Re: shingles work in analyzer but not real data

Dennis Gearon Fri, 03 Sep 2010 02:06:29 -0700

Anyone got a definitive, authoritative link to the definition of a 'shingle' in 
search engine results/technology?



Dennis Gearon

Signature Warning
----------------
EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/3/10, Jeff Rose <j...@globalorange.nl> wrote:

> From: Jeff Rose <j...@globalorange.nl>
> Subject: Re: shingles work in analyzer but not real data
> To: solr-user@lucene.apache.org
> Date: Friday, September 3, 2010, 1:48 AM
> Thanks Steven and Jonathan, we got it
> working by using a combination of
> quoting and the PositionFilterFactory, like is shown
> below.  The
> documentation for the position filter doesn't make much
> sense without
> understanding more about how positioning of tokens is taken
> into account,
> but it appears to do the trick.  Does anyone know why
> position would matter
> here?  It seems like tokens would be emitted by a
> tokenizer, filtered,
> joined into pairwise tokens by the shingler, and then
> matched against the
> index.  If position information is also important it
> seems odd that this is
> not discussed in the documentation..  (Same for the
> pre-tokenizing done by
> the query parser, before handing phrases to the
> tokenizer...)
> 
> Anyway, here is our final schema that works as long as we
> put search phrases
> in double quotes.  Thanks for all the help!
> 
> -Jeff
> 
>  <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer
> class="solr.PatternTokenizerFactory" pattern=";"/>
>         <filter
> class="solr.LowerCaseFilterFactory"/>
>         <filter
> class="solr.TrimFilterFactory" />
>         <filter
> class="solr.LowerCaseFilterFactory"/>
>         <!-- <filter
> class="solr.ShingleFilterFactory" outputUnigrams="true"
> outputUnigramIfNoNgram="true" maxShingleSize="2"/>
> -->
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer
> class="solr.PatternTokenizerFactory" pattern="[.,?;:
> !]"/>
>  <filter class="solr.LowerCaseFilterFactory"/>
>          <filter
> class="solr.TrimFilterFactory" />
>  <filter class="solr.ShingleFilterFactory"/>
>  <filter class="solr.PositionFilterFactory"/>
>       </analyzer>
>     </fieldType>
> 
> 
> On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind <rochk...@jhu.edu>
> wrote:
> 
> > I've run into this before too. Both the dismax and
> solr-lucene _query
> > parsers_ will tokenize a query on whitespace _before_
> they pass the query to
> > any field analyzers.
> > There are some reasons for this, lots of things
> wouldn't work if they
> > didn't do this.
> >
> > But it makes your approach kind of hard. Try doing
> your search as a phrase
> > search with double quotes, "apple pie", I bet it'll
> work then -- because
> > both dismax and solr-lucene will respect the phrase
> quotes and NOT tokenize
> > the stuff inside there before it gets to the field
> analyzers.
> >
> > So if non-tokenized fields like this are all that are
> included in your
> > search, and if you can get your client application to
> just force phrase
> > quoting of everything before sending to Solr, that
> might work. Otherwise....
> > I don't know of a good solution. If you figure one
> out, let me know.
> >
> > Jonathan
> >
> >
> > Jeff Rose wrote:
> >
> >> Hi,
> >>  We are using SOLR to match query strings
> with a keyword database, where
> >> some of the keywords are actually more than one
> word.  For example a
> >> keyword
> >> might be "apple pie" and we only want it to match
> for a query containing
> >> that word pair, but not one only containing
> "apple".  Here is the relevant
> >> piece of the schema.xml, defining the index and
> query pipelines:
> >>
> >>  <fieldType name="text"
> class="solr.TextField" positionIncrementGap="100">
> >>     <analyzer
> type="index">
> >>       <tokenizer
> class="solr.PatternTokenizerFactory" pattern=";"/>
> >>        <filter
> class="solr.LowerCaseFilterFactory"/>
> >>        <filter
> class="solr.TrimFilterFactory" />
> >>     </analyzer>
> >>     <analyzer
> type="query">
> >>        <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
> >> <filter
> class="solr.LowerCaseFilterFactory"/>
> >>        <filter
> class="solr.TrimFilterFactory" />
> >> <filter class="solr.ShingleFilterFactory"
> />
> >>      </analyzer>
> >>   </fieldType>
> >>
> >> In the analysis tool this schema looks like it
> works correctly.  Our
> >> multi-word keywords are indexed as a single entry,
> and then when a search
> >> phrase contains one of these multi-word keywords
> it is shingled and
> >> matched.
> >>  Unfortunately, when we do the same queries
> on top of the actual index it
> >> responds with zero matches.  I can see in the
> index histogram that the
> >> terms
> >> are correctly indexed from our mysql datasource
> containing the keywords,
> >> but
> >> somehow the shingling doesn't appear to work on
> this live data.  Does
> >> anyone
> >> have experience with shingling that might have
> some tips for us, or
> >> otherwise advice for debugging the issue?
> >>
> >> Thanks,
> >> Jeff
> >>
> >>
> >>
> >
>

Re: shingles work in analyzer but not real data

Reply via email to