On Wed, Sep 1, 2010 at 8:21 AM, Jeff Rose <j...@globalorange.nl> wrote:
> Hi, > We are using SOLR to match query strings with a keyword database, where > some of the keywords are actually more than one word. For example a > keyword > might be "apple pie" and we only want it to match for a query containing > that word pair, but not one only containing "apple". Here is the relevant > piece of the schema.xml, defining the index and query pipelines: > > <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.PatternTokenizerFactory" pattern=";"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.TrimFilterFactory" /> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.TrimFilterFactory" /> > <filter class="solr.ShingleFilterFactory" /> > </analyzer> > </fieldType> > > In the analysis tool this schema looks like it works correctly. Our > multi-word keywords are indexed as a single entry, and then when a search > phrase contains one of these multi-word keywords it is shingled and > matched. > Unfortunately, when we do the same queries on top of the actual index it > responds with zero matches. I can see in the index histogram that the > terms > are correctly indexed from our mysql datasource containing the keywords, > but > somehow the shingling doesn't appear to work on this live data. Does > anyone > have experience with shingling that might have some tips for us, or > otherwise advice for debugging the issue? > query-time shingling probably isnt working with the queryparser you are using, the default lucene one first splits on whitespace before sending it to the analyzer: e.g. a query of foo bar is processed as TokenStream(foo) + TokenStream(bar) so query-time shingling like this doesn't work as you expect for this reason. -- Robert Muir rcm...@gmail.com