Anyone got a definitive, authoritative link to the definition of a 'shingle' in search engine results/technology?
Dennis Gearon Signature Warning ---------------- EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/3/10, Jeff Rose <j...@globalorange.nl> wrote: > From: Jeff Rose <j...@globalorange.nl> > Subject: Re: shingles work in analyzer but not real data > To: solr-user@lucene.apache.org > Date: Friday, September 3, 2010, 1:48 AM > Thanks Steven and Jonathan, we got it > working by using a combination of > quoting and the PositionFilterFactory, like is shown > below. The > documentation for the position filter doesn't make much > sense without > understanding more about how positioning of tokens is taken > into account, > but it appears to do the trick. Does anyone know why > position would matter > here? It seems like tokens would be emitted by a > tokenizer, filtered, > joined into pairwise tokens by the shingler, and then > matched against the > index. If position information is also important it > seems odd that this is > not discussed in the documentation.. (Same for the > pre-tokenizing done by > the query parser, before handing phrases to the > tokenizer...) > > Anyway, here is our final schema that works as long as we > put search phrases > in double quotes. Thanks for all the help! > > -Jeff > > <fieldType name="text" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer > class="solr.PatternTokenizerFactory" pattern=";"/> > <filter > class="solr.LowerCaseFilterFactory"/> > <filter > class="solr.TrimFilterFactory" /> > <filter > class="solr.LowerCaseFilterFactory"/> > <!-- <filter > class="solr.ShingleFilterFactory" outputUnigrams="true" > outputUnigramIfNoNgram="true" maxShingleSize="2"/> > --> > </analyzer> > <analyzer type="query"> > <tokenizer > class="solr.PatternTokenizerFactory" pattern="[.,?;: > !]"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter > class="solr.TrimFilterFactory" /> > <filter class="solr.ShingleFilterFactory"/> > <filter class="solr.PositionFilterFactory"/> > </analyzer> > </fieldType> > > > On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind <rochk...@jhu.edu> > wrote: > > > I've run into this before too. Both the dismax and > solr-lucene _query > > parsers_ will tokenize a query on whitespace _before_ > they pass the query to > > any field analyzers. > > There are some reasons for this, lots of things > wouldn't work if they > > didn't do this. > > > > But it makes your approach kind of hard. Try doing > your search as a phrase > > search with double quotes, "apple pie", I bet it'll > work then -- because > > both dismax and solr-lucene will respect the phrase > quotes and NOT tokenize > > the stuff inside there before it gets to the field > analyzers. > > > > So if non-tokenized fields like this are all that are > included in your > > search, and if you can get your client application to > just force phrase > > quoting of everything before sending to Solr, that > might work. Otherwise.... > > I don't know of a good solution. If you figure one > out, let me know. > > > > Jonathan > > > > > > Jeff Rose wrote: > > > >> Hi, > >> We are using SOLR to match query strings > with a keyword database, where > >> some of the keywords are actually more than one > word. For example a > >> keyword > >> might be "apple pie" and we only want it to match > for a query containing > >> that word pair, but not one only containing > "apple". Here is the relevant > >> piece of the schema.xml, defining the index and > query pipelines: > >> > >> <fieldType name="text" > class="solr.TextField" positionIncrementGap="100"> > >> <analyzer > type="index"> > >> <tokenizer > class="solr.PatternTokenizerFactory" pattern=";"/> > >> <filter > class="solr.LowerCaseFilterFactory"/> > >> <filter > class="solr.TrimFilterFactory" /> > >> </analyzer> > >> <analyzer > type="query"> > >> <tokenizer > class="solr.WhitespaceTokenizerFactory"/> > >> <filter > class="solr.LowerCaseFilterFactory"/> > >> <filter > class="solr.TrimFilterFactory" /> > >> <filter class="solr.ShingleFilterFactory" > /> > >> </analyzer> > >> </fieldType> > >> > >> In the analysis tool this schema looks like it > works correctly. Our > >> multi-word keywords are indexed as a single entry, > and then when a search > >> phrase contains one of these multi-word keywords > it is shingled and > >> matched. > >> Unfortunately, when we do the same queries > on top of the actual index it > >> responds with zero matches. I can see in the > index histogram that the > >> terms > >> are correctly indexed from our mysql datasource > containing the keywords, > >> but > >> somehow the shingling doesn't appear to work on > this live data. Does > >> anyone > >> have experience with shingling that might have > some tips for us, or > >> otherwise advice for debugging the issue? > >> > >> Thanks, > >> Jeff > >> > >> > >> > > >