Thank you mucho much, Lance.
Dennis Gearon Signature Warning ---------------- EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/3/10, Lance Norskog <goks...@gmail.com> wrote: > From: Lance Norskog <goks...@gmail.com> > Subject: Re: shingles work in analyzer but not real data > To: solr-user@lucene.apache.org > Date: Friday, September 3, 2010, 9:55 PM > http://en.wikipedia.org/wiki/W-shingling > > On Fri, Sep 3, 2010 at 6:19 AM, Steven A Rowe <sar...@syr.edu> > wrote: > > Hi Dennis, > > > > I took a stab at answering this question in the > following java-user mailing list post: > > > > http://www.lucidimagination.com/search/document/6cb7b54cce6872b3/lucene_indexes > > > > Steve > > > >> -----Original Message----- > >> From: Dennis Gearon [mailto:gear...@sbcglobal.net] > >> Sent: Friday, September 03, 2010 5:06 AM > >> To: solr-user@lucene.apache.org > >> Subject: Re: shingles work in analyzer but not > real data > >> > >> Anyone got a definitive, authoritative link to the > definition of a > >> 'shingle' in search engine results/technology? > >> > >> > >> Dennis Gearon > >> > >> Signature Warning > >> ---------------- > >> EARTH has a Right To Life, > >> otherwise we all die. > >> > >> Read 'Hot, Flat, and Crowded' > >> Laugh at http://www.yert.com/film.php > >> > >> > >> --- On Fri, 9/3/10, Jeff Rose <j...@globalorange.nl> > wrote: > >> > >> > From: Jeff Rose <j...@globalorange.nl> > >> > Subject: Re: shingles work in analyzer but > not real data > >> > To: solr-user@lucene.apache.org > >> > Date: Friday, September 3, 2010, 1:48 AM > >> > Thanks Steven and Jonathan, we got it > >> > working by using a combination of > >> > quoting and the PositionFilterFactory, like > is shown > >> > below. The > >> > documentation for the position filter doesn't > make much > >> > sense without > >> > understanding more about how positioning of > tokens is taken > >> > into account, > >> > but it appears to do the trick. Does anyone > know why > >> > position would matter > >> > here? It seems like tokens would be emitted > by a > >> > tokenizer, filtered, > >> > joined into pairwise tokens by the shingler, > and then > >> > matched against the > >> > index. If position information is also > important it > >> > seems odd that this is > >> > not discussed in the documentation.. (Same > for the > >> > pre-tokenizing done by > >> > the query parser, before handing phrases to > the > >> > tokenizer...) > >> > > >> > Anyway, here is our final schema that works > as long as we > >> > put search phrases > >> > in double quotes. Thanks for all the help! > >> > > >> > -Jeff > >> > > >> > <fieldType name="text" > class="solr.TextField" > >> > positionIncrementGap="100"> > >> > <analyzer type="index"> > >> > <tokenizer > >> > class="solr.PatternTokenizerFactory" > pattern=";"/> > >> > <filter > >> > class="solr.LowerCaseFilterFactory"/> > >> > <filter > >> > class="solr.TrimFilterFactory" /> > >> > <filter > >> > class="solr.LowerCaseFilterFactory"/> > >> > <!-- <filter > >> > class="solr.ShingleFilterFactory" > outputUnigrams="true" > >> > outputUnigramIfNoNgram="true" > maxShingleSize="2"/> > >> > --> > >> > </analyzer> > >> > <analyzer type="query"> > >> > <tokenizer > >> > class="solr.PatternTokenizerFactory" > pattern="[.,?;: > >> > !]"/> > >> > <filter > class="solr.LowerCaseFilterFactory"/> > >> > <filter > >> > class="solr.TrimFilterFactory" /> > >> > <filter > class="solr.ShingleFilterFactory"/> > >> > <filter > class="solr.PositionFilterFactory"/> > >> > </analyzer> > >> > </fieldType> > >> > > >> > > >> > On Thu, Sep 2, 2010 at 11:47 PM, Jonathan > Rochkind <rochk...@jhu.edu> > >> > wrote: > >> > > >> > > I've run into this before too. Both the > dismax and > >> > solr-lucene _query > >> > > parsers_ will tokenize a query on > whitespace _before_ > >> > they pass the query to > >> > > any field analyzers. > >> > > There are some reasons for this, lots of > things > >> > wouldn't work if they > >> > > didn't do this. > >> > > > >> > > But it makes your approach kind of hard. > Try doing > >> > your search as a phrase > >> > > search with double quotes, "apple pie", > I bet it'll > >> > work then -- because > >> > > both dismax and solr-lucene will respect > the phrase > >> > quotes and NOT tokenize > >> > > the stuff inside there before it gets to > the field > >> > analyzers. > >> > > > >> > > So if non-tokenized fields like this are > all that are > >> > included in your > >> > > search, and if you can get your client > application to > >> > just force phrase > >> > > quoting of everything before sending to > Solr, that > >> > might work. Otherwise.... > >> > > I don't know of a good solution. If you > figure one > >> > out, let me know. > >> > > > >> > > Jonathan > >> > > > >> > > > >> > > Jeff Rose wrote: > >> > > > >> > >> Hi, > >> > >> We are using SOLR to match query > strings > >> > with a keyword database, where > >> > >> some of the keywords are actually > more than one > >> > word. For example a > >> > >> keyword > >> > >> might be "apple pie" and we only > want it to match > >> > for a query containing > >> > >> that word pair, but not one only > containing > >> > "apple". Here is the relevant > >> > >> piece of the schema.xml, defining > the index and > >> > query pipelines: > >> > >> > >> > >> <fieldType name="text" > >> > class="solr.TextField" > positionIncrementGap="100"> > >> > >> <analyzer > >> > type="index"> > >> > >> <tokenizer > >> > class="solr.PatternTokenizerFactory" > pattern=";"/> > >> > >> <filter > >> > class="solr.LowerCaseFilterFactory"/> > >> > >> <filter > >> > class="solr.TrimFilterFactory" /> > >> > >> </analyzer> > >> > >> <analyzer > >> > type="query"> > >> > >> <tokenizer > >> > class="solr.WhitespaceTokenizerFactory"/> > >> > >> <filter > >> > class="solr.LowerCaseFilterFactory"/> > >> > >> <filter > >> > class="solr.TrimFilterFactory" /> > >> > >> <filter > class="solr.ShingleFilterFactory" > >> > /> > >> > >> </analyzer> > >> > >> </fieldType> > >> > >> > >> > >> In the analysis tool this schema > looks like it > >> > works correctly. Our > >> > >> multi-word keywords are indexed as a > single entry, > >> > and then when a search > >> > >> phrase contains one of these > multi-word keywords > >> > it is shingled and > >> > >> matched. > >> > >> Unfortunately, when we do the same > queries > >> > on top of the actual index it > >> > >> responds with zero matches. I can > see in the > >> > index histogram that the > >> > >> terms > >> > >> are correctly indexed from our mysql > datasource > >> > containing the keywords, > >> > >> but > >> > >> somehow the shingling doesn't appear > to work on > >> > this live data. Does > >> > >> anyone > >> > >> have experience with shingling that > might have > >> > some tips for us, or > >> > >> otherwise advice for debugging the > issue? > >> > >> > >> > >> Thanks, > >> > >> Jeff > >> > >> > >> > >> > >> > >> > >> > > > >> > > > > > > > -- > Lance Norskog > goks...@gmail.com >