Re: shingles work in analyzer but not real data
: Hi Robert, thanks for the response. I've looked into the query parsers a : bit and I did find that using the raw parser on a matching multi-word : keyword works correctly. I need to have shingling though, in order to : support query phrases. It seems odd to have the query parser emitting The "FieldQParser" should work for this -- unlike the raw QParser it uses the Analyzer for the specified field, but has no metacharacters of it's own. -Hoss -- http://lucenerevolution.org/ ... October 7-8, Boston http://bit.ly/stump-hoss ... Stump The Chump!
Re: shingles work in analyzer but not real data
Thank you mucho much, Lance. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/3/10, Lance Norskog wrote: > From: Lance Norskog > Subject: Re: shingles work in analyzer but not real data > To: solr-user@lucene.apache.org > Date: Friday, September 3, 2010, 9:55 PM > http://en.wikipedia.org/wiki/W-shingling > > On Fri, Sep 3, 2010 at 6:19 AM, Steven A Rowe > wrote: > > Hi Dennis, > > > > I took a stab at answering this question in the > following java-user mailing list post: > > > > http://www.lucidimagination.com/search/document/6cb7b54cce6872b3/lucene_indexes > > > > Steve > > > >> -Original Message- > >> From: Dennis Gearon [mailto:gear...@sbcglobal.net] > >> Sent: Friday, September 03, 2010 5:06 AM > >> To: solr-user@lucene.apache.org > >> Subject: Re: shingles work in analyzer but not > real data > >> > >> Anyone got a definitive, authoritative link to the > definition of a > >> 'shingle' in search engine results/technology? > >> > >> > >> Dennis Gearon > >> > >> Signature Warning > >> > >> EARTH has a Right To Life, > >> otherwise we all die. > >> > >> Read 'Hot, Flat, and Crowded' > >> Laugh at http://www.yert.com/film.php > >> > >> > >> --- On Fri, 9/3/10, Jeff Rose > wrote: > >> > >> > From: Jeff Rose > >> > Subject: Re: shingles work in analyzer but > not real data > >> > To: solr-user@lucene.apache.org > >> > Date: Friday, September 3, 2010, 1:48 AM > >> > Thanks Steven and Jonathan, we got it > >> > working by using a combination of > >> > quoting and the PositionFilterFactory, like > is shown > >> > below. The > >> > documentation for the position filter doesn't > make much > >> > sense without > >> > understanding more about how positioning of > tokens is taken > >> > into account, > >> > but it appears to do the trick. Does anyone > know why > >> > position would matter > >> > here? It seems like tokens would be emitted > by a > >> > tokenizer, filtered, > >> > joined into pairwise tokens by the shingler, > and then > >> > matched against the > >> > index. If position information is also > important it > >> > seems odd that this is > >> > not discussed in the documentation.. (Same > for the > >> > pre-tokenizing done by > >> > the query parser, before handing phrases to > the > >> > tokenizer...) > >> > > >> > Anyway, here is our final schema that works > as long as we > >> > put search phrases > >> > in double quotes. Thanks for all the help! > >> > > >> > -Jeff > >> > > >> > class="solr.TextField" > >> > positionIncrementGap="100"> > >> > > >> > >> > class="solr.PatternTokenizerFactory" > pattern=";"/> > >> > >> > class="solr.LowerCaseFilterFactory"/> > >> > >> > class="solr.TrimFilterFactory" /> > >> > >> > class="solr.LowerCaseFilterFactory"/> > >> > > >> > > >> > > >> > >> > class="solr.PatternTokenizerFactory" > pattern="[.,?;: > >> > !]"/> > >> > class="solr.LowerCaseFilterFactory"/> > >> > >> > class="solr.TrimFilterFactory" /> > >> > class="solr.ShingleFilterFactory"/> > >> > class="solr.PositionFilterFactory"/> > >> > > >> > > >> > > >> > > >> > On Thu, Sep 2, 2010 at 11:47 PM, Jonathan > Rochkind > >> > wrote: > >> > > >> > > I've run into this before too. Both the > dismax and > >> > solr-lucene _query > >> > > parsers_ will tokenize a query on > whitespace _before_ > >> > they pass the query to > >> > > any field analyzers. > >> > > There are some reasons for this, lots of > things > >>
Re: shingles work in analyzer but not real data
http://en.wikipedia.org/wiki/W-shingling On Fri, Sep 3, 2010 at 6:19 AM, Steven A Rowe wrote: > Hi Dennis, > > I took a stab at answering this question in the following java-user mailing > list post: > > http://www.lucidimagination.com/search/document/6cb7b54cce6872b3/lucene_indexes > > Steve > >> -Original Message- >> From: Dennis Gearon [mailto:gear...@sbcglobal.net] >> Sent: Friday, September 03, 2010 5:06 AM >> To: solr-user@lucene.apache.org >> Subject: Re: shingles work in analyzer but not real data >> >> Anyone got a definitive, authoritative link to the definition of a >> 'shingle' in search engine results/technology? >> >> >> Dennis Gearon >> >> Signature Warning >> >> EARTH has a Right To Life, >> otherwise we all die. >> >> Read 'Hot, Flat, and Crowded' >> Laugh at http://www.yert.com/film.php >> >> >> --- On Fri, 9/3/10, Jeff Rose wrote: >> >> > From: Jeff Rose >> > Subject: Re: shingles work in analyzer but not real data >> > To: solr-user@lucene.apache.org >> > Date: Friday, September 3, 2010, 1:48 AM >> > Thanks Steven and Jonathan, we got it >> > working by using a combination of >> > quoting and the PositionFilterFactory, like is shown >> > below. The >> > documentation for the position filter doesn't make much >> > sense without >> > understanding more about how positioning of tokens is taken >> > into account, >> > but it appears to do the trick. Does anyone know why >> > position would matter >> > here? It seems like tokens would be emitted by a >> > tokenizer, filtered, >> > joined into pairwise tokens by the shingler, and then >> > matched against the >> > index. If position information is also important it >> > seems odd that this is >> > not discussed in the documentation.. (Same for the >> > pre-tokenizing done by >> > the query parser, before handing phrases to the >> > tokenizer...) >> > >> > Anyway, here is our final schema that works as long as we >> > put search phrases >> > in double quotes. Thanks for all the help! >> > >> > -Jeff >> > >> > > > positionIncrementGap="100"> >> > >> > > > class="solr.PatternTokenizerFactory" pattern=";"/> >> > > > class="solr.LowerCaseFilterFactory"/> >> > > > class="solr.TrimFilterFactory" /> >> > > > class="solr.LowerCaseFilterFactory"/> >> > >> > >> > >> > > > class="solr.PatternTokenizerFactory" pattern="[.,?;: >> > !]"/> >> > >> > > > class="solr.TrimFilterFactory" /> >> > >> > >> > >> > >> > >> > >> > On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind >> > wrote: >> > >> > > I've run into this before too. Both the dismax and >> > solr-lucene _query >> > > parsers_ will tokenize a query on whitespace _before_ >> > they pass the query to >> > > any field analyzers. >> > > There are some reasons for this, lots of things >> > wouldn't work if they >> > > didn't do this. >> > > >> > > But it makes your approach kind of hard. Try doing >> > your search as a phrase >> > > search with double quotes, "apple pie", I bet it'll >> > work then -- because >> > > both dismax and solr-lucene will respect the phrase >> > quotes and NOT tokenize >> > > the stuff inside there before it gets to the field >> > analyzers. >> > > >> > > So if non-tokenized fields like this are all that are >> > included in your >> > > search, and if you can get your client application to >> > just force phrase >> > > quoting of everything before sending to Solr, that >> > might work. Otherwise >> > > I don't know of a good solution. If you figure one >> > out, let me know. >> > > >> > > Jonathan >> > > >> > > >> > > Jeff Rose wrote: >> > > >> > >> Hi, >> > >> We are using SOLR to match query strings >> > with a ke
RE: shingles work in analyzer but not real data
Hi Dennis, I took a stab at answering this question in the following java-user mailing list post: http://www.lucidimagination.com/search/document/6cb7b54cce6872b3/lucene_indexes Steve > -Original Message- > From: Dennis Gearon [mailto:gear...@sbcglobal.net] > Sent: Friday, September 03, 2010 5:06 AM > To: solr-user@lucene.apache.org > Subject: Re: shingles work in analyzer but not real data > > Anyone got a definitive, authoritative link to the definition of a > 'shingle' in search engine results/technology? > > > Dennis Gearon > > Signature Warning > > EARTH has a Right To Life, > otherwise we all die. > > Read 'Hot, Flat, and Crowded' > Laugh at http://www.yert.com/film.php > > > --- On Fri, 9/3/10, Jeff Rose wrote: > > > From: Jeff Rose > > Subject: Re: shingles work in analyzer but not real data > > To: solr-user@lucene.apache.org > > Date: Friday, September 3, 2010, 1:48 AM > > Thanks Steven and Jonathan, we got it > > working by using a combination of > > quoting and the PositionFilterFactory, like is shown > > below. The > > documentation for the position filter doesn't make much > > sense without > > understanding more about how positioning of tokens is taken > > into account, > > but it appears to do the trick. Does anyone know why > > position would matter > > here? It seems like tokens would be emitted by a > > tokenizer, filtered, > > joined into pairwise tokens by the shingler, and then > > matched against the > > index. If position information is also important it > > seems odd that this is > > not discussed in the documentation.. (Same for the > > pre-tokenizing done by > > the query parser, before handing phrases to the > > tokenizer...) > > > > Anyway, here is our final schema that works as long as we > > put search phrases > > in double quotes. Thanks for all the help! > > > > -Jeff > > > > > positionIncrementGap="100"> > > > > > class="solr.PatternTokenizerFactory" pattern=";"/> > > > class="solr.LowerCaseFilterFactory"/> > > > class="solr.TrimFilterFactory" /> > > > class="solr.LowerCaseFilterFactory"/> > > > > > > > > > class="solr.PatternTokenizerFactory" pattern="[.,?;: > > !]"/> > > > > > class="solr.TrimFilterFactory" /> > > > > > > > > > > > > > > On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind > > wrote: > > > > > I've run into this before too. Both the dismax and > > solr-lucene _query > > > parsers_ will tokenize a query on whitespace _before_ > > they pass the query to > > > any field analyzers. > > > There are some reasons for this, lots of things > > wouldn't work if they > > > didn't do this. > > > > > > But it makes your approach kind of hard. Try doing > > your search as a phrase > > > search with double quotes, "apple pie", I bet it'll > > work then -- because > > > both dismax and solr-lucene will respect the phrase > > quotes and NOT tokenize > > > the stuff inside there before it gets to the field > > analyzers. > > > > > > So if non-tokenized fields like this are all that are > > included in your > > > search, and if you can get your client application to > > just force phrase > > > quoting of everything before sending to Solr, that > > might work. Otherwise > > > I don't know of a good solution. If you figure one > > out, let me know. > > > > > > Jonathan > > > > > > > > > Jeff Rose wrote: > > > > > >> Hi, > > >> We are using SOLR to match query strings > > with a keyword database, where > > >> some of the keywords are actually more than one > > word. For example a > > >> keyword > > >> might be "apple pie" and we only want it to match > > for a query containing > > >> that word pair, but not one only containing > > "apple". Here is the relevant > > >> piece of the schema.xml, defining the index and > > query pipelines: > > >> > > >> > class="solr.TextField" positionIncrementGap="100"> > > >
Re: shingles work in analyzer but not real data
Look up pp.288 in "Solr 1.4 Enterprise Search Engine" book by Eric & David. Shingling is suitable for phrase query case based on token level, it's similar with n-gram. However, the latter one is based on term. We are currently using shingling in our index with shingle size = 3. Be careful, the builing time of index & index dize could be dramatically long & large as the max shinlge size increases. Scott - Original Message - From: "Jeff Rose" To: Sent: Friday, September 03, 2010 5:35 PM Subject: Re: shingles work in analyzer but not real data I don't have any fancy links, but from the documentation shingles make pretty good sense. You typically tokenize an input string so that "the best apple pie" becomes "the" "best" "apple" "pie", so that each term can then be filtered to remove stop words, take off plurals and suffixes like "ing", etc. The problem is if you want to search for multi-word phrases, like "apple pie". This default splitting behavior won't let you do that, so to deal with this problem you can use shingles. The shingle filter will take in successive tokens and then produce a series of output tokens composed of the last 1-n tokens, where n is a setting. So with shingles of size 2, the default, you get "the" "the best" "best" "best apple" "apple" "apple pie" from the above string. Now we can match "apple pie". Besides the shingling there is apparently also some concept of position, which I don't yet understand. -Jeff On Fri, Sep 3, 2010 at 11:05 AM, Dennis Gearon wrote: Anyone got a definitive, authoritative link to the definition of a 'shingle' in search engine results/technology? Dennis Gearon Signature Warning ---- EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/3/10, Jeff Rose wrote: > From: Jeff Rose > Subject: Re: shingles work in analyzer but not real data > To: solr-user@lucene.apache.org > Date: Friday, September 3, 2010, 1:48 AM > Thanks Steven and Jonathan, we got it > working by using a combination of > quoting and the PositionFilterFactory, like is shown > below. The > documentation for the position filter doesn't make much > sense without > understanding more about how positioning of tokens is taken > into account, > but it appears to do the trick. Does anyone know why > position would matter > here? It seems like tokens would be emitted by a > tokenizer, filtered, > joined into pairwise tokens by the shingler, and then > matched against the > index. If position information is also important it > seems odd that this is > not discussed in the documentation.. (Same for the > pre-tokenizing done by > the query parser, before handing phrases to the > tokenizer...) > > Anyway, here is our final schema that works as long as we > put search phrases > in double quotes. Thanks for all the help! > > -Jeff > > positionIncrementGap="100"> > > class="solr.PatternTokenizerFactory" pattern=";"/> > class="solr.LowerCaseFilterFactory"/> > class="solr.TrimFilterFactory" /> > class="solr.LowerCaseFilterFactory"/> > > > > class="solr.PatternTokenizerFactory" pattern="[.,?;: > !]"/> > > class="solr.TrimFilterFactory" /> > > > > > > > On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind > wrote: > > > I've run into this before too. Both the dismax and > solr-lucene _query > > parsers_ will tokenize a query on whitespace _before_ > they pass the query to > > any field analyzers. > > There are some reasons for this, lots of things > wouldn't work if they > > didn't do this. > > > > But it makes your approach kind of hard. Try doing > your search as a phrase > > search with double quotes, "apple pie", I bet it'll > work then -- because > > both dismax and solr-lucene will respect the phrase > quotes and NOT tokenize > > the stuff inside there before it gets to the field > analyzers. > > > > So if non-tokenized fields like this are all that are > included in your > > search, and if you can get your client application to > just force phrase > > quoting of everything before sending to Solr, that > might work. Otherwise > > I don't know of a good solution. If you figure one > out, let me know. > > >
Re: shingles work in analyzer but not real data
I don't have any fancy links, but from the documentation shingles make pretty good sense. You typically tokenize an input string so that "the best apple pie" becomes "the" "best" "apple" "pie", so that each term can then be filtered to remove stop words, take off plurals and suffixes like "ing", etc. The problem is if you want to search for multi-word phrases, like "apple pie". This default splitting behavior won't let you do that, so to deal with this problem you can use shingles. The shingle filter will take in successive tokens and then produce a series of output tokens composed of the last 1-n tokens, where n is a setting. So with shingles of size 2, the default, you get "the" "the best" "best" "best apple" "apple" "apple pie" from the above string. Now we can match "apple pie". Besides the shingling there is apparently also some concept of position, which I don't yet understand. -Jeff On Fri, Sep 3, 2010 at 11:05 AM, Dennis Gearon wrote: > Anyone got a definitive, authoritative link to the definition of a > 'shingle' in search engine results/technology? > > > Dennis Gearon > > Signature Warning > > EARTH has a Right To Life, > otherwise we all die. > > Read 'Hot, Flat, and Crowded' > Laugh at http://www.yert.com/film.php > > > --- On Fri, 9/3/10, Jeff Rose wrote: > > > From: Jeff Rose > > Subject: Re: shingles work in analyzer but not real data > > To: solr-user@lucene.apache.org > > Date: Friday, September 3, 2010, 1:48 AM > > Thanks Steven and Jonathan, we got it > > working by using a combination of > > quoting and the PositionFilterFactory, like is shown > > below. The > > documentation for the position filter doesn't make much > > sense without > > understanding more about how positioning of tokens is taken > > into account, > > but it appears to do the trick. Does anyone know why > > position would matter > > here? It seems like tokens would be emitted by a > > tokenizer, filtered, > > joined into pairwise tokens by the shingler, and then > > matched against the > > index. If position information is also important it > > seems odd that this is > > not discussed in the documentation.. (Same for the > > pre-tokenizing done by > > the query parser, before handing phrases to the > > tokenizer...) > > > > Anyway, here is our final schema that works as long as we > > put search phrases > > in double quotes. Thanks for all the help! > > > > -Jeff > > > > > positionIncrementGap="100"> > > > > > class="solr.PatternTokenizerFactory" pattern=";"/> > > > class="solr.LowerCaseFilterFactory"/> > > > class="solr.TrimFilterFactory" /> > > > class="solr.LowerCaseFilterFactory"/> > > > > > > > > > class="solr.PatternTokenizerFactory" pattern="[.,?;: > > !]"/> > > > > > class="solr.TrimFilterFactory" /> > > > > > > > > > > > > > > On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind > > wrote: > > > > > I've run into this before too. Both the dismax and > > solr-lucene _query > > > parsers_ will tokenize a query on whitespace _before_ > > they pass the query to > > > any field analyzers. > > > There are some reasons for this, lots of things > > wouldn't work if they > > > didn't do this. > > > > > > But it makes your approach kind of hard. Try doing > > your search as a phrase > > > search with double quotes, "apple pie", I bet it'll > > work then -- because > > > both dismax and solr-lucene will respect the phrase > > quotes and NOT tokenize > > > the stuff inside there before it gets to the field > > analyzers. > > > > > > So if non-tokenized fields like this are all that are > > included in your > > > search, and if you can get your client application to > > just force phrase > > > quoting of everything before sending to Solr, that > > might work. Otherwise > > > I don't know of a good solution. If you figure one > > out, let me know. > > > > > > Jonathan > > > > > > > > > Jeff Rose wrote: > > > > > >> Hi
Re: shingles work in analyzer but not real data
Anyone got a definitive, authoritative link to the definition of a 'shingle' in search engine results/technology? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/3/10, Jeff Rose wrote: > From: Jeff Rose > Subject: Re: shingles work in analyzer but not real data > To: solr-user@lucene.apache.org > Date: Friday, September 3, 2010, 1:48 AM > Thanks Steven and Jonathan, we got it > working by using a combination of > quoting and the PositionFilterFactory, like is shown > below. The > documentation for the position filter doesn't make much > sense without > understanding more about how positioning of tokens is taken > into account, > but it appears to do the trick. Does anyone know why > position would matter > here? It seems like tokens would be emitted by a > tokenizer, filtered, > joined into pairwise tokens by the shingler, and then > matched against the > index. If position information is also important it > seems odd that this is > not discussed in the documentation.. (Same for the > pre-tokenizing done by > the query parser, before handing phrases to the > tokenizer...) > > Anyway, here is our final schema that works as long as we > put search phrases > in double quotes. Thanks for all the help! > > -Jeff > > positionIncrementGap="100"> > > class="solr.PatternTokenizerFactory" pattern=";"/> > class="solr.LowerCaseFilterFactory"/> > class="solr.TrimFilterFactory" /> > class="solr.LowerCaseFilterFactory"/> > > > > class="solr.PatternTokenizerFactory" pattern="[.,?;: > !]"/> > > class="solr.TrimFilterFactory" /> > > > > > > > On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind > wrote: > > > I've run into this before too. Both the dismax and > solr-lucene _query > > parsers_ will tokenize a query on whitespace _before_ > they pass the query to > > any field analyzers. > > There are some reasons for this, lots of things > wouldn't work if they > > didn't do this. > > > > But it makes your approach kind of hard. Try doing > your search as a phrase > > search with double quotes, "apple pie", I bet it'll > work then -- because > > both dismax and solr-lucene will respect the phrase > quotes and NOT tokenize > > the stuff inside there before it gets to the field > analyzers. > > > > So if non-tokenized fields like this are all that are > included in your > > search, and if you can get your client application to > just force phrase > > quoting of everything before sending to Solr, that > might work. Otherwise > > I don't know of a good solution. If you figure one > out, let me know. > > > > Jonathan > > > > > > Jeff Rose wrote: > > > >> Hi, > >> We are using SOLR to match query strings > with a keyword database, where > >> some of the keywords are actually more than one > word. For example a > >> keyword > >> might be "apple pie" and we only want it to match > for a query containing > >> that word pair, but not one only containing > "apple". Here is the relevant > >> piece of the schema.xml, defining the index and > query pipelines: > >> > >> class="solr.TextField" positionIncrementGap="100"> > >> type="index"> > >> class="solr.PatternTokenizerFactory" pattern=";"/> > >> class="solr.LowerCaseFilterFactory"/> > >> class="solr.TrimFilterFactory" /> > >> > >> type="query"> > >> class="solr.WhitespaceTokenizerFactory"/> > >> class="solr.LowerCaseFilterFactory"/> > >> class="solr.TrimFilterFactory" /> > >> /> > >> > >> > >> > >> In the analysis tool this schema looks like it > works correctly. Our > >> multi-word keywords are indexed as a single entry, > and then when a search > >> phrase contains one of these multi-word keywords > it is shingled and > >> matched. > >> Unfortunately, when we do the same queries > on top of the actual index it > >> responds with zero matches. I can see in the > index histogram that the > >> terms > >> are correctly indexed from our mysql datasource > containing the keywords, > >> but > >> somehow the shingling doesn't appear to work on > this live data. Does > >> anyone > >> have experience with shingling that might have > some tips for us, or > >> otherwise advice for debugging the issue? > >> > >> Thanks, > >> Jeff > >> > >> > >> > > >
Re: shingles work in analyzer but not real data
Thanks Steven and Jonathan, we got it working by using a combination of quoting and the PositionFilterFactory, like is shown below. The documentation for the position filter doesn't make much sense without understanding more about how positioning of tokens is taken into account, but it appears to do the trick. Does anyone know why position would matter here? It seems like tokens would be emitted by a tokenizer, filtered, joined into pairwise tokens by the shingler, and then matched against the index. If position information is also important it seems odd that this is not discussed in the documentation.. (Same for the pre-tokenizing done by the query parser, before handing phrases to the tokenizer...) Anyway, here is our final schema that works as long as we put search phrases in double quotes. Thanks for all the help! -Jeff On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind wrote: > I've run into this before too. Both the dismax and solr-lucene _query > parsers_ will tokenize a query on whitespace _before_ they pass the query to > any field analyzers. > There are some reasons for this, lots of things wouldn't work if they > didn't do this. > > But it makes your approach kind of hard. Try doing your search as a phrase > search with double quotes, "apple pie", I bet it'll work then -- because > both dismax and solr-lucene will respect the phrase quotes and NOT tokenize > the stuff inside there before it gets to the field analyzers. > > So if non-tokenized fields like this are all that are included in your > search, and if you can get your client application to just force phrase > quoting of everything before sending to Solr, that might work. Otherwise > I don't know of a good solution. If you figure one out, let me know. > > Jonathan > > > Jeff Rose wrote: > >> Hi, >> We are using SOLR to match query strings with a keyword database, where >> some of the keywords are actually more than one word. For example a >> keyword >> might be "apple pie" and we only want it to match for a query containing >> that word pair, but not one only containing "apple". Here is the relevant >> piece of the schema.xml, defining the index and query pipelines: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> In the analysis tool this schema looks like it works correctly. Our >> multi-word keywords are indexed as a single entry, and then when a search >> phrase contains one of these multi-word keywords it is shingled and >> matched. >> Unfortunately, when we do the same queries on top of the actual index it >> responds with zero matches. I can see in the index histogram that the >> terms >> are correctly indexed from our mysql datasource containing the keywords, >> but >> somehow the shingling doesn't appear to work on this live data. Does >> anyone >> have experience with shingling that might have some tips for us, or >> otherwise advice for debugging the issue? >> >> Thanks, >> Jeff >> >> >> >
Re: shingles work in analyzer but not real data
I thought shingles were either a viral infection or roof material? (Hey, it's crazy friday early for me) Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 9/2/10, Jonathan Rochkind wrote: > From: Jonathan Rochkind > Subject: Re: shingles work in analyzer but not real data > To: "solr-user@lucene.apache.org" > Cc: "Vishal Patel" , "Michiel Willekens" > > Date: Thursday, September 2, 2010, 2:47 PM > I've run into this before too. Both > the dismax and solr-lucene _query parsers_ will tokenize a > query on whitespace _before_ they pass the query to any > field analyzers. > There are some reasons for this, lots of things wouldn't > work if they didn't do this. > > But it makes your approach kind of hard. Try doing your > search as a phrase search with double quotes, "apple pie", I > bet it'll work then -- because both dismax and solr-lucene > will respect the phrase quotes and NOT tokenize the stuff > inside there before it gets to the field analyzers. > > So if non-tokenized fields like this are all that are > included in your search, and if you can get your client > application to just force phrase quoting of everything > before sending to Solr, that might work. Otherwise I > don't know of a good solution. If you figure one out, let me > know. > > Jonathan > > Jeff Rose wrote: > > Hi, > > We are using SOLR to match query > strings with a keyword database, where > > some of the keywords are actually more than one > word. For example a keyword > > might be "apple pie" and we only want it to match for > a query containing > > that word pair, but not one only containing > "apple". Here is the relevant > > piece of the schema.xml, defining the index and query > pipelines: > > > > class="solr.TextField" positionIncrementGap="100"> > > > > class="solr.PatternTokenizerFactory" pattern=";"/> > > class="solr.LowerCaseFilterFactory"/> > > class="solr.TrimFilterFactory" /> > > > > > > class="solr.WhitespaceTokenizerFactory"/> > > > > class="solr.TrimFilterFactory" /> > > > > > > > > > > In the analysis tool this schema looks like it works > correctly. Our > > multi-word keywords are indexed as a single entry, and > then when a search > > phrase contains one of these multi-word keywords it is > shingled and matched. > > Unfortunately, when we do the same queries on > top of the actual index it > > responds with zero matches. I can see in the > index histogram that the terms > > are correctly indexed from our mysql datasource > containing the keywords, but > > somehow the shingling doesn't appear to work on this > live data. Does anyone > > have experience with shingling that might have some > tips for us, or > > otherwise advice for debugging the issue? > > > > Thanks, > > Jeff > > > >
Re: shingles work in analyzer but not real data
I've run into this before too. Both the dismax and solr-lucene _query parsers_ will tokenize a query on whitespace _before_ they pass the query to any field analyzers. There are some reasons for this, lots of things wouldn't work if they didn't do this. But it makes your approach kind of hard. Try doing your search as a phrase search with double quotes, "apple pie", I bet it'll work then -- because both dismax and solr-lucene will respect the phrase quotes and NOT tokenize the stuff inside there before it gets to the field analyzers. So if non-tokenized fields like this are all that are included in your search, and if you can get your client application to just force phrase quoting of everything before sending to Solr, that might work. Otherwise I don't know of a good solution. If you figure one out, let me know. Jonathan Jeff Rose wrote: Hi, We are using SOLR to match query strings with a keyword database, where some of the keywords are actually more than one word. For example a keyword might be "apple pie" and we only want it to match for a query containing that word pair, but not one only containing "apple". Here is the relevant piece of the schema.xml, defining the index and query pipelines: In the analysis tool this schema looks like it works correctly. Our multi-word keywords are indexed as a single entry, and then when a search phrase contains one of these multi-word keywords it is shingled and matched. Unfortunately, when we do the same queries on top of the actual index it responds with zero matches. I can see in the index histogram that the terms are correctly indexed from our mysql datasource containing the keywords, but somehow the shingling doesn't appear to work on this live data. Does anyone have experience with shingling that might have some tips for us, or otherwise advice for debugging the issue? Thanks, Jeff
RE: shingles work in analyzer but not real data
Hi Jeff, Have you seen PositionFilterFactory?: <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory> Steve > -Original Message- > From: Jeff Rose [mailto:j...@globalorange.nl] > Sent: Thursday, September 02, 2010 9:06 AM > To: solr-user@lucene.apache.org > Subject: Re: shingles work in analyzer but not real data > > On Wed, Sep 1, 2010 at 3:35 PM, Robert Muir wrote: > > > On Wed, Sep 1, 2010 at 8:21 AM, Jeff Rose wrote: > > > > > Hi, > > > We are using SOLR to match query strings with a keyword database, > where > > > some of the keywords are actually more than one word. For example a > > > keyword > > > might be "apple pie" and we only want it to match for a query > containing > > > that word pair, but not one only containing "apple". Here is the > > relevant > > > piece of the schema.xml, defining the index and query pipelines: > > > > > > > positionIncrementGap="100"> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In the analysis tool this schema looks like it works correctly. Our > > > multi-word keywords are indexed as a single entry, and then when a > search > > > phrase contains one of these multi-word keywords it is shingled and > > > matched. > > > Unfortunately, when we do the same queries on top of the actual index > it > > > responds with zero matches. I can see in the index histogram that the > > > terms > > > are correctly indexed from our mysql datasource containing the > keywords, > > > but > > > somehow the shingling doesn't appear to work on this live data. Does > > > anyone > > > have experience with shingling that might have some tips for us, or > > > otherwise advice for debugging the issue? > > > > > > > query-time shingling probably isnt working with the queryparser you are > > using, the default lucene one first splits on whitespace before sending > it > > to the analyzer: e.g. a query of foo bar is processed as > TokenStream(foo) + > > TokenStream(bar) > > > > so query-time shingling like this doesn't work as you expect for this > > reason. > > > Hi Robert, thanks for the response. I've looked into the query parsers a > bit and I did find that using the raw parser on a matching multi-word > keyword works correctly. I need to have shingling though, in order to > support query phrases. It seems odd to have the query parser emitting > tokens though. If this is the case why would we ever use the > WhitespaceTokenizer? Either way, do you know what the correct > configuration > should be to actually perform shingling as it is documented to work: > joining > adjacent tokens into a single search term? (e.g. "apple" "pie" should > become "apple pie") > > Thanks a lot for the help. > > -Jeff > > P.S. Markus, putting double quotes around the query doesn't seem to have > any > effect. It would be nice to have the analysis debug output on the actual > queries so that I could see what is being searched for after analysis...
Re: shingles work in analyzer but not real data
On Wed, Sep 1, 2010 at 3:35 PM, Robert Muir wrote: > On Wed, Sep 1, 2010 at 8:21 AM, Jeff Rose wrote: > > > Hi, > > We are using SOLR to match query strings with a keyword database, where > > some of the keywords are actually more than one word. For example a > > keyword > > might be "apple pie" and we only want it to match for a query containing > > that word pair, but not one only containing "apple". Here is the > relevant > > piece of the schema.xml, defining the index and query pipelines: > > > > positionIncrementGap="100"> > > > > > > > > > > > > > > > > > > > > > > > > > > > > In the analysis tool this schema looks like it works correctly. Our > > multi-word keywords are indexed as a single entry, and then when a search > > phrase contains one of these multi-word keywords it is shingled and > > matched. > > Unfortunately, when we do the same queries on top of the actual index it > > responds with zero matches. I can see in the index histogram that the > > terms > > are correctly indexed from our mysql datasource containing the keywords, > > but > > somehow the shingling doesn't appear to work on this live data. Does > > anyone > > have experience with shingling that might have some tips for us, or > > otherwise advice for debugging the issue? > > > > query-time shingling probably isnt working with the queryparser you are > using, the default lucene one first splits on whitespace before sending it > to the analyzer: e.g. a query of foo bar is processed as TokenStream(foo) + > TokenStream(bar) > > so query-time shingling like this doesn't work as you expect for this > reason. Hi Robert, thanks for the response. I've looked into the query parsers a bit and I did find that using the raw parser on a matching multi-word keyword works correctly. I need to have shingling though, in order to support query phrases. It seems odd to have the query parser emitting tokens though. If this is the case why would we ever use the WhitespaceTokenizer? Either way, do you know what the correct configuration should be to actually perform shingling as it is documented to work: joining adjacent tokens into a single search term? (e.g. "apple" "pie" should become "apple pie") Thanks a lot for the help. -Jeff P.S. Markus, putting double quotes around the query doesn't seem to have any effect. It would be nice to have the analysis debug output on the actual queries so that I could see what is being searched for after analysis...
Re: shingles work in analyzer but not real data
If your use-case is limited to this, why don't you encapsulate all queries in double quotes? On Wednesday 01 September 2010 14:21:47 Jeff Rose wrote: > Hi, > We are using SOLR to match query strings with a keyword database, where > some of the keywords are actually more than one word. For example a > keyword might be "apple pie" and we only want it to match for a query > containing that word pair, but not one only containing "apple". Here is > the relevant piece of the schema.xml, defining the index and query > pipelines: > > > > > > > > > > > > > > > > In the analysis tool this schema looks like it works correctly. Our > multi-word keywords are indexed as a single entry, and then when a search > phrase contains one of these multi-word keywords it is shingled and > matched. Unfortunately, when we do the same queries on top of the actual > index it responds with zero matches. I can see in the index histogram > that the terms are correctly indexed from our mysql datasource containing > the keywords, but somehow the shingling doesn't appear to work on this > live data. Does anyone have experience with shingling that might have > some tips for us, or otherwise advice for debugging the issue? > > Thanks, > Jeff > Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: shingles work in analyzer but not real data
On Wed, Sep 1, 2010 at 8:21 AM, Jeff Rose wrote: > Hi, > We are using SOLR to match query strings with a keyword database, where > some of the keywords are actually more than one word. For example a > keyword > might be "apple pie" and we only want it to match for a query containing > that word pair, but not one only containing "apple". Here is the relevant > piece of the schema.xml, defining the index and query pipelines: > > > > > > > > > > > > > > > > In the analysis tool this schema looks like it works correctly. Our > multi-word keywords are indexed as a single entry, and then when a search > phrase contains one of these multi-word keywords it is shingled and > matched. > Unfortunately, when we do the same queries on top of the actual index it > responds with zero matches. I can see in the index histogram that the > terms > are correctly indexed from our mysql datasource containing the keywords, > but > somehow the shingling doesn't appear to work on this live data. Does > anyone > have experience with shingling that might have some tips for us, or > otherwise advice for debugging the issue? > query-time shingling probably isnt working with the queryparser you are using, the default lucene one first splits on whitespace before sending it to the analyzer: e.g. a query of foo bar is processed as TokenStream(foo) + TokenStream(bar) so query-time shingling like this doesn't work as you expect for this reason. -- Robert Muir rcm...@gmail.com
shingles work in analyzer but not real data
Hi, We are using SOLR to match query strings with a keyword database, where some of the keywords are actually more than one word. For example a keyword might be "apple pie" and we only want it to match for a query containing that word pair, but not one only containing "apple". Here is the relevant piece of the schema.xml, defining the index and query pipelines: In the analysis tool this schema looks like it works correctly. Our multi-word keywords are indexed as a single entry, and then when a search phrase contains one of these multi-word keywords it is shingled and matched. Unfortunately, when we do the same queries on top of the actual index it responds with zero matches. I can see in the index histogram that the terms are correctly indexed from our mysql datasource containing the keywords, but somehow the shingling doesn't appear to work on this live data. Does anyone have experience with shingling that might have some tips for us, or otherwise advice for debugging the issue? Thanks, Jeff