RE: Keyword Tokenizer Phrase Issue
I have come to the conclusion that this isn't possible due to the way dismax queries are created. I found someone else that had the exact same issue last year: http://lucene.472066.n3.nabble.com/Multi-word-exact-keyword-case-insensitive-search-suggestions-td2246516.html I believe this makes it impossible to do exact matching on multi word terms with dismax. So I have created two JIRA tickets that hopefully address the issue: 1) a suggested improvement to dismax specific to the KeywordTokenizerFactory: https://issues.apache.org/jira/browse/SOLR-3127 2) what I believe is a bug when removing terms from the query: https://issues.apache.org/jira/browse/SOLR-3128 Feedback welcome. Thanks Zac -Original Message- From: Zac Smith Sent: Friday, February 10, 2012 3:30 PM To: 'solr-user@lucene.apache.org' Subject: RE: Keyword Tokenizer Phrase Issue Thanks, that explains why the individual terms 'chicken' and 'stock' are still in the query (and are required). So I have tried a few things to get around this, but to no avail: Changed the query analyzer to use the WhitespaceTokenizerFactory with autoGeneratePhraseQueries=true. This creates the correct phrase query, but the dismax query still requires the individual terms to match ('chicken' and 'stock'): +(DisjunctionMaxQuery((ingredient_synonyms:chicken)~0.01) +DisjunctionMaxQuery((ingredient_synonyms:stock)~0.01)) +DisjunctionMaxQuery((ingredient_synonyms:chicken stock~100)~0.01) So the next thing I have tried is to remove the individual terms during the query analysis. I did this using the ShingleFilterFactory, so my query analyzer now looks like this: analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.ShingleFilterFactory outputUnigrams=false maxShingleSize=2 / /analyzer This leaves the single term 'chicken stock' in the query analysis and the dismax query is: +() DisjunctionMaxQuery((ingredient_synonyms:chicken stock)~0.01) Which looks OK except for the +(). It looks like it is requiring an empty clause. This seems like a pretty simple requirement - to only have exact matches on multi word text. Am I missing something here? Thanks Zac
RE: Keyword Tokenizer Phrase Issue
I have done some further analysis on this and I am now even more confused. When I use the Field Analysis tool with the text 'chicken stock' it highlights that text as a match. The dismax query looks ok to me: +(DisjunctionMaxQuery((ingredient_synonyms:chicken^0.6)~0.01) DisjunctionMaxQuery((ingredient_synonyms:stock^0.6)~0.01)) DisjunctionMaxQuery((ingredient_synonyms:chicken stock^0.6)~0.01) Then I have done an explainOther and it shows a failure to meet condition. However there does seem to be some kind of match registered: 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) 0.0 = no match on required clause (ingredient_synonyms:chicken^0.6 ingredient_synonyms:stock^0.6) 0.0650662 = (MATCH) weight(ingredient_synonyms:chicken stock^0.6 in 0), product of: 0.21204369 = queryWeight(ingredient_synonyms:chicken stock^0.6), product of: 0.6 = boost 0.30685282 = idf(docFreq=1, maxDocs=1) 1.1517122 = queryNorm 0.30685282 = (MATCH) fieldWeight(ingredient_synonyms:chicken stock in 0), product of: 1.0 = tf(termFreq(ingredient_synonyms:chicken stock)=1) 0.30685282 = idf(docFreq=1, maxDocs=1) 1.0 = fieldNorm(field=ingredient_synonyms, doc=0) Any ideas? My dismax handler is setup like this: requestHandler name=dismax class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str float name=tie0.01/float str name=qfingredient_synonyms^0.6/str str name=pfingredient_synonyms^0.6/str /requestHandler Zac From: Zac Smith Sent: Thursday, February 09, 2012 12:52 PM To: solr-user@lucene.apache.org Subject: Keyword Tokenizer Phrase Issue Hi, I have a simple field type that uses the KeywordTokenizerFactory. I would like to use this so that values in this field are only matched with the full text of the field. e.g. If I indexed the text 'chicken stock', searches on this field would only match when searching for 'chicken stock'. If searching for just 'chicken' or just 'stock' there should not match. This mostly works, except if there is more than one word in the text I only get a match when searching with quotes. e.g. chicken stock (matches) chicken stock (doesn't match) Is there any way I can set this up so that I don't have to provide quotes? I am using dismax and if I put quotes in it will mess up the search for the rest of my fields. I had an idea that I could issue a separate search using the regular query parser, but couldn't work out how to do this: I thought I could do something like this: qt=dismaxq=fish OR _query_:ingredient:chicken stock I am using solr 3.5.0. My field type is: fieldType name=keyword_test class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.KeywordTokenizerFactory / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / /analyzer /fieldType Thanks Zac
RE: Keyword Tokenizer Phrase Issue
Hi Zac, Field Analysis tool (analysis.jsp) does not perform actual query parsing. One thing to be aware of when Using Keyword Tokenizer at query time is: Query string (chicken stock) is pre-tokenized according to white spaces, before it reaches keyword tokenizer. If you use quotes (chicken stock), query parser does no pre-tokenizes, though. --- On Fri, 2/10/12, Zac Smith z...@trinkit.com wrote: From: Zac Smith z...@trinkit.com Subject: RE: Keyword Tokenizer Phrase Issue To: solr-user@lucene.apache.org solr-user@lucene.apache.org Date: Friday, February 10, 2012, 10:35 AM I have done some further analysis on this and I am now even more confused. When I use the Field Analysis tool with the text 'chicken stock' it highlights that text as a match. The dismax query looks ok to me: +(DisjunctionMaxQuery((ingredient_synonyms:chicken^0.6)~0.01) DisjunctionMaxQuery((ingredient_synonyms:stock^0.6)~0.01)) DisjunctionMaxQuery((ingredient_synonyms:chicken stock^0.6)~0.01) Then I have done an explainOther and it shows a failure to meet condition. However there does seem to be some kind of match registered: 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) 0.0 = no match on required clause (ingredient_synonyms:chicken^0.6 ingredient_synonyms:stock^0.6) 0.0650662 = (MATCH) weight(ingredient_synonyms:chicken stock^0.6 in 0), product of: 0.21204369 = queryWeight(ingredient_synonyms:chicken stock^0.6), product of: 0.6 = boost 0.30685282 = idf(docFreq=1, maxDocs=1) 1.1517122 = queryNorm 0.30685282 = (MATCH) fieldWeight(ingredient_synonyms:chicken stock in 0), product of: 1.0 = tf(termFreq(ingredient_synonyms:chicken stock)=1) 0.30685282 = idf(docFreq=1, maxDocs=1) 1.0 = fieldNorm(field=ingredient_synonyms, doc=0) Any ideas? My dismax handler is setup like this: requestHandler name=dismax class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str float name=tie0.01/float str name=qfingredient_synonyms^0.6/str str name=pfingredient_synonyms^0.6/str /requestHandler Zac From: Zac Smith Sent: Thursday, February 09, 2012 12:52 PM To: solr-user@lucene.apache.org Subject: Keyword Tokenizer Phrase Issue Hi, I have a simple field type that uses the KeywordTokenizerFactory. I would like to use this so that values in this field are only matched with the full text of the field. e.g. If I indexed the text 'chicken stock', searches on this field would only match when searching for 'chicken stock'. If searching for just 'chicken' or just 'stock' there should not match. This mostly works, except if there is more than one word in the text I only get a match when searching with quotes. e.g. chicken stock (matches) chicken stock (doesn't match) Is there any way I can set this up so that I don't have to provide quotes? I am using dismax and if I put quotes in it will mess up the search for the rest of my fields. I had an idea that I could issue a separate search using the regular query parser, but couldn't work out how to do this: I thought I could do something like this: qt=dismaxq=fish OR _query_:ingredient:chicken stock I am using solr 3.5.0. My field type is: fieldType name=keyword_test class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.KeywordTokenizerFactory / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / /analyzer /fieldType Thanks Zac
RE: Keyword Tokenizer Phrase Issue
Thanks, that explains why the individual terms 'chicken' and 'stock' are still in the query (and are required). So I have tried a few things to get around this, but to no avail: Changed the query analyzer to use the WhitespaceTokenizerFactory with autoGeneratePhraseQueries=true. This creates the correct phrase query, but the dismax query still requires the individual terms to match ('chicken' and 'stock'): +(DisjunctionMaxQuery((ingredient_synonyms:chicken)~0.01) DisjunctionMaxQuery((ingredient_synonyms:stock)~0.01)) DisjunctionMaxQuery((ingredient_synonyms:chicken stock~100)~0.01) So the next thing I have tried is to remove the individual terms during the query analysis. I did this using the ShingleFilterFactory, so my query analyzer now looks like this: analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.ShingleFilterFactory outputUnigrams=false maxShingleSize=2 / /analyzer This leaves the single term 'chicken stock' in the query analysis and the dismax query is: +() DisjunctionMaxQuery((ingredient_synonyms:chicken stock)~0.01) Which looks OK except for the +(). It looks like it is requiring an empty clause. This seems like a pretty simple requirement - to only have exact matches on multi word text. Am I missing something here? Thanks Zac -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Friday, February 10, 2012 1:50 AM To: solr-user@lucene.apache.org Subject: RE: Keyword Tokenizer Phrase Issue Hi Zac, Field Analysis tool (analysis.jsp) does not perform actual query parsing. One thing to be aware of when Using Keyword Tokenizer at query time is: Query string (chicken stock) is pre-tokenized according to white spaces, before it reaches keyword tokenizer. If you use quotes (chicken stock), query parser does no pre-tokenizes, though. --- On Fri, 2/10/12, Zac Smith z...@trinkit.com wrote: From: Zac Smith z...@trinkit.com Subject: RE: Keyword Tokenizer Phrase Issue To: solr-user@lucene.apache.org solr-user@lucene.apache.org Date: Friday, February 10, 2012, 10:35 AM I have done some further analysis on this and I am now even more confused. When I use the Field Analysis tool with the text 'chicken stock' it highlights that text as a match. The dismax query looks ok to me: +(DisjunctionMaxQuery((ingredient_synonyms:chicken^0.6)~0.01) DisjunctionMaxQuery((ingredient_synonyms:stock^0.6)~0.01)) DisjunctionMaxQuery((ingredient_synonyms:chicken stock^0.6)~0.01) Then I have done an explainOther and it shows a failure to meet condition. However there does seem to be some kind of match registered: 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) 0.0 = no match on required clause (ingredient_synonyms:chicken^0.6 ingredient_synonyms:stock^0.6) 0.0650662 = (MATCH) weight(ingredient_synonyms:chicken stock^0.6 in 0), product of: 0.21204369 = queryWeight(ingredient_synonyms:chicken stock^0.6), product of: 0.6 = boost 0.30685282 = idf(docFreq=1, maxDocs=1) 1.1517122 = queryNorm 0.30685282 = (MATCH) fieldWeight(ingredient_synonyms:chicken stock in 0), product of: 1.0 = tf(termFreq(ingredient_synonyms:chicken stock)=1) 0.30685282 = idf(docFreq=1, maxDocs=1) 1.0 = fieldNorm(field=ingredient_synonyms, doc=0) Any ideas? My dismax handler is setup like this: requestHandler name=dismax class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str float name=tie0.01/float str name=qfingredient_synonyms^0.6/str str name=pfingredient_synonyms^0.6/str /requestHandler Zac From: Zac Smith Sent: Thursday, February 09, 2012 12:52 PM To: solr-user@lucene.apache.org Subject: Keyword Tokenizer Phrase Issue Hi, I have a simple field type that uses the KeywordTokenizerFactory. I would like to use this so that values in this field are only matched with the full text of the field. e.g. If I indexed the text 'chicken stock', searches on this field would only match when searching for 'chicken stock'. If searching for just 'chicken' or just 'stock' there should not match. This mostly works, except if there is more than one word in the text I only get a match when searching with quotes. e.g. chicken stock (matches) chicken stock (doesn't match) Is there any way I can set this up so that I don't have to provide quotes? I am using dismax and if I put quotes in it will mess up the search for the rest of my fields. I had an idea that I could issue a separate search using the regular query parser, but couldn't work out how to do this: I thought I could do something like this: qt=dismaxq=fish OR _query_:ingredient:chicken stock I am using solr 3.5.0. My field type is: fieldType name=keyword_test class=solr.TextField positionIncrementGap=100
Keyword Tokenizer Phrase Issue
Hi, I have a simple field type that uses the KeywordTokenizerFactory. I would like to use this so that values in this field are only matched with the full text of the field. e.g. If I indexed the text 'chicken stock', searches on this field would only match when searching for 'chicken stock'. If searching for just 'chicken' or just 'stock' there should not match. This mostly works, except if there is more than one word in the text I only get a match when searching with quotes. e.g. chicken stock (matches) chicken stock (doesn't match) Is there any way I can set this up so that I don't have to provide quotes? I am using dismax and if I put quotes in it will mess up the search for the rest of my fields. I had an idea that I could issue a separate search using the regular query parser, but couldn't work out how to do this: I thought I could do something like this: qt=dismaxq=fish OR _query_:ingredient:chicken stock I am using solr 3.5.0. My field type is: fieldType name=keyword_test class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.KeywordTokenizerFactory / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / /analyzer /fieldType Thanks Zac