RE: Keyword Tokenizer Phrase Issue

2012-02-12 Thread Zac Smith
I have come to the conclusion that this isn't possible due to the way dismax 
queries are created. I found someone else that had the exact same issue last 
year: 
http://lucene.472066.n3.nabble.com/Multi-word-exact-keyword-case-insensitive-search-suggestions-td2246516.html
I believe this makes it impossible to do exact matching on multi word terms 
with dismax.

So I have created two JIRA tickets that hopefully address the issue:
1) a suggested improvement to dismax specific to the KeywordTokenizerFactory: 
https://issues.apache.org/jira/browse/SOLR-3127
2) what I believe is a bug when removing terms from the query: 
https://issues.apache.org/jira/browse/SOLR-3128

Feedback welcome.

Thanks
Zac

-Original Message-
From: Zac Smith 
Sent: Friday, February 10, 2012 3:30 PM
To: 'solr-user@lucene.apache.org'
Subject: RE: Keyword Tokenizer Phrase Issue

Thanks, that explains why the individual terms 'chicken' and 'stock' are still 
in the query (and are required).
So I have tried a few things to get around this, but to no avail:

Changed the query analyzer to use the WhitespaceTokenizerFactory with 
autoGeneratePhraseQueries=true. This creates the correct phrase query, but the 
dismax query still requires the individual terms to match ('chicken' and 
'stock'):
+(DisjunctionMaxQuery((ingredient_synonyms:chicken)~0.01) 
+DisjunctionMaxQuery((ingredient_synonyms:stock)~0.01)) 
+DisjunctionMaxQuery((ingredient_synonyms:chicken stock~100)~0.01)

So the next thing I have tried is to remove the individual terms during the 
query analysis. I did this using the ShingleFilterFactory, so my query analyzer 
now looks like this:
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory /   

filter class=solr.ShingleFilterFactory outputUnigrams=false 
maxShingleSize=2 / /analyzer This leaves the single term 'chicken stock' 
in the query analysis and the dismax query is:
+() DisjunctionMaxQuery((ingredient_synonyms:chicken stock)~0.01)

Which looks OK except for the +(). It looks like it is requiring an empty 
clause.

This seems like a pretty simple requirement - to only have exact matches on 
multi word text. Am I missing something here?

Thanks
Zac



RE: Keyword Tokenizer Phrase Issue

2012-02-10 Thread Zac Smith
I have done some further analysis on this and I am now even more confused. When 
I use the Field Analysis tool with the text 'chicken stock' it highlights that 
text as a match.
The dismax query looks ok to me:
+(DisjunctionMaxQuery((ingredient_synonyms:chicken^0.6)~0.01) 
DisjunctionMaxQuery((ingredient_synonyms:stock^0.6)~0.01)) 
DisjunctionMaxQuery((ingredient_synonyms:chicken stock^0.6)~0.01)

Then I have done an explainOther and it shows a failure to meet condition. 
However there does seem to be some kind of match registered:
0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s)
  0.0 = no match on required clause (ingredient_synonyms:chicken^0.6 
ingredient_synonyms:stock^0.6)
  0.0650662 = (MATCH) weight(ingredient_synonyms:chicken stock^0.6 in 0), 
product of:
0.21204369 = queryWeight(ingredient_synonyms:chicken stock^0.6), product of:
  0.6 = boost
  0.30685282 = idf(docFreq=1, maxDocs=1)
  1.1517122 = queryNorm
0.30685282 = (MATCH) fieldWeight(ingredient_synonyms:chicken stock in 0), 
product of:
  1.0 = tf(termFreq(ingredient_synonyms:chicken stock)=1)
  0.30685282 = idf(docFreq=1, maxDocs=1)
  1.0 = fieldNorm(field=ingredient_synonyms, doc=0)

Any ideas?

My dismax handler is setup like this:
  requestHandler name=dismax class=solr.SearchHandler 
lst name=defaults
 str name=defTypedismax/str
 str name=echoParamsexplicit/str
 float name=tie0.01/float
 str name=qfingredient_synonyms^0.6/str
 str name=pfingredient_synonyms^0.6/str
/requestHandler

Zac

From: Zac Smith
Sent: Thursday, February 09, 2012 12:52 PM
To: solr-user@lucene.apache.org
Subject: Keyword Tokenizer Phrase Issue

Hi,

I have a simple field type that uses the KeywordTokenizerFactory. I would like 
to use this so that values in this field are only matched with the full text of 
the field.
e.g. If I indexed the text 'chicken stock', searches on this field would only 
match when searching for 'chicken stock'. If searching for just 'chicken' or 
just 'stock' there should not match.

This mostly works, except if there is more than one word in the text I only get 
a match when searching with quotes. e.g.
chicken stock (matches)
chicken stock (doesn't match)

Is there any way I can set this up so that I don't have to provide quotes? I am 
using dismax and if I put quotes in it will mess up the search for the rest of 
my fields. I had an idea that I could issue a separate search using the regular 
query parser, but couldn't work out how to do this:
I thought I could do something like this: qt=dismaxq=fish OR 
_query_:ingredient:chicken stock

I am using solr 3.5.0. My field type is:
fieldType name=keyword_test class=solr.TextField 
positionIncrementGap=100 autoGeneratePhraseQueries=true
analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory 
/
/analyzer
analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory 
/
/analyzer
/fieldType

Thanks
Zac


RE: Keyword Tokenizer Phrase Issue

2012-02-10 Thread Ahmet Arslan
Hi Zac,

Field Analysis tool (analysis.jsp) does not perform actual query parsing.

One thing to be aware of when Using Keyword Tokenizer at query time is: Query 
string (chicken stock) is pre-tokenized according to white spaces, before it 
reaches keyword tokenizer.

If you use quotes (chicken stock), query parser does no pre-tokenizes, though.

--- On Fri, 2/10/12, Zac Smith z...@trinkit.com wrote:

 From: Zac Smith z...@trinkit.com
 Subject: RE: Keyword Tokenizer Phrase Issue
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Date: Friday, February 10, 2012, 10:35 AM
 I have done some further analysis on
 this and I am now even more confused. When I use the Field
 Analysis tool with the text 'chicken stock' it highlights
 that text as a match.
 The dismax query looks ok to me:
 +(DisjunctionMaxQuery((ingredient_synonyms:chicken^0.6)~0.01)
 DisjunctionMaxQuery((ingredient_synonyms:stock^0.6)~0.01))
 DisjunctionMaxQuery((ingredient_synonyms:chicken
 stock^0.6)~0.01)
 
 Then I have done an explainOther and it shows a failure to
 meet condition. However there does seem to be some kind of
 match registered:
 0.0 = (NON-MATCH) Failure to meet condition(s) of
 required/prohibited clause(s)
   0.0 = no match on required clause
 (ingredient_synonyms:chicken^0.6
 ingredient_synonyms:stock^0.6)
   0.0650662 = (MATCH)
 weight(ingredient_synonyms:chicken stock^0.6 in 0), product
 of:
     0.21204369 =
 queryWeight(ingredient_synonyms:chicken stock^0.6), product
 of:
       0.6 = boost
       0.30685282 = idf(docFreq=1, maxDocs=1)
       1.1517122 = queryNorm
     0.30685282 = (MATCH)
 fieldWeight(ingredient_synonyms:chicken stock in 0), product
 of:
       1.0 =
 tf(termFreq(ingredient_synonyms:chicken stock)=1)
       0.30685282 = idf(docFreq=1, maxDocs=1)
       1.0 =
 fieldNorm(field=ingredient_synonyms, doc=0)
 
 Any ideas?
 
 My dismax handler is setup like this:
   requestHandler name=dismax
 class=solr.SearchHandler 
     lst name=defaults
      str
 name=defTypedismax/str
      str
 name=echoParamsexplicit/str
      float
 name=tie0.01/float
      str
 name=qfingredient_synonyms^0.6/str
      str
 name=pfingredient_synonyms^0.6/str
 /requestHandler
 
 Zac
 
 From: Zac Smith
 Sent: Thursday, February 09, 2012 12:52 PM
 To: solr-user@lucene.apache.org
 Subject: Keyword Tokenizer Phrase Issue
 
 Hi,
 
 I have a simple field type that uses the
 KeywordTokenizerFactory. I would like to use this so that
 values in this field are only matched with the full text of
 the field.
 e.g. If I indexed the text 'chicken stock', searches on this
 field would only match when searching for 'chicken stock'.
 If searching for just 'chicken' or just 'stock' there should
 not match.
 
 This mostly works, except if there is more than one word in
 the text I only get a match when searching with quotes.
 e.g.
 chicken stock (matches)
 chicken stock (doesn't match)
 
 Is there any way I can set this up so that I don't have to
 provide quotes? I am using dismax and if I put quotes in it
 will mess up the search for the rest of my fields. I had an
 idea that I could issue a separate search using the regular
 query parser, but couldn't work out how to do this:
 I thought I could do something like this:
 qt=dismaxq=fish OR _query_:ingredient:chicken stock
 
 I am using solr 3.5.0. My field type is:
 fieldType name=keyword_test class=solr.TextField
 positionIncrementGap=100
 autoGeneratePhraseQueries=true
                
 analyzer type=index
                
                
 tokenizer class=solr.KeywordTokenizerFactory /
                
 /analyzer
                
 analyzer type=query
                
                
 tokenizer class=solr.KeywordTokenizerFactory /
                
 /analyzer
 /fieldType
 
 Thanks
 Zac



RE: Keyword Tokenizer Phrase Issue

2012-02-10 Thread Zac Smith
Thanks, that explains why the individual terms 'chicken' and 'stock' are still 
in the query (and are required).
So I have tried a few things to get around this, but to no avail:

Changed the query analyzer to use the WhitespaceTokenizerFactory with 
autoGeneratePhraseQueries=true. This creates the correct phrase query, but the 
dismax query still requires the individual terms to match ('chicken' and 
'stock'):
+(DisjunctionMaxQuery((ingredient_synonyms:chicken)~0.01) 
DisjunctionMaxQuery((ingredient_synonyms:stock)~0.01)) 
DisjunctionMaxQuery((ingredient_synonyms:chicken stock~100)~0.01)

So the next thing I have tried is to remove the individual terms during the 
query analysis. I did this using the ShingleFilterFactory, so my query analyzer 
now looks like this:
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory /   

filter class=solr.ShingleFilterFactory outputUnigrams=false 
maxShingleSize=2 /
/analyzer
This leaves the single term 'chicken stock' in the query analysis and the 
dismax query is:
+() DisjunctionMaxQuery((ingredient_synonyms:chicken stock)~0.01)

Which looks OK except for the +(). It looks like it is requiring an empty 
clause.

This seems like a pretty simple requirement - to only have exact matches on 
multi word text. Am I missing something here?

Thanks
Zac


-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Friday, February 10, 2012 1:50 AM
To: solr-user@lucene.apache.org
Subject: RE: Keyword Tokenizer Phrase Issue

Hi Zac,

Field Analysis tool (analysis.jsp) does not perform actual query parsing.

One thing to be aware of when Using Keyword Tokenizer at query time is: Query 
string (chicken stock) is pre-tokenized according to white spaces, before it 
reaches keyword tokenizer.

If you use quotes (chicken stock), query parser does no pre-tokenizes, though.

--- On Fri, 2/10/12, Zac Smith z...@trinkit.com wrote:

 From: Zac Smith z...@trinkit.com
 Subject: RE: Keyword Tokenizer Phrase Issue
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Date: Friday, February 10, 2012, 10:35 AM I have done some further 
 analysis on this and I am now even more confused. When I use the Field 
 Analysis tool with the text 'chicken stock' it highlights that text as 
 a match.
 The dismax query looks ok to me:
 +(DisjunctionMaxQuery((ingredient_synonyms:chicken^0.6)~0.01)
 DisjunctionMaxQuery((ingredient_synonyms:stock^0.6)~0.01))
 DisjunctionMaxQuery((ingredient_synonyms:chicken
 stock^0.6)~0.01)
 
 Then I have done an explainOther and it shows a failure to meet 
 condition. However there does seem to be some kind of match 
 registered:
 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited 
 clause(s)
   0.0 = no match on required clause
 (ingredient_synonyms:chicken^0.6
 ingredient_synonyms:stock^0.6)
   0.0650662 = (MATCH)
 weight(ingredient_synonyms:chicken stock^0.6 in 0), product
 of:
     0.21204369 =
 queryWeight(ingredient_synonyms:chicken stock^0.6), product
 of:
       0.6 = boost
       0.30685282 = idf(docFreq=1, maxDocs=1)
       1.1517122 = queryNorm
     0.30685282 = (MATCH)
 fieldWeight(ingredient_synonyms:chicken stock in 0), product
 of:
       1.0 =
 tf(termFreq(ingredient_synonyms:chicken stock)=1)
       0.30685282 = idf(docFreq=1, maxDocs=1)
       1.0 =
 fieldNorm(field=ingredient_synonyms, doc=0)
 
 Any ideas?
 
 My dismax handler is setup like this:
   requestHandler name=dismax
 class=solr.SearchHandler 
     lst name=defaults
      str
 name=defTypedismax/str
      str
 name=echoParamsexplicit/str
      float
 name=tie0.01/float
      str
 name=qfingredient_synonyms^0.6/str
      str
 name=pfingredient_synonyms^0.6/str
 /requestHandler
 
 Zac
 
 From: Zac Smith
 Sent: Thursday, February 09, 2012 12:52 PM
 To: solr-user@lucene.apache.org
 Subject: Keyword Tokenizer Phrase Issue
 
 Hi,
 
 I have a simple field type that uses the KeywordTokenizerFactory. I 
 would like to use this so that values in this field are only matched 
 with the full text of the field.
 e.g. If I indexed the text 'chicken stock', searches on this field 
 would only match when searching for 'chicken stock'.
 If searching for just 'chicken' or just 'stock' there should not 
 match.
 
 This mostly works, except if there is more than one word in the text I 
 only get a match when searching with quotes.
 e.g.
 chicken stock (matches)
 chicken stock (doesn't match)
 
 Is there any way I can set this up so that I don't have to provide 
 quotes? I am using dismax and if I put quotes in it will mess up the 
 search for the rest of my fields. I had an idea that I could issue a 
 separate search using the regular query parser, but couldn't work out 
 how to do this:
 I thought I could do something like this:
 qt=dismaxq=fish OR _query_:ingredient:chicken stock
 
 I am using solr 3.5.0. My field type is:
 fieldType name=keyword_test class=solr.TextField
 positionIncrementGap=100

Keyword Tokenizer Phrase Issue

2012-02-09 Thread Zac Smith
Hi,

I have a simple field type that uses the KeywordTokenizerFactory. I would like 
to use this so that values in this field are only matched with the full text of 
the field.
e.g. If I indexed the text 'chicken stock', searches on this field would only 
match when searching for 'chicken stock'. If searching for just 'chicken' or 
just 'stock' there should not match.

This mostly works, except if there is more than one word in the text I only get 
a match when searching with quotes. e.g.
chicken stock (matches)
chicken stock (doesn't match)

Is there any way I can set this up so that I don't have to provide quotes? I am 
using dismax and if I put quotes in it will mess up the search for the rest of 
my fields. I had an idea that I could issue a separate search using the regular 
query parser, but couldn't work out how to do this:
I thought I could do something like this: qt=dismaxq=fish OR 
_query_:ingredient:chicken stock

I am using solr 3.5.0. My field type is:
fieldType name=keyword_test class=solr.TextField 
positionIncrementGap=100 autoGeneratePhraseQueries=true
analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory 
/
/analyzer
analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory 
/
/analyzer
/fieldType

Thanks
Zac