dismax and WordDelimiterFilterFactory with PreserveOriginal = 1

2010-03-11 Thread Ya-Wen Hsu
Hi all,

I'm facing the same issue as previous post here: 
http://www.mail-archive.com/solr-user@lucene.apache.org/msg19511.html. Since no 
one answers this post, I thought I'll ask again. In my case, I use below 
setting for index
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=0 preserveOriginal=1/
and
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=0 preserveOriginal=1/ for query.

When I use query with word ain't, no result is returned. When I turned on the 
logging, I found the word is interpreted as (ain't ain) t.

0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s)
0.0 = no match on required clause ((description:(ain't ain) t^2.0 | 
name:(ain't ain) t^3.0 | search_keywords:(ain't ain) t)~0.1)

Does anyone know why ain't be parsed as (ain't ain) t and how to fix it so it 
can match documents that include ain't in the name? Thanks in advance!

Wen



Re: dismax and WordDelimiterFilterFactory with PreserveOriginal = 1

2010-03-11 Thread Yonik Seeley
On Thu, Mar 11, 2010 at 1:07 PM, Ya-Wen Hsu y...@eline.com wrote:
 Hi all,

 I'm facing the same issue as previous post here: 
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg19511.html. Since 
 no one answers this post, I thought I'll ask again. In my case, I use below 
 setting for index
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
 splitOnCaseChange=0 preserveOriginal=1/
 and
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
 splitOnCaseChange=0 preserveOriginal=1/ for query.

 When I use query with word ain't, no result is returned. When I turned on 
 the logging, I found the word is interpreted as (ain't ain) t.


The problem is preserving the original in the query analyzer - try
removing that.  And if you aren't doing prefix or wildcard queries,
preserveOriginal doesn't buy you anything but wasted index space.

It's the same issue of why you can't generate and catenate at the same
time with the query parser.

-Yonik
http://www.lucidimagination.com


Re: dismax and WordDelimiterFilterFactory with PreserveOriginal = 1

2010-03-11 Thread Erick Erickson
Kind of a shot in the dark here, but your parameters for index and query on
WordDelimiterFilterFactory are different, especially suspicious is
catenateWords.

You could test this by looking in your index with the SOLR admin page and/or
Luke to see what your actual terms are.

And don't forget you'll have to re-index after restarting SOLR for any
index
changes to take effect

HTH
Erick

On Thu, Mar 11, 2010 at 2:20 PM, Ya-Wen Hsu y...@eline.com wrote:

 Yonik, thank you for your reply. When I don't use PreserveOriginal = 1 for
 WordDelimiterFilterFactory, the query ain't is parsed as ain t and no
 match is found in this case too. If I remove ' from the query, then I can
 get results. I used the analysis tool and see the term ain't is processed as
 ain t, and get matches when the title includes ain't. But I got no
 result when using ain't query with dismax.

 The debug output looks like:
 (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s)
 +(long_description:ain t^2.0 | name:ain t^3.0 | search_keywords:ain
 t)~0.1 (long_description:save^2.0 | name:save^3.0 |
 search_keywords:saved)~0.1) ()


 Below is my configuration for text field type.

 fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=false/
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
!--filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt ignoreCase=true expand=true/--
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType


 I get results back when I tried to use solr.LowerCaseTokenizerFactory
 instead of solr.WhitespaceTokenizerFactory. However, the concern here is
 this might reduce the quality of relevant search. Does anyone have a better
 idea on what to try next? Thanks!

 Wen
 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: Thursday, March 11, 2010 10:51 AM
 To: solr-user@lucene.apache.org
 Subject: Re: dismax and WordDelimiterFilterFactory with PreserveOriginal =
 1

 On Thu, Mar 11, 2010 at 1:07 PM, Ya-Wen Hsu y...@eline.com wrote:
  Hi all,
 
  I'm facing the same issue as previous post here:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg19511.html.
 Since no one answers this post, I thought I'll ask again. In my case, I use
 below setting for index
  filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1 catenateWords=1 catenateNumbers=1
 catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/
  and
  filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1 catenateWords=0 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/ for query.
 
  When I use query with word ain't, no result is returned. When I turned
 on the logging, I found the word is interpreted as (ain't ain) t.


 The problem is preserving the original in the query analyzer - try
 removing that.  And if you aren't doing prefix or wildcard queries,
 preserveOriginal doesn't buy you anything but wasted index space.

 It's the same issue of why you can't generate and catenate at the same
 time with the query parser.

 -Yonik
 http://www.lucidimagination.com