Dismax handler - whitespace and special character behaviour

Rohk Tue, 25 Oct 2011 07:33:04 -0700

Hello,

I've got strange results when I have special characters in my query.


Here is my request :

q=histoire-france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100%

Parsed query :

<str name="parsedquery_toString">+((any:histoir any:franc)) ()</str>

I've got 17000 results because Solr is doing an OR (should be AND).

I have no problem when I'm using a whitespace instead of a special char :

q=histoire 
france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100%

<str name="parsedquery_toString">+(((any:histoir) (any:franc))~2) ()</str>

2000 results for this query.

Here is my schema.xml (relevant parts) :

<fieldType name="text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="false">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.CommonGramsFilterFactory"
words="stopwords_french.txt" ignoreCase="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_french.txt" enablePositionIncrements="true"/>
        <filter class="solr.SnowballPorterFilterFactory"
language="French" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!--<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"
preserveOriginal="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.CommonGramsFilterFactory"
words="stopwords_french.txt" ignoreCase="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_french.txt" enablePositionIncrements="true"/>
        <filter class="solr.SnowballPorterFilterFactory"
language="French" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
      </analyzer>
    </fieldType>

I tried with a PatternTokenizerFactory to tokenize on whitespaces & special
chars but no change...
Even with a charFilter (PatternReplaceCharFilterFactory) to replace special
characters by whitespace, it doesn't work...

First line of analysis via solr admin, with verbose output, for query =
'histoire-france' :

org.apache.solr.analysis.PatternReplaceCharFilterFactory {replacement=
, pattern=([,;./\\'&-]), luceneMatchVersion=LUCENE_32}
text    histoire france

The '-' is replaced by ' ', then tokenized by WhitespaceTokenizerFactory.
However I still have different number of results for 'histoire-france' and
'histoire france'.

My current workaround is to replace all special chars by whitespaces before
sending query to Solr, but it is not satisfying.

Did i miss something ?

Dismax handler - whitespace and special character behaviour

Reply via email to