[jira] Created: (SOLR-1196) Incorrect matches when using non alphanumeric search string !@#$%\^\&\*

Sam Michael (JIRA) Mon, 01 Jun 2009 08:29:16 -0700

Incorrect matches when using non alphanumeric search string !...@#$%\^\&\*\(\)
---------------------------------------------------------------------------


                 Key: SOLR-1196
                 URL: https://issues.apache.org/jira/browse/SOLR-1196
             Project: Solr
          Issue Type: Bug
          Components: clients - java
    Affects Versions: 1.3
         Environment: Solr 1.3/ Java 1.6/ Win XP/Eclipse 3.3
            Reporter: Sam Michael


When matching strings that do not include alphanumeric chars, all the data is 
returned as matches. (There is actually no match, so nothing should be 
returned.)

When I run a query like  - (activity_type:NAME) AND title:(\...@#$%\^&\*\(\)) 
all the documents are returned even though there is not a single match. There 
is no title that matches the string (which has been escaped).

My document structure is as follows

<doc>
<str name="activity_type">NAME</str>
<str name="title">Bathing</str>
....
</doc> 

The title field is of type text_title which is described below. 

<fieldType name="text_title" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" 
splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" 
splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

      </analyzer>
    </fieldType> 

-----------------------------------------------------
Yonik's analysis as follows.

<str name="rawquerystring">-features:foo features:(\...@#$%\^&\*\(\))</str>
<str name="querystring">-features:foo features:(\...@#$%\^&\*\(\))</str>
<str name="parsedquery">-features:foo</str>
<str name="parsedquery_toString">-features:foo</str>

The text analysis is throwing away non alphanumeric chars (probably
the WordDelimiterFilter).  The Lucene (and Solr) query parser throws
away term queries when the token is zero length (after analysis).
Solr then interprets the left over "-features:foo" as "all documents
not containing foo in the features field", so you get a bunch of
matches. 

As per his suggestion, a bug is filed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-1196) Incorrect matches when using non alphanumeric search string !@#$%\^\&\*\(\)

Reply via email to