[
https://issues.apache.org/jira/browse/SOLR-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717633#action_12717633
]
Oystein Steimler commented on SOLR-1196:
----------------------------------------
This looks similar to this scenario:
<doc>
<str name="id">1</str>
<str name="phoneno">abc</str>
</doc>
The field 'phoneno' is among other steps analyzed like this:
<filter class="solr.PatternReplaceFilterFactory" pattern="[^0-9]"
replacement="" replace="all" />
When using a dismax handler containing the field phoneno, the document id=1
will match on every query phrase. (I guess this is the same as matching any
query on the field)
> Incorrect matches when using non alphanumeric search string !...@#$%\^\&\*\(\)
> ---------------------------------------------------------------------------
>
> Key: SOLR-1196
> URL: https://issues.apache.org/jira/browse/SOLR-1196
> Project: Solr
> Issue Type: Bug
> Components: clients - java
> Affects Versions: 1.3
> Environment: Solr 1.3/ Java 1.6/ Win XP/Eclipse 3.3
> Reporter: Sam Michael
>
> When matching strings that do not include alphanumeric chars, all the data is
> returned as matches. (There is actually no match, so nothing should be
> returned.)
> When I run a query like - (activity_type:NAME) AND title:(\...@#$%\^&\*\(\))
> all the documents are returned even though there is not a single match. There
> is no title that matches the string (which has been escaped).
> My document structure is as follows
> <doc>
> <str name="activity_type">NAME</str>
> <str name="title">Bathing</str>
> ....
> </doc>
> The title field is of type text_title which is described below.
> <fieldType name="text_title" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1"
> splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1"
> splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> </fieldType>
> -----------------------------------------------------
> Yonik's analysis as follows.
> <str name="rawquerystring">-features:foo features:(\...@#$%\^&\*\(\))</str>
> <str name="querystring">-features:foo features:(\...@#$%\^&\*\(\))</str>
> <str name="parsedquery">-features:foo</str>
> <str name="parsedquery_toString">-features:foo</str>
> The text analysis is throwing away non alphanumeric chars (probably
> the WordDelimiterFilter). The Lucene (and Solr) query parser throws
> away term queries when the token is zero length (after analysis).
> Solr then interprets the left over "-features:foo" as "all documents
> not containing foo in the features field", so you get a bunch of
> matches.
> As per his suggestion, a bug is filed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.