KuroSaka TeruHiko created SOLR-3962:
---------------------------------------

             Summary: For the match-all-docs query *:*, (e)dismax parser passes 
"*:*" to tokenizer. Under certain conditions, hit suboptimal (<1.0) score is 
reported.
                 Key: SOLR-3962
                 URL: https://issues.apache.org/jira/browse/SOLR-3962
             Project: Solr
          Issue Type: Bug
          Components: query parsers
    Affects Versions: 4.0, 3.6, 3.5
            Reporter: KuroSaka TeruHiko


My understanding is that the special match-all-docs query "\*:\*" shouldn't 
call tokenizers and all hits should have score 1.0.  In fact, this is usually 
the case.

But, when all of these conditions are met, suboptimal (<1.0) hit scores are 
reported:
* dismax or edismax parser is used
* a tokenizer that splits "\*:\*" into multiple tokens is used
* pf parameter is specified for a field that uses the above tokenizer


Use case:
* We created a Japanese tokenizer which happens to break "\*:\*" into three 
tokens representing each symbols. 
* Our client uses this tokenizer for Japanese with edismax on Solr 3.6.
* They have pf=text^0.5 in the default section in solrconfig.xml.
* When search is done with the query string "\*:\*", all the hits from Japanese 
has the score much less than 1.0.

Below is how to simulate this situation with a NGramTokenizer.  (It is not 
realistic.)

1. Run Solr with the default setting.  Post all *.xml docs in 
examples/exampledocs.
2. Stop the Solr.
3. Add this fieldType:
{noformat}
    <fieldtype name="text_fake" class="solr.TextField" 
positionIncrementGap="100">
       <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.NGramTokenizerFactory"
           maxGramSize="1"
           minGramSize="1" />
      </analyzer>
    </fieldtype>
{noformat}
4. Change the field definition of "name" to use "text_fake".
5. Restart Solr
6. GET this URL:
http://localhost:8983/solr/select?indent=on&version=2.2&q=*%3A*&fq=&start=0&rows=10&fl=*%2Cscore&qt=&wt=&debugQuery=on&defType=edismax&pf=name

Below is an excerpt of query debug output.  Notice that "\*:\*" is expanded 
with spaces to "\* : \*":
{noformat}
...
<doc>
<str name="id">ati</str>
<str name="compName_s">ATI Technologies</str>
<str name="address_s">
33 Commerce Valley Drive East Thornhill, ON L3T 7N6 Canada
</str>
<long name="_version_">1415830106362871808</long>
<float name="score">0.07443535</float>
</doc>
</result>
<lst name="debug">
<str name="rawquerystring">*:*</str>
<str name="querystring">*:*</str>
<str name="parsedquery">
(+MatchAllDocsQuery(*:*) DisjunctionMaxQuery((name:"* : *")))/no_coord
</str>
{noformat}

And here is a partial stack trace at the time the tokenizer is called from the 
query parser:
{noformat}
NGramTokenizer.incrementToken() line: 112       
CachingTokenFilter.fillCache() line: 90 
CachingTokenFilter.incrementToken() line: 55    
ExtendedDismaxQParser$ExtendedSolrQueryParser(QueryParserBase).newFieldQuery(Analyzer,
 String, String, boolean) line: 513       
ExtendedDismaxQParser$ExtendedSolrQueryParser.newFieldQuery(Analyzer, String, 
String, boolean) line: 1018       
ExtendedDismaxQParser$ExtendedSolrQueryParser(QueryParserBase).getFieldQuery(String,
 String, boolean) line: 474 
ExtendedDismaxQParser$ExtendedSolrQueryParser(SolrQueryParser).getFieldQuery(String,
 String, boolean) line: 169 
ExtendedDismaxQParser$ExtendedSolrQueryParser.getQuery() line: 1163     
ExtendedDismaxQParser$ExtendedSolrQueryParser.getAliasedQuery() line: 1105      
ExtendedDismaxQParser$ExtendedSolrQueryParser.getQueries(Alias) line: 1145      
ExtendedDismaxQParser$ExtendedSolrQueryParser.getAliasedQuery() line: 1073      
ExtendedDismaxQParser$ExtendedSolrQueryParser.getFieldQuery(String, String, 
int) line: 989      
ExtendedDismaxQParser$ExtendedSolrQueryParser(QueryParserBase).handleQuotedTerm(String,
 Token, Token) line: 1082        
ExtendedDismaxQParser$ExtendedSolrQueryParser(QueryParser).Term(String) line: 
462       
ExtendedDismaxQParser$ExtendedSolrQueryParser(QueryParser).Clause(String) line: 
257     
ExtendedDismaxQParser$ExtendedSolrQueryParser(QueryParser).Query(String) line: 
181      
ExtendedDismaxQParser$ExtendedSolrQueryParser(QueryParser).TopLevelQuery(String)
 line: 170      
ExtendedDismaxQParser$ExtendedSolrQueryParser(QueryParserBase).parse(String) 
line: 120  
ExtendedDismaxQParser.addShingledPhraseQueries(BooleanQuery, List<Clause>, 
Map<String,Float>, int, float, int) line: 506        
ExtendedDismaxQParser.parse() line: 338 
ExtendedDismaxQParser(QParser).getQuery() line: 143     
QueryComponent.prepare(ResponseBuilder) line: 118       
SearchHandler.handleRequestBody(SolrQueryRequest, SolrQueryResponse) line: 192  
SearchHandler(RequestHandlerBase).handleRequest(SolrQueryRequest, 
SolrQueryResponse) line: 129  
...
{noformat}



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to