[jira] [Updated] (SOLR-3962) For the match-all-docs query :, (e)dismax parser passes ":" to tokenizer, sub-optimal (<1.0) hit scores

KuroSaka TeruHiko (JIRA) Fri, 19 Oct 2012 12:34:14 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


KuroSaka TeruHiko updated SOLR-3962:
------------------------------------

    Summary: For the match-all-docs query *:*, (e)dismax parser passes "*:*" to 
tokenizer, sub-optimal (<1.0) hit scores  (was: For the match-all-docs query 
*:*, (e)dismax parser passes "*:*" to tokenizer. Under certain conditions, hit 
suboptimal (<1.0) score is reported.)
    
> For the match-all-docs query *:*, (e)dismax parser passes "*:*" to tokenizer, 
> sub-optimal (<1.0) hit scores
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-3962
>                 URL: https://issues.apache.org/jira/browse/SOLR-3962
>             Project: Solr
>          Issue Type: Bug
>          Components: query parsers
>    Affects Versions: 3.5, 3.6, 4.0
>            Reporter: KuroSaka TeruHiko
>
> My understanding is that the special match-all-docs query "\*:\*" shouldn't 
> call tokenizers and all hits should have score 1.0.  In fact, this is usually 
> the case.
> But, when all of these conditions are met, suboptimal (<1.0) hit scores are 
> reported:
> * dismax or edismax parser is used
> * a tokenizer that splits "\*:\*" into multiple tokens is used
> * pf parameter is specified for a field that uses the above tokenizer
> Use case:
> * We created a Japanese tokenizer which happens to break "\*:\*" into three 
> tokens representing each symbols. 
> * Our client uses this tokenizer for Japanese with edismax on Solr 3.6.
> * They have pf=text^0.5 in the default section in solrconfig.xml.
> * When search is done with the query string "\*:\*", all the hits from 
> Japanese has the score much less than 1.0.
> Below is how to simulate this situation with a NGramTokenizer.  (It is not 
> realistic.)
> 1. Run Solr with the default setting.  Post all *.xml docs in 
> examples/exampledocs.
> 2. Stop the Solr.
> 3. Add this fieldType:
> {noformat}
>     <fieldtype name="text_fake" class="solr.TextField" 
> positionIncrementGap="100">
>        <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.NGramTokenizerFactory"
>            maxGramSize="1"
>            minGramSize="1" />
>       </analyzer>
>     </fieldtype>
> {noformat}
> 4. Change the field definition of "name" to use "text_fake".
> 5. Restart Solr
> 6. GET this URL:
> http://localhost:8983/solr/select?indent=on&version=2.2&q=*%3A*&fq=&start=0&rows=10&fl=*%2Cscore&qt=&wt=&debugQuery=on&defType=edismax&pf=name
> Below is an excerpt of query debug output.  Notice that "\*:\*" is expanded 
> with spaces to "\* : \*":
> {noformat}
> ...
> <doc>
> <str name="id">ati</str>
> <str name="compName_s">ATI Technologies</str>
> <str name="address_s">
> 33 Commerce Valley Drive East Thornhill, ON L3T 7N6 Canada
> </str>
> <long name="_version_">1415830106362871808</long>
> <float name="score">0.07443535</float>
> </doc>
> </result>
> <lst name="debug">
> <str name="rawquerystring">*:*</str>
> <str name="querystring">*:*</str>
> <str name="parsedquery">
> (+MatchAllDocsQuery(*:*) DisjunctionMaxQuery((name:"* : *")))/no_coord
> </str>
> {noformat}
> And here is a partial stack trace at the time the tokenizer is called from 
> the query parser:
> {noformat}
> NGramTokenizer.incrementToken() line: 112     
> CachingTokenFilter.fillCache() line: 90       
> CachingTokenFilter.incrementToken() line: 55  
> ExtendedDismaxQParser$ExtendedSolrQueryParser(QueryParserBase).newFieldQuery(Analyzer,
>  String, String, boolean) line: 513     
> ExtendedDismaxQParser$ExtendedSolrQueryParser.newFieldQuery(Analyzer, String, 
> String, boolean) line: 1018     
> ExtendedDismaxQParser$ExtendedSolrQueryParser(QueryParserBase).getFieldQuery(String,
>  String, boolean) line: 474       
> ExtendedDismaxQParser$ExtendedSolrQueryParser(SolrQueryParser).getFieldQuery(String,
>  String, boolean) line: 169       
> ExtendedDismaxQParser$ExtendedSolrQueryParser.getQuery() line: 1163   
> ExtendedDismaxQParser$ExtendedSolrQueryParser.getAliasedQuery() line: 1105    
> ExtendedDismaxQParser$ExtendedSolrQueryParser.getQueries(Alias) line: 1145    
> ExtendedDismaxQParser$ExtendedSolrQueryParser.getAliasedQuery() line: 1073    
> ExtendedDismaxQParser$ExtendedSolrQueryParser.getFieldQuery(String, String, 
> int) line: 989    
> ExtendedDismaxQParser$ExtendedSolrQueryParser(QueryParserBase).handleQuotedTerm(String,
>  Token, Token) line: 1082      
> ExtendedDismaxQParser$ExtendedSolrQueryParser(QueryParser).Term(String) line: 
> 462     
> ExtendedDismaxQParser$ExtendedSolrQueryParser(QueryParser).Clause(String) 
> line: 257   
> ExtendedDismaxQParser$ExtendedSolrQueryParser(QueryParser).Query(String) 
> line: 181    
> ExtendedDismaxQParser$ExtendedSolrQueryParser(QueryParser).TopLevelQuery(String)
>  line: 170    
> ExtendedDismaxQParser$ExtendedSolrQueryParser(QueryParserBase).parse(String) 
> line: 120        
> ExtendedDismaxQParser.addShingledPhraseQueries(BooleanQuery, List<Clause>, 
> Map<String,Float>, int, float, int) line: 506      
> ExtendedDismaxQParser.parse() line: 338       
> ExtendedDismaxQParser(QParser).getQuery() line: 143   
> QueryComponent.prepare(ResponseBuilder) line: 118     
> SearchHandler.handleRequestBody(SolrQueryRequest, SolrQueryResponse) line: 
> 192        
> SearchHandler(RequestHandlerBase).handleRequest(SolrQueryRequest, 
> SolrQueryResponse) line: 129        
> ...
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-3962) For the match-all-docs query *:*, (e)dismax parser passes "*:*" to tokenizer, sub-optimal (<1.0) hit scores

Reply via email to

[jira] [Updated] (SOLR-3962) For the match-all-docs query :, (e)dismax parser passes ":" to tokenizer, sub-optimal (<1.0) hit scores