[ 
https://issues.apache.org/jira/browse/LUCENE-6789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-6789:
--------------------------------
    Attachment: LUCENE-6789.patch

> change IndexSearcher default similarity to BM25
> -----------------------------------------------
>
>                 Key: LUCENE-6789
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6789
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 6.0
>
>         Attachments: LUCENE-6789.patch
>
>
> Since Lucene 4.0, the statistics needed for this are always present, so we 
> can make the change without any degradation.
> I think the change should be a 6.0 change only: it will prevent any 
> surprises. DefaultSimilarity is renamed to ClassicSimilarity to prevent 
> confusion. No indexing change is needed as we use the same norm format, its 
> just a runtime switch. Users can just do IndexSearcher.setSimilarity(new 
> ClassicSimilarity()) to get the old behavior.  I did not change solr's 
> default here, I think that should be a separate issue, since it has more 
> concerns: e.g. factories in configuration files and so on.
> One issue was the generation of synonym queries (posinc=0) by QueryBuilder 
> (used by parsers). This is kind of a corner case (query-time synonyms), but 
> we should make it nicer. The current code in trunk disables coord, which 
> makes no sense for anything but the vector space impl. Instead, this patch 
> adds a SynonymQuery which treats occurrences of any term as a single 
> pseudoterm. With english wordnet as a query-time synonym dict, this query 
> gives 12% improvement in MAP for title queries on BM25, and 2% with Classic 
> (not significant). So its a better generic approach for synonyms that works 
> with all scoring models.
> I wanted to use BlendedTermQuery, but it seems to have problems at a glance, 
> it tries to "take on the world", it has problems like not working with 
> distributed scoring (doesn't consult indexsearcher for stats). Anyway this 
> one is a different, simpler approach, which only works for a single field, 
> and which calls tf(sum) a single time. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to