[ 
https://issues.apache.org/jira/browse/LUCENE-8633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740265#comment-16740265
 ] 

Alan Woodward commented on LUCENE-8633:
---------------------------------------

Attached is a patch with an alternative scoring system:
* Sloppy frequency is calculated as the sum of individual interval scores.  
Each interval is scored as 1/(length - minExtent + 1), where minExtent() is a 
new method on IntervalsSource that exposes the minimum possible length of an 
interval produced by that source.  This is based on the scoring mechanism 
described in Vigna's paper describing intervals[1]
* In order to keep the score bounded so that it can be used as a proximity 
boost without wrecking max-score optimizations, the sloppy frequency is 
converted to a score using a saturation function.  I've chosen 5 as a pivot 
here more-or-less at random (meaning that documents containing 5 intervals of 
minimum possible length will get a score of boost * 0.5) - better ways of 
choosing a pivot are welcome.

[1] 
http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf

> Remove term weighting from interval scoring
> -------------------------------------------
>
>                 Key: LUCENE-8633
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8633
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8633.patch
>
>
> IntervalScorer currently uses the same scoring mechanism as SpanScorer, 
> summing the IDF of all possibly matching terms from its parent 
> IntervalsSource and using that in conjunction with a sloppy frequency to 
> produce a similarity-based score.  This doesn't really make sense, however, 
> as it means that terms that don't appear in a document can still contribute 
> to the score, and appears to make scores from interval queries comparable 
> with scores from term or phrase queries when they really aren't.
> I'd like to explore a different scoring mechanism for intervals, based purely 
> on sloppy frequency and ignoring term weighting.  This should make the scores 
> easier to reason about, as well as making them useful for things like 
> proximity boosting on boolean queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to