[ https://issues.apache.org/jira/browse/LUCENE-8633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740265#comment-16740265 ]
Alan Woodward commented on LUCENE-8633: --------------------------------------- Attached is a patch with an alternative scoring system: * Sloppy frequency is calculated as the sum of individual interval scores. Each interval is scored as 1/(length - minExtent + 1), where minExtent() is a new method on IntervalsSource that exposes the minimum possible length of an interval produced by that source. This is based on the scoring mechanism described in Vigna's paper describing intervals[1] * In order to keep the score bounded so that it can be used as a proximity boost without wrecking max-score optimizations, the sloppy frequency is converted to a score using a saturation function. I've chosen 5 as a pivot here more-or-less at random (meaning that documents containing 5 intervals of minimum possible length will get a score of boost * 0.5) - better ways of choosing a pivot are welcome. [1] http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf > Remove term weighting from interval scoring > ------------------------------------------- > > Key: LUCENE-8633 > URL: https://issues.apache.org/jira/browse/LUCENE-8633 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Alan Woodward > Assignee: Alan Woodward > Priority: Major > Attachments: LUCENE-8633.patch > > > IntervalScorer currently uses the same scoring mechanism as SpanScorer, > summing the IDF of all possibly matching terms from its parent > IntervalsSource and using that in conjunction with a sloppy frequency to > produce a similarity-based score. This doesn't really make sense, however, > as it means that terms that don't appear in a document can still contribute > to the score, and appears to make scores from interval queries comparable > with scores from term or phrase queries when they really aren't. > I'd like to explore a different scoring mechanism for intervals, based purely > on sloppy frequency and ignoring term weighting. This should make the scores > easier to reason about, as well as making them useful for things like > proximity boosting on boolean queries. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org