[
https://issues.apache.org/jira/browse/LUCENE-8053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259354#comment-16259354
]
Robert Muir commented on LUCENE-8053:
-------------------------------------
Doesn't need to be the same token: discount_overlaps (which is enabled by
default) means that all tokens with posinc=0 are dropped from the length.
I don't think we should close the issue because it would be nice for sim to not
have to deal with this case, i just think it wouldn't help unless we also
removed discount_overlaps completely, so that length always "makes sense". We
could try benching this across all of our sims again, maybe its really not
needed. But I am not confident this is the case, last time i checked it was
important because there are a lot of cases where "artificial" tokens are added
(e.g. WDF/commongrams/etc) and this prevents skew. See LUCENE-8000 for more
details.
> Similarities should round the length up
> ---------------------------------------
>
> Key: LUCENE-8053
> URL: https://issues.apache.org/jira/browse/LUCENE-8053
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Adrien Grand
> Priority: Minor
>
> The encoding that we use for lengths currently rounds down in case the length
> cannot be stored accurately. We should round up instead so that frequencies
> can never be larger than the length.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]