[ 
https://issues.apache.org/jira/browse/LUCENE-8053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259354#comment-16259354
 ] 

Robert Muir commented on LUCENE-8053:
-------------------------------------

Doesn't need to be the same token: discount_overlaps (which is enabled by 
default) means that all tokens with posinc=0 are dropped from the length.

I don't think we should close the issue because it would be nice for sim to not 
have to deal with this case, i just think it wouldn't help unless we also 
removed discount_overlaps completely, so that length always "makes sense". We 
could try benching this across all of our sims again, maybe its really not 
needed. But I am not confident this is the case, last time i checked it was 
important because there are a lot of cases where "artificial" tokens are added 
(e.g. WDF/commongrams/etc) and this prevents skew. See LUCENE-8000 for more 
details.


> Similarities should round the length up
> ---------------------------------------
>
>                 Key: LUCENE-8053
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8053
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Adrien Grand
>            Priority: Minor
>
> The encoding that we use for lengths currently rounds down in case the length 
> cannot be stored accurately. We should round up instead so that frequencies 
> can never be larger than the length.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to