[jira] [Commented] (LUCENE-10048) Bypass total frequency check if field uses custom term frequency

Ankur (Jira) Wed, 11 Aug 2021 20:05:06 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397771#comment-17397771
 ]


Ankur commented on LUCENE-10048:
--------------------------------

@[~rcmuir]

Consider the case where these term-document level scoring factors are computed 
in an offline process, indexed in Lucene and accessed at query time by a 
ranking function that does not rely on Lucene's 
[Scorer|https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/search/Scorer.html]
 and 
[Similarity|https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/search/similarities/Similarity.html]
 abstractions.

What is considered reasonable is up to the offline process that serves the 
needs of the ranking function and is outside our control. A single 
term-document scoring factor can still be less than {{Integer.MAX_VALUE}} but 
the sum of all such factors for a document could easily exceed the 
{{Integer.MAX_VALUE}} range.

Without this our only option (I think) is to use {{BinaryDocValues}} and 
implement mechanisms to serialize/deserialize term-document level scoring 
factors at indexing and searching time ourselves. With this we don't get the 
space efficiencies that come with the use of highly optimized terms dictionary 
and the integer compression techniques used to encode postings data (at least 
not without significant work).

Maybe we can keep the restriction on the custom term frequency to be less than 
{{Integer.MAX_VALUE}} but relax the check on per field total token count for 
the expert use case ?
  

 

> Bypass total frequency check if field uses custom term frequency
> ----------------------------------------------------------------
>
>                 Key: LUCENE-10048
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10048
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Tony Xu
>            Priority: Minor
>
> For all fields whose index option is not *IndexOptions.NONE*. There is a 
> check on per field total token count (i.e. field-length) to ensure we don't 
> index too many tokens. This is done by accumulating the token's 
> *TermFrequencyAttribute.*
>  
> Given that currently Lucene allows custom term frequency attached to each 
> token and the usage of the frequency can be pretty wild. It is possible to 
> have the following case where the check fails with only a few tokens that 
> have large frequencies. Currently Lucene will skip indexing the whole 
> document.
> *"foo|<very large number> bar|<very large number>"*
>  
> What should be way to inform the indexing chain not to check the field length?
> A related observation, when custom term frequency is in use, user is not 
> likely to use the similarity for this field. Maybe we can offer a way to 
> specify that, too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-10048) Bypass total frequency check if field uses custom term frequency

Reply via email to