[ 
https://issues.apache.org/jira/browse/LUCENE-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397782#comment-17397782
 ] 

Robert Muir commented on LUCENE-10048:
--------------------------------------

btw, there are a lot of alternatives you can look into to avoid having > 2^32 
tokens inside a single document's field. For example, you could use more 
fields, you could encode the thing in the payload, etc.

But I don't understand what you are doing, to me it sounds like lucene may not 
be the right solution at all honestly. Or maybe instead it is worth looking at 
lucene 9 vector format or something very different.

> Bypass total frequency check if field uses custom term frequency
> ----------------------------------------------------------------
>
>                 Key: LUCENE-10048
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10048
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Tony Xu
>            Priority: Minor
>
> For all fields whose index option is not *IndexOptions.NONE*. There is a 
> check on per field total token count (i.e. field-length) to ensure we don't 
> index too many tokens. This is done by accumulating the token's 
> *TermFrequencyAttribute.*
>  
> Given that currently Lucene allows custom term frequency attached to each 
> token and the usage of the frequency can be pretty wild. It is possible to 
> have the following case where the check fails with only a few tokens that 
> have large frequencies. Currently Lucene will skip indexing the whole 
> document.
> *"foo|<very large number> bar|<very large number>"*
>  
> What should be way to inform the indexing chain not to check the field length?
> A related observation, when custom term frequency is in use, user is not 
> likely to use the similarity for this field. Maybe we can offer a way to 
> specify that, too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to