[ https://issues.apache.org/jira/browse/LUCENE-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397774#comment-17397774 ]
Robert Muir commented on LUCENE-10048: -------------------------------------- Whether or not you use the similarity or scorer is irrelevant. I'm referring to term/field statistic values stored in the segment itself. It doesn't matter what parts you do/don't use of lucene here, overflowing these values will essentially behave as corruption. That's why I repeat myself every time a JIRA issue is opened to try to bypass these checks. > Bypass total frequency check if field uses custom term frequency > ---------------------------------------------------------------- > > Key: LUCENE-10048 > URL: https://issues.apache.org/jira/browse/LUCENE-10048 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Tony Xu > Priority: Minor > > For all fields whose index option is not *IndexOptions.NONE*. There is a > check on per field total token count (i.e. field-length) to ensure we don't > index too many tokens. This is done by accumulating the token's > *TermFrequencyAttribute.* > > Given that currently Lucene allows custom term frequency attached to each > token and the usage of the frequency can be pretty wild. It is possible to > have the following case where the check fails with only a few tokens that > have large frequencies. Currently Lucene will skip indexing the whole > document. > *"foo|<very large number> bar|<very large number>"* > > What should be way to inform the indexing chain not to check the field length? > A related observation, when custom term frequency is in use, user is not > likely to use the similarity for this field. Maybe we can offer a way to > specify that, too? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org