[jira] [Commented] (LUCENE-8947) Indexing fails with "too many tokens for field" when using custom term frequencies
[ https://issues.apache.org/jira/browse/LUCENE-8947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231851#comment-17231851 ] Michael McCandless commented on LUCENE-8947: Thanks [~dxl360], I'll look! > Indexing fails with "too many tokens for field" when using custom term > frequencies > -- > > Key: LUCENE-8947 > URL: https://issues.apache.org/jira/browse/LUCENE-8947 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 7.5 >Reporter: Michael McCandless >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We are using custom term frequencies (LUCENE-7854) to index per-token scoring > signals, however for one document that had many tokens and those tokens had > fairly large (~998,000) scoring signals, we hit this exception: > {noformat} > 2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3) > com.amazon.lucene.index.IndexGCRDocument: Failed to index doc: > java.lang.IllegalArgumentException: too many tokens for field "foobar" > at > org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825) > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430) > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) > at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) > at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) > {noformat} > This is happening in this code in {{DefaultIndexingChain.java}}: > {noformat} > try { > invertState.length = Math.addExact(invertState.length, > invertState.termFreqAttribute.getTermFrequency()); > } catch (ArithmeticException ae) { > throw new IllegalArgumentException("too many tokens for field \"" + > field.name() + "\""); > }{noformat} > Where Lucene is accumulating the total length (number of tokens) for the > field. But total length doesn't really make sense if you are using custom > term frequencies to hold arbitrary scoring signals? Or, maybe it does make > sense, if user is using this as simple boosting, but maybe we should allow > this length to be a {{long}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8947) Indexing fails with "too many tokens for field" when using custom term frequencies
[ https://issues.apache.org/jira/browse/LUCENE-8947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231847#comment-17231847 ] Duan Li commented on LUCENE-8947: - I open a PR to fix this issue https://github.com/apache/lucene-solr/pull/2080. > Indexing fails with "too many tokens for field" when using custom term > frequencies > -- > > Key: LUCENE-8947 > URL: https://issues.apache.org/jira/browse/LUCENE-8947 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 7.5 >Reporter: Michael McCandless >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We are using custom term frequencies (LUCENE-7854) to index per-token scoring > signals, however for one document that had many tokens and those tokens had > fairly large (~998,000) scoring signals, we hit this exception: > {noformat} > 2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3) > com.amazon.lucene.index.IndexGCRDocument: Failed to index doc: > java.lang.IllegalArgumentException: too many tokens for field "foobar" > at > org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825) > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430) > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) > at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) > at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) > {noformat} > This is happening in this code in {{DefaultIndexingChain.java}}: > {noformat} > try { > invertState.length = Math.addExact(invertState.length, > invertState.termFreqAttribute.getTermFrequency()); > } catch (ArithmeticException ae) { > throw new IllegalArgumentException("too many tokens for field \"" + > field.name() + "\""); > }{noformat} > Where Lucene is accumulating the total length (number of tokens) for the > field. But total length doesn't really make sense if you are using custom > term frequencies to hold arbitrary scoring signals? Or, maybe it does make > sense, if user is using this as simple boosting, but maybe we should allow > this length to be a {{long}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org