[jira] [Commented] (LUCENE-8947) Indexing fails with "too many tokens for field" when using custom term frequencies

2020-11-13 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231851#comment-17231851
 ] 

Michael McCandless commented on LUCENE-8947:


Thanks [~dxl360], I'll look!

> Indexing fails with "too many tokens for field" when using custom term 
> frequencies
> --
>
> Key: LUCENE-8947
> URL: https://issues.apache.org/jira/browse/LUCENE-8947
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 7.5
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We are using custom term frequencies (LUCENE-7854) to index per-token scoring 
> signals, however for one document that had many tokens and those tokens had 
> fairly large (~998,000) scoring signals, we hit this exception:
> {noformat}
> 2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3) 
> com.amazon.lucene.index.IndexGCRDocument: Failed to index doc: 
> java.lang.IllegalArgumentException: too many tokens for field "foobar"
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825)
> at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
> at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
> at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297)
> at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450)
> at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291)
> at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264)
> {noformat}
> This is happening in this code in {{DefaultIndexingChain.java}}:
> {noformat}
>   try {
> invertState.length = Math.addExact(invertState.length, 
> invertState.termFreqAttribute.getTermFrequency());
>   } catch (ArithmeticException ae) {
> throw new IllegalArgumentException("too many tokens for field \"" + 
> field.name() + "\"");
>   }{noformat}
> Where Lucene is accumulating the total length (number of tokens) for the 
> field.  But total length doesn't really make sense if you are using custom 
> term frequencies to hold arbitrary scoring signals?  Or, maybe it does make 
> sense, if user is using this as simple boosting, but maybe we should allow 
> this length to be a {{long}}?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8947) Indexing fails with "too many tokens for field" when using custom term frequencies

2020-11-13 Thread Duan Li (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231847#comment-17231847
 ] 

Duan Li commented on LUCENE-8947:
-

I open a PR to fix this issue https://github.com/apache/lucene-solr/pull/2080.

> Indexing fails with "too many tokens for field" when using custom term 
> frequencies
> --
>
> Key: LUCENE-8947
> URL: https://issues.apache.org/jira/browse/LUCENE-8947
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 7.5
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We are using custom term frequencies (LUCENE-7854) to index per-token scoring 
> signals, however for one document that had many tokens and those tokens had 
> fairly large (~998,000) scoring signals, we hit this exception:
> {noformat}
> 2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3) 
> com.amazon.lucene.index.IndexGCRDocument: Failed to index doc: 
> java.lang.IllegalArgumentException: too many tokens for field "foobar"
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825)
> at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
> at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
> at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297)
> at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450)
> at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291)
> at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264)
> {noformat}
> This is happening in this code in {{DefaultIndexingChain.java}}:
> {noformat}
>   try {
> invertState.length = Math.addExact(invertState.length, 
> invertState.termFreqAttribute.getTermFrequency());
>   } catch (ArithmeticException ae) {
> throw new IllegalArgumentException("too many tokens for field \"" + 
> field.name() + "\"");
>   }{noformat}
> Where Lucene is accumulating the total length (number of tokens) for the 
> field.  But total length doesn't really make sense if you are using custom 
> term frequencies to hold arbitrary scoring signals?  Or, maybe it does make 
> sense, if user is using this as simple boosting, but maybe we should allow 
> this length to be a {{long}}?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org