[ https://issues.apache.org/jira/browse/LUCENE-9037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970622#comment-16970622 ]
Ilan Ginzburg edited comment on LUCENE-9037 at 11/8/19 10:47 PM: ----------------------------------------------------------------- Thanks [~mikemccand]. What about moving up the call to {{DocumentsWriterFlushControl.doAfterDocument()}} into the {{finally}} of the bloc calling {{DocumentsWriterPerThread.updateDocument/s()}} in {{DocumentsWriter.updateDocument/s()}}? Basically consider {{DocumentsWriterFlushControl.doAfterDocument()}} as a "do after _successful or failed_ document". Exploring that path see if I can make it work (and existing tests pass). Your suggestion of throwing a meaningful exception upon reaching the limit would not help my use case if there's no flush happening as a consequence. was (Author: murblanc): Thanks [~mikemccand]. What about moving up the call to {{DocumentsWriterFlushControl.doAfterDocument()}} into the {{finally}} of the bloc calling {{DocumentsWriterPerThread.updateDocument/s()}} in {{DocumentsWriter.updateDocument/s()}}? Basically consider {{DocumentsWriterFlushControl.doAfterDocument()}} as a "do after _successful or failed_ document". Exploring that path see if I can make it work (and existing tests pass). > ArrayIndexOutOfBoundsException due to repeated IOException during indexing > -------------------------------------------------------------------------- > > Key: LUCENE-9037 > URL: https://issues.apache.org/jira/browse/LUCENE-9037 > Project: Lucene - Core > Issue Type: Bug > Components: core/index > Affects Versions: 7.1 > Reporter: Ilan Ginzburg > Priority: Minor > Attachments: TestIndexWriterTermsHashOverflow.java > > Time Spent: 10m > Remaining Estimate: 0h > > There is a limit to the number of tokens that can be held in memory by Lucene > when docs are indexed using DocumentsWriter, then bad things happen. The > limit can be reached by submitting a really large document, by submitting a > large number of documents without doing a commit (see LUCENE-8118) or by > repeatedly submitting documents that fail to get indexed in some specific > ways, leading to Lucene not cleaning up the in memory data structures that > eventually overflow. > The overflow is due to a 32 bit (signed) integer wrapping around to negative > territory, then causing an ArrayIndexOutOfBoundsException. > The failure path that we are reliably hitting is due to an IOException during > doc tokenization. A tokenizer implementing TokenStream throws an exception > from incrementToken() which causes indexing of that doc to fail. > The IOException bubbles back up to DocumentsWriter.updateDocument() (or > DocumentsWriter.updateDocuments() in some other cases) where it is not > treated as an AbortingException therefore it is not causing a reset of the > DocumentsWriterPerThread. On repeated failures (without any successful > indexing in between) if the upper layer (client via Solr) resubmits the doc > that fails again, DocumentsWriterPerThread will eventually cause > TermsHashPerField data structures to grow and overflow, leading to an > exception stack similar to the one in LUCENE-8118 (below stack trace copied > from a test run repro on 7.1): > java.lang.ArrayIndexOutOfBoundsException: > -65536java.lang.ArrayIndexOutOfBoundsException: -65536 > at __randomizedtesting.SeedInfo.seed([394FAB2B91B1D90A:C86FB3F3CE001AA8]:0) > at > org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198) > at > org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:221) > at > org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:80) > at > org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:171) > at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) > at > org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:792) > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430) > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392) > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:239) > at > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:481) > at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1717) > at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1462) > Using tokens composed only of lowercase letters, it takes less than > 130,000,000 different tokens (the shortest ones) to overflow > TermsHashPerField. > Using a single document (composed of the 20,000 shortest lowercase tokens) > submitted repeatedly for indexing requires 6352 submissions all failing with > an IOException on incrementToken() to trigger the > ArrayIndexOutOfBoundsException. > A proposed fix is to treat in DocumentsWriter.updateDocument() and > DocumentsWriter.updateDocuments() an IOException in the same way we treat an > AbortingException. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org