Luís Filipe Nassif created LUCENE-10681:
-------------------------------------------
Summary: ArrayIndexOutOfBoundsException while indexing large
binary file
Key: LUCENE-10681
URL: https://issues.apache.org/jira/browse/LUCENE-10681
Project: Lucene - Core
Issue Type: Bug
Components: core/index
Affects Versions: 9.2
Environment: Linux Ubuntu (will check the user version), java x64
version 11.0.16.1
Reporter: Luís Filipe Nassif
Hello,
I looked for a similar issue, but didn't find one, so I'm creating this, sorry
if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 recently and
an user reported error below while indexing a huge binary file in a
parent-children schema where strings extracted from the huge binary file (using
strings command) are indexed as thousands of ~10MB children docs of the parent
metadata document:
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of bounds
for length 71428
at
org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at iped.engine.task.index.IndexTask.process(IndexTask.java:148)
~[iped-engine-4.0.2.jar:?]
at
iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250)
~[iped-engine-4.0.2.jar:?]
This seems an integer overflow to me, not sure... It didn't use to happen with
previous lucene-5.5.5 and indexing files like this is pretty common to us,
although with lucene-5.5.5 we used to break that huge file manually before
indexing using IndexWriter.addDocument(Document) method several times for each
10MB chunck, now we are using the IndexWriter.addDocuments(Iterable) method
with lucene-9.2.0... Any thoughts?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]