[
https://issues.apache.org/jira/browse/ORC-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zhaolong updated ORC-1897:
--------------------------
Description:
We have find many cases of ORC file corruption, and errors will be reported
when reading.
# java.lang.ArrayIndexOutOfBoundsException: 0 at
org.apache.orc.impl.RunLengthIntegerReaderV2.readPatchedBaseValues(RunLengthIntegerReaderV2.java:200)
at
org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:70)
at
org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
at
org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:373)
at
org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:696)
at
org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextVector(TreeReaderFactory.java:2463)
at
org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:42)
at
org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:72)
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1236)
# java.lang.IllegalArgumentException: Buffer size too small. size 131072
needed = 471700 in column 1 kind DICTIONARY DATA
at org.apache.orc.impl,InStream$CompressedStream.readHeader(InStream.java:487)
at org
apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java: 531)
at org.apache.orc.impl:InStream$CompressedStream.available(InStream.java:538)
at org
apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryStream(TreeReaderFactory.java:1776)
at
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.
java:1740)
at
org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.
java: 1491)
at
org.apache.orc.impl.TreeReaderFactory$StructTreeReader.startStripe(TreeReaderFactory.
java:2076)
at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl. java: 1117)
at org apache.orc impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:
1154)
at org apache.orc
impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1189)
at org.apache.orc.impl,RecordReaderImpl,<init>(RecordReaderImpl, java:251)
at org apache.orc impl.ReaderImpl.rows(ReaderImpl,java:851)
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:845)
Take the second figure as an example. The chunkLength should be between 32k
(snappy) and 256k (lzo, zlib), but why needed = 471700. We have tested the CPU
and memory of the hardware, and no error is found. The EC policy of the HDFS is
also not configured.
So we want to read the orc file after hive write orc file in filesinkoperator.
However, considering the performance impact, we can only read orc metadata such
as stripe size to check whether there is any problem.
Is there any other way to solve the above problem?
was:
We have find many cases of ORC file corruption, and errors will be reported
when reading.
# java.lang.ArrayIndexOutOfBoundsException: 0 at
org.apache.orc.impl.RunLengthIntegerReaderV2.readPatchedBaseValues(RunLengthIntegerReaderV2.java:200)
at
org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:70)
at
org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
at
org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:373)
at
org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:696)
at
org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextVector(TreeReaderFactory.java:2463)
at
org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:42)
at
org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:72)
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1236)
# java.lang.IllegalArgumentException: Buffer size too small. size 131072
needed = 471700 in column 1 kind DICTIONARY DATA
at org.apache.orc.impl,InStream$CompressedStream.readHeader(InStream.java:487)
at org
apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java: 531)
at org.apache.orc.impl:InStream$CompressedStream.available(InStream.java:538)
at org
apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryStream(TreeReaderFactory.java:1776)
at
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.
java:1740)
at
org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.
java: 1491)
at
org.apache.orc.impl.TreeReaderFactory$StructTreeReader.startStripe(TreeReaderFactory.
java:2076)
at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl. java: 1117)
at org apache.orc impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:
1154)
at org apache.orc
impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1189)
at org.apache.orc.impl,RecordReaderImpl,<init>(RecordReaderImpl, java:251)
at org apache.orc impl.ReaderImpl.rows(ReaderImpl,java:851)
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:845)
Take the second figure as an example. The chunkLength should be between 32k
(snappy) and 256k (lzo, zlib), but why needed = 471700. We have tested the CPU
and memory of the hardware, and no error is found. The EC policy of the HDFS is
not configured.
So we want to read the orc file after hive write orc file in filesinkoperator.
However, considering the performance impact, we can only read orc metadata such
as stripe size to check whether there is any problem.
> ORC file Damaged has many different exception
> ---------------------------------------------
>
> Key: ORC-1897
> URL: https://issues.apache.org/jira/browse/ORC-1897
> Project: ORC
> Issue Type: Bug
> Affects Versions: 1.6.7, 2.1.2
> Reporter: zhaolong
> Priority: Blocker
>
> We have find many cases of ORC file corruption, and errors will be reported
> when reading.
> # java.lang.ArrayIndexOutOfBoundsException: 0 at
> org.apache.orc.impl.RunLengthIntegerReaderV2.readPatchedBaseValues(RunLengthIntegerReaderV2.java:200)
> at
> org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:70)
> at
> org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
> at
> org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:373)
> at
> org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:696)
> at
> org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextVector(TreeReaderFactory.java:2463)
> at
> org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:42)
> at
> org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:72)
> at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1236)
> # java.lang.IllegalArgumentException: Buffer size too small. size 131072
> needed = 471700 in column 1 kind DICTIONARY DATA
> at org.apache.orc.impl,InStream$CompressedStream.readHeader(InStream.java:487)
> at org
> apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:
> 531)
> at org.apache.orc.impl:InStream$CompressedStream.available(InStream.java:538)
> at org
> apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryStream(TreeReaderFactory.java:1776)
> at
> org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.
> java:1740)
> at
> org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.
> java: 1491)
> at
> org.apache.orc.impl.TreeReaderFactory$StructTreeReader.startStripe(TreeReaderFactory.
> java:2076)
> at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl. java:
> 1117)
> at org apache.orc impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:
> 1154)
> at org apache.orc
> impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1189)
> at org.apache.orc.impl,RecordReaderImpl,<init>(RecordReaderImpl, java:251)
> at org apache.orc impl.ReaderImpl.rows(ReaderImpl,java:851)
> at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:845)
> Take the second figure as an example. The chunkLength should be between 32k
> (snappy) and 256k (lzo, zlib), but why needed = 471700. We have tested the
> CPU and memory of the hardware, and no error is found. The EC policy of the
> HDFS is also not configured.
> So we want to read the orc file after hive write orc file in
> filesinkoperator. However, considering the performance impact, we can only
> read orc metadata such as stripe size to check whether there is any problem.
> Is there any other way to solve the above problem?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)