zhaolong created ORC-1897:
-----------------------------
Summary: ORC file Damaged has many different exception
Key: ORC-1897
URL: https://issues.apache.org/jira/browse/ORC-1897
Project: ORC
Issue Type: Bug
Affects Versions: 2.1.2, 1.6.7
Reporter: zhaolong
We have find many cases of ORC file corruption, and errors will be reported
when reading.
# java.lang.ArrayIndexOutOfBoundsException: 0 at
org.apache.orc.impl.RunLengthIntegerReaderV2.readPatchedBaseValues(RunLengthIntegerReaderV2.java:200)
at
org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:70)
at
org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
at
org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:373)
at
org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:696)
at
org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextVector(TreeReaderFactory.java:2463)
at
org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:42)
at
org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:72)
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1236)
# java.lang.IllegalArgumentException: Buffer size too small. size 131072
needed = 471700 in column 1 kind DICTIONARY DATA
at org.apache.orc.impl,InStream$CompressedStream.readHeader(InStream.java:487)
at org
apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java: 531)
at org.apache.orc.impl:InStream$CompressedStream.available(InStream.java:538)
at org
apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryStream(TreeReaderFactory.java:1776)
at
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.
java:1740)
at
org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.
java: 1491)
at
org.apache.orc.impl.TreeReaderFactory$StructTreeReader.startStripe(TreeReaderFactory.
java:2076)
at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl. java: 1117)
at org apache.orc impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:
1154)
at org apache.orc
impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1189)
at org.apache.orc.impl,RecordReaderImpl,<init>(RecordReaderImpl, java:251)
at org apache.orc impl.ReaderImpl.rows(ReaderImpl,java:851)
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:845)
Take the second figure as an example. The chunkLength should be between 32k
(snappy) and 256k (lzo, zlib), but why needed = 471700. We have tested the CPU
and memory of the hardware, and no error is found. The EC policy of the HDFS is
not configured.
So we want to read the orc file after hive write orc file in filesinkoperator.
However, considering the performance impact, we can only read orc metadata such
as stripe size to check whether there is any problem.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)