Taraka Rama Rao Lethavadla created HIVE-29271:
-------------------------------------------------
Summary: Skip corrupted files while reading an Orc table
Key: HIVE-29271
URL: https://issues.apache.org/jira/browse/HIVE-29271
Project: Hive
Issue Type: Improvement
Components: Hive, HiveServer2
Reporter: Taraka Rama Rao Lethavadla
*Scenario:*
There are large number of corrupted files scattered across multiple partitions.
They were created by some external tools. Now when we query the table,
exceptions like below are thrown
{noformat}
Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message
contained an invalid tag (zero).
at
com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
at org.apache.orc.OrcProto$PostScript.<init>(OrcProto.java:30246)
at org.apache.orc.OrcProto$PostScript.<init>(OrcProto.java:30210)
at
org.apache.orc.OrcProto$PostScript$1.parsePartialFrom(OrcProto.java:30353)
at
org.apache.orc.OrcProto$PostScript$1.parsePartialFrom(OrcProto.java:30348)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
at org.apache.orc.OrcProto$PostScript.parseFrom(OrcProto.java:30791)
at org.apache.orc.impl.ReaderImpl.extractPostScript(ReaderImpl.java:644)
at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:814)
at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:567)
at
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:61){noformat}
So, it is not possible to query the data from good files. The only way
available today is to identify corrupted files from the table and remove them.
Orc-tools is taking a lot of time to find out the corrupt files as it will
traverse each file sequentially and show errors for corrupt file.
*Proposal:*
In spark we have a config, *ignoreCorruptFiles* using which we can read data
from rest of the files skipping corrupt files.
Can we also implement something like this in Hive as well?
We can have a flag to enable this feature which is disabled by default.
*Issues:*
If we do not fail the queries, corrupt files may accumulate and may cause
issues later like size of the table, incorrect results etc..
The reason behind requesting this feature is that it is very difficult to
identify faulty/corrupt files easily in a large table/s.
So it is also good if we can list all the corrupt files using a simple Hive
query, so that they can be deleted without disturbing the actual Hive query
flow to skip them
--
This message was sent by Atlassian Jira
(v8.20.10#820010)