Taraka Rama Rao Lethavadla created HIVE-29271:
-------------------------------------------------

             Summary: Skip corrupted files while reading an Orc table
                 Key: HIVE-29271
                 URL: https://issues.apache.org/jira/browse/HIVE-29271
             Project: Hive
          Issue Type: Improvement
          Components: Hive, HiveServer2
            Reporter: Taraka Rama Rao Lethavadla


*Scenario:*

There are large number of corrupted files scattered across multiple partitions. 
They were created by some external tools. Now when we query the table, 
exceptions like below are thrown
{noformat}
Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message 
contained an invalid tag (zero).
    at 
com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
    at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
    at org.apache.orc.OrcProto$PostScript.<init>(OrcProto.java:30246)
    at org.apache.orc.OrcProto$PostScript.<init>(OrcProto.java:30210)
    at 
org.apache.orc.OrcProto$PostScript$1.parsePartialFrom(OrcProto.java:30353)
    at 
org.apache.orc.OrcProto$PostScript$1.parsePartialFrom(OrcProto.java:30348)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
    at org.apache.orc.OrcProto$PostScript.parseFrom(OrcProto.java:30791)
    at org.apache.orc.impl.ReaderImpl.extractPostScript(ReaderImpl.java:644)
    at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:814)
    at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:567)
    at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:61){noformat}
So, it is not possible to query the data from good files. The only way 
available today is to identify corrupted files from the table and remove them.

Orc-tools is taking a lot of time to find out the corrupt files as it will 
traverse each file sequentially and show errors for corrupt file. 

*Proposal:*

In spark we have a config, *ignoreCorruptFiles* using which we can read data 
from rest of the files skipping corrupt files.

Can we also implement something like this in Hive as well?

We can have a flag to enable this feature which is disabled by default.

 

*Issues:*

If we do not fail the queries, corrupt files may accumulate and may cause 
issues later like size of the table, incorrect results etc..

 

The reason behind requesting this feature is that it is very difficult to 
identify faulty/corrupt files easily in a large table/s. 

So it is also good if we can list all the corrupt files using a simple Hive 
query, so that they can be deleted without disturbing the actual Hive query 
flow to skip them



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to