Taraka Rama Rao Lethavadla created HIVE-26897:
-------------------------------------------------
Summary: Provide a command/tool to recover data in ACID table when
table data got corrupted with invalid/junk delta/delete_delta folders
Key: HIVE-26897
URL: https://issues.apache.org/jira/browse/HIVE-26897
Project: Hive
Issue Type: New Feature
Reporter: Taraka Rama Rao Lethavadla
Example: A table has below directories
{noformat}
drwx------ - hive hive 0 2022-11-05 19:43
/data/warehouse/tbl/delete_delta_0080483_0087704_v0973185
drwx------ - hive pdl_prod_nosh_jsin 0 2022-12-05 00:18
/data/warehouse/tbl/delete_delta_0080483_0088384_v1111507{noformat}
When we read data from this table, we get below errors
{noformat}
java.util.concurrent.ExecutionException: java.lang.IllegalStateException:
Duplicate key null (attempted merging values
org.apache.hadoop.hive.ql.io.AcidInputFormat$DeltaFileMetaData@41776cd9 and
org.apache.hadoop.hive.ql.io.AcidInputFormat$DeltaFileMetaData@1404a054){noformat}
delete_delta_0080483_0087704_v0973185,delete_delta_0080483_0088384_v1111507 are
created as part of minor compaction. In general, once minor compaction
completed, the next minor compaction picks min_writeId value as greater than
the value of the previously compacted max_writeId. In this case for both the
minor compacted directories could see min_writeId is the same (i.e. 0080483).
To mitigate the issue, we had to remove those directories manually from hdfs,
then create a fresh table out of it, drop the actual table and rename fresh
table to actual table
*Proposal*
Create a tool/command to read the data from the corrupted ACID table to recover
data out of it before we make any changes to the underlying data. So that we
can workaround the problem by creating another table with same data
--
This message was sent by Atlassian Jira
(v8.20.10#820010)