[
https://issues.apache.org/jira/browse/HIVE-6309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886078#comment-13886078
]
Ashutosh Chauhan commented on HIVE-6309:
----------------------------------------
+1
> Hive incorrectly removes TaskAttempt output files if MRAppMaster fails once
> ---------------------------------------------------------------------------
>
> Key: HIVE-6309
> URL: https://issues.apache.org/jira/browse/HIVE-6309
> Project: Hive
> Issue Type: Bug
> Environment: hadoop 2.2
> Reporter: Chun Chen
> Assignee: Chun Chen
> Priority: Critical
> Fix For: 0.13.0
>
> Attachments: HIVE-6309.patch
>
>
> We recently upgrade to hadoop2.2 and sometimes find some tables lost several
> data files after a mid night ETL process. We find that these MapReduce jobs
> which generate the partial tables have something in common that the
> MRAppMaster of which all had failed once and the tables all left only a
> single data file 000000_1000.
> The following log in hive.log give us some clues of what's going on with the
> incorrectly deleted data files.
> {code}
> $ grep 'hive_2014-01-24_12-33-18_507_6790415670781610350' hive.log
> 2014-01-24 12:52:43,140 WARN exec.Utilities
> (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file
> removed:
> hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000001_1000
> with length 824627293. Existing file:
> hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
> with length 824860643
> 2014-01-24 12:52:43,142 WARN exec.Utilities
> (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file
> removed:
> hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000002_1000
> with length 824681826. Existing file:
> hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
> with length 824860643
> 2014-01-24 12:52:43,149 WARN exec.Utilities
> (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file
> removed:
> hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000003_1000
> with length 824830450. Existing file:
> hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
> with length 824860643
> 2014-01-24 12:52:43,151 WARN exec.Utilities
> (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file
> removed:
> hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000004_1000
> with length 824753882. Existing file:
> hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
> with length 824860643
> {code}
> We find that it's because nextAttemptNumber in hadoop2.2 is bigger than 1000
> and hive doesn't correctly extract task id from filename. See the following
> code in org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
> and ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
> {code}
> // org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
> // All the new TaskAttemptIDs are generated based on MR
> // ApplicationAttemptID so that attempts from previous lives don't
> // over-step the current one. This assumes that a task won't have more
> // than 1000 attempts in its single generation, which is very reasonable.
> nextAttemptNumber = (appAttemptId - 1) * 1000;
> // ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
> /**
> * The first group will contain the task id. The second group is the
> optional extension. The file
> * name looks like: "0_0" or "0_0.gz". There may be a leading prefix
> (tmp_). Since getTaskId() can
> * return an integer only - this should match a pure integer as well.
> {1,3} is used to limit
> * matching for attempts #'s 0-999.
> */
> private static final Pattern FILE_NAME_TO_TASK_ID_REGEX =
> Pattern.compile("^.*?([0-9]+)(_[0-9]{1,3})?(\\..*)?$");
> {code}
> And with the bellow reasons, extract this value for attempt numbers >= 1000
> :
> {code}
> >>> re.match("^.*?([0-9]+)(_[0-9])?(\\..*)?$", 'part-r-000000_2').group(1)
> '000000'
> >>> re.match("^.*?([0-9]+)(_[0-9])?(\\..*)?$", 'part-r-000000_1001').group(1)
> '1001'
> {code}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)