[ 
https://issues.apache.org/jira/browse/HIVE-6309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886078#comment-13886078
 ] 

Ashutosh Chauhan commented on HIVE-6309:
----------------------------------------

+1

> Hive incorrectly removes TaskAttempt output files if MRAppMaster fails once
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-6309
>                 URL: https://issues.apache.org/jira/browse/HIVE-6309
>             Project: Hive
>          Issue Type: Bug
>         Environment: hadoop 2.2
>            Reporter: Chun Chen
>            Assignee: Chun Chen
>            Priority: Critical
>             Fix For: 0.13.0
>
>         Attachments: HIVE-6309.patch
>
>
> We recently upgrade to hadoop2.2 and sometimes find some tables lost several 
> data files after a mid night ETL process. We find that these MapReduce jobs 
> which generate the partial tables have something in common that the 
> MRAppMaster of which all had failed once and the tables all left only a 
> single data file 000000_1000.
> The following log in hive.log give us some clues of what's going on with the 
> incorrectly deleted data files.
> {code}
> $ grep 'hive_2014-01-24_12-33-18_507_6790415670781610350' hive.log
> 2014-01-24 12:52:43,140 WARN  exec.Utilities 
> (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file 
> removed: 
> hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000001_1000
>  with length 824627293. Existing file: 
> hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
>  with length 824860643
> 2014-01-24 12:52:43,142 WARN  exec.Utilities 
> (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file 
> removed: 
> hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000002_1000
>  with length 824681826. Existing file: 
> hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
>  with length 824860643
> 2014-01-24 12:52:43,149 WARN  exec.Utilities 
> (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file 
> removed: 
> hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000003_1000
>  with length 824830450. Existing file: 
> hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
>  with length 824860643
> 2014-01-24 12:52:43,151 WARN  exec.Utilities 
> (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file 
> removed: 
> hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000004_1000
>  with length 824753882. Existing file: 
> hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000
>  with length 824860643
> {code}
> We find that it's because nextAttemptNumber in hadoop2.2 is bigger than 1000 
> and hive doesn't correctly extract task id from filename. See the following 
> code in org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
> and ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
> {code}
> // org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
>     // All the new TaskAttemptIDs are generated based on MR
>     // ApplicationAttemptID so that attempts from previous lives don't
>     // over-step the current one. This assumes that a task won't have more
>     // than 1000 attempts in its single generation, which is very reasonable.
>     nextAttemptNumber = (appAttemptId - 1) * 1000;
> // ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
>    /**
>     * The first group will contain the task id. The second group is the 
> optional extension. The file
>     * name looks like: "0_0" or "0_0.gz". There may be a leading prefix 
> (tmp_). Since getTaskId() can
>     * return an integer only - this should match a pure integer as well. 
> {1,3} is used to limit
>     * matching for attempts #'s 0-999.
>     */
>    private static final Pattern FILE_NAME_TO_TASK_ID_REGEX =
>        Pattern.compile("^.*?([0-9]+)(_[0-9]{1,3})?(\\..*)?$");
> {code}
> And with the bellow reasons,  extract this value for attempt numbers >= 1000 
> : 
> {code}
> >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_2').group(1)
> '000000'
> >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_1001').group(1)
> '1001'
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to