[ 
https://issues.apache.org/jira/browse/HCATALOG-499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452283#comment-13452283
 ] 

Rohini Palaniswamy commented on HCATALOG-499:
---------------------------------------------

Script to Reproduce:

{code}
A = LOAD '/tmp/testdata' using PigStorage(',') as (num:int);
B = FILTER A BY num <= 5;
C = FILTER A BY num > 5;
STORE B into 'testdb.table1' using org.apache.hcatalog.pig.HCatStorer('part=1');
STORE C into 'testdb.table1' using org.apache.hcatalog.pig.HCatStorer('part=2');
{code}

Issue analysis:
Hadoop 20:

org.apache.hadoop.mapred.Task.initialize():
{code}
jobContext = new JobContext(job, id, reporter);
taskContext = new TaskAttemptContext(job, taskId, reporter);
...
Path outputPath = FileOutputFormat.getOutputPath(conf);
    if (outputPath != null) {
      if ((committer instanceof FileOutputCommitter)) {
        FileOutputFormat.setWorkOutputPath(conf, 
          ((FileOutputCommitter)committer).getTempTaskOutputPath(taskContext));
      } else {
        FileOutputFormat.setWorkOutputPath(conf, outputPath);
      }
    }
    committer.setupTask(taskContext);
{code}

Hadoop 23:
org.apache.hadoop.mapred.Task.initialize():

{code}
jobContext = new JobContextImpl(job, id, reporter);
taskContext = new TaskAttemptContextImpl(job, taskId, reporter);
    .......
 Path outputPath = FileOutputFormat.getOutputPath(conf);
    if (outputPath != null) {
      if ((committer instanceof FileOutputCommitter)) {
        FileOutputFormat.setWorkOutputPath(conf, 
          ((FileOutputCommitter)committer).getTaskAttemptPath(taskContext));
      } else {
        FileOutputFormat.setWorkOutputPath(conf, outputPath);
      }
    }
    committer.setupTask(taskContext);
{code}

In 20, new TaskAttemptContext/new JobContext clones the JobConf, whereas in 23
it assigns the JobConf in JobContextImpl. When setupTask(taskContext) was
called it never had mapred.work.output.dir set as conf and taskContext did not
share the same JobConf. But with 23, it is set. 

This causes problems for custom OutputCommitters (FileOutputCommitterContainer) 
which wrap
mapred.FileOutputCommitter. The workoutputpath is set incorrectly to outputPath
and it makes the FileOutputCommitter ignore the written data and say no output
found. There are no issues if mapreduce.lib.FileOutputCommitter is used as its
commitTask does not consider getWorkOutputPath. 

Hcat does setWorkOutputPath() in  FileOutputFormatContainer. Needs to be done 
in FileOutputCommitterContainer too now for commitTask.

                
> Multiple store commands does not work with Hadoop23
> ---------------------------------------------------
>
>                 Key: HCATALOG-499
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-499
>             Project: HCatalog
>          Issue Type: Bug
>    Affects Versions: 0.4.1
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.5, 0.4.1
>
>
> There is change in the semantics of
> JobContext::JobContext(Configuration, JobID). While in .20, the Config was
> cloned, in .23 the Config is adopted (if it's a JobConf). That combined with 
> the way mapred Task.java handles output committers that do not extend 
> FileOutputCommitter has broken storing different partitions to the same table 
> with multiple store statements in pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to