[ 
https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658455#action_12658455
 ] 

Devaraj Das commented on HADOOP-4927:
-------------------------------------

Okay, so i figured that I was referring to the old MapReduce API *smile*
There seems to be two approaches anyways. For the old API:
Today, the getRecordWriter calls relevant to the tasks are made in two places - 
in DirectMapOutputCollector (in the constructor) and in ReduceTask.java (just 
before starting to call the user's reduce method). We can probably move the 
calls to the respective OutputCollect.collect implementations:
{code}
if (out == null) {
  out = job.getOutputFormat().getRecordWriter(fs, job, finalName, reporter);
}
{code}

For the new API, I am not yet sure what the good approach is. Maybe we could 
delay creating the recordwriter until TaskInputOutputContext.write is invoked. 

The other approach is to delay the creation of the files on the output 
filesystem, until it is necessary, in the respective RecordWriter 
implementations. But this requires users (who have implemented recordwriters or 
are implementing them in the future) to be aware of such a change and thus is 
vulnerable to problems..

Thoughts?

> Part files on the output filesystem are created irrespective of whether the 
> corresponding task has anything to write there
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4927
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4927
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Devaraj Das
>             Fix For: 0.20.0
>
>
> When OutputFormat.getRecordWriter is invoked, a part file is created on the 
> output filesystem. But the created RecordWriter is not used until the 
> OutputCollector.collect call is made by the task (user's code). This results 
> in empty part files even if the OutputCollector.collect is never invoked by 
> the corresponding tasks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to