[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dick King updated MAPREDUCE-1073:
---------------------------------

    Attachment: MAPREDUCE-1073--yhadoop20--2010-07-22.patch

The previous versions of this attachment missed one point.

The basic problem is that with the existing code base the progress is based on 
the records read from the input split, but there is buffering in the way pipes 
works.  This makes the tasks appear to have made more progress than they 
deserve to have made, in jobs where the input splits are small.

To make speculation work under pipes with small input splits, two conditions 
have to be met:

1: The pipes code has to have an API to report progress, and has to use it.  
The old patch met this goal.  You incant {{(&context)->serProgress(float)}} 
within {{HadoopPipes::Mapper.map(HadoopPipes::MapContext& context)}} .  This 
does require that you have a way of measuring progress,which I consider likely 
because this is only needed when the input splits are small, which implies that 
the "input data" is really a signal to get the real data somewhere else [or to 
generate it].

2: The job has to be able to say that the progress that would otherwise be 
inferred from input split reads has to be ignored.  This newest version of the 
patch does that; you can either call 
{{JobConf.setRecordReaderProgressDisabled(true)}}, or set the attribute 
{{mapred.job.disable.record.reader.progress}} to {{true}} .

This patch addresses the second point.  I did not mark it available because it 
needs a forward port.  I attached it to this issue for comments, and for the 
record.

> Progress reported for pipes tasks is incorrect.
> -----------------------------------------------
>
>                 Key: MAPREDUCE-1073
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1073
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: pipes
>    Affects Versions: 0.20.1
>            Reporter: Sreekanth Ramakrishnan
>            Assignee: Dick King
>         Attachments: mapreduce-1073--2010-03-31.patch, 
> mapreduce-1073--2010-04-06.patch, 
> MAPREDUCE-1073--yhadoop20--2010-07-22.patch, MAPREDUCE-1073_yhadoop20.patch
>
>
> Currently in pipes, 
> {{org.apache.hadoop.mapred.pipes.PipesMapRunner.run(RecordReader<K1, V1>, 
> OutputCollector<K2, V2>, Reporter)}} we do the following:
> {code}
>         while (input.next(key, value)) {
>           downlink.mapItem(key, value);
>           if(skipping) {
>             downlink.flush();
>           }
>         }
> {code}
> This would result in consumption of all the records for current task and 
> taking task progress to 100% whereas the actual pipes application would be 
> trailing behind. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to