[ http://issues.apache.org/jira/browse/HADOOP-223?page=comments#action_12432196 ] Bryan Pendleton commented on HADOOP-223: ----------------------------------------
This issue still plagues me. I have a large enough dataset, that processing through it and storing temporary outputs often causes many failures. I still have my local fix for re-running (very cumbersome). I'd love to see a general solution supported. What I currently do is to inject a special combiner, that only produces output for partitions that didn't complete in a previous run. Thus, if 20% of my job finishes (completes writing reduce output to DFS), I need 20% less working space to run the subsequent job attempt. Usually, this is enough progress. However, that suffers from a lot of limits: 1) There's no auto-magic storage of what partitions were generated. 2) This loses the ability to have a Combiner doing something else 3) It also requires a new output directory, and hand-reassembly of the completed tasks. It seems like it should be solvable more generally by: 1) Storing the partitions that are generated when a job is initially run. 2) Keeping track of what partitions have completed. 3) Allowing a job to be re-submitted, using the previous partitions (and not failing on the output existance check) Anyone else run into these issues? I'd be happy to write up how I actually work around the problem now, but it seems like a more general solution would be helpful to a wider audience. > Should be able to re-run jobs, collecting only missing output > ------------------------------------------------------------- > > Key: HADOOP-223 > URL: http://issues.apache.org/jira/browse/HADOOP-223 > Project: Hadoop > Issue Type: New Feature > Components: mapred > Reporter: Bryan Pendleton > Priority: Minor > > For jobs with no side effects (roughly == jobs with speculative execution > enabled), if partial output has been generated, it should be possible to > re-run the job, and fill in the missing pieces. I have now run the same job > twice, once finishing 42 of 44 reduce tasks, another time finishing only 17. > Each time, many nodes have failed, causing many many tasks to fail ( in one > case, 5k failures from 15k map tasks, 23 failures from 44 reduces), but some > valid output was generated. Since the output is only dependent on the input, > and both jobs used the same input, I will now be able to combine these two > failed task outputs to get a completed job's output. This should be something > that can be more automatic. > In particular, it should be possible to resubmit a job, with a list of > partitions that should be ignored. A special Combiner, or pre-Combiner, would > throw out any map output for partitions that have already been successfully > completed, thus reducing the amount of data that needs to be reduced to > complete the job. It would, of course, be nice to support "filling in" > existing outputs, rather than having to do a move operation on completed > outputs. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira