Drew, Pig relies completely on Hadoop's handling of fault tolerance, and does not do anything by itself (so, for example, if a job completely fails, Pig won't try to rerun it). The way Hadoop deals with failure is pretty well documented in the O'Reilly Hadoop book as well as online -- essentially, mappers and reducers can be rerun as needed (and thus one needs to be careful about side effects and non-determinism in mapper and reducer tasks, as they may run multiple times!).
Data replication is a somewhat orthogonal to the problem of failure of a node that's processing data. If HDFS is reading from server A, and server A dies, HDFS will just switch to reading from a replica B, and kick off a background copy to replace the missing A. Tasks won't fail. D On Tue, May 31, 2011 at 3:23 PM, David A Boyuka, II <daboy...@ncsu.edu> wrote: > Hello everyone. I’m currently doing some research involving Hadoop and Pig, > evaluating the cost of data replication vs. the penalty of node failure with > respect to job completion time. I’m current modifying a MapReduce simulator > called “MRPerf” to accommodate a sequence of MapReduce jobs such as generated > by Pig. I have some questions about how failures are handled in a Pig job, so > I can reflect that in my implementation. I’m fairly new to Pig, so if some of > these questions reflect misunderstanding, please let me know: > > 1) How does Pig/Hadoop handle failures which would require re-execution of > part of all of some MapReduce job that was already completed earlier in the > sequence? The only source of information I could find on this is in the paper > “Making cloud intermediate data fault-tolerant” by Ko et. al. > (http://portal.acm.org/citation.cfm?id=1807160), but I’m not sure whether > this is accurate. > > As an example: > > Suppose a sequence of three MapReduce jobs have been generated, J1, J2 and > J3, with a replication factor of 1 for the output of J1 and J2 (i.e. one > instance of each HDFS block). Suppose jobs J1 and J2 complete, and J3 is > almost done. Then a failure occurs on some node. Map and Reduce task outputs > are lost from all jobs. It seems to me it would be necessary to re-execute > some tasks, but not all, from all three jobs to complete the overall job. How > does Pig/Hadoop handle such backtracking in this situation? > > (This is based on understanding that when Pig produces a series MapReduce > jobs, the replication factor for the intermediate data produced by each > Reduce phase can be set individually. Correct me if I’m wrong here.) > > 2) On a related note, is the intermediate data produced by the Map tasks in > an individual MapReduce job deleted after that job completes and the next in > the sequence starts, or is it preserved until the whole Pig job finishes? > > Any information or guidance would be greatly appreciated. > > Thanks, > Drew