Re: question on fault tolerance
just a guess, for a long-running sequence of MR jobs, how's the namenode behaving during that time? if it gets corrupted, one might see that behavior. we have a similar situation, with 9 MR jobs back-to-back, taking much of the day. might be good to add some notification to an external process after the end of each of those 3 MR jobs. paco On Mon, Aug 11, 2008 at 12:34 PM, Mori Bellamy <[EMAIL PROTECTED]> wrote: > hey all, > i have a job consisting of three MR jobs back to back to back. the each job > takes an appreciable percent of a day to complete (30% to 70%). even though > i execute these jobs in a blocking fashion:
question on fault tolerance
hey all, i have a job consisting of three MR jobs back to back to back. the each job takes an appreciable percent of a day to complete (30% to 70%). even though i execute these jobs in a blocking fashion: JobClient.runJob(conf); ... JobClient.runJob(conf2); ... the successful execution of one job does not guarantee that the next job in the pipeline starts. (i.e. i can log on to my taskTracker and see that conf's job finished successfully, but conf2's job never started). does anybody else have this problem? can anyone offer advice? the only thing i can think of is that some other people with access to my cluster are accidentally killing jobs, but i doubt that's the case.