Re: question on fault tolerance

2008-08-11 Thread Paco NATHAN
just a guess,
for a long-running sequence of MR jobs, how's the namenode behaving
during that time? if it gets corrupted, one might see that behavior.

we have a similar situation, with 9 MR jobs back-to-back, taking much
of the day.

might be good to add some notification to an external process after
the end of each of those 3 MR jobs.

paco


On Mon, Aug 11, 2008 at 12:34 PM, Mori Bellamy <[EMAIL PROTECTED]> wrote:
> hey all,
> i have a job consisting of three MR jobs back to back to back. the each job
> takes an appreciable percent of a day to complete (30% to 70%). even though
> i execute these jobs in a blocking fashion:


question on fault tolerance

2008-08-11 Thread Mori Bellamy

hey all,
i have a job consisting of three MR jobs back to back to back. the  
each job takes an appreciable percent of a day to complete (30% to  
70%). even though i execute these jobs in a blocking fashion:

JobClient.runJob(conf);
...
JobClient.runJob(conf2);
...

the successful execution of one job does not guarantee that the next  
job in the pipeline starts. (i.e. i can log on to my taskTracker and  
see that conf's job finished successfully, but conf2's job never  
started). does anybody else have this problem? can anyone offer advice?


the only thing i can think of is that some other people with access to  
my cluster are accidentally killing jobs, but i doubt that's the case.