I noticed, but none of the jobs ended up being re-submitted! And all 3 of those jobs failed on the same node. All we know is that the disk on that node became unresponsive.
On 27 March 2014 09:33, Dieter De Witte <drdwi...@gmail.com> wrote: > The ids of the tasks are different so the node got killed after failing on > 3 different(!) reduce tasks. The reduce task 48 will probably have been > resubmitted to another node. > > > 2014-03-27 10:22 GMT+01:00 Krishna Rao <krishnanj...@gmail.com>: > > Hi, >> >> we have a daily Hive script that usually takes a few hours to run. The >> other day I notice one of the jobs was taking in excess of a few hours. >> Digging into it I saw that there were 3 attempts to launch a job on a >> single node: >> >> Task Id Start Time Finish Time >> Error >> task_201312241250_46714_r_000048 Error launching task >> task_201312241250_46714_r_000049 Error launching task >> task_201312241250_46714_r_000050 Error launching task >> >> I later found out that this node had a dodgy/unresponsive disk (still >> being tested right now). >> >> We've seen tasks fail in the past, but re-submitted to another node and >> succeeding. So, shouldn't this task have been kicked off on another node >> after the first failure? Is there anything I could be missing in terms of >> configuration that should be set? >> >> We're using CDH4.4.0. >> >> Cheers, >> >> Krishna >> > >