Hi, we have a daily Hive script that usually takes a few hours to run. The other day I notice one of the jobs was taking in excess of a few hours. Digging into it I saw that there were 3 attempts to launch a job on a single node:
Task Id Start Time Finish Time Error task_201312241250_46714_r_000048 Error launching task task_201312241250_46714_r_000049 Error launching task task_201312241250_46714_r_000050 Error launching task I later found out that this node had a dodgy/unresponsive disk (still being tested right now). We've seen tasks fail in the past, but re-submitted to another node and succeeding. So, shouldn't this task have been kicked off on another node after the first failure? Is there anything I could be missing in terms of configuration that should be set? We're using CDH4.4.0. Cheers, Krishna