Hi,

we have a daily Hive script that usually takes a few hours to run. The
other day I notice one of the jobs was taking in excess of a few hours.
Digging into it I saw that there were 3 attempts to launch a job on a
single node:

Task Id Start Time Finish Time
Error
task_201312241250_46714_r_000048 Error launching task
task_201312241250_46714_r_000049 Error launching task
task_201312241250_46714_r_000050 Error launching task

I later found out that this node had a dodgy/unresponsive disk (still being
tested right now).

We've seen tasks fail in the past, but re-submitted to another node and
succeeding. So, shouldn't this task have been kicked off on another node
after the first failure? Is there anything I could be missing in terms of
configuration that should be set?

We're using CDH4.4.0.

Cheers,

Krishna

Reply via email to