Hi,

I'm getting reports from users that a job has failed, but when I go to the
console the 'failed' jobs section is clear, having been pushed out by other
jobs running on the cluster.

All of the users submit jobs as the same hadoop user, so searching by OS
user doesn't help. Of course the user can't tell me what failed or why, nor
do they have any hadoop-provided identifiers.

Watching them submit it doesn't help either, since the job can create
hundreds of mappers and reducers and by the time it fails it isn't clear
which one was running.

How can I find the failed jobs in the cluster from the logs or  job files on
the nodes? I only have a few nodes so a 'find grep' on each is okay for now.

Thanks,

Chris

Reply via email to