Hi, I'm getting reports from users that a job has failed, but when I go to the console the 'failed' jobs section is clear, having been pushed out by other jobs running on the cluster.
All of the users submit jobs as the same hadoop user, so searching by OS user doesn't help. Of course the user can't tell me what failed or why, nor do they have any hadoop-provided identifiers. Watching them submit it doesn't help either, since the job can create hundreds of mappers and reducers and by the time it fails it isn't clear which one was running. How can I find the failed jobs in the cluster from the logs or job files on the nodes? I only have a few nodes so a 'find grep' on each is okay for now. Thanks, Chris