Look at the NodeManager logs on perfgb0n0 and look for logs of container_1342570404456_0001_* and check for errors.
Arun On Jul 17, 2012, at 5:33 PM, Trevor wrote: > Actually, the HTTP 400 is a red herring, and not the core issue. I added "-D > mapreduce.client.output.filter=ALL" to the command line, and fetching the > task output fails even for successful tasks: > > 12/07/17 19:15:55 INFO mapreduce.Job: Task Id : > attempt_1342570404456_0001_m_000006_1, Status : SUCCEEDED > 12/07/17 19:15:55 WARN mapreduce.Job: Error reading task output Server > returned HTTP response code: 400 for URL: > http://perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342570404456_0001_m_000006_1&filter=stdout > > Having a better idea what to search for, I found that it's a recently fixed > bug: https://issues.apache.org/jira/browse/MAPREDUCE-3889 > > So the real question is how can I debug the failing tasks on the non-AM > slave(s)? Although I see failure on the client: > > 12/07/17 19:14:35 INFO mapreduce.Job: Task Id : > attempt_1342570404456_0001_m_000002_0, Status : FAILED > > I see what appears to be success on the slave: > > 2012-07-17 19:13:47,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_1342570404456_0001_01_000002 succeeded > 2012-07-17 19:13:47,477 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1342570404456_0001_01_000002 transitioned from RUNNING > to EXITED_WITH_SUCCESS > > Suggestions of where to look next? > > Thanks, > Trevor > > On Tue, Jul 17, 2012 at 6:33 PM, Trevor <[email protected]> wrote: > Arun, I just verified that I get the same error with 2.0.0-alpha (official > tarball) and 2.0.1-alpha (built from svn). > > Karthik, thanks for forwarding. > > Thanks, > Trevor > > > On Tue, Jul 17, 2012 at 6:18 PM, Karthik Kambatla <[email protected]> wrote: > Forwarding your email to the cdh-user group. > > Thanks > Karthik > > > On Tue, Jul 17, 2012 at 2:24 PM, Trevor <[email protected]> wrote: > Hi all, > > I recently upgraded from CDH4b2 (0.23.1) to CDH4 (2.0.0). Now for some > strange reason, my MRv2 jobs (TeraGen, specifically) fail if I run with more > than one slave. For every slave except the one running the Application > Master, I get the following failed tasks and warnings repeatedly: > > 12/07/13 14:21:55 INFO mapreduce.Job: Running job: job_1342207265272_0001 > 12/07/13 14:22:17 INFO mapreduce.Job: Job job_1342207265272_0001 running in > uber mode : false > 12/07/13 14:22:17 INFO mapreduce.Job: map 0% reduce 0% > 12/07/13 14:22:46 INFO mapreduce.Job: map 1% reduce 0% > 12/07/13 14:22:52 INFO mapreduce.Job: map 2% reduce 0% > 12/07/13 14:22:55 INFO mapreduce.Job: map 3% reduce 0% > 12/07/13 14:22:58 INFO mapreduce.Job: map 4% reduce 0% > 12/07/13 14:23:04 INFO mapreduce.Job: map 5% reduce 0% > 12/07/13 14:23:07 INFO mapreduce.Job: map 6% reduce 0% > 12/07/13 14:23:07 INFO mapreduce.Job: Task Id : > attempt_1342207265272_0001_m_000004_0, Status : FAILED > 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server > returned HTTP response code: 400 for URL: http:// > perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stdout > 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server > returned HTTP response code: 400 for URL: http:// > perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stderr > 12/07/13 14:23:08 INFO mapreduce.Job: Task Id : > attempt_1342207265272_0001_m_000003_0, Status : FAILED > 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server > returned HTTP response code: 400 for URL: http:// > perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000003_0&filter=stdout > ... > 12/07/13 14:25:12 INFO mapreduce.Job: map 25% reduce 0% > 12/07/13 14:25:12 INFO mapreduce.Job: Job job_1342207265272_0001 failed with > state FAILED due to: > ... > Failed map tasks=19 > Launched map tasks=31 > > The HTTP 400 error appears to be generated by the ShuffleHandler, which is > configured to run on port 8080 of the slaves, and doesn't understand that > URL. What I've been able to piece together so far is that /tasklog is handled > by the TaskLogServlet, which is part of the TaskTracker. However, isn't this > an MRv1 class that shouldn't even be running in my configuration? Also, the > TaskTracker appears to run on port 50060, so I don't know where port 8080 is > coming from. > > Though it could be a red herring, this warning seems to be related to the job > failing, despite the fact that the job makes progress on the slave running > the AM. The Node Manager logs on both AM and non-AM slaves appear fairly > similar, and I don't see any errors in the non-AM logs. > > Another strange data point: These failures occur running the slaves on ARM > systems. Running the slaves on x86 with the same configuration works. I'm > using the same tarball on both, which means that the native-hadoop library > isn't loaded on ARM. The master/client is the same x86 system in both > scenarios. All nodes are running Ubuntu 12.04. > > Thanks for any guidance, > Trevor > > > > -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
