[jira] [Updated] (MAPREDUCE-4442) Accessing hadoop counters from a job is unreliable in yarn during in AM process cleanup window
[ https://issues.apache.org/jira/browse/MAPREDUCE-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rahul Jain updated MAPREDUCE-4442: -- Attachment: rsrc_mgr_logs_counter_failed.txt Here is the snippet from resource mgr logs, relevant to the the time this happened. > Accessing hadoop counters from a job is unreliable in yarn during in AM > process cleanup window > --- > > Key: MAPREDUCE-4442 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4442 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.0.0-alpha >Reporter: Rahul Jain > Attachments: am_logs_counter_failure.html, > rsrc_mgr_logs_counter_failed.txt > > > We found this issue during our tests moving from MapReduceV1 to MapReduceV2. > A few of our applications access job counters multiple times: > a) After submission of job, while job is execution (works fine) > b) Right after job complete notification is received (works fine) > c) Few seconds after job complete notification (fails most of the times). > The error snippet is as follows: > {code} > 2012-07-12 19:12:29,039 WARN [Client] Unexpected error reading responses on > connection Thread[IPC Client (1252749669) connection to > sjc1-ciq-ibm-grid07.carrieriq.com/10.202.50.187:47944 from hadoop,5,main] > java.lang.NullPointerException > at > org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:852) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:781) > 2012-07-12 19:12:29,044 INFO [ClientServiceDelegate] Application state is > completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server > 2012-07-12 19:12:29,132 INFO [ClientServiceDelegate] Application state is > completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server > 2012-07-12 19:12:29,216 ERROR [UserGroupInformation] > PriviledgedActionException as:hadoop (auth:SIMPLE) cause:java.io.IOException > 2012-07-12 19:12:29,216 WARN [BaseOutputStageJob] getJobCounters: Unable to > retrieve counters. null > java.io.IOException > at > org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:315) > at > org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:335) > at > org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:470) > at org.apache.hadoop.mapreduce.Job$8.run(Job.java:719) > at org.apache.hadoop.mapreduce.Job$8.run(Job.java:716) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) > at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:716) > at > org.apache.hadoop.mapred.JobClient$NetworkedJob.getCounters(JobClient.java:396) > {code} > The connection to 10.202.50.187:47944 is actually the connection to AM; > appears that we are connecting to AM to get the counters for the successful > job and not the history server. > > I'll attach the logs for AM and resource mgr separately, however no unusual > activity is seen in those. > This makes me suspect that we have a race condition in the code trying to > access job counters when AM is finishing up and the job hasn't moved to > history server yet. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4442) Accessing hadoop counters from a job is unreliable in yarn during in AM process cleanup window
[ https://issues.apache.org/jira/browse/MAPREDUCE-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rahul Jain updated MAPREDUCE-4442: -- Attachment: am_logs_counter_failure.html Attached AM logs for the full job; the timestamps should correlate to the application timestamps. > Accessing hadoop counters from a job is unreliable in yarn during in AM > process cleanup window > --- > > Key: MAPREDUCE-4442 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4442 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.0.0-alpha >Reporter: Rahul Jain > Attachments: am_logs_counter_failure.html > > > We found this issue during our tests moving from MapReduceV1 to MapReduceV2. > A few of our applications access job counters multiple times: > a) After submission of job, while job is execution (works fine) > b) Right after job complete notification is received (works fine) > c) Few seconds after job complete notification (fails most of the times). > The error snippet is as follows: > {code} > 2012-07-12 19:12:29,039 WARN [Client] Unexpected error reading responses on > connection Thread[IPC Client (1252749669) connection to > sjc1-ciq-ibm-grid07.carrieriq.com/10.202.50.187:47944 from hadoop,5,main] > java.lang.NullPointerException > at > org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:852) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:781) > 2012-07-12 19:12:29,044 INFO [ClientServiceDelegate] Application state is > completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server > 2012-07-12 19:12:29,132 INFO [ClientServiceDelegate] Application state is > completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server > 2012-07-12 19:12:29,216 ERROR [UserGroupInformation] > PriviledgedActionException as:hadoop (auth:SIMPLE) cause:java.io.IOException > 2012-07-12 19:12:29,216 WARN [BaseOutputStageJob] getJobCounters: Unable to > retrieve counters. null > java.io.IOException > at > org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:315) > at > org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:335) > at > org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:470) > at org.apache.hadoop.mapreduce.Job$8.run(Job.java:719) > at org.apache.hadoop.mapreduce.Job$8.run(Job.java:716) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) > at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:716) > at > org.apache.hadoop.mapred.JobClient$NetworkedJob.getCounters(JobClient.java:396) > {code} > The connection to 10.202.50.187:47944 is actually the connection to AM; > appears that we are connecting to AM to get the counters for the successful > job and not the history server. > > I'll attach the logs for AM and resource mgr separately, however no unusual > activity is seen in those. > This makes me suspect that we have a race condition in the code trying to > access job counters when AM is finishing up and the job hasn't moved to > history server yet. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4442) Accessing hadoop counters from a job is unreliable in yarn during in AM process cleanup window
[ https://issues.apache.org/jira/browse/MAPREDUCE-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rahul Jain updated MAPREDUCE-4442: -- Description: We found this issue during our tests moving from MapReduceV1 to MapReduceV2. A few of our applications access job counters multiple times: a) After submission of job, while job is execution (works fine) b) Right after job complete notification is received (works fine) c) Few seconds after job complete notification (fails most of the times). The error snippet is as follows: {code} 2012-07-12 19:12:29,039 WARN [Client] Unexpected error reading responses on connection Thread[IPC Client (1252749669) connection to sjc1-ciq-ibm-grid07.carrieriq.com/10.202.50.187:47944 from hadoop,5,main] java.lang.NullPointerException at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:852) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:781) 2012-07-12 19:12:29,044 INFO [ClientServiceDelegate] Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2012-07-12 19:12:29,132 INFO [ClientServiceDelegate] Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2012-07-12 19:12:29,216 ERROR [UserGroupInformation] PriviledgedActionException as:hadoop (auth:SIMPLE) cause:java.io.IOException 2012-07-12 19:12:29,216 WARN [BaseOutputStageJob] getJobCounters: Unable to retrieve counters. null java.io.IOException at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:315) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:335) at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:470) at org.apache.hadoop.mapreduce.Job$8.run(Job.java:719) at org.apache.hadoop.mapreduce.Job$8.run(Job.java:716) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:716) at org.apache.hadoop.mapred.JobClient$NetworkedJob.getCounters(JobClient.java:396) {code} The connection to 10.202.50.187:47944 is actually the connection to AM; appears that we are connecting to AM to get the counters for the successful job and not the history server. I'll attach the logs for AM and resource mgr separately, however no unusual activity is seen in those. This makes me suspect that we have a race condition in the code trying to access job counters when AM is finishing up and the job hasn't moved to history server yet. was: We found this issue during our tests moving from MapReduceV1 to MapReduceV2. A few of our applications access job counters multiple times: a) After submission of job, while job is execution (works fine) b) Right after job complete notification is received (works fine) c) Few seconds after job complete notification (fails most of the times). The error snippet is as follows: {code} 2012-07-12 19:12:29,039 WARN [Client] Unexpected error reading responses on connection Thread[IPC Client (1252749669) connection to sjc1-ciq-ibm-grid07.carrieriq.com/10.202.50.187:47944 from hadoop,5,main] java.lang.NullPointerException at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:852) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:781) 2012-07-12 19:12:29,044 INFO [ClientServiceDelegate] Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2012-07-12 19:12:29,132 INFO [ClientServiceDelegate] Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2012-07-12 19:12:29,216 ERROR [UserGroupInformation] PriviledgedActionException as:hadoop (auth:SIMPLE) cause:java.io.IOException 2012-07-12 19:12:29,216 WARN [BaseOutputStageJob] getJobCounters: Unable to retrieve counters. null java.io.IOException at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:315) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:335) at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:470) at org.apache.hadoop.mapreduce.Job$8.run(Job.java:719) at org.apache.hadoop.mapreduce.Job$8.run(Job.java:716) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:716) at org.apache.hadoop.mapred.JobClient$NetworkedJob.getCounters(JobClient.java:396)