[ https://issues.apache.org/jira/browse/MAPREDUCE-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427590#comment-15427590 ]
Weiwei Yang commented on MAPREDUCE-6762: ---------------------------------------- The error we saw from Pig console {code} 2016-07-20 07:28:13,625 [uber-SubtaskRunner] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2016-07-20 07:28:16,252 [JobControl] ERROR org.apache.pig.backend.hadoop23.PigJobControl - Error while trying to run jobs. java.lang.NullPointerException at org.apache.hadoop.mapreduce.Job.getJobName(Job.java:426) at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.toString(ControlledJob.java:93) at java.lang.String.valueOf(String.java:2982) at java.lang.StringBuilder.append(StringBuilder.java:131) at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:182) at java.lang.Thread.run(Thread.java:745) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276) {code} The error we saw from app-master log (indicating that failure when writing job meta files) {code} 2016-08-10 07:46:54,862 INFO [Thread-1245] org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state STOPPED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.87.225.170:40913 remote=/10.87.225.174:50010] org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.87.225.170:40913 remote=/10.87.225.174:50010] at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:580) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:374) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStop(MRAppMaster.java:1626) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.stop(MRAppMaster.java:1126) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:561) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:609) Caused by: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.87.225.170:40913 remote=/10.87.225.174:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2278) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1020) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:990) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1131) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402) {code} So the cause looks like * Datanode was too busy to answer JHS's request to flush job meta files * Job meta files missing * Job client failed to get job status update * {{Job.status}} resets to null * {{Job.getJobName}} failed with NPE > ControlledJob#toString failed with NPE when job status is not successfully > updated > ---------------------------------------------------------------------------------- > > Key: MAPREDUCE-6762 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6762 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 2.7.2 > Reporter: Weiwei Yang > > This issue was found from a cluster where Pig query occasionally failed on > NPE. Pig uses JobControl API to track MR job status, but sometimes Job > History Server failed to flush job meta files to HDFS which caused the status > update failed. Then we get NPE in > {{org.apache.hadoop.mapreduce.Job.getJobName}}. The result of this situation > is quite confusing: Pig query failed, job history is missing, but the job > status on Yarn is succeed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org