[jira] [Commented] (MAPREDUCE-4428) A failed job is not available under job history if the job is killed right around the time job is notified as failed

Robert Joseph Evans (JIRA) Wed, 11 Jul 2012 14:31:38 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13412022#comment-13412022
 ]


Robert Joseph Evans commented on MAPREDUCE-4428:
------------------------------------------------

>From the logs it looks like you had many tasks failing because of the 
>following exception.

{noformat}
2012-07-11 03:04:27,122 FATAL [IPC Server handler 4 on 37900] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1341894680756_0017_m_000012_0 - exited : java.io.IOException: Exception 
occured validating bulk loader output
        at 
com.carrieriq.m2m.platform.mmp3.output.db.BulkLoaderOutputWriter.doFinish(BulkLoaderOutputWriter.java:408)
        at 
com.carrieriq.m2m.platform.mmp3.output.db.BulkLoaderOutputWriter.finish(BulkLoaderOutputWriter.java:331)
        at 
com.carrieriq.m2m.platform.mmp3.output.db.DfsAndDbOutputWriter.finish(DfsAndDbOutputWriter.java:89)
        at 
com.carrieriq.m2m.platform.mmp3.output.fact2db.LoadFactsToDatamartMapper.onClose(LoadFactsToDatamartMapper.java:769)
        at 
com.carrieriq.m2m.platform.util.hadoop.AbstractReportingMapper.close(AbstractReportingMapper.java:94)
        at org.apache.hadoop.mapred.lib.Chain.close(Chain.java:283)
        at org.apache.hadoop.mapred.lib.ChainMapper.close(ChainMapper.java:179)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:399)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147)
Caused by: java.io.IOException: Error(s) found in bulk loader output files.
        at 
com.carrieriq.m2m.platform.mmp3.output.db.BulkLoaderOutputWriter.doFinish(BulkLoaderOutputWriter.java:405)
        ... 14 more
{noformat}

This caused the job itself to fail, and as the AM was trying to tell the RM 
that it was exiting the RM said who are you, I've never heard of 
appattempt_1341894680756_0017_000001. It looks almost like the RM somehow was 
restarted in the middle of the job running, or that it somehow forgot about 
this particular Application.  Having the RM logs for around the time this was 
running would help trace down what happened in the RM. 

{noformat}
2012-07-11 03:04:28,574 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Setting job 
diagnostics to Task failed task_1341894680756_0017_m_000014
Job failed as tasks failed. failedMaps:1 failedReduces:0

2012-07-11 03:04:28,575 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: History url is 
sjc1-ciq-ibm-grid06.carrieriq.com:19888/jobhistory/job/job_1341894680756_0017
2012-07-11 03:04:28,580 ERROR [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Exception while 
unregistering 
RemoteTrace: 
 at LocalTrace: 
        org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: 
RemoteTrace: 
 at LocalTrace: 
        org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: 
Application doesn't exist in cache appattempt_1341894680756_0017_000001
        at 
org.apache.hadoop.yarn.factories.impl.pb.YarnRemoteExceptionFactoryPBImpl.createYarnRemoteException(YarnRemoteExceptionFactoryPBImpl.java:39)
        at 
org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:47)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.finishApplicationMaster(ApplicationMasterService.java:222)
        at 
org.apache.hadoop.yarn.api.impl.pb.service.AMRMProtocolPBServiceImpl.finishApplicationMaster(AMRMProtocolPBServiceImpl.java:69)
        at 
org.apache.hadoop.yarn.proto.AMRMProtocol$AMRMProtocolService$2.callBlockingMethod(AMRMProtocol.java:85)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90)
        at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
        at 
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:123)
        at 
org.apache.hadoop.yarn.api.impl.pb.client.AMRMProtocolPBClientImpl.finishApplicationMaster(AMRMProtocolPBClientImpl.java:85)
        at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.unregister(RMCommunicator.java:190)
        at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.stop(RMCommunicator.java:216)
        at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.stop(RMContainerAllocator.java:226)
        at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.stop(MRAppMaster.java:668)
        at 
org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
        at 
org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
        at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$MRAppMasterShutdownHook.run(MRAppMaster.java:1036)
        at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
{noformat}


                
> A failed job is not available under job history if the job is killed right 
> around the time job is notified as failed 
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4428
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4428
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobhistoryserver, jobtracker
>    Affects Versions: 2.0.0-alpha
>            Reporter: Rahul Jain
>         Attachments: appMaster_bad.txt, appMaster_good.txt
>
>
> We have observed this issue consistently running hadoop CDH4 version (based 
> upon 2.0 alpha release):
> In case our hadoop client code gets a notification for a completed job ( 
> using RunningJob object job, with (job.isComplete() && 
> job.isSuccessful()==false)
> the hadoop client code does an unconditional job.killJob() to terminate the 
> job.
> With earlier hadoop versions (verified on hadoop 0.20.2 version), we still  
> have full access to job logs afterwards through hadoop console. However, when 
> using MapReduceV2, the failed hadoop job no longer shows up under jobhistory 
> server. Also, the tracking URL of the job still points to the non-existent 
> Application master http port.
> Once we removed the call to job.killJob() for failed jobs from our hadoop 
> client code, we were able to access the job in job history with mapreduce V2 
> as well. Therefore this appears to be a race condition in the job management 
> wrt. job history for failed jobs.
> We do have the application master and node manager logs collected for this 
> scenario if that'll help isolate the problem and the fix better.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4428) A failed job is not available under job history if the job is killed right around the time job is notified as failed

Reply via email to