[jira] [Commented] (OOZIE-3291) Oozie workflow hangs in running state even when the underlying action failed

Andras Piros (JIRA) Thu, 28 Jun 2018 11:25:12 -0700


    [ 
https://issues.apache.org/jira/browse/OOZIE-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526647#comment-16526647
 ]


Andras Piros commented on OOZIE-3291:
-------------------------------------

[~rohit.peg] can you please post relevant parts of your {{workflow.xml}} / 
{{coordinator.xml}}? Do you also encounter the issue on a newer Oozie like 
5.0.0 / 4.3.1?

> Oozie workflow hangs in running state even when the underlying action failed
> ----------------------------------------------------------------------------
>
>                 Key: OOZIE-3291
>                 URL: https://issues.apache.org/jira/browse/OOZIE-3291
>             Project: Oozie
>          Issue Type: Bug
>          Components: workflow
>    Affects Versions: 4.1.0
>            Reporter: Rohit Pegallapati
>            Priority: Major
>
> We have mutiple distcp actions in fork join. We use hadoop 2.6.0 (cdh 5.5.1). 
> We are hitting 
> https://issues.apache.org/jira/browse/MAPREDUCE-6478
> at this time the distcp action fails with the below exception.
> {code:java}
> 2018-06-10 15:19:39,179 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics 
> report from attempt_1520068304865_972654_m_000000_0: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  No lease on 
> /user/xxx/oozie-oozi/1951586-180303074950833-oozie-oozi-W/distcp-to-dr-0-update-action--distcp/output/_temporary/1/_temporary/attempt_1520068304865_972654_m_000000_0/part-00000
>  (inode 192492374): File does not exist. Holder 
> DFSClient_NONMAPREDUCE_-2068852542_1 does not have any open files.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3604)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3690)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3660)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:738)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.complete(AuthorizationProviderProxyClientProtocol.java:243)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:528)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
> {code}
> At this time we expect that WF should be killed and subsequent WF should 
> start. But this WF is stuck in RUNNING state and other WFs get stacked up 
> through the coordinator, leaving no option but to kill the running WF. After 
> this defective WF is killed, other WF's process perfectly fine  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3291) Oozie workflow hangs in running state even when the underlying action failed

Reply via email to