[ https://issues.apache.org/jira/browse/OOZIE-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526647#comment-16526647 ]
Andras Piros commented on OOZIE-3291: ------------------------------------- [~rohit.peg] can you please post relevant parts of your {{workflow.xml}} / {{coordinator.xml}}? Do you also encounter the issue on a newer Oozie like 5.0.0 / 4.3.1? > Oozie workflow hangs in running state even when the underlying action failed > ---------------------------------------------------------------------------- > > Key: OOZIE-3291 > URL: https://issues.apache.org/jira/browse/OOZIE-3291 > Project: Oozie > Issue Type: Bug > Components: workflow > Affects Versions: 4.1.0 > Reporter: Rohit Pegallapati > Priority: Major > > We have mutiple distcp actions in fork join. We use hadoop 2.6.0 (cdh 5.5.1). > We are hittingĀ > https://issues.apache.org/jira/browse/MAPREDUCE-6478 > at this time the distcp action fails with the below exception. > {code:java} > 2018-06-10 15:19:39,179 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics > report from attempt_1520068304865_972654_m_000000_0: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): > No lease on > /user/xxx/oozie-oozi/1951586-180303074950833-oozie-oozi-W/distcp-to-dr-0-update-action--distcp/output/_temporary/1/_temporary/attempt_1520068304865_972654_m_000000_0/part-00000 > (inode 192492374): File does not exist. Holder > DFSClient_NONMAPREDUCE_-2068852542_1 does not have any open files. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3604) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3690) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3660) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:738) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.complete(AuthorizationProviderProxyClientProtocol.java:243) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:528) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) > {code} > At this time we expect that WF should be killed and subsequent WF should > start. But this WF is stuck in RUNNING state and other WFs get stacked up > through the coordinator, leaving no option but to kill the running WF. After > this defective WF is killed, other WF's process perfectly fine -- This message was sent by Atlassian JIRA (v7.6.3#76005)