[ https://issues.apache.org/jira/browse/OOZIE-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647181#comment-16647181 ]
Satish Subhashrao Saley edited comment on OOZIE-3366 at 10/11/18 11:33 PM: --------------------------------------------------------------------------- I co-related the logs and the part of code, it seems we are not suspending the parent WF if subworkflow gets suspended. Logs: {code:java} 2018-04-23 02:15:25,620 WARN ActionStartXCommand:523 [pool-12-thread-224] - SERVER[wf322] USER[saley] GROUP[users] TOKEN[] APP[saleyapp] JOB[123-123-oozie-saley--W] ACTION[123-123-oozie-saley--W@saleyapp] Error starting action [saleyapp]. ErrorType [NON_TRANSIENT], ErrorCode [JA002], Message [JA002: User: oozieuser is not allowed to impersonate saley] 2018-04-23 02:15:25,620 WARN ActionStartXCommand:523 [pool-12-thread-224] - SERVER[wf322] USER[saley] GROUP[users] TOKEN[] APP[saleyapp] JOB[123-123-oozie-saley--W] ACTION[123-123-oozie-saley--W@saleyapp] Suspending Workflow Job id=123-123-oozie-saley--W 2018-04-23 02:15:25,622 DEBUG LiteWorkflowInstance:526 [pool-12-thread-224] - SERVER[wf322] USER[saley] GROUP[users] TOKEN[] APP[saleyapp] JOB[123-123-oozie-saley--W] ACTION[123-123-oozie-saley--W@saleyapp] Suspending job {code} While starting the action, we get non transient exception. [https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/command/wf/ActionStartXCommand.java#L290-L305] {code:java} ActionStartXCommand.java catch (ActionExecutorException ex) { LOG.warn("Error starting action [\{0}]. ErrorType [\{1}], ErrorCode [\{2}], Message [\{3}]", wfAction.getName(), ex.getErrorType(), ex.getErrorCode(), ex.getMessage(), ex); wfAction.setErrorInfo(ex.getErrorCode(), ex.getMessage()); switch (ex.getErrorType()) { case TRANSIENT: if (!handleTransient(context, executor, WorkflowAction.Status.START_RETRY)) { handleNonTransient(context, executor, WorkflowAction.Status.START_MANUAL); wfAction.setPendingAge(new Date()); wfAction.setRetries(0); wfAction.setStartTime(null); } break; case NON_TRANSIENT: handleNonTransient(context, executor, WorkflowAction.Status.START_MANUAL); {code} We put the workflow action in START_MANUAL and suspend the workflow. [https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/command/wf/ActionXCommand.java#L125-L144] {code:java} ActionXCommand.java protected void handleNonTransient(ActionExecutor.Context context, ActionExecutor executor,WorkflowAction.Status status) throws CommandException { ActionExecutorContext aContext = (ActionExecutorContext) context; WorkflowActionBean action = (WorkflowActionBean) aContext.getAction(); incrActionErrorCounter(action.getType(), "nontransient", 1); WorkflowJobBean workflow = (WorkflowJobBean) context.getWorkflow(); String id = workflow.getId(); action.setStatus(status); action.resetPendingOnly(); LOG.warn("Suspending Workflow Job id=" + id); try { SuspendXCommand.suspendJob(Services.get().get(JPAService.class), workflow, id, action.getId(), null); } catch (Exception e) { throw new CommandException(ErrorCode.E0727, id, e.getMessage()); } finally { updateParentIfNecessary(workflow, 3); } } {code} While updating the parent's status, we don't consider the case where a workflow's parent can be another workflow. [https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/command/wf/WorkflowXCommand.java#L92-L97] {code:java} WorkflowXCommand.java protected void updateParentIfNecessary(WorkflowJobBean wfjob, int maxRetries) throws CommandException { // update coordinator action if the wf was actually started by a coord if (wfjob.getParentId() != null && wfjob.getParentId().contains("-C@")) { new CoordActionUpdateXCommand(wfjob, maxRetries).call(); } } {code} was (Author: satishsaley): I co-related the logs and the part of code, it seems we are not suspending the parent WF if subworkflow gets suspended. Logs: {code} 2018-04-23 02:15:25,620 WARN ActionStartXCommand:523 [pool-12-thread-224] - SERVER[wf322] USER[saley] GROUP[users] TOKEN[] APP[saleyapp] JOB[123-123-oozie-saley--W] ACTION[123-123-oozie-saley--W@saleyapp] Error starting action [saleyapp]. ErrorType [NON_TRANSIENT], ErrorCode [JA002], Message [JA002: User: wrkflow is not allowed to impersonate saley] 2018-04-23 02:15:25,620 WARN ActionStartXCommand:523 [pool-12-thread-224] - SERVER[wf322] USER[saley] GROUP[users] TOKEN[] APP[saleyapp] JOB[123-123-oozie-saley--W] ACTION[123-123-oozie-saley--W@saleyapp] Suspending Workflow Job id=123-123-oozie-saley--W 2018-04-23 02:15:25,622 DEBUG LiteWorkflowInstance:526 [pool-12-thread-224] - SERVER[wf322] USER[saley] GROUP[users] TOKEN[] APP[saleyapp] JOB[123-123-oozie-saley--W] ACTION[123-123-oozie-saley--W@saleyapp] Suspending job {code} While starting the action, we get non transient exception. https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/command/wf/ActionStartXCommand.java#L290-L305 {code} ActionStartXCommand.java catch (ActionExecutorException ex) { LOG.warn("Error starting action [\{0}]. ErrorType [\{1}], ErrorCode [\{2}], Message [\{3}]", wfAction.getName(), ex.getErrorType(), ex.getErrorCode(), ex.getMessage(), ex); wfAction.setErrorInfo(ex.getErrorCode(), ex.getMessage()); switch (ex.getErrorType()) { case TRANSIENT: if (!handleTransient(context, executor, WorkflowAction.Status.START_RETRY)) { handleNonTransient(context, executor, WorkflowAction.Status.START_MANUAL); wfAction.setPendingAge(new Date()); wfAction.setRetries(0); wfAction.setStartTime(null); } break; case NON_TRANSIENT: handleNonTransient(context, executor, WorkflowAction.Status.START_MANUAL); {code} We put the workflow action in START_MANUAL and suspend the workflow. https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/command/wf/ActionXCommand.java#L125-L144 {code} ActionXCommand.java protected void handleNonTransient(ActionExecutor.Context context, ActionExecutor executor,WorkflowAction.Status status) throws CommandException { ActionExecutorContext aContext = (ActionExecutorContext) context; WorkflowActionBean action = (WorkflowActionBean) aContext.getAction(); incrActionErrorCounter(action.getType(), "nontransient", 1); WorkflowJobBean workflow = (WorkflowJobBean) context.getWorkflow(); String id = workflow.getId(); action.setStatus(status); action.resetPendingOnly(); LOG.warn("Suspending Workflow Job id=" + id); try { SuspendXCommand.suspendJob(Services.get().get(JPAService.class), workflow, id, action.getId(), null); } catch (Exception e) { throw new CommandException(ErrorCode.E0727, id, e.getMessage()); } finally { updateParentIfNecessary(workflow, 3); } } {code} While updating the parent's status, we don't consider the case where a workflow's parent can be another workflow. https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/command/wf/WorkflowXCommand.java#L92-L97 {code} WorkflowXCommand.java protected void updateParentIfNecessary(WorkflowJobBean wfjob, int maxRetries) throws CommandException { // update coordinator action if the wf was actually started by a coord if (wfjob.getParentId() != null && wfjob.getParentId().contains("-C@")) { new CoordActionUpdateXCommand(wfjob, maxRetries).call(); } } {code} > Update workflow status and subworkflow status on suspend command > ---------------------------------------------------------------- > > Key: OOZIE-3366 > URL: https://issues.apache.org/jira/browse/OOZIE-3366 > Project: Oozie > Issue Type: Bug > Reporter: Satish Subhashrao Saley > Assignee: Satish Subhashrao Saley > Priority: Major > > Currently, when subworkflow gets suspended, its corresponding workflow status > is not updated correctly. Also, when a coord is suspended, the subworkflows > are not suspended. We need to fix this. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)