Janos Makai created OOZIE-3670:
----------------------------------

             Summary: Actions can stuck while running in a Fork-Join workflow
                 Key: OOZIE-3670
                 URL: https://issues.apache.org/jira/browse/OOZIE-3670
             Project: Oozie
          Issue Type: Bug
          Components: core
    Affects Versions: 5.2.1
            Reporter: Janos Makai
            Assignee: Janos Makai


Fork node splits one path of execution into multiple concurrent paths of 
execution and the join node waits until every concurrent execution path of a 
previous fork node arrives to it. Given a scenario, when one of the paths 
[action] fails for some exotic reason - in our case (see attachment) with an EL 
Error - then the workflow job itself will fail as well, however the other 
actions running parallelly under the same workflow job will stuck in RUNNING 
state until they are purged, which can lead to Oozie slow-down in extreme cases.

This behaviour can be reproduced using the attached 
[forkjoin.xml{^}!https://jira.cloudera.com/images/icons/link_attachment_7.gif|width=7,height=7,align=absmiddle!{^}|https://jira.cloudera.com/secure/attachment/531918/531918_forkjoin.xml],
 
[job.properties{^}!https://jira.cloudera.com/images/icons/link_attachment_7.gif|width=7,height=7,align=absmiddle!{^}|https://jira.cloudera.com/secure/attachment/531916/531916_job.properties],
  and 
[helloworld.sh{^}!https://jira.cloudera.com/images/icons/link_attachment_7.gif|width=7,height=7,align=absmiddle!{^}|https://jira.cloudera.com/secure/attachment/531917/531917_helloworld.sh].
In the above workflow, [action2] will fail due to ELError because
{code:java}
<value>${variableThatWillCauseELError}</value> {code}
could not be evaluated, but at the same time [action1] tries to complete itself 
but remains in RUNNING state.

We have examined the situation at surface level, but we need to get a deeper 
understanding regarding the mechanism of fork-join workflows to proceed further.

Suspected classes are for starting point:
 - org.apache.oozie.workflow.lite.LiteWorkflowInstance
 - org.apache.oozie.command.wf.ActionCheckXCommand
 - what if we do not throw Exception in 
org.apache.oozie.command.wf.ActionCheckXCommand#verifyPrecondition ?




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to