Janos Makai created OOZIE-3670:
----------------------------------
Summary: Actions can stuck while running in a Fork-Join workflow
Key: OOZIE-3670
URL: https://issues.apache.org/jira/browse/OOZIE-3670
Project: Oozie
Issue Type: Bug
Components: core
Affects Versions: 5.2.1
Reporter: Janos Makai
Assignee: Janos Makai
Fork node splits one path of execution into multiple concurrent paths of
execution and the join node waits until every concurrent execution path of a
previous fork node arrives to it. Given a scenario, when one of the paths
[action] fails for some exotic reason - in our case (see attachment) with an EL
Error - then the workflow job itself will fail as well, however the other
actions running parallelly under the same workflow job will stuck in RUNNING
state until they are purged, which can lead to Oozie slow-down in extreme cases.
This behaviour can be reproduced using the attached
[forkjoin.xml{^}!https://jira.cloudera.com/images/icons/link_attachment_7.gif|width=7,height=7,align=absmiddle!{^}|https://jira.cloudera.com/secure/attachment/531918/531918_forkjoin.xml],
[job.properties{^}!https://jira.cloudera.com/images/icons/link_attachment_7.gif|width=7,height=7,align=absmiddle!{^}|https://jira.cloudera.com/secure/attachment/531916/531916_job.properties],
and
[helloworld.sh{^}!https://jira.cloudera.com/images/icons/link_attachment_7.gif|width=7,height=7,align=absmiddle!{^}|https://jira.cloudera.com/secure/attachment/531917/531917_helloworld.sh].
In the above workflow, [action2] will fail due to ELError because
{code:java}
<value>${variableThatWillCauseELError}</value> {code}
could not be evaluated, but at the same time [action1] tries to complete itself
but remains in RUNNING state.
We have examined the situation at surface level, but we need to get a deeper
understanding regarding the mechanism of fork-join workflows to proceed further.
Suspected classes are for starting point:
- org.apache.oozie.workflow.lite.LiteWorkflowInstance
- org.apache.oozie.command.wf.ActionCheckXCommand
- what if we do not throw Exception in
org.apache.oozie.command.wf.ActionCheckXCommand#verifyPrecondition ?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)