[
https://issues.apache.org/jira/browse/OOZIE-3670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633258#comment-17633258
]
Janos Makai commented on OOZIE-3670:
------------------------------------
Fixed the javadoc related issues in my latest patch.
> Actions can stuck while running in a Fork-Join workflow
> -------------------------------------------------------
>
> Key: OOZIE-3670
> URL: https://issues.apache.org/jira/browse/OOZIE-3670
> Project: Oozie
> Issue Type: Bug
> Components: core
> Affects Versions: 5.2.1
> Reporter: Janos Makai
> Assignee: Janos Makai
> Priority: Major
> Attachments: OOZIE-3670-001.patch, OOZIE-3670-002.patch
>
>
> Fork node splits one path of execution into multiple concurrent paths of
> execution and the join node waits until every concurrent execution path of a
> previous fork node arrives to it. Given a scenario, when one of the paths
> [action] fails for some exotic reason - in our case (see attachment) with an
> EL Error - then the workflow job itself will fail as well, however the other
> actions running parallelly under the same workflow job will stuck in RUNNING
> state until they are purged, which can lead to Oozie slow-down in extreme
> cases.
> This behaviour can be reproduced using the attached
> [forkjoin.xml{^}!https://jira.cloudera.com/images/icons/link_attachment_7.gif|width=7,height=7,align=absmiddle!{^}|https://jira.cloudera.com/secure/attachment/531918/531918_forkjoin.xml],
>
> [job.properties{^}!https://jira.cloudera.com/images/icons/link_attachment_7.gif|width=7,height=7,align=absmiddle!{^}|https://jira.cloudera.com/secure/attachment/531916/531916_job.properties],
> and
> [helloworld.sh{^}!https://jira.cloudera.com/images/icons/link_attachment_7.gif|width=7,height=7,align=absmiddle!{^}|https://jira.cloudera.com/secure/attachment/531917/531917_helloworld.sh].
> In the above workflow, [action2] will fail due to ELError because
> {code:java}
> <value>${variableThatWillCauseELError}</value> {code}
> could not be evaluated, but at the same time [action1] tries to complete
> itself but remains in RUNNING state.
> We have examined the situation at surface level, but we need to get a deeper
> understanding regarding the mechanism of fork-join workflows to proceed
> further.
> Suspected classes are for starting point:
> - org.apache.oozie.workflow.lite.LiteWorkflowInstance
> - org.apache.oozie.command.wf.ActionCheckXCommand
> - what if we do not throw Exception in
> org.apache.oozie.command.wf.ActionCheckXCommand#verifyPrecondition ?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)