[ 
https://issues.apache.org/jira/browse/OOZIE-3670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17640785#comment-17640785
 ] 

Hadoop QA commented on OOZIE-3670:
----------------------------------

PreCommit-OOZIE-Build started


> Actions can stuck while running in a Fork-Join workflow
> -------------------------------------------------------
>
>                 Key: OOZIE-3670
>                 URL: https://issues.apache.org/jira/browse/OOZIE-3670
>             Project: Oozie
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 5.2.1
>            Reporter: Janos Makai
>            Assignee: Janos Makai
>            Priority: Major
>         Attachments: OOZIE-3670-001.patch, forkjoin.xml, helloworld.sh, 
> job.properties
>
>
> Fork node splits one path of execution into multiple concurrent paths of 
> execution and the join node waits until every concurrent execution path of a 
> previous fork node arrives to it. Given a scenario, when one of the paths 
> [action] fails for some exotic reason - in our case (see attachment) with an 
> EL Error - then the workflow job itself will fail as well, however the other 
> actions running parallelly under the same workflow job will stuck in RUNNING 
> state until they are purged, which can lead to Oozie slow-down in extreme 
> cases.
> This behaviour can be reproduced using the attached 
> [forkjoin.xml{^}!https://jira.cloudera.com/images/icons/link_attachment_7.gif|width=7,height=7,align=absmiddle!{^}|https://jira.cloudera.com/secure/attachment/531918/531918_forkjoin.xml],
>  
> [job.properties{^}!https://jira.cloudera.com/images/icons/link_attachment_7.gif|width=7,height=7,align=absmiddle!{^}|https://jira.cloudera.com/secure/attachment/531916/531916_job.properties],
>   and 
> [helloworld.sh{^}!https://jira.cloudera.com/images/icons/link_attachment_7.gif|width=7,height=7,align=absmiddle!{^}|https://jira.cloudera.com/secure/attachment/531917/531917_helloworld.sh].
> In the above workflow, [action2] will fail due to ELError because
> {code:java}
> <value>${variableThatWillCauseELError}</value> {code}
> could not be evaluated, but at the same time [action1] tries to complete 
> itself but remains in RUNNING state.
> We have examined the situation at surface level, but we need to get a deeper 
> understanding regarding the mechanism of fork-join workflows to proceed 
> further.
> Suspected classes are for starting point:
>  - org.apache.oozie.workflow.lite.LiteWorkflowInstance
>  - org.apache.oozie.command.wf.ActionCheckXCommand
>  - what if we do not throw Exception in 
> org.apache.oozie.command.wf.ActionCheckXCommand#verifyPrecondition ?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to