[ 
https://issues.apache.org/jira/browse/OOZIE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14032608#comment-14032608
 ] 

Robert Kanter commented on OOZIE-1879:
--------------------------------------

(Test failures unrelated)

> Workflow Rerun causes error depending on the order of forked nodes
> ------------------------------------------------------------------
>
>                 Key: OOZIE-1879
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1879
>             Project: Oozie
>          Issue Type: Bug
>          Components: core
>    Affects Versions: trunk
>            Reporter: Robert Kanter
>            Assignee: Robert Kanter
>            Priority: Blocker
>         Attachments: OOZIE-1879.patch
>
>
> Suppose you have a workflow like this:
> {noformat}
> start --> fork
> fork --> shell1, shell2
> shell1 --> join
> shell2 --> join
> join --> shell3
> shell3 --> end
> {noformat}
> And all but shell3 are successful.  
> Assuming you fix the problem with shell3, if you do a rerun, the following 
> two outcomes can happen:
> # If shell1 finished before shell2, then the rerun succeeds
> # If shell2 finished before shell1, then the rerun fails
> The error in the second outcome is simply this log message:
> {noformat}
> 2014-05-29 17:17:03,735 ERROR 
> org.apache.oozie.workflow.lite.LiteWorkflowInstance: 
> SERVER[cdh5-1.cloudera.local] USER[pdvorak] GROUP[-] TOKEN[] 
> APP[test-rerun-wf] JOB[0000004-140521220856264-oozie-oozi-W] 
> ACTION[0000004-140521220856264-oozie-oozi-W@join] invalid execution path 
> [/shell1/]
> {noformat}
> After a bunch of digging, I discovered that during a rerun with the above 
> workflow or similar workflows, LiteWorkflowInstance#signal gets called for 
> each action in the fork node in the order that they are listed in the fork 
> node's XML; however, during the original run, LiteWorkflowInstance#signal 
> gets called for each action in the order that they complete (i.e. endTime).  
> When these don't match, you get the above error.  The general fix for this is 
> therefore to ensure that during a rerun, LiteWorkflowInstance#signal gets 
> called for each action in the fork node in the order that they originally ran 
> in.  And if you think about it, that is more correct than the current 
> behavior anyway.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to