[ 
https://issues.apache.org/jira/browse/OOZIE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002563#comment-14002563
 ] 

Bowen Zhang commented on OOZIE-1849:
------------------------------------

+1

> If the underlying job finishes while a Workflow is suspended, Oozie can take 
> a while to realize it
> --------------------------------------------------------------------------------------------------
>
>                 Key: OOZIE-1849
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1849
>             Project: Oozie
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 4.0.1
>            Reporter: Robert Kanter
>            Assignee: Robert Kanter
>         Attachments: OOZIE-1849.patch
>
>
> Suppose you have a Workflow and you suspend it while one of the actions is 
> still RUNNING.  The underlying MR/Pig/etc job will continue running (as 
> expected, because we can't pause those).  However, if that job finishes while 
> the workflow is SUSPENDED, the CallbackServlet will receive the callback, but 
> the ActionCheckXCommand won't update the action:
> {noformat}
> 2014-05-16 17:40:57,959  INFO CallbackServlet:541 - SERVER[rkanter-mbp.local] 
> USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000002-140516173529928-oozie-rkan-W] 
> ACTION[0000002-140516173529928-oozie-rkan-W@mr-node] callback for action 
> [0000002-140516173529928-oozie-rkan-W@mr-node]
> 2014-05-16 17:40:57,985  WARN ActionCheckXCommand:544 - 
> SERVER[rkanter-mbp.local] USER[rkanter] GROUP[-] TOKEN[] APP[map-reduce-wf] 
> JOB[0000002-140516173529928-oozie-rkan-W] 
> ACTION[0000002-140516173529928-oozie-rkan-W@mr-node] E0818: Action 
> [0000002-140516173529928-oozie-rkan-W@mr-node] status is running but WF Job 
> [0000002-140516173529928-oozie-rkan-W] status is [SUSPENDED]. Expected status 
> is RUNNING., Error Code: E0818
> {noformat}
> If you then resume the workflow, the action will stay RUNNING for up to 10 
> minutes (the default fallback polling interval), at which point the 
> ActionCheckerService will run an ActionCheckXCommand that will pass, check 
> the job, and finally mark the action as SUCCESSFUL.
> We should fix this by one of the following:
> # ResumeXCommand should also queue a ActionCheckXCommand (if the workflow was 
> SUSPENDED) so we don't have to wait for the ActionCheckerService
> # ActionCheckXCommand's precondition check should allow SUSPENDED workflows



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to