[ https://issues.apache.org/jira/browse/OOZIE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002563#comment-14002563 ]
Bowen Zhang commented on OOZIE-1849: ------------------------------------ +1 > If the underlying job finishes while a Workflow is suspended, Oozie can take > a while to realize it > -------------------------------------------------------------------------------------------------- > > Key: OOZIE-1849 > URL: https://issues.apache.org/jira/browse/OOZIE-1849 > Project: Oozie > Issue Type: Improvement > Components: core > Affects Versions: 4.0.1 > Reporter: Robert Kanter > Assignee: Robert Kanter > Attachments: OOZIE-1849.patch > > > Suppose you have a Workflow and you suspend it while one of the actions is > still RUNNING. The underlying MR/Pig/etc job will continue running (as > expected, because we can't pause those). However, if that job finishes while > the workflow is SUSPENDED, the CallbackServlet will receive the callback, but > the ActionCheckXCommand won't update the action: > {noformat} > 2014-05-16 17:40:57,959 INFO CallbackServlet:541 - SERVER[rkanter-mbp.local] > USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000002-140516173529928-oozie-rkan-W] > ACTION[0000002-140516173529928-oozie-rkan-W@mr-node] callback for action > [0000002-140516173529928-oozie-rkan-W@mr-node] > 2014-05-16 17:40:57,985 WARN ActionCheckXCommand:544 - > SERVER[rkanter-mbp.local] USER[rkanter] GROUP[-] TOKEN[] APP[map-reduce-wf] > JOB[0000002-140516173529928-oozie-rkan-W] > ACTION[0000002-140516173529928-oozie-rkan-W@mr-node] E0818: Action > [0000002-140516173529928-oozie-rkan-W@mr-node] status is running but WF Job > [0000002-140516173529928-oozie-rkan-W] status is [SUSPENDED]. Expected status > is RUNNING., Error Code: E0818 > {noformat} > If you then resume the workflow, the action will stay RUNNING for up to 10 > minutes (the default fallback polling interval), at which point the > ActionCheckerService will run an ActionCheckXCommand that will pass, check > the job, and finally mark the action as SUCCESSFUL. > We should fix this by one of the following: > # ResumeXCommand should also queue a ActionCheckXCommand (if the workflow was > SUSPENDED) so we don't have to wait for the ActionCheckerService > # ActionCheckXCommand's precondition check should allow SUSPENDED workflows -- This message was sent by Atlassian JIRA (v6.2#6252)