Robert Kanter created OOZIE-1849:
------------------------------------

             Summary: If the underlying job finishes while a Workflow is 
suspended, Oozie can take a while to realize it
                 Key: OOZIE-1849
                 URL: https://issues.apache.org/jira/browse/OOZIE-1849
             Project: Oozie
          Issue Type: Improvement
          Components: core
    Affects Versions: 4.0.1
            Reporter: Robert Kanter
            Assignee: Robert Kanter


Suppose you have a Workflow and you suspend it while one of the actions is 
still RUNNING.  The underlying MR/Pig/etc job will continue running (as 
expected, because we can't pause those).  However, if that job finishes while 
the workflow is SUSPENDED, the CallbackServlet will receive the callback, but 
the ActionCheckXCommand won't update the action:
{noformat}
2014-05-16 17:40:57,959  INFO CallbackServlet:541 - SERVER[rkanter-mbp.local] 
USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000002-140516173529928-oozie-rkan-W] 
ACTION[0000002-140516173529928-oozie-rkan-W@mr-node] callback for action 
[0000002-140516173529928-oozie-rkan-W@mr-node]
2014-05-16 17:40:57,985  WARN ActionCheckXCommand:544 - 
SERVER[rkanter-mbp.local] USER[rkanter] GROUP[-] TOKEN[] APP[map-reduce-wf] 
JOB[0000002-140516173529928-oozie-rkan-W] 
ACTION[0000002-140516173529928-oozie-rkan-W@mr-node] E0818: Action 
[0000002-140516173529928-oozie-rkan-W@mr-node] status is running but WF Job 
[0000002-140516173529928-oozie-rkan-W] status is [SUSPENDED]. Expected status 
is RUNNING., Error Code: E0818
{noformat}
If you then resume the workflow, the action will stay RUNNING for up to 10 
minutes (the default fallback polling interval), at which point the 
ActionCheckerService will run an ActionCheckXCommand that will pass, check the 
job, and finally mark the action as SUCCESSFUL.

We should fix this by one of the following:
# ResumeXCommand should also queue a ActionCheckXCommand (if the workflow was 
SUSPENDED) so we don't have to wait for the ActionCheckerService
# ActionCheckXCommand's precondition check should allow SUSPENDED workflows



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to