[ 
https://issues.apache.org/jira/browse/OOZIE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated OOZIE-2126:
---------------------------------
    Attachment: OOZIE-2126.patch

The patch fixes this issue by checking if the action is in PREP when it gets 
the callback; if that happens, the command will be requeued after a delay, up 
to some number of times (both are configurable).  This is not SSH action 
specific, and should work with any of the action types.  

The crux of the issue is that there's a race condition between when Oozie 
updates the database that the action is RUNNING, and when it gets the callback. 
 We can't really do any synchronization for this, which is why I used a retry 
approach.

> SSH action can be too fast for Oozie sometimes
> ----------------------------------------------
>
>                 Key: OOZIE-2126
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2126
>             Project: Oozie
>          Issue Type: Bug
>          Components: action
>            Reporter: Robert Kanter
>            Assignee: Robert Kanter
>         Attachments: OOZIE-2126.patch
>
>
> We've seen a timing problem with the SSH action where the callback comes back 
> too fast, before the action has transitioned to RUNNING and is still in PREP. 
>  This causes Oozie to ignore the callback, which means it won't find out that 
> the action completed until it manually checks (default=10min).  This happened 
> in an HA setup, but I think it could happen even without HA.  Adding a 30 
> second delay into the ssh scripts fixed the problem, but ideally we should 
> come up with a better solution.
> Here's the relevant logs:
> {noformat}
> 2015-01-16 18:00:12,916 INFO org.apache.oozie.action.ssh.SshActionExecutor: 
> SERVER[FOO] USER[foo] GROUP[-] TOKEN[] APP[${job_name}] 
> JOB[0000027-150113223634420-oozie-oozi-W] 
> ACTION[0000027-150113223634420-oozie-oozi-W@action-1] start() begins
> 2015-01-16 18:00:12,917 INFO org.apache.oozie.action.ssh.SshActionExecutor: 
> SERVER[FOO] USER[foo] GROUP[-] TOKEN[] APP[${job_name}] 
> JOB[0000027-150113223634420-oozie-oozi-W] 
> ACTION[0000027-150113223634420-oozie-oozi-W@action-1] Attempting to copy ssh 
> base scripts to remote host [f...@bar.com]
> 2015-01-16 18:00:15,769 INFO org.apache.oozie.servlet.CallbackServlet: 
> SERVER[FOO] USER[-] GROUP[-] TOKEN[-] APP[-] 
> JOB[0000027-150113223634420-oozie-oozi-W] 
> ACTION[0000027-150113223634420-oozie-oozi-W@action-1] callback for action 
> [0000027-150113223634420-oozie-oozi-W@action-1]
> 2015-01-16 18:00:15,774 ERROR 
> org.apache.oozie.command.wf.CompletedActionXCommand: SERVER[FOO] USER[-] 
> GROUP[-] TOKEN[] APP[-] JOB[0000027-150113223634420-oozie-oozi-W] 
> ACTION[0000027-150113223634420-oozie-oozi-W@action-1] XException,
> org.apache.oozie.command.CommandException: E0800: Action it is not running 
> its in [PREP] state, action [0000027-150113223634420-oozie-oozi-W@action-1]
>         at 
> org.apache.oozie.command.wf.CompletedActionXCommand.eagerVerifyPrecondition(CompletedActionXCommand.java:77)
>         at org.apache.oozie.command.XCommand.call(XCommand.java:251)
>         at 
> org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:174)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to