[
https://issues.apache.org/jira/browse/OOZIE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Kanter updated OOZIE-2126:
---------------------------------
Attachment: OOZIE-2126.patch
For reference, I've attached a copy of the final patch I committed; the only
change from the previous version is the formatting change.
> SSH action can be too fast for Oozie sometimes
> ----------------------------------------------
>
> Key: OOZIE-2126
> URL: https://issues.apache.org/jira/browse/OOZIE-2126
> Project: Oozie
> Issue Type: Bug
> Components: action
> Reporter: Robert Kanter
> Assignee: Robert Kanter
> Fix For: trunk
>
> Attachments: OOZIE-2126.patch, OOZIE-2126.patch
>
>
> We've seen a timing problem with the SSH action where the callback comes back
> too fast, before the action has transitioned to RUNNING and is still in PREP.
> This causes Oozie to ignore the callback, which means it won't find out that
> the action completed until it manually checks (default=10min). This happened
> in an HA setup, but I think it could happen even without HA. Adding a 30
> second delay into the ssh scripts fixed the problem, but ideally we should
> come up with a better solution.
> Here's the relevant logs:
> {noformat}
> 2015-01-16 18:00:12,916 INFO org.apache.oozie.action.ssh.SshActionExecutor:
> SERVER[FOO] USER[foo] GROUP[-] TOKEN[] APP[${job_name}]
> JOB[0000027-150113223634420-oozie-oozi-W]
> ACTION[0000027-150113223634420-oozie-oozi-W@action-1] start() begins
> 2015-01-16 18:00:12,917 INFO org.apache.oozie.action.ssh.SshActionExecutor:
> SERVER[FOO] USER[foo] GROUP[-] TOKEN[] APP[${job_name}]
> JOB[0000027-150113223634420-oozie-oozi-W]
> ACTION[0000027-150113223634420-oozie-oozi-W@action-1] Attempting to copy ssh
> base scripts to remote host [[email protected]]
> 2015-01-16 18:00:15,769 INFO org.apache.oozie.servlet.CallbackServlet:
> SERVER[FOO] USER[-] GROUP[-] TOKEN[-] APP[-]
> JOB[0000027-150113223634420-oozie-oozi-W]
> ACTION[0000027-150113223634420-oozie-oozi-W@action-1] callback for action
> [0000027-150113223634420-oozie-oozi-W@action-1]
> 2015-01-16 18:00:15,774 ERROR
> org.apache.oozie.command.wf.CompletedActionXCommand: SERVER[FOO] USER[-]
> GROUP[-] TOKEN[] APP[-] JOB[0000027-150113223634420-oozie-oozi-W]
> ACTION[0000027-150113223634420-oozie-oozi-W@action-1] XException,
> org.apache.oozie.command.CommandException: E0800: Action it is not running
> its in [PREP] state, action [0000027-150113223634420-oozie-oozi-W@action-1]
> at
> org.apache.oozie.command.wf.CompletedActionXCommand.eagerVerifyPrecondition(CompletedActionXCommand.java:77)
> at org.apache.oozie.command.XCommand.call(XCommand.java:251)
> at
> org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:174)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)