[ https://issues.apache.org/jira/browse/OOZIE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Kanter updated OOZIE-2126: --------------------------------- Attachment: OOZIE-2126.patch For reference, I've attached a copy of the final patch I committed; the only change from the previous version is the formatting change. > SSH action can be too fast for Oozie sometimes > ---------------------------------------------- > > Key: OOZIE-2126 > URL: https://issues.apache.org/jira/browse/OOZIE-2126 > Project: Oozie > Issue Type: Bug > Components: action > Reporter: Robert Kanter > Assignee: Robert Kanter > Fix For: trunk > > Attachments: OOZIE-2126.patch, OOZIE-2126.patch > > > We've seen a timing problem with the SSH action where the callback comes back > too fast, before the action has transitioned to RUNNING and is still in PREP. > This causes Oozie to ignore the callback, which means it won't find out that > the action completed until it manually checks (default=10min). This happened > in an HA setup, but I think it could happen even without HA. Adding a 30 > second delay into the ssh scripts fixed the problem, but ideally we should > come up with a better solution. > Here's the relevant logs: > {noformat} > 2015-01-16 18:00:12,916 INFO org.apache.oozie.action.ssh.SshActionExecutor: > SERVER[FOO] USER[foo] GROUP[-] TOKEN[] APP[${job_name}] > JOB[0000027-150113223634420-oozie-oozi-W] > ACTION[0000027-150113223634420-oozie-oozi-W@action-1] start() begins > 2015-01-16 18:00:12,917 INFO org.apache.oozie.action.ssh.SshActionExecutor: > SERVER[FOO] USER[foo] GROUP[-] TOKEN[] APP[${job_name}] > JOB[0000027-150113223634420-oozie-oozi-W] > ACTION[0000027-150113223634420-oozie-oozi-W@action-1] Attempting to copy ssh > base scripts to remote host [f...@bar.com] > 2015-01-16 18:00:15,769 INFO org.apache.oozie.servlet.CallbackServlet: > SERVER[FOO] USER[-] GROUP[-] TOKEN[-] APP[-] > JOB[0000027-150113223634420-oozie-oozi-W] > ACTION[0000027-150113223634420-oozie-oozi-W@action-1] callback for action > [0000027-150113223634420-oozie-oozi-W@action-1] > 2015-01-16 18:00:15,774 ERROR > org.apache.oozie.command.wf.CompletedActionXCommand: SERVER[FOO] USER[-] > GROUP[-] TOKEN[] APP[-] JOB[0000027-150113223634420-oozie-oozi-W] > ACTION[0000027-150113223634420-oozie-oozi-W@action-1] XException, > org.apache.oozie.command.CommandException: E0800: Action it is not running > its in [PREP] state, action [0000027-150113223634420-oozie-oozi-W@action-1] > at > org.apache.oozie.command.wf.CompletedActionXCommand.eagerVerifyPrecondition(CompletedActionXCommand.java:77) > at org.apache.oozie.command.XCommand.call(XCommand.java:251) > at > org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:174) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)