[ https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496381#comment-16496381 ]
Andras Piros commented on OOZIE-3156: ------------------------------------- Thanks for the new patch [~txsing]! Following is the next round on comments: * {{SshActionExecutor#handleRetry()}}: {{sleepBeforeRetryMs /= 2;}} should rather be {{sleepBeforeRetryMs *= 2;}} * the return value of {{SshActionExecutor#handleRetry()}} is not reused in caller code, so it doesn't get really an exponential backoff - {{initWaitTime}} will always be reused * in {{TestSshActionExecutor#testSshCheckWithHostConnectFailure()}} it's unclear to me whether {{echo "prop1=something"}} would always fail for the first time. We need to inject failure somehow to be on the safe side, or, if already present, extract methods of the test case w/ appropriate names to know what's going on * extending {{DG_SshActionExtension.twiki}} goes into the right direction. Still, we need to introduce {{oozie-default.xml#oozie.action.ssh.check.retries.max}} with the default value {{3}}, and mention it also in the docs > SSH action status turns OK wrongly when failed to connect to host > ----------------------------------------------------------------- > > Key: OOZIE-3156 > URL: https://issues.apache.org/jira/browse/OOZIE-3156 > Project: Oozie > Issue Type: Bug > Components: action > Affects Versions: 5.0.0 > Reporter: TIAN XING > Assignee: TIAN XING > Priority: Major > Attachments: OOZIE-3156-v1.patch, OOZIE-3156-v2.patch, > OOZIE-3156-v3.patch, ssh-check-bug.patch > > > When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh > connect to the host and check whether the pid of the process that ssh action > started is still there (by checking the returned value of command "{{ssh > <host-ip> ps -p <pid>}}" ) to determine whether ssh action completes or not. > However, we found cases where oozie fails to connect to host during action > status check (e.g., the host is under heavy load, or network is bad etc.). > In such cases, the return value of command "{{ssh <host-ip> ps -p <pid>}}" > will be 255 (ssh command exits with the exit status of the remote command or > with 255 if an error occurred.). > According the current logic of method {{getActionStatus()}} in > {{SshActionExecutor}}, the action status will be determined as OK which may > not be correct. -- This message was sent by Atlassian JIRA (v7.6.3#76005)