[ 
https://issues.apache.org/jira/browse/SPARK-5851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324694#comment-14324694
 ] 

Nicholas Chammas commented on SPARK-5851:
-----------------------------------------

Yeah, that's a good catch. Have you run into an issue with this btw?

Since we confirm that SSH is available on all nodes of the cluster before doing 
anything, we generally shouldn't run into a case where the call to {{ssh()}} 
fails because of connectivity. The retry behavior is a remnant from when we 
didn't wait on SSH availability explicitly and relied on the {{--wait}} option 
to spark-ec2.

So actually, it might be better if this method never retried anything.

That said, {{man ssh}} reveals:

{code}
     ssh exits with the exit status of the remote command
     or with 255 if an error occurred.
{code}

So we can probably replace [this 
line|https://github.com/apache/spark/blob/d8f69cf78862d13a48392a0b94388b8d403523da/ec2/spark_ec2.py#L959-L961]
 with {{subprocess.Popen}} and explicitly check the return code. If it's 255, 
maybe retry. Otherwise, bubble up the error.

cc [~shivaram]

> spark_ec2.py ssh failure retry handling not always appropriate
> --------------------------------------------------------------
>
>                 Key: SPARK-5851
>                 URL: https://issues.apache.org/jira/browse/SPARK-5851
>             Project: Spark
>          Issue Type: Bug
>          Components: EC2
>            Reporter: Florian Verhein
>            Priority: Minor
>
> The following function doesn't distinguish between the ssh failing (e.g. 
> presumably a connection issue) and the remote command that it executes 
> failing (e.g. setup.sh). The latter should probably not result in a retry. 
> Perhaps tries could be an argument that is set to 1 for certain usages. 
> # Run a command on a host through ssh, retrying up to five times
> # and then throwing an exception if ssh continues to fail.
> spark-ec2: [{{def ssh(host, opts, 
> command)}}|https://github.com/apache/spark/blob/d8f69cf78862d13a48392a0b94388b8d403523da/ec2/spark_ec2.py#L953-L975]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to