[jira] [Commented] (SPARK-6900) spark ec2 script enters infinite loop when run-instance fails

Nicholas Chammas (JIRA) Tue, 21 Apr 2015 16:10:09 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505979#comment-14505979
 ]


Nicholas Chammas commented on SPARK-6900:
-----------------------------------------

Your analysis is correct.

I don't favor the option of a timeout on {{wait_for_cluster_state()}} because 
we used to have something the effectively did the same thing -- the {{--wait}} 
option -- and it was annoying. Sometimes, stuff just takes a long time to come 
up and you just have to wait.

If you really want a timeout, I suggest using the 
[{{timeout}}|http://linux.die.net/man/1/timeout] Linux utility.

{code}
timeout --foreground 30m spark-ec2 launch ...
{code}

([See here|http://stackoverflow.com/a/29662772/877069] for why the 
{{--foreground}} option is required.)

Your second suggestion is better, but I also do not favor returning 
successfully when an instance terminates unexpectedly. If the user asks for 10 
slaves, we should give them 10 slaves or fail.

It would be good to detect if instances terminate prematurely and will never be 
ready, but in that case I think spark-ec2 should error out somehow instead of 
continuing. I would favor such a solution.

By the way, I want to point out that spark-ec2 was not designed for automated 
use like this. It was generally meant to be called interactively by a human, 
which is why many situations in the regular spark-ec2 workflow require human 
input.

The infinite loop, for example, was thought to be OK since a human would notice 
something was wrong and then manually fix things.

Anyway, I don't mind making changes to make spark-ec2 more automation friendly, 
but just keep this history in mind.

> spark ec2 script enters infinite loop when run-instance fails
> -------------------------------------------------------------
>
>                 Key: SPARK-6900
>                 URL: https://issues.apache.org/jira/browse/SPARK-6900
>             Project: Spark
>          Issue Type: Bug
>          Components: EC2
>    Affects Versions: 1.3.0
>            Reporter: Guodong Wang
>
> I am using spark-ec2 scripts to launch spark cluters in AWS.
> Recently, in our AWS region,  there were some tech issues about AWS EC2 
> service. 
> When spark-ec2 send the run-instance requests to EC2, not all the requested 
> instances were launched. Some instance was terminated by AWS-EC2 service  
> before it was up.
> But spark-ec2 script would wait for all the instances to enter 'ssh-ready' 
> status. So, the script enters the infinite loop. Because the terminated 
> instances would never be 'ssh-ready'.
> In my opinion, it should be OK if some of the slave instances were 
> terminated. As long as the master node is running, the terminated slaves 
> should be filtered and the cluster should be setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6900) spark ec2 script enters infinite loop when run-instance fails

Reply via email to