[ 
https://issues.apache.org/jira/browse/MESOS-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164094#comment-14164094
 ] 

Killian Murphy commented on MESOS-1847:
---------------------------------------

I had the same issue.

Adding --wait 600 worked for me. Adding --wait 180 did not. Testing with sshing 
into the created VM after the failure looks like about 7-8 minutes before sshd 
is ready for login.
The only way to recover for me was destroy and recreate with the additional 
--wait option.

Here's the failure:

killian@nore ~/development/mesos/mesos-0.20.1/ec2: ./mesos_ec2.py -k kdefault 
-i ~/AWS/id_rsa-kdefault -s 1 launch k_mesos
Setting up security groups...
Checking for running cluster...
Launching instances...
Launched slaves, regid = r-87bd89ac
Launched master, regid = r-65bf8b4e
Waiting for instances to start up...
Waiting 60 more seconds...
Deploying files to master...
ssh: connect to host ec2-54-237-156-217.compute-1.amazonaws.com port 22: 
Connection refused
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at 
/SourceCache/rsync/rsync-42/rsync/io.c(452) [sender=2.6.9]
Traceback (most recent call last):
  File "./mesos_ec2.py", line 571, in <module>
    main()
  File "./mesos_ec2.py", line 480, in main
    setup_cluster(conn, master_nodes, slave_nodes, zoo_nodes, opts, True)
  File "./mesos_ec2.py", line 334, in setup_cluster
    deploy_files(conn, "deploy." + opts.os, opts, master_nodes, slave_nodes, 
zoo_nodes)
  File "./mesos_ec2.py", line 445, in deploy_files
    subprocess.check_call(command, shell=True)
  File 
"/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py",
 line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'rsync -rv -e 'ssh -o 
StrictHostKeyChecking=no -i /Users/killian/AWS/id_rsa-kdefault' 
'/var/folders/8t/hp2txtm56h3byl8q5cdd33bm0000gp/T/tmp5VZqO3/' 
'r...@ec2-54-237-156-217.compute-1.amazonaws.com:/'' returned non-zero exit 
status 255



> mesos-ec2 launch: tries to rsync before ssh is available
> --------------------------------------------------------
>
>                 Key: MESOS-1847
>                 URL: https://issues.apache.org/jira/browse/MESOS-1847
>             Project: Mesos
>          Issue Type: Bug
>          Components: ec2
>            Reporter: Kevin Matzen
>
> If you don't specify a wait time that is long enough, then wait_for_cluster 
> will return once the instances have launched, but ssh will not necessarily be 
> available.  deploy_files will execute rsync and then possibly fail.  ssh 
> should be tested before continuing onto the file deployment stage.  It's not 
> really clear to me why opts.wait is even a thing when you can simply test for 
> the availability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to