[ https://issues.apache.org/jira/browse/MESOS-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164094#comment-14164094 ]
Killian Murphy commented on MESOS-1847: --------------------------------------- I had the same issue. Adding --wait 600 worked for me. Adding --wait 180 did not. Testing with sshing into the created VM after the failure looks like about 7-8 minutes before sshd is ready for login. The only way to recover for me was destroy and recreate with the additional --wait option. Here's the failure: killian@nore ~/development/mesos/mesos-0.20.1/ec2: ./mesos_ec2.py -k kdefault -i ~/AWS/id_rsa-kdefault -s 1 launch k_mesos Setting up security groups... Checking for running cluster... Launching instances... Launched slaves, regid = r-87bd89ac Launched master, regid = r-65bf8b4e Waiting for instances to start up... Waiting 60 more seconds... Deploying files to master... ssh: connect to host ec2-54-237-156-217.compute-1.amazonaws.com port 22: Connection refused rsync: connection unexpectedly closed (0 bytes received so far) [sender] rsync error: unexplained error (code 255) at /SourceCache/rsync/rsync-42/rsync/io.c(452) [sender=2.6.9] Traceback (most recent call last): File "./mesos_ec2.py", line 571, in <module> main() File "./mesos_ec2.py", line 480, in main setup_cluster(conn, master_nodes, slave_nodes, zoo_nodes, opts, True) File "./mesos_ec2.py", line 334, in setup_cluster deploy_files(conn, "deploy." + opts.os, opts, master_nodes, slave_nodes, zoo_nodes) File "./mesos_ec2.py", line 445, in deploy_files subprocess.check_call(command, shell=True) File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 540, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'rsync -rv -e 'ssh -o StrictHostKeyChecking=no -i /Users/killian/AWS/id_rsa-kdefault' '/var/folders/8t/hp2txtm56h3byl8q5cdd33bm0000gp/T/tmp5VZqO3/' 'r...@ec2-54-237-156-217.compute-1.amazonaws.com:/'' returned non-zero exit status 255 > mesos-ec2 launch: tries to rsync before ssh is available > -------------------------------------------------------- > > Key: MESOS-1847 > URL: https://issues.apache.org/jira/browse/MESOS-1847 > Project: Mesos > Issue Type: Bug > Components: ec2 > Reporter: Kevin Matzen > > If you don't specify a wait time that is long enough, then wait_for_cluster > will return once the instances have launched, but ssh will not necessarily be > available. deploy_files will execute rsync and then possibly fail. ssh > should be tested before continuing onto the file deployment stage. It's not > really clear to me why opts.wait is even a thing when you can simply test for > the availability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)