Patrick, Thanks! Your test_ssh function does what I was doing in client.py, but is a much better solution as it can be done outside the fabric code itself. I still find it a mystery as to why placing this check before fabric starts makes such a difference. It is quite possible that is due to our code and not fabric. If I manage to discover anything, I'll be sure to share. Thanks for you help,
Matt On Mon, Jun 14, 2010 at 6:38 PM, Patrick J McNerthney <pmcnerth...@clearpointmetrics.com> wrote: > Matt, > > I use Fabric to orchestrate by EC2 instances also. What I did though was to > create a loop that tests for "ssh connectability" before I invoke Fabric > scripts. Very roughly copying and pasting the code, it looks something like > this: > > # The instance state is "running" before entering this loop. > while True: > time.sleep(1) > self.update() # This updates self.instance.state > if self.instance.state != "running": > raise Exception('Unexpected instance state "' + > self.instance.state + '"') > if self._test_ssh(False): > break > # Should be okay to run Fabric commands now. > > def _test_ssh(self, throw=True): > sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) > try: > if "port" in self.configuration: > port = int(self.configuration["port"]) > else: > port = 22 > sock.settimeout(1) > sock.connect((self.address, port)) > return True > except socket.timeout: > if throw: > raise > except socket.error, e: > if throw or e.errno != 111: > raise > finally: > sock.close() > return False > > HTH, > Pat > > > On 06/14/2010 12:16 PM, Matt Calder wrote: >> >> Patrick, >> >> I thought you were on to something there, but alas no. I get the same >> using both DNS and IP. Both the errors without the fixes described, >> and correct connections with the fixes. >> >> Matt >> >> On Mon, Jun 14, 2010 at 5:25 PM, Patrick J McNerthney >> <pmcnerth...@clearpointmetrics.com> wrote: >> >>> >>> Matt, >>> >>> Try eliminating the use of DNS, ie. >>> "ec2-174-129-96-241.compute-1.amazonaws.com", and instead connect >>> directly >>> to the IP address, ie. 174.129.96.241, to see if that has something to do >>> with it. >>> >>> Pat >>> >>> >>> On 06/14/2010 11:16 AM, Matt Calder wrote: >>> >>>> >>>> All, >>>> >>>> After much debugging I finally found a workaround. I'd like to explain >>>> what I did in the hopes that someone might see what the underlying >>>> problem is. >>>> >>>> I don't think I made this point explicit in my previous emails, but, I >>>> am using fabric as a library. For simplicity, say I have two >>>> functions, createInstance, and runStuff. The createInstance function >>>> creates an ec2 instance (using boto) and waits for the instance's >>>> state to be "running". The runStuff function uses fabric to run code >>>> on the instance. So, my program looks like: >>>> >>>> createInstance() >>>> runStuff() >>>> >>>> If I run it as is, I will get connection failures, inside >>>> fabric/network.py: connect, either a socket error or a timeout. I know >>>> that ec2 instances can report their state as "running" but still not >>>> be ready to take connections. So I added a sleep to my program, >>>> >>>> createInstance() >>>> sleep(240) >>>> runStuff() >>>> >>>> Now, four minutes may seem excessive, but, with four minutes I still >>>> get connection errors. During my investigations, I made a few >>>> interesting observations. If I place a debugger break point just after >>>> the sleep. I can break, and resume and I will not get connection >>>> errors. If during the sleep period, I ssh into the instance from a >>>> terminal, I will not get connection errors, either in the terminal or >>>> in the program when the sleep passes (yes, really). Lastly, if I run >>>> just createInstance in one process, then after, run just runStuff in >>>> another separate process, I do not get connection errors. >>>> >>>> The workaround that I found was two part. First, I removed the >>>> sleep(240). Instead, I placed a sleep of 20 seconds in >>>> paramiko/client.py, at the very beginning of Client.connect. Then I >>>> added logic to fabric/network.py connect to retry on timeouts and >>>> socket errors up to six times. With these changes, I often connect the >>>> first time (that would include one 20 second sleep), and if not, >>>> always the second time (in the ten or so runs I have done). >>>> >>>> Note that the connection errors are occurring prior to any ssh >>>> activities, the connection is just getting a socket to port 22 on the >>>> ec2 instance. >>>> >>>> For the record I am running Ubuntu 10.04, however, colleagues report >>>> the same errors on Windows and MacOS. >>>> >>>> I hope someone can provide a reason for the behavior I have been >>>> seeing. I don't mind the workaround, but while it works, it is not >>>> based on any real understanding of what the problem is. >>>> >>>> Matt >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Jun 10, 2010 at 8:57 PM, Patrick J McNerthney >>>> <pmcnerth...@clearpointmetrics.com> wrote: >>>> >>>> >>>>> >>>>> Try using the --disable-known-hosts command line option to see if it >>>>> has >>>>> something to do with a prior use of the same ip address. >>>>> >>>>> On 06/10/2010 01:19 PM, Matt Calder wrote: >>>>> >>>>> >>>>>> >>>>>> Jeff, >>>>>> >>>>>> On Thu, Jun 10, 2010 at 6:54 PM, Jeff Forcier<j...@bitprophet.org> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> Hi Matt, >>>>>>> >>>>>>> Paramiko doesn't have a connection cache that I'm aware of, but >>>>>>> Fabric >>>>>>> itself does. However, from your description it sounds like you are >>>>>>> creating a new instance and then connecting to it, so I'm not sure >>>>>>> why >>>>>>> a cache would present a problem. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> I'm fairly certain fabric's cache is empty, because the code goes into >>>>>> the network.py : connect function. The reason I suggested a "paramiko >>>>>> cache" is that, while it is true that just after an instance goes from >>>>>> "pending" to "running" there is a period when connections fail, but >>>>>> that usually is very brief (< 10 sec). That is why I do a >>>>>> sleep(60) >>>>>> after the startup, to give time for that to settle. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> If you're rebooting a remote system or doing anything to alter the >>>>>>> networking of an already-connected system, then you can force a >>>>>>> reconnect by manipulating fabric.state.connections. For example, see >>>>>>> what the (master-only) reboot() operation does: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> http://code.fabfile.org/repositories/entry/fabric/master/fabric/operations.py#L668 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> I will look at that. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> If the problem is as straightforward as it sounds, though, I'm >>>>>>> honestly not sure what's up other than "possible Paramiko bug". Are >>>>>>> you getting any prompts or anything when you connect to the new >>>>>>> instance by hand? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> I can log in by hand, completely and correctly, from a terminal. I can >>>>>> do this after the instance is started but before fabric's first run >>>>>> call. The funny thing is, if I do log in from a terminal, the fabric >>>>>> run command will work. So, a pseudo code timeline: >>>>>> >>>>>> # Version 1, this will fail, the run cannot connect to the instance. >>>>>> startInstance() >>>>>> sleep(60) >>>>>> run("ls") >>>>>> >>>>>> # Version 2, this will succeed in running "ls" on the instance. >>>>>> startInstance() >>>>>> sleep(60) # During this sleep, using a terminal, I log into the >>>>>> instance. >>>>>> run("ls") >>>>>> >>>>>> Another variation that works is: >>>>>> >>>>>> # Version 3, this also succeeds. >>>>>> startInstance() >>>>>> sleep(60) >>>>>> <Debugger breakpoint here> Using debugger, look at variables (no >>>>>> changes), proceed >>>>>> run("ls") >>>>>> >>>>>> It is the examples that work that shout out "threading error" or >>>>>> "caching error" to me. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> Another thing to try is to upgrade Paramiko to 1.7.6 if you're using >>>>>>> the bundled 1.7.4. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> I will try that. Thanks for taking the time to help! >>>>>> >>>>>> Matt >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> -Jeff >>>>>>> >>>>>>> >>>>>>> On Thu, Jun 10, 2010 at 5:38 PM, Matt Calder<mvcal...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Bruno, >>>>>>>> >>>>>>>> No it is in a good group. I can log in using fabric if I restart it >>>>>>>> and the instance is already running. I can see that fabric is inside >>>>>>>> network.py trying to make the connection. I get one of two errors: >>>>>>>> either timeout or low level socket error. In debugging, I added >>>>>>>> retries to network.connect and it will fail repeatedly. First it >>>>>>>> times >>>>>>>> out a few times, then gives the "low level socket" error. While it >>>>>>>> doing that, I can ssh into it from a terminal. I wonder does >>>>>>>> paramiko >>>>>>>> have a connection cache ? Maybe it is not really retrying? Thanks >>>>>>>> for >>>>>>>> any help. >>>>>>>> >>>>>>>> >>>>>>>> Matt >>>>>>>> >>>>>>>> On Thu, Jun 10, 2010 at 5:23 PM, Bruno Clermont >>>>>>>> <bruno.clerm...@gmail.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Is your instance in a security group that allow your IP and the >>>>>>>>> port >>>>>>>>> your >>>>>>>>> trying to connect to? >>>>>>>>> If it timeout, it's probably blocked by Amazon firewalls. >>>>>>>>> >>>>>>>>> On Thu, Jun 10, 2010 at 15:07, Matt Calder<mvcal...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I am having problems using fabric with EC2 instances. I am not >>>>>>>>>> entirely sure fabric is even the source of the problem, but I am >>>>>>>>>> hoping someone on this list can suggest a solution or a path to >>>>>>>>>> investigate. Here is the problem. I start an EC2 instance using >>>>>>>>>> boto. >>>>>>>>>> I wait for the instance to report its state as "running". I wait >>>>>>>>>> an >>>>>>>>>> addition 60 seconds after that. Then I try to "run" things on the >>>>>>>>>> instance through fabric. At that point I get: >>>>>>>>>> >>>>>>>>>> [ubu...@ec2-174-129-96-241.compute-1.amazonaws.com] run: ls >>>>>>>>>> >>>>>>>>>> Fatal error: Timed out trying to connect to >>>>>>>>>> ec2-174-129-96-241.compute-1.amazonaws.com >>>>>>>>>> >>>>>>>>>> Aborting. >>>>>>>>>> >>>>>>>>>> Now, the interesting thing is this. During that additional 60 >>>>>>>>>> second >>>>>>>>>> wait I can log into the instance from a separate terminal, >>>>>>>>>> moreover, >>>>>>>>>> when I do that separate login, the fabric login succeeds. >>>>>>>>>> >>>>>>>>>> Obviously, there is not a lot to go on here, but I am not entirely >>>>>>>>>> sure what additional information would be helpful. If anyone has a >>>>>>>>>> suggestion of what I might try to do, I would greatly appreciate >>>>>>>>>> it. >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Matt >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Fab-user mailing list >>>>>>>>>> Fab-user@nongnu.org >>>>>>>>>> http://lists.nongnu.org/mailman/listinfo/fab-user >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Fab-user mailing list >>>>>>>> Fab-user@nongnu.org >>>>>>>> http://lists.nongnu.org/mailman/listinfo/fab-user >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jeff Forcier >>>>>>> Unix sysadmin; Python/Ruby developer >>>>>>> http://bitprophet.org >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Fab-user mailing list >>>>>> Fab-user@nongnu.org >>>>>> http://lists.nongnu.org/mailman/listinfo/fab-user >>>>>> >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Fab-user mailing list >>>>> Fab-user@nongnu.org >>>>> http://lists.nongnu.org/mailman/listinfo/fab-user >>>>> >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Fab-user mailing list >>>> Fab-user@nongnu.org >>>> http://lists.nongnu.org/mailman/listinfo/fab-user >>>> >>>> >>> >>> _______________________________________________ >>> Fab-user mailing list >>> Fab-user@nongnu.org >>> http://lists.nongnu.org/mailman/listinfo/fab-user >>> >>> >> >> _______________________________________________ >> Fab-user mailing list >> Fab-user@nongnu.org >> http://lists.nongnu.org/mailman/listinfo/fab-user >> > > _______________________________________________ Fab-user mailing list Fab-user@nongnu.org http://lists.nongnu.org/mailman/listinfo/fab-user