I've got this on a test system. The ami was created from scratch. Looks like the master does not see a 'detached' event, even though the instance has terminated.
See log and config attached. It's "substantiating", but never "waiting for instance to start" (my guess is since it's not detached yet). Only after 20 minutes there's a timeout, the detach is forced, but the pending job never runs (I guess that's another bug). Could it be some weird TCP situation that never shuts down the connection properly? I tried tracing the code for what triggers the detached event but I got lost.. Now that I've isolated this on a test system, I'm happy to debug this further as needed with network dumps, debug prints etc.. I'm running latest eight on master and 0.8.12 on the slave. Thanks, --yuval On 04/02/16 03:42 PM, Yuval Hager wrote: > I've got a few EC2 slaves that I'd like each to run a single build and > terminate. I've set up max_builds=1 and build_wait_timeout=0 for all of them. > > They are all defined the same, except for the ami id, slave name and pwd. > They all work well, except for one. That one would only run a single job, get > terminated and never get started again. > > After running its first job, it is about to start its second. I see in the log > "substantiating slave", but I never see "waiting for instance to start". After > about 20 minutes I get a timeout from twisted, but there is still no recovery > and the build never starts, until I restart master. > > I'm running 0.8.12. I have applied the patch in > https://github.com/buildbot/buildbot/pull/1831, but it made no difference. I > also tried to implement something like > http://trac.buildbot.net/ticket/1001#comment:5, but it made no difference > either (I may have done this one wrong though). > > The only way I can get this slave to behave is to set build_wait_timeout=1. > This way everything works properly, except for the fact that it might run more > than one build before terminating. I'd like to avoid that, but it's the only > way I got this to work so far. > > Suspecting something in the master db, I've created another AMI off the first > one, given it a new name, and it behaved the same way! The only other > difference between this and the other images, is that this one is an Amazon > Linux, and the others are Ubuntu and RHEL. I can't think why it would matter > though. It takes this instance slightly longer to terminate than the others > (30-40 seconds, the others are at 25-30), but again, I can't see how it would > matter. > > I would appreciate any insight on what this might be related to, or any > suggestions on how to debug this further. > > Thanks! > > --y > _______________________________________________ > users mailing list > users@buildbot.net > https://lists.buildbot.net/mailman/listinfo/users >
from buildbot.plugins import * c = BuildmasterConfig = {} c['slaves'] = [ buildslave.EC2LatentBuildSlave('amazon-linux', 'pass', 'm3.medium', region='us-west-1', max_builds=1, build_wait_timeout=0, ami='ami-xxxxx') ] c['protocols'] = {'pb': {'port': 9989}} c['change_source'] = [] c['schedulers'] = [] c['schedulers'].append(schedulers.ForceScheduler( name="force", builderNames=["test"])) factory = util.BuildFactory() factory.addStep(steps.ShellCommand(command=["sleep", "10"])) c['builders'] = [] c['builders'].append( util.BuilderConfig(name="test", slavenames=["amazon-linux"], factory=factory)) c['status'] = [] from buildbot.status import html from buildbot.status.web import authz, auth authz_cfg=authz.Authz( auth=auth.BasicAuth([("user","pass")]), gracefulShutdown = False, forceBuild = 'auth', # use this to test your slave once it is set up forceAllBuilds = 'auth', # ..or this pingBuilder = False, stopBuild = False, stopAllBuilds = False, cancelPendingBuild = False, ) c['status'].append(html.WebStatus(http_port=8010, authz=authz_cfg)) c['title'] = "test" c['buildbotURL'] = "http://example.com:8010/" c['db'] = { 'db_url' : "sqlite:///state.sqlite"}
2016-02-07 00:31:30-0800 [-] substantiating slave <LatentSlaveBuilder builder='test' slave='amazon-linux'> 2016-02-07 00:31:30-0800 [-] EC2LatentBuildSlave amazon-linux waiting for instance i-7cc326c9 to start 2016-02-07 00:31:34-0800 [-] added buildset 14 to database 2016-02-07 00:31:50-0800 [-] EC2LatentBuildSlave amazon-linux instance i-7cc326c9 started on ec2-184-72-22-91.us-west-1.compute.amazonaws.com in about 0 minutes 20 seconds (None) 2016-02-07 00:31:56-0800 [-] added buildset 15 to database 2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] slave 'amazon-linux' attaching from IPv4Address(TCP, '184.72.22.91', 58579) 2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] Got slaveinfo from 'amazon-linux' 2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] Starting buildslave keepalive timer for 'amazon-linux' 2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] bot attached 2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] Buildslave amazon-linux attached to test 2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] Slave amazon-linux substantiated \o/ 2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] Firing amazon-linux substantiation deferred with success 2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] starting build <Build test>.. pinging the slave <LatentSlaveBuilder builder='test' slave='amazon-linux'> 2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] sending ping 2016-02-07 00:32:39-0800 [Broker,0,184.72.22.91] ping finished: success 2016-02-07 00:32:39-0800 [-] <Build test>.startBuild 2016-02-07 00:32:39-0800 [-] ShellCommand.startCommand(cmd=<RemoteShellCommand '['sleep', '10']'>) 2016-02-07 00:32:39-0800 [-] cmd.args = {'workdir': 'build', 'env': None, 'want_stderr': 1, 'logEnviron': True, 'sigtermTime': None, 'logfiles': {}, 'timeout': 1200, 'want_stdout': 1, 'usePTY': 'slave-config', 'maxTime': None, 'initial_stdin': None} 2016-02-07 00:32:39-0800 [-] <RemoteShellCommand '['sleep', '10']'>: RemoteCommand.run [0] 2016-02-07 00:32:39-0800 [-] command '['sleep', '10']' in dir 'build' 2016-02-07 00:32:49-0800 [Broker,0,184.72.22.91] <RemoteShellCommand '['sleep', '10']'> rc=0 2016-02-07 00:32:49-0800 [-] closing log <buildbot.status.logfile.LogFile instance at 0x7fad18b1f098> 2016-02-07 00:32:49-0800 [-] releaseLocks(<buildbot.steps.shell.ShellCommand object at 0x7fad18a7f810>): [] 2016-02-07 00:32:49-0800 [-] step 'shell' complete: success 2016-02-07 00:32:49-0800 [-] <Build test>: build finished 2016-02-07 00:32:49-0800 [-] setting expectations for next time 2016-02-07 00:32:49-0800 [-] new expectations: 10.0162930489 seconds 2016-02-07 00:32:49-0800 [-] releaseLocks(<EC2LatentBuildSlave 'amazon-linux'>): [] 2016-02-07 00:32:49-0800 [-] EC2LatentBuildSlave amazon-linux terminating instance i-7cc326c9 2016-02-07 00:33:20-0800 [-] EC2LatentBuildSlave amazon-linux instance i-7cc326c9 ('terminated',) after about 0 minutes 30 seconds 2016-02-07 00:33:20-0800 [-] starting build <Build test> using slave <LatentSlaveBuilder builder='test' slave='amazon-linux'> 2016-02-07 00:33:20-0800 [-] acquireLocks(slave <EC2LatentBuildSlave 'amazon-linux'>, locks []) 2016-02-07 00:33:20-0800 [-] substantiating slave <LatentSlaveBuilder builder='test' slave='amazon-linux'> 2016-02-07 00:53:20-0800 [-] disconnecting old slave amazon-linux now 2016-02-07 00:53:20-0800 [-] waiting for slave to finish disconnecting 2016-02-07 00:53:20-0800 [-] Buildslave amazon-linux detached from test 2016-02-07 00:53:20-0800 [-] while preparing slavebuilder: Traceback (most recent call last): Failure: twisted.internet.defer.TimeoutError: 2016-02-07 00:53:20-0800 [-] slave <Build test> can't build <LatentSlaveBuilder builder='test'> after all; re-queueing the request 2016-02-07 00:53:20-0800 [-] while running <function run_cleanups at 0x7fad18ada050> Traceback (most recent call last): File "/home/yuval/buildbot/sandbox/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1184, in gotResult _inlineCallbacks(r, g, deferred) File "/home/yuval/buildbot/sandbox/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks result = result.throwExceptionIntoGenerator(g) File "/home/yuval/buildbot/sandbox/local/lib/python2.7/site-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator return g.throw(self.type, self.value, self.tb) File "/home/yuval/buildbot/src/master/buildbot/process/builder.py", line 345, in _startBuildFor run_cleanups() --- <exception caught here> --- File "/home/yuval/buildbot/src/master/buildbot/process/builder.py", line 308, in run_cleanups fn() File "/home/yuval/buildbot/src/master/buildbot/process/builder.py", line 322, in <lambda> cleanups.append(lambda: slavebuilder.slave.releaseLocks()) exceptions.AttributeError: 'NoneType' object has no attribute 'releaseLocks' 2016-02-07 00:53:20-0800 [Broker,0,184.72.22.91] BuildSlave.detached(amazon-linux) 2016-02-07 00:53:20-0800 [Broker,0,184.72.22.91] releaseLocks(<EC2LatentBuildSlave 'amazon-linux'>): []
_______________________________________________ users mailing list users@buildbot.net https://lists.buildbot.net/mailman/listinfo/users