I finished being build engineer last week. Here's a summary of the things I did:
* Fixed bug #422433 (Race condition when running two ec2test instances very close together). This needed to be fixed before working on making the test suite run in parallel across several machines. * Investigated a couple of bugs: #419421 (Buildbot: over time memory usage of the buildbot master process gets unreasonable) and # * Got the jscheck builder running more frequently, and, after some cajoling, got it to work. Michael Hudson did the ground work for this. Fixing this problem taught be a lot about how buildbot works, how to configure it, and meant I got to look at a *lot* of source code. * With the help of the LOSAs, got another of mwhudson's lpbuildbot branches, use-update-sourcecode, merged and rolled out. This removed quite a lot of code from lpbuildbot and replaced it with a single call to utilities/update-sourcecode. * Landed a lpbuildbot branch, avoid-deadlock, to fix a potential problem in kill-test-pids where it could hang indefinitely. I can't tell if this has ever affected us, but it was worth a small fix to prevent it. * Prepared a lpbuildbot branch to fix bug #455737 (PYTHONPATH should not be set when calling test_on_merge). This has been reviewed but not merged. * Prepared a possible fix for bug #419408 (Buildbot: over time, buildbot creates zombie processes) and bug #419408 (Buildbot: over time, buildbot creates zombie processes). I think these two are related; see comment 3 in bug 419408 for an explanation of the possible culprit. There are actually two branches related to this, the fix itself, and a port to staging which rolls in the fix and other changes to the production configs. Neither have been merged, but the fix has been reviewed. * Investigated bug 433657 (tests regularly fail on buildbot with "no space left on device"). Landed Launchpad branch log-statement-none to disable PostgreSQL statement logging (which was set to 'all') to see if that might help... but I haven't kept track of failures, so I don't actually know. It should be possible to go back through the build logs and figure it out. I also documented how to put statement logging back for those who want it: https://dev.launchpad.net/Debugging RT #36179 has been filed to request disk space monitoring on the slaves. It would be especially useful to get something like the disk space usage report that baobab does when a disk fills up. * Branch ec2-buildout moves lib/devscripts to a separate place in the tree, so that it's another develop egg. The biggest driver for doing this was so that it could run with a different Python version. As of next Monday that will cease to be an issue, but I think it's still useful to treat it as a separate project. There's no need to separate it from the Launchpad tree right now, but doing so would be quite easy. This branch is unfortunately not quite finished; hooking in the tests to run was proving a hassle, but I think there's a way around that (using subunit, yay). Just got to do it :-/ I'm CHR next week so maybe I'll help the community by finishing this ;) * My pet project was trying to get the test suite to split itself up and run on several machines in parallel, to reduce run time. I didn't make much tangible progress on this until the last couple of weeks of my stint - only a bus load of reading code and docs - but, with a lot of help from jml including a 2-day sprint in London, something good has come out of it. There's a branch in review - lp:~allenap/launchpad/ec2-parry - that both jml and I worked on, and jml has an alternative approach at lp:~jml/launchpad/dirty-parry. See the cover letter in the ec2-parry merge proposal for an idea of how it works. There are two outstanding issues to resolve before it'll be generally useful: security around the RPC mechanism needs tightening up, and there's a problem where workers are not running all the tests. I'll be dogfooding this myself to try and figure these out, but if there are any other masochists out there maybe we can squash these issues quicker than I can on my own. Comments on being Build Engineer: * Getting started was daunting. Suddenly having to actually know about PQM, buildbot, AWS/EC2, unittest, zope.testing, and so on, was a learning cliff-face, but a few things got me through. Figuring out the jscheck issue helped me understand buildbot and be, frankly, less scared of it. But most of all, having mwhudson and jml to talk to was probably the most reassuring thing. * For a lot of the BE stint I was fighting little fires (with my water pistol of limited knowledge). I got an idea of what I imagine the LOSAs feel like every day :-/ (Not the water pistol bit; the fighting fires bit). I felt like I spent a lot of my time task-switching, and the lack of tangible output was a bit of a downer. Coming after mwhudson, who did a lot of build-related goodness, I put myself under a lot of pressure to make a mark. I guess it's worth reminding future Build Engineers that it's also about learning. The BuildEngineer wiki page even states as an advantage that "Knowledge about the build system is spread around the team." * I definitely think the BE role is worth it. It's a break from the routine. I've learnt a ton that I can bring back to my normal role in Bugs. I think I've made improvements to the build side of Launchpad (though I wish I could have made more). * Michael said in his report that "It's hard to get things done on the infrastructure in week 4!". It was difficult to get things done in the last *three* weeks of the 3.1.10 cycle because there was new hardware, U1, Karmic, and a Launchpad release. Especially when it comes to buildbot and PQM, much of the BE's role is LOSA intensive, and, as I've already fed back to Gary, I didn't feel like I had the right to push for attention from them for BE fixes (excepting show-stoppers). * Michael also said "not being able to land branches in week 4 is a pain, even more than normal", and "... the build engineer's work is sort of sideways to the main thrust of launchpad development". It might be beneficial if the BE role was 2 weeks out of sync with the normal development cycle. * I did do some Bugs work during my stint. It just had to be done, but it was probably <5% of my time. * I wish I was as concise as mwhudson. Have a good stint stub! Gavin. _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

