Do we have anything running which monitors disk free space? It seems like in a couple of cases over the last few months getting email alerts when a bot's disk is 90% full might have helped alert Sheriffs/Troopers to a problem earlier and possibly prevent a tree closure.
On Mon, Sep 21, 2009 at 10:31 AM, Jeremy Orlow <jor...@chromium.org> wrote: > On Mon, Sep 21, 2009 at 8:25 AM, Nicolas Sylvain <nsylv...@chromium.org>wrote: > >> Hi chromium-dev, >> A small group of us joined forces to create a "Green Tree" task force. >> The goal of this task >> force is to make sure the tree stays green most of the time. The 2 main >> pain points that >> we are attacking at this time are "reducing the buildbot cycle time", to >> catch errors earlier, and >> "getting rid of the flakiness", to make sure the tree does not turn red >> for no reason. >> >> I'll be prepending "[Green Tree]" to the emails I send related to the >> task force. >> >> You can also follow the progress and our tasks there: >> http://code.google.com/p/chromium/issues/list?q=label:GreenTreeTaskForce >> >> For those interested, these are the highlights of the last week: >> >> - Make sure all the tasks have bugs associated with them (pamg) >> - Make sure VMWare Tools is installed on all the slaves (bev / nsylvain) >> - Disable all services that we don't need on the slaves (bev) >> - Split the windows chromium tests in 3 slaves (maruel) >> - Change the gatekeeper to close the tree on more failures (maruel) >> - Change LKGR to care about more tests, and make it cycle faster (maruel) >> - Write a status page to see the cycle speed on the slaves (nsylvain) >> - Make sure we build only what we need on Mac (thomasvl) >> - Add more try bots (linux views, valgrind) (maruel) >> - Refactor Linux Valgrind buildbots into builder/testers. (mmoss) >> - Create a dashboard to see the slowest tests (phajdan) >> - Speed up the transfer of builds between builders/testers by reducing the >> compression (mmoss) >> >> I'm sure I forgot some, feel free to append to this list. >> >> Despite our efforts, this was one of the worse week we've seen in a long >> time in term of tree closure. This >> was caused by 5 main events: >> >> - Buildbot maintenance went wrong. By changing a mounted drive on the >> buildbot master, the mount table got corrupted, and we had to reboot the >> main server. We started the maintenance at 7:30AM (pacific) and we got the >> buildbot back online shortly after 10AM. It had to cycle a little, so it was >> closed for almost 3 hours >> - A webkit merge left some failures in the tree. And it looks like >> everyone left without fixing it, so it was closed overnight. We fixed it in >> the morning, but before reopening we let another webkit merge go by, and it >> also broke the tree, requiring a change on webkit.org to fix the >> reliability tests (IIRC). Total closure time: 20 hours. >> > > The more try bots we get, the better this will get. At the very least, > when we check in something that upsets bots covered by try bots, we can > always roll back out and triage without the tree closed. Maybe we should > have one try bot for each different type of build bot? > > Btw, in case anyone is wondering what makes WebKit gardening special: > WebKit is a freight train that we can't stop. And so, if we get behind by > even a day, it has a serious impact on Chromium developers' ability to do a > lot of stuff (especially 2 sided patches). I'm not trying to condone the 20 > hour closure (I don't know the details), but if we can't figure out what's > wrong quickly (when it breaks our stuff) we can get into a pretty bad > situation pretty quickly. > > >> - A bad gclient change got checked in. Some machines stopped running >> "runhooks" and some bots got confused. The damage seems to have been >> limited. >> - A second bad gclient change got checked in. This time causing all the >> bots to throw away their checkouts. Almost each slaves had to do a full >> checkout (which takes an hour or so), and some of them ran out of disk >> space, so we had to manually fix them. The tree was closed for another >> couple of hours. >> - A bad DEPS file got checked in. Causing again a bunch of slaves to >> throw away their checkout. It was closed for another hour or two. >> > > Possibly crazy idea: Is there any way we can have a bot that only updates > itself that all the other bots block on? Assuming that one bot syncing the > world is faster than all the bots saving the world, it would save time. It > seems like in the normal case, this will go quite quickly and not block > things for long. But, when something's wrong with gclient, DEPS, etc it'll > take a long time. Sheriffs could then close the tree and stop that bot's > build before any of the other bots pick it up. > > > > --~--~---------~--~----~------------~-------~--~----~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~----------~----~----~----~------~----~------~--~---