Do we have anything running which monitors disk free space? It seems like in
a couple of cases over the last few months getting email alerts when a bot's
disk is 90% full might have helped alert Sheriffs/Troopers to a problem
earlier and possibly prevent a tree closure.

On Mon, Sep 21, 2009 at 10:31 AM, Jeremy Orlow <jor...@chromium.org> wrote:

> On Mon, Sep 21, 2009 at 8:25 AM, Nicolas Sylvain <nsylv...@chromium.org>wrote:
>
>> Hi chromium-dev,
>>   A small group of us joined forces to create a "Green Tree" task force.
>> The goal of this task
>> force is to make sure the tree stays green most of the time.  The 2 main
>> pain points that
>> we are attacking at this time are "reducing the buildbot cycle time", to
>> catch errors earlier, and
>> "getting rid of the flakiness", to make sure the tree does not turn red
>> for no reason.
>>
>>   I'll be prepending "[Green Tree]" to the emails I send related to the
>> task force.
>>
>>   You can also follow the progress and our tasks there:
>> http://code.google.com/p/chromium/issues/list?q=label:GreenTreeTaskForce
>>
>> For those interested, these are the highlights of the last week:
>>
>> - Make sure all the tasks have bugs associated with them (pamg)
>> - Make sure VMWare Tools is installed on all the slaves (bev / nsylvain)
>> - Disable all services that we don't need on the slaves (bev)
>> - Split the windows chromium tests in 3 slaves (maruel)
>> - Change the gatekeeper to close the tree on more failures (maruel)
>>  - Change LKGR to care about more tests, and make it cycle faster (maruel)
>> - Write a status page to see the cycle speed on the slaves (nsylvain)
>> - Make sure we build only what we need on Mac (thomasvl)
>> - Add more try bots (linux views, valgrind) (maruel)
>> - Refactor Linux Valgrind buildbots into builder/testers. (mmoss)
>> - Create a dashboard to see the slowest tests (phajdan)
>> - Speed up the transfer of builds between builders/testers by reducing the
>> compression (mmoss)
>>
>>   I'm sure I forgot some, feel free to append to this list.
>>
>>   Despite our efforts, this was one of the worse week we've seen in a long
>> time in term of tree closure. This
>> was caused by 5 main events:
>>
>>  - Buildbot maintenance went wrong. By changing a mounted drive on the
>> buildbot master, the mount table got corrupted, and we had to reboot the
>> main server. We started the maintenance at 7:30AM (pacific) and we got the
>> buildbot back online shortly after 10AM. It had to cycle a little, so it was
>> closed for almost 3 hours
>>  - A webkit merge left some failures in the tree. And it looks like
>> everyone left without fixing it, so it was closed overnight. We fixed it in
>> the morning, but before reopening we let another webkit merge go by, and it
>> also broke the tree, requiring a change on webkit.org to fix the
>> reliability tests (IIRC). Total closure time: 20 hours.
>>
>
> The more try bots we get, the better this will get.  At the very least,
> when we check in something that upsets bots covered by try bots, we can
> always roll back out and triage without the tree closed.  Maybe we should
> have one try bot for each different type of build bot?
>
> Btw, in case anyone is wondering what makes WebKit gardening special:
> WebKit is a freight train that we can't stop.  And so, if we get behind by
> even a day, it has a serious impact on Chromium developers' ability to do a
> lot of stuff (especially 2 sided patches).  I'm not trying to condone the 20
> hour closure (I don't know the details), but if we can't figure out what's
> wrong quickly (when it breaks our stuff) we can get into a pretty bad
> situation pretty quickly.
>
>
>>  - A bad gclient change got checked in. Some machines stopped running
>> "runhooks" and some bots got confused. The damage seems to have been
>> limited.
>>  - A second bad gclient change got checked in. This time causing all the
>> bots to throw away their checkouts. Almost each slaves had to do a full
>> checkout (which takes an hour or so), and some of them ran out of disk
>> space, so we had to manually fix them. The tree was closed for another
>> couple of hours.
>>  - A bad DEPS file got checked in. Causing again a bunch of slaves to
>> throw away their checkout. It was closed for another hour or two.
>>
>
> Possibly crazy idea:  Is there any way we can have a bot that only updates
> itself that all the other bots block on?  Assuming that one bot syncing the
> world is faster than all the bots saving the world, it would save time.  It
> seems like in the normal case, this will go quite quickly and not block
> things for long.  But, when something's wrong with gclient, DEPS, etc it'll
> take a long time.  Sheriffs could then close the tree and stop that bot's
> build before any of the other bots pick it up.
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
    http://groups.google.com/group/chromium-dev
-~----------~----~----~----~------~----~------~--~---

Reply via email to