On Mon, Sep 21, 2009 at 10:53 AM, Antony Sargent <asarg...@chromium.org>wrote:

> Do we have anything running which monitors disk free space? It seems like
> in a couple of cases over the last few months getting email alerts when a
> bot's disk is 90% full might have helped alert Sheriffs/Troopers to a
> problem earlier and possibly prevent a tree closure.


At this point the problem is that the build got bigger, and we can't fit 2
checkouts at the same time on the same machine. We are currently slowly
replacing all the old 30GB VM with new one that has 70GB.

Eventually we should try to implement some alert mechanism.

Nicolas


>
> On Mon, Sep 21, 2009 at 10:31 AM, Jeremy Orlow <jor...@chromium.org>wrote:
>
>> On Mon, Sep 21, 2009 at 8:25 AM, Nicolas Sylvain 
>> <nsylv...@chromium.org>wrote:
>>
>>> Hi chromium-dev,
>>>   A small group of us joined forces to create a "Green Tree" task force.
>>> The goal of this task
>>> force is to make sure the tree stays green most of the time.  The 2 main
>>> pain points that
>>> we are attacking at this time are "reducing the buildbot cycle time", to
>>> catch errors earlier, and
>>> "getting rid of the flakiness", to make sure the tree does not turn red
>>> for no reason.
>>>
>>>   I'll be prepending "[Green Tree]" to the emails I send related to the
>>> task force.
>>>
>>>   You can also follow the progress and our tasks there:
>>> http://code.google.com/p/chromium/issues/list?q=label:GreenTreeTaskForce
>>>
>>> For those interested, these are the highlights of the last week:
>>>
>>> - Make sure all the tasks have bugs associated with them (pamg)
>>> - Make sure VMWare Tools is installed on all the slaves (bev / nsylvain)
>>> - Disable all services that we don't need on the slaves (bev)
>>> - Split the windows chromium tests in 3 slaves (maruel)
>>> - Change the gatekeeper to close the tree on more failures (maruel)
>>>  - Change LKGR to care about more tests, and make it cycle faster
>>> (maruel)
>>> - Write a status page to see the cycle speed on the slaves (nsylvain)
>>> - Make sure we build only what we need on Mac (thomasvl)
>>> - Add more try bots (linux views, valgrind) (maruel)
>>> - Refactor Linux Valgrind buildbots into builder/testers. (mmoss)
>>> - Create a dashboard to see the slowest tests (phajdan)
>>> - Speed up the transfer of builds between builders/testers by reducing
>>> the compression (mmoss)
>>>
>>>   I'm sure I forgot some, feel free to append to this list.
>>>
>>>   Despite our efforts, this was one of the worse week we've seen in a
>>> long time in term of tree closure. This
>>> was caused by 5 main events:
>>>
>>>  - Buildbot maintenance went wrong. By changing a mounted drive on the
>>> buildbot master, the mount table got corrupted, and we had to reboot the
>>> main server. We started the maintenance at 7:30AM (pacific) and we got the
>>> buildbot back online shortly after 10AM. It had to cycle a little, so it was
>>> closed for almost 3 hours
>>>  - A webkit merge left some failures in the tree. And it looks like
>>> everyone left without fixing it, so it was closed overnight. We fixed it in
>>> the morning, but before reopening we let another webkit merge go by, and it
>>> also broke the tree, requiring a change on webkit.org to fix the
>>> reliability tests (IIRC). Total closure time: 20 hours.
>>>
>>
>> The more try bots we get, the better this will get.  At the very least,
>> when we check in something that upsets bots covered by try bots, we can
>> always roll back out and triage without the tree closed.  Maybe we should
>> have one try bot for each different type of build bot?
>>
>> Btw, in case anyone is wondering what makes WebKit gardening special:
>> WebKit is a freight train that we can't stop.  And so, if we get behind by
>> even a day, it has a serious impact on Chromium developers' ability to do a
>> lot of stuff (especially 2 sided patches).  I'm not trying to condone the 20
>> hour closure (I don't know the details), but if we can't figure out what's
>> wrong quickly (when it breaks our stuff) we can get into a pretty bad
>> situation pretty quickly.
>>
>>
>>>  - A bad gclient change got checked in. Some machines stopped running
>>> "runhooks" and some bots got confused. The damage seems to have been
>>> limited.
>>>  - A second bad gclient change got checked in. This time causing all the
>>> bots to throw away their checkouts. Almost each slaves had to do a full
>>> checkout (which takes an hour or so), and some of them ran out of disk
>>> space, so we had to manually fix them. The tree was closed for another
>>> couple of hours.
>>>  - A bad DEPS file got checked in. Causing again a bunch of slaves to
>>> throw away their checkout. It was closed for another hour or two.
>>>
>>
>> Possibly crazy idea:  Is there any way we can have a bot that only updates
>> itself that all the other bots block on?  Assuming that one bot syncing the
>> world is faster than all the bots saving the world, it would save time.  It
>> seems like in the normal case, this will go quite quickly and not block
>> things for long.  But, when something's wrong with gclient, DEPS, etc it'll
>> take a long time.  Sheriffs could then close the tree and stop that bot's
>> build before any of the other bots pick it up.
>>
>>
>>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
    http://groups.google.com/group/chromium-dev
-~----------~----~----~----~------~----~------~--~---

Reply via email to