On Mon, Sep 21, 2009 at 8:25 AM, Nicolas Sylvain <nsylv...@chromium.org>wrote:

> Hi chromium-dev,
>   A small group of us joined forces to create a "Green Tree" task force.
> The goal of this task
> force is to make sure the tree stays green most of the time.  The 2 main
> pain points that
> we are attacking at this time are "reducing the buildbot cycle time", to
> catch errors earlier, and
> "getting rid of the flakiness", to make sure the tree does not turn red for
> no reason.
>
>   I'll be prepending "[Green Tree]" to the emails I send related to the
> task force.
>
>   You can also follow the progress and our tasks there:
> http://code.google.com/p/chromium/issues/list?q=label:GreenTreeTaskForce
>
> For those interested, these are the highlights of the last week:
>
> - Make sure all the tasks have bugs associated with them (pamg)
> - Make sure VMWare Tools is installed on all the slaves (bev / nsylvain)
> - Disable all services that we don't need on the slaves (bev)
> - Split the windows chromium tests in 3 slaves (maruel)
> - Change the gatekeeper to close the tree on more failures (maruel)
>  - Change LKGR to care about more tests, and make it cycle faster (maruel)
> - Write a status page to see the cycle speed on the slaves (nsylvain)
> - Make sure we build only what we need on Mac (thomasvl)
> - Add more try bots (linux views, valgrind) (maruel)
> - Refactor Linux Valgrind buildbots into builder/testers. (mmoss)
> - Create a dashboard to see the slowest tests (phajdan)
> - Speed up the transfer of builds between builders/testers by reducing the
> compression (mmoss)
>
>   I'm sure I forgot some, feel free to append to this list.
>
>   Despite our efforts, this was one of the worse week we've seen in a long
> time in term of tree closure. This
> was caused by 5 main events:
>
>  - Buildbot maintenance went wrong. By changing a mounted drive on the
> buildbot master, the mount table got corrupted, and we had to reboot the
> main server. We started the maintenance at 7:30AM (pacific) and we got the
> buildbot back online shortly after 10AM. It had to cycle a little, so it was
> closed for almost 3 hours
>  - A webkit merge left some failures in the tree. And it looks like
> everyone left without fixing it, so it was closed overnight. We fixed it in
> the morning, but before reopening we let another webkit merge go by, and it
> also broke the tree, requiring a change on webkit.org to fix the
> reliability tests (IIRC). Total closure time: 20 hours.
>

The more try bots we get, the better this will get.  At the very least, when
we check in something that upsets bots covered by try bots, we can always
roll back out and triage without the tree closed.  Maybe we should have one
try bot for each different type of build bot?

Btw, in case anyone is wondering what makes WebKit gardening special: WebKit
is a freight train that we can't stop.  And so, if we get behind by even a
day, it has a serious impact on Chromium developers' ability to do a lot of
stuff (especially 2 sided patches).  I'm not trying to condone the 20 hour
closure (I don't know the details), but if we can't figure out what's wrong
quickly (when it breaks our stuff) we can get into a pretty bad situation
pretty quickly.


>  - A bad gclient change got checked in. Some machines stopped running
> "runhooks" and some bots got confused. The damage seems to have been
> limited.
>  - A second bad gclient change got checked in. This time causing all the
> bots to throw away their checkouts. Almost each slaves had to do a full
> checkout (which takes an hour or so), and some of them ran out of disk
> space, so we had to manually fix them. The tree was closed for another
> couple of hours.
>  - A bad DEPS file got checked in. Causing again a bunch of slaves to throw
> away their checkout. It was closed for another hour or two.
>

Possibly crazy idea:  Is there any way we can have a bot that only updates
itself that all the other bots block on?  Assuming that one bot syncing the
world is faster than all the bots saving the world, it would save time.  It
seems like in the normal case, this will go quite quickly and not block
things for long.  But, when something's wrong with gclient, DEPS, etc it'll
take a long time.  Sheriffs could then close the tree and stop that bot's
build before any of the other bots pick it up.

--~--~---------~--~----~------------~-------~--~----~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
    http://groups.google.com/group/chromium-dev
-~----------~----~----~----~------~----~------~--~---

Reply via email to