Re: [webkit-dev] A post-mordem of today's tree redness

2010-04-06 Thread Adam Barth
On Mon, Apr 5, 2010 at 11:27 PM, Brent Fulgham bfulg...@gmail.com wrote:
 On Apr 5, 2010, at 9:58 PM, Adam Barth wrote:
 We had some trouble today keeping the tree green.  In this email, I
 present a post-mordem analysis of what happened and what we can learn
 from these events.  I've removed most of the names from this account
 because the purpose isn't to assign blame but to document what
 happened in the hopes that we can learn from it.

 Could rollout of patches be automated in some fashion, so that a 
 previously-green tree becoming red could trigger a rollout of the last 
 checkin?

We have the tools to do this currently, but the prevailing wisdom is
that we should have some human judgement involved in the process.  We
still have enough test flakiness that false positives could roll out
perfect good patches.  In other cases, it's clear how to fix the tree
without rolling out.

 On the other hand, I've noticed that the varying speed of the various build 
 bots makes it difficult to assess which patch might have triggered a break.  
 It's not uncommon for some machines to be several patches behind others, and 
 long test runs further exacerbate the problem.

The sheriffbot does a pretty good job of narrowing down the regression
window.  Its algorithm is somewhat robust to flaky tests and other
bits of noise.  We continue to refine it based on experience.  For
example, even during today's complex overlapping failures, it correct
computed the regression window to a set of five commits.

The main failure mode we're seeing currently is that if a test fails
80% of the time, sheriffbot will generate false positives because it
thinks that the test was fixed the one time it happens to be green.  I
have some ideas for how to handle that case, but we'll need to
experiment some more.

Adam
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] A post-mordem of today's tree redness

2010-04-06 Thread Alexey Proskuryakov

05.04.2010, в 21:58, Adam Barth написал(а):

 Leaving failures in the tree make it difficult to track
 down subsequent failures.  Rolling out a change means more work for
 you, but less costs imposed on the rest of the project.


While I agree with your analysis for the most part, there are costs associated 
with rolling out patches that you didn't mention. Some of these are:

1) Confusion about what is going on with the project. It becomes harder to know 
what's going on by reading webkit-changes - because you can't unsee a patch you 
saw landed, and because people often roll out patches with cryptic messages 
(roll out rX, because it breaks Tiger - how are you supposed to know that 
an important change you saw landed a few hours ago isn't there any more?)

2) Confusion also happens in Bugzilla - there are several styles for dealing 
with such issues (make a new bug for rollout, or just roll out and reopen). 
People often forget to document what they were doing to fix the build, so you 
end up with a resolved bug for something that has been rolled out, or a 
reopened bug without adequate explanations. Even when the original bug is 
correctly reopened, it can be hard to figure out its history, because commit 
queue clears out flags on patches.

3) Likelihood of more world rebuilds for developers and bots. A troublesome 
patch is more likely to touch common headers than a targeted build fix, so you 
get three world rebuilds instead of one.

4) It's harder to isolate regressions if these appear and disappear several 
times (aforementioned confusion doesn't help either). Screening bugs about 
regressions also becomes more error-prone. This arguments goes both ways though 
- it's even harder to isolate regressions if the platform in question had 
broken build at the time.

- WBR, Alexey Proskuryakov

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] A post-mordem of today's tree redness

2010-04-06 Thread Adam Barth
On Tue, Apr 6, 2010 at 9:25 AM, Alexey Proskuryakov a...@webkit.org wrote:
 05.04.2010, в 21:58, Adam Barth написал(а):
 Leaving failures in the tree make it difficult to track
 down subsequent failures.  Rolling out a change means more work for
 you, but less costs imposed on the rest of the project.

 While I agree with your analysis for the most part, there are costs 
 associated with rolling out patches that you didn't mention. Some of these 
 are:

 1) Confusion about what is going on with the project. It becomes harder to 
 know what's going on by reading webkit-changes - because you can't unsee a 
 patch you saw landed, and because people often roll out patches with cryptic 
 messages (roll out rX, because it breaks Tiger - how are you supposed to 
 know that an important change you saw landed a few hours ago isn't there any 
 more?)

I don't read webkit-changes, so I might not fully appreciate this use
case, but the way I know when things are rolled out is because we
reopen the bug and comment that patch was rolled out in a certain
revision.  If you like, we can put more information in the ChangeLogs
created by sheriffbot (such as the title of the original bug).

 2) Confusion also happens in Bugzilla - there are several styles for dealing 
 with such issues (make a new bug for rollout, or just roll out and reopen). 
 People often forget to document what they were doing to fix the build, so you 
 end up with a resolved bug for something that has been rolled out, or a 
 reopened bug without adequate explanations. Even when the original bug is 
 correctly reopened, it can be hard to figure out its history, because commit 
 queue clears out flags on patches.

This problem is solvable with tooling.  The process I recommend is as follows:

1) Open a new bug for the rollout patch and mark it blocking the main
bug.  This reduces noise on the main bug and provides a location to
discuss the failures and resolve the situation, either by landing the
rollout or not.  (Creating a new bug is already automated.)

2) When landing a rollout, reopen the original bug and comment that
the patch was rolled out and provide a link to the revision and bug.
(Currently, this step is manual, but we can automate this too.)

 3) Likelihood of more world rebuilds for developers and bots. A troublesome 
 patch is more likely to touch common headers than a targeted build fix, so 
 you get three world rebuilds instead of one.

I don't see this as much of a concern.  We can track statistics, but I
bet the build time attributable to rollouts is less than 5% of all
build time.

 4) It's harder to isolate regressions if these appear and disappear several 
 times (aforementioned confusion doesn't help either). Screening bugs about 
 regressions also becomes more error-prone. This arguments goes both ways 
 though - it's even harder to isolate regressions if the platform in question 
 had broken build at the time.

Concretely, supposed we hadn't cleaned up the Tiger bot to be green
recently.  I strongly suspect the regression caused by r57081 would
have been lost in the thought process that the Tiger bot is always
red.  Even though the regression was real and affected every
platform.  Had we noticed the problem (say) a month later, we would
have had a devil of a time tracking down the issue as evidenced by the
effort required to fix the previous ancient Tiger-only failures.

Even more concretely, the Windows bot have been red for thousands of
revisions.  Today someone (it's not important who) broke the
break-blockquote-after-delete.html test on all the bots.  He resolved
the situation by updating the expected results to make the tree green
again.  However, he did not update the Windows expected results even
though the failure diff is identical:

http://build.webkit.org/results/Windows%20Release%20(Tests)/r57153%20(10992)/editing/inserting/break-blockquote-after-delete-pretty-diff.html

That means when we finally get around to tracking down the failures
some number of months ago, we'll be mystified by this failure even
though we had the knowledge yesterday to fix the failure in a few
minutes.  I strongly suspect that if the Windows bots had started the
day green and we had a culture of keeping the green, this individual
would have made them green again and we wouldn't be accumulating a
debt of mysterious failures that will drain our productivity in the
future.

I'm not saying the rollouts are always the best solution.  For
example, updating the expected results for
break-blockquote-after-delete.html yesterday appears to have been the
proper response.  What I am saying is that we should keep the bots
green, ideally by steadfastly refusing to regress tests.

Adam
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] A post-mordem of today's tree redness

2010-04-06 Thread Alexey Proskuryakov


On 06.04.2010, at 10:06, Adam Barth wrote:


I don't read webkit-changes, so I might not fully appreciate this use
case, but the way I know when things are rolled out is because we
reopen the bug and comment that patch was rolled out in a certain
revision.  If you like, we can put more information in the ChangeLogs
created by sheriffbot (such as the title of the original bug).


Yes, please do.

I'm definitely not the only one reading webkit-changes and ChangeLogs.


What I am saying is that we should keep the bots
green, ideally by steadfastly refusing to regress tests.



There has never been any disagreement about this, as far as I can  
tell. What seems to be an object for frequent discussions is whether  
disabling tests is an effective measure for making WebKit more secure  
and stable.


- WBR, Alexey Proskuryakov


___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] A post-mordem of today's tree redness

2010-04-06 Thread Alexey Proskuryakov


On 06.04.2010, at 10:06, Adam Barth wrote:

4) It's harder to isolate regressions if these appear and disappear  
several times (aforementioned confusion doesn't help either).  
Screening bugs about regressions also becomes more error-prone.  
This arguments goes both ways though - it's even harder to isolate  
regressions if the platform in question had broken build at the time.


Concretely, supposed we hadn't cleaned up the Tiger bot to be green
recently.  I strongly suspect the regression caused by r57081 would
have been lost in the thought process that the Tiger bot is always
red.  Even though the regression was real and affected every
platform.  Had we noticed the problem (say) a month later, we would
have had a devil of a time tracking down the issue as evidenced by the
effort required to fix the previous ancient Tiger-only failures.



I was talking about regressions not covered by regression tests. When  
there is a bug about something that works in Safari 4, but not in ToT,  
you often need to find out when exactly it broke - and if there were  
several breakage points, it gets harder. Usually not by much, but  
sometimes this can lead you astray, and then it costs a lot.


Again, this is not to say that we should never roll out patches, but  
it's another cost to consider.


- WBR, Alexey Proskuryakov

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] A post-mordem of today's tree redness

2010-04-06 Thread Ojan Vafai
On Mon, Apr 5, 2010 at 11:35 PM, Adam Barth aba...@webkit.org wrote:

 On Mon, Apr 5, 2010 at 11:27 PM, Brent Fulgham bfulg...@gmail.com wrote:
  On Apr 5, 2010, at 9:58 PM, Adam Barth wrote:
  We had some trouble today keeping the tree green.  In this email, I
  present a post-mordem analysis of what happened and what we can learn
  from these events.  I've removed most of the names from this account
  because the purpose isn't to assign blame but to document what
  happened in the hopes that we can learn from it.
 
  Could rollout of patches be automated in some fashion, so that a
 previously-green tree becoming red could trigger a rollout of the last
 checkin?

 We have the tools to do this currently, but the prevailing wisdom is
 that we should have some human judgement involved in the process.  We
 still have enough test flakiness that false positives could roll out
 perfect good patches.  In other cases, it's clear how to fix the tree
 without rolling out.

  On the other hand, I've noticed that the varying speed of the various
 build bots makes it difficult to assess which patch might have triggered a
 break.  It's not uncommon for some machines to be several patches behind
 others, and long test runs further exacerbate the problem.


Another example of tree redness that I caused a couple weeks ago:
1. Commit a change that I thought would be green on all platforms.
2. Turned out to regress some tests on the Chromium Windows bot.
3. Committed a fix for Chromium Windows, which also fixed the WebKit Windows
bots.

The WebKit Windows bots had gotten hours behind, so the original commit had
still not run the tests on the bots. Hours later, the original commit ran
and the bot turned red, even though the fix was already committed. Hours
after that, the followup commit ran and the tests were fixed. Anyways, my
point is that it's really difficult to look at the state of the tree and say
definitively that a patch should be rolled out.

Also, I'd like to add a recommendation to improve keeping the tree green.
Make the bots cycle faster. It makes it much easier to see what commit
caused a regression and it makes it easier to tell that a fix actually
greened the tree. Not sure what this involves, but some ideas:
1. Get faster/more machines.
2. Run tests in parallel. new-run-webkit-tests does this if we can get it to
be a suitable replacement for run-webkit-tests. How much benefit we get from
parallelizing depends on the number of cores on the machine, which gets back
to recommendation 1.

Ojan
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] A post-mordem of today's tree redness

2010-04-06 Thread Maciej Stachowiak


On Apr 6, 2010, at 10:35 AM, Alexey Proskuryakov wrote:



On 06.04.2010, at 10:06, Adam Barth wrote:


I don't read webkit-changes, so I might not fully appreciate this use
case, but the way I know when things are rolled out is because we
reopen the bug and comment that patch was rolled out in a certain
revision.  If you like, we can put more information in the ChangeLogs
created by sheriffbot (such as the title of the original bug).


Yes, please do.

I'm definitely not the only one reading webkit-changes and ChangeLogs.


I agree that more informative revert ChangeLog entries would be  
useful. My preference would be to include the following:


- revision that was reverted
- bugzilla bug URL from the ChangeLog of the reverted commit (in a  
distinct format from how we'd provide it if it was the bug for the  
rollout change itself)

- summary text from the original revert

Something like:

- Reverted rN (Rotate head 90 degrees clockwise)
http://bugs.webkit.org/rollout-bug-num

Reverted bug was: http://bugs.webkit.org/bug-that-reverted




What I am saying is that we should keep the bots
green, ideally by steadfastly refusing to regress tests.



There has never been any disagreement about this, as far as I can  
tell. What seems to be an object for frequent discussions is whether  
disabling tests is an effective measure for making WebKit more  
secure and stable.


- WBR, Alexey Proskuryakov


___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] A post-mordem of today's tree redness

2010-04-06 Thread Adam Barth
On Tue, Apr 6, 2010 at 11:43 AM, Maciej Stachowiak m...@apple.com wrote:
 On Apr 6, 2010, at 10:35 AM, Alexey Proskuryakov wrote:
 On 06.04.2010, at 10:06, Adam Barth wrote:
 I don't read webkit-changes, so I might not fully appreciate this use
 case, but the way I know when things are rolled out is because we
 reopen the bug and comment that patch was rolled out in a certain
 revision.  If you like, we can put more information in the ChangeLogs
 created by sheriffbot (such as the title of the original bug).

 Yes, please do.

 I'm definitely not the only one reading webkit-changes and ChangeLogs.

 I agree that more informative revert ChangeLog entries would be useful. My
 preference would be to include the following:

 - revision that was reverted
 - bugzilla bug URL from the ChangeLog of the reverted commit (in a distinct
 format from how we'd provide it if it was the bug for the rollout change
 itself)
 - summary text from the original revert

 Something like:

 - Reverted rN (Rotate head 90 degrees clockwise)
 http://bugs.webkit.org/rollout-bug-num

 Reverted bug was: http://bugs.webkit.org/bug-that-reverted

Okiedokes.  I'll try to do that today.

Adam
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] A post-mordem of today's tree redness

2010-04-06 Thread Maciej Stachowiak


On Apr 6, 2010, at 11:57 AM, Adam Barth wrote:

On Tue, Apr 6, 2010 at 11:43 AM, Maciej Stachowiak m...@apple.com  
wrote:

On Apr 6, 2010, at 10:35 AM, Alexey Proskuryakov wrote:

On 06.04.2010, at 10:06, Adam Barth wrote:
I don't read webkit-changes, so I might not fully appreciate this  
use

case, but the way I know when things are rolled out is because we
reopen the bug and comment that patch was rolled out in a certain
revision.  If you like, we can put more information in the  
ChangeLogs

created by sheriffbot (such as the title of the original bug).


Yes, please do.

I'm definitely not the only one reading webkit-changes and  
ChangeLogs.


I agree that more informative revert ChangeLog entries would be  
useful. My

preference would be to include the following:

- revision that was reverted
- bugzilla bug URL from the ChangeLog of the reverted commit (in a  
distinct
format from how we'd provide it if it was the bug for the rollout  
change

itself)
- summary text from the original revert

Something like:

- Reverted rN (Rotate head 90 degrees clockwise)
http://bugs.webkit.org/rollout-bug-num

Reverted bug was: http://bugs.webkit.org/bug-that-reverted


Okiedokes.  I'll try to do that today.


By the way, just to explain why I ask for this: I often search  
ChangeLogs for the history of an issue, and searching for the bug  
number or obvious related text will not generally find entries where a  
change was rolled out. I would have to already know that the change  
was rolled out at some point, find the revision number of the original  
change, and then search for ChangeLog entries mentioning that revision  
number.


Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


[webkit-dev] A post-mordem of today's tree redness

2010-04-05 Thread Adam Barth
We had some trouble today keeping the tree green.  In this email, I
present a post-mordem analysis of what happened and what we can learn
from these events.  I've removed most of the names from this account
because the purpose isn't to assign blame but to document what
happened in the hopes that we can learn from it.

== Timeline ==

04/03/10 10:27 am - r57051 lands, breaking five tests that enumerate
all the objects in the global scope.  This breakage goes undetected
because an incremental build fails to rebuild the necessary files due
to bug 37102.
04/05/10 10:00 am - r57078 lands breaking Leopard Release, Leopard
Debug, and SnowLeopard, ALSO triggering a rebuild of DOMWindow.idl,
revealing that r57051 broke multiple platforms.
10:18 am - r57080 lands into a broken tree.  The committer landing
this patch expected the change to break virtually all the bots and was
landing the change to gather expected results from platforms he didn't
have on hand.  As expected, r57080 breaks essentially all of the bots.
10:27 am - r57081 lands, apparently innocuously amid the burning tree.
10:35 am - r57082 rolls out r57078, fixing 3 bots, but the
r57051-triggered failures remain.
10:36 am - r57083 lands, adding Qt expected results for r57080.
11:24 am - r57086 lands, adding Mac expected results for r57080.
11:27 am - r57087 lands, fixing expected results for r57051.  (Further
expected result fixes follow in r57092 and r57097.)
12:28 pm - The dust settles and we realize that two tests are still
broken on Tiger.  The blame list has five patches on it.  Through some
(perhaps misguided) reasoning, we conclude that r57080 is still
responsible.
12:58 pm - We trigger a clean build of Tiger, hoping the clean build
will solve our problem.
2:33 pm - The clean rebuild rolls green.  (Yes, a clean rebuild of
Tiger takes an hour and a half.)
2:52 pm - The next Tiger cycle reveals that the broken tests are still
plaguing the Tiger bot.  We continue investigating assuming r57080 is
the culprit.
~4:00pm - Ossy mentions on #webkit that reverting r57081 locally fixes
these tests on Qt.
4:07 pm - r57100 rolls out r57081.  James Robinson continues to investigate.
~4:30 pm - Tiger (and the rest of the core builders) roll green.
8:08 pm - r57116 lands with a correct version of r57081.

(I've omitted a pixel test regression during this time period, which
was invisible to the bots and rolled out, as well as a breakage later
in the day that we corrected shortly after landing.)

== Discussion ==

Dealing with the failures this morning was particularly difficult
because there were four overlapping failures revealed in the space of
27 minutes.

1) The HTMLProgressElement failures were difficult to track down
because they sat in the tree for days undetected because of incorrect
dependencies.  Unfortunately, the failures were revealed at an
inopportune time in the midst of other failures.  However, correcting
the failures was straightforward because the expected results simply
needed to be updated.

2) The failures caused by r57078 were disentangled 35 minutes later
because the patch was rolled out.  I haven't followed up on this
change to see what the ultimate resolution of this issue was.

3) Landing r57080 into a broken tree during California business hours
seems like a questionable decision, especially given that the
committer knew the change would break essentially every bot.

4) Because the tree was on fire, it took us five and a half hours to
disentangle the failures caused by r57081 from the failures caused by
r57080.

Thanks go to Ossy for uncovering that r57081 (and not r57080) caused
the two failing tests on Tiger, and special thanks go to James
Robinson for continuing to investigate the r57081-related failures and
writing an improved version of the patch (even though the original
patch wasn't his).

== Recommendations ==

A) We should work to improve our dependencies so that breakages like
the one caused by r57051 show up immediately, when they are easier to
deal with, instead of days later in the midst of another fire-fight.

B) When landing a patch you know will break many or most bots,
consider landing the patch when you're sure the tree is green and
preferably outside of business hours in California (when the tree is
especially busy).

C) We should improve the EWS bots to produce expected results so folks
don't need to light the tree on fire to generate expected results for
platforms they don't personally have access to.

D) Instead of spending five and a half hours attempting to figure out
why r57080 caused the regressions, we should have confirmed our
hypothesis by rolling out the patch.  We could then have eliminated
r57080 from suspicion and rolled it back into the tree much more
quickly.

Rolling changes out of the tree isn't the end of the story.  We still
want the improvements in the patch, we just want them without the
regressions.  Leaving failures in the tree make it difficult to track
down subsequent failures.  Rolling out