We had some trouble today keeping the tree green. In this email, I present a post-mordem analysis of what happened and what we can learn from these events. I've removed most of the names from this account because the purpose isn't to assign blame but to document what happened in the hopes that we can learn from it.
== Timeline == 04/03/10 10:27 am - r57051 lands, breaking five tests that enumerate all the objects in the global scope. This breakage goes undetected because an incremental build fails to rebuild the necessary files due to bug 37102. 04/05/10 10:00 am - r57078 lands breaking Leopard Release, Leopard Debug, and SnowLeopard, ALSO triggering a rebuild of DOMWindow.idl, revealing that r57051 broke multiple platforms. 10:18 am - r57080 lands into a broken tree. The committer landing this patch expected the change to break virtually all the bots and was landing the change to gather expected results from platforms he didn't have on hand. As expected, r57080 breaks essentially all of the bots. 10:27 am - r57081 lands, apparently innocuously amid the burning tree. 10:35 am - r57082 rolls out r57078, fixing 3 bots, but the r57051-triggered failures remain. 10:36 am - r57083 lands, adding Qt expected results for r57080. 11:24 am - r57086 lands, adding Mac expected results for r57080. 11:27 am - r57087 lands, fixing expected results for r57051. (Further expected result fixes follow in r57092 and r57097.) 12:28 pm - The dust settles and we realize that two tests are still broken on Tiger. The blame list has five patches on it. Through some (perhaps misguided) reasoning, we conclude that r57080 is still responsible. 12:58 pm - We trigger a clean build of Tiger, hoping the clean build will solve our problem. 2:33 pm - The clean rebuild rolls green. (Yes, a clean rebuild of Tiger takes an hour and a half.) 2:52 pm - The next Tiger cycle reveals that the broken tests are still plaguing the Tiger bot. We continue investigating assuming r57080 is the culprit. ~4:00pm - Ossy mentions on #webkit that reverting r57081 locally fixes these tests on Qt. 4:07 pm - r57100 rolls out r57081. James Robinson continues to investigate. ~4:30 pm - Tiger (and the rest of the core builders) roll green. 8:08 pm - r57116 lands with a correct version of r57081. (I've omitted a pixel test regression during this time period, which was invisible to the bots and rolled out, as well as a breakage later in the day that we corrected shortly after landing.) == Discussion == Dealing with the failures this morning was particularly difficult because there were four overlapping failures revealed in the space of 27 minutes. 1) The HTMLProgressElement failures were difficult to track down because they sat in the tree for days undetected because of incorrect dependencies. Unfortunately, the failures were revealed at an inopportune time in the midst of other failures. However, correcting the failures was straightforward because the expected results simply needed to be updated. 2) The failures caused by r57078 were disentangled 35 minutes later because the patch was rolled out. I haven't followed up on this change to see what the ultimate resolution of this issue was. 3) Landing r57080 into a broken tree during California business hours seems like a questionable decision, especially given that the committer knew the change would break essentially every bot. 4) Because the tree was on fire, it took us five and a half hours to disentangle the failures caused by r57081 from the failures caused by r57080. Thanks go to Ossy for uncovering that r57081 (and not r57080) caused the two failing tests on Tiger, and special thanks go to James Robinson for continuing to investigate the r57081-related failures and writing an improved version of the patch (even though the original patch wasn't his). == Recommendations == A) We should work to improve our dependencies so that breakages like the one caused by r57051 show up immediately, when they are easier to deal with, instead of days later in the midst of another fire-fight. B) When landing a patch you know will break many or most bots, consider landing the patch when you're sure the tree is green and preferably outside of business hours in California (when the tree is especially busy). C) We should improve the EWS bots to produce expected results so folks don't need to light the tree on fire to generate expected results for platforms they don't personally have access to. D) Instead of spending five and a half hours attempting to figure out why r57080 caused the regressions, we should have confirmed our hypothesis by rolling out the patch. We could then have eliminated r57080 from suspicion and rolled it back into the tree much more quickly. Rolling changes out of the tree isn't the end of the story. We still want the improvements in the patch, we just want them without the regressions. Leaving failures in the tree make it difficult to track down subsequent failures. Rolling out a change means more work for you, but less costs imposed on the rest of the project. Adam _______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev