On Fri, Aug 21, 2009 at 1:00 PM, Dirk Pranke <[email protected]> wrote:
> > Hi all, > > As Glenn noted, we made great progress last week in rebaselining the > tests. Unfortunately, we don't have a mechanism to preserve the > knowledge we gained last week as to whether or not tests need to be > rebaselined or not, and why not. As a result, it's easy to imagine > that we'd need to repeat this process every few months. > > I've written up a proposal for preventing this from happening again, > and I think it will also help us notice more regressions in the > future. Check out: > > > http://dev.chromium.org/developers/design-documents/handlinglayouttestexpectations > > Here's the executive summary from that document: > > We have a lot of layout test failures. For each test failure, we have > no good way of tracking whether or not someone has looked at the test > output lately, and whether or not the test output is still broken or > should be rebaselined. We just went through a week of rebaselining, > and stand a good chance of needing to do that again in a few months > and losing all of the knowledge that was captured last week. > > So, I propose a way to capture the current "broken" output from > failing tests, and to version control them so that we can tell when a > test's output changes from one expected failing result to another. > Such a change may reflect that there has been a regression, or that > the bug has been fixed and the test should be rebaselined. > > Changes > > We modify the layout test scripts to check for 'foo-bad' as well as > 'foo-expected'. If the output of test foo does not match > 'foo-expected', then we check to see if it matches 'foo-bad'. If it > does, then we treat it as we treat test failures today, except that > there is no need to save the failed test result (since a version of > the output is already checked in). Note that although "-bad" is > similar to a different platform, we cannot actually use a different > platform, since we actually need up to N different "-bad" versions, > one for each supported platform that a test fails on. > We check in a set of '*-bad' baselines based on current output from > the regressions. In theory, they should all be legitimate. > We modify the test to also report regressions from the *-bad > baselines. In the cases where we know the failing test is also flaky > or nondeterministic, we can indicate that as "NDFAIL" in test > expectations to distinguish from a regular deterministic "FAIL". > We modify the rebaselining tools to handle "*-bad" output as well as > "*-expected". > Just like we require each test failure to be associated with a bug, we > require each "*-bad" output to be associated with a bug - normally > (always?) the same bug. The bug should contain comments about what the > difference is between the broken output and the expected output, and > why it's different, e.g., something like "Note that the text is in two > lines in the -bad output, and it should be all on the same line > without wrapping." > The same approach can be used here to justify platform-specific > variances in output, if we decide to become even more picky about > this, but I suggest we learn to walk before we try to run. > Eventually (?) we modify the layout test scripts themselves to fail if > the *-bad baselines aren't matched. > > Let me know what you think. If it's a thumbs' up, I'll probably > implement this next week. Thanks! I really like this plan. It seems easy to implement and quite useful. +1 from me! --~--~---------~--~----~------------~-------~--~----~ Chromium Developers mailing list: [email protected] View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~----------~----~----~----~------~----~------~--~---
