Hi all, As Glenn noted, we made great progress last week in rebaselining the tests. Unfortunately, we don't have a mechanism to preserve the knowledge we gained last week as to whether or not tests need to be rebaselined or not, and why not. As a result, it's easy to imagine that we'd need to repeat this process every few months.
I've written up a proposal for preventing this from happening again, and I think it will also help us notice more regressions in the future. Check out: http://dev.chromium.org/developers/design-documents/handlinglayouttestexpectations Here's the executive summary from that document: We have a lot of layout test failures. For each test failure, we have no good way of tracking whether or not someone has looked at the test output lately, and whether or not the test output is still broken or should be rebaselined. We just went through a week of rebaselining, and stand a good chance of needing to do that again in a few months and losing all of the knowledge that was captured last week. So, I propose a way to capture the current "broken" output from failing tests, and to version control them so that we can tell when a test's output changes from one expected failing result to another. Such a change may reflect that there has been a regression, or that the bug has been fixed and the test should be rebaselined. Changes We modify the layout test scripts to check for 'foo-bad' as well as 'foo-expected'. If the output of test foo does not match 'foo-expected', then we check to see if it matches 'foo-bad'. If it does, then we treat it as we treat test failures today, except that there is no need to save the failed test result (since a version of the output is already checked in). Note that although "-bad" is similar to a different platform, we cannot actually use a different platform, since we actually need up to N different "-bad" versions, one for each supported platform that a test fails on. We check in a set of '*-bad' baselines based on current output from the regressions. In theory, they should all be legitimate. We modify the test to also report regressions from the *-bad baselines. In the cases where we know the failing test is also flaky or nondeterministic, we can indicate that as "NDFAIL" in test expectations to distinguish from a regular deterministic "FAIL". We modify the rebaselining tools to handle "*-bad" output as well as "*-expected". Just like we require each test failure to be associated with a bug, we require each "*-bad" output to be associated with a bug - normally (always?) the same bug. The bug should contain comments about what the difference is between the broken output and the expected output, and why it's different, e.g., something like "Note that the text is in two lines in the -bad output, and it should be all on the same line without wrapping." The same approach can be used here to justify platform-specific variances in output, if we decide to become even more picky about this, but I suggest we learn to walk before we try to run. Eventually (?) we modify the layout test scripts themselves to fail if the *-bad baselines aren't matched. Let me know what you think. If it's a thumbs' up, I'll probably implement this next week. Thanks! -- Dirk --~--~---------~--~----~------------~-------~--~----~ Chromium Developers mailing list: [email protected] View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~----------~----~----~----~------~----~------~--~---
