At least in the batch of tests I examined, the ones that needed re-baselining weren't tests we'd originally failed and suddenly started passing. They were new tests that nobody had ever taken a good look at.
If that matches everyone else's experience, then all we need is an UNTRIAGED annotation in the test_expectations file to mark ones the next Great Re-Baselining needs to examine. I'm not convinced that passing tests we used to fail, or failing tests differently, happens often enough to warrant the extra work of producing, storing, and using expected-bad results. Of course, I may be completely wrong. What did other people see in their batches of tests? - Pam On Fri, Aug 21, 2009 at 1:16 PM, Jeremy Orlow <[email protected]> wrote: > On Fri, Aug 21, 2009 at 1:00 PM, Dirk Pranke <[email protected]> wrote: > >> >> Hi all, >> >> As Glenn noted, we made great progress last week in rebaselining the >> tests. Unfortunately, we don't have a mechanism to preserve the >> knowledge we gained last week as to whether or not tests need to be >> rebaselined or not, and why not. As a result, it's easy to imagine >> that we'd need to repeat this process every few months. >> >> I've written up a proposal for preventing this from happening again, >> and I think it will also help us notice more regressions in the >> future. Check out: >> >> >> http://dev.chromium.org/developers/design-documents/handlinglayouttestexpectations >> >> Here's the executive summary from that document: >> >> We have a lot of layout test failures. For each test failure, we have >> no good way of tracking whether or not someone has looked at the test >> output lately, and whether or not the test output is still broken or >> should be rebaselined. We just went through a week of rebaselining, >> and stand a good chance of needing to do that again in a few months >> and losing all of the knowledge that was captured last week. >> >> So, I propose a way to capture the current "broken" output from >> failing tests, and to version control them so that we can tell when a >> test's output changes from one expected failing result to another. >> Such a change may reflect that there has been a regression, or that >> the bug has been fixed and the test should be rebaselined. >> >> Changes >> >> We modify the layout test scripts to check for 'foo-bad' as well as >> 'foo-expected'. If the output of test foo does not match >> 'foo-expected', then we check to see if it matches 'foo-bad'. If it >> does, then we treat it as we treat test failures today, except that >> there is no need to save the failed test result (since a version of >> the output is already checked in). Note that although "-bad" is >> similar to a different platform, we cannot actually use a different >> platform, since we actually need up to N different "-bad" versions, >> one for each supported platform that a test fails on. >> We check in a set of '*-bad' baselines based on current output from >> the regressions. In theory, they should all be legitimate. >> We modify the test to also report regressions from the *-bad >> baselines. In the cases where we know the failing test is also flaky >> or nondeterministic, we can indicate that as "NDFAIL" in test >> expectations to distinguish from a regular deterministic "FAIL". >> We modify the rebaselining tools to handle "*-bad" output as well as >> "*-expected". >> Just like we require each test failure to be associated with a bug, we >> require each "*-bad" output to be associated with a bug - normally >> (always?) the same bug. The bug should contain comments about what the >> difference is between the broken output and the expected output, and >> why it's different, e.g., something like "Note that the text is in two >> lines in the -bad output, and it should be all on the same line >> without wrapping." >> The same approach can be used here to justify platform-specific >> variances in output, if we decide to become even more picky about >> this, but I suggest we learn to walk before we try to run. >> Eventually (?) we modify the layout test scripts themselves to fail if >> the *-bad baselines aren't matched. >> >> Let me know what you think. If it's a thumbs' up, I'll probably >> implement this next week. Thanks! > > > I really like this plan. It seems easy to implement and quite useful. +1 > from me! > > > > --~--~---------~--~----~------------~-------~--~----~ Chromium Developers mailing list: [email protected] View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~----------~----~----~----~------~----~------~--~---
