On Fri, Aug 21, 2009 at 1:00 PM, Dirk Pranke <[email protected]> wrote:

>
> Hi all,
>
> As Glenn noted, we made great progress last week in rebaselining the
> tests. Unfortunately, we don't have a mechanism to preserve the
> knowledge we gained last week as to whether or not tests need to be
> rebaselined or not, and why not. As a result, it's easy to imagine
> that we'd need to repeat this process every few months.
>
> I've written up a proposal for preventing this from happening again,
> and I think it will also help us notice more regressions in the
> future. Check out:
>
>
> http://dev.chromium.org/developers/design-documents/handlinglayouttestexpectations
>
> Here's the executive summary from that document:
>
> We have a lot of layout test failures. For each test failure, we have
> no good way of tracking whether or not someone has looked at the test
> output lately, and whether or not the test output is still broken or
> should be rebaselined. We just went through a week of rebaselining,
> and stand a good chance of needing to do that again in a few months
> and losing all of the knowledge that was captured last week.
>
> So, I propose a way to capture the current "broken" output from
> failing tests, and to version control them so that we can tell when a
> test's output changes from one expected failing result to another.
> Such a change may reflect that there has been a regression, or that
> the bug has been fixed and the test should be rebaselined.
>
> Changes
>
> We modify the layout test scripts to check for 'foo-bad' as well as
> 'foo-expected'. If the output of test foo does not match
> 'foo-expected', then we check to see if it matches 'foo-bad'. If it
> does, then we treat it as we treat test failures today, except that
> there is no need to save the failed test result (since a version of
> the output is already checked in). Note that although "-bad" is
> similar to a different platform, we cannot actually use a different
> platform, since we actually need up to N different "-bad" versions,
> one for each supported platform that a test fails on.
> We check in a set of '*-bad' baselines based on current output from
> the regressions. In theory, they should all be legitimate.
> We modify the test to also report regressions from the *-bad
> baselines. In the cases where we know the failing test is also flaky
> or nondeterministic, we can indicate that as "NDFAIL" in test
> expectations to distinguish from a regular deterministic "FAIL".
> We modify the rebaselining tools to handle "*-bad" output as well as
> "*-expected".
> Just like we require each test failure to be associated with a bug, we
> require each "*-bad" output to be associated with a bug - normally
> (always?) the same bug. The bug should contain comments about what the
> difference is between the broken output and the expected output, and
> why it's different, e.g., something like "Note that the text is in two
> lines in the -bad output, and it should be all on the same line
> without wrapping."
> The same approach can be used here to justify platform-specific
> variances in output, if we decide to become even more picky about
> this, but I suggest we learn to walk before we try to run.
> Eventually (?) we modify the layout test scripts themselves to fail if
> the *-bad baselines aren't matched.
>
> Let me know what you think. If it's a thumbs' up, I'll probably
> implement this next week. Thanks!


I really like this plan.  It seems easy to implement and quite useful.  +1
from me!

--~--~---------~--~----~------------~-------~--~----~
Chromium Developers mailing list: [email protected] 
View archives, change email options, or unsubscribe: 
    http://groups.google.com/group/chromium-dev
-~----------~----~----~----~------~----~------~--~---

Reply via email to