Hi all,

As Glenn noted, we made great progress last week in rebaselining the
tests. Unfortunately, we don't have a mechanism to preserve the
knowledge we gained last week as to whether or not tests need to be
rebaselined or not, and why not. As a result, it's easy to imagine
that we'd need to repeat this process every few months.

I've written up a proposal for preventing this from happening again,
and I think it will also help us notice more regressions in the
future. Check out:

http://dev.chromium.org/developers/design-documents/handlinglayouttestexpectations

Here's the executive summary from that document:

We have a lot of layout test failures. For each test failure, we have
no good way of tracking whether or not someone has looked at the test
output lately, and whether or not the test output is still broken or
should be rebaselined. We just went through a week of rebaselining,
and stand a good chance of needing to do that again in a few months
and losing all of the knowledge that was captured last week.

So, I propose a way to capture the current "broken" output from
failing tests, and to version control them so that we can tell when a
test's output changes from one expected failing result to another.
Such a change may reflect that there has been a regression, or that
the bug has been fixed and the test should be rebaselined.

Changes

We modify the layout test scripts to check for 'foo-bad' as well as
'foo-expected'. If the output of test foo does not match
'foo-expected', then we check to see if it matches 'foo-bad'. If it
does, then we treat it as we treat test failures today, except that
there is no need to save the failed test result (since a version of
the output is already checked in). Note that although "-bad" is
similar to a different platform, we cannot actually use a different
platform, since we actually need up to N different "-bad" versions,
one for each supported platform that a test fails on.
We check in a set of '*-bad' baselines based on current output from
the regressions. In theory, they should all be legitimate.
We modify the test to also report regressions from the *-bad
baselines. In the cases where we know the failing test is also flaky
or nondeterministic, we can indicate that as "NDFAIL" in test
expectations to distinguish from a regular deterministic "FAIL".
We modify the rebaselining tools to handle "*-bad" output as well as
"*-expected".
Just like we require each test failure to be associated with a bug, we
require each "*-bad" output to be associated with a bug - normally
(always?) the same bug. The bug should contain comments about what the
difference is between the broken output and the expected output, and
why it's different, e.g., something like "Note that the text is in two
lines in the -bad output, and it should be all on the same line
without wrapping."
The same approach can be used here to justify platform-specific
variances in output, if we decide to become even more picky about
this, but I suggest we learn to walk before we try to run.
Eventually (?) we modify the layout test scripts themselves to fail if
the *-bad baselines aren't matched.

Let me know what you think. If it's a thumbs' up, I'll probably
implement this next week. Thanks!

-- Dirk

--~--~---------~--~----~------------~-------~--~----~
Chromium Developers mailing list: [email protected] 
View archives, change email options, or unsubscribe: 
    http://groups.google.com/group/chromium-dev
-~----------~----~----~----~------~----~------~--~---

Reply via email to