Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)
There is a significant practical problem to turn the tree red and work with someone to rebaseline the tests. It takes multiple hours for some bots to build and test a given patch. That means, at any moment, you will have maybe tens and in some cases hundreds of failing tests associated with some changelist that you need to track on the bots. You might have more failing tests associated with a different changelist, and so on. How do you propose to track which baselines have been computed by which bots for which changelists? In the past I have seen failing tests because baselines were checked in too soon and some bot had not generated new baselines. Things are even worse when a change goes in late in the day. How will the pending rebaselines be handed off to the new gardener? As the Chromium owner of tests that are frequently broken by third party contributors (they do their best to help us, but cannot generate all 15 or so Chromium expectations), I would much rather have a bug filed against me and a line in the expectations file. At least I then have visibility on what's happening and a chance to verify the results. In a recent pass to clean up expectations we found multiple cases where the wrong result had been committed because the person who did it failed to realize the result was wrong. While you might say fix the test to make it obvious, it is not obvious how to do that for all tests. Why not simply attach an owner and a resolution date to each expectation? The real problem right now is accountability and a way to remind people that they have left expectations hanging. Cheers, Stephen. On Mon, Apr 9, 2012 at 9:19 PM, Julien Chaffraix jchaffr...@webkit.orgwrote: If there's consensus in the mean time that it is better on balance to check in suppressions, perhaps we can figure out a better way to do that. Maybe (shudder) a second test_expectations file? Or maybe it would be better to actually check in suppressions marked as REBASELINE (or something like that)? That sounds quirky as it involves maintaining 2 sets of files. From my perspective, saying that we should discard the EWS result and allow changes to get in WebKit trunk, knowing they will turn the bots red, is a bad proposal regardless of how you justify it. In the small delta where the bots are red, you can bet people will miss something else that breaks. As Ryosuke points out, practically we're already in that situation - from what I can tell, the tree is red virtually all of the time, at least during US/Pacific working hours. It's not clear to me if the EWS has made this better or worse, but perhaps others have noticed a difference. That said, I doubt I like red trees any more than you do :) I wasn't talking about the tree's status quo here as it shouldn't impact the discussion. Just because the tree is red, doesn't mean it's the right thing (tm) to just drop the towel and make it more red (even if we seem to agree on the badness of that :)). To add some thoughts here, saying that the tree is red covers several states (failing tests, not building...) and IMHO the EWS has at least helped on the building side. As far as the tests goes, a lot of platform differences are unfortunately uncovered on the bots. Thanks, Julien ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)
On Tue, Apr 10, 2012 at 6:10 AM, Stephen Chenney schen...@chromium.orgwrote: There is a significant practical problem to turn the tree red and work with someone to rebaseline the tests. It takes multiple hours for some bots to build and test a given patch. That means, at any moment, you will have maybe tens and in some cases hundreds of failing tests associated with some changelist that you need to track on the bots. You might have more failing tests associated with a different changelist, and so on. But you have to do this for non-Chromium ports anyway because they don't use test_expectations.txt and skipping the tests won't help you generate new baseline. In my opinion, we should not further diverge from the way things are done in other ports. How do you propose to track which baselines have been computed by which bots for which changelists? In the past I have seen failing tests because baselines were checked in too soon and some bot had not generated new baselines. Things are even worse when a change goes in late in the day. How will the pending rebaselines be handed off to the new gardener? You have to stick around as long as it takes to rebaseline or notify relevant port contributors after your patch lands: http://www.webkit.org/coding/contributing.html clearly says Keeping the tree green In either case, your responsibility for the patch does not end with the patch landing in the tree. There may be regressions from your change or additional feedback from reviewers after the patch has landed. You can watch the tree at build.webkit.org to make sure your patch builds and passes tests on all platforms. It is your responsibility to be available should regressions arise and to respond to additional feedback that happens after a check-in. Changes should succeed on all platforms, but it can be difficult to test on every platform WebKit supports. Be certain that your change does not introduce new test failures on the high-traffic Mac or Windows ports by comparing the list of failing tests before and after your change. Your change must at least compile on all platforms. As the Chromium owner of tests that are frequently broken by third party contributors (they do their best to help us, but cannot generate all 15 or so Chromium expectations), I would much rather have a bug filed against me and a line in the expectations file. At least I then have visibility on what's happening and a chance to verify the results. In a recent pass to clean up expectations we found multiple cases where the wrong result had been committed because the person who did it failed to realize the result was wrong. While you might say fix the test to make it obvious, it is not obvious how to do that for all tests. This is an orthogonal issue to adding failing expectations prior to landing your patch. It's about how rebaseline should be done. I agree that rebaselining a test without understanding what the test intends to test or the correct output is bad. But let's not mangle such a discussion into this thread. Why not simply attach an owner and a resolution date to each expectation? The real problem right now is accountability and a way to remind people that they have left expectations hanging. That's what WebKit bugs are for. Ossy frequently files a bug and cc'es the patch author when a new test is added or a test starts failing and he doesn't know whether new result is correct or not. He also either skips the test or rebaseline the test as needed. He also reverts patches when the patch clearly introduced serious regressions (e.g. crashes on hundreds of tests). - Ryosuke ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)
On Tue, Apr 10, 2012 at 11:42 AM, Stephen Chenney schen...@chromium.orgwrote: On Tue, Apr 10, 2012 at 1:00 PM, Ryosuke Niwa rn...@webkit.org wrote: On Tue, Apr 10, 2012 at 6:10 AM, Stephen Chenney schen...@chromium.orgwrote: There is a significant practical problem to turn the tree red and work with someone to rebaseline the tests. It takes multiple hours for some bots to build and test a given patch. That means, at any moment, you will have maybe tens and in some cases hundreds of failing tests associated with some changelist that you need to track on the bots. You might have more failing tests associated with a different changelist, and so on. But you have to do this for non-Chromium ports anyway because they don't use test_expectations.txt and skipping the tests won't help you generate new baseline. In my opinion, we should not further diverge from the way things are done in other ports. How long on average does it take a builder to get through a change on another port? Right now the Chromium Mac 10.5 and 10.6 dbg builds are processing a patch from about 3 hours ago. About 20 patches have gone in since then. For the Mac 10.5 tree to ever be green would require there being no changes at all requiring new baselines for a 3 hour window. Just because other teams do it some way does not mean that Chromium, with it's greater number of bots and platforms, should do it the same way. Yes, it does mean that we should do it the same way. What if non-Chromium ports started imposing arbitrary processes like this on the rest of us? It'll be a total chaos, and nobody would understand the right thing to do for all ports. We are discussing a process here, not code, and in my mind the goal is to have the tree be as green as possible with all failures tracked with a minimal expectations file and as little engineer time as possible. That's not our project goal. We have continuous builds and regression tests to prevent regressions to improve the stability, not to keep bots green. Please review http://www.webkit.org/projects/goals.html Just look at how often the non-chromium mac and win builds are red. In particular, changes submitted via the commit queue take an indeterminate amount of time to go in, anything from an hour to several hours. Patch authors do not necessarily even have control over when the CQ+ is given. That's why I don't use commit queue when I know my patch requires platform-dependent rebaselines. Even when manually committing, if it takes 3 hours to create baselines then no patches go in in the afternoon. What if the bots are down or misbehaving? We need to promptly fix those bots. I would also point out the waste of resources when every contributor needs to track every failure around commit time in order to know when their own changes cause failures, and then track the bots to know when they are free to go home. But that's clearly stated in the contribution guide line. Why not simply attach an owner and a resolution date to each expectation? The real problem right now is accountability and a way to remind people that they have left expectations hanging. That's what WebKit bugs are for. Ossy frequently files a bug and cc'es the patch author when a new test is added or a test starts failing and he doesn't know whether new result is correct or not. He also either skips the test or rebaseline the test as needed. He also reverts patches when the patch clearly introduced serious regressions (e.g. crashes on hundreds of tests). Yes, Ossy does an excellent job of gardening. Unfortunately, on Chrome we have tens if not hundreds of gardeners and, as this thread has revealed, no clear agreement on the best way to garden. That IS the problem. We have too many in-experiented gardeners that don't understand the WebKit culture or the WebKit process. I strongly believe that keeping the tree green is more important than having a clean expectations file. I disagree. You're effectively just disabling the test temporarily. Finally, there is no pain free way to do this. The question is how to distribute the pain. Right now each gardening is using a process that distributes pain in their preferred way. From a community standpoint it would be nice if the Chromium team could come up with something consistent. The process Chromium port uses should be consistent with non-Chromium ports. - Ryosuke ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)
I don't think we can come up with a hard and fast rule given current tooling. In a theoretical future world in which it's easy to get expected results off the EWS bots (or some other infrastructure), it would be reasonable to expect people to incorporate the correct expected results for any EWS-having ports before committing the patch. I expect we'd all agree that would be better than turning the bots red or adding to test_expectations.txt/Skipped files. In the current world, it's a judgement call. If I expect a patch to need a lot of platform-specific baselines, I'll make sure to commit it at a time when I have hours to spare to cleanup any failures or, if I can't stick around for the bots to cycle, I'll add it to test_expectations.txt appropriately. Both approaches have nasty tradeoffs. It is probably worth writing up a wiki page outlining these two options and explaining why you might do one or the other for people new to the project, but I don't see benefit in trying to pick a hard rule that everyone must follow. Ojan On Tue, Apr 10, 2012 at 11:58 AM, Ryosuke Niwa rn...@webkit.org wrote: On Tue, Apr 10, 2012 at 11:42 AM, Stephen Chenney schen...@chromium.orgwrote: On Tue, Apr 10, 2012 at 1:00 PM, Ryosuke Niwa rn...@webkit.org wrote: On Tue, Apr 10, 2012 at 6:10 AM, Stephen Chenney schen...@chromium.orgwrote: There is a significant practical problem to turn the tree red and work with someone to rebaseline the tests. It takes multiple hours for some bots to build and test a given patch. That means, at any moment, you will have maybe tens and in some cases hundreds of failing tests associated with some changelist that you need to track on the bots. You might have more failing tests associated with a different changelist, and so on. But you have to do this for non-Chromium ports anyway because they don't use test_expectations.txt and skipping the tests won't help you generate new baseline. In my opinion, we should not further diverge from the way things are done in other ports. How long on average does it take a builder to get through a change on another port? Right now the Chromium Mac 10.5 and 10.6 dbg builds are processing a patch from about 3 hours ago. About 20 patches have gone in since then. For the Mac 10.5 tree to ever be green would require there being no changes at all requiring new baselines for a 3 hour window. Just because other teams do it some way does not mean that Chromium, with it's greater number of bots and platforms, should do it the same way. Yes, it does mean that we should do it the same way. What if non-Chromium ports started imposing arbitrary processes like this on the rest of us? It'll be a total chaos, and nobody would understand the right thing to do for all ports. We are discussing a process here, not code, and in my mind the goal is to have the tree be as green as possible with all failures tracked with a minimal expectations file and as little engineer time as possible. That's not our project goal. We have continuous builds and regression tests to prevent regressions to improve the stability, not to keep bots green. Please review http://www.webkit.org/projects/goals.html Just look at how often the non-chromium mac and win builds are red. In particular, changes submitted via the commit queue take an indeterminate amount of time to go in, anything from an hour to several hours. Patch authors do not necessarily even have control over when the CQ+ is given. That's why I don't use commit queue when I know my patch requires platform-dependent rebaselines. Even when manually committing, if it takes 3 hours to create baselines then no patches go in in the afternoon. What if the bots are down or misbehaving? We need to promptly fix those bots. I would also point out the waste of resources when every contributor needs to track every failure around commit time in order to know when their own changes cause failures, and then track the bots to know when they are free to go home. But that's clearly stated in the contribution guide line. Why not simply attach an owner and a resolution date to each expectation? The real problem right now is accountability and a way to remind people that they have left expectations hanging. That's what WebKit bugs are for. Ossy frequently files a bug and cc'es the patch author when a new test is added or a test starts failing and he doesn't know whether new result is correct or not. He also either skips the test or rebaseline the test as needed. He also reverts patches when the patch clearly introduced serious regressions (e.g. crashes on hundreds of tests). Yes, Ossy does an excellent job of gardening. Unfortunately, on Chrome we have tens if not hundreds of gardeners and, as this thread has revealed, no clear agreement on the best way to garden. That IS the problem. We have too many in-experiented gardeners that don't understand the WebKit culture or the
Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)
On Tue, Apr 10, 2012 at 12:29 PM, Ojan Vafai o...@chromium.org wrote: I don't think we can come up with a hard and fast rule given current tooling. In a theoretical future world in which it's easy to get expected results off the EWS bots (or some other infrastructure), it would be reasonable to expect people to incorporate the correct expected results for any EWS-having ports before committing the patch. I expect we'd all agree that would be better than turning the bots red or adding to test_expectations.txt/Skipped files. In the current world, it's a judgement call. If I expect a patch to need a lot of platform-specific baselines, I'll make sure to commit it at a time when I have hours to spare to cleanup any failures or, if I can't stick around for the bots to cycle, I'll add it to test_expectations.txt appropriately. Both approaches have nasty tradeoffs. It is probably worth writing up a wiki page outlining these two options and explaining why you might do one or the other for people new to the project, but I don't see benefit in trying to pick a hard rule that everyone must follow. I agree. Both approaches have pros and cons. Will be good to document on wiki as you say. - Ryosuke ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)
I agree with Ojan. It's clear that there are arguments for both approaches and my initial note did not address all the situations that come up. I will write up something further and put it on the wiki. I will also continue mulling over what sorts of changes to the tools we could do in the short term to make things better. -- Dirk On Tue, Apr 10, 2012 at 12:29 PM, Ojan Vafai o...@chromium.org wrote: I don't think we can come up with a hard and fast rule given current tooling. In a theoretical future world in which it's easy to get expected results off the EWS bots (or some other infrastructure), it would be reasonable to expect people to incorporate the correct expected results for any EWS-having ports before committing the patch. I expect we'd all agree that would be better than turning the bots red or adding to test_expectations.txt/Skipped files. In the current world, it's a judgement call. If I expect a patch to need a lot of platform-specific baselines, I'll make sure to commit it at a time when I have hours to spare to cleanup any failures or, if I can't stick around for the bots to cycle, I'll add it to test_expectations.txt appropriately. Both approaches have nasty tradeoffs. It is probably worth writing up a wiki page outlining these two options and explaining why you might do one or the other for people new to the project, but I don't see benefit in trying to pick a hard rule that everyone must follow. Ojan On Tue, Apr 10, 2012 at 11:58 AM, Ryosuke Niwa rn...@webkit.org wrote: On Tue, Apr 10, 2012 at 11:42 AM, Stephen Chenney schen...@chromium.org wrote: On Tue, Apr 10, 2012 at 1:00 PM, Ryosuke Niwa rn...@webkit.org wrote: On Tue, Apr 10, 2012 at 6:10 AM, Stephen Chenney schen...@chromium.org wrote: There is a significant practical problem to turn the tree red and work with someone to rebaseline the tests. It takes multiple hours for some bots to build and test a given patch. That means, at any moment, you will have maybe tens and in some cases hundreds of failing tests associated with some changelist that you need to track on the bots. You might have more failing tests associated with a different changelist, and so on. But you have to do this for non-Chromium ports anyway because they don't use test_expectations.txt and skipping the tests won't help you generate new baseline. In my opinion, we should not further diverge from the way things are done in other ports. How long on average does it take a builder to get through a change on another port? Right now the Chromium Mac 10.5 and 10.6 dbg builds are processing a patch from about 3 hours ago. About 20 patches have gone in since then. For the Mac 10.5 tree to ever be green would require there being no changes at all requiring new baselines for a 3 hour window. Just because other teams do it some way does not mean that Chromium, with it's greater number of bots and platforms, should do it the same way. Yes, it does mean that we should do it the same way. What if non-Chromium ports started imposing arbitrary processes like this on the rest of us? It'll be a total chaos, and nobody would understand the right thing to do for all ports. We are discussing a process here, not code, and in my mind the goal is to have the tree be as green as possible with all failures tracked with a minimal expectations file and as little engineer time as possible. That's not our project goal. We have continuous builds and regression tests to prevent regressions to improve the stability, not to keep bots green. Please review http://www.webkit.org/projects/goals.html Just look at how often the non-chromium mac and win builds are red. In particular, changes submitted via the commit queue take an indeterminate amount of time to go in, anything from an hour to several hours. Patch authors do not necessarily even have control over when the CQ+ is given. That's why I don't use commit queue when I know my patch requires platform-dependent rebaselines. Even when manually committing, if it takes 3 hours to create baselines then no patches go in in the afternoon. What if the bots are down or misbehaving? We need to promptly fix those bots. I would also point out the waste of resources when every contributor needs to track every failure around commit time in order to know when their own changes cause failures, and then track the bots to know when they are free to go home. But that's clearly stated in the contribution guide line. Why not simply attach an owner and a resolution date to each expectation? The real problem right now is accountability and a way to remind people that they have left expectations hanging. That's what WebKit bugs are for. Ossy frequently files a bug and cc'es the patch author when a new test is added or a test starts failing and he doesn't know whether new result is correct or not. He also either skips the test or rebaseline the test as needed. He
[webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)
Hi all, Recently I've noticed more people making changes and adding test failure suppressions to various ports' test_expectations.txt files. This is great! However, I don't think we have an agreement over what the best practices are here, so I thought I'd list out what I thought they were, and others can comment / correct me as necessary: 1) Don't mix test_expectations.txt files and Skipped files. This is really confusing to everyone involved ... your port should use one or the other where possible. (*) 2) Entries in the test_expectations files should be short-lived. We should be checking in what we believe the current output is, *regardless of whether it's right or wrong*. A different way of putting this is that we are more interested in changes in behavior (regressions) rather than correctness. If you expect a test to be failing for a long time in the same consistent way, check in the failure and file a bug. Don't use test_expectations to suppress this. (**) 3) Don't use test_expectations.txt to suppress failures across a single cycle of the bot, just so you can gather updated baselines without the tree going red. While it might seem that you're doing tree maintainers a favor, in my experience this just makes things confusing and it's better to let the tree go red until you can actually rebaseline things. It's too easy to add a suppression for now and then forget about it. These rules are not set in stone, they're just what I try to do. If people think there are better ways of doing this, please speak up! If there is consensus, I'll update the wikis accordingly. Thanks! -- Dirk (*) I have an outstanding to-do to modify new-run-webkit-tests to a better way of tracking expectations to merge the inheritance/cascade aspects of Skipped files with the flexibility in types of failures that you get from expectations. Eventually we should have a mechanism that replaces both, but for now, we don't. See https://bugs.webkit.org/show_bug.cgi?id=83508 . (**) Note that Chromium has not historically worked this way -- we suppress failures rather than check in failing output -- but I believe many / most of the chromium gardeners have come to believe that this is not the way things should work and we'd be better off checking in the failing output instead. Note that it may make sense to do different things based on your ports' maturity, so you suppress things while bringing a port up, and stop suppressing once your port is stable. ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)
On Mon, Apr 9, 2012 at 3:02 PM, James Robinson jam...@google.com wrote: 3) Don't use test_expectations.txt to suppress failures across a single cycle of the bot, just so you can gather updated baselines without the tree going red. While it might seem that you're doing tree maintainers a favor, in my experience this just makes things confusing and it's better to let the tree go red until you can actually rebaseline things. It's too easy to add a suppression for now and then forget about it. How would you suggest someone contribute a patch that changes layout test results, if not by marking the test as failing? Both this and Dana's response are good questions which I don't have great answers for. Let me explain my thinking in a little more detail. In my ideal world, you would be able to get updated baselines *prior* to trying to land a patch. This is of course not really possible today for any test that fails on multiple ports with different results, as it's practically impossible to run more than a couple of ports by hand, and we don't have a bot infrastructure to help you here. In my closer-to-the-real-world ideal, the test_expectations.txt file would be *empty*, all of the ports would be using them, and then adding in suppressions for files you expect to fail in your patch would be a good signal. In practice, however, marking a test as failing has a few problems: 1) if you don't get the list of suppressions right, than what shows up at the bots can be confusing - you don't realize that more (or different) tests are failing than the ones that showed up. Also, you can see tests fail on some bots that you didn't account for that don't use the expectations at all (e.g., you suppressed Chromium correctly, but the Apple bots go red). 2) adding suppressions to test_expectations.txt increases contention on the test_expectations.txt file, confusing the gardeners / bot-watchers and making merges harder. It is practically difficult to tell the difference between I just added these lines as a temporary suppression and I added these lines as a suppression two weeks ago but then forgot about them, and as Ojan will tell you, you end up with a bunch of lines in the file that just have a comment saying just needs rebaselining, but you have no idea if those statements are still true or not. 3) the current tool-of-the-day for managing rebaselining, garden-o-matic, is much better suited to handling the unexpected failures on the bots rather than the expected failures you've introduced. That said, as Dana points out, my proposal makes it hard if not impossible for the EWS to work at all; I'm biased because I hardly ever post changes that are expected to introduce failures on the EWS bots that should still land. I don't know what to do about this, although perhaps moving to my near-ideal model might work. If there's consensus in the mean time that it is better on balance to check in suppressions, perhaps we can figure out a better way to do that. Maybe (shudder) a second test_expectations file? Or maybe it would be better to actually check in suppressions marked as REBASELINE (or something like that)? -- Dirk ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)
On Monday 09 April 2012 22:42:33 Dirk Pranke wrote: Hi all, Recently I've noticed more people making changes and adding test failure suppressions to various ports' test_expectations.txt files. This is great! Hi Dirk, It would be good to have a page describing the right thing to do for each of the ports for each of the following situations: - Adding a new test with rendtree/pixel results - Changing the rendtree/pixel results for existing tests I think the correct practice will differ between ports as only the Chromium bots (AFAIK) check pixel results. So this has the unfortunate consquence that when you change the expected pixel result for a test that has pixel results on all platforms you end up with stale png results in the tree for those that you are not in a position either to generate yourself or persuade someone on #webkit to generate for you. Thanks, Robert ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)
In my ideal world, you would be able to get updated baselines *prior* to trying to land a patch. This is of course not really possible today for any test that fails on multiple ports with different results, as it's practically impossible to run more than a couple of ports by hand, and we don't have a bot infrastructure to help you here. That would be the best outcome indeed, however that would more or less require all the EWS to be able to run the tests (including pixel tests for the platforms that enable them). 3) the current tool-of-the-day for managing rebaselining, garden-o-matic, is much better suited to handling the unexpected failures on the bots rather than the expected failures you've introduced. You could see it the other way. How could we make garden-o-matic handle newly added suppression better? Maybe some new sub-tool listing the newly added suppressions would help? Ignore the suppressions added XX hours ago? Your issue seems to be suppressions sticking into the tree not strictly suppressing in itself. Our current tool (garden-o-matic) handles failures a lot better than it handles suppressions. If there's consensus in the mean time that it is better on balance to check in suppressions, perhaps we can figure out a better way to do that. Maybe (shudder) a second test_expectations file? Or maybe it would be better to actually check in suppressions marked as REBASELINE (or something like that)? That sounds quirky as it involves maintaining 2 sets of files. From my perspective, saying that we should discard the EWS result and allow changes to get in WebKit trunk, knowing they will turn the bots red, is a bad proposal regardless of how you justify it. In the small delta where the bots are red, you can bet people will miss something else that breaks. Thanks, Julien ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)
On Mon, Apr 9, 2012 at 3:50 PM, Julien Chaffraix jchaffr...@webkit.orgwrote: If there's consensus in the mean time that it is better on balance to check in suppressions, perhaps we can figure out a better way to do that. Maybe (shudder) a second test_expectations file? Or maybe it would be better to actually check in suppressions marked as REBASELINE (or something like that)? That sounds quirky as it involves maintaining 2 sets of files. From my perspective, saying that we should discard the EWS result and allow changes to get in WebKit trunk, knowing they will turn the bots red, is a bad proposal regardless of how you justify it. In the small delta where the bots are red, you can bet people will miss something else that breaks. But that's the status quo. Until we get to the ideal world where every port's EWS at least runs tests, there's no way around it since we can't possibly require all contributors to run tests on all ports and platforms. Also, adding new lines to test_expectations.txt will only work on Chromium port since other ports use Skipped file instead, and skipping the test won't give you new baseline on the bot. And this discrepancy between Chromium and non-Chromium ports imposes a significant cognitive stress on contributors and is not justifiable in my opinion. - Ryosuke ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)
On Mon, Apr 9, 2012 at 3:50 PM, Julien Chaffraix jchaffr...@webkit.org wrote: In my ideal world, you would be able to get updated baselines *prior* to trying to land a patch. This is of course not really possible today for any test that fails on multiple ports with different results, as it's practically impossible to run more than a couple of ports by hand, and we don't have a bot infrastructure to help you here. That would be the best outcome indeed, however that would more or less require all the EWS to be able to run the tests (including pixel tests for the platforms that enable them). Yes. Probably even harder, we'd have to build tooling to extract the results off of this bot and merge them into your patch. 3) the current tool-of-the-day for managing rebaselining, garden-o-matic, is much better suited to handling the unexpected failures on the bots rather than the expected failures you've introduced. You could see it the other way. How could we make garden-o-matic handle newly added suppression better? Maybe some new sub-tool listing the newly added suppressions would help? Ignore the suppressions added XX hours ago? Yes, that is similar. Your issue seems to be suppressions sticking into the tree not strictly suppressing in itself. Our current tool (garden-o-matic) handles failures a lot better than it handles suppressions. Yes. If there's consensus in the mean time that it is better on balance to check in suppressions, perhaps we can figure out a better way to do that. Maybe (shudder) a second test_expectations file? Or maybe it would be better to actually check in suppressions marked as REBASELINE (or something like that)? That sounds quirky as it involves maintaining 2 sets of files. From my perspective, saying that we should discard the EWS result and allow changes to get in WebKit trunk, knowing they will turn the bots red, is a bad proposal regardless of how you justify it. In the small delta where the bots are red, you can bet people will miss something else that breaks. As Ryosuke points out, practically we're already in that situation - from what I can tell, the tree is red virtually all of the time, at least during US/Pacific working hours. It's not clear to me if the EWS has made this better or worse, but perhaps others have noticed a difference. That said, I doubt I like red trees any more than you do :) I had not considered the EWS implications when I wrote the initial note, so I haven't decided if (or how) to revise my opinions yet. -- Dirk ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)
If there's consensus in the mean time that it is better on balance to check in suppressions, perhaps we can figure out a better way to do that. Maybe (shudder) a second test_expectations file? Or maybe it would be better to actually check in suppressions marked as REBASELINE (or something like that)? That sounds quirky as it involves maintaining 2 sets of files. From my perspective, saying that we should discard the EWS result and allow changes to get in WebKit trunk, knowing they will turn the bots red, is a bad proposal regardless of how you justify it. In the small delta where the bots are red, you can bet people will miss something else that breaks. As Ryosuke points out, practically we're already in that situation - from what I can tell, the tree is red virtually all of the time, at least during US/Pacific working hours. It's not clear to me if the EWS has made this better or worse, but perhaps others have noticed a difference. That said, I doubt I like red trees any more than you do :) I wasn't talking about the tree's status quo here as it shouldn't impact the discussion. Just because the tree is red, doesn't mean it's the right thing (tm) to just drop the towel and make it more red (even if we seem to agree on the badness of that :)). To add some thoughts here, saying that the tree is red covers several states (failing tests, not building...) and IMHO the EWS has at least helped on the building side. As far as the tests goes, a lot of platform differences are unfortunately uncovered on the bots. Thanks, Julien ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev