> On Oct 26, 2012, at 11:11 PM, Ryosuke Niwa <rn...@webkit.org> wrote: >> I’m sure Antti, Alexey, and others who have worked on the loader and other >> parts of WebKit are happy to write those tests or list the kind of things >> they want to test. Heck, I don’t mind writing those tests if someone could >> make a list. >> >> I totally sympathize with the sentiment to reduce the test flakiness but >> loader and cache code have historically been under-tested, and we’ve had a >> number of bugs detected only by running non-loader tests consecutively. >> >> On the contrary, we’ve had this DRT behavior for ages. Is there any reason >> we can’t wait for another couple of weeks or months until we add more loader >> & cache tests before making the behavior change? >
Please correct me if I'm misinformed, but it's been three months since this issue was first raised, and it doesn't sound like they've been writing those tests or are happy to do so, and despite people asking on this thread, they haven't been listing the kinds of tests they think they need. Have we actually made any progress here, or was the issue dropped until Ami raised it again? It seems like the latter to me ... again, please correct me if this is being actively worked on, because that would change the whole tenor of this debate. On Sun, Oct 28, 2012 at 6:32 AM, Maciej Stachowiak <m...@apple.com> wrote: > > I think the nature of loader and cache code is that it's very hard to make > tests which always fail deterministically when regressions are introduced, as > opposed to randomly. The reason for this is that bugs in these areas are > often timing-dependent. I think it's likely this tendency to fail randomly > will be the case whether or not the tests are trying to explicitly test the > cache or are just incidentally doing so in the course of other things. > I am not familiar with the loader and caching code in webkit, but I know enough about similar problem spaces to be puzzled by why it's impossible to write tests that can adequately test the code. Is the caching disk-based, and maybe running tests in parallel screwing with things? If so, then maybe the fact that we now run tests in parallel is why this is a problem now and hasn't been before? Or maybe the fact that a given process doesn't always see the same tests in the same order is the problem? > Unfortunately, it's very tempting when a test is failing randomly to blame > the test rather than to investigate whether there is an actual regression > affecting it. And sometimes it really is the test's fault. But sometimes it > is a genuine bug in the code. > > On the other hand, nondetermisitic test failures make it harder to use test > infrastructure in general. > > These are difficult things to reconcile. The original philosophy of WebKit > tests is to test end-to-end under relatively realistic conditions, but at the > same time unpredictability makes it hard to stay at zero regressions. > Exactly. Personally, the cost of unpredictability in the test infrastructure is so much higher than the value we're getting (implicitly) that this is a no-brainer to me. There are some tradeoffs (like running tests in parallel) that are worth it, but this isn't one of them. I am happy to explain further my thinking and standards if there's interest. Hopefully that partially answers Alexey's questions about where we should draw the line in trying to make our tests deterministic and hermetic: do everything you reasonably can. We're not picking on caching here. > I think making different ports do testing under different conditions makes it > more likely that some contributors will introduce regressions without > noticing, leaving it for others to clean up. So it's regrettable if we go > that way because we are unable to reach consensus. I agree that it is bad to have different ports behaving differently, and I would like to avoid that as well. I don't want any port suffering from flaky tests, but I also don't think it's reasonable to have one group force that on everyone else indefinitely, either. I am also fine with having some way to test systems more non-deterministically in a way to expose more bugs, but that needs to be clearly separated from the other testing we do; it is an unfair cost to impose on the rest of the system otherwise and should be tolerated only if we have no other choice. We have other choices. > Creating some special opt-in --antti mode would be even worse, as it's almost > certain that failures would creep into a mode that nobody runs. > This comment (and Antti's suggestion, below) makes me think that you didn't understand my "virtual test suite" suggestion; that's not surprising, since Apple doesn't actual use this feature of NRWT yet. A virtual test suite is a way of saying (re-)run the tests under directory X with command-line flags Y and Z, and put the results in a new directory. For example, Chromium runs all of the tests in fast/canvas twice, once "normally" using the regular software code path, and once with a command-line flag for --enable-accelerated-2d-canvas that forces things through the gpu accelerated code paths (using osmesa for emulation). So, all you would have to do would be to identify which tests you'd like to run (or re-run) w/ caching enabled, add a command line flag, and add two lines of code to NRWT. This isn't a separate "opt-in --antti mode"; these tests are run twice on every single run on the bots in every single config. You can keep separate baselines for them or re-use the existing baselines, and you can have separate TestExpectations (they aren't currently reused/inherited, but that would also be easy to fix). > What I personally would most wish for is good tools to catch when a test > starts failing nondeterministically, and to identify the revision where the > failures began. The reason we hate random failures is that they are hard to > track down and diagnose. But some types of bugs are unlikely to manifest in a > purely deterministic way. It would be good if we had a reliable and useful > way to catch those types of bugs. This is a fine idea -- and I'm always happy to talk about ways we can improve our test tooling, please feel free to start a separate thread on these issues -- but I don't want to lose sight of the main issue here. It sounds like we've identified three existing problems - please correct me if I'm misstating them: 1. There appears to be a bug in the caching code that is causing tests for other parts of the system to fail randomly. 2. DRT and WTR on some ports are implemented in a way that is causing the system to be more fragile than some of us would like it to be, and there doesn't seem to be an a priori need for this to be the case; indeed some ports already don't do this. 3. We don't apparently have dedicated test coverage for caching and the loader that people think is good enough, and getting such tests might be "hard". I would like for us to solve all three of these problems; solving one is not good enough and only a partial solution. I can trivially solve (2). While it could be that I can solve (1) and (3) given enough time and dedication, I am hardly the best person to do so, nor is Ami. And while I am sensitive to the idea that solving (2) might cause us to miss test coverage, I have explained that that's a tradeoff I'm perfectly fine with -- across all ports. Others might not be, and if they don't want me to solve that problem on their port (yet, or even at all), I won't. So unless someone can convince me that there is actually a plan and a timeline for resolving (1) and (3) that we can expect to happen and that I should just wait a little while longer, I plan to R+ Ami's change so he can land it for the ports that do want it. I believe we are inflicting more harm on the project as a whole by not doing so. -- Dirk _______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev