Re: [dev-servo] Servo testing as part of PhD dissertation

Joel Martin Sun, 11 Sep 2016 23:15:06 -0700

Thanks for all the feedback. I've collected the responses and replied
inline.

> Jack Moffitt wrote:

> A sort of subtask of this which would be extremely useful is taking
> a known rendering problem and producing a minimal reproduction of
> it.  For example, many issues are discovered in existing pages with
> perhaps hundreds of kilobytes of extraneous data. It would be nice
> to reduce the failing example to a minimal size. One issue is how to
> make an oracle here. It would probably be an improvement to have it
> be only semi-automated, where it does some shrinking then asks and
> repeats.

That's a good point. One thing I've been mulling over is a method to
serialize/checkpoint/save current test generation state. This would
enable restart and scale with less test repetition. In addition to
being a seeming gap in the QuickCheck/generative testing literature,
having this capability might also enable deserialization of a
pre-existing test case into generative/QuickCheck state which could
then be shrunk to a more minimal reproducer. I'm not sure how
tractable that idea is but it is certainly intriguing and worth
exploring.

> Each browser renders things slightly differently, so pixel by pixel
> comparison across browsers is probably not going to work well. For
> our own testing of this kind we instead produce the same result
> using two different techniques, or in a few cases we make reference
> images.  However making reference images can't account for all
> rendering differences (like text) and so we avoid it if possible.
> I imagine it would be quite difficult if the reference image was
> from another engine, not our own.

My initial plan was to start with just the elements and styling that
_should_ render similarly across browsers. Once that worked well
enough I would consider adding elements and styling with different
expected rendering and probably use some sort of machine learning to
identify real defects vs expected differences. I expect that normal
font variations would be one of the last things I would tackle (if
ever).  The training would probably be manual but the goal would be
that the actual testing would be fully automated and scalable.

> Tools like this would be extremely useful.
>
> The main kind of testing we do is reference testing where the
> reference is the same content achieved by different means. This is
> pretty robust to things like font rendering changing slightly
> between versions. We have some JS level testing where JS APIs are
> invoked and then results verified, but it sounds like you are more
> focused on the visual testing aspect. As an aside, I think
> quickchecking JS  APIs is likely to find a ton of bugs and be useful
> too, plus it probably doesn't have the oracle problems.

To be honest, my working assumption has been that QuickChecking of JS
level APIs would be pretty well trod ground. It's a bit difficult
to search for that vs QuickCheck libraries written in JavaScript.
Looking through some of the references from James and Geoffrey
I see that jsfunfuzz already seems to be doing that category of
testing for JS. In addition to feeling less novel, QuickChecking JS
APIs doesn't feel quite as useful for Servo testing (based on the
assumption that Servo will be re-using *Monkey).

> James Graham wrote:

> There is some prior art here e.g. [1]. I wrote a similar tool that
> was specialised to reducing js code whilst at Opera, but that
> doesn't ever seem to have been released. In both cases you either
> had to write a one-off function to determine if the testcase was
> a pass or fail, or have a human judge it.  Obviously the latter is
> impractically slow if your input is large.
> [1]
http://www.squarefree.com/2007/09/15/introducing-lithium-a-testcase-reduction-tool/

Thanks! That's a great reference that led me to bunch of things I will
definitely have to dig into further.

> Yes, I imagine specifically font rendering will be a problem, along
> with antialiasing in general and tegitimate-per-CSS variations in
> properties such as outline.

Do you happen to know of a reference/list of elements/styles that
cause expected differences in rendering? I expect a couple days of
reading through ACID2/3 bug lists would give me a pretty good idea,
but perhaps somebody has already done the work of collating this info.

> However I think you might make progress with some sort of
> consensus-based approach e.g. take a testcase and render it in
> gecko/blink/webkit/edge. If the difference by some metric (e.g.
> number of differing pixels, although more sophisticated approaches
> are possible) is within some threshold then check whether servo is
> within the same threshold. If it is consider that a pass otherwise
> a fail.

That lines up with my intuition as well. I would probably start with
something really straightforward like: (sum of pixel differences)
/ (number of page elements) > (arbitrary threshold).  After seeing how
well that works (or likely doesn't work) and getting some real data,
then I would begin adding more sophistication to the heuristic.

> Geoffrey Sneddon wrote:

> FWIW, I was talking with a bunch of people in the Chrome team about
> such an oracle not that long ago. I think one can almost certainly
> come up with a useful oracle even though it'll have very real
> limitations.

I would love to hear more about your conversation with the Chrome
team about that. Any chance it was in public hard-copy form (seems
funny to think of IRC, mailing lists, email that way)?

> There are plenty of rendering bugs that don't involve text, and
> practically if you're generating arbitrary web pages it's easy to
> solve all of those problems by simply not including text (though
> you'll need to give boxes explicit heights!). Even if you allow
> text, you can probably get a long way by simply getting rid of all
> text and setting explicit width/height properties on everything such
> that the layout of the boxes doesn't change even if they're then
> empty. You can then compare the position of the box across browsers.

I was thinking of replacing text with a sequence of images of words of
varying lengths (styled to be as much like text as possible). You
know, "lorem ipsum imagos" :-). Although that was before Boris
mentioned the Ahem font which I wasn't aware of previously.

> If you go down the JS route, I'd speak to the fuzzer team (esp.
> Jesse), as well have a look at the approaches taken by other fuzzer
> tools (cross_fuzz for example was particularly effective at finding
> bugs).

Thanks, I'll look into cross_fuzz more. At initial glance it appears
to be similar to funfuzz/DOMfuzz in that it relies on either
browser-level failures or a JS function to be the oracle.

> In either case, I expect generating arbitrary cases to actually be
> not overly interesting, as I expect it'll be sufficiently unlikely
> to combine features in interesting ways. You may well want some
> code-coverage based feedback into the generation of instances, along
> the lines of afl.

One of my inspirations for doing this is the success we've had at my
company (ViaSat) using QuickCheck (Clojure test.check) to find lots
surprising defects in code that already had a fairly comprehensive
test suite and was consider mature. Guiding the generation of tests
using something like AFL/afl-cov definitely seems like a fruitful
avenue of exploration especially if an initial direct approach doesn't
reveal much of interest.

I would be interesting in hearing more about why you think this type
of testing wouldn't uncover interesting cases especially if it's more
than intuition (your intuition about this certainly may be better than
mine).

> Lars Bergstrom wrote:

> Along those lines, it's also worth looking at the very recent
> awesome work at the University of Washington formalizing layout
> (upcoming paper at OOPSLA): http://cassius.uwplse.org/
>
> I've been in contact with them with the hopes of trying it out in
> the context of Servo, as I believe there are both some interesting
> testing applications and some really nifty things that we could do
> with devtools using such a tool, too.

That's very cool stuff. I might use Casius (or at least borrow the
technique) for reversing from a specific test case back to generative
test state in order to shrink the case (or for seeding an interesting
point in the generative test search space). I suppose it also might
make a reasonable test oracle too (if/once it has more complete
coverage of CSS 2.1/3).

> Boris Zbarsky wrote:

> Does using the Ahem font still leave noticeable font rendering
> differences between browsers?

> Geoffrey Sneddon wrote:

> Yes, there are anti-aliasing differences. (See the infamous WebKit
> Ahem-only AA-disable hack for Acid3!)

Ahem will definitely be something I'll take a look at. Anti-aliasing
differences should be more detectable than font spacing and sizing
differences (especially the major layout/flow differences that might
result)

> Robert O'Callahan wrote:

> Cassius doesn't support any kind of fragmentation, not even line
> breaking, and they look difficult to add to the Cassius model. But
> it does look cool for the sort of testcases gsnedders was talking
> about.

Yeah, it's too bad it's not a bit more complete, but it's definitely
an interesting starting point for a lot of interesting stuff.

Again, thanks for the feedback (and keep it coming!).

Joel Martin (kanaka)
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] Servo testing as part of PhD dissertation

Reply via email to