Re: [Pixman] Benchmarked: [PATCH 1/4] Change conditions for setting FAST_PATH_SAMPLES_COVER_CLIP flags

2015-09-21 Thread Pekka Paalanen
On Sun, 20 Sep 2015 15:47:01 +0300
Oded Gabbay  wrote:

> On Sun, Sep 20, 2015 at 1:22 PM, Oded Gabbay  wrote:
> > On Wed, Sep 16, 2015 at 2:25 PM, Pekka Paalanen  wrote:
> >> The series as benchmarked is available at:
> >> https://git.collabora.com/cgit/user/pq/pixman.git/log/?h=cover-benchmark-1
> >>
> >> The benchmark points are:
> >>
> >> - baseline: "test: Add cover-test v5"
> >>
> >> - cleanup: "affine-bench: remove 8e margin from COVER area"
> >> Includes the 8e extra safety margin removal.
> >>
> >> - tight: "pixman-fast-path: Make bilinear cover fetcher use
> >> COVER_CLIP_TIGHT flag"
> >> Includes all the COVER_CLIP_BILINEAR related patches from
> >> Ben.

> > I decided to also run the cairo trimmed benchmarks on my POWER8
> > ppc64le and POWER7 ppc64.
> > To make things clearer, I used the same definitions for "baseline",
> > "cleanup" and "tight".
> >
> > I used Cairo version 1.14.3, actually from git with head set to 6f7a9b4
> > I run the benchmarks doing (it's from inside a script):
> > "cairo-perf-trace benchmark -r -i8 > ../${__output}.perf"

> Please disregard the email above - the results there are bogus because
> my server is inside a VM!!!
> 
> After I sent my email, I run the cleanup version 5 times in a row. The
> first 4 times were identical, but the 5th time showed major slowdown.
> 
> I also run the tight version 5 times in a row. The 2nd run showed
> major improvement over the 1st run, the 3rd run showed an additional
> improvement on top of that, and the 4th and 5th runs were identical to
> the 3rd run.
> 
> On the one hand, I'm not running anything else on this server. On the
> other hand, this is a VM, so maybe the host machine is
> over-subscribed.
> 
> I then went to test it on a physical server without VM (the ppc64
> version). I run cleanup and tight 5 times, and all results were
> identical. So I think the issue is definitely with the VM.
> 
> And as for the results, I'm happy to say that there is no change
> between cleanup and tight :)

Hi Oded,

heh, yeah, double-checking is worth it. :-)

Thank you for confirming my results, even though that's not exactly
good news for the patches. ;-) Not bad either.

I will be trying to record a Cairo trace that would represent the usage
patterns we saw in Raspbian which is the root of Ben's efforts.


Thanks,
pq


pgpwNxZCi5PLK.pgp
Description: OpenPGP digital signature
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] Benchmarked: [PATCH 1/4] Change conditions for setting FAST_PATH_SAMPLES_COVER_CLIP flags

2015-09-20 Thread Oded Gabbay
On Wed, Sep 16, 2015 at 2:25 PM, Pekka Paalanen  wrote:
> On Fri,  4 Sep 2015 03:09:20 +0100
> Ben Avison  wrote:
>
>> As discussed in
>> http://lists.freedesktop.org/archives/pixman/2015-August/003905.html
>>
>> the 8 * pixman_fixed_e adjustment which was applied to the transformed
>> coordinates is a legacy of rounding errors which used to occur in old
>> versions of Pixman, but which no longer apply. For any affine transform,
>> you are now guaranteed to get the same result by transforming the upper
>> coordinate as though you transform the lower coordinate and add (size-1)
>> steps of the increment in source coordinate space. No projective
>> transform routines use the COVER_CLIP flags, so they cannot be affected.
>
> Hi all,
>
> as we doing these things not just for cleaning up but with the premise
> that there are missed optimization opportunities, I have benchmarked
> this patch series.
>
> The series as benchmarked is available at:
> https://git.collabora.com/cgit/user/pq/pixman.git/log/?h=cover-benchmark-1
>
> The benchmark points are:
>
> - baseline: "test: Add cover-test v5"
>
> - cleanup: "affine-bench: remove 8e margin from COVER area"
> Includes the 8e extra safety margin removal.
>
> - tight: "pixman-fast-path: Make bilinear cover fetcher use
> COVER_CLIP_TIGHT flag"
> Includes all the COVER_CLIP_BILINEAR related patches from
> Ben.
>
> Note, that ssse3_iters[] in pixman-ssse3.c still contains
> FAST_PATH_SAMPLES_COVER_CLIP_BILINEAR.
>
> Cairo version is 1.14.2 for the benchmarks, which are run like:
> $ CAIRO_TEST_TARGET=image cairo-perf-trace -r -v -i8 > baseline-image-2.txt
>
> I tried both "image" and "image16" on an x86_64 (Sandybridge), and got
> no performance differences in the trimmed-cairo-traces set in either
> baseline/cleanup or cleanup/tight.
>
> I also tried with PIXMAN_DISABLE=ssse3 and still got no difference. I
> verified I am really running what I think I am by editing Pixman and
> seeing the effect in the benchmark.
>
> Am I missing something?
>
> I thought we would see at least some improvements also on x86_64 when
> comparing cleanup/tight.
>
> Should I run the same on rpi2? Or is the best effect on the fast paths
> we haven't merged yet?
>
> I'd rather not run this on rpi1 due to the function address /
> performance quirk, doing the required iterations there would probably
> take too long and I'd need to rearrange the result files too.
>
> Or maybe our test set is not enough? I recall having some problems with
> that in the past.
>
> So, I patched Pixman to yell whenever TIGHT is set but
> COVER_CLIP_BILINEAR is not set. Only t-firefox-canvas-swscroll and
> t-firefox-fishtank hit it with source image, each twice per iteration.
> Definitely seems like this test set is not hitting the cases we are
> interested in. I think I need to dig up our old performance profiles
> and see if we could record a trace from a real app that would hit these
> cases, now that Cairo's trace recording is supposedly fixed.
>
> The removal of the 8e extra safety margins shouldn't need performance
> profiles as justification, but for the tightening patches they'd be
> nice to have, especially since the usefulness of them has been
> questioned.
>
>
> Thanks,
> pq

Hi Pekka, Ben

I decided to also run the cairo trimmed benchmarks on my POWER8
ppc64le and POWER7 ppc64.
To make things clearer, I used the same definitions for "baseline",
"cleanup" and "tight".

I used Cairo version 1.14.3, actually from git with head set to 6f7a9b4
I run the benchmarks doing (it's from inside a script):
"cairo-perf-trace benchmark -r -i8 > ../${__output}.perf"

First of all, diff between baseline/cleanup showed no change, in both
platforms, so that's good :)

Now, for cleanup/tight:

With POWER8 ppc64le, I got the following very modest boost:

imaget-firefox-asteroids  483.10 (523.85 3.49%) -> 452.84
(480.34 3.16%):  1.07x speedup
image   t-firefox-chalkboard  691.38 (692.09 0.06%) -> 653.07
(654.60 0.26%):  1.06x speedup

However, with POWER7 ppc64, I got the following regressions, which is quite bad:

imaget-firefox-asteroids  545.55 (559.64 1.79%) -> 781.07
(791.83 2.33%):  1.43x slowdown
imaget-firefox-scrolling  1185.45 (1186.02 0.05%) -> 1748.76
(1754.85 0.20%):  1.48x slowdown
image   t-firefox-chalkboard  1444.76 (1464.55 0.88%) -> 2315.76
(2333.10 0.34%):  1.60x slowdown
imaget-firefox-paintball  681.43 (682.28 0.10%) -> 1138.15
(1140.19 0.08%):  1.67x slowdown
image   t-firefox-canvas  890.14 (890.90 0.10%) -> 1492.83
(1493.51 0.20%):  1.68x slowdown
image  t-firefox-canvas-swscroll  1369.94 (1371.66 0.05%) -> 2297.53
(2305.70 0.18%):  1.68x slowdown
imaget-xfce4-terminal-a1  829.35 (832.39 0.16%) -> 1392.50
(1414.69 1.08%):  1.68x slowdown
image t-firefox-fishbowl  3112.93 (3114.13 0.02%) -> 5227.18
(5229.05 0.03%):  1.68x slowdown
image  t-poppler  

Re: [Pixman] Benchmarked: [PATCH 1/4] Change conditions for setting FAST_PATH_SAMPLES_COVER_CLIP flags

2015-09-20 Thread Oded Gabbay
On Sun, Sep 20, 2015 at 1:22 PM, Oded Gabbay  wrote:
> On Wed, Sep 16, 2015 at 2:25 PM, Pekka Paalanen  wrote:
>> On Fri,  4 Sep 2015 03:09:20 +0100
>> Ben Avison  wrote:
>>
>>> As discussed in
>>> http://lists.freedesktop.org/archives/pixman/2015-August/003905.html
>>>
>>> the 8 * pixman_fixed_e adjustment which was applied to the transformed
>>> coordinates is a legacy of rounding errors which used to occur in old
>>> versions of Pixman, but which no longer apply. For any affine transform,
>>> you are now guaranteed to get the same result by transforming the upper
>>> coordinate as though you transform the lower coordinate and add (size-1)
>>> steps of the increment in source coordinate space. No projective
>>> transform routines use the COVER_CLIP flags, so they cannot be affected.
>>
>> Hi all,
>>
>> as we doing these things not just for cleaning up but with the premise
>> that there are missed optimization opportunities, I have benchmarked
>> this patch series.
>>
>> The series as benchmarked is available at:
>> https://git.collabora.com/cgit/user/pq/pixman.git/log/?h=cover-benchmark-1
>>
>> The benchmark points are:
>>
>> - baseline: "test: Add cover-test v5"
>>
>> - cleanup: "affine-bench: remove 8e margin from COVER area"
>> Includes the 8e extra safety margin removal.
>>
>> - tight: "pixman-fast-path: Make bilinear cover fetcher use
>> COVER_CLIP_TIGHT flag"
>> Includes all the COVER_CLIP_BILINEAR related patches from
>> Ben.
>>
>> Note, that ssse3_iters[] in pixman-ssse3.c still contains
>> FAST_PATH_SAMPLES_COVER_CLIP_BILINEAR.
>>
>> Cairo version is 1.14.2 for the benchmarks, which are run like:
>> $ CAIRO_TEST_TARGET=image cairo-perf-trace -r -v -i8 > baseline-image-2.txt
>>
>> I tried both "image" and "image16" on an x86_64 (Sandybridge), and got
>> no performance differences in the trimmed-cairo-traces set in either
>> baseline/cleanup or cleanup/tight.
>>
>> I also tried with PIXMAN_DISABLE=ssse3 and still got no difference. I
>> verified I am really running what I think I am by editing Pixman and
>> seeing the effect in the benchmark.
>>
>> Am I missing something?
>>
>> I thought we would see at least some improvements also on x86_64 when
>> comparing cleanup/tight.
>>
>> Should I run the same on rpi2? Or is the best effect on the fast paths
>> we haven't merged yet?
>>
>> I'd rather not run this on rpi1 due to the function address /
>> performance quirk, doing the required iterations there would probably
>> take too long and I'd need to rearrange the result files too.
>>
>> Or maybe our test set is not enough? I recall having some problems with
>> that in the past.
>>
>> So, I patched Pixman to yell whenever TIGHT is set but
>> COVER_CLIP_BILINEAR is not set. Only t-firefox-canvas-swscroll and
>> t-firefox-fishtank hit it with source image, each twice per iteration.
>> Definitely seems like this test set is not hitting the cases we are
>> interested in. I think I need to dig up our old performance profiles
>> and see if we could record a trace from a real app that would hit these
>> cases, now that Cairo's trace recording is supposedly fixed.
>>
>> The removal of the 8e extra safety margins shouldn't need performance
>> profiles as justification, but for the tightening patches they'd be
>> nice to have, especially since the usefulness of them has been
>> questioned.
>>
>>
>> Thanks,
>> pq
>
> Hi Pekka, Ben
>
> I decided to also run the cairo trimmed benchmarks on my POWER8
> ppc64le and POWER7 ppc64.
> To make things clearer, I used the same definitions for "baseline",
> "cleanup" and "tight".
>
> I used Cairo version 1.14.3, actually from git with head set to 6f7a9b4
> I run the benchmarks doing (it's from inside a script):
> "cairo-perf-trace benchmark -r -i8 > ../${__output}.perf"
>
> First of all, diff between baseline/cleanup showed no change, in both
> platforms, so that's good :)
>
> Now, for cleanup/tight:
>
> With POWER8 ppc64le, I got the following very modest boost:
>
> imaget-firefox-asteroids  483.10 (523.85 3.49%) -> 452.84
> (480.34 3.16%):  1.07x speedup
> image   t-firefox-chalkboard  691.38 (692.09 0.06%) -> 653.07
> (654.60 0.26%):  1.06x speedup
>
> However, with POWER7 ppc64, I got the following regressions, which is quite 
> bad:
>
> imaget-firefox-asteroids  545.55 (559.64 1.79%) -> 781.07
> (791.83 2.33%):  1.43x slowdown
> imaget-firefox-scrolling  1185.45 (1186.02 0.05%) -> 1748.76
> (1754.85 0.20%):  1.48x slowdown
> image   t-firefox-chalkboard  1444.76 (1464.55 0.88%) -> 2315.76
> (2333.10 0.34%):  1.60x slowdown
> imaget-firefox-paintball  681.43 (682.28 0.10%) -> 1138.15
> (1140.19 0.08%):  1.67x slowdown
> image   t-firefox-canvas  890.14 (890.90 0.10%) -> 1492.83
> (1493.51 0.20%):  1.68x slowdown
> image  t-firefox-canvas-swscroll  1369.94 (1371.66 0.05%) -> 2297.53
> (2305.70 0.18%):  1.68x slowdown
> image  

Re: [Pixman] Benchmarked: [PATCH 1/4] Change conditions for setting FAST_PATH_SAMPLES_COVER_CLIP flags

2015-09-20 Thread Ben Avison

On Fri, 18 Sep 2015 19:35:02 +0100, Bill Spitzak  wrote:


On Wed, Sep 16, 2015 at 7:30 AM, Ben Avison  wrote:


Just by thinking things through, I realised that we would regularly
fail to hit COVER paths if Pixman's caller set the scale factors such
that the centre of the outermost destination pixels aligned with the
centre of the outermost source pixels. There has been some argument
about whether this is representative of how Pixman should be used. I
happen to think this is a perfectly reasonable thing to expect Pixman
to support, but there are other models you can follow, notably setting
the scale factors such that the outer edges of the outermost
destination pixels align to the outer edges of the outermost source
pixels. If the Cairo traces are using this latter model, then it's
understandable if you aren't hitting the edge case that I'm concerned
about very often.


There are currently very good reasons for clients to use the first
model: it is a method of removing undesirable "fuzzy edges" from zoomed-
in images.

The alternative of scaling the outer edge, while not hitting this fast
path, will not hit the other fast path either! (the other fast path is
the one that allows a zero-weighted pixel at one end to be read from
the image).


I'm generally in agreement with you. For the sake of completeness though,
I should point out that when zooming *out*, both models will match the
conditions for COVER paths. But then again, when you're zooming out,
particularly by a large factor, it is more appropriate to use a multi-tap
filter rather than bilinear scaling, so maybe that's not much of an
argument.


NOTE: the method of scaling the centers of the pixels is generally
wrong. The math is wonky: does scaling an image by 2 produce a 2*w-1
image, or does it spread the pixels slightly more than 2 apart?
Programs are going to disagree.


I'd suggest that if the highest layer software specifies a particular
scale factor, then that's what should be used, but if it requests a
particular destination size (e.g. "full screen") then that should be the
primary consideration even if it means the corresponding scale factor is
not the "obvious" one.

But this is all really a concern for the layers above Pixman, because the
starting fractional position and per-pixel increment is part of the
Pixman API. Pixman just has to do whatever has been asked of it. Just
because some of the ways the layers above Pixman do their calculations
are impossible for Pixman to optimise further shouldn't be a reason for
Pixman not to optimise the cases that it can do, especially when that
set of cases include many that use a scale factor of 1 on one axis.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] Benchmarked: [PATCH 1/4] Change conditions for setting FAST_PATH_SAMPLES_COVER_CLIP flags

2015-09-18 Thread Bill Spitzak
On Wed, Sep 16, 2015 at 7:30 AM, Ben Avison  wrote:

Just by thinking things through, I realised that we would regularly fail
> to hit COVER paths if Pixman's caller set the scale factors such that the
> centre of the outermost destination pixels aligned with the centre of the
> outermost source pixels. There has been some argument about whether this
> is representative of how Pixman should be used. I happen to think this is
> a perfectly reasonable thing to expect Pixman to support, but there are
> other models you can follow, notably setting the scale factors such that
> the outer edges of the outermost destination pixels align to the outer
> edges of the outermost source pixels. If the Cairo traces are using this
> latter model, then it's understandable if you aren't hitting the edge
> case that I'm concerned about very often.
>

There are currently very good reasons for clients to use the first model:
it is a method of removing undesirable "fuzzy edges" from zoomed-in images.

The alternative of scaling the outer edge, while not hitting this fast
path, will not hit the other fast path either! (the other fast path is the
one that allows a zero-weighted pixel at one end to be read from the image).

Therefore I believe there is actual usage of this new fast path, while
there is almost zero usage of the old fast path. This is the main reason I
think the current flag should be removed, and a new one indicating that all
the non-zero samples are inside the image, added. This second one will
actually be used by real code.

NOTE: the method of scaling the centers of the pixels is generally wrong.
The math is wonky: does scaling an image by 2 produce a 2*w-1 image, or
does it spread the pixels slightly more than 2 apart? Programs are going to
disagree. And you have to alter the math as soon as the scale goes below 1
or you will get fuzzy edges again. And lots of code are going to get
coordinates in the resulting image wrong, resulting in annoying problems
like overlaid lines not lining up or shifting their alignment from one side
to another.

I very much believe Cairo should change how CAIRO_EXTEND_NONE is done. It
should instead mean that the path is intersected with a quadrilateral that
is the transform of the outer edge of the source, then drawing as though
using CAIRO_EXTEND_REPEAT. This is what most users of scaling expect, and
anybody wanting the previous effect can get it by adding a black pixel to
the outer edge of their source surface.

Now it is true that this result can be achieved with Cairo right now. The
reason for making this the default is because non-experts are unlikely to
figure this out. Currently filling a path equal to the source boundary
results in a double-mulitplication of the edge pixels, a subtle error that
can be very frustrating when graphics gets more complex. Making this error
not happen by default would be very helpful. I also believe making this the
default will result in faster implementations, as EXTEND_REPEAT fast paths
can be used, and this is often directly supported by graphics hardware.
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] Benchmarked: [PATCH 1/4] Change conditions for setting FAST_PATH_SAMPLES_COVER_CLIP flags

2015-09-16 Thread Ben Avison

On Wed, 16 Sep 2015 12:25:59 +0100, Pekka Paalanen  wrote:

I tried both "image" and "image16" on an x86_64 (Sandybridge), and got
no performance differences in the trimmed-cairo-traces set in either
baseline/cleanup or cleanup/tight.

I also tried with PIXMAN_DISABLE=ssse3 and still got no difference. I
verified I am really running what I think I am by editing Pixman and
seeing the effect in the benchmark.

Am I missing something?


Well, there has been a long and tangled history behind this change at
this point. I'll admit my twin motivations were code cleanup and making
it possible to measure the speed of bilinear operations using a
benchmarker, but it's also ended up spawning the creation of affine-bench
and cover-test, which isn't a bad thing.

Just by thinking things through, I realised that we would regularly fail
to hit COVER paths if Pixman's caller set the scale factors such that the
centre of the outermost destination pixels aligned with the centre of the
outermost source pixels. There has been some argument about whether this
is representative of how Pixman should be used. I happen to think this is
a perfectly reasonable thing to expect Pixman to support, but there are
other models you can follow, notably setting the scale factors such that
the outer edges of the outermost destination pixels align to the outer
edges of the outermost source pixels. If the Cairo traces are using this
latter model, then it's understandable if you aren't hitting the edge
case that I'm concerned about very often.

The useful thing about the scaled-in-one-axis-only example is that on the
axis that isn't scaled, there is no dispute about how you set the scale
factors, so it sidesteps this argument.


Should I run the same on rpi2? Or is the best effect on the fast paths
we haven't merged yet?


The get a sizeable speed difference, you need two things:
1) a plot geometry that aligns the centres of the high-coordinate pixels
of the source and destination
2) a reasonable speed difference between COVER and repeat fast paths or
iterators

At present, the speed difference will be most marked when you are using
the bilinear iterators from pixman-fast-path.c or pixman-ssse3.c, which
are both currently restricted to COVER plots of a8r8g8b8 source images
only. If you test on a Raspberry Pi 2, then you also need to allow for
the fact that the ARMv7 implementation has bilinear fast paths for
src__, src__0565, over__ and add__, for most
types of image repeat as well as for COVER, so these reduce the
applicability of the iterator still further.

In my patch series from last year, I added ARMv6 bilinear iterators for
COVER plots of a8r8g8b8, x8r8g8b8, r5g6b5 and a8 images. Of course, you
won't be seeing the effect of these because they haven't been accepted
yet - but when (I hope) they are then having the new COVER_CLIP
definition in place should help demonstrate their effectiveness.
Basically, these are all just pieces of one big jigsaw puzzle.


Or maybe our test set is not enough? I recall having some problems with
that in the past.


Yes, I know I have a number of fast paths queued up which were not
represented at all in the Cairo perf traces, but were being hammered by
the Epiphany browser on the Raspberry Pi. I'm not sure how we justify
their inclusion if the Cairo perf traces is the only criterion allowed. I
hope there will be some flexibility in this respect.

If you want a more real-world example of when you might encounter
bilinear scaling in one axis only, here's one that I think is plausible.
Consider screen grabs of taken of an NTSC SD video at ITU656 sample
positions (720x480) - or equally the output of a codec for a video at
that resolution. Now try to display it with bilinear scaling at the
correct aspect ratio on a display with square pixels: the destination
rectangle will likely be chosen to be 640x480, with a vertical pixel
increment of 1 and a horizontal pixel increment of 1.125. There's also a
reasonable chance that you'll also be wanting to re-plot this 30 times
per second, for obvious reasons. You'd hope that this could be achieved
using a COVER fast path or iterator, but with the current flag
definitions, they can't be used.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman