On Wed, 02 Jan 2013 19:12:21 +0000 Chris Wilson <ch...@chris-wilson.co.uk> wrote:
> On Wed, 02 Jan 2013 19:40:58 +0100, sandm...@cs.au.dk (=?utf-8?Q?S=C3=B8ren?= > Sandmann) wrote: > > Chris Wilson <ch...@chris-wilson.co.uk> writes: > > > > > This path is being exercised by inplace compositing of trapezoids, for > > > instance as used in the firefox-asteroids cairo-trace. > > > > > > core2 @ 2.66GHz, > > > > > > reference memcpy speed = 4898.2MB/s (1224.6MP/s for 32bpp fills) > > > > > > before: add_n_8888 = L1: 4.36 L2: 4.27 M: 1.61 ( 0.13%) HT: > > > 1.65 VT: 1.63 R: 1.63 RT: 1.59 ( 21Kops/s) > > > > > > after: add_n_8888 = L1:2969.09 L2:3926.11 M:603.30 ( 49.27%) HT:524.69 > > > VT:401.01 R:407.59 RT:210.34 ( 804Kops/s) > > > > Just two brief comments, and then I'll disappear again (until the 11th > > or so): > > > > - It looks like this function will work for abgr destinations as well as > > argb. > > > > - I'm surprised that the new function is _that_ much better. The current > > code should hit an SSE2 combiner and noop iterators for both source > > and destination, so while I'd expect a solid improvement from a > > dedicated fast path, it is hard to believe that it would be 919 times > > faster than the old. If these numbers are real, there has to be > > something wrong with either the benchmark or the current code. > > Judging from the perf profile of cairo-traces, the delta is closer to 5x. > All I did to gather the numbers was to run > ./test/lowlevel-blt-bench -n add_n_8888 > which is dominated by general_composite_rect: > > if (repeat == PIXMAN_REPEAT_NORMAL) > { > while (*c >= size) > *c -= size; > while (*c < 0) > *c += size; > } > > special casing size==1 there boosts the L1 results from 4 to 70, but it > still surprising that we hit that path at all. The "-n" option benchmarks nearest scaling. In this particular test we hit general scaled fetch for "1x1 image with normal repeat", which is equivalent to "solid". It is impressive how you managed to find such a poorly performing case, and it shows quite a number of issues. The most interesting is the use of repeat() inline function http://cgit.freedesktop.org/pixman/tree/pixman/pixman-inlines.h?id=pixman-0.28.2#n56 which is called from bits_image_fetch_affine_no_alpha() for fetching individual pixels: http://cgit.freedesktop.org/pixman/tree/pixman/pixman-bits-image.c?id=pixman-0.28.2#n459 The problem here is that this repeat() function is optimized for quickly returning the current pixel position back into image bounds when sequentially walking over pixels and is trying to avoid the expensive modulo operator. But it happens to also get used for fetching individual pixels at arbitrarily large coordinates on the general path. And the loop may run for a really ridiculous number of iterations, so 5x or 900x slowdown is not so surprising here. > Ah, read the options to lowlevel-blt-bench wrong... > > ./test/lowlevel-blt-bench add_n_8888: > add_n_8888 = L1:1131.58 L2:1112.37 M:530.11 ( 43.24%) HT:108.01 VT: > 99.03 R: 90.03 RT: 25.11 ( 306Kops/s) This surely looks more reasonable. And maybe adding a shortcut to the combiner for skipping the mask would make it somewhat closer to the performance of the new sse2_add_n_8888 fast path > > > after: add_n_8888 = L1:2969.09 L2:3926.11 M:603.30 ( 49.27%) HT:524.69 > > > VT:401.01 R:407.59 RT:210.34 ( 804Kops/s) -- Best regards, Siarhei Siamashka _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman