On Wed, 02 Jan 2013 19:40:58 +0100, sandm...@cs.au.dk (=?utf-8?Q?S=C3=B8ren?= Sandmann) wrote: > Chris Wilson <ch...@chris-wilson.co.uk> writes: > > > This path is being exercised by inplace compositing of trapezoids, for > > instance as used in the firefox-asteroids cairo-trace. > > > > core2 @ 2.66GHz, > > > > reference memcpy speed = 4898.2MB/s (1224.6MP/s for 32bpp fills) > > > > before: add_n_8888 = L1: 4.36 L2: 4.27 M: 1.61 ( 0.13%) HT: > > 1.65 VT: 1.63 R: 1.63 RT: 1.59 ( 21Kops/s) > > > > after: add_n_8888 = L1:2969.09 L2:3926.11 M:603.30 ( 49.27%) HT:524.69 > > VT:401.01 R:407.59 RT:210.34 ( 804Kops/s) > > Just two brief comments, and then I'll disappear again (until the 11th > or so): > > - It looks like this function will work for abgr destinations as well as > argb. > > - I'm surprised that the new function is _that_ much better. The current > code should hit an SSE2 combiner and noop iterators for both source > and destination, so while I'd expect a solid improvement from a > dedicated fast path, it is hard to believe that it would be 919 times > faster than the old. If these numbers are real, there has to be > something wrong with either the benchmark or the current code.
Judging from the perf profile of cairo-traces, the delta is closer to 5x. All I did to gather the numbers was to run ./test/lowlevel-blt-bench -n add_n_8888 which is dominated by general_composite_rect: if (repeat == PIXMAN_REPEAT_NORMAL) { while (*c >= size) *c -= size; while (*c < 0) *c += size; } special casing size==1 there boosts the L1 results from 4 to 70, but it still surprising that we hit that path at all. Ah, read the options to lowlevel-blt-bench wrong... ./test/lowlevel-blt-bench add_n_8888: add_n_8888 = L1:1131.58 L2:1112.37 M:530.11 ( 43.24%) HT:108.01 VT: 99.03 R: 90.03 RT: 25.11 ( 306Kops/s) -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman