On Wed, May 9, 2012 at 7:57 PM, Søren Sandmann <sandm...@cs.au.dk> wrote: > Matt Turner <matts...@gmail.com> writes: > >> I started porting my src_8888_0565 MMX function to SSE2, and in the >> process started thinking about using SSE3+. The useful instructions >> added post SSE2 that I see are >> SSE3: lddqu - for unaligned loads across cache lines > > I don't really understand that instruction. Isn't it identical to > movdqu? Or is the idea that lddqu is faster than movdqu for cache line > splits, but slower for plain old, non-cache split unaligned loads? > >> SSSE3: palignr - for unaligned loads (but requires software >> pipelining...) >> pmaddubsw - maybe? > > pmaddubsw would be very useful for bilinear interpolation if we drop > coordinate precision to 7 bits instead of the current 8. One example way > to use it is to put 7-bit values of (1 - x, x, 1 - x, x) in a register, > and interleave a top/left and top/right pixels in another. pmaddubsw on > those two registers will then produce a linear interpolation between the > top top pixels. A similar thing can be done for the bottom pixels, and > then the intermediate results can be interleaved and combined using > pmaddwd.
I would say that improving bilinear scaling performance on x86 is really important for pixman in order to remain competitive. The following link might be a good source of inspiration: http://www.hackermusings.com/2012/05/firefoxs-graphics-performance-on-x11/ The comments with the azure backend performance numbers are particularly interesting. For example, one of them mentions 12fps with xrender disabled (using pixman?) vs. 15fps with azure canvas enabled (using skia?) for FishIETank. Needless to say that it would be nice to improve pixman performance by 30% or more. And here are some benchmarks for firefox-fishtank trace with pixman-0.25.2, comparing NEON vs. SSE2 for ARM Cortex-A8 and intel Atom (both are superscalar dual-issue in-order cores): === ARM Cortex-A8 @1GHz === CC=gcc-4.5.3 CFLAGS="-O2 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp" [ 0] image firefox-fishtank 359.228 359.436 0.43% 3/3 CC=gcc-4.5.3 CFLAGS="-O2 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mthumb" [ 0] image firefox-fishtank 347.195 347.773 0.12% 3/3 === Intel Atom N450 @1.67GHz === CC=gcc-4.5.3 CFLAGS="-O2 -g -march=atom -mtune=atom" [ 0] image firefox-fishtank 308.439 308.881 0.09% 3/3 CC=gcc-4.5.3 CFLAGS="-O2 -g -march=atom -mtune=atom -m32" [ 0] image firefox-fishtank 309.457 309.568 0.07% 3/3 CC=gcc-4.5.3 CFLAGS="-O2" [ 0] image firefox-fishtank 345.906 346.156 0.04% 3/3 CC=gcc-4.5.3 CFLAGS="-O2 -mtune=generic" [ 0] image firefox-fishtank 345.367 345.900 0.09% 3/3 The results for gcc-4.7.0 were nearly the same. Currently 1GHz ARM Cortex-A8 is almost as fast as 1.67GHz Atom. ARM NEON bilinear code is using 8-bit multiplications. Atom could use PMADDUBSW to also benefit from 8-bit multiplications and improve performance per MHz. -- Best regards, Siarhei Siamashka _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman