On Mon, Aug 24, 2015 at 7:04 PM, Ben Avison <bavi...@riscosopen.org> wrote: > > On Mon, 24 Aug 2015 15:20:22 +0100, Oded Gabbay <oded.gab...@gmail.com> wrote: >> >> I tested the patch on POWER8, ppc64le. >> make check passes, but when I benchmarked the performance using >> lowlevel-blt-bench over_n_8888, I got far worse results than without >> this patch (see numbers below). >> Apparently, the new C fast-path takes precedence over some vmx combine >> fast-paths, thus making performance worse instead of better. > > > If that's true, it wouldn't be the first time that this issue has arisen. > Pixman is designed to scan all fast paths (i.e. routines which take > source, mask and destination images and perform the whole operation > themselves) irrespective of how well tuned they are for a given platform, > before it starts looking at iter routines (each of which reads or writes > only one of source, mask or destination). > > Previously, in response to my attempt to work around a case where this > was happening, Søren wrote: >> >> [...] the longstanding problem that pixman sometimes won't >> pick the fastest code for some operations. In this case the general >> iter based code will be faster then the C code in pixman-fast-path.c >> because the iter code will use assembly fetchers. >> As a result you end up with a bunch of partial reimplementations of >> general_composite_rect() inside pixman-arm-simd.c. >> Maybe we need to admit failure and make general_composite_rect() >> available for other implementations to use. Essentially we would >> officially provide a way for implementations to say: My iterators are >> faster than pixman-fast-path.c for these specific operations, so just >> go directly to general_composite_rect(). >> It's definitely not a pleasant thing to do, but given that nobody is >> likely to have the time/skill combination required to do the necessary >> redesign of pixman's fast path system, it might still be preferable to >> to do this instead of duplicating the code like these patches do. >> With a setup like that, we could also fix the same issue for the >> bilinear SSSE3 fetchers and possibly other cases. > > > (ref http://lists.freedesktop.org/archives/pixman/2014-October/003457.html) > > I can't say that any cleaner solution has occurred to me since then.
I think the more immediate solution, as Soren have suggested on IRC, is for me to implement the equivalent fast-path in VMX. I see that it is already implemented in mmx, sse2, mips-dspr2 and arm-neon. From looking at the C code, I'm guessing that it is fairly simple to implement. > > I just had a quick look at the VMX source file, and it has hardly any > iters defined. My guess would be that what's being used is > > noop_init_solid_narrow() from pixman-noop.c > _pixman_iter_get_scanline_noop() from pixman-utils.c > combine_src_u() from pixman-combine32.c > > this last one would be responsible for the the bulk of the runtime - and > it palms most of the work off on memcpy(). Presumably the PPC memcpy is > super-optimised? I run perf on lowlevel-blt-bench over_n_8888 and what I got is: - 48.71% 48.68% lowlevel-blt-be lowlevel-blt-bench [.] vmx_combine_over_u_no_mask - vmx_combine_over_u_no_mask - pixman_image_composite32 - 98.87% pixman_image_composite_wrapper + 35.32% bench_L + 30.58% bench_M + 11.71% bench_RT + 8.77% bench_R + 7.48% bench_HT + 6.13% bench_VT Next major function was: 16.78% 16.72% lowlevel-blt-be libc-2.17.so [.] __memcpy_power7 > >> Without the patch: >> >> reference memcpy speed = 25711.7MB/s (6427.9MP/s for 32bpp fills) >> --- >> over_n_8888: PIXMAN_OP_OVER, src a8r8g8b8 solid, mask null, dst a8r8g8b8 >> --- >> over_n_8888 = L1: 572.29 L2:1038.08 M:1104.10 ( >> 17.18%) HT:447.45 VT:520.82 R:407.92 RT:148.90 (1100Kops/s) > > > There's something a bit odd about that - it's slower when working within > the caches (especially the L1 cache) than when operating on main memory. > I'd hazard a guess that memcpy() is using some hardware acceleration that > lives between the L1 and L2 caches. > > Presumably for patch 3 of this series (over_n_0565) you wouldn't see the > same effect, as that can't be achieved using mempcy(). > > Ben Where is that patch ? I didn't see it in the mailing list. Oded _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman