On Wed, Jul 15, 2015 at 6:48 PM, Adam Jackson <a...@redhat.com> wrote: > On Thu, 2015-07-02 at 13:04 +0300, Oded Gabbay wrote: >> Hi, >> >> This patch-set implements the most heavily used fast paths, according to >> profiling done by me using the cairo traces package. > > I finally got a chance to try this series on a power7, and the results > are... mixed. A sampling of x11perf numbers (against Xvfb, just > switching pixman before and after): > > before after Operation > ------------ ------------------------- ------------------------- > 6856255.6 5564651.7 ( 0.812) 10x10 rectangle > 125522.9 455209.1 ( 3.627) 100x100 rectangle > 5419.2 29705.8 ( 5.482) 500x500 rectangle > > This one is telling, I think. This should be the vmx_fill path, and it > looks like a nice win for large ops but a hit for small ops. Is the > vmx setup cost that high, or is there something else going on? > Yes, the setup is that high for fill :( I noticed this right when I started to convert the functions from sse2 to vmx. The reason is that for every line in the image, you first align to 16 byte and only then start to use vmx. The alignment to 16 byte is costly! If the width of the image is small, you may not even have vmx operations done at all! In that case, the C fast-path is of course faster.
I think the next optimization is to separate the implementation to POWER8 and !POWER8. In POWER8, you can do unaligned access with almost no penalty, so it is better to drop the alignment requirement and use vmx from the start. But for POWER7 and below, we need to use the current code. Another option IMO, is to detect image size &alignment before starting, and if it is small and unaligned, drop to the fallback (C fast-path). > 1641838.0 1684290.9 ( 1.026) Char in 80-char aa line (Charter > 10) > 432916.1 466759.2 ( 1.078) Char in 30-char aa line (Charter > 24) > 1412008.5 1545401.0 ( 1.094) Char in 80-char aa line (Courier > 12) > 1440361.7 1947014.6 ( 1.352) Char in 80-char rgb line (Charter > 10) > 384600.6 576289.5 ( 1.498) Char in 30-char rgb line (Charter > 24) > 1258381.8 1811421.7 ( 1.439) Char in 80-char rgb line (Courier > 12) > > Render text gets faster, nice. > > 1202555.7 1228256.6 ( 1.021) Scroll 10x10 pixels > 162282.8 131857.7 ( 0.813) Scroll 100x100 pixels > 6819.8 6256.2 ( 0.917) Scroll 500x500 pixels > 1695720.5 1752339.8 ( 1.033) Copy 10x10 from pixmap to window > 210222.2 165836.1 ( 0.789) Copy 100x100 from pixmap to window > 14408.8 10600.1 ( 0.736) Copy 500x500 from pixmap to window > > This should be the vmx_blit path, and it gets quite a bit worse for > large ops. Eesh. > > 1021293.5 1060568.6 ( 1.038) PutImage 10x10 square > 54803.7 56420.0 ( 1.029) PutImage 100x100 square > 1933.5 1935.4 ( 1.001) PutImage 500x500 square > 1418641.0 1432543.1 ( 1.010) ShmPutImage 10x10 square > 194769.2 160047.5 ( 0.822) ShmPutImage 100x100 square > 11951.2 10968.1 ( 0.918) ShmPutImage 500x500 square > > Again, blit path, and usually worse for large ops. > > 576975.4 573388.4 ( 0.994) Composite 10x10 from pixmap to > window > 156830.4 131246.8 ( 0.837) Composite 100x100 from pixmap to > window > 12172.5 10150.2 ( 0.834) Composite 500x500 from pixmap to > window > > Not-quite-a-blit path, but no transformation, and the same kind of > performance hit. > > 176570.2 176330.2 ( 0.999) Scale 5x5 from pixmap to 10x10 > window > 4598.0 4460.9 ( 0.970) Scale 50x50 from pixmap to 100x100 > window > 189.9 185.9 ( 0.979) Scale 250x250 from pixmap to > 500x500 window > 269540.6 269767.4 ( 1.001) Scale 10x10 from pixmap to 5x5 > window > 267201.2 268220.5 ( 1.004) Scale 100x100 from pixmap to 5x5 > window > 766.8 740.1 ( 0.965) Scale 500x500 from pixmap to > 250x250 window > > All within the noise margin, so I suspect the series just doesn't hit > these paths. (Ignore the implausible numbers from "Scale 100x100", > that's an x11perf bug I just pushed a fix for.) > > I'm a little hesitant to take a 10% to 20% hit to software blit > performance. It might be that vmx_blt is just a mistake to try, that > the CPU and compiler are smarter than we are. > Almost same story as for vmx_fill. Note that I had removed this patch from the v2 I sent yesterday. Your observation strengthens my decision to remove it. > - ajax To sum it up, I think vmx_fill gives a lot of boost with some drawdawns, and vmx_blt is the opposite. So I would like to keep vmx_fill and drop vmx_blt for now. And, as I said, next step is to differentiate between POWER8 and POWER7 (and older). Oded _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman