Hi Siarhei, I implemented a new version of the (patch below) BILINEAR_INTERPOLATE_SINGLE_PIXEL macro where ANDI/EXT instructions, are substituted with load byte instructions (for better dual-issue instruction balancing) and got these results on my Malta board:
Original: [ 0] image firefox-fishtank 2289.180 2290.567 0.05% 5/6 Opt (ANDI/EXT) [ 0] image firefox-fishtank 1700.925 1708.314 0.22% 5/6 Opt2 (load byte instructions) [ 0] image firefox-fishtank 1671.700 1672.006 0.03% 4/4 There is performance improvement, but not impressive as I expected. And now code also becomes vulnerable to endianess of the target CPUs. Of course, this can be guarded with some #ifdef's where byte offset in a word is changed according to the endianess of the target CPU (since MIPS CPUs can be both LE and BE). Is this small improvement worth making this code vulnerable to endian issues? I still need to add improvement for that packing/unpacking of the RGBA pixels after bilinear/before OVER operation, but I don't expect big improvement there (it is just a couple of instructions). Thanks, Nemanja Lukic Patch for load byte implementation of the BILINEAR_INTERPOLATE_SINGLE_PIXEL macro: diff --git a/pixman/pixman-mips-dspr2-asm.S b/pixman/pixman-mips-dspr2-asm.S index 87558f0..541a6af 100644 --- a/pixman/pixman-mips-dspr2-asm.S +++ b/pixman/pixman-mips-dspr2-asm.S @@ -785,15 +785,12 @@ LEAF_MIPS_DSPR2(pixman_scaled_bilinear_scanline_8888_8_8888_OVER_asm_mips) sra t9, s2, 16 sll t9, t9, 2 - addiu t8, t9, 4 - lwx t0, t9(a2) /* t0 = tl */ - lwx t1, t8(a2) /* t1 = tr */ + addu t0, t9, a2 addiu v1, v1, -1 - lwx t2, t9(a3) /* t2 = bl */ - lwx t3, t8(a3) /* t3 = br */ + addu t1, t9, a3 - BILINEAR_INTERPOLATE_SINGLE_PIXEL t0, t1, t2, t3, \ - t4, t5, t6, t7, t8, t9, s4, s5, s6, s7 + BILINEAR_INTERPOLATE_SINGLE_PIXEL t0, t1, \ + t4, t5, t6, t7, t2, t3, s4, s5, s6, s7 lbu t1, 0(a1) /* t1 = mask */ lw t2, 0(a0) /* t2 = dst */ addiu a1, a1, 1 diff --git a/pixman/pixman-mips-dspr2-asm.h b/pixman/pixman-mips-dspr2-asm.h index 7cf3281..2ed3de3 100644 --- a/pixman/pixman-mips-dspr2-asm.h +++ b/pixman/pixman-mips-dspr2-asm.h @@ -566,34 +566,34 @@ LEAF_MIPS32R2(symbol) \ addu_s.qb \out2_8888, \d2_8888, \scratch2 .endm -.macro BILINEAR_INTERPOLATE_SINGLE_PIXEL tl, tr, bl, br, \ +.macro BILINEAR_INTERPOLATE_SINGLE_PIXEL top, bottom, \ scratch1, scratch2, \ alpha, red, green, blue \ wt1, wt2, wb1, wb2 - andi \scratch1, \tl, 0xff - andi \scratch2, \tr, 0xff - andi \alpha, \bl, 0xff - andi \red, \br, 0xff + lbu \scratch1, 0(\top) + lbu \scratch2, 4(\top) + lbu \alpha, 0(\bottom) + lbu \red, 4(\bottom) multu $ac0, \wt1, \scratch1 maddu $ac0, \wt2, \scratch2 maddu $ac0, \wb1, \alpha maddu $ac0, \wb2, \red - ext \scratch1, \tl, 8, 8 - ext \scratch2, \tr, 8, 8 - ext \alpha, \bl, 8, 8 - ext \red, \br, 8, 8 + lbu \scratch1, 1(\top) + lbu \scratch2, 5(\top) + lbu \alpha, 1(\bottom) + lbu \red, 5(\bottom) multu $ac1, \wt1, \scratch1 maddu $ac1, \wt2, \scratch2 maddu $ac1, \wb1, \alpha maddu $ac1, \wb2, \red - ext \scratch1, \tl, 16, 8 - ext \scratch2, \tr, 16, 8 - ext \alpha, \bl, 16, 8 - ext \red, \br, 16, 8 + lbu \scratch1, 2(\top) + lbu \scratch2, 6(\top) + lbu \alpha, 2(\bottom) + lbu \red, 6(\bottom) mflo \blue, $ac0 @@ -602,10 +602,10 @@ LEAF_MIPS32R2(symbol) \ maddu $ac2, \wb1, \alpha maddu $ac2, \wb2, \red - ext \scratch1, \tl, 24, 8 - ext \scratch2, \tr, 24, 8 - ext \alpha, \bl, 24, 8 - ext \red, \br, 24, 8 + lbu \scratch1, 3(\top) + lbu \scratch2, 7(\top) + lbu \alpha, 3(\bottom) + lbu \red, 7(\bottom) mflo \green, $ac1 @@ -619,7 +619,7 @@ LEAF_MIPS32R2(symbol) \ precr.qb.ph \alpha, \alpha, \red precr.qb.ph \scratch1, \green, \blue - precrq.qb.ph \tl, \alpha, \scratch1 + precrq.qb.ph \top, \alpha, \scratch1 .endm #endif //PIXMAN_MIPS_DSPR2_ASM_H -----Original Message----- From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com] Sent: Friday, May 11, 2012 10:55 AM To: Lukic, Nemanja Cc: pixman@lists.freedesktop.org; nemanja.lu...@rt-rk.com Subject: Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path. On Thu, May 3, 2012 at 1:03 AM, Nemanja Lukic <nlu...@mips.com> wrote: > From: Nemanja Lukic <nemanja.lu...@rt-rk.com> > > Performance numbers before/after on MIPS-74kc @ 1GHz > > Referent (before): > > cairo-perf-trace: > [ # ] backend test min(s) median(s) stddev. count > [ # ] image: pixman 0.25.3 > [ 0] image firefox-fishtank 2289.180 2290.567 0.05% 5/6 > > Optimized: > > cairo-perf-trace: > [ # ] backend test min(s) median(s) stddev. count > [ # ] image: pixman 0.25.3 > [ 0] image firefox-fishtank 1700.925 1708.314 0.22% 5/6 This definitely is an improvement. But the firefox-fishtank trace is very dependent on bilinear scaling performance, both x86 SSE2 and ARM NEON demonstrate more than 3x speedup here: http://ssvb.github.com/2012/05/04/xorg-drivers-and-software-rendering.html I understand that MIPS DSPr2 does not stand a chance competing with 128-bit SIMD competitors, but still some more performance tweaks can be be probably applied. See more comments below. > diff --git a/pixman/pixman-mips-dspr2-asm.h b/pixman/pixman-mips-dspr2-asm.h > index 8383060..7cf3281 100644 > --- a/pixman/pixman-mips-dspr2-asm.h > +++ b/pixman/pixman-mips-dspr2-asm.h > @@ -566,4 +566,60 @@ LEAF_MIPS32R2(symbol) \ > addu_s.qb \out2_8888, \d2_8888, \scratch2 > .endm > > +.macro BILINEAR_INTERPOLATE_SINGLE_PIXEL tl, tr, bl, br, \ > + scratch1, scratch2, \ > + alpha, red, green, blue \ > + wt1, wt2, wb1, wb2 > + andi \scratch1, \tl, 0xff > + andi \scratch2, \tr, 0xff > + andi \alpha, \bl, 0xff > + andi \red, \br, 0xff I suggest to have a look at http://lists.freedesktop.org/archives/pixman/2011-February/001088.html The ANDI/EXT instructions from BILINEAR_INTERPOLATE_SINGLE_PIXEL macro could be replaced with byte load instructions. MIPS74K can't dual issue ALU+ALU instructions, but can dual issue LS+ALU. This look like a potentially huge performance win on MIPS74K hardware, far exceeding the speedup observed on x86. Why is the faster C bilinear code from my old post still not in pixman? As I mentioned there, "the discussion is still ongoing about how to improve bilinear scaling performance when SIMD extensions are not available". Reducing interpolation precision from the current 8-bit to 7-bit allows to use signed multiplications and can help a lot x86 MMX/SSE2/SSSE3 code. It may even make sense reducing interpolation precision further to 4-bit as suggested by Taekyun Kim at that time: http://lists.freedesktop.org/archives/pixman/2011-February/001044.html This allows to halve the number of multiplications for bilinear interpolation in C code by using SIMD-alike tricks. But both Taekyun Kim and I were mostly interested in ARM NEON performance, and NEON happens not to suffer from 8-bit interpolation much. Nobody else has tried pushing interpolation precision reduction for faster bilinear interpolation into pixman and .... it did not happen. But the hope is not totally lost, see the recent discussion: http://lists.freedesktop.org/archives/pixman/2012-May/001930.html Regarding how it affects you. If bilinear interpolation precision gets changed after all, your optimized code in bilinear over_8888_8_8888 fast path will need to be updated (if we still care about getting identical results everywhere and passing the test suite). You may also want to take part in this activity and evaluate the effects of 8-bit vs. 7-bit vs. 4-bit interpolation for MIPS. > + multu $ac0, \wt1, \scratch1 > + maddu $ac0, \wt2, \scratch2 > + maddu $ac0, \wb1, \alpha > + maddu $ac0, \wb2, \red > + > + ext \scratch1, \tl, 8, 8 > + ext \scratch2, \tr, 8, 8 > + ext \alpha, \bl, 8, 8 > + ext \red, \br, 8, 8 > + > + multu $ac1, \wt1, \scratch1 > + maddu $ac1, \wt2, \scratch2 > + maddu $ac1, \wb1, \alpha > + maddu $ac1, \wb2, \red > + > + ext \scratch1, \tl, 16, 8 > + ext \scratch2, \tr, 16, 8 > + ext \alpha, \bl, 16, 8 > + ext \red, \br, 16, 8 > + > + mflo \blue, $ac0 > + > + multu $ac2, \wt1, \scratch1 > + maddu $ac2, \wt2, \scratch2 > + maddu $ac2, \wb1, \alpha > + maddu $ac2, \wb2, \red > + > + ext \scratch1, \tl, 24, 8 > + ext \scratch2, \tr, 24, 8 > + ext \alpha, \bl, 24, 8 > + ext \red, \br, 24, 8 > + > + mflo \green, $ac1 > + > + multu $ac3, \wt1, \scratch1 > + maddu $ac3, \wt2, \scratch2 > + maddu $ac3, \wb1, \alpha > + maddu $ac3, \wb2, \red > + > + mflo \red, $ac2 > + mflo \alpha, $ac3 > + > + precr.qb.ph \alpha, \alpha, \red > + precr.qb.ph \scratch1, \green, \blue > + precrq.qb.ph \tl, \alpha, \scratch1 Here you are combining RGBA values and split them again later in OVER_8888_8_8888 macro. Could this be exploited somehow? -- Best regards, Siarhei Siamashka _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman