From: Ben Avison <bavi...@riscosopen.org>

Benchmark results, "before" is the patch
- ARMv6: Add fast path for over_n_8888_8888_ca
and "after" contains the additional patches:
- ARMv6: Add fast path flag to force no preload of destination buffer
- ARMv6: Add fast path for in_reverse_8888_8888 (this patch)

lowlevel-blt-bench, in_reverse_8888_8888, 100 iterations:

       Before          After
      Mean StdDev     Mean StdDev   Confidence   Change
L1    21.1    0.1     32.0    0.1    100.00%     +51.9%
L2    11.7    0.3     18.4    0.5    100.00%     +56.9%
M     10.5    0.0     16.3    0.0    100.00%     +54.8%
HT     8.2    0.0     12.0    0.0    100.00%     +46.7%
VT     8.1    0.0     11.8    0.0    100.00%     +45.4%
R      8.0    0.0     11.2    0.0    100.00%     +40.0%
RT     4.7    0.0      6.0    0.1    100.00%     +28.1%

At most 14 outliers rejected per case per set.

cairo-perf-trace with trimmed traces, 30 iterations:

                                    Before          After
                                   Mean StdDev     Mean StdDev   Confidence   
Change
t-firefox-paintball.trace          17.9    0.0     14.0    0.0    100.00%     
+27.8%
t-firefox-chalkboard.trace         36.6    0.0     35.8    0.0    100.00%      
+2.1%
t-firefox-canvas-alpha.trace       20.7    0.3     20.3    0.3    100.00%      
+1.7%
t-firefox-particles.trace          27.5    0.1     27.1    0.1    100.00%      
+1.3%
t-chromium-tabs.trace               4.9    0.0      4.8    0.0    100.00%      
+1.1%
t-evolution.trace                  13.0    0.1     12.9    0.1    100.00%      
+1.0%
t-swfdec-youtube.trace              7.8    0.0      7.7    0.0    100.00%      
+0.8%
t-gvim.trace                       33.0    0.2     32.8    0.2    100.00%      
+0.7%
t-gnome-terminal-vim.trace         19.8    0.2     19.7    0.2     99.46%      
+0.6%
t-grads-heat-map.trace              4.4    0.0      4.4    0.0     99.32%      
+0.6%
t-firefox-fishbowl.trace           21.1    0.0     21.0    0.0    100.00%      
+0.5%
t-firefox-planet-gnome.trace       10.9    0.0     10.8    0.0    100.00%      
+0.4%
t-firefox-canvas-swscroll.trace    32.1    0.1     32.0    0.1    100.00%      
+0.4%
t-firefox-fishtank.trace           13.2    0.0     13.1    0.0    100.00%      
+0.4%
t-firefox-asteroids.trace          11.1    0.0     11.0    0.0    100.00%      
+0.4%
t-firefox-canvas.trace             17.9    0.0     17.9    0.0     99.99%      
+0.3%
t-poppler.trace                     9.7    0.1      9.7    0.1     79.51%      
+0.2%  (insignificant)
t-firefox-talos-svg.trace          20.4    0.0     20.4    0.0     97.25%      
+0.1%  (insignificant)
t-swfdec-giant-steps.trace         14.8    0.0     14.8    0.0     96.75%      
+0.1%  (insignificant)
t-firefox-scrolling.trace          24.6    0.1     24.6    0.1     31.24%      
+0.1%  (insignificant)
t-midori-zoomed.trace               8.0    0.0      8.0    0.0     50.76%      
+0.0%  (insignificant)
t-gnome-system-monitor.trace       17.1    0.0     17.1    0.0      4.49%      
-0.0%  (insignificant)
t-xfce4-terminal-a1.trace           4.8    0.0      4.8    0.0     98.08%      
-0.2%  (insignificant)
t-poppler-reseau.trace             22.1    0.1     22.2    0.1     93.89%      
-0.3%  (insignificant)
t-firefox-talos-gfx.trace          25.4    0.4     25.5    0.5     75.53%      
-0.5%  (insignificant)

At most 4 outliers rejected per case per set.

Cairo perf reports the running time, but the change is computed for
operations per second instead (inverse of running time).

Confidence is based on Welch's t-test. Absolute changes less than 1%
can be accounted as measurement errors, even if statistically
significant.

There was a question of why FLAG_NO_PRELOAD_DST exists. If a patch
removing that flag from pixman-arm-simd-asm.S is added on top, the
change will be the following.

Before: flag in use
After: flag removed

       Before          After
      Mean StdDev     Mean StdDev   Confidence   Change
L1    32.0    0.1     31.8    0.1    100.00%      -0.6%
L2    18.4    0.5     25.0    0.5    100.00%     +36.0%
M     16.3    0.0     25.7    0.0    100.00%     +57.9%
HT    12.0    0.0     13.9    0.0    100.00%     +16.4%
VT    11.8    0.0     13.2    0.0    100.00%     +12.4%
R     11.2    0.0     14.0    0.0    100.00%     +24.3%
RT     6.0    0.1      7.0    0.1    100.00%     +15.1%

                                    Before          After
                                   Mean StdDev     Mean StdDev   Confidence   
Change
t-chromium-tabs.trace               4.8    0.0      4.8    0.0    100.00%      
+0.7%
t-poppler-reseau.trace             22.2    0.1     22.1    0.1     99.98%      
+0.6%
t-poppler.trace                     9.7    0.1      9.6    0.1     99.70%      
+0.5%
t-firefox-talos-gfx.trace          25.5    0.5     25.4    0.3     72.06%      
+0.5%  (insignificant)
t-firefox-canvas-alpha.trace       20.3    0.3     20.2    0.2     80.88%      
+0.4%  (insignificant)
t-firefox-canvas.trace             17.9    0.0     17.8    0.0     99.36%      
+0.2%
t-firefox-canvas-swscroll.trace    32.0    0.1     31.9    0.1     84.83%      
+0.1%  (insignificant)
t-firefox-asteroids.trace          11.0    0.0     11.0    0.0    100.00%      
+0.1%
t-midori-zoomed.trace               8.0    0.0      8.0    0.0     99.90%      
+0.1%
t-firefox-planet-gnome.trace       10.8    0.0     10.8    0.0     91.34%      
+0.1%  (insignificant)
t-firefox-scrolling.trace          24.6    0.1     24.6    0.1      0.53%      
+0.0%  (insignificant)
t-gnome-terminal-vim.trace         19.7    0.2     19.7    0.1     11.42%      
-0.0%  (insignificant)
t-firefox-talos-svg.trace          20.4    0.0     20.4    0.0     54.68%      
-0.0%  (insignificant)
t-swfdec-giant-steps.trace         14.8    0.0     14.8    0.0     78.92%      
-0.0%  (insignificant)
t-firefox-fishtank.trace           13.1    0.0     13.1    0.0     97.09%      
-0.0%  (insignificant)
t-gnome-system-monitor.trace       17.1    0.0     17.1    0.0     65.13%      
-0.0%  (insignificant)
t-evolution.trace                  12.9    0.1     12.9    0.1     34.70%      
-0.1%  (insignificant)
t-grads-heat-map.trace              4.4    0.0      4.4    0.0     28.95%      
-0.1%  (insignificant)
t-firefox-fishbowl.trace           21.0    0.0     21.0    0.0     99.92%      
-0.2%
t-xfce4-terminal-a1.trace           4.8    0.0      4.8    0.0     98.78%      
-0.2%  (insignificant)
t-firefox-particles.trace          27.1    0.1     27.3    0.1     99.89%      
-0.5%
t-swfdec-youtube.trace              7.7    0.0      7.8    0.0    100.00%      
-0.7%
t-gvim.trace                       32.8    0.2     33.1    0.2    100.00%      
-0.9%
t-firefox-chalkboard.trace         35.8    0.0     37.1    0.0    100.00%      
-3.3%
t-firefox-paintball.trace          14.0    0.0     15.0    0.0    100.00%      
-6.2%

IOW, the flag has adverse effects on lowlevel-blt-bench performance,
but improves one or two Cairo traces slightly.

v4, Pekka Paalanen <pekka.paala...@collabora.co.uk> :
        Rebased, re-benchmarked on Raspberry Pi, commit message.

---

Should I re-spin this without the flag? Ben?
It should not need a new benchmarking night, since I already have
the numbers.

Thanks,
pq
---
 pixman/pixman-arm-simd-asm.S | 103 +++++++++++++++++++++++++++++++++++++++++++
 pixman/pixman-arm-simd.c     |   7 +++
 2 files changed, 110 insertions(+)

diff --git a/pixman/pixman-arm-simd-asm.S b/pixman/pixman-arm-simd-asm.S
index 7bb18cb..d926226 100644
--- a/pixman/pixman-arm-simd-asm.S
+++ b/pixman/pixman-arm-simd-asm.S
@@ -954,3 +954,106 @@ generate_composite_function \
 
 
/******************************************************************************/
 
+.macro in_reverse_8888_8888_init
+        /* Hold loop invariant in MASK */
+        ldr     MASK, =0x00800080
+        /* Set GE[3:0] to 0101 so SEL instructions do what we want */
+        uadd8   SCRATCH, MASK, MASK
+        /* Offset the source pointer: we only need the alpha bytes */
+        add     SRC, SRC, #3
+        line_saved_regs  ORIG_W
+.endm
+
+.macro in_reverse_8888_8888_head  numbytes, reg1, reg2, reg3
+        ldrb    ORIG_W, [SRC], #4
+ .if numbytes >= 8
+        ldrb    WK&reg1, [SRC], #4
+  .if numbytes == 16
+        ldrb    WK&reg2, [SRC], #4
+        ldrb    WK&reg3, [SRC], #4
+  .endif
+ .endif
+        add     DST, DST, #numbytes
+.endm
+
+.macro in_reverse_8888_8888_process_head  cond, numbytes, firstreg, 
unaligned_src, unaligned_mask, preload
+        in_reverse_8888_8888_head  numbytes, firstreg, %(firstreg+1), 
%(firstreg+2)
+.endm
+
+.macro in_reverse_8888_8888_1pixel  s, d, offset, is_only
+ .if is_only != 1
+        movs    s, ORIG_W
+  .if offset != 0
+        ldrb    ORIG_W, [SRC, #offset]
+  .endif
+        beq     01f
+        teq     STRIDE_M, #0xFF
+        beq     02f
+ .endif
+        uxtb16  SCRATCH, d                 /* rb_dest */
+        uxtb16  d, d, ror #8               /* ag_dest */
+        mla     SCRATCH, SCRATCH, s, MASK
+        mla     d, d, s, MASK
+        uxtab16 SCRATCH, SCRATCH, SCRATCH, ror #8
+        uxtab16 d, d, d, ror #8
+        mov     SCRATCH, SCRATCH, ror #8
+        sel     d, SCRATCH, d
+        b       02f
+ .if offset == 0
+48:     /* Last mov d,#0 of the set - used as part of shortcut for
+         * source values all 0 */
+ .endif
+01:     mov     d, #0
+02:
+.endm
+
+.macro in_reverse_8888_8888_tail  numbytes, reg1, reg2, reg3, reg4
+ .if numbytes == 4
+        teq     ORIG_W, ORIG_W, asr #32
+        ldrne   WK&reg1, [DST, #-4]
+ .elseif numbytes == 8
+        teq     ORIG_W, WK&reg1
+        teqeq   ORIG_W, ORIG_W, asr #32  /* all 0 or all -1? */
+        ldmnedb DST, {WK&reg1-WK&reg2}
+ .else
+        teq     ORIG_W, WK&reg1
+        teqeq   ORIG_W, WK&reg2
+        teqeq   ORIG_W, WK&reg3
+        teqeq   ORIG_W, ORIG_W, asr #32  /* all 0 or all -1? */
+        ldmnedb DST, {WK&reg1-WK&reg4}
+ .endif
+        cmnne   DST, #0   /* clear C if NE */
+        bcs     49f       /* no writes to dest if source all -1 */
+        beq     48f       /* set dest to all 0 if source all 0 */
+ .if numbytes == 4
+        in_reverse_8888_8888_1pixel  ORIG_W, WK&reg1, 0, 1
+        str     WK&reg1, [DST, #-4]
+ .elseif numbytes == 8
+        in_reverse_8888_8888_1pixel  STRIDE_M, WK&reg1, -4, 0
+        in_reverse_8888_8888_1pixel  STRIDE_M, WK&reg2, 0, 0
+        stmdb   DST, {WK&reg1-WK&reg2}
+ .else
+        in_reverse_8888_8888_1pixel  STRIDE_M, WK&reg1, -12, 0
+        in_reverse_8888_8888_1pixel  STRIDE_M, WK&reg2, -8, 0
+        in_reverse_8888_8888_1pixel  STRIDE_M, WK&reg3, -4, 0
+        in_reverse_8888_8888_1pixel  STRIDE_M, WK&reg4, 0, 0
+        stmdb   DST, {WK&reg1-WK&reg4}
+ .endif
+49:
+.endm
+
+.macro in_reverse_8888_8888_process_tail  cond, numbytes, firstreg
+        in_reverse_8888_8888_tail  numbytes, firstreg, %(firstreg+1), 
%(firstreg+2), %(firstreg+3)
+.endm
+
+generate_composite_function \
+    pixman_composite_in_reverse_8888_8888_asm_armv6, 32, 0, 32 \
+    FLAG_DST_READWRITE | FLAG_BRANCH_OVER | FLAG_PROCESS_CORRUPTS_PSR | 
FLAG_PROCESS_DOES_STORE | FLAG_SPILL_LINE_VARS | FLAG_PROCESS_CORRUPTS_SCRATCH 
| FLAG_NO_PRELOAD_DST \
+    2, /* prefetch distance */ \
+    in_reverse_8888_8888_init, \
+    nop_macro, /* newline */ \
+    nop_macro, /* cleanup */ \
+    in_reverse_8888_8888_process_head, \
+    in_reverse_8888_8888_process_tail
+
+/******************************************************************************/
diff --git a/pixman/pixman-arm-simd.c b/pixman/pixman-arm-simd.c
index dd6b907..c17ce5a 100644
--- a/pixman/pixman-arm-simd.c
+++ b/pixman/pixman-arm-simd.c
@@ -46,6 +46,8 @@ PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, add_8_8,
                                    uint8_t, 1, uint8_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, over_8888_8888,
                                    uint32_t, 1, uint32_t, 1)
+PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, in_reverse_8888_8888,
+                                   uint32_t, 1, uint32_t, 1)
 
 PIXMAN_ARM_BIND_FAST_PATH_N_DST (0, armv6, over_reverse_n_8888,
                                  uint32_t, 1)
@@ -241,6 +243,11 @@ static const pixman_fast_path_t arm_simd_fast_paths[] =
     PIXMAN_STD_FAST_PATH (OVER, solid, a8, a8b8g8r8, 
armv6_composite_over_n_8_8888),
     PIXMAN_STD_FAST_PATH (OVER, solid, a8, x8b8g8r8, 
armv6_composite_over_n_8_8888),
 
+    PIXMAN_STD_FAST_PATH (IN_REVERSE, a8r8g8b8, null, a8r8g8b8, 
armv6_composite_in_reverse_8888_8888),
+    PIXMAN_STD_FAST_PATH (IN_REVERSE, a8r8g8b8, null, x8r8g8b8, 
armv6_composite_in_reverse_8888_8888),
+    PIXMAN_STD_FAST_PATH (IN_REVERSE, a8b8g8r8, null, a8b8g8r8, 
armv6_composite_in_reverse_8888_8888),
+    PIXMAN_STD_FAST_PATH (IN_REVERSE, a8b8g8r8, null, x8b8g8r8, 
armv6_composite_in_reverse_8888_8888),
+
     PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8r8g8b8, a8r8g8b8, 
armv6_composite_over_n_8888_8888_ca),
     PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8r8g8b8, x8r8g8b8, 
armv6_composite_over_n_8888_8888_ca),
     PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8b8g8r8, a8b8g8r8, 
armv6_composite_over_n_8888_8888_ca),
-- 
1.8.3.2

_______________________________________________
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman

Reply via email to