[Pixman] [PATCH] test: larger 0xFF/0x00 filled clusters in random images for blitters-test

2013-03-04 Thread Siarhei Siamashka
Current blitters-test program had difficulties detecting a bug in
over_n___ca implementation for MIPS DSPr2:

http://lists.freedesktop.org/archives/pixman/2013-March/002645.html

In order to hit the buggy code path, two consecutive mask values had
to be equal to 0x because of loop unrolling. The current
blitters-test generates random images in such a way that each byte
has 25% probability for having 0xFF value. Hence each 32-bit mask
value has ~0.4% probability for 0x. Because we are testing
many compositing operations with many pixels, encountering at least
one 0x mask value reasonably fast is not a problem. If a
bug related to 0x mask value is artificialy introduced into
over_n___ca generic C function, it gets detected on 675591
iteration in blitters-test (out of 200).

However two consecutive 0x mask values are much less likely
to be generated, so the bug was missed by blitters-test.

This patch addresses the problem by also randomly setting the 32-bit
values in images to either 0x or 0x (also with 25%
probability). It allows to have larger clusters of consecutive 0x00
or 0xFF bytes in images which may have special shortcuts for handling
them in unrolled or SIMD optimized code.
---
 test/blitters-test.c | 13 ++--
 test/prng-test.c |  5 -
 test/utils-prng.c| 58 +++-
 test/utils-prng.h|  5 -
 4 files changed, 72 insertions(+), 9 deletions(-)

diff --git a/test/blitters-test.c b/test/blitters-test.c
index 8766fa8..a2c6ff4 100644
--- a/test/blitters-test.c
+++ b/test/blitters-test.c
@@ -46,7 +46,16 @@ create_random_image (pixman_format_code_t *allowed_formats,
 /* do the allocation */
 buf = aligned_malloc (64, stride * height);
 
-prng_randmemset (buf, stride * height, RANDMEMSET_MORE_00_AND_FF);
+if (prng_rand_n (4) == 0)
+{
+   /* uniform distribution */
+   prng_randmemset (buf, stride * height, 0);
+}
+else
+{
+   /* significantly increased probability for 0x00 and 0xFF */
+   prng_randmemset (buf, stride * height, RANDMEMSET_MORE_00_AND_FF);
+}
 
 img = pixman_image_create_bits (fmt, width, height, buf, stride);
 
@@ -393,6 +402,6 @@ main (int argc, const char *argv[])
 }
 
 return fuzzer_test_main("blitters", 200,
-   0xD8265D5E,
+   0x0CF3283B,
test_composite, argc, argv);
 }
diff --git a/test/prng-test.c b/test/prng-test.c
index 0a3ad5e..c1d9320 100644
--- a/test/prng-test.c
+++ b/test/prng-test.c
@@ -106,7 +106,10 @@ int main (int argc, char *argv[])
 {
 const uint32_t ref_crc[RANDMEMSET_MORE_00_AND_FF + 1] =
 {
-0xBA06763D, 0x103FC550, 0x8B59ABA5, 0xD82A0F39
+0xBA06763D, 0x103FC550, 0x8B59ABA5, 0xD82A0F39,
+0xD2321099, 0xFD8C5420, 0xD3B7C42A, 0xFC098093,
+0x85E01DE0, 0x6680F8F7, 0x4D32DD3C, 0xAE52382B,
+0x149E6CB5, 0x8B336987, 0x15DCB2B3, 0x8A71B781
 };
 uint32_t crc1, crc2;
 uint32_t ref, seed, seed0, seed1, seed2, seed3;
diff --git a/test/utils-prng.c b/test/utils-prng.c
index 967b898..7b32e35 100644
--- a/test/utils-prng.c
+++ b/test/utils-prng.c
@@ -107,6 +107,7 @@ randmemset_internal (prng_t  *prng,
 {
 prng_t local_prng = *prng;
 prng_rand_128_data_t randdata;
+size_t i;
 
 while (size >= 16)
 {
@@ -138,6 +139,22 @@ randmemset_internal (prng_t  *prng,
 };
 randdata.vb &= (t.vb >= const_40);
 }
+if (flags & RANDMEMSET_MORE_)
+{
+const uint32x4 const_C000 =
+{
+0xC000, 0xC000, 0xC000, 0xC000
+};
+randdata.vw |= ((t.vw << 30) >= const_C000);
+}
+if (flags & RANDMEMSET_MORE_)
+{
+const uint32x4 const_4000 =
+{
+0x4000, 0x4000, 0x4000, 0x4000
+};
+randdata.vw &= ((t.vw << 30) >= const_4000);
+}
 #else
 #define PROCESS_ONE_LANE(i)   \
 if (flags & RANDMEMSET_MORE_FF)   \
@@ -155,6 +172,18 @@ randmemset_internal (prng_t  *prng,
 mask_00 |= mask_00 >> 2;  \
 mask_00 |= mask_00 >> 4;  \
 randdata.w[i] &= mask_00; \
+} \
+if (flags & RANDMEMSET_MORE_) \
+{ \
+int32_t mask_ff = ((t

Re: [Pixman] [PATCH] glyphs: Check the return code from _pixman_implementation_lookup_composite()

2013-03-04 Thread Juan Francisco Cantero Hurtado

On 03/02/13 03:25, Juan Francisco Cantero Hurtado wrote:

On 01/13/13 22:37, Juan Francisco Cantero Hurtado wrote:

On 01/13/13 11:20, Søren Sandmann wrote:

Søren Sandmann  writes:


Juan Francisco Cantero Hurtado  writes:


OpenBSD and gcc 4.2 (the default compiler) don't support thread local
storage.


In that case, it should fall back to pthread_setspecific(). Can you try
putting in a #error above "#include " in pixman-compiler.h
and see if compilation fails.

It's possible there is a bug in the pthread_setspecific() fallback in
pixman, but if so, I couldn't reproduce it on Linux by forcing
pthread_setspecific() and running the test suite.

Does the test suite pass for you if you run "make check"?


I found this thread:


https://groups.google.com/forum/?fromgroups=#!topic/comp.unix.bsd.openbsd.misc/y-qciyc6wNY



in which Marc Espie says that pthreads on OpenBSD 4.2 is a purely
userspace thread library.

Since the backtrace you posted included multiple threads, I'm guessing
those are kernel threads, which means pthread_setspecific() can't work
on them.

If this diagnosis is right, then pixman currently can't support threads
on OpenBSD 4.2. Support could potentially be added, but it would have to
be done (and maintained) by someone who understands threads on OpenBSD.



OpenBSD changed from user-level to kernel-level threads in the last
release (5.2). I'll mail to the maintainer of pixman on OpenBSD, my
knowledge of OpenBSD internals isn't enough for help you with the bug.



I've been doing some tests the last month. The error occur in pixman
0.28 and 0.29.2 when SSE is enabled. If I disable SSE in the configure
script of pixman 0.28, all works.


I forgot add that disable SSE2 in pixman isn't a good fix for OpenBSD 
because only one program crashes with SSE2 enabled on i386 (it works 
amd64). Can you give me any guidance for help you with the fix of the 
bug?. If you suspect is a bug in OpenBSD, a little test code would be 
great because I could show the problem to OpenBSD devs.


Thanks.

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 12/12] ARMv6: Add fast path for src_x888_0565

2013-03-04 Thread Ben Avison

On Mon, 04 Mar 2013 17:53:01 -, Chris Wilson  
wrote:

Did you try with image16? I think it should be hit somewhere, would seem
like somebody would use it eventually...


Thanks for the tip; I wasn't aware of that. I've been working from
Siarhei's trimmed set of traces that come with a script to run the tests,
and amongst other things, this sets CAIRO_TEST_TARGET=image. I'm not
really familiar with Cairo terminology, but I'm guessing that the target
is some sort of backing store that matches the depth of the framebuffer?
Certainly, if I change that setting to image16, I do start to see
src_x888_0565 start to be called, but the number of calls is still tiny:

t-chromium-tabs0
t-evolution0
t-firefox-asteroids1
t-firefox-canvas-alpha 1
t-firefox-canvas   1
t-firefox-chalkboard   0
t-firefox-fishbowl 3
t-firefox-fishtank 1
t-firefox-paintball3
t-firefox-particles1
t-firefox-planet-gnome 43
t-firefox-scrolling11
t-firefox-talos-gfx1
t-firefox-talos-svg1
t-gnome-system-monitor 0
t-gnome-terminal-vim   0
t-grads-heat-map   0
t-gvim 0
t-midori-zoomed38
t-poppler-reseau   1
t-poppler  0
t-swfdec-giant-steps   0
t-swfdec-youtube   1
t-xfce4-terminal-a10

Given that the number of calls of anything that shows up in the overall
times is usually measured in millions, or at the very least hundreds, I
can't imagine any of those being significant, so I think I'll save myself
the few hours it'd take to run the profile to confirm, if you don't mind
:)

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 12/12] ARMv6: Add fast path for src_x888_0565

2013-03-04 Thread Chris Wilson
On Mon, Mar 04, 2013 at 05:42:29PM +, Ben Avison wrote:
> This isn't used in the trimmed cairo-perf-trace tests at all, but these are
> the lowlevel-blt-bench results:

Did you try with image16? I think it should be hit somewhere, would seem
like somebody would use it eventually...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 11/12] ARMv6: Add fast path for add_8888_8888

2013-03-04 Thread Ben Avison
lowlevel-blt-bench results:

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
L1  27.6   0.1  125.9  0.8 100.0%  +356.0%
L2  14.0   0.5  30.8   1.6 100.0%  +120.3%
M   12.2   0.0  26.7   0.1 100.0%  +118.8%
HT  10.2   0.1  17.0   0.1 100.0%  +67.1%
VT  10.0   0.0  16.6   0.1 100.0%  +65.7%
R   9.70.0  15.9   0.1 100.0%  +64.8%
RT  5.80.1  7.60.1 100.0%  +30.5%

Trimmed cairo-perf-trace results:

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
t-xfce4-terminal-a1 18.6   0.1  18.4   0.1 100.0%  +1.0%
---
 pixman/pixman-arm-simd-asm.S |   58 ++
 pixman/pixman-arm-simd.c |8 ++
 2 files changed, 66 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-simd-asm.S b/pixman/pixman-arm-simd-asm.S
index 4f9a015..158de73 100644
--- a/pixman/pixman-arm-simd-asm.S
+++ b/pixman/pixman-arm-simd-asm.S
@@ -350,6 +350,64 @@ generate_composite_function \
 
 
/**/
 
+.macro test_zero numregs, reg1, reg2, reg3, reg4
+teq WK®1, #0
+ .if numregs >= 2
+teqeq   WK®2, #0
+  .if numregs >= 3
+teqeq   WK®3, #0
+   .if numregs == 4
+teqeq   WK®4, #0
+   .endif
+  .endif
+ .endif
+.endm
+
+.macro add___2pixels  dst1, dst2
+uqadd8  WK&dst1, WK&dst1, MASK
+uqadd8  WK&dst2, WK&dst2, STRIDE_M
+.endm
+
+.macro add___1pixel  dst
+uqadd8  WK&dst, WK&dst, MASK
+.endm
+
+.macro add___process_head  cond, numbytes, firstreg, unaligned_src, 
unaligned_mask, preload
+pixld   , numbytes, firstreg, SRC, 0
+add DST, DST, #numbytes
+.endm
+
+.macro add___process_tail  cond, numbytes, firstreg
+test_zero %(numbytes/4), firstreg, %(firstreg+1), %(firstreg+2), 
%(firstreg+3)
+beq 01f
+ .if numbytes == 16
+ldrdMASK, STRIDE_M, [DST, #-16]
+add___2pixels  firstreg, %(firstreg+1)
+ldrdMASK, STRIDE_M, [DST, #-8]
+add___2pixels  %(firstreg+2), %(firstreg+3)
+ .elseif numbytes == 8
+ldrdMASK, STRIDE_M, [DST, #-8]
+add___2pixels  firstreg, %(firstreg+1)
+ .else
+ldr MASK, [DST, #-4]
+add___1pixel  firstreg
+ .endif
+pixst   , numbytes, firstreg, DST
+01:
+.endm
+
+generate_composite_function \
+pixman_composite_add___asm_armv6, 32, 0, 32, \
+FLAG_DST_READWRITE | FLAG_BRANCH_OVER | FLAG_PROCESS_CORRUPTS_PSR | 
FLAG_PROCESS_DOES_STORE | FLAG_PROCESS_PRESERVES_SCRATCH | FLAG_NO_PRELOAD_DST, 
\
+2, /* prefetch distance */ \
+nop_macro, /* init */ \
+nop_macro, /* newline */ \
+nop_macro, /* cleanup */ \
+add___process_head, \
+add___process_tail
+
+/**/
+
 .macro over___init
 /* Hold loop invariant in MASK */
 ldr MASK, =0x00800080
diff --git a/pixman/pixman-arm-simd.c b/pixman/pixman-arm-simd.c
index 855b703..d227065 100644
--- a/pixman/pixman-arm-simd.c
+++ b/pixman/pixman-arm-simd.c
@@ -44,6 +44,8 @@ PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, src_0565_,
 
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, add_8_8,
uint8_t, 1, uint8_t, 1)
+PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, add__,
+   uint32_t, 1, uint32_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, over__,
uint32_t, 1, uint32_t, 1)
 
@@ -238,6 +240,12 @@ static const pixman_fast_path_t arm_simd_fast_paths[] =
 PIXMAN_STD_FAST_PATH (OVER_REVERSE, solid, null, a8b8g8r8, 
armv6_composite_over_reverse_n_),
 
 PIXMAN_STD_FAST_PATH (ADD, a8, null, a8, armv6_composite_add_8_8),
+PIXMAN_STD_FAST_PATH (ADD, a8r8g8b8, null, a8r8g8b8, 
armv6_composite_add__),
+PIXMAN_STD_FAST_PATH (ADD, a8r8g8b8, null, x8r8g8b8, 
armv6_composite_add__),
+PIXMAN_STD_FAST_PATH (ADD, x8r8g8b8, null, x8r8g8b8, 
armv6_composite_add__),
+PIXMAN_STD_FAST_PATH (ADD, a8b8g8r8, null, a8b8g8r8, 
armv6_composite_add__),
+PIXMAN_STD_FAST_PATH (ADD, a8b8g8r8, null, x8b8g8r8, 
armv6_composite_add__),
+PIXMAN_STD_FAST_PATH (ADD, x8b8g8r8, null, x8b8g8r8, 
armv6_composite_add__),
 
 PIXMAN_STD_FAST_PATH (OVER, solid, a8, a8r8g8b8, 
armv6_composite_over_n_8_),
 PIXMAN_STD_FAST_PATH (OVER, solid, a8, x8r8g8b8, 
armv6_composite_over_n_8_),
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 12/12] ARMv6: Add fast path for src_x888_0565

2013-03-04 Thread Ben Avison
This isn't used in the trimmed cairo-perf-trace tests at all, but these are
the lowlevel-blt-bench results:

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
L1  68.5   1.0  116.3  0.6 100.0%  +69.8%
L2  31.1   1.8  60.9   5.0 100.0%  +96.1%
M   33.6   0.1  86.4   0.4 100.0%  +157.0%
HT  19.1   0.1  35.3   0.4 100.0%  +84.3%
VT  17.7   0.2  32.1   0.3 100.0%  +81.3%
R   17.5   0.2  29.9   0.3 100.0%  +70.7%
RT  7.00.1  11.8   0.3 100.0%  +68.4%
---
 pixman/pixman-arm-simd-asm.S |   77 ++
 pixman/pixman-arm-simd.c |7 
 2 files changed, 84 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-simd-asm.S b/pixman/pixman-arm-simd-asm.S
index 158de73..423a16d 100644
--- a/pixman/pixman-arm-simd-asm.S
+++ b/pixman/pixman-arm-simd-asm.S
@@ -303,6 +303,83 @@ generate_composite_function \
 
 
/**/
 
+.macro src_x888_0565_init
+/* Hold loop invariant in MASK */
+ldr MASK, =0x001F001F
+line_saved_regs  STRIDE_S, ORIG_W
+.endm
+
+.macro src_x888_0565_1pixel  s, d
+and WK&d, MASK, WK&s, lsr #3   @ 
000r000b
+and STRIDE_S, WK&s, #0xFC00@ 
gg00
+orr WK&d, WK&d, WK&d, lsr #5   @ 
000-r00b
+orr WK&d, WK&d, STRIDE_S, lsr #5   @ 
000-rggb
+/* Top 16 bits are discarded during the following STRH */
+.endm
+
+.macro src_x888_0565_2pixels  slo, shi, d, tmp
+and SCRATCH, WK&shi, #0xFC00   @ 
GG00
+and WK&tmp, MASK, WK&shi, lsr #3   @ 
000R000B
+and WK&shi, MASK, WK&slo, lsr #3   @ 
000r000b
+orr WK&tmp, WK&tmp, WK&tmp, lsr #5 @ 
000-R00B
+orr WK&tmp, WK&tmp, SCRATCH, lsr #5@ 
000-RGGB
+and SCRATCH, WK&slo, #0xFC00   @ 
gg00
+orr WK&shi, WK&shi, WK&shi, lsr #5 @ 
000-r00b
+orr WK&shi, WK&shi, SCRATCH, lsr #5@ 
000-rggb
+pkhbt   WK&d, WK&shi, WK&tmp, lsl #16  @ 
RGGBrggb
+.endm
+
+.macro src_x888_0565_process_head   cond, numbytes, firstreg, unaligned_src, 
unaligned_mask, preload
+WK4 .reqSTRIDE_S
+WK5 .reqSTRIDE_M
+WK6 .reqWK3
+WK7 .reqORIG_W
+ .if numbytes == 16
+pixld   , 16, 4, SRC, 0
+src_x888_0565_2pixels  4, 5, 0, 0
+pixld   , 8, 4, SRC, 0
+src_x888_0565_2pixels  6, 7, 1, 1
+pixld   , 8, 6, SRC, 0
+ .else
+pixld   , numbytes*2, 4, SRC, 0
+ .endif
+.endm
+
+.macro src_x888_0565_process_tail   cond, numbytes, firstreg
+ .if numbytes == 16
+src_x888_0565_2pixels  4, 5, 2, 2
+src_x888_0565_2pixels  6, 7, 3, 4
+ .elseif numbytes == 8
+src_x888_0565_2pixels  4, 5, 1, 1
+src_x888_0565_2pixels  6, 7, 2, 2
+ .elseif numbytes == 4
+src_x888_0565_2pixels  4, 5, 1, 1
+ .else
+src_x888_0565_1pixel  4, 1
+ .endif
+ .if numbytes == 16
+pixst   , numbytes, 0, DST
+ .else
+pixst   , numbytes, 1, DST
+ .endif
+.unreq  WK4
+.unreq  WK5
+.unreq  WK6
+.unreq  WK7
+.endm
+
+generate_composite_function \
+pixman_composite_src_x888_0565_asm_armv6, 32, 0, 16, \
+FLAG_DST_WRITEONLY | FLAG_BRANCH_OVER | FLAG_PROCESS_DOES_STORE | 
FLAG_SPILL_LINE_VARS | FLAG_PROCESS_CORRUPTS_SCRATCH, \
+3, /* prefetch distance */ \
+src_x888_0565_init, \
+nop_macro, /* newline */ \
+nop_macro, /* cleanup */ \
+src_x888_0565_process_head, \
+src_x888_0565_process_tail
+
+/**/
+
 .macro add_8_8_8pixels  cond, dst1, dst2
 uqadd8&cond  WK&dst1, WK&dst1, MASK
 uqadd8&cond  WK&dst2, WK&dst2, STRIDE_M
diff --git a/pixman/pixman-arm-simd.c b/pixman/pixman-arm-simd.c
index d227065..5a1708f 100644
--- a/pixman/pixman-arm-simd.c
+++ b/pixman/pixman-arm-simd.c
@@ -41,6 +41,8 @@ PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, src_8_8,
uint8_t, 1, uint8_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, src_0565_,
uint16_t, 1, uint32_t, 1)
+PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, src_x888_0565,
+   uint32_t, 1, uint16_t, 1)
 
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, add_8_8,
uint8_t, 1, uint8_t, 1)
@@ -227,6 +229,11 @@ static const pixman_fast_path_t arm_simd_fast_paths[] =
 PIXMAN_STD_FA

[Pixman] [PATCH 10/12] ARMv6: Add fast path for over_reverse_n_8888

2013-03-04 Thread Ben Avison
lowlevel-blt-bench results:

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
L1  15.0   0.1  276.2  4.0 100.0%  +1743.3%
L2  13.4   0.3  154.8  17.4100.0%  +1058.0%
M   11.4   0.0  73.7   0.8 100.0%  +549.4%
HT  10.2   0.0  25.6   0.2 100.0%  +150.9%
VT  10.0   0.0  23.0   0.3 100.0%  +129.4%
R   9.80.1  22.9   0.2 100.0%  +134.3%
RT  6.40.1  11.6   0.3 100.0%  +80.8%

Trimmed cairo-perf-trace results:

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
t-poppler   11.8   0.1  8.80.1 100.0%  +34.6%
---
 pixman/pixman-arm-simd-asm.S |   78 ++
 pixman/pixman-arm-simd.c |6 +++
 2 files changed, 84 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-simd-asm.S b/pixman/pixman-arm-simd-asm.S
index 20ad05a..4f9a015 100644
--- a/pixman/pixman-arm-simd-asm.S
+++ b/pixman/pixman-arm-simd-asm.S
@@ -979,6 +979,84 @@ generate_composite_function \
 
 
/**/
 
+.macro over_reverse_n__init
+ldr SRC, [sp, #ARGS_STACK_OFFSET]
+ldr MASK, =0x00800080
+/* Split source pixel into RB/AG parts */
+uxtb16  STRIDE_S, SRC
+uxtb16  STRIDE_M, SRC, ror #8
+/* Set GE[3:0] to 0101 so SEL instructions do what we want */
+uadd8   SCRATCH, MASK, MASK
+line_saved_regs  STRIDE_D, ORIG_W
+.endm
+
+.macro over_reverse_n__newline
+mov STRIDE_D, #0xFF
+.endm
+
+.macro over_reverse_n__process_head  cond, numbytes, firstreg, 
unaligned_src, unaligned_mask, preload
+pixld   , numbytes, firstreg, DST, 0
+.endm
+
+.macro over_reverse_n__1pixel  d, is_only
+teq WK&d, #0
+beq 8f   /* replace with source */
+bicsORIG_W, STRIDE_D, WK&d, lsr #24
+ .if is_only == 1
+beq 49f  /* skip store */
+ .else
+beq 9f   /* write same value back */
+ .endif
+mla SCRATCH, STRIDE_S, ORIG_W, MASK /* red/blue */
+mla ORIG_W, STRIDE_M, ORIG_W, MASK  /* alpha/green */
+uxtab16 SCRATCH, SCRATCH, SCRATCH, ror #8
+uxtab16 ORIG_W, ORIG_W, ORIG_W, ror #8
+mov SCRATCH, SCRATCH, ror #8
+sel ORIG_W, SCRATCH, ORIG_W
+uqadd8  WK&d, WK&d, ORIG_W
+b   9f
+8:  mov WK&d, SRC
+9:
+.endm
+
+.macro over_reverse_n__tail  numbytes, reg1, reg2, reg3, reg4
+ .if numbytes == 4
+over_reverse_n__1pixel  reg1, 1
+ .else
+and SCRATCH, WK®1, WK®2
+  .if numbytes == 16
+and SCRATCH, SCRATCH, WK®3
+and SCRATCH, SCRATCH, WK®4
+  .endif
+mvnsSCRATCH, SCRATCH, asr #24
+beq 49f /* skip store if all opaque */
+over_reverse_n__1pixel  reg1, 0
+over_reverse_n__1pixel  reg2, 0
+  .if numbytes == 16
+over_reverse_n__1pixel  reg3, 0
+over_reverse_n__1pixel  reg4, 0
+  .endif
+ .endif
+pixst   , numbytes, reg1, DST
+49:
+.endm
+
+.macro over_reverse_n__process_tail  cond, numbytes, firstreg
+over_reverse_n__tail  numbytes, firstreg, %(firstreg+1), 
%(firstreg+2), %(firstreg+3)
+.endm
+
+generate_composite_function \
+pixman_composite_over_reverse_n__asm_armv6, 0, 0, 32 \
+FLAG_DST_READWRITE | FLAG_BRANCH_OVER | FLAG_PROCESS_CORRUPTS_PSR | 
FLAG_PROCESS_DOES_STORE | FLAG_SPILL_LINE_VARS | FLAG_PROCESS_CORRUPTS_SCRATCH, 
\
+3, /* prefetch distance */ \
+over_reverse_n__init, \
+over_reverse_n__newline, \
+nop_macro, /* cleanup */ \
+over_reverse_n__process_head, \
+over_reverse_n__process_tail
+
+/**/
+
 #ifdef PROFILING
 .p2align 9
 #endif
diff --git a/pixman/pixman-arm-simd.c b/pixman/pixman-arm-simd.c
index 5a50098..855b703 100644
--- a/pixman/pixman-arm-simd.c
+++ b/pixman/pixman-arm-simd.c
@@ -50,6 +50,9 @@ PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, over__,
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, in_reverse__,
uint32_t, 1, uint32_t, 1)
 
+PIXMAN_ARM_BIND_FAST_PATH_N_DST (0, armv6, over_reverse_n_,
+ uint32_t, 1)
+
 PIXMAN_ARM_BIND_FAST_PATH_SRC_N_DST (SKIP_ZERO_MASK, armv6, over__n_,
  uint32_t, 1, uint32_t, 1)
 
@@ -231,6 +234,9 @@ static const pixman_fast_path_t arm_simd_fast_paths[] =
 PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, solid, a8b8g8r8, 
armv6_composite_over__n_),
 PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, solid, x8b8g8r8, 
armv6_composite_over__n_),
 
+PIXMAN_STD_FAST_PATH (OVER_REVERSE, solid, null, a8r8g8b8, 
armv6_composite_over_reverse_n_),
+PIXMAN_STD

[Pixman] [PATCH 09/12] ARMv6: Add fast path for in_reverse_8888_8888

2013-03-04 Thread Ben Avison
lowlevel-blt-bench results:

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
L1  21.3   0.1  32.5   0.2 100.0%  +52.1%
L2  12.1   0.2  19.5   0.5 100.0%  +61.2%
M   11.0   0.0  17.1   0.0 100.0%  +54.6%
HT  8.70.0  12.8   0.1 100.0%  +46.9%
VT  8.60.0  12.5   0.1 100.0%  +46.0%
R   8.60.0  12.0   0.1 100.0%  +40.6%
RT  5.10.1  6.60.1 100.0%  +28.8%

Trimmed cairo-perf-trace results:

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
t-firefox-paintball 17.7   0.1  14.2   0.1 100.0%  +24.5%
---
 pixman/pixman-arm-simd-asm.S |  104 ++
 pixman/pixman-arm-simd.c |8 +++
 2 files changed, 112 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-simd-asm.S b/pixman/pixman-arm-simd-asm.S
index ac084c4..20ad05a 100644
--- a/pixman/pixman-arm-simd-asm.S
+++ b/pixman/pixman-arm-simd-asm.S
@@ -875,6 +875,110 @@ generate_composite_function \
 
 
/**/
 
+.macro in_reverse___init
+/* Hold loop invariant in MASK */
+ldr MASK, =0x00800080
+/* Set GE[3:0] to 0101 so SEL instructions do what we want */
+uadd8   SCRATCH, MASK, MASK
+/* Offset the source pointer: we only need the alpha bytes */
+add SRC, SRC, #3
+line_saved_regs  ORIG_W
+.endm
+
+.macro in_reverse___head  numbytes, reg1, reg2, reg3
+ldrbORIG_W, [SRC], #4
+ .if numbytes >= 8
+ldrbWK®1, [SRC], #4
+  .if numbytes == 16
+ldrbWK®2, [SRC], #4
+ldrbWK®3, [SRC], #4
+  .endif
+ .endif
+add DST, DST, #numbytes
+.endm
+
+.macro in_reverse___process_head  cond, numbytes, firstreg, 
unaligned_src, unaligned_mask, preload
+in_reverse___head  numbytes, firstreg, %(firstreg+1), 
%(firstreg+2)
+.endm
+
+.macro in_reverse___1pixel  s, d, offset, is_only
+ .if is_only != 1
+movss, ORIG_W
+  .if offset != 0
+ldrbORIG_W, [SRC, #offset]
+  .endif
+beq 01f
+teq STRIDE_M, #0xFF
+beq 02f
+ .endif
+uxtb16  SCRATCH, d /* rb_dest */
+uxtb16  d, d, ror #8   /* ag_dest */
+mla SCRATCH, SCRATCH, s, MASK
+mla d, d, s, MASK
+uxtab16 SCRATCH, SCRATCH, SCRATCH, ror #8
+uxtab16 d, d, d, ror #8
+mov SCRATCH, SCRATCH, ror #8
+sel d, SCRATCH, d
+b   02f
+ .if offset == 0
+48: /* Last mov d,#0 of the set - used as part of shortcut for
+ * source values all 0 */
+ .endif
+01: mov d, #0
+02:
+.endm
+
+.macro in_reverse___tail  numbytes, reg1, reg2, reg3, reg4
+ .if numbytes == 4
+teq ORIG_W, ORIG_W, asr #32
+ldrne   WK®1, [DST, #-4]
+ .elseif numbytes == 8
+teq ORIG_W, WK®1
+teqeq   ORIG_W, ORIG_W, asr #32  /* all 0 or all -1? */
+ldmnedb DST, {WK®1-WK®2}
+ .else
+teq ORIG_W, WK®1
+teqeq   ORIG_W, WK®2
+teqeq   ORIG_W, WK®3
+teqeq   ORIG_W, ORIG_W, asr #32  /* all 0 or all -1? */
+ldmnedb DST, {WK®1-WK®4}
+ .endif
+cmnne   DST, #0   /* clear C if NE */
+bcs 49f   /* no writes to dest if source all -1 */
+beq 48f   /* set dest to all 0 if source all 0 */
+ .if numbytes == 4
+in_reverse___1pixel  ORIG_W, WK®1, 0, 1
+str WK®1, [DST, #-4]
+ .elseif numbytes == 8
+in_reverse___1pixel  STRIDE_M, WK®1, -4, 0
+in_reverse___1pixel  STRIDE_M, WK®2, 0, 0
+stmdb   DST, {WK®1-WK®2}
+ .else
+in_reverse___1pixel  STRIDE_M, WK®1, -12, 0
+in_reverse___1pixel  STRIDE_M, WK®2, -8, 0
+in_reverse___1pixel  STRIDE_M, WK®3, -4, 0
+in_reverse___1pixel  STRIDE_M, WK®4, 0, 0
+stmdb   DST, {WK®1-WK®4}
+ .endif
+49:
+.endm
+
+.macro in_reverse___process_tail  cond, numbytes, firstreg
+in_reverse___tail  numbytes, firstreg, %(firstreg+1), 
%(firstreg+2), %(firstreg+3)
+.endm
+
+generate_composite_function \
+pixman_composite_in_reverse___asm_armv6, 32, 0, 32 \
+FLAG_DST_READWRITE | FLAG_BRANCH_OVER | FLAG_PROCESS_CORRUPTS_PSR | 
FLAG_PROCESS_DOES_STORE | FLAG_SPILL_LINE_VARS | FLAG_PROCESS_CORRUPTS_SCRATCH 
| FLAG_NO_PRELOAD_DST \
+2, /* prefetch distance */ \
+in_reverse___init, \
+nop_macro, /* newline */ \
+nop_macro, /* cleanup */ \
+in_reverse___process_head, \
+in_reverse___process_tail
+
+/**/
+
 #ifdef PROFILING
 .p2align 9
 #endif
diff --git a/pixman/pixman-arm-simd.c b/pi

[Pixman] [PATCH 08/12] ARMv6: Added fast path for over_n_8888_8888_ca

2013-03-04 Thread Ben Avison
lowlevel-blt-bench results:

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
L1  2.70.0  16.2   0.1 100.0%  +501.7%
L2  2.40.0  14.8   0.2 100.0%  +502.5%
M   2.40.0  15.0   0.0 100.0%  +525.7%
HT  2.20.0  10.2   0.1 100.0%  +354.9%
VT  2.20.0  9.90.1 100.0%  +344.5%
R   2.30.0  10.0   0.0 100.0%  +339.7%
RT  2.00.0  5.70.1 100.0%  +191.3%

Trimmed cairo-perf-trace results:

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
t-firefox-talos-gfx 25.7   0.1  17.7   0.2 100.0%  +45.3%
t-firefox-scrolling 20.7   0.1  16.6   0.2 100.0%  +24.7%
t-evolution 8.00.1  6.90.2 100.0%  +14.6%
t-gnome-terminal-vim17.2   0.2  15.5   0.2 100.0%  +11.1%
t-firefox-planet-gnome  9.80.1  8.80.1 100.0%  +11.0%
t-xfce4-terminal-a1 19.9   0.1  18.5   0.1 100.0%  +7.7%
t-gvim  20.5   0.3  20.0   0.4 99.8%   +2.8%
t-firefox-paintball 18.0   0.1  17.7   0.1 100.0%  +1.7%
t-poppler-reseau20.8   0.2  20.5   0.1 100.0%  +1.4%
t-firefox-fishbowl  21.6   0.0  21.5   0.1 99.5%   +0.4%
t-firefox-canvas15.6   0.1  15.8   0.1 100.0%  -1.2%
---
 pixman/pixman-arm-simd-asm.S |  264 ++
 pixman/pixman-arm-simd.c |8 ++
 2 files changed, 272 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-simd-asm.S b/pixman/pixman-arm-simd-asm.S
index 259fb88..ac084c4 100644
--- a/pixman/pixman-arm-simd-asm.S
+++ b/pixman/pixman-arm-simd-asm.S
@@ -611,6 +611,270 @@ generate_composite_function \
 
 
/**/
 
+.macro over_white___ca_init
+HALF.reqSRC
+TMP0.reqSTRIDE_D
+TMP1.reqSTRIDE_S
+TMP2.reqSTRIDE_M
+TMP3.reqORIG_W
+WK4 .reqSCRATCH
+line_saved_regs STRIDE_D, STRIDE_M, ORIG_W
+ldr SCRATCH, =0x800080
+mov HALF, #0x80
+/* Set GE[3:0] to 0101 so SEL instructions do what we want */
+uadd8   SCRATCH, SCRATCH, SCRATCH
+.endm
+
+.macro over_white___ca_cleanup
+.unreq  HALF
+.unreq  TMP0
+.unreq  TMP1
+.unreq  TMP2
+.unreq  TMP3
+.unreq  WK4
+.endm
+
+.macro over_white___ca_combine  m, d
+uxtb16  TMP1, TMP0/* rb_notmask */
+uxtb16  TMP2, d   /* rb_dest; 1 stall follows */
+smlatt  TMP3, TMP2, TMP1, HALF/* red */
+smlabb  TMP2, TMP2, TMP1, HALF/* blue */
+uxtb16  TMP0, TMP0, ror #8/* ag_notmask */
+uxtb16  TMP1, d, ror #8   /* ag_dest; 1 stall follows */
+smlatt  d, TMP1, TMP0, HALF   /* alpha */
+smlabb  TMP1, TMP1, TMP0, HALF/* green */
+pkhbt   TMP0, TMP2, TMP3, lsl #16 /* rb; 1 stall follows */
+pkhbt   TMP1, TMP1, d, lsl #16/* ag */
+uxtab16 TMP0, TMP0, TMP0, ror #8
+uxtab16 TMP1, TMP1, TMP1, ror #8
+mov TMP0, TMP0, ror #8
+sel d, TMP0, TMP1
+uqadd8  d, d, m   /* d is a late result */
+.endm
+
+.macro over_white___ca_1pixel_head
+pixld   , 4, 1, MASK, 0
+pixld   , 4, 3, DST, 0
+.endm
+
+.macro over_white___ca_1pixel_tail
+mvn TMP0, WK1
+teq WK1, WK1, asr #32
+bne 01f
+bcc 03f
+mov WK3, WK1
+b   02f
+01: over_white___ca_combine WK1, WK3
+02: pixst   , 4, 3, DST
+03:
+.endm
+
+.macro over_white___ca_2pixels_head
+pixld   , 8, 1, MASK, 0
+pixld   , 8, 3, DST
+.endm
+
+.macro over_white___ca_2pixels_tail
+mvn TMP0, WK1
+teq WK1, WK1, asr #32
+bne 01f
+movcs   WK3, WK1
+bcs 02f
+teq WK2, #0
+beq 05f
+b   02f
+01: over_white___ca_combine WK1, WK3
+02: mvn TMP0, WK2
+teq WK2, WK2, asr #32
+bne 03f
+movcs   WK4, WK2
+b   04f
+03: over_white___ca_combine WK2, WK4
+04: pixst   , 8, 3, DST
+05:
+.endm
+
+.macro over_white___ca_process_head  cond, numbytes, firstreg, 
unaligned_src, unaligned_mask, preload
+ .if numbytes == 4
+over_white___ca_1pixel_head
+ .else
+  .if numbytes == 16
+over_white___ca_2pixels_head
+over_white___ca_2pixels_tail
+  .endif
+over_white___ca_2pixels_head
+ .endif
+.endm
+
+.macro over_white___ca_process_tail  cond, numbytes, firstreg
+ .if numbytes == 4
+over_whit

[Pixman] [PATCH 07/12] ARMv6: Macro to permit testing for early returns or alternate implementations

2013-03-04 Thread Ben Avison
When the source or mask is solid (as opposed to a bitmap) there is the
possibility of an immediate exit, or a branch to an alternate, more optimal
implementation in some cases. This is best achieved with a brief prologue to
the function; to permit this, the necessary boilerplate for setting up a
function entry is now available in the "startfunc" macro.

This feature was first included in my over_n_ fast path, but since that's
still sitting in the submission queue at the time of writing, I'm posting it
again as an independent patch.
---
 pixman/pixman-arm-simd-asm.h |   26 +++---
 1 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/pixman/pixman-arm-simd-asm.h b/pixman/pixman-arm-simd-asm.h
index c7e5ca7..a41e1e0 100644
--- a/pixman/pixman-arm-simd-asm.h
+++ b/pixman/pixman-arm-simd-asm.h
@@ -107,6 +107,20 @@
 .set PREFETCH_TYPE_NONE,   0
 .set PREFETCH_TYPE_STANDARD,   1
 
+.macro startfunc fname
+#ifdef PROFILING
+ .p2align 9
+#endif
+ .func fname
+ .global fname
+ /* For ELF format also set function visibility to hidden */
+#ifdef __ELF__
+ .hidden fname
+ .type fname, %function
+#endif
+fname:
+.endm
+
 /*
  * Definitions of macros for load/store of pixel data.
  */
@@ -596,16 +610,7 @@
process_tail, \
process_inner_loop
 
-#ifdef PROFILING
- .p2align 9
-#endif
- .func fname
- .global fname
- /* For ELF format also set function visibility to hidden */
-#ifdef __ELF__
- .hidden fname
- .type fname, %function
-#endif
+startfunc fname
 
 /*
  * Make some macro arguments globally visible and accessible
@@ -717,7 +722,6 @@
 SCRATCH .reqr12
 ORIG_W  .reqr14 /* width (pixels) */
 
-fname:
 push{r4-r11, lr}/* save all registers */
 
 subsY, Y, #1
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 06/12] Add extra test to lowlevel-blt-bench and fix an existing one

2013-03-04 Thread Ben Avison
in_reverse__ is one of the more commonly used operations in the
cairo-perf-trace suite that hasn't been in lowlevel-blt-bench until now.

The source for over_reverse_n_ needed to be marked as solid.
---
 test/lowlevel-blt-bench.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/test/lowlevel-blt-bench.c b/test/lowlevel-blt-bench.c
index 4e16f7b..9984fa8 100644
--- a/test/lowlevel-blt-bench.c
+++ b/test/lowlevel-blt-bench.c
@@ -706,7 +706,8 @@ tests_tbl[] =
 { "outrev_n__1555_ca", PIXMAN_a8r8g8b8,1, PIXMAN_OP_OUT_REV, 
PIXMAN_a8r8g8b8, 2, PIXMAN_a1r5g5b5 },
 { "outrev_n__x888_ca", PIXMAN_a8r8g8b8,1, PIXMAN_OP_OUT_REV, 
PIXMAN_a8r8g8b8, 2, PIXMAN_x8r8g8b8 },
 { "outrev_n___ca", PIXMAN_a8r8g8b8,1, PIXMAN_OP_OUT_REV, 
PIXMAN_a8r8g8b8, 2, PIXMAN_a8r8g8b8 },
-{ "over_reverse_n_",   PIXMAN_a8r8g8b8,0, PIXMAN_OP_OVER_REVERSE, 
PIXMAN_null, 0, PIXMAN_a8r8g8b8 },
+{ "over_reverse_n_",   PIXMAN_a8r8g8b8,1, PIXMAN_OP_OVER_REVERSE, 
PIXMAN_null, 0, PIXMAN_a8r8g8b8 },
+{ "in_reverse__",  PIXMAN_a8r8g8b8,0, PIXMAN_OP_IN_REVERSE, 
PIXMAN_null,  0, PIXMAN_a8r8g8b8 },
 };
 
 int
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 05/12] ARMv6: Force fast paths to have fixed alignment to the BTAC

2013-03-04 Thread Ben Avison
Trying to produce repeatable, trustworthy profiling results from the
cairo-perf-trace benchmark suite has proved tricky, especially when testing
changes that have only a marginal (< ~5%) effect upon the runtime as a whole.

One of the problems is that some traces appear to show statistically
significant changes even when the only fast path that has changed is not even
exercised by the trace in question. This patch helps to address this by
ensuring that the aliasing between the branch predictor's target address cache
(BTAC) for the remaining fast paths is not affected by the additional, removal
or refactoring of any other fast paths.

The profiling results later in this patch series have been calculated with
this switch enabled, to ensure fair comparisons. Additionally, the
cairo-perf-trace test harness itself was modified to do timing using
getrusage() so as to exclude any kernel mode components of the runtime.
Between these two measures, the majority of false positives appear to have
been eliminated.
---
 pixman/pixman-arm-simd-asm.S |3 +++
 pixman/pixman-arm-simd-asm.h |9 +
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-simd-asm.S b/pixman/pixman-arm-simd-asm.S
index c209688..259fb88 100644
--- a/pixman/pixman-arm-simd-asm.S
+++ b/pixman/pixman-arm-simd-asm.S
@@ -611,3 +611,6 @@ generate_composite_function \
 
 
/**/
 
+#ifdef PROFILING
+.p2align 9
+#endif
diff --git a/pixman/pixman-arm-simd-asm.h b/pixman/pixman-arm-simd-asm.h
index 4c08b9e..c7e5ca7 100644
--- a/pixman/pixman-arm-simd-asm.h
+++ b/pixman/pixman-arm-simd-asm.h
@@ -54,6 +54,12 @@
  */
 
 /*
+ * Determine whether we space out fast paths to reduce the effect of
+ * different BTAC aliasing upon comparative profiling results
+ */
+#define PROFILING
+
+/*
  * Determine whether we put the arguments on the stack for debugging.
  */
 #undef DEBUG_PARAMS
@@ -590,6 +596,9 @@
process_tail, \
process_inner_loop
 
+#ifdef PROFILING
+ .p2align 9
+#endif
  .func fname
  .global fname
  /* For ELF format also set function visibility to hidden */
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 04/12] ARMv6: Add fast path flag to force no preload of destination buffer

2013-03-04 Thread Ben Avison
---
 pixman/pixman-arm-simd-asm.h |   14 +-
 1 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/pixman/pixman-arm-simd-asm.h b/pixman/pixman-arm-simd-asm.h
index e481320..4c08b9e 100644
--- a/pixman/pixman-arm-simd-asm.h
+++ b/pixman/pixman-arm-simd-asm.h
@@ -78,6 +78,8 @@
 .set FLAG_PROCESS_PRESERVES_SCRATCH, 64
 .set FLAG_PROCESS_PRESERVES_WK0, 0
 .set FLAG_PROCESS_CORRUPTS_WK0,  128 /* if possible, use the specified 
register(s) instead so WK0 can hold number of leading pixels */
+.set FLAG_PRELOAD_DST,   0
+.set FLAG_NO_PRELOAD_DST,256
 
 /*
  * Offset into stack where mask and source pointer/stride can be accessed.
@@ -439,7 +441,7 @@
 preload_middle  src_bpp, SRC, 0
 preload_middle  mask_bpp, MASK, 0
   .endif
-  .if (dst_r_bpp > 0) && ((SUBBLOCK % 2) == 0)
+  .if (dst_r_bpp > 0) && ((SUBBLOCK % 2) == 0) && (((flags) & 
FLAG_NO_PRELOAD_DST) == 0)
 /* Because we know that writes are 16-byte aligned, it's relatively 
easy to ensure that
  * destination prefetches are 32-byte aligned. It's also the easiest 
channel to offset
  * preloads for, to achieve staggered prefetches for multiple 
channels, because there are
@@ -474,7 +476,9 @@
  .endif
 preload_trailing  src_bpp, src_bpp_shift, SRC
 preload_trailing  mask_bpp, mask_bpp_shift, MASK
+ .if ((flags) & FLAG_NO_PRELOAD_DST) == 0
 preload_trailing  dst_r_bpp, dst_bpp_shift, DST
+ .endif
 add X, X, #(prefetch_distance+2)*pix_per_block - 128/dst_w_bpp
 /* The remainder of the line is handled identically to the medium case 
*/
 medium_case_inner_loop_and_trailing_pixels  process_head, 
process_tail,, exit_label, unaligned_src, unaligned_mask
@@ -773,7 +777,9 @@ fname:
 newline
 preload_leading_step1  src_bpp, WK1, SRC
 preload_leading_step1  mask_bpp, WK2, MASK
+  .if ((flags) & FLAG_NO_PRELOAD_DST) == 0
 preload_leading_step1  dst_r_bpp, WK3, DST
+  .endif
 
 andsWK0, DST, #15
 beq 154f
@@ -781,7 +787,9 @@ fname:
 
 preload_leading_step2  src_bpp, src_bpp_shift, WK1, SRC
 preload_leading_step2  mask_bpp, mask_bpp_shift, WK2, MASK
+  .if ((flags) & FLAG_NO_PRELOAD_DST) == 0
 preload_leading_step2  dst_r_bpp, dst_bpp_shift, WK3, DST
+  .endif
 
 leading_15bytes  process_head, process_tail
 
@@ -821,7 +829,9 @@ fname:
 newline
 preload_line 0, src_bpp, src_bpp_shift, SRC  /* in: X, corrupts: 
WK0-WK1 */
 preload_line 0, mask_bpp, mask_bpp_shift, MASK
+ .if ((flags) & FLAG_NO_PRELOAD_DST) == 0
 preload_line 0, dst_r_bpp, dst_bpp_shift, DST
+ .endif
 
 sub X, X, #128/dst_w_bpp /* simplifies inner loop termination 
*/
 andsWK0, DST, #15
@@ -850,7 +860,9 @@ fname:
 newline
 preload_line 1, src_bpp, src_bpp_shift, SRC  /* in: X, corrupts: 
WK0-WK1 */
 preload_line 1, mask_bpp, mask_bpp_shift, MASK
+ .if ((flags) & FLAG_NO_PRELOAD_DST) == 0
 preload_line 1, dst_r_bpp, dst_bpp_shift, DST
+ .endif
 
  .if dst_w_bpp == 8
 tst DST, #3
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 03/12] ARMv6: Support for very variable-hungry composite operations

2013-03-04 Thread Ben Avison
Previously, the variable ARGS_STACK_OFFSET was available to extract values
from function arguments during the init macro. Now this changes dynamically
around stack operations in the function as a whole so that arguments can be
accessed at any point. It is also joined by LOCALS_STACK_OFFSET, which
allows access to space reserved on the stack during the init macro.

On top of this, composite macros now have the option of using all of WK0-WK3
registers rather than just the subset it was told to use; this requires the
pixel count to be spilled to the stack over the leading pixels at the start
of each line. Thus, at best, each composite operation can use 11 registers,
plus any pointer registers not required for the composite type, plus as much
stack space as it needs, divided up into constants and variables as necessary.
---
 pixman/pixman-arm-simd-asm.h |   56 +++--
 1 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/pixman/pixman-arm-simd-asm.h b/pixman/pixman-arm-simd-asm.h
index 3a2c250..e481320 100644
--- a/pixman/pixman-arm-simd-asm.h
+++ b/pixman/pixman-arm-simd-asm.h
@@ -76,6 +76,8 @@
 .set FLAG_SPILL_LINE_VARS,   48
 .set FLAG_PROCESS_CORRUPTS_SCRATCH,  0
 .set FLAG_PROCESS_PRESERVES_SCRATCH, 64
+.set FLAG_PROCESS_PRESERVES_WK0, 0
+.set FLAG_PROCESS_CORRUPTS_WK0,  128 /* if possible, use the specified 
register(s) instead so WK0 can hold number of leading pixels */
 
 /*
  * Offset into stack where mask and source pointer/stride can be accessed.
@@ -87,6 +89,11 @@
 #endif
 
 /*
+ * Offset into stack where space allocated during init macro can be accessed.
+ */
+.set LOCALS_STACK_OFFSET, 0
+
+/*
  * Constants for selecting preferable prefetch type.
  */
 .set PREFETCH_TYPE_NONE,   0
@@ -359,23 +366,41 @@
 
 
 .macro test_bits_1_0_ptr
+ .if (flags) & FLAG_PROCESS_CORRUPTS_WK0
+movsSCRATCH, X, lsl #32-1  /* C,N = bits 1,0 of DST */
+ .else
 movsSCRATCH, WK0, lsl #32-1  /* C,N = bits 1,0 of DST */
+ .endif
 .endm
 
 .macro test_bits_3_2_ptr
+ .if (flags) & FLAG_PROCESS_CORRUPTS_WK0
+movsSCRATCH, X, lsl #32-3  /* C,N = bits 3, 2 of DST */
+ .else
 movsSCRATCH, WK0, lsl #32-3  /* C,N = bits 3, 2 of DST */
+ .endif
 .endm
 
 .macro leading_15bytes  process_head, process_tail
 /* On entry, WK0 bits 0-3 = number of bytes until destination is 
16-byte aligned */
+ .set DECREMENT_X, 1
+ .if (flags) & FLAG_PROCESS_CORRUPTS_WK0
+  .set DECREMENT_X, 0
+sub X, X, WK0, lsr #dst_bpp_shift
+str X, [sp, #LINE_SAVED_REG_COUNT*4]
+mov X, WK0
+ .endif
 /* Use unaligned loads in all cases for simplicity */
  .if dst_w_bpp == 8
-conditional_process2  test_bits_1_0_ptr, mi, cs, process_head, 
process_tail, 1, 2, 1, 2, 1, 1, 1
+conditional_process2  test_bits_1_0_ptr, mi, cs, process_head, 
process_tail, 1, 2, 1, 2, 1, 1, DECREMENT_X
  .elseif dst_w_bpp == 16
 test_bits_1_0_ptr
-conditional_process1  cs, process_head, process_tail, 2, 2, 1, 1, 1
+conditional_process1  cs, process_head, process_tail, 2, 2, 1, 1, 
DECREMENT_X
+ .endif
+conditional_process2  test_bits_3_2_ptr, mi, cs, process_head, 
process_tail, 4, 8, 1, 2, 1, 1, DECREMENT_X
+ .if (flags) & FLAG_PROCESS_CORRUPTS_WK0
+ldr X, [sp, #LINE_SAVED_REG_COUNT*4]
  .endif
-conditional_process2  test_bits_3_2_ptr, mi, cs, process_head, 
process_tail, 4, 8, 1, 2, 1, 1, 1
 .endm
 
 .macro test_bits_3_2_pix
@@ -705,6 +730,13 @@ fname:
 #endif
 
 init
+
+ .if (flags) & FLAG_PROCESS_CORRUPTS_WK0
+/* Reserve a word in which to store X during leading pixels */
+sub sp, sp, #4
+  .set ARGS_STACK_OFFSET, ARGS_STACK_OFFSET+4
+  .set LOCALS_STACK_OFFSET, LOCALS_STACK_OFFSET+4
+ .endif
 
 lsl STRIDE_D, #dst_bpp_shift /* stride in bytes */
 sub STRIDE_D, STRIDE_D, X, lsl #dst_bpp_shift
@@ -734,6 +766,8 @@ fname:
   .if (flags) & FLAG_SPILL_LINE_VARS_WIDE
 /* This is stmdb sp!,{} */
 .word   0xE92D | LINE_SAVED_REGS
+   .set ARGS_STACK_OFFSET, ARGS_STACK_OFFSET + LINE_SAVED_REG_COUNT*4
+   .set LOCALS_STACK_OFFSET, LOCALS_STACK_OFFSET + LINE_SAVED_REG_COUNT*4
   .endif
 151:/* New line */
 newline
@@ -767,6 +801,10 @@ fname:
 
 157:/* Check for another line */
 end_of_line 1, %((flags) & FLAG_SPILL_LINE_VARS_WIDE), 151b
+  .if (flags) & FLAG_SPILL_LINE_VARS_WIDE
+   .set ARGS_STACK_OFFSET, ARGS_STACK_OFFSET - LINE_SAVED_REG_COUNT*4
+   .set LOCALS_STACK_OFFSET, LOCALS_STACK_OFFSET - LINE_SAVED_REG_COUNT*4
+  .endif
  .endif
 
  .ltorg
@@ -776,6 +814,8 @@ fname:
  .if (flags) & FLAG_SPILL_LINE_VARS_NON_WIDE
 /* This is stmdb sp!,{} */
 .word   0xE92D | LINE_SAVED_REGS
+  .set ARGS_STACK_OFFSET, ARGS_STACK_OFFSET + LINE_SAVED_REG_COUNT*4
+  .set LOCALS_STACK_OFFSET, LOCALS_STACK_OFFSET + LINE_SAVED_REG_COUNT*4
  .endif
 161:/* 

[Pixman] [PATCH 02/12] ARMv6: Minor optimisation

2013-03-04 Thread Ben Avison
This knocks off one instruction per row. The effect is probably too small to
be measurable, but might as well be included. The second occurrence of this
sequence doesn't actually benefit at all, but is changed for consistency.
---
 pixman/pixman-arm-simd-asm.h |   11 ---
 1 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/pixman/pixman-arm-simd-asm.h b/pixman/pixman-arm-simd-asm.h
index 74400c1..3a2c250 100644
--- a/pixman/pixman-arm-simd-asm.h
+++ b/pixman/pixman-arm-simd-asm.h
@@ -741,12 +741,9 @@ fname:
 preload_leading_step1  mask_bpp, WK2, MASK
 preload_leading_step1  dst_r_bpp, WK3, DST
 
-tst DST, #15
+andsWK0, DST, #15
 beq 154f
-rsb WK0, DST, #0 /* bits 0-3 = number of leading bytes until 
destination aligned */
-  .if (src_bpp != 0 && src_bpp != 2*dst_w_bpp) || (mask_bpp != 0 && mask_bpp 
!= 2*dst_w_bpp)
-PF  and,WK0, WK0, #15
-  .endif
+rsb WK0, WK0, #16 /* bits 0-3 = number of leading bytes until 
destination aligned */
 
 preload_leading_step2  src_bpp, src_bpp_shift, WK1, SRC
 preload_leading_step2  mask_bpp, mask_bpp_shift, WK2, MASK
@@ -787,9 +784,9 @@ fname:
 preload_line 0, dst_r_bpp, dst_bpp_shift, DST
 
 sub X, X, #128/dst_w_bpp /* simplifies inner loop termination 
*/
-tst DST, #15
+andsWK0, DST, #15
 beq 164f
-rsb WK0, DST, #0 /* bits 0-3 = number of leading bytes until 
destination aligned */
+rsb WK0, WK0, #16 /* bits 0-3 = number of leading bytes until 
destination aligned */
 
 leading_15bytes  process_head, process_tail
 
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 01/12] ARMv6: Fix some indentation in the composite macros

2013-03-04 Thread Ben Avison
---
 pixman/pixman-arm-simd-asm.h |   12 ++--
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/pixman/pixman-arm-simd-asm.h b/pixman/pixman-arm-simd-asm.h
index 6543606..74400c1 100644
--- a/pixman/pixman-arm-simd-asm.h
+++ b/pixman/pixman-arm-simd-asm.h
@@ -755,18 +755,18 @@ fname:
 leading_15bytes  process_head, process_tail
 
 154:/* Destination now 16-byte aligned; we have at least one prefetch on 
each channel as well as at least one 16-byte output block */
- .if (src_bpp > 0) && (mask_bpp == 0) && ((flags) & 
FLAG_PROCESS_PRESERVES_SCRATCH)
+  .if (src_bpp > 0) && (mask_bpp == 0) && ((flags) & 
FLAG_PROCESS_PRESERVES_SCRATCH)
 and SCRATCH, SRC, #31
 rsb SCRATCH, SCRATCH, #32*prefetch_distance
- .elseif (src_bpp == 0) && (mask_bpp > 0) && ((flags) & 
FLAG_PROCESS_PRESERVES_SCRATCH)
+  .elseif (src_bpp == 0) && (mask_bpp > 0) && ((flags) & 
FLAG_PROCESS_PRESERVES_SCRATCH)
 and SCRATCH, MASK, #31
 rsb SCRATCH, SCRATCH, #32*prefetch_distance
- .endif
- .ifc "process_inner_loop",""
+  .endif
+  .ifc "process_inner_loop",""
 switch_on_alignment  wide_case_inner_loop_and_trailing_pixels, 
process_head, process_tail, wide_case_inner_loop, 157f
- .else
+  .else
 switch_on_alignment  wide_case_inner_loop_and_trailing_pixels, 
process_head, process_tail, process_inner_loop, 157f
- .endif
+  .endif
 
 157:/* Check for another line */
 end_of_line 1, %((flags) & FLAG_SPILL_LINE_VARS_WIDE), 151b
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 00/12] ARMv6: Assorted improvements

2013-03-04 Thread Ben Avison
While I have some pending contributions relating to pad-repeated
images and over_n_ from 2013-02-06 and 2013-02-13, I've been
continuing to work in other areas. These patches have been rebased
at the current head of git (as I understand is list policy), though
the Cairo benchmark results included in the log messages assume the
earlier patches have been applied (lowlevel-blt-bench should be
unaffected). It is likely that you will encounter conflicts if you
attempt to apply both this patch series and my series from February
in either order.

Ben Avison (12):
  ARMv6: Fix some indentation in the composite macros
  ARMv6: Minor optimisation
  ARMv6: Support for very variable-hungry composite operations
  ARMv6: Add fast path flag to force no preload of destination buffer
  ARMv6: Force fast paths to have fixed alignment to the BTAC
  Add extra test to lowlevel-blt-bench and fix an existing one
  ARMv6: Macro to permit testing for early returns or alternate
implementations
  ARMv6: Added fast path for over_n___ca
  ARMv6: Add fast path for in_reverse__
  ARMv6: Add fast path for over_reverse_n_
  ARMv6: Add fast path for add__
  ARMv6: Add fast path for src_x888_0565

 pixman/pixman-arm-simd-asm.S |  584 ++
 pixman/pixman-arm-simd-asm.h |  122 +++--
 pixman/pixman-arm-simd.c |   37 +++
 test/lowlevel-blt-bench.c|3 +-
 4 files changed, 720 insertions(+), 26 deletions(-)

-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH] MIPS: DSPr2: Fix for bug in in_n_8 routine.

2013-03-04 Thread Nemanja Lukic
Rounding logic was not implemented right.
Instead of using rounding version of the 8-bit shift, logical shifts were used.
Also, code used unnecessary multiplications, which could be avoided by packing
4 destination (a8) pixel into one 32bit register. There were also, unnecessary
spills on stack. Code is rewritten to address mentioned issues.

The bug was revealed by increasing number of the iterations in blitters-test.

Performance numbers on MIPS-74kc @ 1GHz:

lowlevel-blt-bench results

Referent (before):
   in_n_8 =  L1:  21.20  L2:  22.86  M: 21.42 ( 14.21%)  HT: 
15.97  VT: 15.69  R: 15.47  RT:  8.00 (  48Kops/s)
Optimized (first implementation, with bug):
   in_n_8 =  L1:  89.38  L2:  86.07  M: 65.48 ( 43.44%)  HT: 
44.64  VT: 41.50  R: 40.77  RT: 16.94 (  66Kops/s)
Optimized (with bug fix, and code revisited):
   in_n_8 =  L1: 102.33  L2:  95.65  M: 70.54 ( 46.84%)  HT: 
48.35  VT: 45.06  R: 43.20  RT: 17.60 (  66Kops/s)
---
 pixman/pixman-mips-dspr2-asm.S |  118 
 1 files changed, 48 insertions(+), 70 deletions(-)

diff --git a/pixman/pixman-mips-dspr2-asm.S b/pixman/pixman-mips-dspr2-asm.S
index b94e66f..3a4d914 100644
--- a/pixman/pixman-mips-dspr2-asm.S
+++ b/pixman/pixman-mips-dspr2-asm.S
@@ -2974,96 +2974,74 @@ END(pixman_composite_over_reverse_n__asm_mips)
 LEAF_MIPS_DSPR2(pixman_composite_in_n_8_asm_mips)
 /*
  * a0 - dst  (a8)
- * a1 - src  (a8r8g8b8)
+ * a1 - src  (32bit constant)
  * a2 - w
  */
 
-beqz  a2, 5f
+lit9, 0x00ff00ff
+beqz  a2, 3f
  nop
-
-SAVE_REGS_ON_STACK 20, s0, s1, s2, s3, s4, s5, s6, s7
-move  t7, a1
-srl   t5, t7, 24
-replv.ph  t5, t5
-srl   t9, a2, 2   /* t1 = how many multiples of 4 src pixels */
-beqz  t9, 2f  /* branch if less than 4 src pixels */
+srl   t7, a2, 2   /* t7 = how many multiples of 4 dst pixels */
+beqz  t7, 1f  /* branch if less than 4 src pixels */
  nop
 
-1:
-addiu t9, t9, -1
-addiu a2, a2, -4
+srl   t8, a1, 24
+replv.ph  t8, t8
+
+0:
+beqz  t7, 1f
+ addiut7, t7, -1
 lbu   t0, 0(a0)
 lbu   t1, 1(a0)
 lbu   t2, 2(a0)
 lbu   t3, 3(a0)
 
-muleu_s.ph.qbls0, t0, t5
-muleu_s.ph.qbrs1, t0, t5
-muleu_s.ph.qbls2, t1, t5
-muleu_s.ph.qbrs3, t1, t5
-muleu_s.ph.qbls4, t2, t5
-muleu_s.ph.qbrs5, t2, t5
-muleu_s.ph.qbls6, t3, t5
-muleu_s.ph.qbrs7, t3, t5
-
-shrl.ph   t4, s0, 8
-shrl.ph   t6, s1, 8
-shrl.ph   t7, s2, 8
-shrl.ph   t8, s3, 8
-addq.ph   t0, s0, t4
-addq.ph   t1, s1, t6
-addq.ph   t2, s2, t7
-addq.ph   t3, s3, t8
-shra_r.ph t0, t0, 8
-shra_r.ph t1, t1, 8
+precr_sra.ph.wt1, t0, 0
+precr_sra.ph.wt3, t2, 0
+precr.qb.ph   t0, t3, t1
+
+muleu_s.ph.qblt2, t0, t8
+muleu_s.ph.qbrt3, t0, t8
+shra_r.ph t4, t2, 8
+shra_r.ph t5, t3, 8
+and   t4, t4, t9
+and   t5, t5, t9
+addq.ph   t2, t2, t4
+addq.ph   t3, t3, t5
 shra_r.ph t2, t2, 8
 shra_r.ph t3, t3, 8
-shrl.ph   t4, s4, 8
-shrl.ph   t6, s5, 8
-shrl.ph   t7, s6, 8
-shrl.ph   t8, s7, 8
-addq.ph   s0, s4, t4
-addq.ph   s1, s5, t6
-addq.ph   s2, s6, t7
-addq.ph   s3, s7, t8
-shra_r.ph t4, s0, 8
-shra_r.ph t6, s1, 8
-shra_r.ph t7, s2, 8
-shra_r.ph t8, s3, 8
-
-precr.qb.ph   s0, t0, t1
-precr.qb.ph   s1, t2, t3
-precr.qb.ph   s2, t4, t6
-precr.qb.ph   s3, t7, t8
+precr.qb.ph   t2, t2, t3
 
-sbs0, 0(a0)
-sbs1, 1(a0)
-sbs2, 2(a0)
-sbs3, 3(a0)
-bgtz  t9, 1b
+sbt2, 0(a0)
+srl   t2, t2, 8
+sbt2, 1(a0)
+srl   t2, t2, 8
+sbt2, 2(a0)
+srl   t2, t2, 8
+sbt2, 3(a0)
+addiu a2, a2, -4
+b 0b
  addiua0, a0, 4
-2:
-beqz  a2, 4f
+
+1:
+beqz  a2, 3f
  nop
-3:
-lbu   t1, 0(a0)
+srl   t8, a1, 24
+2:
+lbu   t0, 0(a0)
+
+mul   t2, t0, t8
+shra_r.ph t3, t2, 8
+andi  t3, t3, 0x00ff
+addq.ph   t2, t2, t3
+shra_r.ph t2, t2, 8
 
-muleu_s.ph.qblt4, t1, t5
-muleu_s.ph.qbrt7, t1, t5
-

[Pixman] MIPS DSPr2: Fix for in_n_8 routine.

2013-03-04 Thread Nemanja Lukic
Increasing number of the iterations in blitters-test revealed bug in DSPr2
optimization. Bug is in the in_n_8 routine. Rounding logic was not implemented
right. Also, code used unnecessary multiplications, which could be avoided
by packing 4 destination (a8) pixel into one 32bit register. There were also,
unnecessary spills on stack. Code is rewritten to address mentioned issues.
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] MIPS: DSPr2: Fix bug in over_n_8888_8888_ca/over_n_8888_0565_ca routines

2013-03-04 Thread Nemanja Lukic
> Are you referring to MIPS implementation of the following code?
> 
>
http://cgit.freedesktop.org/pixman/tree/pixman/pixman-fast-path.c?id=pixman-
0.29.2#n389

Yes.

> Looks like a lot of changes for only adding a missing shift. Are you
> really just fixing a single bug and not also introducing something
> unrelated?

Yes, it really does look like a huge change for couple of missing shifts.
When I wrote this code in the first place, I misplaced those shifts, which
allowed me to combine code for over operation and:
  UN8x4_MUL_UN8x4 (s, ma);
  UN8x4_MUL_UN8 (ma, srca);
  ma = ~ma;
  UN8x4_MUL_UN8x4_ADD_UN8x4 (d, ma, s);
where shifts are not present (for ma). So I decided to rewrite that piece of
code
from scratch. I changed logic, so now assembly code mimic code from
pixman-fast-path.c
but process two pixels at a time. This code should be easier to debug and
maintain.

> Also appears that this is not the only problem in the MIPS DSPr2
> code. Using "test/fuzzer-find-diff.pl" script, I can reproduce one
> more failure:

I'll look into this, and upload separate patch with fix for this.

Thanks,
Nemanja Lukic

-Original Message-
From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com] 
Sent: Sunday, March 03, 2013 3:42 AM
To: Nemanja Lukic
Cc: pixman@lists.freedesktop.org; Nemanja Lukic
Subject: Re: [Pixman] [PATCH] MIPS: DSPr2: Fix bug in
over_n___ca/over_n__0565_ca routines

On Fri,  1 Mar 2013 10:10:16 +0100
Nemanja Lukic  wrote:

> From: Nemanja Lukic 
> 
> After introducing new PRNG (pseudorandom number generator) a bug in two
DSPr2
> routines was revealed. Bug manifested by wrong calculation in composite
and
> glyph tests, which caused make check to fail for MIPS DSPr2 optimizations.

Thanks for spotting and addressing this issue. The test suite has a
relatively good coverage, but a chance for missing some bugs always
exists. Increasing the number of iterations in random tests can
reduce this probability, but makes the tests run longer.

One of the reasons for introducing the new PRNG was the intention to
improve performance. It was the biggest performance bottleneck for
the blitters-test program (another bottleneck is CRC32 calculation
there) as can be seen from the profiling logs included in the
commit message:

http://cgit.freedesktop.org/pixman/commit/?id=b31a696263f1ae9a

Having less than 4 seconds to run blitters-test on a reasonably fast
x86 hardware, we could increase the number of randomly tested
compositing operations quite significantly and improve reliability.
But unfortunately some slow hardware such as MIPS and ARM11 is
holding us back.

The MIPS build of blitters-test needs ~7.5 minutes to run on MIPS 74K
480MHz hardware or ~2.5 minutes in QEMU on Intel Core i7 860 (with a
single thread). That's quite a lot.

CRC32 can be still improved quite significantly by using a better
split-by-four or split-by-eight implementation from zlib or xz (either
by borrowing the code or by adding one of these libraries as an
optional dependency). But that's more like just tens percents of
overall performance improvement and not a magic solution.

> Bug was in the calculation of the:
> *dst = over (src, *dst) when ma == 0x

Are you referring to MIPS implementation of the following code?

 
http://cgit.freedesktop.org/pixman/tree/pixman/pixman-fast-path.c?id=pixman-
0.29.2#n389

Indeed, in order to test this branch in the blitters-test program, we
need to encounter four consecutive 0xff bytes in the mask. The randomly
generated images are already having more 0xff and 0x00 bytes. But
maybe adding some code to increase the probability of getting large
clusters of 0xff and 0x00 in the randomly generated images could
improve the reliability of testing.

> In this case src was not negated and shifted right by 24 bits,
> it was only negated. Routines are rewritten, and now make check passes
> for DPSr2 optimizations. Performance improvement remained the same as in
> original commit.
> 
> The bug was revealed in commit b31a6962. Errors were detected by composite
> and glyph tests.
> ---
>  pixman/pixman-mips-dspr2-asm.S |  298
++-
>  1 files changed, 168 insertions(+), 130 deletions(-)

Looks like a lot of changes for only adding a missing shift. Are you
really just fixing a single bug and not also introducing something
unrelated?


Also appears that this is not the only problem in the MIPS DSPr2
code. Using "test/fuzzer-find-diff.pl" script, I can reproduce one
more failure:

op=PIXMAN_OP_IN
src_fmt=a8r8g8b8, dst_fmt=a8, mask_fmt=null
src_width=1, src_height=1, dst_width=124, dst_height=14
src_x=0, src_y=0, dst_x=4, dst_y=0
src_stride=12, dst_stride=128
w=13, h=12

4114763: checksum=023EF000
The problematic conditions can be reproduced by running:
./blitters-test 4114763

Which shows us that blitters-test would need to run 2-3x more
iterations to detect this problem (right now it runs 200
tests). It is also a good idea to run fuzzer-find-di