[Pixman] [PATCH] test: larger 0xFF/0x00 filled clusters in random images for blitters-test
Current blitters-test program had difficulties detecting a bug in over_n___ca implementation for MIPS DSPr2: http://lists.freedesktop.org/archives/pixman/2013-March/002645.html In order to hit the buggy code path, two consecutive mask values had to be equal to 0x because of loop unrolling. The current blitters-test generates random images in such a way that each byte has 25% probability for having 0xFF value. Hence each 32-bit mask value has ~0.4% probability for 0x. Because we are testing many compositing operations with many pixels, encountering at least one 0x mask value reasonably fast is not a problem. If a bug related to 0x mask value is artificialy introduced into over_n___ca generic C function, it gets detected on 675591 iteration in blitters-test (out of 200). However two consecutive 0x mask values are much less likely to be generated, so the bug was missed by blitters-test. This patch addresses the problem by also randomly setting the 32-bit values in images to either 0x or 0x (also with 25% probability). It allows to have larger clusters of consecutive 0x00 or 0xFF bytes in images which may have special shortcuts for handling them in unrolled or SIMD optimized code. --- test/blitters-test.c | 13 ++-- test/prng-test.c | 5 - test/utils-prng.c| 58 +++- test/utils-prng.h| 5 - 4 files changed, 72 insertions(+), 9 deletions(-) diff --git a/test/blitters-test.c b/test/blitters-test.c index 8766fa8..a2c6ff4 100644 --- a/test/blitters-test.c +++ b/test/blitters-test.c @@ -46,7 +46,16 @@ create_random_image (pixman_format_code_t *allowed_formats, /* do the allocation */ buf = aligned_malloc (64, stride * height); -prng_randmemset (buf, stride * height, RANDMEMSET_MORE_00_AND_FF); +if (prng_rand_n (4) == 0) +{ + /* uniform distribution */ + prng_randmemset (buf, stride * height, 0); +} +else +{ + /* significantly increased probability for 0x00 and 0xFF */ + prng_randmemset (buf, stride * height, RANDMEMSET_MORE_00_AND_FF); +} img = pixman_image_create_bits (fmt, width, height, buf, stride); @@ -393,6 +402,6 @@ main (int argc, const char *argv[]) } return fuzzer_test_main("blitters", 200, - 0xD8265D5E, + 0x0CF3283B, test_composite, argc, argv); } diff --git a/test/prng-test.c b/test/prng-test.c index 0a3ad5e..c1d9320 100644 --- a/test/prng-test.c +++ b/test/prng-test.c @@ -106,7 +106,10 @@ int main (int argc, char *argv[]) { const uint32_t ref_crc[RANDMEMSET_MORE_00_AND_FF + 1] = { -0xBA06763D, 0x103FC550, 0x8B59ABA5, 0xD82A0F39 +0xBA06763D, 0x103FC550, 0x8B59ABA5, 0xD82A0F39, +0xD2321099, 0xFD8C5420, 0xD3B7C42A, 0xFC098093, +0x85E01DE0, 0x6680F8F7, 0x4D32DD3C, 0xAE52382B, +0x149E6CB5, 0x8B336987, 0x15DCB2B3, 0x8A71B781 }; uint32_t crc1, crc2; uint32_t ref, seed, seed0, seed1, seed2, seed3; diff --git a/test/utils-prng.c b/test/utils-prng.c index 967b898..7b32e35 100644 --- a/test/utils-prng.c +++ b/test/utils-prng.c @@ -107,6 +107,7 @@ randmemset_internal (prng_t *prng, { prng_t local_prng = *prng; prng_rand_128_data_t randdata; +size_t i; while (size >= 16) { @@ -138,6 +139,22 @@ randmemset_internal (prng_t *prng, }; randdata.vb &= (t.vb >= const_40); } +if (flags & RANDMEMSET_MORE_) +{ +const uint32x4 const_C000 = +{ +0xC000, 0xC000, 0xC000, 0xC000 +}; +randdata.vw |= ((t.vw << 30) >= const_C000); +} +if (flags & RANDMEMSET_MORE_) +{ +const uint32x4 const_4000 = +{ +0x4000, 0x4000, 0x4000, 0x4000 +}; +randdata.vw &= ((t.vw << 30) >= const_4000); +} #else #define PROCESS_ONE_LANE(i) \ if (flags & RANDMEMSET_MORE_FF) \ @@ -155,6 +172,18 @@ randmemset_internal (prng_t *prng, mask_00 |= mask_00 >> 2; \ mask_00 |= mask_00 >> 4; \ randdata.w[i] &= mask_00; \ +} \ +if (flags & RANDMEMSET_MORE_) \ +{ \ +int32_t mask_ff = ((t
Re: [Pixman] [PATCH] glyphs: Check the return code from _pixman_implementation_lookup_composite()
On 03/02/13 03:25, Juan Francisco Cantero Hurtado wrote: On 01/13/13 22:37, Juan Francisco Cantero Hurtado wrote: On 01/13/13 11:20, Søren Sandmann wrote: Søren Sandmann writes: Juan Francisco Cantero Hurtado writes: OpenBSD and gcc 4.2 (the default compiler) don't support thread local storage. In that case, it should fall back to pthread_setspecific(). Can you try putting in a #error above "#include " in pixman-compiler.h and see if compilation fails. It's possible there is a bug in the pthread_setspecific() fallback in pixman, but if so, I couldn't reproduce it on Linux by forcing pthread_setspecific() and running the test suite. Does the test suite pass for you if you run "make check"? I found this thread: https://groups.google.com/forum/?fromgroups=#!topic/comp.unix.bsd.openbsd.misc/y-qciyc6wNY in which Marc Espie says that pthreads on OpenBSD 4.2 is a purely userspace thread library. Since the backtrace you posted included multiple threads, I'm guessing those are kernel threads, which means pthread_setspecific() can't work on them. If this diagnosis is right, then pixman currently can't support threads on OpenBSD 4.2. Support could potentially be added, but it would have to be done (and maintained) by someone who understands threads on OpenBSD. OpenBSD changed from user-level to kernel-level threads in the last release (5.2). I'll mail to the maintainer of pixman on OpenBSD, my knowledge of OpenBSD internals isn't enough for help you with the bug. I've been doing some tests the last month. The error occur in pixman 0.28 and 0.29.2 when SSE is enabled. If I disable SSE in the configure script of pixman 0.28, all works. I forgot add that disable SSE2 in pixman isn't a good fix for OpenBSD because only one program crashes with SSE2 enabled on i386 (it works amd64). Can you give me any guidance for help you with the fix of the bug?. If you suspect is a bug in OpenBSD, a little test code would be great because I could show the problem to OpenBSD devs. Thanks. ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 12/12] ARMv6: Add fast path for src_x888_0565
On Mon, 04 Mar 2013 17:53:01 -, Chris Wilson wrote: Did you try with image16? I think it should be hit somewhere, would seem like somebody would use it eventually... Thanks for the tip; I wasn't aware of that. I've been working from Siarhei's trimmed set of traces that come with a script to run the tests, and amongst other things, this sets CAIRO_TEST_TARGET=image. I'm not really familiar with Cairo terminology, but I'm guessing that the target is some sort of backing store that matches the depth of the framebuffer? Certainly, if I change that setting to image16, I do start to see src_x888_0565 start to be called, but the number of calls is still tiny: t-chromium-tabs0 t-evolution0 t-firefox-asteroids1 t-firefox-canvas-alpha 1 t-firefox-canvas 1 t-firefox-chalkboard 0 t-firefox-fishbowl 3 t-firefox-fishtank 1 t-firefox-paintball3 t-firefox-particles1 t-firefox-planet-gnome 43 t-firefox-scrolling11 t-firefox-talos-gfx1 t-firefox-talos-svg1 t-gnome-system-monitor 0 t-gnome-terminal-vim 0 t-grads-heat-map 0 t-gvim 0 t-midori-zoomed38 t-poppler-reseau 1 t-poppler 0 t-swfdec-giant-steps 0 t-swfdec-youtube 1 t-xfce4-terminal-a10 Given that the number of calls of anything that shows up in the overall times is usually measured in millions, or at the very least hundreds, I can't imagine any of those being significant, so I think I'll save myself the few hours it'd take to run the profile to confirm, if you don't mind :) Ben ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 12/12] ARMv6: Add fast path for src_x888_0565
On Mon, Mar 04, 2013 at 05:42:29PM +, Ben Avison wrote: > This isn't used in the trimmed cairo-perf-trace tests at all, but these are > the lowlevel-blt-bench results: Did you try with image16? I think it should be hit somewhere, would seem like somebody would use it eventually... -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 11/12] ARMv6: Add fast path for add_8888_8888
lowlevel-blt-bench results: Before After Mean StdDev Mean StdDev Confidence Change L1 27.6 0.1 125.9 0.8 100.0% +356.0% L2 14.0 0.5 30.8 1.6 100.0% +120.3% M 12.2 0.0 26.7 0.1 100.0% +118.8% HT 10.2 0.1 17.0 0.1 100.0% +67.1% VT 10.0 0.0 16.6 0.1 100.0% +65.7% R 9.70.0 15.9 0.1 100.0% +64.8% RT 5.80.1 7.60.1 100.0% +30.5% Trimmed cairo-perf-trace results: Before After Mean StdDev Mean StdDev Confidence Change t-xfce4-terminal-a1 18.6 0.1 18.4 0.1 100.0% +1.0% --- pixman/pixman-arm-simd-asm.S | 58 ++ pixman/pixman-arm-simd.c |8 ++ 2 files changed, 66 insertions(+), 0 deletions(-) diff --git a/pixman/pixman-arm-simd-asm.S b/pixman/pixman-arm-simd-asm.S index 4f9a015..158de73 100644 --- a/pixman/pixman-arm-simd-asm.S +++ b/pixman/pixman-arm-simd-asm.S @@ -350,6 +350,64 @@ generate_composite_function \ /**/ +.macro test_zero numregs, reg1, reg2, reg3, reg4 +teq WK®1, #0 + .if numregs >= 2 +teqeq WK®2, #0 + .if numregs >= 3 +teqeq WK®3, #0 + .if numregs == 4 +teqeq WK®4, #0 + .endif + .endif + .endif +.endm + +.macro add___2pixels dst1, dst2 +uqadd8 WK&dst1, WK&dst1, MASK +uqadd8 WK&dst2, WK&dst2, STRIDE_M +.endm + +.macro add___1pixel dst +uqadd8 WK&dst, WK&dst, MASK +.endm + +.macro add___process_head cond, numbytes, firstreg, unaligned_src, unaligned_mask, preload +pixld , numbytes, firstreg, SRC, 0 +add DST, DST, #numbytes +.endm + +.macro add___process_tail cond, numbytes, firstreg +test_zero %(numbytes/4), firstreg, %(firstreg+1), %(firstreg+2), %(firstreg+3) +beq 01f + .if numbytes == 16 +ldrdMASK, STRIDE_M, [DST, #-16] +add___2pixels firstreg, %(firstreg+1) +ldrdMASK, STRIDE_M, [DST, #-8] +add___2pixels %(firstreg+2), %(firstreg+3) + .elseif numbytes == 8 +ldrdMASK, STRIDE_M, [DST, #-8] +add___2pixels firstreg, %(firstreg+1) + .else +ldr MASK, [DST, #-4] +add___1pixel firstreg + .endif +pixst , numbytes, firstreg, DST +01: +.endm + +generate_composite_function \ +pixman_composite_add___asm_armv6, 32, 0, 32, \ +FLAG_DST_READWRITE | FLAG_BRANCH_OVER | FLAG_PROCESS_CORRUPTS_PSR | FLAG_PROCESS_DOES_STORE | FLAG_PROCESS_PRESERVES_SCRATCH | FLAG_NO_PRELOAD_DST, \ +2, /* prefetch distance */ \ +nop_macro, /* init */ \ +nop_macro, /* newline */ \ +nop_macro, /* cleanup */ \ +add___process_head, \ +add___process_tail + +/**/ + .macro over___init /* Hold loop invariant in MASK */ ldr MASK, =0x00800080 diff --git a/pixman/pixman-arm-simd.c b/pixman/pixman-arm-simd.c index 855b703..d227065 100644 --- a/pixman/pixman-arm-simd.c +++ b/pixman/pixman-arm-simd.c @@ -44,6 +44,8 @@ PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, src_0565_, PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, add_8_8, uint8_t, 1, uint8_t, 1) +PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, add__, + uint32_t, 1, uint32_t, 1) PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, over__, uint32_t, 1, uint32_t, 1) @@ -238,6 +240,12 @@ static const pixman_fast_path_t arm_simd_fast_paths[] = PIXMAN_STD_FAST_PATH (OVER_REVERSE, solid, null, a8b8g8r8, armv6_composite_over_reverse_n_), PIXMAN_STD_FAST_PATH (ADD, a8, null, a8, armv6_composite_add_8_8), +PIXMAN_STD_FAST_PATH (ADD, a8r8g8b8, null, a8r8g8b8, armv6_composite_add__), +PIXMAN_STD_FAST_PATH (ADD, a8r8g8b8, null, x8r8g8b8, armv6_composite_add__), +PIXMAN_STD_FAST_PATH (ADD, x8r8g8b8, null, x8r8g8b8, armv6_composite_add__), +PIXMAN_STD_FAST_PATH (ADD, a8b8g8r8, null, a8b8g8r8, armv6_composite_add__), +PIXMAN_STD_FAST_PATH (ADD, a8b8g8r8, null, x8b8g8r8, armv6_composite_add__), +PIXMAN_STD_FAST_PATH (ADD, x8b8g8r8, null, x8b8g8r8, armv6_composite_add__), PIXMAN_STD_FAST_PATH (OVER, solid, a8, a8r8g8b8, armv6_composite_over_n_8_), PIXMAN_STD_FAST_PATH (OVER, solid, a8, x8r8g8b8, armv6_composite_over_n_8_), -- 1.7.5.4 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 12/12] ARMv6: Add fast path for src_x888_0565
This isn't used in the trimmed cairo-perf-trace tests at all, but these are the lowlevel-blt-bench results: Before After Mean StdDev Mean StdDev Confidence Change L1 68.5 1.0 116.3 0.6 100.0% +69.8% L2 31.1 1.8 60.9 5.0 100.0% +96.1% M 33.6 0.1 86.4 0.4 100.0% +157.0% HT 19.1 0.1 35.3 0.4 100.0% +84.3% VT 17.7 0.2 32.1 0.3 100.0% +81.3% R 17.5 0.2 29.9 0.3 100.0% +70.7% RT 7.00.1 11.8 0.3 100.0% +68.4% --- pixman/pixman-arm-simd-asm.S | 77 ++ pixman/pixman-arm-simd.c |7 2 files changed, 84 insertions(+), 0 deletions(-) diff --git a/pixman/pixman-arm-simd-asm.S b/pixman/pixman-arm-simd-asm.S index 158de73..423a16d 100644 --- a/pixman/pixman-arm-simd-asm.S +++ b/pixman/pixman-arm-simd-asm.S @@ -303,6 +303,83 @@ generate_composite_function \ /**/ +.macro src_x888_0565_init +/* Hold loop invariant in MASK */ +ldr MASK, =0x001F001F +line_saved_regs STRIDE_S, ORIG_W +.endm + +.macro src_x888_0565_1pixel s, d +and WK&d, MASK, WK&s, lsr #3 @ 000r000b +and STRIDE_S, WK&s, #0xFC00@ gg00 +orr WK&d, WK&d, WK&d, lsr #5 @ 000-r00b +orr WK&d, WK&d, STRIDE_S, lsr #5 @ 000-rggb +/* Top 16 bits are discarded during the following STRH */ +.endm + +.macro src_x888_0565_2pixels slo, shi, d, tmp +and SCRATCH, WK&shi, #0xFC00 @ GG00 +and WK&tmp, MASK, WK&shi, lsr #3 @ 000R000B +and WK&shi, MASK, WK&slo, lsr #3 @ 000r000b +orr WK&tmp, WK&tmp, WK&tmp, lsr #5 @ 000-R00B +orr WK&tmp, WK&tmp, SCRATCH, lsr #5@ 000-RGGB +and SCRATCH, WK&slo, #0xFC00 @ gg00 +orr WK&shi, WK&shi, WK&shi, lsr #5 @ 000-r00b +orr WK&shi, WK&shi, SCRATCH, lsr #5@ 000-rggb +pkhbt WK&d, WK&shi, WK&tmp, lsl #16 @ RGGBrggb +.endm + +.macro src_x888_0565_process_head cond, numbytes, firstreg, unaligned_src, unaligned_mask, preload +WK4 .reqSTRIDE_S +WK5 .reqSTRIDE_M +WK6 .reqWK3 +WK7 .reqORIG_W + .if numbytes == 16 +pixld , 16, 4, SRC, 0 +src_x888_0565_2pixels 4, 5, 0, 0 +pixld , 8, 4, SRC, 0 +src_x888_0565_2pixels 6, 7, 1, 1 +pixld , 8, 6, SRC, 0 + .else +pixld , numbytes*2, 4, SRC, 0 + .endif +.endm + +.macro src_x888_0565_process_tail cond, numbytes, firstreg + .if numbytes == 16 +src_x888_0565_2pixels 4, 5, 2, 2 +src_x888_0565_2pixels 6, 7, 3, 4 + .elseif numbytes == 8 +src_x888_0565_2pixels 4, 5, 1, 1 +src_x888_0565_2pixels 6, 7, 2, 2 + .elseif numbytes == 4 +src_x888_0565_2pixels 4, 5, 1, 1 + .else +src_x888_0565_1pixel 4, 1 + .endif + .if numbytes == 16 +pixst , numbytes, 0, DST + .else +pixst , numbytes, 1, DST + .endif +.unreq WK4 +.unreq WK5 +.unreq WK6 +.unreq WK7 +.endm + +generate_composite_function \ +pixman_composite_src_x888_0565_asm_armv6, 32, 0, 16, \ +FLAG_DST_WRITEONLY | FLAG_BRANCH_OVER | FLAG_PROCESS_DOES_STORE | FLAG_SPILL_LINE_VARS | FLAG_PROCESS_CORRUPTS_SCRATCH, \ +3, /* prefetch distance */ \ +src_x888_0565_init, \ +nop_macro, /* newline */ \ +nop_macro, /* cleanup */ \ +src_x888_0565_process_head, \ +src_x888_0565_process_tail + +/**/ + .macro add_8_8_8pixels cond, dst1, dst2 uqadd8&cond WK&dst1, WK&dst1, MASK uqadd8&cond WK&dst2, WK&dst2, STRIDE_M diff --git a/pixman/pixman-arm-simd.c b/pixman/pixman-arm-simd.c index d227065..5a1708f 100644 --- a/pixman/pixman-arm-simd.c +++ b/pixman/pixman-arm-simd.c @@ -41,6 +41,8 @@ PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, src_8_8, uint8_t, 1, uint8_t, 1) PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, src_0565_, uint16_t, 1, uint32_t, 1) +PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, src_x888_0565, + uint32_t, 1, uint16_t, 1) PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, add_8_8, uint8_t, 1, uint8_t, 1) @@ -227,6 +229,11 @@ static const pixman_fast_path_t arm_simd_fast_paths[] = PIXMAN_STD_FA
[Pixman] [PATCH 10/12] ARMv6: Add fast path for over_reverse_n_8888
lowlevel-blt-bench results: Before After Mean StdDev Mean StdDev Confidence Change L1 15.0 0.1 276.2 4.0 100.0% +1743.3% L2 13.4 0.3 154.8 17.4100.0% +1058.0% M 11.4 0.0 73.7 0.8 100.0% +549.4% HT 10.2 0.0 25.6 0.2 100.0% +150.9% VT 10.0 0.0 23.0 0.3 100.0% +129.4% R 9.80.1 22.9 0.2 100.0% +134.3% RT 6.40.1 11.6 0.3 100.0% +80.8% Trimmed cairo-perf-trace results: Before After Mean StdDev Mean StdDev Confidence Change t-poppler 11.8 0.1 8.80.1 100.0% +34.6% --- pixman/pixman-arm-simd-asm.S | 78 ++ pixman/pixman-arm-simd.c |6 +++ 2 files changed, 84 insertions(+), 0 deletions(-) diff --git a/pixman/pixman-arm-simd-asm.S b/pixman/pixman-arm-simd-asm.S index 20ad05a..4f9a015 100644 --- a/pixman/pixman-arm-simd-asm.S +++ b/pixman/pixman-arm-simd-asm.S @@ -979,6 +979,84 @@ generate_composite_function \ /**/ +.macro over_reverse_n__init +ldr SRC, [sp, #ARGS_STACK_OFFSET] +ldr MASK, =0x00800080 +/* Split source pixel into RB/AG parts */ +uxtb16 STRIDE_S, SRC +uxtb16 STRIDE_M, SRC, ror #8 +/* Set GE[3:0] to 0101 so SEL instructions do what we want */ +uadd8 SCRATCH, MASK, MASK +line_saved_regs STRIDE_D, ORIG_W +.endm + +.macro over_reverse_n__newline +mov STRIDE_D, #0xFF +.endm + +.macro over_reverse_n__process_head cond, numbytes, firstreg, unaligned_src, unaligned_mask, preload +pixld , numbytes, firstreg, DST, 0 +.endm + +.macro over_reverse_n__1pixel d, is_only +teq WK&d, #0 +beq 8f /* replace with source */ +bicsORIG_W, STRIDE_D, WK&d, lsr #24 + .if is_only == 1 +beq 49f /* skip store */ + .else +beq 9f /* write same value back */ + .endif +mla SCRATCH, STRIDE_S, ORIG_W, MASK /* red/blue */ +mla ORIG_W, STRIDE_M, ORIG_W, MASK /* alpha/green */ +uxtab16 SCRATCH, SCRATCH, SCRATCH, ror #8 +uxtab16 ORIG_W, ORIG_W, ORIG_W, ror #8 +mov SCRATCH, SCRATCH, ror #8 +sel ORIG_W, SCRATCH, ORIG_W +uqadd8 WK&d, WK&d, ORIG_W +b 9f +8: mov WK&d, SRC +9: +.endm + +.macro over_reverse_n__tail numbytes, reg1, reg2, reg3, reg4 + .if numbytes == 4 +over_reverse_n__1pixel reg1, 1 + .else +and SCRATCH, WK®1, WK®2 + .if numbytes == 16 +and SCRATCH, SCRATCH, WK®3 +and SCRATCH, SCRATCH, WK®4 + .endif +mvnsSCRATCH, SCRATCH, asr #24 +beq 49f /* skip store if all opaque */ +over_reverse_n__1pixel reg1, 0 +over_reverse_n__1pixel reg2, 0 + .if numbytes == 16 +over_reverse_n__1pixel reg3, 0 +over_reverse_n__1pixel reg4, 0 + .endif + .endif +pixst , numbytes, reg1, DST +49: +.endm + +.macro over_reverse_n__process_tail cond, numbytes, firstreg +over_reverse_n__tail numbytes, firstreg, %(firstreg+1), %(firstreg+2), %(firstreg+3) +.endm + +generate_composite_function \ +pixman_composite_over_reverse_n__asm_armv6, 0, 0, 32 \ +FLAG_DST_READWRITE | FLAG_BRANCH_OVER | FLAG_PROCESS_CORRUPTS_PSR | FLAG_PROCESS_DOES_STORE | FLAG_SPILL_LINE_VARS | FLAG_PROCESS_CORRUPTS_SCRATCH, \ +3, /* prefetch distance */ \ +over_reverse_n__init, \ +over_reverse_n__newline, \ +nop_macro, /* cleanup */ \ +over_reverse_n__process_head, \ +over_reverse_n__process_tail + +/**/ + #ifdef PROFILING .p2align 9 #endif diff --git a/pixman/pixman-arm-simd.c b/pixman/pixman-arm-simd.c index 5a50098..855b703 100644 --- a/pixman/pixman-arm-simd.c +++ b/pixman/pixman-arm-simd.c @@ -50,6 +50,9 @@ PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, over__, PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, in_reverse__, uint32_t, 1, uint32_t, 1) +PIXMAN_ARM_BIND_FAST_PATH_N_DST (0, armv6, over_reverse_n_, + uint32_t, 1) + PIXMAN_ARM_BIND_FAST_PATH_SRC_N_DST (SKIP_ZERO_MASK, armv6, over__n_, uint32_t, 1, uint32_t, 1) @@ -231,6 +234,9 @@ static const pixman_fast_path_t arm_simd_fast_paths[] = PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, solid, a8b8g8r8, armv6_composite_over__n_), PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, solid, x8b8g8r8, armv6_composite_over__n_), +PIXMAN_STD_FAST_PATH (OVER_REVERSE, solid, null, a8r8g8b8, armv6_composite_over_reverse_n_), +PIXMAN_STD
[Pixman] [PATCH 09/12] ARMv6: Add fast path for in_reverse_8888_8888
lowlevel-blt-bench results: Before After Mean StdDev Mean StdDev Confidence Change L1 21.3 0.1 32.5 0.2 100.0% +52.1% L2 12.1 0.2 19.5 0.5 100.0% +61.2% M 11.0 0.0 17.1 0.0 100.0% +54.6% HT 8.70.0 12.8 0.1 100.0% +46.9% VT 8.60.0 12.5 0.1 100.0% +46.0% R 8.60.0 12.0 0.1 100.0% +40.6% RT 5.10.1 6.60.1 100.0% +28.8% Trimmed cairo-perf-trace results: Before After Mean StdDev Mean StdDev Confidence Change t-firefox-paintball 17.7 0.1 14.2 0.1 100.0% +24.5% --- pixman/pixman-arm-simd-asm.S | 104 ++ pixman/pixman-arm-simd.c |8 +++ 2 files changed, 112 insertions(+), 0 deletions(-) diff --git a/pixman/pixman-arm-simd-asm.S b/pixman/pixman-arm-simd-asm.S index ac084c4..20ad05a 100644 --- a/pixman/pixman-arm-simd-asm.S +++ b/pixman/pixman-arm-simd-asm.S @@ -875,6 +875,110 @@ generate_composite_function \ /**/ +.macro in_reverse___init +/* Hold loop invariant in MASK */ +ldr MASK, =0x00800080 +/* Set GE[3:0] to 0101 so SEL instructions do what we want */ +uadd8 SCRATCH, MASK, MASK +/* Offset the source pointer: we only need the alpha bytes */ +add SRC, SRC, #3 +line_saved_regs ORIG_W +.endm + +.macro in_reverse___head numbytes, reg1, reg2, reg3 +ldrbORIG_W, [SRC], #4 + .if numbytes >= 8 +ldrbWK®1, [SRC], #4 + .if numbytes == 16 +ldrbWK®2, [SRC], #4 +ldrbWK®3, [SRC], #4 + .endif + .endif +add DST, DST, #numbytes +.endm + +.macro in_reverse___process_head cond, numbytes, firstreg, unaligned_src, unaligned_mask, preload +in_reverse___head numbytes, firstreg, %(firstreg+1), %(firstreg+2) +.endm + +.macro in_reverse___1pixel s, d, offset, is_only + .if is_only != 1 +movss, ORIG_W + .if offset != 0 +ldrbORIG_W, [SRC, #offset] + .endif +beq 01f +teq STRIDE_M, #0xFF +beq 02f + .endif +uxtb16 SCRATCH, d /* rb_dest */ +uxtb16 d, d, ror #8 /* ag_dest */ +mla SCRATCH, SCRATCH, s, MASK +mla d, d, s, MASK +uxtab16 SCRATCH, SCRATCH, SCRATCH, ror #8 +uxtab16 d, d, d, ror #8 +mov SCRATCH, SCRATCH, ror #8 +sel d, SCRATCH, d +b 02f + .if offset == 0 +48: /* Last mov d,#0 of the set - used as part of shortcut for + * source values all 0 */ + .endif +01: mov d, #0 +02: +.endm + +.macro in_reverse___tail numbytes, reg1, reg2, reg3, reg4 + .if numbytes == 4 +teq ORIG_W, ORIG_W, asr #32 +ldrne WK®1, [DST, #-4] + .elseif numbytes == 8 +teq ORIG_W, WK®1 +teqeq ORIG_W, ORIG_W, asr #32 /* all 0 or all -1? */ +ldmnedb DST, {WK®1-WK®2} + .else +teq ORIG_W, WK®1 +teqeq ORIG_W, WK®2 +teqeq ORIG_W, WK®3 +teqeq ORIG_W, ORIG_W, asr #32 /* all 0 or all -1? */ +ldmnedb DST, {WK®1-WK®4} + .endif +cmnne DST, #0 /* clear C if NE */ +bcs 49f /* no writes to dest if source all -1 */ +beq 48f /* set dest to all 0 if source all 0 */ + .if numbytes == 4 +in_reverse___1pixel ORIG_W, WK®1, 0, 1 +str WK®1, [DST, #-4] + .elseif numbytes == 8 +in_reverse___1pixel STRIDE_M, WK®1, -4, 0 +in_reverse___1pixel STRIDE_M, WK®2, 0, 0 +stmdb DST, {WK®1-WK®2} + .else +in_reverse___1pixel STRIDE_M, WK®1, -12, 0 +in_reverse___1pixel STRIDE_M, WK®2, -8, 0 +in_reverse___1pixel STRIDE_M, WK®3, -4, 0 +in_reverse___1pixel STRIDE_M, WK®4, 0, 0 +stmdb DST, {WK®1-WK®4} + .endif +49: +.endm + +.macro in_reverse___process_tail cond, numbytes, firstreg +in_reverse___tail numbytes, firstreg, %(firstreg+1), %(firstreg+2), %(firstreg+3) +.endm + +generate_composite_function \ +pixman_composite_in_reverse___asm_armv6, 32, 0, 32 \ +FLAG_DST_READWRITE | FLAG_BRANCH_OVER | FLAG_PROCESS_CORRUPTS_PSR | FLAG_PROCESS_DOES_STORE | FLAG_SPILL_LINE_VARS | FLAG_PROCESS_CORRUPTS_SCRATCH | FLAG_NO_PRELOAD_DST \ +2, /* prefetch distance */ \ +in_reverse___init, \ +nop_macro, /* newline */ \ +nop_macro, /* cleanup */ \ +in_reverse___process_head, \ +in_reverse___process_tail + +/**/ + #ifdef PROFILING .p2align 9 #endif diff --git a/pixman/pixman-arm-simd.c b/pi
[Pixman] [PATCH 08/12] ARMv6: Added fast path for over_n_8888_8888_ca
lowlevel-blt-bench results: Before After Mean StdDev Mean StdDev Confidence Change L1 2.70.0 16.2 0.1 100.0% +501.7% L2 2.40.0 14.8 0.2 100.0% +502.5% M 2.40.0 15.0 0.0 100.0% +525.7% HT 2.20.0 10.2 0.1 100.0% +354.9% VT 2.20.0 9.90.1 100.0% +344.5% R 2.30.0 10.0 0.0 100.0% +339.7% RT 2.00.0 5.70.1 100.0% +191.3% Trimmed cairo-perf-trace results: Before After Mean StdDev Mean StdDev Confidence Change t-firefox-talos-gfx 25.7 0.1 17.7 0.2 100.0% +45.3% t-firefox-scrolling 20.7 0.1 16.6 0.2 100.0% +24.7% t-evolution 8.00.1 6.90.2 100.0% +14.6% t-gnome-terminal-vim17.2 0.2 15.5 0.2 100.0% +11.1% t-firefox-planet-gnome 9.80.1 8.80.1 100.0% +11.0% t-xfce4-terminal-a1 19.9 0.1 18.5 0.1 100.0% +7.7% t-gvim 20.5 0.3 20.0 0.4 99.8% +2.8% t-firefox-paintball 18.0 0.1 17.7 0.1 100.0% +1.7% t-poppler-reseau20.8 0.2 20.5 0.1 100.0% +1.4% t-firefox-fishbowl 21.6 0.0 21.5 0.1 99.5% +0.4% t-firefox-canvas15.6 0.1 15.8 0.1 100.0% -1.2% --- pixman/pixman-arm-simd-asm.S | 264 ++ pixman/pixman-arm-simd.c |8 ++ 2 files changed, 272 insertions(+), 0 deletions(-) diff --git a/pixman/pixman-arm-simd-asm.S b/pixman/pixman-arm-simd-asm.S index 259fb88..ac084c4 100644 --- a/pixman/pixman-arm-simd-asm.S +++ b/pixman/pixman-arm-simd-asm.S @@ -611,6 +611,270 @@ generate_composite_function \ /**/ +.macro over_white___ca_init +HALF.reqSRC +TMP0.reqSTRIDE_D +TMP1.reqSTRIDE_S +TMP2.reqSTRIDE_M +TMP3.reqORIG_W +WK4 .reqSCRATCH +line_saved_regs STRIDE_D, STRIDE_M, ORIG_W +ldr SCRATCH, =0x800080 +mov HALF, #0x80 +/* Set GE[3:0] to 0101 so SEL instructions do what we want */ +uadd8 SCRATCH, SCRATCH, SCRATCH +.endm + +.macro over_white___ca_cleanup +.unreq HALF +.unreq TMP0 +.unreq TMP1 +.unreq TMP2 +.unreq TMP3 +.unreq WK4 +.endm + +.macro over_white___ca_combine m, d +uxtb16 TMP1, TMP0/* rb_notmask */ +uxtb16 TMP2, d /* rb_dest; 1 stall follows */ +smlatt TMP3, TMP2, TMP1, HALF/* red */ +smlabb TMP2, TMP2, TMP1, HALF/* blue */ +uxtb16 TMP0, TMP0, ror #8/* ag_notmask */ +uxtb16 TMP1, d, ror #8 /* ag_dest; 1 stall follows */ +smlatt d, TMP1, TMP0, HALF /* alpha */ +smlabb TMP1, TMP1, TMP0, HALF/* green */ +pkhbt TMP0, TMP2, TMP3, lsl #16 /* rb; 1 stall follows */ +pkhbt TMP1, TMP1, d, lsl #16/* ag */ +uxtab16 TMP0, TMP0, TMP0, ror #8 +uxtab16 TMP1, TMP1, TMP1, ror #8 +mov TMP0, TMP0, ror #8 +sel d, TMP0, TMP1 +uqadd8 d, d, m /* d is a late result */ +.endm + +.macro over_white___ca_1pixel_head +pixld , 4, 1, MASK, 0 +pixld , 4, 3, DST, 0 +.endm + +.macro over_white___ca_1pixel_tail +mvn TMP0, WK1 +teq WK1, WK1, asr #32 +bne 01f +bcc 03f +mov WK3, WK1 +b 02f +01: over_white___ca_combine WK1, WK3 +02: pixst , 4, 3, DST +03: +.endm + +.macro over_white___ca_2pixels_head +pixld , 8, 1, MASK, 0 +pixld , 8, 3, DST +.endm + +.macro over_white___ca_2pixels_tail +mvn TMP0, WK1 +teq WK1, WK1, asr #32 +bne 01f +movcs WK3, WK1 +bcs 02f +teq WK2, #0 +beq 05f +b 02f +01: over_white___ca_combine WK1, WK3 +02: mvn TMP0, WK2 +teq WK2, WK2, asr #32 +bne 03f +movcs WK4, WK2 +b 04f +03: over_white___ca_combine WK2, WK4 +04: pixst , 8, 3, DST +05: +.endm + +.macro over_white___ca_process_head cond, numbytes, firstreg, unaligned_src, unaligned_mask, preload + .if numbytes == 4 +over_white___ca_1pixel_head + .else + .if numbytes == 16 +over_white___ca_2pixels_head +over_white___ca_2pixels_tail + .endif +over_white___ca_2pixels_head + .endif +.endm + +.macro over_white___ca_process_tail cond, numbytes, firstreg + .if numbytes == 4 +over_whit
[Pixman] [PATCH 07/12] ARMv6: Macro to permit testing for early returns or alternate implementations
When the source or mask is solid (as opposed to a bitmap) there is the possibility of an immediate exit, or a branch to an alternate, more optimal implementation in some cases. This is best achieved with a brief prologue to the function; to permit this, the necessary boilerplate for setting up a function entry is now available in the "startfunc" macro. This feature was first included in my over_n_ fast path, but since that's still sitting in the submission queue at the time of writing, I'm posting it again as an independent patch. --- pixman/pixman-arm-simd-asm.h | 26 +++--- 1 files changed, 15 insertions(+), 11 deletions(-) diff --git a/pixman/pixman-arm-simd-asm.h b/pixman/pixman-arm-simd-asm.h index c7e5ca7..a41e1e0 100644 --- a/pixman/pixman-arm-simd-asm.h +++ b/pixman/pixman-arm-simd-asm.h @@ -107,6 +107,20 @@ .set PREFETCH_TYPE_NONE, 0 .set PREFETCH_TYPE_STANDARD, 1 +.macro startfunc fname +#ifdef PROFILING + .p2align 9 +#endif + .func fname + .global fname + /* For ELF format also set function visibility to hidden */ +#ifdef __ELF__ + .hidden fname + .type fname, %function +#endif +fname: +.endm + /* * Definitions of macros for load/store of pixel data. */ @@ -596,16 +610,7 @@ process_tail, \ process_inner_loop -#ifdef PROFILING - .p2align 9 -#endif - .func fname - .global fname - /* For ELF format also set function visibility to hidden */ -#ifdef __ELF__ - .hidden fname - .type fname, %function -#endif +startfunc fname /* * Make some macro arguments globally visible and accessible @@ -717,7 +722,6 @@ SCRATCH .reqr12 ORIG_W .reqr14 /* width (pixels) */ -fname: push{r4-r11, lr}/* save all registers */ subsY, Y, #1 -- 1.7.5.4 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 06/12] Add extra test to lowlevel-blt-bench and fix an existing one
in_reverse__ is one of the more commonly used operations in the cairo-perf-trace suite that hasn't been in lowlevel-blt-bench until now. The source for over_reverse_n_ needed to be marked as solid. --- test/lowlevel-blt-bench.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/test/lowlevel-blt-bench.c b/test/lowlevel-blt-bench.c index 4e16f7b..9984fa8 100644 --- a/test/lowlevel-blt-bench.c +++ b/test/lowlevel-blt-bench.c @@ -706,7 +706,8 @@ tests_tbl[] = { "outrev_n__1555_ca", PIXMAN_a8r8g8b8,1, PIXMAN_OP_OUT_REV, PIXMAN_a8r8g8b8, 2, PIXMAN_a1r5g5b5 }, { "outrev_n__x888_ca", PIXMAN_a8r8g8b8,1, PIXMAN_OP_OUT_REV, PIXMAN_a8r8g8b8, 2, PIXMAN_x8r8g8b8 }, { "outrev_n___ca", PIXMAN_a8r8g8b8,1, PIXMAN_OP_OUT_REV, PIXMAN_a8r8g8b8, 2, PIXMAN_a8r8g8b8 }, -{ "over_reverse_n_", PIXMAN_a8r8g8b8,0, PIXMAN_OP_OVER_REVERSE, PIXMAN_null, 0, PIXMAN_a8r8g8b8 }, +{ "over_reverse_n_", PIXMAN_a8r8g8b8,1, PIXMAN_OP_OVER_REVERSE, PIXMAN_null, 0, PIXMAN_a8r8g8b8 }, +{ "in_reverse__", PIXMAN_a8r8g8b8,0, PIXMAN_OP_IN_REVERSE, PIXMAN_null, 0, PIXMAN_a8r8g8b8 }, }; int -- 1.7.5.4 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 05/12] ARMv6: Force fast paths to have fixed alignment to the BTAC
Trying to produce repeatable, trustworthy profiling results from the cairo-perf-trace benchmark suite has proved tricky, especially when testing changes that have only a marginal (< ~5%) effect upon the runtime as a whole. One of the problems is that some traces appear to show statistically significant changes even when the only fast path that has changed is not even exercised by the trace in question. This patch helps to address this by ensuring that the aliasing between the branch predictor's target address cache (BTAC) for the remaining fast paths is not affected by the additional, removal or refactoring of any other fast paths. The profiling results later in this patch series have been calculated with this switch enabled, to ensure fair comparisons. Additionally, the cairo-perf-trace test harness itself was modified to do timing using getrusage() so as to exclude any kernel mode components of the runtime. Between these two measures, the majority of false positives appear to have been eliminated. --- pixman/pixman-arm-simd-asm.S |3 +++ pixman/pixman-arm-simd-asm.h |9 + 2 files changed, 12 insertions(+), 0 deletions(-) diff --git a/pixman/pixman-arm-simd-asm.S b/pixman/pixman-arm-simd-asm.S index c209688..259fb88 100644 --- a/pixman/pixman-arm-simd-asm.S +++ b/pixman/pixman-arm-simd-asm.S @@ -611,3 +611,6 @@ generate_composite_function \ /**/ +#ifdef PROFILING +.p2align 9 +#endif diff --git a/pixman/pixman-arm-simd-asm.h b/pixman/pixman-arm-simd-asm.h index 4c08b9e..c7e5ca7 100644 --- a/pixman/pixman-arm-simd-asm.h +++ b/pixman/pixman-arm-simd-asm.h @@ -54,6 +54,12 @@ */ /* + * Determine whether we space out fast paths to reduce the effect of + * different BTAC aliasing upon comparative profiling results + */ +#define PROFILING + +/* * Determine whether we put the arguments on the stack for debugging. */ #undef DEBUG_PARAMS @@ -590,6 +596,9 @@ process_tail, \ process_inner_loop +#ifdef PROFILING + .p2align 9 +#endif .func fname .global fname /* For ELF format also set function visibility to hidden */ -- 1.7.5.4 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 04/12] ARMv6: Add fast path flag to force no preload of destination buffer
--- pixman/pixman-arm-simd-asm.h | 14 +- 1 files changed, 13 insertions(+), 1 deletions(-) diff --git a/pixman/pixman-arm-simd-asm.h b/pixman/pixman-arm-simd-asm.h index e481320..4c08b9e 100644 --- a/pixman/pixman-arm-simd-asm.h +++ b/pixman/pixman-arm-simd-asm.h @@ -78,6 +78,8 @@ .set FLAG_PROCESS_PRESERVES_SCRATCH, 64 .set FLAG_PROCESS_PRESERVES_WK0, 0 .set FLAG_PROCESS_CORRUPTS_WK0, 128 /* if possible, use the specified register(s) instead so WK0 can hold number of leading pixels */ +.set FLAG_PRELOAD_DST, 0 +.set FLAG_NO_PRELOAD_DST,256 /* * Offset into stack where mask and source pointer/stride can be accessed. @@ -439,7 +441,7 @@ preload_middle src_bpp, SRC, 0 preload_middle mask_bpp, MASK, 0 .endif - .if (dst_r_bpp > 0) && ((SUBBLOCK % 2) == 0) + .if (dst_r_bpp > 0) && ((SUBBLOCK % 2) == 0) && (((flags) & FLAG_NO_PRELOAD_DST) == 0) /* Because we know that writes are 16-byte aligned, it's relatively easy to ensure that * destination prefetches are 32-byte aligned. It's also the easiest channel to offset * preloads for, to achieve staggered prefetches for multiple channels, because there are @@ -474,7 +476,9 @@ .endif preload_trailing src_bpp, src_bpp_shift, SRC preload_trailing mask_bpp, mask_bpp_shift, MASK + .if ((flags) & FLAG_NO_PRELOAD_DST) == 0 preload_trailing dst_r_bpp, dst_bpp_shift, DST + .endif add X, X, #(prefetch_distance+2)*pix_per_block - 128/dst_w_bpp /* The remainder of the line is handled identically to the medium case */ medium_case_inner_loop_and_trailing_pixels process_head, process_tail,, exit_label, unaligned_src, unaligned_mask @@ -773,7 +777,9 @@ fname: newline preload_leading_step1 src_bpp, WK1, SRC preload_leading_step1 mask_bpp, WK2, MASK + .if ((flags) & FLAG_NO_PRELOAD_DST) == 0 preload_leading_step1 dst_r_bpp, WK3, DST + .endif andsWK0, DST, #15 beq 154f @@ -781,7 +787,9 @@ fname: preload_leading_step2 src_bpp, src_bpp_shift, WK1, SRC preload_leading_step2 mask_bpp, mask_bpp_shift, WK2, MASK + .if ((flags) & FLAG_NO_PRELOAD_DST) == 0 preload_leading_step2 dst_r_bpp, dst_bpp_shift, WK3, DST + .endif leading_15bytes process_head, process_tail @@ -821,7 +829,9 @@ fname: newline preload_line 0, src_bpp, src_bpp_shift, SRC /* in: X, corrupts: WK0-WK1 */ preload_line 0, mask_bpp, mask_bpp_shift, MASK + .if ((flags) & FLAG_NO_PRELOAD_DST) == 0 preload_line 0, dst_r_bpp, dst_bpp_shift, DST + .endif sub X, X, #128/dst_w_bpp /* simplifies inner loop termination */ andsWK0, DST, #15 @@ -850,7 +860,9 @@ fname: newline preload_line 1, src_bpp, src_bpp_shift, SRC /* in: X, corrupts: WK0-WK1 */ preload_line 1, mask_bpp, mask_bpp_shift, MASK + .if ((flags) & FLAG_NO_PRELOAD_DST) == 0 preload_line 1, dst_r_bpp, dst_bpp_shift, DST + .endif .if dst_w_bpp == 8 tst DST, #3 -- 1.7.5.4 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 03/12] ARMv6: Support for very variable-hungry composite operations
Previously, the variable ARGS_STACK_OFFSET was available to extract values from function arguments during the init macro. Now this changes dynamically around stack operations in the function as a whole so that arguments can be accessed at any point. It is also joined by LOCALS_STACK_OFFSET, which allows access to space reserved on the stack during the init macro. On top of this, composite macros now have the option of using all of WK0-WK3 registers rather than just the subset it was told to use; this requires the pixel count to be spilled to the stack over the leading pixels at the start of each line. Thus, at best, each composite operation can use 11 registers, plus any pointer registers not required for the composite type, plus as much stack space as it needs, divided up into constants and variables as necessary. --- pixman/pixman-arm-simd-asm.h | 56 +++-- 1 files changed, 53 insertions(+), 3 deletions(-) diff --git a/pixman/pixman-arm-simd-asm.h b/pixman/pixman-arm-simd-asm.h index 3a2c250..e481320 100644 --- a/pixman/pixman-arm-simd-asm.h +++ b/pixman/pixman-arm-simd-asm.h @@ -76,6 +76,8 @@ .set FLAG_SPILL_LINE_VARS, 48 .set FLAG_PROCESS_CORRUPTS_SCRATCH, 0 .set FLAG_PROCESS_PRESERVES_SCRATCH, 64 +.set FLAG_PROCESS_PRESERVES_WK0, 0 +.set FLAG_PROCESS_CORRUPTS_WK0, 128 /* if possible, use the specified register(s) instead so WK0 can hold number of leading pixels */ /* * Offset into stack where mask and source pointer/stride can be accessed. @@ -87,6 +89,11 @@ #endif /* + * Offset into stack where space allocated during init macro can be accessed. + */ +.set LOCALS_STACK_OFFSET, 0 + +/* * Constants for selecting preferable prefetch type. */ .set PREFETCH_TYPE_NONE, 0 @@ -359,23 +366,41 @@ .macro test_bits_1_0_ptr + .if (flags) & FLAG_PROCESS_CORRUPTS_WK0 +movsSCRATCH, X, lsl #32-1 /* C,N = bits 1,0 of DST */ + .else movsSCRATCH, WK0, lsl #32-1 /* C,N = bits 1,0 of DST */ + .endif .endm .macro test_bits_3_2_ptr + .if (flags) & FLAG_PROCESS_CORRUPTS_WK0 +movsSCRATCH, X, lsl #32-3 /* C,N = bits 3, 2 of DST */ + .else movsSCRATCH, WK0, lsl #32-3 /* C,N = bits 3, 2 of DST */ + .endif .endm .macro leading_15bytes process_head, process_tail /* On entry, WK0 bits 0-3 = number of bytes until destination is 16-byte aligned */ + .set DECREMENT_X, 1 + .if (flags) & FLAG_PROCESS_CORRUPTS_WK0 + .set DECREMENT_X, 0 +sub X, X, WK0, lsr #dst_bpp_shift +str X, [sp, #LINE_SAVED_REG_COUNT*4] +mov X, WK0 + .endif /* Use unaligned loads in all cases for simplicity */ .if dst_w_bpp == 8 -conditional_process2 test_bits_1_0_ptr, mi, cs, process_head, process_tail, 1, 2, 1, 2, 1, 1, 1 +conditional_process2 test_bits_1_0_ptr, mi, cs, process_head, process_tail, 1, 2, 1, 2, 1, 1, DECREMENT_X .elseif dst_w_bpp == 16 test_bits_1_0_ptr -conditional_process1 cs, process_head, process_tail, 2, 2, 1, 1, 1 +conditional_process1 cs, process_head, process_tail, 2, 2, 1, 1, DECREMENT_X + .endif +conditional_process2 test_bits_3_2_ptr, mi, cs, process_head, process_tail, 4, 8, 1, 2, 1, 1, DECREMENT_X + .if (flags) & FLAG_PROCESS_CORRUPTS_WK0 +ldr X, [sp, #LINE_SAVED_REG_COUNT*4] .endif -conditional_process2 test_bits_3_2_ptr, mi, cs, process_head, process_tail, 4, 8, 1, 2, 1, 1, 1 .endm .macro test_bits_3_2_pix @@ -705,6 +730,13 @@ fname: #endif init + + .if (flags) & FLAG_PROCESS_CORRUPTS_WK0 +/* Reserve a word in which to store X during leading pixels */ +sub sp, sp, #4 + .set ARGS_STACK_OFFSET, ARGS_STACK_OFFSET+4 + .set LOCALS_STACK_OFFSET, LOCALS_STACK_OFFSET+4 + .endif lsl STRIDE_D, #dst_bpp_shift /* stride in bytes */ sub STRIDE_D, STRIDE_D, X, lsl #dst_bpp_shift @@ -734,6 +766,8 @@ fname: .if (flags) & FLAG_SPILL_LINE_VARS_WIDE /* This is stmdb sp!,{} */ .word 0xE92D | LINE_SAVED_REGS + .set ARGS_STACK_OFFSET, ARGS_STACK_OFFSET + LINE_SAVED_REG_COUNT*4 + .set LOCALS_STACK_OFFSET, LOCALS_STACK_OFFSET + LINE_SAVED_REG_COUNT*4 .endif 151:/* New line */ newline @@ -767,6 +801,10 @@ fname: 157:/* Check for another line */ end_of_line 1, %((flags) & FLAG_SPILL_LINE_VARS_WIDE), 151b + .if (flags) & FLAG_SPILL_LINE_VARS_WIDE + .set ARGS_STACK_OFFSET, ARGS_STACK_OFFSET - LINE_SAVED_REG_COUNT*4 + .set LOCALS_STACK_OFFSET, LOCALS_STACK_OFFSET - LINE_SAVED_REG_COUNT*4 + .endif .endif .ltorg @@ -776,6 +814,8 @@ fname: .if (flags) & FLAG_SPILL_LINE_VARS_NON_WIDE /* This is stmdb sp!,{} */ .word 0xE92D | LINE_SAVED_REGS + .set ARGS_STACK_OFFSET, ARGS_STACK_OFFSET + LINE_SAVED_REG_COUNT*4 + .set LOCALS_STACK_OFFSET, LOCALS_STACK_OFFSET + LINE_SAVED_REG_COUNT*4 .endif 161:/*
[Pixman] [PATCH 02/12] ARMv6: Minor optimisation
This knocks off one instruction per row. The effect is probably too small to be measurable, but might as well be included. The second occurrence of this sequence doesn't actually benefit at all, but is changed for consistency. --- pixman/pixman-arm-simd-asm.h | 11 --- 1 files changed, 4 insertions(+), 7 deletions(-) diff --git a/pixman/pixman-arm-simd-asm.h b/pixman/pixman-arm-simd-asm.h index 74400c1..3a2c250 100644 --- a/pixman/pixman-arm-simd-asm.h +++ b/pixman/pixman-arm-simd-asm.h @@ -741,12 +741,9 @@ fname: preload_leading_step1 mask_bpp, WK2, MASK preload_leading_step1 dst_r_bpp, WK3, DST -tst DST, #15 +andsWK0, DST, #15 beq 154f -rsb WK0, DST, #0 /* bits 0-3 = number of leading bytes until destination aligned */ - .if (src_bpp != 0 && src_bpp != 2*dst_w_bpp) || (mask_bpp != 0 && mask_bpp != 2*dst_w_bpp) -PF and,WK0, WK0, #15 - .endif +rsb WK0, WK0, #16 /* bits 0-3 = number of leading bytes until destination aligned */ preload_leading_step2 src_bpp, src_bpp_shift, WK1, SRC preload_leading_step2 mask_bpp, mask_bpp_shift, WK2, MASK @@ -787,9 +784,9 @@ fname: preload_line 0, dst_r_bpp, dst_bpp_shift, DST sub X, X, #128/dst_w_bpp /* simplifies inner loop termination */ -tst DST, #15 +andsWK0, DST, #15 beq 164f -rsb WK0, DST, #0 /* bits 0-3 = number of leading bytes until destination aligned */ +rsb WK0, WK0, #16 /* bits 0-3 = number of leading bytes until destination aligned */ leading_15bytes process_head, process_tail -- 1.7.5.4 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 01/12] ARMv6: Fix some indentation in the composite macros
--- pixman/pixman-arm-simd-asm.h | 12 ++-- 1 files changed, 6 insertions(+), 6 deletions(-) diff --git a/pixman/pixman-arm-simd-asm.h b/pixman/pixman-arm-simd-asm.h index 6543606..74400c1 100644 --- a/pixman/pixman-arm-simd-asm.h +++ b/pixman/pixman-arm-simd-asm.h @@ -755,18 +755,18 @@ fname: leading_15bytes process_head, process_tail 154:/* Destination now 16-byte aligned; we have at least one prefetch on each channel as well as at least one 16-byte output block */ - .if (src_bpp > 0) && (mask_bpp == 0) && ((flags) & FLAG_PROCESS_PRESERVES_SCRATCH) + .if (src_bpp > 0) && (mask_bpp == 0) && ((flags) & FLAG_PROCESS_PRESERVES_SCRATCH) and SCRATCH, SRC, #31 rsb SCRATCH, SCRATCH, #32*prefetch_distance - .elseif (src_bpp == 0) && (mask_bpp > 0) && ((flags) & FLAG_PROCESS_PRESERVES_SCRATCH) + .elseif (src_bpp == 0) && (mask_bpp > 0) && ((flags) & FLAG_PROCESS_PRESERVES_SCRATCH) and SCRATCH, MASK, #31 rsb SCRATCH, SCRATCH, #32*prefetch_distance - .endif - .ifc "process_inner_loop","" + .endif + .ifc "process_inner_loop","" switch_on_alignment wide_case_inner_loop_and_trailing_pixels, process_head, process_tail, wide_case_inner_loop, 157f - .else + .else switch_on_alignment wide_case_inner_loop_and_trailing_pixels, process_head, process_tail, process_inner_loop, 157f - .endif + .endif 157:/* Check for another line */ end_of_line 1, %((flags) & FLAG_SPILL_LINE_VARS_WIDE), 151b -- 1.7.5.4 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 00/12] ARMv6: Assorted improvements
While I have some pending contributions relating to pad-repeated images and over_n_ from 2013-02-06 and 2013-02-13, I've been continuing to work in other areas. These patches have been rebased at the current head of git (as I understand is list policy), though the Cairo benchmark results included in the log messages assume the earlier patches have been applied (lowlevel-blt-bench should be unaffected). It is likely that you will encounter conflicts if you attempt to apply both this patch series and my series from February in either order. Ben Avison (12): ARMv6: Fix some indentation in the composite macros ARMv6: Minor optimisation ARMv6: Support for very variable-hungry composite operations ARMv6: Add fast path flag to force no preload of destination buffer ARMv6: Force fast paths to have fixed alignment to the BTAC Add extra test to lowlevel-blt-bench and fix an existing one ARMv6: Macro to permit testing for early returns or alternate implementations ARMv6: Added fast path for over_n___ca ARMv6: Add fast path for in_reverse__ ARMv6: Add fast path for over_reverse_n_ ARMv6: Add fast path for add__ ARMv6: Add fast path for src_x888_0565 pixman/pixman-arm-simd-asm.S | 584 ++ pixman/pixman-arm-simd-asm.h | 122 +++-- pixman/pixman-arm-simd.c | 37 +++ test/lowlevel-blt-bench.c|3 +- 4 files changed, 720 insertions(+), 26 deletions(-) -- 1.7.5.4 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH] MIPS: DSPr2: Fix for bug in in_n_8 routine.
Rounding logic was not implemented right. Instead of using rounding version of the 8-bit shift, logical shifts were used. Also, code used unnecessary multiplications, which could be avoided by packing 4 destination (a8) pixel into one 32bit register. There were also, unnecessary spills on stack. Code is rewritten to address mentioned issues. The bug was revealed by increasing number of the iterations in blitters-test. Performance numbers on MIPS-74kc @ 1GHz: lowlevel-blt-bench results Referent (before): in_n_8 = L1: 21.20 L2: 22.86 M: 21.42 ( 14.21%) HT: 15.97 VT: 15.69 R: 15.47 RT: 8.00 ( 48Kops/s) Optimized (first implementation, with bug): in_n_8 = L1: 89.38 L2: 86.07 M: 65.48 ( 43.44%) HT: 44.64 VT: 41.50 R: 40.77 RT: 16.94 ( 66Kops/s) Optimized (with bug fix, and code revisited): in_n_8 = L1: 102.33 L2: 95.65 M: 70.54 ( 46.84%) HT: 48.35 VT: 45.06 R: 43.20 RT: 17.60 ( 66Kops/s) --- pixman/pixman-mips-dspr2-asm.S | 118 1 files changed, 48 insertions(+), 70 deletions(-) diff --git a/pixman/pixman-mips-dspr2-asm.S b/pixman/pixman-mips-dspr2-asm.S index b94e66f..3a4d914 100644 --- a/pixman/pixman-mips-dspr2-asm.S +++ b/pixman/pixman-mips-dspr2-asm.S @@ -2974,96 +2974,74 @@ END(pixman_composite_over_reverse_n__asm_mips) LEAF_MIPS_DSPR2(pixman_composite_in_n_8_asm_mips) /* * a0 - dst (a8) - * a1 - src (a8r8g8b8) + * a1 - src (32bit constant) * a2 - w */ -beqz a2, 5f +lit9, 0x00ff00ff +beqz a2, 3f nop - -SAVE_REGS_ON_STACK 20, s0, s1, s2, s3, s4, s5, s6, s7 -move t7, a1 -srl t5, t7, 24 -replv.ph t5, t5 -srl t9, a2, 2 /* t1 = how many multiples of 4 src pixels */ -beqz t9, 2f /* branch if less than 4 src pixels */ +srl t7, a2, 2 /* t7 = how many multiples of 4 dst pixels */ +beqz t7, 1f /* branch if less than 4 src pixels */ nop -1: -addiu t9, t9, -1 -addiu a2, a2, -4 +srl t8, a1, 24 +replv.ph t8, t8 + +0: +beqz t7, 1f + addiut7, t7, -1 lbu t0, 0(a0) lbu t1, 1(a0) lbu t2, 2(a0) lbu t3, 3(a0) -muleu_s.ph.qbls0, t0, t5 -muleu_s.ph.qbrs1, t0, t5 -muleu_s.ph.qbls2, t1, t5 -muleu_s.ph.qbrs3, t1, t5 -muleu_s.ph.qbls4, t2, t5 -muleu_s.ph.qbrs5, t2, t5 -muleu_s.ph.qbls6, t3, t5 -muleu_s.ph.qbrs7, t3, t5 - -shrl.ph t4, s0, 8 -shrl.ph t6, s1, 8 -shrl.ph t7, s2, 8 -shrl.ph t8, s3, 8 -addq.ph t0, s0, t4 -addq.ph t1, s1, t6 -addq.ph t2, s2, t7 -addq.ph t3, s3, t8 -shra_r.ph t0, t0, 8 -shra_r.ph t1, t1, 8 +precr_sra.ph.wt1, t0, 0 +precr_sra.ph.wt3, t2, 0 +precr.qb.ph t0, t3, t1 + +muleu_s.ph.qblt2, t0, t8 +muleu_s.ph.qbrt3, t0, t8 +shra_r.ph t4, t2, 8 +shra_r.ph t5, t3, 8 +and t4, t4, t9 +and t5, t5, t9 +addq.ph t2, t2, t4 +addq.ph t3, t3, t5 shra_r.ph t2, t2, 8 shra_r.ph t3, t3, 8 -shrl.ph t4, s4, 8 -shrl.ph t6, s5, 8 -shrl.ph t7, s6, 8 -shrl.ph t8, s7, 8 -addq.ph s0, s4, t4 -addq.ph s1, s5, t6 -addq.ph s2, s6, t7 -addq.ph s3, s7, t8 -shra_r.ph t4, s0, 8 -shra_r.ph t6, s1, 8 -shra_r.ph t7, s2, 8 -shra_r.ph t8, s3, 8 - -precr.qb.ph s0, t0, t1 -precr.qb.ph s1, t2, t3 -precr.qb.ph s2, t4, t6 -precr.qb.ph s3, t7, t8 +precr.qb.ph t2, t2, t3 -sbs0, 0(a0) -sbs1, 1(a0) -sbs2, 2(a0) -sbs3, 3(a0) -bgtz t9, 1b +sbt2, 0(a0) +srl t2, t2, 8 +sbt2, 1(a0) +srl t2, t2, 8 +sbt2, 2(a0) +srl t2, t2, 8 +sbt2, 3(a0) +addiu a2, a2, -4 +b 0b addiua0, a0, 4 -2: -beqz a2, 4f + +1: +beqz a2, 3f nop -3: -lbu t1, 0(a0) +srl t8, a1, 24 +2: +lbu t0, 0(a0) + +mul t2, t0, t8 +shra_r.ph t3, t2, 8 +andi t3, t3, 0x00ff +addq.ph t2, t2, t3 +shra_r.ph t2, t2, 8 -muleu_s.ph.qblt4, t1, t5 -muleu_s.ph.qbrt7, t1, t5 -
[Pixman] MIPS DSPr2: Fix for in_n_8 routine.
Increasing number of the iterations in blitters-test revealed bug in DSPr2 optimization. Bug is in the in_n_8 routine. Rounding logic was not implemented right. Also, code used unnecessary multiplications, which could be avoided by packing 4 destination (a8) pixel into one 32bit register. There were also, unnecessary spills on stack. Code is rewritten to address mentioned issues. ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH] MIPS: DSPr2: Fix bug in over_n_8888_8888_ca/over_n_8888_0565_ca routines
> Are you referring to MIPS implementation of the following code? > > http://cgit.freedesktop.org/pixman/tree/pixman/pixman-fast-path.c?id=pixman- 0.29.2#n389 Yes. > Looks like a lot of changes for only adding a missing shift. Are you > really just fixing a single bug and not also introducing something > unrelated? Yes, it really does look like a huge change for couple of missing shifts. When I wrote this code in the first place, I misplaced those shifts, which allowed me to combine code for over operation and: UN8x4_MUL_UN8x4 (s, ma); UN8x4_MUL_UN8 (ma, srca); ma = ~ma; UN8x4_MUL_UN8x4_ADD_UN8x4 (d, ma, s); where shifts are not present (for ma). So I decided to rewrite that piece of code from scratch. I changed logic, so now assembly code mimic code from pixman-fast-path.c but process two pixels at a time. This code should be easier to debug and maintain. > Also appears that this is not the only problem in the MIPS DSPr2 > code. Using "test/fuzzer-find-diff.pl" script, I can reproduce one > more failure: I'll look into this, and upload separate patch with fix for this. Thanks, Nemanja Lukic -Original Message- From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com] Sent: Sunday, March 03, 2013 3:42 AM To: Nemanja Lukic Cc: pixman@lists.freedesktop.org; Nemanja Lukic Subject: Re: [Pixman] [PATCH] MIPS: DSPr2: Fix bug in over_n___ca/over_n__0565_ca routines On Fri, 1 Mar 2013 10:10:16 +0100 Nemanja Lukic wrote: > From: Nemanja Lukic > > After introducing new PRNG (pseudorandom number generator) a bug in two DSPr2 > routines was revealed. Bug manifested by wrong calculation in composite and > glyph tests, which caused make check to fail for MIPS DSPr2 optimizations. Thanks for spotting and addressing this issue. The test suite has a relatively good coverage, but a chance for missing some bugs always exists. Increasing the number of iterations in random tests can reduce this probability, but makes the tests run longer. One of the reasons for introducing the new PRNG was the intention to improve performance. It was the biggest performance bottleneck for the blitters-test program (another bottleneck is CRC32 calculation there) as can be seen from the profiling logs included in the commit message: http://cgit.freedesktop.org/pixman/commit/?id=b31a696263f1ae9a Having less than 4 seconds to run blitters-test on a reasonably fast x86 hardware, we could increase the number of randomly tested compositing operations quite significantly and improve reliability. But unfortunately some slow hardware such as MIPS and ARM11 is holding us back. The MIPS build of blitters-test needs ~7.5 minutes to run on MIPS 74K 480MHz hardware or ~2.5 minutes in QEMU on Intel Core i7 860 (with a single thread). That's quite a lot. CRC32 can be still improved quite significantly by using a better split-by-four or split-by-eight implementation from zlib or xz (either by borrowing the code or by adding one of these libraries as an optional dependency). But that's more like just tens percents of overall performance improvement and not a magic solution. > Bug was in the calculation of the: > *dst = over (src, *dst) when ma == 0x Are you referring to MIPS implementation of the following code? http://cgit.freedesktop.org/pixman/tree/pixman/pixman-fast-path.c?id=pixman- 0.29.2#n389 Indeed, in order to test this branch in the blitters-test program, we need to encounter four consecutive 0xff bytes in the mask. The randomly generated images are already having more 0xff and 0x00 bytes. But maybe adding some code to increase the probability of getting large clusters of 0xff and 0x00 in the randomly generated images could improve the reliability of testing. > In this case src was not negated and shifted right by 24 bits, > it was only negated. Routines are rewritten, and now make check passes > for DPSr2 optimizations. Performance improvement remained the same as in > original commit. > > The bug was revealed in commit b31a6962. Errors were detected by composite > and glyph tests. > --- > pixman/pixman-mips-dspr2-asm.S | 298 ++- > 1 files changed, 168 insertions(+), 130 deletions(-) Looks like a lot of changes for only adding a missing shift. Are you really just fixing a single bug and not also introducing something unrelated? Also appears that this is not the only problem in the MIPS DSPr2 code. Using "test/fuzzer-find-diff.pl" script, I can reproduce one more failure: op=PIXMAN_OP_IN src_fmt=a8r8g8b8, dst_fmt=a8, mask_fmt=null src_width=1, src_height=1, dst_width=124, dst_height=14 src_x=0, src_y=0, dst_x=4, dst_y=0 src_stride=12, dst_stride=128 w=13, h=12 4114763: checksum=023EF000 The problematic conditions can be reproduced by running: ./blitters-test 4114763 Which shows us that blitters-test would need to run 2-3x more iterations to detect this problem (right now it runs 200 tests). It is also a good idea to run fuzzer-find-di