Re: [Pixman] Image scaling with bilinear interpolation performance
On Monday 21 February 2011 13:07:31 김태균 wrote: > Hi, > Thank you for the reply. > > > Regarding performance, improving it twice is still a little bit too slow > > on the hardware which has SIMD. On x86, support for SSE2 is pretty much > > common, so it is quite natural to use it if it proves to be beneficial. > > But for the low end embedded machines with primitive processors without > > SIMD it may be indeed very good to have any kind of performance > > improvements. > > Yes, right. > I will fully utilize SIMD as possible as I can. (NEON is available on some > of our target machines) Great. Surely contributions in this area would be definitely useful. But you may have started this work a bit too late ;) I have been looking into improving bilinear scaling performance for the last couple of weeks already and have just submitted some initial SSE2 and ARM NEON optimizations for it (btw, testing is very much welcome). And there is still lots of work to do before all the bilinear scaling related performance bottlenecks are eliminated. > But I have to consider not only high end machines but also low ends which > do not support SIMD. > That's why I'm trying to optimize non-SIMD general code path. Well, in your original e-mail, you mentioned that you are interested in getting good performance on intel quad core. That's why without having any other information, I suggested SSE2 as a solution for this problem :) What kind of hardware do the rest of your target machines have? A lot of ARM processors beginning with armv5te have special instructions for fast signed 16-bit multiplication. If we know what the target hardware supports, we may modify bilinear interpolation code to make better use of it. The current bilinear interpolation code has one problem that it needs 16-bit unsigned multiplications (uint16_t * uint16_t -> uint32_t), which are also not so efficient for MMX/SSE2. Maybe going down from 256 levels to 128 levels could allow to use signed 16-bit multiplications and provide more optimization possibilities on a wide range of hardware? Also SSSE3 may be worth considering because it has PMADDUBSW instruction (uint8_t * int8_t -> int16_t). It is just ARM NEON not challenging at all and boring because it is totally orthogonal and supports all kind of vector multiplications easily (8-bit and 16-bit, both signed and unsigned, both ordinary and long variant). I guess it would work fine with any interpolation method, like it did with the current one. I also tried to benchmark your change to the bilinear code and got something like 23% better scaling performance overall on Intel Core i7. I guess you have benchmarked 2x performance improvement for that function alone but not for a full rendering pipeline, right?. It's a good improvement, but not even close to the performance effect of using SSE2 or NEON (or maybe even armv5te). So I would consider looking at the supported instruction set on your target hardware first. For these experiments, I'm typically doing benchmarks with the 'scaling-bench' program from: http://cgit.freedesktop.org/~siamashka/pixman/log/?h=playground/test-n-bench -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [cairo] scaling performance test of cairo library
On Thursday 10 February 2011 03:16:47 Siarhei Siamashka wrote: > On Wednesday 09 February 2011 08:23:51 Siarhei Siamashka wrote: > > On Wednesday 09 February 2011 05:28:46 cooolheater wrote: > > > Thank you for your kind explanation. > > > I used pixman-0.21.4 for testing. > > > As you guessed, we are using SIMD and are finding method for NEON > > > acceleration. > > > Could you let me know the bilinear scaling interfaces in pixman and > > > where the SIMD optimization will be applied? > > > > You can look here for the start: > > http://cgit.freedesktop.org/pixman/tree/pixman/pixman-bits-image.c?id=pix > > ma n-0.21.4#n189 > > > > But applying optimizations locally just for this small function is not > > going to provide the best performance, it's kind of like swinging a > > large polearm in a narrow passage is not so effective. > > And here is an example of such patch attached. Performance improvement is > not impressive at all. Who cares if it's now let's say ~15x slower than > nearest scaling instead of ~30x? > > Obviously we need a better solution. Hello cooolheater, Could you please try to run your benchmark again with the patches from the following link applied to pixman and share the results? http://lists.freedesktop.org/archives/pixman/2011-February/001053.html -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 3/3] test: add Makefile for Win32
--- test/Makefile.win32 | 73 +++ 1 files changed, 73 insertions(+), 0 deletions(-) create mode 100644 test/Makefile.win32 diff --git a/test/Makefile.win32 b/test/Makefile.win32 new file mode 100644 index 000..c71afe1 --- /dev/null +++ b/test/Makefile.win32 @@ -0,0 +1,73 @@ +CC = cl +LINK = link + +CFG_VAR = $(CFG) +ifeq ($(CFG_VAR),) +CFG_VAR=release +endif + +CFLAGS = -MD -nologo -D_CRT_SECURE_NO_DEPRECATE -D_CRT_NONSTDC_NO_DEPRECATE -D_BIND_TO_CURRENT_VCLIBS_VERSION -D_MT -I../pixman -I. -I../ +TEST_LDADD = ../pixman/$(CFG_VAR)/pixman-1.lib +INCLUDES = -I../pixman -I$(top_builddir)/pixman + +# optimization flags +ifeq ($(CFG_VAR),debug) +CFLAGS += -Od -Zi +else +CFLAGS += -O2 +endif + +SOURCES = \ + a1-trap-test.c \ + pdf-op-test.c \ + region-test.c \ + region-translate-test.c \ + fetch-test.c\ + oob-test.c \ + trap-crasher.c \ + alpha-loop.c\ + scaling-crash-test.c\ + gradient-crash-test.c \ + alphamap.c \ + stress-test.c \ + composite-traps-test.c \ + blitters-test.c \ + scaling-test.c \ + affine-test.c \ + composite.c \ + utils.c + +TESTS =\ + $(CFG_VAR)/a1-trap-test.exe \ + $(CFG_VAR)/pdf-op-test.exe \ + $(CFG_VAR)/region-test.exe \ + $(CFG_VAR)/region-translate-test.exe\ + $(CFG_VAR)/fetch-test.exe \ + $(CFG_VAR)/oob-test.exe \ + $(CFG_VAR)/trap-crasher.exe \ + $(CFG_VAR)/alpha-loop.exe \ + $(CFG_VAR)/scaling-crash-test.exe \ + $(CFG_VAR)/gradient-crash-test.exe \ + $(CFG_VAR)/alphamap.exe \ + $(CFG_VAR)/stress-test.exe \ + $(CFG_VAR)/composite-traps-test.exe \ + $(CFG_VAR)/blitters-test.exe\ + $(CFG_VAR)/scaling-test.exe \ + $(CFG_VAR)/affine-test.exe \ + $(CFG_VAR)/composite.exe + + +OBJECTS = $(patsubst %.c, $(CFG_VAR)/%.obj, $(SOURCES)) + +$(CFG_VAR)/%.obj: %.c + @mkdir -p $(CFG_VAR) + @$(CC) -c $(CFLAGS) -Fo"$@" $< + +$(CFG_VAR)/%.exe: $(CFG_VAR)/%.obj + $(LINK) /NOLOGO /OUT:$@ $< $(CFG_VAR)/utils.obj $(TEST_LDADD) + +all: $(OBJECTS) $(TESTS) + @exit 0 + +clean: + @rm -f $(CFG_VAR)/*.obj $(CFG_VAR)/*.pdb || exit 0 -- 1.7.1 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 2/3] test: Fix tests for compilation on Windows
The Microsoft C compiler cannot handle subobject initialization and Win32 does not provide snprintf. Work around these limitations by using normal struct initailization and directly using printf. --- test/composite.c| 48 +++--- test/fetch-test.c | 52 ++ test/trap-crasher.c | 20 +- 3 files changed, 61 insertions(+), 59 deletions(-) diff --git a/test/composite.c b/test/composite.c index e14f954..33e8d97 100644 --- a/test/composite.c +++ b/test/composite.c @@ -616,22 +616,20 @@ eval_diff (color_t *expected, color_t *test, pixman_format_code_t format) return MAX (MAX (MAX (rdiff, gdiff), bdiff), adiff); } -static char * -describe_image (image_t *info, char *buf, int buflen) +static void +describe_image (image_t *info) { if (info->size) { - snprintf (buf, buflen, "%s %dx%d%s", + printf ("%s %dx%d%s", info->format->name, info->size, info->size, info->repeat ? "R" :""); } else { - snprintf (buf, buflen, "solid"); + printf ("solid"); } - -return buf; } /* Test a composite of a given operation, source, mask, and destination @@ -708,18 +706,13 @@ composite_test (image_t *dst, */ if (diff > 3.0) { - char buf[40]; - - snprintf (buf, sizeof (buf), - "%s %scomposite", - op->name, - component_alpha ? "CA " : ""); - - printf ("%s test error of %.4f --\n" + printf ("%s %scomposite test error of %.4f --\n" " RGBA\n" "got: %.2f %.2f %.2f %.2f [%08lx]\n" "expected: %.2f %.2f %.2f %.2f\n", - buf, diff, + op->name, + component_alpha ? "CA " : "", + diff, result.r, result.g, result.b, result.a, *(unsigned long *) pixman_image_get_data (dst->image), expected.r, expected.g, expected.b, expected.a); @@ -735,9 +728,18 @@ composite_test (image_t *dst, mask->color->b, mask->color->a, dst->color->r, dst->color->g, dst->color->b, dst->color->a); - printf ("src: %s, ", describe_image (src, buf, sizeof (buf))); - printf ("mask: %s, ", describe_image (mask, buf, sizeof (buf))); - printf ("dst: %s\n\n", describe_image (dst, buf, sizeof (buf))); + + printf ("src: "); + describe_image (src); + printf (", "); + + printf ("mask: "); + describe_image (mask); + printf (", "); + + printf ("dst: "); + describe_image (dst); + printf ("\n\n"); } else { @@ -747,8 +749,14 @@ composite_test (image_t *dst, src->color->b, src->color->a, dst->color->r, dst->color->g, dst->color->b, dst->color->a); - printf ("src: %s, ", describe_image (src, buf, sizeof (buf))); - printf ("dst: %s\n\n", describe_image (dst, buf, sizeof (buf))); + + printf ("src: "); + describe_image (src); + printf (", "); + + printf ("dst: "); + describe_image (dst); + printf ("\n\n"); } success = FALSE; diff --git a/test/fetch-test.c b/test/fetch-test.c index 2ca16dd..314a072 100644 --- a/test/fetch-test.c +++ b/test/fetch-test.c @@ -8,7 +8,7 @@ static pixman_indexed_t mono_palette = { -.rgba = { 0x, 0x00ff }, +0, { 0x, 0x00ff }, }; @@ -24,57 +24,53 @@ typedef struct { static testcase_t testcases[] = { { - .format = PIXMAN_a8r8g8b8, - .width = 2, .height = 2, - .stride = 8, - .src = { 0x00112233, 0x44556677, -0x8899aabb, 0xccddeeff }, - .dst = { 0x00112233, 0x44556677, -0x8899aabb, 0xccddeeff }, - .indexed = NULL, + PIXMAN_a8r8g8b8, + 2, 2, + 8, + { 0x00112233, 0x44556677, + 0x8899aabb, 0xccddeeff }, + { 0x00112233, 0x44556677, + 0x8899aabb, 0xccddeeff }, + NULL, }, { - .format = PIXMAN_g1, - .width = 8, .height = 2, - .stride = 4, + PIXMAN_g1, + 8, 2, + 4, #ifdef WORDS_BIGENDIAN - .src = { 0xaa00, 0x5500 }, #else - .src = { 0x0055, 0x00aa }, #endif - .dst = { 0x00ff, 0x, 0x00ff, 0x, 0x00ff, 0x, 0x00ff, 0x, 0x, 0x00ff, 0x, 0x00ff, 0x, 0x00ff, 0x, 0x00ff }, - .indexed = &mono_palette, + &mono_palette, }, #if 0 { - .format = PIXMAN_g8, -
[Pixman] [PATCH 1/3] Fix compilation on Win32
Building the library from a clean git repository fails with: pixman-image.c(33) : fatal error C1083: Cannot open include file: 'pixman-combine32.h': No such file or directory pixman-combine32.h is not used by pixman-image.c, so its inclusion can simply be removed. --- pixman/pixman-image.c |1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/pixman/pixman-image.c b/pixman/pixman-image.c index 9103ca6..84bacf8 100644 --- a/pixman/pixman-image.c +++ b/pixman/pixman-image.c @@ -30,7 +30,6 @@ #include #include "pixman-private.h" -#include "pixman-combine32.h" pixman_bool_t _pixman_init_gradient (gradient_t * gradient, -- 1.7.1 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] Win32 fixes and improvements
In order to make pixman more maintainable on windows, having working Makefiles for the library and the tests is probably needed. Today I took the Makefile attached to https://bugs.freedesktop.org/show_bug.cgi?id=33069 and tried to use it to build but it didn't build all the tests because of some incompatibilities between cl and gcc. The following patches should make it possible to build pixman and the entire test suite on Windows from git in a properly configured Cygwin environment. There are some remaining warnings: c:\cygwin\home\ranma42\code\fdo\pixman\pixman\pixman-mmx.c(317) : warning C4799: function 'store' has no EMMS instruction c:\cygwin\home\ranma42\code\fdo\pixman\pixman\pixman-mmx.c(166) : warning C4799: function 'to_uint64' has no EMMS instruction c:\cygwin\home\ranma42\code\fdo\pixman\pixman\pixman-mmx.c(437) : warning C4799: function 'combine' has no EMMS instruction These are wanrings about some missing MMX registers cleanup. I don't know if this is required or if the compiler just does not notice that it is already performed somewhere else. c:\cygwin\home\ranma42\code\fdo\pixman\test\fetch-test.c(114) : warning C4715: 'reader' : not all control paths return a value c:\cygwin\home\ranma42\code\fdo\pixman\test\stress-test.c(133) : warning C4715: 'real_reader' : not all control paths return a value c:\cygwin\home\ranma42\code\fdo\pixman\test\composite.c(431) : warning C4715: 'calc_op' : not all control paths return a value These are non-returning functions (abort() / assert(0)). They can be silenced by adding a return after the termination call, if we aim at a warning-free build on Windows. ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 2/7] test: check correctness of 'bilinear_pad_repeat_get_scanline_bounds'
From: Siarhei Siamashka Individual correctness check for the new bilinear scaling related supplementary function. This test program uses a bit wider range of input arguments, not covered by other tests. --- test/Makefile.am|2 + test/scaling-helpers-test.c | 93 +++ 2 files changed, 95 insertions(+), 0 deletions(-) create mode 100644 test/scaling-helpers-test.c diff --git a/test/Makefile.am b/test/Makefile.am index 057e9ce..9dc7219 100644 --- a/test/Makefile.am +++ b/test/Makefile.am @@ -13,6 +13,7 @@ TESTPROGRAMS =\ trap-crasher\ alpha-loop \ scaling-crash-test \ + scaling-helpers-test\ gradient-crash-test \ alphamap\ stress-test \ @@ -33,6 +34,7 @@ alpha_loop_SOURCES = alpha-loop.c utils.c utils.h composite_SOURCES = composite.c utils.c utils.h gradient_crash_test_SOURCES = gradient-crash-test.c utils.c utils.h stress_test_SOURCES = stress-test.c utils.c utils.h +scaling_helpers_test_SOURCES = scaling-helpers-test.c utils.c utils.h # Benchmarks diff --git a/test/scaling-helpers-test.c b/test/scaling-helpers-test.c new file mode 100644 index 000..c186138 --- /dev/null +++ b/test/scaling-helpers-test.c @@ -0,0 +1,93 @@ +#include +#include +#include +#include +#include +#include "utils.h" +#include "pixman-fast-path.h" + +/* A trivial reference implementation for + * 'bilinear_pad_repeat_get_scanline_bounds' + */ +static void +bilinear_pad_repeat_get_scanline_bounds_ref (int32_tsource_image_width, +pixman_fixed_t vx_, +pixman_fixed_t unit_x, +int32_t * left_pad, +int32_t * left_tz, +int32_t * width, +int32_t * right_tz, +int32_t * right_pad) +{ +int w = *width; +*left_pad = 0; +*left_tz = 0; +*width = 0; +*right_tz = 0; +*right_pad = 0; +int64_t vx = vx_; +while (--w >= 0) +{ + if (vx < 0) + { + if (vx + pixman_fixed_1 < 0) + *left_pad += 1; + else + *left_tz += 1; + } + else if (vx + pixman_fixed_1 >= pixman_int_to_fixed (source_image_width)) + { + if (vx >= pixman_int_to_fixed (source_image_width)) + *right_pad += 1; + else + *right_tz += 1; + } + else + { + *width += 1; + } + vx += unit_x; +} +} + +int +main (void) +{ +int i; +for (i = 0; i < 1; i++) +{ + int32_t left_pad1, left_tz1, width1, right_tz1, right_pad1; + int32_t left_pad2, left_tz2, width2, right_tz2, right_pad2; + pixman_fixed_t vx = lcg_rand_N(1 << 16) - (3000 << 16); + int32_t width = lcg_rand_N(1); + int32_t source_image_width = lcg_rand_N(1) + 1; + pixman_fixed_t unit_x = lcg_rand_N(10 << 16) + 1; + width1 = width2 = width; + + bilinear_pad_repeat_get_scanline_bounds_ref (source_image_width, +vx, +unit_x, +&left_pad1, +&left_tz1, +&width1, +&right_tz1, +&right_pad1); + + bilinear_pad_repeat_get_scanline_bounds (source_image_width, +vx, +unit_x, +&left_pad2, +&left_tz2, +&width2, +&right_tz2, +&right_pad2); + + assert (left_pad1 == left_pad2); + assert (left_tz1 == left_tz2); + assert (width1 == width2); + assert (right_tz1 == right_tz2); + assert (right_pad1 == right_pad2); +} + +return 0; +} -- 1.7.3.4 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 7/7] ARM: NEON optimization for bilinear scaled 'src_8888_8888'
From: Siarhei Siamashka Initial NEON optimization for bilinear scaling. Can be probably improved more. Benchmark on ARM Cortex-A8: Microbenchmark (scaling 2000x2000 image with scale factor close to 1x): before: op=1, src=2002, dst=2002, speed=10.72 MPix/s after: op=1, src=2002, dst=2002, speed=44.27 MPix/s --- pixman/pixman-arm-neon-asm.S | 197 ++ pixman/pixman-arm-neon.c | 45 ++ 2 files changed, 242 insertions(+), 0 deletions(-) diff --git a/pixman/pixman-arm-neon-asm.S b/pixman/pixman-arm-neon-asm.S index 47daf45..c168e10 100644 --- a/pixman/pixman-arm-neon-asm.S +++ b/pixman/pixman-arm-neon-asm.S @@ -2391,3 +2391,200 @@ generate_composite_function_nearest_scanline \ 10, /* dst_r_basereg */ \ 8, /* src_basereg */ \ 15 /* mask_basereg */ + +/**/ + +/* Supplementary macro for setting function attributes */ +.macro pixman_asm_function fname +.func fname +.global fname +#ifdef __ELF__ +.hidden fname +.type fname, %function +#endif +fname: +.endm + +.macro bilinear_interpolate_last_pixel +mov TMP1, X, asr #16 +mov TMP2, X, asr #16 +add TMP1, TOP, TMP1, asl #2 +add TMP2, BOTTOM, TMP2, asl #2 +vld1.32 {d0}, [TMP1] +vshr.u16 d30, d24, #8 +vld1.32 {d1}, [TMP2] +vmull.u8 q1, d0, d28 +vmlal.u8 q1, d1, d29 +/* 5 cycles bubble */ +vshll.u16 q0, d2, #8 +vmlsl.u16 q0, d2, d30 +vmlal.u16 q0, d3, d30 +/* 5 cycles bubble */ +vshrn.u32 d0, q0, #16 +/* 3 cycles bubble */ +vmovn.u16 d0, q0 +/* 1 cycle bubble */ +vst1.32 {d0[0]}, [OUT, :32]! +.endm + +.macro bilinear_interpolate_two_pixels +mov TMP1, X, asr #16 +mov TMP2, X, asr #16 +add X, X, UX +add TMP1, TOP, TMP1, asl #2 +add TMP2, BOTTOM, TMP2, asl #2 +vld1.32 {d0}, [TMP1] +vld1.32 {d1}, [TMP2] +vmull.u8 q1, d0, d28 +vmlal.u8 q1, d1, d29 +mov TMP1, X, asr #16 +mov TMP2, X, asr #16 +add X, X, UX +add TMP1, TOP, TMP1, asl #2 +add TMP2, BOTTOM, TMP2, asl #2 +vld1.32 {d20}, [TMP1] +vld1.32 {d21}, [TMP2] +vmull.u8 q11, d20, d28 +vmlal.u8 q11, d21, d29 +vshr.u16 q15, q12, #8 +vadd.u16 q12, q12, q13 +vshll.u16 q0, d2, #8 +vmlsl.u16 q0, d2, d30 +vmlal.u16 q0, d3, d30 +vshll.u16 q10, d22, #8 +vmlsl.u16 q10, d22, d31 +vmlal.u16 q10, d23, d31 +vshrn.u32 d30, q0, #16 +vshrn.u32 d31, q10, #16 +vmovn.u16 d0, q15 +vst1.32 {d0}, [OUT]! +.endm + +.macro bilinear_interpolate_four_pixels +mov TMP1, X, asr #16 +mov TMP2, X, asr #16 +add X, X, UX +add TMP1, TOP, TMP1, asl #2 +add TMP2, BOTTOM, TMP2, asl #2 +vld1.32 {d0}, [TMP1] +vld1.32 {d1}, [TMP2] +vmull.u8 q1, d0, d28 +vmlal.u8 q1, d1, d29 +mov TMP1, X, asr #16 +mov TMP2, X, asr #16 +add X, X, UX +add TMP1, TOP, TMP1, asl #2 +add TMP2, BOTTOM, TMP2, asl #2 +vld1.32 {d20}, [TMP1] +vld1.32 {d21}, [TMP2] +vmull.u8 q11, d20, d28 +vmlal.u8 q11, d21, d29 +vshr.u16 q15, q12, #8 +vadd.u16 q12, q12, q13 +vshll.u16 q0, d2, #8 +vmlsl.u16 q0, d2, d30 +vmlal.u16 q0, d3, d30 +vshll.u16 q10, d22, #8 +vmlsl.u16 q10, d22, d31 +vmlal.u16 q10, d23, d31 +mov TMP1, X, asr #16 +mov TMP2, X, asr #16 +add X, X, UX +add TMP1, TOP, TMP1, asl #2 +add TMP2, BOTTOM, TMP2, asl #2 +vld1.32 {d4}, [TMP1] +vld1.32 {d5}, [TMP2] +vmull.u8 q3, d4, d28 +vmlal.u8 q3, d5, d29 +mov TMP1, X, asr #16 +mov TMP2, X, asr #16 +add X, X, UX +add TMP1, TOP, TMP1, asl #2 +add TMP2, BOTTOM, TMP2, asl #2 +vld1.32 {d16}, [TMP1] +vld1.32 {d17}, [TMP2] +vmull.u8 q9, d16, d28 +vmlal.u8 q9, d17, d29 +vshr.u16 q15, q12, #8 +vadd.u16 q12, q12, q13 +vshll.u16 q2, d6, #8 +vmlsl.u16 q2, d6, d30 +vmlal.u16 q2, d7, d30 +vshll.u16 q8, d18, #8 +vmlsl.u16 q8, d18, d31 +vmlal.u16 q8, d19, d31 +vshrn.u32 d0, q0, #16 +vshrn.u32 d1, q10, #16 +vshrn.u32 d4, q2, #16 +vshrn.u32 d5, q8, #16 +vmovn.u16 d0, q0 +vmovn.u16 d1, q2 +vst1.32 {d0, d1}, [OUT]! +.endm + + +/* + * pixman_scaled_bilinear_scanline___SRC (uint32_t * out, + *const uint32_t * top, + *const uint32_t * bottom, + *int wt, + *int wb, + *pixman_fixed_t x, + *
[Pixman] [PATCH 6/7] SSE2 optimization for bilinear scaled 'src_8888_8888'
From: Siarhei Siamashka A primitive naive implementation of bilinear scaling using SSE2 intrinsics, which only handles one pixel at a time. It is approximately 2x faster than C variant (loop unrolling contributes to ~20% of this speedup). Benchmark on Intel Core i7: Using cairo-perf-trace: before: imagefirefox-planet-gnome 12.019 12.054 0.15%5/6 after: imagefirefox-planet-gnome 10.961 11.013 0.19%5/6 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x): before: op=1, src=2002, dst=2002, speed=82.61 MPix/s after: op=1, src=2002, dst=2002, speed=165.38 MPix/s --- pixman/pixman-sse2.c | 112 ++ 1 files changed, 112 insertions(+), 0 deletions(-) diff --git a/pixman/pixman-sse2.c b/pixman/pixman-sse2.c index 88287b4..696005f 100644 --- a/pixman/pixman-sse2.c +++ b/pixman/pixman-sse2.c @@ -5567,6 +5567,114 @@ FAST_NEAREST_MAINLOOP_COMMON (sse2__n__none_OVER, scaled_nearest_scanline_sse2__n__OVER, uint32_t, uint32_t, uint32_t, NONE, TRUE, TRUE) +static void +bilinear_interpolate_line_sse2 (uint32_t * out, +const uint32_t * top, +const uint32_t * bottom, +int wt, +int wb, +pixman_fixed_t x, +pixman_fixed_t ux, +int width) +{ +const __m128i xmm_wt = _mm_set_epi16 (wt, wt, wt, wt, wt, wt, wt, wt); +const __m128i xmm_wb = _mm_set_epi16 (wb, wb, wb, wb, wb, wb, wb, wb); +const __m128i xmm_xorc = _mm_set_epi16 (0, 0, 0, 0, 0xff, 0xff, 0xff, 0xff); +const __m128i xmm_addc = _mm_set_epi16 (0, 0, 0, 0, 1, 1, 1, 1); +const __m128i xmm_ux = _mm_set_epi16 (ux, ux, ux, ux, ux, ux, ux, ux); +const __m128i xmm_zero = _mm_setzero_si128 (); +__m128i xmm_x = _mm_set_epi16 (x, x, x, x, x, x, x, x); +uint32_t pix1, pix2, pix3, pix4; + +#define INTERPOLATE_ONE_PIXEL(pix) \ +do { \ + __m128i xmm_wh, xmm_lo, xmm_hi, a; \ + /* fetch 2x2 pixel block into sse2 register */ \ + uint32_t tl = top [pixman_fixed_to_int (x)]; \ + uint32_t tr = top [pixman_fixed_to_int (x) + 1]; \ + uint32_t bl = bottom [pixman_fixed_to_int (x)]; \ + uint32_t br = bottom [pixman_fixed_to_int (x) + 1]; \ + a = _mm_set_epi32 (tr, tl, br, bl); \ +x += ux; \ + /* vertical interpolation */ \ + a = _mm_add_epi16 (_mm_mullo_epi16 (_mm_unpackhi_epi8 (a, xmm_zero), \ + xmm_wt), \ + _mm_mullo_epi16 (_mm_unpacklo_epi8 (a, xmm_zero), \ + xmm_wb)); \ + /* calculate horizontal weights */ \ + xmm_wh = _mm_add_epi16 (xmm_addc, \ + _mm_xor_si128 (xmm_xorc, \ + _mm_srli_epi16 (xmm_x, 8))); \ + xmm_x = _mm_add_epi16 (xmm_x, xmm_ux); \ + /* horizontal interpolation */ \ + xmm_lo = _mm_mullo_epi16 (a, xmm_wh); \ + xmm_hi = _mm_mulhi_epu16 (a, xmm_wh); \ + a = _mm_add_epi32 (_mm_unpacklo_epi16 (xmm_lo, xmm_hi), \ + _mm_unpackhi_epi16 (xmm_lo, xmm_hi)); \ + /* shift and pack the result */ \ + a = _mm_srli_epi32 (a, 16); \ + a = _mm_packs_epi32 (a, a); \ + a = _mm_packus_epi16 (a, a); \ + pix = _mm_cvtsi128_si32 (a); \ +} while (0) + +while ((width -= 4) >= 0) +{ + INTERPOLATE_ONE_PIXEL (pix1); + INTERPOLATE_ONE_PIXEL (pix2); + INTERPOLATE_ONE_PIXEL (pix3); + INTERPOLATE_ONE_PIXEL (pix4); + *out++ = pix1; + *out++ = pix2; + *out++ = pix3; + *out++ = pix4; +} +if (width & 2) +{ + INTERPOL
[Pixman] [PATCH 5/7] C variant of bilinear scaled 'src_8888_n_8888' fast path
From: Siarhei Siamashka Serves no real practical purpose other than testing solid mask support in bilinear scaling main loop template. --- pixman/pixman-fast-path.c | 80 + 1 files changed, 80 insertions(+), 0 deletions(-) diff --git a/pixman/pixman-fast-path.c b/pixman/pixman-fast-path.c index a2125c0..fdaad64 100644 --- a/pixman/pixman-fast-path.c +++ b/pixman/pixman-fast-path.c @@ -1670,6 +1670,82 @@ FAST_BILINEAR_MAINLOOP_COMMON (_8__none_SRC, uint32_t, uint8_t, uint32_t, NONE, TRUE, FALSE) +static void +bilinear_interpolate_s_line (uint32_t * dst, +const uint32_t * mask, +const uint32_t * top_row, +const uint32_t * bottom_row, +int wt, +int wb, +pixman_fixed_t x, +pixman_fixed_t ux, +int width) +{ +uint8_t m = *mask >> 24; +while (--width >= 0) +{ + if (m) + { + uint32_t s; + uint32_t tl, tr, bl, br; + int distx; + + tl = top_row [pixman_fixed_to_int (x)]; + tr = top_row [pixman_fixed_to_int (x) + 1]; + bl = bottom_row [pixman_fixed_to_int (x)]; + br = bottom_row [pixman_fixed_to_int (x) + 1]; + + distx = (x >> 8) & 0xff; + + s = bilinear_interpolation (tl, tr, bl, br, distx, wt, wb); + if (m == 0xff) + { + *dst = s; + } + else + { + *dst = in (s, m); + } + } + else + { + *dst = 0; + } + x += ux; + dst++; +} +} + +static force_inline void +scaled_bilinear_scanline__n__SRC (uint32_t * dst, + const uint32_t * mask, + const uint32_t * src_top, + const uint32_t * src_bottom, + int32_t w, + int wt, + int wb, + pixman_fixed_t vx, + pixman_fixed_t unit_x, + pixman_fixed_t max_vx, + pixman_bool_tzero_src) +{ +bilinear_interpolate_s_line (dst, mask, src_top, src_bottom, +wt, wb, vx, unit_x, w); +} + +FAST_BILINEAR_MAINLOOP_COMMON (_n__cover_SRC, + scaled_bilinear_scanline__n__SRC, + uint32_t, uint32_t, uint32_t, + COVER, TRUE, TRUE) +FAST_BILINEAR_MAINLOOP_COMMON (_n__pad_SRC, + scaled_bilinear_scanline__n__SRC, + uint32_t, uint32_t, uint32_t, + PAD, TRUE, TRUE) +FAST_BILINEAR_MAINLOOP_COMMON (_n__none_SRC, + scaled_bilinear_scanline__n__SRC, + uint32_t, uint32_t, uint32_t, + NONE, TRUE, TRUE) + static force_inline uint32_t fetch_nearest (pixman_repeat_t src_repeat, pixman_format_code_t format, @@ -2197,6 +2273,10 @@ static const pixman_fast_path_t c_fast_paths[] = SIMPLE_BILINEAR_A8_MASK_FAST_PATH (SRC, a8r8g8b8, x8r8g8b8, _8_), SIMPLE_BILINEAR_A8_MASK_FAST_PATH (SRC, x8r8g8b8, x8r8g8b8, _8_), +SIMPLE_BILINEAR_SOLID_MASK_FAST_PATH (SRC, a8r8g8b8, a8r8g8b8, _n_), +SIMPLE_BILINEAR_SOLID_MASK_FAST_PATH (SRC, a8r8g8b8, x8r8g8b8, _n_), +SIMPLE_BILINEAR_SOLID_MASK_FAST_PATH (SRC, x8r8g8b8, x8r8g8b8, _n_), + #define NEAREST_FAST_PATH(op,s,d) \ { PIXMAN_OP_ ## op, \ PIXMAN_ ## s, SCALED_NEAREST_FLAGS, \ -- 1.7.3.4 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 4/7] C variant of bilinear scaled 'src_8888_8_8888' fast path
From: Siarhei Siamashka Serves no real practical purpose other than testing a8 mask support in bilinear scaling main loop template. --- pixman/pixman-fast-path.c | 80 + 1 files changed, 80 insertions(+), 0 deletions(-) diff --git a/pixman/pixman-fast-path.c b/pixman/pixman-fast-path.c index 1e3094e..a2125c0 100644 --- a/pixman/pixman-fast-path.c +++ b/pixman/pixman-fast-path.c @@ -1594,6 +1594,82 @@ FAST_BILINEAR_MAINLOOP_COMMON (__none_SRC, uint32_t, uint32_t, uint32_t, NONE, FALSE, FALSE) +static void +bilinear_interpolate_a8_line (uint32_t * dst, + const uint8_t * mask, + const uint32_t * top_row, + const uint32_t * bottom_row, + int wt, + int wb, + pixman_fixed_t x, + pixman_fixed_t ux, + int width) +{ +while (--width >= 0) +{ + uint8_t m = *mask++; + if (m) + { + uint32_t s; + uint32_t tl, tr, bl, br; + int distx; + + tl = top_row [pixman_fixed_to_int (x)]; + tr = top_row [pixman_fixed_to_int (x) + 1]; + bl = bottom_row [pixman_fixed_to_int (x)]; + br = bottom_row [pixman_fixed_to_int (x) + 1]; + + distx = (x >> 8) & 0xff; + + s = bilinear_interpolation (tl, tr, bl, br, distx, wt, wb); + if (m == 0xff) + { + *dst = s; + } + else + { + *dst = in (s, m); + } + } + else + { + *dst = 0; + } + x += ux; + dst++; +} +} + +static force_inline void +scaled_bilinear_scanline__8__SRC (uint32_t * dst, + const uint8_t * mask, + const uint32_t * src_top, + const uint32_t * src_bottom, + int32_t w, + int wt, + int wb, + pixman_fixed_t vx, + pixman_fixed_t unit_x, + pixman_fixed_t max_vx, + pixman_bool_tzero_src) +{ +bilinear_interpolate_a8_line (dst, mask, src_top, src_bottom, + wt, wb, vx, unit_x, w); +} + +FAST_BILINEAR_MAINLOOP_COMMON (_8__cover_SRC, + scaled_bilinear_scanline__8__SRC, + uint32_t, uint8_t, uint32_t, + COVER, TRUE, FALSE) +FAST_BILINEAR_MAINLOOP_COMMON (_8__pad_SRC, + scaled_bilinear_scanline__8__SRC, + uint32_t, uint8_t, uint32_t, + PAD, TRUE, FALSE) +FAST_BILINEAR_MAINLOOP_COMMON (_8__none_SRC, + scaled_bilinear_scanline__8__SRC, + uint32_t, uint8_t, uint32_t, + NONE, TRUE, FALSE) + static force_inline uint32_t fetch_nearest (pixman_repeat_t src_repeat, pixman_format_code_t format, @@ -2117,6 +2193,10 @@ static const pixman_fast_path_t c_fast_paths[] = SIMPLE_BILINEAR_FAST_PATH (SRC, x8r8g8b8, x8r8g8b8, _), SIMPLE_BILINEAR_FAST_PATH (SRC, x8b8g8r8, x8b8g8r8, _), +SIMPLE_BILINEAR_A8_MASK_FAST_PATH (SRC, a8r8g8b8, a8r8g8b8, _8_), +SIMPLE_BILINEAR_A8_MASK_FAST_PATH (SRC, a8r8g8b8, x8r8g8b8, _8_), +SIMPLE_BILINEAR_A8_MASK_FAST_PATH (SRC, x8r8g8b8, x8r8g8b8, _8_), + #define NEAREST_FAST_PATH(op,s,d) \ { PIXMAN_OP_ ## op, \ PIXMAN_ ## s, SCALED_NEAREST_FLAGS, \ -- 1.7.3.4 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 3/7] C variant of bilinear scaled 'src_8888_8888' fast path
From: Siarhei Siamashka Because of doing scaling in a single pass without temporary buffers, it is a bit faster than general path on x86 (and provides even better speedup on MIPS and ARM). Benchmark on Intel Core i7: Using cairo-perf-trace: before: imagefirefox-planet-gnome 12.566 12.610 0.23%6/6 after: imagefirefox-planet-gnome 12.019 12.054 0.15%5/6 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x): before: op=1, src=2002, dst=2002, speed=70.48 MPix/s after: op=1, src=2002, dst=2002, speed=82.61 MPix/s Benchmark on ARM Cortex-A8: Microbenchmark (scaling 2000x2000 image with scale factor close to 1x): before: op=1, src=2002, dst=2002, speed=6.70 MPix/s after: op=1, src=2002, dst=2002, speed=10.72 MPix/s Benchmark on MIPS 24K: Microbenchmark (scaling 2000x2000 image with scale factor close to 1x): before: op=1, src=2002, dst=2002, speed=5.12 MPix/s after: op=1, src=2002, dst=2002, speed=6.96 MPix/s Microbenchmark (scaling 500x500 image with scale factor close to 1x): before: op=1, src=2002, dst=2002, speed=5.26 MPix/s after: op=1, src=2002, dst=2002, speed=7.00 MPix/s --- pixman/pixman-fast-path.c | 144 + 1 files changed, 144 insertions(+), 0 deletions(-) diff --git a/pixman/pixman-fast-path.c b/pixman/pixman-fast-path.c index 92f0308..1e3094e 100644 --- a/pixman/pixman-fast-path.c +++ b/pixman/pixman-fast-path.c @@ -1458,6 +1458,143 @@ FAST_NEAREST_MAINLOOP (565_565_pad_SRC, uint16_t, uint16_t, PAD) static force_inline uint32_t +bilinear_interpolation (uint32_t tl, uint32_t tr, + uint32_t bl, uint32_t br, + int distx, int wt, int wb) +{ +#if SIZEOF_LONG > 4 +uint64_t distxy, distxiy, distixy, distixiy; +uint64_t tl64, tr64, bl64, br64; +uint64_t f, r; + +distxy = distx * wb; +distxiy = distx * wt; +distixy = wb * (256 - distx); +distixiy = (256 - distx) * wt; + +/* Alpha and Blue */ +tl64 = tl & 0xffff; +tr64 = tr & 0xffff; +bl64 = bl & 0xffff; +br64 = br & 0xffff; + +f = tl64 * distixiy + tr64 * distxiy + bl64 * distixy + br64 * distxy; +r = f & 0xffffull; + +/* Red and Green */ +tl64 = tl; +tl64 = ((tl64 << 16) & 0x00ffull) | (tl64 & 0xff00ull); + +tr64 = tr; +tr64 = ((tr64 << 16) & 0x00ffull) | (tr64 & 0xff00ull); + +bl64 = bl; +bl64 = ((bl64 << 16) & 0x00ffull) | (bl64 & 0xff00ull); + +br64 = br; +br64 = ((br64 << 16) & 0x00ffull) | (br64 & 0xff00ull); + +f = tl64 * distixiy + tr64 * distxiy + bl64 * distixy + br64 * distxy; +r |= ((f >> 16) & 0x00ffull) | (f & 0xff00ull); + +return (uint32_t)(r >> 16); +#else +int distxy, distxiy, distixy, distixiy; +uint32_t f, r; + +distxy = distx * wb; +distxiy = distx * wt; +distixy = wb * (256 - distx); +distixiy = (256 - distx) * wt; + +/* Blue */ +r = (tl & 0x00ff) * distixiy + (tr & 0x00ff) * distxiy + + (bl & 0x00ff) * distixy + (br & 0x00ff) * distxy; + +/* Green */ +f = (tl & 0xff00) * distixiy + (tr & 0xff00) * distxiy + + (bl & 0xff00) * distixy + (br & 0xff00) * distxy; +r |= f & 0xff00; + +tl >>= 16; +tr >>= 16; +bl >>= 16; +br >>= 16; +r >>= 16; + +/* Red */ +f = (tl & 0x00ff) * distixiy + (tr & 0x00ff) * distxiy + + (bl & 0x00ff) * distixy + (br & 0x00ff) * distxy; +r |= f & 0x00ff; + +/* Alpha */ +f = (tl & 0xff00) * distixiy + (tr & 0xff00) * distxiy + + (bl & 0xff00) * distixy + (br & 0xff00) * distxy; +r |= f & 0xff00; + +return r; +#endif +} + +static void +bilinear_interpolate_line (uint32_t * buffer, + const uint32_t * top_row, + const uint32_t * bottom_row, + int wt, + int wb, + pixman_fixed_t x, + pixman_fixed_t ux, + int width) +{ +while (--width >= 0) +{ + uint32_t tl, tr, bl, br; + int distx; + + tl = top_row [pixman_fixed_to_int (x)]; + tr = top_row [pixman_fixed_to_int (x) + 1]; + bl = bottom_row [pixman_fixed_to_int (x)]; + br = bottom_row [pixman_fixed_to_int (x) + 1]; + + distx = (x >> 8) & 0xff; + + *buffer++ = bilinear_interpolation (tl, tr, bl, br, distx, wt, wb); + + x += ux; +} +} + +static force_inline void +scaled_bilinear_scanline___SRC (uint32_t * dst, + const uint32_t * mask, + const
[Pixman] [PATCH 1/7] Main loop template for fast single pass bilinear scaling
From: Siarhei Siamashka Can be used for implementing SIMD optimized fast path functions which work with bilinear scaled source images. Similar to the template for nearest scaling main loop, the following types of mask are supported: 1. no mask 2. non-scaled a8 mask with SAMPLES_COVER_CLIP flag 3. solid mask PAD repeat is fully supported. NONE repeat is partially supported (right now only works if source image has alpha channel or when alpha channel of the source image does not have any effect on the compositing operation). --- pixman/pixman-fast-path.h | 432 + 1 files changed, 432 insertions(+), 0 deletions(-) diff --git a/pixman/pixman-fast-path.h b/pixman/pixman-fast-path.h index d081222..1885d47 100644 --- a/pixman/pixman-fast-path.h +++ b/pixman/pixman-fast-path.h @@ -587,4 +587,436 @@ fast_composite_scaled_nearest ## scale_func_name (pixman_implementation_t *imp, SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_NONE (op,s,d,func),\ SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_PAD (op,s,d,func) +/*/ + +/* + * Identify 5 zones in each scanline for bilinear scaling. Depending on + * whether 2 pixels to be interpolated are fetched from the image itself, + * from the padding area around it or from both image and padding area. + */ +static force_inline void +bilinear_pad_repeat_get_scanline_bounds (int32_t source_image_width, +pixman_fixed_t vx, +pixman_fixed_t unit_x, +int32_t * left_pad, +int32_t * left_tz, +int32_t * width, +int32_t * right_tz, +int32_t * right_pad) +{ + int width1 = *width, left_pad1, right_pad1; + int width2 = *width, left_pad2, right_pad2; + + pad_repeat_get_scanline_bounds (source_image_width, vx, unit_x, + &width1, &left_pad1, &right_pad1); + pad_repeat_get_scanline_bounds (source_image_width, vx + pixman_fixed_1, + unit_x, &width2, &left_pad2, &right_pad2); + + *left_pad = left_pad2; + *left_tz = left_pad1 - left_pad2; + *right_tz = right_pad2 - right_pad1; + *right_pad = right_pad1; + *width -= *left_pad + *left_tz + *right_tz + *right_pad; +} + +/* + * Main loop template for single pass bilinear scaling. It needs to be + * provided with 'scanline_func' which should do the compositing operation. + * The needed function has the following prototype: + * + * scanline_func (dst_type_t * dst, + *const mask_type_ * mask, + *const src_type_t * src_top, + *const src_type_t * src_bottom, + *int32_twidth, + *intweight_top, + *intweight_bottom, + *pixman_fixed_t vx, + *pixman_fixed_t unit_x, + *pixman_fixed_t max_vx, + *pixman_bool_t zero_src) + * + * Where: + * dst - destination scanline buffer for storing results + * mask- mask buffer (or single value for solid mask) + * src_top, src_bottom - two source scanlines + * width - number of pixels to process + * weight_top - weight of the top row for interpolation + * weight_bottom - weight of the bottom row for interpolation + * vx - initial position for fetching the first pair of + *pixels from the source buffer + * unit_x - position increment needed to move to the next pair + *of pixels + * max_vx - image size as a fixed point value, can be used for + *implementing NORMAL repeat (when it is supported) + * zero_src- boolean hint variable, which is set to TRUE when + *all source pixels are fetched from zero padding + *zone for NONE repeat + * + * Note: normally the sum of 'weight_top' and 'weight_bottom' is equal to 256, + * but sometimes it may be less than that for NONE repeat when handling + * fuzzy antialiased top or bottom image edges. Also both top and + * bottom weight variables are guaranteed to have value in 0-255 + * range and can fit into unsigned byte or be used with 8-bit SIMD + * multiplication instructions. + */ +#define FAST_BILINEAR_MAINLOOP_INT(scale_func_name, scanline_func, src_type_t, mask_type_t,\ + dst_type_t, repeat_mode, have_mask, mask_is_solid)\ +static void
[Pixman] [PATCH 0/7] SIMD optimizations for bilinear scaling
From: Siarhei Siamashka This patch series introduces support for creating specialized bilinear fast path functions which perform processing in a single pass without intermediate temporary buffers and also can make efficient use of SIMD optimizations. The performance critical code is implemented as scanline processing functions with main loop logic being reused via common macro template. Such scanline processing functions are simple enough to implement and at the same time large enough not to constrain optimization opportunities and possibilities to do loop unrolling for processing multiple pixels per iteration. As a result, bilinear scaled 'src__' operation (simple scaled copy of the image) becomes more than 2 times faster with SSE2 and more than 6 times faster with ARM NEON when compared to the general pixman compositing path. And single pass processing alone is providing some modest, but measurable speedup even without SIMD. I'm mostly exclusively interested in ARM NEON and I did not spend any extra time on tuning this SSE2 code. So SSE2 scaler may be actually not good enough. Nevertheless it is still faster than C. The disadvantage of this method is the high specialization, so that each particular type of compositing operation needs its own fast path code. But it does not prevent us from also adding universal SIMD optimized fetchers later. Anyway, adding specialized fast paths is the way to go when targeting best performance for some of the most common operations. I'll try to add more SIMD optimized bilinear fast path functions shortly, based on analyzing cairo-traces and profiling real use cases. The same patches are also available in the following branch: http://cgit.freedesktop.org/~siamashka/pixman/log/?h=sent/bilinear-scaling-simd-20110222 Siarhei Siamashka (7): Main loop template for fast single pass bilinear scaling test: check correctness of 'bilinear_pad_repeat_get_scanline_bounds' C variant of bilinear scaled 'src__' fast path C variant of bilinear scaled 'src__8_' fast path C variant of bilinear scaled 'src__n_' fast path SSE2 optimization for bilinear scaled 'src__' ARM: NEON optimization for bilinear scaled 'src__' pixman/pixman-arm-neon-asm.S | 197 +++ pixman/pixman-arm-neon.c | 45 + pixman/pixman-fast-path.c| 304 + pixman/pixman-fast-path.h| 432 ++ pixman/pixman-sse2.c | 112 +++ test/Makefile.am |2 + test/scaling-helpers-test.c | 93 + 7 files changed, 1185 insertions(+), 0 deletions(-) create mode 100644 test/scaling-helpers-test.c -- 1.7.3.4 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] Image scaling with bilinear interpolation performance
김태균 writes: > original code : r = a*t0 + b*t1 + c*t2 + d*t3 (in 24 bits precision) > optimized code : r' = a*(t0 >> 8) + b*(t1 >> 8) + c*(t2 >> 8) + d*(t3 >> 8) > (in 16 bits precision) > where t0 + t1 + t2 + t3 = 0x1 > > Now we split "t" into two terms u, v where u is upper 8 bits of t and v is > lower 8 bits of t. (note that t0 = u0*256 + v0, t0 >> 8 = u0) > > So, > > r' = a*u0 + b*u1 + c*u2 + d*u3 > > r = a*(u0*256 + v0) + b*(u1*256 + v1) + c*(u2*256 + v2) + d*(u3*256 + v3) > = 256*(a*u0 + b*u1 + c*u2 + d*u3) + a*v0 + b*v1 + c*v2 + d*v3 > = 256*r' + a*v0 + b*v1 + c*v2 + d*v3 > > Error would be > e = (r - (r' << 8)) >> 16 = (r - 256*r') >> 16 = (a*v0 + b*v1 + c*v2 + d*v3) > >> 16 > > Each value a, b, c and d can be 0xff at most, So > > max(e) = (0xff*(v0 + v1 + v2 + v3)) >> 16 = (0xff*max(v0 + v1 + v2 + v3)) >> > 16 > > max(v0 + v1 + v2 + v3) = 0x300 (because lower 8 bits of t0 + t1 + t2 + t3 > should be 0x00) > > So max(e) = (0xff*0x300) >> 16 = 2 > > But this does not satisfy rule 5 as you mentioned Thanks for doing this analysis. A difference of just 2 would be fine in my opinion, and as you mention the original code was an approximation as well. It would be possible to satisfy rule 5 using a kind of error diffusion, as as demonstrated by this program: static void compute_weights (uint8_t distx, uint8_t disty) { uint32_t distxy, distxiy, distixy, distixiy; int e, t; distxy = distx * disty; distxiy = (distx << 8) - distxy; distixy = (disty << 8) - distxy; distixiy = 256 * 256 - (disty << 8) - (distx << 8) + distxy; t = distxy + 0x80; e = (t & 0xff00) - distxy; distxy = t >> 8; distxiy -= e; t = distxiy + 0x80; e = (t & 0xff00) - distxiy; distxiy = t >> 8; distixy -= e; t = distixy + 0x80; e = (t & 0xff00) - distixy; distixy = t >> 8; distixiy -= e; t = distixiy + 0x80; e = (t & 0xff00) - distixiy; distixiy = t >> 8; assert (distxy + distxiy + distixy + distixiy == 256); } int main () { int i, j; for (i = 0; i < 256; ++i) { for (j = 0; j < 256; ++j) compute_weights (i, j); } } although that does do a bit more arithmetic than your code. > > Now regarding accuracy. I have added some comments above regarding the > > potential solid color issue, but this should be relatively easy to > address. I'm > > also a bit worried about one more thing (in the original pixman code too, > but > > let's cover this too while we are discussing accuracy in general). > Wouldn't it > > be a good idea to do shift with rounding for the final value instead of > > dropping the fractional part? And the 'distx'/'disty' variables are also > > obtained by right shifting 'ux' by 8 and dropping fractional part, maybe > > rounding would be more appropriate. Not doing rounding might cause slight > image > > drift to the left (and top) on repeated rescaling, and also slight > reduction of > > average brightness. > > I agree with that rounding is more appropriate. > I think supplying distx and disty as properly rounded 4 bits values to > interpolation function is the best choice we have. > > Analysis on error is some what complicated in this case. > Error may be bigger than previous code, at least 15 (I've done some brute > force jobs) Rounding to four bits is going to be a quite visible drop in quality though, especially if you zoom more than 16x. With four bits of precision, there will be only 16 different colors in the gradients generated by the filter, which will show up as banding. But maybe it's good enough - 16x scaling is not going to look great with bilinear filtering no matter what. > > I have only one concern about testing. Supposedly when we get both C and > SSE2 > > implementations, it would be much easier for testing if they produce > identical > > results. Otherwise tests need to be improved to somehow be able to take > slight > > differences into account. > > I think the requirement of producing same results for both C & SIMD(maybe > sse2, NEON, mmx) is relatively easy. > But SIMD can produce much better result with less time spent, which can be > horribly slow with general C implementation. > I think it is much desirable to keep both C and SIMD code optimized in spite > of producing slightly different results. Having the C and SIMD code produce different results is not a problem in itself, but as Siarhei says, but we would need to make sure the test suite reflects that decision. If we decide to move away from bit-exact testing, we would need to decide on an acceptable deviation from ideal, and then update the tests to verify that both the C and SIMD implementations are within that deviation. For example there could be a reference implementation that computes t
Re: [Pixman] [cairo] pixman: New ARM NEON optimizations
Siarhei Siamashka writes: > Regarding the (b) part, probably as a side effect of current implementation, > right now it is possible to do some operations with images having > non-premultiplied alpha: > > src_img = pixman_image_create_bits ( > PIXMAN_x8b8g8r8, width, height, src, stride); > msk_img = pixman_image_create_bits ( > PIXMAN_a8b8g8r8, width, height, src, stride); > dst_img = pixman_image_create_bits ( > PIXMAN_a8r8g8b8, width, height, dst, stride); > > pixman_image_composite (PIXMAN_OP_SRC, src_img, msk_img, dst_img, > 0, 0, 0, 0, 0, 0, width, height); > > We only need to wrap the same a8r8g8b8 buffer into x8r8g8b8 > and a8r8g8b8 pixman image, and use the latter as a mask for > pixman_image_composite() calls. Any operations which don't > need mask themselves can use this trick. By also specifying > negative stride, this is useful for example when dealing with > the data returned by glReadPixels(). Yeah, this is useful, and it wouldn't directly be possible to do if the equation were changed to (s OP d) LERP_m d. However, a pretty simple way to fix it would be to just add an unpremultiplied format. Benjamin's video patches had this I believe. Using such a format as a destination can be slow because it requires divisions, but Joonas has a number of optimized implementations here: http://cgit.freedesktop.org/~joonas/unpremultiply/tree/ > So I find it convenient that we are also allowed to work with > masks which are basically interpreted as having a8x24 format. Right, having an a8x24 format would be another way to solve the problem. Soren ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 0/3] Some clean-ups of the test directory
Siarhei Siamashka writes: > On Thursday 10 February 2011 20:22:38 Søren Sandmann wrote: > > The following patches add a new directory "demos" and move all the > > GTK+ based test programs there. This allows the Makefiles in both test > > and demos to become much simpler with less redundancy. > > > > I'm not particularly happy about the "demos" name since the GTK+ tests > > aren't really demos, but I can't think of anything better. Suggestions > > are appreciated. > > It's a bit late comment, but eventually adding some real demo(s) which > would display some nice looking animation and scare the users with huge > FPS might be a good idea :) Yes, I think various types of real demos would be a good idea, both showing off performance and features. There is this branch: http://cgit.freedesktop.org/~sandmann/pixman/log/?h=parrot which makes the composite-test a little more interesting to look at, and shows better how the compositing operators work. Screenshot: http://www.daimi.au.dk/~sandmann/composite-test.png > Well, that is if pixman actually needs any kind of such "marketing" > stuff. I have always thought that pixman should eventually be used in more places than just the X server and cairo. For example, some features that have been proposed, such as a floating point pipeline, a JIT compiler and a shader language API, would let pixman serve the same role as Core Image does on Mac OS X. And to get people to use it, demos and other types of marketing would be useful, including getting a website and a logo. Soren ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 2/3] DSPASE Cleanup and add operations
MIPS: DSPASE Modified the original commit dspase to use arm-neon bind macro MIPS: DSPASE Implemented add__ and add_n_ MIPS: DSPASE Added some simple mips function begin/end macroes. MIPS: DSPASE Implemented scanline add. --- pixman/pixman-mips-dspase1-asm.S | 331 -- pixman/pixman-mips-dspase1.c | 75 + 2 files changed, 325 insertions(+), 81 deletions(-) diff --git a/pixman/pixman-mips-dspase1-asm.S b/pixman/pixman-mips-dspase1-asm.S index b96fe83..596b38a 100644 --- a/pixman/pixman-mips-dspase1-asm.S +++ b/pixman/pixman-mips-dspase1-asm.S @@ -1,27 +1,37 @@ - .text + .setmips32r2 + .setnomips16 + .setdsp + +.macro pixman_asm_func fname + .global \fname + .ent\fname +#ifdef __ELF__ + .type \fname, @function + .hidden \fname +#endif +\fname: +.endm + +.macro pixman_end_func fname + .end \fname + .size \fname, .-\fname +.endm + .setnoreorder .setnomacro - -// void -// mips_dspase1_combine_over_u_nomask(uint32_t *dest, const uint32_t *src, -// const uint32_t *mask, int width) - - .global mips_dspase1_combine_over_u_nomask - .entmips_dspase1_combine_over_u_nomask - // note: this version to be used only when mask = NULL -mips_dspase1_combine_over_u_nomask: - beqz$a3, 1f - subu$v0, $a1, $a0 // diff = src - dest (for LWX) +pixman_asm_func pixman_composite_scanline_over_asm_dspase1 + beqz$a0, 1f + subu$v0, $a2, $a1 // diff = src - dest (for LWX) - sll $a3, $a3, 2 // width <<= 2 - addu$a3, $a0, $a3 // dest_end = dest + width + sll $a0, $a0, 2 // width <<= 2 + addu$a0, $a1, $a0 // dest_end = dest + width - lw $t0, 0($a0) // dest - lwx $t1, $v0($a0) // src (dest + diff) + lw $t0, 0($a1) // dest + lwx $t1, $v0($a1) // src (dest + diff) li $t9, 0x00800080 @@ -33,8 +43,8 @@ mips_dspase1_combine_over_u_nomask: muleu_s.ph.qbl $t3, $t0, $t2 muleu_s.ph.qbr $t4, $t0, $t2 - lw $t0, 4($a0) // dest[1] for next loop iteration - addiu $a0, $a0, 4 // dest++ + lw $t0, 4($a1) // dest[1] for next loop iteration + addiu $a1, $a1, 4 // dest++ addu$t3, $t3, $t9 // can't overflow; rev2: addu_s.ph addu$t4, $t4, $t9 // can't overflow; rev2: addu_s.ph @@ -46,41 +56,34 @@ mips_dspase1_combine_over_u_nomask: precrq.qb.ph$t3, $t3, $t4 addu_s.qb $t3, $t3, $t1 - lwx $t1, $v0($a0) // src (dest + diff) for next loop iteration + lwx $t1, $v0($a1) // src (dest + diff) for next loop iteration - bne $a0, $a3, 0b - sw $t3, -4($a0)// dest + bne $a1, $a0, 0b + sw $t3, -4($a1)// dest 1: jr $ra nop - .endmips_dspase1_combine_over_u_nomask - +pixman_end_func pixman_composite_scanline_over_asm_dspase1 -// void -// mips_dspase1_combine_over_u_mask(uint32_t *dest, const uint32_t *src, -// const uint32_t *mask, int width) - - .global mips_dspase1_combine_over_u_mask - .entmips_dspase1_combine_over_u_mask // note: this version to be used only when mask != NULL -mips_dspase1_combine_over_u_mask: - beqz$a3, 1f - subu$v0, $a1, $a0 // sdiff = src - dest (for LWX) +pixman_asm_func pixman_composite_scanline_over_mask_asm_dspase1 + beqz$a0, 1f + subu$v0, $a2, $a1 // sdiff = src - dest (for LWX) - subu$v1, $a2, $a0 // mdiff = mask - dest (for LWX) + subu$v1, $a3, $a1 // mdiff = mask - dest (for LWX) - sll $a3, $a3, 2 // width <<= 2 - addu$a3, $a0, $a3 // dest_end = dest + width + sll $a0, $a0, 2 // width <<= 2 + addu$a0, $a1, $a0 // dest_end = dest + width li $t9, 0x00800080 0: - lwx $t8, $v1($a0) // mask (dest + mdiff) - lwx $t1, $v0($a0) // src (dest + sdiff) + lwx $t8, $v1($a1) // mask (dest + mdiff) + lwx $t1, $v0($a1) // src (dest + sdiff) srl $t8, $t8, 24// mask >>= A_SHIFT ins $t8, $t8, 16, 8 // 0:m:0:m; equivalent to replv.ph @@ -88,7 +91,7 @@ mips_dspase1_combine_over_u_mask: muleu_s.ph.qbl $t3, $t1, $t8 muleu_s.ph.qbr $t4, $t1, $t8 - lw
[Pixman] [PATCH 0/3] Pixman MIPS DSPASE1
I started working on this optimizing for MIPS32R2 code originally (Based on the patch by Beloev), but the performance increases seem to be relatively similar to what over_n_8_ shows. The dspase is much more promising in this regard. It rather leaves me wondering if the mips32r2 should not be included. It might however be related to the test system, which has a MIPS 74K core. The original I assume was worked on with a MIPS 24K. I used pixman-arm-common.h for the assembler binding macros, which is the reason for the 'ARM' found in the glue. Compiling the code will result in the gcc producing Warnings about macro expansion, it'd be nice not to have these, but "fixing" them would have a (slight) negative effect readability. PATCH 1 is the original patch by Georgi Beloev, but modified to apply against pixman head. Implemented: Scanline add, out reverse, over fast path: over_n_8_ add__ add_n_ Test hardware: Broadcom BCM4718, 453MHz, MIPS 74K V4.0 (Inc. DSP Rev2, MIPS16), Little Endian All the test program builds used CFLAGS="-O2 -mdsp -mips32r2" reference memcpy speed = 176.0MB/s (44.0MP/s for 32bpp fills) Optimizations disabled: --disable-mips32r2 --disable-mips-dspase1 over_n_8_ = L1: 6.16 L2: 5.34 M: 5.35 ( 19.24%) HT: 4.78 VT: 4.62 R: 4.55 RT: 2.99 ( 28Kops/s) add__ = L1: 18.11 L2: 10.15 M: 9.98 ( 45.33%) HT: 14.80 VT: 13.36 R: 13.41 RT: 6.17 ( 46Kops/s) add_n_ = L1: 14.26 L2: 10.30 M: 10.38 ( 23.59%) HT: 8.05 VT: 7.64 R: 7.63 RT: 4.05 ( 33Kops/s) MIPS32R2: --disable-mips-dspase1 over_n_8_ = L1: 6.17 L2: 5.62 M: 5.56 ( 20.33%) HT: 5.00 VT: 4.83 R: 4.76 RT: 3.33 ( 30Kops/s) MIPS DSPASE: over_n_8_ = L1: 9.76 L2: 7.89 M: 7.93 ( 27.11%) HT: 7.04 VT: 6.84 R: 6.63 RT: 4.06 ( 34Kops/s) add__ = L1: 117.36 L2: 20.67 M: 23.22 (105.50%) HT: 17.40 VT: 15.96 R: 13.81 RT: 6.48 ( 47Kops/s) add_n_ = L1: 145.84 L2: 28.23 M: 31.11 ( 70.66%) HT: 22.95 VT: 18.54 R: 19.99 RT: 8.93 ( 50Kops/s) Scanline ops benchmarked using low-level-blit: I selected these ops by adding a printf to the scanline ops, and finding one that triggers it, if there is a more convenient way to benchmark these ops, I failed to find it. Optimizations disabled: add_8_8_8 = L1: 3.31 L2: 5.25 M: 5.16 ( 11.73%) HT: 3.61 VT: 3.60 R: 3.53 RT: 1.77 ( 18Kops/s) add__1555 = L1: 6.51 L2: 5.32 M: 5.34 ( 18.20%) HT: 4.05 VT: 3.96 R: 3.94 RT: 2.21 ( 22Kops/s) outrev_n_8_ = L1: 6.33 L2: 5.25 M: 5.16 ( 17.60%) HT: 4.11 VT: 4.02 R: 3.97 RT: 2.23 ( 22Kops/s) over__n_0565 = L1: 2.83 L2: 3.33 M: 3.21 ( 11.54%) HT: 2.73 VT: 2.69 R: 2.68 RT: 1.67 ( 17Kops/s) over_n_ = L1: 7.45 L2: 6.65 M: 6.66 ( 15.14%) HT: 5.65 VT: 5.43 R: 5.43 RT: 3.35 ( 30Kops/s) MIPS DSPASE: add_8_8_8 = L1: 8.81 L2: 7.67 M: 7.53 ( 17.11%) HT: 4.62 VT: 4.68 R: 4.50 RT: 1.97 ( 19Kops/s) add__1555 = L1: 9.07 L2: 7.27 M: 7.29 ( 24.87%) HT: 5.09 VT: 4.95 R: 4.93 RT: 2.50 ( 23Kops/s) outrev_n_8_ = L1: 8.48 L2: 6.82 M: 6.88 ( 23.45%) HT: 5.04 VT: 4.90 R: 4.85 RT: 2.48 ( 23Kops/s) over__n_0565 = L1: 5.13 L2: 4.38 M: 4.16 ( 14.24%) HT: 3.41 VT: 3.30 R: 3.34 RT: 1.93 ( 19Kops/s) over_n_ = L1: 18.58 L2: 12.91 M: 13.12 ( 29.85%) HT: 9.75 VT: 9.06 R: 9.10 RT: 4.55 ( 33Kops/s) ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 3/3] DSPASE More cleanup, out reverse op.
MIPS: DSPASE Implemented DSPASE1_UN8x4_MUL_UN8 macro. MIPS: DSPASE Implemented scanline out reverse MIPS: DSPASE over_n_8_ modified to use the macro bindings --- pixman/pixman-mips-dspase1-asm.S | 226 +- pixman/pixman-mips-dspase1.c | 50 + 2 files changed, 155 insertions(+), 121 deletions(-) diff --git a/pixman/pixman-mips-dspase1-asm.S b/pixman/pixman-mips-dspase1-asm.S index 596b38a..0cb2293 100644 --- a/pixman/pixman-mips-dspase1-asm.S +++ b/pixman/pixman-mips-dspase1-asm.S @@ -18,6 +18,26 @@ .size \fname, .-\fname .endm +# result register can be the same as any of the params +# rb_half should contain 0x00800080 +.macro DSPASE1_UN8x4_MUL_UN8_head a, b, x, y + muleu_s.ph.qbl \x, \a, \b + muleu_s.ph.qbr \y, \a, \b +.endm + +.macro DSPASE1_UN8x4_MUL_UN8_tail x, y, result, rb_half, tmp3, tmp4 + addu \x, \x, \rb_half + addu \y, \y, \rb_half + + preceu.ph.qbla \tmp3, \x + preceu.ph.qbla \tmp4, \y + + addu \x, \x, \tmp3 + addu \y, \y, \tmp4 + + precrq.qb.ph \result, \x, \y +.endm + .setnoreorder .setnomacro @@ -40,20 +60,13 @@ pixman_asm_func pixman_composite_scanline_over_asm_dspase1 srl $t2, $t2, 24// ALPHA_8(~src) ins $t2, $t2, 16, 8 // 0:a:0:a; equivalent to replv.ph - muleu_s.ph.qbl $t3, $t0, $t2 - muleu_s.ph.qbr $t4, $t0, $t2 + DSPASE1_UN8x4_MUL_UN8_head $t0, $t2, $t3, $t4 lw $t0, 4($a1) // dest[1] for next loop iteration addiu $a1, $a1, 4 // dest++ - addu$t3, $t3, $t9 // can't overflow; rev2: addu_s.ph - addu$t4, $t4, $t9 // can't overflow; rev2: addu_s.ph - preceu.ph.qbla $t5, $t3// rev2: shrl.ph - preceu.ph.qbla $t6, $t4// rev2: shrl.ph - addu$t3, $t3, $t5 // can't overflow; rev2: addu_s.ph - addu$t4, $t4, $t6 // can't overflow; rev2: addu_s.ph + DSPASE1_UN8x4_MUL_UN8_tail $t3, $t4, $t3, $t9, $t5, $t6 - precrq.qb.ph$t3, $t3, $t4 addu_s.qb $t3, $t3, $t1 lwx $t1, $v0($a1) // src (dest + diff) for next loop iteration @@ -88,35 +101,22 @@ pixman_asm_func pixman_composite_scanline_over_mask_asm_dspase1 srl $t8, $t8, 24// mask >>= A_SHIFT ins $t8, $t8, 16, 8 // 0:m:0:m; equivalent to replv.ph - muleu_s.ph.qbl $t3, $t1, $t8 - muleu_s.ph.qbr $t4, $t1, $t8 + DSPASE1_UN8x4_MUL_UN8_head $t1, $t8, $t3, $t4 lw $t0, 0($a1) // dest - addu$t3, $t3, $t9 // can't overflow; rev2: addu_s.ph - addu$t4, $t4, $t9 // can't overflow; rev2: addu_s.ph - preceu.ph.qbla $t5, $t3// rev2: shrl.ph - preceu.ph.qbla $t6, $t4// rev2: shrl.ph - addu$t3, $t3, $t5 // can't overflow; rev2: addu_s.ph - addu$t4, $t4, $t6 // can't overflow; rev2: addu_s.ph - precrq.qb.ph$t1, $t3, $t4 + DSPASE1_UN8x4_MUL_UN8_tail $t3, $t4, $t1, $t9, $t5, $t6 not $t2, $t1// ~src srl $t2, $t2, 24// ALPHA_8(~src) ins $t2, $t2, 16, 8 // 0:a:0:a; equivalent to replv.ph - muleu_s.ph.qbl $t3, $t0, $t2 - muleu_s.ph.qbr $t4, $t0, $t2 + DSPASE1_UN8x4_MUL_UN8_head $t0, $t2, $t3, $t4 addiu $a1, $a1, 4 // dest++ + + DSPASE1_UN8x4_MUL_UN8_tail $t3, $t4, $t3, $t9, $t5, $t6 - addu$t3, $t3, $t9 // can't overflow; rev2: addu_s.ph - addu$t4, $t4, $t9 // can't overflow; rev2: addu_s.ph - preceu.ph.qbla $t5, $t3// rev2: shrl.ph - preceu.ph.qbla $t6, $t4// rev2: shrl.ph - addu$t3, $t3, $t5 // can't overflow; rev2: addu_s.ph - addu$t4, $t4, $t6 // can't overflow; rev2: addu_s.ph - precrq.qb.ph$t3, $t3, $t4 addu_s.qb $t3, $t3, $t1 bne $a1, $a0, 0b @@ -197,28 +197,18 @@ pixman_asm_func pixman_composite_scanline_add_mask_asm_dspase1 $scanline_add_mask_loop: lwx $t2, $a3($a1) lwx $t1, $a2($a1) - lw $t0, 0($a1) - - addiu $a1, $a1, 4 # based on pixman_composite_scanline_over_mask_asm_dspase1 - # converting these to macroes might make sense srl $t2, $t2, 24 ins $t2, $t2, 16, 8 // 0:m:0:m; equivalent to replv.ph - - muleu_s.ph.qbl $t3, $t1, $t2 - muleu_s.ph.qbr $t4, $t1, $t2 - - addu $t3, $t3, $t8 // can't overflow; rev2: addu_s.ph - addu $t4, $t4, $t8 // can't overflow; rev2: addu_s.ph - preceu.ph.qbla $t5, $t3 // rev2: shrl.ph - preceu.ph.qbla $t6, $t4 // rev2: shrl.ph - addu $t3, $
[Pixman] [PATCH 1/3] MIPS32R2 and MIPS DSP ASE optimized functions, adapted for pixman head
From: Veli-Matti Valtonen >From 118b1f5596f72be7fed85ba408ff2961b3308038 Mon Sep 17 00:00:00 2001 From: Georgi Beloev Date: Wed, 8 Sep 2010 17:34:22 -0700 Subject: [PATCH] Added MIPS32R2 and MIPS DSP ASE optimized functions. The following functions were implemented for MIPS32R2: - pixman_fill32() - fast_composite_over_n_8_() The following functions were implemented for MIPS DSP ASE: - combine_over_u() - fast_composite_over_n_8_() Additionally, MIPS DSP ASE uses the MIPS32R2 pixman_fill32() function. Use configure commands similar to the ones below to select the target processor and, correspondingly, the target instruction set: - MIPS32R2: configure CFLAGS='-march=24kc -O2' - MIPS DSP ASE: configure CFLAGS='-march=24kec -O2' --- configure.ac | 63 + pixman/Makefile.am | 22 + pixman/pixman-cpu.c | 21 pixman/pixman-mips-dspase1-asm.S | 189 ++ pixman/pixman-mips-dspase1.c | 107 + pixman/pixman-mips32r2-asm.S | 180 pixman/pixman-mips32r2.c | 112 ++ pixman/pixman-private.h | 11 ++ 8 files changed, 705 insertions(+), 0 deletions(-) create mode 100644 pixman/pixman-mips-dspase1-asm.S create mode 100644 pixman/pixman-mips-dspase1.c create mode 100644 pixman/pixman-mips32r2-asm.S create mode 100644 pixman/pixman-mips32r2.c diff --git a/configure.ac b/configure.ac index 5242799..2a7e49a 100644 --- a/configure.ac +++ b/configure.ac @@ -565,6 +565,69 @@ fi AM_CONDITIONAL(USE_GCC_INLINE_ASM, test $have_gcc_inline_asm = yes) +dnl == +dnl Check if the compiler supports MIPS32R2 instructions + +AC_MSG_CHECKING(whether to use MIPS32R2 instructions) +AC_COMPILE_IFELSE([[ +void test() +{ +asm("ext \$v0,\$a0,8,8"); +} +]], have_mips32r2=yes, have_mips32r2=no) + +AC_ARG_ENABLE(mips32r2, + [AC_HELP_STRING([--disable-mips32r2], + [disable MIPS32R2 fast paths])], + [enable_mips32r2=$enableval], [enable_mips32r2=auto]) + +if test $enable_mips32r2 = no ; then + have_mips32r2=disabled +fi + +if test $have_mips32r2 = yes ; then + AC_DEFINE(USE_MIPS32R2, 1, [use MIPS32R2 optimizations]) +fi + +AM_CONDITIONAL(USE_MIPS32R2, test $have_mips32r2 = yes) + +AC_MSG_RESULT($have_mips32r2) +if test $enable_mips32r2 = yes && test $have_mips32r2 = no ; then + AC_MSG_ERROR([MIPS32R2 not detected]) +fi + + +dnl == +dnl Check if the compiler supports MIPS DSP ASE Rev 1 instructions + +AC_MSG_CHECKING(whether to use MIPS DSP ASE Rev 1 instructions) +AC_COMPILE_IFELSE([[ +void test() +{ +asm("addu.qb \$v0,\$a0,\$a1"); +} +]], have_mips_dspase1=yes, have_mips_dspase1=no) + +AC_ARG_ENABLE(mips-dspase1, + [AC_HELP_STRING([--disable-mips-dspase1], + [disable MIPS DSP ASE Rev 1 fast paths])], + [enable_mips_dspase1=$enableval], [enable_mips_dspase1=auto]) + +if test $enable_mips_dspase1 = no ; then + have_mips_dspase1=disabled +fi + +if test $have_mips_dspase1 = yes ; then + AC_DEFINE(USE_MIPS_DSPASE1, 1, [use MIPS DSP ASE Rev 1 optimizations]) +fi + +AM_CONDITIONAL(USE_MIPS_DSPASE1, test $have_mips_dspase1 = yes) + +AC_MSG_RESULT($have_mips_dspase1) +if test $enable_mips_dspase1 = yes && test $have_mips_dspase1 = no ; then + AC_MSG_ERROR([MIPS DSP ASE Rev 1 not detected]) +fi + dnl == dnl Static test programs diff --git a/pixman/Makefile.am b/pixman/Makefile.am index ca31301..d832db1 100644 --- a/pixman/Makefile.am +++ b/pixman/Makefile.am @@ -123,5 +123,27 @@ libpixman_1_la_LIBADD += libpixman-arm-neon.la ASM_CFLAGS_arm_neon= endif +# MIPS32R2 +if USE_MIPS32R2 +noinst_LTLIBRARIES += libpixman-mips32r2.la +libpixman_mips32r2_la_SOURCES = \ + pixman-mips32r2.c \ + pixman-mips32r2-asm.S +libpixman_mips32r2_la_CFLAGS = $(DEP_CFLAGS) +libpixman_mips32r2_la_LIBADD = $(DEP_LIBS) +libpixman_1_la_LIBADD += libpixman-mips32r2.la +endif + +# MIPS DSP ASE Rev 1 +if USE_MIPS_DSPASE1 +noinst_LTLIBRARIES += libpixman-mips-dspase1.la +libpixman_mips_dspase1_la_SOURCES = \ + pixman-mips-dspase1.c \ + pixman-mips-dspase1-asm.S +libpixman_mips_dspase1_la_CFLAGS = $(DEP_CFLAGS) +libpixman_mips_dspase1_la_LIBADD = $(DEP_LIBS) +libpixman_1_la_LIBADD += libpixman-mips-dspase1.la +endif + .c.s : $(libpixmaninclude_HEADERS) $(BUILT_SOURCES) $(CC) $(CFLAGS) $(ASM_CFLAGS_$(@:pixman-%.s=%)) $(ASM_CFLAGS_$(@:pixman-arm-%.s=arm_%)) -DHAVE_CONFIG_H -I$(srcdir) -I$(builddir) -I$(top_builddir) -S -o $@ $< diff --git a/pixman/pixman-cpu.c b/pixman/pixman-cpu.c index 0e14ecb..ee6dc1c 100644 --- a/pixman/pixman-cpu.c +++ b/pixman/pixman-cpu.c @@ -573,6 +573,17 @@ pixman_have_sse2 (void) #endif /* __amd64__ */ #endif +#ifdef USE_