Re: [Pixman] Image scaling with bilinear interpolation performance

2011-02-22 Thread Siarhei Siamashka
On Monday 21 February 2011 13:07:31 김태균 wrote:
> Hi,
> Thank you for the reply.
> 
> > Regarding performance, improving it twice is still a little bit too slow
> > on the hardware which has SIMD. On x86, support for SSE2 is pretty much
> > common, so it is quite natural to use it if it proves to be beneficial.
> > But for the low end embedded machines with primitive processors without
> > SIMD it may be indeed very good to have any kind of performance
> > improvements.
> 
> Yes, right.
> I will fully utilize SIMD as possible as I can. (NEON is available on some
> of our target machines)

Great. Surely contributions in this area would be definitely useful. But you
may have started this work a bit too late ;) I have been looking into improving
bilinear scaling performance for the last couple of weeks already and have just
submitted some initial SSE2 and ARM NEON optimizations for it (btw, testing is
very much welcome). And there is still lots of work to do before all the
bilinear scaling related performance bottlenecks are eliminated.

> But I have to consider not only high end machines but also low ends which
> do not support SIMD.
> That's why I'm trying to optimize non-SIMD general code path.

Well, in your original e-mail, you mentioned that you are interested in getting
good performance on intel quad core. That's why without having any other
information, I suggested SSE2 as a solution for this problem :)

What kind of hardware do the rest of your target machines have? A lot of ARM
processors beginning with armv5te have special instructions for fast signed
16-bit multiplication. If we know what the target hardware supports, we may
modify bilinear interpolation code to make better use of it.

The current bilinear interpolation code has one problem that it needs 16-bit
unsigned multiplications (uint16_t * uint16_t -> uint32_t), which are also not
so efficient for MMX/SSE2. Maybe going down from 256 levels to 128 levels could
allow to use signed 16-bit multiplications and provide more optimization
possibilities on a wide range of hardware? Also SSSE3 may be worth considering
because it has PMADDUBSW instruction (uint8_t * int8_t -> int16_t).

It is just ARM NEON not challenging at all and boring because it is totally
orthogonal and supports all kind of vector multiplications easily (8-bit and
16-bit, both signed and unsigned, both ordinary and long variant). I guess it
would work fine with any interpolation method, like it did with the current
one.

I also tried to benchmark your change to the bilinear code and got something
like 23% better scaling performance overall on Intel Core i7. I guess you have
benchmarked 2x performance improvement for that function alone but not for a
full rendering pipeline, right?. It's a good improvement, but not even close to
the performance effect of using SSE2 or NEON (or maybe even armv5te). So I
would consider looking at the supported instruction set on your target hardware
first.

For these experiments, I'm typically doing benchmarks with the 'scaling-bench'
program from:
  http://cgit.freedesktop.org/~siamashka/pixman/log/?h=playground/test-n-bench

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [cairo] scaling performance test of cairo library

2011-02-22 Thread Siarhei Siamashka
On Thursday 10 February 2011 03:16:47 Siarhei Siamashka wrote:
> On Wednesday 09 February 2011 08:23:51 Siarhei Siamashka wrote:
> > On Wednesday 09 February 2011 05:28:46 cooolheater wrote:
> > > Thank you for your kind explanation.
> > > I used pixman-0.21.4 for testing.
> > > As you guessed, we are using SIMD and are finding method for NEON
> > > acceleration.
> > > Could you let me know the bilinear scaling interfaces in pixman and
> > > where the SIMD optimization will be applied?
> > 
> > You can look here for the start:
> > http://cgit.freedesktop.org/pixman/tree/pixman/pixman-bits-image.c?id=pix
> > ma n-0.21.4#n189
> > 
> > But applying optimizations locally just for this small function is not
> > going to provide the best performance, it's kind of like swinging a
> > large polearm in a narrow passage is not so effective.
> 
> And here is an example of such patch attached. Performance improvement is
> not impressive at all. Who cares if it's now let's say ~15x slower than
> nearest scaling instead of ~30x?
> 
> Obviously we need a better solution.

Hello cooolheater,

Could you please try to run your benchmark again with the patches from the
following link applied to pixman and share the results?

http://lists.freedesktop.org/archives/pixman/2011-February/001053.html

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 3/3] test: add Makefile for Win32

2011-02-22 Thread Andrea Canciani
---
 test/Makefile.win32 |   73 +++
 1 files changed, 73 insertions(+), 0 deletions(-)
 create mode 100644 test/Makefile.win32

diff --git a/test/Makefile.win32 b/test/Makefile.win32
new file mode 100644
index 000..c71afe1
--- /dev/null
+++ b/test/Makefile.win32
@@ -0,0 +1,73 @@
+CC   = cl
+LINK = link
+
+CFG_VAR = $(CFG)
+ifeq ($(CFG_VAR),)
+CFG_VAR=release
+endif
+
+CFLAGS = -MD -nologo -D_CRT_SECURE_NO_DEPRECATE 
-D_CRT_NONSTDC_NO_DEPRECATE -D_BIND_TO_CURRENT_VCLIBS_VERSION -D_MT -I../pixman 
-I. -I../
+TEST_LDADD = ../pixman/$(CFG_VAR)/pixman-1.lib
+INCLUDES = -I../pixman -I$(top_builddir)/pixman
+
+# optimization flags
+ifeq ($(CFG_VAR),debug)
+CFLAGS += -Od -Zi
+else
+CFLAGS += -O2
+endif
+
+SOURCES =  \
+   a1-trap-test.c  \
+   pdf-op-test.c   \
+   region-test.c   \
+   region-translate-test.c \
+   fetch-test.c\
+   oob-test.c  \
+   trap-crasher.c  \
+   alpha-loop.c\
+   scaling-crash-test.c\
+   gradient-crash-test.c   \
+   alphamap.c  \
+   stress-test.c   \
+   composite-traps-test.c  \
+   blitters-test.c \
+   scaling-test.c  \
+   affine-test.c   \
+   composite.c \
+   utils.c
+
+TESTS =\
+   $(CFG_VAR)/a1-trap-test.exe \
+   $(CFG_VAR)/pdf-op-test.exe  \
+   $(CFG_VAR)/region-test.exe  \
+   $(CFG_VAR)/region-translate-test.exe\
+   $(CFG_VAR)/fetch-test.exe   \
+   $(CFG_VAR)/oob-test.exe \
+   $(CFG_VAR)/trap-crasher.exe \
+   $(CFG_VAR)/alpha-loop.exe   \
+   $(CFG_VAR)/scaling-crash-test.exe   \
+   $(CFG_VAR)/gradient-crash-test.exe  \
+   $(CFG_VAR)/alphamap.exe \
+   $(CFG_VAR)/stress-test.exe  \
+   $(CFG_VAR)/composite-traps-test.exe \
+   $(CFG_VAR)/blitters-test.exe\
+   $(CFG_VAR)/scaling-test.exe \
+   $(CFG_VAR)/affine-test.exe  \
+   $(CFG_VAR)/composite.exe
+
+
+OBJECTS = $(patsubst %.c, $(CFG_VAR)/%.obj, $(SOURCES))
+
+$(CFG_VAR)/%.obj: %.c
+   @mkdir -p $(CFG_VAR)
+   @$(CC) -c $(CFLAGS) -Fo"$@" $<
+
+$(CFG_VAR)/%.exe: $(CFG_VAR)/%.obj
+   $(LINK) /NOLOGO /OUT:$@ $< $(CFG_VAR)/utils.obj $(TEST_LDADD)
+
+all: $(OBJECTS) $(TESTS)
+   @exit 0
+
+clean:
+   @rm -f $(CFG_VAR)/*.obj $(CFG_VAR)/*.pdb || exit 0
-- 
1.7.1

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 2/3] test: Fix tests for compilation on Windows

2011-02-22 Thread Andrea Canciani
The Microsoft C compiler cannot handle subobject initialization and
Win32 does not provide snprintf.

Work around these limitations by using normal struct initailization
and directly using printf.
---
 test/composite.c|   48 +++---
 test/fetch-test.c   |   52 ++
 test/trap-crasher.c |   20 +-
 3 files changed, 61 insertions(+), 59 deletions(-)

diff --git a/test/composite.c b/test/composite.c
index e14f954..33e8d97 100644
--- a/test/composite.c
+++ b/test/composite.c
@@ -616,22 +616,20 @@ eval_diff (color_t *expected, color_t *test, 
pixman_format_code_t format)
 return MAX (MAX (MAX (rdiff, gdiff), bdiff), adiff);
 }
 
-static char *
-describe_image (image_t *info, char *buf, int buflen)
+static void
+describe_image (image_t *info)
 {
 if (info->size)
 {
-   snprintf (buf, buflen, "%s %dx%d%s",
+   printf ("%s %dx%d%s",
  info->format->name,
  info->size, info->size,
  info->repeat ? "R" :"");
 }
 else
 {
-   snprintf (buf, buflen, "solid");
+   printf ("solid");
 }
-
-return buf;
 }
 
 /* Test a composite of a given operation, source, mask, and destination
@@ -708,18 +706,13 @@ composite_test (image_t *dst,
  */
 if (diff > 3.0)
 {
-   char buf[40];
-
-   snprintf (buf, sizeof (buf),
- "%s %scomposite",
- op->name,
- component_alpha ? "CA " : "");
-
-   printf ("%s test error of %.4f --\n"
+   printf ("%s %scomposite test error of %.4f --\n"
"   RGBA\n"
"got:   %.2f %.2f %.2f %.2f [%08lx]\n"
"expected:  %.2f %.2f %.2f %.2f\n",
-   buf, diff,
+   op->name,
+   component_alpha ? "CA " : "",
+   diff,
result.r, result.g, result.b, result.a,
*(unsigned long *) pixman_image_get_data (dst->image),
expected.r, expected.g, expected.b, expected.a);
@@ -735,9 +728,18 @@ composite_test (image_t *dst,
mask->color->b, mask->color->a,
dst->color->r, dst->color->g,
dst->color->b, dst->color->a);
-   printf ("src: %s, ", describe_image (src, buf, sizeof (buf)));
-   printf ("mask: %s, ", describe_image (mask, buf, sizeof (buf)));
-   printf ("dst: %s\n\n", describe_image (dst, buf, sizeof (buf)));
+
+   printf ("src: ");
+   describe_image (src);
+   printf (", ");
+
+   printf ("mask: ");
+   describe_image (mask);
+   printf (", ");
+
+   printf ("dst: ");
+   describe_image (dst);
+   printf ("\n\n");
}
else
{
@@ -747,8 +749,14 @@ composite_test (image_t *dst,
src->color->b, src->color->a,
dst->color->r, dst->color->g,
dst->color->b, dst->color->a);
-   printf ("src: %s, ", describe_image (src, buf, sizeof (buf)));
-   printf ("dst: %s\n\n", describe_image (dst, buf, sizeof (buf)));
+
+   printf ("src: ");
+   describe_image (src);
+   printf (", ");
+
+   printf ("dst: ");
+   describe_image (dst);
+   printf ("\n\n");
}
 
success = FALSE;
diff --git a/test/fetch-test.c b/test/fetch-test.c
index 2ca16dd..314a072 100644
--- a/test/fetch-test.c
+++ b/test/fetch-test.c
@@ -8,7 +8,7 @@
 
 static pixman_indexed_t mono_palette =
 {
-.rgba = { 0x, 0x00ff },
+0, { 0x, 0x00ff },
 };
 
 
@@ -24,57 +24,53 @@ typedef struct {
 static testcase_t testcases[] =
 {
 {
-   .format = PIXMAN_a8r8g8b8,
-   .width = 2, .height = 2,
-   .stride = 8,
-   .src = { 0x00112233, 0x44556677,
-0x8899aabb, 0xccddeeff },
-   .dst = { 0x00112233, 0x44556677,
-0x8899aabb, 0xccddeeff },
-   .indexed = NULL,
+   PIXMAN_a8r8g8b8,
+   2, 2,
+   8,
+   { 0x00112233, 0x44556677,
+ 0x8899aabb, 0xccddeeff },
+   { 0x00112233, 0x44556677,
+ 0x8899aabb, 0xccddeeff },
+   NULL,
 },
 {
-   .format = PIXMAN_g1,
-   .width = 8, .height = 2,
-   .stride = 4,
+   PIXMAN_g1,
+   8, 2,
+   4,
 #ifdef WORDS_BIGENDIAN
-   .src =
{
0xaa00,
0x5500
},
 #else
-   .src =
{
0x0055,
0x00aa
},
 #endif
-   .dst =
{
0x00ff, 0x, 0x00ff, 0x, 0x00ff, 
0x, 0x00ff, 0x,
0x, 0x00ff, 0x, 0x00ff, 0x, 
0x00ff, 0x, 0x00ff
},
-   .indexed = &mono_palette,
+   &mono_palette,
 },
 #if 0
 {
-   .format = PIXMAN_g8,
-   

[Pixman] [PATCH 1/3] Fix compilation on Win32

2011-02-22 Thread Andrea Canciani
Building the library from a clean git repository fails with:
pixman-image.c(33) : fatal error C1083: Cannot open include file:
'pixman-combine32.h': No such file or directory

pixman-combine32.h is not used by pixman-image.c, so its inclusion can
simply be removed.
---
 pixman/pixman-image.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/pixman/pixman-image.c b/pixman/pixman-image.c
index 9103ca6..84bacf8 100644
--- a/pixman/pixman-image.c
+++ b/pixman/pixman-image.c
@@ -30,7 +30,6 @@
 #include 
 
 #include "pixman-private.h"
-#include "pixman-combine32.h"
 
 pixman_bool_t
 _pixman_init_gradient (gradient_t *  gradient,
-- 
1.7.1

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] Win32 fixes and improvements

2011-02-22 Thread Andrea Canciani
In order to make pixman more maintainable on windows, having working
Makefiles for the library and the tests is probably needed. Today I
took the Makefile attached to
https://bugs.freedesktop.org/show_bug.cgi?id=33069
and tried to use it to build but it didn't build all the tests because
of some incompatibilities between cl and gcc.

The following patches should make it possible to build pixman and the
entire test suite on Windows from git in a properly configured Cygwin
environment.

There are some remaining warnings: 

c:\cygwin\home\ranma42\code\fdo\pixman\pixman\pixman-mmx.c(317) :
warning C4799: function 'store' has no EMMS instruction
c:\cygwin\home\ranma42\code\fdo\pixman\pixman\pixman-mmx.c(166) :
warning C4799: function 'to_uint64' has no EMMS instruction
c:\cygwin\home\ranma42\code\fdo\pixman\pixman\pixman-mmx.c(437) :
warning C4799: function 'combine' has no EMMS instruction

These are wanrings about some missing MMX registers cleanup. I don't
know if this is required or if the compiler just does not notice that
it is already performed somewhere else.

c:\cygwin\home\ranma42\code\fdo\pixman\test\fetch-test.c(114) :
warning C4715: 'reader' : not all control paths return a value
c:\cygwin\home\ranma42\code\fdo\pixman\test\stress-test.c(133) :
warning C4715: 'real_reader' : not all control paths return a value
c:\cygwin\home\ranma42\code\fdo\pixman\test\composite.c(431) :
warning C4715: 'calc_op' : not all control paths return a value

These are non-returning functions (abort() / assert(0)). They can be
silenced by adding a return after the termination call, if we aim at a
warning-free build on Windows.


___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 2/7] test: check correctness of 'bilinear_pad_repeat_get_scanline_bounds'

2011-02-22 Thread Siarhei Siamashka
From: Siarhei Siamashka 

Individual correctness check for the new bilinear scaling related
supplementary function. This test program uses a bit wider range
of input arguments, not covered by other tests.
---
 test/Makefile.am|2 +
 test/scaling-helpers-test.c |   93 +++
 2 files changed, 95 insertions(+), 0 deletions(-)
 create mode 100644 test/scaling-helpers-test.c

diff --git a/test/Makefile.am b/test/Makefile.am
index 057e9ce..9dc7219 100644
--- a/test/Makefile.am
+++ b/test/Makefile.am
@@ -13,6 +13,7 @@ TESTPROGRAMS =\
trap-crasher\
alpha-loop  \
scaling-crash-test  \
+   scaling-helpers-test\
gradient-crash-test \
alphamap\
stress-test \
@@ -33,6 +34,7 @@ alpha_loop_SOURCES = alpha-loop.c utils.c utils.h
 composite_SOURCES = composite.c utils.c utils.h
 gradient_crash_test_SOURCES = gradient-crash-test.c utils.c utils.h
 stress_test_SOURCES = stress-test.c utils.c utils.h
+scaling_helpers_test_SOURCES = scaling-helpers-test.c utils.c utils.h
 
 # Benchmarks
 
diff --git a/test/scaling-helpers-test.c b/test/scaling-helpers-test.c
new file mode 100644
index 000..c186138
--- /dev/null
+++ b/test/scaling-helpers-test.c
@@ -0,0 +1,93 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "utils.h"
+#include "pixman-fast-path.h"
+
+/* A trivial reference implementation for
+ * 'bilinear_pad_repeat_get_scanline_bounds'
+ */
+static void
+bilinear_pad_repeat_get_scanline_bounds_ref (int32_tsource_image_width,
+pixman_fixed_t vx_,
+pixman_fixed_t unit_x,
+int32_t *  left_pad,
+int32_t *  left_tz,
+int32_t *  width,
+int32_t *  right_tz,
+int32_t *  right_pad)
+{
+int w = *width;
+*left_pad = 0;
+*left_tz = 0;
+*width = 0;
+*right_tz = 0;
+*right_pad = 0;
+int64_t vx = vx_;
+while (--w >= 0)
+{
+   if (vx < 0)
+   {
+   if (vx + pixman_fixed_1 < 0)
+   *left_pad += 1;
+   else
+   *left_tz += 1;
+   }
+   else if (vx + pixman_fixed_1 >= pixman_int_to_fixed 
(source_image_width))
+   {
+   if (vx >= pixman_int_to_fixed (source_image_width))
+   *right_pad += 1;
+   else
+   *right_tz += 1;
+   }
+   else
+   {
+   *width += 1;
+   }
+   vx += unit_x;
+}
+}
+
+int
+main (void)
+{
+int i;
+for (i = 0; i < 1; i++)
+{
+   int32_t left_pad1, left_tz1, width1, right_tz1, right_pad1;
+   int32_t left_pad2, left_tz2, width2, right_tz2, right_pad2;
+   pixman_fixed_t vx = lcg_rand_N(1 << 16) - (3000 << 16);
+   int32_t width = lcg_rand_N(1);
+   int32_t source_image_width = lcg_rand_N(1) + 1;
+   pixman_fixed_t unit_x = lcg_rand_N(10 << 16) + 1;
+   width1 = width2 = width;
+
+   bilinear_pad_repeat_get_scanline_bounds_ref (source_image_width,
+vx,
+unit_x,
+&left_pad1,
+&left_tz1,
+&width1,
+&right_tz1,
+&right_pad1);
+
+   bilinear_pad_repeat_get_scanline_bounds (source_image_width,
+vx,
+unit_x,
+&left_pad2,
+&left_tz2,
+&width2,
+&right_tz2,
+&right_pad2);
+
+   assert (left_pad1 == left_pad2);
+   assert (left_tz1 == left_tz2);
+   assert (width1 == width2);
+   assert (right_tz1 == right_tz2);
+   assert (right_pad1 == right_pad2);
+}
+
+return 0;
+}
-- 
1.7.3.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 7/7] ARM: NEON optimization for bilinear scaled 'src_8888_8888'

2011-02-22 Thread Siarhei Siamashka
From: Siarhei Siamashka 

Initial NEON optimization for bilinear scaling. Can be probably
improved more.

Benchmark on ARM Cortex-A8:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=2002, dst=2002, speed=10.72 MPix/s
  after:  op=1, src=2002, dst=2002, speed=44.27 MPix/s
---
 pixman/pixman-arm-neon-asm.S |  197 ++
 pixman/pixman-arm-neon.c |   45 ++
 2 files changed, 242 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-neon-asm.S b/pixman/pixman-arm-neon-asm.S
index 47daf45..c168e10 100644
--- a/pixman/pixman-arm-neon-asm.S
+++ b/pixman/pixman-arm-neon-asm.S
@@ -2391,3 +2391,200 @@ generate_composite_function_nearest_scanline \
 10,  /* dst_r_basereg */ \
 8,  /* src_basereg   */ \
 15  /* mask_basereg  */
+
+/**/
+
+/* Supplementary macro for setting function attributes */
+.macro pixman_asm_function fname
+.func fname
+.global fname
+#ifdef __ELF__
+.hidden fname
+.type fname, %function
+#endif
+fname:
+.endm
+
+.macro bilinear_interpolate_last_pixel
+mov   TMP1, X, asr #16
+mov   TMP2, X, asr #16
+add   TMP1, TOP, TMP1, asl #2
+add   TMP2, BOTTOM, TMP2, asl #2
+vld1.32   {d0}, [TMP1]
+vshr.u16  d30, d24, #8
+vld1.32   {d1}, [TMP2]
+vmull.u8  q1, d0, d28
+vmlal.u8  q1, d1, d29
+/* 5 cycles bubble */
+vshll.u16 q0, d2, #8
+vmlsl.u16 q0, d2, d30
+vmlal.u16 q0, d3, d30
+/* 5 cycles bubble */
+vshrn.u32 d0, q0, #16
+/* 3 cycles bubble */
+vmovn.u16 d0, q0
+/* 1 cycle bubble */
+vst1.32   {d0[0]}, [OUT, :32]!
+.endm
+
+.macro bilinear_interpolate_two_pixels
+mov   TMP1, X, asr #16
+mov   TMP2, X, asr #16
+add   X, X, UX
+add   TMP1, TOP, TMP1, asl #2
+add   TMP2, BOTTOM, TMP2, asl #2
+vld1.32   {d0}, [TMP1]
+vld1.32   {d1}, [TMP2]
+vmull.u8  q1, d0, d28
+vmlal.u8  q1, d1, d29
+mov   TMP1, X, asr #16
+mov   TMP2, X, asr #16
+add   X, X, UX
+add   TMP1, TOP, TMP1, asl #2
+add   TMP2, BOTTOM, TMP2, asl #2
+vld1.32   {d20}, [TMP1]
+vld1.32   {d21}, [TMP2]
+vmull.u8  q11, d20, d28
+vmlal.u8  q11, d21, d29
+vshr.u16  q15, q12, #8
+vadd.u16  q12, q12, q13
+vshll.u16 q0, d2, #8
+vmlsl.u16 q0, d2, d30
+vmlal.u16 q0, d3, d30
+vshll.u16 q10, d22, #8
+vmlsl.u16 q10, d22, d31
+vmlal.u16 q10, d23, d31
+vshrn.u32 d30, q0, #16
+vshrn.u32 d31, q10, #16
+vmovn.u16 d0, q15
+vst1.32   {d0}, [OUT]!
+.endm
+
+.macro bilinear_interpolate_four_pixels
+mov   TMP1, X, asr #16
+mov   TMP2, X, asr #16
+add   X, X, UX
+add   TMP1, TOP, TMP1, asl #2
+add   TMP2, BOTTOM, TMP2, asl #2
+vld1.32   {d0}, [TMP1]
+vld1.32   {d1}, [TMP2]
+vmull.u8  q1, d0, d28
+vmlal.u8  q1, d1, d29
+mov   TMP1, X, asr #16
+mov   TMP2, X, asr #16
+add   X, X, UX
+add   TMP1, TOP, TMP1, asl #2
+add   TMP2, BOTTOM, TMP2, asl #2
+vld1.32   {d20}, [TMP1]
+vld1.32   {d21}, [TMP2]
+vmull.u8  q11, d20, d28
+vmlal.u8  q11, d21, d29
+vshr.u16  q15, q12, #8
+vadd.u16  q12, q12, q13
+vshll.u16 q0, d2, #8
+vmlsl.u16 q0, d2, d30
+vmlal.u16 q0, d3, d30
+vshll.u16 q10, d22, #8
+vmlsl.u16 q10, d22, d31
+vmlal.u16 q10, d23, d31
+mov   TMP1, X, asr #16
+mov   TMP2, X, asr #16
+add   X, X, UX
+add   TMP1, TOP, TMP1, asl #2
+add   TMP2, BOTTOM, TMP2, asl #2
+vld1.32   {d4}, [TMP1]
+vld1.32   {d5}, [TMP2]
+vmull.u8  q3, d4, d28
+vmlal.u8  q3, d5, d29
+mov   TMP1, X, asr #16
+mov   TMP2, X, asr #16
+add   X, X, UX
+add   TMP1, TOP, TMP1, asl #2
+add   TMP2, BOTTOM, TMP2, asl #2
+vld1.32   {d16}, [TMP1]
+vld1.32   {d17}, [TMP2]
+vmull.u8  q9, d16, d28
+vmlal.u8  q9, d17, d29
+vshr.u16  q15, q12, #8
+vadd.u16  q12, q12, q13
+vshll.u16 q2, d6, #8
+vmlsl.u16 q2, d6, d30
+vmlal.u16 q2, d7, d30
+vshll.u16 q8, d18, #8
+vmlsl.u16 q8, d18, d31
+vmlal.u16 q8, d19, d31
+vshrn.u32 d0, q0, #16
+vshrn.u32 d1, q10, #16
+vshrn.u32 d4, q2, #16
+vshrn.u32 d5, q8, #16
+vmovn.u16 d0, q0
+vmovn.u16 d1, q2
+vst1.32   {d0, d1}, [OUT]!
+.endm
+
+
+/*
+ * pixman_scaled_bilinear_scanline___SRC (uint32_t *   out,
+ *const uint32_t * top,
+ *const uint32_t * bottom,
+ *int  wt,
+ *int  wb,
+ *pixman_fixed_t   x,
+ *

[Pixman] [PATCH 6/7] SSE2 optimization for bilinear scaled 'src_8888_8888'

2011-02-22 Thread Siarhei Siamashka
From: Siarhei Siamashka 

A primitive naive implementation of bilinear scaling using SSE2 intrinsics,
which only handles one pixel at a time. It is approximately 2x faster than
C variant (loop unrolling contributes to ~20% of this speedup).

Benchmark on Intel Core i7:
 Using cairo-perf-trace:
  before: imagefirefox-planet-gnome   12.019   12.054   0.15%5/6
  after:  imagefirefox-planet-gnome   10.961   11.013   0.19%5/6

 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=2002, dst=2002, speed=82.61 MPix/s
  after:  op=1, src=2002, dst=2002, speed=165.38 MPix/s
---
 pixman/pixman-sse2.c |  112 ++
 1 files changed, 112 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-sse2.c b/pixman/pixman-sse2.c
index 88287b4..696005f 100644
--- a/pixman/pixman-sse2.c
+++ b/pixman/pixman-sse2.c
@@ -5567,6 +5567,114 @@ FAST_NEAREST_MAINLOOP_COMMON 
(sse2__n__none_OVER,
  scaled_nearest_scanline_sse2__n__OVER,
  uint32_t, uint32_t, uint32_t, NONE, TRUE, TRUE)
 
+static void
+bilinear_interpolate_line_sse2 (uint32_t *   out,
+const uint32_t * top,
+const uint32_t * bottom,
+int  wt,
+int  wb,
+pixman_fixed_t   x,
+pixman_fixed_t   ux,
+int  width)
+{
+const __m128i xmm_wt = _mm_set_epi16 (wt, wt, wt, wt, wt, wt, wt, wt);
+const __m128i xmm_wb = _mm_set_epi16 (wb, wb, wb, wb, wb, wb, wb, wb);
+const __m128i xmm_xorc = _mm_set_epi16 (0, 0, 0, 0, 0xff, 0xff, 0xff, 
0xff);
+const __m128i xmm_addc = _mm_set_epi16 (0, 0, 0, 0, 1, 1, 1, 1);
+const __m128i xmm_ux = _mm_set_epi16 (ux, ux, ux, ux, ux, ux, ux, ux);
+const __m128i xmm_zero = _mm_setzero_si128 ();
+__m128i xmm_x = _mm_set_epi16 (x, x, x, x, x, x, x, x);
+uint32_t pix1, pix2, pix3, pix4;
+
+#define INTERPOLATE_ONE_PIXEL(pix) 
\
+do {   
\
+   __m128i xmm_wh, xmm_lo, xmm_hi, a;  
\
+   /* fetch 2x2 pixel block into sse2 register */  
\
+   uint32_t tl = top [pixman_fixed_to_int (x)];
\
+   uint32_t tr = top [pixman_fixed_to_int (x) + 1];
\
+   uint32_t bl = bottom [pixman_fixed_to_int (x)]; 
\
+   uint32_t br = bottom [pixman_fixed_to_int (x) + 1]; 
\
+   a = _mm_set_epi32 (tr, tl, br, bl); 
\
+x += ux;   
\
+   /* vertical interpolation */
\
+   a = _mm_add_epi16 (_mm_mullo_epi16 (_mm_unpackhi_epi8 (a, xmm_zero),
\
+   xmm_wt),
\
+  _mm_mullo_epi16 (_mm_unpacklo_epi8 (a, xmm_zero),
\
+   xmm_wb));   
\
+   /* calculate horizontal weights */  
\
+   xmm_wh = _mm_add_epi16 (xmm_addc,   
\
+   _mm_xor_si128 (xmm_xorc,
\
+  _mm_srli_epi16 (xmm_x, 8))); 
\
+   xmm_x = _mm_add_epi16 (xmm_x, xmm_ux);  
\
+   /* horizontal interpolation */  
\
+   xmm_lo = _mm_mullo_epi16 (a, xmm_wh);   
\
+   xmm_hi = _mm_mulhi_epu16 (a, xmm_wh);   
\
+   a = _mm_add_epi32 (_mm_unpacklo_epi16 (xmm_lo, xmm_hi), 
\
+  _mm_unpackhi_epi16 (xmm_lo, xmm_hi));
\
+   /* shift and pack the result */ 
\
+   a = _mm_srli_epi32 (a, 16); 
\
+   a = _mm_packs_epi32 (a, a); 
\
+   a = _mm_packus_epi16 (a, a);
\
+   pix = _mm_cvtsi128_si32 (a);
\
+} while (0)
+
+while ((width -= 4) >= 0)
+{
+   INTERPOLATE_ONE_PIXEL (pix1);
+   INTERPOLATE_ONE_PIXEL (pix2);
+   INTERPOLATE_ONE_PIXEL (pix3);
+   INTERPOLATE_ONE_PIXEL (pix4);
+   *out++ = pix1;
+   *out++ = pix2;
+   *out++ = pix3;
+   *out++ = pix4;
+}
+if (width & 2)
+{
+   INTERPOL

[Pixman] [PATCH 5/7] C variant of bilinear scaled 'src_8888_n_8888' fast path

2011-02-22 Thread Siarhei Siamashka
From: Siarhei Siamashka 

Serves no real practical purpose other than testing solid mask support
in bilinear scaling main loop template.
---
 pixman/pixman-fast-path.c |   80 +
 1 files changed, 80 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-fast-path.c b/pixman/pixman-fast-path.c
index a2125c0..fdaad64 100644
--- a/pixman/pixman-fast-path.c
+++ b/pixman/pixman-fast-path.c
@@ -1670,6 +1670,82 @@ FAST_BILINEAR_MAINLOOP_COMMON (_8__none_SRC,
  uint32_t, uint8_t, uint32_t,
  NONE, TRUE, FALSE)
 
+static void
+bilinear_interpolate_s_line (uint32_t *   dst,
+const uint32_t * mask,
+const uint32_t * top_row,
+const uint32_t * bottom_row,
+int  wt,
+int  wb,
+pixman_fixed_t   x,
+pixman_fixed_t   ux,
+int  width)
+{
+uint8_t m = *mask >> 24;
+while (--width >= 0)
+{
+   if (m)
+   {
+   uint32_t s;
+   uint32_t tl, tr, bl, br;
+   int distx;
+
+   tl = top_row [pixman_fixed_to_int (x)];
+   tr = top_row [pixman_fixed_to_int (x) + 1];
+   bl = bottom_row [pixman_fixed_to_int (x)];
+   br = bottom_row [pixman_fixed_to_int (x) + 1];
+
+   distx = (x >> 8) & 0xff;
+
+   s = bilinear_interpolation (tl, tr, bl, br, distx, wt, wb);
+   if (m == 0xff)
+   {
+   *dst = s;
+   }
+   else
+   {
+   *dst = in (s, m);
+   }
+   }
+   else
+   {
+   *dst = 0;
+   }
+   x += ux;
+   dst++;
+}
+}
+
+static force_inline void
+scaled_bilinear_scanline__n__SRC (uint32_t *   dst,
+ const uint32_t * mask,
+ const uint32_t * src_top,
+ const uint32_t * src_bottom,
+ int32_t  w,
+ int  wt,
+ int  wb,
+ pixman_fixed_t   vx,
+ pixman_fixed_t   unit_x,
+ pixman_fixed_t   max_vx,
+ pixman_bool_tzero_src)
+{
+bilinear_interpolate_s_line (dst, mask, src_top, src_bottom,
+wt, wb, vx, unit_x, w);
+}
+
+FAST_BILINEAR_MAINLOOP_COMMON (_n__cover_SRC,
+ scaled_bilinear_scanline__n__SRC,
+ uint32_t, uint32_t, uint32_t,
+ COVER, TRUE, TRUE)
+FAST_BILINEAR_MAINLOOP_COMMON (_n__pad_SRC,
+ scaled_bilinear_scanline__n__SRC,
+ uint32_t, uint32_t, uint32_t,
+ PAD, TRUE, TRUE)
+FAST_BILINEAR_MAINLOOP_COMMON (_n__none_SRC,
+ scaled_bilinear_scanline__n__SRC,
+ uint32_t, uint32_t, uint32_t,
+ NONE, TRUE, TRUE)
+
 static force_inline uint32_t
 fetch_nearest (pixman_repeat_t src_repeat,
   pixman_format_code_t format,
@@ -2197,6 +2273,10 @@ static const pixman_fast_path_t c_fast_paths[] =
 SIMPLE_BILINEAR_A8_MASK_FAST_PATH (SRC, a8r8g8b8, x8r8g8b8, _8_),
 SIMPLE_BILINEAR_A8_MASK_FAST_PATH (SRC, x8r8g8b8, x8r8g8b8, _8_),
 
+SIMPLE_BILINEAR_SOLID_MASK_FAST_PATH (SRC, a8r8g8b8, a8r8g8b8, 
_n_),
+SIMPLE_BILINEAR_SOLID_MASK_FAST_PATH (SRC, a8r8g8b8, x8r8g8b8, 
_n_),
+SIMPLE_BILINEAR_SOLID_MASK_FAST_PATH (SRC, x8r8g8b8, x8r8g8b8, 
_n_),
+
 #define NEAREST_FAST_PATH(op,s,d)  \
 {   PIXMAN_OP_ ## op,  \
PIXMAN_ ## s, SCALED_NEAREST_FLAGS, \
-- 
1.7.3.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 4/7] C variant of bilinear scaled 'src_8888_8_8888' fast path

2011-02-22 Thread Siarhei Siamashka
From: Siarhei Siamashka 

Serves no real practical purpose other than testing a8 mask support
in bilinear scaling main loop template.
---
 pixman/pixman-fast-path.c |   80 +
 1 files changed, 80 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-fast-path.c b/pixman/pixman-fast-path.c
index 1e3094e..a2125c0 100644
--- a/pixman/pixman-fast-path.c
+++ b/pixman/pixman-fast-path.c
@@ -1594,6 +1594,82 @@ FAST_BILINEAR_MAINLOOP_COMMON (__none_SRC,
   uint32_t, uint32_t, uint32_t,
   NONE, FALSE, FALSE)
 
+static void
+bilinear_interpolate_a8_line (uint32_t *   dst,
+ const uint8_t *  mask,
+ const uint32_t * top_row,
+ const uint32_t * bottom_row,
+ int  wt,
+ int  wb,
+ pixman_fixed_t   x,
+ pixman_fixed_t   ux,
+ int  width)
+{
+while (--width >= 0)
+{
+   uint8_t m = *mask++;
+   if (m)
+   {
+   uint32_t s;
+   uint32_t tl, tr, bl, br;
+   int distx;
+
+   tl = top_row [pixman_fixed_to_int (x)];
+   tr = top_row [pixman_fixed_to_int (x) + 1];
+   bl = bottom_row [pixman_fixed_to_int (x)];
+   br = bottom_row [pixman_fixed_to_int (x) + 1];
+
+   distx = (x >> 8) & 0xff;
+
+   s = bilinear_interpolation (tl, tr, bl, br, distx, wt, wb);
+   if (m == 0xff)
+   {
+   *dst = s;
+   }
+   else
+   {
+   *dst = in (s, m);
+   }
+   }
+   else
+   {
+   *dst = 0;
+   }
+   x += ux;
+   dst++;
+}
+}
+
+static force_inline void
+scaled_bilinear_scanline__8__SRC (uint32_t *   dst,
+ const uint8_t *  mask,
+ const uint32_t * src_top,
+ const uint32_t * src_bottom,
+ int32_t  w,
+ int  wt,
+ int  wb,
+ pixman_fixed_t   vx,
+ pixman_fixed_t   unit_x,
+ pixman_fixed_t   max_vx,
+ pixman_bool_tzero_src)
+{
+bilinear_interpolate_a8_line (dst, mask, src_top, src_bottom,
+ wt, wb, vx, unit_x, w);
+}
+
+FAST_BILINEAR_MAINLOOP_COMMON (_8__cover_SRC,
+ scaled_bilinear_scanline__8__SRC,
+ uint32_t, uint8_t, uint32_t,
+ COVER, TRUE, FALSE)
+FAST_BILINEAR_MAINLOOP_COMMON (_8__pad_SRC,
+ scaled_bilinear_scanline__8__SRC,
+ uint32_t, uint8_t, uint32_t,
+ PAD, TRUE, FALSE)
+FAST_BILINEAR_MAINLOOP_COMMON (_8__none_SRC,
+ scaled_bilinear_scanline__8__SRC,
+ uint32_t, uint8_t, uint32_t,
+ NONE, TRUE, FALSE)
+
 static force_inline uint32_t
 fetch_nearest (pixman_repeat_t src_repeat,
   pixman_format_code_t format,
@@ -2117,6 +2193,10 @@ static const pixman_fast_path_t c_fast_paths[] =
 SIMPLE_BILINEAR_FAST_PATH (SRC, x8r8g8b8, x8r8g8b8, _),
 SIMPLE_BILINEAR_FAST_PATH (SRC, x8b8g8r8, x8b8g8r8, _),
 
+SIMPLE_BILINEAR_A8_MASK_FAST_PATH (SRC, a8r8g8b8, a8r8g8b8, _8_),
+SIMPLE_BILINEAR_A8_MASK_FAST_PATH (SRC, a8r8g8b8, x8r8g8b8, _8_),
+SIMPLE_BILINEAR_A8_MASK_FAST_PATH (SRC, x8r8g8b8, x8r8g8b8, _8_),
+
 #define NEAREST_FAST_PATH(op,s,d)  \
 {   PIXMAN_OP_ ## op,  \
PIXMAN_ ## s, SCALED_NEAREST_FLAGS, \
-- 
1.7.3.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 3/7] C variant of bilinear scaled 'src_8888_8888' fast path

2011-02-22 Thread Siarhei Siamashka
From: Siarhei Siamashka 

Because of doing scaling in a single pass without temporary buffers, it
is a bit faster than general path on x86 (and provides even better speedup
on MIPS and ARM).

Benchmark on Intel Core i7:
 Using cairo-perf-trace:
  before: imagefirefox-planet-gnome   12.566   12.610   0.23%6/6
  after:  imagefirefox-planet-gnome   12.019   12.054   0.15%5/6

 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=2002, dst=2002, speed=70.48 MPix/s
  after:  op=1, src=2002, dst=2002, speed=82.61 MPix/s

Benchmark on ARM Cortex-A8:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=2002, dst=2002, speed=6.70 MPix/s
  after:  op=1, src=2002, dst=2002, speed=10.72 MPix/s

Benchmark on MIPS 24K:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=2002, dst=2002, speed=5.12 MPix/s
  after:  op=1, src=2002, dst=2002, speed=6.96 MPix/s

 Microbenchmark (scaling 500x500 image with scale factor close to 1x):
  before: op=1, src=2002, dst=2002, speed=5.26 MPix/s
  after:  op=1, src=2002, dst=2002, speed=7.00 MPix/s
---
 pixman/pixman-fast-path.c |  144 +
 1 files changed, 144 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-fast-path.c b/pixman/pixman-fast-path.c
index 92f0308..1e3094e 100644
--- a/pixman/pixman-fast-path.c
+++ b/pixman/pixman-fast-path.c
@@ -1458,6 +1458,143 @@ FAST_NEAREST_MAINLOOP (565_565_pad_SRC,
   uint16_t, uint16_t, PAD)
 
 static force_inline uint32_t
+bilinear_interpolation (uint32_t tl, uint32_t tr,
+   uint32_t bl, uint32_t br,
+   int distx, int wt, int wb)
+{
+#if SIZEOF_LONG > 4
+uint64_t distxy, distxiy, distixy, distixiy;
+uint64_t tl64, tr64, bl64, br64;
+uint64_t f, r;
+
+distxy = distx * wb;
+distxiy = distx * wt;
+distixy = wb * (256 - distx);
+distixiy = (256 - distx) * wt;
+
+/* Alpha and Blue */
+tl64 = tl & 0xffff;
+tr64 = tr & 0xffff;
+bl64 = bl & 0xffff;
+br64 = br & 0xffff;
+
+f = tl64 * distixiy + tr64 * distxiy + bl64 * distixy + br64 * distxy;
+r = f & 0xffffull;
+
+/* Red and Green */
+tl64 = tl;
+tl64 = ((tl64 << 16) & 0x00ffull) | (tl64 & 0xff00ull);
+
+tr64 = tr;
+tr64 = ((tr64 << 16) & 0x00ffull) | (tr64 & 0xff00ull);
+
+bl64 = bl;
+bl64 = ((bl64 << 16) & 0x00ffull) | (bl64 & 0xff00ull);
+
+br64 = br;
+br64 = ((br64 << 16) & 0x00ffull) | (br64 & 0xff00ull);
+
+f = tl64 * distixiy + tr64 * distxiy + bl64 * distixy + br64 * distxy;
+r |= ((f >> 16) & 0x00ffull) | (f & 0xff00ull);
+
+return (uint32_t)(r >> 16);
+#else
+int distxy, distxiy, distixy, distixiy;
+uint32_t f, r;
+
+distxy = distx * wb;
+distxiy = distx * wt;
+distixy = wb * (256 - distx);
+distixiy = (256 - distx) * wt;
+
+/* Blue */
+r = (tl & 0x00ff) * distixiy + (tr & 0x00ff) * distxiy
+  + (bl & 0x00ff) * distixy  + (br & 0x00ff) * distxy;
+
+/* Green */
+f = (tl & 0xff00) * distixiy + (tr & 0xff00) * distxiy
+  + (bl & 0xff00) * distixy  + (br & 0xff00) * distxy;
+r |= f & 0xff00;
+
+tl >>= 16;
+tr >>= 16;
+bl >>= 16;
+br >>= 16;
+r >>= 16;
+
+/* Red */
+f = (tl & 0x00ff) * distixiy + (tr & 0x00ff) * distxiy
+  + (bl & 0x00ff) * distixy  + (br & 0x00ff) * distxy;
+r |= f & 0x00ff;
+
+/* Alpha */
+f = (tl & 0xff00) * distixiy + (tr & 0xff00) * distxiy
+  + (bl & 0xff00) * distixy  + (br & 0xff00) * distxy;
+r |= f & 0xff00;
+
+return r;
+#endif
+}
+
+static void
+bilinear_interpolate_line (uint32_t *   buffer,
+  const uint32_t * top_row,
+  const uint32_t * bottom_row,
+  int  wt,
+  int  wb,
+  pixman_fixed_t   x,
+  pixman_fixed_t   ux,
+  int  width)
+{
+while (--width >= 0)
+{
+   uint32_t tl, tr, bl, br;
+   int distx;
+
+   tl = top_row [pixman_fixed_to_int (x)];
+   tr = top_row [pixman_fixed_to_int (x) + 1];
+   bl = bottom_row [pixman_fixed_to_int (x)];
+   br = bottom_row [pixman_fixed_to_int (x) + 1];
+
+   distx = (x >> 8) & 0xff;
+
+   *buffer++ = bilinear_interpolation (tl, tr, bl, br, distx, wt, wb);
+
+   x += ux;
+}
+}
+
+static force_inline void
+scaled_bilinear_scanline___SRC (uint32_t *   dst,
+   const uint32_t * mask,
+   const 

[Pixman] [PATCH 1/7] Main loop template for fast single pass bilinear scaling

2011-02-22 Thread Siarhei Siamashka
From: Siarhei Siamashka 

Can be used for implementing SIMD optimized fast path
functions which work with bilinear scaled source images.

Similar to the template for nearest scaling main loop, the
following types of mask are supported:
1. no mask
2. non-scaled a8 mask with SAMPLES_COVER_CLIP flag
3. solid mask

PAD repeat is fully supported. NONE repeat is partially
supported (right now only works if source image has alpha
channel or when alpha channel of the source image does not
have any effect on the compositing operation).
---
 pixman/pixman-fast-path.h |  432 +
 1 files changed, 432 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-fast-path.h b/pixman/pixman-fast-path.h
index d081222..1885d47 100644
--- a/pixman/pixman-fast-path.h
+++ b/pixman/pixman-fast-path.h
@@ -587,4 +587,436 @@ fast_composite_scaled_nearest  ## scale_func_name 
(pixman_implementation_t *imp,
 SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_NONE (op,s,d,func),\
 SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_PAD (op,s,d,func)
 
+/*/
+
+/*
+ * Identify 5 zones in each scanline for bilinear scaling. Depending on
+ * whether 2 pixels to be interpolated are fetched from the image itself,
+ * from the padding area around it or from both image and padding area.
+ */
+static force_inline void
+bilinear_pad_repeat_get_scanline_bounds (int32_t source_image_width,
+pixman_fixed_t  vx,
+pixman_fixed_t  unit_x,
+int32_t *   left_pad,
+int32_t *   left_tz,
+int32_t *   width,
+int32_t *   right_tz,
+int32_t *   right_pad)
+{
+   int width1 = *width, left_pad1, right_pad1;
+   int width2 = *width, left_pad2, right_pad2;
+
+   pad_repeat_get_scanline_bounds (source_image_width, vx, unit_x,
+   &width1, &left_pad1, &right_pad1);
+   pad_repeat_get_scanline_bounds (source_image_width, vx + pixman_fixed_1,
+   unit_x, &width2, &left_pad2, 
&right_pad2);
+
+   *left_pad = left_pad2;
+   *left_tz = left_pad1 - left_pad2;
+   *right_tz = right_pad2 - right_pad1;
+   *right_pad = right_pad1;
+   *width -= *left_pad + *left_tz + *right_tz + *right_pad;
+}
+
+/*
+ * Main loop template for single pass bilinear scaling. It needs to be
+ * provided with 'scanline_func' which should do the compositing operation.
+ * The needed function has the following prototype:
+ *
+ * scanline_func (dst_type_t *   dst,
+ *const mask_type_ * mask,
+ *const src_type_t * src_top,
+ *const src_type_t * src_bottom,
+ *int32_twidth,
+ *intweight_top,
+ *intweight_bottom,
+ *pixman_fixed_t vx,
+ *pixman_fixed_t unit_x,
+ *pixman_fixed_t max_vx,
+ *pixman_bool_t  zero_src)
+ *
+ * Where:
+ *  dst - destination scanline buffer for storing results
+ *  mask- mask buffer (or single value for solid mask)
+ *  src_top, src_bottom - two source scanlines
+ *  width   - number of pixels to process
+ *  weight_top  - weight of the top row for interpolation
+ *  weight_bottom   - weight of the bottom row for interpolation
+ *  vx  - initial position for fetching the first pair of
+ *pixels from the source buffer
+ *  unit_x  - position increment needed to move to the next pair
+ *of pixels
+ *  max_vx  - image size as a fixed point value, can be used for
+ *implementing NORMAL repeat (when it is supported)
+ *  zero_src- boolean hint variable, which is set to TRUE when
+ *all source pixels are fetched from zero padding
+ *zone for NONE repeat
+ *
+ * Note: normally the sum of 'weight_top' and 'weight_bottom' is equal to 256,
+ *   but sometimes it may be less than that for NONE repeat when handling
+ *   fuzzy antialiased top or bottom image edges. Also both top and
+ *   bottom weight variables are guaranteed to have value in 0-255
+ *   range and can fit into unsigned byte or be used with 8-bit SIMD
+ *   multiplication instructions.
+ */
+#define FAST_BILINEAR_MAINLOOP_INT(scale_func_name, scanline_func, src_type_t, 
mask_type_t,\
+ dst_type_t, repeat_mode, have_mask, 
mask_is_solid)\
+static void   

[Pixman] [PATCH 0/7] SIMD optimizations for bilinear scaling

2011-02-22 Thread Siarhei Siamashka
From: Siarhei Siamashka 

This patch series introduces support for creating specialized
bilinear fast path functions which perform processing in a single
pass without intermediate temporary buffers and also can make
efficient use of SIMD optimizations. The performance critical
code is implemented as scanline processing functions with main
loop logic being reused via common macro template. Such scanline
processing functions are simple enough to implement and at the
same time large enough not to constrain optimization opportunities
and possibilities to do loop unrolling for processing multiple
pixels per iteration.

As a result, bilinear scaled 'src__' operation (simple
scaled copy of the image) becomes more than 2 times faster
with SSE2 and more than 6 times faster with ARM NEON when
compared to the general pixman compositing path. And single
pass processing alone is providing some modest, but measurable
speedup even without SIMD.

I'm mostly exclusively interested in ARM NEON and I did not spend
any extra time on tuning this SSE2 code. So SSE2 scaler may be
actually not good enough. Nevertheless it is still faster than C.

The disadvantage of this method is the high specialization, so that
each particular type of compositing operation needs its own fast path
code. But it does not prevent us from also adding universal SIMD
optimized fetchers later. Anyway, adding specialized fast paths is
the way to go when targeting best performance for some of the most
common operations. I'll try to add more SIMD optimized bilinear fast
path functions shortly, based on analyzing cairo-traces and profiling
real use cases.

The same patches are also available in the following branch:
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=sent/bilinear-scaling-simd-20110222


Siarhei Siamashka (7):
  Main loop template for fast single pass bilinear scaling
  test: check correctness of 'bilinear_pad_repeat_get_scanline_bounds'
  C variant of bilinear scaled 'src__' fast path
  C variant of bilinear scaled 'src__8_' fast path
  C variant of bilinear scaled 'src__n_' fast path
  SSE2 optimization for bilinear scaled 'src__'
  ARM: NEON optimization for bilinear scaled 'src__'

 pixman/pixman-arm-neon-asm.S |  197 +++
 pixman/pixman-arm-neon.c |   45 +
 pixman/pixman-fast-path.c|  304 +
 pixman/pixman-fast-path.h|  432 ++
 pixman/pixman-sse2.c |  112 +++
 test/Makefile.am |2 +
 test/scaling-helpers-test.c  |   93 +
 7 files changed, 1185 insertions(+), 0 deletions(-)
 create mode 100644 test/scaling-helpers-test.c

-- 
1.7.3.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] Image scaling with bilinear interpolation performance

2011-02-22 Thread Soeren Sandmann
김태균  writes:

> original code : r = a*t0 + b*t1 + c*t2 + d*t3 (in 24 bits precision)
> optimized code : r' = a*(t0 >> 8) + b*(t1 >> 8) + c*(t2 >> 8) + d*(t3 >> 8)
> (in 16 bits precision)
> where t0 + t1 + t2 + t3 = 0x1
> 
> Now we split "t" into two terms u, v where u is upper 8 bits of t and v is
> lower 8 bits of t. (note that t0 = u0*256 + v0, t0 >> 8 = u0)
> 
> So,
> 
> r' = a*u0 + b*u1 + c*u2 + d*u3
> 
> r = a*(u0*256 + v0) + b*(u1*256 + v1) + c*(u2*256 + v2) + d*(u3*256 + v3)
>   = 256*(a*u0 + b*u1 + c*u2 + d*u3) + a*v0 + b*v1 + c*v2 + d*v3
>   = 256*r' + a*v0 + b*v1 + c*v2 + d*v3
> 
> Error would be
> e = (r - (r' << 8)) >> 16 = (r - 256*r') >> 16 = (a*v0 + b*v1 + c*v2 + d*v3)
> >> 16
> 
> Each value a, b, c and d can be 0xff at most, So
> 
> max(e) = (0xff*(v0 + v1 + v2 + v3)) >> 16 = (0xff*max(v0 + v1 + v2 + v3)) >>
> 16
> 
> max(v0 + v1 + v2 + v3) = 0x300 (because lower 8 bits of t0 + t1 + t2 + t3
> should be 0x00)
> 
> So max(e) = (0xff*0x300) >> 16 = 2
> 
> But this does not satisfy rule 5 as you mentioned

Thanks for doing this analysis. A difference of just 2 would be fine
in my opinion, and as you mention the original code was an
approximation as well.

It would be possible to satisfy rule 5 using a kind of error
diffusion, as as demonstrated by this program:

static void
compute_weights (uint8_t distx, uint8_t disty)
{
uint32_t distxy, distxiy, distixy, distixiy;
int e, t;

distxy = distx * disty;
distxiy = (distx << 8) - distxy;
distixy = (disty << 8) - distxy;
distixiy = 256 * 256 - (disty << 8) - (distx << 8) + distxy;

t = distxy + 0x80;
e = (t & 0xff00) - distxy;
distxy = t >> 8;

distxiy -= e;
t = distxiy + 0x80;
e = (t & 0xff00) - distxiy;
distxiy = t >> 8;

distixy -= e;
t = distixy + 0x80;
e = (t & 0xff00) - distixy;
distixy = t >> 8;

distixiy -= e;
t = distixiy + 0x80;
e = (t & 0xff00) - distixiy;
distixiy = t >> 8;

assert (distxy + distxiy + distixy + distixiy == 256);
}

int
main ()
{
int i, j;

for (i = 0; i < 256; ++i)
{
for (j = 0; j < 256; ++j)
compute_weights (i, j);
}
}

although that does do a bit more arithmetic than your code.

> > Now regarding accuracy. I have added some comments above regarding the
> > potential solid color issue, but this should be relatively easy to
> address. I'm
> > also a bit worried about one more thing (in the original pixman code too,
> but
> > let's cover this too while we are discussing accuracy in general).
> Wouldn't it
> > be a good idea to do shift with rounding for the final value instead of
> > dropping the fractional part? And the 'distx'/'disty' variables are also
> > obtained by right shifting 'ux' by 8 and dropping fractional part, maybe
> > rounding would be more appropriate. Not doing rounding might cause slight
> image
> > drift to the left (and top) on repeated rescaling, and also slight
> reduction of
> > average brightness.
> 
> I agree with that rounding is more appropriate.
> I think supplying distx and disty as properly rounded 4 bits values to
> interpolation function is the best choice we have.
> 
> Analysis on error is some what complicated in this case.
> Error may be bigger than previous code, at least 15 (I've done some brute
> force jobs)

Rounding to four bits is going to be a quite visible drop in quality
though, especially if you zoom more than 16x. With four bits of
precision, there will be only 16 different colors in the gradients
generated by the filter, which will show up as banding.

But maybe it's good enough - 16x scaling is not going to look great
with bilinear filtering no matter what.

> > I have only one concern about testing. Supposedly when we get both C and
> SSE2
> > implementations, it would be much easier for testing if they produce
> identical
> > results. Otherwise tests need to be improved to somehow be able to take
> slight
> > differences into account.
> 
> I think the requirement of producing same results for both C & SIMD(maybe
> sse2, NEON, mmx) is relatively easy.
> But SIMD can produce much better result with less time spent, which can be
> horribly slow with general C implementation.
> I think it is much desirable to keep both C and SIMD code optimized in spite
> of producing slightly different results.

Having the C and SIMD code produce different results is not a problem
in itself, but as Siarhei says, but we would need to make sure the
test suite reflects that decision.

If we decide to move away from bit-exact testing, we would need to
decide on an acceptable deviation from ideal, and then update the
tests to verify that both the C and SIMD implementations are within
that deviation.

For example there could be a reference implementation that computes
t

Re: [Pixman] [cairo] pixman: New ARM NEON optimizations

2011-02-22 Thread Soeren Sandmann
Siarhei Siamashka  writes:

> Regarding the (b) part, probably as a side effect of current implementation,
> right now it is possible to do some operations with images having
> non-premultiplied alpha:
> 
> src_img = pixman_image_create_bits (
> PIXMAN_x8b8g8r8, width, height, src, stride);
> msk_img = pixman_image_create_bits (
> PIXMAN_a8b8g8r8, width, height, src, stride);
> dst_img = pixman_image_create_bits (
> PIXMAN_a8r8g8b8, width, height, dst, stride);
> 
> pixman_image_composite (PIXMAN_OP_SRC, src_img, msk_img, dst_img,
> 0, 0, 0, 0, 0, 0, width, height);
> 
> We only need to wrap the same a8r8g8b8 buffer into x8r8g8b8
> and a8r8g8b8 pixman image, and use the latter as a mask for
> pixman_image_composite() calls. Any operations which don't
> need mask themselves can use this trick. By also specifying
> negative stride, this is useful for example when dealing with
> the data returned by glReadPixels().

Yeah, this is useful, and it wouldn't directly be possible to do if
the equation were changed to (s OP d) LERP_m d. However, a pretty
simple way to fix it would be to just add an unpremultiplied
format. Benjamin's video patches had this I believe.

Using such a format as a destination can be slow because it requires
divisions, but Joonas has a number of optimized implementations here:

http://cgit.freedesktop.org/~joonas/unpremultiply/tree/

> So I find it convenient that we are also allowed to work with
> masks which are basically interpreted as having a8x24 format. 

Right, having an a8x24 format would be another way to solve the
problem.


Soren
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 0/3] Some clean-ups of the test directory

2011-02-22 Thread Soeren Sandmann
Siarhei Siamashka  writes:

> On Thursday 10 February 2011 20:22:38 Søren Sandmann wrote:
> > The following patches add a new directory "demos" and move all the
> > GTK+ based test programs there. This allows the Makefiles in both test
> > and demos to become much simpler with less redundancy.
> > 
> > I'm not particularly happy about the "demos" name since the GTK+ tests
> > aren't really demos, but I can't think of anything better. Suggestions
> > are appreciated.
> 
> It's a bit late comment, but eventually adding some real demo(s) which
> would display some nice looking animation and scare the users with huge
> FPS might be a good idea :)
 
Yes, I think various types of real demos would be a good idea, both
showing off performance and features. 

There is this branch:

http://cgit.freedesktop.org/~sandmann/pixman/log/?h=parrot

which makes the composite-test a little more interesting to look at,
and shows better how the compositing operators work. Screenshot:

http://www.daimi.au.dk/~sandmann/composite-test.png

> Well, that is if pixman actually needs any kind of such "marketing"
> stuff.

I have always thought that pixman should eventually be used in more
places than just the X server and cairo. For example, some features
that have been proposed, such as a floating point pipeline, a JIT
compiler and a shader language API, would let pixman serve the same
role as Core Image does on Mac OS X. And to get people to use it,
demos and other types of marketing would be useful, including getting
a website and a logo.


Soren
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 2/3] DSPASE Cleanup and add operations

2011-02-22 Thread Veli-Matti Valtonen
MIPS: DSPASE Modified the original commit dspase to use arm-neon bind macro
MIPS: DSPASE Implemented add__ and add_n_
MIPS: DSPASE Added some simple mips function begin/end macroes.
MIPS: DSPASE Implemented scanline add.
---
 pixman/pixman-mips-dspase1-asm.S |  331 --
 pixman/pixman-mips-dspase1.c |   75 +
 2 files changed, 325 insertions(+), 81 deletions(-)

diff --git a/pixman/pixman-mips-dspase1-asm.S b/pixman/pixman-mips-dspase1-asm.S
index b96fe83..596b38a 100644
--- a/pixman/pixman-mips-dspase1-asm.S
+++ b/pixman/pixman-mips-dspase1-asm.S
@@ -1,27 +1,37 @@
-
.text
+   .setmips32r2
+   .setnomips16
+   .setdsp
+   
+.macro pixman_asm_func fname
+   .global \fname
+   .ent\fname
+#ifdef __ELF__
+   .type   \fname, @function
+   .hidden \fname
+#endif
+\fname:
+.endm
+
+.macro pixman_end_func fname
+   .end \fname
+   .size \fname, .-\fname
+.endm
+
.setnoreorder
.setnomacro
 
-
-// void
-// mips_dspase1_combine_over_u_nomask(uint32_t *dest, const uint32_t *src,
-// const uint32_t *mask, int width)
-
-   .global mips_dspase1_combine_over_u_nomask
-   .entmips_dspase1_combine_over_u_nomask
-
 // note: this version to be used only when mask = NULL
 
-mips_dspase1_combine_over_u_nomask:
-   beqz$a3, 1f
-   subu$v0, $a1, $a0   // diff = src - dest (for LWX)
+pixman_asm_func pixman_composite_scanline_over_asm_dspase1
+   beqz$a0, 1f
+   subu$v0, $a2, $a1   // diff = src - dest (for LWX)
 
-   sll $a3, $a3, 2 // width <<= 2
-   addu$a3, $a0, $a3   // dest_end = dest + width
+   sll $a0, $a0, 2 // width <<= 2
+   addu$a0, $a1, $a0   // dest_end = dest + width
 
-   lw  $t0, 0($a0) // dest
-   lwx $t1, $v0($a0)   // src (dest + diff)
+   lw  $t0, 0($a1) // dest
+   lwx $t1, $v0($a1)   // src (dest + diff)
 
li  $t9, 0x00800080
 
@@ -33,8 +43,8 @@ mips_dspase1_combine_over_u_nomask:
muleu_s.ph.qbl  $t3, $t0, $t2
muleu_s.ph.qbr  $t4, $t0, $t2
 
-   lw  $t0, 4($a0) // dest[1] for next loop iteration
-   addiu   $a0, $a0, 4 // dest++
+   lw  $t0, 4($a1) // dest[1] for next loop iteration
+   addiu   $a1, $a1, 4 // dest++
 
addu$t3, $t3, $t9   // can't overflow; rev2: addu_s.ph
addu$t4, $t4, $t9   // can't overflow; rev2: addu_s.ph
@@ -46,41 +56,34 @@ mips_dspase1_combine_over_u_nomask:
precrq.qb.ph$t3, $t3, $t4
addu_s.qb   $t3, $t3, $t1
 
-   lwx $t1, $v0($a0)   // src (dest + diff) for next loop 
iteration
+   lwx $t1, $v0($a1)   // src (dest + diff) for next loop 
iteration
 
-   bne $a0, $a3, 0b
-   sw  $t3, -4($a0)// dest
+   bne $a1, $a0, 0b
+   sw  $t3, -4($a1)// dest
 
 1:
jr  $ra
nop
 
-   .endmips_dspase1_combine_over_u_nomask
-
+pixman_end_func pixman_composite_scanline_over_asm_dspase1
 
-// void
-// mips_dspase1_combine_over_u_mask(uint32_t *dest, const uint32_t *src,
-// const uint32_t *mask, int width)
-
-   .global mips_dspase1_combine_over_u_mask
-   .entmips_dspase1_combine_over_u_mask
 
 // note: this version to be used only when mask != NULL
 
-mips_dspase1_combine_over_u_mask:
-   beqz$a3, 1f
-   subu$v0, $a1, $a0   // sdiff = src - dest (for LWX)
+pixman_asm_func pixman_composite_scanline_over_mask_asm_dspase1
+   beqz$a0, 1f
+   subu$v0, $a2, $a1   // sdiff = src - dest (for LWX)
 
-   subu$v1, $a2, $a0   // mdiff = mask - dest (for LWX)
+   subu$v1, $a3, $a1   // mdiff = mask - dest (for LWX)
 
-   sll $a3, $a3, 2 // width <<= 2
-   addu$a3, $a0, $a3   // dest_end = dest + width
+   sll $a0, $a0, 2 // width <<= 2
+   addu$a0, $a1, $a0   // dest_end = dest + width
 
li  $t9, 0x00800080
 
 0:
-   lwx $t8, $v1($a0)   // mask (dest + mdiff)
-   lwx $t1, $v0($a0)   // src (dest + sdiff)
+   lwx $t8, $v1($a1)   // mask (dest + mdiff)
+   lwx $t1, $v0($a1)   // src (dest + sdiff)
 
srl $t8, $t8, 24// mask >>= A_SHIFT
ins $t8, $t8, 16, 8 // 0:m:0:m; equivalent to replv.ph
@@ -88,7 +91,7 @@ mips_dspase1_combine_over_u_mask:
muleu_s.ph.qbl  $t3, $t1, $t8
muleu_s.ph.qbr  $t4, $t1, $t8
 
-   lw

[Pixman] [PATCH 0/3] Pixman MIPS DSPASE1

2011-02-22 Thread Veli-Matti Valtonen

I started working on this optimizing for MIPS32R2 code originally (Based on the 
patch by Beloev), but the performance increases seem to be relatively similar 
to what over_n_8_ shows. The dspase is much more promising in this regard. 
It rather leaves me wondering if the mips32r2 should not be included.

It might however be related to the test system, which has a MIPS 74K core. The 
original I assume was worked on with a MIPS 24K.

I used pixman-arm-common.h for the assembler binding macros, which is the 
reason for the 'ARM' found in the glue.

Compiling the code will result in the gcc producing Warnings about macro 
expansion, it'd be nice not to have these, but "fixing" them would have a 
(slight) negative effect readability.

PATCH 1 is the original patch by Georgi Beloev, but modified to apply against 
pixman head.

Implemented:
Scanline add, out reverse, over
fast path:
over_n_8_
add__
add_n_

Test hardware: Broadcom BCM4718, 453MHz, MIPS 74K V4.0 (Inc. DSP Rev2, MIPS16), 
Little Endian

All the test program builds used CFLAGS="-O2 -mdsp -mips32r2"

reference memcpy speed = 176.0MB/s (44.0MP/s for 32bpp fills)

Optimizations disabled: --disable-mips32r2 --disable-mips-dspase1
over_n_8_ =  L1:   6.16  L2:   5.34  M:  5.35 ( 19.24%)  HT:  4.78  VT:  
4.62  R:  4.55  RT:  2.99 (  28Kops/s)
add__ =  L1:  18.11  L2:  10.15  M:  9.98 ( 45.33%)  HT: 14.80  VT: 
13.36  R: 13.41  RT:  6.17 (  46Kops/s)
add_n_ =  L1:  14.26  L2:  10.30  M: 10.38 ( 23.59%)  HT:  8.05  VT:  7.64  
R:  7.63  RT:  4.05 (  33Kops/s)

MIPS32R2: --disable-mips-dspase1
over_n_8_ =  L1:   6.17  L2:   5.62  M:  5.56 ( 20.33%)  HT:  5.00  VT:  
4.83  R:  4.76  RT:  3.33 (  30Kops/s)

MIPS DSPASE:
over_n_8_ =  L1:   9.76  L2:   7.89  M:  7.93 ( 27.11%)  HT:  7.04  VT:  
6.84  R:  6.63  RT:  4.06 (  34Kops/s)
add__ =  L1: 117.36  L2:  20.67  M: 23.22 (105.50%)  HT: 17.40  VT: 
15.96  R: 13.81  RT:  6.48 (  47Kops/s)
add_n_ =  L1: 145.84  L2:  28.23  M: 31.11 ( 70.66%)  HT: 22.95  VT: 18.54  
R: 19.99  RT:  8.93 (  50Kops/s)

Scanline ops benchmarked using low-level-blit:

I selected these ops by adding a printf to the scanline ops, and finding one 
that triggers it, if there is a more convenient way to benchmark these ops, I 
failed to find it.

Optimizations disabled:
add_8_8_8 =  L1:   3.31  L2:   5.25  M:  5.16 ( 11.73%)  HT:  3.61  VT:  3.60  
R:  3.53  RT:  1.77 (  18Kops/s)
add__1555 =  L1:   6.51  L2:   5.32  M:  5.34 ( 18.20%)  HT:  4.05  VT:  
3.96  R:  3.94  RT:  2.21 (  22Kops/s)
outrev_n_8_ =  L1:   6.33  L2:   5.25  M:  5.16 ( 17.60%)  HT:  4.11  VT:  
4.02  R:  3.97  RT:  2.23 (  22Kops/s)
over__n_0565 =  L1:   2.83  L2:   3.33  M:  3.21 ( 11.54%)  HT:  2.73  VT:  
2.69  R:  2.68  RT:  1.67 (  17Kops/s)
over_n_ =  L1:   7.45  L2:   6.65  M:  6.66 ( 15.14%)  HT:  5.65  VT:  5.43 
 R:  5.43  RT:  3.35 (  30Kops/s)

MIPS DSPASE:
add_8_8_8 =  L1:   8.81  L2:   7.67  M:  7.53 ( 17.11%)  HT:  4.62  VT:  4.68  
R:  4.50  RT:  1.97 (  19Kops/s)
add__1555 =  L1:   9.07  L2:   7.27  M:  7.29 ( 24.87%)  HT:  5.09  VT:  
4.95  R:  4.93  RT:  2.50 (  23Kops/s)
outrev_n_8_ =  L1:   8.48  L2:   6.82  M:  6.88 ( 23.45%)  HT:  5.04  VT:  
4.90  R:  4.85  RT:  2.48 (  23Kops/s)
over__n_0565 =  L1:   5.13  L2:   4.38  M:  4.16 ( 14.24%)  HT:  3.41  VT:  
3.30  R:  3.34  RT:  1.93 (  19Kops/s)
over_n_ =  L1:  18.58  L2:  12.91  M: 13.12 ( 29.85%)  HT:  9.75  VT:  9.06 
 R:  9.10  RT:  4.55 (  33Kops/s)

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 3/3] DSPASE More cleanup, out reverse op.

2011-02-22 Thread Veli-Matti Valtonen
MIPS: DSPASE Implemented DSPASE1_UN8x4_MUL_UN8 macro.
MIPS: DSPASE Implemented scanline out reverse
MIPS: DSPASE over_n_8_ modified to use the macro bindings
---
 pixman/pixman-mips-dspase1-asm.S |  226 +-
 pixman/pixman-mips-dspase1.c |   50 +
 2 files changed, 155 insertions(+), 121 deletions(-)

diff --git a/pixman/pixman-mips-dspase1-asm.S b/pixman/pixman-mips-dspase1-asm.S
index 596b38a..0cb2293 100644
--- a/pixman/pixman-mips-dspase1-asm.S
+++ b/pixman/pixman-mips-dspase1-asm.S
@@ -18,6 +18,26 @@
.size \fname, .-\fname
 .endm
 
+# result register can be the same as any of the params
+# rb_half should contain 0x00800080
+.macro DSPASE1_UN8x4_MUL_UN8_head a, b, x, y
+   muleu_s.ph.qbl \x, \a, \b
+   muleu_s.ph.qbr \y, \a, \b 
+.endm
+
+.macro DSPASE1_UN8x4_MUL_UN8_tail x, y, result, rb_half, tmp3, tmp4
+   addu \x, \x, \rb_half
+   addu \y, \y, \rb_half
+
+   preceu.ph.qbla \tmp3, \x
+   preceu.ph.qbla \tmp4, \y
+
+   addu \x, \x, \tmp3
+   addu \y, \y, \tmp4
+   
+   precrq.qb.ph \result, \x, \y
+.endm
+
.setnoreorder
.setnomacro
 
@@ -40,20 +60,13 @@ pixman_asm_func pixman_composite_scanline_over_asm_dspase1
srl $t2, $t2, 24// ALPHA_8(~src)
ins $t2, $t2, 16, 8 // 0:a:0:a; equivalent to replv.ph
 
-   muleu_s.ph.qbl  $t3, $t0, $t2
-   muleu_s.ph.qbr  $t4, $t0, $t2
+   DSPASE1_UN8x4_MUL_UN8_head $t0, $t2, $t3, $t4
 
lw  $t0, 4($a1) // dest[1] for next loop iteration
addiu   $a1, $a1, 4 // dest++
 
-   addu$t3, $t3, $t9   // can't overflow; rev2: addu_s.ph
-   addu$t4, $t4, $t9   // can't overflow; rev2: addu_s.ph
-   preceu.ph.qbla  $t5, $t3// rev2: shrl.ph
-   preceu.ph.qbla  $t6, $t4// rev2: shrl.ph
-   addu$t3, $t3, $t5   // can't overflow; rev2: addu_s.ph
-   addu$t4, $t4, $t6   // can't overflow; rev2: addu_s.ph
+   DSPASE1_UN8x4_MUL_UN8_tail $t3, $t4, $t3, $t9, $t5, $t6
 
-   precrq.qb.ph$t3, $t3, $t4
addu_s.qb   $t3, $t3, $t1
 
lwx $t1, $v0($a1)   // src (dest + diff) for next loop 
iteration
@@ -88,35 +101,22 @@ pixman_asm_func 
pixman_composite_scanline_over_mask_asm_dspase1
srl $t8, $t8, 24// mask >>= A_SHIFT
ins $t8, $t8, 16, 8 // 0:m:0:m; equivalent to replv.ph
 
-   muleu_s.ph.qbl  $t3, $t1, $t8
-   muleu_s.ph.qbr  $t4, $t1, $t8
+   DSPASE1_UN8x4_MUL_UN8_head $t1, $t8, $t3, $t4
 
lw  $t0, 0($a1) // dest

-   addu$t3, $t3, $t9   // can't overflow; rev2: addu_s.ph
-   addu$t4, $t4, $t9   // can't overflow; rev2: addu_s.ph
-   preceu.ph.qbla  $t5, $t3// rev2: shrl.ph
-   preceu.ph.qbla  $t6, $t4// rev2: shrl.ph
-   addu$t3, $t3, $t5   // can't overflow; rev2: addu_s.ph
-   addu$t4, $t4, $t6   // can't overflow; rev2: addu_s.ph
-   precrq.qb.ph$t1, $t3, $t4
+   DSPASE1_UN8x4_MUL_UN8_tail $t3, $t4, $t1, $t9, $t5, $t6
 
not $t2, $t1// ~src
srl $t2, $t2, 24// ALPHA_8(~src)
ins $t2, $t2, 16, 8 // 0:a:0:a; equivalent to replv.ph
 
-   muleu_s.ph.qbl  $t3, $t0, $t2
-   muleu_s.ph.qbr  $t4, $t0, $t2
+   DSPASE1_UN8x4_MUL_UN8_head $t0, $t2, $t3, $t4
 
addiu   $a1, $a1, 4 // dest++
+   
+   DSPASE1_UN8x4_MUL_UN8_tail $t3, $t4, $t3, $t9, $t5, $t6
 
-   addu$t3, $t3, $t9   // can't overflow; rev2: addu_s.ph
-   addu$t4, $t4, $t9   // can't overflow; rev2: addu_s.ph
-   preceu.ph.qbla  $t5, $t3// rev2: shrl.ph
-   preceu.ph.qbla  $t6, $t4// rev2: shrl.ph
-   addu$t3, $t3, $t5   // can't overflow; rev2: addu_s.ph
-   addu$t4, $t4, $t6   // can't overflow; rev2: addu_s.ph
-   precrq.qb.ph$t3, $t3, $t4
addu_s.qb   $t3, $t3, $t1
 
bne $a1, $a0, 0b
@@ -197,28 +197,18 @@ pixman_asm_func 
pixman_composite_scanline_add_mask_asm_dspase1
 $scanline_add_mask_loop:
lwx $t2, $a3($a1)
lwx $t1, $a2($a1)
-   lw $t0, 0($a1)
-
-   addiu $a1, $a1, 4

# based on pixman_composite_scanline_over_mask_asm_dspase1
-   # converting these to macroes might make sense
srl $t2, $t2, 24
ins $t2, $t2, 16, 8 // 0:m:0:m; equivalent to replv.ph
-   
-   muleu_s.ph.qbl $t3, $t1, $t2
-   muleu_s.ph.qbr $t4, $t1, $t2
-
-   addu $t3, $t3, $t8  // can't overflow; rev2: addu_s.ph
-   addu $t4, $t4, $t8  // can't overflow; rev2: addu_s.ph
 
-   preceu.ph.qbla $t5, $t3 // rev2: shrl.ph
-   preceu.ph.qbla $t6, $t4 // rev2: shrl.ph
-   addu $t3, $

[Pixman] [PATCH 1/3] MIPS32R2 and MIPS DSP ASE optimized functions, adapted for pixman head

2011-02-22 Thread Veli-Matti Valtonen
From: Veli-Matti Valtonen  

>From 118b1f5596f72be7fed85ba408ff2961b3308038 Mon Sep 17 00:00:00 2001
From: Georgi Beloev 
Date: Wed, 8 Sep 2010 17:34:22 -0700
Subject: [PATCH] Added MIPS32R2 and MIPS DSP ASE optimized functions.

The following functions were implemented for MIPS32R2:
  - pixman_fill32()
  - fast_composite_over_n_8_()

The following functions were implemented for MIPS DSP ASE:
  - combine_over_u()
  - fast_composite_over_n_8_()

Additionally, MIPS DSP ASE uses the MIPS32R2 pixman_fill32() function.

Use configure commands similar to the ones below to select the target
processor and, correspondingly, the target instruction set:

  - MIPS32R2: configure CFLAGS='-march=24kc -O2'
  - MIPS DSP ASE: configure CFLAGS='-march=24kec -O2'
---
 configure.ac |   63 +
 pixman/Makefile.am   |   22 +
 pixman/pixman-cpu.c  |   21 
 pixman/pixman-mips-dspase1-asm.S |  189 ++
 pixman/pixman-mips-dspase1.c |  107 +
 pixman/pixman-mips32r2-asm.S |  180 
 pixman/pixman-mips32r2.c |  112 ++
 pixman/pixman-private.h  |   11 ++
 8 files changed, 705 insertions(+), 0 deletions(-)
 create mode 100644 pixman/pixman-mips-dspase1-asm.S
 create mode 100644 pixman/pixman-mips-dspase1.c
 create mode 100644 pixman/pixman-mips32r2-asm.S
 create mode 100644 pixman/pixman-mips32r2.c

diff --git a/configure.ac b/configure.ac
index 5242799..2a7e49a 100644
--- a/configure.ac
+++ b/configure.ac
@@ -565,6 +565,69 @@ fi
 
 AM_CONDITIONAL(USE_GCC_INLINE_ASM, test $have_gcc_inline_asm = yes)
 
+dnl ==
+dnl Check if the compiler supports MIPS32R2 instructions
+
+AC_MSG_CHECKING(whether to use MIPS32R2 instructions)
+AC_COMPILE_IFELSE([[
+void test()
+{
+asm("ext \$v0,\$a0,8,8");
+}
+]], have_mips32r2=yes, have_mips32r2=no)
+
+AC_ARG_ENABLE(mips32r2,
+   [AC_HELP_STRING([--disable-mips32r2],
+   [disable MIPS32R2 fast paths])],
+   [enable_mips32r2=$enableval], [enable_mips32r2=auto])
+
+if test $enable_mips32r2 = no ; then
+   have_mips32r2=disabled
+fi
+
+if test $have_mips32r2 = yes ; then
+   AC_DEFINE(USE_MIPS32R2, 1, [use MIPS32R2 optimizations])
+fi
+
+AM_CONDITIONAL(USE_MIPS32R2, test $have_mips32r2 = yes)
+
+AC_MSG_RESULT($have_mips32r2)
+if test $enable_mips32r2 = yes && test $have_mips32r2 = no ; then
+   AC_MSG_ERROR([MIPS32R2 not detected])
+fi
+
+
+dnl ==
+dnl Check if the compiler supports MIPS DSP ASE Rev 1 instructions
+
+AC_MSG_CHECKING(whether to use MIPS DSP ASE Rev 1 instructions)
+AC_COMPILE_IFELSE([[
+void test()
+{
+asm("addu.qb \$v0,\$a0,\$a1");
+}
+]], have_mips_dspase1=yes, have_mips_dspase1=no)
+
+AC_ARG_ENABLE(mips-dspase1,
+   [AC_HELP_STRING([--disable-mips-dspase1],
+   [disable MIPS DSP ASE Rev 1 fast paths])],
+   [enable_mips_dspase1=$enableval], [enable_mips_dspase1=auto])
+
+if test $enable_mips_dspase1 = no ; then
+   have_mips_dspase1=disabled
+fi
+
+if test $have_mips_dspase1 = yes ; then
+   AC_DEFINE(USE_MIPS_DSPASE1, 1, [use MIPS DSP ASE Rev 1 optimizations])
+fi
+
+AM_CONDITIONAL(USE_MIPS_DSPASE1, test $have_mips_dspase1 = yes)
+
+AC_MSG_RESULT($have_mips_dspase1)
+if test $enable_mips_dspase1 = yes && test $have_mips_dspase1 = no ; then
+   AC_MSG_ERROR([MIPS DSP ASE Rev 1 not detected])
+fi
+
 dnl ==
 dnl Static test programs
 
diff --git a/pixman/Makefile.am b/pixman/Makefile.am
index ca31301..d832db1 100644
--- a/pixman/Makefile.am
+++ b/pixman/Makefile.am
@@ -123,5 +123,27 @@ libpixman_1_la_LIBADD += libpixman-arm-neon.la
 ASM_CFLAGS_arm_neon=
 endif
 
+# MIPS32R2
+if USE_MIPS32R2
+noinst_LTLIBRARIES += libpixman-mips32r2.la
+libpixman_mips32r2_la_SOURCES = \
+   pixman-mips32r2.c \
+   pixman-mips32r2-asm.S
+libpixman_mips32r2_la_CFLAGS = $(DEP_CFLAGS)
+libpixman_mips32r2_la_LIBADD = $(DEP_LIBS)
+libpixman_1_la_LIBADD += libpixman-mips32r2.la
+endif
+
+# MIPS DSP ASE Rev 1
+if USE_MIPS_DSPASE1
+noinst_LTLIBRARIES += libpixman-mips-dspase1.la
+libpixman_mips_dspase1_la_SOURCES = \
+   pixman-mips-dspase1.c \
+   pixman-mips-dspase1-asm.S
+libpixman_mips_dspase1_la_CFLAGS = $(DEP_CFLAGS)
+libpixman_mips_dspase1_la_LIBADD = $(DEP_LIBS)
+libpixman_1_la_LIBADD += libpixman-mips-dspase1.la
+endif
+
 .c.s : $(libpixmaninclude_HEADERS) $(BUILT_SOURCES)
$(CC) $(CFLAGS) $(ASM_CFLAGS_$(@:pixman-%.s=%)) 
$(ASM_CFLAGS_$(@:pixman-arm-%.s=arm_%)) -DHAVE_CONFIG_H -I$(srcdir) 
-I$(builddir) -I$(top_builddir) -S -o $@ $<
diff --git a/pixman/pixman-cpu.c b/pixman/pixman-cpu.c
index 0e14ecb..ee6dc1c 100644
--- a/pixman/pixman-cpu.c
+++ b/pixman/pixman-cpu.c
@@ -573,6 +573,17 @@ pixman_have_sse2 (void)
 #endif /* __amd64__ */
 #endif
 
+#ifdef USE_