Re: [Pixman] [PATCH 0/9] lowlevel-blt-bench improvements for automated testing

2015-07-02 Thread Ben Avison

On Wed, 10 Jun 2015 14:32:49 +0100, Pekka Paalanen  wrote:

most of the patches are trivial cleanups. The meat are the last two:
CSV output mode and skipping the memory speed benchmark.

Both new features are designed for an external benchmarking harness,
that runs several different versions of Pixman with lowlevel-blt-bench
in an alternating fashion. Alternating iterations are needed to get
reliable results on platforms like the Raspberry Pi.


These look like sensible improvements to me. A few minor points:

Patch 2 commit message:

Move explanation printing to a new file. This will help with

   function

Patch 8: if the aim is machine readability, I'd suggest no space after
the comma. Otherwise if comma is used as a field separator, the first
field has no leading space but all the others do, which may make things a
little more fiddly when post-processing the results with some tools.

Not really your fault, but I noticed it when trying out the new versions:
it doesn't fault an attempt to list more than one pattern on the command
line. It just acts on the last one. It should really either fault it, or
(more usefully) benchmark all of the patterns specified. This would allow
a subset of the "all" tests to be performed, sharing the same memcpy
measurement overhead, or a group of operations including those not in the
"all" list.

In other respects, happy to give my Reviewed-by:.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 12/12] vmx: implement fast path iterator vmx_fetch_a8

2015-07-02 Thread Oded Gabbay
no changes were observed when running cairo trimmed benchmarks.

Signed-off-by: Oded Gabbay 
---
 pixman/pixman-vmx.c | 46 ++
 1 file changed, 46 insertions(+)

diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
index f71f358..03e8afa 100644
--- a/pixman/pixman-vmx.c
+++ b/pixman/pixman-vmx.c
@@ -3481,6 +3481,49 @@ vmx_fetch_r5g6b5 (pixman_iter_t *iter, const uint32_t 
*mask)
 return iter->buffer;
 }
 
+static uint32_t *
+vmx_fetch_a8 (pixman_iter_t *iter, const uint32_t *mask)
+{
+int w = iter->width;
+uint32_t *dst = iter->buffer;
+uint8_t *src = iter->bits;
+vector unsigned int vmx0, vmx1, vmx2, vmx3, vmx4, vmx5, vmx6;
+
+iter->bits += iter->stride;
+
+while (w && (((uintptr_t)dst) & 15))
+{
+*dst++ = *(src++) << 24;
+w--;
+}
+
+while (w >= 16)
+{
+   vmx0 = load_128_unaligned((uint32_t *) src);
+
+   unpack_128_2x128((vector unsigned int) AVV(0), vmx0, &vmx1, &vmx2);
+   unpack_128_2x128_16((vector unsigned int) AVV(0), vmx1, &vmx3, &vmx4);
+   unpack_128_2x128_16((vector unsigned int) AVV(0), vmx2, &vmx5, &vmx6);
+
+   save_128_aligned(dst, vmx6);
+   save_128_aligned((dst +  4), vmx5);
+   save_128_aligned((dst +  8), vmx4);
+   save_128_aligned((dst + 12), vmx3);
+
+   dst += 16;
+   src += 16;
+   w -= 16;
+}
+
+while (w)
+{
+   *dst++ = *(src++) << 24;
+   w--;
+}
+
+return iter->buffer;
+}
+
 #define IMAGE_FLAGS\
 (FAST_PATH_STANDARD_FLAGS | FAST_PATH_ID_TRANSFORM |   \
  FAST_PATH_BITS_IMAGE | FAST_PATH_SAMPLES_COVER_CLIP_NEAREST)
@@ -3493,6 +3536,9 @@ static const pixman_iter_info_t vmx_iters[] =
 { PIXMAN_r5g6b5, IMAGE_FLAGS, ITER_NARROW,
   _pixman_iter_init_bits_stride, vmx_fetch_r5g6b5, NULL
 },
+{ PIXMAN_a8, IMAGE_FLAGS, ITER_NARROW,
+  _pixman_iter_init_bits_stride, vmx_fetch_a8, NULL
+},
 { PIXMAN_null },
 };
 
-- 
2.4.3

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 10/12] vmx: implement fast path iterator vmx_fetch_x8r8g8b8

2015-07-02 Thread Oded Gabbay
POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le.

cairo trimmed benchmarks :

Speedups

t-firefox-asteroids  533.92  -> 489.94 :  1.09x

Signed-off-by: Oded Gabbay 
---
 pixman/pixman-vmx.c | 48 
 1 file changed, 48 insertions(+)

diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
index 2f82ce7..ed248e1 100644
--- a/pixman/pixman-vmx.c
+++ b/pixman/pixman-vmx.c
@@ -3398,6 +3398,52 @@ static const pixman_fast_path_t vmx_fast_paths[] =
 {   PIXMAN_OP_NONE },
 };
 
+static uint32_t *
+vmx_fetch_x8r8g8b8 (pixman_iter_t *iter, const uint32_t *mask)
+{
+int w = iter->width;
+vector unsigned int ff00 = mask_ff00;
+uint32_t *dst = iter->buffer;
+uint32_t *src = (uint32_t *)iter->bits;
+
+iter->bits += iter->stride;
+
+while (w && ((uintptr_t)dst) & 0x0f)
+{
+   *dst++ = (*src++) | 0xff00;
+   w--;
+}
+
+while (w >= 4)
+{
+   save_128_aligned(dst, vec_or(load_128_unaligned(src), ff00));
+
+   dst += 4;
+   src += 4;
+   w -= 4;
+}
+
+while (w)
+{
+   *dst++ = (*src++) | 0xff00;
+   w--;
+}
+
+return iter->buffer;
+}
+
+#define IMAGE_FLAGS\
+(FAST_PATH_STANDARD_FLAGS | FAST_PATH_ID_TRANSFORM |   \
+ FAST_PATH_BITS_IMAGE | FAST_PATH_SAMPLES_COVER_CLIP_NEAREST)
+
+static const pixman_iter_info_t vmx_iters[] =
+{
+{ PIXMAN_x8r8g8b8, IMAGE_FLAGS, ITER_NARROW,
+  _pixman_iter_init_bits_stride, vmx_fetch_x8r8g8b8, NULL
+},
+{ PIXMAN_null },
+};
+
 pixman_implementation_t *
 _pixman_implementation_create_vmx (pixman_implementation_t *fallback)
 {
@@ -3441,5 +3487,7 @@ _pixman_implementation_create_vmx 
(pixman_implementation_t *fallback)
 imp->blt = vmx_blt;
 imp->fill = vmx_fill;
 
+imp->iter_info = vmx_iters;
+
 return imp;
 }
-- 
2.4.3

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 07/12] vmx: implement fast path vmx_composite_over_n_8_8888

2015-07-02 Thread Oded Gabbay
POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le.

reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)

Before   After   Change
  -
L1  90.21   133.21   +47.67%
L2  94.91   132.95   +40.08%
M   95.49   132.53   +38.79%
HT  88.07   100.43   +14.03%
VT  86.65   112.45   +29.77%
R   82.77   96.25+16.29%
RT  65.64   55.14-16.00%
Kops/s  673 580  -13.82%

cairo trimmed benchmarks :

Speedups

t-firefox-asteroids 533.92  -> 495.51  :  1.08x

Slowdowns
=
t-poppler   364.99  -> 393.72  :  1.08x
t-firefox-canvas-alpha  984.55  -> 1197.85 :  1.22x

Signed-off-by: Oded Gabbay 
---
 pixman/pixman-vmx.c | 126 
 1 file changed, 126 insertions(+)

diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
index 966219f..5c74a47 100644
--- a/pixman/pixman-vmx.c
+++ b/pixman/pixman-vmx.c
@@ -2557,6 +2557,128 @@ vmx_combine_add_ca (pixman_implementation_t *imp,
 }
 }
 
+static void
+vmx_composite_over_n_8_ (pixman_implementation_t *imp,
+  pixman_composite_info_t *info)
+{
+PIXMAN_COMPOSITE_ARGS (info);
+uint32_t src, srca;
+uint32_t *dst_line, *dst;
+uint8_t *mask_line, *mask;
+int dst_stride, mask_stride;
+int32_t w;
+uint32_t m, d;
+
+vector unsigned int vsrc, valpha, vmask;
+
+vector unsigned int vmx_dst, vmx_dst_lo, vmx_dst_hi;
+vector unsigned int vmx_mask, vmx_mask_lo, vmx_mask_hi;
+
+src = _pixman_image_get_solid (imp, src_image, dest_image->bits.format);
+
+srca = src >> 24;
+if (src == 0)
+   return;
+
+PIXMAN_IMAGE_GET_LINE (
+   dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
+PIXMAN_IMAGE_GET_LINE (
+   mask_image, mask_x, mask_y, uint8_t, mask_stride, mask_line, 1);
+
+vmask = create_mask_1x32_128 (&src);
+vsrc = expand_pixel_32_1x128 (src);
+valpha = expand_alpha_1x128 (vsrc);
+
+while (height--)
+{
+   dst = dst_line;
+   dst_line += dst_stride;
+   mask = mask_line;
+   mask_line += mask_stride;
+   w = width;
+
+   while (w && (uintptr_t)dst & 15)
+   {
+   uint8_t m = *mask++;
+
+   if (m)
+   {
+   d = *dst;
+   vmx_mask = expand_pixel_8_1x128 (m);
+   vmx_dst = unpack_32_1x128 (d);
+
+   *dst = pack_1x128_32 (in_over (vsrc,
+  valpha,
+  vmx_mask,
+  vmx_dst));
+   }
+
+   w--;
+   dst++;
+   }
+
+   while (w >= 4)
+   {
+   m = *((uint32_t*)mask);
+
+   if (srca == 0xff && m == 0x)
+   {
+   save_128_aligned(dst, vmask);
+   }
+   else if (m)
+   {
+   vmx_dst = load_128_aligned (dst);
+
+   vmx_mask = unpack_32_1x128 (m);
+   vmx_mask = unpacklo_128_16x8 (vmx_mask,
+   (vector unsigned int) AVV(0));
+
+   /* Unpacking */
+   unpack_128_2x128 (vmx_dst, (vector unsigned int) AVV(0),
+   &vmx_dst_lo, &vmx_dst_hi);
+
+   unpack_128_2x128 (vmx_mask, (vector unsigned int) AVV(0),
+   &vmx_mask_lo, &vmx_mask_hi);
+
+   expand_alpha_rev_2x128 (vmx_mask_lo, vmx_mask_hi,
+   &vmx_mask_lo, &vmx_mask_hi);
+
+   in_over_2x128 (&vsrc, &vsrc,
+  &valpha, &valpha,
+  &vmx_mask_lo, &vmx_mask_hi,
+  &vmx_dst_lo, &vmx_dst_hi);
+
+   save_128_aligned(dst, pack_2x128_128 (vmx_dst_lo, vmx_dst_hi));
+   }
+
+   w -= 4;
+   dst += 4;
+   mask += 4;
+   }
+
+   while (w)
+   {
+   uint8_t m = *mask++;
+
+   if (m)
+   {
+   d = *dst;
+   vmx_mask = expand_pixel_8_1x128 (m);
+   vmx_dst = unpack_32_1x128 (d);
+
+   *dst = pack_1x128_32 (in_over (vsrc,
+  valpha,
+  vmx_mask,
+  vmx_dst));
+   }
+
+   w--;
+   dst++;
+   }
+}
+
+}
+
 static pixman_bool_t
 vmx_fill (pixman_implementation_t *imp,
uint32_t *   bits,
@@ -3061,6 +3183,10 @@ static const pixman_fast_path_t vmx_fast_paths[] =
 PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, nu

[Pixman] [PATCH 09/12] vmx: implement fast path scaled nearest vmx_8888_8888_OVER

2015-07-02 Thread Oded Gabbay
POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le.
reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)

Before   After   Change
  -
L1  134.36  181.68  +35.22%
L2  135.07  180.67  +33.76%
M   134.6   180.51  +34.11%
HT  121.77  128.79  +5.76%
VT  120.49  145.07  +20.40%
R   93.83   102.3   +9.03%
RT  50.82   46.93   -7.65%
Kops/s  448 422 -5.80%

cairo trimmed benchmarks :

Speedups

t-firefox-asteroids  533.92 -> 497.92 :  1.07x
t-midori-zoomed  692.98 -> 651.24 :  1.06x

Signed-off-by: Oded Gabbay 
---
 pixman/pixman-vmx.c | 128 
 1 file changed, 128 insertions(+)

diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
index d5ddf4b..2f82ce7 100644
--- a/pixman/pixman-vmx.c
+++ b/pixman/pixman-vmx.c
@@ -3232,6 +3232,129 @@ vmx_composite_add__ (pixman_implementation_t 
*imp,
 }
 }
 
+static force_inline void
+scaled_nearest_scanline_vmx___OVER (uint32_t*   pd,
+const uint32_t* ps,
+int32_t w,
+pixman_fixed_t  vx,
+pixman_fixed_t  unit_x,
+pixman_fixed_t  src_width_fixed,
+pixman_bool_t   
fully_transparent_src)
+{
+uint32_t s, d;
+const uint32_t* pm = NULL;
+
+vector unsigned int vmx_dst_lo, vmx_dst_hi;
+vector unsigned int vmx_src_lo, vmx_src_hi;
+vector unsigned int vmx_alpha_lo, vmx_alpha_hi;
+
+if (fully_transparent_src)
+   return;
+
+/* Align dst on a 16-byte boundary */
+while (w && ((uintptr_t)pd & 15))
+{
+   d = *pd;
+   s = combine1 (ps + pixman_fixed_to_int (vx), pm);
+   vx += unit_x;
+   while (vx >= 0)
+   vx -= src_width_fixed;
+
+   *pd++ = core_combine_over_u_pixel_vmx (s, d);
+   if (pm)
+   pm++;
+   w--;
+}
+
+while (w >= 4)
+{
+   vector unsigned int tmp;
+   uint32_t tmp1, tmp2, tmp3, tmp4;
+
+   tmp1 = *(ps + pixman_fixed_to_int (vx));
+   vx += unit_x;
+   while (vx >= 0)
+   vx -= src_width_fixed;
+   tmp2 = *(ps + pixman_fixed_to_int (vx));
+   vx += unit_x;
+   while (vx >= 0)
+   vx -= src_width_fixed;
+   tmp3 = *(ps + pixman_fixed_to_int (vx));
+   vx += unit_x;
+   while (vx >= 0)
+   vx -= src_width_fixed;
+   tmp4 = *(ps + pixman_fixed_to_int (vx));
+   vx += unit_x;
+   while (vx >= 0)
+   vx -= src_width_fixed;
+
+   tmp[0] = tmp1;
+   tmp[1] = tmp2;
+   tmp[2] = tmp3;
+   tmp[3] = tmp4;
+
+   vmx_src_hi = combine4 ((const uint32_t *) &tmp, pm);
+
+   if (is_opaque (vmx_src_hi))
+   {
+   save_128_aligned (pd, vmx_src_hi);
+   }
+   else if (!is_zero (vmx_src_hi))
+   {
+   vmx_dst_hi = load_128_aligned (pd);
+
+   unpack_128_2x128 (vmx_src_hi, (vector unsigned int) AVV(0),
+   &vmx_src_lo, &vmx_src_hi);
+
+   unpack_128_2x128 (vmx_dst_hi, (vector unsigned int) AVV(0),
+   &vmx_dst_lo, &vmx_dst_hi);
+
+   expand_alpha_2x128 (
+   vmx_src_lo, vmx_src_hi, &vmx_alpha_lo, &vmx_alpha_hi);
+
+   over_2x128 (&vmx_src_lo, &vmx_src_hi,
+   &vmx_alpha_lo, &vmx_alpha_hi,
+   &vmx_dst_lo, &vmx_dst_hi);
+
+   /* rebuid the 4 pixel data and save*/
+   save_128_aligned (pd, pack_2x128_128 (vmx_dst_lo, vmx_dst_hi));
+   }
+
+   w -= 4;
+   pd += 4;
+   if (pm)
+   pm += 4;
+}
+
+while (w)
+{
+   d = *pd;
+   s = combine1 (ps + pixman_fixed_to_int (vx), pm);
+   vx += unit_x;
+   while (vx >= 0)
+   vx -= src_width_fixed;
+
+   *pd++ = core_combine_over_u_pixel_vmx (s, d);
+   if (pm)
+   pm++;
+
+   w--;
+}
+}
+
+FAST_NEAREST_MAINLOOP (vmx___cover_OVER,
+  scaled_nearest_scanline_vmx___OVER,
+  uint32_t, uint32_t, COVER)
+FAST_NEAREST_MAINLOOP (vmx___none_OVER,
+  scaled_nearest_scanline_vmx___OVER,
+  uint32_t, uint32_t, NONE)
+FAST_NEAREST_MAINLOOP (vmx___pad_OVER,
+  scaled_nearest_scanline_vmx___OVER,
+  uint32_t, uint32_t, PAD)
+FAST_NEAREST_MAINLOOP (vmx___normal_OVER,
+  scaled_nearest_scanline_vmx___OVER,
+  uint32_t, 

[Pixman] [PATCH 01/12] vmx: add LOAD_VECTOR macro

2015-07-02 Thread Oded Gabbay
This patch adds a macro for loading a single vector.
It also make the other LOAD_VECTORx macros use this macro as a base so
code would be re-used.

In addition, I fixed minor coding style issues.

Signed-off-by: Oded Gabbay 
---
 pixman/pixman-vmx.c | 50 --
 1 file changed, 24 insertions(+), 26 deletions(-)

diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
index 1809790..15ea64e 100644
--- a/pixman/pixman-vmx.c
+++ b/pixman/pixman-vmx.c
@@ -169,33 +169,29 @@ over (vector unsigned int src,
 mask ## _mask = vec_lvsl (0, mask);
\
 source ## _mask = vec_lvsl (0, source);
 
-/* notice you have to declare temp vars...
- * Note: tmp3 and tmp4 must remain untouched!
- */
-
-#define LOAD_VECTORS(dest, source)   \
-do { \
+#define LOAD_VECTOR(source)  \
+do   \
+{\
 vector unsigned char tmp1, tmp2; \
 tmp1 = (typeof(tmp1))vec_ld (0, source); \
 tmp2 = (typeof(tmp2))vec_ld (15, source);\
-v ## source = (typeof(v ## source))  \
+v ## source = (typeof(v ## source))  \
vec_perm (tmp1, tmp2, source ## _mask);   \
+} while (0)
+
+#define LOAD_VECTORS(dest, source)   \
+do   \
+{\
+LOAD_VECTOR(source); \
 v ## dest = (typeof(v ## dest))vec_ld (0, dest); \
-} while (0);
+} while (0)
 
 #define LOAD_VECTORSC(dest, source, mask)\
-do { \
-vector unsigned char tmp1, tmp2; \
-tmp1 = (typeof(tmp1))vec_ld (0, source); \
-tmp2 = (typeof(tmp2))vec_ld (15, source);\
-v ## source = (typeof(v ## source))  \
-   vec_perm (tmp1, tmp2, source ## _mask);   \
-tmp1 = (typeof(tmp1))vec_ld (0, mask);   \
-v ## dest = (typeof(v ## dest))vec_ld (0, dest); \
-tmp2 = (typeof(tmp2))vec_ld (15, mask);  \
-v ## mask = (typeof(v ## mask))  \
-vec_perm (tmp1, tmp2, mask ## _mask);\
-} while (0);
+do   \
+{\
+LOAD_VECTORS(dest, source);  \
+LOAD_VECTOR(mask);   \
+} while (0)
 
 #define DECLARE_SRC_MASK_VAR vector unsigned char src_mask
 #define DECLARE_MASK_MASK_VAR vector unsigned char mask_mask
@@ -213,14 +209,16 @@ do {  
  \
 
 #define COMPUTE_SHIFT_MASKC(dest, source, mask)
 
+# define LOAD_VECTOR(source)   \
+v ## source = *((typeof(v ## source)*)source);
+
 # define LOAD_VECTORS(dest, source)\
-v ## source = *((typeof(v ## source)*)source); \
-v ## dest = *((typeof(v ## dest)*)dest);
+LOAD_VECTOR(source);   \
+LOAD_VECTOR(dest); \
 
 # define LOAD_VECTORSC(dest, source, mask) \
-v ## source = *((typeof(v ## source)*)source); \
-v ## dest = *((typeof(v ## dest)*)dest);   \
-v ## mask = *((typeof(v ## mask)*)mask);
+LOAD_VECTORS(dest, source);\
+LOAD_VECTOR(mask); \
 
 #define DECLARE_SRC_MASK_VAR
 #define DECLARE_MASK_MASK_VAR
@@ -228,7 +226,7 @@ do {
  \
 #endif /* WORDS_BIGENDIAN */
 
 #define LOAD_VECTORSM(dest, source, mask)  \
-LOAD_VECTORSC (dest, source, mask) \
+LOAD_VECTORSC (dest, source, mask);\
 v ## source = pix_multiply (v ## source,   \
 splat_alpha (v ## mask));
 
-- 
2.4.3

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 04/12] vmx: implement fast path vmx_blt

2015-07-02 Thread Oded Gabbay
No changes were observed when running cairo trimmed benchmarks.

Signed-off-by: Oded Gabbay 
---
 pixman/pixman-vmx.c | 124 
 1 file changed, 124 insertions(+)

diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
index b9acd6c..b42288b 100644
--- a/pixman/pixman-vmx.c
+++ b/pixman/pixman-vmx.c
@@ -2708,6 +2708,128 @@ vmx_fill (pixman_implementation_t *imp,
 return TRUE;
 }
 
+static pixman_bool_t
+vmx_blt (pixman_implementation_t *imp,
+  uint32_t *   src_bits,
+  uint32_t *   dst_bits,
+  int  src_stride,
+  int  dst_stride,
+  int  src_bpp,
+  int  dst_bpp,
+  int  src_x,
+  int  src_y,
+  int  dest_x,
+  int  dest_y,
+  int  width,
+  int  height)
+{
+uint8_t *   src_bytes;
+uint8_t *   dst_bytes;
+int byte_width;
+
+if (src_bpp != dst_bpp)
+   return FALSE;
+
+if (src_bpp == 16)
+{
+   src_stride = src_stride * (int) sizeof (uint32_t) / 2;
+   dst_stride = dst_stride * (int) sizeof (uint32_t) / 2;
+   src_bytes =(uint8_t *)(((uint16_t *)src_bits) + src_stride * (src_y) + 
(src_x));
+   dst_bytes = (uint8_t *)(((uint16_t *)dst_bits) + dst_stride * (dest_y) 
+ (dest_x));
+   byte_width = 2 * width;
+   src_stride *= 2;
+   dst_stride *= 2;
+}
+else if (src_bpp == 32)
+{
+   src_stride = src_stride * (int) sizeof (uint32_t) / 4;
+   dst_stride = dst_stride * (int) sizeof (uint32_t) / 4;
+   src_bytes = (uint8_t *)(((uint32_t *)src_bits) + src_stride * (src_y) + 
(src_x));
+   dst_bytes = (uint8_t *)(((uint32_t *)dst_bits) + dst_stride * (dest_y) 
+ (dest_x));
+   byte_width = 4 * width;
+   src_stride *= 4;
+   dst_stride *= 4;
+}
+else
+{
+   return FALSE;
+}
+
+while (height--)
+{
+   int w;
+   uint8_t *s = src_bytes;
+   uint8_t *d = dst_bytes;
+   src_bytes += src_stride;
+   dst_bytes += dst_stride;
+   w = byte_width;
+
+   while (w >= 2 && ((uintptr_t)d & 3))
+   {
+   *(uint16_t *)d = *(uint16_t *)s;
+   w -= 2;
+   s += 2;
+   d += 2;
+   }
+
+   while (w >= 4 && ((uintptr_t)d & 15))
+   {
+   *(uint32_t *)d = *(uint32_t *)s;
+
+   w -= 4;
+   s += 4;
+   d += 4;
+   }
+
+   while (w >= 64)
+   {
+   vector unsigned int vmx0, vmx1, vmx2, vmx3;
+
+   vmx0 = load_128_unaligned ((uint32_t*) s);
+   vmx1 = load_128_unaligned ((uint32_t*)(s + 16));
+   vmx2 = load_128_unaligned ((uint32_t*)(s + 32));
+   vmx3 = load_128_unaligned ((uint32_t*)(s + 48));
+
+   save_128_aligned ((uint32_t*)(d), vmx0);
+   save_128_aligned ((uint32_t*)(d + 16), vmx1);
+   save_128_aligned ((uint32_t*)(d + 32), vmx2);
+   save_128_aligned ((uint32_t*)(d + 48), vmx3);
+
+   s += 64;
+   d += 64;
+   w -= 64;
+   }
+
+   while (w >= 16)
+   {
+   save_128_aligned ((uint32_t*) d, load_128_unaligned ((uint32_t*) 
s));
+
+   w -= 16;
+   d += 16;
+   s += 16;
+   }
+
+   while (w >= 4)
+   {
+   *(uint32_t *)d = *(uint32_t *)s;
+
+   w -= 4;
+   s += 4;
+   d += 4;
+   }
+
+   if (w >= 2)
+   {
+   *(uint16_t *)d = *(uint16_t *)s;
+   w -= 2;
+   s += 2;
+   d += 2;
+   }
+}
+
+return TRUE;
+}
+
 static void
 vmx_composite_over__ (pixman_implementation_t *imp,
pixman_composite_info_t *info)
@@ -2812,6 +2934,7 @@ vmx_composite_add__ (pixman_implementation_t *imp,
 
 static const pixman_fast_path_t vmx_fast_paths[] =
 {
+/* PIXMAN_OP_OVER */
 PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, a8r8g8b8, 
vmx_composite_over__),
 PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, 
vmx_composite_over__),
 PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, 
vmx_composite_over__),
@@ -2865,6 +2988,7 @@ _pixman_implementation_create_vmx 
(pixman_implementation_t *fallback)
 imp->combine_32_ca[PIXMAN_OP_XOR] = vmx_combine_xor_ca;
 imp->combine_32_ca[PIXMAN_OP_ADD] = vmx_combine_add_ca;
 
+imp->blt = vmx_blt;
 imp->fill = vmx_fill;
 
 return imp;
-- 
2.4.3

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 03/12] vmx: implement fast path vmx_fill

2015-07-02 Thread Oded Gabbay
Based on sse2 impl.

Tested cairo trimmed benchmarks on POWER8, 8 cores, 3.4GHz,
RHEL 7.1 ppc64le :

speedups

 t-swfdec-giant-steps  1382.86 ->  719.65  :  1.92x speedup
   t-gnome-system-monitor  1405.45 ->  923.71  :  1.52x speedup
  t-evolution  550.91  ->  410.73  :  1.34x speedup
  t-firefox-paintball  800.47  ->  661.47  :  1.21x speedup
t-firefox-canvas-swscroll  1419.22 ->  1209.98 :  1.17x speedup
  t-xfce4-terminal-a1  1537.75 ->  1319.23 :  1.17x speedup
  t-midori-zoomed  693.34  ->  607.99  :  1.14x speedup
  t-firefox-scrolling  1306.05 ->  1149.77 :  1.14x speedup
  t-chromium-tabs  210.72  ->  191.17  :  1.10x speedup
   t-firefox-planet-gnome  980.11  ->  913.88  :  1.07x speedup

Signed-off-by: Oded Gabbay 
---
 pixman/pixman-vmx.c | 153 
 1 file changed, 153 insertions(+)

diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
index b81cb45..b9acd6c 100644
--- a/pixman/pixman-vmx.c
+++ b/pixman/pixman-vmx.c
@@ -2557,6 +2557,157 @@ vmx_combine_add_ca (pixman_implementation_t *imp,
 }
 }
 
+static pixman_bool_t
+vmx_fill (pixman_implementation_t *imp,
+   uint32_t *   bits,
+   int  stride,
+   int  bpp,
+   int  x,
+   int  y,
+   int  width,
+   int  height,
+   uint32_tfiller)
+{
+uint32_t byte_width;
+uint8_t *byte_line;
+
+vector unsigned int vfiller;
+
+if (bpp == 8)
+{
+   uint8_t b;
+   uint16_t w;
+
+   stride = stride * (int) sizeof (uint32_t) / 1;
+   byte_line = (uint8_t *)(((uint8_t *)bits) + stride * y + x);
+   byte_width = width;
+   stride *= 1;
+
+   b = filler & 0xff;
+   w = (b << 8) | b;
+   filler = (w << 16) | w;
+}
+else if (bpp == 16)
+{
+   stride = stride * (int) sizeof (uint32_t) / 2;
+   byte_line = (uint8_t *)(((uint16_t *)bits) + stride * y + x);
+   byte_width = 2 * width;
+   stride *= 2;
+
+filler = (filler & 0x) * 0x00010001;
+}
+else if (bpp == 32)
+{
+   stride = stride * (int) sizeof (uint32_t) / 4;
+   byte_line = (uint8_t *)(((uint32_t *)bits) + stride * y + x);
+   byte_width = 4 * width;
+   stride *= 4;
+}
+else
+{
+   return FALSE;
+}
+
+vfiller = create_mask_1x32_128(&filler);
+
+while (height--)
+{
+   int w;
+   uint8_t *d = byte_line;
+   byte_line += stride;
+   w = byte_width;
+
+   if (w >= 1 && ((uintptr_t)d & 1))
+   {
+   *(uint8_t *)d = filler;
+   w -= 1;
+   d += 1;
+   }
+
+   while (w >= 2 && ((uintptr_t)d & 3))
+   {
+   *(uint16_t *)d = filler;
+   w -= 2;
+   d += 2;
+   }
+
+   while (w >= 4 && ((uintptr_t)d & 15))
+   {
+   *(uint32_t *)d = filler;
+
+   w -= 4;
+   d += 4;
+   }
+
+   while (w >= 128)
+   {
+   vec_st(vfiller, 0, (uint32_t *) d);
+   vec_st(vfiller, 0, (uint32_t *) d + 4);
+   vec_st(vfiller, 0, (uint32_t *) d + 8);
+   vec_st(vfiller, 0, (uint32_t *) d + 12);
+   vec_st(vfiller, 0, (uint32_t *) d + 16);
+   vec_st(vfiller, 0, (uint32_t *) d + 20);
+   vec_st(vfiller, 0, (uint32_t *) d + 24);
+   vec_st(vfiller, 0, (uint32_t *) d + 28);
+
+   d += 128;
+   w -= 128;
+   }
+
+   if (w >= 64)
+   {
+   vec_st(vfiller, 0, (uint32_t *) d);
+   vec_st(vfiller, 0, (uint32_t *) d + 4);
+   vec_st(vfiller, 0, (uint32_t *) d + 8);
+   vec_st(vfiller, 0, (uint32_t *) d + 12);
+
+   d += 64;
+   w -= 64;
+   }
+
+   if (w >= 32)
+   {
+   vec_st(vfiller, 0, (uint32_t *) d);
+   vec_st(vfiller, 0, (uint32_t *) d + 4);
+
+   d += 32;
+   w -= 32;
+   }
+
+   if (w >= 16)
+   {
+   vec_st(vfiller, 0, (uint32_t *) d);
+
+   d += 16;
+   w -= 16;
+   }
+
+   while (w >= 4)
+   {
+   *(uint32_t *)d = filler;
+
+   w -= 4;
+   d += 4;
+   }
+
+   if (w >= 2)
+   {
+   *(uint16_t *)d = filler;
+   w -= 2;
+   d += 2;
+   }
+
+   if (w >= 1)
+   {
+   *(uint8_t *)d = filler;
+   w -= 1;
+   d += 1;
+   }
+}
+
+return TRUE;
+}
+
 static void
 vmx_composite_over__ (pixman_implementation_t *imp,
pixman_composite_info_t *info)
@@ -2714,5 +2865,7 @@ _pixman_implementation_create_vmx 
(pixman_implementation_t *fallback)
 imp->combine_32_ca[PIXMAN_OP_XOR] = vmx_combine_xor_ca;
 imp->combine_32_ca[PIXMAN_OP_ADD] = vmx_combine_add_ca;
 
+imp->fill = vmx_fi

[Pixman] [PATCH 11/12] vmx: implement fast path iterator vmx_fetch_r5g6b5

2015-07-02 Thread Oded Gabbay
no changes were observed when running cairo trimmed benchmarks.

Signed-off-by: Oded Gabbay 
---
 pixman/pixman-vmx.c | 52 
 1 file changed, 52 insertions(+)

diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
index ed248e1..f71f358 100644
--- a/pixman/pixman-vmx.c
+++ b/pixman/pixman-vmx.c
@@ -3432,6 +3432,55 @@ vmx_fetch_x8r8g8b8 (pixman_iter_t *iter, const uint32_t 
*mask)
 return iter->buffer;
 }
 
+static uint32_t *
+vmx_fetch_r5g6b5 (pixman_iter_t *iter, const uint32_t *mask)
+{
+int w = iter->width;
+uint32_t *dst = iter->buffer;
+uint16_t *src = (uint16_t *)iter->bits;
+vector unsigned int ff00 = mask_ff00;
+
+iter->bits += iter->stride;
+
+while (w && ((uintptr_t)dst) & 0x0f)
+{
+   uint16_t s = *src++;
+
+   *dst++ = convert_0565_to_ (s);
+   w--;
+}
+
+while (w >= 8)
+{
+   vector unsigned int lo, hi, s;
+
+   s = load_128_unaligned((uint32_t *) src);
+
+   lo = unpack_565_to_(
+   unpacklo_128_8x16(s, (vector unsigned int) AVV(0)));
+
+   hi = unpack_565_to_(
+   unpackhi_128_8x16(s, (vector unsigned int) AVV(0)));
+
+   save_128_aligned(dst, vec_or(hi, ff00));
+   save_128_aligned(dst + 4, vec_or(lo, ff00));
+
+   dst += 8;
+   src += 8;
+   w -= 8;
+}
+
+while (w)
+{
+   uint16_t s = *src++;
+
+   *dst++ = convert_0565_to_ (s);
+   w--;
+}
+
+return iter->buffer;
+}
+
 #define IMAGE_FLAGS\
 (FAST_PATH_STANDARD_FLAGS | FAST_PATH_ID_TRANSFORM |   \
  FAST_PATH_BITS_IMAGE | FAST_PATH_SAMPLES_COVER_CLIP_NEAREST)
@@ -3441,6 +3490,9 @@ static const pixman_iter_info_t vmx_iters[] =
 { PIXMAN_x8r8g8b8, IMAGE_FLAGS, ITER_NARROW,
   _pixman_iter_init_bits_stride, vmx_fetch_x8r8g8b8, NULL
 },
+{ PIXMAN_r5g6b5, IMAGE_FLAGS, ITER_NARROW,
+  _pixman_iter_init_bits_stride, vmx_fetch_r5g6b5, NULL
+},
 { PIXMAN_null },
 };
 
-- 
2.4.3

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 06/12] vmx: implement fast path vmx_composite_over_n_8888_8888_ca

2015-07-02 Thread Oded Gabbay
POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le.

reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)

Before   After   Change
  -
L1  61.92244.91  +295.53%
L2  62.74243.3   +287.79%
M   63.03241.94  +283.85%
HT  59.91144.22  +140.73%
VT  59.4 174.39  +193.59%
R   53.6 111.37  +107.78%
RT  37.9946.38   +22.08%
Kops/s  436  506 +16.06%

cairo trimmed benchmarks :

Speedups

t-xfce4-terminal-a1  1540.37 -> 1226.14 :  1.26x
t-firefox-talos-gfx  1488.59 -> 1209.19 :  1.23x

Slowdowns
=
t-evolution  553.88  -> 581.63  :  1.05x
  t-poppler  364.99  -> 383.79  :  1.05x
t-firefox-scrolling  1223.65 -> 1304.34 :  1.07x

Signed-off-by: Oded Gabbay 
---
 pixman/pixman-vmx.c | 112 
 1 file changed, 112 insertions(+)

diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
index e69d530..966219f 100644
--- a/pixman/pixman-vmx.c
+++ b/pixman/pixman-vmx.c
@@ -2871,6 +2871,114 @@ vmx_composite_over__ (pixman_implementation_t 
*imp,
 }
 
 static void
+vmx_composite_over_n___ca (pixman_implementation_t *imp,
+pixman_composite_info_t *info)
+{
+PIXMAN_COMPOSITE_ARGS (info);
+uint32_t src;
+uint32_t*dst_line, d;
+uint32_t*mask_line, m;
+uint32_t pack_cmp;
+int dst_stride, mask_stride;
+
+vector unsigned int vsrc, valpha, vmask, vdest;
+
+vector unsigned int vmx_dst, vmx_dst_lo, vmx_dst_hi;
+vector unsigned int vmx_mask, vmx_mask_lo, vmx_mask_hi;
+
+src = _pixman_image_get_solid (imp, src_image, dest_image->bits.format);
+
+if (src == 0)
+   return;
+
+PIXMAN_IMAGE_GET_LINE (
+   dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
+PIXMAN_IMAGE_GET_LINE (
+   mask_image, mask_x, mask_y, uint32_t, mask_stride, mask_line, 1);
+
+vsrc = unpacklo_128_16x8(create_mask_1x32_128 (&src),
+   (vector unsigned int) AVV(0));
+
+valpha = expand_alpha_1x128(vsrc);
+
+while (height--)
+{
+   int w = width;
+   const uint32_t *pm = (uint32_t *)mask_line;
+   uint32_t *pd = (uint32_t *)dst_line;
+
+   dst_line += dst_stride;
+   mask_line += mask_stride;
+
+   while (w && (uintptr_t)pd & 15)
+   {
+   m = *pm++;
+
+   if (m)
+   {
+   d = *pd;
+   vmask = unpack_32_1x128(m);
+   vdest = unpack_32_1x128(d);
+
+   *pd = pack_1x128_32(in_over (vsrc, valpha, vmask, vdest));
+   }
+
+   pd++;
+   w--;
+   }
+
+   while (w >= 4)
+   {
+   /* pm is NOT necessarily 16-byte aligned */
+   vmx_mask = load_128_unaligned (pm);
+
+   pack_cmp = vec_all_eq(vmx_mask, (vector unsigned int) AVV(0));
+
+   /* if all bits in mask are zero, pack_cmp is not 0 */
+   if (pack_cmp == 0)
+   {
+   /* pd is 16-byte aligned */
+   vmx_dst = load_128_aligned (pd);
+
+   unpack_128_2x128 (vmx_mask, (vector unsigned int) AVV(0),
+   &vmx_mask_lo, &vmx_mask_hi);
+
+   unpack_128_2x128 (vmx_dst, (vector unsigned int) AVV(0),
+   &vmx_dst_lo, &vmx_dst_hi);
+
+   in_over_2x128 (&vsrc, &vsrc,
+  &valpha, &valpha,
+  &vmx_mask_lo, &vmx_mask_hi,
+  &vmx_dst_lo, &vmx_dst_hi);
+
+   save_128_aligned(pd, pack_2x128_128(vmx_dst_lo, vmx_dst_hi));
+   }
+
+   pd += 4;
+   pm += 4;
+   w -= 4;
+   }
+
+   while (w)
+   {
+   m = *pm++;
+
+   if (m)
+   {
+   d = *pd;
+   vmask = unpack_32_1x128(m);
+   vdest = unpack_32_1x128(d);
+
+   *pd = pack_1x128_32(in_over (vsrc, valpha, vmask, vdest));
+   }
+
+   pd++;
+   w--;
+   }
+}
+}
+
+static void
 vmx_composite_add_8_8 (pixman_implementation_t *imp,
 pixman_composite_info_t *info)
 {
@@ -2953,6 +3061,10 @@ static const pixman_fast_path_t vmx_fast_paths[] =
 PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, 
vmx_composite_over__),
 PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, 
vmx_composite_over__),
 PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, x8b8g8r8, 
vmx_composite_over__),
+PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8r8g8b8, a8r8g8b8, 
vmx_composite_over_n___ca),
+PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8r8g8b8, x8r8g8b8

[Pixman] [PATCH 08/12] vmx: implement fast path vmx_composite_src_x888_8888

2015-07-02 Thread Oded Gabbay
POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le.
reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)

Before   After   Change
  -
L1  1115.4  5006.49 +348.85%
L2  1112.26 4338.01 +290.02%
M   1110.54 2524.15 +127.29%
HT  745.41  1140.03 +52.94%
VT  749.03  1287.13 +71.84%
R   423.91  547.6   +29.18%
RT  205.79  194.98  -5.25%
Kops/s  14141361-3.75%

cairo trimmed benchmarks :

Speedups

t-gnome-system-monitor  1402.62  -> 1212.75 :  1.16x
   t-firefox-asteroids   533.92  ->  474.50 :  1.13x

Signed-off-by: Oded Gabbay 
---
 pixman/pixman-vmx.c | 58 +
 1 file changed, 58 insertions(+)

diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
index 5c74a47..d5ddf4b 100644
--- a/pixman/pixman-vmx.c
+++ b/pixman/pixman-vmx.c
@@ -2967,6 +2967,62 @@ vmx_composite_copy_area (pixman_implementation_t *imp,
 }
 
 static void
+vmx_composite_src_x888_ (pixman_implementation_t *imp,
+ pixman_composite_info_t *info)
+{
+PIXMAN_COMPOSITE_ARGS (info);
+uint32_t*dst_line, *dst;
+uint32_t*src_line, *src;
+int32_t w;
+int dst_stride, src_stride;
+
+PIXMAN_IMAGE_GET_LINE (
+   dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
+PIXMAN_IMAGE_GET_LINE (
+   src_image, src_x, src_y, uint32_t, src_stride, src_line, 1);
+
+while (height--)
+{
+   dst = dst_line;
+   dst_line += dst_stride;
+   src = src_line;
+   src_line += src_stride;
+   w = width;
+
+   while (w && (uintptr_t)dst & 15)
+   {
+   *dst++ = *src++ | 0xff00;
+   w--;
+   }
+
+   while (w >= 16)
+   {
+   vector unsigned int vmx_src1, vmx_src2, vmx_src3, vmx_src4;
+
+   vmx_src1 = load_128_unaligned (src);
+   vmx_src2 = load_128_unaligned (src + 4);
+   vmx_src3 = load_128_unaligned (src + 8);
+   vmx_src4 = load_128_unaligned (src + 12);
+
+   save_128_aligned (dst, vec_or (vmx_src1, mask_ff00));
+   save_128_aligned (dst + 4, vec_or (vmx_src2, mask_ff00));
+   save_128_aligned (dst + 8, vec_or (vmx_src3, mask_ff00));
+   save_128_aligned (dst + 12, vec_or (vmx_src4, mask_ff00));
+
+   dst += 16;
+   src += 16;
+   w -= 16;
+   }
+
+   while (w)
+   {
+   *dst++ = *src++ | 0xff00;
+   w--;
+   }
+}
+}
+
+static void
 vmx_composite_over__ (pixman_implementation_t *imp,
pixman_composite_info_t *info)
 {
@@ -3200,6 +3256,8 @@ static const pixman_fast_path_t vmx_fast_paths[] =
 PIXMAN_STD_FAST_PATH (ADD, a8b8g8r8, null, a8b8g8r8, 
vmx_composite_add__),
 
 /* PIXMAN_OP_SRC */
+PIXMAN_STD_FAST_PATH (SRC, x8r8g8b8, null, a8r8g8b8, 
vmx_composite_src_x888_),
+PIXMAN_STD_FAST_PATH (SRC, x8b8g8r8, null, a8b8g8r8, 
vmx_composite_src_x888_),
 PIXMAN_STD_FAST_PATH (SRC, a8r8g8b8, null, a8r8g8b8, 
vmx_composite_copy_area),
 PIXMAN_STD_FAST_PATH (SRC, a8b8g8r8, null, a8b8g8r8, 
vmx_composite_copy_area),
 PIXMAN_STD_FAST_PATH (SRC, a8r8g8b8, null, x8r8g8b8, 
vmx_composite_copy_area),
-- 
2.4.3

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 00/12] Implement more vmx fast paths

2015-07-02 Thread Oded Gabbay
Hi,

This patch-set implements the most heavily used fast paths, according to
profiling done by me using the cairo traces package.

The patch-set adds many helper functions, to ease the conversion of fast paths
between the sse2 implementations (which I used as a base) and the vmx
implementations. All the helper functions are added in a single patch.

For each fast path, a different commit was made. Inside the commit message,
I wrote the improvement of the relevant low-level-blt test (if relevant), and
the improvement in cairo trimmed benchmarks.

From my obesrvations, the single most important fast path is vmx_fill, as it
contributed the most improvement to cairo benchmarks.

This patchset is based on the previous patch-set I already sent to the mailing
list, which contains implementation of three other fast paths.

Please review.

Thanks,

Oded

Oded Gabbay (12):
  vmx: add LOAD_VECTOR macro
  vmx: add helper functions
  vmx: implement fast path vmx_fill
  vmx: implement fast path vmx_blt
  vmx: implement fast path vmx_composite_copy_area
  vmx: implement fast path vmx_composite_over_n___ca
  vmx: implement fast path vmx_composite_over_n_8_
  vmx: implement fast path vmx_composite_src_x888_
  vmx: implement fast path scaled nearest vmx___OVER
  vmx: implement fast path iterator vmx_fetch_x8r8g8b8
  vmx: implement fast path iterator vmx_fetch_r5g6b5
  vmx: implement fast path iterator vmx_fetch_a8

 pixman/pixman-vmx.c | 1441 +--
 1 file changed, 1404 insertions(+), 37 deletions(-)

-- 
2.4.3

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 05/12] vmx: implement fast path vmx_composite_copy_area

2015-07-02 Thread Oded Gabbay
No changes were observed when running cairo trimmed benchmarks.

Signed-off-by: Oded Gabbay 
---
 pixman/pixman-vmx.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
index b42288b..e69d530 100644
--- a/pixman/pixman-vmx.c
+++ b/pixman/pixman-vmx.c
@@ -2831,6 +2831,20 @@ vmx_blt (pixman_implementation_t *imp,
 }
 
 static void
+vmx_composite_copy_area (pixman_implementation_t *imp,
+  pixman_composite_info_t *info)
+{
+PIXMAN_COMPOSITE_ARGS (info);
+vmx_blt (imp, src_image->bits.bits,
+ dest_image->bits.bits,
+ src_image->bits.rowstride,
+ dest_image->bits.rowstride,
+ PIXMAN_FORMAT_BPP (src_image->bits.format),
+ PIXMAN_FORMAT_BPP (dest_image->bits.format),
+ src_x, src_y, dest_x, dest_y, width, height);
+}
+
+static void
 vmx_composite_over__ (pixman_implementation_t *imp,
pixman_composite_info_t *info)
 {
@@ -2939,12 +2953,24 @@ static const pixman_fast_path_t vmx_fast_paths[] =
 PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, 
vmx_composite_over__),
 PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, 
vmx_composite_over__),
 PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, x8b8g8r8, 
vmx_composite_over__),
+PIXMAN_STD_FAST_PATH (OVER, x8r8g8b8, null, x8r8g8b8, 
vmx_composite_copy_area),
+PIXMAN_STD_FAST_PATH (OVER, x8b8g8r8, null, x8b8g8r8, 
vmx_composite_copy_area),
 
 /* PIXMAN_OP_ADD */
 PIXMAN_STD_FAST_PATH (ADD, a8, null, a8, vmx_composite_add_8_8),
 PIXMAN_STD_FAST_PATH (ADD, a8r8g8b8, null, a8r8g8b8, 
vmx_composite_add__),
 PIXMAN_STD_FAST_PATH (ADD, a8b8g8r8, null, a8b8g8r8, 
vmx_composite_add__),
 
+/* PIXMAN_OP_SRC */
+PIXMAN_STD_FAST_PATH (SRC, a8r8g8b8, null, a8r8g8b8, 
vmx_composite_copy_area),
+PIXMAN_STD_FAST_PATH (SRC, a8b8g8r8, null, a8b8g8r8, 
vmx_composite_copy_area),
+PIXMAN_STD_FAST_PATH (SRC, a8r8g8b8, null, x8r8g8b8, 
vmx_composite_copy_area),
+PIXMAN_STD_FAST_PATH (SRC, a8b8g8r8, null, x8b8g8r8, 
vmx_composite_copy_area),
+PIXMAN_STD_FAST_PATH (SRC, x8r8g8b8, null, x8r8g8b8, 
vmx_composite_copy_area),
+PIXMAN_STD_FAST_PATH (SRC, x8b8g8r8, null, x8b8g8r8, 
vmx_composite_copy_area),
+PIXMAN_STD_FAST_PATH (SRC, r5g6b5, null, r5g6b5, vmx_composite_copy_area),
+PIXMAN_STD_FAST_PATH (SRC, b5g6r5, null, b5g6r5, vmx_composite_copy_area),
+
 {   PIXMAN_OP_NONE },
 };
 
-- 
2.4.3

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 02/12] vmx: add helper functions

2015-07-02 Thread Oded Gabbay
This patch adds the following helper functions for reuse of code,
hiding BE/LE differences and maintainability.

All of the functions were defined as static force_inline.

Names were copied from pixman-sse2.c so conversion of fast-paths between
sse2 and vmx would be easier from now on. Therefore, I tried to keep the
input/output of the functions to be as close as possible to the sse2
definitions.

The functions are:

- load_128_aligned   : load 128-bit from a 16-byte aligned memory
   address into a vector

- load_128_unaligned : load 128-bit from memory into a vector,
   without guarantee of alignment for the
   source pointer

- save_128_aligned   : save 128-bit vector into a 16-byte aligned
   memory address

- create_mask_16_128 : take a 16-bit value and fill with it
   a new vector

- create_mask_1x32_128   : take a 32-bit pointer and fill a new
   vector with the 32-bit value from that pointer

- create_mask_32_128 : take a 32-bit value and fill with it
   a new vector

- unpack_32_1x128: unpack 32-bit value into a vector

- unpacklo_128_16x8  : unpack the eight low 8-bit values of a vector

- unpackhi_128_16x8  : unpack the eight high 8-bit values of a vector

- unpacklo_128_8x16  : unpack the four low 16-bit values of a vector

- unpackhi_128_8x16  : unpack the four high 16-bit values of a vector

- unpack_128_2x128   : unpack the eight low 8-bit values of a vector
   into one vector and the eight high 8-bit
   values into another vector

- unpack_128_2x128_16: unpack the four low 16-bit values of a vector
   into one vector and the four high 16-bit
   values into another vector

- unpack_565_to_ : unpack an RGB_565 vector to  vector

- pack_1x128_32  : pack a vector and return the LSB 32-bit of it

- pack_2x128_128 : pack two vectors into one and return it

- negate_2x128   : xor two vectors with mask_00ff (separately)

- is_opaque  : returns whether all the pixels contained in
   the vector are opaque

- is_zero: returns whether the vector equals 0

- is_transparent : returns whether all the pixels
   contained in the vector are transparent

- expand_pixel_8_1x128   : expand an 8-bit pixel into lower 8 bytes of a
   vector

- expand_pixel_32_1x128  : expand a 32-bit pixel into lower 2 bytes of a
   vector

- expand_alpha_1x128 : expand alpha from vector and return the new
   vector

- expand_alpha_2x128 : expand alpha from one vector and another alpha
   from a second vector

- expand_alpha_rev_2x128 : expand a reversed alpha from one vector and
   another reversed alpha from a second vector

- pix_multiply_2x128 : do pix_multiply for two vectors (separately)

- over_2x128 : perform over op. on two vectors

- in_over_2x128  : perform in-over op. on two vectors

Signed-off-by: Oded Gabbay 
---
 pixman/pixman-vmx.c | 496 
 1 file changed, 496 insertions(+)

diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
index 15ea64e..b81cb45 100644
--- a/pixman/pixman-vmx.c
+++ b/pixman/pixman-vmx.c
@@ -30,10 +30,19 @@
 #endif
 #include "pixman-private.h"
 #include "pixman-combine32.h"
+#include "pixman-inlines.h"
 #include 
 
 #define AVV(x...) {x}
 
+static vector unsigned int mask_00ff;
+static vector unsigned int mask_ff00;
+static vector unsigned int mask_red;
+static vector unsigned int mask_green;
+static vector unsigned int mask_blue;
+static vector unsigned int mask_565_fix_rb;
+static vector unsigned int mask_565_fix_g;
+
 static force_inline vector unsigned int
 splat_alpha (vector unsigned int pix)
 {
@@ -233,6 +242,484 @@ do
  \
 #define STORE_VECTOR(dest) \
 vec_st ((vector unsigned int) v ## dest, 0, dest);
 
+/* load 4 pixels from a 16-byte boundary aligned address */
+static force_inline vector unsigned int
+load_128_aligned (const uint32_t* src)
+{
+return *((vector unsigned int *) src);
+}
+
+/* load 4 pixels from a unaligned address */
+static force_inline vector unsigned int
+load_128_unaligned (const uint32_t* src)
+{
+vector unsigned int vsrc;
+DECLARE_SRC_MASK_VAR;
+
+COMPUTE_SHIFT_MASK (src);
+LOAD_VECTOR (src);
+
+return vsrc;
+}
+
+/* save 4 pixels on a 16-byte boundary aligned address */
+static force_inline void
+save_128_aligned (uint32_t* data,
+ vector unsigned int vdata)
+{
+STORE_VECTOR(data)
+}

Re: [Pixman] [PATCH v2 3/5] vmx: encapsulate the temporary variables inside the macros

2015-07-02 Thread Oded Gabbay
On Thu, Jul 2, 2015 at 10:08 AM, Pekka Paalanen  wrote:
> On Thu, 25 Jun 2015 15:59:55 +0300
> Oded Gabbay  wrote:
>
>> v2: fixed whitespaces and indentation issues
>>
>> Signed-off-by: Oded Gabbay 
>> Reviewed-by: Adam Jackson 
>> ---
>>  pixman/pixman-vmx.c | 72 
>> +
>>  1 file changed, 39 insertions(+), 33 deletions(-)
>>
>> diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
>> index e33d9d9..f28a0fd 100644
>> --- a/pixman/pixman-vmx.c
>> +++ b/pixman/pixman-vmx.c
>> @@ -153,13 +153,18 @@ over (vector unsigned int src,
>>   */
>>
>>  #define LOAD_VECTORS(dest, source) \
>> +do {   \
>> +vector unsigned char tmp1, tmp2;   \
>>  tmp1 = (typeof(tmp1))vec_ld (0, source);   \
>>  tmp2 = (typeof(tmp2))vec_ld (15, source);  \
>>  v ## source = (typeof(v ## source))\
>>   vec_perm (tmp1, tmp2, source ## _mask);   \
>> -v ## dest = (typeof(v ## dest))vec_ld (0, dest);
>> +v ## dest = (typeof(v ## dest))vec_ld (0, dest);   \
>> +} while (0);
>
> Here...
>
>>
>>  #define LOAD_VECTORSC(dest, source, mask)  \
>> +do {   \
>> +vector unsigned char tmp1, tmp2;   \
>>  tmp1 = (typeof(tmp1))vec_ld (0, source);   \
>>  tmp2 = (typeof(tmp2))vec_ld (15, source);  \
>>  v ## source = (typeof(v ## source))\
>> @@ -168,7 +173,8 @@ over (vector unsigned int src,
>>  v ## dest = (typeof(v ## dest))vec_ld (0, dest);   \
>>  tmp2 = (typeof(tmp2))vec_ld (15, mask);\
>>  v ## mask = (typeof(v ## mask))\
>> - vec_perm (tmp1, tmp2, mask ## _mask);
>> +vec_perm (tmp1, tmp2, mask ## _mask);  \
>> +} while (0);
>
> and here the final semicolon is too much. People expect to write them
> when they call the macro.
>
> But, it's not a bug really, it's just extra semicolon that can be
> cleaned up later, so I won't hold this patch due that.

I'm going to set another set of patches today, so I'll add a separate
patch that fix these issues

>
> Another style issue is that Pixman CODING_STYLE says the braces go on
> separate lines.
>
Same thing

> Is the comment about "notice you have to declare temp vars" now moot?
> I also can't see tmp3 or tmp4 anywhere, so I suppose the whole comment
> is just stale now?
Correct, will remove that
>
> Anyway, all that can be follow-ups.
>
>
> Thanks,
> pq
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH v2 0/5] Fix vmx fast-paths for ppc64le

2015-07-02 Thread Pekka Paalanen
On Thu, 25 Jun 2015 15:59:52 +0300
Oded Gabbay  wrote:

> Hi,
> 
> Here is v2 of my patch-set to fix vmx fast-paths for ppc64le.
> The fixes in this v2 are:
> 
> - replace _LITTLE_ENDIAN with WORDS_BIGENDIAN for consistency
> - fixed whitespaces and indentation issues
> - replace #ifndef with #ifdef for readability
> - don't put ';' at the end of macro definition. Instead, move it to
>   each line the macro is used.

Hi,

all patches pushed with my Acked-by, and assuming ajax' R-b still stands.
Minor nitpicks sent on patch 3 which can be fixed later.

Pushed:
   eebc1b7..2be523b  master -> master


Thanks,
pq

> This patch-set fixes the vmx fast-paths in regard to ppc64le
> architecture (POWER8).
> 
> Although IBM's Fernando Seiti Furusato published a patch concerning this 
> topic,
> it had some problems and I didn't see any "movement" regarding to it for the
> past 2 weeks (since Jun-3).
> 
> As I'm working full-time on this issue, I took the liberty to take part of
> Fernando's patch and make it a seperate patch with his name as author.
> 
> The rest of the patches are my own, based on comments made to
> Fernando's original patch and based on my investigations and judgement.
> 
> I verified that all the tests work on POWER8 server (with RHEL 7.1 LE) and
> POWER7 server (with REHL 7.1). There is no degredation in performance.
> 
> I hope to improve performance in the next patches I will send.
> 
> 
> Fernando Seiti Furusato (1):
>   vmx: adjust macros when loading vectors on ppc64le
> 
> Oded Gabbay (4):
>   vmx: fix splat_alpha for ppc64le
>   vmx: encapsulate the temporary variables inside the macros
>   vmx: fix unused var warnings
>   vmx: fix pix_multiply for ppc64le
> 
>  pixman/pixman-vmx.c | 152 
> 
>  1 file changed, 119 insertions(+), 33 deletions(-)
> 

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH v2 3/5] vmx: encapsulate the temporary variables inside the macros

2015-07-02 Thread Pekka Paalanen
On Thu, 25 Jun 2015 15:59:55 +0300
Oded Gabbay  wrote:

> v2: fixed whitespaces and indentation issues
> 
> Signed-off-by: Oded Gabbay 
> Reviewed-by: Adam Jackson 
> ---
>  pixman/pixman-vmx.c | 72 
> +
>  1 file changed, 39 insertions(+), 33 deletions(-)
> 
> diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
> index e33d9d9..f28a0fd 100644
> --- a/pixman/pixman-vmx.c
> +++ b/pixman/pixman-vmx.c
> @@ -153,13 +153,18 @@ over (vector unsigned int src,
>   */
>  
>  #define LOAD_VECTORS(dest, source) \
> +do {   \
> +vector unsigned char tmp1, tmp2;   \
>  tmp1 = (typeof(tmp1))vec_ld (0, source);   \
>  tmp2 = (typeof(tmp2))vec_ld (15, source);  \
>  v ## source = (typeof(v ## source))\
>   vec_perm (tmp1, tmp2, source ## _mask);   \
> -v ## dest = (typeof(v ## dest))vec_ld (0, dest);
> +v ## dest = (typeof(v ## dest))vec_ld (0, dest);   \
> +} while (0);

Here...

>  
>  #define LOAD_VECTORSC(dest, source, mask)  \
> +do {   \
> +vector unsigned char tmp1, tmp2;   \
>  tmp1 = (typeof(tmp1))vec_ld (0, source);   \
>  tmp2 = (typeof(tmp2))vec_ld (15, source);  \
>  v ## source = (typeof(v ## source))\
> @@ -168,7 +173,8 @@ over (vector unsigned int src,
>  v ## dest = (typeof(v ## dest))vec_ld (0, dest);   \
>  tmp2 = (typeof(tmp2))vec_ld (15, mask);\
>  v ## mask = (typeof(v ## mask))\
> - vec_perm (tmp1, tmp2, mask ## _mask);
> +vec_perm (tmp1, tmp2, mask ## _mask);  \
> +} while (0);

and here the final semicolon is too much. People expect to write them
when they call the macro.

But, it's not a bug really, it's just extra semicolon that can be
cleaned up later, so I won't hold this patch due that.

Another style issue is that Pixman CODING_STYLE says the braces go on
separate lines.

Is the comment about "notice you have to declare temp vars" now moot?
I also can't see tmp3 or tmp4 anywhere, so I suppose the whole comment
is just stale now?

Anyway, all that can be follow-ups.


Thanks,
pq
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman