Re: [Pixman] [PATCH 0/9] lowlevel-blt-bench improvements for automated testing
On Wed, 10 Jun 2015 14:32:49 +0100, Pekka Paalanen wrote: most of the patches are trivial cleanups. The meat are the last two: CSV output mode and skipping the memory speed benchmark. Both new features are designed for an external benchmarking harness, that runs several different versions of Pixman with lowlevel-blt-bench in an alternating fashion. Alternating iterations are needed to get reliable results on platforms like the Raspberry Pi. These look like sensible improvements to me. A few minor points: Patch 2 commit message: Move explanation printing to a new file. This will help with function Patch 8: if the aim is machine readability, I'd suggest no space after the comma. Otherwise if comma is used as a field separator, the first field has no leading space but all the others do, which may make things a little more fiddly when post-processing the results with some tools. Not really your fault, but I noticed it when trying out the new versions: it doesn't fault an attempt to list more than one pattern on the command line. It just acts on the last one. It should really either fault it, or (more usefully) benchmark all of the patterns specified. This would allow a subset of the "all" tests to be performed, sharing the same memcpy measurement overhead, or a group of operations including those not in the "all" list. In other respects, happy to give my Reviewed-by:. Ben ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 12/12] vmx: implement fast path iterator vmx_fetch_a8
no changes were observed when running cairo trimmed benchmarks. Signed-off-by: Oded Gabbay --- pixman/pixman-vmx.c | 46 ++ 1 file changed, 46 insertions(+) diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c index f71f358..03e8afa 100644 --- a/pixman/pixman-vmx.c +++ b/pixman/pixman-vmx.c @@ -3481,6 +3481,49 @@ vmx_fetch_r5g6b5 (pixman_iter_t *iter, const uint32_t *mask) return iter->buffer; } +static uint32_t * +vmx_fetch_a8 (pixman_iter_t *iter, const uint32_t *mask) +{ +int w = iter->width; +uint32_t *dst = iter->buffer; +uint8_t *src = iter->bits; +vector unsigned int vmx0, vmx1, vmx2, vmx3, vmx4, vmx5, vmx6; + +iter->bits += iter->stride; + +while (w && (((uintptr_t)dst) & 15)) +{ +*dst++ = *(src++) << 24; +w--; +} + +while (w >= 16) +{ + vmx0 = load_128_unaligned((uint32_t *) src); + + unpack_128_2x128((vector unsigned int) AVV(0), vmx0, &vmx1, &vmx2); + unpack_128_2x128_16((vector unsigned int) AVV(0), vmx1, &vmx3, &vmx4); + unpack_128_2x128_16((vector unsigned int) AVV(0), vmx2, &vmx5, &vmx6); + + save_128_aligned(dst, vmx6); + save_128_aligned((dst + 4), vmx5); + save_128_aligned((dst + 8), vmx4); + save_128_aligned((dst + 12), vmx3); + + dst += 16; + src += 16; + w -= 16; +} + +while (w) +{ + *dst++ = *(src++) << 24; + w--; +} + +return iter->buffer; +} + #define IMAGE_FLAGS\ (FAST_PATH_STANDARD_FLAGS | FAST_PATH_ID_TRANSFORM | \ FAST_PATH_BITS_IMAGE | FAST_PATH_SAMPLES_COVER_CLIP_NEAREST) @@ -3493,6 +3536,9 @@ static const pixman_iter_info_t vmx_iters[] = { PIXMAN_r5g6b5, IMAGE_FLAGS, ITER_NARROW, _pixman_iter_init_bits_stride, vmx_fetch_r5g6b5, NULL }, +{ PIXMAN_a8, IMAGE_FLAGS, ITER_NARROW, + _pixman_iter_init_bits_stride, vmx_fetch_a8, NULL +}, { PIXMAN_null }, }; -- 2.4.3 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 10/12] vmx: implement fast path iterator vmx_fetch_x8r8g8b8
POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le. cairo trimmed benchmarks : Speedups t-firefox-asteroids 533.92 -> 489.94 : 1.09x Signed-off-by: Oded Gabbay --- pixman/pixman-vmx.c | 48 1 file changed, 48 insertions(+) diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c index 2f82ce7..ed248e1 100644 --- a/pixman/pixman-vmx.c +++ b/pixman/pixman-vmx.c @@ -3398,6 +3398,52 @@ static const pixman_fast_path_t vmx_fast_paths[] = { PIXMAN_OP_NONE }, }; +static uint32_t * +vmx_fetch_x8r8g8b8 (pixman_iter_t *iter, const uint32_t *mask) +{ +int w = iter->width; +vector unsigned int ff00 = mask_ff00; +uint32_t *dst = iter->buffer; +uint32_t *src = (uint32_t *)iter->bits; + +iter->bits += iter->stride; + +while (w && ((uintptr_t)dst) & 0x0f) +{ + *dst++ = (*src++) | 0xff00; + w--; +} + +while (w >= 4) +{ + save_128_aligned(dst, vec_or(load_128_unaligned(src), ff00)); + + dst += 4; + src += 4; + w -= 4; +} + +while (w) +{ + *dst++ = (*src++) | 0xff00; + w--; +} + +return iter->buffer; +} + +#define IMAGE_FLAGS\ +(FAST_PATH_STANDARD_FLAGS | FAST_PATH_ID_TRANSFORM | \ + FAST_PATH_BITS_IMAGE | FAST_PATH_SAMPLES_COVER_CLIP_NEAREST) + +static const pixman_iter_info_t vmx_iters[] = +{ +{ PIXMAN_x8r8g8b8, IMAGE_FLAGS, ITER_NARROW, + _pixman_iter_init_bits_stride, vmx_fetch_x8r8g8b8, NULL +}, +{ PIXMAN_null }, +}; + pixman_implementation_t * _pixman_implementation_create_vmx (pixman_implementation_t *fallback) { @@ -3441,5 +3487,7 @@ _pixman_implementation_create_vmx (pixman_implementation_t *fallback) imp->blt = vmx_blt; imp->fill = vmx_fill; +imp->iter_info = vmx_iters; + return imp; } -- 2.4.3 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 07/12] vmx: implement fast path vmx_composite_over_n_8_8888
POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le. reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills) Before After Change - L1 90.21 133.21 +47.67% L2 94.91 132.95 +40.08% M 95.49 132.53 +38.79% HT 88.07 100.43 +14.03% VT 86.65 112.45 +29.77% R 82.77 96.25+16.29% RT 65.64 55.14-16.00% Kops/s 673 580 -13.82% cairo trimmed benchmarks : Speedups t-firefox-asteroids 533.92 -> 495.51 : 1.08x Slowdowns = t-poppler 364.99 -> 393.72 : 1.08x t-firefox-canvas-alpha 984.55 -> 1197.85 : 1.22x Signed-off-by: Oded Gabbay --- pixman/pixman-vmx.c | 126 1 file changed, 126 insertions(+) diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c index 966219f..5c74a47 100644 --- a/pixman/pixman-vmx.c +++ b/pixman/pixman-vmx.c @@ -2557,6 +2557,128 @@ vmx_combine_add_ca (pixman_implementation_t *imp, } } +static void +vmx_composite_over_n_8_ (pixman_implementation_t *imp, + pixman_composite_info_t *info) +{ +PIXMAN_COMPOSITE_ARGS (info); +uint32_t src, srca; +uint32_t *dst_line, *dst; +uint8_t *mask_line, *mask; +int dst_stride, mask_stride; +int32_t w; +uint32_t m, d; + +vector unsigned int vsrc, valpha, vmask; + +vector unsigned int vmx_dst, vmx_dst_lo, vmx_dst_hi; +vector unsigned int vmx_mask, vmx_mask_lo, vmx_mask_hi; + +src = _pixman_image_get_solid (imp, src_image, dest_image->bits.format); + +srca = src >> 24; +if (src == 0) + return; + +PIXMAN_IMAGE_GET_LINE ( + dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1); +PIXMAN_IMAGE_GET_LINE ( + mask_image, mask_x, mask_y, uint8_t, mask_stride, mask_line, 1); + +vmask = create_mask_1x32_128 (&src); +vsrc = expand_pixel_32_1x128 (src); +valpha = expand_alpha_1x128 (vsrc); + +while (height--) +{ + dst = dst_line; + dst_line += dst_stride; + mask = mask_line; + mask_line += mask_stride; + w = width; + + while (w && (uintptr_t)dst & 15) + { + uint8_t m = *mask++; + + if (m) + { + d = *dst; + vmx_mask = expand_pixel_8_1x128 (m); + vmx_dst = unpack_32_1x128 (d); + + *dst = pack_1x128_32 (in_over (vsrc, + valpha, + vmx_mask, + vmx_dst)); + } + + w--; + dst++; + } + + while (w >= 4) + { + m = *((uint32_t*)mask); + + if (srca == 0xff && m == 0x) + { + save_128_aligned(dst, vmask); + } + else if (m) + { + vmx_dst = load_128_aligned (dst); + + vmx_mask = unpack_32_1x128 (m); + vmx_mask = unpacklo_128_16x8 (vmx_mask, + (vector unsigned int) AVV(0)); + + /* Unpacking */ + unpack_128_2x128 (vmx_dst, (vector unsigned int) AVV(0), + &vmx_dst_lo, &vmx_dst_hi); + + unpack_128_2x128 (vmx_mask, (vector unsigned int) AVV(0), + &vmx_mask_lo, &vmx_mask_hi); + + expand_alpha_rev_2x128 (vmx_mask_lo, vmx_mask_hi, + &vmx_mask_lo, &vmx_mask_hi); + + in_over_2x128 (&vsrc, &vsrc, + &valpha, &valpha, + &vmx_mask_lo, &vmx_mask_hi, + &vmx_dst_lo, &vmx_dst_hi); + + save_128_aligned(dst, pack_2x128_128 (vmx_dst_lo, vmx_dst_hi)); + } + + w -= 4; + dst += 4; + mask += 4; + } + + while (w) + { + uint8_t m = *mask++; + + if (m) + { + d = *dst; + vmx_mask = expand_pixel_8_1x128 (m); + vmx_dst = unpack_32_1x128 (d); + + *dst = pack_1x128_32 (in_over (vsrc, + valpha, + vmx_mask, + vmx_dst)); + } + + w--; + dst++; + } +} + +} + static pixman_bool_t vmx_fill (pixman_implementation_t *imp, uint32_t * bits, @@ -3061,6 +3183,10 @@ static const pixman_fast_path_t vmx_fast_paths[] = PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, nu
[Pixman] [PATCH 09/12] vmx: implement fast path scaled nearest vmx_8888_8888_OVER
POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le. reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills) Before After Change - L1 134.36 181.68 +35.22% L2 135.07 180.67 +33.76% M 134.6 180.51 +34.11% HT 121.77 128.79 +5.76% VT 120.49 145.07 +20.40% R 93.83 102.3 +9.03% RT 50.82 46.93 -7.65% Kops/s 448 422 -5.80% cairo trimmed benchmarks : Speedups t-firefox-asteroids 533.92 -> 497.92 : 1.07x t-midori-zoomed 692.98 -> 651.24 : 1.06x Signed-off-by: Oded Gabbay --- pixman/pixman-vmx.c | 128 1 file changed, 128 insertions(+) diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c index d5ddf4b..2f82ce7 100644 --- a/pixman/pixman-vmx.c +++ b/pixman/pixman-vmx.c @@ -3232,6 +3232,129 @@ vmx_composite_add__ (pixman_implementation_t *imp, } } +static force_inline void +scaled_nearest_scanline_vmx___OVER (uint32_t* pd, +const uint32_t* ps, +int32_t w, +pixman_fixed_t vx, +pixman_fixed_t unit_x, +pixman_fixed_t src_width_fixed, +pixman_bool_t fully_transparent_src) +{ +uint32_t s, d; +const uint32_t* pm = NULL; + +vector unsigned int vmx_dst_lo, vmx_dst_hi; +vector unsigned int vmx_src_lo, vmx_src_hi; +vector unsigned int vmx_alpha_lo, vmx_alpha_hi; + +if (fully_transparent_src) + return; + +/* Align dst on a 16-byte boundary */ +while (w && ((uintptr_t)pd & 15)) +{ + d = *pd; + s = combine1 (ps + pixman_fixed_to_int (vx), pm); + vx += unit_x; + while (vx >= 0) + vx -= src_width_fixed; + + *pd++ = core_combine_over_u_pixel_vmx (s, d); + if (pm) + pm++; + w--; +} + +while (w >= 4) +{ + vector unsigned int tmp; + uint32_t tmp1, tmp2, tmp3, tmp4; + + tmp1 = *(ps + pixman_fixed_to_int (vx)); + vx += unit_x; + while (vx >= 0) + vx -= src_width_fixed; + tmp2 = *(ps + pixman_fixed_to_int (vx)); + vx += unit_x; + while (vx >= 0) + vx -= src_width_fixed; + tmp3 = *(ps + pixman_fixed_to_int (vx)); + vx += unit_x; + while (vx >= 0) + vx -= src_width_fixed; + tmp4 = *(ps + pixman_fixed_to_int (vx)); + vx += unit_x; + while (vx >= 0) + vx -= src_width_fixed; + + tmp[0] = tmp1; + tmp[1] = tmp2; + tmp[2] = tmp3; + tmp[3] = tmp4; + + vmx_src_hi = combine4 ((const uint32_t *) &tmp, pm); + + if (is_opaque (vmx_src_hi)) + { + save_128_aligned (pd, vmx_src_hi); + } + else if (!is_zero (vmx_src_hi)) + { + vmx_dst_hi = load_128_aligned (pd); + + unpack_128_2x128 (vmx_src_hi, (vector unsigned int) AVV(0), + &vmx_src_lo, &vmx_src_hi); + + unpack_128_2x128 (vmx_dst_hi, (vector unsigned int) AVV(0), + &vmx_dst_lo, &vmx_dst_hi); + + expand_alpha_2x128 ( + vmx_src_lo, vmx_src_hi, &vmx_alpha_lo, &vmx_alpha_hi); + + over_2x128 (&vmx_src_lo, &vmx_src_hi, + &vmx_alpha_lo, &vmx_alpha_hi, + &vmx_dst_lo, &vmx_dst_hi); + + /* rebuid the 4 pixel data and save*/ + save_128_aligned (pd, pack_2x128_128 (vmx_dst_lo, vmx_dst_hi)); + } + + w -= 4; + pd += 4; + if (pm) + pm += 4; +} + +while (w) +{ + d = *pd; + s = combine1 (ps + pixman_fixed_to_int (vx), pm); + vx += unit_x; + while (vx >= 0) + vx -= src_width_fixed; + + *pd++ = core_combine_over_u_pixel_vmx (s, d); + if (pm) + pm++; + + w--; +} +} + +FAST_NEAREST_MAINLOOP (vmx___cover_OVER, + scaled_nearest_scanline_vmx___OVER, + uint32_t, uint32_t, COVER) +FAST_NEAREST_MAINLOOP (vmx___none_OVER, + scaled_nearest_scanline_vmx___OVER, + uint32_t, uint32_t, NONE) +FAST_NEAREST_MAINLOOP (vmx___pad_OVER, + scaled_nearest_scanline_vmx___OVER, + uint32_t, uint32_t, PAD) +FAST_NEAREST_MAINLOOP (vmx___normal_OVER, + scaled_nearest_scanline_vmx___OVER, + uint32_t,
[Pixman] [PATCH 01/12] vmx: add LOAD_VECTOR macro
This patch adds a macro for loading a single vector. It also make the other LOAD_VECTORx macros use this macro as a base so code would be re-used. In addition, I fixed minor coding style issues. Signed-off-by: Oded Gabbay --- pixman/pixman-vmx.c | 50 -- 1 file changed, 24 insertions(+), 26 deletions(-) diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c index 1809790..15ea64e 100644 --- a/pixman/pixman-vmx.c +++ b/pixman/pixman-vmx.c @@ -169,33 +169,29 @@ over (vector unsigned int src, mask ## _mask = vec_lvsl (0, mask); \ source ## _mask = vec_lvsl (0, source); -/* notice you have to declare temp vars... - * Note: tmp3 and tmp4 must remain untouched! - */ - -#define LOAD_VECTORS(dest, source) \ -do { \ +#define LOAD_VECTOR(source) \ +do \ +{\ vector unsigned char tmp1, tmp2; \ tmp1 = (typeof(tmp1))vec_ld (0, source); \ tmp2 = (typeof(tmp2))vec_ld (15, source);\ -v ## source = (typeof(v ## source)) \ +v ## source = (typeof(v ## source)) \ vec_perm (tmp1, tmp2, source ## _mask); \ +} while (0) + +#define LOAD_VECTORS(dest, source) \ +do \ +{\ +LOAD_VECTOR(source); \ v ## dest = (typeof(v ## dest))vec_ld (0, dest); \ -} while (0); +} while (0) #define LOAD_VECTORSC(dest, source, mask)\ -do { \ -vector unsigned char tmp1, tmp2; \ -tmp1 = (typeof(tmp1))vec_ld (0, source); \ -tmp2 = (typeof(tmp2))vec_ld (15, source);\ -v ## source = (typeof(v ## source)) \ - vec_perm (tmp1, tmp2, source ## _mask); \ -tmp1 = (typeof(tmp1))vec_ld (0, mask); \ -v ## dest = (typeof(v ## dest))vec_ld (0, dest); \ -tmp2 = (typeof(tmp2))vec_ld (15, mask); \ -v ## mask = (typeof(v ## mask)) \ -vec_perm (tmp1, tmp2, mask ## _mask);\ -} while (0); +do \ +{\ +LOAD_VECTORS(dest, source); \ +LOAD_VECTOR(mask); \ +} while (0) #define DECLARE_SRC_MASK_VAR vector unsigned char src_mask #define DECLARE_MASK_MASK_VAR vector unsigned char mask_mask @@ -213,14 +209,16 @@ do { \ #define COMPUTE_SHIFT_MASKC(dest, source, mask) +# define LOAD_VECTOR(source) \ +v ## source = *((typeof(v ## source)*)source); + # define LOAD_VECTORS(dest, source)\ -v ## source = *((typeof(v ## source)*)source); \ -v ## dest = *((typeof(v ## dest)*)dest); +LOAD_VECTOR(source); \ +LOAD_VECTOR(dest); \ # define LOAD_VECTORSC(dest, source, mask) \ -v ## source = *((typeof(v ## source)*)source); \ -v ## dest = *((typeof(v ## dest)*)dest); \ -v ## mask = *((typeof(v ## mask)*)mask); +LOAD_VECTORS(dest, source);\ +LOAD_VECTOR(mask); \ #define DECLARE_SRC_MASK_VAR #define DECLARE_MASK_MASK_VAR @@ -228,7 +226,7 @@ do { \ #endif /* WORDS_BIGENDIAN */ #define LOAD_VECTORSM(dest, source, mask) \ -LOAD_VECTORSC (dest, source, mask) \ +LOAD_VECTORSC (dest, source, mask);\ v ## source = pix_multiply (v ## source, \ splat_alpha (v ## mask)); -- 2.4.3 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 04/12] vmx: implement fast path vmx_blt
No changes were observed when running cairo trimmed benchmarks. Signed-off-by: Oded Gabbay --- pixman/pixman-vmx.c | 124 1 file changed, 124 insertions(+) diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c index b9acd6c..b42288b 100644 --- a/pixman/pixman-vmx.c +++ b/pixman/pixman-vmx.c @@ -2708,6 +2708,128 @@ vmx_fill (pixman_implementation_t *imp, return TRUE; } +static pixman_bool_t +vmx_blt (pixman_implementation_t *imp, + uint32_t * src_bits, + uint32_t * dst_bits, + int src_stride, + int dst_stride, + int src_bpp, + int dst_bpp, + int src_x, + int src_y, + int dest_x, + int dest_y, + int width, + int height) +{ +uint8_t * src_bytes; +uint8_t * dst_bytes; +int byte_width; + +if (src_bpp != dst_bpp) + return FALSE; + +if (src_bpp == 16) +{ + src_stride = src_stride * (int) sizeof (uint32_t) / 2; + dst_stride = dst_stride * (int) sizeof (uint32_t) / 2; + src_bytes =(uint8_t *)(((uint16_t *)src_bits) + src_stride * (src_y) + (src_x)); + dst_bytes = (uint8_t *)(((uint16_t *)dst_bits) + dst_stride * (dest_y) + (dest_x)); + byte_width = 2 * width; + src_stride *= 2; + dst_stride *= 2; +} +else if (src_bpp == 32) +{ + src_stride = src_stride * (int) sizeof (uint32_t) / 4; + dst_stride = dst_stride * (int) sizeof (uint32_t) / 4; + src_bytes = (uint8_t *)(((uint32_t *)src_bits) + src_stride * (src_y) + (src_x)); + dst_bytes = (uint8_t *)(((uint32_t *)dst_bits) + dst_stride * (dest_y) + (dest_x)); + byte_width = 4 * width; + src_stride *= 4; + dst_stride *= 4; +} +else +{ + return FALSE; +} + +while (height--) +{ + int w; + uint8_t *s = src_bytes; + uint8_t *d = dst_bytes; + src_bytes += src_stride; + dst_bytes += dst_stride; + w = byte_width; + + while (w >= 2 && ((uintptr_t)d & 3)) + { + *(uint16_t *)d = *(uint16_t *)s; + w -= 2; + s += 2; + d += 2; + } + + while (w >= 4 && ((uintptr_t)d & 15)) + { + *(uint32_t *)d = *(uint32_t *)s; + + w -= 4; + s += 4; + d += 4; + } + + while (w >= 64) + { + vector unsigned int vmx0, vmx1, vmx2, vmx3; + + vmx0 = load_128_unaligned ((uint32_t*) s); + vmx1 = load_128_unaligned ((uint32_t*)(s + 16)); + vmx2 = load_128_unaligned ((uint32_t*)(s + 32)); + vmx3 = load_128_unaligned ((uint32_t*)(s + 48)); + + save_128_aligned ((uint32_t*)(d), vmx0); + save_128_aligned ((uint32_t*)(d + 16), vmx1); + save_128_aligned ((uint32_t*)(d + 32), vmx2); + save_128_aligned ((uint32_t*)(d + 48), vmx3); + + s += 64; + d += 64; + w -= 64; + } + + while (w >= 16) + { + save_128_aligned ((uint32_t*) d, load_128_unaligned ((uint32_t*) s)); + + w -= 16; + d += 16; + s += 16; + } + + while (w >= 4) + { + *(uint32_t *)d = *(uint32_t *)s; + + w -= 4; + s += 4; + d += 4; + } + + if (w >= 2) + { + *(uint16_t *)d = *(uint16_t *)s; + w -= 2; + s += 2; + d += 2; + } +} + +return TRUE; +} + static void vmx_composite_over__ (pixman_implementation_t *imp, pixman_composite_info_t *info) @@ -2812,6 +2934,7 @@ vmx_composite_add__ (pixman_implementation_t *imp, static const pixman_fast_path_t vmx_fast_paths[] = { +/* PIXMAN_OP_OVER */ PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, a8r8g8b8, vmx_composite_over__), PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, vmx_composite_over__), PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, vmx_composite_over__), @@ -2865,6 +2988,7 @@ _pixman_implementation_create_vmx (pixman_implementation_t *fallback) imp->combine_32_ca[PIXMAN_OP_XOR] = vmx_combine_xor_ca; imp->combine_32_ca[PIXMAN_OP_ADD] = vmx_combine_add_ca; +imp->blt = vmx_blt; imp->fill = vmx_fill; return imp; -- 2.4.3 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 03/12] vmx: implement fast path vmx_fill
Based on sse2 impl. Tested cairo trimmed benchmarks on POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le : speedups t-swfdec-giant-steps 1382.86 -> 719.65 : 1.92x speedup t-gnome-system-monitor 1405.45 -> 923.71 : 1.52x speedup t-evolution 550.91 -> 410.73 : 1.34x speedup t-firefox-paintball 800.47 -> 661.47 : 1.21x speedup t-firefox-canvas-swscroll 1419.22 -> 1209.98 : 1.17x speedup t-xfce4-terminal-a1 1537.75 -> 1319.23 : 1.17x speedup t-midori-zoomed 693.34 -> 607.99 : 1.14x speedup t-firefox-scrolling 1306.05 -> 1149.77 : 1.14x speedup t-chromium-tabs 210.72 -> 191.17 : 1.10x speedup t-firefox-planet-gnome 980.11 -> 913.88 : 1.07x speedup Signed-off-by: Oded Gabbay --- pixman/pixman-vmx.c | 153 1 file changed, 153 insertions(+) diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c index b81cb45..b9acd6c 100644 --- a/pixman/pixman-vmx.c +++ b/pixman/pixman-vmx.c @@ -2557,6 +2557,157 @@ vmx_combine_add_ca (pixman_implementation_t *imp, } } +static pixman_bool_t +vmx_fill (pixman_implementation_t *imp, + uint32_t * bits, + int stride, + int bpp, + int x, + int y, + int width, + int height, + uint32_tfiller) +{ +uint32_t byte_width; +uint8_t *byte_line; + +vector unsigned int vfiller; + +if (bpp == 8) +{ + uint8_t b; + uint16_t w; + + stride = stride * (int) sizeof (uint32_t) / 1; + byte_line = (uint8_t *)(((uint8_t *)bits) + stride * y + x); + byte_width = width; + stride *= 1; + + b = filler & 0xff; + w = (b << 8) | b; + filler = (w << 16) | w; +} +else if (bpp == 16) +{ + stride = stride * (int) sizeof (uint32_t) / 2; + byte_line = (uint8_t *)(((uint16_t *)bits) + stride * y + x); + byte_width = 2 * width; + stride *= 2; + +filler = (filler & 0x) * 0x00010001; +} +else if (bpp == 32) +{ + stride = stride * (int) sizeof (uint32_t) / 4; + byte_line = (uint8_t *)(((uint32_t *)bits) + stride * y + x); + byte_width = 4 * width; + stride *= 4; +} +else +{ + return FALSE; +} + +vfiller = create_mask_1x32_128(&filler); + +while (height--) +{ + int w; + uint8_t *d = byte_line; + byte_line += stride; + w = byte_width; + + if (w >= 1 && ((uintptr_t)d & 1)) + { + *(uint8_t *)d = filler; + w -= 1; + d += 1; + } + + while (w >= 2 && ((uintptr_t)d & 3)) + { + *(uint16_t *)d = filler; + w -= 2; + d += 2; + } + + while (w >= 4 && ((uintptr_t)d & 15)) + { + *(uint32_t *)d = filler; + + w -= 4; + d += 4; + } + + while (w >= 128) + { + vec_st(vfiller, 0, (uint32_t *) d); + vec_st(vfiller, 0, (uint32_t *) d + 4); + vec_st(vfiller, 0, (uint32_t *) d + 8); + vec_st(vfiller, 0, (uint32_t *) d + 12); + vec_st(vfiller, 0, (uint32_t *) d + 16); + vec_st(vfiller, 0, (uint32_t *) d + 20); + vec_st(vfiller, 0, (uint32_t *) d + 24); + vec_st(vfiller, 0, (uint32_t *) d + 28); + + d += 128; + w -= 128; + } + + if (w >= 64) + { + vec_st(vfiller, 0, (uint32_t *) d); + vec_st(vfiller, 0, (uint32_t *) d + 4); + vec_st(vfiller, 0, (uint32_t *) d + 8); + vec_st(vfiller, 0, (uint32_t *) d + 12); + + d += 64; + w -= 64; + } + + if (w >= 32) + { + vec_st(vfiller, 0, (uint32_t *) d); + vec_st(vfiller, 0, (uint32_t *) d + 4); + + d += 32; + w -= 32; + } + + if (w >= 16) + { + vec_st(vfiller, 0, (uint32_t *) d); + + d += 16; + w -= 16; + } + + while (w >= 4) + { + *(uint32_t *)d = filler; + + w -= 4; + d += 4; + } + + if (w >= 2) + { + *(uint16_t *)d = filler; + w -= 2; + d += 2; + } + + if (w >= 1) + { + *(uint8_t *)d = filler; + w -= 1; + d += 1; + } +} + +return TRUE; +} + static void vmx_composite_over__ (pixman_implementation_t *imp, pixman_composite_info_t *info) @@ -2714,5 +2865,7 @@ _pixman_implementation_create_vmx (pixman_implementation_t *fallback) imp->combine_32_ca[PIXMAN_OP_XOR] = vmx_combine_xor_ca; imp->combine_32_ca[PIXMAN_OP_ADD] = vmx_combine_add_ca; +imp->fill = vmx_fi
[Pixman] [PATCH 11/12] vmx: implement fast path iterator vmx_fetch_r5g6b5
no changes were observed when running cairo trimmed benchmarks. Signed-off-by: Oded Gabbay --- pixman/pixman-vmx.c | 52 1 file changed, 52 insertions(+) diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c index ed248e1..f71f358 100644 --- a/pixman/pixman-vmx.c +++ b/pixman/pixman-vmx.c @@ -3432,6 +3432,55 @@ vmx_fetch_x8r8g8b8 (pixman_iter_t *iter, const uint32_t *mask) return iter->buffer; } +static uint32_t * +vmx_fetch_r5g6b5 (pixman_iter_t *iter, const uint32_t *mask) +{ +int w = iter->width; +uint32_t *dst = iter->buffer; +uint16_t *src = (uint16_t *)iter->bits; +vector unsigned int ff00 = mask_ff00; + +iter->bits += iter->stride; + +while (w && ((uintptr_t)dst) & 0x0f) +{ + uint16_t s = *src++; + + *dst++ = convert_0565_to_ (s); + w--; +} + +while (w >= 8) +{ + vector unsigned int lo, hi, s; + + s = load_128_unaligned((uint32_t *) src); + + lo = unpack_565_to_( + unpacklo_128_8x16(s, (vector unsigned int) AVV(0))); + + hi = unpack_565_to_( + unpackhi_128_8x16(s, (vector unsigned int) AVV(0))); + + save_128_aligned(dst, vec_or(hi, ff00)); + save_128_aligned(dst + 4, vec_or(lo, ff00)); + + dst += 8; + src += 8; + w -= 8; +} + +while (w) +{ + uint16_t s = *src++; + + *dst++ = convert_0565_to_ (s); + w--; +} + +return iter->buffer; +} + #define IMAGE_FLAGS\ (FAST_PATH_STANDARD_FLAGS | FAST_PATH_ID_TRANSFORM | \ FAST_PATH_BITS_IMAGE | FAST_PATH_SAMPLES_COVER_CLIP_NEAREST) @@ -3441,6 +3490,9 @@ static const pixman_iter_info_t vmx_iters[] = { PIXMAN_x8r8g8b8, IMAGE_FLAGS, ITER_NARROW, _pixman_iter_init_bits_stride, vmx_fetch_x8r8g8b8, NULL }, +{ PIXMAN_r5g6b5, IMAGE_FLAGS, ITER_NARROW, + _pixman_iter_init_bits_stride, vmx_fetch_r5g6b5, NULL +}, { PIXMAN_null }, }; -- 2.4.3 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 06/12] vmx: implement fast path vmx_composite_over_n_8888_8888_ca
POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le. reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills) Before After Change - L1 61.92244.91 +295.53% L2 62.74243.3 +287.79% M 63.03241.94 +283.85% HT 59.91144.22 +140.73% VT 59.4 174.39 +193.59% R 53.6 111.37 +107.78% RT 37.9946.38 +22.08% Kops/s 436 506 +16.06% cairo trimmed benchmarks : Speedups t-xfce4-terminal-a1 1540.37 -> 1226.14 : 1.26x t-firefox-talos-gfx 1488.59 -> 1209.19 : 1.23x Slowdowns = t-evolution 553.88 -> 581.63 : 1.05x t-poppler 364.99 -> 383.79 : 1.05x t-firefox-scrolling 1223.65 -> 1304.34 : 1.07x Signed-off-by: Oded Gabbay --- pixman/pixman-vmx.c | 112 1 file changed, 112 insertions(+) diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c index e69d530..966219f 100644 --- a/pixman/pixman-vmx.c +++ b/pixman/pixman-vmx.c @@ -2871,6 +2871,114 @@ vmx_composite_over__ (pixman_implementation_t *imp, } static void +vmx_composite_over_n___ca (pixman_implementation_t *imp, +pixman_composite_info_t *info) +{ +PIXMAN_COMPOSITE_ARGS (info); +uint32_t src; +uint32_t*dst_line, d; +uint32_t*mask_line, m; +uint32_t pack_cmp; +int dst_stride, mask_stride; + +vector unsigned int vsrc, valpha, vmask, vdest; + +vector unsigned int vmx_dst, vmx_dst_lo, vmx_dst_hi; +vector unsigned int vmx_mask, vmx_mask_lo, vmx_mask_hi; + +src = _pixman_image_get_solid (imp, src_image, dest_image->bits.format); + +if (src == 0) + return; + +PIXMAN_IMAGE_GET_LINE ( + dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1); +PIXMAN_IMAGE_GET_LINE ( + mask_image, mask_x, mask_y, uint32_t, mask_stride, mask_line, 1); + +vsrc = unpacklo_128_16x8(create_mask_1x32_128 (&src), + (vector unsigned int) AVV(0)); + +valpha = expand_alpha_1x128(vsrc); + +while (height--) +{ + int w = width; + const uint32_t *pm = (uint32_t *)mask_line; + uint32_t *pd = (uint32_t *)dst_line; + + dst_line += dst_stride; + mask_line += mask_stride; + + while (w && (uintptr_t)pd & 15) + { + m = *pm++; + + if (m) + { + d = *pd; + vmask = unpack_32_1x128(m); + vdest = unpack_32_1x128(d); + + *pd = pack_1x128_32(in_over (vsrc, valpha, vmask, vdest)); + } + + pd++; + w--; + } + + while (w >= 4) + { + /* pm is NOT necessarily 16-byte aligned */ + vmx_mask = load_128_unaligned (pm); + + pack_cmp = vec_all_eq(vmx_mask, (vector unsigned int) AVV(0)); + + /* if all bits in mask are zero, pack_cmp is not 0 */ + if (pack_cmp == 0) + { + /* pd is 16-byte aligned */ + vmx_dst = load_128_aligned (pd); + + unpack_128_2x128 (vmx_mask, (vector unsigned int) AVV(0), + &vmx_mask_lo, &vmx_mask_hi); + + unpack_128_2x128 (vmx_dst, (vector unsigned int) AVV(0), + &vmx_dst_lo, &vmx_dst_hi); + + in_over_2x128 (&vsrc, &vsrc, + &valpha, &valpha, + &vmx_mask_lo, &vmx_mask_hi, + &vmx_dst_lo, &vmx_dst_hi); + + save_128_aligned(pd, pack_2x128_128(vmx_dst_lo, vmx_dst_hi)); + } + + pd += 4; + pm += 4; + w -= 4; + } + + while (w) + { + m = *pm++; + + if (m) + { + d = *pd; + vmask = unpack_32_1x128(m); + vdest = unpack_32_1x128(d); + + *pd = pack_1x128_32(in_over (vsrc, valpha, vmask, vdest)); + } + + pd++; + w--; + } +} +} + +static void vmx_composite_add_8_8 (pixman_implementation_t *imp, pixman_composite_info_t *info) { @@ -2953,6 +3061,10 @@ static const pixman_fast_path_t vmx_fast_paths[] = PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, vmx_composite_over__), PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, vmx_composite_over__), PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, x8b8g8r8, vmx_composite_over__), +PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8r8g8b8, a8r8g8b8, vmx_composite_over_n___ca), +PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8r8g8b8, x8r8g8b8
[Pixman] [PATCH 08/12] vmx: implement fast path vmx_composite_src_x888_8888
POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le. reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills) Before After Change - L1 1115.4 5006.49 +348.85% L2 1112.26 4338.01 +290.02% M 1110.54 2524.15 +127.29% HT 745.41 1140.03 +52.94% VT 749.03 1287.13 +71.84% R 423.91 547.6 +29.18% RT 205.79 194.98 -5.25% Kops/s 14141361-3.75% cairo trimmed benchmarks : Speedups t-gnome-system-monitor 1402.62 -> 1212.75 : 1.16x t-firefox-asteroids 533.92 -> 474.50 : 1.13x Signed-off-by: Oded Gabbay --- pixman/pixman-vmx.c | 58 + 1 file changed, 58 insertions(+) diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c index 5c74a47..d5ddf4b 100644 --- a/pixman/pixman-vmx.c +++ b/pixman/pixman-vmx.c @@ -2967,6 +2967,62 @@ vmx_composite_copy_area (pixman_implementation_t *imp, } static void +vmx_composite_src_x888_ (pixman_implementation_t *imp, + pixman_composite_info_t *info) +{ +PIXMAN_COMPOSITE_ARGS (info); +uint32_t*dst_line, *dst; +uint32_t*src_line, *src; +int32_t w; +int dst_stride, src_stride; + +PIXMAN_IMAGE_GET_LINE ( + dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1); +PIXMAN_IMAGE_GET_LINE ( + src_image, src_x, src_y, uint32_t, src_stride, src_line, 1); + +while (height--) +{ + dst = dst_line; + dst_line += dst_stride; + src = src_line; + src_line += src_stride; + w = width; + + while (w && (uintptr_t)dst & 15) + { + *dst++ = *src++ | 0xff00; + w--; + } + + while (w >= 16) + { + vector unsigned int vmx_src1, vmx_src2, vmx_src3, vmx_src4; + + vmx_src1 = load_128_unaligned (src); + vmx_src2 = load_128_unaligned (src + 4); + vmx_src3 = load_128_unaligned (src + 8); + vmx_src4 = load_128_unaligned (src + 12); + + save_128_aligned (dst, vec_or (vmx_src1, mask_ff00)); + save_128_aligned (dst + 4, vec_or (vmx_src2, mask_ff00)); + save_128_aligned (dst + 8, vec_or (vmx_src3, mask_ff00)); + save_128_aligned (dst + 12, vec_or (vmx_src4, mask_ff00)); + + dst += 16; + src += 16; + w -= 16; + } + + while (w) + { + *dst++ = *src++ | 0xff00; + w--; + } +} +} + +static void vmx_composite_over__ (pixman_implementation_t *imp, pixman_composite_info_t *info) { @@ -3200,6 +3256,8 @@ static const pixman_fast_path_t vmx_fast_paths[] = PIXMAN_STD_FAST_PATH (ADD, a8b8g8r8, null, a8b8g8r8, vmx_composite_add__), /* PIXMAN_OP_SRC */ +PIXMAN_STD_FAST_PATH (SRC, x8r8g8b8, null, a8r8g8b8, vmx_composite_src_x888_), +PIXMAN_STD_FAST_PATH (SRC, x8b8g8r8, null, a8b8g8r8, vmx_composite_src_x888_), PIXMAN_STD_FAST_PATH (SRC, a8r8g8b8, null, a8r8g8b8, vmx_composite_copy_area), PIXMAN_STD_FAST_PATH (SRC, a8b8g8r8, null, a8b8g8r8, vmx_composite_copy_area), PIXMAN_STD_FAST_PATH (SRC, a8r8g8b8, null, x8r8g8b8, vmx_composite_copy_area), -- 2.4.3 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 00/12] Implement more vmx fast paths
Hi, This patch-set implements the most heavily used fast paths, according to profiling done by me using the cairo traces package. The patch-set adds many helper functions, to ease the conversion of fast paths between the sse2 implementations (which I used as a base) and the vmx implementations. All the helper functions are added in a single patch. For each fast path, a different commit was made. Inside the commit message, I wrote the improvement of the relevant low-level-blt test (if relevant), and the improvement in cairo trimmed benchmarks. From my obesrvations, the single most important fast path is vmx_fill, as it contributed the most improvement to cairo benchmarks. This patchset is based on the previous patch-set I already sent to the mailing list, which contains implementation of three other fast paths. Please review. Thanks, Oded Oded Gabbay (12): vmx: add LOAD_VECTOR macro vmx: add helper functions vmx: implement fast path vmx_fill vmx: implement fast path vmx_blt vmx: implement fast path vmx_composite_copy_area vmx: implement fast path vmx_composite_over_n___ca vmx: implement fast path vmx_composite_over_n_8_ vmx: implement fast path vmx_composite_src_x888_ vmx: implement fast path scaled nearest vmx___OVER vmx: implement fast path iterator vmx_fetch_x8r8g8b8 vmx: implement fast path iterator vmx_fetch_r5g6b5 vmx: implement fast path iterator vmx_fetch_a8 pixman/pixman-vmx.c | 1441 +-- 1 file changed, 1404 insertions(+), 37 deletions(-) -- 2.4.3 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 05/12] vmx: implement fast path vmx_composite_copy_area
No changes were observed when running cairo trimmed benchmarks. Signed-off-by: Oded Gabbay --- pixman/pixman-vmx.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c index b42288b..e69d530 100644 --- a/pixman/pixman-vmx.c +++ b/pixman/pixman-vmx.c @@ -2831,6 +2831,20 @@ vmx_blt (pixman_implementation_t *imp, } static void +vmx_composite_copy_area (pixman_implementation_t *imp, + pixman_composite_info_t *info) +{ +PIXMAN_COMPOSITE_ARGS (info); +vmx_blt (imp, src_image->bits.bits, + dest_image->bits.bits, + src_image->bits.rowstride, + dest_image->bits.rowstride, + PIXMAN_FORMAT_BPP (src_image->bits.format), + PIXMAN_FORMAT_BPP (dest_image->bits.format), + src_x, src_y, dest_x, dest_y, width, height); +} + +static void vmx_composite_over__ (pixman_implementation_t *imp, pixman_composite_info_t *info) { @@ -2939,12 +2953,24 @@ static const pixman_fast_path_t vmx_fast_paths[] = PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, vmx_composite_over__), PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, vmx_composite_over__), PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, x8b8g8r8, vmx_composite_over__), +PIXMAN_STD_FAST_PATH (OVER, x8r8g8b8, null, x8r8g8b8, vmx_composite_copy_area), +PIXMAN_STD_FAST_PATH (OVER, x8b8g8r8, null, x8b8g8r8, vmx_composite_copy_area), /* PIXMAN_OP_ADD */ PIXMAN_STD_FAST_PATH (ADD, a8, null, a8, vmx_composite_add_8_8), PIXMAN_STD_FAST_PATH (ADD, a8r8g8b8, null, a8r8g8b8, vmx_composite_add__), PIXMAN_STD_FAST_PATH (ADD, a8b8g8r8, null, a8b8g8r8, vmx_composite_add__), +/* PIXMAN_OP_SRC */ +PIXMAN_STD_FAST_PATH (SRC, a8r8g8b8, null, a8r8g8b8, vmx_composite_copy_area), +PIXMAN_STD_FAST_PATH (SRC, a8b8g8r8, null, a8b8g8r8, vmx_composite_copy_area), +PIXMAN_STD_FAST_PATH (SRC, a8r8g8b8, null, x8r8g8b8, vmx_composite_copy_area), +PIXMAN_STD_FAST_PATH (SRC, a8b8g8r8, null, x8b8g8r8, vmx_composite_copy_area), +PIXMAN_STD_FAST_PATH (SRC, x8r8g8b8, null, x8r8g8b8, vmx_composite_copy_area), +PIXMAN_STD_FAST_PATH (SRC, x8b8g8r8, null, x8b8g8r8, vmx_composite_copy_area), +PIXMAN_STD_FAST_PATH (SRC, r5g6b5, null, r5g6b5, vmx_composite_copy_area), +PIXMAN_STD_FAST_PATH (SRC, b5g6r5, null, b5g6r5, vmx_composite_copy_area), + { PIXMAN_OP_NONE }, }; -- 2.4.3 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH 02/12] vmx: add helper functions
This patch adds the following helper functions for reuse of code, hiding BE/LE differences and maintainability. All of the functions were defined as static force_inline. Names were copied from pixman-sse2.c so conversion of fast-paths between sse2 and vmx would be easier from now on. Therefore, I tried to keep the input/output of the functions to be as close as possible to the sse2 definitions. The functions are: - load_128_aligned : load 128-bit from a 16-byte aligned memory address into a vector - load_128_unaligned : load 128-bit from memory into a vector, without guarantee of alignment for the source pointer - save_128_aligned : save 128-bit vector into a 16-byte aligned memory address - create_mask_16_128 : take a 16-bit value and fill with it a new vector - create_mask_1x32_128 : take a 32-bit pointer and fill a new vector with the 32-bit value from that pointer - create_mask_32_128 : take a 32-bit value and fill with it a new vector - unpack_32_1x128: unpack 32-bit value into a vector - unpacklo_128_16x8 : unpack the eight low 8-bit values of a vector - unpackhi_128_16x8 : unpack the eight high 8-bit values of a vector - unpacklo_128_8x16 : unpack the four low 16-bit values of a vector - unpackhi_128_8x16 : unpack the four high 16-bit values of a vector - unpack_128_2x128 : unpack the eight low 8-bit values of a vector into one vector and the eight high 8-bit values into another vector - unpack_128_2x128_16: unpack the four low 16-bit values of a vector into one vector and the four high 16-bit values into another vector - unpack_565_to_ : unpack an RGB_565 vector to vector - pack_1x128_32 : pack a vector and return the LSB 32-bit of it - pack_2x128_128 : pack two vectors into one and return it - negate_2x128 : xor two vectors with mask_00ff (separately) - is_opaque : returns whether all the pixels contained in the vector are opaque - is_zero: returns whether the vector equals 0 - is_transparent : returns whether all the pixels contained in the vector are transparent - expand_pixel_8_1x128 : expand an 8-bit pixel into lower 8 bytes of a vector - expand_pixel_32_1x128 : expand a 32-bit pixel into lower 2 bytes of a vector - expand_alpha_1x128 : expand alpha from vector and return the new vector - expand_alpha_2x128 : expand alpha from one vector and another alpha from a second vector - expand_alpha_rev_2x128 : expand a reversed alpha from one vector and another reversed alpha from a second vector - pix_multiply_2x128 : do pix_multiply for two vectors (separately) - over_2x128 : perform over op. on two vectors - in_over_2x128 : perform in-over op. on two vectors Signed-off-by: Oded Gabbay --- pixman/pixman-vmx.c | 496 1 file changed, 496 insertions(+) diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c index 15ea64e..b81cb45 100644 --- a/pixman/pixman-vmx.c +++ b/pixman/pixman-vmx.c @@ -30,10 +30,19 @@ #endif #include "pixman-private.h" #include "pixman-combine32.h" +#include "pixman-inlines.h" #include #define AVV(x...) {x} +static vector unsigned int mask_00ff; +static vector unsigned int mask_ff00; +static vector unsigned int mask_red; +static vector unsigned int mask_green; +static vector unsigned int mask_blue; +static vector unsigned int mask_565_fix_rb; +static vector unsigned int mask_565_fix_g; + static force_inline vector unsigned int splat_alpha (vector unsigned int pix) { @@ -233,6 +242,484 @@ do \ #define STORE_VECTOR(dest) \ vec_st ((vector unsigned int) v ## dest, 0, dest); +/* load 4 pixels from a 16-byte boundary aligned address */ +static force_inline vector unsigned int +load_128_aligned (const uint32_t* src) +{ +return *((vector unsigned int *) src); +} + +/* load 4 pixels from a unaligned address */ +static force_inline vector unsigned int +load_128_unaligned (const uint32_t* src) +{ +vector unsigned int vsrc; +DECLARE_SRC_MASK_VAR; + +COMPUTE_SHIFT_MASK (src); +LOAD_VECTOR (src); + +return vsrc; +} + +/* save 4 pixels on a 16-byte boundary aligned address */ +static force_inline void +save_128_aligned (uint32_t* data, + vector unsigned int vdata) +{ +STORE_VECTOR(data) +}
Re: [Pixman] [PATCH v2 3/5] vmx: encapsulate the temporary variables inside the macros
On Thu, Jul 2, 2015 at 10:08 AM, Pekka Paalanen wrote: > On Thu, 25 Jun 2015 15:59:55 +0300 > Oded Gabbay wrote: > >> v2: fixed whitespaces and indentation issues >> >> Signed-off-by: Oded Gabbay >> Reviewed-by: Adam Jackson >> --- >> pixman/pixman-vmx.c | 72 >> + >> 1 file changed, 39 insertions(+), 33 deletions(-) >> >> diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c >> index e33d9d9..f28a0fd 100644 >> --- a/pixman/pixman-vmx.c >> +++ b/pixman/pixman-vmx.c >> @@ -153,13 +153,18 @@ over (vector unsigned int src, >> */ >> >> #define LOAD_VECTORS(dest, source) \ >> +do { \ >> +vector unsigned char tmp1, tmp2; \ >> tmp1 = (typeof(tmp1))vec_ld (0, source); \ >> tmp2 = (typeof(tmp2))vec_ld (15, source); \ >> v ## source = (typeof(v ## source))\ >> vec_perm (tmp1, tmp2, source ## _mask); \ >> -v ## dest = (typeof(v ## dest))vec_ld (0, dest); >> +v ## dest = (typeof(v ## dest))vec_ld (0, dest); \ >> +} while (0); > > Here... > >> >> #define LOAD_VECTORSC(dest, source, mask) \ >> +do { \ >> +vector unsigned char tmp1, tmp2; \ >> tmp1 = (typeof(tmp1))vec_ld (0, source); \ >> tmp2 = (typeof(tmp2))vec_ld (15, source); \ >> v ## source = (typeof(v ## source))\ >> @@ -168,7 +173,8 @@ over (vector unsigned int src, >> v ## dest = (typeof(v ## dest))vec_ld (0, dest); \ >> tmp2 = (typeof(tmp2))vec_ld (15, mask);\ >> v ## mask = (typeof(v ## mask))\ >> - vec_perm (tmp1, tmp2, mask ## _mask); >> +vec_perm (tmp1, tmp2, mask ## _mask); \ >> +} while (0); > > and here the final semicolon is too much. People expect to write them > when they call the macro. > > But, it's not a bug really, it's just extra semicolon that can be > cleaned up later, so I won't hold this patch due that. I'm going to set another set of patches today, so I'll add a separate patch that fix these issues > > Another style issue is that Pixman CODING_STYLE says the braces go on > separate lines. > Same thing > Is the comment about "notice you have to declare temp vars" now moot? > I also can't see tmp3 or tmp4 anywhere, so I suppose the whole comment > is just stale now? Correct, will remove that > > Anyway, all that can be follow-ups. > > > Thanks, > pq ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH v2 0/5] Fix vmx fast-paths for ppc64le
On Thu, 25 Jun 2015 15:59:52 +0300 Oded Gabbay wrote: > Hi, > > Here is v2 of my patch-set to fix vmx fast-paths for ppc64le. > The fixes in this v2 are: > > - replace _LITTLE_ENDIAN with WORDS_BIGENDIAN for consistency > - fixed whitespaces and indentation issues > - replace #ifndef with #ifdef for readability > - don't put ';' at the end of macro definition. Instead, move it to > each line the macro is used. Hi, all patches pushed with my Acked-by, and assuming ajax' R-b still stands. Minor nitpicks sent on patch 3 which can be fixed later. Pushed: eebc1b7..2be523b master -> master Thanks, pq > This patch-set fixes the vmx fast-paths in regard to ppc64le > architecture (POWER8). > > Although IBM's Fernando Seiti Furusato published a patch concerning this > topic, > it had some problems and I didn't see any "movement" regarding to it for the > past 2 weeks (since Jun-3). > > As I'm working full-time on this issue, I took the liberty to take part of > Fernando's patch and make it a seperate patch with his name as author. > > The rest of the patches are my own, based on comments made to > Fernando's original patch and based on my investigations and judgement. > > I verified that all the tests work on POWER8 server (with RHEL 7.1 LE) and > POWER7 server (with REHL 7.1). There is no degredation in performance. > > I hope to improve performance in the next patches I will send. > > > Fernando Seiti Furusato (1): > vmx: adjust macros when loading vectors on ppc64le > > Oded Gabbay (4): > vmx: fix splat_alpha for ppc64le > vmx: encapsulate the temporary variables inside the macros > vmx: fix unused var warnings > vmx: fix pix_multiply for ppc64le > > pixman/pixman-vmx.c | 152 > > 1 file changed, 119 insertions(+), 33 deletions(-) > ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH v2 3/5] vmx: encapsulate the temporary variables inside the macros
On Thu, 25 Jun 2015 15:59:55 +0300 Oded Gabbay wrote: > v2: fixed whitespaces and indentation issues > > Signed-off-by: Oded Gabbay > Reviewed-by: Adam Jackson > --- > pixman/pixman-vmx.c | 72 > + > 1 file changed, 39 insertions(+), 33 deletions(-) > > diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c > index e33d9d9..f28a0fd 100644 > --- a/pixman/pixman-vmx.c > +++ b/pixman/pixman-vmx.c > @@ -153,13 +153,18 @@ over (vector unsigned int src, > */ > > #define LOAD_VECTORS(dest, source) \ > +do { \ > +vector unsigned char tmp1, tmp2; \ > tmp1 = (typeof(tmp1))vec_ld (0, source); \ > tmp2 = (typeof(tmp2))vec_ld (15, source); \ > v ## source = (typeof(v ## source))\ > vec_perm (tmp1, tmp2, source ## _mask); \ > -v ## dest = (typeof(v ## dest))vec_ld (0, dest); > +v ## dest = (typeof(v ## dest))vec_ld (0, dest); \ > +} while (0); Here... > > #define LOAD_VECTORSC(dest, source, mask) \ > +do { \ > +vector unsigned char tmp1, tmp2; \ > tmp1 = (typeof(tmp1))vec_ld (0, source); \ > tmp2 = (typeof(tmp2))vec_ld (15, source); \ > v ## source = (typeof(v ## source))\ > @@ -168,7 +173,8 @@ over (vector unsigned int src, > v ## dest = (typeof(v ## dest))vec_ld (0, dest); \ > tmp2 = (typeof(tmp2))vec_ld (15, mask);\ > v ## mask = (typeof(v ## mask))\ > - vec_perm (tmp1, tmp2, mask ## _mask); > +vec_perm (tmp1, tmp2, mask ## _mask); \ > +} while (0); and here the final semicolon is too much. People expect to write them when they call the macro. But, it's not a bug really, it's just extra semicolon that can be cleaned up later, so I won't hold this patch due that. Another style issue is that Pixman CODING_STYLE says the braces go on separate lines. Is the comment about "notice you have to declare temp vars" now moot? I also can't see tmp3 or tmp4 anywhere, so I suppose the whole comment is just stale now? Anyway, all that can be follow-ups. Thanks, pq ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman