[Pixman] [PATCH 6/9] armv7: Use aligned memory writes in both copies of bilinear code

2016-04-11 Thread Ben Avison
pixman-arm-neon-asm-bilinear.S contains duplicates of some macro
definitions from pixman-arm-neon-asm.S, but they were taken before
commit 9638af9 added the aligns and they have never been brought back
into line.

An equivalent macro to load from the destination buffer (not applicable
to the operations implemented in pixman-arm-neon-asm.S) benefits from
the same alignent hints.

Verified that the aligned versions don't cause memory faults for the
fast paths defined in pixman-arm-neon-asm-bilinear.S.

Signed-off-by: Ben Avison <bavi...@riscosopen.org>
---
 pixman/pixman-arm-neon-asm-bilinear.S |   16 
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/pixman/pixman-arm-neon-asm-bilinear.S 
b/pixman/pixman-arm-neon-asm-bilinear.S
index 0fd92d6..aba8d00 100644
--- a/pixman/pixman-arm-neon-asm-bilinear.S
+++ b/pixman/pixman-arm-neon-asm-bilinear.S
@@ -186,9 +186,9 @@
 
 .macro bilinear_store_ numpix, tmp1, tmp2
 .if numpix == 4
-vst1.32   {d0, d1}, [OUT]!
+vst1.32   {d0, d1}, [OUT, :128]!
 .elseif numpix == 2
-vst1.32   {d0}, [OUT]!
+vst1.32   {d0}, [OUT, :64]!
 .elseif numpix == 1
 vst1.32   {d0[0]}, [OUT, :32]!
 .else
@@ -203,11 +203,11 @@
 vuzp.u8 d0, d2
 convert__to_0565 d2, d1, d0, q1, tmp1, tmp2
 .if numpix == 4
-vst1.16   {d2}, [OUT]!
+vst1.16   {d2}, [OUT, :64]!
 .elseif numpix == 2
-vst1.32   {d2[0]}, [OUT]!
+vst1.32   {d2[0]}, [OUT, :32]!
 .elseif numpix == 1
-vst1.16   {d2[0]}, [OUT]!
+vst1.16   {d2[0]}, [OUT, :16]!
 .else
 .error bilinear_store_0565 numpix is unsupported
 .endif
@@ -251,11 +251,11 @@
 
 .macro bilinear_load_dst_ numpix, dst0, dst1, dst01
 .if numpix == 4
-vld1.32 {dst0, dst1}, [OUT]
+vld1.32 {dst0, dst1}, [OUT, :128]
 .elseif numpix == 2
-vld1.32 {dst0}, [OUT]
+vld1.32 {dst0}, [OUT, :64]
 .elseif numpix == 1
-vld1.32 {dst0[0]}, [OUT]
+vld1.32 {dst0[0]}, [OUT, :32]
 .else
 .error bilinear_load_dst_ numpix is unsupported
 .endif
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 4/9] armv7: Simplify constant load

2016-04-11 Thread Ben Avison
A minor point, but 0xFF00 is already a valid immediate constant for
NEON, there's no need to construct it in two steps.

Signed-off-by: Ben Avison <bavi...@riscosopen.org>
---
 pixman/pixman-arm-neon-asm.S |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/pixman/pixman-arm-neon-asm.S b/pixman/pixman-arm-neon-asm.S
index 9a5d85a..97315d4 100644
--- a/pixman/pixman-arm-neon-asm.S
+++ b/pixman/pixman-arm-neon-asm.S
@@ -1186,8 +1186,7 @@ generate_composite_function \
 .endm
 
 .macro pixman_composite_src_x888__init
-vmov.u8  q2, #0xFF
-vshl.u32 q2, q2, #24
+vmov.u32 q2, #0xFF00
 .endm
 
 generate_composite_function \
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 8/9 v2] armv7: More use of fast paths with localized destination alpha

2016-04-11 Thread Ben Avison
There are a group of combiner types - SRC, OVER, IN_REVERSE, OUT_REVERSE
and ADD - where the destination alpha component is only used (if at all) to
determine the destination alpha component. This means that any such fast
paths with an a8r8g8b8 destination can also be applied to an x8r8g8b8
destination just by updating the fast path table, and likewise with
a8b8g8r8 and x8b8g8r8. The following operations are affected:

over___x888
add_n_8_x888
add__8_x888
add___x888
add__n_x888
add__x888
out_reverse_8_x888

v2: Changed summary line to make it distinct from similar patch relating
to localized source alpha

Signed-off-by: Ben Avison <bavi...@riscosopen.org>
---
 pixman/pixman-arm-neon.c |   12 
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-neon.c b/pixman/pixman-arm-neon.c
index be761c9..5f0561a 100644
--- a/pixman/pixman-arm-neon.c
+++ b/pixman/pixman-arm-neon.c
@@ -331,6 +331,7 @@ static const pixman_fast_path_t arm_neon_fast_paths[] =
 PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, a8,   b5g6r5,   
neon_composite_over__8_0565),
 PIXMAN_STD_FAST_PATH (OVER, r5g6b5,   a8,   r5g6b5,   
neon_composite_over_0565_8_0565),
 PIXMAN_STD_FAST_PATH (OVER, b5g6r5,   a8,   b5g6r5,   
neon_composite_over_0565_8_0565),
+PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, a8r8g8b8, x8r8g8b8, 
neon_composite_over___),
 PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, a8r8g8b8, a8r8g8b8, 
neon_composite_over___),
 PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, r5g6b5,   
neon_composite_over__0565),
 PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, b5g6r5,   
neon_composite_over__0565),
@@ -341,17 +342,26 @@ static const pixman_fast_path_t arm_neon_fast_paths[] =
 PIXMAN_STD_FAST_PATH (OVER, x8r8g8b8, null, a8r8g8b8, 
neon_composite_src_x888_),
 PIXMAN_STD_FAST_PATH (OVER, x8b8g8r8, null, a8b8g8r8, 
neon_composite_src_x888_),
 PIXMAN_STD_FAST_PATH (ADD,  solid,a8,   a8,   
neon_composite_add_n_8_8),
+PIXMAN_STD_FAST_PATH (ADD,  solid,a8,   x8r8g8b8, 
neon_composite_add_n_8_),
 PIXMAN_STD_FAST_PATH (ADD,  solid,a8,   a8r8g8b8, 
neon_composite_add_n_8_),
+PIXMAN_STD_FAST_PATH (ADD,  solid,a8,   x8b8g8r8, 
neon_composite_add_n_8_),
 PIXMAN_STD_FAST_PATH (ADD,  solid,a8,   a8b8g8r8, 
neon_composite_add_n_8_),
 PIXMAN_STD_FAST_PATH (ADD,  a8,   a8,   a8,   
neon_composite_add_8_8_8),
 PIXMAN_STD_FAST_PATH (ADD,  r5g6b5,   a8,   r5g6b5,   
neon_composite_add_0565_8_0565),
 PIXMAN_STD_FAST_PATH (ADD,  b5g6r5,   a8,   b5g6r5,   
neon_composite_add_0565_8_0565),
+PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, a8,   x8r8g8b8, 
neon_composite_add__8_),
+PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, a8,   x8b8g8r8, 
neon_composite_add__8_),
 PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, a8,   a8r8g8b8, 
neon_composite_add__8_),
 PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, a8,   a8b8g8r8, 
neon_composite_add__8_),
+PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, a8r8g8b8, x8r8g8b8, 
neon_composite_add___),
 PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, a8r8g8b8, a8r8g8b8, 
neon_composite_add___),
+PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, solid,x8r8g8b8, 
neon_composite_add__n_),
+PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, solid,x8b8g8r8, 
neon_composite_add__n_),
 PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, solid,a8r8g8b8, 
neon_composite_add__n_),
 PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, solid,a8b8g8r8, 
neon_composite_add__n_),
 PIXMAN_STD_FAST_PATH (ADD,  a8,   null, a8,   
neon_composite_add_8_8),
+PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, null, x8r8g8b8, 
neon_composite_add__),
+PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, null, x8b8g8r8, 
neon_composite_add__),
 PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, null, a8r8g8b8, 
neon_composite_add__),
 PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, null, a8b8g8r8, 
neon_composite_add__),
 PIXMAN_STD_FAST_PATH (IN,   solid,null, a8,   
neon_composite_in_n_8),
@@ -359,7 +369,9 @@ static const pixman_fast_path_t arm_neon_fast_paths[] =
 PIXMAN_STD_FAST_PATH (OVER_REVERSE, solid, null, a8b8g8r8, 
neon_composite_over_reverse_n_),
 PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,null, r5g6b5,   
neon_composite_out_reverse_8_0565),
 PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,null, b5g6r5,   
neon_composite_out_reverse_8_0565),
+PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,null, x8r8g8b8, 
neon_composite_out_reverse_8_),
 PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,null, a8r8g8b8, 
neon_composite_out_reverse_8_),
+PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,null, x8b8g8r8, 
neon_composite_out_reverse_8_),
 PIXMAN_STD_FAST_PATH (OUT_REVERS

[Pixman] [PATCH 3/9 repost] armv7: Use VLD-to-all-lanes

2016-04-11 Thread Ben Avison
I noticed in passing that a number of opportunities to use the all-lanes
variant of VLD has been missed. I don't expect any measurable speedup because
these are all in init code, but this simplifies the code a bit.

Signed-off-by: Ben Avison <bavi...@riscosopen.org>
---
 pixman/pixman-arm-neon-asm.S |  142 +-
 1 files changed, 58 insertions(+), 84 deletions(-)

diff --git a/pixman/pixman-arm-neon-asm.S b/pixman/pixman-arm-neon-asm.S
index 7e949a3..9a5d85a 100644
--- a/pixman/pixman-arm-neon-asm.S
+++ b/pixman/pixman-arm-neon-asm.S
@@ -396,11 +396,10 @@ generate_composite_function \
 
 .macro pixman_composite_over_n_0565_init
 add DUMMY, sp, #ARGS_STACK_OFFSET
-vld1.32 {d3[0]}, [DUMMY]
-vdup.8  d0, d3[0]
-vdup.8  d1, d3[1]
-vdup.8  d2, d3[2]
-vdup.8  d3, d3[3]
+vld1.8  {d0[]}, [DUMMY]!
+vld1.8  {d1[]}, [DUMMY]!
+vld1.8  {d2[]}, [DUMMY]!
+vld1.8  {d3[]}, [DUMMY]!
 vmvn.8  d3, d3  /* invert source alpha */
 .endm
 
@@ -761,11 +760,10 @@ generate_composite_function_single_scanline \
 
 .macro pixman_composite_over_n__init
 add DUMMY, sp, #ARGS_STACK_OFFSET
-vld1.32 {d3[0]}, [DUMMY]
-vdup.8  d0, d3[0]
-vdup.8  d1, d3[1]
-vdup.8  d2, d3[2]
-vdup.8  d3, d3[3]
+vld1.8  {d0[]}, [DUMMY]!
+vld1.8  {d1[]}, [DUMMY]!
+vld1.8  {d2[]}, [DUMMY]!
+vld1.8  {d3[]}, [DUMMY]!
 vmvn.8  d24, d3  /* get inverted alpha */
 .endm
 
@@ -813,11 +811,10 @@ generate_composite_function \
 
 .macro pixman_composite_over_reverse_n__init
 add DUMMY, sp, #ARGS_STACK_OFFSET
-vld1.32 {d7[0]}, [DUMMY]
-vdup.8  d4, d7[0]
-vdup.8  d5, d7[1]
-vdup.8  d6, d7[2]
-vdup.8  d7, d7[3]
+vld1.8  {d4[]}, [DUMMY]!
+vld1.8  {d5[]}, [DUMMY]!
+vld1.8  {d6[]}, [DUMMY]!
+vld1.8  {d7[]}, [DUMMY]!
 .endm
 
 generate_composite_function \
@@ -956,11 +953,10 @@ generate_composite_function \
 .macro pixman_composite_over_n_8_0565_init
 add DUMMY, sp, #ARGS_STACK_OFFSET
 vpush   {d8-d15}
-vld1.32 {d11[0]}, [DUMMY]
-vdup.8  d8, d11[0]
-vdup.8  d9, d11[1]
-vdup.8  d10, d11[2]
-vdup.8  d11, d11[3]
+vld1.8  {d8[]}, [DUMMY]!
+vld1.8  {d9[]}, [DUMMY]!
+vld1.8  {d10[]}, [DUMMY]!
+vld1.8  {d11[]}, [DUMMY]!
 .endm
 
 .macro pixman_composite_over_n_8_0565_cleanup
@@ -981,10 +977,9 @@ generate_composite_function \
 
/**/
 
 .macro pixman_composite_over__n_0565_init
-add DUMMY, sp, #(ARGS_STACK_OFFSET + 8)
+add DUMMY, sp, #(ARGS_STACK_OFFSET + 11)
 vpush   {d8-d15}
-vld1.32 {d24[0]}, [DUMMY]
-vdup.8  d24, d24[3]
+vld1.8  {d24[]}, [DUMMY]
 .endm
 
 .macro pixman_composite_over__n_0565_cleanup
@@ -1049,12 +1044,8 @@ generate_composite_function \
 
 .macro pixman_composite_src_n_8_init
 add DUMMY, sp, #ARGS_STACK_OFFSET
-vld1.32 {d0[0]}, [DUMMY]
-vsli.u64d0, d0, #8
-vsli.u64d0, d0, #16
-vsli.u64d0, d0, #32
-vorrd1, d0, d0
-vorrq1, q0, q0
+vld1.8  {d0[],d1[]}, [DUMMY]
+vld1.8  {d2[],d3[]}, [DUMMY]
 .endm
 
 .macro pixman_composite_src_n_8_cleanup
@@ -1089,11 +1080,8 @@ generate_composite_function \
 
 .macro pixman_composite_src_n_0565_init
 add DUMMY, sp, #ARGS_STACK_OFFSET
-vld1.32 {d0[0]}, [DUMMY]
-vsli.u64d0, d0, #16
-vsli.u64d0, d0, #32
-vorrd1, d0, d0
-vorrq1, q0, q0
+vld1.16 {d0[],d1[]}, [DUMMY]
+vld1.16 {d2[],d3[]}, [DUMMY]
 .endm
 
 .macro pixman_composite_src_n_0565_cleanup
@@ -1128,10 +1116,8 @@ generate_composite_function \
 
 .macro pixman_composite_src_n__init
 add DUMMY, sp, #ARGS_STACK_OFFSET
-vld1.32 {d0[0]}, [DUMMY]
-vsli.u64d0, d0, #32
-vorrd1, d0, d0
-vorrq1, q0, q0
+vld1.32 {d0[],d1[]}, [DUMMY]
+vld1.32 {d2[],d3[]}, [DUMMY]
 .endm
 
 .macro pixman_composite_src_n__cleanup
@@ -1271,11 +1257,10 @@ generate_composite_function \
 
 .macro pixman_composite_src_n_8__init
 add DUMMY, sp, #ARGS_STACK_OFFSET
-vld1.32 {d3[0]}, [DUMMY]
-vdup.8  d0, d3[0]
-vdup.8  d1, d3[1]
-vdup.8  d2, d3[2]
-vdup.8  d3, d3[3]
+vld1.8  {d0[]}, [DUMMY]!
+vld1.8  {d1[]}, [DUMMY]!
+vld1.8  {d2[]}, [DUMMY]!
+vld1.8  {d3[]}, [DUMMY]!
 .endm
 
 .macro pixman_composite_src_n_8__cleanup
@@ -1339,9 +1324,8 @@ generate_composite_function \
 .endm
 
 .macro pixman_composite_src_n_8_8_init
-add DUMMY, sp, #ARGS_STACK_OFFSET
-vld1.32 {d16[0]}, [DUMMY]
-vdup.8  d16, d16[3]
+add DUM

[Pixman] [PATCH 0/9] Changes to existing ARMv7 routines

2016-04-11 Thread Ben Avison
Since there are a few people around on the list at the moment who are
familiar with NEON, I'm hoping someone will be able to review my work so it
can make it into git.

To keep the number of patches manageable, here are a group which improve
incrementally upon existing ARMv7 routines, without adding any new ones yet.
Most of these are reposts which have had no review of the technical content.
The patch numbers have been reassigned within this series of 9 patches, and
won't match the numbers used when originally posted.

Ben Avison (9):
  armv7: Coalesce scalar accesses where possible
  armv7: Faster fill operations
  armv7: Use VLD-to-all-lanes
  armv7: Simplify constant load
  armv7: Use prefetch for small-width images too
  armv7: Use aligned memory writes in both copies of bilinear code
  armv7: Move common bilinear macro definitions to a new header file
  armv7: More use of fast paths with localized destination alpha
  armv7: More use of fast paths with localized source alpha

 pixman/Makefile.am|3 +-
 pixman/pixman-arm-neon-asm-bilinear.S |  153 +-
 pixman/pixman-arm-neon-asm-bilinear.h |  165 +++
 pixman/pixman-arm-neon-asm.S  |  280 +++--
 pixman/pixman-arm-neon-asm.h  |   20 +++
 pixman/pixman-arm-neon.c  |   21 +++
 6 files changed, 272 insertions(+), 370 deletions(-)
 create mode 100644 pixman/pixman-arm-neon-asm-bilinear.h

-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 5/9 repost] armv7: Use prefetch for small-width images too

2016-04-11 Thread Ben Avison
After discovering that the ARMv6 optimised fast paths often out-performed
the ARMv7 ones on a Cortex-A7, particularly on the RT benchmark, I found
that the problem was due to the fact that the ARMv7 macros didn't attempt
any sort of prefetch for small images (fewer than pixblock_size * 2 pixels
across).

Since a pixblock is chosen to be no larger than a cacheline, and is in many
cases smaller, it seemed a reasonable compromise to avoid adding a lot of
complexity by simply doing one prefetch for the start of a pixel row when
starting to process the preceding one, and that is what this patch does.

I compared the effect of using LDRB (which is what is currently used at the
end of each long pixel row) against PLD for each of the source and
destination buffers for a selection of common operations: src__,
over__ and add__, and in each case PLD of both buffers was
the most beneficial. PLDW didn't make any measurable difference.

The overall effect of this patch on the three operations is as follows
(L1, L2 and M tests can be ignored because they're known not to involve the
use of short rows):

src__

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
HT  60.8   0.1  61.1   0.1 100.0%  +0.6%
VT  61.0   0.3  62.6   0.2 100.0%  +2.6%
R   45.5   0.2  46.2   0.2 100.0%  +1.5%
RT  19.8   0.0  21.4   0.0 100.0%  +7.8%

over__

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
HT  40.2   0.1  40.7   0.4 100.0%  +1.0%
VT  35.5   0.2  37.9   0.3 100.0%  +6.7%
R   32.8   0.0  33.8   0.3 100.0%  +3.0%
RT  12.9   0.0  15.6   0.2 100.0%  +21.4%

add__

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
HT  51.0   0.6  51.9   0.5 100.0%  +1.7%
VT  44.0   0.4  46.8   0.5 100.0%  +6.3%
R   39.6   0.5  41.0   0.4 100.0%  +3.5%
RT  15.2   0.2  18.0   0.2 100.0%  +18.5%

Signed-off-by: Ben Avison <bavi...@riscosopen.org>
---
 pixman/pixman-arm-neon-asm.h |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-neon-asm.h b/pixman/pixman-arm-neon-asm.h
index 03257cc..a116e47 100644
--- a/pixman/pixman-arm-neon-asm.h
+++ b/pixman/pixman-arm-neon-asm.h
@@ -881,6 +881,15 @@ local skip1
  * nor prefetch are used.
  */
 8:
+.if src_bpp_shift >= 0
+PF pld, [SRC, SRC_STRIDE, lsl #src_bpp_shift]
+.endif
+.if dst_r_bpp != 0
+PF pld, [DST_R, DST_STRIDE, lsl #dst_bpp_shift]
+.endif
+.if mask_bpp_shift >= 0
+PF pld, [MASK, MASK_STRIDE, lsl #mask_bpp_shift]
+.endif
 /* Process exactly pixblock_size pixels if needed */
 tst W, #pixblock_size
 beq 1f
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 1/9 repost] armv7: Coalesce scalar accesses where possible

2016-04-11 Thread Ben Avison
Where the alignment of a block of elements is known to equal the size of the
block, but the block is smaller than 8 bytes, it is safe to use a larger
element size in a scalar VLD or VST without risking an alignment exception.
Typically the effect of this can be seen when accessing leading or trailing
halfwords or words in the destination buffer for long scanlines.

Sadly, the effect of this is too small to be measured, but it seems like a
good idea anyway.

Signed-off-by: Ben Avison <bavi...@riscosopen.org>
---
 pixman/pixman-arm-neon-asm.h |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-neon-asm.h b/pixman/pixman-arm-neon-asm.h
index bdcf6a9..76b3985 100644
--- a/pixman/pixman-arm-neon-asm.h
+++ b/pixman/pixman-arm-neon-asm.h
@@ -183,6 +183,10 @@
 pixldst30 vst3, 8, %(basereg+0), %(basereg+1), %(basereg+2), 3, mem_operand
 .elseif (bpp == 24) && (numpix == 1)
 pixldst30 vst3, 8, %(basereg+0), %(basereg+1), %(basereg+2), 1, mem_operand
+.elseif numpix * bpp == 32 && abits == 32
+pixldst 4, vst1, 32, basereg, mem_operand, abits
+.elseif numpix * bpp == 16 && abits == 16
+pixldst 2, vst1, 16, basereg, mem_operand, abits
 .else
 pixldst %(numpix * bpp / 8), vst1, %(bpp), basereg, mem_operand, abits
 .endif
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 2/9 repost] armv7: Faster fill operations

2016-04-11 Thread Ben Avison
This eliminates a number of branches over blocks of code that are either
empty or can be trivially combined with a separate code block at the start
and end of each scanline. This has a surprisingly big effect, at least on
Cortex-A7, for src_n_8:

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
L1  1570.4 133.11639.6 110.7   100.0%  +4.4%
L2  1042.6 19.9 1086.6 23.4100.0%  +4.2%
M   1030.8 7.2  1036.8 3.2 100.0%  +0.6%
HT  287.4  3.5  303.3  2.9 100.0%  +5.5%
VT  262.0  2.6  263.3  2.6 99.9%   +0.5%
R   206.5  2.4  209.9  2.4 100.0%  +1.7%
RT  56.5   1.0  59.2   0.5 100.0%  +4.7%

Signed-off-by: Ben Avison <bavi...@riscosopen.org>
---
 pixman/pixman-arm-neon-asm.h |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-neon-asm.h b/pixman/pixman-arm-neon-asm.h
index 76b3985..03257cc 100644
--- a/pixman/pixman-arm-neon-asm.h
+++ b/pixman/pixman-arm-neon-asm.h
@@ -468,6 +468,7 @@
 tst DST_R, #0xF
 beq 2f
 
+.if src_bpp > 0 || mask_bpp > 0 || dst_r_bpp > 0
 .irp lowbit, 1, 2, 4, 8, 16
 local skip1
 .if (dst_w_bpp <= (lowbit * 8)) && ((lowbit * 8) < (pixblock_size * dst_w_bpp))
@@ -487,6 +488,7 @@ local skip1
 1:
 .endif
 .endr
+.endif
 pixdeinterleave src_bpp, src_basereg
 pixdeinterleave mask_bpp, mask_basereg
 pixdeinterleave dst_r_bpp, dst_r_basereg
@@ -503,6 +505,9 @@ local skip1
 tst DST_W, #lowbit
 beq 1f
 .endif
+.if src_bpp == 0 && mask_bpp == 0 && dst_r_bpp == 0
+sub W, W, #(lowbit * 8 / dst_w_bpp)
+.endif
 pixst_a (lowbit * 8 / dst_w_bpp), dst_w_bpp, dst_w_basereg, DST_W
 1:
 .endif
@@ -533,6 +538,7 @@ local skip1
process_pixblock_tail_head
 tst W, #(pixblock_size - 1)
 beq 2f
+.if src_bpp > 0 || mask_bpp > 0 || dst_r_bpp > 0
 .irp chunk_size, 16, 8, 4, 2, 1
 .if pixblock_size > chunk_size
 tst W, #chunk_size
@@ -550,6 +556,7 @@ local skip1
 1:
 .endif
 .endr
+.endif
 pixdeinterleave src_bpp, src_basereg
 pixdeinterleave mask_bpp, mask_basereg
 pixdeinterleave dst_r_bpp, dst_r_basereg
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 7/9] armv7: Move common bilinear macro definitions to a new header file

2016-04-11 Thread Ben Avison
This reduces code duplication.

Signed-off-by: Ben Avison <bavi...@riscosopen.org>
---
 pixman/Makefile.am|3 +-
 pixman/pixman-arm-neon-asm-bilinear.S |  147 +-
 pixman/pixman-arm-neon-asm-bilinear.h |  165 +
 pixman/pixman-arm-neon-asm.S  |  135 +--
 4 files changed, 169 insertions(+), 281 deletions(-)
 create mode 100644 pixman/pixman-arm-neon-asm-bilinear.h

diff --git a/pixman/Makefile.am b/pixman/Makefile.am
index 581b6f6..b0cffaf 100644
--- a/pixman/Makefile.am
+++ b/pixman/Makefile.am
@@ -88,7 +88,8 @@ libpixman_arm_neon_la_SOURCES = \
 pixman-arm-neon-asm.S  \
pixman-arm-neon-asm-bilinear.S \
 pixman-arm-asm.h   \
-pixman-arm-neon-asm.h
+pixman-arm-neon-asm.h  \
+pixman-arm-neon-asm-bilinear.h
 libpixman_1_la_LIBADD += libpixman-arm-neon.la
 
 ASM_CFLAGS_arm_neon=
diff --git a/pixman/pixman-arm-neon-asm-bilinear.S 
b/pixman/pixman-arm-neon-asm-bilinear.S
index aba8d00..1194d2d 100644
--- a/pixman/pixman-arm-neon-asm-bilinear.S
+++ b/pixman/pixman-arm-neon-asm-bilinear.S
@@ -67,152 +67,7 @@
 #include "pixman-private.h"
 #include "pixman-arm-asm.h"
 #include "pixman-arm-neon-asm.h"
-
-/*
- * Bilinear macros from pixman-arm-neon-asm.S
- */
-
-/*
- * Bilinear scaling support code which tries to provide pixel fetching, color
- * format conversion, and interpolation as separate macros which can be used
- * as the basic building blocks for constructing bilinear scanline functions.
- */
-
-.macro bilinear_load_ reg1, reg2, tmp
-mov   TMP1, X, asr #16
-add   X, X, UX
-add   TMP1, TOP, TMP1, asl #2
-vld1.32   {reg1}, [TMP1], STRIDE
-vld1.32   {reg2}, [TMP1]
-.endm
-
-.macro bilinear_load_0565 reg1, reg2, tmp
-mov   TMP1, X, asr #16
-add   X, X, UX
-add   TMP1, TOP, TMP1, asl #1
-vld1.32   {reg2[0]}, [TMP1], STRIDE
-vld1.32   {reg2[1]}, [TMP1]
-convert_four_0565_to_x888_packed reg2, reg1, reg2, tmp
-.endm
-
-.macro bilinear_load_and_vertical_interpolate_two_ \
-acc1, acc2, reg1, reg2, reg3, reg4, tmp1, tmp2
-
-bilinear_load_ reg1, reg2, tmp1
-vmull.u8  acc1, reg1, d28
-vmlal.u8  acc1, reg2, d29
-bilinear_load_ reg3, reg4, tmp2
-vmull.u8  acc2, reg3, d28
-vmlal.u8  acc2, reg4, d29
-.endm
-
-.macro bilinear_load_and_vertical_interpolate_four_ \
-xacc1, xacc2, xreg1, xreg2, xreg3, xreg4, xacc2lo, xacc2hi \
-yacc1, yacc2, yreg1, yreg2, yreg3, yreg4, yacc2lo, yacc2hi
-
-bilinear_load_and_vertical_interpolate_two_ \
-xacc1, xacc2, xreg1, xreg2, xreg3, xreg4, xacc2lo, xacc2hi
-bilinear_load_and_vertical_interpolate_two_ \
-yacc1, yacc2, yreg1, yreg2, yreg3, yreg4, yacc2lo, yacc2hi
-.endm
-
-.macro bilinear_load_and_vertical_interpolate_two_0565 \
-acc1, acc2, reg1, reg2, reg3, reg4, acc2lo, acc2hi
-
-mov   TMP1, X, asr #16
-add   X, X, UX
-add   TMP1, TOP, TMP1, asl #1
-mov   TMP2, X, asr #16
-add   X, X, UX
-add   TMP2, TOP, TMP2, asl #1
-vld1.32   {acc2lo[0]}, [TMP1], STRIDE
-vld1.32   {acc2hi[0]}, [TMP2], STRIDE
-vld1.32   {acc2lo[1]}, [TMP1]
-vld1.32   {acc2hi[1]}, [TMP2]
-convert_0565_to_x888 acc2, reg3, reg2, reg1
-vzip.u8   reg1, reg3
-vzip.u8   reg2, reg4
-vzip.u8   reg3, reg4
-vzip.u8   reg1, reg2
-vmull.u8  acc1, reg1, d28
-vmlal.u8  acc1, reg2, d29
-vmull.u8  acc2, reg3, d28
-vmlal.u8  acc2, reg4, d29
-.endm
-
-.macro bilinear_load_and_vertical_interpolate_four_0565 \
-xacc1, xacc2, xreg1, xreg2, xreg3, xreg4, xacc2lo, xacc2hi \
-yacc1, yacc2, yreg1, yreg2, yreg3, yreg4, yacc2lo, yacc2hi
-
-mov   TMP1, X, asr #16
-add   X, X, UX
-add   TMP1, TOP, TMP1, asl #1
-mov   TMP2, X, asr #16
-add   X, X, UX
-add   TMP2, TOP, TMP2, asl #1
-vld1.32   {xacc2lo[0]}, [TMP1], STRIDE
-vld1.32   {xacc2hi[0]}, [TMP2], STRIDE
-vld1.32   {xacc2lo[1]}, [TMP1]
-vld1.32   {xacc2hi[1]}, [TMP2]
-convert_0565_to_x888 xacc2, xreg3, xreg2, xreg1
-mov   TMP1, X, asr #16
-add   X, X, UX
-add   TMP1, TOP, TMP1, asl #1
-mov   TMP2, X, asr #16
-add   X, X, UX
-add   TMP2, TOP, TMP2, asl #1
-vld1.32   {yacc2lo[0]}, [TMP1], STRIDE
-vzip.u8   xreg1, xreg3
-vld1.32   {yacc2hi[0]}, [TMP2], STRIDE
-vzip.u8   xreg2, xreg4
-vld1.32   {yacc2lo[1]}, [TMP1]
-vzip.u8   xreg3, xreg4
-vld1.32   {yacc2hi[1]}, [TMP2]
-vzip.u8   xreg1, xreg2
-convert_0565_to_x888 yacc2, yreg3, yreg2, yreg1
-vmull.u8  xacc1, xreg1, d28
-vzip.u8   yreg1, yreg3
-vmlal.u8  xacc1, xreg2, d29
-vzip.u8   yreg2, yreg4
-vmull.u8  xacc2, xr

Re: [Pixman] [PATCH] Add support for aarch64 neon optimization

2016-04-04 Thread Ben Avison

On Sat, 02 Apr 2016 13:30:58 +0100, Mizuki Asakura  wrote:

This patch only contains STD_FAST_PATH codes, not scaling (nearest,
bilinear) codes.


Hi Mizuki,

It looks like you have used an automated process to convert the AArch32
NEON code to AArch64. Will you be able to repeat that process for other
code, or at least assist others to repeat your steps?

The reason I ask is that I have a large number of outstanding patches to
the ARM NEON support. The process of getting them merged into the
FreeDesktop git repository has been very slow because there aren't many
people on this list with the time and ability to review them, however my
versions are in many cases up to twice the speed of the FreeDesktop
versions, and it would be a shame if AArch64 couldn't benefit from them.
If your AArch64 conversion is a one-time thing, it will make make it
extremely difficult to merge my changes in.


After completing optimization this patch, scaling related codes should be done.


One of my aims was to implement missing "iter" routines so as to accelerate
scaled plots for a much wider combination of pixels formats and Porter-Duff
combiner rules than the existing limited selection of fast paths could
cover. If you look towards the end of my patch series here:

https://github.com/bavison/pixman/commits/arm-neon-release1

you'll see that I discovered that I was actually outperforming Pixman's
existing bilinear plotters so consistently that I'm advocating removing
them entirely, with the additional advantage that it simplifies the code
base a lot. So you might want to consider whether it's worth bothering
converting those to AArch64 in the first place.

I would maybe go so far as to suggest that you try converting all the iters
first and only add fast paths if you find they do better than the iters.
One of the drawbacks of using iters is that the prefetch code can't be as
sophisticated - it can't easily be prefetching the start of the next row
while it is still working on the end of the current one. But since hardware
prefetchers are better now and conditional execution is hard in AArch64,
this will be less of a drawback with AArch64 CPUs.

I'll also repeat what has been said, that it's very neat the way the
existing prefetch code sneaks calculations into pipeline stalls, but it was
only ever really ideal for Cortex-A8. With Cortex-A7 (despite the number,
actually a much more recent 32-bit core) I noted that it was impossible to
schedule such complex prefetch code without adding to the cycle count, at
least when the images were already in the cache.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] Cleaning patchwork

2016-01-04 Thread Ben Avison

On Tue, 22 Dec 2015 13:14:21 -, Oded Gabbay  wrote:


Hi Ben,

I would like to clean pixman's patchwork and you have about 20
outstanding patches.

[...]


[4/4] pixman-fast-path: Make bilinear cover fetcher use COVER_CLIP_TIGHT flag
[3/4] armv7/mips/sse2: Make bilinear cover fast paths use COVER_CLIP_TIGHT flag
[2/4] Introduce FAST_PATH_SAMPLES_COVER_CLIP_TIGHT_BILINEAR flag


I still think these are a worthwhile improvement and have described some
conditions under which it should be measurable using affine-bench. I believe
Pekka wanted to prove it using real-world traces though, and since there was
no measurable change for better or worse using the normal Cairo traces, he
was attempting to capture some new ones that would exercise the cases I
described. Last I heard though, he had found that Cairo's tracer was broken,
so he was unable to make progress. Under the circumstances, do you think
affine-bench results alone would be acceptable?


Resolve implementation-defined behaviour for division rounded to -infinity


This one got bogged down in an argument about whether C89 should still be
supported or not, which I'm not sure was ever resolved?


[05/37,v2] composite flags: Early detection of fully opaque 1x1
repeating source images
[10/37,v2] pixman-fast-path: Add in__8 fast path
[11/37,v2] armv6: Add in__8 fast path
[23/37,v2] armv6: Add optimised scanline fetchers and writeback for
r5g6b5 and a8
[24/37,v2] armv6: Add optimised scanline fetcher for a1r5g5b5
[25/37,v2] armv6: Add src_1555_ fast path


v1 of these patches were reviewed (excluding the ARM assembly parts) by
Søren on 05 Oct 2014, and v2 addressed the issue he raised. There haven't
been any comments on the reposted versions.


armv7: Re-use existing fast paths in more cases
armv7: Re-use existing fast paths in more cases


These apply the same rules that Søren suggested for my ARMv6 paths in the v2
patches above to the pre-existing ARMv7 paths as well. The only review
comment received so far was that the two patches needed different names.


[2/5,v2] armv7: Add in__8 fast path


The only difference for v2 here was again just a matter of defining the
widest possible set of pixel formats to match the fast path.


[5/5] armv7: Add src_1555_ fast path
[4/5] armv7: Add in_reverse__ fast path
[3/5] armv7: Add in_n_ fast path
[1/5] armv7: Use prefetch for small-width images too
[3/3] armv7: Use VLD-to-all-lanes
[2/3] armv7: Faster fill operations
[1/3] armv7: Coalesce scalar accesses where possible
Require destination images to be bitmaps


None of these have received any comments to date. In many cases, I suspect
it's the fact that they involve ARM assembly, and we don't have many (any?)
active reviewers who know it. I'm not sure what we can do about that.


I also suggest that for the relevant ones (if there are any), you
would rebase them on the latest master and re-send them as a single
series in order to restart the review process.


I can certainly do that, but I had previously been asked to split my
postings into smaller series that were independent. I have a lot more
patches waiting to post that depend on the ones that are stuck in limbo -
the complete set can be seen at

https://github.com/bavison/pixman/commits/arm-neon-release1

if you're interested. Or I could just post/repost the whole lot (72 patches
now), like I did with my 37-patch series from 2014.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH v3 1/2] Remove the 8e extra safety margin in COVER_CLIP analysis

2015-09-24 Thread Ben Avison

On Wed, 23 Sep 2015 09:40:32 +0100, Pekka Paalanen <ppaala...@gmail.com> wrote:

Ben, Siarhei,

for the record, gentlemen, please give your reviewed-by's
so I can just push these two out without any confusion. :-)


Sure...

Reviewed-by: Ben Avison <bavi...@riscosopen.org>

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH v2] pixman-general: Tighten up calculation of temporary buffer sizes

2015-09-22 Thread Ben Avison
Each of the aligns can only add a maximum of 15 bytes to the space
requirement. This permits some edge cases to use the stack buffer where
previously it would have deduced that a heap buffer was required.
---

This is an update of my previous patch (now posted over a year ago):
https://patchwork.freedesktop.org/patch/49898/

which conflicts with the patch just pushed:
http://patchwork.freedesktop.org/patch/60052/

Since this piece of code is fresh in people's minds, this might be a good
time to get this one pushed as well.

Note that the scope of this change has been reduced: while the old code
added space to align the end address to a cacheline boundary (which as I
pointed out, was unnecessary), the new version works using buffer lengths
only.

As a discussion point, wouldn't it be better for the ALIGN macro to
assume 32-byte cache lines? Both 16-byte and 32-byte cachelines are
currently in common use, but by aligning the buffers to 32-byte addresses
we would simultaneously achieve 16-byte alignment.

 pixman/pixman-general.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/pixman/pixman-general.c b/pixman/pixman-general.c
index fa88463..6141cb0 100644
--- a/pixman/pixman-general.c
+++ b/pixman/pixman-general.c
@@ -158,9 +158,9 @@ general_composite_rect  (pixman_implementation_t *imp,
 if (width <= 0 || _pixman_multiply_overflows_int (width, Bpp * 3))
return;
 
-if (width * Bpp * 3 > sizeof (stack_scanline_buffer) - 32 * 3)
+if (width * Bpp * 3 > sizeof (stack_scanline_buffer) - 15 * 3)
 {
-   scanline_buffer = pixman_malloc_ab_plus_c (width, Bpp * 3, 32 * 3);
+   scanline_buffer = pixman_malloc_ab_plus_c (width, Bpp * 3, 15 * 3);
 
if (!scanline_buffer)
return;
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 1/2] Remove the 8e extra safety margin in COVER_CLIP analysis

2015-09-22 Thread Ben Avison

On Mon, 21 Sep 2015 06:32:48 +0100, Siarhei Siamashka 
 wrote:

Since we are trying to justify the 8e extra safety margin removal in
the context of this patch, this is what I wanted to see explained in
the commit message. But maybe I'm just bad at math and it was perfectly
obvious to everyone else without any special proof.


I think it's just that if you've come on board since the old rounding
code was removed (as I have) it's hard to see why this would ever have
been a problem. If you express your P vector as a linear combination of
vectors

| frac(x_dst) |   | int(x_dst) |
| frac(y_dst) | + | int(y_dst) |
| 0x1 |   | 0  |

then it's clear that the first component is an invariant (0x8000, 0x8000,
0x1) irrespective of whether you reach P in a stepwise manner or not,
and that the other one uses only integers. Any integer multiplied by a
fixed-point number is a fixed-point number (of the same number of
fractional digits) without any rounding errors, so in the absence of any
intermediate rounding steps, the rounding error of the expression as a
whole is the same as the rounding error of the first component, and that
hasn't changed.


Proof:

All implementations must give the same numerical results as
bits_image_fetch_pixel_nearest() / bits_image_fetch_pixel_bilinear().

The former does
int x0 = pixman_fixed_to_int (x - pixman_fixed_e);
which maps directly to the new test for the nearest flag, when you consider
that x0 must fall in the interval [0,width).

The latter does
x1 = x - pixman_fixed_1 / 2;
x1 = pixman_fixed_to_int (x1);
x2 = x1 + 1;
but then x2 is brought back in range by the repeat() function, so it can't
stray beyond the end of the source line.


The wording does not look very good here. It seems to imply that the
repeat() function has some special use in the latter (BILINEAR) case.
But the repeat handling is exactly the same for NEAREST and BILINEAR.
Also for NONE repeat and fetching pixels from the outside of the source
image bounds, we are not bringing the coordinate back into range but
interpreting the pixel value as zero.


I can't follow your argument there - I don't think I implied that
repeat() acted differently for the bilinear case?

On NONE repeat, yes I neglected that detail, but I was generalising. How
about:

The latter does
x1 = x - pixman_fixed_1 / 2;
x1 = pixman_fixed_to_int (x1);
x2 = x1 + 1;
but any values of x2 that correspond to a pixel offset beyond the end of
the source line are never used to dereference the pixel array. In the
case of NONE repeat, a pixel value of zero is substituted, and otherwise
the action of the repeat() function, when applied to x2, is to select a
different pixel offset which *does* lie within the source line (the exact
choice depends upon the repeat type selected).

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] Benchmarked: [PATCH 1/4] Change conditions for setting FAST_PATH_SAMPLES_COVER_CLIP flags

2015-09-20 Thread Ben Avison

On Fri, 18 Sep 2015 19:35:02 +0100, Bill Spitzak <spit...@gmail.com> wrote:


On Wed, Sep 16, 2015 at 7:30 AM, Ben Avison <bavi...@riscosopen.org> wrote:


Just by thinking things through, I realised that we would regularly
fail to hit COVER paths if Pixman's caller set the scale factors such
that the centre of the outermost destination pixels aligned with the
centre of the outermost source pixels. There has been some argument
about whether this is representative of how Pixman should be used. I
happen to think this is a perfectly reasonable thing to expect Pixman
to support, but there are other models you can follow, notably setting
the scale factors such that the outer edges of the outermost
destination pixels align to the outer edges of the outermost source
pixels. If the Cairo traces are using this latter model, then it's
understandable if you aren't hitting the edge case that I'm concerned
about very often.


There are currently very good reasons for clients to use the first
model: it is a method of removing undesirable "fuzzy edges" from zoomed-
in images.

The alternative of scaling the outer edge, while not hitting this fast
path, will not hit the other fast path either! (the other fast path is
the one that allows a zero-weighted pixel at one end to be read from
the image).


I'm generally in agreement with you. For the sake of completeness though,
I should point out that when zooming *out*, both models will match the
conditions for COVER paths. But then again, when you're zooming out,
particularly by a large factor, it is more appropriate to use a multi-tap
filter rather than bilinear scaling, so maybe that's not much of an
argument.


NOTE: the method of scaling the centers of the pixels is generally
wrong. The math is wonky: does scaling an image by 2 produce a 2*w-1
image, or does it spread the pixels slightly more than 2 apart?
Programs are going to disagree.


I'd suggest that if the highest layer software specifies a particular
scale factor, then that's what should be used, but if it requests a
particular destination size (e.g. "full screen") then that should be the
primary consideration even if it means the corresponding scale factor is
not the "obvious" one.

But this is all really a concern for the layers above Pixman, because the
starting fractional position and per-pixel increment is part of the
Pixman API. Pixman just has to do whatever has been asked of it. Just
because some of the ways the layers above Pixman do their calculations
are impossible for Pixman to optimise further shouldn't be a reason for
Pixman not to optimise the cases that it can do, especially when that
set of cases include many that use a scale factor of 1 on one axis.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] Benchmarked: [PATCH 1/4] Change conditions for setting FAST_PATH_SAMPLES_COVER_CLIP flags

2015-09-16 Thread Ben Avison

On Wed, 16 Sep 2015 12:25:59 +0100, Pekka Paalanen  wrote:

I tried both "image" and "image16" on an x86_64 (Sandybridge), and got
no performance differences in the trimmed-cairo-traces set in either
baseline/cleanup or cleanup/tight.

I also tried with PIXMAN_DISABLE=ssse3 and still got no difference. I
verified I am really running what I think I am by editing Pixman and
seeing the effect in the benchmark.

Am I missing something?


Well, there has been a long and tangled history behind this change at
this point. I'll admit my twin motivations were code cleanup and making
it possible to measure the speed of bilinear operations using a
benchmarker, but it's also ended up spawning the creation of affine-bench
and cover-test, which isn't a bad thing.

Just by thinking things through, I realised that we would regularly fail
to hit COVER paths if Pixman's caller set the scale factors such that the
centre of the outermost destination pixels aligned with the centre of the
outermost source pixels. There has been some argument about whether this
is representative of how Pixman should be used. I happen to think this is
a perfectly reasonable thing to expect Pixman to support, but there are
other models you can follow, notably setting the scale factors such that
the outer edges of the outermost destination pixels align to the outer
edges of the outermost source pixels. If the Cairo traces are using this
latter model, then it's understandable if you aren't hitting the edge
case that I'm concerned about very often.

The useful thing about the scaled-in-one-axis-only example is that on the
axis that isn't scaled, there is no dispute about how you set the scale
factors, so it sidesteps this argument.


Should I run the same on rpi2? Or is the best effect on the fast paths
we haven't merged yet?


The get a sizeable speed difference, you need two things:
1) a plot geometry that aligns the centres of the high-coordinate pixels
of the source and destination
2) a reasonable speed difference between COVER and repeat fast paths or
iterators

At present, the speed difference will be most marked when you are using
the bilinear iterators from pixman-fast-path.c or pixman-ssse3.c, which
are both currently restricted to COVER plots of a8r8g8b8 source images
only. If you test on a Raspberry Pi 2, then you also need to allow for
the fact that the ARMv7 implementation has bilinear fast paths for
src__, src__0565, over__ and add__, for most
types of image repeat as well as for COVER, so these reduce the
applicability of the iterator still further.

In my patch series from last year, I added ARMv6 bilinear iterators for
COVER plots of a8r8g8b8, x8r8g8b8, r5g6b5 and a8 images. Of course, you
won't be seeing the effect of these because they haven't been accepted
yet - but when (I hope) they are then having the new COVER_CLIP
definition in place should help demonstrate their effectiveness.
Basically, these are all just pieces of one big jigsaw puzzle.


Or maybe our test set is not enough? I recall having some problems with
that in the past.


Yes, I know I have a number of fast paths queued up which were not
represented at all in the Cairo perf traces, but were being hammered by
the Epiphany browser on the Raspberry Pi. I'm not sure how we justify
their inclusion if the Cairo perf traces is the only criterion allowed. I
hope there will be some flexibility in this respect.

If you want a more real-world example of when you might encounter
bilinear scaling in one axis only, here's one that I think is plausible.
Consider screen grabs of taken of an NTSC SD video at ITU656 sample
positions (720x480) - or equally the output of a codec for a video at
that resolution. Now try to display it with bilinear scaling at the
correct aspect ratio on a display with square pixels: the destination
rectangle will likely be chosen to be 640x480, with a vertical pixel
increment of 1 and a horizontal pixel increment of 1.125. There's also a
reasonable chance that you'll also be wanting to re-plot this 30 times
per second, for obvious reasons. You'd hope that this could be achieved
using a COVER fast path or iterator, but with the current flag
definitions, they can't be used.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 1/4] Change conditions for setting FAST_PATH_SAMPLES_COVER_CLIP flags

2015-09-16 Thread Ben Avison

On Wed, 16 Sep 2015 03:09:50 +0100, Søren Sandmann  
wrote:

If all existing fast paths and iterators can be
changed/verified to not read outside their bounds, I have no objection
to changing the meaning of the existing flag (or adding a new one if
that's more convenient).


OK, it sounds like the way to go is use two flags for at least a
transitional period. Then if and when all fast paths and iterators are
changed, then we can delete the old flag.


I realized that there is another use for a new flag like this: It could
likely also be used here:

http://cgit.freedesktop.org/pixman/tree/pixman/pixman.c#n662

so that images would be considered opaque in more cases, leading to more
cases of reducing the operator from OVER to SRC.


I had come across that when searching the source code for references to
the COVER_CLIP flags, but had mentally filed it away because I wasn't
sure at which point it would be appropriate to switch it over to using
the new bilinear flag. But having thought about it a bit more, I agree it
could use the new definition straight away, even before any fast paths or
iterators do.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] Explaining bilinear in rounding.txt

2015-09-15 Thread Ben Avison

On Mon, 14 Sep 2015 12:53:15 +0100, Pekka Paalanen <ppaala...@gmail.com> wrote:


On Sat, 12 Sep 2015 01:26:48 +0100
"Ben Avison" <bavi...@riscosopen.org> wrote:


-- BILINEAR filtering:

The BILINEAR filter calculates the linear interpolation between (i.e. a
weighted mean of) the two closest pixels to the given position - one
found by rounding down, one by rounding up.

  round_up(x) = ceil(x - o) + o
  round_down(x) = floor(x - o) + o

The weight factor applied to each of these is given by

  1 - abs(round(x) - x)

except in the case where two to rounding functions amount to the same
pixel - which only occurs if the given position aligns precisely with
one pixel. In that case, that one pixel value is used directly.


I don't understand this. We have a definition for round(x) earlier and
used with NEAREST, but I don't think that is what you meant here.


My interpretation was that the notation "round(x)" wasn't tied to a
specific definition. It had already been used for two different purposes
by that point in the document - for round to nearest with ties to plus
and minus infinity. Here I was just using it as a catch-all for the two
functions round_up(x) and round_down(x) whose unambiguous definitions I'd
repeated in the preceding paragraph.

Perhaps we should go back and rename the existing uses of round(x) to
round_nearest_ties_up(x) and round_nearest_ties_down(x) since it seems
this wasn't obvious?


Are you saying the weights would be:

w1 = 1 - (round_up(x) - x)
w2 = 1 - (x - round_down(x)).

And then the weigths do not sum to 1, when round_up(x) == x and
round_down(x) == x, because it leads to w1 = w2 = 1?


Yes, that was what I was trying to say, but more concisely.


How about the following:

[...]

To enforce w1 + w2 = 1, we can choose between two modifications to the
above choices of x1 and x2:

x1 = round_down(x)
x2 = x1 + 1

and

x1 = x2 - 1
x2 = round_up(x).


While this second pair of equalities are mathematically true, they do need to 
be evaluated in the opposite order, which might be worth noting.


*** Application to Pixman


I didn't quite follow the document structuring here. I took lines
prefixed with "***" to be top-level headings and lines prefixed with "--"
to be subheadings, but I think you've assumed the existing "Application
to Pixman" section to be a subsection of "General notes on rounding"
which stops before you get to "NEAREST filtering" and you're adding a new
"Application to Pixman" subsection under "BILINEAR filtering"? Perhaps
numbering the sections would make this clearer.


As we have the choice of rounding x either up or down first, we
consider the other pixel to be the one before or after the first pixel,
respectively. This choice is directly connected to the definition of
FAST_PATH_SAMPLES_COVER_CLIP_BILINEAR in Pixman.

Flag FAST_PATH_SAMPLES_COVER_CLIP_BILINEAR signifies that bilinear
sampling will not read beyond the defined image contents so that repeat
modes can be ignored. As coordinates are transformed from destination
space to source space and rounded only once, the logic determining the
flag must assume whether the other pixel is before or after in each
dimension, and whether it will be read even if its weight is 0.

Pixman's choice is to assume that the other pixel is always after, and
it can be read even if the weight is 0. The rounding mode is therefore
round_down(x).

 8< -

The last paragaphs may seem out of place, but I don't see any other
place atm. I'm fairly sure this stuff has not been documented anywhere
before, and it would be really good to have somewhere. OTOH,
rounding.txt does explain exactly which pixel gets loaded for NEAREST.


All the rest of the document is a functional specification - if you
changed the rounding mode used for NEAREST filtering, even to round
nearest-with-ties-up, then you'd be having a material effect on what
Pixman does, the test harnesses would fail, and it would hopefully be
considered a bug.

This subsection is a design specification - the COVER_CLIP flags are a
contract between one internal part of Pixman (the operation dispatcher)
and another internal part of Pixman (the fast paths and iterators). It
can be changed without affecting Pixman's correctness (although its speed
may well be affected by any such changes).

I think I'd be tempted to put the description - either the last paragraph
or the entire subsection - in a large source code comment instead - say,
alongside the #define of FAST_PATH_SAMPLES_COVER_CLIP_BILINEAR. Hiding it
away in a text file increases the likelihood of it not being kept up-to-
date.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 1/4] Change conditions for setting FAST_PATH_SAMPLES_COVER_CLIP flags

2015-09-11 Thread Ben Avison

On Fri, 11 Sep 2015 10:13:08 +0100, Pekka Paalanen  wrote:

If you actually want to document things, then I think
pixman/rounding.txt would be the right place (and another patch). After
all, commit messages are only used to justify the patch, they are not
documentation people usually read.


Somehow, I don't remember ever having noticed that file! Perhaps it was
because it was called rounding.txt that it never occurred to me that
filtering might be documented there?

It's an odd omission that it doesn't talk about BILINEAR filtering,
though. However, having briefly read through the text, even though some
of it goes over my head a bit, I'd say it's describing from a strictly
mathematical point of view. Discussion of exactly which pixels get loaded
from memory in order to reach this mathematical outcome feels outside the
scope of that document to me.

Here's a draft section for BILINEAR filtering, comments welcome:

 8< -

-- BILINEAR filtering:

The BILINEAR filter calculates the linear interpolation between (i.e. a
weighted mean of) the two closest pixels to the given position - one
found by rounding down, one by rounding up.

 round_up(x) = ceil(x - o) + o
 round_down(x) = floor(x - o) + o

The weight factor applied to each of these is given by

 1 - abs(round(x) - x)

except in the case where two to rounding functions amount to the same
pixel - which only occurs if the given position aligns precisely with one
pixel. In that case, that one pixel value is used directly.

A common simplification, to avoid having to treat this case differently,
is to define one (and only one) of the two round functions such that when
the given positions aligns with a pixel, abs(round(x) - x) = 1, and hence
the corresponding weight factor is 0. Either of the following pairs of
definitions satisfy this requirement:

 round_down(x) = floor(x - o) + o
 round_up(x) = round_down(x) + 1

 round_up(x) = ceil(x - o) + o
 round_down(x) = round_up(x) - 1

 8< -

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 1/2] Remove the 8e extra safety margin in COVER_CLIP analysis

2015-09-11 Thread Ben Avison

On Fri, 11 Sep 2015 12:30:23 +0100, Pekka Paalanen  wrote:


As samplers are allowed to fetch the pixel at x2 unconditionally, we
require
x1 >= 0
x2 < width


I may be getting picky, but that's circular logic - the samplers are only
allowed to fetch the pixel at x2 unconditionally because of the way
BILINEAR_COVER_CLIP is defined in this piece of code, so you can't use it
as its own justification. As I wrote the commit log originally, the
premise is that some samplers expect to be able to fetch the pixel at x2
unconditionally (on at least one axis) and the conclusion is that we need
to define BILINEAR_COVER_CLIP to allow that (for now).

How about:
Because samplers may fetch the pixel at x2 unconditionally, we require...

Other than that, the series looks good - good thinking about the impact
of projective transforms (though maybe someday that could do with a more
rigorous examination).

Also good to see Bill's way of phrasing the ultimate aim - that it'll be
defined to be safe to fetch all pixels that have a non-zero weight - made
it into patch 2. I think it's a clear and concise description.

I had to look up what bikeshedding meant since you've used it a couple of
times - very apt. I'll have to drop it into conversation sometime :)

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 3/3] test: Add cover-test v4

2015-09-09 Thread Ben Avison

On Wed, 09 Sep 2015 09:37:41 +0100, Pekka Paalanen  wrote:

I think we need some indication whether cover-test runs with or without
fencing. So far I have thought that if fence-image-self-test is
skipped, then cover-test can only run without fencing. If
fence-image-self-test is not skipped, then cover-test uses fencing if
it is not skipped.

It's perhaps a bit too subtle.


Too subtle for me :)


Maybe cover-test should have a single printf telling if it is fenced or
not? That would show up on old autotools, but on new ones you have to
go look in the logs anyway.

Maybe it would be most obvious if cover-test either always used fencing
or skipped. We'd lose the CRC check on too-large-page systems, but at
least if we see it as a PASS, we can be sure it used fencing. How's that?


Since one test is compile-time (availability of fencing) and the other is
runtime (reading the page size) I admit it's easier to settle on skipping
the test in both cases  - the alternative would need to be a layer of
runtime abstraction between fenced and non-fenced images.

Perhaps a compromise is to

a) skip the test if the page size is too large (i.e. treat this as an
error condition, until someone is motivated to either abstract the fence
image code so it can be disabled at runtime, or to rework it to support
larger page sizes)

b) printf a warning iff fencing isn't available (a bit like the way the
PIXMAN_DISABLE parser doesn't feel the need to list the implementations
that aren't being skipped)

?

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH v3] test: Add cover-test

2015-09-07 Thread Ben Avison

On Mon, 07 Sep 2015 10:19:04 +0100, Pekka Paalanen <ppaala...@gmail.com> wrote:

On Fri, 04 Sep 2015 14:29:22 +0100
"Ben Avison" <bavi...@riscosopen.org> wrote:

Obviously that would require work to the fence image code as well as to
cover-test, so I thought I'd ask for opinions - is it worth the added
complexity, or should we just bail out if the page size is > 32K as Pekka 
suggests?


My opinion is that let's get this test landed now with skip on too
large page size, and enchance it later if wanted.

Ben, shall I add the skips and bikeshed with style and send the
cover-test patch to the list once more?


That's OK with me if nobody else is going to object.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH v3] test: Add cover-test

2015-09-04 Thread Ben Avison

On Fri, 04 Sep 2015 11:18:03 +0100, Pekka Paalanen  wrote:

I think we may have a problem with sizes/coordinates.

With 64 kB page size, the minimum fenced image width for r5g6b5 image
is 32768 pixels. GDB tells me src_img->bits.width << 16 is negative.

Similarly the minimum width of an a8 image becomes 64k pixels. These
coordinates do not fit in pixman_fixed_t.


This is a good point.

However, I have thought of a workaround. Only the image width is
constrained to be < 65536 (or < 32768 if we want to avoid signed issues,
since pixman_fixed_t is defined to be a signed quantity). The *stride*
isn't.

In other words, if we accept that any one fenced image can only be
guaranteed to detect accesses off the right of an image, or the left of
an image, but not both, then we can construct a page layout of
* 1 protected page
* enough unprotected pages to hold at least the specified minimum number
  of pixels
* 1 protected page
* more unprotected pages
* 1 protected page

and so on.

In the case where the page size is much larger than a row, this amounts to
alternating protected and unprotected pages. Then we set the stride to
the distance between protected pages, and set image->bits.bits such that
either the left or right edge of each row lines up with the page boundary.

For example, if we wanted a fenced 8bpp image of width 16384 pixels on a
system where the page size is 64K, then a left-fenced image would be laid
out like this:

 pixels: **  **
 stride: <-->
 protected?  
 ^   ^   ^   ^   ^
 0   64K 128K192K256K

and a right-fenced one would look like:

 pixels:   **  **
 stride:   <-->
 protected?  
 ^   ^   ^   ^   ^
 0   64K 128K192K256K

Obviously that would require work to the fence image code as well as to
cover-test, so I thought I'd ask for opinions - is it worth the added
complexity, or should we just bail out if the page size is > 32K as Pekka
suggests?

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 1/4] Change conditions for setting FAST_PATH_SAMPLES_COVER_CLIP flags

2015-09-03 Thread Ben Avison
As discussed in
http://lists.freedesktop.org/archives/pixman/2015-August/003905.html

the 8 * pixman_fixed_e adjustment which was applied to the transformed
coordinates is a legacy of rounding errors which used to occur in old
versions of Pixman, but which no longer apply. For any affine transform,
you are now guaranteed to get the same result by transforming the upper
coordinate as though you transform the lower coordinate and add (size-1)
steps of the increment in source coordinate space. No projective
transform routines use the COVER_CLIP flags, so they cannot be affected.

However, removing the 8 * pixman_fixed_e border exposes a bug in the
calculation of the COVER_CLIP_NEAREST flag. Because the half-way cases
round down, an adjustment by pixman_fixed_e is needed. This patch
applies this adjustment.

Many bilinear fast paths currently assume that COVER_CLIP_BILINEAR is
not set when the transformed upper coordinate corresponds exactly to the
last row/pixel in source space. This is due to a detail of many current
implementations - they assume they can always load the pixel after the one
you get by dividing the coordinate by 2^16 with rounding to -infinity.

To avoid causing these implementations to exceed array bounds in these
cases, the COVER_CLIP_BILINEAR flag retains this feature in this patch.
Subsequent patches deal with removing this assumption, to enable cover
fast paths to be used in the maximum number of cases.

Proof:

All implementations must give the same numerical results as
bits_image_fetch_pixel_nearest() / bits_image_fetch_pixel_bilinear().

The former does
int x0 = pixman_fixed_to_int (x - pixman_fixed_e);
which maps directly to the new test for the nearest flag, when you consider
that x0 must fall in the interval [0,width).

The latter does
x1 = x - pixman_fixed_1 / 2;
x1 = pixman_fixed_to_int (x1);
x2 = x1 + 1;
but then x2 is brought back in range by the repeat() function, so it can't
stray beyond the end of the source line. When you write a cover path, you
are effectively omitting the repeat() function. The weight applied to the
pixel at offset x2 will be zero, so if you skip the load and leave the
pixel value undefined, the numerical result is unaffected, but you have
avoided a memory access that could potentially cause a page fault. If
however, we assume that the implementation loads from offset x2
unconditionally, then we need
x1 >= 0
x2 < width
so
x - pixman_fixed_1 / 2 >= 0
x - pixman_fixed_1 / 2 + pixman_fixed_1 < width * pixman_fixed_1
so
pixman_fixed_to_int (x - pixman_fixed_1 / 2) >= 0
pixman_fixed_to_int (x + pixman_fixed_1 / 2) < width
which matches the source code lines for the bilinear case, once you delete
the lines that apply the 8 * pixman_fixed_e border.
---
 pixman/pixman.c |   17 -
 test/affine-bench.c |   16 ++--
 2 files changed, 14 insertions(+), 19 deletions(-)

diff --git a/pixman/pixman.c b/pixman/pixman.c
index a07c577..f932eac 100644
--- a/pixman/pixman.c
+++ b/pixman/pixman.c
@@ -497,21 +497,12 @@ analyze_extent (pixman_image_t   *image,
 if (!compute_transformed_extents (transform, extents, ))
return FALSE;
 
-/* Expand the source area by a tiny bit so account of different rounding 
that
- * may happen during sampling. Note that (8 * pixman_fixed_e) is very far 
from
- * 0.5 so this won't cause the area computed to be overly pessimistic.
- */
-transformed.x1 -= 8 * pixman_fixed_e;
-transformed.y1 -= 8 * pixman_fixed_e;
-transformed.x2 += 8 * pixman_fixed_e;
-transformed.y2 += 8 * pixman_fixed_e;
-
 if (image->common.type == BITS)
 {
-   if (pixman_fixed_to_int (transformed.x1) >= 0   &&
-   pixman_fixed_to_int (transformed.y1) >= 0   &&
-   pixman_fixed_to_int (transformed.x2) < image->bits.width&&
-   pixman_fixed_to_int (transformed.y2) < image->bits.height)
+   if (pixman_fixed_to_int (transformed.x1 - pixman_fixed_e) >= 0  
  &&
+   pixman_fixed_to_int (transformed.y1 - pixman_fixed_e) >= 0  
  &&
+   pixman_fixed_to_int (transformed.x2 - pixman_fixed_e) < 
image->bits.width &&
+   pixman_fixed_to_int (transformed.y2 - pixman_fixed_e) < 
image->bits.height)
{
*flags |= FAST_PATH_SAMPLES_COVER_CLIP_NEAREST;
}
diff --git a/test/affine-bench.c b/test/affine-bench.c
index 9e0121e..697684b 100644
--- a/test/affine-bench.c
+++ b/test/affine-bench.c
@@ -396,13 +396,17 @@ main (int argc, char *argv[])
 }
 
 compute_transformed_extents (, _box, );
-/* The source area is expanded by a tiny bit (8/65536th pixel)
- * to match the calculation of the COVER_CLIP flags in analyze_extent()
+xmin = pixman_fixed_to_int (transformed.x1 - pixman_fixed_1 / 2);
+ymin = pixman_fixed_to_int (transformed.y1 - pixman_fixed_1 / 2);
+xmax = pixman_fixed_to_int (transformed.x2 + 

Re: [Pixman] [PATCH] test: Add cover-test

2015-09-03 Thread Ben Avison

On Thu, 03 Sep 2015 11:13:25 +0100, Pekka Paalanen  wrote:

Unless, you go and change the implementation meaning of Pixman's cover
flags which, reading your other reply, is what you are doing. I'm
finally starting to see it.


Glad the penny has dropped! The issue has been muddied somewhat by the
8*epsilon border. Perhaps I should have reposted the series as soon as it
became clear what the history was in that respect.

(from your other post on this thread)

This is *not* for protecting against out-of-bounds access, but this is
allowing for the extreme cases where a cover path is still
*theoretically* possible. (Ignoring what the COVER flags actually mean
in Pixman implementation.)


Yes. Pixman as a whole should be able to handle any coordinates without
bounds violations. The COVER_CLIP flags just determine the threshold
coordinates at which it chooses to use a cover path rather than the
appropriate repeat path (PAD, NONE, NORMAL or REFLECT); while you would
get the correct mathematical result even if always use the repeat path,
the cover path is very likely to be faster because it will be implemented
in a way that assumes it doesn't need to do bounds checking (because the
COVER_CLIP calculation has effectively done the bounds check for it).

Previously, the flag definition was overly cautious, resulting in some
cases which could have been handled by cover paths instead being sent to
repeat paths. I first encountered this myself with lowlevel-blt-bench,
which routinely used repeat paths for its bilinear tests but cover paths
for its nearest tests. However, we have to be careful when adjusting the
flag calculations, as the cases where bounds violations occur may only be
1*epsilon wide, and so would have a high chance of not being detected by
the previous test suite in a reasonable length of time - hence the need
for a new test program.

When EXACT is set, cover-test only performs operations which are
theoretically achievable with a cover path (and in fact, they *are* all
achieved with a cover path if you apply all my patches). When it isn't
set, some operations will stray into territory that definitely needs a
repeat path. The size of the the fuzz factors that are applied to the
coordinates was set large enough, using knowledge of the 8*epsilon factor
in the old definition of the COVER_CLIP flags, that we should exercise
both cover and repeat paths right up to the coordinate limits at which
either old or new definition will allow them to be used.


Of course, I've now changed cover-test so it doesn't do randmemset
straight into fenced images anyway (due to needing to handle systems
with different memory page sizes) so this is no longer an issue.


D'oh, of course, that's why Oded didn't see a segfault. But he did say
the CRC does not match on PPC64, neither LE nor BE.


Just to be clear, is that still an issue after I fixed the DST_HEIGHT
SRC_HEIGHT bug?

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] test: Add cover-test

2015-09-03 Thread Ben Avison

On Thu, 03 Sep 2015 17:14:05 +0100, Siarhei Siamashka 
<siarhei.siamas...@gmail.com> wrote:

On Thu, 03 Sep 2015 13:59:07 +0100 "Ben Avison" <bavi...@riscosopen.org> wrote:

Previously, the flag definition was overly cautious, resulting in some
cases which could have been handled by cover paths instead being sent
to repeat paths. I first encountered this myself with
lowlevel-blt-bench, which routinely used repeat paths for its bilinear
tests


Oh, this was actually intended. The bilinear scaling benchmark in
the lowlevel-blt-bench program tries to stress the use case, where
we are fetching pixels from the outside of the source image a little
bit. The reason is that this is quite typical for bilinear scaling
and that's how it is designed to work.

If this whole ordeal with the new COVER flag interpretation was just
a way to make it run fast in lowlevel-blt-bench, then this new
interpretation may not make much real practical sense.


That was just where it started... I should maybe explain that I
encountered it when I was writing bilinear fetchers for ARMv6, which (at
the time (*)) only handled the COVER case. This meant that
lowlevel-blt-bench was useless to me as it stood, because my COVER
fetchers were never selected, and it was important to me because I rely
on it to be able to compare different implementations (for example to
select the best prefetch distance). I was worried there might be
objections to changing lowlevel-blt-bench, and I also knew that I
followed different code paths for different scale factors (which are
fixed in lowlevel-blt-bench), many of which were so extreme that the
L1/L2/M/HT/VT/RT/R tests of lowlevel-blt-bench no longer made sense - so
that's how I came to write affine-bench.

But while writing affine-bench, it became apparent that it's really
complicated from a high level caller's point of view to construct a
transformation matrix that's guaranteed to hit a COVER path - and it
requires knowledge of details of the internal implementation of the guts
of Pixman, even after you deal with the obsolete 8*pixman_fixed_e
adjustment. This seemed wrong to me; I can imagine applications might
well want to use Pixman to scale between pixel centres rather than pixel
edges, and to expect such an application to know whether to exclude the
left pixel, or the right pixel, or both, or neither, and the same in the
vertical direction (for which the rules were often different) in order to
get the best performance is surely unreasonable.

Ben

(*) Actually, I've been revisiting this recently, and for various reasons
I've added repeat support to these fetchers. But now I'm getting a long
way ahead of myself...
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 3/4] armv7/mips/sse2: Make bilinear cover fast paths use COVER_CLIP_TIGHT flag

2015-09-03 Thread Ben Avison
This patch affects all functions defined by the macro
FAST_BILINEAR_MAINLOOP_INT, which can be identified by the fact that their
names start with the string "fast_composite_scaled_bilinear". These are
currently found in arm-neon, mips-dspr2 and sse2 implementations.

The inner pixel loop for these functions is provided by a separate
function (typically written in assembly) whose name is given in a macro
parameter. These functions accept, amongst other things, a pair of line
buffer pointers (one upper, one lower) and fixed-point initial and
incremental offsets into those lines.

Many of these assembly functions assume that they can calculate the value
of a pixel by truncating the offset of each pixel with rounding to
-infinity to locate one source pixel, and unconditionally loading the one
to its right as well. Rather than rework all the assembly on all three
platforms, this patch permits the assembly function to continue with these
assumptions, even when the C function is called such that the rightmost
coordinate coincides exactly with the last source pixel. In such cases,
we plot the final pixel of each row via a temporary buffer (similar to the
manner in which PAD repeat is accomplished).

It is worth noting that by comparison, these operations would previously
have been treated as PAD/NONE/NORMAL/REFLECT repeat operations, so they
would have been slower to process than other COVER cases anyway.

Tested using cover-test on armv7.
---
 pixman/pixman-inlines.h |   69 ---
 1 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/pixman/pixman-inlines.h b/pixman/pixman-inlines.h
index 1c8441d..e53da08 100644
--- a/pixman/pixman-inlines.h
+++ b/pixman/pixman-inlines.h
@@ -924,6 +924,67 @@ fast_composite_scaled_bilinear ## scale_func_name 
(pixman_implementation_t *imp,
src_width_fixed = pixman_int_to_fixed (src_width);  
\
 }  
\

\
+if (PIXMAN_REPEAT_ ## repeat_mode == PIXMAN_REPEAT_COVER)  
\
+{  
\
+   /* Detect cases where a typical scanline function would read beyond 
array bounds. */\
+   int32_t last_src_pixel = src_image->bits.width - 1; 
\
+   max_vx = v.vector[0] + (width - 1) * unit_x;
\
+   if (pixman_fixed_to_int (max_vx) == last_src_pixel) 
\
+   {   
\
+   src_type_t  buf1[2];
\
+   src_type_t  buf2[2];
\
+   src_type_t *src_line_top;   
\
+   src_type_t *src_line_bottom;
\
+   width--;
\
+   while (--height >= 0)   
\
+   {   
\
+   int weight1, weight2;   
\
+   dst = dst_line; 
\
+   dst_line += dst_stride; 
\
+   vx = v.vector[0];   
\
+   if (flags & FLAG_HAVE_NON_SOLID_MASK)   
\
+   {   
\
+   mask = mask_line;   
\
+   mask_line += mask_stride;   
\
+   }   
\
+   
\
+   y1 = pixman_fixed_to_int (vy);  
\
+   weight2 = pixman_fixed_to_bilinear_weight (vy); 
\
+   if (weight2)
\
+   {   
\
+   /* both weight1 and weight2 are smaller than 

[Pixman] [PATCH 4/4] pixman-fast-path: Make bilinear cover fetcher use COVER_CLIP_TIGHT flag

2015-09-03 Thread Ben Avison
The bilinear cover fetcher was implemented with similar assumptions to
FAST_BILINEAR_MAINLOOP_INT - that for all transformed destination
coordinates, you could divide by 2^16, round down and add 1, and still be
within the source image.

This patch effectively reverses this - dividing by 2^16, rounding up and
subtracting 1. The big advantage of this is we need only test for this
subtracted number being out of bounds and skip the corresponding pixel
load just once per row, at the start before entering the loop, whereas
with the original scheme, you would need to check every pixel.

To make the rounding up a simple operation, all the X coordinates
(including increments) are negated. There is an additional offset of
(1 << 16 - BILINEAR_INTERPOLATION_BITS)) - 1, to allow for the fact
that coordinates are truncated with rounding towards -infinity during
generation of weighting factors. Because the weight is inverted along
with the coordinates, you will also see that the weight is now
considered as the fraction of the way from the right pixel to the left
pixel, rather than vice versa - none of this increases the
computational complexity per pixel.

Tested using cover-test on armv7 with PIXMAN_DISABLE="arm-neon arm-simd"
---
 pixman/pixman-fast-path.c |   37 +
 1 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/pixman/pixman-fast-path.c b/pixman/pixman-fast-path.c
index 53d4a1f..a17ca26 100644
--- a/pixman/pixman-fast-path.c
+++ b/pixman/pixman-fast-path.c
@@ -2280,18 +2280,29 @@ fetch_horizontal (bits_image_t *image, line_t *line,
  int y, pixman_fixed_t x, pixman_fixed_t ux, int n)
 {
 uint32_t *bits = image->bits + y * image->rowstride;
-int i;
+int i, x0, x1;
+int32_t dist_x;
+uint32_t left, right;
+
+x = (1 << (16 - BILINEAR_INTERPOLATION_BITS)) - 1 - x;
+ux = -ux;
+
+x0 = pixman_fixed_to_int (x);
+x1 = x0 + 1;
+dist_x = pixman_fixed_to_bilinear_weight (x);
+left = dist_x ? *(bits - x1) : 0;
 
 for (i = 0; i < n; ++i)
 {
-   int x0 = pixman_fixed_to_int (x);
-   int x1 = x0 + 1;
-   int32_t dist_x;
-
-   uint32_t left = *(bits + x0);
-   uint32_t right = *(bits + x1);
+   if (i > 0)
+   {
+   x0 = pixman_fixed_to_int (x);
+   x1 = x0 + 1;
+   dist_x = pixman_fixed_to_bilinear_weight (x);
+   left = *(bits - x1);
+   }
 
-   dist_x = pixman_fixed_to_bilinear_weight (x);
+   right = *(bits - x0);
dist_x <<= (8 - BILINEAR_INTERPOLATION_BITS);
 
 #if SIZEOF_LONG <= 4
@@ -2301,11 +2312,11 @@ fetch_horizontal (bits_image_t *image, line_t *line,
 
lag = (left & 0xff00ff00) >> 8;
rag = (right & 0xff00ff00) >> 8;
-   ag = (lag << 8) + dist_x * (rag - lag);
+   ag = (rag << 8) + dist_x * (lag - rag);
 
lrb = (left & 0x00ff00ff);
rrb = (right & 0x00ff00ff);
-   rb = (lrb << 8) + dist_x * (rrb - lrb);
+   rb = (rrb << 8) + dist_x * (lrb - rrb);
 
*((uint32_t *)(line->buffer + i)) = ag;
*((uint32_t *)(line->buffer + i) + 1) = rb;
@@ -2323,7 +2334,7 @@ fetch_horizontal (bits_image_t *image, line_t *line,
lagrb = (((uint64_t)lag) << 24) | lrb;
ragrb = (((uint64_t)rag) << 24) | rrb;
 
-   line->buffer[i] = (lagrb << 8) + dist_x * (ragrb - lagrb);
+   line->buffer[i] = (ragrb << 8) + dist_x * (lagrb - ragrb);
}
 #endif
 
@@ -2350,6 +2361,8 @@ fast_fetch_bilinear_cover (pixman_iter_t *iter, const 
uint32_t *mask)
 
 y0 = pixman_fixed_to_int (info->y);
 y1 = y0 + 1;
+if (y1 == iter->image->bits.height)
+   y1 = y0;
 dist_y = pixman_fixed_to_bilinear_weight (info->y);
 dist_y <<= (8 - BILINEAR_INTERPOLATION_BITS);
 
@@ -3187,7 +3200,7 @@ static const pixman_iter_info_t fast_iters[] =
   (FAST_PATH_STANDARD_FLAGS|
FAST_PATH_SCALE_TRANSFORM   |
FAST_PATH_BILINEAR_FILTER   |
-   FAST_PATH_SAMPLES_COVER_CLIP_BILINEAR),
+   FAST_PATH_SAMPLES_COVER_CLIP_TIGHT_BILINEAR),
   ITER_NARROW | ITER_SRC,
   fast_bilinear_cover_iter_init,
   NULL, NULL
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH v3] test: add fence-image-self-test

2015-09-02 Thread Ben Avison

On Wed, 02 Sep 2015 09:15:24 +0100, Pekka Paalanen  wrote:

Changes in v3:

- Do not print progress messages unless VERBOSE environment variable is
  set. Avoid spamming the terminal output of 'make check' on some
  versions of autotools.


I can confirm that "make check" is now quiet for me.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] test: Add cover-test

2015-09-02 Thread Ben Avison

On Wed, 02 Sep 2015 12:56:40 +0100, Pekka Paalanen  wrote:

Right. I looked at fast_bilinear_cover_iter_init() and
fast_fetch_bilinear_cover() -> fetch_horizontal(), and yes, that really
is how it is implemented. The leftmost pixel is chosen essentially by
n = pixman_fixed_to_int(x - pixman_fixed_1 / 2).

So, pixman_fixed_1 / 2 -> 0, not -1.

The thing that confuses me is that with nearest, x=0 will sample pixel
n=-1. However with bilinear, x=0.5 will sample pixels n=0 and n=1, not
n=-1 and n=0.


When x=0.5, the numerical output of bilinear scaling only depends upon
source pixel n=0. The stuff about whether you consider pixel n=+1 or n=-1
to be involved but with a weighting factor of 0 is incidental to the
numerical output. With nearest scaling, you don't have the option of
sitting on the fence so you have to round one way or the other.

If you look at the following ASCII art diagrams, the vertical axis is the
contribution from source pixel 0 (+) and source pixel 1 (x). Where these
coincide, I have used a *. Each column represents an increment of 0.125
(you'll have to read the labels on the horizontal axis vertically). Upper
diagram is nearest, lower diagram is bilinear.

   



  *

  -   -   +   +   +   +   +   +
  1   0   0   0   1   1   2   2   3
  .   .   .   .   .   .   .   .   .
  0   5   0   5   0   5   0   5   0

  +   x
+   +   x   x
  +   *   x
+   x   +   x
  *   *

See how the bilinear diagram is already symmetrical; no rounding is
needed of coordinates on the horizontal axis.

Of course, coordinates actually have 13 more bits of granularity than
this, but the same patterns hold.


> - Looks like it processes the whole stride, not width * bytespp,
>   which means this is going to explode on fenced images in the
>   padding, right?

Yes, looks like it is. Good spot. That probably does belong in a
separate patch; are you volunteering?


I should, shouldn't I? :-)
I'll do it.


Of course, I've now changed cover-test so it doesn't do randmemset
straight into fenced images anyway (due to needing to handle systems with
different memory page sizes) so this is no longer an issue.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] test: Add cover-test

2015-09-02 Thread Ben Avison

On Wed, 02 Sep 2015 14:03:01 +0100, Pekka Paalanen  wrote:

I am still a tiny bit struggling with why 31 and not 32, but it does
add up.


Maybe the reasoning in the comments in random_scale_factor() make more
sense? There, it only talks about the numbers as integers, avoiding the
additional confusion caused by MAX_INC being 0.32 fixed-point and the
return value being 16.16 fixed-point.

If you still want to think of it as real numbers, then:

* random_scale_factor() generates a scale factor between
  2^(-LOG2_MAX_FACTOR-0.5) and 2^(LOG2_MAX_FACTOR+0.5)
* INV_SQRT_2_0POINT32_FIXED represents 2^-0.5
* to get from 2^-0.5 to 2^(LOG2_MAX_FACTOR+0.5) we must multiply by
  2^(LOG2_MAX_FACTOR+1)
* but to convert from 0.32 to 16.16 fixed point, we multiply by 2^(-32+16)
* so the total factor is 2^(LOG2_MAX_FACTOR+1-32+16)
* but this is < 1, so is actually a shift right by
  -LOG2_MAX_FACTOR-1+32-16, which can trivially be rearranged to match
  the macro definition.


+static pixman_image_t *
+create_src_image (pixman_format_code_t fmt)
+{

[...]

+prng_randmemset (tmp_img->bits.bits,
+ tmp_img->bits.rowstride * DST_HEIGHT * sizeof (uint32_t), 
0);


DST_HEIGHT? Should that not be SRC_HEIGHT?


I think you're right, yes - looks like careless copy-and-paste from the
code to initialise the destination image. That means part of the image
was uninitialised, yet it never showed up as a difference on subsequent
runs. The checksums will need updating too. I'll repost shortly.


+static void
+check_transform (pixman_image_t *dst_img,
+ pixman_image_t *src_img,
+ pixman_transform_t *transform,
+ pixman_bool_t   bilinear)
+{

[...]

+if (bilinear)
+{
+assert (v1.vector[0] >= pixman_fixed_1 / 2);
+assert (v1.vector[1] >= pixman_fixed_1 / 2);
+assert (v2.vector[0] <= pixman_int_to_fixed (src_img->bits.width) -
+pixman_fixed_1 / 2);
+assert (v2.vector[1] <= pixman_int_to_fixed (src_img->bits.height) -
+pixman_fixed_1 / 2);


Since we had that discussion about bilinear filtering sampling pixels at
n=0,1 instead of n=-1,0, should the last two asserts not have < instead
of <=?


No. calc_translate() specifically ensures that (depending upon low_align)
either the lower or upper coordinate exactly corresponds to the lowest or
highest value that contains no non-zero-weighted contribution from any
pixel beyond the source image (and the image sizes are chosen so that the
opposite coordinate should always fall within the source image too, at
the largest possible increment). These are in fact the very cases that
are most likely to trigger a fast path or fetcher to perform an out-of
bounds access, so they were deliberately included.


I mean, wouldn't sampling for x=width-0.5 cause reads on pixels
n=width-1,width, where the latter would be out of bounds (even if
weight zero)?


My recent 7-patch series deals with precisely this case:
http://lists.freedesktop.org/archives/pixman/2015-August/003878.html

Before this patch series, the conditions under which the
FAST_PATH_SAMPLES_COVER_CLIP_BILINEAR flag is set mean that this case
never gets passed to a "cover" fast path, even though such fast paths can
actually handle them by at least these three methods:

* check the weight and don't load the upper sample if it's zero (it turns
  out this can be achieved for free with ARM fetchers)
* invert the increments so that when x=width-0.5 it reads samples
  n=width-2,width-1 and applies a weight of 0 to the former (I used this
  for pixman-fast-path.c)
* detect the case in the C binding function and handle it separately (I
  used this for the existing armv7/mips/sse2 fast paths)

The fuzz factor added into the non-EXACT case means that even with the
old rules for the cover clip flag, we should end up exercising both cover
fast paths and repeat fast paths at least some of the time, right up to
the limits of their applicability.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH v2 0/3] Add fence_image_create_bits() test helper

2015-08-28 Thread Ben Avison

On Fri, 28 Aug 2015 12:53:35 +0100, Pekka Paalanen ppaala...@gmail.com wrote:

This is v2, addressing the review comments from the last time:
http://lists.freedesktop.org/archives/pixman/2015-May/003643.html


The changes all look good to me.

One minor thing: the self-test spews a lot of text to my console when I'm
running make check, whereas none of the other tests do in normal
functioning (save for an occasional message about the final CRC for some
tests). I've a feeling this might be something to do with the version of
autotools I'm running, but if so it might be nice to at least record this
in a comment somewhere?

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] Selectively disabling fast paths

2015-08-28 Thread Ben Avison

On Fri, 28 Aug 2015 13:43:06 +0100, Pekka Paalanen ppaala...@gmail.com wrote:


On Thu, 27 Aug 2015 17:20:26 +0100
Ben Avison bavi...@riscosopen.org wrote:

One thing it wouldn't be able to detect, though, would be where the fetch/
combine/writeback iterators are faster than fast paths for the *same*
implementation level - such as with the ARMv6 nearest-scaled patches I
was revisiting recently. In that specific case, it turned out that my
original solution of bespoke C wrappers for the fetchers turned out to be
even faster - but we don't have any way at present of detecting if there
are other cases where we would be better off deleting the fast paths and
letting the iterators do the work instead.


Sorry, but I'm a bit hazy on the details here. Based on the
discussions, I have developed the following mental model:

1. asm fast paths (whole operation)
2. C fast paths (whole operation)
3. _general_composite_rect (fetch/combine/writeback; iterators)
   - asm implementation or
   - C implementation for each


Yes, that's pretty much it, except that some platforms have multiple
levels of asm fast paths, and some or all of those will be enabled
depending upon the CPU features detected via a combination of compile-
time and runtime tests.

Basically, there is a chain of pixman_implementation_t structs, in
decreasing priority order (that's why you'll see the name
implementation used to refer to a set of routines tuned for a
particular instruction set). Each implementation contains a table of
pixman_fast_path_t structs (which we refer to as fast paths) and a
table of pixman_iter_info_t structs (the fetcher and writeback iterators)
and an array of combiner routines and a few other bits and pieces.

For example, on an ARMv7 platform, you'll normally find the following
implementations are enabled, in decreasing priority order:

pixman-noop.c (can't be disabled)
pixman-arm-neon.c (unless PIXMAN_DISABLE contains arm-neon)
pixman-arm-simd.c (unless PIXMAN_DISABLE contains arm-simd)
pixman-fast-path.c (unless PIXMAN_DISABLE contains fast)
pixman-general.c (can't be disabled; also references last-resort
   functions in pixman-bits-image.c / pixman-*-gradient.c /
   pixman-combine32.c / pixman-combine-float.c)

When you call pixman_image_composite(), it scans through the fast paths
from each implementation in order, looking for one which matches the
criteria in the fast path tables. pixman-general.c contains a single fast
path, which is universally applicable, and therefore handles anything
that wasn't caught by higher implementations - and it uses the function
general_composite_rect(). In turn, general_composite_rect scans the
implementations in order, looking for fetchers, combiners and writeback
function which will allow it to perform the requested operation line by
line, stage by stage.

When you set PIXMAN_DISABLE, you knock out the whole of an
implementation, both its fast paths and its iterators/combiners.

The point I was trying to make (badly, it seems) is that iterators/
combiners are relatively widely applicable, and are chosen at lower
priority than all the fast paths, but because they were developed
relatively recently, many of the fast paths have never had their
performance compared against the iterators/combiners to see if their
inclusion is perhaps no longer warranted since the iterators/combiners
were added.


Maybe we could fix that by introducing a PIXMAN_DISABLE=wholeop or
similar, that would disable all whole operation fast paths, but leave
the iterator paths untouched?

Should I do that, would it be worth it?


It could probably be done in _pixman_implementation_create(), as long as
_pixman_implementation_create_general() explicitly initialises
imp-fast_paths so that at least general_composite_rect() always ends up
on the chain of fast paths.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 5/7] armv7/mips/sse2: Fix bounds violations in bilinear cover scaled fast paths

2015-08-27 Thread Ben Avison

On Thu, 27 Aug 2015 04:44:58 +0100, Siarhei Siamashka 
siarhei.siamas...@gmail.com wrote:

On Mon, 24 Aug 2015 21:42:04 +0100
Ben Avison bavi...@riscosopen.org wrote:

The current COVER flag assumes that every pixel is fetched from the
source image as a 2x2 block (regardless of whether the fractional part
of the offset is zero or not) and this is explicitly expected to be
safe.


The situation isn't quite that simple though. With the bilinear fast
paths defined using FAST_BILINEAR_MAINLOOP_INT(), the height of the block
is already set to 1 when the fractional part is 0 - albeit as a side
effect of not being able to express a fixed-point weight of 1.0 in a
16-bit value:

http://cgit.freedesktop.org/pixman/tree/pixman/pixman-inlines.h?id=pixman-0.33.2#n947

My ARMv6 bilinear fetchers (the ones still waiting to be committed)
already contained an optimisation to avoid processing a scanline whose
vertical weight was 0. Patch 6 of this series is a tweak to
fast_fetch_bilinear_cover() so it skips such scanlines, and
ssse3_fetch_bilinear_cover() would benefit from the same optimisation.

In the horizontal direction, it was actually Søren's post (accidentally
or otherwise) which first prompted me to see there's an issue: if you're
going to say that you always fetch two source pixels even when the
coordinate is exactly aligned to a single source pixel, then who's to say
whether the zero-weighted pixel is the one to the right of the true pixel
rather than the one to its left? In practice, FAST_BILINEAR_MAINLOOP_INT
chooses the one on the right, and this happens to match your suggested
change to the calculation of the FAST_PATH_SAMPLES_COVER_CLIP_BILINEAR,
where we simply remove the 8 * pixman_fixed_e border. As a reminder of
what I wrote earlier, this test corresponds to:

transformed.x1 = pixman_fixed_1 / 2
transformed.y1 = pixman_fixed_1 / 2
transformed.x2  image-bits.width * pixman_fixed_1 - pixman_fixed_1 / 2
transformed.y2  image-bits.height * pixman_fixed_1 - pixman_fixed_1 / 2

If we assume that zero-weighted pixels to the left are always loaded,
these would need to be the tests instead:

transformed.x1  pixman_fixed_1 / 2
transformed.y1  pixman_fixed_1 / 2
transformed.x2 = image-bits.width * pixman_fixed_1 - pixman_fixed_1 / 2
transformed.y2 = image-bits.height * pixman_fixed_1 - pixman_fixed_1 / 2

Søren talked about samples with index -1 being fetched - in other words,
the zero-weighted pixel being the one to the left of the true one. This
actually turns out to be more efficient, because to avoid out-of-bounds
accesses when transformed.x1 == pixman_fixed_1 / 2, you only need to test
the first pixel on each row, which can be moved outside the loop. Any
later pixels on the same row which might fetch a zero-weighted pixel to
the left, you at least know that the zero-weighted pixel is within the
pixel row. If the zero-weighted pixel were to be to the right, you'd need
to test every single pixel, which would normally be a significant speed
hit.

It turns out to be surprisingly simple to cause the zero-weighted pixel
to be on the left - just negate the X increments, as demonstrated by
patch 6.

Things work even more smoothly with my ARMv6 bilinear fetchers (though I
haven't reposted them yet). I was able to take advantage of ARM
conditional execution to avoid reading zero-weighted pixels with zero
cycles overhead (and in fact some speed gain due to fewer cachelines
being fetched from RAM). The only necessary change to the core code:

   adds   accumulator, accumulator, increment_fractional
   ldrcs  in0, [source_ptr, #increment_integer + 1]!
   ldrcc  in0, [source_ptr, #increment_integer]!
   ldrin1, [source_ptr, #1]

is to change the last ldr to an ldrne.

This is actually an exception to what I wrote above If the zero-weighted
pixel were to be to the right, you'd need to test every single pixel,
which would normally be a significant speed hit. But since conditional
execution of arbitrary instructions is pretty unique to ARM, it make
sense to use the negated-increment method for the generic C
implementation.


However with your new proposed definition of the COVER_CLIP_BILINEAR
flag, the area is shrinked by pixman_fixed_e on the right side in


(enlarged by pixman_fixed_e, but I think that's what you meant)


order to allow a special corner case to pass through. And this is where
the bounds violations are likely coming from.

[...]

Is there actually a real bug in the
current pixman code? Because the commit summary looks kinda scary and
may be misinterpreted by the general public.


Don't worry, the existing Pixman code doesn't have any bounds violations.
In fact, there aren't any bounds violations at all, despite what the
commit summaries say.

To explain this: quite late on in development, I added the
FAST_PATH_SAMPLES_COVER_CLIP_BILINEAR_APPROX flag, as an insurance policy
against me not finding a volunteer to adapt the SSSE3 bilinear fetcher.
But then I realised that if I added the flag early on, I

Re: [Pixman] [PATCH 1/4] pixman-fast-path: Add over_n_8888 fast path (disabled)

2015-08-27 Thread Ben Avison

On Thu, 27 Aug 2015 10:59:52 +0100, Pekka Paalanen ppaala...@gmail.com wrote:

It would be *really* nice if we could somehow use a benchmark mode
where we could run an operation with every possible implementation and
compare them. I wonder, can we already do that with PIXMAN_DISABLE?


That would certainly help detect some issues, where we get worse
performance at supposedly more advanced implementation levels. Obviously
it wouldn't make any sense to bother with certain combinations, such as
PIXMAN_DISABLE=arm-neon when testing on an x86.

One thing it wouldn't be able to detect, though, would be where the fetch/
combine/writeback iterators are faster than fast paths for the *same*
implementation level - such as with the ARMv6 nearest-scaled patches I
was revisiting recently. In that specific case, it turned out that my
original solution of bespoke C wrappers for the fetchers turned out to be
even faster - but we don't have any way at present of detecting if there
are other cases where we would be better off deleting the fast paths and
letting the iterators do the work instead.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH v2] scaling-test: list more details when verbose

2015-08-27 Thread Ben Avison

On Thu, 27 Aug 2015 12:16:49 +0100, Pekka Paalanen ppaala...@gmail.com wrote:


[Pekka: redo whitespace and print src,dst,mask x and y.]
Signed-off-by: Pekka Paalanen pekka.paala...@collabora.co.uk


Reviewed-by: Ben Avison bavi...@riscosopen.org
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH] test: Add cover-test

2015-08-27 Thread Ben Avison
This test aims to verify both numerical correctness and the honouring of
array bounds for scaled plots (both nearest-neighbour and bilinear) at or
close to the boundary conditions for applicability of cover type fast paths
and iter fetch routines.

It has a secondary purpose: by setting the env var EXACT (to any value) it
will only test plots that are exactly on the boundary condition. This makes
it possible to ensure that cover routines are being used to the maximum,
although this requires the use of a debugger or code instrumentation to
verify.
---
 test/Makefile.sources |1 +
 test/cover-test.c |  424 +
 2 files changed, 425 insertions(+), 0 deletions(-)
 create mode 100644 test/cover-test.c

diff --git a/test/Makefile.sources b/test/Makefile.sources
index 5b9ae84..aece143 100644
--- a/test/Makefile.sources
+++ b/test/Makefile.sources
@@ -27,6 +27,7 @@ TESTPROGRAMS =  \
glyph-test\
solid-test\
stress-test   \
+   cover-test\
blitters-test \
affine-test   \
scaling-test  \
diff --git a/test/cover-test.c b/test/cover-test.c
new file mode 100644
index 000..dbdf0f8
--- /dev/null
+++ b/test/cover-test.c
@@ -0,0 +1,424 @@
+/*
+ * Copyright © 2015 RISC OS Open Ltd
+ *
+ * Permission to use, copy, modify, distribute, and sell this software and its
+ * documentation for any purpose is hereby granted without fee, provided that
+ * the above copyright notice appear in all copies and that both that
+ * copyright notice and this permission notice appear in supporting
+ * documentation, and that the name of the copyright holders not be used in
+ * advertising or publicity pertaining to distribution of the software without
+ * specific, written prior permission.  The copyright holders make no
+ * representations about the suitability of this software for any purpose.  It
+ * is provided as is without express or implied warranty.
+ *
+ * THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS
+ * SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
+ * FITNESS, IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN
+ * AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING
+ * OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
+ * SOFTWARE.
+ *
+ * Author:  Ben Avison (bavi...@riscosopen.org)
+ *
+ */
+
+/*
+ * This test aims to verify both numerical correctness and the honouring of
+ * array bounds for scaled plots (both nearest-neighbour and bilinear) at or
+ * close to the boundary conditions for applicability of cover type fast 
paths
+ * and iter fetch routines.
+ *
+ * It has a secondary purpose: by setting the env var EXACT (to any value) it
+ * will only test plots that are exactly on the boundary condition. This makes
+ * it possible to ensure that cover routines are being used to the maximum,
+ * although this requires the use of a debugger or code instrumentation to
+ * verify.
+ */
+
+#include utils.h
+#include stdlib.h
+#include stdio.h
+
+/* Approximate limits for random scale factor generation - these ensure we can
+ * get at least 8x reduction and 8x enlargement.
+ */
+#define LOG2_MAX_FACTOR (3)
+
+/* 1/sqrt(2) (or sqrt(0.5), or 2^-0.5) as a 0.32 fixed-point number */
+#define INV_SQRT_2_0POINT32_FIXED (0xB504F334u)
+
+/* The largest increment that can be generated by random_scale_factor().
+ * This occurs when the mantissa part is 0x and the exponent
+ * part is -LOG2_MAX_FACTOR.
+ */
+#define MAX_INC ((pixman_fixed_t)(INV_SQRT_2_0POINT32_FIXED  (31 - 16 - 
LOG2_MAX_FACTOR)))
+
+/* Minimum source width (in pixels) based on a typical page size of 4K and
+ * maximum colour depth of 32bpp.
+ */
+#define MIN_SRC_WIDTH (4096 / 4)
+
+/* Derive the destination width so that at max increment we fit within source 
*/
+#define DST_WIDTH (MIN_SRC_WIDTH * pixman_fixed_1 / MAX_INC)
+
+/* Calculate heights the other way round - no limits due to page alignment 
here */
+#define DST_HEIGHT 3
+#define SRC_HEIGHT ((DST_HEIGHT * MAX_INC + pixman_fixed_1 - 1) / 
pixman_fixed_1)
+
+/* At the time of writing, all the scaled fast paths use SRC, OVER or ADD
+ * Porter-Duff operators. XOR is included in the list to ensure good
+ * representation of iter scanline fetch routines.
+ */
+static const pixman_op_t op_list[] = {
+PIXMAN_OP_SRC,
+PIXMAN_OP_OVER,
+PIXMAN_OP_ADD,
+PIXMAN_OP_XOR,
+};
+
+/* At the time of writing, all the scaled fast paths use a8r8g8b8, x8r8g8b8
+ * or r5g6b5, or red-blue swapped versions of the same. When a mask channel is
+ * used, it is always a8 (and so implicitly not component alpha). a1r5g5b5 is
+ * included because it is the only other

Re: [Pixman] [PATCH 1/7] Refactor calculation of cover flags

2015-08-27 Thread Ben Avison

On Thu, 27 Aug 2015 03:55:07 +0100, Siarhei Siamashka 
siarhei.siamas...@gmail.com wrote:

-/* Expand the source area by a tiny bit so account of different rounding 
that
- * may happen during sampling. Note that (8 * pixman_fixed_e) is very far 
from
- * 0.5 so this won't cause the area computed to be overly pessimistic.
- */
-transformed.x1 -= 8 * pixman_fixed_e;
-transformed.y1 -= 8 * pixman_fixed_e;
-transformed.x2 += 8 * pixman_fixed_e;
-transformed.y2 += 8 * pixman_fixed_e;


Thanks for spotting this! I think that this code was used to compensate
matrix multiplication accuracy problems in older versions of pixman.
But we have already fixed these accuracy problems some time ago:

[...]

Now we can drop this 8 * pixman_fixed_e adjustment because there
is no accuracy loss in the matrix multiplication for affine
transformations.


Ah, thank you! I couldn't work out what rounding it was that the comment
was referring to, and it seemed to have been quite deliberate. Søren
could only recall the related issue with bilinear scalers reading pixel
pairs where one source pixel might be read from outside the source array
if its weight was going to be 0.


And it is probably better to just do this simple fix
with a single patch. There is no need to spread it across multiple
patches.



Just as you do it in
http://lists.freedesktop.org/archives/pixman/2015-August/003877.html
this part becomes:

if (pixman_fixed_to_int (transformed.x1 - pixman_fixed_e) = 0
pixman_fixed_to_int (transformed.y1 - pixman_fixed_e) = 0
pixman_fixed_to_int (transformed.x2 - pixman_fixed_e)  image-bits.width 
pixman_fixed_to_int (transformed.y2 - pixman_fixed_e)  image-bits.height)


Glad you agree about the missing pixman_fixed_e offset, which was
disguised by the old 8 * pixman_fixed_e enlargement.

I originally wrote this stuff as a single patch, but I got the impression
that it was too much to digest in one go for most people, hence why I
split it up. Perhaps the compromise is to go for two patches, one to deal
with the 8 * pixman_fixed_e issue, and one to deal with bilinear scaling
with zero-weight pixels just beyond the edges of the image.

For those following the other thread where I'm describing the operation
of the cover-test program, you may want to note that if you multiply both
sides of the above inequalities by pixman_fixed_1, you get

transformed.x1 - pixman_fixed_e = 0
transformed.y1 - pixman_fixed_e = 0
transformed.x2 - pixman_fixed_e  image-bits.width * pixman_fixed_1
transformed.y2 - pixman_fixed_e  image-bits.height * pixman_fixed_1

which is equivalent to

transformed.x1 = pixman_fixed_e
transformed.y1 = pixman_fixed_e
transformed.x2 = image-bits.width * pixman_fixed_1
transformed.y2 = image-bits.height * pixman_fixed_1

which lines up nicely with my description of which coordinates correspond
to which source pixels.


And [the bilinear case] does not need any changes. At least as far as
dropping 8 * pixman_fixed_e is concerned.


Reworking these in the same way, you get

transformed.x1 - pixman_fixed_1 / 2 = 0
transformed.y1 - pixman_fixed_1 / 2 = 0
transformed.x2 + pixman_fixed_1 / 2  image-bits.width
transformed.y2 + pixman_fixed_1 / 2  image-bits.height

transformed.x1 = pixman_fixed_1 / 2
transformed.y1 = pixman_fixed_1 / 2
transformed.x2  image-bits.width * pixman_fixed_1 - pixman_fixed_1 / 2
transformed.y2  image-bits.height * pixman_fixed_1 - pixman_fixed_1 / 2

and then my remaining changes basically boil down to arguing that these
last two rows should be = not . That could sensibly be a separate patch.


I believe that neither of FAST_PATH_SAMPLES_COVER_CLIP_NEAREST and
FAST_PATH_SAMPLES_COVER_CLIP_BILINEAR flags matters for projective
transformations, but this can be additionally checked just in case.


I did survey all the existing fast paths and fetchers and discovered that
both the cover flags were only being used for affine transforms when I
was trying to work out what the 8 * pixman_fixed_e rounding might have
been referring to.

By check, do you mean we should ensure that the flags are not set for
projective transforms? If so, I see that compute_image_info() is called
before analyze_extent() so it could be achieved by adding an extra test
in analyze_extent() that FAST_PATH_AFFINE_TRANSFORM is set before setting
the cover flags.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] test: Add cover-test

2015-08-26 Thread Ben Avison

On Thu, 04 Jun 2015 14:00:30 +0100, Pekka Paalanen ppaala...@gmail.com wrote:


On Tue, 26 May 2015 23:58:30 +0100
Ben Avison bavi...@riscosopen.org wrote:


+#define SQRT_0POINT5_0POINT32_FIXED (0xB504F334u)


Ok, I checked. Mathematically it is equivalent, but maybe a slightly
more obvious name would be INV_SQRT_2_...?


Fair enough, renamed (will repost when I've resolved everything else
below).


+#define MAX_INC (SQRT_0POINT5_0POINT32_FIXED  (31 - 16 - LOG2_MAX_FACTOR))


This evaluates to 11.314 decimal I think. It does match 2^3.5 with an
error in the order of 3e-6, is what Octave tells me. log2(abs(error)) =
-18.321, so if I'm thinking right, it means around 18 fractional bits
being correct, which means pixman_fixed_t should be accurate to the
last bit? I suppose it should be obvious from the integer math, but I
just checked with Octave.


It's basically just the largest random number that can be generated by
random_scale_factor(), in other words if you start out with the largest
number and shift it right by the smallest number of places. I hope that
makes it clearer where it comes from - I'll update the comment.


MAX_INC is a pixman_fixed_t value, right? Could use a comment for that,
or making it a static const variable.


Yes. I've a feeling some compilers might force a load and runtime
calculation of expressions if you use static const, so I'll stick a cast
in the definition (being careful because pixman_fixed_t is signed but we
need to use an unsigned shift to generate the number).


I think it would be valuable to explain these high level goals in a
comment at the top, even when they are in the commit message.


I've duplicated the commit message in a comment.


+static pixman_fixed_t
+random_scale_factor(void)

[...]

Out of curiosity and because it's hard to reason about these things, I
did an experiment, and generated 10 million samples with this function.
The resulting distribution in log2 space is here:

https://people.collabora.com/~pq/pixman/random_scale_factor.png

I think it is close to uniform enough, right? And covers the expected
range.


Nice diagram. For these purposes, I'd say it was close enough to uniform;
no particular value is more than 2x more or less likely than any other.
Without the shift step in the random number generator, what you'd have
instead would be a ramp with one end being soemthing like 128x more
likely than the other. You'd also get effects such as the probability of
a plot which was an enlargement in both X and Y being only 1 in 64, even
though that's quite a likely outcome in real life; this way it's 1 in 4.


+static pixman_fixed_t
+calc_translate (intdst_size,
+intsrc_size,
+pixman_fixed_t scale,
+pixman_bool_t  low_align,
+pixman_bool_t  bilinear)
+{
+pixman_fixed_t ref_src, ref_dst, scaled_dst;
+
+if (low_align)
+{
+ref_src = bilinear ? pixman_fixed_1 / 2 : pixman_fixed_e;


Why the bilinear case is not pixman_fixed_1 / 2 + pixman_fixed_e?


At this point we're determining the translations required so that the
first destination pixel, when transformed into source space, has exactly
the lowest value it can have without straying into the repeat zones.

With bilinear scaling, a coordinate of pixman_fixed_1 / 2 has pixel value
determined only by source pixel 0. A coordinate of pixman_fixed_1 * 3 / 2
has pixel value determined only by source pixel 1. Any value in between
has to be assumed to contain elements from both pixels 0 and 1.

With nearest scaling, any coordinate between pixman_fixed_e and
pixman_fixed_1 inclusive is determined by source pixel 0. Yes, this is
slightly counter-intuitive, but it's the way Pixman has always done it.
See for example in bits_image_fetch_nearest_affine() in
pixman-fast-path.c where we have the lines

x0 = pixman_fixed_to_int (x - pixman_fixed_e);
y0 = pixman_fixed_to_int (y - pixman_fixed_e);

There is no equivalent offset by pixman_fixed_e for bilinear scaling, for
example in fast_bilinear_cover_iter_init() also in pixman-fast-path.c:

 info-x = v.vector[0] - pixman_fixed_1 / 2;
 info-y = v.vector[1] - pixman_fixed_1 / 2;

All the other implementations follow suit (they have to, or they wouldn't
pass the make check suite.)


+scaled_dst = ((uint64_t) ref_dst * scale + pixman_fixed_1 / 2) / 
pixman_fixed_1;


ref_dst is 16.16 fp, scale is 16.16 fp, so the product is 32.32 fp, but
adding pixman_fixed_1/2 to that actually means +0.5/65536, not +0.5.
And the final division then just scales it back to .16 fp.

Basically it's adding 0.5 * pixman_fixed_e... for rounding to nearest
representable value?


Yes, the transform coefficients, source and destination coordinates are
all 16.16 fixed point, and round-to-nearest is how Pixman handles this in
all sorts of places, for example in pixman_transform_point_31_16_3d().
It's worth noting that since we increment the destination coordinate

Re: [Pixman] [PATCH 1/4] pixman-fast-path: Add over_n_8888 fast path (disabled)

2015-08-26 Thread Ben Avison

On Wed, 26 Aug 2015 09:46:49 +0100, Pekka Paalanen ppaala...@gmail.com wrote:

It's clearly controversial to add C fast paths, because it affects all
targets that don't have an asm fast path for the same, and we cannot
tell by just review whether it is going to be for better or (much)
worse.


Yes, it's always going to be a risk changing the cross-platform
functions. Few developers are going to be able to test all the supported
platforms, so we're always going to need help checking performance.
Should that be a reason not to even try C fast paths though?

It would be good to get some understanding of why VMX appears not to
benefit from any sort of over_n_ fast path, considering that so many
other platforms have added one. The information might be useful in
designing any future cross-platform code.


Rpi should not be affected if we drop the C fast path patch,
there is still the asm fast path patch. Right?


Yes, any ARM targets (ARMv6 or ARMv7) would use the ARMv6 implementation
in preference to the C one.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 0/7] Cover scaling patches request for assistance

2015-08-26 Thread Ben Avison

On Wed, 26 Aug 2015 10:36:06 +0100, Pekka Paalanen ppaala...@gmail.com wrote:


On Mon, 24 Aug 2015 21:41:59 +0100
Ben Avison bavi...@riscosopen.org wrote:


Towards this goal, the following patches were posted to the list - and they 
seem to have escaped Patchwork's notice:

http://lists.freedesktop.org/archives/pixman/2015-May/003644.html
http://lists.freedesktop.org/archives/pixman/2015-May/003645.html
http://lists.freedesktop.org/archives/pixman/2015-May/003646.html


these three are marked as Changes requested, so they are on my plate,
waiting to be re-sent.


OK, I realise now I was confused by the default filter on Patchwork,
which hid all those patches. I see how to find them now.


http://lists.freedesktop.org/archives/pixman/2015-May/003678.html


This one is
http://patchwork.freedesktop.org/patch/50516/


That has actions on me, I see. Hmm, I seem to have forgotten about that
one too.


I was sort of waiting for a reply to
http://lists.freedesktop.org/archives/pixman/2015-June/003728.html
and then forgot and wandered off to vacation.


I think I was probably waiting for updates on the three related patches
of yours... deadlock!


Do I understand that right, that this series supersedes:
http://patchwork.freedesktop.org/patch/49937/
Hmm, is that the only one?


Yes, I've just updated its status in Patchwork, now I understand a bit
better how it works. It turned out to be a bit of a can of worms, that
patch - it was responsible for the cover-test program and all your
fence malloc stuff as well.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] Resolve implementation-defined behaviour for division rounded to -infinity

2015-08-26 Thread Ben Avison

On Wed, 26 Aug 2015 10:22:22 +0100, Siarhei Siamashka 
siarhei.siamas...@gmail.com wrote:


On Fri, 14 Aug 2015 15:06:07 +0100
Ben Avison bavi...@riscosopen.org wrote:


The previous implementations of DIV and MOD relied upon the built-in
/ and % operators performing round-to-zero. This is true for C99, but
rounding is implementation-defined for C89 when divisor and/or
dividend is negative


Do you have a practical example of an existing C89 compiler, which
differs from C99 when handling '/' and '%' operators?


No, but I'd have thought it was bad practice to assume C99 behaviour when
compiling C89. The absence of any discussion in the accompanying comments
made me wonder if the author was aware of the issue - it's certainly
relevant when the whole point of those macros is to force a particular
rounding direction when dealing with negative numbers.


My understanding is that C compilers just used to emit a single
division instruction as implemented by the target processor. This
provides the best performance, but is not perfectly portable in
theory. But in practice, after decades of evolution, all the
remaining (major) CPU architectures happen to handle integer
division rounding in the same way (round-to-zero). And C99 just
promoted the de-facto standard to the official standard status
(regardless of whatever was the official justification).


Don't forget not all architectures even have divide instructions - ARM
only just started regularly implementing them (at least for its A profile
CPUs) at the Cortex-A15. Typically it's then the job of a runtime support
function to do the division, and who's to say how that works?


I'm not saying that this patch is wrong, but how can we be sure that it
is doing the right thing? The new variant of this code still relies on /
and % operators (which are implementation-defined) and uses them with
negative numbers. A more in-depth explanation would be useful


OK, a bit of context. I was working on some iterator code (not yet
posted) which supported tiled repeat, which caused me to inspect the
repeat() function in pixman-inlines.h, which in turn uses the MOD macro.
This immediately set off alarm bells in my head, because I'd encountered
the woolly definition in C89 several years ago.

I confess that my approach to writing a correct algorithm was to hit
Google and adapt one of the first well-reasoned hits I encountered into
macro form, to let somebody else do the hard work. The page I used was

http://www.microhowto.info/howto/round_towards_minus_infinity_when_dividing_integers_in_c_or_c++.html

which cites a couple of other references.

As described on that page, the approach it uses will work irrespective of
the rounding rules employed by the built-in / and % operators.


If we are really worried about
rounding in general, should we review all the uses of / and %
operators in the code too? And for the sake of consistency introduce
new macro variants, which are rounding towards zero?


Arguably, maybe we should aspire to, yes. Whenever the dividend and
divisor are both positive, there's no ambiguity, and this should rule out
the vast majority of cases that don't already need round-to-minus-
infinity, I'd expect. Even if this isn't done immediately, it wouldn't
prevent us from getting the minus-infinity case right now.

Of course, there is an easy fix, which is to say that Pixman has to be
compiled in C99 mode, or at least to wrap different implementations in
#if __STDC_VERSION__  199901L.


There is also additional concern about the performance. Is the new
code slower/faster than the old one?


I hadn't really investigated that, but having had a bit of a play with
ARM GCC, I see that it fails to use the runtime function that returns
both the quotient and remainder (__aeabi_idivmod) with the operations in
macro form. I get more luck writing them as functions:

inline int
DIV (int a, int b)
{
   int q = a / b;
   int r = a - q * b;
   return q - (r  0);
}

inline int
MOD (int a, int b)
{
   int r = a % b;
   if (r  0)
 r += b;
   return r;
}

with the caveat that these are based on the macros from my 2015-08-18
post, which rely on b being positive. (Set aside for the moment whether
an inline function with an all-caps name is a good idea...)

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 1/4] pixman-fast-path: Add over_n_8888 fast path (disabled)

2015-08-25 Thread Ben Avison

On Tue, 25 Aug 2015 13:45:48 +0100, Oded Gabbay oded.gab...@gmail.com wrote:

[exposing general_composite_rect]
I can't say that any cleaner solution has occurred to me since then.


I think the more immediate solution, as Soren have suggested on IRC,
is for me to implement the equivalent fast-path in VMX.
I see that it is already implemented in mmx, sse2, mips-dspr2 and
arm-neon. From looking at the C code, I'm guessing that it is fairly
simple to implement.


Yes, it's definitely one of the simpler fast paths, with only two
channels to worry about (source and destination) and with one of them
being a constant. I wrote an arm-simd version as well, to add to your
list - it's just that it's still waiting to be committed :)

I probably ought to get round to exposing general_composite_rect sooner
rather than later anyway - it's one of the few things from my mammoth
patch series last year that Søren commented on and which I haven't got
round to revising yet.


I just had a quick look at the VMX source file, and it has hardly any
iters defined. My guess would be that what's being used is

noop_init_solid_narrow() from pixman-noop.c
_pixman_iter_get_scanline_noop() from pixman-utils.c
combine_src_u() from pixman-combine32.c


I run perf on lowlevel-blt-bench over_n_ and what I got is:

-   48.71%48.68%  lowlevel-blt-be  lowlevel-blt-bench  [.]
vmx_combine_over_u_no_mask
   - vmx_combine_over_u_no_mask


Sorry, my mistake - for some reason I must have thought we were dealing
with src_n_ rather than over_n_. If you can beat the C version
using a solid fetcher (which fills a temporary buffer the size of the row
with a constant pixel value) and an optimised OVER combiner, then you
should be able to do better still if you cut out the temporary buffer and
keep the solid colour in registers.


Presumably for patch 3 of this series (over_n_0565) you wouldn't see
the same effect, as that can't be achieved using mempcy().


Where is that patch ? I didn't see it in the mailing list.


My bad again - in my mind, the patches for over_n_ and over_n_0565 in
C and ARMv6 assembly were a group of four and I overlooked the fact that
when Pekka split them in order to make the benchmarks more robust, he
only reposted the over_n_ ones. My original over_n_0565 patches are
here:

http://patchwork.freedesktop.org/patch/49902/
http://patchwork.freedesktop.org/patch/49903/

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 1/4] Expose general_composite_rect outside pixman-general.c

2015-08-25 Thread Ben Avison
---
 pixman/pixman-general.c |8 
 pixman/pixman-private.h |9 +
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/pixman/pixman-general.c b/pixman/pixman-general.c
index c8f272b..bb14b5e 100644
--- a/pixman/pixman-general.c
+++ b/pixman/pixman-general.c
@@ -123,9 +123,9 @@ operator_needs_division (pixman_op_t op)
 return needs_division[op];
 }
 
-static void
-general_composite_rect  (pixman_implementation_t *imp,
- pixman_composite_info_t *info)
+void
+_pixman_general_composite_rect  (pixman_implementation_t *imp,
+ pixman_composite_info_t *info)
 {
 PIXMAN_COMPOSITE_ARGS (info);
 uint8_t stack_scanline_buffer[3 * SCANLINE_BUFFER_LENGTH];
@@ -240,7 +240,7 @@ general_composite_rect  (pixman_implementation_t *imp,
 
 static const pixman_fast_path_t general_fast_path[] =
 {
-{ PIXMAN_OP_any, PIXMAN_any, 0, PIXMAN_any,0, PIXMAN_any, 0, 
general_composite_rect },
+{ PIXMAN_OP_any, PIXMAN_any, 0, PIXMAN_any,0, PIXMAN_any, 0, 
_pixman_general_composite_rect },
 { PIXMAN_OP_NONE }
 };
 
diff --git a/pixman/pixman-private.h b/pixman/pixman-private.h
index a8298ac..d2fbd53 100644
--- a/pixman/pixman-private.h
+++ b/pixman/pixman-private.h
@@ -573,6 +573,15 @@ _pixman_implementation_iter_init (pixman_implementation_t  
 *imp,
   iter_flags_t   flags,
   uint32_t   image_flags);
 
+/* Provide a grubby back door to the fetcher / combiner system for
+ * use by implementations where we know the iterators are faster than
+ * pixman-fast-path.c (or more generally, any other lower-priority
+ * implementation in the chain) for a specific operation.
+ */
+void
+_pixman_general_composite_rect  (pixman_implementation_t *imp,
+ pixman_composite_info_t *info);
+
 /* Specific implementations */
 pixman_implementation_t *
 _pixman_implementation_create_general (void);
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 2/4] armv6: Add nearest-scaled-cover src_8888_8888 fast path

2015-08-25 Thread Ben Avison
This patch implements a shortcut so that the nearest scaled cover fetcher
iterator is used in preference to the fast path in pixman-fast-path.c, or
even the fast path defined using PIXMAN_ARM_BIND_SCALED_NEAREST_SRC_DST in
pixman-arm-simd.c. This is because the fetcher performs better than the
fast paths.

This is presented as an alternative to the patch I posted last year which
used macroised C wrappers around the fetcher, rather than reusing
general_composite_rect. Judging by the following benchmarks:

lowlevel-blt-bench -n src__

   Before  Old patch  New patch   Change
  Mean StdDev Mean StdDevMean StdDevOld New
L1   116.8   4.15 72.1   1.2453.9   0.69   -38.3%  -53.8%
L235.9   2.26 43.8   1.4638.0   1.12   +22.1%   +6.1%
M 34.9   0.11 62.1   0.2947.7   0.25   +77.6%  +36.5%
HT18.4   0.14 26.4   0.3117.8   0.57   +43.7%   -3.2%
VT17.7   0.13 24.7   0.2317.2   0.50   +39.3%   -3.3%
R 16.9   0.07 23.4   0.2316.5   0.46   +38.4%   -2.7%
RT 8.0   0.09  9.2   0.28 5.8   0.23   +14.6%  -28.4%

I find it hard to advocate this patch, even though it is somewhat simpler.
---
 pixman/pixman-arm-common.h |9 +
 pixman/pixman-arm-simd.c   |7 +++
 2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-common.h b/pixman/pixman-arm-common.h
index 748e933..f970868 100644
--- a/pixman/pixman-arm-common.h
+++ b/pixman/pixman-arm-common.h
@@ -494,6 +494,15 @@ cputype##_combine_##name##_u (pixman_implementation_t 
*imp,   \
  FAST_PATH_X_UNIT_POSITIVE| \
  FAST_PATH_Y_UNIT_ZERO)
 
+#define 
PIXMAN_ARM_NEAREST_SCALED_COVER_SRC_DST_FAST_PATH_VIA_ITER(op,s,d,func) \
+{   PIXMAN_OP_ ## op,  
 \
+PIXMAN_ ## s,  
 \
+PIXMAN_ARM_NEAREST_SCALED_COVER_FLAGS, 
 \
+PIXMAN_null, 0,
 \
+PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,
 \
+_pixman_general_composite_rect 
 \
+}
+
 /*/
 
 /* Support for scaled bilinear fetchers */
diff --git a/pixman/pixman-arm-simd.c b/pixman/pixman-arm-simd.c
index c37cec8..a72f9da 100644
--- a/pixman/pixman-arm-simd.c
+++ b/pixman/pixman-arm-simd.c
@@ -399,6 +399,13 @@ static const pixman_fast_path_t arm_simd_fast_paths[] =
 PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8b8g8r8, a8b8g8r8, 
armv6_composite_over_n___ca),
 PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8b8g8r8, x8b8g8r8, 
armv6_composite_over_n___ca),
 
+PIXMAN_ARM_NEAREST_SCALED_COVER_SRC_DST_FAST_PATH_VIA_ITER (SRC, a8r8g8b8, 
a8r8g8b8, src__),
+PIXMAN_ARM_NEAREST_SCALED_COVER_SRC_DST_FAST_PATH_VIA_ITER (SRC, a8r8g8b8, 
x8r8g8b8, src__),
+PIXMAN_ARM_NEAREST_SCALED_COVER_SRC_DST_FAST_PATH_VIA_ITER (SRC, x8r8g8b8, 
x8r8g8b8, src__),
+PIXMAN_ARM_NEAREST_SCALED_COVER_SRC_DST_FAST_PATH_VIA_ITER (SRC, a8b8g8r8, 
a8b8g8r8, src__),
+PIXMAN_ARM_NEAREST_SCALED_COVER_SRC_DST_FAST_PATH_VIA_ITER (SRC, a8b8g8r8, 
x8b8g8r8, src__),
+PIXMAN_ARM_NEAREST_SCALED_COVER_SRC_DST_FAST_PATH_VIA_ITER (SRC, x8b8g8r8, 
x8b8g8r8, src__),
+
 SIMPLE_NEAREST_FAST_PATH (SRC, r5g6b5, r5g6b5, armv6_0565_0565),
 SIMPLE_NEAREST_FAST_PATH (SRC, b5g6r5, b5g6r5, armv6_0565_0565),
 
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 4/4] armv6: Add nearest-scaled-cover src_0565_0565 fast path

2015-08-25 Thread Ben Avison
This is adapted from the nearest scaled cover scanline fetcher, modified to
pack output data in 16-bit units. This fetcher out-performs both the fast
path defined using PIXMAN_ARM_BIND_SCALED_NEAREST_SRC_DST in pixman-arm-simd.c
and the fast path in pixman-fast-path.c.

Since the two preceding patches no longer defined a macroised C wrapper we
can use, and general_composite_rect is no use to us here (we don't want to
do pixel format conversion twice) the C wrapper has been written out longhand.
Unsurprisingly, the results are similar to last year's version of the patch:

lowlevel-blt-bench -n src_0565_0565

   Before  Old patch  New patch   Change
  Mean StdDev Mean StdDevMean StdDevOld New
L1   118.6   3.12 71.0   1.3273.5   1.18   -40.1%  -38.0%
L242.1   0.73 52.6   2.4452.1   2.00   +25.1%  +23.7%
M 42.1   0.15 69.3   0.1069.3   0.15   +64.9%  +64.8%
HT24.4   0.35 29.2   0.3329.5   0.24   +19.4%  +20.9%
VT23.0   0.24 27.4   0.2927.7   0.35   +19.3%  +20.6%
R 20.8   0.20 25.3   0.3225.7   0.18   +21.4%  +23.2%
RT 9.1   0.25  9.3   0.24 9.7   0.24+1.7%   +6.7%
---
 pixman/pixman-arm-common.h  |9 +
 pixman/pixman-arm-simd-asm-scaled.S |4 ++
 pixman/pixman-arm-simd-asm-scaled.h |   69 ++-
 pixman/pixman-arm-simd.c|   35 ++
 4 files changed, 108 insertions(+), 9 deletions(-)

diff --git a/pixman/pixman-arm-common.h b/pixman/pixman-arm-common.h
index f970868..59190f0 100644
--- a/pixman/pixman-arm-common.h
+++ b/pixman/pixman-arm-common.h
@@ -494,6 +494,15 @@ cputype##_combine_##name##_u (pixman_implementation_t 
*imp,   \
  FAST_PATH_X_UNIT_POSITIVE| \
  FAST_PATH_Y_UNIT_ZERO)
 
+#define PIXMAN_ARM_NEAREST_SCALED_COVER_SRC_DST_FAST_PATH(cputype,op,s,d,func) 
\
+{   PIXMAN_OP_ ## op,  
\
+PIXMAN_ ## s,  
\
+PIXMAN_ARM_NEAREST_SCALED_COVER_FLAGS, 
\
+PIXMAN_null, 0,
\
+PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,
\
+cputype ## _composite_nearest_scaled_cover_ ## func
\
+}
+
 #define 
PIXMAN_ARM_NEAREST_SCALED_COVER_SRC_DST_FAST_PATH_VIA_ITER(op,s,d,func) \
 {   PIXMAN_OP_ ## op,  
 \
 PIXMAN_ ## s,  
 \
diff --git a/pixman/pixman-arm-simd-asm-scaled.S 
b/pixman/pixman-arm-simd-asm-scaled.S
index 0116889..24c1a27 100644
--- a/pixman/pixman-arm-simd-asm-scaled.S
+++ b/pixman/pixman-arm-simd-asm-scaled.S
@@ -170,6 +170,10 @@ generate_nearest_scaled_cover_function \
 pixman_get_scanline_nearest_scaled_cover_x8r8g8b8_asm_armv6, 32, \
 2, 3 /* prefetch distances */, nop_macro, convert_x888_
 
+generate_nearest_scaled_cover_function \
+pixman_get_scanline_r5g6b5_nearest_scaled_cover_r5g6b5_asm_armv6, 16, \
+2, 0 /* prefetch distances */, nop_macro, nop_macro, 16
+
 .macro init_ge
 msr CPSR_s, #0x5
 .endm
diff --git a/pixman/pixman-arm-simd-asm-scaled.h 
b/pixman/pixman-arm-simd-asm-scaled.h
index 660797d..e642e7f 100644
--- a/pixman/pixman-arm-simd-asm-scaled.h
+++ b/pixman/pixman-arm-simd-asm-scaled.h
@@ -94,7 +94,12 @@
 
 .macro nearest_scaled_cover_enlarge_nomask_innerloop  bpp, reg, convert, 
mask_hint, may_be_final, exit_label, store
 addsACCUM, ACCUM, UX
+ .if PIXEL_MERGE_OFFSET == 0
 mov \reg, PIXEL
+ .else
+orr \reg, \reg, PIXEL, lsl #PIXEL_MERGE_OFFSET
+ .endif
+ .set PIXEL_MERGE_OFFSET, (PIXEL_MERGE_OFFSET + out_bpp)  31
 \store
 branch  cc, \exit_label, 1203f
  .ifnc \may_be_final,
@@ -158,10 +163,20 @@
 mov TMP, XHI
 addsXLO, XLO, UX, lsl #16
 adc XHI, XHI, UX, lsr #16
+ .if PIXEL_MERGE_OFFSET == 0
 ldrx\bpp,, \reg, [PTR]
+ .else
+ldrx\bpp,, PIXEL2, [PTR]
+ .endif
 eor TMP, TMP, XHI
 bicsTMP, TMP, #255/\bpp
+ .if PIXEL_MERGE_OFFSET == 0
 \convert \reg, TMP
+ .else
+\convert PIXEL2, TMP
+orr \reg, \reg, PIXEL2, lsl #PIXEL_MERGE_OFFSET
+ .endif
+ .set PIXEL_MERGE_OFFSET, (PIXEL_MERGE_OFFSET + out_bpp)  31
 \store
 branch  eq, \exit_label, 1403f
 subsPLDS, PLDS, #32
@@ -185,7 +200,14 @@
 \inner_loop  \bpp, WK0, \convert, mask_is_0, 1, 1503f, add DST, DST, 
#4
 b   1503f
  .endif
+ .set PIXEL_MERGE_OFFSET, 0
+ .if out_bpp == 32
 1502:   \inner_loop  \bpp, WK0, \convert, mask_is_non_0, 1,, str WK0, [DST], 
#4
+ .elseif out_bpp == 16
+1502:   \inner_loop  \bpp, WK0, \convert, mask_is_non_0, 1,, strh 

Re: [Pixman] [PATCH 1/4] pixman-fast-path: Add over_n_8888 fast path (disabled)

2015-08-24 Thread Ben Avison

On Mon, 24 Aug 2015 15:20:22 +0100, Oded Gabbay oded.gab...@gmail.com wrote:

I tested the patch on POWER8, ppc64le.
make check passes, but when I benchmarked the performance using
lowlevel-blt-bench over_n_, I got far worse results than without
this patch (see numbers below).
Apparently, the new C fast-path takes precedence over some vmx combine
fast-paths, thus making performance worse instead of better.


If that's true, it wouldn't be the first time that this issue has arisen.
Pixman is designed to scan all fast paths (i.e. routines which take
source, mask and destination images and perform the whole operation
themselves) irrespective of how well tuned they are for a given platform,
before it starts looking at iter routines (each of which reads or writes
only one of source, mask or destination).

Previously, in response to my attempt to work around a case where this
was happening, Søren wrote:

[...] the longstanding problem that pixman sometimes won't
pick the fastest code for some operations. In this case the general
iter based code will be faster then the C code in pixman-fast-path.c
because the iter code will use assembly fetchers.
As a result you end up with a bunch of partial reimplementations of
general_composite_rect() inside pixman-arm-simd.c.
Maybe we need to admit failure and make general_composite_rect()
available for other implementations to use. Essentially we would
officially provide a way for implementations to say: My iterators are
faster than pixman-fast-path.c for these specific operations, so just
go directly to general_composite_rect().
It's definitely not a pleasant thing to do, but given that nobody is
likely to have the time/skill combination required to do the necessary
redesign of pixman's fast path system, it might still be preferable to
to do this instead of duplicating the code like these patches do.
With a setup like that, we could also fix the same issue for the
bilinear SSSE3 fetchers and possibly other cases.


(ref http://lists.freedesktop.org/archives/pixman/2014-October/003457.html)

I can't say that any cleaner solution has occurred to me since then.

I just had a quick look at the VMX source file, and it has hardly any
iters defined. My guess would be that what's being used is

noop_init_solid_narrow() from pixman-noop.c
_pixman_iter_get_scanline_noop() from pixman-utils.c
combine_src_u() from pixman-combine32.c

this last one would be responsible for the the bulk of the runtime - and
it palms most of the work off on memcpy(). Presumably the PPC memcpy is
super-optimised?


Without the patch:

reference memcpy speed = 25711.7MB/s (6427.9MP/s for 32bpp fills)
---
over_n_: PIXMAN_OP_OVER, src a8r8g8b8 solid, mask null, dst a8r8g8b8
---
 over_n_ =  L1: 572.29  L2:1038.08  M:1104.10 (
17.18%)  HT:447.45  VT:520.82  R:407.92  RT:148.90 (1100Kops/s)


There's something a bit odd about that - it's slower when working within
the caches (especially the L1 cache) than when operating on main memory.
I'd hazard a guess that memcpy() is using some hardware acceleration that
lives between the L1 and L2 caches.

Presumably for patch 3 of this series (over_n_0565) you wouldn't see the
same effect, as that can't be achieved using mempcy().

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 5/7] armv7/mips/sse2: Fix bounds violations in bilinear cover scaled fast paths

2015-08-24 Thread Ben Avison
In functions named fast_composite_scaled_bilinear... as defined by
FAST_BILINEAR_MAINLOOP_INT, there are multiple calls to scanline_func
(a macro parameter) which is often implemented in assembly, and which
takes as parameters two line buffer pointers (which are range checked)
and fixed-point initial and incremental pixel offsets into those lines.

When handling edge cases, there are already implicit assumptions that
scanline_func reads at most two input pixels horizontally for each
output pixel. In practice, implementations typically read exactly two,
even when the fractional part of the offset is zero; this leads to
array bounds violations for the COVER case when the offset for the final
pixel is integer (in other cases, such offsets move us into special
handling for repeat zones and so are not affected).

This patch fixes this by special-casing operations which are affected,
and plotting the final pixel of each row from a temporary buffer (similar
to the manner in which PAD repeat is accomplished).

Tested using cover-test on armv7.
---
 pixman/pixman-inlines.h |   69 ---
 1 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/pixman/pixman-inlines.h b/pixman/pixman-inlines.h
index c3324cf..8a480bd 100644
--- a/pixman/pixman-inlines.h
+++ b/pixman/pixman-inlines.h
@@ -924,6 +924,67 @@ fast_composite_scaled_bilinear ## scale_func_name 
(pixman_implementation_t *imp,
src_width_fixed = pixman_int_to_fixed (src_width);  
\
 }  
\

\
+if (PIXMAN_REPEAT_ ## repeat_mode == PIXMAN_REPEAT_COVER)  
\
+{  
\
+   /* Detect cases where a typical scanline function would read beyond 
array bounds. */\
+   int32_t last_src_pixel = src_image-bits.width - 1; 
\
+   max_vx = v.vector[0] + (width - 1) * unit_x;
\
+   if (pixman_fixed_to_int (max_vx) == last_src_pixel) 
\
+   {   
\
+   src_type_t  buf1[2];
\
+   src_type_t  buf2[2];
\
+   src_type_t *src_line_top;   
\
+   src_type_t *src_line_bottom;
\
+   width--;
\
+   while (--height = 0)   
\
+   {   
\
+   int weight1, weight2;   
\
+   dst = dst_line; 
\
+   dst_line += dst_stride; 
\
+   vx = v.vector[0];   
\
+   if (flags  FLAG_HAVE_NON_SOLID_MASK)   
\
+   {   
\
+   mask = mask_line;   
\
+   mask_line += mask_stride;   
\
+   }   
\
+   
\
+   y1 = pixman_fixed_to_int (vy);  
\
+   weight2 = pixman_fixed_to_bilinear_weight (vy); 
\
+   if (weight2)
\
+   {   
\
+   /* both weight1 and weight2 are smaller than 
BILINEAR_INTERPOLATION_RANGE */\
+   y2 = y1 + 1;
\
+   weight1 = BILINEAR_INTERPOLATION_RANGE - weight2;   
\
+   }   
\
+   else

[Pixman] [PATCH 0/7] Cover scaling patches request for assistance

2015-08-24 Thread Ben Avison
Some back story...

First there was this patch:

http://patchwork.freedesktop.org/patch/49937/

Back last October, Søren had this to say about it:

 A concern I have here is that code might access pixels outside the
 image that have weight 0. Ie., in the bilinear case, some code might
 attempt to read the pixel at bits[-1] and then multiply it with 0. But
 we can't assume that bits[-1] is readable.

 If I remember correctly, the +pixman_fixed_e * 8 stuff was intended to
 handle this case.

 I think it would be worthwhile to have a test that uses fence_malloc
 for the source buffer and the matrix mentioned in the commit. In fact,
 the fence_malloc() testing could benefit from being extended in
 various ways:

   - having fence pages both before and after the image
   - having fence pages in the 'stride' part of the image

Towards this goal, the following patches were posted to the list - and they
seem to have escaped Patchwork's notice:

http://lists.freedesktop.org/archives/pixman/2015-May/003644.html
http://lists.freedesktop.org/archives/pixman/2015-May/003645.html
http://lists.freedesktop.org/archives/pixman/2015-May/003646.html
http://lists.freedesktop.org/archives/pixman/2015-May/003678.html

(note that there were a few minor outstanding points on the first three).
This series relies upon the test program implemented by those patches to
prove its correctness, so it would be helpful if they could be finished off
and committed.

If you look in detail at the patches in this series, you'll see that there's
an outstanding change for a single routine - the one and only scanline fetch
iter in pixman-ssse3.c, which handles bilinear scaled a8r8g8b8 source
images. This could probably be fixed using either of the two methods I used
in the other patches, but pixman-fast-path.c is the most elegant. My problem
is that I don't know SSSE3 (or any other x86 for that matter) so it would
represent a big learning curve for me for the sake of just this one
function.

Is anyone able to help out? I've got a load of other scaling-related goodies
lined up, but it doesn't make much sense to post them while all these
fundamentals are still outstanding.

Ben Avison (7):
  Refactor calculation of cover flags
  More accurate FAST_PATH_SAMPLES_COVER_CLIP_NEAREST
  Split FAST_PATH_SAMPLES_COVER_CLIP_BILINEAR flag
  More accurate FAST_PATH_SAMPLES_COVER_CLIP_BILINEAR
  armv7/mips/sse2: Fix bounds violations in bilinear cover scaled fast
paths
  pixman-fast-path: Fix bounds violations in bilinear cover fetcher
  test: Make image size calculation match COVER_CLIP definition again

 pixman/pixman-fast-path.c |   35 +
 pixman/pixman-inlines.h   |   63 -
 pixman/pixman-private.h   |1 +
 pixman/pixman-ssse3.c |2 +-
 pixman/pixman.c   |   37 +-
 test/affine-bench.c   |   11 +++-
 6 files changed, 110 insertions(+), 39 deletions(-)

-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 6/7] pixman-fast-path: Fix bounds violations in bilinear cover fetcher

2015-08-24 Thread Ben Avison
A common theme here: the code was unconditionally reading pixels to the
right (and in this case below) the rounded-down accumulator without
ensuring that the accumulator didn't point at the rightmost sample or
bottommost line.

The fix for the horizontal case is to negate the accumulator so that the
bounds check is only needed once, at the start of each row. Speed is less
critical vertically, so for each row we test the row number for validity
and force it in range if not.
---
 pixman/pixman-fast-path.c |   37 +
 1 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/pixman/pixman-fast-path.c b/pixman/pixman-fast-path.c
index 02bbc5a..3938081 100644
--- a/pixman/pixman-fast-path.c
+++ b/pixman/pixman-fast-path.c
@@ -2280,18 +2280,29 @@ fetch_horizontal (bits_image_t *image, line_t *line,
  int y, pixman_fixed_t x, pixman_fixed_t ux, int n)
 {
 uint32_t *bits = image-bits + y * image-rowstride;
-int i;
+int i, x0, x1;
+int32_t dist_x;
+uint32_t left, right;
+
+x = (1  (16 - BILINEAR_INTERPOLATION_BITS)) - 1 - x;
+ux = -ux;
+
+x0 = pixman_fixed_to_int (x);
+x1 = x0 + 1;
+dist_x = pixman_fixed_to_bilinear_weight (x);
+left = dist_x ? *(bits - x1) : 0;
 
 for (i = 0; i  n; ++i)
 {
-   int x0 = pixman_fixed_to_int (x);
-   int x1 = x0 + 1;
-   int32_t dist_x;
-
-   uint32_t left = *(bits + x0);
-   uint32_t right = *(bits + x1);
+   if (i  0)
+   {
+   x0 = pixman_fixed_to_int (x);
+   x1 = x0 + 1;
+   dist_x = pixman_fixed_to_bilinear_weight (x);
+   left = *(bits - x1);
+   }
 
-   dist_x = pixman_fixed_to_bilinear_weight (x);
+   right = *(bits - x0);
dist_x = (8 - BILINEAR_INTERPOLATION_BITS);
 
 #if SIZEOF_LONG = 4
@@ -2301,11 +2312,11 @@ fetch_horizontal (bits_image_t *image, line_t *line,
 
lag = (left  0xff00ff00)  8;
rag = (right  0xff00ff00)  8;
-   ag = (lag  8) + dist_x * (rag - lag);
+   ag = (rag  8) + dist_x * (lag - rag);
 
lrb = (left  0x00ff00ff);
rrb = (right  0x00ff00ff);
-   rb = (lrb  8) + dist_x * (rrb - lrb);
+   rb = (rrb  8) + dist_x * (lrb - rrb);
 
*((uint32_t *)(line-buffer + i)) = ag;
*((uint32_t *)(line-buffer + i) + 1) = rb;
@@ -2323,7 +2334,7 @@ fetch_horizontal (bits_image_t *image, line_t *line,
lagrb = (((uint64_t)lag)  24) | lrb;
ragrb = (((uint64_t)rag)  24) | rrb;
 
-   line-buffer[i] = (lagrb  8) + dist_x * (ragrb - lagrb);
+   line-buffer[i] = (ragrb  8) + dist_x * (lagrb - ragrb);
}
 #endif
 
@@ -2350,6 +2361,8 @@ fast_fetch_bilinear_cover (pixman_iter_t *iter, const 
uint32_t *mask)
 
 y0 = pixman_fixed_to_int (info-y);
 y1 = y0 + 1;
+if (y1 == iter-image-bits.height)
+   y1 = y0;
 dist_y = pixman_fixed_to_bilinear_weight (info-y);
 dist_y = (8 - BILINEAR_INTERPOLATION_BITS);
 
@@ -3187,7 +3200,7 @@ static const pixman_iter_info_t fast_iters[] =
   (FAST_PATH_STANDARD_FLAGS|
FAST_PATH_SCALE_TRANSFORM   |
FAST_PATH_BILINEAR_FILTER   |
-   FAST_PATH_SAMPLES_COVER_CLIP_BILINEAR_APPROX),
+   FAST_PATH_SAMPLES_COVER_CLIP_BILINEAR),
   ITER_NARROW | ITER_SRC,
   fast_bilinear_cover_iter_init,
   NULL, NULL
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 7/7] test: Make image size calculation match COVER_CLIP definition again

2015-08-24 Thread Ben Avison
Now that we don't need so many excess pixels to ensure both COVER_CLIP
fast path flags are set, remove the 8/65536ths of a pixel border used
when calculating how big to make the source image.
---
 test/affine-bench.c |   11 ---
 1 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/test/affine-bench.c b/test/affine-bench.c
index 9e0121e..73e769b 100644
--- a/test/affine-bench.c
+++ b/test/affine-bench.c
@@ -396,13 +396,10 @@ main (int argc, char *argv[])
 }
 
 compute_transformed_extents (binfo.transform, dest_box, transformed);
-/* The source area is expanded by a tiny bit (8/65536th pixel)
- * to match the calculation of the COVER_CLIP flags in analyze_extent()
- */
-xmin = pixman_fixed_to_int (transformed.x1 - 8 * pixman_fixed_e - 
pixman_fixed_1 / 2);
-ymin = pixman_fixed_to_int (transformed.y1 - 8 * pixman_fixed_e - 
pixman_fixed_1 / 2);
-xmax = pixman_fixed_to_int (transformed.x2 + 8 * pixman_fixed_e + 
pixman_fixed_1 / 2);
-ymax = pixman_fixed_to_int (transformed.y2 + 8 * pixman_fixed_e + 
pixman_fixed_1 / 2);
+xmin = pixman_fixed_to_int (transformed.x1 - pixman_fixed_1 / 2);
+ymin = pixman_fixed_to_int (transformed.y1 - pixman_fixed_1 / 2);
+xmax = pixman_fixed_to_int (transformed.x2 + pixman_fixed_1 / 2 - 
pixman_fixed_e);
+ymax = pixman_fixed_to_int (transformed.y2 + pixman_fixed_1 / 2 - 
pixman_fixed_e);
 binfo.src_x = -xmin;
 binfo.src_y = -ymin;
 
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] test: List more details of the operation when running scaling-test verbosely

2015-08-21 Thread Ben Avison

On Fri, 21 Aug 2015 11:34:35 +0100, Pekka Paalanen ppaala...@gmail.com wrote:


are you forgetting src_x, src_y, dst_x, dst_y? These were printed in
the old code.


Good spot - yes, not sure how they managed to escape. Something like this
should take care of it:

+printf (src_x=%d, src_y=%d, ,
+src_x, src_y);
+if (mask_fmt != PIXMAN_null)
+{
+printf (mask_x=%d, mask_y=%d, ,
+mask_x, mask_y);
+}
+printf (dst_x=%d, dst_y=%d\n,
+dst_x, dst_y);

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH] test: List more details of the operation when running scaling-test verbosely

2015-08-20 Thread Ben Avison
---
I've been doing a fair bit of work on scaled plots again recently, and I was
finding that scaling-test omitted to mention lots of useful information
(notably whether a mask was used, and what the mask parameers were).

 test/scaling-test.c |   63 +-
 1 files changed, 41 insertions(+), 22 deletions(-)

diff --git a/test/scaling-test.c b/test/scaling-test.c
index e2f7fa9..02bd0d0 100644
--- a/test/scaling-test.c
+++ b/test/scaling-test.c
@@ -73,7 +73,7 @@ test_composite (int  testnum,
 pixman_op_top;
 pixman_repeat_trepeat = PIXMAN_REPEAT_NONE;
 pixman_repeat_tmask_repeat = PIXMAN_REPEAT_NONE;
-pixman_format_code_t src_fmt, dst_fmt;
+pixman_format_code_t src_fmt, mask_fmt, dst_fmt;
 uint32_t * srcbuf;
 uint32_t * dstbuf;
 uint32_t * maskbuf;
@@ -145,6 +145,7 @@ test_composite (int  testnum,
 prng_randmemset (dstbuf, dst_stride * dst_height, 0);
 
 src_fmt = get_format (src_bpp);
+mask_fmt = PIXMAN_a8;
 dst_fmt = get_format (dst_bpp);
 
 if (prng_rand_n (2))
@@ -169,7 +170,7 @@ test_composite (int  testnum,
 src_fmt, src_width, src_height, srcbuf, src_stride);
 
 mask_img = pixman_image_create_bits (
-PIXMAN_a8, mask_width, mask_height, maskbuf, mask_stride);
+mask_fmt, mask_width, mask_height, maskbuf, mask_stride);
 
 dst_img = pixman_image_create_bits (
 dst_fmt, dst_width, dst_height, dstbuf, dst_stride);
@@ -255,21 +256,6 @@ test_composite (int  testnum,
 else
pixman_image_set_filter (mask_img, PIXMAN_FILTER_BILINEAR, NULL, 0);
 
-if (verbose)
-{
-   printf (src_fmt=%s, dst_fmt=%s\n, 
-   format_name (src_fmt), format_name (dst_fmt));
-   printf (op=%s, scale_x=%d, scale_y=%d, repeat=%d\n,
-   operator_name (op), scale_x, scale_y, repeat);
-   printf (translate_x=%d, translate_y=%d\n,
-   translate_x, translate_y);
-   printf (src_width=%d, src_height=%d, dst_width=%d, dst_height=%d\n,
-   src_width, src_height, dst_width, dst_height);
-   printf (src_x=%d, src_y=%d, dst_x=%d, dst_y=%d\n,
-   src_x, src_y, dst_x, dst_y);
-   printf (w=%d, h=%d\n, w, h);
-}
-
 if (prng_rand_n (8) == 0)
 {
pixman_box16_t clip_boxes[2];
@@ -352,10 +338,42 @@ test_composite (int  testnum,
 }
 
 if (prng_rand_n (2) == 0)
-   pixman_image_composite (op, src_img, NULL, dst_img,
-src_x, src_y, 0, 0, dst_x, dst_y, w, h);
-else
-   pixman_image_composite (op, src_img, mask_img, dst_img,
+{
+mask_fmt = PIXMAN_null;
+pixman_image_unref (mask_img);
+mask_img = NULL;
+mask_x = 0;
+mask_y = 0;
+}
+
+if (verbose)
+{
+printf (op=%s, src_fmt=%s, mask_fmt=%s, dst_fmt=%s\n,
+operator_name (op), format_name (src_fmt), format_name 
(mask_fmt), format_name (dst_fmt));
+printf (scale_x=%d, scale_y=%d, repeat=%d, filter=%d\n,
+scale_x, scale_y, repeat, src_img-common.filter);
+printf (translate_x=%d, translate_y=%d\n,
+translate_x, translate_y);
+if (mask_fmt != PIXMAN_null)
+{
+printf (mask_scale_x=%d, mask_scale_y=%d, mask_repeat=%d, 
mask_filter=%d\n,
+mask_scale_x, mask_scale_y, mask_repeat, 
mask_img-common.filter);
+printf (mask_translate_x=%d, mask_translate_y=%d\n,
+mask_translate_x, mask_translate_y);
+}
+printf (src_width=%d, src_height=%d, ,
+src_width, src_height);
+if (mask_fmt != PIXMAN_null)
+{
+printf (mask_width=%d, mask_height=%d, ,
+mask_width, mask_height);
+}
+printf (dst_width=%d, dst_height=%d\n,
+dst_width, dst_height);
+printf (w=%d, h=%d\n, w, h);
+}
+
+pixman_image_composite (op, src_img, mask_img, dst_img,
 src_x, src_y, mask_x, mask_y, dst_x, dst_y, w, h);
 
 crc32 = compute_crc32_for_image (0, dst_img);
@@ -364,7 +382,8 @@ test_composite (int  testnum,
print_image (dst_img);
 
 pixman_image_unref (src_img);
-pixman_image_unref (mask_img);
+if (mask_img != NULL)
+pixman_image_unref (mask_img);
 pixman_image_unref (dst_img);
 
 if (src_stride  0)
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 0/4] New fast paths and Raspberry Pi 1 benchmarking

2015-08-20 Thread Ben Avison

On Thu, 20 Aug 2015 19:34:37 +0100, Bill Spitzak spit...@gmail.com wrote:


Could this be whether some bad instruction ends up next to or split
by a cache line boundary? That would produce a random-looking plot,
though it really is a plot of the location of the bad instructions in
the measured function.

If this really is a problem then the ideal fix is for the compiler to
insert NOP instructions in order to move the bad instructions away from
the locations that make them bad. Yike.


Thought of that, tried it, still baffled at the results. In other words,
merely ensuring instructions retained the same alignment to cachelines
wasn't enough to ensure reproducibility - it could only be achieved by
ensuring the same absolute address (which isn't an option with shared
libraries in the presence of ASLR).

My best theory at the moment is that the branch predictor in the ARM11
uses a hash of both the source and destination addresses of a branch to
choose which index in the predictor cache. Because it's a direct-mapped
cache, any collisions due to the branch moving to a different address can
have major effects on very tight loops like src__.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH] Resolve implementation-defined behaviour for division rounded to -infinity

2015-08-14 Thread Ben Avison
The previous implementations of DIV and MOD relied upon the built-in / and %
operators performing round-to-zero. This is true for C99, but rounding is
implementation-defined for C89 when divisor and/or dividend is negative, and
I believe Pixman is still supposed to support C89.
---
 pixman/pixman-private.h |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/pixman/pixman-private.h b/pixman/pixman-private.h
index 73108a0..80506be 100644
--- a/pixman/pixman-private.h
+++ b/pixman/pixman-private.h
@@ -889,12 +889,12 @@ pixman_list_move_to_front (pixman_list_t *list, 
pixman_link_t *link)
 #endif
 
 /* Integer division that rounds towards -infinity */
-#define DIV(a, b) \
-a)  0) == ((b)  0)) ? (a) / (b) :\
- ((a) - (b) + 1 - (((b)  0)  1)) / (b))
+#define DIV(a, b) \
+((a) / (b) - ((a) % (b) != 0  ((a) % (b)  0) != ((b)  0) ? 1 : 0))
 
 /* Modulus that produces the remainder wrt. DIV */
-#define MOD(a, b) ((a)  0 ? ((b) - ((-(a) - 1) % (b))) - 1 : (a) % (b))
+#define MOD(a, b) \
+((a) % (b) + ((a) % (b) != 0  ((a) % (b)  0) != ((b)  0) ? (b) : 0))
 
 #define CLIP(v, low, high) ((v)  (low) ? (low) : ((v)  (high) ? (high) : 
(v)))
 
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 0/9] lowlevel-blt-bench improvements for automated testing

2015-07-02 Thread Ben Avison

On Wed, 10 Jun 2015 14:32:49 +0100, Pekka Paalanen ppaala...@gmail.com wrote:

most of the patches are trivial cleanups. The meat are the last two:
CSV output mode and skipping the memory speed benchmark.

Both new features are designed for an external benchmarking harness,
that runs several different versions of Pixman with lowlevel-blt-bench
in an alternating fashion. Alternating iterations are needed to get
reliable results on platforms like the Raspberry Pi.


These look like sensible improvements to me. A few minor points:

Patch 2 commit message:

Move explanation printing to a new file. This will help with

   function

Patch 8: if the aim is machine readability, I'd suggest no space after
the comma. Otherwise if comma is used as a field separator, the first
field has no leading space but all the others do, which may make things a
little more fiddly when post-processing the results with some tools.

Not really your fault, but I noticed it when trying out the new versions:
it doesn't fault an attempt to list more than one pattern on the command
line. It just acts on the last one. It should really either fault it, or
(more usefully) benchmark all of the patterns specified. This would allow
a subset of the all tests to be performed, sharing the same memcpy
measurement overhead, or a group of operations including those not in the
all list.

In other respects, happy to give my Reviewed-by:.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] Fwd: pixman clang issue

2015-06-24 Thread Ben Avison

Forwarding a message received off-list. (Speculation: you need to be subscribed 
to the list before messages will be accepted, and Shane isn't?)

--- Forwarded message ---
From: Shane Burrell sh...@shaneburrell.com
To: bavi...@riscosopen.org
Cc:
Subject: pixman clang issue
Date: Thu, 25 Jun 2015 02:57:11 +0100

I'm using the homebrew built version of pixman and it seems that clang doesn't 
like the inline assembly at line 100 in pixman-mmx.c .

It produces the following error.

pixman-mmx.c20: error: constraint 'K' expects an integer constant expression
: y (__A), K (__N)
^~~

The error is referenced by others here.  
https://code.google.com/p/nativeclient/issues/detail?id=4201

I attempted to submit a bug report via the list but received a bounce back.  
Just trying to reach someone to see if it is something that can be fixed 
upstream from homebrew or a patch or disabling-mmx is needed.___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] RFC: Pixman benchmark CPU time measurement

2015-06-03 Thread Ben Avison

On Wed, 03 Jun 2015 08:51:25 +0100, Pekka Paalanen ppaala...@gmail.com
wrote:

If we fixed gettime() for clock() wraparounds rather than ignoring them,
I suppose there would be no reason to have gettimei().

Ben?


Well, that would help, but I still don't like the idea of using floating
point in the middle of benchmarking loops when an integer version works
just as well and floating point doesn't really gain you anything. Even the
division by 100 is nearly always undone by the time we reach the
printf because megapixels per second or megabytes per second are practical
units - and those are the same things as pixels or bytes per microsecond.
Nobody is currently doing more to the times than adding or subtracting
them.

I know they're increasingly rare these days, but a machine with no
hardware FPU might take an appreciable time to do the integer-floating
point conversion and floating point maths. Even if you have an FPU, it
might be powered down on each context switch and only have its state
restored lazily on the first floating point instruction encountered,
resulting in a random timing element. In both cases this can be avoided
by sticking to integers.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] RFC: Pixman benchmark CPU time measurement

2015-06-03 Thread Ben Avison

On Wed, 03 Jun 2015 08:40:37 +0100, Pekka Paalanen ppaala...@gmail.com wrote:


On Tue, 02 Jun 2015 17:03:30 +0100
Ben Avison bavi...@riscosopen.org wrote:


If I were to make one change to gettimei() now, looking back, it would be
to make it return int32_t instead. This is because most often you'd be
subtracting two sample outputs of the function, and it's more often
useful to consider time intervals as signed (say if you're comparing the
current time against a timeout time which may be in the past or the
future). If gettimei() returns a signed integer, then C's type promotion
rules make the result of the subtraction signed automatically.


Wasn't overflow well-defined only for unsigned integers?


You're right, I'd forgotten that aspect of C. To be fully portable, a
modulo-2^32-arithmetic comparison needs to be written

uint32_t time1, time2;
if (((time1 - time2)  (1u31)) != 0)

rather than

int32_t time1, time2;
if ((time1 - time2)  0)

or

uint32_t time1, time2;
if ((int32_t) (time1 - time2)  0)

even though the latter two are more readable and all three are actually
equivalent if you're using two's complement integers, which (nearly?)
everybody does nowadays.

Sometimes I wish C had an inbuilt modulo integer type. At the assembly
level for a lot of CPUs it's a third type of comparison with similar ease
of use as signed and unsigned comparisons, but it's a pain to express in
C.

Anyway, this doesn't change the fact that struct timeval currently uses
signed long ints, and will enter undefined behaviour territory in 2038. I
think it's reasonable to assume that in practice, tv_sec will actually
contain an unsigned long int value (albeit one that has been cast to
signed) after that, as that would break the least software.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH v2] test: Fix solid-test for big-endian targets

2015-05-29 Thread Ben Avison
Tested-by: Fernando Seiti Furusato ferse...@linux.vnet.ibm.com (ppc64le, 
ppc64, powerpc)
Tested-by: Ben Avison bavi...@riscosopen.org (armv6l, armv7l, i686)
---
 test/solid-test.c |9 ++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/test/solid-test.c b/test/solid-test.c
index 7be5466..c6ea397 100644
--- a/test/solid-test.c
+++ b/test/solid-test.c
@@ -237,7 +237,7 @@ create_solid_image (const pixman_format_code_t 
*allowed_formats,
 pixman_image_unref (dummy_img);
 
 /* Now set the bitmap contents to a random value */
-*buffer = prng_rand ();
+prng_randmemset (buffer, 4, 0);
 image_endian_swap (img);
 
 if (used_fmt)
@@ -251,7 +251,10 @@ create_solid_image (const pixman_format_code_t 
*allowed_formats,
 pixman_color_t color;
 pixman_image_t *img;
 
-prng_randmemset (color, sizeof color, 0);
+color.alpha = prng_rand_n (UINT16_MAX + 1);
+color.red   = prng_rand_n (UINT16_MAX + 1);
+color.green = prng_rand_n (UINT16_MAX + 1);
+color.blue  = prng_rand_n (UINT16_MAX + 1);
 img = pixman_image_create_solid_fill (color);
 
 if (used_fmt)
@@ -345,6 +348,6 @@ main (int argc, const char *argv[])
 }
 
 return fuzzer_test_main (solid, 50,
-0x1B6DFF8D,
+ 0xC30FD380,
 test_solid, argc, argv);
 }
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [RFC PATCH] solid-test: Allow for big-endian targets

2015-05-29 Thread Ben Avison

On Fri, 29 May 2015 11:08:49 +0100, Pekka Paalanen ppaala...@gmail.com wrote:

I think pixman_color needs to be pre-multiplied just like all other
alpha formats, which means that the above way to randomize values into
a pixman_color can produce luminous (or invalid?) colors.

Would that raise undefined behaviour, or should we expect Pixman to
handle also those cases always the same?


Yes, when you're dealing with pre-multiplied alpha, you're allowed to
have the RGB components exceed the alpha one, and this equates to a
luminous effect. Pixman uses saturated arithmetic throughout, so it
should be able to handle these (and so they need to be tested).


Maybe the iteration count (50 in fuzzer_test_main call) should also
be increased?


[versus 2 million in blitters-test]

Well, I kind of settled on that on the basis of the runtime of the test
being similar to most of the other fuzz testers. Blitters-test is one of
the absolute slowest tests, and although it does include roughly the same
number of solid tests as this, they're very skewed towards solid masks
rather than solid sources and only test single-pixel bitmap images, so I
think the test as written is still usefully increasing the test coverage.

Updated patch to follow, as requested.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH v2] arm: Retire PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH

2015-05-27 Thread Ben Avison
This macro does exactly the same thing as the platform-neutral macro
SIMPLE_NEAREST_FAST_PATH.
---
Pekka is correct in deducing that the only difference here is a few lines
of context in the fast path tables.

 pixman/pixman-arm-common.h |7 ---
 pixman/pixman-arm-neon.c   |   24 
 pixman/pixman-arm-simd.c   |   18 +-
 3 files changed, 21 insertions(+), 28 deletions(-)

diff --git a/pixman/pixman-arm-common.h b/pixman/pixman-arm-common.h
index 3a7cb2b..c368516 100644
--- a/pixman/pixman-arm-common.h
+++ b/pixman/pixman-arm-common.h
@@ -266,13 +266,6 @@ FAST_NEAREST_MAINLOOP (cputype##_##name##_normal_##op, 
   \
scaled_nearest_scanline_##cputype##_##name##_##op, \
src_type, dst_type, NORMAL)
 
-/* Provide entries for the fast path table */
-#define PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH(op,s,d,func)  \
-SIMPLE_NEAREST_FAST_PATH_COVER (op,s,d,func), \
-SIMPLE_NEAREST_FAST_PATH_NONE (op,s,d,func),  \
-SIMPLE_NEAREST_FAST_PATH_PAD (op,s,d,func),   \
-SIMPLE_NEAREST_FAST_PATH_NORMAL (op,s,d,func)
-
 #define PIXMAN_ARM_BIND_SCALED_NEAREST_SRC_A8_DST(flags, cputype, name, op,   \
   src_type, dst_type) \
 void  \
diff --git a/pixman/pixman-arm-neon.c b/pixman/pixman-arm-neon.c
index 60e9c78..be761c9 100644
--- a/pixman/pixman-arm-neon.c
+++ b/pixman/pixman-arm-neon.c
@@ -362,21 +362,21 @@ static const pixman_fast_path_t arm_neon_fast_paths[] =
 PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,null, a8r8g8b8, 
neon_composite_out_reverse_8_),
 PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,null, a8b8g8r8, 
neon_composite_out_reverse_8_),
 
-PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (OVER, a8r8g8b8, a8r8g8b8, 
neon__),
-PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (OVER, a8b8g8r8, a8b8g8r8, 
neon__),
-PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (OVER, a8r8g8b8, x8r8g8b8, 
neon__),
-PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (OVER, a8b8g8r8, x8b8g8r8, 
neon__),
+SIMPLE_NEAREST_FAST_PATH (OVER, a8r8g8b8, a8r8g8b8, neon__),
+SIMPLE_NEAREST_FAST_PATH (OVER, a8b8g8r8, a8b8g8r8, neon__),
+SIMPLE_NEAREST_FAST_PATH (OVER, a8r8g8b8, x8r8g8b8, neon__),
+SIMPLE_NEAREST_FAST_PATH (OVER, a8b8g8r8, x8b8g8r8, neon__),
 
-PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (OVER, a8r8g8b8, r5g6b5, 
neon__0565),
-PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (OVER, a8b8g8r8, b5g6r5, 
neon__0565),
+SIMPLE_NEAREST_FAST_PATH (OVER, a8r8g8b8, r5g6b5, neon__0565),
+SIMPLE_NEAREST_FAST_PATH (OVER, a8b8g8r8, b5g6r5, neon__0565),
 
-PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, a8r8g8b8, r5g6b5, 
neon__0565),
-PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, x8r8g8b8, r5g6b5, 
neon__0565),
-PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, a8b8g8r8, b5g6r5, 
neon__0565),
-PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, x8b8g8r8, b5g6r5, 
neon__0565),
+SIMPLE_NEAREST_FAST_PATH (SRC, a8r8g8b8, r5g6b5, neon__0565),
+SIMPLE_NEAREST_FAST_PATH (SRC, x8r8g8b8, r5g6b5, neon__0565),
+SIMPLE_NEAREST_FAST_PATH (SRC, a8b8g8r8, b5g6r5, neon__0565),
+SIMPLE_NEAREST_FAST_PATH (SRC, x8b8g8r8, b5g6r5, neon__0565),
 
-PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, b5g6r5, x8b8g8r8, 
neon_0565_),
-PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, r5g6b5, x8r8g8b8, 
neon_0565_),
+SIMPLE_NEAREST_FAST_PATH (SRC, b5g6r5, x8b8g8r8, neon_0565_),
+SIMPLE_NEAREST_FAST_PATH (SRC, r5g6b5, x8r8g8b8, neon_0565_),
 /* Note: NONE repeat is not supported yet */
 SIMPLE_NEAREST_FAST_PATH_COVER (SRC, r5g6b5, a8r8g8b8, neon_0565_),
 SIMPLE_NEAREST_FAST_PATH_COVER (SRC, b5g6r5, a8b8g8r8, neon_0565_),
diff --git a/pixman/pixman-arm-simd.c b/pixman/pixman-arm-simd.c
index fa1ab5c..f40ff36 100644
--- a/pixman/pixman-arm-simd.c
+++ b/pixman/pixman-arm-simd.c
@@ -260,15 +260,15 @@ static const pixman_fast_path_t arm_simd_fast_paths[] =
 PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8b8g8r8, a8b8g8r8, 
armv6_composite_over_n___ca),
 PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8b8g8r8, x8b8g8r8, 
armv6_composite_over_n___ca),
 
-PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, r5g6b5, r5g6b5, armv6_0565_0565),
-PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, b5g6r5, b5g6r5, armv6_0565_0565),
-
-PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, a8r8g8b8, a8r8g8b8, 
armv6__),
-PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, a8r8g8b8, x8r8g8b8, 
armv6__),
-PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, x8r8g8b8, x8r8g8b8, 
armv6__),
-PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, a8b8g8r8, a8b8g8r8, 
armv6__),
-

Re: [Pixman] [PATCH] mmx/sse2: Use SIMPLE_NEAREST_SOLID_MASK_FAST_PATH macro

2015-05-27 Thread Ben Avison

On Wed, 27 May 2015 12:24:58 +0100, Pekka Paalanen ppaala...@gmail.com wrote:


with summary fixed, this patch seems ok to me (it does not change the
content), but it does reorder the rows. I would assume the relative
ordering change in this case does not matter, but can someone confirm
that?


Yes, there is some reordering, but the only significant thing to ensure
that the same routine is chosen is that a COVER fast path for a given
combination of operator and source/destination pixel formats must precede
all the variants of repeated fast paths for the same combination. This
patch (and the other mmx/sse2 one) still follows that rule.

I believe that in every other case, the set of operations that match any
pair of fast paths that are reordered in these patches are mutually
exclusive. While there will be a very subtle timing difference due to the
distance through the table we have to search to find a match (sometimes
faster, sometime slower) there is no evidence that the tables have been
carefully ordered by frequency of occurrence - just for ease of copy-and-
pasting.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [RFC PATCH] solid-test: Allow for big-endian targets

2015-05-27 Thread Ben Avison

On Tue, 26 May 2015 19:16:32 +0100, I wrote:


+color.alpha = prng_rand_n (UINT16_MAX);
+color.red   = prng_rand_n (UINT16_MAX);
+color.green = prng_rand_n (UINT16_MAX);
+color.blue  = prng_rand_n (UINT16_MAX);


Oops, those should have been UINT16_MAX + 1; the argument is used as a
modulo not as a bitmask. The corresponding checksum is then 0xC30FD380
for me on ARM and x86 (both little-endian of course).

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH] test: Add cover-test

2015-05-26 Thread Ben Avison
This test aims to verify both numerical correctness and the honouring of
array bounds for scaled plots (both nearest-neighbour and bilinear) at or
close to the boundary conditions for applicability of cover type fast paths
and iter fetch routines.

It has a secondary purpose: by setting the env var EXACT (to any value) it
will only test plots that are exactly on the boundary condition. This makes
it possible to ensure that cover routines are being used to the maximum,
although this requires the use of a debugger or code instrumentation to
verify.
---
Note that this must be pushed after Pekka's fence-image patches.

 test/Makefile.sources |1 +
 test/cover-test.c |  376 +
 2 files changed, 377 insertions(+), 0 deletions(-)
 create mode 100644 test/cover-test.c

diff --git a/test/Makefile.sources b/test/Makefile.sources
index 14a3710..5b901db 100644
--- a/test/Makefile.sources
+++ b/test/Makefile.sources
@@ -26,6 +26,7 @@ TESTPROGRAMS =  \
glyph-test\
solid-test\
stress-test   \
+   cover-test\
blitters-test \
affine-test   \
scaling-test  \
diff --git a/test/cover-test.c b/test/cover-test.c
new file mode 100644
index 000..fb173b2
--- /dev/null
+++ b/test/cover-test.c
@@ -0,0 +1,376 @@
+/*
+ * Copyright © 2015 RISC OS Open Ltd
+ *
+ * Permission to use, copy, modify, distribute, and sell this software and its
+ * documentation for any purpose is hereby granted without fee, provided that
+ * the above copyright notice appear in all copies and that both that
+ * copyright notice and this permission notice appear in supporting
+ * documentation, and that the name of the copyright holders not be used in
+ * advertising or publicity pertaining to distribution of the software without
+ * specific, written prior permission.  The copyright holders make no
+ * representations about the suitability of this software for any purpose.  It
+ * is provided as is without express or implied warranty.
+ *
+ * THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS
+ * SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
+ * FITNESS, IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN
+ * AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING
+ * OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
+ * SOFTWARE.
+ *
+ * Author:  Ben Avison (bavi...@riscosopen.org)
+ *
+ */
+
+#include utils.h
+#include stdlib.h
+#include stdio.h
+
+/* Approximate limits for random scale factor generation - these ensure we can
+ * get at least 8x reduction and 8x enlargement.
+ */
+#define LOG2_MAX_FACTOR (3)
+
+#define SQRT_0POINT5_0POINT32_FIXED (0xB504F334u)
+
+#define MAX_INC (SQRT_0POINT5_0POINT32_FIXED  (31 - 16 - LOG2_MAX_FACTOR))
+
+/* Minimum source width (in pixels) based on a typical page size of 4K and
+ * maximum colour depth of 32bpp.
+ */
+#define MIN_SRC_WIDTH (4096 / 4)
+
+/* Derive the destination width so that at max increment we fit within source 
*/
+#define DST_WIDTH (MIN_SRC_WIDTH * pixman_fixed_1 / MAX_INC)
+
+/* Calculate heights the other way round - no limits due to page alignment 
here */
+#define DST_HEIGHT 3
+#define SRC_HEIGHT ((DST_HEIGHT * MAX_INC + pixman_fixed_1 - 1) / 
pixman_fixed_1)
+
+/* At the time of writing, all the scaled fast paths use SRC, OVER or ADD
+ * Porter-Duff operators. XOR is included in the list to ensure good
+ * representation of iter scanline fetch routines.
+ */
+static const pixman_op_t op_list[] = {
+PIXMAN_OP_SRC,
+PIXMAN_OP_OVER,
+PIXMAN_OP_ADD,
+PIXMAN_OP_XOR,
+};
+
+/* At the time of writing, all the scaled fast paths use a8r8g8b8, x8r8g8b8
+ * or r5g6b5, or red-blue swapped versions of the same. When a mask channel is
+ * used, it is always a8 (and so implicitly not component alpha). a1r5g5b5 is
+ * included because it is the only other format to feature in any iters. */
+static const pixman_format_code_t img_fmt_list[] = {
+PIXMAN_a8r8g8b8,
+PIXMAN_x8r8g8b8,
+PIXMAN_r5g6b5,
+PIXMAN_a1r5g5b5
+};
+
+/* This is a flag reflecting the environment variable EXACT. It can be used
+ * to ensure that source coordinates corresponding exactly to the cover 
limits
+ * are used, rather than any near misses. This can, for example, be used in
+ * conjunction with a debugger to ensure that only COVER fast paths are used.
+ */
+static int exact;
+
+static pixman_fixed_t
+random_scale_factor(void)
+{
+/* Get a random number with top bit set. */
+uint32_t f = prng_rand () | 0x8000u;
+
+/* In log(2) space, this is still approximately evenly spread between 31
+ * and 32. Divide by sqrt(2) to centre

[Pixman] [RFC PATCH] solid-test: Allow for big-endian targets

2015-05-26 Thread Ben Avison
---
Hi Fernando,

solid-test is a very new test, and you're quite possibly the first person
to try it on any big-endian machine. Can you try this variation and see if
it helps please?

Thanks,
Ben

 test/solid-test.c |9 ++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/test/solid-test.c b/test/solid-test.c
index 7be5466..fd95e1f 100644
--- a/test/solid-test.c
+++ b/test/solid-test.c
@@ -237,7 +237,7 @@ create_solid_image (const pixman_format_code_t 
*allowed_formats,
 pixman_image_unref (dummy_img);
 
 /* Now set the bitmap contents to a random value */
-*buffer = prng_rand ();
+prng_randmemset (buffer, 4, 0);
 image_endian_swap (img);
 
 if (used_fmt)
@@ -251,7 +251,10 @@ create_solid_image (const pixman_format_code_t 
*allowed_formats,
 pixman_color_t color;
 pixman_image_t *img;
 
-prng_randmemset (color, sizeof color, 0);
+color.alpha = prng_rand_n (UINT16_MAX);
+color.red   = prng_rand_n (UINT16_MAX);
+color.green = prng_rand_n (UINT16_MAX);
+color.blue  = prng_rand_n (UINT16_MAX);
 img = pixman_image_create_solid_fill (color);
 
 if (used_fmt)
@@ -345,6 +348,6 @@ main (int argc, const char *argv[])
 }
 
 return fuzzer_test_main (solid, 50,
-0x1B6DFF8D,
+ 0xFD523DB8,
 test_solid, argc, argv);
 }
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH] mips: Retire PIXMAN_MIPS_SIMPLE_NEAREST_A8_MASK_FAST_PATH

2015-05-26 Thread Ben Avison
This macro does exactly the same thing as the platform-neutral macro
SIMPLE_NEAREST_A8_MASK_FAST_PATH.
---
Question for anyone who can test this code: can working NORMAL repeat
versions of these operations (over__8_0565 and over_0565_8_0565) be
added reasonably easily? This would enable the same
SIMPLE_NEAREST_A8_MASK_FAST_PATH macros to be used for all platforms.

 pixman/pixman-mips-dspr2.c |8 
 pixman/pixman-mips-dspr2.h |6 --
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/pixman/pixman-mips-dspr2.c b/pixman/pixman-mips-dspr2.c
index e10c9df..122d9dc 100644
--- a/pixman/pixman-mips-dspr2.c
+++ b/pixman/pixman-mips-dspr2.c
@@ -388,11 +388,11 @@ static const pixman_fast_path_t mips_dspr2_fast_paths[] =
 SIMPLE_NEAREST_FAST_PATH_PAD (SRC, r5g6b5, a8r8g8b8, mips_0565_),
 SIMPLE_NEAREST_FAST_PATH_PAD (SRC, b5g6r5, a8b8g8r8, mips_0565_),
 
-PIXMAN_MIPS_SIMPLE_NEAREST_A8_MASK_FAST_PATH (OVER, a8r8g8b8, r5g6b5, 
mips__8_0565),
-PIXMAN_MIPS_SIMPLE_NEAREST_A8_MASK_FAST_PATH (OVER, a8b8g8r8, b5g6r5, 
mips__8_0565),
+SIMPLE_NEAREST_A8_MASK_FAST_PATH (OVER, a8r8g8b8, r5g6b5, 
mips__8_0565),
+SIMPLE_NEAREST_A8_MASK_FAST_PATH (OVER, a8b8g8r8, b5g6r5, 
mips__8_0565),
 
-PIXMAN_MIPS_SIMPLE_NEAREST_A8_MASK_FAST_PATH (OVER, r5g6b5, r5g6b5, 
mips_0565_8_0565),
-PIXMAN_MIPS_SIMPLE_NEAREST_A8_MASK_FAST_PATH (OVER, b5g6r5, b5g6r5, 
mips_0565_8_0565),
+SIMPLE_NEAREST_A8_MASK_FAST_PATH (OVER, r5g6b5, r5g6b5, mips_0565_8_0565),
+SIMPLE_NEAREST_A8_MASK_FAST_PATH (OVER, b5g6r5, b5g6r5, mips_0565_8_0565),
 
 SIMPLE_BILINEAR_FAST_PATH (SRC, a8r8g8b8, a8r8g8b8, mips__),
 SIMPLE_BILINEAR_FAST_PATH (SRC, a8r8g8b8, x8r8g8b8, mips__),
diff --git a/pixman/pixman-mips-dspr2.h b/pixman/pixman-mips-dspr2.h
index 955ed70..b9e0684 100644
--- a/pixman/pixman-mips-dspr2.h
+++ b/pixman/pixman-mips-dspr2.h
@@ -328,12 +328,6 @@ FAST_NEAREST_MAINLOOP_COMMON (mips_##name##_pad_##op,  
   \
   scaled_nearest_scanline_mips_##name##_##op, \
   src_type, uint8_t, dst_type, PAD, TRUE, FALSE)
 
-/* Provide entries for the fast path table */
-#define PIXMAN_MIPS_SIMPLE_NEAREST_A8_MASK_FAST_PATH(op,s,d,func) \
-SIMPLE_NEAREST_A8_MASK_FAST_PATH_COVER (op,s,d,func), \
-SIMPLE_NEAREST_A8_MASK_FAST_PATH_NONE (op,s,d,func),  \
-SIMPLE_NEAREST_A8_MASK_FAST_PATH_PAD (op,s,d,func)
-
 //
 
 #define PIXMAN_MIPS_BIND_SCALED_BILINEAR_SRC_DST(flags, name, op,\
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] mmx/sse2: Use SIMPLE_NEAREST_SOLID_MASK_FAST_PATH macro

2015-05-26 Thread Ben Avison

Sorry folks, typo in subject line, should read

mmx/sse2: Use SIMPLE_NEAREST_FAST_PATH macro

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH] mmx/sse2: Use SIMPLE_NEAREST_SOLID_MASK_FAST_PATH macro

2015-05-26 Thread Ben Avison
---
 pixman/pixman-mmx.c  |   20 
 pixman/pixman-sse2.c |   20 
 2 files changed, 8 insertions(+), 32 deletions(-)

diff --git a/pixman/pixman-mmx.c b/pixman/pixman-mmx.c
index 42826d9..877b6e9 100644
--- a/pixman/pixman-mmx.c
+++ b/pixman/pixman-mmx.c
@@ -4094,22 +4094,10 @@ static const pixman_fast_path_t mmx_fast_paths[] =
 PIXMAN_STD_FAST_PATH(IN,   a8,   null, a8,   
mmx_composite_in_8_8  ),
 PIXMAN_STD_FAST_PATH(IN,   solid,a8,   a8,   
mmx_composite_in_n_8_8),
 
-SIMPLE_NEAREST_FAST_PATH_COVER  (OVER,   a8r8g8b8, x8r8g8b8, mmx__ 
),
-SIMPLE_NEAREST_FAST_PATH_COVER  (OVER,   a8b8g8r8, x8b8g8r8, mmx__ 
),
-SIMPLE_NEAREST_FAST_PATH_COVER  (OVER,   a8r8g8b8, a8r8g8b8, mmx__ 
),
-SIMPLE_NEAREST_FAST_PATH_COVER  (OVER,   a8b8g8r8, a8b8g8r8, mmx__ 
),
-SIMPLE_NEAREST_FAST_PATH_NONE   (OVER,   a8r8g8b8, x8r8g8b8, mmx__ 
),
-SIMPLE_NEAREST_FAST_PATH_NONE   (OVER,   a8b8g8r8, x8b8g8r8, mmx__ 
),
-SIMPLE_NEAREST_FAST_PATH_NONE   (OVER,   a8r8g8b8, a8r8g8b8, mmx__ 
),
-SIMPLE_NEAREST_FAST_PATH_NONE   (OVER,   a8b8g8r8, a8b8g8r8, mmx__ 
),
-SIMPLE_NEAREST_FAST_PATH_PAD(OVER,   a8r8g8b8, x8r8g8b8, mmx__ 
),
-SIMPLE_NEAREST_FAST_PATH_PAD(OVER,   a8b8g8r8, x8b8g8r8, mmx__ 
),
-SIMPLE_NEAREST_FAST_PATH_PAD(OVER,   a8r8g8b8, a8r8g8b8, mmx__ 
),
-SIMPLE_NEAREST_FAST_PATH_PAD(OVER,   a8b8g8r8, a8b8g8r8, mmx__ 
),
-SIMPLE_NEAREST_FAST_PATH_NORMAL (OVER,   a8r8g8b8, x8r8g8b8, mmx__ 
),
-SIMPLE_NEAREST_FAST_PATH_NORMAL (OVER,   a8b8g8r8, x8b8g8r8, mmx__ 
),
-SIMPLE_NEAREST_FAST_PATH_NORMAL (OVER,   a8r8g8b8, a8r8g8b8, mmx__ 
),
-SIMPLE_NEAREST_FAST_PATH_NORMAL (OVER,   a8b8g8r8, a8b8g8r8, mmx__ 
),
+SIMPLE_NEAREST_FAST_PATH (OVER,   a8r8g8b8, x8r8g8b8, mmx__
),
+SIMPLE_NEAREST_FAST_PATH (OVER,   a8b8g8r8, x8b8g8r8, mmx__
),
+SIMPLE_NEAREST_FAST_PATH (OVER,   a8r8g8b8, a8r8g8b8, mmx__
),
+SIMPLE_NEAREST_FAST_PATH (OVER,   a8b8g8r8, a8b8g8r8, mmx__
),
 
 SIMPLE_NEAREST_SOLID_MASK_FAST_PATH (OVER, a8r8g8b8, a8r8g8b8, 
mmx__n_ ),
 SIMPLE_NEAREST_SOLID_MASK_FAST_PATH (OVER, a8b8g8r8, a8b8g8r8, 
mmx__n_ ),
diff --git a/pixman/pixman-sse2.c b/pixman/pixman-sse2.c
index a6e7808..1a8c430 100644
--- a/pixman/pixman-sse2.c
+++ b/pixman/pixman-sse2.c
@@ -6274,22 +6274,10 @@ static const pixman_fast_path_t sse2_fast_paths[] =
 PIXMAN_STD_FAST_PATH (IN, solid, a8, a8, sse2_composite_in_n_8_8),
 PIXMAN_STD_FAST_PATH (IN, solid, null, a8, sse2_composite_in_n_8),
 
-SIMPLE_NEAREST_FAST_PATH_COVER (OVER, a8r8g8b8, x8r8g8b8, sse2__),
-SIMPLE_NEAREST_FAST_PATH_COVER (OVER, a8b8g8r8, x8b8g8r8, sse2__),
-SIMPLE_NEAREST_FAST_PATH_COVER (OVER, a8r8g8b8, a8r8g8b8, sse2__),
-SIMPLE_NEAREST_FAST_PATH_COVER (OVER, a8b8g8r8, a8b8g8r8, sse2__),
-SIMPLE_NEAREST_FAST_PATH_NONE (OVER, a8r8g8b8, x8r8g8b8, sse2__),
-SIMPLE_NEAREST_FAST_PATH_NONE (OVER, a8b8g8r8, x8b8g8r8, sse2__),
-SIMPLE_NEAREST_FAST_PATH_NONE (OVER, a8r8g8b8, a8r8g8b8, sse2__),
-SIMPLE_NEAREST_FAST_PATH_NONE (OVER, a8b8g8r8, a8b8g8r8, sse2__),
-SIMPLE_NEAREST_FAST_PATH_PAD (OVER, a8r8g8b8, x8r8g8b8, sse2__),
-SIMPLE_NEAREST_FAST_PATH_PAD (OVER, a8b8g8r8, x8b8g8r8, sse2__),
-SIMPLE_NEAREST_FAST_PATH_PAD (OVER, a8r8g8b8, a8r8g8b8, sse2__),
-SIMPLE_NEAREST_FAST_PATH_PAD (OVER, a8b8g8r8, a8b8g8r8, sse2__),
-SIMPLE_NEAREST_FAST_PATH_NORMAL (OVER, a8r8g8b8, x8r8g8b8, sse2__),
-SIMPLE_NEAREST_FAST_PATH_NORMAL (OVER, a8b8g8r8, x8b8g8r8, sse2__),
-SIMPLE_NEAREST_FAST_PATH_NORMAL (OVER, a8r8g8b8, a8r8g8b8, sse2__),
-SIMPLE_NEAREST_FAST_PATH_NORMAL (OVER, a8b8g8r8, a8b8g8r8, sse2__),
+SIMPLE_NEAREST_FAST_PATH (OVER, a8r8g8b8, x8r8g8b8, sse2__),
+SIMPLE_NEAREST_FAST_PATH (OVER, a8b8g8r8, x8b8g8r8, sse2__),
+SIMPLE_NEAREST_FAST_PATH (OVER, a8r8g8b8, a8r8g8b8, sse2__),
+SIMPLE_NEAREST_FAST_PATH (OVER, a8b8g8r8, a8b8g8r8, sse2__),
 
 SIMPLE_NEAREST_SOLID_MASK_FAST_PATH (OVER, a8r8g8b8, a8r8g8b8, 
sse2__n_),
 SIMPLE_NEAREST_SOLID_MASK_FAST_PATH (OVER, a8b8g8r8, a8b8g8r8, 
sse2__n_),
-- 
1.7.5.4


[Pixman] [PATCH] mmx/sse2: Use SIMPLE_NEAREST_SOLID_MASK_FAST_PATH for NORMAL repeat

2015-05-26 Thread Ben Avison
These two architectures were the only place where
SIMPLE_NEAREST_SOLID_MASK_FAST_PATH was used, and in both cases the
equivalent SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_NORMAL macro was used
immediately afterwards, so including the NORMAL case in the main macro
simplifies the fast path table.
---
 pixman/pixman-inlines.h |3 ++-
 pixman/pixman-mmx.c |4 
 pixman/pixman-sse2.c|4 
 3 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/pixman/pixman-inlines.h b/pixman/pixman-inlines.h
index dd1c2f1..76338f3 100644
--- a/pixman/pixman-inlines.h
+++ b/pixman/pixman-inlines.h
@@ -747,7 +747,8 @@ fast_composite_scaled_nearest  ## scale_func_name 
(pixman_implementation_t *imp,
 #define SIMPLE_NEAREST_SOLID_MASK_FAST_PATH(op,s,d,func)   \
 SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_COVER (op,s,d,func),   \
 SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_NONE (op,s,d,func),\
-SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_PAD (op,s,d,func)
+SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_PAD (op,s,d,func),  \
+SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_NORMAL (op,s,d,func),
 
 /*/
 
diff --git a/pixman/pixman-mmx.c b/pixman/pixman-mmx.c
index 877b6e9..05c48a4 100644
--- a/pixman/pixman-mmx.c
+++ b/pixman/pixman-mmx.c
@@ -4103,10 +4103,6 @@ static const pixman_fast_path_t mmx_fast_paths[] =
 SIMPLE_NEAREST_SOLID_MASK_FAST_PATH (OVER, a8b8g8r8, a8b8g8r8, 
mmx__n_ ),
 SIMPLE_NEAREST_SOLID_MASK_FAST_PATH (OVER, a8r8g8b8, x8r8g8b8, 
mmx__n_ ),
 SIMPLE_NEAREST_SOLID_MASK_FAST_PATH (OVER, a8b8g8r8, x8b8g8r8, 
mmx__n_ ),
-SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_NORMAL (OVER, a8r8g8b8, a8r8g8b8, 
mmx__n_  ),
-SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_NORMAL (OVER, a8b8g8r8, a8b8g8r8, 
mmx__n_  ),
-SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_NORMAL (OVER, a8r8g8b8, x8r8g8b8, 
mmx__n_  ),
-SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_NORMAL (OVER, a8b8g8r8, x8b8g8r8, 
mmx__n_  ),
 
 SIMPLE_BILINEAR_FAST_PATH (SRC, a8r8g8b8,  a8r8g8b8, mmx__ 
),
 SIMPLE_BILINEAR_FAST_PATH (SRC, a8r8g8b8,  x8r8g8b8, mmx__ 
),
diff --git a/pixman/pixman-sse2.c b/pixman/pixman-sse2.c
index 1a8c430..8955103 100644
--- a/pixman/pixman-sse2.c
+++ b/pixman/pixman-sse2.c
@@ -6283,10 +6283,6 @@ static const pixman_fast_path_t sse2_fast_paths[] =
 SIMPLE_NEAREST_SOLID_MASK_FAST_PATH (OVER, a8b8g8r8, a8b8g8r8, 
sse2__n_),
 SIMPLE_NEAREST_SOLID_MASK_FAST_PATH (OVER, a8r8g8b8, x8r8g8b8, 
sse2__n_),
 SIMPLE_NEAREST_SOLID_MASK_FAST_PATH (OVER, a8b8g8r8, x8b8g8r8, 
sse2__n_),
-SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_NORMAL (OVER, a8r8g8b8, a8r8g8b8, 
sse2__n_),
-SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_NORMAL (OVER, a8b8g8r8, a8b8g8r8, 
sse2__n_),
-SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_NORMAL (OVER, a8r8g8b8, x8r8g8b8, 
sse2__n_),
-SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_NORMAL (OVER, a8b8g8r8, x8b8g8r8, 
sse2__n_),
 
 SIMPLE_BILINEAR_FAST_PATH (SRC, a8r8g8b8, a8r8g8b8, sse2__),
 SIMPLE_BILINEAR_FAST_PATH (SRC, a8r8g8b8, x8r8g8b8, sse2__),
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH] arm: Simplify PIXMAN_ARM_SIMPLE_NEAREST_A8_MASK_FAST_PATH

2015-05-26 Thread Ben Avison
This macro is a superset of the platform-neutral macro
SIMPLE_NEAREST_A8_MASK_FAST_PATH. In other words, in addition to the
_COVER, _NONE and _PAD suffixes, its expansion includes the _NORMAL suffix.
---
 pixman/pixman-arm-common.h |4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/pixman/pixman-arm-common.h b/pixman/pixman-arm-common.h
index 8a552fd..c039626 100644
--- a/pixman/pixman-arm-common.h
+++ b/pixman/pixman-arm-common.h
@@ -311,9 +311,7 @@ FAST_NEAREST_MAINLOOP_COMMON 
(cputype##_##name##_normal_##op, \
 
 /* Provide entries for the fast path table */
 #define PIXMAN_ARM_SIMPLE_NEAREST_A8_MASK_FAST_PATH(op,s,d,func)  \
-SIMPLE_NEAREST_A8_MASK_FAST_PATH_COVER (op,s,d,func), \
-SIMPLE_NEAREST_A8_MASK_FAST_PATH_NONE (op,s,d,func),  \
-SIMPLE_NEAREST_A8_MASK_FAST_PATH_PAD (op,s,d,func),   \
+SIMPLE_NEAREST_A8_MASK_FAST_PATH (op,s,d,func),   \
 SIMPLE_NEAREST_A8_MASK_FAST_PATH_NORMAL (op,s,d,func)
 
 /*/
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [RFC PATCH 3/3] test: add fence-image-self-test

2015-05-13 Thread Ben Avison

On Fri, 08 May 2015 13:45:37 +0100, Pekka Paalanen ppaala...@gmail.com wrote:

+ok = test_image_faults (PIXMAN_a8r8g8b8, 7, 5);
+ok = test_image_faults (PIXMAN_r8g8b8, 7, 5);
+ok = test_image_faults (PIXMAN_r5g6b5, 7, 5);
+ok = test_image_faults (PIXMAN_a8, 7, 5);
+ok = test_image_faults (PIXMAN_a4, 7, 5);
+ok = test_image_faults (PIXMAN_a1, 7, 5);


Maybe there's a case for checking that the addresses just the other side
of the boundaries *don't* cause segfaults, in addition to checking that
these ones *do*?

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [RFC PATCH 0/3] Add fence_image_create_bits() test helper

2015-05-13 Thread Ben Avison

On Fri, 08 May 2015 13:45:34 +0100, Pekka Paalanen ppaala...@gmail.com wrote:

The self-test uses the following new things that have not been used in
Pixman code base before:
- fork()
- waitpid()
- sys/wait.h

Should I be adding tests in configure.ac for these?


I don't know the answer to that one, sorry.


Ben, I think you know better what the actual tests for More accurate
COVER_CLIP_NEAREST|BILINEAR should look like. I hope this implements
the fence part properly. Would you like to write the out-of-bounds
tests or explain me what they should do?


Without going to a lot more trouble to trap bad accesses at finer than
page granularity, I think those fences are the best you can do.

The main aim of the test should be to check that when a scaled plot
requires source pixels going right up to the edge of the source image
coordinate space, that it doesn't read beyond the limits. At present,
this is guaranteed because we fall back from a COVER fetcher to a PAD/
REPEAT/NONE/REFLECT fetcher instead (or the equivalent categories of fast
paths) before we get all the way there - although this will only be
apparent from benchmarking or by interrupting the test in a debugger.

To prove the validity of the patch in question, we need to be able to
demonstrate that we can safely continue to use a COVER routine all the
way to the edge of the source. Or at least, if we have to compromise and
set a tighter threshold before switching to a more general purpose
routine, that it is still safe to do so.

What we need to test:
* Both nearest-neighbour and bilinear scaling.
* A range of scale factors, both reduction and enlargement. I remember
  that I used a variety of routines for difference scales of reduction
  (1..2x, 2..3x, 3..4x and so on) to avoid having to hold the integer
  part of the increment in a register, so we need to test up to at least
  8x reduction.
* We need to test alignment with both the extreme minimum and extreme
  maximum source coordinates permitted. Given that the scale factors
  can't be assumed to be nice numbers, I think we need to test them
  independently, using the translation offsets in the transformation
  matrix to ensure they align as desired.
* Possibly test near miss cases as well, though these would probably
  want to be weighted so that positions at or closer to the limit are
  checked most.
* Test both horizontal and vertical axes.
* We should survey which Porter-Duff operations have scaled fast paths
  (across all platforms) and make sure we test each of those, plus at
  least one other which will guarantee that we're exercising the fetchers
  - XOR is probably safe for this.

For nearest-neighbour scaling, the acceptable criteria should be:
* Low end: centre of first output pixel coincides with source coordinate
  pixman_fixed_e
* High end: centre of last pixel output coincides with source coordinate
  pixman_fixed_1*width

For bilinear scaling:
* Low end: centre of first output pixel coincides with source coordinate
  pixman_fixed_1/2 (i.e. centre of first source pixel)
* High end: I'm slightly torn on this one. You could say
  pixman_fixed_1*width - pixman_fixed_1/2
  But it's also true that BILINEAR_INTERPOLATION_BITS comes into play,
  meaning that some slightly higher coordinates should also be achievable
  without requiring any input from the out-of-bounds pixel. For
  BILINEAR_INTERPOLATION_BITS=7, anything up to and including 0x0.01FF
  higher than this should be indistinguishable. And yes, that does mean
  that there is a very slight bias because coordinates are truncated by
  9 bits, rounding towards zero.

Having had to carefully think through writing the specification above, I
have some idea how I'd go about writing it - but if you can follow that
description, feel free to have a go yourself!

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH v2] test: Add new fuzz tester targeting solid images

2015-05-07 Thread Ben Avison
This places a heavier emphasis on solid images than the other fuzz testers,
and tests both single-pixel repeating bitmap images as well as those created
using pixman_image_create_solid_fill(). In the former case, it also
exercises the case where the bitmap contents are written to after the
image's first use, which is not a use-case that any other test has
previously covered.
---
 test/Makefile.sources |1 +
 test/solid-test.c |  348 +
 2 files changed, 349 insertions(+), 0 deletions(-)
 create mode 100644 test/solid-test.c

diff --git a/test/Makefile.sources b/test/Makefile.sources
index 8b0e855..f09c3e4 100644
--- a/test/Makefile.sources
+++ b/test/Makefile.sources
@@ -23,6 +23,7 @@ TESTPROGRAMS =  \
composite-traps-test  \
region-contains-test  \
glyph-test\
+   solid-test\
stress-test   \
blitters-test \
affine-test   \
diff --git a/test/solid-test.c b/test/solid-test.c
new file mode 100644
index 000..4745b1f
--- /dev/null
+++ b/test/solid-test.c
@@ -0,0 +1,348 @@
+/*
+ * Copyright © 2015 RISC OS Open Ltd
+ *
+ * Permission to use, copy, modify, distribute, and sell this software and its
+ * documentation for any purpose is hereby granted without fee, provided that
+ * the above copyright notice appear in all copies and that both that
+ * copyright notice and this permission notice appear in supporting
+ * documentation, and that the name of the copyright holders not be used in
+ * advertising or publicity pertaining to distribution of the software without
+ * specific, written prior permission.  The copyright holders make no
+ * representations about the suitability of this software for any purpose.  It
+ * is provided as is without express or implied warranty.
+ *
+ * THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS
+ * SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
+ * FITNESS, IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN
+ * AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING
+ * OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
+ * SOFTWARE.
+ *
+ * Author:  Ben Avison (bavi...@riscosopen.org)
+ *
+ */
+
+#include utils.h
+#include stdlib.h
+#include stdio.h
+
+#define WIDTH 32
+#define HEIGHT 32
+
+static const pixman_op_t op_list[] = {
+PIXMAN_OP_SRC,
+PIXMAN_OP_OVER,
+PIXMAN_OP_ADD,
+PIXMAN_OP_CLEAR,
+PIXMAN_OP_SRC,
+PIXMAN_OP_DST,
+PIXMAN_OP_OVER,
+PIXMAN_OP_OVER_REVERSE,
+PIXMAN_OP_IN,
+PIXMAN_OP_IN_REVERSE,
+PIXMAN_OP_OUT,
+PIXMAN_OP_OUT_REVERSE,
+PIXMAN_OP_ATOP,
+PIXMAN_OP_ATOP_REVERSE,
+PIXMAN_OP_XOR,
+PIXMAN_OP_ADD,
+PIXMAN_OP_MULTIPLY,
+PIXMAN_OP_SCREEN,
+PIXMAN_OP_OVERLAY,
+PIXMAN_OP_DARKEN,
+PIXMAN_OP_LIGHTEN,
+PIXMAN_OP_HARD_LIGHT,
+PIXMAN_OP_DIFFERENCE,
+PIXMAN_OP_EXCLUSION,
+#if 0 /* these use floating point math and are not always bitexact on 
different platforms */
+PIXMAN_OP_SATURATE,
+PIXMAN_OP_DISJOINT_CLEAR,
+PIXMAN_OP_DISJOINT_SRC,
+PIXMAN_OP_DISJOINT_DST,
+PIXMAN_OP_DISJOINT_OVER,
+PIXMAN_OP_DISJOINT_OVER_REVERSE,
+PIXMAN_OP_DISJOINT_IN,
+PIXMAN_OP_DISJOINT_IN_REVERSE,
+PIXMAN_OP_DISJOINT_OUT,
+PIXMAN_OP_DISJOINT_OUT_REVERSE,
+PIXMAN_OP_DISJOINT_ATOP,
+PIXMAN_OP_DISJOINT_ATOP_REVERSE,
+PIXMAN_OP_DISJOINT_XOR,
+PIXMAN_OP_CONJOINT_CLEAR,
+PIXMAN_OP_CONJOINT_SRC,
+PIXMAN_OP_CONJOINT_DST,
+PIXMAN_OP_CONJOINT_OVER,
+PIXMAN_OP_CONJOINT_OVER_REVERSE,
+PIXMAN_OP_CONJOINT_IN,
+PIXMAN_OP_CONJOINT_IN_REVERSE,
+PIXMAN_OP_CONJOINT_OUT,
+PIXMAN_OP_CONJOINT_OUT_REVERSE,
+PIXMAN_OP_CONJOINT_ATOP,
+PIXMAN_OP_CONJOINT_ATOP_REVERSE,
+PIXMAN_OP_CONJOINT_XOR,
+PIXMAN_OP_COLOR_DODGE,
+PIXMAN_OP_COLOR_BURN,
+PIXMAN_OP_SOFT_LIGHT,
+PIXMAN_OP_HSL_HUE,
+PIXMAN_OP_HSL_SATURATION,
+PIXMAN_OP_HSL_COLOR,
+PIXMAN_OP_HSL_LUMINOSITY,
+#endif
+};
+
+/* The first eight format in the list are by far the most widely
+ * used formats, so we test those more than the others
+ */
+#define N_MOST_LIKELY_FORMATS 8
+
+static const pixman_format_code_t img_fmt_list[] = {
+PIXMAN_a8r8g8b8,
+PIXMAN_a8b8g8r8,
+PIXMAN_x8r8g8b8,
+PIXMAN_x8b8g8r8,
+PIXMAN_r5g6b5,
+PIXMAN_b5g6r5,
+PIXMAN_a8,
+PIXMAN_a1,
+PIXMAN_r3g3b2,
+PIXMAN_b8g8r8a8,
+PIXMAN_b8g8r8x8,
+PIXMAN_r8g8b8a8,
+PIXMAN_r8g8b8x8,
+PIXMAN_x14r6g6b6,
+PIXMAN_r8g8b8,
+PIXMAN_b8g8r8,
+#if 0 /* These are going to use floating point in the near future */
+PIXMAN_x2r10g10b10,
+PIXMAN_a2r10g10b10,
+PIXMAN_x2b10g10r10

Re: [Pixman] [PATCH] test: Add new fuzz tester targeting solid images

2015-05-07 Thread Ben Avison

On Wed, 06 May 2015 13:25:43 +0100, Pekka Paalanen ppaala...@gmail.com wrote:

this is written based on blitters-test.c, right? That would have been
interesting to mention, so new reviewers like me can compare with
existing code you mirrored.


Well, it's a new addition the family of fuzz testers - the others are
 affine-test
 blitters-test
 composite-traps-test
 glyph-test
 matrix-test
 region-contains-test
 rotate-test
 scaling-test

I didn't consciously set out to borrow from blitters-test in particular,
but out of that set it tests the widest range of Porter-Duff operators,
and (jointly with glyph-test) the widest range of pixel formats, so
looking back, that's obviously the one I lifted the operator and format
arrays from.

Aside from the immediate bug that prompted the creation of solid-test, I
wanted to increase the proportion of solid source and mask images over
those tested elsewhere, because they represent a sizeable proportion of
fast paths and therefore a larger attack surface for bugs than was
represented to date. Although blitters-test already does a fair amount of
testing of this, the proportions are low (1/8 for source images) and it
doesn't exercise pixman_image_create_solid_fill images at all. The only
two fuzz testers that do use pixman_image_create_solid_fill are
composite-traps-test and glyph-test: they don't use the full range of
Porter-Duff operators, and they don't use pixman_image_create_solid_fill
for mask images at all.

I think a number of your comments arise from the expectation that the
test was only to test what happens when you change the contents of a 1x1
bits image - hopefully my aim to improve testing of both types of solid
images in general helps explain things.


This test hits the case of fully transparent/opaque - arbitrary color
change, but very rarely - fully transparent/opaque, never starting
with an arbitrary color. I understand the two latter cases are not
really testable: setting the fast path flags for an arbitrary color
when you could use fully transparent/opaque is not detectable via
correctness tests. Are there any cases where more symmetric test case
selection would be useful? If not, nevermind.


My thinking was that as long as we ensured that information about a solid
colour wasn't being cached, then we'd guard against my bug or any similar
ones being reintroduced. The two cases which are most likely to be
treated differently from the others, depending upon the operator, are
fully-transparent and fully-opaque, so as long as each of those featured
reasonably frequently on at least one side of the before/after divide, I
think that should detect any caching.


Here's how I understand the code.

Source image:
- 50% chance for multi-pixel bits image
- 25% chance for 1x1 bits image
- 25% chance for solid image

Mask image:
- 50% chance to not be used
- 12.5% chance for multi-pixel bits image without CA
- 6.25% chance for 1x1 bits image without CA
- 6.25% chance for solid image without CA
- 12.5% chance for multi-pixel bits image with CA
- 6.25% chance for 1x1 bits image with CA
- 6.25% chance for solid image with CA

Destination image:
- always multi-pixel

Both source and mask, when they are 1x1 bits images, have:
- 50% chance to start fully opaque white
- 50% chance to start fully transparent


Subtle distinction here: because the 1x1 image may use any pixel format,
the choices are between 0 and 2^bpp-1. Depending upon the format, there
may not be any pixel value which corresponds to transparent, and the
maximum value may not be opaque white (because the image may use a
palette).


and then switch to arbitrary random color for the actual test.

The problems this test attempts to reveal happen only with 1x1 bits
images, so shouldn't the chances for 1x1 bits images be higher?


Well, 34% of the tests feature at least one 1x1 bits image. If you
include the solid images - to measure the number of tests which would
select fast paths which feature at least one solid component, that figure
rises to 63%. That includes 13% where both the source and mask are solid,
which is silly (it'd make more sense for the caller to multiply the
constant colours and use an operation with a solid source and no mask
image). So by increasing the ratios, while you'd cut the number of tests
with multi-pixel source and mask (which are redundant with blitters-test)
you'd raise the number of tests of unrealistic combinations.

There's a danger that a more sophisticated way to choose combinations of
images and operators to better reflect the share of different fast paths
would just obfuscate the code and could just as easily be achieved by
increasing the number of tests, so I didn't feel it was worth spending
too much brainpower on it.

Perhaps you'd be happier with a three-way split between op_n_bits,
op_n_bits_bits and op_bits_n_bits as in the version I'm just about to
post? That guarantees 100% of tests involve a solid image, and 67% of
tests involve at least one 1x1 bits image.



Re: [Pixman] [PATCH 5/5] test: Add a new benchmarker targeting affine operations

2015-04-23 Thread Ben Avison

On Thu, 23 Apr 2015 17:12:59 +0100, I wrote:

I imagine most of the time, you'll have a source image of fixed size, and
you'll either have a target destination size (in which case your task is
to calculate the transform matrix coefficients) or you'll have a target
scaling factor (in which case your task is to work out the destination
size that can be achieved without running off the sides of the source
image).


Just to add to this, an example has come to mind that demonstrates why
higher level software should really know about the pixman_fixed_e offset
for nearest-neighbour scaling in order to make best use of Pixman.

Assume we have a source image that's 101 pixels wide, and we know it has
to be plotted at a scale factor of 50%. How many destination pixels do we
tell Pixman to plot?

In reality, given a transform matrix whose diagonal is (0.5, 0.5, 1),
Pixman is capable out picking out source pixels with index 0, 2, 4, 6 ...
100 - that's 51 pixels - whilst still using a COVER_CLIP fast path. If it
weren't for the pixman_fixed_e offset, it would pick out source pixels 1,
3, 5, 7 ... 99 instead - that's 50 pixels.

Now, if the caller asks for 50 pixels and Pixman uses the offset, then we
get to use the COVER_CLIP fast path, but we've trimmed two pixels off the
right and none off the left.

But if the caller asks for 51 pixels and Pixman doesn't use the offset,
then Pixman realises it needs a value for out-of-bounds pixel index 101
so won't use the COVER_CLIP fast path.

So, for the best combination of image quality and speed, the details of
Pixman's rounding algorithm *can* be significant.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH] test: Add new fuzz tester targeting solid images

2015-04-22 Thread Ben Avison
This places a heavier emphasis on solid images than the other fuzz testers,
and tests both single-pixel repeating bitmap images as well as those created
using pixman_image_create_solid_fill(). In the former case, it also
exercises the case where the bitmap contents are written to after the
image's first use, which is not a use-case that any other test has
previously covered.
---
 .gitignore|1 +
 test/Makefile.sources |1 +
 test/solid-test.c |  332 +
 3 files changed, 334 insertions(+), 0 deletions(-)
 create mode 100644 test/solid-test.c

diff --git a/.gitignore b/.gitignore
index 0f11496..8bfb48c 100644
--- a/.gitignore
+++ b/.gitignore
@@ -77,6 +77,7 @@ test/scaling-crash-test
 test/scaling-helpers-test
 test/scaling-test
 test/screen-test
+test/solid-test
 test/stress-test
 test/trap-crasher
 test/trap-test
diff --git a/test/Makefile.sources b/test/Makefile.sources
index 8b0e855..f09c3e4 100644
--- a/test/Makefile.sources
+++ b/test/Makefile.sources
@@ -23,6 +23,7 @@ TESTPROGRAMS =  \
composite-traps-test  \
region-contains-test  \
glyph-test\
+   solid-test\
stress-test   \
blitters-test \
affine-test   \
diff --git a/test/solid-test.c b/test/solid-test.c
new file mode 100644
index 000..070652a
--- /dev/null
+++ b/test/solid-test.c
@@ -0,0 +1,332 @@
+/*
+ * Copyright © 2015 RISC OS Open Ltd
+ *
+ * Permission to use, copy, modify, distribute, and sell this software and its
+ * documentation for any purpose is hereby granted without fee, provided that
+ * the above copyright notice appear in all copies and that both that
+ * copyright notice and this permission notice appear in supporting
+ * documentation, and that the name of the copyright holders not be used in
+ * advertising or publicity pertaining to distribution of the software without
+ * specific, written prior permission.  The copyright holders make no
+ * representations about the suitability of this software for any purpose.  It
+ * is provided as is without express or implied warranty.
+ *
+ * THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS
+ * SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
+ * FITNESS, IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN
+ * AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING
+ * OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
+ * SOFTWARE.
+ *
+ * Author:  Ben Avison (bavi...@riscosopen.org)
+ *
+ */
+
+#include utils.h
+#include stdlib.h
+#include stdio.h
+
+#define WIDTH 32
+#define HEIGHT 32
+
+static const pixman_op_t op_list[] = {
+PIXMAN_OP_SRC,
+PIXMAN_OP_OVER,
+PIXMAN_OP_ADD,
+PIXMAN_OP_CLEAR,
+PIXMAN_OP_SRC,
+PIXMAN_OP_DST,
+PIXMAN_OP_OVER,
+PIXMAN_OP_OVER_REVERSE,
+PIXMAN_OP_IN,
+PIXMAN_OP_IN_REVERSE,
+PIXMAN_OP_OUT,
+PIXMAN_OP_OUT_REVERSE,
+PIXMAN_OP_ATOP,
+PIXMAN_OP_ATOP_REVERSE,
+PIXMAN_OP_XOR,
+PIXMAN_OP_ADD,
+PIXMAN_OP_MULTIPLY,
+PIXMAN_OP_SCREEN,
+PIXMAN_OP_OVERLAY,
+PIXMAN_OP_DARKEN,
+PIXMAN_OP_LIGHTEN,
+PIXMAN_OP_HARD_LIGHT,
+PIXMAN_OP_DIFFERENCE,
+PIXMAN_OP_EXCLUSION,
+#if 0 /* these use floating point math and are not always bitexact on 
different platforms */
+PIXMAN_OP_SATURATE,
+PIXMAN_OP_DISJOINT_CLEAR,
+PIXMAN_OP_DISJOINT_SRC,
+PIXMAN_OP_DISJOINT_DST,
+PIXMAN_OP_DISJOINT_OVER,
+PIXMAN_OP_DISJOINT_OVER_REVERSE,
+PIXMAN_OP_DISJOINT_IN,
+PIXMAN_OP_DISJOINT_IN_REVERSE,
+PIXMAN_OP_DISJOINT_OUT,
+PIXMAN_OP_DISJOINT_OUT_REVERSE,
+PIXMAN_OP_DISJOINT_ATOP,
+PIXMAN_OP_DISJOINT_ATOP_REVERSE,
+PIXMAN_OP_DISJOINT_XOR,
+PIXMAN_OP_CONJOINT_CLEAR,
+PIXMAN_OP_CONJOINT_SRC,
+PIXMAN_OP_CONJOINT_DST,
+PIXMAN_OP_CONJOINT_OVER,
+PIXMAN_OP_CONJOINT_OVER_REVERSE,
+PIXMAN_OP_CONJOINT_IN,
+PIXMAN_OP_CONJOINT_IN_REVERSE,
+PIXMAN_OP_CONJOINT_OUT,
+PIXMAN_OP_CONJOINT_OUT_REVERSE,
+PIXMAN_OP_CONJOINT_ATOP,
+PIXMAN_OP_CONJOINT_ATOP_REVERSE,
+PIXMAN_OP_CONJOINT_XOR,
+PIXMAN_OP_COLOR_DODGE,
+PIXMAN_OP_COLOR_BURN,
+PIXMAN_OP_SOFT_LIGHT,
+PIXMAN_OP_HSL_HUE,
+PIXMAN_OP_HSL_SATURATION,
+PIXMAN_OP_HSL_COLOR,
+PIXMAN_OP_HSL_LUMINOSITY,
+#endif
+};
+
+/* The first eight format in the list are by far the most widely
+ * used formats, so we test those more than the others
+ */
+#define N_MOST_LIKELY_FORMATS 8
+
+static const pixman_format_code_t img_fmt_list[] = {
+PIXMAN_a8r8g8b8,
+PIXMAN_a8b8g8r8,
+PIXMAN_x8r8g8b8,
+PIXMAN_x8b8g8r8,
+PIXMAN_r5g6b5,
+PIXMAN_b5g6r5,
+PIXMAN_a8,
+PIXMAN_a1,
+PIXMAN_r3g3b2

Re: [Pixman] [PATCH 10 11/37] in_8888_8 fast path

2015-04-22 Thread Ben Avison

On Sun, 05 Oct 2014 20:03:42 +0100, Søren Sandmann soren.sandm...@gmail.com 
wrote:

These can presumably also be used for the a8r8g8b8_sRGB format since
all we care about is the alpha channel.


Looks like it, yes. Updated versions of the three in__8 patches
coming up.


Also, is this actually showing up in the wild? Which applications?


Checking my records, it was the Epiphany browser (default browser on
Raspbian) that was responsible.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 05/32] pixman-image: Early detection of solid 1x1 repeating source images

2015-04-22 Thread Ben Avison

On Thu, 07 Aug 2014 17:50:01 +0100, I wrote:

Doesn't handle every pixel format, but for those that it does, enables
early conversion of OVER to SRC, for example.
---
 pixman/pixman-image.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-image.c b/pixman/pixman-image.c
index 1ff1a49..ad6e82e 100644
--- a/pixman/pixman-image.c
+++ b/pixman/pixman-image.c

[...]

On Sun, 05 Oct 2014 20:03:42 +0100, Søren Sandmann soren.sandm...@gmail.com 
wrote:

I don't think this works as written. What if someone writes a
non-opaque pixel to the 1x1 image after it has been used once? Then
the flags would be wrong.


This appears to be a valid concern, and one which isn't exercised by any
of the existing tests. I have written a new fuzz tester which does detect
the behavioural change of the original version of this patch, and an
updated patch which doesn't write the updated flags back to the
pixman_image_t struct but re-evaluates them on every usage. I will post
these in a moment, though you may wish to note that I've slightly changed
the subject (the new fix is no longer in pixman-image.c, and in Pixman
terminology, solid means an image has the same colour and alpha
components at all spatial coordinates, rather than being a synonym for
opaque).

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 05/37 v2] composite flags: Early detection of fully opaque 1x1 repeating source images

2015-04-22 Thread Ben Avison
Doesn't handle every pixel format, but for those that it does, enables early
conversion of OVER to SRC, for example.
---
 pixman/pixman.c |   28 +---
 1 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/pixman/pixman.c b/pixman/pixman.c
index 9555cea..08293c8 100644
--- a/pixman/pixman.c
+++ b/pixman/pixman.c
@@ -325,6 +325,27 @@ _pixman_compute_composite_region32 (pixman_region32_t * 
region,
 return TRUE;
 }
 
+static uint32_t
+check_for_single_pixel_opaque_image (pixman_image_t *image)
+{
+if (image-type == BITS 
+image-common.extended_format_code == PIXMAN_solid 
+(image-common.flags  FAST_PATH_NO_ACCESSORS) != 0 
+(((image-bits.format == PIXMAN_a8) 
+*(uint8_t *)image-bits.bits == 0xFF) ||
+ ((image-bits.format == PIXMAN_b8g8r8a8 || image-bits.format == 
PIXMAN_r8g8b8a8) 
+(0xff ~ image-bits.bits[0]) == 0) ||
+ ((image-bits.format == PIXMAN_a8r8g8b8 || image-bits.format == 
PIXMAN_a8b8g8r8 || image-bits.format == PIXMAN_a8r8g8b8_sRGB) 
+(0xff00 ~ image-bits.bits[0]) == 0)) 
+ !image-common.alpha_map 
+ image-common.filter != PIXMAN_FILTER_CONVOLUTION 
+ image-common.filter != PIXMAN_FILTER_SEPARABLE_CONVOLUTION 
+ !image-common.component_alpha)
+return FAST_PATH_SAMPLES_OPAQUE | FAST_PATH_IS_OPAQUE;
+else
+return 0;
+}
+
 typedef struct
 {
 pixman_fixed_48_16_t   x1;
@@ -595,12 +616,13 @@ pixman_image_composite32 (pixman_op_t  op,
 _pixman_image_validate (dest);
 
 src_format = src-common.extended_format_code;
-info.src_flags = src-common.flags;
+info.src_flags = src-common.flags | check_for_single_pixel_opaque_image 
(src);
 
-if (mask  !(mask-common.flags  FAST_PATH_IS_OPAQUE))
+if (mask)
+info.mask_flags = mask-common.flags | 
check_for_single_pixel_opaque_image (mask);
+if (mask  !(info.mask_flags  FAST_PATH_IS_OPAQUE))
 {
mask_format = mask-common.extended_format_code;
-   info.mask_flags = mask-common.flags;
 }
 else
 {
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 11/37 v2] armv6: Add in_8888_8 fast path

2015-04-22 Thread Ben Avison
This is used instead of the equivalent C fast path.

lowlevel-blt-bench results, compared to no fast path at all:

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
L1  12.4   0.1  117.5  2.3 100.0%  +851.2%
L2  9.50.1  46.9   2.4 100.0%  +393.8%
M   9.60.0  61.9   0.9 100.0%  +544.0%
HT  7.90.0  26.6   0.5 100.0%  +238.6%
VT  7.70.0  24.2   0.4 100.0%  +212.5%
R   7.40.0  22.4   0.4 100.0%  +204.5%
RT  4.10.0  8.70.2 100.0%  +109.4%
---
 pixman/pixman-arm-simd-asm.S |  111 ++
 pixman/pixman-arm-simd.c |5 ++
 2 files changed, 116 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-simd-asm.S b/pixman/pixman-arm-simd-asm.S
index 08d6709..6c77fd3 100644
--- a/pixman/pixman-arm-simd-asm.S
+++ b/pixman/pixman-arm-simd-asm.S
@@ -1291,3 +1291,114 @@ generate_composite_function \
 over_n_0565_process_tail
 
 
/**/
+
+.macro in__8_init
+SRC0.req   Y
+SRC1.req   STRIDE_D
+SRC2.req   STRIDE_S
+SRC3.req   MASK
+HALF.req   STRIDE_M
+TMP .req   ORIG_W
+line_saved_regs  Y, STRIDE_D, STRIDE_S, ORIG_W
+ldr SCRATCH, =0x00800080
+mov HALF, #0x80
+/* Set GE[3:0] to 0101 so SEL instructions do what we want */
+uadd8   SCRATCH, SCRATCH, SCRATCH
+/* Offset the source pointer: we only need the alpha bytes */
+add SRC, SRC, #3
+.endm
+
+.macro in__8_cleanup
+.unreq  SRC0
+.unreq  SRC1
+.unreq  SRC2
+.unreq  SRC3
+.unreq  HALF
+.unreq  TMP
+.endm
+
+.macro in__8_4pixels_head dst
+ldr TMP, [DST], #4
+ldrbSRC0, [SRC], #12
+ldrbSRC3, [SRC], #-4
+ldrbSRC2, [SRC], #-4
+uxtb16  WKdst, TMP
+uxtb16  TMP, TMP, ror #8
+ldrbSRC1, [SRC], #12
+smlabb  SRC0, SRC0, WKdst, HALF
+smlabt  SRC3, SRC3, TMP, HALF
+smlabt  SRC2, SRC2, WKdst, HALF
+smlabb  SRC1, SRC1, TMP, HALF
+orr WKdst, SRC0, SRC2, lsl #16
+/* There'd be a stall here if immediately followed by orr, so
+ * fill it with something like a preload if possible */
+.endm
+
+.macro in__8_4pixels_tail dst
+orr TMP, SRC1, SRC3, lsl #16
+uxtab16 WKdst, WKdst, WKdst, ror #8
+uxtab16 TMP, TMP, TMP, ror #8
+mov WKdst, WKdst, ror #8
+sel WKdst, WKdst, TMP
+.endm
+
+.macro in__8_process_head  cond, numbytes, firstreg, unaligned_src, 
unaligned_mask, preload
+ .if numbytes == 1
+ldrbWK3, [DST], #1
+ldrbSRC0, [SRC], #4
+ .elseif numbytes == 2
+ldrbWK3, [DST], #1
+ldrbSRC0, [SRC], #4
+ldrbTMP, [DST], #1
+ldrbSRC1, [SRC], #4
+ .else
+  .if numbytes = 8
+   .if numbytes == 16
+in__8_4pixels_head 0
+in__8_4pixels_tail 0
+in__8_4pixels_head 1
+.if preload
+PF  bic,SCRATCH, SRC, #31
+PF  pld,[SCRATCH, #32*prefetch_distance]
+.endif
+in__8_4pixels_tail 1
+   .endif
+in__8_4pixels_head 2
+in__8_4pixels_tail 2
+  .endif
+in__8_4pixels_head 3
+ .endif
+.endm
+
+.macro in__8_process_tail  cond, numbytes, firstreg
+ .if numbytes == 1
+smlabb  WK3, SRC0, WK3, HALF
+add WK3, WK3, WK3, lsr #8
+mov WK3, WK3, lsr #8
+strbWK3, [DST, #-1]
+ .elseif numbytes == 2
+smlabb  WK3, SRC0, WK3, HALF
+smlabb  TMP, SRC1, TMP, HALF
+add WK3, WK3, WK3, lsr #8
+add TMP, TMP, TMP, lsr #8
+mov WK3, WK3, lsr #8
+mov TMP, TMP, lsr #8
+strbWK3, [DST, #-2]
+strbTMP, [DST, #-1]
+ .else
+in__8_4pixels_tail 3
+pixst   , numbytes, (4-numbytes/4), DST
+ .endif
+.endm
+
+generate_composite_function \
+pixman_composite_in__8_asm_armv6, 32, 0, 8, \
+FLAG_DST_READWRITE | FLAG_BRANCH_OVER | FLAG_PROCESS_DOES_STORE | 
FLAG_SPILL_LINE_VARS, \
+3, /* prefetch distance */ \
+in__8_init, \
+nop_macro, /* newline */ \
+in__8_cleanup, \
+in__8_process_head, \
+in__8_process_tail
+
+/**/
diff --git a/pixman/pixman-arm-simd.c b/pixman/pixman-arm-simd.c
index 31f960d..55f16ea 100644
--- a/pixman/pixman-arm-simd.c
+++ b/pixman/pixman-arm-simd.c
@@ -48,6 +48,8 @@ PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, add_8_8,
uint8_t, 1, uint8_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, over__,
uint32_t, 1, uint32_t, 1)
+PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, 

[Pixman] [PATCH 10/37 v2] pixman-fast-path: Add in_8888_8 fast path

2015-04-22 Thread Ben Avison
This is a C fast path, useful for reference or for platforms that don't
have their own fast path for this operation.

lowlevel-blt-bench results on ARMv6:

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
L1  12.4   0.1  24.4   0.3 100.0%  +97.8%
L2  9.50.1  14.1   0.2 100.0%  +48.1%
M   9.60.0  14.7   0.0 100.0%  +53.1%
HT  7.90.0  12.0   0.1 100.0%  +52.3%
VT  7.70.0  11.6   0.1 100.0%  +49.8%
R   7.40.0  10.8   0.1 100.0%  +47.2%
RT  4.10.0  6.10.1 100.0%  +48.2%
---
 pixman/pixman-fast-path.c |   40 
 1 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-fast-path.c b/pixman/pixman-fast-path.c
index c7ed0f0..e465642 100644
--- a/pixman/pixman-fast-path.c
+++ b/pixman/pixman-fast-path.c
@@ -262,6 +262,43 @@ fast_composite_in_8_8 (pixman_implementation_t *imp,
 }
 
 static void
+fast_composite_in__8 (pixman_implementation_t *imp,
+  pixman_composite_info_t *info)
+{
+PIXMAN_COMPOSITE_ARGS (info);
+uint8_t *dst_line, *dst;
+uint32_t*src_line, *src;
+int dst_stride, src_stride;
+int32_t w;
+uint8_t s;
+uint16_t t;
+
+PIXMAN_IMAGE_GET_LINE (src_image, src_x, src_y, uint32_t, src_stride, 
src_line, 1);
+PIXMAN_IMAGE_GET_LINE (dest_image, dest_x, dest_y, uint8_t, dst_stride, 
dst_line, 1);
+
+while (height--)
+{
+dst = dst_line;
+dst_line += dst_stride;
+src = src_line;
+src_line += src_stride;
+w = width;
+
+while (w--)
+{
+s = *src++  24;
+
+if (s == 0)
+*dst = 0;
+else if (s != 0xff)
+*dst = MUL_UN8 (s, *dst, t);
+
+dst++;
+}
+}
+}
+
+static void
 fast_composite_over_n_8_ (pixman_implementation_t *imp,
   pixman_composite_info_t *info)
 {
@@ -1948,6 +1985,9 @@ static const pixman_fast_path_t c_fast_paths[] =
 PIXMAN_STD_FAST_PATH (SRC, a1r5g5b5, null, x1r5g5b5, 
fast_composite_src_memcpy),
 PIXMAN_STD_FAST_PATH (SRC, a8, null, a8, fast_composite_src_memcpy),
 PIXMAN_STD_FAST_PATH (IN, a8, null, a8, fast_composite_in_8_8),
+PIXMAN_STD_FAST_PATH (IN, a8r8g8b8, null, a8, fast_composite_in__8),
+PIXMAN_STD_FAST_PATH (IN, a8b8g8r8, null, a8, fast_composite_in__8),
+PIXMAN_STD_FAST_PATH (IN, a8r8g8b8_sRGB, null, a8, 
fast_composite_in__8),
 PIXMAN_STD_FAST_PATH (IN, solid, a8, a8, fast_composite_in_n_8_8),
 
 SIMPLE_NEAREST_FAST_PATH (SRC, x8r8g8b8, x8r8g8b8, _),
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 23/37 v2] armv6: Add optimised scanline fetchers and writeback for r5g6b5 and a8

2015-04-22 Thread Ben Avison
This supports r5g6b5 source and desitination images, and a8 source images.

lowlevel-blt-bench results for example operations which use these because
they lack a dedicated fast path at the time of writing:

in_reverse_8_

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
L1  30.0   0.3  37.0   0.3 100.0%  +23.2%
L2  23.3   0.3  29.4   0.4 100.0%  +26.1%
M   24.0   0.0  31.3   0.1 100.0%  +30.5%
HT  12.8   0.1  16.1   0.1 100.0%  +25.8%
VT  11.9   0.1  14.8   0.1 100.0%  +24.6%
R   11.7   0.1  14.6   0.1 100.0%  +24.5%
RT  5.10.1  6.20.1 100.0%  +20.2%

in_0565_

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
L1  22.0   0.1  28.3   0.2 100.0%  +28.4%
L2  16.6   0.2  23.6   0.3 100.0%  +42.2%
M   16.5   0.0  24.7   0.1 100.0%  +49.5%
HT  11.0   0.1  13.7   0.1 100.0%  +24.4%
VT  10.7   0.0  13.1   0.1 100.0%  +22.0%
R   10.3   0.0  12.6   0.1 100.0%  +22.5%
RT  5.30.1  5.70.1 100.0%  +9.0%

in_reverse__0565

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
L1  16.6   0.1  20.9   0.1 100.0%  +25.5%
L2  13.1   0.1  17.7   0.3 100.0%  +35.3%
M   13.2   0.0  19.2   0.0 100.0%  +45.3%
HT  9.60.0  11.7   0.1 100.0%  +21.8%
VT  9.30.0  11.4   0.1 100.0%  +22.4%
R   9.00.0  10.9   0.1 100.0%  +21.1%
RT  4.70.1  5.20.1 100.0%  +8.7%
---
 pixman/pixman-arm-common.h   |   31 ++
 pixman/pixman-arm-simd-asm.S |   94 ++
 pixman/pixman-arm-simd.c |   55 
 3 files changed, 180 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-common.h b/pixman/pixman-arm-common.h
index 3558c15..f4632b2 100644
--- a/pixman/pixman-arm-common.h
+++ b/pixman/pixman-arm-common.h
@@ -453,4 +453,35 @@ cputype##_combine_##name##_u (pixman_implementation_t 
*imp,   \
 pixman_composite_scanline_##name##_asm_##cputype (width, dest, src);  \
 }
 
+/*/
+
+#define PIXMAN_ARM_BIND_GET_SCANLINE(cputype, name) \
+void\
+pixman_get_scanline_##name##_asm_##cputype (int32_tw,   \
+uint32_t   *dst,\
+const uint32_t *src);   \
+\
+uint32_t *  \
+cputype##_get_scanline_##name (pixman_iter_t *iter, const uint32_t *mask)   \
+{   \
+pixman_get_scanline_##name##_asm_##cputype (iter-width, iter-buffer,  \
+(uint32_t *) iter-bits);   \
+iter-bits += iter-stride; \
+return iter-buffer;\
+}
+
+#define PIXMAN_ARM_BIND_WRITE_BACK(cputype, name)  
\
+void   
\
+pixman_write_back_##name##_asm_##cputype (int32_tw,
\
+  uint32_t   *dst, 
\
+  const uint32_t *src);
\
+   
\
+void   
\
+cputype##_write_back_##name (pixman_iter_t *iter)  
\
+{  
\
+pixman_write_back_##name##_asm_##cputype (iter-width, 
\
+  (uint32_t *)(iter-bits - 
iter-stride), \
+  iter-buffer);   
\
+}
+
 #endif
diff --git a/pixman/pixman-arm-simd-asm.S b/pixman/pixman-arm-simd-asm.S
index f61b715..b251187 100644
--- a/pixman/pixman-arm-simd-asm.S
+++ b/pixman/pixman-arm-simd-asm.S
@@ -388,6 +388,16 @@ generate_composite_function \
 src_0565__process_head, \
 src_0565__process_tail
 
+generate_composite_function_single_scanline \
+pixman_get_scanline_r5g6b5_asm_armv6, 16, 0, 32, \
+FLAG_DST_WRITEONLY | FLAG_BRANCH_OVER, \
+3, /* prefetch distance */ \
+src_0565__init, \
+nop_macro, /* newline */ \
+

Re: [Pixman] [PATCH 04/32] pixman-general: Tighten up calculation of temporary buffer sizes

2015-04-21 Thread Ben Avison

On Thu, 07 Aug 2014 17:50:00 +0100, I wrote:

There's no need to align after the end of the temporary destination
buffer, and each of the remaining aligns can only add a maximum of 15
bytes to the space requirement. This permits some edge cases to use the
stack buffer where previously it would have deduced that a heap buffer
was required.
---
 pixman/pixman-general.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/pixman/pixman-general.c b/pixman/pixman-general.c
index 7cdea29..c8f272b 100644
--- a/pixman/pixman-general.c
+++ b/pixman/pixman-general.c
@@ -159,10 +159,10 @@ general_composite_rect  (pixman_implementation_t *imp,
 mask_buffer = ALIGN (src_buffer + width * Bpp);
 dest_buffer = ALIGN (mask_buffer + width * Bpp);
-if (ALIGN (dest_buffer + width * Bpp) 
+if (dest_buffer + width * Bpp 
scanline_buffer + sizeof (stack_scanline_buffer))
 {
-   scanline_buffer = pixman_malloc_ab_plus_c (width, Bpp * 3, 32 * 3);
+   scanline_buffer = pixman_malloc_ab_plus_c (width, Bpp * 3, 15 * 3);
if (!scanline_buffer)
return;


I must say I never expected this patch to be controversial. I was merely
tinkering with some nearby code and saw this, which to me was an obvious
bug. I'm going to stand behind this patch for now, as explained below...

On Sun, 05 Oct 2014 20:03:42 +0100, Søren Sandmann soren.sandm...@gmail.com 
wrote:

The idea behind aligning the final buffer is that it allows fetchers
to assume that they can do a full 16-byte SIMD write to the buffer. We
may possibly not do that, though.


I assume you mean combiner rather than fetcher here; the fetchers write
to the source and mask line buffers, and the combiner writes to the
destination line buffer. I've been thinking about this, and while perhaps
what you describe was the intention at one point, I reckon we can't
permit a combiner to write beyond its buffer limits because of the dest
iters in pixman-noop.c which force the combiner to write directly into
the destination bitmap when it's a8r8g8b8 or x8r8g8b8 format. This is
already a common use case, and will only become more so.

If the combiner is beholden to write exactly the right number of bytes
then there's no benefit to over-allocating the line buffer.


The change to use 15 instead of 32 is alright I guess. Is this
something that actually shows up in practice?


Here's a worked example. For any size of stack buffer that you choose,
there will be a few row lengths that do an unnecessary heap allocate and
free that can be avoided by using this patch.

Assume stack allocations are aligned to 8-byte boundaries.
Assume Narrow pixel format (4 bytes per pixel).
Let width = 2045 pixels.
The size of stack_scanline_buffer is determined by a compile-time define,
currently 8 kiB * 3 channels.

Suppose stack_scanline_buffer = [0x18,0x106008), then:
src_buffer  = [0x100010,0x102004)
mask_buffer = [0x102010,0x104004)
dest_buffer = [0x104010,0x106004)

In this case it is clear that the three small buffers fit within
stack_scanline_buffer, but if the test is performed on the end address of
dest_buffer rounded up to a 16-byte boundary - which would be 0x106010 -
then you would incorrectly deduce that they don't.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH] armv7: Re-use existing fast paths in more cases

2015-04-20 Thread Ben Avison
There are a group of combiner types - SRC, OVER_REVERSE, IN, OUT and ADD -
where the source alpha affects only the destination alpha component. This
means that any fast path with a8r8g8b8 source and destination can also be
applied to an equivalent operation with x8r8g8b8 source and destination
just by updating the fast path table, and likewise with a8b8g8r8 and
x8b8g8r8. The following operations are affected:

add_x888_8_x888   (and bilinear scaled version of same)
add_x888__x888
add_x888_n_x888
add_x888_x888 (and bilinear scaled version of same)
---
 pixman/pixman-arm-neon.c |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-neon.c b/pixman/pixman-arm-neon.c
index 92519e3..2f669cb 100644
--- a/pixman/pixman-arm-neon.c
+++ b/pixman/pixman-arm-neon.c
@@ -349,18 +349,25 @@ static const pixman_fast_path_t arm_neon_fast_paths[] =
 PIXMAN_STD_FAST_PATH (ADD,  a8,   a8,   a8,   
neon_composite_add_8_8_8),
 PIXMAN_STD_FAST_PATH (ADD,  r5g6b5,   a8,   r5g6b5,   
neon_composite_add_0565_8_0565),
 PIXMAN_STD_FAST_PATH (ADD,  b5g6r5,   a8,   b5g6r5,   
neon_composite_add_0565_8_0565),
+PIXMAN_STD_FAST_PATH (ADD,  x8r8g8b8, a8,   x8r8g8b8, 
neon_composite_add__8_),
 PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, a8,   x8r8g8b8, 
neon_composite_add__8_),
+PIXMAN_STD_FAST_PATH (ADD,  x8b8g8r8, a8,   x8b8g8r8, 
neon_composite_add__8_),
 PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, a8,   x8b8g8r8, 
neon_composite_add__8_),
 PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, a8,   a8r8g8b8, 
neon_composite_add__8_),
 PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, a8,   a8b8g8r8, 
neon_composite_add__8_),
+PIXMAN_STD_FAST_PATH (ADD,  x8r8g8b8, a8r8g8b8, x8r8g8b8, 
neon_composite_add___),
 PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, a8r8g8b8, x8r8g8b8, 
neon_composite_add___),
 PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, a8r8g8b8, a8r8g8b8, 
neon_composite_add___),
+PIXMAN_STD_FAST_PATH (ADD,  x8r8g8b8, solid,x8r8g8b8, 
neon_composite_add__n_),
 PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, solid,x8r8g8b8, 
neon_composite_add__n_),
+PIXMAN_STD_FAST_PATH (ADD,  x8b8g8r8, solid,x8b8g8r8, 
neon_composite_add__n_),
 PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, solid,x8b8g8r8, 
neon_composite_add__n_),
 PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, solid,a8r8g8b8, 
neon_composite_add__n_),
 PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, solid,a8b8g8r8, 
neon_composite_add__n_),
 PIXMAN_STD_FAST_PATH (ADD,  a8,   null, a8,   
neon_composite_add_8_8),
+PIXMAN_STD_FAST_PATH (ADD,  x8r8g8b8, null, x8r8g8b8, 
neon_composite_add__),
 PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, null, x8r8g8b8, 
neon_composite_add__),
+PIXMAN_STD_FAST_PATH (ADD,  x8b8g8r8, null, x8b8g8r8, 
neon_composite_add__),
 PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, null, x8b8g8r8, 
neon_composite_add__),
 PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, null, a8r8g8b8, 
neon_composite_add__),
 PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, null, a8b8g8r8, 
neon_composite_add__),
@@ -416,6 +423,7 @@ static const pixman_fast_path_t arm_neon_fast_paths[] =
 
 SIMPLE_BILINEAR_FAST_PATH (ADD, a8r8g8b8, a8r8g8b8, neon__),
 SIMPLE_BILINEAR_FAST_PATH (ADD, a8r8g8b8, x8r8g8b8, neon__),
+SIMPLE_BILINEAR_FAST_PATH (ADD, x8r8g8b8, x8r8g8b8, neon__),
 
 SIMPLE_BILINEAR_A8_MASK_FAST_PATH (SRC, a8r8g8b8, a8r8g8b8, 
neon__8_),
 SIMPLE_BILINEAR_A8_MASK_FAST_PATH (SRC, a8r8g8b8, x8r8g8b8, 
neon__8_),
@@ -432,6 +440,7 @@ static const pixman_fast_path_t arm_neon_fast_paths[] =
 
 SIMPLE_BILINEAR_A8_MASK_FAST_PATH (ADD, a8r8g8b8, a8r8g8b8, 
neon__8_),
 SIMPLE_BILINEAR_A8_MASK_FAST_PATH (ADD, a8r8g8b8, x8r8g8b8, 
neon__8_),
+SIMPLE_BILINEAR_A8_MASK_FAST_PATH (ADD, x8r8g8b8, x8r8g8b8, 
neon__8_),
 
 { PIXMAN_OP_NONE },
 };
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 2/5] armv7: Add in_8888_8 fast path

2015-04-20 Thread Ben Avison
This is tuned for the Cortex-A7 (Raspberry Pi 2).
lowlevel-blt-bench results, compared to the ARMv6 fast path:

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
L1  146.0  0.7  231.4  1.2 100.0%  +58.5%
L2  143.1  0.9  222.1  1.7 100.0%  +55.3%
M   110.9  0.0  129.0  0.5 100.0%  +16.3%
HT  57.3   0.6  73.0   0.3 100.0%  +27.4%
VT  46.6   0.5  61.6   0.4 100.0%  +32.3%
R   42.3   0.2  51.7   0.2 100.0%  +22.2%
RT  19.1   0.1  21.0   0.1 100.0%  +9.9%
---
 pixman/pixman-arm-neon-asm.S |   35 +++
 pixman/pixman-arm-neon.c |4 
 2 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-neon-asm.S b/pixman/pixman-arm-neon-asm.S
index 7e949a3..2fecb5b 100644
--- a/pixman/pixman-arm-neon-asm.S
+++ b/pixman/pixman-arm-neon-asm.S
@@ -2757,6 +2757,41 @@ generate_composite_function \
 
 
/**/
 
+.macro pixman_composite_in__8_process_pixblock_head
+/* src is in d0-d3 (deinterleaved) */
+/* destination pixel data is in d4 */
+vmull.u8q8, d3, d4
+.endm
+
+.macro pixman_composite_in__8_process_pixblock_tail
+vrshr.u16   q9, q8, #8
+vraddhn.u16 d28, q8, q9
+/* result is in d28 */
+.endm
+
+.macro pixman_composite_in__8_process_pixblock_tail_head
+vld4.8  {d0-d3}, [SRC]!
+vrshr.u16   q9, q8, #8
+vld1.8  {d4}, [DST_R :64]!
+cache_preload 8, 8
+vraddhn.u16 d28, q8, q9
+vmull.u8q8, d3, d4
+vst1.8  {d28}, [DST_W :64]!
+.endm
+
+generate_composite_function \
+pixman_composite_in__8_asm_neon, 32, 0, 8, \
+FLAG_DST_READWRITE | FLAG_DEINTERLEAVE_32BPP, \
+8, /* number of pixels, processed in a single block */ \
+4, /* prefetch distance */ \
+default_init, \
+default_cleanup, \
+pixman_composite_in__8_process_pixblock_head, \
+pixman_composite_in__8_process_pixblock_tail, \
+pixman_composite_in__8_process_pixblock_tail_head
+
+/**/
+
 generate_composite_function_nearest_scanline \
 pixman_scaled_nearest_scanline___OVER_asm_neon, 32, 0, 32, \
 FLAG_DST_READWRITE | FLAG_DEINTERLEAVE_32BPP, \
diff --git a/pixman/pixman-arm-neon.c b/pixman/pixman-arm-neon.c
index 2f669cb..52ee9a4 100644
--- a/pixman/pixman-arm-neon.c
+++ b/pixman/pixman-arm-neon.c
@@ -66,6 +66,8 @@ PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, out_reverse_8_0565,
uint8_t, 1, uint16_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, out_reverse_8_,
uint8_t, 1, uint32_t, 1)
+PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, in__8,
+   uint32_t, 1, uint8_t, 1)
 
 PIXMAN_ARM_BIND_FAST_PATH_N_DST (SKIP_ZERO_SRC, neon, over_n_0565,
  uint16_t, 1)
@@ -372,6 +374,8 @@ static const pixman_fast_path_t arm_neon_fast_paths[] =
 PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, null, a8r8g8b8, 
neon_composite_add__),
 PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, null, a8b8g8r8, 
neon_composite_add__),
 PIXMAN_STD_FAST_PATH (IN,   solid,null, a8,   
neon_composite_in_n_8),
+PIXMAN_STD_FAST_PATH (IN,   a8r8g8b8, null, a8,   
neon_composite_in__8),
+PIXMAN_STD_FAST_PATH (IN,   a8b8g8r8, null, a8,   
neon_composite_in__8),
 PIXMAN_STD_FAST_PATH (OVER_REVERSE, solid, null, a8r8g8b8, 
neon_composite_over_reverse_n_),
 PIXMAN_STD_FAST_PATH (OVER_REVERSE, solid, null, a8b8g8r8, 
neon_composite_over_reverse_n_),
 PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,null, r5g6b5,   
neon_composite_out_reverse_8_0565),
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 4/5] armv7: Add in_reverse_8888_8888 fast path

2015-04-20 Thread Ben Avison
This is tuned for Cortex-A7 (Raspberry Pi 2).
lowlevel-blt-bench results, compared to the ARMv6 fast path:

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
L1  48.9   0.2  100.2  0.5 100.0%  +104.9%
L2  47.4   0.1  99.4   0.4 100.0%  +109.6%
M   30.4   0.1  59.2   0.3 100.0%  +94.6%
HT  27.5   0.1  43.4   0.2 100.0%  +57.7%
VT  23.7   0.1  41.6   0.2 100.0%  +75.3%
R   22.1   0.1  35.5   0.0 100.0%  +60.6%
RT  12.3   0.1  16.1   0.0 100.0%  +30.9%
---
 pixman/pixman-arm-neon-asm.S |   57 ++
 pixman/pixman-arm-neon.c |6 
 2 files changed, 63 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-neon-asm.S b/pixman/pixman-arm-neon-asm.S
index 8554e0c..93d680b 100644
--- a/pixman/pixman-arm-neon-asm.S
+++ b/pixman/pixman-arm-neon-asm.S
@@ -2856,6 +2856,63 @@ generate_composite_function \
 
 
/**/
 
+.macro pixman_composite_in_reverse___process_pixblock_head
+/* src is in d0-d3 (deinterleaved, though we only need alpha from d3) */
+/* destination pixel data is in d4-d7 (deinterleaved) */
+vmull.u8q10, d6, d3
+vmull.u8q11, d7, d3
+vmull.u8q9, d5, d3
+vmull.u8q8, d4, d3
+vrshr.u16   q15, q11, #8
+vrshr.u16   q14, q10, #8
+vrshr.u16   q13, q9, #8
+vrshr.u16   q12, q8, #8
+vraddhn.u16 d31, q11, q15
+vraddhn.u16 d30, q10, q14
+vraddhn.u16 d28, q8, q12
+vraddhn.u16 d29, q9, q13
+.endm
+
+.macro pixman_composite_in_reverse___process_pixblock_tail
+/* result is in d28-d31 */
+.endm
+
+.macro pixman_composite_in_reverse___process_pixblock_tail_head
+vld4.8  {d0-d3}, [SRC]!
+vzip.8  d28, d30
+vld4.8  {d4-d7}, [DST_R :128]!
+cache_preload 8, 8
+vzip.8  d29, d31
+vmull.u8q10, d6, d3
+vmull.u8q11, d7, d3
+vmull.u8q9, d5, d3
+vmull.u8q8, d4, d3
+vzip.8  d28, d29
+vzip.8  d30, d31
+vst1.8  {d28-d31}, [DST_W :128]!
+vrshr.u16   q15, q11, #8
+vrshr.u16   q14, q10, #8
+vrshr.u16   q13, q9, #8
+vrshr.u16   q12, q8, #8
+vraddhn.u16 d31, q11, q15
+vraddhn.u16 d30, q10, q14
+vraddhn.u16 d28, q8, q12
+vraddhn.u16 d29, q9, q13
+.endm
+
+generate_composite_function \
+pixman_composite_in_reverse___asm_neon, 32, 0, 32, \
+FLAG_DST_READWRITE | FLAG_DEINTERLEAVE_32BPP, \
+8, /* number of pixels, processed in a single block */ \
+3, /* prefetch distance */ \
+default_init, \
+default_cleanup, \
+pixman_composite_in_reverse___process_pixblock_head, \
+pixman_composite_in_reverse___process_pixblock_tail, \
+pixman_composite_in_reverse___process_pixblock_tail_head
+
+/**/
+
 generate_composite_function_nearest_scanline \
 pixman_scaled_nearest_scanline___OVER_asm_neon, 32, 0, 32, \
 FLAG_DST_READWRITE | FLAG_DEINTERLEAVE_32BPP, \
diff --git a/pixman/pixman-arm-neon.c b/pixman/pixman-arm-neon.c
index ab8a58c..532e903 100644
--- a/pixman/pixman-arm-neon.c
+++ b/pixman/pixman-arm-neon.c
@@ -68,6 +68,8 @@ PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, out_reverse_8_,
uint8_t, 1, uint32_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, in__8,
uint32_t, 1, uint8_t, 1)
+PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, in_reverse__,
+   uint32_t, 1, uint32_t, 1)
 
 PIXMAN_ARM_BIND_FAST_PATH_N_DST (SKIP_ZERO_SRC, neon, over_n_0565,
  uint16_t, 1)
@@ -382,6 +384,10 @@ static const pixman_fast_path_t arm_neon_fast_paths[] =
 PIXMAN_STD_FAST_PATH (IN,   solid,null, a8b8g8r8, 
neon_composite_in_n_),
 PIXMAN_STD_FAST_PATH (OVER_REVERSE, solid, null, a8r8g8b8, 
neon_composite_over_reverse_n_),
 PIXMAN_STD_FAST_PATH (OVER_REVERSE, solid, null, a8b8g8r8, 
neon_composite_over_reverse_n_),
+PIXMAN_STD_FAST_PATH (IN_REVERSE, a8r8g8b8, null, x8r8g8b8, 
neon_composite_in_reverse__),
+PIXMAN_STD_FAST_PATH (IN_REVERSE, a8b8g8r8, null, x8b8g8r8, 
neon_composite_in_reverse__),
+PIXMAN_STD_FAST_PATH (IN_REVERSE, a8r8g8b8, null, a8r8g8b8, 
neon_composite_in_reverse__),
+PIXMAN_STD_FAST_PATH (IN_REVERSE, a8b8g8r8, null, a8b8g8r8, 
neon_composite_in_reverse__),
 PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,null, r5g6b5,   
neon_composite_out_reverse_8_0565),
 PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,null, b5g6r5,   
neon_composite_out_reverse_8_0565),
 PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,null, x8r8g8b8, 
neon_composite_out_reverse_8_),
-- 
1.7.5.4


[Pixman] [PATCH 5/5] armv7: Add src_1555_8888 fast path

2015-04-20 Thread Ben Avison
This is tuned for Cortex-A7 (Raspberry Pi 2).
lowlevel-blt-bench results, compared to the ARMv6 fast path:

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
L1  88.6   0.2  221.3  0.5 100.0%  +149.7%
L2  88.1   0.4  219.2  0.8 100.0%  +148.9%
M   87.9   0.1  178.2  0.1 100.0%  +102.6%
HT  59.7   0.4  72.0   0.2 100.0%  +20.7%
VT  53.2   0.4  69.8   0.2 100.0%  +31.3%
R   48.5   0.3  53.6   0.1 100.0%  +10.6%
RT  21.2   0.1  23.0   0.1 100.0%  +8.5%
---
 pixman/pixman-arm-neon-asm.S |   51 ++
 pixman/pixman-arm-neon.c |8 ++
 2 files changed, 59 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-neon-asm.S b/pixman/pixman-arm-neon-asm.S
index 93d680b..13c75c5 100644
--- a/pixman/pixman-arm-neon-asm.S
+++ b/pixman/pixman-arm-neon-asm.S
@@ -2913,6 +2913,57 @@ generate_composite_function \
 
 
/**/
 
+.macro pixman_composite_src_1555__process_pixblock_head
+/* src is in d0-d1 */
+vshrn.i16   d31, q0, #8
+vshrn.u16   d29, q0, #2
+vshrn.i16   d30, q0, #7
+vsli.u16q0, q0, #5
+vshr.s8 d31, d31, #7
+vsri.8  d29, d29, #5
+vsri.8  d30, d30, #5
+vshrn.u16   d28, q0, #2
+.endm
+
+.macro pixman_composite_src_1555__process_pixblock_tail
+vzip.8  d29, d31
+vzip.8  d28, d30
+vzip.8  d28, d29
+vzip.8  d30, d31
+/* result is in d28-d31 */
+.endm
+
+.macro pixman_composite_src_1555__process_pixblock_tail_head
+vzip.8  d29, d31
+vzip.8  d28, d30
+vld1.16 {d0-d1}, [SRC]!
+cache_preload 8, 8
+vzip.8  d28, d29
+vzip.8  d30, d31
+vst1.8  {d28-d31}, [DST_W :128]!
+vshrn.i16   d31, q0, #8
+vshrn.u16   d29, q0, #2
+vshrn.i16   d30, q0, #7
+vsli.u16q0, q0, #5
+vshr.s8 d31, d31, #7
+vsri.8  d29, d29, #5
+vsri.8  d30, d30, #5
+vshrn.u16   d28, q0, #2
+.endm
+
+generate_composite_function \
+pixman_composite_src_1555__asm_neon, 16, 0, 32, \
+FLAG_DST_WRITEONLY, \
+8, /* number of pixels, processed in a single block */ \
+6, /* prefetch distance */ \
+default_init, \
+default_cleanup, \
+pixman_composite_src_1555__process_pixblock_head, \
+pixman_composite_src_1555__process_pixblock_tail, \
+pixman_composite_src_1555__process_pixblock_tail_head
+
+/**/
+
 generate_composite_function_nearest_scanline \
 pixman_scaled_nearest_scanline___OVER_asm_neon, 32, 0, 32, \
 FLAG_DST_READWRITE | FLAG_DEINTERLEAVE_32BPP, \
diff --git a/pixman/pixman-arm-neon.c b/pixman/pixman-arm-neon.c
index 532e903..9ab7259 100644
--- a/pixman/pixman-arm-neon.c
+++ b/pixman/pixman-arm-neon.c
@@ -46,6 +46,8 @@ PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, src__0565,
uint32_t, 1, uint16_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, src_0565_,
uint16_t, 1, uint32_t, 1)
+PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, src_1555_,
+   uint16_t, 1, uint32_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, src_0888__rev,
uint8_t, 3, uint32_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, src_0888_0565_rev,
@@ -286,6 +288,12 @@ static const pixman_fast_path_t arm_neon_fast_paths[] =
 PIXMAN_STD_FAST_PATH (SRC,  r5g6b5,   null, x8r8g8b8, 
neon_composite_src_0565_),
 PIXMAN_STD_FAST_PATH (SRC,  b5g6r5,   null, a8b8g8r8, 
neon_composite_src_0565_),
 PIXMAN_STD_FAST_PATH (SRC,  b5g6r5,   null, x8b8g8r8, 
neon_composite_src_0565_),
+PIXMAN_STD_FAST_PATH (SRC,  x1r5g5b5, null, x8r8g8b8, 
neon_composite_src_1555_),
+PIXMAN_STD_FAST_PATH (SRC,  a1r5g5b5, null, x8r8g8b8, 
neon_composite_src_1555_),
+PIXMAN_STD_FAST_PATH (SRC,  x1b5g5r5, null, x8b8g8r8, 
neon_composite_src_1555_),
+PIXMAN_STD_FAST_PATH (SRC,  a1b5g5r5, null, x8b8g8r8, 
neon_composite_src_1555_),
+PIXMAN_STD_FAST_PATH (SRC,  a1r5g5b5, null, a8r8g8b8, 
neon_composite_src_1555_),
+PIXMAN_STD_FAST_PATH (SRC,  a1b5g5r5, null, a8b8g8r8, 
neon_composite_src_1555_),
 PIXMAN_STD_FAST_PATH (SRC,  a8r8g8b8, null, x8r8g8b8, 
neon_composite_src__),
 PIXMAN_STD_FAST_PATH (SRC,  x8r8g8b8, null, x8r8g8b8, 
neon_composite_src__),
 PIXMAN_STD_FAST_PATH (SRC,  a8b8g8r8, null, x8b8g8r8, 
neon_composite_src__),
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH 3/5] armv7: Add in_n_8888 fast path

2015-04-20 Thread Ben Avison
This is tuned for Cortex-A7 (Raspberry Pi 2).
lowlevel-blt-bench results, compared to the ARMv6 fast path:

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
L1  104.6  0.5  119.4  0.1 100.0%  +14.1%
L2  106.8  0.6  121.4  0.1 100.0%  +13.6%
M   100.3  1.3  116.4  0.0 100.0%  +16.0%
HT  64.5   1.0  70.8   0.1 100.0%  +9.8%
VT  56.0   0.8  62.2   0.1 100.0%  +11.1%
R   54.1   0.9  55.2   0.0 100.0%  +1.9%
RT  24.6   0.5  26.6   0.0 100.0%  +8.3%
---
 pixman/pixman-arm-neon-asm.S |   64 ++
 pixman/pixman-arm-neon.c |4 ++
 2 files changed, 68 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-neon-asm.S b/pixman/pixman-arm-neon-asm.S
index 2fecb5b..8554e0c 100644
--- a/pixman/pixman-arm-neon-asm.S
+++ b/pixman/pixman-arm-neon-asm.S
@@ -2792,6 +2792,70 @@ generate_composite_function \
 
 
/**/
 
+.macro pixman_composite_in_n__init
+add DUMMY, sp, #ARGS_STACK_OFFSET
+vld1.8  {d0[]}, [DUMMY]!
+vld1.8  {d1[]}, [DUMMY]!
+vld1.8  {d2[]}, [DUMMY]!
+vld1.8  {d3[]}, [DUMMY]!
+.endm
+
+.macro pixman_composite_in_n__process_pixblock_head
+/* src is in d0-d3 (deinterleaved) */
+/* destination pixel data is in d4-d7 (deinterleaved, though we only need 
alpha from d7) */
+vmull.u8q11, d3, d7
+vmull.u8q10, d2, d7
+vmull.u8q9, d1, d7
+vmull.u8q8, d0, d7
+vrshr.u16   q15, q11, #8
+vrshr.u16   q14, q10, #8
+vrshr.u16   q13, q9, #8
+vrshr.u16   q12, q8, #8
+vraddhn.u16 d31, q11, q15
+vraddhn.u16 d30, q10, q14
+vraddhn.u16 d28, q8, q12
+vraddhn.u16 d29, q9, q13
+.endm
+
+.macro pixman_composite_in_n__process_pixblock_tail
+/* result is in d28-d31 */
+.endm
+
+.macro pixman_composite_in_n__process_pixblock_tail_head
+vld4.8  {d4-d7}, [DST_R :128]!
+cache_preload 8, 8
+vzip.8  d28, d30
+vmull.u8q11, d3, d7
+vzip.8  d29, d31
+vmull.u8q10, d2, d7
+vmull.u8q9, d1, d7
+vmull.u8q8, d0, d7
+vzip.8  d28, d29
+vzip.8  d30, d31
+vst1.8  {d28-d31}, [DST_W :128]!
+vrshr.u16   q15, q11, #8
+vrshr.u16   q14, q10, #8
+vrshr.u16   q13, q9, #8
+vrshr.u16   q12, q8, #8
+vraddhn.u16 d31, q11, q15
+vraddhn.u16 d30, q10, q14
+vraddhn.u16 d28, q8, q12
+vraddhn.u16 d29, q9, q13
+.endm
+
+generate_composite_function \
+pixman_composite_in_n__asm_neon, 0, 0, 32, \
+FLAG_DST_READWRITE | FLAG_DEINTERLEAVE_32BPP, \
+8, /* number of pixels, processed in a single block */ \
+6, /* prefetch distance */ \
+pixman_composite_in_n__init, \
+default_cleanup, \
+pixman_composite_in_n__process_pixblock_head, \
+pixman_composite_in_n__process_pixblock_tail, \
+pixman_composite_in_n__process_pixblock_tail_head
+
+/**/
+
 generate_composite_function_nearest_scanline \
 pixman_scaled_nearest_scanline___OVER_asm_neon, 32, 0, 32, \
 FLAG_DST_READWRITE | FLAG_DEINTERLEAVE_32BPP, \
diff --git a/pixman/pixman-arm-neon.c b/pixman/pixman-arm-neon.c
index 52ee9a4..ab8a58c 100644
--- a/pixman/pixman-arm-neon.c
+++ b/pixman/pixman-arm-neon.c
@@ -77,6 +77,8 @@ PIXMAN_ARM_BIND_FAST_PATH_N_DST (SKIP_ZERO_SRC, neon, 
over_reverse_n_,
  uint32_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_N_DST (0, neon, in_n_8,
  uint8_t, 1)
+PIXMAN_ARM_BIND_FAST_PATH_N_DST (0, neon, in_n_,
+ uint32_t, 1)
 
 PIXMAN_ARM_BIND_FAST_PATH_N_MASK_DST (SKIP_ZERO_SRC, neon, over_n_8_0565,
   uint8_t, 1, uint16_t, 1)
@@ -376,6 +378,8 @@ static const pixman_fast_path_t arm_neon_fast_paths[] =
 PIXMAN_STD_FAST_PATH (IN,   solid,null, a8,   
neon_composite_in_n_8),
 PIXMAN_STD_FAST_PATH (IN,   a8r8g8b8, null, a8,   
neon_composite_in__8),
 PIXMAN_STD_FAST_PATH (IN,   a8b8g8r8, null, a8,   
neon_composite_in__8),
+PIXMAN_STD_FAST_PATH (IN,   solid,null, a8r8g8b8, 
neon_composite_in_n_),
+PIXMAN_STD_FAST_PATH (IN,   solid,null, a8b8g8r8, 
neon_composite_in_n_),
 PIXMAN_STD_FAST_PATH (OVER_REVERSE, solid, null, a8r8g8b8, 
neon_composite_over_reverse_n_),
 PIXMAN_STD_FAST_PATH (OVER_REVERSE, solid, null, a8b8g8r8, 
neon_composite_over_reverse_n_),
 PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,null, r5g6b5,   
neon_composite_out_reverse_8_0565),
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org

[Pixman] [PATCH] armv7: Re-use existing fast paths in more cases

2015-04-20 Thread Ben Avison
There are a group of combiner types - SRC, OVER, IN_REVERSE, OUT_REVERSE
and ADD - where the destination alpha component is only used (if at all) to
determine the destination alpha component. This means that any such fast
paths with an a8r8g8b8 destination can also be applied to an x8r8g8b8
destination just by updating the fast path table, and likewise with
a8b8g8r8 and x8b8g8r8. The following operations are affected:

over___x888
add_n_8_x888
add__8_x888
add___x888
add__n_x888
add__x888
out_reverse_8_x888
---
 pixman/pixman-arm-neon.c |   12 
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-neon.c b/pixman/pixman-arm-neon.c
index e5262b1..92519e3 100644
--- a/pixman/pixman-arm-neon.c
+++ b/pixman/pixman-arm-neon.c
@@ -331,6 +331,7 @@ static const pixman_fast_path_t arm_neon_fast_paths[] =
 PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, a8,   b5g6r5,   
neon_composite_over__8_0565),
 PIXMAN_STD_FAST_PATH (OVER, r5g6b5,   a8,   r5g6b5,   
neon_composite_over_0565_8_0565),
 PIXMAN_STD_FAST_PATH (OVER, b5g6r5,   a8,   b5g6r5,   
neon_composite_over_0565_8_0565),
+PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, a8r8g8b8, x8r8g8b8, 
neon_composite_over___),
 PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, a8r8g8b8, a8r8g8b8, 
neon_composite_over___),
 PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, r5g6b5,   
neon_composite_over__0565),
 PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, b5g6r5,   
neon_composite_over__0565),
@@ -341,17 +342,26 @@ static const pixman_fast_path_t arm_neon_fast_paths[] =
 PIXMAN_STD_FAST_PATH (OVER, x8r8g8b8, null, a8r8g8b8, 
neon_composite_src_x888_),
 PIXMAN_STD_FAST_PATH (OVER, x8b8g8r8, null, a8b8g8r8, 
neon_composite_src_x888_),
 PIXMAN_STD_FAST_PATH (ADD,  solid,a8,   a8,   
neon_composite_add_n_8_8),
+PIXMAN_STD_FAST_PATH (ADD,  solid,a8,   x8r8g8b8, 
neon_composite_add_n_8_),
 PIXMAN_STD_FAST_PATH (ADD,  solid,a8,   a8r8g8b8, 
neon_composite_add_n_8_),
+PIXMAN_STD_FAST_PATH (ADD,  solid,a8,   x8b8g8r8, 
neon_composite_add_n_8_),
 PIXMAN_STD_FAST_PATH (ADD,  solid,a8,   a8b8g8r8, 
neon_composite_add_n_8_),
 PIXMAN_STD_FAST_PATH (ADD,  a8,   a8,   a8,   
neon_composite_add_8_8_8),
 PIXMAN_STD_FAST_PATH (ADD,  r5g6b5,   a8,   r5g6b5,   
neon_composite_add_0565_8_0565),
 PIXMAN_STD_FAST_PATH (ADD,  b5g6r5,   a8,   b5g6r5,   
neon_composite_add_0565_8_0565),
+PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, a8,   x8r8g8b8, 
neon_composite_add__8_),
+PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, a8,   x8b8g8r8, 
neon_composite_add__8_),
 PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, a8,   a8r8g8b8, 
neon_composite_add__8_),
 PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, a8,   a8b8g8r8, 
neon_composite_add__8_),
+PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, a8r8g8b8, x8r8g8b8, 
neon_composite_add___),
 PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, a8r8g8b8, a8r8g8b8, 
neon_composite_add___),
+PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, solid,x8r8g8b8, 
neon_composite_add__n_),
+PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, solid,x8b8g8r8, 
neon_composite_add__n_),
 PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, solid,a8r8g8b8, 
neon_composite_add__n_),
 PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, solid,a8b8g8r8, 
neon_composite_add__n_),
 PIXMAN_STD_FAST_PATH (ADD,  a8,   null, a8,   
neon_composite_add_8_8),
+PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, null, x8r8g8b8, 
neon_composite_add__),
+PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, null, x8b8g8r8, 
neon_composite_add__),
 PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, null, a8r8g8b8, 
neon_composite_add__),
 PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, null, a8b8g8r8, 
neon_composite_add__),
 PIXMAN_STD_FAST_PATH (IN,   solid,null, a8,   
neon_composite_in_n_8),
@@ -359,7 +369,9 @@ static const pixman_fast_path_t arm_neon_fast_paths[] =
 PIXMAN_STD_FAST_PATH (OVER_REVERSE, solid, null, a8b8g8r8, 
neon_composite_over_reverse_n_),
 PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,null, r5g6b5,   
neon_composite_out_reverse_8_0565),
 PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,null, b5g6r5,   
neon_composite_out_reverse_8_0565),
+PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,null, x8r8g8b8, 
neon_composite_out_reverse_8_),
 PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,null, a8r8g8b8, 
neon_composite_out_reverse_8_),
+PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,null, x8b8g8r8, 
neon_composite_out_reverse_8_),
 PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,null, a8b8g8r8, 
neon_composite_out_reverse_8_),
 
 PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (OVER, a8r8g8b8, a8r8g8b8, 
neon__),
-- 

[Pixman] [PATCH 1/5] armv7: Use prefetch for small-width images too

2015-04-20 Thread Ben Avison
After discovering that the ARMv6 optimised fast paths often out-performed
the ARMv7 ones on a Cortex-A7, particularly on the RT benchmark, I found
that the problem was due to the fact that the ARMv7 macros didn't attempt
any sort of prefetch for small images (fewer than pixblock_size * 2 pixels
across).

Since a pixblock is chosen to be no larger than a cacheline, and is in many
cases smaller, it seemed a reasonable compromise to avoid adding a lot of
complexity by simply doing one prefetch for the start of a pixel row when
starting to process the preceding one, and that is what this patch does.

I compared the effect of using LDRB (which is what is currently used at the
end of each long pixel row) against PLD for each of the source and
destination buffers for a selection of common operations: src__,
over__ and add__, and in each case PLD of both buffers was
the most beneficial. PLDW didn't make any measurable difference.

The overall effect of this patch on the three operations is as follows
(L1, L2 and M tests can be ignored because they're known not to involve the
use of short rows):

src__

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
HT  60.8   0.1  61.1   0.1 100.0%  +0.6%
VT  61.0   0.3  62.6   0.2 100.0%  +2.6%
R   45.5   0.2  46.2   0.2 100.0%  +1.5%
RT  19.8   0.0  21.4   0.0 100.0%  +7.8%

over__

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
HT  40.2   0.1  40.7   0.4 100.0%  +1.0%
VT  35.5   0.2  37.9   0.3 100.0%  +6.7%
R   32.8   0.0  33.8   0.3 100.0%  +3.0%
RT  12.9   0.0  15.6   0.2 100.0%  +21.4%

add__

Before  After
Mean   StdDev   Mean   StdDev  Confidence  Change
HT  51.0   0.6  51.9   0.5 100.0%  +1.7%
VT  44.0   0.4  46.8   0.5 100.0%  +6.3%
R   39.6   0.5  41.0   0.4 100.0%  +3.5%
RT  15.2   0.2  18.0   0.2 100.0%  +18.5%
---
 pixman/pixman-arm-neon-asm.h |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-arm-neon-asm.h b/pixman/pixman-arm-neon-asm.h
index bdcf6a9..ce06416 100644
--- a/pixman/pixman-arm-neon-asm.h
+++ b/pixman/pixman-arm-neon-asm.h
@@ -870,6 +870,15 @@ local skip1
  * nor prefetch are used.
  */
 8:
+.if src_bpp_shift = 0
+PF pld, [SRC, SRC_STRIDE, lsl #src_bpp_shift]
+.endif
+.if dst_r_bpp != 0
+PF pld, [DST_R, DST_STRIDE, lsl #dst_bpp_shift]
+.endif
+.if mask_bpp_shift = 0
+PF pld, [MASK, MASK_STRIDE, lsl #mask_bpp_shift]
+.endif
 /* Process exactly pixblock_size pixels if needed */
 tst W, #pixblock_size
 beq 1f
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 25/32] armv6: Add src_1555_8888 fast path

2015-04-20 Thread Ben Avison

I hope this doesn't confuse threading too much... I've been working on
ARMv7 with the aim of bringing it up to par with the ARMv6 features that
I've been writing over the last couple of years, and I'm trying to drip
feed them to the mailing list to avoid a big bang at the end this time.
I've just added src_1555_ for ARMv7, and it makes sense to address
Søren's review comment for the ARMv6 version while I am at it, even
though many earlier ARMv6 patches in the series haven't been accepted yet.

On Thu, 07 Aug 2014 17:50:21 +0100, I wrote:
[...]

diff --git a/pixman/pixman-arm-simd.c b/pixman/pixman-arm-simd.c
index e6c5d81..f028794 100644
--- a/pixman/pixman-arm-simd.c
+++ b/pixman/pixman-arm-simd.c

[...]

@@ -277,6 +279,9 @@ static const pixman_fast_path_t arm_simd_fast_paths[] =
 PIXMAN_STD_FAST_PATH (SRC, a8b8g8r8, null, b5g6r5, 
armv6_composite_src_x888_0565),
 PIXMAN_STD_FAST_PATH (SRC, x8b8g8r8, null, b5g6r5, 
armv6_composite_src_x888_0565),
+PIXMAN_STD_FAST_PATH (SRC, a1r5g5b5, null, a8r8g8b8, 
armv6_composite_src_1555_),
+PIXMAN_STD_FAST_PATH (SRC, a1b5g5r5, null, a8b8g8r8, 
armv6_composite_src_1555_),
+
 PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, a8r8g8b8, 
armv6_composite_over__),
 PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, 
armv6_composite_over__),
 PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, 
armv6_composite_over__),


On Sun, 05 Oct 2014 20:03:42 +0100, Søren Sandmann soren.sandm...@gmail.com 
wrote:

Can these be used for x8r8g8b8/x8b8g8r8 targets as well?


This is true. In fact, it can be used for x1r5g5b5/x1b5g5r5 sources as
well if the destination is x8r8g8b8/x8b8g8r8.

The ARMv6 fast path table is otherwise quite rigorous in including x888
type pixel formats whenever possible. The same cannot be said for the
ARMv7 table; logically, the same rules should apply to both. So I will
shortly be following the updated version of this patch with a couple more
patches to improve the ARMv7 table in the same way.

Following that will be a short patch series adding some of the unscaled
fast paths which had been encountered on the Raspberry Pi 1 (ARMv6) but
for which an ARMv7 implementation was missing. These are presented as a
dependent series because without the first patch in that series to
improve the ARMv7 prefetch logic for small images, the RT benchmark
scores for these fast paths showed a sizeable regression compared to the
ARMv6 version.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] lowlevel-blt-bench: use a8r8g8b8 for CA solid masks

2015-04-16 Thread Ben Avison

When doing component alpha, use a mask format that has all the color
channels instead of just a8.


In case you're waiting for me...

That first sentence doesn't qualify the fact that this is specific to
solid masks. Other than that, looks good to me. Not sure whether that
qualifies as a Reviewed-by or Acked-by, take your pick!

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 0/7] lowlevel-blt-bench test pattern parser

2015-04-14 Thread Ben Avison

On Tue, 14 Apr 2015 09:36:38 +0100,
Pekka Paalanen ppaala...@gmail.com wrote:


On Mon, 13 Apr 2015 18:42:45 +0100
Ben Avison bavi...@riscosopen.org wrote:


On Mon, 13 Apr 2015 12:31:35 +0100,
Pekka Paalanen ppaala...@gmail.com wrote:
 None of the existing tests has a solid mask for CA tests. I'm not
 sure how much it makes sense.

When you're in component alpha mode, mathematically the source and mask
images are interchangeable. So you could implement over__n__ca
by calling over_n___ca, and we already have a number of fast
paths for operations of the latter form. (I'm not aware that Pixman is
- yet - aware that it can make such substitutions, but I'm sure it
could be added.)


Hmm, yes, interesting, at least in the cases where the operator indeed
allows it... I don't know the operators by heart so I don't know if
there are any that wouldn't work like that. An idea to remember.


Having thought about it some more, and checked the source code (it's
clearer in pixman-combine-float.c because the maths isn't obfuscated by
the need to work in fixed-point) I'm going to retract the claim. Or at
least restrict its applicability. (This demonstrates the danger of there
not being anyone around to contradict me!)

The problem is that although alpha is considered independently for each
colour component of the mask - so you just multiply the source red by the
mask red and so on, I'd neglected the alpha component in the source
image. Yes, because Pixman premultiplies the alpha into the RGB
components, it doesn't directly affect the RGB output of the combined
source+mask image, but there's a secondary output, a per-channel alpha,
which affects the Porter-Duff operation. For example, with an OVER
operation, the inverse of the per-channel alpha is multiplied into the
existing value from the destination buffer before you add-saturate the
combined source+mask into it.

My statement about being able to exchange the source and mask images is
true, but only if the alpha component is 1 in both images. For example,
over_0565_n__ca or over_x888_n__ca where (in both cases)
n.a == 1 would both be suitable candidates for exchanging the source and
mask.

For the avoidance of doubt, the Porter-Duff and PDF combiners are all
just binary operators, defined in terms of one source and one destination
image. Whenever Pixman talks about unified or component alpha operations,
it's just concerned with how the source and mask images are combined into
the effective source image for the fundamental combiner. It just happens
to be more efficient in most cases to do both the source/mask and source/
destination calculations at the same time, rather than writing out the
intermediate image into a buffer somewhere.


So, can I take it that you gave your Reviewed-by for the whole series?


Yes.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 5/5] test: Add a new benchmarker targeting affine operations

2015-04-14 Thread Ben Avison

On Tue, 14 Apr 2015 11:28:42 +0100, Pekka Paalanen ppaala...@gmail.com wrote:

So the matrices are indexed as matrix[row][column] in Pixman?


That's my understanding, yes.


A short'ish comment somewhere to say why you are doing this offsetting
would be nice, and that the offsetting is the reason to allocate a
margin.


Since you've been reworking the patches anyway, do you have enough
information to add the comments yourself or do you want me to do them?


One simple way round this is to apply a 0.5 pixel translation in the
transformation matrix, so that the pattern becomes:

d[0] =1.00 * s[0]
d[1] =0.50 * s[0] + 0.50 * s[1]
d[2] =  1.00 * s[1]
d[3] =  0.50 * s[1] + 0.50 * s[2]

but this is thwarted by the 8/65536ths of a pixel fudge factor. I can't
see why the fudge factor is needed at all, at least not in the affine
case; it's described as being to compensate for rounding errors, but
there is no rounding involved in fixed-point affine transforms.


Maybe the rounding refers to the pixman_fixed_to_int() calls? I could
imagine it is to guarantee that an xmin=0.5 gets really rounded to 0,
and xmax=0.5 gets rounded to 1.


There's a slightly better worked example in the later patch I was
thinking of:

http://lists.freedesktop.org/archives/pixman/2014-September/003380.html

As I say, we're getting a bit ahead of ourselves here, as there were
about 30 patches between the ones we're reworking at the moment and that
one. Søren did give some comments on it at the end of his review here:

http://lists.freedesktop.org/archives/pixman/2014-October/003457.html

which says the 8*pixman_fixed_e was about ensuring we didn't stray beyond
the bitmap bounds if a fast path happened to over-read pixels and then
multiply them by a 0 filter coefficient. I think we can probably cater
for that better by checking whether a bitmap starts or ends at a page
boundary, and whether we're reading the first and last pixels of the
image if it does.


To be honest, both cases you describe sound strange to me. Surely if I
use an affine transform matrix { 0.5 0 0; 0 1 0 }, I'd expect
d[0] = 0.5 * s[0] + 0.5 * s[1]
when assuming the box filter (if I got my terminology right)...


You're forgetting it's pixel-centres that are fed through the transform.
To be honest, I think it's already quite a headache for an application to
work out how to set up the transform in order to hit a cover scaled
fast path. Assume the reasonable case that you want to plot the whole of
an image of size x,y at a size m,n. You need to set the diagonals of the
transform to

floor(pixman_fixed_1*(x-1)/(m-1))
floor(pixman_fixed_1*(y-1)/(n-1))
1

and then solve for

/ 0.5 \   / 0.5 \
T . | 0.5 | = | 0.5 |
\ 1   /   \ 1   /

to find the translation coefficients that will ensure the top-left source
pixel aligns exactly to the top-left destination pixel.

To then require that the caller also knows that you need to nudge things
along by 8*pixman_fixed_e as well feels like it's requiring too much
knowledge of Pixman's internals to me. I didn't really like putting it in
affine-bench to begin with for this reason, but it doesn't work without
it. So I made do with removing it as quickly as I could - the original
intention was for it to be in the same patch series, but obviously the
patch series has grown too large and unwieldy for that the be the case
any more!

When I've seen timing breakdowns of which Pixman operations are being hit
by applications, I've seen far more scaled plots with repeat type PAD
than I'd really expect. My suspicions are that the bulk of these are down
to how difficult it is for applications to set up the transform for
maximum efficiency.


I think my hope was that since we're processing images of size 1920 *
1080 pixels, you wouldn't have any particular pixels persisting in the
cache from the previous iteration (unless perhaps if you're testing a
high-factor enlargement on a platform that has caches that don't
allocate on write).


Right, so is there any need to flush_cache() at all?


Possibly not. I was taking my cue from lowlevel-blt-bench again, where it
flushes the cache between each type of test. And looking at the
implementation of flush_cache() again now, I note that due to the action
of allocate-on-write and physically mapped caches, it won't completely
flush most caches, only one page-size worth! Perhaps just delete it then.


 +bench (op, t, src_image, mask_image, dest_image, src_x, src_y, UINT32_MAX, 
500, n, t1, pixman_image_composite_wrapper);
 +bench (op, t, src_image, mask_image, dest_image, src_x, src_y, n, 
UINT32_MAX, NULL, t2, pixman_image_composite_empty);

[...]

I think that part I understood. My comment was about having a {} block
without any flow control statements for it. You are just using it to
scope some variables, which suggests that the block should actually be
a separate function.


Oh right. Well, I think 

Re: [Pixman] [PATCH 0/7] lowlevel-blt-bench test pattern parser

2015-04-13 Thread Ben Avison

On Mon, 13 Apr 2015 12:31:35 +0100, Pekka Paalanen ppaala...@gmail.com wrote:

Apart from restructuring, there is one significant(?) difference to
Ben's patches: for a solid mask, Ben used a8r8g8b8, but my code uses a8.


I had to remind myself how things hang together to check if this is
significant or not...

When lowlevel-blt-bench tests a solid image (source or mask), it achieves
it not by calling pixman_image_create_solid_fill() but by creating a 1x1
pixel bitmap using pixman_image_create_bits(). However, the effect is
effectively the same because compute_image_info() detects single-pixel
bitmaps and records it in image-common.extended_format_code before we
start doing fast path lookup. However, image-type remains as BITS rather
than SOLID because the non-common parts of the pixman_image union retain
their original meaning - for a native solid fill image, it's just the
colour in three different formats, but for bitmap images it's a pointer
to and sizes of the bitmap and colour LUT (if applicable), various helper
function pointers and so on.

A typical ARM assembly fast path is written to assume that any argument
corresponding to a solid colour is already in an  format where the
red-blue ordering is the same as the destination image. The magic that
ensures this is hidden in the wrapper C function generated by
PIXMAN_ARM_BIND_FAST_PATH_N_DST and similar, where it calls
_pixman_image_get_solid(). You'll see similar calls more explicitly in
pixman-fast-path.c (architecture-neutral fast paths) and the equivalent
sources files for other architectures.

There are a few circumstances where you might not want to use
_pixman_image_get_solid()  - perhaps most obviously are the operations
which use a solid image in pixel format a2r10g10b10 because you'd lose 2
bits of precision per colour component. There are a few example of those
amongst those you picked out in special_patterns[].

It's worth noting that _pixman_image_get_solid() has optimised code to
extract the colour from image-type==BITS images if the source/mask image
is of format a8r8g8b8, x8r8g8b8 or a8. Other formats will still work, but
more laboriously.

Not that any of this should matter at all for the purposes of
lowlevel-blt-bench. _pixman_image_get_solid() is only called once per
call of pixman_image_composite(), so it is part of the overhead that
should be being accounted for by the code wrapped in #if EXCLUDE_OVERHEAD.


should we also add a condition, that if a test has CA flag and a solid
mask, then that mask needs to be a8r8g8b8?


That might be desirable, if only because lowlevel-blt-bench initialises
all its images using memset(0xCC) so an a8 solid image would be converted
by _pixman_image_get_solid() to 0xCC00 whereas an a8r8g8b8 would be
0x. When you're not in component alpha mode, only the alpha byte
matters for the mask image, but in the case of component alpha
operations, a fast path might decide that it can save itself a lot of
multiplications if it spots that 3 constant mask components are already 0.


None of the existing tests has a solid mask for CA tests. I'm not sure
how much it makes sense.


When you're in component alpha mode, mathematically the source and mask
images are interchangeable. So you could implement over__n__ca by
calling over_n___ca, and we already have a number of fast paths
for operations of the latter form. (I'm not aware that Pixman is - yet -
aware that it can make such substitutions, but I'm sure it could be
added.)


Ben, when we eventually get to the actual optimization patches, I expect
I will reproduce the benchmarks myself while reviewing. Is that ok?


That's definitely fine by me, thanks for offering. Fingers crossed it
doesn't substantially change the results :-)


What do you think of this series?


In general, it looks like an improvement to me. I don't think the test
programs have had this much attention in a while! One minor point in
patch 6's log message, you copied my typo:

  operation_src[_mask]dst[_ca]

should be

  operation_src[_mask]_dst[_ca]

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH] test: Add a new benchmarker targeting affine operations

2015-04-08 Thread Ben Avison
---
 .gitignore|1 +
 test/Makefile.sources |1 +
 test/affine-bench.c   |  394 +
 3 files changed, 396 insertions(+), 0 deletions(-)
 create mode 100644 test/affine-bench.c

diff --git a/.gitignore b/.gitignore
index 0f11496..7da6b6f 100644
--- a/.gitignore
+++ b/.gitignore
@@ -46,6 +46,7 @@ demos/tri-test
 pixman/pixman-srgb.c
 pixman/pixman-version.h
 test/a1-trap-test
+test/affine-bench
 test/affine-test
 test/alpha-loop
 test/alphamap
diff --git a/test/Makefile.sources b/test/Makefile.sources
index c20c34b..8b0e855 100644
--- a/test/Makefile.sources
+++ b/test/Makefile.sources
@@ -37,6 +37,7 @@ OTHERPROGRAMS = \
radial-perf-test\
 check-formats   \
scaling-bench   \
+   affine-bench\
$(NULL)
 
 # Utility functions
diff --git a/test/affine-bench.c b/test/affine-bench.c
new file mode 100644
index 000..720d066
--- /dev/null
+++ b/test/affine-bench.c
@@ -0,0 +1,394 @@
+/*
+ * Copyright © 2014 RISC OS Open Ltd
+ *
+ * Permission to use, copy, modify, distribute, and sell this software and its
+ * documentation for any purpose is hereby granted without fee, provided that
+ * the above copyright notice appear in all copies and that both that
+ * copyright notice and this permission notice appear in supporting
+ * documentation, and that the name of the copyright holders not be used in
+ * advertising or publicity pertaining to distribution of the software without
+ * specific, written prior permission.  The copyright holders make no
+ * representations about the suitability of this software for any purpose.  It
+ * is provided as is without express or implied warranty.
+ *
+ * THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS
+ * SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
+ * FITNESS, IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN
+ * AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING
+ * OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
+ * SOFTWARE.
+ *
+ * Author:  Ben Avison (bavi...@riscosopen.org)
+ */
+
+#include stdio.h
+#include stdlib.h
+#include string.h
+#include ctype.h
+#include stdint.h
+#include utils.h
+
+#ifdef HAVE_GETTIMEOFDAY
+#include sys/time.h
+#else
+#include time.h
+#endif
+
+#define WIDTH  1920
+#define HEIGHT 1080
+#define MAX_L2CACHE_SIZE (8 * 1024 * 1024) /* How much data to read to flush 
all cached data to RAM */
+#define PAGE_SIZE (4 * 1024)
+
+typedef struct box_48_16 box_48_16_t;
+
+struct box_48_16
+{
+pixman_fixed_48_16_tx1;
+pixman_fixed_48_16_ty1;
+pixman_fixed_48_16_tx2;
+pixman_fixed_48_16_ty2;
+};
+
+static pixman_bool_t
+compute_transformed_extents (pixman_transform_t   *transform,
+ const pixman_box32_t *extents,
+ box_48_16_t  *transformed)
+{
+pixman_fixed_48_16_t tx1, ty1, tx2, ty2;
+pixman_fixed_t x1, y1, x2, y2;
+int i;
+
+x1 = pixman_int_to_fixed (extents-x1) + pixman_fixed_1 / 2;
+y1 = pixman_int_to_fixed (extents-y1) + pixman_fixed_1 / 2;
+x2 = pixman_int_to_fixed (extents-x2) - pixman_fixed_1 / 2;
+y2 = pixman_int_to_fixed (extents-y2) - pixman_fixed_1 / 2;
+
+if (!transform)
+{
+transformed-x1 = x1;
+transformed-y1 = y1;
+transformed-x2 = x2;
+transformed-y2 = y2;
+
+return TRUE;
+}
+
+tx1 = ty1 = INT64_MAX;
+tx2 = ty2 = INT64_MIN;
+
+for (i = 0; i  4; ++i)
+{
+pixman_fixed_48_16_t tx, ty;
+pixman_vector_t v;
+
+v.vector[0] = (i  0x01)? x1 : x2;
+v.vector[1] = (i  0x02)? y1 : y2;
+v.vector[2] = pixman_fixed_1;
+
+if (!pixman_transform_point (transform, v))
+return FALSE;
+
+tx = (pixman_fixed_48_16_t)v.vector[0];
+ty = (pixman_fixed_48_16_t)v.vector[1];
+
+if (tx  tx1)
+tx1 = tx;
+if (ty  ty1)
+ty1 = ty;
+if (tx  tx2)
+tx2 = tx;
+if (ty  ty2)
+ty2 = ty;
+}
+
+transformed-x1 = tx1;
+transformed-y1 = ty1;
+transformed-x2 = tx2;
+transformed-y2 = ty2;
+
+return TRUE;
+}
+
+static void
+create_image (uint32_t   width,
+  uint32_t   height,
+  pixman_format_code_t   format,
+  pixman_filter_tfilter,
+  const pixman_transform_t  *t,
+  uint32_t **bits,
+  pixman_image_t   **image)
+{
+uint32_t stride = (width * PIXMAN_FORMAT_BPP(format) + 31) / 32 * 4;
+
+*bits = aligned_malloc (PAGE_SIZE, stride * height);
+memset (*bits, 0xCC, stride * height);
+*image

Re: [Pixman] [PATCH] lowlevel-blt-bench: Parse test name strings in general case

2015-04-08 Thread Ben Avison

On Wed, 08 Apr 2015 12:21:03 +0100, Pekka Paalanen ppaala...@gmail.com wrote:


But if I'll rework the lookup tables, I can rework this too. Would be
my pleasure, even, getting acquainted with Pixman style. :-)


I made some revisions to affine-bench.c (and a couple of tweaks to
lowlevel-blt-bench.c and pixman.c too) while responding to your comments
yesterday, but stopped short of posting it because I hoped someone might
express a preference about the addition of OS-specific code for reading
cache sizes etc. If you're planning on having a play, I'll report them
as they are.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH] pixman.c: Coding style

2015-04-08 Thread Ben Avison
A few violations of coding style were identified in code copied from here
into affine-bench.
---
 pixman/pixman.c |   18 ++
 1 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/pixman/pixman.c b/pixman/pixman.c
index 9555cea..a07c577 100644
--- a/pixman/pixman.c
+++ b/pixman/pixman.c
@@ -325,18 +325,20 @@ _pixman_compute_composite_region32 (pixman_region32_t * 
region,
 return TRUE;
 }
 
-typedef struct
+typedef struct box_48_16 box_48_16_t;
+
+struct box_48_16
 {
-pixman_fixed_48_16_t   x1;
-pixman_fixed_48_16_t   y1;
-pixman_fixed_48_16_t   x2;
-pixman_fixed_48_16_t   y2;
-} box_48_16_t;
+pixman_fixed_48_16_tx1;
+pixman_fixed_48_16_ty1;
+pixman_fixed_48_16_tx2;
+pixman_fixed_48_16_ty2;
+};
 
 static pixman_bool_t
-compute_transformed_extents (pixman_transform_t *transform,
+compute_transformed_extents (pixman_transform_t   *transform,
 const pixman_box32_t *extents,
-box_48_16_t *transformed)
+box_48_16_t  *transformed)
 {
 pixman_fixed_48_16_t tx1, ty1, tx2, ty2;
 pixman_fixed_t x1, y1, x2, y2;
-- 
1.7.5.4

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 4/5] lowlevel-blt-bench: Parse test name strings in general case

2015-04-07 Thread Ben Avison

Hi,

Sorry for the delay in following up...

On Mon, 16 Mar 2015 14:33:47 -, Pekka Paalanen ppaala...@gmail.com wrote:

I understood these arrays should have been replaced by
format_from_string/operator_from_string functions from patch 2 and
similar new functions doing lookups in the same arrays.

Hmm, but I see there are several alternative spellings for operators,
and the formats follow a whole new convention.


Some background to this might help. Pixman supports a lot of pixel
formats; of particular significance is the fact that in many cases you
can have the same bits-per-pixel, divided up into the same bitpattern of
colour components, but with the red and blue components exchanged.

Each of Pixman's operations act upon between 1 and 3 bitmaps, each of
which may have any supported pixel format. But in practice, any given
application tends to stick to the same colour component ordering for most
or all of its images, so the actual number of operations you're likely to
encounter in practice is 1/2 (for 2-image operations) or 1/4 (for 3-image
operations) of what you might otherwise expect. Furthermore,
mathematically, an operation on (for example) two a8r8g8b8 images is
identical to one on two a8b8g8r8 images, so typically a fast path written
for one will be listed in the fast path table under both image formats.
Because of this, a naming convention has evolved in the source code where
fast path names include the string  as an indication that it can be
applied to either a8r8g8b8 or a8b8g8r8, with the implicit assumption that
the other image has the same colour component ordering.

lowlevel-blt-bench is most useful when you're testing the effectiveness
of a particular fast path, so it makes sense that its test pattern names
reflect the names of the fast path function that you're interested in.
However, there are a few tests, like src_0888__rev or rpixbuf
where the limitations of this approach start to show. I suspected I would
get objections if I changed the command line of lowlevel-blt-bench, but
in introducing a new tool, affine-bench, I saw an opportunity to allow
the pixel formats to be specified more directly, and deliberately made
its syntax different.

It is a matter for debate as to whether the two tools should use the same
syntax or not, and if so, which one they should standardise on. I'd be a
bit uncomfortable with saying that  is an alias for a8r8g8b8 or
0565 for r5g6b5, because that's not true in both directions, as I
have described. I'd probably choose the more verbose form if the
consensus was that the two tools should match. Is there anyone who cares
either way, though?


Personally I'm not really much of a fan of this much nesting,
especially where you have a number of parsing steps and each step nests
a bit more. I'd refactor this code into a function and use early (error)
returns instead of nesting.


Quick-and-dirty code there - it was only a test harness after all. But
easily fixed.


Looks like there is also no complaint, if the given string is not a
valid test name.


It never did generate any sort of error if you specified a bad test
string - you just wouldn't have seen any results. Another easy fix -
updated patch following shortly.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


  1   2   >