subject:"\[AArch64_be\] Fix vtbl\[34\] and vtbx4"

Re: [AArch64_be] Fix vtbl[34] and vtbx4

2015-10-13 Thread James Greenhalgh

On Tue, Oct 13, 2015 at 02:05:01PM +0100, Christophe Lyon wrote:
> I commited this as r228716, and noticed later that
> gcc.target/aarch64/table-intrinsics.c failed because of this patch.
> 
> This is because that testcase scans the assembly for 'tbl v' or 'tbx
> v', but since I replaced some asm statements,
> the space is now a tab.
> 
> I plan to commit this (probably obvious?):

> 2015-10-13  Christophe Lyon  
> 
>   * gcc/testsuite/gcc.target/aarch64/table-intrinsics.c: Fix regexp
>   after r228716 (Fix vtbl[34] and vtbx4).

Bad luck. This is fine (and yes, obvious).

Thanks,
James

> Index: gcc/testsuite/gcc.target/aarch64/table-intrinsics.c
> ===
> --- gcc/testsuite/gcc.target/aarch64/table-intrinsics.c   (revision 
> 228759)
> +++ gcc/testsuite/gcc.target/aarch64/table-intrinsics.c   (working copy)
> @@ -435,5 +435,5 @@
>return vqtbx4q_p8 (r, tab, idx);
>  }
>  
> -/* { dg-final { scan-assembler-times "tbl v" 42} }  */
> -/* { dg-final { scan-assembler-times "tbx v" 30} }  */
> +/* { dg-final { scan-assembler-times "tbl\[ |\t\]*v" 42} }  */
> +/* { dg-final { scan-assembler-times "tbx\[ |\t\]*v" 30} }  */

Re: [AArch64_be] Fix vtbl[34] and vtbx4

2015-10-13 Thread Christophe Lyon

On 12 October 2015 at 15:30, James Greenhalgh  wrote:
> On Fri, Oct 09, 2015 at 05:16:05PM +0100, Christophe Lyon wrote:
>> On 8 October 2015 at 11:12, James Greenhalgh  
>> wrote:
>> > On Wed, Oct 07, 2015 at 09:07:30PM +0100, Christophe Lyon wrote:
>> >> On 7 October 2015 at 17:09, James Greenhalgh  
>> >> wrote:
>> >> > On Tue, Sep 15, 2015 at 05:25:25PM +0100, Christophe Lyon wrote:
>> >> >
>> >> > Why do we want this for vtbx4 rather than putting out a VTBX instruction
>> >> > directly (as in the inline asm versions you replace)?
>> >> >
>> >> I just followed the pattern used for vtbx3.
>> >>
>> >> > This sequence does make sense for vtbx3.
>> >> In fact, I don't see why vtbx3 and vtbx4 should be different?
>> >
>> > The difference between TBL and TBX is in their handling of a request to
>> > select an out-of-range value. For TBL this returns zero, for TBX this
>> > returns the value which was already in the destination register.
>> >
>> > Because the byte-vectors used by the TBX instruction in aarch64 are 128-bit
>> > (so two of them togather allow selecting elements in the range 0-31), and
>> > vtbx3 needs to emulate the AArch32 behaviour of picking elements from 
>> > 3x64-bit
>> > vectors (allowing elements in the range 0-23), we need to manually check 
>> > for
>> > values which would have been out-of-range on AArch32, but are not out
>> > of range for AArch64 and handle them appropriately. For vtbx4 on the other
>> > hand, 2x128-bit registers give the range 0..31 and 4x64-bit registers give
>> > the range 0..31, so we don't need the special masked handling.
>> >
>> > You can find the suggested instruction sequences for the Neon intrinsics
>> > in this document:
>> >
>> >   
>> > http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf
>> >
>>
>> Hi James,
>>
>> Please find attached an updated version which hopefully addresses your 
>> comments.
>> Tested on aarch64-none-elf and aarch64_be-none-elf using the Foundation 
>> Model.
>>
>> OK?
>
> Looks good to me,
>
> Thanks,
> James
>

I commited this as r228716, and noticed later that
gcc.target/aarch64/table-intrinsics.c failed because of this patch.

This is because that testcase scans the assembly for 'tbl v' or 'tbx
v', but since I replaced some asm statements,
the space is now a tab.

I plan to commit this (probably obvious?):
2015-10-13  Christophe Lyon  

* gcc/testsuite/gcc.target/aarch64/table-intrinsics.c: Fix regexp
after r228716 (Fix vtbl[34] and vtbx4).

Index: gcc/testsuite/gcc.target/aarch64/table-intrinsics.c
===
--- gcc/testsuite/gcc.target/aarch64/table-intrinsics.c	(revision 228759)
+++ gcc/testsuite/gcc.target/aarch64/table-intrinsics.c	(working copy)
@@ -435,5 +435,5 @@
   return vqtbx4q_p8 (r, tab, idx);
 }
 
-/* { dg-final { scan-assembler-times "tbl v" 42} }  */
-/* { dg-final { scan-assembler-times "tbx v" 30} }  */
+/* { dg-final { scan-assembler-times "tbl\[ |\t\]*v" 42} }  */
+/* { dg-final { scan-assembler-times "tbx\[ |\t\]*v" 30} }  */

Re: [AArch64_be] Fix vtbl[34] and vtbx4

2015-10-12 Thread James Greenhalgh

On Fri, Oct 09, 2015 at 05:16:05PM +0100, Christophe Lyon wrote:
> On 8 October 2015 at 11:12, James Greenhalgh  wrote:
> > On Wed, Oct 07, 2015 at 09:07:30PM +0100, Christophe Lyon wrote:
> >> On 7 October 2015 at 17:09, James Greenhalgh  
> >> wrote:
> >> > On Tue, Sep 15, 2015 at 05:25:25PM +0100, Christophe Lyon wrote:
> >> >
> >> > Why do we want this for vtbx4 rather than putting out a VTBX instruction
> >> > directly (as in the inline asm versions you replace)?
> >> >
> >> I just followed the pattern used for vtbx3.
> >>
> >> > This sequence does make sense for vtbx3.
> >> In fact, I don't see why vtbx3 and vtbx4 should be different?
> >
> > The difference between TBL and TBX is in their handling of a request to
> > select an out-of-range value. For TBL this returns zero, for TBX this
> > returns the value which was already in the destination register.
> >
> > Because the byte-vectors used by the TBX instruction in aarch64 are 128-bit
> > (so two of them togather allow selecting elements in the range 0-31), and
> > vtbx3 needs to emulate the AArch32 behaviour of picking elements from 
> > 3x64-bit
> > vectors (allowing elements in the range 0-23), we need to manually check for
> > values which would have been out-of-range on AArch32, but are not out
> > of range for AArch64 and handle them appropriately. For vtbx4 on the other
> > hand, 2x128-bit registers give the range 0..31 and 4x64-bit registers give
> > the range 0..31, so we don't need the special masked handling.
> >
> > You can find the suggested instruction sequences for the Neon intrinsics
> > in this document:
> >
> >   
> > http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf
> >
> 
> Hi James,
> 
> Please find attached an updated version which hopefully addresses your 
> comments.
> Tested on aarch64-none-elf and aarch64_be-none-elf using the Foundation Model.
> 
> OK?

Looks good to me,

Thanks,
James

Re: [AArch64_be] Fix vtbl[34] and vtbx4

2015-10-09 Thread Christophe Lyon

On 8 October 2015 at 11:12, James Greenhalgh  wrote:
> On Wed, Oct 07, 2015 at 09:07:30PM +0100, Christophe Lyon wrote:
>> On 7 October 2015 at 17:09, James Greenhalgh  
>> wrote:
>> > On Tue, Sep 15, 2015 at 05:25:25PM +0100, Christophe Lyon wrote:
>> >
>> > Why do we want this for vtbx4 rather than putting out a VTBX instruction
>> > directly (as in the inline asm versions you replace)?
>> >
>> I just followed the pattern used for vtbx3.
>>
>> > This sequence does make sense for vtbx3.
>> In fact, I don't see why vtbx3 and vtbx4 should be different?
>
> The difference between TBL and TBX is in their handling of a request to
> select an out-of-range value. For TBL this returns zero, for TBX this
> returns the value which was already in the destination register.
>
> Because the byte-vectors used by the TBX instruction in aarch64 are 128-bit
> (so two of them togather allow selecting elements in the range 0-31), and
> vtbx3 needs to emulate the AArch32 behaviour of picking elements from 3x64-bit
> vectors (allowing elements in the range 0-23), we need to manually check for
> values which would have been out-of-range on AArch32, but are not out
> of range for AArch64 and handle them appropriately. For vtbx4 on the other
> hand, 2x128-bit registers give the range 0..31 and 4x64-bit registers give
> the range 0..31, so we don't need the special masked handling.
>
> You can find the suggested instruction sequences for the Neon intrinsics
> in this document:
>
>   
> http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf
>

Hi James,

Please find attached an updated version which hopefully addresses your comments.
Tested on aarch64-none-elf and aarch64_be-none-elf using the Foundation Model.

OK?

Christophe.

>> >>  /* vtrn */
>> >>
>> >>  __extension__ static __inline float32x2_t __attribute__ 
>> >> ((__always_inline__))
>> >> diff --git a/gcc/config/aarch64/iterators.md 
>> >> b/gcc/config/aarch64/iterators.md
>> >> index b8a45d1..dfbd9cd 100644
>> >> --- a/gcc/config/aarch64/iterators.md
>> >> +++ b/gcc/config/aarch64/iterators.md
>> >> @@ -100,6 +100,8 @@
>> >>  ;; All modes.
>> >>  (define_mode_iterator VALL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI V2SF 
>> >> V4SF V2DF])
>> >>
>> >> +(define_mode_iterator V8Q [V8QI])
>> >> +
>> >
>> > This can be dropped if you use VAR1 in aarch64-builtins.c.
>> >
>> > Thanks for working on this, with your patch applied, the only
>> > remaining intrinsics I see failing for aarch64_be are:
>> >
>> >   vqtbl2_*8
>> >   vqtbl2q_*8
>> >   vqtbl3_*8
>> >   vqtbl3q_*8
>> >   vqtbl4_*8
>> >   vqtbl4q_*8
>> >
>> >   vqtbx2_*8
>> >   vqtbx2q_*8
>> >   vqtbx3_*8
>> >   vqtbx3q_*8
>> >   vqtbx4_*8
>> >   vqtbx4q_*8
>> >
>> Quite possibly. Which tests are you looking at? Since these are
>> aarch64-specific, they are not part of the
>> tests I added (advsimd-intrinsics). Do you mean
>> gcc.target/aarch64/table-intrinsics.c?
>
> Sorry, yes I should have given a reference. I'm running with a variant of
> a testcase from the LLVM test-suite repository:
>
>   SingleSource/UnitTests/Vector/AArch64/aarch64_neon_intrinsics.c
>
> This has an execute test for most of the intrinsics specified for AArch64.
> It needs some modification to cover the intrinsics we don't implement yet.
>
> Thanks,
> James
>
2015-10-09  Christophe Lyon  

* config/aarch64/aarch64-simd-builtins.def: Update builtins
tables: add tbl3 and tbx4.
* config/aarch64/aarch64-simd.md (aarch64_tbl3v8qi): New.
(aarch64_tbx4v8qi): New.
* config/aarch64/arm_neon.h (vtbl3_s8, vtbl3_u8, vtbl3_p8)
(vtbl4_s8, vtbl4_u8, vtbl4_p8, vtbx4_s8, vtbx4_u8, vtbx4_p8):
Rewrite using builtin functions.
* config/aarch64/iterators.md (UNSPEC_TBX): New.
diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def
index d0f298a..c16e82c9 100644
--- a/gcc/config/aarch64/aarch64-simd-builtins.def
+++ b/gcc/config/aarch64/aarch64-simd-builtins.def
@@ -405,3 +405,8 @@
   VAR1 (BINOPP, crypto_pmull, 0, di)
   VAR1 (BINOPP, crypto_pmull, 0, v2di)
 
+  /* Implemented by aarch64_tbl3v8qi.  */
+  VAR1 (BINOP, tbl3, 0, v8qi)
+
+  /* Implemented by aarch64_tbx4v8qi.  */
+  VAR1 (TERNOP, tbx4, 0, v8qi)
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 9777418..6027582 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4716,6 +4714,27 @@
   [(set_attr "type" "neon_tbl2_q")]
 )
 
+(define_insn "aarch64_tbl3v8qi"
+  [(set (match_operand:V8QI 0 "register_operand" "=w")
+	(unspec:V8QI [(match_operand:OI 1 "register_operand" "w")
+		  (match_operand:V8QI 2 "register_operand" "w")]
+		  UNSPEC_TBL))]
+  "TARGET_SIMD"
+  "tbl\\t%S0.8b, {%S1.16b - %T1.16b}, %S2.8b"
+  [(set_attr "type" "neon_tbl3")]
+)
+
+(define_insn "aarch64_tbx4v8qi"
+  [(set (match_operand:V8QI 0 "register_operand" "=w")
+	(unspec:V8QI [(match_operand:V8QI 1 "register_

Re: [AArch64_be] Fix vtbl[34] and vtbx4

2015-10-08 Thread James Greenhalgh

On Wed, Oct 07, 2015 at 09:07:30PM +0100, Christophe Lyon wrote:
> On 7 October 2015 at 17:09, James Greenhalgh  wrote:
> > On Tue, Sep 15, 2015 at 05:25:25PM +0100, Christophe Lyon wrote:
> >
> > Why do we want this for vtbx4 rather than putting out a VTBX instruction
> > directly (as in the inline asm versions you replace)?
> >
> I just followed the pattern used for vtbx3.
> 
> > This sequence does make sense for vtbx3.
> In fact, I don't see why vtbx3 and vtbx4 should be different?

The difference between TBL and TBX is in their handling of a request to
select an out-of-range value. For TBL this returns zero, for TBX this
returns the value which was already in the destination register.

Because the byte-vectors used by the TBX instruction in aarch64 are 128-bit
(so two of them togather allow selecting elements in the range 0-31), and
vtbx3 needs to emulate the AArch32 behaviour of picking elements from 3x64-bit
vectors (allowing elements in the range 0-23), we need to manually check for
values which would have been out-of-range on AArch32, but are not out
of range for AArch64 and handle them appropriately. For vtbx4 on the other
hand, 2x128-bit registers give the range 0..31 and 4x64-bit registers give
the range 0..31, so we don't need the special masked handling.

You can find the suggested instruction sequences for the Neon intrinsics
in this document:

http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf

> >>  /* vtrn */
> >>
> >>  __extension__ static __inline float32x2_t __attribute__ 
> >> ((__always_inline__))
> >> diff --git a/gcc/config/aarch64/iterators.md 
> >> b/gcc/config/aarch64/iterators.md
> >> index b8a45d1..dfbd9cd 100644
> >> --- a/gcc/config/aarch64/iterators.md
> >> +++ b/gcc/config/aarch64/iterators.md
> >> @@ -100,6 +100,8 @@
> >>  ;; All modes.
> >>  (define_mode_iterator VALL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI V2SF V4SF 
> >> V2DF])
> >>
> >> +(define_mode_iterator V8Q [V8QI])
> >> +
> >
> > This can be dropped if you use VAR1 in aarch64-builtins.c.
> >
> > Thanks for working on this, with your patch applied, the only
> > remaining intrinsics I see failing for aarch64_be are:
> >
> >   vqtbl2_*8
> >   vqtbl2q_*8
> >   vqtbl3_*8
> >   vqtbl3q_*8
> >   vqtbl4_*8
> >   vqtbl4q_*8
> >
> >   vqtbx2_*8
> >   vqtbx2q_*8
> >   vqtbx3_*8
> >   vqtbx3q_*8
> >   vqtbx4_*8
> >   vqtbx4q_*8
> >
> Quite possibly. Which tests are you looking at? Since these are
> aarch64-specific, they are not part of the
> tests I added (advsimd-intrinsics). Do you mean
> gcc.target/aarch64/table-intrinsics.c?

Sorry, yes I should have given a reference. I'm running with a variant of
a testcase from the LLVM test-suite repository:

  SingleSource/UnitTests/Vector/AArch64/aarch64_neon_intrinsics.c

This has an execute test for most of the intrinsics specified for AArch64.
It needs some modification to cover the intrinsics we don't implement yet.

Thanks,
James

Re: [AArch64_be] Fix vtbl[34] and vtbx4

2015-10-07 Thread Christophe Lyon

On 7 October 2015 at 17:09, James Greenhalgh  wrote:
> On Tue, Sep 15, 2015 at 05:25:25PM +0100, Christophe Lyon wrote:
>> This patch re-implements vtbl[34] and vtbx4 AdvSIMD intrinsics using
>> existing builtins, and fixes the behaviour on aarch64_be.
>>
>> Tested on aarch64_be-none-elf and aarch64-none-elf using the Foundation 
>> Model.
>>
>> OK?
>
> Hi Christophe,
>
> Sorry for the delay getting back to you, comments below.
>
>> 2015-09-15  Christophe Lyon  
>>
>>   * config/aarch64/aarch64-builtins.c
>>   (aarch64_types_tbl_qualifiers): New static data.
>>   (TYPES_TBL): Define.
>>   * config/aarch64/aarch64-simd-builtins.def: Update builtins
>>   tables.
>>   * config/aarch64/aarch64-simd.md (aarch64_tbl3v8qi): New.
>>   * config/aarch64/arm_neon.h (vtbl3_s8, vtbl3_u8, vtbl3_p8)
>>   (vtbl4_s8, vtbl4_u8, vtbl4_p8): Rewrite using builtin functions.
>>   (vtbx4_s8, vtbx4_u8, vtbx4_p8): Emulate behaviour using other
>>   intrinsics.
>>   * config/aarch64/iterators.md (V8Q): New.
>
>> diff --git a/gcc/config/aarch64/aarch64-builtins.c 
>> b/gcc/config/aarch64/aarch64-builtins.c
>> index 0f4f2b9..7ca3917 100644
>> --- a/gcc/config/aarch64/aarch64-builtins.c
>> +++ b/gcc/config/aarch64/aarch64-builtins.c
>> @@ -253,6 +253,11 @@ 
>> aarch64_types_storestruct_lane_qualifiers[SIMD_MAX_BUILTIN_ARGS]
>>qualifier_none, qualifier_struct_load_store_lane_index };
>>  #define TYPES_STORESTRUCT_LANE (aarch64_types_storestruct_lane_qualifiers)
>>
>> +static enum aarch64_type_qualifiers
>> +aarch64_types_tbl_qualifiers[SIMD_MAX_BUILTIN_ARGS]
>> +  = { qualifier_none, qualifier_none, qualifier_none };
>> +#define TYPES_TBL (aarch64_types_tbl_qualifiers)
>> +
>
> Do we need these? This looks like TYPES_BINOP (the predicate on the
> instruction pattern will prevent the "qualifier_maybe_immediate" from
> becoming a problem).
>
I'll give it a try, indeed I feared "qualifier_maybe_immediate" would
cause problems.

>>  #define CF0(N, X) CODE_FOR_aarch64_##N##X
>>  #define CF1(N, X) CODE_FOR_##N##X##1
>>  #define CF2(N, X) CODE_FOR_##N##X##2
>> diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def 
>> b/gcc/config/aarch64/aarch64-simd-builtins.def
>> index d0f298a..62f1b13 100644
>> --- a/gcc/config/aarch64/aarch64-simd-builtins.def
>> +++ b/gcc/config/aarch64/aarch64-simd-builtins.def
>> @@ -405,3 +405,5 @@
>>VAR1 (BINOPP, crypto_pmull, 0, di)
>>VAR1 (BINOPP, crypto_pmull, 0, v2di)
>>
>> +  /* Implemented by aarch64_tbl3v8qi.  */
>> +  BUILTIN_V8Q (TBL, tbl3, 0)
>
> This can be:
>
>   VAR1 (BINOP, tbl3, 0, v8qi)
>
> It would be good if we could eliminate the casts in arm_neon.h by also
> defining a  "BINOPU" version of this, but I imagine that gets stuck on the
> types accepted by __builtin_aarch64_set_qregoiv16qi - so don't worry about
> making that change.
OK

>
>> diff --git a/gcc/config/aarch64/aarch64-simd.md 
>> b/gcc/config/aarch64/aarch64-simd.md
>> index 9777418..84a61d5 100644
>> --- a/gcc/config/aarch64/aarch64-simd.md
>> +++ b/gcc/config/aarch64/aarch64-simd.md
>> @@ -4716,6 +4714,16 @@
>>[(set_attr "type" "neon_tbl2_q")]
>>  )
>>
>> +(define_insn "aarch64_tbl3v8qi"
>> +  [(set (match_operand:V8QI 0 "register_operand" "=w")
>> + (unspec:V8QI [(match_operand:OI 1 "register_operand" "w")
>> +   (match_operand:V8QI 2 "register_operand" "w")]
>> +   UNSPEC_TBL))]
>> +  "TARGET_SIMD"
>> +  "tbl\\t%S0.8b, {%S1.16b - %T1.16b}, %S2.8b"
>> +  [(set_attr "type" "neon_tbl3")]
>> +)
>> +
>>  (define_insn_and_split "aarch64_combinev16qi"
>>[(set (match_operand:OI 0 "register_operand" "=w")
>>   (unspec:OI [(match_operand:V16QI 1 "register_operand" "w")
>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index 87bbf6e..91704de 100644
>> diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
>> index 6dfebe7..e8ee318 100644
>> --- a/gcc/config/aarch64/arm_neon.h
>> +++ b/gcc/config/aarch64/arm_neon.h
>>  /* End of temporary inline asm.  */
>>
>>  /* Start of optimal implementations in approved order.  */
>> @@ -23221,6 +23182,36 @@ vtbx3_p8 (poly8x8_t __r, poly8x8x3_t __tab, 
>> uint8x8_t __idx)
>>return vbsl_p8 (__mask, __tbl, __r);
>>  }
>>
>> +/* vtbx4  */
>> +
>> +__extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
>> +vtbx4_s8 (int8x8_t __r, int8x8x4_t __tab, int8x8_t __idx)
>> +{
>> +  uint8x8_t __mask = vclt_u8 (vreinterpret_u8_s8 (__idx),
>> +   vmov_n_u8 (32));
>> +  int8x8_t __tbl = vtbl4_s8 (__tab, __idx);
>> +
>> +  return vbsl_s8 (__mask, __tbl, __r);
>> +}
>> +
>> +__extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
>> +vtbx4_u8 (uint8x8_t __r, uint8x8x4_t __tab, uint8x8_t __idx)
>> +{
>> +  uint8x8_t __mask = vclt_u8 (__idx, vmov_n_u8 (32));
>> +  uint8x8_t __tbl = vtbl4_u8 (__tab, __idx);
>> +
>> +  return vbsl_u8 (__mask, __tbl, __r);
>> +}
>> +
>> +__extension__ static __inline pol

Re: [AArch64_be] Fix vtbl[34] and vtbx4

2015-10-07 Thread James Greenhalgh

On Tue, Sep 15, 2015 at 05:25:25PM +0100, Christophe Lyon wrote:
> This patch re-implements vtbl[34] and vtbx4 AdvSIMD intrinsics using
> existing builtins, and fixes the behaviour on aarch64_be.
> 
> Tested on aarch64_be-none-elf and aarch64-none-elf using the Foundation Model.
> 
> OK?

Hi Christophe,

Sorry for the delay getting back to you, comments below.

> 2015-09-15  Christophe Lyon  
> 
>   * config/aarch64/aarch64-builtins.c
>   (aarch64_types_tbl_qualifiers): New static data.
>   (TYPES_TBL): Define.
>   * config/aarch64/aarch64-simd-builtins.def: Update builtins
>   tables.
>   * config/aarch64/aarch64-simd.md (aarch64_tbl3v8qi): New.
>   * config/aarch64/arm_neon.h (vtbl3_s8, vtbl3_u8, vtbl3_p8)
>   (vtbl4_s8, vtbl4_u8, vtbl4_p8): Rewrite using builtin functions.
>   (vtbx4_s8, vtbx4_u8, vtbx4_p8): Emulate behaviour using other
>   intrinsics.
>   * config/aarch64/iterators.md (V8Q): New.

> diff --git a/gcc/config/aarch64/aarch64-builtins.c 
> b/gcc/config/aarch64/aarch64-builtins.c
> index 0f4f2b9..7ca3917 100644
> --- a/gcc/config/aarch64/aarch64-builtins.c
> +++ b/gcc/config/aarch64/aarch64-builtins.c
> @@ -253,6 +253,11 @@ 
> aarch64_types_storestruct_lane_qualifiers[SIMD_MAX_BUILTIN_ARGS]
>qualifier_none, qualifier_struct_load_store_lane_index };
>  #define TYPES_STORESTRUCT_LANE (aarch64_types_storestruct_lane_qualifiers)
>  
> +static enum aarch64_type_qualifiers
> +aarch64_types_tbl_qualifiers[SIMD_MAX_BUILTIN_ARGS]
> +  = { qualifier_none, qualifier_none, qualifier_none };
> +#define TYPES_TBL (aarch64_types_tbl_qualifiers)
> +

Do we need these? This looks like TYPES_BINOP (the predicate on the
instruction pattern will prevent the "qualifier_maybe_immediate" from
becoming a problem).

>  #define CF0(N, X) CODE_FOR_aarch64_##N##X
>  #define CF1(N, X) CODE_FOR_##N##X##1
>  #define CF2(N, X) CODE_FOR_##N##X##2
> diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def 
> b/gcc/config/aarch64/aarch64-simd-builtins.def
> index d0f298a..62f1b13 100644
> --- a/gcc/config/aarch64/aarch64-simd-builtins.def
> +++ b/gcc/config/aarch64/aarch64-simd-builtins.def
> @@ -405,3 +405,5 @@
>VAR1 (BINOPP, crypto_pmull, 0, di)
>VAR1 (BINOPP, crypto_pmull, 0, v2di)
>  
> +  /* Implemented by aarch64_tbl3v8qi.  */
> +  BUILTIN_V8Q (TBL, tbl3, 0)

This can be:

  VAR1 (BINOP, tbl3, 0, v8qi)

It would be good if we could eliminate the casts in arm_neon.h by also
defining a  "BINOPU" version of this, but I imagine that gets stuck on the
types accepted by __builtin_aarch64_set_qregoiv16qi - so don't worry about
making that change.


> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 9777418..84a61d5 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -4716,6 +4714,16 @@
>[(set_attr "type" "neon_tbl2_q")]
>  )
>  
> +(define_insn "aarch64_tbl3v8qi"
> +  [(set (match_operand:V8QI 0 "register_operand" "=w")
> + (unspec:V8QI [(match_operand:OI 1 "register_operand" "w")
> +   (match_operand:V8QI 2 "register_operand" "w")]
> +   UNSPEC_TBL))]
> +  "TARGET_SIMD"
> +  "tbl\\t%S0.8b, {%S1.16b - %T1.16b}, %S2.8b"
> +  [(set_attr "type" "neon_tbl3")]
> +)
> +
>  (define_insn_and_split "aarch64_combinev16qi"
>[(set (match_operand:OI 0 "register_operand" "=w")
>   (unspec:OI [(match_operand:V16QI 1 "register_operand" "w")
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 87bbf6e..91704de 100644
> diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
> index 6dfebe7..e8ee318 100644
> --- a/gcc/config/aarch64/arm_neon.h
> +++ b/gcc/config/aarch64/arm_neon.h
>  /* End of temporary inline asm.  */
>  
>  /* Start of optimal implementations in approved order.  */
> @@ -23221,6 +23182,36 @@ vtbx3_p8 (poly8x8_t __r, poly8x8x3_t __tab, 
> uint8x8_t __idx)
>return vbsl_p8 (__mask, __tbl, __r);
>  }
>  
> +/* vtbx4  */
> +
> +__extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
> +vtbx4_s8 (int8x8_t __r, int8x8x4_t __tab, int8x8_t __idx)
> +{
> +  uint8x8_t __mask = vclt_u8 (vreinterpret_u8_s8 (__idx),
> +   vmov_n_u8 (32));
> +  int8x8_t __tbl = vtbl4_s8 (__tab, __idx);
> +
> +  return vbsl_s8 (__mask, __tbl, __r);
> +}
> +
> +__extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
> +vtbx4_u8 (uint8x8_t __r, uint8x8x4_t __tab, uint8x8_t __idx)
> +{
> +  uint8x8_t __mask = vclt_u8 (__idx, vmov_n_u8 (32));
> +  uint8x8_t __tbl = vtbl4_u8 (__tab, __idx);
> +
> +  return vbsl_u8 (__mask, __tbl, __r);
> +}
> +
> +__extension__ static __inline poly8x8_t __attribute__ ((__always_inline__))
> +vtbx4_p8 (poly8x8_t __r, poly8x8x4_t __tab, uint8x8_t __idx)
> +{
> +  uint8x8_t __mask = vclt_u8 (__idx, vmov_n_u8 (32));
> +  poly8x8_t __tbl = vtbl4_p8 (__tab, __idx);
> +
> +  return vbsl_p8 (__mask, __tbl, __r);
> +}
> +

W

Re: [AArch64_be] Fix vtbl[34] and vtbx4

2015-10-07 Thread Christophe Lyon

Ping?
https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01096.html

On 29 September 2015 at 22:57, Christophe Lyon
 wrote:
> Ping?
>
>
> On 15 September 2015 at 18:25, Christophe Lyon
>  wrote:
>> This patch re-implements vtbl[34] and vtbx4 AdvSIMD intrinsics using
>> existing builtins, and fixes the behaviour on aarch64_be.
>>
>> Tested on aarch64_be-none-elf and aarch64-none-elf using the Foundation 
>> Model.
>>
>> OK?
>>
>> Christophe.

Re: [AArch64_be] Fix vtbl[34] and vtbx4

2015-09-29 Thread Christophe Lyon

Ping?


On 15 September 2015 at 18:25, Christophe Lyon
 wrote:
> This patch re-implements vtbl[34] and vtbx4 AdvSIMD intrinsics using
> existing builtins, and fixes the behaviour on aarch64_be.
>
> Tested on aarch64_be-none-elf and aarch64-none-elf using the Foundation Model.
>
> OK?
>
> Christophe.

[AArch64_be] Fix vtbl[34] and vtbx4

2015-09-15 Thread Christophe Lyon

This patch re-implements vtbl[34] and vtbx4 AdvSIMD intrinsics using
existing builtins, and fixes the behaviour on aarch64_be.

Tested on aarch64_be-none-elf and aarch64-none-elf using the Foundation Model.

OK?

Christophe.
2015-09-15  Christophe Lyon  

* config/aarch64/aarch64-builtins.c
(aarch64_types_tbl_qualifiers): New static data.
(TYPES_TBL): Define.
* config/aarch64/aarch64-simd-builtins.def: Update builtins
tables.
* config/aarch64/aarch64-simd.md (aarch64_tbl3v8qi): New.
* config/aarch64/arm_neon.h (vtbl3_s8, vtbl3_u8, vtbl3_p8)
(vtbl4_s8, vtbl4_u8, vtbl4_p8): Rewrite using builtin functions.
(vtbx4_s8, vtbx4_u8, vtbx4_p8): Emulate behaviour using other
intrinsics.
* config/aarch64/iterators.md (V8Q): New.
diff --git a/gcc/config/aarch64/aarch64-builtins.c b/gcc/config/aarch64/aarch64-builtins.c
index 0f4f2b9..7ca3917 100644
--- a/gcc/config/aarch64/aarch64-builtins.c
+++ b/gcc/config/aarch64/aarch64-builtins.c
@@ -253,6 +253,11 @@ aarch64_types_storestruct_lane_qualifiers[SIMD_MAX_BUILTIN_ARGS]
   qualifier_none, qualifier_struct_load_store_lane_index };
 #define TYPES_STORESTRUCT_LANE (aarch64_types_storestruct_lane_qualifiers)
 
+static enum aarch64_type_qualifiers
+aarch64_types_tbl_qualifiers[SIMD_MAX_BUILTIN_ARGS]
+  = { qualifier_none, qualifier_none, qualifier_none };
+#define TYPES_TBL (aarch64_types_tbl_qualifiers)
+
 #define CF0(N, X) CODE_FOR_aarch64_##N##X
 #define CF1(N, X) CODE_FOR_##N##X##1
 #define CF2(N, X) CODE_FOR_##N##X##2
diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def
index d0f298a..62f1b13 100644
--- a/gcc/config/aarch64/aarch64-simd-builtins.def
+++ b/gcc/config/aarch64/aarch64-simd-builtins.def
@@ -405,3 +405,5 @@
   VAR1 (BINOPP, crypto_pmull, 0, di)
   VAR1 (BINOPP, crypto_pmull, 0, v2di)
 
+  /* Implemented by aarch64_tbl3v8qi.  */
+  BUILTIN_V8Q (TBL, tbl3, 0)
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 9777418..84a61d5 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4716,6 +4714,16 @@
   [(set_attr "type" "neon_tbl2_q")]
 )
 
+(define_insn "aarch64_tbl3v8qi"
+  [(set (match_operand:V8QI 0 "register_operand" "=w")
+	(unspec:V8QI [(match_operand:OI 1 "register_operand" "w")
+		  (match_operand:V8QI 2 "register_operand" "w")]
+		  UNSPEC_TBL))]
+  "TARGET_SIMD"
+  "tbl\\t%S0.8b, {%S1.16b - %T1.16b}, %S2.8b"
+  [(set_attr "type" "neon_tbl3")]
+)
+
 (define_insn_and_split "aarch64_combinev16qi"
   [(set (match_operand:OI 0 "register_operand" "=w")
 	(unspec:OI [(match_operand:V16QI 1 "register_operand" "w")
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 87bbf6e..91704de 100644
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 6dfebe7..e8ee318 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -10902,13 +10902,14 @@ vtbl3_s8 (int8x8x3_t tab, int8x8_t idx)
 {
   int8x8_t result;
   int8x16x2_t temp;
+  __builtin_aarch64_simd_oi __o;
   temp.val[0] = vcombine_s8 (tab.val[0], tab.val[1]);
   temp.val[1] = vcombine_s8 (tab.val[2], vcreate_s8 (__AARCH64_UINT64_C (0x0)));
-  __asm__ ("ld1 {v16.16b - v17.16b }, %1\n\t"
-	   "tbl %0.8b, {v16.16b - v17.16b}, %2.8b\n\t"
-   : "=w"(result)
-   : "Q"(temp), "w"(idx)
-   : "v16", "v17", "memory");
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+	   (int8x16_t) temp.val[0], 0);
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+	   (int8x16_t) temp.val[1], 1);
+  result = __builtin_aarch64_tbl3v8qi (__o, idx);
   return result;
 }
 
@@ -10917,13 +10918,14 @@ vtbl3_u8 (uint8x8x3_t tab, uint8x8_t idx)
 {
   uint8x8_t result;
   uint8x16x2_t temp;
+  __builtin_aarch64_simd_oi __o;
   temp.val[0] = vcombine_u8 (tab.val[0], tab.val[1]);
   temp.val[1] = vcombine_u8 (tab.val[2], vcreate_u8 (__AARCH64_UINT64_C (0x0)));
-  __asm__ ("ld1 {v16.16b - v17.16b }, %1\n\t"
-	   "tbl %0.8b, {v16.16b - v17.16b}, %2.8b\n\t"
-   : "=w"(result)
-   : "Q"(temp), "w"(idx)
-   : "v16", "v17", "memory");
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+	   (int8x16_t) temp.val[0], 0);
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+	   (int8x16_t) temp.val[1], 1);
+  result = (uint8x8_t)__builtin_aarch64_tbl3v8qi (__o, (int8x8_t)idx);
   return result;
 }
 
@@ -10932,13 +10934,14 @@ vtbl3_p8 (poly8x8x3_t tab, uint8x8_t idx)
 {
   poly8x8_t result;
   poly8x16x2_t temp;
+  __builtin_aarch64_simd_oi __o;
   temp.val[0] = vcombine_p8 (tab.val[0], tab.val[1]);
   temp.val[1] = vcombine_p8 (tab.val[2], vcreate_p8 (__AARCH64_UINT64_C (0x0)));
-  __asm__ ("ld1 {v16.16b - v17.16b }, %1\n\t"
-	   "tbl %0.8b, {v16.16b - v17.16b}, %2.8b\n\t"
-   : "=w"(result)
-   : "Q"(temp), "w"(idx)
-   : "v16", "v17", "memory");
+  __o = __builtin_aarch6

Re: [AArch64_be] Fix vtbl[34] and vtbx4

Re: [AArch64_be] Fix vtbl[34] and vtbx4

Re: [AArch64_be] Fix vtbl[34] and vtbx4

Re: [AArch64_be] Fix vtbl[34] and vtbx4

Re: [AArch64_be] Fix vtbl[34] and vtbx4

Re: [AArch64_be] Fix vtbl[34] and vtbx4

Re: [AArch64_be] Fix vtbl[34] and vtbx4

Re: [AArch64_be] Fix vtbl[34] and vtbx4

Re: [AArch64_be] Fix vtbl[34] and vtbx4

[AArch64_be] Fix vtbl[34] and vtbx4

10 matches

Site Navigation

Mail list logo

Footer information