Re: Loop peeling

2014-10-29 Thread Tejas Belagod

On 29/10/14 09:32, Richard Biener wrote:

On Tue, Oct 28, 2014 at 4:55 PM, Evandro Menezes  wrote:

While doing some benchmark flag mining on AArch64, I noticed that
-fpeel-loops was a mined option often.  As a matter of fact, when using it
always, even without FDO, it seemed to raise most benchmarks and to leave
almost all of the rest flat, with a barely noticeable cost in code-size.  It
seems to me that it might be safe enough to be implied perhaps at -O3.  Is
there any reason why this never came into being?


Loop peeling is done by default on AArch64 unless, IIRC, 
-fvect-cost-model=cheap is specified which switches it off. There was a 
general thread on loop peeling around the same time last year 
(https://gcc.gnu.org/ml/gcc/2013-11/msg00307.html) where Richard 
suggested that peeling vs. non-peeling should be factored into the 
vector cost model and is a more generic improvement.


Thanks,
Tejas.



Not sure, but peeling is/was very stupid (peeling 8 times unconditionally
or not at all).  At least without FDO (and with -fprofile-use it is enabled).
Similar case for -funroll-loops.

For GCC 5 peeling now moved to GIMPLE, so maybe things changed
for that (but I'd doubt that).  Honza?







Re: Restricting arguments to intrinsic functions

2014-10-24 Thread Tejas Belagod

On 24/10/14 15:44, Segher Boessenkool wrote:

On Thu, Oct 23, 2014 at 06:52:20PM +0100, Charles Baylis wrote:

( tl;dr: How do I handle intrinsic or builtin functions where there
are restrictions on the arguments which can't be represented in a C
function prototype? Do other ports have this problem, how do they
solve it? Language extension for C++98 to provide static_assert?)


In the builtin expand, you can get the operands' predicates from the
insn_data array entry for the RTL pattern generated for that builtin.
If the predicate is false, do a copy_to_mode_reg; if then the predicate
is still false, assume it had to be some constant and error out.

This works well; I stole the method from the tile* ports.  It may need
tweaks for your port.


I think we already do that in the aarch64 port in aarch64-builtins.c 
when we expand builtins.


  /* Handle constants only if the predicate allows it.  */
  bool op_const_int_p =
(CONST_INT_P (arg)
 && (*insn_data[icode].operand[operands_k].predicate)
(arg, insn_data[icode].operand[operands_k].mode));


But the accuracy of the source position, as Charles says, is lost by the 
time the expander kicks in. For eg. in this piece of code,


#include "arm_neon.h"

int16x4_t
xget_lane(int16x4_t a, int16x4_t b, int c)
{
  return vqrdmulh_lane_s16 (a, b, 7);
}

$ aarch64-none-elf-gcc -O3 cr.c  -S

In file included from cr.c:2:0:
/work/dev/arm/bin/install/lib/gcc/aarch64-none-elf/5.0.0/include/arm_neon.h: 
In function 'xget_lane':
/work/dev/arm/bin/install/lib/gcc/aarch64-none-elf/5.0.0/include/arm_neon.h:19572:11: 
error: lane out of range

   return  __builtin_aarch64_sqrdmulh_lanev4hi (__a, __b, __c);


The diagnostic issued points to the line in arm_neon.h, but we expect 
this to point to the line in cr.c. I suspect we need something closer to 
the front-end?


Thanks,
Tejas.



Debugging LTO.

2014-05-22 Thread Tejas Belagod

Hi,

Are there any tricks I can use to debug an LTO ICE? Lto1 --help does not seem to 
give me an option to output trace dumps etc.
What I suspect is happening is that cc1 builds erroneous LTO IR info in the 
objects that causes the ICEs. Is there a reader that will dump the IR from these 
LTO objects? AFAICS, this page


 https://gcc.gnu.org/wiki/LinkTimeOptimization

says such a reader is still a TODO.

Thanks,
Tejas.



Re: [RFC, LRA] Incorrect subreg resolution?

2014-04-22 Thread Tejas Belagod

Richard Sandiford wrote:

Returning to this old thread...

Richard Sandiford  writes:

Tejas Belagod  writes:

When I relaxed CANNOT_CHANGE_MODE_CLASS to undefined for AArch64,
gcc.c-torture/execute/copysign1.c generates incorrect code because LRA cannot
seem to handle subregs like

  (subreg:DI (reg:TF hard_reg) 8)

on hard registers where the subreg byte offset is unaligned to a hard register
boundary(16 for AArch64). It seems to quietly ignore the 8 and resolves this to
incorrect an hard register during reload.

When I compile this test with -O3,

long double
cl (long double x, long double y)
{
   return __builtin_copysignl (x, y);
}

cs.c.213r.ira:

(insn 26 10 33 2 (set (reg:DI 87 [ y+8 ])
 (subreg:DI (reg:TF 33 v1 [ y ]) 8)) cs.c:4 34 {*movdi_aarch64}
  (expr_list:REG_DEAD (reg:TF 33 v1 [ y ])
 (nil)))
(insn 33 26 35 2 (set (reg:TF 93)
 (reg:TF 32 v0 [ x ])) cs.c:4 40 {*movtf_aarch64}
  (expr_list:REG_DEAD (reg:TF 32 v0 [ x ])
 (nil)))
(insn 35 33 34 2 (set (reg:DI 92 [ x+8 ])
 (subreg:DI (reg:TF 93) 8)) cs.c:4 34 {*movdi_aarch64}
  (nil))
(insn 34 35 23 2 (set (reg:DI 91 [ x ])
 (subreg:DI (reg:TF 93) 0)) cs.c:4 34 {*movdi_aarch64}
  (expr_list:REG_DEAD (reg:TF 93)
 (nil)))


cs.c.214r.reload

(insn 26 10 33 2 (set (reg:DI 2 x2 [orig:87 y+8 ] [87])
 (reg:DI 33 v1 [ y+8 ])) cs.c:4 34 {*movdi_aarch64}
  (nil))
(insn 33 26 35 2 (set (reg:TF 0 x0 [93])
 (reg:TF 32 v0 [ x ])) cs.c:4 40 {*movtf_aarch64}
  (nil))
(insn 35 33 34 2 (set (reg:DI 1 x1 [orig:92 x+8 ] [92])
 (reg:DI 1 x1 [+8 ])) cs.c:4 34 {*movdi_aarch64}
  (nil))
(insn 34 35 8 2 (set (reg:DI 0 x0 [orig:91 x ] [91])
 (reg:DI 0 x0 [93])) cs.c:4 34 {*movdi_aarch64}
  (nil))
.

You can see the changes to insn 26 before and after reload - the SUBREG_BYTE
offset of 8 seems to have been translated to v0 instead of v0.d[1] by
get_hard_regno ().

What's interesting here is that the SUBREG_BYTE that is generated for

(subreg:DI (reg:TF 33 v1 [ y ]) 8)

isn't aligned to a hard register boundary on SIMD regs where UNITS_PER_VREG for
AArch64 is 16. Therefore when this subreg is resolved, it resolves to v1 instead
of v1.d[1]. Is this something going wrong in LRA or is this a more fundamental
problem with generating subregs of hard regs with unaligned subreg byte offsets?
The same subreg on a pseudo works OK because in insn 33, the TF mode is
allocated integer registers and all is well.

I think this is the same problem that was being discussed for x86
after your no-op vec-select patch:

   http://gcc.gnu.org/ml/gcc-patches/2013-12/msg00801.html

and long following thread.

I'd still like to solve this in a target-independent way rather than add
an offset to CANNOT_CHANGE_MODE_CLASS, but I haven't had time to look at
it...


FWIW, here's one possible approach.  The main part is to make the
invalid_mode_change code calculate a set of registers that are either
(a) invalid for the pseudo mode to begin with or (b) do not allow one
of the subregs to be taken (as calculated by simplify_subreg_regno,
which includes the original CANNOT_CHANGE_MODE_CLASS check).

One concern might be about compilation speed when collecting this info.
OTOH, the query is now genuinely constant time, whereas the old bitmap
test was O(num-pseudos) in the worst case.  It might also be possible
to speed things up by walking the subregs using the DF information,
if it's up-to-date at this point (haven't checked).  It would also be
possible to give an ID to each (inner mode, outer mode, byte) combination
and lazily cache the invalid register set for each one.

I went through the other uses of CANNOT_CHANGE_MODE_CLASS.  Most of them
were checking for lowpart mode changes so look safe.  The exception was
combine.c:subst.

This is really four patches squashed into one, but it's not ready to be
submitted yet.  Was just wondering whether this solved your problem.



Hi Richard,

Sorry for the delay in replying to this.

Thanks for this patch - it bootstraps and regresses fine for aarch64. It also 
regresses OK on ARM.
Your patch also fixes issues I was seeing when I undefined C_C_M_C for aarch64 
which is what I was mostly troubled by (copysign1 regression et. al.)


Many Thanks,
Tejas.


Thanks,
Richard



*** /tmp/OCSP7f_combine.c   2014-03-11 07:34:37.928138693 +
--- gcc/combine.c   2014-03-10 21:39:09.428718086 +
*** subst (rtx x, rtx from, rtx to, int in_d
*** 5082,5096 
  )
return gen_rtx_CLOBBER (VOIDmode, const0_rtx);

- #ifdef CANNOT_CHANGE_MODE_CLASS
  if (code == SUBREG
  && REG_P (to)
  && REGNO (to) < FIRST_PSEUDO_REGISTER
! && REG_CANNOT_CHANGE_MODE_P (REGNO (to),
!  GET_MODE (

Re: [RFC, LRA] Repeated looping over subreg reloads.

2014-01-21 Thread Tejas Belagod

Vladimir Makarov wrote:

On 12/5/2013, 9:35 AM, Tejas Belagod wrote:

Vladimir Makarov wrote:

On 12/4/2013, 6:15 AM, Tejas Belagod wrote:

Hi,

I'm trying to relax CANNOT_CHANGE_MODE_CLASS for aarch64 to allow all
mode changes on FP_REGS as aarch64 does not have register-packing, but
I'm running into an LRA ICE. A test case generates an RTL subreg of the
following form

(set (reg:DF 97) (subreg:DF (reg:V2DF 95) 8))

LRA has to reload the subreg because the subreg is not representable as
a full register. When LRA reloads this in
lra-constraints.c:simplyfy_operand_subreg (), it seems to reload
SUBREG_REG() and leave the byte offset alone.

i.e.

  (set (reg:V2DF 100) (reg:V2DF 95))
  (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8))

The code in lra-constraints.c is this conditional:

   /* Force a reload of the SUBREG_REG if this is a constant or PLUS or
  if there may be a problem accessing OPERAND in the outer
  mode.  */
   if ((REG_P (reg)
   
   insert_move_for_subreg (insert_before ? &before : NULL,
   insert_after ? &after : NULL,
   reg, new_reg);
 }
   

What happens subsequently is that LRA keeps looping over this RTL and
keeps reloading the SUBREG_REG() till the limit of constraint passes is
reached.

  (set (reg:V2DF 100) (reg:V2DF 95))
  (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8))

I can't see any place where this subreg is resolved (eg. into equiv
memref) before the next iteration comes around for reloading the inputs
and outputs of curr_insn. Or am I missing something some part of code
that tries reloading the subreg with different alternatives or reg
classes?


I guess this behaviour is wrong.  We could spill the V2DF pseudo or
put it into another class reg. But it is not implemented.  This code
is actually a modified version of reload pass one.  We could implement
alternative strategies and a check for potential loop (such code
exists in process_alt_operands).

Could you send me the macro change and the test.  I'll look at it and
figure out what can we do.

Hi,

Thanks for looking at this.

The macro change is in this patch
http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03638.html. The test is
gcc.c-torture/compile/simd-3.c and when compiled with -O1 for aarch64,
ICEs:

gcc/testsuite/gcc.c-torture/compile/simd-3.c:22:1: internal compiler
error: Maximum number of LRA constraint passes is achieved (30)

Also, I'm curious to know - is it possible to vec_extract for vector
mode subregs and zero/sign extract for scalars and spilling be the last
resort if either of these are not possible? As you say, non-zero
SUBREG_BYTE offset could also be resolved using a different regclass
where the sub-mode could just be a full-register.



Here is the patch which solves the problem.  Right now it is only 
spilling but it is the best what can be done for this case.  I'll submit 
the patch on the next week after better testing on different platforms.




Hi Vladimir,

Have you had a chance to get this patch tested? This can fix a regression I'm 
seeing on AArch64, and I'd like to get it in if you think this patch is good to go.


Thanks,
Tejas.


Vec_extract is interesting but it is a rare case which needs a lot of 
code to implement this.  I think we need more general approach called 
bitwidth-aware RA (putting several pseudo values into regs, e.g vec 
regs).  Although I don't know will it help for arm64 cpus.  Last time i 
checked manually bitwidth-aware RA for intel cpus, it makes code bigger 
and slower.


If there is a mainstream processor for which it can improve performance, 
i'd put it in my higher priority list to do.










[RFC, LRA] Incorrect subreg resolution?

2014-01-10 Thread Tejas Belagod

Hi,

When I relaxed CANNOT_CHANGE_MODE_CLASS to undefined for AArch64, 
gcc.c-torture/execute/copysign1.c generates incorrect code because LRA cannot 
seem to handle subregs like


 (subreg:DI (reg:TF hard_reg) 8)

on hard registers where the subreg byte offset is unaligned to a hard register 
boundary(16 for AArch64). It seems to quietly ignore the 8 and resolves this to 
incorrect an hard register during reload.


When I compile this test with -O3,

long double
cl (long double x, long double y)
{
  return __builtin_copysignl (x, y);
}

cs.c.213r.ira:

(insn 26 10 33 2 (set (reg:DI 87 [ y+8 ])
(subreg:DI (reg:TF 33 v1 [ y ]) 8)) cs.c:4 34 {*movdi_aarch64}
 (expr_list:REG_DEAD (reg:TF 33 v1 [ y ])
(nil)))
(insn 33 26 35 2 (set (reg:TF 93)
(reg:TF 32 v0 [ x ])) cs.c:4 40 {*movtf_aarch64}
 (expr_list:REG_DEAD (reg:TF 32 v0 [ x ])
(nil)))
(insn 35 33 34 2 (set (reg:DI 92 [ x+8 ])
(subreg:DI (reg:TF 93) 8)) cs.c:4 34 {*movdi_aarch64}
 (nil))
(insn 34 35 23 2 (set (reg:DI 91 [ x ])
(subreg:DI (reg:TF 93) 0)) cs.c:4 34 {*movdi_aarch64}
 (expr_list:REG_DEAD (reg:TF 93)
(nil)))


cs.c.214r.reload

(insn 26 10 33 2 (set (reg:DI 2 x2 [orig:87 y+8 ] [87])
(reg:DI 33 v1 [ y+8 ])) cs.c:4 34 {*movdi_aarch64}
 (nil))
(insn 33 26 35 2 (set (reg:TF 0 x0 [93])
(reg:TF 32 v0 [ x ])) cs.c:4 40 {*movtf_aarch64}
 (nil))
(insn 35 33 34 2 (set (reg:DI 1 x1 [orig:92 x+8 ] [92])
(reg:DI 1 x1 [+8 ])) cs.c:4 34 {*movdi_aarch64}
 (nil))
(insn 34 35 8 2 (set (reg:DI 0 x0 [orig:91 x ] [91])
(reg:DI 0 x0 [93])) cs.c:4 34 {*movdi_aarch64}
 (nil))
.

You can see the changes to insn 26 before and after reload - the SUBREG_BYTE 
offset of 8 seems to have been translated to v0 instead of v0.d[1] by 
get_hard_regno ().


What's interesting here is that the SUBREG_BYTE that is generated for

(subreg:DI (reg:TF 33 v1 [ y ]) 8)

isn't aligned to a hard register boundary on SIMD regs where UNITS_PER_VREG for 
AArch64 is 16. Therefore when this subreg is resolved, it resolves to v1 instead 
of v1.d[1]. Is this something going wrong in LRA or is this a more fundamental 
problem with generating subregs of hard regs with unaligned subreg byte offsets? 
The same subreg on a pseudo works OK because in insn 33, the TF mode is 
allocated integer registers and all is well.


Thanks,
Tejas Belagod
ARM.



Re: [RFC, LRA] Repeated looping over subreg reloads.

2013-12-05 Thread Tejas Belagod

Vladimir Makarov wrote:

On 12/4/2013, 6:15 AM, Tejas Belagod wrote:

Hi,

I'm trying to relax CANNOT_CHANGE_MODE_CLASS for aarch64 to allow all
mode changes on FP_REGS as aarch64 does not have register-packing, but
I'm running into an LRA ICE. A test case generates an RTL subreg of the
following form

(set (reg:DF 97) (subreg:DF (reg:V2DF 95) 8))

LRA has to reload the subreg because the subreg is not representable as
a full register. When LRA reloads this in
lra-constraints.c:simplyfy_operand_subreg (), it seems to reload
SUBREG_REG() and leave the byte offset alone.

i.e.

  (set (reg:V2DF 100) (reg:V2DF 95))
  (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8))

The code in lra-constraints.c is this conditional:

   /* Force a reload of the SUBREG_REG if this is a constant or PLUS or
  if there may be a problem accessing OPERAND in the outer
  mode.  */
   if ((REG_P (reg)
   
   insert_move_for_subreg (insert_before ? &before : NULL,
   insert_after ? &after : NULL,
   reg, new_reg);
 }
   

What happens subsequently is that LRA keeps looping over this RTL and
keeps reloading the SUBREG_REG() till the limit of constraint passes is
reached.

  (set (reg:V2DF 100) (reg:V2DF 95))
  (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8))

I can't see any place where this subreg is resolved (eg. into equiv
memref) before the next iteration comes around for reloading the inputs
and outputs of curr_insn. Or am I missing something some part of code
that tries reloading the subreg with different alternatives or reg classes?



I guess this behaviour is wrong.  We could spill the V2DF pseudo or put 
it into another class reg. But it is not implemented.  This code is 
actually a modified version of reload pass one.  We could implement 
alternative strategies and a check for potential loop (such code exists 
in process_alt_operands).


Could you send me the macro change and the test.  I'll look at it and 
figure out what can we do.


Hi,

Thanks for looking at this.

The macro change is in this patch 
http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03638.html. The test is 
gcc.c-torture/compile/simd-3.c and when compiled with -O1 for aarch64, ICEs:


gcc/testsuite/gcc.c-torture/compile/simd-3.c:22:1: internal compiler error: 
Maximum number of LRA constraint passes is achieved (30)


Also, I'm curious to know - is it possible to vec_extract for vector mode 
subregs and zero/sign extract for scalars and spilling be the last resort if 
either of these are not possible? As you say, non-zero SUBREG_BYTE offset could 
also be resolved using a different regclass where the sub-mode could just be a 
full-register.


Thanks,
Tejas.



[RFC, LRA] Repeated looping over subreg reloads.

2013-12-04 Thread Tejas Belagod


Hi,

I'm trying to relax CANNOT_CHANGE_MODE_CLASS for aarch64 to allow all mode 
changes on FP_REGS as aarch64 does not have register-packing, but I'm running 
into an LRA ICE. A test case generates an RTL subreg of the following form


   (set (reg:DF 97) (subreg:DF (reg:V2DF 95) 8))

LRA has to reload the subreg because the subreg is not representable as a full 
register. When LRA reloads this in lra-constraints.c:simplyfy_operand_subreg (), 
it seems to reload SUBREG_REG() and leave the byte offset alone.


i.e.

 (set (reg:V2DF 100) (reg:V2DF 95))
 (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8))

The code in lra-constraints.c is this conditional:

  /* Force a reload of the SUBREG_REG if this is a constant or PLUS or
 if there may be a problem accessing OPERAND in the outer
 mode.  */
  if ((REG_P (reg)
  
  insert_move_for_subreg (insert_before ? &before : NULL,
  insert_after ? &after : NULL,
  reg, new_reg);
}
  

What happens subsequently is that LRA keeps looping over this RTL and keeps 
reloading the SUBREG_REG() till the limit of constraint passes is reached.


 (set (reg:V2DF 100) (reg:V2DF 95))
 (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8))

I can't see any place where this subreg is resolved (eg. into equiv memref) 
before the next iteration comes around for reloading the inputs and outputs of 
curr_insn. Or am I missing something some part of code that tries reloading the 
subreg with different alternatives or reg classes?


Thanks,
Tejas.



Re: [RFC] vector subscripts/BIT_FIELD_REF in Big Endian.

2013-08-12 Thread Tejas Belagod
What's interesting to me here is the bitpos - does this not need 
BYTES_BIG_ENDIAN correction? This seems to be inconsistenct with what happens 
with reduction operations in the autovectorizer where the scalar result in the 
reduction epilogue gets extracted with a BIT_FIELD_REF but the bitpos there is 
corrected for BIG_ENDIAN.


a[0] is at the left end of the array in BIG_ENDIAN, and big-endian
machines number bits from the left, so bit position 0 is correct.



...
   vect_sum_9.17_74 = [reduc_plus_expr] vect_sum_9.15_73;
   stmp_sum_9.16_75 = BIT_FIELD_REF ;
   sum_76 = stmp_sum_9.16_75 + sum_47;

the BIT_FIELD_REF here seems to have been corrected for BYTES_BIG_ENDIAN


Yes, because something else is going on here.  This is a reduction
operation where the sum ends up in the rightmost element of a vector
register that contains four 32-bit integers.  This is at position 96
from the left end of the register according to big-endian numbering.



Thanks for your reply.

Sorry, I'm still a bit confused here. The reduc_splus_ documentation says

"Compute the sum of the signed elements of a vector. The vector is operand 1,
and the scalar result is stored in the least significant bits of operand 0
(also a vector)."

Shouldn't this mean the scalar result should be in bitpos 0 which is the left 
end of the register in BIG ENDIAN?


Thanks,
Tejas

If vec_extract is defined in the back-end, how does one figure out if the 
BIT_FIELD_REF is a product of the gimplifier's indirect ref folding or the 
vectorizer's bit-field extraction and apply the appropriate correction in 
vec_extract's expansion? Or am I missing something that corrects BIT_FIELD_REFs 
between the gimplifier and the RTL expander?


There is no inconsistency here.

Hope this helps!
Bill


Thanks,
Tejas.









[RFC] vector subscripts/BIT_FIELD_REF in Big Endian.

2013-08-05 Thread Tejas Belagod


Hi,

I'm looking for some help understanding how BIT_FIELD_REFs work with big-endian.

Vector subscripts in this example:

#define vector __attribute__((vector_size(sizeof(int)*4) ))

typedef int vec vector;

int foo(vec a)
{
  return a[0];
}

gets lowered into array accesses by c-typeck.c

;; Function foo (null)
{
  return *(int *) &a;
}

and gets gimplified into BIT_FIELD_REFs a bit later.

foo (vec a)
{
  int _2;

  :
  _2 = BIT_FIELD_REF ;
  return _2;

}

What's interesting to me here is the bitpos - does this not need 
BYTES_BIG_ENDIAN correction? This seems to be inconsistenct with what happens 
with reduction operations in the autovectorizer where the scalar result in the 
reduction epilogue gets extracted with a BIT_FIELD_REF but the bitpos there is 
corrected for BIG_ENDIAN.


... from tree-vect-loop.c:vect_create_epilog_for_reduction ()

  /* 2.4  Extract the final scalar result.  Create:
  s_out3 = extract_field   */

  if (extract_scalar_result)
{
  tree rhs;

  if (dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
 "extract scalar result");

  if (BYTES_BIG_ENDIAN)
bitpos = size_binop (MULT_EXPR,
 bitsize_int (TYPE_VECTOR_SUBPARTS (vectype) - 1),
 TYPE_SIZE (scalar_type));
  else
bitpos = bitsize_zero_node;


For eg:

int foo(int * a)
{
  int i, sum = 0;

  for (i=0;i<16;i++)
   sum += a[i];

  return sum;
}

gets autovectorized into:

...
  vect_sum_9.17_74 = [reduc_plus_expr] vect_sum_9.15_73;
  stmp_sum_9.16_75 = BIT_FIELD_REF ;
  sum_76 = stmp_sum_9.16_75 + sum_47;

the BIT_FIELD_REF here seems to have been corrected for BYTES_BIG_ENDIAN

If vec_extract is defined in the back-end, how does one figure out if the 
BIT_FIELD_REF is a product of the gimplifier's indirect ref folding or the 
vectorizer's bit-field extraction and apply the appropriate correction in 
vec_extract's expansion? Or am I missing something that corrects BIT_FIELD_REFs 
between the gimplifier and the RTL expander?


Thanks,
Tejas.



Re: ARM/AAarch64: NEON intrinsics in the kernel

2013-07-18 Thread Tejas Belagod

Ard Biesheuvel wrote:

On 18 July 2013 16:54, Tejas Belagod  wrote:

I'd like to follow up this thread to move towards removing arm_neon.h's
dependence on stdint.h. My comments inline below.


As far as I can tell, the only dependency arm_neon.h has on the
contents of that header are the [u]int[8|16|32|64]_t typedefs. The
kernel does define those, only in a different header.



Hello Tejas,

What I did not realize at the time is that those types are part of the
visible interface of the NEON intrinsics. Just as an example, there is
a function in arm_neon.h:

uint8x8_t vset_lane_u8 (uint8_t __a, uint8x8_t __b, const int __c);

which clearly needs a type definition for uint8_t. Changing the
published and documented interface is unlikely to be a realistic
option, I'm afraid, and simply dropping the #include will cause
breakage for some existing users, which is also not very appealing.



I was thinking more on the lines of

#ifdef __INT8_TYPE__
typedef __INT8_TYPE__ int8_t;
#endif

and

#ifdef __UINT64_C
#define UINT64_C(c) __UINT64_C (c)
#endif

In other words this is perhaps reproducing a part of stdint-gcc.h. I don't know 
if there can be a situation when these are predefines are not defined ( eg. some 
-m option that turns them off?)



Conditionally including stdint.h in case those types have not been
defined (yet) would be the only remaining option, I think, but I am
not sure if that is feasible.



Are you proposing something like:

/* arm_neon.h */

#ifndef __intxx_t_defined ...
#define __STDC_CONSTANT_MACROS
#include 
#endif

...

/* Prevent __STDC_CONSTANT_MACROS from polluting the environment.  */
#ifdef __STDC_CONSTANT_MACROS
#undef __STDC_CONSTANT_MACROS
#endif

/* End of arm_neon.h */

Including all of stdint.h for only a few basic types/macros that we need seems 
to suggest to me that its too heavy a hammer, is it not?


Thanks,
Tejas.


In the kernel case, I have worked around it by having a separate
compilation unit containing the wrapped NEON intrinsics code, and
using plain old C types to interface with the wrapper functions.

[...]

Regards,
Ard.






Re: ARM/AAarch64: NEON intrinsics in the kernel

2013-07-18 Thread Tejas Belagod

Hi Ard,

I'd like to follow up this thread to move towards removing arm_neon.h's 
dependence on stdint.h. My comments inline below.



From: Ard Biesheuvel 
Date: Tue, May 21, 2013 at 10:32 AM
Subject: ARM/AAarch64: NEON intrinsics in the kernel
To: gcc@gcc.gnu.org
Cc: Christophe Lyon , Matthew Gretton-Dann
, richard.earns...@arm.com,
ramana.radhakrish...@arm.com, marcus.shawcr...@arm.com


Hello all,

I am currently exploring various ways of using NEON instructions in
kernel mode. One of the ways of doing so is using NEON intrinsics,
which we would like to support in the kernel, but unfortunately, at
the moment we can't because the support header arm_neon.h assumes C99
conformance and includes . The kernel does not supply that
header.

As far as I can tell, the only dependency arm_neon.h has on the
contents of that header are the [u]int[8|16|32|64]_t typedefs. The
kernel does define those, only in a different header.



There are also constant macros like UINT64_C etc that cause issues when compiled 
with C++. Also, defining __STDC_CONSTANT_MACROS to get around this issue is 
won't make the problem go away, I think.



I would like to propose the following way to address this issue: as
arm_neon.h is coupled very tightly with GCC's internals
(__builtin_neon_* types and functions), could we not modify arm_neon.h
to
- drop the #include 


Removing arm_neon.h's dependency on stdint.h is probably a good idea.


- replace every instance of [u]intxx_t with the builtin macro
__[U]INTxx_TYPE__ (as we are already dependent on specific versions of
GCC, this should not introduce any additional limitations)



The choice we have to do this is replacing all the stdint types with the 
predefined macros


int<8,16,32,64>_t with predefined __INT<8,16,32,64>_TYPE__
and
UINT64_C from stdint.h with __UINT64_C etc.

But it is recommended that these never be used directly - only via the header. 
If we use these directly in arm_neon.h, it introduces a dependency with the 
predefines implementation in gcc, but as you point out that arm_neon.h is 
already dependent on the specific versions of gcc, this maintainance overhead is 
probably unavoidable. We do need standard typedefs from somewhere...


Thoughts?

Thanks,
Tejas Belagod.
ARM.



In this way, it is much easier to support NEON intrinsics in
environments that we care about (like the kernel) but do not conform
to the standards.

Kind regards,
Ard.