Re: Loop peeling
On 29/10/14 09:32, Richard Biener wrote: On Tue, Oct 28, 2014 at 4:55 PM, Evandro Menezes wrote: While doing some benchmark flag mining on AArch64, I noticed that -fpeel-loops was a mined option often. As a matter of fact, when using it always, even without FDO, it seemed to raise most benchmarks and to leave almost all of the rest flat, with a barely noticeable cost in code-size. It seems to me that it might be safe enough to be implied perhaps at -O3. Is there any reason why this never came into being? Loop peeling is done by default on AArch64 unless, IIRC, -fvect-cost-model=cheap is specified which switches it off. There was a general thread on loop peeling around the same time last year (https://gcc.gnu.org/ml/gcc/2013-11/msg00307.html) where Richard suggested that peeling vs. non-peeling should be factored into the vector cost model and is a more generic improvement. Thanks, Tejas. Not sure, but peeling is/was very stupid (peeling 8 times unconditionally or not at all). At least without FDO (and with -fprofile-use it is enabled). Similar case for -funroll-loops. For GCC 5 peeling now moved to GIMPLE, so maybe things changed for that (but I'd doubt that). Honza?
Re: Restricting arguments to intrinsic functions
On 24/10/14 15:44, Segher Boessenkool wrote: On Thu, Oct 23, 2014 at 06:52:20PM +0100, Charles Baylis wrote: ( tl;dr: How do I handle intrinsic or builtin functions where there are restrictions on the arguments which can't be represented in a C function prototype? Do other ports have this problem, how do they solve it? Language extension for C++98 to provide static_assert?) In the builtin expand, you can get the operands' predicates from the insn_data array entry for the RTL pattern generated for that builtin. If the predicate is false, do a copy_to_mode_reg; if then the predicate is still false, assume it had to be some constant and error out. This works well; I stole the method from the tile* ports. It may need tweaks for your port. I think we already do that in the aarch64 port in aarch64-builtins.c when we expand builtins. /* Handle constants only if the predicate allows it. */ bool op_const_int_p = (CONST_INT_P (arg) && (*insn_data[icode].operand[operands_k].predicate) (arg, insn_data[icode].operand[operands_k].mode)); But the accuracy of the source position, as Charles says, is lost by the time the expander kicks in. For eg. in this piece of code, #include "arm_neon.h" int16x4_t xget_lane(int16x4_t a, int16x4_t b, int c) { return vqrdmulh_lane_s16 (a, b, 7); } $ aarch64-none-elf-gcc -O3 cr.c -S In file included from cr.c:2:0: /work/dev/arm/bin/install/lib/gcc/aarch64-none-elf/5.0.0/include/arm_neon.h: In function 'xget_lane': /work/dev/arm/bin/install/lib/gcc/aarch64-none-elf/5.0.0/include/arm_neon.h:19572:11: error: lane out of range return __builtin_aarch64_sqrdmulh_lanev4hi (__a, __b, __c); The diagnostic issued points to the line in arm_neon.h, but we expect this to point to the line in cr.c. I suspect we need something closer to the front-end? Thanks, Tejas.
Debugging LTO.
Hi, Are there any tricks I can use to debug an LTO ICE? Lto1 --help does not seem to give me an option to output trace dumps etc. What I suspect is happening is that cc1 builds erroneous LTO IR info in the objects that causes the ICEs. Is there a reader that will dump the IR from these LTO objects? AFAICS, this page https://gcc.gnu.org/wiki/LinkTimeOptimization says such a reader is still a TODO. Thanks, Tejas.
Re: [RFC, LRA] Incorrect subreg resolution?
Richard Sandiford wrote: Returning to this old thread... Richard Sandiford writes: Tejas Belagod writes: When I relaxed CANNOT_CHANGE_MODE_CLASS to undefined for AArch64, gcc.c-torture/execute/copysign1.c generates incorrect code because LRA cannot seem to handle subregs like (subreg:DI (reg:TF hard_reg) 8) on hard registers where the subreg byte offset is unaligned to a hard register boundary(16 for AArch64). It seems to quietly ignore the 8 and resolves this to incorrect an hard register during reload. When I compile this test with -O3, long double cl (long double x, long double y) { return __builtin_copysignl (x, y); } cs.c.213r.ira: (insn 26 10 33 2 (set (reg:DI 87 [ y+8 ]) (subreg:DI (reg:TF 33 v1 [ y ]) 8)) cs.c:4 34 {*movdi_aarch64} (expr_list:REG_DEAD (reg:TF 33 v1 [ y ]) (nil))) (insn 33 26 35 2 (set (reg:TF 93) (reg:TF 32 v0 [ x ])) cs.c:4 40 {*movtf_aarch64} (expr_list:REG_DEAD (reg:TF 32 v0 [ x ]) (nil))) (insn 35 33 34 2 (set (reg:DI 92 [ x+8 ]) (subreg:DI (reg:TF 93) 8)) cs.c:4 34 {*movdi_aarch64} (nil)) (insn 34 35 23 2 (set (reg:DI 91 [ x ]) (subreg:DI (reg:TF 93) 0)) cs.c:4 34 {*movdi_aarch64} (expr_list:REG_DEAD (reg:TF 93) (nil))) cs.c.214r.reload (insn 26 10 33 2 (set (reg:DI 2 x2 [orig:87 y+8 ] [87]) (reg:DI 33 v1 [ y+8 ])) cs.c:4 34 {*movdi_aarch64} (nil)) (insn 33 26 35 2 (set (reg:TF 0 x0 [93]) (reg:TF 32 v0 [ x ])) cs.c:4 40 {*movtf_aarch64} (nil)) (insn 35 33 34 2 (set (reg:DI 1 x1 [orig:92 x+8 ] [92]) (reg:DI 1 x1 [+8 ])) cs.c:4 34 {*movdi_aarch64} (nil)) (insn 34 35 8 2 (set (reg:DI 0 x0 [orig:91 x ] [91]) (reg:DI 0 x0 [93])) cs.c:4 34 {*movdi_aarch64} (nil)) . You can see the changes to insn 26 before and after reload - the SUBREG_BYTE offset of 8 seems to have been translated to v0 instead of v0.d[1] by get_hard_regno (). What's interesting here is that the SUBREG_BYTE that is generated for (subreg:DI (reg:TF 33 v1 [ y ]) 8) isn't aligned to a hard register boundary on SIMD regs where UNITS_PER_VREG for AArch64 is 16. Therefore when this subreg is resolved, it resolves to v1 instead of v1.d[1]. Is this something going wrong in LRA or is this a more fundamental problem with generating subregs of hard regs with unaligned subreg byte offsets? The same subreg on a pseudo works OK because in insn 33, the TF mode is allocated integer registers and all is well. I think this is the same problem that was being discussed for x86 after your no-op vec-select patch: http://gcc.gnu.org/ml/gcc-patches/2013-12/msg00801.html and long following thread. I'd still like to solve this in a target-independent way rather than add an offset to CANNOT_CHANGE_MODE_CLASS, but I haven't had time to look at it... FWIW, here's one possible approach. The main part is to make the invalid_mode_change code calculate a set of registers that are either (a) invalid for the pseudo mode to begin with or (b) do not allow one of the subregs to be taken (as calculated by simplify_subreg_regno, which includes the original CANNOT_CHANGE_MODE_CLASS check). One concern might be about compilation speed when collecting this info. OTOH, the query is now genuinely constant time, whereas the old bitmap test was O(num-pseudos) in the worst case. It might also be possible to speed things up by walking the subregs using the DF information, if it's up-to-date at this point (haven't checked). It would also be possible to give an ID to each (inner mode, outer mode, byte) combination and lazily cache the invalid register set for each one. I went through the other uses of CANNOT_CHANGE_MODE_CLASS. Most of them were checking for lowpart mode changes so look safe. The exception was combine.c:subst. This is really four patches squashed into one, but it's not ready to be submitted yet. Was just wondering whether this solved your problem. Hi Richard, Sorry for the delay in replying to this. Thanks for this patch - it bootstraps and regresses fine for aarch64. It also regresses OK on ARM. Your patch also fixes issues I was seeing when I undefined C_C_M_C for aarch64 which is what I was mostly troubled by (copysign1 regression et. al.) Many Thanks, Tejas. Thanks, Richard *** /tmp/OCSP7f_combine.c 2014-03-11 07:34:37.928138693 + --- gcc/combine.c 2014-03-10 21:39:09.428718086 + *** subst (rtx x, rtx from, rtx to, int in_d *** 5082,5096 ) return gen_rtx_CLOBBER (VOIDmode, const0_rtx); - #ifdef CANNOT_CHANGE_MODE_CLASS if (code == SUBREG && REG_P (to) && REGNO (to) < FIRST_PSEUDO_REGISTER ! && REG_CANNOT_CHANGE_MODE_P (REGNO (to), ! GET_MODE (
Re: [RFC, LRA] Repeated looping over subreg reloads.
Vladimir Makarov wrote: On 12/5/2013, 9:35 AM, Tejas Belagod wrote: Vladimir Makarov wrote: On 12/4/2013, 6:15 AM, Tejas Belagod wrote: Hi, I'm trying to relax CANNOT_CHANGE_MODE_CLASS for aarch64 to allow all mode changes on FP_REGS as aarch64 does not have register-packing, but I'm running into an LRA ICE. A test case generates an RTL subreg of the following form (set (reg:DF 97) (subreg:DF (reg:V2DF 95) 8)) LRA has to reload the subreg because the subreg is not representable as a full register. When LRA reloads this in lra-constraints.c:simplyfy_operand_subreg (), it seems to reload SUBREG_REG() and leave the byte offset alone. i.e. (set (reg:V2DF 100) (reg:V2DF 95)) (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8)) The code in lra-constraints.c is this conditional: /* Force a reload of the SUBREG_REG if this is a constant or PLUS or if there may be a problem accessing OPERAND in the outer mode. */ if ((REG_P (reg) insert_move_for_subreg (insert_before ? &before : NULL, insert_after ? &after : NULL, reg, new_reg); } What happens subsequently is that LRA keeps looping over this RTL and keeps reloading the SUBREG_REG() till the limit of constraint passes is reached. (set (reg:V2DF 100) (reg:V2DF 95)) (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8)) I can't see any place where this subreg is resolved (eg. into equiv memref) before the next iteration comes around for reloading the inputs and outputs of curr_insn. Or am I missing something some part of code that tries reloading the subreg with different alternatives or reg classes? I guess this behaviour is wrong. We could spill the V2DF pseudo or put it into another class reg. But it is not implemented. This code is actually a modified version of reload pass one. We could implement alternative strategies and a check for potential loop (such code exists in process_alt_operands). Could you send me the macro change and the test. I'll look at it and figure out what can we do. Hi, Thanks for looking at this. The macro change is in this patch http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03638.html. The test is gcc.c-torture/compile/simd-3.c and when compiled with -O1 for aarch64, ICEs: gcc/testsuite/gcc.c-torture/compile/simd-3.c:22:1: internal compiler error: Maximum number of LRA constraint passes is achieved (30) Also, I'm curious to know - is it possible to vec_extract for vector mode subregs and zero/sign extract for scalars and spilling be the last resort if either of these are not possible? As you say, non-zero SUBREG_BYTE offset could also be resolved using a different regclass where the sub-mode could just be a full-register. Here is the patch which solves the problem. Right now it is only spilling but it is the best what can be done for this case. I'll submit the patch on the next week after better testing on different platforms. Hi Vladimir, Have you had a chance to get this patch tested? This can fix a regression I'm seeing on AArch64, and I'd like to get it in if you think this patch is good to go. Thanks, Tejas. Vec_extract is interesting but it is a rare case which needs a lot of code to implement this. I think we need more general approach called bitwidth-aware RA (putting several pseudo values into regs, e.g vec regs). Although I don't know will it help for arm64 cpus. Last time i checked manually bitwidth-aware RA for intel cpus, it makes code bigger and slower. If there is a mainstream processor for which it can improve performance, i'd put it in my higher priority list to do.
[RFC, LRA] Incorrect subreg resolution?
Hi, When I relaxed CANNOT_CHANGE_MODE_CLASS to undefined for AArch64, gcc.c-torture/execute/copysign1.c generates incorrect code because LRA cannot seem to handle subregs like (subreg:DI (reg:TF hard_reg) 8) on hard registers where the subreg byte offset is unaligned to a hard register boundary(16 for AArch64). It seems to quietly ignore the 8 and resolves this to incorrect an hard register during reload. When I compile this test with -O3, long double cl (long double x, long double y) { return __builtin_copysignl (x, y); } cs.c.213r.ira: (insn 26 10 33 2 (set (reg:DI 87 [ y+8 ]) (subreg:DI (reg:TF 33 v1 [ y ]) 8)) cs.c:4 34 {*movdi_aarch64} (expr_list:REG_DEAD (reg:TF 33 v1 [ y ]) (nil))) (insn 33 26 35 2 (set (reg:TF 93) (reg:TF 32 v0 [ x ])) cs.c:4 40 {*movtf_aarch64} (expr_list:REG_DEAD (reg:TF 32 v0 [ x ]) (nil))) (insn 35 33 34 2 (set (reg:DI 92 [ x+8 ]) (subreg:DI (reg:TF 93) 8)) cs.c:4 34 {*movdi_aarch64} (nil)) (insn 34 35 23 2 (set (reg:DI 91 [ x ]) (subreg:DI (reg:TF 93) 0)) cs.c:4 34 {*movdi_aarch64} (expr_list:REG_DEAD (reg:TF 93) (nil))) cs.c.214r.reload (insn 26 10 33 2 (set (reg:DI 2 x2 [orig:87 y+8 ] [87]) (reg:DI 33 v1 [ y+8 ])) cs.c:4 34 {*movdi_aarch64} (nil)) (insn 33 26 35 2 (set (reg:TF 0 x0 [93]) (reg:TF 32 v0 [ x ])) cs.c:4 40 {*movtf_aarch64} (nil)) (insn 35 33 34 2 (set (reg:DI 1 x1 [orig:92 x+8 ] [92]) (reg:DI 1 x1 [+8 ])) cs.c:4 34 {*movdi_aarch64} (nil)) (insn 34 35 8 2 (set (reg:DI 0 x0 [orig:91 x ] [91]) (reg:DI 0 x0 [93])) cs.c:4 34 {*movdi_aarch64} (nil)) . You can see the changes to insn 26 before and after reload - the SUBREG_BYTE offset of 8 seems to have been translated to v0 instead of v0.d[1] by get_hard_regno (). What's interesting here is that the SUBREG_BYTE that is generated for (subreg:DI (reg:TF 33 v1 [ y ]) 8) isn't aligned to a hard register boundary on SIMD regs where UNITS_PER_VREG for AArch64 is 16. Therefore when this subreg is resolved, it resolves to v1 instead of v1.d[1]. Is this something going wrong in LRA or is this a more fundamental problem with generating subregs of hard regs with unaligned subreg byte offsets? The same subreg on a pseudo works OK because in insn 33, the TF mode is allocated integer registers and all is well. Thanks, Tejas Belagod ARM.
Re: [RFC, LRA] Repeated looping over subreg reloads.
Vladimir Makarov wrote: On 12/4/2013, 6:15 AM, Tejas Belagod wrote: Hi, I'm trying to relax CANNOT_CHANGE_MODE_CLASS for aarch64 to allow all mode changes on FP_REGS as aarch64 does not have register-packing, but I'm running into an LRA ICE. A test case generates an RTL subreg of the following form (set (reg:DF 97) (subreg:DF (reg:V2DF 95) 8)) LRA has to reload the subreg because the subreg is not representable as a full register. When LRA reloads this in lra-constraints.c:simplyfy_operand_subreg (), it seems to reload SUBREG_REG() and leave the byte offset alone. i.e. (set (reg:V2DF 100) (reg:V2DF 95)) (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8)) The code in lra-constraints.c is this conditional: /* Force a reload of the SUBREG_REG if this is a constant or PLUS or if there may be a problem accessing OPERAND in the outer mode. */ if ((REG_P (reg) insert_move_for_subreg (insert_before ? &before : NULL, insert_after ? &after : NULL, reg, new_reg); } What happens subsequently is that LRA keeps looping over this RTL and keeps reloading the SUBREG_REG() till the limit of constraint passes is reached. (set (reg:V2DF 100) (reg:V2DF 95)) (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8)) I can't see any place where this subreg is resolved (eg. into equiv memref) before the next iteration comes around for reloading the inputs and outputs of curr_insn. Or am I missing something some part of code that tries reloading the subreg with different alternatives or reg classes? I guess this behaviour is wrong. We could spill the V2DF pseudo or put it into another class reg. But it is not implemented. This code is actually a modified version of reload pass one. We could implement alternative strategies and a check for potential loop (such code exists in process_alt_operands). Could you send me the macro change and the test. I'll look at it and figure out what can we do. Hi, Thanks for looking at this. The macro change is in this patch http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03638.html. The test is gcc.c-torture/compile/simd-3.c and when compiled with -O1 for aarch64, ICEs: gcc/testsuite/gcc.c-torture/compile/simd-3.c:22:1: internal compiler error: Maximum number of LRA constraint passes is achieved (30) Also, I'm curious to know - is it possible to vec_extract for vector mode subregs and zero/sign extract for scalars and spilling be the last resort if either of these are not possible? As you say, non-zero SUBREG_BYTE offset could also be resolved using a different regclass where the sub-mode could just be a full-register. Thanks, Tejas.
[RFC, LRA] Repeated looping over subreg reloads.
Hi, I'm trying to relax CANNOT_CHANGE_MODE_CLASS for aarch64 to allow all mode changes on FP_REGS as aarch64 does not have register-packing, but I'm running into an LRA ICE. A test case generates an RTL subreg of the following form (set (reg:DF 97) (subreg:DF (reg:V2DF 95) 8)) LRA has to reload the subreg because the subreg is not representable as a full register. When LRA reloads this in lra-constraints.c:simplyfy_operand_subreg (), it seems to reload SUBREG_REG() and leave the byte offset alone. i.e. (set (reg:V2DF 100) (reg:V2DF 95)) (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8)) The code in lra-constraints.c is this conditional: /* Force a reload of the SUBREG_REG if this is a constant or PLUS or if there may be a problem accessing OPERAND in the outer mode. */ if ((REG_P (reg) insert_move_for_subreg (insert_before ? &before : NULL, insert_after ? &after : NULL, reg, new_reg); } What happens subsequently is that LRA keeps looping over this RTL and keeps reloading the SUBREG_REG() till the limit of constraint passes is reached. (set (reg:V2DF 100) (reg:V2DF 95)) (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8)) I can't see any place where this subreg is resolved (eg. into equiv memref) before the next iteration comes around for reloading the inputs and outputs of curr_insn. Or am I missing something some part of code that tries reloading the subreg with different alternatives or reg classes? Thanks, Tejas.
Re: [RFC] vector subscripts/BIT_FIELD_REF in Big Endian.
What's interesting to me here is the bitpos - does this not need BYTES_BIG_ENDIAN correction? This seems to be inconsistenct with what happens with reduction operations in the autovectorizer where the scalar result in the reduction epilogue gets extracted with a BIT_FIELD_REF but the bitpos there is corrected for BIG_ENDIAN. a[0] is at the left end of the array in BIG_ENDIAN, and big-endian machines number bits from the left, so bit position 0 is correct. ... vect_sum_9.17_74 = [reduc_plus_expr] vect_sum_9.15_73; stmp_sum_9.16_75 = BIT_FIELD_REF ; sum_76 = stmp_sum_9.16_75 + sum_47; the BIT_FIELD_REF here seems to have been corrected for BYTES_BIG_ENDIAN Yes, because something else is going on here. This is a reduction operation where the sum ends up in the rightmost element of a vector register that contains four 32-bit integers. This is at position 96 from the left end of the register according to big-endian numbering. Thanks for your reply. Sorry, I'm still a bit confused here. The reduc_splus_ documentation says "Compute the sum of the signed elements of a vector. The vector is operand 1, and the scalar result is stored in the least significant bits of operand 0 (also a vector)." Shouldn't this mean the scalar result should be in bitpos 0 which is the left end of the register in BIG ENDIAN? Thanks, Tejas If vec_extract is defined in the back-end, how does one figure out if the BIT_FIELD_REF is a product of the gimplifier's indirect ref folding or the vectorizer's bit-field extraction and apply the appropriate correction in vec_extract's expansion? Or am I missing something that corrects BIT_FIELD_REFs between the gimplifier and the RTL expander? There is no inconsistency here. Hope this helps! Bill Thanks, Tejas.
[RFC] vector subscripts/BIT_FIELD_REF in Big Endian.
Hi, I'm looking for some help understanding how BIT_FIELD_REFs work with big-endian. Vector subscripts in this example: #define vector __attribute__((vector_size(sizeof(int)*4) )) typedef int vec vector; int foo(vec a) { return a[0]; } gets lowered into array accesses by c-typeck.c ;; Function foo (null) { return *(int *) &a; } and gets gimplified into BIT_FIELD_REFs a bit later. foo (vec a) { int _2; : _2 = BIT_FIELD_REF ; return _2; } What's interesting to me here is the bitpos - does this not need BYTES_BIG_ENDIAN correction? This seems to be inconsistenct with what happens with reduction operations in the autovectorizer where the scalar result in the reduction epilogue gets extracted with a BIT_FIELD_REF but the bitpos there is corrected for BIG_ENDIAN. ... from tree-vect-loop.c:vect_create_epilog_for_reduction () /* 2.4 Extract the final scalar result. Create: s_out3 = extract_field */ if (extract_scalar_result) { tree rhs; if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "extract scalar result"); if (BYTES_BIG_ENDIAN) bitpos = size_binop (MULT_EXPR, bitsize_int (TYPE_VECTOR_SUBPARTS (vectype) - 1), TYPE_SIZE (scalar_type)); else bitpos = bitsize_zero_node; For eg: int foo(int * a) { int i, sum = 0; for (i=0;i<16;i++) sum += a[i]; return sum; } gets autovectorized into: ... vect_sum_9.17_74 = [reduc_plus_expr] vect_sum_9.15_73; stmp_sum_9.16_75 = BIT_FIELD_REF ; sum_76 = stmp_sum_9.16_75 + sum_47; the BIT_FIELD_REF here seems to have been corrected for BYTES_BIG_ENDIAN If vec_extract is defined in the back-end, how does one figure out if the BIT_FIELD_REF is a product of the gimplifier's indirect ref folding or the vectorizer's bit-field extraction and apply the appropriate correction in vec_extract's expansion? Or am I missing something that corrects BIT_FIELD_REFs between the gimplifier and the RTL expander? Thanks, Tejas.
Re: ARM/AAarch64: NEON intrinsics in the kernel
Ard Biesheuvel wrote: On 18 July 2013 16:54, Tejas Belagod wrote: I'd like to follow up this thread to move towards removing arm_neon.h's dependence on stdint.h. My comments inline below. As far as I can tell, the only dependency arm_neon.h has on the contents of that header are the [u]int[8|16|32|64]_t typedefs. The kernel does define those, only in a different header. Hello Tejas, What I did not realize at the time is that those types are part of the visible interface of the NEON intrinsics. Just as an example, there is a function in arm_neon.h: uint8x8_t vset_lane_u8 (uint8_t __a, uint8x8_t __b, const int __c); which clearly needs a type definition for uint8_t. Changing the published and documented interface is unlikely to be a realistic option, I'm afraid, and simply dropping the #include will cause breakage for some existing users, which is also not very appealing. I was thinking more on the lines of #ifdef __INT8_TYPE__ typedef __INT8_TYPE__ int8_t; #endif and #ifdef __UINT64_C #define UINT64_C(c) __UINT64_C (c) #endif In other words this is perhaps reproducing a part of stdint-gcc.h. I don't know if there can be a situation when these are predefines are not defined ( eg. some -m option that turns them off?) Conditionally including stdint.h in case those types have not been defined (yet) would be the only remaining option, I think, but I am not sure if that is feasible. Are you proposing something like: /* arm_neon.h */ #ifndef __intxx_t_defined ... #define __STDC_CONSTANT_MACROS #include #endif ... /* Prevent __STDC_CONSTANT_MACROS from polluting the environment. */ #ifdef __STDC_CONSTANT_MACROS #undef __STDC_CONSTANT_MACROS #endif /* End of arm_neon.h */ Including all of stdint.h for only a few basic types/macros that we need seems to suggest to me that its too heavy a hammer, is it not? Thanks, Tejas. In the kernel case, I have worked around it by having a separate compilation unit containing the wrapped NEON intrinsics code, and using plain old C types to interface with the wrapper functions. [...] Regards, Ard.
Re: ARM/AAarch64: NEON intrinsics in the kernel
Hi Ard, I'd like to follow up this thread to move towards removing arm_neon.h's dependence on stdint.h. My comments inline below. From: Ard Biesheuvel Date: Tue, May 21, 2013 at 10:32 AM Subject: ARM/AAarch64: NEON intrinsics in the kernel To: gcc@gcc.gnu.org Cc: Christophe Lyon , Matthew Gretton-Dann , richard.earns...@arm.com, ramana.radhakrish...@arm.com, marcus.shawcr...@arm.com Hello all, I am currently exploring various ways of using NEON instructions in kernel mode. One of the ways of doing so is using NEON intrinsics, which we would like to support in the kernel, but unfortunately, at the moment we can't because the support header arm_neon.h assumes C99 conformance and includes . The kernel does not supply that header. As far as I can tell, the only dependency arm_neon.h has on the contents of that header are the [u]int[8|16|32|64]_t typedefs. The kernel does define those, only in a different header. There are also constant macros like UINT64_C etc that cause issues when compiled with C++. Also, defining __STDC_CONSTANT_MACROS to get around this issue is won't make the problem go away, I think. I would like to propose the following way to address this issue: as arm_neon.h is coupled very tightly with GCC's internals (__builtin_neon_* types and functions), could we not modify arm_neon.h to - drop the #include Removing arm_neon.h's dependency on stdint.h is probably a good idea. - replace every instance of [u]intxx_t with the builtin macro __[U]INTxx_TYPE__ (as we are already dependent on specific versions of GCC, this should not introduce any additional limitations) The choice we have to do this is replacing all the stdint types with the predefined macros int<8,16,32,64>_t with predefined __INT<8,16,32,64>_TYPE__ and UINT64_C from stdint.h with __UINT64_C etc. But it is recommended that these never be used directly - only via the header. If we use these directly in arm_neon.h, it introduces a dependency with the predefines implementation in gcc, but as you point out that arm_neon.h is already dependent on the specific versions of gcc, this maintainance overhead is probably unavoidable. We do need standard typedefs from somewhere... Thoughts? Thanks, Tejas Belagod. ARM. In this way, it is much easier to support NEON intrinsics in environments that we care about (like the kernel) but do not conform to the standards. Kind regards, Ard.