[Bug target/96366] [AArch64] ICE due to lack of support for VNx2SI sub instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96366 --- Comment #5 from Bu Le --- (In reply to rsand...@gcc.gnu.org from comment #3) > (In reply to Bu Le from comment #2) > > (In reply to rsand...@gcc.gnu.org from comment #1) > > > (In reply to Bu Le from comment #0) > Generating a subtraction out of an addition seemed odd since > canonicalisations usually go the other way. But if the target > says it supports negation too then that changes things. It doesn't > make much sense to support addition and negation but not subtraction. If some mode or target do not have a subtraction pattern, should we let the compiler try to use the addition and negation before it fall into an ICE? If so, the changes for optabs seems reasonable as well. (In reply to rsand...@gcc.gnu.org from comment #4) > Fixed by g:9623f61b142174b87760c81f78928dd14af7cbc6. > > As far as I know, only GCC 11 needs the fix, but we can backport > to GCC 10 as well if we find a testcase that needs it. Sure, I will have a try to see whether the problem also exists in gcc10.
[Bug target/96366] [AArch64] ICE due to lack of support for VNx2SI sub instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96366 --- Comment #2 from Bu Le --- (In reply to rsand...@gcc.gnu.org from comment #1) > (In reply to Bu Le from comment #0) > Hmm. In general, the lack of a vector pattern shouldn't case ICEs, > but I suppose the add/sub pairing is somewhat special because of > the canonicalisation rules. It would be worth looking at exactly > why we generate the subtract though, just to confirm that this is > an “expected” ICE rather than a symptom of a deeper problem. Sure. The logic is that the subtraction will be expanded in expr.c:8989, before which I believe it still works fine. The gimple to be expand is vect__5.16_77 = { 4294967273, 4294967154, 4294967294, 4294967265 } - vect__1.14_73 When the logic goes on, it went into the binop routine at expr:9948 because op1 (vect__1.14_73) is not a const op and missed the oppotunity to turn into an add negative pair. Then,the routine will call expand_binop to finalize the subtraction. The expand_binop function also has an oppotunity to turn this subtraction to add a negative number, but also missed because op1 is not a constant. It occurs to me that we can brought the check for the availbility of this pattern to the decision condition for whether turning the subtraction to a addition of negative equivalent. This can be an insurance measurement for the similar case that the pattern is missed, preventing the ICE. So I tried following change, which turns out could also solve the problem by turning the subtraction into addition as expected. diff -Nurp gcc-20200728-org/gcc/optabs.c gcc-20200728/gcc/optabs.c --- gcc-20200728-org/gcc/optabs.c 2020-07-29 15:53:52.76000 +0800 +++ gcc-20200728/gcc/optabs.c 2020-07-30 11:00:00.96400 +0800 @@ -1171,10 +1171,12 @@ expand_binop (machine_mode mode, optab b mclass = GET_MODE_CLASS (mode); - /* If subtracting an integer constant, convert this into an addition of - the negated constant. */ + /* If subtracting an integer constant, or if no subtraction pattern available + for this mode, convert this into an addition of the negated constant. */ - if (binoptab == sub_optab && CONST_INT_P (op1)) + if (binoptab == sub_optab + && (CONST_INT_P (op1) + || optab_handler (binoptab, mode) == CODE_FOR_nothing)) { op1 = negate_rtx (mode, op1); binoptab = add_optab; > The idea was for that patch to add the bare minimum needed > to support the “unpacked vector” infrastructure. Then, once the > infrastructure was in place, we could add support for other > unpacked vector operations too. > > However, the infrastructure went in late during the GCC 10 > cycle, so the idea was to postpone any new unpacked vector > support to GCC 11. So far the only additional operations > has been Joe Ramsay's patches for logical operations > (g:bb3ab62a8b4a108f01ea2eddfe31e9f733bd9cb6 and > g:6802b5ba8234427598abfd9f0163eb5e7c0d6aa8). > > The reason for not changing many operations at once is that, > in each case, a decision needs to be made whether the > operation should use the container mode (as for INDEX), > the element mode (as for right shifts, once they're > implemented) or whether it doesn't matter (as for addition). > Each operation also needs tests. So from that point of view, > it's more convenient to have a separate patch for each > operation (or at least closely-related groups of operations). Oh, I see. From the performance's point of view, I beleive that add the subtraction pattern is necessary eventually. I compiled and ran the test case attached with the subtraction pattern sulotion, which works fine. Logically, the subtraction should be the same as the addition, which is not sensetive to the operation mode. My idea of solving this problem is that we upstream the patch for mode extension independently, after which upstream the sub-to-add patch for insurance that other cases might step into the same routine. Any suggestions?
[Bug target/96366] New: [AArch64] ICE due to lack of support for VNx2SI sub instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96366 Bug ID: 96366 Summary: [AArch64] ICE due to lack of support for VNx2SI sub instruction Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: bule1 at huawei dot com CC: richard.sandiford at arm dot com Target Milestone: --- Created attachment 48950 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48950&action=edit preprocessed source code for recurent the problem Hi, The test case bb-slp-20.c in the gcc testsuit will cause an ICE in the expand pass because the gcc lack of a pattern for subtraction of the VNx2SI mode. The preprocessed file is attached and the problem will be triggered when compiled with -march=armv8.5-a+sve -msve-vector-bits=256 -O3 -fno-tree-forwprop options. By tracing the debug infomation, it is found that the error is due to a vectorized subtraction gimple with VNx2SI mode cannot find its pattern during the expand pass. I tried to extend the mode of this pattern from SVE_FULL_I to SVE_I as following, after which the problem is solved. diff -Nurp a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md --- a/gcc/config/aarch64/aarch64-sve.md 2020-07-29 15:54:39.36000 +0800 +++ b/gcc/config/aarch64/aarch64-sve.md 2020-07-29 14:37:21.93200 +0800 @@ -3644,10 +3644,10 @@ ;; - (define_insn "sub3" - [(set (match_operand:SVE_FULL_I 0 "register_operand" "=w, w, ?&w") - (minus:SVE_FULL_I - (match_operand:SVE_FULL_I 1 "aarch64_sve_arith_operand" "w, vsa, vsa") - (match_operand:SVE_FULL_I 2 "register_operand" "w, 0, w")))] + [(set (match_operand:SVE_I 0 "register_operand" "=w, w, ?&w") + (minus:SVE_I + (match_operand:SVE_I 1 "aarch64_sve_arith_operand" "w, vsa, vsa") + (match_operand:SVE_I 2 "register_operand" "w, 0, w")))] "TARGET_SVE" "@ sub\t%0., %1., %2. I noticed that this mode iterator was changed from SVE_I to SVE_FULL_I in Nov 2019 by richard to support partial SVE vectors. However, in the following patch the addition pattern is supported by changing SVE_FULL_I to SVE_I but not the subtraction pattern. Is there any specific reason why this pattern is not supported? Thanks.
[Bug fortran/96030] AArch64: Add an option to control 64bits simdclone of math functions for fortran
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96030 --- Comment #3 from Bu Le --- (In reply to Jakub Jelinek from comment #1) > The directive should be doing what > #pragma omp declare simd > does on the target and it is an ABI decision what exactly it does. Hi,I am still confused about your comment. Would you mind explain more?
[Bug fortran/96030] AArch64: Add an option to control 64bits simdclone of math functions for fortran
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96030 --- Comment #2 from Bu Le --- (In reply to Jakub Jelinek from comment #1) > The directive should be doing what > #pragma omp declare simd > does on the target and it is an ABI decision what exactly it does. I tried this test case. But I haven't found a way to prevent V2SF 64 bits simdclone (e.g. _ZGVnN2v_cosf )from being generated. !GCC$ builtin (cosf) attributes simd (notinbranch) subroutine test_cos(a_cos, b_cos, is_cos, ie_cos) integer, intent(in) :: is_cos, ie_cos REAL(4), dimension(is_cos:ie_cos), intent(inout) :: a_cos, b_cos do i = 1, 3 b_cos(i) = cos(a_cos(i)) enddo end subroutine test_cos Are you suggesting we already have a way to prevent the generation of _ZGVnN2v_cosf with !GCC$ builtin derivative? Or We haven' solve the problem yet, but the correct way to solve this problem is to implement more feature for the !GCC$ builtin derivative?
[Bug fortran/96030] New: AArch64: Add an option to control 64bits simdclone of math functions for fortran
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96030 Bug ID: 96030 Summary: AArch64: Add an option to control 64bits simdclone of math functions for fortran Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: fortran Assignee: unassigned at gcc dot gnu.org Reporter: bule1 at huawei dot com Target Milestone: --- Target: AARCH64 Created attachment 48824 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48824&action=edit patch for the problem Hi, I found that currently fortran can only enable simdclone of math functions by declare "!GCC$ builtin (exp) attributes simd" derivative. This kind of declaration seems cannot indicate whether the simdlen is 64bits or 128bits. By reading the source code and some tests, I believe the simdclone of both mode will be generated with a single derivative declaration. At present, vector math lib for aarch64 (mathlib and sleef) donot support the 64bits mode functions. So when I want to enable the simd math on some application, if the application has an oppotunity for 64bits mode simdclone, there is no matching math library call, which leads to a link time error. For now, to solve this problem, I added a new backend option -msimdmath-64 to control the generation of the 64bits mode simdclone, which is default to disable. The patch is attached. I think it is reasonable to set a switch to control the generation of the 64bits mode simdclone. Do we have something alike already? If not, is this the right way to go? Thanks.
[Bug libquadmath/96016] AArch64: enable libquadmath
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96016 --- Comment #4 from Bu Le --- (In reply to Andreas Schwab from comment #3) > You are computing the sine of (double)ld. If you want the sine of a long > double value, you need to use the sinl function, also use acosl(-1) to > compute pi in long double precision. Oh, I see. Thanks for help.
[Bug libquadmath/96016] AArch64: enable libquadmath
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96016 --- Comment #2 from Bu Le --- (In reply to Andrew Pinski from comment #1) > If long double is 128bit fp already, then glibc has full support of it. So > you dont need libquadmath at all. It is only there if long double is not > 128bit long double and glibc does not have support for the __float128 type. Can you elabrate more? I tried this test case with libm which gives me an incorrected answer without enough precision -f000--3ffd. What library does glibc provides for quad math? Or maybe I configure the libm wrong? #include #include int main(void) { long double ld = 0; long double res; long double pi = acos(-1); int* i = (int*) &res; i[0] = i[1] = i[2] = i[3] = 0xdeadbeef; ld = pi/6; res = sin(ld); printf("sinq-1: %08x-%08x-%08x-%08x\n", i[0], i[1], i[2], i[3]); } /* { dg-output "sinq-1: af2139b8-fae7b900--3ffd\n" } */
[Bug libquadmath/96016] New: AArch64: enable libquadmath
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96016 Bug ID: 96016 Summary: AArch64: enable libquadmath Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libquadmath Assignee: unassigned at gcc dot gnu.org Reporter: bule1 at huawei dot com Target Milestone: --- Target: AARCH64 Created attachment 48815 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48815&action=edit patch for enable libquadmath in aarch64 Hi I would like to propose a way to enable libquadmath on aarch64. Currently aarch64 support quad precision float point with type "long double". However the libquadmath won't build even if we specify --enable-quadmath in the configure because it will check whether the target support type __float128 during the build configureation. The build process of libquadmath exit if the answer is no. According to the arm abi(https://c9x.me/compile/bib/abi-arm64.pdf) and some test cases I tried, I found that in aarch64, long double is equivalent to __float128 in x86. I happened need to use a quad-precision math library. So I cancled the hard limitation on detecting __float 128 type. After the change when it found the target is aarch64, a Macro is introduced to redefine long double as __float128. It turns out that the libquadmath can be build and works successfully on aarch64. Test have been conducted with random inputs on aarch64 and x86. The output on aarch64 is agree with the output on x86. One minor question of my solution is that aarch64 don't have __builtin_huge_valq built-in functions to define the HUGE_VALQ. I used the value in the original comment that clearly stated this might cause warning, which did happened during the build. I haven't found where and how does aarch64 define all these built-in const values. Any comment on this issue? The patch is attached. You are welcome to check and comment on the patch. Is it ok for trunk?
[Bug target/95285] AArch64:aarch64 medium code model proposal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285 --- Comment #14 from Bu Le --- > > Anyway, my point is that the size of single data does't affact the fact that > > medium code model is missing in aarch64 and aarch64 is lack of PIC large > > code model. > > What is missing is efficient support for >4GB of data, right? How that is > implemented is a different question - my point is that it does not require a > new code model. It would be much better if it just worked without users even > needing to think about code models. > > Also, what is the purpose of a large fpic model? Are there any applications > that use shared libraries larger than 4GB? Yes, I understand, and I am grateful for you suggestion. I have to say it is not a critical problem. After all, most applications works fine with curreent code modes. But there are some cases, like CESM with certain configuration, or my test case, which cannot be compiled with current gcc compiler on aarch64. Unfortunately, applications that large than 4GB is quiet normal in HPC feild. In the meantime, x86 and llvm-aarch64 can compile it, with medium or large-pic code model. That is the purpose I am proposing it. By adding this feature, we can make a step forward for aarch64 gcc compiler, making it more powerful and robust. Clear enough for your concern? And for the implementation you suggested, I believe it is a promissing plan. I would like to try to implement it first. Might take weeks of development. I will see what I can get. I will give you update with progress. Thanks for the suggestion again.
[Bug target/95285] AArch64:aarch64 medium code model proposal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285 --- Comment #11 from Bu Le --- > You're right, we need an extra add, so it's like this: > > adrpx0, bar1.2782 > movk x1, :high32_47:bar1.2782 > add x0, x0, x1 > add x0, x0, :lo12:bar1.2782 > > > (By the way, the high32_47 relocation you suggested is the prel_g2 in the > > officail aarch64 ABI released) > > It needs a new relocation because of the ADRP. ADR could be used so the > existing R__MOVW_PREL_G0-3 work, but then you need 5 instructions. So you suggest a new relocation type "high32_47" to calculate the offset between ADRP and bar1. Am I right? > > And in terms of engineering, you idea can save the trouble to modify the > > linker for calculating the offset for 3 movks. But we still need to make a > > new relocation type for ADRP, because it currently checking the overflow of > > address and gives the "relocation truncated to fit" error. Therefore, both > > idea need to do works in binutils, which make it also equivalent. > > There is relocation 276 (R__ADR_PREL_PG_HI21_NC). Yes, through, we still need to make a change to compiler so when it comes to medium code model, ADRP can use R__ADR_PREL_PG_HI21_NC relocation.
[Bug target/95285] AArch64:aarch64 medium code model proposal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285 --- Comment #10 from Bu Le --- > Fortran already has -fstack-arrays to decide between allocating arrays on > the heap or on the stack. I tried the flag with my example. The fstack-array seems cannot move the array in the bss to the heap. The problem is still there. Anyway, my point is that the size of single data does't affact the fact that medium code model is missing in aarch64 and aarch64 is lack of PIC large code model.
[Bug target/95285] AArch64:aarch64 medium code model proposal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285 --- Comment #7 from Bu Le --- (In reply to Wilco from comment #5) > (In reply to Bu Le from comment #0) > > Also it would be much more efficient to have a relocation like this if you > wanted a 48-bit PC-relative offset: > > adrpx0, bar1.2782 > add x0, x0, :lo12:bar1.2782 > movkx0, :high32_47:bar1.2782 I am afraid that put the PC-relative offset into x0 is not correct, because x0 issuppose to be the final address of bar1 rather than an PC offset. Therefore an extra register is needed to hold the offest temporarily. Later, we need to add the PC address of the movk with the offset to calsulate 32:48 bits of the final address of bar1. Finally, add this part of address with x0 to compute the entire 48 bits final address. So the code sould be following sequence: adrpx0, bar1.2782 add x0, x0, :lo12:bar1.2782 //x0 here hold the 0:31 bits of the final addr movkx4, :prel_g2:bar1.2782 adr x1, . sub x1, x1, 0x4 add x4, x4, x1 // x4 here hold the 32:47 bits of the final addr add x0, x4, x0 (By the way, the high32_47 relocation you suggested is the prel_g2 in the officail aarch64 ABI released) So acctually, if we just want a 48-bit PC-relevent relocation, your idea and mine both need 6-7 instructions to get the symbol. In terms of efficiency, it would be similar. And in terms of engineering, you idea can save the trouble to modify the linker for calculating the offset for 3 movks. But we still need to make a new relocation type for ADRP, because it currently checking the overflow of address and gives the "relocation truncated to fit" error. Therefore, both idea need to do works in binutils, which make it also equivalent.
[Bug target/95285] AArch64:aarch64 medium code model proposal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285 --- Comment #6 from Bu Le --- (In reply to Wilco from comment #4) > (In reply to Bu Le from comment #3) > > (In reply to Wilco from comment #2) > Well the question is whether we're talking about more than 4GB of code or > more than 4GB of data. With >4GB code you're indeed stuck with the large > model. With data it is feasible to automatically use malloc for arrays when > larger than a certain size, so there is no need to change the application at > all. Something like that could be the default in the small model so that you > don't have any extra overhead unless you have huge arrays. Making the > threshold configurable means you can tune it for a specific application. Is this automatic malloc already avaiable on some target? I haven't found an example that works in that way. Would you mind provide an example?
[Bug target/95285] AArch64:aarch64 medium code model proposal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285 --- Comment #3 from Bu Le --- (In reply to Wilco from comment #2) > Is the main usage scenario huge arrays? If so, these could easily be > allocated via malloc at startup rather than using bss. It means an extra > indirection in some cases (to load the pointer), but it should be much more > efficient than using a large code model with all the overheads. Thanks for the reply. The large array is just used to construct the test case. It is not a neccessary condition for this scenario. The common scenario is that the symbol is too far away for small code model to reach it, which cloud also result from large amount of small arrays, structures, etc. Meanwhile, the large code model is able to reach the symbol but can not be position independent, which cause the problem. Besides, the code in CESM is quiet complicated to reconstruct with malloc, which is also not an acceptable option for my customer. Clear enough for your concern?
[Bug target/95285] AArch64:aarch64 medium code model proposal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285 --- Comment #1 from Bu Le --- Created attachment 48585 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48585&action=edit patch for binutils
[Bug target/95285] New: AArch64:aarch64 medium code model proposal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285 Bug ID: 95285 Summary: AArch64:aarch64 medium code model proposal Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: bule1 at huawei dot com Target Milestone: --- Created attachment 48584 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48584&action=edit proposed patch I would like to propose an implementation of the medium code model in aarch64. A prototype is attached, passed bootstrap and the regression test. Mcmodel = medium is a missing code model in aarch64 architecture, which is supported in x86. This code model describes a situation that some small data is relocated by small code model while large data is relocated by large code model. The official statement about medium code model in x86 ABI file page 34 URL : https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf The key difference between x86 and aarch64 is that x86 can use lea+movabs instruction to implement a dynamic relocatable large code model. Currently, large code model in AArch64 relocate the symbol using ldr instruction, which can only be static linked. However, the small code mode use adrp + ldr instruction, which can be dynamic linked. Therefore, the medium code model cannot be implemented directly by simply setting a threshold. As a result a dynamic reloadable large code model is needed first for a functional medium code model. I met this problem when compiling CESM, which is a climate forecast software that widely used in hpc field. In some configure case, when the manipulating large arrays, the large code model with dynamic relocation is needed. The following case is abstract from CESM for this scenario. program main common/baz/a,b,c real a,b,c b = 1.0 call foo() print*, b end subroutine foo() common/baz/a,b,c real a,b,c integer, parameter :: nx = 1024 integer, parameter :: ny = 1024 integer, parameter :: nz = 1024 integer, parameter :: nf = 1 real :: bar(nf,nx*ny*nz) real :: bar1(nf,nx*ny*nz) bar = 0.0 bar1 =0.0 b = bar(1,1024*1024*100) b = bar1(1,1) return end compile with -mcmodel=small -fPIC will give following error due to the access of bar1 array test.f90:(.text+0x28): relocation truncated to fit: R_AARCH64_ADR_PREL_PG_HI21 against `.bss' test.f90:(.text+0x6c): relocation truncated to fit: R_AARCH64_ADR_PREL_PG_HI21 against `.bss' compile with -mcmodel=large -fPIC will give unsupported error: f951: sorry, unimplemented: code model ‘large’ with ‘-fPIC’ As discussed in the beginning, to tackle this problem we have to solve the static large code model problem. My solution here is to use R_AARCH64_MOVW_PREL_Gx group relocation with instructions to calculate the current PC value. Before change (mcmodel=small) : adrpx0, bar1.2782 add x0, x0, :lo12:bar1.2782 After change:(mcmodel = medium proposed): movzx0, :prel_g3:bar1.2782 movkx0, :prel_g2_nc:bar1.2782 movkx0, :prel_g1_nc:bar1.2782 movkx0, :prel_g0_nc:bar1.2782 adr x1, . sub x1, x1, 0x4 add x0, x0, x1 The first 4 movk instruction will calculate the offset between bar1 and the last movk instruction in 64-bits, which fulfil the requirement of large code model(64-bit relocation). The adr+sub instruction will calculate the pc-address of the last movk instruction. By adding the offset with the PC address, bar1 can be dynamically located. Because this relocation is time consuming, a threshold is set to classify the size of the data to be relocated, like x86. The default value of the threshold is set to 65536, which is max relocation capability of small code model. This implementation will also need to amend the linker in binutils so that the4 movk can calculated the same pc-offset of the last movk instruction. The good side of this implementation is that it can use existed relocation type to prototype a medium code model. The drawback of this implementation also exists. For start, these 4movk instructions and the adr instruction must be combined in this order. No other instruction should insert in between the sequence, which will leads to mistake symbol address. This might impede the insn schedule optimizations. Secondly, the linker need to make the change correspondingly so that every mov instruction calculate the same pc-offset. For example, in my implementation, the fisrt movz instruction will need to add 12 to the result of ":prel_g3:bar1.2782" to make up the pc-offset. I haven't figure out a suitable solution for these problems yet. You are most welcomed to leave your suggestions regarding these issues.
[Bug tree-optimization/94434] New: [AArch64][SVE] ICE caused by incompatibility of SRA and svst3 builtin-function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94434 Bug ID: 94434 Summary: [AArch64][SVE] ICE caused by incompatibility of SRA and svst3 builtin-function Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: bule1 at huawei dot com CC: mjambor at suse dot cz Target Milestone: --- Target: aarch64 Created attachment 48154 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48154&action=edit patch for the problem test case: gcc/testsuite/gcc.target/aarch64/sve/acle/asm/st2_bf16.c Command line:gcc st2_bf16.c -march=armv8.2-a+sve -msve-vector-bits=256 -O2 -fno-diagnostics-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -fdiagnostics-urls=never -DTEST_OVERLOADS -fno-ipa-icf -c -o st2_bf16.o during IPA pass: sra st2_bf16.c: In function ‘st2_vnum_bf16_x1’: st2_bf16.c:198:1: internal compiler error: Segmentation fault 0xc995b3 crash_signal ../.././gcc/toplev.c:328 0xa34f68 hash_map, isra_call_summary*, simple_hashmap_traits >, isra_call_summary*> >::get_or_insert(int const&, bool*) ../.././gcc/hash-map.h:194 0xa34f68 call_summary::get_create(cgraph_edge*) ../.././gcc/symbol-summary.h:642 0xa34f68 record_nonregister_call_use ../.././gcc/ipa-sra.c:1613 0xa34f68 scan_expr_access ../.././gcc/ipa-sra.c:1781 0xa37627 scan_function ../.././gcc/ipa-sra.c:1880 0xa37627 ipa_sra_summarize_function ../.././gcc/ipa-sra.c:2505 0xa38437 ipa_sra_generate_summary ../.././gcc/ipa-sra.c:2555 0xbb58bb execute_ipa_summary_passes(ipa_opt_pass_d*) ../.././gcc/passes.c:2191 0x7f672f ipa_passes ../.././gcc/cgraphunit.c:2627 0x7f672f symbol_table::compile() ../.././gcc/cgraphunit.c:2737 0x7f89ab symbol_table::compile() ../.././gcc/cgraphunit.c:2717 0x7f89ab symbol_table::finalize_compilation_unit() ../.././gcc/cgraphunit.c:2984 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <https://gcc.gnu.org/bugs/> for instructions. Similar problems can be found in svst2、svst4 and other functions of this kind. This problem is cause by "record_nonregister_call_use" function trying to access the call graph edge of an internal call, .MASK_STORE_LANE, which is a NULL pointer. The reason of stepping into "record_nonregister_call_use" function is that the upper level function "scan_expr_access" considered the "svbfloat16x3_t z1" argument as a valid candidate for further optimization. A simple solution here is to disqualify the candidate at "scan_expr_access" level when the call graph edge is null, which indicates the call is either an internal call or a call with no references. For both case, the further optimization process should stop before it reference a NULL pointer. A proposed patch is attached.
[Bug target/94154] AArch64: Add parameters to tune the precision of reciprocal div
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94154 Bu Le changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #1 from Bu Le --- The patch has been reviewed and merged to master by Richard. Fixed and close.
[Bug target/94154] New: AArch64: Add parameters to tune the precision of reciprocal div
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94154 Bug ID: 94154 Summary: AArch64: Add parameters to tune the precision of reciprocal div Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: patch Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: bule1 at huawei dot com CC: richard.sandiford at arm dot com Target Milestone: --- Target: AARCH64 This report suggest to use parameters to control the number of newton iterations when using the reciprocal division on aarch64 platform, which is currently hard coded in aarch64.c. This can benefit some test cases in spec2017 fpspeed in peak mode that do not have a high demand on precision. And also fix the downside that users are forced to use reciprocal approximation at low precision. A proposed patch is attached.