[Bug target/109632] Inefficient codegen when complex numbers are emulated with structs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109632 --- Comment #12 from rsandifo at gcc dot gnu.org --- The patch in comment 11 is just a related spot improvement. The PR itself is still unfixed.
[Bug target/109632] Inefficient codegen when complex numbers are emulated with structs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109632 --- Comment #11 from CVS Commits --- The trunk branch has been updated by Richard Sandiford : https://gcc.gnu.org/g:b096a6ebe9d9f9fed4c105f6555f724eb32af95c commit r14-1131-gb096a6ebe9d9f9fed4c105f6555f724eb32af95c Author: Richard Sandiford Date: Tue May 23 11:34:42 2023 +0100 aarch64: Provide FPR alternatives for some bit insertions [PR109632] At -O2, and so with SLP vectorisation enabled: struct complx_t { float re, im; }; complx_t add(complx_t a, complx_t b) { return {a.re + b.re, a.im + b.im}; } generates: fmovw3, s1 fmovx0, d0 fmovx1, d2 fmovw2, s3 bfi x0, x3, 32, 32 fmovd31, x0 bfi x1, x2, 32, 32 fmovd30, x1 faddv31.2s, v31.2s, v30.2s fmovx1, d31 lsr x0, x1, 32 fmovs1, w0 lsr w0, w1, 0 fmovs0, w0 ret This is because complx_t is passed and returned in FPRs, but GCC gives it DImode. We therefore âneedâ to assemble a DImode pseudo from the two individual floats, bitcast it to a vector, do the arithmetic, bitcast it back to a DImode pseudo, then extract the individual floats. There are many problems here. The most basic is that we shouldn't use SLP for such a trivial example. But SLP should in principle be beneficial for more complicated examples, so preventing SLP for the example above just changes the reproducer needed. A more fundamental problem is that it doesn't make sense to use single DImode pseudos in a testcase like this. I have a WIP patch to allow re and im to be stored in individual SFmode pseudos instead, but it's quite an invasive change and might end up going nowhere. A simpler problem to tackle is that we allow DImode pseudos to be stored in FPRs, but we don't provide any patterns for inserting values into them, even though INS makes that easy for element-like insertions. This patch adds some patterns for that. Doing that showed that aarch64_modes_tieable_p was too strict: it didn't allow SFmode and DImode values to be tied, even though both of them occupy a single GPR and FPR, and even though we allow both classes to change between the modes. The *aarch64_bfidi_subreg_ pattern is especially ugly, but it's not clear what target-independent code ought to simplify it to, if it was going to simplify it. We should probably do the same thing for extractions, but that's left as future work. After the patch we generate: ins v0.s[1], v1.s[0] ins v2.s[1], v3.s[0] faddv0.2s, v0.2s, v2.2s fmovx0, d0 ushrd1, d0, 32 lsr w0, w0, 0 fmovs0, w0 ret which seems like a step in the right direction. All in all, there's nothing elegant about this patchh. It just seems like the least worst option. gcc/ PR target/109632 * config/aarch64/aarch64.cc (aarch64_modes_tieable_p): Allow subregs between any scalars that are 64 bits or smaller. * config/aarch64/iterators.md (SUBDI_BITS): New int iterator. (bits_etype): New int attribute. * config/aarch64/aarch64.md (*insv_reg_) (*aarch64_bfi_): New patterns. (*aarch64_bfidi_subreg_): Likewise. gcc/testsuite/ * gcc.target/aarch64/ins_bitfield_1.c: New test. * gcc.target/aarch64/ins_bitfield_2.c: Likewise. * gcc.target/aarch64/ins_bitfield_3.c: Likewise. * gcc.target/aarch64/ins_bitfield_4.c: Likewise. * gcc.target/aarch64/ins_bitfield_5.c: Likewise. * gcc.target/aarch64/ins_bitfield_6.c: Likewise.
[Bug target/109632] Inefficient codegen when complex numbers are emulated with structs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109632 --- Comment #10 from rsandifo at gcc dot gnu.org --- After prototyping this further, I no longer think that lowering at the gimple level is the best answer. (I should have listened to Richi.) Although it works, its major drawback is that it's one-sided: it allows the current function's PARM_DECLs and returns to be lowered to individual scalars, but it does nothing for calls to other functions. Being one-sided means (a) that lowering only solves half the problem and (b) that tail calls cannot be handled easily after lowering. One thing that does seem to work is to force the structure to have V2SF (and fix the inevitable ABI fallout). That could only be done conditionally, based on a target hook. But it seems to fix both test cases: the pass-by-reference one and the pass-by-value one.
[Bug target/109632] Inefficient codegen when complex numbers are emulated with structs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109632 --- Comment #9 from Tamar Christina --- Thank you!
[Bug target/109632] Inefficient codegen when complex numbers are emulated with structs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109632 rsandifo at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2023-04-27 Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #8 from rsandifo at gcc dot gnu.org --- Have a (still hacky) patch that also fixes the example in comment 4, giving: fadds1, s1, s3 fadds0, s0, s2 ret Will work on it a bit more before sending an RFC. Can imagine the approach will be somewhat controversial!
[Bug target/109632] Inefficient codegen when complex numbers are emulated with structs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109632 --- Comment #7 from rsandifo at gcc dot gnu.org --- Thinking more about it, it would probably be better to defer the split until around lower_complex time, after IPA (especially inlining), NRV and tail-recursion. Doing it there should also make it easier to split arguments. (In reply to Tamar Christina from comment #6) > That's an interesting approach, I think it would also fix > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109391 would it not? Since the > int16x8x3_t return would be "scalarized" avoiding the bad expansion? I don't think it will help with that, since the returned value there is a natural V3x8HI (rather than something that the ABI splits apart). Splitting in that case might pessimise cases where the return value is loaded as a whole, rather than assigned to individually. But it might be worth giving SRA the option of splitting even in that case, as a follow-on optimisation, if it fits naturally with the definitions.
[Bug target/109632] Inefficient codegen when complex numbers are emulated with structs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109632 --- Comment #6 from Tamar Christina --- That's an interesting approach, I think it would also fix https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109391 would it not? Since the int16x8x3_t return would be "scalarized" avoiding the bad expansion?
[Bug target/109632] Inefficient codegen when complex numbers are emulated with structs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109632 --- Comment #5 from rsandifo at gcc dot gnu.org --- Created attachment 54941 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54941=edit hacky proof-of-concept patch This is a very hacky proof of concept patch. Don't try it on anything serious, and certainly don't try to bootstrap with it -- it'll fall over in the slightest breeze. But it does produce: ldp s3, s2, [x0] ldp s0, s1, [x1] fadds1, s2, s1 fadds0, s3, s0 ret for the original testcase.
[Bug target/109632] Inefficient codegen when complex numbers are emulated with structs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109632 rsandifo at gcc dot gnu.org changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #4 from rsandifo at gcc dot gnu.org --- Maybe worth noting that if the complex arguments are passed by value, to give: struct complx_t { float re; float im; }; complx_t add(const complx_t a, const complx_t b) { return {a.re + b.re, a.im + b.im}; } and SLP is disabled, we get: fmovw4, s1 fmovw3, s3 fmovx0, d0 fmovx1, d2 mov x2, 0 bfi x0, x4, 32, 32 bfi x1, x3, 32, 32 fmovd0, x0 fmovd1, x1 sbfxx3, x0, 0, 32 sbfxx0, x1, 0, 32 ushrd1, d1, 32 fmovd3, x0 fmovd2, x3 ushrd0, d0, 32 fadds2, s2, s3 fadds0, s0, s1 fmovw1, s2 fmovw0, s0 bfi x2, x1, 0, 32 bfi x2, x0, 32, 32 lsr x0, x2, 32 lsr w2, w2, 0 fmovs1, w0 fmovs0, w2 ret which is almost impressive, in its way. I think we need a way in gimple of “SRA-ing” the arguments and return value, in cases where that's forced by the ABI. I.e. provide separate incoming values of a.re and a.im, and store them to “a” on entry. Then similarly make the return stmt return RETURN_DECL.re and RETURN_DECL.im separately.
[Bug target/109632] Inefficient codegen when complex numbers are emulated with structs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109632 --- Comment #3 from Tamar Christina --- note that even if we can't stop SLP, we should be able to generate as efficient code by being creative about the instruction selection, that's why I marked it as a target bug :)
[Bug target/109632] Inefficient codegen when complex numbers are emulated with structs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109632 --- Comment #2 from Tamar Christina --- (In reply to Richard Biener from comment #1) > Well, the usual unknown ABI boundary at function entry/exit. Yes but LLVM gets it right, so should be a solve able computer science problem. :) Note that this was reduced from a bigger routine but end result the same, the thing shouldn't have been vectorized.
[Bug target/109632] Inefficient codegen when complex numbers are emulated with structs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109632 --- Comment #1 from Richard Biener --- Well, the usual unknown ABI boundary at function entry/exit.