[Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases

2023-06-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #21 from Michael_S --- (In reply to Mason from comment #20) > Doh! You're right. > I come from a background where overlapping/aliasing inputs are heresy, > thus got blindsided :( > > This would be the optimal code, right? > > add4i

[Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases

2023-06-07 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #19 from Michael_S --- (In reply to Mason from comment #18) > Hello Michael_S, > > As far as I can see, massaging the source helps GCC generate optimal code > (in terms of instruction count, not convinced about scheduling). > > #in

[Bug libgcc/108279] Improved speed for float128 routines

2023-02-10 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #24 from Michael_S --- (In reply to Michael_S from comment #22) > (In reply to Michael_S from comment #8) > > (In reply to Thomas Koenig from comment #6) > > > And there will have to be a decision about 32-bit targets. > > > > > > >

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-18 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #23 from Michael_S --- (In reply to Jakub Jelinek from comment #19) > So, if stmxcsr/vstmxcsr is too slow, perhaps we should change x86 > sfp-machine.h > #define FP_INIT_ROUNDMODE \ > do {

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-18 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #22 from Michael_S --- (In reply to Michael_S from comment #8) > (In reply to Thomas Koenig from comment #6) > > And there will have to be a decision about 32-bit targets. > > > > IMHO, 32-bit targets should be left in their current

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #16 from Michael_S --- (In reply to Jakub Jelinek from comment #15) > libquadmath is not needed nor useful on aarch64-linux, because long double > type there is already IEEE 754 quad. That's good to know. Thank you. If you are here

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-14 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #12 from Michael_S --- (In reply to Thomas Koenig from comment #10) > What we would need for incorporation into gcc is to have several > functions, which would then called depending on which floating point > options are in force at t

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-14 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #11 from Michael_S --- (In reply to Thomas Koenig from comment #9) > Created attachment 54273 [details] > matmul_r16.i > > Here is matmul_r16.i from a relatively recent trunk. Thank you. Unfortunately, I was not able to link it wit

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-12 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #8 from Michael_S --- (In reply to Thomas Koenig from comment #6) > (In reply to Michael_S from comment #5) > > Hi Thomas > > Are you in or out? > > Depends a bit on what exactly you want to do, and if there is > a chance that what

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-12 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #7 from Michael_S --- Either here or my yahoo e-mail

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-11 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #5 from Michael_S --- Hi Thomas Are you in or out? If you are still in, I can use your help on several issues. 1. Torture. See if Invalid Operand exception raised properly now. Also if there are still remaining problems with NaN.

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-04 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #4 from Michael_S --- (In reply to Jakub Jelinek from comment #2) > From what I can see, they are certainly not portable. > E.g. the relying on __int128 rules out various arches (basically all 32-bit > arches, > ia32, powerpc 32-bit

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-26 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #22 from Michael_S --- (In reply to Alexander Monakov from comment #21) > (In reply to Michael_S from comment #19) > > > Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be > > > 'unlaminated' (turned to 2 uops before r

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-26 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #20 from Michael_S --- (In reply to Richard Biener from comment #17) > (In reply to Michael_S from comment #16) > > On unrelated note, why loop overhead uses so many instructions? > > Assuming that I am as misguided as gcc about load-

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-26 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #19 from Michael_S --- (In reply to Alexander Monakov from comment #18) > The apparent 'bias' is introduced by instruction scheduling: haifa-sched > lifts a +64 increment over memory accesses, transforming +0 and +32 > displacements t

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-25 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #16 from Michael_S --- On unrelated note, why loop overhead uses so many instructions? Assuming that I am as misguided as gcc about load-op combining, I would write it as: sub %rax, %rdx .L3: vmovupd (%rdx,%rax), %ymm1 vmovupd

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-24 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #14 from Michael_S --- I tested a smaller test bench from Comment 3 with gcc trunk on godbolt. Issue appears to be only partially fixed. -Ofast result is no longer a horror that it was before, but it is still not as good as -O3 or -O2

[Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases

2022-07-29 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #15 from Michael_S --- (In reply to Richard Biener from comment #14) > (In reply to Michael_S from comment #12) > > On related note... > > One of the historical good features of gcc relatively to other popular > > compilers was absen

[Bug target/106220] x86-64 optimizer forgets about shrd peephole optimization pattern when faced with more than one in close proximity

2022-07-06 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106220 --- Comment #3 from Michael_S --- -march-haswell is not very important. I added it only because in absence of BMI extension an issue is somewhat obscured by need to keep shift count in CL register. -O2 is also not important. -O3 is the same. An

[Bug c/106220] New: x86-64 optimizer forgets about shrd peephole optimization pattern when faced with more than one in close proximity

2022-07-06 Thread already5chosen at yahoo dot com via Gcc-bugs
: 12.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: already5chosen at yahoo dot com Target Milestone: --- I am reporting about right shift issue, but left shift has

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-06-13 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #23 from Michael_S --- (In reply to jos...@codesourcery.com from comment #22) > On Mon, 13 Jun 2022, already5chosen at yahoo dot com via Gcc-bugs wrote: > > > > The function should be sqrtf128 (present in glibc 2.

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-06-13 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #21 from Michael_S --- (In reply to jos...@codesourcery.com from comment #20) > On Sat, 11 Jun 2022, already5chosen at yahoo dot com via Gcc-bugs wrote: > > > On MSYS2 _Float128 and __float128 appears to be mostly th

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-06-11 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #19 from Michael_S --- (In reply to jos...@codesourcery.com from comment #18) > libquadmath is essentially legacy code. People working directly in C > should be using the C23 _Float128 interfaces and *f128 functions, as in > curre

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-06-10 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #17 from Michael_S --- (In reply to Jakub Jelinek from comment #15) > From what I can see, it is mostly integral implementation and we already > have one such in GCC, so the question is if we just shouldn't use it (most > of the sou

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-06-10 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #16 from Michael_S --- (In reply to Thomas Koenig from comment #14) > @Michael: Now that gcc 12 is out of the door, I would suggest we try to get > your code into the gcc tree for gcc 13. > > It should follow the gcc style guideline

[Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases

2022-05-17 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #12 from Michael_S --- On related note... One of the historical good features of gcc relatively to other popular compilers was absence of auto-vectorization at -O2. When did you decide to change it and why?

[Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases

2022-05-17 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #11 from Michael_S --- (In reply to Richard Biener from comment #10) > (In reply to Hongtao.liu from comment #9) > > (In reply to Hongtao.liu from comment #8) > > > (In reply to Hongtao.liu from comment #7) > > > > Hmm, we have speci

[Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases

2022-05-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #6 from Michael_S --- (In reply to Michael_S from comment #5) > > Even scalar-to-scalar or vector-to-vector moves that are hoisted at renamer > does not have a zero cost, because quite often renamer itself constitutes > the narrowes

[Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases

2022-05-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #5 from Michael_S --- (In reply to Richard Biener from comment #3) > We are vectorizing the store it dst[] now at -O2 since that appears > profitable: > > t.c:10:10: note: Cost model analysis: > r0.0_12 1 times scalar_store costs 12

[Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases

2022-05-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #4 from Michael_S --- (In reply to Andrew Pinski from comment #1) > This is just the vectorizer still being too aggressive right before a return. > It is a cost model issue and it might not really be an issue in the final > code just

[Bug target/105617] New: Regression in code generation for _addcarry_u64()

2022-05-16 Thread already5chosen at yahoo dot com via Gcc-bugs
Component: target Assignee: unassigned at gcc dot gnu.org Reporter: already5chosen at yahoo dot com Target Milestone: --- It took many years until gcc caught up with MSVC and LLVM/clang in generation of code for chains of Intel's _addcarry_u64() intrinsic calls. But your fi

[Bug target/105468] Suboptimal code generation for access of function parameters and return values of type __float128 on x86-64 Windows target.

2022-05-03 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105468 --- Comment #4 from Michael_S --- Created attachment 52925 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52925&action=edit build script

[Bug target/105468] Suboptimal code generation for access of function parameters and return values of type __float128 on x86-64 Windows target.

2022-05-03 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105468 --- Comment #3 from Michael_S --- Created attachment 52924 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52924&action=edit Another test bench that shows lower impact on Zen3, but higher impact on some Intel CPUs

[Bug target/105468] Suboptimal code generation for access of function parameters and return values of type __float128 on x86-64 Windows target.

2022-05-03 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105468 --- Comment #2 from Michael_S --- Created attachment 52923 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52923&action=edit test bench that shows lower impact on Zen3, but higher impact on some Intel CPUs

[Bug target/105468] Suboptimal code generation for access of function parameters and return values of type __float128 on x86-64 Windows target.

2022-05-03 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105468 --- Comment #1 from Michael_S --- Created attachment 52922 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52922&action=edit test bench that demonstrates maximal impact on Zen3

[Bug target/105468] New: Suboptimal code generation for access of function parameters and return values of type __float128 on x86-64 Windows target.

2022-05-03 Thread already5chosen at yahoo dot com via Gcc-bugs
Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: already5chosen at yahoo dot com Target Milestone: --- Created attachment 52921 --> ht

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-04-21 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #13 from Michael_S --- It turned out that on all micro-architectures that I care about (and majority of those that I don't care) double precision floating point division is quite fast. It's so fast that it easily beats my clever reci

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-04-20 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #12 from Michael_S --- (In reply to Michael_S from comment #11) > (In reply to Michael_S from comment #10) > > BTW, the same ideas as in the code above could improve speed of division > > operation (on modern 64-bit HW) by factor of

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-04-18 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #11 from Michael_S --- (In reply to Michael_S from comment #10) > BTW, the same ideas as in the code above could improve speed of division > operation (on modern 64-bit HW) by factor of 3 (on Intel) or 2 (on AMD). Did it. On Intel i

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-04-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #10 from Michael_S --- BTW, the same ideas as in the code above could improve speed of division operation (on modern 64-bit HW) by factor of 3 (on Intel) or 2 (on AMD).

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-04-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #9 from Michael_S --- (In reply to Michael_S from comment #4) > If you want quick fix for immediate shipment then you can take that: > > #include > #include > > __float128 quick_and_dirty_sqrtq(__float128 x) > { > if (isnanq(x)

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-04-02 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 Michael_S changed: What|Removed |Added CC||already5chosen at yahoo dot com

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2020-11-19 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #10 from Michael_S --- I lost track of what you're talking about long time ago. But that's o.k.

[Bug target/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2020-11-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #3 from Michael_S --- (In reply to Richard Biener from comment #2) > It's again reassociation making a mess out of the natural SLP opportunity > (and thus SLP discovery fails miserably). > > One idea worth playing with would be to ch

[Bug target/79173] add-with-carry and subtract-with-borrow support (x86_64 and others)

2020-11-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173 --- Comment #9 from Michael_S --- Despite what I wrote above, I did took a look at the trunk on godbolt with same old code from a year ago. Because it was so easy. And indeed a trunk looks ALOT better. But until it's released I wouldn't know if i

[Bug target/79173] add-with-carry and subtract-with-borrow support (x86_64 and others)

2020-11-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173 --- Comment #8 from Michael_S --- (In reply to Jakub Jelinek from comment #7) > (In reply to Michael_S from comment #5) > > I agree with regard to "other targets", first of all, aarch64, but x86_64 > > variant of gcc already provides requested fu

[Bug target/79173] add-with-carry and subtract-with-borrow support (x86_64 and others)

2020-11-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173 --- Comment #6 from Michael_S --- (In reply to Marc Glisse from comment #1) > We could start with the simpler: > > void f(unsigned*__restrict__ r,unsigned*__restrict__ s,unsigned a,unsigned > b,unsigned c, unsigned d){ > *r=a+b; > *s=c+d+(*r

[Bug target/79173] add-with-carry and subtract-with-borrow support (x86_64 and others)

2020-11-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173 Michael_S changed: What|Removed |Added CC||already5chosen at yahoo dot com --- Comment

[Bug target/97832] New: AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2020-11-14 Thread already5chosen at yahoo dot com via Gcc-bugs
: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: already5chosen at yahoo dot com Target Milestone: --- I am reporting under 'target' because AVX2+FMA is the only 256-bit SIMD platform I have to play with. If it

[Bug tree-optimization/97428] -O3 is great for basic AoSoA packing of complex arrays, but horrible one step above the basic

2020-10-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97428 --- Comment #9 from Michael_S --- Hopefully, you did regression tests for all main AoS<->SoA cases. I.e. typedef struct { double re, im; } dcmlx_t; void soa2aos(double* restrict dstRe, double* restrict dstIm, const dcmlx_t src[], int nq) { for

[Bug tree-optimization/97428] -O3 is great for basic AoSoA packing of complex arrays, but horrible one step above the basic

2020-10-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97428 --- Comment #6 from Michael_S --- (In reply to Richard Biener from comment #4) > > while the lack of cross-lane shuffles in AVX2 requires a > > .L3: > vmovupd (%rsi,%rax), %xmm5 > vmovupd 32(%rsi,%rax), %xmm6 > vinsertf1

[Bug tree-optimization/97428] -O3 is great for basic AoSoA packing of complex arrays, but horrible one step above the basic

2020-10-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97428 --- Comment #5 from Michael_S --- (In reply to Richard Biener from comment #4) > I have a fix that, with -mavx512f generates just > > .L3: > vmovupd (%rcx,%rax), %zmm0 > vpermpd (%rsi,%rax), %zmm1, %zmm2 > vpermpd %zmm0,

[Bug target/97428] New: -O3 is great for basic AoSoA packing of complex arrays, but horrible one step above the basic

2020-10-14 Thread already5chosen at yahoo dot com via Gcc-bugs
Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: already5chosen at yahoo dot com Target Milestone: --- That my next example of bad handling of AoSoA layout by gcc optimizer/vectorizer. For discussion of

[Bug tree-optimization/97343] AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product

2020-10-09 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97343 --- Comment #2 from Michael_S --- (In reply to Richard Biener from comment #1) > All below for Part 2. > > Without -ffast-math you are seeing GCC using in-order reductions now while > with -ffast-math the vectorizer gets a bit confused about rea

[Bug target/97343] New: AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product

2020-10-08 Thread already5chosen at yahoo dot com via Gcc-bugs
Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: already5chosen at yahoo dot com Target Milestone: --- Let's continue our complex dot product series started here https://gcc.gnu.org/bugzilla/show_bug.c

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-25 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #15 from Michael_S --- (In reply to Hongtao.liu from comment #14) > > Still I don't understand why compiler does not compare the cost of full loop > > body after combining to the cost before combining and does not come to > > conclusi

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-24 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #13 from Michael_S --- (In reply to Hongtao.liu from comment #11) > (In reply to Michael_S from comment #10) > > (In reply to Hongtao.liu from comment #9) > > > (In reply to Michael_S from comment #8) > > > > What are values of gcc "l

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-24 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #10 from Michael_S --- (In reply to Hongtao.liu from comment #9) > (In reply to Michael_S from comment #8) > > What are values of gcc "loop" cost of the relevant instructions now? > > 1. AVX256 Load > > 2. FMA3 ymm,ymm,ymm > > 3. AVX2

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-23 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #8 from Michael_S --- What are values of gcc "loop" cost of the relevant instructions now? 1. AVX256 Load 2. FMA3 ymm,ymm,ymm 3. AVX256 Regmove 4. FMA3 mem,ymm,ymm

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-22 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #6 from Michael_S --- Why do you see it as addition of peephole pattern? I see it as removal. Like, "do what's written in the source and don't try to be tricky". Probably, I am too removed from how compilers work :( Or, may be, handl

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-21 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #3 from Michael_S --- (In reply to Alexander Monakov from comment #2) > Richard, though register moves are resolved by renaming, they still occupy a > uop in all stages except execution, and since renaming is one of the > narrowest po

[Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake

2020-09-20 Thread already5chosen at yahoo dot com
Component: target Assignee: unassigned at gcc dot gnu.org Reporter: already5chosen at yahoo dot com Target Milestone: --- The following clever gcc transformation leads to generation of slower code than non-transformed original: a = *mem; a = a + b * c; where both b and c are

[Bug target/96854] [10 Regression] avx vectorizer breaks complex arithmetic

2020-09-06 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96854 --- Comment #15 from Michael_S --- Thank you. That does not sound too different from what I assumed in post above. 10.1.0 is release. Expected to be used by "normal" people. 10.1.1 was for purpose of development of 10.2.0. Since release of 10.2.0

[Bug target/96854] [10 Regression] avx vectorizer breaks complex arithmetic

2020-09-06 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96854 --- Comment #13 from Michael_S --- I don't follow gcc versioning policy all that closely. What is the function "micro" versions now? For internal use and experimentation only, but not for public release?

[Bug target/96854] [10 Regression] avx vectorizer breaks complex arithmetic

2020-09-06 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96854 --- Comment #11 from Michael_S --- Just to understand Will 10.1 and 10.2 be fixed?

[Bug target/96854] [10 Regression] avx vectorizer breaks complex arithmetic

2020-08-31 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96854 --- Comment #4 from Michael_S --- Pay attention that it's not just AVX. '-mavx2 -mfma -Ofast' generates different code, but at the end gives the same wrong result. Unfortunately, I have no AVX512 hardware to test, but wouldn't be surprised if it'

[Bug target/96854] New: avx vectorizer breaks complex arithmetic

2020-08-30 Thread already5chosen at yahoo dot com
Assignee: unassigned at gcc dot gnu.org Reporter: already5chosen at yahoo dot com Target Milestone: --- '-Ofast -mavx -march=ivybridge' miscompiles this simple loop: double complex foo(double complex acc, const double complex *x, const double complex* y, int N) {

[Bug target/88284] New: nios2: pessimistic ldw-to-stwio scheduling

2018-11-30 Thread already5chosen at yahoo dot com
: target Assignee: unassigned at gcc dot gnu.org Reporter: already5chosen at yahoo dot com Target Milestone: --- Created attachment 45131 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45131&action=edit demonstration of bad scheduling Compiler generates bad schedul

[Bug tree-optimization/86965] nios2 optimizer forgets about known upper bits of register

2018-11-11 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86965 --- Comment #3 from Michael_S --- (In reply to sandra from comment #1) > I'm not sure what command-line options you were using, but with -O2 the bad2 > case now generates the expected code. > With 8.2.0 the problem exists both with -O2 and with

[Bug middle-end/80283] [6/7/8/9 Regression] bad SIMD register allocation

2018-08-27 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80283 --- Comment #25 from Michael_S --- Just a reminder 16 months later: x86-64 case - both 8.2 and trunk are as bad as they were. ARM-Neon case - 8.2 appears to be worse (by 5%) than either 6.x or 7.x. I didn't check trunk.

[Bug rtl-optimization/87047] [7/8/9 Regression] performance regression because of if-conversion

2018-08-24 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87047 --- Comment #11 from Michael_S --- Sorry for intervening, but IMHO a new __builtin is long overdue. __builtin (In reply to Jakub Jelinek from comment #9) > (In reply to Alexander Monakov from comment #8) > > Well, original_costs is already initia

[Bug tree-optimization/87031] nios2 optimization for size - two cases of regression relatively to 5.3.0

2018-08-23 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87031 --- Comment #7 from Michael_S --- Done. a new report = 87079

[Bug target/87079] New: nios2 optimization for size - case of regression relatively to 5.3.0

2018-08-23 Thread already5chosen at yahoo dot com
Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: already5chosen at yahoo dot com Target Milestone: --- Created attachment 44586 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44586&action=edit 5.3->8.3 regressio

[Bug tree-optimization/87031] nios2 optimization for size - two cases of regression relatively to 5.3.0

2018-08-22 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87031 --- Comment #5 from Michael_S --- It's fine that you moved the 2nd case to 'tree-optimization'. I suppose that's where it belongs. But I just saw the second case by chance in the process of reduction of the first case to bare minimum. For me it (

[Bug tree-optimization/87031] nios2 optimization for size - two cases of regression relatively to 5.3.0

2018-08-22 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87031 --- Comment #4 from Michael_S --- It's fine that you moved the 2nd case to 'tree-optimization'. I suppose that's where it belongs. But I just saw the second case by chance in the process of reduction of the first case to bare minimum. For me it (

[Bug middle-end/87047] New: gcc 7 & 8 - performance regression because of if-conversion

2018-08-21 Thread already5chosen at yahoo dot com
rmal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: already5chosen at yahoo dot com Target Milestone: --- Created attachment 44570 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44570&action=edit demonstrate performance

[Bug target/87031] nios2 optimization for size - two cases of regression relatively to 5.3.0

2018-08-20 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87031 --- Comment #2 from Michael_S --- After playing with the 2nd case on godbolt I found that it's not target specific. The regression occurred at all targets between gcc6 and gcc7.

[Bug target/87031] nios2 optimization for size - two cases of regression relatively to 5.3.0

2018-08-20 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87031 --- Comment #1 from Michael_S --- Created attachment 44564 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44564&action=edit second case - loop unrolled

[Bug target/87031] New: nios2 optimization for size - two cases of regression relatively to 5.3.0

2018-08-20 Thread already5chosen at yahoo dot com
: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: already5chosen at yahoo dot com Target Milestone: --- Created attachment 44563 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44563&action=edit first case -

[Bug target/86975] New: wrong peephole optimization applied on nios2 and mips targets

2018-08-16 Thread already5chosen at yahoo dot com
Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: already5chosen at yahoo dot com Target Milestone: --- On MIPS and Nios2 architectures logical instruction immediate (andi, ori) zero-extend immediate field. It means that on this targets

[Bug target/86965] New: nios2 optimizer forgets about known upper bits of register

2018-08-15 Thread already5chosen at yahoo dot com
Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: already5chosen at yahoo dot com Target Milestone: --- Created attachment 44545 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44545&action=edit source code that demonstrates

[Bug target/83528] Nios2: redundant pointers to the record fields

2017-12-21 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83528 Michael_S changed: What|Removed |Added Status|RESOLVED|UNCONFIRMED Resolution|WONTFIX

[Bug target/83528] Nios2: redundant pointers to the record fields

2017-12-21 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83528 --- Comment #5 from Michael_S --- Created attachment 42944 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42944&action=edit good asm output (gcc 4.8.3)

[Bug target/83528] Nios2: redundant pointers to the record fields

2017-12-21 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83528 --- Comment #4 from Michael_S --- Created attachment 42943 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42943&action=edit bad asm output (gcc 5.3.0)

[Bug target/83528] Nios2: redundant pointers to the record fields

2017-12-21 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83528 --- Comment #3 from Michael_S --- Well, the guidline here https://gcc.gnu.org/bugs/ specifically tells me that it's one of the things that you don't want ;) But yes, I can.

[Bug c/83528] Nios2: redundant pointers to the record fields

2017-12-21 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83528 --- Comment #1 from Michael_S --- I did a little more research and found out that it is relatively recent regression introduced in gcc version 4.9.2 (Altera 15.1 Build 185). gcc version 4.8.3 20140320 (prerelease) (Altera 14.1 Build 186) still g

[Bug c/83528] New: Nios2: redundant pointers to the record fields

2017-12-21 Thread already5chosen at yahoo dot com
: c Assignee: unassigned at gcc dot gnu.org Reporter: already5chosen at yahoo dot com Target Milestone: --- Created attachment 42942 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42942&action=edit eaxmple of bad code generation for Nios2 target In the loop over a

[Bug middle-end/80283] [5/6/7/8 Regression] bad SIMD register allocation

2017-08-08 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80283 --- Comment #18 from Michael_S --- O.k. Not a back end. The part of compiler that is responsible for binding local variables to registers or to stack locations. I am assuming that such part exists in gcc and acts after tree-ter phase, but before

[Bug middle-end/80283] [5/6/7/8 Regression] bad SIMD register allocation

2017-05-01 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80283 --- Comment #14 from Michael_S --- Created attachment 41293 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41293&action=edit another case of bad vector register allocation Here is another case of bad allocation of SIMD register that hopefu

[Bug middle-end/80283] [5/6/7 Regression] bad SIMD register allocation

2017-04-04 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80283 --- Comment #11 from Michael_S --- Created attachment 41128 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41128&action=edit ARMv7 case ARMv7 - very similar to x64

[Bug middle-end/80283] [5/6/7 Regression] bad SIMD register allocation

2017-04-04 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80283 --- Comment #10 from Michael_S --- Created attachment 41127 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41127&action=edit bad reg allocation despite no-tree-ter No problems

[Bug middle-end/80283] [5/6/7 Regression] bad SIMD register allocation

2017-04-04 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80283 Michael_S changed: What|Removed |Added CC||already5chosen at yahoo dot com --- Comment