[Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases

2023-06-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #21 from Michael_S  ---
(In reply to Mason from comment #20)
> Doh! You're right.
> I come from a background where overlapping/aliasing inputs are heresy,
> thus got blindsided :(
> 
> This would be the optimal code, right?
> 
> add4i:
> # rdi = dst, rsi = a, rdx = b
>   movq 0(%rdx), %r8
>   movq 8(%rdx), %rax
>   movq16(%rdx), %rcx
>   movq24(%rdx), %rdx
>   addq 0(%rsi), %r8
>   adcq 8(%rsi), %rax
>   adcq16(%rsi), %rcx
>   adcq24(%rsi), %rdx
>   movq%r8,   0(%rdi)
>   movq%rax,  8(%rdi)
>   movq%rcx, 16(%rdi)
>   movq%rdx, 24(%rdi)
>   ret
> 

If one does not care deeply about latency (which is likely for function that
stores result into memory) then that looks good enough.
But if one does care deeply then I'd expect interleaved loads, as in first 8
lines of code generated by trunk, to produce slightly lower latency on majority
of modern CPUs.

[Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases

2023-06-07 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #19 from Michael_S  ---
(In reply to Mason from comment #18)
> Hello Michael_S,
> 
> As far as I can see, massaging the source helps GCC generate optimal code
> (in terms of instruction count, not convinced about scheduling).
> 
> #include 
> typedef unsigned long long u64;
> void add4i(u64 dst[4], const u64 A[4], const u64 B[4])
> {
>   unsigned char c = 0;
>   c = _addcarry_u64(c, A[0], B[0], dst+0);
>   c = _addcarry_u64(c, A[1], B[1], dst+1);
>   c = _addcarry_u64(c, A[2], B[2], dst+2);
>   c = _addcarry_u64(c, A[3], B[3], dst+3);
> }
> 
> 
> On godbolt, gcc-{11.4, 12.3, 13.1, trunk} -O3 -march=znver1 all generate
> the expected:
> 
> add4i:
> movq(%rdx), %rax
> addq(%rsi), %rax
> movq%rax, (%rdi)
> movq8(%rsi), %rax
> adcq8(%rdx), %rax
> movq%rax, 8(%rdi)
> movq16(%rsi), %rax
> adcq16(%rdx), %rax
> movq%rax, 16(%rdi)
> movq24(%rdx), %rax
> adcq24(%rsi), %rax
> movq%rax, 24(%rdi)
> ret
> 
> I'll run a few benchmarks to test optimal scheduling.

That's not merely "massaging the source". That's changing semantics.
Think about what happens when dst points to the middle of A or of B.
The change of semantics effectively prevented vectorizer from doing harm.

And yes, for common non-aliasing case the scheduling is problematic, too. 
It would probably not cause slowdown on the latest and greatest cores, but
could be slow on less great cores, including your default Zen1.

[Bug libgcc/108279] Improved speed for float128 routines

2023-02-10 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #24 from Michael_S  ---
(In reply to Michael_S from comment #22)
> (In reply to Michael_S from comment #8)
> > (In reply to Thomas Koenig from comment #6)
> > > And there will have to be a decision about 32-bit targets.
> > >
> > 
> > IMHO, 32-bit targets should be left in their current state.
> > People that use them probably do not care deeply about performance.
> > Technically, I can implement 32-bit targets in the same sources, by means of
> > few ifdefs and macros, but resulting source code will look much uglier than
> > how it looks today. Still, not to the same level of horror that you have in
> > matmul_r16.c, but certainly uglier than how I like it to look.
> > And I am not sure at all that my implementation of 32-bit targets would be
> > significantly faster than current soft float.
> 
> I explored this path (implementing 32-bit and 64-bit targets from the same
> source with few ifdefs) a little more:
> Now I am even more sure that it is not a way to go. gcc compiler does not
> generate good 32-bit code for this style of sources. This especially applies
> to i386, other supported 32-bit targets (RV32, SPARC32) are affected less.
> 

I can't explain to myself why I am doing it, but I did continue exploration of
32-bit targets. Well, not quite "targets", I don't have SPARC32 or RV32 to
play. So, I did continue exploration of i386.
As said above, using the same code for 32-bit and 64-bit does not produce
acceptable results. But pure 32-bit source did better than what I expected.
So when 2023-01-13 I wrote "And I am not sure at all that my implementation of
32-bit targets would be significantly faster than current soft float" I was
wrong. My implementation of 32-bit targets (i.e. i386) is significantly faster
than current soft float. Up to 3 times faster on Zen3, approximately 2 times
faster on various oldish Intel CPUs.
Today I put 32-bit sources into my github repository.

I am still convinced that improving performance of IEEE binary128 on 32-bit
targets is wastage of time, but since the time is already wasted may be results
can be used.

And may be, it can be used to bring IEEE binary128 to the Arm Cortex-M, where
it can be moderately useful in some situations.

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-18 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #23 from Michael_S  ---
(In reply to Jakub Jelinek from comment #19)
> So, if stmxcsr/vstmxcsr is too slow, perhaps we should change x86
> sfp-machine.h
> #define FP_INIT_ROUNDMODE   \
>   do {  \
> __asm__ __volatile__ ("%vstmxcsr\t%0" : "=m" (_fcw));   \
>   } while (0)
> #endif
> to do that only if not round-to-nearest.
> E.g. the fast-float library uses since last November:
>   static volatile float fmin = __FLT_MIN__;
>   float fmini = fmin; // we copy it so that it gets loaded at most once.
>   return (fmini + 1.0f == 1.0f - fmini);
> as a quick test whether round-to-nearest is in effect or some other rounding
> mode.
> Most likely if this is done with -frounding-math it wouldn't even need the
> volatile stuff.  Of course, if it isn't round-to-nearest, we would need to
> do the expensive {,v}stmxcsr.

I agree with Wilco. This trick is problematic due to effect on inexact flag.
Also, I don't quite understand how you got to setting rounding mode.
I don't need to set rounding mode, I just need to read a current rounding mode.

Doing it in portable way, i.e. by fegetround(), is slow mostly due to various
overheads.
Doing it in non-portable way on x86-64 (by _MM_GET_ROUNDING_MODE()) is not slow
on Intel, but still pretty slow on AMD Zen3, although even on Zen3 it is much
faster than fegetround().
Results of measurements are here:
https://github.com/already5chosen/extfloat/blob/master/binary128/reports/rm-impact.txt
Anyway, I'd very much prefer a portable solution over multitude of ifdefs.
It is a pity that gcc doesn't implement FLT_ROUNDS like other compilers.

But, then again, it is a pity that gcc doesn't implement few other things
implemented by other compilers that could make life of developers of portable
multiprecision routines in general and of soft float in particular so much
easier.

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-18 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #22 from Michael_S  ---
(In reply to Michael_S from comment #8)
> (In reply to Thomas Koenig from comment #6)
> > And there will have to be a decision about 32-bit targets.
> >
> 
> IMHO, 32-bit targets should be left in their current state.
> People that use them probably do not care deeply about performance.
> Technically, I can implement 32-bit targets in the same sources, by means of
> few ifdefs and macros, but resulting source code will look much uglier than
> how it looks today. Still, not to the same level of horror that you have in
> matmul_r16.c, but certainly uglier than how I like it to look.
> And I am not sure at all that my implementation of 32-bit targets would be
> significantly faster than current soft float.

I explored this path (implementing 32-bit and 64-bit targets from the same
source with few ifdefs) a little more:
Now I am even more sure that it is not a way to go. gcc compiler does not
generate good 32-bit code for this style of sources. This especially applies to
i386, other supported 32-bit targets (RV32, SPARC32) are affected less.

In the process I encountered a funny illogical pessimization by i386 code
generator:
https://godbolt.org/z/En6Tredox
Routines foo32() and foo64() are semantically identical, but foo32() is written
 with 32-bit targets in mind while foo64() is the style of could that will
likely be written if one wants to support 32 and 64 bits from the same source
with #ifdef.

The code, generated by gcc for foo32() is reasonable. Like in the source, we
can see 8 multiplications.
The code, generated by gcc for foo64() is anything but reasonable. Somehow,
compiler invented 5 more multiplications for a total of 13 multiplications.

May be, it deserves a separate bug report.

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #16 from Michael_S  ---
(In reply to Jakub Jelinek from comment #15)
> libquadmath is not needed nor useful on aarch64-linux, because long double
> type there is already IEEE 754 quad.

That's good to know. Thank you.

If you are here I have one more somewhat related question: is IEEE binary128
arithmetic really not supported under ARMv7 or, like in the case above, it is
supported, but not under one of common names (__float128/_Float128) ?

And if it is not supported then why? 
If I had to peek one 32-bit architecture to support then it certainly would be
ARMv7, the most used 32-bit architecture in the world that probably outsells
the next one by order of magnitude.

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-14 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #12 from Michael_S  ---
(In reply to Thomas Koenig from comment #10)
> What we would need for incorporation into gcc is to have several
> functions, which would then called depending on which floating point
> options are in force at the time of invocation.
> 
> So, let's go through the gcc options, to see what would fit where. Walking
> down the options tree, depth first.
> 
> From the gcc docs:
> 
> '-ffast-math'
>  Sets the options '-fno-math-errno', '-funsafe-math-optimizations',
>  '-ffinite-math-only', '-fno-rounding-math', '-fno-signaling-nans',
>  '-fcx-limited-range' and '-fexcess-precision=fast'.
> 
> -fno-math-errno is irrelevant in this context, no need to look at that.
> 
> '-funsafe-math-optimizations'
> 
>  Allow optimizations for floating-point arithmetic that (a) assume
>  that arguments and results are valid and (b) may violate IEEE or
>  ANSI standards.  When used at link time, it may include libraries
>  or startup files that change the default FPU control word or other
>  similar optimizations.
> 
>  This option is not turned on by any '-O' option since it can result
>  in incorrect output for programs that depend on an exact
>  implementation of IEEE or ISO rules/specifications for math
>  functions.  It may, however, yield faster code for programs that do
>  not require the guarantees of these specifications.  Enables
>  '-fno-signed-zeros', '-fno-trapping-math', '-fassociative-math' and
>  '-freciprocal-math'.
> 
> '-fno-signed-zeros'
>  Allow optimizations for floating-point arithmetic that ignore the
>  signedness of zero.  IEEE arithmetic specifies the behavior of
>  distinct +0.0 and -0.0 values, which then prohibits simplification
>  of expressions such as x+0.0 or 0.0*x (even with
>  '-ffinite-math-only').  This option implies that the sign of a zero
>  result isn't significant.
> 
>  The default is '-fsigned-zeros'.
> 
> I don't think this options is relevant.
> 
> '-fno-trapping-math'
>  Compile code assuming that floating-point operations cannot
>  generate user-visible traps.  These traps include division by zero,
>  overflow, underflow, inexact result and invalid operation.  This
>  option requires that '-fno-signaling-nans' be in effect.  Setting
>  this option may allow faster code if one relies on "non-stop" IEEE
>  arithmetic, for example.
> 
>  This option should never be turned on by any '-O' option since it
>  can result in incorrect output for programs that depend on an exact
>  implementation of IEEE or ISO rules/specifications for math
>  functions.
> 
>  The default is '-ftrapping-math'.
> 
> Relevant.
> 
> '-ffinite-math-only'
>  Allow optimizations for floating-point arithmetic that assume that
>  arguments and results are not NaNs or +-Infs.
> 
>  This option is not turned on by any '-O' option since it can result
>  in incorrect output for programs that depend on an exact
>  implementation of IEEE or ISO rules/specifications for math
>  functions.  It may, however, yield faster code for programs that do
>  not require the guarantees of these specifications.
> 
> This does not have further suboptions. Relevant.
> 
> '-fassociative-math'
> 
>  Allow re-association of operands in series of floating-point
>  operations.  This violates the ISO C and C++ language standard by
>  possibly changing computation result.  NOTE: re-ordering may change
>  the sign of zero as well as ignore NaNs and inhibit or create
>  underflow or overflow (and thus cannot be used on code that relies
>  on rounding behavior like '(x + 2**52) - 2**52'.  May also reorder
>  floating-point comparisons and thus may not be used when ordered
>  comparisons are required.  This option requires that both
>  '-fno-signed-zeros' and '-fno-trapping-math' be in effect.
>  Moreover, it doesn't make much sense with '-frounding-math'.  For
>  Fortran the option is automatically enabled when both
>  '-fno-signed-zeros' and '-fno-trapping-math' are in effect.
> 
>  The default is '-fno-associative-math'.
> 
> Not relevant, I think - this influences compiler optimizations.
> 
> '-freciprocal-math'
> 
>  Allow the reciprocal of a value to be used instead of dividing by
>  the value if this enables optimizations.  For example 'x / y' can
>  be replaced with 'x * (1/y)', which is useful if '(1/y)' is subject
>  to common subexpression elimination.  Note that this loses
>  precision and increases the number of flops operating on the value.
> 
>  The default is '-fno-reciprocal-math'.
> 
> Again, not relevant.
> 
> 
> '-frounding-math'
>  Disable transformations and optimizations that assume default
>  floating-point rounding behavior.  This is round-to-zero for all
>  floating point to integer 

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-14 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #11 from Michael_S  ---
(In reply to Thomas Koenig from comment #9)
> Created attachment 54273 [details]
> matmul_r16.i
> 
> Here is matmul_r16.i from a relatively recent trunk.

Thank you.
Unfortunately, I was not able to link it with main in Fortran.
So, I still only have to guess why even after replacement of __multf3 and
__addtf3 by my implementations it is still more than twice slower (on Zen3)
then what it should be.
Looking at source and assuming that inner loop starts at line 8944, this
loop looks very strange, but apart from bad programming style and apart from
misunderstanding of what is optimal scheduling there is nothing criminal about
it. May be, because of wrong scheduling, it is 10-15% slower than the best, but
certainly it should not be 2.4 times slower that I am seeing.


That's what I got from linker:
/usr/bin/ld: matmul_r16.o: warning: relocation against
`_gfortrani_matmul_r16_avx128_fma3' in read-only section `.text'
/usr/bin/ld: matmul_r16.o: in function `matmul_r16_avx':
matmul_r16.c:(.text+0x48): undefined reference to `_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x342): undefined reference to
`_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x10e4): undefined reference to
`_gfortrani_size0'
/usr/bin/ld: matmul_r16.c:(.text+0x10f1): undefined reference to
`_gfortrani_xmallocarray'
/usr/bin/ld: matmul_r16.c:(.text+0x12e5): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x13a9): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x13e7): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.o: in function `matmul_r16_avx2':
matmul_r16.c:(.text+0x24c8): undefined reference to
`_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x27c2): undefined reference to
`_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x3564): undefined reference to
`_gfortrani_size0'
/usr/bin/ld: matmul_r16.c:(.text+0x3571): undefined reference to
`_gfortrani_xmallocarray'
/usr/bin/ld: matmul_r16.c:(.text+0x3765): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x3829): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x3867): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.o: in function `matmul_r16_avx512f':
matmul_r16.c:(.text+0x4948): undefined reference to
`_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x4c47): undefined reference to
`_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x5a32): undefined reference to
`_gfortrani_size0'
/usr/bin/ld: matmul_r16.c:(.text+0x5a3f): undefined reference to
`_gfortrani_xmallocarray'
/usr/bin/ld: matmul_r16.c:(.text+0x5c35): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x5cfb): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x5d35): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.o: in function `matmul_r16_vanilla':
matmul_r16.c:(.text+0x6de8): undefined reference to
`_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x70e2): undefined reference to
`_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x7e84): undefined reference to
`_gfortrani_size0'
/usr/bin/ld: matmul_r16.c:(.text+0x7e91): undefined reference to
`_gfortrani_xmallocarray'
/usr/bin/ld: matmul_r16.c:(.text+0x8085): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x8149): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x8187): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.o: in function `_gfortran_matmul_r16':
matmul_r16.c:(.text+0x92b7): undefined reference to
`_gfortrani_matmul_r16_avx128_fma3'
/usr/bin/ld: matmul_r16.c:(.text+0x92ea): undefined reference to
`_gfortrani_matmul_r16_avx128_fma4'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE
collect2: error: ld returned 1 exit status

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-12 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #8 from Michael_S  ---
(In reply to Thomas Koenig from comment #6)
> (In reply to Michael_S from comment #5)
> > Hi Thomas
> > Are you in or out?
> 
> Depends a bit on what exactly you want to do, and if there is
> a chance that what you want to do will be incorporated into gcc.
> 

What about incorporation in Fortran?
What about incorporation in C under fast-math ?

> If you want to replace the soft-float routines, you will have to
> replace them with the full functionality.
> 

Full functionality including Inexact Exception that practically nobody uses?
Sounds wasteful of perfectly good CPU cycles.
Also, I am not so sure that Inexact Exception is fully supported in  existing
soft-float library.

Almost-full functionality with support for non-default rounding modes, but
without Inexact Exception?
I actually implemented it and did few measurements. You can find the results in
the directory /reports in my repo.
Summary: architecture-neutral method cause very serious slowdown. Less so on
slower machines, massive 2.5x on the fastest machine (Zen3 under Linux under
WSL).
AMD64-specific method causes smaller slowdown, esp. on relatively old Intel
cores on Windows (I have no modern Intel cores available for testing). But
Zen3/Linux still suffer 1.45x slowdown. Again, a big wastage of perfectly good
CPU cycles.
Also, what about other architectures? Should they suffer an
"architecture-neutral" slowdown? Even if there are faster methods on other
architecture, these methods should be found by somebody and tested by somebody.
This sort of work is time-consuming. And for what?

Also I measured an impact of implementing non-default rounding through
additional function parameter. An impact is very small, 0  to 5%.
You said on comp.arch that at least for Fortran it could work.

What else is missing for "full functionality"? 
Surely there are other things that I forgot. May be, additional exceptions
apart from Invalid Operand (that hopefully already works) and apart from
Inexact that I find stupid? I don't think that they are hard to implement or
expensive in terms of speed. Just a bit of work and more than a bit of testing.

> And there will have to be a decision about 32-bit targets.
>

IMHO, 32-bit targets should be left in their current state.
People that use them probably do not care deeply about performance.
Technically, I can implement 32-bit targets in the same sources, by means of
few ifdefs and macros, but resulting source code will look much uglier than how
it looks today. Still, not to the same level of horror that you have in
matmul_r16.c, but certainly uglier than how I like it to look.
And I am not sure at all that my implementation of 32-bit targets would be
significantly faster than current soft float. 
In short, it does not sound as a good ROI.
BTW, do you know why current soft float supports so few 32-bit targets?
Most likely somebody felt just like me about it - it's not too hard to support
more 32-bit targets, but it's not a good ROI.

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-12 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #7 from Michael_S  ---
Either here or my yahoo e-mail

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-11 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #5 from Michael_S  ---
Hi Thomas
Are you in or out?

If you are still in, I can use your help on several issues.

1. Torture. 
See if Invalid Operand exception raised properly now. Also if there are still
remaining problems with NaN.

2. Run my correction tests on as many non-AMD64 targets as you can. Preferably,
with 100,000,000 iterations, but on weaker HW 10,000,000 will do.

3. Run my speed tests (tests/matmulq/mm_speed_ma) on more diverse set of AMD64
computers than I did.
Of special interest are
- AMD Zen3 on Linux running on bare metal
- Intel Skylake, SkylakeX, Tiger/Rocket Lake and Alder Lake on Linux running on
bare metal
I realize that doing speed tests is not nearly as simple as correctness tests.
We need non-busy (preferably almost idle) machines that have stable CPU clock
rate. It's not easy to find machines like that nowadays. But, may be, you can
find at least some from the list.

4. Run my speed tests on as many non-obsolete ARM64 computers as you can find.
Well, probably a wishful thinking on my part.


Also off topic but of interest: postprocessed source of matmul_r16.c

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-04 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #4 from Michael_S  ---
(In reply to Jakub Jelinek from comment #2)
> From what I can see, they are certainly not portable.
> E.g. the relying on __int128 rules out various arches (basically all 32-bit
> arches,
> ia32, powerpc 32-bit among others).  Not handling exceptions is a show
> stopper too.
> Guess better time investment would be to improve performance of the soft-fp
> versions.

The only 32-bit gcc target [out of targets represented on Goldbolt] that
supports __float128/_Float128 right now are
- x386==IA32
- SPARC
- RV32
I don't know why it is supported, but would be surprised if __float128 on RV32
is actually used by anybody. Just too slow. 

32-bit SPARC sounds as dead platform.

x386 is more problematic. People still use it, people still develop new
programs for it. And since modern x386 processors are so fast, performance of
__float128 could be usable, even if it is 5 or 10 times slower than on the same
machine running x86-64.

PPC32 is not supported, so no problems from that side.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-26 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #22 from Michael_S  ---
(In reply to Alexander Monakov from comment #21)
> (In reply to Michael_S from comment #19)
> > > Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be
> > > 'unlaminated' (turned to 2 uops before renaming), so selecting independent
> > > IVs for the two arrays actually helps on this testcase.
> > 
> > Both 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' and 'vfnmadd231pd 32(%rdx),
> > %ymm3, %ymm0' would be turned into 2 uops.
> 
> The difference is at which point in the pipeline. The latter goes through
> renaming as one fused uop.
> 

Intel never documents such fine details in their Optimization Reference
manuals.
But I believe you.

> > Misuse of load+op is far bigger problem in this particular test case than
> > sub-optimal loop overhead. Assuming execution on Intel Skylake, it turns
> > loop that can potentially run at 3 clocks per iteration into loop of 4+
> > clocks per iteration.
> 
> Sorry, which assembler output this refers to?
>

gcc12 -O3 -mavx2 -mfma

gcc12 -O3 -march=skylake does not suffer from this problem.

I still think that RISC-style icc code will be a little faster on Skylake, but
here we are arguing about 1/4th of the cycle per iteration rather than a full
cycle.
https://godbolt.org/z/nfa7c9se3

> > But I consider it a separate issue. I reported similar issue in 97127, but
> > here it is more serious. It looks to me that the issue is not soluble within
> > existing gcc optimization framework. The only chance is if you accept my old
> > and simple advice - within inner loops pretend that AVX is RISC, i.e.
> > generate code as if load-op form of AVX instructions weren't existing.
> 
> In bug 97127 the best explanation we have so far is we don't optimally
> handle the case where non-memory inputs of an fma are reused, so we can't
> combine a load with an fma without causing an extra register copy (PR 97127
> comment 16 demonstrates what I mean). I cannot imagine such trouble arising
> with more common commutative operations like mul/add, especially with
> non-destructive VEX encoding. If you hit such examples, I would suggest to
> report them also, because their root cause might be different.
> 
> In general load-op combining should be very helpful on x86, because it
> reduces the number of uops flowing through the renaming stage, which is one
> of the narrowest points in the pipeline.

If compilers were perfect, AVX load-op combining would be somewhat helpful. I
have my doubts about very helpful. But compilers are not perfect.
For none-AVX case, where every op is destructive and repeated loads are on
average cheaper than on AVX, combined load-ops is far more profitable.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-26 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #20 from Michael_S  ---
(In reply to Richard Biener from comment #17)
> (In reply to Michael_S from comment #16)
> > On unrelated note, why loop overhead uses so many instructions?
> > Assuming that I am as misguided as gcc about load-op combining, I would
> > write it as:
> >   sub %rax, %rdx
> > .L3:
> >   vmovupd   (%rdx,%rax), %ymm1
> >   vmovupd 32(%rdx,%rax), %ymm0
> >   vfmadd213pd32(%rax), %ymm3, %ymm1
> >   vfnmadd213pd (%rax), %ymm2, %ymm0
> >   vfnmadd231pd   32(%rdx,%rax), %ymm3, %ymm0
> >   vfnmadd231pd (%rdx,%rax), %ymm2, %ymm1
> >   vmovupd %ymm0,   (%rax)
> >   vmovupd %ymm1, 32(%rax)
> >   addq$64, %rax
> >   decl%esi
> >   jb  .L3
> >   
> > The loop overhead in my variant is 3 x86 instructions==2 macro-ops,
> > vs 5 x86 instructions==4 macro-ops in gcc variant.
> > Also, in gcc variant all memory accesses have displacement that makes them
> > 1 byte longer. In my variant only half of accesses have displacement.
> > 
> > I think, in the past I had seen cases where gcc generates optimal or
> > near-optimal
> > code sequences for loop overhead. I wonder why it can not do it here.
> 
> I don't think we currently consider IVs based on the difference of two
> addresses.  

It seems to me that I had seen you doing it.
But, may be, I confuse gcc with clang.

> The cost benefit of no displacement is only size, 

Size is pretty important in high-IPC SIMD loops. Esp. on Intel and when # of
iterations is small, because Intel has 16-byte fetch out of L1I cache. SIMD
instructions tend to be long and not many instructions fit within 16 bytes even
when memory accesses have no offsets. Offset adds impact to the injury.

> otherwise
> I have no idea why we have biased the %rax accesses by -32.  Why we
> fail to consider decrement-to-zero for the counter IV is probably because
> IVCANON would add such IV but the vectorizer replaces that and IVOPTs
> doesn't consider re-adding that.

Sorry, I have no idea about the meaning of IVCANON.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-26 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #19 from Michael_S  ---
(In reply to Alexander Monakov from comment #18)
> The apparent 'bias' is introduced by instruction scheduling: haifa-sched
> lifts a +64 increment over memory accesses, transforming +0 and +32
> displacements to -64 and -32. Sometimes this helps a little bit even on
> modern x86 CPUs.

I don't think that it ever helps on Intel Sandy Bridge or later or on AMD Zen1
or later.

> 
> Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be
> 'unlaminated' (turned to 2 uops before renaming), so selecting independent
> IVs for the two arrays actually helps on this testcase.

Both 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' and 'vfnmadd231pd 32(%rdx),
%ymm3, %ymm0' would be turned into 2 uops.

Misuse of load+op is far bigger problem in this particular test case than
sub-optimal loop overhead. Assuming execution on Intel Skylake, it turns loop
that can potentially run at 3 clocks per iteration into loop of 4+ clocks per
iteration.
But I consider it a separate issue. I reported similar issue in 97127, but here
it is more serious. It looks to me that the issue is not soluble within
existing gcc optimization framework. The only chance is if you accept my old
and simple advice - within inner loops pretend that AVX is RISC, i.e. generate
code as if load-op form of AVX instructions weren't existing.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-25 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #16 from Michael_S  ---
On unrelated note, why loop overhead uses so many instructions?
Assuming that I am as misguided as gcc about load-op combining, I would write
it as:
  sub %rax, %rdx
.L3:
  vmovupd   (%rdx,%rax), %ymm1
  vmovupd 32(%rdx,%rax), %ymm0
  vfmadd213pd32(%rax), %ymm3, %ymm1
  vfnmadd213pd (%rax), %ymm2, %ymm0
  vfnmadd231pd   32(%rdx,%rax), %ymm3, %ymm0
  vfnmadd231pd (%rdx,%rax), %ymm2, %ymm1
  vmovupd %ymm0,   (%rax)
  vmovupd %ymm1, 32(%rax)
  addq$64, %rax
  decl%esi
  jb  .L3

The loop overhead in my variant is 3 x86 instructions==2 macro-ops,
vs 5 x86 instructions==4 macro-ops in gcc variant.
Also, in gcc variant all memory accesses have displacement that makes them
1 byte longer. In my variant only half of accesses have displacement.

I think, in the past I had seen cases where gcc generates optimal or
near-optimal
code sequences for loop overhead. I wonder why it can not do it here.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-24 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #14 from Michael_S  ---
I tested a smaller test bench from Comment 3 with gcc trunk on godbolt.
Issue appears to be only partially fixed.
-Ofast result is no longer a horror that it was before, but it is still not as
good as -O3 or -O2. -Ofast code generation is still strange and there are few
vblendpd instruction that serve no useful purpose.

And -O2/O3 is still not as good as it should be or as good as icc.
But, as mentioned in my original post, over-aggressive load+op combining is a
separate problem.

[Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases

2022-07-29 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #15 from Michael_S  ---
(In reply to Richard Biener from comment #14)
> (In reply to Michael_S from comment #12)
> > On related note...
> > One of the historical good features of gcc relatively to other popular
> > compilers was absence of auto-vectorization at -O2.
> > When did you decide to change it and why?
> 
> Well, people requested it ... so GCC 12 is now having it (with an extra
> eye on code size but this all doesn't work for scalar code vectorization
> which OTOH is thought to almost always benefit code size ...).

"People" is a vague term.
You didn't organize voting among all gcc users, I suppose.
So, count me as "people" who asks to remove it.
Somehow , I suspect that the other one who will vote like me has a middle name
Benedict.

[Bug target/106220] x86-64 optimizer forgets about shrd peephole optimization pattern when faced with more than one in close proximity

2022-07-06 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106220

--- Comment #3 from Michael_S  ---
-march-haswell is not very important.
I added it only because in absence of BMI extension an issue is somewhat
obscured by need to keep shift count in CL register.

-O2 is also not important. -O3 is the same. And -O1, due to absence of
if-conversion, demonstrates the same issue in different form.
In practice, I'd guess -O1 code would perform quite well, unlike -O2 and -O3,
but it does not make it less ugly looking.

[Bug c/106220] New: x86-64 optimizer forgets about shrd peephole optimization pattern when faced with more than one in close proximity

2022-07-06 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106220

Bug ID: 106220
   Summary: x86-64 optimizer forgets about shrd peephole
optimization pattern when faced with more than one in
close proximity
   Product: gcc
   Version: 12.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: already5chosen at yahoo dot com
  Target Milestone: ---

I am reporting about right shift issue, but left shift has the same issues as
well.

In theory, gcc knows how to calculate lower 64 bits of the right shift of
128-bit number with a single instruction when it is provable that shift count
is in range [0:63]. In practice, it does it only under very special condition.
See here: https://godbolt.org/z/fhdo8xhxW

foo1to1 is good
foo2to1 is good
foo1to2 starts well but is broken near the end but hyperactive vectorizer.
But that's a separate issue already reported in 105617.

foo2to2, foo2to3, foo3to4 - looks like compiler forgot all it knew about
double-word right shifts, or, more likely, forgot that (x % 64) is always in
range [0:63].

I am reporting it as a target issue despite being sure that the problem is not
in the x86-64 back end itself, but somehow in interaction between various
phases of optimizer. As 80+ percents of my reports.
However it's your call, not mine. In practice, an impact is most visible on
x86-64, because, due to existence of shrd instruction, x86-64 is potentially
very good in this sort of tasks. On ARM64 or on POWER64LE the relative slowdown
is lower, because an optimal code is not as fast.

P.S
82261 sounds similar, but I am not sure it is related.

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-06-13 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101

--- Comment #23 from Michael_S  ---
(In reply to jos...@codesourcery.com from comment #22)
> On Mon, 13 Jun 2022, already5chosen at yahoo dot com via Gcc-bugs wrote:
> 
> > > The function should be sqrtf128 (present in glibc 2.26 and later on 
> > > x86_64, x86, powerpc64le, ia64).  I don't know about support in non-glibc 
> > > C libraries.
> > 
> > x86-64 gcc on Godbolt does not appear to know about it.
> > I think, Godbolt uses rather standard Linux with quite new glibc and 
> > headers.
> > https://godbolt.org/z/Y4YecvxK6
> 
> Make sure to define _GNU_SOURCE or __STDC_WANT_IEC_60559_TYPES_EXT__ to 
> get these declarations.
> 

Thank you.
It made a difference on Goldbolt.
Does not help MSYS2, but at least I will be able to test on Linux.


> > May be. I don't know how to get soft-fp version.
> 
> For long double it's sysdeps/ieee754/soft-fp/s_fmal.c in glibc - some 
> adjustments would be needed to be able to use that as a version for 
> _Float128 (where sysdeps/ieee754/float128/s_fmaf128.c currently always 
> uses the ldbl-128 version), in appropriate cases.
>

Way to complicated for mere gcc user like myself.
Hopefully, Thomas Koenig will understand better.

> > It seems, you didn't pay attention that in my later posts I am giving
> > implementations of binary128 *division* rather than sqrtq().
> 
> Ah - binary128 division is nothing to do with libquadmath at all (the 
> basic arithmetic operations go in libgcc, not libquadmath).  Using a PR 
> about one issue as an umbrella discussion of various vaguely related 
> things is generally confusing and unhelpful to tracking the status of what 
> is or is not fixed.

You are right.
Again, hopefully, Thomas Koenig will open a thread in which a discussion of all
issues related to libquadmath and binary128 will be on topic.
But, formally speaking, slowness of division or of fma are not bugs. 
So, may be, there are better places than bugzilla to discuss them.
I just don't know what place it is.
The good thing about bugzilla is a convenient forum-like user interface.


> 
> In general, working out how to optimize the format-generic code in soft-fp 
> if possible would be preferred to writing format-specific implementations.  
> Note that for multiplication and division there are already various 
> choices of implementation approaches that can be selected via macros 
> defined in sfp-machine.h.
> 

I am not sure that I like an idea of format-generic solution for specific case
of soft binary128.
binary128 isvery special, because it is the smallest. It can gain ALOT of speed 
from format-specific handling. Not sure if 3x, but pretty certain about 2x.

> > BTW, I see no mentioning of rounding control or of any sort of exceptions in
> > GCC libquadmath docs. No APIs with names resembling fesetround() or
> > mpfr_set_default_rounding_mode().
> 
> The underlying arithmetic (in libgcc, not libquadmath) uses the hardware 
> rounding mode and exceptions (if the x87 and SSE rounding modes disagree, 
> things are liable to go wrong), via various macros defined in 
> sfp-machine.h.

Oh, a mess!
With implementation that is either 99% or 100% integer being controlled by SSE
control is WRONG. x87 control word, of course, is no better than SSE.
But BOTH  I have no words.

Hopefully, libquadmath is saner than that, but I am unable to figure it out
from docs.

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-06-13 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101

--- Comment #21 from Michael_S  ---
(In reply to jos...@codesourcery.com from comment #20)
> On Sat, 11 Jun 2022, already5chosen at yahoo dot com via Gcc-bugs wrote:
> 
> > On MSYS2 _Float128 and __float128 appears to be mostly the same thing, 
> > mapped
> > to the same library routines with significant difference that _Float128 is 
> > not
> > accessible from C++. Since all my test benches are written in C++ I can't 
> > even
> > validate that what I wrote above is 100% true.
> > 
> > Also according to my understanding of glibc docs (not the clearest piece of
> > text that I ever read) a relevant square root routine should be named
> > sqrtf128().
> > Unfortunately, nothing like that appears to be present in either math.h or 
> > in
> > library. Am I doing something wrong?
> 
> The function should be sqrtf128 (present in glibc 2.26 and later on 
> x86_64, x86, powerpc64le, ia64).  I don't know about support in non-glibc 
> C libraries.

x86-64 gcc on Godbolt does not appear to know about it.
I think, Godbolt uses rather standard Linux with quite new glibc and headers.
https://godbolt.org/z/Y4YecvxK6


> 
> > Right now, there are only two [gcc] platforms with hw binary128 - IBM POWER 
> > and
> > IBM z. I am not sure about the later, but the former has xssqrtqp 
> > instruction
> > which is likely the right way to do sqrtq()/sqrtf128() on this platform. If 
> > z
> > is the same, which sound likely, then implementation based on binary128
> > mul/add/fma by now has no use cases at all.
> 
> That may well be the case for sqrt.
> 
> > > fma is a particularly tricky case because it *is* required to be 
> > > correctly 
> > > rounding, in all rounding modes, and correct rounding implies correct 
> > > exceptions, *and* correct exceptions for fma includes getting right the 
> > > architecture-specific choice of whether tininess is detected before or 
> > > after rounding.
> > 
> > I suspect that by strict IEEE-754 rules sqrt() is the same as fma(), i.e. 
> > you
> > have to calculate a root to infinite precision and then to round with
> > accordance to current mode.
> 
> Yes, but fma has the extra complication of the architecture-specific 
> tininess detection rules (to quote IEEE 754, "The implementer shall choose 
> how tininess is detected [i.e. from the two options listed immediately 
> above this quote in IEEE 754], but shall detect tininess in the same way 
> for all operations in radix two"), which doesn't apply to sqrt because 
> sqrt results can never underflow.
> 
> (I expect the existing soft-fp version of fma in glibc to be rather better 
> optimized than the soft-fp version of sqrt.)
> 

May be. I don't know how to get soft-fp version.
What I do know is that version of fmaq() shipped with gcc/gfortran on MSYS2
x86-64 is absurdly slow - order of 4 microsecond on quite fast modern CPUs.
I don't know how you call this variant, hard, soft or may be freezing.
sqrtq() is also slow, but 3 or 5 times faster than that.

> > I don't quite or understand why you say that. First, I don't remember using 
> > any
> > double math in the variant of sqrtq() posted here. So, may be, you meant
> > division?
> > Here I use doable math, but IMO, I use it in a way that never causes
> > exceptions.
> 
> Yes, the double division.  If that division can ever be inexact when the 
> result of the square root itself is exact, you have a problem (which in 
> turn could lead to actual incorrectly rounded results from other functions 
> such as the square root operations rounding to a narrower type, when the 
> round-to-odd versions of those functions are used, because the 
> round-to-odd technique relies on correct "inexact" exceptions).

It seems, you didn't pay attention that in my later posts I am giving
implementations of binary128 *division* rather than sqrtq().

sqrtq() is in Comment 9. It's pure integer, no FP (double) math involved.

Comment 12 is pure integer implementation of binary128 division. Again, no FP
(double) math involved.

Comment 13 is again implementation of binary128 division. It does use double
math (division) internally. On modern Intel and AMD CPUs it is a little faster
than one from Comment 12, but not radically so. Both arguments and result of
double division are well-controlled, they never come close to subnormal range.

All three implementations support only one rounding mode, a default (round to
nearest with tie broken to even). They do not generate any exceptions, neither
underflow nor even overflow. In case of overflow a division silently returns
Inf with proper sign.
Support for exce

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-06-11 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101

--- Comment #19 from Michael_S  ---
(In reply to jos...@codesourcery.com from comment #18)
> libquadmath is essentially legacy code.  People working directly in C 
> should be using the C23 _Float128 interfaces and *f128 functions, as in 
> current glibc, rather than libquadmath interfaces (unless their code needs 
> to support old glibc or non-glibc C libraries that don't support _Float128 
> in C23 Annex H).  It would be desirable to make GCC generate *f128 calls 
> when appropriate from Fortran code using this format as well; see 
>  for 
> more discussion of the different cases involved.
> 


On MSYS2 _Float128 and __float128 appears to be mostly the same thing, mapped
to the same library routines with significant difference that _Float128 is not
accessible from C++. Since all my test benches are written in C++ I can't even
validate that what I wrote above is 100% true.

Also according to my understanding of glibc docs (not the clearest piece of
text that I ever read) a relevant square root routine should be named
sqrtf128().
Unfortunately, nothing like that appears to be present in either math.h or in
library. Am I doing something wrong?


> Most of libquadmath is derived from code in glibc - some of it can now be 
> updated from the glibc code automatically (see update-quadmath.py), other 
> parts can't (although it would certainly be desirable to extend 
> update-quadmath.py to cover that other code as well).  See the commit 
> message for commit 4239f144ce50c94f2c6cc232028f167b6ebfd506 for a more 
> detailed discussion of what code comes from glibc and what is / is not 
> automatically handled by update-quadmath.py.  Since update-quadmath.py 
> hasn't been run for a while, it might need changes to work with more 
> recent changes to the glibc code.
> 
> sqrtq.c is one of the files not based on glibc code.  That's probably 
> because glibc didn't have a convenient generic implementation of binary128 
> sqrt to use when libquadmath was added - it has soft-fp implementations 
> used for various architectures, but those require sfp-machine.h for each 
> architecture (which maybe we do in fact have in libgcc for each relevant 
> architecture, but it's an extra complication).  Certainly making it 
> possible to use code from glibc for binary128 sqrt would be a good idea, 
> but while we aren't doing that, it should also be OK to improve sqrtq 
> locally in libquadmath.
> 
> The glibc functions for this format are generally *not* optimized for 
> speed yet (this includes the soft-fp-based versions of sqrt).  Note that 
> what's best for speed may depend a lot on whether the architecture has 
> hardware support for binary128 arithmetic; if it has such support, it's 
> more likely an implementation based on binary128 floating-point operations 
> is efficient; 

Not that simple.
Right now, there are only two [gcc] platforms with hw binary128 - IBM POWER and
IBM z. I am not sure about the later, but the former has xssqrtqp instruction
which is likely the right way to do sqrtq()/sqrtf128() on this platform. If z
is the same, which sound likely, then implementation based on binary128
mul/add/fma by now has no use cases at all.


> if it doesn't, direct use of integer arithmetic, without 
> lots of intermediate packing / unpacking into the binary128 format, is 
> likely to be more efficient.  

It's not just redundant packing/unpacking. Direct integer implementation
does fewer arithmetic operations as well, mainly because it know exactly which
parts of 226-bit multiplication product are relevant and does not calculate
parts that are irrelevant.
And with integer math it is much easier to achieve correct rounding at corner
cases that call for precision in excess of 226 bits, so even fmaq() is not
enough. And yes, there is one or two cases like that.

> See the discussion starting at 
>  
> for more on this - glibc is a better place for working on most optimized 
> function implementations than GCC.  See also 
>  - those functions are aiming to 
> be correctly rounding, which is *not* a goal for most glibc libm 
> functions, but are still quite likely to be faster than the existing 
> non-optimized functions in glibc.
> 
> fma is a particularly tricky case because it *is* required to be correctly 
> rounding, in all rounding modes, and correct rounding implies correct 
> exceptions, *and* correct exceptions for fma includes getting right the 
> architecture-specific choice of whether tininess is detected before or 
> after rounding.

I suspect that by strict IEEE-754 rules sqrt() is the same as fma(), i.e. you
have to calculate a root to infinite precision and then to round with
accordance to current mode.

[O.T.]
The whole rounding modes business complicates things quite a lot 

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-06-10 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101

--- Comment #17 from Michael_S  ---
(In reply to Jakub Jelinek from comment #15)
> From what I can see, it is mostly integral implementation and we already
> have one  such in GCC, so the question is if we just shouldn't use it (most
> of the sources
> are in libgcc/soft-fp/ , then configuration files in libgcc/config/*/sfp-*
> and
> the simple *.c file that uses the soft-fp macros is in glibc in
> sysdeps/ieee754/soft-fp/s_fsqrtl.c ).

If you have good implementation, what was the reasoning behind shipping bad
implementation?

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-06-10 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101

--- Comment #16 from Michael_S  ---
(In reply to Thomas Koenig from comment #14)
> @Michael: Now that gcc 12 is out of the door, I would suggest we try to get
> your code into the gcc tree for gcc 13.
> 
> It should follow the gcc style guidelines, which is quite trivial to do:
> Comments should be in /* - */ format, and clang-format --style=gnu does the
> rest.
> (If you prefer, I can also handle that).
> 
> Regarding copyright, you can either assign it to the FSF (which would take
> a bit) or, probably easier, you can use the Signed-off-by: tag as per
> https://gcc.gnu.org/contribute.html#legal .
> 
> You would then submit the patch to gcc-patc...@gcc.gnu.org; if you prefer,
> I can also help a bit with the style issues (ChangeLog etc) for that.
> It would then be reviewed and, if any problems identified are fixed,
> applied - which I can also do.
> 
> How does that sound?

Sounds not bad except I would want to have a copy of my own part of the work in
my public github repo to share it with world under MIT or BSD or plain giveaway
license.
Is it a problem?

Also, if we start than it makes sense to replace more common routines (add, mul
and esp. fma) as well.

Also, I suspect that transcendental, esp trigs, can be improved by quite a lot
if they are currently done in the style, similar to sqrt() and fma().



Also the code will be cleaner and marginally better performance-wise if we base
it on clang-style add_carry64(). 
According to my understanding, gcc is going to implement it in the future
anyway, so why not now?

As to this specific routines, the variants posted here still contain one bug in 
handling of sub-normal inputs (a corner case of lowest non-zero).
Also, for square root I have variant that is a little simple and performs
somewhat better. And commented better.

[Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases

2022-05-17 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #12 from Michael_S  ---
On related note...
One of the historical good features of gcc relatively to other popular
compilers was absence of auto-vectorization at -O2.
When did you decide to change it and why?

[Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases

2022-05-17 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #11 from Michael_S  ---
(In reply to Richard Biener from comment #10)
> (In reply to Hongtao.liu from comment #9)
> > (In reply to Hongtao.liu from comment #8)
> > > (In reply to Hongtao.liu from comment #7)
> > > > Hmm, we have specific code to add scalar->vector(vmovq) cost to vector
> > > > construct, but it seems not to work here, guess it's because ,and 
> > > > thought
> > > > it was load not scalar? 
> > > Yes, true for as gimple_assign_load_p
> > > 
> > > 
> > > (gdb) p debug_gimple_stmt (def)
> > > 72# VUSE <.MEM_46>
> > > 73r0.0_20 = r0;
> > It's a load from stack, and finally eliminated in rtl dse1, but here the
> > vectorizer doesn't know.
> 
> Yes, it's difficult for the SLP vectorizer to guess whether rN will come
> from memory or not.  Some friendlier middle-end representation for
> add-with-carry might be nice - the x86 backend could for example fold
> __builtin_ia32_addcarryx_u64 to use a _Complex unsinged long long for the
> return, ferrying the carry in __imag.  Alternatively we could devise
> some special GIMPLE_ASM kind ferrying RTL and not assembly so the
> backend could fold it directly to RTL on GIMPLE with asm constraints
> doing the plumbing ... (we'd need some match-scratch and RTL expansion
> would still need to allocate the actual pseudos).
> 
>[local count: 1073741824]:
>   _1 = *srcB_17(D);
>   _2 = *srcA_18(D);
>   _30 = __builtin_ia32_addcarryx_u64 (0, _2, _1, );
>   _3 = MEM[(const uint64_t *)srcB_17(D) + 8B];
>   _4 = MEM[(const uint64_t *)srcA_18(D) + 8B];
>   _5 = (int) _30;
>   _29 = __builtin_ia32_addcarryx_u64 (_5, _4, _3, );
>   _6 = MEM[(const uint64_t *)srcB_17(D) + 16B];
>   _7 = MEM[(const uint64_t *)srcA_18(D) + 16B];
>   _8 = (int) _29;
>   _28 = __builtin_ia32_addcarryx_u64 (_8, _7, _6, );
>   _9 = MEM[(const uint64_t *)srcB_17(D) + 24B];
>   _10 = MEM[(const uint64_t *)srcA_18(D) + 24B];
>   _11 = (int) _28;
>   __builtin_ia32_addcarryx_u64 (_11, _10, _9, );
>   r0.0_12 = r0;
>   r1.1_13 = r1;
>   _36 = {r0.0_12, r1.1_13};
>   r2.2_14 = r2;
>   r3.3_15 = r3;
>   _37 = {r2.2_14, r3.3_15};
>   vectp.9_35 = dst_19(D);
>   MEM  [(uint64_t *)vectp.9_35] = _36;
>   vectp.9_39 = vectp.9_35 + 16;
>   MEM  [(uint64_t *)vectp.9_39] = _37;
> 
> so for the situation at hand I don't see any reasonable way out that
> doesn't have the chance of regressing things in other places (like
> treat loads from non-indexed auto variables specially or so).  The
> only real solution is to find a GIMPLE representation for
> __builtin_ia32_addcarryx_u64 that doesn't force the alternate output
> to memory.


How about simple heuristic: "Never auto-vectorize integer code unless in inner
loop"?
It would be optimal >80% of time and even in cases where it's sub-optimal the
impact is likely single-digit per cents. On the other hand, the impact of
mistake in the opposit direction (i.e. over-vectorization) is often quite
large.

[Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases

2022-05-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #6 from Michael_S  ---
(In reply to Michael_S from comment #5)
> 
> Even scalar-to-scalar or vector-to-vector moves that are hoisted at renamer
> does not have a zero cost, because quite often renamer itself constitutes
> the narrowest performance bottleneck. But those moves... I don't think that
> they are hoisted by renamer.

I took a look at several Intel and AMD Optimization Reference Manuals and
instruction tables. None of existing x86 microarchitectures, either old or new,
eliminates scalar-to-SIMD moves at renamer. Which is sort of obvious for new
microarchitectures (Bulldozer or later for AMD, Sandy Bridge or later for
Intel), because on these microarchitectures scalar and SIMD registers live in
separate physical register files.
As to older microarchitectures, they, may be, had them in the common physical
storage area, but they simply were not sufficiently smart to eliminate the
moves.
So, these moves have non-zero latency. On some of the cores, including some of
the newest, the latency is even higher than one clock. And the throughput tends
to be rather low, most typically, one scalar-to-SIMD move per clock. For
comparison,  scalar-to-scalar and SIMD-to-SIMD moves can be executed (or
eliminated at renamer) at rates of 2, 3 or even 4 per clock.

[Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases

2022-05-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #5 from Michael_S  ---
(In reply to Richard Biener from comment #3)
> We are vectorizing the store it dst[] now at -O2 since that appears
> profitable:
> 
> t.c:10:10: note: Cost model analysis:
> r0.0_12 1 times scalar_store costs 12 in body
> r1.1_13 1 times scalar_store costs 12 in body
> r2.2_14 1 times scalar_store costs 12 in body
> r3.3_15 1 times scalar_store costs 12 in body
> r0.0_12 2 times unaligned_store (misalign -1) costs 24 in body
> node 0x4b2b1e0 1 times vec_construct costs 4 in prologue
> node 0x4b2b1e0 1 times vec_construct costs 4 in prologue
> t.c:10:10: note: Cost model analysis for part in loop 0:
>   Vector cost: 32
>   Scalar cost: 48
> t.c:10:10: note: Basic block will be vectorized using SLP

That makes no sense.
4 scalar-to-vector moves + 2 vector shuffles + 2 vector stores are ALOT more
costly than 4 scalar stores.
Even more so considering that scalar store go to adjacent addresses so, on good
CPUs, they are likely combined
at the level of store queue, so a cache subsystem sees fewer operations.


Either your cost model is broken or there are bugs in summation.
I'd guess, somehow compiler thinks that moves have zero cost. But
scalar-to-vector moves are certainly not of zero cost. 
Even scalar-to-scalar or vector-to-vector moves that are hoisted at renamer
does not have a zero cost, because quite often renamer itself constitutes the
narrowest performance bottleneck. But those moves... I don't think that they
are hoisted by renamer.
Also, it's likely that compiler thinks that scalar store costs the same as
vector store. That's also generally incorrect, esp. when you don't know your
target CPU and don't know whether stores are aligned or not, like in this case.

[Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases

2022-05-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #4 from Michael_S  ---
(In reply to Andrew Pinski from comment #1)
> This is just the vectorizer still being too aggressive right before a return.
> It is a cost model issue and it might not really be an issue in the final
> code just microbenchmarks.

BTW, Andrew, in one of the older related threads you made a very wise comment:
"Maybe even generic builtins/internal functions for this case even as clang has
__builtin_addc, etc."

IMHO, that is not only necessary, but long overdue.

[Bug target/105617] New: Regression in code generation for _addcarry_u64()

2022-05-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

Bug ID: 105617
   Summary: Regression in code generation for _addcarry_u64()
   Product: gcc
   Version: 12.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: already5chosen at yahoo dot com
  Target Milestone: ---

It took many years until gcc caught up with MSVC and LLVM/clang in generation
of code for chains of  Intel's _addcarry_u64() intrinsic calls. But your
finally managed to do it in gcc11.
Unfortunately, the luck didn't last for long.


void add4i(uint64_t dst[4], const uint64_t srcA[4], const uint64_t srcB[4])
{
  unsigned char c;
  unsigned long long r0; c = _addcarry_u64(0, srcA[0], srcB[0], );
  unsigned long long r1; c = _addcarry_u64(c, srcA[1], srcB[1], );
  unsigned long long r2; c = _addcarry_u64(c, srcA[2], srcB[2], );
  unsigned long long r3; c = _addcarry_u64(c, srcA[3], srcB[3], );
  dst[0] = r0;
  dst[1] = r1;
  dst[2] = r2;
  dst[3] = r3;
}

gcc 11.1 -O2

add4i:
movq(%rdx), %rax
addq(%rsi), %rax
movq8(%rsi), %rcx
movq%rax, %r8
adcq8(%rdx), %rcx
movq16(%rsi), %rax
adcq16(%rdx), %rax
movq24(%rdx), %rdx
adcq24(%rsi), %rdx
movq%r8, (%rdi)
movq%rcx, 8(%rdi)
movq%rax, 16(%rdi)
movq%rdx, 24(%rdi)
ret


gcc 12.1  -O2

add4i:
movq(%rdx), %rax
movq8(%rsi), %rcx
addq(%rsi), %rax
movq16(%rsi), %r8
adcq8(%rdx), %rcx
adcq16(%rdx), %r8
movq%rax, %xmm1
movq24(%rdx), %rdx
adcq24(%rsi), %rdx
movq%r8, %xmm0
movq%rcx, %xmm3
movq%rdx, %xmm2
punpcklqdq  %xmm3, %xmm1
punpcklqdq  %xmm2, %xmm0
movups  %xmm1, (%rdi)
movups  %xmm0, 16(%rdi)
ret

What ... ?!

BTW, gcc 12.1 -O1 is still o.k.

[Bug target/105468] Suboptimal code generation for access of function parameters and return values of type __float128 on x86-64 Windows target.

2022-05-03 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105468

--- Comment #4 from Michael_S  ---
Created attachment 52925
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52925=edit
build script

[Bug target/105468] Suboptimal code generation for access of function parameters and return values of type __float128 on x86-64 Windows target.

2022-05-03 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105468

--- Comment #3 from Michael_S  ---
Created attachment 52924
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52924=edit
Another test bench that shows lower impact on Zen3, but higher impact on some
Intel CPUs

[Bug target/105468] Suboptimal code generation for access of function parameters and return values of type __float128 on x86-64 Windows target.

2022-05-03 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105468

--- Comment #2 from Michael_S  ---
Created attachment 52923
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52923=edit
test bench that shows lower impact on Zen3, but higher impact on some Intel
CPUs

[Bug target/105468] Suboptimal code generation for access of function parameters and return values of type __float128 on x86-64 Windows target.

2022-05-03 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105468

--- Comment #1 from Michael_S  ---
Created attachment 52922
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52922=edit
test bench that demonstrates maximal impact on Zen3

[Bug target/105468] New: Suboptimal code generation for access of function parameters and return values of type __float128 on x86-64 Windows target.

2022-05-03 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105468

Bug ID: 105468
   Summary: Suboptimal code generation for access of function
parameters and return values of type __float128 on
x86-64 Windows target.
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: already5chosen at yahoo dot com
  Target Milestone: ---

Created attachment 52921
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52921=edit
example routine

>From point of view of x86-64 Window ABI __float128 is yet another structure of
size 16, 100% identical to the rest of them.
Like the rest of them, when it past by value into subroutine, it has to be
stored by caller in temporary location (typically, caller's stack) and the
pointer to the temporary is passed either in GPR or also on stack.
The same for return value - caller allocates temporary storage on its stack and
passes pointer to it to callee in register RCX. Then a callee puts a return
value where it was said.
There is absolutely no "floating-pointness" or "SIMDness" about it.

However, gcc compiler often (although not always) treats __float128 as if it
was somehow related to floating-point side of the machine. I'd guess, it's
somehow related to System V being a primary target and according to System V
x86-64 ABI
__float128 values passed in and out in XMM registers.

In practice it leads to ugly code generation, less so in user code that uses
__float128 for arithmetic, more so in library-type code that attempt to process
__float128 objects with accordance to their binary layout.

Example 1. Access to individual words of __float128 function parameter
  vmovdqu (%rdx),%xmm2
  vmovdqu %xmm2,0x20(%rsp)
  mov 0x28(%rsp),%rdx
  mov 0x20(%rsp),%rcx
instead of simple:
  mov %rcx,%r12
  mov 0x8(%rdx),%rcx
  mov(%rdx),%rdx

Example 2. Function returns __float128 value composed of a pair of 64-bit words
  mov %rax,0x20(%rsp)
  mov %rdi,0x28(%rsp)
  vmovdqu 0x20(%rsp),%xmm3
  mov %r12,%rax
  vmovdqu %xmm3,(%r12)
instead of simple:
  mov %rax,(%r12)
  mov %r12,%rax
  mov %rdi,0x8(%r12)


If it was just ugly I wouldn't complain. Unfortunately, sometimes it's also
quite slow, and some of the best modern CPUs are among the worst affected.
As expected, an exact impact varies. From measurable (2-4 clocks) to quite high
(40 clocks).
The highest impact was measured on AMD Zen3 CPU, but in other situations the
same CPU was among the least affected.
Intel CPUs (Ivy Bridge, Haswell, Skylake) showed impact in range from 2 to 21
clock, with both the lowest and the highest values seen on old Ivy Bridge cores
while on newer Skylake  the impact was relatively consistent (6-10 clocks). I
didn't measure yet on the newest Intel CPUs (Ice/Tiger Lake and Alder Lake).

Below, I attached the example code (which is not a mere dummy example, but
actually quite good implementation of sqrtq()) compiled in two variants:
normally and by tricking compiler into thinking that __float128 is "just
another structure". Which, as said above, it is, at least under Win64, but
appears compiler thinks otherwise. Also provided: 3 test benches that measure
the difference in speed between the "normal" and the "tricky" variants in
various surroundings.
Pay attention, the trick is valid only on Windows, don't try it on Linux.

I hope that this particular (i.e. Windows) variant of the problem can be fixed
with relatively little effort.

On the other hand, System V x86-64 target is different matter. Here the impact
is on average smaller, but in worst cases it is about the same as the worst
case one on Windows, and since the problem there is fundamental (a stupid ABI)
I don't believe that there could be an easy fix.

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-04-21 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101

--- Comment #13 from Michael_S  ---
It turned out that on all micro-architectures that I care about (and majority
of those that I don't care) double precision floating point division is quite
fast.
It's so fast that it easily beats my clever reciprocal estimator.
Which means that this simple code is faster than complicated code above.
Not by much, but faster. And much simpler.

#include 
#include 
#include 

static __inline uint64_t umulx64(uint64_t mult1, uint64_t mult2, int rshift)
{
  return (uint64_t)(((unsigned __int128)mult1 * mult2)>>rshift);
}

static __inline uint64_t umulh64(uint64_t mult1, uint64_t mult2)
{
  return umulx64(mult1, mult2, 64);
}

static __inline void set__float128(__float128* dst, uint64_t wordH, uint64_t
wordL) {
  unsigned __int128 res_u128 = ((unsigned __int128)wordH << 64) | wordL;
  memcpy(dst, _u128, sizeof(*dst)); // hopefully __int128 and __float128
have the endianness
}

// return number of leading zeros
static __inline int normalize_float128(uint64_t* pMantH, uint64_t* pMantL) {
  const uint64_t MANT_H_MSK   = (uint64_t)-1 >> 16;
  uint64_t mantH = *pMantH, mantL = *pMantL;
  int lz;
  if (mantH == 0) {
lz = __builtin_clzll(mantL);
mantL <<= lz + 1; // shift out all leading zeroes and first leading 1
mantH = mantL >> 16;
mantL = mantL << 48;
lz += 48;
  } else {
lz = __builtin_clzll(mantH)-16;
mantH = (mantH << (lz+1)) | mantL >> (63-lz);
mantL = mantL << (lz + 1);
  }
  mantH &= MANT_H_MSK;
  *pMantH = mantH;
  *pMantL = mantL;
  return lz;
}

static __inline uint64_t d2u(double x) {
  uint64_t y;
  memcpy(, , sizeof(y));
  return y;
}

static __inline double u2d(uint64_t x) {
  double y;
  memcpy(, , sizeof(y));
  return y;
}

// calc_reciprocal
// calculate lower estimate for r = floor(2**64/(1 + mantHi*2**(-48) +
mantLo*2**(-112)))
static __inline uint64_t calc_reciprocal(uint64_t mantHi, uint64_t mantLo)
{
  const uint64_t DBL_1PLUS = ((uint64_t)0x3FF << 52) + 18;
  uint64_t mant_u = (mantHi << 4) + DBL_1PLUS;
  uint64_t r1 = d2u(2.0/u2d(mant_u)) << 11; // scale = 2**64, nbits=64
  // r1 is low estimate
  // r1 is in range   [0x7fffe000..0xfffee000 ]
  // r1*x is in range [0.99595..0.999889]
  // Est. Precisions: ~47.81 bits

  //
  // Improve precision of the estimate by canonical 1st-order NR iteration
  // r = r1*(2-r1*x)
  // At this point precision is of high importance, so evaluation is done in a
way
  // that minimizes impact of truncation.
  // The transformed equation is r = r1 + r1*(1-r1*x)
  //
  uint64_t mant64 = (mantHi << 16)|(mantLo >> 48); // Upper 64 bits of
mantissa, not including leading 1
  uint64_t rx1= umulh64(r1,mant64) + r1 + 2;   // r1*(2**64+x+1), scale =
2**64; nbits=64
  // rx1 calculated with x2_u=x+1 instead of x2=x, in order to assure that
  // for any possible values of 48 LS bits of x, the estimate r2 is lower than
exact values of r
  uint64_t r = r1 + umulh64(0-rx1, r1);// scale = 2**64, nbits=64
  // r is low estimate
  // r is in range [0x7fff..0xfffd]
  // r*x is in range
[0.999_999_999_999_999_999_783..0.999_999_999_999_999_999_999_957]
  // Est. Precisions: ~62.00069 bits
  // max(r*x-1) ~= 3.998079 * 2**-64
  return r;
}

void my__divtf3(__float128* result, const __float128* num, const __float128*
den)
{
  typedef unsigned __int128 __uint128;
  __uint128 u_num, u_den;
  memcpy(_num, num, sizeof(u_num));
  memcpy(_den, den, sizeof(u_den));

  const uint64_t BIT_48   = (uint64_t)1 << 48;
  const uint64_t BIT_63   = (uint64_t)1 << 63;

  const uint64_t SIGN_BIT = BIT_63;
  const uint64_t SIGN_OFF_MSK = ~SIGN_BIT;
  const uint64_t MANT_H_MSK   = (uint64_t)-1 >> 16;
  const uint64_t EXP_MSK  = (uint64_t)0x7FFF  << 48;
  const uint64_t pNAN_MSW = (uint64_t)0x7FFF8 << 44;// +NaN
  const uint64_t pINF_MSW = EXP_MSK;// +Inf
  // const uint64_t nNAN_MSW = (uint64_t)0x8 << 44; // -NaN

  // parse num and handle special cases
  uint64_t numHi = (uint64_t)(u_num >> 64);
  uint64_t numLo = (uint64_t)u_num;
  uint64_t denHi = (uint64_t)(u_den >> 64);
  uint64_t denLo = (uint64_t)u_den;
  unsigned numSexp = numHi >> 48;
  unsigned numExp  = numSexp & 0x7FFF;
  int  inumExp = numExp;
  uint64_t numMantHi = numHi & MANT_H_MSK;
  uint64_t resSign = (numHi ^ denHi) & SIGN_BIT;
  if (numExp - 1 >= 0x7FFF - 1) { // special cases
if (numExp == 0x7FFF) { // inf or Nan
   if ((numMantHi | numLo)==0) { // inf
 numHi = resSign | pINF_MSW; // Inf with proper sign
 if ((denLo == 0) && ((denHi & SIGN_OFF_MSK)==pINF_MSW)) { // den inf
   numHi = pNAN_MSW; // Inf/Inf => NaN
 }
   }
   set__float128(result, numHi, numLo);
   return;
}

// numExp == 0 - zero or sub-normal
if (((numHi & SIGN_OFF_MSK) | numLo)==0) { // zero
  if ((denHi & EXP_MSK)==pINF_MSW) { // den = inf 

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-04-20 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101

--- Comment #12 from Michael_S  ---
(In reply to Michael_S from comment #11)
> (In reply to Michael_S from comment #10)
> > BTW, the same ideas as in the code above could improve speed of division
> > operation (on modern 64-bit HW) by factor of 3 (on Intel) or 2 (on AMD).
> 
> Did it.
> On Intel it's even better than anticipated. 5x speedup on Haswell and
> Skylake,
> 4.5x on Ivy Bridge.
> Unfortunately, right now I have no connection to my Zen3 test system, so
> can't
> measure on it with final variant. But according to preliminary tests the
> speedup should be slightly better than 2x.

O.k. 
It seems now I fixed all details of rounding.
Including cases of sub-normal results.
Even ties now appear to be broken to even, as prescribed.
It can take a bit more polish, esp. a comments, but even in this form
much better than what you have now.

#include 
#include 
#include 

static __inline uint64_t umulx64(uint64_t mult1, uint64_t mult2, int rshift)
{
  return (uint64_t)(((unsigned __int128)mult1 * mult2)>>rshift);
}

static __inline uint64_t umulh64(uint64_t mult1, uint64_t mult2)
{
  return umulx64(mult1, mult2, 64);
}

static __inline void set__float128(__float128* dst, uint64_t wordH, uint64_t
wordL) {
  unsigned __int128 res_u128 = ((unsigned __int128)wordH << 64) | wordL;
  memcpy(dst, _u128, sizeof(*dst)); // hopefully __int128 and __float128
have the endianness
}

// return number of leading zeros
static __inline int normalize_float128(uint64_t* pMantH, uint64_t* pMantL) {
  const uint64_t MANT_H_MSK   = (uint64_t)-1 >> 16;
  uint64_t mantH = *pMantH, mantL = *pMantL;
  int lz;
  if (mantH == 0) {
lz = __builtin_clzll(mantL);
mantL <<= lz + 1; // shift out all leading zeroes and first leading 1
mantH = mantL >> 16;
mantL = mantL << 48;
lz += 48;
  } else {
lz = __builtin_clzll(mantH)-16;
mantH = (mantH << (lz+1)) | mantL >> (63-lz);
mantL = mantL << (lz + 1);
  }
  mantH &= MANT_H_MSK;
  *pMantH = mantH;
  *pMantL = mantL;
  return lz;
}

void my__divtf3(__float128* result, const __float128* num, const __float128*
den)
{
  typedef unsigned __int128 __uint128;
  __uint128 u_num, u_den;
  memcpy(_num, num, sizeof(u_num));
  memcpy(_den, den, sizeof(u_den));

  const uint64_t BIT_30   = (uint64_t)1 << 30;
  const uint64_t BIT_48   = (uint64_t)1 << 48;
  const uint64_t BIT_63   = (uint64_t)1 << 63;

  const uint64_t SIGN_BIT = BIT_63;
  const uint64_t SIGN_OFF_MSK = ~SIGN_BIT;
  const uint64_t MANT_H_MSK   = (uint64_t)-1 >> 16;
  const uint64_t EXP_MSK  = (uint64_t)0x7FFF  << 48;
  const uint64_t pNAN_MSW = (uint64_t)0x7FFF8 << 44;// +NaN
  const uint64_t pINF_MSW = EXP_MSK;// +Inf

  // parse num and handle special cases
  uint64_t numHi = (uint64_t)(u_num >> 64);
  uint64_t numLo = (uint64_t)u_num;
  uint64_t denHi = (uint64_t)(u_den >> 64);
  uint64_t denLo = (uint64_t)u_den;
  unsigned numSexp = numHi >> 48;
  unsigned numExp  = numSexp & 0x7FFF;
  int  inumExp = numExp;
  uint64_t numMantHi = numHi & MANT_H_MSK;
  uint64_t resSign = (numHi ^ denHi) & SIGN_BIT;
  if (numExp - 1 >= 0x7FFF - 1) {// special cases
if (numExp == 0x7FFF) {  // inf or Nan
   if ((numMantHi | numLo)==0) { // inf
 numHi = resSign | pINF_MSW; // Inf with proper sign
 if ((denLo == 0) && ((denHi & SIGN_OFF_MSK)==pINF_MSW)) { // den is
inf
   numHi = pNAN_MSW; // Inf/Inf => NaN
 }
   }
   set__float128(result, numHi, numLo);
   return;
}

// numExp == 0 -> zero or sub-normal
if (((numHi & SIGN_OFF_MSK) | numLo)==0) { // zero
  if ((denHi & EXP_MSK)==pINF_MSW) {   // den = inf or Nan
if ((denLo == 0) && ((denHi & SIGN_OFF_MSK)==pINF_MSW)) { // den is inf
  numHi = resSign;  // return zero with proper sign
} else { // den is Nan
  // return den
  numLo = denLo;
  numHi = denHi;
}
  }
  if (((denHi & SIGN_OFF_MSK) | denLo)==0) { // den is zero
numHi = pNAN_MSW; // 0/0 => NaN
  }
  set__float128(result, numHi, numLo);
  return;
}

// num is sub-normal: normalize
inumExp -= normalize_float128(, );
  }

  //
  // num is normal or normalized
  //

  // parse den and handle special cases
  unsigned denSexp   = denHi >> 48;
  unsigned denExp= denSexp & 0x7FFF;
  int  idenExp   = denExp;
  uint64_t denMantHi = denHi & MANT_H_MSK;
  if (denExp - 1 >= 0x7FFF - 1) {// special cases
if (denExp == 0x7FFF) {  // inf or Nan
   if ((denMantHi | denLo)==0) { // den is inf
 // return zero with proper sign
 denHi = resSign;
 denLo = 0;
   }
   // if den==Nan then return den
   set__float128(result, denHi, denLo);
   return;
}

// denExp == 0 -> zero or sub-normal
if (((denHi & SIGN_OFF_MSK) | denLo)==0) { // den is zero
  // return Inf with 

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-04-18 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101

--- Comment #11 from Michael_S  ---
(In reply to Michael_S from comment #10)
> BTW, the same ideas as in the code above could improve speed of division
> operation (on modern 64-bit HW) by factor of 3 (on Intel) or 2 (on AMD).

Did it.
On Intel it's even better than anticipated. 5x speedup on Haswell and Skylake,
4.5x on Ivy Bridge.
Unfortunately, right now I have no connection to my Zen3 test system, so can't
measure on it with final variant. But according to preliminary tests the
speedup should be slightly better than 2x.

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-04-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101

--- Comment #10 from Michael_S  ---
BTW, the same ideas as in the code above could improve speed of division
operation (on modern 64-bit HW) by factor of 3 (on Intel) or 2 (on AMD).

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-04-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101

--- Comment #9 from Michael_S  ---
(In reply to Michael_S from comment #4)
> If you want quick fix for immediate shipment then you can take that:
> 
> #include 
> #include 
> 
> __float128 quick_and_dirty_sqrtq(__float128 x)
> {
>   if (isnanq(x))
> return x;
> 
>   if (x==0)
> return x;
> 
>   if (x < 0)
> return nanq("");
> 
>   if (isinfq(x))
> return x;
> 
>   int xExp;
>   x = frexpq(x, );
>   if (xExp & 1)
> x *= 2.0; // x in [0.5:2.0)
> 
>   __float128 r = (__float128)(1.0/sqrt((double)x)); // r=rsqrt(x) estimate
> (53 bits)
>   r *= 1.5 - r*r*x*0.5; // NR iteration improves precision of r to 105.4 bit
> 
>   __float128 y  = x*r;  // y=sqrt(x) estimate (105.4 bits)
> 
>   // extended-precision NR iteration
>   __float128 yH = (double)y;
>   __float128 yL = y - yH;
> 
>   __float128 deltaX = x - yH*yH;
>   deltaX -= yH*yL*2;
>   deltaX -= yL*yL;
>   y += deltaX*r*0.5; // improve precision of y to ~210.2 bits. Not enough
> for perfect rounding, but not too bad
> 
>   return ldexpq(y, xExp >> 1);
> }
> 
> It is very slow, even slower than what you have now, which by itself is
> quite astonishingly slow.
> It is also not sufficiently precise for correct rounding in all cases.
> But, at least, the worst error is something like (0.5+2**-98) ULP, so you are
> unlikely to be ever caught by black box type of testing.
> It's biggest advantage is extreme portability.
> Should run on all platforms where double==IEEE binary64 and __float128 ==
> IEEE binary128.
> 
> May be, few days later I'll have better variant for "good" 64-bit platforms
> i.e. for those where we have __int128.
> It would be 15-25 times faster than the variant above and rounding would be
> mathematically correct rather than just "impossible to be caught" like above.
> But it would not run everywhere.
> Also, I want to give it away under MIT or BSD license, rather than under GPL.

Here.

#include 
#include 
#include 

static __inline uint64_t umulx64(uint64_t mult1, uint64_t mult2, int rshift)
{
  return (uint64_t)(((unsigned __int128)mult1 * mult2)>>rshift);
}

static __inline uint64_t umulh64(uint64_t mult1, uint64_t mult2)
{
  return umulx64(mult1, mult2, 64);
}

static __inline __float128 ret__float128(uint64_t wordH, uint64_t wordL) {
  unsigned __int128 res_u128 = ((unsigned __int128)wordH << 64) | wordL;
  __float128 result;
  memcpy(, _u128, sizeof(result)); // hopefully __int128 and
__float128 have the endianness
  return result;
}

__float128 slow_and_clean_sqrtq(__float128 src)
{
  typedef unsigned __int128 __uint128;
  const uint64_t SIGN_BIT   = (uint64_t)1  << 63;
  const uint64_t MANT_H_MSK = (uint64_t)-1 >> 16;
  const uint64_t nNAN_MSW = (uint64_t)0x8 << 44; // -NaN

  __uint128 src_u128;
  memcpy(_u128, , sizeof(src_u128));
  uint64_t mshalf = (uint64_t)(src_u128 >> 64); // MS half
  uint64_t mantL  = (uint64_t)(src_u128);   // LS half

  // parse MS half of source
  unsigned exp   = mshalf >> 48;
  uint64_t mantH = mshalf & MANT_H_MSK;
  unsigned expx = exp + 0x3FFF;
  if (exp - 1 >= 0x7FFF-1) { // special cases: exp = 0, exp=max_exp or sign=1
if (exp > 0x7fff) {  // negative
  if (exp==0x) { // NaN or -Inf
if ((mantH | mantL)==0) // -Inf
  mshalf = nNAN_MSW;
return ret__float128(mshalf, mantL); // in case of NaN the leave number
intact
  }
  if (mshalf==SIGN_BIT && mantL==0) // -0
return ret__float128(mshalf, mantL); // in case of -0 leave the number
intact
  // normal or sub-normal
  mantL = 0;
  mshalf = nNAN_MSW;
  return ret__float128(mshalf, mantL);
}
// non-negative
if (exp == 0x7fff) // NaN or -Inf
  return ret__float128(mshalf, mantL); // in case of positive NaN or Inf
leave the number intact

// exp==0 : zero or subnormal
if ((mantH | mantL)==0) // +0
  return ret__float128(mshalf, mantL); // in case of +0 leave the number
intact

// positive subnormal
// normalize
if (mantH == 0) {
  expx -= 48;
  int lz = __builtin_clzll(mantL);
  expx -= lz;
  mantL <<= lz + 1;
  mantH = mantL >> 16;
  mantL = mantL << 48;
} else {
  int lz = __builtin_clzll(mantH)-16;
  expx -= lz;
  mantH = (mantH << (lz+1)) | mantL >> (63-lz);
  mantL = mantL << (lz + 1);
}
mantH &= MANT_H_MSK;
  }

  // Normal case
  int e = expx & 1; // e= LS bit of exponent

  const uint64_t BIT_30 = (uint64_t)1 << 30;
  static const uint8_t rsqrt_tab[64] = { // floor(1/sqrt(1+i/32)*512) - 256
   252, 248, 244, 240,237, 233, 230, 226,
   223, 220, 216, 213,210, 207, 204, 201,
   199, 196, 193, 190,188, 185, 183, 180,
   178, 175, 173, 171,168, 166, 164, 162,
   159, 157, 155, 153,151, 149, 147, 145,
   143, 141, 139, 138,136, 134, 132, 131,
   129, 127, 125, 124,122, 121, 119, 117,
   116, 114, 113, 111,110, 108, 107, 106,
 };

  // Look for approximate value of rsqrt(x)=1/sqrt(x)
  // 

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-04-02 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101

Michael_S  changed:

   What|Removed |Added

 CC||already5chosen at yahoo dot com

--- Comment #4 from Michael_S  ---
If you want quick fix for immediate shipment then you can take that:

#include 
#include 

__float128 quick_and_dirty_sqrtq(__float128 x)
{
  if (isnanq(x))
return x;

  if (x==0)
return x;

  if (x < 0)
return nanq("");

  if (isinfq(x))
return x;

  int xExp;
  x = frexpq(x, );
  if (xExp & 1)
x *= 2.0; // x in [0.5:2.0)

  __float128 r = (__float128)(1.0/sqrt((double)x)); // r=rsqrt(x) estimate (53
bits)
  r *= 1.5 - r*r*x*0.5; // NR iteration improves precision of r to 105.4 bit

  __float128 y  = x*r;  // y=sqrt(x) estimate (105.4 bits)

  // extended-precision NR iteration
  __float128 yH = (double)y;
  __float128 yL = y - yH;

  __float128 deltaX = x - yH*yH;
  deltaX -= yH*yL*2;
  deltaX -= yL*yL;
  y += deltaX*r*0.5; // improve precision of y to ~210.2 bits. Not enough for
perfect rounding, but not too bad

  return ldexpq(y, xExp >> 1);
}

It is very slow, even slower than what you have now, which by itself is quite
astonishingly slow.
It is also not sufficiently precise for correct rounding in all cases.
But, at least, the worst error is something like (0.5+2**-98) ULP, so you are
unlikely to be ever caught by black box type of testing.
It's biggest advantage is extreme portability.
Should run on all platforms where double==IEEE binary64 and __float128 == IEEE
binary128.

May be, few days later I'll have better variant for "good" 64-bit platforms
i.e. for those where we have __int128.
It would be 15-25 times faster than the variant above and rounding would be
mathematically correct rather than just "impossible to be caught" like above.
But it would not run everywhere.
Also, I want to give it away under MIT or BSD license, rather than under GPL.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2020-11-19 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #10 from Michael_S  ---
I lost track of what you're talking about long time ago.
But that's o.k.

[Bug target/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2020-11-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #3 from Michael_S  ---
(In reply to Richard Biener from comment #2)
> It's again reassociation making a mess out of the natural SLP opportunity
> (and thus SLP discovery fails miserably).
> 
> One idea worth playing with would be to change reassociation to rank
> references
> from the same load group (as later vectorization would discover) the same.
> 
> That said, further analysis and maybe a smaller testcase to look at is useful
> here.  There is, after all, the opportunity to turn "bad" association at the
> source level to good for vectorization when -ffast-math is enabled as well.

It turned out, much simpler kernel suffers from the same problem.

void foo1x1(double* restrict y, const double* restrict x, int clen)
{
  int xi = clen & 2;
  double f_re = x[0+xi+0];
  double f_im = x[4+xi+0];
  int clen2 = (clen+xi) * 2;
  #pragma GCC unroll 0
  for (int c = 0; c < clen2; c += 8) {
// y[c] = y[c] - x[c]*conj(f);
#pragma GCC unroll 4
for (int k = 0; k < 4; ++k) {
  double x_re = x[c+0+k];
  double x_im = x[c+4+k];
  double y_re = y[c+0+k];
  double y_im = y[c+4+k];
  y_re = y_re - x_re * f_re - x_im * f_im;;
  y_im = y_im + x_re * f_im - x_im * f_re;
  y[c+0+k] = y_re;
  y[c+4+k] = y_im;
}
  }
}

May be, it's possible to simplify further, but probably not by much.

[Bug target/79173] add-with-carry and subtract-with-borrow support (x86_64 and others)

2020-11-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173

--- Comment #9 from Michael_S  ---
Despite what I wrote above, I did took a look at the trunk on godbolt with same
old code from a year ago. Because it was so easy. And indeed a trunk looks ALOT
better.
But until it's released I wouldn't know if it's actually up to speed of MSVC
and clang.

[Bug target/79173] add-with-carry and subtract-with-borrow support (x86_64 and others)

2020-11-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173

--- Comment #8 from Michael_S  ---
(In reply to Jakub Jelinek from comment #7)
> (In reply to Michael_S from comment #5)
> > I agree with regard to "other targets", first of all, aarch64, but x86_64
> > variant of gcc already provides requested functionality in for of
> > _subborrow_u64 () and _addcarry_u64() intrinsic functions.
> > The problem here is not lack of functionality, but very poor implementation
> > (mentioned many times on bugzilla with minimal effect).
> > In that regard gcc is more than decade behind MSVC and ~4 years behind
> > clang/llvm. Surprisingly, icc is also quite bad.
> 
> Are you sure you have tested gcc trunk?
> There have been fixes for this a month ago as part of PR97387 fixes.

I am sure I did not :-)
And I most likely am not going to test trunk, sorry. I'd wait for release.
Essentially I am posting here not because I deeply care about the topic right
now (I did care a years or so ago, but lost interest since then, see
https://www.realworldtech.com/forum/?threadid=188061=188061), but
because I stepped on it occasionally while waiting for response to 97832.

So, sorry for intervention.

[Bug target/79173] add-with-carry and subtract-with-borrow support (x86_64 and others)

2020-11-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173

--- Comment #6 from Michael_S  ---
(In reply to Marc Glisse from comment #1)
> We could start with the simpler:
> 
> void f(unsigned*__restrict__ r,unsigned*__restrict__ s,unsigned a,unsigned
> b,unsigned c, unsigned d){
>   *r=a+b;
>   *s=c+d+(*r }
> 

That works for dual-precision addition, but not for triple or more.

> After combine, we have:
> 
> (insn 34 12 20 2 (set (reg:SI 93 [ _15+4 ])
> (ltu:SI (reg:CCC 17 flags)
> (const_int 0 [0]))) 608 {*setcc_si_1_movzbl}
>  (expr_list:REG_DEAD (reg:CCC 17 flags)
> (nil)))
> 
> (insn 21 20 22 2 (parallel [
> (set (reg:SI 102)
> (plus:SI (reg:SI 37 r8 [ c ])
> (reg:SI 38 r9 [ d ])))
> (clobber (reg:CC 17 flags))
> ]) "a.c":3 213 {*addsi_1}
>  (expr_list:REG_DEAD (reg:SI 38 r9 [ d ])
> (expr_list:REG_UNUSED (reg:CC 17 flags)
> (expr_list:REG_DEAD (reg:SI 37 r8 [ c ])
> (nil)
> 
> (insn 25 24 26 2 (parallel [
> (set (reg:SI 105)
> (plus:SI (reg:SI 102)
> (reg:SI 93 [ _15+4 ])))
> (clobber (reg:CC 17 flags))
> ]) "a.c":3 213 {*addsi_1}
>  (expr_list:REG_DEAD (reg:SI 93 [ _15+4 ])
> (expr_list:REG_UNUSED (reg:CC 17 flags)
> (expr_list:REG_DEAD (reg:SI 102)
> (nil)
> 
> The combine dump says "Trying 21, 34 -> 25:" but the next line is blank and
> it moves on to trying something else.
> 
> If I use parentheses *s=c+(d+(*r 
> Trying 23, 24 -> 25:
> Successfully matched this instruction:
> (parallel [
> (set (reg:SI 105)
> (plus:SI (plus:SI (ltu:SI (reg:CCC 17 flags)
> (const_int 0 [0]))
> (reg:SI 37 r8 [ c ]))
> (reg:SI 38 r9 [ d ])))
> (clobber (reg:CC 17 flags))
> ])
> Instruction not appropriate for target.
> 
> I don't know where that target restriction is coming from, but at least we
> seem to be getting somewhere.
> 
> If I remove c and keep *s=d+(*r 
> Failed to match this instruction:
> (parallel [
> (set (reg:SI 103)
> (plus:SI (ltu:SI (reg:CCC 17 flags)
> (const_int 0 [0]))
> (reg:SI 38 r9 [ d ])))
> (clobber (reg:CC 17 flags))
> ])
> Failed to match this instruction:
> (set (reg:SI 103)
> (plus:SI (ltu:SI (reg:CCC 17 flags)
> (const_int 0 [0]))
> (reg:SI 38 r9 [ d ])))
> 
> we would probably need a special pattern for this case, virtually adding 0.

[Bug target/79173] add-with-carry and subtract-with-borrow support (x86_64 and others)

2020-11-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173

Michael_S  changed:

   What|Removed |Added

 CC||already5chosen at yahoo dot com

--- Comment #5 from Michael_S  ---
(In reply to Vincent Lefèvre from comment #0)
> There should be a way to support full add-with-carry and
> subtract-with-borrow by generating adc / sbb instructions on x86_64 (and
> similar instructions on other targets).
> 
> GCC could add builtins, such as __builtin_addc* and __builtin_subc* (two
> arguments, carry in, carry out, and the result), similar to Clang:
> http://clang.llvm.org/docs/LanguageExtensions.html#multiprecision-arithmetic-
> builtins
> as suggested in PR 60206 comment 3.
> 
> Detection of special constructs in standard C/... code would be useful too.
> Here are some examples from
> https://gcc.gnu.org/ml/gcc-help/2017-01/msg00067.html for subtraction:
> 
> typedef unsigned long T;
> 
> void sub1 (T *p, T u0, T u1, T u2, T v0, T v1, T v2)
> {
>   T t1;
>   int b0, b1;
> 
>   p[0] = u0 - v0;
>   b0 = u0 < v0;
>   t1 = u1 - v1;
>   b1 = u1 < v1;
>   p[1] = t1 - b0;
>   b1 |= p[1] > t1;
>   p[2] = u2 - v2 - b1;
> }
> 
> void sub2 (T *p, T u0, T u1, T u2, T v0, T v1, T v2)
> {
>   int b0, b1;
> 
>   p[0] = u0 - v0;
>   b0 = u0 < v0;
>   p[1] = u1 - v1 - b0;
>   b1 = u1 < v1 || (u1 == v1 && b0 != 0);
>   p[2] = u2 - v2 - b1;
> }
> 
> In the second example, the b1 line could also be replaced by:
> 
>   b1 = u1 < v1 + b0 || v1 + b0 < v1;
> 
> For the subtractions, optimal code would contain 1 sub and 2 sbb's.

I agree with regard to "other targets", first of all, aarch64, but x86_64
variant of gcc already provides requested functionality in for of
_subborrow_u64 () and _addcarry_u64() intrinsic functions.
The problem here is not lack of functionality, but very poor implementation
(mentioned many times on bugzilla with minimal effect).
In that regard gcc is more than decade behind MSVC and ~4 years behind
clang/llvm. Surprisingly, icc is also quite bad.

[Bug target/97832] New: AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2020-11-14 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

Bug ID: 97832
   Summary: AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7
times slower than -O3
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: already5chosen at yahoo dot com
  Target Milestone: ---

I am reporting under 'target' because AVX2+FMA is the only 256-bit SIMD
platform I have to play with. If it's really tree-optomization, please change.

void foo(double* restrict y, const double* restrict x0, const double* restrict
x1, int clen)
{
  int xi = clen & 2;
  double f00_re = x0[0+xi+0];
  double f10_re = x1[0+xi+0];
  double f01_re = x0[0+xi+1];
  double f11_re = x1[0+xi+1];
  double f00_im = x0[4+xi+0];
  double f10_im = x1[4+xi+0];
  double f01_im = x0[4+xi+1];
  double f11_im = x1[4+xi+1];
  int clen2 = (clen+xi) * 2;
  double* y0 = [0];
  double* y1 = [clen2];
  #pragma GCC unroll 0
  for (int c = 0; c < clen2; c += 8) {
// y0[c] = y0[c] - x0[c]*conj(f00) - x1[c]*conj(f10);
// y1[c] = y1[c] - x0[c]*conj(f01) - x1[c]*conj(f11);
#pragma GCC unroll 4
for (int k = 0; k < 4; ++k) {
  double x0_re = x0[c+0+k];
  double x0_im = x0[c+4+k];
  double y0_re = y0[c+0+k];
  double y0_im = y0[c+4+k];
  double y1_re = y1[c+0+k];
  double y1_im = y1[c+4+k];
  y0_re = y0_re - x0_re * f00_re - x0_im * f00_im;
  y0_im = y0_im + x0_re * f00_im - x0_im * f00_re;
  y1_re = y1_re - x0_re * f01_re - x0_im * f01_im;
  y1_im = y1_im + x0_re * f01_im - x0_im * f01_re;
  double x1_re = x1[c+0+k];
  double x1_im = x1[c+4+k];
  y0_re = y0_re - x1_re * f10_re - x1_im * f10_im;
  y0_im = y0_im + x1_re * f10_im - x1_im * f10_re;
  y1_re = y1_re - x1_re * f11_re - x1_im * f11_im;
  y1_im = y1_im + x1_re * f11_im - x1_im * f11_re;
  y0[c+0+k] = y0_re;
  y0[c+4+k] = y0_im;
  y1[c+0+k] = y1_re;
  y1[c+4+k] = y1_im;
}
  }
}

When compiled with 'gcc.10.2. -march=skylake -O3' it produces pretty decent
code. The only problem is over-aggressive load+op combining similar to what we
already discussed in 97127. It seems, this problem can't be solved without
major overhaul of gcc optimizer architecture, but luckily an impact is quite
minor.
But when we compile with 'gcc.10.2. -march=skylake -Ofast' the fun begins:

.L5:
vmovupd (%r9), %ymm7
vmovupd 64(%r9), %ymm6
vunpcklpd   32(%r9), %ymm7, %ymm2
vunpckhpd   32(%r9), %ymm7, %ymm0
vmovupd 64(%r9), %ymm7
vmovupd 192(%r9), %ymm4
vunpckhpd   96(%r9), %ymm7, %ymm5
vmovupd 128(%r9), %ymm7
vunpcklpd   96(%r9), %ymm6, %ymm6
vunpcklpd   160(%r9), %ymm7, %ymm3
vunpckhpd   160(%r9), %ymm7, %ymm1
vmovupd 192(%r9), %ymm7
vunpcklpd   224(%r9), %ymm4, %ymm4
vunpckhpd   224(%r9), %ymm7, %ymm8
vpermpd $216, %ymm6, %ymm6
vpermpd $216, %ymm5, %ymm5
vpermpd $216, %ymm4, %ymm4
vpermpd $216, %ymm8, %ymm8
vpermpd $216, %ymm2, %ymm2
vpermpd $216, %ymm0, %ymm0
vpermpd $216, %ymm3, %ymm3
vpermpd $216, %ymm1, %ymm1
vunpcklpd   %ymm6, %ymm2, %ymm7
vunpckhpd   %ymm6, %ymm2, %ymm2
vunpcklpd   %ymm4, %ymm3, %ymm6
vunpckhpd   %ymm4, %ymm3, %ymm3
vunpcklpd   %ymm5, %ymm0, %ymm4
vunpckhpd   %ymm5, %ymm0, %ymm0
vunpcklpd   %ymm8, %ymm1, %ymm5
vpermpd $216, %ymm5, %ymm5
vpermpd $216, %ymm4, %ymm4
vpermpd $216, %ymm3, %ymm3
vunpcklpd   %ymm5, %ymm4, %ymm11
vpermpd $216, %ymm2, %ymm2
vunpckhpd   %ymm5, %ymm4, %ymm4
vunpckhpd   %ymm8, %ymm1, %ymm1
vpermpd $216, %ymm0, %ymm0
vpermpd $216, %ymm4, %ymm8
vpermpd $216, %ymm1, %ymm1
vunpcklpd   %ymm3, %ymm2, %ymm4
vunpckhpd   %ymm3, %ymm2, %ymm2
vpermpd $216, %ymm2, %ymm5
vunpcklpd   %ymm1, %ymm0, %ymm2
vpermpd $216, %ymm4, %ymm10
vpermpd $216, %ymm2, %ymm4
vmovupd 64(%rax), %ymm2
vmovupd (%rax), %ymm3
vmovupd %ymm4, 448(%rsp)
vunpckhpd   96(%rax), %ymm2, %ymm4
vmovupd 128(%rax), %ymm2
vpermpd $216, %ymm6, %ymm6
vunpckhpd   %ymm1, %ymm0, %ymm1
vpermpd $216, %ymm7, %ymm7
vunpcklpd   32(%rax), %ymm3, %ymm9
vunpckhpd   32(%rax), %ymm3, %ymm14
vunpckhpd   160(%rax), %ymm2, %ymm0
vmovupd 64(%rax), %ymm3
vunpcklpd   %ymm6, %ymm7, %ymm12
vunpckhpd   %ymm6, %ymm7, %ymm7
vpermpd $216, %ymm1, %ymm6
vunpcklpd   160(%rax), %ymm2, %ymm1
vmovupd 192(%rax), %ymm2
vunpcklpd   96(%rax), %ymm3, %ymm3
vmovupd 

[Bug tree-optimization/97428] -O3 is great for basic AoSoA packing of complex arrays, but horrible one step above the basic

2020-10-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97428

--- Comment #9 from Michael_S  ---
Hopefully, you did regression tests for all main AoS<->SoA cases.
I.e.

typedef struct { double re, im; } dcmlx_t;
void soa2aos(double* restrict dstRe, double* restrict dstIm, const dcmlx_t
src[], int nq)
{
  for (int i = 0; i < nq*4; ++i) {
dcmlx_t s = src[i];
dstRe[i] = s.re;
dstIm[i] = s.im;
  }
}

void aos2soa(dcmlx_t* restrict dst, const double* srcRe, const double* srcIm,
int nq)
{
  for (int i = 0; i < nq*4; ++i) {
dst[i].re = srcRe[i];
dst[i].im = srcIm[i];
  }
}

And equivalents with float instead of double.

Right now 'gcc.10.2 -march=skylake -O3' does very good job for soa2aos() and
suboptimal, but not horrible job for aos2soa(). Hopefully, your changes do not
break any of it.

Personally, I don't like SoA layouts and very much prefer AoSoA, but I
recognize that in existing code bases SoA is more common.

[Bug tree-optimization/97428] -O3 is great for basic AoSoA packing of complex arrays, but horrible one step above the basic

2020-10-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97428

--- Comment #6 from Michael_S  ---
(In reply to Richard Biener from comment #4)
> 
> while the lack of cross-lane shuffles in AVX2 requires a
> 
> .L3:
> vmovupd (%rsi,%rax), %xmm5
> vmovupd 32(%rsi,%rax), %xmm6
> vinsertf128 $0x1, 16(%rsi,%rax), %ymm5, %ymm1
> vinsertf128 $0x1, 48(%rsi,%rax), %ymm6, %ymm3
> vmovupd (%rcx,%rax), %xmm7
> vmovupd 32(%rcx,%rax), %xmm5
> vinsertf128 $0x1, 16(%rcx,%rax), %ymm7, %ymm0
> vinsertf128 $0x1, 48(%rcx,%rax), %ymm5, %ymm2
> vunpcklpd   %ymm3, %ymm1, %ymm4
> vunpckhpd   %ymm3, %ymm1, %ymm1
> vpermpd $216, %ymm4, %ymm4
> vpermpd $216, %ymm1, %ymm1
> vmovupd %xmm4, (%rdi,%rax,2)
> vextractf128$0x1, %ymm4, 16(%rdi,%rax,2)
> vmovupd %xmm1, 32(%rdi,%rax,2)
> vextractf128$0x1, %ymm1, 48(%rdi,%rax,2)
> vunpcklpd   %ymm2, %ymm0, %ymm1
> vunpckhpd   %ymm2, %ymm0, %ymm0
> vpermpd $216, %ymm1, %ymm1
> vpermpd $216, %ymm0, %ymm0
> vmovupd %xmm1, 64(%rdi,%rax,2)
> vextractf128$0x1, %ymm1, 80(%rdi,%rax,2)
> vextractf128$0x1, %ymm0, 112(%rdi,%rax,2)
> vmovupd %xmm0, 96(%rdi,%rax,2)
> addq$64, %rax
> cmpq%rax, %rdx
> jne .L3
> 

I don't follow. AVX2 indeed can't transpose with 4 shuffles per loop iteration,
but it is fully capable to do it with 8 shuffles per iteration, as you
demonstrated few posts above:
.L8:
vmovupd (%rdi), %ymm1
vmovupd 32(%rdi), %ymm4
vmovupd (%rax), %ymm0
vmovupd 32(%rax), %ymm3
vunpcklpd   %ymm4, %ymm1, %ymm2
vunpckhpd   %ymm4, %ymm1, %ymm1
vpermpd $216, %ymm1, %ymm1
vmovupd %ymm1, 32(%r8)
vunpcklpd   %ymm3, %ymm0, %ymm1
vunpckhpd   %ymm3, %ymm0, %ymm0
vpermpd $216, %ymm2, %ymm2
vpermpd $216, %ymm1, %ymm1
vpermpd $216, %ymm0, %ymm0
addq$64, %rdi
vmovupd %ymm2, (%r8)
vmovupd %ymm1, 64(%r8)
vmovupd %ymm0, 96(%r8)
addq$64, %rax
subq$-128, %r8
cmpq%rdi, %rdx
jne .L8

[Bug tree-optimization/97428] -O3 is great for basic AoSoA packing of complex arrays, but horrible one step above the basic

2020-10-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97428

--- Comment #5 from Michael_S  ---
(In reply to Richard Biener from comment #4)
> I have a fix that, with -mavx512f generates just
> 
> .L3:
> vmovupd (%rcx,%rax), %zmm0
> vpermpd (%rsi,%rax), %zmm1, %zmm2
> vpermpd %zmm0, %zmm1, %zmm0
> vmovupd %zmm2, (%rdi,%rax,2)
> vmovupd %zmm0, 64(%rdi,%rax,2)
> addq$64, %rax
> cmpq%rax, %rdx
> jne .L3
> 

This particular kernel on AVX512 is less interesting, because under AVX512 a
natural AoSoA layout is different.

typedef struct { double re, im; } dcmlx_t;
typedef struct { double re[8], im[8]; } dcmlx8_t;

void foo512(dcmlx8_t dst[], const dcmlx_t src[], int n)
{
  for (int i = 0; i < n; ++i) {
dcmlx_t s0 = src[i*8+0];
dcmlx_t s1 = src[i*8+1];
dcmlx_t s2 = src[i*8+2];
dcmlx_t s3 = src[i*8+3];
dcmlx_t s4 = src[i*8+4];
dcmlx_t s5 = src[i*8+5];
dcmlx_t s6 = src[i*8+6];
dcmlx_t s7 = src[i*8+7];

dst[i].re[0] = s0.re;
dst[i].re[1] = s1.re;
dst[i].re[2] = s2.re;
dst[i].re[3] = s3.re;
dst[i].re[4] = s4.re;
dst[i].re[5] = s5.re;
dst[i].re[6] = s6.re;
dst[i].re[7] = s7.re;

dst[i].im[0] = s0.im;
dst[i].im[1] = s1.im;
dst[i].im[2] = s2.im;
dst[i].im[3] = s3.im;
dst[i].im[4] = s4.im;
dst[i].im[5] = s5.im;
dst[i].im[6] = s6.im;
dst[i].im[7] = s7.im;
  }
}

And, respectively:
typedef struct { double re, im; } dcmlx_t;
typedef struct { double re[8], im[8]; } dcmlx8_t;

void foo512_i2(dcmlx8_t dst[], const dcmlx_t src[], int n)
{
  for (int i = 0; i < n; ++i) {
dcmlx_t s00 = src[i*8+0];
dcmlx_t s01 = src[i*8+1];
dcmlx_t s02 = src[i*8+2];
dcmlx_t s03 = src[i*8+3];
dcmlx_t s04 = src[i*8+4];
dcmlx_t s05 = src[i*8+5];
dcmlx_t s06 = src[i*8+6];
dcmlx_t s07 = src[i*8+7];

dcmlx_t s10 = src[i*8+0+n*8];
dcmlx_t s11 = src[i*8+1+n*8];
dcmlx_t s12 = src[i*8+2+n*8];
dcmlx_t s13 = src[i*8+3+n*8];
dcmlx_t s14 = src[i*8+4+n*8];
dcmlx_t s15 = src[i*8+5+n*8];
dcmlx_t s16 = src[i*8+6+n*8];
dcmlx_t s17 = src[i*8+7+n*8];

dst[i*2+0].re[0] = s00.re;
dst[i*2+0].re[1] = s01.re;
dst[i*2+0].re[2] = s02.re;
dst[i*2+0].re[3] = s03.re;
dst[i*2+0].re[4] = s04.re;
dst[i*2+0].re[5] = s05.re;
dst[i*2+0].re[6] = s06.re;
dst[i*2+0].re[7] = s07.re;

dst[i*2+0].im[0] = s00.im;
dst[i*2+0].im[1] = s01.im;
dst[i*2+0].im[2] = s02.im;
dst[i*2+0].im[3] = s03.im;
dst[i*2+0].im[4] = s04.im;
dst[i*2+0].im[5] = s05.im;
dst[i*2+0].im[6] = s06.im;
dst[i*2+0].im[7] = s07.im;

dst[i*2+1].re[0] = s10.re;
dst[i*2+1].re[1] = s11.re;
dst[i*2+1].re[2] = s12.re;
dst[i*2+1].re[3] = s13.re;
dst[i*2+1].re[4] = s14.re;
dst[i*2+1].re[5] = s15.re;
dst[i*2+1].re[6] = s16.re;
dst[i*2+1].re[7] = s17.re;

dst[i*2+1].im[0] = s10.im;
dst[i*2+1].im[1] = s11.im;
dst[i*2+1].im[2] = s12.im;
dst[i*2+1].im[3] = s13.im;
dst[i*2+1].im[4] = s14.im;
dst[i*2+1].im[5] = s15.im;
dst[i*2+1].im[6] = s16.im;
dst[i*2+1].im[7] = s17.im;
  }
}

[Bug target/97428] New: -O3 is great for basic AoSoA packing of complex arrays, but horrible one step above the basic

2020-10-14 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97428

Bug ID: 97428
   Summary: -O3 is great for basic AoSoA packing of complex
arrays, but horrible one step above the basic
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: already5chosen at yahoo dot com
  Target Milestone: ---

That my next example of bad handling of AoSoA layout by gcc
optimizer/vectorizer.
For discussion of AoSoA see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97343

The issue in hand is transformation (packing) of complex AoS numbers into AoSoA
format.
Compiler used: gcc 10.2
Target: AVX2 (Skylake)

Part 1.
typedef struct { double re, im; } dcmlx_t;
typedef struct { double re[4], im[4]; } dcmlx4_t;

void foo(dcmlx4_t dst[], const dcmlx_t src[], int n)
{
  for (int i = 0; i < n; ++i) {
dcmlx_t s00 = src[i*4+0];
dcmlx_t s01 = src[i*4+1];
dcmlx_t s02 = src[i*4+2];
dcmlx_t s03 = src[i*4+3];

dcmlx_t s10 = src[i*4+0+n];
dcmlx_t s11 = src[i*4+1+n];
dcmlx_t s12 = src[i*4+2+n];
dcmlx_t s13 = src[i*4+3+n];

dst[i*2+0].re[0] = s00.re;
dst[i*2+0].re[1] = s01.re;
dst[i*2+0].re[2] = s02.re;
dst[i*2+0].re[3] = s03.re;
dst[i*2+0].im[0] = s00.im;
dst[i*2+0].im[1] = s01.im;
dst[i*2+0].im[2] = s02.im;
dst[i*2+0].im[3] = s03.im;

dst[i*2+1].re[0] = s10.re;
dst[i*2+1].re[1] = s11.re;
dst[i*2+1].re[2] = s12.re;
dst[i*2+1].re[3] = s13.re;
dst[i*2+1].im[0] = s10.im;
dst[i*2+1].im[1] = s11.im;
dst[i*2+1].im[2] = s12.im;
dst[i*2+1].im[3] = s13.im;
  }
}

-march=skylake -O2 produces following inner loop:
.L3:
vmovsd  (%rdx), %xmm7
vmovsd  8(%rdx), %xmm3
vmovsd  16(%rdx), %xmm6
vmovsd  24(%rdx), %xmm2
vmovsd  32(%rdx), %xmm5
vmovsd  40(%rdx), %xmm1
vmovsd  48(%rdx), %xmm4
vmovsd  56(%rdx), %xmm0
addq$64, %rdx
vmovsd  %xmm7, (%rcx)
vmovsd  %xmm6, 8(%rcx)
vmovsd  %xmm5, 16(%rcx)
vmovsd  %xmm4, 24(%rcx)
vmovsd  %xmm3, 32(%rcx)
vmovsd  %xmm2, 40(%rcx)
vmovsd  %xmm1, 48(%rcx)
vmovsd  %xmm0, 56(%rcx)
addq$64, %rcx
cmpq%rax, %rdx
jne .L3


Quite reasonable for non-vectorizing  optimization level. It's possible to save
one instruction by using indexed addressing, but in majority of situations it
wouldn't be faster.

-march=skylake -O3 inner loop:
.L3:
vmovupd (%rdx,%rax), %ymm0
vmovupd 32(%rdx,%rax), %ymm2
vunpcklpd   %ymm2, %ymm0, %ymm1
vunpckhpd   %ymm2, %ymm0, %ymm0
vpermpd $216, %ymm1, %ymm1
vpermpd $216, %ymm0, %ymm0
vmovupd %ymm1, (%rcx,%rax)
vmovupd %ymm0, 32(%rcx,%rax)
addq$64, %rax
cmpq%r8, %rax
jne .L3

That's excellent. It's not only looks better. According to my measurement, for
source array in external memory and destination in L1/L2 cache it's actually
~1.5x faster than -O2, which is not a small fit.

Part 2.
A little more involved case. Now we want to interleave 2 lines of source
matrix.
Sometimes it's desirable to have interleaved layout, because it improves
locality of access for the rest of processing, and also can reduce pressure on
GPRs that are used as pointers or indices.

typedef struct { double re, im; } dcmlx_t;
typedef struct { double re[4], im[4]; } dcmlx4_t;

void foo_i2(dcmlx4_t dst[], const dcmlx_t src[], int n)
{
  for (int i = 0; i < n; ++i) {
dcmlx_t s00 = src[i*4+0];
dcmlx_t s01 = src[i*4+1];
dcmlx_t s02 = src[i*4+2];
dcmlx_t s03 = src[i*4+3];

dcmlx_t s10 = src[i*4+0+n];
dcmlx_t s11 = src[i*4+1+n];
dcmlx_t s12 = src[i*4+2+n];
dcmlx_t s13 = src[i*4+3+n];

dst[i*2+0].re[0] = s00.re;
dst[i*2+0].re[1] = s01.re;
dst[i*2+0].re[2] = s02.re;
dst[i*2+0].re[3] = s03.re;
dst[i*2+0].im[0] = s00.im;
dst[i*2+0].im[1] = s01.im;
dst[i*2+0].im[2] = s02.im;
dst[i*2+0].im[3] = s03.im;

dst[i*2+1].re[0] = s10.re;
dst[i*2+1].re[1] = s11.re;
dst[i*2+1].re[2] = s12.re;
dst[i*2+1].re[3] = s13.re;
dst[i*2+1].im[0] = s10.im;
dst[i*2+1].im[1] = s11.im;
dst[i*2+1].im[2] = s12.im;
dst[i*2+1].im[3] = s13.im;
  }
}

-march=skylake -O2 produces following inner loop:
.L3:
vmovsd  (%rdx), %xmm15
vmovsd  8(%rdx), %xmm11
vmovsd  16(%rdx), %xmm14
vmovsd  24(%rdx), %xmm10
vmovsd  32(%rdx), %xmm13
vmovsd  40(%rdx), %xmm9
vmovsd  48(%rdx), %xmm12
vmovsd  56(%rdx), %xmm8
vmovsd  (%rax), %xmm7
vmovsd  8(%rax), %xmm3
vmovsd  16(%rax), %xmm6
vmovsd  24(%rax), %xmm2
vmovsd  32(%rax), %xmm5
vmovsd  40(%rax), %xmm1
vmovsd  48(%rax), %xmm4
vmovsd  56(%rax), %xmm0

[Bug tree-optimization/97343] AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product

2020-10-09 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97343

--- Comment #2 from Michael_S  ---
(In reply to Richard Biener from comment #1)
> All below for Part 2.
> 
> Without -ffast-math you are seeing GCC using in-order reductions now while
> with -ffast-math the vectorizer gets a bit confused about reassociations done
> before, for me producing
> 

Just to understand, when you are saying "Without -ffast-math" does it includes
my set1 == "-O3 -ffast-math -fno-associative-math" ?

BTW, I like your phrasing: "a bit confused about reassociations"


> .L3:
> vmovupd 32(%rsi,%rax), %ymm3
> vmovupd (%rdx,%rax), %ymm7
> vinsertf128 $1, (%rsi,%rax), %ymm3, %ymm0
> vinsertf128 $1, 32(%rdx,%rax), %ymm7, %ymm2
> vmovupd 32(%rsi,%rax), %ymm5
> vpermpd $136, %ymm0, %ymm4
> vpermpd $40, %ymm2, %ymm7
> vpermpd $221, %ymm0, %ymm1
> vpermpd $125, %ymm2, %ymm3
> vperm2f128  $49, (%rsi,%rax), %ymm5, %ymm0
> vmovupd (%rdx,%rax), %ymm2
> vperm2f128  $49, 32(%rdx,%rax), %ymm2, %ymm2
> addq$64, %rax
> vpermpd $136, %ymm0, %ymm5
> vpermpd $221, %ymm0, %ymm0
> vpermpd $40, %ymm2, %ymm8
> vpermpd $125, %ymm2, %ymm2
> vmulpd  %ymm8, %ymm5, %ymm5
> vmulpd  %ymm2, %ymm0, %ymm0
> vfmadd132pd %ymm3, %ymm5, %ymm1
> vfmadd231pd %ymm7, %ymm4, %ymm0
> vaddpd  %ymm0, %ymm1, %ymm0
> vaddpd  %ymm0, %ymm6, %ymm6
> cmpq%rcx, %rax
> jne .L3
> 
> -ffast-math vs. non-ffast-math we're using a SLP reduction vs. 4 reduction
> chains and this SLP reduction ends up looking like
> 
> t5.c:7:21: note:   Vectorizing SLP tree:
> t5.c:7:21: note:   node 0x4100c20 (max_nunits=4, refcnt=2)
> t5.c:7:21: note:stmt 0 acc_imre_158 = acc_imre_3 + _34;
> t5.c:7:21: note:stmt 1 acc_reim_156 = acc_reim_1 + _8;
> t5.c:7:21: note:stmt 2 acc_imim_154 = _21 + acc_imim_35;
> t5.c:7:21: note:stmt 3 acc_rere_146 = _11 + acc_rere_29;
> t5.c:7:21: note:children 0x3f272e0 0x4100bb0
> t5.c:7:21: note:   node 0x3f272e0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:stmt 0 acc_imre_3 = PHI 
> t5.c:7:21: note:stmt 1 acc_reim_1 = PHI 
> t5.c:7:21: note:stmt 2 acc_imim_35 = PHI 
> t5.c:7:21: note:stmt 3 acc_rere_29 = PHI 
> t5.c:7:21: note:children 0x4100c20
> t5.c:7:21: note:   node 0x4100bb0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:stmt 0 _34 = _36 + _157;
> t5.c:7:21: note:stmt 1 _8 = _30 + _155;
> t5.c:7:21: note:stmt 2 _21 = _15 + _153;
> t5.c:7:21: note:stmt 3 _11 = _6 + _145;
> t5.c:7:21: note:children 0x4100920 0x4100b40
> t5.c:7:21: note:   node 0x4100920 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:stmt 0 _36 = _37 + _73;
> t5.c:7:21: note:stmt 1 _30 = _32 + _71;
> t5.c:7:21: note:stmt 2 _15 = _10 + _69;
> t5.c:7:21: note:stmt 3 _6 = _31 + _61;
> t5.c:7:21: note:children 0x41004e0 0x41008b0
> t5.c:7:21: note:   node 0x41004e0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:stmt 0 _37 = _101 + _129;
> t5.c:7:21: note:stmt 1 _32 = _99 + _127;
> t5.c:7:21: note:stmt 2 _10 = _97 + _125;
> t5.c:7:21: note:stmt 3 _31 = _89 + _117;
> t5.c:7:21: note:children 0x3f2a550 0x3f28700
> t5.c:7:21: note:   node 0x3f2a550 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:stmt 0 _101 = _88 * _94;
> t5.c:7:21: note:stmt 1 _99 = _86 * _96;
> t5.c:7:21: note:stmt 2 _97 = _94 * _96;
> t5.c:7:21: note:stmt 3 _89 = _86 * _88;
> t5.c:7:21: note:children 0x40b6990 0x3f29e00
> t5.c:7:21: note:   node 0x40b6990 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:stmt 0 _88 = *_87;
> t5.c:7:21: note:stmt 1 _96 = *_95;
> t5.c:7:21: note:stmt 2 _96 = *_95;
> t5.c:7:21: note:stmt 3 _88 = *_87;
> t5.c:7:21: note:load permutation { 1 5 5 1 }
> t5.c:7:21: note:   node 0x3f29e00 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:stmt 0 _94 = *_93;
> t5.c:7:21: note:stmt 1 _86 = *_85;
> t5.c:7:21: note:stmt 2 _94 = *_93;
> t5.c:7:21: note:stmt 3 _86 = *_85;
> t5.c:7:21: note:load permutation { 5 1 5 1 }
> t5.c:7:21: note:   node 0x3f28700 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:stmt 0 _129 = _116 * _122;
> t5.c:7:21: note:stmt 1 _127 = _114 * _124;
> t5.c:7:21: note:stmt 2 _125 = _122 * _124;
> t5.c:7:21: note:stmt 3 _117 = _114 * _116;
> t5.c:7:21: note:children 0x3f287e0 0x3f28770
> t5.c:7:21: note:   node 0x3f287e0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:stmt 0 _116 = *_115;
> t5.c:7:21: note:stmt 1 _124 = *_123;
> t5.c:7:21: note:stmt 2 _124 = *_123;
> t5.c:7:21: note:stmt 3 _116 = *_115;
> t5.c:7:21: note:load permutation { 2 6 6 2 }
> t5.c:7:21: note:   node 0x3f28770 (max_nunits=4, refcnt=1)
> t5.c:7:21: 

[Bug target/97343] New: AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product

2020-10-08 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97343

Bug ID: 97343
   Summary: AVX2 vectorizer generates extremely strange and slow
code for AoSoA complex dot product
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: already5chosen at yahoo dot com
  Target Milestone: ---

Let's continue our complex dot product series started here
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96854

This time I have no code generation bugs for your pleasure, just "interesting"
optimization issues.
All examples below, unless stated otherwise were compiled with gcc.10.2 for
x86-64 with following sets 
of flags: 
set1: -Wall -mavx2 -mfma -march=skylake -O3 -ffast-math -fno-associative-math
set2: -Wall -mavx2 -mfma -march=skylake -O3 -ffast-math

The kernel in question is an example of complex dot product in so-called
"hybrid AoS" layout a.k.a. AoSoA.
https://en.wikipedia.org/wiki/AoS_and_SoA#Array_of_Structures_of_Arrays
In my experience it's quite rare when in dense complex linear algebra and
similar computational fields AoSoA is *not* the optimal internal form.
So, practically, I consider these kernels as more important than AoS kernel
presented in bug 96854.

More specifically, the layout can be described as struct { double re[4], im[4];
};
But for purpose of simplicity I omitted the type definition fro code examples
and coded it directly over flat arrays of doubles.

Part 1.
void cdot(double* res, const double* a, const double* b, int N)
{
  double acc_re = 0;
  double acc_im = 0;
  for (int c = 0; c < N; ++c) {
for (int k = 0; k < 4; ++k) {
  acc_re = acc_re + a[c*8+k+0]*b[c*8+k+0] + a[c*8+k+4]*b[c*8+k+4];
  acc_im = acc_im - a[c*8+k+0]*b[c*8+k+4] + a[c*8+k+4]*b[c*8+k+0];
}
  }
  res[0] = acc_re;
  res[4] = acc_im;
}

That's how we want to code it in the ideal world and let to compiles to care
about dirty details.
In less ideal world, gcc is not the only compiler that can't cope with it.
MSVC (-W4 -O2 -fp:fast -arch:AVX2) also can't vectorize it. Even mighty icc
generates code that it's not quite bad, but somewhat suboptimal.
So, let's it pass. I don't want to blame gcc for not being smart enough. It's
just normal.
Except that when we use set2 the code generated by gcc becomes not just
non-smart, but quite crazy.
I am ignoring it in the hope that it would be magically fixed by the change
made by Richard Biener 2020-08-31

Part 2.
void cdot(double* res, const double* a, const double* b, int N)
{
  double acc_rere = 0;
  double acc_imim = 0;
  double acc_reim = 0;
  double acc_imre = 0;
  for (int c = 0; c < N; ++c) {
for (int k = 0; k < 4; ++k) {
  acc_rere += a[c*8+k+0]*b[c*8+k+0];
  acc_imim += a[c*8+k+4]*b[c*8+k+4];
  acc_reim += a[c*8+k+0]*b[c*8+k+4];
  acc_imre += a[c*8+k+4]*b[c*8+k+0];
}
  }
  res[0] = acc_rere+acc_imim;
  res[4] = acc_imre-acc_reim;
}

We are explaining it to compiler slowly.
For icc and MSVC it's enough. They understood.
icc generates near-perfect code. I can write it nicer, but do not expect my
variant to be any faster.
MSVC generates near-perfect inner loop and epilogue that is not great, but not
really much slower.
gcc still didn't get it. It still tries to implement 4 accumulators literally,
as if -ffast-math were not here.
But, sad as it is, it's still a case of not being smart enough. So, I am not
complaining.

Part 3.
static inline double sum4(double x[]) {
  return x[0]+x[1]+x[2]+x[3];
}
void cdot(double* res, const double* a, const double* b, int N)
{
  double acc_re[4] = {0};
  double acc_im[4] = {0};
  for (int c = 0; c < N; ++c) {
for (int k = 0; k < 4; ++k) {
  acc_re[k] = acc_re[k] + a[c*8+k+0]*b[c*8+k+0] + a[c*8+k+4]*b[c*8+k+4];
  acc_im[k] = acc_im[k] - a[c*8+k+0]*b[c*8+k+4] + a[c*8+k+4]*b[c*8+k+0];
}
  }
  res[0] = sum4(acc_re);
  res[4] = sum4(acc_im);
}

Attempt to feed compiler by teaspoon.
That's not a way I want to write code in HLL.
icc copes, producing about the same code as in Part 1
MSVC doesn't understand a Kunststück (I am sympathetic) and generates literal
scalar code with local arrays on stack.
gcc with set1 is a little better than MSVC - the code is fully scalar, but at
least accumulators kept in registers.
gcc with set2 is the most interesting. It vectorizes, but how?
Here is an inner loop:

.L3:
vpermpd $27, (%r8,%rax), %ymm2
vpermpd $27, 32(%rdx,%rax), %ymm3
vpermpd $27, (%rdx,%rax), %ymm1
vpermpd $27, 32(%r8,%rax), %ymm0
vmulpd  %ymm2, %ymm1, %ymm6
vmulpd  %ymm2, %ymm3, %ymm2
addq$64, %rax
vfnmadd132pd%ymm0, %ymm2, %ymm1
vfmadd132pd %ymm3, %ymm6, %ymm0
vaddpd  %ymm1, %ymm5, %ymm5
vaddpd  %ymm0, %ymm4, %ymm4
cmpq%rcx, %rax
jne .L3

What all this vpermpd business about? 

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-25 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #15 from Michael_S  ---
(In reply to Hongtao.liu from comment #14)
> > Still I don't understand why compiler does not compare the cost of full loop
> > body after combining to the cost before combining and does not come to
> > conclusion that combining increased the cost.
> 
> As Richard says, GCC does not model CPU pipelines in such detail.

I don't understand what "details" you have in mind.
The costs of instructions that you quoted above looks fine. 
But for reason, I don't understand, compiler had chosen more costly "combined"
code sequence over less costly, according to its own cost model,  "RISCy"
sequence.

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-24 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #13 from Michael_S  ---
(In reply to Hongtao.liu from comment #11)
> (In reply to Michael_S from comment #10)
> > (In reply to Hongtao.liu from comment #9)
> > > (In reply to Michael_S from comment #8)
> > > > What are values of gcc "loop" cost of the relevant instructions now?
> > > > 1. AVX256 Load
> > > > 2. FMA3 ymm,ymm,ymm
> > > > 3. AVX256 Regmove
> > > > 4. FMA3 mem,ymm,ymm
> > > 
> > > For skylake, outside of register allocation.
> > > 
> > > they are
> > > 1. AVX256 Load   10
> > > 2. FMA3 ymm,ymm,ymm --- 16
> > > 3. AVX256 Regmove  --- 2
> > > 4. FMA3 mem,ymm,ymm --- 32
> > > 
> > > In RA, no direct cost for fma instrcutions, but we can disparage memory
> > > alternative in FMA instructions, but again, it may hurt performance in 
> > > some
> > > cases.
> > > 
> > > 1. AVX256 Load   10
> > > 3. AVX256 Regmove  --- 2
> > > 
> > > BTW: we have done a lot of experiments with different cost models and no
> > > significant performance impact on SPEC2017.
> > 
> > Thank you.
> > With relative costs like these gcc should generate 'FMA3 mem,ymm,ymm' only
> > in conditions of heavy registers pressure. So, why it generates it in my
> > loop, where registers pressure in the innermost loop is light and even in
> > the next outer level the pressure isn't heavy?
> > What am I missing?
> 
> the actual transformation gcc did is
> 
> vmovuxx (mem1), %ymmA pass_combine 
> vmovuxx (mem), %ymmD > vmovuxx   (mem1), %ymmA
> vfmadd213 %ymmD,%ymmC,%ymmAvfmadd213 (mem),%ymmC,%ymmA
> 
> then RA works like
> RA
> vmovuxx (mem1), %ymmA  >  %vmovaps %ymmB, %ymmA
> vfmadd213 (mem),%ymmC,%ymmA   vfmadd213 (mem),%ymmC,%ymmA
> 
> it "look like" but actually not this one.
> 
>  vmovuxx  (mem), %ymmA
>  vfnmadd231xx %ymmB, %ymmC, %ymmA
> transformed to
>  vmovaxx  %ymmB, %ymmA
>  vfnmadd213xx (mem), %ymmC, %ymmA
> 
> ymmB is allocate for (mem1) not (mem)

Thank you.
Now compiler's reasoning is starting to make more sense.
Still I don't understand why compiler does not compare the cost of full loop
body after combining to the cost before combining and does not come to
conclusion that combining increased the cost.

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-24 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #10 from Michael_S  ---
(In reply to Hongtao.liu from comment #9)
> (In reply to Michael_S from comment #8)
> > What are values of gcc "loop" cost of the relevant instructions now?
> > 1. AVX256 Load
> > 2. FMA3 ymm,ymm,ymm
> > 3. AVX256 Regmove
> > 4. FMA3 mem,ymm,ymm
> 
> For skylake, outside of register allocation.
> 
> they are
> 1. AVX256 Load   10
> 2. FMA3 ymm,ymm,ymm --- 16
> 3. AVX256 Regmove  --- 2
> 4. FMA3 mem,ymm,ymm --- 32
> 
> In RA, no direct cost for fma instrcutions, but we can disparage memory
> alternative in FMA instructions, but again, it may hurt performance in some
> cases.
> 
> 1. AVX256 Load   10
> 3. AVX256 Regmove  --- 2
> 
> BTW: we have done a lot of experiments with different cost models and no
> significant performance impact on SPEC2017.

Thank you.
With relative costs like these gcc should generate 'FMA3 mem,ymm,ymm' only in
conditions of heavy registers pressure. So, why it generates it in my loop,
where registers pressure in the innermost loop is light and even in the next
outer level the pressure isn't heavy?
What am I missing?