[Bug target/108922] fmod() 13x slowdown in gcc4.9 dropping "fprem" and calling fmod()

2023-02-27 Thread jkratochvil at azul dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108922

--- Comment #24 from Jan Kratochvil  ---
(In reply to Alexander Monakov from comment #22)
> Strange, comment #8 claims the opposite (unless Jan tested the revert not on
> trunk, but on some branch).

The testsuite ran on 4341106354c6a463ce3628a4ef9c1a1d37193b59 (=2023-02-25),
93ba85fdd253b4b9cf2b9e54e8e5969b1a3db098 reverted.

[Bug target/108922] fmod() 13x slowdown in gcc4.9 dropping "fprem" and calling fmod()

2023-02-27 Thread jkratochvil at azul dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108922

--- Comment #23 from Jan Kratochvil  ---
Created attachment 54542
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54542=edit
fmoderrno.cpp

(In reply to Uroš Bizjak from comment #21)
> When g:93ba85fdd253b4b9cf2b9e54e8e5969b1a3db098 is reverted, current
> mainline does not emit anything that would handle errno (even with
> -fmath-errno flag explicitly set at command line).

With 4341106354c6a463ce3628a4ef9c1a1d37193b59 (=2023-02-25),
93ba85fdd253b4b9cf2b9e54e8e5969b1a3db098 reverted and fmoderrno.cpp I get:
fmod(1.0D, 0.0D)
g++ -o fmoderrno fmoderrno.C -O3 -Wall; ./fmoderrno
-nan errno=33=Numerical argument out of domain

[Bug target/108922] fmod() 13x slowdown in gcc4.9 dropping "fprem" and calling fmod()

2023-02-26 Thread jkratochvil at azul dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108922

--- Comment #13 from Jan Kratochvil  ---
(In reply to Uroš Bizjak from comment #12)
> (In reply to Jan Kratochvil from comment #8)
> 
> > The revert makes it 13x faster. But the produced code still falls back to
> > calling glibc fmod() as shown in the disassembly in Comment 0.
> > If I use the "fprem" instruction directly it gets 15x faster - but I did not
> > figure out some (easy) way for me how to patch GCC to no longer produce the
> > call to fmod() at all and produce only the "fprem" instruction.
> 
> Use -ffinite-math-only option:
> 
> -ffinite-math-only
>Allow optimizations for floating-point arithmetic that assume that
> arguments and results are not NaNs or +-Infs.

That works for this Comment 0 reproducer but I find -ffinite-math-only
incorrect to use due to other calculations in the whole OpenJDK codebase. Using
infinite numbers is documented for Java code and then it may have invalid
results.

To fully performance-fix it (no "call fmod" case) I find better to use
-fno-math-errno. Nothing in OpenJDK should rely on errno from math operations.
But that option still requires to revert your patch.

The question is whether gcc can rely on the undocumented Intel behavior as
described in Comment 7. glibc already relies on it anyway.

This revert proposal I have submitted only for the benefit of GCC. I (or my
employer) do not mind myself as I have already submitted a fix for OpenJDK
using an asm "fprem" expression. Relying on a fix in GCC would not be
acceptable for OpenJDK as it is still going to be built by old/exising
OSes/compilers for years: https://github.com/openjdk/jdk/pull/12508/files

[Bug target/108922] fmod() 13x slowdown in gcc4.9 dropping "fprem" and calling fmod()

2023-02-26 Thread jkratochvil at azul dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108922

--- Comment #10 from Jan Kratochvil  ---
(In reply to Alexander Monakov from comment #9)
> You just need to pass -fno-math-errno (the call is for setting errno,
> similar to how gcc emits the sqrt() sequence).

True, thanks.


So I think the patch should be reverted, right? I expect the revert should have
a testcase nowadays.

[Bug target/108922] fmod() 13x slowdown in gcc4.9 dropping "fprem" and calling fmod()

2023-02-25 Thread jkratochvil at azul dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108922

--- Comment #8 from Jan Kratochvil  ---
(In reply to Andrew Pinski from comment #2)
> So the simple test is run the full GCC bootstrap/test with all languages and
> check if the testcase fails or not. I suspect it will.

It does not. Tested on Fedora 36 x86-64.

I did test only a revert of:
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=93ba85fdd253b4b9cf2b9e54e8e5969b1a3db098

The revert makes it 13x faster. But the produced code still falls back to
calling glibc fmod() as shown in the disassembly in Comment 0.
If I use the "fprem" instruction directly it gets 15x faster - but I did not
figure out some (easy) way for me how to patch GCC to no longer produce the
call to fmod() at all and produce only the "fprem" instruction.

(In reply to Alexander Monakov from comment #4)
> Plus, Glibc does use fprem/fprem1 for fmodl/remainderl on x86_64,

It is true replacing fmod() with fmodl() makes it 5x faster (but only 5x).
There is still some infinity check and I haven't found any real justification
in glibc sources for it:
28if (__builtin_expect (isinf (x) || y == 0.0L, 0)
29&& _LIB_VERSION != _IEEE_ && !isnan (y) && !isnan (x))
30  /* fmod(+-Inf,y) or fmod(x,0) */
31  return __kernel_standard_l (x, y, 227);

> The ieee_2.f90 testcase attempts to change rounding mode. It 2014 it
> probably just was "miscompiled".

The testsuite run did include "gfortran.dg/ieee/ieee_2.f90" and it has no
regression.

[Bug target/108922] New: fmod() 13x slowdown in gcc 4.8->4.9 dropping "fprem" and calling fmod()

2023-02-24 Thread jkratochvil at azul dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108922

Bug ID: 108922
   Summary: fmod() 13x slowdown in gcc 4.8->4.9 dropping "fprem"
and calling fmod()
   Product: gcc
   Version: 12.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jkratochvil at azul dot com
  Target Milestone: ---

Created attachment 54528
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54528=edit
bench.cpp

This performance regression is since:

[PATCH, i386]: Enable reminder{sd,df,xf} and fmod{sf,df,xf} only for
flag_finite_math_only.
https://gcc.gnu.org/pipermail/gcc-patches/2014-September/400104.html

https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=93ba85fdd253b4b9cf2b9e54e8e5969b1a3db098

Reproducible with attached "bench.cpp":
g++ (GCC) 4.8.3 20140517 (prerelease)
real0m0.329s
g++ (GCC) 4.9.3 20150207 (prerelease)
real0m4.396s

The committer claims "do not return NaN for infinities, but generate
invalid-arithmetic-operand exception.". But my attached testcase tests that all
the corner cases do have both the same result value and the same exceptions
generated.

The committer also claims "fixes ieee_2.f90 testsuite failure" but I have no
idea where to find this testsuite.

g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-18)
/home/azul/t/zuc1182/fmod.C:7
  4005f8:   dd 44 24 30 fldl   0x30(%rsp)
  4005fc:   dd 44 24 38 fldl   0x38(%rsp)
  400600:   d9 c1   fld%st(1)
  400602:   d9 c1   fld%st(1)
  400604:   d9 f8   fprem
  400606:   df e0   fnstsw %ax
  400608:   f6 c4 04test   $0x4,%ah
  40060b:   75 f7   jne400604 
  40060d:   dd d9   fstp   %st(1)
  40060f:   dd 5c 24 18 fstpl  0x18(%rsp)
  400613:   f2 0f 10 44 24 18   movsd  0x18(%rsp),%xmm0
  400619:   66 0f 2e c0 ucomisd %xmm0,%xmm0
^^^
Here it tests the result is finite;
if it is not it will fallback to calling fmod().
But I do not find even that needed, one could just use the "fprem" result.
  40061d:   7a 06   jp 400625 
  40061f:   74 2f   je 400650 
  400621:   d9 c9   fxch   %st(1)
  400623:   eb 0b   jmp400630 
  400625:   d9 c9   fxch   %st(1)
  400627:   66 0f 1f 84 00 00 00nopw   0x0(%rax,%rax,1)
  40062e:   00 00
  400630:   dd 5c 24 08 fstpl  0x8(%rsp)
  400634:   f2 0f 10 4c 24 08   movsd  0x8(%rsp),%xmm1
  40063a:   dd 5c 24 08 fstpl  0x8(%rsp)
  40063e:   f2 0f 10 44 24 08   movsd  0x8(%rsp),%xmm0
  400644:   e8 6f fe ff ff  callq  4004b8 
  400649:   eb 09   jmp400654 
  40064b:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
  400650:   dd d8   fstp   %st(0)
  400652:   dd d8   fstp   %st(0)
  400654:   83 c3 01add$0x1,%ebx
  400657:   f2 0f 11 44 24 28   movsd  %xmm0,0x28(%rsp)
/home/azul/t/zuc1182/fmod.C:6
  40065d:   81 fb 00 e1 f5 05   cmp$0x5f5e100,%ebx
  400663:   75 93   jne4005f8 

Similar issue may be with drem() (=remainder()) vs. "fprem1" instruction.

I expect the same issue also affects fmodf(), dremf() and remainderf().

Another topic is why the glibc fmod() implementation just does not use "fprem"
on i686/x86_64 arch.