Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Gabriel Paubert Fri, 06 Aug 2021 06:21:15 -0700

On Fri, Aug 06, 2021 at 02:43:34PM +0200, Stefan Kanthak wrote:
> Gabriel Paubert <[email protected]> wrote:
> 
> > Hi,
> > 
> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> >> Gabriel Paubert <[email protected]> wrote:
> >> 
> >> 
> >> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> 
> >> >>                               .intel_syntax
> >> >>                               .text
> >> >>    0:   f2 48 0f 2c c0        cvttsd2si rax, xmm0  # rax = 
> >> >> trunc(argument)
> >> >>    5:   48 f7 d8              neg     rax
> >> >>                         #     jz      .L0          # argument zero?
> >> >>    8:   70 16                 jo      .L0          # argument 
> >> >> indefinite?
> >> >>                                                    # argument overflows 
> >> >> 64-bit integer?
> >> >>    a:   48 f7 d8              neg     rax
> >> >>    d:   f2 48 0f 2a c8        cvtsi2sd xmm1, rax   # xmm1 = 
> >> >> trunc(argument)
> >> >>   12:   66 0f 73 d0 3f        psrlq   xmm0, 63
> >> >>   17:   66 0f 73 f0 3f        psllq   xmm0, 63     # xmm0 = (argument & 
> >> >> -0.0) ? -0.0 : 0.0
> >> >>   1c:   66 0f 56 c1           orpd    xmm0, xmm1   # xmm0 = 
> >> >> trunc(argument)
> >> >>   20:   c3              .L0:  ret
> >> >>                               .end
> >> > 
> >> > There is one important difference, namely setting the invalid exception
> >> > flag when the parameter can't be represented in a signed integer.
> >> 
> >> Right, I overlooked this fault. Thanks for pointing out.
> >> 
> >> > So using your code may require some option (-fast-math comes to mind),
> >> > or you need at least a check on the exponent before cvttsd2si.
> >> 
> >> The whole idea behind these implementations is to get rid of loading
> >> floating-point constants to perform comparisions.
> > 
> > Indeed, but what I had in mind was something along the following lines:
> > 
> > movq rax,xmm0   # and copy rax to say rcx, if needed later
> > shrq rax,52     # move sign and exponent to 12 LSBs 
> > andl eax,0x7ff  # mask the sign
> > cmpl eax,0x434  # value to be checked
> > ja return       # exponent too large, we're done (what about NaNs?)
> > cvttsd2si rax,xmm0 # safe after exponent check
> > cvtsi2sd xmm0,rax  # conversion done
> > 
> > and a bit more to handle the corner cases (essentially preserve the
> > sign to be correct between -1 and -0.0).
> 
> The sign of -0.0 is the only corner case and already handled in my code.
> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
> preserved, as in the code GCC generates as well as my code.


I don't know what the standard says about NaNs in this case, I seem to
remember that arithmetic instructions typically produce QNaN when one of
the inputs is a NaN, whether signaling or not. 

> 
> > But the CPU can (speculatively) start the conversions early, so the
> > dependency chain is rather short.
> 
> Correct.
>  
> > I don't know if it's faster than your new code,
> 
> It should be faster.
> 
> > I'm almost sure that it's shorter.
> 
> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
> 5+4+5+5+2=21 bytes.
> 
> JFTR: better use "add rax,rax; shr rax,53" instead of
>       "shr rax,52; and eax,0x7ff" and save 2 bytes.

Indeed, I don't have the exact size of instructions in my head,
especially since I've not written x86 assembly since the mid 90s.

In any case, with your last improvement, the code is now down to a
single 32 bit immediate constant. And I don't see how to eliminate it...

> 
> Complete properly optimized code for __builtin_trunc is then as follows
> (11 instructions, 44 bytes):
> 
> .code64
> .intel_syntax
> .equ    BIAS, 1023
> .text
>         movq    rax, xmm0    # rax = argument
>         add     rax, rax
>         shr     rax, 53      # rax = exponent of |argument|
>         cmp     eax, BIAS + 53
>         jae     .Lexit       # argument indefinite?

Maybe s/.Lexit/.L0/

>                              # |argument| >= 0x1.0p53?
>         cvttsd2si rax, xmm0  # rax = trunc(argument)
>         cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>         psrlq   xmm0, 63
>         psllq   xmm0, 63     # xmm0 = (argument & -0.0) ? -0.0 : 0.0
>         orpd    xmm0, xmm1   # xmm0 = trunc(argument)
> .L0:    ret
> .end
> 

This looks nice.

> @Richard Biener (et. al.):
> 
> 1. Is a primitive for "floating-point > 2**x", which generates such
>    an "integer" code sequence, already available, at least for
>    float/binary32 and double/binary64?
> 
> 2. the procedural code generator for __builtin_trunc() etc.  uses
>    __builtin_fabs() and __builtin_copysign() as building blocks.
>    These would need to (and of course should) be modified to generate
>    psllq/psrlq pairs instead of andpd/andnpd referencing a memory
>    location with either -0.0 oder ~(-0.0).
> 
> For -ffast-math, where the sign of -0.0 is not handled and the spurios
> invalid floating-point exception for |argument| >= 2**63 is acceptable,
> it boils down to:
> 
> .code64
> .intel_syntax
> .equ    BIAS, 1023
> .text
>         cvttsd2si rax, xmm0  # rax = trunc(argument)
>         jo      .Lexit       # argument indefinite?
>                              # |argument| > 0x1.0p63?
>         cvtsi2sd xmm0, rax   # xmm1 = trunc(argument)
> .L0:    ret
> .end
> 
> [...]
> 
> >> Right, the conversions dominate both the original and the code I posted.
> >> It's easy to get rid of them, with still slightly shorter and faster
> >> branchless code (17 instructions, 84 bytes, instead of 13 instructions,
> >> 57 + 32 = 89 bytes):
> >> 
> >>                                         .code64
> >>                                         .intel_syntax noprefix
> >>                                         .text
> >>    0:   48 b8 00 00 00 00 00 00 30 43   mov     rax, 0x4330000000000000
> >>    a:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = 
> >> 0x1.0p52 = 4503599627370496.0
> >>    f:   48 b8 00 00 00 00 00 00 f0 3f   mov     rax, 0x3FF0000000000000
> >>   19:   f2 0f 10 c8                     movsd   xmm1, xmm0       # xmm1 = 
> >> argument
> >>   1d:   66 0f 73 f0 01                  psllq   xmm0, 1
> >>   22:   66 0f 73 d0 01                  psrlq   xmm0, 1          # xmm0 = 
> >> |argument|
> >>   27:   66 0f 73 d1 3f                  psrlq   xmm1, 63
> >>   2c:   66 0f 73 f1 3f                  psllq   xmm1, 63         # xmm1 = 
> >> (argument & -0.0) ? -0.0 : +0.0
> >>   31:   f2 0f 10 d8                     movsd   xmm3, xmm0
> >>   35:   f2 0f 58 c2                     addsd   xmm0, xmm2       # xmm0 = 
> >> |argument| + 0x1.0p52
> >>   39:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = 
> >> |argument| - 0x1.0p52
> >>                                                                  #      = 
> >> rint(|argument|)
> >>   3d:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = 
> >> -0x1.0p0 = -1.0
> > 
> > Huh? I see +1.0, -1 would be 0xBFF0000000000000.
> 
> Spurious error in the comment.
> I modified code which uses -1.0 and performs (a commutative) "addsd xmm2, 
> xmm2"
> instead of "subsd xmm0, xmm2" to save a "movsd" instruction.
> 
> >>   42:   f2 0f c2 d8 01                  cmpltsd xmm3, xmm0       # xmm3 = 
> >> (|argument| < rint(|argument|)) ? ~0L : 0L
> >>   47:   66 0f 54 d3                     andpd   xmm2, xmm3       # xmm2 = 
> >> (|argument| < rint(|argument|)) ? 1.0 : 0.0
> >>   4b:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = 
> >> rint(|argument|)
> >>                                                                  #      - 
> >> (|argument| < rint(|argument|)) ? 1.0 : 0.0
> >>                                                                  #      = 
> >> trunc(|argument|)
> >>   4f:   66 0f 56 c1                     orpd    xmm0, xmm1       # xmm0 = 
> >> trunc(argument)
> >>   53:   c3                              ret
> 
> regards
> Stefan

        Regards,
        Gabriel

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Reply via email to