Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-14 Thread Hongtao Liu via Gcc-patches
> >
> Set excess_precision_type to FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16 to
> round after each operation could keep semantics right.
> And I'll document the behavior difference between soft-fp and
> AVX512FP16 instruction for exceptions.
I got some feedback from my colleague who's working on supporting
_Float16 for llvm.
The LLVM side wants to set  FLT_EVAL_METHOD_PROMOTE_TO_FLOAT for
soft-fp so that codes can be more efficient.
i.e.
_Float16 a, b, c, d;
d = a + b + c;

would be transformed to
float tmp, tmp1, a1, b1, c1;
a1 = (float) a;
b1 = (float) b;
c1 = (float) c;
tmp = a1 + b1;
tmp1 = tmp + c1;
d = (_Float16) tmp;

so there's only 1 truncation in the end.

if users want to round back after every operation. codes should be
explicitly written as
_Float16 a, b, c, d, e;
e = a + b;
d = e + c;

That's what Clang does, quote from [1]
 _Float16 arithmetic will be performed using native half-precision
support when available on the target (e.g. on ARMv8.2a); otherwise it
will be performed at a higher precision (currently always float) and
then truncated down to _Float16. Note that C and C++ allow
intermediate floating-point operands of an expression to be computed
with greater precision than is expressible in their type, so Clang may
avoid intermediate truncations in certain cases; this may lead to
results that are inconsistent with native arithmetic.

and so does arm gcc
quote from arm.c

/* We can calculate either in 16-bit range and precision or
   32-bit range and precision.  Make that decision based on whether
   we have native support for the ARMv8.2-A 16-bit floating-point
   instructions or not.  */
return (TARGET_VFP_FP16INST
? FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16
: FLT_EVAL_METHOD_PROMOTE_TO_FLOAT);


[1]https://clang.llvm.org/docs/LanguageExtensions.html
> > --
> > Joseph S. Myers
> > jos...@codesourcery.com
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-06 Thread Hongtao Liu via Gcc-patches
On Wed, Jul 7, 2021 at 2:11 AM Joseph Myers  wrote:
>
> On Tue, 6 Jul 2021, Hongtao Liu via Gcc-patches wrote:
>
> > There may be inconsistent behavior between soft-fp and avx512fp16
> > instructions if we emulate _Float16 w/ float .
> >  i.e
> >   1) for a + b - c where b and c are variables with the same big value
> > and a + b is NAN at _Float16 and real value at float, avx512fp16
> > instruction will raise an exception but soft-fp won't(unless it's
> > rounded after every operation.)
>
> There are at least two variants of emulation using float:
>
> (a) Using the excess precision support, as on AArch64, which means the C
> front end converts the _Float16 operations to float ones, with explicit
> narrowing on assignment (and conversion as if by assignment - argument
> passing and return, casts, etc.).  Excess precision indeed involves
> different semantics compared to doing each operation directly in the range
> and precision of _Float16.
>
Yes, set excess_precision_type to FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16
could round after each operation.
> (b) Letting the expand/optabs code generate operations in a wider mode.
> My understanding is that the result should get converted back to the
> narrower mode after each operation (by the expand/optabs code /
> convert_move called by it generating such a conversion), meaning (for
> basic arithmetic operations) that the semantics end up the same as if the
> operation had been done directly on _Float16 (but with more truncation
> operations occurring than would be the case with excess precision support
> used).
Yes, just w/ different behavior related to exceptions..
>
> >   2) a / b where b is denormal value and AVX512FP16 won't flush it to
> > zero even w/ -Ofast, but when it's extended to float and using divss,
> > it will be flushed to zero and raise an exception when compiling w/
> > Ofast
>
> I don't think that's a concern, flush to zero is well outside the scope of
> standards defining _Float16 semantics.
Ok.
>
> > So the key point is that the soft-fp and avx512fp16 instructions may
> > do not behave the same on the exception, is this acceptable?
>
> As far as I understand it, all cases within the standards will behave as
> expected for exceptions, whether pure software floating-point is used,
> pure hardware _Float16 arithmetic or one of the forms of emulation listed
> above.  (Where "as expected" itself depends on the value of
> FLT_EVAL_METHOD, i.e. whether excess precision is used for _Float16.)
> Flush to zero and trapping exceptions are outside the scope of the
> standards.  Since trapping exceptions is outside the scope of the
> standards, so is anything that distinguishes whether an arithmetic
> operation raises the same exception more than once or the order in which
> it raises different exceptions (e.g. the possibility of "inexact" being
> raised more than once, both by arithmetic on float and by narrowing from
> float to _Float16).
>
Set excess_precision_type to FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16 to
round after each operation could keep semantics right.
And I'll document the behavior difference between soft-fp and
AVX512FP16 instruction for exceptions.
> --
> Joseph S. Myers
> jos...@codesourcery.com



-- 
BR,
Hongtao


Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-06 Thread Joseph Myers
On Tue, 6 Jul 2021, H.J. Lu via Gcc-patches wrote:

> > > So the key point is that the soft-fp and avx512fp16 instructions may
> > > do not behave the same on the exception, is this acceptable?
> >
> > I think that's quite often the case for soft-fp.
> 
> So this is a GCC limitation.  Please document difference behaviors
> of _Float16 with and without AVX512FP16, similar to

I don't think it's yet clear there will be any such limitation, just 
semantics that depend on whether the excess precision support is used or 
not (which is covered by FLT_EVAL_METHOD, like on AArch64).

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-06 Thread Joseph Myers
On Tue, 6 Jul 2021, Richard Biener via Gcc-patches wrote:

> >   /* Look for a wider mode of the same class for which we think we
> >  can open-code the operation.  Check for a widening multiply at the
> >  wider mode as well.  */
> >
> >   if (CLASS_HAS_WIDER_MODES_P (mclass)
> >   && methods != OPTAB_DIRECT && methods != OPTAB_LIB)
> > FOR_EACH_WIDER_MODE (wider_mode, mode)
> >
> > I think pass_expand did this for some reason, so I'm a little afraid
> > to touch this part of the code.
> 
> It might be the first time we hit this ;)  I don't think it's safe for
> non-integer modes or even anything but a small set of operations.
> Just consider ssadd besides rounding issues or FP.

I think it's safe for basic arithmetic (+-*/), for IEEE floating-point 
arithmetic when the wider mode has significand more than twice as wide as 
the narrower one (given that the result is immediately converted back to 
the narrower mode, double rounding isn't an issue given such a constraint 
on the widths of the modes - and given that the wider mode has sufficient 
exponent range to avoid intermediate overflow / underflow as an issue as 
well).

(The precise requirements on the width of the modes may depend on the 
operation in question.  It's *not* safe for fused multiply-add, regardless 
of the widths in question; a software implementation of fmaf16 using float 
arithmetic could be quite simple, using round-to-odd like e.g. glibc's 
implementation of fmaf using double arithmetic, but "call fmaf then 
convert the result to _Float16" would be an incorrect implementation.)

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-06 Thread Joseph Myers
On Tue, 6 Jul 2021, Hongtao Liu via Gcc-patches wrote:

> There may be inconsistent behavior between soft-fp and avx512fp16
> instructions if we emulate _Float16 w/ float .
>  i.e
>   1) for a + b - c where b and c are variables with the same big value
> and a + b is NAN at _Float16 and real value at float, avx512fp16
> instruction will raise an exception but soft-fp won't(unless it's
> rounded after every operation.)

There are at least two variants of emulation using float:

(a) Using the excess precision support, as on AArch64, which means the C 
front end converts the _Float16 operations to float ones, with explicit 
narrowing on assignment (and conversion as if by assignment - argument 
passing and return, casts, etc.).  Excess precision indeed involves 
different semantics compared to doing each operation directly in the range 
and precision of _Float16.

(b) Letting the expand/optabs code generate operations in a wider mode.  
My understanding is that the result should get converted back to the 
narrower mode after each operation (by the expand/optabs code / 
convert_move called by it generating such a conversion), meaning (for 
basic arithmetic operations) that the semantics end up the same as if the 
operation had been done directly on _Float16 (but with more truncation 
operations occurring than would be the case with excess precision support 
used).

>   2) a / b where b is denormal value and AVX512FP16 won't flush it to
> zero even w/ -Ofast, but when it's extended to float and using divss,
> it will be flushed to zero and raise an exception when compiling w/
> Ofast

I don't think that's a concern, flush to zero is well outside the scope of 
standards defining _Float16 semantics.

> So the key point is that the soft-fp and avx512fp16 instructions may
> do not behave the same on the exception, is this acceptable?

As far as I understand it, all cases within the standards will behave as 
expected for exceptions, whether pure software floating-point is used, 
pure hardware _Float16 arithmetic or one of the forms of emulation listed 
above.  (Where "as expected" itself depends on the value of 
FLT_EVAL_METHOD, i.e. whether excess precision is used for _Float16.)  
Flush to zero and trapping exceptions are outside the scope of the 
standards.  Since trapping exceptions is outside the scope of the 
standards, so is anything that distinguishes whether an arithmetic 
operation raises the same exception more than once or the order in which 
it raises different exceptions (e.g. the possibility of "inexact" being 
raised more than once, both by arithmetic on float and by narrowing from 
float to _Float16).

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-06 Thread H.J. Lu via Gcc-patches
On Tue, Jul 6, 2021 at 3:15 AM Richard Biener via Gcc-patches
 wrote:
>
> On Tue, Jul 6, 2021 at 10:46 AM Hongtao Liu  wrote:
> >
> > On Thu, Jul 1, 2021 at 9:04 PM Jakub Jelinek via Gcc-patches
> >  wrote:
> > >
> > > On Thu, Jul 01, 2021 at 02:58:01PM +0200, Richard Biener wrote:
> > > > > The main issue is complex _Float16 functions in libgcc.  If _Float16 
> > > > > doesn't
> > > > > require -mavx512fp16, we need to compile complex _Float16 functions in
> > > > > libgcc without -mavx512fp16.  Complex _Float16 performance is very
> > > > > important for our _Float16 usage.   _Float16 performance has to be
> > > > > very fast.  There should be no emulation anywhere when -mavx512fp16
> > > > > is used.   That is why _Float16 is available only with -mavx512fp16.
> > > >
> > > > It should be possible to emulate scalar _Float16 using _Float32 with a
> > > > reasonable
> > > > performance trade-off.  I think users caring for _Float16 performance 
> > > > will
> > > > use vector intrinsics anyway since for scalar code _Float32 code will 
> > > > likely
> > > > perform the same (at double storage cost)
> > >
> > > Only if it is allowed to have excess precision for _Float16.  If not, then
> > > one would need to (expensively?) round after every operation at least.
> > There may be inconsistent behavior between soft-fp and avx512fp16
> > instructions if we emulate _Float16 w/ float .
> >  i.e
> >   1) for a + b - c where b and c are variables with the same big value
> > and a + b is NAN at _Float16 and real value at float, avx512fp16
> > instruction will raise an exception but soft-fp won't(unless it's
> > rounded after every operation.)
> >   2) a / b where b is denormal value and AVX512FP16 won't flush it to
> > zero even w/ -Ofast, but when it's extended to float and using divss,
> > it will be flushed to zero and raise an exception when compiling w/
> > Ofast
> >
> > To solve the upper issue, i try to add full emulation for _Float16(for
> > all those under libgcc/soft-fp/, i.e. add/sub/mul/div/cmp, .etc),
> > problem is in pass_expand, it always try wider mode first instead of
> > using soft-fp
> >
> >   /* Look for a wider mode of the same class for which we think we
> >  can open-code the operation.  Check for a widening multiply at the
> >  wider mode as well.  */
> >
> >   if (CLASS_HAS_WIDER_MODES_P (mclass)
> >   && methods != OPTAB_DIRECT && methods != OPTAB_LIB)
> > FOR_EACH_WIDER_MODE (wider_mode, mode)
> >
> > I think pass_expand did this for some reason, so I'm a little afraid
> > to touch this part of the code.
>
> It might be the first time we hit this ;)  I don't think it's safe for
> non-integer modes or even anything but a small set of operations.
> Just consider ssadd besides rounding issues or FP.
>
> > So the key point is that the soft-fp and avx512fp16 instructions may
> > do not behave the same on the exception, is this acceptable?
>
> I think that's quite often the case for soft-fp.

So this is a GCC limitation.  Please document difference behaviors
of _Float16 with and without AVX512FP16, similar to

---
 The '__fp16' type may only be used as an argument to intrinsics defined
in '', or as a storage format.  For purposes of arithmetic
and other operations, '__fp16' values in C or C++ expressions are
automatically promoted to 'float'.

 The ARM target provides hardware support for conversions between
'__fp16' and 'float' values as an extension to VFP and NEON (Advanced
SIMD), and from ARMv8-A provides hardware support for conversions
between '__fp16' and 'double' values.  GCC generates code using these
hardware instructions if you compile with options to select an FPU that
provides them; for example, '-mfpu=neon-fp16 -mfloat-abi=softfp', in
addition to the '-mfp16-format' option to select a half-precision
format.

 Language-level support for the '__fp16' data type is independent of
whether GCC generates code using hardware floating-point instructions.
In cases where hardware support is not specified, GCC implements
conversions between '__fp16' and other types as library calls.

 It is recommended that portable code use the '_Float16' type defined by
ISO/IEC TS 18661-3:2015.  *Note Floating Types::.
---

We recommend portable code of _Float16 with AVX512FP16.

> > BTW, i've finished a initial patch to enable _Float16 on sse2, and
> > emulate _Float16 operation w/ float, and it passes all  312 new tests
> > which are related to _Float16, but those units tests doesn't cover the
> > scenario I'm talking about.
> > >
> > > Jakub
> > >
> >
> >
> > --
> > BR,
> > Hongtao



-- 
H.J.


Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-06 Thread Richard Biener via Gcc-patches
On Tue, Jul 6, 2021 at 10:46 AM Hongtao Liu  wrote:
>
> On Thu, Jul 1, 2021 at 9:04 PM Jakub Jelinek via Gcc-patches
>  wrote:
> >
> > On Thu, Jul 01, 2021 at 02:58:01PM +0200, Richard Biener wrote:
> > > > The main issue is complex _Float16 functions in libgcc.  If _Float16 
> > > > doesn't
> > > > require -mavx512fp16, we need to compile complex _Float16 functions in
> > > > libgcc without -mavx512fp16.  Complex _Float16 performance is very
> > > > important for our _Float16 usage.   _Float16 performance has to be
> > > > very fast.  There should be no emulation anywhere when -mavx512fp16
> > > > is used.   That is why _Float16 is available only with -mavx512fp16.
> > >
> > > It should be possible to emulate scalar _Float16 using _Float32 with a
> > > reasonable
> > > performance trade-off.  I think users caring for _Float16 performance will
> > > use vector intrinsics anyway since for scalar code _Float32 code will 
> > > likely
> > > perform the same (at double storage cost)
> >
> > Only if it is allowed to have excess precision for _Float16.  If not, then
> > one would need to (expensively?) round after every operation at least.
> There may be inconsistent behavior between soft-fp and avx512fp16
> instructions if we emulate _Float16 w/ float .
>  i.e
>   1) for a + b - c where b and c are variables with the same big value
> and a + b is NAN at _Float16 and real value at float, avx512fp16
> instruction will raise an exception but soft-fp won't(unless it's
> rounded after every operation.)
>   2) a / b where b is denormal value and AVX512FP16 won't flush it to
> zero even w/ -Ofast, but when it's extended to float and using divss,
> it will be flushed to zero and raise an exception when compiling w/
> Ofast
>
> To solve the upper issue, i try to add full emulation for _Float16(for
> all those under libgcc/soft-fp/, i.e. add/sub/mul/div/cmp, .etc),
> problem is in pass_expand, it always try wider mode first instead of
> using soft-fp
>
>   /* Look for a wider mode of the same class for which we think we
>  can open-code the operation.  Check for a widening multiply at the
>  wider mode as well.  */
>
>   if (CLASS_HAS_WIDER_MODES_P (mclass)
>   && methods != OPTAB_DIRECT && methods != OPTAB_LIB)
> FOR_EACH_WIDER_MODE (wider_mode, mode)
>
> I think pass_expand did this for some reason, so I'm a little afraid
> to touch this part of the code.

It might be the first time we hit this ;)  I don't think it's safe for
non-integer modes or even anything but a small set of operations.
Just consider ssadd besides rounding issues or FP.

> So the key point is that the soft-fp and avx512fp16 instructions may
> do not behave the same on the exception, is this acceptable?

I think that's quite often the case for soft-fp.

> BTW, i've finished a initial patch to enable _Float16 on sse2, and
> emulate _Float16 operation w/ float, and it passes all  312 new tests
> which are related to _Float16, but those units tests doesn't cover the
> scenario I'm talking about.
> >
> > Jakub
> >
>
>
> --
> BR,
> Hongtao


Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-06 Thread Hongtao Liu via Gcc-patches
On Fri, Jul 2, 2021 at 4:46 AM Joseph Myers  wrote:
>
> Some general comments, following what I said on libc-alpha:
>
>
> 1. Can you confirm that the ABI being used for 64-bit, for _Float16 and
> _Complex _Float16 argument passing and return, follows the current x86_64
> ABI document?
>
>
> 2. Can you confirm that if you build with this instruction set extension
> enabled by default, and run GCC tests for a corresponding (emulated?)
> processor, all the existing float16 tests in the testsuite are enabled and
> PASS (both compilation and execution) (both 64-bit and 32-bit testing)?
>
>
> 3. There's an active 32-bit ABI mailing list (ia32-...@googlegroups.com).
> If you want to support _Float16 in the 32-bit case, please work with it to
> get the corresponding ABI documented (using only memory and
> general-purpose registers seems like a good idea, so that the ABI can be
> supported for the base architecture without depending on SSE registers
> being present).  In the absence of 32-bit ABI support it might be better
> to disable the HFmode support for 32-bit.
>
>
> 4. Support for _Float16 really ought not to depend on whether a particular
> instruction set extension is present, just like with other floating-point
> types; it makes sense, as an API, for all x86 processors (and like many
> APIs, it will be faster on some processors than on others).  More specific
> points here are:
>
> (a) Basic arithmetic (+-*/) can be done by converting to SFmode, doing
> arithmetic there and converting back to HFmode; the results of doing so
> will be correctly rounded.  Indeed, I think optabs.c handles that
> automatically when operations are available on a wider mode but not on the
> desired mode (but you'd need to check carefully that all the expected
> conversions do occur).
So would different behavior of exceptions between soft-fp and
avx512fp16 is acceptable?
>
> (b) Conversions to/from all other floating-point modes will always be
> needed, whether in hardware or in software.
>
> (c) In the F16C (Ivy Bridge and later) case, where you have hardware
> conversions to/from float (only), it's fine to convert to double (or long
> double) via float.  (On efficiency grounds, widening from HFmode to TFmode
> should be a pure software operations, that should be faster than having an
> intermediate conversion to SFmode when the SFmode-to-TFmode conversion is
> a software operation.)
>
> (d) In the F16C case (where there are hardware conversions only from
> SFmode, not from wider modes), conversion *from* DFmode (or XFmode or
> TFmode) to HFmode should be a software operation, to avoid double
> rounding; an intermediate conversion to SFmode would be incorrect.
>
> (e) It's OK for conversions to/from integer modes to go via SFmode
> (although I don't know if that's efficient or not).  Any case where a
> conversion from integer to SFmode is inexact would overflow HFmode, so
> there are no double rounding issues.
>
> (f) In the F16C case, it seems the hardware instructions only work on
> vectors, not scalars, so care would need to be taken to use them for
> scalar conversions only if the other elements of the vector register are
> known to be safe to convert without raising any exceptions (e.g. all zero
> bits, or -fno-trapping-math in effect).
>
> (g) If concerned about efficiency of intermediate truncations on
> processors without hardware _Float16 arithmetic, look at
> aarch64_excess_precision; you have the option of using excess precision
> for _Float16 by default, though that only really helps for C given the
> lack of excess precision support in the C++ front end.  (Enabling this can
> cause trouble for code that only expects C99/C11 values of
> FLT_EVAL_METHOD, however; see the -fpermitted-flt-eval-methods option for
> more details.)
>
>
> 5. Suppose that in some cases you do disable _Float16 support (whether
> that's just for 32-bit until the ABI has been defined, or also in the
> absence of instruction set support despite my comments above).  Then the
> way you do that in this patch series, enabling the type in
> ix86_scalar_mode_supported_p and ix86_libgcc_floating_mode_supported_p and
> giving an error later in ix86_expand_move, is a bad idea.
>
> Errors in expanders are generally problematic (they don't have good
> location information available).  But apart from that, ordinary user code
> should be able to tell whether _Float16 is supported by testing whether
> e.g. __FLT16_MANT_DIG__ is defined (like float.h does), or by including
> float.h (with __STDC_WANT_IEC_60559_TYPES_EXT__ defined) and then testing
> whether one of the FLT16_* macros is defined, or in a configure test by
> just declaring something using the _Float16 type.  Patch 1 changes
> check_effective_target_float16 to work around your technique for disabling
> _Float16 in ix86_expand_move, but it should be considered a stable user
> API that any of the above methods can be used in user code to check for
> _Float16 support - user code shouldn't need to kno

Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-06 Thread Hongtao Liu via Gcc-patches
On Thu, Jul 1, 2021 at 9:04 PM Jakub Jelinek via Gcc-patches
 wrote:
>
> On Thu, Jul 01, 2021 at 02:58:01PM +0200, Richard Biener wrote:
> > > The main issue is complex _Float16 functions in libgcc.  If _Float16 
> > > doesn't
> > > require -mavx512fp16, we need to compile complex _Float16 functions in
> > > libgcc without -mavx512fp16.  Complex _Float16 performance is very
> > > important for our _Float16 usage.   _Float16 performance has to be
> > > very fast.  There should be no emulation anywhere when -mavx512fp16
> > > is used.   That is why _Float16 is available only with -mavx512fp16.
> >
> > It should be possible to emulate scalar _Float16 using _Float32 with a
> > reasonable
> > performance trade-off.  I think users caring for _Float16 performance will
> > use vector intrinsics anyway since for scalar code _Float32 code will likely
> > perform the same (at double storage cost)
>
> Only if it is allowed to have excess precision for _Float16.  If not, then
> one would need to (expensively?) round after every operation at least.
There may be inconsistent behavior between soft-fp and avx512fp16
instructions if we emulate _Float16 w/ float .
 i.e
  1) for a + b - c where b and c are variables with the same big value
and a + b is NAN at _Float16 and real value at float, avx512fp16
instruction will raise an exception but soft-fp won't(unless it's
rounded after every operation.)
  2) a / b where b is denormal value and AVX512FP16 won't flush it to
zero even w/ -Ofast, but when it's extended to float and using divss,
it will be flushed to zero and raise an exception when compiling w/
Ofast

To solve the upper issue, i try to add full emulation for _Float16(for
all those under libgcc/soft-fp/, i.e. add/sub/mul/div/cmp, .etc),
problem is in pass_expand, it always try wider mode first instead of
using soft-fp

  /* Look for a wider mode of the same class for which we think we
 can open-code the operation.  Check for a widening multiply at the
 wider mode as well.  */

  if (CLASS_HAS_WIDER_MODES_P (mclass)
  && methods != OPTAB_DIRECT && methods != OPTAB_LIB)
FOR_EACH_WIDER_MODE (wider_mode, mode)

I think pass_expand did this for some reason, so I'm a little afraid
to touch this part of the code.

So the key point is that the soft-fp and avx512fp16 instructions may
do not behave the same on the exception, is this acceptable?

BTW, i've finished a initial patch to enable _Float16 on sse2, and
emulate _Float16 operation w/ float, and it passes all  312 new tests
which are related to _Float16, but those units tests doesn't cover the
scenario I'm talking about.
>
> Jakub
>


-- 
BR,
Hongtao


Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-05 Thread Richard Biener via Gcc-patches
On Mon, Jul 5, 2021 at 3:21 AM Hongtao Liu via Gcc-patches
 wrote:
>
> On Fri, Jul 2, 2021 at 4:03 PM Uros Bizjak  wrote:
> >
> > On Fri, Jul 2, 2021 at 8:25 AM Hongtao Liu  wrote:
> >
> > > > >   AVX512FP16 is disclosed, refer to [1].
> > > > >   There're 100+ instructions for AVX512FP16, 67 gcc patches, for the 
> > > > > convenience of review, we divide the 67 patches into 2 major parts.
> > > > >   The first part is 2 patches containing basic support for AVX512FP16 
> > > > > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is 
> > > > > 65 patches covering all instructions of AVX512FP16(including 
> > > > > intrinsic support and some optimizations).
> > > > >   There is a problem with the first part, _Float16 is not a C++ 
> > > > > standard, so the front-end does not support this type and its 
> > > > > mangling, so we "make up" a _Float16 type on the back-end and use 
> > > > > _DF16 as its mangling. The purpose of this is to align with llvm 
> > > > > side, because llvm C++ FE already supports _Float16[2].
> > > > >
> > > > > [1] 
> > > > > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html
> > > > > [2] https://reviews.llvm.org/D33719
> > > >
> > > > Looking through implementation of _Float16 support, I think, there is
> > > > no need for _Float16 support to depend on AVX512FP16.
> > > >
> > > > The compiler is smart enough to use either a named pattern that
> > > > describes the instruction when available or diverts to a library call
> > > > to a soft-fp implementation. So, I think that general _Float16 support
> > > > should be implemented first (similar to _float128) and then upgraded
> > > > with AVX512FP16 specific instructions.
> > > >
> > > > MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode
> > > > secondary_reload register.
> > > >
> > > MOVD is under sse2, so is pinsrw, which means if we want xmm
> > > load/stores for HF, sse2 is the least requirement.
> > > Also we support PEXTRW reg/m16, xmm, imm8 under SSE4_1 under which we
> > > have 16bit direct load/store for HFmode and no need for a secondary
> > > reload.
> > > So for simplicity, can we just restrict _Float16 under sse4_1?
> >
> > When baseline is not met, the equivalent integer calling convention is
> > used, for example:
> Problem is under TARGET_SSE and w/ -mno-sse2, float calling convention
>  is available for sse register, it's ok for float since there's movss
> under sse, but there's no 16bit load/store for sse registers, nor
> movement between gpr and sse register.

You can always spill though, that's prefered for some archs
over xmm <-> gpr moves anyway.

Richard.

> >
> > --cut here--
> > typedef int __v2si __attribute__ ((vector_size (8)));
> >
> > __v2si foo (__v2si a, __v2si b)
> > {
> >   return a + b;
> > }
> > --cut here--
> >
> > will still compile with -m32 -mno-mmx with warnings:
> >
> > mmx1.c: In function ‘foo’:
> > mmx1.c:4:1: warning: MMX vector return without MMX enabled changes the
> > ABI [-Wpsabi]
> > mmx1.c:3:8: warning: MMX vector argument without MMX enabled changes
> > the ABI [-Wpsabi]
> >
> > So, by setting the baseline to SSE4.1, a big pool of targets will be
> > forced to use alternative ABI. This is quite inconvenient, and we
> > revert to the alternative ABI if we *really*  can't satisfy ABI
> > requirements (e.g. register type is not available, basic move insn
> > can't be implemented). Based on your analysis, I think that SSE2
> > should be the baseline.
> Agreed.
> >
> > Also, looking at insn tables, it looks that movzwl from memory + movd
> > is faster than pinsrw (and similar for pextrw to memory), but I have
> > no hard data here.
> >
> > Regarding secondary_reload, a scratch register is needed in case of
> > HImode moves between memory and XMM reg, since scratch register needs
> > a different mode than source and destination. Please see
> > TARGET_SECONDARY_RELOAD documentation and several examples in the
> > source.
> >
> > Uros.
>
>
>
> --
> BR,
> Hongtao


Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-04 Thread Hongtao Liu via Gcc-patches
On Fri, Jul 2, 2021 at 4:03 PM Uros Bizjak  wrote:
>
> On Fri, Jul 2, 2021 at 8:25 AM Hongtao Liu  wrote:
>
> > > >   AVX512FP16 is disclosed, refer to [1].
> > > >   There're 100+ instructions for AVX512FP16, 67 gcc patches, for the 
> > > > convenience of review, we divide the 67 patches into 2 major parts.
> > > >   The first part is 2 patches containing basic support for AVX512FP16 
> > > > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is 
> > > > 65 patches covering all instructions of AVX512FP16(including intrinsic 
> > > > support and some optimizations).
> > > >   There is a problem with the first part, _Float16 is not a C++ 
> > > > standard, so the front-end does not support this type and its mangling, 
> > > > so we "make up" a _Float16 type on the back-end and use _DF16 as its 
> > > > mangling. The purpose of this is to align with llvm side, because llvm 
> > > > C++ FE already supports _Float16[2].
> > > >
> > > > [1] 
> > > > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html
> > > > [2] https://reviews.llvm.org/D33719
> > >
> > > Looking through implementation of _Float16 support, I think, there is
> > > no need for _Float16 support to depend on AVX512FP16.
> > >
> > > The compiler is smart enough to use either a named pattern that
> > > describes the instruction when available or diverts to a library call
> > > to a soft-fp implementation. So, I think that general _Float16 support
> > > should be implemented first (similar to _float128) and then upgraded
> > > with AVX512FP16 specific instructions.
> > >
> > > MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode
> > > secondary_reload register.
> > >
> > MOVD is under sse2, so is pinsrw, which means if we want xmm
> > load/stores for HF, sse2 is the least requirement.
> > Also we support PEXTRW reg/m16, xmm, imm8 under SSE4_1 under which we
> > have 16bit direct load/store for HFmode and no need for a secondary
> > reload.
> > So for simplicity, can we just restrict _Float16 under sse4_1?
>
> When baseline is not met, the equivalent integer calling convention is
> used, for example:
Problem is under TARGET_SSE and w/ -mno-sse2, float calling convention
 is available for sse register, it's ok for float since there's movss
under sse, but there's no 16bit load/store for sse registers, nor
movement between gpr and sse register.
>
> --cut here--
> typedef int __v2si __attribute__ ((vector_size (8)));
>
> __v2si foo (__v2si a, __v2si b)
> {
>   return a + b;
> }
> --cut here--
>
> will still compile with -m32 -mno-mmx with warnings:
>
> mmx1.c: In function ‘foo’:
> mmx1.c:4:1: warning: MMX vector return without MMX enabled changes the
> ABI [-Wpsabi]
> mmx1.c:3:8: warning: MMX vector argument without MMX enabled changes
> the ABI [-Wpsabi]
>
> So, by setting the baseline to SSE4.1, a big pool of targets will be
> forced to use alternative ABI. This is quite inconvenient, and we
> revert to the alternative ABI if we *really*  can't satisfy ABI
> requirements (e.g. register type is not available, basic move insn
> can't be implemented). Based on your analysis, I think that SSE2
> should be the baseline.
Agreed.
>
> Also, looking at insn tables, it looks that movzwl from memory + movd
> is faster than pinsrw (and similar for pextrw to memory), but I have
> no hard data here.
>
> Regarding secondary_reload, a scratch register is needed in case of
> HImode moves between memory and XMM reg, since scratch register needs
> a different mode than source and destination. Please see
> TARGET_SECONDARY_RELOAD documentation and several examples in the
> source.
>
> Uros.



-- 
BR,
Hongtao


Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-03 Thread Hongtao Liu via Gcc-patches
On Fri, Jul 2, 2021 at 4:19 PM Richard Biener
 wrote:
>
> On Fri, Jul 2, 2021 at 10:07 AM Uros Bizjak via Gcc-patches
>  wrote:
> >
> > On Fri, Jul 2, 2021 at 8:25 AM Hongtao Liu  wrote:
> >
> > > > >   AVX512FP16 is disclosed, refer to [1].
> > > > >   There're 100+ instructions for AVX512FP16, 67 gcc patches, for the 
> > > > > convenience of review, we divide the 67 patches into 2 major parts.
> > > > >   The first part is 2 patches containing basic support for AVX512FP16 
> > > > > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is 
> > > > > 65 patches covering all instructions of AVX512FP16(including 
> > > > > intrinsic support and some optimizations).
> > > > >   There is a problem with the first part, _Float16 is not a C++ 
> > > > > standard, so the front-end does not support this type and its 
> > > > > mangling, so we "make up" a _Float16 type on the back-end and use 
> > > > > _DF16 as its mangling. The purpose of this is to align with llvm 
> > > > > side, because llvm C++ FE already supports _Float16[2].
> > > > >
> > > > > [1] 
> > > > > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html
> > > > > [2] https://reviews.llvm.org/D33719
> > > >
> > > > Looking through implementation of _Float16 support, I think, there is
> > > > no need for _Float16 support to depend on AVX512FP16.
> > > >
> > > > The compiler is smart enough to use either a named pattern that
> > > > describes the instruction when available or diverts to a library call
> > > > to a soft-fp implementation. So, I think that general _Float16 support
> > > > should be implemented first (similar to _float128) and then upgraded
> > > > with AVX512FP16 specific instructions.
> > > >
> > > > MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode
> > > > secondary_reload register.
> > > >
> > > MOVD is under sse2, so is pinsrw, which means if we want xmm
> > > load/stores for HF, sse2 is the least requirement.
> > > Also we support PEXTRW reg/m16, xmm, imm8 under SSE4_1 under which we
> > > have 16bit direct load/store for HFmode and no need for a secondary
> > > reload.
> > > So for simplicity, can we just restrict _Float16 under sse4_1?
> >
> > When baseline is not met, the equivalent integer calling convention is
> > used, for example:
> >
> > --cut here--
> > typedef int __v2si __attribute__ ((vector_size (8)));
> >
> > __v2si foo (__v2si a, __v2si b)
> > {
> >   return a + b;
> > }
> > --cut here--
> >
> > will still compile with -m32 -mno-mmx with warnings:
> >
> > mmx1.c: In function ‘foo’:
> > mmx1.c:4:1: warning: MMX vector return without MMX enabled changes the
> > ABI [-Wpsabi]
> > mmx1.c:3:8: warning: MMX vector argument without MMX enabled changes
> > the ABI [-Wpsabi]
> >
> > So, by setting the baseline to SSE4.1, a big pool of targets will be
> > forced to use alternative ABI. This is quite inconvenient, and we
> > revert to the alternative ABI if we *really*  can't satisfy ABI
> > requirements (e.g. register type is not available, basic move insn
> > can't be implemented). Based on your analysis, I think that SSE2
> > should be the baseline.
> >
> > Also, looking at insn tables, it looks that movzwl from memory + movd
> > is faster than pinsrw (and similar for pextrw to memory), but I have
> > no hard data here.
> >
> > Regarding secondary_reload, a scratch register is needed in case of
> > HImode moves between memory and XMM reg, since scratch register needs
> > a different mode than source and destination. Please see
> > TARGET_SECONDARY_RELOAD documentation and several examples in the
> > source.
>
> I would suggest for the purpose of simplifying the initial patch series to
> not make _Float16 supported on 32bits and leave that (and its ABI) for
w/o AVX512FP16, it's ok.
The problem is AVX512FP16 instructions are also available for -m32,
and corresponding intrinsics will need the "_Float16" type(or other
builtin type name)  which will also be used by users. It means we
still need a 32-bit _Float16 ABI for them.

> future enhancement.  Then the baseline should be SSE2 (x86-64 base)
> which I think should be OK despite needing some awkwardness for
> HFmode stores (scratch reg needed).
>
> Richard.
>
> > Uros.



-- 
BR,
Hongtao


Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-02 Thread Richard Biener via Gcc-patches
On Fri, Jul 2, 2021 at 10:07 AM Uros Bizjak via Gcc-patches
 wrote:
>
> On Fri, Jul 2, 2021 at 8:25 AM Hongtao Liu  wrote:
>
> > > >   AVX512FP16 is disclosed, refer to [1].
> > > >   There're 100+ instructions for AVX512FP16, 67 gcc patches, for the 
> > > > convenience of review, we divide the 67 patches into 2 major parts.
> > > >   The first part is 2 patches containing basic support for AVX512FP16 
> > > > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is 
> > > > 65 patches covering all instructions of AVX512FP16(including intrinsic 
> > > > support and some optimizations).
> > > >   There is a problem with the first part, _Float16 is not a C++ 
> > > > standard, so the front-end does not support this type and its mangling, 
> > > > so we "make up" a _Float16 type on the back-end and use _DF16 as its 
> > > > mangling. The purpose of this is to align with llvm side, because llvm 
> > > > C++ FE already supports _Float16[2].
> > > >
> > > > [1] 
> > > > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html
> > > > [2] https://reviews.llvm.org/D33719
> > >
> > > Looking through implementation of _Float16 support, I think, there is
> > > no need for _Float16 support to depend on AVX512FP16.
> > >
> > > The compiler is smart enough to use either a named pattern that
> > > describes the instruction when available or diverts to a library call
> > > to a soft-fp implementation. So, I think that general _Float16 support
> > > should be implemented first (similar to _float128) and then upgraded
> > > with AVX512FP16 specific instructions.
> > >
> > > MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode
> > > secondary_reload register.
> > >
> > MOVD is under sse2, so is pinsrw, which means if we want xmm
> > load/stores for HF, sse2 is the least requirement.
> > Also we support PEXTRW reg/m16, xmm, imm8 under SSE4_1 under which we
> > have 16bit direct load/store for HFmode and no need for a secondary
> > reload.
> > So for simplicity, can we just restrict _Float16 under sse4_1?
>
> When baseline is not met, the equivalent integer calling convention is
> used, for example:
>
> --cut here--
> typedef int __v2si __attribute__ ((vector_size (8)));
>
> __v2si foo (__v2si a, __v2si b)
> {
>   return a + b;
> }
> --cut here--
>
> will still compile with -m32 -mno-mmx with warnings:
>
> mmx1.c: In function ‘foo’:
> mmx1.c:4:1: warning: MMX vector return without MMX enabled changes the
> ABI [-Wpsabi]
> mmx1.c:3:8: warning: MMX vector argument without MMX enabled changes
> the ABI [-Wpsabi]
>
> So, by setting the baseline to SSE4.1, a big pool of targets will be
> forced to use alternative ABI. This is quite inconvenient, and we
> revert to the alternative ABI if we *really*  can't satisfy ABI
> requirements (e.g. register type is not available, basic move insn
> can't be implemented). Based on your analysis, I think that SSE2
> should be the baseline.
>
> Also, looking at insn tables, it looks that movzwl from memory + movd
> is faster than pinsrw (and similar for pextrw to memory), but I have
> no hard data here.
>
> Regarding secondary_reload, a scratch register is needed in case of
> HImode moves between memory and XMM reg, since scratch register needs
> a different mode than source and destination. Please see
> TARGET_SECONDARY_RELOAD documentation and several examples in the
> source.

I would suggest for the purpose of simplifying the initial patch series to
not make _Float16 supported on 32bits and leave that (and its ABI) for
future enhancement.  Then the baseline should be SSE2 (x86-64 base)
which I think should be OK despite needing some awkwardness for
HFmode stores (scratch reg needed).

Richard.

> Uros.


Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-02 Thread Uros Bizjak via Gcc-patches
On Fri, Jul 2, 2021 at 8:25 AM Hongtao Liu  wrote:

> > >   AVX512FP16 is disclosed, refer to [1].
> > >   There're 100+ instructions for AVX512FP16, 67 gcc patches, for the 
> > > convenience of review, we divide the 67 patches into 2 major parts.
> > >   The first part is 2 patches containing basic support for AVX512FP16 
> > > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is 65 
> > > patches covering all instructions of AVX512FP16(including intrinsic 
> > > support and some optimizations).
> > >   There is a problem with the first part, _Float16 is not a C++ standard, 
> > > so the front-end does not support this type and its mangling, so we "make 
> > > up" a _Float16 type on the back-end and use _DF16 as its mangling. The 
> > > purpose of this is to align with llvm side, because llvm C++ FE already 
> > > supports _Float16[2].
> > >
> > > [1] 
> > > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html
> > > [2] https://reviews.llvm.org/D33719
> >
> > Looking through implementation of _Float16 support, I think, there is
> > no need for _Float16 support to depend on AVX512FP16.
> >
> > The compiler is smart enough to use either a named pattern that
> > describes the instruction when available or diverts to a library call
> > to a soft-fp implementation. So, I think that general _Float16 support
> > should be implemented first (similar to _float128) and then upgraded
> > with AVX512FP16 specific instructions.
> >
> > MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode
> > secondary_reload register.
> >
> MOVD is under sse2, so is pinsrw, which means if we want xmm
> load/stores for HF, sse2 is the least requirement.
> Also we support PEXTRW reg/m16, xmm, imm8 under SSE4_1 under which we
> have 16bit direct load/store for HFmode and no need for a secondary
> reload.
> So for simplicity, can we just restrict _Float16 under sse4_1?

When baseline is not met, the equivalent integer calling convention is
used, for example:

--cut here--
typedef int __v2si __attribute__ ((vector_size (8)));

__v2si foo (__v2si a, __v2si b)
{
  return a + b;
}
--cut here--

will still compile with -m32 -mno-mmx with warnings:

mmx1.c: In function ‘foo’:
mmx1.c:4:1: warning: MMX vector return without MMX enabled changes the
ABI [-Wpsabi]
mmx1.c:3:8: warning: MMX vector argument without MMX enabled changes
the ABI [-Wpsabi]

So, by setting the baseline to SSE4.1, a big pool of targets will be
forced to use alternative ABI. This is quite inconvenient, and we
revert to the alternative ABI if we *really*  can't satisfy ABI
requirements (e.g. register type is not available, basic move insn
can't be implemented). Based on your analysis, I think that SSE2
should be the baseline.

Also, looking at insn tables, it looks that movzwl from memory + movd
is faster than pinsrw (and similar for pextrw to memory), but I have
no hard data here.

Regarding secondary_reload, a scratch register is needed in case of
HImode moves between memory and XMM reg, since scratch register needs
a different mode than source and destination. Please see
TARGET_SECONDARY_RELOAD documentation and several examples in the
source.

Uros.


Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-01 Thread Hongtao Liu via Gcc-patches
On Thu, Jul 1, 2021 at 7:10 PM Uros Bizjak  wrote:
>
> [Sorry for double post, gcc-patches address was wrong in original post]
>
> On Thu, Jul 1, 2021 at 7:48 AM liuhongt  wrote:
> >
> > Hi:
> >   AVX512FP16 is disclosed, refer to [1].
> >   There're 100+ instructions for AVX512FP16, 67 gcc patches, for the 
> > convenience of review, we divide the 67 patches into 2 major parts.
> >   The first part is 2 patches containing basic support for AVX512FP16 
> > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is 65 
> > patches covering all instructions of AVX512FP16(including intrinsic support 
> > and some optimizations).
> >   There is a problem with the first part, _Float16 is not a C++ standard, 
> > so the front-end does not support this type and its mangling, so we "make 
> > up" a _Float16 type on the back-end and use _DF16 as its mangling. The 
> > purpose of this is to align with llvm side, because llvm C++ FE already 
> > supports _Float16[2].
> >
> > [1] 
> > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html
> > [2] https://reviews.llvm.org/D33719
>
> Looking through implementation of _Float16 support, I think, there is
> no need for _Float16 support to depend on AVX512FP16.
>
> The compiler is smart enough to use either a named pattern that
> describes the instruction when available or diverts to a library call
> to a soft-fp implementation. So, I think that general _Float16 support
> should be implemented first (similar to _float128) and then upgraded
> with AVX512FP16 specific instructions.
>
> MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode
> secondary_reload register.
>
MOVD is under sse2, so is pinsrw, which means if we want xmm
load/stores for HF, sse2 is the least requirement.
Also we support PEXTRW reg/m16, xmm, imm8 under SSE4_1 under which we
have 16bit direct load/store for HFmode and no need for a secondary
reload.
So for simplicity, can we just restrict _Float16 under sse4_1?
> soft-fp library already includes all the infrastructure to implement
> _Float16 (see half.h), so HFmode basic operations should be trivial to
> implement (I have gone through this exercise personally years ago when
> implementing __float128 soft-fp support).
>
> Looking through the patch 1/2, it looks that a new ABI is introduced,
> where FP16 values are passed through XMM registers, but I don't think
> there is updated psABI documentation available (for x86_64 as well as
> i386, where FP16 values will probably be passed through memory).
>
> So, the net effect of the above proposal(s) is that x86 will support
> _Float16 out-of the box, emulate it via soft-fp without AVX512FP16 and
> use AVX512FP16 instructions with -mavx512fp16.
>
> Uros.



-- 
BR,
Hongtao


Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-01 Thread Joseph Myers
On Thu, 1 Jul 2021, H.J. Lu via Gcc-patches wrote:

> The main issue is complex _Float16 functions in libgcc.  If _Float16 doesn't
> require -mavx512fp16, we need to compile complex _Float16 functions in
> libgcc without -mavx512fp16.  Complex _Float16 performance is very
> important for our _Float16 usage.   _Float16 performance has to be
> very fast.  There should be no emulation anywhere when -mavx512fp16
> is used.   That is why _Float16 is available only with -mavx512fp16.

You could build IFUNC versions of the libgcc functions (like float128 on 
powerpc64le), to be fast (modulo any IFUNC overhead) when run on 
AVX512FP16 hardware.  Or arrange for different libcall names to be used 
depending on the instruction set features available, and build those 
functions under multiple names, to be fast when the application is built 
with -mavx512fp16.

Since the HCmode libgcc functions just convert to/from SFmode and do all 
their computations on SFmode (to avoid intermediate overflows / 
cancellation resulting in inaccuracy), an F16C version may make sense as 
well (assuming use of the F16C conversion instructions is still efficient 
once you allow for zeroing the unused parts of the vector register, if 
necessary to avoid spurious exceptions from converting junk data in those 
parts of the register).

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-01 Thread Joseph Myers
Some general comments, following what I said on libc-alpha:


1. Can you confirm that the ABI being used for 64-bit, for _Float16 and 
_Complex _Float16 argument passing and return, follows the current x86_64 
ABI document?


2. Can you confirm that if you build with this instruction set extension 
enabled by default, and run GCC tests for a corresponding (emulated?) 
processor, all the existing float16 tests in the testsuite are enabled and 
PASS (both compilation and execution) (both 64-bit and 32-bit testing)?


3. There's an active 32-bit ABI mailing list (ia32-...@googlegroups.com).  
If you want to support _Float16 in the 32-bit case, please work with it to 
get the corresponding ABI documented (using only memory and 
general-purpose registers seems like a good idea, so that the ABI can be 
supported for the base architecture without depending on SSE registers 
being present).  In the absence of 32-bit ABI support it might be better 
to disable the HFmode support for 32-bit.


4. Support for _Float16 really ought not to depend on whether a particular 
instruction set extension is present, just like with other floating-point 
types; it makes sense, as an API, for all x86 processors (and like many 
APIs, it will be faster on some processors than on others).  More specific 
points here are:

(a) Basic arithmetic (+-*/) can be done by converting to SFmode, doing 
arithmetic there and converting back to HFmode; the results of doing so 
will be correctly rounded.  Indeed, I think optabs.c handles that 
automatically when operations are available on a wider mode but not on the 
desired mode (but you'd need to check carefully that all the expected 
conversions do occur).

(b) Conversions to/from all other floating-point modes will always be 
needed, whether in hardware or in software.

(c) In the F16C (Ivy Bridge and later) case, where you have hardware 
conversions to/from float (only), it's fine to convert to double (or long 
double) via float.  (On efficiency grounds, widening from HFmode to TFmode 
should be a pure software operations, that should be faster than having an 
intermediate conversion to SFmode when the SFmode-to-TFmode conversion is 
a software operation.)

(d) In the F16C case (where there are hardware conversions only from 
SFmode, not from wider modes), conversion *from* DFmode (or XFmode or 
TFmode) to HFmode should be a software operation, to avoid double 
rounding; an intermediate conversion to SFmode would be incorrect.

(e) It's OK for conversions to/from integer modes to go via SFmode 
(although I don't know if that's efficient or not).  Any case where a 
conversion from integer to SFmode is inexact would overflow HFmode, so 
there are no double rounding issues.

(f) In the F16C case, it seems the hardware instructions only work on 
vectors, not scalars, so care would need to be taken to use them for 
scalar conversions only if the other elements of the vector register are 
known to be safe to convert without raising any exceptions (e.g. all zero 
bits, or -fno-trapping-math in effect).

(g) If concerned about efficiency of intermediate truncations on 
processors without hardware _Float16 arithmetic, look at 
aarch64_excess_precision; you have the option of using excess precision 
for _Float16 by default, though that only really helps for C given the 
lack of excess precision support in the C++ front end.  (Enabling this can 
cause trouble for code that only expects C99/C11 values of 
FLT_EVAL_METHOD, however; see the -fpermitted-flt-eval-methods option for 
more details.)


5. Suppose that in some cases you do disable _Float16 support (whether 
that's just for 32-bit until the ABI has been defined, or also in the 
absence of instruction set support despite my comments above).  Then the 
way you do that in this patch series, enabling the type in 
ix86_scalar_mode_supported_p and ix86_libgcc_floating_mode_supported_p and 
giving an error later in ix86_expand_move, is a bad idea.

Errors in expanders are generally problematic (they don't have good 
location information available).  But apart from that, ordinary user code 
should be able to tell whether _Float16 is supported by testing whether 
e.g. __FLT16_MANT_DIG__ is defined (like float.h does), or by including 
float.h (with __STDC_WANT_IEC_60559_TYPES_EXT__ defined) and then testing 
whether one of the FLT16_* macros is defined, or in a configure test by 
just declaring something using the _Float16 type.  Patch 1 changes 
check_effective_target_float16 to work around your technique for disabling 
_Float16 in ix86_expand_move, but it should be considered a stable user 
API that any of the above methods can be used in user code to check for 
_Float16 support - user code shouldn't need to know implementation details 
that you need to do something that will go through ix86_expand_move to see 
whether _Float16 is supported or not (and user code shouldn't need to use 
a configure test at all for this, testing FLT16_* after including fl

Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-01 Thread Jakub Jelinek via Gcc-patches
On Thu, Jul 01, 2021 at 02:58:01PM +0200, Richard Biener wrote:
> > The main issue is complex _Float16 functions in libgcc.  If _Float16 doesn't
> > require -mavx512fp16, we need to compile complex _Float16 functions in
> > libgcc without -mavx512fp16.  Complex _Float16 performance is very
> > important for our _Float16 usage.   _Float16 performance has to be
> > very fast.  There should be no emulation anywhere when -mavx512fp16
> > is used.   That is why _Float16 is available only with -mavx512fp16.
> 
> It should be possible to emulate scalar _Float16 using _Float32 with a
> reasonable
> performance trade-off.  I think users caring for _Float16 performance will
> use vector intrinsics anyway since for scalar code _Float32 code will likely
> perform the same (at double storage cost)

Only if it is allowed to have excess precision for _Float16.  If not, then
one would need to (expensively?) round after every operation at least.

Jakub



Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-01 Thread Uros Bizjak via Gcc-patches
On Thu, Jul 1, 2021 at 2:40 PM H.J. Lu  wrote:
>
> On Thu, Jul 1, 2021 at 4:10 AM Uros Bizjak  wrote:
> >
> > [Sorry for double post, gcc-patches address was wrong in original post]
> >
> > On Thu, Jul 1, 2021 at 7:48 AM liuhongt  wrote:
> > >
> > > Hi:
> > >   AVX512FP16 is disclosed, refer to [1].
> > >   There're 100+ instructions for AVX512FP16, 67 gcc patches, for the 
> > > convenience of review, we divide the 67 patches into 2 major parts.
> > >   The first part is 2 patches containing basic support for AVX512FP16 
> > > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is 65 
> > > patches covering all instructions of AVX512FP16(including intrinsic 
> > > support and some optimizations).
> > >   There is a problem with the first part, _Float16 is not a C++ standard, 
> > > so the front-end does not support this type and its mangling, so we "make 
> > > up" a _Float16 type on the back-end and use _DF16 as its mangling. The 
> > > purpose of this is to align with llvm side, because llvm C++ FE already 
> > > supports _Float16[2].
> > >
> > > [1] 
> > > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html
> > > [2] https://reviews.llvm.org/D33719
> >
> > Looking through implementation of _Float16 support, I think, there is
> > no need for _Float16 support to depend on AVX512FP16.
> >
> > The compiler is smart enough to use either a named pattern that
> > describes the instruction when available or diverts to a library call
> > to a soft-fp implementation. So, I think that general _Float16 support
> > should be implemented first (similar to _float128) and then upgraded
> > with AVX512FP16 specific instructions.
> >
> > MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode
> > secondary_reload register.
> >
> > soft-fp library already includes all the infrastructure to implement
> > _Float16 (see half.h), so HFmode basic operations should be trivial to
> > implement (I have gone through this exercise personally years ago when
> > implementing __float128 soft-fp support).
> >
> > Looking through the patch 1/2, it looks that a new ABI is introduced,
> > where FP16 values are passed through XMM registers, but I don't think
> > there is updated psABI documentation available (for x86_64 as well as
>
> _Float16 support was added to x86-64 psABI:
>
> https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/71d1183e7bb95e9f8ad732e0f2b5a4f127796e2a
>
> 2 years ago.

Uh, sorry, my psABI link [1] is way out of date, but this is what
google gives for "x86_64 psABI pdf" ...

[1] https://uclibc.org/docs/psABI-x86_64.pdf

>
> > i386, where FP16 values will probably be passed through memory).
>
> That is correct.
>
> > So, the net effect of the above proposal(s) is that x86 will support
> > _Float16 out-of the box, emulate it via soft-fp without AVX512FP16 and
> > use AVX512FP16 instructions with -mavx512fp16.
> >
>
> The main issue is complex _Float16 functions in libgcc.  If _Float16 doesn't
> require -mavx512fp16, we need to compile complex _Float16 functions in
> libgcc without -mavx512fp16.  Complex _Float16 performance is very
> important for our _Float16 usage.   _Float16 performance has to be
> very fast.  There should be no emulation anywhere when -mavx512fp16
> is used.   That is why _Float16 is available only with -mavx512fp16.

If this performance is important, then the best way is that in
addition to generic versions, recompile these functions for AVX512FP16
target, or even implement them in assembly. The compiler can then call
these specific functions when -mavx512fp16 is used. Please see how
alpha implements calls to  its X_floating library.

Uros.


Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-01 Thread Richard Biener via Gcc-patches
On Thu, Jul 1, 2021 at 2:41 PM H.J. Lu via Gcc-patches
 wrote:
>
> On Thu, Jul 1, 2021 at 4:10 AM Uros Bizjak  wrote:
> >
> > [Sorry for double post, gcc-patches address was wrong in original post]
> >
> > On Thu, Jul 1, 2021 at 7:48 AM liuhongt  wrote:
> > >
> > > Hi:
> > >   AVX512FP16 is disclosed, refer to [1].
> > >   There're 100+ instructions for AVX512FP16, 67 gcc patches, for the 
> > > convenience of review, we divide the 67 patches into 2 major parts.
> > >   The first part is 2 patches containing basic support for AVX512FP16 
> > > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is 65 
> > > patches covering all instructions of AVX512FP16(including intrinsic 
> > > support and some optimizations).
> > >   There is a problem with the first part, _Float16 is not a C++ standard, 
> > > so the front-end does not support this type and its mangling, so we "make 
> > > up" a _Float16 type on the back-end and use _DF16 as its mangling. The 
> > > purpose of this is to align with llvm side, because llvm C++ FE already 
> > > supports _Float16[2].
> > >
> > > [1] 
> > > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html
> > > [2] https://reviews.llvm.org/D33719
> >
> > Looking through implementation of _Float16 support, I think, there is
> > no need for _Float16 support to depend on AVX512FP16.
> >
> > The compiler is smart enough to use either a named pattern that
> > describes the instruction when available or diverts to a library call
> > to a soft-fp implementation. So, I think that general _Float16 support
> > should be implemented first (similar to _float128) and then upgraded
> > with AVX512FP16 specific instructions.
> >
> > MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode
> > secondary_reload register.
> >
> > soft-fp library already includes all the infrastructure to implement
> > _Float16 (see half.h), so HFmode basic operations should be trivial to
> > implement (I have gone through this exercise personally years ago when
> > implementing __float128 soft-fp support).
> >
> > Looking through the patch 1/2, it looks that a new ABI is introduced,
> > where FP16 values are passed through XMM registers, but I don't think
> > there is updated psABI documentation available (for x86_64 as well as
>
> _Float16 support was added to x86-64 psABI:
>
> https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/71d1183e7bb95e9f8ad732e0f2b5a4f127796e2a
>
> 2 years ago.
>
> > i386, where FP16 values will probably be passed through memory).
>
> That is correct.
>
> > So, the net effect of the above proposal(s) is that x86 will support
> > _Float16 out-of the box, emulate it via soft-fp without AVX512FP16 and
> > use AVX512FP16 instructions with -mavx512fp16.
> >
>
> The main issue is complex _Float16 functions in libgcc.  If _Float16 doesn't
> require -mavx512fp16, we need to compile complex _Float16 functions in
> libgcc without -mavx512fp16.  Complex _Float16 performance is very
> important for our _Float16 usage.   _Float16 performance has to be
> very fast.  There should be no emulation anywhere when -mavx512fp16
> is used.   That is why _Float16 is available only with -mavx512fp16.

It should be possible to emulate scalar _Float16 using _Float32 with a
reasonable
performance trade-off.  I think users caring for _Float16 performance will
use vector intrinsics anyway since for scalar code _Float32 code will likely
perform the same (at double storage cost)

Richard.

> --
> H.J.


Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-01 Thread H.J. Lu via Gcc-patches
On Thu, Jul 1, 2021 at 4:10 AM Uros Bizjak  wrote:
>
> [Sorry for double post, gcc-patches address was wrong in original post]
>
> On Thu, Jul 1, 2021 at 7:48 AM liuhongt  wrote:
> >
> > Hi:
> >   AVX512FP16 is disclosed, refer to [1].
> >   There're 100+ instructions for AVX512FP16, 67 gcc patches, for the 
> > convenience of review, we divide the 67 patches into 2 major parts.
> >   The first part is 2 patches containing basic support for AVX512FP16 
> > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is 65 
> > patches covering all instructions of AVX512FP16(including intrinsic support 
> > and some optimizations).
> >   There is a problem with the first part, _Float16 is not a C++ standard, 
> > so the front-end does not support this type and its mangling, so we "make 
> > up" a _Float16 type on the back-end and use _DF16 as its mangling. The 
> > purpose of this is to align with llvm side, because llvm C++ FE already 
> > supports _Float16[2].
> >
> > [1] 
> > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html
> > [2] https://reviews.llvm.org/D33719
>
> Looking through implementation of _Float16 support, I think, there is
> no need for _Float16 support to depend on AVX512FP16.
>
> The compiler is smart enough to use either a named pattern that
> describes the instruction when available or diverts to a library call
> to a soft-fp implementation. So, I think that general _Float16 support
> should be implemented first (similar to _float128) and then upgraded
> with AVX512FP16 specific instructions.
>
> MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode
> secondary_reload register.
>
> soft-fp library already includes all the infrastructure to implement
> _Float16 (see half.h), so HFmode basic operations should be trivial to
> implement (I have gone through this exercise personally years ago when
> implementing __float128 soft-fp support).
>
> Looking through the patch 1/2, it looks that a new ABI is introduced,
> where FP16 values are passed through XMM registers, but I don't think
> there is updated psABI documentation available (for x86_64 as well as

_Float16 support was added to x86-64 psABI:

https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/71d1183e7bb95e9f8ad732e0f2b5a4f127796e2a

2 years ago.

> i386, where FP16 values will probably be passed through memory).

That is correct.

> So, the net effect of the above proposal(s) is that x86 will support
> _Float16 out-of the box, emulate it via soft-fp without AVX512FP16 and
> use AVX512FP16 instructions with -mavx512fp16.
>

The main issue is complex _Float16 functions in libgcc.  If _Float16 doesn't
require -mavx512fp16, we need to compile complex _Float16 functions in
libgcc without -mavx512fp16.  Complex _Float16 performance is very
important for our _Float16 usage.   _Float16 performance has to be
very fast.  There should be no emulation anywhere when -mavx512fp16
is used.   That is why _Float16 is available only with -mavx512fp16.

-- 
H.J.


Re: [PATCH 0/2] Initial support for AVX512FP16

2021-07-01 Thread Uros Bizjak via Gcc-patches
[Sorry for double post, gcc-patches address was wrong in original post]

On Thu, Jul 1, 2021 at 7:48 AM liuhongt  wrote:
>
> Hi:
>   AVX512FP16 is disclosed, refer to [1].
>   There're 100+ instructions for AVX512FP16, 67 gcc patches, for the 
> convenience of review, we divide the 67 patches into 2 major parts.
>   The first part is 2 patches containing basic support for AVX512FP16 
> (options, cpuid, _Float16 type, libgcc, etc.), and the second part is 65 
> patches covering all instructions of AVX512FP16(including intrinsic support 
> and some optimizations).
>   There is a problem with the first part, _Float16 is not a C++ standard, so 
> the front-end does not support this type and its mangling, so we "make up" a 
> _Float16 type on the back-end and use _DF16 as its mangling. The purpose of 
> this is to align with llvm side, because llvm C++ FE already supports 
> _Float16[2].
>
> [1] 
> https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html
> [2] https://reviews.llvm.org/D33719

Looking through implementation of _Float16 support, I think, there is
no need for _Float16 support to depend on AVX512FP16.

The compiler is smart enough to use either a named pattern that
describes the instruction when available or diverts to a library call
to a soft-fp implementation. So, I think that general _Float16 support
should be implemented first (similar to _float128) and then upgraded
with AVX512FP16 specific instructions.

MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode
secondary_reload register.

soft-fp library already includes all the infrastructure to implement
_Float16 (see half.h), so HFmode basic operations should be trivial to
implement (I have gone through this exercise personally years ago when
implementing __float128 soft-fp support).

Looking through the patch 1/2, it looks that a new ABI is introduced,
where FP16 values are passed through XMM registers, but I don't think
there is updated psABI documentation available (for x86_64 as well as
i386, where FP16 values will probably be passed through memory).

So, the net effect of the above proposal(s) is that x86 will support
_Float16 out-of the box, emulate it via soft-fp without AVX512FP16 and
use AVX512FP16 instructions with -mavx512fp16.

Uros.


Re: [PATCH 0/2] Initial support for AVX512FP16

2021-06-30 Thread Hongtao Liu via Gcc-patches
On Thu, Jul 1, 2021 at 1:48 PM liuhongt  wrote:
>
> Hi:
>   AVX512FP16 is disclosed, refer to [1].
>   There're 100+ instructions for AVX512FP16, 67 gcc patches, for the 
> convenience of review, we divide the 67 patches into 2 major parts.
>   The first part is 2 patches containing basic support for AVX512FP16 
> (options, cpuid, _Float16 type, libgcc, etc.), and the second part is 65 
> patches covering all instructions of AVX512FP16(including intrinsic support 
> and some optimizations).
>   There is a problem with the first part, _Float16 is not a C++ standard, so 
> the front-end does not support this type and its mangling, so we "make up" a 
> _Float16 type on the back-end and use _DF16 as its mangling. The purpose of 
> this is to align with llvm side, because llvm C++ FE already supports 
> _Float16[2].
>
> [1] 
> https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html
> [2] https://reviews.llvm.org/D33719
>
>   Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
>
> Guo, Xuepeng (1):
>   AVX512FP16: Initial support for _Float16 type and AVX512FP16 feature.
>
> liuhongt (1):
>   AVX512FP16: Add HFmode support in libgcc.
>
>  gcc/common/config/i386/cpuinfo.h  |   2 +
>  gcc/common/config/i386/i386-common.c  |  26 +-
>  gcc/common/config/i386/i386-cpuinfo.h |   1 +
>  gcc/common/config/i386/i386-isas.h|   1 +
>  gcc/config.gcc|   2 +-
>  gcc/config/i386/avx512fp16intrin.h|  53 
>  gcc/config/i386/cpuid.h   |   1 +
>  gcc/config/i386/i386-builtin-types.def|   7 +-
>  gcc/config/i386/i386-builtins.c   |   6 +
>  gcc/config/i386/i386-c.c  |  20 ++
>  gcc/config/i386/i386-expand.c |   8 +
>  gcc/config/i386/i386-isa.def  |   1 +
>  gcc/config/i386/i386-modes.def|   1 +
>  gcc/config/i386/i386-options.c|  10 +-
>  gcc/config/i386/i386.c| 158 ++--
>  gcc/config/i386/i386.h|  18 +-
>  gcc/config/i386/i386.md   | 242 +++---
>  gcc/config/i386/i386.opt  |   4 +
>  gcc/config/i386/immintrin.h   |   2 +
>  gcc/config/i386/sse.md|  42 +--
>  gcc/doc/invoke.texi   |  10 +-
>  gcc/optabs-query.c|   9 +-
>  gcc/testsuite/g++.target/i386/float16-1.C |   8 +
>  gcc/testsuite/g++.target/i386/float16-2.C |  14 +
>  gcc/testsuite/g++.target/i386/float16-3.C |  10 +
>  gcc/testsuite/gcc.target/i386/avx-1.c |   2 +-
>  gcc/testsuite/gcc.target/i386/avx-2.c |   2 +-
>  gcc/testsuite/gcc.target/i386/avx512-check.h  |   3 +
>  .../gcc.target/i386/avx512fp16-12a.c  |  21 ++
>  .../gcc.target/i386/avx512fp16-12b.c  |  27 ++
>  gcc/testsuite/gcc.target/i386/float16-1.c |   8 +
>  gcc/testsuite/gcc.target/i386/float16-2.c |  14 +
>  gcc/testsuite/gcc.target/i386/float16-3a.c|  10 +
>  gcc/testsuite/gcc.target/i386/float16-3b.c|  10 +
>  gcc/testsuite/gcc.target/i386/float16-4a.c|  10 +
>  gcc/testsuite/gcc.target/i386/float16-4b.c|  10 +
>  gcc/testsuite/gcc.target/i386/funcspec-56.inc |   2 +
>  gcc/testsuite/gcc.target/i386/pr54855-12.c|  14 +
>  gcc/testsuite/gcc.target/i386/sse-13.c|   2 +-
>  gcc/testsuite/gcc.target/i386/sse-14.c|   2 +-
>  gcc/testsuite/gcc.target/i386/sse-22.c|   4 +-
>  gcc/testsuite/gcc.target/i386/sse-23.c|   2 +-
>  gcc/testsuite/lib/target-supports.exp |  13 +-
>  libgcc/Makefile.in|   4 +-
>  libgcc/config.host|   6 +-
>  libgcc/config/i386/32/sfp-machine.h   |   1 +
>  libgcc/config/i386/64/sfp-machine.h   |   1 +
>  libgcc/config/i386/64/t-softfp|   9 +
>  libgcc/config/i386/_divhc3.c  |   4 +
>  libgcc/config/i386/_mulhc3.c  |   4 +
>  libgcc/config/i386/sfp-machine.h  |   1 +
>  libgcc/config/i386/t-softfp   |  20 ++
>  libgcc/configure  |  33 +++
>  libgcc/configure.ac   |  13 +
>  libgcc/soft-fp/extendhfxf2.c  |  53 
>  libgcc/soft-fp/truncxfhf2.c   |  52 
>  56 files changed, 907 insertions(+), 106 deletions(-)
>  create mode 100644 gcc/config/i386/avx512fp16intrin.h
>  create mode 100644 gcc/testsuite/g++.target/i386/float16-1.C
>  create mode 100644 gcc/testsuite/g++.target/i386/float16-2.C
>  create mode 100644 gcc/testsuite/g++.target/i386/float16-3.C
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-12a.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-12b.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/float16-1.c
>  create mode 100644 gcc/tests