Re: [PATCH 0/2] Initial support for AVX512FP16
> > > Set excess_precision_type to FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16 to > round after each operation could keep semantics right. > And I'll document the behavior difference between soft-fp and > AVX512FP16 instruction for exceptions. I got some feedback from my colleague who's working on supporting _Float16 for llvm. The LLVM side wants to set FLT_EVAL_METHOD_PROMOTE_TO_FLOAT for soft-fp so that codes can be more efficient. i.e. _Float16 a, b, c, d; d = a + b + c; would be transformed to float tmp, tmp1, a1, b1, c1; a1 = (float) a; b1 = (float) b; c1 = (float) c; tmp = a1 + b1; tmp1 = tmp + c1; d = (_Float16) tmp; so there's only 1 truncation in the end. if users want to round back after every operation. codes should be explicitly written as _Float16 a, b, c, d, e; e = a + b; d = e + c; That's what Clang does, quote from [1] _Float16 arithmetic will be performed using native half-precision support when available on the target (e.g. on ARMv8.2a); otherwise it will be performed at a higher precision (currently always float) and then truncated down to _Float16. Note that C and C++ allow intermediate floating-point operands of an expression to be computed with greater precision than is expressible in their type, so Clang may avoid intermediate truncations in certain cases; this may lead to results that are inconsistent with native arithmetic. and so does arm gcc quote from arm.c /* We can calculate either in 16-bit range and precision or 32-bit range and precision. Make that decision based on whether we have native support for the ARMv8.2-A 16-bit floating-point instructions or not. */ return (TARGET_VFP_FP16INST ? FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16 : FLT_EVAL_METHOD_PROMOTE_TO_FLOAT); [1]https://clang.llvm.org/docs/LanguageExtensions.html > > -- > > Joseph S. Myers > > jos...@codesourcery.com > > > > -- > BR, > Hongtao -- BR, Hongtao
Re: [PATCH 0/2] Initial support for AVX512FP16
On Wed, Jul 7, 2021 at 2:11 AM Joseph Myers wrote: > > On Tue, 6 Jul 2021, Hongtao Liu via Gcc-patches wrote: > > > There may be inconsistent behavior between soft-fp and avx512fp16 > > instructions if we emulate _Float16 w/ float . > > i.e > > 1) for a + b - c where b and c are variables with the same big value > > and a + b is NAN at _Float16 and real value at float, avx512fp16 > > instruction will raise an exception but soft-fp won't(unless it's > > rounded after every operation.) > > There are at least two variants of emulation using float: > > (a) Using the excess precision support, as on AArch64, which means the C > front end converts the _Float16 operations to float ones, with explicit > narrowing on assignment (and conversion as if by assignment - argument > passing and return, casts, etc.). Excess precision indeed involves > different semantics compared to doing each operation directly in the range > and precision of _Float16. > Yes, set excess_precision_type to FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16 could round after each operation. > (b) Letting the expand/optabs code generate operations in a wider mode. > My understanding is that the result should get converted back to the > narrower mode after each operation (by the expand/optabs code / > convert_move called by it generating such a conversion), meaning (for > basic arithmetic operations) that the semantics end up the same as if the > operation had been done directly on _Float16 (but with more truncation > operations occurring than would be the case with excess precision support > used). Yes, just w/ different behavior related to exceptions.. > > > 2) a / b where b is denormal value and AVX512FP16 won't flush it to > > zero even w/ -Ofast, but when it's extended to float and using divss, > > it will be flushed to zero and raise an exception when compiling w/ > > Ofast > > I don't think that's a concern, flush to zero is well outside the scope of > standards defining _Float16 semantics. Ok. > > > So the key point is that the soft-fp and avx512fp16 instructions may > > do not behave the same on the exception, is this acceptable? > > As far as I understand it, all cases within the standards will behave as > expected for exceptions, whether pure software floating-point is used, > pure hardware _Float16 arithmetic or one of the forms of emulation listed > above. (Where "as expected" itself depends on the value of > FLT_EVAL_METHOD, i.e. whether excess precision is used for _Float16.) > Flush to zero and trapping exceptions are outside the scope of the > standards. Since trapping exceptions is outside the scope of the > standards, so is anything that distinguishes whether an arithmetic > operation raises the same exception more than once or the order in which > it raises different exceptions (e.g. the possibility of "inexact" being > raised more than once, both by arithmetic on float and by narrowing from > float to _Float16). > Set excess_precision_type to FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16 to round after each operation could keep semantics right. And I'll document the behavior difference between soft-fp and AVX512FP16 instruction for exceptions. > -- > Joseph S. Myers > jos...@codesourcery.com -- BR, Hongtao
Re: [PATCH 0/2] Initial support for AVX512FP16
On Tue, 6 Jul 2021, H.J. Lu via Gcc-patches wrote: > > > So the key point is that the soft-fp and avx512fp16 instructions may > > > do not behave the same on the exception, is this acceptable? > > > > I think that's quite often the case for soft-fp. > > So this is a GCC limitation. Please document difference behaviors > of _Float16 with and without AVX512FP16, similar to I don't think it's yet clear there will be any such limitation, just semantics that depend on whether the excess precision support is used or not (which is covered by FLT_EVAL_METHOD, like on AArch64). -- Joseph S. Myers jos...@codesourcery.com
Re: [PATCH 0/2] Initial support for AVX512FP16
On Tue, 6 Jul 2021, Richard Biener via Gcc-patches wrote: > > /* Look for a wider mode of the same class for which we think we > > can open-code the operation. Check for a widening multiply at the > > wider mode as well. */ > > > > if (CLASS_HAS_WIDER_MODES_P (mclass) > > && methods != OPTAB_DIRECT && methods != OPTAB_LIB) > > FOR_EACH_WIDER_MODE (wider_mode, mode) > > > > I think pass_expand did this for some reason, so I'm a little afraid > > to touch this part of the code. > > It might be the first time we hit this ;) I don't think it's safe for > non-integer modes or even anything but a small set of operations. > Just consider ssadd besides rounding issues or FP. I think it's safe for basic arithmetic (+-*/), for IEEE floating-point arithmetic when the wider mode has significand more than twice as wide as the narrower one (given that the result is immediately converted back to the narrower mode, double rounding isn't an issue given such a constraint on the widths of the modes - and given that the wider mode has sufficient exponent range to avoid intermediate overflow / underflow as an issue as well). (The precise requirements on the width of the modes may depend on the operation in question. It's *not* safe for fused multiply-add, regardless of the widths in question; a software implementation of fmaf16 using float arithmetic could be quite simple, using round-to-odd like e.g. glibc's implementation of fmaf using double arithmetic, but "call fmaf then convert the result to _Float16" would be an incorrect implementation.) -- Joseph S. Myers jos...@codesourcery.com
Re: [PATCH 0/2] Initial support for AVX512FP16
On Tue, 6 Jul 2021, Hongtao Liu via Gcc-patches wrote: > There may be inconsistent behavior between soft-fp and avx512fp16 > instructions if we emulate _Float16 w/ float . > i.e > 1) for a + b - c where b and c are variables with the same big value > and a + b is NAN at _Float16 and real value at float, avx512fp16 > instruction will raise an exception but soft-fp won't(unless it's > rounded after every operation.) There are at least two variants of emulation using float: (a) Using the excess precision support, as on AArch64, which means the C front end converts the _Float16 operations to float ones, with explicit narrowing on assignment (and conversion as if by assignment - argument passing and return, casts, etc.). Excess precision indeed involves different semantics compared to doing each operation directly in the range and precision of _Float16. (b) Letting the expand/optabs code generate operations in a wider mode. My understanding is that the result should get converted back to the narrower mode after each operation (by the expand/optabs code / convert_move called by it generating such a conversion), meaning (for basic arithmetic operations) that the semantics end up the same as if the operation had been done directly on _Float16 (but with more truncation operations occurring than would be the case with excess precision support used). > 2) a / b where b is denormal value and AVX512FP16 won't flush it to > zero even w/ -Ofast, but when it's extended to float and using divss, > it will be flushed to zero and raise an exception when compiling w/ > Ofast I don't think that's a concern, flush to zero is well outside the scope of standards defining _Float16 semantics. > So the key point is that the soft-fp and avx512fp16 instructions may > do not behave the same on the exception, is this acceptable? As far as I understand it, all cases within the standards will behave as expected for exceptions, whether pure software floating-point is used, pure hardware _Float16 arithmetic or one of the forms of emulation listed above. (Where "as expected" itself depends on the value of FLT_EVAL_METHOD, i.e. whether excess precision is used for _Float16.) Flush to zero and trapping exceptions are outside the scope of the standards. Since trapping exceptions is outside the scope of the standards, so is anything that distinguishes whether an arithmetic operation raises the same exception more than once or the order in which it raises different exceptions (e.g. the possibility of "inexact" being raised more than once, both by arithmetic on float and by narrowing from float to _Float16). -- Joseph S. Myers jos...@codesourcery.com
Re: [PATCH 0/2] Initial support for AVX512FP16
On Tue, Jul 6, 2021 at 3:15 AM Richard Biener via Gcc-patches wrote: > > On Tue, Jul 6, 2021 at 10:46 AM Hongtao Liu wrote: > > > > On Thu, Jul 1, 2021 at 9:04 PM Jakub Jelinek via Gcc-patches > > wrote: > > > > > > On Thu, Jul 01, 2021 at 02:58:01PM +0200, Richard Biener wrote: > > > > > The main issue is complex _Float16 functions in libgcc. If _Float16 > > > > > doesn't > > > > > require -mavx512fp16, we need to compile complex _Float16 functions in > > > > > libgcc without -mavx512fp16. Complex _Float16 performance is very > > > > > important for our _Float16 usage. _Float16 performance has to be > > > > > very fast. There should be no emulation anywhere when -mavx512fp16 > > > > > is used. That is why _Float16 is available only with -mavx512fp16. > > > > > > > > It should be possible to emulate scalar _Float16 using _Float32 with a > > > > reasonable > > > > performance trade-off. I think users caring for _Float16 performance > > > > will > > > > use vector intrinsics anyway since for scalar code _Float32 code will > > > > likely > > > > perform the same (at double storage cost) > > > > > > Only if it is allowed to have excess precision for _Float16. If not, then > > > one would need to (expensively?) round after every operation at least. > > There may be inconsistent behavior between soft-fp and avx512fp16 > > instructions if we emulate _Float16 w/ float . > > i.e > > 1) for a + b - c where b and c are variables with the same big value > > and a + b is NAN at _Float16 and real value at float, avx512fp16 > > instruction will raise an exception but soft-fp won't(unless it's > > rounded after every operation.) > > 2) a / b where b is denormal value and AVX512FP16 won't flush it to > > zero even w/ -Ofast, but when it's extended to float and using divss, > > it will be flushed to zero and raise an exception when compiling w/ > > Ofast > > > > To solve the upper issue, i try to add full emulation for _Float16(for > > all those under libgcc/soft-fp/, i.e. add/sub/mul/div/cmp, .etc), > > problem is in pass_expand, it always try wider mode first instead of > > using soft-fp > > > > /* Look for a wider mode of the same class for which we think we > > can open-code the operation. Check for a widening multiply at the > > wider mode as well. */ > > > > if (CLASS_HAS_WIDER_MODES_P (mclass) > > && methods != OPTAB_DIRECT && methods != OPTAB_LIB) > > FOR_EACH_WIDER_MODE (wider_mode, mode) > > > > I think pass_expand did this for some reason, so I'm a little afraid > > to touch this part of the code. > > It might be the first time we hit this ;) I don't think it's safe for > non-integer modes or even anything but a small set of operations. > Just consider ssadd besides rounding issues or FP. > > > So the key point is that the soft-fp and avx512fp16 instructions may > > do not behave the same on the exception, is this acceptable? > > I think that's quite often the case for soft-fp. So this is a GCC limitation. Please document difference behaviors of _Float16 with and without AVX512FP16, similar to --- The '__fp16' type may only be used as an argument to intrinsics defined in '', or as a storage format. For purposes of arithmetic and other operations, '__fp16' values in C or C++ expressions are automatically promoted to 'float'. The ARM target provides hardware support for conversions between '__fp16' and 'float' values as an extension to VFP and NEON (Advanced SIMD), and from ARMv8-A provides hardware support for conversions between '__fp16' and 'double' values. GCC generates code using these hardware instructions if you compile with options to select an FPU that provides them; for example, '-mfpu=neon-fp16 -mfloat-abi=softfp', in addition to the '-mfp16-format' option to select a half-precision format. Language-level support for the '__fp16' data type is independent of whether GCC generates code using hardware floating-point instructions. In cases where hardware support is not specified, GCC implements conversions between '__fp16' and other types as library calls. It is recommended that portable code use the '_Float16' type defined by ISO/IEC TS 18661-3:2015. *Note Floating Types::. --- We recommend portable code of _Float16 with AVX512FP16. > > BTW, i've finished a initial patch to enable _Float16 on sse2, and > > emulate _Float16 operation w/ float, and it passes all 312 new tests > > which are related to _Float16, but those units tests doesn't cover the > > scenario I'm talking about. > > > > > > Jakub > > > > > > > > > -- > > BR, > > Hongtao -- H.J.
Re: [PATCH 0/2] Initial support for AVX512FP16
On Tue, Jul 6, 2021 at 10:46 AM Hongtao Liu wrote: > > On Thu, Jul 1, 2021 at 9:04 PM Jakub Jelinek via Gcc-patches > wrote: > > > > On Thu, Jul 01, 2021 at 02:58:01PM +0200, Richard Biener wrote: > > > > The main issue is complex _Float16 functions in libgcc. If _Float16 > > > > doesn't > > > > require -mavx512fp16, we need to compile complex _Float16 functions in > > > > libgcc without -mavx512fp16. Complex _Float16 performance is very > > > > important for our _Float16 usage. _Float16 performance has to be > > > > very fast. There should be no emulation anywhere when -mavx512fp16 > > > > is used. That is why _Float16 is available only with -mavx512fp16. > > > > > > It should be possible to emulate scalar _Float16 using _Float32 with a > > > reasonable > > > performance trade-off. I think users caring for _Float16 performance will > > > use vector intrinsics anyway since for scalar code _Float32 code will > > > likely > > > perform the same (at double storage cost) > > > > Only if it is allowed to have excess precision for _Float16. If not, then > > one would need to (expensively?) round after every operation at least. > There may be inconsistent behavior between soft-fp and avx512fp16 > instructions if we emulate _Float16 w/ float . > i.e > 1) for a + b - c where b and c are variables with the same big value > and a + b is NAN at _Float16 and real value at float, avx512fp16 > instruction will raise an exception but soft-fp won't(unless it's > rounded after every operation.) > 2) a / b where b is denormal value and AVX512FP16 won't flush it to > zero even w/ -Ofast, but when it's extended to float and using divss, > it will be flushed to zero and raise an exception when compiling w/ > Ofast > > To solve the upper issue, i try to add full emulation for _Float16(for > all those under libgcc/soft-fp/, i.e. add/sub/mul/div/cmp, .etc), > problem is in pass_expand, it always try wider mode first instead of > using soft-fp > > /* Look for a wider mode of the same class for which we think we > can open-code the operation. Check for a widening multiply at the > wider mode as well. */ > > if (CLASS_HAS_WIDER_MODES_P (mclass) > && methods != OPTAB_DIRECT && methods != OPTAB_LIB) > FOR_EACH_WIDER_MODE (wider_mode, mode) > > I think pass_expand did this for some reason, so I'm a little afraid > to touch this part of the code. It might be the first time we hit this ;) I don't think it's safe for non-integer modes or even anything but a small set of operations. Just consider ssadd besides rounding issues or FP. > So the key point is that the soft-fp and avx512fp16 instructions may > do not behave the same on the exception, is this acceptable? I think that's quite often the case for soft-fp. > BTW, i've finished a initial patch to enable _Float16 on sse2, and > emulate _Float16 operation w/ float, and it passes all 312 new tests > which are related to _Float16, but those units tests doesn't cover the > scenario I'm talking about. > > > > Jakub > > > > > -- > BR, > Hongtao
Re: [PATCH 0/2] Initial support for AVX512FP16
On Fri, Jul 2, 2021 at 4:46 AM Joseph Myers wrote: > > Some general comments, following what I said on libc-alpha: > > > 1. Can you confirm that the ABI being used for 64-bit, for _Float16 and > _Complex _Float16 argument passing and return, follows the current x86_64 > ABI document? > > > 2. Can you confirm that if you build with this instruction set extension > enabled by default, and run GCC tests for a corresponding (emulated?) > processor, all the existing float16 tests in the testsuite are enabled and > PASS (both compilation and execution) (both 64-bit and 32-bit testing)? > > > 3. There's an active 32-bit ABI mailing list (ia32-...@googlegroups.com). > If you want to support _Float16 in the 32-bit case, please work with it to > get the corresponding ABI documented (using only memory and > general-purpose registers seems like a good idea, so that the ABI can be > supported for the base architecture without depending on SSE registers > being present). In the absence of 32-bit ABI support it might be better > to disable the HFmode support for 32-bit. > > > 4. Support for _Float16 really ought not to depend on whether a particular > instruction set extension is present, just like with other floating-point > types; it makes sense, as an API, for all x86 processors (and like many > APIs, it will be faster on some processors than on others). More specific > points here are: > > (a) Basic arithmetic (+-*/) can be done by converting to SFmode, doing > arithmetic there and converting back to HFmode; the results of doing so > will be correctly rounded. Indeed, I think optabs.c handles that > automatically when operations are available on a wider mode but not on the > desired mode (but you'd need to check carefully that all the expected > conversions do occur). So would different behavior of exceptions between soft-fp and avx512fp16 is acceptable? > > (b) Conversions to/from all other floating-point modes will always be > needed, whether in hardware or in software. > > (c) In the F16C (Ivy Bridge and later) case, where you have hardware > conversions to/from float (only), it's fine to convert to double (or long > double) via float. (On efficiency grounds, widening from HFmode to TFmode > should be a pure software operations, that should be faster than having an > intermediate conversion to SFmode when the SFmode-to-TFmode conversion is > a software operation.) > > (d) In the F16C case (where there are hardware conversions only from > SFmode, not from wider modes), conversion *from* DFmode (or XFmode or > TFmode) to HFmode should be a software operation, to avoid double > rounding; an intermediate conversion to SFmode would be incorrect. > > (e) It's OK for conversions to/from integer modes to go via SFmode > (although I don't know if that's efficient or not). Any case where a > conversion from integer to SFmode is inexact would overflow HFmode, so > there are no double rounding issues. > > (f) In the F16C case, it seems the hardware instructions only work on > vectors, not scalars, so care would need to be taken to use them for > scalar conversions only if the other elements of the vector register are > known to be safe to convert without raising any exceptions (e.g. all zero > bits, or -fno-trapping-math in effect). > > (g) If concerned about efficiency of intermediate truncations on > processors without hardware _Float16 arithmetic, look at > aarch64_excess_precision; you have the option of using excess precision > for _Float16 by default, though that only really helps for C given the > lack of excess precision support in the C++ front end. (Enabling this can > cause trouble for code that only expects C99/C11 values of > FLT_EVAL_METHOD, however; see the -fpermitted-flt-eval-methods option for > more details.) > > > 5. Suppose that in some cases you do disable _Float16 support (whether > that's just for 32-bit until the ABI has been defined, or also in the > absence of instruction set support despite my comments above). Then the > way you do that in this patch series, enabling the type in > ix86_scalar_mode_supported_p and ix86_libgcc_floating_mode_supported_p and > giving an error later in ix86_expand_move, is a bad idea. > > Errors in expanders are generally problematic (they don't have good > location information available). But apart from that, ordinary user code > should be able to tell whether _Float16 is supported by testing whether > e.g. __FLT16_MANT_DIG__ is defined (like float.h does), or by including > float.h (with __STDC_WANT_IEC_60559_TYPES_EXT__ defined) and then testing > whether one of the FLT16_* macros is defined, or in a configure test by > just declaring something using the _Float16 type. Patch 1 changes > check_effective_target_float16 to work around your technique for disabling > _Float16 in ix86_expand_move, but it should be considered a stable user > API that any of the above methods can be used in user code to check for > _Float16 support - user code shouldn't need to kno
Re: [PATCH 0/2] Initial support for AVX512FP16
On Thu, Jul 1, 2021 at 9:04 PM Jakub Jelinek via Gcc-patches wrote: > > On Thu, Jul 01, 2021 at 02:58:01PM +0200, Richard Biener wrote: > > > The main issue is complex _Float16 functions in libgcc. If _Float16 > > > doesn't > > > require -mavx512fp16, we need to compile complex _Float16 functions in > > > libgcc without -mavx512fp16. Complex _Float16 performance is very > > > important for our _Float16 usage. _Float16 performance has to be > > > very fast. There should be no emulation anywhere when -mavx512fp16 > > > is used. That is why _Float16 is available only with -mavx512fp16. > > > > It should be possible to emulate scalar _Float16 using _Float32 with a > > reasonable > > performance trade-off. I think users caring for _Float16 performance will > > use vector intrinsics anyway since for scalar code _Float32 code will likely > > perform the same (at double storage cost) > > Only if it is allowed to have excess precision for _Float16. If not, then > one would need to (expensively?) round after every operation at least. There may be inconsistent behavior between soft-fp and avx512fp16 instructions if we emulate _Float16 w/ float . i.e 1) for a + b - c where b and c are variables with the same big value and a + b is NAN at _Float16 and real value at float, avx512fp16 instruction will raise an exception but soft-fp won't(unless it's rounded after every operation.) 2) a / b where b is denormal value and AVX512FP16 won't flush it to zero even w/ -Ofast, but when it's extended to float and using divss, it will be flushed to zero and raise an exception when compiling w/ Ofast To solve the upper issue, i try to add full emulation for _Float16(for all those under libgcc/soft-fp/, i.e. add/sub/mul/div/cmp, .etc), problem is in pass_expand, it always try wider mode first instead of using soft-fp /* Look for a wider mode of the same class for which we think we can open-code the operation. Check for a widening multiply at the wider mode as well. */ if (CLASS_HAS_WIDER_MODES_P (mclass) && methods != OPTAB_DIRECT && methods != OPTAB_LIB) FOR_EACH_WIDER_MODE (wider_mode, mode) I think pass_expand did this for some reason, so I'm a little afraid to touch this part of the code. So the key point is that the soft-fp and avx512fp16 instructions may do not behave the same on the exception, is this acceptable? BTW, i've finished a initial patch to enable _Float16 on sse2, and emulate _Float16 operation w/ float, and it passes all 312 new tests which are related to _Float16, but those units tests doesn't cover the scenario I'm talking about. > > Jakub > -- BR, Hongtao
Re: [PATCH 0/2] Initial support for AVX512FP16
On Mon, Jul 5, 2021 at 3:21 AM Hongtao Liu via Gcc-patches wrote: > > On Fri, Jul 2, 2021 at 4:03 PM Uros Bizjak wrote: > > > > On Fri, Jul 2, 2021 at 8:25 AM Hongtao Liu wrote: > > > > > > > AVX512FP16 is disclosed, refer to [1]. > > > > > There're 100+ instructions for AVX512FP16, 67 gcc patches, for the > > > > > convenience of review, we divide the 67 patches into 2 major parts. > > > > > The first part is 2 patches containing basic support for AVX512FP16 > > > > > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is > > > > > 65 patches covering all instructions of AVX512FP16(including > > > > > intrinsic support and some optimizations). > > > > > There is a problem with the first part, _Float16 is not a C++ > > > > > standard, so the front-end does not support this type and its > > > > > mangling, so we "make up" a _Float16 type on the back-end and use > > > > > _DF16 as its mangling. The purpose of this is to align with llvm > > > > > side, because llvm C++ FE already supports _Float16[2]. > > > > > > > > > > [1] > > > > > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html > > > > > [2] https://reviews.llvm.org/D33719 > > > > > > > > Looking through implementation of _Float16 support, I think, there is > > > > no need for _Float16 support to depend on AVX512FP16. > > > > > > > > The compiler is smart enough to use either a named pattern that > > > > describes the instruction when available or diverts to a library call > > > > to a soft-fp implementation. So, I think that general _Float16 support > > > > should be implemented first (similar to _float128) and then upgraded > > > > with AVX512FP16 specific instructions. > > > > > > > > MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode > > > > secondary_reload register. > > > > > > > MOVD is under sse2, so is pinsrw, which means if we want xmm > > > load/stores for HF, sse2 is the least requirement. > > > Also we support PEXTRW reg/m16, xmm, imm8 under SSE4_1 under which we > > > have 16bit direct load/store for HFmode and no need for a secondary > > > reload. > > > So for simplicity, can we just restrict _Float16 under sse4_1? > > > > When baseline is not met, the equivalent integer calling convention is > > used, for example: > Problem is under TARGET_SSE and w/ -mno-sse2, float calling convention > is available for sse register, it's ok for float since there's movss > under sse, but there's no 16bit load/store for sse registers, nor > movement between gpr and sse register. You can always spill though, that's prefered for some archs over xmm <-> gpr moves anyway. Richard. > > > > --cut here-- > > typedef int __v2si __attribute__ ((vector_size (8))); > > > > __v2si foo (__v2si a, __v2si b) > > { > > return a + b; > > } > > --cut here-- > > > > will still compile with -m32 -mno-mmx with warnings: > > > > mmx1.c: In function ‘foo’: > > mmx1.c:4:1: warning: MMX vector return without MMX enabled changes the > > ABI [-Wpsabi] > > mmx1.c:3:8: warning: MMX vector argument without MMX enabled changes > > the ABI [-Wpsabi] > > > > So, by setting the baseline to SSE4.1, a big pool of targets will be > > forced to use alternative ABI. This is quite inconvenient, and we > > revert to the alternative ABI if we *really* can't satisfy ABI > > requirements (e.g. register type is not available, basic move insn > > can't be implemented). Based on your analysis, I think that SSE2 > > should be the baseline. > Agreed. > > > > Also, looking at insn tables, it looks that movzwl from memory + movd > > is faster than pinsrw (and similar for pextrw to memory), but I have > > no hard data here. > > > > Regarding secondary_reload, a scratch register is needed in case of > > HImode moves between memory and XMM reg, since scratch register needs > > a different mode than source and destination. Please see > > TARGET_SECONDARY_RELOAD documentation and several examples in the > > source. > > > > Uros. > > > > -- > BR, > Hongtao
Re: [PATCH 0/2] Initial support for AVX512FP16
On Fri, Jul 2, 2021 at 4:03 PM Uros Bizjak wrote: > > On Fri, Jul 2, 2021 at 8:25 AM Hongtao Liu wrote: > > > > > AVX512FP16 is disclosed, refer to [1]. > > > > There're 100+ instructions for AVX512FP16, 67 gcc patches, for the > > > > convenience of review, we divide the 67 patches into 2 major parts. > > > > The first part is 2 patches containing basic support for AVX512FP16 > > > > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is > > > > 65 patches covering all instructions of AVX512FP16(including intrinsic > > > > support and some optimizations). > > > > There is a problem with the first part, _Float16 is not a C++ > > > > standard, so the front-end does not support this type and its mangling, > > > > so we "make up" a _Float16 type on the back-end and use _DF16 as its > > > > mangling. The purpose of this is to align with llvm side, because llvm > > > > C++ FE already supports _Float16[2]. > > > > > > > > [1] > > > > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html > > > > [2] https://reviews.llvm.org/D33719 > > > > > > Looking through implementation of _Float16 support, I think, there is > > > no need for _Float16 support to depend on AVX512FP16. > > > > > > The compiler is smart enough to use either a named pattern that > > > describes the instruction when available or diverts to a library call > > > to a soft-fp implementation. So, I think that general _Float16 support > > > should be implemented first (similar to _float128) and then upgraded > > > with AVX512FP16 specific instructions. > > > > > > MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode > > > secondary_reload register. > > > > > MOVD is under sse2, so is pinsrw, which means if we want xmm > > load/stores for HF, sse2 is the least requirement. > > Also we support PEXTRW reg/m16, xmm, imm8 under SSE4_1 under which we > > have 16bit direct load/store for HFmode and no need for a secondary > > reload. > > So for simplicity, can we just restrict _Float16 under sse4_1? > > When baseline is not met, the equivalent integer calling convention is > used, for example: Problem is under TARGET_SSE and w/ -mno-sse2, float calling convention is available for sse register, it's ok for float since there's movss under sse, but there's no 16bit load/store for sse registers, nor movement between gpr and sse register. > > --cut here-- > typedef int __v2si __attribute__ ((vector_size (8))); > > __v2si foo (__v2si a, __v2si b) > { > return a + b; > } > --cut here-- > > will still compile with -m32 -mno-mmx with warnings: > > mmx1.c: In function ‘foo’: > mmx1.c:4:1: warning: MMX vector return without MMX enabled changes the > ABI [-Wpsabi] > mmx1.c:3:8: warning: MMX vector argument without MMX enabled changes > the ABI [-Wpsabi] > > So, by setting the baseline to SSE4.1, a big pool of targets will be > forced to use alternative ABI. This is quite inconvenient, and we > revert to the alternative ABI if we *really* can't satisfy ABI > requirements (e.g. register type is not available, basic move insn > can't be implemented). Based on your analysis, I think that SSE2 > should be the baseline. Agreed. > > Also, looking at insn tables, it looks that movzwl from memory + movd > is faster than pinsrw (and similar for pextrw to memory), but I have > no hard data here. > > Regarding secondary_reload, a scratch register is needed in case of > HImode moves between memory and XMM reg, since scratch register needs > a different mode than source and destination. Please see > TARGET_SECONDARY_RELOAD documentation and several examples in the > source. > > Uros. -- BR, Hongtao
Re: [PATCH 0/2] Initial support for AVX512FP16
On Fri, Jul 2, 2021 at 4:19 PM Richard Biener wrote: > > On Fri, Jul 2, 2021 at 10:07 AM Uros Bizjak via Gcc-patches > wrote: > > > > On Fri, Jul 2, 2021 at 8:25 AM Hongtao Liu wrote: > > > > > > > AVX512FP16 is disclosed, refer to [1]. > > > > > There're 100+ instructions for AVX512FP16, 67 gcc patches, for the > > > > > convenience of review, we divide the 67 patches into 2 major parts. > > > > > The first part is 2 patches containing basic support for AVX512FP16 > > > > > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is > > > > > 65 patches covering all instructions of AVX512FP16(including > > > > > intrinsic support and some optimizations). > > > > > There is a problem with the first part, _Float16 is not a C++ > > > > > standard, so the front-end does not support this type and its > > > > > mangling, so we "make up" a _Float16 type on the back-end and use > > > > > _DF16 as its mangling. The purpose of this is to align with llvm > > > > > side, because llvm C++ FE already supports _Float16[2]. > > > > > > > > > > [1] > > > > > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html > > > > > [2] https://reviews.llvm.org/D33719 > > > > > > > > Looking through implementation of _Float16 support, I think, there is > > > > no need for _Float16 support to depend on AVX512FP16. > > > > > > > > The compiler is smart enough to use either a named pattern that > > > > describes the instruction when available or diverts to a library call > > > > to a soft-fp implementation. So, I think that general _Float16 support > > > > should be implemented first (similar to _float128) and then upgraded > > > > with AVX512FP16 specific instructions. > > > > > > > > MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode > > > > secondary_reload register. > > > > > > > MOVD is under sse2, so is pinsrw, which means if we want xmm > > > load/stores for HF, sse2 is the least requirement. > > > Also we support PEXTRW reg/m16, xmm, imm8 under SSE4_1 under which we > > > have 16bit direct load/store for HFmode and no need for a secondary > > > reload. > > > So for simplicity, can we just restrict _Float16 under sse4_1? > > > > When baseline is not met, the equivalent integer calling convention is > > used, for example: > > > > --cut here-- > > typedef int __v2si __attribute__ ((vector_size (8))); > > > > __v2si foo (__v2si a, __v2si b) > > { > > return a + b; > > } > > --cut here-- > > > > will still compile with -m32 -mno-mmx with warnings: > > > > mmx1.c: In function ‘foo’: > > mmx1.c:4:1: warning: MMX vector return without MMX enabled changes the > > ABI [-Wpsabi] > > mmx1.c:3:8: warning: MMX vector argument without MMX enabled changes > > the ABI [-Wpsabi] > > > > So, by setting the baseline to SSE4.1, a big pool of targets will be > > forced to use alternative ABI. This is quite inconvenient, and we > > revert to the alternative ABI if we *really* can't satisfy ABI > > requirements (e.g. register type is not available, basic move insn > > can't be implemented). Based on your analysis, I think that SSE2 > > should be the baseline. > > > > Also, looking at insn tables, it looks that movzwl from memory + movd > > is faster than pinsrw (and similar for pextrw to memory), but I have > > no hard data here. > > > > Regarding secondary_reload, a scratch register is needed in case of > > HImode moves between memory and XMM reg, since scratch register needs > > a different mode than source and destination. Please see > > TARGET_SECONDARY_RELOAD documentation and several examples in the > > source. > > I would suggest for the purpose of simplifying the initial patch series to > not make _Float16 supported on 32bits and leave that (and its ABI) for w/o AVX512FP16, it's ok. The problem is AVX512FP16 instructions are also available for -m32, and corresponding intrinsics will need the "_Float16" type(or other builtin type name) which will also be used by users. It means we still need a 32-bit _Float16 ABI for them. > future enhancement. Then the baseline should be SSE2 (x86-64 base) > which I think should be OK despite needing some awkwardness for > HFmode stores (scratch reg needed). > > Richard. > > > Uros. -- BR, Hongtao
Re: [PATCH 0/2] Initial support for AVX512FP16
On Fri, Jul 2, 2021 at 10:07 AM Uros Bizjak via Gcc-patches wrote: > > On Fri, Jul 2, 2021 at 8:25 AM Hongtao Liu wrote: > > > > > AVX512FP16 is disclosed, refer to [1]. > > > > There're 100+ instructions for AVX512FP16, 67 gcc patches, for the > > > > convenience of review, we divide the 67 patches into 2 major parts. > > > > The first part is 2 patches containing basic support for AVX512FP16 > > > > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is > > > > 65 patches covering all instructions of AVX512FP16(including intrinsic > > > > support and some optimizations). > > > > There is a problem with the first part, _Float16 is not a C++ > > > > standard, so the front-end does not support this type and its mangling, > > > > so we "make up" a _Float16 type on the back-end and use _DF16 as its > > > > mangling. The purpose of this is to align with llvm side, because llvm > > > > C++ FE already supports _Float16[2]. > > > > > > > > [1] > > > > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html > > > > [2] https://reviews.llvm.org/D33719 > > > > > > Looking through implementation of _Float16 support, I think, there is > > > no need for _Float16 support to depend on AVX512FP16. > > > > > > The compiler is smart enough to use either a named pattern that > > > describes the instruction when available or diverts to a library call > > > to a soft-fp implementation. So, I think that general _Float16 support > > > should be implemented first (similar to _float128) and then upgraded > > > with AVX512FP16 specific instructions. > > > > > > MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode > > > secondary_reload register. > > > > > MOVD is under sse2, so is pinsrw, which means if we want xmm > > load/stores for HF, sse2 is the least requirement. > > Also we support PEXTRW reg/m16, xmm, imm8 under SSE4_1 under which we > > have 16bit direct load/store for HFmode and no need for a secondary > > reload. > > So for simplicity, can we just restrict _Float16 under sse4_1? > > When baseline is not met, the equivalent integer calling convention is > used, for example: > > --cut here-- > typedef int __v2si __attribute__ ((vector_size (8))); > > __v2si foo (__v2si a, __v2si b) > { > return a + b; > } > --cut here-- > > will still compile with -m32 -mno-mmx with warnings: > > mmx1.c: In function ‘foo’: > mmx1.c:4:1: warning: MMX vector return without MMX enabled changes the > ABI [-Wpsabi] > mmx1.c:3:8: warning: MMX vector argument without MMX enabled changes > the ABI [-Wpsabi] > > So, by setting the baseline to SSE4.1, a big pool of targets will be > forced to use alternative ABI. This is quite inconvenient, and we > revert to the alternative ABI if we *really* can't satisfy ABI > requirements (e.g. register type is not available, basic move insn > can't be implemented). Based on your analysis, I think that SSE2 > should be the baseline. > > Also, looking at insn tables, it looks that movzwl from memory + movd > is faster than pinsrw (and similar for pextrw to memory), but I have > no hard data here. > > Regarding secondary_reload, a scratch register is needed in case of > HImode moves between memory and XMM reg, since scratch register needs > a different mode than source and destination. Please see > TARGET_SECONDARY_RELOAD documentation and several examples in the > source. I would suggest for the purpose of simplifying the initial patch series to not make _Float16 supported on 32bits and leave that (and its ABI) for future enhancement. Then the baseline should be SSE2 (x86-64 base) which I think should be OK despite needing some awkwardness for HFmode stores (scratch reg needed). Richard. > Uros.
Re: [PATCH 0/2] Initial support for AVX512FP16
On Fri, Jul 2, 2021 at 8:25 AM Hongtao Liu wrote: > > > AVX512FP16 is disclosed, refer to [1]. > > > There're 100+ instructions for AVX512FP16, 67 gcc patches, for the > > > convenience of review, we divide the 67 patches into 2 major parts. > > > The first part is 2 patches containing basic support for AVX512FP16 > > > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is 65 > > > patches covering all instructions of AVX512FP16(including intrinsic > > > support and some optimizations). > > > There is a problem with the first part, _Float16 is not a C++ standard, > > > so the front-end does not support this type and its mangling, so we "make > > > up" a _Float16 type on the back-end and use _DF16 as its mangling. The > > > purpose of this is to align with llvm side, because llvm C++ FE already > > > supports _Float16[2]. > > > > > > [1] > > > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html > > > [2] https://reviews.llvm.org/D33719 > > > > Looking through implementation of _Float16 support, I think, there is > > no need for _Float16 support to depend on AVX512FP16. > > > > The compiler is smart enough to use either a named pattern that > > describes the instruction when available or diverts to a library call > > to a soft-fp implementation. So, I think that general _Float16 support > > should be implemented first (similar to _float128) and then upgraded > > with AVX512FP16 specific instructions. > > > > MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode > > secondary_reload register. > > > MOVD is under sse2, so is pinsrw, which means if we want xmm > load/stores for HF, sse2 is the least requirement. > Also we support PEXTRW reg/m16, xmm, imm8 under SSE4_1 under which we > have 16bit direct load/store for HFmode and no need for a secondary > reload. > So for simplicity, can we just restrict _Float16 under sse4_1? When baseline is not met, the equivalent integer calling convention is used, for example: --cut here-- typedef int __v2si __attribute__ ((vector_size (8))); __v2si foo (__v2si a, __v2si b) { return a + b; } --cut here-- will still compile with -m32 -mno-mmx with warnings: mmx1.c: In function ‘foo’: mmx1.c:4:1: warning: MMX vector return without MMX enabled changes the ABI [-Wpsabi] mmx1.c:3:8: warning: MMX vector argument without MMX enabled changes the ABI [-Wpsabi] So, by setting the baseline to SSE4.1, a big pool of targets will be forced to use alternative ABI. This is quite inconvenient, and we revert to the alternative ABI if we *really* can't satisfy ABI requirements (e.g. register type is not available, basic move insn can't be implemented). Based on your analysis, I think that SSE2 should be the baseline. Also, looking at insn tables, it looks that movzwl from memory + movd is faster than pinsrw (and similar for pextrw to memory), but I have no hard data here. Regarding secondary_reload, a scratch register is needed in case of HImode moves between memory and XMM reg, since scratch register needs a different mode than source and destination. Please see TARGET_SECONDARY_RELOAD documentation and several examples in the source. Uros.
Re: [PATCH 0/2] Initial support for AVX512FP16
On Thu, Jul 1, 2021 at 7:10 PM Uros Bizjak wrote: > > [Sorry for double post, gcc-patches address was wrong in original post] > > On Thu, Jul 1, 2021 at 7:48 AM liuhongt wrote: > > > > Hi: > > AVX512FP16 is disclosed, refer to [1]. > > There're 100+ instructions for AVX512FP16, 67 gcc patches, for the > > convenience of review, we divide the 67 patches into 2 major parts. > > The first part is 2 patches containing basic support for AVX512FP16 > > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is 65 > > patches covering all instructions of AVX512FP16(including intrinsic support > > and some optimizations). > > There is a problem with the first part, _Float16 is not a C++ standard, > > so the front-end does not support this type and its mangling, so we "make > > up" a _Float16 type on the back-end and use _DF16 as its mangling. The > > purpose of this is to align with llvm side, because llvm C++ FE already > > supports _Float16[2]. > > > > [1] > > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html > > [2] https://reviews.llvm.org/D33719 > > Looking through implementation of _Float16 support, I think, there is > no need for _Float16 support to depend on AVX512FP16. > > The compiler is smart enough to use either a named pattern that > describes the instruction when available or diverts to a library call > to a soft-fp implementation. So, I think that general _Float16 support > should be implemented first (similar to _float128) and then upgraded > with AVX512FP16 specific instructions. > > MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode > secondary_reload register. > MOVD is under sse2, so is pinsrw, which means if we want xmm load/stores for HF, sse2 is the least requirement. Also we support PEXTRW reg/m16, xmm, imm8 under SSE4_1 under which we have 16bit direct load/store for HFmode and no need for a secondary reload. So for simplicity, can we just restrict _Float16 under sse4_1? > soft-fp library already includes all the infrastructure to implement > _Float16 (see half.h), so HFmode basic operations should be trivial to > implement (I have gone through this exercise personally years ago when > implementing __float128 soft-fp support). > > Looking through the patch 1/2, it looks that a new ABI is introduced, > where FP16 values are passed through XMM registers, but I don't think > there is updated psABI documentation available (for x86_64 as well as > i386, where FP16 values will probably be passed through memory). > > So, the net effect of the above proposal(s) is that x86 will support > _Float16 out-of the box, emulate it via soft-fp without AVX512FP16 and > use AVX512FP16 instructions with -mavx512fp16. > > Uros. -- BR, Hongtao
Re: [PATCH 0/2] Initial support for AVX512FP16
On Thu, 1 Jul 2021, H.J. Lu via Gcc-patches wrote: > The main issue is complex _Float16 functions in libgcc. If _Float16 doesn't > require -mavx512fp16, we need to compile complex _Float16 functions in > libgcc without -mavx512fp16. Complex _Float16 performance is very > important for our _Float16 usage. _Float16 performance has to be > very fast. There should be no emulation anywhere when -mavx512fp16 > is used. That is why _Float16 is available only with -mavx512fp16. You could build IFUNC versions of the libgcc functions (like float128 on powerpc64le), to be fast (modulo any IFUNC overhead) when run on AVX512FP16 hardware. Or arrange for different libcall names to be used depending on the instruction set features available, and build those functions under multiple names, to be fast when the application is built with -mavx512fp16. Since the HCmode libgcc functions just convert to/from SFmode and do all their computations on SFmode (to avoid intermediate overflows / cancellation resulting in inaccuracy), an F16C version may make sense as well (assuming use of the F16C conversion instructions is still efficient once you allow for zeroing the unused parts of the vector register, if necessary to avoid spurious exceptions from converting junk data in those parts of the register). -- Joseph S. Myers jos...@codesourcery.com
Re: [PATCH 0/2] Initial support for AVX512FP16
Some general comments, following what I said on libc-alpha: 1. Can you confirm that the ABI being used for 64-bit, for _Float16 and _Complex _Float16 argument passing and return, follows the current x86_64 ABI document? 2. Can you confirm that if you build with this instruction set extension enabled by default, and run GCC tests for a corresponding (emulated?) processor, all the existing float16 tests in the testsuite are enabled and PASS (both compilation and execution) (both 64-bit and 32-bit testing)? 3. There's an active 32-bit ABI mailing list (ia32-...@googlegroups.com). If you want to support _Float16 in the 32-bit case, please work with it to get the corresponding ABI documented (using only memory and general-purpose registers seems like a good idea, so that the ABI can be supported for the base architecture without depending on SSE registers being present). In the absence of 32-bit ABI support it might be better to disable the HFmode support for 32-bit. 4. Support for _Float16 really ought not to depend on whether a particular instruction set extension is present, just like with other floating-point types; it makes sense, as an API, for all x86 processors (and like many APIs, it will be faster on some processors than on others). More specific points here are: (a) Basic arithmetic (+-*/) can be done by converting to SFmode, doing arithmetic there and converting back to HFmode; the results of doing so will be correctly rounded. Indeed, I think optabs.c handles that automatically when operations are available on a wider mode but not on the desired mode (but you'd need to check carefully that all the expected conversions do occur). (b) Conversions to/from all other floating-point modes will always be needed, whether in hardware or in software. (c) In the F16C (Ivy Bridge and later) case, where you have hardware conversions to/from float (only), it's fine to convert to double (or long double) via float. (On efficiency grounds, widening from HFmode to TFmode should be a pure software operations, that should be faster than having an intermediate conversion to SFmode when the SFmode-to-TFmode conversion is a software operation.) (d) In the F16C case (where there are hardware conversions only from SFmode, not from wider modes), conversion *from* DFmode (or XFmode or TFmode) to HFmode should be a software operation, to avoid double rounding; an intermediate conversion to SFmode would be incorrect. (e) It's OK for conversions to/from integer modes to go via SFmode (although I don't know if that's efficient or not). Any case where a conversion from integer to SFmode is inexact would overflow HFmode, so there are no double rounding issues. (f) In the F16C case, it seems the hardware instructions only work on vectors, not scalars, so care would need to be taken to use them for scalar conversions only if the other elements of the vector register are known to be safe to convert without raising any exceptions (e.g. all zero bits, or -fno-trapping-math in effect). (g) If concerned about efficiency of intermediate truncations on processors without hardware _Float16 arithmetic, look at aarch64_excess_precision; you have the option of using excess precision for _Float16 by default, though that only really helps for C given the lack of excess precision support in the C++ front end. (Enabling this can cause trouble for code that only expects C99/C11 values of FLT_EVAL_METHOD, however; see the -fpermitted-flt-eval-methods option for more details.) 5. Suppose that in some cases you do disable _Float16 support (whether that's just for 32-bit until the ABI has been defined, or also in the absence of instruction set support despite my comments above). Then the way you do that in this patch series, enabling the type in ix86_scalar_mode_supported_p and ix86_libgcc_floating_mode_supported_p and giving an error later in ix86_expand_move, is a bad idea. Errors in expanders are generally problematic (they don't have good location information available). But apart from that, ordinary user code should be able to tell whether _Float16 is supported by testing whether e.g. __FLT16_MANT_DIG__ is defined (like float.h does), or by including float.h (with __STDC_WANT_IEC_60559_TYPES_EXT__ defined) and then testing whether one of the FLT16_* macros is defined, or in a configure test by just declaring something using the _Float16 type. Patch 1 changes check_effective_target_float16 to work around your technique for disabling _Float16 in ix86_expand_move, but it should be considered a stable user API that any of the above methods can be used in user code to check for _Float16 support - user code shouldn't need to know implementation details that you need to do something that will go through ix86_expand_move to see whether _Float16 is supported or not (and user code shouldn't need to use a configure test at all for this, testing FLT16_* after including fl
Re: [PATCH 0/2] Initial support for AVX512FP16
On Thu, Jul 01, 2021 at 02:58:01PM +0200, Richard Biener wrote: > > The main issue is complex _Float16 functions in libgcc. If _Float16 doesn't > > require -mavx512fp16, we need to compile complex _Float16 functions in > > libgcc without -mavx512fp16. Complex _Float16 performance is very > > important for our _Float16 usage. _Float16 performance has to be > > very fast. There should be no emulation anywhere when -mavx512fp16 > > is used. That is why _Float16 is available only with -mavx512fp16. > > It should be possible to emulate scalar _Float16 using _Float32 with a > reasonable > performance trade-off. I think users caring for _Float16 performance will > use vector intrinsics anyway since for scalar code _Float32 code will likely > perform the same (at double storage cost) Only if it is allowed to have excess precision for _Float16. If not, then one would need to (expensively?) round after every operation at least. Jakub
Re: [PATCH 0/2] Initial support for AVX512FP16
On Thu, Jul 1, 2021 at 2:40 PM H.J. Lu wrote: > > On Thu, Jul 1, 2021 at 4:10 AM Uros Bizjak wrote: > > > > [Sorry for double post, gcc-patches address was wrong in original post] > > > > On Thu, Jul 1, 2021 at 7:48 AM liuhongt wrote: > > > > > > Hi: > > > AVX512FP16 is disclosed, refer to [1]. > > > There're 100+ instructions for AVX512FP16, 67 gcc patches, for the > > > convenience of review, we divide the 67 patches into 2 major parts. > > > The first part is 2 patches containing basic support for AVX512FP16 > > > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is 65 > > > patches covering all instructions of AVX512FP16(including intrinsic > > > support and some optimizations). > > > There is a problem with the first part, _Float16 is not a C++ standard, > > > so the front-end does not support this type and its mangling, so we "make > > > up" a _Float16 type on the back-end and use _DF16 as its mangling. The > > > purpose of this is to align with llvm side, because llvm C++ FE already > > > supports _Float16[2]. > > > > > > [1] > > > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html > > > [2] https://reviews.llvm.org/D33719 > > > > Looking through implementation of _Float16 support, I think, there is > > no need for _Float16 support to depend on AVX512FP16. > > > > The compiler is smart enough to use either a named pattern that > > describes the instruction when available or diverts to a library call > > to a soft-fp implementation. So, I think that general _Float16 support > > should be implemented first (similar to _float128) and then upgraded > > with AVX512FP16 specific instructions. > > > > MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode > > secondary_reload register. > > > > soft-fp library already includes all the infrastructure to implement > > _Float16 (see half.h), so HFmode basic operations should be trivial to > > implement (I have gone through this exercise personally years ago when > > implementing __float128 soft-fp support). > > > > Looking through the patch 1/2, it looks that a new ABI is introduced, > > where FP16 values are passed through XMM registers, but I don't think > > there is updated psABI documentation available (for x86_64 as well as > > _Float16 support was added to x86-64 psABI: > > https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/71d1183e7bb95e9f8ad732e0f2b5a4f127796e2a > > 2 years ago. Uh, sorry, my psABI link [1] is way out of date, but this is what google gives for "x86_64 psABI pdf" ... [1] https://uclibc.org/docs/psABI-x86_64.pdf > > > i386, where FP16 values will probably be passed through memory). > > That is correct. > > > So, the net effect of the above proposal(s) is that x86 will support > > _Float16 out-of the box, emulate it via soft-fp without AVX512FP16 and > > use AVX512FP16 instructions with -mavx512fp16. > > > > The main issue is complex _Float16 functions in libgcc. If _Float16 doesn't > require -mavx512fp16, we need to compile complex _Float16 functions in > libgcc without -mavx512fp16. Complex _Float16 performance is very > important for our _Float16 usage. _Float16 performance has to be > very fast. There should be no emulation anywhere when -mavx512fp16 > is used. That is why _Float16 is available only with -mavx512fp16. If this performance is important, then the best way is that in addition to generic versions, recompile these functions for AVX512FP16 target, or even implement them in assembly. The compiler can then call these specific functions when -mavx512fp16 is used. Please see how alpha implements calls to its X_floating library. Uros.
Re: [PATCH 0/2] Initial support for AVX512FP16
On Thu, Jul 1, 2021 at 2:41 PM H.J. Lu via Gcc-patches wrote: > > On Thu, Jul 1, 2021 at 4:10 AM Uros Bizjak wrote: > > > > [Sorry for double post, gcc-patches address was wrong in original post] > > > > On Thu, Jul 1, 2021 at 7:48 AM liuhongt wrote: > > > > > > Hi: > > > AVX512FP16 is disclosed, refer to [1]. > > > There're 100+ instructions for AVX512FP16, 67 gcc patches, for the > > > convenience of review, we divide the 67 patches into 2 major parts. > > > The first part is 2 patches containing basic support for AVX512FP16 > > > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is 65 > > > patches covering all instructions of AVX512FP16(including intrinsic > > > support and some optimizations). > > > There is a problem with the first part, _Float16 is not a C++ standard, > > > so the front-end does not support this type and its mangling, so we "make > > > up" a _Float16 type on the back-end and use _DF16 as its mangling. The > > > purpose of this is to align with llvm side, because llvm C++ FE already > > > supports _Float16[2]. > > > > > > [1] > > > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html > > > [2] https://reviews.llvm.org/D33719 > > > > Looking through implementation of _Float16 support, I think, there is > > no need for _Float16 support to depend on AVX512FP16. > > > > The compiler is smart enough to use either a named pattern that > > describes the instruction when available or diverts to a library call > > to a soft-fp implementation. So, I think that general _Float16 support > > should be implemented first (similar to _float128) and then upgraded > > with AVX512FP16 specific instructions. > > > > MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode > > secondary_reload register. > > > > soft-fp library already includes all the infrastructure to implement > > _Float16 (see half.h), so HFmode basic operations should be trivial to > > implement (I have gone through this exercise personally years ago when > > implementing __float128 soft-fp support). > > > > Looking through the patch 1/2, it looks that a new ABI is introduced, > > where FP16 values are passed through XMM registers, but I don't think > > there is updated psABI documentation available (for x86_64 as well as > > _Float16 support was added to x86-64 psABI: > > https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/71d1183e7bb95e9f8ad732e0f2b5a4f127796e2a > > 2 years ago. > > > i386, where FP16 values will probably be passed through memory). > > That is correct. > > > So, the net effect of the above proposal(s) is that x86 will support > > _Float16 out-of the box, emulate it via soft-fp without AVX512FP16 and > > use AVX512FP16 instructions with -mavx512fp16. > > > > The main issue is complex _Float16 functions in libgcc. If _Float16 doesn't > require -mavx512fp16, we need to compile complex _Float16 functions in > libgcc without -mavx512fp16. Complex _Float16 performance is very > important for our _Float16 usage. _Float16 performance has to be > very fast. There should be no emulation anywhere when -mavx512fp16 > is used. That is why _Float16 is available only with -mavx512fp16. It should be possible to emulate scalar _Float16 using _Float32 with a reasonable performance trade-off. I think users caring for _Float16 performance will use vector intrinsics anyway since for scalar code _Float32 code will likely perform the same (at double storage cost) Richard. > -- > H.J.
Re: [PATCH 0/2] Initial support for AVX512FP16
On Thu, Jul 1, 2021 at 4:10 AM Uros Bizjak wrote: > > [Sorry for double post, gcc-patches address was wrong in original post] > > On Thu, Jul 1, 2021 at 7:48 AM liuhongt wrote: > > > > Hi: > > AVX512FP16 is disclosed, refer to [1]. > > There're 100+ instructions for AVX512FP16, 67 gcc patches, for the > > convenience of review, we divide the 67 patches into 2 major parts. > > The first part is 2 patches containing basic support for AVX512FP16 > > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is 65 > > patches covering all instructions of AVX512FP16(including intrinsic support > > and some optimizations). > > There is a problem with the first part, _Float16 is not a C++ standard, > > so the front-end does not support this type and its mangling, so we "make > > up" a _Float16 type on the back-end and use _DF16 as its mangling. The > > purpose of this is to align with llvm side, because llvm C++ FE already > > supports _Float16[2]. > > > > [1] > > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html > > [2] https://reviews.llvm.org/D33719 > > Looking through implementation of _Float16 support, I think, there is > no need for _Float16 support to depend on AVX512FP16. > > The compiler is smart enough to use either a named pattern that > describes the instruction when available or diverts to a library call > to a soft-fp implementation. So, I think that general _Float16 support > should be implemented first (similar to _float128) and then upgraded > with AVX512FP16 specific instructions. > > MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode > secondary_reload register. > > soft-fp library already includes all the infrastructure to implement > _Float16 (see half.h), so HFmode basic operations should be trivial to > implement (I have gone through this exercise personally years ago when > implementing __float128 soft-fp support). > > Looking through the patch 1/2, it looks that a new ABI is introduced, > where FP16 values are passed through XMM registers, but I don't think > there is updated psABI documentation available (for x86_64 as well as _Float16 support was added to x86-64 psABI: https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/71d1183e7bb95e9f8ad732e0f2b5a4f127796e2a 2 years ago. > i386, where FP16 values will probably be passed through memory). That is correct. > So, the net effect of the above proposal(s) is that x86 will support > _Float16 out-of the box, emulate it via soft-fp without AVX512FP16 and > use AVX512FP16 instructions with -mavx512fp16. > The main issue is complex _Float16 functions in libgcc. If _Float16 doesn't require -mavx512fp16, we need to compile complex _Float16 functions in libgcc without -mavx512fp16. Complex _Float16 performance is very important for our _Float16 usage. _Float16 performance has to be very fast. There should be no emulation anywhere when -mavx512fp16 is used. That is why _Float16 is available only with -mavx512fp16. -- H.J.
Re: [PATCH 0/2] Initial support for AVX512FP16
[Sorry for double post, gcc-patches address was wrong in original post] On Thu, Jul 1, 2021 at 7:48 AM liuhongt wrote: > > Hi: > AVX512FP16 is disclosed, refer to [1]. > There're 100+ instructions for AVX512FP16, 67 gcc patches, for the > convenience of review, we divide the 67 patches into 2 major parts. > The first part is 2 patches containing basic support for AVX512FP16 > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is 65 > patches covering all instructions of AVX512FP16(including intrinsic support > and some optimizations). > There is a problem with the first part, _Float16 is not a C++ standard, so > the front-end does not support this type and its mangling, so we "make up" a > _Float16 type on the back-end and use _DF16 as its mangling. The purpose of > this is to align with llvm side, because llvm C++ FE already supports > _Float16[2]. > > [1] > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html > [2] https://reviews.llvm.org/D33719 Looking through implementation of _Float16 support, I think, there is no need for _Float16 support to depend on AVX512FP16. The compiler is smart enough to use either a named pattern that describes the instruction when available or diverts to a library call to a soft-fp implementation. So, I think that general _Float16 support should be implemented first (similar to _float128) and then upgraded with AVX512FP16 specific instructions. MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode secondary_reload register. soft-fp library already includes all the infrastructure to implement _Float16 (see half.h), so HFmode basic operations should be trivial to implement (I have gone through this exercise personally years ago when implementing __float128 soft-fp support). Looking through the patch 1/2, it looks that a new ABI is introduced, where FP16 values are passed through XMM registers, but I don't think there is updated psABI documentation available (for x86_64 as well as i386, where FP16 values will probably be passed through memory). So, the net effect of the above proposal(s) is that x86 will support _Float16 out-of the box, emulate it via soft-fp without AVX512FP16 and use AVX512FP16 instructions with -mavx512fp16. Uros.
Re: [PATCH 0/2] Initial support for AVX512FP16
On Thu, Jul 1, 2021 at 1:48 PM liuhongt wrote: > > Hi: > AVX512FP16 is disclosed, refer to [1]. > There're 100+ instructions for AVX512FP16, 67 gcc patches, for the > convenience of review, we divide the 67 patches into 2 major parts. > The first part is 2 patches containing basic support for AVX512FP16 > (options, cpuid, _Float16 type, libgcc, etc.), and the second part is 65 > patches covering all instructions of AVX512FP16(including intrinsic support > and some optimizations). > There is a problem with the first part, _Float16 is not a C++ standard, so > the front-end does not support this type and its mangling, so we "make up" a > _Float16 type on the back-end and use _DF16 as its mangling. The purpose of > this is to align with llvm side, because llvm C++ FE already supports > _Float16[2]. > > [1] > https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html > [2] https://reviews.llvm.org/D33719 > > Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. > > Guo, Xuepeng (1): > AVX512FP16: Initial support for _Float16 type and AVX512FP16 feature. > > liuhongt (1): > AVX512FP16: Add HFmode support in libgcc. > > gcc/common/config/i386/cpuinfo.h | 2 + > gcc/common/config/i386/i386-common.c | 26 +- > gcc/common/config/i386/i386-cpuinfo.h | 1 + > gcc/common/config/i386/i386-isas.h| 1 + > gcc/config.gcc| 2 +- > gcc/config/i386/avx512fp16intrin.h| 53 > gcc/config/i386/cpuid.h | 1 + > gcc/config/i386/i386-builtin-types.def| 7 +- > gcc/config/i386/i386-builtins.c | 6 + > gcc/config/i386/i386-c.c | 20 ++ > gcc/config/i386/i386-expand.c | 8 + > gcc/config/i386/i386-isa.def | 1 + > gcc/config/i386/i386-modes.def| 1 + > gcc/config/i386/i386-options.c| 10 +- > gcc/config/i386/i386.c| 158 ++-- > gcc/config/i386/i386.h| 18 +- > gcc/config/i386/i386.md | 242 +++--- > gcc/config/i386/i386.opt | 4 + > gcc/config/i386/immintrin.h | 2 + > gcc/config/i386/sse.md| 42 +-- > gcc/doc/invoke.texi | 10 +- > gcc/optabs-query.c| 9 +- > gcc/testsuite/g++.target/i386/float16-1.C | 8 + > gcc/testsuite/g++.target/i386/float16-2.C | 14 + > gcc/testsuite/g++.target/i386/float16-3.C | 10 + > gcc/testsuite/gcc.target/i386/avx-1.c | 2 +- > gcc/testsuite/gcc.target/i386/avx-2.c | 2 +- > gcc/testsuite/gcc.target/i386/avx512-check.h | 3 + > .../gcc.target/i386/avx512fp16-12a.c | 21 ++ > .../gcc.target/i386/avx512fp16-12b.c | 27 ++ > gcc/testsuite/gcc.target/i386/float16-1.c | 8 + > gcc/testsuite/gcc.target/i386/float16-2.c | 14 + > gcc/testsuite/gcc.target/i386/float16-3a.c| 10 + > gcc/testsuite/gcc.target/i386/float16-3b.c| 10 + > gcc/testsuite/gcc.target/i386/float16-4a.c| 10 + > gcc/testsuite/gcc.target/i386/float16-4b.c| 10 + > gcc/testsuite/gcc.target/i386/funcspec-56.inc | 2 + > gcc/testsuite/gcc.target/i386/pr54855-12.c| 14 + > gcc/testsuite/gcc.target/i386/sse-13.c| 2 +- > gcc/testsuite/gcc.target/i386/sse-14.c| 2 +- > gcc/testsuite/gcc.target/i386/sse-22.c| 4 +- > gcc/testsuite/gcc.target/i386/sse-23.c| 2 +- > gcc/testsuite/lib/target-supports.exp | 13 +- > libgcc/Makefile.in| 4 +- > libgcc/config.host| 6 +- > libgcc/config/i386/32/sfp-machine.h | 1 + > libgcc/config/i386/64/sfp-machine.h | 1 + > libgcc/config/i386/64/t-softfp| 9 + > libgcc/config/i386/_divhc3.c | 4 + > libgcc/config/i386/_mulhc3.c | 4 + > libgcc/config/i386/sfp-machine.h | 1 + > libgcc/config/i386/t-softfp | 20 ++ > libgcc/configure | 33 +++ > libgcc/configure.ac | 13 + > libgcc/soft-fp/extendhfxf2.c | 53 > libgcc/soft-fp/truncxfhf2.c | 52 > 56 files changed, 907 insertions(+), 106 deletions(-) > create mode 100644 gcc/config/i386/avx512fp16intrin.h > create mode 100644 gcc/testsuite/g++.target/i386/float16-1.C > create mode 100644 gcc/testsuite/g++.target/i386/float16-2.C > create mode 100644 gcc/testsuite/g++.target/i386/float16-3.C > create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-12a.c > create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-12b.c > create mode 100644 gcc/testsuite/gcc.target/i386/float16-1.c > create mode 100644 gcc/tests