Re: GLIBC libmvec status
On Fri, Feb 28, 2020 at 05:31:56PM +0100, Jakub Jelinek wrote: > On Fri, Feb 28, 2020 at 04:23:03PM +, GT wrote: > > Do we want to change the name and title of the document since Segher > > doesn't believe it > > is an ABI. My initial suggestion: "POWER Architecture Specification of > > Scalar Function > > to Vector Function Mapping". > > It is an ABI, similarly like e.g. the C++ Itanium ABI is an ABI, it specifies > mangling of certain functions and how the function argument types and return > types are transformed. It does not say anything about the machine code generated, or about the binary format generated, other than the naming of symbols. It is confusing to call this an "ABI": you still need to have an actual ABI underneath, and this itself is not a "binary interface". In some other contexts similar things are called "binding", but that is not a very good name either :-/ Segher
Re: GLIBC libmvec status
‐‐‐ Original Message ‐‐‐ On Monday, March 2, 2020 12:14 PM, Bill Schmidt wrote: > On 3/2/20 11:10 AM, Tulio Magno Quites Machado Filho wrote: > > > Bill Schmidt writes: > > > > > One tiny nit on the document: For the "b" value, let's just say > > > "VSX" rather than > > > "VSX as defined in PowerISA v2.07)." We will plan to only change > > > values in case > > > a different vector length is defined in future. > > > > That change would have more implications: all libmvec functions would have > > to > > work on Power ISA v2.06 HW too. But half of the functions do use v2.07 > > instructions now. > > Ah, I see. Well, then language such as "VSX defined at least at the level of > PowerISA v2.07" would be appropriate. We want to define a minimum subset > without > further implied constraint. (Higher levels can be handled with ifunc without > needing to reference this in the ABI, as previously discussed.) > Changed description of 'b' ISA to exactly the quoted sentence above. Bert.
Re: GLIBC libmvec status
‐‐‐ Original Message ‐‐‐ On Monday, March 2, 2020 4:59 PM, Jakub Jelinek wrote: > Indeed, there aren't any yet on the vectorizer side, I thought I've > implemented it > already in the vectorizer but apparently didn't, just the omp-simd-clone.c > part is > implemented (the more important part, as it matters for the ABI). What is in omp-simd-clone.c? What is missing from the vectorizer? My assumption was that the implementation of vector function masking was complete, but no test was created to verify the functionality. > A testcase could > be something along the lines of > #pragma omp declare simd > int foo (int, int); > > void > bar (int *a, int *b, int *c) > { > #pragma omp simd > for (int i = 0; i < 1024; i++) > { > int d = b[i], e = c[i], f; > if (b[i] < 20) > f = foo (d, e); > else > f = d + e; > } > } I thought the test would be more like: #pragma omp declare simd int foo (int *x, int *y) { *y = *x + 2; } void bar (int *a, float *b, float *c) { #pragma omp simd for (int i = 0; i < 1024; i++) { int d = b[i], e = c[i], f; if ( i % 2) f = foo (d, e); } } The point being that only items at odd indices are updated. That would require masking to avoid altering items at even indices. Bert.
Re: GLIBC libmvec status
On Mon, Mar 02, 2020 at 09:40:59PM +, GT wrote: > Searching openmp.org located document "OpenMP API Examples". The relevant > example > for inbranch/notinbranch shows very simple functions (SIMD.6.c). GCC testsuite > functions are similarly simple. > Wouldn't the same effect be achieved by letting GCC inline such functions and > having > the loop autovectorizer handle the resulting code? If it is defined in headers and inlinable, sure, then you don't need to mark it any way. The pragmas are mainly for functions that aren't inlinable for whatever reason. > > There are various tests that cover it, look e.g. at tests that require > > vect_simd_clones > > effective target (e.g. in gcc/testsuite//{vect,gomp}/ and > > libgomp/testsuite//). > > > > Sorry, I can't identify any test that ensures a masked vector function > variant produces > expected results. I'll check again but I need more help here. Indeed, there aren't any yet on the vectorizer side, I thought I've implemented it already in the vectorizer but apparently didn't, just the omp-simd-clone.c part is implemented (the more important part, as it matters for the ABI). A testcase could be something along the lines of #pragma omp declare simd int foo (int, int); void bar (int *a, int *b, int *c) { #pragma omp simd for (int i = 0; i < 1024; i++) { int d = b[i], e = c[i], f; if (b[i] < 20) f = foo (d, e); else f = d + e; } } To make this work, one would need to tweak tree-if-conv.c (invent some way how to represent the conditional calls in the IL during the vect pass, probably some new internal function) and then handle that in vectorizable_simd_clone_call. Jakub
Re: GLIBC libmvec status
‐‐‐ Original Message ‐‐‐ On Monday, March 2, 2020 3:31 PM, Jakub Jelinek wrote: > On Mon, Mar 02, 2020 at 08:20:01PM +, GT wrote: > > > Which raises the question: what use-case motivated allowing the compiler > > to auto-vectorize user defined functions? From having manually created > > vector > > The feature made it into the OpenMP standard (already OpenMP 4.0) and so got > implemented as part of the OpenMP 4.0 implementation. > > > versions of sin, cos and other libmvec functions, I'm wondering how GCC is > > able to > > autovectorize a non-trivial user defined function. > Searching openmp.org located document "OpenMP API Examples". The relevant example for inbranch/notinbranch shows very simple functions (SIMD.6.c). GCC testsuite functions are similarly simple. Wouldn't the same effect be achieved by letting GCC inline such functions and having the loop autovectorizer handle the resulting code? > There are various tests that cover it, look e.g. at tests that require > vect_simd_clones > effective target (e.g. in gcc/testsuite//{vect,gomp}/ and > libgomp/testsuite//). > Sorry, I can't identify any test that ensures a masked vector function variant produces expected results. I'll check again but I need more help here. Bert.
Re: GLIBC libmvec status
‐‐‐ Original Message ‐‐‐ On Thursday, February 27, 2020 9:52 AM, Jakub Jelinek wrote: > On Thu, Feb 27, 2020 at 08:47:19AM -0600, Bill Schmidt wrote: > > > But is this actually a good idea? It seems to me this will generate lousy > > code in the absence of hardware support. Won't we be better off warning and > > ignoring the directive, leaving the code in scalar form? > > Depends on the exact code, I think sometimes it will be just fine and will > allow vectorizing something that really couldn't be otherwise. > Isn't it better to leave it for the user to decide? > They can always ask for it not to be generated (add notinbranch) if it isn't > worthwhile. > I'm trying to understand what the x86_64 implementation does w.r.t. masked versions of user defined functions. I haven't found any test under directory testsuite which verifies that compiler-generated versions (from inbranch being specified) produce expected results. Which raises the question: what use-case motivated allowing the compiler to auto-vectorize user defined functions? From having manually created vector versions of sin, cos and other libmvec functions, I'm wondering how GCC is able to autovectorize a non-trivial user defined function. Any pointers to relevant tests and documentation will be really appreciated. Thanks. Bert.
Re: GLIBC libmvec status
On Mon, Mar 02, 2020 at 08:20:01PM +, GT wrote: > Which raises the question: what use-case motivated allowing the compiler > to auto-vectorize user defined functions? From having manually created vector The feature made it into the OpenMP standard (already OpenMP 4.0) and so got implemented as part of the OpenMP 4.0 implementation. > versions of sin, cos and other libmvec functions, I'm wondering how GCC is > able to > autovectorize a non-trivial user defined function. There are various tests that cover it, look e.g. at tests that require vect_simd_clones effective target (e.g. in gcc/testsuite/*/{vect,gomp}/ and libgomp/testsuite/*/). In the OpenMP standard, see e.g. https://www.openmp.org/spec-html/5.0/openmpsu42.html#x65-1390002.9.3 Jakub
Re: GLIBC libmvec status
On 3/2/20 11:10 AM, Tulio Magno Quites Machado Filho wrote: Bill Schmidt writes: One tiny nit on the document: For the "b" value, let's just say "VSX" rather than "VSX as defined in PowerISA v2.07)." We will plan to only change values in case a different vector length is defined in future. That change would have more implications: all libmvec functions would have to work on Power ISA v2.06 HW too. But half of the functions do use v2.07 instructions now. Ah, I see. Well, then language such as "VSX defined at least at the level of PowerISA v2.07" would be appropriate. We want to define a minimum subset without further implied constraint. (Higher levels can be handled with ifunc without needing to reference this in the ABI, as previously discussed.) Thanks, Bill
Re: GLIBC libmvec status
Bill Schmidt writes: > One tiny nit on the document: For the "b" value, let's just say "VSX" > rather than > "VSX as defined in PowerISA v2.07)." We will plan to only change > values in case > a different vector length is defined in future. That change would have more implications: all libmvec functions would have to work on Power ISA v2.06 HW too. But half of the functions do use v2.07 instructions now. -- Tulio Magno
Re: GLIBC libmvec status
In 2/28/20 10:31 AM, Jakub Jelinek wrote: On Fri, Feb 28, 2020 at 04:23:03PM +, GT wrote: Do we want to change the name and title of the document since Segher doesn't believe it is an ABI. My initial suggestion: "POWER Architecture Specification of Scalar Function to Vector Function Mapping". It is an ABI, similarly like e.g. the C++ Itanium ABI is an ABI, it specifies mangling of certain functions and how the function argument types and return types are transformed. Agreed, let's leave that as is. One tiny nit on the document: For the "b" value, let's just say "VSX" rather than "VSX as defined in PowerISA v2.07)." We will plan to only change values in case a different vector length is defined in future. Looks good otherwise! Thanks, Bill Jakub
Re: GLIBC libmvec status
‐‐‐ Original Message ‐‐‐ On Thursday, February 27, 2020 4:32 PM, Bill Schmidt wrote: > On 2/27/20 2:21 PM, Bill Schmidt wrote: > > > On 2/27/20 12:48 PM, GT wrote: > > > > > Done. > > > > > > The updated document is at: > > > https://sourceware.org/glibc/wiki/HomePage?action=AttachFile&do=view&target=powerarchvectfuncabi.html > > Looks good. Can you please also remove the 'c' ABI from the mangling, as > earlier agreed? > 1. Reference to 'c' ABI deleted. 2. In final paragraph of section "Vector Function ABI Overview", removed reference to ELFv2 specification. Replaced with reference to OpenPOWER IBM Power ISA v2.07. 3. Cleaned up display of angle brackets in section "Vector Function Name Mangling". Question: Do we want to change the name and title of the document since Segher doesn't believe it is an ABI. My initial suggestion: "POWER Architecture Specification of Scalar Function to Vector Function Mapping". Bert.
Re: GLIBC libmvec status
On Fri, Feb 28, 2020 at 04:23:03PM +, GT wrote: > Do we want to change the name and title of the document since Segher doesn't > believe it > is an ABI. My initial suggestion: "POWER Architecture Specification of Scalar > Function > to Vector Function Mapping". It is an ABI, similarly like e.g. the C++ Itanium ABI is an ABI, it specifies mangling of certain functions and how the function argument types and return types are transformed. Jakub
Re: GLIBC libmvec status
On 2/27/20 2:21 PM, Bill Schmidt wrote: On 2/27/20 12:48 PM, GT wrote: Done. The updated document is at: https://sourceware.org/glibc/wiki/HomePage?action=AttachFile&do=view&target=powerarchvectfuncabi.html Looks good. Can you please also remove the 'c' ABI from the mangling, as earlier agreed? Thanks! Bill
Re: GLIBC libmvec status
On 2/27/20 12:48 PM, GT wrote: ‐‐‐ Original Message ‐‐‐ On Thursday, February 27, 2020 9:26 AM, Bill Schmidt wrote: Upon reflection, I agree. Bert, we need to make changes to the document to reflect this: (1) "Calling convention" should refer to ELFv1 for powerpc64 and ELFv2 for powerpc64le. Done. Have provided names and links to respective ABI documents but no longer explicitly refer to ELF version. (2) "Vector Length" should remove bullet 3, strike the word "nonhomogeneous" in bullet 4, and strike the parenthetical clause in bullet 4. (3) "Ordering of Vector Arguments" should remove the example involving homogeneous aggregates. Done. It also occurs to me that for bullets 4 and 5 in "Vector Length", the CDT should be long long, not int, since we pass aggregates in pieces in 64-bit registers and/or chunks of memory. That determination of Vector Length is common for all architectures and is implemented in function simd_clone_compute_base_data_type. If we do really need PPC64 to be different, we'll have to allow the function to be replaced by architecture-specific versions. Before we do that, do you have an example of code which ends up with incorrect vectorization with the existing CDT of int? No, and I'll withdraw the suggestion. It seems rather arbitrary in any event. Thanks for the updates! Bill Other small bugs: - Bullet 4 says "the CDT determine by a) or b) above", but the referents should be "(1) or (2)" instead. - First line of "Compiler generated variants of vector functions" has a typo ("umasked"). Done. The updated document is at: https://sourceware.org/glibc/wiki/HomePage?action=AttachFile&do=view&target=powerarchvectfuncabi.html
Re: GLIBC libmvec status
‐‐‐ Original Message ‐‐‐ On Thursday, February 27, 2020 9:26 AM, Bill Schmidt wrote: > > Upon reflection, I agree. Bert, we need to make changes to the document to > reflect this: > > (1) "Calling convention" should refer to ELFv1 for powerpc64 and ELFv2 for > powerpc64le. Done. Have provided names and links to respective ABI documents but no longer explicitly refer to ELF version. > (2) "Vector Length" should remove bullet 3, strike the word > "nonhomogeneous" in bullet 4, and strike the parenthetical clause in > bullet 4. > (3) "Ordering of Vector Arguments" should remove the example involving > homogeneous aggregates. > Done. > It also occurs to me that for bullets 4 and 5 in "Vector Length", the > CDT should be long long, not int, since we pass aggregates in pieces in > 64-bit registers and/or chunks of memory. > That determination of Vector Length is common for all architectures and is implemented in function simd_clone_compute_base_data_type. If we do really need PPC64 to be different, we'll have to allow the function to be replaced by architecture-specific versions. Before we do that, do you have an example of code which ends up with incorrect vectorization with the existing CDT of int? > Other small bugs: > - Bullet 4 says "the CDT determine by a) or b) above", but the referents > should be "(1) or (2)" instead. > - First line of "Compiler generated variants of vector functions" has > a typo ("umasked"). > Done. The updated document is at: https://sourceware.org/glibc/wiki/HomePage?action=AttachFile&do=view&target=powerarchvectfuncabi.html
Re: GLIBC libmvec status
On 2/27/20 9:30 AM, Jakub Jelinek wrote: On Thu, Feb 27, 2020 at 09:19:25AM -0600, Bill Schmidt wrote: On 2/27/20 8:52 AM, Jakub Jelinek wrote: On Thu, Feb 27, 2020 at 08:47:19AM -0600, Bill Schmidt wrote: But is this actually a good idea? It seems to me this will generate lousy code in the absence of hardware support. Won't we be better off warning and ignoring the directive, leaving the code in scalar form? Depends on the exact code, I think sometimes it will be just fine and will allow vectorizing something that really couldn't be otherwise. Isn't it better to leave it for the user to decide? They can always ask for it not to be generated (add notinbranch) if it isn't worthwhile. You need a high ratio of unguarded code to guarded code in order to pay for all those vector extract and reconstruct operations. Sure, some code will be fine, but a lot of code will be lousy. This will be particularly true on older hardware with a less exhaustive set of vector operations. Why? E.g. for integral code other than division or memory loads/stores where nothing will really trap, you can just perform it unguarded. Just use whatever the vectorizer does right now for conditional code, and if that isn't as efficient as it could be given a particular HW/ISA, try to improve it? If that's how the vectorizer is working today, then my concerns are certainly lessened. It's been a while since I've seen how the vectorizer and if-conversion interact, so my perspective is probably outdated. We'll take a look at it. Thanks for the discussion! Bill I really don't see how is it different say from SSE2 on x86 or even AVX. Jakub
Re: GLIBC libmvec status
On Thu, Feb 27, 2020 at 09:19:25AM -0600, Bill Schmidt wrote: > On 2/27/20 8:52 AM, Jakub Jelinek wrote: > > On Thu, Feb 27, 2020 at 08:47:19AM -0600, Bill Schmidt wrote: > > > But is this actually a good idea? It seems to me this will generate lousy > > > code in the absence of hardware support. Won't we be better off warning > > > and > > > ignoring the directive, leaving the code in scalar form? > > Depends on the exact code, I think sometimes it will be just fine and will > > allow vectorizing something that really couldn't be otherwise. > > Isn't it better to leave it for the user to decide? > > They can always ask for it not to be generated (add notinbranch) if it isn't > > worthwhile. > > You need a high ratio of unguarded code to guarded code in order to pay for > all > those vector extract and reconstruct operations. Sure, some code will be > fine, > but a lot of code will be lousy. This will be particularly true on older > hardware with a less exhaustive set of vector operations. Why? E.g. for integral code other than division or memory loads/stores where nothing will really trap, you can just perform it unguarded. Just use whatever the vectorizer does right now for conditional code, and if that isn't as efficient as it could be given a particular HW/ISA, try to improve it? I really don't see how is it different say from SSE2 on x86 or even AVX. Jakub
Re: GLIBC libmvec status
On 2/27/20 8:52 AM, Jakub Jelinek wrote: On Thu, Feb 27, 2020 at 08:47:19AM -0600, Bill Schmidt wrote: But is this actually a good idea? It seems to me this will generate lousy code in the absence of hardware support. Won't we be better off warning and ignoring the directive, leaving the code in scalar form? Depends on the exact code, I think sometimes it will be just fine and will allow vectorizing something that really couldn't be otherwise. Isn't it better to leave it for the user to decide? They can always ask for it not to be generated (add notinbranch) if it isn't worthwhile. You need a high ratio of unguarded code to guarded code in order to pay for all those vector extract and reconstruct operations. Sure, some code will be fine, but a lot of code will be lousy. This will be particularly true on older hardware with a less exhaustive set of vector operations. In the lousy-code case, my concern is that the user won't be savvy enough to understand they should add notinbranch. They'll just notice that their code runs badly on Power and either complain (good, then we can explain it) or abandon porting existing code to Power (very bad, and we may never know). I don't like the downside, and the upside is quite unpredictable. Bill Jakub
Re: GLIBC libmvec status
On Thu, Feb 27, 2020 at 08:47:19AM -0600, Bill Schmidt wrote: > But is this actually a good idea? It seems to me this will generate lousy > code in the absence of hardware support. Won't we be better off warning and > ignoring the directive, leaving the code in scalar form? Depends on the exact code, I think sometimes it will be just fine and will allow vectorizing something that really couldn't be otherwise. Isn't it better to leave it for the user to decide? They can always ask for it not to be generated (add notinbranch) if it isn't worthwhile. Jakub
Re: GLIBC libmvec status
On 2/26/20 8:31 AM, Jakub Jelinek wrote: On Wed, Feb 26, 2020 at 07:55:53AM -0600, Bill Schmidt wrote: The hope is that we can create a vectorized version that returns values in registers rather than the by-ref parameters, and add code to GCC to copy things around correctly following the call. Ideally the signature of the vectorized version would be sth like struct retval {vector double, vector double}; retval vecsincos (vector double); In the typical case where calls to sincos are of the form sincos (val[i], &sinval[i], &cosval[i]); this would allow us to only store the values in the caller upon return, rather than store them in the callee and potentially reload them immediately in the caller. On some Power CPUs, the latter behavior can result in somewhat costly stalls if the consecutive accesses hit a timing window. But can't you do #pragma omp declare simd linear(sinp, cosp) void sincos (double x, double *sinp, double *cosp); ? That is something the vectorizer code could handle and for for (int i = 0; i < 1024; i++) sincos (val[i], &sinval[i], &cosval[i]); just vectorize it as for (int i = 0; i < 1024; i += vf) _ZGVbN8vl8l8_sincos (*(vector double *)&val[i], &sinval[i], &cosval[i]); Anything else will need specialized code to handle sincos specially in the vectorizer. After reading all the discussion on this thread, yes, I agree for now. It will be good for everybody if we can get the vectorized cexpi sorted out at some point, which will give us a superior interface. If you feel it isn't possible to do this, then we can abandon it. Right now my understanding is that GCC doesn't vectorize calls to sincos yet for any targets, so it would be moot except that we really should define what happens for the future. This calling convention would also be useful in the future for vectorizing functions that return complex values either by value or by reference. Only by value, you really don't know what the code does if something is passed by reference, whether it is read, written into, or both etc. And for _Complex {float,double}, e.g. the Intel ABI already specifies how to pass them, just GCC isn't able to do that right now. Per the fork of the thread with Segher, I've cried uncle on the specifics of the calling convention. :) Well, as a matter of practicality, we don't have any of that implemented in the rs6000 back end, and we don't have any free resources to do that in GCC 11. Is there any documentation about what needs to be done to support this? I've always been under the impression that vectorizing for masking when there isn't any hardware support is a losing proposition, so we've not investigated it. You don't need to do pretty much anything, except set clonei->mask_mode = VOIDmode, I think the generic code should handle that everything beyond that, in particular add the mask argument and use it both on the caller side and on the expansion of the to be vectorized clone. But is this actually a good idea? It seems to me this will generate lousy code in the absence of hardware support. Won't we be better off warning and ignoring the directive, leaving the code in scalar form? If and when we have hardware support for vector masking, I'll be happy to remove this restriction, but I need more convincing to do it now. Thanks, Bill Jakub
Re: GLIBC libmvec status
On 2/27/20 4:52 AM, Segher Boessenkool wrote: On Tue, Feb 25, 2020 at 07:43:09PM -0600, Bill Schmidt wrote: The reason that homogeneous aggregates matter (at least somewhat) is that the ABI ^H^H^H^HAPI requires establishing a calling convention and a name- mangling formula that includes the length of parameters and return values. Since ELFv2 and ELFv1 do not have the same calling convention, and ELFv2 has a superior one, we chose to use ELFv2's calling convention and make use of homogeneous aggregates for return values in registers for the case of vectorized sincos. Please look at the document to see the constraints we're under to fit into the different OpenMP clauses and attributes. It seems to me that we can only define this for both powerpc64 and powerpc64le by establishing two different calling conventions, which provides two different vector length calculations for the sincos return value, and therefore requires two different function implementations with different mangled names. (Either that, or we cripple vectorized sincos by requiring it to return values through memory.) I still don't see it. For all ABIs the length of the arguments and return value is the same, and homogeneous aggregates doesn't factor in at all; that is just a detail whether something is passed in registers or memory (as we have with many other ABIs as well, fwiw). So why make this part of the mangling rules? It is perfectly fine to design this with ELFv2 in mind, of course, but making a dependency on the (current!) (very complex!) ELFv2 rules for absolutely no reason at all is a mistake, in my opinion. Upon reflection, I agree. Bert, we need to make changes to the document to reflect this: (1) "Calling convention" should refer to ELFv1 for powerpc64 and ELFv2 for powerpc64le. (2) "Vector Length" should remove bullet 3, strike the word "nonhomogeneous" in bullet 4, and strike the parenthetical clause in bullet 4. (3) "Ordering of Vector Arguments" should remove the example involving homogeneous aggregates. It also occurs to me that for bullets 4 and 5 in "Vector Length", the CDT should be long long, not int, since we pass aggregates in pieces in 64-bit registers and/or chunks of memory. Other small bugs: - Bullet 4 says "the CDT determine by a) or b) above", but the referents should be "(1) or (2)" instead. - First line of "Compiler generated variants of vector functions" has a typo ("umasked"). Segher, thanks for smacking my recalcitrant head until it understands... Thanks, Bill Segher
Re: GLIBC libmvec status
On Thu, Feb 27, 2020 at 11:56:49AM +0100, Richard Biener wrote: > > > This calling convention would also be useful in the future for vectorizing > > > functions that return complex values either by value or by reference. > > > > Only by value, you really don't know what the code does if something is > > passed by reference, whether it is read, written into, or both etc. > > And for _Complex {float,double}, e.g. the Intel ABI already specifies how to > > pass them, just GCC isn't able to do that right now. > > Ah, ok. So what's missing is the standard function cexpi both GCC and > libmvec can use. That, plus adjust omp-simd-clone.c and the backends so that they do support the complex modes and essentially transform those into passing/returning of either vector of the complex elts with twice as many subparts, or twice as many vectors, like e.g. the Intel ABI specifies. E.g. for return type adjustment, right now we have: t = TREE_TYPE (TREE_TYPE (fndecl)); if (INTEGRAL_TYPE_P (t) || POINTER_TYPE_P (t)) veclen = node->simdclone->vecsize_int; else veclen = node->simdclone->vecsize_float; veclen /= GET_MODE_BITSIZE (SCALAR_TYPE_MODE (t)); if (veclen > node->simdclone->simdlen) veclen = node->simdclone->simdlen; if (POINTER_TYPE_P (t)) t = pointer_sized_int_node; if (veclen == node->simdclone->simdlen) t = build_vector_type (t, node->simdclone->simdlen); else { t = build_vector_type (t, veclen); t = build_array_type_nelts (t, node->simdclone->simdlen / veclen); } and we'd need to deal with the complex types accordingly. And of course then to teach the vectorizer. The Intel ABI e.g. for SSE2 (their 'x' letter, which roughly matches our 'b' letter) they have: sizeof VLEN=2 VLEN=4 VLEN=8 VLEN=16 float 4 1*MS128 1*MS128 2*MS128 4*MS128 double 8 1*MD128 2*MD128 4*MD128 8*MD128 float complex 8 1*MS128 2*MS128 4*MS128 8*MS128 double complex 16 2*MD128 4*MD128 8*MD128 16*MD128 where MS128 is __m128 and MD128 __m128d, i.e. float __attribute__((vector_size (16))) and double __attribute__((vector_size (16))). I'll need to check ICC on godbolt how they actually pass the complex, whether it is real0 imag0 real1 imag1 real2 imag2 real3 imag3 or real0 real1 real2 real3 imag0 imag1 imag2 imag3. Jakub
Re: GLIBC libmvec status
On Wed, Feb 26, 2020 at 3:31 PM Jakub Jelinek wrote: > > On Wed, Feb 26, 2020 at 07:55:53AM -0600, Bill Schmidt wrote: > > The hope is that we can create a vectorized version that returns values > > in registers rather than the by-ref parameters, and add code to GCC to > > copy things around correctly following the call. Ideally the signature of > > the vectorized version would be sth like > > > > struct retval {vector double, vector double}; > > retval vecsincos (vector double); > > > > In the typical case where calls to sincos are of the form > > > > sincos (val[i], &sinval[i], &cosval[i]); > > > > this would allow us to only store the values in the caller upon return, > > rather than store them in the callee and potentially reload them > > immediately in the caller. On some Power CPUs, the latter behavior can > > result in somewhat costly stalls if the consecutive accesses hit a timing > > window. > > But can't you do > #pragma omp declare simd linear(sinp, cosp) > void sincos (double x, double *sinp, double *cosp); > ? > That is something the vectorizer code could handle and for > for (int i = 0; i < 1024; i++) > sincos (val[i], &sinval[i], &cosval[i]); > just vectorize it as > for (int i = 0; i < 1024; i += vf) > _ZGVbN8vl8l8_sincos (*(vector double *)&val[i], &sinval[i], &cosval[i]); > Anything else will need specialized code to handle sincos specially in the > vectorizer. I guess we'll need special code in the vectorizer anyway because in GIMPLE we'll have for (int i = 0; i < 1024; i++) { _Complex double tem = __builtin_cexpi (val[i]); sinval[i] = __real tem; cosval[i] = __imag tem; } we'd have to promote tem back to memory and the call to sincos (val[i], &__real tem, &__imag tem) virtually or explicitely. The vectorizer is currently not happy seeing _Complex (but dataref analysis would not be happy to see sincos). So we do need changes to the vectorizer. > > If you feel it isn't possible to do this, then we can abandon it. Right > > now my understanding is that GCC doesn't vectorize calls to sincos yet > > for any targets, so it would be moot except that we really should define > > what happens for the future. > > > > This calling convention would also be useful in the future for vectorizing > > functions that return complex values either by value or by reference. > > Only by value, you really don't know what the code does if something is > passed by reference, whether it is read, written into, or both etc. > And for _Complex {float,double}, e.g. the Intel ABI already specifies how to > pass them, just GCC isn't able to do that right now. Ah, ok. So what's missing is the standard function cexpi both GCC and libmvec can use. > > Well, as a matter of practicality, we don't have any of that implemented > > in the rs6000 back end, and we don't have any free resources to do that > > in GCC 11. Is there any documentation about what needs to be done to > > support this? I've always been under the impression that vectorizing for > > masking when there isn't any hardware support is a losing proposition, so > > we've not investigated it. > > You don't need to do pretty much anything, except set > clonei->mask_mode = VOIDmode, I think the generic code should handle that > everything beyond that, in particular add the mask argument and use it > both on the caller side and on the expansion of the to be vectorized clone. > > Jakub >
Re: GLIBC libmvec status
On Tue, Feb 25, 2020 at 07:43:09PM -0600, Bill Schmidt wrote: > The reason that homogeneous aggregates matter (at least somewhat) is that > the ABI ^H^H^H^HAPI requires establishing a calling convention and a name- > mangling formula that includes the length of parameters and return values. > Since ELFv2 and ELFv1 do not have the same calling convention, and ELFv2 > has a superior one, we chose to use ELFv2's calling convention and make use > of homogeneous aggregates for return values in registers for the case of > vectorized sincos. > > Please look at the document to see the constraints we're under to fit into > the different OpenMP clauses and attributes. It seems to me that we can > only define this for both powerpc64 and powerpc64le by establishing two > different calling conventions, which provides two different vector length > calculations for the sincos return value, and therefore requires two > different function implementations with different mangled names. (Either > that, or we cripple vectorized sincos by requiring it to return values > through memory.) I still don't see it. For all ABIs the length of the arguments and return value is the same, and homogeneous aggregates doesn't factor in at all; that is just a detail whether something is passed in registers or memory (as we have with many other ABIs as well, fwiw). So why make this part of the mangling rules? It is perfectly fine to design this with ELFv2 in mind, of course, but making a dependency on the (current!) (very complex!) ELFv2 rules for absolutely no reason at all is a mistake, in my opinion. Segher
Re: GLIBC libmvec status
Hi! On Tue, Feb 25, 2020 at 07:43:09PM -0600, Bill Schmidt wrote: > On 2/25/20 12:45 PM, Segher Boessenkool wrote: > >I don't agree we should have a new ABI, and an API (which this *is* as > >far as I can tell) works fine on *any* ABI. Homogeneous aggregates has > >nothing to do with anything either. > > > >It is fine to only *support* powerpc64le-linux, sure. But don't fragment > >the implementation, it only hurts, never helps -- we will end up having > >to support ten or twenty different compilers, instead of one compiler > >with a few (mostly) orthogonal variations. And yes, we should also test > >everything everywhere, whenever reasonable. > > Thanks, Segher. Let me ask for some clarification here on how you'd like > us to proceed. > > The reason that homogeneous aggregates matter (at least somewhat) is that > the ABI ^H^H^H^HAPI requires establishing a calling convention and a name- > mangling formula that includes the length of parameters and return values. I don't see how that matters? A function that returns a struct works fine on any implementation. Sure, for small structs it can be returned in just registers on ELFv2, while some other ABIs push stuff through memory. But that is just an implementation detail of the *actual* ABI. It is good to know about this when designing the mvec stuff, sure, but this will *work* on *any* ABI. > Since ELFv2 and ELFv1 do not have the same calling convention, and ELFv2 > has a superior one, we chose to use ELFv2's calling convention and make use > of homogeneous aggregates for return values in registers for the case of > vectorized sincos. I don't understand this. You designed this API with the ELFv2 ABI in mind, sure, but that does not magically make homogeneous aggreggates appear on other ABIs, nor do you need them there. I'll read the docs again, someone is missing something here, and it probably is me ;-) Segher
Re: GLIBC libmvec status
On Wed, Feb 26, 2020 at 07:55:53AM -0600, Bill Schmidt wrote: > The hope is that we can create a vectorized version that returns values > in registers rather than the by-ref parameters, and add code to GCC to > copy things around correctly following the call. Ideally the signature of > the vectorized version would be sth like > > struct retval {vector double, vector double}; > retval vecsincos (vector double); > > In the typical case where calls to sincos are of the form > > sincos (val[i], &sinval[i], &cosval[i]); > > this would allow us to only store the values in the caller upon return, > rather than store them in the callee and potentially reload them > immediately in the caller. On some Power CPUs, the latter behavior can > result in somewhat costly stalls if the consecutive accesses hit a timing > window. But can't you do #pragma omp declare simd linear(sinp, cosp) void sincos (double x, double *sinp, double *cosp); ? That is something the vectorizer code could handle and for for (int i = 0; i < 1024; i++) sincos (val[i], &sinval[i], &cosval[i]); just vectorize it as for (int i = 0; i < 1024; i += vf) _ZGVbN8vl8l8_sincos (*(vector double *)&val[i], &sinval[i], &cosval[i]); Anything else will need specialized code to handle sincos specially in the vectorizer. > If you feel it isn't possible to do this, then we can abandon it. Right > now my understanding is that GCC doesn't vectorize calls to sincos yet > for any targets, so it would be moot except that we really should define > what happens for the future. > > This calling convention would also be useful in the future for vectorizing > functions that return complex values either by value or by reference. Only by value, you really don't know what the code does if something is passed by reference, whether it is read, written into, or both etc. And for _Complex {float,double}, e.g. the Intel ABI already specifies how to pass them, just GCC isn't able to do that right now. > Well, as a matter of practicality, we don't have any of that implemented > in the rs6000 back end, and we don't have any free resources to do that > in GCC 11. Is there any documentation about what needs to be done to > support this? I've always been under the impression that vectorizing for > masking when there isn't any hardware support is a losing proposition, so > we've not investigated it. You don't need to do pretty much anything, except set clonei->mask_mode = VOIDmode, I think the generic code should handle that everything beyond that, in particular add the mask argument and use it both on the caller side and on the expansion of the to be vectorized clone. Jakub
Re: GLIBC libmvec status
On 2/26/20 2:18 AM, Jakub Jelinek wrote: On Tue, Feb 25, 2020 at 07:43:09PM -0600, Bill Schmidt wrote: The reason that homogeneous aggregates matter (at least somewhat) is that the ABI ^H^H^H^HAPI requires establishing a calling convention and a name- mangling formula that includes the length of parameters and return values. Since ELFv2 and ELFv1 do not have the same calling convention, and ELFv2 has a superior one, we chose to use ELFv2's calling convention and make use of homogeneous aggregates for return values in registers for the case of vectorized sincos. Can you please explain how do you want to pass the void sincos (double, double *, double *); arguments? I must say it isn't entirely clear from the document. You talk there about double[2], but sincos certainly doesn't have such an argument. The hope is that we can create a vectorized version that returns values in registers rather than the by-ref parameters, and add code to GCC to copy things around correctly following the call. Ideally the signature of the vectorized version would be sth like struct retval {vector double, vector double}; retval vecsincos (vector double); In the typical case where calls to sincos are of the form sincos (val[i], &sinval[i], &cosval[i]); this would allow us to only store the values in the caller upon return, rather than store them in the callee and potentially reload them immediately in the caller. On some Power CPUs, the latter behavior can result in somewhat costly stalls if the consecutive accesses hit a timing window. If you feel it isn't possible to do this, then we can abandon it. Right now my understanding is that GCC doesn't vectorize calls to sincos yet for any targets, so it would be moot except that we really should define what happens for the future. This calling convention would also be useful in the future for vectorizing functions that return complex values either by value or by reference. Also, I'd say ignoring the masked variants is a mistake, are you going to warn any time the user uses inbranch or even doesn't specify notinbranch? The masking can be implemented even without highly specialized instructions, e.g. on x86 only AVX512F has full masking support, for older ISAs all that is there is conditional store or e.g. for integral operations that can't trap/raise exceptions just doing blend-like operations (or even and/or) is all that is needed; just let the vectorizer do its job. Well, as a matter of practicality, we don't have any of that implemented in the rs6000 back end, and we don't have any free resources to do that in GCC 11. Is there any documentation about what needs to be done to support this? I've always been under the impression that vectorizing for masking when there isn't any hardware support is a losing proposition, so we've not investigated it. Thanks, Bill Even if you don't want it for libmvec, just use __attribute__((simd ("notinbranch"))) for those, but allow the user to use it where it makes sense. Jakub
Re: GLIBC libmvec status
On Wed, Feb 26, 2020 at 2:46 PM Jakub Jelinek wrote: > > On Wed, Feb 26, 2020 at 10:32:17AM -0300, Tulio Magno Quites Machado Filho > wrote: > > Jakub Jelinek writes: > > > > > Can you please explain how do you want to pass the > > > void sincos (double, double *, double *); > > > arguments? I must say it isn't entirely clear from the document. > > > You talk there about double[2], but sincos certainly doesn't have such an > > > argument. > > > > The plan [1] is to return a struct instead, i.e.: > > > > struct sincosret _ZGVbN2v_sincos (vector double); > > struct sincosretf _ZGVbN4v_sincosf (vector float); > > Ugh, then certainly it shouldn't be mangled as simd variant of sincos, it > needs to be called something else. > The ABI can't be written for a single, even if commonly used, function, but > needs to be generic. > And if I have a > #pragma omp declare simd > void foo (double x, double *y, double *z); > then from the prototype there is no way to find out that it only uses the > second two arguments to store a single double through them. It could very > well do > void foo (double x, double *y, double *z) { > y[0] = y[1] + x; > z[0] = z[1] + x; > } > or anything else and then you can't transform it to something like that. Yeah. I think you have to approach the sincos case from our internal __builtin_cexpi which means _Complex double foo (double); and how that represents itself with OpenMP SIMD. Richard. > Jakub >
Re: GLIBC libmvec status
On Wed, Feb 26, 2020 at 10:32:17AM -0300, Tulio Magno Quites Machado Filho wrote: > Jakub Jelinek writes: > > > Can you please explain how do you want to pass the > > void sincos (double, double *, double *); > > arguments? I must say it isn't entirely clear from the document. > > You talk there about double[2], but sincos certainly doesn't have such an > > argument. > > The plan [1] is to return a struct instead, i.e.: > > struct sincosret _ZGVbN2v_sincos (vector double); > struct sincosretf _ZGVbN4v_sincosf (vector float); Ugh, then certainly it shouldn't be mangled as simd variant of sincos, it needs to be called something else. The ABI can't be written for a single, even if commonly used, function, but needs to be generic. And if I have a #pragma omp declare simd void foo (double x, double *y, double *z); then from the prototype there is no way to find out that it only uses the second two arguments to store a single double through them. It could very well do void foo (double x, double *y, double *z) { y[0] = y[1] + x; z[0] = z[1] + x; } or anything else and then you can't transform it to something like that. Jakub
Re: GLIBC libmvec status
Jakub Jelinek writes: > Can you please explain how do you want to pass the > void sincos (double, double *, double *); > arguments? I must say it isn't entirely clear from the document. > You talk there about double[2], but sincos certainly doesn't have such an > argument. The plan [1] is to return a struct instead, i.e.: struct sincosret _ZGVbN2v_sincos (vector double); struct sincosretf _ZGVbN4v_sincosf (vector float); Notice however, that change is still missing [2] from the libmvec patch series [3]. [1] https://sourceware.org/ml/libc-alpha/2019-09/msg00334.html [2] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/powerpc/powerpc64/fpu/multiarch/vec_s_sincosf4_vsx.c;hb=refs/heads/tuliom/libmvec [3] https://sourceware.org/git/?p=glibc.git;a=log;h=refs/heads/tuliom/libmvec -- Tulio Magno
Re: GLIBC libmvec status
On Tue, Feb 25, 2020 at 07:43:09PM -0600, Bill Schmidt wrote: > The reason that homogeneous aggregates matter (at least somewhat) is that > the ABI ^H^H^H^HAPI requires establishing a calling convention and a name- > mangling formula that includes the length of parameters and return values. > Since ELFv2 and ELFv1 do not have the same calling convention, and ELFv2 > has a superior one, we chose to use ELFv2's calling convention and make use > of homogeneous aggregates for return values in registers for the case of > vectorized sincos. Can you please explain how do you want to pass the void sincos (double, double *, double *); arguments? I must say it isn't entirely clear from the document. You talk there about double[2], but sincos certainly doesn't have such an argument. Also, I'd say ignoring the masked variants is a mistake, are you going to warn any time the user uses inbranch or even doesn't specify notinbranch? The masking can be implemented even without highly specialized instructions, e.g. on x86 only AVX512F has full masking support, for older ISAs all that is there is conditional store or e.g. for integral operations that can't trap/raise exceptions just doing blend-like operations (or even and/or) is all that is needed; just let the vectorizer do its job. Even if you don't want it for libmvec, just use __attribute__((simd ("notinbranch"))) for those, but allow the user to use it where it makes sense. Jakub
Re: GLIBC libmvec status
On 2/25/20 12:45 PM, Segher Boessenkool wrote: Hi! On Tue, Feb 25, 2020 at 04:53:17PM +, GT wrote: ‐‐‐ Original Message ‐‐‐ On Sunday, February 23, 2020 11:45 AM, Bill Schmidt wrote: As I just wrote on gcc-patches, we should disable libmvec for powerpc64. The vector ABI as written isn't compatible with ELFv1. We would need a modified ABI that doesn't allow homogeneous aggregates of vectors to be returned in registers in order to support ELFv1. I do not believe that is worth pursuing until and unless there is demand for it (which I do not expect). Are we all agreed that the POWER Vector Function ABI will be implemented only for powerpc64le? I do not agree. I don't agree we should have a new ABI, and an API (which this *is* as far as I can tell) works fine on *any* ABI. Homogeneous aggregates has nothing to do with anything either. It is fine to only *support* powerpc64le-linux, sure. But don't fragment the implementation, it only hurts, never helps -- we will end up having to support ten or twenty different compilers, instead of one compiler with a few (mostly) orthogonal variations. And yes, we should also test everything everywhere, whenever reasonable. Thanks, Segher. Let me ask for some clarification here on how you'd like us to proceed. The reason that homogeneous aggregates matter (at least somewhat) is that the ABI ^H^H^H^HAPI requires establishing a calling convention and a name- mangling formula that includes the length of parameters and return values. Since ELFv2 and ELFv1 do not have the same calling convention, and ELFv2 has a superior one, we chose to use ELFv2's calling convention and make use of homogeneous aggregates for return values in registers for the case of vectorized sincos. Please look at the document to see the constraints we're under to fit into the different OpenMP clauses and attributes. It seems to me that we can only define this for both powerpc64 and powerpc64le by establishing two different calling conventions, which provides two different vector length calculations for the sincos return value, and therefore requires two different function implementations with different mangled names. (Either that, or we cripple vectorized sincos by requiring it to return values through memory.) Now, we can either write a document that handles both cases now (describes both calling conventions), and force glibc to have two different functions at least for the sincos case; or we can restrict this particular document to ELFv2 and leave open the possibility of writing a very similar but slightly different document for ELFv1 at such time as someone wants to use ELFv1 for libmvec. I'd personally rather push that extra work out until we know there's a market for it. That is, I don't want to preclude its use for ELFv1, but this *particular* API is specific to ELFv2, so we need to acknowledge that in the code. Ultimately it's your call, but if we need to rewrite the ABI/API we're going to need concrete proposals for how to do that. Thanks, Bill For the glibc side I have no opinion. Segher
Re: GLIBC libmvec status
On Tue, 25 Feb 2020, GT wrote: > 2. In GCC making SIMD clones available only for powerpc64le should be > sufficient to guarantee that the Vector Function ABI is applied only for > systems implementing the ELFv2 ABI. Right? Then, which macro is to be > tested for in rs6000_simd_clone_usable? I expect that TARGET_VSX, > TARGET_P8_VECTOR or TARGET_P9_VECTOR are not specific enough. I have no advice on whether you should restrict it to ELFv2 in GCC or not. But if you do restrict it to ELFv2, that's "DEFAULT_ABI == ABI_ELFv2" (not sure if you should also test TARGET_64BIT). -- Joseph S. Myers jos...@codesourcery.com
Re: GLIBC libmvec status
GT writes: > Are we all agreed that the POWER Vector Function ABI will be implemented only > for powerpc64le? > > If so, here are a few more questions: > > 1. The GLIBC implementation has files Makefile, Versions, configure, > configure.ac among others > in directory sysdeps/powerpc/powerpc64/fpu. Do we need to create a new > directory as > sysdeps/powerpc/powerpc64/powerpc64le/fpu and into it move the aforementioned > files? No, the directory already exists as sysdeps/powerpc/powerpc64/le/fpu/. If we end up agreeing to restrict it to powerpc64le, we only need to move a couple of files or contents of files. Not all of them. That would require to change most, if not all, of the libmvec patches. I can change them. Anyway, IMHO glibc needs to follow the GCC decision here. -- Tulio Magno
Re: GLIBC libmvec status
Hi! On Tue, Feb 25, 2020 at 04:53:17PM +, GT wrote: > ‐‐‐ Original Message ‐‐‐ > On Sunday, February 23, 2020 11:45 AM, Bill Schmidt > wrote: > > As I just wrote on gcc-patches, we should disable libmvec for powerpc64. > > The vector ABI as written isn't compatible with ELFv1. We would need > > a modified ABI that doesn't allow homogeneous aggregates of vectors to > > be returned in registers in order to support ELFv1. I do not believe > > that is worth pursuing until and unless there is demand for it (which > > I do not expect). > > Are we all agreed that the POWER Vector Function ABI will be implemented only > for powerpc64le? I do not agree. I don't agree we should have a new ABI, and an API (which this *is* as far as I can tell) works fine on *any* ABI. Homogeneous aggregates has nothing to do with anything either. It is fine to only *support* powerpc64le-linux, sure. But don't fragment the implementation, it only hurts, never helps -- we will end up having to support ten or twenty different compilers, instead of one compiler with a few (mostly) orthogonal variations. And yes, we should also test everything everywhere, whenever reasonable. For the glibc side I have no opinion. Segher
Re: GLIBC libmvec status
‐‐‐ Original Message ‐‐‐ On Sunday, February 23, 2020 11:45 AM, Bill Schmidt wrote: > On 2/21/20 6:49 AM, Tulio Magno Quites Machado Filho wrote: > > > +Bill, +Segher > > > > GT writes: > > > > > Can I have until tomorrow morning to figure out exactly where/how to link > > > the Power Vector > > > Function ABI page? My first quick attempt resulted in the html tags being > > > rendered on the > > > page verbatim. > > > > Sure! > > Let me know when you update the wiki and I'll send the patches to > > libc-alpha. > > > > Meanwhile, let me clarify another point... > > > > Bert, Bill, Segher, > > > > In the GCC discussion, Jakub pointed out [1] the new vector ABI targets > > ELFv2, > > but there is nothing preventing it from being used on powerpc64-linux or > > powerpc-linux. > > On the other hand, the glibc patches enable libmvec on powerpc64le and > > powerpc64. > > > > IMHO, regardless of the decision, GCC and glibc should be in sync. > > > > Bert, did you get a chance to test the GCC patches on powerpc64-linux? > > I've been testing the glibc patches and they work fine, but they require > > POWER8 (the vector ABI also requires P8). > > > > Bill, Segher, > > What do you think is the best solution from the GCC point of view? > > As I just wrote on gcc-patches, we should disable libmvec for powerpc64. > The vector ABI as written isn't compatible with ELFv1. We would need > a modified ABI that doesn't allow homogeneous aggregates of vectors to > be returned in registers in order to support ELFv1. I do not believe > that is worth pursuing until and unless there is demand for it (which > I do not expect). > Are we all agreed that the POWER Vector Function ABI will be implemented only for powerpc64le? If so, here are a few more questions: 1. The GLIBC implementation has files Makefile, Versions, configure, configure.ac among others in directory sysdeps/powerpc/powerpc64/fpu. Do we need to create a new directory as sysdeps/powerpc/powerpc64/powerpc64le/fpu and into it move the aforementioned files? 2. In GCC making SIMD clones available only for powerpc64le should be sufficient to guarantee that the Vector Function ABI is applied only for systems implementing the ELFv2 ABI. Right? Then, which macro is to be tested for in rs6000_simd_clone_usable? I expect that TARGET_VSX, TARGET_P8_VECTOR or TARGET_P9_VECTOR are not specific enough. Bert.