On Wed, Jul 5, 2023 at 5:14 PM Sylvain Noiry via Gcc <gcc@gcc.gnu.org> wrote: > > Hi, > > My name is Sylvain, I am an intern at Kalray and I work on improving the GCC > backend for the KVX target. The KVX ISA has dedicated instructions for the > handling of complex numbers, which cannot be selected by GCC due to how > complex numbers are handled internally. My goal is to make GCC able to > expose to machine description files new patterns dealing with complex > numbers. I already have a proof of concept which can increase performance > even on other backends like x86 if the new patterns are implemented. > > My approach is to prevent the lowering of complex operations when the backend > can handle it natively and work directly on complex modes (SC, DC, CDI, CSI, > CHI, CQI). The cplxlower pass looks for supported optabs related to complex > numbers and use them directly. Another advantage is that native operations > can now go through all GIMPLE passes and preserve most optimisations like FMA > generation.
I'll note that complex lowering takes advantage of complex numbers without real/imag parts, I suppose you are preseving some of these optimizations and only prevent lowering of ops we natively support. I think that's a reasonable thing and I agree the standard optabs should be used with complex modes. > Vectorization is also preserved with native complex operands, although some > functions were updated. Because vectorization assumes that inner elements are > scalar and complex cannot be considered as scalar, some functions which only > take scalars have been adapted or duplicated to handle complex elements. I don't quite understand whether you end up with vectors with complex components or vectors with twice the number of scalar elements, implicitely representing real/imag parts interleaved. Can you clarify? We've recently had discussions around this and agreed we don't want vector modes with complex component modes. > I've also changed the representation of complex numbers during the expand > pass. READ_COMPLEX_PART and WRITE_COMPLEX_PART have been transformed into > target hooks, and a new hook GEN_RTX_COMPLEX allows each backend to choose > its preferred complex representation in RTL. The default one uses CONCAT > like before, but the KVX backend uses registers with complex mode containing > both real and imaginary parts. > > Now each backend can add its own native complex operations with patterns in > its machine description. The following example implements a complex > multiplication with mode SC on the KVX backend: > (define_insn "mulsc3" > [(set (match_operand:SC 0 "register_operand" "=r") > (mult:SC (match_operand:SC 1 "register_operand" "r") > (match_operand:SC 2 "register_operand" "r")))] > "" > "fmulwc %0 = %1, %2" > [(set_attr "type" "mau_fpu")] > ) > > The main patch affects around 1400 lines of generic code, mostly located in > expr.cc and tree-complex.cc. These are mainly additions or the result of the > move of READ_COMPLEX_PART and WRITE_COMPLEX_PART from expr.cc to target hooks. > > I know that ARM developers have added partial support of complex > instructions. However, since they are operating during the vectorization, > and are promoting operations on vectors of floating point numbers that looks > like operations on (vectors of) complex numbers, their approach misses simple > cases. At this point they create operations working on vector of floating > point numbers which will be caught by dedicated define_expand later. On the > other hand, our approach propagates complex numbers through all the > middle-end and we have an easier time to recombine the operations and > recognize what ARM does. Some choices will be needed to merge our two > approaches, although I've already reused their work on complex rotations in > my implementation. > > Results: > > I have tested my implementation on multiple code samples, as well as a few > FFTs. On a simple in-place radix-2 with precomputed twiddle seeds (2 complex > mult, 1 add, and 1 sub per loop), the compute time has been divided by 3 when > compiling with -O3 (because calls to __mulsc3 are replaced by native > instructions) and shortened by 20% with -ffast-math. In both cases, the > achieved performance level is now on par with another version coded using > intrinsics. These improvements do not come exclusively from the new > generated hardware instructions, the replacement of CONCATs to registers > prevents GCC from generating instructions to extract the real and imaginary > part into their own registers and recombine them later. > > This new approach can also brings a performance uplift to other backends. I > have tried to reuse the same complex representation in rtl as KVX for x86, > and a few patterns. Although I still have useless moves on large programs, > simple examples like below already show performance uplift. > > _Complex float add(_Complex float a, _Complex float b) > { > return a + b; > } Yeah, the splitting doesn't help our bad job dealing with parameter and return value expansion and your change likely skirts that issue. Vectorizing would likely fix it in a similar way. > Using "-O2" the assembly produced is now on paar with llvm and looks like : > > add: > addps %xmm1, %xmm0 > ret > > Choices to be done: > - Currently, ARM uses optab which start with "c" like "cmul" to distinguish > between a real floating point numbers and complex numbers. Since we keep > complex mode, this could be simply done with mul<mode>. > - Currently the parser does some early optimizations and lowering that > could be moved into the cplxlower pass. For example, i've changed a bit how > complex rotations by 90° and 270° are processed, which are recognized in > fold-const.cc. A call to a new COMPLEX_ROT90/270 internal function is now > inserted, which is then lowered or kept in the cplxlower pass. Finally the > widening_mul pass can generate COMPLEX_ADD_ROT90/270 internal function, which > are expanded using the cadd90/270 optabs, else COMPLEX_ROT90/270 are expanded > using new crot90/270 optabs. > - Currently, we have to duplicate the preferred_simd_mode since in only > accept scalar modes, if we unify enough, we could have a new type that would > be a union of scalar_mode and complex_mode, but we did not do it since it > would incur many modifications. > - Declaration of complex vector through attribute directives, this would be > a new C extension (and clang does not support it either). > - The KVX ISA supports some fused conjugate and operations (ex: a + > conjf(b)), which are caught directly in the combine pass if the corresponding > pattern in present the backend. This solution is simple, but it also mays be > caught in the middle-end like FMAs. > > Currently supported patterns: > - all basic arithmetic operations for scalar and vector complex modes (add, > mul, neg, ...) > - conj<mode> for the conjugate operation, using a new conj_optab > - crot90<mode>/crot270<mode> for complex rotations, using new optabs > > I would like to have your opinion on my approach. I can send you the patch if > you want. It would be nice if you could split the patch into a series of changes. Thanks, Richard. > Best regards, > > Sylvain Noiry > > >