On Wed, Jul 5, 2023 at 5:14 PM Sylvain Noiry via Gcc <gcc@gcc.gnu.org> wrote:
>
> Hi,
>
> My name is Sylvain, I am an intern at Kalray and I work on improving the GCC 
> backend for the KVX target.  The KVX ISA has dedicated instructions for the 
> handling of complex numbers, which cannot be selected by GCC due to how 
> complex numbers are handled internally.  My goal is to make GCC able to 
> expose to machine description files new patterns dealing with complex 
> numbers.  I already have a proof of concept which can increase performance 
> even on other backends like x86 if the new patterns are implemented.
>
> My approach is to prevent the lowering of complex operations when the backend 
> can handle it natively and work directly on complex modes (SC, DC, CDI, CSI, 
> CHI, CQI).  The cplxlower pass looks for supported optabs related to complex 
> numbers and use them directly.  Another advantage is that native operations 
> can now go through all GIMPLE passes and preserve most optimisations like FMA 
> generation.

I'll note that complex lowering takes advantage of complex numbers
without real/imag parts, I suppose you are preseving some
of these optimizations and only prevent lowering of ops we natively
support.  I think that's a reasonable thing and I agree the
standard optabs should be used with complex modes.

> Vectorization is also preserved with native complex operands, although some 
> functions were updated. Because vectorization assumes that inner elements are 
> scalar and complex cannot be considered as scalar, some functions which only 
> take scalars have been adapted or duplicated to handle complex elements.

I don't quite understand whether you end up with vectors with complex
components or vectors with twice the number of
scalar elements, implicitely representing real/imag parts interleaved.
Can you clarify?  We've recently had discussions
around this and agreed we don't want vector modes with complex component modes.

> I've also changed the representation of complex numbers during the expand 
> pass.  READ_COMPLEX_PART and WRITE_COMPLEX_PART have been transformed into 
> target hooks, and a new hook GEN_RTX_COMPLEX allows each backend to choose 
> its preferred complex representation in RTL.  The default one uses CONCAT 
> like before, but the KVX backend uses registers with complex mode containing 
> both real and imaginary parts.
>
> Now each backend can add its own native complex operations with patterns in 
> its machine description. The following example implements a complex 
> multiplication with mode SC on the KVX backend:
> (define_insn "mulsc3"
>   [(set (match_operand:SC 0 "register_operand" "=r")
>         (mult:SC (match_operand:SC 1 "register_operand" "r")
>                  (match_operand:SC 2 "register_operand" "r")))]
>   ""
>   "fmulwc %0 = %1, %2"
>   [(set_attr "type" "mau_fpu")]
> )
>
> The main patch affects around 1400 lines of generic code, mostly located in 
> expr.cc and tree-complex.cc. These are mainly additions or the result of the 
> move of READ_COMPLEX_PART and WRITE_COMPLEX_PART from expr.cc to target hooks.
>
> I know that ARM developers have added partial support of complex 
> instructions.  However, since they are operating during the vectorization, 
> and are promoting operations on vectors of floating point numbers that looks 
> like operations on (vectors of) complex numbers, their approach misses simple 
> cases.  At this point they create operations working on vector of floating 
> point numbers which will be caught by dedicated define_expand later.  On the 
> other hand, our approach propagates complex numbers through all the 
> middle-end and we have an easier time to recombine the operations and 
> recognize what ARM does.  Some choices will be needed to merge our two 
> approaches, although I've already reused their work on complex rotations in 
> my implementation.
>
> Results:
>
> I have tested my implementation on multiple code samples, as well as a few 
> FFTs.  On a simple in-place radix-2 with precomputed twiddle seeds (2 complex 
> mult, 1 add, and 1 sub per loop), the compute time has been divided by 3 when 
> compiling with -O3 (because calls to __mulsc3 are replaced by native 
> instructions) and shortened by 20% with -ffast-math.  In both cases, the 
> achieved performance level is now on par with another version coded using 
> intrinsics.  These improvements do not come exclusively from the new 
> generated hardware instructions, the replacement of CONCATs to registers 
> prevents GCC from generating instructions to extract the real and imaginary 
> part into their own registers and recombine them later.
>
> This new approach can also brings a performance uplift to other backends.  I 
> have tried to reuse the same complex representation in rtl as KVX for x86, 
> and a few patterns.  Although I still have useless moves on large programs, 
> simple examples like below already show performance uplift.
>
> _Complex float add(_Complex float a, _Complex float b)
> {
>   return a + b;
> }

Yeah, the splitting doesn't help our bad job dealing with parameter
and return value expansion and your
change likely skirts that issue.  Vectorizing would likely fix it in a
similar way.

> Using "-O2" the assembly produced is now on paar with llvm and looks like :
>
> add:
>         addps  %xmm1, %xmm0
>         ret
>
> Choices to be done:
>   - Currently, ARM uses optab which start with "c" like "cmul" to distinguish 
> between a real floating point numbers and complex numbers.  Since we keep 
> complex mode, this could be simply done with mul<mode>.
>   - Currently the parser does some early optimizations and lowering that 
> could be moved into the cplxlower pass.  For example, i've changed a bit how 
> complex rotations by 90° and 270° are processed, which are recognized in 
> fold-const.cc.  A call to a new COMPLEX_ROT90/270 internal function is now 
> inserted, which is then lowered or kept in the cplxlower pass.  Finally the 
> widening_mul pass can generate COMPLEX_ADD_ROT90/270 internal function, which 
> are expanded using the cadd90/270 optabs, else COMPLEX_ROT90/270 are expanded 
> using new crot90/270 optabs.
>   - Currently, we have to duplicate the preferred_simd_mode since in only 
> accept scalar modes, if we unify enough, we could have a new type that would 
> be a union of scalar_mode and complex_mode, but we did not do it since it 
> would incur many modifications.
>   - Declaration of complex vector through attribute directives, this would be 
> a new C extension (and clang does not support it either).
>   - The KVX ISA supports some fused conjugate and operations (ex: a + 
> conjf(b)), which are caught directly in the combine pass if the corresponding 
> pattern in present the backend. This solution is simple, but it also mays be 
> caught in the middle-end like FMAs.
>
> Currently supported patterns:
>   - all basic arithmetic operations for scalar and vector complex modes (add, 
> mul, neg, ...)
>   - conj<mode> for the conjugate operation, using a new conj_optab
>   - crot90<mode>/crot270<mode> for complex rotations, using new optabs
>
> I would like to have your opinion on my approach. I can send you the patch if 
> you want.

It would be nice if you could split the patch into a series of changes.

Thanks,
Richard.

> Best regards,
>
> Sylvain Noiry
>
>
>

Reply via email to