Re: [Mono-dev] Mono generates inefficient vectorized code

Jerry Maine - KF5ADY Tue, 13 Apr 2010 17:10:29 -0700

Please do, Sergei I am also very much interested in the code.


Rodrigo Kumpera wrote:

Hi Sergei,

I'm glad to hear about your improvements. Can you share the code?

I believe this is not the best approach. Mono.Simd was never intendedto be a variable width simd API. Making such proposition

makes coding over it significantly harder.

My suggestion is to implement both scalar replacement and then forceinlining of all Mono.Simd operations.


For example:

Vector4f a,b,c;
...
a = b + c;

SR would replace it with:
float a0,a1,a2,a3,b0....

a0 = b0 + c0;
a1 = b1 + c1;
...

This will have acceptable performance and result in equivalentexecution semantics, which is a much more usable model.

Scalar replacement requires two major changes in the JIT. First weneed to convert all valuetype operations to use a higher level IRwithout explicit memory operations. Second, with this new IR, we canscalar replace all vector types that have no memory ops over them.IOW, something like:


Right now "a  = new Vector4f (1,2,3,4)" generates an IR similar to this:

ldaddr R10 <- R8
storer4_membase [R10 + 0], 1
storer4_membase [R10 + 4], 2
storer4_membase [R10 + 8], 3
storer4_membase [R10 + 12], 4

Which imposes that the vector type must be in memory. If we generatesomething like:


vzero R8
storer4_field [x] R8, 1
storer4_field [y] R8, 2
storer4_field [z] R8, 3
storer4_field [w] R8, 4

This new IR has no memory ops over the vector type, so we can scalarreplace it to something like:


r4_const R11, 0
r4_const R12, 0
r4_const R13, 0
r4_const R14, 0

r4_const R11, 1
r4_const R12, 2
r4_const R13, 3
r4_const R14, 4

The first four stores will be removed by the DCE pass.

I have a WIP patch to do the first part of the transformation. It'sbased on a 3 months old trunk and has a bunch of bugs, so it requiresquite some work before it's functional. I can send it to you, if youwant to continue working on it.

On Tue, Apr 13, 2010 at 12:01 PM, Sergei Dyshel<qyron.priv...@gmail.com <mailto:qyron.priv...@gmail.com>> wrote:


    Hello Rodrigo,
    Regarding your question unfortunately I cannot apply for GSoC due
    to time and other constraints.

    With your tips I managed to extend linear scan on to vector
    registers and now SIMD code runs much faster. Thank you!

    My next (:]) question is about "scalarization", i.e. running
    programs with SIMD intrinsics on non-SIMD platforms (just
    simulating this with -O=-simd). Current implementation in Mono
    simply treats vectors as vtypes and passes them by value using
    stack, thus doing a lot of superfluous memory copies. Therefore
    "scalarized" code runs slow, way behind code without vector

intrinsics.

    A better solution I'm thinking of is to "reduce" vector size to 1,
    i.e. interpret Mono.Simd datatypes as corresponding scalar types.
    For example:
    Vector4i a;
    Vector4i b;

Vector4i c = op_addition (a, b);will be transformed to something like:

    int a;
    int b;
    int c = op_addition (a,b);

    of course not any code allows such transformation (it must not use
    hard-coded SIMD size but use some kind of get-vector-size
    intrinsics). I tried some test by manually replacing assembly and
    it showed great results. But now I want to do transformation

inside the JIT.

    Can you please help me to find corresponding place in JIT where I
    can do the transformation? I tried searching through
    'method-to-ir.c' but could realize where exactly vtypes can be
    transformed to scalar types.
    Thanks!

--Regards,

    Sergei Dyshel



    On Thu, Apr 8, 2010 at 18:08, Rodrigo Kumpera <kump...@gmail.com
    <mailto:kump...@gmail.com>> wrote:

        Hi Sergei,

        On Thu, Apr 8, 2010 at 11:59 AM, Sergei Dyshel
        <qyron.priv...@gmail.com <mailto:qyron.priv...@gmail.com>> wrote:

            Hello Rodrigo,
            Just picking up this conversation we had some time ago. I
            was asking why JIT does unneeded loads and stores and you
            answered that this behavior is because of lack of global
            reg allocator. I understand it so that any vreg which is
            used in different basic blocks is "promoted" to "memory
            variable" and hence gets loaded and stored each time.
            Then I asked why bare "global" 'ints' are treated
            differently (and more effectively) and you said that there
            are callee-saved iregs so there is a specialized allocator
            for them.
            Can you please point at the relevant place in code?

Look into liveness.c / linear_scan.c.In liveness.c look for mono_analyze_liveness

        In linear_scan.c look for mono_linear_scan



            On Altivec we have callee-saved vector registers too. Is
            it possible to use the same trick with them , in order to
            remove unnecessary loads/stores?

Yes, it might be possible to do so, not sure how much work it

        will be thou.




------------------------------------------------------------------------

_______________________________________________
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list

_______________________________________________
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list

Re: [Mono-dev] Mono generates inefficient vectorized code

Reply via email to