Please do, Sergei I am also very much interested in the code.

Rodrigo Kumpera wrote:
Hi Sergei,

I'm glad to hear about your improvements. Can you share the code?

I believe this is not the best approach. Mono.Simd was never intended to be a variable width simd API. Making such proposition
makes coding over it significantly harder.

My suggestion is to implement both scalar replacement and then force inlining of all Mono.Simd operations.

For example:

Vector4f a,b,c;
...
a = b + c;

SR would replace it with:
float a0,a1,a2,a3,b0....

a0 = b0 + c0;
a1 = b1 + c1;
...

This will have acceptable performance and result in equivalent execution semantics, which is a much more usable model.

Scalar replacement requires two major changes in the JIT. First we need to convert all valuetype operations to use a higher level IR without explicit memory operations. Second, with this new IR, we can scalar replace all vector types that have no memory ops over them. IOW, something like:

Right now "a  = new Vector4f (1,2,3,4)" generates an IR similar to this:

ldaddr R10 <- R8
storer4_membase [R10 + 0], 1
storer4_membase [R10 + 4], 2
storer4_membase [R10 + 8], 3
storer4_membase [R10 + 12], 4

Which imposes that the vector type must be in memory. If we generate something like:

vzero R8
storer4_field [x] R8, 1
storer4_field [y] R8, 2
storer4_field [z] R8, 3
storer4_field [w] R8, 4

This new IR has no memory ops over the vector type, so we can scalar replace it to something like:

r4_const R11, 0
r4_const R12, 0
r4_const R13, 0
r4_const R14, 0

r4_const R11, 1
r4_const R12, 2
r4_const R13, 3
r4_const R14, 4

The first four stores will be removed by the DCE pass.

I have a WIP patch to do the first part of the transformation. It's based on a 3 months old trunk and has a bunch of bugs, so it requires quite some work before it's functional. I can send it to you, if you want to continue working on it.


On Tue, Apr 13, 2010 at 12:01 PM, Sergei Dyshel <qyron.priv...@gmail.com <mailto:qyron.priv...@gmail.com>> wrote:

    Hello Rodrigo,
    Regarding your question unfortunately I cannot apply for GSoC due
    to time and other constraints.

    With your tips I managed to extend linear scan on to vector
    registers and now SIMD code runs much faster. Thank you!

    My next (:]) question is about "scalarization", i.e. running
    programs with SIMD intrinsics on non-SIMD platforms (just
    simulating this with -O=-simd). Current implementation in Mono
    simply treats vectors as vtypes and passes them by value using
    stack, thus doing a lot of superfluous memory copies. Therefore
    "scalarized" code runs slow, way behind code without vector
intrinsics.
    A better solution I'm thinking of is to "reduce" vector size to 1,
    i.e. interpret Mono.Simd datatypes as corresponding scalar types.
    For example:
    Vector4i a;
    Vector4i b;
Vector4i c = op_addition (a, b); will be transformed to something like:
    int a;
    int b;
    int c = op_addition (a,b);

    of course not any code allows such transformation (it must not use
    hard-coded SIMD size but use some kind of get-vector-size
    intrinsics). I tried some test by manually replacing assembly and
    it showed great results. But now I want to do transformation
inside the JIT.
    Can you please help me to find corresponding place in JIT where I
    can do the transformation? I tried searching through
    'method-to-ir.c' but could realize where exactly vtypes can be
    transformed to scalar types.
    Thanks!
-- Regards,
    Sergei Dyshel



    On Thu, Apr 8, 2010 at 18:08, Rodrigo Kumpera <kump...@gmail.com
    <mailto:kump...@gmail.com>> wrote:

        Hi Sergei,

        On Thu, Apr 8, 2010 at 11:59 AM, Sergei Dyshel
        <qyron.priv...@gmail.com <mailto:qyron.priv...@gmail.com>> wrote:

            Hello Rodrigo,
            Just picking up this conversation we had some time ago. I
            was asking why JIT does unneeded loads and stores and you
            answered that this behavior is because of lack of global
            reg allocator. I understand it so that any vreg which is
            used in different basic blocks is "promoted" to "memory
            variable" and hence gets loaded and stored each time.
            Then I asked why bare "global" 'ints' are treated
            differently (and more effectively) and you said that there
            are callee-saved iregs so there is a specialized allocator
            for them.
            Can you please point at the relevant place in code?


Look into liveness.c / linear_scan.c. In liveness.c look for mono_analyze_liveness
        In linear_scan.c look for mono_linear_scan



            On Altivec we have callee-saved vector registers too. Is
            it possible to use the same trick with them , in order to
            remove unnecessary loads/stores?

Yes, it might be possible to do so, not sure how much work it
        will be thou.




------------------------------------------------------------------------

_______________________________________________
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list

_______________________________________________
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list

Reply via email to