Andrew Pinski <pins...@gmail.com> writes:
> I need some help with the vector cost model for aarch64.
> I am adding V2HI and V4QI mode support by emulating it using the
> native V4HI/V8QI instructions (similarly to mmx as SSE is done). The
> problem is I am running into a cost model issue with
> gcc.target/aarch64/pr98772.c (wminus is similar to
> gcc.dg/vect/slp-gap-1.c, just slightly different offsets for the
> address).
> It seems like the cost mode is overestimating the number of loads for
> V8QI case .
> With the new cost model usage (-march=armv9-a+nosve), I get:
> ```
> t.c:7:21: note:  ***** Analysis succeeded with vector mode V4QI
> t.c:7:21: note:  Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2)
> t.c:7:21: note:  Issue info for V4QI loop:
> t.c:7:21: note:    load operations = 2
> t.c:7:21: note:    store operations = 1
> t.c:7:21: note:    general operations = 4
> t.c:7:21: note:    reduction latency = 0
> t.c:7:21: note:    estimated min cycles per iteration = 2.000000
> t.c:7:21: note:  Issue info for V8QI loop:
> t.c:7:21: note:    load operations = 12
> t.c:7:21: note:    store operations = 1
> t.c:7:21: note:    general operations = 6
> t.c:7:21: note:    reduction latency = 0
> t.c:7:21: note:    estimated min cycles per iteration = 4.333333
> t.c:7:21: note:  Weighted cycles per iteration of V4QI loop ~= 4.000000
> t.c:7:21: note:  Weighted cycles per iteration of V8QI loop ~= 4.333333
> t.c:7:21: note:  Preferring loop with lower cycles per iteration
> t.c:7:21: note:  ***** Preferring vector mode V4QI to vector mode V8QI
> ```
>
> That is totally wrong and instead of vectorizing using V8QI we
> vectorize using V4QI and the resulting code is worse.
>
> Attached is my current patch for adding V4QI/V2HI to the aarch64
> backend (Note I have not finished up the changelog nor the testcases;
> I have secondary patches that add the testcases already).
> Is there something I am missing here or are we just over estimating
> V8QI cost and is something easy to fix?

Trying it locally, I get:

foo.c:15:23: note:  ***** Analysis succeeded with vector mode V4QI
foo.c:15:23: note:  Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2)
foo.c:15:23: note:  Issue info for V4QI loop:
foo.c:15:23: note:    load operations = 2
foo.c:15:23: note:    store operations = 1
foo.c:15:23: note:    general operations = 4
foo.c:15:23: note:    reduction latency = 0
foo.c:15:23: note:    estimated min cycles per iteration = 2.000000
foo.c:15:23: note:  Issue info for V8QI loop:
foo.c:15:23: note:    load operations = 8
foo.c:15:23: note:    store operations = 1
foo.c:15:23: note:    general operations = 6
foo.c:15:23: note:    reduction latency = 0
foo.c:15:23: note:    estimated min cycles per iteration = 3.000000
foo.c:15:23: note:  Weighted cycles per iteration of V4QI loop ~= 4.000000
foo.c:15:23: note:  Weighted cycles per iteration of V8QI loop ~= 3.000000
foo.c:15:23: note:  Preferring loop with lower cycles per iteration

The function is:

extern void
wplus (uint16_t *d, uint8_t *restrict pix1, uint8_t *restrict pix2 )
{
    for (int y = 0; y < 4; y++ )
    {
        for (int x = 0; x < 4; x++ )
            d[x + y*4] = pix1[x] + pix2[x];
        pix1 += 16;
        pix2 += 16;
    }
}

For V8QI we need a VF of 2, so that there are 8 elements to store to d.
Conceptually, we handle those two iterations by loading 4 V8QIs from
pix1 and pix2 (32 bytes each), with mitigations against overrun,
and then permute the result to single V8QIs.

vectorize_load doesn't seem to be smart enough to realise that only 2
of those 4 loads are actually used in the permuation, and so only 2
loads should be costed for each of pix1 and pix2.

Thanks,
Richard

Reply via email to