On Thu, Jul 11, 2024 at 10:58 AM Richard Sandiford
<richard.sandif...@arm.com> wrote:
>
> Andrew Pinski <pins...@gmail.com> writes:
> > I need some help with the vector cost model for aarch64.
> > I am adding V2HI and V4QI mode support by emulating it using the
> > native V4HI/V8QI instructions (similarly to mmx as SSE is done). The
> > problem is I am running into a cost model issue with
> > gcc.target/aarch64/pr98772.c (wminus is similar to
> > gcc.dg/vect/slp-gap-1.c, just slightly different offsets for the
> > address).
> > It seems like the cost mode is overestimating the number of loads for
> > V8QI case .
> > With the new cost model usage (-march=armv9-a+nosve), I get:
> > ```
> > t.c:7:21: note:  ***** Analysis succeeded with vector mode V4QI
> > t.c:7:21: note:  Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2)
> > t.c:7:21: note:  Issue info for V4QI loop:
> > t.c:7:21: note:    load operations = 2
> > t.c:7:21: note:    store operations = 1
> > t.c:7:21: note:    general operations = 4
> > t.c:7:21: note:    reduction latency = 0
> > t.c:7:21: note:    estimated min cycles per iteration = 2.000000
> > t.c:7:21: note:  Issue info for V8QI loop:
> > t.c:7:21: note:    load operations = 12
> > t.c:7:21: note:    store operations = 1
> > t.c:7:21: note:    general operations = 6
> > t.c:7:21: note:    reduction latency = 0
> > t.c:7:21: note:    estimated min cycles per iteration = 4.333333
> > t.c:7:21: note:  Weighted cycles per iteration of V4QI loop ~= 4.000000
> > t.c:7:21: note:  Weighted cycles per iteration of V8QI loop ~= 4.333333
> > t.c:7:21: note:  Preferring loop with lower cycles per iteration
> > t.c:7:21: note:  ***** Preferring vector mode V4QI to vector mode V8QI
> > ```
> >
> > That is totally wrong and instead of vectorizing using V8QI we
> > vectorize using V4QI and the resulting code is worse.
> >
> > Attached is my current patch for adding V4QI/V2HI to the aarch64
> > backend (Note I have not finished up the changelog nor the testcases;
> > I have secondary patches that add the testcases already).
> > Is there something I am missing here or are we just over estimating
> > V8QI cost and is something easy to fix?
>
> Trying it locally, I get:
>
> foo.c:15:23: note:  ***** Analysis succeeded with vector mode V4QI
> foo.c:15:23: note:  Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2)
> foo.c:15:23: note:  Issue info for V4QI loop:
> foo.c:15:23: note:    load operations = 2
> foo.c:15:23: note:    store operations = 1
> foo.c:15:23: note:    general operations = 4
> foo.c:15:23: note:    reduction latency = 0
> foo.c:15:23: note:    estimated min cycles per iteration = 2.000000
> foo.c:15:23: note:  Issue info for V8QI loop:
> foo.c:15:23: note:    load operations = 8
> foo.c:15:23: note:    store operations = 1
> foo.c:15:23: note:    general operations = 6
> foo.c:15:23: note:    reduction latency = 0
> foo.c:15:23: note:    estimated min cycles per iteration = 3.000000
> foo.c:15:23: note:  Weighted cycles per iteration of V4QI loop ~= 4.000000
> foo.c:15:23: note:  Weighted cycles per iteration of V8QI loop ~= 3.000000
> foo.c:15:23: note:  Preferring loop with lower cycles per iteration
>
> The function is:
>
> extern void
> wplus (uint16_t *d, uint8_t *restrict pix1, uint8_t *restrict pix2 )
> {
>     for (int y = 0; y < 4; y++ )
>     {
>         for (int x = 0; x < 4; x++ )
>             d[x + y*4] = pix1[x] + pix2[x];
>         pix1 += 16;
>         pix2 += 16;
>     }
> }
>
> For V8QI we need a VF of 2, so that there are 8 elements to store to d.
> Conceptually, we handle those two iterations by loading 4 V8QIs from
> pix1 and pix2 (32 bytes each), with mitigations against overrun,
> and then permute the result to single V8QIs.
>
> vectorize_load doesn't seem to be smart enough to realise that only 2
> of those 4 loads are actually used in the permuation, and so only 2
> loads should be costed for each of pix1 and pix2.

Though it has code to do that.

Richard.

> Thanks,
> Richard

Reply via email to