On Fri, Jul 12, 2024 at 4:42 AM Andrew Pinski <pins...@gmail.com> wrote:
>
> On Thu, Jul 11, 2024 at 2:14 AM Richard Biener
> <richard.guent...@gmail.com> wrote:
> >
> > On Thu, Jul 11, 2024 at 10:58 AM Richard Sandiford
> > <richard.sandif...@arm.com> wrote:
> > >
> > > Andrew Pinski <pins...@gmail.com> writes:
> > > > I need some help with the vector cost model for aarch64.
> > > > I am adding V2HI and V4QI mode support by emulating it using the
> > > > native V4HI/V8QI instructions (similarly to mmx as SSE is done). The
> > > > problem is I am running into a cost model issue with
> > > > gcc.target/aarch64/pr98772.c (wminus is similar to
> > > > gcc.dg/vect/slp-gap-1.c, just slightly different offsets for the
> > > > address).
> > > > It seems like the cost mode is overestimating the number of loads for
> > > > V8QI case .
> > > > With the new cost model usage (-march=armv9-a+nosve), I get:
> > > > ```
> > > > t.c:7:21: note:  ***** Analysis succeeded with vector mode V4QI
> > > > t.c:7:21: note:  Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2)
> > > > t.c:7:21: note:  Issue info for V4QI loop:
> > > > t.c:7:21: note:    load operations = 2
> > > > t.c:7:21: note:    store operations = 1
> > > > t.c:7:21: note:    general operations = 4
> > > > t.c:7:21: note:    reduction latency = 0
> > > > t.c:7:21: note:    estimated min cycles per iteration = 2.000000
> > > > t.c:7:21: note:  Issue info for V8QI loop:
> > > > t.c:7:21: note:    load operations = 12
> > > > t.c:7:21: note:    store operations = 1
> > > > t.c:7:21: note:    general operations = 6
> > > > t.c:7:21: note:    reduction latency = 0
> > > > t.c:7:21: note:    estimated min cycles per iteration = 4.333333
> > > > t.c:7:21: note:  Weighted cycles per iteration of V4QI loop ~= 4.000000
> > > > t.c:7:21: note:  Weighted cycles per iteration of V8QI loop ~= 4.333333
> > > > t.c:7:21: note:  Preferring loop with lower cycles per iteration
> > > > t.c:7:21: note:  ***** Preferring vector mode V4QI to vector mode V8QI
> > > > ```
> > > >
> > > > That is totally wrong and instead of vectorizing using V8QI we
> > > > vectorize using V4QI and the resulting code is worse.
> > > >
> > > > Attached is my current patch for adding V4QI/V2HI to the aarch64
> > > > backend (Note I have not finished up the changelog nor the testcases;
> > > > I have secondary patches that add the testcases already).
> > > > Is there something I am missing here or are we just over estimating
> > > > V8QI cost and is something easy to fix?
> > >
> > > Trying it locally, I get:
> > >
> > > foo.c:15:23: note:  ***** Analysis succeeded with vector mode V4QI
> > > foo.c:15:23: note:  Comparing two main loops (V4QI at VF 1 vs V8QI at VF 
> > > 2)
> > > foo.c:15:23: note:  Issue info for V4QI loop:
> > > foo.c:15:23: note:    load operations = 2
> > > foo.c:15:23: note:    store operations = 1
> > > foo.c:15:23: note:    general operations = 4
> > > foo.c:15:23: note:    reduction latency = 0
> > > foo.c:15:23: note:    estimated min cycles per iteration = 2.000000
> > > foo.c:15:23: note:  Issue info for V8QI loop:
> > > foo.c:15:23: note:    load operations = 8
> > > foo.c:15:23: note:    store operations = 1
> > > foo.c:15:23: note:    general operations = 6
> > > foo.c:15:23: note:    reduction latency = 0
> > > foo.c:15:23: note:    estimated min cycles per iteration = 3.000000
> > > foo.c:15:23: note:  Weighted cycles per iteration of V4QI loop ~= 4.000000
> > > foo.c:15:23: note:  Weighted cycles per iteration of V8QI loop ~= 3.000000
> > > foo.c:15:23: note:  Preferring loop with lower cycles per iteration
> > >
> > > The function is:
> > >
> > > extern void
> > > wplus (uint16_t *d, uint8_t *restrict pix1, uint8_t *restrict pix2 )
> > > {
> > >     for (int y = 0; y < 4; y++ )
> > >     {
> > >         for (int x = 0; x < 4; x++ )
> > >             d[x + y*4] = pix1[x] + pix2[x];
> > >         pix1 += 16;
> > >         pix2 += 16;
> > >     }
> > > }
> > >
> > > For V8QI we need a VF of 2, so that there are 8 elements to store to d.
> > > Conceptually, we handle those two iterations by loading 4 V8QIs from
> > > pix1 and pix2 (32 bytes each), with mitigations against overrun,
> > > and then permute the result to single V8QIs.
> > >
> > > vectorize_load doesn't seem to be smart enough to realise that only 2
> > > of those 4 loads are actually used in the permuation, and so only 2
> > > loads should be costed for each of pix1 and pix2.
> >
> > Though it has code to do that.
>
> So looking into this a little further. Yes there is code that does it
> but it still adds the extra loads and then removes them. And the
> costing part is done before the removal of the extra loads.
>
> From (a non modifed trunk):
> ```
> /app/example.cpp:2:21: note:   add new stmt: vect__34.7_15 = MEM
> <vector(8) unsigned char> [(unsigned char *)vectp_pix1.5_17];
> /app/example.cpp:2:21: note:   add new stmt: vectp_pix1.5_14 =
> vectp_pix1.5_17 + 8;
> /app/example.cpp:2:21: note:   add new stmt: vect__34.8_13 = MEM
> <vector(8) unsigned char> [(unsigned char *)vectp_pix1.5_14];
> /app/example.cpp:2:21: note:   add new stmt: vectp_pix1.5_12 =
> vectp_pix1.5_14 + 8;
> Applying pattern match.pd:2922, gimple-match-2.cc:7467
> gimple_simplified to vectp_pix1.5_12 = vectp_pix1.5_17 + 16;
> /app/example.cpp:2:21: note:   add new stmt: _11 = MEM <unsigned int>
> [(unsigned char *)vectp_pix1.5_12];
> /app/example.cpp:2:21: note:   add new stmt: _10 = {_11, 0};
> /app/example.cpp:2:21: note:   add new stmt: vect__34.9_9 =
> VIEW_CONVERT_EXPR<vector(8) unsigned char>(_10);
> /app/example.cpp:2:21: note:   add new stmt: vectp_pix1.5_8 =
> vectp_pix1.5_12 + 8;
> Applying pattern match.pd:2922, gimple-match-2.cc:7467
> gimple_simplified to vectp_pix1.5_8 = vectp_pix1.5_17 + 24;
> /app/example.cpp:2:21: note:   add new stmt: vect__34.10_7 = {};
> /app/example.cpp:2:21: note:   add new stmt: vect__34.11_6 =
> VEC_PERM_EXPR <vect__34.7_15, vect__34.9_9, { 0, 1, 2, 3, 8, 9, 10, 11
> }>;
> ```
>
> The load that goes into `vect__34.8_13` is unused. It looks like that
> load is being counted even though it is not there in the end.
> Maybe I am misunderstanding the output here though.

Ah, yes - I think the code correctly counts the number of loads needed
but as we now cost emitted loads immediately the computed count is
probably ignored, maybe only after the last cost vs. code-gen reorg though.

Richard.

> Thanks,
> Andrew
>
> >
> > Richard.
> >
> > > Thanks,
> > > Richard

Reply via email to