I need some help with the vector cost model for aarch64.
I am adding V2HI and V4QI mode support by emulating it using the
native V4HI/V8QI instructions (similarly to mmx as SSE is done). The
problem is I am running into a cost model issue with
gcc.target/aarch64/pr98772.c (wminus is similar to
gcc.dg/vect/slp-gap-1.c, just slightly different offsets for the
address).
It seems like the cost mode is overestimating the number of loads for
V8QI case .
With the new cost model usage (-march=armv9-a+nosve), I get:
```
t.c:7:21: note:  ***** Analysis succeeded with vector mode V4QI
t.c:7:21: note:  Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2)
t.c:7:21: note:  Issue info for V4QI loop:
t.c:7:21: note:    load operations = 2
t.c:7:21: note:    store operations = 1
t.c:7:21: note:    general operations = 4
t.c:7:21: note:    reduction latency = 0
t.c:7:21: note:    estimated min cycles per iteration = 2.000000
t.c:7:21: note:  Issue info for V8QI loop:
t.c:7:21: note:    load operations = 12
t.c:7:21: note:    store operations = 1
t.c:7:21: note:    general operations = 6
t.c:7:21: note:    reduction latency = 0
t.c:7:21: note:    estimated min cycles per iteration = 4.333333
t.c:7:21: note:  Weighted cycles per iteration of V4QI loop ~= 4.000000
t.c:7:21: note:  Weighted cycles per iteration of V8QI loop ~= 4.333333
t.c:7:21: note:  Preferring loop with lower cycles per iteration
t.c:7:21: note:  ***** Preferring vector mode V4QI to vector mode V8QI
```

That is totally wrong and instead of vectorizing using V8QI we
vectorize using V4QI and the resulting code is worse.

Attached is my current patch for adding V4QI/V2HI to the aarch64
backend (Note I have not finished up the changelog nor the testcases;
I have secondary patches that add the testcases already).
Is there something I am missing here or are we just over estimating
V8QI cost and is something easy to fix?

Thanks,
Andrew

Attachment: 0001-RFC-aarch64-Start-to-support-v4qi-modes-for-SLP.patch
Description: Binary data

Reply via email to