I need some help with the vector cost model for aarch64. I am adding V2HI and V4QI mode support by emulating it using the native V4HI/V8QI instructions (similarly to mmx as SSE is done). The problem is I am running into a cost model issue with gcc.target/aarch64/pr98772.c (wminus is similar to gcc.dg/vect/slp-gap-1.c, just slightly different offsets for the address). It seems like the cost mode is overestimating the number of loads for V8QI case . With the new cost model usage (-march=armv9-a+nosve), I get: ``` t.c:7:21: note: ***** Analysis succeeded with vector mode V4QI t.c:7:21: note: Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2) t.c:7:21: note: Issue info for V4QI loop: t.c:7:21: note: load operations = 2 t.c:7:21: note: store operations = 1 t.c:7:21: note: general operations = 4 t.c:7:21: note: reduction latency = 0 t.c:7:21: note: estimated min cycles per iteration = 2.000000 t.c:7:21: note: Issue info for V8QI loop: t.c:7:21: note: load operations = 12 t.c:7:21: note: store operations = 1 t.c:7:21: note: general operations = 6 t.c:7:21: note: reduction latency = 0 t.c:7:21: note: estimated min cycles per iteration = 4.333333 t.c:7:21: note: Weighted cycles per iteration of V4QI loop ~= 4.000000 t.c:7:21: note: Weighted cycles per iteration of V8QI loop ~= 4.333333 t.c:7:21: note: Preferring loop with lower cycles per iteration t.c:7:21: note: ***** Preferring vector mode V4QI to vector mode V8QI ```
That is totally wrong and instead of vectorizing using V8QI we vectorize using V4QI and the resulting code is worse. Attached is my current patch for adding V4QI/V2HI to the aarch64 backend (Note I have not finished up the changelog nor the testcases; I have secondary patches that add the testcases already). Is there something I am missing here or are we just over estimating V8QI cost and is something easy to fix? Thanks, Andrew
0001-RFC-aarch64-Start-to-support-v4qi-modes-for-SLP.patch
Description: Binary data