https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117722
--- Comment #12 from Li Pan <pan2.li at intel dot com> ---
(In reply to Robin Dapp from comment #11)
> (In reply to Li Pan from comment #9)
> > Created attachment 59663 [details]
> > before_vs_after when outer loop is 128
>
> Ok, that's a different loop then. I'm seeing vmv1rs in the current version,
> is that what you're referring to as problematic? Do they result from the
> lack of overlap constraints? I'd prefer a bit more context rather than just
> code dumps :)
Oh, forget this, list code and build option as below for the above png.
1 │ #include <stdint.h>
2 │ #include <stdlib.h>
3 │
4 │ #define T1 uint8_t
5 │ #define T2 int32_t
6 │
7 │ T2
8 │ foo (T2 * restrict op_0, T1 * restrict op_1,
9 │ T1 * restrict op_2, T2 op_3, T2 op_4)
10 │ {
11 │ T2 sum = 0;
12 │ for (unsigned i = 0; i < 128; i++) // x264_pixel_sad_4x4 is i < 4.
13 │ {
14 │ for (unsigned k = 0; k < 8; k++)
15 │ sum += abs (op_1[k] - op_2[k]);
16 │
17 │ op_1 += op_3;
18 │ op_2 += op_4;
19 │ }
20 │
21 │ return sum;
22 │ }
-O3 -march=rv64gcv -mabi=lp64d -c -S u_sad.c -o after.S -fno-schedule-insns
-fno-schedule-insns2
-O3 -march=rv64gcv -mabi=lp64d -c -S u_sad.c -mno-vector-strict-align -o
before.S -fno-schedule-insns -fno-schedule-insns2