Re: [PATCH v2] RISC-V: Add LMUL-aware RVV cost model for the Spacemit-X60 core

Jeffrey Law Sat, 21 Feb 2026 07:49:47 -0800



On 2/20/2026 11:57 AM, Robin Dapp wrote:

I repeated the measurements using perf stat on multiple isolated
cores, including runs after reboot and on different days. Increasing
the number of iterations from -r 10 to -r 100 did not change the outcome.

Thanks, that's good to know.

Agreed. That's probably an indication these benchmarks aren'tparticularly memory intensive. My working theory on the extreme jitterobserved on the K1 is the memory subsystem. It's got a shared L2 acrossthe cores and no L3. My speculation is latency to the L2 varies basedon the particular slice you're accessing and which core is trying toaccess. Similarly for main memory depending on the precise way thecontrollers are laid out and such. But that's all speculation.

What was clear was the 10%+ jitter on some of the spec components andthose components are ones that show a high sensitivity to cache size.

In the generated code for SciMark2, the compiler selects almost
exclusively LMUL=M1 (only two MF2 occurrences in the whole assembly),
so LMUL scaling itself is effectively a no-op here. Therefore, my assumption
is that the difference in performance is caused by the base M1 latencies.

In the previous MD model, the measured load latency did not follow a
power-of-two relationship across LMULs (M1=3, M2=4, M4=8, M8=16).
To make this compatible with the dynamic -madjust-lmul-cost scaling,
I normalized M1 to 2 so higher LMULs could be approximated as ×2, ×4,
etc. Otherwise this would result in 3/6/12/24 for M1/M2/M4/M8, which
deviates significantly more from the measured 4/8/16 at higher LMULs
than adjusting M1 from 3 to 2. This improves the fit for M2/M4/M8, but
likely reduces accuracy for the dominant M1 case.

Hmm, so we have both cases then:  One where modelling the latency exactly
helps (Scimark), and one where modelling exactly is significantly worse (the
two others) :)

And the one where it helps is the one where it sounds like the vastmajority of vector access is LMUL1. To me that would seem to indicatethe >LMUL1 case isn't working well. That could be a modeling problemor it could be that the modeling in turn causes a secondary effect thataffects performance.


Insn scheduling is always a heuristic.  How this is usually approached is
benchmark a large number of tests/applications and check which setting performs
best over all.  Without including SPEC and others we might be in the
overfitting territory.

Possible. My biggest problem with getting data on spec is the jitter. When Austin and I were looking at this the jitter made it virtuallyimpossible to draw conclusions about whether or not any given change wasan improvement. We're looking for effects in the 1-3% range, but withjitter at >10% you just need too many runs to get good confidence in theresults.

Point being if these changes aren't significant improvements, then we'llbe in the same scenario. We can still go forward, but we won't havesignificant confidence that the changes are good. This is especiallytrue when there is uncertainty in expectations.


I'm not 100% sure how to continue here.  One one hand, I'd like to avoid too
much manual twiddling for the sole purpose of getting LMUL latency right.
There's also still the issue of VLS modes with -mrvv-vector-bits=zvl.
Those would all get assigned LMUL1 latencies right now, while the hook would
use the proper scaling.
So for a "proper" solution there's still more work to be done:
Including lmul-scaling into the cost model, having a broader test base, maybe
adding a custom, per uarch, lmul-scale curve/factor, etc.

On the other hand, your first patch clearly shows an improvement, and, even if
not optimal, would improve the status quo.  It's unlikely we have time for the
full solution, so maybe we should settle for a partial one for now?

I would expect that the larger the LMUL the less often it gets used inpractice. So getting things right for LMUL1 and LMUL2 is much moreimportant than LMUL4 and LMUL8. Or at least that would be my theory.

The non-linear scaling is a surprise and doesn't line up that well withcamel-cdr's data. His isn't perfectly scaling either, but it's muchcloser to the expected 2X, 4X, 8X for LMUL2..LMUL8. Philip Reams'sinvestigation also tended to show close to expected scaling.

As for next steps. I think the big question is what model should weuse. We've got conflicting data on how things scale for LMUL2..LMUL8and that impacts how we go about modeling those cases. If it's close tolinear scaling, then the lmul-scaling param works. If it isn't, then wehave to look at uarch specific handling in one way or another.

Given we've got two data points arguing for linear scaling with LMUL andone against and that linear scaling with LMUL makes a lot of sense whenyou think about uarch implementations, I'd tend to think linear scalingis the way to go.


Jeff

Re: [PATCH v2] RISC-V: Add LMUL-aware RVV cost model for the Spacemit-X60 core

Reply via email to