On 2/20/2026 11:57 AM, Robin Dapp wrote:
I repeated the measurements using perf stat on multiple isolated
cores, including runs after reboot and on different days. Increasing
the number of iterations from -r 10 to -r 100 did not change the outcome.
Thanks, that's good to know.
Agreed.  That's probably an indication these benchmarks aren't particularly memory intensive.  My working theory on the extreme jitter observed on the K1 is the memory subsystem.  It's got a shared L2 across the cores and no L3.  My speculation is latency to the L2 varies based on the particular slice you're accessing and which core is trying to access.  Similarly for main memory depending on the precise way the controllers are laid out and such.  But that's all speculation.

What was clear was the 10%+ jitter on some of the spec components and those components are ones that show a high sensitivity to cache size.





In the generated code for SciMark2, the compiler selects almost
exclusively LMUL=M1 (only two MF2 occurrences in the whole assembly),
so LMUL scaling itself is effectively a no-op here. Therefore, my assumption
is that the difference in performance is caused by the base M1 latencies.

In the previous MD model, the measured load latency did not follow a
power-of-two relationship across LMULs (M1=3, M2=4, M4=8, M8=16).
To make this compatible with the dynamic -madjust-lmul-cost scaling,
I normalized M1 to 2 so higher LMULs could be approximated as ×2, ×4,
etc. Otherwise this would result in 3/6/12/24 for M1/M2/M4/M8, which
deviates significantly more from the measured 4/8/16 at higher LMULs
than adjusting M1 from 3 to 2. This improves the fit for M2/M4/M8, but
likely reduces accuracy for the dominant M1 case.
Hmm, so we have both cases then:  One where modelling the latency exactly
helps (Scimark), and one where modelling exactly is significantly worse (the
two others) :)
And the one where it helps is the one where it sounds like the vast majority of vector access is LMUL1.  To me that would seem to indicate the >LMUL1 case isn't working well.   That could be a modeling problem or it could be that the modeling in turn causes a secondary effect that affects performance.

Insn scheduling is always a heuristic.  How this is usually approached is
benchmark a large number of tests/applications and check which setting performs
best over all.  Without including SPEC and others we might be in the
overfitting territory.
Possible.  My biggest problem with getting data on spec is the jitter.  When Austin and I were looking at this the jitter made it virtually impossible to draw conclusions about whether or not any given change was an improvement.  We're looking for effects in the 1-3% range, but with jitter at >10% you just need too many runs to get good confidence in the results.

Point being if these changes aren't significant improvements, then we'll be in the same scenario.  We can still go forward, but we won't have significant confidence that the changes are good.  This is especially true when there is uncertainty in expectations.


I'm not 100% sure how to continue here.  One one hand, I'd like to avoid too
much manual twiddling for the sole purpose of getting LMUL latency right.
There's also still the issue of VLS modes with -mrvv-vector-bits=zvl.
Those would all get assigned LMUL1 latencies right now, while the hook would
use the proper scaling.
So for a "proper" solution there's still more work to be done:
Including lmul-scaling into the cost model, having a broader test base, maybe
adding a custom, per uarch, lmul-scale curve/factor, etc.

On the other hand, your first patch clearly shows an improvement, and, even if
not optimal, would improve the status quo.  It's unlikely we have time for the
full solution, so maybe we should settle for a partial one for now?
I would expect that the larger the LMUL the less often it gets used in practice.   So getting things right for LMUL1 and LMUL2 is much more important than LMUL4 and LMUL8.  Or at least that would be my theory.

The non-linear scaling is a surprise and doesn't line up that well with camel-cdr's data.    His isn't perfectly scaling either, but it's much closer to the expected 2X, 4X, 8X for LMUL2..LMUL8. Philip Reams's investigation also tended to show close to expected scaling.

As for next steps.  I think the big question is what model should we use.  We've got conflicting data on how things scale for LMUL2..LMUL8 and that impacts how we go about modeling those cases. If it's close to linear scaling, then the lmul-scaling param works. If it isn't, then we have to look at uarch specific handling in one way or another.

Given we've got two data points arguing for linear scaling with LMUL and one against and that linear scaling with LMUL makes a lot of sense when you think about uarch implementations, I'd tend to think linear scaling is the way to go.

Jeff

Reply via email to