>> This patch introduces a vector cost model for the Spacemit-X60 core, >> using dynamic LMUL scaling with the -madjust-lmul-cost flag. >> >> Compared to the previous patch, I dropped the local 'vector_lmul' >> attribute and the corresponding LMUL-aware cost logic in spacemit-x60.md. >> Instead, Spacemit-X60 tuning now enables -madjust-lmul-cost implicitly, >> and riscv_sched_adjust_cost is updated so that the adjustment applies to >> spacemit_x60 in addition to the generic out-of-order model. >> >> The stress tests I previously used to tune individual instruction costs >> (with the LMUL-aware logic implemented directly in spacemit-x60.md) >> now show a regression in performance. The most likely cause is the implicit >> -madjust-lmul-cost scaling, since some instructions performed better >> with non-power-of-two scaling (or with no LMUL scaling at all), so the >> uniform ×(1,2,4,8) adjustment affects performance. >> >> Updated performance results: >> >> | Benchmark | Metric | Trunk | Vector Cost Model | Δ (%) >> | >> |------------------|--------|------------------|-------------------|---------| >> | SciMark2-C | cycles | 311,450,555,453 | 313,278,899,107 | +0.56% | >> |------------------|--------|------------------|-------------------|---------| >> | tramp3d-v4 | cycles | 23,788,980,247 | 21,073,526,428 | -12.89% >> | >> |------------------|--------|------------------|-------------------|---------| >> | Freebench/neural | cycles | 471,707,641 | 435,842,612 | -8.23% >> | >> |------------------|--------|------------------|-------------------|---------|
>> >> Benchmarks were run from the LLVM test-suite >> (MultiSource/Benchmarks) using: >> >> taskset -c 0 perf stat -r 10 ./... > How sure are we about these results? It has been notoriously difficult to > obtain reliable benchmark numbers on the BPI. Do the results hold after a > reboot or on the next day? What about an even higher number of iterations? I repeated the measurements using perf stat on multiple isolated cores, including runs after reboot and on different days. Increasing the number of iterations from -r 10 to -r 100 did not change the outcome. > I find it difficult to understand why two > benchmarks improve a lot more and one > regresses. If the LMUL scaling is incorrect, wouldn't we expect similar > behavior for all three? Or does SciMark have a different footprint WRT > instructions and e.g. uses some insns more for which the uniform scaling > doesn't hold? In the generated code for SciMark2, the compiler selects almost exclusively LMUL=M1 (only two MF2 occurrences in the whole assembly), so LMUL scaling itself is effectively a no-op here. Therefore, my assumption is that the difference in performance is caused by the base M1 latencies. In the previous MD model, the measured load latency did not follow a power-of-two relationship across LMULs (M1=3, M2=4, M4=8, M8=16). To make this compatible with the dynamic -madjust-lmul-cost scaling, I normalized M1 to 2 so higher LMULs could be approximated as ×2, ×4, etc. Otherwise this would result in 3/6/12/24 for M1/M2/M4/M8, which deviates significantly more from the measured 4/8/16 at higher LMULs than adjusting M1 from 3 to 2. This improves the fit for M2/M4/M8, but likely reduces accuracy for the dominant M1 case. Nikola CONFIDENTIALITY: The contents of this e-mail are confidential and intended only for the above addressee(s). If you are not the intended recipient, or the person responsible for delivering it to the intended recipient, copying or delivering it to anyone else or using it in any unauthorized manner is prohibited and may be unlawful. If you receive this e-mail by mistake, please notify the sender and the systems administrator at [email protected] immediately.
