On 11/19/25 9:42 AM, Andi Kleen wrote:
I know I was pushing for it to be enabled more widely as it's painfully hard
to forward from a narrow store to a wider load. But based on earlier
discussions I've backed off that position.
FWIW I would expect any slightly better OOO core aimed at general
purpose code to have some form of hardware support for a subset of the
cases.
The narrow store to wide load is the problem space, even for OOO cores.
I fully expect any modern performance core to forward when the load can
get all of its data from a single prior store.
The rules can be very complicated. As an example see the diagram
in https://chipsandcheese.com/p/a-peek-at-sapphire-rapids
https://substackcdn.com/image/fetch/$s_!rESw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b17f38-0631-424d-8e05-7988f9b174f6_2559x1214.png
They don't look significantly more complex than I expected. Essentially
if the load is contained within the store, then it's forwarded, with a
possible penalty if there isn't a perfect start match, but it's still
forwarded.
If there's a partial overlap then no store to load forwarding occurs and
you take that full 19c penalty.
jeff