Hi Maxim,
> It appears that cores with autoprefetcher hardware prefer loads and stores
>bundled together, not interspersed with > other instructions to occupy the
>rest of CPU units.
I don't believe it is as simple as that - modern cores have multiple
prefetchers but
won't prefer bundling loads and stores in large blocks. That would result in
terrible
performance due to dispatch and issue stalls. Also the increased register
pressure
could cause extra spilling. If we group loads and stores, we'd definitely need
to
limit them to say 4 or so at most, and then interleave ALU operations.
> Autoprefetching heuristic is enabled only for cores that support it, and
isn't active for by default.
It's enabled on most cores, including the default (generic). So we do have to
be
careful that this doesn't regress any other benchmarks or do worse on modern
cores.
Cheers,
Wilco