Hi Maxim,
 
 >  It appears that cores with autoprefetcher hardware prefer loads and stores 
 >bundled together, not interspersed with > other instructions to occupy the 
 >rest of CPU units.
  
 I don't believe it is as simple as that - modern cores have multiple 
prefetchers but
 won't prefer bundling loads and stores in large blocks. That would result in 
terrible
 performance due to dispatch and issue stalls. Also the increased register 
pressure
 could cause extra spilling. If we group loads and stores, we'd definitely need 
to
 limit them to say 4 or so at most, and then interleave ALU operations.
 
  > Autoprefetching heuristic is enabled only for cores that support it, and 
isn't active for by default.
  
 It's enabled on most cores, including the default (generic). So we do have to 
be
 careful that this doesn't regress any other benchmarks or do worse on modern
 cores.
 
 Cheers,
 Wilco
  
     

Reply via email to