One vexing thing about modern CPUs is that a single CPU core cannot saturate 
RAM bandwidth. So, while throughput in L1 can be 100..200 GB/sec, and RAM 
bandwidth might be 50..150 GB/s, the amount just one L1 cache system can read 
from L2/L3/DIMMs might be only 5-30 GB/sec. This puts (often much simpler) 
single-threaded code that could otherwise saturate RAM throughput (thereby 
running as fast as possible anyway) at a major throughput disadvantage. But I 
totally agree with @mratsim here that staying with ST-code is the best idea.

The cynic inside me guesses the throughput disparity incentivizes people to 
write more parallel code, thusly selling more CPUs. So, these days CPU 
manufacturers are sadly incentivized to complicate lives of programmers. At the 
dawn of the industry, it was the other way around. Maybe there is some less 
cynical "scaling" problem, but usually with throughput/bandwidth you can just 
add more lanes of traffic.

Reply via email to