One vexing thing about modern CPUs is that a single CPU core cannot saturate RAM bandwidth. So, while throughput in L1 can be 100..200 GB/sec, and RAM bandwidth might be 50..150 GB/s, the amount just one L1 cache system can read from L2/L3/DIMMs might be only 5-30 GB/sec. This puts (often much simpler) single-threaded code that could otherwise saturate RAM throughput (thereby running as fast as possible anyway) at a major throughput disadvantage. But I totally agree with @mratsim here that staying with ST-code is the best idea.
The cynic inside me guesses the throughput disparity incentivizes people to write more parallel code, thusly selling more CPUs. So, these days CPU manufacturers are sadly incentivized to complicate lives of programmers. At the dawn of the industry, it was the other way around. Maybe there is some less cynical "scaling" problem, but usually with throughput/bandwidth you can just add more lanes of traffic.
