Ok, benchA and benchB have the same assembler code generated. However, I _can_ reproduce the slowdown albeit on average only 20%-40%, not a factor of 10.

It turns out that it's always the first tested function that's slower. You can test this by switching benchA and benchB in the call to benchmark(). I suspect the reason is that the OS is paging in the code the first time, and we're actually seeing the cost of the page fault. If you a second round of benchmarks after the first one, that one shows more or less the same performance for both functions.

Reply via email to