On Saturday, 6 February 2016 at 07:06:27 UTC, Ivan Kazmenko wrote:
1. Primitive types (cheap swap, cheap comparison).

2. Heavy structs A (expensive swap, cheap comparison - e.g., compare one field of primitive type).

3. Heavy structs B (expensive swap, expensive comparison - e.g., call a heavy external function).

4. Heavy classes (cheap swap - pointers only, expensive comparison).

So there's perhaps no single best solution.

That's right, but other factors are more important: preventing pipeline stalls. If you are collecting from 5 different cachelines in an array you are likely to get several 40-120 cycles delays unless you do prefetching, and if you do, you need to have other instructions to fill in the latency gaps.

But also instructions have latency and concurrency issues. Which is why your version did reasonably well as it made the compares/swaps independent so that they could be concurrently scheduled.

Yet, Haswell has SIMD instructions that can do 8-16x 32-bit max/min operations per cycle, with a latency of only 1 cycle, and 4-8x 64bit compares with a latency of 1 cycle.

So if you use as small fixed N, like 5, it makes very little sense to count compares/swaps.

What makes sense is to focus on how you can avoid branching and build an algorithm with no pipeline stalls.

If sorting large arrays you also might want to look at multi-threaded parallel sort functions.

Reply via email to