- compile with the loop unrolled 1x, 2x, 4x, 8x, 16x, 32x and measure the time the benchmark takes
The optimal unrolling factor may not be a power of two, depending on icache size (11 times the loop body size?), iteration count (13*n for some unknown n?), and whether there are actions performed inside the loop once or twice every N passes (for N not a power of two).
The powers of two would probably hit a lot of the common cases, but you might want to throw in some intermediate values too, if it's too costly to check all practical values.
Ken