On 11/15/2013 2:26 PM, Ondřej Bílka wrote:
On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:
Also keep in mind that usually costs go up significantly if
misalignment causes cache line splits (processor will fetch 2 lines).
There are non-linear costs of filling up the store queue in modern
out-of-order processors (x86). Bottom line is that it's much better to
peel e.g. for AVX2/AVX3 if the loop would cause loads that cross cache
line boundaries otherwise. The solution is to either actually always
peel for alignment, or insert an additional check for cache line
boundaries (for high trip count loops).
That is quite bold claim do you have a benchmark to support that?
Since nehalem there is no overhead of unaligned sse loads except of fetching
cache lines. As haswell avx2 loads behave in similar way.
Where gcc or gfortran choose to split sse2 or sse4 loads, I found a
marked advantage in that choice on my Westmere (which I seldom power on
nowadays). You are correct that this finding is in disagreement with
Intel documentation, and it has the effect that Intel option -xHost is
not the optimum one. I suspect the Westmere was less well performing
than Nehalem on unaligned loads. Another poorly documented feature of
Nehalem and Westmere was a preference for 32-byte aligned data, more so
than Sandy Bridge.
Intel documentation encourages use of unaligned AVX-256 loads on Ivy
Bridge and Haswell, but Intel compilers don't implement them (except for
intrinsics) until AVX2. Still, on my own Haswell tests, the splitting of
unaligned loads by use of AVX compile option comes out ahead.
Supposedly, the preference of Windows intrinsics programmers for the
relative simplicity of unaligned moves was taken into account in the
more recent hardware designs, as it was disastrous for Sandy Bridge.
I have only remote access to Haswell although I plan to buy a laptop
soon. I'm skeptical about whether useful findings on these points may be
obtained on a Windows laptop.
In case you didn't notice it, Intel compilers introduced #pragma vector
unaligned as a means to specify handling of unaligned access without
peeling. I guess it is expected to be useful on Ivy Bridge or Haswell
for cases where the loop count is moderate but expected to match
unrolled AVX-256, or if the case where peeling can improve alignment is
rare.
In addition, Intel compilers learned from gcc the trick of using AVX-128
for situations where frequent unaligned accesses are expected and
peeling is clearly undesirable. The new facility for vectorizing OpenMP
parallel loops (e.g. #pragma omp parallel for simd) uses AVX-128,
consistent with the fact that OpenMP chunks are more frequently
unaligned. In fact, parallel for simd seems to perform nearly the same
with gcc-4.9 as with icc.
Many decisions on compiler defaults still are based on an unscientific
choice of benchmarks, with gcc evidently more responsive to input from
the community.
--
Tim Prince