On 11/15/2013 2:26 PM, Ondřej Bílka wrote:
On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:
Also keep in mind that usually costs go up significantly if
misalignment causes cache line splits (processor will fetch 2 lines).
There are non-linear costs of filling up the store queue in modern
out-of-order processors (x86). Bottom line is that it's much better to
peel e.g. for AVX2/AVX3 if the loop would cause loads that cross cache
line boundaries otherwise. The solution is to either actually always
peel for alignment, or insert an additional check for cache line
boundaries (for high trip count loops).
That is quite bold claim do you have a benchmark to support that?

Since nehalem there is no overhead of unaligned sse loads except of fetching
cache lines. As haswell avx2 loads behave in similar way.
Where gcc or gfortran choose to split sse2 or sse4 loads, I found a marked advantage in that choice on my Westmere (which I seldom power on nowadays). You are correct that this finding is in disagreement with Intel documentation, and it has the effect that Intel option -xHost is not the optimum one. I suspect the Westmere was less well performing than Nehalem on unaligned loads. Another poorly documented feature of Nehalem and Westmere was a preference for 32-byte aligned data, more so than Sandy Bridge. Intel documentation encourages use of unaligned AVX-256 loads on Ivy Bridge and Haswell, but Intel compilers don't implement them (except for intrinsics) until AVX2. Still, on my own Haswell tests, the splitting of unaligned loads by use of AVX compile option comes out ahead. Supposedly, the preference of Windows intrinsics programmers for the relative simplicity of unaligned moves was taken into account in the more recent hardware designs, as it was disastrous for Sandy Bridge. I have only remote access to Haswell although I plan to buy a laptop soon. I'm skeptical about whether useful findings on these points may be obtained on a Windows laptop. In case you didn't notice it, Intel compilers introduced #pragma vector unaligned as a means to specify handling of unaligned access without peeling. I guess it is expected to be useful on Ivy Bridge or Haswell for cases where the loop count is moderate but expected to match unrolled AVX-256, or if the case where peeling can improve alignment is rare. In addition, Intel compilers learned from gcc the trick of using AVX-128 for situations where frequent unaligned accesses are expected and peeling is clearly undesirable. The new facility for vectorizing OpenMP parallel loops (e.g. #pragma omp parallel for simd) uses AVX-128, consistent with the fact that OpenMP chunks are more frequently unaligned. In fact, parallel for simd seems to perform nearly the same with gcc-4.9 as with icc. Many decisions on compiler defaults still are based on an unscientific choice of benchmarks, with gcc evidently more responsive to input from the community.

--
Tim Prince

Reply via email to