Re: Vectorization: Loop peeling with misaligned support.

Tim Prince Fri, 15 Nov 2013 19:26:45 -0800

On 11/15/2013 2:26 PM, Ondřej Bílka wrote:

On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:

Also keep in mind that usually costs go up significantly if
misalignment causes cache line splits (processor will fetch 2 lines).
There are non-linear costs of filling up the store queue in modern
out-of-order processors (x86). Bottom line is that it's much better to
peel e.g. for AVX2/AVX3 if the loop would cause loads that cross cache
line boundaries otherwise. The solution is to either actually always
peel for alignment, or insert an additional check for cache line
boundaries (for high trip count loops).

That is quite bold claim do you have a benchmark to support that?


Since nehalem there is no overhead of unaligned sse loads except of fetching
cache lines. As haswell avx2 loads behave in similar way.

Where gcc or gfortran choose to split sse2 or sse4 loads, I found amarked advantage in that choice on my Westmere (which I seldom power onnowadays). You are correct that this finding is in disagreement withIntel documentation, and it has the effect that Intel option -xHost isnot the optimum one. I suspect the Westmere was less well performingthan Nehalem on unaligned loads. Another poorly documented feature ofNehalem and Westmere was a preference for 32-byte aligned data, more sothan Sandy Bridge.Intel documentation encourages use of unaligned AVX-256 loads on IvyBridge and Haswell, but Intel compilers don't implement them (except forintrinsics) until AVX2. Still, on my own Haswell tests, the splitting ofunaligned loads by use of AVX compile option comes out ahead.Supposedly, the preference of Windows intrinsics programmers for therelative simplicity of unaligned moves was taken into account in themore recent hardware designs, as it was disastrous for Sandy Bridge.I have only remote access to Haswell although I plan to buy a laptopsoon. I'm skeptical about whether useful findings on these points may beobtained on a Windows laptop.In case you didn't notice it, Intel compilers introduced #pragma vectorunaligned as a means to specify handling of unaligned access withoutpeeling. I guess it is expected to be useful on Ivy Bridge or Haswellfor cases where the loop count is moderate but expected to matchunrolled AVX-256, or if the case where peeling can improve alignment israre.In addition, Intel compilers learned from gcc the trick of using AVX-128for situations where frequent unaligned accesses are expected andpeeling is clearly undesirable. The new facility for vectorizing OpenMPparallel loops (e.g. #pragma omp parallel for simd) uses AVX-128,consistent with the fact that OpenMP chunks are more frequentlyunaligned. In fact, parallel for simd seems to perform nearly the samewith gcc-4.9 as with icc.Many decisions on compiler defaults still are based on an unscientificchoice of benchmarks, with gcc evidently more responsive to input fromthe community.


--
Tim Prince

Re: Vectorization: Loop peeling with misaligned support.

Reply via email to