On Wed, May 31, 2017 at 10:05:50PM -0700, Jonathan M Davis via Digitalmars-d-learn wrote: > On Thursday, June 01, 2017 04:52:40 Patrick Schluter via Digitalmars-d-learn [...] > > See my link above to realdworldtech. Using SIMD can give good > > results in micro-benchmarks but completely screw up performance > > of other things in practice (the alignment requirements are heavy > > and result in code bloat, cache misses, TLB misses, cost of > > context switches, AVX warm up time (Agner Fog observed around > > 10000 cycles before AVX switches from 128 bits to 256 bits > > operations), reduced turboing, etc.). > > Whenever you attempt more complicated optimizations, it becomes harder > to get it right, and you always have the problem of figuring out > whether you really did make it better in general. It's the sort of > thing that's easier when you have a specific use case and it's very > difficult to get right when dealing with a general solution for a > standard library. So, it doesn't surprise me at all if a particular > optimization turns out to be a bad idea for Phobos even if it's great > for some use cases. [...]
It makes me want to just say, write a naïve loop expressing exactly what you intend to achieve, and let the compiler's optimizer figure out how to best optimize it for your target architecture. Unfortunately, just earlier today while testing an incomplete version of count() that uses the ulong iteration optimization, I discovered to my horror that ldc (at -O3) apparently recognizes that exact hack and turns the loop into a massive bowl of SSE/MMX/AVX/etc soup that's many times the size of the "unoptimized" loop. After reading the thread Patrick pointed out from realworldtech, I'm starting to wonder if the result is actually faster in practice, or if it only looks good in benchmarks, because that code bloat is going to add instruction cache pressure and probably TLB misses, etc.. If your program mostly calls count() on smallish arrays (which I'd say is rather likely in cases that matter, because I can't imagine someone would want to count bytes in 1MB arrays inside an inner loop -- in inner loops you'd tend to be working with smaller chunks of data and thus you'd want count() to be fast for small to medium-sized arrays), then a small, tight "unoptimized" loop is going to perform much better, because it would be easily inlineable, won't add 1KB to your function body size, and thus increase the chances your function body won't overflow the instruction cache. Plus, reducing the amount of complicated branches and other paraphrenalia will make the CPU's branch predictor more likely to get it right, so you're less likely to cause pipeline stalls. Perhaps the right approach is to check if the array length is less than some arbitrary threshold, and just use a naïve loop below that, and only switch to the complicated hackish stuff where you're sure it will actually benefit, rather than hurt, performance. T -- Not all rumours are as misleading as this one.