On Tuesday, 13 January 2015 at 09:58:56 UTC, bearophile wrote:
Take a look at the ideas of "C++ seasoning" (http://channel9.msdn.com/Events/GoingNative/2013/Cpp-Seasoning ), where they suggest to do kind of the opposite of what you do, it means throwing out loops and other things, and replacing them with standard algorithms.

Yes... you can do that. For little gain since C++'s support for high level programming is bloat inducing. Just take a look of all the symbols you need to have a conforming iterator or allocator... An allocator should be a simple 5 line snippet, it is bloatsome in STL:

https://gist.github.com/donny-dont/1471329

Then I needed a circular buffer. STL didn't have one. So I downloaded the Boost one. It was terribly inefficient, because it was generic and STLish.

I ended up writing my own using a fixed size log2 array with a start and end index. Clean conditional-free efficient code due to the log2 property and modular arithmetics.

So, in the end I get something faster, that produce more readable code, give faster compiles, is easier to read, is easier to debug and was implemented in less time than finding and figuring out the Boost one...

I have no problem with using array<T> and vector<T> where it fits, but in the end templated libraries prevent transparency. If you want speed you need to understand layout. Concrete implementations make that easier.

A solution like list comprehensions is a lot easier on the programmer, if convenience is the goal.

There's still time to add lazy and eager sequence comprehensions (or even better the computational thinghies of F#) to D, but past suggestions were not welcomed. D has lot of features, adding more and more has costs.


Phobos "ranges" need a next_simd() to be efficient. Right?

Perhaps, but first std.simd needs to be finished.

Right, but you need to support masked simd if you want to do filtering. Maybe autovectorization is the only path.

Still, you also need to keep your loops tiny if you want to benefit from X86 loop buffer. The CPU is capable of unrolling tight loops in hardware before hitting the execution pipeline. Thus getting the conditionals out of the pipeline.

Then you have cache locality. You need to break up long loops so you don't push things out of the caches.

So, if you can gain 2x by good cache locality/prefetching and 4x by using AVX over scalars, then you gain 8x performance over a naive implementation. That hurts.

Reply via email to