On 06/09/2010 01:28 AM, retard wrote:
Wed, 09 Jun 2010 01:13:43 -0400, Nick Sabalausky wrote:

"retard"<r...@tard.com.invalid>  wrote in message
news:hun6ok$13s...@digitalmars.com...
Tue, 08 Jun 2010 16:14:51 -0500, Andrei Alexandrescu wrote:

On 06/08/2010 04:05 PM, Walter Bright wrote:
Andrei Alexandrescu wrote:
On 06/08/2010 01:27 PM, "Jérôme M. Berger" wrote:
Please define "reasonable performance"...

Within 15% of hand-optimized code specialized for the types at hand.

I would have said O(n) or O(log n), as opposed to, say, O(n*n).

General rules for performance improvements:

1. nobody notices a 10% improvement

2. users will start noticing speedups when they exceed 2x

3. a 10x speedup is a game changer

max of n elements is O(n).

This probably means that D 2 won't be very efficient on multicore until
the authors learn some basic parallel programming skills. Now where did
you get your PhD - I'm collecting a list of toy universities people
should avoid.

You used to have meaningful things to say. Now you're just trolling.

Max of n unordered elements can be solved in O(log log n) time assuming
you have enough cores and constant time memory access. Happy now?

When calculating the complexity of an operation you don't consider cores in unlimited supply; it's a constant. Complexity being calculated in terms of the number of inputs, max of n elements is O(n) because you need to look at each element at least once.

Cores being a constant k, you can then consider that max can be distributed such that each core looks at n/k elements. What's left is k intermediate results, which can be further processed in log(k) time (see below); since we consider k a constant, this approach could claim a net k-fold speedup.

If available cores are proportional to n (an unusual assumption), you first compare pairs of the n elements with n/2 cores, then n/4 comparisons against pairs of the remaining elements etc. Each step halves the number of candidates so the complexity is O(log n). I'd be curious to find out how that can be reduced to O(log log n).

Finally, for small values of n, you could consider the number of cores sufficiently large, but, as was pointed out, using threads for max is actually impractical, so superscalar execution may be a better venue. Practically this means exploiting ILP by reducing data dependencies between intermediate results. I suggest you take a look at a thread entitled "challenge: implement the max function" that I started on 01/21/2007. That thread discusses ILP issues, and continues with the thread "challenge #2: implement the varargs_reduce metafunction". Allow me to quote from my own post on 01/23/2007:

==================
That's a good point, and goes right to the core of my solution, which (similarly to Bruno Medeiros') arranges operations in an optimal way for superscalar evaluation, e.g. max(a, b, c, d) is expanded not to:

max2(max2(max2(a, b), c), d)

but instead to:

max2(max2(a, b), max2(c, d))

The number of operations is the same but in the latter case there is logaritmically less dependency between partial results. When max2 expands into a primitive comparison, the code is easy prey for a superscalar machine to execute in parallel.

This won't count in a great deal of real code, but the fact that this will go in the standard library warrants the thought. Besides, the fact that the language makes it so easy, we can actually think of such subtlety, is very encouraging.
==================

The messages that follow further discuss how associative reduce should be ordered to take advantage of superscalar execution.


Andrei

Reply via email to