-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mar 3, 2008, at 1:12 AM, Philippe Anel wrote:

So, does this mean the latency is only required by the I/O system of your program ? If so, maybe I'm wrong, what you need is to be able to interrupt working cores and I'm afraid libthread doesn't help here. If not and your algorithm requires (a lot of) fast IPC, maybe this is the reason why it doesn't scale well ?

No, the whole simulation has to run in the low-latency space - it's a video game and its rendering engine, which are generally highly heterogeneous workload. And that heterogeneity means that there are many points of contact between various subsystems. And the (semi-) real-time constraint means that you can't just scale the problem up to cover overhead costs.


I don't know what you mean by "CSP system itself takes care about memory hierarchy". Do you mean that the CSP implementation does something about it, or do you mean that the code using the CSP approach takes care of it?
Both :)
I agree with you about the fact programming for the memory hierarchy is way more important than optimizing CPU clocks. But I also think synchronization primitives used in CSP systems are the main reason why CSP programs do not scale well (excepted bad designed algorithm of course). I meant that a different CSP implementation, based on different synchronisation primitive (IPI), can help here.

I'm more interested just now in working with lock-free algorithms; I've not made any good measurements of how badly our kernels would hit channels as the number of threads increases. Perhaps some could be mitigated through a better channel implementation.


IPI isn't free either - apart from the OS switch, it generates bus traffic that competes with the cache coherence protocols and memory traffic; in a well designed compute kernel that saturates both compute and bandwidth the latency hiccups so introduced can propagate really badly.

This is very interesting. For sure IPI is not free. But I thought the bus traffic generated by IPI was less important than cache coherence protocols such as MESI, mainly because it is a one way message.

It depends immensely on the hardware implementation of your IPI. If you wind up having to pay for MESI as well, then the advantage becomes less.

I think now IPI are sent through the system bus (local APIC used to talk through a separate bus), so I agree with you about the fact it can saturate the bandwidth. But I wonder if locking primitive are not worse. It would be interesting to test this.

Agreed!

Paul

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (Darwin)

iD8DBQFHzLSSpJeHo/Fbu1wRAkv/AKDKK4fuuWyYCqXv4JqbWWj+RXQd0wCfSFoS
b9E6X/a13bg6AzUGT5dLSqU=
=ppoF
-----END PGP SIGNATURE-----

Reply via email to