-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Mar 3, 2008, at 1:12 AM, Philippe Anel wrote:
So, does this mean the latency is only required by the I/O system
of your program ? If so, maybe I'm wrong, what you need is to be
able to interrupt working cores and I'm afraid libthread doesn't
help here.
If not and your algorithm requires (a lot of) fast IPC, maybe this
is the reason why it doesn't scale well ?
No, the whole simulation has to run in the low-latency space - it's a
video game and its rendering engine, which are generally highly
heterogeneous workload. And that heterogeneity means that there are
many points of contact between various subsystems. And the (semi-)
real-time constraint means that you can't just scale the problem up
to cover overhead costs.
I don't know what you mean by "CSP system itself takes care about
memory hierarchy". Do you mean that the CSP implementation does
something about it, or do you mean that the code using the CSP
approach takes care of it?
Both :)
I agree with you about the fact programming for the memory
hierarchy is way more important than optimizing CPU clocks.
But I also think synchronization primitives used in CSP systems are
the main reason why CSP programs do not scale well (excepted bad
designed algorithm of course).
I meant that a different CSP implementation, based on different
synchronisation primitive (IPI), can help here.
I'm more interested just now in working with lock-free algorithms;
I've not made any good measurements of how badly our kernels would
hit channels as the number of threads increases. Perhaps some could
be mitigated through a better channel implementation.
IPI isn't free either - apart from the OS switch, it generates bus
traffic that competes with the cache coherence protocols and
memory traffic; in a well designed compute kernel that saturates
both compute and bandwidth the latency hiccups so introduced can
propagate really badly.
This is very interesting. For sure IPI is not free. But I thought
the bus traffic generated by IPI was less important than cache
coherence protocols such as MESI, mainly because it is a one way
message.
It depends immensely on the hardware implementation of your IPI. If
you wind up having to pay for MESI as well, then the advantage
becomes less.
I think now IPI are sent through the system bus (local APIC used to
talk through a separate bus), so I agree with you about the fact it
can saturate the bandwidth. But I wonder if locking primitive are
not worse. It would be interesting to test this.
Agreed!
Paul
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (Darwin)
iD8DBQFHzLSSpJeHo/Fbu1wRAkv/AKDKK4fuuWyYCqXv4JqbWWj+RXQd0wCfSFoS
b9E6X/a13bg6AzUGT5dLSqU=
=ppoF
-----END PGP SIGNATURE-----