On 10 Aug 2009, at 23:42, Quentin Mathé wrote:
> Hi David,
>
> Le 10 août 09 à 01:36, David Chisnall a écrit :
>
>> One of the main reasons I did this was that I was interested by
>> Apple's claim that adding a new object to a Grand Central Dispatch
>> queue in 15 instructions. This is somewhat disingenuous. Locking
>> and
>> unlocking the spinlock is 5 instructions, but these instructions
>> acquire the lock line and so are not particularly fast instructions.
>> In terms of real-world performance, acquiring and releasing a pthread
>> mutex is fairly close (factor of two) on any platform I tried
>
> Interesting to know.
Yup. I was wondering if they were using some clever lockless
structure, but if we just use an inline version of the little mutex I
wrote using the GCC atomic ops (on OS X I had to use some inline asm
because Apple's GCC 4.0 doesn't support atomic ops) then we should get
the same performance. Of course, a full implementation of GCD needs
feedback from the kernel to let each process determine the optimum
number of threads to spawn. This can possibly be done using the
existing code we have for counting the number of CPUs and a small
amount extra for the amount of load.
>> except OS X; maybe they'd shout about it a bit less if their pthread
>> implementation[1] were not an embarrassment, as the performance gain
>> relative to Apple's mutex implementation is a factor of 100 or so,
>> while relative to anyone else's (including one I wrote in five
>> minutes) is closer to a factor of 2.
>
> The result could be completly different in Mac OS X 10.6 though.
> In fact, I'd be suprised if they didn't rewrite their mutex
> implementation given the result you report.
Looking a bit more carefully, it appears that the problem is that they
use a Mach semaphore rather than sched_yield() to make the mutex
sleep. This is an optimisation in cases where you expect threads to
block for a long time but generally this is not how you use a pthread
mutex (that's what a condition variable is for). A better
implementation would spin a few times without sleeping, a few more
times calling sched_yield(), and then wait on a semaphore atomically
setting a flag in the mutex to instruct the corresponding unlock call
to signal the semaphore. They may be doing this and just have really
badly-tuned heuristics, in which case it's quite easy to fix. If they
haven't already fixed it though, they probably won't before 10.6 is
released; changes to mutex implementation are not things that you
should rush.
A perfect implementation of a mutex would encode the average waiting
time for each {thread, mutex} pair and use this to determine the
waiting policy. In most cases, however, sched_yield() is sufficient
because good code doesn't hold mutexes for very long (Amdahl's law)
and so just kicking the other threads to run for a bit is usually
sufficient to let the one that holds it give it up. And, of course,
you should never be using a mutex in a situation where you need
latency guarantees.
In short, it looks like someone just copied the SysV semaphore code
into the pthreads implementation without noting that SysV semaphores
and pthread mutexes are used in very different ways.
David
_______________________________________________
Etoile-dev mailing list
[email protected]
https://mail.gna.org/listinfo/etoile-dev