Here is a diff relative of OpenBSD 5.3 to avoid taking a lock in
pthread_getspecific in the common case where storage for the key is
already allocated.
I have only tested this patch for my use case where a single key is
created once and used many times. In that case, the bottleneck I
observed is gone. My parallel program sees linear speedup.
This patch would be unsafe if elements were ever removed from the
per-thread list of allocated storage. In the 5.3 implementation the
list can only grow. Its size is bounded by PTHREAD_KEYS_MAX so I
do not consider this a leak. Possibly it should be an array instead
of a list.
Index: rthread_tls.c
===================================================================
RCS file: /cvs/src/lib/librthread/rthread_tls.c,v
retrieving revision 1.13
diff -c -r1.13 rthread_tls.c
*** rthread_tls.c 6 Nov 2011 11:48:59 -0000 1.13
--- rthread_tls.c 30 Sep 2013 14:27:13 -0000
***************
*** 96,102 ****
struct rthread_storage *rs;
pthread_t self;
! _spinlock(&rkeyslock);
if (!rkeys[key].used) {
rs = NULL;
goto out;
--- 96,107 ----
struct rthread_storage *rs;
pthread_t self;
! /* No lock is needed if
! (1) the key is not valid (undefined behavior)
! (2) storage is already allocated for the key (the linked
! list is only ever modified by adding to the head)
! */
!
if (!rkeys[key].used) {
rs = NULL;
goto out;
***************
*** 109,125 ****
break;
}
if (!rs) {
! rs = calloc(1, sizeof(*rs));
! if (!rs)
! goto out;
! rs->keyid = key;
! rs->data = NULL;
! rs->next = self->local_storage;
! self->local_storage = rs;
}
out:
- _spinunlock(&rkeyslock);
return (rs);
}
--- 114,138 ----
break;
}
if (!rs) {
! _spinlock(&rkeyslock);
! /* Repeat search with lock held */
! for (rs = self->local_storage; rs; rs = rs->next) {
! if (rs->keyid == key)
! break;
! }
! if (!rs) {
! rs = calloc(1, sizeof(*rs));
! if (rs) {
! rs->keyid = key;
! rs->data = NULL;
! rs->next = self->local_storage;
! self->local_storage = rs;
! }
! }
! _spinunlock(&rkeyslock);
}
out:
return (rs);
}
>
> Problem in a nutshell: pthread_getspecific serializes execution of
> threads using it. Without a better implementation my program doesn't
> work on OpenBSD.
>
> I am trying to port Cilk+ to OpenBSD (5.3).
>
> Cilk+ is a multithreaded extension to C/C++. It adds some bookkeeping
> operations in the prologue of some functions. In many Cilk programs
> most function calls do this bookkeeping (dynamic count, not static).
>
> The per-call bookkeeping calls pthread_getspecific. pthread_getspecific
> takes out a spinlock. The lock is apparently needed in case of a race
> with pthread_key_delete. This is unlikely to happen, but I suppose it
> is possible. Every function call in this multithreaded program
> serializes waiting on the lock. Also, the cache line with the lock is
> constantly moving between processors.
>
> This is worse than useless for Cilk. You're much better off with a
> single threaded program.
>
> An older version of Cilk used a thread local storage class (__thread).
> If memory serves, the switch to pthread_getspecific was driven by a few
> considerations:
>
> 1. Thread local variables don't get along well with shared libraries.
>
> 2. Thread local variables are less portable. OpenBSD doesn't really
> support them, for example. They are emulated with pthread_getspecific.
>
> 3. On Linux/x86 pthread_getspecific is very fast, essentially a move
> instruction with a segment override.
>
> It seems to me the implementation of pthread_getspecific doesn't need to
> be as slow as it is.
>
> It ought to be possible to have multiple readers be always nonblocking
> as long as the key list doesn't change, and possibly even if it does
> change. pthread_getspecific only needs a read lock rather than a mutex.
> The rwlock in librthread starts with a spinlock, so it's not the answer.
>
> Any thoughts?
>