Here is a diff relative of OpenBSD 5.3 to avoid taking a lock in pthread_getspecific in the common case where storage for the key is already allocated.
I have only tested this patch for my use case where a single key is created once and used many times. In that case, the bottleneck I observed is gone. My parallel program sees linear speedup. This patch would be unsafe if elements were ever removed from the per-thread list of allocated storage. In the 5.3 implementation the list can only grow. Its size is bounded by PTHREAD_KEYS_MAX so I do not consider this a leak. Possibly it should be an array instead of a list.
Index: rthread_tls.c =================================================================== RCS file: /cvs/src/lib/librthread/rthread_tls.c,v retrieving revision 1.13 diff -c -r1.13 rthread_tls.c *** rthread_tls.c 6 Nov 2011 11:48:59 -0000 1.13 --- rthread_tls.c 30 Sep 2013 14:27:13 -0000 *************** *** 96,102 **** struct rthread_storage *rs; pthread_t self; ! _spinlock(&rkeyslock); if (!rkeys[key].used) { rs = NULL; goto out; --- 96,107 ---- struct rthread_storage *rs; pthread_t self; ! /* No lock is needed if ! (1) the key is not valid (undefined behavior) ! (2) storage is already allocated for the key (the linked ! list is only ever modified by adding to the head) ! */ ! if (!rkeys[key].used) { rs = NULL; goto out; *************** *** 109,125 **** break; } if (!rs) { ! rs = calloc(1, sizeof(*rs)); ! if (!rs) ! goto out; ! rs->keyid = key; ! rs->data = NULL; ! rs->next = self->local_storage; ! self->local_storage = rs; } out: - _spinunlock(&rkeyslock); return (rs); } --- 114,138 ---- break; } if (!rs) { ! _spinlock(&rkeyslock); ! /* Repeat search with lock held */ ! for (rs = self->local_storage; rs; rs = rs->next) { ! if (rs->keyid == key) ! break; ! } ! if (!rs) { ! rs = calloc(1, sizeof(*rs)); ! if (rs) { ! rs->keyid = key; ! rs->data = NULL; ! rs->next = self->local_storage; ! self->local_storage = rs; ! } ! } ! _spinunlock(&rkeyslock); } out: return (rs); }
> > Problem in a nutshell: pthread_getspecific serializes execution of > threads using it. Without a better implementation my program doesn't > work on OpenBSD. > > I am trying to port Cilk+ to OpenBSD (5.3). > > Cilk+ is a multithreaded extension to C/C++. It adds some bookkeeping > operations in the prologue of some functions. In many Cilk programs > most function calls do this bookkeeping (dynamic count, not static). > > The per-call bookkeeping calls pthread_getspecific. pthread_getspecific > takes out a spinlock. The lock is apparently needed in case of a race > with pthread_key_delete. This is unlikely to happen, but I suppose it > is possible. Every function call in this multithreaded program > serializes waiting on the lock. Also, the cache line with the lock is > constantly moving between processors. > > This is worse than useless for Cilk. You're much better off with a > single threaded program. > > An older version of Cilk used a thread local storage class (__thread). > If memory serves, the switch to pthread_getspecific was driven by a few > considerations: > > 1. Thread local variables don't get along well with shared libraries. > > 2. Thread local variables are less portable. OpenBSD doesn't really > support them, for example. They are emulated with pthread_getspecific. > > 3. On Linux/x86 pthread_getspecific is very fast, essentially a move > instruction with a segment override. > > It seems to me the implementation of pthread_getspecific doesn't need to > be as slow as it is. > > It ought to be possible to have multiple readers be always nonblocking > as long as the key list doesn't change, and possibly even if it does > change. pthread_getspecific only needs a read lock rather than a mutex. > The rwlock in librthread starts with a spinlock, so it's not the answer. > > Any thoughts? >