On Mon, 2008-02-11 at 14:31 -0600, Olof Johansson wrote: > On Mon, Feb 11, 2008 at 08:58:46PM +0100, Mike Galbraith wrote:
> > It shouldn't matter if you yield or not really, that should reduce the > > number of non-work spin cycles wasted awaiting preemption as threads > > execute in series (the problem), and should improve your performance > > numbers, but not beyond single threaded. > > > > If I plugged a yield into the busy wait, I would expect to see a large > > behavioral difference due to yield implementation changes, but that > > would only be a symptom in this case, no? Yield should be a noop. > > Exactly. It made a big impact on the first testcase from Friday, where > the spin-off thread spent the bulk of the time in the busy-wait loop, > with a very small initial workload loop. Thus the yield passed the cpu > over to the other thread who got a chance to run the small workload, > followed by a quick finish by both of them. The better model spends the > bulk of the time in the first workload loop, so yielding doesn't gain > at all the same amount. There is a strong dependency on execution order in this testcase. Between cpu affinity and giving the child a little head start to reduce the chance (100% if child wakes on same CPU and doesn't preempt parent) of busy wait, modified testcase behaves. I don't think I should need the CPU affinity, but I do. If you plunk a usleep(1) in prior to calling thread_func() does your testcase performance change radically? If so, I wonder if the real application has the same kind of dependency. -Mike
#define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <pthread.h> #include <sched.h> #include <sys/time.h> #include <sys/types.h> #include <sys/syscall.h> #ifdef __PPC__ static void atomic_inc(volatile long *a) { asm volatile ("1:\n\ lwarx %0,0,%1\n\ addic %0,%0,1\n\ stwcx. %0,0,%1\n\ bne- 1b" : "=&r" (result) : "r"(a)); } #else static void atomic_inc(volatile long *a) { asm volatile ("lock; incl %0" : "+m" (*a)); } #endif long usecs(void) { struct timeval tv; gettimeofday(&tv, NULL); return tv.tv_sec * 1000000 + tv.tv_usec; } void burn(long *burnt) { long then, now, delta, tolerance = 10; then = now = usecs(); while (now == then) now = usecs(); delta = now - then; if (delta < tolerance) *burnt += delta; } volatile long stopped; long burn_usecs = 1000, tot_work, tot_wait; pid_t parent; #define gettid() syscall(SYS_gettid) void *thread_func(void *cpus) { long work = 0, wait = 0; cpu_set_t cpuset; pid_t whoami = gettid(); if (whoami != parent) { CPU_ZERO(&cpuset); CPU_SET(1, &cpuset); sched_setaffinity(whoami, sizeof(cpuset), &cpuset); usleep(1); } while (work < burn_usecs) burn(&work); tot_work += work; atomic_inc(&stopped); /* Busy-wait */ while (stopped < *(int *)cpus) burn(&wait); tot_wait += wait; return NULL; } int main(int argc, char **argv) { pthread_t thread; int iter = 500, cpus = 2; long t1, t2; cpu_set_t cpuset; if (argc > 1) iter = atoi(argv[1]); if (argc > 2) burn_usecs = atoi(argv[2]); parent = gettid(); CPU_ZERO(&cpuset); CPU_SET(0, &cpuset); sched_setaffinity(parent, sizeof(cpuset), &cpuset); t1 = usecs(); while(iter--) { stopped = 0; pthread_create(&thread, NULL, &thread_func, &cpus); /* clild needs headstart guarantee to avoid busy wait */ usleep(1); thread_func(&cpus); pthread_join(thread, NULL); } t2 = usecs(); printf("time: %ld (us) work: %ld wait: %ld idx: %2.2f\n", t2-t1, tot_work, tot_wait, (double)tot_work/(t2-t1)); return 0; }