Jed, I think you're right. There are several approaches within OpenMP for doing what Barry is asking. Fundamentally, when distributing a for loop, you can use a dynamic scheduler:
See the section on scheduling clauses in the Wikipedia article for a short overview: http://en.wikipedia.org/wiki/OpenMP Or as Jed would prefer, the standard (from Table 2-1 in 3.0 available here: http://openmp.org/wp/openmp-specifications/) itself. Below are the dynamic/guided schedule variants. "When schedule(dynamic, chunk_size) is specified, the iterations are distributed to threads in the team in chunks as the threads request them. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be distributed. Each chunk contains chunk_size iterations, except for the last chunk to be distributed, which may have fewer iterations. When no chunk_size is specified, it defaults to 1. When schedule(guided, chunk_size) is specified, the iterations are assigned to threads in the team in chunks as the executing threads request them. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be assigned. For a chunk_size of 1, the size of each chunk is proportional to the number of unassigned iterations divided by the number of threads in the team, decreasing to 1. For a chunk_size with value k (greater than 1), the size of each chunk is determined in the same way, with the restriction that the chunks do not contain fewer than k iterations (except for the last chunk to be assigned, which may have fewer than k iterations). When no chunk_size is specified, it defaults to 1. " A task (OpenCL/CUDA) like model is also available. I think if you are worried about NUMA problems you can specifically work with thread ids. This is 'MPI-like' in the sense that each thread is identified by its own rank, so you can team threads by memory region if you are guaranteed affinity by the compiler. A On Fri, Nov 12, 2010 at 1:22 AM, Jed Brown <jed at 59a2.org> wrote: > On Fri, Nov 12, 2010 at 02:18, Barry Smith <bsmith at mcs.anl.gov> wrote: >> >> How do you get adaptive load balancing (across the cores inside a process) >> if you have OpenMP compiler decide the partitioning/parallelism? This was >> Bill's point in why not to use OpenMP. For example if you give each core the >> same amount of work up front they will end not ending at the same time so >> you have wasted cycles. > > Hmm, I think this issue is largely subordinate to the memory locality (for > the sort of work we usually care about), but the OpenMP could be more > dynamic about distributing work. ?I.e. this could be an OpenMP > implementation or tuning issue, but I don't see it as a fundamental > disadvantage of that programming model. ?I could be wrong. > Jed