[petsc-dev] PETSc programming model for multi-core systems

Aron Ahmadia Fri, 12 Nov 2010 01:31:15 +0000

Jed, I think you're right.  There are several approaches within OpenMP
for doing what Barry is asking.  Fundamentally, when distributing a
for loop, you can use a dynamic scheduler:

See the section on scheduling clauses in the Wikipedia article for a
short overview: http://en.wikipedia.org/wiki/OpenMP

Or as Jed would prefer, the standard (from Table 2-1 in 3.0 available
here: http://openmp.org/wp/openmp-specifications/) itself.  Below are
the dynamic/guided schedule variants.

"When schedule(dynamic, chunk_size) is specified, the iterations are
distributed to threads in the team in chunks as the threads request
them. Each thread executes a chunk of iterations, then requests
another chunk, until no chunks remain to be distributed.
Each chunk contains chunk_size iterations, except for the last chunk
to be distributed, which may have fewer iterations.
When no chunk_size is specified, it defaults to 1.

When schedule(guided, chunk_size) is specified, the iterations are
assigned to threads in the team in chunks as the executing threads
request them. Each thread executes a chunk of iterations, then
requests another chunk, until no chunks remain to be assigned.
For a chunk_size of 1, the size of each chunk is proportional to the
number of unassigned iterations divided by the number of threads in
the team, decreasing to 1. For a chunk_size with value k (greater than
1), the size of each chunk is determined in the same way, with the
restriction that the chunks do not contain fewer than k iterations
(except for the last chunk to be assigned, which may have fewer than k
iterations).
When no chunk_size is specified, it defaults to 1.
"

A task (OpenCL/CUDA) like model is also available.  I think if you are
worried about NUMA problems you can specifically work with thread ids.
 This is 'MPI-like' in the sense that each thread is identified by its
own rank, so you can team threads by memory region if you are
guaranteed affinity by the compiler.

A

On Fri, Nov 12, 2010 at 1:22 AM, Jed Brown <jed at 59a2.org> wrote:
> On Fri, Nov 12, 2010 at 02:18, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>
>> How do you get adaptive load balancing (across the cores inside a process)
>> if you have OpenMP compiler decide the partitioning/parallelism? This was
>> Bill's point in why not to use OpenMP. For example if you give each core the
>> same amount of work up front they will end not ending at the same time so
>> you have wasted cycles.
>
> Hmm, I think this issue is largely subordinate to the memory locality (for
> the sort of work we usually care about), but the OpenMP could be more
> dynamic about distributing work. ?I.e. this could be an OpenMP
> implementation or tuning issue, but I don't see it as a fundamental
> disadvantage of that programming model. ?I could be wrong.
> Jed

[petsc-dev] PETSc programming model for multi-core systems

Reply via email to