http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49490

           Summary: suboptimal load balancing in loops
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: libgomp
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: dennis.jesper...@nasa.gov


Created attachment 24573
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24573
test code to show how a compiler/runtime splits an OpenMP loop

The OpenMP runtime library produces a correct but suboptimal load balance
in parallel loops.
For example, a loop of length 33 with 8 OpenMP threads will give the
threads work of lengths 5, 5, 5, 5, 5, 5, 3, 0 respectively.  This is logically
correct, but imagine a dual-socket 4 core + 4 core configuration; then
the "left" socket has 20 units of work while the "right" socket has 13
units of work.  This could put undue pressure on the left cache(s) and/or
memory connection.  It would be better to spread out the work as much
as possible, so in the example in question the threads would get work
of lengths 5, 4, 4, 4, 4, 4, 4, 4.

It should be fairly easy to modify libgomp/iter.c to produce the better
load balancing (at least I think that's where the modification would go).

The attached Fortran code will show the load balance; the Portland Group and
Intel products give the desired even balance.

Reply via email to