http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49490
Summary: suboptimal load balancing in loops Product: gcc Version: unknown Status: UNCONFIRMED Severity: minor Priority: P3 Component: libgomp AssignedTo: unassig...@gcc.gnu.org ReportedBy: dennis.jesper...@nasa.gov Created attachment 24573 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24573 test code to show how a compiler/runtime splits an OpenMP loop The OpenMP runtime library produces a correct but suboptimal load balance in parallel loops. For example, a loop of length 33 with 8 OpenMP threads will give the threads work of lengths 5, 5, 5, 5, 5, 5, 3, 0 respectively. This is logically correct, but imagine a dual-socket 4 core + 4 core configuration; then the "left" socket has 20 units of work while the "right" socket has 13 units of work. This could put undue pressure on the left cache(s) and/or memory connection. It would be better to spread out the work as much as possible, so in the example in question the threads would get work of lengths 5, 4, 4, 4, 4, 4, 4, 4. It should be fairly easy to modify libgomp/iter.c to produce the better load balancing (at least I think that's where the modification would go). The attached Fortran code will show the load balance; the Portland Group and Intel products give the desired even balance.