Hi Jeff, My apologies for the delay in replying, I was flying back from the UK to the States, but now I'm here and I can provide a more timely response.
> I confirm that the hwloc message you sent (and your posts to the
> hwloc-users list) indicate that hwloc is getting confused by a buggy
> BIOS, but it's only dealing with the L3 cache, and that shouldn't
> affect the binding that OMPI is doing.
Great, good to know. I'd still be interested in learning how to build a
hwloc-parsable xml as a workaround, especially if it fixes the bindings
(see below).
> 1. Run with "--report-bindings" and send the output. It'll
> prettyprint-render where OMPI thinks it is binding each process.
Please find it attached.
> 2. Run with "--bind-to none" and see if that helps. I.e., if, per
> #1, OMPI thinks it is binding correctly (i.e., each of the 48
> processes is being bound to a unique core), then perhaps hwloc is
> doing something wrong in the actual binding (i.e., binding the 48
> processes only among the lower 32 cores).
BINGO! As soon as I did this, indeed all the cores went to 100%! Here's
the updated timing (compared to 13 minutes from before):
real 1m8.442s
user 0m0.077s
sys 0m0.071s
So I guess the conclusion is that hwloc is somehow messing things up on
this chipset?
Thanks,
Andrej
test_report_bindings.stderr
Description: Binary data
