Le 01/09/2015 15:59, marcin.krotkiewski a écrit :
> Dear Rolf and Brice,
>
> Thank you very much for your help. I have now moved the 'dubious' IB
> card from Slot 1 to Slot 5. It is now reported by hwloc as bound to a
> separate NUMA node. In this case OpenMPI works as could be expected:
>
>  - NUMA nodes that 'own' a card decide to use only that card (the
> other card is too far away)
>  - NUMA nodes that have no card decide to use both cards and both
> ports of each card, because they are equally far away
>
> I do not have practical experience with this, so I am not sure whether
> this is the best performing behavior, but I guess it does look reasonable.
>
> Another question is, how should OpenMPI deal with the hw configuration
> that originally caused problems. Now it is assumed that the distance
> to the 'dubious' card is 0 (close to everybody) and the other,
> 'correctly' inserted card is never used. Maybe it would be better to
> assing the maximum distance/latency instead of 0? It seems that in
> this scenario the system (at least my hw configuration) would
> effectively behave exactly as described above: 2 NUMA nodes use 1
> card, the other 2 use both cards. And in either case it would at least
> not eliminate the other card(s) from being used.

My feeling is: don't bother. I think all Intel processors now have their
PCI controllers inside sockets, AMD PCI controllers are connected to a
single NUMA node, same for Power8. So you're not supposed to get such
'dubious' cards anymore. If the BIOS is buggy, we have hwloc environment
variables to workaround it.

Now, how do you detect the issue ? I wonder if hwloc should autofix
them, or at least issue a warning. We already do that in some cases for
E5v3. We could just warn whenever "sandy-bridge or later" reports
PCI-locality larger than a single NUMA nodes. However, the warning will
quite common given how buggy the BIOS are.

Brice

Reply via email to