[OMPI users] binding to core odd results

2020-05-23 Thread Steve Brasier via users
Hi,

In the process of testing out reframe  I
ran IMB pingpong using openmpi 3.1.0 with:

mpirun -np 2 --report-bindings IMB-MPI1 pingpong

I find that if set

OMPI_MCA_hwloc_base_binding_policy=core

I get slightly higher zero-size latency (1.58us vs 1.32us without this),
but the reported bindings are exactly the same (shown below - note
hyperthreading is enabled on this system).

I expected the bindings not to change given the docs say "--bind-to core"
is the default for this version of OpenMPI. So any suggestions as to why
the latency changes? I know it's not necessarily a huge change but seems
odd.

[openhpc-compute-0.novalocal:03604] MCW rank 0 bound to socket 0[core 0[hwt
0-1]]:
[BB/../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../..]
[openhpc-compute-1.novalocal:02213] MCW rank 1 bound to socket 0[core 0[hwt
0-1]]:
[BB/../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../..]


thanks
Steve

http://stackhpc.com/
Please note I work Tuesday to Friday.


[OMPI users] problems with scattering the ranks over all compute nodes using MPI spawning

2020-05-23 Thread Yang Liu via users
Dear openmpi developers,

I'm wondering how to scatter the ranks across all compute nodes when they
are spawned by a master processor.

I'm working on a cluster where
Each node has two sockets, each socket is populated with a 16-core Haswell
processor.
Each core supports 2 hyper-threads.

Whenever the spawned processors are less than the total number of cores
(e.g. half of the core counts), I would like they are as spread as possible
across all nodes such that I can use OMP threading on each rank. By default
it seems the spawning will fill all 64 logical cores on one node before
moving to the next node.

It seems I can use the runtime parameter
--map-by core:span to resolve this. Am I correct?

However whenever I call MPI_spawn, I got the following runtime error. Could
you advise how to proceed with this?

[nid01876:62825] [[51146,0],0] ORTE_ERROR_LOG: Not found in file
base/plm_base_receive.c at line 343
--
An internal error has occurred in ORTE:

[[51146,0],0] FORCE-TERMINATE AT (null):1 - error
base/plm_base_receive.c(344)

This is something that should be reported to the developers.