Folks,
i ran some more test and found this
with both master and v2.x :
mpirun --host n0:16,n1:16 -np 4 --tag-output hostname | sort
[1,0]<stdout>:n0
[1,1]<stdout>:n0
[1,2]<stdout>:n0
[1,3]<stdout>:n0
and same output with the --map-by socket option.
now, without specifying the number of slots per hosts, and the
--oversubscribe option (mandatory for v2.x)
v2.x :
mpirun --host n0,n1 -np 4 --tag-output --oversubscribe hostname | sort
[1,0]<stdout>:n0
[1,1]<stdout>:n0
[1,2]<stdout>:n1
[1,3]<stdout>:n1
master :
mpirun --host n0,n1 -np 4 --tag-output --oversubscribe hostname | sort
[1,0]<stdout>:n0
[1,1]<stdout>:n0
[1,2]<stdout>:n0
[1,3]<stdout>:n0
no change is the --map-by socket is used
my observation is hardware topology is not retrieved when the number of
slots is specified (both v2.x and master). the default policy is
--map-by slot, *and* the --map-by socket option seems ignored, should we
instead abort instead of ignoring this option ?
when the number of slots is not specified (and --oversubscribe is used),
it seems the hardware topology is retrieved on v2.x, but not on master.
instead, master only retrieves the number of slots and use them.
from an end user point of view, the default mapping policy is --map-by
socket on v2.x, and --map-by slot on master. --map-by socket seems
ignored on master.
i re-read previous discussions, and i do not think this level of detail
was ever discussed.
fwiw, --map-by node option is correctly interpreted on both master and v2.x
mpirun --host n0,n1 -np 4 --tag-output --oversubscribe --map-by node
hostname | sort
[1,0]<stdout>:n0
[1,1]<stdout>:n1
[1,2]<stdout>:n0
[1,3]<stdout>:n1
also, i can get the mapping i wished/expected with --map-by ppr:2:node
bottom line :
1) should we abort if the number of slots is explicitly specified and
--map-by socket and the like option is requested ?
2) in master only, when the number of slots per host is not specified,
should we retrieve the hardware topology instead of the number of slots
? if not, should we abort if --map-by socket is specified
if there is a consensus and changes are desired, i am fine trying to
implement them
Cheers,
Gilles
On 5/17/2016 11:01 AM, Gilles Gouaillardet wrote:
Folks,
currently, default mapping policy on master is different than v2.x.
my preliminary question is : when will the master mapping policy land
into the release branch ?
v2.0.0 ? v2.x ? v3.0.0 ?
here are some commands and their output (both n0 and n1 have 16 cores
each, mpirun runs on n0)
first, let's force 2 slots per node via the --host parameter, and play
with mapping
[gilles@n0 ~]$ mpirun --tag-output --host n0:2,n1:2 -np 4 hostname |
sort
[1,0]<stdout>:n0
[1,1]<stdout>:n0
[1,2]<stdout>:n1
[1,3]<stdout>:n1
[gilles@n0 ~]$ mpirun --tag-output --host n0:2,n1:2 -np 4 --map-by
socket hostname | sort
[1,0]<stdout>:n0
[1,1]<stdout>:n0
[1,2]<stdout>:n1
[1,3]<stdout>:n1
/* so far so good, default mapping is --map-by socket, and mapping
looks correct to me */
[gilles@n0 ~]$ mpirun --tag-output --host n0:2,n1:2 -np 4 --map-by
node hostname | sort
[1,0]<stdout>:n0
[1,1]<stdout>:n1
[1,2]<stdout>:n0
[1,3]<stdout>:n1
/* mapping looks correct to me too */
now let's force 4 slots per node
[gilles@n0 ~]$ mpirun --tag-output --host n0:4,n1:4 -np 4 --map-by
node hostname | sort
[1,0]<stdout>:n0
[1,1]<stdout>:n1
[1,2]<stdout>:n0
[1,3]<stdout>:n1
/* same output than previously, looks correct to me */
[gilles@n0 ~]$ mpirun --tag-output --host n0:4,n1:4 -np 4 --map-by
socket hostname | sort
[1,0]<stdout>:n0
[1,1]<stdout>:n0
[1,2]<stdout>:n0
[1,3]<stdout>:n0
/* all tasks run on n0, even if i explicitly requested --map-by
socket, that looks wrong to me */
[gilles@n0 ~]$ mpirun --tag-output --host n0:4,n1:4 -np 4 hostname |
sort
[1,0]<stdout>:n0
[1,1]<stdout>:n0
[1,2]<stdout>:n0
[1,3]<stdout>:n0
/* same output than previously, which makes sense to me since the
default mapping policy is --map-by socket,
but all tasks run on n0, which still looks wrong to me */
if i do not force the number of slots, i get the same output (16 cores
are detected on each node) regardless the --map-by socket option.
it seems --map-by core is used, regardless what we pass on the command
line.
in the last cases, is running all tasks on one node the intended
behavior ?
if yes, which mapping option can be used to run the first 2 tasks on
the first node, and the last 2 tasks on the second nodes ?
Cheers,
Gilles
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/05/18990.php