Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-23 Thread Maxime Boissonneault

Hi,
I've been following this thread because it may be relevant to our setup.

Is there a drawback of having orte_hetero_nodes=1 as default MCA 
parameter ? Is there a reason why the most generic case is not assumed ?


Maxime Boissonneault

Le 2014-06-20 13:48, Ralph Castain a écrit :

Put "orte_hetero_nodes=1" in your default MCA param file - uses can override by 
setting that param to 0


On Jun 20, 2014, at 10:30 AM, Brock Palen  wrote:


Perfection!  That appears to do it for our standard case.

Now I know how to set MCA options by env var or config file.  How can I make 
this the default, that then a user can override?

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985



On Jun 20, 2014, at 1:21 PM, Ralph Castain  wrote:


I think I begin to grok at least part of the problem. If you are assigning 
different cpus on each node, then you'll need to tell us that by setting 
--hetero-nodes otherwise we won't have any way to report that back to mpirun 
for its binding calculation.

Otherwise, we expect that the cpuset of the first node we launch a daemon onto 
(or where mpirun is executing, if we are only launching local to mpirun) 
accurately represents the cpuset on every node in the allocation.

We still might well have a bug in our binding computation - but the above will 
definitely impact what you said the user did.

On Jun 20, 2014, at 10:06 AM, Brock Palen  wrote:


Extra data point if I do:

[brockp@nyx5508 34241]$ mpirun --report-bindings --bind-to core hostname
--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

  Bind to: CORE
  Node:nyx5513
  #processes:  2
  #cpus:  1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

[brockp@nyx5508 34241]$ mpirun -H nyx5513 uptime
13:01:37 up 31 days, 23:06,  0 users,  load average: 10.13, 10.90, 12.38
13:01:37 up 31 days, 23:06,  0 users,  load average: 10.13, 10.90, 12.38
[brockp@nyx5508 34241]$ mpirun -H nyx5513 --bind-to core hwloc-bind --get
0x0010
0x1000
[brockp@nyx5508 34241]$ cat $PBS_NODEFILE | grep nyx5513
nyx5513
nyx5513

Interesting, if I force bind to core, MPI barfs saying there is only 1 cpu 
available, PBS says it gave it two, and if I force (this is all inside an 
interactive job) just on that node hwloc-bind --get I get what I expect,

Is there a way to get a map of what MPI thinks it has on each host?

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985



On Jun 20, 2014, at 12:38 PM, Brock Palen  wrote:


I was able to produce it in my test.

orted affinity set by cpuset:
[root@nyx5874 ~]# hwloc-bind --get --pid 103645
0xc002

This mask (1, 14,15) which is across sockets, matches the cpu set setup by the 
batch system.
[root@nyx5874 ~]# cat /dev/cpuset/torque/12719806.nyx.engin.umich.edu/cpus
1,14-15

The ranks though were then all set to the same core:

[root@nyx5874 ~]# hwloc-bind --get --pid 103871
0x8000
[root@nyx5874 ~]# hwloc-bind --get --pid 103872
0x8000
[root@nyx5874 ~]# hwloc-bind --get --pid 103873
0x8000

Which is core 15:

report-bindings gave me:
You can see how a few nodes were bound to all the same core, the last one in 
each case.  I only gave you the results for the hose nyx5874.

[nyx5526.engin.umich.edu:23726] MCW rank 0 is not bound (or bound to all 
available processors)
[nyx5878.engin.umich.edu:103925] MCW rank 8 is not bound (or bound to all 
available processors)
[nyx5533.engin.umich.edu:123988] MCW rank 1 is not bound (or bound to all 
available processors)
[nyx5879.engin.umich.edu:102808] MCW rank 9 is not bound (or bound to all 
available processors)
[nyx5874.engin.umich.edu:103645] MCW rank 41 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5874.engin.umich.edu:103645] MCW rank 42 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5874.engin.umich.edu:103645] MCW rank 43 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5888.engin.umich.edu:117400] MCW rank 11 is not bound (or bound to all 
available processors)
[nyx5786.engin.umich.edu:30004] MCW rank 19 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5786.engin.umich.edu:30004] MCW rank 18 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5594.engin.umich.edu:33884] MCW rank 24 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5594.engin.umich.edu:33884] MCW rank 25 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5594.engin.umich.edu:33884] MCW rank 26 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5798.engin.umich.edu:53026] MCW rank 59 

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-23 Thread Brock Palen
Perfection, flexible, extensible, so nice.

BTW this doesn't happen older versions,

[brockp@flux-login2 34241]$ ompi_info --param all all
Error getting SCIF driver version 
 MCA btl: parameter "btl_tcp_if_include" (current value: "",
  data source: default, level: 1 user/basic, type:
  string)
  Comma-delimited list of devices and/or CIDR
  notation of networks to use for MPI communication
  (e.g., "eth0,192.168.0.0/16").  Mutually exclusive
  with btl_tcp_if_exclude.
 MCA btl: parameter "btl_tcp_if_exclude" (current value:
  "127.0.0.1/8,sppp", data source: default, level: 1
  user/basic, type: string)
  Comma-delimited list of devices and/or CIDR
  notation of networks to NOT use for MPI
  communication -- all devices not matching these
  specifications will be used (e.g.,
  "eth0,192.168.0.0/16").  If set to a non-default
  value, it is mutually exclusive with
  btl_tcp_if_include.


This is normally much longer.  And yes we don't have the PHI stuff installed on 
all nodes, strange that 'all all' is now very short,  ompi_info -a  still works 
though.



Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985



On Jun 20, 2014, at 1:48 PM, Ralph Castain  wrote:

> Put "orte_hetero_nodes=1" in your default MCA param file - uses can override 
> by setting that param to 0
> 
> 
> On Jun 20, 2014, at 10:30 AM, Brock Palen  wrote:
> 
>> Perfection!  That appears to do it for our standard case.
>> 
>> Now I know how to set MCA options by env var or config file.  How can I make 
>> this the default, that then a user can override?
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> XSEDE Campus Champion
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> On Jun 20, 2014, at 1:21 PM, Ralph Castain  wrote:
>> 
>>> I think I begin to grok at least part of the problem. If you are assigning 
>>> different cpus on each node, then you'll need to tell us that by setting 
>>> --hetero-nodes otherwise we won't have any way to report that back to 
>>> mpirun for its binding calculation.
>>> 
>>> Otherwise, we expect that the cpuset of the first node we launch a daemon 
>>> onto (or where mpirun is executing, if we are only launching local to 
>>> mpirun) accurately represents the cpuset on every node in the allocation.
>>> 
>>> We still might well have a bug in our binding computation - but the above 
>>> will definitely impact what you said the user did.
>>> 
>>> On Jun 20, 2014, at 10:06 AM, Brock Palen  wrote:
>>> 
 Extra data point if I do:
 
 [brockp@nyx5508 34241]$ mpirun --report-bindings --bind-to core hostname
 --
 A request was made to bind to that would result in binding more
 processes than cpus on a resource:
 
 Bind to: CORE
 Node:nyx5513
 #processes:  2
 #cpus:  1
 
 You can override this protection by adding the "overload-allowed"
 option to your binding directive.
 --
 
 [brockp@nyx5508 34241]$ mpirun -H nyx5513 uptime
 13:01:37 up 31 days, 23:06,  0 users,  load average: 10.13, 10.90, 12.38
 13:01:37 up 31 days, 23:06,  0 users,  load average: 10.13, 10.90, 12.38
 [brockp@nyx5508 34241]$ mpirun -H nyx5513 --bind-to core hwloc-bind --get
 0x0010
 0x1000
 [brockp@nyx5508 34241]$ cat $PBS_NODEFILE | grep nyx5513
 nyx5513
 nyx5513
 
 Interesting, if I force bind to core, MPI barfs saying there is only 1 cpu 
 available, PBS says it gave it two, and if I force (this is all inside an 
 interactive job) just on that node hwloc-bind --get I get what I expect,
 
 Is there a way to get a map of what MPI thinks it has on each host?
 
 Brock Palen
 www.umich.edu/~brockp
 CAEN Advanced Computing
 XSEDE Campus Champion
 bro...@umich.edu
 (734)936-1985
 
 
 
 On Jun 20, 2014, at 12:38 PM, Brock Palen  wrote:
 
> I was able to produce it in my test.
> 
> orted affinity set by cpuset:
> [root@nyx5874 ~]# hwloc-bind --get --pid 103645
> 0xc002
> 
> This mask (1, 14,15) which is across sockets, matches the cpu set setup 
> by the batch system. 
> [root@nyx5874 ~]# cat 
> /dev/cpuset/torque/12719806.nyx.engin.umich.edu/cpus 
> 1,14-15
> 
> The ranks though were then all set to the same core:
> 
> [root@nyx5874 ~]# hwloc-bind --get --pid 103871
> 0x000