Re: [OMPI users] affinity issues under cpuset torque 1.8.1
Hi, I've been following this thread because it may be relevant to our setup. Is there a drawback of having orte_hetero_nodes=1 as default MCA parameter ? Is there a reason why the most generic case is not assumed ? Maxime Boissonneault Le 2014-06-20 13:48, Ralph Castain a écrit : Put "orte_hetero_nodes=1" in your default MCA param file - uses can override by setting that param to 0 On Jun 20, 2014, at 10:30 AM, Brock Palen wrote: Perfection! That appears to do it for our standard case. Now I know how to set MCA options by env var or config file. How can I make this the default, that then a user can override? Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985 On Jun 20, 2014, at 1:21 PM, Ralph Castain wrote: I think I begin to grok at least part of the problem. If you are assigning different cpus on each node, then you'll need to tell us that by setting --hetero-nodes otherwise we won't have any way to report that back to mpirun for its binding calculation. Otherwise, we expect that the cpuset of the first node we launch a daemon onto (or where mpirun is executing, if we are only launching local to mpirun) accurately represents the cpuset on every node in the allocation. We still might well have a bug in our binding computation - but the above will definitely impact what you said the user did. On Jun 20, 2014, at 10:06 AM, Brock Palen wrote: Extra data point if I do: [brockp@nyx5508 34241]$ mpirun --report-bindings --bind-to core hostname -- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node:nyx5513 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. -- [brockp@nyx5508 34241]$ mpirun -H nyx5513 uptime 13:01:37 up 31 days, 23:06, 0 users, load average: 10.13, 10.90, 12.38 13:01:37 up 31 days, 23:06, 0 users, load average: 10.13, 10.90, 12.38 [brockp@nyx5508 34241]$ mpirun -H nyx5513 --bind-to core hwloc-bind --get 0x0010 0x1000 [brockp@nyx5508 34241]$ cat $PBS_NODEFILE | grep nyx5513 nyx5513 nyx5513 Interesting, if I force bind to core, MPI barfs saying there is only 1 cpu available, PBS says it gave it two, and if I force (this is all inside an interactive job) just on that node hwloc-bind --get I get what I expect, Is there a way to get a map of what MPI thinks it has on each host? Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985 On Jun 20, 2014, at 12:38 PM, Brock Palen wrote: I was able to produce it in my test. orted affinity set by cpuset: [root@nyx5874 ~]# hwloc-bind --get --pid 103645 0xc002 This mask (1, 14,15) which is across sockets, matches the cpu set setup by the batch system. [root@nyx5874 ~]# cat /dev/cpuset/torque/12719806.nyx.engin.umich.edu/cpus 1,14-15 The ranks though were then all set to the same core: [root@nyx5874 ~]# hwloc-bind --get --pid 103871 0x8000 [root@nyx5874 ~]# hwloc-bind --get --pid 103872 0x8000 [root@nyx5874 ~]# hwloc-bind --get --pid 103873 0x8000 Which is core 15: report-bindings gave me: You can see how a few nodes were bound to all the same core, the last one in each case. I only gave you the results for the hose nyx5874. [nyx5526.engin.umich.edu:23726] MCW rank 0 is not bound (or bound to all available processors) [nyx5878.engin.umich.edu:103925] MCW rank 8 is not bound (or bound to all available processors) [nyx5533.engin.umich.edu:123988] MCW rank 1 is not bound (or bound to all available processors) [nyx5879.engin.umich.edu:102808] MCW rank 9 is not bound (or bound to all available processors) [nyx5874.engin.umich.edu:103645] MCW rank 41 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B] [nyx5874.engin.umich.edu:103645] MCW rank 42 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B] [nyx5874.engin.umich.edu:103645] MCW rank 43 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B] [nyx5888.engin.umich.edu:117400] MCW rank 11 is not bound (or bound to all available processors) [nyx5786.engin.umich.edu:30004] MCW rank 19 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B] [nyx5786.engin.umich.edu:30004] MCW rank 18 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B] [nyx5594.engin.umich.edu:33884] MCW rank 24 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B] [nyx5594.engin.umich.edu:33884] MCW rank 25 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B] [nyx5594.engin.umich.edu:33884] MCW rank 26 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B] [nyx5798.engin.umich.edu:53026] MCW rank 59
Re: [OMPI users] affinity issues under cpuset torque 1.8.1
Perfection, flexible, extensible, so nice. BTW this doesn't happen older versions, [brockp@flux-login2 34241]$ ompi_info --param all all Error getting SCIF driver version MCA btl: parameter "btl_tcp_if_include" (current value: "", data source: default, level: 1 user/basic, type: string) Comma-delimited list of devices and/or CIDR notation of networks to use for MPI communication (e.g., "eth0,192.168.0.0/16"). Mutually exclusive with btl_tcp_if_exclude. MCA btl: parameter "btl_tcp_if_exclude" (current value: "127.0.0.1/8,sppp", data source: default, level: 1 user/basic, type: string) Comma-delimited list of devices and/or CIDR notation of networks to NOT use for MPI communication -- all devices not matching these specifications will be used (e.g., "eth0,192.168.0.0/16"). If set to a non-default value, it is mutually exclusive with btl_tcp_if_include. This is normally much longer. And yes we don't have the PHI stuff installed on all nodes, strange that 'all all' is now very short, ompi_info -a still works though. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985 On Jun 20, 2014, at 1:48 PM, Ralph Castain wrote: > Put "orte_hetero_nodes=1" in your default MCA param file - uses can override > by setting that param to 0 > > > On Jun 20, 2014, at 10:30 AM, Brock Palen wrote: > >> Perfection! That appears to do it for our standard case. >> >> Now I know how to set MCA options by env var or config file. How can I make >> this the default, that then a user can override? >> >> Brock Palen >> www.umich.edu/~brockp >> CAEN Advanced Computing >> XSEDE Campus Champion >> bro...@umich.edu >> (734)936-1985 >> >> >> >> On Jun 20, 2014, at 1:21 PM, Ralph Castain wrote: >> >>> I think I begin to grok at least part of the problem. If you are assigning >>> different cpus on each node, then you'll need to tell us that by setting >>> --hetero-nodes otherwise we won't have any way to report that back to >>> mpirun for its binding calculation. >>> >>> Otherwise, we expect that the cpuset of the first node we launch a daemon >>> onto (or where mpirun is executing, if we are only launching local to >>> mpirun) accurately represents the cpuset on every node in the allocation. >>> >>> We still might well have a bug in our binding computation - but the above >>> will definitely impact what you said the user did. >>> >>> On Jun 20, 2014, at 10:06 AM, Brock Palen wrote: >>> Extra data point if I do: [brockp@nyx5508 34241]$ mpirun --report-bindings --bind-to core hostname -- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node:nyx5513 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. -- [brockp@nyx5508 34241]$ mpirun -H nyx5513 uptime 13:01:37 up 31 days, 23:06, 0 users, load average: 10.13, 10.90, 12.38 13:01:37 up 31 days, 23:06, 0 users, load average: 10.13, 10.90, 12.38 [brockp@nyx5508 34241]$ mpirun -H nyx5513 --bind-to core hwloc-bind --get 0x0010 0x1000 [brockp@nyx5508 34241]$ cat $PBS_NODEFILE | grep nyx5513 nyx5513 nyx5513 Interesting, if I force bind to core, MPI barfs saying there is only 1 cpu available, PBS says it gave it two, and if I force (this is all inside an interactive job) just on that node hwloc-bind --get I get what I expect, Is there a way to get a map of what MPI thinks it has on each host? Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985 On Jun 20, 2014, at 12:38 PM, Brock Palen wrote: > I was able to produce it in my test. > > orted affinity set by cpuset: > [root@nyx5874 ~]# hwloc-bind --get --pid 103645 > 0xc002 > > This mask (1, 14,15) which is across sockets, matches the cpu set setup > by the batch system. > [root@nyx5874 ~]# cat > /dev/cpuset/torque/12719806.nyx.engin.umich.edu/cpus > 1,14-15 > > The ranks though were then all set to the same core: > > [root@nyx5874 ~]# hwloc-bind --get --pid 103871 > 0x000