Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Brock Palen
In this case they are a single socket, but as you can see they could be 
ether/or depending on the job.

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985



On Jun 19, 2014, at 2:44 PM, Ralph Castain  wrote:

> Sorry, I should have been clearer - I was asking if cores 8-11 are all on one 
> socket, or span multiple sockets
> 
> 
> On Jun 19, 2014, at 11:36 AM, Brock Palen  wrote:
> 
>> Ralph,
>> 
>> It was a large job spread across.  Our system allows users to ask for 
>> 'procs' which are laid out in any format. 
>> 
>> The list:
>> 
>>> [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3]
>>> [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11]
>>> [nyx5409:11][nyx5411:11][nyx5412:3]
>> 
>> Shows that nyx5406 had 2 cores,  nyx5427 also 2,  nyx5411 had 11.
>> 
>> They could be spread across any number of sockets configuration.  We start 
>> very lax "user requests X procs" and then the user can request more strict 
>> requirements from there.  We support mostly serial users, and users can 
>> colocate on nodes.
>> 
>> That is good to know, I think we would want to turn our default to 'bind to 
>> core' except for our few users who use hybrid mode.
>> 
>> Our CPU set tells you what cores the job is assigned.  So in the problem 
>> case provided, the cpuset/cgroup shows only cores 8-11 are available to this 
>> job on this node.
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> XSEDE Campus Champion
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> On Jun 18, 2014, at 11:10 PM, Ralph Castain  wrote:
>> 
>>> The default binding option depends on the number of procs - it is bind-to 
>>> core for np=2, and bind-to socket for np > 2. You never said, but should I 
>>> assume you ran 4 ranks? If so, then we should be trying to bind-to socket.
>>> 
>>> I'm not sure what your cpuset is telling us - are you binding us to a 
>>> socket? Are some cpus in one socket, and some in another?
>>> 
>>> It could be that the cpuset + bind-to socket is resulting in some odd 
>>> behavior, but I'd need a little more info to narrow it down.
>>> 
>>> 
>>> On Jun 18, 2014, at 7:48 PM, Brock Palen  wrote:
>>> 
 I have started using 1.8.1 for some codes (meep in this case) and it 
 sometimes works fine, but in a few cases I am seeing ranks being given 
 overlapping CPU assignments, not always though.
 
 Example job, default binding options (so by-core right?):
 
 Assigned nodes, the one in question is nyx5398, we use torque CPU sets, 
 and use TM to spawn.
 
 [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3]
 [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11]
 [nyx5409:11][nyx5411:11][nyx5412:3]
 
 [root@nyx5398 ~]# hwloc-bind --get --pid 16065
 0x0200
 [root@nyx5398 ~]# hwloc-bind --get --pid 16066
 0x0800
 [root@nyx5398 ~]# hwloc-bind --get --pid 16067
 0x0200
 [root@nyx5398 ~]# hwloc-bind --get --pid 16068
 0x0800
 
 [root@nyx5398 ~]# cat /dev/cpuset/torque/12703230.nyx.engin.umich.edu/cpus 
 8-11
 
 So torque claims the CPU set setup for the job has 4 cores, but as you can 
 see the ranks were giving identical binding. 
 
 I checked the pids they were part of the correct CPU set, I also checked, 
 orted:
 
 [root@nyx5398 ~]# hwloc-bind --get --pid 16064
 0x0f00
 [root@nyx5398 ~]# hwloc-calc --intersect PU 16064
 ignored unrecognized argument 16064
 
 [root@nyx5398 ~]# hwloc-calc --intersect PU 0x0f00
 8,9,10,11
 
 Which is exactly what I would expect.
 
 So ummm, i'm lost why this might happen?  What else should I check?  Like 
 I said not all jobs show this behavior.
 
 Brock Palen
 www.umich.edu/~brockp
 CAEN Advanced Computing
 XSEDE Campus Champion
 bro...@umich.edu
 (734)936-1985
 
 
 
 ___
 users mailing list
 us...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post: 
 http://www.open-mpi.org/community/lists/users/2014/06/24672.php
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/06/24673.php
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/06/24675.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/06/24676.php



signature.asc
Description

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Ralph Castain
Thanks - I'm just trying to reproduce one problem case so I can look at it. 
Given that I don't have access to a Torque machine, I need to "fake" it.


On Jun 20, 2014, at 9:15 AM, Brock Palen  wrote:

> In this case they are a single socket, but as you can see they could be 
> ether/or depending on the job.
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Jun 19, 2014, at 2:44 PM, Ralph Castain  wrote:
> 
>> Sorry, I should have been clearer - I was asking if cores 8-11 are all on 
>> one socket, or span multiple sockets
>> 
>> 
>> On Jun 19, 2014, at 11:36 AM, Brock Palen  wrote:
>> 
>>> Ralph,
>>> 
>>> It was a large job spread across.  Our system allows users to ask for 
>>> 'procs' which are laid out in any format. 
>>> 
>>> The list:
>>> 
 [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3]
 [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11]
 [nyx5409:11][nyx5411:11][nyx5412:3]
>>> 
>>> Shows that nyx5406 had 2 cores,  nyx5427 also 2,  nyx5411 had 11.
>>> 
>>> They could be spread across any number of sockets configuration.  We start 
>>> very lax "user requests X procs" and then the user can request more strict 
>>> requirements from there.  We support mostly serial users, and users can 
>>> colocate on nodes.
>>> 
>>> That is good to know, I think we would want to turn our default to 'bind to 
>>> core' except for our few users who use hybrid mode.
>>> 
>>> Our CPU set tells you what cores the job is assigned.  So in the problem 
>>> case provided, the cpuset/cgroup shows only cores 8-11 are available to 
>>> this job on this node.
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> CAEN Advanced Computing
>>> XSEDE Campus Champion
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
>>> On Jun 18, 2014, at 11:10 PM, Ralph Castain  wrote:
>>> 
 The default binding option depends on the number of procs - it is bind-to 
 core for np=2, and bind-to socket for np > 2. You never said, but should I 
 assume you ran 4 ranks? If so, then we should be trying to bind-to socket.
 
 I'm not sure what your cpuset is telling us - are you binding us to a 
 socket? Are some cpus in one socket, and some in another?
 
 It could be that the cpuset + bind-to socket is resulting in some odd 
 behavior, but I'd need a little more info to narrow it down.
 
 
 On Jun 18, 2014, at 7:48 PM, Brock Palen  wrote:
 
> I have started using 1.8.1 for some codes (meep in this case) and it 
> sometimes works fine, but in a few cases I am seeing ranks being given 
> overlapping CPU assignments, not always though.
> 
> Example job, default binding options (so by-core right?):
> 
> Assigned nodes, the one in question is nyx5398, we use torque CPU sets, 
> and use TM to spawn.
> 
> [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3]
> [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11]
> [nyx5409:11][nyx5411:11][nyx5412:3]
> 
> [root@nyx5398 ~]# hwloc-bind --get --pid 16065
> 0x0200
> [root@nyx5398 ~]# hwloc-bind --get --pid 16066
> 0x0800
> [root@nyx5398 ~]# hwloc-bind --get --pid 16067
> 0x0200
> [root@nyx5398 ~]# hwloc-bind --get --pid 16068
> 0x0800
> 
> [root@nyx5398 ~]# cat 
> /dev/cpuset/torque/12703230.nyx.engin.umich.edu/cpus 
> 8-11
> 
> So torque claims the CPU set setup for the job has 4 cores, but as you 
> can see the ranks were giving identical binding. 
> 
> I checked the pids they were part of the correct CPU set, I also checked, 
> orted:
> 
> [root@nyx5398 ~]# hwloc-bind --get --pid 16064
> 0x0f00
> [root@nyx5398 ~]# hwloc-calc --intersect PU 16064
> ignored unrecognized argument 16064
> 
> [root@nyx5398 ~]# hwloc-calc --intersect PU 0x0f00
> 8,9,10,11
> 
> Which is exactly what I would expect.
> 
> So ummm, i'm lost why this might happen?  What else should I check?  Like 
> I said not all jobs show this behavior.
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/06/24672.php
 
 ___
 users mailing list
 us...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post: 
 http://www.open-mpi.org/community/lists/users/2014/06/24673.php
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> L

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Brock Palen
Got it,

I have the input from the user and am testing it out.

It probably has less todo with torque and more cpuset's, 

I'm working on producing it myself also.

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985



On Jun 20, 2014, at 12:18 PM, Ralph Castain  wrote:

> Thanks - I'm just trying to reproduce one problem case so I can look at it. 
> Given that I don't have access to a Torque machine, I need to "fake" it.
> 
> 
> On Jun 20, 2014, at 9:15 AM, Brock Palen  wrote:
> 
>> In this case they are a single socket, but as you can see they could be 
>> ether/or depending on the job.
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> XSEDE Campus Champion
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> On Jun 19, 2014, at 2:44 PM, Ralph Castain  wrote:
>> 
>>> Sorry, I should have been clearer - I was asking if cores 8-11 are all on 
>>> one socket, or span multiple sockets
>>> 
>>> 
>>> On Jun 19, 2014, at 11:36 AM, Brock Palen  wrote:
>>> 
 Ralph,
 
 It was a large job spread across.  Our system allows users to ask for 
 'procs' which are laid out in any format. 
 
 The list:
 
> [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3]
> [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11]
> [nyx5409:11][nyx5411:11][nyx5412:3]
 
 Shows that nyx5406 had 2 cores,  nyx5427 also 2,  nyx5411 had 11.
 
 They could be spread across any number of sockets configuration.  We start 
 very lax "user requests X procs" and then the user can request more strict 
 requirements from there.  We support mostly serial users, and users can 
 colocate on nodes.
 
 That is good to know, I think we would want to turn our default to 'bind 
 to core' except for our few users who use hybrid mode.
 
 Our CPU set tells you what cores the job is assigned.  So in the problem 
 case provided, the cpuset/cgroup shows only cores 8-11 are available to 
 this job on this node.
 
 Brock Palen
 www.umich.edu/~brockp
 CAEN Advanced Computing
 XSEDE Campus Champion
 bro...@umich.edu
 (734)936-1985
 
 
 
 On Jun 18, 2014, at 11:10 PM, Ralph Castain  wrote:
 
> The default binding option depends on the number of procs - it is bind-to 
> core for np=2, and bind-to socket for np > 2. You never said, but should 
> I assume you ran 4 ranks? If so, then we should be trying to bind-to 
> socket.
> 
> I'm not sure what your cpuset is telling us - are you binding us to a 
> socket? Are some cpus in one socket, and some in another?
> 
> It could be that the cpuset + bind-to socket is resulting in some odd 
> behavior, but I'd need a little more info to narrow it down.
> 
> 
> On Jun 18, 2014, at 7:48 PM, Brock Palen  wrote:
> 
>> I have started using 1.8.1 for some codes (meep in this case) and it 
>> sometimes works fine, but in a few cases I am seeing ranks being given 
>> overlapping CPU assignments, not always though.
>> 
>> Example job, default binding options (so by-core right?):
>> 
>> Assigned nodes, the one in question is nyx5398, we use torque CPU sets, 
>> and use TM to spawn.
>> 
>> [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3]
>> [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11]
>> [nyx5409:11][nyx5411:11][nyx5412:3]
>> 
>> [root@nyx5398 ~]# hwloc-bind --get --pid 16065
>> 0x0200
>> [root@nyx5398 ~]# hwloc-bind --get --pid 16066
>> 0x0800
>> [root@nyx5398 ~]# hwloc-bind --get --pid 16067
>> 0x0200
>> [root@nyx5398 ~]# hwloc-bind --get --pid 16068
>> 0x0800
>> 
>> [root@nyx5398 ~]# cat 
>> /dev/cpuset/torque/12703230.nyx.engin.umich.edu/cpus 
>> 8-11
>> 
>> So torque claims the CPU set setup for the job has 4 cores, but as you 
>> can see the ranks were giving identical binding. 
>> 
>> I checked the pids they were part of the correct CPU set, I also 
>> checked, orted:
>> 
>> [root@nyx5398 ~]# hwloc-bind --get --pid 16064
>> 0x0f00
>> [root@nyx5398 ~]# hwloc-calc --intersect PU 16064
>> ignored unrecognized argument 16064
>> 
>> [root@nyx5398 ~]# hwloc-calc --intersect PU 0x0f00
>> 8,9,10,11
>> 
>> Which is exactly what I would expect.
>> 
>> So ummm, i'm lost why this might happen?  What else should I check?  
>> Like I said not all jobs show this behavior.
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> XSEDE Campus Champion
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/communit

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Brock Palen
I was able to produce it in my test.

orted affinity set by cpuset:
[root@nyx5874 ~]# hwloc-bind --get --pid 103645
0xc002

This mask (1, 14,15) which is across sockets, matches the cpu set setup by the 
batch system. 
[root@nyx5874 ~]# cat /dev/cpuset/torque/12719806.nyx.engin.umich.edu/cpus 
1,14-15

The ranks though were then all set to the same core:

[root@nyx5874 ~]# hwloc-bind --get --pid 103871
0x8000
[root@nyx5874 ~]# hwloc-bind --get --pid 103872
0x8000
[root@nyx5874 ~]# hwloc-bind --get --pid 103873
0x8000

Which is core 15:

report-bindings gave me:
You can see how a few nodes were bound to all the same core, the last one in 
each case.  I only gave you the results for the hose nyx5874.

[nyx5526.engin.umich.edu:23726] MCW rank 0 is not bound (or bound to all 
available processors)
[nyx5878.engin.umich.edu:103925] MCW rank 8 is not bound (or bound to all 
available processors)
[nyx5533.engin.umich.edu:123988] MCW rank 1 is not bound (or bound to all 
available processors)
[nyx5879.engin.umich.edu:102808] MCW rank 9 is not bound (or bound to all 
available processors)
[nyx5874.engin.umich.edu:103645] MCW rank 41 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5874.engin.umich.edu:103645] MCW rank 42 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5874.engin.umich.edu:103645] MCW rank 43 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5888.engin.umich.edu:117400] MCW rank 11 is not bound (or bound to all 
available processors)
[nyx5786.engin.umich.edu:30004] MCW rank 19 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5786.engin.umich.edu:30004] MCW rank 18 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5594.engin.umich.edu:33884] MCW rank 24 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5594.engin.umich.edu:33884] MCW rank 25 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5594.engin.umich.edu:33884] MCW rank 26 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5798.engin.umich.edu:53026] MCW rank 59 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5798.engin.umich.edu:53026] MCW rank 60 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5798.engin.umich.edu:53026] MCW rank 56 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5798.engin.umich.edu:53026] MCW rank 57 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5798.engin.umich.edu:53026] MCW rank 58 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5545.engin.umich.edu:88170] MCW rank 2 is not bound (or bound to all 
available processors)
[nyx5613.engin.umich.edu:25229] MCW rank 31 is not bound (or bound to all 
available processors)
[nyx5880.engin.umich.edu:01406] MCW rank 10 is not bound (or bound to all 
available processors)
[nyx5770.engin.umich.edu:86538] MCW rank 6 is not bound (or bound to all 
available processors)
[nyx5613.engin.umich.edu:25228] MCW rank 30 is not bound (or bound to all 
available processors)
[nyx5577.engin.umich.edu:65949] MCW rank 4 is not bound (or bound to all 
available processors)
[nyx5607.engin.umich.edu:30379] MCW rank 14 is not bound (or bound to all 
available processors)
[nyx5544.engin.umich.edu:72960] MCW rank 47 is not bound (or bound to all 
available processors)
[nyx5544.engin.umich.edu:72959] MCW rank 46 is not bound (or bound to all 
available processors)
[nyx5848.engin.umich.edu:04332] MCW rank 33 is not bound (or bound to all 
available processors)
[nyx5848.engin.umich.edu:04333] MCW rank 34 is not bound (or bound to all 
available processors)
[nyx5544.engin.umich.edu:72958] MCW rank 45 is not bound (or bound to all 
available processors)
[nyx5858.engin.umich.edu:12165] MCW rank 35 is not bound (or bound to all 
available processors)
[nyx5607.engin.umich.edu:30380] MCW rank 15 is not bound (or bound to all 
available processors)
[nyx5544.engin.umich.edu:72957] MCW rank 44 is not bound (or bound to all 
available processors)
[nyx5858.engin.umich.edu:12167] MCW rank 37 is not bound (or bound to all 
available processors)
[nyx5870.engin.umich.edu:33811] MCW rank 7 is not bound (or bound to all 
available processors)
[nyx5582.engin.umich.edu:81994] MCW rank 5 is not bound (or bound to all 
available processors)
[nyx5848.engin.umich.edu:04331] MCW rank 32 is not bound (or bound to all 
available processors)
[nyx5557.engin.umich.edu:46654] MCW rank 50 is not bound (or bound to all 
available processors)
[nyx5858.engin.umich.edu:12166] MCW rank 36 is not bound (or bound to all 
available processors)
[nyx5799.engin.umich.edu:67802] MCW rank 22 is not bound (or bound to all 
available processors)
[nyx5799.engin.umich.edu:67803] MCW rank 23 is not bound (or bound to all 
available processors)
[nyx5556.engin.umich.edu:50889] MCW rank 3 is not bound (or bound

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Brock Palen
Extra data point if I do:

[brockp@nyx5508 34241]$ mpirun --report-bindings --bind-to core hostname
--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:nyx5513
   #processes:  2
   #cpus:  1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

[brockp@nyx5508 34241]$ mpirun -H nyx5513 uptime
 13:01:37 up 31 days, 23:06,  0 users,  load average: 10.13, 10.90, 12.38
 13:01:37 up 31 days, 23:06,  0 users,  load average: 10.13, 10.90, 12.38
[brockp@nyx5508 34241]$ mpirun -H nyx5513 --bind-to core hwloc-bind --get
0x0010
0x1000
[brockp@nyx5508 34241]$ cat $PBS_NODEFILE | grep nyx5513
nyx5513
nyx5513

Interesting, if I force bind to core, MPI barfs saying there is only 1 cpu 
available, PBS says it gave it two, and if I force (this is all inside an 
interactive job) just on that node hwloc-bind --get I get what I expect,

Is there a way to get a map of what MPI thinks it has on each host?

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985



On Jun 20, 2014, at 12:38 PM, Brock Palen  wrote:

> I was able to produce it in my test.
> 
> orted affinity set by cpuset:
> [root@nyx5874 ~]# hwloc-bind --get --pid 103645
> 0xc002
> 
> This mask (1, 14,15) which is across sockets, matches the cpu set setup by 
> the batch system. 
> [root@nyx5874 ~]# cat /dev/cpuset/torque/12719806.nyx.engin.umich.edu/cpus 
> 1,14-15
> 
> The ranks though were then all set to the same core:
> 
> [root@nyx5874 ~]# hwloc-bind --get --pid 103871
> 0x8000
> [root@nyx5874 ~]# hwloc-bind --get --pid 103872
> 0x8000
> [root@nyx5874 ~]# hwloc-bind --get --pid 103873
> 0x8000
> 
> Which is core 15:
> 
> report-bindings gave me:
> You can see how a few nodes were bound to all the same core, the last one in 
> each case.  I only gave you the results for the hose nyx5874.
> 
> [nyx5526.engin.umich.edu:23726] MCW rank 0 is not bound (or bound to all 
> available processors)
> [nyx5878.engin.umich.edu:103925] MCW rank 8 is not bound (or bound to all 
> available processors)
> [nyx5533.engin.umich.edu:123988] MCW rank 1 is not bound (or bound to all 
> available processors)
> [nyx5879.engin.umich.edu:102808] MCW rank 9 is not bound (or bound to all 
> available processors)
> [nyx5874.engin.umich.edu:103645] MCW rank 41 bound to socket 1[core 15[hwt 
> 0]]: [./././././././.][./././././././B]
> [nyx5874.engin.umich.edu:103645] MCW rank 42 bound to socket 1[core 15[hwt 
> 0]]: [./././././././.][./././././././B]
> [nyx5874.engin.umich.edu:103645] MCW rank 43 bound to socket 1[core 15[hwt 
> 0]]: [./././././././.][./././././././B]
> [nyx5888.engin.umich.edu:117400] MCW rank 11 is not bound (or bound to all 
> available processors)
> [nyx5786.engin.umich.edu:30004] MCW rank 19 bound to socket 1[core 15[hwt 
> 0]]: [./././././././.][./././././././B]
> [nyx5786.engin.umich.edu:30004] MCW rank 18 bound to socket 1[core 15[hwt 
> 0]]: [./././././././.][./././././././B]
> [nyx5594.engin.umich.edu:33884] MCW rank 24 bound to socket 1[core 15[hwt 
> 0]]: [./././././././.][./././././././B]
> [nyx5594.engin.umich.edu:33884] MCW rank 25 bound to socket 1[core 15[hwt 
> 0]]: [./././././././.][./././././././B]
> [nyx5594.engin.umich.edu:33884] MCW rank 26 bound to socket 1[core 15[hwt 
> 0]]: [./././././././.][./././././././B]
> [nyx5798.engin.umich.edu:53026] MCW rank 59 bound to socket 1[core 15[hwt 
> 0]]: [./././././././.][./././././././B]
> [nyx5798.engin.umich.edu:53026] MCW rank 60 bound to socket 1[core 15[hwt 
> 0]]: [./././././././.][./././././././B]
> [nyx5798.engin.umich.edu:53026] MCW rank 56 bound to socket 1[core 15[hwt 
> 0]]: [./././././././.][./././././././B]
> [nyx5798.engin.umich.edu:53026] MCW rank 57 bound to socket 1[core 15[hwt 
> 0]]: [./././././././.][./././././././B]
> [nyx5798.engin.umich.edu:53026] MCW rank 58 bound to socket 1[core 15[hwt 
> 0]]: [./././././././.][./././././././B]
> [nyx5545.engin.umich.edu:88170] MCW rank 2 is not bound (or bound to all 
> available processors)
> [nyx5613.engin.umich.edu:25229] MCW rank 31 is not bound (or bound to all 
> available processors)
> [nyx5880.engin.umich.edu:01406] MCW rank 10 is not bound (or bound to all 
> available processors)
> [nyx5770.engin.umich.edu:86538] MCW rank 6 is not bound (or bound to all 
> available processors)
> [nyx5613.engin.umich.edu:25228] MCW rank 30 is not bound (or bound to all 
> available processors)
> [nyx5577.engin.umich.edu:65949] MCW rank 4 is not bound (or bound to all 
> available processors)
> [nyx5607.engin.umich.edu:30379] MCW rank 14 is not bound (or bound to all 
> available processors)
> [nyx5544.engin.umich.edu:72960] MCW rank 47 is not bound (or bound to all 
> available proc

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Ralph Castain
I think I begin to grok at least part of the problem. If you are assigning 
different cpus on each node, then you'll need to tell us that by setting 
--hetero-nodes otherwise we won't have any way to report that back to mpirun 
for its binding calculation.

Otherwise, we expect that the cpuset of the first node we launch a daemon onto 
(or where mpirun is executing, if we are only launching local to mpirun) 
accurately represents the cpuset on every node in the allocation.

We still might well have a bug in our binding computation - but the above will 
definitely impact what you said the user did.

On Jun 20, 2014, at 10:06 AM, Brock Palen  wrote:

> Extra data point if I do:
> 
> [brockp@nyx5508 34241]$ mpirun --report-bindings --bind-to core hostname
> --
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
> 
>   Bind to: CORE
>   Node:nyx5513
>   #processes:  2
>   #cpus:  1
> 
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> --
> 
> [brockp@nyx5508 34241]$ mpirun -H nyx5513 uptime
> 13:01:37 up 31 days, 23:06,  0 users,  load average: 10.13, 10.90, 12.38
> 13:01:37 up 31 days, 23:06,  0 users,  load average: 10.13, 10.90, 12.38
> [brockp@nyx5508 34241]$ mpirun -H nyx5513 --bind-to core hwloc-bind --get
> 0x0010
> 0x1000
> [brockp@nyx5508 34241]$ cat $PBS_NODEFILE | grep nyx5513
> nyx5513
> nyx5513
> 
> Interesting, if I force bind to core, MPI barfs saying there is only 1 cpu 
> available, PBS says it gave it two, and if I force (this is all inside an 
> interactive job) just on that node hwloc-bind --get I get what I expect,
> 
> Is there a way to get a map of what MPI thinks it has on each host?
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Jun 20, 2014, at 12:38 PM, Brock Palen  wrote:
> 
>> I was able to produce it in my test.
>> 
>> orted affinity set by cpuset:
>> [root@nyx5874 ~]# hwloc-bind --get --pid 103645
>> 0xc002
>> 
>> This mask (1, 14,15) which is across sockets, matches the cpu set setup by 
>> the batch system. 
>> [root@nyx5874 ~]# cat /dev/cpuset/torque/12719806.nyx.engin.umich.edu/cpus 
>> 1,14-15
>> 
>> The ranks though were then all set to the same core:
>> 
>> [root@nyx5874 ~]# hwloc-bind --get --pid 103871
>> 0x8000
>> [root@nyx5874 ~]# hwloc-bind --get --pid 103872
>> 0x8000
>> [root@nyx5874 ~]# hwloc-bind --get --pid 103873
>> 0x8000
>> 
>> Which is core 15:
>> 
>> report-bindings gave me:
>> You can see how a few nodes were bound to all the same core, the last one in 
>> each case.  I only gave you the results for the hose nyx5874.
>> 
>> [nyx5526.engin.umich.edu:23726] MCW rank 0 is not bound (or bound to all 
>> available processors)
>> [nyx5878.engin.umich.edu:103925] MCW rank 8 is not bound (or bound to all 
>> available processors)
>> [nyx5533.engin.umich.edu:123988] MCW rank 1 is not bound (or bound to all 
>> available processors)
>> [nyx5879.engin.umich.edu:102808] MCW rank 9 is not bound (or bound to all 
>> available processors)
>> [nyx5874.engin.umich.edu:103645] MCW rank 41 bound to socket 1[core 15[hwt 
>> 0]]: [./././././././.][./././././././B]
>> [nyx5874.engin.umich.edu:103645] MCW rank 42 bound to socket 1[core 15[hwt 
>> 0]]: [./././././././.][./././././././B]
>> [nyx5874.engin.umich.edu:103645] MCW rank 43 bound to socket 1[core 15[hwt 
>> 0]]: [./././././././.][./././././././B]
>> [nyx5888.engin.umich.edu:117400] MCW rank 11 is not bound (or bound to all 
>> available processors)
>> [nyx5786.engin.umich.edu:30004] MCW rank 19 bound to socket 1[core 15[hwt 
>> 0]]: [./././././././.][./././././././B]
>> [nyx5786.engin.umich.edu:30004] MCW rank 18 bound to socket 1[core 15[hwt 
>> 0]]: [./././././././.][./././././././B]
>> [nyx5594.engin.umich.edu:33884] MCW rank 24 bound to socket 1[core 15[hwt 
>> 0]]: [./././././././.][./././././././B]
>> [nyx5594.engin.umich.edu:33884] MCW rank 25 bound to socket 1[core 15[hwt 
>> 0]]: [./././././././.][./././././././B]
>> [nyx5594.engin.umich.edu:33884] MCW rank 26 bound to socket 1[core 15[hwt 
>> 0]]: [./././././././.][./././././././B]
>> [nyx5798.engin.umich.edu:53026] MCW rank 59 bound to socket 1[core 15[hwt 
>> 0]]: [./././././././.][./././././././B]
>> [nyx5798.engin.umich.edu:53026] MCW rank 60 bound to socket 1[core 15[hwt 
>> 0]]: [./././././././.][./././././././B]
>> [nyx5798.engin.umich.edu:53026] MCW rank 56 bound to socket 1[core 15[hwt 
>> 0]]: [./././././././.][./././././././B]
>> [nyx5798.engin.umich.edu:53026] MCW rank 57 bound to socket 1[core 15[hwt 
>> 0]]: [./././././././.][./././././././B]
>> [nyx5798.engin.umich.edu:53026] MCW rank 58 bound to socket 1[core 15[hwt 
>> 0]]: [./././././././.][./././././././B

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Brock Palen
Perfection!  That appears to do it for our standard case.

Now I know how to set MCA options by env var or config file.  How can I make 
this the default, that then a user can override?

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985



On Jun 20, 2014, at 1:21 PM, Ralph Castain  wrote:

> I think I begin to grok at least part of the problem. If you are assigning 
> different cpus on each node, then you'll need to tell us that by setting 
> --hetero-nodes otherwise we won't have any way to report that back to mpirun 
> for its binding calculation.
> 
> Otherwise, we expect that the cpuset of the first node we launch a daemon 
> onto (or where mpirun is executing, if we are only launching local to mpirun) 
> accurately represents the cpuset on every node in the allocation.
> 
> We still might well have a bug in our binding computation - but the above 
> will definitely impact what you said the user did.
> 
> On Jun 20, 2014, at 10:06 AM, Brock Palen  wrote:
> 
>> Extra data point if I do:
>> 
>> [brockp@nyx5508 34241]$ mpirun --report-bindings --bind-to core hostname
>> --
>> A request was made to bind to that would result in binding more
>> processes than cpus on a resource:
>> 
>>   Bind to: CORE
>>   Node:nyx5513
>>   #processes:  2
>>   #cpus:  1
>> 
>> You can override this protection by adding the "overload-allowed"
>> option to your binding directive.
>> --
>> 
>> [brockp@nyx5508 34241]$ mpirun -H nyx5513 uptime
>> 13:01:37 up 31 days, 23:06,  0 users,  load average: 10.13, 10.90, 12.38
>> 13:01:37 up 31 days, 23:06,  0 users,  load average: 10.13, 10.90, 12.38
>> [brockp@nyx5508 34241]$ mpirun -H nyx5513 --bind-to core hwloc-bind --get
>> 0x0010
>> 0x1000
>> [brockp@nyx5508 34241]$ cat $PBS_NODEFILE | grep nyx5513
>> nyx5513
>> nyx5513
>> 
>> Interesting, if I force bind to core, MPI barfs saying there is only 1 cpu 
>> available, PBS says it gave it two, and if I force (this is all inside an 
>> interactive job) just on that node hwloc-bind --get I get what I expect,
>> 
>> Is there a way to get a map of what MPI thinks it has on each host?
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> XSEDE Campus Champion
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> On Jun 20, 2014, at 12:38 PM, Brock Palen  wrote:
>> 
>>> I was able to produce it in my test.
>>> 
>>> orted affinity set by cpuset:
>>> [root@nyx5874 ~]# hwloc-bind --get --pid 103645
>>> 0xc002
>>> 
>>> This mask (1, 14,15) which is across sockets, matches the cpu set setup by 
>>> the batch system. 
>>> [root@nyx5874 ~]# cat /dev/cpuset/torque/12719806.nyx.engin.umich.edu/cpus 
>>> 1,14-15
>>> 
>>> The ranks though were then all set to the same core:
>>> 
>>> [root@nyx5874 ~]# hwloc-bind --get --pid 103871
>>> 0x8000
>>> [root@nyx5874 ~]# hwloc-bind --get --pid 103872
>>> 0x8000
>>> [root@nyx5874 ~]# hwloc-bind --get --pid 103873
>>> 0x8000
>>> 
>>> Which is core 15:
>>> 
>>> report-bindings gave me:
>>> You can see how a few nodes were bound to all the same core, the last one 
>>> in each case.  I only gave you the results for the hose nyx5874.
>>> 
>>> [nyx5526.engin.umich.edu:23726] MCW rank 0 is not bound (or bound to all 
>>> available processors)
>>> [nyx5878.engin.umich.edu:103925] MCW rank 8 is not bound (or bound to all 
>>> available processors)
>>> [nyx5533.engin.umich.edu:123988] MCW rank 1 is not bound (or bound to all 
>>> available processors)
>>> [nyx5879.engin.umich.edu:102808] MCW rank 9 is not bound (or bound to all 
>>> available processors)
>>> [nyx5874.engin.umich.edu:103645] MCW rank 41 bound to socket 1[core 15[hwt 
>>> 0]]: [./././././././.][./././././././B]
>>> [nyx5874.engin.umich.edu:103645] MCW rank 42 bound to socket 1[core 15[hwt 
>>> 0]]: [./././././././.][./././././././B]
>>> [nyx5874.engin.umich.edu:103645] MCW rank 43 bound to socket 1[core 15[hwt 
>>> 0]]: [./././././././.][./././././././B]
>>> [nyx5888.engin.umich.edu:117400] MCW rank 11 is not bound (or bound to all 
>>> available processors)
>>> [nyx5786.engin.umich.edu:30004] MCW rank 19 bound to socket 1[core 15[hwt 
>>> 0]]: [./././././././.][./././././././B]
>>> [nyx5786.engin.umich.edu:30004] MCW rank 18 bound to socket 1[core 15[hwt 
>>> 0]]: [./././././././.][./././././././B]
>>> [nyx5594.engin.umich.edu:33884] MCW rank 24 bound to socket 1[core 15[hwt 
>>> 0]]: [./././././././.][./././././././B]
>>> [nyx5594.engin.umich.edu:33884] MCW rank 25 bound to socket 1[core 15[hwt 
>>> 0]]: [./././././././.][./././././././B]
>>> [nyx5594.engin.umich.edu:33884] MCW rank 26 bound to socket 1[core 15[hwt 
>>> 0]]: [./././././././.][./././././././B]
>>> [nyx5798.engin.umich.edu:53026] MCW rank 59 bound to socket 1[core 15[hwt 
>>> 0]]: [./././././././.][./././././././B]
>>> 

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Ralph Castain
Put "orte_hetero_nodes=1" in your default MCA param file - uses can override by 
setting that param to 0


On Jun 20, 2014, at 10:30 AM, Brock Palen  wrote:

> Perfection!  That appears to do it for our standard case.
> 
> Now I know how to set MCA options by env var or config file.  How can I make 
> this the default, that then a user can override?
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Jun 20, 2014, at 1:21 PM, Ralph Castain  wrote:
> 
>> I think I begin to grok at least part of the problem. If you are assigning 
>> different cpus on each node, then you'll need to tell us that by setting 
>> --hetero-nodes otherwise we won't have any way to report that back to mpirun 
>> for its binding calculation.
>> 
>> Otherwise, we expect that the cpuset of the first node we launch a daemon 
>> onto (or where mpirun is executing, if we are only launching local to 
>> mpirun) accurately represents the cpuset on every node in the allocation.
>> 
>> We still might well have a bug in our binding computation - but the above 
>> will definitely impact what you said the user did.
>> 
>> On Jun 20, 2014, at 10:06 AM, Brock Palen  wrote:
>> 
>>> Extra data point if I do:
>>> 
>>> [brockp@nyx5508 34241]$ mpirun --report-bindings --bind-to core hostname
>>> --
>>> A request was made to bind to that would result in binding more
>>> processes than cpus on a resource:
>>> 
>>>  Bind to: CORE
>>>  Node:nyx5513
>>>  #processes:  2
>>>  #cpus:  1
>>> 
>>> You can override this protection by adding the "overload-allowed"
>>> option to your binding directive.
>>> --
>>> 
>>> [brockp@nyx5508 34241]$ mpirun -H nyx5513 uptime
>>> 13:01:37 up 31 days, 23:06,  0 users,  load average: 10.13, 10.90, 12.38
>>> 13:01:37 up 31 days, 23:06,  0 users,  load average: 10.13, 10.90, 12.38
>>> [brockp@nyx5508 34241]$ mpirun -H nyx5513 --bind-to core hwloc-bind --get
>>> 0x0010
>>> 0x1000
>>> [brockp@nyx5508 34241]$ cat $PBS_NODEFILE | grep nyx5513
>>> nyx5513
>>> nyx5513
>>> 
>>> Interesting, if I force bind to core, MPI barfs saying there is only 1 cpu 
>>> available, PBS says it gave it two, and if I force (this is all inside an 
>>> interactive job) just on that node hwloc-bind --get I get what I expect,
>>> 
>>> Is there a way to get a map of what MPI thinks it has on each host?
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> CAEN Advanced Computing
>>> XSEDE Campus Champion
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
>>> On Jun 20, 2014, at 12:38 PM, Brock Palen  wrote:
>>> 
 I was able to produce it in my test.
 
 orted affinity set by cpuset:
 [root@nyx5874 ~]# hwloc-bind --get --pid 103645
 0xc002
 
 This mask (1, 14,15) which is across sockets, matches the cpu set setup by 
 the batch system. 
 [root@nyx5874 ~]# cat /dev/cpuset/torque/12719806.nyx.engin.umich.edu/cpus 
 1,14-15
 
 The ranks though were then all set to the same core:
 
 [root@nyx5874 ~]# hwloc-bind --get --pid 103871
 0x8000
 [root@nyx5874 ~]# hwloc-bind --get --pid 103872
 0x8000
 [root@nyx5874 ~]# hwloc-bind --get --pid 103873
 0x8000
 
 Which is core 15:
 
 report-bindings gave me:
 You can see how a few nodes were bound to all the same core, the last one 
 in each case.  I only gave you the results for the hose nyx5874.
 
 [nyx5526.engin.umich.edu:23726] MCW rank 0 is not bound (or bound to all 
 available processors)
 [nyx5878.engin.umich.edu:103925] MCW rank 8 is not bound (or bound to all 
 available processors)
 [nyx5533.engin.umich.edu:123988] MCW rank 1 is not bound (or bound to all 
 available processors)
 [nyx5879.engin.umich.edu:102808] MCW rank 9 is not bound (or bound to all 
 available processors)
 [nyx5874.engin.umich.edu:103645] MCW rank 41 bound to socket 1[core 15[hwt 
 0]]: [./././././././.][./././././././B]
 [nyx5874.engin.umich.edu:103645] MCW rank 42 bound to socket 1[core 15[hwt 
 0]]: [./././././././.][./././././././B]
 [nyx5874.engin.umich.edu:103645] MCW rank 43 bound to socket 1[core 15[hwt 
 0]]: [./././././././.][./././././././B]
 [nyx5888.engin.umich.edu:117400] MCW rank 11 is not bound (or bound to all 
 available processors)
 [nyx5786.engin.umich.edu:30004] MCW rank 19 bound to socket 1[core 15[hwt 
 0]]: [./././././././.][./././././././B]
 [nyx5786.engin.umich.edu:30004] MCW rank 18 bound to socket 1[core 15[hwt 
 0]]: [./././././././.][./././././././B]
 [nyx5594.engin.umich.edu:33884] MCW rank 24 bound to socket 1[core 15[hwt 
 0]]: [./././././././.][./././././././B]
 [nyx5594.engin.umich.edu:33884] MCW rank 25 bound to socket 1[core 15[hwt 
>>

[OMPI users] btl_openib_connect_oob.c:867:rml_recv_cb error after Infini-band stack update.

2014-06-20 Thread Ivanov, Aleksandar (INR)

Dear Sir or Madam,

I am using the openmpi 1.6.5 library compiled with IFORT / ICC 13.1.5. Since a 
recent update of our machine I started generating mpi errors. The code crashes 
after completing approx. 24 % from the total job. The same code and input were 
run before on the same machine and no such problems were ever observed. The 
actual error message is attached.
I presume that after the update an incompatibility between the infiniband-stack 
and the openmpi library might have been introduced. I think that the suggested  
"out of memory problem" should not be causing the malfunction, since the 
application uses only 1GB of the total 32 GB available.

I would appreciate your help and ideas how to clarify this issue.

Thank you in advance

Best Regards

Aleksandar Ivanov






openmpi.log
Description: openmpi.log


Re: [OMPI users] btl_openib_connect_oob.c:867:rml_recv_cb error after Infini-band stack update.

2014-06-20 Thread Ralph Castain
What was updated? If the OS, did you remember to set the memory registration 
limits to max?


On Jun 20, 2014, at 11:25 AM, Ivanov, Aleksandar (INR) 
 wrote:

>  
> Dear Sir or Madam,
>  
> I am using the openmpi 1.6.5 library compiled with IFORT / ICC 13.1.5. Since 
> a recent update of our machine I started generating mpi errors. The code 
> crashes after completing approx. 24 % from the total job. The same code and 
> input were run before on the same machine and no such problems were ever 
> observed. The actual error message is attached.
> I presume that after the update an incompatibility between the 
> infiniband-stack and the openmpi library might have been introduced. I think 
> that the suggested  “out of memory problem” should not be causing the 
> malfunction, since the application uses only 1GB of the total 32 GB 
> available.  
>  
> I would appreciate your help and ideas how to clarify this issue.
>  
> Thank you in advance
>  
> Best Regards
>  
> Aleksandar Ivanov
>  
>  
>  
>  
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/06/24685.php



Re: [OMPI users] btl_openib_connect_oob.c:867:rml_recv_cb error after Infini-band stack update.

2014-06-20 Thread Ivanov, Aleksandar (INR)
Hi,

I was not the one updating the machine unfortunately, however I can ask my 
colleagues for specific list of modifications done. If I understand you 
correctly you are referring to the "ulimit" parameters. They are properly set, 
in fact we use JMS as job scheduler, therefore the "ulimit -v" is set by the 
user. In my case I used 31GB per MPI process.
The stack size is set to infinity.




From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Friday, June 20, 2014 8:42 PM
To: Open MPI Users
Subject: Re: [OMPI users] btl_openib_connect_oob.c:867:rml_recv_cb error after 
Infini-band stack update.

What was updated? If the OS, did you remember to set the memory registration 
limits to max?


On Jun 20, 2014, at 11:25 AM, Ivanov, Aleksandar (INR) 
mailto:aleksandar.iva...@kit.edu>> wrote:



Dear Sir or Madam,

I am using the openmpi 1.6.5 library compiled with IFORT / ICC 13.1.5. Since a 
recent update of our machine I started generating mpi errors. The code crashes 
after completing approx. 24 % from the total job. The same code and input were 
run before on the same machine and no such problems were ever observed. The 
actual error message is attached.
I presume that after the update an incompatibility between the infiniband-stack 
and the openmpi library might have been introduced. I think that the suggested  
"out of memory problem" should not be causing the malfunction, since the 
application uses only 1GB of the total 32 GB available.

I would appreciate your help and ideas how to clarify this issue.

Thank you in advance

Best Regards

Aleksandar Ivanov




___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/06/24685.php



Re: [OMPI users] btl_openib_connect_oob.c:867:rml_recv_cb error after Infini-band stack update.

2014-06-20 Thread Joshua Ladd
Aleksandar,

Please ensure your system administrator follows the guidelines outlined in
the link printed in the error message

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Best,

Josh


On Fri, Jun 20, 2014 at 2:56 PM, Ivanov, Aleksandar (INR) <
aleksandar.iva...@kit.edu> wrote:

> Hi,
>
>
>
> I was not the one updating the machine unfortunately, however I can ask my
> colleagues for specific list of modifications done. If I understand you
> correctly you are referring to the “ulimit” parameters. They are properly
> set, in fact we use JMS as job scheduler, therefore the “ulimit -v” is set
> by the user. In my case I used 31GB per MPI process.
>
> The stack size is set to infinity.
>
>
>
>
>
>
>
>
>
> *From:* users [mailto:users-boun...@open-mpi.org] *On Behalf Of *Ralph
> Castain
> *Sen**t:* Friday, June 20, 2014 8:42 PM
> *To:* Open MPI Users
> *Subject:* Re: [OMPI users] btl_openib_connect_oob.c:867:rml_recv_cb
> error after Infini-band stack update.
>
>
>
> What was updated? If the OS, did you remember to set the memory
> registration limits to max?
>
>
>
>
>
> On Jun 20, 2014, at 11:25 AM, Ivanov, Aleksandar (INR) <
> aleksandar.iva...@kit.edu> wrote:
>
>
>
>
>
> Dear Sir or Madam,
>
>
>
> I am using the openmpi 1.6.5 library compiled with IFORT / ICC 13.1.5.
> Since a recent update of our machine I started generating mpi errors. The
> code crashes after completing approx. 24 % from the total job. The same
> code and input were run before on the same machine and no such problems
> were ever observed. The actual error message is attached.
>
> I presume that after the update an incompatibility between the
> infiniband-stack and the openmpi library might have been introduced. I
> think that the suggested  “out of memory problem” should not be causing the
> malfunction, since the application uses only 1GB of the total 32 GB
> available.
>
>
>
> I would appreciate your help and ideas how to clarify this issue.
>
>
>
> Thank you in advance
>
>
>
> Best Regards
>
>
>
> Aleksandar Ivanov
>
>
>
>
>
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/06/24685.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/06/24687.php
>


Re: [OMPI users] btl_openib_connect_oob.c:867:rml_recv_cb error after Infini-band stack update.

2014-06-20 Thread Ivanov, Aleksandar (INR)
Joshua,

I am using a job scheduling system so ulimit –v is set by me. Nevertheless the 
ulimit –l is always half the value of ulimit –v. This is a bit strange, I am 
not sure whether this might be an issue (31GB and 156GB are decent values).

For completeness the output of ulimit –o from one of the nodes

core file size  (blocks, -c) 1
data seg size   (kbytes, -d) 32768000
scheduling priority (-e) 0
file size   (blocks, -f) unlimited
pending signals (-i) 515032
max locked memory   (kbytes, -l) 16460684
max memory size (kbytes, -m) 56047808
open files  (-n) 8192
pipe size(512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority  (-r) 0
stack size  (kbytes, -s) unlimited
cpu time   (seconds, -t) 2400
max user processes  (-u) 16308
virtual memory  (kbytes, -v) 32768000
file locks  (-x) unlimited

Best Regards
Alex

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Joshua Ladd
Sent: Friday, June 20, 2014 9:15 PM
To: Open MPI Users
Subject: Re: [OMPI users] btl_openib_connect_oob.c:867:rml_recv_cb error after 
Infini-band stack update.

Aleksandar,
Please ensure your system administrator follows the guidelines outlined in the 
link printed in the error message

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Best,
Josh

On Fri, Jun 20, 2014 at 2:56 PM, Ivanov, Aleksandar (INR) 
mailto:aleksandar.iva...@kit.edu>> wrote:
Hi,

I was not the one updating the machine unfortunately, however I can ask my 
colleagues for specific list of modifications done. If I understand you 
correctly you are referring to the “ulimit” parameters. They are properly set, 
in fact we use JMS as job scheduler, therefore the “ulimit -v” is set by the 
user. In my case I used 31GB per MPI process.
The stack size is set to infinity.




From: users 
[mailto:users-boun...@open-mpi.org] On 
Behalf Of Ralph Castain
Sent: Friday, June 20, 2014 8:42 PM
To: Open MPI Users
Subject: Re: [OMPI users] btl_openib_connect_oob.c:867:rml_recv_cb error after 
Infini-band stack update.

What was updated? If the OS, did you remember to set the memory registration 
limits to max?


On Jun 20, 2014, at 11:25 AM, Ivanov, Aleksandar (INR) 
mailto:aleksandar.iva...@kit.edu>> wrote:


Dear Sir or Madam,

I am using the openmpi 1.6.5 library compiled with IFORT / ICC 13.1.5. Since a 
recent update of our machine I started generating mpi errors. The code crashes 
after completing approx. 24 % from the total job. The same code and input were 
run before on the same machine and no such problems were ever observed. The 
actual error message is attached.
I presume that after the update an incompatibility between the infiniband-stack 
and the openmpi library might have been introduced. I think that the suggested  
“out of memory problem” should not be causing the malfunction, since the 
application uses only 1GB of the total 32 GB available.

I would appreciate your help and ideas how to clarify this issue.

Thank you in advance

Best Regards

Aleksandar Ivanov




___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/06/24685.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/06/24687.php