Re: [OMPI users] affinity issues under cpuset torque 1.8.1
Sorry, I should have been clearer - I was asking if cores 8-11 are all on one socket, or span multiple sockets On Jun 19, 2014, at 11:36 AM, Brock Palenwrote: > Ralph, > > It was a large job spread across. Our system allows users to ask for 'procs' > which are laid out in any format. > > The list: > >> [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3] >> [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11] >> [nyx5409:11][nyx5411:11][nyx5412:3] > > Shows that nyx5406 had 2 cores, nyx5427 also 2, nyx5411 had 11. > > They could be spread across any number of sockets configuration. We start > very lax "user requests X procs" and then the user can request more strict > requirements from there. We support mostly serial users, and users can > colocate on nodes. > > That is good to know, I think we would want to turn our default to 'bind to > core' except for our few users who use hybrid mode. > > Our CPU set tells you what cores the job is assigned. So in the problem case > provided, the cpuset/cgroup shows only cores 8-11 are available to this job > on this node. > > Brock Palen > www.umich.edu/~brockp > CAEN Advanced Computing > XSEDE Campus Champion > bro...@umich.edu > (734)936-1985 > > > > On Jun 18, 2014, at 11:10 PM, Ralph Castain wrote: > >> The default binding option depends on the number of procs - it is bind-to >> core for np=2, and bind-to socket for np > 2. You never said, but should I >> assume you ran 4 ranks? If so, then we should be trying to bind-to socket. >> >> I'm not sure what your cpuset is telling us - are you binding us to a >> socket? Are some cpus in one socket, and some in another? >> >> It could be that the cpuset + bind-to socket is resulting in some odd >> behavior, but I'd need a little more info to narrow it down. >> >> >> On Jun 18, 2014, at 7:48 PM, Brock Palen wrote: >> >>> I have started using 1.8.1 for some codes (meep in this case) and it >>> sometimes works fine, but in a few cases I am seeing ranks being given >>> overlapping CPU assignments, not always though. >>> >>> Example job, default binding options (so by-core right?): >>> >>> Assigned nodes, the one in question is nyx5398, we use torque CPU sets, and >>> use TM to spawn. >>> >>> [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3] >>> [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11] >>> [nyx5409:11][nyx5411:11][nyx5412:3] >>> >>> [root@nyx5398 ~]# hwloc-bind --get --pid 16065 >>> 0x0200 >>> [root@nyx5398 ~]# hwloc-bind --get --pid 16066 >>> 0x0800 >>> [root@nyx5398 ~]# hwloc-bind --get --pid 16067 >>> 0x0200 >>> [root@nyx5398 ~]# hwloc-bind --get --pid 16068 >>> 0x0800 >>> >>> [root@nyx5398 ~]# cat /dev/cpuset/torque/12703230.nyx.engin.umich.edu/cpus >>> 8-11 >>> >>> So torque claims the CPU set setup for the job has 4 cores, but as you can >>> see the ranks were giving identical binding. >>> >>> I checked the pids they were part of the correct CPU set, I also checked, >>> orted: >>> >>> [root@nyx5398 ~]# hwloc-bind --get --pid 16064 >>> 0x0f00 >>> [root@nyx5398 ~]# hwloc-calc --intersect PU 16064 >>> ignored unrecognized argument 16064 >>> >>> [root@nyx5398 ~]# hwloc-calc --intersect PU 0x0f00 >>> 8,9,10,11 >>> >>> Which is exactly what I would expect. >>> >>> So ummm, i'm lost why this might happen? What else should I check? Like I >>> said not all jobs show this behavior. >>> >>> Brock Palen >>> www.umich.edu/~brockp >>> CAEN Advanced Computing >>> XSEDE Campus Champion >>> bro...@umich.edu >>> (734)936-1985 >>> >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/06/24672.php >> >> ___ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/06/24673.php > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/06/24675.php
Re: [OMPI users] affinity issues under cpuset torque 1.8.1
Ralph, It was a large job spread across. Our system allows users to ask for 'procs' which are laid out in any format. The list: > [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3] > [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11] > [nyx5409:11][nyx5411:11][nyx5412:3] Shows that nyx5406 had 2 cores, nyx5427 also 2, nyx5411 had 11. They could be spread across any number of sockets configuration. We start very lax "user requests X procs" and then the user can request more strict requirements from there. We support mostly serial users, and users can colocate on nodes. That is good to know, I think we would want to turn our default to 'bind to core' except for our few users who use hybrid mode. Our CPU set tells you what cores the job is assigned. So in the problem case provided, the cpuset/cgroup shows only cores 8-11 are available to this job on this node. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985 On Jun 18, 2014, at 11:10 PM, Ralph Castainwrote: > The default binding option depends on the number of procs - it is bind-to > core for np=2, and bind-to socket for np > 2. You never said, but should I > assume you ran 4 ranks? If so, then we should be trying to bind-to socket. > > I'm not sure what your cpuset is telling us - are you binding us to a socket? > Are some cpus in one socket, and some in another? > > It could be that the cpuset + bind-to socket is resulting in some odd > behavior, but I'd need a little more info to narrow it down. > > > On Jun 18, 2014, at 7:48 PM, Brock Palen wrote: > >> I have started using 1.8.1 for some codes (meep in this case) and it >> sometimes works fine, but in a few cases I am seeing ranks being given >> overlapping CPU assignments, not always though. >> >> Example job, default binding options (so by-core right?): >> >> Assigned nodes, the one in question is nyx5398, we use torque CPU sets, and >> use TM to spawn. >> >> [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3] >> [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11] >> [nyx5409:11][nyx5411:11][nyx5412:3] >> >> [root@nyx5398 ~]# hwloc-bind --get --pid 16065 >> 0x0200 >> [root@nyx5398 ~]# hwloc-bind --get --pid 16066 >> 0x0800 >> [root@nyx5398 ~]# hwloc-bind --get --pid 16067 >> 0x0200 >> [root@nyx5398 ~]# hwloc-bind --get --pid 16068 >> 0x0800 >> >> [root@nyx5398 ~]# cat /dev/cpuset/torque/12703230.nyx.engin.umich.edu/cpus >> 8-11 >> >> So torque claims the CPU set setup for the job has 4 cores, but as you can >> see the ranks were giving identical binding. >> >> I checked the pids they were part of the correct CPU set, I also checked, >> orted: >> >> [root@nyx5398 ~]# hwloc-bind --get --pid 16064 >> 0x0f00 >> [root@nyx5398 ~]# hwloc-calc --intersect PU 16064 >> ignored unrecognized argument 16064 >> >> [root@nyx5398 ~]# hwloc-calc --intersect PU 0x0f00 >> 8,9,10,11 >> >> Which is exactly what I would expect. >> >> So ummm, i'm lost why this might happen? What else should I check? Like I >> said not all jobs show this behavior. >> >> Brock Palen >> www.umich.edu/~brockp >> CAEN Advanced Computing >> XSEDE Campus Champion >> bro...@umich.edu >> (734)936-1985 >> >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/06/24672.php > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/06/24673.php signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [OMPI users] Program abortion at a simple MPI_Get Programm
Hi, > I wrote a litte MPI program to demonstrate the mpi_get() function > (see attachment). > > The idea behind the program is that a master process with rank 0 > fills an integer array with the size of MPI_Comm_size with values. > The other processes should MPI_GET the value from this shared int > array at the index of their rank. > > We could compile the code, but execution will raise an error. You use the following statements: int rank, size, *a, *b, namelen, i, sizereal, flag; ... MPI_Win_get_attr(win, MPI_WIN_SIZE, , ); The MPI 3.0 standard requires on page 416 that MPI_WIN_SIZE should have type "MPI_Aint *". tyr test_onesided 138 diff beispiel2_flo.c* 18c18 < int rank, size, *a, *b, namelen, i, flag; --- > int rank, size, *a, *b, namelen, i, sizereal, flag; 21d20 < MPI_Aint *sizereal; 44c43 < { --- > {+ 54c53 < printf("Real Size %d\n", *sizereal); --- > printf("Real Size %d\n", sizereal); 59c58 < printf ("Process %d after MPI_Win_create()\n", rank); --- > printf ("Process %d after MPI_Win_create()\n"); 77c76 < printf("Prozess %d# hat Wert %d von Prozess 0 geholt\n", rank, *b); --- > printf("Prozess %d# hat Wert %d von Prozess 0 geholt\n", rank, b); [2] - Done xemacs beispiel2_flo.c tyr test_onesided 139 If you change the type of "sizereal" and fix some more bugs, you get what you want. tyr test_onesided 139 mpiexec -np 2 a.out Guten Tag. Ich bin Prozess 1 von 2. Ich werde auf Host tyr.informatik.hs-fulda.de ausgefuehrt Guten Tag. Ich bin Prozess 0 von 2. Ich werde auf Host tyr.informatik.hs-fulda.de ausgefuehrt a[0]=0 a[1]=100 ok1 ok1 Process 1 after MPI_Win_create() ok3 ok2 Real Size 8 ok3 Prozess 1# hat Wert 100 von Prozess 0 geholt tyr test_onesided 140 Kind regards Siegmar
Re: [OMPI users] affinity issues under cpuset torque 1.8.1
The default binding option depends on the number of procs - it is bind-to core for np=2, and bind-to socket for np > 2. You never said, but should I assume you ran 4 ranks? If so, then we should be trying to bind-to socket. I'm not sure what your cpuset is telling us - are you binding us to a socket? Are some cpus in one socket, and some in another? It could be that the cpuset + bind-to socket is resulting in some odd behavior, but I'd need a little more info to narrow it down. On Jun 18, 2014, at 7:48 PM, Brock Palenwrote: > I have started using 1.8.1 for some codes (meep in this case) and it > sometimes works fine, but in a few cases I am seeing ranks being given > overlapping CPU assignments, not always though. > > Example job, default binding options (so by-core right?): > > Assigned nodes, the one in question is nyx5398, we use torque CPU sets, and > use TM to spawn. > > [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3] > [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11] > [nyx5409:11][nyx5411:11][nyx5412:3] > > [root@nyx5398 ~]# hwloc-bind --get --pid 16065 > 0x0200 > [root@nyx5398 ~]# hwloc-bind --get --pid 16066 > 0x0800 > [root@nyx5398 ~]# hwloc-bind --get --pid 16067 > 0x0200 > [root@nyx5398 ~]# hwloc-bind --get --pid 16068 > 0x0800 > > [root@nyx5398 ~]# cat /dev/cpuset/torque/12703230.nyx.engin.umich.edu/cpus > 8-11 > > So torque claims the CPU set setup for the job has 4 cores, but as you can > see the ranks were giving identical binding. > > I checked the pids they were part of the correct CPU set, I also checked, > orted: > > [root@nyx5398 ~]# hwloc-bind --get --pid 16064 > 0x0f00 > [root@nyx5398 ~]# hwloc-calc --intersect PU 16064 > ignored unrecognized argument 16064 > > [root@nyx5398 ~]# hwloc-calc --intersect PU 0x0f00 > 8,9,10,11 > > Which is exactly what I would expect. > > So ummm, i'm lost why this might happen? What else should I check? Like I > said not all jobs show this behavior. > > Brock Palen > www.umich.edu/~brockp > CAEN Advanced Computing > XSEDE Campus Champion > bro...@umich.edu > (734)936-1985 > > > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/06/24672.php