w00t :-) Thanks
Brock Palen www.umich.edu/~brockp CAEN Advanced Computing bro...@umich.edu (734)936-1985 On Dec 20, 2012, at 10:46 AM, Ralph Castain wrote: > Hmmm....I'll see what I can do about the error message. I don't think there > is much in 1.6 I can do, but in 1.7 I could generate an appropriate error > message as we have a way to check the topologies. > > On Dec 20, 2012, at 7:11 AM, Brock Palen <bro...@umich.edu> wrote: > >> Ralph, >> >> Thanks for the info, >> That said I found the problem, one of the new nodes, had Hyperthreading on, >> and the rest didn't so all the nodes didn't match. A quick >> >> pdsh lstopo | dshbak -c >> >> Uncovered the one different node. The error just didn't give me a clue to >> that being the cause, which was very odd: >> >> Correct node: >> [brockp@nyx0930 ~]$ lstopo >> Machine (64GB) >> NUMANode L#0 (P#0 32GB) + Socket L#0 + L3 L#0 (20MB) >> L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0) >> L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1) >> L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2) >> L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3) >> L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4) >> L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5) >> L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6) >> L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7) >> NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB) >> L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8) >> L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9) >> L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10) >> L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11) >> L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12 + PU L#12 (P#12) >> L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13 + PU L#13 (P#13) >> L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14 + PU L#14 (P#14) >> L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15 + PU L#15 (P#15) >> >> >> Bad node: >> [brockp@nyx0936 ~]$ lstopo >> Machine (64GB) >> NUMANode L#0 (P#0 32GB) + Socket L#0 + L3 L#0 (20MB) >> L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 >> PU L#0 (P#0) >> PU L#1 (P#16) >> L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 >> PU L#2 (P#1) >> PU L#3 (P#17) >> L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 >> PU L#4 (P#2) >> PU L#5 (P#18) >> L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 >> PU L#6 (P#3) >> PU L#7 (P#19) >> L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 >> PU L#8 (P#4) >> PU L#9 (P#20) >> L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 >> PU L#10 (P#5) >> PU L#11 (P#21) >> L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 >> PU L#12 (P#6) >> PU L#13 (P#22) >> L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 >> PU L#14 (P#7) >> PU L#15 (P#23) >> NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB) >> L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 >> PU L#16 (P#8) >> PU L#17 (P#24) >> L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 >> PU L#18 (P#9) >> PU L#19 (P#25) >> L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 >> PU L#20 (P#10) >> PU L#21 (P#26) >> L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 >> PU L#22 (P#11) >> PU L#23 (P#27) >> L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12 >> PU L#24 (P#12) >> PU L#25 (P#28) >> L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13 >> PU L#26 (P#13) >> PU L#27 (P#29) >> L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14 >> PU L#28 (P#14) >> PU L#29 (P#30) >> L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15 >> PU L#30 (P#15) >> PU L#31 (P#31) >> >> Once I removed that node from the pool the error went away, and using >> bind-to-core and cpus-per-rank worked. >> >> I don't see how an error message of the sort given would ever lead me to >> find a node with 'more' cores, even if fake, I was looking for a node that >> had a bad socket or wrong part. >> >> >> Brock Palen >> www.umich.edu/~brockp >> CAEN Advanced Computing >> bro...@umich.edu >> (734)936-1985 >> >> >> >> On Dec 19, 2012, at 9:08 PM, Ralph Castain wrote: >> >>> I'm afraid these are both known problems in the 1.6.2 release. I believe we >>> fixed npersocket in 1.6.3, though you might check to be sure. On the >>> large-scale issue, cpus-per-rank well might fail under those conditions. >>> The algorithm in the 1.6 series hasn't seen much use, especially at scale. >>> >>> In fact, cpus-per-rank has somewhat fallen by the wayside recently due to >>> apparent lack of interest. I'm restoring it for the 1.7 series over the >>> holiday (currently doesn't work in 1.7 or trunk). >>> >>> >>> On Dec 19, 2012, at 4:34 PM, Brock Palen <bro...@umich.edu> wrote: >>> >>>> Using openmpi 1.6.2 with intel 13.0 though the problem not specific to >>>> the compiler. >>>> >>>> Using two 12 core 2 socket nodes, >>>> >>>> mpirun -np 4 -npersocket 2 uptime >>>> -------------------------------------------------------------------------- >>>> Your job has requested a conflicting number of processes for the >>>> application: >>>> >>>> App: uptime >>>> number of procs: 4 >>>> >>>> This is more processes than we can launch under the following >>>> additional directives and conditions: >>>> >>>> number of sockets: 0 >>>> npersocket: 2 >>>> >>>> >>>> Any idea why this wouldn't work? >>>> >>>> Another problem the following does what I expect, two 2 socket 8 core >>>> sockets. 16 total cores/node. >>>> >>>> mpirun -np 8 -npernode 4 -bind-to-core -cpus-per-rank 4 hwloc-bind --get >>>> 0x0000000f >>>> 0x0000000f >>>> 0x000000f0 >>>> 0x000000f0 >>>> 0x00000f00 >>>> 0x00000f00 >>>> 0x0000f000 >>>> 0x0000f000 >>>> >>>> But fails at large scale: >>>> >>>> mpirun -np 276 -npernode 4 -bind-to-core -cpus-per-rank 4 hwloc-bind --get >>>> >>>> -------------------------------------------------------------------------- >>>> An invalid physical processor ID was returned when attempting to bind >>>> an MPI process to a unique processor. >>>> >>>> This usually means that you requested binding to more processors than >>>> exist (e.g., trying to bind N MPI processes to M processors, where N > >>>> M). Double check that you have enough unique processors for all the >>>> MPI processes that you are launching on this host. >>>> You job will now abort. >>>> -------------------------------------------------------------------------- >>>> >>>> >>>> >>>> Brock Palen >>>> www.umich.edu/~brockp >>>> CAEN Advanced Computing >>>> bro...@umich.edu >>>> (734)936-1985 >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users