Re: [OMPI users] 1.6.2 affinity failures

Brock Palen Thu, 20 Dec 2012 11:02:10 -0500

w00t :-)

Thanks


Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985



On Dec 20, 2012, at 10:46 AM, Ralph Castain wrote:

> Hmmm....I'll see what I can do about the error message. I don't think there 
> is much in 1.6 I can do, but in 1.7 I could generate an appropriate error 
> message as we have a way to check the topologies.
> 
> On Dec 20, 2012, at 7:11 AM, Brock Palen <bro...@umich.edu> wrote:
> 
>> Ralph,
>> 
>> Thanks for the info, 
>> That said I found the problem, one of the new nodes, had Hyperthreading on, 
>> and the rest didn't so all the nodes didn't match.  A quick 
>> 
>> pdsh lstopo | dshbak -c 
>> 
>> Uncovered the one different node.  The error just didn't give me a clue to 
>> that being the cause, which was very odd:
>> 
>> Correct node:
>> [brockp@nyx0930 ~]$ lstopo 
>> Machine (64GB)
>> NUMANode L#0 (P#0 32GB) + Socket L#0 + L3 L#0 (20MB)
>>   L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
>>   L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1)
>>   L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2)
>>   L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3)
>>   L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4)
>>   L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5)
>>   L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6)
>>   L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)
>> NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
>>   L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8)
>>   L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9)
>>   L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10)
>>   L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11)
>>   L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12 + PU L#12 (P#12)
>>   L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13 + PU L#13 (P#13)
>>   L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14 + PU L#14 (P#14)
>>   L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15 + PU L#15 (P#15)
>> 
>> 
>> Bad node:
>> [brockp@nyx0936 ~]$ lstopo
>> Machine (64GB)
>> NUMANode L#0 (P#0 32GB) + Socket L#0 + L3 L#0 (20MB)
>>   L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
>>     PU L#0 (P#0)
>>     PU L#1 (P#16)
>>   L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
>>     PU L#2 (P#1)
>>     PU L#3 (P#17)
>>   L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
>>     PU L#4 (P#2)
>>     PU L#5 (P#18)
>>   L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
>>     PU L#6 (P#3)
>>     PU L#7 (P#19)
>>   L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4
>>     PU L#8 (P#4)
>>     PU L#9 (P#20)
>>   L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5
>>     PU L#10 (P#5)
>>     PU L#11 (P#21)
>>   L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6
>>     PU L#12 (P#6)
>>     PU L#13 (P#22)
>>   L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7
>>     PU L#14 (P#7)
>>     PU L#15 (P#23)
>> NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
>>   L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8
>>     PU L#16 (P#8)
>>     PU L#17 (P#24)
>>   L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9
>>     PU L#18 (P#9)
>>     PU L#19 (P#25)
>>   L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10
>>     PU L#20 (P#10)
>>     PU L#21 (P#26)
>>   L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11
>>     PU L#22 (P#11)
>>     PU L#23 (P#27)
>>   L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12
>>     PU L#24 (P#12)
>>     PU L#25 (P#28)
>>   L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13
>>     PU L#26 (P#13)
>>     PU L#27 (P#29)
>>   L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14
>>     PU L#28 (P#14)
>>     PU L#29 (P#30)
>>   L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15
>>     PU L#30 (P#15)
>>     PU L#31 (P#31)
>> 
>> Once I removed that node from the pool the error went away, and using 
>> bind-to-core and cpus-per-rank worked. 
>> 
>> I don't see how an error message of the sort given would ever lead me to 
>> find a node with 'more' cores, even if fake, I was looking for a node that 
>> had a bad socket or wrong part.
>> 
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> On Dec 19, 2012, at 9:08 PM, Ralph Castain wrote:
>> 
>>> I'm afraid these are both known problems in the 1.6.2 release. I believe we 
>>> fixed npersocket in 1.6.3, though you might check to be sure. On the 
>>> large-scale issue, cpus-per-rank well might fail under those conditions. 
>>> The algorithm in the 1.6 series hasn't seen much use, especially at scale.
>>> 
>>> In fact, cpus-per-rank has somewhat fallen by the wayside recently due to 
>>> apparent lack of interest. I'm restoring it for the 1.7 series over the 
>>> holiday (currently doesn't work in 1.7 or trunk).
>>> 
>>> 
>>> On Dec 19, 2012, at 4:34 PM, Brock Palen <bro...@umich.edu> wrote:
>>> 
>>>> Using openmpi 1.6.2 with intel 13.0  though the problem not specific to 
>>>> the compiler.
>>>> 
>>>> Using two 12 core 2 socket nodes, 
>>>> 
>>>> mpirun -np 4 -npersocket 2 uptime
>>>> --------------------------------------------------------------------------
>>>> Your job has requested a conflicting number of processes for the
>>>> application:
>>>> 
>>>> App: uptime
>>>> number of procs:  4
>>>> 
>>>> This is more processes than we can launch under the following
>>>> additional directives and conditions:
>>>> 
>>>> number of sockets:   0
>>>> npersocket:   2
>>>> 
>>>> 
>>>> Any idea why this wouldn't work?  
>>>> 
>>>> Another problem the following does what I expect,  two 2 socket 8 core 
>>>> sockets. 16 total cores/node.
>>>> 
>>>> mpirun -np 8 -npernode 4 -bind-to-core -cpus-per-rank 4 hwloc-bind --get
>>>> 0x0000000f
>>>> 0x0000000f
>>>> 0x000000f0
>>>> 0x000000f0
>>>> 0x00000f00
>>>> 0x00000f00
>>>> 0x0000f000
>>>> 0x0000f000
>>>> 
>>>> But fails at large scale:
>>>> 
>>>> mpirun -np 276 -npernode 4 -bind-to-core -cpus-per-rank 4 hwloc-bind --get
>>>> 
>>>> --------------------------------------------------------------------------
>>>> An invalid physical processor ID was returned when attempting to bind
>>>> an MPI process to a unique processor.
>>>> 
>>>> This usually means that you requested binding to more processors than
>>>> exist (e.g., trying to bind N MPI processes to M processors, where N >
>>>> M).  Double check that you have enough unique processors for all the
>>>> MPI processes that you are launching on this host.
>>>> You job will now abort.
>>>> --------------------------------------------------------------------------
>>>> 
>>>> 
>>>> 
>>>> Brock Palen
>>>> www.umich.edu/~brockp
>>>> CAEN Advanced Computing
>>>> bro...@umich.edu
>>>> (734)936-1985
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] 1.6.2 affinity failures

Reply via email to