Re: [OMPI devel] Assigning processes to cores 1.4.2, 1.6.4 and 1.8.4

Ralph Castain Fri, 17 Apr 2015 19:36:22 -0400 (EDT)

Hi Tom

Glad you are making some progress! Note that the 1.8 series uses hwloc for its 
affinity operations, while the 1.4 and 1.6 series used the old plpa code. 
Hence, you will not find the “affinity” components in the 1.8 ompi_info output.


Is there some reason you didn’t compile OMPI on the AMD machine? I ask because 
there are some config switches in various areas that differ between AMD and 
Intel architectures.


> On Apr 17, 2015, at 11:16 AM, Tom Wurgler <twu...@goodyear.com> wrote:
> 
> Note where I said "1 hour 14 minutes" it should have read "1 hour 24 
> minutes"...
> 
> 
> 
> From: Tom Wurgler
> Sent: Friday, April 17, 2015 2:14 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] Assigning processes to cores 1.4.2, 1.6.4 and 1.8.4
>  
> Ok, seems like I am making some progress here.  Thanks for the help.
> I turned HT off.
> Now I can run v 1.4.2, 1.6.4 and 1.8.4 all compiled the same compiler and run 
> on the same machine
> 1.4.2 runs this job in 59 minutes.   1.6.4 and 1.8.4 run the job in 1hr 24 
> minutes.
> 1.4.2 uses just --mca paffinuty-alone 1 and the processes are bound
>   PID COMMAND         CPUMASK     TOTAL [     N0     N1     N2     N3     N4  
>    N5 ]
> 22232 prog1                 0    469.9M [ 469.9M     0      0      0      0   
>    0  ]
> 22233 prog1                 1    479.0M [   4.0M 475.0M     0      0      0   
>    0  ]
> 22234 prog1                 2    516.7M [ 516.7M     0      0      0      0   
>    0  ]
> 22235 prog1                 3    485.4M [   8.0M 477.4M     0      0      0   
>    0  ]
> 22236 prog1                 4    482.6M [ 482.6M     0      0      0      0   
>    0  ]
> 22237 prog1                 5    486.6M [   6.0M 480.6M     0      0      0   
>    0  ]
> 22238 prog1                 6    481.3M [ 481.3M     0      0      0      0   
>    0  ]
> 22239 prog1                 7    419.4M [   8.0M 411.4M     0      0      0   
>    0  ]
> 
> If I use 1.6.4 and 1.8.4 with --mca paffinity-alone 1, the run time is now 
> 1hr 14 minutes.  The process map now looks like:
> bash-4.3# numa-maps -n eagle
>   PID COMMAND         CPUMASK     TOTAL [     N0     N1     N2     N3     N4  
>    N5 ]
> 12248 eagle                 0    163.3M [ 155.3M   8.0M     0      0      0   
>    0  ]
> 12249 eagle                 2    161.6M [ 159.6M   2.0M     0      0      0   
>    0  ]
> 12250 eagle                 4    164.3M [ 160.3M   4.0M     0      0      0   
>    0  ]
> 12251 eagle                 6    160.4M [ 156.4M   4.0M     0      0      0   
>    0  ]
> 12252 eagle                 8    160.6M [ 154.6M   6.0M     0      0      0   
>    0  ]
> 12253 eagle                10    159.8M [ 151.8M   8.0M     0      0      0   
>    0  ]
> 12254 eagle                12    160.9M [ 152.9M   8.0M     0      0      0   
>    0  ]
> 12255 eagle                14    159.8M [ 157.8M   2.0M     0      0      0   
>    0  ]
> 
> If I take off the --mca paffinity-alone 1, and instead use --bysocket 
> --bind-to-core (1.6.4)  or --map-by socket --bind-to core (1.8.4), the job 
> runs in 59 minutes and the process map look like the 1.4.2 one above...looks 
> super!
> 
> Now the issue:
> 
> If I move the same openmi install dirs to our cluster nodes, I can run 1.64+ 
> using the --mca paffinity-alone 1 options and the job runs (taking longer 
> etc).
> 
> If I then try the --bysocket --bind-to-core etc, I get the following error:
> 
> --------------------------------------------------------------------------
> An attempt to set processor affinity has failed - please check to
> ensure that your system supports such functionality. If so, then
> this is probably something that should be reported to the OMPI developers.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun was unable to start the specified application as it encountered an 
> error:
> 
> Error name: Input/output error
> Node: rdsargo36
> 
> when attempting to start process rank 0.
> --------------------------------------------------------------------------
> Error: Previous command failed (exitcode=1
> 
> Now the original runs were done on an Intel box (and this is where OpenMPI 
> was comilped).
> I am trying to run now on an AMD based cluster node.
> 
> So --mca paffinity-alone 1  works
>      --bysocket --bind-to-core doesn't.
> 
> Does this make sense to you folks?  If the AMD (running SuSE 11.1, BTW) 
> doesn't support paffinity, why does the --mca version run?  Is there some way 
> to check/set whether a system would support --bysocket etc?  Does it matter 
> which machine I compiled on?
> 
> And compare the following:
> 
> [test_lsf2]rds4020[1010]% /apps/share/openmpi/1.6.4.I1404211/bin/ompi_info | 
> grep -i affinity
>           MPI extensions: affinity example
>            MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.4)
>            MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.6.4)
>            MCA maffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.4)
> 
> [test_lsf2]rds4020[1010]% /apps/share/openmpi/1.4.2.I1404211/bin/ompi_info | 
> grep -i affinity
>            MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.4.2)
>            MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.4.2)
>            MCA maffinity: libnuma (MCA v2.0, API v2.0, Component v1.4.2)
> 
> [test_lsf2]rds4020[1012]% /apps/share/openmpi/1.8.4.I1404211/bin/ompi_info | 
> grep -i affinity
> (no output)
> 
> Shouldn't the 1.8.4 version show something?
> 
> Thank again for the help so far and appreciate any comments/help on the above.
> tom
> From: devel <devel-boun...@open-mpi.org <mailto:devel-boun...@open-mpi.org>> 
> on behalf of Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org>>
> Sent: Friday, April 10, 2015 11:38 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] Assigning processes to cores 1.4.2, 1.6.4 and 1.8.4
>  
> Your configure options look fine.
> 
> Getting 1 process assigned to each core (irrespective of HT on or off):
> 
> —map-by core —bind-to core
> 
> This will tight-pack the processes - i.e., they will be placed on each 
> successive core. If you want to balance the load across the allocation (if 
> the #procs < #cores in allocation):
> 
> —map-by node —bind-to core
> 
> HTH
> Ralph
> 
> 
>> On Apr 10, 2015, at 7:24 AM, Tom Wurgler <twu...@goodyear.com 
>> <mailto:twu...@goodyear.com>> wrote:
>> 
>> Thanks for the responses.  
>> 
>> The idea is to bind one process per processor.  The actual problem that 
>> prompted the investigation is that a job
>> ran with 1.4.2 runs in 59 minutes and the same job in 1.6.4 and 1.8.4 takes 
>> 79 minutes on the same machine, same compiler etc.  In trying to track down 
>> the reason for the run time differences, I found that the behavior is 
>> different regarding the binding.  Hence the question.
>> 
>> I believe it is doing what we requested, but not what we want.  The 
>> bind-to-socket was just one attempt at making
>> it bind one per processor.  I tried about 15 different combinations of the 
>> mpirun args and none matched the behavior of 1.4.2 or the run time of 1.4.2 
>> and is a huge concern for us.
>> 
>> I just checked this machine and hyperthreading is on.  I can change that and 
>> retest.
>> 
>> Are my configure options ok for the 1.6.4+ configuring?
>> And what mpirun options should I be using to get 1 process per processor?
>> 
>> This job was an 8 core test job, but the core counts varies per type of job 
>> (and will be run on the big clusters, not this compile server).
>> 
>> The run time differences show up across all our clusters, Intel based, AMD 
>> based, various SuSE OS versions.
>> 
>> thanks
>> tom
>> 
>> From: devel <devel-boun...@open-mpi.org <mailto:devel-boun...@open-mpi.org>> 
>> on behalf of Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org>>
>> Sent: Friday, April 10, 2015 9:54 AM
>> To: Open MPI Developers
>> Subject: Re: [OMPI devel] Assigning processes to cores 1.4.2, 1.6.4 and 1.8.4
>>  
>> Actually, I believe from the cmd line that the questioner wanted each 
>> process to be bound to a single core.
>> 
>> From your output, I’m guessing you have hyperthreads enabled on your system 
>> - yes? In that case, the 1.4 series is likely to be binding each process to 
>> a single HT because it isn’t sophisticated enough to realize the difference 
>> between HT and core.
>> 
>> Later versions of OMPI do know the difference. When you tell OMPI to bind to 
>> core, it will bind you to -both- HTs of that core. Hence the output you 
>> showed here:
>> 
>>> here is the map using just --mca mpi_paffinity_alone 1
>>> 
>>>   PID COMMAND         CPUMASK     TOTAL [     N0     N1     N2     N3     
>>> N4     N5 ]
>>> 25846 prog1              0,16     60.6M [  60.6M     0      0      0      0 
>>>      0  ]
>>> 25847 prog1              2,18     60.6M [  60.6M     0      0      0      0 
>>>      0  ]
>>> 25848 prog1              4,20     60.6M [  60.6M     0      0      0      0 
>>>      0  ]
>>> 25849 prog1              6,22     60.6M [  60.6M     0      0      0      0 
>>>      0  ]
>>> 25850 prog1              8,24     60.6M [  60.6M     0      0      0      0 
>>>      0  ]
>>> 25851 prog1             10,26     60.6M [  60.6M     0      0      0      0 
>>>      0  ]
>>> 25852 prog1             12,28     60.6M [  60.6M     0      0      0      0 
>>>      0  ]
>>> 25853 prog1             14,30     60.6M [  60.6M     0      0      0      0 
>>>      0  ]
>> 
>> 
>> When you tell us bind-to socket, we bind you to every HT in that socket. 
>> Since you are running less than 8 processes, and we map-by core by default, 
>> all the processes are bound to the first socket. This is what you show in 
>> this output:
>> 
>>> We get the following process map (this output is with mpirun args 
>>> --bind-to-socket
>>> --mca mpi_paffinity_alone 1):
>>> 
>>>   PID COMMAND         CPUMASK     TOTAL [     N0     N1     N2     N3     
>>> N4     N5 ]
>>> 24176 prog1           0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30     60.2M 
>>> [  60.2M     0      0      0      0      0  ]
>>> 24177 prog1           0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30     60.5M 
>>> [  60.5M     0      0      0      0      0  ]
>>> 24178 prog1           0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30     60.5M 
>>> [  60.5M     0      0      0      0      0  ]
>>> 24179 prog1           0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30     60.5M 
>>> [  60.5M     0      0      0      0      0  ]
>>> 24180 prog1           0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30     60.5M 
>>> [  60.5M     0      0      0      0      0  ]
>>> 24181 prog1           0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30     60.5M 
>>> [  60.5M     0      0      0      0      0  ]
>>> 24182 prog1           0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30     60.5M 
>>> [  60.5M     0      0      0      0      0  ]
>>> 24183 prog1           0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30     60.5M 
>>> [  60.5M     0      0      0      0      0  ]
>> 
>> 
>> So it looks to me like OMPI is doing exactly what you requested. I admit the 
>> HT numbering in the cpumask is strange, but that’s the way your BIOS 
>> numbered them.
>> 
>> HTH
>> Ralph
>> 
>> 
>>> On Apr 10, 2015, at 6:29 AM, Nick Papior Andersen <nickpap...@gmail.com 
>>> <mailto:nickpap...@gmail.com>> wrote:
>>> 
>>> Bug, it should be "span,pe=2"
>>> 
>>> 2015-04-10 15:28 GMT+02:00 Nick Papior Andersen <nickpap...@gmail.com 
>>> <mailto:nickpap...@gmail.com>>:
>>> I guess you want process #1 to have core 0 and core 1 bound to it, process 
>>> #2 have core 2 and core 3 bound?
>>> 
>>> I can do this with (I do this with 1.8.4, I do not think it works with 
>>> 1.6.x):
>>> --map-by ppr:4:socket:span:pe=2
>>> ppr = processes per resource.
>>> socket = the resource
>>> span = load balance the processes
>>> pe = bind processing elements to each process
>>> 
>>> This should launch 8 processes (you have 2 sockets). Each process should 
>>> have 2 processing elements bound to it.
>>> You can check with --report-bindings to see the "bound" processes bindings.
>>> 
>>> 2015-04-10 15:16 GMT+02:00  <twu...@goodyear.com 
>>> <mailto:twu...@goodyear.com>>:
>>> 
>>> We can't seem to get "processor affinity" using 1.6.4 or newer OpenMPI.
>>> 
>>> Note this is a 2 socket machine with 8 cores per socket
>>> 
>>> We had compiled OpenMPI 1.4.2 with the following configure options:
>>> 
>>> ===========================================================================
>>> export CC=/apps/share/intel/v14.0.4.211/bin/icc
>>> export CXX=/apps/share/intel/v14.0.4.211/bin/icpc
>>> export FC=/apps/share/intel/v14.0.4.211/bin/ifort
>>> 
>>> version=1.4.2.I1404211
>>> 
>>> ./configure \
>>>     --prefix=/apps/share/openmpi/$version \
>>>     --disable-shared \
>>>     --enable-static \
>>>     --enable-shared=no \
>>>     --with-openib \
>>>     --with-libnuma=/usr \
>>>     --enable-mpirun-prefix-by-default \
>>>     --with-memory-manager=none \
>>>     --with-tm=/apps/share/TORQUE/current/Linux
>>> ===========================================================================
>>> 
>>> and then used this mpirun command (where we used 8 cores):
>>> 
>>> ===========================================================================
>>> /apps/share/openmpi/1.4.2.I1404211/bin/mpirun \
>>> --prefix /apps/share/openmpi/1.4.2.I1404211 \
>>> --mca mpi_paffinity_alone 1 \
>>> --mca btl openib,tcp,sm,self \
>>> --x LD_LIBRARY_PATH \
>>> {model args}
>>> ===========================================================================
>>> 
>>> And when we checked the process map, it looks like this:
>>> 
>>>   PID COMMAND         CPUMASK     TOTAL [     N0     N1     N2     N3     
>>> N4     N5 ]
>>> 22232 prog1                 0    469.9M [ 469.9M     0      0      0      0 
>>>      0  ]
>>> 22233 prog1                 1    479.0M [   4.0M 475.0M     0      0      0 
>>>      0  ]
>>> 22234 prog1                 2    516.7M [ 516.7M     0      0      0      0 
>>>      0  ]
>>> 22235 prog1                 3    485.4M [   8.0M 477.4M     0      0      0 
>>>      0  ]
>>> 22236 prog1                 4    482.6M [ 482.6M     0      0      0      0 
>>>      0  ]
>>> 22237 prog1                 5    486.6M [   6.0M 480.6M     0      0      0 
>>>      0  ]
>>> 22238 prog1                 6    481.3M [ 481.3M     0      0      0      0 
>>>      0  ]
>>> 22239 prog1                 7    419.4M [   8.0M 411.4M     0      0      0 
>>>      0  ]
>>> 
>>> Now with 1.6.4 and higher, we did the following:
>>> ===========================================================================
>>> export CC=/apps/share/intel/v14.0.4.211/bin/icc
>>> export CXX=/apps/share/intel/v14.0.4.211/bin/icpc
>>> export FC=/apps/share/intel/v14.0.4.211/bin/ifort
>>> 
>>> version=1.6.4.I1404211
>>> 
>>> ./configure \
>>>     --disable-vt \
>>>     --prefix=/apps/share/openmpi/$version \
>>>     --disable-shared \
>>>     --enable-static \
>>>     --with-verbs \
>>>     --enable-mpirun-prefix-by-default \
>>>     --with-memory-manager=none \
>>>     --with-hwloc \
>>>     --enable-mpi-ext \
>>>     --with-tm=/apps/share/TORQUE/current/Linux
>>> ===========================================================================
>>> 
>>> We've tried the same mpirun command, with -bind-to-core, with -bind-to-core 
>>> -bycore etc
>>> and I can't seem to get the right combination of args to get the same 
>>> behavior as 1.4.2.
>>> 
>>> We get the following process map (this output is with mpirun args 
>>> --bind-to-socket
>>> --mca mpi_paffinity_alone 1):
>>> 
>>>   PID COMMAND         CPUMASK     TOTAL [     N0     N1     N2     N3     
>>> N4     N5 ]
>>> 24176 prog1           0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30     60.2M 
>>> [  60.2M     0      0      0      0      0  ]
>>> 24177 prog1           0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30     60.5M 
>>> [  60.5M     0      0      0      0      0  ]
>>> 24178 prog1           0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30     60.5M 
>>> [  60.5M     0      0      0      0      0  ]
>>> 24179 prog1           0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30     60.5M 
>>> [  60.5M     0      0      0      0      0  ]
>>> 24180 prog1           0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30     60.5M 
>>> [  60.5M     0      0      0      0      0  ]
>>> 24181 prog1           0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30     60.5M 
>>> [  60.5M     0      0      0      0      0  ]
>>> 24182 prog1           0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30     60.5M 
>>> [  60.5M     0      0      0      0      0  ]
>>> 24183 prog1           0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30     60.5M 
>>> [  60.5M     0      0      0      0      0  ]
>>> 
>>> here is the map using just --mca mpi_paffinity_alone 1
>>> 
>>>   PID COMMAND         CPUMASK     TOTAL [     N0     N1     N2     N3     
>>> N4     N5 ]
>>> 25846 prog1              0,16     60.6M [  60.6M     0      0      0      0 
>>>      0  ]
>>> 25847 prog1              2,18     60.6M [  60.6M     0      0      0      0 
>>>      0  ]
>>> 25848 prog1              4,20     60.6M [  60.6M     0      0      0      0 
>>>      0  ]
>>> 25849 prog1              6,22     60.6M [  60.6M     0      0      0      0 
>>>      0  ]
>>> 25850 prog1              8,24     60.6M [  60.6M     0      0      0      0 
>>>      0  ]
>>> 25851 prog1             10,26     60.6M [  60.6M     0      0      0      0 
>>>      0  ]
>>> 25852 prog1             12,28     60.6M [  60.6M     0      0      0      0 
>>>      0  ]
>>> 25853 prog1             14,30     60.6M [  60.6M     0      0      0      0 
>>>      0  ]
>>> 
>>> I figure I am compiling incorrectly or using the wrong mpirun args.
>>> 
>>> Can someone tell me how to duplicate the behavior of 1.4.2 regarding 
>>> binding the processes to cores?
>>> 
>>> Any help appreciated..
>>> 
>>> thanks
>>> 
>>> tom
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2015/04/17205.php 
>>> <http://www.open-mpi.org/community/lists/devel/2015/04/17205.php>
>>> 
>>> 
>>> 
>>> -- 
>>> Kind regards Nick
>>> 
>>> 
>>> 
>>> -- 
>>> Kind regards Nick
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2015/04/17207.php 
>>> <http://www.open-mpi.org/community/lists/devel/2015/04/17207.php>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/04/17209.php 
>> <http://www.open-mpi.org/community/lists/devel/2015/04/17209.php>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org <mailto:de...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/04/17249.php 
> <http://www.open-mpi.org/community/lists/devel/2015/04/17249.php>

Re: [OMPI devel] Assigning processes to cores 1.4.2, 1.6.4 and 1.8.4

Reply via email to