Am 21.08.2014 um 16:50 schrieb Reuti:

> Am 21.08.2014 um 16:00 schrieb Ralph Castain:
> 
>> 
>> On Aug 21, 2014, at 6:54 AM, Reuti <re...@staff.uni-marburg.de> wrote:
>> 
>>> Am 21.08.2014 um 15:45 schrieb Ralph Castain:
>>> 
>>>> On Aug 21, 2014, at 2:51 AM, Reuti <re...@staff.uni-marburg.de> wrote:
>>>> 
>>>>> Am 20.08.2014 um 23:16 schrieb Ralph Castain:
>>>>> 
>>>>>> 
>>>>>> On Aug 20, 2014, at 11:16 AM, Reuti <re...@staff.uni-marburg.de> wrote:
>>>>>> 
>>>>>>> Am 20.08.2014 um 19:05 schrieb Ralph Castain:
>>>>>>> 
>>>>>>>>> <snip>
>>>>>>>>> Aha, this is quite interesting - how do you do this: scanning the 
>>>>>>>>> /proc/<pid>/status or alike? What happens if you don't find enough 
>>>>>>>>> free cores as they are used up by other applications already?
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Remember, when you use mpirun to launch, we launch our own daemons 
>>>>>>>> using the native launcher (e.g., qsub). So the external RM will bind 
>>>>>>>> our daemons to the specified cores on each node. We use hwloc to 
>>>>>>>> determine what cores our daemons are bound to, and then bind our own 
>>>>>>>> child processes to cores within that range.
>>>>>>> 
>>>>>>> Thx for reminding me of this. Indeed, I mixed up two different aspects 
>>>>>>> in this discussion.
>>>>>>> 
>>>>>>> a) What will happen in case no binding was done by the RM (hence Open 
>>>>>>> MPI could use all cores) and two Open MPI jobs (or something completely 
>>>>>>> different besides one Open MPI job) are running on the same node (due 
>>>>>>> to the Tight Integration with two different Open MPI directories in 
>>>>>>> /tmp and two `orted`, unique for each job)? Will the second Open MPI 
>>>>>>> job know what the first Open MPI job used up already? Or will both use 
>>>>>>> the same set of cores as "-bind-to none" can't be set in the given 
>>>>>>> `mpiexec` command because of "-map-by slot:pe=$OMP_NUM_THREADS" was 
>>>>>>> used - which triggers "-bind-to core" indispensable and can't be 
>>>>>>> switched off? I see the same cores being used for both jobs.
>>>>>> 
>>>>>> Yeah, each mpirun executes completely independently of the other, so 
>>>>>> they have no idea what the other is doing. So the cores will be 
>>>>>> overloaded. Multi-pe's requires bind-to-core otherwise there is no way 
>>>>>> to implement the request
>>>>> 
>>>>> Yep, and so it's no option in a mixed cluster. Why would it hurt to allow 
>>>>> "-bind-to none" here?
>>>> 
>>>> Guess I'm confused here - what does pe=N mean if we bind-to none?? If you 
>>>> are running on a mixed cluster and don't want binding, then just say 
>>>> bind-to none and leave the pe argument out entirely as it wouldn't mean 
>>>> anything unless you are bound
>>> 
>>> I would mean: divide the overall number of slots/cores in the machinefile 
>>> by N (i.e. $OMP_NUM_THREADS).
>>> 
>>> - Request made to the queuing system: I need 80 cores in total.
>>> - The machinefile will contain 80 cores
>>> - Open MPI will divide it by N, i.e. 8 here
>>> - Open MPI will start only 10 processes, one on each node
>>> - The application will use 8 threads per started MPI process
>> 
>> I see - so you were talking about the case where the user doesn't provide 
>> the -np N option
> 
> Yes. Even if -np is specified: AFAICS Open MPI fills up the given slots in 
> the machinefile from the beginning (first nodes get all the processes, 
> remaining nodes are free). Making it in a round-robin way would work better 
> for this case.
> 
> 
>> and we need to compute the number of procs to start. Okay, the change you 
>> requested below will fix that one too. I can make that easily enough.
> 
> Therefore I wanted to start a discussion about it (at that time I wasn't 
> aware of the "-map-by slot:pe=N" option), as I have no final syntax which 
> would cover all cases. Someone may want the binding by the "-map-by 
> slot:pe=N". How can this be specified, while keeping an easy 
> tight-integration for users who don't want any binding at all.
> 
> The boundary conditions are:
> 
> - the job is running inside a queuingsystem
> - the user requests the overall amount of slots to the queuingsystem
> - hence the machinefile has entries for all slots

BTW: That fact that the queuingsystem is set up in such a that the machinefile 
contains a mutiple of  $OMP_NUM_THREADS per node is a premise and can be seen 
as given here - otherwise generate an error. It's up to admin of the 
queuingsystem to configure it in such a way.

-- Reuti


> - the user sets OMP_NUM_THREADS
> 
> case 1) no interest in any binding, other jobs may exist on the nodes
> 
> case 2) user wants binding: i.e. $OMP_NUM_THREADS cores assigned to each MPI 
> process, maybe with "-map-by slot:pe=N"
> 
> In both cases only (overall amount of slots) / ($OMP_NUM_THREADS) MPI 
> processes should be started, not (overall amount of slots) processes AFAICS.
> 
> -- Reuti
> 
> 
>>> -- Reuti
>>> 
>>> 
>>>>> 
>>>>> 
>>>>>>> Altering the machinefile instead: the processes are not bound to any 
>>>>>>> core, and the OS takes care of a proper assignment.
>>>>> 
>>>>> Here the ordinary user has to mangle the hostfile, this is not good (but 
>>>>> allows several jobs per node as the OS shift the processes around). 
>>>>> Could/should it be put into the "gridengine" module in OpenMPI, to divide 
>>>>> the slot count per node automatically when $OMP_NUM_THREADS is found, or 
>>>>> generate an error if it's not divisible?
>>>> 
>>>> Sure, that could be done - but it will only have if OMP_NUM_THREADS is set 
>>>> when someone spins off threads. So far as I know, that's only used for 
>>>> OpenMP - so we'd get a little help, but it wouldn't be full coverage.
>>>> 
>>>> 
>>>>> 
>>>>> ===
>>>>> 
>>>>>>>> If the cores we are bound to are the same on each node, then we will 
>>>>>>>> do this with no further instruction. However, if the cores are 
>>>>>>>> different on the individual nodes, then you need to add --hetero-nodes 
>>>>>>>> to your command line (as the nodes appear to be heterogeneous to us).
>>>>>>> 
>>>>>>> b) Aha, it's not about different type CPU types, but also same CPU type 
>>>>>>> but different allocations between the nodes? It's not in the `mpiexec` 
>>>>>>> man-page of 1.8.1 though. I'll have a look at it.
>>>>> 
>>>>> I tried:
>>>>> 
>>>>> $ qsub -binding linear:2:0 -pe smp2 8 -masterq parallel@node01 -q 
>>>>> parallel@node0[1-4] test_openmpi.sh 
>>>>> Your job 247109 ("test_openmpi.sh") has been submitted
>>>>> $ qsub -binding linear:2:1 -pe smp2 8 -masterq parallel@node01 -q 
>>>>> parallel@node0[1-4] test_openmpi.sh 
>>>>> Your job 247110 ("test_openmpi.sh") has been submitted
>>>>> 
>>>>> 
>>>>> Getting on node03:
>>>>> 
>>>>> 
>>>>> 6733 ?        Sl     0:00  \_ sge_shepherd-247109 -bg
>>>>> 6734 ?        SNs    0:00  |   \_ 
>>>>> /usr/sge/utilbin/lx24-amd64/qrsh_starter 
>>>>> /var/spool/sge/node03/active_jobs/247109.1/1.node03
>>>>> 6741 ?        SN     0:00  |       \_ orted -mca orte_hetero_nodes 1 -mca 
>>>>> ess env -mca orte_ess_jobid 1493303296 -mca orte_ess_vpid
>>>>> 6742 ?        RNl    0:31  |           \_ ./mpihello
>>>>> 6745 ?        Sl     0:00  \_ sge_shepherd-247110 -bg
>>>>> 6746 ?        SNs    0:00      \_ 
>>>>> /usr/sge/utilbin/lx24-amd64/qrsh_starter 
>>>>> /var/spool/sge/node03/active_jobs/247110.1/1.node03
>>>>> 6753 ?        SN     0:00          \_ orted -mca orte_hetero_nodes 1 -mca 
>>>>> ess env -mca orte_ess_jobid 1506607104 -mca orte_ess_vpid
>>>>> 6754 ?        RNl    0:25              \_ ./mpihello
>>>>> 
>>>>> 
>>>>> reuti@node03:~> cat /proc/6741/status | grep Cpus_
>>>>> Cpus_allowed:     
>>>>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
>>>>> Cpus_allowed_list:        0-1
>>>>> reuti@node03:~> cat /proc/6753/status | grep Cpus_
>>>>> Cpus_allowed:     
>>>>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000030
>>>>> Cpus_allowed_list:        4-5
>>>>> 
>>>>> Hence, "orted" got two cores assigned for each of them. But:
>>>>> 
>>>>> 
>>>>> reuti@node03:~> cat /proc/6742/status | grep Cpus_
>>>>> Cpus_allowed:     
>>>>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
>>>>> Cpus_allowed_list:        0-1
>>>>> reuti@node03:~> cat /proc/6754/status | grep Cpus_
>>>>> Cpus_allowed:     
>>>>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
>>>>> Cpus_allowed_list:        0-1
>>>>> 
>>>>> What I see here (and in `top` + pressing "1") that only two cores are 
>>>>> used, and Open MPI assigns 0-1 to both jobs. The information in "status" 
>>>>> is not the one OpenMPI gets from hwloc?
>>>>> 
>>>>> -- Reuti
>>>>> 
>>>>> 
>>>>>> The man page is probably a little out-of-date in this area - but yes, 
>>>>>> --hetero-nodes is required for *any* difference in the way the nodes 
>>>>>> appear to us (cpus, slot assignments, etc.). The 1.9 series may remove 
>>>>>> that requirement - still looking at it.
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> So it is up to the RM to set the constraint - we just live within it.
>>>>>>> 
>>>>>>> Fine.
>>>>>>> 
>>>>>>> -- Reuti
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25097.php
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25098.php
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25106.php
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2014/08/25111.php
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/08/25112.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/08/25113.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25114.php

Reply via email to