Am 21.08.2014 um 16:50 schrieb Reuti: > Am 21.08.2014 um 16:00 schrieb Ralph Castain: > >> >> On Aug 21, 2014, at 6:54 AM, Reuti <re...@staff.uni-marburg.de> wrote: >> >>> Am 21.08.2014 um 15:45 schrieb Ralph Castain: >>> >>>> On Aug 21, 2014, at 2:51 AM, Reuti <re...@staff.uni-marburg.de> wrote: >>>> >>>>> Am 20.08.2014 um 23:16 schrieb Ralph Castain: >>>>> >>>>>> >>>>>> On Aug 20, 2014, at 11:16 AM, Reuti <re...@staff.uni-marburg.de> wrote: >>>>>> >>>>>>> Am 20.08.2014 um 19:05 schrieb Ralph Castain: >>>>>>> >>>>>>>>> <snip> >>>>>>>>> Aha, this is quite interesting - how do you do this: scanning the >>>>>>>>> /proc/<pid>/status or alike? What happens if you don't find enough >>>>>>>>> free cores as they are used up by other applications already? >>>>>>>>> >>>>>>>> >>>>>>>> Remember, when you use mpirun to launch, we launch our own daemons >>>>>>>> using the native launcher (e.g., qsub). So the external RM will bind >>>>>>>> our daemons to the specified cores on each node. We use hwloc to >>>>>>>> determine what cores our daemons are bound to, and then bind our own >>>>>>>> child processes to cores within that range. >>>>>>> >>>>>>> Thx for reminding me of this. Indeed, I mixed up two different aspects >>>>>>> in this discussion. >>>>>>> >>>>>>> a) What will happen in case no binding was done by the RM (hence Open >>>>>>> MPI could use all cores) and two Open MPI jobs (or something completely >>>>>>> different besides one Open MPI job) are running on the same node (due >>>>>>> to the Tight Integration with two different Open MPI directories in >>>>>>> /tmp and two `orted`, unique for each job)? Will the second Open MPI >>>>>>> job know what the first Open MPI job used up already? Or will both use >>>>>>> the same set of cores as "-bind-to none" can't be set in the given >>>>>>> `mpiexec` command because of "-map-by slot:pe=$OMP_NUM_THREADS" was >>>>>>> used - which triggers "-bind-to core" indispensable and can't be >>>>>>> switched off? I see the same cores being used for both jobs. >>>>>> >>>>>> Yeah, each mpirun executes completely independently of the other, so >>>>>> they have no idea what the other is doing. So the cores will be >>>>>> overloaded. Multi-pe's requires bind-to-core otherwise there is no way >>>>>> to implement the request >>>>> >>>>> Yep, and so it's no option in a mixed cluster. Why would it hurt to allow >>>>> "-bind-to none" here? >>>> >>>> Guess I'm confused here - what does pe=N mean if we bind-to none?? If you >>>> are running on a mixed cluster and don't want binding, then just say >>>> bind-to none and leave the pe argument out entirely as it wouldn't mean >>>> anything unless you are bound >>> >>> I would mean: divide the overall number of slots/cores in the machinefile >>> by N (i.e. $OMP_NUM_THREADS). >>> >>> - Request made to the queuing system: I need 80 cores in total. >>> - The machinefile will contain 80 cores >>> - Open MPI will divide it by N, i.e. 8 here >>> - Open MPI will start only 10 processes, one on each node >>> - The application will use 8 threads per started MPI process >> >> I see - so you were talking about the case where the user doesn't provide >> the -np N option > > Yes. Even if -np is specified: AFAICS Open MPI fills up the given slots in > the machinefile from the beginning (first nodes get all the processes, > remaining nodes are free). Making it in a round-robin way would work better > for this case. > > >> and we need to compute the number of procs to start. Okay, the change you >> requested below will fix that one too. I can make that easily enough. > > Therefore I wanted to start a discussion about it (at that time I wasn't > aware of the "-map-by slot:pe=N" option), as I have no final syntax which > would cover all cases. Someone may want the binding by the "-map-by > slot:pe=N". How can this be specified, while keeping an easy > tight-integration for users who don't want any binding at all. > > The boundary conditions are: > > - the job is running inside a queuingsystem > - the user requests the overall amount of slots to the queuingsystem > - hence the machinefile has entries for all slots
BTW: That fact that the queuingsystem is set up in such a that the machinefile contains a mutiple of $OMP_NUM_THREADS per node is a premise and can be seen as given here - otherwise generate an error. It's up to admin of the queuingsystem to configure it in such a way. -- Reuti > - the user sets OMP_NUM_THREADS > > case 1) no interest in any binding, other jobs may exist on the nodes > > case 2) user wants binding: i.e. $OMP_NUM_THREADS cores assigned to each MPI > process, maybe with "-map-by slot:pe=N" > > In both cases only (overall amount of slots) / ($OMP_NUM_THREADS) MPI > processes should be started, not (overall amount of slots) processes AFAICS. > > -- Reuti > > >>> -- Reuti >>> >>> >>>>> >>>>> >>>>>>> Altering the machinefile instead: the processes are not bound to any >>>>>>> core, and the OS takes care of a proper assignment. >>>>> >>>>> Here the ordinary user has to mangle the hostfile, this is not good (but >>>>> allows several jobs per node as the OS shift the processes around). >>>>> Could/should it be put into the "gridengine" module in OpenMPI, to divide >>>>> the slot count per node automatically when $OMP_NUM_THREADS is found, or >>>>> generate an error if it's not divisible? >>>> >>>> Sure, that could be done - but it will only have if OMP_NUM_THREADS is set >>>> when someone spins off threads. So far as I know, that's only used for >>>> OpenMP - so we'd get a little help, but it wouldn't be full coverage. >>>> >>>> >>>>> >>>>> === >>>>> >>>>>>>> If the cores we are bound to are the same on each node, then we will >>>>>>>> do this with no further instruction. However, if the cores are >>>>>>>> different on the individual nodes, then you need to add --hetero-nodes >>>>>>>> to your command line (as the nodes appear to be heterogeneous to us). >>>>>>> >>>>>>> b) Aha, it's not about different type CPU types, but also same CPU type >>>>>>> but different allocations between the nodes? It's not in the `mpiexec` >>>>>>> man-page of 1.8.1 though. I'll have a look at it. >>>>> >>>>> I tried: >>>>> >>>>> $ qsub -binding linear:2:0 -pe smp2 8 -masterq parallel@node01 -q >>>>> parallel@node0[1-4] test_openmpi.sh >>>>> Your job 247109 ("test_openmpi.sh") has been submitted >>>>> $ qsub -binding linear:2:1 -pe smp2 8 -masterq parallel@node01 -q >>>>> parallel@node0[1-4] test_openmpi.sh >>>>> Your job 247110 ("test_openmpi.sh") has been submitted >>>>> >>>>> >>>>> Getting on node03: >>>>> >>>>> >>>>> 6733 ? Sl 0:00 \_ sge_shepherd-247109 -bg >>>>> 6734 ? SNs 0:00 | \_ >>>>> /usr/sge/utilbin/lx24-amd64/qrsh_starter >>>>> /var/spool/sge/node03/active_jobs/247109.1/1.node03 >>>>> 6741 ? SN 0:00 | \_ orted -mca orte_hetero_nodes 1 -mca >>>>> ess env -mca orte_ess_jobid 1493303296 -mca orte_ess_vpid >>>>> 6742 ? RNl 0:31 | \_ ./mpihello >>>>> 6745 ? Sl 0:00 \_ sge_shepherd-247110 -bg >>>>> 6746 ? SNs 0:00 \_ >>>>> /usr/sge/utilbin/lx24-amd64/qrsh_starter >>>>> /var/spool/sge/node03/active_jobs/247110.1/1.node03 >>>>> 6753 ? SN 0:00 \_ orted -mca orte_hetero_nodes 1 -mca >>>>> ess env -mca orte_ess_jobid 1506607104 -mca orte_ess_vpid >>>>> 6754 ? RNl 0:25 \_ ./mpihello >>>>> >>>>> >>>>> reuti@node03:~> cat /proc/6741/status | grep Cpus_ >>>>> Cpus_allowed: >>>>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003 >>>>> Cpus_allowed_list: 0-1 >>>>> reuti@node03:~> cat /proc/6753/status | grep Cpus_ >>>>> Cpus_allowed: >>>>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000030 >>>>> Cpus_allowed_list: 4-5 >>>>> >>>>> Hence, "orted" got two cores assigned for each of them. But: >>>>> >>>>> >>>>> reuti@node03:~> cat /proc/6742/status | grep Cpus_ >>>>> Cpus_allowed: >>>>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003 >>>>> Cpus_allowed_list: 0-1 >>>>> reuti@node03:~> cat /proc/6754/status | grep Cpus_ >>>>> Cpus_allowed: >>>>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003 >>>>> Cpus_allowed_list: 0-1 >>>>> >>>>> What I see here (and in `top` + pressing "1") that only two cores are >>>>> used, and Open MPI assigns 0-1 to both jobs. The information in "status" >>>>> is not the one OpenMPI gets from hwloc? >>>>> >>>>> -- Reuti >>>>> >>>>> >>>>>> The man page is probably a little out-of-date in this area - but yes, >>>>>> --hetero-nodes is required for *any* difference in the way the nodes >>>>>> appear to us (cpus, slot assignments, etc.). The 1.9 series may remove >>>>>> that requirement - still looking at it. >>>>>> >>>>>>> >>>>>>> >>>>>>>> So it is up to the RM to set the constraint - we just live within it. >>>>>>> >>>>>>> Fine. >>>>>>> >>>>>>> -- Reuti >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25097.php >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25098.php >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2014/08/25106.php >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/08/25111.php >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/08/25112.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/08/25113.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25114.php