Re: [gridengine users] How to set properly the high priority queue ?

Reuti Mon, 22 May 2017 04:09:30 -0700

Hi,

> Am 22.05.2017 um 02:33 schrieb <[email protected]> <[email protected]>:
> 
> Hello Reuti,
> 
> I am back with our high-priority queue problem.
> 
> As you suggested we did new experiments with high priority jobs (nice = 0) 
> and low priority jobs (nice = 19) (instead of older nice values -10(-19)/0).
> Unfortunately we obtained expected behaviour only in case, when high priority 
> jobs were single (1x CPU slot) or serial jobs. In case of parallel high 
> priority jobs independently on the way of parallelization (MPI, OpenMP, "TCP 
> /  IP sockets")  it does not work on the given computing node ( 2x14 = 28 
> physical = 56 logical CPU cores), while 
> it works nicely on the ordinary workstation (6 physical = 12 logical CPU 
> cores) .
> 
> please see results here:
> 
> http://physics.ujep.cz/~mmaly/high-priority-queue-SGE/


Using HyperThreading for HPC is often not the best option for HPC. Nevertheless 
it shouldn't show this effect.

One possible constraint in your case might be, that core binding is in effect. 
The pmemd.MPI seems to be from Amber and uses which MPI (Intel MPI, Open MPI, 
MPICH,…)? Is it the same on both machines?

If two processes are essentially bound to one physical core, they both can't be 
scheduled to somewhere else by the Linux scheduler in the kernel, and as both 
are using the same core one might face this drop in performance.

What is the output for such a pmemd.MPI process in:

$ grep allowed /proc/51545/status

to check this. Besides this, cgroups might also be involved to keep the process 
on the cores it's assigned to.


> I would be grateful for any helpful ideas including suggestions of some other 
> maybe more suitable mailing lists
> (e.g. focusing on the different relevant CPU settings ... )  where we might 
> have chance to solve this problem.
> 
> You also alternatively suggested this "suspend" solution:
> 
> $ qconf -sq high.q 
> … 
> subordinate_list low.q=1 
> 
> One could also try higher values than 1 (it's numeric, not a flag) (means at 
> soon as one slot is used in high.q on a particular, the corresponding low.q 
> is suspended on this node) to allow some overloading before it kicks in.
> 
> It is not so elegant, because all jobs in low priority queue are stopped on 
> the given node even if just small job is submitted using the high priority 
> queue on that node if I understood well, but such situation will not be so 
> frequent, so it could be solution for us but.
> 
> #1 
> Do you think that it might work also on our "crazy" cluster where for example 
> the nice values are pretty ignored in case of parallel jobs ? or this method 
> has nothing common with job priorities and just some secure and reliable  
> STOP signals are sent to all jobs/threads belonging to the low priority queue 
> on that node ?

The latter one. The STOP signal is send to the process group. GAMESS usually 
jumps out of the process tree AFAIR (at least on slave nodes). But the startup 
can be routed to `qrsh -inherit …`.

Does a:

$ ps -e f

(f w/o -) shows a process tree, where all GAMESS processes are still bound to 
the sge_shepherd?


> #2
> Of course we would like just to temporarily stop those low priority jobs on 
> the givennode, only  for the time when some high priority jobs are running on 
> that node. Does this "subordinate_list" solution provides this and is it 
> secure also in case of parallel jobs or combined CPU/GPU jobs ?

If parallel jobs get suspended, one might face timeouts. Nevertheless it's 
worth to try it with your applications.

-- Reuti


>   Thanks in advance for comments !
> 
>       Best wishes,
> 
>           Marek 
> 
> 
> 
> 
> ---------- Původní e-mail ----------
> Od: [email protected]
> Komu: Reuti <[email protected]>
> Datum: 2. 5. 2017 16:21:58
> Předmět: Re: [gridengine users] How to set properly the high priority queue ?
> 
> Dear Reuti,
> thank you a lot for your help !
> 
> We will first try your idea to use 19/0 priorities instead of 0/-19(-10) and 
> then eventually
> the other your  idea using "subordinate_list" .
> 
>    Best wishes,
> 
>            Marek
> 
> 
> ---------- Původní e-mail ----------
> Od: Reuti <[email protected]>
> Komu: [email protected]
> Datum: 2. 5. 2017 13:55:03
> Předmět: Re: [gridengine users] How to set properly the high priority queue ?
> 
> Hi, 
> 
> > Am 02.05.2017 um 03:14 schrieb [email protected]: 
> > 
> > Hi Reuti, 
> > first of all thanks a lot for a prompt reaction ! 
> > 
> > Please see my answers below. 
> > 
> > 
> > ---------- Původní e-mail ---------- 
> > Od: Reuti <[email protected]> 
> > Komu: [email protected] 
> > Datum: 1. 5. 2017 22:34:34 
> > Předmět: Re: [gridengine users] How to set properly the high priority queue 
> > ? " 
> > ""-----BEGIN PGP SIGNED MESSAGE----- 
> > Hash: SHA1 
> > 
> > […] 
> > 
> > What type of MPI: Open MPI, MPICH, Intel MPI, IBM Spectrum MPI, 
> > IBM/Platform 
> > MPI…? 
> > " 
> > 
> > 
> > """" 
> > 
> > ""mpicc -v shows ""mpicc for MPICH2 version 1.4.1 ...."" 
> > on the main node and 
> 
> Oh, that's some time old already. They are at 3.2 now. I'm not sure about the 
> SGE integration at that time. There were some issues until it became stable, 
> but it's too long ago to say. You started the application with mpiexec.hydra, 
> or was at that time still the `mpd` ring necessary? 
> 
> 
> > mpicc -v 
> > 
> > Using built-in specs. 
> 
> The output is different than from the main node? 
> 
> 
> > COLLECT_GCC=/usr/x86_64-pc-linux-gnu/gcc-bin/4.8.3/x86_64-pc-linux-gnu-gcc 
> > 
> > COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/4.8.3/lto-wrapper 
> > 
> > Target: x86_64-pc-linux-gnu 
> > 
> > Configured with: /var/tmp/portage/sys-devel/gcc-4.8.3/work/gcc-4.8.3/ 
> > configure --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/ 
> > usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/4.8.3 --includedir=/usr/lib/ 
> > gcc/x86_64-pc-linux-gnu/4.8.3/include --datadir=/usr/share/gcc-data/x86_64- 
> > pc-linux-gnu/4.8.3 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.8.3/ 
> >> […] 
> > 
> > How many independent jobs were on the node? " 
> > 
> > 
> > For example in the case of the tests with SGE priority = -10 (and so nice = 
> > -10) - first two screenshots - there were: 
> > 
> > 
> > 
> > 
> > NORMAL PRIORITY JOBS 
> > 
> > 
> > 
> > 
> > #1 
> > 
> > 2 x OpenMP job (shelx_charged_p) each requiring 12 slots -> 1200% CPU usage 
> > 
> > (you can see that after overloading the node with high priority 24-slots 
> > job, those jobs decreased CPU usage to ca 800% ) 
> > 
> > 
> > 
> > 
> > #2 
> > 
> > 1 x most likely MPI job ( dlmeso) requiring 12 slots 
> > 
> > 
> > 
> > 
> > #3 
> > 
> > 
> > 1x most likely MPI job (lammps) requiring 8 slots 
> > 
> > 
> > 
> > 
> > #4 
> > 
> > One GPU job (pmemd.cuda) requiring just 1 CPU slot (the majority of this 
> > calculation is running on GPU) 
> > 
> > 
> > 
> > 
> > #5 
> > 
> > One MPI job (sander.MPI ) requiring ca 10 CPU slots 
> > 
> > 
> > 
> > 
> > This gets 55 busy CPU slots from 56 available with normal priority jobs. 
> > 
> > 
> > 
> > 
> > HIGH PRIORITY JOBS 
> > 
> > 
> > In the first test, the node loaded with the above described normal priority 
> > jobs 
> > 
> > was overloaded with high priority job submitted using "high priority queue" 
> > (SGE "priority" paramter set to -10). 
> > 
> > 
> > 
> > 
> > 1 x MPI job ( pmemd.MPI ) requiring 24 slots 
> > 
> > 
> > 
> > 
> > 
> > in the second test, in the same situation the node was overloaded with high 
> > priority job: 
> > 
> > 
> > 
> > 
> > 1 x multithread job (gamess) requiring 24 slots ( parallelized using TCP / 
> > IP sockets and SystemV shared memory ) 
> > 
> > 
> > 
> > " 
> > ""> I would be grateful for any relevant comments/tips which could help us 
> > to successfully solve 
> >> our problem with high priority queue. 
> > 
> > I would say that these high priority jobs fight with the kernel processes 
> > having the same nice value for resources. The behavior of the nice value is 
> > to be more "nice" to other jobs, i.e. a higher value means to be nicer. 
> > 
> > Essentially this means: normal jobs should get a 19 (yes, plus 19), and 
> > high 
> > priority jobs a value of 0 (zero). Negative values are reserved for 
> > important kernel tasks, and no user process should use them." 
> > 
> > " " 
> > OK, but as I mentioned, in the case of ordinary workstation with 12 logical 
> > cores, if this was fully loaded with 
> > 
> > 
> > normal priority (nice = 0) MPI job (sander. MPI requiring 12 CPU threads) 
> > and then overloaded 
> > 
> > with high priority job (nice = -10) ( pmemd. MPI requiring 12 CPU threads) 
> > it perfectly worked as we wanted 
> > 
> > i.e. almost all the CPU resources were redirected to high priority job 
> > (pmemd.MPI). 
> > 
> > 
> > 
> > 
> > I know that there is a big difference between the ordinary workstation and 
> > the computing node, but I would 
> > 
> > assume at least some similar behaviour. 
> 
> Nowadays I would say the difference is mainly an installed graphics card. 
> Otherwise it's the same, and features like cores, memory w/ECC, disk, SSD… 
> depend on the application. 
> 
> 
> > So you think that if we use SGE "priority" parameter (and so the "nice" 
> > parameter) value 19 for the normal queue and 0 in the case of high 
> > priority queue, we might get significantly better results than with values 
> > 0 
> > (for normal) and -10 or -19 (for high priority) because now the high 
> > priority computing jobs will not fight for resources with important 
> > processes of operating system etc. ? 
> 
> Yes. 
> 
> 
> > BTW is there any way how to change the SGE "priority" value (and so the 
> > "nice") value of already running jobs ? 
> > " 
> 
> Not in the way you think I fear. One can clearly log in to a node and change 
> the values by hand to the new intended values. `renice` also accepts a 
> process group to ease the work. 
> 
> The reprioritization in SGE works together with the functional policy, to 
> level the granted computing time to achieve the desired distribution of CPU 
> time according to the tickets: 
> 
> `man sge_conf`: reprioritize 
> 
> `man sched_conf`: reprioritize_interval 
> 
> 
> > "" 
> > Side note B: Using HT in a cluster is often not advisable, as the runtime 
> > of 
> > a job can't be predicted as it depends on other processes on the CPU. There 
> > was some discussion here: 
> > 
> > https://www.mail-archive.com/[email protected]//msg30863.html (the 
> > complete thread and all links)" 
> > 
> > " " 
> > Yes I know, but we prefer to have more "slots" available (56 using HT) 
> > (than 
> > physical cores (28)) to have possibility of more jobs running in one time 
> > even if we perfectly know that if the number of computing threads exceeds 
> > significantly number of physical cores, the calculations significantly slow 
> > down. 
> 
> Ok. 
> 
> > 
> > 
> > " 
> > ""Maybe one get 130% of the CPU. Especially with MPI jobs this becomes a 
> > problem: all processes are doing the same at the the same time and fight 
> > for 
> > the same resources inside a CPU. Having 2 independent jobs on a CPU might 
> > be 
> > more promising. 
> > 
> > Side note C: In most of the cases one MPI job doesn't know anything about 
> > the other MPI job on a node. If they have an automatic core binding 
> > enables, 
> > each starts to count at core 0 and binds to the same cores. " 
> > 
> > 
> > OK, but the problem we are try to solve is not a sharing of the CPU 
> > resources between the several jobs with the same priority but sharing the 
> > CPU resources between the jobs with low and high priority. 
> > 
> > 
> > 
> > "It might be necessary to disable the automatic core binding and let the 
> > kernel scheduler do its best (unless you have a complete node for all tasks 
> > belonging to a job, which could of course spawn several nodes)." 
> > 
> > "" 
> > We do not have an optimal inter-node connections so each job uses just CPUs 
> > on one node. 
> 
> Ok. 
> 
> 
> > To be frank I am definitely not an expert here so I have no idea what does 
> > it mean "to disable the automatic core binding" and of course I absolutely 
> > have no idea how to do it. 
> 
> This depends on the MPI library. For Open MPI it's a parameter to `mpiexec` 
> "--bind-to none", for Intel MPI an environment variable "export 
> I_MPI_PIN=off". For MPICH it's the opposite: there is no automatic core 
> binding and one would have to enable it. 
> 
> 
> > If you think that this could significantly help us to implement our idea of 
> > "normal priority" and "high priority" 
> > 
> > queue operating on the same node, please write more details. 
> > 
> > 
> > 
> > 
> > Also if there is some another strategy how to run preferentially some "high 
> > priority" (urgent) jobs on the node saturated with "low priority" jobs 
> > (which could be eventually even suspended until some of high priority job 
> > is 
> > done) I would be grateful if you mentioned them as well. 
> 
> The slotwise preemption is not working so good with parallel jobs. In case 
> the complete low priority queue could be halted, one could try to hold the 
> complete low priority queue with this means. 
> 
> $ qconf -sq high.q 
> … 
> subordinate_list low.q=1 
> 
> One could also try higher values than 1 (it's numeric, not a flag) (means at 
> soon as one slot is used in high.q on a particular, the corresponding low.q 
> is suspended on this node) to allow some overloading before it kicks in. 
> 
> -- Reuti 
> 
> _______________________________________________ 
> users mailing list 
> [email protected] 
> https://gridengine.org/mailman/listinfo/users 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] How to set properly the high priority queue ?

Reply via email to