Re: [gridengine users] How to set properly the high priority queue ?

maly1972 Mon, 01 May 2017 18:17:20 -0700

Hi Reuti,
first of all thanks a lot for a prompt reaction !

Please see my answers below.



---------- Původní e-mail ----------
Od: Reuti <[email protected]>
Komu: [email protected]
Datum: 1. 5. 2017 22:34:34
Předmět: Re: [gridengine users] How to set properly the high priority queue
? " 
""-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

Am 01.05.2017 um 21:52 schrieb <[email protected]>:

> Hello,
>
> We have computer cluster at our faculty based on the nodes
> equipped with two Intel Xeon (R) Processors E5-2695 v3 (i.e. 2x14 = 28
physical = 56
> logical cores/node), where we use SGE or more precisely OGS/GE (OGS / GE
2011.11p1)
> to run/distribute jobs.
>
> On one of these nodes we would like to create a "high priority
> queue" that should provide CPU resources preferentially to those jobs 
> which were submitted using this queue which should eventually restrict/
decrease
> use of CPU resources in case of already running jobs which were submitted
> earlier to this node using "ordinary queue".
>
> Until now we just experimented with the SGE / OGE queue parameter
> "priority" which can be used to set a "nice" parameter for the given job.
> First we tested the value -10 (which appeared to be totally sufficient on
> ordinary workstation with 12 logical CPU cores (tested here without
> SGE) just using "nice" parameter) and later also -19.
>
> In the situation when the given node was nearly fully loaded (i.e. 54-55
busy CPU slots from the total 56 available) with jobs submitted using
"ordinary queue" we submitted here one parallel (24-slots) job using "high
priority queue" hoping that we achieve the similar effect as we saw in our
12-log. core workstation, i.e. that the high priority job will get nearly 24
x100% CPU usage at the expense of running jobs submitted using "ordinary 
queue".
>
> We performed this test with a parallel MPI job (pmemd.MPI - Molecular 
Dynamics) and then another test with the GAMESS job (QM) where
parallelization is accomplished using TCP / IP sockets and SystemV shared 
memory.

What type of MPI: Open MPI, MPICH, Intel MPI, IBM Spectrum MPI, IBM/Platform
MPI…?
"


""""
 
""mpicc -v shows ""mpicc for MPICH2 version 1.4.1 ....""
 on the main node and 

 mpicc -v

Using built-in specs.

COLLECT_GCC=/usr/x86_64-pc-linux-gnu/gcc-bin/4.8.3/x86_64-pc-linux-gnu-gcc

COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/4.8.3/lto-wrapper

Target: x86_64-pc-linux-gnu

Configured with: /var/tmp/portage/sys-devel/gcc-4.8.3/work/gcc-4.8.3/
configure --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/
usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/4.8.3 --includedir=/usr/lib/
gcc/x86_64-pc-linux-gnu/4.8.3/include --datadir=/usr/share/gcc-data/x86_64-
pc-linux-gnu/4.8.3 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.8.3/
man --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.8.3/info --with-gxx-
include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/include/g++-v4 --with-
python-dir=/share/gcc-data/x86_64-pc-linux-gnu/4.8.3/python --enable-
languages=c,c++,fortran --enable-obsolete --enable-secureplt --disable-
werror --with-system-zlib --enable-nls --without-included-gettext --enable-
checking=release --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion=
'Gentoo 4.8.3 p1.1, pie-0.5.9' --enable-libstdcxx-time --enable-shared --
enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-
multilib --with-multilib-list=m32,m64 --disable-altivec --disable-fixed-
point --enable-targets=all --disable-libgcj --enable-libgomp --disable-
libmudflap --disable-libssp --enable-lto --without-cloog --enable-
libsanitizer

Thread model: posix

gcc version 4.8.3 (Gentoo 4.8.3 p1.1, pie-0.5.9)




on the  target computing node where the high priority queue is defined.




But there are also OpenMP jobs running. Moreover as I mentioned before, the
actual installation of QM software GAMESS is parallelized using using TCP /
IP sockets and SystemV shared memory.




Naturally we would like to have "normal-priority/high-priority" solution on
the given node which

works more or less independently on the type of the jobs.
""""


"






> Unfortunately, neither one test did not meet our expectations.
> SGE successfully assigned the "nice" value -10 and later -19 to the job 
submitted in "high priority queue"
> but on the other hand this fact was not reflected properly in the
allocation of CPU resources for the high priority job. We obtained quite 
different and unsatisfactory situation comparing to our first preliminary 
experiments (without SGE just using "nice" parameter) on ordinary 12-log.CPU
cores workstation.
>
> Please see here relevant screens.
> http://physics.ujep.cz/~mmaly/SCREENS/

How many independent jobs were on the node? "
 

For example in the case of the tests with SGE priority = -10 (and so nice =
-10) - first two screenshots - there were:




NORMAL PRIORITY JOBS




#1

2 x OpenMP job (shelx_charged_p) each requiring 12 slots -> 1200% CPU usage

(you can see that after overloading the node with high priority 24-slots 
job, those jobs decreased CPU usage to ca 800% )




#2

1 x most likely MPI job ( dlmeso) requiring 12 slots




#3


1x most likely MPI job (lammps) requiring 8 slots




#4

One GPU job (pmemd.cuda) requiring just 1 CPU slot (the majority of this 
calculation is running on GPU)




#5

One MPI job (sander.MPI ) requiring ca 10 CPU slots 




This gets 55 busy CPU slots  from 56 available with normal priority jobs.




HIGH PRIORITY JOBS


In the first test, the node loaded with the above described normal priority
jobs  

was overloaded with high priority job submitted using "high priority queue"
(SGE "priority" paramter set to -10).




1 x  MPI job ( pmemd.MPI ) requiring 24 slots





in the second test, in the same situation the node was overloaded with high
priority job:




1 x  multithread job (gamess) requiring 24 slots ( parallelized  using TCP /
IP sockets and SystemV shared memory ) 



"
""> I would be grateful for any relevant comments/tips which could help us
to successfully solve
> our problem with high priority queue.

I would say that these high priority jobs fight with the kernel processes 
having the same nice value for resources. The behavior of the nice value is
to be more "nice" to other jobs, i.e. a higher value means to be nicer. 

Essentially this means: normal jobs should get a 19 (yes, plus 19), and high
priority jobs a value of 0 (zero). Negative values are reserved for
important kernel tasks, and no user process should use them."
 
" "
OK, but as I mentioned, in the case of ordinary workstation with 12 logical
cores, if this was fully loaded with 


normal priority (nice = 0) MPI job (sander. MPI requiring 12 CPU threads)
and then overloaded 

with high  priority job (nice = -10) ( pmemd. MPI requiring 12 CPU threads)
it perfectly worked as we wanted

i.e. almost all the CPU resources were redirected to high priority job
(pmemd.MPI).




I know that there is a big difference between the ordinary workstation and
the computing node, but I would 

assume at least some similar behaviour. 




So you think that if we use SGE "priority" parameter (and so the "nice" 
parameter)  value 19 for the normal queue and  0 in the case of high
priority queue, we might get significantly better results than with values 0
(for normal) and -10 or -19 (for high priority) because now the high
priority computing jobs will not fight for resources with important
processes of operating system etc. ?




It would be nice if just this change will solve our problem !




BTW is there any way how to change the SGE "priority" value (and so the 
"nice") value of already running jobs ?
"
""

Side note A: as long as the number of active processes in the run queue of
the kernel is lower than the number of cores, the nice value has no effect.
I.e. having 8 cores and:

4 x nice 19
2 x nice 10
1 x nice 5
1 x nice 0

all will get 100%. The nice value comes only into play, when there are more
processes than cores. This also means: 8 x nice 0 is essentially the same as
8 x nice 19, as there is no one to be nice to.
"
 

Yes, I am aware of this obvious fact.  



"
""
Side note B: Using HT in a cluster is often not advisable, as the runtime of
a job can't be predicted as it depends on other processes on the CPU. There
was some discussion here:

https://www.mail-archive.com/[email protected]//msg30863.html (the 
complete thread and all links)"
 
" "
Yes I know, but we prefer to have more "slots" available (56 using HT) (than
physical cores (28)) to have possibility of more jobs running in one time 
even if we perfectly know that if the number of computing threads exceeds  
significantly number of physical cores, the  calculations significantly slow
down.

 
" 
""Maybe one get 130% of the CPU. Especially with MPI jobs this becomes a 
problem: all processes are doing the same at the the same time and fight for
the same resources inside a CPU. Having 2 independent jobs on a CPU might be
more promising.

Side note C: In most of the cases one MPI job doesn't know anything about 
the other MPI job on a node. If they have an automatic core binding enables,
each starts to count at core 0 and binds to the same cores. "
 

OK, but the problem we are try to solve is not a sharing of the CPU
resources between the several jobs with the same priority but sharing the 
CPU resources between the jobs with low and high priority.



"It might be necessary to disable the automatic core binding and let the 
kernel scheduler do its best (unless you have a complete node for all tasks
belonging to a job, which could of course spawn several nodes)."
 
""
We do not have an optimal inter-node connections so each job uses just CPUs
on one node.

To be frank I am definitely not an expert here so I have no idea what does
it mean  "to disable the automatic core binding" and of course I absolutely
have no idea how to do it.




If you think that this could significantly help us to implement our idea of
"normal priority" and "high priority" 

queue operating on the same node, please write more details.




Also if there is some another strategy how to run preferentially some "high
priority" (urgent) jobs on the node saturated with "low priority" jobs
(which could be eventually even suspended until some of high priority job is
done) I would be grateful if you mentioned them as well.




Thank you !




  Best wishes,

  

       Marek




 

"

- -- Reuti

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - https://gpgtools.org

iEYEARECAAYFAlkHm80ACgkQo/GbGkBRnRpqYACfZcQFchzTd5Nnr7/8RD682r1f
j0EAoLY9s0GpV1Bq7g56fkdkIr+2NsV2
=HzOx
-----END PGP SIGNATURE-----
"

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] How to set properly the high priority queue ?

Reply via email to