Hi,

> Am 02.05.2017 um 03:14 schrieb [email protected]:
> 
> Hi Reuti,
> first of all thanks a lot for a prompt reaction !
> 
> Please see my answers below.
> 
> 
> ---------- Původní e-mail ----------
> Od: Reuti <[email protected]>
> Komu: [email protected]
> Datum: 1. 5. 2017 22:34:34
> Předmět: Re: [gridengine users] How to set properly the high priority queue
> ? " 
> ""-----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> […]
> 
> What type of MPI: Open MPI, MPICH, Intel MPI, IBM Spectrum MPI, IBM/Platform
> MPI…?
> "
> 
> 
> """"
>  
> ""mpicc -v shows ""mpicc for MPICH2 version 1.4.1 ....""
>  on the main node and 

Oh, that's some time old already. They are at 3.2 now. I'm not sure about the 
SGE integration at that time. There were some issues until it became stable, 
but it's too long ago to say. You started the application with mpiexec.hydra, 
or was at that time still the `mpd` ring necessary?


>  mpicc -v
> 
> Using built-in specs.

The output is different than from the main node?


> COLLECT_GCC=/usr/x86_64-pc-linux-gnu/gcc-bin/4.8.3/x86_64-pc-linux-gnu-gcc
> 
> COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/4.8.3/lto-wrapper
> 
> Target: x86_64-pc-linux-gnu
> 
> Configured with: /var/tmp/portage/sys-devel/gcc-4.8.3/work/gcc-4.8.3/
> configure --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/
> usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/4.8.3 --includedir=/usr/lib/
> gcc/x86_64-pc-linux-gnu/4.8.3/include --datadir=/usr/share/gcc-data/x86_64-
> pc-linux-gnu/4.8.3 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.8.3/
>> […]
> 
> How many independent jobs were on the node? "
>  
> 
> For example in the case of the tests with SGE priority = -10 (and so nice =
> -10) - first two screenshots - there were:
> 
> 
> 
> 
> NORMAL PRIORITY JOBS
> 
> 
> 
> 
> #1
> 
> 2 x OpenMP job (shelx_charged_p) each requiring 12 slots -> 1200% CPU usage
> 
> (you can see that after overloading the node with high priority 24-slots 
> job, those jobs decreased CPU usage to ca 800% )
> 
> 
> 
> 
> #2
> 
> 1 x most likely MPI job ( dlmeso) requiring 12 slots
> 
> 
> 
> 
> #3
> 
> 
> 1x most likely MPI job (lammps) requiring 8 slots
> 
> 
> 
> 
> #4
> 
> One GPU job (pmemd.cuda) requiring just 1 CPU slot (the majority of this 
> calculation is running on GPU)
> 
> 
> 
> 
> #5
> 
> One MPI job (sander.MPI ) requiring ca 10 CPU slots 
> 
> 
> 
> 
> This gets 55 busy CPU slots  from 56 available with normal priority jobs.
> 
> 
> 
> 
> HIGH PRIORITY JOBS
> 
> 
> In the first test, the node loaded with the above described normal priority
> jobs  
> 
> was overloaded with high priority job submitted using "high priority queue"
> (SGE "priority" paramter set to -10).
> 
> 
> 
> 
> 1 x  MPI job ( pmemd.MPI ) requiring 24 slots
> 
> 
> 
> 
> 
> in the second test, in the same situation the node was overloaded with high
> priority job:
> 
> 
> 
> 
> 1 x  multithread job (gamess) requiring 24 slots ( parallelized  using TCP /
> IP sockets and SystemV shared memory ) 
> 
> 
> 
> "
> ""> I would be grateful for any relevant comments/tips which could help us
> to successfully solve
>> our problem with high priority queue.
> 
> I would say that these high priority jobs fight with the kernel processes 
> having the same nice value for resources. The behavior of the nice value is
> to be more "nice" to other jobs, i.e. a higher value means to be nicer. 
> 
> Essentially this means: normal jobs should get a 19 (yes, plus 19), and high
> priority jobs a value of 0 (zero). Negative values are reserved for
> important kernel tasks, and no user process should use them."
>  
> " "
> OK, but as I mentioned, in the case of ordinary workstation with 12 logical
> cores, if this was fully loaded with 
> 
> 
> normal priority (nice = 0) MPI job (sander. MPI requiring 12 CPU threads)
> and then overloaded 
> 
> with high  priority job (nice = -10) ( pmemd. MPI requiring 12 CPU threads)
> it perfectly worked as we wanted
> 
> i.e. almost all the CPU resources were redirected to high priority job
> (pmemd.MPI).
> 
> 
> 
> 
> I know that there is a big difference between the ordinary workstation and
> the computing node, but I would 
> 
> assume at least some similar behaviour. 

Nowadays I would say the difference is mainly an installed graphics card. 
Otherwise it's the same, and features like cores, memory w/ECC, disk, SSD… 
depend on the application.


> So you think that if we use SGE "priority" parameter (and so the "nice" 
> parameter)  value 19 for the normal queue and  0 in the case of high
> priority queue, we might get significantly better results than with values 0
> (for normal) and -10 or -19 (for high priority) because now the high
> priority computing jobs will not fight for resources with important
> processes of operating system etc. ?

Yes.


> BTW is there any way how to change the SGE "priority" value (and so the 
> "nice") value of already running jobs ?
> "

Not in the way you think I fear. One can clearly log in to a node and change 
the values by hand to the new intended values. `renice` also accepts a process 
group to ease the work.

The reprioritization in SGE works together with the functional policy, to level 
the granted computing time to achieve the desired distribution of CPU time 
according to the tickets:

`man sge_conf`: reprioritize

`man sched_conf`: reprioritize_interval


> ""
> Side note B: Using HT in a cluster is often not advisable, as the runtime of
> a job can't be predicted as it depends on other processes on the CPU. There
> was some discussion here:
> 
> https://www.mail-archive.com/[email protected]//msg30863.html (the 
> complete thread and all links)"
>  
> " "
> Yes I know, but we prefer to have more "slots" available (56 using HT) (than
> physical cores (28)) to have possibility of more jobs running in one time 
> even if we perfectly know that if the number of computing threads exceeds  
> significantly number of physical cores, the  calculations significantly slow
> down.

Ok.

> 
>  
> " 
> ""Maybe one get 130% of the CPU. Especially with MPI jobs this becomes a 
> problem: all processes are doing the same at the the same time and fight for
> the same resources inside a CPU. Having 2 independent jobs on a CPU might be
> more promising.
> 
> Side note C: In most of the cases one MPI job doesn't know anything about 
> the other MPI job on a node. If they have an automatic core binding enables,
> each starts to count at core 0 and binds to the same cores. "
>  
> 
> OK, but the problem we are try to solve is not a sharing of the CPU
> resources between the several jobs with the same priority but sharing the 
> CPU resources between the jobs with low and high priority.
> 
> 
> 
> "It might be necessary to disable the automatic core binding and let the 
> kernel scheduler do its best (unless you have a complete node for all tasks
> belonging to a job, which could of course spawn several nodes)."
>  
> ""
> We do not have an optimal inter-node connections so each job uses just CPUs
> on one node.

Ok.


> To be frank I am definitely not an expert here so I have no idea what does
> it mean  "to disable the automatic core binding" and of course I absolutely
> have no idea how to do it.

This depends on the MPI library. For Open MPI it's a parameter to `mpiexec` 
"--bind-to none", for Intel MPI an environment variable "export I_MPI_PIN=off". 
For MPICH it's the opposite: there is no automatic core binding and one would 
have to enable it.


> If you think that this could significantly help us to implement our idea of
> "normal priority" and "high priority" 
> 
> queue operating on the same node, please write more details.
> 
> 
> 
> 
> Also if there is some another strategy how to run preferentially some "high
> priority" (urgent) jobs on the node saturated with "low priority" jobs
> (which could be eventually even suspended until some of high priority job is
> done) I would be grateful if you mentioned them as well.

The slotwise preemption is not working so good with parallel jobs. In case the 
complete low priority queue could be halted, one could try to hold the complete 
low priority queue with this means.

$ qconf -sq high.q
…
subordinate_list low.q=1

One could also try higher values than 1 (it's numeric, not a flag) (means at 
soon as one slot is used in high.q on a particular, the corresponding low.q is 
suspended on this node) to allow some overloading before it kicks in.

-- Reuti


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to