http://portal.rc.fas.harvard.edu/slurmmon/
That basically gives a graph of our sdiag statistics.
We have about 35 partitions in our environment due to various owned
hardware as well as general purpose queues. We also run both parallel
and embarassingly parallel workloads. So we have people who want to
submit 10,000 jobs in a go and people who want to run >2048 core jobs
for MPI work. So we have to support them all.
Prior we were running LSF which didn't really have an issue with having
a plethora of queues and each was somewhat independent for scheduling.
Our users are used to this so it has created a bit of tension when we
switched to SLURM.
We are trying to get to the point where we can get rid of our owned
partitions but that won't be anytime soon. We love all the options and
abilities of SLURM, our big issue is similar to yours. We do HTC and
HPC. We have thousands of jobs in the queue. I think we would take the
hit for having to spin through all the partitions in order to make sure
every partition is treated properly.
-Paul Edmon-
On 02/10/2014 11:12 AM, Alejandro Lucero Palau wrote:
Hi Paul,
What's the max cycle latency for main scheduling cycle in your system?
You can got it with the sdiag command.
I've been working in a different mechanism for going through the job
queue. It would be helpful for sites with a really high number of queued
jobs, so this could make sense for HTC more than HPC. Also it makes
sense for sites using several partitions and users sending jobs to more
than one partition. So instead of one general queue, this solution would
create one by partition, but it just would take a configurable number of
most priority jobs. Then the scheduler would take the most priority job
from the top of each queue.
Now the scheduler is not efficient for HTC sites with tenths of
thousands or even hundreds of thousands queued jobs. When users use
dependencies strongly and submit jobs to more than one partition, there
are a lot of work to do for the scheduler. Indeed if you have some
special partition which is seldom used it will lead to the scheduler
going through the whole queue even if you try to minimze the problem
with scheuduler parameters. Even if this is not costly by job it could
lead to high latencies when queued jobs are tenths of thousands. We can
see it from time to time and then slurm can be unresponsive while it is
trying to schedule jobs.
The slurm design was for HPC centers where it is not likely to have such
a high number of jobs. But if slurm is being used in other type of
centers like those from genomics, it would be really useful to have
another way of working with queued jobs. Maybe this issue should be
discussed in Slurm Users Meeting next September in Lugano.
On 02/10/2014 03:49 PM, Paul Edmon wrote:
How difficult would it be to put a switch into SLURM where instead of
considering the global priority chain it would instead consider each
partition wholly independently with respect to both backfill and main
scheduling loop? In our environment we have many partitions. We also
have people submitting 1000's of jobs to those partitions and
partitions are at different priorities. Since SLURM (even in
backfill) runs down the priority chain higher priority queues can
impact scheduling in lower priority queues even of those queues do not
overlap in terms of hardware. It would be better in our case is SLURM
considered each partition as a wholly independent scheduling run and
did all of them both for backfill and main loop.
I know there is the bf_max_job_part option in the backfill loop but it
would be better to just have each partition be independent as that way
you don't get any cross talk. Can this be done? It would be
incredibly helpful for our environment.
-Paul Edmon-
WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.
http://www.bsc.es/disclaimer