[slurm-dev] Re: Job submit plugin to improve backfill

2013-06-28 Thread Ryan Cox

An alternative that we do is choose very low defaults for people:
PartitionName=Default DefaultTime=30:00 #plus other options 
DefMemPerCPU=512

The disadvantage to this approach is that it doesn't give an obvious 
error message at submit time.  However, it's not hard to figure out what 
happened when they hit the time limit or the error output says they went 
over their memory limit.


Ryan

On 06/28/2013 08:29 AM, Daniel M. Weeks wrote:

At CCNI, we use backfill scheduling on all our systems. However, we have
found that users typically do not specify a time limit for their job so
the scheduler assumes the maximum from QoS/user limits/partition
limits/etc. This really hurts backfilling since the scheduler remains
ignorant of short jobs.

Attached is a small patch I wrote containing a job submit plugin and a
new error message. The plugin rejects a job submission when it is
missing a time limit and will provide the user with a clear and distinct
error.

I've just re-tested and the patch applies and builds cleanly on the
slurm-2.5, slurm-2.6, and master branches.

Please let me know if you find this useful, run across problems, or have
suggestions/improvements. Thanks.



--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University



[slurm-dev] Re: Job submit plugin to improve backfill

2013-06-28 Thread Daniel M. Weeks

Hi Ryan,

Thanks. We had considered this approach but went in a different
direction for a couple reasons:

We have a good number of users that script job submissions and may blast
out up to several hundred jobs. A user might not realize their jobs are
getting cutoff until many of them run and it's a waste of resources.

Also, we have many users that are relatively new to HPC/Slurm and work
from guides or tutorials that don't explain things very well. The
distinct error message at job submission rather than a related error
after a "failure" (from the user's perspective) keeps a lot of support
emails out of my inbox. Of course I'd like them to learn to use Slurm
better but they usually want to focus on their own research first.

- Dan

On 06/28/2013 11:00 AM, Ryan Cox wrote:
> An alternative that we do is choose very low defaults for people:
> PartitionName=Default DefaultTime=30:00 #plus other options 
> DefMemPerCPU=512
> 
> The disadvantage to this approach is that it doesn't give an obvious
> error message at submit time.  However, it's not hard to figure out what
> happened when they hit the time limit or the error output says they went
> over their memory limit.
> 
> Ryan
> 
> On 06/28/2013 08:29 AM, Daniel M. Weeks wrote:
>> At CCNI, we use backfill scheduling on all our systems. However, we have
>> found that users typically do not specify a time limit for their job so
>> the scheduler assumes the maximum from QoS/user limits/partition
>> limits/etc. This really hurts backfilling since the scheduler remains
>> ignorant of short jobs.
>>
>> Attached is a small patch I wrote containing a job submit plugin and a
>> new error message. The plugin rejects a job submission when it is
>> missing a time limit and will provide the user with a clear and distinct
>> error.
>>
>> I've just re-tested and the patch applies and builds cleanly on the
>> slurm-2.5, slurm-2.6, and master branches.
>>
>> Please let me know if you find this useful, run across problems, or have
>> suggestions/improvements. Thanks.
>>
> 
> -- 
> Ryan Cox
> Operations Director
> Fulton Supercomputing Lab
> Brigham Young University
> 


-- 
Daniel M. Weeks
Systems Programmer
Computational Center for Nanotechnology Innovations
Rensselaer Polytechnic Institute
Troy, NY 12180
518-276-4458


[slurm-dev] Re: Job submit plugin to improve backfill

2013-06-28 Thread Nikita Burtsev
Hello, 

Why not enable this functionality by setting DefaultTime=0 in slurm.conf which 
would let us set this on per-partition basis, rather than through job submit 
plugin. (Unless i'm missing something obvious here) 

Also currently setting DefaultTime=0 (on 2.5.6 at least) gives following 
message:
# srun -N2 hostname
srun: error: Unable to create job step: Job/step already completing or completed


I suppose it is the way it should be, but seems rather illogical to be able to 
set this at all. 

-- 
Nikita Burtsev
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Friday, June 28, 2013 at 7:25 PM, Daniel M. Weeks wrote:

> 
> Hi Ryan,
> 
> Thanks. We had considered this approach but went in a different
> direction for a couple reasons:
> 
> We have a good number of users that script job submissions and may blast
> out up to several hundred jobs. A user might not realize their jobs are
> getting cutoff until many of them run and it's a waste of resources.
> 
> Also, we have many users that are relatively new to HPC/Slurm and work
> from guides or tutorials that don't explain things very well. The
> distinct error message at job submission rather than a related error
> after a "failure" (from the user's perspective) keeps a lot of support
> emails out of my inbox. Of course I'd like them to learn to use Slurm
> better but they usually want to focus on their own research first.
> 
> - Dan
> 
> On 06/28/2013 11:00 AM, Ryan Cox wrote:
> > An alternative that we do is choose very low defaults for people:
> > PartitionName=Default DefaultTime=30:00 #plus other options 
> > DefMemPerCPU=512
> > 
> > The disadvantage to this approach is that it doesn't give an obvious
> > error message at submit time. However, it's not hard to figure out what
> > happened when they hit the time limit or the error output says they went
> > over their memory limit.
> > 
> > Ryan
> > 
> > On 06/28/2013 08:29 AM, Daniel M. Weeks wrote:
> > > At CCNI, we use backfill scheduling on all our systems. However, we have
> > > found that users typically do not specify a time limit for their job so
> > > the scheduler assumes the maximum from QoS/user limits/partition
> > > limits/etc. This really hurts backfilling since the scheduler remains
> > > ignorant of short jobs.
> > > 
> > > Attached is a small patch I wrote containing a job submit plugin and a
> > > new error message. The plugin rejects a job submission when it is
> > > missing a time limit and will provide the user with a clear and distinct
> > > error.
> > > 
> > > I've just re-tested and the patch applies and builds cleanly on the
> > > slurm-2.5, slurm-2.6, and master branches.
> > > 
> > > Please let me know if you find this useful, run across problems, or have
> > > suggestions/improvements. Thanks.
> > > 
> > 
> > 
> > -- 
> > Ryan Cox
> > Operations Director
> > Fulton Supercomputing Lab
> > Brigham Young University
> > 
> 
> 
> 
> -- 
> Daniel M. Weeks
> Systems Programmer
> Computational Center for Nanotechnology Innovations
> Rensselaer Polytechnic Institute
> Troy, NY 12180
> 518-276-4458
> 
> 




[slurm-dev] Re: Job submit plugin to improve backfill

2013-06-28 Thread Eckert, Phil
Another route that could be taken is to set the DefaultTime for a
partition to 0, and the
small patch attached to this email will reject a job when is has no time
limit specified
and the default_time limit is 0. I also modified the
ESLURM_INVALID_TIME_LIMIT
to include information that the error might be because of a missing time
limit.

Phil Eckert
LLNL


On 6/28/13 7:29 AM, "Daniel M. Weeks"  wrote:

>At CCNI, we use backfill scheduling on all our systems. However, we have
>found that users typically do not specify a time limit for their job so
>the scheduler assumes the maximum from QoS/user limits/partition
>limits/etc. This really hurts backfilling since the scheduler remains
>ignorant of short jobs.
>
>Attached is a small patch I wrote containing a job submit plugin and a
>new error message. The plugin rejects a job submission when it is
>missing a time limit and will provide the user with a clear and distinct
>error.
>
>I've just re-tested and the patch applies and builds cleanly on the
>slurm-2.5, slurm-2.6, and master branches.
>
>Please let me know if you find this useful, run across problems, or have
>suggestions/improvements. Thanks.
>
>-- 
>Daniel M. Weeks
>Systems Programmer
>Computational Center for Nanotechnology Innovations
>Rensselaer Polytechnic Institute
>Troy, NY 12180
>518-276-4458



spatch
Description: spatch


[slurm-dev] Re: Job submit plugin to improve backfill

2013-06-29 Thread Bjørn-Helge Mevik

Many people have listed alternatives, so why not? :)

Another alternative, that we use, is to use the lua job submit plugin.
Then there is no need for a new plugin, and changes are easy to make:
just edit the lua script, and restart slurm.

Here is our lua code for enforcing a time limit:

--  If walltime is missing: fail
--  (0xfffe is slurm's NO_VAL)
if job_desc.time_limit == 0xfffe then
log_info("slurm_job_submit: job from uid %d with missing time: 
Denying.",
 job_desc.user_id)
return 2051 -- Signal ESLURM_INVALID_TIME_LIMIT
end

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo