Re: [slurm-users] Help with preemtion based on licenses

2019-12-05 Thread Oytun Peksel
Hi all,

It took me a while but I think I achieved what I have been trying to achieve. I 
had to modify cons_res plugin to achieve the result. I forked 19.05.2 and 
modified the code and recompiled it. It works for me. You have to define 
licenses as generic resources (gres) and set --gres-flags=disable-binding.  The 
result is slurm releases all gres resources of jobs that is preempted by 
suspension as well.

I tried to modify cons_tres as well but for some reason it does not work. I 
will try to figure that out if I have time in the future.

You can find the modified version at: 

https://github.com/baytuni/slurm.git

Oytun Peksel
oytun.pek...@semcon.com
Mobile   +46739205917

-Original Message-
From: slurm-users  On Behalf Of Oytun 
Peksel
Sent: den 7 november 2019 08:48
To: Slurm User Community List 
Subject: Re: [slurm-users] Help with preemtion based on licenses

Thank you all for your input.
Being a newbie in this, my impression from what you guys write is  for most 
commercial software suspend/release_license/reacquire mechanism is not feasible.

(Answer to Mark)
What we are using here is an engineering software called abaqus. In abaqus you 
can use token based licenses which depend on number of cores used (and some 
other things). It checks out the license on submission from a flex license 
server, and if it gets suspended it releases the licenses. Then another 
instance can use the released tokens. If the initially suspended instance 
somehow resumed then it cannot start unless there are enough tokens.

I have had no problems with this mechanism really.  It works pretty well if  I 
do not attempt to track licenses with slurm.

I claim:  since Slurm doesn't really integrate with license servers and it is 
pretty much up to admin,  it should not assume that all licenses are not 
releasable.

Another thing puzzles me is :
AccountingStorageTRES=license/someSoftware

I would expect this to track the licenses defined either in slurm.conf or in 
sacctmgr. But it does not.

When I do :

scontrol show job

it does not show any licenses in the output:

TRES=cpu=23,mem=23G,node=1,billing=23

Or sacct --format=tres
Shows just the default trackable resources.


Oytun Peksel
oytun.pek...@semcon.com
Mobile   +46739205917


-Original Message-
From: slurm-users  On Behalf Of Chris 
Samuel
Sent: den 7 november 2019 08:03
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Help with preemtion based on licenses

On Wednesday, 6 November 2019 7:36:57 AM PST Oytun Peksel wrote:

> GPU part of the discussion is beyond my knowledge so I assumed it 
> would be possible to release it.

If you simply suspend a job then the application does not exit, it will just 
get stopped and so will be holding various resources and file handles open - 
and that will include the GPU and the resources on it.

[...]
> After all software licenses might be the most expensive resource to 
> utilize where preemption might sometimes be inevitable.

I think the thing to remember with software licensing systems is that we are 
not the users or customers for that vendor, it's the ISV whose software you are 
using who is their customer.  So their aim is to try and ensure that the ISV 
sells as many licenses for their software as possible.

If you just suspend an application that has checked licenses out and then use 
some other program to make the license server think it's died and release them 
then I suspect when you unsuspend it then it will be very confused as it'll 
think it still has these licenses checked out but the license server won't.  I 
suspect that would not lead to a happy program, user or license server.

So for both GPUs and licenses I suspect you really do want either cancel or 
requeue for this.

All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






When you communicate with us or otherwise interact with Semcon, we will process 
personal data that you provide to us or we collect about you, please read more 
in our Privacy Policy<https://semcon.com/data-privacy-policy/>.




Re: [slurm-users] Help with preemtion based on licenses

2019-11-06 Thread Oytun Peksel
Thank you all for your input.
Being a newbie in this, my impression from what you guys write is  for most 
commercial software suspend/release_license/reacquire mechanism is not feasible.

(Answer to Mark)
What we are using here is an engineering software called abaqus. In abaqus you 
can use token based licenses which depend on number of cores used (and some 
other things). It checks out the license on submission from a flex license 
server, and if it gets suspended it releases the licenses. Then another 
instance can use the released tokens. If the initially suspended instance 
somehow resumed then it cannot start unless there are enough tokens.

I have had no problems with this mechanism really.  It works pretty well if  I 
do not attempt to track licenses with slurm.

I claim:  since Slurm doesn't really integrate with license servers and it is 
pretty much up to admin,  it should not assume that all licenses are not 
releasable.

Another thing puzzles me is :
AccountingStorageTRES=license/someSoftware

I would expect this to track the licenses defined either in slurm.conf or in 
sacctmgr. But it does not.

When I do :

scontrol show job

it does not show any licenses in the output:

TRES=cpu=23,mem=23G,node=1,billing=23

Or sacct --format=tres
Shows just the default trackable resources.


Oytun Peksel
oytun.pek...@semcon.com
Mobile   +46739205917


-Original Message-
From: slurm-users  On Behalf Of Chris 
Samuel
Sent: den 7 november 2019 08:03
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Help with preemtion based on licenses

On Wednesday, 6 November 2019 7:36:57 AM PST Oytun Peksel wrote:

> GPU part of the discussion is beyond my knowledge so I assumed it
> would be possible to release it.

If you simply suspend a job then the application does not exit, it will just 
get stopped and so will be holding various resources and file handles open - 
and that will include the GPU and the resources on it.

[...]
> After all software licenses might be the most expensive resource to
> utilize where preemption might sometimes be inevitable.

I think the thing to remember with software licensing systems is that we are 
not the users or customers for that vendor, it's the ISV whose software you are 
using who is their customer.  So their aim is to try and ensure that the ISV 
sells as many licenses for their software as possible.

If you just suspend an application that has checked licenses out and then use 
some other program to make the license server think it's died and release them 
then I suspect when you unsuspend it then it will be very confused as it'll 
think it still has these licenses checked out but the license server won't.  I 
suspect that would not lead to a happy program, user or license server.

So for both GPUs and licenses I suspect you really do want either cancel or 
requeue for this.

All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






When you communicate with us or otherwise interact with Semcon, we will process 
personal data that you provide to us or we collect about you, please read more 
in our Privacy Policy<https://semcon.com/data-privacy-policy/>.



Re: [slurm-users] Help with preemtion based on licenses

2019-11-06 Thread Chris Samuel
On Wednesday, 6 November 2019 7:36:57 AM PST Oytun Peksel wrote:

> GPU part of the discussion is beyond my knowledge so I assumed it would be
> possible to release it.

If you simply suspend a job then the application does not exit, it will just 
get stopped and so will be holding various resources and file handles open - 
and that will include the GPU and the resources on it.

[...]
> After all software licenses might be the most expensive resource to utilize 
> where preemption might sometimes be inevitable.

I think the thing to remember with software licensing systems is that we are 
not the users or customers for that vendor, it's the ISV whose software you 
are using who is their customer.  So their aim is to try and ensure that the 
ISV sells as many licenses for their software as possible.

If you just suspend an application that has checked licenses out and then use 
some other program to make the license server think it's died and release them 
then I suspect when you unsuspend it then it will be very confused as it'll 
think it still has these licenses checked out but the license server won't.  I 
suspect that would not lead to a happy program, user or license server.

So for both GPUs and licenses I suspect you really do want either cancel or 
requeue for this.

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] Help with preemtion based on licenses

2019-11-06 Thread Reuti


> Am 06.11.2019 um 16:36 schrieb Oytun Peksel :
> 
> Thanks for the information Mark.
> 
> I understand. GPU part of the discussion is beyond my knowledge so I assumed 
> it would be possible to release it.
> 
> But as for the licenses it is always possible to leave it to the system 
> admin. It is possible to take care of license release and reacquire using 
> scripts instead of assuming it is not possible. At least there should be an 
> easy configuration option to configure generic or trackable resources to be 
> releasable.

To name some additional obstacles to Mark's notes:

In the inaction of any queuing system and the license tracking mechanism inside 
each application there can for sure many things be improved. But it starts 
already with the constraint that there is to my knowledge no mechanism in any 
license daemon to "check and reserve/acquire a license if available" in an 
atomic operation, so that the queuing system is aware of the availability of a 
license and schedule a job to use it. What might come close is to borrow a 
license in a scheduling run and use this information for an upcoming job. But 
here already the limitations of each allocation might be different: some 
vendors allow to release a borrowed license premature, while others don't allow 
this and one has to wait for the specified timeframe to elapse.

Then there is the application itself: when does it check for an available 
license? Just as the application starts, periodic every certain amount of 
elapsed time, or for each iteration while it's running – or will it hold the 
license while it's running and only release it when it finishes? What will 
happen if the application was suspended for some time and when it continues it 
might discover that there were X minutes without a license daemon response and 
so it might quit. If one is lucky: results achieved up to this point can still 
be saved.

To make the things worse: what type of license is used by a particular 
application? One license per core/thread, per CPU, per job, per machine; or per 
machine per user or for each group on this machine?

One positive aspect could be, if one job consists of several instances of a 
program like a compiler when compiling a large application and the job could be 
stopped exactly when no compiler instance is active but just the job script.

Sure, for some applications it might be possible to script this in some way. So 
in my opinion the first goal for such a proposal would be to get this working 
outside of any queuing system. Stop the application on a local machine with a 
sigstop and attempt to use the license by another instance of this application, 
being it the same or another machine. Often the state of the license daemon can 
be checked and the stopped application should allow the counter of the 
available licenses to increment again in the license daemon's state output.

-- Reuti


> After all software licenses might be the most expensive resource to utilize  
> where preemption might sometimes be inevitable.
> 
> For now I have no better plan than to dig in the source code to find an easy 
> way to change this behavior.
> 
> Oytun Peksel
> oytun.pek...@semcon.com
> Mobile   +46739205917
> 
> 
> -Original Message-
> From: slurm-users  On Behalf Of Mark 
> Hahn
> Sent: den 6 november 2019 16:23
> To: Slurm User Community List 
> Subject: Re: [slurm-users] Help with preemtion based on licenses
> 
>> This does not make sense to me. If gpu is my generic resource why would it 
>> not release the gpu resources if a job is suspended?
> 
> how would that be implemented?  how would the scheduler reach into the 
> application and cause the license to be released and reacquired?
> after all, the license server is otherwise oblivious to whether the job it 
> has granted a license to has been suspended or resumed.
> this applies to other gres as well - for instance GPUs, since there's no 
> mechanism to free up GPU resources allocated to a suspended process.
> 
> *that* is the problem - merely adding and substracting is not.
> 
> regards, mark hahn.
> 
> 
> 
> When you communicate with us or otherwise interact with Semcon, we will 
> process personal data that you provide to us or we collect about you, please 
> read more in our Privacy Policy<https://semcon.com/data-privacy-policy/>.
> 




Re: [slurm-users] Help with preemtion based on licenses

2019-11-06 Thread Oytun Peksel
Thanks for the information Mark.

I understand. GPU part of the discussion is beyond my knowledge so I assumed it 
would be possible to release it.

But as for the licenses it is always possible to leave it to the system admin. 
It is possible to take care of license release and reacquire using scripts 
instead of assuming it is not possible. At least there should be an easy 
configuration option to configure generic or trackable resources to be 
releasable.

After all software licenses might be the most expensive resource to utilize  
where preemption might sometimes be inevitable.

For now I have no better plan than to dig in the source code to find an easy 
way to change this behavior.

Oytun Peksel
oytun.pek...@semcon.com
Mobile   +46739205917


-Original Message-
From: slurm-users  On Behalf Of Mark Hahn
Sent: den 6 november 2019 16:23
To: Slurm User Community List 
Subject: Re: [slurm-users] Help with preemtion based on licenses

> This does not make sense to me. If gpu is my generic resource why would it 
> not release the gpu resources if a job is suspended?

how would that be implemented?  how would the scheduler reach into the 
application and cause the license to be released and reacquired?
after all, the license server is otherwise oblivious to whether the job it has 
granted a license to has been suspended or resumed.
this applies to other gres as well - for instance GPUs, since there's no 
mechanism to free up GPU resources allocated to a suspended process.

*that* is the problem - merely adding and substracting is not.

regards, mark hahn.



When you communicate with us or otherwise interact with Semcon, we will process 
personal data that you provide to us or we collect about you, please read more 
in our Privacy Policy<https://semcon.com/data-privacy-policy/>.



Re: [slurm-users] Help with preemtion based on licenses

2019-11-06 Thread Mark Hahn

This does not make sense to me. If gpu is my generic resource why would it not 
release the gpu resources if a job is suspended?


how would that be implemented?  how would the scheduler reach into 
the application and cause the license to be released and reacquired?
after all, the license server is otherwise oblivious to whether 
the job it has granted a license to has been suspended or resumed.

this applies to other gres as well - for instance GPUs, since there's
no mechanism to free up GPU resources allocated to a suspended process.

*that* is the problem - merely adding and substracting is not.

regards, mark hahn.



Re: [slurm-users] Help with preemtion based on licenses

2019-11-06 Thread Oytun Peksel
Ok, I found out it is possible to preempt on licenses if you define the license 
as a generic resource. Such as:
GresTypes=license
NodeName=SomeNode Gres=license:someSoftware:100

And submit the jobs with --gres=license:someSoftware:20

But this does not work with PreemptMode=Suspend. It would requeue or cancel the 
preempted job but it won't suspend it. There is an interesting paragraph in 
Gres Scheduling page:

"Jobs will be allocated specific generic resources as needed to satisfy the 
request. If the job is suspended, those resources do not become available for 
use by other jobs."

This does not make sense to me. If gpu is my generic resource why would it not 
release the gpu resources if a job is suspended?



Oytun Peksel
oytun.pek...@semcon.com
Mobile   +46739205917


-Original Message-
From: slurm-users  On Behalf Of Oytun 
Peksel
Sent: den 6 november 2019 09:09
To: Slurm User Community List 
Subject: Re: [slurm-users] Help with preemtion based on licenses

Yes of course no one would expect the resource manager to control the job 
applications to release licenses.
 Sometimes licenses are released either automatically or can be done by scripts.

The desired behavior here while using  '--license someSoftware@someserver:x ' :
 if there are not enough licenses a running job should be 
suspended/cancelled/requeued/checkpointed and assume that licenses are released.

Namely just treat license resource as any other resource like CPU and Memory. 
Nothing else. Today licenses are automatically pending the job disabling 
preemption mechanism.

The above behavior is observed with select/cons_tres plugin and license defined 
as a TRES "AccountingStorageTres=license/someSoftware



Oytun Peksel
oytun.pek...@semcon.com
Mobile   +46739205917


-Original Message-
From: slurm-users  On Behalf Of Mark Hahn
Sent: den 5 november 2019 16:38
To: Slurm User Community List 
Subject: Re: [slurm-users] Help with preemtion based on licenses

> The limiting factor in our cluster is licenses and I want to have high and 
> low priority jobs where submitting a high priority job will preempt (suspend) 
> a low priority job if all the licenses are already in use.

But what are you expecting to happen?  that Slurm will somehow release the 
license used by the suspended job, and then somehow reacquire the license when 
it is resumed?  I've never heard of that kind of thing even being offered by 
license managers, let alone that level of intimate integration between 
schedulers and license managers.

At most, a scheduler may provide a callout to query the number of free 
licenses, and consider a job eligible to start if its declared usage fits (gres 
in Slurm terms, I think).

regards, mark hahn
--
operator may differ from spokesperson.h...@mcmaster.ca



When you communicate with us or otherwise interact with Semcon, we will process 
personal data that you provide to us or we collect about you, please read more 
in our Privacy Policy<https://semcon.com/data-privacy-policy/>.




Re: [slurm-users] Help with preemtion based on licenses

2019-11-06 Thread Oytun Peksel
Yes of course no one would expect the resource manager to control the job 
applications to release licenses.
 Sometimes licenses are released either automatically or can be done by scripts.

The desired behavior here while using  '--license someSoftware@someserver:x ' :
 if there are not enough licenses a running job should be 
suspended/cancelled/requeued/checkpointed and assume that licenses are released.

Namely just treat license resource as any other resource like CPU and Memory. 
Nothing else. Today licenses are automatically pending the job disabling 
preemption mechanism.

The above behavior is observed with select/cons_tres plugin and license defined 
as a TRES "AccountingStorageTres=license/someSoftware



Oytun Peksel
oytun.pek...@semcon.com
Mobile   +46739205917


-Original Message-
From: slurm-users  On Behalf Of Mark Hahn
Sent: den 5 november 2019 16:38
To: Slurm User Community List 
Subject: Re: [slurm-users] Help with preemtion based on licenses

> The limiting factor in our cluster is licenses and I want to have high and 
> low priority jobs where submitting a high priority job will preempt (suspend) 
> a low priority job if all the licenses are already in use.

But what are you expecting to happen?  that Slurm will somehow release the 
license used by the suspended job, and then somehow reacquire the license when 
it is resumed?  I've never heard of that kind of thing even being offered by 
license managers, let alone that level of intimate integration between 
schedulers and license managers.

At most, a scheduler may provide a callout to query the number of free 
licenses, and consider a job eligible to start if its declared usage fits (gres 
in Slurm terms, I think).

regards, mark hahn
--
operator may differ from spokesperson.h...@mcmaster.ca



When you communicate with us or otherwise interact with Semcon, we will process 
personal data that you provide to us or we collect about you, please read more 
in our Privacy Policy<https://semcon.com/data-privacy-policy/>.



Re: [slurm-users] Help with preemtion based on licenses

2019-11-05 Thread Mark Hahn

The limiting factor in our cluster is licenses and I want to have high and low 
priority jobs where submitting a high priority job will preempt (suspend) a low 
priority job if all the licenses are already in use.


But what are you expecting to happen?  that Slurm will somehow release
the license used by the suspended job, and then somehow reacquire the 
license when it is resumed?  I've never heard of that kind of thing

even being offered by license managers, let alone that level of intimate
integration between schedulers and license managers.

At most, a scheduler may provide a callout to query the number of 
free licenses, and consider a job eligible to start if its declared 
usage fits (gres in Slurm terms, I think).


regards, mark hahn
--
operator may differ from spokesperson.  h...@mcmaster.ca



Re: [slurm-users] Help with preemtion based on licenses

2019-11-05 Thread Oytun Peksel
Hi,
Apparently the original email got lost here so here it is. If anyone has any 
idea how to this please comment.


On Thursday, June 20, 2019 at 3:20:41 AM UTC+2, Eric Wittmayer wrote:
Hi Slurm experts,
  I'm new to SLURM and could really use some help getting preemption working.

The limiting factor in our cluster is licenses and I want to have high and low 
priority jobs where submitting a high priority job will preempt (suspend) a low 
priority job if all the licenses are already in use.
Is this possible with SLURM currently?
If so can someone provide example configuration settings?

If it isn't currently possible, could this be a feature included in the current 
cons_tres work that is going on?
I've read through a bunch of the documentation and tried to do my due diligence 
but haven't found a definitive answer.

Thanks,
Eric W



From: slurm-users  On Behalf Of Oytun 
Peksel
Sent: den 5 november 2019 08:33
To: Slurm User Community List 
Subject: Re: [slurm-users] Help with preemtion based on licenses

Hi Eric,

Have you been able to find a solution to your problem. Facing the same issue 
right now..

BR
Oytun Peksel




When you communicate with us or otherwise interact with Semcon, we will process 
personal data that you provide to us or we collect about you, please read more 
in our Privacy Policy<https://semcon.com/data-privacy-policy/>.


Re: [slurm-users] Help with preemtion based on licenses

2019-11-04 Thread Oytun Peksel
Hi Eric,

Have you been able to find a solution to your problem. Facing the same issue 
right now..

BR
Oytun Peksel




When you communicate with us or otherwise interact with Semcon, we will process 
personal data that you provide to us or we collect about you, please read more 
in our Privacy Policy.


[slurm-users] Help with preemtion based on licenses

2019-06-19 Thread Eric Wittmayer
Hi Slurm experts,

  I'm new to SLURM and could really use some help getting preemption
working.

 

The limiting factor in our cluster is licenses and I want to have high and
low priority jobs where submitting a high priority job will preempt
(suspend) a low priority job if all the licenses are already in use.

Is this possible with SLURM currently?  

If so can someone provide example configuration settings?

 

If it isn't currently possible, could this be a feature included in the
current cons_tres work that is going on?

I've read through a bunch of the documentation and tried to do my due
diligence but haven't found a definitive answer.  

 

Thanks,

Eric W