[slurm-users] BB plugin changes

2020-01-23 Thread subodhp
Dear all,

I want to make changes in slurm's BB generic plugin skeleton, please tell me
where i need to make the changes. Do i need to make the changes in
burst_buffer_generic.c file or the example files which slurm has provided.


-

Subodh Pandey


For assimilation and dissemination of knowledge, visit cakes.cdac.in 

[ C-DAC is on Social-Media too. Kindly follow us at:
Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]

This e-mail is for the sole use of the intended recipient(s) and may
contain confidential and privileged information. If you are not the
intended recipient, please contact the sender by reply e-mail and destroy
all copies and the original message. Any unauthorized review, use,
disclosure, dissemination, forwarding, printing or copying of this email
is strictly prohibited and appropriate legal action will be taken.




Re: [slurm-users] Can't get node out of drain state

2020-01-23 Thread Chris Samuel

On 23/1/20 7:09 pm, Dean Schulze wrote:

Pretty strange that having a Gres= property on a node that doesn't have 
a gpu would get it stuck in the drain state.


Slurm verifies that nodes have the capabilities you say they have so 
that should a node boot with less RAM than it should have, or a socket 
hidden or should a GPU fail and a node reboot you'll know about it and 
not blindly send jobs to it only for them to find they fail because they 
no longer meet their requirements.


All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Can't get node out of drain state

2020-01-23 Thread Dean Schulze
The problem turned out to be that I had Gres=gpu:gp100:1 on the NodeName
line for that node and it didn't have a gpu or a gres.conf.  Once I moved
that to the correct NodeName line in slurm.conf that node came out of the
drain state and became usable again.

Pretty strange that having a Gres= property on a node that doesn't have a
gpu would get it stuck in the drain state.



On Thu, Jan 23, 2020 at 2:34 PM Alex Chekholko  wrote:

> Hey Dean,
>
> Does 'scontrol show node  at 'sinfo -R'.
>
> Make sure the relevant network ports are open:
>
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
>
> Also check that slurmd daemons on the compute nodes can talk to each other
> (not just to the master). e.g. bottom of
> https://slurm.schedmd.com/big_sys.html
>
> Regards,
> Alex
>
> On Thu, Jan 23, 2020 at 1:05 PM Dean Schulze 
> wrote:
>
>> I've tried the normal things with scontrol (
>> https://blog.redbranch.net/2015/12/26/resetting-drained-slurm-node/),
>> but I have a node that will not come out of the drain state.
>>
>> I've also done a hard reboot and tried again.  Are there any other
>> remedies?
>>
>> Thanks.
>>
>


[slurm-users] setting default resource limits with sacctmgr dump/load ?

2020-01-23 Thread Grigory Shamov
Hi All,

I have tried to use a script that would manage SURM accounts and users
with sacctmgr dump flat files. I am using SLURM 19.05.4, on CentOS 7.6.

Our accounting scheme is rather flat: there is one level of accounting
groups and users that belong to the groups. It looks like with sacctmgr
dump / load the user Œroot' is created automatically, and its accounting
group Œroot¹ is created implicitly. The dump does not explicitly list
account Œroot¹; the top of the dump looks like this.

Cluster - Œmy cluster':FairShare=1:QOS='normal':GrpTRES=cpu=400
Parent - 'root'
User - 'root':AdminLevel='Administrator':DefaultAccount='root':FairShare=1
Account - Œaaa-qqq':FairShare=1:GrpTRES=cpu=666
Account - Œbbb-rrr':FairShare=1

(more accounts to follow)

(then users to follow)

It mostly works, but I ran into an issue while I have tried to specify
default limits. I have  used the first line of the dump, ³Cluster - Œmy
cluster':FairShare=1:QOS='normal':GrpTRES=cpu=400²

The documentation says "Anything included on this line will be the
defaults for all associations on this cluster.  These options are as
followsŠ² . I read for all it as ³for each and every association, that
does not have anything else explicitly specified². I read this, that the
account Œbbb-rrr¹ should get cpu=400 by default, while aaa-qqq gets
cpu=666 because it is explicitly set.

However, I have found the at least GrpTRES limit gets set to the implicit
Œroot¹ association as well. And because the Œroot¹ association is the
parent of each and every accounting group, the supposedly default
per-association limits of GrpTRES=cpu become total limits per entire
cluster. SO that aaa-qqq and bbb-rrr ¹s users cannot exceed cpu=400 .
Which is way more limiting than I¹d expect.

Is it a bug or a feature? Is there a way to distinguish cluster-wide-total
limits from default per-ag limits with sacctmgr dump flat file syntax?

Thanks!  

--
Grigory Shamov
University of Manitoba




[slurm-users] Useful script: estimating how long until the next blocked job starts

2020-01-23 Thread Renfro, Michael
Hey, folks.

Some of my users submit job after job with no recognition of our 1000 CPU-day 
TRES limit, and thus their later jobs get blocked with the reason 
AssocGrpCPURunMinutesLimit.

I’ve written up a script [1] using Ole Holm Nielsen’s showuserlimits script [2] 
that will identify a user’s smallest-resource blocked job, and to predict when 
that job might run at current resource consumption rates. Non-root users will 
query about their blocked jobs, and root can query about anyone’s.

Example runs:

=

# guessblockedjobstart someusername
Next blocked job to run should be 551294, with 188160 CPU-minute(s) requested
- Limit for running and queued jobs is 144 CPU-minutes
- Running and pending jobs have 1364937 CPU-minutes remaining
- Leaving 75063 CPU-minutes available currently
- Smallest blocked job, 551294, requested 188160 CPU-minutes
  (14 CPU(s) on 1 node(s) for 13440 minute(s))
- Currently-running jobs release 7560 CPU-minutes per hour of elapsed time
Estimated time for job 551294 to enter queue is Fri Jan 24 07:14 CST 2020,
if resources are available

# guessblockedjobstart anotherusername
User anotherusername has no blocked jobs

=

Let me know if there any questions or problems found. Thanks.

[1] https://gist.github.com/mikerenfro/4d21fee5cd6c82b16e30c46fb2bf3226
[2] https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University



Re: [slurm-users] Can't get node out of drain state

2020-01-23 Thread Alex Chekholko
Hey Dean,

Does 'scontrol show node https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons

Also check that slurmd daemons on the compute nodes can talk to each other
(not just to the master). e.g. bottom of
https://slurm.schedmd.com/big_sys.html

Regards,
Alex

On Thu, Jan 23, 2020 at 1:05 PM Dean Schulze 
wrote:

> I've tried the normal things with scontrol (
> https://blog.redbranch.net/2015/12/26/resetting-drained-slurm-node/), but
> I have a node that will not come out of the drain state.
>
> I've also done a hard reboot and tried again.  Are there any other
> remedies?
>
> Thanks.
>


Re: [slurm-users] Issues with HA config and AllocNodes

2020-01-23 Thread Dave Sizer
Bumping on this thread.. this issue persists even after upgrade to 19.05.4.  
Does anyone have an HA setup that could provide some insight?

From: Dave Sizer 
Date: Thursday, December 19, 2019 at 9:44 AM
To: Slurm User Community List , Brian Andrus 

Subject: Re: [slurm-users] Issues with HA config and AllocNodes

So I’ve found some more info on this. It seems like the primary controller is 
writing  “ none” as the AllocNodes value in the partition state file when it 
shuts down.  It does this even with the backup out of the picture, and it still 
happens even when I switched the primary and backup controller nodes in the 
config.

When the primary starts up, it ignores these none values and sets 
AllocNodes=ALL on all partitions (what we want), but when the backup starts up, 
it “honors” the none values and all partitions have AllocNodes=none set.  
Again, the slurm.conf on both nodes are the same, and this happens even when 
swapping the primary/backup roles of the nodes. I am digging through the source 
to try and find some hints.

Does anyone have any ideas?

From: slurm-users  on behalf of Dave 
Sizer 
Reply-To: Slurm User Community List 
Date: Tuesday, December 17, 2019 at 1:05 PM
To: Brian Andrus , "slurm-us...@schedmd.com" 

Subject: Re: [slurm-users] Issues with HA config and AllocNodes

External email: Use caution opening links or attachments

Thanks for the response.

I have confirmed that the slurm.conf files are the same and that StateSaveDir 
is working, we see logs like the following on the backup controller:
Recovered state of 9 partitions
Recovered JobId=124 Assoc=6
Recovered JobId=125 Assoc=6
Recovered JobId=126 Assoc=6
Recovered JobId=127 Assoc=6
Recovered JobId=128 Assoc=6

I do see the following error when the backup takes control, but not sure if it 
is related since it continues to start up fine:

error: _shutdown_bu_thread:send/recv slurm-ctrl-02: Connection refused

We also see a lot of these messages on the backup while it is in standby mode, 
but from what I’ve researched these maybe unrelated as well?

error: Invalid RPC received 1002 while in standby mode

and similar messages with other RPC codes. We no longer see these once the 
backup controller has taken control.

I do agree with the idea that there is some issue with the saving/loading of 
partition state during takeover, I’m just a bit stumped on why it is happening 
and what to do to stop partitions being loaded with the AllocNodes=none config.



From: Brian Andrus 
Date: Tuesday, December 17, 2019 at 12:30 PM
To: Dave Sizer 
Subject: Re: [slurm-users] Issues with HA config and AllocNodes

External email: Use caution opening links or attachments


Double check that your slurm.conf are the same and that both systems are 
successfully using your savestate directory

Brian Andrus
On 12/17/2019 9:23 AM, Dave Sizer wrote:
Hello friends,

We are running slurm 19.05.1-2 with an HA setup consisting of one primary and 
one backup controller.  However, we are observing that when the backup takes 
over, for some reason AllocNodes is getting set to “none” on all of our 
partitions.  We can remedy this by manually setting AllocNodes=ALL on each 
partition, however this is not feasible in production, since any jobs launched 
just before the takeover still fail to submit (before the partitions can be 
manually updated).  For reference, the backup controller has the correct config 
if it is restarted AFTER the primary is taken down, so this issue seems 
isolated to the takeover flow.

Has anyone seen this issue before?  Or any hints for how I can debug this 
problem?

Thanks in advance!

Dave

This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.



[slurm-users] Can't get node out of drain state

2020-01-23 Thread Dean Schulze
I've tried the normal things with scontrol (
https://blog.redbranch.net/2015/12/26/resetting-drained-slurm-node/), but I
have a node that will not come out of the drain state.

I've also done a hard reboot and tried again.  Are there any other remedies?

Thanks.


Re: [slurm-users] Implementation of generic plugin

2020-01-23 Thread subodhp
Dear all,

I wish to know where i need to make the changes, running

> srun -bb="capacity=1G access=striped type=scratch" a.out

command executes but then running below command

> scontrol show burst

doesn't shows anything.

Regards,
Subodh

On January 22, 2020 at 3:53 PM subodhp  wrote:

>  Dear all,
> 
>  I wish to have stage in, stage out functionality for my burst buffer, for
> this i have used the burst buffer generic plugin provided by slurm. But i am
> bit confused on how to add this plugin to my slurm configuration file and call
> this at time of job scheduling.
> 
> 
>  Regards,
>  Subodh Pandey
> 



For assimilation and dissemination of knowledge, visit cakes.cdac.in 

[ C-DAC is on Social-Media too. Kindly follow us at:
Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]

This e-mail is for the sole use of the intended recipient(s) and may
contain confidential and privileged information. If you are not the
intended recipient, please contact the sender by reply e-mail and destroy
all copies and the original message. Any unauthorized review, use,
disclosure, dissemination, forwarding, printing or copying of this email
is strictly prohibited and appropriate legal action will be taken.