Re: [slurm-users] How to request for the allocation of scratch .

2020-04-20 Thread navin srivastava
I attempted again and it gets succeed.
Thanks for your help.

On Thu, Apr 16, 2020 at 9:45 PM Ellestad, Erik 
wrote:

> That all seems fine to me.
>
> I would check into your slurm logs to try and determine why slurm put your
> nodes into drain state.
>
> Erik
>
> ---
> Erik Ellestad
> Wynton Cluster SysAdmin
> UCSF
> --
> *From:* slurm-users  on behalf of
> navin srivastava 
> *Sent:* Wednesday, April 15, 2020 10:37 PM
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] How to request for the allocation of scratch
> .
>
> Thanks Erik.
>
> Last night i made the changes.
>
> i defined in slurm.conf on all the nodes as well as on the slurm server.
>
> TmpFS=/lscratch
>
>  NodeName=node[01-10]  CPUs=44  RealMemory=257380 Sockets=2
> CoresPerSocket=22 ThreadsPerCore=1 TmpDisk=160 State=UNKNOWN
> Feature=P4000 Gres=gpu:2
>
> These nodes having 1.6TB local scratch. i did a scontrol reconfig on all
> the nodes but after sometime we saw all nodes went into drain state.then i
> revert back the changes with old one.
>
> on all nodes jobs were running and the localsctratch is 20-25% in use.
> we have already cleanup script in crontab which used to clean the scratch
> space regularly.
>
> is anything wrong here?
>
>
> Regards
> Navin.
>
>
>
>
>
>
>
>
>
> On Thu, Apr 16, 2020 at 12:26 AM Ellestad, Erik 
> wrote:
>
> The default value for TmpDisk is 0, so if you want local scratch
> available on a node, the amount of TmpDisk space must be defined in the
> node configuration in slurm.conf.
>
> example:
>
> NodeName=TestNode01 CPUs=8 Boards=1 SocketsPerBoard=2 CoresPerSocket=4
> ThreadsPerCore=1 RealMemory=24099 TmpDisk=15
>
> The configuration value for the node definition is in MB.
>
> https://slurm.schedmd.com/slurm.conf.html
> 
>
> *TmpDisk*Total size of temporary disk storage in *TmpFS* in megabytes
> (e.g. "16384"). *TmpFS* (for "Temporary File System") identifies the
> location which jobs should use for temporary storage. Note this does not
> indicate the amount of free space available to the user on the node, only
> the total file system size. The system administration should ensure this
> file system is purged as needed so that user jobs have access to most of
> this space. The Prolog and/or Epilog programs (specified in the
> configuration file) might be used to ensure the file system is kept clean.
> The default value is 0.
>
> When requesting --tmp with srun or sbatch, it can be done in various size
> formats:
>
> *--tmp*=<*size[units]*> Specify a minimum amount of temporary disk space
> per node. Default units are megabytes unless the SchedulerParameters
> configuration parameter includes the "default_gbytes" option for gigabytes.
> Different units can be specified using the suffix [K|M|G|T].
> https://slurm.schedmd.com/sbatch.html
> 
>
>
>
> ---
> Erik Ellestad
> Wynton Cluster SysAdmin
> UCSF
> --
> *From:* slurm-users  on behalf of
> navin srivastava 
> *Sent:* Tuesday, April 14, 2020 11:19 PM
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] How to request for the allocation of scratch
> .
>
> Thank you Erik.
>
> To define the local scratch on all the compute node is not mandatory? only
> on slurm server is enough right?
> Also the TMPdisk should be defined in MB or can be defined in GB as well
>
> while requesting --tmp , we can use the value in GB right?
>
> Regards
> Navin.
>
>
>
> On Tue, Apr 14, 2020 at 11:04 PM Ellestad, Erik 
> wrote:
>
> Have you defined the TmpDisk value for each node?
>
> As far as I know, local disk space is not a valid type for GRES.
>
> https://slurm.schedmd.com/gres.html
> 
>
> "Generic resource (GRES) scheduling is supported through a flexible plugin
> mechanism. Support is currently provided for Graphics Processing Units
> (GPUs), CUDA Multi-Process Service (MPS), and Intel® Many Integrated Core
> (MIC) processors."
>
> The only valid solution I've found for scratch is to:
>
> In slurm.conf, define the location of local scratch globally via TmpFS.
>
> And then the amount per host is defined via TmpDisk=xxx.
>
> Then the request for srun/sbatch via --tmp=X
>
>
>
> ---
> Erik Ellestad
> Wynton Cluster SysAdmin
> UCSF
>

Re: [slurm-users] Alternative to munge for use with slurm?

2020-04-20 Thread Brian Andrus

For CentOS/RHEL, it is in the OpenFusion repo:


http://repo.openfusion.net/centos7-x86_64/

just

    yum install 
http://repo.openfusion.net/centos7-x86_64/openfusion-release-0.7-1.of.el7.noarch.rpm


then

    yum install libjwt-devel


Brian Andrus


On 4/18/2020 2:27 PM, Daniel Letai wrote:


in v20.02 you can use jwt, as per https://slurm.schedmd.com/jwt.html


Only issue is getting libjwt for most rpm based distros.

The current libjwt configure;make dist-all doesn't work.

I had to cd into dist, and 'make rpm' to create the spec file, then 
rpmbuild -ba after placing the tar gz file in the SOURCES dir of 
rpmbuild tree.



Possibly just installing libjwt manually is easier for image based 
clusters.


HTH.



On 17/04/2020 22:42, Dean Schulze wrote:
Is there an alternative to munge when running slurm?  Munge issues 
are a common problem in slurm, and munge doesn't give any useful 
information when a problem occurs.  An alternative that at least gave 
some useful information when a problem occurs would be a big 
improvement.


Thanks.


Re: [slurm-users] slurm-20.02.1-1 failed rpmbuild with error File not found

2020-04-20 Thread Ole Holm Nielsen
For the record: The Slurm developers has found it tricky to write a 
slurm.spec file which requires the mysql-devel package and still works 
in all environments, see https://bugs.schedmd.com/show_bug.cgi?id=6488


My recommendation[1] is therefore to explicitly require mysql when 
building Slurm RPMs:


export VER=20.02.1-1
rpmbuild -ta slurm-$VER.tar.bz2 --with mysql

This will catch the case where you forgot to install 
mysql-devel/mariadb-devel.


/Ole

[1] 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-packages




Re: [slurm-users] slurm-20.02.1-1 failed rpmbuild with error File not found

2020-04-20 Thread Christian Anthon
That makes sense. Though I would prefer that they made a choice for the 
db in the spec file. Just makes it easier for people who aren't experts 
in building packages. And slurmdbd will work with both mysql and mariadb 
regardless of what it was built against.


Thanks for the update,

Christian.

On 20/04/2020 17.57, Ole Holm Nielsen wrote:
For the record: The Slurm developers has found it tricky to write a 
slurm.spec file which requires the mysql-devel package and still works 
in all environments, see https://bugs.schedmd.com/show_bug.cgi?id=6488


My recommendation[1] is therefore to explicitly require mysql when 
building Slurm RPMs:


export VER=20.02.1-1
rpmbuild -ta slurm-$VER.tar.bz2 --with mysql

This will catch the case where you forgot to install 
mysql-devel/mariadb-devel.


/Ole

[1] 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-packages






[slurm-users] pam_slurm_adopt seems not working properly under "configless" slurm mode

2020-04-20 Thread Haoyang Liu
Hello,

I am setting up the latest slurm-20.02-1 on my clusters and trying to configure 
the "configless" slurm on the compute nodes.
After following the instructions from 
https://slurm.schedmd.com/configless_slurm.html, both slurmctld and slurmd 
works fine.
The config files can be found at $SlurmdSpoolDir/conf-cache and 
/run/slurm/conf. However, when I try to ssh into some compute
node, say `comput6`,

$ ssh comput6

the prompt will be stuck for ~one minute and finally returns 'No Slurm jobs 
found on node'. Previously it should be
'Access denied by pam_slurm_adopt: you have no active jobs on this node'.

The issue can be reproduced on centos 6 and 7. I've checked /var/log/secure and 
noticed the following output:

comput6 pam_slurm_adopt[43672]: error: s_p_parse_file: unable to status file 
/usr/local/slurm/etc/slurm.conf: No such file or directory, retrying in 1sec up 
to 60sec

It seems that pam_slurm_adopt is still trying to find the config file in the 
default directory under the "configless" mode.
Creating a symlink in /usr/local/slurm/etc seems to be a workaround, but it 
seems moving away from the "configless" slurm.

Is there a better way to fix this?


Best regards,
Haoyang