Re: [slurm-users] [EXTERNAL] CentOS 7 CUDA 8.0 can't find plugin cons_tres

2020-04-16 Thread Sean Crosby
Hi Lisa,

cons_tres is part of Slurm 19.05 and higher. As you are using Slurm 18.08,
it won't be there. The select plugin for 18.05 is cons_res.

Is there a reason why you're using an old Slurm?

Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Fri, 17 Apr 2020 at 05:00, Lisa Kay Weihl  wrote:

> *UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts.*
> --
> I have a standalone server with 4 GeForce RTX 2080 Ti. The purpose is to
> serve as a computer server for data science jobs. My department chair wants
> a job scheduler on it. I have installed SLURM (18.08.9). That works just
> fine in a basic configuration when I attempt to add Gres_Types gpu and then
> add Gres:gpu:4 to the end of the node description:
>
> NodeName=cs-datasci CPUs=24 RealMemory=385405 Sockets=2 CoresPerSocket=6
> ThreadsPerCore=2 State=UNKNOWN Gres=gpu:4
>
> and then try to restart slurmd I get an error that it cannot find the
> plugin
>
> slurmd: error: Couldn't find the specified plugin name for
> select/cons_tres looking at all files
>
> slurmd: error: cannot find select plugin for select/cons_tres
>
> slurmd: fatal: Can't find plugin for select/cons_tres
>
> The system was prebuilt by AdvancedHPC with CentOS 7 and CUDA 8.0
>
> I usually keep notes when I'm installing things but in this case I wasn't
> jotting things down as I went. I think I started with the instructions on
> this page: https://slurm.schedmd.com/quickstart_admin.html and went with
> the usual ./configure, make, make install.
>
> I have a feeling maybe something did not work and I switched to the rpm
> packages based on some other web pages I saw because if I do a yum list
> installed | grep slurm I see a lot of pacakages. The problem is I was
> interrupted with other tasks and my memory was somewhat rusty when I came
> back to this.
>
> When I went looking for this error I saw there were some issues with the
> newest SLURM and CUDA 10.2 but I didn't think that should be an issue
> because I was at CUDA 8.0.  Just in case I backed down to SLURM 18.
>
> I'm willing to start all over if anyone thinks cleaning up and rebuilding
> will help that. I do see libraries in /etc/lib64/slurm but I also see 2
> files in /usr/local/lib/slurm/src so I'm not sure if that's left over from
> trying to install from source.  All the daemons are in /usr/sbin and user
> commands in /usr/bin
>
> I'm a newbie at this and very frustrated. Can anyone help?
>
> ***
>
> Lisa Weihl *Systems Administrator*
>
>
> *Computer Science, Bowling Green State University *Tel: (419) 372-0116
> |Fax: (419) 372-8061
> lwe...@bgsu.edu
> www.bgsu.edu​
>


[slurm-users] CentOS 7 CUDA 8.0 can't find plugin cons_tres

2020-04-16 Thread Lisa Kay Weihl
I have a standalone server with 4 GeForce RTX 2080 Ti. The purpose is to serve 
as a computer server for data science jobs. My department chair wants a job 
scheduler on it. I have installed SLURM (18.08.9). That works just fine in a 
basic configuration when I attempt to add Gres_Types gpu and then add 
Gres:gpu:4 to the end of the node description:


NodeName=cs-datasci CPUs=24 RealMemory=385405 Sockets=2 CoresPerSocket=6 
ThreadsPerCore=2 State=UNKNOWN Gres=gpu:4

and then try to restart slurmd I get an error that it cannot find the plugin

slurmd: error: Couldn't find the specified plugin name for select/cons_tres 
looking at all files

slurmd: error: cannot find select plugin for select/cons_tres

slurmd: fatal: Can't find plugin for select/cons_tres

The system was prebuilt by AdvancedHPC with CentOS 7 and CUDA 8.0

I usually keep notes when I'm installing things but in this case I wasn't 
jotting things down as I went. I think I started with the instructions on this 
page: https://slurm.schedmd.com/quickstart_admin.html and went with the usual 
./configure, make, make install.

I have a feeling maybe something did not work and I switched to the rpm 
packages based on some other web pages I saw because if I do a yum list 
installed | grep slurm I see a lot of pacakages. The problem is I was 
interrupted with other tasks and my memory was somewhat rusty when I came back 
to this.

When I went looking for this error I saw there were some issues with the newest 
SLURM and CUDA 10.2 but I didn't think that should be an issue because I was at 
CUDA 8.0.  Just in case I backed down to SLURM 18.

I'm willing to start all over if anyone thinks cleaning up and rebuilding will 
help that. I do see libraries in /etc/lib64/slurm but I also see 2 files in 
/usr/local/lib/slurm/src so I'm not sure if that's left over from trying to 
install from source.  All the daemons are in /usr/sbin and user commands in 
/usr/bin

I'm a newbie at this and very frustrated. Can anyone help?

***

Lisa Weihl Systems Administrator

Computer Science, Bowling Green State University
Tel: (419) 372-0116   |Fax: (419) 372-8061
lwe...@bgsu.edu
www.bgsu.edu​


Re: [slurm-users] How to request for the allocation of scratch .

2020-04-16 Thread Ellestad, Erik
That all seems fine to me.

I would check into your slurm logs to try and determine why slurm put your 
nodes into drain state.

Erik

---
Erik Ellestad
Wynton Cluster SysAdmin
UCSF

From: slurm-users  on behalf of navin 
srivastava 
Sent: Wednesday, April 15, 2020 10:37 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] How to request for the allocation of scratch .

Thanks Erik.

Last night i made the changes.

i defined in slurm.conf on all the nodes as well as on the slurm server.

TmpFS=/lscratch

 NodeName=node[01-10]  CPUs=44  RealMemory=257380 Sockets=2 CoresPerSocket=22 
ThreadsPerCore=1 TmpDisk=160 State=UNKNOWN Feature=P4000 Gres=gpu:2

These nodes having 1.6TB local scratch. i did a scontrol reconfig on all the 
nodes but after sometime we saw all nodes went into drain state.then i revert 
back the changes with old one.

on all nodes jobs were running and the localsctratch is 20-25% in use.
we have already cleanup script in crontab which used to clean the scratch space 
regularly.

is anything wrong here?


Regards
Navin.









On Thu, Apr 16, 2020 at 12:26 AM Ellestad, Erik 
mailto:erik.elles...@ucsf.edu>> wrote:
The default value for TmpDisk is 0, so if you want local scratch available on a 
node, the amount of TmpDisk space must be defined in the node configuration in 
slurm.conf.

example:

NodeName=TestNode01 CPUs=8 Boards=1 SocketsPerBoard=2 CoresPerSocket=4 
ThreadsPerCore=1 RealMemory=24099 TmpDisk=15

The configuration value for the node definition is in MB.

https://slurm.schedmd.com/slurm.conf.html


TmpDisk
Total size of temporary disk storage in TmpFS in megabytes (e.g. "16384"). 
TmpFS (for "Temporary File System") identifies the location which jobs should 
use for temporary storage. Note this does not indicate the amount of free space 
available to the user on the node, only the total file system size. The system 
administration should ensure this file system is purged as needed so that user 
jobs have access to most of this space. The Prolog and/or Epilog programs 
(specified in the configuration file) might be used to ensure the file system 
is kept clean. The default value is 0.

When requesting --tmp with srun or sbatch, it can be done in various size 
formats:


--tmp=
Specify a minimum amount of temporary disk space per node. Default units are 
megabytes unless the SchedulerParameters configuration parameter includes the 
"default_gbytes" option for gigabytes. Different units can be specified using 
the suffix [K|M|G|T].

https://slurm.schedmd.com/sbatch.html



---
Erik Ellestad
Wynton Cluster SysAdmin
UCSF

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of navin srivastava 
mailto:navin.alt...@gmail.com>>
Sent: Tuesday, April 14, 2020 11:19 PM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] How to request for the allocation of scratch .

Thank you Erik.

To define the local scratch on all the compute node is not mandatory? only on 
slurm server is enough right?
Also the TMPdisk should be defined in MB or can be defined in GB as well

while requesting --tmp , we can use the value in GB right?

Regards
Navin.



On Tue, Apr 14, 2020 at 11:04 PM Ellestad, Erik 
mailto:erik.elles...@ucsf.edu>> wrote:
Have you defined the TmpDisk value for each node?

As far as I know, local disk space is not a valid type for GRES.

https://slurm.schedmd.com/gres.html

"Generic resource (GRES) scheduling is supported through a flexible plugin 
mechanism. Support is currently provided for Graphics Processing Units (GPUs), 
CUDA Multi-Process Service (MPS), and Intel® Many Integrated Core (MIC) 
processors."

The only valid solution I've found for scratch is to:

In slurm.conf, define the location of local scratch globally via TmpFS.

And then the amount per host is defined via TmpDisk=xxx.

Then the request for srun/sbatch via --tmp=X



---
Erik Ellestad
Wynton Cluster SysAdmin
UCSF

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of navin srivastava 
mailto:navin.alt...@gmail.com>>
Sent: Tuesday, April 14, 2020 7:32 AM
To: Slurm User 

[slurm-users] srun always uses node002 even using --nodelist=node001

2020-04-16 Thread Robert Kudyba
I'm using this TensorRT tutorial

with MPS on Slurm 20.02 on Bright Cluster 8.2

I’m trying to use srun to test this but it always fails as it appears to be
trying all nodes. We only have 3 compute nodes. As I’m writing this node002
 and node003 are in use by other users so I just want to use node001.

srun /home/mydir/mpsmovietest  --gres=gpu:1 --job-name=MPSMovieTest
--nodes=1 --nodelist=node001 -Z --output=mpstest.out
Tue Apr 14 16:45:10 2020
+-+
| NVIDIA-SMI 440.33.01Driver Version: 440.33.01CUDA Version: 10.2 |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute M. |
|===+==+==|
|   0  Tesla V100-PCIE...  On   | :3B:00.0 Off |0 |
| N/A   67CP0   241W / 250W |  32167MiB / 32510MiB |100%   E. Process |
+---+--+--+

+-+
| Processes:   GPU Memory |
|  GPU   PID   Type   Process name Usage  |
|=|
|0428996  C   python3.6  32151MiB |
+-+
Loading openmpi/cuda/64/3.1.4
  Loading requirement: hpcx/2.4.0 gcc5/5.5.0

Loading cm-ml-python3deps/3.2.3
  Loading requirement: python36

Loading tensorflow-py36-cuda10.1-gcc/1.15.2
  Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20
keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.5.6
 RUNNING TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps
-b 2 -p 2
[03/14/2020-16:45:10] [I] ../../../data/movielens/movielens_ratings.txt
[E] [TRT] CUDA initialization failure with error 999. Please check
your CUDA installation:
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
[E] Could not create builder.
[03/14/2020-16:45:10] [03/14/2020-16:45:10]  FAILED
TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps
-b 2 -p 2
srun: error: node002: task 0: Exited with exit code 1

So is my syntax wrong with srun? MPS is running:

$ ps -auwx|grep mps
root 108581  0.0  0.0  12780   812 ?Ssl  Mar23   0:54
/cm/local/apps/cuda-


When node002 is available the program runs correctly, albeit with an error
on the log file:

srun /home/mydir/mpsmovietest  --gres=gpu:1 --job-name=MPSMovieTest
 --nodes=1 --nodelist=node001 -Z --output=mpstest.out
Thu Apr 16 10:08:52 2020
+-+
| NVIDIA-SMI 440.33.01Driver Version: 440.33.01CUDA Version: 10.2
  |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr.
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute
M. |
|===+==+==|
|   0  Tesla V100-PCIE...  On   | :3B:00.0 Off |
 0 |
| N/A   28CP025W / 250W | 41MiB / 32510MiB |  0%   E.
Process |
+---+--+--+

+-+
| Processes:   GPU
Memory |
|  GPU   PID   Type   Process name Usage
   |
|=|
|0420596  C   nvidia-cuda-mps-server
 29MiB |
+-+
Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs
will be available.
An instance of this daemon is already running
Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs
will be available.
Loading openmpi/cuda/64/3.1.4
  Loading requirement: hpcx/2.4.0 gcc5/5.5.0

Loading cm-ml-python3deps/3.2.3
  Loading requirement: python36

Loading tensorflow-py36-cuda10.1-gcc/1.15.2
  Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20
keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0
nccl2-cuda10.1-gcc/2.5.6
 RUNNING TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2
-p 2
[03/16/2020-10:08:52] [I] ../../../data/movielens/movielens_ratings.txt

Re: [slurm-users] Munge decode failing on new node

2020-04-16 Thread Chris Samuel

On 4/15/20 10:57 am, Dean Schulze wrote:


  error: Munge decode failed: Invalid credential
  ENCODED: Wed Dec 31 17:00:00 1969
  DECODED: Wed Dec 31 17:00:00 1969
  error: authentication: Invalid authentication credential


That's really interesting, I had one of these last week when on call, 
for us at least it seemed to be a hardware error as when attempting to 
reboot it the node failed completely and would no longer boot.


Worth checking whatever hardware logging capabilities your system has to 
see if MCE's are being reported.


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Munge decode failing on new node

2020-04-16 Thread Ole Holm Nielsen

You might want to check the Munge section in my Slurm Wiki page:
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#munge-authentication-service

/Ole

On 15-04-2020 19:57, Dean Schulze wrote:
I've installed two new nodes onto my slurm cluster.  One node works, but 
the other one complains about an invalid credential for munge.  I've 
verified that the munge.key is the same as on all other nodes with


sudo cksum /etc/munge/munge.key

I recopied a munge.key from a node that works.  I've verified that munge 
uid and gid are the same on the nodes.  The time is in sync on all nodes.


Here is what is in the slurmd.log:

  error: Unable to register: Unable to contact slurm controller (connect 
failure)

  error: Munge decode failed: Invalid credential
  ENCODED: Wed Dec 31 17:00:00 1969
  DECODED: Wed Dec 31 17:00:00 1969
  error: authentication: Invalid authentication credential
  error: slurm_receive_msg_and_forward: Protocol authentication error
  error: service_connection: slurm_receive_msg: Protocol authentication 
error
  error: Unable to register: Unable to contact slurm controller (connect 
failure)


I've checked in the munged.log and all it says is

Invalid credential