[slurm-users] PyTorch with Slurm and MPS work-around --gres=gpu:1?

2020-04-03 Thread Robert Kudyba
Running Slurm 20.02 on Centos 7.7 with Bright Cluster 8.2. I'm wondering
how the below sbatch file is sharing a GPU.

MPS is running on the head node:
ps -auwx|grep mps
root 108581  0.0  0.0  12780   812 ?Ssl  Mar23   0:27
/cm/local/apps/cuda-driver/libs/440.33.01/bin/nvidia-cuda-mps-control -d

The entire script is posted on SO here

.

Here is the sbatch file contents:

#!/bin/sh
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --job-name=sequentialBlur_alexnet_training_imagewoof_crossval
#SBATCH --nodelist=node003
module purge
module load gcc5 cuda10.1
module load openmpi/cuda/64
module load pytorch-py36-cuda10.1-gcc
module load ml-pythondeps-py36-cuda10.1-gcc
python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof $1 | tee
alex_100_imwoof_seq_longtrain_cv_$1.txt

>From nvidia-smi on the compute node:
Processes
Process ID  : 320467
Type: C
Name: python3.6
Used GPU Memory : 2369 MiB
Process ID  : 320574
Type: C
Name: python3.6
Used GPU Memory : 2369 MiB

[node003 ~]# nvidia-smi -q -d compute

==NVSMI LOG==

Timestamp   : Fri Apr  3 15:27:49 2020
Driver Version  : 440.33.01
CUDA Version: 10.2

Attached GPUs   : 1
GPU :3B:00.0
Compute Mode: Default


[~]# nvidia-smi
Fri Apr  3 15:28:49 2020
+-+
| NVIDIA-SMI 440.33.01Driver Version: 440.33.01CUDA Version: 10.2
  |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr.
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute
M. |
|===+==+==|
|   0  Tesla V100-PCIE...  On   | :3B:00.0 Off |
 0 |
| N/A   42CP046W / 250W |   4750MiB / 32510MiB | 32%
 Default |
+---+--+--+

+-+
| Processes:   GPU
Memory |
|  GPU   PID   Type   Process name Usage
   |
|=|
|0320467  C   python3.6
2369MiB |
|0320574  C   python3.6
2369MiB |
+-+

>From htop:
320574 ouruser 20   0 12.2G 1538M  412M R 502.  0.8 14h45:59 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1
320467 ouruser 20   0 12.2G 1555M  412M D 390.  0.8 14h45:13 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320654 ouruser 20   0 12.2G 1555M  412M R 111.  0.8  3h00:56 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320656 ouruser 20   0 12.2G 1555M  412M R 111.  0.8  3h00:56 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320658 ouruser 20   0 12.2G 1538M  412M R 111.  0.8  3h00:54 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1
320660 ouruser 20   0 12.2G 1538M  412M R 111.  0.8  3h00:53 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1
320661 ouruser 20   0 12.2G 1538M  412M R 111.  0.8  3h00:54 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1
320655 ouruser 20   0 12.2G 1555M  412M R 55.8  0.8  3h00:56 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320657 ouruser 20   0 12.2G 1555M  412M R 55.8  0.8  3h00:56 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320659 ouruser 20   0 12.2G 1538M  412M R 55.8  0.8  3h00:53 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1

Is PyTorch somehow working around Slurm and NOT locking a GPU since the
user omitted --gres=gpu:1? How can I tell if MPS is really working?


Re: [slurm-users] How to get the Average number of CPU cores used by jobs per day?

2020-04-03 Thread Alex Chekholko
Hey Sudeep,

Which flags to sreport have you tried?  Which information was missing?

Regards,
Alex

On Thu, Apr 2, 2020 at 10:29 PM Sudeep Narayan Banerjee <
snbaner...@iitgn.ac.in> wrote:

> Dear Steven: Yes, but am unable to get the desired data. Not sure which
> flags to use.
>
> Thanks & Regards,
> Sudeep Narayan Banerjee
>
> On 03/04/20 10:42 am, Steven Dick wrote:
>
> Have you looked at sreport?
>
> On Fri, Apr 3, 2020 at 1:09 AM Sudeep Narayan 
> Banerjee  wrote:
>
> How to get the Average number of CPU cores used by jobs per day by a 
> particular group?
>
> By group means: say faculty group1, group2 etc. all those groups are having a 
> certain number of students
>
> --
> Thanks & Regards,
> Sudeep Narayan Banerjee
> System Analyst | Scientist B
> Information System Technology Facility
> Academic Block 5 | Room 110
> Indian Institute of Technology Gandhinagar
> Palaj, Gujarat 382355 INDIA
>
>


Re: [slurm-users] Executing slurm command from Lua job_submit script?

2020-04-03 Thread CB
Hi Marcus,

the essence of the code looks like

in job_submitl.lua script, it execute an external script

os.execute("/etc/slutm/test.sh".." "..job_desc.partition)

and the external test.sh executes the following command to get the
partition summary for further processing.

sinfo -h -p $1 -s

But, this sinfo command returned no result.

Regards,
Chansup

On Fri, Apr 3, 2020 at 1:28 AM Marcus Wagner 
wrote:

> Hi Chansup,
>
> could you provde a code snippet?
>
> Best
> Marcus
>
> Am 02.04.2020 um 19:43 schrieb CB:
> > Hi,
> >
> > I'm running Slurm 19.05.
> >
> > I'm trying to execute some Slurm commands from the Lua job_submit script
> > for a certain condition.
> > But, I found that it's not executed and return nothing.
> > For example, I tried to execute a "sinfo" command from an external shell
> > script but it didn't work.
> >
> > Does Slurm prohibit to execute any Slurm command from the Lua job_submit
> > command?
> >
> > Thanks,
> > - Chansup
>
>
>