Re: [slurm-users] How to use a pyhon virtualenv with srun?

Yann Bouteiller Sun, 17 Nov 2019 18:51:11 -0800

Hello Brian, thank you for your answer.

Actually, you are not allowed to install things in your home oncomputecanada, this is why you need to install everything in avirtualenv with pip install. Also, you have to install each virtualenvin $SLURM_TMDIR which is the local drive of the node, becauseeverything else is slow, so I think I cannot share homes.

Actually I succeeded at installing different virtualenvs on differentnodes using a script for each worker that creates a local virtualenv,installs ray on it, and connects to the ray server running in thevirtualenv of the head node (I mean the primary node, yes). I justcall these scripts with srun. However, for some reason, the workersseem to connect fine to the server but are detected as dead after awhile: https://groups.google.com/forum/#!topic/ray-dev/INB_zVS5PWY


Yann



Brian Andrus <toomuc...@gmail.com> a écrit :

I suspect when you say "head node" you mean the primary node fromthe nodes your were allocated.
Normally, when you use pip as a user, it installs in your homedirectory. Are you certain all your nodes share the same homes?If they are merely synched, that would not be the same. Not actuallysharing homes could be the cause.
Brian Andrus


On 11/17/2019 11:24 AM, Yann Bouteiller wrote:
Hello,
I am trying to do this on computecanada, which is managed by slurm:https://ray.readthedocs.io/en/latest/deploying-on-slurm.html
However, on computecanada, you cannot install things on nodesbefore the job has started, and you can only install things in apython virtualenv once the job has started.
I can do:

```
module load python/3.7.4
source venv/bin/activate
pip install ray
```
in the bash script before calling everything else, but apparentlythis will only create-activate the virtualenv and install ray onthe head node, but not on the remote nodes, so calling
```
srun --nodes=1 --ntasks=1 -w $node1 ray start --block --head--redis-port=6379 --redis-password=$redis_password & # Starting thehead
```

will succeed, but later calling

```
for ((  i=1; i<=$worker_num; i++ ))
do
  node2=${nodes_array[$i]}
srun --export=ALL --nodes=1 --ntasks=1 -w $node2 ray start--block --address=$ip_head --redis-pass$
  sleep 5
done

```

will produce the following error:

```
slurmstepd: error: execve(): ray: No such file or directory
srun: error: cdr768: task 0: Exited with exit code 2
srun: Terminating job step 31218604.3
[2]+ Exit 2 srun --export=ALL --nodes=1--ntasks=1 -w $node2 ray start --block --address=$ip_head--redis-password=$redis_password
```
How can I tackle this issue, please? I am a beginner with slurm soI am not sure what is the problem here. Here is my whole sbatchscript:
```
#!/bin/bash

#SBATCH --job-name=test
#SBATCH --cpus-per-task=5
#SBATCH --mem-per-cpu=1000M
#SBATCH --nodes=3
#SBATCH --tasks-per-node 1

worker_num=2 # Must be one less that the total number of nodes
nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting thenode names
nodes_array=( $nodes )

module load python/3.7.4
source venv/bin/activate
pip install ray

node1=${nodes_array[0]}
ip_prefix=$(srun --nodes=1 --ntasks=1 -w $node1 hostname--ip-address) # Making address
suffix=':6379'
ip_head=$ip_prefix$suffix
redis_password=$(uuidgen)
export ip_head # Exporting for latter access by trainer.py
srun --nodes=1 --ntasks=1 -w $node1 ray start --block --head--redis-port=6379 --redis-password=$redis_password & # Starting thehead
sleep 5

for ((  i=1; i<=$worker_num; i++ ))
do
  node2=${nodes_array[$i]}
srun --export=ALL --nodes=1 --ntasks=1 -w $node2 ray start--block --address=$ip_head --redis-password=$redis_password & #Starting the workers
  sleep 5
done
python -u trainer.py $redis_password 15 # Pass the total number ofallocated CPUs
```

---
Regards,
Yann

Re: [slurm-users] How to use a pyhon virtualenv with srun?

Reply via email to