You need that nvidia-module has been started since machine boot. This
script is worked for me
I use Centos. Yo can add it to init.d
Regards
El 09/02/17 a las 13:56, Christian Goll escribió:
Hello Daniel,
do /dev/nvidia[0-1] exist on the machines?
If not see under
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/
there is shell scripted which creates the device nodes for you. They are not
always created during startup, especially if there is not X on the system.
kind regards,
Christian
Am 09.02.2017 um 12:50 schrieb Daniel Ruiz Molina:
Hi,
In my GPU cluster, slurmd daemon doesn't start correctly because when
daemon start, it doesn't find /dev/nvidia[0-1] device (mapped in
gres.conf). For solving this problem, I have added attribute
"ExecStartPre=@/usr/bin/nvidia-smi >/dev/null" in service file and now
daemon starts correctly. However, could anybody copy-paste his/her
slurmd daemon file in a GPU cluster? I suppose it must be a better
solution than mine.
Thanks.
--
DAVID RAMIREZ
Senior HPC Consultant
Systems & Integrator Manager
********************************
SIE, HPCSIE & Ladon OS Proyect
C/ Marqués de Mondejar 29-31 2ª Planta
28028 Madrid-Spain
********************************
Phone: (+34)913611002
Mobile: (+34)661369483
Email: drami...@sie.es
Skype: dramirezsie
Twitter: @dramirezhpc @ladon_os
WWW.SIE.ES
WWW.LADONOS.ORG
--
Este correo y sus archivos asociados son privados y confidenciales y va
dirigido exclusivamente a su destinatario. Si recibe este correo sin ser el
destinatario del mismo, le rogamos proceda a su eliminación y lo ponga en
conocimiento del emisor. La difusión por cualquier medio del contenido de
este correo podría ser sancionada conforme a lo previsto en las leyes
españolas. No se autoriza la utilización con fines comerciales o para su
incorporación a ficheros automatizados de las direcciones del emisor o del
destinatario .
This mail and its attached files are confidential and are exclusively
intended to their addressee. In case you may receive this mail not being
its addressee, we beg you to let us know the error by reply and to proceed
to delete it. The circulation by any mean of this mail could be penalised
in accordance with the Spanish legislation. The use of both the transmitter
and the addressee’s address with a commercial aim, or in order to be
incorporated to automated files, is not authorised.
#!/bin/bash
#
# Startup/shutdown script for nVidia CUDA
#
# chkconfig: 345 80 20
# description: Startup/shutdown script for nVidia CUDA
# Source function library.
. /etc/init.d/functions
DRIVER=nvidia
RETVAL=0
# Create /dev nodes for nvidia devices
function createnodes() {
# Count the number of NVIDIA controllers found.
N3D=`/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l`
NVGA=`/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc
-l`
N=`expr $N3D + $NVGA - 1`
for i in `seq 0 $N`; do
mknod -m 666 /dev/nvidia$i c 195 $i
RETVAL=$?
[ "$RETVAL" = 0 ] || exit $RETVAL
done
mknod -m 666 /dev/nvidiactl c 195 255
RETVAL=$?
[ "$RETVAL" = 0 ] || exit $RETVAL
}
# Remove /dev nodes for nvidia devices
function removenodes() {
rm -f /dev/nvidia*
}
# Start daemon
function start() {
echo -n $"Loading $DRIVER kernel module: "
modprobe $DRIVER && success || failure
RETVAL=$?
echo
[ "$RETVAL" = 0 ] || exit $RETVAL
echo -n $"Initializing CUDA /dev entries: "
createnodes && success || failure
RETVAL=$?
echo
[ "$RETVAL" = 0 ] || exit $RETVAL
}
# Stop daemon
function stop() {
echo -n $"Unloading $DRIVER kernel module: "
rmmod -f $DRIVER && success || failure
RETVAL=$?
echo
[ "$RETVAL" = 0 ] || exit $RETVAL
echo -n $"Removing CUDA /dev entries: "
removenodes && success || failure
RETVAL=$?
echo
[ "$RETVAL" = 0 ] || exit $RETVAL
}
# See how we were called
case "$1" in
start)
start
;;
stop)
stop
;;
restart)
stop
start
;;
*)
echo $"Usage: $0 {start|stop|restart}"
RETVAL=1
esac
exit $RETVAL