[slurm-dev] Re: Slurmd daemon doesn't start

David Ramírez Thu, 09 Feb 2017 06:15:03 -0800

You need that nvidia-module has been started since machine boot. Thisscript is worked for me


I use Centos. Yo can add it to init.d


Regards


El 09/02/17 a las 13:56, Christian Goll escribió:

Hello Daniel,
do /dev/nvidia[0-1] exist on the machines?
If not see under
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/
there is shell scripted which creates the device nodes for you. They are not 
always created during startup, especially if there is not X on the system.

kind regards,
Christian

Am 09.02.2017 um 12:50 schrieb Daniel Ruiz Molina:

Hi,

In my GPU cluster, slurmd daemon doesn't start correctly because when
daemon start, it doesn't find /dev/nvidia[0-1] device (mapped in
gres.conf). For solving this problem, I have added attribute
"ExecStartPre=@/usr/bin/nvidia-smi >/dev/null" in service  file and now
daemon starts correctly. However, could anybody copy-paste his/her
slurmd daemon file in a GPU cluster? I suppose it must be a better
solution than mine.

Thanks.


--
DAVID RAMIREZ
Senior HPC Consultant
Systems & Integrator Manager
********************************
SIE, HPCSIE & Ladon OS Proyect
C/ Marqués de Mondejar 29-31 2ª Planta
28028 Madrid-Spain
********************************
Phone:  (+34)913611002
Mobile: (+34)661369483
Email: drami...@sie.es
Skype: dramirezsie
Twitter: @dramirezhpc @ladon_os
WWW.SIE.ES
WWW.LADONOS.ORG


--

Este correo y sus archivos asociados son privados y confidenciales y vadirigido exclusivamente a su destinatario. Si recibe este correo sin ser eldestinatario del mismo, le rogamos proceda a su eliminación y lo ponga enconocimiento del emisor. La difusión por cualquier medio del contenido deeste correo podría ser sancionada conforme a lo previsto en las leyesespañolas. No se autoriza la utilización con fines comerciales o para suincorporación a ficheros automatizados de las direcciones del emisor o deldestinatario .This mail and its attached files are confidential and are exclusivelyintended to their addressee. In case you may receive this mail not beingits addressee, we beg you to let us know the error by reply and to proceedto delete it. The circulation by any mean of this mail could be penalisedin accordance with the Spanish legislation. The use of both the transmitterand the addressee’s address with a commercial aim, or in order to beincorporated to automated files, is not authorised.

#!/bin/bash

#

# Startup/shutdown script for nVidia CUDA

#

# chkconfig: 345 80 20

# description: Startup/shutdown script for nVidia CUDA



# Source function library.

. /etc/init.d/functions



DRIVER=nvidia

RETVAL=0



# Create /dev nodes for nvidia devices

function createnodes() {

    # Count the number of NVIDIA controllers found.

    N3D=`/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l`

    NVGA=`/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc 
-l`



    N=`expr $N3D + $NVGA - 1`

    for i in `seq 0 $N`; do

        mknod -m 666 /dev/nvidia$i c 195 $i

        RETVAL=$?

        [ "$RETVAL" = 0 ] || exit $RETVAL

    done

        

    mknod -m 666 /dev/nvidiactl c 195 255

    RETVAL=$?

    [ "$RETVAL" = 0 ] || exit $RETVAL

}



# Remove /dev nodes for nvidia devices

function removenodes() {

    rm -f /dev/nvidia*

}



# Start daemon

function start() {

    echo -n $"Loading $DRIVER kernel module: "

    modprobe $DRIVER && success || failure

    RETVAL=$?

    echo

    [ "$RETVAL" = 0 ] || exit $RETVAL



    echo -n $"Initializing CUDA /dev entries: "

    createnodes && success || failure

    RETVAL=$?

    echo

    [ "$RETVAL" = 0 ] || exit $RETVAL

}



# Stop daemon

function stop() {

    echo -n $"Unloading $DRIVER kernel module: "

    rmmod -f $DRIVER && success || failure

    RETVAL=$?

    echo

    [ "$RETVAL" = 0 ] || exit $RETVAL



    echo -n $"Removing CUDA /dev entries: "

    removenodes && success || failure

    RETVAL=$?

    echo

    [ "$RETVAL" = 0 ] || exit $RETVAL

}



# See how we were called

case "$1" in

    start)

        start

       ;;

    stop)

        stop

       ;;

    restart)

        stop

        start

       ;;

    *)

        echo $"Usage: $0 {start|stop|restart}"

        RETVAL=1

esac

exit $RETVAL

[slurm-dev] Re: Slurmd daemon doesn't start

Reply via email to