On Mon, Jul 31, 2017 at 11:58 AM, Sean McGrath <[email protected]> wrote: > We do check that the GPU drivers are working on nodes before launching a job > on > them. > > Our prolog call's another in house script, (we should move to NHC to be > honest), > that does the following: > > if [[ -n $(lspci | grep GK110BGL) ]]; then # only applicable to boole > nodes with gpu's installed > if [[ -z > $(/home/support/apps/apps/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery > | grep 'Result = PASS') ]]; then # deviceQuery is slow > print_problem "GPU Drivers" > else > print_ok "GPU Drivers" > fi > fi
i'm doing a slightly different test, but along the same lines. since we run GPU exclusive, if the test fails I can presume it's either a dead card or in use. the process seems pretty fragile though, i've had instances where the driver would hang and cause the command to hang as well which would then cause the prolog to hang. which then just left a mess behind
