Re: [slurm-users] Multi-node job failure

Ree, Jan-Albert van Tue, 10 Dec 2019 13:32:51 -0800

We're running multiple clusters using Bright 8.x with Scientific Linux 7 (and 
have run Scientific Linux releases 5 and 6 with Bright 5.0 and higher in the 
past without issues on many different pieces of hardware) and never experienced 
this. But some things to test :



- some implementations prefer mpiexec over mpirun , have you tried that instead 
?

- if you log in to a node while a job is 'hanging' , do you see that on each 
node the right amount of processes are spawned ? Is the node list of all nodes 
involved in the job parsed to the hydra process on all nodes ?

- which version of the Mellanox OFED stack are you using ? One of our vendors 
recommended against OFED 4.6 due to issues, mostly related to IP over IB but 
still ; you might want to try 4.5 just to rule things out.

- what happens if you use openmpi (as supplied by Bright) together with a 
simple hello world example ? There's a good one at 
https://mpitutorial.com/tutorials/mpi-hello-world/ which I know to work fine 
with Bright supplied openmpi

- what happens if you test with openmpi and force it to use ethernet instead of 
infiniband ? See https://www.open-mpi.org/faq/?category=tcp for info to force a 
specific interface with openmpi


I've just successfully tested the above hello-world example with the Bright 
supplied mvapich2/gcc/64/2.3b to compile the code , with the jobfile below to 
run it over 2 nodes, each 20 cores.


#!/bin/bash
#SBATCH -n 40
#SBATCH --exclusive
#SBATCH --partition=normal
#SBATCH --job-name=P80000.000_test
#SBATCH --time=2:00:00
#SBATCH --ntasks-per-node=20
#SBATCH --begin=now
#SBATCH --error=errors
#SBATCH --output=output
source /etc/profile.d/modules.sh
module load mvapich2/gcc/64/2.3b
mpiexec -n 40 ./hello



Good luck!

--

Jan-Albert van Ree


Jan-Albert van Ree | Linux System Administrator | Digital Services
MARIN | T +31 317 49 35 48 | j.a.v....@marin.nl<mailto:j.a.v....@marin.nl> | 
www.marin.nl<http://www.marin.nl>

[LinkedIn]<https://www.linkedin.com/company/marin> [YouTube] 
<http://www.youtube.com/marinmultimedia>  [Twitter] 
<https://twitter.com/MARIN_nieuws>  [Facebook] 
<https://www.facebook.com/marin.wageningen>
MARIN news: FLARE holds first General Assembly Meeting in Bremen, 
Germany<https://www.marin.nl/flare-holds-first-general-assembly-meeting-in-bremen-germany>

________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Chris 
Woelkers - NOAA Federal <chris.woelk...@noaa.gov>
Sent: Tuesday, December 10, 2019 20:49
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Multi-node job failure

I have a 16 node HPC that is in the process of being upgraded from CentOS 6 to 
7. All nodes are diskless and connected via 1Gbps Ethernet and FDR Infiniband. 
I am using Bright Cluster Management to manage it and their support has not 
found a solution to this problem.
For the most part the cluster is up and running with all nodes booting and able 
to communicate with each other via all interfaces on a basic level.
Test jobs, submitted via sbatch, are able to run on one node with no problem 
but will not run on multiple nodes. The jobs are using mpirun and mvapich2 is 
installed.
Any job trying to run on multiple nodes ends up timing out, as set via -t, with 
no output data written and no error messages in the slurm.err or slurm.out 
files. The job shows up in the squeue output and the nodes used show up as 
allocated in the sinfo output.

Thanks,

Chris Woelkers
IT Specialist
National Oceanic and Atmospheric Agency
Great Lakes Environmental Research Laboratory
4840 S State Rd | Ann Arbor, MI 48108
734-741-2446



Help us improve the spam filter. If this message contains SPAM, click 
here<https://www.mailcontrol.com/sr/jhTwuwISfT_GX2PQPOmvUgItITKVa7z0k6_JDvhE_EooCnj2ZVOWPNLeAoygBsgADsU9DA6Go4T46EHnGWFGZQ==>
 to report. Thank you, MARIN Support Group

Re: [slurm-users] Multi-node job failure

Reply via email to