Re: [slurm-users] Multi-node job failure

Zacarias Benta Wed, 11 Dec 2019 02:08:36 -0800
I had a simmilar issue, please check if the home drive, or the place
the data should be stored is mounted on the nodes.
On Tue, 2019-12-10 at 14:49 -0500, Chris Woelkers - NOAA Federal wrote:
> I have a 16 node HPC that is in the process of being upgraded from
> CentOS 6 to 7. All nodes are diskless and connected via 1Gbps
> Ethernet and FDR Infiniband. I am using Bright Cluster Management to
> manage it and their support has not found a solution to this
> problem.For the most part the cluster is up and running with all
> nodes booting and able to communicate with each other via all
> interfaces on a basic level.
> Test jobs, submitted via sbatch, are able to run on one node with no
> problem but will not run on multiple nodes. The jobs are using mpirun
> and mvapich2 is installed.
> Any job trying to run on multiple nodes ends up timing out, as set
> via -t, with no output data written and no error messages in the
> slurm.err or slurm.out files. The job shows up in the squeue output
> and the nodes used show up as allocated in the sinfo output.
> 
> Thanks,
> 
> Chris Woelkers
> IT Specialist
> National Oceanic and Atmospheric Agency
> Great Lakes Environmental Research Laboratory
> 4840 S State Rd | Ann Arbor, MI 48108
> 734-741-2446
-- 
Cumprimentos / Best Regards,
Zacarias Benta
INCD @ LIP - UMinho
Re: [slurm-users] Multi-node job failure

Reply via email to