Re: [slurm-users] Multi-node job failure
On 11/12/19 8:05 am, Chris Woelkers - NOAA Federal wrote: Partial progress. The scientist that developed the model took a look at the output and found that instead of one model run being ran in parallel srun had ran multiple instances of the model, one per thread, which for this test was 110 threads. This sounds like MVAPICH isn't built to support Slurm, from the Slurm MPI guide you need to build it with this to enable Slurm support (and of course add any other options you were using): ./configure --with-pmi=pmi2 --with-pm=slurm All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Multi-node job failure
Partial progress. The scientist that developed the model took a look at the output and found that instead of one model run being ran in parallel srun had ran multiple instances of the model, one per thread, which for this test was 110 threads. I have a feeling this just verified the same thing that the hello world test did. Thanks, Chris Woelkers IT Specialist National Oceanic and Atmospheric Agency Great Lakes Environmental Research Laboratory 4840 S State Rd | Ann Arbor, MI 48108 734-741-2446 On Wed, Dec 11, 2019 at 10:35 AM Chris Woelkers - NOAA Federal < chris.woelk...@noaa.gov> wrote: > I tried a simple thing of swapping out mpirun in the sbatch script for > srun. Nothing more, nothing less. > The model is now working on at least two nodes, I will have to test again > on more but this is progress. > > Thanks, > > Chris Woelkers > IT Specialist > National Oceanic and Atmospheric Agency > Great Lakes Environmental Research Laboratory > 4840 S State Rd | Ann Arbor, MI 48108 > 734-741-2446 > > > On Wed, Dec 11, 2019 at 10:17 AM Chris Woelkers - NOAA Federal < > chris.woelk...@noaa.gov> wrote: > >> Thanks all for the ideas and possibilities. I will answer all in turn. >> >> Paul: Neither of the switches in use, Ethernet and Infiniband, have any >> form of broadcast storm protection enabled. >> >> Chris: I have passed on your question to the scientist that created >> the sbatch script. I will also look into other scripts that may make use of >> srun to find out if the same thing occurs. >> >> Jan-Albert: The mvapich2 package is provided by Bright and loaded as a >> module by the script before mpirun is executed. >> >> Zacarias: The drive that the data and script lives on is used is mounted >> on all the nodes at boot. >> >> Thanks, >> >> Chris Woelkers >> IT Specialist >> National Oceanic and Atmospheric Agency >> Great Lakes Environmental Research Laboratory >> 4840 S State Rd | Ann Arbor, MI 48108 >> 734-741-2446 >> >> >> On Wed, Dec 11, 2019 at 5:15 AM Zacarias Benta wrote: >> >>> I had a simmilar issue, please check if the home drive, or the place the >>> data should be stored is mounted on the nodes. >>> >>> On Tue, 2019-12-10 at 14:49 -0500, Chris Woelkers - NOAA Federal wrote: >>> >>> I have a 16 node HPC that is in the process of being upgraded from >>> CentOS 6 to 7. All nodes are diskless and connected via 1Gbps Ethernet and >>> FDR Infiniband. I am using Bright Cluster Management to manage it and their >>> support has not found a solution to this problem. >>> For the most part the cluster is up and running with all nodes booting >>> and able to communicate with each other via all interfaces on a basic level. >>> Test jobs, submitted via sbatch, are able to run on one node with no >>> problem but will not run on multiple nodes. The jobs are using mpirun and >>> mvapich2 is installed. >>> Any job trying to run on multiple nodes ends up timing out, as set via >>> -t, with no output data written and no error messages in the slurm.err or >>> slurm.out files. The job shows up in the squeue output and the nodes used >>> show up as allocated in the sinfo output. >>> >>> Thanks, >>> >>> Chris Woelkers >>> IT Specialist >>> National Oceanic and Atmospheric Agency >>> Great Lakes Environmental Research Laboratory >>> 4840 S State Rd | Ann Arbor, MI 48108 >>> 734-741-2446 >>> >>> -- >>> >>> Cumprimentos / Best Regards, >>> Zacarias Benta >>> INCD @ LIP - UMinho >>> >>> >>>
Re: [slurm-users] Multi-node job failure
I tried a simple thing of swapping out mpirun in the sbatch script for srun. Nothing more, nothing less. The model is now working on at least two nodes, I will have to test again on more but this is progress. Thanks, Chris Woelkers IT Specialist National Oceanic and Atmospheric Agency Great Lakes Environmental Research Laboratory 4840 S State Rd | Ann Arbor, MI 48108 734-741-2446 On Wed, Dec 11, 2019 at 10:17 AM Chris Woelkers - NOAA Federal < chris.woelk...@noaa.gov> wrote: > Thanks all for the ideas and possibilities. I will answer all in turn. > > Paul: Neither of the switches in use, Ethernet and Infiniband, have any > form of broadcast storm protection enabled. > > Chris: I have passed on your question to the scientist that created > the sbatch script. I will also look into other scripts that may make use of > srun to find out if the same thing occurs. > > Jan-Albert: The mvapich2 package is provided by Bright and loaded as a > module by the script before mpirun is executed. > > Zacarias: The drive that the data and script lives on is used is mounted > on all the nodes at boot. > > Thanks, > > Chris Woelkers > IT Specialist > National Oceanic and Atmospheric Agency > Great Lakes Environmental Research Laboratory > 4840 S State Rd | Ann Arbor, MI 48108 > 734-741-2446 > > > On Wed, Dec 11, 2019 at 5:15 AM Zacarias Benta wrote: > >> I had a simmilar issue, please check if the home drive, or the place the >> data should be stored is mounted on the nodes. >> >> On Tue, 2019-12-10 at 14:49 -0500, Chris Woelkers - NOAA Federal wrote: >> >> I have a 16 node HPC that is in the process of being upgraded from CentOS >> 6 to 7. All nodes are diskless and connected via 1Gbps Ethernet and FDR >> Infiniband. I am using Bright Cluster Management to manage it and their >> support has not found a solution to this problem. >> For the most part the cluster is up and running with all nodes booting >> and able to communicate with each other via all interfaces on a basic level. >> Test jobs, submitted via sbatch, are able to run on one node with no >> problem but will not run on multiple nodes. The jobs are using mpirun and >> mvapich2 is installed. >> Any job trying to run on multiple nodes ends up timing out, as set via >> -t, with no output data written and no error messages in the slurm.err or >> slurm.out files. The job shows up in the squeue output and the nodes used >> show up as allocated in the sinfo output. >> >> Thanks, >> >> Chris Woelkers >> IT Specialist >> National Oceanic and Atmospheric Agency >> Great Lakes Environmental Research Laboratory >> 4840 S State Rd | Ann Arbor, MI 48108 >> 734-741-2446 >> >> -- >> >> Cumprimentos / Best Regards, >> Zacarias Benta >> INCD @ LIP - UMinho >> >> >>
Re: [slurm-users] Multi-node job failure
Thanks all for the ideas and possibilities. I will answer all in turn. Paul: Neither of the switches in use, Ethernet and Infiniband, have any form of broadcast storm protection enabled. Chris: I have passed on your question to the scientist that created the sbatch script. I will also look into other scripts that may make use of srun to find out if the same thing occurs. Jan-Albert: The mvapich2 package is provided by Bright and loaded as a module by the script before mpirun is executed. Zacarias: The drive that the data and script lives on is used is mounted on all the nodes at boot. Thanks, Chris Woelkers IT Specialist National Oceanic and Atmospheric Agency Great Lakes Environmental Research Laboratory 4840 S State Rd | Ann Arbor, MI 48108 734-741-2446 On Wed, Dec 11, 2019 at 5:15 AM Zacarias Benta wrote: > I had a simmilar issue, please check if the home drive, or the place the > data should be stored is mounted on the nodes. > > On Tue, 2019-12-10 at 14:49 -0500, Chris Woelkers - NOAA Federal wrote: > > I have a 16 node HPC that is in the process of being upgraded from CentOS > 6 to 7. All nodes are diskless and connected via 1Gbps Ethernet and FDR > Infiniband. I am using Bright Cluster Management to manage it and their > support has not found a solution to this problem. > For the most part the cluster is up and running with all nodes booting and > able to communicate with each other via all interfaces on a basic level. > Test jobs, submitted via sbatch, are able to run on one node with no > problem but will not run on multiple nodes. The jobs are using mpirun and > mvapich2 is installed. > Any job trying to run on multiple nodes ends up timing out, as set via -t, > with no output data written and no error messages in the slurm.err or > slurm.out files. The job shows up in the squeue output and the nodes used > show up as allocated in the sinfo output. > > Thanks, > > Chris Woelkers > IT Specialist > National Oceanic and Atmospheric Agency > Great Lakes Environmental Research Laboratory > 4840 S State Rd | Ann Arbor, MI 48108 > 734-741-2446 > > -- > > Cumprimentos / Best Regards, > Zacarias Benta > INCD @ LIP - UMinho > > >
Re: [slurm-users] Multi-node job failure
I had a simmilar issue, please check if the home drive, or the place the data should be stored is mounted on the nodes. On Tue, 2019-12-10 at 14:49 -0500, Chris Woelkers - NOAA Federal wrote: > I have a 16 node HPC that is in the process of being upgraded from > CentOS 6 to 7. All nodes are diskless and connected via 1Gbps > Ethernet and FDR Infiniband. I am using Bright Cluster Management to > manage it and their support has not found a solution to this > problem.For the most part the cluster is up and running with all > nodes booting and able to communicate with each other via all > interfaces on a basic level. > Test jobs, submitted via sbatch, are able to run on one node with no > problem but will not run on multiple nodes. The jobs are using mpirun > and mvapich2 is installed. > Any job trying to run on multiple nodes ends up timing out, as set > via -t, with no output data written and no error messages in the > slurm.err or slurm.out files. The job shows up in the squeue output > and the nodes used show up as allocated in the sinfo output. > > Thanks, > > Chris Woelkers > IT Specialist > National Oceanic and Atmospheric Agency > Great Lakes Environmental Research Laboratory > 4840 S State Rd | Ann Arbor, MI 48108 > 734-741-2446 -- Cumprimentos / Best Regards, Zacarias Benta INCD @ LIP - UMinho
Re: [slurm-users] Multi-node job failure
OK so OpenMPI works fine. That means SLURM, OFED and hardware are fine. Which mvapich2 package are you using, a home built one or one provided by Bright ? Regards, -- Jan-Albert Jan-Albert van Ree | Linux System Administrator | Digital Services MARIN | T +31 317 49 35 48 | j.a.v@marin.nl<mailto:j.a.v@marin.nl> | www.marin.nl<http://www.marin.nl> [LinkedIn]<https://www.linkedin.com/company/marin> [YouTube] <http://www.youtube.com/marinmultimedia> [Twitter] <https://twitter.com/MARIN_nieuws> [Facebook] <https://www.facebook.com/marin.wageningen> MARIN news: FLARE holds first General Assembly Meeting in Bremen, Germany<https://www.marin.nl/flare-holds-first-general-assembly-meeting-in-bremen-germany> From: slurm-users on behalf of Chris Woelkers - NOAA Federal Sent: Wednesday, December 11, 2019 01:11 To: Slurm User Community List Subject: Re: [slurm-users] Multi-node job failure Thanks for the reply and the things to try. Here are the answers to your questions/tests in order: - I tried mpiexec and the same issue occurred. - While the job is listed as running I checked all the nodes. None of them have processes spawned. I have no idea on the hydra process. - I have version 4.7 of the OFED stack installed on all nodes. - Using openmpi with the hello world example you listed to gives output that seems to match what should normally be given. I upped the number of threads to 16, because 4 doesn't help much, and ran it again with four nodes of 4 threads each, and got the following which looks like good output. Hello world from processor bearnode14, rank 4 out of 16 processors Hello world from processor bearnode14, rank 5 out of 16 processors Hello world from processor bearnode14, rank 6 out of 16 processors Hello world from processor bearnode15, rank 10 out of 16 processors Hello world from processor bearnode15, rank 8 out of 16 processors Hello world from processor bearnode16, rank 13 out of 16 processors Hello world from processor bearnode15, rank 11 out of 16 processors Hello world from processor bearnode13, rank 3 out of 16 processors Hello world from processor bearnode14, rank 7 out of 16 processors Hello world from processor bearnode15, rank 9 out of 16 processors Hello world from processor bearnode16, rank 12 out of 16 processors Hello world from processor bearnode16, rank 14 out of 16 processors Hello world from processor bearnode16, rank 15 out of 16 processors Hello world from processor bearnode13, rank 1 out of 16 processors Hello world from processor bearnode13, rank 0 out of 16 processors Hello world from processor bearnode13, rank 2 out of 16 processors - I have not tested our test model with openmpi as it was compiled with Intel compilers and expects Intel MPI. It might work but for now I will hold that for later. I did test the hello world again using the Intel modules instead of the openmpi modules and it still worked. Thanks, Chris Woelkers
Re: [slurm-users] Multi-node job failure
Hi Chris, On Tuesday, 10 December 2019 11:49:44 AM PST Chris Woelkers - NOAA Federal wrote: > Test jobs, submitted via sbatch, are able to run on one node with no problem > but will not run on multiple nodes. The jobs are using mpirun and mvapich2 > is installed. Is there a reason why you aren't using srun for launching these? https://slurm.schedmd.com/mpi_guide.html If you're using mpirun then (unless you've built mvapich2 with Slurm support) then you'll be relying on ssh to launch tasks and so that could be what's broken for you. Running with srun will avoid that and allow Slurm to track your processes correctly. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Multi-node job failure
file >> below to run it over 2 nodes, each 20 cores. >> >> >> #!/bin/bash >> #SBATCH -n 40 >> #SBATCH --exclusive >> #SBATCH --partition=normal >> #SBATCH --job-name=P8.000_test >> #SBATCH --time=2:00:00 >> #SBATCH --ntasks-per-node=20 >> #SBATCH --begin=now >> #SBATCH --error=errors >> #SBATCH --output=output >> source /etc/profile.d/modules.sh >> module load mvapich2/gcc/64/2.3b >> mpiexec -n 40 ./hello >> >> >> >> Good luck! >> >> -- >> >> Jan-Albert van Ree >> >> >> >> Jan-Albert van Ree | Linux System Administrator | Digital Services >> MARIN | T +31 317 49 35 48 | j.a.v@marin.nl | www.marin.nl >> >> [image: LinkedIn] <https://www.linkedin.com/company/marin> [image: >> YouTube] <http://www.youtube.com/marinmultimedia> [image: Twitter] >> <https://twitter.com/MARIN_nieuws> [image: Facebook] >> <https://www.facebook.com/marin.wageningen> >> MARIN news: FLARE holds first General Assembly Meeting in Bremen, Germany >> <https://www.marin.nl/flare-holds-first-general-assembly-meeting-in-bremen-germany> >> >> -- >> *From:* slurm-users on behalf of >> Chris Woelkers - NOAA Federal >> *Sent:* Tuesday, December 10, 2019 20:49 >> *To:* slurm-users@lists.schedmd.com >> *Subject:* [slurm-users] Multi-node job failure >> >> I have a 16 node HPC that is in the process of being upgraded from CentOS >> 6 to 7. All nodes are diskless and connected via 1Gbps Ethernet and FDR >> Infiniband. I am using Bright Cluster Management to manage it and their >> support has not found a solution to this problem. >> For the most part the cluster is up and running with all nodes booting >> and able to communicate with each other via all interfaces on a basic level. >> Test jobs, submitted via sbatch, are able to run on one node with no >> problem but will not run on multiple nodes. The jobs are using mpirun and >> mvapich2 is installed. >> Any job trying to run on multiple nodes ends up timing out, as set via >> -t, with no output data written and no error messages in the slurm.err or >> slurm.out files. The job shows up in the squeue output and the nodes used >> show up as allocated in the sinfo output. >> >> Thanks, >> >> Chris Woelkers >> IT Specialist >> National Oceanic and Atmospheric Agency >> Great Lakes Environmental Research Laboratory >> 4840 S State Rd | Ann Arbor, MI 48108 >> 734-741-2446 >> >> >> *Help us improve the spam filter. If this message contains SPAM, click >> here >> <https://www.mailcontrol.com/sr/jhTwuwISfT_GX2PQPOmvUgItITKVa7z0k6_JDvhE_EooCnj2ZVOWPNLeAoygBsgADsU9DA6Go4T46EHnGWFGZQ==> >> to report. Thank you, MARIN Support Group* >> >>
Re: [slurm-users] Multi-node job failure
Thanks for the reply and the things to try. Here are the answers to your questions/tests in order: - I tried mpiexec and the same issue occurred. - While the job is listed as running I checked all the nodes. None of them have processes spawned. I have no idea on the hydra process. - I have version 4.7 of the OFED stack installed on all nodes. - Using openmpi with the hello world example you listed to gives output that seems to match what should normally be given. I upped the number of threads to 16, because 4 doesn't help much, and ran it again with four nodes of 4 threads each, and got the following which looks like good output. Hello world from processor bearnode14, rank 4 out of 16 processors Hello world from processor bearnode14, rank 5 out of 16 processors Hello world from processor bearnode14, rank 6 out of 16 processors Hello world from processor bearnode15, rank 10 out of 16 processors Hello world from processor bearnode15, rank 8 out of 16 processors Hello world from processor bearnode16, rank 13 out of 16 processors Hello world from processor bearnode15, rank 11 out of 16 processors Hello world from processor bearnode13, rank 3 out of 16 processors Hello world from processor bearnode14, rank 7 out of 16 processors Hello world from processor bearnode15, rank 9 out of 16 processors Hello world from processor bearnode16, rank 12 out of 16 processors Hello world from processor bearnode16, rank 14 out of 16 processors Hello world from processor bearnode16, rank 15 out of 16 processors Hello world from processor bearnode13, rank 1 out of 16 processors Hello world from processor bearnode13, rank 0 out of 16 processors Hello world from processor bearnode13, rank 2 out of 16 processors - I have not tested our test model with openmpi as it was compiled with Intel compilers and expects Intel MPI. It might work but for now I will hold that for later. I did test the hello world again using the Intel modules instead of the openmpi modules and it still worked. Thanks, Chris Woelkers IT Specialist National Oceanic and Atmospheric Agency Great Lakes Environmental Research Laboratory 4840 S State Rd | Ann Arbor, MI 48108 734-741-2446 On Tue, Dec 10, 2019 at 4:36 PM Ree, Jan-Albert van wrote: > We're running multiple clusters using Bright 8.x with Scientific Linux 7 > (and have run Scientific Linux releases 5 and 6 with Bright 5.0 and higher > in the past without issues on many different pieces of hardware) and never > experienced this. But some things to test : > > > - some implementations prefer mpiexec over mpirun , have you tried that > instead ? > > - if you log in to a node while a job is 'hanging' , do you see that on > each node the right amount of processes are spawned ? Is the node list of > all nodes involved in the job parsed to the hydra process on all nodes ? > > - which version of the Mellanox OFED stack are you using ? One of our > vendors recommended against OFED 4.6 due to issues, mostly related to IP > over IB but still ; you might want to try 4.5 just to rule things out. > > - what happens if you use openmpi (as supplied by Bright) together with a > simple hello world example ? There's a good one at > https://mpitutorial.com/tutorials/mpi-hello-world/ which I know to work > fine with Bright supplied openmpi > > - what happens if you test with openmpi and force it to use ethernet > instead of infiniband ? See https://www.open-mpi.org/faq/?category=tcp > for info to force a specific interface with openmpi > > > I've just successfully tested the above hello-world example with the > Bright supplied mvapich2/gcc/64/2.3b to compile the code , with the jobfile > below to run it over 2 nodes, each 20 cores. > > > #!/bin/bash > #SBATCH -n 40 > #SBATCH --exclusive > #SBATCH --partition=normal > #SBATCH --job-name=P8.000_test > #SBATCH --time=2:00:00 > #SBATCH --ntasks-per-node=20 > #SBATCH --begin=now > #SBATCH --error=errors > #SBATCH --output=output > source /etc/profile.d/modules.sh > module load mvapich2/gcc/64/2.3b > mpiexec -n 40 ./hello > > > > Good luck! > > -- > > Jan-Albert van Ree > > > > Jan-Albert van Ree | Linux System Administrator | Digital Services > MARIN | T +31 317 49 35 48 | j.a.v@marin.nl | www.marin.nl > > [image: LinkedIn] <https://www.linkedin.com/company/marin> [image: > YouTube] <http://www.youtube.com/marinmultimedia> [image: Twitter] > <https://twitter.com/MARIN_nieuws> [image: Facebook] > <https://www.facebook.com/marin.wageningen> > MARIN news: FLARE holds first General Assembly Meeting in Bremen, Germany > <https://www.marin.nl/flare-holds-first-general-assembly-meeting-in-bremen-germany> > > -- > *From:* slurm-users on behalf of > Chris Woelkers - NOAA Federal > *Sent:* Tuesday, Decemb
Re: [slurm-users] Multi-node job failure
We're running multiple clusters using Bright 8.x with Scientific Linux 7 (and have run Scientific Linux releases 5 and 6 with Bright 5.0 and higher in the past without issues on many different pieces of hardware) and never experienced this. But some things to test : - some implementations prefer mpiexec over mpirun , have you tried that instead ? - if you log in to a node while a job is 'hanging' , do you see that on each node the right amount of processes are spawned ? Is the node list of all nodes involved in the job parsed to the hydra process on all nodes ? - which version of the Mellanox OFED stack are you using ? One of our vendors recommended against OFED 4.6 due to issues, mostly related to IP over IB but still ; you might want to try 4.5 just to rule things out. - what happens if you use openmpi (as supplied by Bright) together with a simple hello world example ? There's a good one at https://mpitutorial.com/tutorials/mpi-hello-world/ which I know to work fine with Bright supplied openmpi - what happens if you test with openmpi and force it to use ethernet instead of infiniband ? See https://www.open-mpi.org/faq/?category=tcp for info to force a specific interface with openmpi I've just successfully tested the above hello-world example with the Bright supplied mvapich2/gcc/64/2.3b to compile the code , with the jobfile below to run it over 2 nodes, each 20 cores. #!/bin/bash #SBATCH -n 40 #SBATCH --exclusive #SBATCH --partition=normal #SBATCH --job-name=P8.000_test #SBATCH --time=2:00:00 #SBATCH --ntasks-per-node=20 #SBATCH --begin=now #SBATCH --error=errors #SBATCH --output=output source /etc/profile.d/modules.sh module load mvapich2/gcc/64/2.3b mpiexec -n 40 ./hello Good luck! -- Jan-Albert van Ree Jan-Albert van Ree | Linux System Administrator | Digital Services MARIN | T +31 317 49 35 48 | j.a.v@marin.nl<mailto:j.a.v@marin.nl> | www.marin.nl<http://www.marin.nl> [LinkedIn]<https://www.linkedin.com/company/marin> [YouTube] <http://www.youtube.com/marinmultimedia> [Twitter] <https://twitter.com/MARIN_nieuws> [Facebook] <https://www.facebook.com/marin.wageningen> MARIN news: FLARE holds first General Assembly Meeting in Bremen, Germany<https://www.marin.nl/flare-holds-first-general-assembly-meeting-in-bremen-germany> From: slurm-users on behalf of Chris Woelkers - NOAA Federal Sent: Tuesday, December 10, 2019 20:49 To: slurm-users@lists.schedmd.com Subject: [slurm-users] Multi-node job failure I have a 16 node HPC that is in the process of being upgraded from CentOS 6 to 7. All nodes are diskless and connected via 1Gbps Ethernet and FDR Infiniband. I am using Bright Cluster Management to manage it and their support has not found a solution to this problem. For the most part the cluster is up and running with all nodes booting and able to communicate with each other via all interfaces on a basic level. Test jobs, submitted via sbatch, are able to run on one node with no problem but will not run on multiple nodes. The jobs are using mpirun and mvapich2 is installed. Any job trying to run on multiple nodes ends up timing out, as set via -t, with no output data written and no error messages in the slurm.err or slurm.out files. The job shows up in the squeue output and the nodes used show up as allocated in the sinfo output. Thanks, Chris Woelkers IT Specialist National Oceanic and Atmospheric Agency Great Lakes Environmental Research Laboratory 4840 S State Rd | Ann Arbor, MI 48108 734-741-2446 Help us improve the spam filter. If this message contains SPAM, click here<https://www.mailcontrol.com/sr/jhTwuwISfT_GX2PQPOmvUgItITKVa7z0k6_JDvhE_EooCnj2ZVOWPNLeAoygBsgADsU9DA6Go4T46EHnGWFGZQ==> to report. Thank you, MARIN Support Group
[slurm-users] Multi-node job failure
I have a 16 node HPC that is in the process of being upgraded from CentOS 6 to 7. All nodes are diskless and connected via 1Gbps Ethernet and FDR Infiniband. I am using Bright Cluster Management to manage it and their support has not found a solution to this problem. For the most part the cluster is up and running with all nodes booting and able to communicate with each other via all interfaces on a basic level. Test jobs, submitted via sbatch, are able to run on one node with no problem but will not run on multiple nodes. The jobs are using mpirun and mvapich2 is installed. Any job trying to run on multiple nodes ends up timing out, as set via -t, with no output data written and no error messages in the slurm.err or slurm.out files. The job shows up in the squeue output and the nodes used show up as allocated in the sinfo output. Thanks, Chris Woelkers IT Specialist National Oceanic and Atmospheric Agency Great Lakes Environmental Research Laboratory 4840 S State Rd | Ann Arbor, MI 48108 734-741-2446