Re: [slurm-users] Multi-node job failure

2019-12-12 Thread Chris Samuel

On 11/12/19 8:05 am, Chris Woelkers - NOAA Federal wrote:

Partial progress. The scientist that developed the model took a look at 
the output and found that instead of one model run being ran in parallel 
srun had ran multiple instances of the model, one per thread, which for 
this test was 110 threads.


This sounds like MVAPICH isn't built to support Slurm, from the Slurm 
MPI guide you need to build it with this to enable Slurm support (and of 
course add any other options you were using):


./configure --with-pmi=pmi2 --with-pm=slurm

All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Multi-node job failure

2019-12-11 Thread Chris Woelkers - NOAA Federal
Partial progress. The scientist that developed the model took a look at the
output and found that instead of one model run being ran in parallel srun
had ran multiple instances of the model, one per thread, which for this
test was 110 threads.
I have a feeling this just verified the same thing that the hello world
test did.

Thanks,

Chris Woelkers
IT Specialist
National Oceanic and Atmospheric Agency
Great Lakes Environmental Research Laboratory
4840 S State Rd | Ann Arbor, MI 48108
734-741-2446


On Wed, Dec 11, 2019 at 10:35 AM Chris Woelkers - NOAA Federal <
chris.woelk...@noaa.gov> wrote:

> I tried a simple thing of swapping out mpirun in the sbatch script for
> srun. Nothing more, nothing less.
> The model is now working on at least two nodes, I will have to test again
> on more but this is progress.
>
> Thanks,
>
> Chris Woelkers
> IT Specialist
> National Oceanic and Atmospheric Agency
> Great Lakes Environmental Research Laboratory
> 4840 S State Rd | Ann Arbor, MI 48108
> 734-741-2446
>
>
> On Wed, Dec 11, 2019 at 10:17 AM Chris Woelkers - NOAA Federal <
> chris.woelk...@noaa.gov> wrote:
>
>> Thanks all for the ideas and possibilities. I will answer all in turn.
>>
>> Paul: Neither of the switches in use, Ethernet and Infiniband, have any
>> form of broadcast storm protection enabled.
>>
>> Chris: I have passed on your question to the scientist that created
>> the sbatch script. I will also look into other scripts that may make use of
>> srun to find out if the same thing occurs.
>>
>> Jan-Albert: The mvapich2 package is provided by Bright and loaded as a
>> module by the script before mpirun is executed.
>>
>> Zacarias: The drive that the data and script lives on is used is mounted
>> on all the nodes at boot.
>>
>> Thanks,
>>
>> Chris Woelkers
>> IT Specialist
>> National Oceanic and Atmospheric Agency
>> Great Lakes Environmental Research Laboratory
>> 4840 S State Rd | Ann Arbor, MI 48108
>> 734-741-2446
>>
>>
>> On Wed, Dec 11, 2019 at 5:15 AM Zacarias Benta  wrote:
>>
>>> I had a simmilar issue, please check if the home drive, or the place the
>>> data should be stored is mounted on the nodes.
>>>
>>> On Tue, 2019-12-10 at 14:49 -0500, Chris Woelkers - NOAA Federal wrote:
>>>
>>> I have a 16 node HPC that is in the process of being upgraded from
>>> CentOS 6 to 7. All nodes are diskless and connected via 1Gbps Ethernet and
>>> FDR Infiniband. I am using Bright Cluster Management to manage it and their
>>> support has not found a solution to this problem.
>>> For the most part the cluster is up and running with all nodes booting
>>> and able to communicate with each other via all interfaces on a basic level.
>>> Test jobs, submitted via sbatch, are able to run on one node with no
>>> problem but will not run on multiple nodes. The jobs are using mpirun and
>>> mvapich2 is installed.
>>> Any job trying to run on multiple nodes ends up timing out, as set via
>>> -t, with no output data written and no error messages in the slurm.err or
>>> slurm.out files. The job shows up in the squeue output and the nodes used
>>> show up as allocated in the sinfo output.
>>>
>>> Thanks,
>>>
>>> Chris Woelkers
>>> IT Specialist
>>> National Oceanic and Atmospheric Agency
>>> Great Lakes Environmental Research Laboratory
>>> 4840 S State Rd | Ann Arbor, MI 48108
>>> 734-741-2446
>>>
>>> --
>>>
>>> Cumprimentos / Best Regards,
>>> Zacarias Benta
>>> INCD @ LIP - UMinho
>>>
>>>
>>>


Re: [slurm-users] Multi-node job failure

2019-12-11 Thread Chris Woelkers - NOAA Federal
I tried a simple thing of swapping out mpirun in the sbatch script for
srun. Nothing more, nothing less.
The model is now working on at least two nodes, I will have to test again
on more but this is progress.

Thanks,

Chris Woelkers
IT Specialist
National Oceanic and Atmospheric Agency
Great Lakes Environmental Research Laboratory
4840 S State Rd | Ann Arbor, MI 48108
734-741-2446


On Wed, Dec 11, 2019 at 10:17 AM Chris Woelkers - NOAA Federal <
chris.woelk...@noaa.gov> wrote:

> Thanks all for the ideas and possibilities. I will answer all in turn.
>
> Paul: Neither of the switches in use, Ethernet and Infiniband, have any
> form of broadcast storm protection enabled.
>
> Chris: I have passed on your question to the scientist that created
> the sbatch script. I will also look into other scripts that may make use of
> srun to find out if the same thing occurs.
>
> Jan-Albert: The mvapich2 package is provided by Bright and loaded as a
> module by the script before mpirun is executed.
>
> Zacarias: The drive that the data and script lives on is used is mounted
> on all the nodes at boot.
>
> Thanks,
>
> Chris Woelkers
> IT Specialist
> National Oceanic and Atmospheric Agency
> Great Lakes Environmental Research Laboratory
> 4840 S State Rd | Ann Arbor, MI 48108
> 734-741-2446
>
>
> On Wed, Dec 11, 2019 at 5:15 AM Zacarias Benta  wrote:
>
>> I had a simmilar issue, please check if the home drive, or the place the
>> data should be stored is mounted on the nodes.
>>
>> On Tue, 2019-12-10 at 14:49 -0500, Chris Woelkers - NOAA Federal wrote:
>>
>> I have a 16 node HPC that is in the process of being upgraded from CentOS
>> 6 to 7. All nodes are diskless and connected via 1Gbps Ethernet and FDR
>> Infiniband. I am using Bright Cluster Management to manage it and their
>> support has not found a solution to this problem.
>> For the most part the cluster is up and running with all nodes booting
>> and able to communicate with each other via all interfaces on a basic level.
>> Test jobs, submitted via sbatch, are able to run on one node with no
>> problem but will not run on multiple nodes. The jobs are using mpirun and
>> mvapich2 is installed.
>> Any job trying to run on multiple nodes ends up timing out, as set via
>> -t, with no output data written and no error messages in the slurm.err or
>> slurm.out files. The job shows up in the squeue output and the nodes used
>> show up as allocated in the sinfo output.
>>
>> Thanks,
>>
>> Chris Woelkers
>> IT Specialist
>> National Oceanic and Atmospheric Agency
>> Great Lakes Environmental Research Laboratory
>> 4840 S State Rd | Ann Arbor, MI 48108
>> 734-741-2446
>>
>> --
>>
>> Cumprimentos / Best Regards,
>> Zacarias Benta
>> INCD @ LIP - UMinho
>>
>>
>>


Re: [slurm-users] Multi-node job failure

2019-12-11 Thread Chris Woelkers - NOAA Federal
Thanks all for the ideas and possibilities. I will answer all in turn.

Paul: Neither of the switches in use, Ethernet and Infiniband, have any
form of broadcast storm protection enabled.

Chris: I have passed on your question to the scientist that created
the sbatch script. I will also look into other scripts that may make use of
srun to find out if the same thing occurs.

Jan-Albert: The mvapich2 package is provided by Bright and loaded as a
module by the script before mpirun is executed.

Zacarias: The drive that the data and script lives on is used is mounted on
all the nodes at boot.

Thanks,

Chris Woelkers
IT Specialist
National Oceanic and Atmospheric Agency
Great Lakes Environmental Research Laboratory
4840 S State Rd | Ann Arbor, MI 48108
734-741-2446


On Wed, Dec 11, 2019 at 5:15 AM Zacarias Benta  wrote:

> I had a simmilar issue, please check if the home drive, or the place the
> data should be stored is mounted on the nodes.
>
> On Tue, 2019-12-10 at 14:49 -0500, Chris Woelkers - NOAA Federal wrote:
>
> I have a 16 node HPC that is in the process of being upgraded from CentOS
> 6 to 7. All nodes are diskless and connected via 1Gbps Ethernet and FDR
> Infiniband. I am using Bright Cluster Management to manage it and their
> support has not found a solution to this problem.
> For the most part the cluster is up and running with all nodes booting and
> able to communicate with each other via all interfaces on a basic level.
> Test jobs, submitted via sbatch, are able to run on one node with no
> problem but will not run on multiple nodes. The jobs are using mpirun and
> mvapich2 is installed.
> Any job trying to run on multiple nodes ends up timing out, as set via -t,
> with no output data written and no error messages in the slurm.err or
> slurm.out files. The job shows up in the squeue output and the nodes used
> show up as allocated in the sinfo output.
>
> Thanks,
>
> Chris Woelkers
> IT Specialist
> National Oceanic and Atmospheric Agency
> Great Lakes Environmental Research Laboratory
> 4840 S State Rd | Ann Arbor, MI 48108
> 734-741-2446
>
> --
>
> Cumprimentos / Best Regards,
> Zacarias Benta
> INCD @ LIP - UMinho
>
>
>


Re: [slurm-users] Multi-node job failure

2019-12-11 Thread Zacarias Benta
I had a simmilar issue, please check if the home drive, or the place
the data should be stored is mounted on the nodes.
On Tue, 2019-12-10 at 14:49 -0500, Chris Woelkers - NOAA Federal wrote:
> I have a 16 node HPC that is in the process of being upgraded from
> CentOS 6 to 7. All nodes are diskless and connected via 1Gbps
> Ethernet and FDR Infiniband. I am using Bright Cluster Management to
> manage it and their support has not found a solution to this
> problem.For the most part the cluster is up and running with all
> nodes booting and able to communicate with each other via all
> interfaces on a basic level.
> Test jobs, submitted via sbatch, are able to run on one node with no
> problem but will not run on multiple nodes. The jobs are using mpirun
> and mvapich2 is installed.
> Any job trying to run on multiple nodes ends up timing out, as set
> via -t, with no output data written and no error messages in the
> slurm.err or slurm.out files. The job shows up in the squeue output
> and the nodes used show up as allocated in the sinfo output.
> 
> Thanks,
> 
> Chris Woelkers
> IT Specialist
> National Oceanic and Atmospheric Agency
> Great Lakes Environmental Research Laboratory
> 4840 S State Rd | Ann Arbor, MI 48108
> 734-741-2446
-- 
Cumprimentos / Best Regards,
Zacarias Benta
INCD @ LIP - UMinho
  




Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Ree, Jan-Albert van
OK so OpenMPI works fine. That means SLURM, OFED and hardware are fine.

Which mvapich2 package are you using, a home built one or one provided by 
Bright ?


Regards,

--

Jan-Albert


Jan-Albert van Ree | Linux System Administrator | Digital Services
MARIN | T +31 317 49 35 48 | j.a.v@marin.nl<mailto:j.a.v@marin.nl> | 
www.marin.nl<http://www.marin.nl>

[LinkedIn]<https://www.linkedin.com/company/marin> [YouTube] 
<http://www.youtube.com/marinmultimedia>  [Twitter] 
<https://twitter.com/MARIN_nieuws>  [Facebook] 
<https://www.facebook.com/marin.wageningen>
MARIN news: FLARE holds first General Assembly Meeting in Bremen, 
Germany<https://www.marin.nl/flare-holds-first-general-assembly-meeting-in-bremen-germany>


From: slurm-users  on behalf of Chris 
Woelkers - NOAA Federal 
Sent: Wednesday, December 11, 2019 01:11
To: Slurm User Community List
Subject: Re: [slurm-users] Multi-node job failure

Thanks for the reply and the things to try. Here are the answers to your 
questions/tests in order:

- I tried mpiexec and the same issue occurred.
- While the job is listed as running I checked all the nodes. None of them have 
processes spawned. I have no idea on the hydra process.
- I have version 4.7 of the OFED stack installed on all nodes.
- Using openmpi with the hello world example you listed to gives output that 
seems to match what should normally be given. I upped the number of threads to 
16, because 4 doesn't help much, and ran it again with four nodes of 4 threads 
each, and got the following which looks like good output.
Hello world from processor bearnode14, rank 4 out of 16 processors
Hello world from processor bearnode14, rank 5 out of 16 processors
Hello world from processor bearnode14, rank 6 out of 16 processors
Hello world from processor bearnode15, rank 10 out of 16 processors
Hello world from processor bearnode15, rank 8 out of 16 processors
Hello world from processor bearnode16, rank 13 out of 16 processors
Hello world from processor bearnode15, rank 11 out of 16 processors
Hello world from processor bearnode13, rank 3 out of 16 processors
Hello world from processor bearnode14, rank 7 out of 16 processors
Hello world from processor bearnode15, rank 9 out of 16 processors
Hello world from processor bearnode16, rank 12 out of 16 processors
Hello world from processor bearnode16, rank 14 out of 16 processors
Hello world from processor bearnode16, rank 15 out of 16 processors
Hello world from processor bearnode13, rank 1 out of 16 processors
Hello world from processor bearnode13, rank 0 out of 16 processors
Hello world from processor bearnode13, rank 2 out of 16 processors
- I have not tested our test model with openmpi as it was compiled with Intel 
compilers and expects Intel MPI. It might work but for now I will hold that for 
later. I did test the hello world again using the Intel modules instead of the 
openmpi modules and it still worked.

Thanks,

Chris Woelkers




Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Chris Samuel
Hi Chris,

On Tuesday, 10 December 2019 11:49:44 AM PST Chris Woelkers - NOAA Federal 
wrote:

> Test jobs, submitted via sbatch, are able to run on one node with no problem
> but will not run on multiple nodes. The jobs are using mpirun and mvapich2
> is installed.

Is there a reason why you aren't using srun for launching these?

https://slurm.schedmd.com/mpi_guide.html

If you're using mpirun then (unless you've built mvapich2 with Slurm support) 
then you'll be relying on ssh to launch tasks and so that could be what's 
broken for you.  Running with srun will avoid that and allow Slurm to track 
your processes correctly.

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Paul Kenyon
file
>> below to run it over 2 nodes, each 20 cores.
>>
>>
>> #!/bin/bash
>> #SBATCH -n 40
>> #SBATCH --exclusive
>> #SBATCH --partition=normal
>> #SBATCH --job-name=P8.000_test
>> #SBATCH --time=2:00:00
>> #SBATCH --ntasks-per-node=20
>> #SBATCH --begin=now
>> #SBATCH --error=errors
>> #SBATCH --output=output
>> ​source /etc/profile.d/modules.sh
>> module load mvapich2/gcc/64/2.3b
>> mpiexec -n 40 ./hello
>>
>>
>>
>> Good luck!
>>
>> --
>>
>> Jan-Albert van Ree
>>
>>
>>
>> Jan-Albert van Ree | Linux System Administrator | Digital Services
>> MARIN | T +31 317 49 35 48 | j.a.v@marin.nl | www.marin.nl
>>
>> [image: LinkedIn] <https://www.linkedin.com/company/marin> [image:
>> YouTube] <http://www.youtube.com/marinmultimedia> [image: Twitter]
>> <https://twitter.com/MARIN_nieuws> [image: Facebook]
>> <https://www.facebook.com/marin.wageningen>
>> MARIN news: FLARE holds first General Assembly Meeting in Bremen, Germany
>> <https://www.marin.nl/flare-holds-first-general-assembly-meeting-in-bremen-germany>
>>
>> --
>> *From:* slurm-users  on behalf of
>> Chris Woelkers - NOAA Federal 
>> *Sent:* Tuesday, December 10, 2019 20:49
>> *To:* slurm-users@lists.schedmd.com
>> *Subject:* [slurm-users] Multi-node job failure
>>
>> I have a 16 node HPC that is in the process of being upgraded from CentOS
>> 6 to 7. All nodes are diskless and connected via 1Gbps Ethernet and FDR
>> Infiniband. I am using Bright Cluster Management to manage it and their
>> support has not found a solution to this problem.
>> For the most part the cluster is up and running with all nodes booting
>> and able to communicate with each other via all interfaces on a basic level.
>> Test jobs, submitted via sbatch, are able to run on one node with no
>> problem but will not run on multiple nodes. The jobs are using mpirun and
>> mvapich2 is installed.
>> Any job trying to run on multiple nodes ends up timing out, as set via
>> -t, with no output data written and no error messages in the slurm.err or
>> slurm.out files. The job shows up in the squeue output and the nodes used
>> show up as allocated in the sinfo output.
>>
>> Thanks,
>>
>> Chris Woelkers
>> IT Specialist
>> National Oceanic and Atmospheric Agency
>> Great Lakes Environmental Research Laboratory
>> 4840 S State Rd | Ann Arbor, MI 48108
>> 734-741-2446
>>
>>
>> *Help us improve the spam filter. If this message contains SPAM, click
>> here
>> <https://www.mailcontrol.com/sr/jhTwuwISfT_GX2PQPOmvUgItITKVa7z0k6_JDvhE_EooCnj2ZVOWPNLeAoygBsgADsU9DA6Go4T46EHnGWFGZQ==>
>> to report. Thank you, MARIN Support Group*
>>
>>


Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Chris Woelkers - NOAA Federal
Thanks for the reply and the things to try. Here are the answers to your
questions/tests in order:

- I tried mpiexec and the same issue occurred.
- While the job is listed as running I checked all the nodes. None of them
have processes spawned. I have no idea on the hydra process.
- I have version 4.7 of the OFED stack installed on all nodes.
- Using openmpi with the hello world example you listed to gives output
that seems to match what should normally be given. I upped the number of
threads to 16, because 4 doesn't help much, and ran it again with four
nodes of 4 threads each, and got the following which looks like good output.
Hello world from processor bearnode14, rank 4 out of 16 processors
Hello world from processor bearnode14, rank 5 out of 16 processors
Hello world from processor bearnode14, rank 6 out of 16 processors
Hello world from processor bearnode15, rank 10 out of 16 processors
Hello world from processor bearnode15, rank 8 out of 16 processors
Hello world from processor bearnode16, rank 13 out of 16 processors
Hello world from processor bearnode15, rank 11 out of 16 processors
Hello world from processor bearnode13, rank 3 out of 16 processors
Hello world from processor bearnode14, rank 7 out of 16 processors
Hello world from processor bearnode15, rank 9 out of 16 processors
Hello world from processor bearnode16, rank 12 out of 16 processors
Hello world from processor bearnode16, rank 14 out of 16 processors
Hello world from processor bearnode16, rank 15 out of 16 processors
Hello world from processor bearnode13, rank 1 out of 16 processors
Hello world from processor bearnode13, rank 0 out of 16 processors
Hello world from processor bearnode13, rank 2 out of 16 processors
- I have not tested our test model with openmpi as it was compiled with
Intel compilers and expects Intel MPI. It might work but for now I will
hold that for later. I did test the hello world again using the Intel
modules instead of the openmpi modules and it still worked.

Thanks,

Chris Woelkers
IT Specialist
National Oceanic and Atmospheric Agency
Great Lakes Environmental Research Laboratory
4840 S State Rd | Ann Arbor, MI 48108
734-741-2446


On Tue, Dec 10, 2019 at 4:36 PM Ree, Jan-Albert van 
wrote:

> We're running multiple clusters using Bright 8.x with Scientific Linux 7
> (and have run Scientific Linux releases 5 and 6 with Bright 5.0 and higher
> in the past without issues on many different pieces of hardware) and never
> experienced this. But some things to test :
>
>
> - some implementations prefer mpiexec over mpirun , have you tried that
> instead ?
>
> - if you log in to a node while a job is 'hanging' , do you see that on
> each node the right amount of processes are spawned ? Is the node list of
> all nodes involved in the job parsed to the hydra process on all nodes ?
>
> - which version of the Mellanox OFED stack are you using ? One of our
> vendors recommended against OFED 4.6 due to issues, mostly related to IP
> over IB but still ; you might want to try 4.5 just to rule things out.
>
> - what happens if you use openmpi (as supplied by Bright) together with a
> simple hello world example ? There's a good one at
> https://mpitutorial.com/tutorials/mpi-hello-world/ which I know to work
> fine with Bright supplied openmpi
>
> - what happens if you test with openmpi and force it to use ethernet
> instead of infiniband ? See https://www.open-mpi.org/faq/?category=tcp
> for info to force a specific interface with openmpi
>
>
> I've just successfully tested the above hello-world example with the
> Bright supplied mvapich2/gcc/64/2.3b to compile the code , with the jobfile
> below to run it over 2 nodes, each 20 cores.
>
>
> #!/bin/bash
> #SBATCH -n 40
> #SBATCH --exclusive
> #SBATCH --partition=normal
> #SBATCH --job-name=P8.000_test
> #SBATCH --time=2:00:00
> #SBATCH --ntasks-per-node=20
> #SBATCH --begin=now
> #SBATCH --error=errors
> #SBATCH --output=output
> ​source /etc/profile.d/modules.sh
> module load mvapich2/gcc/64/2.3b
> mpiexec -n 40 ./hello
>
>
>
> Good luck!
>
> --
>
> Jan-Albert van Ree
>
>
>
> Jan-Albert van Ree | Linux System Administrator | Digital Services
> MARIN | T +31 317 49 35 48 | j.a.v@marin.nl | www.marin.nl
>
> [image: LinkedIn] <https://www.linkedin.com/company/marin> [image:
> YouTube] <http://www.youtube.com/marinmultimedia> [image: Twitter]
> <https://twitter.com/MARIN_nieuws> [image: Facebook]
> <https://www.facebook.com/marin.wageningen>
> MARIN news: FLARE holds first General Assembly Meeting in Bremen, Germany
> <https://www.marin.nl/flare-holds-first-general-assembly-meeting-in-bremen-germany>
>
> --
> *From:* slurm-users  on behalf of
> Chris Woelkers - NOAA Federal 
> *Sent:* Tuesday, Decemb

Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Ree, Jan-Albert van
We're running multiple clusters using Bright 8.x with Scientific Linux 7 (and 
have run Scientific Linux releases 5 and 6 with Bright 5.0 and higher in the 
past without issues on many different pieces of hardware) and never experienced 
this. But some things to test :


- some implementations prefer mpiexec over mpirun , have you tried that instead 
?

- if you log in to a node while a job is 'hanging' , do you see that on each 
node the right amount of processes are spawned ? Is the node list of all nodes 
involved in the job parsed to the hydra process on all nodes ?

- which version of the Mellanox OFED stack are you using ? One of our vendors 
recommended against OFED 4.6 due to issues, mostly related to IP over IB but 
still ; you might want to try 4.5 just to rule things out.

- what happens if you use openmpi (as supplied by Bright) together with a 
simple hello world example ? There's a good one at 
https://mpitutorial.com/tutorials/mpi-hello-world/ which I know to work fine 
with Bright supplied openmpi

- what happens if you test with openmpi and force it to use ethernet instead of 
infiniband ? See https://www.open-mpi.org/faq/?category=tcp for info to force a 
specific interface with openmpi


I've just successfully tested the above hello-world example with the Bright 
supplied mvapich2/gcc/64/2.3b to compile the code , with the jobfile below to 
run it over 2 nodes, each 20 cores.


#!/bin/bash
#SBATCH -n 40
#SBATCH --exclusive
#SBATCH --partition=normal
#SBATCH --job-name=P8.000_test
#SBATCH --time=2:00:00
#SBATCH --ntasks-per-node=20
#SBATCH --begin=now
#SBATCH --error=errors
#SBATCH --output=output
​source /etc/profile.d/modules.sh
module load mvapich2/gcc/64/2.3b
mpiexec -n 40 ./hello



Good luck!

--

Jan-Albert van Ree


Jan-Albert van Ree | Linux System Administrator | Digital Services
MARIN | T +31 317 49 35 48 | j.a.v@marin.nl<mailto:j.a.v@marin.nl> | 
www.marin.nl<http://www.marin.nl>

[LinkedIn]<https://www.linkedin.com/company/marin> [YouTube] 
<http://www.youtube.com/marinmultimedia>  [Twitter] 
<https://twitter.com/MARIN_nieuws>  [Facebook] 
<https://www.facebook.com/marin.wageningen>
MARIN news: FLARE holds first General Assembly Meeting in Bremen, 
Germany<https://www.marin.nl/flare-holds-first-general-assembly-meeting-in-bremen-germany>


From: slurm-users  on behalf of Chris 
Woelkers - NOAA Federal 
Sent: Tuesday, December 10, 2019 20:49
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Multi-node job failure

I have a 16 node HPC that is in the process of being upgraded from CentOS 6 to 
7. All nodes are diskless and connected via 1Gbps Ethernet and FDR Infiniband. 
I am using Bright Cluster Management to manage it and their support has not 
found a solution to this problem.
For the most part the cluster is up and running with all nodes booting and able 
to communicate with each other via all interfaces on a basic level.
Test jobs, submitted via sbatch, are able to run on one node with no problem 
but will not run on multiple nodes. The jobs are using mpirun and mvapich2 is 
installed.
Any job trying to run on multiple nodes ends up timing out, as set via -t, with 
no output data written and no error messages in the slurm.err or slurm.out 
files. The job shows up in the squeue output and the nodes used show up as 
allocated in the sinfo output.

Thanks,

Chris Woelkers
IT Specialist
National Oceanic and Atmospheric Agency
Great Lakes Environmental Research Laboratory
4840 S State Rd | Ann Arbor, MI 48108
734-741-2446



Help us improve the spam filter. If this message contains SPAM, click 
here<https://www.mailcontrol.com/sr/jhTwuwISfT_GX2PQPOmvUgItITKVa7z0k6_JDvhE_EooCnj2ZVOWPNLeAoygBsgADsU9DA6Go4T46EHnGWFGZQ==>
 to report. Thank you, MARIN Support Group




[slurm-users] Multi-node job failure

2019-12-10 Thread Chris Woelkers - NOAA Federal
I have a 16 node HPC that is in the process of being upgraded from CentOS 6
to 7. All nodes are diskless and connected via 1Gbps Ethernet and FDR
Infiniband. I am using Bright Cluster Management to manage it and their
support has not found a solution to this problem.
For the most part the cluster is up and running with all nodes booting and
able to communicate with each other via all interfaces on a basic level.
Test jobs, submitted via sbatch, are able to run on one node with no
problem but will not run on multiple nodes. The jobs are using mpirun and
mvapich2 is installed.
Any job trying to run on multiple nodes ends up timing out, as set via -t,
with no output data written and no error messages in the slurm.err or
slurm.out files. The job shows up in the squeue output and the nodes used
show up as allocated in the sinfo output.

Thanks,

Chris Woelkers
IT Specialist
National Oceanic and Atmospheric Agency
Great Lakes Environmental Research Laboratory
4840 S State Rd | Ann Arbor, MI 48108
734-741-2446