date:20191210

Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Ree, Jan-Albert van

OK so OpenMPI works fine. That means SLURM, OFED and hardware are fine.

Which mvapich2 package are you using, a home built one or one provided by 
Bright ?


Regards,

--

Jan-Albert


Jan-Albert van Ree | Linux System Administrator | Digital Services
MARIN | T +31 317 49 35 48 | j.a.v@marin.nl | 
www.marin.nl

[LinkedIn] [YouTube] 
  [Twitter] 
  [Facebook] 

MARIN news: FLARE holds first General Assembly Meeting in Bremen, 
Germany


From: slurm-users  on behalf of Chris 
Woelkers - NOAA Federal 
Sent: Wednesday, December 11, 2019 01:11
To: Slurm User Community List
Subject: Re: [slurm-users] Multi-node job failure

Thanks for the reply and the things to try. Here are the answers to your 
questions/tests in order:

- I tried mpiexec and the same issue occurred.
- While the job is listed as running I checked all the nodes. None of them have 
processes spawned. I have no idea on the hydra process.
- I have version 4.7 of the OFED stack installed on all nodes.
- Using openmpi with the hello world example you listed to gives output that 
seems to match what should normally be given. I upped the number of threads to 
16, because 4 doesn't help much, and ran it again with four nodes of 4 threads 
each, and got the following which looks like good output.
Hello world from processor bearnode14, rank 4 out of 16 processors
Hello world from processor bearnode14, rank 5 out of 16 processors
Hello world from processor bearnode14, rank 6 out of 16 processors
Hello world from processor bearnode15, rank 10 out of 16 processors
Hello world from processor bearnode15, rank 8 out of 16 processors
Hello world from processor bearnode16, rank 13 out of 16 processors
Hello world from processor bearnode15, rank 11 out of 16 processors
Hello world from processor bearnode13, rank 3 out of 16 processors
Hello world from processor bearnode14, rank 7 out of 16 processors
Hello world from processor bearnode15, rank 9 out of 16 processors
Hello world from processor bearnode16, rank 12 out of 16 processors
Hello world from processor bearnode16, rank 14 out of 16 processors
Hello world from processor bearnode16, rank 15 out of 16 processors
Hello world from processor bearnode13, rank 1 out of 16 processors
Hello world from processor bearnode13, rank 0 out of 16 processors
Hello world from processor bearnode13, rank 2 out of 16 processors
- I have not tested our test model with openmpi as it was compiled with Intel 
compilers and expects Intel MPI. It might work but for now I will hold that for 
later. I did test the hello world again using the Intel modules instead of the 
openmpi modules and it still worked.

Thanks,

Chris Woelkers

Re: [slurm-users] Need help with controller issues

2019-12-10 Thread William Brown

The latest MariaDB packaging is different, there is a 3rd RPM needed, as
well as the client and developer.  Away from my desk but the info is on the
MariaDB site.

William

On Wed, 11 Dec 2019, 05:23 Chris Samuel,  wrote:

> On Tuesday, 10 December 2019 1:57:59 PM PST Dean Schulze wrote:
>
> > This bug report from a couple of years ago indicates a source code issue:
> >
> > https://bugs.schedmd.com/show_bug.cgi?id=3278
> >
> > This must have been fixed by now, though.
> >
> > I built using slurm-19.05.2.  Does anyone know if this has been fixed in
> > 19.05.4?
>
> I don't think this is a Slurm issue - have you checked that you have the
> MariaDB development package for your distro installed before trying to
> buidl
> Slurm?   It will skip things it doesn't find and that could explain what
> you're
> seeing.
>
> All the best,
> Chris
> --
>   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>
>
>
>
>

Re: [slurm-users] Need help with controller issues

2019-12-10 Thread Chris Samuel

On Tuesday, 10 December 2019 1:57:59 PM PST Dean Schulze wrote:

> This bug report from a couple of years ago indicates a source code issue:
> 
> https://bugs.schedmd.com/show_bug.cgi?id=3278
> 
> This must have been fixed by now, though.
> 
> I built using slurm-19.05.2.  Does anyone know if this has been fixed in
> 19.05.4?

I don't think this is a Slurm issue - have you checked that you have the 
MariaDB development package for your distro installed before trying to buidl 
Slurm?   It will skip things it doesn't find and that could explain what you're 
seeing.

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Chris Samuel

Hi Chris,

On Tuesday, 10 December 2019 11:49:44 AM PST Chris Woelkers - NOAA Federal 
wrote:

> Test jobs, submitted via sbatch, are able to run on one node with no problem
> but will not run on multiple nodes. The jobs are using mpirun and mvapich2
> is installed.

Is there a reason why you aren't using srun for launching these?

https://slurm.schedmd.com/mpi_guide.html

If you're using mpirun then (unless you've built mvapich2 with Slurm support) 
then you'll be relying on ssh to launch tasks and so that could be what's 
broken for you.  Running with srun will avoid that and allow Slurm to track 
your processes correctly.

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Paul Kenyon

Hi Chris,

Your issue sounds similar to a case I ran into once, where I could run jobs
on a few nodes, but once it spanned more than a handful it would fail.  In
that particular case, we figured out that it was due to broadcast storm
protection being enabled on the cluster switch.  When the first node's
slurmd started the job, it would send out a ton of ARP requests for each of
the other nodes so it could contact them.  That triggered the broadcast
storm protection several ARPs in, so a subset of the nodes couldn't be
reached and the job wouldn't start.  Disabling broadcast storm protection
in the switch solved the problems.

Hope it turns out to be this simple - if not, good luck!

Regards,

-Paul

--
Paul Kenyon
Advanced Clustering Technologies, Inc.
Main: 866-802-8222
Direct: 913-643-0306
pken...@advancedclustering.com


On Tue, Dec 10, 2019 at 6:13 PM Chris Woelkers - NOAA Federal <
chris.woelk...@noaa.gov> wrote:

> Thanks for the reply and the things to try. Here are the answers to your
> questions/tests in order:
>
> - I tried mpiexec and the same issue occurred.
> - While the job is listed as running I checked all the nodes. None of them
> have processes spawned. I have no idea on the hydra process.
> - I have version 4.7 of the OFED stack installed on all nodes.
> - Using openmpi with the hello world example you listed to gives output
> that seems to match what should normally be given. I upped the number of
> threads to 16, because 4 doesn't help much, and ran it again with four
> nodes of 4 threads each, and got the following which looks like good output.
> Hello world from processor bearnode14, rank 4 out of 16 processors
> Hello world from processor bearnode14, rank 5 out of 16 processors
> Hello world from processor bearnode14, rank 6 out of 16 processors
> Hello world from processor bearnode15, rank 10 out of 16 processors
> Hello world from processor bearnode15, rank 8 out of 16 processors
> Hello world from processor bearnode16, rank 13 out of 16 processors
> Hello world from processor bearnode15, rank 11 out of 16 processors
> Hello world from processor bearnode13, rank 3 out of 16 processors
> Hello world from processor bearnode14, rank 7 out of 16 processors
> Hello world from processor bearnode15, rank 9 out of 16 processors
> Hello world from processor bearnode16, rank 12 out of 16 processors
> Hello world from processor bearnode16, rank 14 out of 16 processors
> Hello world from processor bearnode16, rank 15 out of 16 processors
> Hello world from processor bearnode13, rank 1 out of 16 processors
> Hello world from processor bearnode13, rank 0 out of 16 processors
> Hello world from processor bearnode13, rank 2 out of 16 processors
> - I have not tested our test model with openmpi as it was compiled with
> Intel compilers and expects Intel MPI. It might work but for now I will
> hold that for later. I did test the hello world again using the Intel
> modules instead of the openmpi modules and it still worked.
>
> Thanks,
>
> Chris Woelkers
> IT Specialist
> National Oceanic and Atmospheric Agency
> Great Lakes Environmental Research Laboratory
> 4840 S State Rd | Ann Arbor, MI 48108
> 734-741-2446
>
>
> On Tue, Dec 10, 2019 at 4:36 PM Ree, Jan-Albert van 
> wrote:
>
>> We're running multiple clusters using Bright 8.x with Scientific Linux 7
>> (and have run Scientific Linux releases 5 and 6 with Bright 5.0 and higher
>> in the past without issues on many different pieces of hardware) and never
>> experienced this. But some things to test :
>>
>>
>> - some implementations prefer mpiexec over mpirun , have you tried that
>> instead ?
>>
>> - if you log in to a node while a job is 'hanging' , do you see that on
>> each node the right amount of processes are spawned ? Is the node list of
>> all nodes involved in the job parsed to the hydra process on all nodes ?
>>
>> - which version of the Mellanox OFED stack are you using ? One of our
>> vendors recommended against OFED 4.6 due to issues, mostly related to IP
>> over IB but still ; you might want to try 4.5 just to rule things out.
>>
>> - what happens if you use openmpi (as supplied by Bright) together with a
>> simple hello world example ? There's a good one at
>> https://mpitutorial.com/tutorials/mpi-hello-world/ which I know to work
>> fine with Bright supplied openmpi
>>
>> - what happens if you test with openmpi and force it to use ethernet
>> instead of infiniband ? See https://www.open-mpi.org/faq/?category=tcp
>> for info to force a specific interface with openmpi
>>
>>
>> I've just successfully tested the above hello-world example with the
>> Bright supplied mvapich2/gcc/64/2.3b to compile the code , with the jobfile
>> below to run it over 2 nodes, each 20 cores.
>>
>>
>> #!/bin/bash
>> #SBATCH -n 40
>> #SBATCH --exclusive
>> #SBATCH --partition=normal
>> #SBATCH --job-name=P8.000_test
>> #SBATCH --time=2:00:00
>> #SBATCH --ntasks-per-node=20
>> #SBATCH --begin=now
>> #SBATCH --error=errors
>> #SBATCH --output=output
>>

Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Chris Woelkers - NOAA Federal

Thanks for the reply and the things to try. Here are the answers to your
questions/tests in order:

- I tried mpiexec and the same issue occurred.
- While the job is listed as running I checked all the nodes. None of them
have processes spawned. I have no idea on the hydra process.
- I have version 4.7 of the OFED stack installed on all nodes.
- Using openmpi with the hello world example you listed to gives output
that seems to match what should normally be given. I upped the number of
threads to 16, because 4 doesn't help much, and ran it again with four
nodes of 4 threads each, and got the following which looks like good output.
Hello world from processor bearnode14, rank 4 out of 16 processors
Hello world from processor bearnode14, rank 5 out of 16 processors
Hello world from processor bearnode14, rank 6 out of 16 processors
Hello world from processor bearnode15, rank 10 out of 16 processors
Hello world from processor bearnode15, rank 8 out of 16 processors
Hello world from processor bearnode16, rank 13 out of 16 processors
Hello world from processor bearnode15, rank 11 out of 16 processors
Hello world from processor bearnode13, rank 3 out of 16 processors
Hello world from processor bearnode14, rank 7 out of 16 processors
Hello world from processor bearnode15, rank 9 out of 16 processors
Hello world from processor bearnode16, rank 12 out of 16 processors
Hello world from processor bearnode16, rank 14 out of 16 processors
Hello world from processor bearnode16, rank 15 out of 16 processors
Hello world from processor bearnode13, rank 1 out of 16 processors
Hello world from processor bearnode13, rank 0 out of 16 processors
Hello world from processor bearnode13, rank 2 out of 16 processors
- I have not tested our test model with openmpi as it was compiled with
Intel compilers and expects Intel MPI. It might work but for now I will
hold that for later. I did test the hello world again using the Intel
modules instead of the openmpi modules and it still worked.

Thanks,

Chris Woelkers
IT Specialist
National Oceanic and Atmospheric Agency
Great Lakes Environmental Research Laboratory
4840 S State Rd | Ann Arbor, MI 48108
734-741-2446


On Tue, Dec 10, 2019 at 4:36 PM Ree, Jan-Albert van 
wrote:

> We're running multiple clusters using Bright 8.x with Scientific Linux 7
> (and have run Scientific Linux releases 5 and 6 with Bright 5.0 and higher
> in the past without issues on many different pieces of hardware) and never
> experienced this. But some things to test :
>
>
> - some implementations prefer mpiexec over mpirun , have you tried that
> instead ?
>
> - if you log in to a node while a job is 'hanging' , do you see that on
> each node the right amount of processes are spawned ? Is the node list of
> all nodes involved in the job parsed to the hydra process on all nodes ?
>
> - which version of the Mellanox OFED stack are you using ? One of our
> vendors recommended against OFED 4.6 due to issues, mostly related to IP
> over IB but still ; you might want to try 4.5 just to rule things out.
>
> - what happens if you use openmpi (as supplied by Bright) together with a
> simple hello world example ? There's a good one at
> https://mpitutorial.com/tutorials/mpi-hello-world/ which I know to work
> fine with Bright supplied openmpi
>
> - what happens if you test with openmpi and force it to use ethernet
> instead of infiniband ? See https://www.open-mpi.org/faq/?category=tcp
> for info to force a specific interface with openmpi
>
>
> I've just successfully tested the above hello-world example with the
> Bright supplied mvapich2/gcc/64/2.3b to compile the code , with the jobfile
> below to run it over 2 nodes, each 20 cores.
>
>
> #!/bin/bash
> #SBATCH -n 40
> #SBATCH --exclusive
> #SBATCH --partition=normal
> #SBATCH --job-name=P8.000_test
> #SBATCH --time=2:00:00
> #SBATCH --ntasks-per-node=20
> #SBATCH --begin=now
> #SBATCH --error=errors
> #SBATCH --output=output
> source /etc/profile.d/modules.sh
> module load mvapich2/gcc/64/2.3b
> mpiexec -n 40 ./hello
>
>
>
> Good luck!
>
> --
>
> Jan-Albert van Ree
>
>
>
> Jan-Albert van Ree | Linux System Administrator | Digital Services
> MARIN | T +31 317 49 35 48 | j.a.v@marin.nl | www.marin.nl
>
> [image: LinkedIn]  [image:
> YouTube]  [image: Twitter]
>  [image: Facebook]
> 
> MARIN news: FLARE holds first General Assembly Meeting in Bremen, Germany
> 
>
> --
> *From:* slurm-users  on behalf of
> Chris Woelkers - NOAA Federal 
> *Sent:* Tuesday, December 10, 2019 20:49
> *To:* slurm-users@lists.schedmd.com
> *Subject:* [slurm-users] Multi-node job failure
>
> I have a 16 node HPC that is in the process of being upgraded from CentOS
> 6 to 7. All nodes are diskless and connected via 1Gbps Ethern

Re: [slurm-users] Need help with controller issues

2019-12-10 Thread Dean Schulze

There's a problem with accounting_storage/mysql plugin:

$ sudo  slurmdbd -D -
slurmdbd: debug:  Log file re-opened
slurmdbd: pidfile not locked, assuming no running daemon
slurmdbd: debug3: Trying to load plugin /usr/lib/slurm/auth_munge.so
slurmdbd: debug:  Munge authentication plugin loaded
slurmdbd: debug3: Success.
slurmdbd: debug3: Trying to load plugin
/usr/lib/slurm/accounting_storage_mysql.so
slurmdbd: error: Couldn't find the specified plugin name for
accounting_storage/mysql looking at all files
slurmdbd: error: cannot find accounting_storage plugin for
accounting_storage/mysql
slurmdbd: error: cannot create accounting_storage context for
accounting_storage/mysql
slurmdbd: fatal: Unable to initialize accounting_storage/mysql accounting
storage plugin

This bug report from a couple of years ago indicates a source code issue:

https://bugs.schedmd.com/show_bug.cgi?id=3278

This must have been fixed by now, though.

I built using slurm-19.05.2.  Does anyone know if this has been fixed in
19.05.4?

On Tue, Dec 10, 2019 at 2:05 PM Dean Schulze 
wrote:

> I'm trying to set up my first slurm installation following these
> instructions:
>
> https://github.com/nateGeorge/slurm_gpu_ubuntu
>
> I've had to deviate a little bit because I'm using virtual machines that
> don't have GPUs, so I don't have a gres.conf file and in
> /etc/slurm/slurm.conf I don't have an entry like Gres=gpu:2 on the last
> line.
>
> On my controller vm I get errors when trying to do simple commnands:
>
> $ sinfo
> slurm_load_partitions: Unable to contact slurm controller (connect failure)
>
> $ sudo sacctmgr add cluster compute-cluster
> sacctmgr: error: slurm_persist_conn_open_without_init: failed to open
> persistent connection to localhost:6819: Connection refused
> sacctmgr: error: slurmdbd: Sending PersistInit msg: Connection refused
> sacctmgr: error: Problem talking to the database: Connection refused
>
>
> Something is supposed to be running on port 6819, but netstat shows
> nothing using that port.  What is supposed to be running on 6819?
>
> My database (Maria) is running.  I can connect to it with `sudo mysql -U
> root`.
>
> When I boot my controller which services are supposed to be running and on
> which ports?
>
> Thanks.
>
>

Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Ree, Jan-Albert van

We're running multiple clusters using Bright 8.x with Scientific Linux 7 (and 
have run Scientific Linux releases 5 and 6 with Bright 5.0 and higher in the 
past without issues on many different pieces of hardware) and never experienced 
this. But some things to test :


- some implementations prefer mpiexec over mpirun , have you tried that instead 
?

- if you log in to a node while a job is 'hanging' , do you see that on each 
node the right amount of processes are spawned ? Is the node list of all nodes 
involved in the job parsed to the hydra process on all nodes ?

- which version of the Mellanox OFED stack are you using ? One of our vendors 
recommended against OFED 4.6 due to issues, mostly related to IP over IB but 
still ; you might want to try 4.5 just to rule things out.

- what happens if you use openmpi (as supplied by Bright) together with a 
simple hello world example ? There's a good one at 
https://mpitutorial.com/tutorials/mpi-hello-world/ which I know to work fine 
with Bright supplied openmpi

- what happens if you test with openmpi and force it to use ethernet instead of 
infiniband ? See https://www.open-mpi.org/faq/?category=tcp for info to force a 
specific interface with openmpi


I've just successfully tested the above hello-world example with the Bright 
supplied mvapich2/gcc/64/2.3b to compile the code , with the jobfile below to 
run it over 2 nodes, each 20 cores.


#!/bin/bash
#SBATCH -n 40
#SBATCH --exclusive
#SBATCH --partition=normal
#SBATCH --job-name=P8.000_test
#SBATCH --time=2:00:00
#SBATCH --ntasks-per-node=20
#SBATCH --begin=now
#SBATCH --error=errors
#SBATCH --output=output
source /etc/profile.d/modules.sh
module load mvapich2/gcc/64/2.3b
mpiexec -n 40 ./hello



Good luck!

--

Jan-Albert van Ree


Jan-Albert van Ree | Linux System Administrator | Digital Services
MARIN | T +31 317 49 35 48 | j.a.v@marin.nl | 
www.marin.nl

[LinkedIn] [YouTube] 
  [Twitter] 
  [Facebook] 

MARIN news: FLARE holds first General Assembly Meeting in Bremen, 
Germany


From: slurm-users  on behalf of Chris 
Woelkers - NOAA Federal 
Sent: Tuesday, December 10, 2019 20:49
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Multi-node job failure

I have a 16 node HPC that is in the process of being upgraded from CentOS 6 to 
7. All nodes are diskless and connected via 1Gbps Ethernet and FDR Infiniband. 
I am using Bright Cluster Management to manage it and their support has not 
found a solution to this problem.
For the most part the cluster is up and running with all nodes booting and able 
to communicate with each other via all interfaces on a basic level.
Test jobs, submitted via sbatch, are able to run on one node with no problem 
but will not run on multiple nodes. The jobs are using mpirun and mvapich2 is 
installed.
Any job trying to run on multiple nodes ends up timing out, as set via -t, with 
no output data written and no error messages in the slurm.err or slurm.out 
files. The job shows up in the squeue output and the nodes used show up as 
allocated in the sinfo output.

Thanks,

Chris Woelkers
IT Specialist
National Oceanic and Atmospheric Agency
Great Lakes Environmental Research Laboratory
4840 S State Rd | Ann Arbor, MI 48108
734-741-2446



Help us improve the spam filter. If this message contains SPAM, click 
here
 to report. Thank you, MARIN Support Group

Re: [slurm-users] Need help with controller issues

2019-12-10 Thread Dean Schulze

$ systemctl status slurmdbd
● slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/etc/systemd/system/slurmdbd.service; enabled; vendor
preset: enabled)
   Active: failed (Result: exit-code) since Tue 2019-12-10 13:33:28 MST;
40min ago
  Process: 787 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited,
status=0/SUCCESS)
 Main PID: 791 (code=exited, status=1/FAILURE)

Dec 10 13:33:28 ubuntu-controller.liqid.com systemd[1]: Starting Slurm DBD
accounting daemon...
Dec 10 13:33:28 ubuntu-controller.liqid.com systemd[1]: Started Slurm DBD
accounting daemon.
Dec 10 13:33:28 ubuntu-controller.liqid.com slurmdbd[791]: fatal: Unable to
initialize accounting_storage/mysql accounting storage plugin
Dec 10 13:33:28 ubuntu-controller.liqid.com systemd[1]: slurmdbd.service:
Main process exited, code=exited, status=1/FAILURE
Dec 10 13:33:28 ubuntu-controller.liqid.com systemd[1]: slurmdbd.service:
Failed with result 'exit-code'.
$ systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor
preset: enabled)
   Active: failed (Result: exit-code) since Tue 2019-12-10 13:33:28 MST;
41min ago
  Process: 788 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
(code=exited, status=0/SUCCESS)
 Main PID: 796 (code=exited, status=1/FAILURE)

Dec 10 13:33:28 ubuntu-controller.liqid.com systemd[1]: Starting Slurm
controller daemon...
Dec 10 13:33:28 ubuntu-controller.liqid.com systemd[1]: Started Slurm
controller daemon.
Dec 10 13:33:28 ubuntu-controller.liqid.com slurmctld[796]: fatal: You are
running with a database but for some reason we have no TRES from it.  Th
Dec 10 13:33:28 ubuntu-controller.liqid.com systemd[1]: slurmctld.service:
Main process exited, code=exited, status=1/FAILURE
Dec 10 13:33:28 ubuntu-controller.liqid.com systemd[1]: slurmctld.service:
Failed with result 'exit-code'.
$

One issue is with a database plugin.  During database setup this command
failed:

sudo systemctl enable mysql

I did this instead

sudo systemctl enable mariadb.service

Maybe there is some config that has to be modified to use maria instead  of
mysql?

On Tue, Dec 10, 2019 at 2:13 PM Renfro, Michael  wrote:

> What do you get from
>
> systemctl status slurmdbd
> systemctl status slurmctld
>
> I’m assuming at least slurmdbd isn’t running.
>
> > On Dec 10, 2019, at 3:05 PM, Dean Schulze 
> wrote:
> >
> > External Email Warning
> > This email originated from outside the university. Please use caution
> when opening attachments, clicking links, or responding to requests.
> > I'm trying to set up my first slurm installation following these
> instructions:
> >
> > https://github.com/nateGeorge/slurm_gpu_ubuntu
> >
> > I've had to deviate a little bit because I'm using virtual machines that
> don't have GPUs, so I don't have a gres.conf file and in
> /etc/slurm/slurm.conf I don't have an entry like Gres=gpu:2 on the last
> line.
> >
> > On my controller vm I get errors when trying to do simple commnands:
> >
> > $ sinfo
> > slurm_load_partitions: Unable to contact slurm controller (connect
> failure)
> >
> > $ sudo sacctmgr add cluster compute-cluster
> > sacctmgr: error: slurm_persist_conn_open_without_init: failed to open
> persistent connection to localhost:6819: Connection refused
> > sacctmgr: error: slurmdbd: Sending PersistInit msg: Connection refused
> > sacctmgr: error: Problem talking to the database: Connection refused
> >
> >
> > Something is supposed to be running on port 6819, but netstat shows
> nothing using that port.  What is supposed to be running on 6819?
> >
> > My database (Maria) is running.  I can connect to it with `sudo mysql -U
> root`.
> >
> > When I boot my controller which services are supposed to be running and
> on which ports?
> >
> > Thanks.
> >
>
>

Re: [slurm-users] Need help with controller issues

2019-12-10 Thread Renfro, Michael

What do you get from

systemctl status slurmdbd
systemctl status slurmctld

I’m assuming at least slurmdbd isn’t running.

> On Dec 10, 2019, at 3:05 PM, Dean Schulze  wrote:
> 
> External Email Warning
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> I'm trying to set up my first slurm installation following these instructions:
> 
> https://github.com/nateGeorge/slurm_gpu_ubuntu
> 
> I've had to deviate a little bit because I'm using virtual machines that 
> don't have GPUs, so I don't have a gres.conf file and in 
> /etc/slurm/slurm.conf I don't have an entry like Gres=gpu:2 on the last line.
> 
> On my controller vm I get errors when trying to do simple commnands:
> 
> $ sinfo
> slurm_load_partitions: Unable to contact slurm controller (connect failure)
> 
> $ sudo sacctmgr add cluster compute-cluster
> sacctmgr: error: slurm_persist_conn_open_without_init: failed to open 
> persistent connection to localhost:6819: Connection refused
> sacctmgr: error: slurmdbd: Sending PersistInit msg: Connection refused
> sacctmgr: error: Problem talking to the database: Connection refused
> 
> 
> Something is supposed to be running on port 6819, but netstat shows nothing 
> using that port.  What is supposed to be running on 6819?
> 
> My database (Maria) is running.  I can connect to it with `sudo mysql -U 
> root`.
> 
> When I boot my controller which services are supposed to be running and on 
> which ports?
> 
> Thanks.
>

[slurm-users] Need help with controller issues

2019-12-10 Thread Dean Schulze

I'm trying to set up my first slurm installation following these
instructions:

https://github.com/nateGeorge/slurm_gpu_ubuntu

I've had to deviate a little bit because I'm using virtual machines that
don't have GPUs, so I don't have a gres.conf file and in
/etc/slurm/slurm.conf I don't have an entry like Gres=gpu:2 on the last
line.

On my controller vm I get errors when trying to do simple commnands:

$ sinfo
slurm_load_partitions: Unable to contact slurm controller (connect failure)

$ sudo sacctmgr add cluster compute-cluster
sacctmgr: error: slurm_persist_conn_open_without_init: failed to open
persistent connection to localhost:6819: Connection refused
sacctmgr: error: slurmdbd: Sending PersistInit msg: Connection refused
sacctmgr: error: Problem talking to the database: Connection refused


Something is supposed to be running on port 6819, but netstat shows nothing
using that port.  What is supposed to be running on 6819?

My database (Maria) is running.  I can connect to it with `sudo mysql -U
root`.

When I boot my controller which services are supposed to be running and on
which ports?

Thanks.

[slurm-users] Multi-node job failure

2019-12-10 Thread Chris Woelkers - NOAA Federal

I have a 16 node HPC that is in the process of being upgraded from CentOS 6
to 7. All nodes are diskless and connected via 1Gbps Ethernet and FDR
Infiniband. I am using Bright Cluster Management to manage it and their
support has not found a solution to this problem.
For the most part the cluster is up and running with all nodes booting and
able to communicate with each other via all interfaces on a basic level.
Test jobs, submitted via sbatch, are able to run on one node with no
problem but will not run on multiple nodes. The jobs are using mpirun and
mvapich2 is installed.
Any job trying to run on multiple nodes ends up timing out, as set via -t,
with no output data written and no error messages in the slurm.err or
slurm.out files. The job shows up in the squeue output and the nodes used
show up as allocated in the sinfo output.

Thanks,

Chris Woelkers
IT Specialist
National Oceanic and Atmospheric Agency
Great Lakes Environmental Research Laboratory
4840 S State Rd | Ann Arbor, MI 48108
734-741-2446

Re: [slurm-users] SLURM_TMPDIR

2019-12-10 Thread Juergen Salk

Hi Angelines,

we create a job specific scratch directory in the prolog script but
use the task_prolog script to set the environment variable.

In prolog:

scratch_dir=/your/path
/bin/mkdir -p ${scratch_dir}
/bin/chmod 700 ${scratch_dir}
/bin/chown ${SLURM_JOB_USER} ${scratch_dir}

In task_prolog:

scratch_dir=/your/path
echo "export TMPDIR=${scratch_dir}"

The fully qualified pathnames of the prolog and task prolog scripts 
need to be defined in slurm.conf, e.g.:

Prolog=/etc/slurm/prolog
TaskProlog=/etc/slurm/task_prolog

We also clean up job specific scratch directories in the epilog
script.

Best regards
Jürgen

-- 
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471


* Angelines  [191210 07:30]:
> Hi Roger
> 
> thanks for your answer but it doesn't work in our case and I don't
> understand why.
> 
> 
> 
> Angelines Alberto Morillas
> 
> Unidad de Arquitectura Informática
> Despacho: 22.1.32
> Telf.: +34 91 346 6119
> Fax:   +34 91 346 6537
> 
> skype: angelines.alberto
> 
> CIEMAT
> Avenida Complutense, 40
> 28040 MADRID
> 
> 
> El 5/12/19 a las 17:56, Roger Moye escribió:
> > Our prolog script just does this:
> > 
> > export SLURM_TMPDIR="/tmp/slurm/${SLURM_JOB_ID}"
> > 
> > This has worked for us.
> > 
> > -Roger
> > 
> > -Original Message-
> > From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf 
> > Of Angelines
> > Sent: Thursday, December 5, 2019 9:58 AM
> > To: slurm-users@lists.schedmd.com
> > Subject: [slurm-users] SLURM_TMPDIR
> > 
> > Hello,
> > 
> > I would like to change the SLURM_TMPDIR that by default is /tmp to other 
> > place.
> > 
> > Could you help me? Becuase I have tried with prolog script and in other 
> > cluster with a older version this work for me but not now.
> > 
> > tmpFolder="/SCRATCH_LOCAL/$SLURM_JOB_USER/$SLURM_JOB_ID"
> > mkdir -p $tmpFolder
> > echo "export TMPDIR=$tmpFolder"
> > 
> > --
> > 
> > 
> > Angelines Alberto Morillas
> > 
> > 
> > ---
> > 
> > The information in this communication and any attachment is confidential 
> > and intended solely for the attention and use of the named addressee(s). 
> > All information and opinions expressed herein are subject to change without 
> > notice. This communication is not to be construed as an offer to sell or 
> > the solicitation of an offer to buy any security. Any such offer or 
> > solicitation can only be made by means of the delivery of a confidential 
> > private offering memorandum (which should be carefully reviewed for a 
> > complete description of investment strategies and risks). Any reliance one 
> > may place on the accuracy or validity of this information is at their own 
> > risk. Past performance is not necessarily indicative of the future results 
> > of an investment. All figures are estimated and unaudited unless otherwise 
> > noted. If you are not the intended recipient, or a person responsible for 
> > delivering this to the intended recipient, you are not authorized to and 
> > must not disclose, copy, distribute, or retain this message or any part of 
> > it. In this case, please notify the sender immediately at 713-333-5440
> 
> 

-- 
GPG A997BA7A | 87FC DA31 5F00 C885 0DC3  E28F BD0D 4B33 A997 BA7A

Re: [slurm-users] Multi-node job failure

Re: [slurm-users] Need help with controller issues

Re: [slurm-users] Need help with controller issues

Re: [slurm-users] Multi-node job failure

Re: [slurm-users] Multi-node job failure

Re: [slurm-users] Multi-node job failure

Re: [slurm-users] Need help with controller issues

Re: [slurm-users] Multi-node job failure

Re: [slurm-users] Need help with controller issues

Re: [slurm-users] Need help with controller issues

[slurm-users] Need help with controller issues

[slurm-users] Multi-node job failure

Re: [slurm-users] SLURM_TMPDIR

13 matches

Site Navigation

Mail list logo

Footer information