[slurm-dev] Re: Error: Unable to contact slurm controller

2014-08-24 Thread Andrew Elwell
Hi Gerry,


 [2014-08-21T09:30:09.673] fatal: system has no usable batch compute nodes


We see this on our systems (running Slurm + Alps/basil rather than native)
when the slurmctld starts before the sdb has a list of batch nodes. It's
bitten us when we've set the nodes to interactive rather than batch, and
more regularly when we've restarted the sdb and slurmctld has started too
early in the boot process. (a quick 'service slurm restart' sorts that tho)

Andrew


[slurm-dev] Re: Error: Unable to contact slurm controller

2014-08-21 Thread Gerry Creager - NOAA Affiliate
No, slurmctld isn't running. Now. It was when I started, but I suspect I
made at least one mod too many to slurm.conf. When I try to start
slurmctld, I get these in slurmctld.log:
[2014-08-21T09:30:09.626] debug2: No ApbasilTimeout configured (65534)
[2014-08-21T09:30:09.630] debug2: No ApbasilTimeout configured (65534)
[2014-08-21T09:30:09.673] fatal: system has no usable batch compute nodes


I've just made a mod to slurm.conf that makes sure there's a default
partition. I'd had named partitions in previously, but got some errors and
warnings when trying to get the partition naming right in #SBATCH, so I'd
gone back to the default config.

This appears to have started with a reboot several days ago. I'm now making
sure it's not something deeper causing a Gemini network problem.

Thanks, Trey!
gerry


On Wed, Aug 20, 2014 at 10:11 PM, Trey Dockendorf treyd...@tamu.edu wrote:


 Is slurmctld running?  My guess is that you need at least one partition
 defined in addition to the DEFAULT partition.  Try creating a partition
 with any name, which will inherit everything from DEFAULT.

 - Trey

 =

 Trey Dockendorf
 Systems Analyst I
 Texas AM University
 Academy for Advanced Telecommunications and Learning Technologies
 Phone: (979)458-2396
 Email: treyd...@tamu.edu
 Jabber: treyd...@tamu.edu

 - Original Message -
  From: Gerry Creager - NOAA Affiliate gerry.crea...@noaa.gov
  To: slurm-dev slurm-dev@schedmd.com
  Sent: Wednesday, August 20, 2014 4:40:40 PM
  Subject: [slurm-dev] Re: Error: Unable to contact slurm controller
 
 
  Hi, Trey
 
 
  That's what I am intuiting, as well, but:
 
 
 
  gerry@loki:~/software/wrf/NME/DART_Lanai/models/wrf/work egrep
  '^(PartitionName|NodeName)' /opt/slurm/default/etc/slurm.conf
 
 NodeName=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287]
  Sockets=4 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=65536
  PartitionName=DEFAULT Shared=EXCLUSIVE State=UP DefaultTime=60
 
 Nodes=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287]
  MaxNodes=12
 
 
  looks pretty normal.
 
 
  gerry
 
 
 
 
 
  On Wed, Aug 20, 2014 at 4:25 PM, Trey Dockendorf  treyd...@tamu.edu
   wrote:
 
 
 
  What's your slurm.conf look like? Do you have valid Nodes and
  Partitions defined?
 
  For example:
 
  egrep '^(PartitionName|NodeName)' /etc/slurm/slurm.conf
 
  Sounds like invalid slurm.conf is preventing slurmctld from starting.
 
  - Trey
 
  =
 
  Trey Dockendorf
  Systems Analyst I
  Texas AM University
  Academy for Advanced Telecommunications and Learning Technologies
  Phone: (979)458-2396
  Email: treyd...@tamu.edu
  Jabber: treyd...@tamu.edu
 
 
 
  - Original Message -
   From: Gerry Creager - NOAA Affiliate  gerry.crea...@noaa.gov 
   To: slurm-dev  slurm-dev@schedmd.com 
   Sent: Wednesday, August 20, 2014 4:09:25 PM
   Subject: [slurm-dev] Re: Error: Unable to contact slurm controller
  
  
   Moe,
  
  
   Thanks. I've tried. I'm noting a pair of errors in the
   slurmctld.log
   file:
  
  
  
   2014-08-20T15:58:58.458] debug: No DownNodes
   [2014-08-20T15:58:58.458] fatal: No PartitionName information
   available!
  
  
   So far, Google hasn't helped me much in this regard.
  
  
   gerry
  
  
  
   On Wed, Aug 20, 2014 at 11:39 AM,  je...@schedmd.com  wrote:
  
  
  
   Try this:
   http://slurm.schedmd.com/ troubleshoot.html
  
  
  
   Quoting Gerry Creager - NOAA Affiliate  gerry.crea...@noaa.gov :
  
  
  
   I'm trying to learn how to use and administer slurm on a new Cray
   system,
   and started seeing this yesterday:
   squeue
   slurm_load_jobs error: Unable to contact slurm controller (connect
   failure)
  
   I'm at a loss as to how to proceed.
  
   Thanks, Gerry
   --
   Gerry Creager
   NSSL/CIMMS
   405.325.6371
   ++
   “Big whorls have little whorls,
   That feed on their velocity;
   And little whorls have lesser whorls,
   And so on to viscosity.”
   Lewis Fry Richardson (1881-1953)
  
  
   --
   Morris Moe Jette
   CTO, SchedMD LLC
  
   Slurm User Group Meeting
   September 23-24, Lugano, Switzerland
   Find out more http://slurm.schedmd.com/ slurm_ug_agenda.html
  
  
  
  
   --
  
   Gerry Creager
   NSSL/CIMMS
   405.325.6371
   ++
  
   “Big whorls have little whorls,
   That feed on their velocity;
   And little whorls have lesser whorls,
   And so on to viscosity.”
   Lewis Fry Richardson (1881-1953)
 
 
 
  --
 
  Gerry Creager
  NSSL/CIMMS
  405.325.6371
  ++
 
  “Big whorls have little whorls,
  That feed on their velocity;
  And little whorls have lesser whorls,
  And so on to viscosity.”
  Lewis Fry Richardson (1881-1953)




-- 
Gerry Creager
NSSL/CIMMS
405.325.6371
++
“Big whorls have little whorls,
That feed

[slurm-dev] Re: Error: Unable to contact slurm controller

2014-08-20 Thread jette


Try this:
http://slurm.schedmd.com/troubleshoot.html

Quoting Gerry Creager - NOAA Affiliate gerry.crea...@noaa.gov:


I'm trying to learn how to use and administer slurm on a new Cray system,
and started seeing this yesterday:
squeue
slurm_load_jobs error: Unable to contact slurm controller (connect failure)

I'm at a loss as to how to proceed.

Thanks, Gerry
--
Gerry Creager
NSSL/CIMMS
405.325.6371
++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)



--
Morris Moe Jette
CTO, SchedMD LLC

Slurm User Group Meeting
September 23-24, Lugano, Switzerland
Find out more http://slurm.schedmd.com/slurm_ug_agenda.html


[slurm-dev] Re: Error: Unable to contact slurm controller

2014-08-20 Thread Gerry Creager - NOAA Affiliate
Moe,

Thanks. I've tried. I'm noting a pair of errors in the slurmctld.log file:

2014-08-20T15:58:58.458] debug:  No DownNodes
[2014-08-20T15:58:58.458] fatal: No PartitionName information available!

So far, Google hasn't helped me much in this regard.

gerry


On Wed, Aug 20, 2014 at 11:39 AM, je...@schedmd.com wrote:


 Try this:
 http://slurm.schedmd.com/troubleshoot.html


 Quoting Gerry Creager - NOAA Affiliate gerry.crea...@noaa.gov:

  I'm trying to learn how to use and administer slurm on a new Cray system,
 and started seeing this yesterday:
 squeue
 slurm_load_jobs error: Unable to contact slurm controller (connect
 failure)

 I'm at a loss as to how to proceed.

 Thanks, Gerry
 --
 Gerry Creager
 NSSL/CIMMS
 405.325.6371
 ++
 “Big whorls have little whorls,
 That feed on their velocity;
 And little whorls have lesser whorls,
 And so on to viscosity.”
 Lewis Fry Richardson (1881-1953)



 --
 Morris Moe Jette
 CTO, SchedMD LLC

 Slurm User Group Meeting
 September 23-24, Lugano, Switzerland
 Find out more http://slurm.schedmd.com/slurm_ug_agenda.html




-- 
Gerry Creager
NSSL/CIMMS
405.325.6371
++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)


[slurm-dev] Re: Error: Unable to contact slurm controller

2014-08-20 Thread Trey Dockendorf

What's your slurm.conf look like?  Do you have valid Nodes and Partitions 
defined?

For example:

egrep '^(PartitionName|NodeName)' /etc/slurm/slurm.conf

Sounds like invalid slurm.conf is preventing slurmctld from starting.

- Trey

=

Trey Dockendorf 
Systems Analyst I 
Texas AM University 
Academy for Advanced Telecommunications and Learning Technologies 
Phone: (979)458-2396 
Email: treyd...@tamu.edu 
Jabber: treyd...@tamu.edu

- Original Message -
 From: Gerry Creager - NOAA Affiliate gerry.crea...@noaa.gov
 To: slurm-dev slurm-dev@schedmd.com
 Sent: Wednesday, August 20, 2014 4:09:25 PM
 Subject: [slurm-dev] Re: Error: Unable to contact slurm controller
 
 
 Moe,
 
 
 Thanks. I've tried. I'm noting a pair of errors in the slurmctld.log
 file:
 
 
 
 2014-08-20T15:58:58.458] debug: No DownNodes
 [2014-08-20T15:58:58.458] fatal: No PartitionName information
 available!
 
 
 So far, Google hasn't helped me much in this regard.
 
 
 gerry
 
 
 
 On Wed, Aug 20, 2014 at 11:39 AM,  je...@schedmd.com  wrote:
 
 
 
 Try this:
 http://slurm.schedmd.com/ troubleshoot.html
 
 
 
 Quoting Gerry Creager - NOAA Affiliate  gerry.crea...@noaa.gov :
 
 
 
 I'm trying to learn how to use and administer slurm on a new Cray
 system,
 and started seeing this yesterday:
 squeue
 slurm_load_jobs error: Unable to contact slurm controller (connect
 failure)
 
 I'm at a loss as to how to proceed.
 
 Thanks, Gerry
 --
 Gerry Creager
 NSSL/CIMMS
 405.325.6371
 ++
 “Big whorls have little whorls,
 That feed on their velocity;
 And little whorls have lesser whorls,
 And so on to viscosity.”
 Lewis Fry Richardson (1881-1953)
 
 
 --
 Morris Moe Jette
 CTO, SchedMD LLC
 
 Slurm User Group Meeting
 September 23-24, Lugano, Switzerland
 Find out more http://slurm.schedmd.com/ slurm_ug_agenda.html
 
 
 
 
 --
 
 Gerry Creager
 NSSL/CIMMS
 405.325.6371
 ++
 
 “Big whorls have little whorls,
 That feed on their velocity;
 And little whorls have lesser whorls,
 And so on to viscosity.”
 Lewis Fry Richardson (1881-1953)

[slurm-dev] Re: Error: Unable to contact slurm controller

2014-08-20 Thread Gerry Creager - NOAA Affiliate
Hi, Trey

That's what I am intuiting, as well, but:

gerry@loki:~/software/wrf/NME/DART_Lanai/models/wrf/work egrep
'^(PartitionName|NodeName)' /opt/slurm/default/etc/slurm.conf
NodeName=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287]
Sockets=4 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=65536
PartitionName=DEFAULT Shared=EXCLUSIVE State=UP DefaultTime=60
Nodes=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287]
MaxNodes=12

looks pretty normal.

gerry



On Wed, Aug 20, 2014 at 4:25 PM, Trey Dockendorf treyd...@tamu.edu wrote:


 What's your slurm.conf look like?  Do you have valid Nodes and Partitions
 defined?

 For example:

 egrep '^(PartitionName|NodeName)' /etc/slurm/slurm.conf

 Sounds like invalid slurm.conf is preventing slurmctld from starting.

 - Trey

 =

 Trey Dockendorf
 Systems Analyst I
 Texas AM University
 Academy for Advanced Telecommunications and Learning Technologies
 Phone: (979)458-2396
 Email: treyd...@tamu.edu
 Jabber: treyd...@tamu.edu

 - Original Message -
  From: Gerry Creager - NOAA Affiliate gerry.crea...@noaa.gov
  To: slurm-dev slurm-dev@schedmd.com
  Sent: Wednesday, August 20, 2014 4:09:25 PM
  Subject: [slurm-dev] Re: Error: Unable to contact slurm controller
 
 
  Moe,
 
 
  Thanks. I've tried. I'm noting a pair of errors in the slurmctld.log
  file:
 
 
 
  2014-08-20T15:58:58.458] debug: No DownNodes
  [2014-08-20T15:58:58.458] fatal: No PartitionName information
  available!
 
 
  So far, Google hasn't helped me much in this regard.
 
 
  gerry
 
 
 
  On Wed, Aug 20, 2014 at 11:39 AM,  je...@schedmd.com  wrote:
 
 
 
  Try this:
  http://slurm.schedmd.com/ troubleshoot.html
 
 
 
  Quoting Gerry Creager - NOAA Affiliate  gerry.crea...@noaa.gov :
 
 
 
  I'm trying to learn how to use and administer slurm on a new Cray
  system,
  and started seeing this yesterday:
  squeue
  slurm_load_jobs error: Unable to contact slurm controller (connect
  failure)
 
  I'm at a loss as to how to proceed.
 
  Thanks, Gerry
  --
  Gerry Creager
  NSSL/CIMMS
  405.325.6371
  ++
  “Big whorls have little whorls,
  That feed on their velocity;
  And little whorls have lesser whorls,
  And so on to viscosity.”
  Lewis Fry Richardson (1881-1953)
 
 
  --
  Morris Moe Jette
  CTO, SchedMD LLC
 
  Slurm User Group Meeting
  September 23-24, Lugano, Switzerland
  Find out more http://slurm.schedmd.com/ slurm_ug_agenda.html
 
 
 
 
  --
 
  Gerry Creager
  NSSL/CIMMS
  405.325.6371
  ++
 
  “Big whorls have little whorls,
  That feed on their velocity;
  And little whorls have lesser whorls,
  And so on to viscosity.”
  Lewis Fry Richardson (1881-1953)




-- 
Gerry Creager
NSSL/CIMMS
405.325.6371
++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)


[slurm-dev] Re: Error: Unable to contact slurm controller

2014-08-20 Thread Trey Dockendorf

Is slurmctld running?  My guess is that you need at least one partition defined 
in addition to the DEFAULT partition.  Try creating a partition with any name, 
which will inherit everything from DEFAULT.

- Trey

=

Trey Dockendorf 
Systems Analyst I 
Texas AM University 
Academy for Advanced Telecommunications and Learning Technologies 
Phone: (979)458-2396 
Email: treyd...@tamu.edu 
Jabber: treyd...@tamu.edu

- Original Message -
 From: Gerry Creager - NOAA Affiliate gerry.crea...@noaa.gov
 To: slurm-dev slurm-dev@schedmd.com
 Sent: Wednesday, August 20, 2014 4:40:40 PM
 Subject: [slurm-dev] Re: Error: Unable to contact slurm controller
 
 
 Hi, Trey
 
 
 That's what I am intuiting, as well, but:
 
 
 
 gerry@loki:~/software/wrf/NME/DART_Lanai/models/wrf/work egrep
 '^(PartitionName|NodeName)' /opt/slurm/default/etc/slurm.conf
 NodeName=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287]
 Sockets=4 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=65536
 PartitionName=DEFAULT Shared=EXCLUSIVE State=UP DefaultTime=60
 Nodes=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287]
 MaxNodes=12
 
 
 looks pretty normal.
 
 
 gerry
 
 
 
 
 
 On Wed, Aug 20, 2014 at 4:25 PM, Trey Dockendorf  treyd...@tamu.edu
  wrote:
 
 
 
 What's your slurm.conf look like? Do you have valid Nodes and
 Partitions defined?
 
 For example:
 
 egrep '^(PartitionName|NodeName)' /etc/slurm/slurm.conf
 
 Sounds like invalid slurm.conf is preventing slurmctld from starting.
 
 - Trey
 
 =
 
 Trey Dockendorf
 Systems Analyst I
 Texas AM University
 Academy for Advanced Telecommunications and Learning Technologies
 Phone: (979)458-2396
 Email: treyd...@tamu.edu
 Jabber: treyd...@tamu.edu
 
 
 
 - Original Message -
  From: Gerry Creager - NOAA Affiliate  gerry.crea...@noaa.gov 
  To: slurm-dev  slurm-dev@schedmd.com 
  Sent: Wednesday, August 20, 2014 4:09:25 PM
  Subject: [slurm-dev] Re: Error: Unable to contact slurm controller
  
  
  Moe,
  
  
  Thanks. I've tried. I'm noting a pair of errors in the
  slurmctld.log
  file:
  
  
  
  2014-08-20T15:58:58.458] debug: No DownNodes
  [2014-08-20T15:58:58.458] fatal: No PartitionName information
  available!
  
  
  So far, Google hasn't helped me much in this regard.
  
  
  gerry
  
  
  
  On Wed, Aug 20, 2014 at 11:39 AM,  je...@schedmd.com  wrote:
  
  
  
  Try this:
  http://slurm.schedmd.com/ troubleshoot.html
  
  
  
  Quoting Gerry Creager - NOAA Affiliate  gerry.crea...@noaa.gov :
  
  
  
  I'm trying to learn how to use and administer slurm on a new Cray
  system,
  and started seeing this yesterday:
  squeue
  slurm_load_jobs error: Unable to contact slurm controller (connect
  failure)
  
  I'm at a loss as to how to proceed.
  
  Thanks, Gerry
  --
  Gerry Creager
  NSSL/CIMMS
  405.325.6371
  ++
  “Big whorls have little whorls,
  That feed on their velocity;
  And little whorls have lesser whorls,
  And so on to viscosity.”
  Lewis Fry Richardson (1881-1953)
  
  
  --
  Morris Moe Jette
  CTO, SchedMD LLC
  
  Slurm User Group Meeting
  September 23-24, Lugano, Switzerland
  Find out more http://slurm.schedmd.com/ slurm_ug_agenda.html
  
  
  
  
  --
  
  Gerry Creager
  NSSL/CIMMS
  405.325.6371
  ++
  
  “Big whorls have little whorls,
  That feed on their velocity;
  And little whorls have lesser whorls,
  And so on to viscosity.”
  Lewis Fry Richardson (1881-1953)
 
 
 
 --
 
 Gerry Creager
 NSSL/CIMMS
 405.325.6371
 ++
 
 “Big whorls have little whorls,
 That feed on their velocity;
 And little whorls have lesser whorls,
 And so on to viscosity.”
 Lewis Fry Richardson (1881-1953)