[slurm-dev] Re: Error: Unable to contact slurm controller
Hi Gerry, [2014-08-21T09:30:09.673] fatal: system has no usable batch compute nodes We see this on our systems (running Slurm + Alps/basil rather than native) when the slurmctld starts before the sdb has a list of batch nodes. It's bitten us when we've set the nodes to interactive rather than batch, and more regularly when we've restarted the sdb and slurmctld has started too early in the boot process. (a quick 'service slurm restart' sorts that tho) Andrew
[slurm-dev] Re: Error: Unable to contact slurm controller
No, slurmctld isn't running. Now. It was when I started, but I suspect I made at least one mod too many to slurm.conf. When I try to start slurmctld, I get these in slurmctld.log: [2014-08-21T09:30:09.626] debug2: No ApbasilTimeout configured (65534) [2014-08-21T09:30:09.630] debug2: No ApbasilTimeout configured (65534) [2014-08-21T09:30:09.673] fatal: system has no usable batch compute nodes I've just made a mod to slurm.conf that makes sure there's a default partition. I'd had named partitions in previously, but got some errors and warnings when trying to get the partition naming right in #SBATCH, so I'd gone back to the default config. This appears to have started with a reboot several days ago. I'm now making sure it's not something deeper causing a Gemini network problem. Thanks, Trey! gerry On Wed, Aug 20, 2014 at 10:11 PM, Trey Dockendorf treyd...@tamu.edu wrote: Is slurmctld running? My guess is that you need at least one partition defined in addition to the DEFAULT partition. Try creating a partition with any name, which will inherit everything from DEFAULT. - Trey = Trey Dockendorf Systems Analyst I Texas AM University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu Jabber: treyd...@tamu.edu - Original Message - From: Gerry Creager - NOAA Affiliate gerry.crea...@noaa.gov To: slurm-dev slurm-dev@schedmd.com Sent: Wednesday, August 20, 2014 4:40:40 PM Subject: [slurm-dev] Re: Error: Unable to contact slurm controller Hi, Trey That's what I am intuiting, as well, but: gerry@loki:~/software/wrf/NME/DART_Lanai/models/wrf/work egrep '^(PartitionName|NodeName)' /opt/slurm/default/etc/slurm.conf NodeName=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287] Sockets=4 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=65536 PartitionName=DEFAULT Shared=EXCLUSIVE State=UP DefaultTime=60 Nodes=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287] MaxNodes=12 looks pretty normal. gerry On Wed, Aug 20, 2014 at 4:25 PM, Trey Dockendorf treyd...@tamu.edu wrote: What's your slurm.conf look like? Do you have valid Nodes and Partitions defined? For example: egrep '^(PartitionName|NodeName)' /etc/slurm/slurm.conf Sounds like invalid slurm.conf is preventing slurmctld from starting. - Trey = Trey Dockendorf Systems Analyst I Texas AM University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu Jabber: treyd...@tamu.edu - Original Message - From: Gerry Creager - NOAA Affiliate gerry.crea...@noaa.gov To: slurm-dev slurm-dev@schedmd.com Sent: Wednesday, August 20, 2014 4:09:25 PM Subject: [slurm-dev] Re: Error: Unable to contact slurm controller Moe, Thanks. I've tried. I'm noting a pair of errors in the slurmctld.log file: 2014-08-20T15:58:58.458] debug: No DownNodes [2014-08-20T15:58:58.458] fatal: No PartitionName information available! So far, Google hasn't helped me much in this regard. gerry On Wed, Aug 20, 2014 at 11:39 AM, je...@schedmd.com wrote: Try this: http://slurm.schedmd.com/ troubleshoot.html Quoting Gerry Creager - NOAA Affiliate gerry.crea...@noaa.gov : I'm trying to learn how to use and administer slurm on a new Cray system, and started seeing this yesterday: squeue slurm_load_jobs error: Unable to contact slurm controller (connect failure) I'm at a loss as to how to proceed. Thanks, Gerry -- Gerry Creager NSSL/CIMMS 405.325.6371 ++ “Big whorls have little whorls, That feed on their velocity; And little whorls have lesser whorls, And so on to viscosity.” Lewis Fry Richardson (1881-1953) -- Morris Moe Jette CTO, SchedMD LLC Slurm User Group Meeting September 23-24, Lugano, Switzerland Find out more http://slurm.schedmd.com/ slurm_ug_agenda.html -- Gerry Creager NSSL/CIMMS 405.325.6371 ++ “Big whorls have little whorls, That feed on their velocity; And little whorls have lesser whorls, And so on to viscosity.” Lewis Fry Richardson (1881-1953) -- Gerry Creager NSSL/CIMMS 405.325.6371 ++ “Big whorls have little whorls, That feed on their velocity; And little whorls have lesser whorls, And so on to viscosity.” Lewis Fry Richardson (1881-1953) -- Gerry Creager NSSL/CIMMS 405.325.6371 ++ “Big whorls have little whorls, That feed
[slurm-dev] Re: Error: Unable to contact slurm controller
Try this: http://slurm.schedmd.com/troubleshoot.html Quoting Gerry Creager - NOAA Affiliate gerry.crea...@noaa.gov: I'm trying to learn how to use and administer slurm on a new Cray system, and started seeing this yesterday: squeue slurm_load_jobs error: Unable to contact slurm controller (connect failure) I'm at a loss as to how to proceed. Thanks, Gerry -- Gerry Creager NSSL/CIMMS 405.325.6371 ++ “Big whorls have little whorls, That feed on their velocity; And little whorls have lesser whorls, And so on to viscosity.” Lewis Fry Richardson (1881-1953) -- Morris Moe Jette CTO, SchedMD LLC Slurm User Group Meeting September 23-24, Lugano, Switzerland Find out more http://slurm.schedmd.com/slurm_ug_agenda.html
[slurm-dev] Re: Error: Unable to contact slurm controller
Moe, Thanks. I've tried. I'm noting a pair of errors in the slurmctld.log file: 2014-08-20T15:58:58.458] debug: No DownNodes [2014-08-20T15:58:58.458] fatal: No PartitionName information available! So far, Google hasn't helped me much in this regard. gerry On Wed, Aug 20, 2014 at 11:39 AM, je...@schedmd.com wrote: Try this: http://slurm.schedmd.com/troubleshoot.html Quoting Gerry Creager - NOAA Affiliate gerry.crea...@noaa.gov: I'm trying to learn how to use and administer slurm on a new Cray system, and started seeing this yesterday: squeue slurm_load_jobs error: Unable to contact slurm controller (connect failure) I'm at a loss as to how to proceed. Thanks, Gerry -- Gerry Creager NSSL/CIMMS 405.325.6371 ++ “Big whorls have little whorls, That feed on their velocity; And little whorls have lesser whorls, And so on to viscosity.” Lewis Fry Richardson (1881-1953) -- Morris Moe Jette CTO, SchedMD LLC Slurm User Group Meeting September 23-24, Lugano, Switzerland Find out more http://slurm.schedmd.com/slurm_ug_agenda.html -- Gerry Creager NSSL/CIMMS 405.325.6371 ++ “Big whorls have little whorls, That feed on their velocity; And little whorls have lesser whorls, And so on to viscosity.” Lewis Fry Richardson (1881-1953)
[slurm-dev] Re: Error: Unable to contact slurm controller
What's your slurm.conf look like? Do you have valid Nodes and Partitions defined? For example: egrep '^(PartitionName|NodeName)' /etc/slurm/slurm.conf Sounds like invalid slurm.conf is preventing slurmctld from starting. - Trey = Trey Dockendorf Systems Analyst I Texas AM University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu Jabber: treyd...@tamu.edu - Original Message - From: Gerry Creager - NOAA Affiliate gerry.crea...@noaa.gov To: slurm-dev slurm-dev@schedmd.com Sent: Wednesday, August 20, 2014 4:09:25 PM Subject: [slurm-dev] Re: Error: Unable to contact slurm controller Moe, Thanks. I've tried. I'm noting a pair of errors in the slurmctld.log file: 2014-08-20T15:58:58.458] debug: No DownNodes [2014-08-20T15:58:58.458] fatal: No PartitionName information available! So far, Google hasn't helped me much in this regard. gerry On Wed, Aug 20, 2014 at 11:39 AM, je...@schedmd.com wrote: Try this: http://slurm.schedmd.com/ troubleshoot.html Quoting Gerry Creager - NOAA Affiliate gerry.crea...@noaa.gov : I'm trying to learn how to use and administer slurm on a new Cray system, and started seeing this yesterday: squeue slurm_load_jobs error: Unable to contact slurm controller (connect failure) I'm at a loss as to how to proceed. Thanks, Gerry -- Gerry Creager NSSL/CIMMS 405.325.6371 ++ “Big whorls have little whorls, That feed on their velocity; And little whorls have lesser whorls, And so on to viscosity.” Lewis Fry Richardson (1881-1953) -- Morris Moe Jette CTO, SchedMD LLC Slurm User Group Meeting September 23-24, Lugano, Switzerland Find out more http://slurm.schedmd.com/ slurm_ug_agenda.html -- Gerry Creager NSSL/CIMMS 405.325.6371 ++ “Big whorls have little whorls, That feed on their velocity; And little whorls have lesser whorls, And so on to viscosity.” Lewis Fry Richardson (1881-1953)
[slurm-dev] Re: Error: Unable to contact slurm controller
Hi, Trey That's what I am intuiting, as well, but: gerry@loki:~/software/wrf/NME/DART_Lanai/models/wrf/work egrep '^(PartitionName|NodeName)' /opt/slurm/default/etc/slurm.conf NodeName=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287] Sockets=4 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=65536 PartitionName=DEFAULT Shared=EXCLUSIVE State=UP DefaultTime=60 Nodes=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287] MaxNodes=12 looks pretty normal. gerry On Wed, Aug 20, 2014 at 4:25 PM, Trey Dockendorf treyd...@tamu.edu wrote: What's your slurm.conf look like? Do you have valid Nodes and Partitions defined? For example: egrep '^(PartitionName|NodeName)' /etc/slurm/slurm.conf Sounds like invalid slurm.conf is preventing slurmctld from starting. - Trey = Trey Dockendorf Systems Analyst I Texas AM University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu Jabber: treyd...@tamu.edu - Original Message - From: Gerry Creager - NOAA Affiliate gerry.crea...@noaa.gov To: slurm-dev slurm-dev@schedmd.com Sent: Wednesday, August 20, 2014 4:09:25 PM Subject: [slurm-dev] Re: Error: Unable to contact slurm controller Moe, Thanks. I've tried. I'm noting a pair of errors in the slurmctld.log file: 2014-08-20T15:58:58.458] debug: No DownNodes [2014-08-20T15:58:58.458] fatal: No PartitionName information available! So far, Google hasn't helped me much in this regard. gerry On Wed, Aug 20, 2014 at 11:39 AM, je...@schedmd.com wrote: Try this: http://slurm.schedmd.com/ troubleshoot.html Quoting Gerry Creager - NOAA Affiliate gerry.crea...@noaa.gov : I'm trying to learn how to use and administer slurm on a new Cray system, and started seeing this yesterday: squeue slurm_load_jobs error: Unable to contact slurm controller (connect failure) I'm at a loss as to how to proceed. Thanks, Gerry -- Gerry Creager NSSL/CIMMS 405.325.6371 ++ “Big whorls have little whorls, That feed on their velocity; And little whorls have lesser whorls, And so on to viscosity.” Lewis Fry Richardson (1881-1953) -- Morris Moe Jette CTO, SchedMD LLC Slurm User Group Meeting September 23-24, Lugano, Switzerland Find out more http://slurm.schedmd.com/ slurm_ug_agenda.html -- Gerry Creager NSSL/CIMMS 405.325.6371 ++ “Big whorls have little whorls, That feed on their velocity; And little whorls have lesser whorls, And so on to viscosity.” Lewis Fry Richardson (1881-1953) -- Gerry Creager NSSL/CIMMS 405.325.6371 ++ “Big whorls have little whorls, That feed on their velocity; And little whorls have lesser whorls, And so on to viscosity.” Lewis Fry Richardson (1881-1953)
[slurm-dev] Re: Error: Unable to contact slurm controller
Is slurmctld running? My guess is that you need at least one partition defined in addition to the DEFAULT partition. Try creating a partition with any name, which will inherit everything from DEFAULT. - Trey = Trey Dockendorf Systems Analyst I Texas AM University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu Jabber: treyd...@tamu.edu - Original Message - From: Gerry Creager - NOAA Affiliate gerry.crea...@noaa.gov To: slurm-dev slurm-dev@schedmd.com Sent: Wednesday, August 20, 2014 4:40:40 PM Subject: [slurm-dev] Re: Error: Unable to contact slurm controller Hi, Trey That's what I am intuiting, as well, but: gerry@loki:~/software/wrf/NME/DART_Lanai/models/wrf/work egrep '^(PartitionName|NodeName)' /opt/slurm/default/etc/slurm.conf NodeName=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287] Sockets=4 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=65536 PartitionName=DEFAULT Shared=EXCLUSIVE State=UP DefaultTime=60 Nodes=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287] MaxNodes=12 looks pretty normal. gerry On Wed, Aug 20, 2014 at 4:25 PM, Trey Dockendorf treyd...@tamu.edu wrote: What's your slurm.conf look like? Do you have valid Nodes and Partitions defined? For example: egrep '^(PartitionName|NodeName)' /etc/slurm/slurm.conf Sounds like invalid slurm.conf is preventing slurmctld from starting. - Trey = Trey Dockendorf Systems Analyst I Texas AM University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu Jabber: treyd...@tamu.edu - Original Message - From: Gerry Creager - NOAA Affiliate gerry.crea...@noaa.gov To: slurm-dev slurm-dev@schedmd.com Sent: Wednesday, August 20, 2014 4:09:25 PM Subject: [slurm-dev] Re: Error: Unable to contact slurm controller Moe, Thanks. I've tried. I'm noting a pair of errors in the slurmctld.log file: 2014-08-20T15:58:58.458] debug: No DownNodes [2014-08-20T15:58:58.458] fatal: No PartitionName information available! So far, Google hasn't helped me much in this regard. gerry On Wed, Aug 20, 2014 at 11:39 AM, je...@schedmd.com wrote: Try this: http://slurm.schedmd.com/ troubleshoot.html Quoting Gerry Creager - NOAA Affiliate gerry.crea...@noaa.gov : I'm trying to learn how to use and administer slurm on a new Cray system, and started seeing this yesterday: squeue slurm_load_jobs error: Unable to contact slurm controller (connect failure) I'm at a loss as to how to proceed. Thanks, Gerry -- Gerry Creager NSSL/CIMMS 405.325.6371 ++ “Big whorls have little whorls, That feed on their velocity; And little whorls have lesser whorls, And so on to viscosity.” Lewis Fry Richardson (1881-1953) -- Morris Moe Jette CTO, SchedMD LLC Slurm User Group Meeting September 23-24, Lugano, Switzerland Find out more http://slurm.schedmd.com/ slurm_ug_agenda.html -- Gerry Creager NSSL/CIMMS 405.325.6371 ++ “Big whorls have little whorls, That feed on their velocity; And little whorls have lesser whorls, And so on to viscosity.” Lewis Fry Richardson (1881-1953) -- Gerry Creager NSSL/CIMMS 405.325.6371 ++ “Big whorls have little whorls, That feed on their velocity; And little whorls have lesser whorls, And so on to viscosity.” Lewis Fry Richardson (1881-1953)