Re: [slurm-users] Munge decode failing on new node

2020-05-14 Thread dean.w.schulze
This problem turned out to be that the new node was on a different subnet than 
the other nodes.  Once our network admin opened up ports 6817, 6818, and 6188 
between the subnets the new node worked.

 

Thanks for all the responses.

 

From: slurm-users  On Behalf Of Riebs, 
Andy
Sent: Friday, April 17, 2020 1:58 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Munge decode failing on new node

 

A couple of quick checks to see if the problem is munge:

1.  On the problem node, try
$ echo foo | munge | unmunge
2.  If (1) works, try this from the node running slurmctld to the problem 
node
slurm-node$ echo foo | ssh node munge | unmunge

 

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Dean Schulze
Sent: Friday, April 17, 2020 3:40 PM
To: Slurm User Community List mailto:slurm-users@lists.schedmd.com> >
Subject: Re: [slurm-users] Munge decode failing on new node

 

There is no ntp service running on any of my nodes, and all but this one is 
working.  I haven't heard that ntp is a requirement for slurm, just that the 
time be synchronized across the cluster.  And it is.

 

On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy mailto:mini...@gmail.com> > wrote:

I’d check ntp as your encoding time seems odd to me

 

On Wed, 15 Apr 2020 at 19:59, Dean Schulze mailto:dean.w.schu...@gmail.com> > wrote:

I've installed two new nodes onto my slurm cluster.  One node works, but the 
other one complains about an invalid credential for munge.  I've verified that 
the munge.key is the same as on all other nodes with


sudo cksum /etc/munge/munge.key

 

I recopied a munge.key from a node that works.  I've verified that munge uid 
and gid are the same on the nodes.  The time is in sync on all nodes. 

 

Here is what is in the slurmd.log:

 

 error: Unable to register: Unable to contact slurm controller (connect failure)
 error: Munge decode failed: Invalid credential
 ENCODED: Wed Dec 31 17:00:00 1969
 DECODED: Wed Dec 31 17:00:00 1969
 error: authentication: Invalid authentication credential
 error: slurm_receive_msg_and_forward: Protocol authentication error
 error: service_connection: slurm_receive_msg: Protocol authentication error
 error: Unable to register: Unable to contact slurm controller (connect failure)

 

I've checked in the munged.log and all it says is 

 

Invalid credential 

 

Thanks for your help

-- 

--
Carles Fenoy



Re: [slurm-users] Munge decode failing on new node

2020-04-23 Thread Dean Schulze
I went through the exercise of making the other user the same on the
slurmctld as on the slurmd nodes, but that had no effect.  I still have 3
nodes that have connectivity and one node where slurmd cannot contact
slurmctld.  That node has ssh connectivity to and from slurmctld node, but
no slurm communication.

It's time to reformat the drive and start over.


On Thu, Apr 23, 2020 at 12:34 AM Gennaro Oliva 
wrote:

> Hi Dean,
>
> On Wed, Apr 22, 2020 at 07:28:15PM -0600, dean.w.schu...@gmail.com wrote:
> > Even for users other than slurm and munge?  It seems strange that 3 of
> > 4 worker nodes work with the same UIDs/GIDs as the non-working nodes.
>
> As in:
>
> https://slurm.schedmd.com/quickstart_admin.html
>
> Super Quick Start 1st step:
>
> Make sure the clocks, users and groups (UIDs and GIDs) are synchronized
> across the cluster.
>
> This is true for the slum user and the regular users running jobs.
>
> The munge user doesn't need to be the same on all the cluster:
>
> https://bugs.schedmd.com/show_bug.cgi?id=4209
>
> Best regards,
> --
> Gennaro Oliva
>
>


Re: [slurm-users] Munge decode failing on new node

2020-04-23 Thread Gennaro Oliva
Hi Dean,

On Wed, Apr 22, 2020 at 07:28:15PM -0600, dean.w.schu...@gmail.com wrote:
> Even for users other than slurm and munge?  It seems strange that 3 of
> 4 worker nodes work with the same UIDs/GIDs as the non-working nodes.

As in:

https://slurm.schedmd.com/quickstart_admin.html

Super Quick Start 1st step:

Make sure the clocks, users and groups (UIDs and GIDs) are synchronized
across the cluster.

This is true for the slum user and the regular users running jobs.

The munge user doesn't need to be the same on all the cluster:

https://bugs.schedmd.com/show_bug.cgi?id=4209

Best regards,
-- 
Gennaro Oliva



Re: [slurm-users] Munge decode failing on new node

2020-04-22 Thread dean.w.schulze
Even for users other than slurm and munge?  It seems strange that 3 of 4 worker 
nodes work with the same UIDs/GIDs as the non-working nodes.

-Original Message-
From: slurm-users  On Behalf Of 
Christopher Samuel
Sent: Wednesday, April 22, 2020 2:27 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Munge decode failing on new node

On 4/22/20 12:56 PM, dean.w.schu...@gmail.com wrote:

> There is a third user account on all machines in the cluster that is 
> the user account for using the cluster.  That account has uid 1000 on 
> all four worker nodes, but on the controller it is 1001.  So that is 
> probably why the question marks.

You need to have identical UIDs everywhere for this to work.

I would strongly suggest using something like LDAP to ensure that your users 
have identical representation everywhere.

All the best,
Chris
-- 
   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA





Re: [slurm-users] Munge decode failing on new node

2020-04-22 Thread Christopher Samuel

On 4/22/20 12:56 PM, dean.w.schu...@gmail.com wrote:


There is a third user account on all machines in the cluster that is the
user account for using the cluster.  That account has uid 1000 on all four
worker nodes, but on the controller it is 1001.  So that is probably why the
question marks.


You need to have identical UIDs everywhere for this to work.

I would strongly suggest using something like LDAP to ensure that your 
users have identical representation everywhere.


All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Munge decode failing on new node

2020-04-22 Thread dean.w.schulze
There is a third user account on all machines in the cluster that is the
user account for using the cluster.  That account has uid 1000 on all four
worker nodes, but on the controller it is 1001.  So that is probably why the
question marks.

I doubt this is the issue when 3 of the 4 nodes that work have the same uid
mismatch for that user (nor the slurm or munge user).


-Original Message-
From: slurm-users  On Behalf Of Chris
Samuel
Sent: Monday, April 20, 2020 12:03 AM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Munge decode failing on new node

On Friday, 17 April 2020 2:22:00 PM PDT Dean Schulze wrote:

> Both work.  The only discrepancy is that the slurm controller output 
> had these two lines:
> 
> UID:  ??? (1000)
> GID:  ??? (1000)
> 
> Like the controller doesn't know the username for UID 1000.

What does this say on the controller and the compute node?

getent passwd 1000

Are you using LDAP or the like to ensure that all nodes have the same user
database?

All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA








Re: [slurm-users] Munge decode failing on new node

2020-04-22 Thread dean.w.schulze
The uid and gid are the same for the slurm and munge users on each node.  The 
two new nodes, one of which can’t connect with the controller, have the same 
users and were created with the same sequence of steps.  The only exception is 
that the node that won’t connect has the software stack to compile slurm 
installed on it.  I’ll try removing these packages and see if that makes any 
difference.

 

I was wrong about the nodes not having ntp.  They are all running 
systemd-timesyncd.

 

I’ve found something interesting and inconsistent on the nodes that I’ll post 
in a new thread since this one is going nowhere.

 

From: slurm-users  On Behalf Of Brian 
Andrus
Sent: Sunday, April 19, 2020 9:30 AM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Munge decode failing on new node

 

I see potentially 2 things you should likely do:

1.  Run ntpd on your nodes. You can even have them sync with your master. 
2.  Sync your user data on the nodes too. Even if that is just ensuring 
/etc/passwd and /etc/group are the same on them all

While ntp is not required for slurm, the time sync is very important and ntp 
makes that a non-issue. Best practices and all.

#2 is something that is often overlooked, but obvious when you think about it.
I have seen folks add users my doing 'useradd' on each node, but that messes 
everything up if you installed a package or such that changed the next uid on 
any node.

The error below looks like you may have a different uid for the slurm user on 
the node. What uid is slurmd running as on the bad node vs a good node?

Brian Andrus

 

On 4/17/2020 2:38 PM, Dean Schulze wrote:

Just noticed this.  On the problem node the munged.log file has an entry every 
1:40: 

 

2020-04-17 15:31:02 -0600 Info:  Invalid credential
2020-04-17 15:32:42 -0600 Info:  Invalid credential
2020-04-17 15:34:22 -0600 Info:  Invalid credential

 

This happens on the failed node and two other nodes that work.  Two nodes that 
work (including the controller) don't have this message.

 

 

 

On Fri, Apr 17, 2020 at 2:00 PM Riebs, Andy mailto:andy.ri...@hpe.com> > wrote:

A couple of quick checks to see if the problem is munge:

1.   On the problem node, try
$ echo foo | munge | unmunge

2.   If (1) works, try this from the node running slurmctld to the problem 
node
slurm-node$ echo foo | ssh node munge | unmunge

 

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com 
<mailto:slurm-users-boun...@lists.schedmd.com> ] On Behalf Of Dean Schulze
Sent: Friday, April 17, 2020 3:40 PM
To: Slurm User Community List mailto:slurm-users@lists.schedmd.com> >
Subject: Re: [slurm-users] Munge decode failing on new node

 

There is no ntp service running on any of my nodes, and all but this one is 
working.  I haven't heard that ntp is a requirement for slurm, just that the 
time be synchronized across the cluster.  And it is.

 

On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy mailto:mini...@gmail.com> > wrote:

I’d check ntp as your encoding time seems odd to me

 

On Wed, 15 Apr 2020 at 19:59, Dean Schulze mailto:dean.w.schu...@gmail.com> > wrote:

I've installed two new nodes onto my slurm cluster.  One node works, but the 
other one complains about an invalid credential for munge.  I've verified that 
the munge.key is the same as on all other nodes with


sudo cksum /etc/munge/munge.key

 

I recopied a munge.key from a node that works.  I've verified that munge uid 
and gid are the same on the nodes.  The time is in sync on all nodes. 

 

Here is what is in the slurmd.log:

 

 error: Unable to register: Unable to contact slurm controller (connect failure)
 error: Munge decode failed: Invalid credential
 ENCODED: Wed Dec 31 17:00:00 1969
 DECODED: Wed Dec 31 17:00:00 1969
 error: authentication: Invalid authentication credential
 error: slurm_receive_msg_and_forward: Protocol authentication error
 error: service_connection: slurm_receive_msg: Protocol authentication error
 error: Unable to register: Unable to contact slurm controller (connect failure)

 

I've checked in the munged.log and all it says is 

 

Invalid credential 

 

Thanks for your help

-- 

--
Carles Fenoy



Re: [slurm-users] Munge decode failing on new node

2020-04-20 Thread Chris Samuel
On Friday, 17 April 2020 2:22:00 PM PDT Dean Schulze wrote:

> Both work.  The only discrepancy is that the slurm controller output had
> these two lines:
> 
> UID:  ??? (1000)
> GID:  ??? (1000)
> 
> Like the controller doesn't know the username for UID 1000.

What does this say on the controller and the compute node?

getent passwd 1000

Are you using LDAP or the like to ensure that all nodes have the same user 
database?

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] Munge decode failing on new node

2020-04-19 Thread Brian Andrus

I see potentially 2 things you should likely do:

1. Run ntpd on your nodes. You can even have them sync with your master.
2. Sync your user data on the nodes too. Even if that is just ensuring
   /etc/passwd and /etc/group are the same on them all

While ntp is not required for slurm, the time sync is very important and 
ntp makes that a non-issue. Best practices and all.


#2 is something that is often overlooked, but obvious when you think 
about it.
I have seen folks add users my doing 'useradd' on each node, but that 
messes everything up if you installed a package or such that changed the 
next uid on any node.


The error below looks like you may have a different uid for the slurm 
user on the node. What uid is slurmd running as on the bad node vs a 
good node?


Brian Andrus


On 4/17/2020 2:38 PM, Dean Schulze wrote:
Just noticed this.  On the problem node the munged.log file has an 
entry every 1:40:


2020-04-17 15:31:02 -0600 Info:      Invalid credential
2020-04-17 15:32:42 -0600 Info:      Invalid credential
2020-04-17 15:34:22 -0600 Info:      Invalid credential

This happens on the failed node and two other nodes that work.  Two 
nodes that work (including the controller) don't have this message.




On Fri, Apr 17, 2020 at 2:00 PM Riebs, Andy <mailto:andy.ri...@hpe.com>> wrote:


A couple of quick checks to see if the problem is munge:

1.On the problem node, try
$ echo foo | munge | unmunge

2.If (1) works, try this from the node running slurmctld to the
problem node
slurm-node$ echo foo | ssh node munge | unmunge

*From:*slurm-users [mailto:slurm-users-boun...@lists.schedmd.com
<mailto:slurm-users-boun...@lists.schedmd.com>] *On Behalf Of
*Dean Schulze
*Sent:* Friday, April 17, 2020 3:40 PM
*To:* Slurm User Community List mailto:slurm-users@lists.schedmd.com>>
*Subject:* Re: [slurm-users] Munge decode failing on new node

There is no ntp service running on any of my nodes, and all but
this one is working.  I haven't heard that ntp is a requirement
for slurm, just that the time be synchronized across the cluster. 
And it is.

On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy mailto:mini...@gmail.com>> wrote:

I’d check ntp as your encoding time seems odd to me

On Wed, 15 Apr 2020 at 19:59, Dean Schulze
mailto:dean.w.schu...@gmail.com>>
wrote:

I've installed two new nodes onto my slurm cluster.  One
node works, but the other one complains about an invalid
credential for munge.  I've verified that the munge.key is
the same as on all other nodes with


sudo cksum /etc/munge/munge.key

I recopied a munge.key from a node that works.  I've
verified that munge uid and gid are the same on the
nodes.  The time is in sync on all nodes.

Here is what is in the slurmd.log:

 error: Unable to register: Unable to contact slurm
controller (connect failure)
 error: Munge decode failed: Invalid credential
 ENCODED: Wed Dec 31 17:00:00 1969
 DECODED: Wed Dec 31 17:00:00 1969
 error: authentication: Invalid authentication credential
 error: slurm_receive_msg_and_forward: Protocol
authentication error
 error: service_connection: slurm_receive_msg: Protocol
authentication error
 error: Unable to register: Unable to contact slurm
controller (connect failure)

I've checked in the munged.log and all it says is

Invalid credential

Thanks for your help

-- 


--
Carles Fenoy



Re: [slurm-users] Munge decode failing on new node

2020-04-17 Thread Dean Schulze
Just noticed this.  On the problem node the munged.log file has an entry
every 1:40:

2020-04-17 15:31:02 -0600 Info:  Invalid credential
2020-04-17 15:32:42 -0600 Info:  Invalid credential
2020-04-17 15:34:22 -0600 Info:  Invalid credential

This happens on the failed node and two other nodes that work.  Two nodes
that work (including the controller) don't have this message.



On Fri, Apr 17, 2020 at 2:00 PM Riebs, Andy  wrote:

> A couple of quick checks to see if the problem is munge:
>
> 1.   On the problem node, try
> $ echo foo | munge | unmunge
>
> 2.   If (1) works, try this from the node running slurmctld to the
> problem node
> slurm-node$ echo foo | ssh node munge | unmunge
>
>
>
> *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On
> Behalf Of *Dean Schulze
> *Sent:* Friday, April 17, 2020 3:40 PM
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] Munge decode failing on new node
>
>
>
> There is no ntp service running on any of my nodes, and all but this one
> is working.  I haven't heard that ntp is a requirement for slurm, just that
> the time be synchronized across the cluster.  And it is.
>
>
>
> On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy  wrote:
>
> I’d check ntp as your encoding time seems odd to me
>
>
>
> On Wed, 15 Apr 2020 at 19:59, Dean Schulze 
> wrote:
>
> I've installed two new nodes onto my slurm cluster.  One node works, but
> the other one complains about an invalid credential for munge.  I've
> verified that the munge.key is the same as on all other nodes with
>
>
> sudo cksum /etc/munge/munge.key
>
>
>
> I recopied a munge.key from a node that works.  I've verified that munge
> uid and gid are the same on the nodes.  The time is in sync on all nodes.
>
>
>
> Here is what is in the slurmd.log:
>
>
>
>  error: Unable to register: Unable to contact slurm controller (connect
> failure)
>  error: Munge decode failed: Invalid credential
>  ENCODED: Wed Dec 31 17:00:00 1969
>  DECODED: Wed Dec 31 17:00:00 1969
>  error: authentication: Invalid authentication credential
>  error: slurm_receive_msg_and_forward: Protocol authentication error
>  error: service_connection: slurm_receive_msg: Protocol authentication
> error
>  error: Unable to register: Unable to contact slurm controller (connect
> failure)
>
>
>
> I've checked in the munged.log and all it says is
>
>
>
> Invalid credential
>
>
>
> Thanks for your help
>
> --
>
> --
> Carles Fenoy
>
>


Re: [slurm-users] Munge decode failing on new node

2020-04-17 Thread Dean Schulze
Both work.  The only discrepancy is that the slurm controller output had
these two lines:

UID:  ??? (1000)
GID:  ??? (1000)

Like the controller doesn't know the username for UID 1000.

But it returned success 0

On Fri, Apr 17, 2020 at 2:00 PM Riebs, Andy  wrote:

> A couple of quick checks to see if the problem is munge:
>
> 1.   On the problem node, try
> $ echo foo | munge | unmunge
>
> 2.   If (1) works, try this from the node running slurmctld to the
> problem node
> slurm-node$ echo foo | ssh node munge | unmunge
>
>
>
> *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On
> Behalf Of *Dean Schulze
> *Sent:* Friday, April 17, 2020 3:40 PM
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] Munge decode failing on new node
>
>
>
> There is no ntp service running on any of my nodes, and all but this one
> is working.  I haven't heard that ntp is a requirement for slurm, just that
> the time be synchronized across the cluster.  And it is.
>
>
>
> On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy  wrote:
>
> I’d check ntp as your encoding time seems odd to me
>
>
>
> On Wed, 15 Apr 2020 at 19:59, Dean Schulze 
> wrote:
>
> I've installed two new nodes onto my slurm cluster.  One node works, but
> the other one complains about an invalid credential for munge.  I've
> verified that the munge.key is the same as on all other nodes with
>
>
> sudo cksum /etc/munge/munge.key
>
>
>
> I recopied a munge.key from a node that works.  I've verified that munge
> uid and gid are the same on the nodes.  The time is in sync on all nodes.
>
>
>
> Here is what is in the slurmd.log:
>
>
>
>  error: Unable to register: Unable to contact slurm controller (connect
> failure)
>  error: Munge decode failed: Invalid credential
>  ENCODED: Wed Dec 31 17:00:00 1969
>  DECODED: Wed Dec 31 17:00:00 1969
>  error: authentication: Invalid authentication credential
>  error: slurm_receive_msg_and_forward: Protocol authentication error
>  error: service_connection: slurm_receive_msg: Protocol authentication
> error
>  error: Unable to register: Unable to contact slurm controller (connect
> failure)
>
>
>
> I've checked in the munged.log and all it says is
>
>
>
> Invalid credential
>
>
>
> Thanks for your help
>
> --
>
> --
> Carles Fenoy
>
>


Re: [slurm-users] Munge decode failing on new node

2020-04-17 Thread Riebs, Andy
A couple of quick checks to see if the problem is munge:

1.   On the problem node, try
$ echo foo | munge | unmunge

2.   If (1) works, try this from the node running slurmctld to the problem 
node
slurm-node$ echo foo | ssh node munge | unmunge

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Dean Schulze
Sent: Friday, April 17, 2020 3:40 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Munge decode failing on new node

There is no ntp service running on any of my nodes, and all but this one is 
working.  I haven't heard that ntp is a requirement for slurm, just that the 
time be synchronized across the cluster.  And it is.

On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy 
mailto:mini...@gmail.com>> wrote:
I’d check ntp as your encoding time seems odd to me

On Wed, 15 Apr 2020 at 19:59, Dean Schulze 
mailto:dean.w.schu...@gmail.com>> wrote:
I've installed two new nodes onto my slurm cluster.  One node works, but the 
other one complains about an invalid credential for munge.  I've verified that 
the munge.key is the same as on all other nodes with

sudo cksum /etc/munge/munge.key

I recopied a munge.key from a node that works.  I've verified that munge uid 
and gid are the same on the nodes.  The time is in sync on all nodes.

Here is what is in the slurmd.log:

 error: Unable to register: Unable to contact slurm controller (connect failure)
 error: Munge decode failed: Invalid credential
 ENCODED: Wed Dec 31 17:00:00 1969
 DECODED: Wed Dec 31 17:00:00 1969
 error: authentication: Invalid authentication credential
 error: slurm_receive_msg_and_forward: Protocol authentication error
 error: service_connection: slurm_receive_msg: Protocol authentication error
 error: Unable to register: Unable to contact slurm controller (connect failure)

I've checked in the munged.log and all it says is

Invalid credential

Thanks for your help
--
--
Carles Fenoy


Re: [slurm-users] Munge decode failing on new node

2020-04-17 Thread Dean Schulze
There is no ntp service running on any of my nodes, and all but this one is
working.  I haven't heard that ntp is a requirement for slurm, just that
the time be synchronized across the cluster.  And it is.

On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy  wrote:

> I’d check ntp as your encoding time seems odd to me
>
> On Wed, 15 Apr 2020 at 19:59, Dean Schulze 
> wrote:
>
>> I've installed two new nodes onto my slurm cluster.  One node works, but
>> the other one complains about an invalid credential for munge.  I've
>> verified that the munge.key is the same as on all other nodes with
>>
>> sudo cksum /etc/munge/munge.key
>>
>> I recopied a munge.key from a node that works.  I've verified that munge
>> uid and gid are the same on the nodes.  The time is in sync on all nodes.
>>
>> Here is what is in the slurmd.log:
>>
>>  error: Unable to register: Unable to contact slurm controller (connect
>> failure)
>>  error: Munge decode failed: Invalid credential
>>  ENCODED: Wed Dec 31 17:00:00 1969
>>  DECODED: Wed Dec 31 17:00:00 1969
>>  error: authentication: Invalid authentication credential
>>  error: slurm_receive_msg_and_forward: Protocol authentication error
>>  error: service_connection: slurm_receive_msg: Protocol authentication
>> error
>>  error: Unable to register: Unable to contact slurm controller (connect
>> failure)
>>
>> I've checked in the munged.log and all it says is
>>
>> Invalid credential
>>
>> Thanks for your help
>>
> --
> --
> Carles Fenoy
>


Re: [slurm-users] Munge decode failing on new node

2020-04-16 Thread Chris Samuel

On 4/15/20 10:57 am, Dean Schulze wrote:


  error: Munge decode failed: Invalid credential
  ENCODED: Wed Dec 31 17:00:00 1969
  DECODED: Wed Dec 31 17:00:00 1969
  error: authentication: Invalid authentication credential


That's really interesting, I had one of these last week when on call, 
for us at least it seemed to be a hardware error as when attempting to 
reboot it the node failed completely and would no longer boot.


Worth checking whatever hardware logging capabilities your system has to 
see if MCE's are being reported.


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Munge decode failing on new node

2020-04-16 Thread Ole Holm Nielsen

You might want to check the Munge section in my Slurm Wiki page:
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#munge-authentication-service

/Ole

On 15-04-2020 19:57, Dean Schulze wrote:
I've installed two new nodes onto my slurm cluster.  One node works, but 
the other one complains about an invalid credential for munge.  I've 
verified that the munge.key is the same as on all other nodes with


sudo cksum /etc/munge/munge.key

I recopied a munge.key from a node that works.  I've verified that munge 
uid and gid are the same on the nodes.  The time is in sync on all nodes.


Here is what is in the slurmd.log:

  error: Unable to register: Unable to contact slurm controller (connect 
failure)

  error: Munge decode failed: Invalid credential
  ENCODED: Wed Dec 31 17:00:00 1969
  DECODED: Wed Dec 31 17:00:00 1969
  error: authentication: Invalid authentication credential
  error: slurm_receive_msg_and_forward: Protocol authentication error
  error: service_connection: slurm_receive_msg: Protocol authentication 
error
  error: Unable to register: Unable to contact slurm controller (connect 
failure)


I've checked in the munged.log and all it says is

Invalid credential




Re: [slurm-users] Munge decode failing on new node

2020-04-15 Thread Dean Schulze
/etc/munge is 700
/etc/munge/munge.key is 400



On Wed, Apr 15, 2020 at 12:11 PM Riebs, Andy  wrote:

> Two trivial things to check:
>
> 1.   Permissions on /etc/munge and /etc/munge.key
>
> 2.   Is munged running on the problem node?
>
>
>
> Andy
>
>
>
> *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On
> Behalf Of *Dean Schulze
> *Sent:* Wednesday, April 15, 2020 1:57 PM
> *To:* Slurm User Community List 
> *Subject:* [slurm-users] Munge decode failing on new node
>
>
>
> I've installed two new nodes onto my slurm cluster.  One node works, but
> the other one complains about an invalid credential for munge.  I've
> verified that the munge.key is the same as on all other nodes with
>
>
> sudo cksum /etc/munge/munge.key
>
>
>
> I recopied a munge.key from a node that works.  I've verified that munge
> uid and gid are the same on the nodes.  The time is in sync on all nodes.
>
>
>
> Here is what is in the slurmd.log:
>
>
>
>  error: Unable to register: Unable to contact slurm controller (connect
> failure)
>  error: Munge decode failed: Invalid credential
>  ENCODED: Wed Dec 31 17:00:00 1969
>  DECODED: Wed Dec 31 17:00:00 1969
>  error: authentication: Invalid authentication credential
>  error: slurm_receive_msg_and_forward: Protocol authentication error
>  error: service_connection: slurm_receive_msg: Protocol authentication
> error
>  error: Unable to register: Unable to contact slurm controller (connect
> failure)
>
>
>
> I've checked in the munged.log and all it says is
>
>
>
> Invalid credential
>
>
>
> Thanks for your help
>


Re: [slurm-users] Munge decode failing on new node

2020-04-15 Thread Carlos Fenoy
I’d check ntp as your encoding time seems odd to me

On Wed, 15 Apr 2020 at 19:59, Dean Schulze  wrote:

> I've installed two new nodes onto my slurm cluster.  One node works, but
> the other one complains about an invalid credential for munge.  I've
> verified that the munge.key is the same as on all other nodes with
>
> sudo cksum /etc/munge/munge.key
>
> I recopied a munge.key from a node that works.  I've verified that munge
> uid and gid are the same on the nodes.  The time is in sync on all nodes.
>
> Here is what is in the slurmd.log:
>
>  error: Unable to register: Unable to contact slurm controller (connect
> failure)
>  error: Munge decode failed: Invalid credential
>  ENCODED: Wed Dec 31 17:00:00 1969
>  DECODED: Wed Dec 31 17:00:00 1969
>  error: authentication: Invalid authentication credential
>  error: slurm_receive_msg_and_forward: Protocol authentication error
>  error: service_connection: slurm_receive_msg: Protocol authentication
> error
>  error: Unable to register: Unable to contact slurm controller (connect
> failure)
>
> I've checked in the munged.log and all it says is
>
> Invalid credential
>
> Thanks for your help
>
-- 
--
Carles Fenoy


Re: [slurm-users] Munge decode failing on new node

2020-04-15 Thread Riebs, Andy
Two trivial things to check:

1.   Permissions on /etc/munge and /etc/munge.key

2.   Is munged running on the problem node?

Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Dean Schulze
Sent: Wednesday, April 15, 2020 1:57 PM
To: Slurm User Community List 
Subject: [slurm-users] Munge decode failing on new node

I've installed two new nodes onto my slurm cluster.  One node works, but the 
other one complains about an invalid credential for munge.  I've verified that 
the munge.key is the same as on all other nodes with

sudo cksum /etc/munge/munge.key

I recopied a munge.key from a node that works.  I've verified that munge uid 
and gid are the same on the nodes.  The time is in sync on all nodes.

Here is what is in the slurmd.log:

 error: Unable to register: Unable to contact slurm controller (connect failure)
 error: Munge decode failed: Invalid credential
 ENCODED: Wed Dec 31 17:00:00 1969
 DECODED: Wed Dec 31 17:00:00 1969
 error: authentication: Invalid authentication credential
 error: slurm_receive_msg_and_forward: Protocol authentication error
 error: service_connection: slurm_receive_msg: Protocol authentication error
 error: Unable to register: Unable to contact slurm controller (connect failure)

I've checked in the munged.log and all it says is

Invalid credential

Thanks for your help


[slurm-users] Munge decode failing on new node

2020-04-15 Thread Dean Schulze
I've installed two new nodes onto my slurm cluster.  One node works, but
the other one complains about an invalid credential for munge.  I've
verified that the munge.key is the same as on all other nodes with

sudo cksum /etc/munge/munge.key

I recopied a munge.key from a node that works.  I've verified that munge
uid and gid are the same on the nodes.  The time is in sync on all nodes.

Here is what is in the slurmd.log:

 error: Unable to register: Unable to contact slurm controller (connect
failure)
 error: Munge decode failed: Invalid credential
 ENCODED: Wed Dec 31 17:00:00 1969
 DECODED: Wed Dec 31 17:00:00 1969
 error: authentication: Invalid authentication credential
 error: slurm_receive_msg_and_forward: Protocol authentication error
 error: service_connection: slurm_receive_msg: Protocol authentication error
 error: Unable to register: Unable to contact slurm controller (connect
failure)

I've checked in the munged.log and all it says is

Invalid credential

Thanks for your help