Re: [slurm-users] Munge decode failing on new node
This problem turned out to be that the new node was on a different subnet than the other nodes. Once our network admin opened up ports 6817, 6818, and 6188 between the subnets the new node worked. Thanks for all the responses. From: slurm-users On Behalf Of Riebs, Andy Sent: Friday, April 17, 2020 1:58 PM To: Slurm User Community List Subject: Re: [slurm-users] Munge decode failing on new node A couple of quick checks to see if the problem is munge: 1. On the problem node, try $ echo foo | munge | unmunge 2. If (1) works, try this from the node running slurmctld to the problem node slurm-node$ echo foo | ssh node munge | unmunge From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Dean Schulze Sent: Friday, April 17, 2020 3:40 PM To: Slurm User Community List mailto:slurm-users@lists.schedmd.com> > Subject: Re: [slurm-users] Munge decode failing on new node There is no ntp service running on any of my nodes, and all but this one is working. I haven't heard that ntp is a requirement for slurm, just that the time be synchronized across the cluster. And it is. On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy mailto:mini...@gmail.com> > wrote: I’d check ntp as your encoding time seems odd to me On Wed, 15 Apr 2020 at 19:59, Dean Schulze mailto:dean.w.schu...@gmail.com> > wrote: I've installed two new nodes onto my slurm cluster. One node works, but the other one complains about an invalid credential for munge. I've verified that the munge.key is the same as on all other nodes with sudo cksum /etc/munge/munge.key I recopied a munge.key from a node that works. I've verified that munge uid and gid are the same on the nodes. The time is in sync on all nodes. Here is what is in the slurmd.log: error: Unable to register: Unable to contact slurm controller (connect failure) error: Munge decode failed: Invalid credential ENCODED: Wed Dec 31 17:00:00 1969 DECODED: Wed Dec 31 17:00:00 1969 error: authentication: Invalid authentication credential error: slurm_receive_msg_and_forward: Protocol authentication error error: service_connection: slurm_receive_msg: Protocol authentication error error: Unable to register: Unable to contact slurm controller (connect failure) I've checked in the munged.log and all it says is Invalid credential Thanks for your help -- -- Carles Fenoy
Re: [slurm-users] Munge decode failing on new node
I went through the exercise of making the other user the same on the slurmctld as on the slurmd nodes, but that had no effect. I still have 3 nodes that have connectivity and one node where slurmd cannot contact slurmctld. That node has ssh connectivity to and from slurmctld node, but no slurm communication. It's time to reformat the drive and start over. On Thu, Apr 23, 2020 at 12:34 AM Gennaro Oliva wrote: > Hi Dean, > > On Wed, Apr 22, 2020 at 07:28:15PM -0600, dean.w.schu...@gmail.com wrote: > > Even for users other than slurm and munge? It seems strange that 3 of > > 4 worker nodes work with the same UIDs/GIDs as the non-working nodes. > > As in: > > https://slurm.schedmd.com/quickstart_admin.html > > Super Quick Start 1st step: > > Make sure the clocks, users and groups (UIDs and GIDs) are synchronized > across the cluster. > > This is true for the slum user and the regular users running jobs. > > The munge user doesn't need to be the same on all the cluster: > > https://bugs.schedmd.com/show_bug.cgi?id=4209 > > Best regards, > -- > Gennaro Oliva > >
Re: [slurm-users] Munge decode failing on new node
Hi Dean, On Wed, Apr 22, 2020 at 07:28:15PM -0600, dean.w.schu...@gmail.com wrote: > Even for users other than slurm and munge? It seems strange that 3 of > 4 worker nodes work with the same UIDs/GIDs as the non-working nodes. As in: https://slurm.schedmd.com/quickstart_admin.html Super Quick Start 1st step: Make sure the clocks, users and groups (UIDs and GIDs) are synchronized across the cluster. This is true for the slum user and the regular users running jobs. The munge user doesn't need to be the same on all the cluster: https://bugs.schedmd.com/show_bug.cgi?id=4209 Best regards, -- Gennaro Oliva
Re: [slurm-users] Munge decode failing on new node
Even for users other than slurm and munge? It seems strange that 3 of 4 worker nodes work with the same UIDs/GIDs as the non-working nodes. -Original Message- From: slurm-users On Behalf Of Christopher Samuel Sent: Wednesday, April 22, 2020 2:27 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Munge decode failing on new node On 4/22/20 12:56 PM, dean.w.schu...@gmail.com wrote: > There is a third user account on all machines in the cluster that is > the user account for using the cluster. That account has uid 1000 on > all four worker nodes, but on the controller it is 1001. So that is > probably why the question marks. You need to have identical UIDs everywhere for this to work. I would strongly suggest using something like LDAP to ensure that your users have identical representation everywhere. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Munge decode failing on new node
On 4/22/20 12:56 PM, dean.w.schu...@gmail.com wrote: There is a third user account on all machines in the cluster that is the user account for using the cluster. That account has uid 1000 on all four worker nodes, but on the controller it is 1001. So that is probably why the question marks. You need to have identical UIDs everywhere for this to work. I would strongly suggest using something like LDAP to ensure that your users have identical representation everywhere. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Munge decode failing on new node
There is a third user account on all machines in the cluster that is the user account for using the cluster. That account has uid 1000 on all four worker nodes, but on the controller it is 1001. So that is probably why the question marks. I doubt this is the issue when 3 of the 4 nodes that work have the same uid mismatch for that user (nor the slurm or munge user). -Original Message- From: slurm-users On Behalf Of Chris Samuel Sent: Monday, April 20, 2020 12:03 AM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Munge decode failing on new node On Friday, 17 April 2020 2:22:00 PM PDT Dean Schulze wrote: > Both work. The only discrepancy is that the slurm controller output > had these two lines: > > UID: ??? (1000) > GID: ??? (1000) > > Like the controller doesn't know the username for UID 1000. What does this say on the controller and the compute node? getent passwd 1000 Are you using LDAP or the like to ensure that all nodes have the same user database? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Munge decode failing on new node
The uid and gid are the same for the slurm and munge users on each node. The two new nodes, one of which can’t connect with the controller, have the same users and were created with the same sequence of steps. The only exception is that the node that won’t connect has the software stack to compile slurm installed on it. I’ll try removing these packages and see if that makes any difference. I was wrong about the nodes not having ntp. They are all running systemd-timesyncd. I’ve found something interesting and inconsistent on the nodes that I’ll post in a new thread since this one is going nowhere. From: slurm-users On Behalf Of Brian Andrus Sent: Sunday, April 19, 2020 9:30 AM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Munge decode failing on new node I see potentially 2 things you should likely do: 1. Run ntpd on your nodes. You can even have them sync with your master. 2. Sync your user data on the nodes too. Even if that is just ensuring /etc/passwd and /etc/group are the same on them all While ntp is not required for slurm, the time sync is very important and ntp makes that a non-issue. Best practices and all. #2 is something that is often overlooked, but obvious when you think about it. I have seen folks add users my doing 'useradd' on each node, but that messes everything up if you installed a package or such that changed the next uid on any node. The error below looks like you may have a different uid for the slurm user on the node. What uid is slurmd running as on the bad node vs a good node? Brian Andrus On 4/17/2020 2:38 PM, Dean Schulze wrote: Just noticed this. On the problem node the munged.log file has an entry every 1:40: 2020-04-17 15:31:02 -0600 Info: Invalid credential 2020-04-17 15:32:42 -0600 Info: Invalid credential 2020-04-17 15:34:22 -0600 Info: Invalid credential This happens on the failed node and two other nodes that work. Two nodes that work (including the controller) don't have this message. On Fri, Apr 17, 2020 at 2:00 PM Riebs, Andy mailto:andy.ri...@hpe.com> > wrote: A couple of quick checks to see if the problem is munge: 1. On the problem node, try $ echo foo | munge | unmunge 2. If (1) works, try this from the node running slurmctld to the problem node slurm-node$ echo foo | ssh node munge | unmunge From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com <mailto:slurm-users-boun...@lists.schedmd.com> ] On Behalf Of Dean Schulze Sent: Friday, April 17, 2020 3:40 PM To: Slurm User Community List mailto:slurm-users@lists.schedmd.com> > Subject: Re: [slurm-users] Munge decode failing on new node There is no ntp service running on any of my nodes, and all but this one is working. I haven't heard that ntp is a requirement for slurm, just that the time be synchronized across the cluster. And it is. On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy mailto:mini...@gmail.com> > wrote: I’d check ntp as your encoding time seems odd to me On Wed, 15 Apr 2020 at 19:59, Dean Schulze mailto:dean.w.schu...@gmail.com> > wrote: I've installed two new nodes onto my slurm cluster. One node works, but the other one complains about an invalid credential for munge. I've verified that the munge.key is the same as on all other nodes with sudo cksum /etc/munge/munge.key I recopied a munge.key from a node that works. I've verified that munge uid and gid are the same on the nodes. The time is in sync on all nodes. Here is what is in the slurmd.log: error: Unable to register: Unable to contact slurm controller (connect failure) error: Munge decode failed: Invalid credential ENCODED: Wed Dec 31 17:00:00 1969 DECODED: Wed Dec 31 17:00:00 1969 error: authentication: Invalid authentication credential error: slurm_receive_msg_and_forward: Protocol authentication error error: service_connection: slurm_receive_msg: Protocol authentication error error: Unable to register: Unable to contact slurm controller (connect failure) I've checked in the munged.log and all it says is Invalid credential Thanks for your help -- -- Carles Fenoy
Re: [slurm-users] Munge decode failing on new node
On Friday, 17 April 2020 2:22:00 PM PDT Dean Schulze wrote: > Both work. The only discrepancy is that the slurm controller output had > these two lines: > > UID: ??? (1000) > GID: ??? (1000) > > Like the controller doesn't know the username for UID 1000. What does this say on the controller and the compute node? getent passwd 1000 Are you using LDAP or the like to ensure that all nodes have the same user database? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Munge decode failing on new node
I see potentially 2 things you should likely do: 1. Run ntpd on your nodes. You can even have them sync with your master. 2. Sync your user data on the nodes too. Even if that is just ensuring /etc/passwd and /etc/group are the same on them all While ntp is not required for slurm, the time sync is very important and ntp makes that a non-issue. Best practices and all. #2 is something that is often overlooked, but obvious when you think about it. I have seen folks add users my doing 'useradd' on each node, but that messes everything up if you installed a package or such that changed the next uid on any node. The error below looks like you may have a different uid for the slurm user on the node. What uid is slurmd running as on the bad node vs a good node? Brian Andrus On 4/17/2020 2:38 PM, Dean Schulze wrote: Just noticed this. On the problem node the munged.log file has an entry every 1:40: 2020-04-17 15:31:02 -0600 Info: Invalid credential 2020-04-17 15:32:42 -0600 Info: Invalid credential 2020-04-17 15:34:22 -0600 Info: Invalid credential This happens on the failed node and two other nodes that work. Two nodes that work (including the controller) don't have this message. On Fri, Apr 17, 2020 at 2:00 PM Riebs, Andy <mailto:andy.ri...@hpe.com>> wrote: A couple of quick checks to see if the problem is munge: 1.On the problem node, try $ echo foo | munge | unmunge 2.If (1) works, try this from the node running slurmctld to the problem node slurm-node$ echo foo | ssh node munge | unmunge *From:*slurm-users [mailto:slurm-users-boun...@lists.schedmd.com <mailto:slurm-users-boun...@lists.schedmd.com>] *On Behalf Of *Dean Schulze *Sent:* Friday, April 17, 2020 3:40 PM *To:* Slurm User Community List mailto:slurm-users@lists.schedmd.com>> *Subject:* Re: [slurm-users] Munge decode failing on new node There is no ntp service running on any of my nodes, and all but this one is working. I haven't heard that ntp is a requirement for slurm, just that the time be synchronized across the cluster. And it is. On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy mailto:mini...@gmail.com>> wrote: I’d check ntp as your encoding time seems odd to me On Wed, 15 Apr 2020 at 19:59, Dean Schulze mailto:dean.w.schu...@gmail.com>> wrote: I've installed two new nodes onto my slurm cluster. One node works, but the other one complains about an invalid credential for munge. I've verified that the munge.key is the same as on all other nodes with sudo cksum /etc/munge/munge.key I recopied a munge.key from a node that works. I've verified that munge uid and gid are the same on the nodes. The time is in sync on all nodes. Here is what is in the slurmd.log: error: Unable to register: Unable to contact slurm controller (connect failure) error: Munge decode failed: Invalid credential ENCODED: Wed Dec 31 17:00:00 1969 DECODED: Wed Dec 31 17:00:00 1969 error: authentication: Invalid authentication credential error: slurm_receive_msg_and_forward: Protocol authentication error error: service_connection: slurm_receive_msg: Protocol authentication error error: Unable to register: Unable to contact slurm controller (connect failure) I've checked in the munged.log and all it says is Invalid credential Thanks for your help -- -- Carles Fenoy
Re: [slurm-users] Munge decode failing on new node
Just noticed this. On the problem node the munged.log file has an entry every 1:40: 2020-04-17 15:31:02 -0600 Info: Invalid credential 2020-04-17 15:32:42 -0600 Info: Invalid credential 2020-04-17 15:34:22 -0600 Info: Invalid credential This happens on the failed node and two other nodes that work. Two nodes that work (including the controller) don't have this message. On Fri, Apr 17, 2020 at 2:00 PM Riebs, Andy wrote: > A couple of quick checks to see if the problem is munge: > > 1. On the problem node, try > $ echo foo | munge | unmunge > > 2. If (1) works, try this from the node running slurmctld to the > problem node > slurm-node$ echo foo | ssh node munge | unmunge > > > > *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On > Behalf Of *Dean Schulze > *Sent:* Friday, April 17, 2020 3:40 PM > *To:* Slurm User Community List > *Subject:* Re: [slurm-users] Munge decode failing on new node > > > > There is no ntp service running on any of my nodes, and all but this one > is working. I haven't heard that ntp is a requirement for slurm, just that > the time be synchronized across the cluster. And it is. > > > > On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy wrote: > > I’d check ntp as your encoding time seems odd to me > > > > On Wed, 15 Apr 2020 at 19:59, Dean Schulze > wrote: > > I've installed two new nodes onto my slurm cluster. One node works, but > the other one complains about an invalid credential for munge. I've > verified that the munge.key is the same as on all other nodes with > > > sudo cksum /etc/munge/munge.key > > > > I recopied a munge.key from a node that works. I've verified that munge > uid and gid are the same on the nodes. The time is in sync on all nodes. > > > > Here is what is in the slurmd.log: > > > > error: Unable to register: Unable to contact slurm controller (connect > failure) > error: Munge decode failed: Invalid credential > ENCODED: Wed Dec 31 17:00:00 1969 > DECODED: Wed Dec 31 17:00:00 1969 > error: authentication: Invalid authentication credential > error: slurm_receive_msg_and_forward: Protocol authentication error > error: service_connection: slurm_receive_msg: Protocol authentication > error > error: Unable to register: Unable to contact slurm controller (connect > failure) > > > > I've checked in the munged.log and all it says is > > > > Invalid credential > > > > Thanks for your help > > -- > > -- > Carles Fenoy > >
Re: [slurm-users] Munge decode failing on new node
Both work. The only discrepancy is that the slurm controller output had these two lines: UID: ??? (1000) GID: ??? (1000) Like the controller doesn't know the username for UID 1000. But it returned success 0 On Fri, Apr 17, 2020 at 2:00 PM Riebs, Andy wrote: > A couple of quick checks to see if the problem is munge: > > 1. On the problem node, try > $ echo foo | munge | unmunge > > 2. If (1) works, try this from the node running slurmctld to the > problem node > slurm-node$ echo foo | ssh node munge | unmunge > > > > *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On > Behalf Of *Dean Schulze > *Sent:* Friday, April 17, 2020 3:40 PM > *To:* Slurm User Community List > *Subject:* Re: [slurm-users] Munge decode failing on new node > > > > There is no ntp service running on any of my nodes, and all but this one > is working. I haven't heard that ntp is a requirement for slurm, just that > the time be synchronized across the cluster. And it is. > > > > On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy wrote: > > I’d check ntp as your encoding time seems odd to me > > > > On Wed, 15 Apr 2020 at 19:59, Dean Schulze > wrote: > > I've installed two new nodes onto my slurm cluster. One node works, but > the other one complains about an invalid credential for munge. I've > verified that the munge.key is the same as on all other nodes with > > > sudo cksum /etc/munge/munge.key > > > > I recopied a munge.key from a node that works. I've verified that munge > uid and gid are the same on the nodes. The time is in sync on all nodes. > > > > Here is what is in the slurmd.log: > > > > error: Unable to register: Unable to contact slurm controller (connect > failure) > error: Munge decode failed: Invalid credential > ENCODED: Wed Dec 31 17:00:00 1969 > DECODED: Wed Dec 31 17:00:00 1969 > error: authentication: Invalid authentication credential > error: slurm_receive_msg_and_forward: Protocol authentication error > error: service_connection: slurm_receive_msg: Protocol authentication > error > error: Unable to register: Unable to contact slurm controller (connect > failure) > > > > I've checked in the munged.log and all it says is > > > > Invalid credential > > > > Thanks for your help > > -- > > -- > Carles Fenoy > >
Re: [slurm-users] Munge decode failing on new node
A couple of quick checks to see if the problem is munge: 1. On the problem node, try $ echo foo | munge | unmunge 2. If (1) works, try this from the node running slurmctld to the problem node slurm-node$ echo foo | ssh node munge | unmunge From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Dean Schulze Sent: Friday, April 17, 2020 3:40 PM To: Slurm User Community List Subject: Re: [slurm-users] Munge decode failing on new node There is no ntp service running on any of my nodes, and all but this one is working. I haven't heard that ntp is a requirement for slurm, just that the time be synchronized across the cluster. And it is. On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy mailto:mini...@gmail.com>> wrote: I’d check ntp as your encoding time seems odd to me On Wed, 15 Apr 2020 at 19:59, Dean Schulze mailto:dean.w.schu...@gmail.com>> wrote: I've installed two new nodes onto my slurm cluster. One node works, but the other one complains about an invalid credential for munge. I've verified that the munge.key is the same as on all other nodes with sudo cksum /etc/munge/munge.key I recopied a munge.key from a node that works. I've verified that munge uid and gid are the same on the nodes. The time is in sync on all nodes. Here is what is in the slurmd.log: error: Unable to register: Unable to contact slurm controller (connect failure) error: Munge decode failed: Invalid credential ENCODED: Wed Dec 31 17:00:00 1969 DECODED: Wed Dec 31 17:00:00 1969 error: authentication: Invalid authentication credential error: slurm_receive_msg_and_forward: Protocol authentication error error: service_connection: slurm_receive_msg: Protocol authentication error error: Unable to register: Unable to contact slurm controller (connect failure) I've checked in the munged.log and all it says is Invalid credential Thanks for your help -- -- Carles Fenoy
Re: [slurm-users] Munge decode failing on new node
There is no ntp service running on any of my nodes, and all but this one is working. I haven't heard that ntp is a requirement for slurm, just that the time be synchronized across the cluster. And it is. On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy wrote: > I’d check ntp as your encoding time seems odd to me > > On Wed, 15 Apr 2020 at 19:59, Dean Schulze > wrote: > >> I've installed two new nodes onto my slurm cluster. One node works, but >> the other one complains about an invalid credential for munge. I've >> verified that the munge.key is the same as on all other nodes with >> >> sudo cksum /etc/munge/munge.key >> >> I recopied a munge.key from a node that works. I've verified that munge >> uid and gid are the same on the nodes. The time is in sync on all nodes. >> >> Here is what is in the slurmd.log: >> >> error: Unable to register: Unable to contact slurm controller (connect >> failure) >> error: Munge decode failed: Invalid credential >> ENCODED: Wed Dec 31 17:00:00 1969 >> DECODED: Wed Dec 31 17:00:00 1969 >> error: authentication: Invalid authentication credential >> error: slurm_receive_msg_and_forward: Protocol authentication error >> error: service_connection: slurm_receive_msg: Protocol authentication >> error >> error: Unable to register: Unable to contact slurm controller (connect >> failure) >> >> I've checked in the munged.log and all it says is >> >> Invalid credential >> >> Thanks for your help >> > -- > -- > Carles Fenoy >
Re: [slurm-users] Munge decode failing on new node
On 4/15/20 10:57 am, Dean Schulze wrote: error: Munge decode failed: Invalid credential ENCODED: Wed Dec 31 17:00:00 1969 DECODED: Wed Dec 31 17:00:00 1969 error: authentication: Invalid authentication credential That's really interesting, I had one of these last week when on call, for us at least it seemed to be a hardware error as when attempting to reboot it the node failed completely and would no longer boot. Worth checking whatever hardware logging capabilities your system has to see if MCE's are being reported. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Munge decode failing on new node
You might want to check the Munge section in my Slurm Wiki page: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#munge-authentication-service /Ole On 15-04-2020 19:57, Dean Schulze wrote: I've installed two new nodes onto my slurm cluster. One node works, but the other one complains about an invalid credential for munge. I've verified that the munge.key is the same as on all other nodes with sudo cksum /etc/munge/munge.key I recopied a munge.key from a node that works. I've verified that munge uid and gid are the same on the nodes. The time is in sync on all nodes. Here is what is in the slurmd.log: error: Unable to register: Unable to contact slurm controller (connect failure) error: Munge decode failed: Invalid credential ENCODED: Wed Dec 31 17:00:00 1969 DECODED: Wed Dec 31 17:00:00 1969 error: authentication: Invalid authentication credential error: slurm_receive_msg_and_forward: Protocol authentication error error: service_connection: slurm_receive_msg: Protocol authentication error error: Unable to register: Unable to contact slurm controller (connect failure) I've checked in the munged.log and all it says is Invalid credential
Re: [slurm-users] Munge decode failing on new node
/etc/munge is 700 /etc/munge/munge.key is 400 On Wed, Apr 15, 2020 at 12:11 PM Riebs, Andy wrote: > Two trivial things to check: > > 1. Permissions on /etc/munge and /etc/munge.key > > 2. Is munged running on the problem node? > > > > Andy > > > > *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On > Behalf Of *Dean Schulze > *Sent:* Wednesday, April 15, 2020 1:57 PM > *To:* Slurm User Community List > *Subject:* [slurm-users] Munge decode failing on new node > > > > I've installed two new nodes onto my slurm cluster. One node works, but > the other one complains about an invalid credential for munge. I've > verified that the munge.key is the same as on all other nodes with > > > sudo cksum /etc/munge/munge.key > > > > I recopied a munge.key from a node that works. I've verified that munge > uid and gid are the same on the nodes. The time is in sync on all nodes. > > > > Here is what is in the slurmd.log: > > > > error: Unable to register: Unable to contact slurm controller (connect > failure) > error: Munge decode failed: Invalid credential > ENCODED: Wed Dec 31 17:00:00 1969 > DECODED: Wed Dec 31 17:00:00 1969 > error: authentication: Invalid authentication credential > error: slurm_receive_msg_and_forward: Protocol authentication error > error: service_connection: slurm_receive_msg: Protocol authentication > error > error: Unable to register: Unable to contact slurm controller (connect > failure) > > > > I've checked in the munged.log and all it says is > > > > Invalid credential > > > > Thanks for your help >
Re: [slurm-users] Munge decode failing on new node
I’d check ntp as your encoding time seems odd to me On Wed, 15 Apr 2020 at 19:59, Dean Schulze wrote: > I've installed two new nodes onto my slurm cluster. One node works, but > the other one complains about an invalid credential for munge. I've > verified that the munge.key is the same as on all other nodes with > > sudo cksum /etc/munge/munge.key > > I recopied a munge.key from a node that works. I've verified that munge > uid and gid are the same on the nodes. The time is in sync on all nodes. > > Here is what is in the slurmd.log: > > error: Unable to register: Unable to contact slurm controller (connect > failure) > error: Munge decode failed: Invalid credential > ENCODED: Wed Dec 31 17:00:00 1969 > DECODED: Wed Dec 31 17:00:00 1969 > error: authentication: Invalid authentication credential > error: slurm_receive_msg_and_forward: Protocol authentication error > error: service_connection: slurm_receive_msg: Protocol authentication > error > error: Unable to register: Unable to contact slurm controller (connect > failure) > > I've checked in the munged.log and all it says is > > Invalid credential > > Thanks for your help > -- -- Carles Fenoy
Re: [slurm-users] Munge decode failing on new node
Two trivial things to check: 1. Permissions on /etc/munge and /etc/munge.key 2. Is munged running on the problem node? Andy From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Dean Schulze Sent: Wednesday, April 15, 2020 1:57 PM To: Slurm User Community List Subject: [slurm-users] Munge decode failing on new node I've installed two new nodes onto my slurm cluster. One node works, but the other one complains about an invalid credential for munge. I've verified that the munge.key is the same as on all other nodes with sudo cksum /etc/munge/munge.key I recopied a munge.key from a node that works. I've verified that munge uid and gid are the same on the nodes. The time is in sync on all nodes. Here is what is in the slurmd.log: error: Unable to register: Unable to contact slurm controller (connect failure) error: Munge decode failed: Invalid credential ENCODED: Wed Dec 31 17:00:00 1969 DECODED: Wed Dec 31 17:00:00 1969 error: authentication: Invalid authentication credential error: slurm_receive_msg_and_forward: Protocol authentication error error: service_connection: slurm_receive_msg: Protocol authentication error error: Unable to register: Unable to contact slurm controller (connect failure) I've checked in the munged.log and all it says is Invalid credential Thanks for your help