[xcat-user] statelite crashes with nfs server timeout

2018-07-12 Thread Jeff Berry
Good afternoon all,

I've got some Centos 7.5 statelite nodes which seem to be booting properly, but 
after being up for less than a day, they crash with what look like nfs 
timeouts.  The server is up, and if I rpower reset the nodes, they come back up 
with no problem, but then they crash again overnight.

this may not be an xcat problem at all, it may be an nfs issue, but I thought 
I'd toss it out here and see if it rang any bells for anyone,

Jeff Berry, MRC CBSU
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user


Re: [xcat-user] statelite crashes with nfs server timeout

2018-07-13 Thread Javier Ron

Hello,


It sounds like an NFS thing, you could try different options for the mount


https://www.centos.org/forums/viewtopic.php?t=8787

NFS hard mounts vs soft mounts - 
CentOS<https://www.centos.org/forums/viewtopic.php?t=8787>
www.centos.org
[quote] simon_matthews wrote: I think that the reason hard mounts are 
recommended is that this covers the case where the user's home directory is on 
an NFS server.




From: Jeff Berry 
Sent: 12 July 2018 14:35:53
To: xCAT Users Mailing list
Subject: [xcat-user] statelite crashes with nfs server timeout

Good afternoon all,

I've got some Centos 7.5 statelite nodes which seem to be booting properly, but 
after being up for less than a day, they crash with what look like nfs 
timeouts.  The server is up, and if I rpower reset the nodes, they come back up 
with no problem, but then they crash again overnight.

this may not be an xcat problem at all, it may be an nfs issue, but I thought 
I'd toss it out here and see if it rang any bells for anyone,

Jeff Berry, MRC CBSU
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user


Re: [xcat-user] statelite crashes with nfs server timeout

2018-07-13 Thread David Johnson
I’m not sure, but this seems to have a whiff of a problem we had a while back, 
where 
we forgot to make appropriate holes in the firewall.  GPFS died after the node 
had been
up successfully for a while.  It worked because connection initiated from the 
host to the
GPFS servers, but traffic back to the node was blocked after the session was 
dropped from
the firewall table of active connections.

> On Jul 13, 2018, at 5:06 AM, Javier Ron  wrote:
> 
> 
> Hello,
> 
> It sounds like an NFS thing, you could try different options for the mount
> 
> https://www.centos.org/forums/viewtopic.php?t=8787 
> <https://www.centos.org/forums/viewtopic.php?t=8787>
> NFS hard mounts vs soft mounts - CentOS 
> <https://www.centos.org/forums/viewtopic.php?t=8787>
> www.centos.org <http://www.centos.org/>
> [quote] simon_matthews wrote: I think that the reason hard mounts are 
> recommended is that this covers the case where the user's home directory is 
> on an NFS server.
> 
> From: Jeff Berry 
> Sent: 12 July 2018 14:35:53
> To: xCAT Users Mailing list
> Subject: [xcat-user] statelite crashes with nfs server timeout
>  
> Good afternoon all,
> 
> I've got some Centos 7.5 statelite nodes which seem to be booting properly, 
> but after being up for less than a day, they crash with what look like nfs 
> timeouts.  The server is up, and if I rpower reset the nodes, they come back 
> up with no problem, but then they crash again overnight.
> 
> this may not be an xcat problem at all, it may be an nfs issue, but I thought 
> I'd toss it out here and see if it rang any bells for anyone,
> 
> Jeff Berry, MRC CBSU
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot 
> <http://sdm.link/slashdot>
> ___
> xCAT-user mailing list
> xCAT-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xcat-user 
> <https://lists.sourceforge.net/lists/listinfo/xcat-user>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org <http://slashdot.org/>! 
> http://sdm.link/slashdot___ 
> <http://sdm.link/slashdot___>
> xCAT-user mailing list
> xCAT-user@lists.sourceforge.net <mailto:xCAT-user@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/xcat-user 
> <https://lists.sourceforge.net/lists/listinfo/xcat-user>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user


Re: [xcat-user] statelite crashes with nfs server timeout

2018-07-15 Thread Song BJ Yang
Hi Jeff,
 
did you enabled kdump? the dump core file might help to find out the problem
--YANG Song (杨嵩)IBM China System Technology LaboratoryTel: 86-10-82452903Email: yang...@cn.ibm.comAddress: Building 28, ZhongGuanCun Software Park,No.8, Dong Bei Wang West Road, Haidian District Beijing 100193, PRC北京市海淀区东北旺西路8号中关村软件园28号楼邮编: 100193
 
 
- Original message -From: David Johnson To: xCAT Users Mailing list Cc:Subject: Re: [xcat-user] statelite crashes with nfs server timeoutDate: Fri, Jul 13, 2018 10:58 PM I’m not sure, but this seems to have a whiff of a problem we had a while back, where 
we forgot to make appropriate holes in the firewall.  GPFS died after the node had been
up successfully for a while.  It worked because connection initiated from the host to the
GPFS servers, but traffic back to the node was blocked after the session was dropped from
the firewall table of active connections.
 
On Jul 13, 2018, at 5:06 AM, Javier Ron <j@qmul.ac.uk> wrote: 

Hello,
 
It sounds like an NFS thing, you could try different options for the mount
 
https://www.centos.org/forums/viewtopic.php?t=8787
 
NFS hard mounts vs soft mounts - CentOS
www.centos.org
[quote] simon_matthews wrote: I think that the reason hard mounts are recommended is that this covers the case where the user's home directory is on an NFS server. 

 
From: Jeff Berry <jeff.be...@mrc-cbu.cam.ac.uk>Sent: 12 July 2018 14:35:53To: xCAT Users Mailing listSubject: [xcat-user] statelite crashes with nfs server timeout
 
Good afternoon all,I've got some Centos 7.5 statelite nodes which seem to be booting properly, but after being up for less than a day, they crash with what look like nfs timeouts.  The server is up, and if I rpower reset the nodes, they come back up with no problem, but then they crash again overnight.this may not be an xcat problem at all, it may be an nfs issue, but I thought I'd toss it out here and see if it rang any bells for anyone,Jeff Berry, MRC CBSU--Check out the vibrant tech community on one of the world's mostengaging tech sites, Slashdot.org! http://sdm.link/slashdot___xCAT-user mailing listxCAT-user@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/xcat-user--Check out the vibrant tech community on one of the world's mostengaging tech sites, Slashdot.org! http://sdm.link/slashdot___xCAT-user mailing listxCAT-user@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/xcat-user
--Check out the vibrant tech community on one of the world's mostengaging tech sites, Slashdot.org! http://sdm.link/slashdot
___xCAT-user mailing listxCAT-user@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/xcat-user
 


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user


Re: [xcat-user] statelite crashes with nfs server timeout

2018-07-15 Thread Yuan Y Bai
 
kdump doc link: https://xcat-docs.readthedocs.io/en/latest/guides/admin-guides/manage_clusters/ppc64le/diskless/customize_image/enable_kdump.html?highlight=kdump
 
Best Regards--Yuan Bai (白媛)CSTL HPC System Management DevelopmentTel:86-10-82451401E-mail: by...@cn.ibm.comAddress: IBM ZGC Campus. Ring Building 28,ZhongGuanCun Software Park,No.8 Dong Bei Wang West Road, Haidian District,Beijing P.R.China 100193IBM环宇大厦北京市海淀区东北旺西路8号,中关村软件园28号楼邮编:100193
 
 
- Original message -From: "Song BJ Yang" To: xcat-user@lists.sourceforge.netCc: xcat-user@lists.sourceforge.netSubject: Re: [xcat-user] statelite crashes with nfs server timeoutDate: Mon, Jul 16, 2018 10:59 AM 
Hi Jeff,
 
did you enabled kdump? the dump core file might help to find out the problem
--YANG Song (杨嵩)IBM China System Technology LaboratoryTel: 86-10-82452903Email: yang...@cn.ibm.comAddress: Building 28, ZhongGuanCun Software Park,No.8, Dong Bei Wang West Road, Haidian District Beijing 100193, PRC北京市海淀区东北旺西路8号中关村软件园28号楼邮编: 100193
 
 
- Original message -From: David Johnson To: xCAT Users Mailing list Cc:Subject: Re: [xcat-user] statelite crashes with nfs server timeoutDate: Fri, Jul 13, 2018 10:58 PM I’m not sure, but this seems to have a whiff of a problem we had a while back, where 
we forgot to make appropriate holes in the firewall.  GPFS died after the node had been
up successfully for a while.  It worked because connection initiated from the host to the
GPFS servers, but traffic back to the node was blocked after the session was dropped from
the firewall table of active connections.
 
On Jul 13, 2018, at 5:06 AM, Javier Ron <j@qmul.ac.uk> wrote: 

Hello,
 
It sounds like an NFS thing, you could try different options for the mount
 
https://www.centos.org/forums/viewtopic.php?t=8787
 
NFS hard mounts vs soft mounts - CentOS
www.centos.org
[quote] simon_matthews wrote: I think that the reason hard mounts are recommended is that this covers the case where the user's home directory is on an NFS server. 

 
From: Jeff Berry <jeff.be...@mrc-cbu.cam.ac.uk>Sent: 12 July 2018 14:35:53To: xCAT Users Mailing listSubject: [xcat-user] statelite crashes with nfs server timeout
 
Good afternoon all,I've got some Centos 7.5 statelite nodes which seem to be booting properly, but after being up for less than a day, they crash with what look like nfs timeouts.  The server is up, and if I rpower reset the nodes, they come back up with no problem, but then they crash again overnight.this may not be an xcat problem at all, it may be an nfs issue, but I thought I'd toss it out here and see if it rang any bells for anyone,Jeff Berry, MRC CBSU--Check out the vibrant tech community on one of the world's mostengaging tech sites, Slashdot.org! http://sdm.link/slashdot___xCAT-user mailing listxCAT-user@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/xcat-user--Check out the vibrant tech community on one of the world's mostengaging tech sites, Slashdot.org! http://sdm.link/slashdot___xCAT-user mailing listxCAT-user@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/xcat-user
--Check out the vibrant tech community on one of the world's mostengaging tech sites, Slashdot.org! http://sdm.link/slashdot
___xCAT-user mailing listxCAT-user@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/xcat-user
  

--Check out the vibrant tech community on one of the world's mostengaging tech sites, Slashdot.org! http://sdm.link/slashdot
___xCAT-user mailing listxCAT-user@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/xcat-user
 


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user


Re: [xcat-user] statelite crashes with nfs server timeout

2018-07-16 Thread Jeff Berry
Thanks for the ideas everyone,  I’ll look into kdump and double check the 
firewalling.  (And have another look through the mount options;-)


Regards,

Jeff Berry
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user


Re: [xcat-user] statelite crashes with nfs server timeout

2018-07-16 Thread Jeff Berry
Hi all,

the problem is linked to the dhcp lease renewal.  When the lease times out, the 
renewal seems to fail, and then the nfs starts to timeout and ... bad things 
happen.

I’m not sure yet what the specific issue is, but I’m guessing it might have to 
do with the interface bonding.

Again, thanks for the pointers,

Jeff, MRC CBSU

From: Jeff Berry [mailto:jeff.be...@mrc-cbu.cam.ac.uk]
Sent: 16 July 2018 12:00
To: xCAT Users Mailing list 
Subject: Re: [xcat-user] statelite crashes with nfs server timeout

Thanks for the ideas everyone,  I’ll look into kdump and double check the 
firewalling.  (And have another look through the mount options;-)


Regards,

Jeff Berry
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user


Re: [xcat-user] statelite crashes with nfs server timeout

2018-07-16 Thread david_johnson
After we had some (let’s say self inflicted)?issues with dhcp, I added hardeths 
to the list of postscripts. Still would want to know why dhcp is failing.  

  -- ddj
Dave Johnson

> On Jul 16, 2018, at 8:55 AM, Jeff Berry  wrote:
> 
> Hi all,
>  
> the problem is linked to the dhcp lease renewal.  When the lease times out, 
> the renewal seems to fail, and then the nfs starts to timeout and ... bad 
> things happen.
>  
> I’m not sure yet what the specific issue is, but I’m guessing it might have 
> to do with the interface bonding.
>  
> Again, thanks for the pointers,
>  
> Jeff, MRC CBSU
>  
> From: Jeff Berry [mailto:jeff.be...@mrc-cbu.cam.ac.uk] 
> Sent: 16 July 2018 12:00
> To: xCAT Users Mailing list 
> Subject: Re: [xcat-user] statelite crashes with nfs server timeout
>  
> Thanks for the ideas everyone,  I’ll look into kdump and double check the 
> firewalling.  (And have another look through the mount options;-)
>  
>  
> Regards,
>  
> Jeff Berry
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> xCAT-user mailing list
> xCAT-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xcat-user
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user


Re: [xcat-user] statelite crashes with nfs server timeout

2018-07-16 Thread Jeff Berry
I’ll take a look at the hardeths script.

But, yes, it would still be nice to know what’s going on ...

JB
MRC

From: david_john...@brown.edu [mailto:david_john...@brown.edu]
Sent: 16 July 2018 14:09
To: xCAT Users Mailing list 
Subject: Re: [xcat-user] statelite crashes with nfs server timeout

After we had some (let’s say self inflicted)?issues with dhcp, I added hardeths 
to the list of postscripts. Still would want to know why dhcp is failing.
  -- ddj
Dave Johnson

On Jul 16, 2018, at 8:55 AM, Jeff Berry 
mailto:jeff.be...@mrc-cbu.cam.ac.uk>> wrote:
Hi all,

the problem is linked to the dhcp lease renewal.  When the lease times out, the 
renewal seems to fail, and then the nfs starts to timeout and ... bad things 
happen.

I’m not sure yet what the specific issue is, but I’m guessing it might have to 
do with the interface bonding.

Again, thanks for the pointers,

Jeff, MRC CBSU

From: Jeff Berry [mailto:jeff.be...@mrc-cbu.cam.ac.uk]
Sent: 16 July 2018 12:00
To: xCAT Users Mailing list 
mailto:xcat-user@lists.sourceforge.net>>
Subject: Re: [xcat-user] statelite crashes with nfs server timeout

Thanks for the ideas everyone,  I’ll look into kdump and double check the 
firewalling.  (And have another look through the mount options;-)


Regards,

Jeff Berry
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org<http://Slashdot.org>! http://sdm.link/slashdot
___
xCAT-user mailing list
xCAT-user@lists.sourceforge.net<mailto:xCAT-user@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/xcat-user
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user


Re: [xcat-user] statelite crashes with nfs server timeout

2018-07-18 Thread Jeff Berry
Hi all,

the problem with setting up the bonded interface on the client.  If it was set 
up on the command line at deployment, it didn’t propagate properly into the OS. 
 And my guess is that what happened then is that the OS tried to send its DHCP 
requests over the bond, which was not actually broadcasting, so when the lease 
expired, so did the machine ...

A few tweaks to the image’s /.default/etc/sysconfig/network-scripts got the 
bond working properly and the nodes are now stable.

Thanks again for the advice,

Jeff, MRC CBU
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user