Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Brett Lee
So, the LNet ping is not working, and LNet is running on IB. Have you moved down the stack toward the hardware, running an ibping from a rebooted client to the MGS? Brett -- Protect Yourself Against Cybercrime PDS Software Solutions LLC https://www.TrustPDS.com On Mon

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Raj
Yes, this is strange. Normally, I have seen that credits mismatch results this scenario but it doesn't look like this is the case. You wouldn't want to put mgs into capture debug messages as there will be a lot of data. I guess you already tried removing the lustre drivers and adding it again ? l

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Strikwerda, Ger
Hi Raj, When i do a lctl ping on a MGS server i do not see any logs at all. Also not when i do a sucessfull ping from a working node. Is there a way to verbose the Lustre logging to see more detail on the LNET level? It is very strange that a rebooted node is able to lctl ping compute nodes, but

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Raj
Ger, It looks like default configuration of lustre. Do you see any error message on the MGS side while you are doing lctl ping from the rebooted clients? On Mon, Apr 24, 2017 at 12:27 PM Strikwerda, Ger wrote: > Hi Eli, > > Nothing can be mounted on the Lustre filesystems so the output is: > > [

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Strikwerda, Ger
Hi Eli, Nothing can be mounted on the Lustre filesystems so the output is: [root@pg-gpu01 ~]# lfs df /home/ger/ [root@pg-gpu01 ~]# Empty.. On Mon, Apr 24, 2017 at 7:24 PM, E.S. Rosenberg wrote: > > > On Mon, Apr 24, 2017 at 8:19 PM, Strikwerda, Ger > wrote: > >> Hallo Eli, >> >> Logfile/sy

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Strikwerda, Ger
Hallo Eli, Logfile/syslog on the client-side: Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64 LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180] LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.211@o2ib rejected: consumer defined fatal error On Mon, Apr

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread E.S. Rosenberg
On Mon, Apr 24, 2017 at 8:13 PM, Strikwerda, Ger wrote: > Hi Raj (and others), > > In which file should i state the credits/peer_credits stuff? > > Perhaps relevant config-files: > > [root@pg-gpu01 ~]# cd /etc/modprobe.d/ > > [root@pg-gpu01 modprobe.d]# ls > anaconda.conf blacklist-kvm.conf

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Strikwerda, Ger
Hi Raj (and others), In which file should i state the credits/peer_credits stuff? Perhaps relevant config-files: [root@pg-gpu01 ~]# cd /etc/modprobe.d/ [root@pg-gpu01 modprobe.d]# ls anaconda.conf blacklist-kvm.conf dist-alsa.conf dist-oss.conf ib_ipoib.conf lustre.conf openf

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Raj
May be worth checking your lnet credits and peer_credits in /etc/modprobe.d ? You can compare between working hosts and non working hosts. Thanks _Raj On Mon, Apr 24, 2017 at 10:10 AM Strikwerda, Ger wrote: > Hi Rick, > > Even without iptables rules and loading the correct modules afterwards, we

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Strikwerda, Ger
Hi Rick, Even without iptables rules and loading the correct modules afterwards, we get the same results: [root@pg-gpu01 sysconfig]# iptables --list Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Mohr Jr, Richard Frank (Rick Mohr)
This might be a long shot, but have you checked for possible firewall rules that might be causing the issue? I’m wondering if there is a chance that some rules were added after the nodes were up to allow Lustre access, and when a node got rebooted, it lost the rules. -- Rick Mohr Senior HPC Sy

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Strikwerda, Ger
Hi Russell, On a rebooted node: [root@pg-gpu01 ~]# ibhosts | wc -l 183 On a not-rebooted node: [root@pg-gpu02 ~]# ibhosts | wc -l 183 No diffence and all our lustre storage nodes seems to be present: Ca : 0xf45214030062eb50 ports 2 "pg-ost01 HCA-1" Ca : 0xf45214030062eb30 ports 2 "p

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Russell Dekema
I'm not sure this is likely to help either, but if you run the command 'ibhosts' on one of the non-working Lustre client nodes, do you see all of your Lustre servers in the printed list? -Rusty On Mon, Apr 24, 2017 at 10:39 AM, Russell Dekema wrote: > I can't rule it out, but it seems unlikely t

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Russell Dekema
I can't rule it out, but it seems unlikely to me that an out of date IB HCA firmware version would cause a problem like this, especially when everything was working before on that same version, and when IB communication over the device seems to be working in general (as shown by your pings over you

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Strikwerda, Ger
Hi Russell/*, If we run ibdiagnet we get errors/warnings about some (newer) nodes which happen to have a new firmware on the IB interface: Nodes Information -E- FW Check finished with errors -W- pg-gpu01/U1 - Node has FW version 2.32.5100 while the latest FW version, for the same device available

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Russell Dekema
Oh, ok, that seems to rule the subnet manager out. I mis-read your IP network numbers earlier and thought you had not tried regular IP-ping across your IPoIB interfaces, but, upon re-reading your initial message, it seems you have tried this and it does work, even between a client with non-working

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Strikwerda, Ger
Hi Russell, Thanks for the IB subnet clues: [root@pg-gpu01 ~]# ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.32.5100 node_guid: f452:1403:00f5:4620 sys_image_guid: f4

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Russell Dekema
At first glance, this sounds like your Infiniband subnet manager may be down or malfunctioning. In this case, nodes which were already up when the subnet manager was working will continue to be able to communicate over IB, but nodes which reboot after the SM goes down will not. You can test this t

[lustre-discuss] client fails to mount

2017-04-24 Thread Strikwerda, Ger
Hi everybody, Here at the university of Groningen we are now experiencing a strange Lustre error. If a client reboots, it fails to mount the Lustre storage. The client is not able to reach the MSG service. The storage and nodes are communicating over IB and unitil now without any problems. It loo