Re: [lustre-discuss] client fails to mount

2017-05-01 Thread Dilger, Andreas
Thanks for the follow up. Often once the problem is found there is no 
update/conclusion to the email thread, then no way for others to try and solve 
the problem in a similar way. 

Cheers, Andreas

> On May 1, 2017, at 02:46, Strikwerda, Ger  wrote:
> 
> Hi all,
> 
> Our clients-failed-to-mount/lctl ping horror, turned out to be a failing 
> subnet manager issue. We did no see an issue runnning 'sminfo' but on the IB 
> switch we could see that the subnetmanager was unstable. This caused mayhem 
> on the IB/Lustre setup.
> 
> Thanks everybody for their help/advice/hints. Good to see how this active 
> community works! 
> 
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] client fails to mount

2017-05-01 Thread Strikwerda, Ger
The second option. We did not trust 'sminfo' so why not double check on the
IB switch or at least look at the logs of the IB switch to see what happens
over there.



On Mon, May 1, 2017 at 3:15 PM, E.S. Rosenberg 
wrote:

>
>
> On Mon, May 1, 2017 at 3:45 PM, Strikwerda, Ger 
> wrote:
>
>> Hi Eli,
>>
>> We have a 180+ compute-cluster IB/10Gb connected with Lustre storage
>> IB/10 Gb connected. We have multiple IB switches with the master/core/big
>> switch manageable via webmanagement. This is switch is a Mellanox SX6036
>> FDR switch. 1 subnet manager is supposed to be running at this switch. And
>> using 'sminfo' on the clients we got info about the subnet manager being
>> alive. But when we looked via the webmanagement the subnet-manager was
>> unstable. The reason why is unknown. Could be faulty firmware. During the
>> weekend the system was running fine.
>>
> Did anything specific make you look in the switch, or just after all other
> things were checked you checked there?
>
>>
>>
>>
>>
>>
>>
>> On Mon, May 1, 2017 at 2:18 PM, E.S. Rosenberg <
>> esr+lus...@mail.hebrew.edu> wrote:
>>
>>>
>>>
>>> On Mon, May 1, 2017 at 11:46 AM, Strikwerda, Ger <
>>> g.j.c.strikwe...@rug.nl> wrote:
>>>
 Hi all,

 Our clients-failed-to-mount/lctl ping horror, turned out to be a
 failing subnet manager issue. We did no see an issue runnning 'sminfo' but
 on the IB switch we could see that the subnetmanager was unstable. This
 caused mayhem on the IB/Lustre setup.

>>> Can you describe a bit more of how you found this?
>>> You are running an SM on the switches?
>>> Like this if someone else runs into this they will be able to check this
>>> too
>>>

 Thanks everybody for their help/advice/hints. Good to see how this
 active community works!

>>> Indeed.
>>> Eli
>>>




 On Tue, Apr 25, 2017 at 8:17 PM, E.S. Rosenberg <
 esr+lus...@mail.hebrew.edu> wrote:

>
>
> On Tue, Apr 25, 2017 at 7:41 PM, Oucharek, Doug S <
> doug.s.oucha...@intel.com> wrote:
>
>> That specific message happens when the “magic” u32 field at the start
>> of a message does not match what we are expecting.  We do check if the
>> message was transmitted as a different endian from us so when you see 
>> this
>> error, we assume that message has been corrupted or the sender is using 
>> an
>> invalid magic value.  I don’t believe this value has changed in the 
>> history
>> of the LND so this is more likely corruption of some sort.
>>
>
> OT: this information should probably be added to LU-2977 which
> specifically includes the question: What does "consumer defined fatal
> error" mean and why is this connection rejected?
>
>
>
>> Doug
>>
>> > On Apr 25, 2017, at 2:29 AM, Dilger, Andreas <
>> andreas.dil...@intel.com> wrote:
>> >
>> > I'm not an LNet expert, but I think the critical issue to focus on
>> is:
>> >
>> >  Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
>> .el6.x86_64
>> >  LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>> >  LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>> 172.23.55.211@o2ib rejected: consumer defined fatal error
>> >
>> > This means that the LND didn't connect at startup time, but I don't
>> know what the cause is.
>> > The error that generates this message is
>> IB_CM_REJ_CONSUMER_DEFINED, but I don't know enough about IB to tell you
>> what that means.  Some of the later code is checking for mismatched 
>> Lustre
>> versions, but it doesn't even get that far.
>> >
>> > Cheers, Andreas
>> >
>> >> On Apr 25, 2017, at 02:21, Strikwerda, Ger <
>> g.j.c.strikwe...@rug.nl> wrote:
>> >>
>> >> Hi Raj,
>> >>
>> >> [root@pg-gpu01 ~]# lustre_rmmod
>> >>
>> >> [root@pg-gpu01 ~]# modprobe -v lustre
>> >> insmod /lib/modules/2.6.32-642.6.2.el
>> 6.x86_64/weak-updates/kernel/net/lustre/libcfs.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el
>> 6.x86_64/weak-updates/kernel/fs/lustre/lvfs.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el
>> 6.x86_64/weak-updates/kernel/net/lustre/lnet.ko networks=o2ib(ib0)
>> >> insmod /lib/modules/2.6.32-642.6.2.el
>> 6.x86_64/weak-updates/kernel/fs/lustre/obdclass.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el
>> 6.x86_64/weak-updates/kernel/fs/lustre/ptlrpc.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el
>> 6.x86_64/weak-updates/kernel/fs/lustre/fid.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el
>> 6.x86_64/weak-updates/kernel/fs/lustre/mdc.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el
>> 6.x86_64/weak-updates/kernel/fs/lustre/osc.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el
>> 6.x86_64/weak-updates/kernel/fs/lustre/lov.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el

Re: [lustre-discuss] client fails to mount

2017-05-01 Thread E.S. Rosenberg
On Mon, May 1, 2017 at 3:45 PM, Strikwerda, Ger 
wrote:

> Hi Eli,
>
> We have a 180+ compute-cluster IB/10Gb connected with Lustre storage IB/10
> Gb connected. We have multiple IB switches with the master/core/big switch
> manageable via webmanagement. This is switch is a Mellanox SX6036 FDR
> switch. 1 subnet manager is supposed to be running at this switch. And
> using 'sminfo' on the clients we got info about the subnet manager being
> alive. But when we looked via the webmanagement the subnet-manager was
> unstable. The reason why is unknown. Could be faulty firmware. During the
> weekend the system was running fine.
>
Did anything specific make you look in the switch, or just after all other
things were checked you checked there?

>
>
>
>
>
>
> On Mon, May 1, 2017 at 2:18 PM, E.S. Rosenberg  > wrote:
>
>>
>>
>> On Mon, May 1, 2017 at 11:46 AM, Strikwerda, Ger > > wrote:
>>
>>> Hi all,
>>>
>>> Our clients-failed-to-mount/lctl ping horror, turned out to be a failing
>>> subnet manager issue. We did no see an issue runnning 'sminfo' but on the
>>> IB switch we could see that the subnetmanager was unstable. This caused
>>> mayhem on the IB/Lustre setup.
>>>
>> Can you describe a bit more of how you found this?
>> You are running an SM on the switches?
>> Like this if someone else runs into this they will be able to check this
>> too
>>
>>>
>>> Thanks everybody for their help/advice/hints. Good to see how this
>>> active community works!
>>>
>> Indeed.
>> Eli
>>
>>>
>>>
>>>
>>>
>>> On Tue, Apr 25, 2017 at 8:17 PM, E.S. Rosenberg <
>>> esr+lus...@mail.hebrew.edu> wrote:
>>>


 On Tue, Apr 25, 2017 at 7:41 PM, Oucharek, Doug S <
 doug.s.oucha...@intel.com> wrote:

> That specific message happens when the “magic” u32 field at the start
> of a message does not match what we are expecting.  We do check if the
> message was transmitted as a different endian from us so when you see this
> error, we assume that message has been corrupted or the sender is using an
> invalid magic value.  I don’t believe this value has changed in the 
> history
> of the LND so this is more likely corruption of some sort.
>

 OT: this information should probably be added to LU-2977 which
 specifically includes the question: What does "consumer defined fatal
 error" mean and why is this connection rejected?



> Doug
>
> > On Apr 25, 2017, at 2:29 AM, Dilger, Andreas <
> andreas.dil...@intel.com> wrote:
> >
> > I'm not an LNet expert, but I think the critical issue to focus on
> is:
> >
> >  Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
> .el6.x86_64
> >  LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
> >  LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
> 172.23.55.211@o2ib rejected: consumer defined fatal error
> >
> > This means that the LND didn't connect at startup time, but I don't
> know what the cause is.
> > The error that generates this message is IB_CM_REJ_CONSUMER_DEFINED,
> but I don't know enough about IB to tell you what that means.  Some of the
> later code is checking for mismatched Lustre versions, but it doesn't even
> get that far.
> >
> > Cheers, Andreas
> >
> >> On Apr 25, 2017, at 02:21, Strikwerda, Ger 
> wrote:
> >>
> >> Hi Raj,
> >>
> >> [root@pg-gpu01 ~]# lustre_rmmod
> >>
> >> [root@pg-gpu01 ~]# modprobe -v lustre
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/n
> et/lustre/libcfs.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
> s/lustre/lvfs.ko
> >> insmod 
> >> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/net/lustre/lnet.ko
> networks=o2ib(ib0)
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
> s/lustre/obdclass.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
> s/lustre/ptlrpc.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
> s/lustre/fid.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
> s/lustre/mdc.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
> s/lustre/osc.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
> s/lustre/lov.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
> s/lustre/lustre.ko
> >>
> >> dmesg:
> >>
> >> LNet: HW CPU cores: 24, npartitions: 4
> >> alg: No test for crc32 (crc32-table)
> >> alg: No test for adler32 (adler32-zlib)
> >> alg: No test for crc32 (crc32-pclmul)
> >> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
> .el6.x86_64
> >> LNet: Added LNI 

Re: [lustre-discuss] client fails to mount

2017-05-01 Thread Strikwerda, Ger
Hi Eli,

We have a 180+ compute-cluster IB/10Gb connected with Lustre storage IB/10
Gb connected. We have multiple IB switches with the master/core/big switch
manageable via webmanagement. This is switch is a Mellanox SX6036 FDR
switch. 1 subnet manager is supposed to be running at this switch. And
using 'sminfo' on the clients we got info about the subnet manager being
alive. But when we looked via the webmanagement the subnet-manager was
unstable. The reason why is unknown. Could be faulty firmware. During the
weekend the system was running fine.






On Mon, May 1, 2017 at 2:18 PM, E.S. Rosenberg 
wrote:

>
>
> On Mon, May 1, 2017 at 11:46 AM, Strikwerda, Ger 
> wrote:
>
>> Hi all,
>>
>> Our clients-failed-to-mount/lctl ping horror, turned out to be a failing
>> subnet manager issue. We did no see an issue runnning 'sminfo' but on the
>> IB switch we could see that the subnetmanager was unstable. This caused
>> mayhem on the IB/Lustre setup.
>>
> Can you describe a bit more of how you found this?
> You are running an SM on the switches?
> Like this if someone else runs into this they will be able to check this
> too
>
>>
>> Thanks everybody for their help/advice/hints. Good to see how this active
>> community works!
>>
> Indeed.
> Eli
>
>>
>>
>>
>>
>> On Tue, Apr 25, 2017 at 8:17 PM, E.S. Rosenberg <
>> esr+lus...@mail.hebrew.edu> wrote:
>>
>>>
>>>
>>> On Tue, Apr 25, 2017 at 7:41 PM, Oucharek, Doug S <
>>> doug.s.oucha...@intel.com> wrote:
>>>
 That specific message happens when the “magic” u32 field at the start
 of a message does not match what we are expecting.  We do check if the
 message was transmitted as a different endian from us so when you see this
 error, we assume that message has been corrupted or the sender is using an
 invalid magic value.  I don’t believe this value has changed in the history
 of the LND so this is more likely corruption of some sort.

>>>
>>> OT: this information should probably be added to LU-2977 which
>>> specifically includes the question: What does "consumer defined fatal
>>> error" mean and why is this connection rejected?
>>>
>>>
>>>
 Doug

 > On Apr 25, 2017, at 2:29 AM, Dilger, Andreas <
 andreas.dil...@intel.com> wrote:
 >
 > I'm not an LNet expert, but I think the critical issue to focus on is:
 >
 >  Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
 .el6.x86_64
 >  LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
 >  LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
 172.23.55.211@o2ib rejected: consumer defined fatal error
 >
 > This means that the LND didn't connect at startup time, but I don't
 know what the cause is.
 > The error that generates this message is IB_CM_REJ_CONSUMER_DEFINED,
 but I don't know enough about IB to tell you what that means.  Some of the
 later code is checking for mismatched Lustre versions, but it doesn't even
 get that far.
 >
 > Cheers, Andreas
 >
 >> On Apr 25, 2017, at 02:21, Strikwerda, Ger 
 wrote:
 >>
 >> Hi Raj,
 >>
 >> [root@pg-gpu01 ~]# lustre_rmmod
 >>
 >> [root@pg-gpu01 ~]# modprobe -v lustre
 >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/n
 et/lustre/libcfs.ko
 >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
 s/lustre/lvfs.ko
 >> insmod 
 >> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/net/lustre/lnet.ko
 networks=o2ib(ib0)
 >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
 s/lustre/obdclass.ko
 >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
 s/lustre/ptlrpc.ko
 >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
 s/lustre/fid.ko
 >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
 s/lustre/mdc.ko
 >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
 s/lustre/osc.ko
 >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
 s/lustre/lov.ko
 >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
 s/lustre/lustre.ko
 >>
 >> dmesg:
 >>
 >> LNet: HW CPU cores: 24, npartitions: 4
 >> alg: No test for crc32 (crc32-table)
 >> alg: No test for adler32 (adler32-zlib)
 >> alg: No test for crc32 (crc32-pclmul)
 >> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
 .el6.x86_64
 >> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
 >>
 >> But no luck,
 >>
 >> [root@pg-gpu01 ~]# lctl ping 172.23.55.211@o2ib
 >> failed to ping 172.23.55.211@o2ib: Input/output error
 >>
 >> [root@pg-gpu01 ~]# mount /home
 >> mount.lustre: mount 172.23.55.211@o2ib:172.23.55.212@o2ib:/pghome01
 at /home failed: Input/output error
 >> Is the 

Re: [lustre-discuss] client fails to mount

2017-05-01 Thread E.S. Rosenberg
On Mon, May 1, 2017 at 11:46 AM, Strikwerda, Ger 
wrote:

> Hi all,
>
> Our clients-failed-to-mount/lctl ping horror, turned out to be a failing
> subnet manager issue. We did no see an issue runnning 'sminfo' but on the
> IB switch we could see that the subnetmanager was unstable. This caused
> mayhem on the IB/Lustre setup.
>
Can you describe a bit more of how you found this?
You are running an SM on the switches?
Like this if someone else runs into this they will be able to check this
too

>
> Thanks everybody for their help/advice/hints. Good to see how this active
> community works!
>
Indeed.
Eli

>
>
>
>
> On Tue, Apr 25, 2017 at 8:17 PM, E.S. Rosenberg <
> esr+lus...@mail.hebrew.edu> wrote:
>
>>
>>
>> On Tue, Apr 25, 2017 at 7:41 PM, Oucharek, Doug S <
>> doug.s.oucha...@intel.com> wrote:
>>
>>> That specific message happens when the “magic” u32 field at the start of
>>> a message does not match what we are expecting.  We do check if the message
>>> was transmitted as a different endian from us so when you see this error,
>>> we assume that message has been corrupted or the sender is using an invalid
>>> magic value.  I don’t believe this value has changed in the history of the
>>> LND so this is more likely corruption of some sort.
>>>
>>
>> OT: this information should probably be added to LU-2977 which
>> specifically includes the question: What does "consumer defined fatal
>> error" mean and why is this connection rejected?
>>
>>
>>
>>> Doug
>>>
>>> > On Apr 25, 2017, at 2:29 AM, Dilger, Andreas 
>>> wrote:
>>> >
>>> > I'm not an LNet expert, but I think the critical issue to focus on is:
>>> >
>>> >  Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
>>> .el6.x86_64
>>> >  LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>>> >  LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>>> 172.23.55.211@o2ib rejected: consumer defined fatal error
>>> >
>>> > This means that the LND didn't connect at startup time, but I don't
>>> know what the cause is.
>>> > The error that generates this message is IB_CM_REJ_CONSUMER_DEFINED,
>>> but I don't know enough about IB to tell you what that means.  Some of the
>>> later code is checking for mismatched Lustre versions, but it doesn't even
>>> get that far.
>>> >
>>> > Cheers, Andreas
>>> >
>>> >> On Apr 25, 2017, at 02:21, Strikwerda, Ger 
>>> wrote:
>>> >>
>>> >> Hi Raj,
>>> >>
>>> >> [root@pg-gpu01 ~]# lustre_rmmod
>>> >>
>>> >> [root@pg-gpu01 ~]# modprobe -v lustre
>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/n
>>> et/lustre/libcfs.ko
>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>> s/lustre/lvfs.ko
>>> >> insmod 
>>> >> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/net/lustre/lnet.ko
>>> networks=o2ib(ib0)
>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>> s/lustre/obdclass.ko
>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>> s/lustre/ptlrpc.ko
>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>> s/lustre/fid.ko
>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>> s/lustre/mdc.ko
>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>> s/lustre/osc.ko
>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>> s/lustre/lov.ko
>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>> s/lustre/lustre.ko
>>> >>
>>> >> dmesg:
>>> >>
>>> >> LNet: HW CPU cores: 24, npartitions: 4
>>> >> alg: No test for crc32 (crc32-table)
>>> >> alg: No test for adler32 (adler32-zlib)
>>> >> alg: No test for crc32 (crc32-pclmul)
>>> >> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
>>> .el6.x86_64
>>> >> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>>> >>
>>> >> But no luck,
>>> >>
>>> >> [root@pg-gpu01 ~]# lctl ping 172.23.55.211@o2ib
>>> >> failed to ping 172.23.55.211@o2ib: Input/output error
>>> >>
>>> >> [root@pg-gpu01 ~]# mount /home
>>> >> mount.lustre: mount 172.23.55.211@o2ib:172.23.55.212@o2ib:/pghome01
>>> at /home failed: Input/output error
>>> >> Is the MGS running?
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Apr 24, 2017 at 7:53 PM, Raj  wrote:
>>> >> Yes, this is strange. Normally, I have seen that credits mismatch
>>> results this scenario but it doesn't look like this is the case.
>>> >>
>>> >> You wouldn't want to put mgs into capture debug messages as there
>>> will be a lot of data.
>>> >>
>>> >> I guess you already tried removing the lustre drivers and adding it
>>> again ?
>>> >> lustre_rmmod
>>> >> modprobe -v lustre
>>> >>
>>> >> And check dmesg for any errors...
>>> >>
>>> >>
>>> >> On Mon, Apr 24, 2017 at 12:43 PM Strikwerda, Ger <
>>> g.j.c.strikwe...@rug.nl> wrote:
>>> >> Hi Raj,
>>> >>
>>> >> When i do a lctl ping on a MGS server i do not see any logs at all.
>>> Also 

Re: [lustre-discuss] client fails to mount

2017-05-01 Thread Strikwerda, Ger
Hi all,

Our clients-failed-to-mount/lctl ping horror, turned out to be a failing
subnet manager issue. We did no see an issue runnning 'sminfo' but on the
IB switch we could see that the subnetmanager was unstable. This caused
mayhem on the IB/Lustre setup.

Thanks everybody for their help/advice/hints. Good to see how this active
community works!




On Tue, Apr 25, 2017 at 8:17 PM, E.S. Rosenberg 
wrote:

>
>
> On Tue, Apr 25, 2017 at 7:41 PM, Oucharek, Doug S <
> doug.s.oucha...@intel.com> wrote:
>
>> That specific message happens when the “magic” u32 field at the start of
>> a message does not match what we are expecting.  We do check if the message
>> was transmitted as a different endian from us so when you see this error,
>> we assume that message has been corrupted or the sender is using an invalid
>> magic value.  I don’t believe this value has changed in the history of the
>> LND so this is more likely corruption of some sort.
>>
>
> OT: this information should probably be added to LU-2977 which
> specifically includes the question: What does "consumer defined fatal
> error" mean and why is this connection rejected?
>
>
>
>> Doug
>>
>> > On Apr 25, 2017, at 2:29 AM, Dilger, Andreas 
>> wrote:
>> >
>> > I'm not an LNet expert, but I think the critical issue to focus on is:
>> >
>> >  Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
>> .el6.x86_64
>> >  LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>> >  LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>> 172.23.55.211@o2ib rejected: consumer defined fatal error
>> >
>> > This means that the LND didn't connect at startup time, but I don't
>> know what the cause is.
>> > The error that generates this message is IB_CM_REJ_CONSUMER_DEFINED,
>> but I don't know enough about IB to tell you what that means.  Some of the
>> later code is checking for mismatched Lustre versions, but it doesn't even
>> get that far.
>> >
>> > Cheers, Andreas
>> >
>> >> On Apr 25, 2017, at 02:21, Strikwerda, Ger 
>> wrote:
>> >>
>> >> Hi Raj,
>> >>
>> >> [root@pg-gpu01 ~]# lustre_rmmod
>> >>
>> >> [root@pg-gpu01 ~]# modprobe -v lustre
>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/
>> net/lustre/libcfs.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/
>> fs/lustre/lvfs.ko
>> >> insmod 
>> >> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/net/lustre/lnet.ko
>> networks=o2ib(ib0)
>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/
>> fs/lustre/obdclass.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/
>> fs/lustre/ptlrpc.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/
>> fs/lustre/fid.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/
>> fs/lustre/mdc.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/
>> fs/lustre/osc.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/
>> fs/lustre/lov.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/
>> fs/lustre/lustre.ko
>> >>
>> >> dmesg:
>> >>
>> >> LNet: HW CPU cores: 24, npartitions: 4
>> >> alg: No test for crc32 (crc32-table)
>> >> alg: No test for adler32 (adler32-zlib)
>> >> alg: No test for crc32 (crc32-pclmul)
>> >> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
>> .el6.x86_64
>> >> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>> >>
>> >> But no luck,
>> >>
>> >> [root@pg-gpu01 ~]# lctl ping 172.23.55.211@o2ib
>> >> failed to ping 172.23.55.211@o2ib: Input/output error
>> >>
>> >> [root@pg-gpu01 ~]# mount /home
>> >> mount.lustre: mount 172.23.55.211@o2ib:172.23.55.212@o2ib:/pghome01
>> at /home failed: Input/output error
>> >> Is the MGS running?
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Mon, Apr 24, 2017 at 7:53 PM, Raj  wrote:
>> >> Yes, this is strange. Normally, I have seen that credits mismatch
>> results this scenario but it doesn't look like this is the case.
>> >>
>> >> You wouldn't want to put mgs into capture debug messages as there will
>> be a lot of data.
>> >>
>> >> I guess you already tried removing the lustre drivers and adding it
>> again ?
>> >> lustre_rmmod
>> >> modprobe -v lustre
>> >>
>> >> And check dmesg for any errors...
>> >>
>> >>
>> >> On Mon, Apr 24, 2017 at 12:43 PM Strikwerda, Ger <
>> g.j.c.strikwe...@rug.nl> wrote:
>> >> Hi Raj,
>> >>
>> >> When i do a lctl ping on a MGS server i do not see any logs at all.
>> Also not when i do a sucessfull ping from a working node. Is there a way to
>> verbose the Lustre logging to see more detail on the LNET level?
>> >>
>> >> It is very strange that a rebooted node is able to lctl ping compute
>> nodes, but fails to lctl ping metadata and storage nodes.
>> >>
>> >>
>> >>
>> >>
>> >> On Mon, Apr 24, 2017 at 7:35 PM, Raj  wrote:
>> >> Ger,
>> >> It looks like 

Re: [lustre-discuss] client fails to mount

2017-04-25 Thread E.S. Rosenberg
On Tue, Apr 25, 2017 at 7:41 PM, Oucharek, Doug S  wrote:

> That specific message happens when the “magic” u32 field at the start of a
> message does not match what we are expecting.  We do check if the message
> was transmitted as a different endian from us so when you see this error,
> we assume that message has been corrupted or the sender is using an invalid
> magic value.  I don’t believe this value has changed in the history of the
> LND so this is more likely corruption of some sort.
>

OT: this information should probably be added to LU-2977 which specifically
includes the question: What does "consumer defined fatal error" mean and
why is this connection rejected?



> Doug
>
> > On Apr 25, 2017, at 2:29 AM, Dilger, Andreas 
> wrote:
> >
> > I'm not an LNet expert, but I think the critical issue to focus on is:
> >
> >  Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-
> 573.el6.x86_64
> >  LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
> >  LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
> 172.23.55.211@o2ib rejected: consumer defined fatal error
> >
> > This means that the LND didn't connect at startup time, but I don't know
> what the cause is.
> > The error that generates this message is IB_CM_REJ_CONSUMER_DEFINED, but
> I don't know enough about IB to tell you what that means.  Some of the
> later code is checking for mismatched Lustre versions, but it doesn't even
> get that far.
> >
> > Cheers, Andreas
> >
> >> On Apr 25, 2017, at 02:21, Strikwerda, Ger 
> wrote:
> >>
> >> Hi Raj,
> >>
> >> [root@pg-gpu01 ~]# lustre_rmmod
> >>
> >> [root@pg-gpu01 ~]# modprobe -v lustre
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/
> kernel/net/lustre/libcfs.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/
> kernel/fs/lustre/lvfs.ko
> >> insmod 
> >> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/net/lustre/lnet.ko
> networks=o2ib(ib0)
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/
> kernel/fs/lustre/obdclass.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/
> kernel/fs/lustre/ptlrpc.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/
> kernel/fs/lustre/fid.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/
> kernel/fs/lustre/mdc.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/
> kernel/fs/lustre/osc.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/
> kernel/fs/lustre/lov.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/
> kernel/fs/lustre/lustre.ko
> >>
> >> dmesg:
> >>
> >> LNet: HW CPU cores: 24, npartitions: 4
> >> alg: No test for crc32 (crc32-table)
> >> alg: No test for adler32 (adler32-zlib)
> >> alg: No test for crc32 (crc32-pclmul)
> >> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-
> 573.el6.x86_64
> >> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
> >>
> >> But no luck,
> >>
> >> [root@pg-gpu01 ~]# lctl ping 172.23.55.211@o2ib
> >> failed to ping 172.23.55.211@o2ib: Input/output error
> >>
> >> [root@pg-gpu01 ~]# mount /home
> >> mount.lustre: mount 172.23.55.211@o2ib:172.23.55.212@o2ib:/pghome01 at
> /home failed: Input/output error
> >> Is the MGS running?
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Mon, Apr 24, 2017 at 7:53 PM, Raj  wrote:
> >> Yes, this is strange. Normally, I have seen that credits mismatch
> results this scenario but it doesn't look like this is the case.
> >>
> >> You wouldn't want to put mgs into capture debug messages as there will
> be a lot of data.
> >>
> >> I guess you already tried removing the lustre drivers and adding it
> again ?
> >> lustre_rmmod
> >> modprobe -v lustre
> >>
> >> And check dmesg for any errors...
> >>
> >>
> >> On Mon, Apr 24, 2017 at 12:43 PM Strikwerda, Ger <
> g.j.c.strikwe...@rug.nl> wrote:
> >> Hi Raj,
> >>
> >> When i do a lctl ping on a MGS server i do not see any logs at all.
> Also not when i do a sucessfull ping from a working node. Is there a way to
> verbose the Lustre logging to see more detail on the LNET level?
> >>
> >> It is very strange that a rebooted node is able to lctl ping compute
> nodes, but fails to lctl ping metadata and storage nodes.
> >>
> >>
> >>
> >>
> >> On Mon, Apr 24, 2017 at 7:35 PM, Raj  wrote:
> >> Ger,
> >> It looks like default configuration of lustre.
> >>
> >> Do you see any error message on the MGS side while you are doing lctl
> ping from the rebooted clients?
> >> On Mon, Apr 24, 2017 at 12:27 PM Strikwerda, Ger <
> g.j.c.strikwe...@rug.nl> wrote:
> >> Hi Eli,
> >>
> >> Nothing can be mounted on the Lustre filesystems so the output is:
> >>
> >> [root@pg-gpu01 ~]# lfs df /home/ger/
> >> [root@pg-gpu01 ~]#
> >>
> >> Empty..
> >>
> >>
> >>
> >> On Mon, Apr 24, 2017 at 7:24 PM, E.S. Rosenberg 
> wrote:
> >>
> >>
> >> On Mon, Apr 24, 2017 at 8:19 PM, Strikwerda, Ger <
> 

Re: [lustre-discuss] client fails to mount

2017-04-25 Thread Oucharek, Doug S
That specific message happens when the “magic” u32 field at the start of a 
message does not match what we are expecting.  We do check if the message was 
transmitted as a different endian from us so when you see this error, we assume 
that message has been corrupted or the sender is using an invalid magic value.  
I don’t believe this value has changed in the history of the LND so this is 
more likely corruption of some sort.

Doug

> On Apr 25, 2017, at 2:29 AM, Dilger, Andreas  wrote:
> 
> I'm not an LNet expert, but I think the critical issue to focus on is:
> 
>  Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
>  LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>  LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.211@o2ib 
> rejected: consumer defined fatal error
> 
> This means that the LND didn't connect at startup time, but I don't know what 
> the cause is.
> The error that generates this message is IB_CM_REJ_CONSUMER_DEFINED, but I 
> don't know enough about IB to tell you what that means.  Some of the later 
> code is checking for mismatched Lustre versions, but it doesn't even get that 
> far.
> 
> Cheers, Andreas
> 
>> On Apr 25, 2017, at 02:21, Strikwerda, Ger  wrote:
>> 
>> Hi Raj,
>> 
>> [root@pg-gpu01 ~]# lustre_rmmod
>> 
>> [root@pg-gpu01 ~]# modprobe -v lustre
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/net/lustre/libcfs.ko
>>  
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/lvfs.ko 
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/net/lustre/lnet.ko
>>  networks=o2ib(ib0)
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/obdclass.ko
>>  
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/ptlrpc.ko
>>  
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/fid.ko 
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/mdc.ko 
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/osc.ko 
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/lov.ko 
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/lustre.ko
>>  
>> 
>> dmesg:
>> 
>> LNet: HW CPU cores: 24, npartitions: 4
>> alg: No test for crc32 (crc32-table)
>> alg: No test for adler32 (adler32-zlib)
>> alg: No test for crc32 (crc32-pclmul)
>> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
>> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>> 
>> But no luck,
>> 
>> [root@pg-gpu01 ~]# lctl ping 172.23.55.211@o2ib
>> failed to ping 172.23.55.211@o2ib: Input/output error
>> 
>> [root@pg-gpu01 ~]# mount /home
>> mount.lustre: mount 172.23.55.211@o2ib:172.23.55.212@o2ib:/pghome01 at /home 
>> failed: Input/output error
>> Is the MGS running?
>> 
>> 
>> 
>> 
>> 
>> 
>> On Mon, Apr 24, 2017 at 7:53 PM, Raj  wrote:
>> Yes, this is strange. Normally, I have seen that credits mismatch results 
>> this scenario but it doesn't look like this is the case. 
>> 
>> You wouldn't want to put mgs into capture debug messages as there will be a 
>> lot of data. 
>> 
>> I guess you already tried removing the lustre drivers and adding it again ? 
>> lustre_rmmod 
>> modprobe -v lustre
>> 
>> And check dmesg for any errors...
>> 
>> 
>> On Mon, Apr 24, 2017 at 12:43 PM Strikwerda, Ger  
>> wrote:
>> Hi Raj,
>> 
>> When i do a lctl ping on a MGS server i do not see any logs at all. Also not 
>> when i do a sucessfull ping from a working node. Is there a way to verbose 
>> the Lustre logging to see more detail on the LNET level?
>> 
>> It is very strange that a rebooted node is able to lctl ping compute nodes, 
>> but fails to lctl ping metadata and storage nodes. 
>> 
>> 
>> 
>> 
>> On Mon, Apr 24, 2017 at 7:35 PM, Raj  wrote:
>> Ger,
>> It looks like default configuration of lustre. 
>> 
>> Do you see any error message on the MGS side while you are doing lctl ping 
>> from the rebooted clients? 
>> On Mon, Apr 24, 2017 at 12:27 PM Strikwerda, Ger  
>> wrote:
>> Hi Eli,
>> 
>> Nothing can be mounted on the Lustre filesystems so the output is:
>> 
>> [root@pg-gpu01 ~]# lfs df /home/ger/
>> [root@pg-gpu01 ~]# 
>> 
>> Empty..
>> 
>> 
>> 
>> On Mon, Apr 24, 2017 at 7:24 PM, E.S. Rosenberg  wrote:
>> 
>> 
>> On Mon, Apr 24, 2017 at 8:19 PM, Strikwerda, Ger  
>> wrote:
>> Hallo Eli,
>> 
>> Logfile/syslog on the client-side:
>> 
>> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
>> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>> LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.211@o2ib 
>> rejected: consumer defined fatal error
>> 
>> lctl df /path/to/some/file
>> 
>> gives 

Re: [lustre-discuss] client fails to mount

2017-04-25 Thread Dilger, Andreas
I'm not an LNet expert, but I think the critical issue to focus on is:

  Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
  LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
  LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.211@o2ib 
rejected: consumer defined fatal error

This means that the LND didn't connect at startup time, but I don't know what 
the cause is.
The error that generates this message is IB_CM_REJ_CONSUMER_DEFINED, but I 
don't know enough about IB to tell you what that means.  Some of the later code 
is checking for mismatched Lustre versions, but it doesn't even get that far.

Cheers, Andreas

> On Apr 25, 2017, at 02:21, Strikwerda, Ger  wrote:
> 
> Hi Raj,
> 
> [root@pg-gpu01 ~]# lustre_rmmod
> 
> [root@pg-gpu01 ~]# modprobe -v lustre
> insmod 
> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/net/lustre/libcfs.ko
>  
> insmod 
> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/lvfs.ko 
> insmod 
> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/net/lustre/lnet.ko 
> networks=o2ib(ib0)
> insmod 
> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/obdclass.ko
>  
> insmod 
> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/ptlrpc.ko
>  
> insmod 
> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/fid.ko 
> insmod 
> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/mdc.ko 
> insmod 
> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/osc.ko 
> insmod 
> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/lov.ko 
> insmod 
> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/lustre.ko
>  
> 
> dmesg:
> 
> LNet: HW CPU cores: 24, npartitions: 4
> alg: No test for crc32 (crc32-table)
> alg: No test for adler32 (adler32-zlib)
> alg: No test for crc32 (crc32-pclmul)
> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
> 
> But no luck,
> 
> [root@pg-gpu01 ~]# lctl ping 172.23.55.211@o2ib
> failed to ping 172.23.55.211@o2ib: Input/output error
> 
> [root@pg-gpu01 ~]# mount /home
> mount.lustre: mount 172.23.55.211@o2ib:172.23.55.212@o2ib:/pghome01 at /home 
> failed: Input/output error
> Is the MGS running?
> 
> 
> 
> 
> 
> 
> On Mon, Apr 24, 2017 at 7:53 PM, Raj  wrote:
> Yes, this is strange. Normally, I have seen that credits mismatch results 
> this scenario but it doesn't look like this is the case. 
> 
> You wouldn't want to put mgs into capture debug messages as there will be a 
> lot of data. 
> 
> I guess you already tried removing the lustre drivers and adding it again ? 
> lustre_rmmod 
> modprobe -v lustre
> 
> And check dmesg for any errors...
> 
> 
> On Mon, Apr 24, 2017 at 12:43 PM Strikwerda, Ger  
> wrote:
> Hi Raj,
> 
> When i do a lctl ping on a MGS server i do not see any logs at all. Also not 
> when i do a sucessfull ping from a working node. Is there a way to verbose 
> the Lustre logging to see more detail on the LNET level?
> 
> It is very strange that a rebooted node is able to lctl ping compute nodes, 
> but fails to lctl ping metadata and storage nodes. 
> 
> 
> 
> 
> On Mon, Apr 24, 2017 at 7:35 PM, Raj  wrote:
> Ger,
> It looks like default configuration of lustre. 
> 
> Do you see any error message on the MGS side while you are doing lctl ping 
> from the rebooted clients? 
> On Mon, Apr 24, 2017 at 12:27 PM Strikwerda, Ger  
> wrote:
> Hi Eli,
> 
> Nothing can be mounted on the Lustre filesystems so the output is:
> 
> [root@pg-gpu01 ~]# lfs df /home/ger/
> [root@pg-gpu01 ~]# 
> 
> Empty..
> 
> 
> 
> On Mon, Apr 24, 2017 at 7:24 PM, E.S. Rosenberg  wrote:
> 
> 
> On Mon, Apr 24, 2017 at 8:19 PM, Strikwerda, Ger  
> wrote:
> Hallo Eli,
> 
> Logfile/syslog on the client-side:
> 
> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
> LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.211@o2ib 
> rejected: consumer defined fatal error
> 
> lctl df /path/to/some/file
> 
> gives nothing useful? (the second one will dump *a lot*)
> 
> 
> 
> 
> On Mon, Apr 24, 2017 at 7:16 PM, E.S. Rosenberg  
> wrote:
> 
> 
> On Mon, Apr 24, 2017 at 8:13 PM, Strikwerda, Ger  
> wrote:
> Hi Raj (and others),
> 
> In which file should i state the credits/peer_credits stuff? 
> 
> Perhaps relevant config-files:
> 
> [root@pg-gpu01 ~]# cd /etc/modprobe.d/
> 
> [root@pg-gpu01 modprobe.d]# ls
> anaconda.conf   blacklist-kvm.conf  dist-alsa.conf  dist-oss.conf 
>   ib_ipoib.conf  lustre.conf  openfwwf.conf
> blacklist.conf  blacklist-nouveau.conf  dist.conf   
> freeipmi-modalias.conf  ib_sdp.conf 

Re: [lustre-discuss] client fails to mount

2017-04-25 Thread Strikwerda, Ger
Hi Raj,

[root@pg-gpu01 ~]# lustre_rmmod

[root@pg-gpu01 ~]# modprobe -v lustre
insmod
/lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/net/lustre/libcfs.ko

insmod
/lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/lvfs.ko

insmod
/lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/net/lustre/lnet.ko
networks=o2ib(ib0)
insmod
/lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/obdclass.ko

insmod
/lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/ptlrpc.ko

insmod
/lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/fid.ko
insmod
/lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/mdc.ko
insmod
/lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/osc.ko
insmod
/lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/lov.ko
insmod
/lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/lustre.ko


dmesg:

LNet: HW CPU cores: 24, npartitions: 4
alg: No test for crc32 (crc32-table)
alg: No test for adler32 (adler32-zlib)
alg: No test for crc32 (crc32-pclmul)
Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]

But no luck,

[root@pg-gpu01 ~]# lctl ping 172.23.55.211@o2ib
failed to ping 172.23.55.211@o2ib: Input/output error

[root@pg-gpu01 ~]# mount /home
mount.lustre: mount 172.23.55.211@o2ib:172.23.55.212@o2ib:/pghome01 at
/home failed: Input/output error
Is the MGS running?






On Mon, Apr 24, 2017 at 7:53 PM, Raj  wrote:

> Yes, this is strange. Normally, I have seen that credits mismatch results
> this scenario but it doesn't look like this is the case.
>
> You wouldn't want to put mgs into capture debug messages as there will be
> a lot of data.
>
> I guess you already tried removing the lustre drivers and adding it again
> ?
> lustre_rmmod
> modprobe -v lustre
>
> And check dmesg for any errors...
>
>
> On Mon, Apr 24, 2017 at 12:43 PM Strikwerda, Ger 
> wrote:
>
>> Hi Raj,
>>
>> When i do a lctl ping on a MGS server i do not see any logs at all. Also
>> not when i do a sucessfull ping from a working node. Is there a way to
>> verbose the Lustre logging to see more detail on the LNET level?
>>
>> It is very strange that a rebooted node is able to lctl ping compute
>> nodes, but fails to lctl ping metadata and storage nodes.
>>
>>
>>
>>
>> On Mon, Apr 24, 2017 at 7:35 PM, Raj  wrote:
>>
>>> Ger,
>>> It looks like default configuration of lustre.
>>>
>>> Do you see any error message on the MGS side while you are doing lctl
>>> ping from the rebooted clients?
>>> On Mon, Apr 24, 2017 at 12:27 PM Strikwerda, Ger <
>>> g.j.c.strikwe...@rug.nl> wrote:
>>>
 Hi Eli,

 Nothing can be mounted on the Lustre filesystems so the output is:

 [root@pg-gpu01 ~]# lfs df /home/ger/
 [root@pg-gpu01 ~]#

 Empty..



 On Mon, Apr 24, 2017 at 7:24 PM, E.S. Rosenberg 
 wrote:

>
>
> On Mon, Apr 24, 2017 at 8:19 PM, Strikwerda, Ger <
> g.j.c.strikwe...@rug.nl> wrote:
>
>> Hallo Eli,
>>
>> Logfile/syslog on the client-side:
>>
>> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-
>> 573.el6.x86_64
>> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>> LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>> 172.23.55.211@o2ib rejected: consumer defined fatal error
>>
>
> lctl df /path/to/some/file
>
> gives nothing useful? (the second one will dump *a lot*)
>
>>
>>
>>
>>
>> On Mon, Apr 24, 2017 at 7:16 PM, E.S. Rosenberg <
>> esr+lus...@mail.hebrew.edu> wrote:
>>
>>>
>>>
>>> On Mon, Apr 24, 2017 at 8:13 PM, Strikwerda, Ger <
>>> g.j.c.strikwe...@rug.nl> wrote:
>>>
 Hi Raj (and others),

 In which file should i state the credits/peer_credits stuff?

 Perhaps relevant config-files:

 [root@pg-gpu01 ~]# cd /etc/modprobe.d/

 [root@pg-gpu01 modprobe.d]# ls
 anaconda.conf   blacklist-kvm.conf  dist-alsa.conf
 dist-oss.conf   ib_ipoib.conf  lustre.conf  openfwwf.conf
 blacklist.conf  blacklist-nouveau.conf  dist.conf
 freeipmi-modalias.conf  ib_sdp.confmlnx.conftruescale.conf

 [root@pg-gpu01 modprobe.d]# cat ./ib_ipoib.conf
 alias netdev-ib* ib_ipoib

 [root@pg-gpu01 modprobe.d]# cat ./mlnx.conf
 # Module parameters for MLNX_OFED kernel modules

 [root@pg-gpu01 modprobe.d]# cat ./lustre.conf
 options lnet networks=o2ib(ib0)

 Are there more Lustre/LNET options that could help in this
 situation?

>>>
>>> What about the logfiles?
>>> Any error messages in syslog? lctl debug options?
>>> Veel geluk,
>>> 

Re: [lustre-discuss] client fails to mount

2017-04-25 Thread Strikwerda, Ger
Hi Brett,

Yes, we can ibping from the rebooted client to the metadata-server:

[root@pg-gpu01 ~]# ibping -G 0xf45214030062ee91
Pong from pg-mds01.(none) (Lid 179): time 0.094 ms
Pong from pg-mds01.(none) (Lid 179): time 0.139 ms
Pong from pg-mds01.(none) (Lid 179): time 0.110 ms
Pong from pg-mds01.(none) (Lid 179): time 0.149 ms

But lctl ping fails at once, no timeouts or anything:

[root@pg-gpu01 ~]# lctl ping 172.23.55.211@o2ib
failed to ping 172.23.55.211@o2ib: Input/output error

Also we happen to see some differences on the MGC listings, we have 2
metadata-servers and on pg-mds01 the MGS is mounted and running:

[root@pg-mds02 ~]# lctl dl | grep mgc
  4 UP mgc MGC172.23.55.211@o2ib 24ecba8d-1574-c649-47fc-c7bc944ce4af 5

[root@pg-mds01 ~]# lctl dl | grep mgc
  1 UP mgc MGC172.23.55.211@o2ib 0c7a07eb-a49a-189a-89b5-86e6ef805fc3 5

Any ideas/advice on the different hex string?







On Mon, Apr 24, 2017 at 11:20 PM, Brett Lee 
wrote:

> So, the LNet ping is not working, and LNet is running on IB.  Have you
> moved down the stack toward the hardware, running an ibping from a rebooted
> client to the MGS?
>
> Brett
> --
> Protect Yourself Against Cybercrime
> PDS Software Solutions LLC
> https://www.TrustPDS.com 
>
> On Mon, Apr 24, 2017 at 11:53 AM, Raj  wrote:
>
>> Yes, this is strange. Normally, I have seen that credits mismatch results
>> this scenario but it doesn't look like this is the case.
>>
>> You wouldn't want to put mgs into capture debug messages as there will be
>> a lot of data.
>>
>> I guess you already tried removing the lustre drivers and adding it again
>> ?
>> lustre_rmmod
>> modprobe -v lustre
>>
>> And check dmesg for any errors...
>>
>>
>> On Mon, Apr 24, 2017 at 12:43 PM Strikwerda, Ger 
>> wrote:
>>
>>> Hi Raj,
>>>
>>> When i do a lctl ping on a MGS server i do not see any logs at all. Also
>>> not when i do a sucessfull ping from a working node. Is there a way to
>>> verbose the Lustre logging to see more detail on the LNET level?
>>>
>>> It is very strange that a rebooted node is able to lctl ping compute
>>> nodes, but fails to lctl ping metadata and storage nodes.
>>>
>>>
>>>
>>>
>>> On Mon, Apr 24, 2017 at 7:35 PM, Raj  wrote:
>>>
 Ger,
 It looks like default configuration of lustre.

 Do you see any error message on the MGS side while you are doing lctl
 ping from the rebooted clients?
 On Mon, Apr 24, 2017 at 12:27 PM Strikwerda, Ger <
 g.j.c.strikwe...@rug.nl> wrote:

> Hi Eli,
>
> Nothing can be mounted on the Lustre filesystems so the output is:
>
> [root@pg-gpu01 ~]# lfs df /home/ger/
> [root@pg-gpu01 ~]#
>
> Empty..
>
>
>
> On Mon, Apr 24, 2017 at 7:24 PM, E.S. Rosenberg 
> wrote:
>
>>
>>
>> On Mon, Apr 24, 2017 at 8:19 PM, Strikwerda, Ger <
>> g.j.c.strikwe...@rug.nl> wrote:
>>
>>> Hallo Eli,
>>>
>>> Logfile/syslog on the client-side:
>>>
>>> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
>>> .el6.x86_64
>>> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>>> LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>>> 172.23.55.211@o2ib rejected: consumer defined fatal error
>>>
>>
>> lctl df /path/to/some/file
>>
>> gives nothing useful? (the second one will dump *a lot*)
>>
>>>
>>>
>>>
>>>
>>> On Mon, Apr 24, 2017 at 7:16 PM, E.S. Rosenberg <
>>> esr+lus...@mail.hebrew.edu> wrote:
>>>


 On Mon, Apr 24, 2017 at 8:13 PM, Strikwerda, Ger <
 g.j.c.strikwe...@rug.nl> wrote:

> Hi Raj (and others),
>
> In which file should i state the credits/peer_credits stuff?
>
> Perhaps relevant config-files:
>
> [root@pg-gpu01 ~]# cd /etc/modprobe.d/
>
> [root@pg-gpu01 modprobe.d]# ls
> anaconda.conf   blacklist-kvm.conf  dist-alsa.conf
> dist-oss.conf   ib_ipoib.conf  lustre.conf  openfwwf.conf
> blacklist.conf  blacklist-nouveau.conf  dist.conf
> freeipmi-modalias.conf  ib_sdp.confmlnx.conftruescale.conf
>
> [root@pg-gpu01 modprobe.d]# cat ./ib_ipoib.conf
> alias netdev-ib* ib_ipoib
>
> [root@pg-gpu01 modprobe.d]# cat ./mlnx.conf
> # Module parameters for MLNX_OFED kernel modules
>
> [root@pg-gpu01 modprobe.d]# cat ./lustre.conf
> options lnet networks=o2ib(ib0)
>
> Are there more Lustre/LNET options that could help in this
> situation?
>

 What about the logfiles?
 Any error messages in syslog? lctl debug options?
 Veel geluk,
 Eli

>
>
>
>
> On Mon, Apr 24, 2017 at 7:02 PM, Raj 

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Brett Lee
So, the LNet ping is not working, and LNet is running on IB.  Have you
moved down the stack toward the hardware, running an ibping from a rebooted
client to the MGS?

Brett
--
Protect Yourself Against Cybercrime
PDS Software Solutions LLC
https://www.TrustPDS.com 

On Mon, Apr 24, 2017 at 11:53 AM, Raj  wrote:

> Yes, this is strange. Normally, I have seen that credits mismatch results
> this scenario but it doesn't look like this is the case.
>
> You wouldn't want to put mgs into capture debug messages as there will be
> a lot of data.
>
> I guess you already tried removing the lustre drivers and adding it again
> ?
> lustre_rmmod
> modprobe -v lustre
>
> And check dmesg for any errors...
>
>
> On Mon, Apr 24, 2017 at 12:43 PM Strikwerda, Ger 
> wrote:
>
>> Hi Raj,
>>
>> When i do a lctl ping on a MGS server i do not see any logs at all. Also
>> not when i do a sucessfull ping from a working node. Is there a way to
>> verbose the Lustre logging to see more detail on the LNET level?
>>
>> It is very strange that a rebooted node is able to lctl ping compute
>> nodes, but fails to lctl ping metadata and storage nodes.
>>
>>
>>
>>
>> On Mon, Apr 24, 2017 at 7:35 PM, Raj  wrote:
>>
>>> Ger,
>>> It looks like default configuration of lustre.
>>>
>>> Do you see any error message on the MGS side while you are doing lctl
>>> ping from the rebooted clients?
>>> On Mon, Apr 24, 2017 at 12:27 PM Strikwerda, Ger <
>>> g.j.c.strikwe...@rug.nl> wrote:
>>>
 Hi Eli,

 Nothing can be mounted on the Lustre filesystems so the output is:

 [root@pg-gpu01 ~]# lfs df /home/ger/
 [root@pg-gpu01 ~]#

 Empty..



 On Mon, Apr 24, 2017 at 7:24 PM, E.S. Rosenberg 
 wrote:

>
>
> On Mon, Apr 24, 2017 at 8:19 PM, Strikwerda, Ger <
> g.j.c.strikwe...@rug.nl> wrote:
>
>> Hallo Eli,
>>
>> Logfile/syslog on the client-side:
>>
>> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-
>> 573.el6.x86_64
>> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>> LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>> 172.23.55.211@o2ib rejected: consumer defined fatal error
>>
>
> lctl df /path/to/some/file
>
> gives nothing useful? (the second one will dump *a lot*)
>
>>
>>
>>
>>
>> On Mon, Apr 24, 2017 at 7:16 PM, E.S. Rosenberg <
>> esr+lus...@mail.hebrew.edu> wrote:
>>
>>>
>>>
>>> On Mon, Apr 24, 2017 at 8:13 PM, Strikwerda, Ger <
>>> g.j.c.strikwe...@rug.nl> wrote:
>>>
 Hi Raj (and others),

 In which file should i state the credits/peer_credits stuff?

 Perhaps relevant config-files:

 [root@pg-gpu01 ~]# cd /etc/modprobe.d/

 [root@pg-gpu01 modprobe.d]# ls
 anaconda.conf   blacklist-kvm.conf  dist-alsa.conf
 dist-oss.conf   ib_ipoib.conf  lustre.conf  openfwwf.conf
 blacklist.conf  blacklist-nouveau.conf  dist.conf
 freeipmi-modalias.conf  ib_sdp.confmlnx.conftruescale.conf

 [root@pg-gpu01 modprobe.d]# cat ./ib_ipoib.conf
 alias netdev-ib* ib_ipoib

 [root@pg-gpu01 modprobe.d]# cat ./mlnx.conf
 # Module parameters for MLNX_OFED kernel modules

 [root@pg-gpu01 modprobe.d]# cat ./lustre.conf
 options lnet networks=o2ib(ib0)

 Are there more Lustre/LNET options that could help in this
 situation?

>>>
>>> What about the logfiles?
>>> Any error messages in syslog? lctl debug options?
>>> Veel geluk,
>>> Eli
>>>




 On Mon, Apr 24, 2017 at 7:02 PM, Raj  wrote:

> May be worth checking your lnet credits and peer_credits in
> /etc/modprobe.d ?
> You can compare between working hosts and non working hosts.
> Thanks
> _Raj
>
> On Mon, Apr 24, 2017 at 10:10 AM Strikwerda, Ger <
> g.j.c.strikwe...@rug.nl> wrote:
>
>> Hi Rick,
>>
>> Even without iptables rules and loading the correct modules
>> afterwards, we get the same results:
>>
>> [root@pg-gpu01 sysconfig]# iptables --list
>> Chain INPUT (policy ACCEPT)
>> target prot opt source   destination
>>
>> Chain FORWARD (policy ACCEPT)
>> target prot opt source   destination
>>
>> Chain OUTPUT (policy ACCEPT)
>> target prot opt source   destination
>>
>> Chain LOGDROP (0 references)
>> target prot opt source   destination
>> LOGall  --  anywhere anywhereLOG
>> level warning

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Raj
Yes, this is strange. Normally, I have seen that credits mismatch results
this scenario but it doesn't look like this is the case.

You wouldn't want to put mgs into capture debug messages as there will be a
lot of data.

I guess you already tried removing the lustre drivers and adding it again ?
lustre_rmmod
modprobe -v lustre

And check dmesg for any errors...


On Mon, Apr 24, 2017 at 12:43 PM Strikwerda, Ger 
wrote:

> Hi Raj,
>
> When i do a lctl ping on a MGS server i do not see any logs at all. Also
> not when i do a sucessfull ping from a working node. Is there a way to
> verbose the Lustre logging to see more detail on the LNET level?
>
> It is very strange that a rebooted node is able to lctl ping compute
> nodes, but fails to lctl ping metadata and storage nodes.
>
>
>
>
> On Mon, Apr 24, 2017 at 7:35 PM, Raj  wrote:
>
>> Ger,
>> It looks like default configuration of lustre.
>>
>> Do you see any error message on the MGS side while you are doing lctl
>> ping from the rebooted clients?
>> On Mon, Apr 24, 2017 at 12:27 PM Strikwerda, Ger 
>> wrote:
>>
>>> Hi Eli,
>>>
>>> Nothing can be mounted on the Lustre filesystems so the output is:
>>>
>>> [root@pg-gpu01 ~]# lfs df /home/ger/
>>> [root@pg-gpu01 ~]#
>>>
>>> Empty..
>>>
>>>
>>>
>>> On Mon, Apr 24, 2017 at 7:24 PM, E.S. Rosenberg 
>>> wrote:
>>>


 On Mon, Apr 24, 2017 at 8:19 PM, Strikwerda, Ger <
 g.j.c.strikwe...@rug.nl> wrote:

> Hallo Eli,
>
> Logfile/syslog on the client-side:
>
> Lustre: Lustre: Build Version:
> 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
> LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
> 172.23.55.211@o2ib rejected: consumer defined fatal error
>

 lctl df /path/to/some/file

 gives nothing useful? (the second one will dump *a lot*)

>
>
>
>
> On Mon, Apr 24, 2017 at 7:16 PM, E.S. Rosenberg <
> esr+lus...@mail.hebrew.edu> wrote:
>
>>
>>
>> On Mon, Apr 24, 2017 at 8:13 PM, Strikwerda, Ger <
>> g.j.c.strikwe...@rug.nl> wrote:
>>
>>> Hi Raj (and others),
>>>
>>> In which file should i state the credits/peer_credits stuff?
>>>
>>> Perhaps relevant config-files:
>>>
>>> [root@pg-gpu01 ~]# cd /etc/modprobe.d/
>>>
>>> [root@pg-gpu01 modprobe.d]# ls
>>> anaconda.conf   blacklist-kvm.conf  dist-alsa.conf
>>> dist-oss.conf   ib_ipoib.conf  lustre.conf  openfwwf.conf
>>> blacklist.conf  blacklist-nouveau.conf  dist.conf
>>> freeipmi-modalias.conf  ib_sdp.confmlnx.conftruescale.conf
>>>
>>> [root@pg-gpu01 modprobe.d]# cat ./ib_ipoib.conf
>>> alias netdev-ib* ib_ipoib
>>>
>>> [root@pg-gpu01 modprobe.d]# cat ./mlnx.conf
>>> # Module parameters for MLNX_OFED kernel modules
>>>
>>> [root@pg-gpu01 modprobe.d]# cat ./lustre.conf
>>> options lnet networks=o2ib(ib0)
>>>
>>> Are there more Lustre/LNET options that could help in this situation?
>>>
>>
>> What about the logfiles?
>> Any error messages in syslog? lctl debug options?
>> Veel geluk,
>> Eli
>>
>>>
>>>
>>>
>>>
>>> On Mon, Apr 24, 2017 at 7:02 PM, Raj  wrote:
>>>
 May be worth checking your lnet credits and peer_credits in
 /etc/modprobe.d ?
 You can compare between working hosts and non working hosts.
 Thanks
 _Raj

 On Mon, Apr 24, 2017 at 10:10 AM Strikwerda, Ger <
 g.j.c.strikwe...@rug.nl> wrote:

> Hi Rick,
>
> Even without iptables rules and loading the correct modules
> afterwards, we get the same results:
>
> [root@pg-gpu01 sysconfig]# iptables --list
> Chain INPUT (policy ACCEPT)
> target prot opt source   destination
>
> Chain FORWARD (policy ACCEPT)
> target prot opt source   destination
>
> Chain OUTPUT (policy ACCEPT)
> target prot opt source   destination
>
> Chain LOGDROP (0 references)
> target prot opt source   destination
> LOGall  --  anywhere anywhereLOG
> level warning
> DROP   all  --  anywhere anywhere
>
> [root@pg-gpu01 sysconfig]# modprobe lnet
>
> [root@pg-gpu01 sysconfig]# modprobe lustre
>
> [root@pg-gpu01 sysconfig]# lctl ping 172.23.55.211@o2ib
>
> failed to ping 172.23.55.211@o2ib: Input/output error
>
>
>
>
>
>
>
> On Mon, Apr 24, 2017 at 4:59 PM, Mohr Jr, Richard Frank (Rick
> Mohr)  wrote:
>

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Strikwerda, Ger
Hi Raj,

When i do a lctl ping on a MGS server i do not see any logs at all. Also
not when i do a sucessfull ping from a working node. Is there a way to
verbose the Lustre logging to see more detail on the LNET level?

It is very strange that a rebooted node is able to lctl ping compute nodes,
but fails to lctl ping metadata and storage nodes.




On Mon, Apr 24, 2017 at 7:35 PM, Raj  wrote:

> Ger,
> It looks like default configuration of lustre.
>
> Do you see any error message on the MGS side while you are doing lctl ping
> from the rebooted clients?
> On Mon, Apr 24, 2017 at 12:27 PM Strikwerda, Ger 
> wrote:
>
>> Hi Eli,
>>
>> Nothing can be mounted on the Lustre filesystems so the output is:
>>
>> [root@pg-gpu01 ~]# lfs df /home/ger/
>> [root@pg-gpu01 ~]#
>>
>> Empty..
>>
>>
>>
>> On Mon, Apr 24, 2017 at 7:24 PM, E.S. Rosenberg 
>> wrote:
>>
>>>
>>>
>>> On Mon, Apr 24, 2017 at 8:19 PM, Strikwerda, Ger <
>>> g.j.c.strikwe...@rug.nl> wrote:
>>>
 Hallo Eli,

 Logfile/syslog on the client-side:

 Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-
 573.el6.x86_64
 LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
 LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
 172.23.55.211@o2ib rejected: consumer defined fatal error

>>>
>>> lctl df /path/to/some/file
>>>
>>> gives nothing useful? (the second one will dump *a lot*)
>>>




 On Mon, Apr 24, 2017 at 7:16 PM, E.S. Rosenberg <
 esr+lus...@mail.hebrew.edu> wrote:

>
>
> On Mon, Apr 24, 2017 at 8:13 PM, Strikwerda, Ger <
> g.j.c.strikwe...@rug.nl> wrote:
>
>> Hi Raj (and others),
>>
>> In which file should i state the credits/peer_credits stuff?
>>
>> Perhaps relevant config-files:
>>
>> [root@pg-gpu01 ~]# cd /etc/modprobe.d/
>>
>> [root@pg-gpu01 modprobe.d]# ls
>> anaconda.conf   blacklist-kvm.conf  dist-alsa.conf
>> dist-oss.conf   ib_ipoib.conf  lustre.conf  openfwwf.conf
>> blacklist.conf  blacklist-nouveau.conf  dist.conf
>> freeipmi-modalias.conf  ib_sdp.confmlnx.conftruescale.conf
>>
>> [root@pg-gpu01 modprobe.d]# cat ./ib_ipoib.conf
>> alias netdev-ib* ib_ipoib
>>
>> [root@pg-gpu01 modprobe.d]# cat ./mlnx.conf
>> # Module parameters for MLNX_OFED kernel modules
>>
>> [root@pg-gpu01 modprobe.d]# cat ./lustre.conf
>> options lnet networks=o2ib(ib0)
>>
>> Are there more Lustre/LNET options that could help in this situation?
>>
>
> What about the logfiles?
> Any error messages in syslog? lctl debug options?
> Veel geluk,
> Eli
>
>>
>>
>>
>>
>> On Mon, Apr 24, 2017 at 7:02 PM, Raj  wrote:
>>
>>> May be worth checking your lnet credits and peer_credits in
>>> /etc/modprobe.d ?
>>> You can compare between working hosts and non working hosts.
>>> Thanks
>>> _Raj
>>>
>>> On Mon, Apr 24, 2017 at 10:10 AM Strikwerda, Ger <
>>> g.j.c.strikwe...@rug.nl> wrote:
>>>
 Hi Rick,

 Even without iptables rules and loading the correct modules
 afterwards, we get the same results:

 [root@pg-gpu01 sysconfig]# iptables --list
 Chain INPUT (policy ACCEPT)
 target prot opt source   destination

 Chain FORWARD (policy ACCEPT)
 target prot opt source   destination

 Chain OUTPUT (policy ACCEPT)
 target prot opt source   destination

 Chain LOGDROP (0 references)
 target prot opt source   destination
 LOGall  --  anywhere anywhereLOG
 level warning
 DROP   all  --  anywhere anywhere

 [root@pg-gpu01 sysconfig]# modprobe lnet

 [root@pg-gpu01 sysconfig]# modprobe lustre

 [root@pg-gpu01 sysconfig]# lctl ping 172.23.55.211@o2ib

 failed to ping 172.23.55.211@o2ib: Input/output error







 On Mon, Apr 24, 2017 at 4:59 PM, Mohr Jr, Richard Frank (Rick Mohr)
  wrote:

> This might be a long shot, but have you checked for possible
> firewall rules that might be causing the issue?  I’m wondering if 
> there is
> a chance that some rules were added after the nodes were up to allow 
> Lustre
> access, and when a node got rebooted, it lost the rules.
>
> --
> Rick Mohr
> Senior HPC System Administrator
> National Institute for Computational Sciences
> http://www.nics.tennessee.edu
>
>
> > On Apr 24, 2017, at 10:19 AM, Strikwerda, Ger <
> 

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Raj
Ger,
It looks like default configuration of lustre.

Do you see any error message on the MGS side while you are doing lctl ping
from the rebooted clients?
On Mon, Apr 24, 2017 at 12:27 PM Strikwerda, Ger 
wrote:

> Hi Eli,
>
> Nothing can be mounted on the Lustre filesystems so the output is:
>
> [root@pg-gpu01 ~]# lfs df /home/ger/
> [root@pg-gpu01 ~]#
>
> Empty..
>
>
>
> On Mon, Apr 24, 2017 at 7:24 PM, E.S. Rosenberg  wrote:
>
>>
>>
>> On Mon, Apr 24, 2017 at 8:19 PM, Strikwerda, Ger > > wrote:
>>
>>> Hallo Eli,
>>>
>>> Logfile/syslog on the client-side:
>>>
>>> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
>>> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>>> LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>>> 172.23.55.211@o2ib rejected: consumer defined fatal error
>>>
>>
>> lctl df /path/to/some/file
>>
>> gives nothing useful? (the second one will dump *a lot*)
>>
>>>
>>>
>>>
>>>
>>> On Mon, Apr 24, 2017 at 7:16 PM, E.S. Rosenberg <
>>> esr+lus...@mail.hebrew.edu> wrote:
>>>


 On Mon, Apr 24, 2017 at 8:13 PM, Strikwerda, Ger <
 g.j.c.strikwe...@rug.nl> wrote:

> Hi Raj (and others),
>
> In which file should i state the credits/peer_credits stuff?
>
> Perhaps relevant config-files:
>
> [root@pg-gpu01 ~]# cd /etc/modprobe.d/
>
> [root@pg-gpu01 modprobe.d]# ls
> anaconda.conf   blacklist-kvm.conf  dist-alsa.conf
> dist-oss.conf   ib_ipoib.conf  lustre.conf  openfwwf.conf
> blacklist.conf  blacklist-nouveau.conf  dist.conf
> freeipmi-modalias.conf  ib_sdp.confmlnx.conftruescale.conf
>
> [root@pg-gpu01 modprobe.d]# cat ./ib_ipoib.conf
> alias netdev-ib* ib_ipoib
>
> [root@pg-gpu01 modprobe.d]# cat ./mlnx.conf
> # Module parameters for MLNX_OFED kernel modules
>
> [root@pg-gpu01 modprobe.d]# cat ./lustre.conf
> options lnet networks=o2ib(ib0)
>
> Are there more Lustre/LNET options that could help in this situation?
>

 What about the logfiles?
 Any error messages in syslog? lctl debug options?
 Veel geluk,
 Eli

>
>
>
>
> On Mon, Apr 24, 2017 at 7:02 PM, Raj  wrote:
>
>> May be worth checking your lnet credits and peer_credits in
>> /etc/modprobe.d ?
>> You can compare between working hosts and non working hosts.
>> Thanks
>> _Raj
>>
>> On Mon, Apr 24, 2017 at 10:10 AM Strikwerda, Ger <
>> g.j.c.strikwe...@rug.nl> wrote:
>>
>>> Hi Rick,
>>>
>>> Even without iptables rules and loading the correct modules
>>> afterwards, we get the same results:
>>>
>>> [root@pg-gpu01 sysconfig]# iptables --list
>>> Chain INPUT (policy ACCEPT)
>>> target prot opt source   destination
>>>
>>> Chain FORWARD (policy ACCEPT)
>>> target prot opt source   destination
>>>
>>> Chain OUTPUT (policy ACCEPT)
>>> target prot opt source   destination
>>>
>>> Chain LOGDROP (0 references)
>>> target prot opt source   destination
>>> LOGall  --  anywhere anywhereLOG
>>> level warning
>>> DROP   all  --  anywhere anywhere
>>>
>>> [root@pg-gpu01 sysconfig]# modprobe lnet
>>>
>>> [root@pg-gpu01 sysconfig]# modprobe lustre
>>>
>>> [root@pg-gpu01 sysconfig]# lctl ping 172.23.55.211@o2ib
>>>
>>> failed to ping 172.23.55.211@o2ib: Input/output error
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Apr 24, 2017 at 4:59 PM, Mohr Jr, Richard Frank (Rick Mohr)
>>>  wrote:
>>>
 This might be a long shot, but have you checked for possible
 firewall rules that might be causing the issue?  I’m wondering if 
 there is
 a chance that some rules were added after the nodes were up to allow 
 Lustre
 access, and when a node got rebooted, it lost the rules.

 --
 Rick Mohr
 Senior HPC System Administrator
 National Institute for Computational Sciences
 http://www.nics.tennessee.edu


 > On Apr 24, 2017, at 10:19 AM, Strikwerda, Ger <
 g.j.c.strikwe...@rug.nl> wrote:
 >
 > Hi Russell,
 >
 > Thanks for the IB subnet clues:
 >
 > [root@pg-gpu01 ~]# ibv_devinfo
 > hca_id: mlx4_0
 > transport:  InfiniBand (0)
 > fw_ver: 2.32.5100
 > node_guid:  f452:1403:00f5:4620
 > sys_image_guid: f452:1403:00f5:4623
 > vendor_id:  0x02c9
 > vendor_part_id: 4099
 >  

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Strikwerda, Ger
Hi Eli,

Nothing can be mounted on the Lustre filesystems so the output is:

[root@pg-gpu01 ~]# lfs df /home/ger/
[root@pg-gpu01 ~]#

Empty..



On Mon, Apr 24, 2017 at 7:24 PM, E.S. Rosenberg  wrote:

>
>
> On Mon, Apr 24, 2017 at 8:19 PM, Strikwerda, Ger 
> wrote:
>
>> Hallo Eli,
>>
>> Logfile/syslog on the client-side:
>>
>> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
>> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>> LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>> 172.23.55.211@o2ib rejected: consumer defined fatal error
>>
>
> lctl df /path/to/some/file
>
> gives nothing useful? (the second one will dump *a lot*)
>
>>
>>
>>
>>
>> On Mon, Apr 24, 2017 at 7:16 PM, E.S. Rosenberg <
>> esr+lus...@mail.hebrew.edu> wrote:
>>
>>>
>>>
>>> On Mon, Apr 24, 2017 at 8:13 PM, Strikwerda, Ger <
>>> g.j.c.strikwe...@rug.nl> wrote:
>>>
 Hi Raj (and others),

 In which file should i state the credits/peer_credits stuff?

 Perhaps relevant config-files:

 [root@pg-gpu01 ~]# cd /etc/modprobe.d/

 [root@pg-gpu01 modprobe.d]# ls
 anaconda.conf   blacklist-kvm.conf  dist-alsa.conf
 dist-oss.conf   ib_ipoib.conf  lustre.conf  openfwwf.conf
 blacklist.conf  blacklist-nouveau.conf  dist.conf
 freeipmi-modalias.conf  ib_sdp.confmlnx.conftruescale.conf

 [root@pg-gpu01 modprobe.d]# cat ./ib_ipoib.conf
 alias netdev-ib* ib_ipoib

 [root@pg-gpu01 modprobe.d]# cat ./mlnx.conf
 # Module parameters for MLNX_OFED kernel modules

 [root@pg-gpu01 modprobe.d]# cat ./lustre.conf
 options lnet networks=o2ib(ib0)

 Are there more Lustre/LNET options that could help in this situation?

>>>
>>> What about the logfiles?
>>> Any error messages in syslog? lctl debug options?
>>> Veel geluk,
>>> Eli
>>>




 On Mon, Apr 24, 2017 at 7:02 PM, Raj  wrote:

> May be worth checking your lnet credits and peer_credits in
> /etc/modprobe.d ?
> You can compare between working hosts and non working hosts.
> Thanks
> _Raj
>
> On Mon, Apr 24, 2017 at 10:10 AM Strikwerda, Ger <
> g.j.c.strikwe...@rug.nl> wrote:
>
>> Hi Rick,
>>
>> Even without iptables rules and loading the correct modules
>> afterwards, we get the same results:
>>
>> [root@pg-gpu01 sysconfig]# iptables --list
>> Chain INPUT (policy ACCEPT)
>> target prot opt source   destination
>>
>> Chain FORWARD (policy ACCEPT)
>> target prot opt source   destination
>>
>> Chain OUTPUT (policy ACCEPT)
>> target prot opt source   destination
>>
>> Chain LOGDROP (0 references)
>> target prot opt source   destination
>> LOGall  --  anywhere anywhereLOG
>> level warning
>> DROP   all  --  anywhere anywhere
>>
>> [root@pg-gpu01 sysconfig]# modprobe lnet
>>
>> [root@pg-gpu01 sysconfig]# modprobe lustre
>>
>> [root@pg-gpu01 sysconfig]# lctl ping 172.23.55.211@o2ib
>>
>> failed to ping 172.23.55.211@o2ib: Input/output error
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Apr 24, 2017 at 4:59 PM, Mohr Jr, Richard Frank (Rick Mohr) <
>> rm...@utk.edu> wrote:
>>
>>> This might be a long shot, but have you checked for possible
>>> firewall rules that might be causing the issue?  I’m wondering if there 
>>> is
>>> a chance that some rules were added after the nodes were up to allow 
>>> Lustre
>>> access, and when a node got rebooted, it lost the rules.
>>>
>>> --
>>> Rick Mohr
>>> Senior HPC System Administrator
>>> National Institute for Computational Sciences
>>> http://www.nics.tennessee.edu
>>>
>>>
>>> > On Apr 24, 2017, at 10:19 AM, Strikwerda, Ger <
>>> g.j.c.strikwe...@rug.nl> wrote:
>>> >
>>> > Hi Russell,
>>> >
>>> > Thanks for the IB subnet clues:
>>> >
>>> > [root@pg-gpu01 ~]# ibv_devinfo
>>> > hca_id: mlx4_0
>>> > transport:  InfiniBand (0)
>>> > fw_ver: 2.32.5100
>>> > node_guid:  f452:1403:00f5:4620
>>> > sys_image_guid: f452:1403:00f5:4623
>>> > vendor_id:  0x02c9
>>> > vendor_part_id: 4099
>>> > hw_ver: 0x1
>>> > board_id:   MT_1100120019
>>> > phys_port_cnt:  1
>>> > port:   1
>>> > state:  PORT_ACTIVE (4)
>>> > max_mtu:4096 (5)
>>> > active_mtu: 4096 (5)

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Strikwerda, Ger
Hallo Eli,

Logfile/syslog on the client-side:

Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.211@o2ib
rejected: consumer defined fatal error




On Mon, Apr 24, 2017 at 7:16 PM, E.S. Rosenberg 
wrote:

>
>
> On Mon, Apr 24, 2017 at 8:13 PM, Strikwerda, Ger 
> wrote:
>
>> Hi Raj (and others),
>>
>> In which file should i state the credits/peer_credits stuff?
>>
>> Perhaps relevant config-files:
>>
>> [root@pg-gpu01 ~]# cd /etc/modprobe.d/
>>
>> [root@pg-gpu01 modprobe.d]# ls
>> anaconda.conf   blacklist-kvm.conf  dist-alsa.conf
>> dist-oss.conf   ib_ipoib.conf  lustre.conf  openfwwf.conf
>> blacklist.conf  blacklist-nouveau.conf  dist.conf
>> freeipmi-modalias.conf  ib_sdp.confmlnx.conftruescale.conf
>>
>> [root@pg-gpu01 modprobe.d]# cat ./ib_ipoib.conf
>> alias netdev-ib* ib_ipoib
>>
>> [root@pg-gpu01 modprobe.d]# cat ./mlnx.conf
>> # Module parameters for MLNX_OFED kernel modules
>>
>> [root@pg-gpu01 modprobe.d]# cat ./lustre.conf
>> options lnet networks=o2ib(ib0)
>>
>> Are there more Lustre/LNET options that could help in this situation?
>>
>
> What about the logfiles?
> Any error messages in syslog? lctl debug options?
> Veel geluk,
> Eli
>
>>
>>
>>
>>
>> On Mon, Apr 24, 2017 at 7:02 PM, Raj  wrote:
>>
>>> May be worth checking your lnet credits and peer_credits in
>>> /etc/modprobe.d ?
>>> You can compare between working hosts and non working hosts.
>>> Thanks
>>> _Raj
>>>
>>> On Mon, Apr 24, 2017 at 10:10 AM Strikwerda, Ger <
>>> g.j.c.strikwe...@rug.nl> wrote:
>>>
 Hi Rick,

 Even without iptables rules and loading the correct modules afterwards,
 we get the same results:

 [root@pg-gpu01 sysconfig]# iptables --list
 Chain INPUT (policy ACCEPT)
 target prot opt source   destination

 Chain FORWARD (policy ACCEPT)
 target prot opt source   destination

 Chain OUTPUT (policy ACCEPT)
 target prot opt source   destination

 Chain LOGDROP (0 references)
 target prot opt source   destination
 LOGall  --  anywhere anywhereLOG level
 warning
 DROP   all  --  anywhere anywhere

 [root@pg-gpu01 sysconfig]# modprobe lnet

 [root@pg-gpu01 sysconfig]# modprobe lustre

 [root@pg-gpu01 sysconfig]# lctl ping 172.23.55.211@o2ib

 failed to ping 172.23.55.211@o2ib: Input/output error







 On Mon, Apr 24, 2017 at 4:59 PM, Mohr Jr, Richard Frank (Rick Mohr) <
 rm...@utk.edu> wrote:

> This might be a long shot, but have you checked for possible firewall
> rules that might be causing the issue?  I’m wondering if there is a chance
> that some rules were added after the nodes were up to allow Lustre access,
> and when a node got rebooted, it lost the rules.
>
> --
> Rick Mohr
> Senior HPC System Administrator
> National Institute for Computational Sciences
> http://www.nics.tennessee.edu
>
>
> > On Apr 24, 2017, at 10:19 AM, Strikwerda, Ger <
> g.j.c.strikwe...@rug.nl> wrote:
> >
> > Hi Russell,
> >
> > Thanks for the IB subnet clues:
> >
> > [root@pg-gpu01 ~]# ibv_devinfo
> > hca_id: mlx4_0
> > transport:  InfiniBand (0)
> > fw_ver: 2.32.5100
> > node_guid:  f452:1403:00f5:4620
> > sys_image_guid: f452:1403:00f5:4623
> > vendor_id:  0x02c9
> > vendor_part_id: 4099
> > hw_ver: 0x1
> > board_id:   MT_1100120019
> > phys_port_cnt:  1
> > port:   1
> > state:  PORT_ACTIVE (4)
> > max_mtu:4096 (5)
> > active_mtu: 4096 (5)
> > sm_lid: 1
> > port_lid:   185
> > port_lmc:   0x00
> > link_layer: InfiniBand
> >
> > [root@pg-gpu01 ~]# sminfo
> > sminfo: sm lid 1 sm guid 0xf452140300f62320, activity count 80878098
> priority 0 state 3 SMINFO_MASTER
> >
> > Looks like the rebooted node is able to connect/contact IB/IB
> subnetmanager
> >
> >
> >
> >
> > On Mon, Apr 24, 2017 at 4:14 PM, Russell Dekema 
> wrote:
> > At first glance, this sounds like your Infiniband subnet manager may
> > be down 

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread E.S. Rosenberg
On Mon, Apr 24, 2017 at 8:13 PM, Strikwerda, Ger 
wrote:

> Hi Raj (and others),
>
> In which file should i state the credits/peer_credits stuff?
>
> Perhaps relevant config-files:
>
> [root@pg-gpu01 ~]# cd /etc/modprobe.d/
>
> [root@pg-gpu01 modprobe.d]# ls
> anaconda.conf   blacklist-kvm.conf  dist-alsa.conf
> dist-oss.conf   ib_ipoib.conf  lustre.conf  openfwwf.conf
> blacklist.conf  blacklist-nouveau.conf  dist.conf
> freeipmi-modalias.conf  ib_sdp.confmlnx.conftruescale.conf
>
> [root@pg-gpu01 modprobe.d]# cat ./ib_ipoib.conf
> alias netdev-ib* ib_ipoib
>
> [root@pg-gpu01 modprobe.d]# cat ./mlnx.conf
> # Module parameters for MLNX_OFED kernel modules
>
> [root@pg-gpu01 modprobe.d]# cat ./lustre.conf
> options lnet networks=o2ib(ib0)
>
> Are there more Lustre/LNET options that could help in this situation?
>

What about the logfiles?
Any error messages in syslog? lctl debug options?
Veel geluk,
Eli

>
>
>
>
> On Mon, Apr 24, 2017 at 7:02 PM, Raj  wrote:
>
>> May be worth checking your lnet credits and peer_credits in
>> /etc/modprobe.d ?
>> You can compare between working hosts and non working hosts.
>> Thanks
>> _Raj
>>
>> On Mon, Apr 24, 2017 at 10:10 AM Strikwerda, Ger 
>> wrote:
>>
>>> Hi Rick,
>>>
>>> Even without iptables rules and loading the correct modules afterwards,
>>> we get the same results:
>>>
>>> [root@pg-gpu01 sysconfig]# iptables --list
>>> Chain INPUT (policy ACCEPT)
>>> target prot opt source   destination
>>>
>>> Chain FORWARD (policy ACCEPT)
>>> target prot opt source   destination
>>>
>>> Chain OUTPUT (policy ACCEPT)
>>> target prot opt source   destination
>>>
>>> Chain LOGDROP (0 references)
>>> target prot opt source   destination
>>> LOGall  --  anywhere anywhereLOG level
>>> warning
>>> DROP   all  --  anywhere anywhere
>>>
>>> [root@pg-gpu01 sysconfig]# modprobe lnet
>>>
>>> [root@pg-gpu01 sysconfig]# modprobe lustre
>>>
>>> [root@pg-gpu01 sysconfig]# lctl ping 172.23.55.211@o2ib
>>>
>>> failed to ping 172.23.55.211@o2ib: Input/output error
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Apr 24, 2017 at 4:59 PM, Mohr Jr, Richard Frank (Rick Mohr) <
>>> rm...@utk.edu> wrote:
>>>
 This might be a long shot, but have you checked for possible firewall
 rules that might be causing the issue?  I’m wondering if there is a chance
 that some rules were added after the nodes were up to allow Lustre access,
 and when a node got rebooted, it lost the rules.

 --
 Rick Mohr
 Senior HPC System Administrator
 National Institute for Computational Sciences
 http://www.nics.tennessee.edu


 > On Apr 24, 2017, at 10:19 AM, Strikwerda, Ger <
 g.j.c.strikwe...@rug.nl> wrote:
 >
 > Hi Russell,
 >
 > Thanks for the IB subnet clues:
 >
 > [root@pg-gpu01 ~]# ibv_devinfo
 > hca_id: mlx4_0
 > transport:  InfiniBand (0)
 > fw_ver: 2.32.5100
 > node_guid:  f452:1403:00f5:4620
 > sys_image_guid: f452:1403:00f5:4623
 > vendor_id:  0x02c9
 > vendor_part_id: 4099
 > hw_ver: 0x1
 > board_id:   MT_1100120019
 > phys_port_cnt:  1
 > port:   1
 > state:  PORT_ACTIVE (4)
 > max_mtu:4096 (5)
 > active_mtu: 4096 (5)
 > sm_lid: 1
 > port_lid:   185
 > port_lmc:   0x00
 > link_layer: InfiniBand
 >
 > [root@pg-gpu01 ~]# sminfo
 > sminfo: sm lid 1 sm guid 0xf452140300f62320, activity count 80878098
 priority 0 state 3 SMINFO_MASTER
 >
 > Looks like the rebooted node is able to connect/contact IB/IB
 subnetmanager
 >
 >
 >
 >
 > On Mon, Apr 24, 2017 at 4:14 PM, Russell Dekema 
 wrote:
 > At first glance, this sounds like your Infiniband subnet manager may
 > be down or malfunctioning. In this case, nodes which were already up
 > when the subnet manager was working will continue to be able to
 > communicate over IB, but nodes which reboot after the SM goes down
 > will not.
 >
 > You can test this theory by running the 'ibv_devinfo' command on one
 > of your rebooted nodes. If the relevant IB port is in state PORT_INIT,
 > this confirms there is a problem with your subnet manager.
 >
 > Sincerely,
 > Rusty Dekema
 >
 >
 >
 >

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Strikwerda, Ger
Hi Raj (and others),

In which file should i state the credits/peer_credits stuff?

Perhaps relevant config-files:

[root@pg-gpu01 ~]# cd /etc/modprobe.d/

[root@pg-gpu01 modprobe.d]# ls
anaconda.conf   blacklist-kvm.conf  dist-alsa.conf
dist-oss.conf   ib_ipoib.conf  lustre.conf  openfwwf.conf
blacklist.conf  blacklist-nouveau.conf  dist.conf
freeipmi-modalias.conf  ib_sdp.confmlnx.conftruescale.conf

[root@pg-gpu01 modprobe.d]# cat ./ib_ipoib.conf
alias netdev-ib* ib_ipoib

[root@pg-gpu01 modprobe.d]# cat ./mlnx.conf
# Module parameters for MLNX_OFED kernel modules

[root@pg-gpu01 modprobe.d]# cat ./lustre.conf
options lnet networks=o2ib(ib0)

Are there more Lustre/LNET options that could help in this situation?




On Mon, Apr 24, 2017 at 7:02 PM, Raj  wrote:

> May be worth checking your lnet credits and peer_credits in
> /etc/modprobe.d ?
> You can compare between working hosts and non working hosts.
> Thanks
> _Raj
>
> On Mon, Apr 24, 2017 at 10:10 AM Strikwerda, Ger 
> wrote:
>
>> Hi Rick,
>>
>> Even without iptables rules and loading the correct modules afterwards,
>> we get the same results:
>>
>> [root@pg-gpu01 sysconfig]# iptables --list
>> Chain INPUT (policy ACCEPT)
>> target prot opt source   destination
>>
>> Chain FORWARD (policy ACCEPT)
>> target prot opt source   destination
>>
>> Chain OUTPUT (policy ACCEPT)
>> target prot opt source   destination
>>
>> Chain LOGDROP (0 references)
>> target prot opt source   destination
>> LOGall  --  anywhere anywhereLOG level
>> warning
>> DROP   all  --  anywhere anywhere
>>
>> [root@pg-gpu01 sysconfig]# modprobe lnet
>>
>> [root@pg-gpu01 sysconfig]# modprobe lustre
>>
>> [root@pg-gpu01 sysconfig]# lctl ping 172.23.55.211@o2ib
>>
>> failed to ping 172.23.55.211@o2ib: Input/output error
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Apr 24, 2017 at 4:59 PM, Mohr Jr, Richard Frank (Rick Mohr) <
>> rm...@utk.edu> wrote:
>>
>>> This might be a long shot, but have you checked for possible firewall
>>> rules that might be causing the issue?  I’m wondering if there is a chance
>>> that some rules were added after the nodes were up to allow Lustre access,
>>> and when a node got rebooted, it lost the rules.
>>>
>>> --
>>> Rick Mohr
>>> Senior HPC System Administrator
>>> National Institute for Computational Sciences
>>> http://www.nics.tennessee.edu
>>>
>>>
>>> > On Apr 24, 2017, at 10:19 AM, Strikwerda, Ger 
>>> wrote:
>>> >
>>> > Hi Russell,
>>> >
>>> > Thanks for the IB subnet clues:
>>> >
>>> > [root@pg-gpu01 ~]# ibv_devinfo
>>> > hca_id: mlx4_0
>>> > transport:  InfiniBand (0)
>>> > fw_ver: 2.32.5100
>>> > node_guid:  f452:1403:00f5:4620
>>> > sys_image_guid: f452:1403:00f5:4623
>>> > vendor_id:  0x02c9
>>> > vendor_part_id: 4099
>>> > hw_ver: 0x1
>>> > board_id:   MT_1100120019
>>> > phys_port_cnt:  1
>>> > port:   1
>>> > state:  PORT_ACTIVE (4)
>>> > max_mtu:4096 (5)
>>> > active_mtu: 4096 (5)
>>> > sm_lid: 1
>>> > port_lid:   185
>>> > port_lmc:   0x00
>>> > link_layer: InfiniBand
>>> >
>>> > [root@pg-gpu01 ~]# sminfo
>>> > sminfo: sm lid 1 sm guid 0xf452140300f62320, activity count 80878098
>>> priority 0 state 3 SMINFO_MASTER
>>> >
>>> > Looks like the rebooted node is able to connect/contact IB/IB
>>> subnetmanager
>>> >
>>> >
>>> >
>>> >
>>> > On Mon, Apr 24, 2017 at 4:14 PM, Russell Dekema 
>>> wrote:
>>> > At first glance, this sounds like your Infiniband subnet manager may
>>> > be down or malfunctioning. In this case, nodes which were already up
>>> > when the subnet manager was working will continue to be able to
>>> > communicate over IB, but nodes which reboot after the SM goes down
>>> > will not.
>>> >
>>> > You can test this theory by running the 'ibv_devinfo' command on one
>>> > of your rebooted nodes. If the relevant IB port is in state PORT_INIT,
>>> > this confirms there is a problem with your subnet manager.
>>> >
>>> > Sincerely,
>>> > Rusty Dekema
>>> >
>>> >
>>> >
>>> >
>>> > On Mon, Apr 24, 2017 at 9:57 AM, Strikwerda, Ger
>>> >  wrote:
>>> > > Hi everybody,
>>> > >
>>> > > Here at the university of Groningen we are now experiencing a
>>> strange Lustre
>>> > > error. If a client reboots, it fails to mount the Lustre storage.
>>> The client
>>> > > is not able to reach the MSG 

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Mohr Jr, Richard Frank (Rick Mohr)
This might be a long shot, but have you checked for possible firewall rules 
that might be causing the issue?  I’m wondering if there is a chance that some 
rules were added after the nodes were up to allow Lustre access, and when a 
node got rebooted, it lost the rules.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu


> On Apr 24, 2017, at 10:19 AM, Strikwerda, Ger  wrote:
> 
> Hi Russell,
> 
> Thanks for the IB subnet clues:
> 
> [root@pg-gpu01 ~]# ibv_devinfo
> hca_id: mlx4_0
> transport:  InfiniBand (0)
> fw_ver: 2.32.5100
> node_guid:  f452:1403:00f5:4620
> sys_image_guid: f452:1403:00f5:4623
> vendor_id:  0x02c9
> vendor_part_id: 4099
> hw_ver: 0x1
> board_id:   MT_1100120019
> phys_port_cnt:  1
> port:   1
> state:  PORT_ACTIVE (4)
> max_mtu:4096 (5)
> active_mtu: 4096 (5)
> sm_lid: 1
> port_lid:   185
> port_lmc:   0x00
> link_layer: InfiniBand
> 
> [root@pg-gpu01 ~]# sminfo 
> sminfo: sm lid 1 sm guid 0xf452140300f62320, activity count 80878098 priority 
> 0 state 3 SMINFO_MASTER
> 
> Looks like the rebooted node is able to connect/contact IB/IB subnetmanager
> 
> 
> 
> 
> On Mon, Apr 24, 2017 at 4:14 PM, Russell Dekema  wrote:
> At first glance, this sounds like your Infiniband subnet manager may
> be down or malfunctioning. In this case, nodes which were already up
> when the subnet manager was working will continue to be able to
> communicate over IB, but nodes which reboot after the SM goes down
> will not.
> 
> You can test this theory by running the 'ibv_devinfo' command on one
> of your rebooted nodes. If the relevant IB port is in state PORT_INIT,
> this confirms there is a problem with your subnet manager.
> 
> Sincerely,
> Rusty Dekema
> 
> 
> 
> 
> On Mon, Apr 24, 2017 at 9:57 AM, Strikwerda, Ger
>  wrote:
> > Hi everybody,
> >
> > Here at the university of Groningen we are now experiencing a strange Lustre
> > error. If a client reboots, it fails to mount the Lustre storage. The client
> > is not able to reach the MSG service. The storage and nodes are
> > communicating over IB and unitil now without any problems. It looks like an
> > issue inside LNET. Clients cannot LNET ping/connect the metadata and or
> > storage. But the clients are able to LNET ping each other. Clients which not
> > have been rebooted, are working fine and have their mounts on our Lustre
> > filesystem.
> >
> > Lustre client log:
> >
> > Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
> > LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
> >
> > LustreError: 15c-8: MGC172.23.55.211@o2ib: The configuration from log
> > 'pgdata01-client' failed (-5). This may be the result of communication
> > errors between this node and the MGS, a bad configuration, or other errors.
> > See the syslog for more information.
> > LustreError: 3812:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
> > log: -5
> > Lustre: Unmounted pgdata01-client
> > LustreError: 3812:0:(obd_mount.c:1325:lustre_fill_super()) Unable to mount
> > (-5)
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.212@o2ib
> > rejected: consumer defined fatal error
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 previous
> > similar message
> > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent
> > has failed due to network error: [sent 1492789626/real 1492789626]
> > req@88105af2cc00 x1565303228072004/t0(0)
> > o250->MGC172.23.55.211@o2ib@172.23.55.212@o2ib:26/25 lens 400/544 e 0 to 1
> > dl 1492789631 ref 1 fl Rpc:XN/0/ rc 0/-1
> > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) Skipped 1
> > previous similar message
> > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit
> > expired   req@882041ffc000 x1565303228071996/t0(0)
> > o101->MGC172.23.55.211@o2ib@172.23.55.211@o2ib:26/25 lens 328/344 e 0 to 0
> > dl 0 ref 2 fl Rpc:W/0/ rc 0/-1
> > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped 2
> > previous similar messages
> > LustreError: 15c-8: MGC172.23.55.211@o2ib: The configuration from log
> > 'pghome01-client' failed (-5). This may be the result of communication
> > errors between this node and the MGS, a bad configuration, or other errors.
> > See the syslog for more information.
> > LustreError: 

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Strikwerda, Ger
Hi Russell,

On a rebooted node:

[root@pg-gpu01 ~]# ibhosts | wc -l
183

On a not-rebooted node:

[root@pg-gpu02 ~]# ibhosts | wc -l
183

No diffence and all our lustre storage nodes seems to be present:

Ca  : 0xf45214030062eb50 ports 2 "pg-ost01 HCA-1"
Ca  : 0xf45214030062eb30 ports 2 "pg-mds02 HCA-1"
Ca  : 0xf45214030062ee60 ports 2 "pg-ost03 HCA-1"
Ca  : 0xf45214030062ee90 ports 2 "pg-mds01 HCA-1"
Ca  : 0xf45214030062eb00 ports 2 "pg-ost04 HCA-1"
Ca  : 0xe41d2d030001ce80 ports 2 "pg-ost02 HCA-1"





On Mon, Apr 24, 2017 at 4:46 PM, Russell Dekema  wrote:

> I'm not sure this is likely to help either, but if you run the command
> 'ibhosts' on one of the non-working Lustre client nodes, do you see
> all of your Lustre servers in the printed list?
>
> -Rusty
>
> On Mon, Apr 24, 2017 at 10:39 AM, Russell Dekema 
> wrote:
> > I can't rule it out, but it seems unlikely to me that an out of date
> > IB HCA firmware version would cause a problem like this, especially
> > when everything was working before on that same version, and when IB
> > communication over the device seems to be working in general (as shown
> > by your pings over your IPoIB interfaces).
> >
> > If you decide to update your HCA (or switch, in the case of
> > switch-f0cc8e/U1) firmware, you may want to check with the vendor
> > before doing so. In the past, they have sometimes told me that the
> > "latest FW version available for this device" reported by ibdiagnet is
> > incorrect and should be ignored. Of course, in other cases, newer
> > firmware versions were in fact available and they did recommend
> > upgrading.
> >
> > Sincerely,
> > Rusty D.
> >
> >
> >
> > On Mon, Apr 24, 2017 at 10:32 AM, Strikwerda, Ger
> >  wrote:
> >> Hi Russell/*,
> >>
> >> If we run ibdiagnet we get errors/warnings about some (newer) nodes
> which
> >> happen to have a new firmware on the IB interface:
> >>
> >> Nodes Information
> >> -E- FW Check finished with errors
> >> -W- pg-gpu01/U1 - Node has FW version 2.32.5100 while the latest FW
> version,
> >> for the same device available on this fabric is 2.36.5150
> >> -W- pg-gpu05/U1 - Node has FW version 2.35.5100 while the latest FW
> version,
> >> for the same device available on this fabric is 2.36.5150
> >> -W- pg-gpu06/U1 - Node has FW version 2.35.5100 while the latest FW
> version,
> >> for the same device available on this fabric is 2.36.5150
> >> -W- pg-gpu02/U1 - Node has FW version 2.32.5100 while the latest FW
> version,
> >> for the same device available on this fabric is 2.36.5150
> >> -W- pg-gpu03/U1 - Node has FW version 2.32.5100 while the latest FW
> version,
> >> for the same device available on this fabric is 2.36.5150
> >> -W- pg-gpu04/U1 - Node has FW version 2.32.5100 while the latest FW
> version,
> >> for the same device available on this fabric is 2.36.5150
> >> -W- switch-f0cc8e/U1 - Node has FW version 9.2.7300 while the latest FW
> >> version, for the same device available on this fabric is 9.2.8000
> >> -W- pg-memory02/U1 - Node has FW version 2.32.5100 while the latest FW
> >> version, for the same device available on this fabric is 2.36.5150
> >> -W- pg-memory01/U1 - Node has FW version 2.32.5100 while the latest FW
> >> version, for the same device available on this fabric is 2.36.5150
> >> -W- pg-memory03/U1 - Node has FW version 2.32.5100 while the latest FW
> >> version, for the same device available on this fabric is 2.36.5150
> >> -W- pg-memory04/U1 - Node has FW version 2.32.5100 while the latest FW
> >> version, for the same device available on this fabric is 2.36.5150
> >> -W- pg-memory06/U1 - Node has FW version 2.32.5100 while the latest FW
> >> version, for the same device available on this fabric is 2.36.5150
> >> -W- pg-memory05/U1 - Node has FW version 2.32.5100 while the latest FW
> >> version, for the same device available on this fabric is 2.36.5150
> >> -W- pg-node163/U1 - Node has FW version 2.36.5000 while the latest FW
> >> version, for the same device available on this fabric is 2.36.5150
> >> -W- pg-node164/U1 - Node has FW version 2.36.5000 while the latest FW
> >> version, for the same device available on this fabric is 2.36.5150
> >> -W- pg-node001/U1 - Node has FW version 2.33.5100 while the latest FW
> >> version, for the same device available on this fabric is 2.36.5150
> >> -W- pg-interactive/U1 - Node has FW version 2.32.5100 while the latest
> FW
> >> version, for the same device available on this fabric is 2.36.5150
> >> -W- peregrine/U1 - Node has FW version 2.32.5100 while the latest FW
> >> version, for the same device available on this fabric is 2.36.5150
> >> -W- pg-node004/U1 - Node has FW version 2.32.5100 while the latest FW
> >> version, for the same device available on this fabric is 2.36.5150
> >> -W- pg-node005/U1 - Node has FW version 2.32.5100 while the latest FW
> >> version, for the same device available on this fabric is 2.36.5150
> >> -W- 

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Russell Dekema
I'm not sure this is likely to help either, but if you run the command
'ibhosts' on one of the non-working Lustre client nodes, do you see
all of your Lustre servers in the printed list?

-Rusty

On Mon, Apr 24, 2017 at 10:39 AM, Russell Dekema  wrote:
> I can't rule it out, but it seems unlikely to me that an out of date
> IB HCA firmware version would cause a problem like this, especially
> when everything was working before on that same version, and when IB
> communication over the device seems to be working in general (as shown
> by your pings over your IPoIB interfaces).
>
> If you decide to update your HCA (or switch, in the case of
> switch-f0cc8e/U1) firmware, you may want to check with the vendor
> before doing so. In the past, they have sometimes told me that the
> "latest FW version available for this device" reported by ibdiagnet is
> incorrect and should be ignored. Of course, in other cases, newer
> firmware versions were in fact available and they did recommend
> upgrading.
>
> Sincerely,
> Rusty D.
>
>
>
> On Mon, Apr 24, 2017 at 10:32 AM, Strikwerda, Ger
>  wrote:
>> Hi Russell/*,
>>
>> If we run ibdiagnet we get errors/warnings about some (newer) nodes which
>> happen to have a new firmware on the IB interface:
>>
>> Nodes Information
>> -E- FW Check finished with errors
>> -W- pg-gpu01/U1 - Node has FW version 2.32.5100 while the latest FW version,
>> for the same device available on this fabric is 2.36.5150
>> -W- pg-gpu05/U1 - Node has FW version 2.35.5100 while the latest FW version,
>> for the same device available on this fabric is 2.36.5150
>> -W- pg-gpu06/U1 - Node has FW version 2.35.5100 while the latest FW version,
>> for the same device available on this fabric is 2.36.5150
>> -W- pg-gpu02/U1 - Node has FW version 2.32.5100 while the latest FW version,
>> for the same device available on this fabric is 2.36.5150
>> -W- pg-gpu03/U1 - Node has FW version 2.32.5100 while the latest FW version,
>> for the same device available on this fabric is 2.36.5150
>> -W- pg-gpu04/U1 - Node has FW version 2.32.5100 while the latest FW version,
>> for the same device available on this fabric is 2.36.5150
>> -W- switch-f0cc8e/U1 - Node has FW version 9.2.7300 while the latest FW
>> version, for the same device available on this fabric is 9.2.8000
>> -W- pg-memory02/U1 - Node has FW version 2.32.5100 while the latest FW
>> version, for the same device available on this fabric is 2.36.5150
>> -W- pg-memory01/U1 - Node has FW version 2.32.5100 while the latest FW
>> version, for the same device available on this fabric is 2.36.5150
>> -W- pg-memory03/U1 - Node has FW version 2.32.5100 while the latest FW
>> version, for the same device available on this fabric is 2.36.5150
>> -W- pg-memory04/U1 - Node has FW version 2.32.5100 while the latest FW
>> version, for the same device available on this fabric is 2.36.5150
>> -W- pg-memory06/U1 - Node has FW version 2.32.5100 while the latest FW
>> version, for the same device available on this fabric is 2.36.5150
>> -W- pg-memory05/U1 - Node has FW version 2.32.5100 while the latest FW
>> version, for the same device available on this fabric is 2.36.5150
>> -W- pg-node163/U1 - Node has FW version 2.36.5000 while the latest FW
>> version, for the same device available on this fabric is 2.36.5150
>> -W- pg-node164/U1 - Node has FW version 2.36.5000 while the latest FW
>> version, for the same device available on this fabric is 2.36.5150
>> -W- pg-node001/U1 - Node has FW version 2.33.5100 while the latest FW
>> version, for the same device available on this fabric is 2.36.5150
>> -W- pg-interactive/U1 - Node has FW version 2.32.5100 while the latest FW
>> version, for the same device available on this fabric is 2.36.5150
>> -W- peregrine/U1 - Node has FW version 2.32.5100 while the latest FW
>> version, for the same device available on this fabric is 2.36.5150
>> -W- pg-node004/U1 - Node has FW version 2.32.5100 while the latest FW
>> version, for the same device available on this fabric is 2.36.5150
>> -W- pg-node005/U1 - Node has FW version 2.32.5100 while the latest FW
>> version, for the same device available on this fabric is 2.36.5150
>> -W- pg-node006/U1 - Node has FW version 2.32.5100 while the latest FW
>> version, for the same device available on this fabric is 2.36.5150
>> -W- pg-node007/U1 - Node has FW version 2.32.5100 while the latest FW
>> version, for the same device available on this fabric is 2.36.5150
>> -W- pg-node008/U1 - Node has FW version 2.32.5100 while the latest FW
>> version, for the same device available on this fabric is 2.36.5150
>> -W- pg-node009/U1 - Node has FW version 2.32.5100 while the latest FW
>> version, for the same device available on this fabric is 2.36.5150
>> -W- pg-node010/U1 - Node has FW version 2.32.5100 while the latest FW
>> version, for the same device available on this fabric is 2.36.5150
>> -W- pg-node011/U1 - Node has FW version 2.32.5100 while the latest 

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Russell Dekema
I can't rule it out, but it seems unlikely to me that an out of date
IB HCA firmware version would cause a problem like this, especially
when everything was working before on that same version, and when IB
communication over the device seems to be working in general (as shown
by your pings over your IPoIB interfaces).

If you decide to update your HCA (or switch, in the case of
switch-f0cc8e/U1) firmware, you may want to check with the vendor
before doing so. In the past, they have sometimes told me that the
"latest FW version available for this device" reported by ibdiagnet is
incorrect and should be ignored. Of course, in other cases, newer
firmware versions were in fact available and they did recommend
upgrading.

Sincerely,
Rusty D.



On Mon, Apr 24, 2017 at 10:32 AM, Strikwerda, Ger
 wrote:
> Hi Russell/*,
>
> If we run ibdiagnet we get errors/warnings about some (newer) nodes which
> happen to have a new firmware on the IB interface:
>
> Nodes Information
> -E- FW Check finished with errors
> -W- pg-gpu01/U1 - Node has FW version 2.32.5100 while the latest FW version,
> for the same device available on this fabric is 2.36.5150
> -W- pg-gpu05/U1 - Node has FW version 2.35.5100 while the latest FW version,
> for the same device available on this fabric is 2.36.5150
> -W- pg-gpu06/U1 - Node has FW version 2.35.5100 while the latest FW version,
> for the same device available on this fabric is 2.36.5150
> -W- pg-gpu02/U1 - Node has FW version 2.32.5100 while the latest FW version,
> for the same device available on this fabric is 2.36.5150
> -W- pg-gpu03/U1 - Node has FW version 2.32.5100 while the latest FW version,
> for the same device available on this fabric is 2.36.5150
> -W- pg-gpu04/U1 - Node has FW version 2.32.5100 while the latest FW version,
> for the same device available on this fabric is 2.36.5150
> -W- switch-f0cc8e/U1 - Node has FW version 9.2.7300 while the latest FW
> version, for the same device available on this fabric is 9.2.8000
> -W- pg-memory02/U1 - Node has FW version 2.32.5100 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- pg-memory01/U1 - Node has FW version 2.32.5100 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- pg-memory03/U1 - Node has FW version 2.32.5100 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- pg-memory04/U1 - Node has FW version 2.32.5100 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- pg-memory06/U1 - Node has FW version 2.32.5100 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- pg-memory05/U1 - Node has FW version 2.32.5100 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- pg-node163/U1 - Node has FW version 2.36.5000 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- pg-node164/U1 - Node has FW version 2.36.5000 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- pg-node001/U1 - Node has FW version 2.33.5100 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- pg-interactive/U1 - Node has FW version 2.32.5100 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- peregrine/U1 - Node has FW version 2.32.5100 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- pg-node004/U1 - Node has FW version 2.32.5100 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- pg-node005/U1 - Node has FW version 2.32.5100 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- pg-node006/U1 - Node has FW version 2.32.5100 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- pg-node007/U1 - Node has FW version 2.32.5100 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- pg-node008/U1 - Node has FW version 2.32.5100 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- pg-node009/U1 - Node has FW version 2.32.5100 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- pg-node010/U1 - Node has FW version 2.32.5100 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- pg-node011/U1 - Node has FW version 2.32.5100 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- pg-node012/U1 - Node has FW version 2.32.5100 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- pg-node013/U1 - Node has FW version 2.32.5100 while the latest FW
> version, for the same device available on this fabric is 2.36.5150
> -W- 

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Strikwerda, Ger
Hi Russell/*,

If we run ibdiagnet we get errors/warnings about some (newer) nodes which
happen to have a new firmware on the IB interface:

Nodes Information
-E- FW Check finished with errors
-W- pg-gpu01/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-gpu05/U1 - Node has FW version 2.35.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-gpu06/U1 - Node has FW version 2.35.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-gpu02/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-gpu03/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-gpu04/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- switch-f0cc8e/U1 - Node has FW version 9.2.7300 while the latest FW
version, for the same device available on this fabric is 9.2.8000
-W- pg-memory02/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-memory01/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-memory03/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-memory04/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-memory06/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-memory05/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-node163/U1 - Node has FW version 2.36.5000 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-node164/U1 - Node has FW version 2.36.5000 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-node001/U1 - Node has FW version 2.33.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-interactive/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- peregrine/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-node004/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-node005/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-node006/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-node007/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-node008/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-node009/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-node010/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-node011/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-node012/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-node013/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-node014/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150
-W- pg-node015/U1 - Node has FW version 2.32.5100 while the latest FW
version, for the same device available on this fabric is 2.36.5150

Could that be problematic/an issue?




On Mon, Apr 24, 2017 at 4:27 PM, Russell Dekema  wrote:

> Oh, ok, that seems to rule the subnet manager out.
>
> I mis-read your IP network numbers earlier and thought you had not
> tried regular IP-ping across your IPoIB interfaces, but, upon
> re-reading your initial message, it seems you have tried this and it
> does work, even between a client with non-working Lustre and your
> MGS/MDS.
>
> In this case, I have no further suggestions.
>
> Best of luck,
> Rusty D.
>
> On Mon, Apr 24, 2017 at 10:19 AM, Strikwerda, Ger
>  wrote:
> > Hi Russell,
> >
> > Thanks for the IB subnet clues:
> >
> > 

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Russell Dekema
Oh, ok, that seems to rule the subnet manager out.

I mis-read your IP network numbers earlier and thought you had not
tried regular IP-ping across your IPoIB interfaces, but, upon
re-reading your initial message, it seems you have tried this and it
does work, even between a client with non-working Lustre and your
MGS/MDS.

In this case, I have no further suggestions.

Best of luck,
Rusty D.

On Mon, Apr 24, 2017 at 10:19 AM, Strikwerda, Ger
 wrote:
> Hi Russell,
>
> Thanks for the IB subnet clues:
>
> [root@pg-gpu01 ~]# ibv_devinfo
> hca_id: mlx4_0
> transport:  InfiniBand (0)
> fw_ver: 2.32.5100
> node_guid:  f452:1403:00f5:4620
> sys_image_guid: f452:1403:00f5:4623
> vendor_id:  0x02c9
> vendor_part_id: 4099
> hw_ver: 0x1
> board_id:   MT_1100120019
> phys_port_cnt:  1
> port:   1
> state:  PORT_ACTIVE (4)
> max_mtu:4096 (5)
> active_mtu: 4096 (5)
> sm_lid: 1
> port_lid:   185
> port_lmc:   0x00
> link_layer: InfiniBand
>
> [root@pg-gpu01 ~]# sminfo
> sminfo: sm lid 1 sm guid 0xf452140300f62320, activity count 80878098
> priority 0 state 3 SMINFO_MASTER
>
> Looks like the rebooted node is able to connect/contact IB/IB subnetmanager
>
>
>
>
> On Mon, Apr 24, 2017 at 4:14 PM, Russell Dekema  wrote:
>>
>> At first glance, this sounds like your Infiniband subnet manager may
>> be down or malfunctioning. In this case, nodes which were already up
>> when the subnet manager was working will continue to be able to
>> communicate over IB, but nodes which reboot after the SM goes down
>> will not.
>>
>> You can test this theory by running the 'ibv_devinfo' command on one
>> of your rebooted nodes. If the relevant IB port is in state PORT_INIT,
>> this confirms there is a problem with your subnet manager.
>>
>> Sincerely,
>> Rusty Dekema
>>
>>
>>
>>
>> On Mon, Apr 24, 2017 at 9:57 AM, Strikwerda, Ger
>>  wrote:
>> > Hi everybody,
>> >
>> > Here at the university of Groningen we are now experiencing a strange
>> > Lustre
>> > error. If a client reboots, it fails to mount the Lustre storage. The
>> > client
>> > is not able to reach the MSG service. The storage and nodes are
>> > communicating over IB and unitil now without any problems. It looks like
>> > an
>> > issue inside LNET. Clients cannot LNET ping/connect the metadata and or
>> > storage. But the clients are able to LNET ping each other. Clients which
>> > not
>> > have been rebooted, are working fine and have their mounts on our Lustre
>> > filesystem.
>> >
>> > Lustre client log:
>> >
>> > Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
>> > LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>> >
>> > LustreError: 15c-8: MGC172.23.55.211@o2ib: The configuration from log
>> > 'pgdata01-client' failed (-5). This may be the result of communication
>> > errors between this node and the MGS, a bad configuration, or other
>> > errors.
>> > See the syslog for more information.
>> > LustreError: 3812:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
>> > log: -5
>> > Lustre: Unmounted pgdata01-client
>> > LustreError: 3812:0:(obd_mount.c:1325:lustre_fill_super()) Unable to
>> > mount
>> > (-5)
>> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>> > 172.23.55.212@o2ib
>> > rejected: consumer defined fatal error
>> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1
>> > previous
>> > similar message
>> > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request
>> > sent
>> > has failed due to network error: [sent 1492789626/real 1492789626]
>> > req@88105af2cc00 x1565303228072004/t0(0)
>> > o250->MGC172.23.55.211@o2ib@172.23.55.212@o2ib:26/25 lens 400/544 e 0 to
>> > 1
>> > dl 1492789631 ref 1 fl Rpc:XN/0/ rc 0/-1
>> > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) Skipped 1
>> > previous similar message
>> > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send
>> > limit
>> > expired   req@882041ffc000 x1565303228071996/t0(0)
>> > o101->MGC172.23.55.211@o2ib@172.23.55.211@o2ib:26/25 lens 328/344 e 0 to
>> > 0
>> > dl 0 ref 2 fl Rpc:W/0/ rc 0/-1
>> > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped 2
>> > previous similar messages
>> > LustreError: 15c-8: MGC172.23.55.211@o2ib: The configuration from log
>> > 'pghome01-client' failed (-5). This may be the result of communication
>> > errors between this node and the MGS, a bad 

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Strikwerda, Ger
Hi Russell,

Thanks for the IB subnet clues:

[root@pg-gpu01 ~]# ibv_devinfo
hca_id: mlx4_0
transport:  InfiniBand (0)
fw_ver: 2.32.5100
node_guid:  f452:1403:00f5:4620
sys_image_guid: f452:1403:00f5:4623
vendor_id:  0x02c9
vendor_part_id: 4099
hw_ver: 0x1
board_id:   MT_1100120019
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid:   185
port_lmc:   0x00
link_layer: InfiniBand

[root@pg-gpu01 ~]# sminfo
sminfo: sm lid 1 sm guid 0xf452140300f62320, activity count 80878098
priority 0 state 3 SMINFO_MASTER

Looks like the rebooted node is able to connect/contact IB/IB subnetmanager




On Mon, Apr 24, 2017 at 4:14 PM, Russell Dekema  wrote:

> At first glance, this sounds like your Infiniband subnet manager may
> be down or malfunctioning. In this case, nodes which were already up
> when the subnet manager was working will continue to be able to
> communicate over IB, but nodes which reboot after the SM goes down
> will not.
>
> You can test this theory by running the 'ibv_devinfo' command on one
> of your rebooted nodes. If the relevant IB port is in state PORT_INIT,
> this confirms there is a problem with your subnet manager.
>
> Sincerely,
> Rusty Dekema
>
>
>
>
> On Mon, Apr 24, 2017 at 9:57 AM, Strikwerda, Ger
>  wrote:
> > Hi everybody,
> >
> > Here at the university of Groningen we are now experiencing a strange
> Lustre
> > error. If a client reboots, it fails to mount the Lustre storage. The
> client
> > is not able to reach the MSG service. The storage and nodes are
> > communicating over IB and unitil now without any problems. It looks like
> an
> > issue inside LNET. Clients cannot LNET ping/connect the metadata and or
> > storage. But the clients are able to LNET ping each other. Clients which
> not
> > have been rebooted, are working fine and have their mounts on our Lustre
> > filesystem.
> >
> > Lustre client log:
> >
> > Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
> > LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
> >
> > LustreError: 15c-8: MGC172.23.55.211@o2ib: The configuration from log
> > 'pgdata01-client' failed (-5). This may be the result of communication
> > errors between this node and the MGS, a bad configuration, or other
> errors.
> > See the syslog for more information.
> > LustreError: 3812:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
> > log: -5
> > Lustre: Unmounted pgdata01-client
> > LustreError: 3812:0:(obd_mount.c:1325:lustre_fill_super()) Unable to
> mount
> > (-5)
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
> 172.23.55.212@o2ib
> > rejected: consumer defined fatal error
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1
> previous
> > similar message
> > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request
> sent
> > has failed due to network error: [sent 1492789626/real 1492789626]
> > req@88105af2cc00 x1565303228072004/t0(0)
> > o250->MGC172.23.55.211@o2ib@172.23.55.212@o2ib:26/25 lens 400/544 e 0
> to 1
> > dl 1492789631 ref 1 fl Rpc:XN/0/ rc 0/-1
> > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) Skipped 1
> > previous similar message
> > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send
> limit
> > expired   req@882041ffc000 x1565303228071996/t0(0)
> > o101->MGC172.23.55.211@o2ib@172.23.55.211@o2ib:26/25 lens 328/344 e 0
> to 0
> > dl 0 ref 2 fl Rpc:W/0/ rc 0/-1
> > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped 2
> > previous similar messages
> > LustreError: 15c-8: MGC172.23.55.211@o2ib: The configuration from log
> > 'pghome01-client' failed (-5). This may be the result of communication
> > errors between this node and the MGS, a bad configuration, or other
> errors.
> > See the syslog for more information.
> > LustreError: 3826:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
> > log: -5
> >
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
> 172.23.55.212@o2ib
> > rejected: consumer defined fatal error
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1
> previous
> > similar message
> > LNet: 3755:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
> > 172.23.55.211@o2ib failed: 5
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
> 172.23.55.211@o2ib
> > rejected: consumer defined fatal error
> > LNetError: 

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Russell Dekema
At first glance, this sounds like your Infiniband subnet manager may
be down or malfunctioning. In this case, nodes which were already up
when the subnet manager was working will continue to be able to
communicate over IB, but nodes which reboot after the SM goes down
will not.

You can test this theory by running the 'ibv_devinfo' command on one
of your rebooted nodes. If the relevant IB port is in state PORT_INIT,
this confirms there is a problem with your subnet manager.

Sincerely,
Rusty Dekema




On Mon, Apr 24, 2017 at 9:57 AM, Strikwerda, Ger
 wrote:
> Hi everybody,
>
> Here at the university of Groningen we are now experiencing a strange Lustre
> error. If a client reboots, it fails to mount the Lustre storage. The client
> is not able to reach the MSG service. The storage and nodes are
> communicating over IB and unitil now without any problems. It looks like an
> issue inside LNET. Clients cannot LNET ping/connect the metadata and or
> storage. But the clients are able to LNET ping each other. Clients which not
> have been rebooted, are working fine and have their mounts on our Lustre
> filesystem.
>
> Lustre client log:
>
> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>
> LustreError: 15c-8: MGC172.23.55.211@o2ib: The configuration from log
> 'pgdata01-client' failed (-5). This may be the result of communication
> errors between this node and the MGS, a bad configuration, or other errors.
> See the syslog for more information.
> LustreError: 3812:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
> log: -5
> Lustre: Unmounted pgdata01-client
> LustreError: 3812:0:(obd_mount.c:1325:lustre_fill_super()) Unable to mount
> (-5)
> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.212@o2ib
> rejected: consumer defined fatal error
> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 previous
> similar message
> Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent
> has failed due to network error: [sent 1492789626/real 1492789626]
> req@88105af2cc00 x1565303228072004/t0(0)
> o250->MGC172.23.55.211@o2ib@172.23.55.212@o2ib:26/25 lens 400/544 e 0 to 1
> dl 1492789631 ref 1 fl Rpc:XN/0/ rc 0/-1
> Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) Skipped 1
> previous similar message
> LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit
> expired   req@882041ffc000 x1565303228071996/t0(0)
> o101->MGC172.23.55.211@o2ib@172.23.55.211@o2ib:26/25 lens 328/344 e 0 to 0
> dl 0 ref 2 fl Rpc:W/0/ rc 0/-1
> LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped 2
> previous similar messages
> LustreError: 15c-8: MGC172.23.55.211@o2ib: The configuration from log
> 'pghome01-client' failed (-5). This may be the result of communication
> errors between this node and the MGS, a bad configuration, or other errors.
> See the syslog for more information.
> LustreError: 3826:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
> log: -5
>
> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.212@o2ib
> rejected: consumer defined fatal error
> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 previous
> similar message
> LNet: 3755:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
> 172.23.55.211@o2ib failed: 5
> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.211@o2ib
> rejected: consumer defined fatal error
> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 previous
> similar message
> LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> messages for 172.23.55.211@o2ib: connection failed
> LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> messages for 172.23.55.212@o2ib: connection failed
> LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
> 172.23.55.212@o2ib failed: 5
> LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Skipped 17 previous
> similar messages
> LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> messages for 172.23.55.211@o2ib: connection failed
> LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
> 172.23.55.212@o2ib failed: 5
> LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> messages for 172.23.55.212@o2ib: connection failed
>
> LNET ping of a metadata-node:
>
> [root@pg-gpu01 ~]# lctl ping 172.23.55.211@o2ib
> failed to ping 172.23.55.211@o2ib: Input/output error
>
> LNET ping of the number 2 metadata-node:
>
> [root@pg-gpu01 ~]# lctl ping 172.23.55.212@o2ib
> failed to ping 172.23.55.212@o2ib: Input/output error
>
> LNET ping of a random compute-node:
>
> [root@pg-gpu01 ~]# lctl ping 172.23.52.5@o2ib
> 12345-0@lo
> 12345-172.23.52.5@o2ib
>
> LNET to OST01:
>
> [root@pg-gpu01 ~]# lctl ping 172.23.55.201@o2ib
> failed to ping 172.23.55.201@o2ib: Input/output error
>
> LNET to OST02:
>
> 

[lustre-discuss] client fails to mount

2017-04-24 Thread Strikwerda, Ger
 Hi everybody,

Here at the university of Groningen we are now experiencing a strange
Lustre error. If a client reboots, it fails to mount the Lustre storage.
The client is not able to reach the MSG service. The storage and nodes are
communicating over IB and unitil now without any problems. It looks like an
issue inside LNET. Clients cannot LNET ping/connect the metadata and or
storage. But the clients are able to LNET ping each other. Clients which
not have been rebooted, are working fine and have their mounts on our
Lustre filesystem.

Lustre client log:

Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]

LustreError: 15c-8: MGC172.23.55.211@o2ib: The configuration from log
'pgdata01-client' failed (-5). This may be the result of communication
errors between this node and the MGS, a bad configuration, or other errors.
See the syslog for more information.
LustreError: 3812:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
log: -5
Lustre: Unmounted pgdata01-client
LustreError: 3812:0:(obd_mount.c:1325:lustre_fill_super()) Unable to mount
(-5)
LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.212@o2ib
rejected: consumer defined fatal error
LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 previous
similar message
Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent
has failed due to network error: [sent 1492789626/real 1492789626]
req@88105af2cc00 x1565303228072004/t0(0) o250->MGC172.23.55.211@o2ib
@172.23.55.212@o2ib:26/25 lens 400/544 e 0 to 1 dl 1492789631 ref 1 fl
Rpc:XN/0/ rc 0/-1
Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) Skipped 1
previous similar message
LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send
limit expired   req@882041ffc000 x1565303228071996/t0(0)
o101->MGC172.23.55.211@o2ib@172.23.55.211@o2ib:26/25 lens 328/344 e 0 to 0
dl 0 ref 2 fl Rpc:W/0/ rc 0/-1
LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped 2
previous similar messages
LustreError: 15c-8: MGC172.23.55.211@o2ib: The configuration from log
'pghome01-client' failed (-5). This may be the result of communication
errors between this node and the MGS, a bad configuration, or other errors.
See the syslog for more information.
LustreError: 3826:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
log: -5

LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.212@o2ib
rejected: consumer defined fatal error
LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 previous
similar message
LNet: 3755:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
172.23.55.211@o2ib failed: 5
LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.211@o2ib
rejected: consumer defined fatal error
LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 previous
similar message
LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
messages for 172.23.55.211@o2ib: connection failed
LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
messages for 172.23.55.212@o2ib: connection failed
LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
172.23.55.212@o2ib failed: 5
LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Skipped 17 previous
similar messages
LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
messages for 172.23.55.211@o2ib: connection failed
LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
172.23.55.212@o2ib failed: 5
LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
messages for 172.23.55.212@o2ib: connection failed

LNET ping of a metadata-node:

[root@pg-gpu01 ~]# lctl ping 172.23.55.211@o2ib
failed to ping 172.23.55.211@o2ib: Input/output error

LNET ping of the number 2 metadata-node:

[root@pg-gpu01 ~]# lctl ping 172.23.55.212@o2ib
failed to ping 172.23.55.212@o2ib: Input/output error

LNET ping of a random compute-node:

[root@pg-gpu01 ~]# lctl ping 172.23.52.5@o2ib
12345-0@lo
12345-172.23.52.5@o2ib

LNET to OST01:

[root@pg-gpu01 ~]# lctl ping 172.23.55.201@o2ib
failed to ping 172.23.55.201@o2ib: Input/output error

LNET to OST02:

[root@pg-gpu01 ~]# lctl ping 172.23.55.202@o2ib
failed to ping 172.23.55.202@o2ib: Input/output error

'normal' pings (on ip level) works fine:

[root@pg-gpu01 ~]# ping 172.23.55.201
PING 172.23.55.201 (172.23.55.201) 56(84) bytes of data.
64 bytes from 172.23.55.201: icmp_seq=1 ttl=64 time=0.741 ms

[root@pg-gpu01 ~]# ping 172.23.55.202
PING 172.23.55.202 (172.23.55.202) 56(84) bytes of data.
64 bytes from 172.23.55.202: icmp_seq=1 ttl=64 time=0.704 ms

lctl on a rebooted node:

[root@pg-gpu01 ~]# lctl dl

lctl on a not rebooted node:

[root@pg-node005 ~]# lctl dl
  0 UP mgc MGC172.23.55.211@o2ib 94bd1c8a-512f-b920-9a4e-a6aced3d386d 5
  1 UP lov pgtemp01-clilov-88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 4
  2 UP lmv pgtemp01-clilmv-88206906d400