[lustre-discuss] Multiple IB Interfaces

2021-03-09 Thread Ms. Megan Larko via lustre-discuss
Greetings Alastair,

Bonding is supported on InfiniBand, but  I believe that it is only
active/passive.
I think what you might be looking for WRT avoiding data travel through the
inter-cpu link is cpu "affinity" AKA cpu "pinning".

Cheers,
megan

WRT = "with regards to"
AKA = "also known as"
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] o2ib nid connections timeout until an snmp ping

2021-03-09 Thread Christian Kuntz via lustre-discuss
Hello all,

Requisite preamble: This is debian 10.7 with lustre 2.13.0 (compiled by
yours truly).

We've been observing some odd behavior recently with o2ib NIDs. Everyone's
all connected over the same switch (cards and switch are all mellanox),
each machine has a single network card connected in a bond up to the
switch. Whenever a 'new' machine connects to the others over lnet, `lctl
ping` and other operations will fail to some set of the existing hosts.
Curiously, after an SNMP ping is issued all o2ib operations succeed and
things stabilize.

We've tested the stability with the ib_ suite of tools and the fabric
itself appears stable. As of yet we've not attempted to duplicate the
behavior with tcp NIDs, but we haven't encountered this issue over
approximately one year of using lustre over tcp NIDs.

Here's the relevant dmesg portions:
[72768.234745] LNetError:
16138:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.100.101.210@o2ib1
added to recovery queue. Health = 0
[72792.235556] LNetError:
16138:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.100.101.210@o2ib1
added to recovery queue. Health = 0
[72829.229280] LNetError:
16112:0:(lib-move.c:3043:lnet_resend_pending_msgs_locked()) Error sending
GET to 12345-10.100.101.32@o2ib1: -125
[72829.231426] LNetError:
16112:0:(lib-move.c:3043:lnet_resend_pending_msgs_locked()) Skipped 1
previous similar message
[72966.226069] LNetError:
16112:0:(lib-move.c:3043:lnet_resend_pending_msgs_locked()) Error sending
GET to 12345-10.100.101.32@o2ib1: -125
[72966.228366] LNetError:
16112:0:(lib-move.c:3043:lnet_resend_pending_msgs_locked()) Skipped 3
previous similar messages
[73006.226876] LNetError:
16138:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx:
active_txs, 1 seconds
[73006.228085] LNetError:
16138:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 31 previous
similar messages
[73006.229140] LNetError: 16138:0:(o2iblnd_cb.c:3426:kiblnd_check_conns())
Timed out RDMA with 10.100.101.32@o2ib1 (7): c: 6, oc: 0, rc: 8
[73006.231045] LNetError: 16138:0:(o2iblnd_cb.c:3426:kiblnd_check_conns())
Skipped 31 previous similar messages
[73016.243190] LNet: 16138:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed
out tx for 10.100.101.36@o2ib1: 9 seconds
[73016.243195] LNet: 16138:0:(o2iblnd_cb.c:3397:kiblnd_check_conns())
Skipped 60 previous similar messages
[73032.243722] LNetError:
16138:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.100.101.210@o2ib1
added to recovery queue. Health = 0
[73261.244179] LNetError:
16112:0:(lib-move.c:3043:lnet_resend_pending_msgs_locked()) Error sending
GET to 12345-10.100.101.32@o2ib1: -125
[73261.246265] LNetError:
16112:0:(lib-move.c:3043:lnet_resend_pending_msgs_locked()) Skipped 11
previous similar messages

Is this normal/known behavior with 2.13, or have I missed some portion of
o2ib net setups?

Please let me know if further information is needed.

Cheers, and thanks for your time,
Christian
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] MDT hanging

2021-03-09 Thread Simon Guilbault via lustre-discuss
Hi,
One of the things that the ZFS pacemaker resource does not seem to pick up
failure is when MMP fails due to some problem with the SAS bus. We added
this short script running as a systemd daemon to do a failover when this
happens. The other check in this script is using NHC, mostly to check if
the IB port is up.

If any of those 2 checks are failing, it will try to umount the resource
cleanly instead of doing a power cycle right away, if it fails to umount,
pacemaker will issue the STONITH after the umount timeout.

#!/bin/bashwhile true; do  if /usr/sbin/crm_mon -1 | grep Online |
grep $HOSTNAME > /dev/null ; then# node is online, do the checks
 # check the status of NHCif ! /usr/bin/nice -n -5 /usr/sbin/nhc
-t 600; then  echo "NHC failed, failover to the other node"
/usr/sbin/pcs node standby $HOSTNAMEfi# check if the MMP of
ZFS is not stalledif /usr/sbin/zpool status | grep "The pool is
suspended because multihost writes failed or were delayed" ; then
echo "ZFS could not write the MMP, failover to the other node"
/usr/sbin/pcs node standby $HOSTNAMEfi  fi  sleep 60;done


On Tue, Mar 9, 2021 at 1:15 PM Christopher Mountford via lustre-discuss <
lustre-discuss@lists.lustre.org> wrote:

> Hi,
>
> We've had a couple of MDT hangs on 2 of our lustre filesystems after
> updating to 2.12.6 (though I'm sure I've seen this exact behaviour on
> previous versions).
>
> Ths symptoms are a gradualy increasing load on the affected MDS, processes
> doing I/O on the filesystem blocking indefinately, showing messages on the
> client similar to:
>
> Mar  9 15:37:22 spectre09 kernel: Lustre:
> 25309:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has
> timed out for slow reply: [sent 1615303641/real
> 1615303641]  req@972dbe51bf00 x1692620480891456/t0(0)
> o44->ahome3-MDT0001-mdc-9718e3be@10.143.254.212@o2ib:12/10 lens
> 448/440 e 2 to 1 dl 1615304242 re
> f 2 fl Rpc:X/0/ rc 0/-1
> Mar  9 15:37:22 spectre09 kernel: Lustre:
> ahome3-MDT0001-mdc-9718e3be: Connection to ahome3-MDT0001 (at
> 10.143.254.212@o2ib) was lost; in progress operatio
> ns using this service will wait for recovery to complete
> Mar  9 15:37:22 spectre09 kernel: Lustre:
> ahome3-MDT0001-mdc-9718e3be: Connection restored to
> 10.143.254.212@o2ib (at 10.143.254.212@o2ib)
>
> Warnings of hung mdt_io tasks on the MDS and lustre debug log dumps to
> /tmp.
>
> Rebooting the affected MDS cleared the problem and everything recovered.
>
>
>
> Looking at the MDS system logs, the first sign of trouble appears to be:
>
> Mar  9 15:24:11 amds01b kernel: VERIFY3(dr->dr_dbuf->db_level == level)
> failed (0 == 18446744073709551615)
> Mar  9 15:24:11 amds01b kernel: PANIC at dbuf.c:3391:dbuf_sync_list()
> Mar  9 15:24:11 amds01b kernel: Showing stack for process 18137
> Mar  9 15:24:11 amds01b kernel: CPU: 3 PID: 18137 Comm: dp_sync_taskq
> Tainted: P   OE     3.10.0-1160.2.1.el7_lustre.x86_64 #1
> Mar  9 15:24:11 amds01b kernel: Hardware name: HPE ProLiant DL360
> Gen10/ProLiant DL360 Gen10, BIOS U32 07/16/2020
> Mar  9 15:24:11 amds01b kernel: Call Trace:
> Mar  9 15:24:11 amds01b kernel: [] dump_stack+0x19/0x1b
> Mar  9 15:24:11 amds01b kernel: []
> spl_dumpstack+0x44/0x50 [spl]
> Mar  9 15:24:11 amds01b kernel: [] spl_panic+0xc9/0x110
> [spl]
> Mar  9 15:24:11 amds01b kernel: [] ?
> tracing_is_on+0x15/0x30
> Mar  9 15:24:11 amds01b kernel: [] ?
> tracing_record_cmdline+0x1d/0x120
> Mar  9 15:24:11 amds01b kernel: [] ?
> spl_kmem_free+0x35/0x40 [spl]
> Mar  9 15:24:11 amds01b kernel: [] ?
> update_curr+0x14c/0x1e0
> Mar  9 15:24:11 amds01b kernel: [] ?
> account_entity_dequeue+0xae/0xd0
> Mar  9 15:24:11 amds01b kernel: []
> dbuf_sync_list+0x7b/0xd0 [zfs]
> Mar  9 15:24:11 amds01b kernel: []
> dnode_sync+0x370/0x890 [zfs]
> Mar  9 15:24:11 amds01b kernel: []
> sync_dnodes_task+0x61/0x150 [zfs]
> Mar  9 15:24:11 amds01b kernel: []
> taskq_thread+0x2ac/0x4f0 [spl]
> Mar  9 15:24:11 amds01b kernel: [] ?
> wake_up_state+0x20/0x20
> Mar  9 15:24:11 amds01b kernel: [] ?
> taskq_thread_spawn+0x60/0x60 [spl]
> Mar  9 15:24:11 amds01b kernel: [] kthread+0xd1/0xe0
> Mar  9 15:24:11 amds01b kernel: [] ?
> insert_kthread_work+0x40/0x40
> Mar  9 15:24:11 amds01b kernel: []
> ret_from_fork_nospec_begin+0x7/0x21
> Mar  9 15:24:11 amds01b kernel: [] ?
> insert_kthread_work+0x40/0x40
>
>
>
>
> My read of this is that ZFS failed whilst syncing cached data out to disk
> and panicked (I guess this panic is internal to ZFS as the system remained
> up and otherwise responsive - no kernel panic triggered). Does this seem
> correct?
>
> The pacemaker ZFS resource did not pick up the failure, it relies on
> 'zpool list -H -o health'. Is there any way anyone can think of that we can
> detect this sort of problem to trigger an automated reset of the affected
> server? Unfortunately I'd rebooted the server before I spotted the log
> entry. Next time I'll run some zfs commands to see what they 

[lustre-discuss] MDT hanging

2021-03-09 Thread Christopher Mountford via lustre-discuss
Hi,

We've had a couple of MDT hangs on 2 of our lustre filesystems after updating 
to 2.12.6 (though I'm sure I've seen this exact behaviour on previous versions).

Ths symptoms are a gradualy increasing load on the affected MDS, processes 
doing I/O on the filesystem blocking indefinately, showing messages on the 
client similar to:

Mar  9 15:37:22 spectre09 kernel: Lustre: 
25309:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1615303641/real 
1615303641]  req@972dbe51bf00 x1692620480891456/t0(0) 
o44->ahome3-MDT0001-mdc-9718e3be@10.143.254.212@o2ib:12/10 lens 448/440 
e 2 to 1 dl 1615304242 re
f 2 fl Rpc:X/0/ rc 0/-1
Mar  9 15:37:22 spectre09 kernel: Lustre: ahome3-MDT0001-mdc-9718e3be: 
Connection to ahome3-MDT0001 (at 10.143.254.212@o2ib) was lost; in progress 
operatio
ns using this service will wait for recovery to complete
Mar  9 15:37:22 spectre09 kernel: Lustre: ahome3-MDT0001-mdc-9718e3be: 
Connection restored to 10.143.254.212@o2ib (at 10.143.254.212@o2ib)

Warnings of hung mdt_io tasks on the MDS and lustre debug log dumps to /tmp.

Rebooting the affected MDS cleared the problem and everything recovered.



Looking at the MDS system logs, the first sign of trouble appears to be:

Mar  9 15:24:11 amds01b kernel: VERIFY3(dr->dr_dbuf->db_level == level) failed 
(0 == 18446744073709551615)
Mar  9 15:24:11 amds01b kernel: PANIC at dbuf.c:3391:dbuf_sync_list()
Mar  9 15:24:11 amds01b kernel: Showing stack for process 18137
Mar  9 15:24:11 amds01b kernel: CPU: 3 PID: 18137 Comm: dp_sync_taskq Tainted: 
P   OE     3.10.0-1160.2.1.el7_lustre.x86_64 #1
Mar  9 15:24:11 amds01b kernel: Hardware name: HPE ProLiant DL360 
Gen10/ProLiant DL360 Gen10, BIOS U32 07/16/2020
Mar  9 15:24:11 amds01b kernel: Call Trace:
Mar  9 15:24:11 amds01b kernel: [] dump_stack+0x19/0x1b
Mar  9 15:24:11 amds01b kernel: [] spl_dumpstack+0x44/0x50 
[spl]
Mar  9 15:24:11 amds01b kernel: [] spl_panic+0xc9/0x110 [spl]
Mar  9 15:24:11 amds01b kernel: [] ? tracing_is_on+0x15/0x30
Mar  9 15:24:11 amds01b kernel: [] ? 
tracing_record_cmdline+0x1d/0x120
Mar  9 15:24:11 amds01b kernel: [] ? spl_kmem_free+0x35/0x40 
[spl]
Mar  9 15:24:11 amds01b kernel: [] ? update_curr+0x14c/0x1e0
Mar  9 15:24:11 amds01b kernel: [] ? 
account_entity_dequeue+0xae/0xd0
Mar  9 15:24:11 amds01b kernel: [] dbuf_sync_list+0x7b/0xd0 
[zfs]
Mar  9 15:24:11 amds01b kernel: [] dnode_sync+0x370/0x890 
[zfs]
Mar  9 15:24:11 amds01b kernel: [] 
sync_dnodes_task+0x61/0x150 [zfs]
Mar  9 15:24:11 amds01b kernel: [] taskq_thread+0x2ac/0x4f0 
[spl]
Mar  9 15:24:11 amds01b kernel: [] ? wake_up_state+0x20/0x20
Mar  9 15:24:11 amds01b kernel: [] ? 
taskq_thread_spawn+0x60/0x60 [spl]
Mar  9 15:24:11 amds01b kernel: [] kthread+0xd1/0xe0
Mar  9 15:24:11 amds01b kernel: [] ? 
insert_kthread_work+0x40/0x40
Mar  9 15:24:11 amds01b kernel: [] 
ret_from_fork_nospec_begin+0x7/0x21
Mar  9 15:24:11 amds01b kernel: [] ? 
insert_kthread_work+0x40/0x40




My read of this is that ZFS failed whilst syncing cached data out to disk and 
panicked (I guess this panic is internal to ZFS as the system remained up and 
otherwise responsive - no kernel panic triggered). Does this seem correct?

The pacemaker ZFS resource did not pick up the failure, it relies on 'zpool 
list -H -o health'. Is there any way anyone can think of that we can detect 
this sort of problem to trigger an automated reset of the affected server? 
Unfortunately I'd rebooted the server before I spotted the log entry. Next time 
I'll run some zfs commands to see what they return before rebooting.

Any advice on what additional steps to take? I guess this is probably more a 
ZFS rather than Lustre issue.

The MDS are based on HPE DL360s, connected to D3700 JBODs, MDTs are on ZFS, 
Centos Lustre 7.9, zfs 0.7.13, lustre 2.12.6, kernel 
3.10.0-1160.2.1.el7_lustre.x86_64

Kind Regards,
Christopher.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Multiple IB interfaces

2021-03-09 Thread Alastair Basden via lustre-discuss

Hi,

We are installing some new Lustre servers with 2 InfiniBand cards, 1 
attached to each CPU socket.  Storage is nvme, again, some drives attached 
to each socket.


We want to ensure that data to/from each drive uses the appropriate IB 
card, and doesn't need to travel through the inter-cpu link.  Data being 
written is fairly easy I think, we just set that OST to the appropriate IP 
address.  However, data being read may well go out the other NIC, if I 
understand correctly.


What setup do we need for this?

I think probably not bonding, as that will presumably not tie 
NIC interfaces to cpus.  But I also see a note in the Lustre manual:


"""If the server has multiple interfaces on the same subnet, the Linux 
kernel will send all traffic using the first configured interface. This is 
a limitation of Linux, not Lustre. In this case, network interface bonding 
should be used. For more information about network interface bonding, see 
Chapter 7, Setting Up Network Interface Bonding."""


(plus, no idea if bonding is supported on InfiniBand).

Thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org