[lustre-discuss] Multiple IB Interfaces
Greetings Alastair, Bonding is supported on InfiniBand, but I believe that it is only active/passive. I think what you might be looking for WRT avoiding data travel through the inter-cpu link is cpu "affinity" AKA cpu "pinning". Cheers, megan WRT = "with regards to" AKA = "also known as" ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] o2ib nid connections timeout until an snmp ping
Hello all, Requisite preamble: This is debian 10.7 with lustre 2.13.0 (compiled by yours truly). We've been observing some odd behavior recently with o2ib NIDs. Everyone's all connected over the same switch (cards and switch are all mellanox), each machine has a single network card connected in a bond up to the switch. Whenever a 'new' machine connects to the others over lnet, `lctl ping` and other operations will fail to some set of the existing hosts. Curiously, after an SNMP ping is issued all o2ib operations succeed and things stabilize. We've tested the stability with the ib_ suite of tools and the fabric itself appears stable. As of yet we've not attempted to duplicate the behavior with tcp NIDs, but we haven't encountered this issue over approximately one year of using lustre over tcp NIDs. Here's the relevant dmesg portions: [72768.234745] LNetError: 16138:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.100.101.210@o2ib1 added to recovery queue. Health = 0 [72792.235556] LNetError: 16138:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.100.101.210@o2ib1 added to recovery queue. Health = 0 [72829.229280] LNetError: 16112:0:(lib-move.c:3043:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.100.101.32@o2ib1: -125 [72829.231426] LNetError: 16112:0:(lib-move.c:3043:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message [72966.226069] LNetError: 16112:0:(lib-move.c:3043:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.100.101.32@o2ib1: -125 [72966.228366] LNetError: 16112:0:(lib-move.c:3043:lnet_resend_pending_msgs_locked()) Skipped 3 previous similar messages [73006.226876] LNetError: 16138:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds [73006.228085] LNetError: 16138:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 31 previous similar messages [73006.229140] LNetError: 16138:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 10.100.101.32@o2ib1 (7): c: 6, oc: 0, rc: 8 [73006.231045] LNetError: 16138:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Skipped 31 previous similar messages [73016.243190] LNet: 16138:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.100.101.36@o2ib1: 9 seconds [73016.243195] LNet: 16138:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 60 previous similar messages [73032.243722] LNetError: 16138:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.100.101.210@o2ib1 added to recovery queue. Health = 0 [73261.244179] LNetError: 16112:0:(lib-move.c:3043:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.100.101.32@o2ib1: -125 [73261.246265] LNetError: 16112:0:(lib-move.c:3043:lnet_resend_pending_msgs_locked()) Skipped 11 previous similar messages Is this normal/known behavior with 2.13, or have I missed some portion of o2ib net setups? Please let me know if further information is needed. Cheers, and thanks for your time, Christian ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] MDT hanging
Hi, One of the things that the ZFS pacemaker resource does not seem to pick up failure is when MMP fails due to some problem with the SAS bus. We added this short script running as a systemd daemon to do a failover when this happens. The other check in this script is using NHC, mostly to check if the IB port is up. If any of those 2 checks are failing, it will try to umount the resource cleanly instead of doing a power cycle right away, if it fails to umount, pacemaker will issue the STONITH after the umount timeout. #!/bin/bashwhile true; do if /usr/sbin/crm_mon -1 | grep Online | grep $HOSTNAME > /dev/null ; then# node is online, do the checks # check the status of NHCif ! /usr/bin/nice -n -5 /usr/sbin/nhc -t 600; then echo "NHC failed, failover to the other node" /usr/sbin/pcs node standby $HOSTNAMEfi# check if the MMP of ZFS is not stalledif /usr/sbin/zpool status | grep "The pool is suspended because multihost writes failed or were delayed" ; then echo "ZFS could not write the MMP, failover to the other node" /usr/sbin/pcs node standby $HOSTNAMEfi fi sleep 60;done On Tue, Mar 9, 2021 at 1:15 PM Christopher Mountford via lustre-discuss < lustre-discuss@lists.lustre.org> wrote: > Hi, > > We've had a couple of MDT hangs on 2 of our lustre filesystems after > updating to 2.12.6 (though I'm sure I've seen this exact behaviour on > previous versions). > > Ths symptoms are a gradualy increasing load on the affected MDS, processes > doing I/O on the filesystem blocking indefinately, showing messages on the > client similar to: > > Mar 9 15:37:22 spectre09 kernel: Lustre: > 25309:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has > timed out for slow reply: [sent 1615303641/real > 1615303641] req@972dbe51bf00 x1692620480891456/t0(0) > o44->ahome3-MDT0001-mdc-9718e3be@10.143.254.212@o2ib:12/10 lens > 448/440 e 2 to 1 dl 1615304242 re > f 2 fl Rpc:X/0/ rc 0/-1 > Mar 9 15:37:22 spectre09 kernel: Lustre: > ahome3-MDT0001-mdc-9718e3be: Connection to ahome3-MDT0001 (at > 10.143.254.212@o2ib) was lost; in progress operatio > ns using this service will wait for recovery to complete > Mar 9 15:37:22 spectre09 kernel: Lustre: > ahome3-MDT0001-mdc-9718e3be: Connection restored to > 10.143.254.212@o2ib (at 10.143.254.212@o2ib) > > Warnings of hung mdt_io tasks on the MDS and lustre debug log dumps to > /tmp. > > Rebooting the affected MDS cleared the problem and everything recovered. > > > > Looking at the MDS system logs, the first sign of trouble appears to be: > > Mar 9 15:24:11 amds01b kernel: VERIFY3(dr->dr_dbuf->db_level == level) > failed (0 == 18446744073709551615) > Mar 9 15:24:11 amds01b kernel: PANIC at dbuf.c:3391:dbuf_sync_list() > Mar 9 15:24:11 amds01b kernel: Showing stack for process 18137 > Mar 9 15:24:11 amds01b kernel: CPU: 3 PID: 18137 Comm: dp_sync_taskq > Tainted: P OE 3.10.0-1160.2.1.el7_lustre.x86_64 #1 > Mar 9 15:24:11 amds01b kernel: Hardware name: HPE ProLiant DL360 > Gen10/ProLiant DL360 Gen10, BIOS U32 07/16/2020 > Mar 9 15:24:11 amds01b kernel: Call Trace: > Mar 9 15:24:11 amds01b kernel: [] dump_stack+0x19/0x1b > Mar 9 15:24:11 amds01b kernel: [] > spl_dumpstack+0x44/0x50 [spl] > Mar 9 15:24:11 amds01b kernel: [] spl_panic+0xc9/0x110 > [spl] > Mar 9 15:24:11 amds01b kernel: [] ? > tracing_is_on+0x15/0x30 > Mar 9 15:24:11 amds01b kernel: [] ? > tracing_record_cmdline+0x1d/0x120 > Mar 9 15:24:11 amds01b kernel: [] ? > spl_kmem_free+0x35/0x40 [spl] > Mar 9 15:24:11 amds01b kernel: [] ? > update_curr+0x14c/0x1e0 > Mar 9 15:24:11 amds01b kernel: [] ? > account_entity_dequeue+0xae/0xd0 > Mar 9 15:24:11 amds01b kernel: [] > dbuf_sync_list+0x7b/0xd0 [zfs] > Mar 9 15:24:11 amds01b kernel: [] > dnode_sync+0x370/0x890 [zfs] > Mar 9 15:24:11 amds01b kernel: [] > sync_dnodes_task+0x61/0x150 [zfs] > Mar 9 15:24:11 amds01b kernel: [] > taskq_thread+0x2ac/0x4f0 [spl] > Mar 9 15:24:11 amds01b kernel: [] ? > wake_up_state+0x20/0x20 > Mar 9 15:24:11 amds01b kernel: [] ? > taskq_thread_spawn+0x60/0x60 [spl] > Mar 9 15:24:11 amds01b kernel: [] kthread+0xd1/0xe0 > Mar 9 15:24:11 amds01b kernel: [] ? > insert_kthread_work+0x40/0x40 > Mar 9 15:24:11 amds01b kernel: [] > ret_from_fork_nospec_begin+0x7/0x21 > Mar 9 15:24:11 amds01b kernel: [] ? > insert_kthread_work+0x40/0x40 > > > > > My read of this is that ZFS failed whilst syncing cached data out to disk > and panicked (I guess this panic is internal to ZFS as the system remained > up and otherwise responsive - no kernel panic triggered). Does this seem > correct? > > The pacemaker ZFS resource did not pick up the failure, it relies on > 'zpool list -H -o health'. Is there any way anyone can think of that we can > detect this sort of problem to trigger an automated reset of the affected > server? Unfortunately I'd rebooted the server before I spotted the log > entry. Next time I'll run some zfs commands to see what they
[lustre-discuss] MDT hanging
Hi, We've had a couple of MDT hangs on 2 of our lustre filesystems after updating to 2.12.6 (though I'm sure I've seen this exact behaviour on previous versions). Ths symptoms are a gradualy increasing load on the affected MDS, processes doing I/O on the filesystem blocking indefinately, showing messages on the client similar to: Mar 9 15:37:22 spectre09 kernel: Lustre: 25309:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1615303641/real 1615303641] req@972dbe51bf00 x1692620480891456/t0(0) o44->ahome3-MDT0001-mdc-9718e3be@10.143.254.212@o2ib:12/10 lens 448/440 e 2 to 1 dl 1615304242 re f 2 fl Rpc:X/0/ rc 0/-1 Mar 9 15:37:22 spectre09 kernel: Lustre: ahome3-MDT0001-mdc-9718e3be: Connection to ahome3-MDT0001 (at 10.143.254.212@o2ib) was lost; in progress operatio ns using this service will wait for recovery to complete Mar 9 15:37:22 spectre09 kernel: Lustre: ahome3-MDT0001-mdc-9718e3be: Connection restored to 10.143.254.212@o2ib (at 10.143.254.212@o2ib) Warnings of hung mdt_io tasks on the MDS and lustre debug log dumps to /tmp. Rebooting the affected MDS cleared the problem and everything recovered. Looking at the MDS system logs, the first sign of trouble appears to be: Mar 9 15:24:11 amds01b kernel: VERIFY3(dr->dr_dbuf->db_level == level) failed (0 == 18446744073709551615) Mar 9 15:24:11 amds01b kernel: PANIC at dbuf.c:3391:dbuf_sync_list() Mar 9 15:24:11 amds01b kernel: Showing stack for process 18137 Mar 9 15:24:11 amds01b kernel: CPU: 3 PID: 18137 Comm: dp_sync_taskq Tainted: P OE 3.10.0-1160.2.1.el7_lustre.x86_64 #1 Mar 9 15:24:11 amds01b kernel: Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 07/16/2020 Mar 9 15:24:11 amds01b kernel: Call Trace: Mar 9 15:24:11 amds01b kernel: [] dump_stack+0x19/0x1b Mar 9 15:24:11 amds01b kernel: [] spl_dumpstack+0x44/0x50 [spl] Mar 9 15:24:11 amds01b kernel: [] spl_panic+0xc9/0x110 [spl] Mar 9 15:24:11 amds01b kernel: [] ? tracing_is_on+0x15/0x30 Mar 9 15:24:11 amds01b kernel: [] ? tracing_record_cmdline+0x1d/0x120 Mar 9 15:24:11 amds01b kernel: [] ? spl_kmem_free+0x35/0x40 [spl] Mar 9 15:24:11 amds01b kernel: [] ? update_curr+0x14c/0x1e0 Mar 9 15:24:11 amds01b kernel: [] ? account_entity_dequeue+0xae/0xd0 Mar 9 15:24:11 amds01b kernel: [] dbuf_sync_list+0x7b/0xd0 [zfs] Mar 9 15:24:11 amds01b kernel: [] dnode_sync+0x370/0x890 [zfs] Mar 9 15:24:11 amds01b kernel: [] sync_dnodes_task+0x61/0x150 [zfs] Mar 9 15:24:11 amds01b kernel: [] taskq_thread+0x2ac/0x4f0 [spl] Mar 9 15:24:11 amds01b kernel: [] ? wake_up_state+0x20/0x20 Mar 9 15:24:11 amds01b kernel: [] ? taskq_thread_spawn+0x60/0x60 [spl] Mar 9 15:24:11 amds01b kernel: [] kthread+0xd1/0xe0 Mar 9 15:24:11 amds01b kernel: [] ? insert_kthread_work+0x40/0x40 Mar 9 15:24:11 amds01b kernel: [] ret_from_fork_nospec_begin+0x7/0x21 Mar 9 15:24:11 amds01b kernel: [] ? insert_kthread_work+0x40/0x40 My read of this is that ZFS failed whilst syncing cached data out to disk and panicked (I guess this panic is internal to ZFS as the system remained up and otherwise responsive - no kernel panic triggered). Does this seem correct? The pacemaker ZFS resource did not pick up the failure, it relies on 'zpool list -H -o health'. Is there any way anyone can think of that we can detect this sort of problem to trigger an automated reset of the affected server? Unfortunately I'd rebooted the server before I spotted the log entry. Next time I'll run some zfs commands to see what they return before rebooting. Any advice on what additional steps to take? I guess this is probably more a ZFS rather than Lustre issue. The MDS are based on HPE DL360s, connected to D3700 JBODs, MDTs are on ZFS, Centos Lustre 7.9, zfs 0.7.13, lustre 2.12.6, kernel 3.10.0-1160.2.1.el7_lustre.x86_64 Kind Regards, Christopher. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Multiple IB interfaces
Hi, We are installing some new Lustre servers with 2 InfiniBand cards, 1 attached to each CPU socket. Storage is nvme, again, some drives attached to each socket. We want to ensure that data to/from each drive uses the appropriate IB card, and doesn't need to travel through the inter-cpu link. Data being written is fairly easy I think, we just set that OST to the appropriate IP address. However, data being read may well go out the other NIC, if I understand correctly. What setup do we need for this? I think probably not bonding, as that will presumably not tie NIC interfaces to cpus. But I also see a note in the Lustre manual: """If the server has multiple interfaces on the same subnet, the Linux kernel will send all traffic using the first configured interface. This is a limitation of Linux, not Lustre. In this case, network interface bonding should be used. For more information about network interface bonding, see Chapter 7, Setting Up Network Interface Bonding.""" (plus, no idea if bonding is supported on InfiniBand). Thanks, Alastair. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org