Howdy all! Long and short is; when multiple clients write long (100+ GB) contiguous of streams to a file (different for each client) over an SMB export of the lustre client mount, all but one stream locks up and the clients writing those streams lose connectoins to the MDS/OSS momentarily. Eviction appears to be caused by either spurious connection resets, or lock callbacks hitting their timeouts. Apologies if something is unclear or if detail is odd, I've been staring at this problem for a while now and have experienced mild acute cognitive decline.
I'm on lustre 2.14.0, debian buster, kernel 4.19.208-3, zfs 2.0.3, using upstream OFED. The cluster is 4 nodes total, in two pairs of two. The two pairs are set up in internal HA, and the cluster overall is controlled by pacemaker. The first pair has the MGT,MDT0,OST0, and the second pair has OST1. All 4 nodes have the lustre client running, and are exporting it over SMB with clients connecting via multichannel. Network topology is fairly simple, and looks like this on node 1: net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: tcp1 local NI(s): - nid: 192.168.68.121@tcp1 status: up interfaces: 0: lacpbond0.1 - net type: tcp2 local NI(s): - nid: 182.168.98.121@tcp2 status: up interfaces: 0: lacpbond0.2 Other nodes have share this config, peers look like: peer: - primary nid: 192.168.68.122@tcp1 Multi-Rail: True peer ni: - nid: 192.168.68.122@tcp1 state: NA - nid: 192.168.98.122@tcp2 state: NA - primary nid: 192.168.68.132@tcp1 Multi-Rail: True peer ni: - nid: 192.168.68.132@tcp1 state: NA - nid: 192.168.98.132@tcp2 state: NA - primary nid: 0@lo Multi-Rail: False peer ni: - nid: 0@lo state: NA - primary nid: 192.168.68.131@tcp1 Multi-Rail: True peer ni: - nid: 192.168.68.131@tcp1 state: NA - nid: 192.168.98.131@tcp2 state: NA - primary nid: 192.168.98.121@tcp2 Multi-Rail: False peer ni: - nid: 192.168.98.121@tcp2 state: NA - primary nid: 192.168.68.121@tcp1 Multi-Rail: False peer ni: - nid: 192.168.68.121@tcp1 state: NA 121 is pair-1-node-1, 122 is pair-1-node-2, 131 is pair-2-node-1, 132 is pair-2-node-2 At the time of testing, 121 holds MGT,MDT0 and OST0, and 131 holds OST1. Of interest perhaps is that there are more IP addresses resident on those bond vlans, but only one IP shows in `lnetctl net show`. Each one additional VIP in each subnet used for SMB multichannel traffic. Observed errors look like this in the log from the client that gets the axe: ```` 00000400:00000100:11.0:1663299328.923153:0:5195:0:(lib-move.c:3987:lnet_parse_put()) Dropping PUT from 12345-192.168.68.121@tcp1 portal 28 match 1743801431915008 offset 0 length 520: 4 00000100:02020000:8.0:1663357306.033900:0:5313:0:(client.c:1371:ptlrpc_check_status()) 11-0: dfsone-OST0000-osc-ffff9f85fcbe7800: operation ldlm_enqueue to node 192.168.68.121@tcp1 failed: rc = -107 00000100:00080000:8.0:1663357306.035405:0:5313:0:(recover.c:237:ptlrpc_request_handle_notconn()) import dfsone-OST0000-osc-ffff9f85fcbe7800 of dfsone-OST0000_UUID@192.168.68.121@tcp1 abruptly disconnected: reconnecting 00000100:00080000:8.0:1663357306.035407:0:5313:0:(import.c:86:import_set_state_nolock()) 0000000090940d7d dfsone-OST0000_UUID: changing import state from FULL to DISCONN 00000100:02000400:8.0:1663357306.035409:0:5313:0:(import.c:200:ptlrpc_set_import_discon()) dfsone-OST0000-osc-ffff9f85fcbe7800: Connection to dfsone-OST0000 (at 192.168.68.121@tcp1) was lost; in progress operations using this service will wait for recovery to complete 00000100:00080000:8.0:1663357306.035412:0:5313:0:(import.c:86:import_set_state_nolock()) 0000000090940d7d dfsone-OST0000_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:8.0:1663357306.035414:0:5313:0:(import.c:544:import_select_connection()) dfsone-OST0000-osc-ffff9f85fcbe7800: connect to NID 192.168.68.121@tcp1 last attempt 285795 00000100:00080000:8.0:1663357306.035415:0:5313:0:(import.c:625:import_select_connection()) dfsone-OST0000-osc-ffff9f85fcbe7800: import 0000000090940d7d using connection 192.168.68.121@tcp1/192.168.68.121@tcp1 00000100:00080000:13.0:1663357306.036858:0:5304:0:(import.c:1085:ptlrpc_connect_interpret()) dfsone-OST0000-osc-ffff9f85fcbe7800: connect to target with instance 22 00000008:00020000:13.0:1663357306.036861:0:5304:0:(osc_request.c:1042:osc_init_grant()) dfsone-OST0000-osc-ffff9f85fcbe7800: granted 3407872 but already consumed 247070720 00000080:00080000:13.0:1663357306.038241:0:5304:0:(lcommon_misc.c:75:cl_init_ea_size()) updating def/max_easize: 72/65536 00000100:00080000:13.0:1663357306.038244:0:5304:0:(import.c:1298:ptlrpc_connect_interpret()) @@@ dfsone-OST0000-osc-ffff9f85fcbe7800: evicting (reconnect/recover flags not set: 4) req@0000000023e1589b x1743840866079616/t0(0) o8->dfsone-OST0000-osc-ffff9f85fcbe7800@192.168.68.121@tcp1:28/4 lens 520/416 e 0 to 0 dl 1663357312 ref 1 fl Interpret:RNQU/0/0 rc 0/0 job:'' 00000100:00080000:13.0:1663357306.038250:0:5304:0:(import.c:86:import_set_state_nolock()) 0000000090940d7d dfsone-OST0000_UUID: changing import state from CONNECTING to EVICTED 00000100:02020000:13.0:1663357306.038251:0:5304:0:(import.c:1580:ptlrpc_import_recovery_state_machine()) 167-0: dfsone-OST0000-osc-ffff9f85fcbe7800: This client was evicted by dfsone-OST0000; in progress operations using this service will fail. 00000100:00080000:13.0:1663357306.039600:0:5304:0:(import.c:1585:ptlrpc_import_recovery_state_machine()) evicted from dfsone-OST0000_UUID@192.168.68.121@tcp1; invalidating ```` Followed by a LOT of dirty page discards. The debug logs from OST000 at the time show very little: ```` 00010000:00020000:7.0:1663357306.009983:0:25912:0:(ldlm_lockd.c:260:expired_lock_main()) ### lock callback timer expired after 101s: evicting client at 192.168.68.132@tcp1 ns: filter-dfsone-OST0000_UUID lock: 000000005efd3160/0xe7b2f62b49d87750 lrc: 3/0,0 mode: PR/PR res: [0x309f8:0x0:0x0].0x0 rrc: 109 type: EXT [0->18446744073709551615] (req 2072186880->2072219647) gid 0 flags: 0x60000400020020 nid: 192.168.68.132@tcp1 remote: 0x7090539ead802aa5 expref: 2484 pid: 31523 timeout: 344608 lvb_type: 0 00000020:00080000:7.0:1663357306.012934:0:25912:0:(genops.c:1602:class_fail_export()) disconnecting export 000000003df4eab0/af951890-1dbc-4df6-9745-dfe62f31e4e6 00000020:00000080:7.0:1663357306.012938:0:25912:0:(genops.c:1443:class_disconnect()) disconnect: cookie 0xe7b2f62b147efd01 00000100:00080000:7.0:1663357306.012941:0:25912:0:(import.c:86:import_set_state_nolock()) 00000000247857f4 <90><9F>O1S<95><FF><FF>: changing import state from FULL to CLOSED 00000100:00020000:8.0:1663357306.012981:0:10171:0:(client.c:1266:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@00000000fcbfa5de x1743802516165888/t0(0) o105->dfsone-OST0000@192.168.68.132@tcp1:15/16 lens 392/224 e 0 to 0 dl 0 ref 1 fl Rpc:QU/0/ffffffff rc 0/-1 job:'' 00010000:00020000:9.0:1663357306.012999:0:31532:0:(ldlm_lib.c:3506:target_bulk_io()) @@@ Eviction on bulk READ req@00000000fa939054 x1743840866079296/t0(0) o3->af951890-1dbc-4df6-9745-dfe62f31e4e6@192.168.68.132@tcp1:202/0 lens 488/440 e 0 to 0 dl 1663357312 ref 1 fl Interpret:/0/0 rc 0/0 job:'' 00000020:00080000:7.0:1663357306.013001:0:25912:0:(genops.c:1619:class_fail_export()) disconnected export 000000003df4eab0/af951890-1dbc-4df6-9745-dfe62f31e4e6 00010000:00020000:7.0:1663357306.013005:0:25912:0:(ldlm_lockd.c:260:expired_lock_main()) ### lock callback timer expired after 101s: evicting client at 192.168.68.132@tcp1 ns: filter-dfsone-OST0000_UUID lock: 00000000594515f1/0xe7b2f62b48a04d27 lrc: 3/0,0 mode: PR/PR res: [0x308e5:0x0:0x0].0x0 rrc: 2023 type: EXT [0->4106174463] (req 4105555968->4105621503) gid 0 flags: 0x60000400020020 nid: 192.168.68.132@tcp1 remote: 0x7090539ead103d4b expref: 2484 pid: 32689 timeout: 344608 lvb_type: 0 00000020:00080000:7.0:1663357306.013008:0:25912:0:(genops.c:1597:class_fail_export()) disconnecting dead export 000000003df4eab0/af951890-1dbc-4df6-9745-dfe62f31e4e6; skipping 00010000:00020000:7.0:1663357306.013010:0:25912:0:(ldlm_lockd.c:260:expired_lock_main()) ### lock callback timer expired after 101s: evicting client at 192.168.68.132@tcp1 ns: filter-dfsone-OST0000_UUID lock: 000000008ffe558b/0xe7b2f62b4888ebef lrc: 3/0,0 mode: PR/PR res: [0x308e5:0x0:0x0].0x0 rrc: 2023 type: EXT [0->2483986431] (req 2483920896->2483986431) gid 0 flags: 0x60000400020020 nid: 192.168.68.132@tcp1 remote: 0x7090539ead09565e expref: 2484 pid: 7009 timeout: 344608 lvb_type: 0 ``` All the while, SMB on the client is unhappy: ```[285788.856940] INFO: task smbd:11634 blocked for more than 120 seconds. [285788.857717] Tainted: P O 4.19.0-21-amd64 #1 Debian 4.19.208-3 [285788.858468] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [285788.859160] smbd D 0 11634 30610 0x00000000 [285788.859161] Call Trace: [285788.859168] __schedule+0x29f/0x840 [285788.859170] schedule+0x28/0x80 [285788.859171] io_schedule+0x12/0x40 [285788.859176] wait_on_page_bit+0x114/0x150 [285788.859178] ? page_cache_tree_insert+0xe0/0xe0 [285788.859230] cl_page_assume+0x68/0x230 [obdclass] [285788.859253] ll_write_begin+0x365/0xf30 [lustre] [285788.859256] generic_perform_write+0xf4/0x1b0 [285788.859258] ? file_update_time+0xed/0x130 [285788.859259] __generic_file_write_iter+0xfe/0x1c0 [285788.859269] vvp_io_write_start+0x502/0xb80 [lustre] [285788.859283] cl_io_start+0x59/0x110 [obdclass] [285788.859296] cl_io_loop+0x9a/0x1e0 [obdclass] [285788.859304] ll_file_io_generic+0x37f/0xb30 [lustre] [285788.859312] ll_file_write_iter+0x4fc/0x620 [lustre] [285788.859315] new_sync_write+0xfb/0x160 [285788.859316] vfs_write+0xa5/0x1a0 [285788.859318] ksys_pwrite64+0x61/0xa0 [285788.859322] do_syscall_64+0x53/0x110 [285788.859324] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [285788.859326] RIP: 0033:0x7f926ec7eedf [285788.859328] Code: Bad RIP value. [285788.859329] RSP: 002b:00007f9261b75b50 EFLAGS: 00000293 ORIG_RAX: 0000000000000012 [285788.859331] RAX: ffffffffffffffda RBX: 000000000000003b RCX: 00007f926ec7eedf [285788.859331] RDX: 0000000000001000 RSI: 000055b55876f6f0 RDI: 000000000000003b [285788.859332] RBP: 000055b55876f6f0 R08: 0000000000000000 R09: 00000000ffffffff [285788.859332] R10: 000000000cf51ab0 R11: 0000000000000293 R12: 0000000000001000 [285788.859333] R13: 000000000cf51ab0 R14: 00007f926dde4c00 R15: 000055b558734430``` Perhaps unrelated but possibly related, there are a number of reconnects in the logs that get labeled as network errors, seemingly triggered by IPs that are resident to the nodes, but not a part of the cluster configuration: ``` 00000800:02000400:10.0:1663252872.643150:0:5192:0:(socklnd_cb.c:614:ksocknal_process_transmit()) Host 192.168.98.137 reset our connection while we were sending data; it may have rebooted. 00000100:00000400:8.0:1663252872.643164:0:5321:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1663252872/real 1663252872] req@0000000051b5e1ac x1743829801974592/t0(0) o2->dfsone-OST0001-osc-ffff9f85fcbe7800@192.168.68.131@tcp1:28/4 lens 440/432 e 0 to 1 dl 1663252879 ref 1 fl Rpc:eXQr/0/ffffffff rc 0/-1 job:'' 00000100:00080000:8.0:1663252872.643174:0:5321:0:(import.c:86:import_set_state_nolock()) 00000000f6d77c17 dfsone-OST0001_UUID: changing import state from FULL to DISCONN 00000100:02000400:8.0:1663252872.643175:0:5321:0:(import.c:200:ptlrpc_set_import_discon()) dfsone-OST0001-osc-ffff9f85fcbe7800: Connection to dfsone-OST0001 (at 192.168.68.131@tcp1) was lost; in progress operations using this service will wait for recovery to complete 00000100:00080000:8.0:1663252872.643178:0:5321:0:(import.c:444:ptlrpc_pinger_force()) dfsone-OST0001_UUID: waking up pinger s:DISCONN 00000100:00020000:6.0:1663252872.643180:0:5174:0:(events.c:208:client_bulk_callback()) event type 1, status -5, desc 00000000943f7085 00000100:00080000:9.0:1663252872.643186:0:26650:0:(pinger.c:249:ptlrpc_pinger_process_import()) af951890-1dbc-4df6-9745-dfe62f31e4e6->dfsone-OST0001_UUID: level DISCONN/3 force 1 force_next 0 deactive 0 pingable 1 suppress 0 00000100:00080000:9.0:1663252872.643188:0:26650:0:(recover.c:58:ptlrpc_initiate_recovery()) dfsone-OST0001_UUID: starting recovery 00000100:00080000:9.0:1663252872.643189:0:26650:0:(import.c:86:import_set_state_nolock()) 00000000f6d77c17 dfsone-OST0001_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:9.0:1663252872.643191:0:26650:0:(import.c:544:import_select_connection()) dfsone-OST0001-osc-ffff9f85fcbe7800: connect to NID 192.168.68.131@tcp1 last attempt 9075 00000100:00080000:9.0:1663252872.643193:0:26650:0:(import.c:625:import_select_connection()) dfsone-OST0001-osc-ffff9f85fcbe7800: import 00000000f6d77c17 using connection 192.168.68.131@tcp1/192.168.68.131@tcp1 00000100:00000400:7.0:1663252872.643216:0:5309:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1663252871/real 1663252871] req@000000001865eefa x1743829801842240/t0(0) o4->dfsone-OST0001-osc-ffff9f85fcbe7800@192.168.68.131@tcp1:6/4 lens 488/448 e 0 to 1 dl 1663252878 ref 2 fl Rpc:eXQr/0/ffffffff rc 0/-1 job:'' 00000100:00080000:7.0:1663252872.643222:0:5309:0:(import.c:234:ptlrpc_set_import_discon()) osc: import 00000000f6d77c17 already not connected (conn 12, was 11): CONNECTING 00000800:02000400:12.0:1663252872.643226:0:5193:0:(socklnd_cb.c:614:ksocknal_process_transmit()) Host 192.168.68.137 reset our connection while we were sending data; it may have rebooted. ``` Potentially of note is the lnet statistics from 132: ``` statistics: msgs_alloc: 8 msgs_max: 1657 rst_alloc: 0 errors: 0 send_count: 485076548 resend_count: 7 response_timeout_count: 0 local_interrupt_count: 0 local_dropped_count: 0 local_aborted_count: 0 local_no_route_count: 0 local_timeout_count: 3 local_error_count: 25 remote_dropped_count: 27 remote_error_count: 850 remote_timeout_count: 0 network_timeout_count: 0 recv_count: 484607083 route_count: 0 drop_count: 52 send_length: 135618914369236 recv_length: 640430677792 route_length: 0 drop_length: 25056 ``` I'm wondering if someone with fresh[er] eyes and a better knowledge of Lustre's internals might be able to tell me where things are breaking. Is this is a known issue that may be resolved in a future release, a problem with my cluster's network config, or something I can resolve through using some tuning parameters to facilitate the larger than usual transfer? Modifying client behavior is a possibility, but would require engaging with the vendor. Cheers, and thanks for your time! Christian
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org