Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11308
The patch tested out just fine. Once question I have: when a client gets a NAK back from the server, it doesn't seem that it returns EIO up the stack. Instead, the reply seems to get dropped on the floor and we let the ptlrpc level timeouts hit before it is resent. With longer timeouts (300s), this could make for a long time between us seeing the NAK & actually resending the RPC> What is the expected behavior here? Example syslog traffic when "lctl --net ptl del_peer" run on an OST (nid00028) while a dd was running on nid00007: Dec 19 17:14:32 nid00028 kernel: Lustre: 3737:0:(ptllnd_rx_buf.c:569:kptllnd_rx_parse()) NAK [EMAIL PROTECTED]: no connection; peer must reconnect Dec 19 17:14:32 nid00007 kernel: Lustre: 4015:0:(ptllnd_rx_buf.c:539:kptllnd_rx_parse()) NAK from [EMAIL PROTECTED] (ptlid:9-28) Dec 19 17:14:32 nid00007 kernel: Lustre: 4016:0:(router.c:184:lnet_notify()) Upcall: NID [EMAIL PROTECTED] is dead Dec 19 17:14:32 nid00007 kernel: Lustre: 4:0:(linux-debug.c:96:libcfs_run_upcall()) Invoked portals upcall /usr/lib/lustre/lnet_upcall ROUTER_NOTIFY,[EMAIL PROTECTED],down,1166570069 Dec 19 17:14:43 nid00028 kernel: Lustre: 4890:0:(ldlm_lib.c:489:target_handle_reconnect()) ost_svc: 93f03b41-ebd5-4daa-8f75-eb981390f46e reconnecting Dec 19 17:14:43 nid00003 kernel: LustreError: 15069:0:(client.c:955:ptlrpc_expire_one_request()) @@@ timeout (sent at 1166570064, 15s ago) [out 1166570064.782602, in 0.000000] [EMAIL PROTECTED] x2292/t0 o400->[EMAIL PROTECTED]:28 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0 Dec 19 17:14:43 nid00028 kernel: Lustre: 4891:0:(filter.c:2985:filter_set_info_async()) ost_svc: received MDS connection from [EMAIL PROTECTED] Dec 19 17:14:43 nid00003 kernel: LustreError: Connection to service ost_svc via nid [EMAIL PROTECTED] was lost; in progress operations using this service will wait for recovery to complete. Dec 19 17:14:43 nid00028 kernel: Lustre: 4891:0:(filter.c:2985:filter_set_info_async()) previously skipped 3 similar messages Dec 19 17:14:43 nid00003 kernel: Lustre: OSC_eelc0-0c0s0n3_ost_svc_mds_svc: Connection restored to service ost_svc using nid [EMAIL PROTECTED] Dec 19 17:14:43 nid00003 kernel: Lustre: 15101:0:(mds_lov.c:530:__mds_lov_syncronize()) MDS mds_svc: ost_svc_UUID now active, resetting orphans Dec 19 17:14:43 nid00003 kernel: Lustre: 15101:0:(mds_lov.c:530:__mds_lov_syncronize()) previously skipped 2 similar messages Dec 19 17:14:43 nid00028 kernel: Lustre: 4892:0:(recov_thread.c:580:llog_repl_connect()) llcd 0000010072c03000:00000100711da9c0 not empty Dec 19 17:14:43 nid00028 kernel: Lustre: 4893:0:(filter.c:2364:filter_destroy_precreated()) ost_svc: deleting orphan objects from 886857 to 886981 Dec 19 17:14:43 nid00028 kernel: Lustre: 4893:0:(filter.c:2364:filter_destroy_precreated()) previously skipped 3 similar messages Dec 19 17:14:43 nid00028 kernel: Lustre: 4989:0:(llog_cat.c:352:llog_cat_process_cb()) processing log 0x11050006:37fdedf6 at index 57 of catalog 0x11050002 Dec 19 17:14:43 nid00028 kernel: Lustre: 4989:0:(llog_cat.c:352:llog_cat_process_cb()) previously skipped 1 similar messages Dec 19 17:14:43 nid00028 kernel: Lustre: 4989:0:(filter_log.c:227:filter_recov_log_mds_ost_cb()) fetch generation log, send cookie Dec 19 17:14:43 nid00028 kernel: Lustre: 4989:0:(llog.c:294:llog_process()) recovery from log: 0x11050004:c5056065 stopped _______________________________________________ Lustre-devel mailing list [email protected] https://mail.clusterfs.com/mailman/listinfo/lustre-devel
