Hello, FYI, we had stalling lustre mounts in conjunction with automount over the last weeks. This is a short summary in case you are using automunt + lustre.
When lustre gets automounted ok you will see the messages as in 1). A user can stall the lustre mount by not using a FQN Filename. Example file: /lustre_automount/myfile.dat When lustre is *NOT* mounted a user can stall the client mount with 'ls /lustre_automount/myfile' (no asterik after myfile !) for at minimum 100s. Error messages as in 2) will popup with the 'lnet_try_match_md()' sequence. After that you will see messages of type 3) which may indicate a network problem (hm, well, ok to us ...) After 100s the user gets back 'ls: cannot access /lustre_automount/myfile.dat: No such file or directory' After that it looks that lustre is mounted. But a simple 'ls /lustre_automount/' in a second shell will not return anything and produce the same message sequence as above. Attention: When several 'illformed' ls commands are send at once the lustre mount freezes completely and forever on that client. This happened in our case because this command sequence has been driven by scripts running in parallel. You have to 'umount -f /lustre_automount/' or even 'lustre_rmmod' to recover. If umount works correct it looks like 3). Due to the fact that a lot of messages are between 1)2) and 3) we were mislead and searched the error in wrong places. Especially the MDS/MGS hardware and additionally due to 2) we have replaced nearly all network components we could get our hands on. Unfortunatly doing the same illformed ls command over an NFS automount will not result in a stalled system but will return the 'cannot access' message back at once. Examples of what does work correctly when lustre is not mounted: a) ls /lustre_automount/myfi* b) find /lustre_automount -iname 'myfi*' (eventually: -maxdepth 1) c) lfs find /lustre_automount --name 'myfile*' --maxdepth 1 (returns the file) d) lfs find /lustre_automount --name 'myfile' --maxdepth 1 (does not return anything, but will not freeze the system) ..... Another 'illformed' command is 'gunzip -c /lustre_automount/myfile > /tmp/test' instead of 'gunzip -c /lustre_automount/myfile.gz > /tmp/test'. The solution seems to be to not using autofs + lustre if the above cannot be avoided for sure including mistyping. Or to tar and feather the user .... that's what we did .... ;-) Hairless by now Heiko ################################################################ Gentoo x86_64 GNU/Linux lustre: 1.6.6 vanilla-kernel 2.6.22.19 autofs 5.0.3-r6 mount 2.14.2 ################################################################ Client Syslog. Automount timing 60s + 120s WAIT, just for testing. The same holds true for timouts of 600s. 1) Mounting OK: Nov 19 17:29:58 quadcore2 automount[21803]: attempting to mount entry /lustre_automount Nov 19 17:29:58 quadcore2 Lustre: fs_lustre-OST0006-osc-ffff8101c918b800.osc: set parameter active=0 Nov 19 17:29:58 quadcore2 Lustre: Skipped 16 previous similar messages Nov 19 17:29:58 quadcore2 LustreError: 24764:0:(lov_obd.c:316:lov_connect_obd()) not connecting OSC fs_lustre-OST0006_UUID; administratively disabled Nov 19 17:29:58 quadcore2 LustreError: 24764:0:(lov_obd.c:316:lov_connect_obd()) Skipped 13 previous similar messages Nov 19 17:29:58 quadcore2 Lustre: Client fs_lustre-client has started Nov 19 17:29:58 quadcore2 automount[21803]: mount(generic): mounted m...@tcp0:m...@tcp0:/fs_lustre type lustre on /lustre_automount Nov 19 17:29:58 quadcore2 automount[21803]: mounted /lustre_automount 2) Mounting failed: Nov 19 17:43:09 quadcore2 automount[21803]: attempting to mount entry /lustre_automount Nov 19 17:43:09 quadcore2 Lustre: Client fs_lustre-client has started Nov 19 17:43:09 quadcore2 automount[21803]: mount(generic): mounted m...@tcp0:m...@tcp0:/fs_lustre type lustre on /lustre_automount Nov 19 17:43:09 quadcore2 automount[21803]: mounted /lustre_automount Nov 19 17:43:10 quadcore2 LustreError: 25321:0:(lib-move.c:111:lnet_try_match_md()) Matching packet from 12345-192.168.16....@tcp, match 776 length 1336 too big: 1272 left, 1272 allowed Nov 19 17:43:16 quadcore2 automount[21803]: 1 remaining in /home 3) The possible network problem message: Nov 19 17:44:50 quadcore2 Lustre: Request x776 sent from fs_lustre-MDT0000-mdc-ffff8101aac5f400 to NID 192.168.16....@tcp 100s ago has timed out (limit 100s). Nov 19 17:44:50 quadcore2 Lustre: fs_lustre-MDT0000-mdc-ffff8101aac5f400: Connection to service fs_lustre-MDT0000 via nid 192.168.16....@tcp was lost; in progress operations using this service will wait for recovery to complete. Nov 19 17:44:50 quadcore2 LustreError: 25692:0:(mdc_locks.c:598:mdc_enqueue()) ldlm_cli_enqueue: -4 Nov 19 17:44:50 quadcore2 Lustre: fs_lustre-MDT0000-mdc-ffff8101aac5f400: Connection restored to service fs_lustre-MDT0000 using nid 192.168.16....@tcp. 4) Umount OK: Nov 19 17:45:37 quadcore2 automount[21803]: expiring path /lustre_automount Nov 19 17:45:37 quadcore2 automount[21803]: unmounting dir = /lustre_automount Nov 19 17:45:37 quadcore2 LustreError: 25717:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Nov 19 17:45:37 quadcore2 LustreError: 25717:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Skipped 2 previous similar messages Nov 19 17:45:37 quadcore2 LustreError: 25717:0:(ldlm_request.c:1605:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Nov 19 17:45:37 quadcore2 LustreError: 25717:0:(ldlm_request.c:1605:ldlm_cli_cancel_list()) Skipped 2 previous similar messages Nov 19 17:45:37 quadcore2 LustreError: 25298:0:(connection.c:155:ptlrpc_put_connection()) NULL connection Nov 19 17:45:37 quadcore2 LustreError: 25298:0:(connection.c:155:ptlrpc_put_connection()) Skipped 13 previous similar messages Nov 19 17:45:37 quadcore2 Lustre: client ffff8101aac5f400 umount complete Nov 19 17:45:37 quadcore2 automount[21803]: expired /lustre_automount _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
