[Lustre-discuss] Stalled autofs + lustre summary

Heiko Schröter Fri, 20 Nov 2009 00:32:16 -0800

Hello,

FYI, we had stalling lustre mounts in conjunction with automount over the last 
weeks.
This is a short summary in case you are using automunt + lustre.


When lustre gets automounted ok you will see the messages as in 1).

A user can stall the lustre mount by not using a FQN Filename.
Example file: /lustre_automount/myfile.dat

When lustre is *NOT* mounted a user can stall the client mount with 'ls 
/lustre_automount/myfile' (no asterik after myfile !) for at minimum 100s.
Error messages as in 2) will popup with the 'lnet_try_match_md()' sequence.
After that you will see messages of type 3) which may indicate a network 
problem (hm, well, ok to us ...)
After 100s the user gets back 'ls: cannot access /lustre_automount/myfile.dat: 
No such file or directory'
After that it looks that lustre is mounted. But a simple 'ls 
/lustre_automount/' in a second shell will not return anything and produce the 
same message sequence as above.

Attention:
When several 'illformed' ls commands are send at once the lustre mount freezes 
completely and forever on that client.
This happened in our case because this command sequence has been driven by 
scripts running in parallel.
You have to 'umount -f /lustre_automount/' or even 'lustre_rmmod' to recover.

If umount works correct it looks like 3).

Due to the fact that a lot of messages are between 1)2) and 3) we were mislead 
and searched the error in wrong places.
Especially the MDS/MGS hardware and additionally due to 2) we have replaced 
nearly all network components we could get our hands on.

Unfortunatly doing the same illformed ls command over an NFS automount will not 
result in a stalled system but will return the 'cannot access' message back at 
once.

Examples of what does work correctly when lustre is not mounted:
a) ls /lustre_automount/myfi*
b) find /lustre_automount -iname 'myfi*' (eventually: -maxdepth 1)
c) lfs find /lustre_automount --name 'myfile*' --maxdepth 1 (returns the file)
d) lfs find /lustre_automount --name 'myfile' --maxdepth 1 (does not return 
anything, but will not freeze the system)
.....
Another 'illformed' command is 'gunzip -c /lustre_automount/myfile > /tmp/test' 
instead of 'gunzip -c /lustre_automount/myfile.gz > /tmp/test'.

The solution seems to be to not using autofs + lustre if the above cannot be 
avoided for sure including mistyping.
Or to tar and feather the user .... that's what we did .... ;-)

Hairless by now
Heiko

################################################################
Gentoo x86_64 GNU/Linux
lustre: 1.6.6
vanilla-kernel 2.6.22.19
autofs 5.0.3-r6
mount 2.14.2
################################################################
Client Syslog. Automount timing 60s + 120s WAIT, just for testing. The same 
holds true for timouts of 600s.
1) Mounting OK:
Nov 19 17:29:58 quadcore2 automount[21803]: attempting to mount entry 
/lustre_automount
Nov 19 17:29:58 quadcore2 Lustre: fs_lustre-OST0006-osc-ffff8101c918b800.osc: 
set parameter active=0
Nov 19 17:29:58 quadcore2 Lustre: Skipped 16 previous similar messages
Nov 19 17:29:58 quadcore2 LustreError: 
24764:0:(lov_obd.c:316:lov_connect_obd()) not connecting OSC 
fs_lustre-OST0006_UUID; administratively disabled
Nov 19 17:29:58 quadcore2 LustreError: 
24764:0:(lov_obd.c:316:lov_connect_obd()) Skipped 13 previous similar messages
Nov 19 17:29:58 quadcore2 Lustre: Client fs_lustre-client has started
Nov 19 17:29:58 quadcore2 automount[21803]: mount(generic): mounted 
m...@tcp0:m...@tcp0:/fs_lustre type lustre on /lustre_automount
Nov 19 17:29:58 quadcore2 automount[21803]: mounted /lustre_automount

2) Mounting failed:
Nov 19 17:43:09 quadcore2 automount[21803]: attempting to mount entry 
/lustre_automount
Nov 19 17:43:09 quadcore2 Lustre: Client fs_lustre-client has started
Nov 19 17:43:09 quadcore2 automount[21803]: mount(generic): mounted 
m...@tcp0:m...@tcp0:/fs_lustre type lustre on /lustre_automount
Nov 19 17:43:09 quadcore2 automount[21803]: mounted /lustre_automount
Nov 19 17:43:10 quadcore2 LustreError: 
25321:0:(lib-move.c:111:lnet_try_match_md()) Matching packet from 
12345-192.168.16....@tcp, match 776 length 1336 too big: 1272 left, 1272 allowed
Nov 19 17:43:16 quadcore2 automount[21803]: 1 remaining in /home

3) The possible network problem message:
Nov 19 17:44:50 quadcore2 Lustre: Request x776 sent from 
fs_lustre-MDT0000-mdc-ffff8101aac5f400 to NID 192.168.16....@tcp 100s ago has 
timed out (limit 100s).
Nov 19 17:44:50 quadcore2 Lustre: fs_lustre-MDT0000-mdc-ffff8101aac5f400: 
Connection to service fs_lustre-MDT0000 via nid 192.168.16....@tcp was lost; in 
progress operations using this service will wait for recovery to complete.
Nov 19 17:44:50 quadcore2 LustreError: 25692:0:(mdc_locks.c:598:mdc_enqueue()) 
ldlm_cli_enqueue: -4
Nov 19 17:44:50 quadcore2 Lustre: fs_lustre-MDT0000-mdc-ffff8101aac5f400: 
Connection restored to service fs_lustre-MDT0000 using nid 192.168.16....@tcp.

4) Umount OK:
Nov 19 17:45:37 quadcore2 automount[21803]: expiring path /lustre_automount
Nov 19 17:45:37 quadcore2 automount[21803]: unmounting dir = /lustre_automount
Nov 19 17:45:37 quadcore2 LustreError: 
25717:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: 
canceling anyway
Nov 19 17:45:37 quadcore2 LustreError: 
25717:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Skipped 2 previous similar 
messages
Nov 19 17:45:37 quadcore2 LustreError: 
25717:0:(ldlm_request.c:1605:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Nov 19 17:45:37 quadcore2 LustreError: 
25717:0:(ldlm_request.c:1605:ldlm_cli_cancel_list()) Skipped 2 previous similar 
messages
Nov 19 17:45:37 quadcore2 LustreError: 
25298:0:(connection.c:155:ptlrpc_put_connection()) NULL connection
Nov 19 17:45:37 quadcore2 LustreError: 
25298:0:(connection.c:155:ptlrpc_put_connection()) Skipped 13 previous similar 
messages
Nov 19 17:45:37 quadcore2 Lustre: client ffff8101aac5f400 umount complete
Nov 19 17:45:37 quadcore2 automount[21803]: expired /lustre_automount
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

[Lustre-discuss] Stalled autofs + lustre summary

Reply via email to