Guys we are seeing a problem, on one of our larger clusters that is still running 1.4.2 from HP, We have reliably caused a EIO problem with 32 clients... we attempted at 8, but it did not happen.
Our config: 1k-IA64 nodes, quadrics4 interconnect. 1 MDS, 32 OSS, 64 OSTs in an active-active mode. We see this on one or more clients(dmesg): LustreError: Connection to service pair12_Xdg1 via nid 0:953 was lost; in progress operations using this service will wait for recovery to complete. Lustre: 11646:0:(import.c:136:ptlrpc_set_import_discon()) OSC_m568_pair12_Xdg1_MNT_client: connection lost to [EMAIL PROTECTED] LustreError: 11646:0:(ldlm_request.c:69:ldlm_expired_completion_wait()) ### lock timed out, entering recovery for [EMAIL PROTECTED] ns: OSC_m568_pair12 _Xdg1_MNT_client lock: e000000144d2fe80/0xbf806ddd959ef0c3 lrc: 4/1,0 mode: --/PR res: 16436509/0 rrc: 3 type: EXT [45088768->47185919] (req 45088768->47185919) fl ags: 0 remote: 0xbb88faef1d525d0e expref: -99 pid: 11646 Lustre: 1218:0:(import.c:308:import_select_connection()) OSC_m568_pair12_Xdg1_MNT_client: Using connection NID_953_UUID LustreError: This client was evicted by pair12_Xdg1; in progress operations using this service will be reattempted. LustreError: 11662:0:(ldlm_resource.c:358:ldlm_namespace_cleanup()) Namespace OSC_m568_pair12_Xdg1_MNT_client resource refcount 2 after lock cleanup Lustre: Connection restored to service pair12_Xdg1 using nid 0:953. LustreError: 11646:0:(lov_request.c:166:lov_update_enqueue_set()) error: enqueue objid 0x5cb456c subobj 0xfacd1d on OST idx 12: rc = -5 Lustre: 11662:0:(import.c:687:ptlrpc_import_recovery_state_machine()) OSC_m568_pair12_Xdg1_MNT_client: connection restored to [EMAIL PROTECTED] And one message on the OST: LustreError: 1557:0:(ldlm_lockd.c:198:waiting_locks_callback()) ### lock callback timer expired: evicting client [EMAIL PROTECTED] nid 0:568 ns: filter-pair12_Xdg1_UUID lock: e000000122b96380/0xbb88faef1d4ed1d1 lrc: 1/0,0 mode: PW/PW res: 16436509/0 rrc: 60 type: EXT [44040192->45391871] (req 44040192->44048383) flags: 20 remote: 0xbf806ddd9542580e expref: 13 pid: 1550 The user application as I understand it: It has two phases, one where it generates a large sparse file, then after all the nodes have filled in the portion of the file they are responsible for, then then 'compress' the file by moving chunks that are spread out everywhere into new non-sparse file(in parallel). It is during this second phase that we are seeing the EIO errors. The application has been modified to retry the write IO when the buffer write fails, and normally it succeeds the second time. Any thoughts on why this would happen? It seems that because the client gets evicted its lock it invalid, and then returns a EIO to the user, but then just re-acquires a lock for the suceeding IO. Also I don't have an explanation to why the connection gets evicted. There is a high IO load during this time, but it only seems to be using about 2/3 of the available peak IO. Evan
_______________________________________________ Lustre-devel mailing list [email protected] https://mail.clusterfs.com/mailman/listinfo/lustre-devel
