Guys we are seeing a problem, on one of our larger clusters that is
still running 1.4.2 from HP, We have reliably caused a EIO problem with
32 clients... we attempted at 8, but it did not happen.

Our config:
1k-IA64 nodes, quadrics4 interconnect. 
1 MDS, 32 OSS, 64 OSTs in an active-active mode.

We see this on one or more clients(dmesg):

LustreError: Connection to service pair12_Xdg1 via nid 0:953 was lost;
in progress operations using this service will wait for recovery to
complete.
Lustre: 11646:0:(import.c:136:ptlrpc_set_import_discon())
OSC_m568_pair12_Xdg1_MNT_client: connection lost to
[EMAIL PROTECTED]
LustreError: 11646:0:(ldlm_request.c:69:ldlm_expired_completion_wait())
### lock timed out, entering recovery for [EMAIL PROTECTED]
ns: OSC_m568_pair12
_Xdg1_MNT_client lock: e000000144d2fe80/0xbf806ddd959ef0c3 lrc: 4/1,0
mode: --/PR res: 16436509/0 rrc: 3 type: EXT [45088768->47185919] (req
45088768->47185919) fl
ags: 0 remote: 0xbb88faef1d525d0e expref: -99 pid: 11646
Lustre: 1218:0:(import.c:308:import_select_connection())
OSC_m568_pair12_Xdg1_MNT_client: Using connection NID_953_UUID
LustreError: This client was evicted by pair12_Xdg1; in progress
operations using this service will be reattempted.
LustreError: 11662:0:(ldlm_resource.c:358:ldlm_namespace_cleanup())
Namespace OSC_m568_pair12_Xdg1_MNT_client resource refcount 2 after lock
cleanup
Lustre: Connection restored to service pair12_Xdg1 using nid 0:953.
LustreError: 11646:0:(lov_request.c:166:lov_update_enqueue_set()) error:
enqueue objid 0x5cb456c subobj 0xfacd1d on OST idx 12: rc = -5
Lustre: 11662:0:(import.c:687:ptlrpc_import_recovery_state_machine())
OSC_m568_pair12_Xdg1_MNT_client: connection restored to
[EMAIL PROTECTED]


And one message on the OST:

LustreError: 1557:0:(ldlm_lockd.c:198:waiting_locks_callback()) ### lock
callback timer expired: evicting client
[EMAIL PROTECTED] nid 0:568  ns:
filter-pair12_Xdg1_UUID lock: e000000122b96380/0xbb88faef1d4ed1d1 lrc:
1/0,0 mode: PW/PW res: 16436509/0 rrc: 60 type: EXT [44040192->45391871]
(req 44040192->44048383) flags: 20 remote: 0xbf806ddd9542580e expref: 13
pid: 1550


The user application as I understand it:

It has two phases, one where it generates a large sparse file, then
after all the nodes have filled in the portion of the file they are
responsible for, then then 'compress' the file by moving chunks that are
spread out everywhere into new non-sparse file(in parallel).  It is
during this second phase that we are seeing the EIO errors.  The
application has been modified to retry the write IO when the buffer
write fails, and normally it succeeds the second time.

Any thoughts on why this would happen?

It seems that because the client gets evicted its lock it invalid, and
then returns a EIO to the user, but then just re-acquires a lock for the
suceeding IO.  Also I don't have an explanation to why the connection
gets evicted.  There is a high IO load during this time, but it only
seems to be using about 2/3 of the available peak IO.

Evan
_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to