Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the
following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11277
What |Removed |Added
----------------------------------------------------------------------------
CC| |[EMAIL PROTECTED]
That's what happens here:
1. client does some operation, which take long time on server (trying to take a
lock)
2. client times out and resends request.
3. new request arrives at server, we look in the list of locks and find original
requested lock from (1), we get this lock and its cookie, decide that this is
the lock we sent to client and start to do resending preparations (passes the
lustre_handle_is_used() check in mds_getattr_lock).
4. at this time thread serving request from (1) finally gets the lock it was
waiting for and decides to return it to client, it rewrites remote handle and
returns with ELDLM_LOCK_REPLACED from mds_intent_policy(). This makes
ldlm_lock_enqueue() to drop existing lock from (1) (and it's cookie).
5. thread serving resent request is trying to get a lock from cookie(now
invalid), fails and trips on the assertion.
Easy fix I think about is to add an extra check fixup_handle_for_resent_req()
that the lock we are about to choose is granted (req_mode == granted_mode).
The drawback is we will obviously redo entire lock granting for second request.
But should be enough for a quick fix.
Shadow: please start with a recovery-small.sh test to replicate the issue (you
will need to strategically place OBD_FAIL/OBD_RACE in mds/handler.c), then try
the solution I described above.
_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel