On Mon, Feb 12 2018, Patrick Farrell wrote: > Neil, > > I didn't get anything after 8/19 in this series. Is this just me? (I'd keep > waiting, except I also found a few things in this patch.)
Not just you. My fault. They are appearing now.
>
> Minor:
> The line XXX ALLOCATE is out of date and could go. (It refers to a
> mix of things you eliminated and things that were already gone.)
What does the line even mean? Some comment about stack usage?
I think we have a look that looks for large stack frames. I wonder how
to run it...
>
> Less minor:
> You remove use of the imp_lock when reading the connection count. While
> that'll work on x86, it's probably wrong on some architecture to read that
> without taking the lock...?
It was my understanding that on all architectures which Linux support, a
32bit aligned read is atomic wrt any 32bit write. I have trouble imagining how
it could be otherwise.
I probably should have highlighted the removal of the spinlock in the
patch description though - it was intentional.
>
> Bug:
> The existing code uses the imp_conn_cnt from *before* the wait, rather
> than after. I think that's quite important. So you'll want to read
> it out before the wait. I think the main reason we'd hit the timeout
> is a disconnect, which should cause a reconnect, so it's very
> important to use the value from *before* the wait. (See comment on
> ptlrpc_set_import_discon for more of an explanation. Basically it's
> tracking a connection 'epoch', if it's changed, someone else already
> went through the reconnect code for this 'connection epoch' and we
> shouldn't start that process.)
>
That wasn't intentional though - thanks for catching!
Looking at ptlrpc_set_import_discon(), which is where the number
eventually gets used, it is only used to compare with the new value of
imp->imp_conn_cnt.
This would fix both (assuming the locking issue needs fixing).
Thanks,
NeilBrown
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
index f1233d844bbd..c3c9186b74ce 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
@@ -103,7 +103,7 @@ static int ldlm_request_bufsize(int count, int type)
return sizeof(struct ldlm_request) + avail;
}
-static void ldlm_expired_completion_wait(struct ldlm_lock *lock, struct
obd_import *imp2)
+static void ldlm_expired_completion_wait(struct ldlm_lock *lock, __u32
conn_cnt)
{
struct obd_import *imp;
struct obd_device *obd;
@@ -129,7 +129,7 @@ static void ldlm_expired_completion_wait(struct ldlm_lock
*lock, struct obd_impo
obd = lock->l_conn_export->exp_obd;
imp = obd->u.cli.cl_import;
- ptlrpc_fail_import(imp, imp2 ? imp2->imp_conn_cnt : 0);
+ ptlrpc_fail_import(imp, conn_cnt);
LDLM_ERROR(lock,
"lock timed out (enqueued at %lld, %llds ago), entering
recovery for %s@%s",
(s64)lock->l_last_activity,
@@ -241,6 +241,7 @@ int ldlm_completion_ast(struct ldlm_lock *lock, __u64
flags, void *data)
struct obd_device *obd;
struct obd_import *imp = NULL;
__u32 timeout;
+ __u32 conn_cnt = 0;
int rc = 0;
if (flags == LDLM_FL_WAIT_NOREPROC) {
@@ -268,6 +269,11 @@ int ldlm_completion_ast(struct ldlm_lock *lock, __u64
flags, void *data)
lock->l_last_activity = ktime_get_real_seconds();
+ if (imp) {
+ spin_lock(&imp->imp_lock);
+ conn_cnt = imp->imp_conn_cnt;
+ spin_unlock(&imp->imp_lock);
+ }
if (OBD_FAIL_CHECK_RESET(OBD_FAIL_LDLM_INTR_CP_AST,
OBD_FAIL_LDLM_CP_BL_RACE | OBD_FAIL_ONCE)) {
ldlm_set_fail_loc(lock);
@@ -280,7 +286,7 @@ int ldlm_completion_ast(struct ldlm_lock *lock, __u64
flags, void *data)
is_granted_or_cancelled(lock),
timeout * HZ);
if (rc == 0)
- ldlm_expired_completion_wait(lock, imp);
+ ldlm_expired_completion_wait(lock, conn_cnt);
}
/* Now wait abortable */
if (rc == 0)
signature.asc
Description: PGP signature

