This patch tries to fix two problems: problem 1): It's a case of recovery + migration. That is a recovery is happening when node I is in progress of umount. Node I is the recovery master. Say lockres A was mastered by the dead node and need to be recovered. Node I(the reco master) and node II both have reference on lockres A. So lockres A is being recovered from node II to node I, with RECOVERING flag set. The umounting process is going on, it happened to be migrating lockres A to node II. Since recovery not finished yet(RECOVERING still set), node II reponds with -EFAULT to kill node I. Then node I killed its self(BUGON).
There is a checking for recovery(on RECOVERING), but it droped res->spinlock and dlm->spinlock. So the checking does not help much enough. Since we have to drop any spinlock when we are sending migrate lockres( DLM_MIG_LOCKRES_MSG) message, we have to deal with above case. problem 2): In the same context of problem 1), -ENOMEM from target node can trigger an incorrect BUG() on the requester of "migrate lockres". The fix is when target node returns -EFAULT or -ENOMEM, we retry the migration( for umount). Though they are two separated problems, the fixes are in the same way. So I combined them together. Signed-off-by: Wengang Wang <[email protected]> --- fs/ocfs2/dlm/dlmrecovery.c | 28 ++++++++++++++++++++++------ 1 files changed, 22 insertions(+), 6 deletions(-) diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c index aaaffbc..b7dd03f 100644 --- a/fs/ocfs2/dlm/dlmrecovery.c +++ b/fs/ocfs2/dlm/dlmrecovery.c @@ -1122,6 +1122,8 @@ static int dlm_send_mig_lockres_msg(struct dlm_ctxt *dlm, orig_flags & DLM_MRES_MIGRATION ? "migration" : "recovery", send_to); +#define WAIT_FOR_NOMEM_MS 30 +resend: /* send it */ ret = o2net_send_message(DLM_MIG_LOCKRES_MSG, dlm->key, mres, sz, send_to, &status); @@ -1132,16 +1134,30 @@ static int dlm_send_mig_lockres_msg(struct dlm_ctxt *dlm, "0x%x) to node %u\n", ret, DLM_MIG_LOCKRES_MSG, dlm->key, send_to); } else { - /* might get an -ENOMEM back here */ - ret = status; if (ret < 0) { mlog_errno(ret); - if (ret == -EFAULT) { - mlog(ML_ERROR, "node %u told me to kill " - "myself!\n", send_to); - BUG(); + /* + * -ENOMEM or -EFAULT here. + * -EFAULT means lockres is in recovery. + * we should retry in both the two cases. + */ + ret = status; + if (ret == -ENOMEM) { + mlog(ML_NOTICE, "node %u no memory\n", + send_to); + if (dlm_in_recovery(dlm)) { + dlm_wait_for_recovery(dlm); + } else { + msleep(WAIT_FOR_NOMEM_MS); + } + } else { + BUG_ON(ret != -EFAULT); + mlog(ML_NOTICE, "node %u in recovery\n", + send_to); + dlm_wait_for_recovery(dlm); } + goto resend; } } -- 1.7.2.2 _______________________________________________ Ocfs2-devel mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-devel
