Hi,

>> After looking more closely, this is a subtle form of conversion deadlock,
>> and this exact case is described in the comment here:

thanks, we withdraw this patch for now since we need to look into more.

-- owa

-----Original Message-----
From: David Teigland [mailto:teigl...@redhat.com] 
Sent: Thursday, August 10, 2017 3:48 AM
To: owa tsutomu(大輪 勤 TMC ○SSDジ□ES技○ES五)
Cc: cluster-devel@redhat.com; miyauchi tadashi(宮内 忠志 TOPS (SW開)[基本])
Subject: Re: [Cluster-devel] [PATCH 13/17] dlm: fix _can_be_granted() for lock 
at the head of covert queue.

On Wed, Aug 09, 2017 at 11:41:44AM -0500, David Teigland wrote:
> On Wed, Aug 09, 2017 at 05:51:37AM +0000, tsutomu....@toshiba.co.jp wrote:
> > If there is a lock resource conflict on multiple nodes, the lock on
> > convert queue may not be granted forever.
> > 
> > EX.)
> > grant queue:
> >     node0 grmode NL / rqmode IV
> >     node1 grmode NL / rqmode IV
> > 
> > convert queue:
> >     node2 grmode NL / rqmode EX
> >     node3 grmode PR / rqmode EX
> > 
> > wait queue:
> >     node4 grmode IV / rqmode PR
> >     node5 grmode IV / rqmode PR
> > 
> > When the lock conversion (node PR -> NL) of node 0 is completed, the lock
> > of node 2 should be grantable. However, __can_be_granted() returns 0
> > because the grmode of the lock on node 3 in convert queue is PR.
> > 
> > When checking the lock at the head of convert queue, exclude
> > queue_conflict() targeting convert queue.
> 
> This example doesn't look right.  node2's NL->EX cannot be granted because
> it conflicts with the PR lock held by node3.  (The grmode is still valid
> when a lock is on the convert queue.)

After looking more closely, this is a subtle form of conversion deadlock,
and this exact case is described in the comment here:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/dlm/lock.c#n2218

This should be handled by the dlm canceling one of the converting locks
(returning it to the grant queue with IV rqmode) and returning -EDEADLK to
the application.  There is a FIXME in the code highlighting a case you
could be hitting:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/dlm/lock.c#n2504

If you are running into that FIXME, you should see these log messages:

 if (deadlk) {
        log_print("WARN: pending deadlock %x node %d %s",
                  lkb->lkb_id, lkb->lkb_nodeid, r->res_name);
        dlm_dump_rsb(r);
        continue;
 }




Reply via email to