On 6/8/19 1:16 PM, Andreas Gruenbacher wrote:
Hi Ross,

On Fri, 7 Jun 2019 at 18:21, Ross Lagerwall <ross.lagerw...@citrix.com> wrote:
On 5/7/19 9:32 PM, Andreas Gruenbacher wrote:
Since commit 64bc06bb32ee ("gfs2: iomap buffered write support"), gfs2 is doing
buffered writes by starting a transaction in iomap_begin, writing a range of
pages, and ending that transaction in iomap_end.  This approach suffers from
two problems:

    (1) Any allocations necessary for the write are done in iomap_begin, so when
    the data aren't journaled, there is no need for keeping the transaction open
    until iomap_end.

    (2) Transactions keep the gfs2 log flush lock held.  When
    iomap_file_buffered_write calls balance_dirty_pages, this can end up calling
    gfs2_write_inode, which will try to flush the log.  This requires taking the
    log flush lock which is already held, resulting in a deadlock.

Fix both of these issues by not keeping transactions open from iomap_begin to
iomap_end.  Instead, start a small transaction in page_prepare and end it in
page_done when necessary.

Unfortunately, this patch broke growing gfs2 filesystems. It is easy to
reproduce:

$ mkfs.gfs2 -t xxx:yyy /dev/xvdb  4369065
$ mount /dev/xvdb /mnt
$ gfs2_grow /mnt (doesn't finish)
FS: Mount point:             /mnt
FS: Device:                  /dev/xvdb
FS: Size:                    4369062 (0x42aaa6)
DEV: Length:                 13107200 (0xc80000)
The file system will grow by 34133MB.

Looking at the kernel log, I see it hits the following assertion and
then hangs trying to withdraw the filesystem (which is a separate
problem, presumably):

gfs2: fsid=xxx:yyy.0: fatal: assertion "(nbuf <= tr->tr_blocks) &&
(tr->tr_num_revoke <= tr->tr_revokes)" failed
     function = gfs2_trans_end, file = fs/gfs2/trans.c, line = 117
gfs2: fsid=xxx:yyy.0: about to withdraw this file system

Rearranging the code so that it prints information about the transaction
before the failed withdrawal attempt shows:
gfs2: fsid=xxx:yyy.0: Transaction created at:
iomap_write_begin.constprop.45+0xbc/0x380
gfs2: fsid=xxx:yyy.0: blocks=1 revokes=0 reserved=8 touched=1
gfs2: fsid=xxx:yyy.0: Buf 1/0 Databuf 1/0 Revoke 0/0

Reverting this commit fixes the issue. Tested with git master as of
today (16d72dd4891fe).

thanks for the error report. This turns out to be a rounding error in
gfs2_iomap_page_prepare; the attached patch should help.

Yes, that fixes the problem. Thanks for the quick response.

--
Ross Lagerwall

Reply via email to