Re: [Cluster-devel] GFS2 filesystem corruption when mkfs block size (4k) does not match blockdevice blocksize (512 bytes)
Hi Edwin, On Mon, 25 Mar 2019 at 20:07, Edwin Török wrote: > On 20/03/2019 15:54, Edwin Török wrote: > > Hello, > > > > The patch below to xfstests reliably reproduces the filesystem corruption > > on a 4.19 kernel. > > I've also reproduced this on CentOS 7.6: > [ 2565.093062] Buffer I/O error on dev dm-5, logical block 123761, lost async > page write > [ 2565.307581] GFS2: fsid=dm-5.0: fatal: invalid metadata block > GFS2: fsid=dm-5.0: bh = 123761 (magic number) > GFS2: fsid=dm-5.0: function = gfs2_meta_indirect_buffer, file = > fs/gfs2/meta_io.c, line = 428 > [ 2565.307641] GFS2: fsid=dm-5.0: about to withdraw this file system > [ 2565.307706] GFS2: fsid=dm-5.0: withdrawn > [ 2565.307737] CPU: 0 PID: 61453 Comm: ls Kdump: loaded Not tainted > 3.10.0-957.10.1.el7.x86_64 #1 > [ 2565.307772] Hardware name: Xen HVM domU, BIOS 4.7.6-6.3 08/24/2018 > > See below for the script that I used to do it, 100% reproducible and takes > only a few minutes to reproduce. > > The problem here seems to be that the metadata block encountered some I/O > error, wrote an incomplete block > (writes are atomic only at 512 byte level, not 4k), and later on GFS2 tried > to read back the now corrupt metadata block. > > I've also been experimenting with a patch to make GFS2 withdraw sooner, > before it actually tries to read back the corrupted block, > but then of course it still fails when mounting the filesystem again, when it > detects the corrupt metadata block. > I would've thought that a journal replay on mount would fix this, i.e. try to > write out the metadata block again and succeed, > but I'm not convinced the journal still contains the metadata block entry > because the I/O error here is all asynchronous. when the filesystem is withdrawn and unmounted, and then either re-mounted or fsck'ed, it's pretty clear that the journal needs to be replayed. If that doesn't happen, and it didn't at least in my local testing, it's no surprise that partially written blocks will lead to all kinds of problems. This is an area that Bob has spent a lot of time looking into recently. He has found and fixed several bugs, but things are still not working entirely correctly. I'll try to help Bob with these patches so that we can get at least the initial fixes out the door as soon as possible. > While looking at this I also discovered that gfs2_write_jdata_pagevec has > some copy/pasted code from page-writeback.c that is missing > https://github.com/torvalds/linux/commit/3fa750dcf29e8606e3969d13d8e188cc1c0f511d#diff-f9a62819096a044efc553c77d4f6fdbb, > but that doesn't seem to be used by default. > > > Do you have some thoughts/suggestions on how to proceed here? > > diff --git i/fs/gfs2/meta_io.c w/fs/gfs2/meta_io.c > index 868caa0..39c1030 100644 > --- i/fs/gfs2/meta_io.c > +++ w/fs/gfs2/meta_io.c > @@ -33,6 +33,17 @@ > #include "util.h" > #include "trace_gfs2.h" > > +static void gfs2_async_end_io(struct buffer_head *bh, int uptodate) > +{ > + end_buffer_async_write(bh, uptodate); > + if (!uptodate) { > + struct address_space *mapping = bh->b_page->mapping; > + struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping); > + gfs2_io_error_bh_wd(sdp, bh); > + gfs2_lm_withdraw(sdp, NULL); > + } > +} > + > static int gfs2_aspace_writepage(struct page *page, struct writeback_control > *wbc) > { > struct buffer_head *bh, *head; > @@ -63,6 +74,7 @@ static int gfs2_aspace_writepage(struct page *page, struct > writeback_control *wb > } > if (test_clear_buffer_dirty(bh)) { > mark_buffer_async_write(bh); > + bh->b_end_io = gfs2_async_end_io; > } else { > unlock_buffer(bh); > } > > The script to reproduce the issue on CentOS: > #/bin/sh > set -eu > SIZE=1G > # XFSTESTS=https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/ > XFSTESTS=https://github.com/edwintorok/xfstests.git > > sudo yum install -y epel-release > sudo yum install -y acl attr automake bc dump e2fsprogs fio gawk gcc indent > libtool make psmisc quota sed xfsdump xfsprogs libacl-devel libattr-devel > libaio-devel libuuid-devel xfsprogs-devel python sqlite git gfs2-utils > test -d xfstests-dev || git clone "${XFSTESTS}" xfstests-dev && make -C > xfstests-dev -j8 > cd xfstests-dev > > SCRATCH_FILE=scratch-file-$(date +%s) > truncate --size "${SIZE}" "${SCRATCH_FILE}" > SCRATCH_DEV=$(losetup --show -f "${SCRATCH_FILE}") > > TEST_FILE=test-file-$(date +%s) > truncate --size "${SIZE}" "${TEST_FILE}" > TEST_DEV=$(losetup --show -f "${TEST_FILE}") > TEST_DIR=/mnt/xfstests-test > SCRATCH_MNT=/mnt/xfstests-scratch > mkdir -p "${TEST_DIR}" "${SCRATCH_MNT}" > > catexport SCRATCH_DEV=$SCRATCH_DEV > export TEST_DEV=$TEST_DEV > export TEST_DIR=$TEST_DIR > export SCRATCH_MNT=$SCRATCH_MNT > EOF > > umount ${TEST_DIR} || true > wipefs -a -f ${TEST_DEV} > mk
Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
Hi Ross, - Original Message - > 6. gfs2_log_flush() continues and calls revoke_lo_after_commit() and > uses the freed glock (stack trace above). > > Should evict_inode call gfs2_ail_flush() or something so that the revoke > is written before it drops its reference to the glock? > > Or is there something else that is meant to keep the glock around if the > inode is evicted while there is a linked gfs2_bufdata sitting on some > AIL list? > > Let me know if any more specific information is needed since I now have > a test setup that can usually reproduce this within 10 minutes. Very interesting. It's not unusual for glocks to outlive their inodes, but I'm not sure what the right answer is in this case. Either the revoke should take an additional reference to the glock, and not let go until the revoke is written by some log flush, or else the evict needs to do the log flush to make sure the revoke is committed. But we've had issues with evict in the past, so we need to be careful about how we fix it. Andreas and I will look into the best way to fix it. Thanks again for your help. Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
On 1/31/19 5:18 PM, Andreas Gruenbacher wrote: Hi Ross, On Thu, 31 Jan 2019 at 11:56, Ross Lagerwall wrote: Each gfs2_bufdata stores a reference to a glock but the reference count isn't incremented. This causes an occasional use-after-free of the glock. Fix by taking a reference on the glock during allocation and dropping it when freeing. Found by KASAN: BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2] Write of size 4 at addr 88801aff6134 by task kworker/0:2H/20371 CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1 Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010 Workqueue: glock_workqueue glock_work_func [gfs2] Call Trace: dump_stack+0x71/0xab print_address_description+0x6a/0x270 kasan_report+0x258/0x380 ? revoke_lo_after_commit+0x8e/0xe0 [gfs2] revoke_lo_after_commit+0x8e/0xe0 [gfs2] gfs2_log_flush+0x511/0xa70 [gfs2] ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2] ? __brelse+0x48/0x50 ? gfs2_log_commit+0x4de/0x6e0 [gfs2] ? gfs2_trans_end+0x18d/0x340 [gfs2] gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2] ? inode_go_dump+0xe0/0xe0 [gfs2] ? inode_go_sync+0xe4/0x220 [gfs2] inode_go_sync+0xe4/0x220 [gfs2] do_xmote+0x12b/0x290 [gfs2] glock_work_func+0x6f/0x160 [gfs2] process_one_work+0x461/0x790 worker_thread+0x69/0x6b0 ? process_one_work+0x790/0x790 kthread+0x1ae/0x1d0 ? kthread_create_worker_on_cpu+0xc0/0xc0 ret_from_fork+0x22/0x40 thanks for tracking this down, very interesting. The consistency model here is that every buffer head that a struct gfs2_bufdata object is attached to is protected by a glock. Before a glock can be released, all the buffers under that glock have to be flushed out and released; this is what allows another node to access the same on-disk location without causing inconsistencies. When there is a bufdata object that points to a glock that has already been freed, this consistency model is broken. Taking an additional refcount as this patch does may make the use-after-free go away, but it doesn't fix the underlying problem. So I think we'll need a different fix here. Did you observe this problem in a real-world scenario, or with KASAN only? It might be that we're looking at a small race that is unlikely to trigger in the field. In any case, I think we need to understand better what't actually going on. I finally got time to look into this further. The following is what happens as far as I can tell though my understanding is a bit limited at this point: 1. A file is opened, truncated and closed which results in a metadata write so a gfs2_bufdata object is created and associated with the inode glock. 2. The gfs2_bufdata is written out and placed on an AIL list. 3. The VFS layer calls gfs2_evict_inode() and the inode is evicted which drops a reference on the glock. It takes case 3 (i_nlink > 0) so no log flush or ail_flush happens. 4. The gfs2_bufdata object is however still on one of the AIL lists and the next gfs2_log_flush() begins to write a revoke entry. 5. At about the same time the glock is freed. At the point of freeing, gl_revokes is 1 (should probably have a GLOCK_BUG_ON for this). 6. gfs2_log_flush() continues and calls revoke_lo_after_commit() and uses the freed glock (stack trace above). Should evict_inode call gfs2_ail_flush() or something so that the revoke is written before it drops its reference to the glock? Or is there something else that is meant to keep the glock around if the inode is evicted while there is a linked gfs2_bufdata sitting on some AIL list? Let me know if any more specific information is needed since I now have a test setup that can usually reproduce this within 10 minutes. Thanks, -- Ross Lagerwall