Re: [Cluster-devel] GFS2 filesystem corruption when mkfs block size (4k) does not match blockdevice blocksize (512 bytes)
Hi Edwin, On Wed, 27 Mar 2019 at 01:13, Andreas Gruenbacher wrote: > Hi Edwin, > > On Mon, 25 Mar 2019 at 20:07, Edwin Török wrote: > > On 20/03/2019 15:54, Edwin Török wrote: > > > Hello, > > > > > > The patch below to xfstests reliably reproduces the filesystem corruption > > > on a 4.19 kernel. > > > > I've also reproduced this on CentOS 7.6: > > [ 2565.093062] Buffer I/O error on dev dm-5, logical block 123761, lost > > async page write > > [ 2565.307581] GFS2: fsid=dm-5.0: fatal: invalid metadata block > > GFS2: fsid=dm-5.0: bh = 123761 (magic number) > > GFS2: fsid=dm-5.0: function = gfs2_meta_indirect_buffer, file = > > fs/gfs2/meta_io.c, line = 428 > > [ 2565.307641] GFS2: fsid=dm-5.0: about to withdraw this file system > > [ 2565.307706] GFS2: fsid=dm-5.0: withdrawn > > [ 2565.307737] CPU: 0 PID: 61453 Comm: ls Kdump: loaded Not tainted > > 3.10.0-957.10.1.el7.x86_64 #1 > > [ 2565.307772] Hardware name: Xen HVM domU, BIOS 4.7.6-6.3 08/24/2018 > > > > See below for the script that I used to do it, 100% reproducible and takes > > only a few minutes to reproduce. > > > > The problem here seems to be that the metadata block encountered some I/O > > error, wrote an incomplete block > > (writes are atomic only at 512 byte level, not 4k), and later on GFS2 tried > > to read back the now corrupt metadata block. > > > > I've also been experimenting with a patch to make GFS2 withdraw sooner, > > before it actually tries to read back the corrupted block, > > but then of course it still fails when mounting the filesystem again, when > > it detects the corrupt metadata block. > > I would've thought that a journal replay on mount would fix this, i.e. try > > to write out the metadata block again and succeed, > > but I'm not convinced the journal still contains the metadata block entry > > because the I/O error here is all asynchronous. > > when the filesystem is withdrawn and unmounted, and then either > re-mounted or fsck'ed, it's pretty clear that the journal needs to be > replayed. If that doesn't happen, and it didn't at least in my local > testing, it's no surprise that partially written blocks will lead to > all kinds of problems. This is an area that Bob has spent a lot of > time looking into recently. He has found and fixed several bugs, but > things are still not working entirely correctly. I'll try to help Bob > with these patches so that we can get at least the initial fixes out > the door as soon as possible. Bob has been through several iterations of his recovery patch queue in the meantime, squashing one bug after the other. This process seems to be coming to an end slowly, so you may want to try out the patches Bob has posted to cluster-devel today. A version of that, with two more cosmetic cleanups, can be found on the for-next.recovery7f branch of the gfs2 repository. > > While looking at this I also discovered that gfs2_write_jdata_pagevec has > > some copy/pasted code from page-writeback.c that is missing > > https://github.com/torvalds/linux/commit/3fa750dcf29e8606e3969d13d8e188cc1c0f511d#diff-f9a62819096a044efc553c77d4f6fdbb, > > but that doesn't seem to be used by default. > > > > > > Do you have some thoughts/suggestions on how to proceed here? > > > > diff --git i/fs/gfs2/meta_io.c w/fs/gfs2/meta_io.c > > index 868caa0..39c1030 100644 > > --- i/fs/gfs2/meta_io.c > > +++ w/fs/gfs2/meta_io.c > > @@ -33,6 +33,17 @@ > > #include "util.h" > > #include "trace_gfs2.h" > > > > +static void gfs2_async_end_io(struct buffer_head *bh, int uptodate) > > +{ > > + end_buffer_async_write(bh, uptodate); > > + if (!uptodate) { > > + struct address_space *mapping = bh->b_page->mapping; > > + struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping); > > + gfs2_io_error_bh_wd(sdp, bh); > > + gfs2_lm_withdraw(sdp, NULL); > > + } > > +} > > + > > static int gfs2_aspace_writepage(struct page *page, struct > > writeback_control *wbc) > > { > > struct buffer_head *bh, *head; > > @@ -63,6 +74,7 @@ static int gfs2_aspace_writepage(struct page *page, > > struct writeback_control *wb > > } > > if (test_clear_buffer_dirty(bh)) { > > mark_buffer_async_write(bh); > > + bh->b_end_io = gfs2_async_end_io; > > } else { > > unlock_buffer(bh); > > } > > > > The script to reproduce the issue on CentOS: > > #/bin/sh > > set -eu > > SIZE=1G > > # XFSTESTS=https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/ > > XFSTESTS=https://github.com/edwintorok/xfstests.git > > > > sudo yum install -y epel-release > > sudo yum install -y acl attr automake bc dump e2fsprogs fio gawk gcc indent > > libtool make psmisc quota sed xfsdump xfsprogs libacl-devel libattr-devel > > libaio-devel libuuid-devel xfsprogs-devel python sqlite git gfs2-utils > > test -d xfstests-dev || git clone "${XFSTESTS}"
Re: [Cluster-devel] GFS2 filesystem corruption when mkfs block size (4k) does not match blockdevice blocksize (512 bytes)
Hi Edwin, On Mon, 25 Mar 2019 at 20:07, Edwin Török wrote: > On 20/03/2019 15:54, Edwin Török wrote: > > Hello, > > > > The patch below to xfstests reliably reproduces the filesystem corruption > > on a 4.19 kernel. > > I've also reproduced this on CentOS 7.6: > [ 2565.093062] Buffer I/O error on dev dm-5, logical block 123761, lost async > page write > [ 2565.307581] GFS2: fsid=dm-5.0: fatal: invalid metadata block > GFS2: fsid=dm-5.0: bh = 123761 (magic number) > GFS2: fsid=dm-5.0: function = gfs2_meta_indirect_buffer, file = > fs/gfs2/meta_io.c, line = 428 > [ 2565.307641] GFS2: fsid=dm-5.0: about to withdraw this file system > [ 2565.307706] GFS2: fsid=dm-5.0: withdrawn > [ 2565.307737] CPU: 0 PID: 61453 Comm: ls Kdump: loaded Not tainted > 3.10.0-957.10.1.el7.x86_64 #1 > [ 2565.307772] Hardware name: Xen HVM domU, BIOS 4.7.6-6.3 08/24/2018 > > See below for the script that I used to do it, 100% reproducible and takes > only a few minutes to reproduce. > > The problem here seems to be that the metadata block encountered some I/O > error, wrote an incomplete block > (writes are atomic only at 512 byte level, not 4k), and later on GFS2 tried > to read back the now corrupt metadata block. > > I've also been experimenting with a patch to make GFS2 withdraw sooner, > before it actually tries to read back the corrupted block, > but then of course it still fails when mounting the filesystem again, when it > detects the corrupt metadata block. > I would've thought that a journal replay on mount would fix this, i.e. try to > write out the metadata block again and succeed, > but I'm not convinced the journal still contains the metadata block entry > because the I/O error here is all asynchronous. when the filesystem is withdrawn and unmounted, and then either re-mounted or fsck'ed, it's pretty clear that the journal needs to be replayed. If that doesn't happen, and it didn't at least in my local testing, it's no surprise that partially written blocks will lead to all kinds of problems. This is an area that Bob has spent a lot of time looking into recently. He has found and fixed several bugs, but things are still not working entirely correctly. I'll try to help Bob with these patches so that we can get at least the initial fixes out the door as soon as possible. > While looking at this I also discovered that gfs2_write_jdata_pagevec has > some copy/pasted code from page-writeback.c that is missing > https://github.com/torvalds/linux/commit/3fa750dcf29e8606e3969d13d8e188cc1c0f511d#diff-f9a62819096a044efc553c77d4f6fdbb, > but that doesn't seem to be used by default. > > > Do you have some thoughts/suggestions on how to proceed here? > > diff --git i/fs/gfs2/meta_io.c w/fs/gfs2/meta_io.c > index 868caa0..39c1030 100644 > --- i/fs/gfs2/meta_io.c > +++ w/fs/gfs2/meta_io.c > @@ -33,6 +33,17 @@ > #include "util.h" > #include "trace_gfs2.h" > > +static void gfs2_async_end_io(struct buffer_head *bh, int uptodate) > +{ > + end_buffer_async_write(bh, uptodate); > + if (!uptodate) { > + struct address_space *mapping = bh->b_page->mapping; > + struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping); > + gfs2_io_error_bh_wd(sdp, bh); > + gfs2_lm_withdraw(sdp, NULL); > + } > +} > + > static int gfs2_aspace_writepage(struct page *page, struct writeback_control > *wbc) > { > struct buffer_head *bh, *head; > @@ -63,6 +74,7 @@ static int gfs2_aspace_writepage(struct page *page, struct > writeback_control *wb > } > if (test_clear_buffer_dirty(bh)) { > mark_buffer_async_write(bh); > + bh->b_end_io = gfs2_async_end_io; > } else { > unlock_buffer(bh); > } > > The script to reproduce the issue on CentOS: > #/bin/sh > set -eu > SIZE=1G > # XFSTESTS=https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/ > XFSTESTS=https://github.com/edwintorok/xfstests.git > > sudo yum install -y epel-release > sudo yum install -y acl attr automake bc dump e2fsprogs fio gawk gcc indent > libtool make psmisc quota sed xfsdump xfsprogs libacl-devel libattr-devel > libaio-devel libuuid-devel xfsprogs-devel python sqlite git gfs2-utils > test -d xfstests-dev || git clone "${XFSTESTS}" xfstests-dev && make -C > xfstests-dev -j8 > cd xfstests-dev > > SCRATCH_FILE=scratch-file-$(date +%s) > truncate --size "${SIZE}" "${SCRATCH_FILE}" > SCRATCH_DEV=$(losetup --show -f "${SCRATCH_FILE}") > > TEST_FILE=test-file-$(date +%s) > truncate --size "${SIZE}" "${TEST_FILE}" > TEST_DEV=$(losetup --show -f "${TEST_FILE}") > TEST_DIR=/mnt/xfstests-test > SCRATCH_MNT=/mnt/xfstests-scratch > mkdir -p "${TEST_DIR}" "${SCRATCH_MNT}" > > catexport SCRATCH_DEV=$SCRATCH_DEV > export TEST_DEV=$TEST_DEV > export TEST_DIR=$TEST_DIR > export SCRATCH_MNT=$SCRATCH_MNT > EOF > > umount ${TEST_DIR} || true > wipefs -a -f ${TEST_DEV} >
Re: [Cluster-devel] GFS2 filesystem corruption when mkfs block size (4k) does not match blockdevice blocksize (512 bytes)
On 20/03/2019 15:54, Edwin Török wrote: > Hello, > > The patch below to xfstests reliably reproduces the filesystem corruption on > a 4.19 kernel. I've also reproduced this on CentOS 7.6: [ 2565.093062] Buffer I/O error on dev dm-5, logical block 123761, lost async page write [ 2565.307581] GFS2: fsid=dm-5.0: fatal: invalid metadata block GFS2: fsid=dm-5.0: bh = 123761 (magic number) GFS2: fsid=dm-5.0: function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 428 [ 2565.307641] GFS2: fsid=dm-5.0: about to withdraw this file system [ 2565.307706] GFS2: fsid=dm-5.0: withdrawn [ 2565.307737] CPU: 0 PID: 61453 Comm: ls Kdump: loaded Not tainted 3.10.0-957.10.1.el7.x86_64 #1 [ 2565.307772] Hardware name: Xen HVM domU, BIOS 4.7.6-6.3 08/24/2018 See below for the script that I used to do it, 100% reproducible and takes only a few minutes to reproduce. The problem here seems to be that the metadata block encountered some I/O error, wrote an incomplete block (writes are atomic only at 512 byte level, not 4k), and later on GFS2 tried to read back the now corrupt metadata block. I've also been experimenting with a patch to make GFS2 withdraw sooner, before it actually tries to read back the corrupted block, but then of course it still fails when mounting the filesystem again, when it detects the corrupt metadata block. I would've thought that a journal replay on mount would fix this, i.e. try to write out the metadata block again and succeed, but I'm not convinced the journal still contains the metadata block entry because the I/O error here is all asynchronous. While looking at this I also discovered that gfs2_write_jdata_pagevec has some copy/pasted code from page-writeback.c that is missing https://github.com/torvalds/linux/commit/3fa750dcf29e8606e3969d13d8e188cc1c0f511d#diff-f9a62819096a044efc553c77d4f6fdbb, but that doesn't seem to be used by default. Do you have some thoughts/suggestions on how to proceed here? diff --git i/fs/gfs2/meta_io.c w/fs/gfs2/meta_io.c index 868caa0..39c1030 100644 --- i/fs/gfs2/meta_io.c +++ w/fs/gfs2/meta_io.c @@ -33,6 +33,17 @@ #include "util.h" #include "trace_gfs2.h" +static void gfs2_async_end_io(struct buffer_head *bh, int uptodate) +{ + end_buffer_async_write(bh, uptodate); + if (!uptodate) { + struct address_space *mapping = bh->b_page->mapping; + struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping); + gfs2_io_error_bh_wd(sdp, bh); + gfs2_lm_withdraw(sdp, NULL); + } +} + static int gfs2_aspace_writepage(struct page *page, struct writeback_control *wbc) { struct buffer_head *bh, *head; @@ -63,6 +74,7 @@ static int gfs2_aspace_writepage(struct page *page, struct writeback_control *wb } if (test_clear_buffer_dirty(bh)) { mark_buffer_async_write(bh); + bh->b_end_io = gfs2_async_end_io; } else { unlock_buffer(bh); } The script to reproduce the issue on CentOS: #/bin/sh set -eu SIZE=1G # XFSTESTS=https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/ XFSTESTS=https://github.com/edwintorok/xfstests.git sudo yum install -y epel-release sudo yum install -y acl attr automake bc dump e2fsprogs fio gawk gcc indent libtool make psmisc quota sed xfsdump xfsprogs libacl-devel libattr-devel libaio-devel libuuid-devel xfsprogs-devel python sqlite git gfs2-utils test -d xfstests-dev || git clone "${XFSTESTS}" xfstests-dev && make -C xfstests-dev -j8 cd xfstests-dev SCRATCH_FILE=scratch-file-$(date +%s) truncate --size "${SIZE}" "${SCRATCH_FILE}" SCRATCH_DEV=$(losetup --show -f "${SCRATCH_FILE}") TEST_FILE=test-file-$(date +%s) truncate --size "${SIZE}" "${TEST_FILE}" TEST_DEV=$(losetup --show -f "${TEST_FILE}") TEST_DIR=/mnt/xfstests-test SCRATCH_MNT=/mnt/xfstests-scratch mkdir -p "${TEST_DIR}" "${SCRATCH_MNT}" cat> In our lab we've found that if multiple iSCSI connection errors are > detected (without completely loosing the iSCSI connection) then the GFS2 > filesystem becomes corrupt. > The connection errors occured very likely due to network congestion (the > iSCSI target is running on a 10Gbps network, > but the test VM had only a 1Gbps network connection to it). Nevertheless > networking issues shouldn't cause corruption. > > Even after rebooting the host and remounting the filesystem has immediately > withdrawn itself > when
[Cluster-devel] GFS2 filesystem corruption when mkfs block size (4k) does not match blockdevice blocksize (512 bytes)
Hello, The patch below to xfstests reliably reproduces the filesystem corruption on a 4.19 kernel. In our lab we've found that if multiple iSCSI connection errors are detected (without completely loosing the iSCSI connection) then the GFS2 filesystem becomes corrupt. The connection errors occured very likely due to network congestion (the iSCSI target is running on a 10Gbps network, but the test VM had only a 1Gbps network connection to it). Nevertheless networking issues shouldn't cause corruption. Even after rebooting the host and remounting the filesystem has immediately withdrawn itself when stating files. Sometimes fsck.gfs2 is able to fix it, but not always. Tim Smith has found that using matching blocksizes for GFS2 and the underlying blockdevice avoids the problem (running the testcase below for 30m in a loop does not reproduce the issue anymore). This discovery was made after careful reading of the fsck.gfs2 manpage, which states: "The first step to ensuring a healthy file system is the selec‐ tion of reliable hardware (i.e. storage systems that will write complete blocks - even in the event of power failure)." By default GFS2 uses a blocksize of 4k, and typically the devices in our lab have a block size of 512 bytes, therefore there are no atomicity guarantees on a 4k write: you may indeed get an incomplete write of 512, 1024, etc. bytes We'd like your opinion on two questions: 1. We believe that mkfs.gfs2 in gfs2-utils should be updated to query the underlying blockdevice for its blocksize and use that by default instead of 4k, and warn on a mismatch if you manually specify a blocksize. Also update the manpage of mkfs.gfs2 to state this. Is this the right approach, i.e. would such a patch be acceptable upstream if we prepare it? 2. Can the code be made more robust to cope with the incomplete writes? (e.g. can it throw away the bad journal entries, or replay the journal in such a way that it treats the underlying blockdevice's blocksize as the unit of atomicity instead of the gfs2 block size?) We're aware of your hard work on withdraw corruption https://www.redhat.com/archives/cluster-devel/2019-February/msg00077.html, would these patches help with this issue? Ideally we'd like to fix both issues (use matching blocksizes for newly created filesystems, and fix the possibility of corruption for the benefit of those whom already created their GFS2 filesystem with the default blocksize). The initial withdraw looks like this: Mar 20 15:10:29 localhost kernel: [13654.281665] Buffer I/O error on dev dm-4, logical block 431236, lost async page write Mar 20 15:10:29 localhost kernel: [13654.281795] gfs2: fsid=dm-4.0: fatal: invalid metadata block Mar 20 15:10:29 localhost kernel: [13654.281795] bh = 431236 (magic number) Mar 20 15:10:29 localhost kernel: [13654.281795] function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 419 Mar 20 15:10:29 localhost kernel: [13654.281834] gfs2: fsid=dm-4.0: about to withdraw this file system Mar 20 15:10:29 localhost kernel: [13654.281884] gfs2: fsid=dm-4.0: withdrawn Mar 20 15:10:29 localhost kernel: [13654.281894] CPU: 6 PID: 20880 Comm: xfs_io Tainted: G O 4.19.0+1 #1 Mar 20 15:10:29 localhost kernel: [13654.281896] Hardware name: Dell Inc. PowerEdge R415/0GXH08, BIOS 1.2.5 03/16/2011 Mar 20 15:10:29 localhost kernel: [13654.281897] Call Trace: Mar 20 15:10:29 localhost kernel: [13654.281916] dump_stack+0x5a/0x73 Mar 20 15:10:29 localhost kernel: [13654.281941] gfs2_lm_withdraw+0x125/0x1a0 [gfs2] Mar 20 15:10:29 localhost kernel: [13654.281956] ? gfs2_meta_read+0x22d/0x2d0 [gfs2] Mar 20 15:10:29 localhost kernel: [13654.281969] gfs2_meta_check_ii+0x2b/0x40 [gfs2] Mar 20 15:10:29 localhost kernel: [13654.281981] gfs2_meta_indirect_buffer+0x119/0x130 [gfs2] Mar 20 15:10:29 localhost kernel: [13654.281993] gfs2_dirty_inode+0x107/0x230 [gfs2] Mar 20 15:10:29 localhost kernel: [13654.282006] ? gfs2_dirty_inode+0x1bf/0x230 [gfs2] Mar 20 15:10:29 localhost kernel: [13654.282012] __mark_inode_dirty+0x17c/0x360 Mar 20 15:10:29 localhost kernel: [13654.282024] gfs2_write_inode+0x11f/0x160 [gfs2] Mar 20 15:10:29 localhost kernel: [13654.282028] __writeback_singl e_inode+0x26d/0x340 Mar 20 15:10:29 localhost kernel: [13654.282031] writeback_single_inode+0xb7/0x130 Mar 20 15:10:29 localhost kernel: [13654.282033] sync_inode_metadata+0x3d/0x60 Mar 20 15:10:29 localhost kernel: [13654.282045] gfs2_fsync+0xad/0x110 [gfs2] Mar 20 15:10:29 localhost kernel: [13654.282049] do_fsync+0x38/0x60 Mar 20 15:10:29 localhost kernel: [13654.282053] __x64_sys_fsync+0x10/0x20 Mar 20 15:10:29 localhost kernel: [13654.282057] do_syscall_64+0x4e/0x100 Mar 20 15:10:29 localhost kernel: [13654.282062] entry_SYSCALL_64_after_hwfr After the remount: Mar 20 15:10:30 localhost kernel: [13655.529526] gfs2: fsid=dm-4.0: first mount done, others may mount Mar 20 15:10:30 localhost kernel: [13655.581457] gfs2: