Re: [Cluster-devel] GFS2 filesystem corruption when mkfs block size (4k) does not match blockdevice blocksize (512 bytes)

2019-05-23 Thread Andreas Gruenbacher
Hi Edwin,

On Wed, 27 Mar 2019 at 01:13, Andreas Gruenbacher  wrote:
> Hi Edwin,
>
> On Mon, 25 Mar 2019 at 20:07, Edwin Török  wrote:
> > On 20/03/2019 15:54, Edwin Török wrote:
> > > Hello,
> > >
> > > The patch below to xfstests reliably reproduces the filesystem corruption 
> > > on a 4.19 kernel.
> >
> > I've also reproduced this on CentOS 7.6:
> > [ 2565.093062] Buffer I/O error on dev dm-5, logical block 123761, lost 
> > async page write
> > [ 2565.307581] GFS2: fsid=dm-5.0: fatal: invalid metadata block
> > GFS2: fsid=dm-5.0:   bh = 123761 (magic number)
> > GFS2: fsid=dm-5.0:   function = gfs2_meta_indirect_buffer, file = 
> > fs/gfs2/meta_io.c, line = 428
> > [ 2565.307641] GFS2: fsid=dm-5.0: about to withdraw this file system
> > [ 2565.307706] GFS2: fsid=dm-5.0: withdrawn
> > [ 2565.307737] CPU: 0 PID: 61453 Comm: ls Kdump: loaded Not tainted 
> > 3.10.0-957.10.1.el7.x86_64 #1
> > [ 2565.307772] Hardware name: Xen HVM domU, BIOS 4.7.6-6.3 08/24/2018
> >
> > See below for the script that I used to do it, 100% reproducible and takes 
> > only a few minutes to reproduce.
> >
> > The problem here seems to be that the metadata block encountered some I/O 
> > error, wrote an incomplete block
> > (writes are atomic only at 512 byte level, not 4k), and later on GFS2 tried 
> > to read back the now corrupt metadata block.
> >
> > I've also been experimenting with a patch to make GFS2 withdraw sooner, 
> > before it actually tries to read back the corrupted block,
> > but then of course it still fails when mounting the filesystem again, when 
> > it detects the corrupt metadata block.
> > I would've thought that a journal replay on mount would fix this, i.e. try 
> > to write out the metadata block again and succeed,
> > but I'm not convinced the journal still contains the metadata block entry 
> > because the I/O error here is all asynchronous.
>
> when the filesystem is withdrawn and unmounted, and then either
> re-mounted or fsck'ed, it's pretty clear that the journal needs to be
> replayed. If that doesn't happen, and it didn't at least in my local
> testing, it's no surprise that partially written blocks will lead to
> all kinds of problems. This is an area that Bob has spent a lot of
> time looking into recently. He has found and fixed several bugs, but
> things are still not working entirely correctly. I'll try to help Bob
> with these patches so that we can get at least the initial fixes out
> the door as soon as possible.

Bob has been through several iterations of his recovery patch queue in
the meantime, squashing one bug after the other. This process seems to
be coming to an end slowly, so you may want to try out the patches Bob
has posted to cluster-devel today. A version of that, with two more
cosmetic cleanups, can be found on the for-next.recovery7f branch of
the gfs2 repository.

> > While looking at this I also discovered that gfs2_write_jdata_pagevec has 
> > some copy/pasted code from page-writeback.c that is missing
> > https://github.com/torvalds/linux/commit/3fa750dcf29e8606e3969d13d8e188cc1c0f511d#diff-f9a62819096a044efc553c77d4f6fdbb,
> >  but that doesn't seem to be used by default.
> >
> >
> > Do you have some thoughts/suggestions on how to proceed here?
> >
> > diff --git i/fs/gfs2/meta_io.c w/fs/gfs2/meta_io.c
> > index 868caa0..39c1030 100644
> > --- i/fs/gfs2/meta_io.c
> > +++ w/fs/gfs2/meta_io.c
> > @@ -33,6 +33,17 @@
> >  #include "util.h"
> >  #include "trace_gfs2.h"
> >
> > +static void gfs2_async_end_io(struct buffer_head *bh, int uptodate)
> > +{
> > +   end_buffer_async_write(bh, uptodate);
> > +   if (!uptodate) {
> > +   struct address_space *mapping = bh->b_page->mapping;
> > +   struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping);
> > +   gfs2_io_error_bh_wd(sdp, bh);
> > +   gfs2_lm_withdraw(sdp, NULL);
> > +   }
> > +}
> > +
> >  static int gfs2_aspace_writepage(struct page *page, struct 
> > writeback_control *wbc)
> >  {
> > struct buffer_head *bh, *head;
> > @@ -63,6 +74,7 @@ static int gfs2_aspace_writepage(struct page *page, 
> > struct writeback_control *wb
> > }
> > if (test_clear_buffer_dirty(bh)) {
> > mark_buffer_async_write(bh);
> > +   bh->b_end_io = gfs2_async_end_io;
> > } else {
> > unlock_buffer(bh);
> > }
> >
> > The script to reproduce the issue on CentOS:
> > #/bin/sh
> > set -eu
> > SIZE=1G
> > # XFSTESTS=https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/
> > XFSTESTS=https://github.com/edwintorok/xfstests.git
> >
> > sudo yum install -y epel-release
> > sudo yum install -y acl attr automake bc dump e2fsprogs fio gawk gcc indent 
> > libtool make psmisc quota sed xfsdump xfsprogs libacl-devel libattr-devel 
> > libaio-devel libuuid-devel xfsprogs-devel python sqlite git gfs2-utils
> > test -d xfstests-dev || git clone "${XFSTESTS}" 

Re: [Cluster-devel] GFS2 filesystem corruption when mkfs block size (4k) does not match blockdevice blocksize (512 bytes)

2019-03-26 Thread Andreas Gruenbacher
Hi Edwin,

On Mon, 25 Mar 2019 at 20:07, Edwin Török  wrote:
> On 20/03/2019 15:54, Edwin Török wrote:
> > Hello,
> >
> > The patch below to xfstests reliably reproduces the filesystem corruption 
> > on a 4.19 kernel.
>
> I've also reproduced this on CentOS 7.6:
> [ 2565.093062] Buffer I/O error on dev dm-5, logical block 123761, lost async 
> page write
> [ 2565.307581] GFS2: fsid=dm-5.0: fatal: invalid metadata block
> GFS2: fsid=dm-5.0:   bh = 123761 (magic number)
> GFS2: fsid=dm-5.0:   function = gfs2_meta_indirect_buffer, file = 
> fs/gfs2/meta_io.c, line = 428
> [ 2565.307641] GFS2: fsid=dm-5.0: about to withdraw this file system
> [ 2565.307706] GFS2: fsid=dm-5.0: withdrawn
> [ 2565.307737] CPU: 0 PID: 61453 Comm: ls Kdump: loaded Not tainted 
> 3.10.0-957.10.1.el7.x86_64 #1
> [ 2565.307772] Hardware name: Xen HVM domU, BIOS 4.7.6-6.3 08/24/2018
>
> See below for the script that I used to do it, 100% reproducible and takes 
> only a few minutes to reproduce.
>
> The problem here seems to be that the metadata block encountered some I/O 
> error, wrote an incomplete block
> (writes are atomic only at 512 byte level, not 4k), and later on GFS2 tried 
> to read back the now corrupt metadata block.
>
> I've also been experimenting with a patch to make GFS2 withdraw sooner, 
> before it actually tries to read back the corrupted block,
> but then of course it still fails when mounting the filesystem again, when it 
> detects the corrupt metadata block.
> I would've thought that a journal replay on mount would fix this, i.e. try to 
> write out the metadata block again and succeed,
> but I'm not convinced the journal still contains the metadata block entry 
> because the I/O error here is all asynchronous.

when the filesystem is withdrawn and unmounted, and then either
re-mounted or fsck'ed, it's pretty clear that the journal needs to be
replayed. If that doesn't happen, and it didn't at least in my local
testing, it's no surprise that partially written blocks will lead to
all kinds of problems. This is an area that Bob has spent a lot of
time looking into recently. He has found and fixed several bugs, but
things are still not working entirely correctly. I'll try to help Bob
with these patches so that we can get at least the initial fixes out
the door as soon as possible.

> While looking at this I also discovered that gfs2_write_jdata_pagevec has 
> some copy/pasted code from page-writeback.c that is missing
> https://github.com/torvalds/linux/commit/3fa750dcf29e8606e3969d13d8e188cc1c0f511d#diff-f9a62819096a044efc553c77d4f6fdbb,
>  but that doesn't seem to be used by default.
>
>
> Do you have some thoughts/suggestions on how to proceed here?
>
> diff --git i/fs/gfs2/meta_io.c w/fs/gfs2/meta_io.c
> index 868caa0..39c1030 100644
> --- i/fs/gfs2/meta_io.c
> +++ w/fs/gfs2/meta_io.c
> @@ -33,6 +33,17 @@
>  #include "util.h"
>  #include "trace_gfs2.h"
>
> +static void gfs2_async_end_io(struct buffer_head *bh, int uptodate)
> +{
> +   end_buffer_async_write(bh, uptodate);
> +   if (!uptodate) {
> +   struct address_space *mapping = bh->b_page->mapping;
> +   struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping);
> +   gfs2_io_error_bh_wd(sdp, bh);
> +   gfs2_lm_withdraw(sdp, NULL);
> +   }
> +}
> +
>  static int gfs2_aspace_writepage(struct page *page, struct writeback_control 
> *wbc)
>  {
> struct buffer_head *bh, *head;
> @@ -63,6 +74,7 @@ static int gfs2_aspace_writepage(struct page *page, struct 
> writeback_control *wb
> }
> if (test_clear_buffer_dirty(bh)) {
> mark_buffer_async_write(bh);
> +   bh->b_end_io = gfs2_async_end_io;
> } else {
> unlock_buffer(bh);
> }
>
> The script to reproduce the issue on CentOS:
> #/bin/sh
> set -eu
> SIZE=1G
> # XFSTESTS=https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/
> XFSTESTS=https://github.com/edwintorok/xfstests.git
>
> sudo yum install -y epel-release
> sudo yum install -y acl attr automake bc dump e2fsprogs fio gawk gcc indent 
> libtool make psmisc quota sed xfsdump xfsprogs libacl-devel libattr-devel 
> libaio-devel libuuid-devel xfsprogs-devel python sqlite git gfs2-utils
> test -d xfstests-dev || git clone "${XFSTESTS}" xfstests-dev && make -C 
> xfstests-dev -j8
> cd xfstests-dev
>
> SCRATCH_FILE=scratch-file-$(date +%s)
> truncate --size "${SIZE}" "${SCRATCH_FILE}"
> SCRATCH_DEV=$(losetup --show -f "${SCRATCH_FILE}")
>
> TEST_FILE=test-file-$(date +%s)
> truncate --size "${SIZE}" "${TEST_FILE}"
> TEST_DEV=$(losetup --show -f "${TEST_FILE}")
> TEST_DIR=/mnt/xfstests-test
> SCRATCH_MNT=/mnt/xfstests-scratch
> mkdir -p "${TEST_DIR}" "${SCRATCH_MNT}"
>
> cat  export SCRATCH_DEV=$SCRATCH_DEV
> export TEST_DEV=$TEST_DEV
> export TEST_DIR=$TEST_DIR
> export SCRATCH_MNT=$SCRATCH_MNT
> EOF
>
> umount ${TEST_DIR} || true
> wipefs -a -f ${TEST_DEV}
> 

Re: [Cluster-devel] GFS2 filesystem corruption when mkfs block size (4k) does not match blockdevice blocksize (512 bytes)

2019-03-25 Thread Edwin Török
On 20/03/2019 15:54, Edwin Török wrote:
> Hello,
> 
> The patch below to xfstests reliably reproduces the filesystem corruption on 
> a 4.19 kernel.

I've also reproduced this on CentOS 7.6:
[ 2565.093062] Buffer I/O error on dev dm-5, logical block 123761, lost async 
page write
[ 2565.307581] GFS2: fsid=dm-5.0: fatal: invalid metadata block
GFS2: fsid=dm-5.0:   bh = 123761 (magic number)
GFS2: fsid=dm-5.0:   function = gfs2_meta_indirect_buffer, file = 
fs/gfs2/meta_io.c, line = 428   
   
[ 2565.307641] GFS2: fsid=dm-5.0: about to withdraw this file system
[ 2565.307706] GFS2: fsid=dm-5.0: withdrawn
[ 2565.307737] CPU: 0 PID: 61453 Comm: ls Kdump: loaded Not tainted 
3.10.0-957.10.1.el7.x86_64 #1   
 
[ 2565.307772] Hardware name: Xen HVM domU, BIOS 4.7.6-6.3 08/24/2018

See below for the script that I used to do it, 100% reproducible and takes only 
a few minutes to reproduce.

The problem here seems to be that the metadata block encountered some I/O 
error, wrote an incomplete block
(writes are atomic only at 512 byte level, not 4k), and later on GFS2 tried to 
read back the now corrupt metadata block.

I've also been experimenting with a patch to make GFS2 withdraw sooner, before 
it actually tries to read back the corrupted block,
but then of course it still fails when mounting the filesystem again, when it 
detects the corrupt metadata block.
I would've thought that a journal replay on mount would fix this, i.e. try to 
write out the metadata block again and succeed,
but I'm not convinced the journal still contains the metadata block entry 
because the I/O error here is all asynchronous.

While looking at this I also discovered that gfs2_write_jdata_pagevec has some 
copy/pasted code from page-writeback.c that is missing
https://github.com/torvalds/linux/commit/3fa750dcf29e8606e3969d13d8e188cc1c0f511d#diff-f9a62819096a044efc553c77d4f6fdbb,
 but that doesn't seem to be used by default.


Do you have some thoughts/suggestions on how to proceed here?

diff --git i/fs/gfs2/meta_io.c w/fs/gfs2/meta_io.c
index 868caa0..39c1030 100644
--- i/fs/gfs2/meta_io.c
+++ w/fs/gfs2/meta_io.c
@@ -33,6 +33,17 @@
 #include "util.h"
 #include "trace_gfs2.h"
 
+static void gfs2_async_end_io(struct buffer_head *bh, int uptodate)
+{
+   end_buffer_async_write(bh, uptodate);
+   if (!uptodate) {
+   struct address_space *mapping = bh->b_page->mapping;
+   struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping);
+   gfs2_io_error_bh_wd(sdp, bh);
+   gfs2_lm_withdraw(sdp, NULL);
+   }
+}
+
 static int gfs2_aspace_writepage(struct page *page, struct writeback_control 
*wbc)
 {
struct buffer_head *bh, *head;
@@ -63,6 +74,7 @@ static int gfs2_aspace_writepage(struct page *page, struct 
writeback_control *wb
}
if (test_clear_buffer_dirty(bh)) {
mark_buffer_async_write(bh);
+   bh->b_end_io = gfs2_async_end_io;
} else {
unlock_buffer(bh);
}

The script to reproduce the issue on CentOS:
#/bin/sh
set -eu
SIZE=1G
# XFSTESTS=https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/
XFSTESTS=https://github.com/edwintorok/xfstests.git

sudo yum install -y epel-release
sudo yum install -y acl attr automake bc dump e2fsprogs fio gawk gcc indent 
libtool make psmisc quota sed xfsdump xfsprogs libacl-devel libattr-devel 
libaio-devel libuuid-devel xfsprogs-devel python sqlite git gfs2-utils
test -d xfstests-dev || git clone "${XFSTESTS}" xfstests-dev && make -C 
xfstests-dev -j8
cd xfstests-dev

SCRATCH_FILE=scratch-file-$(date +%s)
truncate --size "${SIZE}" "${SCRATCH_FILE}"
SCRATCH_DEV=$(losetup --show -f "${SCRATCH_FILE}")

TEST_FILE=test-file-$(date +%s)
truncate --size "${SIZE}" "${TEST_FILE}"
TEST_DEV=$(losetup --show -f "${TEST_FILE}")
TEST_DIR=/mnt/xfstests-test
SCRATCH_MNT=/mnt/xfstests-scratch
mkdir -p "${TEST_DIR}" "${SCRATCH_MNT}"

cat  
> In our lab we've found that if multiple iSCSI connection errors are
> detected (without completely loosing the iSCSI connection) then the GFS2
> filesystem becomes corrupt.
> The connection errors occured very likely due to network congestion (the 
> iSCSI target is running on a 10Gbps network,
> but the test VM had only a 1Gbps network connection to it). Nevertheless 
> networking issues shouldn't cause corruption.
> 
> Even after rebooting the host and remounting the filesystem has immediately 
> withdrawn itself
> when 

[Cluster-devel] GFS2 filesystem corruption when mkfs block size (4k) does not match blockdevice blocksize (512 bytes)

2019-03-20 Thread Edwin Török
Hello,

The patch below to xfstests reliably reproduces the filesystem corruption on a 
4.19 kernel.

In our lab we've found that if multiple iSCSI connection errors are
detected (without completely loosing the iSCSI connection) then the GFS2
filesystem becomes corrupt.
The connection errors occured very likely due to network congestion (the iSCSI 
target is running on a 10Gbps network,
but the test VM had only a 1Gbps network connection to it). Nevertheless 
networking issues shouldn't cause corruption.

Even after rebooting the host and remounting the filesystem has immediately 
withdrawn itself
when stating files. Sometimes fsck.gfs2 is able to fix it, but not always.

Tim Smith has found that using matching blocksizes for GFS2 and the underlying 
blockdevice avoids the problem
(running the testcase below for 30m in a loop does not reproduce the issue 
anymore).
This discovery was made after careful reading of the fsck.gfs2 manpage, which 
states:
"The first step to ensuring a healthy file system is the selec‐
tion of reliable hardware (i.e. storage systems that will write complete blocks 
- even in the event of power failure)."

By default GFS2 uses a blocksize of 4k, and typically the devices in our lab 
have a block size of 512 bytes, therefore
there are no atomicity guarantees on a 4k write: you may indeed get an 
incomplete write of 512, 1024, etc. bytes

We'd like your opinion on two questions:
1. We believe that mkfs.gfs2 in gfs2-utils should be updated to query the 
underlying blockdevice for its blocksize and use that by default instead of 4k,
and warn on a mismatch if you manually specify a blocksize. Also update the 
manpage of mkfs.gfs2 to state this.
Is this the right approach, i.e. would such a patch be acceptable upstream if 
we prepare it?

2. Can the code be made more robust to cope with the incomplete writes?
(e.g. can it throw away the bad journal entries, or replay the journal in such 
a way that it treats the underlying blockdevice's blocksize as the unit of 
atomicity
instead of the gfs2 block size?)
We're aware of your hard work on withdraw corruption 
https://www.redhat.com/archives/cluster-devel/2019-February/msg00077.html, 
would these patches help with this issue?

Ideally we'd like to fix both issues (use matching blocksizes for newly created 
filesystems, and fix the possibility of corruption for the benefit of those 
whom already
created their GFS2 filesystem with the default blocksize).


The initial withdraw looks like this:
Mar 20 15:10:29 localhost kernel: [13654.281665] Buffer I/O error on dev dm-4, 
logical block 431236, lost async page write
Mar 20 15:10:29 localhost kernel: [13654.281795] gfs2: fsid=dm-4.0: fatal: 
invalid metadata block
Mar 20 15:10:29 localhost kernel: [13654.281795]   bh = 431236 (magic number)
Mar 20 15:10:29 localhost kernel: [13654.281795]   function = 
gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 419
Mar 20 15:10:29 localhost kernel: [13654.281834] gfs2: fsid=dm-4.0: about to 
withdraw this file system
Mar 20 15:10:29 localhost kernel: [13654.281884] gfs2: fsid=dm-4.0: withdrawn
Mar 20 15:10:29 localhost kernel: [13654.281894] CPU: 6 PID: 20880 Comm: xfs_io 
Tainted: G   O  4.19.0+1 #1
Mar 20 15:10:29 localhost kernel: [13654.281896] Hardware name: Dell Inc. 
PowerEdge R415/0GXH08, BIOS 1.2.5 03/16/2011
Mar 20 15:10:29 localhost kernel: [13654.281897] Call Trace:
Mar 20 15:10:29 localhost kernel: [13654.281916]  dump_stack+0x5a/0x73
Mar 20 15:10:29 localhost kernel: [13654.281941]  gfs2_lm_withdraw+0x125/0x1a0 
[gfs2]
Mar 20 15:10:29 localhost kernel: [13654.281956]  ? gfs2_meta_read+0x22d/0x2d0 
[gfs2]
Mar 20 15:10:29 localhost kernel: [13654.281969]  gfs2_meta_check_ii+0x2b/0x40 
[gfs2]
Mar 20 15:10:29 localhost kernel: [13654.281981]  
gfs2_meta_indirect_buffer+0x119/0x130 [gfs2]
Mar 20 15:10:29 localhost kernel: [13654.281993]  gfs2_dirty_inode+0x107/0x230 
[gfs2]
Mar 20 15:10:29 localhost kernel: [13654.282006]  ? 
gfs2_dirty_inode+0x1bf/0x230 [gfs2]
Mar 20 15:10:29 localhost kernel: [13654.282012]  __mark_inode_dirty+0x17c/0x360
Mar 20 15:10:29 localhost kernel: [13654.282024]  gfs2_write_inode+0x11f/0x160 
[gfs2]
Mar 20 15:10:29 localhost kernel: [13654.282028]  __writeback_singl
e_inode+0x26d/0x340
Mar 20 15:10:29 localhost kernel: [13654.282031]  
writeback_single_inode+0xb7/0x130
Mar 20 15:10:29 localhost kernel: [13654.282033]  sync_inode_metadata+0x3d/0x60
Mar 20 15:10:29 localhost kernel: [13654.282045]  gfs2_fsync+0xad/0x110 [gfs2]
Mar 20 15:10:29 localhost kernel: [13654.282049]  do_fsync+0x38/0x60
Mar 20 15:10:29 localhost kernel: [13654.282053]  __x64_sys_fsync+0x10/0x20
Mar 20 15:10:29 localhost kernel: [13654.282057]  do_syscall_64+0x4e/0x100
Mar 20 15:10:29 localhost kernel: [13654.282062]  entry_SYSCALL_64_after_hwfr

After the remount:
Mar 20 15:10:30 localhost kernel: [13655.529526] gfs2: fsid=dm-4.0: first mount 
done, others may mount
Mar 20 15:10:30 localhost kernel: [13655.581457] gfs2: