On Thu, Feb 14, 2019 at 01:22:49AM +0000, Filipe Manana wrote:
> On Wed, Feb 13, 2019 at 6:14 PM Filipe Manana <fdman...@gmail.com> wrote:
> > On Wed, Feb 13, 2019 at 5:36 PM Filipe Manana <fdman...@gmail.com> wrote:
[...]
> > > Tried it today and I got it reproduced (different vm, but still debian
> > > and kernel built from source).
> > > Not sure what was different last time. Yes, I had compression enabled.
> > >
> > > I'll look into it.
> >
> > So the problem is caused by hole punching. The script can be reduced
> > to the following:
> >
> > https://friendpaste.com/22t4OdktHQTl0aMGxckc86
> >
> > file size: 384K am
> > digests after file creation:   7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> > digests after file creation 2: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> > 262144 total bytes deduped in this operation
> > digests after dedupe:          7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> > digests after dedupe 2:        7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> > am: 24 KiB (24576 bytes) converted to sparse holes.
> > digests after hole punching:   7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> > digests after hole punching 2: 5a357b64f4004ea38dbc7058c64a5678668420da  am
> >
> > So hole punching is screwing things, and only after dropping the page
> > cache we can see the bug.
> > I'll send a fix likely tomorrow.
> 
> So it turns out it's a problem in the read of compressed extents part,
> a variant of a bug I found back in 2015:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=005efedf2c7d0a270ffbe28d8997b03844f3e3e7
> 
> The following one liner fixes it:
> https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3
> 
> While you test it there (if you want/can), I'll write a change log and
> a proper test case for fstests and submit them later.

Works here (and produces the correct sha1sum, which turns out to be
dae78e303edfb8b8ad64ecae01dc1bf233770cfd).

Nice work!

> Thanks!
> >
> > >
> > > >
> > > > > > > >
> > > > > > > > The behavior is slightly different on current kernels (4.20.7, 
> > > > > > > > 4.14.96)
> > > > > > > > which makes the problem a bit more difficult to detect.
> > > > > > > >
> > > > > > > >         # repro-hole-corruption-test
> > > > > > > >         i: 91, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 92, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 93, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 94, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 95, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 96, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 97, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 98, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 99, status: 0, bytes_deduped: 131072
> > > > > > > >         13107200 total bytes deduped in this operation
> > > > > > > >         am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > > > > >         94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >
> > > > > > > > The sha1sum seems stable after the first drop_caches--until a 
> > > > > > > > second
> > > > > > > > process tries to read the test file:
> > > > > > > >
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         # cat am > /dev/null              (in another shell)
> > > > > > > >         19294e695272c42edb89ceee24bb08c13473140a am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >
> > > > > > > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
> > > > > > > > > This is a repro script for a btrfs bug that causes corrupted 
> > > > > > > > > data reads
> > > > > > > > > when reading a mix of compressed extents and holes.  The bug 
> > > > > > > > > is
> > > > > > > > > reproducible on at least kernels v4.1..v4.18.
> > > > > > > > >
> > > > > > > > > Some more observations and background follow, but first here 
> > > > > > > > > is the
> > > > > > > > > script and some sample output:
> > > > > > > > >
> > > > > > > > >       root@rescue:/test# cat repro-hole-corruption-test
> > > > > > > > >       #!/bin/bash
> > > > > > > > >
> > > > > > > > >       # Write a 4096 byte block of something
> > > > > > > > >       block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> > > > > > > > >
> > > > > > > > >       # Here is some test data with holes in it:
> > > > > > > > >       for y in $(seq 0 100); do
> > > > > > > > >               for x in 0 1; do
> > > > > > > > >                       block 0;
> > > > > > > > >                       block 21;
> > > > > > > > >                       block 0;
> > > > > > > > >                       block 22;
> > > > > > > > >                       block 0;
> > > > > > > > >                       block 0;
> > > > > > > > >                       block 43;
> > > > > > > > >                       block 44;
> > > > > > > > >                       block 0;
> > > > > > > > >                       block 0;
> > > > > > > > >                       block 61;
> > > > > > > > >                       block 62;
> > > > > > > > >                       block 63;
> > > > > > > > >                       block 64;
> > > > > > > > >                       block 65;
> > > > > > > > >                       block 66;
> > > > > > > > >               done
> > > > > > > > >       done > am
> > > > > > > > >       sync
> > > > > > > > >
> > > > > > > > >       # Now replace those 101 distinct extents with 101 
> > > > > > > > > references to the first extent
> > > > > > > > >       btrfs-extent-same 131072 $(for x in $(seq 0 100); do 
> > > > > > > > > echo am $((x * 131072)); done) 2>&1 | tail
> > > > > > > > >
> > > > > > > > >       # Punch holes into the extent refs
> > > > > > > > >       fallocate -v -d am
> > > > > > > > >
> > > > > > > > >       # Do some other stuff on the machine while this runs, 
> > > > > > > > > and watch the sha1sums change!
> > > > > > > > >       while :; do echo $(sha1sum am); sysctl -q 
> > > > > > > > > vm.drop_caches={1,2,3}; sleep 1; done
> > > > > > > > >
> > > > > > > > >       root@rescue:/test# ./repro-hole-corruption-test
> > > > > > > > >       i: 91, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 92, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 93, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 94, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 95, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 96, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 97, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 98, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 99, status: 0, bytes_deduped: 131072
> > > > > > > > >       13107200 total bytes deduped in this operation
> > > > > > > > >       am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       072a152355788c767b97e4e4c0e4567720988b84 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       60831f0e7ffe4b49722612c18685c09f4583b1df am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       ^C
> > > > > > > > >
> > > > > > > > > Corruption occurs most often when there is a sequence like 
> > > > > > > > > this in a file:
> > > > > > > > >
> > > > > > > > >       ref 1: hole
> > > > > > > > >       ref 2: extent A, offset 0
> > > > > > > > >       ref 3: hole
> > > > > > > > >       ref 4: extent A, offset 8192
> > > > > > > > >
> > > > > > > > > This scenario typically arises due to hole-punching or 
> > > > > > > > > deduplication.
> > > > > > > > > Hole-punching replaces one extent ref with two references to 
> > > > > > > > > the same
> > > > > > > > > extent with a hole between them, so:
> > > > > > > > >
> > > > > > > > >       ref 1:  extent A, offset 0, length 16384
> > > > > > > > >
> > > > > > > > > becomes:
> > > > > > > > >
> > > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > > >       ref 2:  hole, length 8192
> > > > > > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > > > > > >
> > > > > > > > > Deduplication replaces two distinct extent refs surrounding a 
> > > > > > > > > hole with
> > > > > > > > > two references to one of the duplicate extents, turning this:
> > > > > > > > >
> > > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > > >       ref 2:  hole, length 8192
> > > > > > > > >       ref 3:  extent B, offset 0, length 4096
> > > > > > > > >
> > > > > > > > > into this:
> > > > > > > > >
> > > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > > >       ref 2:  hole, length 8192
> > > > > > > > >       ref 3:  extent A, offset 0, length 4096
> > > > > > > > >
> > > > > > > > > Compression is required (zlib, zstd, or lzo) for corruption 
> > > > > > > > > to occur.
> > > > > > > > > I am not able to reproduce the issue with an uncompressed 
> > > > > > > > > extent nor
> > > > > > > > > have I observed any such corruption in the wild.
> > > > > > > > >
> > > > > > > > > The presence or absence of the no-holes filesystem feature 
> > > > > > > > > has no effect.
> > > > > > > > >
> > > > > > > > > Ordinary writes can lead to pairs of extent references to the 
> > > > > > > > > same extent
> > > > > > > > > separated by a reference to a different extent; however, in 
> > > > > > > > > this case
> > > > > > > > > there is data to be read from a real extent, instead of pages 
> > > > > > > > > that have
> > > > > > > > > to be zero filled from a hole.  If ordinary non-hole writes 
> > > > > > > > > could trigger
> > > > > > > > > this bug, every page-oriented database engine would be 
> > > > > > > > > crashing all the
> > > > > > > > > time on btrfs with compression enabled, and it's unlikely 
> > > > > > > > > that would not
> > > > > > > > > have been noticed between 2015 and now.  An ordinary write 
> > > > > > > > > that splits
> > > > > > > > > an extent ref would look like this:
> > > > > > > > >
> > > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > > >       ref 2:  extent C, offset 0, length 8192
> > > > > > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > > > > > >
> > > > > > > > > Sparse writes can lead to pairs of extent references 
> > > > > > > > > surrounding a hole;
> > > > > > > > > however, in this case the extent references will point to 
> > > > > > > > > different
> > > > > > > > > extents, avoiding the bug.  If a sparse write could trigger 
> > > > > > > > > the bug,
> > > > > > > > > the rsync -S option and qemu/kvm 'raw' disk image files 
> > > > > > > > > (among many
> > > > > > > > > other tools that produce sparse files) would be unusable, and 
> > > > > > > > > it's
> > > > > > > > > unlikely that would not have been noticed between 2015 and 
> > > > > > > > > now either.
> > > > > > > > > Sparse writes look like this:
> > > > > > > > >
> > > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > > >       ref 2:  hole, length 8192
> > > > > > > > >       ref 3:  extent B, offset 0, length 4096
> > > > > > > > >
> > > > > > > > > The pattern or timing of read() calls seems to be relevant.  
> > > > > > > > > It is very
> > > > > > > > > hard to see the corruption when reading files with 'hd', but 
> > > > > > > > > 'cat | hd'
> > > > > > > > > will see the corruption just fine.  Similar problems exist 
> > > > > > > > > with 'cmp'
> > > > > > > > > but not 'sha1sum'.  Two processes reading the same file at 
> > > > > > > > > the same time
> > > > > > > > > seem to trigger the corruption very frequently.
> > > > > > > > >
> > > > > > > > > Some patterns of holes and data produce corruption faster 
> > > > > > > > > than others.
> > > > > > > > > The pattern generated by the script above is based on 
> > > > > > > > > instances of
> > > > > > > > > corruption I've found in the wild, and has a much better 
> > > > > > > > > repro rate than
> > > > > > > > > random holes.
> > > > > > > > >
> > > > > > > > > The corruption occurs during reads, after csum verification 
> > > > > > > > > and before
> > > > > > > > > decompression, so btrfs detects no csum failures.  The data 
> > > > > > > > > on disk
> > > > > > > > > seems to be OK and could be read correctly once the kernel 
> > > > > > > > > bug is fixed.
> > > > > > > > > Repeated reads do eventually return correct data, but there 
> > > > > > > > > is no way
> > > > > > > > > for userspace to distinguish between corrupt and correct data 
> > > > > > > > > reliably.
> > > > > > > > >
> > > > > > > > > The corrupted data is usually data replaced by a hole or a 
> > > > > > > > > copy of other
> > > > > > > > > blocks in the same extent.
> > > > > > > > >
> > > > > > > > > The behavior is similar to some earlier bugs related to holes 
> > > > > > > > > and
> > > > > > > > > Compressed data in btrfs, but it's new and not fixed 
> > > > > > > > > yet--hence,
> > > > > > > > > "2018 edition."
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Filipe David Manana,
> > > > > > >
> > > > > > > “Whether you think you can, or you think you can't — you're 
> > > > > > > right.”
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Filipe David Manana,
> > > > >
> > > > > “Whether you think you can, or you think you can't — you're right.”
> > > > >
> > >
> > >
> > >
> > > --
> > > Filipe David Manana,
> > >
> > > “Whether you think you can, or you think you can't — you're right.”
> >
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”
> 
> 
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”
> 

Attachment: signature.asc
Description: PGP signature

Reply via email to