On Tue, Mar 6, 2018 at 8:15 AM, Qu Wenruo <w...@suse.com> wrote: > There are some btrfs corruption report in mail list for a while,
There have been for years (well, since ever) many reports of different types of corruptions. Which kind of corruption are you referring to? > although such corruption is pretty rare and almost impossible to > reproduce, with dm-log-writes we found it's highly related to v1 space > cache. > > Unlike journal based filesystems, btrfs completely rely on metadata CoW > to protect itself from power loss. > Which needs extent allocator to avoid allocate extents on existing > extent range. > Btrfs also uses free space cache to speed up such allocation. > > However there is a problem, v1 space cache is not protected by data CoW, > and can be corrupted during power loss. > So btrfs do extra check on free space cache, verifying its own in-file csum, > generation and free space recorded in cache and extent tree. > > The problem is, under heavy concurrency, v1 space cache can be corrupted > even under normal operations without power loss. How? > And we believe corrupted space cache can break btrfs metadata CoW and > leads to the rare corruption in next power loss. Which kind of corruption? > > The most obvious symptom will be difference in free space: > > This will be caught by kernel, but such check is quite weak, and if > the net free space change is 0 in one transaction, the corrupted > cache can be loaded by kernel. How can that happen? The only case I'm aware of, explained below, always leads to a difference (space cache has less free space then what we actually have if we check the extent tree). > > In this case, btrfs check would report things like : > ------ > block group 298844160 has wrong amount of free space > failed to load free space cache for block group 298844160 > ------ This is normal, but not very common, due to tiny races that exists between committing a transaction (writing the free space caches) and running dellaloc for example (since reserving an extent while running dealloc doesn't joing/start a transaction). > > But considering the test case are using notreelog, btrfs won't do > sub-transaction commit which doesn't increase generation, each > transaction should be consistent, and nothing should be reported at all. > > Further more, we can even found corrupted file extents like: > ------ > root 5 inode 261 errors 100, file extent discount > Found file extent holes: > start: 962560, len: 32768 > ERROR: errors found in fs roots Why do you think that's a corruption? Does it cause data loss or any user visible issue? Having file extent holes not inserted happens when mixing buffered and direct IO writes to a file (and fsstress does that), for example: create file buffered write at offset 0, length 64K direct IO write at offset at offset 64K, length 4K transaction commit power loss after this we got a missing 64K hole extent at offset 0 (at btrfs_file_write_iter we only add hole extents if the start offset is greater then the current i_size) But this does not cause any problem for the user or the fs itself, and it's supposed to be like that in the NO_HOLES mode which one day (probably) will be the default mode. > ------ > > Signed-off-by: Qu Wenruo <w...@suse.com> > --- > common/dmlogwrites | 72 +++++++++++++++++++++++++++ > tests/btrfs/159 | 141 > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > tests/btrfs/159.out | 2 + > tests/btrfs/group | 1 + > 4 files changed, 216 insertions(+) > create mode 100755 tests/btrfs/159 > create mode 100644 tests/btrfs/159.out > > diff --git a/common/dmlogwrites b/common/dmlogwrites > index 467b872e..54e7e242 100644 > --- a/common/dmlogwrites > +++ b/common/dmlogwrites > @@ -126,3 +126,75 @@ _log_writes_cleanup() > $UDEV_SETTLE_PROG >/dev/null 2>&1 > _log_writes_remove > } > + > +# Convert log writes mark to entry number > +# Result entry number is output to stdout, could be empty if not found > +_log_writes_mark_to_entry_number() > +{ > + local _mark=$1 > + local ret > + > + [ -z "$_mark" ] && _fail \ > + "mark must be given for _log_writes_mark_to_entry_number" > + > + ret=$($here/src/log-writes/replay-log --find --log $LOGWRITES_DEV \ > + --end-mark $_mark 2> /dev/null) > + [ -z "$ret" ] && return > + ret=$(echo "$ret" | cut -f1 -d\@) > + echo "mark $_mark has entry number $ret" >> $seqres.full > + echo "$ret" > +} > + > +# Find next fua write entry number > +# Result entry number is output to stdout, could be empty if not found > +_log_writes_find_next_fua() > +{ > + local _start_entry=$1 > + local ret > + > + if [ -z "$_start_entry" ]; then > + ret=$($here/src/log-writes/replay-log --find --log > $LOGWRITES_DEV \ > + --next-fua 2> /dev/null) > + else > + ret=$($here/src/log-writes/replay-log --find --log > $LOGWRITES_DEV \ > + --next-fua --start-entry $_start_entry 2> /dev/null) > + fi > + [ -z "$ret" ] && return > + > + ret=$(echo "$ret" | cut -f1 -d\@) > + echo "next fua is entry number $ret" >> $seqres.full > + echo "$ret" > +} > + > +# Replay log range to specified entry > +# $1: End entry. The last entry will *NOT* be replayed > +# $2: Start entry. If not specified, start from the first entry. > +_log_writes_replay_log_entry_range() > +{ > + local _end=$1 > + local _start=$2 > + > + [ -z "$_end" ] && _fail \ > + "end entry must be specified for _log_writes_replay_log_entry_range" > + > + if [[ "$_start" && "$_start" -gt "$_end" ]]; then > + _fail \ > + "wrong parameter order for > _log_writes_replay_log_entry_range:start=$_start end=$_end" > + fi > + > + # Original replay-log won't replay the last entry. So increase entry > + # number here to ensure the end entry to be replayed > + if [ -z "$_start" ]; then > + echo "=== replay to $_end ===" >> $seqres.full > + $here/src/log-writes/replay-log --log $LOGWRITES_DEV \ > + --replay $SCRATCH_DEV --limit $_end -v \ > + >> $seqres.full 2>&1 > + else > + echo "=== replay from $_start to $_end ===" >> $seqres.full > + $here/src/log-writes/replay-log --log $LOGWRITES_DEV \ > + --replay $SCRATCH_DEV --start-entry $_start \ > + --limit $(($_end - $_start)) -v \ > + >> $seqres.full 2>&1 > + fi > + [ $? -ne 0 ] && _fail "replay failed" > +} > diff --git a/tests/btrfs/159 b/tests/btrfs/159 > new file mode 100755 > index 00000000..5113d526 > --- /dev/null > +++ b/tests/btrfs/159 > @@ -0,0 +1,141 @@ > +#! /bin/bash > +# FS QA Test 159 > +# > +# Test case for btrfs v1 space cache corruption > +# > +# Btrfs has some unexpected corruption after power loss, during the > +# investigation, the problem leads to v1 space cache corruption, which > +# could be loaded by kernel without detection. > +# > +# Kernel has 3 detection for corrupted space cache, checksum, generation > +# and free space, however the last one is a weak one. If in the one > transaction > +# no net space change it can be loaded by kernel and break up metadata CoW, > +# leading serious corruption. > +# > +#----------------------------------------------------------------------- > +# Copyright (c) SUSE. All Rights Reserved. > +# > +# This program is free software; you can redistribute it and/or > +# modify it under the terms of the GNU General Public License as > +# published by the Free Software Foundation. > +# > +# This program is distributed in the hope that it would be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write the Free Software Foundation, > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > +#----------------------------------------------------------------------- > +# > + > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > + > +here=`pwd` > +tmp=/tmp/$$ > +status=1 # failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > + cd / > + rm -f $tmp.* > + _log_writes_cleanup > +} > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/dmlogwrites > +. ./common/filter > + > +# remove previous $seqres.full before test > +rm -f $seqres.full > + > +# Small workload is enough to trigger the problem > +workload=$(( 512 * $LOAD_FACTOR )) > +nr_threads=$(($($here/src/feature -o) * $LOAD_FACTOR)) > + > +# Modify as appropriate. > +_supported_fs btrfs > +_supported_os Linux > +_require_log_writes > +_require_scratch > + > +_log_writes_init > +_log_writes_mkfs >> $seqres.full 2>&1 > + > +# Here we don't want to create space cache yet > +_log_writes_mount -o nospace_cache > + > +# The SINGLE profile data chunk created by mkfs is too small (8M) so btrfs > +# won't create space cache for it. > +# We need a data chunk large enough so btrfs create space cache for it and > +# store the cache in itself. > +_run_btrfs_util_prog balance start --full-balance "$SCRATCH_MNT" > + > +_log_writes_unmount > +_log_writes_mark mkfs > + > +# Disable tree log, so btrfs will always do full transaction commit > +# It's OK to use tree log, but it will cause extra super block update which > +# doesn't increase generation. > +_log_writes_mount -o space_cache=v1,notreelog > +run_check $FSSTRESS_PROG -w -n $workload -p $nr_threads -d $SCRATCH_MNT \ > + $FSSTRESS_AVOID > /dev/null 2>&1 > +_log_writes_unmount > + > +_log_writes_remove > +# Here we manually iterate through the log entries, as we need extra check > +# on the output of btrfs check. And we could have detailed replay log in > +# $seqres.full > +prev=$(_log_writes_mark_to_entry_number mkfs) > +if [ -z "$prev" ]; then > + _fail "failed to locate log writes mark 'mkfs'" > +fi > + > +# Here we only checks the result *AFTER* FUA writes, and no tree log, so > +# everything should be valid. > +cur=$(_log_writes_find_next_fua $prev) > +[ -z "$cur" ] && _fail "failed to locate next fua write" > + > +# log-replay will not replay the last item, so manually increase it by one > +cur=$(($cur + 1)) > + > +_log_writes_replay_log_entry_range $prev >> $seqres.full 2>&1 > +while [ ! -z "$cur" ]; do > + _log_writes_replay_log_entry_range $cur $prev > + # Catch the btrfs check output into temp file, as we need to > + # grep the output to find the cache corruption > + $BTRFS_UTIL_PROG check --check-data-csum $SCRATCH_DEV &> $tmp.fsck > + > + # Cache passed generation,csum and free space check but corrupted > + # will be reported as error > + if [ $? -ne 0 ]; then > + cat $tmp.fsck >> $seqres.full > + _fail "btrfs check found corruption" > + fi > + > + # Mount option has ruled out any possible factors affect space cache > + # And we are at the FUA writes, no generation related problem should > + # happen anyway > + if grep -q -e 'failed to load free space cache' $tmp.fsck; then > + cat $tmp.fsck >> $seqres.full > + _fail "btrfs check found invalid space cache" > + fi > + > + prev=$cur > + cur=$(_log_writes_find_next_fua $prev) > + [ -z $cur ] && break > + > + # Same as above > + cur=$(($cur + 1)) > +done > + > +echo "Silence is golden" > + > +# success, all done > +status=0 > +exit > diff --git a/tests/btrfs/159.out b/tests/btrfs/159.out > new file mode 100644 > index 00000000..e569e60c > --- /dev/null > +++ b/tests/btrfs/159.out > @@ -0,0 +1,2 @@ > +QA output created by 159 > +Silence is golden > diff --git a/tests/btrfs/group b/tests/btrfs/group > index 8007e07e..bc83db94 100644 > --- a/tests/btrfs/group > +++ b/tests/btrfs/group > @@ -161,3 +161,4 @@ > 156 auto quick trim > 157 auto quick raid > 158 auto quick raid scrub > +159 auto > -- > 2.15.1 > > -- > To unsubscribe from this list: send the line "unsubscribe fstests" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.” -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html