(continuing from https://www.spinics.net/lists/linux-btrfs/msg59251.html)

I'm still poking at this bug, and found out some more about it.
Recall that this bug seems to have two parts which together cause data
corruption:

The "write half" of the bug writes a questionable extent structure in
the filesystem once every few hundred thousand files (a compressed inline
extent followed by other non-inline extents when a write occurs at
the beginning of a file, followed by a seek past the end of the first
page of the file, followed by another write).

The "read half" of the bug reads this structure inconsistently (data
between the inline extent and the following extent is random garbage,
different each time the file is read).

After having no success attacking the read half of the bug with a patch,
I tried to bisect to see where the bug was introduced.

The "write half" of the bug seems to appear first somewhere between v3.8
and v3.9.  I have not been able to reproduce it with v3.8.13, v3.7.10, or
v3.6.11.  I can reproduce it in v3.9.11, v3.12.64, and v3.18.13..v4.7.5.

The "read half" of the bug is more interesting.  All kernels I've tested
that have the write half of the bug have the read half as well, but
versions 3.6..3.9 have many more instances of a separate non-repeatable
read corruption (one that does not require the "write half" bug to occur).
These additional bugs were not anticipated by my bisection test case,
so my bisect went in the wrong direction and I didn't cover the right
kernels to understand where these bugs were introduced (yet).

The good news is that whatever went wrong around 3.6..3.9 seems to have
been fixed by v3.12--that kernel has the same behavior as v4.7.5 for
data corruption on reads.


This is my current repro script.  Run it in a shell loop until corruption
occurs, e.g. "while repro; do date; done".  Adjust the "result" function
to taste (e.g. write it to a file, use your own email address, etc).

#!/bin/sh
set -x

result () {
        echo "$@" "$(cat /proc/version)" | mail -s "$(echo "$@" | head -1) 
$(uname -r)" results@localhost
}

umount /try
mkdir -p /try
for blk in /dev/vdc /dev/sdc; do
        < "$blk" || continue
        mkfs.btrfs -dsingle -mdup -O ^extref,^skinny-metadata,^no-holes -f 
"$blk" || exit 1
        mount -ocompress-force,flushoncommit,max_inline=4096,noatime "$blk" 
/try || exit 1
        cd /try || exit 1
        break
done

# Must be on btrfs
btrfs sub list . || exit 1

y=/usr; for x in $(seq 0 9); do rsync -axHSWI "$y/." "$x"; y="$x"; done &
y=/usr; for x in $(seq 10 19); do rsync -axHSWI "$y/." "$x"; y="$x"; done &
y=/usr; for x in $(seq 20 29); do rsync -axHSWI "$y/." "$x"; y="$x"; done &
y=/usr; for x in $(seq 30 39); do rsync -axHSWI "$y/." "$x"; y="$x"; done &

wait

touch list

find -type f -size +4097c -exec sh -c 'for x; do if filefrag -v "$x" | sed -n 
"4p" | grep -q "inline"; then echo "$x" >> list; fi; done' -- {} +

if [ -s list ]; then
        while read -r x; do
                ls -l "$x"
                filefrag -v "$x"
                sum="$(sha1sum "$x")"
                for y in $(seq 0 99); do
                        sysctl vm.drop_caches=1
                        sum2="$(sha1sum "$x")"
                        if [ "$sum" != "$sum2" ]; then
                                result "$x sum1 $sum sum2 $sum2"
                                exit 1
                        fi
                done
        done < list
        result "No inconsistent reads, $(wc -l < list) inlines"
else
        result "No inline extents"
fi

for x in *9/.; do
        if ! diff -r /usr/. "$x"; then
                result "Differences found in $x"
                # We are looking for corrupted inline extents.
                # Other corruption is interesting but it's not our bug.
                exit 0
        fi
done

result "No corruption found"
exit 0

Attachment: signature.asc
Description: Digital signature

Reply via email to