On 2019/3/20 下午1:47, Anand Jain wrote: > > >>>>> A tree based integrity verification >>>>> is important for all data, which is missing. >>>>> Fix: >>>>> In this RFC patch it proposes to use same disk from with the >>>>> metadata >>>>> is read to read the data. >>>> >>>> The obvious problem I found is, the idea only works for RAID1/10. >>>> >>>> For striped profile it makes no sense, or even have a worse chance to >>>> get stale data. >>>> >>>> >>>> To me, the idea of using possible better mirror makes some sense, but >>>> very profile limited. >>> >>> Yep. This problem and fix is only for the mirror based profiles >>> such as raid1/raid10. >> >> Then current implementation lacks such check. >> >> Further more, data and metadata can lie in different chunks and have >> different chunk types. > > Right. Current tests for this RFC were only for raid1. > > But the final patch can fix that. > > In fact current patch works for all the cases except for the case of > replace is running and mix of metadata:raid1 and data:raid56 > > We need some cleanups in mirror_num, basically we need to bring it > under #define. and handle it accordingly in __btrfs_map_block() > >>>> Another idea I get inspired from the idea is, make it more generic so >>>> that bad/stale device get a lower priority. >>> >>> When it comes to reading junk data, its not about the priority its >>> about the eliminating. When the problem is only few blocks, I am >>> against making the whole disk as bad. >>> >>>> Although it suffers the same problem as I described. >>>> >>>> To make the point short, the use case looks very limited. >>> >>> It applies to raid1/raid10 with nodatacow (which implies nodatasum). >>> In my understanding that's not rare. >>> >>> Any comments on the fix offered here? >> >> The implementation part is, is eb->read_mirror reliable? >> >> E.g. if the data and the eb are in different chunks, and the stale >> happens in the chunk of eb but not in the data chunk? > > > eb and regular data are indeed in different chunks always. But eb > can never be stale as there is parent transid which is verified against > the read eb. However we do not have such a check for the data (this is > the core of the issue here) and so we return the junk data silently.
Yes, that's the basis of your patch. And I totally agree this part, since it's not only transid, but eb's content is also checked almost bit by bit. So the idea is OK, for mirror based profile. > > Also any idea why the generation number for the extent data is not > incremented [2] when -o nodatacow and notrunc option is used, is it > a bug? the dump-tree is taken with the script as below [1] > (this corruption is seen with or without generation number is > being incremented, but as another way to fix for the corruption we can > verify the inode EXTENT_DATA generation from the same disk from which > the data is read). For the generation part, it's the generation when data is written to disk. Truncation/nocow overwrite shouldn't really change the generation of existing file extents. So I'm afraid you can't use that generation to do the check. Thanks, Qu > > [1] > umount /btrfs; mkfs.btrfs -fq -dsingle -msingle /dev/sdb && \ > mount -o notreelog,max_inline=0,nodatasum /dev/sdb /btrfs && \ > echo 1st write: && \ > dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1 > conv=fsync,notrunc && sync && \ > btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \ > echo --- && \ > btrfs in dump-tree /dev/sdb | grep -A4 "257 EXTENT_DATA" && \ > echo 2nd write: && \ > dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1 > conv=fsync,notrunc && sync && \ > btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \ > echo --- && \ > btrfs in dump-tree /dev/sdb | grep -A4 "257 EXTENT_DATA" > > > 1st write: > item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160 > generation 6 transid 6 size 4096 nbytes 4096 > block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 > sequence 1 flags 0x3(NODATASUM|NODATACOW) > atime 1553058460.163985452 (2019-03-20 13:07:40) > ctime 1553058460.163985452 (2019-03-20 13:07:40) > mtime 1553058460.163985452 (2019-03-20 13:07:40) > otime 1553058460.163985452 (2019-03-20 13:07:40) > --- > item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53 > generation 6 type 1 (regular) > extent data disk byte 13631488 nr 4096 > extent data offset 0 nr 4096 ram 4096 > extent compression 0 (none) > 2nd write: > item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160 > generation 6 transid 7 size 4096 nbytes 4096 > block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 > sequence 2 flags 0x3(NODATASUM|NODATACOW) > atime 1553058460.163985452 (2019-03-20 13:07:40) > ctime 1553058460.189985450 (2019-03-20 13:07:40) > mtime 1553058460.189985450 (2019-03-20 13:07:40) > otime 1553058460.163985452 (2019-03-20 13:07:40) > --- > item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53 > generation 6 type 1 (regular) <----- [2] > extent data disk byte 13631488 nr 4096 > extent data offset 0 nr 4096 ram 4096 > extent compression 0 (none) > > > Thanks, Anand
signature.asc
Description: OpenPGP digital signature