On Wed, Mar 20, 2019 at 09:54:16PM +0800, Anand Jain wrote: > > > On 3/20/19 2:27 PM, Qu Wenruo wrote: > > > > > > On 2019/3/20 下午1:47, Anand Jain wrote: > > > > > > > > > > > > > A tree based integrity verification > > > > > > > is important for all data, which is missing. > > > > > > > Fix: > > > > > > > In this RFC patch it proposes to use same disk from with the > > > > > > > metadata > > > > > > > is read to read the data. > > > > > > > > > > > > The obvious problem I found is, the idea only works for RAID1/10. > > > > > > > > > > > > For striped profile it makes no sense, or even have a worse chance > > > > > > to > > > > > > get stale data. > > > > > > > > > > > > > > > > > > To me, the idea of using possible better mirror makes some sense, > > > > > > but > > > > > > very profile limited. > > > > > > > > > > Yep. This problem and fix is only for the mirror based profiles > > > > > such as raid1/raid10. > > > > > > > > Then current implementation lacks such check. > > > > > > > > Further more, data and metadata can lie in different chunks and have > > > > different chunk types. > > > > > > Right. Current tests for this RFC were only for raid1. > > > > > > But the final patch can fix that. > > > > > > In fact current patch works for all the cases except for the case of > > > replace is running and mix of metadata:raid1 and data:raid56 > > > > > > We need some cleanups in mirror_num, basically we need to bring it > > > under #define. and handle it accordingly in __btrfs_map_block() > > > > > > Wait for a minute. > > > > There is a hidden pitfall from the very beginning. > > > > Consider such chunk layout: > > Chunk A Type DATA|RAID1 > > Stripe 1: Dev 1 > > Stripe 2: Dev 2 > > > > Chunk B Type METADATA|RAID1 > > Stripe 1: Dev 2 > > Stripe 2: Dev 1 > > Right this fix can't solve this case. > So one down. The other choice I had is to
I think it's more than one down. I think *only* a handful of scenarios can work. What happens if there are 4 disks? Chunk A Type DATA|RAID1 Stripe 1: Dev 1 Stripe 2: Dev 2 Chunk B Type METADATA|RAID1 Stripe 1: Dev 3 Stripe 2: Dev 4 How do you know if dev 2 lost some data writes, but came back in time to update the superblock? Which disk do you plan to read data from if Dev 3 fails a generation check? > Quarantine whole disk by comparing dev1 and dev2 generation # in the > superblock during mount. or > Quarantine whole chunk(s) by comparing dev1 and dev2 generation # in the > chunk tree during mount. or > Always read eb from both dev1 and dev2, if fails then fail its > corresponding data block as well if any. or > During mount compare dev1 and dev2 generation #, record the range of > generation number missing on the disk. But this need ondisk format changes. > (But nodatacow, notrunc reg data extent write must change gen#). or > Similar to btree eb read pages, nest the extent data read as well. > (But nodatacow, notrunc reg data extent write must change gen#). > Thanks, Anand > > > > Then when we found stale metadata in chunk B mirror 1, caused by dev 2, > > then your patch consider device 2 stale, and try to use mirror num 2 to > > read from data chunk. > > > > However in data chunk, mirror num 2 means it's still from device 2, and > > we get stale data. > > > > So the eb->mirror_num can still map to bad/stale device, due to the > > flexibility provided by btrfs per-chunk mapping. > > > > Thanks, > > Qu > > > > > > > > > > > Another idea I get inspired from the idea is, make it more generic > > > > > > so > > > > > > that bad/stale device get a lower priority. > > > > > > > > > > When it comes to reading junk data, its not about the priority its > > > > > about the eliminating. When the problem is only few blocks, I am > > > > > against making the whole disk as bad. > > > > > > > > > > > Although it suffers the same problem as I described. > > > > > > > > > > > > To make the point short, the use case looks very limited. > > > > > > > > > > It applies to raid1/raid10 with nodatacow (which implies > > > > > nodatasum). > > > > > In my understanding that's not rare. > > > > > > > > > > Any comments on the fix offered here? > > > > > > > > The implementation part is, is eb->read_mirror reliable? > > > > > > > > E.g. if the data and the eb are in different chunks, and the stale > > > > happens in the chunk of eb but not in the data chunk? > > > > > > > > > eb and regular data are indeed in different chunks always. But eb > > > can never be stale as there is parent transid which is verified against > > > the read eb. However we do not have such a check for the data (this is > > > the core of the issue here) and so we return the junk data silently. > > > > > > Also any idea why the generation number for the extent data is not > > > incremented [2] when -o nodatacow and notrunc option is used, is it > > > a bug? the dump-tree is taken with the script as below [1] > > > (this corruption is seen with or without generation number is > > > being incremented, but as another way to fix for the corruption we can > > > verify the inode EXTENT_DATA generation from the same disk from which > > > the data is read). > > > > > > [1] > > > umount /btrfs; mkfs.btrfs -fq -dsingle -msingle /dev/sdb && \ > > > mount -o notreelog,max_inline=0,nodatasum /dev/sdb /btrfs && \ > > > echo 1st write: && \ > > > dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1 > > > conv=fsync,notrunc && sync && \ > > > btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \ > > > echo --- && \ > > > btrfs in dump-tree /dev/sdb | grep -A4 "257 EXTENT_DATA" && \ > > > echo 2nd write: && \ > > > dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1 > > > conv=fsync,notrunc && sync && \ > > > btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \ > > > echo --- && \ > > > btrfs in dump-tree /dev/sdb | grep -A4 "257 EXTENT_DATA" > > > > > > > > > 1st write: > > > item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160 > > > generation 6 transid 6 size 4096 nbytes 4096 > > > block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 > > > sequence 1 flags 0x3(NODATASUM|NODATACOW) > > > atime 1553058460.163985452 (2019-03-20 13:07:40) > > > ctime 1553058460.163985452 (2019-03-20 13:07:40) > > > mtime 1553058460.163985452 (2019-03-20 13:07:40) > > > otime 1553058460.163985452 (2019-03-20 13:07:40) > > > --- > > > item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53 > > > generation 6 type 1 (regular) > > > extent data disk byte 13631488 nr 4096 > > > extent data offset 0 nr 4096 ram 4096 > > > extent compression 0 (none) > > > 2nd write: > > > item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160 > > > generation 6 transid 7 size 4096 nbytes 4096 > > > block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 > > > sequence 2 flags 0x3(NODATASUM|NODATACOW) > > > atime 1553058460.163985452 (2019-03-20 13:07:40) > > > ctime 1553058460.189985450 (2019-03-20 13:07:40) > > > mtime 1553058460.189985450 (2019-03-20 13:07:40) > > > otime 1553058460.163985452 (2019-03-20 13:07:40) > > > --- > > > item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53 > > > generation 6 type 1 (regular) <----- [2] > > > extent data disk byte 13631488 nr 4096 > > > extent data offset 0 nr 4096 ram 4096 > > > extent compression 0 (none) > > > > > > > > > Thanks, Anand > >
signature.asc
Description: PGP signature