On 2019/3/19 下午7:35, Anand Jain wrote: > In case of file regular extent (non inline), the metadata and data are > read from two different IO operations. When we read the metadata using > the btree each extent block gets verified with the expected transid as > per its parent. So suppose if any of the block is stale it gets reported > and corrected by reading from the mirror (consider raid1). > > This also means the data extent is not yet read or verified and this > happens when application tries to read the file. > When application tries to read the file btrfs_readpage(), it shall first > read file extent map as provided by the metadata in the above operation > (which is assured to be consistent). > And when file data has to be read using the above filemap as it again > has the choice of using either of the mirrors, suppose if the IO is > routed to the disk on which we haven't verified the metadata on it and > if there is write hole on that disk, then we shall be reading the junk > data without being reported anywhere. > > This is reproducible with the test case [1] in this email thread. > [1] > fstests: btrfs test read from disks of different generation > > As we have data csum most of these get caught and reported as csum > error.
So far so good. > But csum verification is a point in verification and its not a > tree based transid verification. Which means if there is a stale data > with matching csum we may return a junk data silently. Then the normal idea is to use stronger but slower csum in the first place, to avoid the csum match case. > This problem is > easily reproducible when csum is disabled but not impossible to achieve > when csum is not disabled as well. Under this case, it's the user to be blamed for the decision to disable the csum in the first place. > A tree based integrity verification > is important for all data, which is missing. > > Fix: > In this RFC patch it proposes to use same disk from with the metadata > is read to read the data. The obvious problem I found is, the idea only works for RAID1/10. For striped profile it makes no sense, or even have a worse chance to get stale data. To me, the idea of using possible better mirror makes some sense, but very profile limited. Another idea I get inspired from the idea is, make it more generic so that bad/stale device get a lower priority. Although it suffers the same problem as I described. To make the point short, the use case looks very limited. Thanks, Qu > Which means the feature such as readmirror is > less effective (which is fine). But if the good data fails (media error) > as usual we shall read the mirrored data without being aware that this > data may be stale. To fix that, we have to invalidate the corresponding > metadata and re-read the metadata from other disk (this part is not > implemented in this patch). > > The other better solution is - a tree based verification, which means > we have to verify the data-extent's metadata in the extent tree when its > read - but there is a problem we don't update the generation number in > the extent tree for all types of write, for example consider overwrite > with notrunc option on a filesystem mounted with nodatacow option [1], > which I wonder if its a bug or if its like that for a reason? as because > there is no cow? > > [1] The EXTENT_DATA generation is same before and after dd with notrunc > > mkfs.btrfs -fq /dev/sdb && mount -o notreelog,nodatacow /dev/sdb /btrfs > dd if=/dev/urandom of=/btrfs/tf1 bs=4096 count=1 conv=fsync,notrunc && > sync > btrfs in dump-tree /dev/sdb | grep -A4 "257 EXTENT_DATA 0" > btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) itemoff" > dd if=/dev/urandom of=/btrfs/tf1 bs=4096 count=1 conv=fsync,notrunc && > sync > btrfs in dump-tree /dev/sdb | grep -A4 "257 EXTENT_DATA 0" > btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) itemoff" > > --------- > item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53 > generation 6 type 1 (regular) > extent data disk byte 13631488 nr 4096 > extent data offset 0 nr 4096 ram 4096 > extent compression 0 (none) > item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160 > generation 6 transid 6 size 4096 nbytes 4096 > block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 > sequence 66785 flags 0x3(NODATASUM|NODATACOW) > atime 1552993569.276079763 (2019-03-19 19:06:09) > ctime 1552993569.276079763 (2019-03-19 19:06:09) > mtime 1552993569.276079763 (2019-03-19 19:06:09) > otime 1552993569.276079763 (2019-03-19 19:06:09) > 1+0 records in > 1+0 records out > 4096 bytes (4.1 kB) copied, 0.00302717 s, 1.4 MB/s > item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53 > generation 6 type 1 (regular) > extent data disk byte 13631488 nr 4096 > extent data offset 0 nr 4096 ram 4096 > extent compression 0 (none) > item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160 > generation 6 transid 7 size 4096 nbytes 4096 > block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 > sequence 66786 flags 0x3(NODATASUM|NODATACOW) > atime 1552993569.276079763 (2019-03-19 19:06:09) > ctime 1552993569.302079763 (2019-03-19 19:06:09) > mtime 1552993569.302079763 (2019-03-19 19:06:09) > otime 1552993569.276079763 (2019-03-19 19:06:09) > ------- > > Comments appreciated. > > > Signed-off-by: Anand Jain <anand.j...@oracle.com> > --- > fs/btrfs/extent_io.c | 41 ++++++++++++++++++++++++++++++++++------- > 1 file changed, 34 insertions(+), 7 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 31892052a24f..f1cc21e8dff7 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -3079,7 +3079,7 @@ static inline void __do_contiguous_readpages(struct > extent_io_tree *tree, > struct extent_map **em_cached, > struct bio **bio, > unsigned long *bio_flags, > - u64 *prev_em_start) > + u64 *prev_em_start, int mirror) > { > struct inode *inode; > struct btrfs_ordered_extent *ordered; > @@ -3099,7 +3099,7 @@ static inline void __do_contiguous_readpages(struct > extent_io_tree *tree, > > for (index = 0; index < nr_pages; index++) { > __do_readpage(tree, pages[index], btrfs_get_extent, em_cached, > - bio, 0, bio_flags, REQ_RAHEAD, prev_em_start); > + bio, mirror, bio_flags, REQ_RAHEAD, > prev_em_start); > put_page(pages[index]); > } > } > @@ -3109,7 +3109,7 @@ static void __extent_readpages(struct extent_io_tree > *tree, > int nr_pages, > struct extent_map **em_cached, > struct bio **bio, unsigned long *bio_flags, > - u64 *prev_em_start) > + u64 *prev_em_start, int mirror) > { > u64 start = 0; > u64 end = 0; > @@ -3130,7 +3130,7 @@ static void __extent_readpages(struct extent_io_tree > *tree, > index - first_index, start, > end, em_cached, > bio, bio_flags, > - prev_em_start); > + prev_em_start, mirror); > start = page_start; > end = start + PAGE_SIZE - 1; > first_index = index; > @@ -3141,7 +3141,7 @@ static void __extent_readpages(struct extent_io_tree > *tree, > __do_contiguous_readpages(tree, &pages[first_index], > index - first_index, start, > end, em_cached, bio, > - bio_flags, prev_em_start); > + bio_flags, prev_em_start, mirror); > } > > static int __extent_read_full_page(struct extent_io_tree *tree, > @@ -4093,6 +4093,27 @@ int extent_writepages(struct address_space *mapping, > return ret; > } > > +static int btrfs_get_read_mirror(struct inode *inode) > +{ > + int ret; > + struct btrfs_path *path; > + > + path = btrfs_alloc_path(); > + if (!path) > + return -ENOMEM; > + > + ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path, > + btrfs_ino(BTRFS_I(inode)), 0, 0); > + if (!ret) { > + struct extent_buffer *eb = path->nodes[0]; > + > + ret = eb->read_mirror; > + } > + > + btrfs_free_path(path); > + return ret; > +} > + > int extent_readpages(struct address_space *mapping, struct list_head *pages, > unsigned nr_pages) > { > @@ -4103,6 +4124,11 @@ int extent_readpages(struct address_space *mapping, > struct list_head *pages, > struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree; > int nr = 0; > u64 prev_em_start = (u64)-1; > + int mirror; > + > + mirror = btrfs_get_read_mirror(mapping->host); > + if (mirror < 0) > + return mirror; > > while (!list_empty(pages)) { > for (nr = 0; nr < ARRAY_SIZE(pagepool) && !list_empty(pages);) { > @@ -4120,14 +4146,15 @@ int extent_readpages(struct address_space *mapping, > struct list_head *pages, > } > > __extent_readpages(tree, pagepool, nr, &em_cached, &bio, > - &bio_flags, &prev_em_start); > + &bio_flags, &prev_em_start, mirror); > + > } > > if (em_cached) > free_extent_map(em_cached); > > if (bio) > - return submit_one_bio(bio, 0, bio_flags); > + return submit_one_bio(bio, mirror, bio_flags); > return 0; > } > >
signature.asc
Description: OpenPGP digital signature