On Mon, Feb 22, 2016 at 9:46 AM, Alex Lyakas <[email protected]> wrote: > Thank you, Filipe, for your review. > > On Mon, Feb 22, 2016 at 3:05 AM, Filipe Manana <[email protected]> wrote: >> On Sun, Feb 21, 2016 at 3:36 PM, Alex Lyakas <[email protected]> wrote: >>> csum_dirty_buffer was issuing a warning in case the extent buffer >>> did not look alright, but was still returning success. >>> Let's return error in this case, and also add two additional sanity >>> checks on the extent buffer header. >>> >>> We had btrfs metadata corruption, and after looking at the logs we saw >>> that WARN_ON(found_start != start) has been triggered. We are still >>> investigating >> >> There's a warning for WARN_ON(found_start != start || !PageUptodate(page)) >> >> Are you sure it triggered only because of found_start != start and not >> because of !PageUptodate(page) (or both)? > The problem initially happened on kernel 3.8.13. In this kernel, the > code looks like this: > found_start = btrfs_header_bytenr(eb); > if (found_start != start) { > WARN_ON(1); > return 0; > } > if (!PageUptodate(page)) { > WARN_ON(1); > return 0; > } > (You can see it on > http://lxr.free-electrons.com/source/fs/btrfs/disk-io.c?v=3.8#L420) > The WARN_ON that we hit was on the found_start comparison.
Ok, I see now that one of those useless cleanup patches merged both conditions into a single if some time ago. > >> >>> which component trashed the cache page which belonged to btrfs. But btrfs >>> only issued a warning, and as a result, the corrupted metadata block went to >>> disk. >>> >>> I think we should return an error in such case that the extent buffer >>> doesn't look alright. >> >> I think so too. >> >>> The caller up the chain may BUG_ON on this, for example flush_epd_write_bio >>> will, >>> but it is better than to have a silent metadata corruption on disk. >>> >>> Note: this patch has been properly tested on 3.18 kernel only. >>> >>> Signed-off-by: Alex Lyakas <[email protected]> >>> --- >>> fs/btrfs/disk-io.c | 14 ++++++++++++-- >>> 1 file changed, 12 insertions(+), 2 deletions(-) >>> >>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >>> index 4545e2e..701e706 100644 >>> --- a/fs/btrfs/disk-io.c >>> +++ b/fs/btrfs/disk-io.c >>> @@ -508,22 +508,32 @@ static int csum_dirty_buffer(struct btrfs_fs_info >>> *fs_info, struct page *page) >>> { >>> u64 start = page_offset(page); >>> u64 found_start; >>> struct extent_buffer *eb; >>> >>> eb = (struct extent_buffer *)page->private; >>> if (page != eb->pages[0]) >>> return 0; >>> found_start = btrfs_header_bytenr(eb); >>> if (WARN_ON(found_start != start || !PageUptodate(page))) >>> - return 0; >>> - csum_tree_block(fs_info, eb, 0); >>> + return -EUCLEAN; >>> +#ifdef CONFIG_BTRFS_ASSERT >> >> A bit odd to surround these with CONFIG_BTRFS_ASSERT if we don't do >> assertions. >> I would remove this #ifdef ... #endif or do the memcmp calls inside ASSERT(). > Agreed. > >> >>> + if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid, >>> + (unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE))) >>> + return -EUCLEAN; >>> + if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid, >>> + (unsigned long)btrfs_header_chunk_tree_uuid(eb), >>> + BTRFS_FSID_SIZE))) >> >> This second comparison doesn't seem correct. Second argument to >> memcmp_extent_buffer should be fs_info->chunk_tree_uuid, which >> shouldn't be the same as the fsid (take a look at utils.c:make_btrfs() >> in the tools, both uuids are generated by different calls to >> uuid_generate()) - did you make your tests only before adding this >> comparison?. Also you should use BTRFS_UUID_SIZE instead of >> BTRFS_FSID_SIZE (even if both have the same value). > Obviously, you are right. In the 3.18-based code that I fixed locally > here, the fix looks like this: > > if (found_start != start) { > ZBTRFS_WARN(1, "FS[%s]: header_bytenr(eb)(%llu) != > page->index<<PAGE_CACHE_SHIFT(%llu)", root->fs_info->sb->s_id, > found_start, start); > return -EUCLEAN; > } > if (!PageUptodate(page)) { > ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu page->index(%llu) > !PageUptodate", root->fs_info->sb->s_id, start, (u64)page->index); > return -EUCLEAN; > } > if (memcmp_extent_buffer(eb, root->fs_info->fsid, (unsigned > long)btrfs_header_fsid(), BTRFS_FSID_SIZE)) { > u8 hdr_fsid[BTRFS_FSID_SIZE] = {0}; > > read_extent_buffer(eb, hdr_fsid, (unsigned > long)btrfs_header_fsid(), BTRFS_FSID_SIZE); > ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu header->fsid["PRIX128"] > != fs_info->fsid["PRIX128"]", root->fs_info->sb->s_id, start, > PRI_UUID(hdr_fsid), PRI_UUID(root->fs_info->fsid)); > return -EUCLEAN; > } > if (memcmp_extent_buffer(eb, root->fs_info->chunk_tree_uuid, > (unsigned long)btrfs_header_chunk_tree_uuid(eb), BTRFS_UUID_SIZE)) { > u8 hdr_ch_uuid[BTRFS_UUID_SIZE] = {0}; > > read_extent_buffer(eb, hdr_ch_uuid, (unsigned > long)btrfs_header_chunk_tree_uuid(eb), BTRFS_UUID_SIZE); > ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu > header->chunk_tree_uuid["PRIX128"] != > fs_info->chunk_tree_uuid["PRIX128"]", root->fs_info->sb->s_id, start, > PRI_UUID(hdr_ch_uuid), > PRI_UUID(root->fs_info->chunk_tree_uuid)); > return -EUCLEAN; > } > if (csum_tree_block(root, eb, 0) != 0) { > ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu csum_tree_block() > failure", root->fs_info->sb->s_id, start); > return -EUCLEAN; > } > > ZBTRFS_WARN, PRIX128/PRI_UUID are some custom macros that I added to > properly test the fix: > #define ZBTRFS_WARN(condition, format, ...) WARN(condition, format, > ##__VA_ARGS__) > #define PRIX128 > "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X" > #define PRI_UUID(uuid) ((u8*)(uuid))[0], ((u8*)(uuid))[1], > ((u8*)(uuid))[2], ((u8*)(uuid))[3], \ > ((u8*)(uuid))[4], ((u8*)(uuid))[5], > ((u8*)(uuid))[6], ((u8*)(uuid))[7], \ > ((u8*)(uuid))[8], ((u8*)(uuid))[9], > ((u8*)(uuid))[10], ((u8*)(uuid))[11], \ > ((u8*)(uuid))[12], ((u8*)(uuid))[13], > ((u8*)(uuid))[14], ((u8*)(uuid))[15] > > > Then I just pulled the latest Linux tree and fixed it there, and > obviously did not check it properly. > >> >>> + return -EUCLEAN; >>> +#endif >>> + if (csum_tree_block(fs_info, eb, 0)) >>> + return -EUCLEAN; >> >> I would just return the real error from csum_tree_block() - currently >> it returns 1 for all possible failures instead of its real possible >> failures: -ENOMEM or -EINVAL. > Returning a positive errno is a bit risky, no? Somebody might only > check ret<0, for example, and will miss the error (like > flush_epd_write_bio). > That's why I preferred to return a negative error, because I saw that > csum_tree_block returns 1. Right, what I meant before but didn't phrase it properly, was to make csum_tree_block() returns a negative errno instead of 1 on failure. And then just make csum_dirty_buffer() return whatever csum_tree_block() returns as an error. thanks > > Thanks, > Alex. > > >> >> Thanks. >> >>> return 0; >>> } >>> >>> static int check_tree_block_fsid(struct btrfs_fs_info *fs_info, >>> struct extent_buffer *eb) >>> { >>> struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; >>> u8 fsid[BTRFS_UUID_SIZE]; >>> int ret = 1; >>> >>> -- >>> 1.9.1 >>> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
