Thank you, Filipe, for your review. On Mon, Feb 22, 2016 at 3:05 AM, Filipe Manana <fdman...@kernel.org> wrote: > On Sun, Feb 21, 2016 at 3:36 PM, Alex Lyakas <a...@zadarastorage.com> wrote: >> csum_dirty_buffer was issuing a warning in case the extent buffer >> did not look alright, but was still returning success. >> Let's return error in this case, and also add two additional sanity >> checks on the extent buffer header. >> >> We had btrfs metadata corruption, and after looking at the logs we saw >> that WARN_ON(found_start != start) has been triggered. We are still >> investigating > > There's a warning for WARN_ON(found_start != start || !PageUptodate(page)) > > Are you sure it triggered only because of found_start != start and not > because of !PageUptodate(page) (or both)? The problem initially happened on kernel 3.8.13. In this kernel, the code looks like this: found_start = btrfs_header_bytenr(eb); if (found_start != start) { WARN_ON(1); return 0; } if (!PageUptodate(page)) { WARN_ON(1); return 0; } (You can see it on http://lxr.free-electrons.com/source/fs/btrfs/disk-io.c?v=3.8#L420) The WARN_ON that we hit was on the found_start comparison.
> >> which component trashed the cache page which belonged to btrfs. But btrfs >> only issued a warning, and as a result, the corrupted metadata block went to >> disk. >> >> I think we should return an error in such case that the extent buffer >> doesn't look alright. > > I think so too. > >> The caller up the chain may BUG_ON on this, for example flush_epd_write_bio >> will, >> but it is better than to have a silent metadata corruption on disk. >> >> Note: this patch has been properly tested on 3.18 kernel only. >> >> Signed-off-by: Alex Lyakas <a...@zadarastorage.com> >> --- >> fs/btrfs/disk-io.c | 14 ++++++++++++-- >> 1 file changed, 12 insertions(+), 2 deletions(-) >> >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >> index 4545e2e..701e706 100644 >> --- a/fs/btrfs/disk-io.c >> +++ b/fs/btrfs/disk-io.c >> @@ -508,22 +508,32 @@ static int csum_dirty_buffer(struct btrfs_fs_info >> *fs_info, struct page *page) >> { >> u64 start = page_offset(page); >> u64 found_start; >> struct extent_buffer *eb; >> >> eb = (struct extent_buffer *)page->private; >> if (page != eb->pages[0]) >> return 0; >> found_start = btrfs_header_bytenr(eb); >> if (WARN_ON(found_start != start || !PageUptodate(page))) >> - return 0; >> - csum_tree_block(fs_info, eb, 0); >> + return -EUCLEAN; >> +#ifdef CONFIG_BTRFS_ASSERT > > A bit odd to surround these with CONFIG_BTRFS_ASSERT if we don't do > assertions. > I would remove this #ifdef ... #endif or do the memcmp calls inside ASSERT(). Agreed. > >> + if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid, >> + (unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE))) >> + return -EUCLEAN; >> + if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid, >> + (unsigned long)btrfs_header_chunk_tree_uuid(eb), >> + BTRFS_FSID_SIZE))) > > This second comparison doesn't seem correct. Second argument to > memcmp_extent_buffer should be fs_info->chunk_tree_uuid, which > shouldn't be the same as the fsid (take a look at utils.c:make_btrfs() > in the tools, both uuids are generated by different calls to > uuid_generate()) - did you make your tests only before adding this > comparison?. Also you should use BTRFS_UUID_SIZE instead of > BTRFS_FSID_SIZE (even if both have the same value). Obviously, you are right. In the 3.18-based code that I fixed locally here, the fix looks like this: if (found_start != start) { ZBTRFS_WARN(1, "FS[%s]: header_bytenr(eb)(%llu) != page->index<<PAGE_CACHE_SHIFT(%llu)", root->fs_info->sb->s_id, found_start, start); return -EUCLEAN; } if (!PageUptodate(page)) { ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu page->index(%llu) !PageUptodate", root->fs_info->sb->s_id, start, (u64)page->index); return -EUCLEAN; } if (memcmp_extent_buffer(eb, root->fs_info->fsid, (unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE)) { u8 hdr_fsid[BTRFS_FSID_SIZE] = {0}; read_extent_buffer(eb, hdr_fsid, (unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE); ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu header->fsid["PRIX128"] != fs_info->fsid["PRIX128"]", root->fs_info->sb->s_id, start, PRI_UUID(hdr_fsid), PRI_UUID(root->fs_info->fsid)); return -EUCLEAN; } if (memcmp_extent_buffer(eb, root->fs_info->chunk_tree_uuid, (unsigned long)btrfs_header_chunk_tree_uuid(eb), BTRFS_UUID_SIZE)) { u8 hdr_ch_uuid[BTRFS_UUID_SIZE] = {0}; read_extent_buffer(eb, hdr_ch_uuid, (unsigned long)btrfs_header_chunk_tree_uuid(eb), BTRFS_UUID_SIZE); ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu header->chunk_tree_uuid["PRIX128"] != fs_info->chunk_tree_uuid["PRIX128"]", root->fs_info->sb->s_id, start, PRI_UUID(hdr_ch_uuid), PRI_UUID(root->fs_info->chunk_tree_uuid)); return -EUCLEAN; } if (csum_tree_block(root, eb, 0) != 0) { ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu csum_tree_block() failure", root->fs_info->sb->s_id, start); return -EUCLEAN; } ZBTRFS_WARN, PRIX128/PRI_UUID are some custom macros that I added to properly test the fix: #define ZBTRFS_WARN(condition, format, ...) WARN(condition, format, ##__VA_ARGS__) #define PRIX128 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X" #define PRI_UUID(uuid) ((u8*)(uuid))[0], ((u8*)(uuid))[1], ((u8*)(uuid))[2], ((u8*)(uuid))[3], \ ((u8*)(uuid))[4], ((u8*)(uuid))[5], ((u8*)(uuid))[6], ((u8*)(uuid))[7], \ ((u8*)(uuid))[8], ((u8*)(uuid))[9], ((u8*)(uuid))[10], ((u8*)(uuid))[11], \ ((u8*)(uuid))[12], ((u8*)(uuid))[13], ((u8*)(uuid))[14], ((u8*)(uuid))[15] Then I just pulled the latest Linux tree and fixed it there, and obviously did not check it properly. > >> + return -EUCLEAN; >> +#endif >> + if (csum_tree_block(fs_info, eb, 0)) >> + return -EUCLEAN; > > I would just return the real error from csum_tree_block() - currently > it returns 1 for all possible failures instead of its real possible > failures: -ENOMEM or -EINVAL. Returning a positive errno is a bit risky, no? Somebody might only check ret<0, for example, and will miss the error (like flush_epd_write_bio). That's why I preferred to return a negative error, because I saw that csum_tree_block returns 1. Thanks, Alex. > > Thanks. > >> return 0; >> } >> >> static int check_tree_block_fsid(struct btrfs_fs_info *fs_info, >> struct extent_buffer *eb) >> { >> struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; >> u8 fsid[BTRFS_UUID_SIZE]; >> int ret = 1; >> >> -- >> 1.9.1 >> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html