Re: [RFC][PATCH] btrfs-progs: inspect: new subcommand to dump chunks
On Wed, Jun 22, 2016 at 07:26:46PM +0200, David Sterba wrote: > From: David Sterba> > Hi, > > the chunk dump is a useful thing, for debugging or balance filters. > > Example output: > > Chunks on device id: 1 > PNumberTypePStartLength PEnd Age > LStart Usage > --- > 0 System/RAID11.00MiB 32.00MiB 33.00MiB 47 >1.40TiB 0.06 > 1 Metadata/RAID1 33.00MiB 1.00GiB 1.03GiB 31 >1.36TiB 64.03 > 2 Metadata/RAID11.03GiB 32.00MiB 1.06GiB 36 >1.36TiB 77.28 > 3 Data/single 1.06GiB 1.00GiB 2.06GiB 12 > 422.30GiB 78.90 > 4 Data/single 2.06GiB 1.00GiB 3.06GiB 11 > 420.30GiB 78.47 > ... > > (full output is at http://susepaste.org/view/raw/089a9877) > > This patch does the basic output, not filtering or sorting besides physical > and > logical offset. There are possiblities to do more or enhance the output (eg. > starting with logical chunk and list the related physcial chunks together). > Or filter by type/profile, or understand the balance filters format. > > As it is now, it's per-device dump of physical layout, which was the original > idea. > > Printing 'usage' is not default as it's quite slow, it uses the search ioctl > and probably not in the best way, or there's some other issue in the > implementation. > > I'll add the patch to devel branch but will not add it for any particular > release yet, I'd like some feedback first. Thanks. > > - > New command 'btrfs inspect-internal dump-chunks' will dump layout of > chunks as stored on the devices. This corresponds to the physical > layout, sorted by the physical offset. The block group usage can be > shown as well, but the search is too slow so it's off by default. > > If the physical offset sorting is selected, the empty space between > chunks is also shown. > > Signed-off-by: David Sterba > --- > cmds-inspect.c | 364 > + > 1 file changed, 364 insertions(+) > > diff --git a/cmds-inspect.c b/cmds-inspect.c > index dd7b9dd278f2..4eace63d8517 100644 > --- a/cmds-inspect.c > +++ b/cmds-inspect.c > @@ -623,6 +623,368 @@ static int cmd_inspect_min_dev_size(int argc, char > **argv) > return !!ret; > } > > +static const char * const cmd_dump_chunks_usage[] = { > + "btrfs inspect-internal chunk-stats [options] ", > + "Show chunks (block groups) layout", > + "Show chunks (block groups) layout for all devices", > + "", > + HELPINFO_UNITS_LONG, > + "--sort=MODEsort by the physical or logical chunk start", > + " MODE is one of pstart or lstart (default: pstart)", > + "--usageshow usage per block group, note this can be slow", > + NULL > +}; > + > +enum { > + CHUNK_SORT_PSTART, > + CHUNK_SORT_LSTART, > + CHUNK_SORT_DEFAULT = CHUNK_SORT_PSTART > +}; > + > +struct dump_chunks_entry { > + u64 devid; > + u64 start; > + u64 lstart; > + u64 length; > + u64 flags; > + u64 age; > + u64 used; > + u32 pnumber; > +}; > + > +struct dump_chunks_ctx { > + unsigned length; > + unsigned size; > + struct dump_chunks_entry *stats; > +}; > + > +int cmp_cse_devid_start(const void *va, const void *vb) > +{ > + const struct dump_chunks_entry *a = va; > + const struct dump_chunks_entry *b = vb; > + > + if (a->devid < b->devid) > + return -1; > + if (a->devid > b->devid) > + return 1; > + > + if (a->start < b->start) > + return -1; > + if (a->start == b->start) { > + error( > + "chunks start on same offset in the same device: devid %llu start %llu", > + (unsigned long long)a->devid, (unsigned long long)a->start); > + return 0; > + } > + return 1; > +} > + > +int cmp_cse_devid_lstart(const void *va, const void *vb) > +{ > + const struct dump_chunks_entry *a = va; > + const struct dump_chunks_entry *b = vb; > + > + if (a->devid < b->devid) > + return -1; > + if (a->devid > b->devid) > + return 1; > + > + if (a->lstart < b->lstart) > + return -1; > + if (a->lstart == b->lstart) { > + error( > +"chunks logically start on same offset in the same device: devid %llu start > %llu", > + (unsigned long long)a->devid, (unsigned long > long)a->lstart); > + return 0; > + } > + return 1; > +} > + > +void print_dump_chunks(struct dump_chunks_ctx *ctx, unsigned sort_mode, > + unsigned unit_mode, int with_usage) > +{ > + u64 devid; > + struct dump_chunks_entry e; > + int i; > + int
[PATCH] Btrfs: cleanup BUG_ON in merge_bio
One can use btrfs-corrupt-block to hit BUG_ON() in merge_bio(), thus this aims to stop anyone to panic the whole system by using their btrfs. Since the error in merge_bio can only come from __btrfs_map_block() when chunk tree mapping has something insane and __btrfs_map_block() has already had printed the reason, we can just return errors in merge_bio. Signed-off-by: Liu Bo--- fs/btrfs/extent_io.c | 1 - fs/btrfs/inode.c | 8 ++-- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index aa44d3e..a8661fb 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2756,7 +2756,6 @@ static int merge_bio(int rw, struct extent_io_tree *tree, struct page *page, if (tree->ops && tree->ops->merge_bio_hook) ret = tree->ops->merge_bio_hook(rw, page, offset, size, bio, bio_flags); - BUG_ON(ret < 0); return ret; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 8b1212e..d203c06 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1822,6 +1822,10 @@ static void btrfs_clear_bit_hook(struct inode *inode, /* * extent_io.c merge_bio_hook, this must check the chunk tree to make sure * we don't create bios that span stripes or chunks + * + * return 1 if page cannot be merged to bio + * return 0 if page can be merged to bio + * return error otherwise */ int btrfs_merge_bio_hook(int rw, struct page *page, unsigned long offset, size_t size, struct bio *bio, @@ -1840,8 +1844,8 @@ int btrfs_merge_bio_hook(int rw, struct page *page, unsigned long offset, map_length = length; ret = btrfs_map_block(root->fs_info, rw, logical, _length, NULL, 0); - /* Will always return 0 with map_multi == NULL */ - BUG_ON(ret < 0); + if (ret < 0) + return ret; if (map_length < length + size) return 1; return 0; -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: error out if generic_bin_search get invalid arguments
With btrfs-corrupt-block, one can set btree node/leaf's field, if we assign a negative value to node/leaf, we can get various hangs, eg. if extent_root's nritems is -2ULL, then we get stuck in btrfs_read_block_groups() because it has a while loop and btrfs_search_slot() on extent_root will always return the first child. This lets us know what's happening and returns a EINVAL to callers instead of returning the first item. Signed-off-by: Liu Bo--- fs/btrfs/ctree.c | 8 1 file changed, 8 insertions(+) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index c49a500..915d224 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -1770,6 +1770,14 @@ static noinline int generic_bin_search(struct extent_buffer *eb, unsigned long map_len = 0; int err; + if (low > high) { + btrfs_err(eb->fs_info, +"%s: low (%d) < high (%d) eb %llu owner %llu level %d", + __func__, low, high, eb->start, + btrfs_header_owner(eb), btrfs_header_level(eb)); + return -EINVAL; + } + while (low < high) { mid = (low + high) / 2; offset = p + mid * item_size; -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: check inconsistence between chunk and block group
With btrfs-corrupt-block, one can drop one chunk item and mounting will end up with a panic in btrfs_full_stripe_len(). This doesn't not remove the BUG_ON, but instead checks it a bit earlier when we find the block group item. Signed-off-by: Liu Bo--- fs/btrfs/extent-tree.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 689d25a..d860291 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9662,7 +9662,22 @@ static int find_first_block_group(struct btrfs_root *root, if (found_key.objectid >= key->objectid && found_key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) { - ret = 0; + struct extent_map_tree *em_tree; + struct extent_map *em; + + em_tree = >fs_info->mapping_tree.map_tree; + read_lock(_tree->lock); + em = lookup_extent_mapping(em_tree, found_key.objectid, + found_key.offset); + read_unlock(_tree->lock); + if (!em) { + btrfs_err(root->fs_info, + "logical %llu len %llu found bg but no related chunk", + found_key.objectid, found_key.offset); + ret = -ENOENT; + } else { + ret = 0; + } goto out; } path->slots[0]++; -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix BUG_ON in btrfs_submit_compressed_write
This is similar to btrfs_submit_compressed_read(), if we fail after bio is allocated, then we can use bio_endio() and errors are saved in bio->bi_error. But please note that we don't return errors to its caller because the caller assumes it won't call endio to cleanup on error. Signed-off-by: Liu Bo--- fs/btrfs/compression.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 658c39b..7a4d9c8 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -402,7 +402,10 @@ int btrfs_submit_compressed_write(struct inode *inode, u64 start, } ret = btrfs_map_bio(root, WRITE, bio, 0, 1); - BUG_ON(ret); /* -ENOMEM */ + if (ret) { + bio->bi_error = ret; + bio_endio(bio); + } bio_put(bio); @@ -432,7 +435,10 @@ int btrfs_submit_compressed_write(struct inode *inode, u64 start, } ret = btrfs_map_bio(root, WRITE, bio, 0, 1); - BUG_ON(ret); /* -ENOMEM */ + if (ret) { + bio->bi_error = ret; + bio_endio(bio); + } bio_put(bio); return 0; -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] btrfs-progs: inspect: new subcommand to dump chunks
At 06/23/2016 01:26 AM, David Sterba wrote: From: David SterbaHi, the chunk dump is a useful thing, for debugging or balance filters. Yes, quite useful. I don't ever need to manually check btrfs-debug-tree output to figure out the dev-extent layout. But according to the output, it's more like dev-extent dump, not chunk dump. It's better to make things more clear for objects at different logical level. Example output: Chunks on device id: 1 PNumberTypePStartLength PEnd Age LStart Usage --- 0 System/RAID11.00MiB 32.00MiB 33.00MiB 47 1.40TiB 0.06 [Mixed chunk/dev-extent output] For "PStart/Length/Pend/Lstart" that's OK. (Although I prefer "Plength" to "Length", as it's a little confusing for chunk logical length and dev_extent physical length) Since "pstart" = key.offset, "length" = dev_extent_length, "Pend" = pstart + length and "Lstart" = dev_extent_chunk_offset, these values are all straightforward. But for Usage, it's completely a chunk/block group level thing, so personally speaking it's better not to output per chunk usage in dev-extent dump If there is real "chunk dump", then that's the best place to output per-chunk "usage". For dev-extent level usage, it should be the usage of a device, not per chunk. ["Age" calculation] Another concern is about "age", since there is no generation in chunk_item nor dev_extent, so you're using logical bytenr as a hint. This method implies the fact that btrfs will always allocates new chunk with increasing logical bytenr, never reuse lower logical bytenr. Such behavior provides the basis for a lot of other btrfs functions, like balance/qgroup rescan. IMHO it won't be changed in any time soon. But I'm still a little concerned about using such assumption. Since "age" here is more like a sorting index, not the generation when the chunk is created. It may not really reflect how long one chunk really lived. It maybe completely OK for most people, but at least it's not as straightforward as generation. [Support for offline btrfs] BTW, it's also a good idea to support not only mounted fs, but also offlined one. As for offlined one, no need to bother about the tree search ioctl, just straightforward search_slots(). (That's why I like offline btrfs tools) Thanks, Qu 1 Metadata/RAID1 33.00MiB 1.00GiB 1.03GiB 31 1.36TiB 64.03 2 Metadata/RAID11.03GiB 32.00MiB 1.06GiB 36 1.36TiB 77.28 3 Data/single 1.06GiB 1.00GiB 2.06GiB 12 422.30GiB 78.90 4 Data/single 2.06GiB 1.00GiB 3.06GiB 11 420.30GiB 78.47 ... (full output is at http://susepaste.org/view/raw/089a9877) This patch does the basic output, not filtering or sorting besides physical and logical offset. There are possiblities to do more or enhance the output (eg. starting with logical chunk and list the related physcial chunks together). Or filter by type/profile, or understand the balance filters format. As it is now, it's per-device dump of physical layout, which was the original idea. Printing 'usage' is not default as it's quite slow, it uses the search ioctl and probably not in the best way, or there's some other issue in the implementation. I'll add the patch to devel branch but will not add it for any particular release yet, I'd like some feedback first. Thanks. - New command 'btrfs inspect-internal dump-chunks' will dump layout of chunks as stored on the devices. This corresponds to the physical layout, sorted by the physical offset. The block group usage can be shown as well, but the search is too slow so it's off by default. If the physical offset sorting is selected, the empty space between chunks is also shown. Signed-off-by: David Sterba --- cmds-inspect.c | 364 + 1 file changed, 364 insertions(+) diff --git a/cmds-inspect.c b/cmds-inspect.c index dd7b9dd278f2..4eace63d8517 100644 --- a/cmds-inspect.c +++ b/cmds-inspect.c @@ -623,6 +623,368 @@ static int cmd_inspect_min_dev_size(int argc, char **argv) return !!ret; } +static const char * const cmd_dump_chunks_usage[] = { + "btrfs inspect-internal chunk-stats [options] ", + "Show chunks (block groups) layout", + "Show chunks (block groups) layout for all devices", + "", + HELPINFO_UNITS_LONG, + "--sort=MODEsort by the physical or logical chunk start", + " MODE is one of pstart or lstart (default: pstart)", + "--usageshow usage per block group, note this can be slow", + NULL +}; + +enum { + CHUNK_SORT_PSTART, + CHUNK_SORT_LSTART, +
Re: [RFC][PATCH] btrfs-progs: inspect: new subcommand to dump chunks
On 06/22/2016 07:26 PM, David Sterba wrote: From: David SterbaHi, the chunk dump is a useful thing, for debugging or balance filters. Example output: Chunks on device id: 1 PNumberTypePStartLength PEnd Age LStart Usage --- 0 System/RAID11.00MiB 32.00MiB 33.00MiB 47 1.40TiB 0.06 1 Metadata/RAID1 33.00MiB 1.00GiB 1.03GiB 31 1.36TiB 64.03 2 Metadata/RAID11.03GiB 32.00MiB 1.06GiB 36 1.36TiB 77.28 3 Data/single 1.06GiB 1.00GiB 2.06GiB 12 422.30GiB 78.90 4 Data/single 2.06GiB 1.00GiB 3.06GiB 11 420.30GiB 78.47 ... On RAID0, it looks funny :) # ./btrfs inspect-internal dump-chunks /mnt/extents/ Chunks on device id: 1 PNumberTypePStartLength PEnd Age LStart 0 Metadata/RAID01.00MiB 256.00MiB 257.00MiB 1 13.36GiB . empty . 16.00EiB . . . 1 System/RAID0 129.00MiB 64.00MiB 193.00MiB 2 13.61GiB . empty . 16.00EiB . . . 2 Data/RAID0 161.00MiB 2.00GiB 2.16GiB 3 15.68GiB . empty . 115.00MiB . . . 3 Data/RAID02.27GiB 2.00GiB 4.27GiB 0 11.36GiB Chunks on device id: 2 PNumberTypePStartLength PEnd Age LStart 0 Data/RAID01.00MiB 2.00GiB 2.00GiB 0 11.36GiB . empty . 16.00EiB . . . 1 Metadata/RAID01.00GiB 256.00MiB 1.25GiB 1 13.36GiB . empty . 16.00EiB . . . 2 System/RAID01.13GiB 64.00MiB 1.19GiB 2 13.61GiB . empty . 16.00EiB . . . 3 Data/RAID01.16GiB 2.00GiB 3.16GiB 3 15.68GiB For the correct sizes, it would be needed to either do some math, like if RAID0 then divide by 2, but I guess this gets horrible really soon with RAID5 etc... Or, if you want to print everything in the physical order, why not just dump the device tree with DEV_EXTENT which has the exact length on disk, and fill in the missing information using the reference to the chunk they belong to? https://github.com/knorrie/python-btrfs/commit/be85c2a0e2774909b532326e99e0c23ab8467263 Sorry, couldn't resist /:D\ -- Hans van Kranenburg - System / Network Engineer T +31 (0)10 2760434 | hans.van.kranenb...@mendix.com | www.mendix.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: page allocation failure: order:1, mode:0x2204020
On 06/20/2016 02:49 PM, David Sterba wrote: On Sat, Jun 18, 2016 at 08:47:55PM +0200, Hans van Kranenburg wrote: Last night, one of my btrfs filesystems went read-only after a memory allocation failure (logging attached). According to the logs, the allocation itself happens out of btrfs so we can't do much here. More specifically, when creating a new subvolume and requesting an anonymous block device (via get_anon_bdev), there's a call to request a free id for it. This could ask for "order-1" ie 8kb of contiguous memory. And it failed. This depends on memory fragmentation, ie. how the pages are allocated and freed over time. Tweaking vm.min_free_kbytes could help but it's not designed to prevent memory allocations in such scenario. The id range sturctures themselves do not need more than a 4k page, but the slab cache tries to provide more objects per slab, I see this on my box right now: Excerpt from /proc/slabinfo: idr_layer_cache 474486 209632 where 3 == objperslab and 2 == pages per slab, which corresponds to the 8kb. This seems to depend on internal slab cache logic, and nothing I'd like to go chaning right now. Aha, interesting. The output of "sort -nk 6 < /proc/slabinfo" shows me that most of them use the 4k. [...] I've seen this happen once before somewhere else, also during snapshot creation, also with a 4.5.x kernel. Speaking of the devil... Yesterday another one crashed, and this one is using Debian 3.16.7-ckt25-2. Errors attached. There's a bug report at Debian, in which is suggested to increase the value of vm.min_free_kbytes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=666021 [...] [2363000.815554] Node 0 Normal: 2424*4kB (U) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 9696kB Just to confirm the fragmentation, there are no page orders higher than 0 (ie. 8k and up). So, technically not a btrfs bug but we could get affected by it badly. Ah, so the list is what's still free, and there's no 8kB block available. Now this also makes perfect sense to me: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=666021#90 So while increasing the min_free_kbytes will not fix the fragmentation issue, it is at least a workaround by just throwing some more dedicated space at it and decreasing the chance of this happening. This one was even more spectacular, since it ends in oopsing and the whole box going down a bit later. I see no line with "Normal" in here by the way... btrfs: page allocation failure: order:1, mode:0x204020 CPU: 0 PID: 9475 Comm: btrfs Not tainted 3.16.0-4-amd64 #1 Debian 3.16.7-ckt25-2 8150e835 00204020 8800052f7838 81142d3f 8802 0041 8800ef8a5c00 Call Trace: [] ? dump_stack+0x5d/0x78 [] ? warn_alloc_failed+0xdf/0x130 [] ? __alloc_pages_nodemask+0x8ca/0xb30 [] ? kmem_getpages+0x5b/0x110 [] ? fallback_alloc+0x1cf/0x210 [] ? kmem_cache_alloc+0x1f0/0x450 [] ? ida_pre_get+0x60/0xe0 [] ? get_anon_bdev+0x62/0xe0 [] ? btrfs_init_fs_root+0x101/0x190 [btrfs] [] ? btrfs_get_fs_root+0xb6/0x270 [btrfs] [] ? create_pending_snapshot+0x6f6/0x960 [btrfs] [] ? create_pending_snapshots+0x89/0xb0 [btrfs] [] ? btrfs_commit_transaction+0x40c/0xa10 [btrfs] [] ? start_transaction+0x90/0x570 [btrfs] [] ? btrfs_mksubvol.isra.29+0x54a/0x5a0 [btrfs] [] ? prepare_to_wait_event+0xf0/0xf0 [] ? btrfs_ioctl_snap_create_transid+0x191/0x1a0 [btrfs] [] ? btrfs_ioctl_snap_create_v2+0xee/0x140 [btrfs] [] ? btrfs_ioctl+0xcc4/0x2b50 [btrfs] [] ? handle_mm_fault+0x63c/0x11c0 [] ? mmap_region+0x15f/0x650 [] ? __do_page_fault+0x1d1/0x4f0 [] ? do_mmap_pgoff+0x2e9/0x3b0 [] ? do_vfs_ioctl+0x2cf/0x4b0 [] ? SyS_ioctl+0x81/0xa0 [] ? page_fault+0x28/0x30 [] ? system_call_fast_compare_end+0x10/0x15 Mem-Info: Node 0 DMA per-cpu: CPU0: hi:0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU0: hi: 186, btch: 31 usd: 135 active_anon:58342 inactive_anon:20518 isolated_anon:0 active_file:371028 inactive_file:364279 isolated_file:0 unevictable:0 dirty:25166 writeback:0 unstable:0 free:85089 slab_reclaimable:46135 slab_unreclaimable:9219 mapped:5560 shmem:20579 pagetables:1492 bounce:0 free_cma:0 Node 0 DMA free:14984kB min:32kB low:40kB high:48kB active_anon:344kB inactive_anon:160kB active_file:40kB inactive_file:36kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB managed:15912kB mlocked:0kB dirty:16kB writeback:0kB mapped:32kB shmem:160kB slab_reclaimable:84kB slab_unreclaimable:60kB kernel_stack:64kB pagetables:16kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11 all_unreclaimable? no lowmem_reserve[]: 0 3743 3743 3743 Node 0 DMA32 free:325372kB min:7808kB low:9760kB high:11712kB active_anon:233024kB inactive_anon:81912kB active_file:1484072kB inactive_file:1457080kB unevictable:0kB isolated(anon):0kB
Re: [RFC][PATCH] btrfs-progs: inspect: new subcommand to dump chunks
On 06/22/2016 07:26 PM, David Sterba wrote: From: David SterbaHi, the chunk dump is a useful thing, for debugging or balance filters. Nice! Example output: Chunks on device id: 1 PNumberTypePStartLength PEnd Age LStart Usage --- 0 System/RAID11.00MiB 32.00MiB 33.00MiB 47 1.40TiB 0.06 1 Metadata/RAID1 33.00MiB 1.00GiB 1.03GiB 31 1.36TiB 64.03 2 Metadata/RAID11.03GiB 32.00MiB 1.06GiB 36 1.36TiB 77.28 3 Data/single 1.06GiB 1.00GiB 2.06GiB 12 422.30GiB 78.90 4 Data/single 2.06GiB 1.00GiB 3.06GiB 11 420.30GiB 78.47 ... (full output is at http://susepaste.org/view/raw/089a9877) This patch does the basic output, not filtering or sorting besides physical and logical offset. There are possiblities to do more or enhance the output (eg. starting with logical chunk and list the related physcial chunks together). Or filter by type/profile, or understand the balance filters format. As it is now, it's per-device dump of physical layout, which was the original idea. Printing 'usage' is not default as it's quite slow, it uses the search ioctl and probably not in the best way, or there's some other issue in the implementation. Interesting. So after reading this, I wrote a little test to test some scenarios: https://github.com/knorrie/python-btrfs/commit/1ca99880dfa0e14b148f3d9e2b6b381b781eb52d It's very clear that the most optimal way of doing this search is to have nr_items=1 and if possible, specify the length in offset. I have no idea yet what happens inside the kernel when nr_items is not 1, in combination with a large extent tree. I'll try to follow the code and see if I can find what's making this big difference. Also, the filesystem with the 1200 second search result is Debian 3.16.7-ckt25-2, I haven't looked yet if there's different behaviour inside handling searches between that kernel and e.g. 4.5. After loading a huge amount of data in memory, the search with nr_items > 1 only takes 4.6 seconds to come up with the block group item itself, and another 4.6 seconds to come up with 0 additional results, only using cpu time. It seems that searching in the empty space between (vaddr BLOCK_GROUP_ITEM length+1) and (vaddr BLOCK_GROUP_ITEM ULLONG_MAX) is really expensive, while there's absolutely nothing to find. I ran into this two days ago, caused by doing an unnecessary extra search after finding the block item already: https://github.com/knorrie/python-btrfs/commit/5a69d9cf0477515ce2d6e50f1741276a49b33af8 I'll add the patch to devel branch but will not add it for any particular release yet, I'd like some feedback first. Thanks. - New command 'btrfs inspect-internal dump-chunks' will dump layout of chunks as stored on the devices. This corresponds to the physical layout, sorted by the physical offset. The block group usage can be shown as well, but the search is too slow so it's off by default. If the physical offset sorting is selected, the empty space between chunks is also shown. Signed-off-by: David Sterba --- cmds-inspect.c | 364 + 1 file changed, 364 insertions(+) [...] +} + +static u64 fill_usage(int fd, u64 lstart) +{ + struct btrfs_ioctl_search_args args; + struct btrfs_ioctl_search_key *sk = + struct btrfs_ioctl_search_header sh; + struct btrfs_block_group_item *item; + int ret; + + memset(, 0, sizeof(args)); + sk->tree_id = BTRFS_EXTENT_TREE_OBJECTID; + sk->min_objectid = lstart; + sk->max_objectid = lstart; + sk->min_type = BTRFS_BLOCK_GROUP_ITEM_KEY; + sk->max_type = BTRFS_BLOCK_GROUP_ITEM_KEY; + sk->min_offset = 0; + sk->max_offset = (u64)-1; The chunk length is already known here, so it can go in offset. + sk->max_transid = (u64)-1; + + sk->nr_items = 4096; Or set to 1 because we already know there can only be one result. -- Hans van Kranenburg - System / Network Engineer T +31 (0)10 2760434 | hans.van.kranenb...@mendix.com | www.mendix.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] btrfs: wait for bdev put
On Wed, Jun 22, 2016 at 6:18 AM, Anand Jainwrote: Thanks for the review Chris. On 06/21/2016 09:00 PM, Chris Mason wrote: On 06/21/2016 06:24 AM, Anand Jain wrote: From: Anand Jain Further to the commit bc178622d40d87e75abc131007342429c9b03351 btrfs: use rcu_barrier() to wait for bdev puts at unmount This patch implements a method to time wait on the __free_device() which actually does the bdev put. This is needed as the user space running 'btrfs fi show -d' immediately after the replace and unmount, is still reading older information from the device. Thanks for working on this Anand. Since it looks like blkdev_put can deadlock against us, can we please switch to making sure we fully flush the outstanding IO? It's probably enough to do a sync_blockdev() call before we allow the unmount to finish, but we can toss in an invalidate_bdev for good measure. # git diff diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 604daf315669..e0ad29d6fe9a 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -870,6 +870,11 @@ static void btrfs_close_one_device(struct btrfs_device *device) if (device->missing) fs_devices->missing_devices--; + if (device->bdev && device->writeable) { + sync_blockdev(device->bdev); + invalidate_bdev(device->bdev); + } + new_device = btrfs_alloc_device(NULL, >devid, device->uuid); BUG_ON(IS_ERR(new_device)); /* -ENOMEM */ --- However, theoretically still there might be a problem - at the end of unmount, if the device exclusive open is not actually closed, then there might be a race with another program which is trying to open the device in exclusive mode. Like for eg: unmount /btrfs; fsck /dev/X and here fsck might fail to open the device if it wins the race. This true, but at least we know he'll have up to date buffers if he does manage to open the device. With the generic code, the blkdev_put happens after the super is gone, so I'm not sure we can completely fix this from inside our callback. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Wed, Jun 22, 2016 at 11:14:30AM -0600, Chris Murphy wrote: > > Before deploying raid5, I tested these by intentionally corrupting > > one disk in an otherwise healthy raid5 array and watching the result. > > It's difficult to reproduce if no one understands how you > intentionally corrupted that disk. Literal reading, you corrupted the > entire disk, but that's impractical. The fs is expected to behave > differently depending on what's been corrupted and how much. The first round of testing I did (a year ago, when deciding whether btrfs raid5 was mature enough to start using) was: Create a 5-disk RAID5 Put some known data on it until it's full (i.e. random test patterns). At the time I didn't do any tests involving compressible data, which I now realize was a serious gap in my test coverage. Pick 1000 random blocks (excluding superblocks) on one of the disks and write random data to them Read and verify the data through the filesystem, do scrub, etc. Exercise all the btrfs features related to error reporting and recovery. I expected scrub and dev stat to report accurate corruption counts (except for the 1 in 4 billion case where a bad csum matches by random chance), and I expect all the data to be reconstructed since only one drive was corrupted (assuming there are no unplanned disk failures during the test, obviously) and the corruption occurred while the filesystem was offline so there was no possibility of RAID write hole. My results from that testing were that everything worked except for the mostly-harmless quirk where scrub counts errors on random disks instead of the disk where the errors occur. > I don't often use the -Bd options, so I haven't tested it thoroughly, > but what you're describing sounds like a bug in user space tools. I've > found it reflects the same information as btrfs dev stats, and dev > stats have been reliable in my testing. Don't the user space tools just read what the kernel tells them? I don't know how *not* to produce this behavior on btrfs raid5 or raid6. It should show up on any btrfs raid56 system. > > A different thing happens if there is a crash. In that case, scrub cannot > > repair the errors. Every btrfs raid5 filesystem I've deployed so far > > behaves this way when disks turn bad. I had assumed it was a software bug > > in the comparatively new raid5 support that would get fixed eventually. > > This is really annoyingly vague. You don't give a complete recipe for > reproducing this sequence. Here's what I'm understanding and what I'm > missing: > > 1. The intentional corruption, extent of which is undefined, is still present. No intentional corruption here (quote: "A different thing happens if there is a crash..."). Now we are talking about the baseline behavior when there is a crash on a btrfs raid5 array, especially crashes triggered by a disk-level failure (e.g. watchdog timeout because a disk or controller has hung) but also ordinary power failures or other crashes triggered by external causes. > 2. A drive is bad, but that doesn't tell us if it's totally dead, or > only intermittently spitting out spurious information. The most common drive-initiated reboot case is that one drive temporarily locks up and triggers the host to perform a watchdog reset. The reset is successful and the filesystem can be mounted again with all drives present; however, a small amount of raid5 data appears to be corrupted each time. The raid1 metadata passes all the integrity checks I can throw at it: btrfs check, scrub, balance, walk the filesystem with find -type f -exec cat ..., compare with the last backup, etc. Usually when I detect this case, I delete any corrupted data, delete the disk that triggers the lockups and have no further problems with that array. > 3. Is the volume remounted degraded or is the bad drive still being > used by Btrfs? Because Btrfs has no concept (patches pending) of drive > faulty state like md, let alone an automatic change to that faulty > state. It just keeps on trying to read or write to bad drives, even if > they're physically removed. In the baseline case the filesystem has all drives present after remount. It could be as simple as power-cycling the host while writes are active. > 4. You've initiated a scrub, and the corruption in 1 is not fixed. In this pattern, btrfs may find both correctable and uncorrectable corrupted data, usually on one of the drives. scrub fixes the correctable corruption, but fails on the uncorrectable. > OK so what am I missing? Nothing yet. The above is the "normal" btrfs raid5 crash experience with a non-degraded raid5 array. A few megabytes of corrupt extents can easily be restored from backups or deleted and everything's fine after that. In my *current* failure case, I'm experiencing *additional* issues. In total: all of the above, plus: BUG_ON()s as described in the first mail on
Re: Confusing output from fi us/df
On Mon, Jun 20, 2016 at 5:30 PM, Marc Grondinwrote: > Metadata,single: Size:3.00GiB, Used:1.53GiB Read this as: 3GiB of space on the device is reserved for metadata block group, and 1.53GiB of that is being used. The reservation means that this space can't be used for other block group types, such as system or data. So in some sense it is "used" but strictly speaking it's just a reservation. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH] btrfs-progs: inspect: new subcommand to dump chunks
From: David SterbaHi, the chunk dump is a useful thing, for debugging or balance filters. Example output: Chunks on device id: 1 PNumberTypePStartLength PEnd Age LStart Usage --- 0 System/RAID11.00MiB 32.00MiB 33.00MiB 47 1.40TiB 0.06 1 Metadata/RAID1 33.00MiB 1.00GiB 1.03GiB 31 1.36TiB 64.03 2 Metadata/RAID11.03GiB 32.00MiB 1.06GiB 36 1.36TiB 77.28 3 Data/single 1.06GiB 1.00GiB 2.06GiB 12 422.30GiB 78.90 4 Data/single 2.06GiB 1.00GiB 3.06GiB 11 420.30GiB 78.47 ... (full output is at http://susepaste.org/view/raw/089a9877) This patch does the basic output, not filtering or sorting besides physical and logical offset. There are possiblities to do more or enhance the output (eg. starting with logical chunk and list the related physcial chunks together). Or filter by type/profile, or understand the balance filters format. As it is now, it's per-device dump of physical layout, which was the original idea. Printing 'usage' is not default as it's quite slow, it uses the search ioctl and probably not in the best way, or there's some other issue in the implementation. I'll add the patch to devel branch but will not add it for any particular release yet, I'd like some feedback first. Thanks. - New command 'btrfs inspect-internal dump-chunks' will dump layout of chunks as stored on the devices. This corresponds to the physical layout, sorted by the physical offset. The block group usage can be shown as well, but the search is too slow so it's off by default. If the physical offset sorting is selected, the empty space between chunks is also shown. Signed-off-by: David Sterba --- cmds-inspect.c | 364 + 1 file changed, 364 insertions(+) diff --git a/cmds-inspect.c b/cmds-inspect.c index dd7b9dd278f2..4eace63d8517 100644 --- a/cmds-inspect.c +++ b/cmds-inspect.c @@ -623,6 +623,368 @@ static int cmd_inspect_min_dev_size(int argc, char **argv) return !!ret; } +static const char * const cmd_dump_chunks_usage[] = { + "btrfs inspect-internal chunk-stats [options] ", + "Show chunks (block groups) layout", + "Show chunks (block groups) layout for all devices", + "", + HELPINFO_UNITS_LONG, + "--sort=MODEsort by the physical or logical chunk start", + " MODE is one of pstart or lstart (default: pstart)", + "--usageshow usage per block group, note this can be slow", + NULL +}; + +enum { + CHUNK_SORT_PSTART, + CHUNK_SORT_LSTART, + CHUNK_SORT_DEFAULT = CHUNK_SORT_PSTART +}; + +struct dump_chunks_entry { + u64 devid; + u64 start; + u64 lstart; + u64 length; + u64 flags; + u64 age; + u64 used; + u32 pnumber; +}; + +struct dump_chunks_ctx { + unsigned length; + unsigned size; + struct dump_chunks_entry *stats; +}; + +int cmp_cse_devid_start(const void *va, const void *vb) +{ + const struct dump_chunks_entry *a = va; + const struct dump_chunks_entry *b = vb; + + if (a->devid < b->devid) + return -1; + if (a->devid > b->devid) + return 1; + + if (a->start < b->start) + return -1; + if (a->start == b->start) { + error( + "chunks start on same offset in the same device: devid %llu start %llu", + (unsigned long long)a->devid, (unsigned long long)a->start); + return 0; + } + return 1; +} + +int cmp_cse_devid_lstart(const void *va, const void *vb) +{ + const struct dump_chunks_entry *a = va; + const struct dump_chunks_entry *b = vb; + + if (a->devid < b->devid) + return -1; + if (a->devid > b->devid) + return 1; + + if (a->lstart < b->lstart) + return -1; + if (a->lstart == b->lstart) { + error( +"chunks logically start on same offset in the same device: devid %llu start %llu", + (unsigned long long)a->devid, (unsigned long long)a->lstart); + return 0; + } + return 1; +} + +void print_dump_chunks(struct dump_chunks_ctx *ctx, unsigned sort_mode, + unsigned unit_mode, int with_usage) +{ + u64 devid; + struct dump_chunks_entry e; + int i; + int chidx; + u64 lastend = 0; + u64 age; + + /* +* Chunks are sorted logically as found by the ioctl, we need to sort +* them once to find the physical ordering. This is the default mode. +*/ + qsort(ctx->stats,
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 20, 2016 at 7:55 PM, Zygo Blaxellwrote: > On Mon, Jun 20, 2016 at 03:27:03PM -0600, Chris Murphy wrote: >> On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxell >> wrote: >> > On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote: >> >> >> For me the critical question is what does "some corrupted sectors" mean? >> > >> > On other raid5 arrays, I would observe a small amount of corruption every >> > time there was a system crash (some of which were triggered by disk >> > failures, some not). >> >> What test are you using to determine there is corruption, and how much >> data is corrupted? Is this on every disk? Non-deterministically fewer >> than all disks? Have you identified this as a torn write or >> misdirected write or is it just garbage at some sectors? And what's >> the size? Partial sector? Partial md chunk (or fs block?) > > In earlier cases, scrub, read(), and btrfs dev stat all reported the > incidents differently. Scrub would attribute errors randomly to disks > (error counts spread randomly across all the disks in the 'btrfs scrub > status -d' output). 'dev stat' would correctly increment counts on only > those disks which had individually had an event (e.g. media error or > SATA bus reset). > > Before deploying raid5, I tested these by intentionally corrupting > one disk in an otherwise healthy raid5 array and watching the result. It's difficult to reproduce if no one understands how you intentionally corrupted that disk. Literal reading, you corrupted the entire disk, but that's impractical. The fs is expected to behave differently depending on what's been corrupted and how much. > When scrub identified an inode and offset in the kernel log, the csum > failure log message matched the offsets producing EIO on read(), but > the statistics reported by scrub about which disk had been corrupted > were mostly wrong. In such cases a scrub could repair the data. I don't often use the -Bd options, so I haven't tested it thoroughly, but what you're describing sounds like a bug in user space tools. I've found it reflects the same information as btrfs dev stats, and dev stats have been reliable in my testing. > A different thing happens if there is a crash. In that case, scrub cannot > repair the errors. Every btrfs raid5 filesystem I've deployed so far > behaves this way when disks turn bad. I had assumed it was a software bug > in the comparatively new raid5 support that would get fixed eventually. This is really annoyingly vague. You don't give a complete recipe for reproducing this sequence. Here's what I'm understanding and what I'm missing: 1. The intentional corruption, extent of which is undefined, is still present. 2. A drive is bad, but that doesn't tell us if it's totally dead, or only intermittently spitting out spurious information. 3. Is the volume remounted degraded or is the bad drive still being used by Btrfs? Because Btrfs has no concept (patches pending) of drive faulty state like md, let alone an automatic change to that faulty state. It just keeps on trying to read or write to bad drives, even if they're physically removed. 4. You've initiated a scrub, and the corruption in 1 is not fixed. OK so what am I missing? Because it sounds to me like you have two copies of data that are gone. For raid 5 that's data loss, scrub can't fix things. Corruption is missing data. The bad drive is missing data. What values do you get for smartctl -l scterc /dev/sdX cat /sys/block/sdX/device/timeout >> This is on Btrfs? This isn't supposed to be possible. Even a literal >> overwrite of a file is not an overwrite on Btrfs unless the file is >> nodatacow. Data extents get written, then the metadata is updated to >> point to those new blocks. There should be flush or fua requests to >> make sure the order is such that the fs points to either the old or >> new file, in either case uncorrupted. That's why I'm curious about the >> nature of this corruption. It sounds like your hardware is not exactly >> honoring flush requests. > > That's true when all the writes are ordered within a single device, but > possibly not when writes must be synchronized across multiple devices. I think that's a big problem, the fs cannot be consistent if the super block points to any tree whose metadata or data isn't on stable media. But if you think it's happening you might benefit from integrity checking, maybe try just the metadata one for starters which is the check_int mount option (it must be compiled in first for that mount option to work). https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/check-integrity.c?id=refs/tags/v4.6.2 > It could be possible to corrupt existing data if: > > - btrfs puts multiple extents in the same RAID5 stripe > > - writes two of the extents in different transactions > > - the array is degraded so the RAID5 write hole applies > >
Re: btrfs send/receive error
Hallo Lubos! I have got the same problems with send/receive. Had those problems for several weeks or months now. Not sure when those started. These send/receive problems are known and there are several bug reports all over the internet and also on this list. Not sure if or when this gets fixed. It's broken atm. Michael On 2016-06-22 14:01, Lubos Kolouch wrote: > Hello, > > I am having strange issue with btrfs send/receive: > > #btrfs send -p /mnt/a/_btrbk_snap/debian.20160621_1 > /mnt/a/_btrbk_snap/debian.20160622 | btrfs receive /mnt/ext/backups/vaclfw/ > > At subvol /mnt/a/_btrbk_snap/debian.20160622 > At snapshot debian.20160622 > ERROR: unlink var/lib/samba/private/msg.sock/7992 failed. No such file or > directory > > > But : > > ls -la /mnt/a/_btrbk_snap/debian.20160621_1/var/lib/samba/private/msg.sock/ > 2330 2388 2394 29375 7992 > > ls -la > /mnt/ext/backups/vaclfw/debian.20160621_1/var/lib/samba/private/msg.sock/ > 2330 2388 2394 29375 7992 > > uname -a > Linux vaclfw 4.6.0-1-amd64 #1 SMP Debian 4.6.2-1 (2016-06-15) x86_64 GNU/Linux > > # btrfs version > # btrfs-progs v4.5.2 > > What could be wrong? > > Thanks > > Lubos > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: OpenPGP digital signature
Re: [Y2038] [PATCH v2 00/24] Delete CURRENT_TIME and CURRENT_TIME_SEC macros
On Sunday, June 19, 2016 5:26:59 PM CEST Deepa Dinamani wrote: > The series is aimed at getting rid of CURRENT_TIME and CURRENT_TIME_SEC > macros. > The macros are not y2038 safe. There is no plan to transition them into being > y2038 safe. > ktime_get_* api's can be used in their place. And, these are y2038 safe. > > Thanks to Arnd Bergmann for all the guidance and discussions. > > Patches 2-4 were mostly generated using coccinelle scripts. > > All filesystem timestamps use current_fs_time() for right granularity as > mentioned in the respective commit texts of patches. This has a changed > signature, renamed to current_time() and moved to the fs/inode.c. > > This series also serves as a preparatory series to transition vfs to 64 bit > timestamps as outlined here: https://lkml.org/lkml/2016/2/12/104 . > > As per Linus's suggestion in https://lkml.org/lkml/2016/5/24/663 , all the > inode timestamp changes have been squashed into a single patch. Also, > current_time() now is used as a single generic vfs filesystem timestamp api. > It also takes struct inode* as argument instead of struct super_block*. > Posting all patches together in a bigger series so that the big picture is > clear. > > As per the suggestion in https://lwn.net/Articles/672598/, CURRENT_TIME macro > bug fixes are being handled in a series separate from transitioning vfs to > use. I've looked in detail at all the patches in this version now, and while overall everything is fine, I found that two patches cannot be part of the series because of the dependency on the patch that John already took (adding time64_to_tm), but I think that's ok because we just need to change over all the users of CURRENT_TIME and CURRENT_TIME_SEC that assign to inode timestamps in order to prepare for the type change, the other ones can be changed later. I also found a few things that could be done differently to make the later conversion slightly easier, but it's also possible that I missed part of your bigger plan for those files, and none of them seem important. Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs send/receive error
Hello, I am having strange issue with btrfs send/receive: #btrfs send -p /mnt/a/_btrbk_snap/debian.20160621_1 /mnt/a/_btrbk_snap/debian.20160622 | btrfs receive /mnt/ext/backups/vaclfw/ At subvol /mnt/a/_btrbk_snap/debian.20160622 At snapshot debian.20160622 ERROR: unlink var/lib/samba/private/msg.sock/7992 failed. No such file or directory But : ls -la /mnt/a/_btrbk_snap/debian.20160621_1/var/lib/samba/private/msg.sock/ 2330 2388 2394 29375 7992 ls -la /mnt/ext/backups/vaclfw/debian.20160621_1/var/lib/samba/private/msg.sock/ 2330 2388 2394 29375 7992 uname -a Linux vaclfw 4.6.0-1-amd64 #1 SMP Debian 4.6.2-1 (2016-06-15) x86_64 GNU/Linux # btrfs version # btrfs-progs v4.5.2 What could be wrong? Thanks Lubos -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 5/6] fstests: btrfs: test RAID1 device reappear and balanced
On 06/21/2016 09:31 PM, Eryu Guan wrote: On Wed, Jun 15, 2016 at 04:48:47PM +0800, Anand Jain wrote: From: Anand JainThe test does the following: Initialize a RAID1 with some data Re-mount RAID1 degraded with _dev1_ and write up to half of the FS capacity If test devices are big enough, this test consumes much longer test time. I tested with 15G scratch dev pool and this test ran ~200s on my 4vcpu 8G memory test vm. Right. Isn't that a good design? So that it gets tested differently on different HW config. ? However the test time can be reduced by using smaller vdisk. Thanks, Anand Is it possible to limit the file size or the device size used? So it won't grow with device size. I'm thinking about something like _scratch_mkfs_sized, but that doesn't work for dev pool. Save md5sum checkpoint1 Re-mount healthy RAID1 Let balance re-silver. Save md5sum checkpoint2 Re-mount RAID1 degraded with _dev2_ Save md5sum checkpoint3 Verify if all three md5sum match Signed-off-by: Anand Jain --- v2: add tmp= and its rm add comments to why _reload_btrfs_ko is used add missing put and test_mount at notrun exit use echo instead of _fail when checkpoints are checked .out updated to remove Silence.. tests/btrfs/123 | 169 tests/btrfs/123.out | 7 +++ tests/btrfs/group | 1 + 3 files changed, 177 insertions(+) create mode 100755 tests/btrfs/123 create mode 100644 tests/btrfs/123.out diff --git a/tests/btrfs/123 b/tests/btrfs/123 new file mode 100755 index ..33decfd1c434 --- /dev/null +++ b/tests/btrfs/123 @@ -0,0 +1,169 @@ +#! /bin/bash +# FS QA Test 123 +# +# This test verify the RAID1 reconstruction on the reappeared +# device. By using the following steps: +# Initialize a RAID1 with some data +# +# Re-mount RAID1 degraded with dev2 missing and write up to +# half of the FS capacity. +# Save md5sum checkpoint1 +# +# Re-mount healthy RAID1 +# +# Let balance re-silver. +# Save md5sum checkpoint2 +# +# Re-mount RAID1 degraded with dev1 missing +# Save md5sum checkpoint3 +# +# Verify if all three checkpoints match +# +#- +# Copyright (c) 2016 Oracle. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +_supported_fs btrfs +_supported_os Linux +_require_scratch_nocheck Why don't check filesystem after test? A comment would be good if there's a good reason. Patch 6 needs it as well :) Thanks, Eryu +_require_scratch_dev_pool 2 + +# the mounted test dir prevent btrfs unload, we need to unmount +_test_unmount +_require_btrfs_loadable + +_scratch_dev_pool_get 2 + +dev1=`echo $SCRATCH_DEV_POOL | awk '{print $1}'` +dev2=`echo $SCRATCH_DEV_POOL | awk '{print $2}'` + +dev1_sz=`blockdev --getsize64 $dev1` +dev2_sz=`blockdev --getsize64 $dev2` +# get min of both +max_fs_sz=`echo -e "$dev1_sz\n$dev2_sz" | sort | head -1` +max_fs_sz=$(( max_fs_sz/2 )) +if [ $max_fs_sz -gt 100 ]; then + bs="1M" + count=$(( max_fs_sz/100 )) +else + max_fs_sz=$(( max_fs_sz*2 )) + _scratch_dev_pool_put + _test_mount + _notrun "Smallest dev size $max_fs_sz, Need at least 2M" +fi + +echo >> $seqres.full +echo "max_fs_sz=$max_fs_sz count=$count" >> $seqres.full +echo "-Initialize -" >> $seqres.full +_scratch_pool_mkfs "-mraid1 -draid1" >> $seqres.full 2>&1 +_scratch_mount >> $seqres.full 2>&1 +_run_btrfs_util_prog filesystem show +dd if=/dev/zero of="$SCRATCH_MNT"/tf1 bs=$bs count=1 \ + >>$seqres.full 2>&1 +count=$(( count-- )) +echo "unmount" >> $seqres.full +echo "clean btrfs ko" >> $seqres.full +_scratch_unmount + +# un-scan the btrfs devices +_reload_btrfs_ko + + +echo >> $seqres.full +echo "-Write degraded mount fill upto
Re: [PATCH v2 1/6] fstests: btrfs: add functions to set and reset required number of SCRATCH_DEV_POOL
On 06/21/2016 09:20 PM, Eryu Guan wrote: On Wed, Jun 15, 2016 at 04:46:03PM +0800, Anand Jain wrote: From: Anand JainThis patch provides functions _scratch_dev_pool_get() _scratch_dev_pool_put() Which will help to set/reset SCRATCH_DEV_POOL with the required number of devices. SCRATCH_DEV_POOL_SAVED will hold all the devices. Usage: _scratch_dev_pool_get() :: do stuff _scratch_dev_pool_put() Signed-off-by: Anand Jain I think the helpers should be introduced when they are first used, so that we know why they're introduced, and know how they're used with clear examples. So patch 1, 2 and 4 can be merged as one patch, patch 3 updates btrfs/027, patch 4 and patch 5 can be merged as one patch, then comes patch 6. What do you think? Hm. I won't bother much on that. will do if needed. (just a note- On the other hand it gets noticed about the new helper function if its a separate patch for helper function). --- v2: Error message and comment section updates. common/rc | 55 +++ 1 file changed, 55 insertions(+) diff --git a/common/rc b/common/rc index 51092a0644f0..31c46ba1226e 100644 --- a/common/rc +++ b/common/rc @@ -802,6 +802,61 @@ _scratch_mkfs() esac } +# +# Generally test cases will have.. +# _require_scratch_dev_pool X +# to make sure it has the enough scratch devices including +# replace-target and spare device. Now arg1 here is the +# required number of scratch devices by a-test-case excluding +# the replace-target and spare device. So this function will +# set SCRATCH_DEV_POOL to the specified number of devices. +# +# Usage: +# _scratch_dev_pool_get() +# :: do stuff +# +# _scratch_dev_pool_put() +# +_scratch_dev_pool_get() +{ + if [ $# != 1 ]; then "-ne" "-eq" etc. are used for comparing integers, "=!" "==" are for strings. And I think $#, $? are integers here, "-ne" would be better. Thanks for the catch. fixed in next rev. - Anand Thanks, Eryu + _fail "Usage: _scratch_dev_pool_get ndevs" + fi + + local test_ndevs=$1 + local config_ndevs=`echo $SCRATCH_DEV_POOL| wc -w` + local devs[]="( $SCRATCH_DEV_POOL )" + + typeset -p config_ndevs >/dev/null 2>&1 + if [ $? != 0 ]; then + _fail "Bug: cant find SCRATCH_DEV_POOL ndevs" + fi + + if [ $config_ndevs -lt $test_ndevs ]; then + _notrun "Need at least test requested number of ndevs $test_ndevs" + fi + + SCRATCH_DEV_POOL_SAVED=${SCRATCH_DEV_POOL} + export SCRATCH_DEV_POOL_SAVED + SCRATCH_DEV_POOL=${devs[@]:0:$test_ndevs} + export SCRATCH_DEV_POOL +} + +_scratch_dev_pool_put() +{ + typeset -p SCRATCH_DEV_POOL_SAVED >/dev/null 2>&1 + if [ $? != 0 ]; then + _fail "Bug: unset val, must call _scratch_dev_pool_get before _scratch_dev_pool_put" + fi + + if [ -z "$SCRATCH_DEV_POOL_SAVED" ]; then + _fail "Bug: str empty, must call _scratch_dev_pool_get before _scratch_dev_pool_put" + fi + + export SCRATCH_DEV_POOL=$SCRATCH_DEV_POOL_SAVED + export SCRATCH_DEV_POOL_SAVED="" +} + _scratch_pool_mkfs() { case $FSTYP in -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe fstests" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] btrfs: wait for bdev put
Thanks for the review Chris. On 06/21/2016 09:00 PM, Chris Mason wrote: On 06/21/2016 06:24 AM, Anand Jain wrote: From: Anand JainFurther to the commit bc178622d40d87e75abc131007342429c9b03351 btrfs: use rcu_barrier() to wait for bdev puts at unmount This patch implements a method to time wait on the __free_device() which actually does the bdev put. This is needed as the user space running 'btrfs fi show -d' immediately after the replace and unmount, is still reading older information from the device. Thanks for working on this Anand. Since it looks like blkdev_put can deadlock against us, can we please switch to making sure we fully flush the outstanding IO? It's probably enough to do a sync_blockdev() call before we allow the unmount to finish, but we can toss in an invalidate_bdev for good measure. # git diff diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 604daf315669..e0ad29d6fe9a 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -870,6 +870,11 @@ static void btrfs_close_one_device(struct btrfs_device *device) if (device->missing) fs_devices->missing_devices--; + if (device->bdev && device->writeable) { + sync_blockdev(device->bdev); + invalidate_bdev(device->bdev); + } + new_device = btrfs_alloc_device(NULL, >devid, device->uuid); BUG_ON(IS_ERR(new_device)); /* -ENOMEM */ --- However, theoretically still there might be a problem - at the end of unmount, if the device exclusive open is not actually closed, then there might be a race with another program which is trying to open the device in exclusive mode. Like for eg: unmount /btrfs; fsck /dev/X and here fsck might fail to open the device if it wins the race. Then we can get rid of the mdelay loop completely, which seems pretty error prone to me. Yes, the code got little complex here (and also when sysfs fixes were wrote) as struct btrfs_device is getting migrated to a new struct btrfs_device at unmount, which I don't think was necessary? Thanks, Anand Thanks! -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs multi device handling
Hi, any inputs on this would be appreciated. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix disk_i_size update bug when fallocate() fails
hello, On 06/22/2016 02:24 PM, Christoph Hellwig wrote: On Wed, Jun 22, 2016 at 09:57:01AM +0800, Wang Xiaoguang wrote: And below test case can reveal this bug: Please wire this up for xfstests, thanks! OK. Regards, Xiaoguang Wang -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix disk_i_size update bug when fallocate() fails
On Wed, Jun 22, 2016 at 09:57:01AM +0800, Wang Xiaoguang wrote: > And below test case can reveal this bug: Please wire this up for xfstests, thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html