Re: [PATCH 00/17] fs, btrfs refcount conversions
At 03/03/2017 04:55 PM, Elena Reshetova wrote: Now when new refcount_t type and API are finally merged (see include/linux/refcount.h), the following patches convert various refcounters in the btrfs filesystem from atomic_t to refcount_t. By doing this we prevent intentional or accidental underflows or overflows that can led to use-after-free vulnerabilities. The below patches are fully independent and can be cherry-picked separately. Since we convert all kernel subsystems in the same fashion, resulting in about 300 patches, we have to group them for sending at least in some fashion to be manageable. Please excuse the long cc list. These patches have been tested with xfstests by running btrfs-related tests. btrfs debug was enabled, warns on refcount errors, too. No output related to refcount errors produced. However, the following errors were during the run: * tests btrfs/078, btrfs/114, btrfs/115, no errors anywhere in dmesg, but process hangs. They all seem to be around qgroup, sometimes error visible such as qgroup scan failed -4 before it blocks, but not always. How reproducible of the hang? I also see the -EINTR output, but that seems to be designed for btrfs/11[45]. btrfs/078 is unrelated to qgroup, and all these three test pass in my test environment, which is v4.11-rc1 with your patches applied. I ran these 3 tests in a row with default and space_cache=v2 mount options, and 5 times for each mount option, no hang at all. It would help much if more info can be provided, from blocked process backtrace to test mount option to base commit. Thanks, Qu * test btrfs/104 dmesg has additional error output: BTRFS warning (device vdc): qgroup 258 reserved space underflow, have: 0, to free: 4096 I tried looking at the code on what causes the failure, but could not figure it out. It doesn't seem to be related to any refcount changes at least IMO. The above test failures are hard for me to understand and interpreted, but they don't seem to relate to refcount conversions. Elena Reshetova (17): fs, btrfs: convert btrfs_bio.refs from atomic_t to refcount_t fs, btrfs: convert btrfs_transaction.use_count from atomic_t to refcount_t fs, btrfs: convert extent_map.refs from atomic_t to refcount_t fs, btrfs: convert btrfs_ordered_extent.refs from atomic_t to refcount_t fs, btrfs: convert btrfs_caching_control.count from atomic_t to refcount_t fs, btrfs: convert btrfs_delayed_ref_node.refs from atomic_t to refcount_t fs, btrfs: convert btrfs_delayed_node.refs from atomic_t to refcount_t fs, btrfs: convert btrfs_delayed_item.refs from atomic_t to refcount_t fs, btrfs: convert btrfs_root.refs from atomic_t to refcount_t fs, btrfs: convert extent_state.refs from atomic_t to refcount_t fs, btrfs: convert compressed_bio.pending_bios from atomic_t to refcount_t fs, btrfs: convert scrub_recover.refs from atomic_t to refcount_t fs, btrfs: convert scrub_page.refs from atomic_t to refcount_t fs, btrfs: convert scrub_block.refs from atomic_t to refcount_t fs, btrfs: convert scrub_parity.refs from atomic_t to refcount_t fs, btrfs: convert scrub_ctx.refs from atomic_t to refcount_t fs, btrfs: convert btrfs_raid_bio.refs from atomic_t to refcount_t fs/btrfs/backref.c | 2 +- fs/btrfs/compression.c | 18 - fs/btrfs/ctree.h | 5 +++-- fs/btrfs/delayed-inode.c | 46 ++-- fs/btrfs/delayed-inode.h | 5 +++-- fs/btrfs/delayed-ref.c | 8 fs/btrfs/delayed-ref.h | 8 +--- fs/btrfs/disk-io.c | 6 +++--- fs/btrfs/disk-io.h | 4 ++-- fs/btrfs/extent-tree.c | 20 +-- fs/btrfs/extent_io.c | 18 - fs/btrfs/extent_io.h | 3 ++- fs/btrfs/extent_map.c| 10 +- fs/btrfs/extent_map.h| 3 ++- fs/btrfs/ordered-data.c | 20 +-- fs/btrfs/ordered-data.h | 2 +- fs/btrfs/raid56.c| 19 +- fs/btrfs/scrub.c | 42 fs/btrfs/transaction.c | 20 +-- fs/btrfs/transaction.h | 3 ++- fs/btrfs/tree-log.c | 2 +- fs/btrfs/volumes.c | 10 +- fs/btrfs/volumes.h | 2 +- include/trace/events/btrfs.h | 4 ++-- 24 files changed, 143 insertions(+), 137 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: relocation: Enhance kernel error output for relocation
Any comment on this patch? Thanks, Qu At 02/15/2017 09:39 AM, Qu Wenruo wrote: When balance(relocation) fails, btrfs-progs will report like: ERROR: error during balancing '/mnt/scratch': Input/output error There may be more info in syslog - try dmesg | tail However kernel can't provide may useful info in many cases to locate the problem. This patch will add error messages in relocation to help user and developer to locate the problem. Signed-off-by: Qu Wenruo--- v2: Fix typo where 'err' and 'ret' are used wrong. v3: Add space between error number and parenthesis of error string Fix gramma errors. --- fs/btrfs/relocation.c | 61 +++ 1 file changed, 57 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 379711048fb0..6de5800e17bc 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -4011,6 +4011,8 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc) rc->block_rsv, rc->block_rsv->size, BTRFS_RESERVE_FLUSH_ALL); if (ret) { + btrfs_err(fs_info, "failed to reserve space: %d (%s)", + ret, btrfs_decode_error(ret)); err = ret; break; } @@ -4019,6 +4021,9 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc) if (IS_ERR(trans)) { err = PTR_ERR(trans); trans = NULL; + btrfs_err(fs_info, + "failed to start transaction: %d (%s)", + err, btrfs_decode_error(err)); break; } restart: @@ -4028,8 +4033,12 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc) } ret = find_next_extent(rc, path, ); - if (ret < 0) + if (ret < 0) { + btrfs_err(fs_info, + "failed to find next extent: %d (%s)", + ret, btrfs_decode_error(ret)); err = ret; + } if (ret != 0) break; @@ -4081,9 +4090,17 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc) if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) { ret = add_tree_block(rc, , path, ); + if (ret < 0) + btrfs_err(fs_info, + "failed to record tree block: %d (%s)", + ret, btrfs_decode_error(ret)); } else if (rc->stage == UPDATE_DATA_PTRS && (flags & BTRFS_EXTENT_FLAG_DATA)) { ret = add_data_references(rc, , path, ); + if (ret < 0) + btrfs_err(fs_info, + "failed to record data extent: %d (%s)", + ret, btrfs_decode_error(ret)); } else { btrfs_release_path(path); ret = 0; @@ -4103,6 +4120,9 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc) rc->backref_cache.last_trans = trans->transid - 1; if (ret != -EAGAIN) { + btrfs_err(fs_info, + "failed to relocate tree blocks: %d (%s)", + ret, btrfs_decode_error(ret)); err = ret; break; } @@ -4121,6 +4141,9 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc) ret = relocate_data_extent(rc->data_inode, , >cluster); if (ret < 0) { + btrfs_err(fs_info, + "failed to relocate data extent: %d (%s)", + ret, btrfs_decode_error(ret)); err = ret; break; } @@ -4147,8 +4170,12 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc) if (!err) { ret = relocate_file_extent_cluster(rc->data_inode, >cluster); - if (ret < 0) + if (ret < 0) { + btrfs_err(fs_info, + "failed to relocate file extent
[PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang
[BUG] If run_delalloc_range() returns error and there is already some ordered extents created, btrfs will be hanged with the following backtrace: Call Trace: __schedule+0x2d4/0xae0 schedule+0x3d/0x90 btrfs_start_ordered_extent+0x160/0x200 [btrfs] ? wake_atomic_t_function+0x60/0x60 btrfs_run_ordered_extent_work+0x25/0x40 [btrfs] btrfs_scrubparity_helper+0x1c1/0x620 [btrfs] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] process_one_work+0x2af/0x720 ? process_one_work+0x22b/0x720 worker_thread+0x4b/0x4f0 kthread+0x10f/0x150 ? process_one_work+0x720/0x720 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x2e/0x40 [CAUSE] |<-- delalloc range --->| | OE 1 | OE 2 | ... | OE n | |<>| |<-- cleanup range ->| || \_=> First page handled by end_extent_writepage() in __extent_writepage() The problem is caused by error handler of run_delalloc_range(), which doesn't handle any created ordered extents, leaving them waiting on btrfs_finish_ordered_io() to finish. However after run_delalloc_range() returns error, __extent_writepage() won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered except the first page, and btrfs_finish_ordered_io() won't be triggered for created ordered extents either. So OE 2~n will hang forever, and if OE 1 is larger than one page, it will also hang. [FIX] Introduce btrfs_cleanup_ordered_extents() function to cleanup created ordered extents and finish them manually. The function is based on existing btrfs_endio_direct_write_update_ordered() function, and modify it to act just like btrfs_writepage_endio_hook() but handles specified range other than one page. After fix, delalloc error will be handled like: |<-- delalloc range --->| | OE 1 | OE 2 | ... | OE n | |<>|< --->|<-- old error handler ->| || || || \_=> Cleaned up by cleanup_ordered_extents() \_=> First page handled by end_extent_writepage() in __extent_writepage() Signed-off-by: Qu WenruoSigned-off-by: Filipe Manana --- changelog: v2: Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing outstanding extents, which is already done by extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit v3: Skip first page to avoid underflow ordered->bytes_left. Fix range passed in cow_file_range() which doesn't cover the whole extent. Expend extent_clear_unlock_delalloc() range to allow them to handle metadata release. v4: Don't use extra bit to skip metadata freeing for ordered extent, but only handle btrfs_reloc_clone_csums() error just before processing next extent. This makes error handle much easier for run_delalloc_nocow(). v5: Variant gramma and comment fixes suggested by Filipe Manana Enhanced commit message to focus on the generic error handler bug, pointed out by Filipe Manana Adding missing wq/func user in __endio_write_update_ordered(), pointed out by Filipe Manana. Enhanced commit message with ascii art to explain the bug more easily. Fix a bug which can leads to corrupted extent allocation, exposed by Filipe Manana. v6: Split the metadata underflow fix to another patch. Use better comment and btrfs_cleanup_ordered_extent() implementation from Filipe Manana. --- fs/btrfs/inode.c | 62 ++-- 1 file changed, 47 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 1d83d504f2e5..9b03eb9b27d0 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -115,6 +115,30 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len, u64 ram_bytes, int compress_type, int type); +static void __endio_write_update_ordered(struct inode *inode, +u64 offset, u64 bytes, +bool uptodate); + +/* + * Cleanup all submitted ordered extents in specified range to handle errors + * from the fill_dellaloc() callback. + * + * NOTE: caller must ensure that when an error happens, it can not call + * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING + * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata + * to be released, which we want to happen only when finishing the ordered + * extent (btrfs_finish_ordered_io()). Also note that the caller of the + * fill_dealloc() callback already does proper cleanup for the first page of + * the range, that is, it invokes the callback writepage_end_io_hook() for the + * range of the first page. + */ +static inline void btrfs_cleanup_ordered_extents(struct inode *inode, +u64 offset, u64 bytes) +{ + return __endio_write_update_ordered(inode, offset + PAGE_SIZE, +
[PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
[BUG] When btrfs_reloc_clone_csum() reports error, it can underflow metadata and leads to kernel assertion on outstanding extents in run_delalloc_nocow() and cow_file_range(). BTRFS info (device vdb5): relocating block group 12582912 flags data BTRFS info (device vdb5): found 1 extents assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858 Currently, due to another bug blocking ordered extents, the bug is only reproducible under certain block group layout and using error injection. a) Create one data block group with one 4K extent in it. To avoid the bug that hangs btrfs due to ordered extent which never finishes b) Make btrfs_reloc_clone_csum() always fail c) Relocate that block group [CAUSE] run_delalloc_nocow() and cow_file_range() handles error from btrfs_reloc_clone_csum() wrongly: (The ascii chart shows a more generic case of this bug other than the bug mentioned above) |<-- delalloc range --->| | OE 1 | OE 2 | ... | OE n | |<--- cleanup range --->| |<--- --->| \/ btrfs_finish_ordered_io() range So error handler, which calls extent_clear_unlock_delalloc() with EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() will both cover OE n, and free its metadata, causing metadata under flow. [Fix] The fix is to ensure after calling btrfs_add_ordered_extent(), we only call error handler after increasing the iteration offset, so that cleanup range won't cover any created ordered extent. |<-- delalloc range --->| | OE 1 | OE 2 | ... | OE n | |<--- --->|<-- cleanup range ->| \/ btrfs_finish_ordered_io() range Signed-off-by: Qu Wenruo--- changelog: v6: New, split from v5 patch, as this is a separate bug. --- fs/btrfs/inode.c | 51 +++ 1 file changed, 39 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b2bc07aad1ae..1d83d504f2e5 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode, BTRFS_DATA_RELOC_TREE_OBJECTID) { ret = btrfs_reloc_clone_csums(inode, start, cur_alloc_size); + /* +* Only drop cache here, and process as normal. +* +* We must not allow extent_clear_unlock_delalloc() +* at out_unlock label to free meta of this ordered +* extent, as its meta should be freed by +* btrfs_finish_ordered_io(). +* +* So we must continue until @start is increased to +* skip current ordered extent. +*/ if (ret) - goto out_drop_extent_cache; + btrfs_drop_extent_cache(BTRFS_I(inode), start, + start + ram_size - 1, 0); } btrfs_dec_block_group_reservations(fs_info, ins.objectid); - if (disk_num_bytes < cur_alloc_size) - break; - /* we're not doing compressed IO, don't unlock the first * page (which the caller expects to stay locked), don't * clear any dirty bits and don't set any writeback bits @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode, delalloc_end, locked_page, EXTENT_LOCKED | EXTENT_DELALLOC, op); - disk_num_bytes -= cur_alloc_size; + if (disk_num_bytes > cur_alloc_size) + disk_num_bytes = 0; + else + disk_num_bytes -= cur_alloc_size; num_bytes -= cur_alloc_size; alloc_hint = ins.objectid + ins.offset; start += cur_alloc_size; + + /* +* btrfs_reloc_clone_csums() error, since start is increased +* extent_clear_unlock_delalloc() at out_unlock label won't +* free metadata of current ordered extent, we're OK to exit. +*/ + if (ret) + goto out_unlock; } out: return ret; @@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode, BUG_ON(ret); /* -ENOMEM */ if (root->root_key.objectid == - BTRFS_DATA_RELOC_TREE_OBJECTID) { +
Re: Why do BTRFS (still) forgets what device to write to?
waxhead posted on Sun, 05 Mar 2017 17:26:36 +0100 as excerpted: > I am doing some test on BTRFS with both data and metadata in raid1. > > uname -a Linux daffy 4.9.0-1-amd64 #1 SMP Debian 4.9.6-3 (2017-01-28) > x86_64 GNU/Linux > > btrfs--version btrfs-progs v4.7.3 > > > 01. mkfs.btrfs /dev/sd[fgh]1 02. mount /dev/sdf1 /btrfs_test/ > 03. btrfs balance start -dconvert=raid1 /btrfs_test/ > 04. copied a lots of 3-4MB files to it (about 40GB)... > 05. Started to compress some of the files to create one larger file... > 06. Pulled the (sata) plug on one of the drives... (sdf1) > 07. dmesg shows that the kernel is rejecting I/O to offline device + > [sdf] killing request] > 08. BTRS error (device sdf1) bdev /dev/sdf1 errs: wr 0, rd 1, flush 0, > corrupt 0, gen 0 09. the previous line repeats - increasing rd count 10. > Reconnecting the sdf1 drive again makes it show up as sdi1 11. btrfs fi > sh /btrfs_test shows sd1 as the correct device id (1). > 12. Yet dmesg shows tons of errors like this: BTRFS error (device sdf1) > : bdev /dev/sdi1 errs wr 37182, rd 39851, flush 1, corrupt 0, gen 0 > 13. and the above line repeats increasing wr, and rd errors. > 14. BTRFS never seems to "get in tune again" while the filesystem is > mounted. > > The conclusion appears to be that the device ID is back again in the > btrfs pool so why does btrfs still try to write to the wrong device (or > does it?!). The base problem is that btrfs doesn't (yet) have any concept of a device disconnecting and reconnecting "live", only after unmount/remount. When a device drops out, btrfs will continue to attempt to write to it. Things will continue normally on all other devices, and only after some time will btrfs actually finally give up on the device. (I /believe/ it's after the level of dirty memory exceeds some safety threshold, with the unwritten writes taking up a larger and larger part of dirty memory until something gives. However, I'm not a dev just a user and list regular, and this is just my supposition filling in the blanks, so don't take it for gospel unless you get confirmation either directly from the code or from an actual dev.) If the outage is short enough for the kernel to bring back the device as the same device node, great, btrfs can and does resume writing to it. However, once the outage is long enough that the kernel brings back the physical device as a different device node, yes, btrfs filesystem show will show the device back as its normal ID, but that information isn't properly communicated to the "live" still-mounted filesystem, and it continues to attempt writing to the old device node. There's plans for, and even patches that introduce limited support for, live detection and automatic (re)integration of a new or reintroduced device, but those patches are in a longterm development project and last I read weren't even in a state where they even applied cleanly to current kernels, as they've not been kept current and have gone stale. Of course it should be kept in mind that btrfs is still under heavy development, and while stabilizing, isn't considered, certainly not by its devs, to be anywhere near feature complete and stabilized, even, at times such as this, for features that are generally considered as reasonably stable and mature as btrfs itself is -- that is, still stabilizING, not yet fully stable and mature -- keep backups and be prepared to use them if you value your data, because you may indeed be calling on them! In that state, it's only to be expected that there will still be some incomplete features such as this, where manual intervention may be required that wouldn't be in more complete/stable/mature solutions. Basically, it comes with the territory. > The good thing here is that BTRFS does still work fine after a unmount > and mount again. Running a scrub on the filesystem cleans up tons of > errors , but no uncorrectable errors. Correct. An unmount will leave all that data unwritten to the device it still considers missing, so of course those checksums aren't going to match. On remount, btrfs sees the device again, and should and AFAIK consistently does note the difference in commit generations, pulling from the updated device where they differ. A scrub can then be used to bring the outdated device back in sync. But be sure to do that scrub as soon as possible. Should further instability continue to drop out devices, or further not entirely graceful unmounts/shutdowns occur, the damage may get worse and not be entirely repairable, certainly not with only a simple scrub. > However it says total bytes scrubbed 94.21GB with 75 errors ... and > further down it says corrected errors: 72, uncorrectable errors: 0 , > unverified errors: 0 > > Why 75 vs 72 errors?! did it correct all or not? >From my own experience (and I actually deliberately ran btrfs raid1 with a failing device for awhile to test this sort of thing, btrfs' checksumming
Re: [PATCH v2] btrfs-progs: report I/O errors when closing the filesystem
At 03/04/2017 01:02 AM, Omar Sandoval wrote: From: Omar SandovalIf the final fsync() on the Btrfs device fails, we just swallow the error and don't alert the user in any way. This was uncovered by xfstest generic/405, which checks that mkfs fails when it encounters EIO. Signed-off-by: Omar Sandoval Please ignore my comment on v1 patch, didn't see the v2 one. Looks good to me. Reviewed-by: Qu Wenruo Thanks, Qu --- Changes from v1: return -errno instead of -1 disk-io.c | 4 ++-- volumes.c | 9 +++-- 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/disk-io.c b/disk-io.c index 2a94d4fc..48fb3c6b 100644 --- a/disk-io.c +++ b/disk-io.c @@ -1817,10 +1817,10 @@ int close_ctree_fs_info(struct btrfs_fs_info *fs_info) free_fs_roots_tree(_info->fs_root_tree); btrfs_release_all_roots(fs_info); - btrfs_close_devices(fs_info->fs_devices); + ret = btrfs_close_devices(fs_info->fs_devices); btrfs_cleanup_all_caches(fs_info); btrfs_free_fs_info(fs_info); - return 0; + return ret; } int clean_tree_block(struct btrfs_trans_handle *trans, struct btrfs_root *root, diff --git a/volumes.c b/volumes.c index a0a85edd..f7d82d51 100644 --- a/volumes.c +++ b/volumes.c @@ -160,6 +160,7 @@ int btrfs_close_devices(struct btrfs_fs_devices *fs_devices) { struct btrfs_fs_devices *seed_devices; struct btrfs_device *device; + int ret = 0; again: if (!fs_devices) @@ -168,7 +169,11 @@ again: device = list_entry(fs_devices->devices.next, struct btrfs_device, dev_list); if (device->fd != -1) { - fsync(device->fd); + if (fsync(device->fd) == -1) { + warning("fsync on device %llu failed: %s", + device->devid, strerror(errno)); + ret = -errno; + } if (posix_fadvise(device->fd, 0, 0, POSIX_FADV_DONTNEED)) fprintf(stderr, "Warning, could not drop caches\n"); close(device->fd); @@ -197,7 +202,7 @@ again: free(fs_devices); } - return 0; + return ret; } void btrfs_close_all_devices(void) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: report I/O errors when closing the filesystem
At 03/04/2017 12:59 AM, Omar Sandoval wrote: From: Omar SandovalIf the final fsync() on the Btrfs device fails, we just swallow the error and don't alert the user in any way. This was uncovered by xfstest generic/405, which checks that mkfs fails when it encounters EIO. Signed-off-by: Omar Sandoval --- disk-io.c | 4 ++-- volumes.c | 9 +++-- 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/disk-io.c b/disk-io.c index 2a94d4fc..48fb3c6b 100644 --- a/disk-io.c +++ b/disk-io.c @@ -1817,10 +1817,10 @@ int close_ctree_fs_info(struct btrfs_fs_info *fs_info) free_fs_roots_tree(_info->fs_root_tree); btrfs_release_all_roots(fs_info); - btrfs_close_devices(fs_info->fs_devices); + ret = btrfs_close_devices(fs_info->fs_devices); btrfs_cleanup_all_caches(fs_info); btrfs_free_fs_info(fs_info); - return 0; + return ret; } int clean_tree_block(struct btrfs_trans_handle *trans, struct btrfs_root *root, diff --git a/volumes.c b/volumes.c index a0a85edd..89335d35 100644 --- a/volumes.c +++ b/volumes.c @@ -160,6 +160,7 @@ int btrfs_close_devices(struct btrfs_fs_devices *fs_devices) { struct btrfs_fs_devices *seed_devices; struct btrfs_device *device; + int ret = 0; again: if (!fs_devices) @@ -168,7 +169,11 @@ again: device = list_entry(fs_devices->devices.next, struct btrfs_device, dev_list); if (device->fd != -1) { - fsync(device->fd); + if (fsync(device->fd) == -1) { + warning("fsync on device %llu failed: %s", + device->devid, strerror(errno)); + ret = -1; -1 is -EINVAL, better using -errno here. Despite of that, looks good. Thanks, Qu + } if (posix_fadvise(device->fd, 0, 0, POSIX_FADV_DONTNEED)) fprintf(stderr, "Warning, could not drop caches\n"); close(device->fd); @@ -197,7 +202,7 @@ again: free(fs_devices); } - return 0; + return ret; } void btrfs_close_all_devices(void) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] fs, btrfs refcount conversions
At 03/03/2017 04:55 PM, Elena Reshetova wrote: Now when new refcount_t type and API are finally merged (see include/linux/refcount.h), the following patches convert various refcounters in the btrfs filesystem from atomic_t to refcount_t. By doing this we prevent intentional or accidental underflows or overflows that can led to use-after-free vulnerabilities. The below patches are fully independent and can be cherry-picked separately. Since we convert all kernel subsystems in the same fashion, resulting in about 300 patches, we have to group them for sending at least in some fashion to be manageable. Please excuse the long cc list. These patches have been tested with xfstests by running btrfs-related tests. btrfs debug was enabled, warns on refcount errors, too. No output related to refcount errors produced. However, the following errors were during the run: * tests btrfs/078, btrfs/114, btrfs/115, no errors anywhere in dmesg, but process hangs. They all seem to be around qgroup, sometimes error visible such as qgroup scan failed -4 before it blocks, but not always. -EINTR? That's strange. Any blocked process backtrace? * test btrfs/104 dmesg has additional error output: BTRFS warning (device vdc): qgroup 258 reserved space underflow, have: 0, to free: 4096 Known one, and fixes already sent to mail list while not merged yet: https://patchwork.kernel.org/patch/9592765/ Thanks, Qu I tried looking at the code on what causes the failure, but could not figure it out. It doesn't seem to be related to any refcount changes at least IMO. The above test failures are hard for me to understand and interpreted, but they don't seem to relate to refcount conversions. Elena Reshetova (17): fs, btrfs: convert btrfs_bio.refs from atomic_t to refcount_t fs, btrfs: convert btrfs_transaction.use_count from atomic_t to refcount_t fs, btrfs: convert extent_map.refs from atomic_t to refcount_t fs, btrfs: convert btrfs_ordered_extent.refs from atomic_t to refcount_t fs, btrfs: convert btrfs_caching_control.count from atomic_t to refcount_t fs, btrfs: convert btrfs_delayed_ref_node.refs from atomic_t to refcount_t fs, btrfs: convert btrfs_delayed_node.refs from atomic_t to refcount_t fs, btrfs: convert btrfs_delayed_item.refs from atomic_t to refcount_t fs, btrfs: convert btrfs_root.refs from atomic_t to refcount_t fs, btrfs: convert extent_state.refs from atomic_t to refcount_t fs, btrfs: convert compressed_bio.pending_bios from atomic_t to refcount_t fs, btrfs: convert scrub_recover.refs from atomic_t to refcount_t fs, btrfs: convert scrub_page.refs from atomic_t to refcount_t fs, btrfs: convert scrub_block.refs from atomic_t to refcount_t fs, btrfs: convert scrub_parity.refs from atomic_t to refcount_t fs, btrfs: convert scrub_ctx.refs from atomic_t to refcount_t fs, btrfs: convert btrfs_raid_bio.refs from atomic_t to refcount_t fs/btrfs/backref.c | 2 +- fs/btrfs/compression.c | 18 - fs/btrfs/ctree.h | 5 +++-- fs/btrfs/delayed-inode.c | 46 ++-- fs/btrfs/delayed-inode.h | 5 +++-- fs/btrfs/delayed-ref.c | 8 fs/btrfs/delayed-ref.h | 8 +--- fs/btrfs/disk-io.c | 6 +++--- fs/btrfs/disk-io.h | 4 ++-- fs/btrfs/extent-tree.c | 20 +-- fs/btrfs/extent_io.c | 18 - fs/btrfs/extent_io.h | 3 ++- fs/btrfs/extent_map.c| 10 +- fs/btrfs/extent_map.h| 3 ++- fs/btrfs/ordered-data.c | 20 +-- fs/btrfs/ordered-data.h | 2 +- fs/btrfs/raid56.c| 19 +- fs/btrfs/scrub.c | 42 fs/btrfs/transaction.c | 20 +-- fs/btrfs/transaction.h | 3 ++- fs/btrfs/tree-log.c | 2 +- fs/btrfs/volumes.c | 10 +- fs/btrfs/volumes.h | 2 +- include/trace/events/btrfs.h | 4 ++-- 24 files changed, 143 insertions(+), 137 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.7.2] btrfs_run_delayed_refs:2963: errno=-17 Object already exists
On Mittwoch, 1. März 2017 19:14:07 CET you wrote: > In any > case, I started btrfs-check on the device itself. *Sigh*, I had to restart it, because I forgot to redirect to a file and quite frankly wasn't expecting this flood of output, but here's a summary of the output after about 2 days: % wc -l btrfs_check_output_20170303.log 1360252 btrfs_check_output_20170303.log % grep -v "backref lost" btrfs_check_output_20170303.log | grep -v "check \ (leaf\|node\) failed" | grep -v "lost its parent" | grep -v "referencer count" checking extents ERROR: block group[3879328546816 1073741824] used 1072840704 but extent items used 1129164800 ERROR: block group[4163870130176 1073741824] used 1072259072 but extent items used 0 ERROR: block group[4223999672320 1073741824] used 1073664000 but extent items used 1074188288 ERROR: block group[4278760505344 1073741824] used 1073377280 but extent items used 1073901568 ERROR: block group[4406535782400 1073741824] used 1073627136 but extent items used 0 ERROR: extent [3830028140544 4096] referencer bytenr mismatch, wanted: 3830028140544, have: 3826183847936 % tail btrfs_check_output_20170303.log ERROR: data extent[3924538445824 4096] backref lost ERROR: data extent[3933464903680 4096] backref lost ERROR: data extent[3924538531840 4096] backref lost ERROR: data extent[3839131480064 4096] backref lost ERROR: data extent[3834701750272 4096] backref lost ERROR: data extent[3873087918080 4096] backref lost ERROR: data extent[3873072283648 4096] backref lost ERROR: data extent[3873088090112 8192] backref lost ERROR: data extent[3873072287744 4096] backref lost ERROR: data extent[3856294449152 4096] backref lost That's right, slowly approaching 1.5 million lines of btrfs-check output! That's *way* more than I ran it the last time this error happened a few weeks ago. Greetings -- Marc Joliet -- "People who think they know everything really annoy those of us who know we don't" - Bjarne Stroustrup signature.asc Description: This is a digitally signed message part.
Re: BTRFS critical: corrupt leaf, slot offset bad; then read-only
Hello Hans, Am 24.02.2017 um 01:26 schrieb Hans van Kranenburg: Once that is done, I would like to go over the "btrfs recovery" thread and see if it can be applied for my case as well. I will certainly need your help when that time comes... We can take a stab at it. I upgraded btrfs-tools to 4.8.1 as 4.4 didn't have btrfs inspect-internal dump-tree. But I cannot find anything about 5242107641856 in the dump-tree output. What does that mean? Thanks, Lukas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
[ ... on the difference between number of devices and length of a chunk-stripe ... ] > Note: possibilities get even more interesting with a 4-device > volume with 'raid1' profile chunks, and similar case involving > other profiles than 'raid1'. Consider for example a 4-device volume with 2 devices abruptly missing: if 2-length 'raid1' chunk-stripes have been uniformly laid across devices, then some chunk-stripes will be completely missing (where both chunks in the stripe were on the 2 missing devices), some will be 1-length, and some will be 2-length. What to do when devices are missing? One possibility is to simply require mount with the 'degraded' option, by default read-only, but allowing read-write, simply as a way to ensure the sysadm knows that some metada/data *may* not be redundant or *may* even be unavailable (if the chunk-stripe length is less than the minimum to reconstruct the data). Then attempts to read unavailable metadata or data would return an error like a checksum violation without redundancy, dynamically (when the application or 'balance' or 'scrub' attempt to read the unavailable data). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
>> What makes me think that "unmirrored" 'raid1' profile chunks >> are "not a thing" is that it is impossible to remove >> explicitly a member device from a 'raid1' profile volume: >> first one has to 'convert' to 'single', and then the 'remove' >> copies back to the remaining devices the 'single' chunks that >> are on the explicitly 'remove'd device. Which to me seems >> absurd. > It is, there should be a way to do this as a single operation. > [ ... ] The reason this is currently the case though is a > simple one, 'btrfs device delete' is just a special instance > of balance [ ... ] does no profile conversion, but having > that as an option would actually be _very_ useful from a data > safety perspective. That seems to me an even more "confused" opinion: because removing a device to make it "missing" and removing it permanently should be very different operations. Consider the common case of a 3-member volume with a 'raid1' target profile: if the sysadm thinks that a drive should be replaced, the goal is to take it out *without* converting every chunk to 'single', because with 2-out-of-3 devices half of the chunks will still be fully mirrored. Also, removing the device to be replaced should really not be the same thing as balancing the chunks, if there is space, to be 'raid1' across remaining drives, because that's a completely different operation. >> Going further in my speculation, I suspect that at the core of >> the Btrfs multidevice design there is a persistent "confusion" >> (to use en euphemism) between volumes having a profile, and >> merely chunks have a profile. > There generally is. The profile is entirely a property of the > chunks (each chunk literally has a bit of metadata that says > what profile it is), not the volume. There's some metadata in > the volume somewhere that says what profile to use for new > chunks of each type (I think), That's the "target" profile for the volume. > but that doesn't dictate what chunk profiles there are on the > volume. [ ... ] But as that's the case then the current Btrfs logic for determining whether a volume is degraded or not is quite "confused" indeed. Because suppose there is again the simple case of a 3-device volume, where all existing chunks have 'raid1' profile and the volume's target profile is also 'raid1' and one device has gone offline: the volume cannot be said to be "degraded", unless a full examination of all chunks is made. Because it can well happen that in fact *none* of the chunks was mirrored to that device, for example, however unlikely. And viceversa. Even with 3 devices some chunks may be temporarily "unmirrored" (even if for brief times hopefully). The average case is that half of the chunks will be fully mirrored across the two remaining devices and half will be "unmirrored". Now consider re-adding the third device: at that point the volume has got back all 3 devices, so it is not "degraded", but 50% of the chunks in the volume will still be "unmirrored", even if eventually they will be mirrored on the newly added device. Note: possibilities get even more interesting with a 4-device volume with 'raid1' profile chunks, and similar case involving other profiles than 'raid1'. Therefore the current Btrfs logic for deciding whether a volume is "degraded" seems simply "confused" to me, because whether there are missing devices and some chunks are "unmirrored" is not quite the same thing. The same applies to the current logic that in a 2-device volume with a device missing new chunks are created as "single" profile instead of as "unmirrored" 'raid1' profile: another example of "confusion" between number of devices and chunk profile. Note: the best that can be said is that a volume has both a "target chunk profile" (one per data, metadata, system chunks) and a target number of member devices, and that a volume with a number of devices below the target *might* be degraded, and that whether a volume is in fact degraded is not either/or, but given by the percentage of chunks or stripes that are degraded. This is expecially made clear by the 'raid1' case where the chunk stripe length is always 2, but the number of target devices can be greater than 2. Management of devices and management of stripes are in Btrfs, unlike conventional RAID like Linux MD, rather different operations needing rather different, if related, logic. My impression is that because of "confusion" between number of devices in a volume and status of chunk profile there are some "surprising" behaviors in Btrfs, and that will take quite a bit to fix, most importantly for the Btrfs developer team to clear among themselves the semantics attaching to both. After 10 years of development that seems the right thing to do :-). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
assertion failed: page_ops & PAGE_LOCK
After commenting out the assertion that Liu bo pointed out was bogus, my trinity runs last a little longer.. This is a new one I think.. assertion failed: page_ops & PAGE_LOCK, file: fs/btrfs/extent_io.c, line: 1716 [ cut here ] kernel BUG at fs/btrfs/ctree.h:3423! invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU: 1 PID: 32625 Comm: trinity-c40 Not tainted 4.10.0-think+ #3 task: 8804c6b95440 task.stack: c9ba8000 RIP: 0010:assfail.constprop.68+0x1c/0x1e RSP: 0018:c9bab6c0 EFLAGS: 00010282 RAX: 004e RBX: 00a3 RCX: 0001 RDX: RSI: 81ee593a RDI: RBP: c9bab6c0 R08: R09: 0001 R10: 0001 R11: R12: c9bab790 R13: R14: 01f8 R15: ea000b2026c0 FS: 7f102a0e4b40() GS:880507a0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f1029f21000 CR3: 0004c6bde000 CR4: 001406e0 DR0: 7f695d948000 DR1: DR2: DR3: DR6: 0ff0 DR7: 0600 Call Trace: __process_pages_contig+0x4b4/0x4d0 __unlock_for_delalloc.isra.43+0x32/0x50 find_lock_delalloc_range+0x15e/0x210 writepage_delalloc.isra.51+0x91/0x1a0 __extent_writepage+0x10d/0x470 extent_write_cache_pages.constprop.57+0x2d3/0x430 extent_writepages+0x5d/0x90 ? btrfs_releasepage+0x20/0x20 btrfs_writepages+0x28/0x30 do_writepages+0x21/0x30 __filemap_fdatawrite_range+0xc6/0x100 filemap_fdatawrite_range+0x13/0x20 btrfs_fdatawrite_range+0x20/0x50 start_ordered_ops+0x19/0x30 btrfs_sync_file+0x99/0x540 ? debug_smp_processor_id+0x17/0x20 ? get_lock_stats+0x19/0x50 ? debug_smp_processor_id+0x17/0x20 ? get_lock_stats+0x19/0x50 vfs_fsync_range+0x4b/0xb0 btrfs_file_write_iter+0x412/0x570 ? rcu_sync_lockdep_assert+0x2f/0x60 __do_readv_writev+0x2ea/0x380 do_readv_writev+0x7c/0xc0 ? debug_smp_processor_id+0x17/0x20 ? get_lock_stats+0x19/0x50 ? __context_tracking_exit.part.5+0x82/0x1e0 vfs_writev+0x3f/0x50 do_pwritev+0xb5/0xd0 SyS_pwritev2+0x17/0x30 do_syscall_64+0x66/0x1d0 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f1029a0e0f9 RSP: 002b:7ffe59e72c48 EFLAGS: 0246 [CONT START] ORIG_RAX: 0148 RAX: ffda RBX: 0148 RCX: 7f1029a0e0f9 RDX: 00f2 RSI: 562b5ef7de50 RDI: 0185 RBP: 7f1029fc5000 R08: 08002180528a R09: 0007 R10: baba R11: 0246 R12: 0002 R13: 7f1029fc5048 R14: 7f102a0e4ad8 R15: 7f1029fc5000 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Why do BTRFS (still) forgets what device to write to?
I am doing some test on BTRFS with both data and metadata in raid1. uname -a Linux daffy 4.9.0-1-amd64 #1 SMP Debian 4.9.6-3 (2017-01-28) x86_64 GNU/Linux btrfs--version btrfs-progs v4.7.3 01. mkfs.btrfs /dev/sd[fgh]1 02. mount /dev/sdf1 /btrfs_test/ 03. btrfs balance start -dconvert=raid1 /btrfs_test/ 04. copied a lots of 3-4MB files to it (about 40GB)... 05. Started to compress some of the files to create one larger file... 06. Pulled the (sata) plug on one of the drives... (sdf1) 07. dmesg shows that the kernel is rejecting I/O to offline device + [sdf] killing request] 08. BTRS error (device sdf1) bdev /dev/sdf1 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0 09. the previous line repeats - increasing rd count 10. Reconnecting the sdf1 drive again makes it show up as sdi1 11. btrfs fi sh /btrfs_test shows sd1 as the correct device id (1). 12. Yet dmesg shows tons of errors like this: BTRFS error (device sdf1) : bdev /dev/sdi1 errs wr 37182, rd 39851, flush 1, corrupt 0, gen 0 13. and the above line repeats increasing wr, and rd errors. 14. BTRFS never seems to "get in tune again" while the filesystem is mounted. The conclusion appears to be that the device ID is back again in the btrfs pool so why does btrfs still try to write to the wrong device (or does it?!). The good thing here is that BTRFS does still work fine after a unmount and mount again. Running a scrub on the filesystem cleans up tons of errors , but no uncorrectable errors. However it says total bytes scrubbed 94.21GB with 75 errors ... and further down it says corrected errors: 72, uncorrectable errors: 0 , unverified errors: 0 Why 75 vs 72 errors?! did it correct all or not? I have recently lost 1x 5 device BTRFS filesystem as well as 2x 3 device BTRFS filesystems set up in RAID1 (both data and medata) by toying around with them. The 2x filesystems I lost was using all bad disks (all 3 of them) but the one mentioned here uses good (but old) 400GB drives just for the record. By lost I mean that mount does not recognize the filesystem, but BTRFS fi sh does show that all devices are present. I did not make notes for those filesystems , but it appears that RAID1 is a bit fragile. I don't need to recover anything. This is just a "toy system" for playing around with btrfs and doing some tests. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8 v2] Non-blocking AIO
On 03/01/2017 01:36 AM, Goldwyn Rodrigues wrote: This series adds nonblocking feature to asynchronous I/O writes. io_submit() can be delayed because of a number of reason: - Block allocation for files - Data writebacks for direct I/O - Sleeping because of waiting to acquire i_rwsem - Congested block device We've been hit by a few of these so this change is very welcome. The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if any of these conditions are met. This way userspace can push most of the write()s to the kernel to the best of its ability to complete and if it returns -EAGAIN, can defer it to another thread. Is it not possible to push the iocb to a workqueue? This will allow existing userspace to work with the new functionality, unchanged. Any userspace implementation would have to do the same thing, so it's not like we're saving anything by pushing it there. In order to enable this, IOCB_FLAG_NOWAIT is introduced in uapi/linux/aio_abi.h which translates to IOCB_NOWAIT for struct iocb, BIO_NOWAIT for bio and IOMAP_NOWAIT for iomap. This feature is provided for direct I/O of asynchronous I/O only. I have tested it against xfs, ext4, and btrfs. Changes since v1: + Forwardported from 4.9.10 + changed name from _NONBLOCKING to *_NOWAIT + filemap_range_has_page call moved to closer to (just before) calling filemap_write_and_wait_range(). + BIO_NOWAIT limited to get_request() + XFS fixes - included reflink - use of xfs_ilock_nowait() instead of a XFS_IOLOCK_NONBLOCKING flag - Translate the flag through IOMAP_NOWAIT (iomap) to check for block allocation for the file. + ext4 coding style -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html