[PATCH 2/2] btrfs: submit a normal bio for the mirror dio read
Similarly do not set the REQ_FAILFAST_DEV for mirror read bio during dio. [Taken from patch in ml btrfs: submit a normal bio for the mirror read] [When the fist mirror failed to read we submit bio again to read from the 2nd mirror, however during this, we also set the flag REQ_FAILFAST_DEV which indicates to the underlying block device driver not to perform the default IO retry (sd, 5 counts), and thus command will be failed and returned at the first failed attempt. On the contrary, in fact, it should have been other way around that is, as 2nd mirror being the last copy of the data, it should rather try equally hard to make this read cmd successful and at the same time with or without REQ_FAILFAST_DEV there is no performance benefits if the command is successful in the first attempt itself. The only benefit which would be achieved with REQ_FAILFAST_DEV is that when both the copies encounter failed read, then for the applications, the EIO will be reported roughly 40% faster (since it saves 4 retries). But when first mirror has failed whats more important will be to make a successful read from the 2nd mirror. And for the DUP case where both the copies are on the same disk, this makes to retry 5 times on 2 different LBA/sector but on the same disk, which probably is not a good idea if your test case involves pulling out the disk. But as of now we don't have a way to tell the block layer how critical a read is, so that block layer can accommodate the retry dynamically]. Signed-off-by: Anand Jain--- fs/btrfs/inode.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b93fe05a39c7..9a539cf82e51 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8014,9 +8014,6 @@ static blk_status_t dio_read_error(struct inode *inode, struct bio *failed_bio, } segs = bio_segments(failed_bio); - if (segs > 1 || - (failed_bio->bi_io_vec->bv_len > btrfs_inode_sectorsize(inode))) - read_mode |= REQ_FAILFAST_DEV; isector = start - btrfs_io_bio(failed_bio)->logical; isector >>= inode->i_sb->s_blocksize_bits; -- 2.15.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] btrfs: submit a normal bio for the mirror read
When the fist mirror failed to read we submit bio again to read from the 2nd mirror, however during this, we also set the flag REQ_FAILFAST_DEV which indicates to the underlying block device driver not to perform the default IO retry (sd, 5 counts), and thus command will be failed and returned at the first failed attempt. On the contrary, in fact, it should have been other way around that is, as 2nd mirror being the last copy of the data, it should rather try equally hard to make this read cmd successful and at the same time with or without REQ_FAILFAST_DEV there is no performance benefits if the command is successful in the first attempt itself. The only benefit which would be achieved with REQ_FAILFAST_DEV is that when both the copies encounter failed read, then for the applications, the EIO will be reported roughly 40% faster (since it saves 4 retries). But when first mirror has failed whats more important will be to make a successful read from the 2nd mirror. And for the DUP case where both the copies are on the same disk, this makes to retry 5 times on 2 different LBA/sector but on the same disk, which probably is not a good idea if your test case involves pulling out the disk. But as of now we don't have a way to tell the block layer how critical a read is, so that block layer can accommodate the retry dynamically. Signed-off-by: Anand Jain--- fs/btrfs/extent_io.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index d8eb457f2a70..29c03fb4d5af 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2388,9 +2388,6 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, return -EIO; } - if (failed_bio->bi_vcnt > 1) - read_mode |= REQ_FAILFAST_DEV; - phy_offset >>= inode->i_sb->s_blocksize_bits; bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page, start - page_offset(page), -- 2.15.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fstests: btrfs: Add test case to check if btrfs can handle full fs trim correctly
Ancient commit f4c697e6406d ("btrfs: return EINVAL if start > total_bytes in fitrim ioctl") introduced a regression where btrfs may fail to trim any free space in existing block groups. It's caused by confusion with btrfs_super_block->total_bytes and btrfs logical address space. Unlike physical address, any aligned bytenr in range [0, U64_MAX) is valid in btrfs logical address space, and it's chunk mapping mechanism of btrfs to handle the logical<->physical mapping. The test case will craft a btrfs with the following features: 0) Single data/meta profile Make trimmed bytes reporting and chunk allocation more predictable. 1) All chunks start beyond super_block->total_bytes (1G) By relocating these blocks several times. 2) Unallocated space is less than 50% of the whole fs 3) Fragmented data chunks Data chunks will be full of fragments, 50% of data chunks will be free space. So in theory fstrim should be able to trim over 50% space of the fs. (after fix, 64% of the fs can be trimmed) While the regression makes btrfs only able to trim unallocated space, which is less than 50% of the total space. (without fix, it's only 31%) Fixed by patch named "btrfs: Ensure btrfs_trim_fs can trim the whole fs". Signed-off-by: Qu Wenruo--- tests/btrfs/155 | 120 tests/btrfs/155.out | 2 + tests/btrfs/group | 1 + 3 files changed, 123 insertions(+) create mode 100755 tests/btrfs/155 create mode 100644 tests/btrfs/155.out diff --git a/tests/btrfs/155 b/tests/btrfs/155 new file mode 100755 index ..6918f093 --- /dev/null +++ b/tests/btrfs/155 @@ -0,0 +1,120 @@ +#! /bin/bash +# FS QA Test 155 +# +# Check if btrfs can correctly trim free space in block groups +# +# An ancient regression prevent btrfs from trimming free space inside +# existing block groups, if bytenr of block group starts beyond +# btrfs_super_block->total_bytes. +# However all bytenr in btrfs is in btrfs logical address space, +# where any bytenr in range [0, U64_MAX] is valid. +# +# Fixed by patch named "btrfs: Ensure btrfs_trim_fs can trim the whole fs". +# +#--- +# Copyright (c) 2017 SUSE Linux Products GmbH. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_fstrim + +# 1024fs size +fs_size=$((1024 * 1024 * 1024)) + +# Use small files to fill half of the fs +file_size=$(( 1024 * 1024 )) +nr_files=$(( $fs_size / $file_size / 2)) + +# Force to use single data and meta profile. +# Since the test relies on fstrim output, which will differ for different +# profiles +_scratch_mkfs -b $fs_size -m single -d single > /dev/null +_scratch_mount + +_require_batched_discard "$SCRATCH_MNT" + +for n in $(seq -w 0 $(( $nr_files - 1))); do + $XFS_IO_PROG -f -c "pwrite 0 $file_size" "$SCRATCH_MNT/file_$n" \ + > /dev/null +done + +# Flush all buffer data into disk, to trigger chunk allocation +sync + +# Now we have take at least 50% of the filesystem, relocate all chunks twice +# so all chunks will start after 1G in logical space. +# (Btrfs chunk allocation will not rewind to reuse lower space) +_run_btrfs_util_prog balance start --full-balance "$SCRATCH_MNT" + +# To avoid possible false ENOSPC alert on v4.15-rc1, seems to be a +# reserved space related bug (maybe related to outstanding space rework?), +# but that's another story. +sync + +_run_btrfs_util_prog balance start --full-balance "$SCRATCH_MNT" + +# Now remove half of the files to make some holes for later trim. +# While still keep the chunk space fragmented, so no chunk will be freed +rm $SCRATCH_MNT/file_*[13579] -f + +# Make sure space is freed +sync + +trimmed=$($FSTRIM_PROG -v "$SCRATCH_MNT" | _filter_fstrim) +echo
[PATCH 0/4] define BTRFS_DEV_STATE
As of now device properties and states are being represented as int variable. Patches in the ML such as device failed state needs this cleanup as well. Anand Jain (4): btrfs: cleanup device states define BTRFS_DEV_STATE_WRITEABLE btrfs: cleanup device states define BTRFS_DEV_STATE_IN_FS_METADATA btrfs: cleanup device states define BTRFS_DEV_STATE_MISSING btrfs: cleanup device states define BTRFS_DEV_STATE_CAN_DISCARD fs/btrfs/dev-replace.c | 3 +- fs/btrfs/disk-io.c | 23 ++--- fs/btrfs/extent-tree.c | 5 +- fs/btrfs/extent_io.c | 3 +- fs/btrfs/ioctl.c | 2 +- fs/btrfs/scrub.c | 13 +++-- fs/btrfs/super.c | 7 +-- fs/btrfs/volumes.c | 129 - fs/btrfs/volumes.h | 10 ++-- 9 files changed, 118 insertions(+), 77 deletions(-) -- 2.15.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] btrfs: cleanup device states define BTRFS_DEV_STATE_IN_FS_METADATA
Currently device state is being managed by each individual int variable such as struct btrfs_device::in_fs_metadata. Instead of that declare device state BTRFS_DEV_STATE_IN_FS_METADATA and use the bit operations. Signed-off-by: Anand Jain--- fs/btrfs/disk-io.c | 21 ++--- fs/btrfs/scrub.c | 3 ++- fs/btrfs/super.c | 5 +++-- fs/btrfs/volumes.c | 29 + fs/btrfs/volumes.h | 2 +- 5 files changed, 37 insertions(+), 23 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 0d361b6713e1..ab1a514e5c8d 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3563,8 +3563,10 @@ static int barrier_all_devices(struct btrfs_fs_info *info) continue; if (!dev->bdev) continue; - if (!dev->in_fs_metadata || - !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) + if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, + >dev_state) || + !test_bit(BTRFS_DEV_STATE_WRITEABLE, + >dev_state)) continue; write_dev_flush(dev); @@ -3579,8 +3581,10 @@ static int barrier_all_devices(struct btrfs_fs_info *info) errors_wait++; continue; } - if (!dev->in_fs_metadata || - !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) + if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, + >dev_state) || + !test_bit(BTRFS_DEV_STATE_WRITEABLE, + >dev_state)) continue; ret = wait_dev_flush(dev); @@ -3677,7 +3681,8 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors) total_errors++; continue; } - if (!dev->in_fs_metadata || + if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, + >dev_state) || !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) continue; @@ -3717,8 +3722,10 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors) list_for_each_entry_rcu(dev, head, dev_list) { if (!dev->bdev) continue; - if (!dev->in_fs_metadata || - !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) + if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, + >dev_state) || + !test_bit(BTRFS_DEV_STATE_WRITEABLE, + >dev_state)) continue; ret = wait_dev_supers(dev, max_mirrors); diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index e027e0de66a5..060fa93731e5 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -4129,7 +4129,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, } mutex_lock(_info->scrub_lock); - if (!dev->in_fs_metadata || dev->is_tgtdev_for_dev_replace) { + if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state) || + dev->is_tgtdev_for_dev_replace) { mutex_unlock(_info->scrub_lock); mutex_unlock(_info->fs_devices->device_list_mutex); return -EIO; diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 161694b66038..9089aad2f3aa 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1945,8 +1945,9 @@ static int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info, rcu_read_lock(); list_for_each_entry_rcu(device, _devices->devices, dev_list) { - if (!device->in_fs_metadata || !device->bdev || - device->is_tgtdev_for_dev_replace) + if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, + >dev_state) || + !device->bdev || device->is_tgtdev_for_dev_replace) continue; if (i >= nr_devices) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 0f5be1808c6e..13a6ae80bee1 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -652,7 +652,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, fs_devices->rotating = 1; device->bdev = bdev; - device->in_fs_metadata = 0; + clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state); device->mode = flags; fs_devices->open_devices++; @@ -857,7 +857,8 @@ void btrfs_close_extra_devices(struct btrfs_fs_devices *fs_devices, int step) again: /* This is the
[PATCH 4/4] btrfs: cleanup device states define BTRFS_DEV_STATE_CAN_DISCARD
Currently device state is being managed by each individual int variable such as struct btrfs_device::can_discard. Instead of that declare btrfs_device::dev_state BTRFS_DEV_STATE_CAN_DISCARD and use the bit operations. Signed-off-by: Anand Jain--- fs/btrfs/extent-tree.c | 3 ++- fs/btrfs/volumes.c | 6 +++--- fs/btrfs/volumes.h | 2 +- 3 files changed, 6 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f81d928754e1..ee79f7cdc543 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2155,7 +2155,8 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, for (i = 0; i < bbio->num_stripes; i++, stripe++) { u64 bytes; - if (!stripe->dev->can_discard) + if (!test_bit(BTRFS_DEV_STATE_CAN_DISCARD, + >dev->dev_state)) continue; ret = btrfs_issue_discard(stripe->dev->bdev, diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 75839e07ce10..a9c6486f06f4 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -647,7 +647,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, q = bdev_get_queue(bdev); if (blk_queue_discard(q)) - device->can_discard = 1; + set_bit(BTRFS_DEV_STATE_CAN_DISCARD, >dev_state); if (!blk_queue_nonrot(q)) fs_devices->rotating = 1; @@ -2401,7 +2401,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path q = bdev_get_queue(bdev); if (blk_queue_discard(q)) - device->can_discard = 1; + set_bit(BTRFS_DEV_STATE_CAN_DISCARD, >dev_state); set_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state); device->generation = trans->transid; device->io_width = fs_info->sectorsize; @@ -2601,7 +2601,7 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, q = bdev_get_queue(bdev); if (blk_queue_discard(q)) - device->can_discard = 1; + set_bit(BTRFS_DEV_STATE_CAN_DISCARD, >dev_state); mutex_lock(_info->fs_devices->device_list_mutex); set_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state); device->generation = 0; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 2fbff6902c8d..85e4b2dcc071 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -50,6 +50,7 @@ struct btrfs_pending_bios { #define BTRFS_DEV_STATE_WRITEABLE (1UL << 1) #define BTRFS_DEV_STATE_IN_FS_METADATA (1UL << 2) #define BTRFS_DEV_STATE_MISSING(1UL << 3) +#define BTRFS_DEV_STATE_CAN_DISCARD(1UL << 4) struct btrfs_device { struct list_head dev_list; @@ -74,7 +75,6 @@ struct btrfs_device { fmode_t mode; unsigned long dev_state; - int can_discard; int is_tgtdev_for_dev_replace; blk_status_t last_flush_error; int flush_bio_sent; -- 2.15.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] btrfs: cleanup device states define BTRFS_DEV_STATE_MISSING
Currently device state is being managed by each individual int variable such as struct btrfs_device::missing. Instead of that declare btrfs_device::dev_state BTRFS_DEV_STATE_MISSING and use the bit operations. Signed-off-by: Anand Jain--- fs/btrfs/dev-replace.c | 3 ++- fs/btrfs/disk-io.c | 4 ++-- fs/btrfs/scrub.c | 7 --- fs/btrfs/super.c | 2 +- fs/btrfs/volumes.c | 33 +++-- fs/btrfs/volumes.h | 2 +- 6 files changed, 29 insertions(+), 22 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 4b6ceb38cb5f..559db7667f38 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -306,7 +306,8 @@ void btrfs_after_dev_replace_commit(struct btrfs_fs_info *fs_info) static inline char *dev_missing_or_rcu_str(struct btrfs_device *device) { - return device->missing ? "" : rcu_str_deref(device->name); + return test_bit(BTRFS_DEV_STATE_MISSING, >dev_state) ? + "" : rcu_str_deref(device->name); } int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index ab1a514e5c8d..ac1079fc9382 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3559,7 +3559,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) /* send down all the barriers */ head = >fs_devices->devices; list_for_each_entry_rcu(dev, head, dev_list) { - if (dev->missing) + if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) continue; if (!dev->bdev) continue; @@ -3575,7 +3575,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) /* wait for all the barriers */ list_for_each_entry_rcu(dev, head, dev_list) { - if (dev->missing) + if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) continue; if (!dev->bdev) { errors_wait++; diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 060fa93731e5..666c00cbee03 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -2535,7 +2535,7 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len, } WARN_ON(sblock->page_count == 0); - if (dev->missing) { + if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) { /* * This case should only be hit for RAID 5/6 device replace. See * the comment in scrub_missing_raid56_pages() for details. @@ -2870,7 +2870,7 @@ static int scrub_extent_for_parity(struct scrub_parity *sparity, u8 csum[BTRFS_CSUM_SIZE]; u32 blocksize; - if (dev->missing) { + if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) { scrub_parity_mark_sectors_error(sparity, logical, len); return 0; } @@ -4112,7 +4112,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, mutex_lock(_info->fs_devices->device_list_mutex); dev = btrfs_find_device(fs_info, devid, NULL, NULL); - if (!dev || (dev->missing && !is_dev_replace)) { + if (!dev || (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state) && + !is_dev_replace)) { mutex_unlock(_info->fs_devices->device_list_mutex); return -ENODEV; } diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 9089aad2f3aa..4b70ae6f0245 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -2243,7 +2243,7 @@ static int btrfs_show_devname(struct seq_file *m, struct dentry *root) while (cur_devices) { head = _devices->devices; list_for_each_entry(dev, head, dev_list) { - if (dev->missing) + if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) continue; if (!dev->name) continue; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 13a6ae80bee1..75839e07ce10 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -773,9 +773,9 @@ static noinline int device_list_add(const char *path, return -ENOMEM; rcu_string_free(device->name); rcu_assign_pointer(device->name, name); - if (device->missing) { + if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) { fs_devices->missing_devices--; - device->missing = 0; + clear_bit(BTRFS_DEV_STATE_MISSING, >dev_state); } } @@ -957,7 +957,7 @@ static void btrfs_prepare_close_one_device(struct btrfs_device *device) fs_devices->rw_devices--; } - if (device->missing) + if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
[PATCH 1/4] btrfs: cleanup device states define BTRFS_DEV_STATE_WRITEABLE
Currently device state is being managed by each individual int variable such as struct btrfs_device::writeable. Instead of that declare device state BTRFS_DEV_STATE_WRITEABLE and use the bit operations. Signed-off-by: Anand Jain--- fs/btrfs/disk-io.c | 12 ++ fs/btrfs/extent-tree.c | 2 +- fs/btrfs/extent_io.c | 3 ++- fs/btrfs/ioctl.c | 2 +- fs/btrfs/scrub.c | 3 ++- fs/btrfs/volumes.c | 63 +- fs/btrfs/volumes.h | 4 +++- 7 files changed, 54 insertions(+), 35 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index dfdab849037b..0d361b6713e1 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3563,7 +3563,8 @@ static int barrier_all_devices(struct btrfs_fs_info *info) continue; if (!dev->bdev) continue; - if (!dev->in_fs_metadata || !dev->writeable) + if (!dev->in_fs_metadata || + !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) continue; write_dev_flush(dev); @@ -3578,7 +3579,8 @@ static int barrier_all_devices(struct btrfs_fs_info *info) errors_wait++; continue; } - if (!dev->in_fs_metadata || !dev->writeable) + if (!dev->in_fs_metadata || + !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) continue; ret = wait_dev_flush(dev); @@ -3675,7 +3677,8 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors) total_errors++; continue; } - if (!dev->in_fs_metadata || !dev->writeable) + if (!dev->in_fs_metadata || + !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) continue; btrfs_set_stack_device_generation(dev_item, 0); @@ -3714,7 +3717,8 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors) list_for_each_entry_rcu(dev, head, dev_list) { if (!dev->bdev) continue; - if (!dev->in_fs_metadata || !dev->writeable) + if (!dev->in_fs_metadata || + !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) continue; ret = wait_dev_supers(dev, max_mirrors); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e2d7e86b51d1..f81d928754e1 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -10935,7 +10935,7 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, *trimmed = 0; /* Not writeable = nothing to do. */ - if (!device->writeable) + if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) return 0; /* No free space = nothing to do. */ diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 7fa50e12f18e..f51c797847c7 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2028,7 +2028,8 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start, bio->bi_iter.bi_sector = sector; dev = bbio->stripes[bbio->mirror_num - 1].dev; btrfs_put_bbio(bbio); - if (!dev || !dev->bdev || !dev->writeable) { + if (!dev || !dev->bdev || + !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) { btrfs_bio_counter_dec(fs_info); bio_put(bio); return -EIO; diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 6c7a49faf4e0..8a74c83503d6 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1518,7 +1518,7 @@ static noinline int btrfs_ioctl_resize(struct file *file, goto out_free; } - if (!device->writeable) { + if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) { btrfs_info(fs_info, "resizer unable to apply on readonly device %llu", devid); diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index e3f6c49e5c4d..e027e0de66a5 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -4117,7 +4117,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, return -ENODEV; } - if (!is_dev_replace && !readonly && !dev->writeable) { + if (!is_dev_replace && !readonly && + !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) { mutex_unlock(_info->fs_devices->device_list_mutex); rcu_read_lock(); name = rcu_dereference(dev->name); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 397b335d108c..0f5be1808c6e 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -634,10 +634,15 @@ static int btrfs_open_one_device(struct btrfs_fs_devices
Re: [PATCH 0/7] retry write on error
On 11/29/2017 07:41 AM, p...@btrfs.list.sabi.co.uk wrote: If the underlying protocal doesn't support retry and there are some transient errors happening somewhere in our IO stack, we'd like to give an extra chance for IO. A limited number of retries may make sense, though I saw some long stalls after retries on bad disks. Indeed! One of the major issues in actual storage administration is to find ways to reliably disable most retries, or to shorten them, both at the block device level and the device level, because in almost all cases where storage reliability matters what is important is simply swapping out the failing device immediately and then examining and possible refreshing it offline. To the point that many device manufacturers deliberately cripple in cheaper products retry shortening or disabling options to force long stalls, so that people who care about reliability more than price will buy the more expensive version that can disable or shorten retries. Seems preferable to avoid issuing retries when the underlying transport layer(s) has already done so, but I am not sure there is a way to know that at the fs level. Inded, and to use an euphemism, a third layer of retries at the filesystem level are currently a thoroughly imbecilic idea :-), as whether retries are worth doing is not a filesystem dependent issue (but then plugging is done at the block io level when it is entirely device dependent whether it is worth doing, so there is famous precedent). There are excellent reasons why error recovery is in general not done at the filesystem level since around 20 years ago, which do not need repeating every time. However one of them is that where it makes sense device firmware does retries, and the block device layer does retries too, which is often a bad idea, and where it is not, the block io level should be do that, not the filesystem. A large part of the above discussion would not be needed if Linux kernel "developers" exposed a clear notion of hardware device and block device state machine and related semantics, or even knew that it were desirable, but that's an idea that is only 50 years old, so may not have yet reached popularity :-). I agree with Ed and Peter, similar opinion was posted here [1]. [1] https://www.spinics.net/lists/linux-btrfs/msg70240.html Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/3] btrfs-progs: fi defrag: do not exit if defrag range ioctl is unsupported
If ioctl of defrag range is unsupported, defrag will exit immediately. Since caller can handle the error, let cmd_filesystem_defrag() close file, break the loop and return error instead of calling exit(1). Suggested-by: David SterbaSigned-off-by: Su Yue --- Changelog: v2: Separate the patch from commit 6e991b9161fa ("btrfs-progs: fi defrag: clean up duplicate code if find errors"). v3: Call close_file_or_dir() before breaking the loop. --- cmds-filesystem.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index 17d399d58adf..232d4e88e0c0 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -1049,8 +1049,10 @@ static int cmd_filesystem_defrag(int argc, char **argv) if (recursive && S_ISDIR(st.st_mode)) { ret = nftw(argv[i], defrag_callback, 10, FTW_MOUNT | FTW_PHYS); - if (ret == ENOTTY) - exit(1); + if (ret == ENOTTY) { + close_file_or_dir(fd, dirstream); + break; + } /* errors are handled in the callback */ ret = 0; } else { -- 2.15.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/3] btrfs-progs: fi defrag: extend -c to drop nocompress flag on files
On 11/29/2017 12:07 AM, David Sterba wrote: On Tue, Nov 28, 2017 at 05:14:50PM +0800, Su Yue wrote: Now, files which have nocompress flag also will be defraged with compression. However, nocompress flag is still existed and have to be cleared manually. So add an option '--clear-nocompress' to extend -c to drop nocompress flag after defragement. Suggested-by: David SterbaSuggested-by: Anand Jain Do you have the pointer to the discussion? The whole idea sounds familiar and seeing my name here means I must have been involved, but I have only vague memories. Here is the link: https://www.spinics.net/lists/linux-btrfs/msg67529.html Signed-off-by: Su Yue --- Changelog: v2: Remove change about indentation of defrag_callback(). --- cmds-filesystem.c | 65 ++- 1 file changed, 64 insertions(+), 1 deletion(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index 3931333f76c6..84242814798e 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -27,6 +27,7 @@ #include #include #include +#include #include "kerncompat.h" #include "ctree.h" @@ -867,6 +868,8 @@ static const char * const cmd_filesystem_defrag_usage[] = { "-l len defragment only up to len bytes", "-t size target extent size hint (default: 32M)", "", + "--compress-forceclear nocompress flag on files after defragment, only work with option -c", This is probably --clear-nocompress . Oh... I forgot to change the usage info here. The "force compress" semantics is up to kernel, so I think naming the options 'clear-nocompress' is the right thing, as it's what it really does. + "", "Warning: most Linux kernels will break up the ref-links of COW data", "(e.g., files copied with 'cp --reflink', snapshots) which may cause", "considerable increase of space usage. See btrfs-filesystem(8) for", @@ -874,9 +877,39 @@ static const char * const cmd_filesystem_defrag_usage[] = { NULL }; +static int clear_nocompress_flag(int fd) +{ + unsigned int flags; + int ret = 0; + + ret = ioctl(fd, FS_IOC_GETFLAGS, ); + if (ret < 0) { + ret = -errno; + error("failed to get flags: %s", strerror(-ret)); + goto out; + } + + if (!(flags & FS_NOCOMP_FL)) { + ret = 0; + goto out; + } + flags &= ~FS_NOCOMP_FL; + ret = ioctl(fd, FS_IOC_SETFLAGS, ); This is inherently racy, but the inode flags do not change that often so concurrent change is unlikely. I don't quite understand what you mean. Could you explain some details? Thanks, Su + if (ret < 0) { + ret = -errno; + error("failed to set flags: %s", strerror(-ret)); + goto out; + } + + ret = 0; +out: + return ret; +} + static struct btrfs_ioctl_defrag_range_args defrag_global_range; static int defrag_global_verbose; static int defrag_global_errors; +static int defrag_global_clear_nocompress; static int defrag_callback(const char *fpath, const struct stat *sb, int typeflag, struct FTW *ftwbuf) { @@ -904,6 +937,14 @@ static int defrag_callback(const char *fpath, const struct stat *sb, err = errno; goto error; } + + if (defrag_global_clear_nocompress) { + ret = clear_nocompress_flag(fd); + if (ret) { + err = -ret; + goto error; + } + } } return 0; @@ -926,6 +967,12 @@ static int cmd_filesystem_defrag(int argc, char **argv) int compress_type = BTRFS_COMPRESS_NONE; DIR *dirstream; + enum { GETOPT_VAL_CLEAR_NOCOMPRESS = 257}; + static const struct option long_options[] = { + { "clear-nocompress", no_argument, NULL, + GETOPT_VAL_CLEAR_NOCOMPRESS}, + { NULL, 0, NULL, 0} + }; /* * Kernel has a different default (256K) that is supposed to be safe, * but it does not defragment very well. The 32M will likely lead to @@ -937,8 +984,10 @@ static int cmd_filesystem_defrag(int argc, char **argv) defrag_global_errors = 0; defrag_global_verbose = 0; defrag_global_errors = 0; + defrag_global_clear_nocompress = 0; while(1) { - int c = getopt(argc, argv, "vrc::fs:l:t:"); + int c = getopt_long(argc, argv, "vrc::fs:l:t:", long_options, + NULL); if (c < 0) break; @@ -972,6 +1021,9 @@ static int cmd_filesystem_defrag(int argc, char **argv) case 'r': recursive
Re: Read before you deploy btrfs + zstd
> On Nov 28, 2017, at 3:49 PM, David Sterbawrote: > > On Tue, Nov 28, 2017 at 09:31:57PM +, Nick Terrell wrote: >> >>> On Nov 21, 2017, at 8:22 AM, David Sterba wrote: >>> >>> On Wed, Nov 15, 2017 at 08:09:15PM +, Nick Terrell wrote: On 11/15/17, 6:41 AM, "David Sterba" wrote: > The branch is now in a state that can be tested. Turns out the memory > requirements are too much for grub, so the boot fails with "not enough > memory". The calculated value > > ZSTD_BTRFS_MAX_INPUT: 131072 > ZSTD_DStreamWorkspaceBound with ZSTD_BTRFS_MAX_INPUT: 549424 > > This is not something I could fix easily, we'd probalby need a tuned > version of ZSTD for grub constraints. Adding Nick to CC. If I understand the grub code correctly, we only need to read, and we have the entire input and output buffer in one segment. In that case you can use ZSTD_initDCtx(), and ZSTD_decompressDCtx(). ZSTD_DCtxWorkspaceBound() is only 155984. See decompress_single() in https://patchwork.kernel.org/patch/9997909/ for an example. >>> >>> Does not help, still ENOMEM. >> >> It looks like XZ had the same issue, and they make the decompression >> context a static object (grep for GRUB_EMBED_DECOMPRESSOR). We could >> potentially do the same and statically allocate the workspace: >> >> ``` >> /* Could also be size_t */ >> #define BTRFS_ZSTD_WORKSPACE_SIZE_U64 (155984 / sizeof(uint64_t)) >> static uint64_t workspace[BTRFS_ZSTD_WORKSPACE_SIZE_U64]; >> >> /* ... */ >> >> assert(sizeof(workspace) >= ZSTD_DCtxWorkspaceBound()); >> ``` > > Interesting, thanks for the tip, I'll try it next. > > I've meanwhile tried to tweak the numbers, the maximum block for zstd, > that squeezed the DCtx somewhere under 48k, with block size 8k. Still > enomem. Sadly the block size has to stay 128 KiB, since that is the block size that is used for compression. > I've tried to add some debugging prints to see what numbers get actually > passed to the allocator, but did not see anything printed. I'm sure > there is a more intelligent way to test the grub changes. So far each > test loop takes quite some time, as I build the rpm package, test it in > a VM and have to recreate the environmet each time. Is the code in github.com/kdave/grub in the btrfs-zstd branch up to date? btrfs.c:1230 looks like it will error for zstd compression, but not with ENOMEM. btrfs.c:1272 [2] looks like a typo. Maybe the second is causing the compressed block to be interpreted as uncompressed, and causes a too-large allocation? [1] https://github.com/kdave/grub/blob/btrfs-zstd/grub-core/fs/btrfs.c#L1230 [2] https://github.com/kdave/grub/blob/btrfs-zstd/grub-core/fs/btrfs.c#L1272-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/7] Btrfs: retry write for non-raid56
On Wed, Nov 22, 2017 at 04:41:10PM +0200, Nikolay Borisov wrote: > > > On 22.11.2017 02:35, Liu Bo wrote: > > If the underlying protocal doesn't support retry and there are some > > transient errors happening somewhere in our IO stack, we'd like to > > give an extra chance for IO. > > > > In btrfs, read retry is handled in bio_readpage_error() with the retry > > unit being page size , for write retry however, we're going to do it > > in a different way, as a write may consist of several writes onto > > different stripes, retry write needs to be done right after the IO on > > each stripe completes and reaches to endio. > > > > As write errors are really corner cases, performance impact is not > > considered. > > > > This adds code to retry write on errors _sector by sector_ in order to > > find out which sector is really bad so that we can mark it somewhere. > > And since endio is running in interrupt context, the real retry work > > is scheduled in system_unbound_wq in order to get a normal context. > > > > Signed-off-by: Liu Bo> > --- > > fs/btrfs/volumes.c | 162 > > + > > fs/btrfs/volumes.h | 3 + > > 2 files changed, 142 insertions(+), 23 deletions(-) > > > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > > index 853f9ce..c11db0b 100644 > > --- a/fs/btrfs/volumes.c > > +++ b/fs/btrfs/volumes.c > > @@ -6023,34 +6023,150 @@ static void __btrfs_end_bio(struct bio *bio) > > } > > } > > > > -static void btrfs_end_bio(struct bio *bio) > > +static inline struct btrfs_device *get_device_from_bio(struct bio *bio) > > { > > struct btrfs_bio *bbio = bio->bi_private; > > - int is_orig_bio = 0; > > + unsigned int stripe_index = btrfs_io_bio(bio)->stripe_index; > > + > > + BUG_ON(stripe_index >= bbio->num_stripes); > > + return bbio->stripes[stripe_index].dev; > > +} > > + > > +/* > > + * return 1 if every sector retry returns successful. > > + * return 0 if one or more sector retries fails. > > + */ > > +int btrfs_narrow_write_error(struct bio *bio, struct btrfs_device *dev) > > +{ > > + struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); > > + u64 sectors_to_write; > > + u64 offset; > > + u64 orig; > > + u64 unit; > > + u64 block_sectors; > > + int ok = 1; > > + struct bio *wbio; > > + > > + /* offset and unit are bytes aligned, not 512-bytes aligned. */ > > + sectors_to_write = io_bio->iter.bi_size >> 9; > > + orig = io_bio->iter.bi_sector; > > + offset = 0; > > + block_sectors = bdev_logical_block_size(dev->bdev) >> 9; > > + unit = block_sectors; > > + ASSERT(unit == 1); > > + > > + while (1) { > > + if (!sectors_to_write) > > + break; > > + /* > > +* LIUBO: I don't think unit > sectors_to_write could > > +* happen, sectors_to_write should be aligned to PAGE_SIZE > > +* which is > unit. Just in case. > > +*/ > > + if (unit > sectors_to_write) { > > + WARN_ONCE(1, "unit %llu > sectors_to_write (%llu)\n", > > unit, sectors_to_write); > > + unit = sectors_to_write; > > + } > > + > > + /* write @unit bytes at @offset */ > > + /* this would never fail, check btrfs_bio_clone(). */ > > + wbio = btrfs_bio_clone(bio); > > + wbio->bi_opf = REQ_OP_WRITE; > > + wbio->bi_iter = io_bio->iter; > > + > > + bio_trim(wbio, offset, unit); > > + bio_copy_dev(wbio, bio); > > + > > + /* submit in sync way */ > > + /* > > +* LIUBO: There is an issue, if this bio is quite > > +* large, say 1M or 2M, and sector size is just 512, > > +* then this may take a while. > > +* > > +* May need to schedule the job to workqueue. > > +*/ > > + if (submit_bio_wait(wbio) < 0) { > > + ok = 0 && ok; > > + /* > > +* This is not correct if badblocks is enabled > > +* as we need to record every bad sector by > > +* trying sectors one by one. > > +*/ > > + break; > > + } > > + > > + bio_put(wbio); > > + offset += unit; > > + sectors_to_write -= unit; > > + unit = block_sectors; > > + } > > + return ok; > > +} > > + > > +void btrfs_record_bio_error(struct bio *bio, struct btrfs_device *dev) > > +{ > > + if (bio->bi_status == BLK_STS_IOERR || > > + bio->bi_status == BLK_STS_TARGET) { > > + if (dev->bdev) { > > + if (bio_data_dir(bio) == WRITE) > > + btrfs_dev_stat_inc(dev, > > + BTRFS_DEV_STAT_WRITE_ERRS); > > + else > > + btrfs_dev_stat_inc(dev, > > +
Re: [PATCH] btrfs: handle dynamically reappearing missing device
On Fri, Nov 17, 2017 at 07:53:29PM +0800, Anand Jain wrote: > > > On 11/17/2017 07:28 AM, Liu Bo wrote: > > On Sun, Nov 12, 2017 at 06:56:50PM +0800, Anand Jain wrote: > > > If the device is not present at the time of (-o degrade) mount > > > the mount context will create a dummy missing struct btrfs_device. > > > Later this device may reappear after the FS is mounted. > > > This commit log doesn't explain what would happen when the missing > > device re-appears. > > Hm. You mean in the current design without this patch.? > Just nothing it gives a false impression to the user by > 'btrfs dev ready' and 'btrfs fi show' that missing device > has been included. Will update change log. > > > > So this > > > patch handles that case by going through the open_device steps > > > > which this device missed and finally adds to the device alloc list. > > > > > > So now with this patch, to bring back the missing device user can run, > > > > > > btrfs dev scan > > > > Most common distros already have some udev rules of btrfs to process a > > reappeared disk. > > Right. Do you see anything that will break ? > Planning to add comments [1] in v2 for more clarity around this. > > > > > > > Signed-off-by: Anand Jain> > > --- > > > This patch needs: > > > [PATCH 0/4] factor __btrfs_open_devices() > > > > > > fs/btrfs/volumes.c | 43 +-- > > > 1 file changed, 41 insertions(+), 2 deletions(-) > > > > > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > > > index d24e966ee29f..e7dd996831f2 100644 > > > --- a/fs/btrfs/volumes.c > > > +++ b/fs/btrfs/volumes.c > > > @@ -769,8 +769,47 @@ static noinline int device_list_add(const char *path, > > > rcu_string_free(device->name); > > > rcu_assign_pointer(device->name, name); > > > if (device->missing) { > > > - fs_devices->missing_devices--; > > > - device->missing = 0; > > > + int ret; > > > + struct btrfs_fs_info *fs_info = fs_devices->fs_info; > > > + fmode_t fmode = FMODE_READ | FMODE_WRITE | FMODE_EXCL; > > > + > > > + if (btrfs_super_flags(disk_super) & > > > BTRFS_SUPER_FLAG_SEEDING) > > > + fmode &= ~FMODE_WRITE; > > > + > > > + /* > > > + * Missing can be set only when FS is mounted. > > > + * So here its always fs_devices->opened > 0 and most > > > + * of the struct device members are already updated by > > > + * the mount process even if this device was missing, so > > > + * now follow the normal open device procedure for this > > > + * device. The scrub will take care of filling the > > > + * missing stripes for raid56 and balance for raid1 and > > > + * raid10. > > > + */ > > > > > > The idea looks good to me. > > > > What if users skip balance/scrub given both could take a while? > > > > Then btrfs takes the disks back and reads content from it, and it's OK > > for raid1/10 as they may have a good copy, > > Yes, except for the split brain situation for which patches are > in the ML, review comments appreciated. > > > but in case of raid56, it > > might hit the reconstruction bug if two disks are missing and added > > back, which I tried to address recently. > > > At least ->writable should not be set until balance/scrub completes > > the re-sync job. > > Hm. Why ? > (Sorry for the late reply.) On second thought, write is OK since parity will be calculated along writes. Will check the v2. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read before you deploy btrfs + zstd
On Tue, Nov 28, 2017 at 09:31:57PM +, Nick Terrell wrote: > > > On Nov 21, 2017, at 8:22 AM, David Sterbawrote: > > > > On Wed, Nov 15, 2017 at 08:09:15PM +, Nick Terrell wrote: > >> On 11/15/17, 6:41 AM, "David Sterba" wrote: > >>> The branch is now in a state that can be tested. Turns out the memory > >>> requirements are too much for grub, so the boot fails with "not enough > >>> memory". The calculated value > >>> > >>> ZSTD_BTRFS_MAX_INPUT: 131072 > >>> ZSTD_DStreamWorkspaceBound with ZSTD_BTRFS_MAX_INPUT: 549424 > >>> > >>> This is not something I could fix easily, we'd probalby need a tuned > >>> version of ZSTD for grub constraints. Adding Nick to CC. > >> > >> If I understand the grub code correctly, we only need to read, and we have > >> the entire input and output buffer in one segment. In that case you can use > >> ZSTD_initDCtx(), and ZSTD_decompressDCtx(). ZSTD_DCtxWorkspaceBound() is > >> only 155984. See decompress_single() in > >> https://patchwork.kernel.org/patch/9997909/ for an example. > > > > Does not help, still ENOMEM. > > It looks like XZ had the same issue, and they make the decompression > context a static object (grep for GRUB_EMBED_DECOMPRESSOR). We could > potentially do the same and statically allocate the workspace: > > ``` > /* Could also be size_t */ > #define BTRFS_ZSTD_WORKSPACE_SIZE_U64 (155984 / sizeof(uint64_t)) > static uint64_t workspace[BTRFS_ZSTD_WORKSPACE_SIZE_U64]; > > /* ... */ > > assert(sizeof(workspace) >= ZSTD_DCtxWorkspaceBound()); > ``` Interesting, thanks for the tip, I'll try it next. I've meanwhile tried to tweak the numbers, the maximum block for zstd, that squeezed the DCtx somewhere under 48k, with block size 8k. Still enomem. I've tried to add some debugging prints to see what numbers get actually passed to the allocator, but did not see anything printed. I'm sure there is a more intelligent way to test the grub changes. So far each test loop takes quite some time, as I build the rpm package, test it in a VM and have to recreate the environmet each time. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] retry write on error
>>> If the underlying protocal doesn't support retry and there >>> are some transient errors happening somewhere in our IO >>> stack, we'd like to give an extra chance for IO. >> A limited number of retries may make sense, though I saw some >> long stalls after retries on bad disks. Indeed! One of the major issues in actual storage administration is to find ways to reliably disable most retries, or to shorten them, both at the block device level and the device level, because in almost all cases where storage reliability matters what is important is simply swapping out the failing device immediately and then examining and possible refreshing it offline. To the point that many device manufacturers deliberately cripple in cheaper products retry shortening or disabling options to force long stalls, so that people who care about reliability more than price will buy the more expensive version that can disable or shorten retries. > Seems preferable to avoid issuing retries when the underlying > transport layer(s) has already done so, but I am not sure > there is a way to know that at the fs level. Inded, and to use an euphemism, a third layer of retries at the filesystem level are currently a thoroughly imbecilic idea :-), as whether retries are worth doing is not a filesystem dependent issue (but then plugging is done at the block io level when it is entirely device dependent whether it is worth doing, so there is famous precedent). There are excellent reasons why error recovery is in general not done at the filesystem level since around 20 years ago, which do not need repeating every time. However one of them is that where it makes sense device firmware does retries, and the block device layer does retries too, which is often a bad idea, and where it is not, the block io level should be do that, not the filesystem. A large part of the above discussion would not be needed if Linux kernel "developers" exposed a clear notion of hardware device and block device state machine and related semantics, or even knew that it were desirable, but that's an idea that is only 50 years old, so may not have yet reached popularity :-). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] retry write on error
On 11/28/2017 12:22 PM, David Sterba wrote: > On Tue, Nov 21, 2017 at 05:35:51PM -0700, Liu Bo wrote: >> If the underlying protocal doesn't support retry and there are some >> transient errors happening somewhere in our IO stack, we'd like to >> give an extra chance for IO. Or sometimes you see btrfs reporting >> 'wrr 1 flush 0 read 0 blabla' but the disk drive is 100% good, this >> retry may help a bit. > > A limited number of retries may make sense, though I saw some long > stalls after retries on bad disks. Tracking the retries would be a good > addition to the dev stats, ie. a soft error but still worth reporting. Seems preferable to avoid issuing retries when the underlying transport layer(s) has already done so, but I am not sure there is a way to know that at the fs level. Likewise for cases where a retry cannot succeed (transport failures? media errors?). It can get tricky to know how long to delay between retries and when it is appropriate to stop retrying. Ed >> In btrfs, read retry is handled in bio_readpage_error() with the retry >> unit being page size, for write retry however, we're going to do it in >> a different way, as a write may consist of several writes onto >> different stripes, retry write needs to be done right after the IO on >> each stripe completes and arrives at endio. >> >> Patch 1-3 are the implementation of retry write on error for >> non-raid56 profile. Patch 4-6 are for raid56 profile. Both raid56 >> and non-raid56 shares one retry function helper. >> >> Patch 3 does retry sector by sector, but since this patch set doesn't >> included badblocks support, patch 7 changes it back to retry the whole >> bio. (I didn't fold patch 7 to patch 3 in the hope of just reverting >> patch 7 once badblocks support is done, but I'm open to it.) > > What does 'badblocks' refer to? I know about the badblocks utility that > find and reportts bad blocks, possibly ext2 understands that and avoids > allocating them. Btrfs does not have such support. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How about adding an ioctl to convert a directory to a subvolume?
2017-11-28 21:48 GMT+03:00 David Sterba: > On Mon, Nov 27, 2017 at 05:41:56PM +0800, Lu Fengqi wrote: >> As we all know, under certain circumstances, it is more appropriate to >> create some subvolumes rather than keep everything in the same >> subvolume. As the condition of demand change, the user may need to >> convert a previous directory to a subvolume. For this reason,how about >> adding an ioctl to convert a directory to a subvolume? > > I'd say too difficult to get everything right in kernel. This is > possible to be done in userspace, with existing tools. > > The problem is that the conversion cannot be done atomically in most > cases, so even if it's just one ioctl call, there are several possible > intermediate states that would exist during the call. Reporting where > did the ioctl fail would need some extended error code semantics. > >> Users can convert by the scripts mentioned in this >> thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is >> it easier to use the off-the-shelf btrfs subcommand? > > Adding a subcommand would work, though I'd rather avoid reimplementing > 'cp -ax' or 'rsync -ax'. We want to copy the files preserving all > attributes, with reflink, and be able to identify partially synced > files, and not cross the mountpoints or subvolumes. > > The middle step with snapshotting the containing subvolume before > syncing the data is also a valid option, but not always necessary. > >> After an initial consideration, our implementation is broadly divided >> into the following steps: >> 1. Freeze the filesystem or set the subvolume above the source directory >> to read-only; > > Freezing the filesystme will freeze all IO, so this would not work, but > I understand what you mean. The file data are synced before the snapshot > is taken, but nothing prevents applications to continue writing data. > > Open and live files is a problem and don't see a nice solution here. > >> 2. Perform a pre-check, for example, check if a cross-device link >> creation during the conversion; > > Cross-device links are not a problem as long as we use 'cp' ie. the > manual creation of files in the target. > >> 3. Perform conversion, such as creating a new subvolume and moving the >> contents of the source directory; >> 4. Thaw the filesystem or restore the subvolume writable property. >> >> In fact, I am not so sure whether this use of freeze is appropriate >> because the source directory the user needs to convert may be located >> at / or /home and this pre-check and conversion process may take a long >> time, which can lead to some shell and graphical application suspended. > > I think the closest operation is a read-only remount, which is not > always possible due to open files and can otherwise considered as quite > intrusive operation to the whole system. And the root filesystem cannot > be easily remounted read-only in the systemd days anyway. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html My two 2c, Then we talking about 'fast' (i.e. i like the idea where ioctl calls to be fast) conversion of dir to subvolume, can be done like that (sorry if i miss understood something and that a rave or i'm crazy..): For make idea more clear, for userspace that can looks like: 1. Create snapshot of parent subvol for that dir 2. Cleanup all data, except content of dir in snapshot 3. Move content of that dir to snapshot root 4. Replace dir with that snapshot/subvol i.e. no copy, no cp, only rename() and garbage collecting. In kernel that in "theory" will looks like: 1. Copy of subvol root inode 2. Replace root inode with target dir inode 3. Replace target dir in old subvol with new subvol 4. GC old dir content from parent subvol, GC all useless content of around dir in new subvol That's may be a fastest way for user, but that will not solve problems with opened files & etc, but that must be fast from user point of view, and all other staff can be simply cleaned in background Thanks -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 1/5] add infrastructure for tagging functions as error injectable
On Tue, Nov 28, 2017 at 11:58:41AM -0700, Jonathan Corbet wrote: > On Wed, 22 Nov 2017 16:23:30 -0500 > Josef Bacikwrote: > > > From: Josef Bacik > > > > Using BPF we can override kprob'ed functions and return arbitrary > > values. Obviously this can be a bit unsafe, so make this feature opt-in > > for functions. Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in > > order to give BPF access to that function for error injection purposes. > > > > Signed-off-by: Josef Bacik > > Acked-by: Ingo Molnar > > --- > > arch/x86/include/asm/asm.h| 6 ++ > > include/asm-generic/vmlinux.lds.h | 10 +++ > > include/linux/bpf.h | 11 +++ > > include/linux/kprobes.h | 1 + > > include/linux/module.h| 5 ++ > > kernel/kprobes.c | 163 > > ++ > > kernel/module.c | 6 +- > > 7 files changed, 201 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h > > index b0dc91f4bedc..340f4cc43255 100644 > > --- a/arch/x86/include/asm/asm.h > > +++ b/arch/x86/include/asm/asm.h > > @@ -85,6 +85,12 @@ > > _ASM_PTR (entry); \ > > .popsection > > > > +# define _ASM_KPROBE_ERROR_INJECT(entry) \ > > + .pushsection "_kprobe_error_inject_list","aw" ; \ > > + _ASM_ALIGN ;\ > > + _ASM_PTR (entry); \ > > + .popseciton > > So this stuff is not my area of greatest expertise, but I do have to wonder > how ".popseciton" can work ... ? > Well fuck, do you want me to send a increment Daniel/Alexei or resend this patch fixed? Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How about adding an ioctl to convert a directory to a subvolume?
On 2017-11-28 13:48, David Sterba wrote: On Mon, Nov 27, 2017 at 05:41:56PM +0800, Lu Fengqi wrote: As we all know, under certain circumstances, it is more appropriate to create some subvolumes rather than keep everything in the same subvolume. As the condition of demand change, the user may need to convert a previous directory to a subvolume. For this reason,how about adding an ioctl to convert a directory to a subvolume? I'd say too difficult to get everything right in kernel. This is possible to be done in userspace, with existing tools. The problem is that the conversion cannot be done atomically in most cases, so even if it's just one ioctl call, there are several possible intermediate states that would exist during the call. Reporting where did the ioctl fail would need some extended error code semantics. I think you mean it can't be done atomically in an inexpensive manner without significant work on the kernel side. It should in theory be possible to do it atomically by watching for and mirroring changes from the source directory to the new subvolume. Such an approach is however expensive, and is not guaranteed to ever finish if the source directory is under active usage. The only issue is updating open file descriptors to point to the new files. In short, the flow I'm thinking of is: 1. Create the subvolume in a temporary state that would get cleaned up by garbage collection if the FS got remounted. 2. Start watching the directory structure in the source directory for changes, recursively, and mirror all such changes to the subvolume as they happen. 3. For each file in the source directory and sub-directories, create a file in the new subvolume using a reflink, and add start watching that file for changes. Reflink any detected updates to the temporary file. Beyond this point, there are two possible methods to finish things: A: 4. Freeze all userspace I/O to the filesystem. 5. Update the dentry for the source directory to point to the new subvolume, remove the subvolume's 'temporary' status, and force a commit. 6. Update all open file descriptors to point to the files in the new subvolume. 7. Thaw all userspace I/O to the filesystem. 8. Garbage collect the source directory and it's contents. or: B: 4. Update the dentry for the source directory to point to the new subvolume, remove the subvolume's temporary status, and force a commit. 5. Keep the old data around until there are no references to it, and indirect opens on files that were already open prior to step 4 to point to the old file, while keeping watches around for the old files that were open. Prior to step A5 or B4, this can be atomically rolled back by simply nuking the temporary subvolume and removing the watches. After those steps, it's fully complete as far as the on-device state is concerned. Method A of completing the conversion has less overall impact on long-term operation of the system, but may require significant changes to the VFS API to be viable (I don't know if some of the overlay stuff could be used or not). Method B will continue to negatively impact performance until all the files that were open are closed, but shouldn't require as much legwork on the kernel side. In both cases, it ends up working similarly to the device replace operation, or LVM's pvmove operation, both of which can be made atomic. For what it's worth, I am of the opinion that this would be nice to have not so that stuff could be converted on-line, but so that you can convert a directory more easily off-line. Right now, it's a serious pain in the arse to do such a conversion (the Python code I linked is simple because it's using high-level operations out of the shutil standard module, and/or the reflink module on PyPI), largely because it is decidedly non-trivial to actually copy all the data about a file (and both `cp -ax` and `rsync -ax` miss information other than reflinks, most notably file attributes normally set by `chattr`). I personally care less about the atomicity of the operation than the fact that it actually preserves _everything_ (with the likely exception of EVM and IMA xattrs, but _nothing_ should be preserving those). IOW, I would be perfectly fine with something that does this in the kernel but returns -EWHATEVER if there are open files below the source directory and blocks modification to them until the switch is done. Users can convert by the scripts mentioned in this thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is it easier to use the off-the-shelf btrfs subcommand? Adding a subcommand would work, though I'd rather avoid reimplementing 'cp -ax' or 'rsync -ax'. We want to copy the files preserving all attributes, with reflink, and be able to identify partially synced files, and not cross the mountpoints or subvolumes. The middle step with snapshotting the containing subvolume before syncing the data is also a
Re: About 'key type for persistent [...] items'
On Tue, Nov 28, 2017 at 08:28:58PM +0100, Hans van Kranenburg wrote: > >> E.g. if I wanted to (just a random idea) add per device statistics, and > >> use this, I'd need to use the key also with objectid 1, 2, 3, etc... if > >> I have multiple devices. That's already a no go if there's anyone in any > >> other tree that is doing anything with any objectid in the range of > >> valid device numbers. > > > > In that case there would be a new objectid PER_DEVICE_OBJECTID, with the > > persistent key, and all the device ids can go to the offset field. The > > objectid field should not be dynamic, by design. > > Ok, didn't think of that yet, clear. > > So... just a random idea... You're not the first one to have that idea :) > How cool would it be to have the block group items being done this > way... PER_CHUNK_OBJECTID. > > Imagine the time to mount a large btrfs filesystem going down from > minutes or tens of minutes to just a few seconds... Exactly for that reason, but in such case we're allowed to grab a new key type and set objectid to a value that would group all the bg items, mayb also moving the to the beginning of the key space. The slightly tricky part would be the backward compatibility when we'd have to keep the existing bg and new-bg in sync, but there are no fundamental obstacles. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About 'key type for persistent [...] items'
Ok, just want to add one more thing :) On 11/28/2017 08:12 PM, David Sterba wrote: > On Tue, Nov 28, 2017 at 07:00:28PM +0100, Hans van Kranenburg wrote: >> On 11/28/2017 06:34 PM, David Sterba wrote: >>> On Fri, Nov 24, 2017 at 08:16:05PM +0100, Hans van Kranenburg wrote: Last week, when implementing the automatic classifier to dynamically create tree item data objects by key type in python-btrfs, I ran into the following commits in btrfs-progs: commit 8609c8bad68528f668d9ce564b868aa4828107a0 btrfs-progs: print-tree: factor out temporary_item dump and commit a4b65f00d53deb1b495728dd58253af44fcf70df btrfs-progs: print-tree: factor out persistent_item dump ...which are related to kernel... commit 50c2d5abe64c1726b48d292a2ab04f60e8238933 btrfs: introduce key type for persistent permanent items and commit 0bbbccb17fea86818e1a058faf5903aefd20b31a btrfs: introduce key type for persistent temporary items Afaics the goal is to overload types because there can be only 256 in total. However, I'm missing the design decisions behind the implementation of it. It's not in the commit messages, and it hasn't been on the mailing list. >>> >>> The reason is avoid wasting key types but still allow to store new types >>> of data to the btrees, if they were not part of the on-disk format. >>> >>> I'm not sure if this has been discussed or mentioned under some patches >>> or maybe unrelated patches. I do remember that I discussed that with >>> Chris in private on IRC and have the logs, dated 2015-09-02. >>> >>> At that time the balance item and dev stats item were introduced, maybe >>> also the qgroup status item type. This had me alarmed enough to >>> reconsider how the keys are allocated. >>> Before, there was an 1:1 mapping from key types to data structures. Now, with the new PERSISTENT_ITEM_KEY and TEMPORARY_ITEM_KEY, it seems items which use this type can be using any data structure they want, so it's some kind of YOLO_ITEM_KEY. >>> >>> In some sense it is, so it's key+objectid to determine the structure. >>> The print-tree code in progs 8609c8b and a4b65f0 seems incomplete. For example, for the PERSISTENT_ITEM_KEY, there's a switch (objectid) with case BTRFS_DEV_STATS_OBJECTID. However, BTRFS_DEV_STATS_OBJECTID is just the value 0. So, that means that if I want to have another tree where BTRFS_MOUTON_OBJECTID is also 0, and I'm storing a btrfs_kebab_item struct indexed at (BTRFS_MOUTON_OBJECTID, PERSISTENT_ITEM_KEY, 31337), then print_tree.c will try to parse the data by calling print_dev_stats? >>> >>> As answered by Qu, you can't use 0 for BTRFS_MOUTON_OBJECTID in that >>> case. >> >> (I'm just thinking out loud here, if you think I'm wasting your time >> just say.) >> >> Yes, so the objectid numbers have to be "registered" / "reserved" in the >> documentation, and they have to be unique over all trees. > > Right. > >> Maybe the information I was looking for is... in what cases should or >> shouldn't this be used? Because that limits the possible usage quite a >> bit. Or is it only for very very specific things. > > The keys are not free for use, they need to have a defined meaning and > are sort of part of the on-disk format. There must be some usecase and > reasoning why it's necessary to be done that way, and not in another. > Like xattr. > > I don't have an example now, but I could imagine some per-tree > information that can be tracked and updated at commit time. > >> E.g. if I wanted to (just a random idea) add per device statistics, and >> use this, I'd need to use the key also with objectid 1, 2, 3, etc... if >> I have multiple devices. That's already a no go if there's anyone in any >> other tree that is doing anything with any objectid in the range of >> valid device numbers. > > In that case there would be a new objectid PER_DEVICE_OBJECTID, with the > persistent key, and all the device ids can go to the offset field. The > objectid field should not be dynamic, by design. Ok, didn't think of that yet, clear. So... just a random idea... How cool would it be to have the block group items being done this way... PER_CHUNK_OBJECTID. Imagine the time to mount a large btrfs filesystem going down from minutes or tens of minutes to just a few seconds... \:D/ What's the idea behind that? Instead of having the key type field define the struct and meaning, we now suddenly need the tuple (tree, objectid, type), and we need all three to determine what's inside the item data? So, the code in print_tree.c would also need to know about the tree number and pass that into the different functions. >>> >>> No, all key types, even the persistent/temporary are independent of the >>> tree type. So it's only type <-> structure mapping, besides >>> persistent/temporary types. >> >> Yeah, I wasn't explicit
Re: [PATCH 0/7] retry write on error
On Tue, Nov 21, 2017 at 05:35:51PM -0700, Liu Bo wrote: > If the underlying protocal doesn't support retry and there are some > transient errors happening somewhere in our IO stack, we'd like to > give an extra chance for IO. Or sometimes you see btrfs reporting > 'wrr 1 flush 0 read 0 blabla' but the disk drive is 100% good, this > retry may help a bit. A limited number of retries may make sense, though I saw some long stalls after retries on bad disks. Tracking the retries would be a good addition to the dev stats, ie. a soft error but still worth reporting. > In btrfs, read retry is handled in bio_readpage_error() with the retry > unit being page size, for write retry however, we're going to do it in > a different way, as a write may consist of several writes onto > different stripes, retry write needs to be done right after the IO on > each stripe completes and arrives at endio. > > Patch 1-3 are the implementation of retry write on error for > non-raid56 profile. Patch 4-6 are for raid56 profile. Both raid56 > and non-raid56 shares one retry function helper. > > Patch 3 does retry sector by sector, but since this patch set doesn't > included badblocks support, patch 7 changes it back to retry the whole > bio. (I didn't fold patch 7 to patch 3 in the hope of just reverting > patch 7 once badblocks support is done, but I'm open to it.) What does 'badblocks' refer to? I know about the badblocks utility that find and reportts bad blocks, possibly ext2 understands that and avoids allocating them. Btrfs does not have such support. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About 'key type for persistent [...] items'
On Tue, Nov 28, 2017 at 07:00:28PM +0100, Hans van Kranenburg wrote: > On 11/28/2017 06:34 PM, David Sterba wrote: > > On Fri, Nov 24, 2017 at 08:16:05PM +0100, Hans van Kranenburg wrote: > >> Last week, when implementing the automatic classifier to dynamically > >> create tree item data objects by key type in python-btrfs, I ran into > >> the following commits in btrfs-progs: > >> > >> commit 8609c8bad68528f668d9ce564b868aa4828107a0 > >> btrfs-progs: print-tree: factor out temporary_item dump > >> and > >> commit a4b65f00d53deb1b495728dd58253af44fcf70df > >> btrfs-progs: print-tree: factor out persistent_item dump > >> > >> ...which are related to kernel... > >> > >> commit 50c2d5abe64c1726b48d292a2ab04f60e8238933 > >> btrfs: introduce key type for persistent permanent items > >> and > >> commit 0bbbccb17fea86818e1a058faf5903aefd20b31a > >> btrfs: introduce key type for persistent temporary items > >> > >> Afaics the goal is to overload types because there can be only 256 in > >> total. However, I'm missing the design decisions behind the > >> implementation of it. It's not in the commit messages, and it hasn't > >> been on the mailing list. > > > > The reason is avoid wasting key types but still allow to store new types > > of data to the btrees, if they were not part of the on-disk format. > > > > I'm not sure if this has been discussed or mentioned under some patches > > or maybe unrelated patches. I do remember that I discussed that with > > Chris in private on IRC and have the logs, dated 2015-09-02. > > > > At that time the balance item and dev stats item were introduced, maybe > > also the qgroup status item type. This had me alarmed enough to > > reconsider how the keys are allocated. > > > >> Before, there was an 1:1 mapping from key types to data structures. Now, > >> with the new PERSISTENT_ITEM_KEY and TEMPORARY_ITEM_KEY, it seems items > >> which use this type can be using any data structure they want, so it's > >> some kind of YOLO_ITEM_KEY. > > > > In some sense it is, so it's key+objectid to determine the structure. > > > >> The print-tree code in progs 8609c8b and a4b65f0 seems incomplete. For > >> example, for the PERSISTENT_ITEM_KEY, there's a switch (objectid) with > >> case BTRFS_DEV_STATS_OBJECTID. > >> > >> However, BTRFS_DEV_STATS_OBJECTID is just the value 0. So, that means > >> that if I want to have another tree where BTRFS_MOUTON_OBJECTID is also > >> 0, and I'm storing a btrfs_kebab_item struct indexed at > >> (BTRFS_MOUTON_OBJECTID, PERSISTENT_ITEM_KEY, 31337), then print_tree.c > >> will try to parse the data by calling print_dev_stats? > > > > As answered by Qu, you can't use 0 for BTRFS_MOUTON_OBJECTID in that > > case. > > (I'm just thinking out loud here, if you think I'm wasting your time > just say.) > > Yes, so the objectid numbers have to be "registered" / "reserved" in the > documentation, and they have to be unique over all trees. Right. > Maybe the information I was looking for is... in what cases should or > shouldn't this be used? Because that limits the possible usage quite a > bit. Or is it only for very very specific things. The keys are not free for use, they need to have a defined meaning and are sort of part of the on-disk format. There must be some usecase and reasoning why it's necessary to be done that way, and not in another. Like xattr. I don't have an example now, but I could imagine some per-tree information that can be tracked and updated at commit time. > E.g. if I wanted to (just a random idea) add per device statistics, and > use this, I'd need to use the key also with objectid 1, 2, 3, etc... if > I have multiple devices. That's already a no go if there's anyone in any > other tree that is doing anything with any objectid in the range of > valid device numbers. In that case there would be a new objectid PER_DEVICE_OBJECTID, with the persistent key, and all the device ids can go to the offset field. The objectid field should not be dynamic, by design. > >> What's the idea behind that? Instead of having the key type field define > >> the struct and meaning, we now suddenly need the tuple (tree, objectid, > >> type), and we need all three to determine what's inside the item data? > >> So, the code in print_tree.c would also need to know about the tree > >> number and pass that into the different functions. > > > > No, all key types, even the persistent/temporary are independent of the > > tree type. So it's only type <-> structure mapping, besides > > persistent/temporary types. > > Yeah, I wasn't explicit about that, I meant only for the > persistent/temporary case yes. So for this case it's type + objectid, the tree independence stays. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] btrfs-progs: Add location check when process share_data_ref item
On Fri, Nov 24, 2017 at 06:41:28PM +0800, Gu Jinxiang wrote: > The following test failed becasuse share_data_ref be added into > extent_cache when deal with root tree node. The whole function run_next_block would need to be revisited with respect to error handling and sanity checks. Adding them only when a fuzzed image hits the particular code path is not IMHO the best approach. If there's some fuzzed test case, we should try to find all similar missing checks and fix them before moving to another type. Addressing only the failed tests gives a false sense of fixing, there are usally more similar bugs. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 1/5] add infrastructure for tagging functions as error injectable
On Wed, 22 Nov 2017 16:23:30 -0500 Josef Bacikwrote: > From: Josef Bacik > > Using BPF we can override kprob'ed functions and return arbitrary > values. Obviously this can be a bit unsafe, so make this feature opt-in > for functions. Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in > order to give BPF access to that function for error injection purposes. > > Signed-off-by: Josef Bacik > Acked-by: Ingo Molnar > --- > arch/x86/include/asm/asm.h| 6 ++ > include/asm-generic/vmlinux.lds.h | 10 +++ > include/linux/bpf.h | 11 +++ > include/linux/kprobes.h | 1 + > include/linux/module.h| 5 ++ > kernel/kprobes.c | 163 > ++ > kernel/module.c | 6 +- > 7 files changed, 201 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h > index b0dc91f4bedc..340f4cc43255 100644 > --- a/arch/x86/include/asm/asm.h > +++ b/arch/x86/include/asm/asm.h > @@ -85,6 +85,12 @@ > _ASM_PTR (entry); \ > .popsection > > +# define _ASM_KPROBE_ERROR_INJECT(entry) \ > + .pushsection "_kprobe_error_inject_list","aw" ; \ > + _ASM_ALIGN ;\ > + _ASM_PTR (entry); \ > + .popseciton So this stuff is not my area of greatest expertise, but I do have to wonder how ".popseciton" can work ... ? Thanks, jon -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How about adding an ioctl to convert a directory to a subvolume?
On Mon, Nov 27, 2017 at 05:41:56PM +0800, Lu Fengqi wrote: > As we all know, under certain circumstances, it is more appropriate to > create some subvolumes rather than keep everything in the same > subvolume. As the condition of demand change, the user may need to > convert a previous directory to a subvolume. For this reason,how about > adding an ioctl to convert a directory to a subvolume? I'd say too difficult to get everything right in kernel. This is possible to be done in userspace, with existing tools. The problem is that the conversion cannot be done atomically in most cases, so even if it's just one ioctl call, there are several possible intermediate states that would exist during the call. Reporting where did the ioctl fail would need some extended error code semantics. > Users can convert by the scripts mentioned in this > thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is > it easier to use the off-the-shelf btrfs subcommand? Adding a subcommand would work, though I'd rather avoid reimplementing 'cp -ax' or 'rsync -ax'. We want to copy the files preserving all attributes, with reflink, and be able to identify partially synced files, and not cross the mountpoints or subvolumes. The middle step with snapshotting the containing subvolume before syncing the data is also a valid option, but not always necessary. > After an initial consideration, our implementation is broadly divided > into the following steps: > 1. Freeze the filesystem or set the subvolume above the source directory > to read-only; Freezing the filesystme will freeze all IO, so this would not work, but I understand what you mean. The file data are synced before the snapshot is taken, but nothing prevents applications to continue writing data. Open and live files is a problem and don't see a nice solution here. > 2. Perform a pre-check, for example, check if a cross-device link > creation during the conversion; Cross-device links are not a problem as long as we use 'cp' ie. the manual creation of files in the target. > 3. Perform conversion, such as creating a new subvolume and moving the > contents of the source directory; > 4. Thaw the filesystem or restore the subvolume writable property. > > In fact, I am not so sure whether this use of freeze is appropriate > because the source directory the user needs to convert may be located > at / or /home and this pre-check and conversion process may take a long > time, which can lead to some shell and graphical application suspended. I think the closest operation is a read-only remount, which is not always possible due to open files and can otherwise considered as quite intrusive operation to the whole system. And the root filesystem cannot be easily remounted read-only in the systemd days anyway. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About 'key type for persistent [...] items'
On 11/28/2017 06:34 PM, David Sterba wrote: > On Fri, Nov 24, 2017 at 08:16:05PM +0100, Hans van Kranenburg wrote: >> Last week, when implementing the automatic classifier to dynamically >> create tree item data objects by key type in python-btrfs, I ran into >> the following commits in btrfs-progs: >> >> commit 8609c8bad68528f668d9ce564b868aa4828107a0 >> btrfs-progs: print-tree: factor out temporary_item dump >> and >> commit a4b65f00d53deb1b495728dd58253af44fcf70df >> btrfs-progs: print-tree: factor out persistent_item dump >> >> ...which are related to kernel... >> >> commit 50c2d5abe64c1726b48d292a2ab04f60e8238933 >> btrfs: introduce key type for persistent permanent items >> and >> commit 0bbbccb17fea86818e1a058faf5903aefd20b31a >> btrfs: introduce key type for persistent temporary items >> >> Afaics the goal is to overload types because there can be only 256 in >> total. However, I'm missing the design decisions behind the >> implementation of it. It's not in the commit messages, and it hasn't >> been on the mailing list. > > The reason is avoid wasting key types but still allow to store new types > of data to the btrees, if they were not part of the on-disk format. > > I'm not sure if this has been discussed or mentioned under some patches > or maybe unrelated patches. I do remember that I discussed that with > Chris in private on IRC and have the logs, dated 2015-09-02. > > At that time the balance item and dev stats item were introduced, maybe > also the qgroup status item type. This had me alarmed enough to > reconsider how the keys are allocated. > >> Before, there was an 1:1 mapping from key types to data structures. Now, >> with the new PERSISTENT_ITEM_KEY and TEMPORARY_ITEM_KEY, it seems items >> which use this type can be using any data structure they want, so it's >> some kind of YOLO_ITEM_KEY. > > In some sense it is, so it's key+objectid to determine the structure. > >> The print-tree code in progs 8609c8b and a4b65f0 seems incomplete. For >> example, for the PERSISTENT_ITEM_KEY, there's a switch (objectid) with >> case BTRFS_DEV_STATS_OBJECTID. >> >> However, BTRFS_DEV_STATS_OBJECTID is just the value 0. So, that means >> that if I want to have another tree where BTRFS_MOUTON_OBJECTID is also >> 0, and I'm storing a btrfs_kebab_item struct indexed at >> (BTRFS_MOUTON_OBJECTID, PERSISTENT_ITEM_KEY, 31337), then print_tree.c >> will try to parse the data by calling print_dev_stats? > > As answered by Qu, you can't use 0 for BTRFS_MOUTON_OBJECTID in that > case. (I'm just thinking out loud here, if you think I'm wasting your time just say.) Yes, so the objectid numbers have to be "registered" / "reserved" in the documentation, and they have to be unique over all trees. Maybe the information I was looking for is... in what cases should or shouldn't this be used? Because that limits the possible usage quite a bit. Or is it only for very very specific things. E.g. if I wanted to (just a random idea) add per device statistics, and use this, I'd need to use the key also with objectid 1, 2, 3, etc... if I have multiple devices. That's already a no go if there's anyone in any other tree that is doing anything with any objectid in the range of valid device numbers. >> What's the idea behind that? Instead of having the key type field define >> the struct and meaning, we now suddenly need the tuple (tree, objectid, >> type), and we need all three to determine what's inside the item data? >> So, the code in print_tree.c would also need to know about the tree >> number and pass that into the different functions. > > No, all key types, even the persistent/temporary are independent of the > tree type. So it's only type <-> structure mapping, besides > persistent/temporary types. Yeah, I wasn't explicit about that, I meant only for the persistent/temporary case yes. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About 'key type for persistent [...] items'
On Sat, Nov 25, 2017 at 09:12:35AM +0800, Qu Wenruo wrote: > On 2017年11月25日 03:16, Hans van Kranenburg wrote: > > Last week, when implementing the automatic classifier to dynamically > > create tree item data objects by key type in python-btrfs, I ran into > > the following commits in btrfs-progs: > > > > commit 8609c8bad68528f668d9ce564b868aa4828107a0 > > btrfs-progs: print-tree: factor out temporary_item dump > > and > > commit a4b65f00d53deb1b495728dd58253af44fcf70df > > btrfs-progs: print-tree: factor out persistent_item dump > > > > ...which are related to kernel... > > > > commit 50c2d5abe64c1726b48d292a2ab04f60e8238933 > > btrfs: introduce key type for persistent permanent items > > and > > commit 0bbbccb17fea86818e1a058faf5903aefd20b31a > > btrfs: introduce key type for persistent temporary items > > > > Afaics the goal is to overload types because there can be only 256 in > > total. > > Personally speaking, to overload types, we can easily make different > meanings of type based on tree owner. Key type tied to a tree would cause some ambiguity, while now we always know what's in the type item if we eg. read a random metadata block. We're now at ~40 key types, scattered in the 256 range so further extensions are possible should we need a particular number due to the ordering. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About 'key type for persistent [...] items'
On Fri, Nov 24, 2017 at 08:16:05PM +0100, Hans van Kranenburg wrote: > Last week, when implementing the automatic classifier to dynamically > create tree item data objects by key type in python-btrfs, I ran into > the following commits in btrfs-progs: > > commit 8609c8bad68528f668d9ce564b868aa4828107a0 > btrfs-progs: print-tree: factor out temporary_item dump > and > commit a4b65f00d53deb1b495728dd58253af44fcf70df > btrfs-progs: print-tree: factor out persistent_item dump > > ...which are related to kernel... > > commit 50c2d5abe64c1726b48d292a2ab04f60e8238933 > btrfs: introduce key type for persistent permanent items > and > commit 0bbbccb17fea86818e1a058faf5903aefd20b31a > btrfs: introduce key type for persistent temporary items > > Afaics the goal is to overload types because there can be only 256 in > total. However, I'm missing the design decisions behind the > implementation of it. It's not in the commit messages, and it hasn't > been on the mailing list. The reason is avoid wasting key types but still allow to store new types of data to the btrees, if they were not part of the on-disk format. I'm not sure if this has been discussed or mentioned under some patches or maybe unrelated patches. I do remember that I discussed that with Chris in private on IRC and have the logs, dated 2015-09-02. At that time the balance item and dev stats item were introduced, maybe also the qgroup status item type. This had me alarmed enough to reconsider how the keys are allocated. > Before, there was an 1:1 mapping from key types to data structures. Now, > with the new PERSISTENT_ITEM_KEY and TEMPORARY_ITEM_KEY, it seems items > which use this type can be using any data structure they want, so it's > some kind of YOLO_ITEM_KEY. In some sense it is, so it's key+objectid to determine the structure. > The print-tree code in progs 8609c8b and a4b65f0 seems incomplete. For > example, for the PERSISTENT_ITEM_KEY, there's a switch (objectid) with > case BTRFS_DEV_STATS_OBJECTID. > > However, BTRFS_DEV_STATS_OBJECTID is just the value 0. So, that means > that if I want to have another tree where BTRFS_MOUTON_OBJECTID is also > 0, and I'm storing a btrfs_kebab_item struct indexed at > (BTRFS_MOUTON_OBJECTID, PERSISTENT_ITEM_KEY, 31337), then print_tree.c > will try to parse the data by calling print_dev_stats? As answered by Qu, you can't use 0 for BTRFS_MOUTON_OBJECTID in that case. > What's the idea behind that? Instead of having the key type field define > the struct and meaning, we now suddenly need the tuple (tree, objectid, > type), and we need all three to determine what's inside the item data? > So, the code in print_tree.c would also need to know about the tree > number and pass that into the different functions. No, all key types, even the persistent/temporary are independent of the tree type. So it's only type <-> structure mapping, besides persistent/temporary types. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 0/4] Add the ability to do BPF directed error injection
On 11/22/2017 10:23 PM, Josef Bacik wrote: > This is hopefully the final version, I've addressed the comment by Igno and > added his Acks. > > v6->v7: > - moved the opt-in macro to bpf.h out of kprobes.h. > > v5->v6: > - add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this > feature. This way only functions that opt-in will be allowed to be > overridden. > - added a btrfs patch to allow error injection for open_ctree() so that the > bpf > sample actually works. > > v4->v5: > - disallow kprobe_override programs from being put in the prog map array so we > don't tail call into something we didn't check. This allows us to make the > normal path still fast without a bunch of percpu operations. > > v3->v4: > - fix a build error found by kbuild test bot (I didn't wait long enough > apparently.) > - Added a warning message as per Daniels suggestion. > > v2->v3: > - added a ->kprobe_override flag to bpf_prog. > - added some sanity checks to disallow attaching bpf progs that have > ->kprobe_override set that aren't for ftrace kprobes. > - added the trace_kprobe_ftrace helper to check if the trace_event_call is a > ftrace kprobe. > - renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read > this > value in the kprobe path, and thus only write to it if we're overriding or > clearing the override. > > v1->v2: > - moved things around to make sure that bpf_override_return could really only > be > used for an ftrace kprobe. > - killed the special return values from trace_call_bpf. > - renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if > it was being called from an ftrace kprobe context. > - reworked the logic in kprobe_perf_func to take advantage of > bpf_kprobe_state. > - updated the test as per Alexei's review. > > - Original message - > > A lot of our error paths are not well tested because we have no good way of > injecting errors generically. Some subystems (block, memory) have ways to > inject errors, but they are random so it's hard to get reproduceable results. > > With BPF we can add determinism to our error injection. We can use kprobes > and > other things to verify we are injecting errors at the exact case we are trying > to test. This patch gives us the tool to actual do the error injection part. > It is very simple, we just set the return value of the pt_regs we're given to > whatever we provide, and then override the PC with a dummy function that > simply > returns. > > Right now this only works on x86, but it would be simple enough to expand to > other architectures. Thanks, Ok, given the remaining feedback from Ingo was addressed and therefore the series acked, I've applied it to bpf-next tree, thanks Josef. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: incremental send, fix wrong unlink path after renaming file
On Fri, Nov 17, 2017 at 02:48:30PM +, fdman...@kernel.org wrote: > From: Filipe Manana> > Under some circumstances, an incremental send operation can issue wrong > paths for unlink commands related to files that have multiple hard links > and some (or all) of those links were renamed between the parent and send > snapshots. Consider the following example: > > Parent snapshot > > . (ino 256) > | a/ (ino 257) > | | b/ (ino 259) > | | | c/ (ino 260) > | | | f2 (ino 261) > | | > | | f2l1 (ino 261) > | > | d/ (ino 262) >| f1l1_2 (ino 258) >| f2l2 (ino 261) >| f1_2 (ino 258) > > Send snapshot > > . (ino 256) > | a/ (ino 257) > | | f2l1/ (ino 263) > | | b2/(ino 259) > | | c/ (ino 260) > | | | d3 (ino 262) > | | | f1l1_2 (ino 258) > | | | f2l2_2 (ino 261) > | | | f1_2 (ino 258) > | | > | | f2 (ino 261) > | | f1l2 (ino 258) > | > | d(ino 261) > > When computing the incremental send stream the following steps happen: > > 1) When processing inode 261, a rename operation is issued that renames >inode 262, which currently as a path of "d", to an orphan name of >"o262-7-0". This is done because in the send snapshot, inode 261 has >of its hard links with a path of "d" as well. > > 2) Two link operations are issued that create the new hard links for >inode 261, whose names are "d" and "f2l2_2", at paths "/" and >"o262-7-0/" respectively. > > 3) Still while processing inode 261, unlink operations are issued to >remove the old hard links of inode 261, with names "f2l1" and "f2l2", >at paths "a/" and "d/". However path "d/" does not correspond anymore >to the directory inode 262 but corresponds instead to a hard link of >inode 261 (link command issued in the previous step). This makes the >receiver fail with a ENOTDIR error when attempting the unlink >operation. > > The problem happens because before sending the unlink operation, we failed > to detect that inode 262 was one of ancestors for inode 261 in the parent > snapshot, and therefore we didn't recompute the path for inode 262 before > issuing the unlink operation for the link named "f2l2" of inode 262. The > detection failed because the function "is_ancestor()" only follows the > first hard link it finds for an inode instead of all of its hard links > (as it was originally created for being used with directories only, for > which only one hard link exists). So fix this by making "is_ancestor()" > follow all hard links of the input inode. > > A test case for fstests follows soon. > > Signed-off-by: Filipe Manana FYI, added to the rc2 pull. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: mkfs: fix verbose value
On Fri, Nov 24, 2017 at 02:24:17PM +0900, Misono, Tomohiro wrote: > The value of 'verbose' is either 1 (default) or 0 (-q) > and "verbose >= 2" will not be true. > > After fix this, we get something like: > adding device /dev/sde id 2 > adding device /dev/sdf id 3 > during mkfs time when multiple devices are used. I think the original idea was to allow different levels of verbosity, and the "adding device" messages do not bring much value to be printed by default. The verbose/quiet options will be handled globally so this code will get cleaned at that time. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] btrfs-progs: fi defrag: clean up duplicate code if find errors
On Tue, Nov 28, 2017 at 05:14:48PM +0800, Su Yue wrote: > In function cmd_filesystem_defrag(), lines of code for error handling > are duplicate and hard to expand in further. > > Create a jump label for errors. > > Signed-off-by: Su YueApplied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/3] btrfs-progs: fi defrag: extend -c to drop nocompress flag on files
On Tue, Nov 28, 2017 at 05:14:50PM +0800, Su Yue wrote: > Now, files which have nocompress flag also will be defraged > with compression. However, nocompress flag is still existed > and have to be cleared manually. > > So add an option '--clear-nocompress' to extend -c to drop > nocompress flag after defragement. > Suggested-by: David Sterba> Suggested-by: Anand Jain Do you have the pointer to the discussion? The whole idea sounds familiar and seeing my name here means I must have been involved, but I have only vague memories. > Signed-off-by: Su Yue > --- > Changelog: > v2: Remove change about indentation of defrag_callback(). > > --- > cmds-filesystem.c | 65 > ++- > 1 file changed, 64 insertions(+), 1 deletion(-) > > diff --git a/cmds-filesystem.c b/cmds-filesystem.c > index 3931333f76c6..84242814798e 100644 > --- a/cmds-filesystem.c > +++ b/cmds-filesystem.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > > #include "kerncompat.h" > #include "ctree.h" > @@ -867,6 +868,8 @@ static const char * const cmd_filesystem_defrag_usage[] = > { > "-l len defragment only up to len bytes", > "-t size target extent size hint (default: 32M)", > "", > + "--compress-forceclear nocompress flag on files after defragment, > only work with option -c", This is probably --clear-nocompress . The "force compress" semantics is up to kernel, so I think naming the options 'clear-nocompress' is the right thing, as it's what it really does. > + "", > "Warning: most Linux kernels will break up the ref-links of COW data", > "(e.g., files copied with 'cp --reflink', snapshots) which may cause", > "considerable increase of space usage. See btrfs-filesystem(8) for", > @@ -874,9 +877,39 @@ static const char * const cmd_filesystem_defrag_usage[] > = { > NULL > }; > > +static int clear_nocompress_flag(int fd) > +{ > + unsigned int flags; > + int ret = 0; > + > + ret = ioctl(fd, FS_IOC_GETFLAGS, ); > + if (ret < 0) { > + ret = -errno; > + error("failed to get flags: %s", strerror(-ret)); > + goto out; > + } > + > + if (!(flags & FS_NOCOMP_FL)) { > + ret = 0; > + goto out; > + } > + flags &= ~FS_NOCOMP_FL; > + ret = ioctl(fd, FS_IOC_SETFLAGS, ); This is inherently racy, but the inode flags do not change that often so concurrent change is unlikely. > + if (ret < 0) { > + ret = -errno; > + error("failed to set flags: %s", strerror(-ret)); > + goto out; > + } > + > + ret = 0; > +out: > + return ret; > +} > + > static struct btrfs_ioctl_defrag_range_args defrag_global_range; > static int defrag_global_verbose; > static int defrag_global_errors; > +static int defrag_global_clear_nocompress; > static int defrag_callback(const char *fpath, const struct stat *sb, > int typeflag, struct FTW *ftwbuf) > { > @@ -904,6 +937,14 @@ static int defrag_callback(const char *fpath, const > struct stat *sb, > err = errno; > goto error; > } > + > + if (defrag_global_clear_nocompress) { > + ret = clear_nocompress_flag(fd); > + if (ret) { > + err = -ret; > + goto error; > + } > + } > } > return 0; > > @@ -926,6 +967,12 @@ static int cmd_filesystem_defrag(int argc, char **argv) > int compress_type = BTRFS_COMPRESS_NONE; > DIR *dirstream; > > + enum { GETOPT_VAL_CLEAR_NOCOMPRESS = 257}; > + static const struct option long_options[] = { > + { "clear-nocompress", no_argument, NULL, > + GETOPT_VAL_CLEAR_NOCOMPRESS}, > + { NULL, 0, NULL, 0} > + }; > /* >* Kernel has a different default (256K) that is supposed to be safe, >* but it does not defragment very well. The 32M will likely lead to > @@ -937,8 +984,10 @@ static int cmd_filesystem_defrag(int argc, char **argv) > defrag_global_errors = 0; > defrag_global_verbose = 0; > defrag_global_errors = 0; > + defrag_global_clear_nocompress = 0; > while(1) { > - int c = getopt(argc, argv, "vrc::fs:l:t:"); > + int c = getopt_long(argc, argv, "vrc::fs:l:t:", long_options, > + NULL); > if (c < 0) > break; > > @@ -972,6 +1021,9 @@ static int cmd_filesystem_defrag(int argc, char **argv) > case 'r': > recursive = 1; > break; > + case GETOPT_VAL_CLEAR_NOCOMPRESS: > + defrag_global_clear_nocompress = 1; > +
Re: [PATCH v2 2/3] btrfs-progs: fi defrag: do not exit if defrag range ioctl is unsupported
On Tue, Nov 28, 2017 at 05:14:49PM +0800, Su Yue wrote: > If ioctl of defrag range is unsupported, defrag will exit > immediately. > > Since caller can handle the error, let cmd_filesystem_defrag() > break the loop and return error instead of calling exit(1). > > Suggested-by: David Sterba> Signed-off-by: Su Yue > --- > Changelog: > v2: Separate the patch from commit 6e991b9161fa ("btrfs-progs: fi > defrag: clean up duplicate code if find errors"). > --- > cmds-filesystem.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/cmds-filesystem.c b/cmds-filesystem.c > index 17d399d58adf..3931333f76c6 100644 > --- a/cmds-filesystem.c > +++ b/cmds-filesystem.c > @@ -1050,7 +1050,7 @@ static int cmd_filesystem_defrag(int argc, char **argv) > ret = nftw(argv[i], defrag_callback, 10, > FTW_MOUNT | FTW_PHYS); > if (ret == ENOTTY) > - exit(1); > + break; And we still need to call close_file_or_dir. > /* errors are handled in the callback */ > ret = 0; > } else { > -- > 2.15.0 > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] btrfs-progs: mkfs: Reword --rootdir
On Fri, Oct 20, 2017 at 09:59:00AM +0800, Qu Wenruo wrote: > Qu Wenruo (7): > btrfs-progs: mkfs: Don't use custom chunk allocator for rootdir > btrfs-progs: mkfs/rootdir: Use over-reserve method to make size > estimate easier > btrfs-progs: mkfs: Only zero out the first 1M for rootdir > btrfs-progs: mkfs/rootdir: Introduce function to get end position of > last device extent > btrfs-progs: mkfs/rootdir: Shrink fs for rootdir option > btrfs-progs: mkfs: Update allocation info before verbose output > btrfs-progs: mkfs: Separate shrink from rootdir I've merged independent patches or what applied almost cleanly on current devel. Please refresh and resend patches 1, 2, 5 and 7, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/7] btrfs-progs: mkfs: Update allocation info before verbose output
On Fri, Oct 20, 2017 at 09:59:06AM +0800, Qu Wenruo wrote: > Since new --rootdir can allocate chunk, it will modify the chunk > allocation result. > > This patch will update allocation info before verbose output to reflect > such info. > > Signed-off-by: Qu WenruoApplied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7] btrfs-progs: mkfs/rootdir: Introduce function to get end position of last device extent
On Fri, Oct 20, 2017 at 09:59:04AM +0800, Qu Wenruo wrote: > Useful for later 'mkfs.btrfs --rootdir' shrink support. > > Signed-off-by: Qu WenruoApplied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/7] btrfs-progs: mkfs: Only zero out the first 1M for rootdir
On Fri, Oct 20, 2017 at 09:59:03AM +0800, Qu Wenruo wrote: > It's a waste of IO to fill the whole image before creating btrfs on it, > just wiping the first 1M, and then write 1 byte to the last position to > create a sparse file. > > Signed-off-by: Qu WenruoApplied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/4] btrfs-progs: test/mkfs: Test if the minimal device size is valid
On Wed, Nov 01, 2017 at 09:30:43AM +0800, Qu Wenruo wrote: > New test case to test if the minimal device size given by "mkfs.btrfs" > failure case is valid. > > Signed-off-by: Qu WenruoApplied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent
On Tue, Nov 28, 2017 at 02:16:15AM +0100, Adam Borowski wrote: > On Tue, Nov 28, 2017 at 08:51:07AM +0800, Qu Wenruo wrote: > > On 2017年11月27日 22:22, David Sterba wrote: > > > On Mon, Nov 27, 2017 at 02:23:49PM +0100, Adam Borowski wrote: > > >> On 4.15-rc1, I get the following failure: > > >> > > >> BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688 > > >> slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed > > >> inline > > >> extent, have 134 expect 281474976710677 > > > > > > By a quick look at suspiciously large number > > > > > > hex(281474976710677) = 0x10015 > > > > > > may be a bitflip, but 0x15 does not match 134, so there could be > > > something else involved in the corruption. > > > > That's a known bug, fixed by that patch which is not merged yet. > > > > https://patchwork.kernel.org/patch/10047489/ > > This helped, thanks! Patch is going to 4.15-rc2. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: swapon: swapfile has holes
On 2017-11-28 07:31, Cristian wrote: Hello, Report in: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1723449 dmesg: [ 3.782191] swapon: swapfile has holes As mentioned in the linked bug report, this is a known issue. To use a swap file on BTRFS, you need to: 1. Create an empty file. 2. Mark it NOCOW with `chattr +C` 3. Use fallocate to pre-allocate to the appropriate size. 4. Hook it up to a loop device. 5. Use the loop device for swap. This is because the kernel needs a consistent block mapping for a swap file, since it bypasses the filesystem and accesses the block device directly. However, since BTRFS has copy-on-write semantics, blocks move around on each write, which violates the requirement for a consistent block mapping, you can't use a file on a BTRFS filesystem directly as a swap file. Of the above steps, you technically don't need to mark the file NOCOW or pre-allocate it, but those are good practice to remove random latency. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 1/4] btrfs-progs: mkfs: Enhance minimal device size calculation to fix mkfs failure on small file
On Tue, Nov 28, 2017 at 08:40:47AM +0800, Qu Wenruo wrote: > > > On 2017年11月28日 07:21, David Sterba wrote: > > On Wed, Nov 01, 2017 at 09:30:40AM +0800, Qu Wenruo wrote: > >> +u64 btrfs_min_dev_size(u32 nodesize, int mixed, u64 meta_profile, > >> + u64 data_profile) > >> +{ > >> + u64 reserved = 0; > >> + u64 meta_size; > >> + u64 data_size; > >> + > >> + if (mixed) > >> + return 2 * (BTRFS_MKFS_SYSTEM_GROUP_SIZE + > >> + btrfs_min_global_blk_rsv_size(nodesize)); > >> + > >> + /* > >> + * minimal size calculation is complex due to several factors: > >> + * 1) Temporary chunk resue > >> + *If specified chunk profile is SINGLE, we can resue > >> + *temporary chunks, no need to alloc new chunks. > >> + * > >> + * 2) Different minimal chunk size for different profile > >> + *For initial sys chunk, chunk size is fixed to 4M. > >> + *For single profile, minimal chunk size is 8M for all. > >> + *For other profiles, minimal chunk and stripe size > >> + *differs from 8M to 64M. > >> + * > >> + * To calculate it a little easier, here we assume we don't > >> + * reuse any temporary chunk, and calculate the size all > >> + * by ourselves. > >> + * > >> + * Temporary chunks sizes are always fixed: > >> + * One initial sys chunk, one SINGLE meta, and one SINGLE data. > >> + * The latter two are all 8M, accroding to @calc_size of > >> + * btrfs_alloc_chunk(). > >> + */ > >> + reserved += BTRFS_MKFS_SYSTEM_GROUP_SIZE + SZ_8M * 2; > >> + > >> + /* > >> + * For real chunks, we need to refer different sizes: > >> + * For SINGLE, it's still fixed to 8M (@calc_size). > >> + * For other profiles, refer to max(@min_stripe_size, @calc_size). > >> + * > >> + * And use the stripe size to calculate its physical used space. > >> + */ > >> + if (meta_profile & BTRFS_BLOCK_GROUP_PROFILE_MASK) > >> + meta_size = SZ_8M + SZ_32M; > >> + else > >> + meta_size = SZ_8M + SZ_8M; > >> + /* Only metadata put 2 stripes into one disk */ > > > > 'mkfs.btrfs -d dup' also works, so I think this should be handled here > > the same way as metadata, right? > > Just a few lines under. > > > >> + if (meta_profile & BTRFS_BLOCK_GROUP_DUP) > >> + meta_size *= 2; > >> + reserved += meta_size; > >> + > >> + if (data_profile & BTRFS_BLOCK_GROUP_PROFILE_MASK) > >> + data_size = SZ_64M; > >> + else > >> + data_size = SZ_8M; > >> + if (data_profile & BTRFS_BLOCK_GROUP_DUP) > >> + data_size *= 2; > Here data DUP is also handled. > > So the problem is about the word "Only". Yeah, it's confusing when comment does not match the code because it's not clear what's the actual intention. > Do I need to re-submit the whole patchset or just a separate patch to > update the comment? No, I'll update it myself as it is fairly trivial change. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
swapon: swapfile has holes
Hello, Report in: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1723449 dmesg: [ 3.782191] swapon: swapfile has holes Regards, -- Cristian -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/3] btrfs-progs: fi defrag: extend -c to drop nocompress flag on files
Now, files which have nocompress flag also will be defraged with compression. However, nocompress flag is still existed and have to be cleared manually. So add an option '--clear-nocompress' to extend -c to drop nocompress flag after defragement. Suggested-by: David SterbaSuggested-by: Anand Jain Signed-off-by: Su Yue --- Changelog: v2: Remove change about indentation of defrag_callback(). --- cmds-filesystem.c | 65 ++- 1 file changed, 64 insertions(+), 1 deletion(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index 3931333f76c6..84242814798e 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -27,6 +27,7 @@ #include #include #include +#include #include "kerncompat.h" #include "ctree.h" @@ -867,6 +868,8 @@ static const char * const cmd_filesystem_defrag_usage[] = { "-l len defragment only up to len bytes", "-t size target extent size hint (default: 32M)", "", + "--compress-forceclear nocompress flag on files after defragment, only work with option -c", + "", "Warning: most Linux kernels will break up the ref-links of COW data", "(e.g., files copied with 'cp --reflink', snapshots) which may cause", "considerable increase of space usage. See btrfs-filesystem(8) for", @@ -874,9 +877,39 @@ static const char * const cmd_filesystem_defrag_usage[] = { NULL }; +static int clear_nocompress_flag(int fd) +{ + unsigned int flags; + int ret = 0; + + ret = ioctl(fd, FS_IOC_GETFLAGS, ); + if (ret < 0) { + ret = -errno; + error("failed to get flags: %s", strerror(-ret)); + goto out; + } + + if (!(flags & FS_NOCOMP_FL)) { + ret = 0; + goto out; + } + flags &= ~FS_NOCOMP_FL; + ret = ioctl(fd, FS_IOC_SETFLAGS, ); + if (ret < 0) { + ret = -errno; + error("failed to set flags: %s", strerror(-ret)); + goto out; + } + + ret = 0; +out: + return ret; +} + static struct btrfs_ioctl_defrag_range_args defrag_global_range; static int defrag_global_verbose; static int defrag_global_errors; +static int defrag_global_clear_nocompress; static int defrag_callback(const char *fpath, const struct stat *sb, int typeflag, struct FTW *ftwbuf) { @@ -904,6 +937,14 @@ static int defrag_callback(const char *fpath, const struct stat *sb, err = errno; goto error; } + + if (defrag_global_clear_nocompress) { + ret = clear_nocompress_flag(fd); + if (ret) { + err = -ret; + goto error; + } + } } return 0; @@ -926,6 +967,12 @@ static int cmd_filesystem_defrag(int argc, char **argv) int compress_type = BTRFS_COMPRESS_NONE; DIR *dirstream; + enum { GETOPT_VAL_CLEAR_NOCOMPRESS = 257}; + static const struct option long_options[] = { + { "clear-nocompress", no_argument, NULL, + GETOPT_VAL_CLEAR_NOCOMPRESS}, + { NULL, 0, NULL, 0} + }; /* * Kernel has a different default (256K) that is supposed to be safe, * but it does not defragment very well. The 32M will likely lead to @@ -937,8 +984,10 @@ static int cmd_filesystem_defrag(int argc, char **argv) defrag_global_errors = 0; defrag_global_verbose = 0; defrag_global_errors = 0; + defrag_global_clear_nocompress = 0; while(1) { - int c = getopt(argc, argv, "vrc::fs:l:t:"); + int c = getopt_long(argc, argv, "vrc::fs:l:t:", long_options, + NULL); if (c < 0) break; @@ -972,6 +1021,9 @@ static int cmd_filesystem_defrag(int argc, char **argv) case 'r': recursive = 1; break; + case GETOPT_VAL_CLEAR_NOCOMPRESS: + defrag_global_clear_nocompress = 1; + break; default: usage(cmd_filesystem_defrag_usage); } @@ -987,6 +1039,8 @@ static int cmd_filesystem_defrag(int argc, char **argv) if (compress_type) { defrag_global_range.flags |= BTRFS_DEFRAG_RANGE_COMPRESS; defrag_global_range.compress_type = compress_type; + } else if (defrag_global_clear_nocompress) { + warning("Option --clear-nocompress only works for -c"); } if (flush) defrag_global_range.flags |= BTRFS_DEFRAG_RANGE_START_IO; @@ -1071,6 +1125,15 @@ static int
[PATCH v2 2/3] btrfs-progs: fi defrag: do not exit if defrag range ioctl is unsupported
If ioctl of defrag range is unsupported, defrag will exit immediately. Since caller can handle the error, let cmd_filesystem_defrag() break the loop and return error instead of calling exit(1). Suggested-by: David SterbaSigned-off-by: Su Yue --- Changelog: v2: Separate the patch from commit 6e991b9161fa ("btrfs-progs: fi defrag: clean up duplicate code if find errors"). --- cmds-filesystem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index 17d399d58adf..3931333f76c6 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -1050,7 +1050,7 @@ static int cmd_filesystem_defrag(int argc, char **argv) ret = nftw(argv[i], defrag_callback, 10, FTW_MOUNT | FTW_PHYS); if (ret == ENOTTY) - exit(1); + break; /* errors are handled in the callback */ ret = 0; } else { -- 2.15.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/3] btrfs-progs: fi defrag: clean up duplicate code if find errors
In function cmd_filesystem_defrag(), lines of code for error handling are duplicate and hard to expand in further. Create a jump label for errors. Signed-off-by: Su Yue--- Changelog: v2: Record -errno to return value if open() and fstat() failed. Move change "do no exit if defrag range ioctl is unsupported" to another patch. Suggested by David. --- cmds-filesystem.c | 44 1 file changed, 24 insertions(+), 20 deletions(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index 7728430f16a1..17d399d58adf 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -1029,23 +1029,22 @@ static int cmd_filesystem_defrag(int argc, char **argv) if (fd < 0) { error("cannot open %s: %s", argv[i], strerror(errno)); - defrag_global_errors++; - close_file_or_dir(fd, dirstream); - continue; + ret = -errno; + goto next; } - if (fstat(fd, )) { + + ret = fstat(fd, ); + if (ret) { error("failed to stat %s: %s", argv[i], strerror(errno)); - defrag_global_errors++; - close_file_or_dir(fd, dirstream); - continue; + ret = -errno; + goto next; } if (!(S_ISDIR(st.st_mode) || S_ISREG(st.st_mode))) { error("%s is not a directory or a regular file", argv[i]); - defrag_global_errors++; - close_file_or_dir(fd, dirstream); - continue; + ret = -EINVAL; + goto next; } if (recursive && S_ISDIR(st.st_mode)) { ret = nftw(argv[i], defrag_callback, 10, @@ -1060,20 +1059,25 @@ static int cmd_filesystem_defrag(int argc, char **argv) ret = ioctl(fd, BTRFS_IOC_DEFRAG_RANGE, _global_range); defrag_err = errno; - } - close_file_or_dir(fd, dirstream); - if (ret && defrag_err == ENOTTY) { - error( + if (ret && defrag_err == ENOTTY) { + error( "defrag range ioctl not supported in this kernel version, 2.6.33 and newer is required"); - defrag_global_errors++; - break; + defrag_global_errors++; + close_file_or_dir(fd, dirstream); + break; + } + if (ret) { + error("defrag failed on %s: %s", argv[i], + strerror(defrag_err)); + goto next; + } } - if (ret) { - error("defrag failed on %s: %s", argv[i], - strerror(defrag_err)); +next: + if (ret) defrag_global_errors++; - } + close_file_or_dir(fd, dirstream); } + if (defrag_global_errors) fprintf(stderr, "total %d failures\n", defrag_global_errors); -- 2.15.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] btrfs-progs: check: exclude extents of tree blocks and extent items
On 11/28/2017 04:25 PM, Qu Wenruo wrote: On 2017年11月28日 15:47, Su Yue wrote: On 11/28/2017 12:05 PM, Qu Wenruo wrote: On 2017年11月28日 10:38, Su Yue wrote: Thanks for review. On 11/27/2017 06:37 PM, Qu Wenruo wrote: On 2017年11月27日 11:13, Su Yue wrote: Commit d17d6663c99c ("btrfs-progs: lowmem check: Fix regression which screws up extent allocator") removes pin_metadata_blocks() from lowmem repair. So we have to find another way to exclude extents which should be occupied by tree blocks. First thing first, any tree block which is referred by will not be freed. Unless some special case, like extent tree initialization which clears the whole extent tree so we can't check if one tree block should be freed using extent tree, there is no need to explicitly pin existing tree blocks. Their creation/freeing is already handled well. If I understand the logic of extents correctly: The extents in free space cache are handled in the way like cache_block_group() in extent-tree.c. It traverses all extents items then marks all holes free in free space cache. Yes, in a normal situation, extents are already handled well. But original and lowmem check try to repair corrupted extent tree. Suppose a situation: 1) Let an extent item of one tree block is corrupted or missed. 2) The correct extent in free space cache will be marked as free. 3) Next CoW may allocate the "used" extents from free space and insert an extent item. 4) Lowmem repair increases refs of the extent item and causes a wrong extent item. OR 3) Lowmem repair inserts the extent item firstly. 4) CoW may allocate the "used" extents from free space. BUG_ON failure of inserting the extent item. OK, this explains things, but still not optimal. Such behavior should not happen if nothing is exposed. Pinning down all tree blocks is never a light operation and should be avoided if possible. Agreed. It would be nice to do it when starting a new transaction to modify the fs. Now, there is only one transaction while checking chunks and extents because of previous wrong call of pin_metadata_blocks(). It's meaningless to put it at start of new transaction now. Because you start that transaction under all condition. Normally, transaction should only be started if you're sure you will modify the filesystem. Start them unconditionally is not a preferred behavior. I will split the transaction into many minors. Then lowmem repair will speed up if no error in chunks and extents. But it is just a little optimization. Compared to iterating all tree blocks, it's a big optimization. Just think of filesystem with several TB metadata. Please explain the reason why you need to pin extents first. What I do in the patch is like origin mode's. Pining extents first ensures every extents(corrupted and uncorrupted) will not be rellocated. Only extent tree reinit will pin down all metadata block in original mode while normal repair won't. So you're still doing something which is not done in original mode, which either needs to be explained in detail or fix original mode first. You are right. I did miss details of how original mode frees extent records. After come thinking, pining extents is still necessary in lowmem repair. Original mode has extent cache to record all extents and free normal extents while traversing tree blocks. After traversal, only corrupted extent records exist in the cache. At this moment, it pins them and start transactions to repair. Lowmem mode does not store anything like extent records. It begins transactions(in further patch) in traversal. We can not guarantee extents allocated from free space is occupied by one tree block which has not been traversed. Although pining extents is heavy and painfully work, it seems inevitable. Otherwise, I have no better idea than pining all extents. Firstly, only missing backref in extent tree will cause problem. Even we have corrupted backref or extra backref which doesn't have any tree block/data referring it, it won't cause a big problem. So for lowmem mode, we can do it in a different way, just like seed device, we allocate new meta chunks, and only use new meta chunks for our tree operations. This will avoid any extra tree iteration, and easier to implement AFAIK. (Just mark all existing block groups full, to force chunk allocation) Seems feasible! I will try to implement it. Thanks, Su Thanks, Qu Thanks, Su Thanks, Su Thanks, Qu Modify pin_down_tree_blocks() only for code reuse. So behavior of pin_metadata_blocks() which works with option 'init-extent-tree' is not influenced. Introduce exclude_blocks_and_extent_items() to mark extents of all tree blocks dirty in fs_info->excluded_extents. It also calls exclude_extent_items() to exclude extent_items in extent tree. Signed-off-by: Su Yue--- cmds-check.c | 122 ++- 1 file changed, 112
Re: How about adding an ioctl to convert a directory to a subvolume?
On 2017年11月28日 16:29, Lu Fengqi wrote: > On Mon, Nov 27, 2017 at 06:13:10PM +0800, Qu Wenruo wrote: >> >> >> On 2017年11月27日 17:41, Lu Fengqi wrote: >>> Hi all, >>> >>> As we all know, under certain circumstances, it is more appropriate to >>> create some subvolumes rather than keep everything in the same >>> subvolume. As the condition of demand change, the user may need to >>> convert a previous directory to a subvolume. For this reason,how about >>> adding an ioctl to convert a directory to a subvolume? >> >> The idea seems interesting. >> >> However in my opinion, this can be done quite easily in (mostly) user >> space, thanks to btrfs support of relink. >> >> The method from Hugo or Chris is quite good, maybe it can be enhanced a >> little. >> >> Use the following layout as an example: >> >> root_subv >> |- subvolume_1 >> | |- dir_1 >> | | |- file_1 >> | | |- file_2 >> | |- dir_2 >> | |- file_3 >> |- subvolume_2 >> >> If we want to convert dir_1 into subvolume, we can do it like: >> >> 1) Create a temporary readonly snapshot of parent subvolume containing >> the desired dir >> # btrfs sub snapshot -r root_subv/subvolume_1 \ >> root_subv/tmp_snapshot_1 >> >> 2) Create a new subvolume, as destination. >> # btrfs sub create root_subv/tmp_dest/ >> >> 3) Copy the content and sync the fs >> Use of reflink is necessary. >> # cp -r --reflink=always root_subv/tmp_snapshot_1/dir_1 \ >> root_subv/tmp_dest >> # btrfs sync root_subv/tmp_dest >> >> 4) Delete temporary readonly snapshot >> # btrfs subvolume delete root_subv/tmp_snapshot_1 >> >> 5) Remove the source dir >> # rm -rf root_subv/subvolume_1/dir_1 >> >> 5) Create a final destination snapshot of "root_subv/temporary_dest" >> # btrfs subvolume snapshot root_subv/tmp_dest \ >> root_subv/subvolume_1/dir_1 >> >> 6) Remove the temporary destination >> # btrfs subvolume delete root_subv/tmp_dest >> >> >> The main challenge is in step 3). >> In fact above method can only handle normal dir/files. >> If there is another subvolume inside the desired dir, current "cp -r" is >> a bad idea. >> We need to skip subvolume dir, and create snapshot for it. >> >> But it's quite easy to write a user space program to handle it. >> Maybe using "find" command can already handle it well. >> >> Anyway, doing it in user space onsidering btrfs snapshot creation won't >> flush data, which already means some buffered data will not occur in >> snapshot. is already possible and much easier than >> doing it in kernel. >> >>> >>> Users can convert by the scripts mentioned in this >>> thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is >>> it easier to use the off-the-shelf btrfs subcommand? >> >> If you just want to integrate the functionality into btrfs-progs, maybe >> it's possible. >> >> But if you insist in providing a new ioctl for this, I highly doubt if >> the extra hassle is worthy. >> > > Thanks for getting back to me. > > The enhanced approach you provide is pretty good, and I agree that it > meets the needs of some users. > >>> >>> After an initial consideration, our implementation is broadly divided >>> into the following steps: >>> 1. Freeze the filesystem or set the subvolume above the source directory >>> to read-only; >> >> Not really need to freeze the whole fs. >> Just create a readonly snapshot of the parent subvolume which contains >> the dir. >> That's how snapshot is designed for. >> > > I still worry about the following problem. Although tmp_snapshot_1 is > read-only, the source directory dir_1 is still writable. Also, step 5) > will delete dir_1, which may result in the loss of newly written data > during the conversion. Just a problem of timing. Output message after or before creation of read-only snapshot to info user will be good enough. Especially considering btrfs snapshot creation won't flush data, which already means some buffered data will not occur in snapshot. Thanks, Qu > > I can not think of any method else for now, except I set the subvolume_1 > read-only(obviously this is a bad idea). Could you provide some > suggestions? > >>> 2. Perform a pre-check, for example, check if a cross-device link >>> creation during the conversion; >> >> This can be done in-the-fly. >> As the check is so easy (only needs to check if the inode number is 256). >> We only need a mid-order iteration of the source dir (in temporary >> snapshot), and for normal file, use reflink. >> For subvolume dir, create a snapshot for it. >> >> And for such iteration, a python script less than 100 lines would be >> sufficient. > > Just to clarify, the check is to confirm if there is a hard link across > dir_1.> The dir_1 can't be converted to subvolume if there is such a > cross-device link that can not be created. > >> >> Thanks, >> Qu >> >>> 3. Perform conversion, such as creating a new subvolume and moving the >>> contents of the source directory; >>> 4. Thaw the filesystem or restore the subvolume writable
Re: How about adding an ioctl to convert a directory to a subvolume?
On Mon, Nov 27, 2017 at 06:13:10PM +0800, Qu Wenruo wrote: > > >On 2017年11月27日 17:41, Lu Fengqi wrote: >> Hi all, >> >> As we all know, under certain circumstances, it is more appropriate to >> create some subvolumes rather than keep everything in the same >> subvolume. As the condition of demand change, the user may need to >> convert a previous directory to a subvolume. For this reason,how about >> adding an ioctl to convert a directory to a subvolume? > >The idea seems interesting. > >However in my opinion, this can be done quite easily in (mostly) user >space, thanks to btrfs support of relink. > >The method from Hugo or Chris is quite good, maybe it can be enhanced a >little. > >Use the following layout as an example: > >root_subv >|- subvolume_1 >| |- dir_1 >| | |- file_1 >| | |- file_2 >| |- dir_2 >| |- file_3 >|- subvolume_2 > >If we want to convert dir_1 into subvolume, we can do it like: > >1) Create a temporary readonly snapshot of parent subvolume containing > the desired dir > # btrfs sub snapshot -r root_subv/subvolume_1 \ > root_subv/tmp_snapshot_1 > >2) Create a new subvolume, as destination. > # btrfs sub create root_subv/tmp_dest/ > >3) Copy the content and sync the fs > Use of reflink is necessary. > # cp -r --reflink=always root_subv/tmp_snapshot_1/dir_1 \ > root_subv/tmp_dest > # btrfs sync root_subv/tmp_dest > >4) Delete temporary readonly snapshot > # btrfs subvolume delete root_subv/tmp_snapshot_1 > >5) Remove the source dir > # rm -rf root_subv/subvolume_1/dir_1 > >5) Create a final destination snapshot of "root_subv/temporary_dest" > # btrfs subvolume snapshot root_subv/tmp_dest \ > root_subv/subvolume_1/dir_1 > >6) Remove the temporary destination > # btrfs subvolume delete root_subv/tmp_dest > > >The main challenge is in step 3). >In fact above method can only handle normal dir/files. >If there is another subvolume inside the desired dir, current "cp -r" is >a bad idea. >We need to skip subvolume dir, and create snapshot for it. > >But it's quite easy to write a user space program to handle it. >Maybe using "find" command can already handle it well. > >Anyway, doing it in user space is already possible and much easier than >doing it in kernel. > >> >> Users can convert by the scripts mentioned in this >> thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is >> it easier to use the off-the-shelf btrfs subcommand? > >If you just want to integrate the functionality into btrfs-progs, maybe >it's possible. > >But if you insist in providing a new ioctl for this, I highly doubt if >the extra hassle is worthy. > Thanks for getting back to me. The enhanced approach you provide is pretty good, and I agree that it meets the needs of some users. >> >> After an initial consideration, our implementation is broadly divided >> into the following steps: >> 1. Freeze the filesystem or set the subvolume above the source directory >> to read-only; > >Not really need to freeze the whole fs. >Just create a readonly snapshot of the parent subvolume which contains >the dir. >That's how snapshot is designed for. > I still worry about the following problem. Although tmp_snapshot_1 is read-only, the source directory dir_1 is still writable. Also, step 5) will delete dir_1, which may result in the loss of newly written data during the conversion. I can not think of any method else for now, except I set the subvolume_1 read-only(obviously this is a bad idea). Could you provide some suggestions? >> 2. Perform a pre-check, for example, check if a cross-device link >> creation during the conversion; > >This can be done in-the-fly. >As the check is so easy (only needs to check if the inode number is 256). >We only need a mid-order iteration of the source dir (in temporary >snapshot), and for normal file, use reflink. >For subvolume dir, create a snapshot for it. > >And for such iteration, a python script less than 100 lines would be >sufficient. Just to clarify, the check is to confirm if there is a hard link across dir_1. The dir_1 can't be converted to subvolume if there is such a cross-device link that can not be created. > >Thanks, >Qu > >> 3. Perform conversion, such as creating a new subvolume and moving the >> contents of the source directory; >> 4. Thaw the filesystem or restore the subvolume writable property. >> >> In fact, I am not so sure whether this use of freeze is appropriate >> because the source directory the user needs to convert may be located >> at / or /home and this pre-check and conversion process may take a long >> time, which can lead to some shell and graphical application suspended. >> >> Please give your comments if any. >> > -- Thanks, Lu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] btrfs-progs: check: exclude extents of tree blocks and extent items
On 2017年11月28日 15:47, Su Yue wrote: > > > On 11/28/2017 12:05 PM, Qu Wenruo wrote: >> >> >> On 2017年11月28日 10:38, Su Yue wrote: >>> Thanks for review. >>> >>> On 11/27/2017 06:37 PM, Qu Wenruo wrote: On 2017年11月27日 11:13, Su Yue wrote: > Commit d17d6663c99c ("btrfs-progs: lowmem check: Fix regression which > screws up extent allocator") removes pin_metadata_blocks() from > lowmem repair. > So we have to find another way to exclude extents which should be > occupied by tree blocks. First thing first, any tree block which is referred by will not be freed. Unless some special case, like extent tree initialization which clears the whole extent tree so we can't check if one tree block should be freed using extent tree, there is no need to explicitly pin existing tree blocks. Their creation/freeing is already handled well. >>> If I understand the logic of extents correctly: >>> >>> The extents in free space cache are handled in the way like >>> cache_block_group() in extent-tree.c. >>> It traverses all extents items then marks all holes free in free space >>> cache. >>> >>> Yes, in a normal situation, extents are already handled well. >>> But original and lowmem check try to repair corrupted extent tree. >>> >>> Suppose a situation: >>> 1) Let an extent item of one tree block is corrupted or missed. >>> 2) The correct extent in free space cache will be marked as free. >>> 3) Next CoW may allocate the "used" extents from free space >>> and insert an extent item. >>> 4) Lowmem repair increases refs of the extent item and >>> causes a wrong extent item. >>> OR >>> 3) Lowmem repair inserts the extent item firstly. >>> 4) CoW may allocate the "used" extents from free space. >>> BUG_ON failure of inserting the extent item. >> >> OK, this explains things, but still not optimal. >> >> Such behavior should not happen if nothing is exposed. >> Pinning down all tree blocks is never a light operation and should be >> avoided if possible. >> > > Agreed. > >> It would be nice to do it when starting a new transaction to modify >> the fs. >> > > Now, there is only one transaction while checking chunks and extents > because of previous wrong call of pin_metadata_blocks(). > It's meaningless to put it at start of new transaction now. Because you start that transaction under all condition. Normally, transaction should only be started if you're sure you will modify the filesystem. Start them unconditionally is not a preferred behavior. > > I will split the transaction into many minors. Then lowmem repair will > speed up if no error in chunks and extents. But it is just a little > optimization. Compared to iterating all tree blocks, it's a big optimization. Just think of filesystem with several TB metadata. > >>> Please explain the reason why you need to pin extents first. >>> What I do in the patch is like origin mode's. >>> Pining extents first ensures every extents(corrupted and uncorrupted) >>> will not be rellocated. >> >> Only extent tree reinit will pin down all metadata block in original >> mode while normal repair won't. >> >> So you're still doing something which is not done in original mode, >> which either needs to be explained in detail or fix original mode first. >> > > You are right. I did miss details of how original mode frees extent > records. > > After come thinking, pining extents is still necessary in lowmem > repair. > > Original mode has extent cache to record all extents and free normal > extents while traversing tree blocks. > After traversal, only corrupted extent records exist in the cache. > At this moment, it pins them and start transactions to repair. > > Lowmem mode does not store anything like extent records. It begins > transactions(in further patch) in traversal. We can not guarantee > extents allocated from free space is occupied by one tree block which > has not been traversed. > > Although pining extents is heavy and painfully work, it seems > inevitable. Otherwise, I have no better idea than pining all > extents. Firstly, only missing backref in extent tree will cause problem. Even we have corrupted backref or extra backref which doesn't have any tree block/data referring it, it won't cause a big problem. So for lowmem mode, we can do it in a different way, just like seed device, we allocate new meta chunks, and only use new meta chunks for our tree operations. This will avoid any extra tree iteration, and easier to implement AFAIK. (Just mark all existing block groups full, to force chunk allocation) Thanks, Qu > > Thanks, > Su > >>> >>> Thanks, >>> Su >>> Thanks, Qu > > Modify pin_down_tree_blocks() only for code reuse. > So behavior of pin_metadata_blocks() which works with option > 'init-extent-tree' is not influenced. > > Introduce exclude_blocks_and_extent_items() to mark extents of all > tree > blocks