FAQ / encryption / error handling?
Hi all, The FAQ has a couple of sections on encryption (general and dm-crypt) One thing that isn't explained there: if you create multiple encrypted volumes (e.g. using dm-crypt) and use Btrfs to combine them into RAID1, how does error recovery work when a read operation returns corrupted data? Without encryption, reading from one disk would give a checksum mismatch and Btrfs would read from the other disk to (hopefully) get a good copy of the data. With this encryption scenario, the failure would potentially be detected in the decryption layer code and instead of returning bad data to Btrfs, it would return some error code. In that case, will Btrfs attempt to read from the other volume and allow the application to proceed as if nothing was wrong? Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How about adding an ioctl to convert a directory to a subvolume?
Hi all, As we all know, under certain circumstances, it is more appropriate to create some subvolumes rather than keep everything in the same subvolume. As the condition of demand change, the user may need to convert a previous directory to a subvolume. For this reason,how about adding an ioctl to convert a directory to a subvolume? Users can convert by the scripts mentioned in this thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is it easier to use the off-the-shelf btrfs subcommand? After an initial consideration, our implementation is broadly divided into the following steps: 1. Freeze the filesystem or set the subvolume above the source directory to read-only; 2. Perform a pre-check, for example, check if a cross-device link creation during the conversion; 3. Perform conversion, such as creating a new subvolume and moving the contents of the source directory; 4. Thaw the filesystem or restore the subvolume writable property. In fact, I am not so sure whether this use of freeze is appropriate because the source directory the user needs to convert may be located at / or /home and this pre-check and conversion process may take a long time, which can lead to some shell and graphical application suspended. Please give your comments if any. -- Thanks, Lu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FAQ / encryption / error handling?
On Mon, 27 Nov 2017 09:06:12 +0100 Daniel Pocock wrote: > Hi all, > > The FAQ has a couple of sections on encryption (general and dm-crypt) > > One thing that isn't explained there: if you create multiple encrypted > volumes (e.g. using dm-crypt) and use Btrfs to combine them into > RAID1, how does error recovery work when a read operation returns > corrupted data? > > Without encryption, reading from one disk would give a checksum > mismatch and Btrfs would read from the other disk to (hopefully) get > a good copy of the data. > > With this encryption scenario, the failure would potentially be > detected in the decryption layer code and instead of returning bad > data to Btrfs, it would return some error code. In that case, will > Btrfs attempt to read from the other volume and allow the application > to proceed as if nothing was wrong? > > Regards, > > Daniel Default (aes-xts-plain64) dm-crypt setup can't verify integrity of encrypted block and in case of silent corruption will decrypt it to garbage which btrfs will catch. In case of AEAD encryption (dm-crypt plus dm-integrity) it can verify integrity itself but I'm not sure right now which exact error it returns to upper layer as I didn't used it yet. I use btrfs raid1 on top of LVM on top of dm-crypt devices and it handled bad blocks on physical devices normally (there was a burst of about 900 reallocates on one device which btrfs caught and fixed). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How about adding an ioctl to convert a directory to a subvolume?
On 2017年11月27日 17:41, Lu Fengqi wrote: > Hi all, > > As we all know, under certain circumstances, it is more appropriate to > create some subvolumes rather than keep everything in the same > subvolume. As the condition of demand change, the user may need to > convert a previous directory to a subvolume. For this reason,how about > adding an ioctl to convert a directory to a subvolume? The idea seems interesting. However in my opinion, this can be done quite easily in (mostly) user space, thanks to btrfs support of relink. The method from Hugo or Chris is quite good, maybe it can be enhanced a little. Use the following layout as an example: root_subv |- subvolume_1 | |- dir_1 | | |- file_1 | | |- file_2 | |- dir_2 | |- file_3 |- subvolume_2 If we want to convert dir_1 into subvolume, we can do it like: 1) Create a temporary readonly snapshot of parent subvolume containing the desired dir # btrfs sub snapshot -r root_subv/subvolume_1 \ root_subv/tmp_snapshot_1 2) Create a new subvolume, as destination. # btrfs sub create root_subv/tmp_dest/ 3) Copy the content and sync the fs Use of reflink is necessary. # cp -r --reflink=always root_subv/tmp_snapshot_1/dir_1 \ root_subv/tmp_dest # btrfs sync root_subv/tmp_dest 4) Delete temporary readonly snapshot # btrfs subvolume delete root_subv/tmp_snapshot_1 5) Remove the source dir # rm -rf root_subv/subvolume_1/dir_1 5) Create a final destination snapshot of "root_subv/temporary_dest" # btrfs subvolume snapshot root_subv/tmp_dest \ root_subv/subvolume_1/dir_1 6) Remove the temporary destination # btrfs subvolume delete root_subv/tmp_dest The main challenge is in step 3). In fact above method can only handle normal dir/files. If there is another subvolume inside the desired dir, current "cp -r" is a bad idea. We need to skip subvolume dir, and create snapshot for it. But it's quite easy to write a user space program to handle it. Maybe using "find" command can already handle it well. Anyway, doing it in user space is already possible and much easier than doing it in kernel. > > Users can convert by the scripts mentioned in this > thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is > it easier to use the off-the-shelf btrfs subcommand? If you just want to integrate the functionality into btrfs-progs, maybe it's possible. But if you insist in providing a new ioctl for this, I highly doubt if the extra hassle is worthy. > > After an initial consideration, our implementation is broadly divided > into the following steps: > 1. Freeze the filesystem or set the subvolume above the source directory > to read-only; Not really need to freeze the whole fs. Just create a readonly snapshot of the parent subvolume which contains the dir. That's how snapshot is designed for. > 2. Perform a pre-check, for example, check if a cross-device link > creation during the conversion; This can be done in-the-fly. As the check is so easy (only needs to check if the inode number is 256). We only need a mid-order iteration of the source dir (in temporary snapshot), and for normal file, use reflink. For subvolume dir, create a snapshot for it. And for such iteration, a python script less than 100 lines would be sufficient. Thanks, Qu > 3. Perform conversion, such as creating a new subvolume and moving the > contents of the source directory; > 4. Thaw the filesystem or restore the subvolume writable property. > > In fact, I am not so sure whether this use of freeze is appropriate > because the source directory the user needs to convert may be located > at / or /home and this pre-check and conversion process may take a long > time, which can lead to some shell and graphical application suspended. > > Please give your comments if any. > signature.asc Description: OpenPGP digital signature
Re: [PATCH 1/3] btrfs-progs: check: release path in repair_extent_data_item()
On 2017年11月27日 11:13, Su Yue wrote: > In repair_extent_data_item(), path is not be released if some > errors occurs which causes extent buffer leak. > > So release path in end of the function. > > Signed-off-by: Su Yue > --- > cmds-check.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/cmds-check.c b/cmds-check.c > index e746ee7b281d..3a72f8e3877d 100644 > --- a/cmds-check.c > +++ b/cmds-check.c > @@ -11974,7 +11974,6 @@ static int repair_extent_data_item(struct > btrfs_trans_handle *trans, > btrfs_mark_buffer_dirty(eb); > ret = btrfs_update_block_group(trans, extent_root, disk_bytenr, > num_bytes, 1, 0); > - btrfs_release_path(&path); Here btrfs_release_path() should not be removed. Lines below this, btrfs_inc_extent_ref() is called. In which we will alloc a new path and modify extent tree. Thanks to the fact that there is no tree locking implemented in btrfs-progs, it won't cause dead lock, but it's a good habit to make path allocation deadlock free. So please keep the path release here. (And if you find some code holding conflicting path on the same tree, feel free to fix them, even it won't cause real problem in btrfs right now) > } > > if (nrefs->full_backref[0]) > @@ -11998,6 +11997,7 @@ static int repair_extent_data_item(struct > btrfs_trans_handle *trans, > > err &= ~BACKREF_MISSING; > out: > + btrfs_release_path(&path); On the other hand, an empty path can be released as many times as you wish. So double releasing the path in out branch is not a problem. Thanks, Q > if (ret) > error("can't repair root %llu extent data item[%llu %llu]", > root->objectid, disk_bytenr, num_bytes); > signature.asc Description: OpenPGP digital signature
Re: [PATCH 2/3] btrfs-progs: check: record returned errors after walk_down_tree_v2()
On 2017年11月27日 11:13, Su Yue wrote: > In lowmem mode with '--repair', check_chunks_and_extents_v2() > will fix accounting in block groups and clear the error > bit BG_ACCOUNTING_ERROR. > However, return value of check_btrfs_root() is 0 either 1 instead of > error bits. > > If extent tree is on error, lowmem repair always prints error and > returns nonzero value even the filesystem is fine after repair. > > So let @err contains bits after walk_down_tree_v2(). > > Signed-off-by: Su Yue > --- > cmds-check.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/cmds-check.c b/cmds-check.c > index 3a72f8e3877d..c7b570bef9c3 100644 > --- a/cmds-check.c > +++ b/cmds-check.c > @@ -6607,7 +6607,7 @@ static int check_btrfs_root(struct btrfs_trans_handle > *trans, > ret = walk_down_tree_v2(trans, root, &path, &level, &nrefs, > ext_ref, check_all); > > - err |= !!ret; > + err |= ret; Since ret from walk_down_tree_v2() can be plus, minus and 0, using bit OR here is quite confusing. For example, if ret is -1, it will makes all bits of err to 1. Please don't mix bit operation with possible minus value. (And feel free to fix similar problems) Thanks, Qu > > /* if ret is negative, walk shall stop */ > if (ret < 0) { > signature.asc Description: OpenPGP digital signature
Re: [PATCH 3/3] btrfs-progs: check: exclude extents of tree blocks and extent items
On 2017年11月27日 11:13, Su Yue wrote: > Commit d17d6663c99c ("btrfs-progs: lowmem check: Fix regression which > screws up extent allocator") removes pin_metadata_blocks() from > lowmem repair. > So we have to find another way to exclude extents which should be > occupied by tree blocks. First thing first, any tree block which is referred by will not be freed. Unless some special case, like extent tree initialization which clears the whole extent tree so we can't check if one tree block should be freed using extent tree, there is no need to explicitly pin existing tree blocks. Their creation/freeing is already handled well. Please explain the reason why you need to pin extents first. Thanks, Qu > > Modify pin_down_tree_blocks() only for code reuse. > So behavior of pin_metadata_blocks() which works with option > 'init-extent-tree' is not influenced. > > Introduce exclude_blocks_and_extent_items() to mark extents of all tree > blocks dirty in fs_info->excluded_extents. It also calls > exclude_extent_items() to exclude extent_items in extent tree. > > Signed-off-by: Su Yue > --- > cmds-check.c | 122 > ++- > 1 file changed, 112 insertions(+), 10 deletions(-) > > diff --git a/cmds-check.c b/cmds-check.c > index c7b570bef9c3..f39285c5a3c2 100644 > --- a/cmds-check.c > +++ b/cmds-check.c > @@ -13351,6 +13351,7 @@ out: > return err; > } > > +static int exclude_blocks_and_extent_items(struct btrfs_fs_info *fs_info); > /* > * Low memory usage version check_chunks_and_extents. > */ > @@ -13363,12 +13364,22 @@ static int check_chunks_and_extents_v2(struct > btrfs_fs_info *fs_info) > struct btrfs_root *root1; > struct btrfs_root *root; > struct btrfs_root *cur_root; > + struct extent_io_tree excluded_extents; > int err = 0; > int ret; > > root = fs_info->fs_root; > > if (repair) { > + extent_io_tree_init(&excluded_extents); > + fs_info->excluded_extents = &excluded_extents; > + ret = exclude_blocks_and_extent_items(fs_info); > + if (ret) { > + error("failed to exclude tree blocks and extent items"); > + return ret; > + } > + reset_cached_block_groups(fs_info); > + > trans = btrfs_start_transaction(fs_info->extent_root, 1); > if (IS_ERR(trans)) { > error("failed to start transaction before check"); > @@ -13437,6 +13448,8 @@ out: > err |= ret; > else > err &= ~BG_ACCOUNTING_ERROR; > + extent_io_tree_cleanup(&excluded_extents); > + fs_info->excluded_extents = NULL; > } > > if (trans) > @@ -13534,40 +13547,106 @@ init: > return 0; > } > > -static int pin_down_tree_blocks(struct btrfs_fs_info *fs_info, > - struct extent_buffer *eb, int tree_root) > +static int exclude_extent_items(struct btrfs_fs_info *fs_info, > + struct extent_io_tree *tree) > +{ > + struct btrfs_root *root = fs_info->extent_root; > + struct btrfs_key key; > + struct btrfs_path path; > + struct extent_buffer *eb; > + int slot; > + int ret; > + u64 start; > + u64 end; > + > + ASSERT(tree); > + btrfs_init_path(&path); > + key.objectid = 0; > + key.type = 0; > + key.offset = 0; > + > + ret = btrfs_search_slot(NULL, root, &key, &path, 0, 0); > + if (ret < 0) > + goto out; > + > + while (1) { > + eb = path.nodes[0]; > + slot = path.slots[0]; > + btrfs_item_key_to_cpu(eb, &key, slot); > + if (key.type != BTRFS_EXTENT_ITEM_KEY && > + key.type != BTRFS_METADATA_ITEM_KEY) > + goto next; > + start = key.objectid; > + if (key.type == BTRFS_EXTENT_ITEM_KEY) > + end = start + key.offset; > + else > + end = start + fs_info->nodesize; > + > + set_extent_dirty(tree, start, end - 1); > +next: > + ret = btrfs_next_item(root, &path); > + if (ret > 0) { > + ret = 0; > + goto out; > + } > + if (ret < 0) > + goto out; > + } > +out: > + if (ret) > + error("failed to exclude extent items"); > + btrfs_release_path(&path); > + return ret; > +} > + > +static int traverse_tree_blocks(struct btrfs_fs_info *fs_info, > + struct extent_buffer *eb, int tree_root, > + int pin) > { > struct extent_buffer *tmp; > struct btrfs_root_item *ri; > struct btrfs_key key; > + struct extent_io_tree *tree; > u64 bytenr; > int level = btrfs_header_level(eb); > int nritems; >
Re: [PATCH 4/4] btrfs: factor __btrfs_open_devices() to create btrfs_open_one_device()
Hi Lakshmipathi, Oops I can see the same. I am trying to narrow down. Thanks, Anand On 11/27/2017 02:47 PM, Lakshmipathi.G wrote: Hi Anand, With this patch applied, btrfs-progs/misc-test/021 error out. Is this same for you? Without this patch: https://asciinema.org/a/RJmE5469mHlL3S1BIOCifWVn6 With this patch:https://asciinema.org/a/1h5UX6DIFNsvvMXgLo4GiEgdE thanks! Cheers, Lakshmipathi.G http://www.giis.co.in http://www.webminal.org On Thu, Nov 9, 2017 at 9:15 PM, Anand Jain wrote: No functional changes, create btrfs_open_one_device() from __btrfs_open_devices(). This is a preparatory work to add dynamic device scan. Signed-off-by: Anand Jain --- fs/btrfs/volumes.c | 126 + 1 file changed, 69 insertions(+), 57 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 0857b580014d..d24e966ee29f 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -601,6 +601,73 @@ void btrfs_free_stale_device(struct btrfs_device *cur_dev) } } +static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, + struct btrfs_device *device, fmode_t flags, + void *holder) +{ + struct request_queue *q; + struct block_device *bdev; + struct buffer_head *bh; + struct btrfs_super_block *disk_super; + u64 devid; + int ret; + + if (device->bdev) + return -EINVAL; + if (!device->name) + return -EINVAL; + + ret = btrfs_get_bdev_and_sb(device->name->str, flags, holder, 1, + &bdev, &bh); + if (ret) + return ret; + + disk_super = (struct btrfs_super_block *)bh->b_data; + devid = btrfs_stack_device_id(&disk_super->dev_item); + if (devid != device->devid) + goto error_brelse; + + if (memcmp(device->uuid, disk_super->dev_item.uuid, + BTRFS_UUID_SIZE)) + goto error_brelse; + + device->generation = btrfs_super_generation(disk_super); + + if (btrfs_super_flags(disk_super) & BTRFS_SUPER_FLAG_SEEDING) { + device->writeable = 0; + fs_devices->seeding = 1; + } else { + device->writeable = !bdev_read_only(bdev); + } + + q = bdev_get_queue(bdev); + if (blk_queue_discard(q)) + device->can_discard = 1; + if (!blk_queue_nonrot(q)) + fs_devices->rotating = 1; + + device->bdev = bdev; + device->in_fs_metadata = 0; + device->mode = flags; + + fs_devices->open_devices++; + if (device->writeable && + device->devid != BTRFS_DEV_REPLACE_DEVID) { + fs_devices->rw_devices++; + list_add(&device->dev_alloc_list, +&fs_devices->alloc_list); + } + brelse(bh); + + return 0; + +error_brelse: + brelse(bh); + blkdev_put(bdev, flags); + + return -EINVAL; +} + /* * Add new device to list of registered devices * @@ -978,69 +1045,14 @@ static int __btrfs_open_devices(struct btrfs_fs_devices *fs_devices, flags |= FMODE_EXCL; list_for_each_entry(device, head, dev_list) { - struct request_queue *q; - struct block_device *bdev; - struct buffer_head *bh; - struct btrfs_super_block *disk_super; - u64 devid; - - if (device->bdev) - continue; - if (!device->name) - continue; - /* Just open everything we can; ignore failures here */ - if (btrfs_get_bdev_and_sb(device->name->str, flags, holder, 1, - &bdev, &bh)) + ret = btrfs_open_one_device(fs_devices, device, flags, holder); + if (ret) continue; - disk_super = (struct btrfs_super_block *)bh->b_data; - devid = btrfs_stack_device_id(&disk_super->dev_item); - if (devid != device->devid) - goto error_brelse; - - if (memcmp(device->uuid, disk_super->dev_item.uuid, - BTRFS_UUID_SIZE)) - goto error_brelse; - - device->generation = btrfs_super_generation(disk_super); - - if (btrfs_super_flags(disk_super) & BTRFS_SUPER_FLAG_SEEDING) { - device->writeable = 0; - fs_devices->seeding = 1; - } else { - device->writeable = !bdev_read_only(bdev); - } - - q = bdev_get_queue(bdev); - if (blk_queue_discard(q)) - device->can_discard = 1; - if (!blk_queue_nonrot(q)) - fs_devices->rotating = 1; - - device->bdev = bdev; -
Re: How about adding an ioctl to convert a directory to a subvolume?
On 2017-11-27 05:13, Qu Wenruo wrote: On 2017年11月27日 17:41, Lu Fengqi wrote: Hi all, As we all know, under certain circumstances, it is more appropriate to create some subvolumes rather than keep everything in the same subvolume. As the condition of demand change, the user may need to convert a previous directory to a subvolume. For this reason,how about adding an ioctl to convert a directory to a subvolume? The idea seems interesting. However in my opinion, this can be done quite easily in (mostly) user space, thanks to btrfs support of relink. The method from Hugo or Chris is quite good, maybe it can be enhanced a little. Use the following layout as an example: root_subv |- subvolume_1 | |- dir_1 | | |- file_1 | | |- file_2 | |- dir_2 | |- file_3 |- subvolume_2 If we want to convert dir_1 into subvolume, we can do it like: 1) Create a temporary readonly snapshot of parent subvolume containing the desired dir # btrfs sub snapshot -r root_subv/subvolume_1 \ root_subv/tmp_snapshot_1 2) Create a new subvolume, as destination. # btrfs sub create root_subv/tmp_dest/ 3) Copy the content and sync the fs Use of reflink is necessary. # cp -r --reflink=always root_subv/tmp_snapshot_1/dir_1 \ root_subv/tmp_dest # btrfs sync root_subv/tmp_dest 4) Delete temporary readonly snapshot # btrfs subvolume delete root_subv/tmp_snapshot_1 5) Remove the source dir # rm -rf root_subv/subvolume_1/dir_1 5) Create a final destination snapshot of "root_subv/temporary_dest" # btrfs subvolume snapshot root_subv/tmp_dest \ root_subv/subvolume_1/dir_1 6) Remove the temporary destination # btrfs subvolume delete root_subv/tmp_dest The main challenge is in step 3). In fact above method can only handle normal dir/files. If there is another subvolume inside the desired dir, current "cp -r" is a bad idea. We need to skip subvolume dir, and create snapshot for it. But it's quite easy to write a user space program to handle it. Maybe using "find" command can already handle it well. Anyway, doing it in user space is already possible and much easier than doing it in kernel. Users can convert by the scripts mentioned in this thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is it easier to use the off-the-shelf btrfs subcommand? If you just want to integrate the functionality into btrfs-progs, maybe it's possible. But if you insist in providing a new ioctl for this, I highly doubt if the extra hassle is worthy. After an initial consideration, our implementation is broadly divided into the following steps: 1. Freeze the filesystem or set the subvolume above the source directory to read-only; Not really need to freeze the whole fs. Just create a readonly snapshot of the parent subvolume which contains the dir. That's how snapshot is designed for. 2. Perform a pre-check, for example, check if a cross-device link creation during the conversion; This can be done in-the-fly. As the check is so easy (only needs to check if the inode number is 256). We only need a mid-order iteration of the source dir (in temporary snapshot), and for normal file, use reflink. For subvolume dir, create a snapshot for it. And for such iteration, a python script less than 100 lines would be sufficient. On that note, see the function convert_dir_to_subv() in: https://github.com/Ferroin/btrfs-subv-backup/blob/master/btrfs-subv-backup.py For an example of how to do it in Python (albeit with some extra code to handle the case of not having the reflink module from PyPI, and without anything to prevent the source from being modified). It would still be nice to be able to do this atomically though, or at least get cross-rename support in BTRFS, which would allow the final rename to replace the source with a subvolume to be atomic (assuming of course you could cross-rename a directory and subvolume). Thanks, Qu 3. Perform conversion, such as creating a new subvolume and moving the contents of the source directory; 4. Thaw the filesystem or restore the subvolume writable property. In fact, I am not so sure whether this use of freeze is appropriate because the source directory the user needs to convert may be located at / or /home and this pre-check and conversion process may take a long time, which can lead to some shell and graphical application suspended. Please give your comments if any. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FAQ / encryption / error handling?
On 2017-11-27 05:06, Dmitrii Tcvetkov wrote: On Mon, 27 Nov 2017 09:06:12 +0100 Daniel Pocock wrote: Hi all, The FAQ has a couple of sections on encryption (general and dm-crypt) One thing that isn't explained there: if you create multiple encrypted volumes (e.g. using dm-crypt) and use Btrfs to combine them into RAID1, how does error recovery work when a read operation returns corrupted data? Without encryption, reading from one disk would give a checksum mismatch and Btrfs would read from the other disk to (hopefully) get a good copy of the data. With this encryption scenario, the failure would potentially be detected in the decryption layer code and instead of returning bad data to Btrfs, it would return some error code. In that case, will Btrfs attempt to read from the other volume and allow the application to proceed as if nothing was wrong? Regards, Daniel Default (aes-xts-plain64) dm-crypt setup can't verify integrity of encrypted block and in case of silent corruption will decrypt it to garbage which btrfs will catch. In case of AEAD encryption (dm-crypt plus dm-integrity) it can verify integrity itself but I'm not sure right now which exact error it returns to upper layer as I didn't used it yet. The exact error shouldn't matter, provided that BTRFS perceives it as a read error from the 'device' (in reality the virtual DM device). Provided that condition is met, the error is handled pretty much the same regardless of the exact error code. I use btrfs raid1 on top of LVM on top of dm-crypt devices and it handled bad blocks on physical devices normally (there was a burst of about 900 reallocates on one device which btrfs caught and fixed).Same here, and I've also tested it on top of dm-integrity, where BTRFS will correctly handle errors passed up from dm-integrity failing to verify blocks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How about adding an ioctl to convert a directory to a subvolume?
On 2017年11月27日 21:02, Austin S. Hemmelgarn wrote: > On 2017-11-27 05:13, Qu Wenruo wrote: >> >> >> On 2017年11月27日 17:41, Lu Fengqi wrote: >>> Hi all, >>> >>> As we all know, under certain circumstances, it is more appropriate to >>> create some subvolumes rather than keep everything in the same >>> subvolume. As the condition of demand change, the user may need to >>> convert a previous directory to a subvolume. For this reason,how about >>> adding an ioctl to convert a directory to a subvolume? >> >> The idea seems interesting. >> >> However in my opinion, this can be done quite easily in (mostly) user >> space, thanks to btrfs support of relink. >> >> The method from Hugo or Chris is quite good, maybe it can be enhanced a >> little. >> >> Use the following layout as an example: >> >> root_subv >> |- subvolume_1 >> | |- dir_1 >> | | |- file_1 >> | | |- file_2 >> | |- dir_2 >> | |- file_3 >> |- subvolume_2 >> >> If we want to convert dir_1 into subvolume, we can do it like: >> >> 1) Create a temporary readonly snapshot of parent subvolume containing >> the desired dir >> # btrfs sub snapshot -r root_subv/subvolume_1 \ >> root_subv/tmp_snapshot_1 >> >> 2) Create a new subvolume, as destination. >> # btrfs sub create root_subv/tmp_dest/ >> >> 3) Copy the content and sync the fs >> Use of reflink is necessary. >> # cp -r --reflink=always root_subv/tmp_snapshot_1/dir_1 \ >> root_subv/tmp_dest >> # btrfs sync root_subv/tmp_dest >> >> 4) Delete temporary readonly snapshot >> # btrfs subvolume delete root_subv/tmp_snapshot_1 >> >> 5) Remove the source dir >> # rm -rf root_subv/subvolume_1/dir_1 >> >> 5) Create a final destination snapshot of "root_subv/temporary_dest" >> # btrfs subvolume snapshot root_subv/tmp_dest \ >> root_subv/subvolume_1/dir_1 >> >> 6) Remove the temporary destination >> # btrfs subvolume delete root_subv/tmp_dest >> >> >> The main challenge is in step 3). >> In fact above method can only handle normal dir/files. >> If there is another subvolume inside the desired dir, current "cp -r" is >> a bad idea. >> We need to skip subvolume dir, and create snapshot for it. >> >> But it's quite easy to write a user space program to handle it. >> Maybe using "find" command can already handle it well. >> >> Anyway, doing it in user space is already possible and much easier than >> doing it in kernel. >> >>> >>> Users can convert by the scripts mentioned in this >>> thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is >>> it easier to use the off-the-shelf btrfs subcommand? >> >> If you just want to integrate the functionality into btrfs-progs, maybe >> it's possible. >> >> But if you insist in providing a new ioctl for this, I highly doubt if >> the extra hassle is worthy. >> >>> >>> After an initial consideration, our implementation is broadly divided >>> into the following steps: >>> 1. Freeze the filesystem or set the subvolume above the source directory >>> to read-only; >> >> Not really need to freeze the whole fs. >> Just create a readonly snapshot of the parent subvolume which contains >> the dir. >> That's how snapshot is designed for. >> >>> 2. Perform a pre-check, for example, check if a cross-device link >>> creation during the conversion; >> >> This can be done in-the-fly. >> As the check is so easy (only needs to check if the inode number is 256). >> We only need a mid-order iteration of the source dir (in temporary >> snapshot), and for normal file, use reflink. >> For subvolume dir, create a snapshot for it. >> >> And for such iteration, a python script less than 100 lines would be >> sufficient. > On that note, see the function convert_dir_to_subv() in: > https://github.com/Ferroin/btrfs-subv-backup/blob/master/btrfs-subv-backup.py > > > For an example of how to do it in Python (albeit with some extra code to > handle the case of not having the reflink module from PyPI, and without > anything to prevent the source from being modified). > > It would still be nice to be able to do this atomically though, or at > least get cross-rename support in BTRFS, which would allow the final > rename to replace the source with a subvolume to be atomic (assuming of > course you could cross-rename a directory and subvolume). The problem behind cross-rename is, btrfs doesn't follow the one-inode-one-tree organization used by most filesystems. This prevents inode from being referred outside of its subvolume. And since btrfs uses one-subvolume-one-tree solution, which greatly simplify the snapshot implementation, it's pretty hard or almost impossible to do real rename-across-subvolume. But at least we can reflink, reducing huge amount of data IO, making us only need to handle inode creation/link. (Although such one-subvolume-one-tree also makes metadata concurrency very low, further slowing down the metadata operation) Thanks, Qu >> >> Thanks, >> Qu >> >>> 3. Perform conversion, such as creating a new subvolu
splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent
Hi! On 4.15-rc1, I get the following failure: BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688 slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline extent, have 134 expect 281474976710677 Repeatable every boot attempt. 4.14 and earlier boot fine; btrfs check (progs 4.13.3) doesn't find any badness either. [ 11.347451] BTRFS info (device sda1): use lzo compression [ 11.352914] BTRFS info (device sda1): using free space tree [] Activating lvm and md swap...[ ok done. [] Checking file systems...fsck from util-l [ 11.660352] BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688 slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline extent, have 134 expect 281474976710677 inux 2.30.2 [ 11.678550] BTRFS info (device sda1): leaf 3820662898688 total ptrs 103 free space 4350 [ 11.687767] item 0 key (35663 12 32909) itemoff 16263 itemsize 20 [ 11.695021] item 1 key (35663 108 0) itemoff 15751 itemsize 512 [ 11.701274] inline extent data size 509 ok [ 11.705704] item 2 key (35664 1 0) itemoff 15591 itemsize 160 [ 11.713034] inode generation 1292 size 509 mode 100644 [ 11.718811] item 3 key (35664 12 32909) itemoff 15571 itemsize 20 [ 11.725113] item 4 key (35664 108 0) itemoff 15059 itemsize 512 [ 11.732168] inline extent data size 509 done. [ 11.736280] item 5 key (35665 1 0) itemoff 14899 itemsize 160 [ 11.742681] inode generation 1292 size 457 mode 100644 [ 11.748275] item 6 key (35665 12 32909) itemoff 14879 itemsize 20 [ 11.754584] item 7 key (35665 108 0) itemoff 14411 itemsize 468 [ 11.760674] inline extent data size 457 [ 11.764780] item 8 key (35666 1 0) itemoff 14251 itemsize 160 [ 11.770711] inode generation 1292 size 533 mode 100644 [] [ 11.776145] item 9 key (35666 12 32909) itemoff 14231 itemsize 20 Cleaning up temp [ 11.783069] item 10 key (35666 108 0) itemoff 13697 itemsize 534 orary files... [ 11.790666] inline extent data size 533 [ 11.795980] item 11 key (35668 1 0) itemoff 13537 itemsize 160 [ 11.801989] inode generation 1292 size 319 mode 100644 /tmp [ 11.807413] item 12 key (35668 12 32909) itemoff 13517 itemsize 20 [ 11.814250] item 13 key (35668 108 0) itemoff 13247 itemsize 270 [ 11.820512] inline extent data size 319 [ 11.825539] item 14 key (35669 1 0) itemoff 13087 itemsize 160 [ 11.831577] inode generation 1292 size 375 mode 100644 [ 11.837149] item 15 key (35669 12 32909) itemoff 13067 itemsize 20 [ 11.843873] item 16 key (35669 108 0) itemoff 12783 itemsize 284 ok [ 11.850098] inline extent data size 375 [ 11.855579] item 17 key (35670 1 0) itemoff 12623 itemsize 160 [ 11.862861] inode generation 1292 size 168 mode 100644 . [ 11.869194] item 18 key (35670 12 33512) itemoff 12588 itemsize 35 [ 11.876467] item 19 key (35670 108 0) itemoff 12399 itemsize 189 [ 11.883564] inline extent data size 168 [ 11.888551] item 20 key (35676 1 0) itemoff 12239 itemsize 160 [ 11.895421] inode generation 1292 size 512 mode 100600 [ 11.901719] item 21 key (35676 12 32911) itemoff 12218 itemsize 21 [ 11.909045] item 22 key (35676 108 0) itemoff 11685 itemsize 533 [ 11.916136] inline extent data size 512 [ 11.921125] item 23 key (35685 1 0) itemoff 11525 itemsize 160 [ 11.928047] inode generation 1292 size 32128 mode 100644 [ 11.934553] item 24 key (35685 12 32783) itemoff 11508 itemsize 17 [] Mounting [ 11.941874] item 25 key (35685 108 0) itemoff 11455 itemsize 53 local filesystem [ 11.949377] extent data disk bytenr 3757990555648 nr 4096 s... [ 11.956471] extent data offset 0 nr 4096 ram 4096 [ 11.962383] item 26 key (35685 108 4096) itemoff 11402 itemsize 53 [ 11.969704] extent data disk bytenr 3755041128448 nr 4096 [ 11.976324] extent data offset 4096 nr 24576 ram 32768 [ 11.982686] item 27 key (35685 108 28672) itemoff 11349 itemsize 53 [ 11.990140] extent data disk bytenr 3749090922496 nr 4096 [ 11.996786] extent data offset 0 nr 4096 ram 4096 [ 12.002732] item 28 key (35686 1 0) itemoff 11189 itemsize 160 [ 12.009755] inode generation 1292 size 5023 mode 100644 [ 12.016204] item 29 key (35686 12 32783) itemoff 11165 itemsize 24 [ 12.023576] item 30 key (35686 108 0) itemoff 2 itemsize 53 [ 12.030665] extent data disk bytenr 3651995594752 nr 4096 [ 12.037298] extent data offset 0 nr 8192 ram 8192 [ 12.043184] item 31 key (35687 1 0) itemoff 10952 itemsize 160 [ 12.050181] inode generation 1292 size 293168 mode 100664 [ 12.056793] item 32 key (35687 12 32783) itemoff 10935 itemsize 17 [ 12.064082] item 33 key (35687 108 0) itemoff 10882 itemsize 53 [ 12.071154] extent data disk bytenr 3752
Re: How about adding an ioctl to convert a directory to a subvolume?
On 2017-11-27 08:17, Qu Wenruo wrote: On 2017年11月27日 21:02, Austin S. Hemmelgarn wrote: On 2017-11-27 05:13, Qu Wenruo wrote: On 2017年11月27日 17:41, Lu Fengqi wrote: Hi all, As we all know, under certain circumstances, it is more appropriate to create some subvolumes rather than keep everything in the same subvolume. As the condition of demand change, the user may need to convert a previous directory to a subvolume. For this reason,how about adding an ioctl to convert a directory to a subvolume? The idea seems interesting. However in my opinion, this can be done quite easily in (mostly) user space, thanks to btrfs support of relink. The method from Hugo or Chris is quite good, maybe it can be enhanced a little. Use the following layout as an example: root_subv |- subvolume_1 | |- dir_1 | | |- file_1 | | |- file_2 | |- dir_2 | |- file_3 |- subvolume_2 If we want to convert dir_1 into subvolume, we can do it like: 1) Create a temporary readonly snapshot of parent subvolume containing the desired dir # btrfs sub snapshot -r root_subv/subvolume_1 \ root_subv/tmp_snapshot_1 2) Create a new subvolume, as destination. # btrfs sub create root_subv/tmp_dest/ 3) Copy the content and sync the fs Use of reflink is necessary. # cp -r --reflink=always root_subv/tmp_snapshot_1/dir_1 \ root_subv/tmp_dest # btrfs sync root_subv/tmp_dest 4) Delete temporary readonly snapshot # btrfs subvolume delete root_subv/tmp_snapshot_1 5) Remove the source dir # rm -rf root_subv/subvolume_1/dir_1 5) Create a final destination snapshot of "root_subv/temporary_dest" # btrfs subvolume snapshot root_subv/tmp_dest \ root_subv/subvolume_1/dir_1 6) Remove the temporary destination # btrfs subvolume delete root_subv/tmp_dest The main challenge is in step 3). In fact above method can only handle normal dir/files. If there is another subvolume inside the desired dir, current "cp -r" is a bad idea. We need to skip subvolume dir, and create snapshot for it. But it's quite easy to write a user space program to handle it. Maybe using "find" command can already handle it well. Anyway, doing it in user space is already possible and much easier than doing it in kernel. Users can convert by the scripts mentioned in this thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is it easier to use the off-the-shelf btrfs subcommand? If you just want to integrate the functionality into btrfs-progs, maybe it's possible. But if you insist in providing a new ioctl for this, I highly doubt if the extra hassle is worthy. After an initial consideration, our implementation is broadly divided into the following steps: 1. Freeze the filesystem or set the subvolume above the source directory to read-only; Not really need to freeze the whole fs. Just create a readonly snapshot of the parent subvolume which contains the dir. That's how snapshot is designed for. 2. Perform a pre-check, for example, check if a cross-device link creation during the conversion; This can be done in-the-fly. As the check is so easy (only needs to check if the inode number is 256). We only need a mid-order iteration of the source dir (in temporary snapshot), and for normal file, use reflink. For subvolume dir, create a snapshot for it. And for such iteration, a python script less than 100 lines would be sufficient. On that note, see the function convert_dir_to_subv() in: https://github.com/Ferroin/btrfs-subv-backup/blob/master/btrfs-subv-backup.py For an example of how to do it in Python (albeit with some extra code to handle the case of not having the reflink module from PyPI, and without anything to prevent the source from being modified). It would still be nice to be able to do this atomically though, or at least get cross-rename support in BTRFS, which would allow the final rename to replace the source with a subvolume to be atomic (assuming of course you could cross-rename a directory and subvolume). The problem behind cross-rename is, btrfs doesn't follow the one-inode-one-tree organization used by most filesystems. This prevents inode from being referred outside of its subvolume. And since btrfs uses one-subvolume-one-tree solution, which greatly simplify the snapshot implementation, it's pretty hard or almost impossible to do real rename-across-subvolume. I seriously doubt that that matters in almost all real-world use cases. Everything I've seen that uses cross-rename does it with a temporary file in the same directory as the target file, using it to avoid the non-atomic nature of creating a backup and replacing a file without needing extra I/O (yes, reflinks help here, but still aren't perfect). Just supporting it within a subvolume and returning whatever errno gets returned for trying to call rename(2) across filesystem boundaries should be more than sufficient for most use cases, even if it doesn't work with what I had suggested
[PATCH] btrfs: ignore return from btrfs_open_one_device()
Test case btrfs-progs test-misc/012 can recreate the same fsid with different number of struct btrfs_fs_devices::total_devices. And the previous device which is in the kernel device list is stale now. But as we don't clean the kernel device list (unless same device is scanned with different fsid), so in the mount context it really ended up reading the device to find zeroed SB. And thus return the fail to mount. The long term fix for this should be to refresh the kernel device list. Though the patch btrfs: factor __btrfs_open_devices() to create btrfs_open_one_device() did tried to skip the failed open, but it forgot to reset the ret value or not to assign, thus error went up the stack in the mount context. Signed-off-by: Anand Jain Fixes: btrfs: factor __btrfs_open_devices() to create btrfs_open_one_device() --- Hi David, My bad. The above patch introduced a regression. Can you kindly squash this patch to it. Thanks, Anand fs/btrfs/volumes.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 5bd73edc2602..a3fa2dc39881 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1047,8 +1047,7 @@ static int __btrfs_open_devices(struct btrfs_fs_devices *fs_devices, list_for_each_entry(device, head, dev_list) { /* Just open everything we can; ignore failures here */ - ret = btrfs_open_one_device(fs_devices, device, flags, holder); - if (ret) + if (btrfs_open_one_device(fs_devices, device, flags, holder)) continue; if (!latest_dev || -- 2.15.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent
On Mon, Nov 27, 2017 at 02:23:49PM +0100, Adam Borowski wrote: > Hi! > On 4.15-rc1, I get the following failure: > > BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688 > slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline > extent, have 134 expect 281474976710677 By a quick look at suspiciously large number hex(281474976710677) = 0x10015 may be a bitflip, but 0x15 does not match 134, so there could be something else involved in the corruption. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent
On 27.11.2017 15:23, Adam Borowski wrote: > Hi! > On 4.15-rc1, I get the following failure: > > BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688 > slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline > extent, have 134 expect 281474976710677 > > Repeatable every boot attempt. 4.14 and earlier boot fine; btrfs check > (progs 4.13.3) doesn't find any badness either. The reason why you hit this on 4.15 is because this error is coming from the tree checker. So how big is the file which inode 35691 represents? So the check which fails is: if (item_size != BTRFS_FILE_EXTENT_INLINE_DATA_START + btrfs_file_extent_ram_bytes(leaf, fi)) { file_extent_err(root, leaf, slot, "invalid ram_bytes for uncompressed inline extent, have %u expect %llu", item_size, BTRFS_FILE_EXTENT_INLINE_DATA_START + btrfs_file_extent_ram_bytes(leaf, fi)); return -EUCLEAN; } BTRFS_FILE_EXTENT_INLINE_DATA_START is 21 based on pahole's output: struct btrfs_file_extent_item { __le64 generation; /* 0 8 */ __le64 ram_bytes;/* 8 8 */ __u8 compression; /*16 1 */ __u8 encryption; /*17 1 */ __le16 other_encoding; /*18 2 */ __u8 type; /*20 1 */ __le64 disk_bytenr; /*21 8 */ __le64 disk_num_bytes; /*29 8 */ __le64 offset; /*37 8 */ __le64 num_bytes;/*45 8 */ /* size: 53, cachelines: 1, members: 10 */ /* last cacheline: 53 bytes */ }; The inline extent size is 113 so 21 + 113 should is 134 which equals to what we expect. However, the printing code uses btrfs_file_extent_inline_len and the tree-checker code uses directly the ram bytes. And this function does the correct thing according to whether the file is compressed or not. So Qu, perhaps the code needs to be changed or is this a genuine bug ? > > [ 11.347451] BTRFS info (device sda1): use lzo compression > [ 11.352914] BTRFS info (device sda1): using free space tree > [] Activating lvm and md swap...[ ok done. > [] Checking file systems...fsck from util-l > [ 11.660352] BTRFS critical (device sda1): corrupt leaf: root=1 > block=3820662898688 slot=43 ino=35691 file_offset=0, invalid ram_bytes for > uncompressed inline extent, have 134 expect 281474976710677 > inux 2.30.2 > [ 11.678550] BTRFS info (device sda1): leaf 3820662898688 total ptrs 103 > free space 4350 > [ 11.687767]item 0 key (35663 12 32909) itemoff 16263 itemsize 20 > [ 11.695021]item 1 key (35663 108 0) itemoff 15751 itemsize 512 > [ 11.701274]inline extent data size 509 > ok > [ 11.705704]item 2 key (35664 1 0) itemoff 15591 itemsize 160 > [ 11.713034]inode generation 1292 size 509 mode 100644 > [ 11.718811]item 3 key (35664 12 32909) itemoff 15571 itemsize 20 > [ 11.725113]item 4 key (35664 108 0) itemoff 15059 itemsize 512 > [ 11.732168]inline extent data size 509 > done. > [ 11.736280]item 5 key (35665 1 0) itemoff 14899 itemsize 160 > [ 11.742681]inode generation 1292 size 457 mode 100644 > [ 11.748275]item 6 key (35665 12 32909) itemoff 14879 itemsize 20 > [ 11.754584]item 7 key (35665 108 0) itemoff 14411 itemsize 468 > [ 11.760674]inline extent data size 457 > [ 11.764780]item 8 key (35666 1 0) itemoff 14251 itemsize 160 > [ 11.770711]inode generation 1292 size 533 mode 100644 > [] > [ 11.776145]item 9 key (35666 12 32909) itemoff 14231 itemsize 20 > Cleaning up temp > [ 11.783069]item 10 key (35666 108 0) itemoff 13697 itemsize 534 > orary files... > [ 11.790666]inline extent data size 533 > [ 11.795980]item 11 key (35668 1 0) itemoff 13537 itemsize 160 > [ 11.801989]inode generation 1292 size 319 mode 100644 > /tmp > [ 11.807413]item 12 key (35668 12 32909) itemoff 13517 itemsize 20 > [ 11.814250]item 13 key (35668 108 0) itemoff 13247 itemsize 270 > [ 11.820512]inline extent data size 319 > [ 11.825539]item 14 key (35669 1 0) itemoff 13087 itemsize 160 > [ 11.831577]inode generation 1292 size 375 mode 100644 > [ 11.837149]item 15 key (35669 12 32909) itemoff 1
Re: [PATCH] btrfs-progs: dump_tree: remove superfluous _TREE
On Sat, Nov 25, 2017 at 09:03:26AM +0800, Qu Wenruo wrote: > On 2017年11月25日 04:26, Hans van Kranenburg wrote: > > -# btrfs inspect-internal dump-tree -t fs /dev/block/device > > ERROR: unrecognized tree id: fs > > > > Without this fix I can't dump-tree fs, but I can dump-tree fs_tree and > > also fs_tree_tree, which is a bit silly. > > --- > > cmds-inspect-dump-tree.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c > > index 4e72c08a..df44bb63 100644 > > --- a/cmds-inspect-dump-tree.c > > +++ b/cmds-inspect-dump-tree.c > > @@ -143,7 +143,7 @@ static u64 treeid_from_string(const char *str, const > > char **end) > > { "CHUNK", BTRFS_CHUNK_TREE_OBJECTID }, > > { "DEVICE", BTRFS_DEV_TREE_OBJECTID }, > > { "DEV", BTRFS_DEV_TREE_OBJECTID }, > > - { "FS_TREE", BTRFS_FS_TREE_OBJECTID }, > > + { "FS", BTRFS_FS_TREE_OBJECTID }, > > Considering no other name has the _tree suffix, this looks good to me. > > Reviewed-by: Qu Wenruo Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: fix list_add corruption and soft lockups in fsync
On Tue, Nov 21, 2017 at 02:35:40PM -0700, Liu Bo wrote: > Xfstests btrfs/146 revealed this corruption, > > [ 58.138831] Buffer I/O error on dev dm-0, logical block 2621424, async > page read > [ 58.151233] BTRFS error (device sdf): bdev /dev/mapper/error-test errs: wr > 1, rd 0, flush 0, corrupt 0, gen 0 > [ 58.152403] list_add corruption. prev->next should be next > (88005e6775d8), but was c9000189be88. (prev=c9000189be88). > [ 58.153518] [ cut here ] > [ 58.153892] WARNING: CPU: 1 PID: 1287 at lib/list_debug.c:31 > __list_add_valid+0x169/0x1f0 > ... > [ 58.157379] RIP: 0010:__list_add_valid+0x169/0x1f0 > ... > [ 58.161956] Call Trace: > [ 58.162264] btrfs_log_inode_parent+0x5bd/0xfb0 [btrfs] > [ 58.163583] btrfs_log_dentry_safe+0x60/0x80 [btrfs] > [ 58.164003] btrfs_sync_file+0x4c2/0x6f0 [btrfs] > [ 58.164393] vfs_fsync_range+0x5f/0xd0 > [ 58.164898] do_fsync+0x5a/0x90 > [ 58.165170] SyS_fsync+0x10/0x20 > [ 58.165395] entry_SYSCALL_64_fastpath+0x1f/0xbe > ... > > It turns out that we could record btrfs_log_ctx:io_err in > log_one_extents when IO fails, but make log_one_extents() return '0' > instead of -EIO, so the IO error is not acknowledged by the callers, > i.e. btrfs_log_inode_parent(), which would remove btrfs_log_ctx:list > from list head 'root->log_ctxs'. Since btrfs_log_ctx is allocated > from stack memory, it'd get freed with a object alive on the > list. then a future list_add will throw the above warning. > > This returns the correct error in the above case. > > Jeff also reported this while testing against his fsync error > patch set[1]. > > [1]: https://www.spinics.net/lists/linux-btrfs/msg65308.html > "btrfs list corruption and soft lockups while testing writeback error > handling" > > Fixes: 8407f553268a4611f254 ("Btrfs: fix data corruption after fast fsync and > writeback error") > Signed-off-by: Liu Bo Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: extent-tree: Make btrfs_inode_rsv_refill function static
On Fri, Nov 17, 2017 at 03:14:19PM +0800, Qu Wenruo wrote: > This function is no longer used outside of extent-tree.c. > Make it static. > > Signed-off-by: Qu Wenruo Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Remove redundant FLAG_VACANCY
On Thu, Nov 23, 2017 at 10:51:43AM +0200, Nikolay Borisov wrote: > Commit 9036c10208e1 ("Btrfs: update hole handling v2") added the FLAG_VACANCY > to denote holes, however there was already a consistent way of flagging > extents which represent hole - ->block_start = EXTENT_MAP_HOLE. And also > the only place where this flag is checked is in the fiemap code, but the > block_start value is also checked and every other place in the filesystem > detects holes by using block_start value's. So remove the extra flag. This > survived a full xfstest run > > Signed-off-by: Nikolay Borisov Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 09/11] Btrfs: kill the btree_inode
On Wed, Nov 22, 2017 at 04:16:04PM -0500, Josef Bacik wrote: > From: Josef Bacik > @@ -4802,8 +4885,8 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct > extent_buffer *src) > return new; > } > > -struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info > *fs_info, > - u64 start, unsigned long len) > +struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_eb_info > *eb_info, > + u64 start, unsigned long len) The __alloc_dummy_extent_buffer takes the length parameter because it's used in tests that need to pass different values. I've removed nodesize from alloc_dummy_extent_buffer and the callchain because we know that it's always going to be fs_info->nodesize. Reintroducing it does not look like a good idea. > { > struct extent_buffer *eb; > unsigned long num_pages; > @@ -160,13 +162,25 @@ struct extent_state { > #endif > }; > > +struct btrfs_eb_info { > + struct btrfs_fs_info *fs_info; > + struct extent_io_tree io_tree; > + struct extent_io_tree io_failure_tree; > + > + /* Extent buffer radix tree */ > + spinlock_t buffer_lock; > + struct radix_tree_root buffer_radix; > + struct list_lru lru_list; > + pgoff_t writeback_index; > +}; > + > #define INLINE_EXTENT_BUFFER_PAGES 16 > #define MAX_INLINE_EXTENT_BUFFER_SIZE (INLINE_EXTENT_BUFFER_PAGES * > PAGE_SIZE) > struct extent_buffer { > u64 start; > unsigned long len; > unsigned long bflags; > - struct btrfs_fs_info *fs_info; > + struct btrfs_eb_info *eb_info; This single change increases the patch size just because all the callers need to be updated. I suggest to keep fs_info in extent_buffer, we're not going to lose much in terms of memory: currently there are 14 eb objects in a 4k slab page, with the additional fs_info it's still 14, 280 * 14 = 3920, unused 176 bytes 288 * 14 = 4032, unused 64 bytes > spinlock_t refs_lock; > atomic_t refs; > atomic_t io_pages; -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: lowmem check: Reword an unclear error message about file extent gap
On Fri, Nov 10, 2017 at 05:47:10PM +0800, Lu Fengqi wrote: > This error occurs when no_holes is not set, but there is a gap > before the file extent. > > Signed-off-by: Lu Fengqi Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] btrfs-progs: test/fsck: Introduce test images containing tree reloc tree
On Fri, Nov 10, 2017 at 09:34:17AM +0800, Qu Wenruo wrote: > Tree reloc tree is a special tree with very short life spawn. > It acts as a special snapshot for any tree, with related nodes/leaves or > EXTENT_DATA modified to point to new position. > > Considering the short life spawn and its specialness, it should be quite > reasonable to keep them as both corner case for fsck and educational > dump for anyone interested in relocation. > > Signed-off-by: Qu Wenruo 1-3 applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] btrfs-progs: fi defrag: clean up duplicate code if find errors
On Tue, Nov 07, 2017 at 10:24:30AM +0800, Su Yue wrote: > In function cmd_filesystem_defrag(), lines of code for error handling > are duplicate and hard to expand in further. > > Create a jump label for errors. > > Signed-off-by: Su Yue > --- > cmds-filesystem.c | 46 +- > 1 file changed, 25 insertions(+), 21 deletions(-) > > diff --git a/cmds-filesystem.c b/cmds-filesystem.c > index 7728430f16a1..0893a44f28fe 100644 > --- a/cmds-filesystem.c > +++ b/cmds-filesystem.c > @@ -1029,29 +1029,27 @@ static int cmd_filesystem_defrag(int argc, char > **argv) > if (fd < 0) { > error("cannot open %s: %s", argv[i], > strerror(errno)); > - defrag_global_errors++; > - close_file_or_dir(fd, dirstream); > - continue; > + ret = fd; The fd will be -1, we might need to save the last error, though it's not directly used (yet). > + goto next; > } > - if (fstat(fd, &st)) { > + > + ret = fstat(fd, &st); > + if (ret) { > error("failed to stat %s: %s", > argv[i], strerror(errno)); > - defrag_global_errors++; > - close_file_or_dir(fd, dirstream); > - continue; > + goto next; Similar here. > } > if (!(S_ISDIR(st.st_mode) || S_ISREG(st.st_mode))) { > error("%s is not a directory or a regular file", > argv[i]); > - defrag_global_errors++; > - close_file_or_dir(fd, dirstream); > - continue; > + ret = -EINVAL; So it's consistent with that one. > + goto next; > } > if (recursive && S_ISDIR(st.st_mode)) { > ret = nftw(argv[i], defrag_callback, 10, > FTW_MOUNT | FTW_PHYS); > if (ret == ENOTTY) > - exit(1); > + break; Please split this change, it changes the logic and should be properly described in the changelog. Otherwise ok. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: mkfs: check the status of file at mkfs
On Fri, Nov 24, 2017 at 02:21:15PM +0900, Misono, Tomohiro wrote: > Currently, only the status of block devices is checked at mkfs, > but we should also check for regular files whether they are already > formatted or mounted to prevent overwrite accidentally. > > Device status is checked by test_dev_for_mkfs(). > The part which is not related to block device is split from this > and used for both block device and regular file. > > Signed-off-by: Tomohiro Misono Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 1/4] btrfs-progs: mkfs: Enhance minimal device size calculation to fix mkfs failure on small file
On Wed, Nov 01, 2017 at 09:30:40AM +0800, Qu Wenruo wrote: > +u64 btrfs_min_dev_size(u32 nodesize, int mixed, u64 meta_profile, > +u64 data_profile) > +{ > + u64 reserved = 0; > + u64 meta_size; > + u64 data_size; > + > + if (mixed) > + return 2 * (BTRFS_MKFS_SYSTEM_GROUP_SIZE + > + btrfs_min_global_blk_rsv_size(nodesize)); > + > + /* > + * minimal size calculation is complex due to several factors: > + * 1) Temporary chunk resue > + *If specified chunk profile is SINGLE, we can resue > + *temporary chunks, no need to alloc new chunks. > + * > + * 2) Different minimal chunk size for different profile > + *For initial sys chunk, chunk size is fixed to 4M. > + *For single profile, minimal chunk size is 8M for all. > + *For other profiles, minimal chunk and stripe size > + *differs from 8M to 64M. > + * > + * To calculate it a little easier, here we assume we don't > + * reuse any temporary chunk, and calculate the size all > + * by ourselves. > + * > + * Temporary chunks sizes are always fixed: > + * One initial sys chunk, one SINGLE meta, and one SINGLE data. > + * The latter two are all 8M, accroding to @calc_size of > + * btrfs_alloc_chunk(). > + */ > + reserved += BTRFS_MKFS_SYSTEM_GROUP_SIZE + SZ_8M * 2; > + > + /* > + * For real chunks, we need to refer different sizes: > + * For SINGLE, it's still fixed to 8M (@calc_size). > + * For other profiles, refer to max(@min_stripe_size, @calc_size). > + * > + * And use the stripe size to calculate its physical used space. > + */ > + if (meta_profile & BTRFS_BLOCK_GROUP_PROFILE_MASK) > + meta_size = SZ_8M + SZ_32M; > + else > + meta_size = SZ_8M + SZ_8M; > + /* Only metadata put 2 stripes into one disk */ 'mkfs.btrfs -d dup' also works, so I think this should be handled here the same way as metadata, right? > + if (meta_profile & BTRFS_BLOCK_GROUP_DUP) > + meta_size *= 2; > + reserved += meta_size; > + > + if (data_profile & BTRFS_BLOCK_GROUP_PROFILE_MASK) > + data_size = SZ_64M; > + else > + data_size = SZ_8M; > + if (data_profile & BTRFS_BLOCK_GROUP_DUP) > + data_size *= 2; > + reserved += data_size; > + return reserved; > +} -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/4] btrfs-progs: test/common: Introduce run_mustfail_stdout
On Wed, Nov 01, 2017 at 09:30:41AM +0800, Qu Wenruo wrote: > For later test case which needs info from stderr. > > Signed-off-by: Qu Wenruo Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 3/4] btrfs-progs: test/common: Enhance prepare_test_dev to reset device size
On Wed, Nov 01, 2017 at 09:30:42AM +0800, Qu Wenruo wrote: > So prepare_test_dev() can be called several times in one test case, to > test different device sizes. > > Signed-off-by: Qu Wenruo Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent
On 2017年11月27日 23:29, Nikolay Borisov wrote: > > > On 27.11.2017 15:23, Adam Borowski wrote: >> Hi! >> On 4.15-rc1, I get the following failure: >> >> BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688 >> slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline >> extent, have 134 expect 281474976710677 >> >> Repeatable every boot attempt. 4.14 and earlier boot fine; btrfs check >> (progs 4.13.3) doesn't find any badness either. > > The reason why you hit this on 4.15 is because this error is coming from the > tree checker. > So how big is the file which inode 35691 represents? The problem is, the reporter is using CONFIG_BTRFS_FS_CHECK_INTEGRITY, which will call btrfs_check_leaf() every time btrfs_mark_buffer_dirty(). However just as explained in the fixing patch: https://patchwork.kernel.org/patch/10047489/ The timing of call btrfs_mark_buffer_dirty() is not right, one of the most obvious place is for EXTENT_DATA. So please apply the above patch to see if it solves the problem. Thanks, Qu > > So the check which fails is: > > if (item_size != BTRFS_FILE_EXTENT_INLINE_DATA_START + > btrfs_file_extent_ram_bytes(leaf, fi)) { > > file_extent_err(root, leaf, slot, > > "invalid ram_bytes for uncompressed inline extent, have %u expect > %llu", > item_size, > BTRFS_FILE_EXTENT_INLINE_DATA_START + > btrfs_file_extent_ram_bytes(leaf, fi)); > > return -EUCLEAN; > > } > > BTRFS_FILE_EXTENT_INLINE_DATA_START is 21 based on pahole's output: > > struct btrfs_file_extent_item { > __le64 generation; /* 0 8 */ > __le64 ram_bytes;/* 8 8 */ > __u8 compression; /*16 1 */ > __u8 encryption; /*17 1 */ > __le16 other_encoding; /*18 2 */ > __u8 type; /*20 1 */ > __le64 disk_bytenr; /*21 8 */ > __le64 disk_num_bytes; /*29 8 */ > __le64 offset; /*37 8 */ > __le64 num_bytes;/*45 8 */ > > /* size: 53, cachelines: 1, members: 10 */ > /* last cacheline: 53 bytes */ > }; > > The inline extent size is 113 so 21 + 113 should is 134 which equals to what > we expect. However, > the printing code uses btrfs_file_extent_inline_len and the tree-checker code > uses directly the > ram bytes. And this function does the correct thing according to whether the > file is compressed or not. > So Qu, perhaps the code needs to be changed or is this a genuine bug ? > >> >> [ 11.347451] BTRFS info (device sda1): use lzo compression >> [ 11.352914] BTRFS info (device sda1): using free space tree >> [] Activating lvm and md swap...[ ok done. >> [] Checking file systems...fsck from util-l >> [ 11.660352] BTRFS critical (device sda1): corrupt leaf: root=1 >> block=3820662898688 slot=43 ino=35691 file_offset=0, invalid ram_bytes for >> uncompressed inline extent, have 134 expect 281474976710677 >> inux 2.30.2 >> [ 11.678550] BTRFS info (device sda1): leaf 3820662898688 total ptrs 103 >> free space 4350 >> [ 11.687767] item 0 key (35663 12 32909) itemoff 16263 itemsize 20 >> [ 11.695021] item 1 key (35663 108 0) itemoff 15751 itemsize 512 >> [ 11.701274] inline extent data size 509 >> ok >> [ 11.705704] item 2 key (35664 1 0) itemoff 15591 itemsize 160 >> [ 11.713034] inode generation 1292 size 509 mode 100644 >> [ 11.718811] item 3 key (35664 12 32909) itemoff 15571 itemsize 20 >> [ 11.725113] item 4 key (35664 108 0) itemoff 15059 itemsize 512 >> [ 11.732168] inline extent data size 509 >> done. >> [ 11.736280] item 5 key (35665 1 0) itemoff 14899 itemsize 160 >> [ 11.742681] inode generation 1292 size 457 mode 100644 >> [ 11.748275] item 6 key (35665 12 32909) itemoff 14879 itemsize 20 >> [ 11.754584] item 7 key (35665 108 0) itemoff 14411 itemsize 468 >> [ 11.760674] inline extent data size 457 >> [ 11.764780] item 8 key (35666 1 0) itemoff 14251 itemsize 160 >> [ 11.770711] inode generation 1292 size 533 mode 100644 >> [] >> [ 11.776145] item 9 key (35666 12 32909) itemoff 14231 itemsize 20 >> Cleaning up temp >> [ 11.783069] item 10 key (35666 108 0) itemoff 13697 itemsize 534 >> orary files... >> [ 11.790666] inline extent data size 533 >> [
Re: [PATCH v4 1/4] btrfs-progs: mkfs: Enhance minimal device size calculation to fix mkfs failure on small file
On 2017年11月28日 07:21, David Sterba wrote: > On Wed, Nov 01, 2017 at 09:30:40AM +0800, Qu Wenruo wrote: >> +u64 btrfs_min_dev_size(u32 nodesize, int mixed, u64 meta_profile, >> + u64 data_profile) >> +{ >> +u64 reserved = 0; >> +u64 meta_size; >> +u64 data_size; >> + >> +if (mixed) >> +return 2 * (BTRFS_MKFS_SYSTEM_GROUP_SIZE + >> +btrfs_min_global_blk_rsv_size(nodesize)); >> + >> +/* >> + * minimal size calculation is complex due to several factors: >> + * 1) Temporary chunk resue >> + *If specified chunk profile is SINGLE, we can resue >> + *temporary chunks, no need to alloc new chunks. >> + * >> + * 2) Different minimal chunk size for different profile >> + *For initial sys chunk, chunk size is fixed to 4M. >> + *For single profile, minimal chunk size is 8M for all. >> + *For other profiles, minimal chunk and stripe size >> + *differs from 8M to 64M. >> + * >> + * To calculate it a little easier, here we assume we don't >> + * reuse any temporary chunk, and calculate the size all >> + * by ourselves. >> + * >> + * Temporary chunks sizes are always fixed: >> + * One initial sys chunk, one SINGLE meta, and one SINGLE data. >> + * The latter two are all 8M, accroding to @calc_size of >> + * btrfs_alloc_chunk(). >> + */ >> +reserved += BTRFS_MKFS_SYSTEM_GROUP_SIZE + SZ_8M * 2; >> + >> +/* >> + * For real chunks, we need to refer different sizes: >> + * For SINGLE, it's still fixed to 8M (@calc_size). >> + * For other profiles, refer to max(@min_stripe_size, @calc_size). >> + * >> + * And use the stripe size to calculate its physical used space. >> + */ >> +if (meta_profile & BTRFS_BLOCK_GROUP_PROFILE_MASK) >> +meta_size = SZ_8M + SZ_32M; >> +else >> +meta_size = SZ_8M + SZ_8M; >> +/* Only metadata put 2 stripes into one disk */ > > 'mkfs.btrfs -d dup' also works, so I think this should be handled here > the same way as metadata, right? Just a few lines under. > >> +if (meta_profile & BTRFS_BLOCK_GROUP_DUP) >> +meta_size *= 2; >> +reserved += meta_size; >> + >> +if (data_profile & BTRFS_BLOCK_GROUP_PROFILE_MASK) >> +data_size = SZ_64M; >> +else >> +data_size = SZ_8M; >> +if (data_profile & BTRFS_BLOCK_GROUP_DUP) >> +data_size *= 2; Here data DUP is also handled. So the problem is about the word "Only". Do I need to re-submit the whole patchset or just a separate patch to update the comment? Thanks, Qu >> +reserved += data_size; >> +return reserved; >> +} > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > signature.asc Description: OpenPGP digital signature
Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent
On 2017年11月27日 22:22, David Sterba wrote: > On Mon, Nov 27, 2017 at 02:23:49PM +0100, Adam Borowski wrote: >> Hi! >> On 4.15-rc1, I get the following failure: >> >> BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688 >> slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline >> extent, have 134 expect 281474976710677 > > By a quick look at suspiciously large number > > hex(281474976710677) = 0x10015 > > may be a bitflip, but 0x15 does not match 134, so there could be > something else involved in the corruption. That's a known bug, fixed by that patch which is not merged yet. https://patchwork.kernel.org/patch/10047489/ Thanks, Qu > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > signature.asc Description: OpenPGP digital signature
Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent
On Tue, Nov 28, 2017 at 08:51:07AM +0800, Qu Wenruo wrote: > On 2017年11月27日 22:22, David Sterba wrote: > > On Mon, Nov 27, 2017 at 02:23:49PM +0100, Adam Borowski wrote: > >> On 4.15-rc1, I get the following failure: > >> > >> BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688 > >> slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline > >> extent, have 134 expect 281474976710677 > > > > By a quick look at suspiciously large number > > > > hex(281474976710677) = 0x10015 > > > > may be a bitflip, but 0x15 does not match 134, so there could be > > something else involved in the corruption. > > That's a known bug, fixed by that patch which is not merged yet. > > https://patchwork.kernel.org/patch/10047489/ This helped, thanks! 喵! -- ⢀⣴⠾⠻⢶⣦⠀ Mozilla's Hippocritical Oath: "Keep trackers off your trail" ⣾⠁⢰⠒⠀⣿⡁ blah blah evading "tracking technology" blah blah ⢿⡄⠘⠷⠚⠋⠀ "https://click.e.mozilla.org/?qs=e7bb0dcf14b1013fca3820..."; ⠈⠳⣄ (same for all links) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] btrfs-progs: check: exclude extents of tree blocks and extent items
Thanks for review. On 11/27/2017 06:37 PM, Qu Wenruo wrote: On 2017年11月27日 11:13, Su Yue wrote: Commit d17d6663c99c ("btrfs-progs: lowmem check: Fix regression which screws up extent allocator") removes pin_metadata_blocks() from lowmem repair. So we have to find another way to exclude extents which should be occupied by tree blocks. First thing first, any tree block which is referred by will not be freed. Unless some special case, like extent tree initialization which clears the whole extent tree so we can't check if one tree block should be freed using extent tree, there is no need to explicitly pin existing tree blocks. Their creation/freeing is already handled well. If I understand the logic of extents correctly: The extents in free space cache are handled in the way like cache_block_group() in extent-tree.c. It traverses all extents items then marks all holes free in free space cache. Yes, in a normal situation, extents are already handled well. But original and lowmem check try to repair corrupted extent tree. Suppose a situation: 1) Let an extent item of one tree block is corrupted or missed. 2) The correct extent in free space cache will be marked as free. 3) Next CoW may allocate the "used" extents from free space and insert an extent item. 4) Lowmem repair increases refs of the extent item and causes a wrong extent item. OR 3) Lowmem repair inserts the extent item firstly. 4) CoW may allocate the "used" extents from free space. BUG_ON failure of inserting the extent item. Please explain the reason why you need to pin extents first. What I do in the patch is like origin mode's. Pining extents first ensures every extents(corrupted and uncorrupted) will not be rellocated. Thanks, Su Thanks, Qu Modify pin_down_tree_blocks() only for code reuse. So behavior of pin_metadata_blocks() which works with option 'init-extent-tree' is not influenced. Introduce exclude_blocks_and_extent_items() to mark extents of all tree blocks dirty in fs_info->excluded_extents. It also calls exclude_extent_items() to exclude extent_items in extent tree. Signed-off-by: Su Yue --- cmds-check.c | 122 ++- 1 file changed, 112 insertions(+), 10 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index c7b570bef9c3..f39285c5a3c2 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -13351,6 +13351,7 @@ out: return err; } +static int exclude_blocks_and_extent_items(struct btrfs_fs_info *fs_info); /* * Low memory usage version check_chunks_and_extents. */ @@ -13363,12 +13364,22 @@ static int check_chunks_and_extents_v2(struct btrfs_fs_info *fs_info) struct btrfs_root *root1; struct btrfs_root *root; struct btrfs_root *cur_root; + struct extent_io_tree excluded_extents; int err = 0; int ret; root = fs_info->fs_root; if (repair) { + extent_io_tree_init(&excluded_extents); + fs_info->excluded_extents = &excluded_extents; + ret = exclude_blocks_and_extent_items(fs_info); + if (ret) { + error("failed to exclude tree blocks and extent items"); + return ret; + } + reset_cached_block_groups(fs_info); + trans = btrfs_start_transaction(fs_info->extent_root, 1); if (IS_ERR(trans)) { error("failed to start transaction before check"); @@ -13437,6 +13448,8 @@ out: err |= ret; else err &= ~BG_ACCOUNTING_ERROR; + extent_io_tree_cleanup(&excluded_extents); + fs_info->excluded_extents = NULL; } if (trans) @@ -13534,40 +13547,106 @@ init: return 0; } -static int pin_down_tree_blocks(struct btrfs_fs_info *fs_info, - struct extent_buffer *eb, int tree_root) +static int exclude_extent_items(struct btrfs_fs_info *fs_info, + struct extent_io_tree *tree) +{ + struct btrfs_root *root = fs_info->extent_root; + struct btrfs_key key; + struct btrfs_path path; + struct extent_buffer *eb; + int slot; + int ret; + u64 start; + u64 end; + + ASSERT(tree); + btrfs_init_path(&path); + key.objectid = 0; + key.type = 0; + key.offset = 0; + + ret = btrfs_search_slot(NULL, root, &key, &path, 0, 0); + if (ret < 0) + goto out; + + while (1) { + eb = path.nodes[0]; + slot = path.slots[0]; + btrfs_item_key_to_cpu(eb, &key, slot); + if (key.type != BTRFS_EXTENT_ITEM_KEY && + key.type != BTRFS_METADATA_ITEM_KEY) + goto next; + start = key.objectid; + if (key.type == BTRFS_EXTENT_ITEM_KEY) + end =
[PATCH] btrfs: add helper for device path or missing
This patch creates a helper function to get either the rcu device path or missing. Signed-off-by: Anand Jain --- fs/btrfs/dev-replace.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 7c655f9a7a50..4b6ceb38cb5f 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -304,6 +304,11 @@ void btrfs_after_dev_replace_commit(struct btrfs_fs_info *fs_info) dev_replace->cursor_left_last_write_of_item; } +static inline char *dev_missing_or_rcu_str(struct btrfs_device *device) +{ + return device->missing ? "" : rcu_str_deref(device->name); +} + int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info, const char *tgtdev_name, u64 srcdevid, const char *srcdev_name, int read_src) @@ -363,8 +368,7 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info, btrfs_info_in_rcu(fs_info, "dev_replace from %s (devid %llu) to %s started", - src_device->missing ? "" : - rcu_str_deref(src_device->name), + dev_missing_or_rcu_str(src_device), src_device->devid, rcu_str_deref(tgt_device->name)); @@ -538,8 +542,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, } else { btrfs_err_in_rcu(fs_info, "btrfs_scrub_dev(%s, %llu, %s) failed %d", -src_device->missing ? "" : -rcu_str_deref(src_device->name), +dev_missing_or_rcu_str(src_device), src_device->devid, rcu_str_deref(tgt_device->name), scrub_ret); btrfs_dev_replace_unlock(dev_replace, 1); @@ -557,8 +560,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, btrfs_info_in_rcu(fs_info, "dev_replace from %s (devid %llu) to %s finished", - src_device->missing ? "" : - rcu_str_deref(src_device->name), + dev_missing_or_rcu_str(src_device), src_device->devid, rcu_str_deref(tgt_device->name)); tgt_device->is_tgtdev_for_dev_replace = 0; @@ -814,12 +816,10 @@ static int btrfs_dev_replace_kthread(void *data) progress = btrfs_dev_replace_progress(fs_info); progress = div_u64(progress, 10); btrfs_info_in_rcu(fs_info, - "continuing dev_replace from %s (devid %llu) to %s @%u%%", - dev_replace->srcdev->missing ? "" - : rcu_str_deref(dev_replace->srcdev->name), + "continuing dev_replace from %s (devid %llu) to target %s @%u%%", + dev_missing_or_rcu_str(dev_replace->srcdev), dev_replace->srcdev->devid, - dev_replace->tgtdev ? rcu_str_deref(dev_replace->tgtdev->name) - : "", + dev_missing_or_rcu_str(dev_replace->tgtdev), (unsigned int)progress); btrfs_dev_replace_continue_on_mount(fs_info); -- 2.15.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] btrfs-progs: check: exclude extents of tree blocks and extent items
On 2017年11月28日 10:38, Su Yue wrote: > Thanks for review. > > On 11/27/2017 06:37 PM, Qu Wenruo wrote: >> >> >> On 2017年11月27日 11:13, Su Yue wrote: >>> Commit d17d6663c99c ("btrfs-progs: lowmem check: Fix regression which >>> screws up extent allocator") removes pin_metadata_blocks() from >>> lowmem repair. >>> So we have to find another way to exclude extents which should be >>> occupied by tree blocks. >> >> First thing first, any tree block which is referred by will not be freed. >> >> Unless some special case, like extent tree initialization which clears >> the whole extent tree so we can't check if one tree block should be >> freed using extent tree, there is no need to explicitly pin existing >> tree blocks. >> >> Their creation/freeing is already handled well. >> > If I understand the logic of extents correctly: > > The extents in free space cache are handled in the way like > cache_block_group() in extent-tree.c. > It traverses all extents items then marks all holes free in free space > cache. > > Yes, in a normal situation, extents are already handled well. > But original and lowmem check try to repair corrupted extent tree. > > Suppose a situation: > 1) Let an extent item of one tree block is corrupted or missed. > 2) The correct extent in free space cache will be marked as free. > 3) Next CoW may allocate the "used" extents from free space > and insert an extent item. > 4) Lowmem repair increases refs of the extent item and > causes a wrong extent item. > OR > 3) Lowmem repair inserts the extent item firstly. > 4) CoW may allocate the "used" extents from free space. > BUG_ON failure of inserting the extent item. OK, this explains things, but still not optimal. Such behavior should not happen if nothing is exposed. Pinning down all tree blocks is never a light operation and should be avoided if possible. It would be nice to do it when starting a new transaction to modify the fs. > >> Please explain the reason why you need to pin extents first. >> > What I do in the patch is like origin mode's. > Pining extents first ensures every extents(corrupted and uncorrupted) > will not be rellocated. Only extent tree reinit will pin down all metadata block in original mode while normal repair won't. So you're still doing something which is not done in original mode, which either needs to be explained in detail or fix original mode first. > > Thanks, > Su > >> Thanks, >> Qu >> >>> >>> Modify pin_down_tree_blocks() only for code reuse. >>> So behavior of pin_metadata_blocks() which works with option >>> 'init-extent-tree' is not influenced. >>> >>> Introduce exclude_blocks_and_extent_items() to mark extents of all tree >>> blocks dirty in fs_info->excluded_extents. It also calls >>> exclude_extent_items() to exclude extent_items in extent tree. >>> >>> Signed-off-by: Su Yue >>> --- >>> cmds-check.c | 122 >>> ++- >>> 1 file changed, 112 insertions(+), 10 deletions(-) >>> >>> diff --git a/cmds-check.c b/cmds-check.c >>> index c7b570bef9c3..f39285c5a3c2 100644 >>> --- a/cmds-check.c >>> +++ b/cmds-check.c >>> @@ -13351,6 +13351,7 @@ out: >>> return err; >>> } >>> +static int exclude_blocks_and_extent_items(struct btrfs_fs_info >>> *fs_info); >>> /* >>> * Low memory usage version check_chunks_and_extents. >>> */ >>> @@ -13363,12 +13364,22 @@ static int >>> check_chunks_and_extents_v2(struct btrfs_fs_info *fs_info) >>> struct btrfs_root *root1; >>> struct btrfs_root *root; >>> struct btrfs_root *cur_root; >>> + struct extent_io_tree excluded_extents; >>> int err =; >>> int ret; >>> root =s_info->fs_root; >>> if (repair) { >>> + extent_io_tree_init(&excluded_extents); >>> + fs_info->excluded_extents =excluded_extents; >>> + ret =xclude_blocks_and_extent_items(fs_info); >>> + if (ret) { >>> + error("failed to exclude tree blocks and extent items"); >>> + return ret; >>> + } >>> + reset_cached_block_groups(fs_info); >>> + >>> trans =trfs_start_transaction(fs_info->extent_root, 1); >>> if (IS_ERR(trans)) { >>> error("failed to start transaction before check"); >>> @@ -13437,6 +13448,8 @@ out: >>> err |=et; >>> else >>> err &=BG_ACCOUNTING_ERROR; >>> + extent_io_tree_cleanup(&excluded_extents); >>> + fs_info->excluded_extents =ULL; >>> } >>> if (trans) >>> @@ -13534,40 +13547,106 @@ init: >>> return 0; >>> } >>> -static int pin_down_tree_blocks(struct btrfs_fs_info *fs_info, >>> - struct extent_buffer *eb, int tree_root) >>> +static int exclude_extent_items(struct btrfs_fs_info *fs_info, >>> + struct extent_io_tree *tree) >>> +{ >>> + struct btrfs_root *root =s_info->extent_root; >>> + struct btrfs_key key; >>> + struct btrfs_path path; >>> + struct extent_b
Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent
On 28.11.2017 02:37, Qu Wenruo wrote: > > > On 2017年11月27日 23:29, Nikolay Borisov wrote: >> >> >> On 27.11.2017 15:23, Adam Borowski wrote: >>> Hi! >>> On 4.15-rc1, I get the following failure: >>> >>> BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688 >>> slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline >>> extent, have 134 expect 281474976710677 >>> >>> Repeatable every boot attempt. 4.14 and earlier boot fine; btrfs check >>> (progs 4.13.3) doesn't find any badness either. >> >> The reason why you hit this on 4.15 is because this error is coming from the >> tree checker. >> So how big is the file which inode 35691 represents? > > The problem is, the reporter is using CONFIG_BTRFS_FS_CHECK_INTEGRITY, > which will call btrfs_check_leaf() every time btrfs_mark_buffer_dirty(). > > However just as explained in the fixing patch: > https://patchwork.kernel.org/patch/10047489/ > > The timing of call btrfs_mark_buffer_dirty() is not right, one of the > most obvious place is for EXTENT_DATA. Right, that was one of the things which I suspected. So in this case, David, do we want to push those fixes to 4.15-rc in some of the upcoming pull reqs or are we gonna have FS_INTEGRITY left broken for a release (imo not a good thing) ? > > So please apply the above patch to see if it solves the problem. > > Thanks, > Qu >> >> So the check which fails is: >> >> if (item_size != BTRFS_FILE_EXTENT_INLINE_DATA_START + >> btrfs_file_extent_ram_bytes(leaf, fi)) { >> >> file_extent_err(root, leaf, slot, >> >> "invalid ram_bytes for uncompressed inline extent, have %u expect >> %llu", >> item_size, >> BTRFS_FILE_EXTENT_INLINE_DATA_START + >> btrfs_file_extent_ram_bytes(leaf, fi)); >> >> return -EUCLEAN; >> >> } >> >> BTRFS_FILE_EXTENT_INLINE_DATA_START is 21 based on pahole's output: >> >> struct btrfs_file_extent_item { >> __le64 generation; /* 0 8 */ >> __le64 ram_bytes;/* 8 8 */ >> __u8 compression; /*16 1 */ >> __u8 encryption; /*17 1 */ >> __le16 other_encoding; /*18 2 */ >> __u8 type; /*20 1 */ >> __le64 disk_bytenr; /*21 8 */ >> __le64 disk_num_bytes; /*29 8 */ >> __le64 offset; /*37 8 */ >> __le64 num_bytes;/*45 8 */ >> >> /* size: 53, cachelines: 1, members: 10 */ >> /* last cacheline: 53 bytes */ >> }; >> >> The inline extent size is 113 so 21 + 113 should is 134 which equals to what >> we expect. However, >> the printing code uses btrfs_file_extent_inline_len and the tree-checker >> code uses directly the >> ram bytes. And this function does the correct thing according to whether the >> file is compressed or not. >> So Qu, perhaps the code needs to be changed or is this a genuine bug ? >> >>> >>> [ 11.347451] BTRFS info (device sda1): use lzo compression >>> [ 11.352914] BTRFS info (device sda1): using free space tree >>> [] Activating lvm and md swap...[ ok done. >>> [] Checking file systems...fsck from util-l >>> [ 11.660352] BTRFS critical (device sda1): corrupt leaf: root=1 >>> block=3820662898688 slot=43 ino=35691 file_offset=0, invalid ram_bytes for >>> uncompressed inline extent, have 134 expect 281474976710677 >>> inux 2.30.2 >>> [ 11.678550] BTRFS info (device sda1): leaf 3820662898688 total ptrs 103 >>> free space 4350 >>> [ 11.687767] item 0 key (35663 12 32909) itemoff 16263 itemsize 20 >>> [ 11.695021] item 1 key (35663 108 0) itemoff 15751 itemsize 512 >>> [ 11.701274] inline extent data size 509 >>> ok >>> [ 11.705704] item 2 key (35664 1 0) itemoff 15591 itemsize 160 >>> [ 11.713034] inode generation 1292 size 509 mode 100644 >>> [ 11.718811] item 3 key (35664 12 32909) itemoff 15571 itemsize 20 >>> [ 11.725113] item 4 key (35664 108 0) itemoff 15059 itemsize 512 >>> [ 11.732168] inline extent data size 509 >>> done. >>> [ 11.736280] item 5 key (35665 1 0) itemoff 14899 itemsize 160 >>> [ 11.742681] inode generation 1292 size 457 mode 100644 >>> [ 11.748275] item 6 key (35665 12 32909) itemoff 14879 itemsize 20 >>> [ 11.754584] item 7 key (35665 108 0) itemoff 14411 itemsize 468 >>> [ 11.760674] inline extent data size 457 >>> [ 11.764780] item 8 key (35666 1 0) itemof
Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent
On 2017年11月28日 14:59, Nikolay Borisov wrote: > > > On 28.11.2017 02:37, Qu Wenruo wrote: >> >> >> On 2017年11月27日 23:29, Nikolay Borisov wrote: >>> >>> >>> On 27.11.2017 15:23, Adam Borowski wrote: Hi! On 4.15-rc1, I get the following failure: BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688 slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline extent, have 134 expect 281474976710677 Repeatable every boot attempt. 4.14 and earlier boot fine; btrfs check (progs 4.13.3) doesn't find any badness either. >>> >>> The reason why you hit this on 4.15 is because this error is coming from >>> the tree checker. >>> So how big is the file which inode 35691 represents? >> >> The problem is, the reporter is using CONFIG_BTRFS_FS_CHECK_INTEGRITY, >> which will call btrfs_check_leaf() every time btrfs_mark_buffer_dirty(). >> >> However just as explained in the fixing patch: >> https://patchwork.kernel.org/patch/10047489/ >> >> The timing of call btrfs_mark_buffer_dirty() is not right, one of the >> most obvious place is for EXTENT_DATA. > > Right, that was one of the things which I suspected. So in this case, > David, do we want to push those fixes to 4.15-rc in some of the upcoming > pull reqs or are we gonna have FS_INTEGRITY left broken for a release > (imo not a good thing) ? BTW, compression level wild memory access fix is even more important. Just try to mount with "-o compress" then kernel will crash. Thanks, Qu > > >> >> So please apply the above patch to see if it solves the problem. >> >> Thanks, >> Qu >>> >>> So the check which fails is: >>> >>> if (item_size != BTRFS_FILE_EXTENT_INLINE_DATA_START + >>> btrfs_file_extent_ram_bytes(leaf, fi)) { >>> >>> file_extent_err(root, leaf, slot, >>> >>> "invalid ram_bytes for uncompressed inline extent, have %u expect >>> %llu", >>> item_size, >>> BTRFS_FILE_EXTENT_INLINE_DATA_START + >>> btrfs_file_extent_ram_bytes(leaf, fi)); >>> >>> return -EUCLEAN; >>> >>> } >>> >>> BTRFS_FILE_EXTENT_INLINE_DATA_START is 21 based on pahole's output: >>> >>> struct btrfs_file_extent_item { >>> __le64 generation; /* 0 8 */ >>> __le64 ram_bytes;/* 8 8 */ >>> __u8 compression; /*16 1 */ >>> __u8 encryption; /*17 1 */ >>> __le16 other_encoding; /*18 2 */ >>> __u8 type; /*20 1 */ >>> __le64 disk_bytenr; /*21 8 */ >>> __le64 disk_num_bytes; /*29 8 */ >>> __le64 offset; /*37 8 */ >>> __le64 num_bytes;/*45 8 */ >>> >>> /* size: 53, cachelines: 1, members: 10 */ >>> /* last cacheline: 53 bytes */ >>> }; >>> >>> The inline extent size is 113 so 21 + 113 should is 134 which equals to >>> what we expect. However, >>> the printing code uses btrfs_file_extent_inline_len and the tree-checker >>> code uses directly the >>> ram bytes. And this function does the correct thing according to whether >>> the file is compressed or not. >>> So Qu, perhaps the code needs to be changed or is this a genuine bug ? >>> [ 11.347451] BTRFS info (device sda1): use lzo compression [ 11.352914] BTRFS info (device sda1): using free space tree [] Activating lvm and md swap...[ ok done. [] Checking file systems...fsck from util-l [ 11.660352] BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688 slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline extent, have 134 expect 281474976710677 inux 2.30.2 [ 11.678550] BTRFS info (device sda1): leaf 3820662898688 total ptrs 103 free space 4350 [ 11.687767] item 0 key (35663 12 32909) itemoff 16263 itemsize 20 [ 11.695021] item 1 key (35663 108 0) itemoff 15751 itemsize 512 [ 11.701274] inline extent data size 509 ok [ 11.705704] item 2 key (35664 1 0) itemoff 15591 itemsize 160 [ 11.713034] inode generation 1292 size 509 mode 100644 [ 11.718811] item 3 key (35664 12 32909) itemoff 15571 itemsize 20 [ 11.725113] item 4 key (35664 108 0) itemoff 15059 itemsize 512 [ 11.732168] inline extent data size 509 done. [ 11.736280] item 5 key (35665 1 0) itemoff 14899 itemsize 160 [ 11.742681] inode generation 1292 size
Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent
On 28.11.2017 09:03, Qu Wenruo wrote: > > > On 2017年11月28日 14:59, Nikolay Borisov wrote: >> >> >> On 28.11.2017 02:37, Qu Wenruo wrote: >>> >>> >>> On 2017年11月27日 23:29, Nikolay Borisov wrote: On 27.11.2017 15:23, Adam Borowski wrote: > Hi! > On 4.15-rc1, I get the following failure: > > BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688 > slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline > extent, have 134 expect 281474976710677 > > Repeatable every boot attempt. 4.14 and earlier boot fine; btrfs check > (progs 4.13.3) doesn't find any badness either. The reason why you hit this on 4.15 is because this error is coming from the tree checker. So how big is the file which inode 35691 represents? >>> >>> The problem is, the reporter is using CONFIG_BTRFS_FS_CHECK_INTEGRITY, >>> which will call btrfs_check_leaf() every time btrfs_mark_buffer_dirty(). >>> >>> However just as explained in the fixing patch: >>> https://patchwork.kernel.org/patch/10047489/ >>> >>> The timing of call btrfs_mark_buffer_dirty() is not right, one of the >>> most obvious place is for EXTENT_DATA. >> >> Right, that was one of the things which I suspected. So in this case, >> David, do we want to push those fixes to 4.15-rc in some of the upcoming >> pull reqs or are we gonna have FS_INTEGRITY left broken for a release >> (imo not a good thing) ? > > BTW, compression level wild memory access fix is even more important. > > Just try to mount with "-o compress" then kernel will crash. That one is already queued in David's misc-4.15 branch ergo it will go in a pull req this release. > > Thanks, > Qu > >> >> >>> >>> So please apply the above patch to see if it solves the problem. >>> >>> Thanks, >>> Qu So the check which fails is: if (item_size != BTRFS_FILE_EXTENT_INLINE_DATA_START + btrfs_file_extent_ram_bytes(leaf, fi)) { file_extent_err(root, leaf, slot, "invalid ram_bytes for uncompressed inline extent, have %u expect %llu", item_size, BTRFS_FILE_EXTENT_INLINE_DATA_START + btrfs_file_extent_ram_bytes(leaf, fi)); return -EUCLEAN; } BTRFS_FILE_EXTENT_INLINE_DATA_START is 21 based on pahole's output: struct btrfs_file_extent_item { __le64 generation; /* 0 8 */ __le64 ram_bytes;/* 8 8 */ __u8 compression; /*16 1 */ __u8 encryption; /*17 1 */ __le16 other_encoding; /*18 2 */ __u8 type; /*20 1 */ __le64 disk_bytenr; /*21 8 */ __le64 disk_num_bytes; /*29 8 */ __le64 offset; /*37 8 */ __le64 num_bytes;/*45 8 */ /* size: 53, cachelines: 1, members: 10 */ /* last cacheline: 53 bytes */ }; The inline extent size is 113 so 21 + 113 should is 134 which equals to what we expect. However, the printing code uses btrfs_file_extent_inline_len and the tree-checker code uses directly the ram bytes. And this function does the correct thing according to whether the file is compressed or not. So Qu, perhaps the code needs to be changed or is this a genuine bug ? > > [ 11.347451] BTRFS info (device sda1): use lzo compression > [ 11.352914] BTRFS info (device sda1): using free space tree > [] Activating lvm and md swap...[ ok done. > [] Checking file systems...fsck from util-l > [ 11.660352] BTRFS critical (device sda1): corrupt leaf: root=1 > block=3820662898688 slot=43 ino=35691 file_offset=0, invalid ram_bytes > for uncompressed inline extent, have 134 expect 281474976710677 > inux 2.30.2 > [ 11.678550] BTRFS info (device sda1): leaf 3820662898688 total ptrs > 103 free space 4350 > [ 11.687767]item 0 key (35663 12 32909) itemoff 16263 itemsize 20 > [ 11.695021]item 1 key (35663 108 0) itemoff 15751 itemsize 512 > [ 11.701274]inline extent data size 509 > ok > [ 11.705704]item 2 key (35664 1 0) itemoff 15591 itemsize 160 > [ 11.713034]inode generation 1292 size 509 mode 100644 > [ 11.718811]item 3 key (35664 12 32909) itemoff 15571 itemsize 20 > [ 11.725113]item
[PATCH v2 2/2] btrfs: Ensure btrfs_trim_fs can trim the whole fs
[BUG] fstrim on some btrfs only trims the unallocated space, not trimming any space in existing block groups. [CAUSE] Before fstrim_range passed to btrfs_trim_fs(), it get truncated to range [0, super->total_bytes). So later btrfs_trim_fs() will only be able to trim block groups in range [0, super->total_bytes). While for btrfs, any bytenr aligned to sector size is valid, since btrfs use its logical address space, there is nothing limiting the location where we put block groups. For btrfs with routine balance, it's quite easy to relocate all block groups and bytenr of block groups will start beyond super->total_bytes. In that case, btrfs will not trim existing block groups. [FIX] Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs() can get the unmodified range, which is normally set to [0, U64_MAX]. Reported-by: Chris Murphy Fixes: f4c697e6406d ("btrfs: return EINVAL if start > total_bytes in fitrim ioctl") Cc: # v4.0+ Signed-off-by: Qu Wenruo --- changelog: v2: Locate the root cause in btrfs_ioctl_fitrim(), remove the truncation so we can still allow user to trim custom range. --- fs/btrfs/ioctl.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index fd172a93d11a..017fda31400d 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -365,7 +365,6 @@ static noinline int btrfs_ioctl_fitrim(struct file *file, void __user *arg) struct fstrim_range range; u64 minlen = ULLONG_MAX; u64 num_devices = 0; - u64 total_bytes = btrfs_super_total_bytes(fs_info->super_copy); int ret; if (!capable(CAP_SYS_ADMIN)) @@ -389,11 +388,15 @@ static noinline int btrfs_ioctl_fitrim(struct file *file, void __user *arg) return -EOPNOTSUPP; if (copy_from_user(&range, arg, sizeof(range))) return -EFAULT; - if (range.start > total_bytes || - range.len < fs_info->sb->s_blocksize) + + /* +* NOTE: Don't truncate the range using super->total_bytes. +* Bytenr of btrfs block group is in btrfs logical address space, +* which can be any sector size aligned bytenr in [0, U64_MAX]. +*/ + if (range.len < fs_info->sb->s_blocksize) return -EINVAL; - range.len = min(range.len, total_bytes - range.start); range.minlen = max(range.minlen, minlen); ret = btrfs_trim_fs(fs_info, &range); if (ret < 0) -- 2.15.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RESEND 1/2] btrfs: Enhance btrfs_trim_fs function to handle error better
Function btrfs_trim_fs() doesn't handle errors in a consistent way, if error happens when trimming existing block groups, it will skip the remaining blocks and continue to trim unallocated space for each device. And the return value will only reflect the final error from device trimming. This patch will fix such behavior by: 1) Recording first error from block group or device trimming So return value will also reflect any error found when trimming. Make developer more aware of the problem. 2) Outputting btrfs warning message for each trimming failure Any error for block group or device trimming will cause btrfs warning kernel message. 3) Continuing trimming if we can If we failed to trim one block group or device, we could still try next block group or device. Such behavior can avoid confusion for case like failure to trim the first block group and then only unallocated space is trimmed. Reported-by: Chris Murphy Signed-off-by: Qu Wenruo --- fs/btrfs/extent-tree.c | 59 -- 1 file changed, 43 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 673ac4e01dd0..f830aa91ac3d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -10948,6 +10948,16 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, return ret; } +/* + * Trim the whole fs, by: + * 1) Trimming free space in each block group + * 2) Trimming unallocated space in each device + * + * Will try to continue trimming even if we failed to trim one block group or + * device. + * The return value will be the error return value of the first error. + * Or 0 if nothing wrong happened. + */ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range) { struct btrfs_block_group_cache *cache = NULL; @@ -10958,6 +10968,8 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range) u64 end; u64 trimmed = 0; u64 total_bytes = btrfs_super_total_bytes(fs_info->super_copy); + int bg_ret = 0; + int dev_ret = 0; int ret = 0; /* @@ -10968,7 +10980,7 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range) else cache = btrfs_lookup_block_group(fs_info, range->start); - while (cache) { + for (; cache; cache = next_block_group(fs_info, cache)) { if (cache->key.objectid >= (range->start + range->len)) { btrfs_put_block_group(cache); break; @@ -10982,29 +10994,36 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range) if (!block_group_cache_done(cache)) { ret = cache_block_group(cache, 0); if (ret) { - btrfs_put_block_group(cache); - break; + btrfs_warn_rl(fs_info, + "failed to cache block group %llu ret %d", + cache->key.objectid, ret); + if (!bg_ret) + bg_ret = ret; + continue; } ret = wait_block_group_cache_done(cache); if (ret) { - btrfs_put_block_group(cache); - break; + btrfs_warn_rl(fs_info, + "failed to wait cache for block group %llu ret %d", + cache->key.objectid, ret); + if (!bg_ret) + bg_ret = ret; + continue; } } - ret = btrfs_trim_block_group(cache, -&group_trimmed, -start, -end, -range->minlen); + ret = btrfs_trim_block_group(cache, &group_trimmed, + start, end, range->minlen); trimmed += group_trimmed; if (ret) { - btrfs_put_block_group(cache); - break; + btrfs_warn_rl(fs_info, + "failed to trim block group %llu ret %d", + cache->key.objectid, ret); + if (!bg_ret) + bg_ret = ret; +
[PATCH v2.1 2/2] btrfs: Ensure btrfs_trim_fs can trim the whole fs
[BUG] fstrim on some btrfs only trims the unallocated space, not trimming any space in existing block groups. [CAUSE] Before fstrim_range passed to btrfs_trim_fs(), it get truncated to range [0, super->total_bytes). So later btrfs_trim_fs() will only be able to trim block groups in range [0, super->total_bytes). While for btrfs, any bytenr aligned to sector size is valid, since btrfs use its logical address space, there is nothing limiting the location where we put block groups. For btrfs with routine balance, it's quite easy to relocate all block groups and bytenr of block groups will start beyond super->total_bytes. In that case, btrfs will not trim existing block groups. [FIX] Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs() can get the unmodified range, which is normally set to [0, U64_MAX]. Reported-by: Chris Murphy Fixes: f4c697e6406d ("btrfs: return EINVAL if start > total_bytes in fitrim ioctl") Cc: # v4.0+ Signed-off-by: Qu Wenruo --- changelog: v2: Locate the root cause in btrfs_ioctl_fitrim(), remove the truncation so we can still allow user to trim custom range. v2.1: Include the missing change in btrfs_trim_fs() and update the commit message to reflect this. --- fs/btrfs/extent-tree.c | 9 + fs/btrfs/ioctl.c | 11 +++ 2 files changed, 8 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f830aa91ac3d..f74958e11008 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -10972,14 +10972,7 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range) int dev_ret = 0; int ret = 0; - /* -* try to trim all FS space, our block group may start from non-zero. -*/ - if (range->len == total_bytes) - cache = btrfs_lookup_first_block_group(fs_info, range->start); - else - cache = btrfs_lookup_block_group(fs_info, range->start); - + cache = btrfs_lookup_first_block_group(fs_info, range->start); for (; cache; cache = next_block_group(fs_info, cache)) { if (cache->key.objectid >= (range->start + range->len)) { btrfs_put_block_group(cache); diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index fd172a93d11a..017fda31400d 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -365,7 +365,6 @@ static noinline int btrfs_ioctl_fitrim(struct file *file, void __user *arg) struct fstrim_range range; u64 minlen = ULLONG_MAX; u64 num_devices = 0; - u64 total_bytes = btrfs_super_total_bytes(fs_info->super_copy); int ret; if (!capable(CAP_SYS_ADMIN)) @@ -389,11 +388,15 @@ static noinline int btrfs_ioctl_fitrim(struct file *file, void __user *arg) return -EOPNOTSUPP; if (copy_from_user(&range, arg, sizeof(range))) return -EFAULT; - if (range.start > total_bytes || - range.len < fs_info->sb->s_blocksize) + + /* +* NOTE: Don't truncate the range using super->total_bytes. +* Bytenr of btrfs block group is in btrfs logical address space, +* which can be any sector size aligned bytenr in [0, U64_MAX]. +*/ + if (range.len < fs_info->sb->s_blocksize) return -EINVAL; - range.len = min(range.len, total_bytes - range.start); range.minlen = max(range.minlen, minlen); ret = btrfs_trim_fs(fs_info, &range); if (ret < 0) -- 2.15.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] btrfs-progs: check: exclude extents of tree blocks and extent items
On 11/28/2017 12:05 PM, Qu Wenruo wrote: On 2017年11月28日 10:38, Su Yue wrote: Thanks for review. On 11/27/2017 06:37 PM, Qu Wenruo wrote: On 2017年11月27日 11:13, Su Yue wrote: Commit d17d6663c99c ("btrfs-progs: lowmem check: Fix regression which screws up extent allocator") removes pin_metadata_blocks() from lowmem repair. So we have to find another way to exclude extents which should be occupied by tree blocks. First thing first, any tree block which is referred by will not be freed. Unless some special case, like extent tree initialization which clears the whole extent tree so we can't check if one tree block should be freed using extent tree, there is no need to explicitly pin existing tree blocks. Their creation/freeing is already handled well. If I understand the logic of extents correctly: The extents in free space cache are handled in the way like cache_block_group() in extent-tree.c. It traverses all extents items then marks all holes free in free space cache. Yes, in a normal situation, extents are already handled well. But original and lowmem check try to repair corrupted extent tree. Suppose a situation: 1) Let an extent item of one tree block is corrupted or missed. 2) The correct extent in free space cache will be marked as free. 3) Next CoW may allocate the "used" extents from free space and insert an extent item. 4) Lowmem repair increases refs of the extent item and causes a wrong extent item. OR 3) Lowmem repair inserts the extent item firstly. 4) CoW may allocate the "used" extents from free space. BUG_ON failure of inserting the extent item. OK, this explains things, but still not optimal. Such behavior should not happen if nothing is exposed. Pinning down all tree blocks is never a light operation and should be avoided if possible. Agreed. It would be nice to do it when starting a new transaction to modify the fs. Now, there is only one transaction while checking chunks and extents because of previous wrong call of pin_metadata_blocks(). It's meaningless to put it at start of new transaction now. I will split the transaction into many minors. Then lowmem repair will speed up if no error in chunks and extents. But it is just a little optimization. Please explain the reason why you need to pin extents first. What I do in the patch is like origin mode's. Pining extents first ensures every extents(corrupted and uncorrupted) will not be rellocated. Only extent tree reinit will pin down all metadata block in original mode while normal repair won't. So you're still doing something which is not done in original mode, which either needs to be explained in detail or fix original mode first. You are right. I did miss details of how original mode frees extent records. After come thinking, pining extents is still necessary in lowmem repair. Original mode has extent cache to record all extents and free normal extents while traversing tree blocks. After traversal, only corrupted extent records exist in the cache. At this moment, it pins them and start transactions to repair. Lowmem mode does not store anything like extent records. It begins transactions(in further patch) in traversal. We can not guarantee extents allocated from free space is occupied by one tree block which has not been traversed. Although pining extents is heavy and painfully work, it seems inevitable. Otherwise, I have no better idea than pining all extents. Thanks, Su Thanks, Su Thanks, Qu Modify pin_down_tree_blocks() only for code reuse. So behavior of pin_metadata_blocks() which works with option 'init-extent-tree' is not influenced. Introduce exclude_blocks_and_extent_items() to mark extents of all tree blocks dirty in fs_info->excluded_extents. It also calls exclude_extent_items() to exclude extent_items in extent tree. Signed-off-by: Su Yue --- cmds-check.c | 122 ++- 1 file changed, 112 insertions(+), 10 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index c7b570bef9c3..f39285c5a3c2 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -13351,6 +13351,7 @@ out: return err; } +static int exclude_blocks_and_extent_items(struct btrfs_fs_info *fs_info); /* * Low memory usage version check_chunks_and_extents. */ @@ -13363,12 +13364,22 @@ static int check_chunks_and_extents_v2(struct btrfs_fs_info *fs_info) struct btrfs_root *root1; struct btrfs_root *root; struct btrfs_root *cur_root; + struct extent_io_tree excluded_extents; int err =; int ret; root =s_info->fs_root; if (repair) { + extent_io_tree_init(&excluded_extents); + fs_info->excluded_extents =excluded_extents; + ret =xclude_blocks_and_extent_items(fs_info); + if (ret) { + error("failed to exclude tree blocks and extent items"); + return ret; + } + reset_cached_block_groups(fs_info); + tra