Re: btrfs device delete /dev/sdc1 /mnt/raid1 user experience
Sorry I unsubscribed from linux-btrfs@vger.kernel.org since the traffic was a bit too high for me. On Tue, 7 Jun 2016, at 11:42 AM, Chris Murphy wrote: > Your command turned this from a 3 drive volume into a 2 drive volume, > it removed the drive you asked to be removed. I actually had 2 drives to begin with, but it wouldn't allow me to delete the drive without adding a new one, so I had 3 drives. I think I've confused you by my talk of "RAID-1 over 3 drives". I am not sure why RAIDX terminology is so confusing. All I want is for the data on one drive to be mirrored across them all. So if I add X drives, I want to X exact copies. But IIUC this changes the RAID level and everyone gets confused what I am talking about. > You're definitely missing something but once you understand how it > actually works, it'll be interesting if you have suggestions on how > the existing documentation confused you into making this mistake. Summary: I issued the remove command when I should have just removed it physically IIUC. If I was to do this again, I would unmount the raid1 mount. Take a disk physically out and then add the new one. So yes, it would be degraded but then as you mention, the newly drive will sort it out. One would hope. Thanks, -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs device delete /dev/sdc1 /mnt/raid1 user experience
On Mon, Jun 6, 2016 at 7:19 PM, Kai Hendrywrote: > On Mon, 6 Jun 2016, at 10:16 PM, Austin S. Hemmelgarn wrote: >> Based on the fact that you appear to want to carry a disk to copy data >> more quickly than over then internet, then what you've already done plus >> this is the correct way to do it. > > The trouble is the way I ended up doing it: > 1) Replacing it with a bigger disk > 2) Deleting the disk > 3) btrfs send /mnt/raid1/snapshot | btrfs receive /mnt/onetb/ > > This took almost **an entire day** over USB3 spinning disks ! > > My expectation was that I could quickly remove a mirrored disk, replace > that disk and hand carry the disk I removed. btrfs shouldn't remove data > from the disk I removed. Had this been a two drive raid1, all you need to do is take one drive with you. No commands necessary. Each drive can be separately mounted with -o degraded, and you can re-establish two copy replication with 'btrfs dev add' with a new drive and 'btrfs dev del missing' to get rid of the phantom missing drive. You'd do that locally and then also at your destination. A gotcha with this technique though is often single chunks get created during the -o degraded read-write mount. After 'dev del missing' completes the replication process, you need to check 'fi df' or 'fi us' for single chunks. If they exist, get rid of them with a filtered balance. '-dconvert=raid1,soft -mconvert=raid1,soft' should work I think. If there are single chunks created and not converted, later if you need to do a degraded mount it will fail. Usually it can be made to work with -o ro,degraded in such a case, but it means no more read-write for the file system, you'd have to recreate it (at this point until it's fixed). -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs device delete /dev/sdc1 /mnt/raid1 user experience
On Sun, Jun 5, 2016 at 11:44 PM, Kai Hendrywrote: > Hi there, > > > I planned to remove one of my disks, so that I can take it from > Singapore to the UK and then re-establish another remote RAID1 store. > > delete is an alias of remove, so I added a new disk (devid 3) and > proceeded to run: > btrfs device delete /dev/sdc1 /mnt/raid1 (devid 1) Wrong command. This means you don't want sdc1 used in this volume anymore and in the process it is invalidated because to remove it from Btrfs means to remove its signature. > > > nuc:~$ uname -a > Linux nuc 4.5.4-1-ARCH #1 SMP PREEMPT Wed May 11 22:21:28 CEST 2016 > x86_64 GNU/Linux > nuc:~$ btrfs --version > btrfs-progs v4.5.3 > nuc:~$ sudo btrfs fi show /mnt/raid1/ > Label: 'extraid1' uuid: 5cab2a4a-e282-4931-b178-bec4c73cdf77 > Total devices 2 FS bytes used 776.56GiB > devid2 size 931.48GiB used 778.03GiB path /dev/sdb1 > devid3 size 1.82TiB used 778.03GiB path /dev/sdd1 OK I'm confused. You had a three drive Btrfs raid-1 and you expected to take 1 drive on a trip to establish that data elsewhere? That's not possible. The minimum number of drives for degraded mount of a 3 drive Btrfs raid1 is 2 drives. There aren't three copies of the data with a three drive raid1. There are two copies only spread across three drives. Your command turned this from a 3 drive volume into a 2 drive volume, it removed the drive you asked to be removed. > First off, I was expecting btrfs to release the drive pretty much > immediately. The command took about half a day to complete. I watched > `btrfs fi show` to see size of devid 1 (the one I am trying to remove) > to be zero and to see used space slowly go down whilst used space of > devid 3 (the new disk) slowly go up. Expected behavior, it was told you no longer wanted a 3 drive raid1, so the data on the removed drive was being replicated onto the other two drives to maintain two copies of your data on those two drives. > Secondly and most importantly my /dev/sdc1 can't be mounted now anymore. > Why? The data that was on that drive is still there, just the magic was invalidated as the final step in the operation. But that one drive doesn't contain all of your data anyway so by itself the one drive won't mount degraded, too many devices are missing. > > sudo mount -t btrfs /dev/sdc1 /mnt/test/ > mount: wrong fs type, bad option, bad superblock on /dev/sdc1, >missing codepage or helper program, or other error > >In some cases useful info is found in syslog - try >dmesg | tail or so. > > There is nothing in dmesg nor my journal. I wasn't expecting my drive to > be rendered useless on removing or am I missing something? You're definitely missing something but once you understand how it actually works, it'll be interesting if you have suggestions on how the existing documentation confused you into making this mistake. > I'm still keen to take a TB on a flight with me the day after tomorrow. > What is the recommended course of action? Recreate a mkfs.btrfs on > /dev/sdc1 and send data to it from /mnt/raid1? You can do that. The 2 disk raid1 can be used as the send, and the new fs on the removed 3rd drive can be used as the receive side. That would be per subvolume however. It would be faster to use a btrfs seed device for this. But as I went through the steps, I got confused how to remove the two raid1s. Both are read only in this case, so I don't know what happens, it'd require testing and the seed > sprout stuff isn't as well tested. But at least it's relatively safe since the original is flagged read only. > Still I hope the experience could be improved to remove a disk sanely. Yeah but we'll have to understand at what point you're confused. It seems like maybe you're not aware that three drive raid1 still means two copies of data, not three. Taking one drive would never have worked anyway. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs device delete /dev/sdc1 /mnt/raid1 user experience
On Mon, 6 Jun 2016, at 10:16 PM, Austin S. Hemmelgarn wrote: > Based on the fact that you appear to want to carry a disk to copy data > more quickly than over then internet, then what you've already done plus > this is the correct way to do it. The trouble is the way I ended up doing it: 1) Replacing it with a bigger disk 2) Deleting the disk 3) btrfs send /mnt/raid1/snapshot | btrfs receive /mnt/onetb/ This took almost **an entire day** over USB3 spinning disks ! My expectation was that I could quickly remove a mirrored disk, replace that disk and hand carry the disk I removed. btrfs shouldn't remove data from the disk I removed. Kind regards, -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement
At 06/07/2016 03:54 AM, Mark Fasheh wrote: On Sat, Jun 04, 2016 at 06:26:39PM +0800, Qu Wenruo wrote: On 06/03/2016 10:27 PM, Josef Bacik wrote: On 06/01/2016 09:12 PM, Qu Wenruo wrote: At 06/02/2016 06:08 AM, Mark Fasheh wrote: On Fri, Apr 01, 2016 at 02:35:00PM +0800, Qu Wenruo wrote: Core implement for inband de-duplication. It reuse the async_cow_start() facility to do the calculate dedupe hash. And use dedupe hash to do inband de-duplication at extent level. The work flow is as below: 1) Run delalloc range for an inode 2) Calculate hash for the delalloc range at the unit of dedupe_bs 3) For hash match(duplicated) case, just increase source extent ref and insert file extent. For hash mismatch case, go through the normal cow_file_range() fallback, and add hash into dedupe_tree. Compress for hash miss case is not supported yet. Current implement restore all dedupe hash in memory rb-tree, with LRU behavior to control the limit. Signed-off-by: Wang XiaoguangSigned-off-by: Qu Wenruo --- fs/btrfs/extent-tree.c | 18 fs/btrfs/inode.c | 235 ++--- fs/btrfs/relocation.c | 16 3 files changed, 236 insertions(+), 33 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 53e1297..dabd721 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -37,6 +37,7 @@ #include "math.h" #include "sysfs.h" #include "qgroup.h" +#include "dedupe.h" #undef SCRAMBLE_DELAYED_REFS @@ -2399,6 +2400,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans, if (btrfs_delayed_ref_is_head(node)) { struct btrfs_delayed_ref_head *head; +struct btrfs_fs_info *fs_info = root->fs_info; + /* * we've hit the end of the chain and we were supposed * to insert this extent into the tree. But, it got @@ -2413,6 +2416,15 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans, btrfs_pin_extent(root, node->bytenr, node->num_bytes, 1); if (head->is_data) { +/* + * If insert_reserved is given, it means + * a new extent is revered, then deleted + * in one tran, and inc/dec get merged to 0. + * + * In this case, we need to remove its dedup + * hash. + */ +btrfs_dedupe_del(trans, fs_info, node->bytenr); ret = btrfs_del_csums(trans, root, node->bytenr, node->num_bytes); @@ -6713,6 +6725,12 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans, btrfs_release_path(path); if (is_data) { +ret = btrfs_dedupe_del(trans, info, bytenr); +if (ret < 0) { +btrfs_abort_transaction(trans, extent_root, +ret); I don't see why an error here should lead to a readonly fs. --Mark Because such deletion error can lead to corruption. For example, extent A is already in hash pool. And when freeing extent A, we need to delete its hash, of course. But if such deletion fails, which means the hash is still in the pool, even the extent A no longer exists in extent tree. Except if we're in in-memory mode only it doesn't matter, so don't abort if we're in in-memory mode. Thanks, Josef If we can't ensure a hash is delete along with the extent, we will screw up the whole fs, as new write can points to non-exist extent. Although you're right with in-memory mode here, we won't abort trans, as inmem_del_hash() won't return error code. It will always return 0. Until a third party comes along and changes it to return an error code and neither you or I are there to remind them to fix this check (or have simply forgotten). So still, no need to change anyway. Personally I'd call this 'defensive coding' and do a check for in-memory only before our abort_trans(). This would have no effect on our running code but avoids the problem I stated above. Alternatively, you could clearly comment the exception. I don't like leaving it as-is for the reason I stated above. Thanks, --Mark The whole 'defensive coding' is here just because the V10 patchset comes with full function, including 2 backends and other things later. A lot of code like this is here because we know what will be added later. So this is true that it looks ridiculous until one knows there is an on-disk backend to be added. I'm OK to move the check to btrfs_dedupe_del(), but this makes me curious about the correct coding style for adding new function. If we have a clear view of the future functions , should we leave such interfaces for them? Or add them when adding the new functions? And what level of integration should be done inside btrfs codes? Should any caller of an exported btrfs
Re: [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement
On Sat, Jun 04, 2016 at 06:26:39PM +0800, Qu Wenruo wrote: > > > On 06/03/2016 10:27 PM, Josef Bacik wrote: > >On 06/01/2016 09:12 PM, Qu Wenruo wrote: > >> > >> > >>At 06/02/2016 06:08 AM, Mark Fasheh wrote: > >>>On Fri, Apr 01, 2016 at 02:35:00PM +0800, Qu Wenruo wrote: > Core implement for inband de-duplication. > It reuse the async_cow_start() facility to do the calculate dedupe > hash. > And use dedupe hash to do inband de-duplication at extent level. > > The work flow is as below: > 1) Run delalloc range for an inode > 2) Calculate hash for the delalloc range at the unit of dedupe_bs > 3) For hash match(duplicated) case, just increase source extent ref > and insert file extent. > For hash mismatch case, go through the normal cow_file_range() > fallback, and add hash into dedupe_tree. > Compress for hash miss case is not supported yet. > > Current implement restore all dedupe hash in memory rb-tree, with LRU > behavior to control the limit. > > Signed-off-by: Wang Xiaoguang> Signed-off-by: Qu Wenruo > --- > fs/btrfs/extent-tree.c | 18 > fs/btrfs/inode.c | 235 > ++--- > fs/btrfs/relocation.c | 16 > 3 files changed, 236 insertions(+), 33 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 53e1297..dabd721 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -37,6 +37,7 @@ > #include "math.h" > #include "sysfs.h" > #include "qgroup.h" > +#include "dedupe.h" > > #undef SCRAMBLE_DELAYED_REFS > > @@ -2399,6 +2400,8 @@ static int run_one_delayed_ref(struct > btrfs_trans_handle *trans, > > if (btrfs_delayed_ref_is_head(node)) { > struct btrfs_delayed_ref_head *head; > +struct btrfs_fs_info *fs_info = root->fs_info; > + > /* > * we've hit the end of the chain and we were supposed > * to insert this extent into the tree. But, it got > @@ -2413,6 +2416,15 @@ static int run_one_delayed_ref(struct > btrfs_trans_handle *trans, > btrfs_pin_extent(root, node->bytenr, > node->num_bytes, 1); > if (head->is_data) { > +/* > + * If insert_reserved is given, it means > + * a new extent is revered, then deleted > + * in one tran, and inc/dec get merged to 0. > + * > + * In this case, we need to remove its dedup > + * hash. > + */ > +btrfs_dedupe_del(trans, fs_info, node->bytenr); > ret = btrfs_del_csums(trans, root, > node->bytenr, > node->num_bytes); > @@ -6713,6 +6725,12 @@ static int __btrfs_free_extent(struct > btrfs_trans_handle *trans, > btrfs_release_path(path); > > if (is_data) { > +ret = btrfs_dedupe_del(trans, info, bytenr); > +if (ret < 0) { > +btrfs_abort_transaction(trans, extent_root, > +ret); > >>> > >>>I don't see why an error here should lead to a readonly fs. > >>>--Mark > >>> > >> > >>Because such deletion error can lead to corruption. > >> > >>For example, extent A is already in hash pool. > >>And when freeing extent A, we need to delete its hash, of course. > >> > >>But if such deletion fails, which means the hash is still in the pool, > >>even the extent A no longer exists in extent tree. > > > >Except if we're in in-memory mode only it doesn't matter, so don't abort > >if we're in in-memory mode. Thanks, > > > >Josef > > > > If we can't ensure a hash is delete along with the extent, we will > screw up the whole fs, as new write can points to non-exist extent. > > Although you're right with in-memory mode here, we won't abort > trans, as inmem_del_hash() won't return error code. It will always > return 0. Until a third party comes along and changes it to return an error code and neither you or I are there to remind them to fix this check (or have simply forgotten). > So still, no need to change anyway. Personally I'd call this 'defensive coding' and do a check for in-memory only before our abort_trans(). This would have no effect on our running code but avoids the problem I stated above. Alternatively, you could clearly comment the exception. I don't like leaving it as-is for the reason I stated above. Thanks, --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at
[PATCH v2] Btrfs: check if extent buffer is aligned to sectorsize
Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer via alloc_extent_buffer(). An unaligned eb can have more pages than it should have, which ends up extent buffer's leak or some corrupted content in extent buffer. This adds a warning to let us quickly know what was happening. Now that alloc_extent_buffer() no more returns NULL, this changes its caller and callers of its caller to match with the new error handling. Signed-off-by: Liu Bo--- v2: - Add more fine-grained error handling to alloc_extent_buffer() and its callers - Use btrfs_err and -EINVAL instead of WARN_ONCE(). fs/btrfs/ctree.c | 2 ++ fs/btrfs/disk-io.c | 8 fs/btrfs/extent-tree.c | 10 ++ fs/btrfs/extent_io.c | 15 --- fs/btrfs/tree-log.c| 4 ++-- fs/btrfs/volumes.c | 4 ++-- 6 files changed, 28 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 427c36b..24c9fb2 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -2510,6 +2510,8 @@ read_block_for_search(struct btrfs_trans_handle *trans, if (!btrfs_buffer_uptodate(tmp, 0, 0)) ret = -EIO; free_extent_buffer(tmp); + } else { + ret = PTR_ERR(tmp); } return ret; } diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index ea78d77..c2a9cec 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1098,7 +1098,7 @@ void readahead_tree_block(struct btrfs_root *root, u64 bytenr) struct inode *btree_inode = root->fs_info->btree_inode; buf = btrfs_find_create_tree_block(root, bytenr); - if (!buf) + if (IS_ERR(buf)) return; read_extent_buffer_pages(_I(btree_inode)->io_tree, buf, 0, WAIT_NONE, btree_get_extent, 0); @@ -1114,7 +1114,7 @@ int reada_tree_block_flagged(struct btrfs_root *root, u64 bytenr, int ret; buf = btrfs_find_create_tree_block(root, bytenr); - if (!buf) + if (IS_ERR(buf)) return 0; set_bit(EXTENT_BUFFER_READAHEAD, >bflags); @@ -1171,8 +1171,8 @@ struct extent_buffer *read_tree_block(struct btrfs_root *root, u64 bytenr, int ret; buf = btrfs_find_create_tree_block(root, bytenr); - if (!buf) - return ERR_PTR(-ENOMEM); + if (IS_ERR(buf)) + return buf; ret = btree_read_extent_buffer_pages(root, buf, 0, parent_transid); if (ret) { diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index a400951..d63 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -8010,8 +8010,9 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf; buf = btrfs_find_create_tree_block(root, bytenr); - if (!buf) - return ERR_PTR(-ENOMEM); + if (IS_ERR(buf)) + return buf; + btrfs_set_header_generation(buf, trans->transid); btrfs_set_buffer_lockdep_class(root->root_key.objectid, buf, level); btrfs_tree_lock(buf); @@ -8653,8 +8654,9 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans, next = btrfs_find_tree_block(root->fs_info, bytenr); if (!next) { next = btrfs_find_create_tree_block(root, bytenr); - if (!next) - return -ENOMEM; + if (IS_ERR(next)) + return PTR_ERR(next); + btrfs_set_buffer_lockdep_class(root->root_key.objectid, next, level - 1); reada = 1; diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 6f38c2c..91c4d23 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4883,18 +4883,25 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, int uptodate = 1; int ret; + if (!IS_ALIGNED(start, fs_info->tree_root->sectorsize)) { + btrfs_err(fs_info, "bad tree block start %llu", start); + return ERR_PTR(-EINVAL); + } + eb = find_extent_buffer(fs_info, start); if (eb) return eb; eb = __alloc_extent_buffer(fs_info, start, len); if (!eb) - return NULL; + return ERR_PTR(-ENOMEM); for (i = 0; i < num_pages; i++, index++) { p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL); - if (!p) + if (!p) { + exists = ERR_PTR(-ENOMEM); goto free_eb; + } spin_lock(>private_lock); if (PagePrivate(p)) { @@ -4939,8 +4946,10 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, set_bit(EXTENT_BUFFER_UPTODATE, >bflags); again: ret =
Re: [PATCH 28/45] target: use bio op accessors
On 06/06/2016 05:40 PM, Mike Christie wrote: On 06/06/2016 01:46 AM, Hannes Reinecke wrote: On 06/05/2016 09:32 PM, mchri...@redhat.com wrote: From: Mike ChristieSeparate the op from the rq_flag_bits and have the target layer set/get the bio using bio_set_op_attrs/bio_op. Signed-off-by: Mike Christie --- drivers/target/target_core_iblock.c | 29 ++--- drivers/target/target_core_pscsi.c | 2 +- 2 files changed, 15 insertions(+), 16 deletions(-) diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c index c25109c..22af12f 100644 --- a/drivers/target/target_core_iblock.c +++ b/drivers/target/target_core_iblock.c [ .. ] @@ -689,18 +690,15 @@ iblock_execute_rw(struct se_cmd *cmd, struct scatterlist *sgl, u32 sgl_nents, * Force writethrough using WRITE_FUA if a volatile write cache * is not enabled, or if initiator set the Force Unit Access bit. */ + op = REQ_OP_WRITE; if (test_bit(QUEUE_FLAG_FUA, >queue_flags)) { if (cmd->se_cmd_flags & SCF_FUA) - rw = WRITE_FUA; + op_flags = WRITE_FUA; else if (!test_bit(QUEUE_FLAG_WC, >queue_flags)) - rw = WRITE_FUA; - else - rw = WRITE; - } else { - rw = WRITE; + op_flags = WRITE_FUA; } Wrong intendation. It should be ok. That line is for the QUEUE_FLAG_WC test. We end up with: op = REQ_OP_WRITE; if (test_bit(QUEUE_FLAG_FUA, >queue_flags)) { if (cmd->se_cmd_flags & SCF_FUA) op_flags = WRITE_FUA; else if (!test_bit(QUEUE_FLAG_WC, >queue_flags)) op_flags = WRITE_FUA; } Indeed, you are correct. Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 28/45] target: use bio op accessors
On 06/06/2016 01:46 AM, Hannes Reinecke wrote: > On 06/05/2016 09:32 PM, mchri...@redhat.com wrote: >> From: Mike Christie>> >> Separate the op from the rq_flag_bits and have the target layer >> set/get the bio using bio_set_op_attrs/bio_op. >> >> Signed-off-by: Mike Christie >> --- >> drivers/target/target_core_iblock.c | 29 ++--- >> drivers/target/target_core_pscsi.c | 2 +- >> 2 files changed, 15 insertions(+), 16 deletions(-) >> >> diff --git a/drivers/target/target_core_iblock.c >> b/drivers/target/target_core_iblock.c >> index c25109c..22af12f 100644 >> --- a/drivers/target/target_core_iblock.c >> +++ b/drivers/target/target_core_iblock.c > [ .. ] >> @@ -689,18 +690,15 @@ iblock_execute_rw(struct se_cmd *cmd, struct >> scatterlist *sgl, u32 sgl_nents, >> * Force writethrough using WRITE_FUA if a volatile write cache >> * is not enabled, or if initiator set the Force Unit Access >> bit. >> */ >> +op = REQ_OP_WRITE; >> if (test_bit(QUEUE_FLAG_FUA, >queue_flags)) { >> if (cmd->se_cmd_flags & SCF_FUA) >> -rw = WRITE_FUA; >> +op_flags = WRITE_FUA; >> else if (!test_bit(QUEUE_FLAG_WC, >queue_flags)) >> -rw = WRITE_FUA; >> -else >> -rw = WRITE; >> -} else { >> -rw = WRITE; >> +op_flags = WRITE_FUA; >> } > Wrong intendation. It should be ok. That line is for the QUEUE_FLAG_WC test. We end up with: op = REQ_OP_WRITE; if (test_bit(QUEUE_FLAG_FUA, >queue_flags)) { if (cmd->se_cmd_flags & SCF_FUA) op_flags = WRITE_FUA; else if (!test_bit(QUEUE_FLAG_WC, >queue_flags)) op_flags = WRITE_FUA; } -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL] Self-tests updates for non-4k pages, for 4.7-rc3
Hi, this patchset enhances the self-tests to work on non-4k architectures (ppc64, arm64). The diffstat may appear big, but there are lots of call sites that just get another parameter (nodesize), so we now also test all valid combinations of pagesize and nodesize for a given arch. The patchset is part of preparatory work of the subpage-blocksize patchset, and it's updating only non-production code so it's IMO safe to merge in an rc. I have tested it on x86_64 to verify that it does not break the default usecase. The following changes since commit 56244ef151c3cd11f505020ab0b3f45454363bcc: Btrfs: fix handling of faults from btrfs_copy_from_user (2016-05-26 13:23:59 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-chris-4.7-20160606 for you to fetch changes up to 34b3e6c92af1fa3f7067e4fa05ffa9d8bd41c96c: Btrfs: self-tests: Fix extent buffer bitmap test fail on BE system (2016-06-06 17:17:12 +0200) Feifei Xu (8): Btrfs: test_check_exists: Fix infinite loop when searching for free space entries Btrfs: Fix integer overflow when calculating bytes_per_bitmap Btrfs: self-tests: Support non-4k page size Btrfs: self-tests: Execute page straddling test only when nodesize < PAGE_SIZE Btrfs: self-tests: Support testing all possible sectorsizes and nodesizes Btrfs: self-tests: Use macros instead of constants and add missing newline Btrfs: self-tests: Fix test_bitmaps fail on 64k sectorsize Btrfs: self-tests: Fix extent buffer bitmap test fail on BE system fs/btrfs/ctree.c | 6 +- fs/btrfs/disk-io.c | 9 +- fs/btrfs/disk-io.h | 2 +- fs/btrfs/extent_io.c | 10 +- fs/btrfs/extent_io.h | 4 +- fs/btrfs/free-space-cache.c| 18 +- fs/btrfs/super.c | 52 +++-- fs/btrfs/tests/btrfs-tests.c | 6 +- fs/btrfs/tests/btrfs-tests.h | 27 +-- fs/btrfs/tests/extent-buffer-tests.c | 13 +- fs/btrfs/tests/extent-io-tests.c | 86 ++--- fs/btrfs/tests/free-space-tests.c | 76 +--- fs/btrfs/tests/free-space-tree-tests.c | 30 +-- fs/btrfs/tests/inode-tests.c | 344 ++--- fs/btrfs/tests/qgroup-tests.c | 111 ++- 15 files changed, 454 insertions(+), 340 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the worst case scenario of splitting an extent-tree's node or leaf?
On Mon, Jun 06, 2016 at 08:17:15PM +0800, Kaho Ng wrote: > Hi all, > > I have questions on how BTRFS handles the scenario that a split pf > extent-tree's node/leaf incurs another splits in the tree's > node/leaf. Is that scenario bounded? If it is, how could I calculate > the number that the split starts looping. For reference for anyone replying, we've already had this conversation as far as I could manage it on IRC a few days ago: http://logs.tvrrug.org.uk/logs/%23btrfs/2016-05-29.html#2016-05-29T10:13:53 Hugo. -- Hugo Mills | In one respect at least, the Martians are a happy hugo@... carfax.org.uk | people: they have no lawyers http://carfax.org.uk/ | PGP: E2AB1DE4 | John Carter, A Princess of Mars signature.asc Description: Digital signature
Re: [PATCH] btrfs: chunk_width_limit mount option
On Mon, Jun 06, 2016 at 09:43:19AM -0400, Andrew Armenia wrote: > On Mon, Jun 6, 2016 at 5:17 AM, David Sterbawrote: > > On Thu, Jun 02, 2016 at 09:50:15PM -0400, Andrew Armenia wrote: > >> This patch adds mount option 'chunk_width_limit=X', which when set forces > >> the chunk allocator to use only up to X devices when allocating a chunk. > >> This may help reduce the seek penalties seen in filesystems with large > >> numbers of devices. > > > > There is a use for such limit but passing it as a mount is not the right > > way to do it. Changing the value requires remount, different raid levels > > might need different values, it's not persistent etc. > > Thanks for the feedback. I agree that it's less than an ideal way to > do it. I guess I was a bit quick to release proof-of-concept code. > I'll try to do some more work on the configuration side of things and > resubmit. Last time something like this came up in discussion, the consensus was to put the configuration in xattrs in the btrfs namespace. There should be one with a name to indicate "global" configuration, applied to the top of subvol 5. If the feature allows configuration on a per-subvol or per-object basis, then there should also be a name for the relevant xattr (also in the btrfs namespace) that can be created on each object as required. Hugo. -- Hugo Mills | Klytus, I'm bored. What plaything can you offer me hugo@... carfax.org.uk | today? http://carfax.org.uk/ | PGP: E2AB1DE4 | Ming the Merciless, Flash Gordon signature.asc Description: Digital signature
Re: btrfs device delete /dev/sdc1 /mnt/raid1 user experience
On 2016-06-06 01:44, Kai Hendry wrote: Hi there, I planned to remove one of my disks, so that I can take it from Singapore to the UK and then re-establish another remote RAID1 store. delete is an alias of remove, so I added a new disk (devid 3) and proceeded to run: btrfs device delete /dev/sdc1 /mnt/raid1 (devid 1) nuc:~$ uname -a Linux nuc 4.5.4-1-ARCH #1 SMP PREEMPT Wed May 11 22:21:28 CEST 2016 x86_64 GNU/Linux nuc:~$ btrfs --version btrfs-progs v4.5.3 nuc:~$ sudo btrfs fi show /mnt/raid1/ Label: 'extraid1' uuid: 5cab2a4a-e282-4931-b178-bec4c73cdf77 Total devices 2 FS bytes used 776.56GiB devid2 size 931.48GiB used 778.03GiB path /dev/sdb1 devid3 size 1.82TiB used 778.03GiB path /dev/sdd1 nuc:~$ sudo btrfs fi df /mnt/raid1/ Data, RAID1: total=775.00GiB, used=774.39GiB System, RAID1: total=32.00MiB, used=144.00KiB Metadata, RAID1: total=3.00GiB, used=2.17GiB GlobalReserve, single: total=512.00MiB, used=0.00B First off, I was expecting btrfs to release the drive pretty much immediately. The command took about half a day to complete. I watched `btrfs fi show` to see size of devid 1 (the one I am trying to remove) to be zero and to see used space slowly go down whilst used space of devid 3 (the new disk) slowly go up. If you're using multiple devices, BTRFS has to move any data off of the device being removed onto other devices in the FS before it can remove it. Not doing so runs the risk of another disk failing before the balance that would otherwise be needed afterwards to maintain replication profiles completes, thus causing a loss of data. Secondly and most importantly my /dev/sdc1 can't be mounted now anymore. Why? Deleting the device from the array removes the data from it (as mentioned above), and wipes all BTRFS specific signatures as well. sudo mount -t btrfs /dev/sdc1 /mnt/test/ mount: wrong fs type, bad option, bad superblock on /dev/sdc1, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so. There is nothing in dmesg nor my journal. I wasn't expecting my drive to be rendered useless on removing or am I missing something? If your getting such a message and there's nothing in dmesg, then it's because there's no filesystem of the requested type on the device you specified. nuc:~$ sudo fdisk -l /dev/sdc Disk /dev/sdc: 931.5 GiB, 1000204885504 bytes, 1953525167 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 33553920 bytes Disklabel type: gpt Disk identifier: 19938642-3B10-4220-BF99-3E12AF1D1CF6 Device StartEndSectors Size Type /dev/sdc1 2048 1953525133 1953523086 931.5G Linux filesystem On #btrfs IRC channel I'm told: hendry: breaking multi-disk filesystems in half is not a recommended way to do "normal operations" :D Which is exactly correct. Trying to split off a single device the way you want is extremely dangerous and risks breaking the source filesystem as well as not having a complete copy of the data on the receiving end. I'm still keen to take a TB on a flight with me the day after tomorrow. What is the recommended course of action? Recreate a mkfs.btrfs on /dev/sdc1 and send data to it from /mnt/raid1? Based on the fact that you appear to want to carry a disk to copy data more quickly than over then internet, then what you've already done plus this is the correct way to do it. Still I hope the experience could be improved to remove a disk sanely. It did exactly what you told it to, it removed the device from the array. The misunderstanding here is as to what that means. Removing a disk from a BTRFS filesystem means that you can't mount that disk anymore, and for all intents and purposes, it is now blank media (in practice, it only wipes the superblocks and some of the metadata, but that's irrelevant to what your asking, as you can't recover the filesystem from what's left). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Transaction aborted in btrfs_rename2
On 6/6/16 7:47 AM, Adam Borowski wrote: > Hi! > I just got this thrice, in 4.7-rc1 and 4.7-rc2: > > [ 1836.672368] [ cut here ] > [ 1836.672382] WARNING: CPU: 1 PID: 16348 at fs/btrfs/inode.c:9820 > btrfs_rename2+0xcd2/0x2a50 > [ 1836.672385] BTRFS: Transaction aborted (error -2) > [ 1836.672387] Modules linked in: nvidia(PO) usb_storage > [ 1836.672396] CPU: 1 PID: 16348 Comm: gcc-6 Tainted: P O > 4.7.0-rc2-debug+ #3 > [ 1836.672399] Hardware name: System manufacturer System Product Name/M4A77T, > BIOS 240105/18/2011 > [ 1836.672402] 81f8b504 880062c47c78 8165be6d > 0007 > [ 1836.672407] 880062c47cd0 880062c47cc0 > 81110c1c > [ 1836.672411] 880062c47d20 265c814e8642 > 00a25ade > [ 1836.672415] Call Trace: > [ 1836.672423] [] dump_stack+0x4e/0x71 > [ 1836.672429] [] __warn+0x10c/0x150 > [ 1836.672433] [] warn_slowpath_fmt+0x4a/0x50 > [ 1836.672437] [] btrfs_rename2+0xcd2/0x2a50 > [ 1836.672443] [] ? btrfs_permission+0x5b/0xc0 > [ 1836.672448] [] ? down_write+0x18/0x60 > [ 1836.672453] [] vfs_rename+0x7cc/0xc30 > [ 1836.672457] [] SyS_rename+0x32b/0x420 > [ 1836.672461] [] entry_SYSCALL_64_fastpath+0x17/0x93 > [ 1836.672464] ---[ end trace 6405b6e3d0e6c945 ]--- > [ 1836.672468] BTRFS warning (device sda1): btrfs_rename:9820: Aborting > unused transaction(No such entry). > [ 1836.675505] BTRFS warning (device sda1): btrfs_rename:9820: Aborting > unused transaction(No such entry). > > [ 1837.935238] BTRFS warning (device sda1): btrfs_rename:9820: Aborting > unused transaction(No such entry). > [ 1837.937602] BTRFS: error (device sda1) in btrfs_rename:9820: errno=-2 No > such entry > [ 1837.937607] BTRFS info (device sda1): forced readonly > [ 1838.086754] BTRFS warning (device sda1): Skipping commit of aborted > transaction. > [ 1838.086762] BTRFS: error (device sda1) in cleanup_transaction:1857: > errno=-2 No such entry > [ 1838.086782] BTRFS info (device sda1): delayed_refs has NO entry > > Didn't trigger during a week of other work, yet a kernel compile triggers > this reliably. > > Filesystem appears consistent (btrfs check, scrub). > Mount options: noatime,compress=lzo,ssd,space_cache. > Oh, interesting. We're seeing this on our 4.4-based kernels as well but only on arm64. That it's triggering on x86_64 is a good data point. I'm hunting this one today. -Jeff -- Jeff Mahoney SUSE Labs signature.asc Description: OpenPGP digital signature
Re: [PATCH] btrfs: chunk_width_limit mount option
On Mon, Jun 6, 2016 at 5:17 AM, David Sterbawrote: > On Thu, Jun 02, 2016 at 09:50:15PM -0400, Andrew Armenia wrote: >> This patch adds mount option 'chunk_width_limit=X', which when set forces >> the chunk allocator to use only up to X devices when allocating a chunk. >> This may help reduce the seek penalties seen in filesystems with large >> numbers of devices. > > There is a use for such limit but passing it as a mount is not the right > way to do it. Changing the value requires remount, different raid levels > might need different values, it's not persistent etc. Thanks for the feedback. I agree that it's less than an ideal way to do it. I guess I was a bit quick to release proof-of-concept code. I'll try to do some more work on the configuration side of things and resubmit. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recommended why to use btrfs for production?
On 2016-06-05 22:40, James Johnston wrote: On 06/06/2016 at 01:47, Chris Murphy wrote: On Sun, Jun 5, 2016 at 4:45 AM, Mladen Milinkovicwrote: On 06/03/2016 04:05 PM, Chris Murphy wrote: Make certain the kernel command timer value is greater than the driver error recovery timeout. The former is found in sysfs, per block device, the latter can be get and set with smartctl. Wrong configuration is common (it's actually the default) when using consumer drives, and inevitably leads to problems, even the loss of the entire array. It really is a terrible default. Since it's first time i've heard of this I did some googling. Here's some nice article about these timeouts: http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive- timeouts/comment-page-1/ And some udev rules that should apply this automatically: http://comments.gmane.org/gmane.linux.raid/48193 Yes it's a constant problem that pops up on the linux-raid list. Sometimes the list is quiet on this issue but it really seems like it's once a week. From last week... http://www.spinics.net/lists/raid/msg52447.html It seems like it would be useful if the distributions or the kernel could automatically set the kernel timeout to an appropriate value. If the TLER can be indeed be queried via smartctl, then it would be easy to automatically read it, and then calculate a suitable timeout. A RAID-oriented drive would end up leaving the current 30 seconds, while if it can't successfully query for TLER or the drive just doesn't support it, then assume a consumer drive and set timeout for 180 seconds. That way, zero user configuration would be needed in the common case. Or is it not that simple? Strictly speaking, it's policy, and therefore shouldn't be in the kernel. It's not hard to write a script to handle this though, both hdparm and smartctl can set the SCT ERC value, and will report an error if it fails, so you can try and set the value as you want (I personally would go with 10 seconds instead of 7), and if that fails, bump the kernel command timout. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recommended why to use btrfs for production?
On 2016-06-03 21:48, Chris Murphy wrote: On Fri, Jun 3, 2016 at 6:48 PM, Nicholas D Steeveswrote: On 3 June 2016 at 11:33, Austin S. Hemmelgarn wrote: On 2016-06-03 10:11, Martin wrote: Make certain the kernel command timer value is greater than the driver error recovery timeout. The former is found in sysfs, per block device, the latter can be get and set with smartctl. Wrong configuration is common (it's actually the default) when using consumer drives, and inevitably leads to problems, even the loss of the entire array. It really is a terrible default. Are nearline SAS drives considered consumer drives? If it's a SAS drive, then no, especially when you start talking about things marketed as 'nearline'. Additionally, SCT ERC is entirely a SATA thing, I forget what the equivalent in SCSI (and by extension SAS) terms is, but I'm pretty sure that the kernel handles things differently there. For the purposes of BTRFS RAID1: For drives that ship with SCT ERC of 7sec, is the default kernel command timeout of 30sec appropriate, or should it be reduced? It's fine. But it depends on your use case, if it can tolerate a rare 7 second < 30 second hang, and you're prepared to start investigating the cause then I'd leave it alone. If the use case prefers resetting the drive when it stops responding, then you'd go with something shorter. I'm fairly certain SAS's command queue doesn't get obliterated with such a link reset, just the hung command; where SATA drives all information in the queue is lost. So resets on SATA are a much bigger penalty if I have the correct understanding. There's also more involved otherwise with a ATA link reset because AHCI controllers aren't MP safe, so there's a global lock that has to be held while talking to them. Because of this, a link reset on an ATA drive (be it SATA or PATA) will cause performance degradation for all other devices on that controller as well until the reset is complete. For SATA drives that do not support SC TERC, is it true that 120sec is a sane value? I forget where I got this value of 120sec; It's a good question. It's not well documented, is not defined in the SATA spec, so it's probably make/model specific. The linux-raid@ list probably has the most information on this just because their users get nailed by this problem often. And the recommendation does seem to vary around 120 to 180. That is of course a maximum. The drive could give up much sooner. But what you don't want is for the drive to be in recovery for a bad sector, and the command timer does a link reset, losing all of what the drive was doing: all of which is replaceable except really one thing which is what sector was having the problem. And right now there's no report of the drive for slow sectors. It only reports failed reads, and it's that failed read error that includes the sector, so that the raid mechanism can figure out what data is missing, recongistruct from mirror or parity, and then fix the bad sector by writing to it. FWIW, I usually go with 150 on the Seagate 'Desktop' drives I use. I've seen some cheap Hitachi and Toshiba disks that need it as high as 300 though to work right. it might have been this list, it might have been an mdadm bug report. Also, in terms of tuning, I've been unable to find whether the ideal kernel timeout value changes depending on RAID type...is that a factor in selecting a sane kernel timeout value? No. It's strictly a value to make certain you get read errors from the drive rather than link resets. You have to factor in how the controller handles things too. SOme of them will retry just like a desktop drive, and you need to account for that. And that's why I think it's a bad default, because it totally thwarts attempts by manufacturers to recover marginal sectors, even in the single disk case. That's debatable, by attempting to recover the bad sector, they're slowing down the whole system. The likelihood of recovering a bad sectors functionally falls off linearly the longer you try, and not having the ability to choose when to report an error is the bigger issue here. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs
On 2016-06-03 21:51, Christoph Anton Mitterer wrote: On Fri, 2016-06-03 at 15:50 -0400, Austin S Hemmelgarn wrote: There's no point in trying to do higher parity levels if we can't get regular parity working correctly. Given the current state of things, it might be better to break even and just rewrite the whole parity raid thing from scratch, but I doubt that anybody is willing to do that. Well... as I've said, things are pretty worrying. Obviously I cannot really judge, since I'm not into btrfs' development... maybe there's a lack of manpower? Since btrfs seems to be a very important part (i.e. next-gen fs), wouldn't it be possible to either get some additional funding by the Linux Foundation, or possible that some of the core developers make an open call for funding by companies? Having some additional people, perhaps working fulltime on it, may be a big help. As for the RAID... given how many time/effort is spent now into 5/6,.. it really seems that one should have considered multi-parity from the beginning on. Kinda feels like either, with multi-parity this whole instability phase would start again, or it will simply never happen. New features will always cause some instability, period, there is no way to avoid that. - Serious show-stoppers and security deficiencies like the UUID collision corruptions/attacks that have been extensively discussed earlier, are still open The UUID issue is not a BTRFS specific one, it just happens to be easier to cause issues with it on BTRFS uhm this had been discussed extensively before, as I've said... AFAICS btrfs is the only system we have, that can possibly cause data corruption or even security breach by UUID collisions. I wouldn't know that other fs, or LVM are affected, these just continue to use those devices already "online"... and I think lvm refuses to activate VGs, if conflicting UUIDs are found. If you are mounting by UUID, it is entirely non-deterministic which filesystem with that UUID will be mounted (because device enumeration is non-deterministic). As far as LVM, it refuses activating VG's, but it can still have issues if you have LV's with the same UUID (which can be done pretty trivially), and the fact that it refuses to activate them technically constitutes a DoS attack (because you can't use the resources). There is no way to solve it sanely given the requirement that userspace not be broken. No this is not true. Back when this was discussed, I and others described how it could/should be done,... respectively how userspace/kernel should behave, in short: - continue using those devices that are already active This is easy, but only works for mounted filesystems. - refusing to (auto)assemble by UUID, if there are conflicts or requiring to specify the devices (with some --override-yes-i-know- what-i-do option option or so) - in case of assembling/rebuilding/similar... never doing this automatically These two allow anyone with the ability to plug in a USB device to DoS the system. I think there were some more corner cases, I basically had them all discussed in the thread back then (search for "attacking btrfs filesystems via UUID collisions?" and IIRC some different titled parent or child threads). Properly fixing this would likely make us more dependent on hardware configuration than even mounting by device name. Sure, if there are colliding UUIDs, and one still wants to mount (by using some --override-yes-i-know-what-i-do option),.. it would need to be by specifying the device name... But where's the problem? This would anyway only happen if someone either attacks or someone made a clone, and it's far better to refuse automatic assembly in cases where accidental corruption can happen or where attacks may be possible, requiring the user/admin to manually take action, than having corruption or security breach. Refusing automatic assembly does not prevent the attack, it simply converts it from a data corruption attack to a DoS attack. Imagine the simple case: degraded RAID1 on a PC; if btrfs would do some auto-rebuild based on UUID, then if an attacker knows that he'd just need to plug in a USB disk with a fitting UUID...and easily gets a copy of everything on disk, gpg keys, ssh keys, etc. If the attacker has physical access to the machine, it's irrelevant even with such protection, as there are all kinds of other things that could be done to get data off of the disk (especially if the system has thunderbolt ports or USB C ports). If the user has any unsecured encryption or authentication tokens on the system, they're screwed anyway though. - a number of important core features not fully working in many situations (e.g. the issues with defrag, not being ref-link aware,... an I vaguely remember similar things with compression). OK, how then should defrag handle reflinks? Preserving them prevents it from being able to completely defragment data. Didn't that even work in the past and had
What is the worst case scenario of splitting an extent-tree's node or leaf?
Hi all, I have questions on how BTRFS handles the scenario that a split pf extent-tree's node/leaf incurs another splits in the tree's node/leaf. Is that scenario bounded? If it is, how could I calculate the number that the split starts looping. Regards, Kaho Ng -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 vs RAID10 and best way to set up 6 disks
On 2016-06-05 16:31, Christoph Anton Mitterer wrote: On Sun, 2016-06-05 at 09:36 -0600, Chris Murphy wrote: That's ridiculous. It isn't incorrect to refer to only 2 copies as raid1. No, if there are only two devices then not. But obviously we're talking about how btrfs does RAID1, in which even with n>2 devices there are only 2 copies - that's incorrect. Go read the original standards that defined the term RAID (assuming you can find a openly accessible copy), it defines RAID1 as a mirrored _pair_ of disks. This is how every hardware RAID controller I've ever seen implements RAID1, and in fact how most software RAID implementations (including the fake raid in some motherboards) implements it with the sole exception of Linux's MD-RAID and it's direct derivatives (which includes LVM/DM based RAID, as well as BSD's GEOM framework). You have to explicitly ask both mdadm Aha, and which option would that be? Specifying more than two disks. The request is more correctly an implicit one, but the fact that it's implied by a now largely obsolete piece of software does not mean that BTRFS should have the same implications. and lvcreate for the number of copies you want, it doesn't automatically happen. I've said that before, but at least it allows you to use the full number of disks, so we're again back to that it's closer to the original and common meaning of RAID1 than what btrfs does. /me inserts reflink to the first part of my reply. The man page for mkfs.btrfs is very clear you only get two copies. I haven't denied that... but one shouldn't use terms that are commonly understood in a different mannor and require people to read all the small printed. One could also have changed it's RAID0 with RAID1, and I guess people wouldn't be too delighted if the excuse was "well it's in the manpage". You can leave the hyperbolic theoreticals out of this, they really do detract from your argument. Well I'd say, for btrfs: do away with the term "RAID" at all, use e.g.: linear = just a bunch of devices put together, no striping basically what MD's linear is Except this isn't really how Btrfs single works. The difference between mdadm linear and Btrfs single is more different in behavior than the difference between mdadm raid1 and btrfs raid1. So you're proposing tolerating a bigger difference, while criticizing a smaller one. *shrug* What's the big difference? Would you care to explain? But I'm happy with "single" either, it just doesn't really tell that there is no striping, I mean "single" points more towards "we have no resilience but only 1 copy", whether this is striped or not. On this point I actually do kind of agree with you, but Chris is also correct here, BTRFS single mode is just as different from MD linear mode as BTRFS raid1 is from MD RAID1, if not more so. If a metaphor is going to be used for a technical thing, it would be mirrors or mirroring. Mirror would mean exactly two (the original and the mirror). See lvcreate --mirrors. Also, the lvm mirror segment type is legacy, having been replaced with raid1 (man lvcreate uses the term raid1, not RAID1 or RAID-1). So I'm not a big fan of this term. Admittedly, I didn't like the "mirror(s)" either... I was just trying to show that different names could be used that are already a bit better. striped = basically what RAID0 is lvcreate uses only striped, not raid0. mdadm uses only RAID0, not striped. Since striping is also employed with RAIDs 4, 5, 6, 7, it seems ambiguous even though without further qualification whether parity exists, it's considered to mean non-parity striping. The ambiguity is probably less of a problem than the contradiction that is RAID0. Mhh,.. well or one makes schema names that contain all possible properties of a "RAID", something like: replicasN-parityN-[not]striped SINGLE would be something like "replicas1-parity0-notstriped". RAID5 would be something like "replicas0-parity1-striped". It's worth pointing out that both programmers and sysadmins are still lazy typists, so it would more likely end up being: rep1-par0-strip0 rep0-par1-stripN (with N being the number of desired stripes). Having a number to indicate the striping is actually useful (there are legitimate cases for not striping across everything we can, and we need some way to represent stripes that weren't allocated at full width for some reason). Such a scheme was actually proposed back when the higher order parity patches were being discussed. Like those patches, it was decided to wait until we had basic feature completeness before trying to tackle that. And just mention in the manpage, which of these names comes closest to what people understand by RAID level i. It already does this. What version of btrfs-progs are you basing your criticism on that there's some inconsistency, deficiency, or ambiguity when it comes to these raid levels? Well first, the terminology thing is the least serious
Transaction aborted in btrfs_rename2
Hi! I just got this thrice, in 4.7-rc1 and 4.7-rc2: [ 1836.672368] [ cut here ] [ 1836.672382] WARNING: CPU: 1 PID: 16348 at fs/btrfs/inode.c:9820 btrfs_rename2+0xcd2/0x2a50 [ 1836.672385] BTRFS: Transaction aborted (error -2) [ 1836.672387] Modules linked in: nvidia(PO) usb_storage [ 1836.672396] CPU: 1 PID: 16348 Comm: gcc-6 Tainted: P O 4.7.0-rc2-debug+ #3 [ 1836.672399] Hardware name: System manufacturer System Product Name/M4A77T, BIOS 240105/18/2011 [ 1836.672402] 81f8b504 880062c47c78 8165be6d 0007 [ 1836.672407] 880062c47cd0 880062c47cc0 81110c1c [ 1836.672411] 880062c47d20 265c814e8642 00a25ade [ 1836.672415] Call Trace: [ 1836.672423] [] dump_stack+0x4e/0x71 [ 1836.672429] [] __warn+0x10c/0x150 [ 1836.672433] [] warn_slowpath_fmt+0x4a/0x50 [ 1836.672437] [] btrfs_rename2+0xcd2/0x2a50 [ 1836.672443] [] ? btrfs_permission+0x5b/0xc0 [ 1836.672448] [] ? down_write+0x18/0x60 [ 1836.672453] [] vfs_rename+0x7cc/0xc30 [ 1836.672457] [] SyS_rename+0x32b/0x420 [ 1836.672461] [] entry_SYSCALL_64_fastpath+0x17/0x93 [ 1836.672464] ---[ end trace 6405b6e3d0e6c945 ]--- [ 1836.672468] BTRFS warning (device sda1): btrfs_rename:9820: Aborting unused transaction(No such entry). [ 1836.675505] BTRFS warning (device sda1): btrfs_rename:9820: Aborting unused transaction(No such entry). [ 1837.935238] BTRFS warning (device sda1): btrfs_rename:9820: Aborting unused transaction(No such entry). [ 1837.937602] BTRFS: error (device sda1) in btrfs_rename:9820: errno=-2 No such entry [ 1837.937607] BTRFS info (device sda1): forced readonly [ 1838.086754] BTRFS warning (device sda1): Skipping commit of aborted transaction. [ 1838.086762] BTRFS: error (device sda1) in cleanup_transaction:1857: errno=-2 No such entry [ 1838.086782] BTRFS info (device sda1): delayed_refs has NO entry Didn't trigger during a week of other work, yet a kernel compile triggers this reliably. Filesystem appears consistent (btrfs check, scrub). Mount options: noatime,compress=lzo,ssd,space_cache. -- An imaginary friend squared is a real enemy. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: chunk_width_limit mount option
On Thu, Jun 02, 2016 at 09:50:15PM -0400, Andrew Armenia wrote: > This patch adds mount option 'chunk_width_limit=X', which when set forces > the chunk allocator to use only up to X devices when allocating a chunk. > This may help reduce the seek penalties seen in filesystems with large > numbers of devices. There is a use for such limit but passing it as a mount is not the right way to do it. Changing the value requires remount, different raid levels might need different values, it's not persistent etc. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] btrfs-progs: convert-tests: Add test case for backup superblock migration
On Fri, Jun 03, 2016 at 10:34:26AM +0800, Qu Wenruo wrote: > New convert introduced simpler chunk/extent allocation algorithm, at the > cost of complex backup superblock migration codes. > > Use specially built ext2 images to test if btrfs-convert can convert and > rollback images without problem. > > All these special ext2 image have blocks/holes across 2nd btrfs backup > superblock. > > The naming of test image is like the following: > |<--superblock migration range->| > 64M 64M + 64K > |-Data--|-Data--|/Hole//|-Data--|/Hole//|-Data--|--Data--| = drdhdhdrd > > These test cases should check all typical layouts and make sure new > convert works. > > Signed-off-by: Qu WenruoApplied, thanks. > + extracted=$(extract_image "$src") > + run_check $SUDO_HELPER e2fsck -n -f $extracted > + run_check $SUDO_HELPER $TOP/btrfs-convert $extracted > + run_check $SUDO_HELPER $TOP/btrfs check $extracted > + run_check $SUDO_HELPER $TOP/btrfs-show-super $extracted > + > + run_check $SUDO_HELPER mount -o loop $extracted $TEST_MNT > + run_check $SUDO_HELPER e2fsck -n -f $TEST_MNT/ext2_saved/image > + run_check $SUDO_HELPER umount $TEST_MNT > + > + run_check $SUDO_HELPER $TOP/btrfs check $extracted > + run_check $SUDO_HELPER $TOP/btrfs-convert -r $extracted > + run_check $SUDO_HELPER e2fsck -n -f $extracted Most of the SUDO_HELPER calls are not necessary, and the tests failed when run under user (on NFS where root cannot do everything), so I did some additional changes. The separate convert dir should be now ready for adding more images easily. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] btrfs-progs: convert-tests: Add support for custom test scripts
On Fri, Jun 03, 2016 at 10:34:25AM +0800, Qu Wenruo wrote: > Add support for custom convert test scripts, just like fsck tests. > > Instead of generic convert tests, we need more specifically created images > for new convert tests. > > This patch provide the needed infrastructure for later convert test > cases. > > Signed-off-by: Qu WenruoApplied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] Btrfs: add valid checks for chunk loading
On Fri, Jun 03, 2016 at 12:05:15PM -0700, Liu Bo wrote: > To prevent fuzz filesystem images from panic the whole system, > we need various validation checks to refuse to mount such an image > if btrfs finds any invalid value during loading chunks, including > both sys_array and regular chunks. > > Note that these checks may not be sufficient to cover all corner cases, > feel free to add more checks. > > Reported-by: Vegard Nossum> Reported-by: Quentin Casasnovas > Signed-off-by: Liu Bo Reviewed-by: David Sterba (switched the printk to btrfs_*) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] btrfs: return all mirror whether need_raid_map set or not
At 06/06/2016 04:21 PM, David Sterba wrote: On Tue, Dec 15, 2015 at 07:09:48PM +0800, Zhao Lei wrote: __btrfs_map_block() should return all mirror on WRITE, REQ_GET_READ_MIRRORS, and RECOVERY case, whether need_raid_map set or not. need_raid_map only used to control is to set bbio->raid_map. Current code works right becuase there is only one caller can trigger above bug, which is readahead, and this function happened to bypass on less mirror. But after we fixed __btrfs_map_block(), readahead will be really works, and exit with warning at another bug. This patchset fixed __btrfs_map_block(), and disable raid56 readahead temp temporary, (actually, it already disable by this bug), and I'll fix raid56 readahead next. Zhao Lei (4): btrfs: Disable raid56 readahead btrfs: return all mirror whether need_raid_map set or not btrfs: Small cleanup for get index_srcdev loop btrfs: Use direct way to determine raid56 write/recover mode Are these patches still relevant? If yes, please update and resed, thanks. Hi David, Zhao doesn't work on btrfs any more. So I'll re-check these patches and see if they are still needed. If so, I'll rebase (some of) them and resend. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] Btrfs: add more valid checks for superblock
On Fri, Jun 03, 2016 at 12:05:14PM -0700, Liu Bo wrote: > This adds valid checks for super_total_bytes, super_bytes_used and > super_stripesize, super_num_devices. > > Reported-by: Vegard Nossum> Reported-by: Quentin Casasnovas > Signed-off-by: Liu Bo Reviewed-by: David Sterba I'll switch the printk(KERN_ to the btrfs_* helpers and do minor tweaks in the messages, but otherwise this version looks good to me. I don't think we need the helper for stripe value checks as I proposed. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] btrfs: return all mirror whether need_raid_map set or not
On Tue, Dec 15, 2015 at 07:09:48PM +0800, Zhao Lei wrote: > __btrfs_map_block() should return all mirror on WRITE, > REQ_GET_READ_MIRRORS, and RECOVERY case, whether need_raid_map set > or not. > > need_raid_map only used to control is to set bbio->raid_map. > > Current code works right becuase there is only one caller can > trigger above bug, which is readahead, and this function happened > to bypass on less mirror. > > But after we fixed __btrfs_map_block(), readahead will be really > works, and exit with warning at another bug. > This patchset fixed __btrfs_map_block(), and disable raid56 > readahead temp temporary, (actually, it already disable by this bug), > and I'll fix raid56 readahead next. > > Zhao Lei (4): > btrfs: Disable raid56 readahead > btrfs: return all mirror whether need_raid_map set or not > btrfs: Small cleanup for get index_srcdev loop > btrfs: Use direct way to determine raid56 write/recover mode Are these patches still relevant? If yes, please update and resed, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: clear uptodate flags of pages in sys_array eb
On Fri, Jun 03, 2016 at 06:16:18PM -0700, Liu Bo wrote: > On Fri, Jun 03, 2016 at 05:41:42PM -0700, Liu Bo wrote: > > We set uptodate flag to pages in the temporary sys_array eb, > > but do not clear the flag after free eb. As the special > > btree inode may still hold a reference on those pages, the > > uptodate flag can remain alive in them. > > > > If btrfs_super_chunk_root has been intentionally changed to the > > offset of this sys_array eb, reading chunk_root will read content > > of sys_array and it will pass our beautiful checks in > > s/pass/skip/ > > My mistake, sorry. Updated in the patch. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 42/45] block, fs, drivers: remove REQ_OP compat defs and related code
On 06/05/2016 09:32 PM, mchri...@redhat.com wrote: > From: Mike Christie> > This patch drops the compat definition of req_op where it matches > the rq_flag_bits definitions, and drops the related old and compat > code that allowed users to set either the op or flags for the operation. > > We also then store the operation in the bi_rw/cmd_flags field similar > to how we used to store the bio ioprio where it sat in the upper bits > of the field. > > Signed-off-by: Mike Christie > --- > drivers/scsi/sd.c | 2 +- > include/linux/bio.h | 3 --- > include/linux/blk_types.h | 52 > + > include/linux/blkdev.h | 14 > include/linux/fs.h | 37 +--- > include/trace/events/f2fs.h | 1 - > 6 files changed, 46 insertions(+), 63 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 41/45] block, drivers, fs: shrink bi_rw from long to int
On 06/05/2016 09:32 PM, mchri...@redhat.com wrote: > From: Mike Christie> > We don't need bi_rw to be so large on 64 bit archs, so > reduce it to unsigned int. > > Signed-off-by: Mike Christie > --- > block/blk-core.c | 2 +- > drivers/md/dm-flakey.c | 2 +- > drivers/md/raid5.c | 6 +++--- > fs/btrfs/check-integrity.c | 4 ++-- > fs/btrfs/inode.c | 2 +- > include/linux/blk_types.h | 2 +- > 6 files changed, 9 insertions(+), 9 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 40/45] block: move bio io prio to a new field
On 06/05/2016 09:32 PM, mchri...@redhat.com wrote: > From: Mike Christie> > In the next patch, we move drop the compat code and make > the op a separate value that is hidden in bi_rw. To give > the op and rq bits flags room to grow this moves prio to > its own field. > > Signed-off-by: Mike Christie > --- > include/linux/bio.h | 14 ++ > include/linux/blk_types.h | 5 ++--- > 2 files changed, 4 insertions(+), 15 deletions(-) > > diff --git a/include/linux/bio.h b/include/linux/bio.h > index 4568647..35108c2 100644 > --- a/include/linux/bio.h > +++ b/include/linux/bio.h > @@ -47,18 +47,8 @@ > #define bio_op(bio) (op_from_rq_bits((bio)->bi_rw)) > #define bio_set_op_attrs(bio, op, flags) ((bio)->bi_rw |= (op | flags)) > > -/* > - * upper 16 bits of bi_rw define the io priority of this bio > - */ > -#define BIO_PRIO_SHIFT (8 * sizeof(unsigned long) - IOPRIO_BITS) > -#define bio_prio(bio)((bio)->bi_rw >> BIO_PRIO_SHIFT) > -#define bio_prio_valid(bio) ioprio_valid(bio_prio(bio)) > - > -#define bio_set_prio(bio, prio) do {\ > - WARN_ON(prio >= (1 << IOPRIO_BITS));\ > - (bio)->bi_rw &= ((1UL << BIO_PRIO_SHIFT) - 1); \ > - (bio)->bi_rw |= ((unsigned long) (prio) << BIO_PRIO_SHIFT); \ > -} while (0) > +#define bio_prio(bio)(bio)->bi_ioprio > +#define bio_set_prio(bio, prio) ((bio)->bi_ioprio = prio) > > /* > * various member access, note that bio_data should of course not be used > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h > index 6e60baa..2738413 100644 > --- a/include/linux/blk_types.h > +++ b/include/linux/blk_types.h > @@ -48,9 +48,8 @@ struct bio { > struct block_device *bi_bdev; > unsigned intbi_flags; /* status, command, etc */ > int bi_error; > - unsigned long bi_rw; /* bottom bits READ/WRITE, > - * top bits priority > - */ > + unsigned long bi_rw; /* READ/WRITE */ > + unsigned short bi_ioprio; > > struct bvec_iterbi_iter; > > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 37/45] drivers: use req op accessor
On 06/05/2016 09:32 PM, mchri...@redhat.com wrote: > From: Mike Christie> > The req operation REQ_OP is separated from the rq_flag_bits > definition. This converts the block layer drivers to > use req_op to get the op from the request struct. > > Signed-off-by: Mike Christie > --- > drivers/block/loop.c | 6 +++--- > drivers/block/mtip32xx/mtip32xx.c | 2 +- > drivers/block/nbd.c | 2 +- > drivers/block/rbd.c | 4 ++-- > drivers/block/xen-blkfront.c | 8 +--- > drivers/ide/ide-floppy.c | 2 +- > drivers/md/dm.c | 2 +- > drivers/mmc/card/block.c | 7 +++ > drivers/mmc/card/queue.c | 6 ++ > drivers/mmc/card/queue.h | 5 - > drivers/mtd/mtd_blkdevs.c | 2 +- > drivers/nvme/host/core.c | 2 +- > drivers/nvme/host/nvme.h | 4 ++-- > drivers/scsi/sd.c | 25 - > 14 files changed, 43 insertions(+), 34 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 28/45] target: use bio op accessors
On 06/05/2016 09:32 PM, mchri...@redhat.com wrote: > From: Mike Christie> > Separate the op from the rq_flag_bits and have the target layer > set/get the bio using bio_set_op_attrs/bio_op. > > Signed-off-by: Mike Christie > --- > drivers/target/target_core_iblock.c | 29 ++--- > drivers/target/target_core_pscsi.c | 2 +- > 2 files changed, 15 insertions(+), 16 deletions(-) > > diff --git a/drivers/target/target_core_iblock.c > b/drivers/target/target_core_iblock.c > index c25109c..22af12f 100644 > --- a/drivers/target/target_core_iblock.c > +++ b/drivers/target/target_core_iblock.c [ .. ] > @@ -689,18 +690,15 @@ iblock_execute_rw(struct se_cmd *cmd, struct > scatterlist *sgl, u32 sgl_nents, >* Force writethrough using WRITE_FUA if a volatile write cache >* is not enabled, or if initiator set the Force Unit Access > bit. >*/ > + op = REQ_OP_WRITE; > if (test_bit(QUEUE_FLAG_FUA, >queue_flags)) { > if (cmd->se_cmd_flags & SCF_FUA) > - rw = WRITE_FUA; > + op_flags = WRITE_FUA; > else if (!test_bit(QUEUE_FLAG_WC, >queue_flags)) > - rw = WRITE_FUA; > - else > - rw = WRITE; > - } else { > - rw = WRITE; > + op_flags = WRITE_FUA; > } Wrong intendation. > } else { > - rw = READ; > + op = REQ_OP_READ; > } > > ibr = kzalloc(sizeof(struct iblock_req), GFP_KERNEL); > @@ -714,7 +712,7 @@ iblock_execute_rw(struct se_cmd *cmd, struct scatterlist > *sgl, u32 sgl_nents, > return 0; > } > > - bio = iblock_get_bio(cmd, block_lba, sgl_nents, rw); > + bio = iblock_get_bio(cmd, block_lba, sgl_nents, op, op_flags); > if (!bio) > goto fail_free_ibr; > > @@ -738,7 +736,8 @@ iblock_execute_rw(struct se_cmd *cmd, struct scatterlist > *sgl, u32 sgl_nents, > bio_cnt = 0; > } > > - bio = iblock_get_bio(cmd, block_lba, sg_num, rw); > + bio = iblock_get_bio(cmd, block_lba, sg_num, op, > + op_flags); > if (!bio) > goto fail_put_bios; > > diff --git a/drivers/target/target_core_pscsi.c > b/drivers/target/target_core_pscsi.c > index de18790..81564c8 100644 > --- a/drivers/target/target_core_pscsi.c > +++ b/drivers/target/target_core_pscsi.c > @@ -922,7 +922,7 @@ pscsi_map_sg(struct se_cmd *cmd, struct scatterlist *sgl, > u32 sgl_nents, > goto fail; > > if (rw) > - bio->bi_rw |= REQ_WRITE; > + bio_set_op_attrs(bio, REQ_OP_WRITE, 0); > > pr_debug("PSCSI: Allocated bio: %p," > " dir: %s nr_vecs: %d\n", bio, > Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 25/45] bcache: use bio op accessors
On 06/05/2016 09:32 PM, mchri...@redhat.com wrote: > From: Mike Christie> > Separate the op from the rq_flag_bits and have bcache > set/get the bio using bio_set_op_attrs/bio_op. > > Signed-off-by: Mike Christie > --- > drivers/md/bcache/btree.c | 4 ++-- > drivers/md/bcache/debug.c | 4 ++-- > drivers/md/bcache/journal.c | 7 --- > drivers/md/bcache/movinggc.c | 2 +- > drivers/md/bcache/request.c | 14 +++--- > drivers/md/bcache/super.c | 24 +--- > drivers/md/bcache/writeback.c | 4 ++-- > 7 files changed, 31 insertions(+), 28 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 24/45] dm: use bio op accessors
On 06/05/2016 09:32 PM, mchri...@redhat.com wrote: > From: Mike Christie> > Separate the op from the rq_flag_bits and have dm > set/get the bio using bio_set_op_attrs/bio_op. > > Signed-off-by: Mike Christie > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/45] block discard: use bio set op accessor
On 06/05/2016 09:31 PM, mchri...@redhat.com wrote: > From: Mike Christie> > This converts the block issue discard helper and users to use > the bio_set_op_attrs accessor and only pass in the operation flags > like REQ_SEQURE. > > Signed-off-by: Mike Christie > --- > block/blk-lib.c| 13 +++-- > drivers/md/dm-thin.c | 2 +- > include/linux/blkdev.h | 3 ++- > 3 files changed, 10 insertions(+), 8 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/45] block, fs, mm, drivers: use bio set/get op accessors
On 06/05/2016 09:31 PM, mchri...@redhat.com wrote: > From: Mike Christie> > This patch converts the simple bi_rw use cases in the block, > drivers, mm and fs code to set/get the bio operation using > bio_set_op_attrs/bio_op > > These should be simple one or two liner cases, so I just did them > in one patch. The next patches handle the more complicated > cases in a module per patch. > > Signed-off-by: Mike Christie > --- > > v5: > 1. Add missed crypto call. > 2. Change nfs bi_rw check to bi_op. > > block/bio.c | 13 ++--- > block/blk-core.c| 6 +++--- > block/blk-flush.c | 2 +- > block/blk-lib.c | 4 ++-- > block/blk-map.c | 2 +- > block/blk-merge.c | 12 ++-- > drivers/block/brd.c | 2 +- > drivers/block/floppy.c | 2 +- > drivers/block/pktcdvd.c | 4 ++-- > drivers/block/rsxx/dma.c| 2 +- > drivers/block/zram/zram_drv.c | 2 +- > drivers/lightnvm/rrpc.c | 6 +++--- > drivers/scsi/osd/osd_initiator.c| 8 > drivers/staging/lustre/lustre/llite/lloop.c | 6 +++--- > fs/crypto/crypto.c | 2 +- > fs/exofs/ore.c | 2 +- > fs/ext4/page-io.c | 6 +++--- > fs/ext4/readpage.c | 2 +- > fs/jfs/jfs_logmgr.c | 4 ++-- > fs/jfs/jfs_metapage.c | 4 ++-- > fs/logfs/dev_bdev.c | 12 ++-- > fs/nfs/blocklayout/blocklayout.c| 4 ++-- > include/linux/bio.h | 15 ++- > mm/page_io.c| 4 ++-- > 24 files changed, 65 insertions(+), 61 deletions(-) > [ .. ] > diff --git a/include/linux/bio.h b/include/linux/bio.h > index 09c5308..4568647 100644 > --- a/include/linux/bio.h > +++ b/include/linux/bio.h > @@ -109,18 +109,23 @@ static inline bool bio_has_data(struct bio *bio) > { > if (bio && > bio->bi_iter.bi_size && > - !(bio->bi_rw & REQ_DISCARD)) > + bio_op(bio) != REQ_OP_DISCARD) > return true; > > return false; > } > > +static inline bool bio_no_advance_iter(struct bio *bio) > +{ > + return bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == > REQ_OP_WRITE_SAME; > +} > + > static inline bool bio_is_rw(struct bio *bio) > { > if (!bio_has_data(bio)) > return false; > > - if (bio->bi_rw & BIO_NO_ADVANCE_ITER_MASK) > + if (bio_no_advance_iter(bio)) > return false; > > return true; > @@ -228,7 +233,7 @@ static inline void bio_advance_iter(struct bio *bio, > struct bvec_iter *iter, > { > iter->bi_sector += bytes >> 9; > > - if (bio->bi_rw & BIO_NO_ADVANCE_ITER_MASK) > + if (bio_no_advance_iter(bio)) > iter->bi_size -= bytes; > else > bvec_iter_advance(bio->bi_io_vec, iter, bytes); Hmm. Can't you drop 'BIO_NO_ADVANCE_ITER_MASK' after this patch? > @@ -256,10 +261,10 @@ static inline unsigned bio_segments(struct bio *bio) >* differently: >*/ > > - if (bio->bi_rw & REQ_DISCARD) > + if (bio_op(bio) == REQ_OP_DISCARD) > return 1; > > - if (bio->bi_rw & REQ_WRITE_SAME) > + if (bio_op(bio) == REQ_OP_WRITE_SAME) > return 1; > > bio_for_each_segment(bv, bio, iter) > diff --git a/mm/page_io.c b/mm/page_io.c > index 5a5fd66..dcc5d37 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -317,7 +317,7 @@ int __swap_writepage(struct page *page, struct > writeback_control *wbc, > ret = -ENOMEM; > goto out; > } > - bio->bi_rw = WRITE; > + bio_set_op_attrs(bio, REQ_OP_WRITE, 0); > if (wbc->sync_mode == WB_SYNC_ALL) > bio->bi_rw |= REQ_SYNC; > count_vm_event(PSWPOUT); > @@ -370,7 +370,7 @@ int swap_readpage(struct page *page) > ret = -ENOMEM; > goto out; > } > - bio->bi_rw = READ; > + bio_set_op_attrs(bio, REQ_OP_READ, 0); > count_vm_event(PSWPIN); > submit_bio(bio); > out: > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/45] bcache: use op_is_write instead of checking for REQ_WRITE
On 06/05/2016 09:31 PM, mchri...@redhat.com wrote: > From: Mike Christie> > We currently set REQ_WRITE/WRITE for all non READ IOs > like discard, flush, writesame, etc. In the next patches where we > no longer set up the op as a bitmap, we will not be able to > detect a operation direction like writesame by testing if REQ_WRITE is > set. > > This has bcache use the op_is_write helper which will do the right > thing. > > Signed-off-by: Mike Christie > --- > drivers/md/bcache/io.c | 2 +- > drivers/md/bcache/request.c | 6 +++--- > 2 files changed, 4 insertions(+), 4 deletions(-) > (Could probably folded together with the two previous patches) Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/45] dm: use op_is_write instead of checking for REQ_WRITE
On 06/05/2016 09:31 PM, mchri...@redhat.com wrote: > From: Mike Christie> > We currently set REQ_WRITE/WRITE for all non READ IOs > like discard, flush, writesame, etc. In the next patches where we > no longer set up the op as a bitmap, we will not be able to > detect a operation direction like writesame by testing if REQ_WRITE is > set. > > This has dm use the op_is_write helper which will do the right > thing. > > Signed-off-by: Mike Christie > --- (Could probably be folded into the previous patch) Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/45] block, drivers, cgroup: use op_is_write helper instead of checking for REQ_WRITE
On 06/05/2016 09:31 PM, mchri...@redhat.com wrote: > From: Mike Christie> > We currently set REQ_WRITE/WRITE for all non READ IOs > like discard, flush, writesame, etc. In the next patches where we > no longer set up the op as a bitmap, we will not be able to > detect a operation direction like writesame by testing if REQ_WRITE is > set. > > This patch converts the drivers and cgroup to use the > op_is_write helper. This should just cover the simple > cases. I did dm, md and bcache in their own patches > because they were more involved. > > Signed-off-by: Mike Christie > --- Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/45] block: add REQ_OP definitions and helpers
On 06/05/2016 09:31 PM, mchri...@redhat.com wrote: > From: Mike Christie> > The following patches separate the operation (WRITE, READ, DISCARD, > etc) from the rq_flag_bits flags. This patch adds definitions for > request/bio operations (REQ_OPs) and adds request/bio accessors to > get/set the op. > > In this patch the REQ_OPs match the REQ rq_flag_bits ones > for compat reasons while all the code is converted to use the > op accessors in the set. In the last patches the op will become a > number and the accessors and helpers in this patch will be dropped > or updated. > > Signed-off-by: Mike Christie > --- Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/45] block/fs/drivers: remove rw argument from submit_bio
On 06/05/2016 09:31 PM, mchri...@redhat.com wrote: > From: Mike Christie> > This has callers of submit_bio/submit_bio_wait set the bio->bi_rw > instead of passing it in. This makes that use the same as > generic_make_request and how we set the other bio fields. > > Signed-off-by: Mike Christie > --- Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html