Re: btrfs device delete /dev/sdc1 /mnt/raid1 user experience

2016-06-06 Thread Kai Hendry
Sorry I unsubscribed from linux-btrfs@vger.kernel.org since the traffic
was a bit too high for me.

On Tue, 7 Jun 2016, at 11:42 AM, Chris Murphy wrote:
> Your command turned this from a 3 drive volume into a 2 drive volume,
> it removed the drive you asked to be removed.

I actually had 2 drives to begin with, but it wouldn't allow me to
delete the drive without adding a new one, so I had 3 drives.

I think I've confused you by my talk of "RAID-1 over 3 drives". I am not
sure why RAIDX terminology is so confusing. All I want is for the data
on one drive to be mirrored across them all. So if I add X drives, I
want to X exact copies. But IIUC this changes the RAID level and
everyone gets confused what I am talking about.


> You're definitely missing something but once you understand how it
> actually works, it'll be interesting if you have suggestions on how
> the existing documentation confused you into making this mistake.


Summary: I issued the remove command when I should have just removed it
physically IIUC.



If I was to do this again, I would unmount the raid1 mount. Take a disk
physically out and then add the new one. So yes, it would be degraded
but then as you mention, the newly drive will sort it out. One would
hope.

Thanks,
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs device delete /dev/sdc1 /mnt/raid1 user experience

2016-06-06 Thread Chris Murphy
On Mon, Jun 6, 2016 at 7:19 PM, Kai Hendry  wrote:
> On Mon, 6 Jun 2016, at 10:16 PM, Austin S. Hemmelgarn wrote:
>> Based on the fact that you appear to want to carry a disk to copy data
>> more quickly than over then internet, then what you've already done plus
>> this is the correct way to do it.
>
> The trouble is the way I ended up doing it:
> 1) Replacing it with a bigger disk
> 2) Deleting the disk
> 3) btrfs send /mnt/raid1/snapshot | btrfs receive /mnt/onetb/
>
> This took almost **an entire day** over USB3 spinning disks  !
>
> My expectation was that I could quickly remove a mirrored disk, replace
> that disk and hand carry the disk I removed. btrfs shouldn't remove data
> from the disk I removed.

Had this been a two drive raid1, all you need to do is take one drive
with you. No commands necessary. Each drive can be separately mounted
with -o degraded, and you can re-establish two copy replication with
'btrfs dev add' with a new drive and 'btrfs dev del missing' to get
rid of the phantom missing drive. You'd do that locally and then also
at your destination.

A gotcha with this technique though is often single chunks get created
during the -o degraded read-write mount. After 'dev del missing'
completes the replication process, you need to check 'fi df' or 'fi
us' for single chunks. If they exist, get rid of them with a filtered
balance. '-dconvert=raid1,soft -mconvert=raid1,soft' should work I
think. If there are single chunks created and not converted, later if
you need to do a degraded mount it will fail. Usually it can be made
to work with -o ro,degraded in such a case, but it means no more
read-write for the file system, you'd have to recreate it (at this
point until it's fixed).


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs device delete /dev/sdc1 /mnt/raid1 user experience

2016-06-06 Thread Chris Murphy
On Sun, Jun 5, 2016 at 11:44 PM, Kai Hendry  wrote:
> Hi there,
>
>
> I planned to remove one of my disks, so that I can take it from
> Singapore to the UK and then re-establish another remote RAID1 store.
>
> delete is an alias of remove, so I added a new disk (devid 3) and
> proceeded to run:
> btrfs device delete /dev/sdc1 /mnt/raid1 (devid 1)

Wrong command. This means you don't want sdc1 used in this volume
anymore and in the process it is invalidated because to remove it from
Btrfs means to remove its signature.





>
>
> nuc:~$ uname -a
> Linux nuc 4.5.4-1-ARCH #1 SMP PREEMPT Wed May 11 22:21:28 CEST 2016
> x86_64 GNU/Linux
> nuc:~$   btrfs --version
> btrfs-progs v4.5.3
> nuc:~$ sudo btrfs fi show /mnt/raid1/
> Label: 'extraid1'  uuid: 5cab2a4a-e282-4931-b178-bec4c73cdf77
> Total devices 2 FS bytes used 776.56GiB
> devid2 size 931.48GiB used 778.03GiB path /dev/sdb1
> devid3 size 1.82TiB used 778.03GiB path /dev/sdd1

OK I'm confused. You had a three drive Btrfs raid-1 and you expected
to take 1 drive on a trip to establish that data elsewhere? That's not
possible. The minimum number of drives for degraded mount of a 3 drive
Btrfs raid1 is 2 drives. There aren't three copies of the data with a
three drive raid1. There are two copies only spread across three
drives.

Your command turned this from a 3 drive volume into a 2 drive volume,
it removed the drive you asked to be removed.



> First off, I was expecting btrfs to release the drive pretty much
> immediately. The command took about half a day to complete. I watched
> `btrfs fi show` to see size of devid 1 (the one I am trying to remove)
> to be zero and to see used space slowly go down whilst used space of
> devid 3 (the new disk) slowly go up.

Expected behavior, it was told you no longer wanted a 3 drive raid1,
so the data on the removed drive was being replicated onto the other
two drives to maintain two copies of your data on those two drives.


> Secondly and most importantly my /dev/sdc1 can't be mounted now anymore.
> Why?

The data that was on that drive is still there, just the magic was
invalidated as the final step in the operation. But that one drive
doesn't contain all of your data anyway so by itself the one drive
won't mount degraded, too many devices are missing.


>
> sudo mount -t btrfs /dev/sdc1 /mnt/test/
> mount: wrong fs type, bad option, bad superblock on /dev/sdc1,
>missing codepage or helper program, or other error
>
>In some cases useful info is found in syslog - try
>dmesg | tail or so.
>
> There is nothing in dmesg nor my journal. I wasn't expecting my drive to
> be rendered useless on removing or am I missing something?

You're definitely missing something but once you understand how it
actually works, it'll be interesting if you have suggestions on how
the existing documentation confused you into making this mistake.


> I'm still keen to take a TB on a flight with me the day after tomorrow.
> What is the recommended course of action? Recreate a mkfs.btrfs on
> /dev/sdc1 and send data to it from /mnt/raid1?

You can do that. The 2 disk raid1 can be used as the send, and the new
fs on the removed 3rd drive can be used as the receive side. That
would be per subvolume however.

It would be faster to use a btrfs seed device for this. But as I went
through the steps, I got confused how to remove the two raid1s. Both
are read only in this case, so I don't know what happens, it'd require
testing and the seed > sprout stuff isn't as well tested. But at least
it's relatively safe since the original is flagged read only.


> Still I hope the experience could be improved to remove a disk sanely.

Yeah but we'll have to understand at what point you're confused. It
seems like maybe you're not aware that  three drive raid1 still means
two copies of data, not three. Taking one drive would never have
worked anyway.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs device delete /dev/sdc1 /mnt/raid1 user experience

2016-06-06 Thread Kai Hendry
On Mon, 6 Jun 2016, at 10:16 PM, Austin S. Hemmelgarn wrote:
> Based on the fact that you appear to want to carry a disk to copy data 
> more quickly than over then internet, then what you've already done plus 
> this is the correct way to do it.

The trouble is the way I ended up doing it:
1) Replacing it with a bigger disk
2) Deleting the disk
3) btrfs send /mnt/raid1/snapshot | btrfs receive /mnt/onetb/

This took almost **an entire day** over USB3 spinning disks  !

My expectation was that I could quickly remove a mirrored disk, replace
that disk and hand carry the disk I removed. btrfs shouldn't remove data
from the disk I removed.

Kind regards,
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement

2016-06-06 Thread Qu Wenruo



At 06/07/2016 03:54 AM, Mark Fasheh wrote:

On Sat, Jun 04, 2016 at 06:26:39PM +0800, Qu Wenruo wrote:



On 06/03/2016 10:27 PM, Josef Bacik wrote:

On 06/01/2016 09:12 PM, Qu Wenruo wrote:



At 06/02/2016 06:08 AM, Mark Fasheh wrote:

On Fri, Apr 01, 2016 at 02:35:00PM +0800, Qu Wenruo wrote:

Core implement for inband de-duplication.
It reuse the async_cow_start() facility to do the calculate dedupe
hash.
And use dedupe hash to do inband de-duplication at extent level.

The work flow is as below:
1) Run delalloc range for an inode
2) Calculate hash for the delalloc range at the unit of dedupe_bs
3) For hash match(duplicated) case, just increase source extent ref
  and insert file extent.
  For hash mismatch case, go through the normal cow_file_range()
  fallback, and add hash into dedupe_tree.
  Compress for hash miss case is not supported yet.

Current implement restore all dedupe hash in memory rb-tree, with LRU
behavior to control the limit.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
fs/btrfs/extent-tree.c |  18 
fs/btrfs/inode.c   | 235
++---
fs/btrfs/relocation.c  |  16 
3 files changed, 236 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 53e1297..dabd721 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -37,6 +37,7 @@
#include "math.h"
#include "sysfs.h"
#include "qgroup.h"
+#include "dedupe.h"

#undef SCRAMBLE_DELAYED_REFS

@@ -2399,6 +2400,8 @@ static int run_one_delayed_ref(struct
btrfs_trans_handle *trans,

if (btrfs_delayed_ref_is_head(node)) {
struct btrfs_delayed_ref_head *head;
+struct btrfs_fs_info *fs_info = root->fs_info;
+
/*
 * we've hit the end of the chain and we were supposed
 * to insert this extent into the tree.  But, it got
@@ -2413,6 +2416,15 @@ static int run_one_delayed_ref(struct
btrfs_trans_handle *trans,
btrfs_pin_extent(root, node->bytenr,
 node->num_bytes, 1);
if (head->is_data) {
+/*
+ * If insert_reserved is given, it means
+ * a new extent is revered, then deleted
+ * in one tran, and inc/dec get merged to 0.
+ *
+ * In this case, we need to remove its dedup
+ * hash.
+ */
+btrfs_dedupe_del(trans, fs_info, node->bytenr);
ret = btrfs_del_csums(trans, root,
  node->bytenr,
  node->num_bytes);
@@ -6713,6 +6725,12 @@ static int __btrfs_free_extent(struct
btrfs_trans_handle *trans,
btrfs_release_path(path);

if (is_data) {
+ret = btrfs_dedupe_del(trans, info, bytenr);
+if (ret < 0) {
+btrfs_abort_transaction(trans, extent_root,
+ret);


I don't see why an error here should lead to a readonly fs.
   --Mark



Because such deletion error can lead to corruption.

For example, extent A is already in hash pool.
And when freeing extent A, we need to delete its hash, of course.

But if such deletion fails, which means the hash is still in the pool,
even the extent A no longer exists in extent tree.


Except if we're in in-memory mode only it doesn't matter, so don't abort
if we're in in-memory mode.  Thanks,

Josef



If we can't ensure a hash is delete along with the extent, we will
screw up the whole fs, as new write can points to non-exist extent.

Although you're right with in-memory mode here, we won't abort
trans, as inmem_del_hash() won't return error code. It will always
return 0.


Until a third party comes along and changes it to return an error code and
neither you or I are there to remind them to fix this check (or have simply
forgotten).



So still, no need to change anyway.


Personally I'd call this 'defensive coding' and do a check for in-memory
only before our abort_trans().  This would have no effect on our running
code but avoids the problem I stated above.  Alternatively, you could
clearly comment the exception. I don't like leaving it as-is for the reason
I stated above.

Thanks,
--Mark



The whole 'defensive coding' is here just because the V10 patchset comes 
with full function, including 2 backends and other things later.


A lot of code like this is here because we know what will be added later.
So this is true that it looks ridiculous until one knows there is an 
on-disk backend to be added.


I'm OK to move the check to btrfs_dedupe_del(), but this makes me 
curious about the correct coding style for adding new function.



If we have a clear view of the future functions , should we leave such 
interfaces for them?

Or add them when adding the new functions?

And what level of integration should be done inside btrfs codes?
Should any caller of an exported btrfs 

Re: [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement

2016-06-06 Thread Mark Fasheh
On Sat, Jun 04, 2016 at 06:26:39PM +0800, Qu Wenruo wrote:
> 
> 
> On 06/03/2016 10:27 PM, Josef Bacik wrote:
> >On 06/01/2016 09:12 PM, Qu Wenruo wrote:
> >>
> >>
> >>At 06/02/2016 06:08 AM, Mark Fasheh wrote:
> >>>On Fri, Apr 01, 2016 at 02:35:00PM +0800, Qu Wenruo wrote:
> Core implement for inband de-duplication.
> It reuse the async_cow_start() facility to do the calculate dedupe
> hash.
> And use dedupe hash to do inband de-duplication at extent level.
> 
> The work flow is as below:
> 1) Run delalloc range for an inode
> 2) Calculate hash for the delalloc range at the unit of dedupe_bs
> 3) For hash match(duplicated) case, just increase source extent ref
>    and insert file extent.
>    For hash mismatch case, go through the normal cow_file_range()
>    fallback, and add hash into dedupe_tree.
>    Compress for hash miss case is not supported yet.
> 
> Current implement restore all dedupe hash in memory rb-tree, with LRU
> behavior to control the limit.
> 
> Signed-off-by: Wang Xiaoguang 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/extent-tree.c |  18 
>  fs/btrfs/inode.c   | 235
> ++---
>  fs/btrfs/relocation.c  |  16 
>  3 files changed, 236 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 53e1297..dabd721 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -37,6 +37,7 @@
>  #include "math.h"
>  #include "sysfs.h"
>  #include "qgroup.h"
> +#include "dedupe.h"
> 
>  #undef SCRAMBLE_DELAYED_REFS
> 
> @@ -2399,6 +2400,8 @@ static int run_one_delayed_ref(struct
> btrfs_trans_handle *trans,
> 
>  if (btrfs_delayed_ref_is_head(node)) {
>  struct btrfs_delayed_ref_head *head;
> +struct btrfs_fs_info *fs_info = root->fs_info;
> +
>  /*
>   * we've hit the end of the chain and we were supposed
>   * to insert this extent into the tree.  But, it got
> @@ -2413,6 +2416,15 @@ static int run_one_delayed_ref(struct
> btrfs_trans_handle *trans,
>  btrfs_pin_extent(root, node->bytenr,
>   node->num_bytes, 1);
>  if (head->is_data) {
> +/*
> + * If insert_reserved is given, it means
> + * a new extent is revered, then deleted
> + * in one tran, and inc/dec get merged to 0.
> + *
> + * In this case, we need to remove its dedup
> + * hash.
> + */
> +btrfs_dedupe_del(trans, fs_info, node->bytenr);
>  ret = btrfs_del_csums(trans, root,
>    node->bytenr,
>    node->num_bytes);
> @@ -6713,6 +6725,12 @@ static int __btrfs_free_extent(struct
> btrfs_trans_handle *trans,
>  btrfs_release_path(path);
> 
>  if (is_data) {
> +ret = btrfs_dedupe_del(trans, info, bytenr);
> +if (ret < 0) {
> +btrfs_abort_transaction(trans, extent_root,
> +ret);
> >>>
> >>>I don't see why an error here should lead to a readonly fs.
> >>>--Mark
> >>>
> >>
> >>Because such deletion error can lead to corruption.
> >>
> >>For example, extent A is already in hash pool.
> >>And when freeing extent A, we need to delete its hash, of course.
> >>
> >>But if such deletion fails, which means the hash is still in the pool,
> >>even the extent A no longer exists in extent tree.
> >
> >Except if we're in in-memory mode only it doesn't matter, so don't abort
> >if we're in in-memory mode.  Thanks,
> >
> >Josef
> >
> 
> If we can't ensure a hash is delete along with the extent, we will
> screw up the whole fs, as new write can points to non-exist extent.
> 
> Although you're right with in-memory mode here, we won't abort
> trans, as inmem_del_hash() won't return error code. It will always
> return 0.

Until a third party comes along and changes it to return an error code and
neither you or I are there to remind them to fix this check (or have simply
forgotten).


> So still, no need to change anyway.

Personally I'd call this 'defensive coding' and do a check for in-memory
only before our abort_trans().  This would have no effect on our running
code but avoids the problem I stated above.  Alternatively, you could
clearly comment the exception. I don't like leaving it as-is for the reason 
I stated above.

Thanks,
--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

[PATCH v2] Btrfs: check if extent buffer is aligned to sectorsize

2016-06-06 Thread Liu Bo
Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer
via alloc_extent_buffer().  An unaligned eb can have more pages than it
should have, which ends up extent buffer's leak or some corrupted content
in extent buffer.

This adds a warning to let us quickly know what was happening.

Now that alloc_extent_buffer() no more returns NULL, this changes its
caller and callers of its caller to match with the new error
handling.

Signed-off-by: Liu Bo 
---
v2:
  - Add more fine-grained error handling to alloc_extent_buffer() and
its callers
  - Use btrfs_err and -EINVAL instead of WARN_ONCE().

 fs/btrfs/ctree.c   |  2 ++
 fs/btrfs/disk-io.c |  8 
 fs/btrfs/extent-tree.c | 10 ++
 fs/btrfs/extent_io.c   | 15 ---
 fs/btrfs/tree-log.c|  4 ++--
 fs/btrfs/volumes.c |  4 ++--
 6 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 427c36b..24c9fb2 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2510,6 +2510,8 @@ read_block_for_search(struct btrfs_trans_handle *trans,
if (!btrfs_buffer_uptodate(tmp, 0, 0))
ret = -EIO;
free_extent_buffer(tmp);
+   } else {
+   ret = PTR_ERR(tmp);
}
return ret;
 }
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ea78d77..c2a9cec 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1098,7 +1098,7 @@ void readahead_tree_block(struct btrfs_root *root, u64 
bytenr)
struct inode *btree_inode = root->fs_info->btree_inode;
 
buf = btrfs_find_create_tree_block(root, bytenr);
-   if (!buf)
+   if (IS_ERR(buf))
return;
read_extent_buffer_pages(_I(btree_inode)->io_tree,
 buf, 0, WAIT_NONE, btree_get_extent, 0);
@@ -1114,7 +1114,7 @@ int reada_tree_block_flagged(struct btrfs_root *root, u64 
bytenr,
int ret;
 
buf = btrfs_find_create_tree_block(root, bytenr);
-   if (!buf)
+   if (IS_ERR(buf))
return 0;
 
set_bit(EXTENT_BUFFER_READAHEAD, >bflags);
@@ -1171,8 +1171,8 @@ struct extent_buffer *read_tree_block(struct btrfs_root 
*root, u64 bytenr,
int ret;
 
buf = btrfs_find_create_tree_block(root, bytenr);
-   if (!buf)
-   return ERR_PTR(-ENOMEM);
+   if (IS_ERR(buf))
+   return buf;
 
ret = btree_read_extent_buffer_pages(root, buf, 0, parent_transid);
if (ret) {
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a400951..d63 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8010,8 +8010,9 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, 
struct btrfs_root *root,
struct extent_buffer *buf;
 
buf = btrfs_find_create_tree_block(root, bytenr);
-   if (!buf)
-   return ERR_PTR(-ENOMEM);
+   if (IS_ERR(buf))
+   return buf;
+
btrfs_set_header_generation(buf, trans->transid);
btrfs_set_buffer_lockdep_class(root->root_key.objectid, buf, level);
btrfs_tree_lock(buf);
@@ -8653,8 +8654,9 @@ static noinline int do_walk_down(struct 
btrfs_trans_handle *trans,
next = btrfs_find_tree_block(root->fs_info, bytenr);
if (!next) {
next = btrfs_find_create_tree_block(root, bytenr);
-   if (!next)
-   return -ENOMEM;
+   if (IS_ERR(next))
+   return PTR_ERR(next);
+
btrfs_set_buffer_lockdep_class(root->root_key.objectid, next,
   level - 1);
reada = 1;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6f38c2c..91c4d23 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4883,18 +4883,25 @@ struct extent_buffer *alloc_extent_buffer(struct 
btrfs_fs_info *fs_info,
int uptodate = 1;
int ret;
 
+   if (!IS_ALIGNED(start, fs_info->tree_root->sectorsize)) {
+   btrfs_err(fs_info, "bad tree block start %llu", start);
+   return ERR_PTR(-EINVAL);
+   }
+
eb = find_extent_buffer(fs_info, start);
if (eb)
return eb;
 
eb = __alloc_extent_buffer(fs_info, start, len);
if (!eb)
-   return NULL;
+   return ERR_PTR(-ENOMEM);
 
for (i = 0; i < num_pages; i++, index++) {
p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
-   if (!p)
+   if (!p) {
+   exists = ERR_PTR(-ENOMEM);
goto free_eb;
+   }
 
spin_lock(>private_lock);
if (PagePrivate(p)) {
@@ -4939,8 +4946,10 @@ struct extent_buffer *alloc_extent_buffer(struct 
btrfs_fs_info *fs_info,
set_bit(EXTENT_BUFFER_UPTODATE, >bflags);
 again:
ret = 

Re: [PATCH 28/45] target: use bio op accessors

2016-06-06 Thread Hannes Reinecke

On 06/06/2016 05:40 PM, Mike Christie wrote:

On 06/06/2016 01:46 AM, Hannes Reinecke wrote:

On 06/05/2016 09:32 PM, mchri...@redhat.com wrote:

From: Mike Christie 

Separate the op from the rq_flag_bits and have the target layer
set/get the bio using bio_set_op_attrs/bio_op.

Signed-off-by: Mike Christie 
---
  drivers/target/target_core_iblock.c | 29 ++---
  drivers/target/target_core_pscsi.c  |  2 +-
  2 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/drivers/target/target_core_iblock.c 
b/drivers/target/target_core_iblock.c
index c25109c..22af12f 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c

[ .. ]

@@ -689,18 +690,15 @@ iblock_execute_rw(struct se_cmd *cmd, struct scatterlist 
*sgl, u32 sgl_nents,
 * Force writethrough using WRITE_FUA if a volatile write cache
 * is not enabled, or if initiator set the Force Unit Access 
bit.
 */
+   op = REQ_OP_WRITE;
if (test_bit(QUEUE_FLAG_FUA, >queue_flags)) {
if (cmd->se_cmd_flags & SCF_FUA)
-   rw = WRITE_FUA;
+   op_flags = WRITE_FUA;
else if (!test_bit(QUEUE_FLAG_WC, >queue_flags))
-   rw = WRITE_FUA;
-   else
-   rw = WRITE;
-   } else {
-   rw = WRITE;
+   op_flags = WRITE_FUA;
}

Wrong intendation.


It should be ok. That line is for the QUEUE_FLAG_WC test. We end up with:

op = REQ_OP_WRITE;
if (test_bit(QUEUE_FLAG_FUA, >queue_flags)) {
if (cmd->se_cmd_flags & SCF_FUA)
op_flags = WRITE_FUA;
else if (!test_bit(QUEUE_FLAG_WC, >queue_flags))
op_flags = WRITE_FUA;
}





Indeed, you are correct.

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 28/45] target: use bio op accessors

2016-06-06 Thread Mike Christie
On 06/06/2016 01:46 AM, Hannes Reinecke wrote:
> On 06/05/2016 09:32 PM, mchri...@redhat.com wrote:
>> From: Mike Christie 
>>
>> Separate the op from the rq_flag_bits and have the target layer
>> set/get the bio using bio_set_op_attrs/bio_op.
>>
>> Signed-off-by: Mike Christie 
>> ---
>>  drivers/target/target_core_iblock.c | 29 ++---
>>  drivers/target/target_core_pscsi.c  |  2 +-
>>  2 files changed, 15 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/target/target_core_iblock.c 
>> b/drivers/target/target_core_iblock.c
>> index c25109c..22af12f 100644
>> --- a/drivers/target/target_core_iblock.c
>> +++ b/drivers/target/target_core_iblock.c
> [ .. ]
>> @@ -689,18 +690,15 @@ iblock_execute_rw(struct se_cmd *cmd, struct 
>> scatterlist *sgl, u32 sgl_nents,
>>   * Force writethrough using WRITE_FUA if a volatile write cache
>>   * is not enabled, or if initiator set the Force Unit Access 
>> bit.
>>   */
>> +op = REQ_OP_WRITE;
>>  if (test_bit(QUEUE_FLAG_FUA, >queue_flags)) {
>>  if (cmd->se_cmd_flags & SCF_FUA)
>> -rw = WRITE_FUA;
>> +op_flags = WRITE_FUA;
>>  else if (!test_bit(QUEUE_FLAG_WC, >queue_flags))
>> -rw = WRITE_FUA;
>> -else
>> -rw = WRITE;
>> -} else {
>> -rw = WRITE;
>> +op_flags = WRITE_FUA;
>>  }
> Wrong intendation.

It should be ok. That line is for the QUEUE_FLAG_WC test. We end up with:

op = REQ_OP_WRITE;
if (test_bit(QUEUE_FLAG_FUA, >queue_flags)) {
if (cmd->se_cmd_flags & SCF_FUA)
op_flags = WRITE_FUA;
else if (!test_bit(QUEUE_FLAG_WC, >queue_flags))
op_flags = WRITE_FUA;
}




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL] Self-tests updates for non-4k pages, for 4.7-rc3

2016-06-06 Thread David Sterba
Hi,

this patchset enhances the self-tests to work on non-4k architectures (ppc64,
arm64). The diffstat may appear big, but there are lots of call sites that just
get another parameter (nodesize), so we now also test all valid combinations of
pagesize and nodesize for a given arch.

The patchset is part of preparatory work of the subpage-blocksize patchset, and
it's updating only non-production code so it's IMO safe to merge in an rc. I
have tested it on x86_64 to verify that it does not break the default usecase.


The following changes since commit 56244ef151c3cd11f505020ab0b3f45454363bcc:

  Btrfs: fix handling of faults from btrfs_copy_from_user (2016-05-26 13:23:59 
-0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git 
for-chris-4.7-20160606

for you to fetch changes up to 34b3e6c92af1fa3f7067e4fa05ffa9d8bd41c96c:

  Btrfs: self-tests: Fix extent buffer bitmap test fail on BE system 
(2016-06-06 17:17:12 +0200)


Feifei Xu (8):
  Btrfs: test_check_exists: Fix infinite loop when searching for free space 
entries
  Btrfs: Fix integer overflow when calculating bytes_per_bitmap
  Btrfs: self-tests: Support non-4k page size
  Btrfs: self-tests: Execute page straddling test only when nodesize < 
PAGE_SIZE
  Btrfs: self-tests: Support testing all possible sectorsizes and nodesizes
  Btrfs: self-tests: Use macros instead of constants and add missing newline
  Btrfs: self-tests: Fix test_bitmaps fail on 64k sectorsize
  Btrfs: self-tests: Fix extent buffer bitmap test fail on BE system

 fs/btrfs/ctree.c   |   6 +-
 fs/btrfs/disk-io.c |   9 +-
 fs/btrfs/disk-io.h |   2 +-
 fs/btrfs/extent_io.c   |  10 +-
 fs/btrfs/extent_io.h   |   4 +-
 fs/btrfs/free-space-cache.c|  18 +-
 fs/btrfs/super.c   |  52 +++--
 fs/btrfs/tests/btrfs-tests.c   |   6 +-
 fs/btrfs/tests/btrfs-tests.h   |  27 +--
 fs/btrfs/tests/extent-buffer-tests.c   |  13 +-
 fs/btrfs/tests/extent-io-tests.c   |  86 ++---
 fs/btrfs/tests/free-space-tests.c  |  76 +---
 fs/btrfs/tests/free-space-tree-tests.c |  30 +--
 fs/btrfs/tests/inode-tests.c   | 344 ++---
 fs/btrfs/tests/qgroup-tests.c  | 111 ++-
 15 files changed, 454 insertions(+), 340 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What is the worst case scenario of splitting an extent-tree's node or leaf?

2016-06-06 Thread Hugo Mills
On Mon, Jun 06, 2016 at 08:17:15PM +0800, Kaho Ng wrote:
> Hi all,
> 
> I have questions on how BTRFS handles the scenario that a split pf
> extent-tree's node/leaf  incurs another splits in the tree's
> node/leaf. Is that scenario bounded? If it is, how could I calculate
> the number that the split starts looping.

   For reference for anyone replying, we've already had this
conversation as far as I could manage it on IRC a few days ago:

http://logs.tvrrug.org.uk/logs/%23btrfs/2016-05-29.html#2016-05-29T10:13:53

   Hugo.

-- 
Hugo Mills | In one respect at least, the Martians are a happy
hugo@... carfax.org.uk | people: they have no lawyers
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   John Carter, A Princess of Mars


signature.asc
Description: Digital signature


Re: [PATCH] btrfs: chunk_width_limit mount option

2016-06-06 Thread Hugo Mills
On Mon, Jun 06, 2016 at 09:43:19AM -0400, Andrew Armenia wrote:
> On Mon, Jun 6, 2016 at 5:17 AM, David Sterba  wrote:
> > On Thu, Jun 02, 2016 at 09:50:15PM -0400, Andrew Armenia wrote:
> >> This patch adds mount option 'chunk_width_limit=X', which when set forces
> >> the chunk allocator to use only up to X devices when allocating a chunk.
> >> This may help reduce the seek penalties seen in filesystems with large
> >> numbers of devices.
> >
> > There is a use for such limit but passing it as a mount is not the right
> > way to do it. Changing the value requires remount, different raid levels
> > might need different values, it's not persistent etc.
> 
> Thanks for the feedback. I agree that it's less than an ideal way to
> do it. I guess I was a bit quick to release proof-of-concept code.
> I'll try to do some more work on the configuration side of things and
> resubmit.

   Last time something like this came up in discussion, the consensus
was to put the configuration in xattrs in the btrfs namespace. There
should be one with a name to indicate "global" configuration, applied
to the top of subvol 5. If the feature allows configuration on a
per-subvol or per-object basis, then there should also be a name for
the relevant xattr (also in the btrfs namespace) that can be created
on each object as required.

   Hugo.

-- 
Hugo Mills | Klytus, I'm bored. What plaything can you offer me
hugo@... carfax.org.uk | today?
http://carfax.org.uk/  |
PGP: E2AB1DE4  |  Ming the Merciless, Flash Gordon


signature.asc
Description: Digital signature


Re: btrfs device delete /dev/sdc1 /mnt/raid1 user experience

2016-06-06 Thread Austin S. Hemmelgarn

On 2016-06-06 01:44, Kai Hendry wrote:

Hi there,


I planned to remove one of my disks, so that I can take it from
Singapore to the UK and then re-establish another remote RAID1 store.

delete is an alias of remove, so I added a new disk (devid 3) and
proceeded to run:
btrfs device delete /dev/sdc1 /mnt/raid1 (devid 1)


nuc:~$ uname -a
Linux nuc 4.5.4-1-ARCH #1 SMP PREEMPT Wed May 11 22:21:28 CEST 2016
x86_64 GNU/Linux
nuc:~$   btrfs --version
btrfs-progs v4.5.3
nuc:~$ sudo btrfs fi show /mnt/raid1/
Label: 'extraid1'  uuid: 5cab2a4a-e282-4931-b178-bec4c73cdf77
Total devices 2 FS bytes used 776.56GiB
devid2 size 931.48GiB used 778.03GiB path /dev/sdb1
devid3 size 1.82TiB used 778.03GiB path /dev/sdd1

nuc:~$ sudo btrfs fi df /mnt/raid1/
Data, RAID1: total=775.00GiB, used=774.39GiB
System, RAID1: total=32.00MiB, used=144.00KiB
Metadata, RAID1: total=3.00GiB, used=2.17GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


First off, I was expecting btrfs to release the drive pretty much
immediately. The command took about half a day to complete. I watched
`btrfs fi show` to see size of devid 1 (the one I am trying to remove)
to be zero and to see used space slowly go down whilst used space of
devid 3 (the new disk) slowly go up.
If you're using multiple devices, BTRFS has to move any data off of the 
device being removed onto other devices in the FS before it can remove 
it.  Not doing so runs the risk of another disk failing before the 
balance that would otherwise be needed afterwards to maintain 
replication profiles completes, thus causing a loss of data.


Secondly and most importantly my /dev/sdc1 can't be mounted now anymore.
Why?
Deleting the device from the array removes the data from it (as 
mentioned above), and wipes all BTRFS specific signatures as well.


sudo mount -t btrfs /dev/sdc1 /mnt/test/
mount: wrong fs type, bad option, bad superblock on /dev/sdc1,
   missing codepage or helper program, or other error

   In some cases useful info is found in syslog - try
   dmesg | tail or so.

There is nothing in dmesg nor my journal. I wasn't expecting my drive to
be rendered useless on removing or am I missing something?
If your getting such a message and there's nothing in dmesg, then it's 
because there's no filesystem of the requested type on the device you 
specified.


nuc:~$ sudo fdisk -l /dev/sdc
Disk /dev/sdc: 931.5 GiB, 1000204885504 bytes, 1953525167 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 33553920 bytes
Disklabel type: gpt
Disk identifier: 19938642-3B10-4220-BF99-3E12AF1D1CF6

Device StartEndSectors   Size Type
/dev/sdc1   2048 1953525133 1953523086 931.5G Linux filesystem

On #btrfs IRC channel I'm told:

 hendry: breaking multi-disk filesystems in half is not a
recommended way to do "normal operations" :D
Which is exactly correct.  Trying to split off a single device the way 
you want is extremely dangerous and risks breaking the source filesystem 
as well as not having a complete copy of the data on the receiving end.


I'm still keen to take a TB on a flight with me the day after tomorrow.
What is the recommended course of action? Recreate a mkfs.btrfs on
/dev/sdc1 and send data to it from /mnt/raid1?
Based on the fact that you appear to want to carry a disk to copy data 
more quickly than over then internet, then what you've already done plus 
this is the correct way to do it.


Still I hope the experience could be improved to remove a disk sanely.
It did exactly what you told it to, it removed the device from the 
array.  The misunderstanding here is as to what that means.  Removing a 
disk from a BTRFS filesystem means that you can't mount that disk 
anymore, and for all intents and purposes, it is now blank media (in 
practice, it only wipes the superblocks and some of the metadata, but 
that's irrelevant to what your asking, as you can't recover the 
filesystem from what's left).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Transaction aborted in btrfs_rename2

2016-06-06 Thread Jeff Mahoney
On 6/6/16 7:47 AM, Adam Borowski wrote:
> Hi!
> I just got this thrice, in 4.7-rc1 and 4.7-rc2:
> 
> [ 1836.672368] [ cut here ]
> [ 1836.672382] WARNING: CPU: 1 PID: 16348 at fs/btrfs/inode.c:9820 
> btrfs_rename2+0xcd2/0x2a50
> [ 1836.672385] BTRFS: Transaction aborted (error -2)
> [ 1836.672387] Modules linked in: nvidia(PO) usb_storage
> [ 1836.672396] CPU: 1 PID: 16348 Comm: gcc-6 Tainted: P   O
> 4.7.0-rc2-debug+ #3
> [ 1836.672399] Hardware name: System manufacturer System Product Name/M4A77T, 
> BIOS 240105/18/2011
> [ 1836.672402]  81f8b504 880062c47c78 8165be6d 
> 0007
> [ 1836.672407]  880062c47cd0  880062c47cc0 
> 81110c1c
> [ 1836.672411]  880062c47d20 265c814e8642  
> 00a25ade
> [ 1836.672415] Call Trace:
> [ 1836.672423]  [] dump_stack+0x4e/0x71
> [ 1836.672429]  [] __warn+0x10c/0x150
> [ 1836.672433]  [] warn_slowpath_fmt+0x4a/0x50
> [ 1836.672437]  [] btrfs_rename2+0xcd2/0x2a50
> [ 1836.672443]  [] ? btrfs_permission+0x5b/0xc0
> [ 1836.672448]  [] ? down_write+0x18/0x60
> [ 1836.672453]  [] vfs_rename+0x7cc/0xc30
> [ 1836.672457]  [] SyS_rename+0x32b/0x420
> [ 1836.672461]  [] entry_SYSCALL_64_fastpath+0x17/0x93
> [ 1836.672464] ---[ end trace 6405b6e3d0e6c945 ]---
> [ 1836.672468] BTRFS warning (device sda1): btrfs_rename:9820: Aborting 
> unused transaction(No such entry).
> [ 1836.675505] BTRFS warning (device sda1): btrfs_rename:9820: Aborting 
> unused transaction(No such entry).
> 
> [ 1837.935238] BTRFS warning (device sda1): btrfs_rename:9820: Aborting 
> unused transaction(No such entry).
> [ 1837.937602] BTRFS: error (device sda1) in btrfs_rename:9820: errno=-2 No 
> such entry
> [ 1837.937607] BTRFS info (device sda1): forced readonly
> [ 1838.086754] BTRFS warning (device sda1): Skipping commit of aborted 
> transaction.
> [ 1838.086762] BTRFS: error (device sda1) in cleanup_transaction:1857: 
> errno=-2 No such entry
> [ 1838.086782] BTRFS info (device sda1): delayed_refs has NO entry
> 
> Didn't trigger during a week of other work, yet a kernel compile triggers
> this reliably.
> 
> Filesystem appears consistent (btrfs check, scrub).
> Mount options: noatime,compress=lzo,ssd,space_cache.
> 

Oh, interesting.  We're seeing this on our 4.4-based kernels as well but
only on arm64.  That it's triggering on x86_64 is a good data point.
I'm hunting this one today.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs: chunk_width_limit mount option

2016-06-06 Thread Andrew Armenia
On Mon, Jun 6, 2016 at 5:17 AM, David Sterba  wrote:
> On Thu, Jun 02, 2016 at 09:50:15PM -0400, Andrew Armenia wrote:
>> This patch adds mount option 'chunk_width_limit=X', which when set forces
>> the chunk allocator to use only up to X devices when allocating a chunk.
>> This may help reduce the seek penalties seen in filesystems with large
>> numbers of devices.
>
> There is a use for such limit but passing it as a mount is not the right
> way to do it. Changing the value requires remount, different raid levels
> might need different values, it's not persistent etc.

Thanks for the feedback. I agree that it's less than an ideal way to
do it. I guess I was a bit quick to release proof-of-concept code.
I'll try to do some more work on the configuration side of things and
resubmit.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended why to use btrfs for production?

2016-06-06 Thread Austin S. Hemmelgarn

On 2016-06-05 22:40, James Johnston wrote:

On 06/06/2016 at 01:47, Chris Murphy wrote:

On Sun, Jun 5, 2016 at 4:45 AM, Mladen Milinkovic  wrote:

On 06/03/2016 04:05 PM, Chris Murphy wrote:

Make certain the kernel command timer value is greater than the driver
error recovery timeout. The former is found in sysfs, per block
device, the latter can be get and set with smartctl. Wrong
configuration is common (it's actually the default) when using
consumer drives, and inevitably leads to problems, even the loss of
the entire array. It really is a terrible default.


Since it's first time i've heard of this I did some googling.

Here's some nice article about these timeouts:
http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive-

timeouts/comment-page-1/


And some udev rules that should apply this automatically:
http://comments.gmane.org/gmane.linux.raid/48193


Yes it's a constant problem that pops up on the linux-raid list.
Sometimes the list is quiet on this issue but it really seems like
it's once a week. From last week...

http://www.spinics.net/lists/raid/msg52447.html


It seems like it would be useful if the distributions or the kernel could
automatically set the kernel timeout to an appropriate value.  If the TLER can 
be
indeed be queried via smartctl, then it would be easy to automatically read it,
and then calculate a suitable timeout.  A RAID-oriented drive would end up 
leaving
the current 30 seconds, while if it can't successfully query for TLER or the 
drive
just doesn't support it, then assume a consumer drive and set timeout for 180
seconds.

That way, zero user configuration would be needed in the common case.  Or is it
not that simple?
Strictly speaking, it's policy, and therefore shouldn't be in the 
kernel.  It's not hard to write a script to handle this though, both 
hdparm and smartctl can set the SCT ERC value, and will report an error 
if it fails, so you can try and set the value as you want (I personally 
would go with 10 seconds instead of 7), and if that fails, bump the 
kernel command timout.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended why to use btrfs for production?

2016-06-06 Thread Austin S. Hemmelgarn

On 2016-06-03 21:48, Chris Murphy wrote:

On Fri, Jun 3, 2016 at 6:48 PM, Nicholas D Steeves  wrote:

On 3 June 2016 at 11:33, Austin S. Hemmelgarn  wrote:

On 2016-06-03 10:11, Martin wrote:


Make certain the kernel command timer value is greater than the driver
error recovery timeout. The former is found in sysfs, per block
device, the latter can be get and set with smartctl. Wrong
configuration is common (it's actually the default) when using
consumer drives, and inevitably leads to problems, even the loss of
the entire array. It really is a terrible default.



Are nearline SAS drives considered consumer drives?


If it's a SAS drive, then no, especially when you start talking about things
marketed as 'nearline'.  Additionally, SCT ERC is entirely a SATA thing, I
forget what the equivalent in SCSI (and by extension SAS) terms is, but I'm
pretty sure that the kernel handles things differently there.


For the purposes of BTRFS RAID1: For drives that ship with SCT ERC of
7sec, is the default kernel command timeout of 30sec appropriate, or
should it be reduced?


It's fine. But it depends on your use case, if it can tolerate a rare

7 second < 30 second hang, and you're prepared to start

investigating the cause then I'd leave it alone. If the use case
prefers resetting the drive when it stops responding, then you'd go
with something shorter.

I'm fairly certain SAS's command queue doesn't get obliterated with
such a link reset, just the hung command; where SATA drives all
information in the queue is lost. So resets on SATA are a much bigger
penalty if I have the correct understanding.
There's also more involved otherwise with a ATA link reset because AHCI 
controllers aren't MP safe, so there's a global lock that has to be held 
while talking to them.  Because of this, a link reset on an ATA drive 
(be it SATA or PATA) will cause performance degradation for all other 
devices on that controller as well until the reset is complete.




 For SATA drives that do not support SC TERC, is
it true that 120sec is a sane value?  I forget where I got this value
of 120sec;


It's a good question. It's not well documented, is not defined in the
SATA spec, so it's probably make/model specific. The linux-raid@ list
probably has the most information on this just because their users get
nailed by this problem often. And the recommendation does seem to vary
around 120 to 180. That is of course a maximum. The drive could give
up much sooner. But what you don't want is for the drive to be in
recovery for a bad sector, and the command timer does a link reset,
losing all of what the drive was doing: all of which is replaceable
except really one thing which is what sector was having the problem.
And right now there's no report of the drive for slow sectors. It only
reports failed reads, and it's that failed read error that includes
the sector, so that the raid mechanism can figure out what data is
missing, recongistruct from mirror or parity, and then fix the bad
sector by writing to it.
FWIW, I usually go with 150 on the Seagate 'Desktop' drives I use.  I've 
seen some cheap Hitachi and Toshiba disks that need it as high as 300 
though to work right.



it might have been this list, it might have been an mdadm
bug report.  Also, in terms of tuning, I've been unable to find
whether the ideal kernel timeout value changes depending on RAID
type...is that a factor in selecting a sane kernel timeout value?


No. It's strictly a value to make certain you get read errors from the
drive rather than link resets.
You have to factor in how the controller handles things too.  SOme of 
them will retry just like a desktop drive, and you need to account for that.


And that's why I think it's a bad default, because it totally thwarts
attempts by manufacturers to recover marginal sectors, even in the
single disk case.
That's debatable, by attempting to recover the bad sector, they're 
slowing down the whole system.  The likelihood of recovering a bad 
sectors functionally falls off linearly the longer you try, and not 
having the ability to choose when to report an error is the bigger issue 
here.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs

2016-06-06 Thread Austin S. Hemmelgarn

On 2016-06-03 21:51, Christoph Anton Mitterer wrote:

On Fri, 2016-06-03 at 15:50 -0400, Austin S Hemmelgarn wrote:

There's no point in trying to do higher parity levels if we can't get
regular parity working correctly.  Given the current state of things,
it might be better to break even and just rewrite the whole parity
raid thing from scratch, but I doubt that anybody is willing to do
that.


Well... as I've said, things are pretty worrying. Obviously I cannot
really judge, since I'm not into btrfs' development... maybe there's a
lack of manpower? Since btrfs seems to be a very important part (i.e.
next-gen fs), wouldn't it be possible to either get some additional
funding by the Linux Foundation, or possible that some of the core
developers make an open call for funding by companies?
Having some additional people, perhaps working fulltime on it, may be a
big help.

As for the RAID... given how many time/effort is spent now into 5/6,..
it really seems that one should have considered multi-parity from the
beginning on.
Kinda feels like either, with multi-parity this whole instability phase
would start again, or it will simply never happen.
New features will always cause some instability, period, there is no way 
to avoid that.




- Serious show-stoppers and security deficiencies like the UUID
  collision corruptions/attacks that have been extensively
discussed
  earlier, are still open

The UUID issue is not a BTRFS specific one, it just happens to be
easier
to cause issues with it on BTRFS


uhm this had been discussed extensively before, as I've said... AFAICS
btrfs is the only system we have, that can possibly cause data
corruption or even security breach by UUID collisions.
I wouldn't know that other fs, or LVM are affected, these just continue
to use those devices already "online"... and I think lvm refuses to
activate VGs, if conflicting UUIDs are found.
If you are mounting by UUID, it is entirely non-deterministic which 
filesystem with that UUID will be mounted (because device enumeration is 
non-deterministic).  As far as LVM, it refuses activating VG's, but it 
can still have issues if you have LV's with the same UUID (which can be 
done pretty trivially), and the fact that it refuses to activate them 
technically constitutes a DoS attack (because you can't use the resources).




There is no way to solve it sanely given the requirement that
userspace
not be broken.

No this is not true. Back when this was discussed, I and others
described how it could/should be done,... respectively how
userspace/kernel should behave, in short:
- continue using those devices that are already active

This is easy, but only works for mounted filesystems.

- refusing to (auto)assemble by UUID, if there are conflicts
  or requiring to specify the devices (with some --override-yes-i-know-
  what-i-do option option or so)
- in case of assembling/rebuilding/similar... never doing this
  automatically
These two allow anyone with the ability to plug in a USB device to DoS 
the system.


I think there were some more corner cases, I basically had them all
discussed in the thread back then (search for "attacking btrfs
filesystems via UUID collisions?" and IIRC some different titled parent
or child threads).



  Properly fixing this would likely make us more dependent
on hardware configuration than even mounting by device name.

Sure, if there are colliding UUIDs, and one still wants to mount (by
using some --override-yes-i-know-what-i-do option),.. it would need to
be by specifying the device name...
But where's the problem?
This would anyway only happen if someone either attacks or someone made
a clone, and it's far better to refuse automatic assembly in cases
where accidental corruption can happen or where attacks may be
possible, requiring the user/admin to manually take action, than having
corruption or security breach.
Refusing automatic assembly does not prevent the attack, it simply 
converts it from a data corruption attack to a DoS attack.


Imagine the simple case: degraded RAID1 on a PC; if btrfs would do some
auto-rebuild based on UUID, then if an attacker knows that he'd just
need to plug in a USB disk with a fitting UUID...and easily gets a copy
of everything on disk, gpg keys, ssh keys, etc.
If the attacker has physical access to the machine, it's irrelevant even 
with such protection, as there are all kinds of other things that could 
be done to get data off of the disk (especially if the system has 
thunderbolt ports or USB C ports).  If the user has any unsecured 
encryption or authentication tokens on the system, they're screwed 
anyway though.



- a number of important core features not fully working in many
  situations (e.g. the issues with defrag, not being ref-link
aware,...
  an I vaguely remember similar things with compression).

OK, how then should defrag handle reflinks?  Preserving them prevents
it
from being able to completely defragment data.

Didn't that even work in the past and had 

What is the worst case scenario of splitting an extent-tree's node or leaf?

2016-06-06 Thread Kaho Ng
Hi all,

I have questions on how BTRFS handles the scenario that a split pf
extent-tree's node/leaf  incurs another splits in the tree's
node/leaf. Is that scenario bounded? If it is, how could I calculate
the number that the split starts looping.

Regards,
Kaho Ng
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 vs RAID10 and best way to set up 6 disks

2016-06-06 Thread Austin S. Hemmelgarn

On 2016-06-05 16:31, Christoph Anton Mitterer wrote:

On Sun, 2016-06-05 at 09:36 -0600, Chris Murphy wrote:

That's ridiculous. It isn't incorrect to refer to only 2 copies as
raid1.

No, if there are only two devices then not.
But obviously we're talking about how btrfs does RAID1, in which even
with n>2 devices there are only 2 copies - that's incorrect.
Go read the original standards that defined the term RAID (assuming you 
can find a openly accessible copy), it defines RAID1 as a mirrored 
_pair_ of disks.  This is how every hardware RAID controller I've ever 
seen implements RAID1, and in fact how most software RAID 
implementations (including the fake raid in some motherboards) 
implements it with the sole exception of Linux's MD-RAID and it's direct 
derivatives (which includes LVM/DM based RAID, as well as BSD's GEOM 
framework).




 You have to explicitly ask both mdadm

Aha, and which option would that be?
Specifying more than two disks.  The request is more correctly an 
implicit one, but the fact that it's implied by a now largely obsolete 
piece of software does not mean that BTRFS should have the same 
implications.



 and lvcreate for the
number of copies you want, it doesn't automatically happen.

I've said that before, but at least it allows you to use the full
number of disks, so we're again back to that it's closer to the
original and common meaning of RAID1 than what btrfs does.

/me inserts reflink to the first part of my reply.




 The man
page for mkfs.btrfs is very clear you only get two copies.


I haven't denied that... but one shouldn't use terms that are commonly
understood in a different mannor and require people to read all the
small printed.
One could also have changed it's RAID0 with RAID1, and I guess people
wouldn't be too delighted if the excuse was "well it's in the manpage".
You can leave the hyperbolic theoreticals out of this, they really do 
detract from your argument.






Well I'd say, for btrfs: do away with the term "RAID" at all, use
e.g.:

linear = just a bunch of devices put together, no striping
 basically what MD's linear is

Except this isn't really how Btrfs single works. The difference
between mdadm linear and Btrfs single is more different in behavior
than the difference between mdadm raid1 and btrfs raid1. So you're
proposing tolerating a bigger difference, while criticizing a smaller
one. *shrug*


What's the big difference? Would you care to explain? But I'm happy
with "single" either, it just doesn't really tell that there is no
striping, I mean "single" points more towards "we have no resilience
but only 1 copy", whether this is striped or not.
On this point I actually do kind of agree with you, but Chris is also 
correct here, BTRFS single mode is just as different from MD linear mode 
as BTRFS raid1 is from MD RAID1, if not more so.





If a metaphor is going to be used for a technical thing, it would be
mirrors or mirroring. Mirror would mean exactly two (the original and
the mirror). See lvcreate --mirrors. Also, the lvm mirror segment
type
is legacy, having been replaced with raid1 (man lvcreate uses the
term
raid1, not RAID1 or RAID-1). So I'm not a big fan of this term.


Admittedly, I didn't like the "mirror(s)" either... I was just trying
to show that different names could be used that are already a bit
better.



striped = basically what RAID0 is


lvcreate uses only striped, not raid0. mdadm uses only RAID0, not
striped. Since striping is also employed with RAIDs 4, 5, 6, 7, it
seems ambiguous even though without further qualification whether
parity exists, it's considered to mean non-parity striping. The
ambiguity is probably less of a problem than the contradiction that
is
RAID0.


Mhh,.. well or one makes schema names that contain all possible
properties of a "RAID", something like:
replicasN-parityN-[not]striped

SINGLE would be something like "replicas1-parity0-notstriped".
RAID5 would be something like "replicas0-parity1-striped".
It's worth pointing out that both programmers and sysadmins are still 
lazy typists, so it would more likely end up being:

rep1-par0-strip0
rep0-par1-stripN (with N being the number of desired stripes).

Having a number to indicate the striping is actually useful (there are 
legitimate cases for not striping across everything we can, and we need 
some way to represent stripes that weren't allocated at full width for 
some reason).


Such a scheme was actually proposed back when the higher order parity 
patches were being discussed.  Like those patches, it was decided to 
wait until we had basic feature completeness before trying to tackle that.




And just mention in the manpage, which of these names comes closest
to
what people understand by RAID level i.


It already does this. What version of btrfs-progs are you basing your
criticism on that there's some inconsistency, deficiency, or
ambiguity
when it comes to these raid levels?


Well first, the terminology thing is the least serious 

Transaction aborted in btrfs_rename2

2016-06-06 Thread Adam Borowski
Hi!
I just got this thrice, in 4.7-rc1 and 4.7-rc2:

[ 1836.672368] [ cut here ]
[ 1836.672382] WARNING: CPU: 1 PID: 16348 at fs/btrfs/inode.c:9820 
btrfs_rename2+0xcd2/0x2a50
[ 1836.672385] BTRFS: Transaction aborted (error -2)
[ 1836.672387] Modules linked in: nvidia(PO) usb_storage
[ 1836.672396] CPU: 1 PID: 16348 Comm: gcc-6 Tainted: P   O
4.7.0-rc2-debug+ #3
[ 1836.672399] Hardware name: System manufacturer System Product Name/M4A77T, 
BIOS 240105/18/2011
[ 1836.672402]  81f8b504 880062c47c78 8165be6d 
0007
[ 1836.672407]  880062c47cd0  880062c47cc0 
81110c1c
[ 1836.672411]  880062c47d20 265c814e8642  
00a25ade
[ 1836.672415] Call Trace:
[ 1836.672423]  [] dump_stack+0x4e/0x71
[ 1836.672429]  [] __warn+0x10c/0x150
[ 1836.672433]  [] warn_slowpath_fmt+0x4a/0x50
[ 1836.672437]  [] btrfs_rename2+0xcd2/0x2a50
[ 1836.672443]  [] ? btrfs_permission+0x5b/0xc0
[ 1836.672448]  [] ? down_write+0x18/0x60
[ 1836.672453]  [] vfs_rename+0x7cc/0xc30
[ 1836.672457]  [] SyS_rename+0x32b/0x420
[ 1836.672461]  [] entry_SYSCALL_64_fastpath+0x17/0x93
[ 1836.672464] ---[ end trace 6405b6e3d0e6c945 ]---
[ 1836.672468] BTRFS warning (device sda1): btrfs_rename:9820: Aborting unused 
transaction(No such entry).
[ 1836.675505] BTRFS warning (device sda1): btrfs_rename:9820: Aborting unused 
transaction(No such entry).

[ 1837.935238] BTRFS warning (device sda1): btrfs_rename:9820: Aborting unused 
transaction(No such entry).
[ 1837.937602] BTRFS: error (device sda1) in btrfs_rename:9820: errno=-2 No 
such entry
[ 1837.937607] BTRFS info (device sda1): forced readonly
[ 1838.086754] BTRFS warning (device sda1): Skipping commit of aborted 
transaction.
[ 1838.086762] BTRFS: error (device sda1) in cleanup_transaction:1857: errno=-2 
No such entry
[ 1838.086782] BTRFS info (device sda1): delayed_refs has NO entry

Didn't trigger during a week of other work, yet a kernel compile triggers
this reliably.

Filesystem appears consistent (btrfs check, scrub).
Mount options: noatime,compress=lzo,ssd,space_cache.

-- 
An imaginary friend squared is a real enemy.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: chunk_width_limit mount option

2016-06-06 Thread David Sterba
On Thu, Jun 02, 2016 at 09:50:15PM -0400, Andrew Armenia wrote:
> This patch adds mount option 'chunk_width_limit=X', which when set forces
> the chunk allocator to use only up to X devices when allocating a chunk.
> This may help reduce the seek penalties seen in filesystems with large
> numbers of devices.

There is a use for such limit but passing it as a mount is not the right
way to do it. Changing the value requires remount, different raid levels
might need different values, it's not persistent etc.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] btrfs-progs: convert-tests: Add test case for backup superblock migration

2016-06-06 Thread David Sterba
On Fri, Jun 03, 2016 at 10:34:26AM +0800, Qu Wenruo wrote:
> New convert introduced simpler chunk/extent allocation algorithm, at the
> cost of complex backup superblock migration codes.
> 
> Use specially built ext2 images to test if btrfs-convert can convert and
> rollback images without problem.
> 
> All these special ext2 image have blocks/holes across 2nd btrfs backup
> superblock.
> 
> The naming of test image is like the following:
>   |<--superblock migration range->|
>   64M 64M + 64K
> |-Data--|-Data--|/Hole//|-Data--|/Hole//|-Data--|--Data--| = drdhdhdrd
> 
> These test cases should check all typical layouts and make sure new
> convert works.
> 
> Signed-off-by: Qu Wenruo 

Applied, thanks.

> + extracted=$(extract_image "$src")
> + run_check $SUDO_HELPER e2fsck -n -f $extracted
> + run_check $SUDO_HELPER $TOP/btrfs-convert $extracted
> + run_check $SUDO_HELPER $TOP/btrfs check $extracted
> + run_check $SUDO_HELPER $TOP/btrfs-show-super $extracted
> +
> + run_check $SUDO_HELPER mount -o loop $extracted $TEST_MNT
> + run_check $SUDO_HELPER e2fsck -n -f $TEST_MNT/ext2_saved/image
> + run_check $SUDO_HELPER umount $TEST_MNT
> +
> + run_check $SUDO_HELPER $TOP/btrfs check $extracted
> + run_check $SUDO_HELPER $TOP/btrfs-convert -r $extracted
> + run_check $SUDO_HELPER e2fsck -n -f $extracted

Most of the SUDO_HELPER calls are not necessary, and the tests failed
when run under user (on NFS where root cannot do everything), so I did
some additional changes. The separate convert dir should be now ready
for adding more images easily.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] btrfs-progs: convert-tests: Add support for custom test scripts

2016-06-06 Thread David Sterba
On Fri, Jun 03, 2016 at 10:34:25AM +0800, Qu Wenruo wrote:
> Add support for custom convert test scripts, just like fsck tests.
> 
> Instead of generic convert tests, we need more specifically created images
> for new convert tests.
> 
> This patch provide the needed infrastructure for later convert test
> cases.
> 
> Signed-off-by: Qu Wenruo 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] Btrfs: add valid checks for chunk loading

2016-06-06 Thread David Sterba
On Fri, Jun 03, 2016 at 12:05:15PM -0700, Liu Bo wrote:
> To prevent fuzz filesystem images from panic the whole system,
> we need various validation checks to refuse to mount such an image
> if btrfs finds any invalid value during loading chunks, including
> both sys_array and regular chunks.
> 
> Note that these checks may not be sufficient to cover all corner cases,
> feel free to add more checks.
> 
> Reported-by: Vegard Nossum 
> Reported-by: Quentin Casasnovas 
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 

(switched the printk to btrfs_*)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] btrfs: return all mirror whether need_raid_map set or not

2016-06-06 Thread Qu Wenruo



At 06/06/2016 04:21 PM, David Sterba wrote:

On Tue, Dec 15, 2015 at 07:09:48PM +0800, Zhao Lei wrote:

__btrfs_map_block() should return all mirror on WRITE,
REQ_GET_READ_MIRRORS, and RECOVERY case, whether need_raid_map set
or not.

need_raid_map only used to control is to set bbio->raid_map.

Current code works right becuase there is only one caller can
trigger above bug, which is readahead, and this function happened
to bypass on less mirror.

But after we fixed __btrfs_map_block(), readahead will be really
works, and exit with warning at another bug.
This patchset fixed __btrfs_map_block(), and disable raid56
readahead temp temporary, (actually, it already disable by this bug),
and I'll fix raid56 readahead next.

Zhao Lei (4):
  btrfs: Disable raid56 readahead
  btrfs: return all mirror whether need_raid_map set or not
  btrfs: Small cleanup for get index_srcdev loop
  btrfs: Use direct way to determine raid56 write/recover mode


Are these patches still relevant? If yes, please update and resed,
thanks.


Hi David,

Zhao doesn't work on btrfs any more.

So I'll re-check these patches and see if they are still needed.
If so, I'll rebase (some of) them and resend.

Thanks,
Qu

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] Btrfs: add more valid checks for superblock

2016-06-06 Thread David Sterba
On Fri, Jun 03, 2016 at 12:05:14PM -0700, Liu Bo wrote:
> This adds valid checks for super_total_bytes, super_bytes_used and
> super_stripesize, super_num_devices.
> 
> Reported-by: Vegard Nossum 
> Reported-by: Quentin Casasnovas 
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 

I'll switch the printk(KERN_ to the btrfs_* helpers and do minor tweaks
in the messages, but otherwise this version looks good to me. I don't
think we need the helper for stripe value checks as I proposed.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] btrfs: return all mirror whether need_raid_map set or not

2016-06-06 Thread David Sterba
On Tue, Dec 15, 2015 at 07:09:48PM +0800, Zhao Lei wrote:
> __btrfs_map_block() should return all mirror on WRITE,
> REQ_GET_READ_MIRRORS, and RECOVERY case, whether need_raid_map set
> or not.
> 
> need_raid_map only used to control is to set bbio->raid_map.
> 
> Current code works right becuase there is only one caller can
> trigger above bug, which is readahead, and this function happened
> to bypass on less mirror.
> 
> But after we fixed __btrfs_map_block(), readahead will be really
> works, and exit with warning at another bug.
> This patchset fixed __btrfs_map_block(), and disable raid56
> readahead temp temporary, (actually, it already disable by this bug),
> and I'll fix raid56 readahead next.
> 
> Zhao Lei (4):
>   btrfs: Disable raid56 readahead
>   btrfs: return all mirror whether need_raid_map set or not
>   btrfs: Small cleanup for get index_srcdev loop
>   btrfs: Use direct way to determine raid56 write/recover mode

Are these patches still relevant? If yes, please update and resed,
thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: clear uptodate flags of pages in sys_array eb

2016-06-06 Thread David Sterba
On Fri, Jun 03, 2016 at 06:16:18PM -0700, Liu Bo wrote:
> On Fri, Jun 03, 2016 at 05:41:42PM -0700, Liu Bo wrote:
> > We set uptodate flag to pages in the temporary sys_array eb,
> > but do not clear the flag after free eb.  As the special
> > btree inode may still hold a reference on those pages, the
> > uptodate flag can remain alive in them.
> > 
> > If btrfs_super_chunk_root has been intentionally changed to the
> > offset of this sys_array eb, reading chunk_root will read content
> > of sys_array and it will pass our beautiful checks in
> 
> s/pass/skip/
> 
> My mistake, sorry.

Updated in the patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 42/45] block, fs, drivers: remove REQ_OP compat defs and related code

2016-06-06 Thread Hannes Reinecke
On 06/05/2016 09:32 PM, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> This patch drops the compat definition of req_op where it matches
> the rq_flag_bits definitions, and drops the related old and compat
> code that allowed users to set either the op or flags for the operation.
> 
> We also then store the operation in the bi_rw/cmd_flags field similar
> to how we used to store the bio ioprio where it sat in the upper bits
> of the field.
> 
> Signed-off-by: Mike Christie 
> ---
>  drivers/scsi/sd.c   |  2 +-
>  include/linux/bio.h |  3 ---
>  include/linux/blk_types.h   | 52 
> +
>  include/linux/blkdev.h  | 14 
>  include/linux/fs.h  | 37 +---
>  include/trace/events/f2fs.h |  1 -
>  6 files changed, 46 insertions(+), 63 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 41/45] block, drivers, fs: shrink bi_rw from long to int

2016-06-06 Thread Hannes Reinecke
On 06/05/2016 09:32 PM, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> We don't need bi_rw to be so large on 64 bit archs, so
> reduce it to unsigned int.
> 
> Signed-off-by: Mike Christie 
> ---
>  block/blk-core.c   | 2 +-
>  drivers/md/dm-flakey.c | 2 +-
>  drivers/md/raid5.c | 6 +++---
>  fs/btrfs/check-integrity.c | 4 ++--
>  fs/btrfs/inode.c   | 2 +-
>  include/linux/blk_types.h  | 2 +-
>  6 files changed, 9 insertions(+), 9 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 40/45] block: move bio io prio to a new field

2016-06-06 Thread Hannes Reinecke
On 06/05/2016 09:32 PM, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> In the next patch, we move drop the compat code and make
> the op a separate value that is hidden in bi_rw. To give
> the op and rq bits flags room to grow this moves prio to
> its own field.
> 
> Signed-off-by: Mike Christie 
> ---
>  include/linux/bio.h   | 14 ++
>  include/linux/blk_types.h |  5 ++---
>  2 files changed, 4 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 4568647..35108c2 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -47,18 +47,8 @@
>  #define bio_op(bio)  (op_from_rq_bits((bio)->bi_rw))
>  #define bio_set_op_attrs(bio, op, flags) ((bio)->bi_rw |= (op | flags))
>  
> -/*
> - * upper 16 bits of bi_rw define the io priority of this bio
> - */
> -#define BIO_PRIO_SHIFT   (8 * sizeof(unsigned long) - IOPRIO_BITS)
> -#define bio_prio(bio)((bio)->bi_rw >> BIO_PRIO_SHIFT)
> -#define bio_prio_valid(bio)  ioprio_valid(bio_prio(bio))
> -
> -#define bio_set_prio(bio, prio)  do {\
> - WARN_ON(prio >= (1 << IOPRIO_BITS));\
> - (bio)->bi_rw &= ((1UL << BIO_PRIO_SHIFT) - 1);  \
> - (bio)->bi_rw |= ((unsigned long) (prio) << BIO_PRIO_SHIFT); \
> -} while (0)
> +#define bio_prio(bio)(bio)->bi_ioprio
> +#define bio_set_prio(bio, prio)  ((bio)->bi_ioprio = prio)
>  
>  /*
>   * various member access, note that bio_data should of course not be used
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 6e60baa..2738413 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -48,9 +48,8 @@ struct bio {
>   struct block_device *bi_bdev;
>   unsigned intbi_flags;   /* status, command, etc */
>   int bi_error;
> - unsigned long   bi_rw;  /* bottom bits READ/WRITE,
> -  * top bits priority
> -  */
> + unsigned long   bi_rw;  /* READ/WRITE */
> + unsigned short  bi_ioprio;
>  
>   struct bvec_iterbi_iter;
>  
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 37/45] drivers: use req op accessor

2016-06-06 Thread Hannes Reinecke
On 06/05/2016 09:32 PM, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> The req operation REQ_OP is separated from the rq_flag_bits
> definition. This converts the block layer drivers to
> use req_op to get the op from the request struct.
> 
> Signed-off-by: Mike Christie 
> ---
>  drivers/block/loop.c  |  6 +++---
>  drivers/block/mtip32xx/mtip32xx.c |  2 +-
>  drivers/block/nbd.c   |  2 +-
>  drivers/block/rbd.c   |  4 ++--
>  drivers/block/xen-blkfront.c  |  8 +---
>  drivers/ide/ide-floppy.c  |  2 +-
>  drivers/md/dm.c   |  2 +-
>  drivers/mmc/card/block.c  |  7 +++
>  drivers/mmc/card/queue.c  |  6 ++
>  drivers/mmc/card/queue.h  |  5 -
>  drivers/mtd/mtd_blkdevs.c |  2 +-
>  drivers/nvme/host/core.c  |  2 +-
>  drivers/nvme/host/nvme.h  |  4 ++--
>  drivers/scsi/sd.c | 25 -
>  14 files changed, 43 insertions(+), 34 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 28/45] target: use bio op accessors

2016-06-06 Thread Hannes Reinecke
On 06/05/2016 09:32 PM, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> Separate the op from the rq_flag_bits and have the target layer
> set/get the bio using bio_set_op_attrs/bio_op.
> 
> Signed-off-by: Mike Christie 
> ---
>  drivers/target/target_core_iblock.c | 29 ++---
>  drivers/target/target_core_pscsi.c  |  2 +-
>  2 files changed, 15 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/target/target_core_iblock.c 
> b/drivers/target/target_core_iblock.c
> index c25109c..22af12f 100644
> --- a/drivers/target/target_core_iblock.c
> +++ b/drivers/target/target_core_iblock.c
[ .. ]
> @@ -689,18 +690,15 @@ iblock_execute_rw(struct se_cmd *cmd, struct 
> scatterlist *sgl, u32 sgl_nents,
>* Force writethrough using WRITE_FUA if a volatile write cache
>* is not enabled, or if initiator set the Force Unit Access 
> bit.
>*/
> + op = REQ_OP_WRITE;
>   if (test_bit(QUEUE_FLAG_FUA, >queue_flags)) {
>   if (cmd->se_cmd_flags & SCF_FUA)
> - rw = WRITE_FUA;
> + op_flags = WRITE_FUA;
>   else if (!test_bit(QUEUE_FLAG_WC, >queue_flags))
> - rw = WRITE_FUA;
> - else
> - rw = WRITE;
> - } else {
> - rw = WRITE;
> + op_flags = WRITE_FUA;
>   }
Wrong intendation.

>   } else {
> - rw = READ;
> + op = REQ_OP_READ;
>   }
>  
>   ibr = kzalloc(sizeof(struct iblock_req), GFP_KERNEL);
> @@ -714,7 +712,7 @@ iblock_execute_rw(struct se_cmd *cmd, struct scatterlist 
> *sgl, u32 sgl_nents,
>   return 0;
>   }
>  
> - bio = iblock_get_bio(cmd, block_lba, sgl_nents, rw);
> + bio = iblock_get_bio(cmd, block_lba, sgl_nents, op, op_flags);
>   if (!bio)
>   goto fail_free_ibr;
>  
> @@ -738,7 +736,8 @@ iblock_execute_rw(struct se_cmd *cmd, struct scatterlist 
> *sgl, u32 sgl_nents,
>   bio_cnt = 0;
>   }
>  
> - bio = iblock_get_bio(cmd, block_lba, sg_num, rw);
> + bio = iblock_get_bio(cmd, block_lba, sg_num, op,
> +  op_flags);
>   if (!bio)
>   goto fail_put_bios;
>  
> diff --git a/drivers/target/target_core_pscsi.c 
> b/drivers/target/target_core_pscsi.c
> index de18790..81564c8 100644
> --- a/drivers/target/target_core_pscsi.c
> +++ b/drivers/target/target_core_pscsi.c
> @@ -922,7 +922,7 @@ pscsi_map_sg(struct se_cmd *cmd, struct scatterlist *sgl, 
> u32 sgl_nents,
>   goto fail;
>  
>   if (rw)
> - bio->bi_rw |= REQ_WRITE;
> + bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
>  
>   pr_debug("PSCSI: Allocated bio: %p,"
>   " dir: %s nr_vecs: %d\n", bio,
> 
Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 25/45] bcache: use bio op accessors

2016-06-06 Thread Hannes Reinecke
On 06/05/2016 09:32 PM, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> Separate the op from the rq_flag_bits and have bcache
> set/get the bio using bio_set_op_attrs/bio_op.
> 
> Signed-off-by: Mike Christie 
> ---
>  drivers/md/bcache/btree.c |  4 ++--
>  drivers/md/bcache/debug.c |  4 ++--
>  drivers/md/bcache/journal.c   |  7 ---
>  drivers/md/bcache/movinggc.c  |  2 +-
>  drivers/md/bcache/request.c   | 14 +++---
>  drivers/md/bcache/super.c | 24 +---
>  drivers/md/bcache/writeback.c |  4 ++--
>  7 files changed, 31 insertions(+), 28 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 24/45] dm: use bio op accessors

2016-06-06 Thread Hannes Reinecke
On 06/05/2016 09:32 PM, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> Separate the op from the rq_flag_bits and have dm
> set/get the bio using bio_set_op_attrs/bio_op.
> 
> Signed-off-by: Mike Christie 
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/45] block discard: use bio set op accessor

2016-06-06 Thread Hannes Reinecke
On 06/05/2016 09:31 PM, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> This converts the block issue discard helper and users to use
> the bio_set_op_attrs accessor and only pass in the operation flags
> like REQ_SEQURE.
> 
> Signed-off-by: Mike Christie 
> ---
>  block/blk-lib.c| 13 +++--
>  drivers/md/dm-thin.c   |  2 +-
>  include/linux/blkdev.h |  3 ++-
>  3 files changed, 10 insertions(+), 8 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/45] block, fs, mm, drivers: use bio set/get op accessors

2016-06-06 Thread Hannes Reinecke
On 06/05/2016 09:31 PM, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> This patch converts the simple bi_rw use cases in the block,
> drivers, mm and fs code to set/get the bio operation using
> bio_set_op_attrs/bio_op
> 
> These should be simple one or two liner cases, so I just did them
> in one patch. The next patches handle the more complicated
> cases in a module per patch.
> 
> Signed-off-by: Mike Christie 
> ---
> 
> v5:
> 1. Add missed crypto call.
> 2. Change nfs bi_rw check to bi_op.
> 
>  block/bio.c | 13 ++---
>  block/blk-core.c|  6 +++---
>  block/blk-flush.c   |  2 +-
>  block/blk-lib.c |  4 ++--
>  block/blk-map.c |  2 +-
>  block/blk-merge.c   | 12 ++--
>  drivers/block/brd.c |  2 +-
>  drivers/block/floppy.c  |  2 +-
>  drivers/block/pktcdvd.c |  4 ++--
>  drivers/block/rsxx/dma.c|  2 +-
>  drivers/block/zram/zram_drv.c   |  2 +-
>  drivers/lightnvm/rrpc.c |  6 +++---
>  drivers/scsi/osd/osd_initiator.c|  8 
>  drivers/staging/lustre/lustre/llite/lloop.c |  6 +++---
>  fs/crypto/crypto.c  |  2 +-
>  fs/exofs/ore.c  |  2 +-
>  fs/ext4/page-io.c   |  6 +++---
>  fs/ext4/readpage.c  |  2 +-
>  fs/jfs/jfs_logmgr.c |  4 ++--
>  fs/jfs/jfs_metapage.c   |  4 ++--
>  fs/logfs/dev_bdev.c | 12 ++--
>  fs/nfs/blocklayout/blocklayout.c|  4 ++--
>  include/linux/bio.h | 15 ++-
>  mm/page_io.c|  4 ++--
>  24 files changed, 65 insertions(+), 61 deletions(-)
> 
[ .. ]
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 09c5308..4568647 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -109,18 +109,23 @@ static inline bool bio_has_data(struct bio *bio)
>  {
>   if (bio &&
>   bio->bi_iter.bi_size &&
> - !(bio->bi_rw & REQ_DISCARD))
> + bio_op(bio) != REQ_OP_DISCARD)
>   return true;
>  
>   return false;
>  }
>  
> +static inline bool bio_no_advance_iter(struct bio *bio)
> +{
> + return bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == 
> REQ_OP_WRITE_SAME;
> +}
> +
>  static inline bool bio_is_rw(struct bio *bio)
>  {
>   if (!bio_has_data(bio))
>   return false;
>  
> - if (bio->bi_rw & BIO_NO_ADVANCE_ITER_MASK)
> + if (bio_no_advance_iter(bio))
>   return false;
>  
>   return true;
> @@ -228,7 +233,7 @@ static inline void bio_advance_iter(struct bio *bio, 
> struct bvec_iter *iter,
>  {
>   iter->bi_sector += bytes >> 9;
>  
> - if (bio->bi_rw & BIO_NO_ADVANCE_ITER_MASK)
> + if (bio_no_advance_iter(bio))
>   iter->bi_size -= bytes;
>   else
>   bvec_iter_advance(bio->bi_io_vec, iter, bytes);
Hmm. Can't you drop 'BIO_NO_ADVANCE_ITER_MASK' after this patch?

> @@ -256,10 +261,10 @@ static inline unsigned bio_segments(struct bio *bio)
>* differently:
>*/
>  
> - if (bio->bi_rw & REQ_DISCARD)
> + if (bio_op(bio) == REQ_OP_DISCARD)
>   return 1;
>  
> - if (bio->bi_rw & REQ_WRITE_SAME)
> + if (bio_op(bio) == REQ_OP_WRITE_SAME)
>   return 1;
>  
>   bio_for_each_segment(bv, bio, iter)
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 5a5fd66..dcc5d37 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -317,7 +317,7 @@ int __swap_writepage(struct page *page, struct 
> writeback_control *wbc,
>   ret = -ENOMEM;
>   goto out;
>   }
> - bio->bi_rw = WRITE;
> + bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
>   if (wbc->sync_mode == WB_SYNC_ALL)
>   bio->bi_rw |= REQ_SYNC;
>   count_vm_event(PSWPOUT);
> @@ -370,7 +370,7 @@ int swap_readpage(struct page *page)
>   ret = -ENOMEM;
>   goto out;
>   }
> - bio->bi_rw = READ;
> + bio_set_op_attrs(bio, REQ_OP_READ, 0);
>   count_vm_event(PSWPIN);
>   submit_bio(bio);
>  out:
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/45] bcache: use op_is_write instead of checking for REQ_WRITE

2016-06-06 Thread Hannes Reinecke
On 06/05/2016 09:31 PM, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> We currently set REQ_WRITE/WRITE for all non READ IOs
> like discard, flush, writesame, etc. In the next patches where we
> no longer set up the op as a bitmap, we will not be able to
> detect a operation direction like writesame by testing if REQ_WRITE is
> set.
> 
> This has bcache use the op_is_write helper which will do the right
> thing.
> 
> Signed-off-by: Mike Christie 
> ---
>  drivers/md/bcache/io.c  | 2 +-
>  drivers/md/bcache/request.c | 6 +++---
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
(Could probably folded together with the two previous patches)

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/45] dm: use op_is_write instead of checking for REQ_WRITE

2016-06-06 Thread Hannes Reinecke
On 06/05/2016 09:31 PM, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> We currently set REQ_WRITE/WRITE for all non READ IOs
> like discard, flush, writesame, etc. In the next patches where we
> no longer set up the op as a bitmap, we will not be able to
> detect a operation direction like writesame by testing if REQ_WRITE is
> set.
> 
> This has dm use the op_is_write helper which will do the right
> thing.
> 
> Signed-off-by: Mike Christie 
> ---

(Could probably be folded into the previous patch)

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/45] block, drivers, cgroup: use op_is_write helper instead of checking for REQ_WRITE

2016-06-06 Thread Hannes Reinecke
On 06/05/2016 09:31 PM, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> We currently set REQ_WRITE/WRITE for all non READ IOs
> like discard, flush, writesame, etc. In the next patches where we
> no longer set up the op as a bitmap, we will not be able to
> detect a operation direction like writesame by testing if REQ_WRITE is
> set.
> 
> This patch converts the drivers and cgroup to use the
> op_is_write helper. This should just cover the simple
> cases. I did dm, md and bcache in their own patches
> because they were more involved.
> 
> Signed-off-by: Mike Christie 
> ---
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/45] block: add REQ_OP definitions and helpers

2016-06-06 Thread Hannes Reinecke
On 06/05/2016 09:31 PM, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> The following patches separate the operation (WRITE, READ, DISCARD,
> etc) from the rq_flag_bits flags. This patch adds definitions for
> request/bio operations (REQ_OPs) and adds request/bio accessors to
> get/set the op.
> 
> In this patch the REQ_OPs match the REQ rq_flag_bits ones
> for compat reasons while all the code is converted to use the
> op accessors in the set. In the last patches the op will become a
> number and the accessors and helpers in this patch will be dropped
> or updated.
> 
> Signed-off-by: Mike Christie 
> ---
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/45] block/fs/drivers: remove rw argument from submit_bio

2016-06-06 Thread Hannes Reinecke
On 06/05/2016 09:31 PM, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
> instead of passing it in. This makes that use the same as
> generic_make_request and how we set the other bio fields.
> 
> Signed-off-by: Mike Christie 
> ---
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html