Re: [PATCH 2/2] btrfs: drop devices declare in btrfs_init_new_device()

2018-07-02 Thread Nikolay Borisov



On  3.07.2018 08:14, Anand Jain wrote:
> There is only one usage of the declared devices variable, instead
> use its value directly.
> 
> Signed-off-by: Anand Jain 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/volumes.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index f7fa0ea26e9c..124bd8728c37 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2406,7 +2406,6 @@ int btrfs_init_new_device(struct btrfs_fs_info 
> *fs_info, const char *device_path
>   struct btrfs_trans_handle *trans;
>   struct btrfs_device *device;
>   struct block_device *bdev;
> - struct list_head *devices;
>   struct super_block *sb = fs_info->sb;
>   struct rcu_string *name;
>   struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> @@ -2431,10 +2430,8 @@ int btrfs_init_new_device(struct btrfs_fs_info 
> *fs_info, const char *device_path
>  
>   filemap_write_and_wait(bdev->bd_inode->i_mapping);
>  
> - devices = _devices->devices;
> -
>   mutex_lock(_devices->device_list_mutex);
> - list_for_each_entry(device, devices, dev_list) {
> + list_for_each_entry(device, _devices->devices, dev_list) {
>   if (device->bdev == bdev) {
>   ret = -EEXIST;
>   mutex_unlock(
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: fix missing superblock update in the device delete commit transaction

2018-07-02 Thread Nikolay Borisov



On  3.07.2018 08:12, Anand Jain wrote:
> When a device is deleted, the btrfs_super_block::number_devices is
> reduced by 1, but we do that after the commit transaction, so this
> change did not made it to the disk and waits for the next commit
> transaction whenever it happens.
> 
> This can be easily demonstrated using the following test case where I
> use the btrfs device ready cli to read the disk and report.
> 
>   mkfs.btrfs -fq -dsingle -msingle $dev1 $dev2
>   mount $dev1 /btrfs
>   btrfs dev del $dev2 /btrfs
>   btrfs dev ready $dev1; echo RESULT=$? <-- 1
>Without this patch RESULT returns 1, indicating not ready!
> 
>   Testing with a seed device:
> 
>   mkfs.btrfs -fq $dev1
>   btrfstune -S1 $dev1
>   mount $dev1 /btrfs
>   btrfs dev add -f $dev2 /btrfs
>   umount /btrfs
>   mount $dev2 /btrfs
>   btrfs dev del $dev1 /btrfs
>   btrfs dev ready $dev2; echo RESULT=$? <-- 1

Turn those into fully fledged xfstests

> 
>   Fix this by bringing in the num_devices update with in the
>   current commit transaction.
> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/volumes.c | 38 ++
>  1 file changed, 18 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 6b807b166ca3..18cd73703951 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1823,46 +1823,32 @@ static void update_dev_time(const char *path_name)
>  }
>  
>  static int btrfs_rm_dev_item(struct btrfs_fs_info *fs_info,
> -  struct btrfs_device *device)
> +  struct btrfs_device *device,
> +  struct btrfs_trans_handle *trans)
>  {

While doing this, refactor the function to take transaction as the first
argument and remove the fs_info arg. fs_info in turn can be referenced
from the transaction handle
>   struct btrfs_root *root = fs_info->chunk_root;
>   int ret;
>   struct btrfs_path *path;
>   struct btrfs_key key;
> - struct btrfs_trans_handle *trans;
>  
>   path = btrfs_alloc_path();
>   if (!path)
>   return -ENOMEM;
>  
> - trans = btrfs_start_transaction(root, 0);
> - if (IS_ERR(trans)) {
> - btrfs_free_path(path);
> - return PTR_ERR(trans);
> - }
>   key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
>   key.type = BTRFS_DEV_ITEM_KEY;
>   key.offset = device->devid;
>  
>   ret = btrfs_search_slot(trans, root, , path, -1, 1);
> - if (ret) {
> - if (ret > 0)
> - ret = -ENOENT;
> - btrfs_abort_transaction(trans, ret);
> - btrfs_end_transaction(trans);
> + if (ret > 0) {
> + ret = -ENOENT;
>   goto out;
>   }
>  
>   ret = btrfs_del_item(trans, root, path);
> - if (ret) {
> - btrfs_abort_transaction(trans, ret);
> - btrfs_end_transaction(trans);
> - }
>  
>  out:
>   btrfs_free_path(path);
> - if (!ret)
> - ret = btrfs_commit_transaction(trans);
>   return ret;
>  }
>  
> @@ -1946,7 +1932,9 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, 
> const char *device_path,
>   u64 devid)
>  {
>   struct btrfs_device *device;
> + struct btrfs_trans_handle *trans;
>   struct btrfs_fs_devices *cur_devices;
> + struct btrfs_root *root = fs_info->dev_root;
>   struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>   u64 num_devices;
>   int ret = 0;
> @@ -1994,14 +1982,23 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, 
> const char *device_path,
>   if (ret)
>   goto error_undo;
>  
> + trans = btrfs_start_transaction(root, 0);
> + if (IS_ERR(trans)) {
> + ret = PTR_ERR(trans);
> + goto error_undo;
> + }
> +
>   /*
>* TODO: the superblock still includes this device in its num_devices
>* counter although write_all_supers() is not locked out. This
>* could give a filesystem state which requires a degraded mount.
>*/
Isn't introducing the transaction fixing this TODO item so it can be
removed?
> - ret = btrfs_rm_dev_item(fs_info, device);
> - if (ret)
> + ret = btrfs_rm_dev_item(fs_info, device, trans);
> + if (ret) {
> + btrfs_abort_transaction(trans, ret);
> + btrfs_end_transaction(trans);
>   goto error_undo;
> + }
>  
>   clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state);
>   btrfs_scrub_cancel_dev(fs_info, device);
> @@ -2070,6 +2067,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, 
> const char *device_path,
>   free_fs_devices(cur_devices);
>   }
>  
> + ret = btrfs_commit_transaction(trans);
>  out:
>   mutex_unlock(_mutex);
>   return ret;
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Marc MERLIN
On Tue, Jul 03, 2018 at 04:26:37AM +, Paul Jones wrote:
> I don't have any experience with this, but since it's the internet let me 
> tell you how I'd do it anyway 

That's the spirit :)

> raid5
> dm-crypt
> lvm (using thin provisioning + cache)
> btrfs
> 
> The cache mode on lvm requires you to set up all your volumes first, then
> add caching to those volumes last. If you need to modify the volume then
> you have to remove the cache, make your changes, then re-add the cache. It
> sounds like a pain, but having the cache separate from the data is quite
> handy.

I'm ok enough with that.

> Given you are running a backup server I don't think the cache would
> really do much unless you enable writeback mode. If you can split up your
> filesystem a bit to the point that btrfs check doesn't OOM that will
> seriously help performance as well. Rsync might be feasible again.

I'm a bit warry of write caching with the issues I've had. I may do
write-through, but not writeback :)

But caching helps indeed for my older filesystems that are still backed up
via rsync because the source fs is ext4 and not btrfs.

Thanks for the suggestions
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Add chunk type check in read a chunk

2018-07-02 Thread Qu Wenruo


On 2018年07月03日 12:27, Gu Jinxiang wrote:
> Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199839,
> which has a invalid chunk, not return error opportunlly.
> 
> Add chunk type check in btrfs_check_chunk_valid,
> to make error be returned in advance.
> 
> Signed-off-by: Gu Jinxiang 
> ---
>  fs/btrfs/volumes.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index e034ad9e23b4..67732c910c3d 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6439,6 +6439,12 @@ static int btrfs_check_chunk_valid(struct 
> btrfs_fs_info *fs_info,
> btrfs_chunk_type(leaf, chunk));
>   return -EIO;
>   }
> +
> + if (!(type & BTRFS_BLOCK_GROUP_TYPE_MASK)) {
> + btrfs_err(fs_info, "missing chunk type flag: %llu", type);
> + return -EIO;
> + }

This is not good enough to catch case like DATA | METADATA | SYSTEM.

You patch reminds me that my block group type check is not good enough.
We allow either single bit set in TYPE_MASK or DATA | METADATA for mixed
block group.

Thanks,
Qu

> +
>   if ((type & BTRFS_BLOCK_GROUP_RAID10 && sub_stripes != 2) ||
>   (type & BTRFS_BLOCK_GROUP_RAID1 && num_stripes < 1) ||
>   (type & BTRFS_BLOCK_GROUP_RAID5 && num_stripes < 2) ||
> 



signature.asc
Description: OpenPGP digital signature


[PATCH 2/2] btrfs: drop devices declare in btrfs_init_new_device()

2018-07-02 Thread Anand Jain
There is only one usage of the declared devices variable, instead
use its value directly.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f7fa0ea26e9c..124bd8728c37 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2406,7 +2406,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
struct btrfs_trans_handle *trans;
struct btrfs_device *device;
struct block_device *bdev;
-   struct list_head *devices;
struct super_block *sb = fs_info->sb;
struct rcu_string *name;
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
@@ -2431,10 +2430,8 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
 
filemap_write_and_wait(bdev->bd_inode->i_mapping);
 
-   devices = _devices->devices;
-
mutex_lock(_devices->device_list_mutex);
-   list_for_each_entry(device, devices, dev_list) {
+   list_for_each_entry(device, _devices->devices, dev_list) {
if (device->bdev == bdev) {
ret = -EEXIST;
mutex_unlock(
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs: declare fs_devices in btrfs_init_new_device()

2018-07-02 Thread Anand Jain
There are many instances of the %fs_info->fs_devices pointer
de-reference, so declare a %fs_devices pointer instead.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 42 +-
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 18cd73703951..f7fa0ea26e9c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2409,12 +2409,13 @@ int btrfs_init_new_device(struct btrfs_fs_info 
*fs_info, const char *device_path
struct list_head *devices;
struct super_block *sb = fs_info->sb;
struct rcu_string *name;
+   struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
u64 tmp;
int seeding_dev = 0;
int ret = 0;
bool unlocked = false;
 
-   if (sb_rdonly(sb) && !fs_info->fs_devices->seeding)
+   if (sb_rdonly(sb) && !fs_devices->seeding)
return -EROFS;
 
bdev = blkdev_get_by_path(device_path, FMODE_WRITE | FMODE_EXCL,
@@ -2422,7 +2423,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
if (IS_ERR(bdev))
return PTR_ERR(bdev);
 
-   if (fs_info->fs_devices->seeding) {
+   if (fs_devices->seeding) {
seeding_dev = 1;
down_write(>s_umount);
mutex_lock(_mutex);
@@ -2430,18 +2431,18 @@ int btrfs_init_new_device(struct btrfs_fs_info 
*fs_info, const char *device_path
 
filemap_write_and_wait(bdev->bd_inode->i_mapping);
 
-   devices = _info->fs_devices->devices;
+   devices = _devices->devices;
 
-   mutex_lock(_info->fs_devices->device_list_mutex);
+   mutex_lock(_devices->device_list_mutex);
list_for_each_entry(device, devices, dev_list) {
if (device->bdev == bdev) {
ret = -EEXIST;
mutex_unlock(
-   _info->fs_devices->device_list_mutex);
+   _devices->device_list_mutex);
goto error;
}
}
-   mutex_unlock(_info->fs_devices->device_list_mutex);
+   mutex_unlock(_devices->device_list_mutex);
 
device = btrfs_alloc_device(fs_info, NULL, NULL);
if (IS_ERR(device)) {
@@ -2490,23 +2491,22 @@ int btrfs_init_new_device(struct btrfs_fs_info 
*fs_info, const char *device_path
}
}
 
-   device->fs_devices = fs_info->fs_devices;
+   device->fs_devices = fs_devices;
 
-   mutex_lock(_info->fs_devices->device_list_mutex);
+   mutex_lock(_devices->device_list_mutex);
mutex_lock(_info->chunk_mutex);
-   list_add_rcu(>dev_list, _info->fs_devices->devices);
-   list_add(>dev_alloc_list,
-_info->fs_devices->alloc_list);
-   fs_info->fs_devices->num_devices++;
-   fs_info->fs_devices->open_devices++;
-   fs_info->fs_devices->rw_devices++;
-   fs_info->fs_devices->total_devices++;
-   fs_info->fs_devices->total_rw_bytes += device->total_bytes;
+   list_add_rcu(>dev_list, _devices->devices);
+   list_add(>dev_alloc_list, _devices->alloc_list);
+   fs_devices->num_devices++;
+   fs_devices->open_devices++;
+   fs_devices->rw_devices++;
+   fs_devices->total_devices++;
+   fs_devices->total_rw_bytes += device->total_bytes;
 
atomic64_add(device->total_bytes, _info->free_chunk_space);
 
if (!blk_queue_nonrot(q))
-   fs_info->fs_devices->rotating = 1;
+   fs_devices->rotating = 1;
 
tmp = btrfs_super_total_bytes(fs_info->super_copy);
btrfs_set_super_total_bytes(fs_info->super_copy,
@@ -2516,7 +2516,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
btrfs_set_super_num_devices(fs_info->super_copy, tmp + 1);
 
/* add sysfs device entry */
-   btrfs_sysfs_add_device_link(fs_info->fs_devices, device);
+   btrfs_sysfs_add_device_link(fs_devices, device);
 
/*
 * we've got more storage, clear any full flags on the space
@@ -2525,7 +2525,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
btrfs_clear_space_info_full(fs_info);
 
mutex_unlock(_info->chunk_mutex);
-   mutex_unlock(_info->fs_devices->device_list_mutex);
+   mutex_unlock(_devices->device_list_mutex);
 
if (seeding_dev) {
mutex_lock(_info->chunk_mutex);
@@ -2557,7 +2557,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
 */
snprintf(fsid_buf, BTRFS_UUID_UNPARSED_SIZE, "%pU",
fs_info->fsid);
-   if (kobject_rename(_info->fs_devices->fsid_kobj, fsid_buf))
+   if (kobject_rename(_devices->fsid_kobj, fsid_buf))
btrfs_warn(fs_info,
   "sysfs: failed to 

[PATCH RESEND 1/2] btrfs: fix parent in memory total_devices after seed delete

2018-07-02 Thread Anand Jain
In case of deleting the seed device the %cur_devices (seed)
and the %fs_devices (parent) are different. Now, as the parent
fs_devices::total_devices also maintains the total number of devices
including the seed device, so decrement its in-memory value for the
successful seed delete. We are already updating its corresponding on-disk
btrfs_super_block::number_devices value.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5bd6f3a40f9c..6b807b166ca3 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2027,6 +2027,9 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const 
char *device_path,
 
cur_devices->num_devices--;
cur_devices->total_devices--;
+   /* Update total_devices for the parent fs_devices if its seed */
+   if (cur_devices != fs_devices)
+   fs_devices->total_devices--;
 
if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
cur_devices->missing_devices--;
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: fix missing superblock update in the device delete commit transaction

2018-07-02 Thread Anand Jain
When a device is deleted, the btrfs_super_block::number_devices is
reduced by 1, but we do that after the commit transaction, so this
change did not made it to the disk and waits for the next commit
transaction whenever it happens.

This can be easily demonstrated using the following test case where I
use the btrfs device ready cli to read the disk and report.

  mkfs.btrfs -fq -dsingle -msingle $dev1 $dev2
  mount $dev1 /btrfs
  btrfs dev del $dev2 /btrfs
  btrfs dev ready $dev1; echo RESULT=$? <-- 1
   Without this patch RESULT returns 1, indicating not ready!

  Testing with a seed device:

  mkfs.btrfs -fq $dev1
  btrfstune -S1 $dev1
  mount $dev1 /btrfs
  btrfs dev add -f $dev2 /btrfs
  umount /btrfs
  mount $dev2 /btrfs
  btrfs dev del $dev1 /btrfs
  btrfs dev ready $dev2; echo RESULT=$? <-- 1

  Fix this by bringing in the num_devices update with in the
  current commit transaction.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 38 ++
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6b807b166ca3..18cd73703951 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1823,46 +1823,32 @@ static void update_dev_time(const char *path_name)
 }
 
 static int btrfs_rm_dev_item(struct btrfs_fs_info *fs_info,
-struct btrfs_device *device)
+struct btrfs_device *device,
+struct btrfs_trans_handle *trans)
 {
struct btrfs_root *root = fs_info->chunk_root;
int ret;
struct btrfs_path *path;
struct btrfs_key key;
-   struct btrfs_trans_handle *trans;
 
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
 
-   trans = btrfs_start_transaction(root, 0);
-   if (IS_ERR(trans)) {
-   btrfs_free_path(path);
-   return PTR_ERR(trans);
-   }
key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
key.type = BTRFS_DEV_ITEM_KEY;
key.offset = device->devid;
 
ret = btrfs_search_slot(trans, root, , path, -1, 1);
-   if (ret) {
-   if (ret > 0)
-   ret = -ENOENT;
-   btrfs_abort_transaction(trans, ret);
-   btrfs_end_transaction(trans);
+   if (ret > 0) {
+   ret = -ENOENT;
goto out;
}
 
ret = btrfs_del_item(trans, root, path);
-   if (ret) {
-   btrfs_abort_transaction(trans, ret);
-   btrfs_end_transaction(trans);
-   }
 
 out:
btrfs_free_path(path);
-   if (!ret)
-   ret = btrfs_commit_transaction(trans);
return ret;
 }
 
@@ -1946,7 +1932,9 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const 
char *device_path,
u64 devid)
 {
struct btrfs_device *device;
+   struct btrfs_trans_handle *trans;
struct btrfs_fs_devices *cur_devices;
+   struct btrfs_root *root = fs_info->dev_root;
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
u64 num_devices;
int ret = 0;
@@ -1994,14 +1982,23 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, 
const char *device_path,
if (ret)
goto error_undo;
 
+   trans = btrfs_start_transaction(root, 0);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   goto error_undo;
+   }
+
/*
 * TODO: the superblock still includes this device in its num_devices
 * counter although write_all_supers() is not locked out. This
 * could give a filesystem state which requires a degraded mount.
 */
-   ret = btrfs_rm_dev_item(fs_info, device);
-   if (ret)
+   ret = btrfs_rm_dev_item(fs_info, device, trans);
+   if (ret) {
+   btrfs_abort_transaction(trans, ret);
+   btrfs_end_transaction(trans);
goto error_undo;
+   }
 
clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state);
btrfs_scrub_cancel_dev(fs_info, device);
@@ -2070,6 +2067,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const 
char *device_path,
free_fs_devices(cur_devices);
}
 
+   ret = btrfs_commit_transaction(trans);
 out:
mutex_unlock(_mutex);
return ret;
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Add chunk type check in read a chunk

2018-07-02 Thread Gu Jinxiang
Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199839,
which has a invalid chunk, not return error opportunlly.

Add chunk type check in btrfs_check_chunk_valid,
to make error be returned in advance.

Signed-off-by: Gu Jinxiang 
---
 fs/btrfs/volumes.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e034ad9e23b4..67732c910c3d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6439,6 +6439,12 @@ static int btrfs_check_chunk_valid(struct btrfs_fs_info 
*fs_info,
  btrfs_chunk_type(leaf, chunk));
return -EIO;
}
+
+   if (!(type & BTRFS_BLOCK_GROUP_TYPE_MASK)) {
+   btrfs_err(fs_info, "missing chunk type flag: %llu", type);
+   return -EIO;
+   }
+
if ((type & BTRFS_BLOCK_GROUP_RAID10 && sub_stripes != 2) ||
(type & BTRFS_BLOCK_GROUP_RAID1 && num_stripes < 1) ||
(type & BTRFS_BLOCK_GROUP_RAID5 && num_stripes < 2) ||
-- 
2.17.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Paul Jones

> -Original Message-
> From: Marc MERLIN 
> Sent: Tuesday, 3 July 2018 2:07 PM
> To: Paul Jones 
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: how to best segment a big block device in resizeable btrfs
> filesystems?
> 
> On Tue, Jul 03, 2018 at 12:51:30AM +, Paul Jones wrote:
> > You could combine bcache and lvm if you are happy to use dm-cache
> instead (which lvm uses).
> > I use it myself (but without thin provisioning) and it works well.
> 
> Interesting point. So, I used to use lvm and then lvm2 many years ago until I
> got tired with its performance, especially as asoon as I took even a single
> snapshot.
> But that was a long time ago now, just saying that I'm a bit rusty on LVM
> itself.
> 
> That being said, if I have
> raid5
> dm-cache
> dm-crypt
> dm-thin
> 
> That's still 4 block layers under btrfs.
> Am I any better off using dm-cache instead of bcache, my understanding is
> that it only replaces one block layer with another one and one codebase with
> another.

True, I didn't think of it like that.

> Mmmh, a bit of reading shows that dm-cache is now used as lvmcache, which
> might change things, or not.
> I'll admit that setting up and maintaining bcache is a bit of a pain, I only 
> used it
> at the time because it seemed more ready then, but we're a few years later
> now.
> 
> So, what do you recommend nowadays, assuming you've used both?
> (given that it's literally going to take days to recreate my array, I'd 
> rather do it
> once and the right way the first time :) )

I don't have any experience with this, but since it's the internet let me tell 
you how I'd do it anyway 
raid5
dm-crypt
lvm (using thin provisioning + cache)
btrfs

The cache mode on lvm requires you to set up all your volumes first, then add 
caching to those volumes last. If you need to modify the volume then you have 
to remove the cache, make your changes, then re-add the cache. It sounds like a 
pain, but having the cache separate from the data is quite handy.
Given you are running a backup server I don't think the cache would really do 
much unless you enable writeback mode. If you can split up your filesystem a 
bit to the point that btrfs check doesn't OOM that will seriously help 
performance as well. Rsync might be feasible again.

Paul.

N�r��y���b�X��ǧv�^�)޺{.n�+{�n�߲)���w*jg����ݢj/���z�ޖ��2�ޙ���&�)ߡ�a�����G���h��j:+v���w�٥

Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Andrei Borzenkov
02.07.2018 21:35, Austin S. Hemmelgarn пишет:
> them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit
> dangerous to do it while writes are happening).

Could you please elaborate? Do you mean btrfs can trim data before new
writes are actually committed to disk?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Andrei Borzenkov
03.07.2018 04:37, Qu Wenruo пишет:
> 
> BTW, IMHO the bcache is not really helping for backup system, which is
> more write oriented.
> 

There is new writecache target which may help in this case.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-07-02 Thread Marc MERLIN
On Mon, Jul 02, 2018 at 06:31:43PM -0600, Chris Murphy wrote:
> So the idea behind journaled file systems is that journal replay
> enabled mount time "repair" that's faster than an fsck. Already Btrfs
> use cases with big, but not huge, file systems makes btrfs check a
> problem. Either running out of memory or it takes too long. So already
> it isn't scaling as well as ext4 or XFS in this regard.
> 
> So what's the future hold? It seems like the goal is that the problems
> must be avoided in the first place rather than to repair them after
> the fact.
> 
> Are the problem's Marc is running into understood well enough that
> there can eventually be a fix, maybe even an on-disk format change,
> that prevents such problems from happening in the first place?
> 
> Or does it make sense for him to be running with btrfs debug or some
> subset of btrfs integrity checking mask to try to catch the problems
> in the act of them happening?

Those are all good questions.
To be fair, I cannot claim that btrfs was at fault for whatever filesystem
damage I ended up with. It's very possible that it happened due to a flaky
Sata card that kicked drives off the bus when it shouldn't have.
Sure in theory a journaling filesystem can recover from unexpected power
loss and drives dropping off at bad times, but I'm going to guess that
btrfs' complexity also means that it has data structures (extent tree?) that
need to be updated completely "or else".

I'm obviously ok with a filesystem check being necessary to recover in cases
like this, afterall I still occasionally have to run e2fsck on ext4 too, but
I'm a lot less thrilled with the btrfs situation where basically the repair
tools can either completely crash your kernel, or take days and then either
get stuck in an infinite loop or hit an algorithm that can't scale if you
have too many hardlinks/snapshots.

It sounds like there may not be a fix to this problem with the filesystem's
design, outside of "do not get there, or else".
It would even be useful for btrfs tools to start computing heuristics and
output warnings like "you have more than 100 snapshots on this filesystem,
this is not recommended, please read http://url/;

Qu, Su, does that sound both reasonable and doable?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Marc MERLIN
On Tue, Jul 03, 2018 at 09:37:47AM +0800, Qu Wenruo wrote:
> > If I do this, I would have
> > software raid 5 < dmcrypt < bcache < lvm < btrfs
> > That's a lot of layers, and that's also starting to make me nervous :)
> 
> If you could keep the number of snapshots to minimal (less than 10) for
> each btrfs (and the number of send source is less than 5), one big btrfs
> may work in that case.
 
Well, we kind of discussed this already. If btrfs falls over if you reach
100 snapshots or so, and it sure seems to in my case, I won't be much better
off.
Having btrfs check --repair fail because 32GB of RAM is not enough, and it's
unable to use swap, is a big deal in my case. You also confirmed that btrfs
check lowmem does not scale to filesystems like mine, so this translates
into "if regular btrfs check repair can't fit in 32GB, I am completely out
of luck if anything happens to the filesystem"

You're correct that I could tweak my backups and snapshot rotation to get
from 250 or so down to 100, but it seems that I'll just be hoping to avoid
the problem by being just under the limit, until I'm not, again, and it'll
be too late to do anything it next time I'm in trouble again, putting me
back right in the same spot I'm in now.
Is all this fair to say, or did I misunderstand?

> BTW, IMHO the bcache is not really helping for backup system, which is
> more write oriented.

That's a good point. So, what I didn't explain is that I still have some old
filesystem that do get backed up with rsync instead of btrfs send (going
into the same filesystem, but not same subvolume).
Because rsync is so painfully slow when it needs to scan both sides before
it'll even start doing any work, bcache helps there.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Marc MERLIN
On Tue, Jul 03, 2018 at 12:51:30AM +, Paul Jones wrote:
> You could combine bcache and lvm if you are happy to use dm-cache instead 
> (which lvm uses).
> I use it myself (but without thin provisioning) and it works well.

Interesting point. So, I used to use lvm and then lvm2 many years ago until
I got tired with its performance, especially as asoon as I took even a
single snapshot.
But that was a long time ago now, just saying that I'm a bit rusty on LVM
itself.

That being said, if I have
raid5
dm-cache
dm-crypt
dm-thin

That's still 4 block layers under btrfs.
Am I any better off using dm-cache instead of bcache, my understanding is
that it only replaces one block layer with another one and one codebase with
another.

Mmmh, a bit of reading shows that dm-cache is now used as lvmcache, which
might change things, or not.
I'll admit that setting up and maintaining bcache is a bit of a pain, I only
used it at the time because it seemed more ready then, but we're a few years
later now.

So, what do you recommend nowadays, assuming you've used both?
(given that it's literally going to take days to recreate my array, I'd
rather do it once and the right way the first time :) )

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Qu Wenruo



On 2018年07月02日 23:18, Marc MERLIN wrote:
> Hi Qu,
> 
> I'll split this part into a new thread:
> 
>> 2) Don't keep unrelated snapshots in one btrfs.
>>I totally understand that maintain different btrfs would hugely add
>>maintenance pressure, but as explains, all snapshots share one
>>fragile extent tree.
> 
> Yes, I understand that this is what I should do given what you
> explained.
> My main problem is knowing how to segment things so I don't end up with
> filesystems that are full while others are almost empty :)
> 
> Am I supposed to put LVM thin volumes underneath so that I can share
> the same single 10TB raid5?
> 
> If I do this, I would have
> software raid 5 < dmcrypt < bcache < lvm < btrfs
> That's a lot of layers, and that's also starting to make me nervous :)

If you could keep the number of snapshots to minimal (less than 10) for
each btrfs (and the number of send source is less than 5), one big btrfs
may work in that case.

BTW, IMHO the bcache is not really helping for backup system, which is
more write oriented.

Thanks,
Qu

> 
> Is there any other way that does not involve me creating smaller block
> devices for multiple btrfs filesystems and hope that they are the right
> size because I won't be able to change it later?
> 
> Thanks,
> Marc
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] btrfs-progs: Fix wrong optind re-initialization to allow mixed option and non-option

2018-07-02 Thread Qu Wenruo


On 2018年07月03日 06:12, David Sterba wrote:
> On Wed, Jun 20, 2018 at 08:38:38AM +0800, Qu Wenruo wrote:
>> In function handle_global_options(), we reset @optind to 1.
>> However according to man page of getopt(3) NOTES section, if we need to
>> rescan options later, @optind should be reset to 0 to initialize the
>> internal variables correctly.
>>
>> This explains the reason why in cmd_check(), getopt_long() doesn't
>> handle the following command correctly:
>> "btrfs check /dev/data/btrfs --check-data-csum"
>>
>> While mkfs.btrfs handles mixed non-option and option correctly:
>> "mkfs.btrfs -f /dev/data/disk1 --data raid1 /dev/data/disk2"
>>
>> Cc: Paul Jones 
>> Cc: Hugo Mills 
>> Fixes: 010ceab56e06 ("btrfs-progs: rework option parser to use getopt for 
>> global options")
>> Signed-off-by: Qu Wenruo 
>> ---
>> changelog:
>> v2:
>>   Instead of resetting @optind at handle_global_options(), reset @optind
>>   before each later getopt() call. Since there are cases uses @optind and
>>   expects it to be starting from 1.
> 
> So it's not possible to set it once before the callbacks are called in
> btrfs.c:main() due to the exceptions that need it to be set to 1. Ok,
> can you please send a separate patch that makes the optind = 1 explicit?

It's already done in handle_global_options(), so every subcommand is
getting @optind set to 1 already.

Thanks,
Qu

> Even it might be redundant in the context, it's a way to document the
> expected behaviour of getopt and also we would not have to think if the
> optind setting is missing or not before the getopts. Thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2] btrfs: tree-checker: Verify block_group_item

2018-07-02 Thread Qu Wenruo


On 2018年07月02日 22:48, David Sterba wrote:
> On Mon, Jul 02, 2018 at 04:34:14PM +0800, Qu Wenruo wrote:
>> As reported in https://bugzilla.kernel.org/show_bug.cgi?id=199849,
>> a crafted image with invalid block group items could make free space cache
>> code to cause panic.
>>
>> We could early detect such invalid block group item by checking:
>> 1) Size (key)
>>We have a up limit on block group item (10G)
> 
>> +if (key->offset > 10ULL * SZ_1G) {
>> +block_group_err(fs_info, leaf, slot,
>> +"invalid block group size, have %llu expect (0, %llu)",
>> +key->offset, 10ULL * SZ_1G);
> 
> Can you please make this magic constant a define and explain in a
> comment how it's calculated? Thanks.

Of course, it's used in btrfs_alloc_chunk(), and I'll make it a constant
define.

Thanks,
Qu

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


RE: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Paul Jones
> -Original Message-
> From: linux-btrfs-ow...@vger.kernel.org  ow...@vger.kernel.org> On Behalf Of Marc MERLIN
> Sent: Tuesday, 3 July 2018 1:19 AM
> To: Qu Wenruo 
> Cc: Su Yue ; linux-btrfs@vger.kernel.org
> Subject: Re: how to best segment a big block device in resizeable btrfs
> filesystems?
> 
> Hi Qu,
> 
> I'll split this part into a new thread:
> 
> > 2) Don't keep unrelated snapshots in one btrfs.
> >I totally understand that maintain different btrfs would hugely add
> >maintenance pressure, but as explains, all snapshots share one
> >fragile extent tree.
> 
> Yes, I understand that this is what I should do given what you explained.
> My main problem is knowing how to segment things so I don't end up with
> filesystems that are full while others are almost empty :)
> 
> Am I supposed to put LVM thin volumes underneath so that I can share the
> same single 10TB raid5?
> 
> If I do this, I would have
> software raid 5 < dmcrypt < bcache < lvm < btrfs That's a lot of layers, and
> that's also starting to make me nervous :)

You could combine bcache and lvm if you are happy to use dm-cache instead 
(which lvm uses).
I use it myself (but without thin provisioning) and it works well.


> 
> Is there any other way that does not involve me creating smaller block
> devices for multiple btrfs filesystems and hope that they are the right size
> because I won't be able to change it later?
> 
> Thanks,
> Marc
> --
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems 
>    what McDonalds is to gourmet 
> cooking
> Home page: http://marc.merlins.org/   | PGP 
> 7F55D5F27AAF9D08
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the
> body of a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-07-02 Thread Chris Murphy
On Mon, Jul 2, 2018 at 8:42 AM, Qu Wenruo  wrote:
>
>
> On 2018年07月02日 22:05, Marc MERLIN wrote:
>> On Mon, Jul 02, 2018 at 02:22:20PM +0800, Su Yue wrote:
 Ok, that's 29MB, so it doesn't fit on pastebin:
 http://marc.merlins.org/tmp/dshelf2_inspect.txt

>>> Sorry Marc. After offline communication with Qu, both
>>> of us think the filesystem is hard to repair.
>>> The filesystem is too large to debug step by step.
>>> Every time check and debug spent is too expensive.
>>> And it already costs serveral days.
>>>
>>> Sadly, I am afarid that you have to recreate filesystem
>>> and reback up your data. :(
>>>
>>> Sorry again and thanks for you reports and patient.
>>
>> I appreciate your help. Honestly I only wanted to help you find why the
>> tools aren't working. Fixing filesystems by hand (and remotely via Email
>> on top of that), is way too time consuming like you said.
>>
>> Is the btrfs design flawed in a way that repair tools just cannot repair
>> on their own?
>
> For short and for your case, yes, you can consider repair tool just a
> garbage and don't use them at any production system.

So the idea behind journaled file systems is that journal replay
enabled mount time "repair" that's faster than an fsck. Already Btrfs
use cases with big, but not huge, file systems makes btrfs check a
problem. Either running out of memory or it takes too long. So already
it isn't scaling as well as ext4 or XFS in this regard.

So what's the future hold? It seems like the goal is that the problems
must be avoided in the first place rather than to repair them after
the fact.

Are the problem's Marc is running into understood well enough that
there can eventually be a fix, maybe even an on-disk format change,
that prevents such problems from happening in the first place?

Or does it make sense for him to be running with btrfs debug or some
subset of btrfs integrity checking mask to try to catch the problems
in the act of them happening?



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: make sure there is always room for generation number

2018-07-02 Thread Zhihui Zhang
io_ctl_set_generation() assumes that the generation number shares
the same page with inline CRCs. Let's make sure this is always true.

Signed-off-by: Zhihui Zhang 
---
 fs/btrfs/free-space-cache.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index d5f80cb300be..9aa0fbf16515 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -300,9 +300,9 @@ static int io_ctl_init(struct btrfs_io_ctl *io_ctl, struct 
inode *inode,
if (btrfs_ino(BTRFS_I(inode)) != BTRFS_FREE_INO_OBJECTID)
check_crcs = 1;
 
-   /* Make sure we can fit our crcs into the first page */
+   /* Make sure we can fit our crcs and generation # into the first page */
if (write && check_crcs &&
-   (num_pages * sizeof(u32)) >= PAGE_SIZE)
+   (num_pages * sizeof(u32) + sizeof(u64)) > PAGE_SIZE)
return -ENOSPC;
 
memset(io_ctl, 0, sizeof(struct btrfs_io_ctl));
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] btrfs-progs: Fixes inline ram_bytes related bugs

2018-07-02 Thread David Sterba
On Thu, Jun 07, 2018 at 10:37:16PM -0600, Steve Leung wrote:
> On 06/06/2018 01:27 AM, Qu Wenruo wrote:
> > The patchset can be fetched from github (*):
> > https://github.com/adam900710/btrfs-progs/tree/inline_ram_bytes
> > 
> > It's based on David's devel branch, whose HEAD is:
> > commit 0d1c5812e28e286648781c7b35b542311cc01aa4 (david/devel)
> > Author: Matthias Benkard 
> > Date:   Wed Apr 25 16:34:54 2018 +0200
> > 
> >  btrfs-progs: mkfs: traverse_directory: Reset error code on continue
> > 
> > Reported-by Steve Leung , his old btrfs (at least
> > offending inodes are from 2014) has inline uncompressed extent, while
> > its ram_bytes mismatch with item size.
> 
> Took a while to run, but:
> 
> Tested-by: Steve Leung 
> 
> Verifying everything against backups now, but things look good so far.

Thank you very much for testing!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] btrfs-progs: Fixes inline ram_bytes related bugs

2018-07-02 Thread David Sterba
On Wed, Jun 06, 2018 at 03:27:11PM +0800, Qu Wenruo wrote:
> The patchset can be fetched from github (*):
> https://github.com/adam900710/btrfs-progs/tree/inline_ram_bytes
> 
> It's based on David's devel branch, whose HEAD is:
> commit 0d1c5812e28e286648781c7b35b542311cc01aa4 (david/devel)
> Author: Matthias Benkard 
> Date:   Wed Apr 25 16:34:54 2018 +0200
> 
> btrfs-progs: mkfs: traverse_directory: Reset error code on continue
> 
> Reported-by Steve Leung , his old btrfs (at least
> offending inodes are from 2014) has inline uncompressed extent, while
> its ram_bytes mismatch with item size.
> 
> Latest kernel tree check catches this bug, while we failed to detect by
> dump-tree.
> 
> It turns out that btrfs-progs is doing something evil to avoid reading
> ram_bytes from inline uncompressed extent.
> 
> 
> So this patchset will address all such ram_bytes related problems.

Thanks, series applied.

> The 1st patch is a not-so-relative fix for restore, which is using
> ram_bytes for decompress. Although thanks to the compression header, we
> won't read out-of-boundary, but fixing it is never a bad thing.
> 
> The 2nd patch will get rid of the evil btrfs_file_extent_inline_len()
> which hides raw ram_bytes from us, and fooling us for a long long time.
> 
> The 3rd~5th patches introduce check/repair function for both original
> and lowmem mode (although lowmem mode can detect it even before this patch).
> 
> The last one is the test case for it as usual.
> 
> *: Or should I just migrate to gitlab after M$ acquired github?

About that, I have a gitlab account and auto-synced the btrfs-progs
repository from github, so I guess it can be used to fork the code.  The
issues are disabled, that's kind of tied to github, but mail bugreports
work and I think technically we don't have to stick to just one issue
tracker though it's a bit of work on our side to update the status if
the reports are duplicated.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] btrfs-progs: Fix wrong optind re-initialization to allow mixed option and non-option

2018-07-02 Thread David Sterba
On Wed, Jun 20, 2018 at 08:38:38AM +0800, Qu Wenruo wrote:
> In function handle_global_options(), we reset @optind to 1.
> However according to man page of getopt(3) NOTES section, if we need to
> rescan options later, @optind should be reset to 0 to initialize the
> internal variables correctly.
> 
> This explains the reason why in cmd_check(), getopt_long() doesn't
> handle the following command correctly:
> "btrfs check /dev/data/btrfs --check-data-csum"
> 
> While mkfs.btrfs handles mixed non-option and option correctly:
> "mkfs.btrfs -f /dev/data/disk1 --data raid1 /dev/data/disk2"
> 
> Cc: Paul Jones 
> Cc: Hugo Mills 
> Fixes: 010ceab56e06 ("btrfs-progs: rework option parser to use getopt for 
> global options")
> Signed-off-by: Qu Wenruo 
> ---
> changelog:
> v2:
>   Instead of resetting @optind at handle_global_options(), reset @optind
>   before each later getopt() call. Since there are cases uses @optind and
>   expects it to be starting from 1.

1 and 2 applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: free-space-cache: Don't panic when free space cache is corrupted

2018-07-02 Thread David Sterba
On Sun, Jul 01, 2018 at 10:45:26AM +0800, Qu Wenruo wrote:
> In btrfs_add_free_space(), if the free space to be added is already
> here, we trigger ASSERT() which is just another BUG_ON().
> 
> Let's remove such BUG_ON() at all.
> 
> Reported-by: Lewis Diamond 
> Signed-off-by: Qu Wenruo 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: check: Fix wrong root parameter of btrfs_next_leaf call

2018-07-02 Thread David Sterba
On Mon, Jun 18, 2018 at 02:10:39PM +0300, Nikolay Borisov wrote:
> The first thing that check_chunks_and_extents does is to iterate all
> the root items in the root tree and link them to either the "normal_list"
> or "dropping_trees" list. If a leaf has to be crossed during this
> operation btrfs_next_leaf is called to do that. However, currently it's
> called with a wrong argument for its 'root' parameter. Since we are
> iterating the root tree the passed root should be fs_info->tree_rot,
> whereas right now we are passing the local variable 'root' which is
> assigned to the fs_tree. As it stands, this bug is actually benign since
> the passed root is only passed to reada_for_search, where it's used to
> reference the fs_info. Nevertheless the code is wrong and at the very least
> misleading, so fix it by passing the correct root.
> 
> Signed-off-by: Nikolay Borisov 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: Check factor out root parsing from check_chunks_and_extents

2018-07-02 Thread David Sterba
On Mon, Jun 18, 2018 at 02:10:58PM +0300, Nikolay Borisov wrote:
> check_chunks_and_extents does quite a number of distinct things. The
> first of those is going through all root items in the root tree and
> classify every root depending on whether they have a dropping operation
> in progress or not. Lets factor out this code and move the variables
> specific to this in a separate function so clean up check_chunks_and_extents
> a bit. Accidentally, this patch fixes some reference leaks since
> in error conditions in the loop the code does "goto out" but at that
> label we don't really release the path. Having this code extracted in a
> separate function which always releases the path avoids this problem
> entirely.
> 
> Signed-off-by: Nikolay Borisov 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] btrfs-progs: Fix wrong optind re-initialization to allow mixed option and non-option

2018-07-02 Thread David Sterba
On Wed, Jun 20, 2018 at 08:38:38AM +0800, Qu Wenruo wrote:
> In function handle_global_options(), we reset @optind to 1.
> However according to man page of getopt(3) NOTES section, if we need to
> rescan options later, @optind should be reset to 0 to initialize the
> internal variables correctly.
> 
> This explains the reason why in cmd_check(), getopt_long() doesn't
> handle the following command correctly:
> "btrfs check /dev/data/btrfs --check-data-csum"
> 
> While mkfs.btrfs handles mixed non-option and option correctly:
> "mkfs.btrfs -f /dev/data/disk1 --data raid1 /dev/data/disk2"
> 
> Cc: Paul Jones 
> Cc: Hugo Mills 
> Fixes: 010ceab56e06 ("btrfs-progs: rework option parser to use getopt for 
> global options")
> Signed-off-by: Qu Wenruo 
> ---
> changelog:
> v2:
>   Instead of resetting @optind at handle_global_options(), reset @optind
>   before each later getopt() call. Since there are cases uses @optind and
>   expects it to be starting from 1.

So it's not possible to set it once before the callbacks are called in
btrfs.c:main() due to the exceptions that need it to be set to 1. Ok,
can you please send a separate patch that makes the optind = 1 explicit?
Even it might be redundant in the context, it's a way to document the
expected behaviour of getopt and also we would not have to think if the
optind setting is missing or not before the getopts. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Marc MERLIN
On Mon, Jul 02, 2018 at 02:35:19PM -0400, Austin S. Hemmelgarn wrote:
> >I kind of linked the thin provisioning idea because it's hands off,
> >which is appealing. Any reason against it?
> No, not currently, except that it adds a whole lot more stuff between 
> BTRFS and whatever layer is below it.  That increase in what's being 
> done adds some overhead (it's noticeable on 7200 RPM consumer SATA 
> drives, but not on decent consumer SATA SSD's).
> 
> There used to be issues running BTRFS on top of LVM thin targets which 
> had zero mode turned off, but AFAIK, all of those problems were fixed 
> long ago (before 4.0).

I see, thanks for the heads up.

> >Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm
> >radi5?
> Actually, it uses MD's RAID5 implementation as a back-end.  Same for 
> RAID6, and optionally for RAID0, RAID1, and RAID10.
 
Ok, that makes me feel a bit better :)

> >But yeah, if it's incompatible with thin provisioning, it's not that
> >useful.
> It's technically not incompatible, just a bit of a pain.  Last time I 
> tried to use it, you had to jump through hoops to repair a damaged RAID 
> volume that was serving as an underlying volume in a thin pool, and it 
> required keeping the thin pool offline for the entire duration of the 
> rebuild.

Argh, not good :( / thanks for the heads up.

> If you do go with thin provisioning, I would encourage you to make 
> certain to call fstrim on the BTRFS volumes on a semi regular basis so 
> that the thin pool doesn't get filled up with old unused blocks, 

That's a very good point/reminder, thanks for that. I guess it's like
running on an ssd :)

> preferably when you are 100% certain that there are no ongoing writes on 
> them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit 
> dangerous to do it while writes are happening).
 
Argh, that will be harder, but I'll try.

Given what you said, it sounds like I'll still be best off with separate
layers to avoid the rebuild problem you mentioned.
So it'll be
swraid5 / dmcrypt / bcache / lvm dm thin / btrfs

Hopefully that will work well enough.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Austin S. Hemmelgarn

On 2018-07-02 13:34, Marc MERLIN wrote:

On Mon, Jul 02, 2018 at 12:59:02PM -0400, Austin S. Hemmelgarn wrote:

Am I supposed to put LVM thin volumes underneath so that I can share
the same single 10TB raid5?


Actually, because of the online resize ability in BTRFS, you don't
technically _need_ to use thin provisioning here.  It makes the maintenance
a bit easier, but it also adds a much more complicated layer of indirection
than just doing regular volumes.


You're right that I can use btrfs resize, but then I still need an LVM
device underneath, correct?
So, if I have 10 backup targets, I need 10 LVM LVs, I give them 10%
each of the full size available (as a guess), and then I'd have to
- btrfs resize down one that's bigger than I need
- LVM shrink the LV
- LVM grow the other LV
- LVM resize up the other btrfs

and I think LVM resize and btrfs resize are not linked so I have to do
them separately and hope to type the right numbers each time, correct?
(or is that easier now?)

I kind of linked the thin provisioning idea because it's hands off,
which is appealing. Any reason against it?
No, not currently, except that it adds a whole lot more stuff between 
BTRFS and whatever layer is below it.  That increase in what's being 
done adds some overhead (it's noticeable on 7200 RPM consumer SATA 
drives, but not on decent consumer SATA SSD's).


There used to be issues running BTRFS on top of LVM thin targets which 
had zero mode turned off, but AFAIK, all of those problems were fixed 
long ago (before 4.0).



You could (in theory) merge the LVM and software RAID5 layers, though that
may make handling of the RAID5 layer a bit complicated if you choose to use
thin provisioning (for some reason, LVM is unable to do on-line checks and
rebuilds of RAID arrays that are acting as thin pool data or metadata).
  
Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm

radi5?
Actually, it uses MD's RAID5 implementation as a back-end.  Same for 
RAID6, and optionally for RAID0, RAID1, and RAID10.



But yeah, if it's incompatible with thin provisioning, it's not that
useful.
It's technically not incompatible, just a bit of a pain.  Last time I 
tried to use it, you had to jump through hoops to repair a damaged RAID 
volume that was serving as an underlying volume in a thin pool, and it 
required keeping the thin pool offline for the entire duration of the 
rebuild.



Alternatively, you could increase your array size, remove the software RAID
layer, and switch to using BTRFS in raid10 mode so that you could eliminate
one of the layers, though that would probably reduce the effectiveness of
bcache (you might want to get a bigger cache device if you do this).


Sadly that won't work. I have more data than will fit on raid10

Thanks for your suggestions though.
Still need to read up on whether I should do thin provisioning, or not.
If you do go with thin provisioning, I would encourage you to make 
certain to call fstrim on the BTRFS volumes on a semi regular basis so 
that the thin pool doesn't get filled up with old unused blocks, 
preferably when you are 100% certain that there are no ongoing writes on 
them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit 
dangerous to do it while writes are happening).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-07-02 Thread Marc MERLIN
On Mon, Jul 02, 2018 at 10:33:09PM +0500, Roman Mamedov wrote:
> On Mon, 2 Jul 2018 08:19:03 -0700
> Marc MERLIN  wrote:
> 
> > I actually have fewer snapshots than this per filesystem, but I backup
> > more than 10 filesystems.
> > If I used as many snapshots as you recommend, that would already be 230
> > snapshots for 10 filesystems :)
> 
> (...once again me with my rsync :)
> 
> If you didn't use send/receive, you wouldn't be required to keep a separate
> snapshot trail per filesystem backed up, one trail of snapshots for the entire
> backup server would be enough. Rsync everything to subdirs within one
> subvolume, then do timed or event-based snapshots of it. You only need more
> than one trail if you want different retention policies for different datasets
> (e.g. in my case I have 91 and 31 days).

This is exactly how I used to do backups before btrfs.
I did 

cp -al backup.olddate backup.newdate
rsync -avSH src/ backup.newdate/

You don't even need snapshots or btrfs anymore.
Also, sorry to say, but I have different data retention needs for
different backups. Some need to rotate more quickly than others, but if
you're using rsync, the method I gave above works fine at any rotation
interval you need.

It is almost as efficient as btrfs on space, but as I said, the time
penalty on all those stats for many files was what killed it for me.
If I go back to rsync backups (and I'm really unlikely to), then I'd
also go back to ext4. There would be no point in dealing with the
complexity and fragility of btrfs anymore.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Marc MERLIN
On Mon, Jul 02, 2018 at 12:59:02PM -0400, Austin S. Hemmelgarn wrote:
> > Am I supposed to put LVM thin volumes underneath so that I can share
> > the same single 10TB raid5?
>
> Actually, because of the online resize ability in BTRFS, you don't
> technically _need_ to use thin provisioning here.  It makes the maintenance
> a bit easier, but it also adds a much more complicated layer of indirection
> than just doing regular volumes.

You're right that I can use btrfs resize, but then I still need an LVM
device underneath, correct?
So, if I have 10 backup targets, I need 10 LVM LVs, I give them 10%
each of the full size available (as a guess), and then I'd have to 
- btrfs resize down one that's bigger than I need
- LVM shrink the LV
- LVM grow the other LV
- LVM resize up the other btrfs

and I think LVM resize and btrfs resize are not linked so I have to do
them separately and hope to type the right numbers each time, correct?
(or is that easier now?)

I kind of linked the thin provisioning idea because it's hands off,
which is appealing. Any reason against it?

> You could (in theory) merge the LVM and software RAID5 layers, though that
> may make handling of the RAID5 layer a bit complicated if you choose to use
> thin provisioning (for some reason, LVM is unable to do on-line checks and
> rebuilds of RAID arrays that are acting as thin pool data or metadata).
 
Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm
radi5?
But yeah, if it's incompatible with thin provisioning, it's not that
useful.

> Alternatively, you could increase your array size, remove the software RAID
> layer, and switch to using BTRFS in raid10 mode so that you could eliminate
> one of the layers, though that would probably reduce the effectiveness of
> bcache (you might want to get a bigger cache device if you do this).

Sadly that won't work. I have more data than will fit on raid10

Thanks for your suggestions though.
Still need to read up on whether I should do thin provisioning, or not.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-07-02 Thread Roman Mamedov
On Mon, 2 Jul 2018 08:19:03 -0700
Marc MERLIN  wrote:

> I actually have fewer snapshots than this per filesystem, but I backup
> more than 10 filesystems.
> If I used as many snapshots as you recommend, that would already be 230
> snapshots for 10 filesystems :)

(...once again me with my rsync :)

If you didn't use send/receive, you wouldn't be required to keep a separate
snapshot trail per filesystem backed up, one trail of snapshots for the entire
backup server would be enough. Rsync everything to subdirs within one
subvolume, then do timed or event-based snapshots of it. You only need more
than one trail if you want different retention policies for different datasets
(e.g. in my case I have 91 and 31 days).

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-07-02 Thread Austin S. Hemmelgarn

On 2018-07-02 11:19, Marc MERLIN wrote:

Hi Qu,

thanks for the detailled and honest answer.
A few comments inline.

On Mon, Jul 02, 2018 at 10:42:40PM +0800, Qu Wenruo wrote:

For full, it depends. (but for most real world case, it's still flawed)
We have small and crafted images as test cases, which btrfs check can
repair without problem at all.
But such images are *SMALL*, and only have *ONE* type of corruption,
which can represent real world case at all.
  
right, they're just unittest images, I understand.



1) Too large fs (especially too many snapshots)
The use case (too many snapshots and shared extents, a lot of extents
get shared over 1000 times) is in fact a super large challenge for
lowmem mode check/repair.
It needs O(n^2) or even O(n^3) to check each backref, which hugely
slow the progress and make us hard to locate the real bug.
  
So, the non lowmem version would work better, but it's a problem if it

doesn't fit in RAM.
I've always considered it a grave bug that btrfs check repair can use so
much kernel memory that it will crash the entire system. This should not
be possible.
While it won't help me here, can btrfs check be improved not to suck all
the kernel memory, and ideally even allow using swap space if the RAM is
not enough?

Is btrfs check regular mode still being maintained? I think it's still
better than lowmem, correct?


2) Corruption in extent tree and our objective is to mount RW
Extent tree is almost useless if we just want to read data.
But when we do any write, we needs it and if it goes wrong even a
tiny bit, your fs could be damaged really badly.

For other corruption, like some fs tree corruption, we could do
something to discard some corrupted files, but if it's extent tree,
we either mount RO and grab anything we have, or hopes the
almost-never-working --init-extent-tree can work (that's mostly
miracle).
  
I understand that it's the weak point of btrfs, thanks for explaining.



1) Don't keep too many snapshots.
Really, this is the core.
For send/receive backup, IIRC it only needs the parent subvolume
exists, there is no need to keep the whole history of all those
snapshots.


You are correct on history. The reason I keep history is because I may
want to recover a file from last week or 2 weeks ago after I finally
notice that it's gone.
I have terabytes of space on the backup server, so it's easier to keep
history there than on the client which may not have enough space to keep
a month's worth of history.
As you know, back when we did tape backups, we also kept history of at
least several weeks (usually several months, but that's too much for
btrfs snapshots).
Bit of a case-study here, but it may be of interest.  We do something 
kind of similar where I work for our internal file servers.  We've got 
daily snapshots of the whole server kept on the server itself for 7 days 
(we usually see less than 5% of the total amount of data in changes on 
weekdays, and essentially 0 on weekends, so the snapshots rarely take up 
more than ab out 25% of the size of the live data), and then we 
additionally do daily backups which we retain for 6 months.  I've 
written up a short (albeit rather system specific script) for recovering 
old versions of a file that first scans the snapshots, and then pulls it 
out of the backups if it's not there.  I've found this works remarkably 
well for our use case (almost all the data on the file server follows a 
WORM access pattern with most of the files being between 100kB and 100MB 
in size).


We actually did try moving it all over to BTRFS for a while before we 
finally ended up with the setup we currently have, but aside from the 
whole issue with massive numbers of snapshots, we found that for us at 
least, Amanda actually outperforms BTRFS send/receive for everything 
except full backups and uses less storage space (though that last bit is 
largely because we use really aggressive compression).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Austin S. Hemmelgarn

On 2018-07-02 11:18, Marc MERLIN wrote:

Hi Qu,

I'll split this part into a new thread:


2) Don't keep unrelated snapshots in one btrfs.
I totally understand that maintain different btrfs would hugely add
maintenance pressure, but as explains, all snapshots share one
fragile extent tree.


Yes, I understand that this is what I should do given what you
explained.
My main problem is knowing how to segment things so I don't end up with
filesystems that are full while others are almost empty :)

Am I supposed to put LVM thin volumes underneath so that I can share
the same single 10TB raid5?
Actually, because of the online resize ability in BTRFS, you don't 
technically _need_ to use thin provisioning here.  It makes the 
maintenance a bit easier, but it also adds a much more complicated layer 
of indirection than just doing regular volumes.


If I do this, I would have
software raid 5 < dmcrypt < bcache < lvm < btrfs
That's a lot of layers, and that's also starting to make me nervous :)

Is there any other way that does not involve me creating smaller block
devices for multiple btrfs filesystems and hope that they are the right
size because I won't be able to change it later?
You could (in theory) merge the LVM and software RAID5 layers, though 
that may make handling of the RAID5 layer a bit complicated if you 
choose to use thin provisioning (for some reason, LVM is unable to do 
on-line checks and rebuilds of RAID arrays that are acting as thin pool 
data or metadata).


Alternatively, you could increase your array size, remove the software 
RAID layer, and switch to using BTRFS in raid10 mode so that you could 
eliminate one of the layers, though that would probably reduce the 
effectiveness of bcache (you might want to get a bigger cache device if 
you do this).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: relocation: Only remove reloc rb_trees if reloc control has been initialized

2018-07-02 Thread David Sterba
On Mon, Jul 02, 2018 at 02:25:38PM +0800, Qu Wenruo wrote:
> Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199833, where an
> invalid tree reloc tree can cause kernel NULL pointer dereference when
> btrfs does some cleanup for reloc roots.
> 
> It turns out that fs_info->reloc_ctl can be NULL in
> btrfs_recover_relocation() as we allocate relocation control after all
> reloc roots are verified.
> So when we hit out: tag, we haven't call set_reloc_control() thus
> fs_info->reloc_ctl is still NULL.
> 
> Reported-by: Xu Wen 
> Signed-off-by: Qu Wenruo 

Thanks for the fix, patch added to the queue. I've added the fuzzed
image from bugzilla to btrfs-progs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: qgroups: Move transaction managed inside btrfs_quota_enable

2018-07-02 Thread David Sterba
On Mon, Jul 02, 2018 at 02:00:34PM +0300, Nikolay Borisov wrote:
> Commit 5d23515be669 ("btrfs: Move qgroup rescan on quota enable to
> btrfs_quota_enable") not only resulted in an easier to follow code but
> it also introduced a subtle bug. It changed the timing when the initial
> transaction rescan was happening - before the commit it would happen
> after transaction commit had occured but after the commit it might happen
> before the transaction was committed. This results in failure to
> correctly rescan the quota since there could be data which is still not
> committed on disk.
> 
> This patch aims to fix this by movign the transaction creation/commit
> inside btrfs_quota_enable, which allows to schedule the quota commit
> after the transaction has been committed.
> 
> Fixes: 5d23515be669 ("btrfs: Move qgroup rescan on quota enable to 
> btrfs_quota_enable")
> Link: https://marc.info/?l=linux-btrfs=152999289017582

Please use https://lkml.kernel.org/r/

> Reported-by: Misono Tomohiro 
> Reviewed-by: Misono Tomohiro 
> Signed-off-by: Nikolay Borisov 
> ---
>  fs/btrfs/ioctl.c  | 15 ++-
>  fs/btrfs/qgroup.c | 38 +++---
>  fs/btrfs/qgroup.h |  6 ++
>  3 files changed, 35 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index a399750b9e41..316fb1af15e2 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -5135,9 +5135,7 @@ static long btrfs_ioctl_quota_ctl(struct file *file, 
> void __user *arg)
>   struct inode *inode = file_inode(file);
>   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   struct btrfs_ioctl_quota_ctl_args *sa;
> - struct btrfs_trans_handle *trans = NULL;
>   int ret;
> - int err;
>  
>   if (!capable(CAP_SYS_ADMIN))
>   return -EPERM;
> @@ -5153,28 +5151,19 @@ static long btrfs_ioctl_quota_ctl(struct file *file, 
> void __user *arg)
>   }
>  
>   down_write(_info->subvol_sem);
> - trans = btrfs_start_transaction(fs_info->tree_root, 2);
> - if (IS_ERR(trans)) {
> - ret = PTR_ERR(trans);
> - goto out;
> - }
>  
>   switch (sa->cmd) {
>   case BTRFS_QUOTA_CTL_ENABLE:
> - ret = btrfs_quota_enable(trans, fs_info);
> + ret = btrfs_quota_enable(fs_info);
>   break;
>   case BTRFS_QUOTA_CTL_DISABLE:
> - ret = btrfs_quota_disable(trans, fs_info);
> + ret = btrfs_quota_disable(fs_info);
>   break;
>   default:
>   ret = -EINVAL;
>   break;
>   }
>  
> - err = btrfs_commit_transaction(trans);
> - if (err && !ret)
> - ret = err;
> -out:
>   kfree(sa);
>   up_write(_info->subvol_sem);
>  drop_write:
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index c25dc47210a3..1012c7138633 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -875,8 +875,7 @@ static int btrfs_clean_quota_tree(struct 
> btrfs_trans_handle *trans,
>   return ret;
>  }
>  
> -int btrfs_quota_enable(struct btrfs_trans_handle *trans,
> -struct btrfs_fs_info *fs_info)
> +int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
>  {
>   struct btrfs_root *quota_root;
>   struct btrfs_root *tree_root = fs_info->tree_root;
> @@ -886,6 +885,7 @@ int btrfs_quota_enable(struct btrfs_trans_handle *trans,
>   struct btrfs_key key;
>   struct btrfs_key found_key;
>   struct btrfs_qgroup *qgroup = NULL;
> + struct btrfs_trans_handle *trans = NULL;
>   int ret = 0;
>   int slot;
>  
> @@ -893,6 +893,12 @@ int btrfs_quota_enable(struct btrfs_trans_handle *trans,
>   if (fs_info->quota_root)
>   goto out;
>  
> + trans = btrfs_start_transaction(tree_root, 2);

Please document what transaction items are requested.

> + if (IS_ERR(trans)) {
> + ret = PTR_ERR(trans);
> + goto out;
> + }
> +
>   fs_info->qgroup_ulist = ulist_alloc(GFP_KERNEL);
>   if (!fs_info->qgroup_ulist) {
>   ret = -ENOMEM;
> @@ -987,6 +993,11 @@ int btrfs_quota_enable(struct btrfs_trans_handle *trans,
>   fs_info->quota_root = quota_root;
>   set_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags);
>   spin_unlock(_info->qgroup_lock);
> +
> + ret = btrfs_commit_transaction(trans);
> + if (ret)
> + goto out_free_path;
> +
>   ret = qgroup_rescan_init(fs_info, 0, 1);
>   if (!ret) {
>   qgroup_rescan_zero_tracking(fs_info);
> @@ -1011,15 +1022,22 @@ int btrfs_quota_enable(struct btrfs_trans_handle 
> *trans,
>   return ret;
>  }
>  
> -int btrfs_quota_disable(struct btrfs_trans_handle *trans,
> - struct btrfs_fs_info *fs_info)
> +int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
>  {
>   struct btrfs_root *quota_root;
> + struct btrfs_trans_handle *trans = NULL;
>   int ret = 0;
>  
>   mutex_lock(_info->qgroup_ioctl_lock);
>  

Re: So, does btrfs check lowmem take days? weeks?

2018-07-02 Thread Marc MERLIN
Hi Qu,

thanks for the detailled and honest answer.
A few comments inline.

On Mon, Jul 02, 2018 at 10:42:40PM +0800, Qu Wenruo wrote:
> For full, it depends. (but for most real world case, it's still flawed)
> We have small and crafted images as test cases, which btrfs check can
> repair without problem at all.
> But such images are *SMALL*, and only have *ONE* type of corruption,
> which can represent real world case at all.
 
right, they're just unittest images, I understand.

> 1) Too large fs (especially too many snapshots)
>The use case (too many snapshots and shared extents, a lot of extents
>get shared over 1000 times) is in fact a super large challenge for
>lowmem mode check/repair.
>It needs O(n^2) or even O(n^3) to check each backref, which hugely
>slow the progress and make us hard to locate the real bug.
 
So, the non lowmem version would work better, but it's a problem if it
doesn't fit in RAM.
I've always considered it a grave bug that btrfs check repair can use so
much kernel memory that it will crash the entire system. This should not
be possible.
While it won't help me here, can btrfs check be improved not to suck all
the kernel memory, and ideally even allow using swap space if the RAM is
not enough?

Is btrfs check regular mode still being maintained? I think it's still
better than lowmem, correct?

> 2) Corruption in extent tree and our objective is to mount RW
>Extent tree is almost useless if we just want to read data.
>But when we do any write, we needs it and if it goes wrong even a
>tiny bit, your fs could be damaged really badly.
> 
>For other corruption, like some fs tree corruption, we could do
>something to discard some corrupted files, but if it's extent tree,
>we either mount RO and grab anything we have, or hopes the
>almost-never-working --init-extent-tree can work (that's mostly
>miracle).
 
I understand that it's the weak point of btrfs, thanks for explaining.

> 1) Don't keep too many snapshots.
>Really, this is the core.
>For send/receive backup, IIRC it only needs the parent subvolume
>exists, there is no need to keep the whole history of all those
>snapshots.

You are correct on history. The reason I keep history is because I may
want to recover a file from last week or 2 weeks ago after I finally
notice that it's gone. 
I have terabytes of space on the backup server, so it's easier to keep
history there than on the client which may not have enough space to keep
a month's worth of history.
As you know, back when we did tape backups, we also kept history of at
least several weeks (usually several months, but that's too much for
btrfs snapshots).

>Keep the number of snapshots to minimal does greatly improve the
>possibility (both manual patch or check repair) of a successful
>repair.
>Normally I would suggest 4 hourly snapshots, 7 daily snapshots, 12
>monthly snapshots.

I actually have fewer snapshots than this per filesystem, but I backup
more than 10 filesystems.
If I used as many snapshots as you recommend, that would already be 230
snapshots for 10 filesystems :)

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Marc MERLIN
Hi Qu,

I'll split this part into a new thread:

> 2) Don't keep unrelated snapshots in one btrfs.
>I totally understand that maintain different btrfs would hugely add
>maintenance pressure, but as explains, all snapshots share one
>fragile extent tree.

Yes, I understand that this is what I should do given what you
explained.
My main problem is knowing how to segment things so I don't end up with
filesystems that are full while others are almost empty :)

Am I supposed to put LVM thin volumes underneath so that I can share
the same single 10TB raid5?

If I do this, I would have
software raid 5 < dmcrypt < bcache < lvm < btrfs
That's a lot of layers, and that's also starting to make me nervous :)

Is there any other way that does not involve me creating smaller block
devices for multiple btrfs filesystems and hope that they are the right
size because I won't be able to change it later?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: tree-checker: Verify block_group_item

2018-07-02 Thread David Sterba
On Mon, Jul 02, 2018 at 04:34:14PM +0800, Qu Wenruo wrote:
> As reported in https://bugzilla.kernel.org/show_bug.cgi?id=199849,
> a crafted image with invalid block group items could make free space cache
> code to cause panic.
> 
> We could early detect such invalid block group item by checking:
> 1) Size (key)
>We have a up limit on block group item (10G)

> + if (key->offset > 10ULL * SZ_1G) {
> + block_group_err(fs_info, leaf, slot,
> + "invalid block group size, have %llu expect (0, %llu)",
> + key->offset, 10ULL * SZ_1G);

Can you please make this magic constant a define and explain in a
comment how it's calculated? Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-07-02 Thread Qu Wenruo



On 2018年07月02日 22:05, Marc MERLIN wrote:
> On Mon, Jul 02, 2018 at 02:22:20PM +0800, Su Yue wrote:
>>> Ok, that's 29MB, so it doesn't fit on pastebin:
>>> http://marc.merlins.org/tmp/dshelf2_inspect.txt
>>>
>> Sorry Marc. After offline communication with Qu, both
>> of us think the filesystem is hard to repair.
>> The filesystem is too large to debug step by step.
>> Every time check and debug spent is too expensive.
>> And it already costs serveral days.
>>
>> Sadly, I am afarid that you have to recreate filesystem
>> and reback up your data. :(
>>
>> Sorry again and thanks for you reports and patient.
> 
> I appreciate your help. Honestly I only wanted to help you find why the
> tools aren't working. Fixing filesystems by hand (and remotely via Email
> on top of that), is way too time consuming like you said.
> 
> Is the btrfs design flawed in a way that repair tools just cannot repair
> on their own? 

For short and for your case, yes, you can consider repair tool just a
garbage and don't use them at any production system.

For full, it depends. (but for most real world case, it's still flawed)
We have small and crafted images as test cases, which btrfs check can
repair without problem at all.
But such images are *SMALL*, and only have *ONE* type of corruption,
which can represent real world case at all.

> I understand that data can be lost, but I don't understand how the tools
> just either keep crashing for me, go in infinite loops, or otherwise
> fail to give me back a stable filesystem, even if some data is missing
> after that.

There are several reasons here that repair tool can't help much:

1) Too large fs (especially too many snapshots)
   The use case (too many snapshots and shared extents, a lot of extents
   get shared over 1000 times) is in fact a super large challenge for
   lowmem mode check/repair.
   It needs O(n^2) or even O(n^3) to check each backref, which hugely
   slow the progress and make us hard to locate the real bug.

2) Corruption in extent tree and our objective is to mount RW
   Extent tree is almost useless if we just want to read data.
   But when we do any write, we needs it and if it goes wrong even a
   tiny bit, your fs could be damaged really badly.

   For other corruption, like some fs tree corruption, we could do
   something to discard some corrupted files, but if it's extent tree,
   we either mount RO and grab anything we have, or hopes the
   almost-never-working --init-extent-tree can work (that's mostly
   miracle).

So, I feel very sorry that we can't provide enough help for your case.

But still, we hope to provide some tips on next build if you still want
to choose btrfs.

1) Don't keep too many snapshots.
   Really, this is the core.
   For send/receive backup, IIRC it only needs the parent subvolume
   exists, there is no need to keep the whole history of all those
   snapshots.
   Keep the number of snapshots to minimal does greatly improve the
   possibility (both manual patch or check repair) of a successful
   repair.
   Normally I would suggest 4 hourly snapshots, 7 daily snapshots, 12
   monthly snapshots.

2) Don't keep unrelated snapshots in one btrfs.
   I totally understand that maintain different btrfs would hugely add
   maintenance pressure, but as explains, all snapshots share one
   fragile extent tree.
   If we limit the fragile extent tree from each other fs, it's less
   possible a single extent tree corruption to take down the whole fs.

Thanks,
Qu

> 
> Thanks,
> Marc
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-07-02 Thread Marc MERLIN
On Mon, Jul 02, 2018 at 02:22:20PM +0800, Su Yue wrote:
> > Ok, that's 29MB, so it doesn't fit on pastebin:
> > http://marc.merlins.org/tmp/dshelf2_inspect.txt
> > 
> Sorry Marc. After offline communication with Qu, both
> of us think the filesystem is hard to repair.
> The filesystem is too large to debug step by step.
> Every time check and debug spent is too expensive.
> And it already costs serveral days.
> 
> Sadly, I am afarid that you have to recreate filesystem
> and reback up your data. :(
> 
> Sorry again and thanks for you reports and patient.

I appreciate your help. Honestly I only wanted to help you find why the
tools aren't working. Fixing filesystems by hand (and remotely via Email
on top of that), is way too time consuming like you said.

Is the btrfs design flawed in a way that repair tools just cannot repair
on their own? 
I understand that data can be lost, but I don't understand how the tools
just either keep crashing for me, go in infinite loops, or otherwise
fail to give me back a stable filesystem, even if some data is missing
after that.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] btrfs-progs: lowmem: delay before lowmem repair

2018-07-02 Thread Su Yue



> Sent: Monday, July 02, 2018 at 5:43 PM
> From: "Nikolay Borisov" 
> To: "Su Yue" , linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH RFC 0/3] btrfs-progs: lowmem: delay before lowmem repair
>
> 
> 
> On  2.07.2018 12:28, Su Yue wrote:
> > Since lowmem repair is dangerous, it should remind user more obviously.
> > The patchset add 10 seconds delay like btrfs balance and add am option
> > '--force-repair-lowmem' to skip the delay.
> 
> IMO this is the wrong way to approach a dangerous option. If it's so
> dangerous it needs to be written in the documentation explicitly this is
> so. If someone wants to use lowmem then they should explicitly set
> --mode lowmem. So I'm inclined to NACK this patch.
> 
OK. After some considerations, I thinks the patchset is indeed appropriate.
Droping it is fine.

Thanks,
Su
> > 
> > ---
> > I don't whether it's a good idea to add delay and the option only for
> > lowmem repair is acceptable, so make it RFC.
> > 
> > Su Yue (3):
> >   btrfs-progs: lowmem: delay before lowmem repair starts
> >   btrfs-progs: lowmem: force to start without delay with option
> > '--force-repair-lowmem'
> >   btrfs-progs: tests: append '--force-repair-lowmem' if lowmem repair is
> > enabled
> > 
> >  check/main.c   | 43 ---
> >  tests/common.local |  1 +
> >  2 files changed, 37 insertions(+), 7 deletions(-)
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] btrfs-progs: lowmem: delay before lowmem repair

2018-07-02 Thread Su Yue



---
Su


> Sent: Monday, July 02, 2018 at 7:09 PM
> From: "Nikolay Borisov" 
> To: "David Disseldorp" 
> Cc: "Su Yue" , linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH RFC 0/3] btrfs-progs: lowmem: delay before lowmem repair
>
> 
> 
> On  2.07.2018 14:07, David Disseldorp wrote:
> > On Mon, 2 Jul 2018 12:43:20 +0300, Nikolay Borisov wrote:
> > 
> >> On  2.07.2018 12:28, Su Yue wrote:
> >>> Since lowmem repair is dangerous, it should remind user more obviously.
> >>> The patchset add 10 seconds delay like btrfs balance and add am option
> >>> '--force-repair-lowmem' to skip the delay.  
> >>
> >> IMO this is the wrong way to approach a dangerous option. If it's so
> >> dangerous it needs to be written in the documentation explicitly this is
> >> so. If someone wants to use lowmem then they should explicitly set
> >> --mode lowmem. So I'm inclined to NACK this patch.
> > 
> > AFAICT it's already documented as "experimental" in the manpage, but the
> > usage flag appears to have been dropped as part of refactoring for
> > 87c1bd13c1fca430c3dbf0da62e9aa33bde609c8 . If nobody's working on a fix,
> > and lowmem removal isn't an option, then please consider adding the usage
> > flag back, e.g.
> 
> lowmem seems to be here to stay and it seems to be more useful than
> original mode. So your patch is acceptable.
> 
Agreed.

Su
> > --- a/check/main.c
> > +++ b/check/main.c
> > @@ -9386,7 +9386,7 @@ const char * const cmd_check_usage[] = {
> > "original - read inodes and extents to 
> > memory (requires",
> > "   more memory, does less IO)",
> > "lowmem   - try to use less memory but 
> > read blocks again",
> > -   "   when needed",
> > +   "   when needed (experimental)",
> > "--check-data-csum   verify checksums of data blocks",
> > "-Q|--qgroup-report  print a report on qgroup consistency",
> > "-E|--subvol-extents ",
> > 
> > Cheers, David
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] btrfs: Remove unnecessary locking code in qgroup_rescan_leaf

2018-07-02 Thread Nikolay Borisov



On 27.06.2018 16:38, Nikolay Borisov wrote:
> In qgroup_rescan_leaf a copy is made of the target leaf by calling
> btrfs_clone_extent_buffer. The latter allocates a new buffer and
> attaches a new set of pages and copies the content of the source
> buffer. The new scratch buffer is only used to iterate it's items, it's
> not published anywhere and cannot be accessed by a third party. Hence,
> it's not necessary to perform any locking on it whatsoever. Furthermore,
> remove the extra extent_buffer_get call since the new buffer is always
> allocated with a reference count of 1 which is sufficient here.
> No functional changes.
> 
> Signed-off-by: Nikolay Borisov 

I'm going to be sending this patch as part of a larger series so you can
ignore it for now.

> ---
>  fs/btrfs/qgroup.c | 7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 1874a6d2e6f5..3b57dc247fa2 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -2647,9 +2647,6 @@ qgroup_rescan_leaf(struct btrfs_fs_info *fs_info, 
> struct btrfs_path *path,
>   mutex_unlock(_info->qgroup_rescan_lock);
>   goto out;
>   }
> - extent_buffer_get(scratch_leaf);
> - btrfs_tree_read_lock(scratch_leaf);
> - btrfs_set_lock_blocking_rw(scratch_leaf, BTRFS_READ_LOCK);
>   slot = path->slots[0];
>   btrfs_release_path(path);
>   mutex_unlock(_info->qgroup_rescan_lock);
> @@ -2675,10 +2672,8 @@ qgroup_rescan_leaf(struct btrfs_fs_info *fs_info, 
> struct btrfs_path *path,
>   goto out;
>   }
>  out:
> - if (scratch_leaf) {
> - btrfs_tree_read_unlock_blocking(scratch_leaf);
> + if (scratch_leaf)
>   free_extent_buffer(scratch_leaf);
> - }
>  
>   if (done && !ret)
>   ret = 1;
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-07-02 Thread Austin S. Hemmelgarn

On 2018-06-30 02:33, Duncan wrote:

Austin S. Hemmelgarn posted on Fri, 29 Jun 2018 14:31:04 -0400 as
excerpted:


On 2018-06-29 13:58, james harvey wrote:

On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn
 wrote:

On 2018-06-29 11:15, james harvey wrote:


On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy

wrote:


And an open question I have about scrub is weather it only ever is
checking csums, meaning nodatacow files are never scrubbed, or if
the copies are at least compared to each other?



Scrub never looks at nodatacow files.  It does not compare the copies
to each other.

Qu submitted a patch to make check compare the copies:
https://patchwork.kernel.org/patch/10434509/

This hasn't been added to btrfs-progs git yet.

IMO, I think the offline check should look at nodatacow copies like
this, but I still think this also needs to be added to scrub.  In the
patch thread, I discuss my reasons why.  In brief: online scanning;
this goes along with user's expectation of scrub ensuring mirrored
data integrity; and recommendations to setup scrub on periodic basis
to me means it's the place to put it.


That said, it can't sanely fix things if there is a mismatch. At
least,
not unless BTRFS gets proper generational tracking to handle
temporarily missing devices.  As of right now, sanely fixing things
requires significant manual intervention, as you have to bypass the
device read selection algorithm to be able to look at the state of the
individual copies so that you can pick one to use and forcibly rewrite
the whole file by hand.


Absolutely.  User would need to use manual intervention as you
describe, or restore the single file(s) from backup.  But, it's a good
opportunity to tell the user they had partial data corruption, even if
it can't be auto-fixed.  Otherwise they get intermittent data
corruption, depending on which copies are read.



The thing is though, as things stand right now, you need to manually
edit the data on-disk directly or restore the file from a backup to fix
the file.  While it's technically true that you can manually repair this
type of thing, both of the cases for doing it without those patches I
mentioned, it's functionally impossible for a regular user to do it
without potentially losing some data.


[Usual backups rant, user vs. admin variant, nowcow/tmpfs edition.
Regulars can skip as the rest is already predicted from past posts, for
them. =;^]

"Regular user"?

"Regular users" don't need to bother with this level of detail.  They
simply get their "admin" to do it, even if that "admin" is their kid, or
the kid from next door that's good with computers, or the geek squad (aka
nsa-agent-squad) guy/gal, doing it... or telling them to install "a real
OS", meaning whatever MS/Apple/Google something that they know how to
deal with.

If the "user" is dealing with setting nocow, choosing btrfs in the first
place, etc, then they're _not_ a "regular user" by definition, they're
already an admin.I'd argue that that's not always true.  'Regular users' also bli9ndly 
follow advice they find online about how to make their system run 
better, and quite often don't keep backups.


And as any admin learns rather quickly, the value of data is defined by
the number of backups it's worth having of that data.

Which means it's not a problem.  Either the data had a backup and it's
(reasonably) trivial to restore the data from that backup, or the data
was defined by lack of having that backup as of only trivial value, so
low as to not be worth the time/trouble/resources necessary to make that
backup in the first place.

Which of course means what was defined as of most value, either the data
of there was a backup, or the time/trouble/resources that would have gone
into creating it if not, is *always* saved.

(And of course the same goes for "I had a backup, but it's old", except
in this case it's the value of the data delta between the backup and
current.  As soon as it's worth more than the time/trouble/hassle of
updating the backup, it will by definition be updated.  Not having a
newer backup available thus simply means the value of the data that
changed between the last backup and current was simply not enough to
justify updating the backup, and again, what was of most value is
*always* saved, either the data, or the time that would have otherwise
gone into making the newer backup.)

Because while a "regular user" may not know it because it's not his /job/
to know it, if there's anything an admin knows *well* it's that the
working copy of data **WILL** be damaged.  It's not a matter of if, but
of when, and of whether it'll be a fat-finger mistake, or a hardware or
software failure, or wetware (theft, ransomware, etc), or wetware (flood,
fire and the water that put it out damage, etc), tho none of that
actually matters after all, because in the end, the only thing that
matters was how the value of that data was defined by the number of
backups made of it, and how quickly and conveniently at 

Re: unsolvable technical issues?

2018-07-02 Thread Austin S. Hemmelgarn

On 2018-06-30 01:32, Andrei Borzenkov wrote:

30.06.2018 06:22, Duncan пишет:

Austin S. Hemmelgarn posted on Mon, 25 Jun 2018 07:26:41 -0400 as
excerpted:


On 2018-06-24 16:22, Goffredo Baroncelli wrote:

On 06/23/2018 07:11 AM, Duncan wrote:

waxhead posted on Fri, 22 Jun 2018 01:13:31 +0200 as excerpted:


According to this:

https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 ,
section 1.2

It claims that BTRFS still have significant technical issues that may
never be resolved.


I can speculate a bit.

1) When I see btrfs "technical issue that may never be resolved", the
#1 first thing I think of, that AFAIK there are _definitely_ no plans
to resolve, because it's very deeply woven into the btrfs core by now,
is...

[1)] Filesystem UUID Identification.  Btrfs takes the UU bit of
Universally Unique quite literally, assuming they really *are*
unique, at least on that system[.]  Because
btrfs uses this supposedly unique ID to ID devices that belong to the
filesystem, it can get *very* mixed up, with results possibly
including dataloss, if it sees devices that don't actually belong to a
filesystem with the same UUID as a mounted filesystem.


As partial workaround you can disable udev btrfs rules and then do a
"btrfs dev scan" manually only for the device which you need.



You don't even need `btrfs dev scan` if you just specify the exact set
of devices in the mount options.  The `device=` mount option tells the
kernel to check that device during the mount process.


Not that lvm does any better in this regard[1], but has btrfs ever solved
the bug where only one device= in the kernel commandline's rootflags=
would take effect, effectively forcing initr* on people (like me) who
would otherwise not need them and prefer to do without them, if they're
using a multi-device btrfs as root?



This requires in-kernel device scanning; I doubt we will ever see it.


Not to mention the fact that as kernel people will tell you, device
enumeration isn't guaranteed to be in the same order every boot, so
device=/dev/* can't be relied upon and shouldn't be used -- but of course
device=LABEL= and device=UUID= and similar won't work without userspace,
basically udev (if they work at all, IDK if they actually do).

Tho in practice from what I've seen, device enumeration order tends to be
dependable /enough/ for at least those without enterprise-level numbers
of devices to enumerate.


Just boot with USB stick/eSATA drive plugged in, there are good chances
it changes device order.
It really depends on your particular hardware.  If your USB controllers 
are at lower PCI addresses than your primary disk controllers, then yes, 
this will cause an issue.  Same for whatever SATA controller your eSATA 
port is on (or stupid systems where the eSATA port is port 0 on the main 
SATA controller).


Notably, most Intel systems I've seen have the SATA controllers in the 
chipset enumerate after the USB controllers, and the whole chipset 
enumerates after add-in cards (so they almost always have this issue), 
while most AMD systems I've seen demonstrate the exact opposite 
behavior, they enumerate the SATA controller from the chipset before the 
USB controllers, and then enumerate the chipset before all the add-in 
cards (so they almost never have this issue).


That said, one of the constraints for them remaining consistent is that 
you don't change hardware.



  True, it /does/ change from time to time with a
new kernel, but anybody sane keeps a tested-dependable old kernel around
to boot to until they know the new one works as expected, and that sort
of change is seldom enough that users can boot to the old kernel and
adjust their settings for the new one as necessary when it does happen.
So as "don't do it that way because it's not reliable" as it might indeed
be in theory, in practice, just using an ordered /dev/* in kernel
commandlines does tend to "just work"... provided one is ready for the
occasion when that device parameter might need a bit of adjustment, of
course.


...


---
[1] LVM is userspace code on top of the kernelspace devicemapper, and
therefore requires an initr* if root is on lvm, regardless.  So btrfs
actually does a bit better here, only requiring it for multi-device btrfs.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unsolvable technical issues?

2018-07-02 Thread Austin S. Hemmelgarn

On 2018-06-29 23:22, Duncan wrote:

Austin S. Hemmelgarn posted on Mon, 25 Jun 2018 07:26:41 -0400 as
excerpted:


On 2018-06-24 16:22, Goffredo Baroncelli wrote:

On 06/23/2018 07:11 AM, Duncan wrote:

waxhead posted on Fri, 22 Jun 2018 01:13:31 +0200 as excerpted:


According to this:

https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 ,
section 1.2

It claims that BTRFS still have significant technical issues that may
never be resolved.


I can speculate a bit.

1) When I see btrfs "technical issue that may never be resolved", the
#1 first thing I think of, that AFAIK there are _definitely_ no plans
to resolve, because it's very deeply woven into the btrfs core by now,
is...

[1)] Filesystem UUID Identification.  Btrfs takes the UU bit of
Universally Unique quite literally, assuming they really *are*
unique, at least on that system[.]  Because
btrfs uses this supposedly unique ID to ID devices that belong to the
filesystem, it can get *very* mixed up, with results possibly
including dataloss, if it sees devices that don't actually belong to a
filesystem with the same UUID as a mounted filesystem.


As partial workaround you can disable udev btrfs rules and then do a
"btrfs dev scan" manually only for the device which you need.



You don't even need `btrfs dev scan` if you just specify the exact set
of devices in the mount options.  The `device=` mount option tells the
kernel to check that device during the mount process.


Not that lvm does any better in this regard[1], but has btrfs ever solved
the bug where only one device= in the kernel commandline's rootflags=
would take effect, effectively forcing initr* on people (like me) who
would otherwise not need them and prefer to do without them, if they're
using a multi-device btrfs as root?

I haven't tested this recently myself, so I don't know.


Not to mention the fact that as kernel people will tell you, device
enumeration isn't guaranteed to be in the same order every boot, so
device=/dev/* can't be relied upon and shouldn't be used -- but of course
device=LABEL= and device=UUID= and similar won't work without userspace,
basically udev (if they work at all, IDK if they actually do).
They aren't guaranteed to be stable, but they functionally are provided 
you don't modify hardware in any way and your disks can't be enumerated 
asynchronously without some form of ordered identification (IOW, you're 
using just one SATA or SCSI controller for all your disks).


That said, the required component for the LABEL= and UUID= syntax is not 
udev, it's blkid.  blkid can use udev to avoid having to read 
everything, but it's not mandatory.


Tho in practice from what I've seen, device enumeration order tends to be
dependable /enough/ for at least those without enterprise-level numbers
of devices to enumerate.  True, it /does/ change from time to time with a
new kernel, but anybody sane keeps a tested-dependable old kernel around
to boot to until they know the new one works as expected, and that sort
of change is seldom enough that users can boot to the old kernel and
adjust their settings for the new one as necessary when it does happen.
So as "don't do it that way because it's not reliable" as it might indeed
be in theory, in practice, just using an ordered /dev/* in kernel
commandlines does tend to "just work"... provided one is ready for the
occasion when that device parameter might need a bit of adjustment, of
course.


Also, while LVM does have 'issues' with cloned PV's, it fails safe (by
refusing to work on VG's that have duplicate PV's), while BTRFS fails
very unsafely (by randomly corrupting data).


And IMO that "failing unsafe" is both serious and common enough that it
easily justifies adding the point to a list of this sort, thus my putting
it #1.
Agreed.  My point wasn't that BTRFS is doing things correctly, just that 
LVM is not a saint in this respect either (it's just more saintly than 
we are).



2) Subvolume and (more technically) reflink-aware defrag.

It was there for a couple kernel versions some time ago, but
"impossibly" slow, so it was disabled until such time as btrfs could
be made to scale rather better in this regard.



I still contend that the biggest issue WRT reflink-aware defrag was that
it was not optional.  The only way to get the old defrag behavior was to
boot a kernel that didn't have reflink-aware defrag support.  IOW,
_everyone_ had to deal with the performance issues, not just the people
who wanted to use reflink-aware defrag.


Absolutely.

Which of course suggests making it optional, with a suitable warning as
to the speed implications with lots of snapshots/reflinks, when it does
get enabled again (and as David mentions elsewhere, there's apparently
some work going into the idea once again, which potentially moves it from
the 3-5 year range, at best, back to a 1/2-2-year range, time will tell).


3) N-way-mirroring.


[...]
This is not an issue, but a not implemented feature

If you're looking at 

Re: [PATCH RFC 0/3] btrfs-progs: lowmem: delay before lowmem repair

2018-07-02 Thread David Disseldorp
On Mon, 2 Jul 2018 12:43:20 +0300, Nikolay Borisov wrote:

> On  2.07.2018 12:28, Su Yue wrote:
> > Since lowmem repair is dangerous, it should remind user more obviously.
> > The patchset add 10 seconds delay like btrfs balance and add am option
> > '--force-repair-lowmem' to skip the delay.  
> 
> IMO this is the wrong way to approach a dangerous option. If it's so
> dangerous it needs to be written in the documentation explicitly this is
> so. If someone wants to use lowmem then they should explicitly set
> --mode lowmem. So I'm inclined to NACK this patch.

AFAICT it's already documented as "experimental" in the manpage, but the
usage flag appears to have been dropped as part of refactoring for
87c1bd13c1fca430c3dbf0da62e9aa33bde609c8 . If nobody's working on a fix,
and lowmem removal isn't an option, then please consider adding the usage
flag back, e.g.
--- a/check/main.c
+++ b/check/main.c
@@ -9386,7 +9386,7 @@ const char * const cmd_check_usage[] = {
"original - read inodes and extents to 
memory (requires",
"   more memory, does less IO)",
"lowmem   - try to use less memory but read 
blocks again",
-   "   when needed",
+   "   when needed (experimental)",
"--check-data-csum   verify checksums of data blocks",
"-Q|--qgroup-report  print a report on qgroup consistency",
"-E|--subvol-extents ",

Cheers, David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] btrfs-progs: lowmem: delay before lowmem repair

2018-07-02 Thread Nikolay Borisov



On  2.07.2018 14:07, David Disseldorp wrote:
> On Mon, 2 Jul 2018 12:43:20 +0300, Nikolay Borisov wrote:
> 
>> On  2.07.2018 12:28, Su Yue wrote:
>>> Since lowmem repair is dangerous, it should remind user more obviously.
>>> The patchset add 10 seconds delay like btrfs balance and add am option
>>> '--force-repair-lowmem' to skip the delay.  
>>
>> IMO this is the wrong way to approach a dangerous option. If it's so
>> dangerous it needs to be written in the documentation explicitly this is
>> so. If someone wants to use lowmem then they should explicitly set
>> --mode lowmem. So I'm inclined to NACK this patch.
> 
> AFAICT it's already documented as "experimental" in the manpage, but the
> usage flag appears to have been dropped as part of refactoring for
> 87c1bd13c1fca430c3dbf0da62e9aa33bde609c8 . If nobody's working on a fix,
> and lowmem removal isn't an option, then please consider adding the usage
> flag back, e.g.

lowmem seems to be here to stay and it seems to be more useful than
original mode. So your patch is acceptable.

> --- a/check/main.c
> +++ b/check/main.c
> @@ -9386,7 +9386,7 @@ const char * const cmd_check_usage[] = {
> "original - read inodes and extents to 
> memory (requires",
> "   more memory, does less IO)",
> "lowmem   - try to use less memory but 
> read blocks again",
> -   "   when needed",
> +   "   when needed (experimental)",
> "--check-data-csum   verify checksums of data blocks",
> "-Q|--qgroup-report  print a report on qgroup consistency",
> "-E|--subvol-extents ",
> 
> Cheers, David
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: qgroups: Move transaction managed inside btrfs_quota_enable

2018-07-02 Thread Nikolay Borisov



On  2.07.2018 14:00, Nikolay Borisov wrote:
> Commit 5d23515be669 ("btrfs: Move qgroup rescan on quota enable to
> btrfs_quota_enable") not only resulted in an easier to follow code but
> it also introduced a subtle bug. It changed the timing when the initial
> transaction rescan was happening - before the commit it would happen
> after transaction commit had occured but after the commit it might happen
> before the transaction was committed. This results in failure to
> correctly rescan the quota since there could be data which is still not
> committed on disk.
> 
> This patch aims to fix this by movign the transaction creation/commit
> inside btrfs_quota_enable, which allows to schedule the quota commit
> after the transaction has been committed.
> 
> Fixes: 5d23515be669 ("btrfs: Move qgroup rescan on quota enable to 
> btrfs_quota_enable")
> Link: https://marc.info/?l=linux-btrfs=152999289017582
> Reported-by: Misono Tomohiro 
> Reviewed-by: Misono Tomohiro 
> Signed-off-by: Nikolay Borisov 

David,

This is a fix for an issue which Misono is seeing. So far neither I nor
Qu were able to reproduce it. Imho it should be material for this RC.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: qgroups: Move transaction managed inside btrfs_quota_enable

2018-07-02 Thread Nikolay Borisov
Commit 5d23515be669 ("btrfs: Move qgroup rescan on quota enable to
btrfs_quota_enable") not only resulted in an easier to follow code but
it also introduced a subtle bug. It changed the timing when the initial
transaction rescan was happening - before the commit it would happen
after transaction commit had occured but after the commit it might happen
before the transaction was committed. This results in failure to
correctly rescan the quota since there could be data which is still not
committed on disk.

This patch aims to fix this by movign the transaction creation/commit
inside btrfs_quota_enable, which allows to schedule the quota commit
after the transaction has been committed.

Fixes: 5d23515be669 ("btrfs: Move qgroup rescan on quota enable to 
btrfs_quota_enable")
Link: https://marc.info/?l=linux-btrfs=152999289017582
Reported-by: Misono Tomohiro 
Reviewed-by: Misono Tomohiro 
Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/ioctl.c  | 15 ++-
 fs/btrfs/qgroup.c | 38 +++---
 fs/btrfs/qgroup.h |  6 ++
 3 files changed, 35 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a399750b9e41..316fb1af15e2 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -5135,9 +5135,7 @@ static long btrfs_ioctl_quota_ctl(struct file *file, void 
__user *arg)
struct inode *inode = file_inode(file);
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct btrfs_ioctl_quota_ctl_args *sa;
-   struct btrfs_trans_handle *trans = NULL;
int ret;
-   int err;
 
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
@@ -5153,28 +5151,19 @@ static long btrfs_ioctl_quota_ctl(struct file *file, 
void __user *arg)
}
 
down_write(_info->subvol_sem);
-   trans = btrfs_start_transaction(fs_info->tree_root, 2);
-   if (IS_ERR(trans)) {
-   ret = PTR_ERR(trans);
-   goto out;
-   }
 
switch (sa->cmd) {
case BTRFS_QUOTA_CTL_ENABLE:
-   ret = btrfs_quota_enable(trans, fs_info);
+   ret = btrfs_quota_enable(fs_info);
break;
case BTRFS_QUOTA_CTL_DISABLE:
-   ret = btrfs_quota_disable(trans, fs_info);
+   ret = btrfs_quota_disable(fs_info);
break;
default:
ret = -EINVAL;
break;
}
 
-   err = btrfs_commit_transaction(trans);
-   if (err && !ret)
-   ret = err;
-out:
kfree(sa);
up_write(_info->subvol_sem);
 drop_write:
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index c25dc47210a3..1012c7138633 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -875,8 +875,7 @@ static int btrfs_clean_quota_tree(struct btrfs_trans_handle 
*trans,
return ret;
 }
 
-int btrfs_quota_enable(struct btrfs_trans_handle *trans,
-  struct btrfs_fs_info *fs_info)
+int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
 {
struct btrfs_root *quota_root;
struct btrfs_root *tree_root = fs_info->tree_root;
@@ -886,6 +885,7 @@ int btrfs_quota_enable(struct btrfs_trans_handle *trans,
struct btrfs_key key;
struct btrfs_key found_key;
struct btrfs_qgroup *qgroup = NULL;
+   struct btrfs_trans_handle *trans = NULL;
int ret = 0;
int slot;
 
@@ -893,6 +893,12 @@ int btrfs_quota_enable(struct btrfs_trans_handle *trans,
if (fs_info->quota_root)
goto out;
 
+   trans = btrfs_start_transaction(tree_root, 2);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   goto out;
+   }
+
fs_info->qgroup_ulist = ulist_alloc(GFP_KERNEL);
if (!fs_info->qgroup_ulist) {
ret = -ENOMEM;
@@ -987,6 +993,11 @@ int btrfs_quota_enable(struct btrfs_trans_handle *trans,
fs_info->quota_root = quota_root;
set_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags);
spin_unlock(_info->qgroup_lock);
+
+   ret = btrfs_commit_transaction(trans);
+   if (ret)
+   goto out_free_path;
+
ret = qgroup_rescan_init(fs_info, 0, 1);
if (!ret) {
qgroup_rescan_zero_tracking(fs_info);
@@ -1011,15 +1022,22 @@ int btrfs_quota_enable(struct btrfs_trans_handle *trans,
return ret;
 }
 
-int btrfs_quota_disable(struct btrfs_trans_handle *trans,
-   struct btrfs_fs_info *fs_info)
+int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
 {
struct btrfs_root *quota_root;
+   struct btrfs_trans_handle *trans = NULL;
int ret = 0;
 
mutex_lock(_info->qgroup_ioctl_lock);
if (!fs_info->quota_root)
goto out;
+
+   trans = btrfs_start_transaction(fs_info->tree_root, 2);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   goto out;
+   }
+
clear_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags);

Re: [PATCH 1/4] btrfs: Refactor loop in btrfs_release_extent_buffer_page

2018-07-02 Thread Nikolay Borisov



On 29.06.2018 15:35, David Sterba wrote:
> On Wed, Jun 27, 2018 at 04:38:22PM +0300, Nikolay Borisov wrote:
>> The purpose of the function is to free all the pages comprising an
>> extent buffer. This can be achieved with a simple for loop rather than
>> the slitghly more involved 'do {} while' construct. So rewrite the
>> loop using a 'for' construct. Additionally we can never have an
>> extent_buffer that is 0 pages so remove the check for index == 0. No
>> functional changes.
> 
> The reversed loop makes sense, the first page is special and used for
> locking the whole extent buffer's pages, as can be seen eg.
> 897ca6e9b4fef86d5dfb6b31 or 4f2de97acee6532b36dd6e99 . Also see current
> alloc_extent_buffer for the locking and managing the private bit.

Turns out  page[0] is not special and it's not used for any special
locking whatsoever. In csum_dirty_buffer this page is used so that when
we submit a multipage eb then we only perform csum calculation once, i.e
when the first page (page[0]) is submitted.
btrfs_release_extent_buffer_page OTOH no longer takes an index but just
an EB and iterates all the pages.

The fact that page0 is unlocked last in alloc_extent_buffer seems to be
an artefact of the code refactoring rather than a deliberate behavior.
Furthermore I've been running tests with sequential unlocking (and
complete removal of the SetPageChecked/ClearPageChecked function from
alloc_extent_buffer) of the pages and didn't observe any problems
whatsoever on both 4g and 1g configuration ( the latter is more likely
to trigger premature page eviction hence trigger any lurking races
between alloc_extent_buffer and btree_releasepage. I will be sending a
patch after I do a bit more testing but so far the results are good.

> 
> So you can still rewrite it as a for loop but would have to preserve the
> logic or provide the reason that it's correct to iterate from 0 to
> num_pages. There are some subltle races regarding pages and extents and
> their presence in various trees, so I'd rather be careful here.
> 
>> Signed-off-by: Nikolay Borisov 
>> ---
>>  fs/btrfs/extent_io.c | 13 -
>>  1 file changed, 4 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index cce6087d6880..4180a3b7e725 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -4641,19 +4641,14 @@ int extent_buffer_under_io(struct extent_buffer *eb)
>>   */
>>  static void btrfs_release_extent_buffer_page(struct extent_buffer *eb)
>>  {
>> -unsigned long index;
>> -struct page *page;
>> +int i;
>>  int mapped = !test_bit(EXTENT_BUFFER_DUMMY, >bflags);
>>  
>>  BUG_ON(extent_buffer_under_io(eb));
>>  
>> -index = num_extent_pages(eb->start, eb->len);
>> -if (index == 0)
>> -return;
> 
> This check does seem to be redundant.
> 
>> +for (i = 0; i < num_extent_pages(eb->start, eb->len); i++) {
> 
> I think that the num_extent_pages(...) would be better put to a
> temporary variable so it's not evaluated each time the loop termination
> condition is checked.
> 
>> +struct page *page = eb->pages[i];
>>  
>> -do {
>> -index--;
>> -page = eb->pages[index];
>>  if (!page)
>>  continue;
>>  if (mapped)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] btrfs-progs: lowmem: delay before lowmem repair

2018-07-02 Thread Nikolay Borisov



On  2.07.2018 12:28, Su Yue wrote:
> Since lowmem repair is dangerous, it should remind user more obviously.
> The patchset add 10 seconds delay like btrfs balance and add am option
> '--force-repair-lowmem' to skip the delay.

IMO this is the wrong way to approach a dangerous option. If it's so
dangerous it needs to be written in the documentation explicitly this is
so. If someone wants to use lowmem then they should explicitly set
--mode lowmem. So I'm inclined to NACK this patch.

> 
> ---
> I don't whether it's a good idea to add delay and the option only for
> lowmem repair is acceptable, so make it RFC.
> 
> Su Yue (3):
>   btrfs-progs: lowmem: delay before lowmem repair starts
>   btrfs-progs: lowmem: force to start without delay with option
> '--force-repair-lowmem'
>   btrfs-progs: tests: append '--force-repair-lowmem' if lowmem repair is
> enabled
> 
>  check/main.c   | 43 ---
>  tests/common.local |  1 +
>  2 files changed, 37 insertions(+), 7 deletions(-)
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 0/3] btrfs-progs: lowmem: delay before lowmem repair

2018-07-02 Thread Su Yue
Since lowmem repair is dangerous, it should remind user more obviously.
The patchset add 10 seconds delay like btrfs balance and add am option
'--force-repair-lowmem' to skip the delay.

---
I don't whether it's a good idea to add delay and the option only for
lowmem repair is acceptable, so make it RFC.

Su Yue (3):
  btrfs-progs: lowmem: delay before lowmem repair starts
  btrfs-progs: lowmem: force to start without delay with option
'--force-repair-lowmem'
  btrfs-progs: tests: append '--force-repair-lowmem' if lowmem repair is
enabled

 check/main.c   | 43 ---
 tests/common.local |  1 +
 2 files changed, 37 insertions(+), 7 deletions(-)

-- 
2.17.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 1/3] btrfs-progs: lowmem: delay before lowmem repair starts

2018-07-02 Thread Su Yue
Since lowmem mode repair is so dangerous, delay 10 seconds before
start.

Signed-off-by: Su Yue 
---
 check/main.c | 28 ++--
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/check/main.c b/check/main.c
index 3190b5d4f293..b9997460162f 100644
--- a/check/main.c
+++ b/check/main.c
@@ -9553,12 +9553,6 @@ int cmd_check(int argc, char **argv)
exit(1);
}
 
-   /*
-* experimental and dangerous
-*/
-   if (repair && check_mode == CHECK_MODE_LOWMEM)
-   warning("low-memory mode repair support is only partial");
-
radix_tree_init();
cache_tree_init(_cache);
 
@@ -9600,6 +9594,28 @@ int cmd_check(int argc, char **argv)
if (repair)
ctree_flags |= OPEN_CTREE_PARTIAL;
 
+   /*
+* experimental and dangerous
+*/
+   if (repair && check_mode == CHECK_MODE_LOWMEM) {
+   int delay = 10;
+
+   printf("WARNING:\n\n");
+   printf("\tLow-memory mode repair support is only partial.\n");
+   printf("\tIt's experimental and very dangerous.\n");
+   printf("\tIt may run slow or crash unexpectedly.\n");
+   printf("\tPlease backup device before running low-memory mode 
repair.\n");
+   printf("\tThe operation will start in %d seconds.\n", delay);
+   printf("\tUse Ctrl-C to stop it.\n");
+
+   while (delay) {
+   printf("%2d", delay--);
+   fflush(stdout);
+   sleep(1);
+   }
+   printf("\nStarting to check and repair filesystem.\n");
+   }
+
info = open_ctree_fs_info(argv[optind], bytenr, tree_root_bytenr,
  chunk_root_bytenr, ctree_flags);
if (!info) {
-- 
2.17.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 3/3] btrfs-progs: tests: append '--force-repair-lowmem' if lowmem repair is enabled

2018-07-02 Thread Su Yue
Since commit 62785de7c107 ("btrfs-progs: lowmem: force to start
without delay with option '--force-repair-lowmem'") delays repair in
lowmem mode.
For tests which repair in lowmem mode, append '--force-repair-lowmem'
to force them to start without delay.

Signed-off-by: Su Yue 
---
 tests/common.local | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tests/common.local b/tests/common.local
index f5e96f5b73ac..4facc5482ef7 100644
--- a/tests/common.local
+++ b/tests/common.local
@@ -29,6 +29,7 @@ _skip_spec()
   echo "$@" | grep -q -- '--repair'; then
dir="$(dirname ${@: -1})"
if [ -f ${dir}/${beacon} ]; then
+   TEST_ARGS_CHECK+=" --force-repair-lowmem"
return 1;
fi
return 0;
-- 
2.17.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 2/3] btrfs-progs: lowmem: force to start without delay with option '--force-repair-lowmem'

2018-07-02 Thread Su Yue
Add an option '--force-repair-lowmem' to start check without any delay.

Signed-off-by: Su Yue 
---
 check/main.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/check/main.c b/check/main.c
index b9997460162f..3c9dc242f3db 100644
--- a/check/main.c
+++ b/check/main.c
@@ -9377,6 +9377,7 @@ const char * const cmd_check_usage[] = {
"-s|--super  use this superblock copy",
"-b|--backup use the first valid backup root copy",
"--force skip mount checks, repair is not possible",
+   "--force-repair-lowmem   start to check and repair in lowmem mode 
without delay",
"--repairtry to repair the filesystem",
"--readonly  run in read-only mode (default)",
"--init-csum-treecreate a new CRC tree",
@@ -9419,6 +9420,7 @@ int cmd_check(int argc, char **argv)
int qgroup_report_ret;
unsigned ctree_flags = OPEN_CTREE_EXCLUSIVE;
int force = 0;
+   int force_repair_lowmem = 0;
 
while(1) {
int c;
@@ -9426,7 +9428,7 @@ int cmd_check(int argc, char **argv)
GETOPT_VAL_INIT_EXTENT, GETOPT_VAL_CHECK_CSUM,
GETOPT_VAL_READONLY, GETOPT_VAL_CHUNK_TREE,
GETOPT_VAL_MODE, GETOPT_VAL_CLEAR_SPACE_CACHE,
-   GETOPT_VAL_FORCE };
+   GETOPT_VAL_FORCE, GETOPT_VAL_FORCE_REPAIR_LOWMEM };
static const struct option long_options[] = {
{ "super", required_argument, NULL, 's' },
{ "repair", no_argument, NULL, GETOPT_VAL_REPAIR },
@@ -9449,6 +9451,8 @@ int cmd_check(int argc, char **argv)
{ "clear-space-cache", required_argument, NULL,
GETOPT_VAL_CLEAR_SPACE_CACHE},
{ "force", no_argument, NULL, GETOPT_VAL_FORCE },
+   {"force-repair-lowmem", no_argument, NULL,
+   GETOPT_VAL_FORCE_REPAIR_LOWMEM},
{ NULL, 0, NULL, 0}
};
 
@@ -9536,6 +9540,9 @@ int cmd_check(int argc, char **argv)
case GETOPT_VAL_FORCE:
force = 1;
break;
+   case GETOPT_VAL_FORCE_REPAIR_LOWMEM:
+   force_repair_lowmem = 1;
+   break;
}
}
 
@@ -9552,6 +9559,11 @@ int cmd_check(int argc, char **argv)
error("repair options are not compatible with --readonly");
exit(1);
}
+   if (force_repair_lowmem &&
+   (!repair || check_mode != CHECK_MODE_LOWMEM)) {
+   error("--force-repair-lowmem only works with --mode=lowmem and 
--repair");
+   exit(1);
+   }
 
radix_tree_init();
cache_tree_init(_cache);
@@ -9597,7 +9609,7 @@ int cmd_check(int argc, char **argv)
/*
 * experimental and dangerous
 */
-   if (repair && check_mode == CHECK_MODE_LOWMEM) {
+   if (repair && check_mode == CHECK_MODE_LOWMEM && !force_repair_lowmem) {
int delay = 10;
 
printf("WARNING:\n\n");
@@ -9605,6 +9617,7 @@ int cmd_check(int argc, char **argv)
printf("\tIt's experimental and very dangerous.\n");
printf("\tIt may run slow or crash unexpectedly.\n");
printf("\tPlease backup device before running low-memory mode 
repair.\n");
+   printf("\tUse option'--force-repair-lowmem' option to skip this 
warning.\n");
printf("\tThe operation will start in %d seconds.\n", delay);
printf("\tUse Ctrl-C to stop it.\n");
 
-- 
2.17.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: tree-checker: Verify block_group_item

2018-07-02 Thread Nikolay Borisov



On  2.07.2018 11:47, Qu Wenruo wrote:
> 
> 
> On 2018年07月02日 16:37, Nikolay Borisov wrote:
>>
>>
>> On  2.07.2018 11:34, Qu Wenruo wrote:
>>> +   if (hweight64(flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) > 1) {
>>> +   block_group_err(fs_info, leaf, slot,
>>> +"invalid profile flags, have 0x%llx (%lu bits set) expect no more than 1 
>>> bit set",
>>> +   flags & BTRFS_BLOCK_GROUP_PROFILE_MASK,
>>> +   hweight64(flags & BTRFS_BLOCK_GROUP_PROFILE_MASK));
>>
>> Shouldn't this also be a hweight64() != 1 check ?
> 
> What about SINGLE profile on disk?
> It doesn't has any profile bit set.

Agh, it's implied, doesn't even have a define. Then it's fine I guess,
huhz so many "special cases" ...

Reviewed-by: Nikolay Borisov 


> 
> Thanks,
> Qu
> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: tree-checker: Verify block_group_item

2018-07-02 Thread Qu Wenruo



On 2018年07月02日 16:37, Nikolay Borisov wrote:
> 
> 
> On  2.07.2018 11:34, Qu Wenruo wrote:
>> +if (hweight64(flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) > 1) {
>> +block_group_err(fs_info, leaf, slot,
>> +"invalid profile flags, have 0x%llx (%lu bits set) expect no more than 1 
>> bit set",
>> +flags & BTRFS_BLOCK_GROUP_PROFILE_MASK,
>> +hweight64(flags & BTRFS_BLOCK_GROUP_PROFILE_MASK));
> 
> Shouldn't this also be a hweight64() != 1 check ?

What about SINGLE profile on disk?
It doesn't has any profile bit set.

Thanks,
Qu

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: tree-checker: Verify block_group_item

2018-07-02 Thread Nikolay Borisov



On  2.07.2018 11:34, Qu Wenruo wrote:
> + if (hweight64(flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) > 1) {
> + block_group_err(fs_info, leaf, slot,
> +"invalid profile flags, have 0x%llx (%lu bits set) expect no more than 1 bit 
> set",
> + flags & BTRFS_BLOCK_GROUP_PROFILE_MASK,
> + hweight64(flags & BTRFS_BLOCK_GROUP_PROFILE_MASK));

Shouldn't this also be a hweight64() != 1 check ?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Do extra chunk block group mapping check at mount

2018-07-02 Thread Nikolay Borisov



On  2.07.2018 11:29, Qu Wenruo wrote:
> Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199837, if a
> crafted btrfs with incorrect chunk<->block group mapping, it could leads
> to a lot of unexpected behavior.
> 
> Although the crafted image can be catched by block group item checker
> added in "[PATCH] btrfs: tree-checker: Verify block_group_item", if one
> crafted a valid enough block group item which can pass above check but
> still mismatch with existing chunk, it could cause a lot of undefined
> behavior.
> 
> This patch will add extra chunk<->block group mapping check at mount
> time, so such problem can be detected early and cause no harm.
> 
> Reported-by: Xu Wen 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/extent-tree.c | 38 ++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 3d9fe58c0080..7a11dc76cd6d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10003,6 +10003,31 @@ btrfs_create_block_group_cache(struct btrfs_fs_info 
> *fs_info,
>   return cache;
>  }
>  
> +static int verify_chunk_block_group_mapping(struct btrfs_fs_info *fs_info,
> + u64 start, u64 len)
> +{
> + struct btrfs_mapping_tree *map_tree = _info->mapping_tree;
> + struct extent_map *em;
> + int ret;
> +
> + read_lock(_tree->map_tree.lock);
> + em = lookup_extent_mapping(_tree->map_tree, start, len);
> + read_unlock(_tree->map_tree.lock);
> +
> + if (!em) {
> + ret = -ENOENT;
> + goto out;
> + }
> + if (em->start != start || em->len != len) {
> + ret = -ENOENT;
> + goto out;
> + }
> + ret = 0;
> +out:
> + free_extent_map(em);
> + return ret;
> +}
> +
>  int btrfs_read_block_groups(struct btrfs_fs_info *info)
>  {
>   struct btrfs_path *path;
> @@ -10045,6 +10070,19 @@ int btrfs_read_block_groups(struct btrfs_fs_info 
> *info)
>   leaf = path->nodes[0];
>   btrfs_item_key_to_cpu(leaf, _key, path->slots[0]);
>  
> + /*
> +  * Since block group and chunks are 1:1 mapped and are core
> +  * infrastructure of btrfs, we need to do extra verification.
> +  */
nit: I'd rephrase that to just document the invariant that bg and chunks
must have 1:1 mapping.
> + ret = verify_chunk_block_group_mapping(info,
> + found_key.objectid, found_key.offset);
> + if (ret < 0) {
> + btrfs_err_rl(info,
> + "block group start=%llu len=%llu doesn't have matched chunk",
> +  found_key.objectid, found_key.offset);
> + goto error;
> + }
> +
>   cache = btrfs_create_block_group_cache(info, found_key.objectid,
>  found_key.offset);
>   if (!cache) {
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: tree-checker: Verify block_group_item

2018-07-02 Thread Qu Wenruo
As reported in https://bugzilla.kernel.org/show_bug.cgi?id=199849,
a crafted image with invalid block group items could make free space cache
code to cause panic.

We could early detect such invalid block group item by checking:
1) Size (key)
   We have a up limit on block group item (10G)
2) Chunk objectid
3) Type
   Exactly 1 bit set for type and no more than 1 bit set for profile
4) Used space
   No more than block group size.

This should allow btrfs to detect and refuse to mount the crafted image.

Reported-by: Xu Wen 
Signed-off-by: Qu Wenruo 
---
changelog:
v2:
  Format change for if() code block, added one new line for each block.
  Use "hweight64() > 1" to detect profile error, replacing the bloated
  !(flags & MASK == 0 || hweight64() == 1) condition.
---
 fs/btrfs/tree-checker.c | 92 +
 1 file changed, 92 insertions(+)

diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 8d40e7dd8c30..d4f872682668 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -353,6 +353,95 @@ static int check_dir_item(struct btrfs_fs_info *fs_info,
return 0;
 }
 
+__printf(4, 5)
+__cold
+static void block_group_err(const struct btrfs_fs_info *fs_info,
+   const struct extent_buffer *eb, int slot,
+   const char *fmt, ...)
+{
+   struct btrfs_key key;
+   struct va_format vaf;
+   va_list args;
+
+   btrfs_item_key_to_cpu(eb, , slot);
+   va_start(args, fmt);
+
+   vaf.fmt = fmt;
+   vaf.va = 
+
+   btrfs_crit(fs_info,
+   "corrupt %s: root=%llu block=%llu slot=%d bg_start=%llu bg_len=%llu, 
%pV",
+   btrfs_header_level(eb) == 0 ? "leaf" : "node",
+   btrfs_header_owner(eb), btrfs_header_bytenr(eb), slot,
+   key.objectid, key.offset, );
+   va_end(args);
+}
+
+static int check_block_group_item(struct btrfs_fs_info *fs_info,
+ struct extent_buffer *leaf,
+ struct btrfs_key *key, int slot)
+{
+   struct btrfs_block_group_item bgi;
+   u32 item_size = btrfs_item_size_nr(leaf, slot);
+   u64 flags;
+
+   /*
+* Here we don't really care about unalignment since extent allocator
+* can handle it.
+* We care more about the size, as if one block group is larger than
+* maximum size, it's must be some obvious corruption
+*/
+   if (key->offset > 10ULL * SZ_1G) {
+   block_group_err(fs_info, leaf, slot,
+   "invalid block group size, have %llu expect (0, %llu)",
+   key->offset, 10ULL * SZ_1G);
+   return -EUCLEAN;
+   }
+
+   if (item_size != sizeof(bgi)) {
+   block_group_err(fs_info, leaf, slot,
+   "invalid item size, have %u expect %lu",
+   item_size, sizeof(bgi));
+   return -EUCLEAN;
+   }
+
+   read_extent_buffer(leaf, , btrfs_item_ptr_offset(leaf, slot),
+  sizeof(bgi));
+   if (btrfs_block_group_chunk_objectid() !=
+   BTRFS_FIRST_CHUNK_TREE_OBJECTID) {
+   block_group_err(fs_info, leaf, slot,
+   "invalid block group chunk objectid, have %llu expect %llu",
+   btrfs_block_group_chunk_objectid(),
+   BTRFS_FIRST_CHUNK_TREE_OBJECTID);
+   return -EUCLEAN;
+   }
+
+   if (btrfs_block_group_used() > key->offset) {
+   block_group_err(fs_info, leaf, slot,
+   "invalid block group used, have %llu expect [0, %llu)",
+   btrfs_block_group_used(), key->offset);
+   return -EUCLEAN;
+   }
+
+   flags = btrfs_block_group_flags();
+   if (hweight64(flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) > 1) {
+   block_group_err(fs_info, leaf, slot,
+"invalid profile flags, have 0x%llx (%lu bits set) expect no more than 1 bit 
set",
+   flags & BTRFS_BLOCK_GROUP_PROFILE_MASK,
+   hweight64(flags & BTRFS_BLOCK_GROUP_PROFILE_MASK));
+   return -EUCLEAN;
+   }
+
+   if (hweight64(flags & BTRFS_BLOCK_GROUP_TYPE_MASK) != 1) {
+   block_group_err(fs_info, leaf, slot,
+"invalid type flags, have 0x%llx (%lu bits set) expect exactly 1 bit set",
+   flags & BTRFS_BLOCK_GROUP_TYPE_MASK,
+   hweight64(flags & BTRFS_BLOCK_GROUP_TYPE_MASK));
+   return -EUCLEAN;
+   }
+   return 0;
+}
+
 /*
  * Common point to switch the item-specific validation.
  */
@@ -374,6 +463,9 @@ static int check_leaf_item(struct btrfs_fs_info *fs_info,
case BTRFS_XATTR_ITEM_KEY:
ret = check_dir_item(fs_info, leaf, key, slot);
break;
+   case BTRFS_BLOCK_GROUP_ITEM_KEY:
+   ret = 

[PATCH] btrfs: Do extra chunk block group mapping check at mount

2018-07-02 Thread Qu Wenruo
Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199837, if a
crafted btrfs with incorrect chunk<->block group mapping, it could leads
to a lot of unexpected behavior.

Although the crafted image can be catched by block group item checker
added in "[PATCH] btrfs: tree-checker: Verify block_group_item", if one
crafted a valid enough block group item which can pass above check but
still mismatch with existing chunk, it could cause a lot of undefined
behavior.

This patch will add extra chunk<->block group mapping check at mount
time, so such problem can be detected early and cause no harm.

Reported-by: Xu Wen 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c | 38 ++
 1 file changed, 38 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d9fe58c0080..7a11dc76cd6d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10003,6 +10003,31 @@ btrfs_create_block_group_cache(struct btrfs_fs_info 
*fs_info,
return cache;
 }
 
+static int verify_chunk_block_group_mapping(struct btrfs_fs_info *fs_info,
+   u64 start, u64 len)
+{
+   struct btrfs_mapping_tree *map_tree = _info->mapping_tree;
+   struct extent_map *em;
+   int ret;
+
+   read_lock(_tree->map_tree.lock);
+   em = lookup_extent_mapping(_tree->map_tree, start, len);
+   read_unlock(_tree->map_tree.lock);
+
+   if (!em) {
+   ret = -ENOENT;
+   goto out;
+   }
+   if (em->start != start || em->len != len) {
+   ret = -ENOENT;
+   goto out;
+   }
+   ret = 0;
+out:
+   free_extent_map(em);
+   return ret;
+}
+
 int btrfs_read_block_groups(struct btrfs_fs_info *info)
 {
struct btrfs_path *path;
@@ -10045,6 +10070,19 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
leaf = path->nodes[0];
btrfs_item_key_to_cpu(leaf, _key, path->slots[0]);
 
+   /*
+* Since block group and chunks are 1:1 mapped and are core
+* infrastructure of btrfs, we need to do extra verification.
+*/
+   ret = verify_chunk_block_group_mapping(info,
+   found_key.objectid, found_key.offset);
+   if (ret < 0) {
+   btrfs_err_rl(info,
+   "block group start=%llu len=%llu doesn't have matched chunk",
+found_key.objectid, found_key.offset);
+   goto error;
+   }
+
cache = btrfs_create_block_group_cache(info, found_key.objectid,
   found_key.offset);
if (!cache) {
-- 
2.18.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling quota may not correctly rescan on 4.17

2018-07-02 Thread Misono Tomohiro
> Misono,
> 
> Can you please try the attached patch?
> 

I tried and it works (on 4.18.0-rc3).

Committing transaction before starting rescan worker is
what btrfs_qgroup_resan() does, so it looks fine. 

(though I'm not sure why you don't see the problem in your machine.)

Reviewed-by: Misono Tomohiro 
Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: relocation: Only remove reloc rb_trees if reloc control has been initialized

2018-07-02 Thread Qu Wenruo



On 2018年07月02日 16:01, Nikolay Borisov wrote:
> 
> 
> On  2.07.2018 10:53, Qu Wenruo wrote:
>>
>>
>> On 2018年07月02日 15:31, Nikolay Borisov wrote:
>>>
>>>
>>> On  2.07.2018 09:25, Qu Wenruo wrote:
 Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199833, where an
 invalid tree reloc tree can cause kernel NULL pointer dereference when
 btrfs does some cleanup for reloc roots.

 It turns out that fs_info->reloc_ctl can be NULL in
 btrfs_recover_relocation() as we allocate relocation control after all
 reloc roots are verified.
 So when we hit out: tag, we haven't call set_reloc_control() thus
 fs_info->reloc_ctl is still NULL.

 Reported-by: Xu Wen 
 Signed-off-by: Qu Wenruo 
 ---
  fs/btrfs/relocation.c | 23 ---
  1 file changed, 12 insertions(+), 11 deletions(-)

 diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
 index 879b76fa881a..be94c65bb4d2 100644
 --- a/fs/btrfs/relocation.c
 +++ b/fs/btrfs/relocation.c
 @@ -1321,18 +1321,19 @@ static void __del_reloc_root(struct btrfs_root 
 *root)
struct mapping_node *node = NULL;
struct reloc_control *rc = fs_info->reloc_ctl;
  > -   spin_lock(>reloc_root_tree.lock);
 -  rb_node = tree_search(>reloc_root_tree.rb_root,
 -root->node->start);
 -  if (rb_node) {
 -  node = rb_entry(rb_node, struct mapping_node, rb_node);
 -  rb_erase(>rb_node, >reloc_root_tree.rb_root);
>>>
>>> Just do  if (!rc)
>>> return;
>>>
>>> The function is simple enough, no need to indent multiple lines.
>>
>> You missed serval lines below, we still have:
>> ---
>> spin_lock(_info->trans_lock);
>> list_del_init(>root_list);
>> spin_unlock(_info->trans_lock);
>> kfree(node);
>> ---
>>
>> Which still needs to be called even rc is not initialized.
> 
> But then isn't the function buggy even with your patch because if node
> is not initialised then we exit at if (!node) return.

That means node->data isn't initialized nor its root->root_list.

The patch only needs to call list_del_init() if @rc is not initialized.

Thanks,
Qu

>>
>> Thanks,
>> Qu
>>
 +  if (rc) {
 +  spin_lock(>reloc_root_tree.lock);
 +  rb_node = tree_search(>reloc_root_tree.rb_root,
 +root->node->start);
 +  if (rb_node) {
 +  node = rb_entry(rb_node, struct mapping_node, rb_node);
 +  rb_erase(>rb_node, >reloc_root_tree.rb_root);
 +  }
 +  spin_unlock(>reloc_root_tree.lock);
 +  if (!node)
 +  return;
 +  BUG_ON((struct btrfs_root *)node->data != root);
}
 -  spin_unlock(>reloc_root_tree.lock);
 -
 -  if (!node)
 -  return;
 -  BUG_ON((struct btrfs_root *)node->data != root);
  
spin_lock(_info->trans_lock);
list_del_init(>root_list);

>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: relocation: Only remove reloc rb_trees if reloc control has been initialized

2018-07-02 Thread Nikolay Borisov



On  2.07.2018 11:01, Nikolay Borisov wrote:
> 
> 
> On  2.07.2018 10:53, Qu Wenruo wrote:
>>
>>
>> On 2018年07月02日 15:31, Nikolay Borisov wrote:
>>>
>>>
>>> On  2.07.2018 09:25, Qu Wenruo wrote:
 Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199833, where an
 invalid tree reloc tree can cause kernel NULL pointer dereference when
 btrfs does some cleanup for reloc roots.

 It turns out that fs_info->reloc_ctl can be NULL in
 btrfs_recover_relocation() as we allocate relocation control after all
 reloc roots are verified.
 So when we hit out: tag, we haven't call set_reloc_control() thus
 fs_info->reloc_ctl is still NULL.

 Reported-by: Xu Wen 
 Signed-off-by: Qu Wenruo 
 ---
  fs/btrfs/relocation.c | 23 ---
  1 file changed, 12 insertions(+), 11 deletions(-)

 diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
 index 879b76fa881a..be94c65bb4d2 100644
 --- a/fs/btrfs/relocation.c
 +++ b/fs/btrfs/relocation.c
 @@ -1321,18 +1321,19 @@ static void __del_reloc_root(struct btrfs_root 
 *root)
struct mapping_node *node = NULL;
struct reloc_control *rc = fs_info->reloc_ctl;
  > -   spin_lock(>reloc_root_tree.lock);
 -  rb_node = tree_search(>reloc_root_tree.rb_root,
 -root->node->start);
 -  if (rb_node) {
 -  node = rb_entry(rb_node, struct mapping_node, rb_node);
 -  rb_erase(>rb_node, >reloc_root_tree.rb_root);
>>>
>>> Just do  if (!rc)
>>> return;
>>>
>>> The function is simple enough, no need to indent multiple lines.
>>
>> You missed serval lines below, we still have:
>> ---
>> spin_lock(_info->trans_lock);
>> list_del_init(>root_list);
>> spin_unlock(_info->trans_lock);
>> kfree(node);
>> ---
>>
>> Which still needs to be called even rc is not initialized.
> 
> But then isn't the function buggy even with your patch because if node
> is not initialised then we exit at if (!node) return.

Ah, so you've moved the if (!node) check inside the if (rc) branch. Then
it will work as you said. Fair enough.

>>


>> Thanks,
>> Qu
>>
 +  if (rc) {
 +  spin_lock(>reloc_root_tree.lock);
 +  rb_node = tree_search(>reloc_root_tree.rb_root,
 +root->node->start);
 +  if (rb_node) {
 +  node = rb_entry(rb_node, struct mapping_node, rb_node);
 +  rb_erase(>rb_node, >reloc_root_tree.rb_root);
 +  }
 +  spin_unlock(>reloc_root_tree.lock);
 +  if (!node)
 +  return;
 +  BUG_ON((struct btrfs_root *)node->data != root);
}
 -  spin_unlock(>reloc_root_tree.lock);
 -
 -  if (!node)
 -  return;
 -  BUG_ON((struct btrfs_root *)node->data != root);
  
spin_lock(_info->trans_lock);
list_del_init(>root_list);

>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: relocation: Only remove reloc rb_trees if reloc control has been initialized

2018-07-02 Thread Nikolay Borisov



On  2.07.2018 10:53, Qu Wenruo wrote:
> 
> 
> On 2018年07月02日 15:31, Nikolay Borisov wrote:
>>
>>
>> On  2.07.2018 09:25, Qu Wenruo wrote:
>>> Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199833, where an
>>> invalid tree reloc tree can cause kernel NULL pointer dereference when
>>> btrfs does some cleanup for reloc roots.
>>>
>>> It turns out that fs_info->reloc_ctl can be NULL in
>>> btrfs_recover_relocation() as we allocate relocation control after all
>>> reloc roots are verified.
>>> So when we hit out: tag, we haven't call set_reloc_control() thus
>>> fs_info->reloc_ctl is still NULL.
>>>
>>> Reported-by: Xu Wen 
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>>  fs/btrfs/relocation.c | 23 ---
>>>  1 file changed, 12 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>>> index 879b76fa881a..be94c65bb4d2 100644
>>> --- a/fs/btrfs/relocation.c
>>> +++ b/fs/btrfs/relocation.c
>>> @@ -1321,18 +1321,19 @@ static void __del_reloc_root(struct btrfs_root 
>>> *root)
>>> struct mapping_node *node = NULL;
>>> struct reloc_control *rc = fs_info->reloc_ctl;
>>>  > -spin_lock(>reloc_root_tree.lock);
>>> -   rb_node = tree_search(>reloc_root_tree.rb_root,
>>> - root->node->start);
>>> -   if (rb_node) {
>>> -   node = rb_entry(rb_node, struct mapping_node, rb_node);
>>> -   rb_erase(>rb_node, >reloc_root_tree.rb_root);
>>
>> Just do  if (!rc)
>> return;
>>
>> The function is simple enough, no need to indent multiple lines.
> 
> You missed serval lines below, we still have:
> ---
> spin_lock(_info->trans_lock);
> list_del_init(>root_list);
> spin_unlock(_info->trans_lock);
> kfree(node);
> ---
> 
> Which still needs to be called even rc is not initialized.

But then isn't the function buggy even with your patch because if node
is not initialised then we exit at if (!node) return.
> 
> Thanks,
> Qu
> 
>>> +   if (rc) {
>>> +   spin_lock(>reloc_root_tree.lock);
>>> +   rb_node = tree_search(>reloc_root_tree.rb_root,
>>> + root->node->start);
>>> +   if (rb_node) {
>>> +   node = rb_entry(rb_node, struct mapping_node, rb_node);
>>> +   rb_erase(>rb_node, >reloc_root_tree.rb_root);
>>> +   }
>>> +   spin_unlock(>reloc_root_tree.lock);
>>> +   if (!node)
>>> +   return;
>>> +   BUG_ON((struct btrfs_root *)node->data != root);
>>> }
>>> -   spin_unlock(>reloc_root_tree.lock);
>>> -
>>> -   if (!node)
>>> -   return;
>>> -   BUG_ON((struct btrfs_root *)node->data != root);
>>>  
>>> spin_lock(_info->trans_lock);
>>> list_del_init(>root_list);
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: relocation: Only remove reloc rb_trees if reloc control has been initialized

2018-07-02 Thread Qu Wenruo



On 2018年07月02日 15:31, Nikolay Borisov wrote:
> 
> 
> On  2.07.2018 09:25, Qu Wenruo wrote:
>> Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199833, where an
>> invalid tree reloc tree can cause kernel NULL pointer dereference when
>> btrfs does some cleanup for reloc roots.
>>
>> It turns out that fs_info->reloc_ctl can be NULL in
>> btrfs_recover_relocation() as we allocate relocation control after all
>> reloc roots are verified.
>> So when we hit out: tag, we haven't call set_reloc_control() thus
>> fs_info->reloc_ctl is still NULL.
>>
>> Reported-by: Xu Wen 
>> Signed-off-by: Qu Wenruo 
>> ---
>>  fs/btrfs/relocation.c | 23 ---
>>  1 file changed, 12 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>> index 879b76fa881a..be94c65bb4d2 100644
>> --- a/fs/btrfs/relocation.c
>> +++ b/fs/btrfs/relocation.c
>> @@ -1321,18 +1321,19 @@ static void __del_reloc_root(struct btrfs_root *root)
>>  struct mapping_node *node = NULL;
>>  struct reloc_control *rc = fs_info->reloc_ctl;
>>  > - spin_lock(>reloc_root_tree.lock);
>> -rb_node = tree_search(>reloc_root_tree.rb_root,
>> -  root->node->start);
>> -if (rb_node) {
>> -node = rb_entry(rb_node, struct mapping_node, rb_node);
>> -rb_erase(>rb_node, >reloc_root_tree.rb_root);
> 
> Just do  if (!rc)
> return;
> 
> The function is simple enough, no need to indent multiple lines.

You missed serval lines below, we still have:
---
spin_lock(_info->trans_lock);
list_del_init(>root_list);
spin_unlock(_info->trans_lock);
kfree(node);
---

Which still needs to be called even rc is not initialized.

Thanks,
Qu

>> +if (rc) {
>> +spin_lock(>reloc_root_tree.lock);
>> +rb_node = tree_search(>reloc_root_tree.rb_root,
>> +  root->node->start);
>> +if (rb_node) {
>> +node = rb_entry(rb_node, struct mapping_node, rb_node);
>> +rb_erase(>rb_node, >reloc_root_tree.rb_root);
>> +}
>> +spin_unlock(>reloc_root_tree.lock);
>> +if (!node)
>> +return;
>> +BUG_ON((struct btrfs_root *)node->data != root);
>>  }
>> -spin_unlock(>reloc_root_tree.lock);
>> -
>> -if (!node)
>> -return;
>> -BUG_ON((struct btrfs_root *)node->data != root);
>>  
>>  spin_lock(_info->trans_lock);
>>  list_del_init(>root_list);
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: use correct compare function of dirty_metadata_bytes

2018-07-02 Thread Nikolay Borisov



On  2.07.2018 10:44, Ethan Lien wrote:
> We use customized, nodesize batch value to update dirty_metadata_bytes.
> We should also use batch version of compare function or we will easily
> goto fast path and get false result from percpu_counter_compare().
> 
> Signed-off-by: Ethan Lien 

Reviewed-by: Nikolay Borisov 

Also this needs the following tag: 

Fixes: e2d845211eda ("Btrfs: use percpu counter for dirty metadata count")

And perhaps stable from 4.4 onwards. But David will take care
of that. So the patch is good. 

> ---
>  fs/btrfs/disk-io.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 205092dc9390..dfed08e70ec1 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -961,8 +961,9 @@ static int btree_writepages(struct address_space *mapping,
>  
>   fs_info = BTRFS_I(mapping->host)->root->fs_info;
>   /* this is a bit racy, but that's ok */
> - ret = percpu_counter_compare(_info->dirty_metadata_bytes,
> -  BTRFS_DIRTY_METADATA_THRESH);
> + ret = __percpu_counter_compare(_info->dirty_metadata_bytes,
> +  BTRFS_DIRTY_METADATA_THRESH,
> +  fs_info->dirty_metadata_batch);
>   if (ret < 0)
>   return 0;
>   }
> @@ -4150,8 +4151,9 @@ static void __btrfs_btree_balance_dirty(struct 
> btrfs_fs_info *fs_info,
>   if (flush_delayed)
>   btrfs_balance_delayed_items(fs_info);
>  
> - ret = percpu_counter_compare(_info->dirty_metadata_bytes,
> -  BTRFS_DIRTY_METADATA_THRESH);
> + ret = __percpu_counter_compare(_info->dirty_metadata_bytes,
> +  BTRFS_DIRTY_METADATA_THRESH,
> +  fs_info->dirty_metadata_batch);
>   if (ret > 0) {
>   
> balance_dirty_pages_ratelimited(fs_info->btree_inode->i_mapping);
>   }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: use correct compare function of dirty_metadata_bytes

2018-07-02 Thread Ethan Lien
We use customized, nodesize batch value to update dirty_metadata_bytes.
We should also use batch version of compare function or we will easily
goto fast path and get false result from percpu_counter_compare().

Signed-off-by: Ethan Lien 
---
 fs/btrfs/disk-io.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 205092dc9390..dfed08e70ec1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -961,8 +961,9 @@ static int btree_writepages(struct address_space *mapping,
 
fs_info = BTRFS_I(mapping->host)->root->fs_info;
/* this is a bit racy, but that's ok */
-   ret = percpu_counter_compare(_info->dirty_metadata_bytes,
-BTRFS_DIRTY_METADATA_THRESH);
+   ret = __percpu_counter_compare(_info->dirty_metadata_bytes,
+BTRFS_DIRTY_METADATA_THRESH,
+fs_info->dirty_metadata_batch);
if (ret < 0)
return 0;
}
@@ -4150,8 +4151,9 @@ static void __btrfs_btree_balance_dirty(struct 
btrfs_fs_info *fs_info,
if (flush_delayed)
btrfs_balance_delayed_items(fs_info);
 
-   ret = percpu_counter_compare(_info->dirty_metadata_bytes,
-BTRFS_DIRTY_METADATA_THRESH);
+   ret = __percpu_counter_compare(_info->dirty_metadata_bytes,
+BTRFS_DIRTY_METADATA_THRESH,
+fs_info->dirty_metadata_batch);
if (ret > 0) {

balance_dirty_pages_ratelimited(fs_info->btree_inode->i_mapping);
}
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: free-space-cache: Don't panic when free space cache is corrupted

2018-07-02 Thread Nikolay Borisov



On  1.07.2018 05:45, Qu Wenruo wrote:
> In btrfs_add_free_space(), if the free space to be added is already
> here, we trigger ASSERT() which is just another BUG_ON().
> 
> Let's remove such BUG_ON() at all.
> 
> Reported-by: Lewis Diamond 
> Signed-off-by: Qu Wenruo 

Reviewed-by: Nikolay Borisov 

> ---
>  free-space-cache.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/free-space-cache.c b/free-space-cache.c
> index 9b83a71ca59a..2ef2d307cc5d 100644
> --- a/free-space-cache.c
> +++ b/free-space-cache.c
> @@ -838,10 +838,8 @@ int btrfs_add_free_space(struct btrfs_free_space_ctl 
> *ctl, u64 offset,
>   try_merge_free_space(ctl, info);
>  
>   ret = link_free_space(ctl, info);
> - if (ret) {
> + if (ret)
>   printk(KERN_CRIT "btrfs: unable to add free space :%d\n", ret);
> - BUG_ON(ret == -EEXIST);
> - }
>  
>   return ret;
>  }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: tree-checker: Verify block_group_item

2018-07-02 Thread Qu Wenruo



On 2018年07月02日 15:28, Nikolay Borisov wrote:
> 
> 
> On  2.07.2018 08:53, Qu Wenruo wrote:
>> As reported in https://bugzilla.kernel.org/show_bug.cgi?id=199849,
>> a crafted image with invalid block group items could make free space cache
>> code to cause panic.
>>
>> We could early detect such invalid block group item by checking:
>> 1) Size (key)
>>We have a up limit on block group item (10G)
>> 2) Chunk objectid
>> 3) Type
>>Exactly 1 bit set for type and no more than 1 bit set for profile
>> 4) Used space
>>No more than block group size.
>>
>> This should allow btrfs to detect and refuse to mount the crafted image.
>>
>> Reported-by: Xu Wen 
>> Signed-off-by: Qu Wenruo 
>> ---
>>  fs/btrfs/tree-checker.c | 88 +
>>  1 file changed, 88 insertions(+)
>>
>> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
>> index 8d40e7dd8c30..a42187ba50d7 100644
>> --- a/fs/btrfs/tree-checker.c
>> +++ b/fs/btrfs/tree-checker.c
>> @@ -353,6 +353,91 @@ static int check_dir_item(struct btrfs_fs_info *fs_info,
>>  return 0;
>>  }
>>  
>> +__printf(4, 5)
>> +__cold
>> +static void block_group_err(const struct btrfs_fs_info *fs_info,
>> +const struct extent_buffer *eb, int slot,
>> +const char *fmt, ...)
>> +{
>> +struct btrfs_key key;
>> +struct va_format vaf;
>> +va_list args;
>> +
>> +btrfs_item_key_to_cpu(eb, , slot);
>> +va_start(args, fmt);
>> +
>> +vaf.fmt = fmt;
>> +vaf.va = 
>> +
>> +btrfs_crit(fs_info,
>> +"corrupt %s: root=%llu block=%llu slot=%d bg_start=%llu bg_len=%llu, 
>> %pV",
>> +btrfs_header_level(eb) == 0 ? "leaf" : "node",
>> +btrfs_header_owner(eb), btrfs_header_bytenr(eb), slot,
>> +key.objectid, key.offset, );
>> +va_end(args);
>> +}
>> +
>> +static int check_block_group_item(struct btrfs_fs_info *fs_info,
>> +  struct extent_buffer *leaf,
> Seems it's not mandatory that this extent buffer points to a leaf, it
> might very well point to an interim node (judging from the
> btrfs_header_level() check in block_group_err). I'd suggest you use the
> more neutral - eb .

Nope, it's ensured to be a leaf.

The caller is only from check_leaf(), whose name explains itself.

>> +  struct btrfs_key *key, int slot)
>> +{
>> +struct btrfs_block_group_item bgi;
>> +u32 item_size = btrfs_item_size_nr(leaf, slot);
>> +u64 flags;
>> +
>> +/*
>> + * Here we don't really care about unalignment since extent allocator
>> + * can handle it.
>> + * We care more about the size, as if one block group is larger than
>> + * maximum size, it's must be some obvious corruption
>> + */
>> +if (key->offset > 10ULL * SZ_1G) {
>> +block_group_err(fs_info, leaf, slot,
>> +"invalid block group size, have %llu expect (0, %llu)",
>> +key->offset, 10ULL * SZ_1G);
>> +return -EUCLEAN;
>> +}
> 
> Put an empty line after each if to distinguish each part more easily.
> 
>> +if (item_size != sizeof(bgi)) {
>> +block_group_err(fs_info, leaf, slot,
>> +"invalid item size, have %u expect %lu",
>> +item_size, sizeof(bgi));
>> +return -EUCLEAN;
>> +}
>> +read_extent_buffer(leaf, , btrfs_item_ptr_offset(leaf, slot),
>> +   sizeof(bgi));
>> +if (btrfs_block_group_chunk_objectid() !=
>> +BTRFS_FIRST_CHUNK_TREE_OBJECTID) {
>> +block_group_err(fs_info, leaf, slot,
>> +"invalid block group chunk objectid, have %llu expect %llu",
>> +btrfs_block_group_chunk_objectid(),
>> +BTRFS_FIRST_CHUNK_TREE_OBJECTID);
>> +return -EUCLEAN;
>> +}
>> +if (btrfs_block_group_used() > key->offset) {
>> +block_group_err(fs_info, leaf, slot,
>> +"invalid block group used, have %llu expect [0, %llu)",
>> +btrfs_block_group_used(), key->offset);
>> +return -EUCLEAN;
>> +}
>> +flags = btrfs_block_group_flags();
>> +if (!((flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0 ||
>> +  hweight64(flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 1)) {
> 
> Can you make this condition a bit more stupid like:
> 
> if ((flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0 ||
   !=
> hweight64(flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) > 1)

In fact, "hweight64() > 1" is good enough.
As all zero and only 1 set bit both fits above check.

I'll use hweight64() only.

> 
> It's easy to miss the ! right before the two (( and it causes a mild
> head scratch :)

Yeah, hweight64() > 1 solves all.

Thanks,
Qu

> 
>> +block_group_err(fs_info, leaf, slot,
>> +"invalid profile flags, have 0x%llx (%lu bits 

Re: [PATCH] btrfs: relocation: Only remove reloc rb_trees if reloc control has been initialized

2018-07-02 Thread Nikolay Borisov



On  2.07.2018 09:25, Qu Wenruo wrote:
> Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199833, where an
> invalid tree reloc tree can cause kernel NULL pointer dereference when
> btrfs does some cleanup for reloc roots.
> 
> It turns out that fs_info->reloc_ctl can be NULL in
> btrfs_recover_relocation() as we allocate relocation control after all
> reloc roots are verified.
> So when we hit out: tag, we haven't call set_reloc_control() thus
> fs_info->reloc_ctl is still NULL.
> 
> Reported-by: Xu Wen 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/relocation.c | 23 ---
>  1 file changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 879b76fa881a..be94c65bb4d2 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -1321,18 +1321,19 @@ static void __del_reloc_root(struct btrfs_root *root)
>   struct mapping_node *node = NULL;
>   struct reloc_control *rc = fs_info->reloc_ctl;
>  > -  spin_lock(>reloc_root_tree.lock);
> - rb_node = tree_search(>reloc_root_tree.rb_root,
> -   root->node->start);
> - if (rb_node) {
> - node = rb_entry(rb_node, struct mapping_node, rb_node);
> - rb_erase(>rb_node, >reloc_root_tree.rb_root);

Just do  if (!rc)
return;

The function is simple enough, no need to indent multiple lines.
> + if (rc) {
> + spin_lock(>reloc_root_tree.lock);
> + rb_node = tree_search(>reloc_root_tree.rb_root,
> +   root->node->start);
> + if (rb_node) {
> + node = rb_entry(rb_node, struct mapping_node, rb_node);
> + rb_erase(>rb_node, >reloc_root_tree.rb_root);
> + }
> + spin_unlock(>reloc_root_tree.lock);
> + if (!node)
> + return;
> + BUG_ON((struct btrfs_root *)node->data != root);
>   }
> - spin_unlock(>reloc_root_tree.lock);
> -
> - if (!node)
> - return;
> - BUG_ON((struct btrfs_root *)node->data != root);
>  
>   spin_lock(_info->trans_lock);
>   list_del_init(>root_list);
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: tree-checker: Verify block_group_item

2018-07-02 Thread Nikolay Borisov



On  2.07.2018 08:53, Qu Wenruo wrote:
> As reported in https://bugzilla.kernel.org/show_bug.cgi?id=199849,
> a crafted image with invalid block group items could make free space cache
> code to cause panic.
> 
> We could early detect such invalid block group item by checking:
> 1) Size (key)
>We have a up limit on block group item (10G)
> 2) Chunk objectid
> 3) Type
>Exactly 1 bit set for type and no more than 1 bit set for profile
> 4) Used space
>No more than block group size.
> 
> This should allow btrfs to detect and refuse to mount the crafted image.
> 
> Reported-by: Xu Wen 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/tree-checker.c | 88 +
>  1 file changed, 88 insertions(+)
> 
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index 8d40e7dd8c30..a42187ba50d7 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -353,6 +353,91 @@ static int check_dir_item(struct btrfs_fs_info *fs_info,
>   return 0;
>  }
>  
> +__printf(4, 5)
> +__cold
> +static void block_group_err(const struct btrfs_fs_info *fs_info,
> + const struct extent_buffer *eb, int slot,
> + const char *fmt, ...)
> +{
> + struct btrfs_key key;
> + struct va_format vaf;
> + va_list args;
> +
> + btrfs_item_key_to_cpu(eb, , slot);
> + va_start(args, fmt);
> +
> + vaf.fmt = fmt;
> + vaf.va = 
> +
> + btrfs_crit(fs_info,
> + "corrupt %s: root=%llu block=%llu slot=%d bg_start=%llu bg_len=%llu, 
> %pV",
> + btrfs_header_level(eb) == 0 ? "leaf" : "node",
> + btrfs_header_owner(eb), btrfs_header_bytenr(eb), slot,
> + key.objectid, key.offset, );
> + va_end(args);
> +}
> +
> +static int check_block_group_item(struct btrfs_fs_info *fs_info,
> +   struct extent_buffer *leaf,
Seems it's not mandatory that this extent buffer points to a leaf, it
might very well point to an interim node (judging from the
btrfs_header_level() check in block_group_err). I'd suggest you use the
more neutral - eb .
> +   struct btrfs_key *key, int slot)
> +{
> + struct btrfs_block_group_item bgi;
> + u32 item_size = btrfs_item_size_nr(leaf, slot);
> + u64 flags;
> +
> + /*
> +  * Here we don't really care about unalignment since extent allocator
> +  * can handle it.
> +  * We care more about the size, as if one block group is larger than
> +  * maximum size, it's must be some obvious corruption
> +  */
> + if (key->offset > 10ULL * SZ_1G) {
> + block_group_err(fs_info, leaf, slot,
> + "invalid block group size, have %llu expect (0, %llu)",
> + key->offset, 10ULL * SZ_1G);
> + return -EUCLEAN;
> + }

Put an empty line after each if to distinguish each part more easily.

> + if (item_size != sizeof(bgi)) {
> + block_group_err(fs_info, leaf, slot,
> + "invalid item size, have %u expect %lu",
> + item_size, sizeof(bgi));
> + return -EUCLEAN;
> + }
> + read_extent_buffer(leaf, , btrfs_item_ptr_offset(leaf, slot),
> +sizeof(bgi));
> + if (btrfs_block_group_chunk_objectid() !=
> + BTRFS_FIRST_CHUNK_TREE_OBJECTID) {
> + block_group_err(fs_info, leaf, slot,
> + "invalid block group chunk objectid, have %llu expect %llu",
> + btrfs_block_group_chunk_objectid(),
> + BTRFS_FIRST_CHUNK_TREE_OBJECTID);
> + return -EUCLEAN;
> + }
> + if (btrfs_block_group_used() > key->offset) {
> + block_group_err(fs_info, leaf, slot,
> + "invalid block group used, have %llu expect [0, %llu)",
> + btrfs_block_group_used(), key->offset);
> + return -EUCLEAN;
> + }
> + flags = btrfs_block_group_flags();
> + if (!((flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0 ||
> +   hweight64(flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 1)) {

Can you make this condition a bit more stupid like:

if ((flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0 ||
hweight64(flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) > 1)

It's easy to miss the ! right before the two (( and it causes a mild
head scratch :)

> + block_group_err(fs_info, leaf, slot,
> +"invalid profile flags, have 0x%llx (%lu bits set) expect no more than 1 bit 
> set",
> + flags & BTRFS_BLOCK_GROUP_PROFILE_MASK,
> + hweight64(flags & BTRFS_BLOCK_GROUP_PROFILE_MASK));
> + return -EUCLEAN;
> + }
> + if (hweight64(flags & BTRFS_BLOCK_GROUP_TYPE_MASK) != 1) {
> + block_group_err(fs_info, leaf, slot,
> +"invalid type flags, have 0x%llx (%lu bits set) expect exactly 1 bit set",
> + flags 

[PATCH] btrfs: relocation: Only remove reloc rb_trees if reloc control has been initialized

2018-07-02 Thread Qu Wenruo
Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199833, where an
invalid tree reloc tree can cause kernel NULL pointer dereference when
btrfs does some cleanup for reloc roots.

It turns out that fs_info->reloc_ctl can be NULL in
btrfs_recover_relocation() as we allocate relocation control after all
reloc roots are verified.
So when we hit out: tag, we haven't call set_reloc_control() thus
fs_info->reloc_ctl is still NULL.

Reported-by: Xu Wen 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/relocation.c | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 879b76fa881a..be94c65bb4d2 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1321,18 +1321,19 @@ static void __del_reloc_root(struct btrfs_root *root)
struct mapping_node *node = NULL;
struct reloc_control *rc = fs_info->reloc_ctl;
 
-   spin_lock(>reloc_root_tree.lock);
-   rb_node = tree_search(>reloc_root_tree.rb_root,
- root->node->start);
-   if (rb_node) {
-   node = rb_entry(rb_node, struct mapping_node, rb_node);
-   rb_erase(>rb_node, >reloc_root_tree.rb_root);
+   if (rc) {
+   spin_lock(>reloc_root_tree.lock);
+   rb_node = tree_search(>reloc_root_tree.rb_root,
+ root->node->start);
+   if (rb_node) {
+   node = rb_entry(rb_node, struct mapping_node, rb_node);
+   rb_erase(>rb_node, >reloc_root_tree.rb_root);
+   }
+   spin_unlock(>reloc_root_tree.lock);
+   if (!node)
+   return;
+   BUG_ON((struct btrfs_root *)node->data != root);
}
-   spin_unlock(>reloc_root_tree.lock);
-
-   if (!node)
-   return;
-   BUG_ON((struct btrfs_root *)node->data != root);
 
spin_lock(_info->trans_lock);
list_del_init(>root_list);
-- 
2.18.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: tree-checker: Verify block_group_item

2018-07-02 Thread kbuild test robot
Hi Qu,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on btrfs/next]
[also build test WARNING on v4.18-rc3 next-20180629]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Qu-Wenruo/btrfs-tree-checker-Verify-block_group_item/20180702-135502
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
next
config: i386-randconfig-x016-201826 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

   fs/btrfs/tree-checker.c: In function 'check_block_group_item':
>> fs/btrfs/tree-checker.c:402:41: warning: format '%lu' expects argument of 
>> type 'long unsigned int', but argument 6 has type 'unsigned int' [-Wformat=]
   "invalid item size, have %u expect %lu",
  ~~^
  %u

vim +402 fs/btrfs/tree-checker.c

   379  
   380  static int check_block_group_item(struct btrfs_fs_info *fs_info,
   381struct extent_buffer *leaf,
   382struct btrfs_key *key, int slot)
   383  {
   384  struct btrfs_block_group_item bgi;
   385  u32 item_size = btrfs_item_size_nr(leaf, slot);
   386  u64 flags;
   387  
   388  /*
   389   * Here we don't really care about unalignment since extent 
allocator
   390   * can handle it.
   391   * We care more about the size, as if one block group is larger 
than
   392   * maximum size, it's must be some obvious corruption
   393   */
   394  if (key->offset > 10ULL * SZ_1G) {
   395  block_group_err(fs_info, leaf, slot,
   396  "invalid block group size, have %llu expect (0, 
%llu)",
   397  key->offset, 10ULL * SZ_1G);
   398  return -EUCLEAN;
   399  }
   400  if (item_size != sizeof(bgi)) {
   401  block_group_err(fs_info, leaf, slot,
 > 402  "invalid item size, have %u expect %lu",
   403  item_size, sizeof(bgi));
   404  return -EUCLEAN;
   405  }
   406  read_extent_buffer(leaf, , btrfs_item_ptr_offset(leaf, 
slot),
   407 sizeof(bgi));
   408  if (btrfs_block_group_chunk_objectid() !=
   409  BTRFS_FIRST_CHUNK_TREE_OBJECTID) {
   410  block_group_err(fs_info, leaf, slot,
   411  "invalid block group chunk objectid, have %llu expect 
%llu",
   412  btrfs_block_group_chunk_objectid(),
   413  BTRFS_FIRST_CHUNK_TREE_OBJECTID);
   414  return -EUCLEAN;
   415  }
   416  if (btrfs_block_group_used() > key->offset) {
   417  block_group_err(fs_info, leaf, slot,
   418  "invalid block group used, have %llu expect [0, 
%llu)",
   419  btrfs_block_group_used(), 
key->offset);
   420  return -EUCLEAN;
   421  }
   422  flags = btrfs_block_group_flags();
   423  if (!((flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0 ||
   424hweight64(flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 1)) {
   425  block_group_err(fs_info, leaf, slot,
   426  "invalid profile flags, have 0x%llx (%lu bits set) expect no more than 
1 bit set",
   427  flags & BTRFS_BLOCK_GROUP_PROFILE_MASK,
   428  hweight64(flags & 
BTRFS_BLOCK_GROUP_PROFILE_MASK));
   429  return -EUCLEAN;
   430  }
   431  if (hweight64(flags & BTRFS_BLOCK_GROUP_TYPE_MASK) != 1) {
   432  block_group_err(fs_info, leaf, slot,
   433  "invalid type flags, have 0x%llx (%lu bits set) expect exactly 1 bit 
set",
   434  flags & BTRFS_BLOCK_GROUP_TYPE_MASK,
   435  hweight64(flags & BTRFS_BLOCK_GROUP_TYPE_MASK));
   436  return -EUCLEAN;
   437  }
   438  return 0;
   439  }
   440  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: So, does btrfs check lowmem take days? weeks?

2018-07-02 Thread Su Yue




On 07/02/2018 11:22 AM, Marc MERLIN wrote:

On Mon, Jul 02, 2018 at 10:02:33AM +0800, Su Yue wrote:

Could you try follow dumps? They shouldn't cost much time.

#btrfs inspect dump-tree -t 21872  | grep -C 50 "374857
EXTENT_DATA "

#btrfs inspect dump-tree -t 22911  | grep -C 50 "374857
EXTENT_DATA "


Ok, that's 29MB, so it doesn't fit on pastebin:
http://marc.merlins.org/tmp/dshelf2_inspect.txt


Sorry Marc. After offline communication with Qu, both
of us think the filesystem is hard to repair.
The filesystem is too large to debug step by step.
Every time check and debug spent is too expensive.
And it already costs serveral days.

Sadly, I am afarid that you have to recreate filesystem
and reback up your data. :(

Sorry again and thanks for you reports and patient.

Su

Marc




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html