Re: [PATCH] Btrfs : send, truncate first to enhance many small files

2017-12-03 Thread Qu Wenruo


On 2017年12月04日 15:02, robbieko wrote:
> From: Robbie Ko 
> 
> The commands generated by send contain the following step:
> 1. mkfile o1851-19-0
> 2. rename o1851-19-0 -> alsa-driver/alsa-kernel/isa/es1688/es1688.c
> 3. set_xattr alsa-driver/alsa-kernel/isa/es1688/es1688.c - name=user.xattr 
> data_len=4 data=test
> 4. write alsa-driver/alsa-kernel/isa/es1688/es1688.c - offset=0, len=10458
> 5. truncate alsa-driver/alsa-kernel/isa/es1688/es1688.c size=10458
> 6. chown alsa-driver/alsa-kernel/isa/es1688/es1688.c - uid=1024, gid=100
> 7. chmod alsa-driver/alsa-kernel/isa/es1688/es1688.c - mode=0644
> 8. utimes alsa-driver/alsa-kernel/isa/es1688/es1688.c
> 
> After writing file content, it will truncate file to the correct size.
> Btrfs truncate will flush last page if size does not align to sectorsize,
> and this will cause receive process to wait until flush finishes.
> In order to avoid waiting flushing data.This patch changes the order so
> that truncate command is sent before write command.

Personally speaking, it's better to optimize the receive side.

For example, at receive side, if we already know that the file size is
not changed at all, then just skip the truncate command.

In your send dump, step 5 is not needed at all, and can be skipped to
speed up the receive procedure.

Thanks,
Qu

> 
> Overall performance improves by 102 percent when sending 79 small files.
> original: 32m45.311s
> patch: 16m8.387s
> 
> Signed-off-by: Robbie Ko 
> ---
>  fs/btrfs/send.c | 13 +
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 20d3300..7ae2347 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -5857,10 +5857,6 @@ static int finish_inode_if_needed(struct send_ctx 
> *sctx, int at_end)
>   goto out;
>   }
>   }
> - ret = send_truncate(sctx, sctx->cur_ino, sctx->cur_inode_gen,
> - sctx->cur_inode_size);
> - if (ret < 0)
> - goto out;
>   }
>  
>   if (need_chown) {
> @@ -6044,6 +6040,15 @@ static int changed_inode(struct send_ctx *sctx,
>   sctx->left_path->nodes[0], left_ii);
>   }
>   }
> + if (result == BTRFS_COMPARE_TREE_NEW ||
> + result == BTRFS_COMPARE_TREE_CHANGED) {
> + if (S_ISREG(sctx->cur_inode_mode)) {
> + ret = send_truncate(sctx, sctx->cur_ino, 
> sctx->cur_inode_gen,
> + sctx->cur_inode_size);
> + if (ret < 0)
> + goto out;
> + }
> + }
>  
>  out:
>   return ret;
> 



signature.asc
Description: OpenPGP digital signature


[PATCH] btrfs: fix inconsistency during missing device rejoin

2017-12-03 Thread Anand Jain
When device is missing its not necessary that btrfs_device::name is null
or the path is different when it reappears. Its possible that device can
go missing after its been scanned where neither of
btrfs_device::name == NULL OR btrfs_device::name != reappear_dev_path,
is true. So just check for btrfs_device::dev_state.missing. Thanks.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 59a8785a2e9e..ac0c4eb5107f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -715,7 +715,8 @@ static noinline int device_list_add(const char *path,
 
ret = 1;
device->fs_devices = fs_devices;
-   } else if (!device->name || strcmp(device->name->str, path)) {
+   } else if (!device->name || strcmp(device->name->str, path) ||
+   test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) {
/*
 * When FS is already mounted.
 * 1. If you are here and if the device->name is NULL that
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] btrfs: handle dynamically reappearing missing device

2017-12-03 Thread Anand Jain
If the device is not present at the time of (-o degrade) mount,
the mount context will create a dummy missing struct btrfs_device.
Later this device may reappear after the FS is mounted and
then device is included in the device list but it missed the
open_device part. So this patch handles that case by going
through the open_device steps which this device missed and finally
adds to the device alloc list.

So now with this patch, to bring back the missing device user can run,

   btrfs dev scan 

Without this kernel patch, even though 'btrfs fi show' and 'btrfs
dev ready' would tell you that missing device has reappeared
successfully but actually in kernel FS layer it didn't.

Signed-off-by: Anand Jain 
---
This patch needs:
 [PATCH 0/4]  factor __btrfs_open_devices()

v2:
Add more comments.
Add more change log.
Add to check if device missing is set, to handle the case
dev open fail and user will rerun the dev scan

v3:
Reword comments in the code.
The device missing check added in v2, is sent as a separate patch
  [patch] btrfs: fix inconsistency during missing device rejoin

 fs/btrfs/volumes.c | 57 --
 1 file changed, 55 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ac0c4eb5107f..04164337ac69 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -760,8 +760,61 @@ static noinline int device_list_add(const char *path,
rcu_string_free(device->name);
rcu_assign_pointer(device->name, name);
if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) {
-   fs_devices->missing_devices--;
-   clear_bit(BTRFS_DEV_STATE_MISSING, >dev_state);
+   int ret;
+   struct btrfs_fs_info *fs_info = fs_devices->fs_info;
+   fmode_t fmode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
+
+   if (btrfs_super_flags(disk_super) &
+   BTRFS_SUPER_FLAG_SEEDING)
+   fmode &= ~FMODE_WRITE;
+
+   /*
+* Missing can be set only when FS is mounted.
+* So here its always fs_devices->opened > 0 and most
+* of the struct device members are already updated by
+* the mount process even if this device was missing, so
+* now follow the normal open device procedure for this
+* device. The scrub will take care of filling the
+* missing stripes for raid56 and balance for raid1 and
+* raid10.
+*/
+   ASSERT(fs_devices->opened);
+   mutex_lock(_devices->device_list_mutex);
+   mutex_lock(_info->chunk_mutex);
+   /*
+* As of now do not fail the dev scan thread for the
+* reason that btrfs_open_one_device() fails and keep
+* the legacy dev scan requisites as it is.
+* And reset missing only if open is successful, as
+* user can rerun dev scan after fixing the device
+* for which the device open (below) failed.
+*/
+   ret = btrfs_open_one_device(fs_devices, device, fmode,
+   fs_info->bdev_holder);
+   if (!ret) {
+   fs_devices->missing_devices--;
+   clear_bit(BTRFS_DEV_STATE_MISSING,
+   >dev_state);
+   btrfs_clear_opt(fs_info->mount_opt, DEGRADED);
+   btrfs_warn(fs_info,
+   "BTRFS: device %s devid %llu joined\n",
+   path, devid);
+   }
+
+   if (test_bit(BTRFS_DEV_STATE_WRITEABLE,
+   >dev_state) &&
+   !test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
+   >dev_state)) {
+   fs_devices->total_rw_bytes +=
+   device->total_bytes;
+   atomic64_add(device->total_bytes -
+   device->bytes_used,
+   _info->free_chunk_space);
+   }
+   set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+   >dev_state);
+   mutex_unlock(_info->chunk_mutex);
+   

Re: [PATCH v2] btrfs: handle dynamically reappearing missing device

2017-12-03 Thread Anand Jain




[..]  would be better an explicit user intervention instead of an automatic one 
?


 What is the user intervention method steps that you have in mind ?
 Just curious. Pls remember downtime is not a choice of recovery
 from this context which means FS should be available for the
 applications to perform read/write.

 -o remount was one this which I had skipped for sometime as
 theoretically it can't solve the problem (to bring back missing
 device) as well, but now I have verified it, it won't.

 There is no record of what is the original idea to perform this
 basic volume manager step.

Thanks, Anand

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs : send, truncate first to enhance many small files

2017-12-03 Thread robbieko
From: Robbie Ko 

The commands generated by send contain the following step:
1. mkfile o1851-19-0
2. rename o1851-19-0 -> alsa-driver/alsa-kernel/isa/es1688/es1688.c
3. set_xattr alsa-driver/alsa-kernel/isa/es1688/es1688.c - name=user.xattr 
data_len=4 data=test
4. write alsa-driver/alsa-kernel/isa/es1688/es1688.c - offset=0, len=10458
5. truncate alsa-driver/alsa-kernel/isa/es1688/es1688.c size=10458
6. chown alsa-driver/alsa-kernel/isa/es1688/es1688.c - uid=1024, gid=100
7. chmod alsa-driver/alsa-kernel/isa/es1688/es1688.c - mode=0644
8. utimes alsa-driver/alsa-kernel/isa/es1688/es1688.c

After writing file content, it will truncate file to the correct size.
Btrfs truncate will flush last page if size does not align to sectorsize,
and this will cause receive process to wait until flush finishes.
In order to avoid waiting flushing data.This patch changes the order so
that truncate command is sent before write command.

Overall performance improves by 102 percent when sending 79 small files.
original: 32m45.311s
patch: 16m8.387s

Signed-off-by: Robbie Ko 
---
 fs/btrfs/send.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 20d3300..7ae2347 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -5857,10 +5857,6 @@ static int finish_inode_if_needed(struct send_ctx *sctx, 
int at_end)
goto out;
}
}
-   ret = send_truncate(sctx, sctx->cur_ino, sctx->cur_inode_gen,
-   sctx->cur_inode_size);
-   if (ret < 0)
-   goto out;
}
 
if (need_chown) {
@@ -6044,6 +6040,15 @@ static int changed_inode(struct send_ctx *sctx,
sctx->left_path->nodes[0], left_ii);
}
}
+   if (result == BTRFS_COMPARE_TREE_NEW ||
+   result == BTRFS_COMPARE_TREE_CHANGED) {
+   if (S_ISREG(sctx->cur_inode_mode)) {
+   ret = send_truncate(sctx, sctx->cur_ino, 
sctx->cur_inode_gen,
+   sctx->cur_inode_size);
+   if (ret < 0)
+   goto out;
+   }
+   }
 
 out:
return ret;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-03 Thread Chris Murphy
On Sun, Dec 3, 2017 at 3:47 AM, Adam Borowski  wrote:

> I'd say that the only good use for nocow is "I wish I have placed this file
> on a non-btrfs, but it'd be too much hassle to repartition".
>
> If you snapshot nocow at all, you get the worst of both worlds.

I think it's better to have the option than not have it, but for
regular Joe user I think it's a problem. And that's why I'm not such a
big fan of systemd-journald using chattr +C on journals when on Btrfs,
by default. I wouldn't mind it if systemd also made /var/log/journal/
a subvolume, just like it automatically creates /var/lib/machines as
as subvolume. That way by default /var/log/journal would be immune to
snapshots.

Or alternatively a rework of how journals are written to be more COW friendly.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-03 Thread Chris Murphy
On Fri, Dec 1, 2017 at 5:53 PM, Tomasz Pala  wrote:

> #  btrfs fi usage /
> Overall:
> Device size: 128.00GiB
> Device allocated:117.19GiB
> Device unallocated:   10.81GiB
> Device missing:  0.00B
> Used:103.56GiB
> Free (estimated): 11.19GiB  (min: 11.14GiB)
> Data ratio:   1.98
> Metadata ratio:   2.00
> Global reserve:  146.08MiB  (used: 0.00B)
>
> Data,single: Size:1.19GiB, Used:1.18GiB
>/dev/sda2   1.07GiB
>/dev/sdb2 132.00MiB

This is asking for trouble. Two devices have single copy data chunks,
if those drives die, you lose that data. But the metadata referring to
those files will survive and Btrfs will keep complaining about them at
every scrub until they're all deleted - there is no command that makes
this easy. You'd have to scrape scrub output, which includes paths to
the missing files, and script something to delete them all.

You should convert this with something like 'btrfs balance start
-dconvert=raid1,soft '



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/5] btrfs: cleanup device states define BTRFS_DEV_STATE_REPLACE_TGT

2017-12-03 Thread Anand Jain
Currently device state is being managed by each individual int
variable such as struct btrfs_device::is_tgtdev_for_dev_replace.
Instead of that declare btrfs_device::dev_state
BTRFS_DEV_STATE_MISSING and use the bit operations.

Signed-off-by: Anand Jain 
---
 fs/btrfs/dev-replace.c |  5 +++--
 fs/btrfs/extent-tree.c |  3 ++-
 fs/btrfs/ioctl.c   |  2 +-
 fs/btrfs/scrub.c   |  2 +-
 fs/btrfs/super.c   |  5 +++--
 fs/btrfs/volumes.c | 39 ++-
 fs/btrfs/volumes.h |  2 +-
 7 files changed, 33 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 559db7667f38..12fd8a203735 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -172,7 +172,8 @@ int btrfs_init_dev_replace(struct btrfs_fs_info *fs_info)
dev_replace->tgtdev->commit_bytes_used =
dev_replace->srcdev->commit_bytes_used;
}
-   dev_replace->tgtdev->is_tgtdev_for_dev_replace = 1;
+   set_bit(BTRFS_DEV_STATE_REPLACE_TGT,
+   _replace->tgtdev->dev_state);
btrfs_init_dev_replace_tgtdev_for_resume(fs_info,
dev_replace->tgtdev);
}
@@ -564,7 +565,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info 
*fs_info,
  dev_missing_or_rcu_str(src_device),
  src_device->devid,
  rcu_str_deref(tgt_device->name));
-   tgt_device->is_tgtdev_for_dev_replace = 0;
+   clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, _device->dev_state);
tgt_device->devid = src_device->devid;
src_device->devid = BTRFS_DEV_REPLACE_DEVID;
memcpy(uuid_tmp, tgt_device->uuid, sizeof(uuid_tmp));
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2cd323d184a0..1e65d5d54a8a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9692,7 +9692,8 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, u64 
bytenr)
 * space to fit our block group in.
 */
if (device->total_bytes > device->bytes_used + min_free &&
-   !device->is_tgtdev_for_dev_replace) {
+   !test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
+   >dev_state)) {
ret = find_free_dev_extent(trans, device, min_free,
   _offset, NULL);
if (!ret)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e59004a17166..953563138020 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1528,7 +1528,7 @@ static noinline int btrfs_ioctl_resize(struct file *file,
}
}
 
-   if (device->is_tgtdev_for_dev_replace) {
+   if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state)) {
ret = -EPERM;
goto out_free;
}
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b6de017066b3..b5a33db38874 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -4131,7 +4131,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
devid, u64 start,
 
mutex_lock(_info->scrub_lock);
if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state) ||
-   dev->is_tgtdev_for_dev_replace) {
+   test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state)) {
mutex_unlock(_info->scrub_lock);
mutex_unlock(_info->fs_devices->device_list_mutex);
return -EIO;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 6bae2e046257..b16e3fbd5895 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1973,8 +1973,9 @@ static int btrfs_calc_avail_data_space(struct 
btrfs_fs_info *fs_info,
rcu_read_lock();
list_for_each_entry_rcu(device, _devices->devices, dev_list) {
if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
-   >dev_state) ||
-   !device->bdev || device->is_tgtdev_for_dev_replace)
+   >dev_state) || !device->bdev ||
+   test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
+   >dev_state))
continue;
 
if (i >= nr_devices)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c6f7f4935dc4..37b1aed14353 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -845,8 +845,8 @@ void btrfs_close_extra_devices(struct btrfs_fs_devices 
*fs_devices, int step)
list_for_each_entry_safe(device, next, _devices->devices, dev_list) {
if (test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
>dev_state)) {
-   if 

[PATCH v2 3/5] btrfs: cleanup device states define BTRFS_DEV_STATE_MISSING

2017-12-03 Thread Anand Jain
Currently device state is being managed by each individual int
variable such as struct btrfs_device::missing. Instead of that
declare btrfs_device::dev_state BTRFS_DEV_STATE_MISSING and use
the bit operations.

Signed-off-by: Anand Jain 
Reviewed-by : Nikolay Borisov 
---
 fs/btrfs/dev-replace.c |  3 ++-
 fs/btrfs/disk-io.c |  4 ++--
 fs/btrfs/scrub.c   |  7 ---
 fs/btrfs/super.c   |  2 +-
 fs/btrfs/volumes.c | 34 --
 fs/btrfs/volumes.h |  2 +-
 6 files changed, 30 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 4b6ceb38cb5f..559db7667f38 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -306,7 +306,8 @@ void btrfs_after_dev_replace_commit(struct btrfs_fs_info 
*fs_info)
 
 static inline char *dev_missing_or_rcu_str(struct btrfs_device *device)
 {
-   return device->missing ? "" : rcu_str_deref(device->name);
+   return test_bit(BTRFS_DEV_STATE_MISSING, >dev_state) ?
+   "" : rcu_str_deref(device->name);
 }
 
 int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 634e8eb51cc8..890e3a6a2f3e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3399,7 +3399,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
/* send down all the barriers */
head = >fs_devices->devices;
list_for_each_entry_rcu(dev, head, dev_list) {
-   if (dev->missing)
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
continue;
if (!dev->bdev)
continue;
@@ -3415,7 +3415,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
 
/* wait for all the barriers */
list_for_each_entry_rcu(dev, head, dev_list) {
-   if (dev->missing)
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
continue;
if (!dev->bdev) {
errors_wait++;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index c4705de2ec26..b6de017066b3 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2535,7 +2535,7 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
logical, u64 len,
}
 
WARN_ON(sblock->page_count == 0);
-   if (dev->missing) {
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) {
/*
 * This case should only be hit for RAID 5/6 device replace. See
 * the comment in scrub_missing_raid56_pages() for details.
@@ -2870,7 +2870,7 @@ static int scrub_extent_for_parity(struct scrub_parity 
*sparity,
u8 csum[BTRFS_CSUM_SIZE];
u32 blocksize;
 
-   if (dev->missing) {
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) {
scrub_parity_mark_sectors_error(sparity, logical, len);
return 0;
}
@@ -4112,7 +4112,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
devid, u64 start,
 
mutex_lock(_info->fs_devices->device_list_mutex);
dev = btrfs_find_device(fs_info, devid, NULL, NULL);
-   if (!dev || (dev->missing && !is_dev_replace)) {
+   if (!dev || (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state) &&
+   !is_dev_replace)) {
mutex_unlock(_info->fs_devices->device_list_mutex);
return -ENODEV;
}
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f0906fbfa731..6bae2e046257 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2270,7 +2270,7 @@ static int btrfs_show_devname(struct seq_file *m, struct 
dentry *root)
while (cur_devices) {
head = _devices->devices;
list_for_each_entry(dev, head, dev_list) {
-   if (dev->missing)
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
continue;
if (!dev->name)
continue;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7100c877748d..c6f7f4935dc4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -758,9 +758,9 @@ static noinline int device_list_add(const char *path,
return -ENOMEM;
rcu_string_free(device->name);
rcu_assign_pointer(device->name, name);
-   if (device->missing) {
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) {
fs_devices->missing_devices--;
-   device->missing = 0;
+   clear_bit(BTRFS_DEV_STATE_MISSING, >dev_state);
}
}
 
@@ -944,7 +944,7 @@ static void btrfs_prepare_close_one_device(struct 
btrfs_device *device)
fs_devices->rw_devices--;
}
 
-   if (device->missing)
+   

[PATCH v2 5/5] btrfs: cleanup device states define BTRFS_DEV_STATE_FLUSH_SENT

2017-12-03 Thread Anand Jain
Currently device state is being managed by each individual int
variable such as struct btrfs_device::is_tgtdev_for_dev_replace.
Instead of that declare btrfs_device::dev_state
BTRFS_DEV_STATE_FLUSH_SENT and use the bit operations.

Signed-off-by: Anand Jain 
---
 fs/btrfs/disk-io.c | 6 +++---
 fs/btrfs/volumes.h | 1 +
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 890e3a6a2f3e..9b20c1f3563b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3359,7 +3359,7 @@ static void write_dev_flush(struct btrfs_device *device)
bio->bi_private = >flush_wait;
 
btrfsic_submit_bio(bio);
-   device->flush_bio_sent = 1;
+   set_bit(BTRFS_DEV_STATE_FLUSH_SENT, >dev_state);
 }
 
 /*
@@ -3369,10 +3369,10 @@ static blk_status_t wait_dev_flush(struct btrfs_device 
*device)
 {
struct bio *bio = device->flush_bio;
 
-   if (!device->flush_bio_sent)
+   if (!test_bit(BTRFS_DEV_STATE_FLUSH_SENT, >dev_state))
return BLK_STS_OK;
 
-   device->flush_bio_sent = 0;
+   clear_bit(BTRFS_DEV_STATE_FLUSH_SENT, >dev_state);
wait_for_completion_io(>flush_wait);
 
return bio->bi_status;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 4096350c2cea..7acfd61611aa 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -51,6 +51,7 @@ struct btrfs_pending_bios {
 #define BTRFS_DEV_STATE_IN_FS_METADATA (1UL << 1)
 #define BTRFS_DEV_STATE_MISSING(1UL << 2)
 #define BTRFS_DEV_STATE_REPLACE_TGT(1UL << 3)
+#define BTRFS_DEV_STATE_FLUSH_SENT (1UL << 4)
 
 struct btrfs_device {
struct list_head dev_list;
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/5] btrfs: cleanup device states define BTRFS_DEV_STATE_WRITEABLE

2017-12-03 Thread Anand Jain
Currently device state is being managed by each individual int
variable such as struct btrfs_device::writeable. Instead of that
declare device state BTRFS_DEV_STATE_WRITEABLE and use the
bit operations.

Signed-off-by: Anand Jain 
---
v2: Remove a unrelated change.
Start btrfs_device::dev_state position from bit 0.
 fs/btrfs/disk-io.c | 12 ++
 fs/btrfs/extent-tree.c |  2 +-
 fs/btrfs/extent_io.c   |  3 ++-
 fs/btrfs/ioctl.c   |  2 +-
 fs/btrfs/scrub.c   |  3 ++-
 fs/btrfs/volumes.c | 60 +-
 fs/btrfs/volumes.h |  4 +++-
 7 files changed, 52 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 10a2a579cc7f..56198cb02b35 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3403,7 +3403,8 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
continue;
if (!dev->bdev)
continue;
-   if (!dev->in_fs_metadata || !dev->writeable)
+   if (!dev->in_fs_metadata ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
continue;
 
write_dev_flush(dev);
@@ -3418,7 +3419,8 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
errors_wait++;
continue;
}
-   if (!dev->in_fs_metadata || !dev->writeable)
+   if (!dev->in_fs_metadata ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
continue;
 
ret = wait_dev_flush(dev);
@@ -3515,7 +3517,8 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int 
max_mirrors)
total_errors++;
continue;
}
-   if (!dev->in_fs_metadata || !dev->writeable)
+   if (!dev->in_fs_metadata ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
continue;
 
btrfs_set_stack_device_generation(dev_item, 0);
@@ -3554,7 +3557,8 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int 
max_mirrors)
list_for_each_entry_rcu(dev, head, dev_list) {
if (!dev->bdev)
continue;
-   if (!dev->in_fs_metadata || !dev->writeable)
+   if (!dev->in_fs_metadata ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
continue;
 
ret = wait_dev_supers(dev, max_mirrors);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 15c01014e5e1..2cd323d184a0 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10877,7 +10877,7 @@ static int btrfs_trim_free_extents(struct btrfs_device 
*device,
*trimmed = 0;
 
/* Not writeable = nothing to do. */
-   if (!device->writeable)
+   if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
return 0;
 
/* No free space = nothing to do. */
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 012d63870b99..25682c5a0dd5 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2027,7 +2027,8 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 
ino, u64 start,
bio->bi_iter.bi_sector = sector;
dev = bbio->stripes[bbio->mirror_num - 1].dev;
btrfs_put_bbio(bbio);
-   if (!dev || !dev->bdev || !dev->writeable) {
+   if (!dev || !dev->bdev ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) {
btrfs_bio_counter_dec(fs_info);
bio_put(bio);
return -EIO;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d748ad1c3620..e59004a17166 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1503,7 +1503,7 @@ static noinline int btrfs_ioctl_resize(struct file *file,
goto out_free;
}
 
-   if (!device->writeable) {
+   if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) {
btrfs_info(fs_info,
   "resizer unable to apply on readonly device %llu",
   devid);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b2f871d80982..fa70ff9b7762 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -4117,7 +4117,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
devid, u64 start,
return -ENODEV;
}
 
-   if (!is_dev_replace && !readonly && !dev->writeable) {
+   if (!is_dev_replace && !readonly &&
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) {
mutex_unlock(_info->fs_devices->device_list_mutex);
rcu_read_lock();
name = rcu_dereference(dev->name);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c19a49167966..9d14d83ab8dc 100644
--- a/fs/btrfs/volumes.c
+++ 

[PATCH v2 0/5] define BTRFS_DEV_STATE

2017-12-03 Thread Anand Jain
As of now device properties and states are being represented as int
variable. So clean that up using bitwise operations. Also patches in
the ML such as device failed state needs this cleanup as well.

V2:
 Accepts all comments from Nikolay.
 Drops can_discard.
 Adds BTRFS_DEV_STATE_REPLACE_TGT and BTRFS_DEV_STATE_FLUSH_SENT
patches.

Anand Jain (5):
  btrfs: cleanup device states define BTRFS_DEV_STATE_WRITEABLE
  btrfs: cleanup device states define BTRFS_DEV_STATE_IN_FS_METADATA
  btrfs: cleanup device states define BTRFS_DEV_STATE_MISSING
  btrfs: cleanup device states define BTRFS_DEV_STATE_REPLACE_TGT
  btrfs: cleanup device states define BTRFS_DEV_STATE_FLUSH_SENT

 fs/btrfs/dev-replace.c |   8 ++-
 fs/btrfs/disk-io.c |  29 ++---
 fs/btrfs/extent-tree.c |   5 +-
 fs/btrfs/extent_io.c   |   3 +-
 fs/btrfs/ioctl.c   |   4 +-
 fs/btrfs/scrub.c   |  13 +++--
 fs/btrfs/super.c   |   8 ++-
 fs/btrfs/volumes.c | 156 -
 fs/btrfs/volumes.h |  11 ++--
 9 files changed, 143 insertions(+), 94 deletions(-)

-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/5] btrfs: cleanup device states define BTRFS_DEV_STATE_IN_FS_METADATA

2017-12-03 Thread Anand Jain
Currently device state is being managed by each individual int
variable such as struct btrfs_device::in_fs_metadata. Instead of
that declare device state BTRFS_DEV_STATE_IN_FS_METADATA and use
the bit operations.

Signed-off-by: Anand Jain 
Reviewed-by: Nikolay Borisov 
---
 fs/btrfs/disk-io.c | 21 ++---
 fs/btrfs/scrub.c   |  3 ++-
 fs/btrfs/super.c   |  5 +++--
 fs/btrfs/volumes.c | 29 +
 fs/btrfs/volumes.h |  2 +-
 5 files changed, 37 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 56198cb02b35..634e8eb51cc8 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3403,8 +3403,10 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
continue;
if (!dev->bdev)
continue;
-   if (!dev->in_fs_metadata ||
-   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
+   if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+   >dev_state) ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE,
+   >dev_state))
continue;
 
write_dev_flush(dev);
@@ -3419,8 +3421,10 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
errors_wait++;
continue;
}
-   if (!dev->in_fs_metadata ||
-   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
+   if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+   >dev_state) ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE,
+   >dev_state))
continue;
 
ret = wait_dev_flush(dev);
@@ -3517,7 +3521,8 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int 
max_mirrors)
total_errors++;
continue;
}
-   if (!dev->in_fs_metadata ||
+   if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+   >dev_state) ||
!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
continue;
 
@@ -3557,8 +3562,10 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int 
max_mirrors)
list_for_each_entry_rcu(dev, head, dev_list) {
if (!dev->bdev)
continue;
-   if (!dev->in_fs_metadata ||
-   !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
+   if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+   >dev_state) ||
+   !test_bit(BTRFS_DEV_STATE_WRITEABLE,
+   >dev_state))
continue;
 
ret = wait_dev_supers(dev, max_mirrors);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index fa70ff9b7762..c4705de2ec26 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -4129,7 +4129,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
devid, u64 start,
}
 
mutex_lock(_info->scrub_lock);
-   if (!dev->in_fs_metadata || dev->is_tgtdev_for_dev_replace) {
+   if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state) ||
+   dev->is_tgtdev_for_dev_replace) {
mutex_unlock(_info->scrub_lock);
mutex_unlock(_info->fs_devices->device_list_mutex);
return -EIO;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 3a4dce153645..f0906fbfa731 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1972,8 +1972,9 @@ static int btrfs_calc_avail_data_space(struct 
btrfs_fs_info *fs_info,
 
rcu_read_lock();
list_for_each_entry_rcu(device, _devices->devices, dev_list) {
-   if (!device->in_fs_metadata || !device->bdev ||
-   device->is_tgtdev_for_dev_replace)
+   if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+   >dev_state) ||
+   !device->bdev || device->is_tgtdev_for_dev_replace)
continue;
 
if (i >= nr_devices)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9d14d83ab8dc..7100c877748d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -636,7 +636,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
*fs_devices,
fs_devices->rotating = 1;
 
device->bdev = bdev;
-   device->in_fs_metadata = 0;
+   clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state);
device->mode = flags;
 
fs_devices->open_devices++;
@@ -843,7 +843,8 @@ void btrfs_close_extra_devices(struct btrfs_fs_devices 

Re: [PATCH 5/5] btrfs: Greatly simplify btrfs_read_dev_super

2017-12-03 Thread Anand Jain



On 12/03/2017 05:43 PM, Nikolay Borisov wrote:



On  2.12.2017 01:23, Anand Jain wrote:



On 12/01/2017 05:19 PM, Nikolay Borisov wrote:

Currently this function executes the inner loop at most 1 due to the i
= 0;
i < 1 condition. Furthermore, the btrfs_super_generation(super) >
transid code
in the if condition is never executed due to latest always set to NULL
hence the
first part of the condition always triggering. The gist of
btrfs_read_dev_super
is really to read the first superblock.

Signed-off-by: Nikolay Borisov 
---
   fs/btrfs/disk-io.c | 27 ---
   1 file changed, 4 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 82c96607fc46..6d5f632fd1e7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3170,37 +3170,18 @@ int btrfs_read_dev_one_super(struct
block_device *bdev, int copy_num,
   struct buffer_head *btrfs_read_dev_super(struct block_device *bdev)
   {
   struct buffer_head *bh;
-    struct buffer_head *latest = NULL;
-    struct btrfs_super_block *super;
-    int i;
-    u64 transid = 0;
-    int ret = -EINVAL;
+    int ret;
     /* we would like to check all the supers, but that would make
    * a btrfs mount succeed after a mkfs from a different FS.
    * So, we need to add a special mount option to scan for
    * later supers, using BTRFS_SUPER_MIRROR_MAX instead
    */


  We need below loop to support the above comment at some point,


And when is that, since I don't see anyone working on it. 



Furthermore
what is it that we are losing in terms of functionality by not
supporting the comment? 


 As of now if the primary SB is corrupted we don't recover from it
 automatically and external tools are broken.


It seems this code was just slapt here without
any vision how/when to implement it?


 I have in my todo list. I am ok if you want to fix it as needed.


Furthermore, you seem to be aware of what the comment is talking about,
I have to admit I'm not.



Is the idea that if another filesystem does
mkfs and doesn't overwrite ALL superblock copies that btrfs writes (at
64k, 64mb, 256gb and 1 PiB) then it's possible for this code to
erroneously detect btrfs when in fact there is a different fs?


 Right.

 IMO the above comment is wrong as well for which it made the
 for-loop to read only primary SB.

 If we have a feature to maintain backup SB, then that feature
 is only complete when we would automatically recover from the
 backup SB.
 If a user overwrites btrfs primary SB and still mounts btrfs
 with -t btrfs options, then its use-end problem we should be
 able to recover from backup SB. So IMO looping through other
 SB is fine.

Thanks, Anand


I don't understand what problem *should* be solved here...



  instead of removing I would prefer to fix as per above comments.

Thanks, Anand



-    for (i = 0; i < 1; i++) {
-    ret = btrfs_read_dev_one_super(bdev, i, );
-    if (ret)
-    continue;
-
-    super = (struct btrfs_super_block *)bh->b_data;
-
-    if (!latest || btrfs_super_generation(super) > transid) {
-    brelse(latest);
-    latest = bh;
-    transid = btrfs_super_generation(super);
-    } else {
-    brelse(bh);
-    }
-    }
-
-    if (!latest)
+    ret = btrfs_read_dev_one_super(bdev, 0, );
+    if (ret)
   return ERR_PTR(ret);
   -    return latest;
+    return bh;
   }
     /*




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-03 Thread Qu Wenruo


On 2017年12月02日 17:33, Tomasz Pala wrote:
> OK, I seriously need to address that, as during the night I lost
> 3 GB again:
> 
> On Sat, Dec 02, 2017 at 10:35:12 +0800, Qu Wenruo wrote:
> 
>>> #  btrfs fi sh /
>>> Label: none  uuid: 17a3de25-6e26-4b0b-9665-ac267f6f6c4a
>>> Total devices 2 FS bytes used 44.10GiB
>Total devices 2 FS bytes used 47.28GiB
> 
>>> #  btrfs fi usage /
>>> Overall:
>>> Used: 88.19GiB
>Used: 94.58GiB
>>> Free (estimated): 18.75GiB  (min: 18.75GiB)
>Free (estimated): 15.56GiB  (min: 15.56GiB)
>>>
>>> #  btrfs dev usage /
> - output not changed
> 
>>> #  btrfs fi df /
>>> Data, RAID1: total=51.97GiB, used=43.22GiB
>Data, RAID1: total=51.97GiB, used=46.42GiB
>>> System, RAID1: total=32.00MiB, used=16.00KiB
>>> Metadata, RAID1: total=2.00GiB, used=895.69MiB
>>> GlobalReserve, single: total=131.14MiB, used=0.00B
>GlobalReserve, single: total=135.50MiB, used=0.00B
>>>
>>> # df
>>> /dev/sda264G   45G   19G  71% /
>/dev/sda264G   48G   16G  76% /
>>> However the difference is on active root fs:
>>>
>>> -0/29124.29GiB  9.77GiB
>>> +0/29115.99GiB 76.00MiB
> 0/29119.19GiB  3.28GiB
>>
>> Since you have already showed the size of the snapshots, which hardly
>> goes beyond 1G, it may be possible that extent booking is the cause.
>>
>> And considering it's all exclusive, defrag may help in this case.
> 
> I'm going to try defrag here, but have a bunch of questions before;
> as defrag would break CoW, I don't want to defrag files that span
> multiple snapshots, unless they have huge overhead:
> 1. is there any switch resulting in 'defrag only exclusive data'?

IIRC, no.

> 2. is there any switch resulting in 'defrag only extents fragmented more than 
> X'
>or 'defrag only fragments that would be possibly freed'?

No, either.

> 3. I guess there aren't, so how could I accomplish my target, i.e.
>reclaiming space that was lost due to fragmentation, without breaking
>spanshoted CoW where it would be not only pointless, but actually harmful?

What about using old kernel, like v4.13?

> 4. How can I prevent this from happening again? All the files, that are
>written constantly (stats collector here, PostgreSQL database and
>logs on other machines), are marked with nocow (+C); maybe some new
>attribute to mark file as autodefrag? +t?

Unfortunately, nocow only works if there is no other subvolume/inode
referring to it.

That's to say, if you're using snapshot, then NOCOW won't help as much
as you expected, but still much better than normal data cow.

> 
> For example, the largest file from stats collector:
>  Total   Exclusive  Set shared  Filename
>  432.00KiB   176.00KiB   256.00KiB  load/load.rrd
> 
> but most of them has 'Set shared'==0.
> 
> 5. The stats collector is running from the beginning, according to the
> quota output was not the issue since something happened. If the problem
> was triggered by (guessing) low space condition, and it results in even
> more space lost, there is positive feedback that is dangerous, as makes
> any filesystem unstable ("once you run out of space, you won't recover").
> Does it mean btrfs is simply not suitable (yet?) for frequent updates usage
> pattern, like RRD files?

Hard to say the cause.

But in my understanding, btrfs is not suitable for such conflicting
situation, where you want to have snapshots of frequent partial updates.

IIRC, btrfs is better for use case where either update is less frequent,
or update is replacing the whole file, not just part of it.

So btrfs is good for root filesystem like /etc /usr (and /bin /lib which
is pointing to /usr/bin and /usr/lib) , but not for /var or /run.

> 
> 6. Or maybe some extra steps just before taking snapshot should be taken?
> I guess 'defrag exclusive' would be perfect here - reclaiming space
> before it is being locked inside snapshot.

Yes, this sounds perfectly reasonable.

Thanks,
Qu

> Rationale behind this is obvious: since the snapshot-aware defrag was
> removed, allow to defrag snapshot exclusive data only.
> This would of course result in partial file defragmentation, but that
> should be enough for pathological cases like mine.



signature.asc
Description: OpenPGP digital signature


[PATCH v2] Btrfs: heuristic replace heap sort with radix sort

2017-12-03 Thread Timofey Titovets
Slowest part of heuristic for now is kernel heap sort()
It's can take up to 55% of runtime on sorting bucket items.

As sorting will always call on most data sets to get correctly
byte_core_set_size, the only way to speed up heuristic, is to
speed up sort on bucket.

Add a general radix_sort function.
Radix sort require 2 buffers, one full size of input array
and one for store counters (jump addresses).

That increase usage per heuristic workspace +1KiB
8KiB + 1KiB -> 8KiB + 2KiB

That is LSD Radix, i use 4 bit as a base for calculating,
to make counters array acceptable small (16 elements * 8 byte).

That Radix sort implementation have several points to adjust,
I added him to make radix sort general usable in kernel,
like heap sort, if needed.

Performance tested in userspace copy of heuristic code,
throughput:
- average <-> random data: ~3500 MiB/s - heap  sort
- average <-> random data: ~6000 MiB/s - radix sort

Changes:
  v1 -> v2:
- Tested on Big Endian
- Drop most of multiply operations
- Separately allocate sort buffer

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/compression.c | 147 ++---
 1 file changed, 140 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index ae016699d13e..19b52982deda 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -33,7 +33,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include "ctree.h"
 #include "disk-io.h"
@@ -752,6 +751,8 @@ struct heuristic_ws {
u32 sample_size;
/* Buckets store counters for each byte value */
struct bucket_item *bucket;
+   /* Sorting buffer */
+   struct bucket_item *bucket_b;
struct list_head list;
 };
 
@@ -763,6 +764,7 @@ static void free_heuristic_ws(struct list_head *ws)
 
kvfree(workspace->sample);
kfree(workspace->bucket);
+   kfree(workspace->bucket_b);
kfree(workspace);
 }
 
@@ -782,6 +784,10 @@ static struct list_head *alloc_heuristic_ws(void)
if (!ws->bucket)
goto fail;
 
+   ws->bucket_b = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket_b), GFP_KERNEL);
+   if (!ws->bucket_b)
+   goto fail;
+
INIT_LIST_HEAD(>list);
return >list;
 fail:
@@ -1278,13 +1284,136 @@ static u32 shannon_entropy(struct heuristic_ws *ws)
return entropy_sum * 100 / entropy_max;
 }
 
-/* Compare buckets by size, ascending */
-static int bucket_comp_rev(const void *lv, const void *rv)
+#define RADIX_BASE 4
+#define COUNTERS_SIZE (1 << RADIX_BASE)
+
+static inline uint8_t get4bits(uint64_t num, int shift) {
+   uint8_t low4bits;
+   num = num >> shift;
+   /* Reverse order */
+   low4bits = (COUNTERS_SIZE - 1) - (num % COUNTERS_SIZE);
+   return low4bits;
+}
+
+static inline void copy_cell(void *dst, int dest_i, void *src, int src_i)
 {
-   const struct bucket_item *l = (const struct bucket_item *)lv;
-   const struct bucket_item *r = (const struct bucket_item *)rv;
+   struct bucket_item *dstv = (struct bucket_item *) dst;
+   struct bucket_item *srcv = (struct bucket_item *) src;
+   dstv[dest_i] = srcv[src_i];
+}
 
-   return r->count - l->count;
+static inline uint64_t get_num(const void *a, int i)
+{
+   struct bucket_item *av = (struct bucket_item *) a;
+   return av[i].count;
+}
+
+/*
+ * Use 4 bits as radix base
+ * Use 16 uint64_t counters for calculating new possition in buf array
+ *
+ * @array - array that will be sorted
+ * @array_buf - buffer array to store sorting results
+ *  must be equal in size to @array
+ * @num   - array size
+ * @max_cell  - Link to element with maximum possible value
+ *  that can be used to cap radix sort iterations
+ *  if we know maximum value before call sort
+ * @get_num   - function to extract number from array
+ * @copy_cell - function to copy data from array to array_buf
+ *  and vise versa
+ * @get4bits  - function to get 4 bits from number at specified offset
+ */
+
+static void radix_sort(void *array, void *array_buf,
+  int num,
+  const void *max_cell,
+  uint64_t (*get_num)(const void *, int i),
+  void (*copy_cell)(void *dest, int dest_i,
+void* src, int src_i),
+  uint8_t (*get4bits)(uint64_t num, int shift))
+{
+   u64 max_num;
+   uint64_t buf_num;
+   uint64_t counters[COUNTERS_SIZE];
+   uint64_t new_addr;
+   int i;
+   int addr;
+   int bitlen;
+   int shift;
+
+   /*
+* Try avoid useless loop iterations
+* For small numbers stored in big counters
+* example: 48 33 4 ... in 64bit array
+*/
+   if (!max_cell) {
+   max_num = get_num(array, 0);
+   for (i = 1; i < num; i++) {
+ 

Re: [PATCH 0/7] retry write on error

2017-12-03 Thread Peter Grandi
> [ ... ] btrfs incorporates disk management which is actually a
> version of md layer, [ ... ]

As far as I know Btrfs has no disk management, and was wisely
designed without any, just like MD: Btrfs volumes and MD sets
can be composed from "block devices", not disks, and block
devices are quite high level abstractions, as they closely mimic
the semantics of a UNIX file, not a physical device.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: again "out of space" and remount read only, with 4.14

2017-12-03 Thread Martin Raiber
Am 26.11.2017 um 17:02 schrieb Tomasz Chmielewski:
> On 2017-11-27 00:37, Martin Raiber wrote:
>> On 26.11.2017 08:46 Tomasz Chmielewski wrote:
>>> Got this one on a 4.14-rc7 filesystem with some 400 GB left:
>> I guess it is too late now, but I guess the "btrfs fi usage" output of
>> the file system (especially after it went ro) would be useful.
> It was more or less similar as it went ro:
>
> # btrfs fi usage /srv
> Overall:
>     Device size:   5.25TiB
>     Device allocated:  4.45TiB
>     Device unallocated:  823.97GiB
>     Device missing:  0.00B
>     Used:  4.33TiB
>     Free (estimated):    471.91GiB  (min: 471.91GiB)
>     Data ratio:   2.00
>     Metadata ratio:   2.00
>     Global reserve:  512.00MiB  (used: 0.00B)
>
> Unallocated:
>    /dev/sda4 411.99GiB
>    /dev/sdb4 411.99GiB

I wanted to check if is the same issue I have, e.g. with 4.14.1
space_cache=v2:

[153245.341823] BTRFS: error (device loop0) in
btrfs_run_delayed_refs:3089: errno=-28 No space left
[153245.341845] BTRFS: error (device loop0) in btrfs_drop_snapshot:9317:
errno=-28 No space left
[153245.341848] BTRFS info (device loop0): forced readonly
[153245.341972] BTRFS warning (device loop0): Skipping commit of aborted
transaction.
[153245.341975] BTRFS: error (device loop0) in cleanup_transaction:1873:
errno=-28 No space left
# btrfs fi usage /media/backup
Overall:
    Device size:  49.60TiB
    Device allocated: 38.10TiB
    Device unallocated:   11.50TiB
    Device missing:  0.00B
    Used: 36.98TiB
    Free (estimated): 12.59TiB  (min: 12.59TiB)
    Data ratio:   1.00
    Metadata ratio:   1.00
    Global reserve:    2.00GiB  (used: 1.99GiB)

Data,single: Size:37.70TiB, Used:36.61TiB
   /dev/loop0 37.70TiB

Metadata,single: Size:411.01GiB, Used:380.98GiB
   /dev/loop0    411.01GiB

System,single: Size:36.00MiB, Used:4.00MiB
   /dev/loop0 36.00MiB

Unallocated:
   /dev/loop0 11.50TiB

Note the global reserve being at maximum. I already increased that in
the code to 2G and that seems to make this issue appear more rarely.

Regards,
Martin Raiber


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-03 Thread Adam Borowski
On Sun, Dec 03, 2017 at 01:45:45AM +, Duncan wrote:
> Tomasz Pala posted on Sat, 02 Dec 2017 18:18:19 +0100 as excerpted:
> >> I got ~500 small files (100-500 kB) updated partially in regular
> >> intervals:
> >> 
> >> # du -Lc **/*.rrd | tail -n1
> >> 105Mtotal
> 
> FWIW, I've no idea what rrd files, or rrdcached (from the grandparent post)
> are (other than that a quick google suggests that it's...
> round-robin-database...

Basically: preallocate a file, its size doesn't change since then.  Every a
few minutes, write several bytes into the file, slowly advancing.

This is indeed the worst possible case for btrfs, and nocow doesn't help the
slightest as the database doesn't wrap around before a typical snapshot
interval.

> Meanwhile, /because/ nocow has these complexities along with others (nocow
> automatically turns off data checksumming and compression for the files
> too), and the fact that they nullify some of the big reasons people might
> choose btrfs in the first place, I actually don't recommend setting
> nocow in the first place -- if usage is such than a file needs nocow,
> my thinking is that btrfs isn't a particularly good hosting choice for
> that file in the first place, a more traditional rewrite-in-place
> filesystem is likely to be a better fit.

I'd say that the only good use for nocow is "I wish I have placed this file
on a non-btrfs, but it'd be too much hassle to repartition".

If you snapshot nocow at all, you get the worst of both worlds.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Mozilla's Hippocritic Oath: "Keep trackers off your trail"
⣾⠁⢰⠒⠀⣿⡁ blah blah evading "tracking technology" blah blah
⢿⡄⠘⠷⠚⠋⠀ "https://click.e.mozilla.org/?qs=e7bb0dcf14b1013fca3820...;
⠈⠳⣄ (same for all links)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] btrfs: Greatly simplify btrfs_read_dev_super

2017-12-03 Thread Nikolay Borisov


On  2.12.2017 01:23, Anand Jain wrote:
> 
> 
> On 12/01/2017 05:19 PM, Nikolay Borisov wrote:
>> Currently this function executes the inner loop at most 1 due to the i
>> = 0;
>> i < 1 condition. Furthermore, the btrfs_super_generation(super) >
>> transid code
>> in the if condition is never executed due to latest always set to NULL
>> hence the
>> first part of the condition always triggering. The gist of
>> btrfs_read_dev_super
>> is really to read the first superblock.
>>
>> Signed-off-by: Nikolay Borisov 
>> ---
>>   fs/btrfs/disk-io.c | 27 ---
>>   1 file changed, 4 insertions(+), 23 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 82c96607fc46..6d5f632fd1e7 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -3170,37 +3170,18 @@ int btrfs_read_dev_one_super(struct
>> block_device *bdev, int copy_num,
>>   struct buffer_head *btrfs_read_dev_super(struct block_device *bdev)
>>   {
>>   struct buffer_head *bh;
>> -    struct buffer_head *latest = NULL;
>> -    struct btrfs_super_block *super;
>> -    int i;
>> -    u64 transid = 0;
>> -    int ret = -EINVAL;
>> +    int ret;
>>     /* we would like to check all the supers, but that would make
>>    * a btrfs mount succeed after a mkfs from a different FS.
>>    * So, we need to add a special mount option to scan for
>>    * later supers, using BTRFS_SUPER_MIRROR_MAX instead
>>    */
> 
>  We need below loop to support the above comment at some point,

And when is that, since I don't see anyone working on it. Furthermore
what is it that we are losing in terms of functionality by not
supporting the comment? It seems this code was just slapt here without
any vision how/when to implement it?

Furthermore, you seem to be aware of what the comment is talking about,
I have to admit I'm not. Is the idea that if another filesystem does
mkfs and doesn't overwrite ALL superblock copies that btrfs writes (at
64k, 64mb, 256gb and 1 PiB) then it's possible for this code to
erroneously detect btrfs when in fact there is a different fs?

I don't understand what problem *should* be solved here...

>  instead of removing I would prefer to fix as per above comments.
> 
> Thanks, Anand
> 
> 
>> -    for (i = 0; i < 1; i++) {
>> -    ret = btrfs_read_dev_one_super(bdev, i, );
>> -    if (ret)
>> -    continue;
>> -
>> -    super = (struct btrfs_super_block *)bh->b_data;
>> -
>> -    if (!latest || btrfs_super_generation(super) > transid) {
>> -    brelse(latest);
>> -    latest = bh;
>> -    transid = btrfs_super_generation(super);
>> -    } else {
>> -    brelse(bh);
>> -    }
>> -    }
>> -
>> -    if (!latest)
>> +    ret = btrfs_read_dev_one_super(bdev, 0, );
>> +    if (ret)
>>   return ERR_PTR(ret);
>>   -    return latest;
>> +    return bh;
>>   }
>>     /*
>>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html