date:20171114

Re: [PATCH] btrfs: handle dynamically reappearing missing device

2017-11-14 Thread kbuild test robot

Hi Anand,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on btrfs/next]
[also build test ERROR on v4.14 next-20171114]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Anand-Jain/btrfs-handle-dynamically-reappearing-missing-device/20171115-143047
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
next
config: sparc64-allyesconfig (attached as .config)
compiler: sparc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=sparc64 

All errors (new ones prefixed by >>):

   fs/btrfs/volumes.c: In function 'device_list_add':
>> fs/btrfs/volumes.c:732:10: error: implicit declaration of function 
>> 'btrfs_open_one_device'; did you mean 'btrfs_scan_one_device'? 
>> [-Werror=implicit-function-declaration]
   ret = btrfs_open_one_device(fs_devices, device, fmode,
 ^
 btrfs_scan_one_device
   cc1: some warnings being treated as errors

vim +732 fs/btrfs/volumes.c

   610  
   611  /*
   612   * Add new device to list of registered devices
   613   *
   614   * Returns:
   615   * 1   - first time device is seen
   616   * 0   - device already known
   617   * < 0 - error
   618   */
   619  static noinline int device_list_add(const char *path,
   620 struct btrfs_super_block *disk_super,
   621 u64 devid, struct btrfs_fs_devices 
**fs_devices_ret)
   622  {
   623  struct btrfs_device *device;
   624  struct btrfs_fs_devices *fs_devices;
   625  struct rcu_string *name;
   626  int ret = 0;
   627  u64 found_transid = btrfs_super_generation(disk_super);
   628  
   629  fs_devices = find_fsid(disk_super->fsid);
   630  if (!fs_devices) {
   631  fs_devices = alloc_fs_devices(disk_super->fsid);
   632  if (IS_ERR(fs_devices))
   633  return PTR_ERR(fs_devices);
   634  
   635  list_add(_devices->list, _uuids);
   636  
   637  device = NULL;
   638  } else {
   639  device = __find_device(_devices->devices, devid,
   640 disk_super->dev_item.uuid);
   641  }
   642  
   643  if (!device) {
   644  if (fs_devices->opened)
   645  return -EBUSY;
   646  
   647  device = btrfs_alloc_device(NULL, ,
   648  disk_super->dev_item.uuid);
   649  if (IS_ERR(device)) {
   650  /* we can safely leave the fs_devices entry 
around */
   651  return PTR_ERR(device);
   652  }
   653  
   654  name = rcu_string_strdup(path, GFP_NOFS);
   655  if (!name) {
   656  kfree(device);
   657  return -ENOMEM;
   658  }
   659  rcu_assign_pointer(device->name, name);
   660  
   661  mutex_lock(_devices->device_list_mutex);
   662  list_add_rcu(>dev_list, _devices->devices);
   663  fs_devices->num_devices++;
   664  mutex_unlock(_devices->device_list_mutex);
   665  
   666  ret = 1;
   667  device->fs_devices = fs_devices;
   668  } else if (!device->name || strcmp(device->name->str, path)) {
   669  /*
   670   * When FS is already mounted.
   671   * 1. If you are here and if the device->name is NULL 
that
   672   *means this device was missing at time of FS mount.
   673   * 2. If you are here and if the device->name is 
different
   674   *from 'path' that means either
   675   *  a. The same device disappeared and reappeared 
with
   676   * different name. or
   677   *  b. The missing-disk-which-was-replaced, has
   678   * reappeared now.
   679   *
   680   * We must allow 1 and 2a above. But 2b would be a 
spurious
   681   * and unintentional.
   682   *
   683   * Further in case of 1 and 2a above, the disk at 'path'
   684   * would have missed some transaction when it was away 
and
   685   * in case of 2a the stale bdev has to be upd

Re: Tiered storage?

2017-11-14 Thread waxhead

As a regular BTRFS user I can tell you that there is no such thing as 
hot data tracking yet. Some people seem to use bcache together with 
btrfs and come asking for help on the mailing list.


Raid5/6 have received a few fixes recently, and it *may* soon me worth 
trying out raid5/6 for data, but keeping metadata in raid1/10 (I would 
rather loose a file or two than the entire filesystem).

I had plans to run some tests on this a while ago, but forgot about it.
As call good citizens, remember to have good backups. Last time I tested 
for Raid5/6 I ran into issues easily. For what it's worth - raid1/10 
seems pretty rock solid as long as you have sufficient disks (hint: you 
need more than two for raid1 if you want to stay safe)


As for dedupe there is (to my knowledge) nothing fully automatic yet. 
You have to run a program to scan your filesystem but all the 
deduplication is done in the kernel.
duperemove works apparently quite well when I tested it, but there may 
be some performance implications.


Roy Sigurd Karlsbakk wrote:

Hi all

I've been following this project on and off for quite a few years, and I wonder 
if anyone has looked into tiered storage on it. With tiered storage, I mean hot 
data lying on fast storage and cold data on slow storage. I'm not talking about 
cashing (where you just keep a copy of the hot data on the fast storage).

And btw, how far is raid[56] and block-level dedup from something useful in 
production?

Vennlig hilsen

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] btrfs/154: test for device dynamic rescan

2017-11-14 Thread Eryu Guan

On Wed, Nov 15, 2017 at 11:05:15AM +0800, Anand Jain wrote:
> Make sure missing device is included in the alloc list when it is
> scanned on a mounted FS.
> 
> This test case needs btrfs kernel patch which is in the ML
>   [PATCH] btrfs: handle dynamically reappearing missing device
> Without the kernel patch, the test will run, but reports as
> failed, as the device scanned won't appear in the alloc_list.
> 
> Signed-off-by: Anand Jain 
> ---
> v2: Fixed review comments.
>  tests/btrfs/154 | 186 
> 
>  tests/btrfs/154.out |  10 +++
>  tests/btrfs/group   |   1 +
>  3 files changed, 197 insertions(+)
>  create mode 100755 tests/btrfs/154
>  create mode 100644 tests/btrfs/154.out
> 
> diff --git a/tests/btrfs/154 b/tests/btrfs/154
> new file mode 100755
> index ..73a185157389
> --- /dev/null
> +++ b/tests/btrfs/154
> @@ -0,0 +1,186 @@
> +#! /bin/bash
> +# FS QA Test 154
> +#
> +# Test for reappearing missing device functionality.
> +#   This test will fail without the btrfs kernel patch
> +#   [PATCH] btrfs: handle dynamically reappearing missing device
> +#
> +#-
> +# Copyright (c) 2017 Oracle.  All Rights Reserved.
> +# Author: Anand Jain 
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#-
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + cd /
> + rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/module
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch_dev_pool 2
> +_require_loadable_fs_module "btrfs"
> +
> +_scratch_dev_pool_get 2
> +
> +DEV1=`echo $SCRATCH_DEV_POOL | awk '{print $1}'`
> +DEV2=`echo $SCRATCH_DEV_POOL | awk '{print $2}'`
> +
> +echo DEV1=$DEV1 >> $seqres.full
> +echo DEV2=$DEV2 >> $seqres.full
> +
> +# Balance won't be successful if filled too much
> +DEV1_SZ=`blockdev --getsize64 $DEV1`
> +DEV2_SZ=`blockdev --getsize64 $DEV2`
> +
> +# get min
> +MAX_FS_SZ=`echo -e "$DEV1_SZ\n$DEV2_SZ" | sort | head -1`
> +# Need disks with more than 2G
> +if [ $MAX_FS_SZ -lt 20 ]; then
> + _scratch_dev_pool_put
> + _notrun "Smallest dev size $MAX_FS_SZ, Need at least 2G"
> +fi
> +
> +MAX_FS_SZ=1
> +bs="1M"
> +COUNT=$(($MAX_FS_SZ / 100))
> +CHECKPOINT1=0
> +CHECKPOINT2=0
> +
> +setup()
> +{
> + echo >> $seqres.full
> + echo "MAX_FS_SZ=$MAX_FS_SZ COUNT=$COUNT" >> $seqres.full
> + echo "setup"
> + echo "-setup-" >> $seqres.full
> + _scratch_pool_mkfs "-mraid1 -draid1" >> $seqres.full 2>&1
> + _scratch_mount >> $seqres.full 2>&1
> + dd if=/dev/urandom of="$SCRATCH_MNT"/tf bs=$bs count=1 \
> + >>$seqres.full 2>&1
> + _run_btrfs_util_prog filesystem show -m ${SCRATCH_MNT}
> + _run_btrfs_util_prog filesystem df $SCRATCH_MNT
> + COUNT=$(( $COUNT - 1 ))
> + echo "unmount" >> $seqres.full
> + _scratch_unmount
> +}
> +
> +degrade_mount_write()
> +{
> + echo >> $seqres.full
> + echo "--degraded mount: max_fs_sz $max_fs_sz bytes--" >> $seqres.full
> + echo
> + echo "degraded mount"
> +
> + echo "clean btrfs ko" >> $seqres.full
> + # un-scan the btrfs devices
> + _reload_fs_module "btrfs"
> + _mount -o degraded $DEV1 $SCRATCH_MNT >>$seqres.full 2>&1
> + cnt=$(( $COUNT/10 ))
> + dd if=/dev/urandom of="$SCRATCH_MNT"/tf1 bs=$bs count=$cnt \
> + >>$seqres.full 2>&1
> + COUNT=$(( $COUNT - $cnt ))
> + _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT
> + _run_btrfs_util_prog filesystem df $SCRATCH_MNT
> + CHECKPOINT1=`md5sum $SCRATCH_MNT/tf1`
> + echo $SCRATCH_MNT/tf1:$CHECKPOINT1 >> $seqres.full
> +}
> +
> +scan_missing_dev_and_write()
> +{
> + echo >> $seqres.full
> + echo "--scan missing $DEV2--" >> $seqres.full
> + echo
> + echo "scan missing

Re: [PATCH] btrfs/154: test for device dynamic rescan

2017-11-14 Thread Anand Jain




On 11/14/2017 08:12 PM, Eryu Guan wrote:

On Mon, Nov 13, 2017 at 10:25:41AM +0800, Anand Jain wrote:

Make sure missing device is included in the alloc list when it is
scanned on a mounted FS.

This test case needs btrfs kernel patch which is in the ML
   [PATCH] btrfs: handle dynamically reappearing missing device
Without the kernel patch, the test will run, but reports as
failed, as the device scanned won't appear in the alloc_list.

Signed-off-by: Anand Jain 


Tested without the fix and test failed as expected, test passed after
applying the fix.




Some minor nits below.


---
  tests/btrfs/154 | 188 
  tests/btrfs/154.out |  10 +++
  tests/btrfs/group   |   1 +
  3 files changed, 199 insertions(+)
  create mode 100755 tests/btrfs/154
  create mode 100644 tests/btrfs/154.out

diff --git a/tests/btrfs/154 b/tests/btrfs/154
new file mode 100755
index ..8b06fc4d9347
--- /dev/null
+++ b/tests/btrfs/154
@@ -0,0 +1,188 @@
+#! /bin/bash
+# FS QA Test 154
+#
+# Test for reappearing missing device functionality.
+#   This test will fail without the btrfs kernel patch
+#   [PATCH] btrfs: handle dynamically reappearing missing device
+#
+#-
+# Copyright (c) 2017 Oracle.  All Rights Reserved.
+# Author: Anand Jain 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#-
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/module
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_dev_pool 2
+_test_unmount


This is not needed now, _require_loadable_fs_module will umount & mount
test dev as necessary.


 Right will fix it.


+_require_loadable_fs_module "btrfs"
+
+_scratch_dev_pool_get 2
+
+DEV1=`echo $SCRATCH_DEV_POOL | awk '{print $1}'`
+DEV2=`echo $SCRATCH_DEV_POOL | awk '{print $2}'`
+
+echo DEV1=$DEV1 >> $seqres.full
+echo DEV2=$DEV2 >> $seqres.full
+
+# Balance won't be successful if filled too much
+DEV1_SZ=`blockdev --getsize64 $DEV1`
+DEV2_SZ=`blockdev --getsize64 $DEV2`
+
+# get min
+MAX_FS_SZ=`echo -e "$DEV1_SZ\n$DEV2_SZ" | sort | head -1`
+# Need disks with more than 2G
+if [ $MAX_FS_SZ -lt 20 ]; then
+   _scratch_dev_pool_put
+   _test_mount


Then no need to _test_mount.


 Fixed this in v2.


+   _notrun "Smallest dev size $MAX_FS_SZ, Need at least 2G"
+fi
+
+MAX_FS_SZ=1
+bs="1M"
+COUNT=$(($MAX_FS_SZ / 100))
+CHECKPOINT1=0
+CHECKPOINT2=0
+
+setup()
+{
+   echo >> $seqres.full
+   echo "MAX_FS_SZ=$MAX_FS_SZ COUNT=$COUNT" >> $seqres.full
+   echo "setup"
+   echo "-setup-" >> $seqres.full
+   _scratch_pool_mkfs "-mraid1 -draid1" >> $seqres.full 2>&1
+   _scratch_mount >> $seqres.full 2>&1
+   dd if=/dev/urandom of="$SCRATCH_MNT"/tf bs=$bs count=1 \
+   >>$seqres.full 2>&1
+   _run_btrfs_util_prog filesystem show -m ${SCRATCH_MNT}
+   _run_btrfs_util_prog filesystem df $SCRATCH_MNT
+   COUNT=$(( $COUNT - 1 ))
+   echo "unmount" >> $seqres.full
+   _scratch_unmount
+}
+
+degrade_mount_write()
+{
+   echo >> $seqres.full
+   echo "--degraded mount: max_fs_sz $max_fs_sz bytes--" >> $seqres.full
+   echo
+   echo "degraded mount"
+
+   echo "clean btrfs ko" >> $seqres.full
+   # un-scan the btrfs devices
+   _reload_fs_module "btrfs"
+   _mount -o degraded $DEV1 $SCRATCH_MNT >>$seqres.full 2>&1
+   cnt=$(( $COUNT/10 ))
+   dd if=/dev/urandom of="$SCRATCH_MNT"/tf1 bs=$bs count=$cnt \
+   >>$seqres.full 2>&1
+   COUNT=$(( $COUNT - $cnt ))
+   _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT
+   _run_btrfs_util_prog filesystem df $SCRATCH_MNT
+   CHECKPOINT1=`md5sum $SCRATCH_MNT/tf1`
+   echo $SCRATCH_MNT/tf1:$CHECKPOINT1 >> $seqres.full 2>&1


"2>&1" not needed.

[PATCH v2] btrfs/154: test for device dynamic rescan

2017-11-14 Thread Anand Jain

Make sure missing device is included in the alloc list when it is
scanned on a mounted FS.

This test case needs btrfs kernel patch which is in the ML
  [PATCH] btrfs: handle dynamically reappearing missing device
Without the kernel patch, the test will run, but reports as
failed, as the device scanned won't appear in the alloc_list.

Signed-off-by: Anand Jain 
---
v2: Fixed review comments.
 tests/btrfs/154 | 186 
 tests/btrfs/154.out |  10 +++
 tests/btrfs/group   |   1 +
 3 files changed, 197 insertions(+)
 create mode 100755 tests/btrfs/154
 create mode 100644 tests/btrfs/154.out

diff --git a/tests/btrfs/154 b/tests/btrfs/154
new file mode 100755
index ..73a185157389
--- /dev/null
+++ b/tests/btrfs/154
@@ -0,0 +1,186 @@
+#! /bin/bash
+# FS QA Test 154
+#
+# Test for reappearing missing device functionality.
+#   This test will fail without the btrfs kernel patch
+#   [PATCH] btrfs: handle dynamically reappearing missing device
+#
+#-
+# Copyright (c) 2017 Oracle.  All Rights Reserved.
+# Author: Anand Jain 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#-
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/module
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_dev_pool 2
+_require_loadable_fs_module "btrfs"
+
+_scratch_dev_pool_get 2
+
+DEV1=`echo $SCRATCH_DEV_POOL | awk '{print $1}'`
+DEV2=`echo $SCRATCH_DEV_POOL | awk '{print $2}'`
+
+echo DEV1=$DEV1 >> $seqres.full
+echo DEV2=$DEV2 >> $seqres.full
+
+# Balance won't be successful if filled too much
+DEV1_SZ=`blockdev --getsize64 $DEV1`
+DEV2_SZ=`blockdev --getsize64 $DEV2`
+
+# get min
+MAX_FS_SZ=`echo -e "$DEV1_SZ\n$DEV2_SZ" | sort | head -1`
+# Need disks with more than 2G
+if [ $MAX_FS_SZ -lt 20 ]; then
+   _scratch_dev_pool_put
+   _notrun "Smallest dev size $MAX_FS_SZ, Need at least 2G"
+fi
+
+MAX_FS_SZ=1
+bs="1M"
+COUNT=$(($MAX_FS_SZ / 100))
+CHECKPOINT1=0
+CHECKPOINT2=0
+
+setup()
+{
+   echo >> $seqres.full
+   echo "MAX_FS_SZ=$MAX_FS_SZ COUNT=$COUNT" >> $seqres.full
+   echo "setup"
+   echo "-setup-" >> $seqres.full
+   _scratch_pool_mkfs "-mraid1 -draid1" >> $seqres.full 2>&1
+   _scratch_mount >> $seqres.full 2>&1
+   dd if=/dev/urandom of="$SCRATCH_MNT"/tf bs=$bs count=1 \
+   >>$seqres.full 2>&1
+   _run_btrfs_util_prog filesystem show -m ${SCRATCH_MNT}
+   _run_btrfs_util_prog filesystem df $SCRATCH_MNT
+   COUNT=$(( $COUNT - 1 ))
+   echo "unmount" >> $seqres.full
+   _scratch_unmount
+}
+
+degrade_mount_write()
+{
+   echo >> $seqres.full
+   echo "--degraded mount: max_fs_sz $max_fs_sz bytes--" >> $seqres.full
+   echo
+   echo "degraded mount"
+
+   echo "clean btrfs ko" >> $seqres.full
+   # un-scan the btrfs devices
+   _reload_fs_module "btrfs"
+   _mount -o degraded $DEV1 $SCRATCH_MNT >>$seqres.full 2>&1
+   cnt=$(( $COUNT/10 ))
+   dd if=/dev/urandom of="$SCRATCH_MNT"/tf1 bs=$bs count=$cnt \
+   >>$seqres.full 2>&1
+   COUNT=$(( $COUNT - $cnt ))
+   _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT
+   _run_btrfs_util_prog filesystem df $SCRATCH_MNT
+   CHECKPOINT1=`md5sum $SCRATCH_MNT/tf1`
+   echo $SCRATCH_MNT/tf1:$CHECKPOINT1 >> $seqres.full
+}
+
+scan_missing_dev_and_write()
+{
+   echo >> $seqres.full
+   echo "--scan missing $DEV2--" >> $seqres.full
+   echo
+   echo "scan missing dev and write"
+
+   _run_btrfs_util_prog device scan $DEV2
+
+   echo >> $seqres.full
+
+   _run_btrfs_util_prog filesystem show -m ${SCRATCH_MNT}
+   _run_btrfs_util_prog filesystem df ${SCRATCH_MNT}
+
+   dd if=/dev/urandom of="$SCRATCH_MNT"/tf2 bs=$bs

Re: A partially failing disk in raid0 needs replacement

2017-11-14 Thread Chris Murphy

On Tue, Nov 14, 2017 at 1:36 AM, Klaus Agnoletti  wrote:

> Btrfs v3.17

Unrelated to the problem but this is pretty old.

> Linux box 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u5 (2017-09-19)

Also pretty old kernel.

> x86_64 GNU/Linux
> klaus@box:~$ sudo btrfs --version
> Btrfs v3.17
> klaus@box:~$ sudo btrfs fi df /mnt
> Data, RAID0: total=5.34TiB, used=5.14TiB
> System, RAID0: total=96.00MiB, used=384.00KiB
> Metadata, RAID0: total=7.22GiB, used=5.82GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B

The central two problems: failing hardware, and no copies of metadata.
By default, mkfs.btrfs does -draid0 -mraid1 for multiple device
volumes. Explicitly making metadata raid0 basically means it's a
disposable file system the instant there's a problem.

What do you get for
smartctl -l scterc /dev/

If you're lucky, this is really short. If it is something like 7
seconds, there's a chance the data in this sector can be recovered
with a longer recovery time set by the drive *and* also setting the
kernel's SCSI command timer to a value higher than 30 seconds (to
match whatever you pick for the drive's error timeout). I'd pull
something out of my ass like 60 seconds, or hell why not 120 seconds,
for both. Maybe then there won't be a UNC error and you can quickly
catch up your backups at the least.

But before trying device removal again, assuming changing the error
timeout to be higher is possible, the first thing I'd do is convert
metadata to raid1. Then remove the bad device.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A partially failing disk in raid0 needs replacement

2017-11-14 Thread Chris Murphy

On Tue, Nov 14, 2017 at 5:48 AM, Roman Mamedov  wrote:
> On Tue, 14 Nov 2017 10:36:22 +0200
> Klaus Agnoletti  wrote:
>
>> Obviously, I want /dev/sdd emptied and deleted from the raid.
>
>   * Unmount the RAID0 FS
>
>   * copy the bad drive using `dd_rescue`[1] into a file on the 6TB drive
> (noting how much of it is actually unreadable -- chances are it's mostly
> intact)

This almost certainly will not work now, the delete command has copied
metadata to the 6TB drive, so it would have to be removed first to
remove that metadata,and Btrfs's record of that member device to avoid
it being considered missing, and also any chunks successfully copied
over.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A partially failing disk in raid0 needs replacement

2017-11-14 Thread Chris Murphy

On Tue, Nov 14, 2017 at 5:38 AM, Adam Borowski  wrote:
> On Tue, Nov 14, 2017 at 10:36:22AM +0200, Klaus Agnoletti wrote:
>> I used to have 3x2TB in a btrfs in raid0. A few weeks ago, one of the
>  ^
>> 2TB disks started giving me I/O errors in dmesg like this:
>>
>> [388659.188988] Add. Sense: Unrecovered read error - auto reallocate failed
>
> Alas, chances to recover anything are pretty slim.  That's RAID0 metadata
> for you.
>
> On the other hand, losing any non-trivial file while being able to gape at
> intact metadata isn't that much better, thus -mraid0 isn't completely
> unreasonable.

I don't know the statistics on UNC read error vs total drive failure.
If I thought that total drive failure was 2x or more likely than a
single UNC then maybe raid0 is reasonable. But it's a 64KB block size
for raid0. I think metadata raid0 probably doesn't offer that much
performance improvement over raid1, and if it did, that's a case for
raid10 metadata.

In the UNC case, chances are it hits a data extent of a single file,
in which case Btrfs can handle this fine, you just lose that one file.
And if it hits the smaller target of metadata, it's fine if metadata
is raid1 or raid10.

In a previous email in the archives, I did a test where I
intentionally formatted one member drive of a Btrfs data raid0,
metadata raid1, and it was totally recoverable with a bunch of scary
messages and sometimes a file was corrupted. So it actually is pretty
darn resilient when there is a copy of metadata. (I did not try DUP.)

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tiered storage?

2017-11-14 Thread Roy Sigurd Karlsbakk

Hi all

I've been following this project on and off for quite a few years, and I wonder 
if anyone has looked into tiered storage on it. With tiered storage, I mean hot 
data lying on fast storage and cold data on slow storage. I'm not talking about 
cashing (where you just keep a copy of the hot data on the fast storage).

And btw, how far is raid[56] and block-level dedup from something useful in 
production?

Vennlig hilsen

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/10] btrfs: rework end io for extent buffer reads

2017-11-14 Thread Josef Bacik

From: Josef Bacik 

Now that the only thing that keeps eb's alive is io_pages and it's
refcount we need to hold the eb ref for the entire end io call so we
don't get it removed out from underneath us.  Also the hooks make no
sense for us now, so rework this to be cleaner.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c   | 63 
 fs/btrfs/disk-io.h   |  1 +
 fs/btrfs/extent_io.c | 67 +++-
 3 files changed, 41 insertions(+), 90 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7ccb6d839126..459491d662a0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -755,33 +755,13 @@ static int check_node(struct btrfs_root *root, struct 
extent_buffer *node)
return ret;
 }
 
-static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
- u64 phy_offset, struct page *page,
- u64 start, u64 end, int mirror)
+int btrfs_extent_buffer_end_read(struct extent_buffer *eb, int mirror)
 {
+   struct btrfs_fs_info *fs_info = eb->eb_info->fs_info;
+   struct btrfs_root *root = fs_info->tree_root;
u64 found_start;
int found_level;
-   struct extent_buffer *eb;
-   struct btrfs_root *root;
-   struct btrfs_fs_info *fs_info;
int ret = 0;
-   int reads_done;
-
-   if (!page->private)
-   goto out;
-
-   eb = (struct extent_buffer *)page->private;
-
-   /* the pending IO might have been the only thing that kept this buffer
-* in memory.  Make sure we have a ref for all this other checks
-*/
-   extent_buffer_get(eb);
-   fs_info = eb->eb_info->fs_info;
-   root = fs_info->tree_root;
-
-   reads_done = atomic_dec_and_test(>io_pages);
-   if (!reads_done)
-   goto err;
 
eb->read_mirror = mirror;
if (test_bit(EXTENT_BUFFER_READ_ERR, >bflags)) {
@@ -833,45 +813,14 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
if (!ret)
set_extent_buffer_uptodate(eb);
 err:
-   if (reads_done &&
-   test_and_clear_bit(EXTENT_BUFFER_READAHEAD, >bflags))
+   if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, >bflags))
btree_readahead_hook(eb, ret);
 
-   if (ret) {
-   /*
-* our io error hook is going to dec the io pages
-* again, we have to make sure it has something
-* to decrement.
-*
-* TODO: Kill this, we've re-arranged how this works now so we
-* don't need to do this io_pages dance.
-*/
-   atomic_inc(>io_pages);
+   if (ret)
clear_extent_buffer_uptodate(eb);
-   }
-   if (reads_done) {
-   clear_bit(EXTENT_BUFFER_READING, >bflags);
-   smp_mb__after_atomic();
-   wake_up_bit(>bflags, EXTENT_BUFFER_READING);
-   }
-   free_extent_buffer(eb);
-out:
return ret;
 }
 
-static int btree_io_failed_hook(struct page *page, int failed_mirror)
-{
-   struct extent_buffer *eb;
-
-   eb = (struct extent_buffer *)page->private;
-   set_bit(EXTENT_BUFFER_READ_ERR, >bflags);
-   eb->read_mirror = failed_mirror;
-   atomic_dec(>io_pages);
-   if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, >bflags))
-   btree_readahead_hook(eb, -EIO);
-   return -EIO;/* we fixed nothing */
-}
-
 static void end_workqueue_bio(struct bio *bio)
 {
struct btrfs_end_io_wq *end_io_wq = bio->bi_private;
@@ -4553,9 +4502,7 @@ static int btree_merge_bio_hook(struct page *page, 
unsigned long offset,
 static const struct extent_io_ops btree_extent_io_ops = {
/* mandatory callbacks */
.submit_bio_hook = btree_submit_bio_hook,
-   .readpage_end_io_hook = btree_readpage_end_io_hook,
.merge_bio_hook = btree_merge_bio_hook,
-   .readpage_io_failed_hook = btree_io_failed_hook,
.set_range_writeback = btrfs_set_range_writeback,
.tree_fs_info = btree_fs_info,
 
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 7f7c35d6347a..e1f4fef91547 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -152,6 +152,7 @@ int btree_lock_page_hook(struct page *page, void *data,
 int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags);
 int __init btrfs_end_io_wq_init(void);
 void btrfs_end_io_wq_exit(void);
+int btrfs_extent_buffer_end_read(struct extent_buffer *eb, int mirror);
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 void btrfs_init_lockdep(void);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 2077bd6ad1b3..1e5affee0f7e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -20,6 +20,7 @@
 #include "locking.h"
 #include "rcu-string.h"
 #include "backref.h"
+#include "disk-io.h"
 
 static struct kmem_cache

[PATCH 03/10] lib: add a batch size to fprop_global

2017-11-14 Thread Josef Bacik

From: Josef Bacik 

The flexible proportion stuff has been used to track how many pages we
are writing out over a period of time, so counts everything in single
increments.  If we wanted to use another base value we need to be able
to adjust the batch size to fit our the units we'll be using for the
proportions.

Signed-off-by: Josef Bacik 
---
 include/linux/flex_proportions.h |  4 +++-
 lib/flex_proportions.c   | 11 +--
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/linux/flex_proportions.h b/include/linux/flex_proportions.h
index 0d348e011a6e..853f4305d1b2 100644
--- a/include/linux/flex_proportions.h
+++ b/include/linux/flex_proportions.h
@@ -20,7 +20,7 @@
  */
 #define FPROP_FRAC_SHIFT 10
 #define FPROP_FRAC_BASE (1UL << FPROP_FRAC_SHIFT)
-
+#define FPROP_BATCH_SIZE (8*(1+ilog2(nr_cpu_ids)))
 /*
  *  Global proportion definitions 
  */
@@ -31,6 +31,8 @@ struct fprop_global {
unsigned int period;
/* Synchronization with period transitions */
seqcount_t sequence;
+   /* batch size */
+   s32 batch_size;
 };
 
 int fprop_global_init(struct fprop_global *p, gfp_t gfp);
diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c
index 2cc1f94e03a1..5552523b663a 100644
--- a/lib/flex_proportions.c
+++ b/lib/flex_proportions.c
@@ -44,6 +44,7 @@ int fprop_global_init(struct fprop_global *p, gfp_t gfp)
if (err)
return err;
seqcount_init(>sequence);
+   p->batch_size = FPROP_BATCH_SIZE;
return 0;
 }
 
@@ -166,8 +167,6 @@ void fprop_fraction_single(struct fprop_global *p,
 /*
  *  PERCPU 
  */
-#define PROP_BATCH (8*(1+ilog2(nr_cpu_ids)))
-
 int fprop_local_init_percpu(struct fprop_local_percpu *pl, gfp_t gfp)
 {
int err;
@@ -204,11 +203,11 @@ static void fprop_reflect_period_percpu(struct 
fprop_global *p,
if (period - pl->period < BITS_PER_LONG) {
s64 val = percpu_counter_read(>events);
 
-   if (val < (nr_cpu_ids * PROP_BATCH))
+   if (val < (nr_cpu_ids * p->batch_size))
val = percpu_counter_sum(>events);
 
percpu_counter_add_batch(>events,
-   -val + (val >> (period-pl->period)), PROP_BATCH);
+   -val + (val >> (period-pl->period)), p->batch_size);
} else
percpu_counter_set(>events, 0);
pl->period = period;
@@ -219,7 +218,7 @@ static void fprop_reflect_period_percpu(struct fprop_global 
*p,
 void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl)
 {
fprop_reflect_period_percpu(p, pl);
-   percpu_counter_add_batch(>events, 1, PROP_BATCH);
+   percpu_counter_add_batch(>events, 1, p->batch_size);
percpu_counter_add(>events, 1);
 }
 
@@ -267,6 +266,6 @@ void __fprop_inc_percpu_max(struct fprop_global *p,
return;
} else
fprop_reflect_period_percpu(p, pl);
-   percpu_counter_add_batch(>events, 1, PROP_BATCH);
+   percpu_counter_add_batch(>events, 1, p->batch_size);
percpu_counter_add(>events, 1);
 }
-- 
2.7.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/10] Btrfs: kill the btree_inode

2017-11-14 Thread Josef Bacik

From: Josef Bacik 

In order to more efficiently support sub-page blocksizes we need to stop
allocating pages from pagecache for our metadata.  Instead switch to using the
account_metadata* counters for making sure we are keeping the system aware of
how much dirty metadata we have, and use the ->free_cached_objects super
operation in order to handle freeing up extent buffers.  This greatly simplifies
how we deal with extent buffers as now we no longer have to tie the page cache
reclaimation stuff to the extent buffer stuff.  This will also allow us to
simply kmalloc() our data for sub-page blocksizes.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/btrfs_inode.h |   1 -
 fs/btrfs/ctree.c   |  18 +-
 fs/btrfs/ctree.h   |  17 +-
 fs/btrfs/dir-item.c|   2 +-
 fs/btrfs/disk-io.c | 385 --
 fs/btrfs/extent-tree.c |  14 +-
 fs/btrfs/extent_io.c   | 919 ++---
 fs/btrfs/extent_io.h   |  51 +-
 fs/btrfs/inode.c   |   6 +-
 fs/btrfs/print-tree.c  |  13 +-
 fs/btrfs/reada.c   |   2 +-
 fs/btrfs/root-tree.c   |   2 +-
 fs/btrfs/super.c   |  31 +-
 fs/btrfs/tests/btrfs-tests.c   |  36 +-
 fs/btrfs/tests/extent-buffer-tests.c   |   3 +-
 fs/btrfs/tests/extent-io-tests.c   |   4 +-
 fs/btrfs/tests/free-space-tree-tests.c |   3 +-
 fs/btrfs/tests/inode-tests.c   |   4 +-
 fs/btrfs/tests/qgroup-tests.c  |   3 +-
 fs/btrfs/transaction.c |  13 +-
 20 files changed, 757 insertions(+), 770 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index f9c6887a8b6c..24582650622d 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -241,7 +241,6 @@ static inline u64 btrfs_ino(const struct btrfs_inode *inode)
u64 ino = inode->location.objectid;
 
/*
-* !ino: btree_inode
 * type == BTRFS_ROOT_ITEM_KEY: subvol dir
 */
if (!ino || inode->location.type == BTRFS_ROOT_ITEM_KEY)
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 531e0a8645b0..3c6610b5d0d3 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1361,7 +1361,8 @@ tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct 
btrfs_path *path,
 
if (tm->op == MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
BUG_ON(tm->slot != 0);
-   eb_rewin = alloc_dummy_extent_buffer(fs_info, eb->start);
+   eb_rewin = alloc_dummy_extent_buffer(fs_info->eb_info,
+eb->start, eb->len);
if (!eb_rewin) {
btrfs_tree_read_unlock_blocking(eb);
free_extent_buffer(eb);
@@ -1444,7 +1445,8 @@ get_old_root(struct btrfs_root *root, u64 time_seq)
} else if (old_root) {
btrfs_tree_read_unlock(eb_root);
free_extent_buffer(eb_root);
-   eb = alloc_dummy_extent_buffer(fs_info, logical);
+   eb = alloc_dummy_extent_buffer(root->fs_info->eb_info, logical,
+  root->fs_info->nodesize);
} else {
btrfs_set_lock_blocking_rw(eb_root, BTRFS_READ_LOCK);
eb = btrfs_clone_extent_buffer(eb_root);
@@ -1675,7 +1677,7 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
continue;
}
 
-   cur = find_extent_buffer(fs_info, blocknr);
+   cur = find_extent_buffer(fs_info->eb_info, blocknr);
if (cur)
uptodate = btrfs_buffer_uptodate(cur, gen, 0);
else
@@ -1748,7 +1750,7 @@ static noinline int generic_bin_search(struct 
extent_buffer *eb,
int err;
 
if (low > high) {
-   btrfs_err(eb->fs_info,
+   btrfs_err(eb->eb_info->fs_info,
 "%s: low (%d) > high (%d) eb %llu owner %llu level %d",
  __func__, low, high, eb->start,
  btrfs_header_owner(eb), btrfs_header_level(eb));
@@ -2260,7 +2262,7 @@ static void reada_for_search(struct btrfs_fs_info 
*fs_info,
 
search = btrfs_node_blockptr(node, slot);
blocksize = fs_info->nodesize;
-   eb = find_extent_buffer(fs_info, search);
+   eb = find_extent_buffer(fs_info->eb_info, search);
if (eb) {
free_extent_buffer(eb);
return;
@@ -2319,7 +2321,7 @@ static noinline void reada_for_balance(struct 
btrfs_fs_info *fs_info,
if (slot > 0) {
block1 = btrfs_node_blockptr(parent, slot - 1);
gen = btrfs_node_ptr_generation(parent, slot - 1);
-   eb = find_extent_buffer(fs_info, block1);
+   eb = find_extent_buffer(fs_info->eb_info,

[PATCH 07/10] writeback: introduce super_operations->write_metadata

2017-11-14 Thread Josef Bacik

From: Josef Bacik 

Now that we have metadata counters in the VM, we need to provide a way to kick
writeback on dirty metadata.  Introduce super_operations->write_metadata.  This
allows file systems to deal with writing back any dirty metadata we need based
on the writeback needs of the system.  Since there is no inode to key off of we
need a list in the bdi for dirty super blocks to be added.  From there we can
find any dirty sb's on the bdi we are currently doing writeback on and call into
their ->write_metadata callback.

Signed-off-by: Josef Bacik 
Reviewed-by: Jan Kara 
Reviewed-by: Tejun Heo 
---
 fs/fs-writeback.c| 72 
 fs/super.c   |  6 
 include/linux/backing-dev-defs.h |  2 ++
 include/linux/fs.h   |  4 +++
 mm/backing-dev.c |  2 ++
 5 files changed, 80 insertions(+), 6 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 987448ed7698..fba703dff678 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct bdi_writeback 
*wb,
return pages;
 }
 
+static long writeback_sb_metadata(struct super_block *sb,
+ struct bdi_writeback *wb,
+ struct wb_writeback_work *work)
+{
+   struct writeback_control wbc = {
+   .sync_mode  = work->sync_mode,
+   .tagged_writepages  = work->tagged_writepages,
+   .for_kupdate= work->for_kupdate,
+   .for_background = work->for_background,
+   .for_sync   = work->for_sync,
+   .range_cyclic   = work->range_cyclic,
+   .range_start= 0,
+   .range_end  = LLONG_MAX,
+   };
+   long write_chunk;
+
+   write_chunk = writeback_chunk_size(wb, work);
+   wbc.nr_to_write = write_chunk;
+   sb->s_op->write_metadata(sb, );
+   work->nr_pages -= write_chunk - wbc.nr_to_write;
+
+   return write_chunk - wbc.nr_to_write;
+}
+
+
 /*
  * Write a portion of b_io inodes which belong to @sb.
  *
@@ -1505,6 +1530,7 @@ static long writeback_sb_inodes(struct super_block *sb,
unsigned long start_time = jiffies;
long write_chunk;
long wrote = 0;  /* count both pages and inodes */
+   bool done = false;
 
while (!list_empty(>b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
@@ -1621,12 +1647,18 @@ static long writeback_sb_inodes(struct super_block *sb,
 * background threshold and other termination conditions.
 */
if (wrote) {
-   if (time_is_before_jiffies(start_time + HZ / 10UL))
-   break;
-   if (work->nr_pages <= 0)
+   if (time_is_before_jiffies(start_time + HZ / 10UL) ||
+   work->nr_pages <= 0) {
+   done = true;
break;
+   }
}
}
+   if (!done && sb->s_op->write_metadata) {
+   spin_unlock(>list_lock);
+   wrote += writeback_sb_metadata(sb, wb, work);
+   spin_lock(>list_lock);
+   }
return wrote;
 }
 
@@ -1635,6 +1667,7 @@ static long __writeback_inodes_wb(struct bdi_writeback 
*wb,
 {
unsigned long start_time = jiffies;
long wrote = 0;
+   bool done = false;
 
while (!list_empty(>b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
@@ -1654,12 +1687,39 @@ static long __writeback_inodes_wb(struct bdi_writeback 
*wb,
 
/* refer to the same tests at the end of writeback_sb_inodes */
if (wrote) {
-   if (time_is_before_jiffies(start_time + HZ / 10UL))
-   break;
-   if (work->nr_pages <= 0)
+   if (time_is_before_jiffies(start_time + HZ / 10UL) ||
+   work->nr_pages <= 0) {
+   done = true;
break;
+   }
}
}
+
+   if (!done && wb_stat(wb, WB_METADATA_DIRTY_BYTES)) {
+   LIST_HEAD(list);
+
+   spin_unlock(>list_lock);
+   spin_lock(>bdi->sb_list_lock);
+   list_splice_init(>bdi->dirty_sb_list, );
+   while (!list_empty()) {
+   struct super_block *sb;
+
+   sb = list_first_entry(, struct super_block,
+ s_bdi_dirty_list);
+   list_move_tail(>s_bdi_dirty_list,
+  >bdi->dirty_sb_list);
+   if

[PATCH 06/10] writeback: add counters for metadata usage

2017-11-14 Thread Josef Bacik

From: Josef Bacik 

Btrfs has no bounds except memory on the amount of dirty memory that we have in
use for metadata.  Historically we have used a special inode so we could take
advantage of the balance_dirty_pages throttling that comes with using pagecache.
However as we'd like to support different blocksizes it would be nice to not
have to rely on pagecache, but still get the balance_dirty_pages throttling
without having to do it ourselves.

So introduce *METADATA_DIRTY_BYTES and *METADATA_WRITEBACK_BYTES.  These are
zone and bdi_writeback counters to keep track of how many bytes we have in
flight for METADATA.  We need to count in bytes as blocksizes could be
percentages of pagesize.  We simply convert the bytes to number of pages where
it is needed for the throttling.

Also introduce NR_METADATA_BYTES so we can keep track of the total amount of
pages used for metadata on the system.  This is also needed so things like dirty
throttling know that this is dirtyable memory as well and easily reclaimed.

Signed-off-by: Josef Bacik 
Reviewed-by: Jan Kara 
---
 drivers/base/node.c  |   8 +++
 fs/fs-writeback.c|   2 +
 fs/proc/meminfo.c|   8 +++
 include/linux/backing-dev-defs.h |   2 +
 include/linux/mm.h   |   9 +++
 include/linux/mmzone.h   |   3 +
 include/trace/events/writeback.h |  13 +++-
 mm/backing-dev.c |   4 ++
 mm/page-writeback.c  | 141 +++
 mm/page_alloc.c  |  20 --
 mm/util.c|   1 +
 mm/vmscan.c  |  19 +-
 mm/vmstat.c  |   3 +
 13 files changed, 211 insertions(+), 22 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 3855902f2c5b..a39cecc8957a 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -51,6 +51,8 @@ static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
 static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
+#define BtoK(x) ((x) >> 10)
+
 static ssize_t node_read_meminfo(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -99,7 +101,10 @@ static ssize_t node_read_meminfo(struct device *dev,
 #endif
n += sprintf(buf + n,
   "Node %d Dirty:  %8lu kB\n"
+  "Node %d MetadataDirty:  %8lu kB\n"
   "Node %d Writeback:  %8lu kB\n"
+  "Node %d MetaWriteback:  %8lu kB\n"
+  "Node %d Metadata:   %8lu kB\n"
   "Node %d FilePages:  %8lu kB\n"
   "Node %d Mapped: %8lu kB\n"
   "Node %d AnonPages:  %8lu kB\n"
@@ -119,8 +124,11 @@ static ssize_t node_read_meminfo(struct device *dev,
 #endif
,
   nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
+  nid, BtoK(node_page_state(pgdat, 
NR_METADATA_DIRTY_BYTES)),
   nid, K(node_page_state(pgdat, NR_WRITEBACK)),
+  nid, BtoK(node_page_state(pgdat, 
NR_METADATA_WRITEBACK_BYTES)),
   nid, K(node_page_state(pgdat, NR_FILE_PAGES)),
+  nid, BtoK(node_page_state(pgdat, NR_METADATA_BYTES)),
   nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
   nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
   nid, K(i.sharedram),
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 245c430a2e41..987448ed7698 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1814,6 +1814,7 @@ static struct wb_writeback_work 
*get_next_work_item(struct bdi_writeback *wb)
return work;
 }
 
+#define BtoP(x) ((x) >> PAGE_SHIFT)
 /*
  * Add in the number of potentially dirty inodes, because each inode
  * write can dirty pagecache in the underlying blockdev.
@@ -1822,6 +1823,7 @@ static unsigned long get_nr_dirty_pages(void)
 {
return global_node_page_state(NR_FILE_DIRTY) +
global_node_page_state(NR_UNSTABLE_NFS) +
+   BtoP(global_node_page_state(NR_METADATA_DIRTY_BYTES)) +
get_nr_dirty_inodes();
 }
 
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index cdd979724c74..fa1fd24a4d99 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -42,6 +42,8 @@ static void show_val_kb(struct seq_file *m, const char *s, 
unsigned long num)
seq_write(m, " kB\n", 4);
 }
 
+#define BtoP(x) ((x) >> PAGE_SHIFT)
+
 static int meminfo_proc_show(struct seq_file *m, void *v)
 {
struct sysinfo i;
@@ -71,6 +73,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
show_val_kb(m, "Buffers:", i.bufferram);
show_val_kb(m, "Cached: ", cached);
show_val_kb(m, "SwapCached: ", total_swapcache_pages());
+

[PATCH 08/10] export radix_tree_iter_tag_set

2017-11-14 Thread Josef Bacik

From: Josef Bacik 

We use this in btrfs for metadata writeback.

Acked-by: Matthew Wilcox 
Signed-off-by: Josef Bacik 
---
 lib/radix-tree.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 8b1feca1230a..0c1cde9fcb69 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1459,6 +1459,7 @@ void radix_tree_iter_tag_set(struct radix_tree_root *root,
 {
node_tag_set(root, iter->node, tag, iter_offset(iter));
 }
+EXPORT_SYMBOL(radix_tree_iter_tag_set);
 
 static void node_tag_clear(struct radix_tree_root *root,
struct radix_tree_node *node,
-- 
2.7.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 05/10] writeback: convert the flexible prop stuff to bytes

2017-11-14 Thread Josef Bacik

From: Josef Bacik 

The flexible proportions were all page based, but now that we are doing
metadata writeout that can be smaller or larger than page size we need
to account for this in bytes instead of number of pages.

Signed-off-by: Josef Bacik 
---
 mm/backing-dev.c|  2 +-
 mm/page-writeback.c | 19 ---
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 62a332a91b38..e0d7c62dc0ad 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -832,7 +832,7 @@ static int bdi_init(struct backing_dev_info *bdi)
kref_init(>refcnt);
bdi->min_ratio = 0;
bdi->max_ratio = 100;
-   bdi->max_prop_frac = FPROP_FRAC_BASE;
+   bdi->max_prop_frac = FPROP_FRAC_BASE * PAGE_SIZE;
INIT_LIST_HEAD(>bdi_list);
INIT_LIST_HEAD(>wb_list);
init_waitqueue_head(>wb_waitq);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e4563645749a..c491dee711a8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -574,11 +574,11 @@ static unsigned long wp_next_time(unsigned long cur_time)
return cur_time;
 }
 
-static void wb_domain_writeout_inc(struct wb_domain *dom,
+static void wb_domain_writeout_add(struct wb_domain *dom,
   struct fprop_local_percpu *completions,
-  unsigned int max_prop_frac)
+  long bytes, unsigned int max_prop_frac)
 {
-   __fprop_inc_percpu_max(>completions, completions,
+   __fprop_add_percpu_max(>completions, completions, bytes,
   max_prop_frac);
/* First event after period switching was turned off? */
if (unlikely(!dom->period_time)) {
@@ -602,12 +602,12 @@ static inline void __wb_writeout_add(struct bdi_writeback 
*wb, long bytes)
struct wb_domain *cgdom;
 
__add_wb_stat(wb, WB_WRITTEN_BYTES, bytes);
-   wb_domain_writeout_inc(_wb_domain, >completions,
+   wb_domain_writeout_add(_wb_domain, >completions, bytes,
   wb->bdi->max_prop_frac);
 
cgdom = mem_cgroup_wb_domain(wb);
if (cgdom)
-   wb_domain_writeout_inc(cgdom, wb_memcg_completions(wb),
+   wb_domain_writeout_add(cgdom, wb_memcg_completions(wb), bytes,
   wb->bdi->max_prop_frac);
 }
 
@@ -646,6 +646,7 @@ static void writeout_period(unsigned long t)
 
 int wb_domain_init(struct wb_domain *dom, gfp_t gfp)
 {
+   int ret;
memset(dom, 0, sizeof(*dom));
 
spin_lock_init(>lock);
@@ -655,7 +656,10 @@ int wb_domain_init(struct wb_domain *dom, gfp_t gfp)
 
dom->dirty_limit_tstamp = jiffies;
 
-   return fprop_global_init(>completions, gfp);
+   ret = fprop_global_init(>completions, gfp);
+   if (!ret)
+   dom->completions.batch_size *= PAGE_SIZE;
+   return ret;
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
@@ -706,7 +710,8 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, 
unsigned max_ratio)
ret = -EINVAL;
} else {
bdi->max_ratio = max_ratio;
-   bdi->max_prop_frac = (FPROP_FRAC_BASE * max_ratio) / 100;
+   bdi->max_prop_frac = ((FPROP_FRAC_BASE * max_ratio) / 100) *
+   PAGE_SIZE;
}
spin_unlock_bh(_lock);
 
-- 
2.7.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/10] remove mapping from balance_dirty_pages*()

2017-11-14 Thread Josef Bacik

From: Josef Bacik 

The only reason we pass in the mapping is to get the inode in order to see if
writeback cgroups is enabled, and even then it only checks the bdi and a super
block flag.  balance_dirty_pages() doesn't even use the mapping.  Since
balance_dirty_pages*() works on a bdi level, just pass in the bdi and super
block directly so we can avoid using mapping.  This will allow us to still use
balance_dirty_pages for dirty metadata pages that are not backed by an
address_mapping.

Signed-off-by: Josef Bacik 
Reviewed-by: Jan Kara 
---
 drivers/mtd/devices/block2mtd.c | 12 
 fs/btrfs/disk-io.c  |  3 ++-
 fs/btrfs/file.c |  3 ++-
 fs/btrfs/ioctl.c|  3 ++-
 fs/btrfs/relocation.c   |  3 ++-
 fs/buffer.c |  3 ++-
 fs/iomap.c  |  6 --
 fs/ntfs/attrib.c| 11 ---
 fs/ntfs/file.c  |  4 ++--
 include/linux/backing-dev.h | 29 +++--
 include/linux/writeback.h   |  4 +++-
 mm/filemap.c|  4 +++-
 mm/memory.c |  5 -
 mm/page-writeback.c | 15 +++
 14 files changed, 72 insertions(+), 33 deletions(-)

diff --git a/drivers/mtd/devices/block2mtd.c b/drivers/mtd/devices/block2mtd.c
index 7c887f111a7d..7892d0b9fcb0 100644
--- a/drivers/mtd/devices/block2mtd.c
+++ b/drivers/mtd/devices/block2mtd.c
@@ -52,7 +52,8 @@ static struct page *page_read(struct address_space *mapping, 
int index)
 /* erase a specified part of the device */
 static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t to, size_t len)
 {
-   struct address_space *mapping = dev->blkdev->bd_inode->i_mapping;
+   struct inode *inode = dev->blkdev->bd_inode;
+   struct address_space *mapping = inode->i_mapping;
struct page *page;
int index = to >> PAGE_SHIFT;   // page index
int pages = len >> PAGE_SHIFT;
@@ -71,7 +72,8 @@ static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t 
to, size_t len)
memset(page_address(page), 0xff, PAGE_SIZE);
set_page_dirty(page);
unlock_page(page);
-   balance_dirty_pages_ratelimited(mapping);
+   
balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+   inode->i_sb);
break;
}
 
@@ -141,7 +143,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, 
const u_char *buf,
loff_t to, size_t len, size_t *retlen)
 {
struct page *page;
-   struct address_space *mapping = dev->blkdev->bd_inode->i_mapping;
+   struct inode *inode = dev->blkdev->bd_inode;
+   struct address_space *mapping = inode->i_mapping;
int index = to >> PAGE_SHIFT;   // page index
int offset = to & ~PAGE_MASK;   // page offset
int cpylen;
@@ -162,7 +165,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, 
const u_char *buf,
memcpy(page_address(page) + offset, buf, cpylen);
set_page_dirty(page);
unlock_page(page);
-   balance_dirty_pages_ratelimited(mapping);
+   balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+   inode->i_sb);
}
put_page(page);
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 689b9913ccb5..8b6df7688d52 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4150,7 +4150,8 @@ static void __btrfs_btree_balance_dirty(struct 
btrfs_fs_info *fs_info,
ret = percpu_counter_compare(_info->dirty_metadata_bytes,
 BTRFS_DIRTY_METADATA_THRESH);
if (ret > 0) {
-   
balance_dirty_pages_ratelimited(fs_info->btree_inode->i_mapping);
+   balance_dirty_pages_ratelimited(fs_info->sb->s_bdi,
+   fs_info->sb);
}
 }
 
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index ab1c38f2dd8c..4bc6cd6509be 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1779,7 +1779,8 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
 
cond_resched();
 
-   balance_dirty_pages_ratelimited(inode->i_mapping);
+   balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+   inode->i_sb);
if (dirty_pages < (fs_info->nodesize >> PAGE_SHIFT) + 1)
btrfs_btree_balance_dirty(fs_info);
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 6a07d4e12fd2..ec92fb5e2b51 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1368,7 +1368,8 @@ int

[PATCH 04/10] lib: add a __fprop_add_percpu_max

2017-11-14 Thread Josef Bacik

From: Josef Bacik 

This helper allows us to add an arbitrary amount to the fprop
structures.

Signed-off-by: Josef Bacik 
---
 include/linux/flex_proportions.h | 11 +--
 lib/flex_proportions.c   |  9 +
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/flex_proportions.h b/include/linux/flex_proportions.h
index 853f4305d1b2..2d1a87331e5d 100644
--- a/include/linux/flex_proportions.h
+++ b/include/linux/flex_proportions.h
@@ -85,8 +85,8 @@ struct fprop_local_percpu {
 int fprop_local_init_percpu(struct fprop_local_percpu *pl, gfp_t gfp);
 void fprop_local_destroy_percpu(struct fprop_local_percpu *pl);
 void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl);
-void __fprop_inc_percpu_max(struct fprop_global *p, struct fprop_local_percpu 
*pl,
-   int max_frac);
+void __fprop_add_percpu_max(struct fprop_global *p, struct fprop_local_percpu 
*pl,
+   unsigned long nr, int max_frac);
 void fprop_fraction_percpu(struct fprop_global *p,
struct fprop_local_percpu *pl, unsigned long *numerator,
unsigned long *denominator);
@@ -101,4 +101,11 @@ void fprop_inc_percpu(struct fprop_global *p, struct 
fprop_local_percpu *pl)
local_irq_restore(flags);
 }
 
+static inline
+void __fprop_inc_percpu_max(struct fprop_global *p,
+   struct fprop_local_percpu *pl, int max_frac)
+{
+   __fprop_add_percpu_max(p, pl, 1, max_frac);
+}
+
 #endif
diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c
index 5552523b663a..2190180a81fe 100644
--- a/lib/flex_proportions.c
+++ b/lib/flex_proportions.c
@@ -254,8 +254,9 @@ void fprop_fraction_percpu(struct fprop_global *p,
  * Like __fprop_inc_percpu() except that event is counted only if the given
  * type has fraction smaller than @max_frac/FPROP_FRAC_BASE
  */
-void __fprop_inc_percpu_max(struct fprop_global *p,
-   struct fprop_local_percpu *pl, int max_frac)
+void __fprop_add_percpu_max(struct fprop_global *p,
+   struct fprop_local_percpu *pl, unsigned long nr,
+   int max_frac)
 {
if (unlikely(max_frac < FPROP_FRAC_BASE)) {
unsigned long numerator, denominator;
@@ -266,6 +267,6 @@ void __fprop_inc_percpu_max(struct fprop_global *p,
return;
} else
fprop_reflect_period_percpu(p, pl);
-   percpu_counter_add_batch(>events, 1, p->batch_size);
-   percpu_counter_add(>events, 1);
+   percpu_counter_add_batch(>events, nr, p->batch_size);
+   percpu_counter_add(>events, nr);
 }
-- 
2.7.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/10] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes

2017-11-14 Thread Josef Bacik

From: Josef Bacik 

These are counters that constantly go up in order to do bandwidth calculations.
It isn't important what the units are in, as long as they are consistent between
the two of them, so convert them to count bytes written/dirtied, and allow the
metadata accounting stuff to change the counters as well.

Signed-off-by: Josef Bacik 
Acked-by: Tejun Heo 
---
 fs/fuse/file.c   |  4 ++--
 include/linux/backing-dev-defs.h |  4 ++--
 include/linux/backing-dev.h  |  2 +-
 mm/backing-dev.c |  9 +
 mm/page-writeback.c  | 20 ++--
 5 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index cb7dff5c45d7..67e7c4fac28d 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1471,7 +1471,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
for (i = 0; i < req->num_pages; i++) {
dec_wb_stat(>wb, WB_WRITEBACK);
dec_node_page_state(req->pages[i], NR_WRITEBACK_TEMP);
-   wb_writeout_inc(>wb);
+   wb_writeout_add(>wb, PAGE_SIZE);
}
wake_up(>page_waitq);
 }
@@ -1776,7 +1776,7 @@ static bool fuse_writepage_in_flight(struct fuse_req 
*new_req,
 
dec_wb_stat(>wb, WB_WRITEBACK);
dec_node_page_state(page, NR_WRITEBACK_TEMP);
-   wb_writeout_inc(>wb);
+   wb_writeout_add(>wb, PAGE_SIZE);
fuse_writepage_free(fc, new_req);
fuse_request_free(new_req);
goto out;
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 866c433e7d32..ded45ac2cec7 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -36,8 +36,8 @@ typedef int (congested_fn)(void *, int);
 enum wb_stat_item {
WB_RECLAIMABLE,
WB_WRITEBACK,
-   WB_DIRTIED,
-   WB_WRITTEN,
+   WB_DIRTIED_BYTES,
+   WB_WRITTEN_BYTES,
NR_WB_STAT_ITEMS
 };
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 14e266d12620..39b8dc486ea7 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -89,7 +89,7 @@ static inline s64 wb_stat_sum(struct bdi_writeback *wb, enum 
wb_stat_item item)
return percpu_counter_sum_positive(>stat[item]);
 }
 
-extern void wb_writeout_inc(struct bdi_writeback *wb);
+extern void wb_writeout_add(struct bdi_writeback *wb, long bytes);
 
 /*
  * maximal error of a stat counter.
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e19606bb41a0..62a332a91b38 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -68,14 +68,15 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
wb_thresh = wb_calc_thresh(wb, dirty_thresh);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
+#define BtoK(x) ((x) >> 10)
seq_printf(m,
   "BdiWriteback:   %10lu kB\n"
   "BdiReclaimable: %10lu kB\n"
   "BdiDirtyThresh: %10lu kB\n"
   "DirtyThresh:%10lu kB\n"
   "BackgroundThresh:   %10lu kB\n"
-  "BdiDirtied: %10lu kB\n"
-  "BdiWritten: %10lu kB\n"
+  "BdiDirtiedBytes:%10lu kB\n"
+  "BdiWrittenBytes:%10lu kB\n"
   "BdiWriteBandwidth:  %10lu kBps\n"
   "b_dirty:%10lu\n"
   "b_io:   %10lu\n"
@@ -88,8 +89,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
   K(wb_thresh),
   K(dirty_thresh),
   K(background_thresh),
-  (unsigned long) K(wb_stat(wb, WB_DIRTIED)),
-  (unsigned long) K(wb_stat(wb, WB_WRITTEN)),
+  (unsigned long) BtoK(wb_stat(wb, WB_DIRTIED_BYTES)),
+  (unsigned long) BtoK(wb_stat(wb, WB_WRITTEN_BYTES)),
   (unsigned long) K(wb->write_bandwidth),
   nr_dirty,
   nr_io,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 1a47d4296750..e4563645749a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -597,11 +597,11 @@ static void wb_domain_writeout_inc(struct wb_domain *dom,
  * Increment @wb's writeout completion count and the global writeout
  * completion count. Called from test_clear_page_writeback().
  */
-static inline void __wb_writeout_inc(struct bdi_writeback *wb)
+static inline void __wb_writeout_add(struct bdi_writeback *wb, long bytes)
 {
struct wb_domain *cgdom;
 
-   inc_wb_stat(wb, WB_WRITTEN);
+   __add_wb_stat(wb, WB_WRITTEN_BYTES, bytes);
wb_domain_writeout_inc(_wb_domain, >completions,
   wb->bdi->max_prop_frac);
 
@@ -611,15 +611,15 @@ static inline void __wb_writeout_inc(struct bdi_writeback 
*wb)

Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-14 Thread Dave

On Tue, Nov 14, 2017 at 3:50 AM, Roman Mamedov  wrote:
>
> On Mon, 13 Nov 2017 22:39:44 -0500
> Dave  wrote:
>
> > I have my live system on one block device and a backup snapshot of it
> > on another block device. I am keeping them in sync with hourly rsync
> > transfers.
> >
> > Here's how this system works in a little more detail:
> >
> > 1. I establish the baseline by sending a full snapshot to the backup
> > block device using btrfs send-receive.
> > 2. Next, on the backup device I immediately create a rw copy of that
> > baseline snapshot.
> > 3. I delete the source snapshot to keep the live filesystem free of
> > all snapshots (so it can be optimally defragmented, etc.)
> > 4. hourly, I take a snapshot of the live system, rsync all changes to
> > the backup block device, and then delete the source snapshot. This
> > hourly process takes less than a minute currently. (My test system has
> > only moderate usage.)
> > 5. hourly, following the above step, I use snapper to take a snapshot
> > of the backup subvolume to create/preserve a history of changes. For
> > example, I can find the version of a file 30 hours prior.
>
> Sounds a bit complex, I still don't get why you need all these snapshot
> creations and deletions, and even still using btrfs send-receive.


Hopefully, my comments below will explain my reasons.

>
> Here is my scheme:
> 
> /mnt/dst <- mounted backup storage volume
> /mnt/dst/backup  <- a subvolume
> /mnt/dst/backup/host1/ <- rsync destination for host1, regular directory
> /mnt/dst/backup/host2/ <- rsync destination for host2, regular directory
> /mnt/dst/backup/host3/ <- rsync destination for host3, regular directory
> etc.
>
> /mnt/dst/backup/host1/bin/
> /mnt/dst/backup/host1/etc/
> /mnt/dst/backup/host1/home/
> ...
> Self explanatory. All regular directories, not subvolumes.
>
> Snapshots:
> /mnt/dst/snaps/backup <- a regular directory
> /mnt/dst/snaps/backup/2017-11-14T12:00/ <- snapshot 1 of /mnt/dst/backup
> /mnt/dst/snaps/backup/2017-11-14T13:00/ <- snapshot 2 of /mnt/dst/backup
> /mnt/dst/snaps/backup/2017-11-14T14:00/ <- snapshot 3 of /mnt/dst/backup
>
> Accessing historic data:
> /mnt/dst/snaps/backup/2017-11-14T12:00/host1/bin/bash
> ...
> /bin/bash for host1 as of 2017-11-14 12:00 (time on the backup system).
> 
>
> No need for btrfs send-receive, only plain rsync is used, directly from
> hostX:/ to /mnt/dst/backup/host1/;


I prefer to start with a BTRFS snapshot at the backup destination. I
think that's the most "accurate" starting point.

>
> No need to create or delete snapshots during the actual backup process;


Then you can't guarantee consistency of the backed up information.

>
> A single common timeline is kept for all hosts to be backed up, snapshot count
> not multiplied by the number of hosts (in my case the backup location is
> multi-purpose, so I somewhat care about total number of snapshots there as
> well);
>
> Also, all of this works even with source hosts which do not use Btrfs.


That's not a concern for me because I prefer to use BTRFS everywhere.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how to repair or access broken btrfs?

2017-11-14 Thread Stefan Priebe - Profihost AG


Am 14.11.2017 um 18:45 schrieb Andrei Borzenkov:
> 14.11.2017 12:56, Stefan Priebe - Profihost AG пишет:
>> Hello,
>>
>> after a controller firmware bug / failure i've a broken btrfs.
>>
>> # parent transid verify failed on 181846016 wanted 143404 found 143399
>>
>> running repair, fsck or zero-log always results in the same failure message:
>> extent-tree.c:2725: alloc_reserved_tree_block: BUG_ON `ret` triggered,
>> value -1
>> .. stack trace ..
>>
>> Is there an chance to get at least a single file out of the broken fs?
>>
> 
> Did you try "btrfs restore"?

Great that worked for that file. Still wondering why a repair is not
possible.

Greets,
Stefan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Read before you deploy btrfs + zstd

2017-11-14 Thread Martin Steigerwald

David Sterba - 14.11.17, 19:49:
> On Tue, Nov 14, 2017 at 08:34:37AM +0100, Martin Steigerwald wrote:
> > Hello David.
> > 
> > David Sterba - 13.11.17, 23:50:
> > > while 4.14 is still fresh, let me address some concerns I've seen on
> > > linux
> > > forums already.
> > > 
> > > The newly added ZSTD support is a feature that has broader impact than
> > > just the runtime compression. The btrfs-progs understand filesystem with
> > > ZSTD since 4.13. The remaining key part is the bootloader.
> > > 
> > > Up to now, there are no bootloaders supporting ZSTD. This could lead to
> > > an
> > > unmountable filesystem if the critical files under /boot get
> > > accidentally
> > > or intentionally compressed by ZSTD.
> > 
> > But otherwise ZSTD is safe to use? Are you aware of any other issues?
> 
> No issues from my own testing or reported by other users.

Thanks to you and the others. I think I try this soon.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Read before you deploy btrfs + zstd

2017-11-14 Thread David Sterba

On Mon, Nov 13, 2017 at 11:50:46PM +0100, David Sterba wrote:
> Up to now, there are no bootloaders supporting ZSTD.

I've tried to implement the support to GRUB, still incomplete and hacky
but most of the code is there.  The ZSTD implementation is copied from
kernel. The allocators need to be properly set up, as it needs to use
grub_malloc/grub_free for the workspace thats called from some ZSTD_*
functions.

https://github.com/kdave/grub/tree/btrfs-zstd
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Read before you deploy btrfs + zstd

2017-11-14 Thread David Sterba

On Tue, Nov 14, 2017 at 08:34:37AM +0100, Martin Steigerwald wrote:
> Hello David.
> 
> David Sterba - 13.11.17, 23:50:
> > while 4.14 is still fresh, let me address some concerns I've seen on linux
> > forums already.
> > 
> > The newly added ZSTD support is a feature that has broader impact than
> > just the runtime compression. The btrfs-progs understand filesystem with
> > ZSTD since 4.13. The remaining key part is the bootloader.
> > 
> > Up to now, there are no bootloaders supporting ZSTD. This could lead to an
> > unmountable filesystem if the critical files under /boot get accidentally
> > or intentionally compressed by ZSTD.
> 
> But otherwise ZSTD is safe to use? Are you aware of any other issues?

No issues from my own testing or reported by other users.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how to repair or access broken btrfs?

2017-11-14 Thread Andrei Borzenkov

14.11.2017 12:56, Stefan Priebe - Profihost AG пишет:
> Hello,
> 
> after a controller firmware bug / failure i've a broken btrfs.
> 
> # parent transid verify failed on 181846016 wanted 143404 found 143399
> 
> running repair, fsck or zero-log always results in the same failure message:
> extent-tree.c:2725: alloc_reserved_tree_block: BUG_ON `ret` triggered,
> value -1
> .. stack trace ..
> 
> Is there an chance to get at least a single file out of the broken fs?
> 

Did you try "btrfs restore"?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A partially failing disk in raid0 needs replacement

2017-11-14 Thread Klaus Agnoletti

Hi Roman,

If you look at the 'show' command, the failing disk is sorta out of
the fs, so maybe removing the 6TB disk again will divide the data
already on the 6TB disk (which isn't more than 300something gigs) to
the 2 well-functioning disks.

Still, as putting the dd-image of the 2TB disk on the temporary disk
is only temporary, I do need one more 2TB+ disk attached to create a
more permanent btrfs with the 6TB disk (which is what I eventually
want). And for that I need some more harddisk power cables/splitters.
And another disk. But that still seems to be the best option, so I
will do that once I have those things sorted out.

Thanks for your creative suggestion :)

/klaus

On Tue, Nov 14, 2017 at 4:44 PM, Roman Mamedov  wrote:
> On Tue, 14 Nov 2017 15:09:52 +0100
> Klaus Agnoletti  wrote:
>
>> Hi Roman
>>
>> I almost understand :-) - however, I need a bit more information:
>>
>> How do I copy the image file to the 6TB without screwing the existing
>> btrfs up when the fs is not mounted? Should I remove it from the raid
>> again?
>
> Oh, you already added it to your FS, that's so unfortunate. For my scenario I
> assumed have a spare 6TB (or any 2TB+) disk you can use as temporary space.
>
> You could try removing it, but with one of the existing member drives
> malfunctioning, I wonder if trying any operation on that FS will cause further
> damage. For example if you remove the 6TB one, how do you prevent Btrfs from
> using the bad 2TB drive as destination to relocate data from the 6TB drive. Or
> use it for one of the metadata mirrors, which will fail to write properly,
> leading into transid failures later, etc.
>
> --
> With respect,
> Roman

-- 
Klaus Agnoletti
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs progs pre-release 4.14-rc1

2017-11-14 Thread David Sterba

Hi,

a pre-release has been tagged.

Changes:
  * build: libzstd now required by default
  * check: more lowmem mode repair enhancements
  * subvol set-default: also accept path
  * prop set: compression accepts no/none, same as ""
  * filesystem usage: enable for filesystem on top of a seed device
  * rescue: new command fix-device-size
  * other
* new tests
* cleanups and refactoring
* doc updates

ETA for 4.14 is in +2 days (2017-11-16).

Mailinglist patch backlog has grown again, I'll have to do more minor releases
to get the features and fixes merged. No concrete plans for now, some patchsets
are almost ready so they'll probably go first.

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

Shortlog:

Baruch Siach (1):
  btrfs-progs: convert: add missing types header

Benjamin Peterson (1):
  btrfs-progs: docs: correct grammar

David Sterba (24):
  btrfs-progs: help: print multiple syntax schemas on separate lines
  btrfs-progs: prop: also allow "none" to disable compression
  btrfs-progs: docs: update btrfs-properties
  btrfs-progs: image: move metadump definitions to own header
  btrfs-progs: build: use variables for btrfs-image images
  btrfs-progs: image: start a new header for sanitization functions
  btrfs-progs: image: introduce symbolic names for the sanitization modes
  btrfs-progs: image: pass rb_root to find_collisions
  btrfs-progs: image: drop unused parameter from sanitize_xattr
  btrfs-progs: image: pass sanitize mode and name tree separately to 
sanitize_inode_ref
  btrfs-progs: image: pass sanitize mode and name tree separately to 
sanitize_dir_item
  btrfs-progs: image: pass sanitize mode and name tree separately to 
sanitize_name
  btrfs-progs: image: move sanitization to new file
  btrfs-progs: don't use __u8 for fsid buffers
  btrfs-progs: tests: don't pass size to prepare_test_dev if not necessary
  btrfs-progs: tests: extend fsck/028 to test fix-device-size and mount
  btrfs-progs: docs: update mount options
  btrfs-progs: docs: add impact of atime/noatime
  btrfs-progs: docs: add note about mount option applicability
  btrfs-progs: build: require libzstd support by default
  btrfs-progs: build: mention library dependency for reiserfs
  btrfs-progs: docs: move the rescue fix-device-size command and update
  btrfs-progs: update CHANGES for v4.14
  Btrfs progs v4.14-rc1

Lakshmipathi.G (1):
  btrfs-progs: tests/common: Display warning only after searching for btrfs 
kernel module

Liu Bo (1):
  btrfs-progs: do not add stale device into fs_devices

Lu Fengqi (7):
  btrfs-progs: qgroup: fix qgroup show sort by multi items
  btrfs-progs: test: Add test image for lowmem mode file extent interrupt
  btrfs-progs: lowmem check: Output more detailed information about file 
extent interrupt
  btrfs-progs: lowmem check: Fix false alert about referencer count mismatch
  btrfs-progs: test: Add test image for lowmem mode referencer count 
mismatch false alert
  btrfs-progs: qgroup: cleanup the redundant function add_qgroup
  btrfs-progs: qgroup: split update_qgroup to reduce arguments

Misono, Tomohiro (6):
  btrfs-progs: subvol: change set-default to also accept path
  btrfs-progs: test: add new cli-test for subvol get/set-default
  btrfs-progs: fi: move dev_to_fsid to cmds-fi-usage for later use
  btrfs-progs: fi: enable fi usage for filesystem on top of seed device
  btrfs-progs: device: add description of alias to help message
  btrfs-progs: doc: add description of missing and example, of device remove

Pavel Kretov (1):
  btrfs-progs: defrag: add a brief warning about ref-link breakage

Qu Wenruo (14):
  btrfs-progs: tests: Allow check test to repair in lowmem mode for certain 
errors
  btrfs-progs: mkfs: avoid BUG_ON for chunk allocation when ENOSPC happens
  btrfs-progs: mkfs: avoid positive return value from cleanup_temp_chunks
  btrfs-progs: mkfs: fix overwritten return value for mkfs
  btrfs-progs: mkfs: error out gracefully for --rootdir
  btrfs-progs: convert: Open the fs readonly for rollback
  btrfs-progs: mkfs: refactor test_minimum_size to use the calculated 
minimal size
  btrfs-progs: rescue: Fix zero-log mounted branch
  btrfs-progs: Introduce function to fix unaligned device size
  btrfs-progs: Introduce function to fix super block total bytes
  btrfs-progs: rescue: Introduce fix-device-size
  btrfs-progs: check: Also check and repair unaligned/mismatch device and 
super sizes
  btrfs-progs: tests/fsck: Add test case image for 'rescue fix-dev-size'
  btrfs-progs: print-tree: Print offset as tree objectid for ROOT_ITEM

Satoru Takeuchi (1):
  btrfs-progs: allow "no" to disable compression for convenience

Su Yue (28):
  btrfs-progs:

Re: [GIT PULL] Btrfs changes for 4.15

2017-11-14 Thread David Sterba

On Tue, Nov 14, 2017 at 07:39:11AM +0800, Qu Wenruo wrote:
> > - extend mount options to specify zlib compression level, -o compress=zlib:9
> 
> However the support for it has a big problem, it will cause wild memory
> access for "-o compress" mount option.
> 
> Kernel ASAN can detect it easily and we already have user report about
> it. Btrfs/026 could also easily trigger it.
> 
> The fixing patch is submitted some days ago:
> https://patchwork.kernel.org/patch/10042553/
> 
> And the default compression level when not specified is zero, which
> means no compression but directly memory copy.

This fix will go in next pull request. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A partially failing disk in raid0 needs replacement

2017-11-14 Thread Roman Mamedov

On Tue, 14 Nov 2017 15:09:52 +0100
Klaus Agnoletti  wrote:

> Hi Roman
> 
> I almost understand :-) - however, I need a bit more information:
> 
> How do I copy the image file to the 6TB without screwing the existing
> btrfs up when the fs is not mounted? Should I remove it from the raid
> again?

Oh, you already added it to your FS, that's so unfortunate. For my scenario I
assumed have a spare 6TB (or any 2TB+) disk you can use as temporary space.

You could try removing it, but with one of the existing member drives
malfunctioning, I wonder if trying any operation on that FS will cause further
damage. For example if you remove the 6TB one, how do you prevent Btrfs from
using the bad 2TB drive as destination to relocate data from the 6TB drive. Or
use it for one of the metadata mirrors, which will fail to write properly,
leading into transid failures later, etc.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A partially failing disk in raid0 needs replacement

2017-11-14 Thread Kai Krakow

Am Tue, 14 Nov 2017 17:48:56 +0500
schrieb Roman Mamedov :

> [1] Note that "ddrescue" and "dd_rescue" are two different programs
> for the same purpose, one may work better than the other. I don't
> remember which. :)

One is a perl implementation and is the one working worse. ;-)


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A partially failing disk in raid0 needs replacement

2017-11-14 Thread Klaus Agnoletti

Hi Austin

Good points. Thanks a lot.

/klaus

On Tue, Nov 14, 2017 at 2:14 PM, Austin S. Hemmelgarn
 wrote:
> On 2017-11-14 03:36, Klaus Agnoletti wrote:
>>
>> Hi list
>>
>> I used to have 3x2TB in a btrfs in raid0. A few weeks ago, one of the
>> 2TB disks started giving me I/O errors in dmesg like this:
>>
>> [388659.173819] ata5.00: exception Emask 0x0 SAct 0x7fff SErr 0x0
>> action 0x0
>> [388659.175589] ata5.00: irq_stat 0x4008
>> [388659.177312] ata5.00: failed command: READ FPDMA QUEUED
>> [388659.179045] ata5.00: cmd 60/20:60:80:96:95/00:00:c4:00:00/40 tag
>> 12 ncq 1638
>>   4 in
>>   res 51/40:1c:84:96:95/00:00:c4:00:00/40 Emask 0x409 (media
>> error) 
>> [388659.182552] ata5.00: status: { DRDY ERR }
>> [388659.184303] ata5.00: error: { UNC }
>> [388659.188899] ata5.00: configured for UDMA/133
>> [388659.188956] sd 4:0:0:0: [sdd] Unhandled sense code
>> [388659.188960] sd 4:0:0:0: [sdd]
>> [388659.188962] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> [388659.188965] sd 4:0:0:0: [sdd]
>> [388659.188967] Sense Key : Medium Error [current] [descriptor]
>> [388659.188970] Descriptor sense data with sense descriptors (in hex):
>> [388659.188972] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
>> [388659.188981] c4 95 96 84
>> [388659.188985] sd 4:0:0:0: [sdd]
>> [388659.188988] Add. Sense: Unrecovered read error - auto reallocate
>> failed
>> [388659.188991] sd 4:0:0:0: [sdd] CDB:
>> [388659.188992] Read(10): 28 00 c4 95 96 80 00 00 20 00
>> [388659.189000] end_request: I/O error, dev sdd, sector 3298137732
>> [388659.190740] BTRFS: bdev /dev/sdd errs: wr 0, rd 3120, flush 0,
>> corrupt 0, ge
>> n 0
>> [388659.192556] ata5: EH complete
>
> Just some background, but this error is usually indicative of either media
> degradation from long-term usage, or a head crash.
>>
>>
>> At the same time, I started getting mails from smartd:
>>
>> Device: /dev/sdd [SAT], 2 Currently unreadable (pending) sectors
>> Device info:
>> Hitachi HDS723020BLA642, S/N:MN1220F30MNHUD, WWN:5-000cca-369c8f00b,
>> FW:MN6OA580, 2.00 TB
>>
>> For details see host's SYSLOG.
>
> And this correlates with the above errors (although the current pending
> sectors being non-zero is less specific than the above).
>>
>>
>> To fix it, it ended up with me adding a new 6TB disk and trying to
>> delete the failing 2TB disks.
>>
>> That didn't go so well; apparently, the delete command aborts when
>> ever it encounters I/O errors. So now my raid0 looks like this:
>
> I'm not going to comment on how to fix the current situation, as what has
> been stated in other people's replies pretty well covers that.
>
> I would however like to mention two things for future reference:
>
> 1. The delete command handles I/O errors just fine, provided that there is
> some form of redundancy in the filesystem.  In your case, if this had been a
> raid1 array instead of raid0, then the delete command would have just fallen
> back to the other copy of the data when it hit an I/O error instead of
> dying.  Just like a regular RAID0 array (be it LVM, MD, or hardware), you
> can't lose a device in a BTRFS raid0 array without losing the array.
>
> 2. While it would not have helped in this case, the preferred method when
> replacing a device is to use the `btrfs replace` command.  It's a lot more
> efficient than add+delete (and exponentially more efficient than
> delete+add), and also a bit safer (in both cases because it needs to move
> less data).  The only down-side to it is that you may need a couple of
> resize commands around it.
>
>>
>> klaus@box:~$ sudo btrfs fi show
>> [sudo] password for klaus:
>> Label: none  uuid: 5db5f82c-2571-4e62-a6da-50da0867888a
>>  Total devices 4 FS bytes used 5.14TiB
>>  devid1 size 1.82TiB used 1.78TiB path /dev/sde
>>  devid2 size 1.82TiB used 1.78TiB path /dev/sdf
>>  devid3 size 0.00B used 1.49TiB path /dev/sdd
>>  devid4 size 5.46TiB used 305.21GiB path /dev/sdb
>>
>> Btrfs v3.17
>>
>> Obviously, I want /dev/sdd emptied and deleted from the raid.
>>
>> So how do I do that?
>>
>> I thought of three possibilities myself. I am sure there are more,
>> given that I am in no way a btrfs expert:
>>
>> 1)Try to force a deletion of /dev/sdd where btrfs copies all intact
>> data to the other disks
>> 2) Somehow re-balances the raid so that sdd is emptied, and then deleted
>> 3) converting into a raid1, physically removing the failing disk,
>> simulating a hard error, starting the raid degraded, and converting it
>> back to raid0 again.
>>
>> How do you guys think I should go about this? Given that it's a raid0
>> for a reason, it's not the end of the world losing all data, but I'd
>> really prefer losing as little as possible, obviously.
>>
>> FYI, I tried doing some scrubbing and balancing. There's traces of
>> that in the syslog and dmesg I've attached. It's being used as
>> firewall

Re: A partially failing disk in raid0 needs replacement

2017-11-14 Thread Klaus Agnoletti

Hi Roman

I almost understand :-) - however, I need a bit more information:

How do I copy the image file to the 6TB without screwing the existing
btrfs up when the fs is not mounted? Should I remove it from the raid
again?

Also, as you might have noticed, I have a bit of an issue with the
entire space of the 6TB disk being added to the btrfs when I added the
disk. There's something kinda basic about using btrfs that I haven't
really understodd yet. Maybe you - or someone else - can point me in
the right direction in terms of documentation.

Thanks

/klaus

On Tue, Nov 14, 2017 at 1:48 PM, Roman Mamedov  wrote:
> On Tue, 14 Nov 2017 10:36:22 +0200
> Klaus Agnoletti  wrote:
>
>> Obviously, I want /dev/sdd emptied and deleted from the raid.
>
>   * Unmount the RAID0 FS
>
>   * copy the bad drive using `dd_rescue`[1] into a file on the 6TB drive
> (noting how much of it is actually unreadable -- chances are it's mostly
> intact)
>
>   * physically remove the bad drive (have a powerdown or reboot for this to be
> sure Btrfs didn't remember it somewhere)
>
>   * set up a loop device from the dd_rescue'd 2TB file
>
>   * run `btrfs device scan`
>
>   * mount the RAID0 filesystem
>
>   * run the delete command on the loop device, it will not encounter I/O
> errors anymore.
>
>
> [1] Note that "ddrescue" and "dd_rescue" are two different programs for the
> same purpose, one may work better than the other. I don't remember which. :)
>
> --
> With respect,
> Roman

-- 
Klaus Agnoletti
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A partially failing disk in raid0 needs replacement

2017-11-14 Thread Austin S. Hemmelgarn


On 2017-11-14 03:36, Klaus Agnoletti wrote:

Hi list

I used to have 3x2TB in a btrfs in raid0. A few weeks ago, one of the
2TB disks started giving me I/O errors in dmesg like this:

[388659.173819] ata5.00: exception Emask 0x0 SAct 0x7fff SErr 0x0 action 0x0
[388659.175589] ata5.00: irq_stat 0x4008
[388659.177312] ata5.00: failed command: READ FPDMA QUEUED
[388659.179045] ata5.00: cmd 60/20:60:80:96:95/00:00:c4:00:00/40 tag
12 ncq 1638
  4 in
  res 51/40:1c:84:96:95/00:00:c4:00:00/40 Emask 0x409 (media error) 
[388659.182552] ata5.00: status: { DRDY ERR }
[388659.184303] ata5.00: error: { UNC }
[388659.188899] ata5.00: configured for UDMA/133
[388659.188956] sd 4:0:0:0: [sdd] Unhandled sense code
[388659.188960] sd 4:0:0:0: [sdd]
[388659.188962] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[388659.188965] sd 4:0:0:0: [sdd]
[388659.188967] Sense Key : Medium Error [current] [descriptor]
[388659.188970] Descriptor sense data with sense descriptors (in hex):
[388659.188972] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[388659.188981] c4 95 96 84
[388659.188985] sd 4:0:0:0: [sdd]
[388659.188988] Add. Sense: Unrecovered read error - auto reallocate failed
[388659.188991] sd 4:0:0:0: [sdd] CDB:
[388659.188992] Read(10): 28 00 c4 95 96 80 00 00 20 00
[388659.189000] end_request: I/O error, dev sdd, sector 3298137732
[388659.190740] BTRFS: bdev /dev/sdd errs: wr 0, rd 3120, flush 0,
corrupt 0, ge
n 0
[388659.192556] ata5: EH complete
Just some background, but this error is usually indicative of either 
media degradation from long-term usage, or a head crash.


At the same time, I started getting mails from smartd:

Device: /dev/sdd [SAT], 2 Currently unreadable (pending) sectors
Device info:
Hitachi HDS723020BLA642, S/N:MN1220F30MNHUD, WWN:5-000cca-369c8f00b,
FW:MN6OA580, 2.00 TB

For details see host's SYSLOG.
And this correlates with the above errors (although the current pending 
sectors being non-zero is less specific than the above).


To fix it, it ended up with me adding a new 6TB disk and trying to
delete the failing 2TB disks.

That didn't go so well; apparently, the delete command aborts when
ever it encounters I/O errors. So now my raid0 looks like this:
I'm not going to comment on how to fix the current situation, as what 
has been stated in other people's replies pretty well covers that.


I would however like to mention two things for future reference:

1. The delete command handles I/O errors just fine, provided that there 
is some form of redundancy in the filesystem.  In your case, if this had 
been a raid1 array instead of raid0, then the delete command would have 
just fallen back to the other copy of the data when it hit an I/O error 
instead of dying.  Just like a regular RAID0 array (be it LVM, MD, or 
hardware), you can't lose a device in a BTRFS raid0 array without losing 
the array.


2. While it would not have helped in this case, the preferred method 
when replacing a device is to use the `btrfs replace` command.  It's a 
lot more efficient than add+delete (and exponentially more efficient 
than delete+add), and also a bit safer (in both cases because it needs 
to move less data).  The only down-side to it is that you may need a 
couple of resize commands around it.


klaus@box:~$ sudo btrfs fi show
[sudo] password for klaus:
Label: none  uuid: 5db5f82c-2571-4e62-a6da-50da0867888a
 Total devices 4 FS bytes used 5.14TiB
 devid1 size 1.82TiB used 1.78TiB path /dev/sde
 devid2 size 1.82TiB used 1.78TiB path /dev/sdf
 devid3 size 0.00B used 1.49TiB path /dev/sdd
 devid4 size 5.46TiB used 305.21GiB path /dev/sdb

Btrfs v3.17

Obviously, I want /dev/sdd emptied and deleted from the raid.

So how do I do that?

I thought of three possibilities myself. I am sure there are more,
given that I am in no way a btrfs expert:

1)Try to force a deletion of /dev/sdd where btrfs copies all intact
data to the other disks
2) Somehow re-balances the raid so that sdd is emptied, and then deleted
3) converting into a raid1, physically removing the failing disk,
simulating a hard error, starting the raid degraded, and converting it
back to raid0 again.

How do you guys think I should go about this? Given that it's a raid0
for a reason, it's not the end of the world losing all data, but I'd
really prefer losing as little as possible, obviously.

FYI, I tried doing some scrubbing and balancing. There's traces of
that in the syslog and dmesg I've attached. It's being used as
firewall too, so there's a lof of Shorewall block messages smapping
the log I'm afraid.

Additional info:
klaus@box:~$ uname -a
Linux box 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u5 (2017-09-19)
x86_64 GNU/Linux
klaus@box:~$ sudo btrfs --version
Btrfs v3.17
klaus@box:~$ sudo btrfs fi df /mnt
Data, RAID0: total=5.34TiB, used=5.14TiB
System, RAID0: total=96.00MiB, used=384.00KiB
Metadata, RAID0: total=7.22GiB,

Re: A partially failing disk in raid0 needs replacement

2017-11-14 Thread Austin S. Hemmelgarn


On 2017-11-14 07:48, Roman Mamedov wrote:

On Tue, 14 Nov 2017 10:36:22 +0200
Klaus Agnoletti  wrote:


Obviously, I want /dev/sdd emptied and deleted from the raid.


   * Unmount the RAID0 FS

   * copy the bad drive using `dd_rescue`[1] into a file on the 6TB drive
 (noting how much of it is actually unreadable -- chances are it's mostly
 intact)

   * physically remove the bad drive (have a powerdown or reboot for this to be
 sure Btrfs didn't remember it somewhere)

   * set up a loop device from the dd_rescue'd 2TB file

   * run `btrfs device scan`

   * mount the RAID0 filesystem

   * run the delete command on the loop device, it will not encounter I/O
 errors anymore.
While the above procedure will work, it is worth noting that you may 
still lose data.



[1] Note that "ddrescue" and "dd_rescue" are two different programs for the
same purpose, one may work better than the other. I don't remember which. :)
As a general rule, GNU ddrescue is more user friendly for block-level 
copies, while Kurt Garlof's dd_rescue tends to be better for copying at 
the file level.  Both work fine in terms of reliability though.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A partially failing disk in raid0 needs replacement

2017-11-14 Thread Patrik Lundquist

On 14 November 2017 at 09:36, Klaus Agnoletti  wrote:
>
> How do you guys think I should go about this?

I'd clone the disk with GNU ddrescue.

https://www.gnu.org/software/ddrescue/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A partially failing disk in raid0 needs replacement

2017-11-14 Thread Roman Mamedov

On Tue, 14 Nov 2017 10:36:22 +0200
Klaus Agnoletti  wrote:

> Obviously, I want /dev/sdd emptied and deleted from the raid.

  * Unmount the RAID0 FS

  * copy the bad drive using `dd_rescue`[1] into a file on the 6TB drive
(noting how much of it is actually unreadable -- chances are it's mostly
intact)

  * physically remove the bad drive (have a powerdown or reboot for this to be
sure Btrfs didn't remember it somewhere)

  * set up a loop device from the dd_rescue'd 2TB file

  * run `btrfs device scan`

  * mount the RAID0 filesystem

  * run the delete command on the loop device, it will not encounter I/O
errors anymore.


[1] Note that "ddrescue" and "dd_rescue" are two different programs for the
same purpose, one may work better than the other. I don't remember which. :)

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A partially failing disk in raid0 needs replacement

2017-11-14 Thread Adam Borowski

On Tue, Nov 14, 2017 at 10:36:22AM +0200, Klaus Agnoletti wrote:
> I used to have 3x2TB in a btrfs in raid0. A few weeks ago, one of the
 ^
> 2TB disks started giving me I/O errors in dmesg like this:
> 
> [388659.188988] Add. Sense: Unrecovered read error - auto reallocate failed

Alas, chances to recover anything are pretty slim.  That's RAID0 metadata
for you.

On the other hand, losing any non-trivial file while being able to gape at
intact metadata isn't that much better, thus -mraid0 isn't completely
unreasonable.

> To fix it, it ended up with me adding a new 6TB disk and trying to
> delete the failing 2TB disks.
> 
> That didn't go so well; apparently, the delete command aborts when
> ever it encounters I/O errors. So now my raid0 looks like this:
> 
> klaus@box:~$ sudo btrfs fi show
> [sudo] password for klaus:
> Label: none  uuid: 5db5f82c-2571-4e62-a6da-50da0867888a
> Total devices 4 FS bytes used 5.14TiB
> devid1 size 1.82TiB used 1.78TiB path /dev/sde
> devid2 size 1.82TiB used 1.78TiB path /dev/sdf
> devid3 size 0.00B used 1.49TiB path /dev/sdd
> devid4 size 5.46TiB used 305.21GiB path /dev/sdb

> Obviously, I want /dev/sdd emptied and deleted from the raid.
> 
> So how do I do that?
> 
> I thought of three possibilities myself. I am sure there are more,
> given that I am in no way a btrfs expert:
> 
> 1)Try to force a deletion of /dev/sdd where btrfs copies all intact
> data to the other disks
> 2) Somehow re-balances the raid so that sdd is emptied, and then deleted
> 3) converting into a raid1, physically removing the failing disk,
> simulating a hard error, starting the raid degraded, and converting it
> back to raid0 again.

There's hardly any intact data: roughly 2/3 of chunks have half of their
blocks on the failed disk, densely interspersed.  Even worse, metadata
required to map those blocks to files is gone, too: if we naively assume
there's only a single tree, a tree node is intact only if it and every
single node on the path to the root is intact.  In practice, this means
it's a total filesystem loss.

> How do you guys think I should go about this? Given that it's a raid0
> for a reason, it's not the end of the world losing all data, but I'd
> really prefer losing as little as possible, obviously.

As the disk isn't _completely_ gone, there's a slim chance of some stuff
requiring only still-readable sectors.  Probably a waste of time to try
to recover, though.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Laws we want back: Poland, Dz.U. 1921 nr.30 poz.177 (also Dz.U. 
⣾⠁⢰⠒⠀⣿⡁ 1920 nr.11 poz.61): Art.2: An official, guilty of accepting a gift
⢿⡄⠘⠷⠚⠋⠀ or another material benefit, or a promise thereof, [in matters
⠈⠳⣄ relevant to duties], shall be punished by death by shooting.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Read before you deploy btrfs + zstd

2017-11-14 Thread Austin S. Hemmelgarn


On 2017-11-14 02:34, Martin Steigerwald wrote:

Hello David.

David Sterba - 13.11.17, 23:50:

while 4.14 is still fresh, let me address some concerns I've seen on linux
forums already.

The newly added ZSTD support is a feature that has broader impact than
just the runtime compression. The btrfs-progs understand filesystem with
ZSTD since 4.13. The remaining key part is the bootloader.

Up to now, there are no bootloaders supporting ZSTD. This could lead to an
unmountable filesystem if the critical files under /boot get accidentally
or intentionally compressed by ZSTD.


But otherwise ZSTD is safe to use? Are you aware of any other issues?
Aside from the obvious issue that recovery media like SystemRescueCD and 
the GParted LIveCD haven't caught up yet, and thus won't be able to do 
anything with the filesystem, my testing has not uncovered any issues, 
though it is by no means rigorous.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs/154: test for device dynamic rescan

2017-11-14 Thread Eryu Guan

On Mon, Nov 13, 2017 at 10:25:41AM +0800, Anand Jain wrote:
> Make sure missing device is included in the alloc list when it is
> scanned on a mounted FS.
> 
> This test case needs btrfs kernel patch which is in the ML
>   [PATCH] btrfs: handle dynamically reappearing missing device
> Without the kernel patch, the test will run, but reports as
> failed, as the device scanned won't appear in the alloc_list.
> 
> Signed-off-by: Anand Jain 

Tested without the fix and test failed as expected, test passed after
applying the fix.

Some minor nits below.

> ---
>  tests/btrfs/154 | 188 
> 
>  tests/btrfs/154.out |  10 +++
>  tests/btrfs/group   |   1 +
>  3 files changed, 199 insertions(+)
>  create mode 100755 tests/btrfs/154
>  create mode 100644 tests/btrfs/154.out
> 
> diff --git a/tests/btrfs/154 b/tests/btrfs/154
> new file mode 100755
> index ..8b06fc4d9347
> --- /dev/null
> +++ b/tests/btrfs/154
> @@ -0,0 +1,188 @@
> +#! /bin/bash
> +# FS QA Test 154
> +#
> +# Test for reappearing missing device functionality.
> +#   This test will fail without the btrfs kernel patch
> +#   [PATCH] btrfs: handle dynamically reappearing missing device
> +#
> +#-
> +# Copyright (c) 2017 Oracle.  All Rights Reserved.
> +# Author: Anand Jain 
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#-
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + cd /
> + rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/module
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch_dev_pool 2
> +_test_unmount

This is not needed now, _require_loadable_fs_module will umount & mount
test dev as necessary.

> +_require_loadable_fs_module "btrfs"
> +
> +_scratch_dev_pool_get 2
> +
> +DEV1=`echo $SCRATCH_DEV_POOL | awk '{print $1}'`
> +DEV2=`echo $SCRATCH_DEV_POOL | awk '{print $2}'`
> +
> +echo DEV1=$DEV1 >> $seqres.full
> +echo DEV2=$DEV2 >> $seqres.full
> +
> +# Balance won't be successful if filled too much
> +DEV1_SZ=`blockdev --getsize64 $DEV1`
> +DEV2_SZ=`blockdev --getsize64 $DEV2`
> +
> +# get min
> +MAX_FS_SZ=`echo -e "$DEV1_SZ\n$DEV2_SZ" | sort | head -1`
> +# Need disks with more than 2G
> +if [ $MAX_FS_SZ -lt 20 ]; then
> + _scratch_dev_pool_put
> + _test_mount

Then no need to _test_mount.

> + _notrun "Smallest dev size $MAX_FS_SZ, Need at least 2G"
> +fi
> +
> +MAX_FS_SZ=1
> +bs="1M"
> +COUNT=$(($MAX_FS_SZ / 100))
> +CHECKPOINT1=0
> +CHECKPOINT2=0
> +
> +setup()
> +{
> + echo >> $seqres.full
> + echo "MAX_FS_SZ=$MAX_FS_SZ COUNT=$COUNT" >> $seqres.full
> + echo "setup"
> + echo "-setup-" >> $seqres.full
> + _scratch_pool_mkfs "-mraid1 -draid1" >> $seqres.full 2>&1
> + _scratch_mount >> $seqres.full 2>&1
> + dd if=/dev/urandom of="$SCRATCH_MNT"/tf bs=$bs count=1 \
> + >>$seqres.full 2>&1
> + _run_btrfs_util_prog filesystem show -m ${SCRATCH_MNT}
> + _run_btrfs_util_prog filesystem df $SCRATCH_MNT
> + COUNT=$(( $COUNT - 1 ))
> + echo "unmount" >> $seqres.full
> + _scratch_unmount
> +}
> +
> +degrade_mount_write()
> +{
> + echo >> $seqres.full
> + echo "--degraded mount: max_fs_sz $max_fs_sz bytes--" >> $seqres.full
> + echo
> + echo "degraded mount"
> +
> + echo "clean btrfs ko" >> $seqres.full
> + # un-scan the btrfs devices
> + _reload_fs_module "btrfs"
> + _mount -o degraded $DEV1 $SCRATCH_MNT >>$seqres.full 2>&1
> + cnt=$(( $COUNT/10 ))
> + dd if=/dev/urandom of="$SCRATCH_MNT"/tf1 bs=$bs count=$cnt \
> + >>$seqres.full 2>&1
> + COUNT=$(( $COUNT - $cnt ))
> + _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT
> + _run_btrfs_util_prog filesystem df $SCRATCH_MNT
> +

Re: [PATCH 4/4] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction

2017-11-14 Thread Timofey Titovets

Sorry, i just thinking that i can test that and send you some feedback,
But for now, no time.
I will check that later and try adds memory reusing.

So, just ignore patches for now.

Thanks

2017-10-10 20:36 GMT+03:00 David Sterba :
> On Tue, Oct 03, 2017 at 06:06:04PM +0300, Timofey Titovets wrote:
>> At now btrfs_dedupe_file_range() restricted to 16MiB range for
>> limit locking time and memory requirement for dedup ioctl()
>>
>> For too big input rage code silently set range to 16MiB
>>
>> Let's remove that restriction by do iterating over dedup range.
>> That's backward compatible and will not change anything for request
>> less then 16MiB.
>
> This would make the ioctl more pleasant to use. So far I haven't found
> any problems to do the iteration. One possible speedup could be done to
> avoid the repeated allocations in btrfs_extent_same if we're going to
> iterate more than once.
>
> As this would mean the 16MiB length restriction is gone, this needs to
> bubble up to the documentation
> (http://man7.org/linux/man-pages/man2/ioctl_fideduperange.2.html)
>
> Have you tested the behaviour with larger ranges?



-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

how to repair or access broken btrfs?

2017-11-14 Thread Stefan Priebe - Profihost AG

Hello,

after a controller firmware bug / failure i've a broken btrfs.

# parent transid verify failed on 181846016 wanted 143404 found 143399

running repair, fsck or zero-log always results in the same failure message:
extent-tree.c:2725: alloc_reserved_tree_block: BUG_ON `ret` triggered,
value -1
.. stack trace ..

Is there an chance to get at least a single file out of the broken fs?

Greets,
Stefan


Complete output:
./btrfs check --repair /dev/mapper/crypt_md0
enabling repair mode
parent transid verify failed on 181846016 wanted 143404 found 143399
parent transid verify failed on 181846016 wanted 143404 found 143399
Ignoring transid failure
Checking filesystem on /dev/mapper/crypt_md0
UUID: d3f9eee9-efbd-4590-858f-27b39d453350
repair mode will force to clear out log tree, are you sure? [y/N]: y
parent transid verify failed on 308183040 wanted 143404 found 143399
parent transid verify failed on 308183040 wanted 143404 found 143399
Ignoring transid failure
parent transid verify failed on 338870272 wanted 143404 found 143399
parent transid verify failed on 338870272 wanted 143404 found 143399
Ignoring transid failure
parent transid verify failed on 12778157178880 wanted 143404 found 143399
parent transid verify failed on 12778157178880 wanted 143404 found 143399
Ignoring transid failure
leaf parent key incorrect 38699008
btrfs unable to find ref byte nr 12778147823616 parent 0 root 2  owner 0
offset 0
parent transid verify failed on 308183040 wanted 143404 found 143399
Ignoring transid failure
leaf parent key incorrect 91766784
extent-tree.c:2725: alloc_reserved_tree_block: BUG_ON `ret` triggered,
value -1
./btrfs[0x415cb3]
./btrfs[0x416ee5]
./btrfs[0x417104]
./btrfs[0x418cea]
./btrfs[0x418f06]
./btrfs(btrfs_alloc_free_block+0x1e4)[0x41b8d0]
./btrfs(__btrfs_cow_block+0xd3)[0x40c5f9]
./btrfs(btrfs_cow_block+0x110)[0x40d03b]
./btrfs(commit_tree_roots+0x53)[0x439a37]
./btrfs(btrfs_commit_transaction+0xf9)[0x439e02]
./btrfs(cmd_check+0x861)[0x46172e]
./btrfs(main+0x163)[0x40b5e9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f44b14fab45]
./btrfs[0x40b0b9]
Aborted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Read before you deploy btrfs + zstd

2017-11-14 Thread Paul Jones

> -Original Message-
> From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-
> ow...@vger.kernel.org] On Behalf Of Martin Steigerwald
> Sent: Tuesday, 14 November 2017 6:35 PM
> To: dste...@suse.cz; linux-btrfs@vger.kernel.org
> Subject: Re: Read before you deploy btrfs + zstd
> 
> Hello David.
> 
> David Sterba - 13.11.17, 23:50:
> > while 4.14 is still fresh, let me address some concerns I've seen on
> > linux forums already.
> >
> > The newly added ZSTD support is a feature that has broader impact than
> > just the runtime compression. The btrfs-progs understand filesystem
> > with ZSTD since 4.13. The remaining key part is the bootloader.
> >
> > Up to now, there are no bootloaders supporting ZSTD. This could lead
> > to an unmountable filesystem if the critical files under /boot get
> > accidentally or intentionally compressed by ZSTD.
> 
> But otherwise ZSTD is safe to use? Are you aware of any other issues?
> 
> I consider switching from LZO to ZSTD on this ThinkPad T520 with
> Sandybridge.

I've been using it since rc2 and had no trouble at all so far. The filesystem 
is running faster now (with zstd) than it did uncompressed on 4.13


Paul.





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-14 Thread Roman Mamedov

On Mon, 13 Nov 2017 22:39:44 -0500
Dave  wrote:

> I have my live system on one block device and a backup snapshot of it
> on another block device. I am keeping them in sync with hourly rsync
> transfers.
> 
> Here's how this system works in a little more detail:
> 
> 1. I establish the baseline by sending a full snapshot to the backup
> block device using btrfs send-receive.
> 2. Next, on the backup device I immediately create a rw copy of that
> baseline snapshot.
> 3. I delete the source snapshot to keep the live filesystem free of
> all snapshots (so it can be optimally defragmented, etc.)
> 4. hourly, I take a snapshot of the live system, rsync all changes to
> the backup block device, and then delete the source snapshot. This
> hourly process takes less than a minute currently. (My test system has
> only moderate usage.)
> 5. hourly, following the above step, I use snapper to take a snapshot
> of the backup subvolume to create/preserve a history of changes. For
> example, I can find the version of a file 30 hours prior.

Sounds a bit complex, I still don't get why you need all these snapshot
creations and deletions, and even still using btrfs send-receive.

Here is my scheme:

/mnt/dst <- mounted backup storage volume
/mnt/dst/backup  <- a subvolume 
/mnt/dst/backup/host1/ <- rsync destination for host1, regular directory
/mnt/dst/backup/host2/ <- rsync destination for host2, regular directory
/mnt/dst/backup/host3/ <- rsync destination for host3, regular directory
etc.

/mnt/dst/backup/host1/bin/
/mnt/dst/backup/host1/etc/
/mnt/dst/backup/host1/home/
...
Self explanatory. All regular directories, not subvolumes.

Snapshots:
/mnt/dst/snaps/backup <- a regular directory
/mnt/dst/snaps/backup/2017-11-14T12:00/ <- snapshot 1 of /mnt/dst/backup
/mnt/dst/snaps/backup/2017-11-14T13:00/ <- snapshot 2 of /mnt/dst/backup
/mnt/dst/snaps/backup/2017-11-14T14:00/ <- snapshot 3 of /mnt/dst/backup

Accessing historic data:
/mnt/dst/snaps/backup/2017-11-14T12:00/host1/bin/bash
...
/bin/bash for host1 as of 2017-11-14 12:00 (time on the backup system).


No need for btrfs send-receive, only plain rsync is used, directly from
hostX:/ to /mnt/dst/backup/host1/;

No need to create or delete snapshots during the actual backup process;

A single common timeline is kept for all hosts to be backed up, snapshot count
not multiplied by the number of hosts (in my case the backup location is
multi-purpose, so I somewhat care about total number of snapshots there as
well);

Also, all of this works even with source hosts which do not use Btrfs.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-14 Thread Roman Mamedov

On Tue, 14 Nov 2017 10:14:55 +0300
Marat Khalili  wrote:

> Don't keep snapshots under rsync target, place them under ../snapshots 
> (if snapper supports this):

> Or, specify them in --exclude and avoid using --delete-excluded.

Both are good suggestions, in my case each system does have its own snapshots
as well, but they are retained for much shorter. So I both use --exclude to
avoid fetching the entire /snaps tree from the source system, and store
snapshots of the destination system outside of the rsync target dirs.

>Or keep using -x if it works, why not?

-x will exclude content of all subvolumes down the tree on the source side --
not only the time-based ones. If you take care to never casually create any
subvolumes content of which you'd still want backed up, then I guess it can
work.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

44 matches

Mail list logo