Re: [PATCH] btrfs: handle dynamically reappearing missing device
Hi Anand, Thank you for the patch! Yet something to improve: [auto build test ERROR on btrfs/next] [also build test ERROR on v4.14 next-20171114] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Anand-Jain/btrfs-handle-dynamically-reappearing-missing-device/20171115-143047 base: https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git next config: sparc64-allyesconfig (attached as .config) compiler: sparc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=sparc64 All errors (new ones prefixed by >>): fs/btrfs/volumes.c: In function 'device_list_add': >> fs/btrfs/volumes.c:732:10: error: implicit declaration of function >> 'btrfs_open_one_device'; did you mean 'btrfs_scan_one_device'? >> [-Werror=implicit-function-declaration] ret = btrfs_open_one_device(fs_devices, device, fmode, ^ btrfs_scan_one_device cc1: some warnings being treated as errors vim +732 fs/btrfs/volumes.c 610 611 /* 612 * Add new device to list of registered devices 613 * 614 * Returns: 615 * 1 - first time device is seen 616 * 0 - device already known 617 * < 0 - error 618 */ 619 static noinline int device_list_add(const char *path, 620 struct btrfs_super_block *disk_super, 621 u64 devid, struct btrfs_fs_devices **fs_devices_ret) 622 { 623 struct btrfs_device *device; 624 struct btrfs_fs_devices *fs_devices; 625 struct rcu_string *name; 626 int ret = 0; 627 u64 found_transid = btrfs_super_generation(disk_super); 628 629 fs_devices = find_fsid(disk_super->fsid); 630 if (!fs_devices) { 631 fs_devices = alloc_fs_devices(disk_super->fsid); 632 if (IS_ERR(fs_devices)) 633 return PTR_ERR(fs_devices); 634 635 list_add(_devices->list, _uuids); 636 637 device = NULL; 638 } else { 639 device = __find_device(_devices->devices, devid, 640 disk_super->dev_item.uuid); 641 } 642 643 if (!device) { 644 if (fs_devices->opened) 645 return -EBUSY; 646 647 device = btrfs_alloc_device(NULL, , 648 disk_super->dev_item.uuid); 649 if (IS_ERR(device)) { 650 /* we can safely leave the fs_devices entry around */ 651 return PTR_ERR(device); 652 } 653 654 name = rcu_string_strdup(path, GFP_NOFS); 655 if (!name) { 656 kfree(device); 657 return -ENOMEM; 658 } 659 rcu_assign_pointer(device->name, name); 660 661 mutex_lock(_devices->device_list_mutex); 662 list_add_rcu(>dev_list, _devices->devices); 663 fs_devices->num_devices++; 664 mutex_unlock(_devices->device_list_mutex); 665 666 ret = 1; 667 device->fs_devices = fs_devices; 668 } else if (!device->name || strcmp(device->name->str, path)) { 669 /* 670 * When FS is already mounted. 671 * 1. If you are here and if the device->name is NULL that 672 *means this device was missing at time of FS mount. 673 * 2. If you are here and if the device->name is different 674 *from 'path' that means either 675 * a. The same device disappeared and reappeared with 676 * different name. or 677 * b. The missing-disk-which-was-replaced, has 678 * reappeared now. 679 * 680 * We must allow 1 and 2a above. But 2b would be a spurious 681 * and unintentional. 682 * 683 * Further in case of 1 and 2a above, the disk at 'path' 684 * would have missed some transaction when it was away and 685 * in case of 2a the stale bdev has to be upd
Re: Tiered storage?
As a regular BTRFS user I can tell you that there is no such thing as hot data tracking yet. Some people seem to use bcache together with btrfs and come asking for help on the mailing list. Raid5/6 have received a few fixes recently, and it *may* soon me worth trying out raid5/6 for data, but keeping metadata in raid1/10 (I would rather loose a file or two than the entire filesystem). I had plans to run some tests on this a while ago, but forgot about it. As call good citizens, remember to have good backups. Last time I tested for Raid5/6 I ran into issues easily. For what it's worth - raid1/10 seems pretty rock solid as long as you have sufficient disks (hint: you need more than two for raid1 if you want to stay safe) As for dedupe there is (to my knowledge) nothing fully automatic yet. You have to run a program to scan your filesystem but all the deduplication is done in the kernel. duperemove works apparently quite well when I tested it, but there may be some performance implications. Roy Sigurd Karlsbakk wrote: Hi all I've been following this project on and off for quite a few years, and I wonder if anyone has looked into tiered storage on it. With tiered storage, I mean hot data lying on fast storage and cold data on slow storage. I'm not talking about cashing (where you just keep a copy of the hot data on the fast storage). And btw, how far is raid[56] and block-level dedup from something useful in production? Vennlig hilsen roy -- Roy Sigurd Karlsbakk (+47) 98013356 http://blogg.karlsbakk.net/ GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt -- Hið góða skaltu í stein höggva, hið illa í snjó rita. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs/154: test for device dynamic rescan
On Wed, Nov 15, 2017 at 11:05:15AM +0800, Anand Jain wrote: > Make sure missing device is included in the alloc list when it is > scanned on a mounted FS. > > This test case needs btrfs kernel patch which is in the ML > [PATCH] btrfs: handle dynamically reappearing missing device > Without the kernel patch, the test will run, but reports as > failed, as the device scanned won't appear in the alloc_list. > > Signed-off-by: Anand Jain> --- > v2: Fixed review comments. > tests/btrfs/154 | 186 > > tests/btrfs/154.out | 10 +++ > tests/btrfs/group | 1 + > 3 files changed, 197 insertions(+) > create mode 100755 tests/btrfs/154 > create mode 100644 tests/btrfs/154.out > > diff --git a/tests/btrfs/154 b/tests/btrfs/154 > new file mode 100755 > index ..73a185157389 > --- /dev/null > +++ b/tests/btrfs/154 > @@ -0,0 +1,186 @@ > +#! /bin/bash > +# FS QA Test 154 > +# > +# Test for reappearing missing device functionality. > +# This test will fail without the btrfs kernel patch > +# [PATCH] btrfs: handle dynamically reappearing missing device > +# > +#- > +# Copyright (c) 2017 Oracle. All Rights Reserved. > +# Author: Anand Jain > +# > +# This program is free software; you can redistribute it and/or > +# modify it under the terms of the GNU General Public License as > +# published by the Free Software Foundation. > +# > +# This program is distributed in the hope that it would be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write the Free Software Foundation, > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > +#- > +# > + > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > + > +here=`pwd` > +tmp=/tmp/$$ > +status=1 # failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > + cd / > + rm -f $tmp.* > +} > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > +. ./common/module > + > +# remove previous $seqres.full before test > +rm -f $seqres.full > + > +# real QA test starts here > + > +_supported_fs btrfs > +_supported_os Linux > +_require_scratch_dev_pool 2 > +_require_loadable_fs_module "btrfs" > + > +_scratch_dev_pool_get 2 > + > +DEV1=`echo $SCRATCH_DEV_POOL | awk '{print $1}'` > +DEV2=`echo $SCRATCH_DEV_POOL | awk '{print $2}'` > + > +echo DEV1=$DEV1 >> $seqres.full > +echo DEV2=$DEV2 >> $seqres.full > + > +# Balance won't be successful if filled too much > +DEV1_SZ=`blockdev --getsize64 $DEV1` > +DEV2_SZ=`blockdev --getsize64 $DEV2` > + > +# get min > +MAX_FS_SZ=`echo -e "$DEV1_SZ\n$DEV2_SZ" | sort | head -1` > +# Need disks with more than 2G > +if [ $MAX_FS_SZ -lt 20 ]; then > + _scratch_dev_pool_put > + _notrun "Smallest dev size $MAX_FS_SZ, Need at least 2G" > +fi > + > +MAX_FS_SZ=1 > +bs="1M" > +COUNT=$(($MAX_FS_SZ / 100)) > +CHECKPOINT1=0 > +CHECKPOINT2=0 > + > +setup() > +{ > + echo >> $seqres.full > + echo "MAX_FS_SZ=$MAX_FS_SZ COUNT=$COUNT" >> $seqres.full > + echo "setup" > + echo "-setup-" >> $seqres.full > + _scratch_pool_mkfs "-mraid1 -draid1" >> $seqres.full 2>&1 > + _scratch_mount >> $seqres.full 2>&1 > + dd if=/dev/urandom of="$SCRATCH_MNT"/tf bs=$bs count=1 \ > + >>$seqres.full 2>&1 > + _run_btrfs_util_prog filesystem show -m ${SCRATCH_MNT} > + _run_btrfs_util_prog filesystem df $SCRATCH_MNT > + COUNT=$(( $COUNT - 1 )) > + echo "unmount" >> $seqres.full > + _scratch_unmount > +} > + > +degrade_mount_write() > +{ > + echo >> $seqres.full > + echo "--degraded mount: max_fs_sz $max_fs_sz bytes--" >> $seqres.full > + echo > + echo "degraded mount" > + > + echo "clean btrfs ko" >> $seqres.full > + # un-scan the btrfs devices > + _reload_fs_module "btrfs" > + _mount -o degraded $DEV1 $SCRATCH_MNT >>$seqres.full 2>&1 > + cnt=$(( $COUNT/10 )) > + dd if=/dev/urandom of="$SCRATCH_MNT"/tf1 bs=$bs count=$cnt \ > + >>$seqres.full 2>&1 > + COUNT=$(( $COUNT - $cnt )) > + _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT > + _run_btrfs_util_prog filesystem df $SCRATCH_MNT > + CHECKPOINT1=`md5sum $SCRATCH_MNT/tf1` > + echo $SCRATCH_MNT/tf1:$CHECKPOINT1 >> $seqres.full > +} > + > +scan_missing_dev_and_write() > +{ > + echo >> $seqres.full > + echo "--scan missing $DEV2--" >> $seqres.full > + echo > + echo "scan missing
Re: [PATCH] btrfs/154: test for device dynamic rescan
On 11/14/2017 08:12 PM, Eryu Guan wrote: On Mon, Nov 13, 2017 at 10:25:41AM +0800, Anand Jain wrote: Make sure missing device is included in the alloc list when it is scanned on a mounted FS. This test case needs btrfs kernel patch which is in the ML [PATCH] btrfs: handle dynamically reappearing missing device Without the kernel patch, the test will run, but reports as failed, as the device scanned won't appear in the alloc_list. Signed-off-by: Anand JainTested without the fix and test failed as expected, test passed after applying the fix. Some minor nits below. --- tests/btrfs/154 | 188 tests/btrfs/154.out | 10 +++ tests/btrfs/group | 1 + 3 files changed, 199 insertions(+) create mode 100755 tests/btrfs/154 create mode 100644 tests/btrfs/154.out diff --git a/tests/btrfs/154 b/tests/btrfs/154 new file mode 100755 index ..8b06fc4d9347 --- /dev/null +++ b/tests/btrfs/154 @@ -0,0 +1,188 @@ +#! /bin/bash +# FS QA Test 154 +# +# Test for reappearing missing device functionality. +# This test will fail without the btrfs kernel patch +# [PATCH] btrfs: handle dynamically reappearing missing device +# +#- +# Copyright (c) 2017 Oracle. All Rights Reserved. +# Author: Anand Jain +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/module + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +_supported_fs btrfs +_supported_os Linux +_require_scratch_dev_pool 2 +_test_unmount This is not needed now, _require_loadable_fs_module will umount & mount test dev as necessary. Right will fix it. +_require_loadable_fs_module "btrfs" + +_scratch_dev_pool_get 2 + +DEV1=`echo $SCRATCH_DEV_POOL | awk '{print $1}'` +DEV2=`echo $SCRATCH_DEV_POOL | awk '{print $2}'` + +echo DEV1=$DEV1 >> $seqres.full +echo DEV2=$DEV2 >> $seqres.full + +# Balance won't be successful if filled too much +DEV1_SZ=`blockdev --getsize64 $DEV1` +DEV2_SZ=`blockdev --getsize64 $DEV2` + +# get min +MAX_FS_SZ=`echo -e "$DEV1_SZ\n$DEV2_SZ" | sort | head -1` +# Need disks with more than 2G +if [ $MAX_FS_SZ -lt 20 ]; then + _scratch_dev_pool_put + _test_mount Then no need to _test_mount. Fixed this in v2. + _notrun "Smallest dev size $MAX_FS_SZ, Need at least 2G" +fi + +MAX_FS_SZ=1 +bs="1M" +COUNT=$(($MAX_FS_SZ / 100)) +CHECKPOINT1=0 +CHECKPOINT2=0 + +setup() +{ + echo >> $seqres.full + echo "MAX_FS_SZ=$MAX_FS_SZ COUNT=$COUNT" >> $seqres.full + echo "setup" + echo "-setup-" >> $seqres.full + _scratch_pool_mkfs "-mraid1 -draid1" >> $seqres.full 2>&1 + _scratch_mount >> $seqres.full 2>&1 + dd if=/dev/urandom of="$SCRATCH_MNT"/tf bs=$bs count=1 \ + >>$seqres.full 2>&1 + _run_btrfs_util_prog filesystem show -m ${SCRATCH_MNT} + _run_btrfs_util_prog filesystem df $SCRATCH_MNT + COUNT=$(( $COUNT - 1 )) + echo "unmount" >> $seqres.full + _scratch_unmount +} + +degrade_mount_write() +{ + echo >> $seqres.full + echo "--degraded mount: max_fs_sz $max_fs_sz bytes--" >> $seqres.full + echo + echo "degraded mount" + + echo "clean btrfs ko" >> $seqres.full + # un-scan the btrfs devices + _reload_fs_module "btrfs" + _mount -o degraded $DEV1 $SCRATCH_MNT >>$seqres.full 2>&1 + cnt=$(( $COUNT/10 )) + dd if=/dev/urandom of="$SCRATCH_MNT"/tf1 bs=$bs count=$cnt \ + >>$seqres.full 2>&1 + COUNT=$(( $COUNT - $cnt )) + _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT + _run_btrfs_util_prog filesystem df $SCRATCH_MNT + CHECKPOINT1=`md5sum $SCRATCH_MNT/tf1` + echo $SCRATCH_MNT/tf1:$CHECKPOINT1 >> $seqres.full 2>&1 "2>&1" not needed.
[PATCH v2] btrfs/154: test for device dynamic rescan
Make sure missing device is included in the alloc list when it is scanned on a mounted FS. This test case needs btrfs kernel patch which is in the ML [PATCH] btrfs: handle dynamically reappearing missing device Without the kernel patch, the test will run, but reports as failed, as the device scanned won't appear in the alloc_list. Signed-off-by: Anand Jain--- v2: Fixed review comments. tests/btrfs/154 | 186 tests/btrfs/154.out | 10 +++ tests/btrfs/group | 1 + 3 files changed, 197 insertions(+) create mode 100755 tests/btrfs/154 create mode 100644 tests/btrfs/154.out diff --git a/tests/btrfs/154 b/tests/btrfs/154 new file mode 100755 index ..73a185157389 --- /dev/null +++ b/tests/btrfs/154 @@ -0,0 +1,186 @@ +#! /bin/bash +# FS QA Test 154 +# +# Test for reappearing missing device functionality. +# This test will fail without the btrfs kernel patch +# [PATCH] btrfs: handle dynamically reappearing missing device +# +#- +# Copyright (c) 2017 Oracle. All Rights Reserved. +# Author: Anand Jain +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/module + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +_supported_fs btrfs +_supported_os Linux +_require_scratch_dev_pool 2 +_require_loadable_fs_module "btrfs" + +_scratch_dev_pool_get 2 + +DEV1=`echo $SCRATCH_DEV_POOL | awk '{print $1}'` +DEV2=`echo $SCRATCH_DEV_POOL | awk '{print $2}'` + +echo DEV1=$DEV1 >> $seqres.full +echo DEV2=$DEV2 >> $seqres.full + +# Balance won't be successful if filled too much +DEV1_SZ=`blockdev --getsize64 $DEV1` +DEV2_SZ=`blockdev --getsize64 $DEV2` + +# get min +MAX_FS_SZ=`echo -e "$DEV1_SZ\n$DEV2_SZ" | sort | head -1` +# Need disks with more than 2G +if [ $MAX_FS_SZ -lt 20 ]; then + _scratch_dev_pool_put + _notrun "Smallest dev size $MAX_FS_SZ, Need at least 2G" +fi + +MAX_FS_SZ=1 +bs="1M" +COUNT=$(($MAX_FS_SZ / 100)) +CHECKPOINT1=0 +CHECKPOINT2=0 + +setup() +{ + echo >> $seqres.full + echo "MAX_FS_SZ=$MAX_FS_SZ COUNT=$COUNT" >> $seqres.full + echo "setup" + echo "-setup-" >> $seqres.full + _scratch_pool_mkfs "-mraid1 -draid1" >> $seqres.full 2>&1 + _scratch_mount >> $seqres.full 2>&1 + dd if=/dev/urandom of="$SCRATCH_MNT"/tf bs=$bs count=1 \ + >>$seqres.full 2>&1 + _run_btrfs_util_prog filesystem show -m ${SCRATCH_MNT} + _run_btrfs_util_prog filesystem df $SCRATCH_MNT + COUNT=$(( $COUNT - 1 )) + echo "unmount" >> $seqres.full + _scratch_unmount +} + +degrade_mount_write() +{ + echo >> $seqres.full + echo "--degraded mount: max_fs_sz $max_fs_sz bytes--" >> $seqres.full + echo + echo "degraded mount" + + echo "clean btrfs ko" >> $seqres.full + # un-scan the btrfs devices + _reload_fs_module "btrfs" + _mount -o degraded $DEV1 $SCRATCH_MNT >>$seqres.full 2>&1 + cnt=$(( $COUNT/10 )) + dd if=/dev/urandom of="$SCRATCH_MNT"/tf1 bs=$bs count=$cnt \ + >>$seqres.full 2>&1 + COUNT=$(( $COUNT - $cnt )) + _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT + _run_btrfs_util_prog filesystem df $SCRATCH_MNT + CHECKPOINT1=`md5sum $SCRATCH_MNT/tf1` + echo $SCRATCH_MNT/tf1:$CHECKPOINT1 >> $seqres.full +} + +scan_missing_dev_and_write() +{ + echo >> $seqres.full + echo "--scan missing $DEV2--" >> $seqres.full + echo + echo "scan missing dev and write" + + _run_btrfs_util_prog device scan $DEV2 + + echo >> $seqres.full + + _run_btrfs_util_prog filesystem show -m ${SCRATCH_MNT} + _run_btrfs_util_prog filesystem df ${SCRATCH_MNT} + + dd if=/dev/urandom of="$SCRATCH_MNT"/tf2 bs=$bs
Re: A partially failing disk in raid0 needs replacement
On Tue, Nov 14, 2017 at 1:36 AM, Klaus Agnolettiwrote: > Btrfs v3.17 Unrelated to the problem but this is pretty old. > Linux box 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u5 (2017-09-19) Also pretty old kernel. > x86_64 GNU/Linux > klaus@box:~$ sudo btrfs --version > Btrfs v3.17 > klaus@box:~$ sudo btrfs fi df /mnt > Data, RAID0: total=5.34TiB, used=5.14TiB > System, RAID0: total=96.00MiB, used=384.00KiB > Metadata, RAID0: total=7.22GiB, used=5.82GiB > GlobalReserve, single: total=512.00MiB, used=0.00B The central two problems: failing hardware, and no copies of metadata. By default, mkfs.btrfs does -draid0 -mraid1 for multiple device volumes. Explicitly making metadata raid0 basically means it's a disposable file system the instant there's a problem. What do you get for smartctl -l scterc /dev/ If you're lucky, this is really short. If it is something like 7 seconds, there's a chance the data in this sector can be recovered with a longer recovery time set by the drive *and* also setting the kernel's SCSI command timer to a value higher than 30 seconds (to match whatever you pick for the drive's error timeout). I'd pull something out of my ass like 60 seconds, or hell why not 120 seconds, for both. Maybe then there won't be a UNC error and you can quickly catch up your backups at the least. But before trying device removal again, assuming changing the error timeout to be higher is possible, the first thing I'd do is convert metadata to raid1. Then remove the bad device. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A partially failing disk in raid0 needs replacement
On Tue, Nov 14, 2017 at 5:48 AM, Roman Mamedovwrote: > On Tue, 14 Nov 2017 10:36:22 +0200 > Klaus Agnoletti wrote: > >> Obviously, I want /dev/sdd emptied and deleted from the raid. > > * Unmount the RAID0 FS > > * copy the bad drive using `dd_rescue`[1] into a file on the 6TB drive > (noting how much of it is actually unreadable -- chances are it's mostly > intact) This almost certainly will not work now, the delete command has copied metadata to the 6TB drive, so it would have to be removed first to remove that metadata,and Btrfs's record of that member device to avoid it being considered missing, and also any chunks successfully copied over. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A partially failing disk in raid0 needs replacement
On Tue, Nov 14, 2017 at 5:38 AM, Adam Borowskiwrote: > On Tue, Nov 14, 2017 at 10:36:22AM +0200, Klaus Agnoletti wrote: >> I used to have 3x2TB in a btrfs in raid0. A few weeks ago, one of the > ^ >> 2TB disks started giving me I/O errors in dmesg like this: >> >> [388659.188988] Add. Sense: Unrecovered read error - auto reallocate failed > > Alas, chances to recover anything are pretty slim. That's RAID0 metadata > for you. > > On the other hand, losing any non-trivial file while being able to gape at > intact metadata isn't that much better, thus -mraid0 isn't completely > unreasonable. I don't know the statistics on UNC read error vs total drive failure. If I thought that total drive failure was 2x or more likely than a single UNC then maybe raid0 is reasonable. But it's a 64KB block size for raid0. I think metadata raid0 probably doesn't offer that much performance improvement over raid1, and if it did, that's a case for raid10 metadata. In the UNC case, chances are it hits a data extent of a single file, in which case Btrfs can handle this fine, you just lose that one file. And if it hits the smaller target of metadata, it's fine if metadata is raid1 or raid10. In a previous email in the archives, I did a test where I intentionally formatted one member drive of a Btrfs data raid0, metadata raid1, and it was totally recoverable with a bunch of scary messages and sometimes a file was corrupted. So it actually is pretty darn resilient when there is a copy of metadata. (I did not try DUP.) -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tiered storage?
Hi all I've been following this project on and off for quite a few years, and I wonder if anyone has looked into tiered storage on it. With tiered storage, I mean hot data lying on fast storage and cold data on slow storage. I'm not talking about cashing (where you just keep a copy of the hot data on the fast storage). And btw, how far is raid[56] and block-level dedup from something useful in production? Vennlig hilsen roy -- Roy Sigurd Karlsbakk (+47) 98013356 http://blogg.karlsbakk.net/ GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt -- Hið góða skaltu í stein höggva, hið illa í snjó rita. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/10] btrfs: rework end io for extent buffer reads
From: Josef BacikNow that the only thing that keeps eb's alive is io_pages and it's refcount we need to hold the eb ref for the entire end io call so we don't get it removed out from underneath us. Also the hooks make no sense for us now, so rework this to be cleaner. Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 63 fs/btrfs/disk-io.h | 1 + fs/btrfs/extent_io.c | 67 +++- 3 files changed, 41 insertions(+), 90 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 7ccb6d839126..459491d662a0 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -755,33 +755,13 @@ static int check_node(struct btrfs_root *root, struct extent_buffer *node) return ret; } -static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio, - u64 phy_offset, struct page *page, - u64 start, u64 end, int mirror) +int btrfs_extent_buffer_end_read(struct extent_buffer *eb, int mirror) { + struct btrfs_fs_info *fs_info = eb->eb_info->fs_info; + struct btrfs_root *root = fs_info->tree_root; u64 found_start; int found_level; - struct extent_buffer *eb; - struct btrfs_root *root; - struct btrfs_fs_info *fs_info; int ret = 0; - int reads_done; - - if (!page->private) - goto out; - - eb = (struct extent_buffer *)page->private; - - /* the pending IO might have been the only thing that kept this buffer -* in memory. Make sure we have a ref for all this other checks -*/ - extent_buffer_get(eb); - fs_info = eb->eb_info->fs_info; - root = fs_info->tree_root; - - reads_done = atomic_dec_and_test(>io_pages); - if (!reads_done) - goto err; eb->read_mirror = mirror; if (test_bit(EXTENT_BUFFER_READ_ERR, >bflags)) { @@ -833,45 +813,14 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio, if (!ret) set_extent_buffer_uptodate(eb); err: - if (reads_done && - test_and_clear_bit(EXTENT_BUFFER_READAHEAD, >bflags)) + if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, >bflags)) btree_readahead_hook(eb, ret); - if (ret) { - /* -* our io error hook is going to dec the io pages -* again, we have to make sure it has something -* to decrement. -* -* TODO: Kill this, we've re-arranged how this works now so we -* don't need to do this io_pages dance. -*/ - atomic_inc(>io_pages); + if (ret) clear_extent_buffer_uptodate(eb); - } - if (reads_done) { - clear_bit(EXTENT_BUFFER_READING, >bflags); - smp_mb__after_atomic(); - wake_up_bit(>bflags, EXTENT_BUFFER_READING); - } - free_extent_buffer(eb); -out: return ret; } -static int btree_io_failed_hook(struct page *page, int failed_mirror) -{ - struct extent_buffer *eb; - - eb = (struct extent_buffer *)page->private; - set_bit(EXTENT_BUFFER_READ_ERR, >bflags); - eb->read_mirror = failed_mirror; - atomic_dec(>io_pages); - if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, >bflags)) - btree_readahead_hook(eb, -EIO); - return -EIO;/* we fixed nothing */ -} - static void end_workqueue_bio(struct bio *bio) { struct btrfs_end_io_wq *end_io_wq = bio->bi_private; @@ -4553,9 +4502,7 @@ static int btree_merge_bio_hook(struct page *page, unsigned long offset, static const struct extent_io_ops btree_extent_io_ops = { /* mandatory callbacks */ .submit_bio_hook = btree_submit_bio_hook, - .readpage_end_io_hook = btree_readpage_end_io_hook, .merge_bio_hook = btree_merge_bio_hook, - .readpage_io_failed_hook = btree_io_failed_hook, .set_range_writeback = btrfs_set_range_writeback, .tree_fs_info = btree_fs_info, diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h index 7f7c35d6347a..e1f4fef91547 100644 --- a/fs/btrfs/disk-io.h +++ b/fs/btrfs/disk-io.h @@ -152,6 +152,7 @@ int btree_lock_page_hook(struct page *page, void *data, int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags); int __init btrfs_end_io_wq_init(void); void btrfs_end_io_wq_exit(void); +int btrfs_extent_buffer_end_read(struct extent_buffer *eb, int mirror); #ifdef CONFIG_DEBUG_LOCK_ALLOC void btrfs_init_lockdep(void); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 2077bd6ad1b3..1e5affee0f7e 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -20,6 +20,7 @@ #include "locking.h" #include "rcu-string.h" #include "backref.h" +#include "disk-io.h" static struct kmem_cache
[PATCH 03/10] lib: add a batch size to fprop_global
From: Josef BacikThe flexible proportion stuff has been used to track how many pages we are writing out over a period of time, so counts everything in single increments. If we wanted to use another base value we need to be able to adjust the batch size to fit our the units we'll be using for the proportions. Signed-off-by: Josef Bacik --- include/linux/flex_proportions.h | 4 +++- lib/flex_proportions.c | 11 +-- 2 files changed, 8 insertions(+), 7 deletions(-) diff --git a/include/linux/flex_proportions.h b/include/linux/flex_proportions.h index 0d348e011a6e..853f4305d1b2 100644 --- a/include/linux/flex_proportions.h +++ b/include/linux/flex_proportions.h @@ -20,7 +20,7 @@ */ #define FPROP_FRAC_SHIFT 10 #define FPROP_FRAC_BASE (1UL << FPROP_FRAC_SHIFT) - +#define FPROP_BATCH_SIZE (8*(1+ilog2(nr_cpu_ids))) /* * Global proportion definitions */ @@ -31,6 +31,8 @@ struct fprop_global { unsigned int period; /* Synchronization with period transitions */ seqcount_t sequence; + /* batch size */ + s32 batch_size; }; int fprop_global_init(struct fprop_global *p, gfp_t gfp); diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c index 2cc1f94e03a1..5552523b663a 100644 --- a/lib/flex_proportions.c +++ b/lib/flex_proportions.c @@ -44,6 +44,7 @@ int fprop_global_init(struct fprop_global *p, gfp_t gfp) if (err) return err; seqcount_init(>sequence); + p->batch_size = FPROP_BATCH_SIZE; return 0; } @@ -166,8 +167,6 @@ void fprop_fraction_single(struct fprop_global *p, /* * PERCPU */ -#define PROP_BATCH (8*(1+ilog2(nr_cpu_ids))) - int fprop_local_init_percpu(struct fprop_local_percpu *pl, gfp_t gfp) { int err; @@ -204,11 +203,11 @@ static void fprop_reflect_period_percpu(struct fprop_global *p, if (period - pl->period < BITS_PER_LONG) { s64 val = percpu_counter_read(>events); - if (val < (nr_cpu_ids * PROP_BATCH)) + if (val < (nr_cpu_ids * p->batch_size)) val = percpu_counter_sum(>events); percpu_counter_add_batch(>events, - -val + (val >> (period-pl->period)), PROP_BATCH); + -val + (val >> (period-pl->period)), p->batch_size); } else percpu_counter_set(>events, 0); pl->period = period; @@ -219,7 +218,7 @@ static void fprop_reflect_period_percpu(struct fprop_global *p, void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl) { fprop_reflect_period_percpu(p, pl); - percpu_counter_add_batch(>events, 1, PROP_BATCH); + percpu_counter_add_batch(>events, 1, p->batch_size); percpu_counter_add(>events, 1); } @@ -267,6 +266,6 @@ void __fprop_inc_percpu_max(struct fprop_global *p, return; } else fprop_reflect_period_percpu(p, pl); - percpu_counter_add_batch(>events, 1, PROP_BATCH); + percpu_counter_add_batch(>events, 1, p->batch_size); percpu_counter_add(>events, 1); } -- 2.7.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/10] Btrfs: kill the btree_inode
From: Josef BacikIn order to more efficiently support sub-page blocksizes we need to stop allocating pages from pagecache for our metadata. Instead switch to using the account_metadata* counters for making sure we are keeping the system aware of how much dirty metadata we have, and use the ->free_cached_objects super operation in order to handle freeing up extent buffers. This greatly simplifies how we deal with extent buffers as now we no longer have to tie the page cache reclaimation stuff to the extent buffer stuff. This will also allow us to simply kmalloc() our data for sub-page blocksizes. Signed-off-by: Josef Bacik --- fs/btrfs/btrfs_inode.h | 1 - fs/btrfs/ctree.c | 18 +- fs/btrfs/ctree.h | 17 +- fs/btrfs/dir-item.c| 2 +- fs/btrfs/disk-io.c | 385 -- fs/btrfs/extent-tree.c | 14 +- fs/btrfs/extent_io.c | 919 ++--- fs/btrfs/extent_io.h | 51 +- fs/btrfs/inode.c | 6 +- fs/btrfs/print-tree.c | 13 +- fs/btrfs/reada.c | 2 +- fs/btrfs/root-tree.c | 2 +- fs/btrfs/super.c | 31 +- fs/btrfs/tests/btrfs-tests.c | 36 +- fs/btrfs/tests/extent-buffer-tests.c | 3 +- fs/btrfs/tests/extent-io-tests.c | 4 +- fs/btrfs/tests/free-space-tree-tests.c | 3 +- fs/btrfs/tests/inode-tests.c | 4 +- fs/btrfs/tests/qgroup-tests.c | 3 +- fs/btrfs/transaction.c | 13 +- 20 files changed, 757 insertions(+), 770 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index f9c6887a8b6c..24582650622d 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -241,7 +241,6 @@ static inline u64 btrfs_ino(const struct btrfs_inode *inode) u64 ino = inode->location.objectid; /* -* !ino: btree_inode * type == BTRFS_ROOT_ITEM_KEY: subvol dir */ if (!ino || inode->location.type == BTRFS_ROOT_ITEM_KEY) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 531e0a8645b0..3c6610b5d0d3 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -1361,7 +1361,8 @@ tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct btrfs_path *path, if (tm->op == MOD_LOG_KEY_REMOVE_WHILE_FREEING) { BUG_ON(tm->slot != 0); - eb_rewin = alloc_dummy_extent_buffer(fs_info, eb->start); + eb_rewin = alloc_dummy_extent_buffer(fs_info->eb_info, +eb->start, eb->len); if (!eb_rewin) { btrfs_tree_read_unlock_blocking(eb); free_extent_buffer(eb); @@ -1444,7 +1445,8 @@ get_old_root(struct btrfs_root *root, u64 time_seq) } else if (old_root) { btrfs_tree_read_unlock(eb_root); free_extent_buffer(eb_root); - eb = alloc_dummy_extent_buffer(fs_info, logical); + eb = alloc_dummy_extent_buffer(root->fs_info->eb_info, logical, + root->fs_info->nodesize); } else { btrfs_set_lock_blocking_rw(eb_root, BTRFS_READ_LOCK); eb = btrfs_clone_extent_buffer(eb_root); @@ -1675,7 +1677,7 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans, continue; } - cur = find_extent_buffer(fs_info, blocknr); + cur = find_extent_buffer(fs_info->eb_info, blocknr); if (cur) uptodate = btrfs_buffer_uptodate(cur, gen, 0); else @@ -1748,7 +1750,7 @@ static noinline int generic_bin_search(struct extent_buffer *eb, int err; if (low > high) { - btrfs_err(eb->fs_info, + btrfs_err(eb->eb_info->fs_info, "%s: low (%d) > high (%d) eb %llu owner %llu level %d", __func__, low, high, eb->start, btrfs_header_owner(eb), btrfs_header_level(eb)); @@ -2260,7 +2262,7 @@ static void reada_for_search(struct btrfs_fs_info *fs_info, search = btrfs_node_blockptr(node, slot); blocksize = fs_info->nodesize; - eb = find_extent_buffer(fs_info, search); + eb = find_extent_buffer(fs_info->eb_info, search); if (eb) { free_extent_buffer(eb); return; @@ -2319,7 +2321,7 @@ static noinline void reada_for_balance(struct btrfs_fs_info *fs_info, if (slot > 0) { block1 = btrfs_node_blockptr(parent, slot - 1); gen = btrfs_node_ptr_generation(parent, slot - 1); - eb = find_extent_buffer(fs_info, block1); + eb = find_extent_buffer(fs_info->eb_info,
[PATCH 07/10] writeback: introduce super_operations->write_metadata
From: Josef BacikNow that we have metadata counters in the VM, we need to provide a way to kick writeback on dirty metadata. Introduce super_operations->write_metadata. This allows file systems to deal with writing back any dirty metadata we need based on the writeback needs of the system. Since there is no inode to key off of we need a list in the bdi for dirty super blocks to be added. From there we can find any dirty sb's on the bdi we are currently doing writeback on and call into their ->write_metadata callback. Signed-off-by: Josef Bacik Reviewed-by: Jan Kara Reviewed-by: Tejun Heo --- fs/fs-writeback.c| 72 fs/super.c | 6 include/linux/backing-dev-defs.h | 2 ++ include/linux/fs.h | 4 +++ mm/backing-dev.c | 2 ++ 5 files changed, 80 insertions(+), 6 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 987448ed7698..fba703dff678 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct bdi_writeback *wb, return pages; } +static long writeback_sb_metadata(struct super_block *sb, + struct bdi_writeback *wb, + struct wb_writeback_work *work) +{ + struct writeback_control wbc = { + .sync_mode = work->sync_mode, + .tagged_writepages = work->tagged_writepages, + .for_kupdate= work->for_kupdate, + .for_background = work->for_background, + .for_sync = work->for_sync, + .range_cyclic = work->range_cyclic, + .range_start= 0, + .range_end = LLONG_MAX, + }; + long write_chunk; + + write_chunk = writeback_chunk_size(wb, work); + wbc.nr_to_write = write_chunk; + sb->s_op->write_metadata(sb, ); + work->nr_pages -= write_chunk - wbc.nr_to_write; + + return write_chunk - wbc.nr_to_write; +} + + /* * Write a portion of b_io inodes which belong to @sb. * @@ -1505,6 +1530,7 @@ static long writeback_sb_inodes(struct super_block *sb, unsigned long start_time = jiffies; long write_chunk; long wrote = 0; /* count both pages and inodes */ + bool done = false; while (!list_empty(>b_io)) { struct inode *inode = wb_inode(wb->b_io.prev); @@ -1621,12 +1647,18 @@ static long writeback_sb_inodes(struct super_block *sb, * background threshold and other termination conditions. */ if (wrote) { - if (time_is_before_jiffies(start_time + HZ / 10UL)) - break; - if (work->nr_pages <= 0) + if (time_is_before_jiffies(start_time + HZ / 10UL) || + work->nr_pages <= 0) { + done = true; break; + } } } + if (!done && sb->s_op->write_metadata) { + spin_unlock(>list_lock); + wrote += writeback_sb_metadata(sb, wb, work); + spin_lock(>list_lock); + } return wrote; } @@ -1635,6 +1667,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb, { unsigned long start_time = jiffies; long wrote = 0; + bool done = false; while (!list_empty(>b_io)) { struct inode *inode = wb_inode(wb->b_io.prev); @@ -1654,12 +1687,39 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb, /* refer to the same tests at the end of writeback_sb_inodes */ if (wrote) { - if (time_is_before_jiffies(start_time + HZ / 10UL)) - break; - if (work->nr_pages <= 0) + if (time_is_before_jiffies(start_time + HZ / 10UL) || + work->nr_pages <= 0) { + done = true; break; + } } } + + if (!done && wb_stat(wb, WB_METADATA_DIRTY_BYTES)) { + LIST_HEAD(list); + + spin_unlock(>list_lock); + spin_lock(>bdi->sb_list_lock); + list_splice_init(>bdi->dirty_sb_list, ); + while (!list_empty()) { + struct super_block *sb; + + sb = list_first_entry(, struct super_block, + s_bdi_dirty_list); + list_move_tail(>s_bdi_dirty_list, + >bdi->dirty_sb_list); + if
[PATCH 06/10] writeback: add counters for metadata usage
From: Josef BacikBtrfs has no bounds except memory on the amount of dirty memory that we have in use for metadata. Historically we have used a special inode so we could take advantage of the balance_dirty_pages throttling that comes with using pagecache. However as we'd like to support different blocksizes it would be nice to not have to rely on pagecache, but still get the balance_dirty_pages throttling without having to do it ourselves. So introduce *METADATA_DIRTY_BYTES and *METADATA_WRITEBACK_BYTES. These are zone and bdi_writeback counters to keep track of how many bytes we have in flight for METADATA. We need to count in bytes as blocksizes could be percentages of pagesize. We simply convert the bytes to number of pages where it is needed for the throttling. Also introduce NR_METADATA_BYTES so we can keep track of the total amount of pages used for metadata on the system. This is also needed so things like dirty throttling know that this is dirtyable memory as well and easily reclaimed. Signed-off-by: Josef Bacik Reviewed-by: Jan Kara --- drivers/base/node.c | 8 +++ fs/fs-writeback.c| 2 + fs/proc/meminfo.c| 8 +++ include/linux/backing-dev-defs.h | 2 + include/linux/mm.h | 9 +++ include/linux/mmzone.h | 3 + include/trace/events/writeback.h | 13 +++- mm/backing-dev.c | 4 ++ mm/page-writeback.c | 141 +++ mm/page_alloc.c | 20 -- mm/util.c| 1 + mm/vmscan.c | 19 +- mm/vmstat.c | 3 + 13 files changed, 211 insertions(+), 22 deletions(-) diff --git a/drivers/base/node.c b/drivers/base/node.c index 3855902f2c5b..a39cecc8957a 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -51,6 +51,8 @@ static DEVICE_ATTR(cpumap, S_IRUGO, node_read_cpumask, NULL); static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL); #define K(x) ((x) << (PAGE_SHIFT - 10)) +#define BtoK(x) ((x) >> 10) + static ssize_t node_read_meminfo(struct device *dev, struct device_attribute *attr, char *buf) { @@ -99,7 +101,10 @@ static ssize_t node_read_meminfo(struct device *dev, #endif n += sprintf(buf + n, "Node %d Dirty: %8lu kB\n" + "Node %d MetadataDirty: %8lu kB\n" "Node %d Writeback: %8lu kB\n" + "Node %d MetaWriteback: %8lu kB\n" + "Node %d Metadata: %8lu kB\n" "Node %d FilePages: %8lu kB\n" "Node %d Mapped: %8lu kB\n" "Node %d AnonPages: %8lu kB\n" @@ -119,8 +124,11 @@ static ssize_t node_read_meminfo(struct device *dev, #endif , nid, K(node_page_state(pgdat, NR_FILE_DIRTY)), + nid, BtoK(node_page_state(pgdat, NR_METADATA_DIRTY_BYTES)), nid, K(node_page_state(pgdat, NR_WRITEBACK)), + nid, BtoK(node_page_state(pgdat, NR_METADATA_WRITEBACK_BYTES)), nid, K(node_page_state(pgdat, NR_FILE_PAGES)), + nid, BtoK(node_page_state(pgdat, NR_METADATA_BYTES)), nid, K(node_page_state(pgdat, NR_FILE_MAPPED)), nid, K(node_page_state(pgdat, NR_ANON_MAPPED)), nid, K(i.sharedram), diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 245c430a2e41..987448ed7698 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1814,6 +1814,7 @@ static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb) return work; } +#define BtoP(x) ((x) >> PAGE_SHIFT) /* * Add in the number of potentially dirty inodes, because each inode * write can dirty pagecache in the underlying blockdev. @@ -1822,6 +1823,7 @@ static unsigned long get_nr_dirty_pages(void) { return global_node_page_state(NR_FILE_DIRTY) + global_node_page_state(NR_UNSTABLE_NFS) + + BtoP(global_node_page_state(NR_METADATA_DIRTY_BYTES)) + get_nr_dirty_inodes(); } diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index cdd979724c74..fa1fd24a4d99 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -42,6 +42,8 @@ static void show_val_kb(struct seq_file *m, const char *s, unsigned long num) seq_write(m, " kB\n", 4); } +#define BtoP(x) ((x) >> PAGE_SHIFT) + static int meminfo_proc_show(struct seq_file *m, void *v) { struct sysinfo i; @@ -71,6 +73,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v) show_val_kb(m, "Buffers:", i.bufferram); show_val_kb(m, "Cached: ", cached); show_val_kb(m, "SwapCached: ", total_swapcache_pages()); +
[PATCH 08/10] export radix_tree_iter_tag_set
From: Josef BacikWe use this in btrfs for metadata writeback. Acked-by: Matthew Wilcox Signed-off-by: Josef Bacik --- lib/radix-tree.c | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 8b1feca1230a..0c1cde9fcb69 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -1459,6 +1459,7 @@ void radix_tree_iter_tag_set(struct radix_tree_root *root, { node_tag_set(root, iter->node, tag, iter_offset(iter)); } +EXPORT_SYMBOL(radix_tree_iter_tag_set); static void node_tag_clear(struct radix_tree_root *root, struct radix_tree_node *node, -- 2.7.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/10] writeback: convert the flexible prop stuff to bytes
From: Josef BacikThe flexible proportions were all page based, but now that we are doing metadata writeout that can be smaller or larger than page size we need to account for this in bytes instead of number of pages. Signed-off-by: Josef Bacik --- mm/backing-dev.c| 2 +- mm/page-writeback.c | 19 --- 2 files changed, 13 insertions(+), 8 deletions(-) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 62a332a91b38..e0d7c62dc0ad 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -832,7 +832,7 @@ static int bdi_init(struct backing_dev_info *bdi) kref_init(>refcnt); bdi->min_ratio = 0; bdi->max_ratio = 100; - bdi->max_prop_frac = FPROP_FRAC_BASE; + bdi->max_prop_frac = FPROP_FRAC_BASE * PAGE_SIZE; INIT_LIST_HEAD(>bdi_list); INIT_LIST_HEAD(>wb_list); init_waitqueue_head(>wb_waitq); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index e4563645749a..c491dee711a8 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -574,11 +574,11 @@ static unsigned long wp_next_time(unsigned long cur_time) return cur_time; } -static void wb_domain_writeout_inc(struct wb_domain *dom, +static void wb_domain_writeout_add(struct wb_domain *dom, struct fprop_local_percpu *completions, - unsigned int max_prop_frac) + long bytes, unsigned int max_prop_frac) { - __fprop_inc_percpu_max(>completions, completions, + __fprop_add_percpu_max(>completions, completions, bytes, max_prop_frac); /* First event after period switching was turned off? */ if (unlikely(!dom->period_time)) { @@ -602,12 +602,12 @@ static inline void __wb_writeout_add(struct bdi_writeback *wb, long bytes) struct wb_domain *cgdom; __add_wb_stat(wb, WB_WRITTEN_BYTES, bytes); - wb_domain_writeout_inc(_wb_domain, >completions, + wb_domain_writeout_add(_wb_domain, >completions, bytes, wb->bdi->max_prop_frac); cgdom = mem_cgroup_wb_domain(wb); if (cgdom) - wb_domain_writeout_inc(cgdom, wb_memcg_completions(wb), + wb_domain_writeout_add(cgdom, wb_memcg_completions(wb), bytes, wb->bdi->max_prop_frac); } @@ -646,6 +646,7 @@ static void writeout_period(unsigned long t) int wb_domain_init(struct wb_domain *dom, gfp_t gfp) { + int ret; memset(dom, 0, sizeof(*dom)); spin_lock_init(>lock); @@ -655,7 +656,10 @@ int wb_domain_init(struct wb_domain *dom, gfp_t gfp) dom->dirty_limit_tstamp = jiffies; - return fprop_global_init(>completions, gfp); + ret = fprop_global_init(>completions, gfp); + if (!ret) + dom->completions.batch_size *= PAGE_SIZE; + return ret; } #ifdef CONFIG_CGROUP_WRITEBACK @@ -706,7 +710,8 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio) ret = -EINVAL; } else { bdi->max_ratio = max_ratio; - bdi->max_prop_frac = (FPROP_FRAC_BASE * max_ratio) / 100; + bdi->max_prop_frac = ((FPROP_FRAC_BASE * max_ratio) / 100) * + PAGE_SIZE; } spin_unlock_bh(_lock); -- 2.7.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/10] remove mapping from balance_dirty_pages*()
From: Josef BacikThe only reason we pass in the mapping is to get the inode in order to see if writeback cgroups is enabled, and even then it only checks the bdi and a super block flag. balance_dirty_pages() doesn't even use the mapping. Since balance_dirty_pages*() works on a bdi level, just pass in the bdi and super block directly so we can avoid using mapping. This will allow us to still use balance_dirty_pages for dirty metadata pages that are not backed by an address_mapping. Signed-off-by: Josef Bacik Reviewed-by: Jan Kara --- drivers/mtd/devices/block2mtd.c | 12 fs/btrfs/disk-io.c | 3 ++- fs/btrfs/file.c | 3 ++- fs/btrfs/ioctl.c| 3 ++- fs/btrfs/relocation.c | 3 ++- fs/buffer.c | 3 ++- fs/iomap.c | 6 -- fs/ntfs/attrib.c| 11 --- fs/ntfs/file.c | 4 ++-- include/linux/backing-dev.h | 29 +++-- include/linux/writeback.h | 4 +++- mm/filemap.c| 4 +++- mm/memory.c | 5 - mm/page-writeback.c | 15 +++ 14 files changed, 72 insertions(+), 33 deletions(-) diff --git a/drivers/mtd/devices/block2mtd.c b/drivers/mtd/devices/block2mtd.c index 7c887f111a7d..7892d0b9fcb0 100644 --- a/drivers/mtd/devices/block2mtd.c +++ b/drivers/mtd/devices/block2mtd.c @@ -52,7 +52,8 @@ static struct page *page_read(struct address_space *mapping, int index) /* erase a specified part of the device */ static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t to, size_t len) { - struct address_space *mapping = dev->blkdev->bd_inode->i_mapping; + struct inode *inode = dev->blkdev->bd_inode; + struct address_space *mapping = inode->i_mapping; struct page *page; int index = to >> PAGE_SHIFT; // page index int pages = len >> PAGE_SHIFT; @@ -71,7 +72,8 @@ static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t to, size_t len) memset(page_address(page), 0xff, PAGE_SIZE); set_page_dirty(page); unlock_page(page); - balance_dirty_pages_ratelimited(mapping); + balance_dirty_pages_ratelimited(inode_to_bdi(inode), + inode->i_sb); break; } @@ -141,7 +143,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, const u_char *buf, loff_t to, size_t len, size_t *retlen) { struct page *page; - struct address_space *mapping = dev->blkdev->bd_inode->i_mapping; + struct inode *inode = dev->blkdev->bd_inode; + struct address_space *mapping = inode->i_mapping; int index = to >> PAGE_SHIFT; // page index int offset = to & ~PAGE_MASK; // page offset int cpylen; @@ -162,7 +165,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, const u_char *buf, memcpy(page_address(page) + offset, buf, cpylen); set_page_dirty(page); unlock_page(page); - balance_dirty_pages_ratelimited(mapping); + balance_dirty_pages_ratelimited(inode_to_bdi(inode), + inode->i_sb); } put_page(page); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 689b9913ccb5..8b6df7688d52 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4150,7 +4150,8 @@ static void __btrfs_btree_balance_dirty(struct btrfs_fs_info *fs_info, ret = percpu_counter_compare(_info->dirty_metadata_bytes, BTRFS_DIRTY_METADATA_THRESH); if (ret > 0) { - balance_dirty_pages_ratelimited(fs_info->btree_inode->i_mapping); + balance_dirty_pages_ratelimited(fs_info->sb->s_bdi, + fs_info->sb); } } diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index ab1c38f2dd8c..4bc6cd6509be 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1779,7 +1779,8 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file, cond_resched(); - balance_dirty_pages_ratelimited(inode->i_mapping); + balance_dirty_pages_ratelimited(inode_to_bdi(inode), + inode->i_sb); if (dirty_pages < (fs_info->nodesize >> PAGE_SHIFT) + 1) btrfs_btree_balance_dirty(fs_info); diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 6a07d4e12fd2..ec92fb5e2b51 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1368,7 +1368,8 @@ int
[PATCH 04/10] lib: add a __fprop_add_percpu_max
From: Josef BacikThis helper allows us to add an arbitrary amount to the fprop structures. Signed-off-by: Josef Bacik --- include/linux/flex_proportions.h | 11 +-- lib/flex_proportions.c | 9 + 2 files changed, 14 insertions(+), 6 deletions(-) diff --git a/include/linux/flex_proportions.h b/include/linux/flex_proportions.h index 853f4305d1b2..2d1a87331e5d 100644 --- a/include/linux/flex_proportions.h +++ b/include/linux/flex_proportions.h @@ -85,8 +85,8 @@ struct fprop_local_percpu { int fprop_local_init_percpu(struct fprop_local_percpu *pl, gfp_t gfp); void fprop_local_destroy_percpu(struct fprop_local_percpu *pl); void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl); -void __fprop_inc_percpu_max(struct fprop_global *p, struct fprop_local_percpu *pl, - int max_frac); +void __fprop_add_percpu_max(struct fprop_global *p, struct fprop_local_percpu *pl, + unsigned long nr, int max_frac); void fprop_fraction_percpu(struct fprop_global *p, struct fprop_local_percpu *pl, unsigned long *numerator, unsigned long *denominator); @@ -101,4 +101,11 @@ void fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl) local_irq_restore(flags); } +static inline +void __fprop_inc_percpu_max(struct fprop_global *p, + struct fprop_local_percpu *pl, int max_frac) +{ + __fprop_add_percpu_max(p, pl, 1, max_frac); +} + #endif diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c index 5552523b663a..2190180a81fe 100644 --- a/lib/flex_proportions.c +++ b/lib/flex_proportions.c @@ -254,8 +254,9 @@ void fprop_fraction_percpu(struct fprop_global *p, * Like __fprop_inc_percpu() except that event is counted only if the given * type has fraction smaller than @max_frac/FPROP_FRAC_BASE */ -void __fprop_inc_percpu_max(struct fprop_global *p, - struct fprop_local_percpu *pl, int max_frac) +void __fprop_add_percpu_max(struct fprop_global *p, + struct fprop_local_percpu *pl, unsigned long nr, + int max_frac) { if (unlikely(max_frac < FPROP_FRAC_BASE)) { unsigned long numerator, denominator; @@ -266,6 +267,6 @@ void __fprop_inc_percpu_max(struct fprop_global *p, return; } else fprop_reflect_period_percpu(p, pl); - percpu_counter_add_batch(>events, 1, p->batch_size); - percpu_counter_add(>events, 1); + percpu_counter_add_batch(>events, nr, p->batch_size); + percpu_counter_add(>events, nr); } -- 2.7.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/10] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes
From: Josef BacikThese are counters that constantly go up in order to do bandwidth calculations. It isn't important what the units are in, as long as they are consistent between the two of them, so convert them to count bytes written/dirtied, and allow the metadata accounting stuff to change the counters as well. Signed-off-by: Josef Bacik Acked-by: Tejun Heo --- fs/fuse/file.c | 4 ++-- include/linux/backing-dev-defs.h | 4 ++-- include/linux/backing-dev.h | 2 +- mm/backing-dev.c | 9 + mm/page-writeback.c | 20 ++-- 5 files changed, 20 insertions(+), 19 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index cb7dff5c45d7..67e7c4fac28d 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1471,7 +1471,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req) for (i = 0; i < req->num_pages; i++) { dec_wb_stat(>wb, WB_WRITEBACK); dec_node_page_state(req->pages[i], NR_WRITEBACK_TEMP); - wb_writeout_inc(>wb); + wb_writeout_add(>wb, PAGE_SIZE); } wake_up(>page_waitq); } @@ -1776,7 +1776,7 @@ static bool fuse_writepage_in_flight(struct fuse_req *new_req, dec_wb_stat(>wb, WB_WRITEBACK); dec_node_page_state(page, NR_WRITEBACK_TEMP); - wb_writeout_inc(>wb); + wb_writeout_add(>wb, PAGE_SIZE); fuse_writepage_free(fc, new_req); fuse_request_free(new_req); goto out; diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 866c433e7d32..ded45ac2cec7 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -36,8 +36,8 @@ typedef int (congested_fn)(void *, int); enum wb_stat_item { WB_RECLAIMABLE, WB_WRITEBACK, - WB_DIRTIED, - WB_WRITTEN, + WB_DIRTIED_BYTES, + WB_WRITTEN_BYTES, NR_WB_STAT_ITEMS }; diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 14e266d12620..39b8dc486ea7 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -89,7 +89,7 @@ static inline s64 wb_stat_sum(struct bdi_writeback *wb, enum wb_stat_item item) return percpu_counter_sum_positive(>stat[item]); } -extern void wb_writeout_inc(struct bdi_writeback *wb); +extern void wb_writeout_add(struct bdi_writeback *wb, long bytes); /* * maximal error of a stat counter. diff --git a/mm/backing-dev.c b/mm/backing-dev.c index e19606bb41a0..62a332a91b38 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -68,14 +68,15 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) wb_thresh = wb_calc_thresh(wb, dirty_thresh); #define K(x) ((x) << (PAGE_SHIFT - 10)) +#define BtoK(x) ((x) >> 10) seq_printf(m, "BdiWriteback: %10lu kB\n" "BdiReclaimable: %10lu kB\n" "BdiDirtyThresh: %10lu kB\n" "DirtyThresh:%10lu kB\n" "BackgroundThresh: %10lu kB\n" - "BdiDirtied: %10lu kB\n" - "BdiWritten: %10lu kB\n" + "BdiDirtiedBytes:%10lu kB\n" + "BdiWrittenBytes:%10lu kB\n" "BdiWriteBandwidth: %10lu kBps\n" "b_dirty:%10lu\n" "b_io: %10lu\n" @@ -88,8 +89,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) K(wb_thresh), K(dirty_thresh), K(background_thresh), - (unsigned long) K(wb_stat(wb, WB_DIRTIED)), - (unsigned long) K(wb_stat(wb, WB_WRITTEN)), + (unsigned long) BtoK(wb_stat(wb, WB_DIRTIED_BYTES)), + (unsigned long) BtoK(wb_stat(wb, WB_WRITTEN_BYTES)), (unsigned long) K(wb->write_bandwidth), nr_dirty, nr_io, diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 1a47d4296750..e4563645749a 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -597,11 +597,11 @@ static void wb_domain_writeout_inc(struct wb_domain *dom, * Increment @wb's writeout completion count and the global writeout * completion count. Called from test_clear_page_writeback(). */ -static inline void __wb_writeout_inc(struct bdi_writeback *wb) +static inline void __wb_writeout_add(struct bdi_writeback *wb, long bytes) { struct wb_domain *cgdom; - inc_wb_stat(wb, WB_WRITTEN); + __add_wb_stat(wb, WB_WRITTEN_BYTES, bytes); wb_domain_writeout_inc(_wb_domain, >completions, wb->bdi->max_prop_frac); @@ -611,15 +611,15 @@ static inline void __wb_writeout_inc(struct bdi_writeback *wb)
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
On Tue, Nov 14, 2017 at 3:50 AM, Roman Mamedovwrote: > > On Mon, 13 Nov 2017 22:39:44 -0500 > Dave wrote: > > > I have my live system on one block device and a backup snapshot of it > > on another block device. I am keeping them in sync with hourly rsync > > transfers. > > > > Here's how this system works in a little more detail: > > > > 1. I establish the baseline by sending a full snapshot to the backup > > block device using btrfs send-receive. > > 2. Next, on the backup device I immediately create a rw copy of that > > baseline snapshot. > > 3. I delete the source snapshot to keep the live filesystem free of > > all snapshots (so it can be optimally defragmented, etc.) > > 4. hourly, I take a snapshot of the live system, rsync all changes to > > the backup block device, and then delete the source snapshot. This > > hourly process takes less than a minute currently. (My test system has > > only moderate usage.) > > 5. hourly, following the above step, I use snapper to take a snapshot > > of the backup subvolume to create/preserve a history of changes. For > > example, I can find the version of a file 30 hours prior. > > Sounds a bit complex, I still don't get why you need all these snapshot > creations and deletions, and even still using btrfs send-receive. Hopefully, my comments below will explain my reasons. > > Here is my scheme: > > /mnt/dst <- mounted backup storage volume > /mnt/dst/backup <- a subvolume > /mnt/dst/backup/host1/ <- rsync destination for host1, regular directory > /mnt/dst/backup/host2/ <- rsync destination for host2, regular directory > /mnt/dst/backup/host3/ <- rsync destination for host3, regular directory > etc. > > /mnt/dst/backup/host1/bin/ > /mnt/dst/backup/host1/etc/ > /mnt/dst/backup/host1/home/ > ... > Self explanatory. All regular directories, not subvolumes. > > Snapshots: > /mnt/dst/snaps/backup <- a regular directory > /mnt/dst/snaps/backup/2017-11-14T12:00/ <- snapshot 1 of /mnt/dst/backup > /mnt/dst/snaps/backup/2017-11-14T13:00/ <- snapshot 2 of /mnt/dst/backup > /mnt/dst/snaps/backup/2017-11-14T14:00/ <- snapshot 3 of /mnt/dst/backup > > Accessing historic data: > /mnt/dst/snaps/backup/2017-11-14T12:00/host1/bin/bash > ... > /bin/bash for host1 as of 2017-11-14 12:00 (time on the backup system). > > > No need for btrfs send-receive, only plain rsync is used, directly from > hostX:/ to /mnt/dst/backup/host1/; I prefer to start with a BTRFS snapshot at the backup destination. I think that's the most "accurate" starting point. > > No need to create or delete snapshots during the actual backup process; Then you can't guarantee consistency of the backed up information. > > A single common timeline is kept for all hosts to be backed up, snapshot count > not multiplied by the number of hosts (in my case the backup location is > multi-purpose, so I somewhat care about total number of snapshots there as > well); > > Also, all of this works even with source hosts which do not use Btrfs. That's not a concern for me because I prefer to use BTRFS everywhere. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to repair or access broken btrfs?
Am 14.11.2017 um 18:45 schrieb Andrei Borzenkov: > 14.11.2017 12:56, Stefan Priebe - Profihost AG пишет: >> Hello, >> >> after a controller firmware bug / failure i've a broken btrfs. >> >> # parent transid verify failed on 181846016 wanted 143404 found 143399 >> >> running repair, fsck or zero-log always results in the same failure message: >> extent-tree.c:2725: alloc_reserved_tree_block: BUG_ON `ret` triggered, >> value -1 >> .. stack trace .. >> >> Is there an chance to get at least a single file out of the broken fs? >> > > Did you try "btrfs restore"? Great that worked for that file. Still wondering why a repair is not possible. Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read before you deploy btrfs + zstd
David Sterba - 14.11.17, 19:49: > On Tue, Nov 14, 2017 at 08:34:37AM +0100, Martin Steigerwald wrote: > > Hello David. > > > > David Sterba - 13.11.17, 23:50: > > > while 4.14 is still fresh, let me address some concerns I've seen on > > > linux > > > forums already. > > > > > > The newly added ZSTD support is a feature that has broader impact than > > > just the runtime compression. The btrfs-progs understand filesystem with > > > ZSTD since 4.13. The remaining key part is the bootloader. > > > > > > Up to now, there are no bootloaders supporting ZSTD. This could lead to > > > an > > > unmountable filesystem if the critical files under /boot get > > > accidentally > > > or intentionally compressed by ZSTD. > > > > But otherwise ZSTD is safe to use? Are you aware of any other issues? > > No issues from my own testing or reported by other users. Thanks to you and the others. I think I try this soon. Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read before you deploy btrfs + zstd
On Mon, Nov 13, 2017 at 11:50:46PM +0100, David Sterba wrote: > Up to now, there are no bootloaders supporting ZSTD. I've tried to implement the support to GRUB, still incomplete and hacky but most of the code is there. The ZSTD implementation is copied from kernel. The allocators need to be properly set up, as it needs to use grub_malloc/grub_free for the workspace thats called from some ZSTD_* functions. https://github.com/kdave/grub/tree/btrfs-zstd -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read before you deploy btrfs + zstd
On Tue, Nov 14, 2017 at 08:34:37AM +0100, Martin Steigerwald wrote: > Hello David. > > David Sterba - 13.11.17, 23:50: > > while 4.14 is still fresh, let me address some concerns I've seen on linux > > forums already. > > > > The newly added ZSTD support is a feature that has broader impact than > > just the runtime compression. The btrfs-progs understand filesystem with > > ZSTD since 4.13. The remaining key part is the bootloader. > > > > Up to now, there are no bootloaders supporting ZSTD. This could lead to an > > unmountable filesystem if the critical files under /boot get accidentally > > or intentionally compressed by ZSTD. > > But otherwise ZSTD is safe to use? Are you aware of any other issues? No issues from my own testing or reported by other users. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to repair or access broken btrfs?
14.11.2017 12:56, Stefan Priebe - Profihost AG пишет: > Hello, > > after a controller firmware bug / failure i've a broken btrfs. > > # parent transid verify failed on 181846016 wanted 143404 found 143399 > > running repair, fsck or zero-log always results in the same failure message: > extent-tree.c:2725: alloc_reserved_tree_block: BUG_ON `ret` triggered, > value -1 > .. stack trace .. > > Is there an chance to get at least a single file out of the broken fs? > Did you try "btrfs restore"? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A partially failing disk in raid0 needs replacement
Hi Roman, If you look at the 'show' command, the failing disk is sorta out of the fs, so maybe removing the 6TB disk again will divide the data already on the 6TB disk (which isn't more than 300something gigs) to the 2 well-functioning disks. Still, as putting the dd-image of the 2TB disk on the temporary disk is only temporary, I do need one more 2TB+ disk attached to create a more permanent btrfs with the 6TB disk (which is what I eventually want). And for that I need some more harddisk power cables/splitters. And another disk. But that still seems to be the best option, so I will do that once I have those things sorted out. Thanks for your creative suggestion :) /klaus On Tue, Nov 14, 2017 at 4:44 PM, Roman Mamedovwrote: > On Tue, 14 Nov 2017 15:09:52 +0100 > Klaus Agnoletti wrote: > >> Hi Roman >> >> I almost understand :-) - however, I need a bit more information: >> >> How do I copy the image file to the 6TB without screwing the existing >> btrfs up when the fs is not mounted? Should I remove it from the raid >> again? > > Oh, you already added it to your FS, that's so unfortunate. For my scenario I > assumed have a spare 6TB (or any 2TB+) disk you can use as temporary space. > > You could try removing it, but with one of the existing member drives > malfunctioning, I wonder if trying any operation on that FS will cause further > damage. For example if you remove the 6TB one, how do you prevent Btrfs from > using the bad 2TB drive as destination to relocate data from the 6TB drive. Or > use it for one of the metadata mirrors, which will fail to write properly, > leading into transid failures later, etc. > > -- > With respect, > Roman -- Klaus Agnoletti -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs progs pre-release 4.14-rc1
Hi, a pre-release has been tagged. Changes: * build: libzstd now required by default * check: more lowmem mode repair enhancements * subvol set-default: also accept path * prop set: compression accepts no/none, same as "" * filesystem usage: enable for filesystem on top of a seed device * rescue: new command fix-device-size * other * new tests * cleanups and refactoring * doc updates ETA for 4.14 is in +2 days (2017-11-16). Mailinglist patch backlog has grown again, I'll have to do more minor releases to get the features and fixes merged. No concrete plans for now, some patchsets are almost ready so they'll probably go first. Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/ Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git Shortlog: Baruch Siach (1): btrfs-progs: convert: add missing types header Benjamin Peterson (1): btrfs-progs: docs: correct grammar David Sterba (24): btrfs-progs: help: print multiple syntax schemas on separate lines btrfs-progs: prop: also allow "none" to disable compression btrfs-progs: docs: update btrfs-properties btrfs-progs: image: move metadump definitions to own header btrfs-progs: build: use variables for btrfs-image images btrfs-progs: image: start a new header for sanitization functions btrfs-progs: image: introduce symbolic names for the sanitization modes btrfs-progs: image: pass rb_root to find_collisions btrfs-progs: image: drop unused parameter from sanitize_xattr btrfs-progs: image: pass sanitize mode and name tree separately to sanitize_inode_ref btrfs-progs: image: pass sanitize mode and name tree separately to sanitize_dir_item btrfs-progs: image: pass sanitize mode and name tree separately to sanitize_name btrfs-progs: image: move sanitization to new file btrfs-progs: don't use __u8 for fsid buffers btrfs-progs: tests: don't pass size to prepare_test_dev if not necessary btrfs-progs: tests: extend fsck/028 to test fix-device-size and mount btrfs-progs: docs: update mount options btrfs-progs: docs: add impact of atime/noatime btrfs-progs: docs: add note about mount option applicability btrfs-progs: build: require libzstd support by default btrfs-progs: build: mention library dependency for reiserfs btrfs-progs: docs: move the rescue fix-device-size command and update btrfs-progs: update CHANGES for v4.14 Btrfs progs v4.14-rc1 Lakshmipathi.G (1): btrfs-progs: tests/common: Display warning only after searching for btrfs kernel module Liu Bo (1): btrfs-progs: do not add stale device into fs_devices Lu Fengqi (7): btrfs-progs: qgroup: fix qgroup show sort by multi items btrfs-progs: test: Add test image for lowmem mode file extent interrupt btrfs-progs: lowmem check: Output more detailed information about file extent interrupt btrfs-progs: lowmem check: Fix false alert about referencer count mismatch btrfs-progs: test: Add test image for lowmem mode referencer count mismatch false alert btrfs-progs: qgroup: cleanup the redundant function add_qgroup btrfs-progs: qgroup: split update_qgroup to reduce arguments Misono, Tomohiro (6): btrfs-progs: subvol: change set-default to also accept path btrfs-progs: test: add new cli-test for subvol get/set-default btrfs-progs: fi: move dev_to_fsid to cmds-fi-usage for later use btrfs-progs: fi: enable fi usage for filesystem on top of seed device btrfs-progs: device: add description of alias to help message btrfs-progs: doc: add description of missing and example, of device remove Pavel Kretov (1): btrfs-progs: defrag: add a brief warning about ref-link breakage Qu Wenruo (14): btrfs-progs: tests: Allow check test to repair in lowmem mode for certain errors btrfs-progs: mkfs: avoid BUG_ON for chunk allocation when ENOSPC happens btrfs-progs: mkfs: avoid positive return value from cleanup_temp_chunks btrfs-progs: mkfs: fix overwritten return value for mkfs btrfs-progs: mkfs: error out gracefully for --rootdir btrfs-progs: convert: Open the fs readonly for rollback btrfs-progs: mkfs: refactor test_minimum_size to use the calculated minimal size btrfs-progs: rescue: Fix zero-log mounted branch btrfs-progs: Introduce function to fix unaligned device size btrfs-progs: Introduce function to fix super block total bytes btrfs-progs: rescue: Introduce fix-device-size btrfs-progs: check: Also check and repair unaligned/mismatch device and super sizes btrfs-progs: tests/fsck: Add test case image for 'rescue fix-dev-size' btrfs-progs: print-tree: Print offset as tree objectid for ROOT_ITEM Satoru Takeuchi (1): btrfs-progs: allow "no" to disable compression for convenience Su Yue (28): btrfs-progs:
Re: [GIT PULL] Btrfs changes for 4.15
On Tue, Nov 14, 2017 at 07:39:11AM +0800, Qu Wenruo wrote: > > - extend mount options to specify zlib compression level, -o compress=zlib:9 > > However the support for it has a big problem, it will cause wild memory > access for "-o compress" mount option. > > Kernel ASAN can detect it easily and we already have user report about > it. Btrfs/026 could also easily trigger it. > > The fixing patch is submitted some days ago: > https://patchwork.kernel.org/patch/10042553/ > > And the default compression level when not specified is zero, which > means no compression but directly memory copy. This fix will go in next pull request. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A partially failing disk in raid0 needs replacement
On Tue, 14 Nov 2017 15:09:52 +0100 Klaus Agnolettiwrote: > Hi Roman > > I almost understand :-) - however, I need a bit more information: > > How do I copy the image file to the 6TB without screwing the existing > btrfs up when the fs is not mounted? Should I remove it from the raid > again? Oh, you already added it to your FS, that's so unfortunate. For my scenario I assumed have a spare 6TB (or any 2TB+) disk you can use as temporary space. You could try removing it, but with one of the existing member drives malfunctioning, I wonder if trying any operation on that FS will cause further damage. For example if you remove the 6TB one, how do you prevent Btrfs from using the bad 2TB drive as destination to relocate data from the 6TB drive. Or use it for one of the metadata mirrors, which will fail to write properly, leading into transid failures later, etc. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A partially failing disk in raid0 needs replacement
Am Tue, 14 Nov 2017 17:48:56 +0500 schrieb Roman Mamedov: > [1] Note that "ddrescue" and "dd_rescue" are two different programs > for the same purpose, one may work better than the other. I don't > remember which. :) One is a perl implementation and is the one working worse. ;-) -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A partially failing disk in raid0 needs replacement
Hi Austin Good points. Thanks a lot. /klaus On Tue, Nov 14, 2017 at 2:14 PM, Austin S. Hemmelgarnwrote: > On 2017-11-14 03:36, Klaus Agnoletti wrote: >> >> Hi list >> >> I used to have 3x2TB in a btrfs in raid0. A few weeks ago, one of the >> 2TB disks started giving me I/O errors in dmesg like this: >> >> [388659.173819] ata5.00: exception Emask 0x0 SAct 0x7fff SErr 0x0 >> action 0x0 >> [388659.175589] ata5.00: irq_stat 0x4008 >> [388659.177312] ata5.00: failed command: READ FPDMA QUEUED >> [388659.179045] ata5.00: cmd 60/20:60:80:96:95/00:00:c4:00:00/40 tag >> 12 ncq 1638 >> 4 in >> res 51/40:1c:84:96:95/00:00:c4:00:00/40 Emask 0x409 (media >> error) >> [388659.182552] ata5.00: status: { DRDY ERR } >> [388659.184303] ata5.00: error: { UNC } >> [388659.188899] ata5.00: configured for UDMA/133 >> [388659.188956] sd 4:0:0:0: [sdd] Unhandled sense code >> [388659.188960] sd 4:0:0:0: [sdd] >> [388659.188962] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE >> [388659.188965] sd 4:0:0:0: [sdd] >> [388659.188967] Sense Key : Medium Error [current] [descriptor] >> [388659.188970] Descriptor sense data with sense descriptors (in hex): >> [388659.188972] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 >> [388659.188981] c4 95 96 84 >> [388659.188985] sd 4:0:0:0: [sdd] >> [388659.188988] Add. Sense: Unrecovered read error - auto reallocate >> failed >> [388659.188991] sd 4:0:0:0: [sdd] CDB: >> [388659.188992] Read(10): 28 00 c4 95 96 80 00 00 20 00 >> [388659.189000] end_request: I/O error, dev sdd, sector 3298137732 >> [388659.190740] BTRFS: bdev /dev/sdd errs: wr 0, rd 3120, flush 0, >> corrupt 0, ge >> n 0 >> [388659.192556] ata5: EH complete > > Just some background, but this error is usually indicative of either media > degradation from long-term usage, or a head crash. >> >> >> At the same time, I started getting mails from smartd: >> >> Device: /dev/sdd [SAT], 2 Currently unreadable (pending) sectors >> Device info: >> Hitachi HDS723020BLA642, S/N:MN1220F30MNHUD, WWN:5-000cca-369c8f00b, >> FW:MN6OA580, 2.00 TB >> >> For details see host's SYSLOG. > > And this correlates with the above errors (although the current pending > sectors being non-zero is less specific than the above). >> >> >> To fix it, it ended up with me adding a new 6TB disk and trying to >> delete the failing 2TB disks. >> >> That didn't go so well; apparently, the delete command aborts when >> ever it encounters I/O errors. So now my raid0 looks like this: > > I'm not going to comment on how to fix the current situation, as what has > been stated in other people's replies pretty well covers that. > > I would however like to mention two things for future reference: > > 1. The delete command handles I/O errors just fine, provided that there is > some form of redundancy in the filesystem. In your case, if this had been a > raid1 array instead of raid0, then the delete command would have just fallen > back to the other copy of the data when it hit an I/O error instead of > dying. Just like a regular RAID0 array (be it LVM, MD, or hardware), you > can't lose a device in a BTRFS raid0 array without losing the array. > > 2. While it would not have helped in this case, the preferred method when > replacing a device is to use the `btrfs replace` command. It's a lot more > efficient than add+delete (and exponentially more efficient than > delete+add), and also a bit safer (in both cases because it needs to move > less data). The only down-side to it is that you may need a couple of > resize commands around it. > >> >> klaus@box:~$ sudo btrfs fi show >> [sudo] password for klaus: >> Label: none uuid: 5db5f82c-2571-4e62-a6da-50da0867888a >> Total devices 4 FS bytes used 5.14TiB >> devid1 size 1.82TiB used 1.78TiB path /dev/sde >> devid2 size 1.82TiB used 1.78TiB path /dev/sdf >> devid3 size 0.00B used 1.49TiB path /dev/sdd >> devid4 size 5.46TiB used 305.21GiB path /dev/sdb >> >> Btrfs v3.17 >> >> Obviously, I want /dev/sdd emptied and deleted from the raid. >> >> So how do I do that? >> >> I thought of three possibilities myself. I am sure there are more, >> given that I am in no way a btrfs expert: >> >> 1)Try to force a deletion of /dev/sdd where btrfs copies all intact >> data to the other disks >> 2) Somehow re-balances the raid so that sdd is emptied, and then deleted >> 3) converting into a raid1, physically removing the failing disk, >> simulating a hard error, starting the raid degraded, and converting it >> back to raid0 again. >> >> How do you guys think I should go about this? Given that it's a raid0 >> for a reason, it's not the end of the world losing all data, but I'd >> really prefer losing as little as possible, obviously. >> >> FYI, I tried doing some scrubbing and balancing. There's traces of >> that in the syslog and dmesg I've attached. It's being used as >> firewall
Re: A partially failing disk in raid0 needs replacement
Hi Roman I almost understand :-) - however, I need a bit more information: How do I copy the image file to the 6TB without screwing the existing btrfs up when the fs is not mounted? Should I remove it from the raid again? Also, as you might have noticed, I have a bit of an issue with the entire space of the 6TB disk being added to the btrfs when I added the disk. There's something kinda basic about using btrfs that I haven't really understodd yet. Maybe you - or someone else - can point me in the right direction in terms of documentation. Thanks /klaus On Tue, Nov 14, 2017 at 1:48 PM, Roman Mamedovwrote: > On Tue, 14 Nov 2017 10:36:22 +0200 > Klaus Agnoletti wrote: > >> Obviously, I want /dev/sdd emptied and deleted from the raid. > > * Unmount the RAID0 FS > > * copy the bad drive using `dd_rescue`[1] into a file on the 6TB drive > (noting how much of it is actually unreadable -- chances are it's mostly > intact) > > * physically remove the bad drive (have a powerdown or reboot for this to be > sure Btrfs didn't remember it somewhere) > > * set up a loop device from the dd_rescue'd 2TB file > > * run `btrfs device scan` > > * mount the RAID0 filesystem > > * run the delete command on the loop device, it will not encounter I/O > errors anymore. > > > [1] Note that "ddrescue" and "dd_rescue" are two different programs for the > same purpose, one may work better than the other. I don't remember which. :) > > -- > With respect, > Roman -- Klaus Agnoletti -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A partially failing disk in raid0 needs replacement
On 2017-11-14 03:36, Klaus Agnoletti wrote: Hi list I used to have 3x2TB in a btrfs in raid0. A few weeks ago, one of the 2TB disks started giving me I/O errors in dmesg like this: [388659.173819] ata5.00: exception Emask 0x0 SAct 0x7fff SErr 0x0 action 0x0 [388659.175589] ata5.00: irq_stat 0x4008 [388659.177312] ata5.00: failed command: READ FPDMA QUEUED [388659.179045] ata5.00: cmd 60/20:60:80:96:95/00:00:c4:00:00/40 tag 12 ncq 1638 4 in res 51/40:1c:84:96:95/00:00:c4:00:00/40 Emask 0x409 (media error) [388659.182552] ata5.00: status: { DRDY ERR } [388659.184303] ata5.00: error: { UNC } [388659.188899] ata5.00: configured for UDMA/133 [388659.188956] sd 4:0:0:0: [sdd] Unhandled sense code [388659.188960] sd 4:0:0:0: [sdd] [388659.188962] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [388659.188965] sd 4:0:0:0: [sdd] [388659.188967] Sense Key : Medium Error [current] [descriptor] [388659.188970] Descriptor sense data with sense descriptors (in hex): [388659.188972] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 [388659.188981] c4 95 96 84 [388659.188985] sd 4:0:0:0: [sdd] [388659.188988] Add. Sense: Unrecovered read error - auto reallocate failed [388659.188991] sd 4:0:0:0: [sdd] CDB: [388659.188992] Read(10): 28 00 c4 95 96 80 00 00 20 00 [388659.189000] end_request: I/O error, dev sdd, sector 3298137732 [388659.190740] BTRFS: bdev /dev/sdd errs: wr 0, rd 3120, flush 0, corrupt 0, ge n 0 [388659.192556] ata5: EH complete Just some background, but this error is usually indicative of either media degradation from long-term usage, or a head crash. At the same time, I started getting mails from smartd: Device: /dev/sdd [SAT], 2 Currently unreadable (pending) sectors Device info: Hitachi HDS723020BLA642, S/N:MN1220F30MNHUD, WWN:5-000cca-369c8f00b, FW:MN6OA580, 2.00 TB For details see host's SYSLOG. And this correlates with the above errors (although the current pending sectors being non-zero is less specific than the above). To fix it, it ended up with me adding a new 6TB disk and trying to delete the failing 2TB disks. That didn't go so well; apparently, the delete command aborts when ever it encounters I/O errors. So now my raid0 looks like this: I'm not going to comment on how to fix the current situation, as what has been stated in other people's replies pretty well covers that. I would however like to mention two things for future reference: 1. The delete command handles I/O errors just fine, provided that there is some form of redundancy in the filesystem. In your case, if this had been a raid1 array instead of raid0, then the delete command would have just fallen back to the other copy of the data when it hit an I/O error instead of dying. Just like a regular RAID0 array (be it LVM, MD, or hardware), you can't lose a device in a BTRFS raid0 array without losing the array. 2. While it would not have helped in this case, the preferred method when replacing a device is to use the `btrfs replace` command. It's a lot more efficient than add+delete (and exponentially more efficient than delete+add), and also a bit safer (in both cases because it needs to move less data). The only down-side to it is that you may need a couple of resize commands around it. klaus@box:~$ sudo btrfs fi show [sudo] password for klaus: Label: none uuid: 5db5f82c-2571-4e62-a6da-50da0867888a Total devices 4 FS bytes used 5.14TiB devid1 size 1.82TiB used 1.78TiB path /dev/sde devid2 size 1.82TiB used 1.78TiB path /dev/sdf devid3 size 0.00B used 1.49TiB path /dev/sdd devid4 size 5.46TiB used 305.21GiB path /dev/sdb Btrfs v3.17 Obviously, I want /dev/sdd emptied and deleted from the raid. So how do I do that? I thought of three possibilities myself. I am sure there are more, given that I am in no way a btrfs expert: 1)Try to force a deletion of /dev/sdd where btrfs copies all intact data to the other disks 2) Somehow re-balances the raid so that sdd is emptied, and then deleted 3) converting into a raid1, physically removing the failing disk, simulating a hard error, starting the raid degraded, and converting it back to raid0 again. How do you guys think I should go about this? Given that it's a raid0 for a reason, it's not the end of the world losing all data, but I'd really prefer losing as little as possible, obviously. FYI, I tried doing some scrubbing and balancing. There's traces of that in the syslog and dmesg I've attached. It's being used as firewall too, so there's a lof of Shorewall block messages smapping the log I'm afraid. Additional info: klaus@box:~$ uname -a Linux box 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u5 (2017-09-19) x86_64 GNU/Linux klaus@box:~$ sudo btrfs --version Btrfs v3.17 klaus@box:~$ sudo btrfs fi df /mnt Data, RAID0: total=5.34TiB, used=5.14TiB System, RAID0: total=96.00MiB, used=384.00KiB Metadata, RAID0: total=7.22GiB,
Re: A partially failing disk in raid0 needs replacement
On 2017-11-14 07:48, Roman Mamedov wrote: On Tue, 14 Nov 2017 10:36:22 +0200 Klaus Agnolettiwrote: Obviously, I want /dev/sdd emptied and deleted from the raid. * Unmount the RAID0 FS * copy the bad drive using `dd_rescue`[1] into a file on the 6TB drive (noting how much of it is actually unreadable -- chances are it's mostly intact) * physically remove the bad drive (have a powerdown or reboot for this to be sure Btrfs didn't remember it somewhere) * set up a loop device from the dd_rescue'd 2TB file * run `btrfs device scan` * mount the RAID0 filesystem * run the delete command on the loop device, it will not encounter I/O errors anymore. While the above procedure will work, it is worth noting that you may still lose data. [1] Note that "ddrescue" and "dd_rescue" are two different programs for the same purpose, one may work better than the other. I don't remember which. :) As a general rule, GNU ddrescue is more user friendly for block-level copies, while Kurt Garlof's dd_rescue tends to be better for copying at the file level. Both work fine in terms of reliability though. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A partially failing disk in raid0 needs replacement
On 14 November 2017 at 09:36, Klaus Agnolettiwrote: > > How do you guys think I should go about this? I'd clone the disk with GNU ddrescue. https://www.gnu.org/software/ddrescue/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A partially failing disk in raid0 needs replacement
On Tue, 14 Nov 2017 10:36:22 +0200 Klaus Agnolettiwrote: > Obviously, I want /dev/sdd emptied and deleted from the raid. * Unmount the RAID0 FS * copy the bad drive using `dd_rescue`[1] into a file on the 6TB drive (noting how much of it is actually unreadable -- chances are it's mostly intact) * physically remove the bad drive (have a powerdown or reboot for this to be sure Btrfs didn't remember it somewhere) * set up a loop device from the dd_rescue'd 2TB file * run `btrfs device scan` * mount the RAID0 filesystem * run the delete command on the loop device, it will not encounter I/O errors anymore. [1] Note that "ddrescue" and "dd_rescue" are two different programs for the same purpose, one may work better than the other. I don't remember which. :) -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A partially failing disk in raid0 needs replacement
On Tue, Nov 14, 2017 at 10:36:22AM +0200, Klaus Agnoletti wrote: > I used to have 3x2TB in a btrfs in raid0. A few weeks ago, one of the ^ > 2TB disks started giving me I/O errors in dmesg like this: > > [388659.188988] Add. Sense: Unrecovered read error - auto reallocate failed Alas, chances to recover anything are pretty slim. That's RAID0 metadata for you. On the other hand, losing any non-trivial file while being able to gape at intact metadata isn't that much better, thus -mraid0 isn't completely unreasonable. > To fix it, it ended up with me adding a new 6TB disk and trying to > delete the failing 2TB disks. > > That didn't go so well; apparently, the delete command aborts when > ever it encounters I/O errors. So now my raid0 looks like this: > > klaus@box:~$ sudo btrfs fi show > [sudo] password for klaus: > Label: none uuid: 5db5f82c-2571-4e62-a6da-50da0867888a > Total devices 4 FS bytes used 5.14TiB > devid1 size 1.82TiB used 1.78TiB path /dev/sde > devid2 size 1.82TiB used 1.78TiB path /dev/sdf > devid3 size 0.00B used 1.49TiB path /dev/sdd > devid4 size 5.46TiB used 305.21GiB path /dev/sdb > Obviously, I want /dev/sdd emptied and deleted from the raid. > > So how do I do that? > > I thought of three possibilities myself. I am sure there are more, > given that I am in no way a btrfs expert: > > 1)Try to force a deletion of /dev/sdd where btrfs copies all intact > data to the other disks > 2) Somehow re-balances the raid so that sdd is emptied, and then deleted > 3) converting into a raid1, physically removing the failing disk, > simulating a hard error, starting the raid degraded, and converting it > back to raid0 again. There's hardly any intact data: roughly 2/3 of chunks have half of their blocks on the failed disk, densely interspersed. Even worse, metadata required to map those blocks to files is gone, too: if we naively assume there's only a single tree, a tree node is intact only if it and every single node on the path to the root is intact. In practice, this means it's a total filesystem loss. > How do you guys think I should go about this? Given that it's a raid0 > for a reason, it's not the end of the world losing all data, but I'd > really prefer losing as little as possible, obviously. As the disk isn't _completely_ gone, there's a slim chance of some stuff requiring only still-readable sectors. Probably a waste of time to try to recover, though. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ Laws we want back: Poland, Dz.U. 1921 nr.30 poz.177 (also Dz.U. ⣾⠁⢰⠒⠀⣿⡁ 1920 nr.11 poz.61): Art.2: An official, guilty of accepting a gift ⢿⡄⠘⠷⠚⠋⠀ or another material benefit, or a promise thereof, [in matters ⠈⠳⣄ relevant to duties], shall be punished by death by shooting. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read before you deploy btrfs + zstd
On 2017-11-14 02:34, Martin Steigerwald wrote: Hello David. David Sterba - 13.11.17, 23:50: while 4.14 is still fresh, let me address some concerns I've seen on linux forums already. The newly added ZSTD support is a feature that has broader impact than just the runtime compression. The btrfs-progs understand filesystem with ZSTD since 4.13. The remaining key part is the bootloader. Up to now, there are no bootloaders supporting ZSTD. This could lead to an unmountable filesystem if the critical files under /boot get accidentally or intentionally compressed by ZSTD. But otherwise ZSTD is safe to use? Are you aware of any other issues? Aside from the obvious issue that recovery media like SystemRescueCD and the GParted LIveCD haven't caught up yet, and thus won't be able to do anything with the filesystem, my testing has not uncovered any issues, though it is by no means rigorous. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs/154: test for device dynamic rescan
On Mon, Nov 13, 2017 at 10:25:41AM +0800, Anand Jain wrote: > Make sure missing device is included in the alloc list when it is > scanned on a mounted FS. > > This test case needs btrfs kernel patch which is in the ML > [PATCH] btrfs: handle dynamically reappearing missing device > Without the kernel patch, the test will run, but reports as > failed, as the device scanned won't appear in the alloc_list. > > Signed-off-by: Anand JainTested without the fix and test failed as expected, test passed after applying the fix. Some minor nits below. > --- > tests/btrfs/154 | 188 > > tests/btrfs/154.out | 10 +++ > tests/btrfs/group | 1 + > 3 files changed, 199 insertions(+) > create mode 100755 tests/btrfs/154 > create mode 100644 tests/btrfs/154.out > > diff --git a/tests/btrfs/154 b/tests/btrfs/154 > new file mode 100755 > index ..8b06fc4d9347 > --- /dev/null > +++ b/tests/btrfs/154 > @@ -0,0 +1,188 @@ > +#! /bin/bash > +# FS QA Test 154 > +# > +# Test for reappearing missing device functionality. > +# This test will fail without the btrfs kernel patch > +# [PATCH] btrfs: handle dynamically reappearing missing device > +# > +#- > +# Copyright (c) 2017 Oracle. All Rights Reserved. > +# Author: Anand Jain > +# > +# This program is free software; you can redistribute it and/or > +# modify it under the terms of the GNU General Public License as > +# published by the Free Software Foundation. > +# > +# This program is distributed in the hope that it would be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write the Free Software Foundation, > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > +#- > +# > + > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > + > +here=`pwd` > +tmp=/tmp/$$ > +status=1 # failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > + cd / > + rm -f $tmp.* > +} > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > +. ./common/module > + > +# remove previous $seqres.full before test > +rm -f $seqres.full > + > +# real QA test starts here > + > +_supported_fs btrfs > +_supported_os Linux > +_require_scratch_dev_pool 2 > +_test_unmount This is not needed now, _require_loadable_fs_module will umount & mount test dev as necessary. > +_require_loadable_fs_module "btrfs" > + > +_scratch_dev_pool_get 2 > + > +DEV1=`echo $SCRATCH_DEV_POOL | awk '{print $1}'` > +DEV2=`echo $SCRATCH_DEV_POOL | awk '{print $2}'` > + > +echo DEV1=$DEV1 >> $seqres.full > +echo DEV2=$DEV2 >> $seqres.full > + > +# Balance won't be successful if filled too much > +DEV1_SZ=`blockdev --getsize64 $DEV1` > +DEV2_SZ=`blockdev --getsize64 $DEV2` > + > +# get min > +MAX_FS_SZ=`echo -e "$DEV1_SZ\n$DEV2_SZ" | sort | head -1` > +# Need disks with more than 2G > +if [ $MAX_FS_SZ -lt 20 ]; then > + _scratch_dev_pool_put > + _test_mount Then no need to _test_mount. > + _notrun "Smallest dev size $MAX_FS_SZ, Need at least 2G" > +fi > + > +MAX_FS_SZ=1 > +bs="1M" > +COUNT=$(($MAX_FS_SZ / 100)) > +CHECKPOINT1=0 > +CHECKPOINT2=0 > + > +setup() > +{ > + echo >> $seqres.full > + echo "MAX_FS_SZ=$MAX_FS_SZ COUNT=$COUNT" >> $seqres.full > + echo "setup" > + echo "-setup-" >> $seqres.full > + _scratch_pool_mkfs "-mraid1 -draid1" >> $seqres.full 2>&1 > + _scratch_mount >> $seqres.full 2>&1 > + dd if=/dev/urandom of="$SCRATCH_MNT"/tf bs=$bs count=1 \ > + >>$seqres.full 2>&1 > + _run_btrfs_util_prog filesystem show -m ${SCRATCH_MNT} > + _run_btrfs_util_prog filesystem df $SCRATCH_MNT > + COUNT=$(( $COUNT - 1 )) > + echo "unmount" >> $seqres.full > + _scratch_unmount > +} > + > +degrade_mount_write() > +{ > + echo >> $seqres.full > + echo "--degraded mount: max_fs_sz $max_fs_sz bytes--" >> $seqres.full > + echo > + echo "degraded mount" > + > + echo "clean btrfs ko" >> $seqres.full > + # un-scan the btrfs devices > + _reload_fs_module "btrfs" > + _mount -o degraded $DEV1 $SCRATCH_MNT >>$seqres.full 2>&1 > + cnt=$(( $COUNT/10 )) > + dd if=/dev/urandom of="$SCRATCH_MNT"/tf1 bs=$bs count=$cnt \ > + >>$seqres.full 2>&1 > + COUNT=$(( $COUNT - $cnt )) > + _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT > + _run_btrfs_util_prog filesystem df $SCRATCH_MNT > +
Re: [PATCH 4/4] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction
Sorry, i just thinking that i can test that and send you some feedback, But for now, no time. I will check that later and try adds memory reusing. So, just ignore patches for now. Thanks 2017-10-10 20:36 GMT+03:00 David Sterba: > On Tue, Oct 03, 2017 at 06:06:04PM +0300, Timofey Titovets wrote: >> At now btrfs_dedupe_file_range() restricted to 16MiB range for >> limit locking time and memory requirement for dedup ioctl() >> >> For too big input rage code silently set range to 16MiB >> >> Let's remove that restriction by do iterating over dedup range. >> That's backward compatible and will not change anything for request >> less then 16MiB. > > This would make the ioctl more pleasant to use. So far I haven't found > any problems to do the iteration. One possible speedup could be done to > avoid the repeated allocations in btrfs_extent_same if we're going to > iterate more than once. > > As this would mean the 16MiB length restriction is gone, this needs to > bubble up to the documentation > (http://man7.org/linux/man-pages/man2/ioctl_fideduperange.2.html) > > Have you tested the behaviour with larger ranges? -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
how to repair or access broken btrfs?
Hello, after a controller firmware bug / failure i've a broken btrfs. # parent transid verify failed on 181846016 wanted 143404 found 143399 running repair, fsck or zero-log always results in the same failure message: extent-tree.c:2725: alloc_reserved_tree_block: BUG_ON `ret` triggered, value -1 .. stack trace .. Is there an chance to get at least a single file out of the broken fs? Greets, Stefan Complete output: ./btrfs check --repair /dev/mapper/crypt_md0 enabling repair mode parent transid verify failed on 181846016 wanted 143404 found 143399 parent transid verify failed on 181846016 wanted 143404 found 143399 Ignoring transid failure Checking filesystem on /dev/mapper/crypt_md0 UUID: d3f9eee9-efbd-4590-858f-27b39d453350 repair mode will force to clear out log tree, are you sure? [y/N]: y parent transid verify failed on 308183040 wanted 143404 found 143399 parent transid verify failed on 308183040 wanted 143404 found 143399 Ignoring transid failure parent transid verify failed on 338870272 wanted 143404 found 143399 parent transid verify failed on 338870272 wanted 143404 found 143399 Ignoring transid failure parent transid verify failed on 12778157178880 wanted 143404 found 143399 parent transid verify failed on 12778157178880 wanted 143404 found 143399 Ignoring transid failure leaf parent key incorrect 38699008 btrfs unable to find ref byte nr 12778147823616 parent 0 root 2 owner 0 offset 0 parent transid verify failed on 308183040 wanted 143404 found 143399 Ignoring transid failure leaf parent key incorrect 91766784 extent-tree.c:2725: alloc_reserved_tree_block: BUG_ON `ret` triggered, value -1 ./btrfs[0x415cb3] ./btrfs[0x416ee5] ./btrfs[0x417104] ./btrfs[0x418cea] ./btrfs[0x418f06] ./btrfs(btrfs_alloc_free_block+0x1e4)[0x41b8d0] ./btrfs(__btrfs_cow_block+0xd3)[0x40c5f9] ./btrfs(btrfs_cow_block+0x110)[0x40d03b] ./btrfs(commit_tree_roots+0x53)[0x439a37] ./btrfs(btrfs_commit_transaction+0xf9)[0x439e02] ./btrfs(cmd_check+0x861)[0x46172e] ./btrfs(main+0x163)[0x40b5e9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f44b14fab45] ./btrfs[0x40b0b9] Aborted -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Read before you deploy btrfs + zstd
> -Original Message- > From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs- > ow...@vger.kernel.org] On Behalf Of Martin Steigerwald > Sent: Tuesday, 14 November 2017 6:35 PM > To: dste...@suse.cz; linux-btrfs@vger.kernel.org > Subject: Re: Read before you deploy btrfs + zstd > > Hello David. > > David Sterba - 13.11.17, 23:50: > > while 4.14 is still fresh, let me address some concerns I've seen on > > linux forums already. > > > > The newly added ZSTD support is a feature that has broader impact than > > just the runtime compression. The btrfs-progs understand filesystem > > with ZSTD since 4.13. The remaining key part is the bootloader. > > > > Up to now, there are no bootloaders supporting ZSTD. This could lead > > to an unmountable filesystem if the critical files under /boot get > > accidentally or intentionally compressed by ZSTD. > > But otherwise ZSTD is safe to use? Are you aware of any other issues? > > I consider switching from LZO to ZSTD on this ThinkPad T520 with > Sandybridge. I've been using it since rc2 and had no trouble at all so far. The filesystem is running faster now (with zstd) than it did uncompressed on 4.13 Paul. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
On Mon, 13 Nov 2017 22:39:44 -0500 Davewrote: > I have my live system on one block device and a backup snapshot of it > on another block device. I am keeping them in sync with hourly rsync > transfers. > > Here's how this system works in a little more detail: > > 1. I establish the baseline by sending a full snapshot to the backup > block device using btrfs send-receive. > 2. Next, on the backup device I immediately create a rw copy of that > baseline snapshot. > 3. I delete the source snapshot to keep the live filesystem free of > all snapshots (so it can be optimally defragmented, etc.) > 4. hourly, I take a snapshot of the live system, rsync all changes to > the backup block device, and then delete the source snapshot. This > hourly process takes less than a minute currently. (My test system has > only moderate usage.) > 5. hourly, following the above step, I use snapper to take a snapshot > of the backup subvolume to create/preserve a history of changes. For > example, I can find the version of a file 30 hours prior. Sounds a bit complex, I still don't get why you need all these snapshot creations and deletions, and even still using btrfs send-receive. Here is my scheme: /mnt/dst <- mounted backup storage volume /mnt/dst/backup <- a subvolume /mnt/dst/backup/host1/ <- rsync destination for host1, regular directory /mnt/dst/backup/host2/ <- rsync destination for host2, regular directory /mnt/dst/backup/host3/ <- rsync destination for host3, regular directory etc. /mnt/dst/backup/host1/bin/ /mnt/dst/backup/host1/etc/ /mnt/dst/backup/host1/home/ ... Self explanatory. All regular directories, not subvolumes. Snapshots: /mnt/dst/snaps/backup <- a regular directory /mnt/dst/snaps/backup/2017-11-14T12:00/ <- snapshot 1 of /mnt/dst/backup /mnt/dst/snaps/backup/2017-11-14T13:00/ <- snapshot 2 of /mnt/dst/backup /mnt/dst/snaps/backup/2017-11-14T14:00/ <- snapshot 3 of /mnt/dst/backup Accessing historic data: /mnt/dst/snaps/backup/2017-11-14T12:00/host1/bin/bash ... /bin/bash for host1 as of 2017-11-14 12:00 (time on the backup system). No need for btrfs send-receive, only plain rsync is used, directly from hostX:/ to /mnt/dst/backup/host1/; No need to create or delete snapshots during the actual backup process; A single common timeline is kept for all hosts to be backed up, snapshot count not multiplied by the number of hosts (in my case the backup location is multi-purpose, so I somewhat care about total number of snapshots there as well); Also, all of this works even with source hosts which do not use Btrfs. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
On Tue, 14 Nov 2017 10:14:55 +0300 Marat Khaliliwrote: > Don't keep snapshots under rsync target, place them under ../snapshots > (if snapper supports this): > Or, specify them in --exclude and avoid using --delete-excluded. Both are good suggestions, in my case each system does have its own snapshots as well, but they are retained for much shorter. So I both use --exclude to avoid fetching the entire /snaps tree from the source system, and store snapshots of the destination system outside of the rsync target dirs. >Or keep using -x if it works, why not? -x will exclude content of all subvolumes down the tree on the source side -- not only the time-based ones. If you take care to never casually create any subvolumes content of which you'd still want backed up, then I guess it can work. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html