Re: [PATCH] fstests: btrfs: Regression test for leaking data reserved space
On Tue, Jun 28, 2016 at 09:54:51AM +0800, Qu Wenruo wrote: > When btrfs hits EDQUOTA when reserving data space, it will leak already > reserved data space. > > This test case will check it by using more restrict enospc_debug mount > option to trigger kernel warning at umount time. > > Signed-off-by: Qu WenruoLooks good to me. Tested on x86_64 and ppc64 hosts, x86_64 host failed the test as expected, ppc64 host didn't though. Reviewed-by: Eryu Guan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 27, 2016 at 08:39:21PM -0600, Chris Murphy wrote: > On Mon, Jun 27, 2016 at 7:52 PM, Zygo Blaxell >wrote: > > On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote: > >> Btrfs does have something of a work around for when things get slow, > >> and that's balance, read and rewrite everything. The write forces > >> sector remapping by the drive firmware for bad sectors. > > > > It's a crude form of "resilvering" as ZFS calls it. > > In what manner is it crude? Balance relocates extents, looks up backrefs, and rewrites metadata, all of which are extra work above what is required by resilvering (and extra work that is proportional to the number of backrefs and the (currently extremely poor) performance of the backref walking code, so snapshots and large files multiply the workload). Resilvering should just read data, reconstruct it from a mirror if necessary, and write it back to the original location (or read one mirror and rewrite the other). That's more like what scrub does, except scrub rewrites only the blocks it couldn't read (or that failed csum). > > Last time I checked all the RAID implementations on Linux (ok, so that's > > pretty much just md-raid) had some sort of repair capability. > > You can read man 4 md, and you can also look on linux-raid@, it's very > clearly necessary for the drive to report a read or write error > explicitly with LBA for md to do repairs. If there are link resets, > bad sectors accumulate and the obvious inevitably happens. I am looking at the md code. It looks at ->bi_error, and nothing else as far as I can tell. It doesn't even care if the error is EIO--any non-zero return value from the lower bio layer seems to trigger automatic recovery. signature.asc Description: Digital signature
[RFC] Btrfs: add asynchronous compression support in zlib
This patch introduces a change in zlib.c to use the new asynchronous compression API (acomp) proposed in cryptodev (working in progress): https://patchwork.kernel.org/patch/9163577/ Now BTRFS can offload the zlib (de)compression to a hardware accelerator engine if acomp hardware driver is registered in LKCF, the advantage of using acomp is saving CPU cycles and increasing disk IO by hardware offloading. The input pages (up to 32) are added in sg-list and sent to acomp in one request, as it is asynchronous call, the thread is put to sleep and then CPU is free up, once compression is done, callback is triggered and the thread is wake up. This patch doesn't change the BTRFS disk format, that means the files compressed by hardware engine can be de-compressed by zlib software library, or vice versa. The previous synchronous zlib (de)compression method is not changed in current implementation, but enventually they can be unified with the acomp API in LKCF. Signed-off-by: Weigang Li--- fs/btrfs/zlib.c | 206 1 file changed, 206 insertions(+) diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c index 82990b8..957e603 100644 --- a/fs/btrfs/zlib.c +++ b/fs/btrfs/zlib.c @@ -31,6 +31,8 @@ #include #include #include "compression.h" +#include +#include struct workspace { z_stream strm; @@ -38,6 +40,11 @@ struct workspace { struct list_head list; }; +struct acomp_res { + struct completion *completion; + int *ret; +}; + static void zlib_free_workspace(struct list_head *ws) { struct workspace *workspace = list_entry(ws, struct workspace, list); @@ -71,6 +78,119 @@ fail: return ERR_PTR(-ENOMEM); } +static void acomp_op_done(struct crypto_async_request *req, int err) +{ + struct acomp_res *res = req->data; + *res->ret = err; + complete(res->completion); +} + +static int zlib_compress_pages_async(struct address_space *mapping, + u64 start, unsigned long len, + struct page **pages, + unsigned long nr_dest_pages, + unsigned long *out_pages, + unsigned long *total_in, + unsigned long *total_out, + unsigned long max_out) +{ + int ret, acomp_ret = -1, i = 0; + int nr_pages = 0; + struct page *out_page = NULL; + struct crypto_acomp *tfm = NULL; + struct acomp_req *req = NULL; + struct completion completion; + unsigned int nr_src_pages = 0, nr_dst_pages = 0, nr = 0; + struct sg_table *in_sg = NULL, *out_sg = NULL; + struct page **src_pages = NULL; + struct acomp_res res; + + *out_pages = 0; + *total_out = 0; + *total_in = 0; + + init_completion(); + nr_src_pages = (len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + src_pages = kcalloc(nr_src_pages, sizeof(struct page *), GFP_KERNEL); + nr = find_get_pages(mapping, start >> PAGE_CACHE_SHIFT, + nr_src_pages, src_pages); + if (nr != nr_src_pages) { + ret = -ENOMEM; + goto out; + } + + in_sg = kcalloc(1, sizeof(*in_sg), GFP_KERNEL); + ret = sg_alloc_table_from_pages(in_sg, src_pages, nr_src_pages, + 0, len, GFP_KERNEL); + if (ret) + goto out; + + /* pre-alloc dst pages, with same size as src */ + nr_dst_pages = nr_src_pages; + for (i = 0; i < nr_dst_pages; i++) { + out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM); + if (!out_page) { + ret = -ENOMEM; + goto out; + } + pages[i] = out_page; + } + + out_sg = kcalloc(1, sizeof(*out_sg), GFP_KERNEL); + + ret = sg_alloc_table_from_pages(out_sg, pages, nr_dst_pages, 0, + (nr_dst_pages << PAGE_CACHE_SHIFT), GFP_KERNEL); + if (ret) + goto out; + + tfm = crypto_alloc_acomp("zlib_deflate", 0, 0); + req = acomp_request_alloc(tfm, GFP_KERNEL); + acomp_request_set_params(req, in_sg->sgl, out_sg->sgl, len, + nr_dst_pages << PAGE_CACHE_SHIFT); + + res.completion = + res.ret = _ret; + acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG, + acomp_op_done, ); + ret = crypto_acomp_compress(req); + if (ret == -EINPROGRESS) { + ret = wait_for_completion_timeout(, 5000); + if (ret == 0) { /* timeout */ + ret = -1; + goto out; + } + } + + ret = *res.ret; + *total_in = len; + *total_out =
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 27, 2016 at 7:52 PM, Zygo Blaxellwrote: > On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote: >> On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell >> wrote: >> > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: >> > If anything, I want the timeout to be shorter so that upper layers with >> > redundancy can get an EIO and initiate repair promptly, and admins can >> > get notified to evict chronic offenders from their drive slots, without >> > having to pay extra for hard disk firmware with that feature. >> >> The drive totally thwarts this. It doesn't report back to the kernel >> what command is hung, as far as I'm aware. It just hangs and goes into >> a so called "deep recovery" there is no way to know what sector is >> causing the problem > > I'm proposing just treat the link reset _as_ an EIO, unless transparent > link resets are required for link speed negotiation or something. That's not one EIO, that's possibly 31 items in the command queue that get knocked over when the link is reset. I don't have the expertise to know whether it's sane to interpret many EIO all at once as an implicit indication of bad sectors. Off hand I think that's probably specious. > The drive wouldn't be thwarting anything, the host would just ignore it > (unless the drive doesn't respond to a link reset until after its internal > timeout, in which case nothing is saved by shortening the timeout). > >> until the drive reports a read error, which will >> include the affected sector LBA. > > It doesn't matter which sector. Chances are good that it was more than > one of the outstanding requested sectors anyway. Rewrite them all. *shrug* even if valid, it only helps the raid 1+ cases. It does nothing to help raid0, linear/concat, or single device deployments. Those users also deserve to have access to their data, if the drive can recover it by giving it enough time to do so. > We know which sectors they are because somebody has an IO operation > waiting for a status on each of them (unless they're using AIO or some > other API where a request can be fired at a hard drive and the reply > discarded). Notify all of them that their IO failed and move on. Dunno, maybe. > >> Btrfs does have something of a work around for when things get slow, >> and that's balance, read and rewrite everything. The write forces >> sector remapping by the drive firmware for bad sectors. > > It's a crude form of "resilvering" as ZFS calls it. In what manner is it crude? > If btrfs sees EIO from a lower block layer it will try to reconstruct the > missing data (but not repair it). If that happens during a scrub, > it will also attempt to rewrite the missing data over the original > offending sectors. This happens every few months in my server pool, > and seems to be working even on btrfs raid5. > > Last time I checked all the RAID implementations on Linux (ok, so that's > pretty much just md-raid) had some sort of repair capability. You can read man 4 md, and you can also look on linux-raid@, it's very clearly necessary for the drive to report a read or write error explicitly with LBA for md to do repairs. If there are link resets, bad sectors accumulate and the obvious inevitably happens. > >> For single drives and RAID 0, the only possible solution is to not do >> link resets for up to 3 minutes and hope the drive returns the single >> copy of data. > > So perhaps the timeout should be influenced by higher layers, e.g. if a > disk becomes part of a raid1, its timeout should be shortened by default, > while a timeout for a disk that is not used in by redundant layer should > be longer. And there are a pile of reasons why link resets are necessary that have nothing to do with bad sectors. So if you end up with a drive or controller misbehaving and the new behavior is to force a bunch of new (corrective) writes to the drive right after a reset it could actually make its problems worse for all we know. I think it's highly speculative to assume hung block devices means bad sector and should be treated as a bad sector, and that doing so will cause no other side effects. That's a question for block device/SCSI experts to opine on whether this is at all sane to do. I'm sure they're reasonably aware of this problem that if it were that simple they'd have done that already, but conversely 5 years of telling users to change the command timer or stop using the wrong kind of drives for RAID really isn't sufficiently good advice either. The reality is that manufacturers of drives have handed us drives that far and wide don't support SCT ERC or it's disabled by default, so yeah maybe the thing to do is udev polls the drive for SCT ERC, if it's already at 70,70 then leave the SCSI command timer as is. If it reports it's disabled, then udev needs to know if it's in some kind of RAID 1+ and if so then set SCT ERC to 70,70. If it's a single drive,
Re: Kernel bug during RAID1 replace
On Mon, Jun 27, 2016 at 6:49 PM, Saint Germainwrote: > > I've tried both option and launched a replace, but I got the same error > (replace is cancelled, jernel bug). > I will let these options on and attempt a ddrescue on /dev/sda > to /dev/sdd. > Then I will disconnect /dev/sda and reboot and see if it works better. Sounds reasonable. Just make sure the file system is already unmounted when you use ddrescue because otherwise you're block copying it while it could be modified while rw mounted (generation number tends to get incremented while rw mounted). >> Last, I have no idea if the massive Btrfs write errors on sda are from >> an earlier problem where the drive data or power cable got jiggled or >> was otherwise absent temporarily? So depending on how the block >> timeout change affects your data recovery, you might end up needing to >> do a reboot to get back to a more stable state for all of this? It >> really should be able to fix things *if* at least one copy can be read >> and then written to the other drive. >> > > I have also no idea why is sda behaving like this. I haven't done > anything particular on these drives. Yeah pretty weird. At some point once things are stable, if this file system survives, you might want to use btrfs dev stat -z to wipe out those stats. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: Fix leaking bytes_may_use after hitting EDQUOTA
If one mount btrfs with enospc_debug mount option and hit qgroup limits in btrfs_check_data_free_space(), then at unmount time, kernel warning will be triggered alone with a data space info dump. -- [ cut here ] WARNING: CPU: 0 PID: 3875 at fs/btrfs/extent-tree.c:9785 btrfs_free_block_groups+0x2b8/0x460 [btrfs] Modules linked in: btrfs ext4 jbd2 mbcache xor zlib_deflate raid6_pq xfs [last unloaded: btrfs] CPU: 0 PID: 3875 Comm: umount Tainted: GW 4.7.0-rc4+ #13 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 8800230a7d00 813b89e5 8800230a7d40 810c9b8b 2639230a7d50 88003d523a78 88003d523b80 88000d1c 88000d1c00c8 Call Trace: [] dump_stack+0x67/0x92 [] __warn+0xcb/0xf0 [] warn_slowpath_null+0x1d/0x20 [] btrfs_free_block_groups+0x2b8/0x460 [btrfs] [] close_ctree+0x173/0x350 [btrfs] [] btrfs_put_super+0x19/0x20 [btrfs] [] generic_shutdown_super+0x6a/0xf0 [] kill_anon_super+0x12/0x20 [] btrfs_kill_super+0x18/0x110 [btrfs] [] deactivate_locked_super+0x3e/0x70 [] deactivate_super+0x5c/0x60 [] cleanup_mnt+0x3f/0x90 [] __cleanup_mnt+0x12/0x20 [] task_work_run+0x81/0xc0 [] exit_to_usermode_loop+0xb3/0xc0 [] syscall_return_slowpath+0xb0/0xc0 [] entry_SYSCALL_64_fastpath+0xa6/0xa8 ---[ end trace 99b9af8484495c66 ]--- BTRFS: space_info 1 has 8044544 free, is not full BTRFS: space_info total=8388608, used=344064, pinned=0, reserved=0, may_use=409600, readonly=0 -- The problem is in btrfs_check_data_free_space(), we reserve data space first and then reserve qgroup space. However if qgroup reserve failed, we didn't cleanup reserved data space, which leads to the kernel warning. Fix it by freeing reserved data space when qgroup_reserve_data() fails. Signed-off-by: Qu Wenruo--- fs/btrfs/extent-tree.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 29e5d00..e349da0 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4265,6 +4265,9 @@ int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len) * range, but don't impact performance on quota disable case. */ ret = btrfs_qgroup_reserve_data(inode, start, len); + if (ret < 0) + /* Qgroup reserve failed, need to cleanup reserved data space */ + btrfs_free_reserved_data_space(inode, start, len); return ret; } -- 2.9.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote: > On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell >wrote: > > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: > > If anything, I want the timeout to be shorter so that upper layers with > > redundancy can get an EIO and initiate repair promptly, and admins can > > get notified to evict chronic offenders from their drive slots, without > > having to pay extra for hard disk firmware with that feature. > > The drive totally thwarts this. It doesn't report back to the kernel > what command is hung, as far as I'm aware. It just hangs and goes into > a so called "deep recovery" there is no way to know what sector is > causing the problem I'm proposing just treat the link reset _as_ an EIO, unless transparent link resets are required for link speed negotiation or something. The drive wouldn't be thwarting anything, the host would just ignore it (unless the drive doesn't respond to a link reset until after its internal timeout, in which case nothing is saved by shortening the timeout). > until the drive reports a read error, which will > include the affected sector LBA. It doesn't matter which sector. Chances are good that it was more than one of the outstanding requested sectors anyway. Rewrite them all. We know which sectors they are because somebody has an IO operation waiting for a status on each of them (unless they're using AIO or some other API where a request can be fired at a hard drive and the reply discarded). Notify all of them that their IO failed and move on. > Btrfs does have something of a work around for when things get slow, > and that's balance, read and rewrite everything. The write forces > sector remapping by the drive firmware for bad sectors. It's a crude form of "resilvering" as ZFS calls it. > > The upper layers could time the IOs, and make their own decisions based > > on the timing (e.g. btrfs or mdadm could proactively repair anything that > > took more than 10 seconds to read). That might be a better approach, > > since shortening the time to an EIO is only useful when you have a > > redundancy layer in place to do something about them. > > For RAID with redundancy, that's doable, although I have no idea what > work is needed, or even if it's possible, to track commands in this > manner, and fall back to some kind of repair mode as if it were a read > error. If btrfs sees EIO from a lower block layer it will try to reconstruct the missing data (but not repair it). If that happens during a scrub, it will also attempt to rewrite the missing data over the original offending sectors. This happens every few months in my server pool, and seems to be working even on btrfs raid5. Last time I checked all the RAID implementations on Linux (ok, so that's pretty much just md-raid) had some sort of repair capability. lvm uses (or can use) the md-raid implementation. ext4 and xfs on naked disk partitions will have problems, but that's because they were designed in the 1990's when we were young and naive and still believed hard disks would one day become reliable devices without buggy firmware. > For single drives and RAID 0, the only possible solution is to not do > link resets for up to 3 minutes and hope the drive returns the single > copy of data. So perhaps the timeout should be influenced by higher layers, e.g. if a disk becomes part of a raid1, its timeout should be shortened by default, while a timeout for a disk that is not used in by redundant layer should be longer. > Even in the case of Btrfs DUP, it's thwarted without a read error > reported from the drive (or it returning bad data). That case gets messy--different timeouts for different parts of the disk. Probably not practical. signature.asc Description: Digital signature
[PATCH] fstests: btrfs: Regression test for leaking data reserved space
When btrfs hits EDQUOTA when reserving data space, it will leak already reserved data space. This test case will check it by using more restrict enospc_debug mount option to trigger kernel warning at umount time. Signed-off-by: Qu Wenruo--- tests/btrfs/124 | 73 + tests/btrfs/124.out | 2 ++ tests/btrfs/group | 1 + 3 files changed, 76 insertions(+) create mode 100755 tests/btrfs/124 create mode 100644 tests/btrfs/124.out diff --git a/tests/btrfs/124 b/tests/btrfs/124 new file mode 100755 index 000..94a5b28 --- /dev/null +++ b/tests/btrfs/124 @@ -0,0 +1,73 @@ +#! /bin/bash +# FS QA Test 124 +# +# Regression test for leaking data space after hitting EDQUOTA +# +#--- +# Copyright (c) 2016 Fujitsu. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs btrfs +_supported_os Linux +_require_scratch + +_scratch_mkfs +# Use enospc_debug mount option to trigger restrict space info check +_scratch_mount "-o enospc_debug" + +_run_btrfs_util_prog quota enable $SCRATCH_MNT +_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT +_run_btrfs_util_prog qgroup limit 512K 0/5 $SCRATCH_MNT + +# The amount of written data may change due to different nodesize at mkfs time, +# so redirect stdout to seqres.full. +# Also, EDQUOTA is expected, which can't be redirected due to the limitation +# of _filter_xfs_io, so golden output will include EDQUOTA error message +_pwrite_byte 0xcdcdcdcd 0 1M $SCRATCH_MNT/test_file | _filter_xfs_io \ + >> $seqres.full + +# Fstests will umount the fs, and at umount time, kernel warning will be +# triggered + +# success, all done +status=0 +exit diff --git a/tests/btrfs/124.out b/tests/btrfs/124.out new file mode 100644 index 000..6774792 --- /dev/null +++ b/tests/btrfs/124.out @@ -0,0 +1,2 @@ +QA output created by 124 +pwrite64: Disk quota exceeded diff --git a/tests/btrfs/group b/tests/btrfs/group index 5a26ed7..a398213 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -126,3 +126,4 @@ 121 auto quick snapshot qgroup 122 auto quick snapshot qgroup 123 auto quick qgroup +124 auto quick qgroup -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug during RAID1 replace
On Mon, 27 Jun 2016 18:00:34 -0600, Chris Murphywrote : > On Mon, Jun 27, 2016 at 5:06 PM, Saint Germain > wrote: > > On Mon, 27 Jun 2016 16:58:37 -0600, Chris Murphy > > wrote : > > > >> On Mon, Jun 27, 2016 at 4:55 PM, Chris Murphy > >> wrote: > >> > >> >> BTRFS info (device sdb1): dev_replace from /dev/sda1 (devid 1) > >> >> to /dev/sdd1 started scrub_handle_errored_block: 166 callbacks > >> >> suppressed BTRFS warning (device sdb1): checksum error at > >> >> logical 93445255168 on dev /dev/sda1, sector 77669048, root 5, > >> >> inode 3434831, offset 479232, length 4096, links 1 (path: > >> >> user/.local/share/zeitgeist/activity.sqlite-wal) > >> >> btrfs_dev_stat_print_on_error: 166 callbacks suppressed BTRFS > >> >> error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, > >> >> corrupt 14221, gen 24 scrub_handle_errored_block: 166 callbacks > >> >> suppressed BTRFS error (device sdb1): unable to fixup (regular) > >> >> error at logical 93445255168 on dev /dev/sda1 > >> > > >> > Shoot. You have a lot of these. It looks suspiciously like you're > >> > hitting a case list regulars are only just starting to understand > >> > >> Forget this part completely. It doesn't affect raid1. I just > >> re-read that your setup is not raid1, I don't know why I thought > >> it was raid5. > >> > >> The likely issue here is that you've got legit corruptions on sda > >> (mix of slow and flat out bad sectors), as well as a failing drive. > >> > >> This is also safe to issue: > >> > >> smartctl -l scterc /dev/sda > >> smartctl -l scterc /dev/sdb > >> cat /sys/block/sda/device/timeout > >> cat /sys/block/sdb/device/timeout > >> > > > > My setup is indeed RAID1 (and not RAID5) > > > > root@system:/# smartctl -l scterc /dev/sda > > smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.6.0-0.bpo.1-amd64] > > (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, > > www.smartmontools.org > > > > SCT Error Recovery Control: > >Read: Disabled > > Write: Disabled > > > > root@system:/# smartctl -l scterc /dev/sdb > > smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.6.0-0.bpo.1-amd64] > > (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, > > www.smartmontools.org > > > > SCT Error Recovery Control: > >Read: Disabled > > Write: Disabled > > > > root@system:/# cat /sys/block/sda/device/timeout > > 30 > > root@system:/# cat /sys/block/sdb/device/timeout > > 30 > > Good news and bad news. The bad news is this is a significant > misconfiguration, it's very common, and it means that any bad sectors > that don't result in read errors before 30 seconds will mean they > don't get fixed by Btrfs (or even mdadm or LVM raid). So they can > accumulate. > > There are two options since your drives support SCT ERC. > > 1. > smartctl -l scterc,70,70 /dev/sdX ## done for both drives > > That will make sure the drive reports a read error in 7 seconds, well > under the kernel's command timer of 7 seconds. This is how your drives > should normally be configured for RAID usage. > > 2. > echo 180 > /sys/block/sda/device/timeout > echo 180 > /sys/block/sdb/device/timeout > > This *might* actually work better in your case. If you permit the > drives to have really long error recovery, it might actually allow the > data to be returned to Btrfs and then it can start fixing problems. > Maybe. It's a long shot. And there will be upwards of 3 minute hangs. > > I would give this a shot first. You can issue these commands safely at > any time, no umount is needed or anything like that. I would do this > even before using cp/rsync or ddrescue because it increases the chance > the drive can recover data from these bad sectors and fix the other > drive. > > These settings are not persistent across a reboot unless you set a > udev rule or equivalent. > > On one of my drives that supports SCT ERC it only accepts the smartctl > -l command to set the timeout once. I can't change it without power > cycling the drive or it just crashes (yay firmware bugs). Just FYI > it's possible to run into other weirdness. > I've tried both option and launched a replace, but I got the same error (replace is cancelled, jernel bug). I will let these options on and attempt a ddrescue on /dev/sda to /dev/sdd. Then I will disconnect /dev/sda and reboot and see if it works better. > Last, I have no idea if the massive Btrfs write errors on sda are from > an earlier problem where the drive data or power cable got jiggled or > was otherwise absent temporarily? So depending on how the block > timeout change affects your data recovery, you might end up needing to > do a reboot to get back to a more stable state for all of this? It > really should be able to fix things *if* at least one copy can be read > and then written to the other drive. > I have also no idea why is sda behaving like this. I haven't done
Re: [PATCH 05/14] Btrfs: warn_on for unaccounted spaces
At 06/27/2016 09:03 PM, Chris Mason wrote: On 06/27/2016 12:47 AM, Qu Wenruo wrote: Hi Josef, Would you please move this patch to the first of the patchset? It's making bisect quite hard, as it will always stop at this patch, hard to check if it's a regression or existing bug. That's a good idea. Which workload are you having trouble with? -chris Qgroup test which hits EDQUOTA. After hitting EDQUOTA, unmount will always trigger a kernel warning, for DATA space whose byte_may_use is not zero. The problem is long existing, seems to be buffered write doesn't clean up its delalloc reserved DATA space. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug during RAID1 replace
On Mon, Jun 27, 2016 at 6:00 PM, Chris Murphywrote: > There are two options since your drives support SCT ERC. > > 1. > smartctl -l scterc,70,70 /dev/sdX ## done for both drives > > That will make sure the drive reports a read error in 7 seconds, well > under the kernel's command timer of 7 seconds. correction: "well under the kernel's command timer default of 30 seconds" -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug during RAID1 replace
On Mon, Jun 27, 2016 at 5:06 PM, Saint Germainwrote: > On Mon, 27 Jun 2016 16:58:37 -0600, Chris Murphy > wrote : > >> On Mon, Jun 27, 2016 at 4:55 PM, Chris Murphy >> wrote: >> >> >> BTRFS info (device sdb1): dev_replace from /dev/sda1 (devid 1) >> >> to /dev/sdd1 started scrub_handle_errored_block: 166 callbacks >> >> suppressed BTRFS warning (device sdb1): checksum error at logical >> >> 93445255168 on dev /dev/sda1, sector 77669048, root 5, inode >> >> 3434831, offset 479232, length 4096, links 1 (path: >> >> user/.local/share/zeitgeist/activity.sqlite-wal) >> >> btrfs_dev_stat_print_on_error: 166 callbacks suppressed BTRFS >> >> error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, >> >> corrupt 14221, gen 24 scrub_handle_errored_block: 166 callbacks >> >> suppressed BTRFS error (device sdb1): unable to fixup (regular) >> >> error at logical 93445255168 on dev /dev/sda1 >> > >> > Shoot. You have a lot of these. It looks suspiciously like you're >> > hitting a case list regulars are only just starting to understand >> >> Forget this part completely. It doesn't affect raid1. I just re-read >> that your setup is not raid1, I don't know why I thought it was raid5. >> >> The likely issue here is that you've got legit corruptions on sda (mix >> of slow and flat out bad sectors), as well as a failing drive. >> >> This is also safe to issue: >> >> smartctl -l scterc /dev/sda >> smartctl -l scterc /dev/sdb >> cat /sys/block/sda/device/timeout >> cat /sys/block/sdb/device/timeout >> > > My setup is indeed RAID1 (and not RAID5) > > root@system:/# smartctl -l scterc /dev/sda > smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.6.0-0.bpo.1-amd64] (local > build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, > www.smartmontools.org > > SCT Error Recovery Control: >Read: Disabled > Write: Disabled > > root@system:/# smartctl -l scterc /dev/sdb > smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.6.0-0.bpo.1-amd64] (local > build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, > www.smartmontools.org > > SCT Error Recovery Control: >Read: Disabled > Write: Disabled > > root@system:/# cat /sys/block/sda/device/timeout > 30 > root@system:/# cat /sys/block/sdb/device/timeout > 30 Good news and bad news. The bad news is this is a significant misconfiguration, it's very common, and it means that any bad sectors that don't result in read errors before 30 seconds will mean they don't get fixed by Btrfs (or even mdadm or LVM raid). So they can accumulate. There are two options since your drives support SCT ERC. 1. smartctl -l scterc,70,70 /dev/sdX ## done for both drives That will make sure the drive reports a read error in 7 seconds, well under the kernel's command timer of 7 seconds. This is how your drives should normally be configured for RAID usage. 2. echo 180 > /sys/block/sda/device/timeout echo 180 > /sys/block/sdb/device/timeout This *might* actually work better in your case. If you permit the drives to have really long error recovery, it might actually allow the data to be returned to Btrfs and then it can start fixing problems. Maybe. It's a long shot. And there will be upwards of 3 minute hangs. I would give this a shot first. You can issue these commands safely at any time, no umount is needed or anything like that. I would do this even before using cp/rsync or ddrescue because it increases the chance the drive can recover data from these bad sectors and fix the other drive. These settings are not persistent across a reboot unless you set a udev rule or equivalent. On one of my drives that supports SCT ERC it only accepts the smartctl -l command to set the timeout once. I can't change it without power cycling the drive or it just crashes (yay firmware bugs). Just FYI it's possible to run into other weirdness. Last, I have no idea if the massive Btrfs write errors on sda are from an earlier problem where the drive data or power cable got jiggled or was otherwise absent temporarily? So depending on how the block timeout change affects your data recovery, you might end up needing to do a reboot to get back to a more stable state for all of this? It really should be able to fix things *if* at least one copy can be read and then written to the other drive. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug during RAID1 replace
On Mon, Jun 27, 2016 at 5:03 PM, Saint Germainwrote: >> > > Ok thanks I will begin to make an image with dd. > Do you recommend to use sda or sdb ? Well at the moment you're kinda stuck. I'd leave them together and just get the data off the drive normally with cp -a (or just -r if you don't care about permissions and other metadata like time stamps and xattr) or rsync -a. Certainly the dying drive is being really pissy but if you get a bad read off one drive *maybe* it can correct off the other drive. But that's not possible if you pull one of those drives. Also as for imaging the drive, you probably need to use ddrescue instead of dd. Be warned that there's a gotcha where you can corrupt Btrfs volumes where multiple instances of the same fs uuid and dev uuid appear at the same time to the kernel. So once you've cloned in this manner, don't mount the volume until you hide (as in remove) one of the copies. See block level copies: https://btrfs.wiki.kernel.org/index.php/Gotchas > root@system:/# smartctl -x /dev/sda > ID# ATTRIBUTE_NAME FLAGSVALUE WORST THRESH FAIL RAW_VALUE > 1 Raw_Read_Error_Rate POSR-K 100 100 051-0 > 2 Throughput_Performance -OS--K 252 252 000-0 > 3 Spin_Up_TimePO---K 091 090 025-2993 > 4 Start_Stop_Count-O--CK 100 100 000-661 > 5 Reallocated_Sector_Ct PO--CK 252 252 010-0 > 7 Seek_Error_Rate -OSR-K 252 252 051-0 > 8 Seek_Time_Performance --S--K 252 252 015-0 > 9 Power_On_Hours -O--CK 100 100 000-1379 > 10 Spin_Retry_Count-O--CK 252 252 051-0 > 12 Power_Cycle_Count -O--CK 100 100 000-349 > 191 G-Sense_Error_Rate -O---K 252 252 000-0 > 192 Power-Off_Retract_Count -O---K 252 252 000-0 > 194 Temperature_Celsius -O 060 047 000-40 (Min/Max > 18/53) > 195 Hardware_ECC_Recovered -O-RCK 100 100 000-0 > 196 Reallocated_Event_Count -O--CK 252 252 000-0 > 197 Current_Pending_Sector -O--CK 252 252 000-0 > 198 Offline_Uncorrectable CK 252 252 000-0 > 199 UDMA_CRC_Error_Count-OS-CK 200 200 000-0 > 200 Multi_Zone_Error_Rate -O-R-K 100 100 000-2 > 223 Load_Retry_Count-O--CK 100 100 000-1 > 225 Load_Cycle_Count-O--CK 099 099 000-10744 > 241 Total_LBAs_Written -O--CK 095 094 000-7981553 > 242 Total_LBAs_Read -O--CK 098 094 000-4015781 No current pending, reallocated, or uncorrected sectors. Interesting. But this drive has piles of write errors. Why? Bad cable? That should result in UDMA CRC errors, lots of them. > SATA Phy Event Counters (GP Log 0x11) No significant problems. > root@system:/# smartctl -x /dev/sdb > > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAGSVALUE WORST THRESH FAIL RAW_VALUE > 1 Raw_Read_Error_Rate POSR-K 100 100 051-28 > 2 Throughput_Performance -OS--K 252 252 000-0 > 3 Spin_Up_TimePO---K 092 083 025-2678 > 4 Start_Stop_Count-O--CK 100 100 000-575 > 5 Reallocated_Sector_Ct PO--CK 252 252 010-0 > 7 Seek_Error_Rate -OSR-K 252 252 051-0 > 8 Seek_Time_Performance --S--K 252 252 015-0 > 9 Power_On_Hours -O--CK 100 100 000-1391 > 10 Spin_Retry_Count-O--CK 252 252 051-0 > 12 Power_Cycle_Count -O--CK 100 100 000-371 > 191 G-Sense_Error_Rate -O---K 252 252 000-0 > 192 Power-Off_Retract_Count -O---K 252 252 000-0 > 194 Temperature_Celsius -O 061 047 000-39 (Min/Max > 19/53) > 195 Hardware_ECC_Recovered -O-RCK 100 100 000-0 > 196 Reallocated_Event_Count -O--CK 252 252 000-0 > 197 Current_Pending_Sector -O--CK 100 100 000-1 > 198 Offline_Uncorrectable CK 252 252 000-0 > 199 UDMA_CRC_Error_Count-OS-CK 200 200 000-0 > 200 Multi_Zone_Error_Rate -O-R-K 100 100 000-3 > 223 Load_Retry_Count-O--CK 100 100 000-1 > 225 Load_Cycle_Count-O--CK 099 099 000-13957 > 241 Total_LBAs_Written -O--CK 096 094 000-6153920 > 242 Total_LBAs_Read -O--CK 097 094 000-4873960 One pending sector. Enough for a dozen scary warnings or so, but not enough to account for as many as you have. Pretty curious. > > Error 28 [3] occurred at disk power-on lifetime: 1390 hours (57 days + 22 > hours) > When the command that caused the error occurred, the device was active or > idle. > >
Re: Kernel bug during RAID1 replace
On Mon, 27 Jun 2016 16:58:37 -0600, Chris Murphywrote : > On Mon, Jun 27, 2016 at 4:55 PM, Chris Murphy > wrote: > > >> BTRFS info (device sdb1): dev_replace from /dev/sda1 (devid 1) > >> to /dev/sdd1 started scrub_handle_errored_block: 166 callbacks > >> suppressed BTRFS warning (device sdb1): checksum error at logical > >> 93445255168 on dev /dev/sda1, sector 77669048, root 5, inode > >> 3434831, offset 479232, length 4096, links 1 (path: > >> user/.local/share/zeitgeist/activity.sqlite-wal) > >> btrfs_dev_stat_print_on_error: 166 callbacks suppressed BTRFS > >> error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, > >> corrupt 14221, gen 24 scrub_handle_errored_block: 166 callbacks > >> suppressed BTRFS error (device sdb1): unable to fixup (regular) > >> error at logical 93445255168 on dev /dev/sda1 > > > > Shoot. You have a lot of these. It looks suspiciously like you're > > hitting a case list regulars are only just starting to understand > > Forget this part completely. It doesn't affect raid1. I just re-read > that your setup is not raid1, I don't know why I thought it was raid5. > > The likely issue here is that you've got legit corruptions on sda (mix > of slow and flat out bad sectors), as well as a failing drive. > > This is also safe to issue: > > smartctl -l scterc /dev/sda > smartctl -l scterc /dev/sdb > cat /sys/block/sda/device/timeout > cat /sys/block/sdb/device/timeout > My setup is indeed RAID1 (and not RAID5) root@system:/# smartctl -l scterc /dev/sda smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.6.0-0.bpo.1-amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org SCT Error Recovery Control: Read: Disabled Write: Disabled root@system:/# smartctl -l scterc /dev/sdb smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.6.0-0.bpo.1-amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org SCT Error Recovery Control: Read: Disabled Write: Disabled root@system:/# cat /sys/block/sda/device/timeout 30 root@system:/# cat /sys/block/sdb/device/timeout 30 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug during RAID1 replace
On Mon, 27 Jun 2016 16:55:07 -0600, Chris Murphywrote : > On Mon, Jun 27, 2016 at 4:26 PM, Saint Germain > wrote: > > >> > > > > Thanks for your help. > > > > Ok here is the log from the mounting, and including btrfs replace > > (btrfs replace start -f /dev/sda1 /dev/sdd1 /home): > > > > BTRFS info (device sdb1): disk space caching is enabled > > BTRFS info (device sdb1): bdev /dev/sdb1 errs: wr 11881695, rd 12, > > flush 7928, corrupt 1705631, gen 1335 BTRFS info (device sdb1): > > bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt 14220, gen 24 > > Eek. So sdb has 11+ million write errors, flush errors, read errors, > and over 1 million corruptions. It's dying or dead. > > And sda has a dozen thousand+ corruptions. This isn't a good > combination, as you have two devices with problems and raid5 only > protects you from one device with problems. > > You were in the process of replacing sda, which is good, but it may > not be enough... > > > > BTRFS info (device sdb1): dev_replace from /dev/sda1 (devid 1) > > to /dev/sdd1 started scrub_handle_errored_block: 166 callbacks > > suppressed BTRFS warning (device sdb1): checksum error at logical > > 93445255168 on dev /dev/sda1, sector 77669048, root 5, inode > > 3434831, offset 479232, length 4096, links 1 (path: > > user/.local/share/zeitgeist/activity.sqlite-wal) > > btrfs_dev_stat_print_on_error: 166 callbacks suppressed BTRFS error > > (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt > > 14221, gen 24 scrub_handle_errored_block: 166 callbacks suppressed > > BTRFS error (device sdb1): unable to fixup (regular) error at > > logical 93445255168 on dev /dev/sda1 > > Shoot. You have a lot of these. It looks suspiciously like you're > hitting a case list regulars are only just starting to understand > (somewhat) where it's possible to have a legit corrupt sector that > Btrfs detects during scrub as wrong, fixes it from parity, but then > occasionally wrongly overwrites the parity with bad parity. This > doesn't cause an immediately recognizable problem. But if the volume > becomes degraded later, Btrfs must use parity to reconstruct > on-the-fly and if it hits one of these bad parities, the > reconstruction is bad, and ends up causing lots of these checksum > errors. We can tell it's not metadata corruption because a.) there's a > file listed as being affected and b.) the file system doesn't fail and > go read only. But still it means those files are likely toast... > > > [...snip many instances of checksum errors...] > > > BTRFS error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush > > 0, corrupt 16217, gen 24 ata2.00: exception Emask 0x0 SAct 0x4000 > > SErr 0x0 action 0x0 ata2.00: irq_stat 0x4008 > > ata2.00: failed command: READ FPDMA QUEUED > > ata2.00: cmd 60/08:70:08:d8:70/00:00:0f:00:00/40 tag 14 ncq 4096 in > > res 41/40:00:08:d8:70/00:00:0f:00:00/40 Emask 0x409 (media > > error) ata2.00: status: { DRDY ERR } > > ata2.00: error: { UNC } > > ata2.00: configured for UDMA/133 > > sd 1:0:0:0: [sdb] tag#14 FAILED Result: hostbyte=DID_OK > > driverbyte=DRIVER_SENSE sd 1:0:0:0: [sdb] tag#14 Sense Key : Medium > > Error [current] [descriptor] sd 1:0:0:0: [sdb] tag#14 Add. Sense: > > Unrecovered read error - auto reallocate failed sd 1:0:0:0: [sdb] > > tag#14 CDB: Read(10) 28 00 0f 70 d8 08 00 00 08 00 > > blk_update_request: I/O error, dev sdb, sector 259053576 > > OK yeah so bad sector on sdb. So you have two failures because sda is > already giving you trouble while being replaced and on top of it you > now get a 2nd (partial) failure via bad sectors. > > So rather urgently I think you need to copy things off this volume if > you don't already have a backup so you can save as much as possible. > Don't write to the drives. You might even consider 'mount -o > remount,ro' to avoid anything writing to the volume. Copy the most > important data first, triage time. > > While that happens you can safely collect some more information: > > btrfs fi us > smartctl -x## for both drives > Ok thanks I will begin to make an image with dd. Do you recommend to use sda or sdb ? In the meantime here are the info requested: btrfs fi us /home Overall: Device size: 3.63TiB Device allocated: 2.76TiB Device unallocated: 888.51GiB Device missing: 0.00B Used: 2.62TiB Free (estimated):517.56GiB (min: 517.56GiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,RAID1: Size:1.38TiB, Used:1.31TiB /dev/sda1 1.38TiB /dev/sdb1 1.38TiB Metadata,RAID1: Size:5.00GiB, Used:3.15GiB /dev/sda1 5.00GiB /dev/sdb1 5.00GiB System,RAID1: Size:64.00MiB, Used:216.00KiB /dev/sda1 64.00MiB /dev/sdb1 64.00MiB Unallocated:
Re: Kernel bug during RAID1 replace
On Mon, Jun 27, 2016 at 4:55 PM, Chris Murphywrote: >> BTRFS info (device sdb1): dev_replace from /dev/sda1 (devid 1) to /dev/sdd1 >> started >> scrub_handle_errored_block: 166 callbacks suppressed >> BTRFS warning (device sdb1): checksum error at logical 93445255168 on dev >> /dev/sda1, sector 77669048, root 5, inode 3434831, offset 479232, length >> 4096, links 1 (path: user/.local/share/zeitgeist/activity.sqlite-wal) >> btrfs_dev_stat_print_on_error: 166 callbacks suppressed >> BTRFS error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt >> 14221, gen 24 >> scrub_handle_errored_block: 166 callbacks suppressed >> BTRFS error (device sdb1): unable to fixup (regular) error at logical >> 93445255168 on dev /dev/sda1 > > Shoot. You have a lot of these. It looks suspiciously like you're > hitting a case list regulars are only just starting to understand Forget this part completely. It doesn't affect raid1. I just re-read that your setup is not raid1, I don't know why I thought it was raid5. The likely issue here is that you've got legit corruptions on sda (mix of slow and flat out bad sectors), as well as a failing drive. This is also safe to issue: smartctl -l scterc /dev/sda smartctl -l scterc /dev/sdb cat /sys/block/sda/device/timeout cat /sys/block/sdb/device/timeout -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug during RAID1 replace
On Mon, Jun 27, 2016 at 4:26 PM, Saint Germainwrote: >> > > Thanks for your help. > > Ok here is the log from the mounting, and including btrfs replace > (btrfs replace start -f /dev/sda1 /dev/sdd1 /home): > > BTRFS info (device sdb1): disk space caching is enabled > BTRFS info (device sdb1): bdev /dev/sdb1 errs: wr 11881695, rd 12, flush > 7928, corrupt 1705631, gen 1335 > BTRFS info (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt > 14220, gen 24 Eek. So sdb has 11+ million write errors, flush errors, read errors, and over 1 million corruptions. It's dying or dead. And sda has a dozen thousand+ corruptions. This isn't a good combination, as you have two devices with problems and raid5 only protects you from one device with problems. You were in the process of replacing sda, which is good, but it may not be enough... > BTRFS info (device sdb1): dev_replace from /dev/sda1 (devid 1) to /dev/sdd1 > started > scrub_handle_errored_block: 166 callbacks suppressed > BTRFS warning (device sdb1): checksum error at logical 93445255168 on dev > /dev/sda1, sector 77669048, root 5, inode 3434831, offset 479232, length > 4096, links 1 (path: user/.local/share/zeitgeist/activity.sqlite-wal) > btrfs_dev_stat_print_on_error: 166 callbacks suppressed > BTRFS error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt > 14221, gen 24 > scrub_handle_errored_block: 166 callbacks suppressed > BTRFS error (device sdb1): unable to fixup (regular) error at logical > 93445255168 on dev /dev/sda1 Shoot. You have a lot of these. It looks suspiciously like you're hitting a case list regulars are only just starting to understand (somewhat) where it's possible to have a legit corrupt sector that Btrfs detects during scrub as wrong, fixes it from parity, but then occasionally wrongly overwrites the parity with bad parity. This doesn't cause an immediately recognizable problem. But if the volume becomes degraded later, Btrfs must use parity to reconstruct on-the-fly and if it hits one of these bad parities, the reconstruction is bad, and ends up causing lots of these checksum errors. We can tell it's not metadata corruption because a.) there's a file listed as being affected and b.) the file system doesn't fail and go read only. But still it means those files are likely toast... [...snip many instances of checksum errors...] > BTRFS error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt > 16217, gen 24 > ata2.00: exception Emask 0x0 SAct 0x4000 SErr 0x0 action 0x0 > ata2.00: irq_stat 0x4008 > ata2.00: failed command: READ FPDMA QUEUED > ata2.00: cmd 60/08:70:08:d8:70/00:00:0f:00:00/40 tag 14 ncq 4096 in > res 41/40:00:08:d8:70/00:00:0f:00:00/40 Emask 0x409 (media error) > ata2.00: status: { DRDY ERR } > ata2.00: error: { UNC } > ata2.00: configured for UDMA/133 > sd 1:0:0:0: [sdb] tag#14 FAILED Result: hostbyte=DID_OK > driverbyte=DRIVER_SENSE > sd 1:0:0:0: [sdb] tag#14 Sense Key : Medium Error [current] [descriptor] > sd 1:0:0:0: [sdb] tag#14 Add. Sense: Unrecovered read error - auto reallocate > failed > sd 1:0:0:0: [sdb] tag#14 CDB: Read(10) 28 00 0f 70 d8 08 00 00 08 00 > blk_update_request: I/O error, dev sdb, sector 259053576 OK yeah so bad sector on sdb. So you have two failures because sda is already giving you trouble while being replaced and on top of it you now get a 2nd (partial) failure via bad sectors. So rather urgently I think you need to copy things off this volume if you don't already have a backup so you can save as much as possible. Don't write to the drives. You might even consider 'mount -o remount,ro' to avoid anything writing to the volume. Copy the most important data first, triage time. While that happens you can safely collect some more information: btrfs fi us smartctl -x## for both drives -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxellwrote: > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: > >> It just came up again in a thread over the weekend on linux-raid@. I'm >> going to ask while people are paying attention if a patch to change >> the 30 second time out to something a lot higher has ever been >> floated, what the negatives might be, and where to get this fixed if >> it wouldn't be accepted in the kernel code directly. > > Defaults are defaults, they're not for everyone. 30 seconds is about > two minutes too short for an SMR drive's worst-case write latency, or > 28 seconds too long for an OLTP system, or just right for an end-user's > personal machine with a low-energy desktop drive and a long spin-up time. The question is where is the correct place to change the default to broadly capture most use cases, because it's definitely incompatible with consumer SATA drives, whether in an enclosure or not. Maybe it's with the kernel teams at each distribution? Or maybe an upstream udev rule? In any case something needs to give here because it's been years of bugging users about this misconfiguration and people constantly run into it, which means user education is not working. > > Once a drive starts taking 30+ seconds to do I/O, I consider the drive > failed in the sense that it's too slow to meet latency requirements. Well that is then a mismatch between use case and the drive purchasing decision. Consumer drives do this. It's how they're designed to work. > When the problem is that it's already taking too long, the solution is > not waiting even longer. To put things in perspective, consider that > server hardware watchdog timeouts are typically 60 seconds by default > (if not maximum). If you want the data retrieved from that particular device, the only solution is waiting longer. The alternative is what you get, an IO error (well actually you get a link reset, which also means the entire command queue is purged on SATA drives). > If anything, I want the timeout to be shorter so that upper layers with > redundancy can get an EIO and initiate repair promptly, and admins can > get notified to evict chronic offenders from their drive slots, without > having to pay extra for hard disk firmware with that feature. The drive totally thwarts this. It doesn't report back to the kernel what command is hung, as far as I'm aware. It just hangs and goes into a so called "deep recovery" there is no way to know what sector is causing the problem until the drive reports a read error, which will include the affected sector LBA. Btrfs does have something of a work around for when things get slow, and that's balance, read and rewrite everything. The write forces sector remapping by the drive firmware for bad sectors. >> *Ideally* I think we'd want two timeouts. I'd like to see commands >> have a timer that results in merely a warning that could be used by >> e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to >> write over those sectors". That's how bad sectors start out, they read >> slower and eventually go beyond 30 seconds and now it's all link >> resets. If the problem could be fixed before then... that's the best >> scenario. > > What's the downside of a link reset? Can the driver not just return > EIO for all the outstanding IOs in progress at reset, and let the upper > layers deal with it? Or is the problem that the upper layers are all > horribly broken by EIOs, or drive firmware horribly broken by link resets? Link reset clears the entire command queue on SATA drives, and it wipes away any possibility of finding out what LBA or even a range of LBAs, is the source of the stall. So it pretty much gets you nothing. > The upper layers could time the IOs, and make their own decisions based > on the timing (e.g. btrfs or mdadm could proactively repair anything that > took more than 10 seconds to read). That might be a better approach, > since shortening the time to an EIO is only useful when you have a > redundancy layer in place to do something about them. For RAID with redundancy, that's doable, although I have no idea what work is needed, or even if it's possible, to track commands in this manner, and fall back to some kind of repair mode as if it were a read error. For single drives and RAID 0, the only possible solution is to not do link resets for up to 3 minutes and hope the drive returns the single copy of data. Even in the case of Btrfs DUP, it's thwarted without a read error reported from the drive (or it returning bad data). -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Strange behavior when replacing device on BTRFS RAID 5 array.
On 28/06/16 03:46, Chris Murphy wrote: > On Mon, Jun 27, 2016 at 11:29 AM, Chris Murphy> wrote: > >> >> Next is to decide to what degree you want to salvage this volume and >> keep using Btrfs raid56 despite the risks > > Forgot to complete this thought. So if you get a backup, and decide > you want to fix it, I would see if you can cancel the replace using > "btrfs replace cancel " and confirm that it stops. And now is the > risky part, which is whether to try "btrfs add" and then "btrfs > remove" or remove the bad drive, reboot, and see if it'll mount with > -o degraded, and then use add and remove (in which case you'll use > 'remove missing'). > > The first you risk Btrfs still using the flaky bad drive. > > The second you risk whether a degraded mount will work, and whether > any other drive in the array has a problem while degraded (like an > unrecovery read error from a single sector). This is the exact set of circumstances that caused my corrupt array. I was using RAID6 - yet it still corrupted large portions of things. In theory, due to having double parity, it should still have survived even if a disk did go bad - but there we are. I first started a replace - noted how slow it was going - cancelled the replace, then did an add / delete - the system crashed and it was all over. Just as another data point, I've been flogging the guts out of the array with mdadm RAID6 doing a reshape of that - and no read errors, system crashes or other problems in over 48 hours. -- Steven Haigh Email: net...@crc.id.au Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 signature.asc Description: OpenPGP digital signature
Re: Kernel bug during RAID1 replace
On Mon, 27 Jun 2016 15:42:42 -0600, Chris Murphywrote : > On Mon, Jun 27, 2016 at 3:36 PM, Saint Germain > wrote: > > Hello, > > > > I am on Debian Jessie with a kernel from backports: > > 4.6.0-0.bpo.1-amd64 > > > > I am also using btrfs-tools 4.4.1-1.1~bpo8+1 > > > > When trying to replace a RAID1 drive (with btrfs replace start > > -f /dev/sda1 /dev/sdd1), the operation is cancelled after completing > > only 5%. > > > > I got this error in the /var/log/syslog: > > [ cut here ] > > WARNING: CPU: 2 PID: 2617 > > at /build/linux-9LouV5/linux-4.6.1/fs/btrfs/dev-replace.c:430 > > btrfs_dev_replace_start+0x2be/0x400 [btrfs] Modules linked in: > > uas(E) usb_storage(E) bnep(E) ftdi_sio(E) usbserial(E) > > snd_hda_codec_hdmi(E) nls_utf8(E) nls_cp437(E) vfat(E) fat(E) > > intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) > > coretemp(E) kvm_intel(E) kvm(E) iTCO_wdt(E) irqbypass(E) > > iTCO_vendor_support(E) crct10dif_pclmul(E) crc32_pclmul(E) > > ghash_clmulni_intel(E) hmac(E) drbg(E) ansi_cprng(E) aesni_intel(E) > > aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) > > cryptd(E) wl(POE) btusb(E) btrtl(E) btbcm(E) btintel(E) cfg80211(E) > > bluetooth(E) efi_pstore(E) snd_hda_codec_realtek(E) evdev(E) > > crc16(E) serio_raw(E) pcspkr(E) efivars(E) joydev(E) > > snd_hda_codec_generic(E) rfkill(E) snd_hda_intel(E) nuvoton_cir(E) > > rc_core(E) snd_hda_codec(E) i915(E) battery(E) snd_hda_core(E) > > snd_hwdep(E) soc_button_array(E) tpm_tis(E) drm_kms_helper(E) > > intel_smartconnect(E) snd_pcm(E) tpm(E) video(E) i2c_i801(E) > > snd_timer(E) drm(E) snd(E) lpc_ich(E) i2c_algo_bit(E) soundcore(E) > > mfd_core(E) mei_me(E) processor(E) button(E) mei(E) shpchp(E) > > fuse(E) autofs4(E) hid_logitech_hidpp(E) btrfs(E) > > hid_logitech_dj(E) usbhid(E) hid(E) xor(E) raid6_pq(E) sg(E) > > sr_mod(E) cdrom(E) sd_mod(E) crc32c_intel(E) ahci(E) libahci(E) > > libata(E) psmouse(E) scsi_mod(E) xhci_pci(E) ehci_pci(E) > > xhci_hcd(E) ehci_hcd(E) e1000e(E) usbcore(E) ptp(E) pps_core(E) > > usb_common(E) fjes(E) CPU: 2 PID: 2617 Comm: btrfs Tainted: > > P OE 4.6.0-0.bpo.1-amd64 #1 Debian 4.6.1-1~bpo8+1 > > Hardware name: To Be Filled By O.E.M. To Be Filled By > > O.E.M./Z87E-ITX, BIOS P2.10 10/04/2013 0286 > > f0ba7fe7 813123c5 > > 8107af94 880186caf000 fffb 8800c76b0800 > > 8800cae7 8800cae70ee0 7ffdd5397d98 Call Trace: > > [] ? dump_stack+0x5c/0x77 [] ? > > __warn+0xc4/0xe0 [] ? > > btrfs_dev_replace_start+0x2be/0x400 [btrfs] [] ? > > btrfs_ioctl+0x1d42/0x2190 [btrfs] [] ? > > handle_mm_fault+0x154d/0x1cb0 [] ? > > do_vfs_ioctl+0x99/0x5d0 [] ? SyS_ioctl+0x76/0x90 > > [] ? system_call_fast_compare_end+0xc/0x96 > > ---[ end trace 9fbfaa137cc5a72a ]--- > > > > > > > > What should I do to replace correctly my drive ? > > I don't often see handle_mm_fault with btrfs problems, maybe the > entire dmesg from mounting the fs and including btrfs replace would > reveal a related problem that instigates the failure? > > If the device being replaced is acting unreliably, then you'd want to > use -r with replace to ignore that device unless it's absolutely > necessary to read from it. > Thanks for your help. Ok here is the log from the mounting, and including btrfs replace (btrfs replace start -f /dev/sda1 /dev/sdd1 /home): BTRFS info (device sdb1): disk space caching is enabled BTRFS info (device sdb1): bdev /dev/sdb1 errs: wr 11881695, rd 12, flush 7928, corrupt 1705631, gen 1335 BTRFS info (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt 14220, gen 24 BTRFS info (device sdb1): dev_replace from /dev/sda1 (devid 1) to /dev/sdd1 started scrub_handle_errored_block: 166 callbacks suppressed BTRFS warning (device sdb1): checksum error at logical 93445255168 on dev /dev/sda1, sector 77669048, root 5, inode 3434831, offset 479232, length 4096, links 1 (path: user/.local/share/zeitgeist/activity.sqlite-wal) btrfs_dev_stat_print_on_error: 166 callbacks suppressed BTRFS error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt 14221, gen 24 scrub_handle_errored_block: 166 callbacks suppressed BTRFS error (device sdb1): unable to fixup (regular) error at logical 93445255168 on dev /dev/sda1 BTRFS error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt 14222, gen 24 BTRFS error (device sdb1): unable to fixup (regular) error at logical 93445259264 on dev /dev/sda1 BTRFS warning (device sdb1): checksum error at logical 136349810688 on dev /dev/sda1, sector 140429952, root 5, inode 4265283, offset 0, length 4096, links 1 (path: user/Pictures/Picture-42-2.jpg) BTRFS error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt 14223, gen 24 BTRFS error (device sdb1): unable to fixup (regular) error at logical 136349810688 on dev /dev/sda1 BTRFS warning (device sdb1): checksum
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: > On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn >wrote: > > On 2016-06-25 12:44, Chris Murphy wrote: > >> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn > >> wrote: > >> > >> OK but hold on. During scrub, it should read data, compute checksums > >> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in > >> the checksum tree, and the parity strip in the chunk tree. And if > >> parity is wrong, then it should be replaced. > > > > Except that's horribly inefficient. With limited exceptions involving > > highly situational co-processors, computing a checksum of a parity block is > > always going to be faster than computing parity for the stripe. By using > > that to check parity, we can safely speed up the common case of near zero > > errors during a scrub by a pretty significant factor. > > OK I'm in favor of that. Although somehow md gets away with this by > computing and checking parity for its scrubs, and still manages to > keep drives saturated in the process - at least HDDs, I'm not sure how > it fares on SSDs. A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest one at more than 10GB/sec. Maybe a bottleneck is within reach of an array of SSDs vs. a slow CPU. > It just came up again in a thread over the weekend on linux-raid@. I'm > going to ask while people are paying attention if a patch to change > the 30 second time out to something a lot higher has ever been > floated, what the negatives might be, and where to get this fixed if > it wouldn't be accepted in the kernel code directly. Defaults are defaults, they're not for everyone. 30 seconds is about two minutes too short for an SMR drive's worst-case write latency, or 28 seconds too long for an OLTP system, or just right for an end-user's personal machine with a low-energy desktop drive and a long spin-up time. Once a drive starts taking 30+ seconds to do I/O, I consider the drive failed in the sense that it's too slow to meet latency requirements. When the problem is that it's already taking too long, the solution is not waiting even longer. To put things in perspective, consider that server hardware watchdog timeouts are typically 60 seconds by default (if not maximum). If anything, I want the timeout to be shorter so that upper layers with redundancy can get an EIO and initiate repair promptly, and admins can get notified to evict chronic offenders from their drive slots, without having to pay extra for hard disk firmware with that feature. > *Ideally* I think we'd want two timeouts. I'd like to see commands > have a timer that results in merely a warning that could be used by > e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to > write over those sectors". That's how bad sectors start out, they read > slower and eventually go beyond 30 seconds and now it's all link > resets. If the problem could be fixed before then... that's the best > scenario. What's the downside of a link reset? Can the driver not just return EIO for all the outstanding IOs in progress at reset, and let the upper layers deal with it? Or is the problem that the upper layers are all horribly broken by EIOs, or drive firmware horribly broken by link resets? The upper layers could time the IOs, and make their own decisions based on the timing (e.g. btrfs or mdadm could proactively repair anything that took more than 10 seconds to read). That might be a better approach, since shortening the time to an EIO is only useful when you have a redundancy layer in place to do something about them. > The 2nd timer would be, OK the controller or drive just face planted, reset. > > -- > Chris Murphy > signature.asc Description: Digital signature
Re: Btrfs full balance command fails due to ENOSPC (bug 121071)
Hi! On 06/27/2016 11:26 PM, Henk Slager wrote: btrfs-debug does not show metadata ans system chunks; the balancing problem might come from those. This script does show all chunks: https://github.com/knorrie/btrfs-heatmap/blob/master/show_usage.py Since the existence of python-btrfs, it has gathered a few useful example scripts: git clone https://github.com/knorrie/python-btrfs cd python-btrfs/examples/ (get root prompt) ./show_usage.py /mountpoint <- view sorted by 'virtual' address space ./show_dev_extents.py /mountpoint <- view sorted by physical layout The show_usage in the btrfs-heatmap repo is almost gone. I'm currently replacing all the proof of concept playing around stuff in there with dedicated png-creation code that uses the python-btrfs lib. So, it's better to refer to the examples in python-btrfs instead. Or, write some others to create overviews yourself, it gets easier every day. :D Have fun, -- Hans van Kranenburg - System / Network Engineer T +31 (0)10 2760434 | hans.van.kranenb...@mendix.com | www.mendix.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug during RAID1 replace
On Mon, Jun 27, 2016 at 3:36 PM, Saint Germainwrote: > Hello, > > I am on Debian Jessie with a kernel from backports: > 4.6.0-0.bpo.1-amd64 > > I am also using btrfs-tools 4.4.1-1.1~bpo8+1 > > When trying to replace a RAID1 drive (with btrfs replace start > -f /dev/sda1 /dev/sdd1), the operation is cancelled after completing > only 5%. > > I got this error in the /var/log/syslog: > [ cut here ] > WARNING: CPU: 2 PID: 2617 at > /build/linux-9LouV5/linux-4.6.1/fs/btrfs/dev-replace.c:430 > btrfs_dev_replace_start+0x2be/0x400 [btrfs] > Modules linked in: uas(E) usb_storage(E) bnep(E) ftdi_sio(E) usbserial(E) > snd_hda_codec_hdmi(E) nls_utf8(E) nls_cp437(E) vfat(E) fat(E) intel_rapl(E) > x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) > iTCO_wdt(E) irqbypass(E) iTCO_vendor_support(E) crct10dif_pclmul(E) > crc32_pclmul(E) ghash_clmulni_intel(E) hmac(E) drbg(E) ansi_cprng(E) > aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) > cryptd(E) wl(POE) btusb(E) btrtl(E) btbcm(E) btintel(E) cfg80211(E) > bluetooth(E) efi_pstore(E) snd_hda_codec_realtek(E) evdev(E) crc16(E) > serio_raw(E) pcspkr(E) efivars(E) joydev(E) snd_hda_codec_generic(E) > rfkill(E) snd_hda_intel(E) nuvoton_cir(E) rc_core(E) snd_hda_codec(E) i915(E) > battery(E) snd_hda_core(E) snd_hwdep(E) soc_button_array(E) tpm_tis(E) > drm_kms_helper(E) intel_smartconnect(E) snd_pcm(E) tpm(E) video(E) > i2c_i801(E) snd_timer(E) drm(E) snd(E) lpc_ich(E) i2c_algo_bit(E) > soundcore(E) mfd_core(E) mei_me(E) processor(E) button(E) mei(E) shpchp(E) > fuse(E) autofs4(E) hid_logitech_hidpp(E) btrfs(E) hid_logitech_dj(E) > usbhid(E) hid(E) xor(E) raid6_pq(E) sg(E) sr_mod(E) cdrom(E) sd_mod(E) > crc32c_intel(E) ahci(E) libahci(E) libata(E) psmouse(E) scsi_mod(E) > xhci_pci(E) ehci_pci(E) xhci_hcd(E) ehci_hcd(E) e1000e(E) usbcore(E) ptp(E) > pps_core(E) usb_common(E) fjes(E) > CPU: 2 PID: 2617 Comm: btrfs Tainted: P OE 4.6.0-0.bpo.1-amd64 #1 > Debian 4.6.1-1~bpo8+1 > Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z87E-ITX, BIOS > P2.10 10/04/2013 > 0286 f0ba7fe7 813123c5 > 8107af94 880186caf000 fffb > 8800c76b0800 8800cae7 8800cae70ee0 7ffdd5397d98 > Call Trace: > [] ? dump_stack+0x5c/0x77 > [] ? __warn+0xc4/0xe0 > [] ? btrfs_dev_replace_start+0x2be/0x400 [btrfs] > [] ? btrfs_ioctl+0x1d42/0x2190 [btrfs] > [] ? handle_mm_fault+0x154d/0x1cb0 > [] ? do_vfs_ioctl+0x99/0x5d0 > [] ? SyS_ioctl+0x76/0x90 > [] ? system_call_fast_compare_end+0xc/0x96 > ---[ end trace 9fbfaa137cc5a72a ]--- > > > > What should I do to replace correctly my drive ? I don't often see handle_mm_fault with btrfs problems, maybe the entire dmesg from mounting the fs and including btrfs replace would reveal a related problem that instigates the failure? If the device being replaced is acting unreliably, then you'd want to use -r with replace to ignore that device unless it's absolutely necessary to read from it. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kernel bug during RAID1 replace
Hello, I am on Debian Jessie with a kernel from backports: 4.6.0-0.bpo.1-amd64 I am also using btrfs-tools 4.4.1-1.1~bpo8+1 When trying to replace a RAID1 drive (with btrfs replace start -f /dev/sda1 /dev/sdd1), the operation is cancelled after completing only 5%. I got this error in the /var/log/syslog: [ cut here ] WARNING: CPU: 2 PID: 2617 at /build/linux-9LouV5/linux-4.6.1/fs/btrfs/dev-replace.c:430 btrfs_dev_replace_start+0x2be/0x400 [btrfs] Modules linked in: uas(E) usb_storage(E) bnep(E) ftdi_sio(E) usbserial(E) snd_hda_codec_hdmi(E) nls_utf8(E) nls_cp437(E) vfat(E) fat(E) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) iTCO_wdt(E) irqbypass(E) iTCO_vendor_support(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) hmac(E) drbg(E) ansi_cprng(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) wl(POE) btusb(E) btrtl(E) btbcm(E) btintel(E) cfg80211(E) bluetooth(E) efi_pstore(E) snd_hda_codec_realtek(E) evdev(E) crc16(E) serio_raw(E) pcspkr(E) efivars(E) joydev(E) snd_hda_codec_generic(E) rfkill(E) snd_hda_intel(E) nuvoton_cir(E) rc_core(E) snd_hda_codec(E) i915(E) battery(E) snd_hda_core(E) snd_hwdep(E) soc_button_array(E) tpm_tis(E) drm_kms_helper(E) intel_smartconnect(E) snd_pcm(E) tpm(E) video(E) i2c_i801(E) snd_timer(E) drm(E) snd(E) lpc_ich(E) i2c_algo_bit(E) soundcore(E) mfd_core(E) mei_me(E) processor(E) button(E) mei(E) shp chp(E) fuse(E) autofs4(E) hid_logitech_hidpp(E) btrfs(E) hid_logitech_dj(E) usbhid(E) hid(E) xor(E) raid6_pq(E) sg(E) sr_mod(E) cdrom(E) sd_mod(E) crc32c_intel(E) ahci(E) libahci(E) libata(E) psmouse(E) scsi_mod(E) xhci_pci(E) ehci_pci(E) xhci_hcd(E) ehci_hcd(E) e1000e(E) usbcore(E) ptp(E) pps_core(E) usb_common(E) fjes(E) CPU: 2 PID: 2617 Comm: btrfs Tainted: P OE 4.6.0-0.bpo.1-amd64 #1 Debian 4.6.1-1~bpo8+1 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z87E-ITX, BIOS P2.10 10/04/2013 0286 f0ba7fe7 813123c5 8107af94 880186caf000 fffb 8800c76b0800 8800cae7 8800cae70ee0 7ffdd5397d98 Call Trace: [] ? dump_stack+0x5c/0x77 [] ? __warn+0xc4/0xe0 [] ? btrfs_dev_replace_start+0x2be/0x400 [btrfs] [] ? btrfs_ioctl+0x1d42/0x2190 [btrfs] [] ? handle_mm_fault+0x154d/0x1cb0 [] ? do_vfs_ioctl+0x99/0x5d0 [] ? SyS_ioctl+0x76/0x90 [] ? system_call_fast_compare_end+0xc/0x96 ---[ end trace 9fbfaa137cc5a72a ]--- What should I do to replace correctly my drive ? Thanks in advance, -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs full balance command fails due to ENOSPC (bug 121071)
On Mon, Jun 27, 2016 at 9:24 PM, Chris Murphywrote: > On Mon, Jun 27, 2016 at 12:32 PM, Francesco Turco wrote: >> On 2016-06-27 20:18, Chris Murphy wrote: >>> If you can grab btrfs-debugfs from >>> https://github.com/kdave/btrfs-progs/blob/master/btrfs-debugfs >>> >>> And then attach the output to the bug report it might be useful for a >>> developer. But really your case is an odd duck, because there's fully >>> 14GiB unallocated, so it should be able to create a new one without >>> problem. >>> >>> $ sudo ./btrfs-debugfs -b / >> >> Done! Thank you, I was not aware of the existence of btrfs-debug... > > I'm not certain what the "1 enospc errors during balance' refers to. > That message happens several times, the balance operation isn't > aborted, and doesn't come with any call traces (those appear later). > Further, the btrfs-debugfs output suggests the balance worked - each > bg is continguously located after the last and they're all new bg > offset values compared to what's found in the dmesg. btrfs-debug does not show metadata ans system chunks; the balancing problem might come from those. This script does show all chunks: https://github.com/knorrie/btrfs-heatmap/blob/master/show_usage.py You might want to use vrange or drange balance filters so that you can just target a certain chunk and maybe that gives a hint where the problem might be. But anyhow, the behavior experienced is a bug. > This might be that obscure -28 enospc bug that affects some file > systems and hasn't been tracked down yet. If I recall correctly it's a > misleading error, and the only work around to get rid of it is migrate > to a new Btrfs file system. I don't think the file system is at any > risk in the current state, but I'm not certain as it's already an edge > case. I'd just make sure you keep suitably current backups and keep > using it. > > > > -- > Chris Murphy > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Strange behavior when replacing device on BTRFS RAID 5 array.
Nick Austin posted on Sun, 26 Jun 2016 20:57:32 -0700 as excerpted: > I have a 4 device BTRFS RAID 5 filesystem. > > One of the device members of this file system (sdr) had badblocks, so I > decided to replace it. While the others answered the direct question, there's something potentially more urgent... Btrfs raid56 mode has some fundamentally serious bugs as currently implemented, that we are just now finding out how serious they potentially are. For the details you can read the other active threads from the last week or so, but the important thing is that... For the time being, raid56 mode is not to be trusted repairable and as a result is now highly negative-recommended. Unless you are using pure testing data that you don't care about whether it lives or dies (either because it literally /is/ that trivial, or because you have tested backups, /making/ it that trivial), I'd urgently recommend either getting your data off it ASAP, or rebalancing to redundant-raid, raid1 or raid10, instead of parity-raid (5/6), before something worse happens and you no longer can. Raid1 mode is a reasonable alternative, as long as your data fits in the available space. Keeping in mind that btrfs raid1 is always two copies, with more than two devices upping the capacity, not the redundancy, 3 5.46 TB devices = 8.19 TB usable space. Given your 8+ TiB of data usage, plus metadata and system, that's unlikely to fit unless you delete some stuff (older snapshots probably, if you have them). So you'll need to keep it to four devices of that size. Btrfs raid10 is also considered as stable as btrfs in general, and would be doable with 4+ devices, but for various reasons I'll skip for brevity here (ask if you want them detailed), I'd recommend staying with btrfs raid1. Or switch to md- or dm-raid1. Or one other interesting alternative, a pair of md- or dm-raid0s, on top of which you run btrfs raid1. That gives you the data integrity of btrfs raid1, with somewhat better speed than the reasonably stable but as yet unoptimized btrfs raid10. And of course there's one other alternative, zfs, if you are good with its hardware requirements and licensing situation. But I'd recommend btrfs raid1 as the simple choice. It's what I'm using here (tho on a pair of ssds, so far smaller but faster media, so different use-case). -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 27, 2016 at 6:17 PM, Chris Murphywrote: > On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn > wrote: >> On 2016-06-25 12:44, Chris Murphy wrote: >>> >>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn >>> wrote: >>> Well, the obvious major advantage that comes to mind for me to checksumming parity is that it would let us scrub the parity data itself and verify it. >>> >>> >>> OK but hold on. During scrub, it should read data, compute checksums >>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in >>> the checksum tree, and the parity strip in the chunk tree. And if >>> parity is wrong, then it should be replaced. >> >> Except that's horribly inefficient. With limited exceptions involving >> highly situational co-processors, computing a checksum of a parity block is >> always going to be faster than computing parity for the stripe. By using >> that to check parity, we can safely speed up the common case of near zero >> errors during a scrub by a pretty significant factor. > > OK I'm in favor of that. Although somehow md gets away with this by > computing and checking parity for its scrubs, and still manages to > keep drives saturated in the process - at least HDDs, I'm not sure how > it fares on SSDs. What I read in this thread clarifies the different flavors of errors I saw when trying btrfs raid5 while corrupting 1 device or just unexpectedly removing a device and replacing it with a fresh one. Especially the lack of parity csums I was not aware of and I think this is really wrong. Consider a 4 disk btrfs raid10 and a 3 disk btrfs raid5. Both protect against the loss of 1 device or badblocks on 1 device. In the current design (unoptimized for performance), raid10 reads from 2 disk and raid5 as well (as far as I remember) per task/process. Which pair of strips for raid10 is pseudo-random AFAIK, so one could get low throughput if some device in the array is older/slower and that one is picked. From device to fs logical layer is just a simple function, namely copy, so having the option to keep data in-place (zerocopy). The data is at least read by the csum check and in case of failure, the btrfs code picks the alternative strip and corrects etc. For raid5, assuming it does avoid the parity in principle, it is also a strip pair and csum check. In case of csum failure, one needs the parity strip parity calculation. To me, It looks like that the 'fear' of this calculation has made raid56 as a sort of add-on, instead of a more integral part. Looking at raid6 perf test at boot in dmesg, it is 30GByte/s, even higher than memory bandwidth. So although a calculation is needed in case data0strip+paritystrip would be used instead of data0strip+data1strip, I think looking at total cost, it can be cheaper than spending time on seeks, at least on HDDs. If the parity calculation is treated in a transparent way, same as copy, then there is more flexibility in selecting disks (and strips) and enables easier design and performance optimizations I think. >> The ideal situation that I'd like to see for scrub WRT parity is: >> 1. Store checksums for the parity itself. >> 2. During scrub, if the checksum is good, the parity is good, and we just >> saved the time of computing the whole parity block. >> 3. If the checksum is not good, then compute the parity. If the parity just >> computed matches what is there already, the checksum is bad and should be >> rewritten (and we should probably recompute the whole block of checksums >> it's in), otherwise, the parity was bad, write out the new parity and update >> the checksum. This 3rd point: if parity matches but csum is not good, then there is a btrfs design error or some hardware/CPU/memory problem. Compare with btrfs raid10: if the copies match but csum wrong, then there is something fatally wrong. Just the first step, csum check and if wrong, it would mean you generate the assumed corrupt strip newly from the 3 others. And for 3 disk raid5 from the 2 others, whether it is copying or paritycalculation. >> 4. Have an option to skip the csum check on the parity and always compute >> it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
For what it's worth I found btrfs-map-logical can produce mapping for raid5 (didn't test raid6) by specifying the extent block length. If that's omitted it only shows the device+mapping for the first strip. This example is a 3 disk raid5, with a 128KiB file all in a single extent. [root@f24s ~]# btrfs-map-logical -l 14157742080 /dev/VG/a mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c [root@f24s ~]# btrfs-map-logical -l 14157742080 -b 131072 /dev/VG/a mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a mirror 1 logical 14157807616 physical 1075773440 device /dev/mapper/VG-b mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c mirror 2 logical 14157807616 physical 2183069696 device /dev/mapper/VG-c It's also possible to use -c and -o to copy the extent to a file and more easily diff it with a control file, rather than using dd. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/4] btrfs/126,127,128: test feature ioctl and sysfs interfaces
From: Jeff MahoneyThis tests the exporting of feature information from the kernel via sysfs and ioctl. The first test works whether the sysfs permissions are correct, if the information exported via sysfs matches what the ioctls are reporting, and if they both match the on-disk superblock's version of the feature sets. The second and third tests test online setting and clearing of feature bits via the sysfs and ioctl interfaces, checking whether they match the on-disk super on each cycle. Signed-off-by: Jeff Mahoney --- common/btrfs | 203 +++ src/btrfs_ioctl_helper.c | 88 + tests/btrfs/126 | 244 +++ tests/btrfs/126.out | 1 + tests/btrfs/127 | 166 tests/btrfs/127.out | 1 + tests/btrfs/128 | 128 + tests/btrfs/128.out | 1 + tests/btrfs/group| 3 + 9 files changed, 835 insertions(+) create mode 100755 tests/btrfs/126 create mode 100644 tests/btrfs/126.out create mode 100755 tests/btrfs/127 create mode 100644 tests/btrfs/127.out create mode 100755 tests/btrfs/128 create mode 100644 tests/btrfs/128.out diff --git a/common/btrfs b/common/btrfs index 5828d0a..2d7d0ce 100644 --- a/common/btrfs +++ b/common/btrfs @@ -48,3 +48,206 @@ _require_btrfs_raid_dev_pool() _require_scratch_dev_pool 4 # RAID10 _require_scratch_dev_pool_equal_size } + +# TODO Add tool to enable and test unknown feature bits +_btrfs_feature_lookup() { + local name=$1 + class="" + case "$name" in + mixed_backref) class=incompat; bit=0x1 ;; + default_subvol) class=incompat; bit=0x2 ;; + mixed_groups) class=incompat; bit=0x4 ;; + compress_lzo) class=incompat; bit=0x8 ;; + compress_lsov2) class=incompat; bit=0x10 ;; + big_metadata) class=incompat; bit=0x20 ;; + extended_iref) class=incompat; bit=0x40 ;; + raid56) class=incompat; bit=0x80 ;; + skinny_metadata)class=incompat; bit=0x100 ;; + compat:*) class=compat; bit=${name##compat:} ;; + compat_ro:*)class=compat_ro; bit=${name##compat_ro:} ;; + incompat:*) class=incompat; bit=${name##incompat:} ;; + esac + if [ -z "$class" ]; then + echo "Unknown feature name $name. xfstests needs updating." \ +" Skipping the test of sysfs values to superblock values" \ +>&2 + fi + + echo "$class/$bit" +} + +_btrfs_feature_get_class() { + bits=$(_btrfs_feature_lookup $1) + echo ${bits%/*} +} + +_btrfs_feature_get_bit() { + bits=$(_btrfs_feature_lookup $1) + echo ${bits#*/} +} + +_btrfs_feature_class_to_index() +{ + local class=$1 + local index=0 + + case "$class" in + compat) index=0 ;; + compat_ro) index=1 ;; + incompat) index=2 ;; + *) echo "Invalid class name $class" >&2 + esac + + echo $index +} + +# The ioctl helper outputs the supported feature flags as a series of +# 9 hex numbers, which represent bitfields. +# These 9 values represent 3 sets of 3 values. +# supported flags: compat compat_ro incompat, starting at index 0 +# settable online: compat compat_ro incompat, starting at index 3 +# clearable online: compat compat_ro incompat, starting at index 6 +# The returned mask is: 0x1 settable | 0x2 clearable +_btrfs_feature_ioctl_writeable_mask() +{ + local feature=$1 + local mnt=$2 + local index=0 + + # This usually won't matter. The supported bits are fs-module global. + if [ -z "$mnt" ]; then + mnt=$TEST_DIR + fi + + class=$(_btrfs_feature_get_class $1) + bit=$(_btrfs_feature_get_bit $1) + index=$(_btrfs_feature_class_to_index $class) + + local set_index=$(( $index + 3 )) + local clear_index=$(( $index + 6 )) + + out=$(src/btrfs_ioctl_helper $mnt GET_SUPPORTED_FEATURES) + set -- $out + supp_features=($@) + + settable=$(( ${supp_features[$set_index]} & $bit )) + clearable=$(( ${supp_features[$clear_index]} & $bit )) + + val=0 + if [ "$settable" -ne 0 ]; then + val=$(( $val | 1 )) + fi + if [ "$clearable" -ne 0 ]; then + val=$(( $val | 2 )) + fi + + echo $val +} + +_btrfs_feature_ioctl_index_settable_mask() +{ + local class_index=$1 + local mnt=$2 + + # This usually won't matter. The supported bits are fs-module global. + if [ -z "$mnt" ]; then + mnt=$TEST_DIR + fi + + local set_index=$(( $class_index + 3 )) + + out=$(src/btrfs_ioctl_helper $mnt GET_SUPPORTED_FEATURES) + set -- $out + supp_features=($@) + + echo $(( ${supp_features[$set_index]} )) +} + +_btrfs_feature_ioctl_index_clearable_mask() +{ +
[PATCH v2 0/4] btrfs feature testing + props fix
From: Jeff MahoneyHi all - Thanks, Eryu, for the review. The btrfs feature testing changes were a patchet I wrote three years ago, and it looks like significant cleanup has happened in the xfstests since then. I'm sorry for the level of the review you had to do for them, but do appreciate that you did. This version should fix the outstanding issues, including some issues with the tests themselves, where e.g. the 32MB reserved size was file system-size (and implementation) dependent. Most notably, since these tests share some common functionality that ultimately hit ~250 lines, I chose to create a new common/btrfs library. Other than that, I tried to meet the level of consistency you were looking for with just printing errors instead of failing, not depending on error codes, etc. Thanks, -Jeff --- Jeff Mahoney (4): btrfs/048: extend _filter_btrfs_prop_error to handle additional errors btrfs/124: test global metadata reservation reporting btrfs/125: test sysfs exports of allocation and device membership info btrfs/126,127,128: test feature ioctl and sysfs interfaces .gitignore | 1 + common/btrfs | 253 +++ common/config| 7 +- common/filter.btrfs | 10 +- src/Makefile | 3 +- src/btrfs_ioctl_helper.c | 220 + tests/btrfs/048 | 6 +- tests/btrfs/048.out | 4 +- tests/btrfs/124 | 84 tests/btrfs/124.out | 1 + tests/btrfs/125 | 177 + tests/btrfs/125.out | 1 + tests/btrfs/126 | 244 + tests/btrfs/126.out | 1 + tests/btrfs/127 | 166 +++ tests/btrfs/127.out | 1 + tests/btrfs/128 | 128 tests/btrfs/128.out | 1 + tests/btrfs/group| 5 + 19 files changed, 1302 insertions(+), 11 deletions(-) create mode 100644 common/btrfs create mode 100644 src/btrfs_ioctl_helper.c create mode 100755 tests/btrfs/124 create mode 100644 tests/btrfs/124.out create mode 100755 tests/btrfs/125 create mode 100644 tests/btrfs/125.out create mode 100755 tests/btrfs/126 create mode 100644 tests/btrfs/126.out create mode 100755 tests/btrfs/127 create mode 100644 tests/btrfs/127.out create mode 100755 tests/btrfs/128 create mode 100644 tests/btrfs/128.out -- 1.8.5.6 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/4] btrfs/048: extend _filter_btrfs_prop_error to handle additional errors
From: Jeff Mahoneybtrfsprogs v4.5.3 changed the formatting of some error messages. This patch extends the filter for btrfs prop to handle those. Signed-off-by: Jeff Mahoney --- common/filter.btrfs | 10 +++--- tests/btrfs/048 | 6 -- tests/btrfs/048.out | 4 ++-- 3 files changed, 13 insertions(+), 7 deletions(-) diff --git a/common/filter.btrfs b/common/filter.btrfs index 9970f4d..0cf7f0d 100644 --- a/common/filter.btrfs +++ b/common/filter.btrfs @@ -72,15 +72,19 @@ _filter_btrfs_compress_property() sed -e "s/compression=\(lzo\|zlib\)/COMPRESSION=XXX/g" } -# filter name of the property from the output, optionally verify against $1 +# filter error messages from btrfs prop, optionally verify against $1 # recognized message(s): # "object is not compatible with property: label" +# "invalid value for property:{, value}" +# "failed to {get,set} compression for $PATH[.:]: Invalid argument" _filter_btrfs_prop_error() { if ! [ -z "$1" ]; then - sed -e "s/\(compatible with property\): $1/\1/" + sed -e "s#\(compatible with property\): $1#\1#" \ + -e "s#^\(.*failed to [sg]et compression for $1\)[:.] \(.*\)#\1: \2#" else - sed -e "s/^\(.*compatible with property\).*/\1/" + sed -e "s#^\(.*compatible with property\).*#\1#" \ + -e "s#^\(.*invalid value for property\)[:.].*#\1#" fi } diff --git a/tests/btrfs/048 b/tests/btrfs/048 index 4a36303..0b907b0 100755 --- a/tests/btrfs/048 +++ b/tests/btrfs/048 @@ -79,7 +79,8 @@ echo -e "\nTesting subvolume ro property" _run_btrfs_util_prog subvolume create $SCRATCH_MNT/sv1 $BTRFS_UTIL_PROG property get $SCRATCH_MNT/sv1 ro echo "***" -$BTRFS_UTIL_PROG property set $SCRATCH_MNT/sv1 ro foo +$BTRFS_UTIL_PROG property set $SCRATCH_MNT/sv1 ro foo 2>&1 | + _filter_btrfs_prop_error echo "***" $BTRFS_UTIL_PROG property set $SCRATCH_MNT/sv1 ro true echo "***" @@ -99,7 +100,8 @@ $BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir/file1 compression $BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir/subdir1 compression echo "***" $BTRFS_UTIL_PROG property set $SCRATCH_MNT/testdir/file1 compression \ - foo 2>&1 | _filter_scratch + foo 2>&1 | _filter_scratch | + _filter_btrfs_prop_error SCRATCH_MNT/testdir/file1 echo "***" $BTRFS_UTIL_PROG property set $SCRATCH_MNT/testdir/file1 compression lzo $BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir/file1 compression diff --git a/tests/btrfs/048.out b/tests/btrfs/048.out index 0b20d0b..3e4e3d2 100644 --- a/tests/btrfs/048.out +++ b/tests/btrfs/048.out @@ -15,7 +15,7 @@ ERROR: object is not compatible with property Testing subvolume ro property ro=false *** -ERROR: invalid value for property. +ERROR: invalid value for property *** *** ro=true @@ -27,7 +27,7 @@ ro=false Testing compression property *** -ERROR: failed to set compression for SCRATCH_MNT/testdir/file1. Invalid argument +ERROR: failed to set compression for SCRATCH_MNT/testdir/file1: Invalid argument *** compression=lzo compression=lzo -- 1.8.5.6 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/4] btrfs/125: test sysfs exports of allocation and device membership info
From: Jeff MahoneyThis tests the sysfs publishing for btrfs allocation and device membership info under a number of different layouts, similar to the btrfs replace test. We test the allocation files only for existence and that they contain numerical values. We test the device membership by mapping the devices used to create the file system to sysfs paths and matching them against the paths used for the device membership symlinks. Signed-off-by: Jeff Mahoney --- common/btrfs| 7 +++ common/config | 7 ++- tests/btrfs/125 | 177 tests/btrfs/125.out | 1 + tests/btrfs/group | 1 + 5 files changed, 190 insertions(+), 3 deletions(-) create mode 100755 tests/btrfs/125 create mode 100644 tests/btrfs/125.out diff --git a/common/btrfs b/common/btrfs index b972b13..5828d0a 100644 --- a/common/btrfs +++ b/common/btrfs @@ -41,3 +41,10 @@ _require_btrfs_ioctl() _notrun "btrfs ioctl $ioctl not implemented." fi } + +# Requires the minimum size pool for largest btrfs RAID test +_require_btrfs_raid_dev_pool() +{ + _require_scratch_dev_pool 4 # RAID10 + _require_scratch_dev_pool_equal_size +} diff --git a/common/config b/common/config index c25b1ec..8577924 100644 --- a/common/config +++ b/common/config @@ -201,13 +201,14 @@ export DEBUGFS_PROG="`set_prog_path debugfs`" # newer systems have udevadm command but older systems like RHEL5 don't. # But if neither one is available, just set it to "sleep 1" to wait for lv to # be settled -UDEV_SETTLE_PROG="`set_prog_path udevadm`" -if [ "$UDEV_SETTLE_PROG" == "" ]; then +UDEVADM_PROG="`set_prog_path udevadm`" +if [ "$UDEVADM_PROG" == "" ]; then # try udevsettle command UDEV_SETTLE_PROG="`set_prog_path udevsettle`" else # udevadm is available, add 'settle' as subcommand - UDEV_SETTLE_PROG="$UDEV_SETTLE_PROG settle" + UDEV_SETTLE_PROG="$UDEVADM_PROG settle" + export UDEVADM_PROG fi # neither command is available, use sleep 1 if [ "$UDEV_SETTLE_PROG" == "" ]; then diff --git a/tests/btrfs/125 b/tests/btrfs/125 new file mode 100755 index 000..999a10e --- /dev/null +++ b/tests/btrfs/125 @@ -0,0 +1,177 @@ +#! /bin/bash +# FS QA Test No. 125 +# +# Test of the btrfs sysfs publishing +# +#--- +# Copyright (C) 2016 SUSE. All rights reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +# +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 + +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# remove previous $seqres.full before test +rm -f $seqres.full + +# get standard environment, filters and checks +. ./common/rc +. ./common/btrfs +. ./common/filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_command "$UDEVADM_PROG" udevadm +_require_test +_require_btrfs_sysfs +_require_btrfs_raid_dev_pool + +sysfs_root=$(_btrfs_get_sysfs $TEST_DIR) + +[ -d "$sysfs_root/allocation" ] || _notrun "sysfs allocation dir not found" +[ -d "$sysfs_root/devices" ] || _notrun "sysfs devices dir not found" + +check_file() +{ + local file=$1 + base="$(basename $(dirname $file))/$(basename $file)" + value="$(cat $file)" + if [ -n "$(echo $value | tr -d 0-9)" ]; then + echo "ERROR: $base: numerical value expected" \ +"(got $value)" + fi +} + +check_chunk() +{ + path=$1 + mkfs_options=$2 + + chunktype=$(basename $path) + [ -d "$path" ] || echo "No $chunktype directory." + + for file in bytes_may_use bytes_pinned bytes_reserved bytes_used \ + disk_total disk_used flags total_bytes \ + total_bytes_pinned; do + check_file "$path/$file" + done + + if [ "$chunktype" = "data" -o "$chunktype" = "mixed" ]; then + opt="-d" + elif [ "$chunktype" = "metadata" -o "$chunktype" = "system" ]; then + opt="-m" + fi + + profile=$(echo $mkfs_options | sed -e "s/.*$opt \([[:alnum:]]*\).*/\1/") + [ -d "$path/$profile" ] || echo
[PATCH v2 2/4] btrfs/124: test global metadata reservation reporting
From: Jeff MahoneyBtrfs can now report the size of the global metadata reservation via ioctl and sysfs. This test confirms that we get sane results on an empty file system. Signed-off-by: Jeff Mahoney --- .gitignore | 1 + common/btrfs | 43 +++ src/Makefile | 3 +- src/btrfs_ioctl_helper.c | 132 +++ tests/btrfs/124 | 84 ++ tests/btrfs/124.out | 1 + tests/btrfs/group| 1 + 7 files changed, 264 insertions(+), 1 deletion(-) create mode 100644 common/btrfs create mode 100644 src/btrfs_ioctl_helper.c create mode 100755 tests/btrfs/124 create mode 100644 tests/btrfs/124.out diff --git a/.gitignore b/.gitignore index 28bd180..0e4f2a1 100644 --- a/.gitignore +++ b/.gitignore @@ -39,6 +39,7 @@ /src/append_reader /src/append_writer /src/bstat +/src/btrfs_ioctl_helper /src/bulkstat_unlink_test /src/bulkstat_unlink_test_modified /src/dbtest diff --git a/common/btrfs b/common/btrfs new file mode 100644 index 000..b972b13 --- /dev/null +++ b/common/btrfs @@ -0,0 +1,43 @@ +#!/bin/bash +# Functions for testing btrfs + +_btrfs_get_fsid() +{ + local mnt=$1 + if [ -z "$mnt" ]; then + mnt=$TEST_DIR + fi + $BTRFS_UTIL_PROG filesystem show $mnt|awk '/uuid:/ {print $NF}' +} + +_btrfs_get_sysfs() +{ + local mnt=$1 + local fsid=$(_btrfs_get_fsid $mnt) + echo "/sys/fs/btrfs/$fsid" +} + +_require_btrfs_sysfs() +{ + local mnt=$1 + if [ -z "$mnt" ]; then + mnt=$TEST_DIR + fi + if [ ! -d "$(_btrfs_get_sysfs $mnt)" ];then + _notrun "btrfs sysfs support not available." + fi +} + +_require_btrfs_ioctl() +{ + local ioctl=$1 + local mnt=$2 + shift 2 + if [ -z "$mnt" ]; then + mnt=$TEST_DIR + fi + out=$(src/btrfs_ioctl_helper $mnt $ioctl $@) + if [ "$out" = "Not implemented." ]; then + _notrun "btrfs ioctl $ioctl not implemented." + fi +} diff --git a/src/Makefile b/src/Makefile index 1bf318b..c467475 100644 --- a/src/Makefile +++ b/src/Makefile @@ -20,7 +20,8 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \ bulkstat_unlink_test_modified t_dir_offset t_futimens t_immutable \ stale_handle pwrite_mmap_blocked t_dir_offset2 seek_sanity_test \ seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec cloner \ - renameat2 t_getcwd e4compact test-nextquota punch-alternating + renameat2 t_getcwd e4compact test-nextquota punch-alternating \ + btrfs_ioctl_helper SUBDIRS = diff --git a/src/btrfs_ioctl_helper.c b/src/btrfs_ioctl_helper.c new file mode 100644 index 000..b6eb924 --- /dev/null +++ b/src/btrfs_ioctl_helper.c @@ -0,0 +1,132 @@ +#include +#include +#include +#include +#include +#include +#include +#include + +#ifndef BTRFS_IOCTL_MAGIC +#define BTRFS_IOCTL_MAGIC 0x94 +#endif + +#ifndef BTRFS_IOC_SPACE_INFO +struct btrfs_ioctl_space_info { +uint64_t flags; +uint64_t total_bytes; +uint64_t used_bytes; +}; + +struct btrfs_ioctl_space_args { +uint64_t space_slots; +uint64_t total_spaces; +struct btrfs_ioctl_space_info spaces[0]; +}; +#define BTRFS_IOC_SPACE_INFO _IOWR(BTRFS_IOCTL_MAGIC, 20, \ +struct btrfs_ioctl_space_args) +#endif +#ifndef BTRFS_SPACE_INFO_GLOBAL_RSV +#define BTRFS_SPACE_INFO_GLOBAL_RSV(1ULL << 49) +#endif + +static int global_rsv_ioctl(int fd, int argc, char *argv[]) +{ + struct btrfs_ioctl_space_args arg; + struct btrfs_ioctl_space_args *args; + int ret; + int i; + size_t size; + + arg.space_slots = 0; + + ret = ioctl(fd, BTRFS_IOC_SPACE_INFO, ); + if (ret) + return -errno; + + size = sizeof(*args) + sizeof(args->spaces[0]) * arg.total_spaces; + args = malloc(size); + if (!args) + return -ENOMEM; + + args->space_slots = arg.total_spaces; + + ret = ioctl(fd, BTRFS_IOC_SPACE_INFO, args); + if (ret) + return -errno; + + for (i = 0; i < args->total_spaces; i++) { + if (args->spaces[i].flags & BTRFS_SPACE_INFO_GLOBAL_RSV) { + unsigned long long reserved; + reserved = args->spaces[i].total_bytes; + printf("%llu\n", reserved); + return 0; + } + } + + return -ENOENT; +} + +#define IOCTL_TABLE_ENTRY(_ioctl_name, _handler) \ + { .name = #_ioctl_name, .ioctl_cmd = BTRFS_IOC_##_ioctl_name, \ + .handler = _handler, } + +struct ioctl_table_entry { + const char *name; + unsigned ioctl_cmd; + int (*handler)(int fd, int argc, char *argv[]); +}; + +static struct ioctl_table_entry
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
Steven Haigh posted on Mon, 27 Jun 2016 13:21:00 +1000 as excerpted: > I'd also recommend updates to the ArchLinux wiki - as for some reason I > always seem to end up there when searching for a certain topic... Not really btrfs related, but for people using popular search engines, at least, this is likely for two reasons: 1) Arch is apparently the most popular distro among reasonably technically literate users -- those who will both tend to have good technical knowledge and advice on the various real-life issues Linux users tend to encounter, and are likely to post it to an easily publicly indexable forum. (And in that regard, wikis are likely to be more indexable than (web archives of) mailing lists like this, because that's (part of) what wikis /do/ by design, make topics keyword searchable. Lists, not so much.) 2) Specifically to the point of being _publicly_ indexable, Arch may have a more liberal robots.txt that allows indexers more access, than other distros, some of which may limit robot access for performance reasons. With this combination, arch's wiki is a natural place for searches to point. So agreed, a high priority on getting the raid56 warning up there on the arch wiki is a good idea, indeed. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs full balance command fails due to ENOSPC (bug 121071)
On Mon, Jun 27, 2016 at 12:32 PM, Francesco Turcowrote: > On 2016-06-27 20:18, Chris Murphy wrote: >> If you can grab btrfs-debugfs from >> https://github.com/kdave/btrfs-progs/blob/master/btrfs-debugfs >> >> And then attach the output to the bug report it might be useful for a >> developer. But really your case is an odd duck, because there's fully >> 14GiB unallocated, so it should be able to create a new one without >> problem. >> >> $ sudo ./btrfs-debugfs -b / > > Done! Thank you, I was not aware of the existence of btrfs-debug... I'm not certain what the "1 enospc errors during balance' refers to. That message happens several times, the balance operation isn't aborted, and doesn't come with any call traces (those appear later). Further, the btrfs-debugfs output suggests the balance worked - each bg is continguously located after the last and they're all new bg offset values compared to what's found in the dmesg. This might be that obscure -28 enospc bug that affects some file systems and hasn't been tracked down yet. If I recall correctly it's a misleading error, and the only work around to get rid of it is migrate to a new Btrfs file system. I don't think the file system is at any risk in the current state, but I'm not certain as it's already an edge case. I'd just make sure you keep suitably current backups and keep using it. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug in 'btrfs filesystem du' ?
On Mon, Jun 27, 2016 at 3:33 PM, M G Berberichwrote: > Am Montag, den 27. Juni schrieb M G Berberich: >> after a balance ‘btrfs filesystem du’ probably shows false data about >> shared data. > > Oh, I forgot: I have btrfs-progs v4.5.2 and kernel 4.6.2. With btrfs-progs v4.6.1 and kernel 4.7-rc5, the numbers are correct about shared data. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs full balance command fails due to ENOSPC (bug 121071)
On 2016-06-27 20:18, Chris Murphy wrote: > If you can grab btrfs-debugfs from > https://github.com/kdave/btrfs-progs/blob/master/btrfs-debugfs > > And then attach the output to the bug report it might be useful for a > developer. But really your case is an odd duck, because there's fully > 14GiB unallocated, so it should be able to create a new one without > problem. > > $ sudo ./btrfs-debugfs -b / Done! Thank you, I was not aware of the existence of btrfs-debug... -- Website: http://www.fturco.net/ GPG key: 6712 2364 B2FE 30E1 4791 EB82 7BB1 1F53 29DE CD34 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs full balance command fails due to ENOSPC (bug 121071)
On Mon, Jun 27, 2016 at 11:28 AM, Francesco Turcowrote: > Note: I already filed bug 121071 but perhaps I should have written to > this mailing list first. https://bugzilla.kernel.org/show_bug.cgi?id=121071 It's a good bug report. > Is there anything I can try? Should I run the command from a live CD? Is > this a real bug or a mistake from an unexperienced btrfs user like me? It's a bug somewhere, not a user mistake. If you can grab btrfs-debugfs from https://github.com/kdave/btrfs-progs/blob/master/btrfs-debugfs And then attach the output to the bug report it might be useful for a developer. But really your case is an odd duck, because there's fully 14GiB unallocated, so it should be able to create a new one without problem. $ sudo ./btrfs-debugfs -b / -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Strange behavior when replacing device on BTRFS RAID 5 array.
On Mon, Jun 27, 2016 at 11:29 AM, Chris Murphywrote: > > Next is to decide to what degree you want to salvage this volume and > keep using Btrfs raid56 despite the risks Forgot to complete this thought. So if you get a backup, and decide you want to fix it, I would see if you can cancel the replace using "btrfs replace cancel " and confirm that it stops. And now is the risky part, which is whether to try "btrfs add" and then "btrfs remove" or remove the bad drive, reboot, and see if it'll mount with -o degraded, and then use add and remove (in which case you'll use 'remove missing'). The first you risk Btrfs still using the flaky bad drive. The second you risk whether a degraded mount will work, and whether any other drive in the array has a problem while degraded (like an unrecovery read error from a single sector). -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Strange behavior when replacing device on BTRFS RAID 5 array.
On 2016-06-27 13:29, Chris Murphy wrote: On Sun, Jun 26, 2016 at 10:02 PM, Nick Austinwrote: On Sun, Jun 26, 2016 at 8:57 PM, Nick Austin wrote: sudo btrfs fi show /mnt/newdata Label: '/var/data' uuid: e4a2eb77-956e-447a-875e-4f6595a5d3ec Total devices 4 FS bytes used 8.07TiB devid1 size 5.46TiB used 2.70TiB path /dev/sdg devid2 size 5.46TiB used 2.70TiB path /dev/sdl devid3 size 5.46TiB used 2.70TiB path /dev/sdm devid4 size 5.46TiB used 2.70TiB path /dev/sdx It looks like fi show has bad data: When I start heavy IO on the filesystem (running rsync -c to verify the data), I notice zero IO on the bad drive I told btrfs to replace, and lots of IO to the expected replacement. I guess some metadata is messed up somewhere? avg-cpu: %user %nice %system %iowait %steal %idle 25.190.007.81 28.460.00 38.54 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sdg 437.00 75168.00 1792.00 75168 1792 sdl 443.00 76064.00 1792.00 76064 1792 sdm 438.00 75232.00 1472.00 75232 1472 sdw 443.00 75680.00 1856.00 75680 1856 sdx 0.00 0.00 0.00 0 0 There's reported some bugs with 'btrfs replace' and raid56, but I don't know the exact nature of those bugs, when or how they manifest. It's recommended to fallback to use 'btrfs add' and then 'btrfs delete' but you have other issues going on also. One other thing to mention, if the device is failing, _always_ add '-r' to the replace command line. This will tell it to avoid reading from the device being replaced (in raid1 or raid10 mode, it will pull from the other mirror, in raid5/6 mode, it will recompute the block from parity and compare to the stored checksums (which in turn means that this _will_ be slower on raid5/6 than regular repalce)). Link resets and other issues that cause devices to disappear become more common the more damaged a disk is, so avoiding reading from it becomes more important too, because just reading from a disk puts stress on it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Strange behavior when replacing device on BTRFS RAID 5 array.
On Sun, Jun 26, 2016 at 10:02 PM, Nick Austinwrote: > On Sun, Jun 26, 2016 at 8:57 PM, Nick Austin wrote: >> sudo btrfs fi show /mnt/newdata >> Label: '/var/data' uuid: e4a2eb77-956e-447a-875e-4f6595a5d3ec >> Total devices 4 FS bytes used 8.07TiB >> devid1 size 5.46TiB used 2.70TiB path /dev/sdg >> devid2 size 5.46TiB used 2.70TiB path /dev/sdl >> devid3 size 5.46TiB used 2.70TiB path /dev/sdm >> devid4 size 5.46TiB used 2.70TiB path /dev/sdx > > It looks like fi show has bad data: > > When I start heavy IO on the filesystem (running rsync -c to verify the data), > I notice zero IO on the bad drive I told btrfs to replace, and lots of IO to > the > expected replacement. > > I guess some metadata is messed up somewhere? > > avg-cpu: %user %nice %system %iowait %steal %idle > 25.190.007.81 28.460.00 38.54 > > Device:tpskB_read/skB_wrtn/skB_readkB_wrtn > sdg 437.00 75168.00 1792.00 75168 1792 > sdl 443.00 76064.00 1792.00 76064 1792 > sdm 438.00 75232.00 1472.00 75232 1472 > sdw 443.00 75680.00 1856.00 75680 1856 > sdx 0.00 0.00 0.00 0 0 There's reported some bugs with 'btrfs replace' and raid56, but I don't know the exact nature of those bugs, when or how they manifest. It's recommended to fallback to use 'btrfs add' and then 'btrfs delete' but you have other issues going on also. Devices dropping off and being renamed is something btrfs, in my experience, does not handle well at all. The very fact the hardware is dropping off and coming back is bad, so you really need to get that sorted out as a prerequisite no matter what RAID technology you're using. First advice, make a backup. Don't change the volume further until you've done this. Each attempt to make the volume healthy again carries risks of totally breaking it and losing the ability to mount it. So as long as it's mounted, take advantage of that. Pretend the very next repair attempt will break the volume, and make your backup accordingly. Next is to decide to what degree you want to salvage this volume and keep using Btrfs raid56 despite the risks (it's still rather experimental, and in particular some things have been realized on the list in the last week especially that make it not recommended, except by people willing to poke it with a stick and learn how many more bodies can be found with the current implementation) or if you just want to migrate it over to something like XFS on mdadm or LVM raid 5 as soon as possible? There's also the obligatory notice that applies to all Linux software raid implementations which is to discover if you have a very common misconfiguration that enhances the chance of data loss if the volume ever goes degraded and you need to rebuild with a new drive: smartctl -l scterc cat /sys/block//device/timeout The first value must be less than the second. Note the first value is in deciseconds, the second is in seconds. And either 'unsupported' or 'unset' translates into a vague value that could be as high as 180 seconds. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs full balance command fails due to ENOSPC (bug 121071)
Note: I already filed bug 121071 but perhaps I should have written to this mailing list first. I get the ENOSPC error when running a btrfs full balance command for my root partition, even if it seems I have a lot of free/unallocated space. # btrfs filesystem show / Label: none uuid: 27150b83-7d90-4031-8e83-581315b9a254 Total devices 1 FS bytes used 10.79GiB devid1 size 25.00GiB used 13.31GiB path /dev/mapper/Desktop-root # btrfs filesystem df / Data, single: total=11.00GiB, used=10.40GiB System, DUP: total=32.00MiB, used=16.00KiB Metadata, DUP: total=1.12GiB, used=392.08MiB GlobalReserve, single: total=144.00MiB, used=0.00B # btrfs balance start --full-balance / ERROR: error during balancing '/': No space left on device There may be more info in syslog - try dmesg | tail # dmesg | tail [29807.441930] BTRFS info (device dm-2): found 13206 extents [29807.879845] BTRFS info (device dm-2): relocating block group 47542435840 flags 1 [29827.116083] BTRFS info (device dm-2): found 12909 extents [29830.500110] BTRFS info (device dm-2): found 12909 extents [29830.976485] BTRFS info (device dm-2): relocating block group 46468694016 flags 1 [29848.924188] BTRFS info (device dm-2): found 5129 extents [29851.533076] BTRFS info (device dm-2): found 5129 extents [29851.994787] BTRFS info (device dm-2): relocating block group 46435139584 flags 34 [29852.399460] BTRFS info (device dm-2): found 1 extents [29852.657983] BTRFS info (device dm-2): 1 enospc errors during balance I have successfully balanced both the boot and home partitions before. Only root gives me problems. Is there anything I can try? Should I run the command from a live CD? Is this a real bug or a mistake from an unexperienced btrfs user like me? Thanks. -- Website: http://www.fturco.net/ GPG key: 6712 2364 B2FE 30E1 4791 EB82 7BB1 1F53 29DE CD34 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bad hard drive - checksum verify failure forces readonly mount
On Mon, Jun 27, 2016 at 12:30 AM, Vasco Almeidawrote: > File system image available at (choose one link) > https://mega.nz/#!AkAEgKyB!RUa7G5xHIygWm0ALx5ZxQjjXNdFYa7lDRHJ_sW0bWLs > https://www.sendspace.com/file/i70cft > Should I file a bug report with that image dump linked above or btrfs- > debug-tree output or both? If it were me, I'd include both. Maybe the image is incomplete or vice versa. The debug tree output is also human readable. I'd also put them up in a cloud location where you can kinda forget about them for a while, I've had images not looked at for 6+ months by a dev. > I think I will use the subject of this thread as summary to file the > bug. Can you think of something more suitable or is that fine? I would try to summarize something like: file system created with btrfs-progs version -, and mostly used with kernel version -, and inexplicably the file system became unusable at boot time always mounting only readonly. Newer kernel versions still could not mount it, nor was btrfs check using btrfs-progs version - able to repair. See thread URL for more details. btrfs-image URL btrfs-debug-tree URL > I think I will reinstall the OS since, even if I manage to recover the > file system from this issue, that OS will be something I can not trust > fully. Yeah pretty much that's right. There is an rpm command where you can have it check the signatures of all installed binaries, but I forget what it is offhand. That'd be an alternative to reinstalling if the init options were to work. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Mon, 2016-06-27 at 07:35 +0300, Andrei Borzenkov wrote: > The problem is that current implementation of RAID56 puts exactly CoW > data at risk. I.e. writing new (copy of) data may suddenly make old > (copy of) data inaccessible, even though it had been safely committed > to > disk and is now in read-only snapshot. Sure,... mine was just a general thing to be added. No checksums => no way to tell which block is valid in case of silent block errors => no way to recover unless by chance => should be included as a warning, especially as userland software starts to automatically set nodatacow (IIRC systemd does so), thereby silently breaking functionality (integrity+recoverability) assumed by the user. Cheers, Chris- smime.p7s Description: S/MIME cryptographic signature
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarnwrote: > On 2016-06-25 12:44, Chris Murphy wrote: >> >> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn >> wrote: >> >>> Well, the obvious major advantage that comes to mind for me to >>> checksumming >>> parity is that it would let us scrub the parity data itself and verify >>> it. >> >> >> OK but hold on. During scrub, it should read data, compute checksums >> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in >> the checksum tree, and the parity strip in the chunk tree. And if >> parity is wrong, then it should be replaced. > > Except that's horribly inefficient. With limited exceptions involving > highly situational co-processors, computing a checksum of a parity block is > always going to be faster than computing parity for the stripe. By using > that to check parity, we can safely speed up the common case of near zero > errors during a scrub by a pretty significant factor. OK I'm in favor of that. Although somehow md gets away with this by computing and checking parity for its scrubs, and still manages to keep drives saturated in the process - at least HDDs, I'm not sure how it fares on SSDs. > The ideal situation that I'd like to see for scrub WRT parity is: > 1. Store checksums for the parity itself. > 2. During scrub, if the checksum is good, the parity is good, and we just > saved the time of computing the whole parity block. > 3. If the checksum is not good, then compute the parity. If the parity just > computed matches what is there already, the checksum is bad and should be > rewritten (and we should probably recompute the whole block of checksums > it's in), otherwise, the parity was bad, write out the new parity and update > the checksum. > 4. Have an option to skip the csum check on the parity and always compute > it. >> >> >> Even check > md/sync_action does this. So no pun intended but Btrfs >> isn't even at parity with mdadm on data integrity if it doesn't check >> if the parity matches data. > > Except that MD and LVM don't have checksums to verify anything outside of > the very high-level metadata. They have to compute the parity during a > scrub because that's the _only_ way they have to check data integrity. Just > because that's the only way for them to check it does not mean we have to > follow their design, especially considering that we have other, faster ways > to check it. I'm not opposed to this optimization. But retroactively better qualifying my previous "major advantage" what I meant was in terms of solving functional deficiency. >> The much bigger problem we have right now that affects Btrfs, >> LVM/mdadm md raid, is this silly bad default with non-enterprise >> drives having no configurable SCT ERC, with ensuing long recovery >> times, and the kernel SCSI command timer at 30 seconds - which >> actually also fucks over regular single disk users also because it >> means they don't get the "benefit" of long recovery times, which is >> the whole g'd point of that feature. This itself causes so many >> problems where bad sectors just get worse and don't get fixed up >> because of all the link resets. So I still think it's a bullshit >> default kernel side because it pretty much affects the majority use >> case, it is only a non-problem with proprietary hardware raid, and >> software raid using enterprise (or NAS specific) drives that already >> have short recovery times by default. > > On this, we can agree. It just came up again in a thread over the weekend on linux-raid@. I'm going to ask while people are paying attention if a patch to change the 30 second time out to something a lot higher has ever been floated, what the negatives might be, and where to get this fixed if it wouldn't be accepted in the kernel code directly. *Ideally* I think we'd want two timeouts. I'd like to see commands have a timer that results in merely a warning that could be used by e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to write over those sectors". That's how bad sectors start out, they read slower and eventually go beyond 30 seconds and now it's all link resets. If the problem could be fixed before then... that's the best scenario. The 2nd timer would be, OK the controller or drive just face planted, reset. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug in 'btrfs filesystem du' ?
Am Montag, den 27. Juni schrieb M G Berberich: > after a balance ‘btrfs filesystem du’ probably shows false data about > shared data. Oh, I forgot: I have btrfs-progs v4.5.2 and kernel 4.6.2. MfG bmg -- „Des is völlig wurscht, was heut beschlos- | M G Berberich sen wird: I bin sowieso dagegn!“ | m...@m-berberich.de (SPD-Stadtrat Kurt Schindler; Regensburg) | -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bug in 'btrfs filesystem du' ?
Hello, after a balance ‘btrfs filesystem du’ probably shows false data about shared data. To reproduce, create a (smal) btrfs-filesystem, copy some data in a directory, then ‘cp -a --reflink’ the data. Now all data is shared and ‘btrfs fi du’ shows it correct. In my case: Total Exclusive Set shared Filename 59.38MiB29.69MiB29.69MiB . after a balance ‘btrfs fi du’ shows no shared data any more, but all data as exclusive. In my case: Total Exclusive Set shared Filename 59.38MiB59.38MiB 0.00B . As ‘btrfs fi df’ still shows used=29.69MiB, the problem probabaly is in btrfs-tools. Test-session log: # dd if=/dev/urandom of=dev-btrfs bs=4K count=10 10+0 Datensätze ein 10+0 Datensätze aus 40960 bytes (410 MB, 391 MiB) copied, 24.7574 s, 16.5 MB/s # mkfs.btrfs dev-btrfs btrfs-progs v4.5.2 See http://btrfs.wiki.kernel.org for more information. Label: (null) UUID: 698a2755-8ecb-468d-9577-9a48947361ea Node size: 16384 Sector size:4096 Filesystem size:390.62MiB Block group profiles: Data: single8.00MiB Metadata: DUP 40.00MiB System: DUP 12.00MiB SSD detected: no Incompat features: extref, skinny-metadata Number of devices: 1 Devices: IDSIZE PATH 1 390.62MiB dev-btrfs # mount /tmp/dev-btrfs /mnt/ # cd /mnt/ # btrfs fi du -s . Total Exclusive Set shared Filename 0.00B 0.00B 0.00B . # cp -a /scratch/kernel/linux-4.6/drivers/usb . # btrfs fi du -s . Total Exclusive Set shared Filename 28.96MiB28.96MiB 0.00B . # btrfs fi df . Data, single: total=56.00MiB, used=3.61MiB System, DUP: total=8.00MiB, used=16.00KiB Metadata, DUP: total=32.00MiB, used=192.00KiB GlobalReserve, single: total=16.00MiB, used=0.00B # btrfs fi usage . Overall: Device size: 390.62MiB Device allocated:136.00MiB Device unallocated: 254.62MiB Device missing: 0.00B Used: 32.06MiB Free (estimated):280.94MiB (min: 153.62MiB) Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 16.00MiB (used: 0.00B) Data,single: Size:56.00MiB, Used:29.69MiB /dev/loop0 56.00MiB Metadata,DUP: Size:32.00MiB, Used:1.17MiB /dev/loop0 64.00MiB System,DUP: Size:8.00MiB, Used:16.00KiB /dev/loop0 16.00MiB Unallocated: /dev/loop0254.62MiB # cp -a --reflink usb usb2 # btrfs fi du -s . Total Exclusive Set shared Filename 59.38MiB29.69MiB29.69MiB . # btrfs fi df . Data, single: total=56.00MiB, used=29.69MiB System, DUP: total=8.00MiB, used=16.00KiB Metadata, DUP: total=32.00MiB, used=1.17MiB GlobalReserve, single: total=16.00MiB, used=0.00B # btrfs fi usage . Overall: Device size: 390.62MiB Device allocated:136.00MiB Device unallocated: 254.62MiB Device missing: 0.00B Used: 32.06MiB Free (estimated):280.94MiB (min: 153.62MiB) Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 16.00MiB (used: 0.00B) Data,single: Size:56.00MiB, Used:29.69MiB /dev/loop0 56.00MiB Metadata,DUP: Size:32.00MiB, Used:1.17MiB /dev/loop0 64.00MiB System,DUP: Size:8.00MiB, Used:16.00KiB /dev/loop0 16.00MiB Unallocated: /dev/loop0254.62MiB # btrfs balance start . WARNING: Full balance without filters requested. This operation is very intense and takes potentially very long. It is recommended to use the balance filters to narrow down the balanced data. Use 'btrfs balance start --full-balance' option to skip this warning. The operation will start in 10 seconds. Use Ctrl-C to stop it. 10 9 8 7 6 5 4 3 2 1 Starting balance without any filters. Done, had to relocate 4 out of 4 chunks # btrfs fi du -s . Total Exclusive Set shared Filename 59.38MiB59.38MiB 0.00B . # btrfs fi df . Data, single: total=48.00MiB, used=29.69MiB System, DUP: total=24.00MiB, used=16.00KiB Metadata, DUP: total=24.00MiB, used=2.08MiB GlobalReserve, single: total=16.00MiB, used=0.00B # btrfs fi usage . Overall: Device size: 390.62MiB Device allocated:144.00MiB Device unallocated: 246.62MiB Device missing: 0.00B Used: 33.88MiB Free (estimated):264.94MiB (min: 141.62MiB) Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 16.00MiB (used: 0.00B) Data,single: Size:48.00MiB, Used:29.69MiB /dev/loop0 48.00MiB Metadata,DUP: Size:24.00MiB,
Re: [PATCH 05/14] Btrfs: warn_on for unaccounted spaces
On 06/27/2016 12:47 AM, Qu Wenruo wrote: Hi Josef, Would you please move this patch to the first of the patchset? It's making bisect quite hard, as it will always stop at this patch, hard to check if it's a regression or existing bug. That's a good idea. Which workload are you having trouble with? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On 2016-06-25 12:44, Chris Murphy wrote: On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarnwrote: Well, the obvious major advantage that comes to mind for me to checksumming parity is that it would let us scrub the parity data itself and verify it. OK but hold on. During scrub, it should read data, compute checksums *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in the checksum tree, and the parity strip in the chunk tree. And if parity is wrong, then it should be replaced. Except that's horribly inefficient. With limited exceptions involving highly situational co-processors, computing a checksum of a parity block is always going to be faster than computing parity for the stripe. By using that to check parity, we can safely speed up the common case of near zero errors during a scrub by a pretty significant factor. The ideal situation that I'd like to see for scrub WRT parity is: 1. Store checksums for the parity itself. 2. During scrub, if the checksum is good, the parity is good, and we just saved the time of computing the whole parity block. 3. If the checksum is not good, then compute the parity. If the parity just computed matches what is there already, the checksum is bad and should be rewritten (and we should probably recompute the whole block of checksums it's in), otherwise, the parity was bad, write out the new parity and update the checksum. 4. Have an option to skip the csum check on the parity and always compute it. Even check > md/sync_action does this. So no pun intended but Btrfs isn't even at parity with mdadm on data integrity if it doesn't check if the parity matches data. Except that MD and LVM don't have checksums to verify anything outside of the very high-level metadata. They have to compute the parity during a scrub because that's the _only_ way they have to check data integrity. Just because that's the only way for them to check it does not mean we have to follow their design, especially considering that we have other, faster ways to check it. I'd personally much rather know my parity is bad before I need to use it than after using it to reconstruct data and getting an error there, and I'd be willing to be that most seasoned sysadmins working for companies using big storage arrays likely feel the same about it. That doesn't require parity csums though. It just requires computing parity during a scrub and comparing it to the parity on disk to make sure they're the same. If they aren't, assuming no other error for that full stripe read, then the parity block is replaced. It does not require it, but it can make it significantly more efficient, and even a 1% increase in efficiency is a huge difference on a big array. So that's also something to check in the code or poke a system with a stick and see what happens. I could see it being practical to have an option to turn this off for performance reasons or similar, but again, I have a feeling that most people would rather be able to check if a rebuild will eat data before trying to rebuild (depending on the situation in such a case, it will sometimes just make more sense to nuke the array and restore from a backup instead of spending time waiting for it to rebuild). The much bigger problem we have right now that affects Btrfs, LVM/mdadm md raid, is this silly bad default with non-enterprise drives having no configurable SCT ERC, with ensuing long recovery times, and the kernel SCSI command timer at 30 seconds - which actually also fucks over regular single disk users also because it means they don't get the "benefit" of long recovery times, which is the whole g'd point of that feature. This itself causes so many problems where bad sectors just get worse and don't get fixed up because of all the link resets. So I still think it's a bullshit default kernel side because it pretty much affects the majority use case, it is only a non-problem with proprietary hardware raid, and software raid using enterprise (or NAS specific) drives that already have short recovery times by default. On this, we can agree. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 5/6] fstests: btrfs: test RAID1 device reappear and balanced
On Wed, Jun 22, 2016 at 07:01:54PM +0800, Anand Jain wrote: > > > On 06/21/2016 09:31 PM, Eryu Guan wrote: > > On Wed, Jun 15, 2016 at 04:48:47PM +0800, Anand Jain wrote: > > > From: Anand Jain> > > > > > The test does the following: > > > Initialize a RAID1 with some data > > > > > > Re-mount RAID1 degraded with _dev1_ and write up to > > > half of the FS capacity > > > > If test devices are big enough, this test consumes much longer test > > time. I tested with 15G scratch dev pool and this test ran ~200s on my > > 4vcpu 8G memory test vm. > > Right. Isn't that a good design? So that it gets tested differently > on different HW config. ? Not in fstests. We should limit the run time of tests to an acceptable amount, for auto group it's within 5 minutes. > However the test time can be reduced by using smaller vdisk. I think either limit the write size or _notrun if the $max_fs_size is too big (say 30G). More comments below. > > Thanks, Anand > > > > Is it possible to limit the file size or the device size used? So it > > won't grow with device size. I'm thinking about something like > > _scratch_mkfs_sized, but that doesn't work for dev pool. > > > > > Save md5sum checkpoint1 > > > > > > Re-mount healthy RAID1 > > > > > > Let balance re-silver. > > > Save md5sum checkpoint2 > > > > > > Re-mount RAID1 degraded with _dev2_ > > > Save md5sum checkpoint3 > > > > > > Verify if all three md5sum match > > > > > > Signed-off-by: Anand Jain > > > --- > > > v2: > > > add tmp= and its rm > > > add comments to why _reload_btrfs_ko is used > > > add missing put and test_mount at notrun exit > > > use echo instead of _fail when checkpoints are checked > > > .out updated to remove Silence.. > > > > > > tests/btrfs/123 | 169 > > > > > > tests/btrfs/123.out | 7 +++ > > > tests/btrfs/group | 1 + > > > 3 files changed, 177 insertions(+) > > > create mode 100755 tests/btrfs/123 > > > create mode 100644 tests/btrfs/123.out > > > > > > diff --git a/tests/btrfs/123 b/tests/btrfs/123 > > > new file mode 100755 > > > index ..33decfd1c434 > > > --- /dev/null > > > +++ b/tests/btrfs/123 > > > @@ -0,0 +1,169 @@ > > > +#! /bin/bash > > > +# FS QA Test 123 > > > +# > > > +# This test verify the RAID1 reconstruction on the reappeared > > > +# device. By using the following steps: > > > +# Initialize a RAID1 with some data > > > +# > > > +# Re-mount RAID1 degraded with dev2 missing and write up to > > > +# half of the FS capacity. > > > +# Save md5sum checkpoint1 > > > +# > > > +# Re-mount healthy RAID1 > > > +# > > > +# Let balance re-silver. > > > +# Save md5sum checkpoint2 > > > +# > > > +# Re-mount RAID1 degraded with dev1 missing > > > +# Save md5sum checkpoint3 > > > +# > > > +# Verify if all three checkpoints match > > > +# > > > +#- > > > +# Copyright (c) 2016 Oracle. All Rights Reserved. > > > +# > > > +# This program is free software; you can redistribute it and/or > > > +# modify it under the terms of the GNU General Public License as > > > +# published by the Free Software Foundation. > > > +# > > > +# This program is distributed in the hope that it would be useful, > > > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > > > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > > > +# GNU General Public License for more details. > > > +# > > > +# You should have received a copy of the GNU General Public License > > > +# along with this program; if not, write the Free Software Foundation, > > > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > > > +#- > > > +# > > > + > > > +seq=`basename $0` > > > +seqres=$RESULT_DIR/$seq > > > +echo "QA output created by $seq" > > > + > > > +here=`pwd` > > > +tmp=/tmp/$$ > > > +status=1 # failure is the default! > > > +trap "_cleanup; exit \$status" 0 1 2 3 15 > > > + > > > +_cleanup() > > > +{ > > > + cd / > > > + rm -f $tmp.* > > > +} > > > + > > > +# get standard environment, filters and checks > > > +. ./common/rc > > > +. ./common/filter > > > + > > > +# remove previous $seqres.full before test > > > +rm -f $seqres.full > > > + > > > +# real QA test starts here > > > + > > > +_supported_fs btrfs > > > +_supported_os Linux > > > +_require_scratch_nocheck > > > > Why don't check filesystem after test? A comment would be good if > > there's a good reason. Patch 6 needs it as well :) And can you please add comments on _require_scratch_nocheck in this patch and patch 6, and rebase the whole series after Dave pushed my pull request(on 06-25) to upstream, and resend? Thanks, Eryu > > > > > +_require_scratch_dev_pool 2 > > > + > > > +# the mounted test dir prevent btrfs unload, we need to unmount > > > +_test_unmount > > >
Re: Rescue a single-device btrfs instance with zeroed tree root
On 2016-06-21 at 20:23 +0300, Ivan Shapovalov wrote: > Hello, > > So this is another case of "I lost my partition and do not have > backups". More precisely, _this_ is the backup and it turned out to > be > damaged. > > (The backup was made by partclone.btrfs. Together with a zeroed out > tree root, this asks for a bug in partclone...) > > So: the tree root is zeroes, backup roots are zeroes too, > btrfs-find-root only reports blocks of level 0 (needed is 1). > Is there something that can be done? Maybe it is possible to > reconstruct the root from its children? > Operations log following. > Please Cc: me in replies as I'm not subscribed to the list. Anyone? I'd appreciate any advice on how to rebuild the tree roots. I'd even write some code if someone tells me the disk format and logical tree constraints (i. e. in which order to put pointers to child nodes). BTW, it looks like all tree roots are lost, i. e. btrfs-find-root with any objectid (extent tree, subvolume tree, anything in ctree.h) finds only level 0 nodes. It should be possible to rebuild intermediate nodes, isn't it? Do they contain valuable information? Please Cc: me in replies as I'm not subscribed to the list. Thanks, -- Ivan Shapovalov / intelfx / > > 1. regular mount > > # mount /dev/loop0p3 /mnt/temp > > === dmesg === > [106737.299592] BTRFS info (device loop0p3): disk space caching is > enabled > [106737.299604] BTRFS: has skinny extents > [106737.299884] BTRFS error (device loop0p3): bad tree block start 0 > 162633449472 > [106737.299888] BTRFS: failed to read tree root on loop0p3 > [106737.314359] BTRFS: open_ctree failed > === end dmesg === > > 2. mount with -o recovery > > # mount -o recovery /dev/loop0p3 /mnt/temp > > === dmesg === > [106742.305720] BTRFS warning (device loop0p3): 'recovery' is > deprecated, use 'usebackuproot' instead > [106742.305722] BTRFS info (device loop0p3): trying to use backup > root at mount time > [106742.305724] BTRFS info (device loop0p3): disk space caching is > enabled > [106742.305725] BTRFS: has skinny extents > [106742.306056] BTRFS error (device loop0p3): bad tree block start 0 > 162633449472 > [106742.306060] BTRFS: failed to read tree root on loop0p3 > [106742.306069] BTRFS error (device loop0p3): bad tree block start 0 > 162633449472 > [106742.306071] BTRFS: failed to read tree root on loop0p3 > [106742.306084] BTRFS error (device loop0p3): bad tree block start 0 > 162632237056 > [106742.306086] BTRFS: failed to read tree root on loop0p3 > [106742.306097] BTRFS error (device loop0p3): bad tree block start 0 > 162626682880 > [106742.306100] BTRFS: failed to read tree root on loop0p3 > [106742.306111] BTRFS error (device loop0p3): bad tree block start 0 > 162609168384 > [106742.306114] BTRFS: failed to read tree root on loop0p3 > [106742.327272] BTRFS: open_ctree failed > === end dmesg === > > > 3. btrfs-find-root > > # btrfs-find-root /dev/loop0p3 > Couldn't read tree root > Superblock thinks the generation is 22332 > Superblock thinks the level is 1 > Well block 162633646080(gen: 22332 level: 0) seems good, but > generation/level doesn't match, want gen: 22332 level: 1 > Well block 162633596928(gen: 22332 level: 0) seems good, but > generation/level doesn't match, want gen: 22332 level: 1 > Well block 162633515008(gen: 22332 level: 0) seems good, but > generation/level doesn't match, want gen: 22332 level: 1 > > Thanks, signature.asc Description: This is a digitally signed message part
Re: [PATCH 4/4] fstests: btrfs/126,127,128: test feature ioctl and sysfs interfaces
On Fri, Jun 24, 2016 at 11:08:34AM -0400, je...@suse.com wrote: > From: Jeff Mahoney> > This tests the exporting of feature information from the kernel via > sysfs and ioctl. The first test works whether the sysfs permissions > are correct, if the information exported via sysfs matches > what the ioctls are reporting, and if they both match the on-disk > superblock's version of the feature sets. The second and third tests > test online setting and clearing of feature bits via the sysfs and > ioctl interfaces, checking whether they match the on-disk super on > each cycle. > > In every case, if the features are not present, it is not considered > a failure and a message indicating that will be dumped to the $num.full > file. > > Signed-off-by: Jeff Mahoney > --- > tests/btrfs/126 | 269 > > tests/btrfs/126.out | 2 + > tests/btrfs/127 | 185 > tests/btrfs/127.out | 2 + > tests/btrfs/128 | 178 ++ > tests/btrfs/128.out | 2 + > tests/btrfs/group | 3 + > 7 files changed, 641 insertions(+) > create mode 100755 tests/btrfs/126 > create mode 100644 tests/btrfs/126.out > create mode 100755 tests/btrfs/127 > create mode 100644 tests/btrfs/127.out > create mode 100755 tests/btrfs/128 > create mode 100644 tests/btrfs/128.out > > diff --git a/tests/btrfs/126 b/tests/btrfs/126 > new file mode 100755 > index 000..3d660c5 > --- /dev/null > +++ b/tests/btrfs/126 > @@ -0,0 +1,269 @@ > +#!/bin/bash > +# FA QA Test No. 126 > +# > +# Test online feature publishing > +# > +# This test doesn't test the changing of features. It does test that > +# the proper publishing bits and permissions match up with > +# the expected values. > +# > +#--- > +# Copyright (c) 2013 SUSE, All Rights Reserved. Copyright year 2016. > +# > +# This program is free software; you can redistribute it and/or > +# modify it under the terms of the GNU General Public License as > +# published by the Free Software Foundation. > +# > +# This program is distributed in the hope that it would be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write the Free Software Foundation, > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > +#--- > + > +seq=$(basename $0) > +seqres=$RESULT_DIR/$seq > +echo "== QA output created by $seq" > + > +here=$(pwd) > +tmp=/tmp/$$ > +status=1 Missing _cleanup() and trap, use './new btrfs' to create new btrfs tests. > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter.btrfs > + > +_supported_fs btrfs > +_supported_os Linux > +_require_scratch > +_require_command $BTRFS_SHOW_SUPER_PROG _require_command "$BTRFS_SHOW_SUPER_PROG" btrfs-show-super > + > +_scratch_mkfs > /dev/null 2>&1 > +_scratch_mount > + > +check_features() { "{" on a seperate line > + reserved="$2" > + method="$3" > + if [ "$1" != 0 ]; then > + echo "$method: failed: $reserved" > + exit 1 > + fi No need to check return value. > +if [ "$reserved" = "Not implemented." ]; then > +echo "Skipping ioctl test. Not implemented." >> $seqres.full > +return > +fi Call _notrun if ioctl not implemented. Do the check before actual test starts. And you're mixing spaces and tabs for indention in this function. > +} > + > +error=false All the checks around error can be omitted. > + > +# test -w will always return true if root is making the call. > +# This would be true in most cases, but for sysfs files, the permissions > +# are enforced even for root. > +is_writeable() { "{" on a seperate line > + local file=$1 > + mode=$(stat -c "0%a" "$file") > + mode=$(( $mode & 0200 )) > + > + [ "$mode" -eq 0 ] && return 1 > + return 0 > +} > + > +# ioctl > +read -a features < <(src/btrfs_ioctl_helper $SCRATCH_MNT GET_FEATURES 2>&1) > +check_features $? "$features" "GET_FEATURES" > + > +test_ioctl=true > +[ "${features[*]}" = "Not implemented." ] && test_ioctl=false > + > +read -a supp_features < <(src/btrfs_ioctl_helper $SCRATCH_MNT > GET_SUPPORTED_FEATURES 2>&1) > +check_features $? "$supp_features" "GET_SUPPORTED_FEATURES" > +[ "${supp_features[*]}" = "Not implemented." ] && test_ioctl=false These checks are not needed if the test was checked and _notrun properly before test. > + > +# Sysfs checks > +fsid=$(_btrfs_get_fsid $SCRATCH_DEV) > +sysfs_base="/sys/fs/btrfs" > + > +# TODO Add tool to enable and test unknown feature bits >
Re: [PATCH 3/4] fstests: btrfs/125: test sysfs exports of allocation and device membership info
On Fri, Jun 24, 2016 at 11:08:33AM -0400, je...@suse.com wrote: > From: Jeff Mahoney> > This tests the sysfs publishing for btrfs allocation and device > membership info under a number of different layouts, similar to the > btrfs replace test. We test the allocation files only for existence and > that they contain numerical values. We test the device membership > by mapping the devices used to create the file system to sysfs paths > and matching them against the paths used for the device membership > symlinks. > > It passes on kernels without a /sys/fs/btrfs/ directory. > > Signed-off-by: Jeff Mahoney > --- > common/config | 4 +- > common/rc | 7 ++ > tests/btrfs/125 | 193 > > tests/btrfs/125.out | 2 + > tests/btrfs/group | 1 + > 5 files changed, 205 insertions(+), 2 deletions(-) > create mode 100755 tests/btrfs/125 > create mode 100644 tests/btrfs/125.out > > diff --git a/common/config b/common/config > index c25b1ec..c5e65f7 100644 > --- a/common/config > +++ b/common/config > @@ -201,13 +201,13 @@ export DEBUGFS_PROG="`set_prog_path debugfs`" > # newer systems have udevadm command but older systems like RHEL5 don't. > # But if neither one is available, just set it to "sleep 1" to wait for lv to > # be settled > -UDEV_SETTLE_PROG="`set_prog_path udevadm`" > +UDEVADM_PROG="`set_prog_path udevadm`" > if [ "$UDEV_SETTLE_PROG" == "" ]; then $UDEVADM_PROG should be checked here, not $UDEV_SETTLE_PROG anymore. > # try udevsettle command > UDEV_SETTLE_PROG="`set_prog_path udevsettle`" > else > # udevadm is available, add 'settle' as subcommand > - UDEV_SETTLE_PROG="$UDEV_SETTLE_PROG settle" > + UDEV_SETTLE_PROG="$UDEVADM_PROG settle" > fi > # neither command is available, use sleep 1 > if [ "$UDEV_SETTLE_PROG" == "" ]; then > diff --git a/common/rc b/common/rc > index 4b05fcf..f4c4312 100644 > --- a/common/rc > +++ b/common/rc > @@ -76,6 +76,13 @@ _btrfs_get_subvolid() > $BTRFS_UTIL_PROG sub list $mnt | grep $name | awk '{ print $2 }' > } > > +_btrfs_get_feature_flags() > +{ > + local dev=$1 > + local class=$2 > + $BTRFS_SHOW_SUPER_PROG $dev | grep ^${class}_flags | awk '{print $NF}' > +} > + > _btrfs_get_fsid() > { > local dev=$1 > diff --git a/tests/btrfs/125 b/tests/btrfs/125 > new file mode 100755 > index 000..83f1921 > --- /dev/null > +++ b/tests/btrfs/125 > @@ -0,0 +1,193 @@ > +#! /bin/bash > +# FS QA Test No. 125 > +# > +# Test of the btrfs sysfs publishing > +# > +#--- > +# Copyright (C) 2013-2016 SUSE. All rights reserved. Copyright year is 2016. > +# > +# This program is free software; you can redistribute it and/or > +# modify it under the terms of the GNU General Public License as > +# published by the Free Software Foundation. > +# > +# This program is distributed in the hope that it would be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write the Free Software Foundation, > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > +# > +#--- > +# > + > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "== QA output created by $seq" > + > +here=`pwd` > +tmp=/tmp/$$ > +status=1 Missing _cleanup and trap, use "./new btrfs" to generate new btrfs test. > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > + > +# real QA test starts here > +_supported_fs btrfs Missing "_supported_os" call. Following the template create by the './new' script makes it easier :) > +_require_scratch > +_require_scratch_dev_pool > +_require_command "$UDEVADM_PROG" We usually provide a program name as a second param _require_command "$UDEVADM_PROG" udevadm > + > +rm -f $seqres.full > +rm -f $tmp.tmp This should be "rm -f $tmp.*" and belongs to _cleanup() > + > +check_file() { > + local file=$1 > + base="$(echo "$file" | sed -e 's#/sys/fs/btrfs/[0-9a-f-][0-9a-f-]*/##')" > + if [ ! -f "$file" ]; then > + echo "$base missing." > + return 0 No need to return 0/1 based on failure/pass, because check_chunk() doesn't need to exit on failure. > + else > + value="$(cat $file)" > + if [ -n "$(echo $value | tr -d 0-9)" ]; then > + echo "ERROR: $base: numerical value expected" \ > + "(got $value)" > + return 0 > + fi > + fi > + return 1 > +} > + > +check_chunk() { > + path=$1 > + mkfs_options=$2 > + error=false > + > +
Re: [PATCH 2/4] fstests: btrfs/124: test global metadata reservation reporting
On Mon, Jun 27, 2016 at 03:16:47PM +0800, Eryu Guan wrote: > On Fri, Jun 24, 2016 at 11:08:32AM -0400, je...@suse.com wrote: > > From: Jeff Mahoney> > [snip] > > + > > +# get standard environment, filters and checks > > +. ./common/rc > > +. ./common/filter.btrfs > > + > > +_supported_fs btrfs > > +_supported_os Linux > > +_require_scratch > > + > > +_scratch_mkfs > /dev/null 2>&1 > > +_scratch_mount > > There should be some kind of "_require_xxx" or something like that to > _notrun if current running kernel doesn't have global metadata > reservation report implemented. Also need a _require_test_program call to make sure btrfs_ioctl_helper is built and in src/ dir. _require_test_program "btrfs_ioctl_helper" Sorry, I missed it in first revlew. Thanks, Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] fstests: btrfs/124: test global metadata reservation reporting
On Fri, Jun 24, 2016 at 11:08:32AM -0400, je...@suse.com wrote: > From: Jeff Mahoney> > Btrfs can now report the size of the global metadata reservation > via ioctl and sysfs. > > This test confirms that we get sane results on an empty file system. > > ENOTTY and missing /sys/fs/btrfs//allocation are not considered > failures. > > Signed-off-by: Jeff Mahoney I'm reviewing mainly from the fstests perspective, need help from other btrfs developers to review the test itself to see if it's a valid & useful test. > --- > common/rc| 6 ++ > src/Makefile | 3 +- > src/btrfs_ioctl_helper.c | 220 > +++ > tests/btrfs/124 | 90 +++ > tests/btrfs/124.out | 2 + > tests/btrfs/group| 1 + > 6 files changed, 321 insertions(+), 1 deletion(-) > create mode 100644 src/btrfs_ioctl_helper.c > create mode 100755 tests/btrfs/124 > create mode 100644 tests/btrfs/124.out > > diff --git a/common/rc b/common/rc > index 3a9c4d1..4b05fcf 100644 > --- a/common/rc > +++ b/common/rc > @@ -76,6 +76,12 @@ _btrfs_get_subvolid() > $BTRFS_UTIL_PROG sub list $mnt | grep $name | awk '{ print $2 }' > } > > +_btrfs_get_fsid() > +{ > + local dev=$1 > + $BTRFS_UTIL_PROG filesystem show $dev|awk '/uuid:/ {print $NF}' > +} > + > # Prints the md5 checksum of a given file > _md5_checksum() > { > diff --git a/src/Makefile b/src/Makefile > index 1bf318b..c467475 100644 > --- a/src/Makefile > +++ b/src/Makefile > @@ -20,7 +20,8 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize > preallo_rw_pattern_reader \ > bulkstat_unlink_test_modified t_dir_offset t_futimens t_immutable \ > stale_handle pwrite_mmap_blocked t_dir_offset2 seek_sanity_test \ > seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec cloner \ > - renameat2 t_getcwd e4compact test-nextquota punch-alternating > + renameat2 t_getcwd e4compact test-nextquota punch-alternating \ > + btrfs_ioctl_helper .gitignore needs an entry for new binary. But I'm wondering that is this something that can be added to btrfs-progs, either as part of the btrfs command or a seperate command? > > SUBDIRS = > > diff --git a/src/btrfs_ioctl_helper.c b/src/btrfs_ioctl_helper.c > new file mode 100644 > index 000..4344bdc > --- /dev/null > +++ b/src/btrfs_ioctl_helper.c > @@ -0,0 +1,220 @@ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#ifndef BTRFS_IOCTL_MAGIC > +#define BTRFS_IOCTL_MAGIC 0x94 > +#endif > + > +#ifndef BTRFS_IOC_SPACE_INFO > +struct btrfs_ioctl_space_info { > +uint64_t flags; > +uint64_t total_bytes; > +uint64_t used_bytes; > +}; > + > +struct btrfs_ioctl_space_args { > +uint64_t space_slots; > +uint64_t total_spaces; > +struct btrfs_ioctl_space_info spaces[0]; > +}; > +#define BTRFS_IOC_SPACE_INFO _IOWR(BTRFS_IOCTL_MAGIC, 20, \ > +struct btrfs_ioctl_space_args) > +#endif > +#ifndef BTRFS_SPACE_INFO_GLOBAL_RSV > +#define BTRFS_SPACE_INFO_GLOBAL_RSV(1ULL << 49) > +#endif > + > +#ifndef BTRFS_IOC_GET_FEATURES > +struct btrfs_ioctl_feature_flags { > + uint64_t compat_flags; > + uint64_t compat_ro_flags; > + uint64_t incompat_flags; > +}; > + > +#define BTRFS_IOC_GET_FEATURES _IOR(BTRFS_IOCTL_MAGIC, 57, \ > + struct btrfs_ioctl_feature_flags) > +#define BTRFS_IOC_SET_FEATURES _IOW(BTRFS_IOCTL_MAGIC, 57, \ > + struct btrfs_ioctl_feature_flags[2]) > +#define BTRFS_IOC_GET_SUPPORTED_FEATURES _IOR(BTRFS_IOCTL_MAGIC, 57, \ > + struct btrfs_ioctl_feature_flags[3]) > +#endif > + > +static int global_rsv_ioctl(int fd, int argc, char *argv[]) > +{ > + struct btrfs_ioctl_space_args arg; > + struct btrfs_ioctl_space_args *args; > + int ret; > + int i; > + size_t size; > + > + arg.space_slots = 0; > + > + ret = ioctl(fd, BTRFS_IOC_SPACE_INFO, ); > + if (ret) > + return -errno; > + > + size = sizeof(*args) + sizeof(args->spaces[0]) * arg.total_spaces; > + args = malloc(size); > + if (!args) > + return -ENOMEM; > + > + args->space_slots = arg.total_spaces; > + > + ret = ioctl(fd, BTRFS_IOC_SPACE_INFO, args); > + if (ret) > + return -errno; > + > + for (i = 0; i < args->total_spaces; i++) { > + if (args->spaces[i].flags & BTRFS_SPACE_INFO_GLOBAL_RSV) { > + unsigned long long reserved; > + reserved = args->spaces[i].total_bytes; > + printf("%llu\n", reserved); > + return 0; > + } > + } > + > + return -ENOENT; > +} > + > +static int get_features_ioctl(int fd, int argc, char *argv[]) > +{ > + struct
Re: Bad hard drive - checksum verify failure forces readonly mount
A Dom, 26-06-2016 às 13:54 -0600, Chris Murphy escreveu: > On Sun, Jun 26, 2016 at 7:05 AM, Vasco Almeida> wrote: > > I have tried "btrfs check --repair /device" but that seems do not > > do > > any good. > > http://paste.fedoraproject.org/384960/66945936/ > > It did fix things, in particular with the snapshot that was having > problems being dropped. But it's not enough it seems to prevent it > from going read only. > > There's more than one bug here, you might see if the repair was good > enough that it's possible to use brtfs-image now. File system image available at (choose one link) https://mega.nz/#!AkAEgKyB!RUa7G5xHIygWm0ALx5ZxQjjXNdFYa7lDRHJ_sW0bWLs https://www.sendspace.com/file/i70cft > If not, use > btrfs-debug-tree > file.txt and post that file somewhere. This > does expose file names. Maybe that'll shed some light on the problem. > But also worth filing a bug at bugzilla.kernel.org with this debug > tree referenced (probably too big to attach), maybe a dev will be > able > to look at it and improve things so they don't fail. Should I file a bug report with that image dump linked above or btrfs- debug-tree output or both? I think I will use the subject of this thread as summary to file the bug. Can you think of something more suitable or is that fine? > > What else can I do or I must rebuild the file system? > > Well, it's a long shot but you could try using --repair --init-csum > which will create a new csum tree. But that applies to data, if the > problem with it going read only is due to metadata corruption this > won't help. And then last you could try --init-extent-tree. Thing I > can't answer is which order to do it in. > > In any case there will be files that you shouldn't trust after csum > has been recreated, anything corrupt will now have a new csum, so you > can get silent data corruption. It's better to just blow away this > file system and make a new one and reinstall the OS. But if you're > feeling brave, you can try one or both of those additional options > and > see if they can help. I think I will reinstall the OS since, even if I manage to recover the file system from this issue, that OS will be something I can not trust fully. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html