Re: [PATCH] fstests: btrfs: Regression test for leaking data reserved space

2016-06-27 Thread Eryu Guan
On Tue, Jun 28, 2016 at 09:54:51AM +0800, Qu Wenruo wrote:
> When btrfs hits EDQUOTA when reserving data space, it will leak already
> reserved data space.
> 
> This test case will check it by using more restrict enospc_debug mount
> option to trigger kernel warning at umount time.
> 
> Signed-off-by: Qu Wenruo 

Looks good to me. Tested on x86_64 and ppc64 hosts, x86_64 host failed
the test as expected, ppc64 host didn't though.

Reviewed-by: Eryu Guan 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Zygo Blaxell
On Mon, Jun 27, 2016 at 08:39:21PM -0600, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 7:52 PM, Zygo Blaxell
>  wrote:
> > On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote:
> >> Btrfs does have something of a work around for when things get slow,
> >> and that's balance, read and rewrite everything. The write forces
> >> sector remapping by the drive firmware for bad sectors.
> >
> > It's a crude form of "resilvering" as ZFS calls it.
> 
> In what manner is it crude?

Balance relocates extents, looks up backrefs, and rewrites metadata, all
of which are extra work above what is required by resilvering (and extra
work that is proportional to the number of backrefs and the (currently
extremely poor) performance of the backref walking code, so snapshots
and large files multiply the workload).

Resilvering should just read data, reconstruct it from a mirror if
necessary, and write it back to the original location (or read one
mirror and rewrite the other).  That's more like what scrub does, except
scrub rewrites only the blocks it couldn't read (or that failed csum).

> > Last time I checked all the RAID implementations on Linux (ok, so that's
> > pretty much just md-raid) had some sort of repair capability.
> 
> You can read man 4 md, and you can also look on linux-raid@, it's very
> clearly necessary for the drive to report a read or write error
> explicitly with LBA for md to do repairs. If there are link resets,
> bad sectors accumulate and the obvious inevitably happens.

I am looking at the md code.  It looks at ->bi_error, and nothing else as
far as I can tell.  It doesn't even care if the error is EIO--any non-zero
return value from the lower bio layer seems to trigger automatic recovery.



signature.asc
Description: Digital signature


[RFC] Btrfs: add asynchronous compression support in zlib

2016-06-27 Thread Weigang Li
This patch introduces a change in zlib.c to use the new asynchronous
compression API (acomp) proposed in cryptodev (working in progress):
https://patchwork.kernel.org/patch/9163577/
Now BTRFS can offload the zlib (de)compression to a hardware accelerator
engine if acomp hardware driver is registered in LKCF, the advantage 
of using acomp is saving CPU cycles and increasing disk IO by hardware 
offloading.
The input pages (up to 32) are added in sg-list and sent to acomp in one 
request, as it is asynchronous call, the thread is put to sleep and 
then CPU is free up, once compression is done, callback is triggered
and the thread is wake up.
This patch doesn't change the BTRFS disk format, that means the files 
compressed by hardware engine can be de-compressed by zlib software 
library, or vice versa.
The previous synchronous zlib (de)compression method is not changed in 
current implementation, but enventually they can be unified with the acomp 
API in LKCF.

Signed-off-by: Weigang Li 
---
 fs/btrfs/zlib.c | 206 
 1 file changed, 206 insertions(+)

diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
index 82990b8..957e603 100644
--- a/fs/btrfs/zlib.c
+++ b/fs/btrfs/zlib.c
@@ -31,6 +31,8 @@
 #include 
 #include 
 #include "compression.h"
+#include 
+#include 
 
 struct workspace {
z_stream strm;
@@ -38,6 +40,11 @@ struct workspace {
struct list_head list;
 };
 
+struct acomp_res {
+   struct completion *completion;
+   int *ret;
+};
+
 static void zlib_free_workspace(struct list_head *ws)
 {
struct workspace *workspace = list_entry(ws, struct workspace, list);
@@ -71,6 +78,119 @@ fail:
return ERR_PTR(-ENOMEM);
 }
 
+static void acomp_op_done(struct crypto_async_request *req, int err)
+{
+   struct acomp_res *res = req->data;
+   *res->ret = err;
+   complete(res->completion);
+}
+
+static int zlib_compress_pages_async(struct address_space *mapping,
+   u64 start, unsigned long len,
+   struct page **pages,
+   unsigned long nr_dest_pages,
+   unsigned long *out_pages,
+   unsigned long *total_in,
+   unsigned long *total_out,
+   unsigned long max_out)
+{
+   int ret, acomp_ret = -1, i = 0;
+   int nr_pages = 0;
+   struct page *out_page = NULL;
+   struct crypto_acomp *tfm = NULL;
+   struct acomp_req *req = NULL;
+   struct completion completion;
+   unsigned int nr_src_pages = 0, nr_dst_pages = 0, nr = 0;
+   struct sg_table *in_sg = NULL, *out_sg = NULL;
+   struct page **src_pages = NULL;
+   struct acomp_res res;
+
+   *out_pages = 0;
+   *total_out = 0;
+   *total_in = 0;
+
+   init_completion();
+   nr_src_pages = (len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+   src_pages = kcalloc(nr_src_pages, sizeof(struct page *), GFP_KERNEL);
+   nr = find_get_pages(mapping, start >> PAGE_CACHE_SHIFT,
+   nr_src_pages, src_pages);
+   if (nr != nr_src_pages) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
+   in_sg = kcalloc(1, sizeof(*in_sg), GFP_KERNEL);
+   ret = sg_alloc_table_from_pages(in_sg, src_pages, nr_src_pages,
+   0, len, GFP_KERNEL);
+   if (ret)
+   goto out;
+
+   /* pre-alloc dst pages, with same size as src */
+   nr_dst_pages =  nr_src_pages;
+   for (i = 0; i < nr_dst_pages; i++) {
+   out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+   if (!out_page) {
+   ret = -ENOMEM;
+   goto out;
+   }
+   pages[i] = out_page;
+   }
+
+   out_sg = kcalloc(1, sizeof(*out_sg), GFP_KERNEL);
+
+   ret = sg_alloc_table_from_pages(out_sg, pages, nr_dst_pages, 0,
+   (nr_dst_pages << PAGE_CACHE_SHIFT), GFP_KERNEL);
+   if (ret)
+   goto out;
+
+   tfm = crypto_alloc_acomp("zlib_deflate", 0, 0);
+   req = acomp_request_alloc(tfm, GFP_KERNEL);
+   acomp_request_set_params(req, in_sg->sgl, out_sg->sgl, len,
+   nr_dst_pages << PAGE_CACHE_SHIFT);
+
+   res.completion = 
+   res.ret = _ret;
+   acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+   acomp_op_done, );
+   ret = crypto_acomp_compress(req);
+   if (ret == -EINPROGRESS) {
+   ret = wait_for_completion_timeout(, 5000);
+   if (ret == 0) { /* timeout */
+   ret = -1;
+   goto out;
+   }
+   }
+
+   ret = *res.ret;
+   *total_in = len;
+   *total_out = 

Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 7:52 PM, Zygo Blaxell
 wrote:
> On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote:
>> On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell
>>  wrote:
>> > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
>> > If anything, I want the timeout to be shorter so that upper layers with
>> > redundancy can get an EIO and initiate repair promptly, and admins can
>> > get notified to evict chronic offenders from their drive slots, without
>> > having to pay extra for hard disk firmware with that feature.
>>
>> The drive totally thwarts this. It doesn't report back to the kernel
>> what command is hung, as far as I'm aware. It just hangs and goes into
>> a so called "deep recovery" there is no way to know what sector is
>> causing the problem
>
> I'm proposing just treat the link reset _as_ an EIO, unless transparent
> link resets are required for link speed negotiation or something.

That's not one EIO, that's possibly 31 items in the command queue that
get knocked over when the link is reset. I don't have the expertise to
know whether it's sane to interpret many EIO all at once as an
implicit indication of bad sectors. Off hand I think that's probably
specious.

> The drive wouldn't be thwarting anything, the host would just ignore it
> (unless the drive doesn't respond to a link reset until after its internal
> timeout, in which case nothing is saved by shortening the timeout).
>
>> until the drive reports a read error, which will
>> include the affected sector LBA.
>
> It doesn't matter which sector.  Chances are good that it was more than
> one of the outstanding requested sectors anyway.  Rewrite them all.

*shrug* even if valid, it only helps the raid 1+ cases. It does
nothing to help raid0, linear/concat, or single device deployments.
Those users also deserve to have access to their data, if the drive
can recover it by giving it enough time to do so.


> We know which sectors they are because somebody has an IO operation
> waiting for a status on each of them (unless they're using AIO or some
> other API where a request can be fired at a hard drive and the reply
> discarded).  Notify all of them that their IO failed and move on.

Dunno, maybe.


>
>> Btrfs does have something of a work around for when things get slow,
>> and that's balance, read and rewrite everything. The write forces
>> sector remapping by the drive firmware for bad sectors.
>
> It's a crude form of "resilvering" as ZFS calls it.

In what manner is it crude?




> If btrfs sees EIO from a lower block layer it will try to reconstruct the
> missing data (but not repair it).  If that happens during a scrub,
> it will also attempt to rewrite the missing data over the original
> offending sectors.  This happens every few months in my server pool,
> and seems to be working even on btrfs raid5.
>
> Last time I checked all the RAID implementations on Linux (ok, so that's
> pretty much just md-raid) had some sort of repair capability.

You can read man 4 md, and you can also look on linux-raid@, it's very
clearly necessary for the drive to report a read or write error
explicitly with LBA for md to do repairs. If there are link resets,
bad sectors accumulate and the obvious inevitably happens.



>
>> For single drives and RAID 0, the only possible solution is to not do
>> link resets for up to 3 minutes and hope the drive returns the single
>> copy of data.
>
> So perhaps the timeout should be influenced by higher layers, e.g. if a
> disk becomes part of a raid1, its timeout should be shortened by default,
> while a timeout for a disk that is not used in by redundant layer should
> be longer.

And there are a pile of reasons why link resets are necessary that
have nothing to do with bad sectors. So if you end up with a drive or
controller misbehaving and the new behavior is to force a bunch of new
(corrective) writes to the drive right after a reset it could actually
make its problems worse for all we know.

I think it's highly speculative to assume hung block devices means bad
sector and should be treated as a bad sector, and that doing so will
cause no other side effects. That's a question for block device/SCSI
experts to opine on whether this is at all sane to do. I'm sure
they're reasonably aware of this problem that if it were that simple
they'd have done that already, but conversely 5 years of telling users
to change the command timer or stop using the wrong kind of drives for
RAID really isn't sufficiently good advice either.

The reality is that manufacturers of drives have handed us drives that
far and wide don't support SCT ERC or it's disabled by default, so
yeah maybe the thing to do is udev polls the drive for SCT ERC, if
it's already at 70,70 then leave the SCSI command timer as is. If it
reports it's disabled, then udev needs to know if it's in some kind of
RAID 1+ and if so then set SCT ERC to 70,70. If it's a single drive,

Re: Kernel bug during RAID1 replace

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 6:49 PM, Saint Germain  wrote:

>
> I've tried both option and launched a replace, but I got the same error
> (replace is cancelled, jernel bug).
> I will let these options on and attempt a ddrescue on /dev/sda
> to /dev/sdd.
> Then I will disconnect /dev/sda and reboot and see if it works better.

Sounds reasonable. Just make sure the file system is already unmounted
when you use ddrescue because otherwise you're block copying it while
it could be modified while rw mounted (generation number tends to get
incremented while rw mounted).


>> Last, I have no idea if the massive Btrfs write errors on sda are from
>> an earlier problem where the drive data or power cable got jiggled or
>> was otherwise absent temporarily? So depending on how the block
>> timeout change affects your data recovery, you might end up needing to
>> do a reboot to get back to a more stable state for all of this? It
>> really should be able to fix things *if* at least one copy can be read
>> and then written to the other drive.
>>
>
> I have also no idea why is sda behaving like this. I haven't done
> anything particular on these drives.

Yeah pretty weird. At some point once things are stable, if this file
system survives, you might want to use btrfs dev stat -z to wipe out
those stats.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Fix leaking bytes_may_use after hitting EDQUOTA

2016-06-27 Thread Qu Wenruo
If one mount btrfs with enospc_debug mount option and hit qgroup limits
in btrfs_check_data_free_space(), then at unmount time, kernel warning
will be triggered alone with a data space info dump.
--
[ cut here ]
WARNING: CPU: 0 PID: 3875 at fs/btrfs/extent-tree.c:9785
btrfs_free_block_groups+0x2b8/0x460 [btrfs]
Modules linked in: btrfs ext4 jbd2 mbcache xor zlib_deflate raid6_pq xfs
[last unloaded: btrfs]
CPU: 0 PID: 3875 Comm: umount Tainted: GW   4.7.0-rc4+ #13
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox
12/01/2006
  8800230a7d00 813b89e5 
  8800230a7d40 810c9b8b 2639230a7d50
 88003d523a78 88003d523b80 88000d1c 88000d1c00c8
Call Trace:
 [] dump_stack+0x67/0x92
 [] __warn+0xcb/0xf0
 [] warn_slowpath_null+0x1d/0x20
 [] btrfs_free_block_groups+0x2b8/0x460 [btrfs]
 [] close_ctree+0x173/0x350 [btrfs]
 [] btrfs_put_super+0x19/0x20 [btrfs]
 [] generic_shutdown_super+0x6a/0xf0
 [] kill_anon_super+0x12/0x20
 [] btrfs_kill_super+0x18/0x110 [btrfs]
 [] deactivate_locked_super+0x3e/0x70
 [] deactivate_super+0x5c/0x60
 [] cleanup_mnt+0x3f/0x90
 [] __cleanup_mnt+0x12/0x20
 [] task_work_run+0x81/0xc0
 [] exit_to_usermode_loop+0xb3/0xc0
 [] syscall_return_slowpath+0xb0/0xc0
 [] entry_SYSCALL_64_fastpath+0xa6/0xa8
---[ end trace 99b9af8484495c66 ]---
BTRFS: space_info 1 has 8044544 free, is not full
BTRFS: space_info total=8388608, used=344064, pinned=0, reserved=0,
may_use=409600, readonly=0
--

The problem is in btrfs_check_data_free_space(), we reserve data space
first and then reserve qgroup space.
However if qgroup reserve failed, we didn't cleanup reserved data space,
which leads to the kernel warning.

Fix it by freeing reserved data space when qgroup_reserve_data() fails.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 29e5d00..e349da0 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4265,6 +4265,9 @@ int btrfs_check_data_free_space(struct inode *inode, u64 
start, u64 len)
 * range, but don't impact performance on quota disable case.
 */
ret = btrfs_qgroup_reserve_data(inode, start, len);
+   if (ret < 0)
+   /* Qgroup reserve failed, need to cleanup reserved data space */
+   btrfs_free_reserved_data_space(inode, start, len);
return ret;
 }
 
-- 
2.9.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Zygo Blaxell
On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell
>  wrote:
> > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
> > If anything, I want the timeout to be shorter so that upper layers with
> > redundancy can get an EIO and initiate repair promptly, and admins can
> > get notified to evict chronic offenders from their drive slots, without
> > having to pay extra for hard disk firmware with that feature.
> 
> The drive totally thwarts this. It doesn't report back to the kernel
> what command is hung, as far as I'm aware. It just hangs and goes into
> a so called "deep recovery" there is no way to know what sector is
> causing the problem

I'm proposing just treat the link reset _as_ an EIO, unless transparent
link resets are required for link speed negotiation or something.
The drive wouldn't be thwarting anything, the host would just ignore it
(unless the drive doesn't respond to a link reset until after its internal
timeout, in which case nothing is saved by shortening the timeout).

> until the drive reports a read error, which will
> include the affected sector LBA.

It doesn't matter which sector.  Chances are good that it was more than
one of the outstanding requested sectors anyway.  Rewrite them all.

We know which sectors they are because somebody has an IO operation
waiting for a status on each of them (unless they're using AIO or some
other API where a request can be fired at a hard drive and the reply
discarded).  Notify all of them that their IO failed and move on.

> Btrfs does have something of a work around for when things get slow,
> and that's balance, read and rewrite everything. The write forces
> sector remapping by the drive firmware for bad sectors.

It's a crude form of "resilvering" as ZFS calls it.

> > The upper layers could time the IOs, and make their own decisions based
> > on the timing (e.g. btrfs or mdadm could proactively repair anything that
> > took more than 10 seconds to read).  That might be a better approach,
> > since shortening the time to an EIO is only useful when you have a
> > redundancy layer in place to do something about them.
> 
> For RAID with redundancy, that's doable, although I have no idea what
> work is needed, or even if it's possible, to track commands in this
> manner, and fall back to some kind of repair mode as if it were a read
> error.

If btrfs sees EIO from a lower block layer it will try to reconstruct the
missing data (but not repair it).  If that happens during a scrub,
it will also attempt to rewrite the missing data over the original
offending sectors.  This happens every few months in my server pool,
and seems to be working even on btrfs raid5.

Last time I checked all the RAID implementations on Linux (ok, so that's
pretty much just md-raid) had some sort of repair capability.  lvm uses
(or can use) the md-raid implementation.  ext4 and xfs on naked disk
partitions will have problems, but that's because they were designed in
the 1990's when we were young and naive and still believed hard disks
would one day become reliable devices without buggy firmware.

> For single drives and RAID 0, the only possible solution is to not do
> link resets for up to 3 minutes and hope the drive returns the single
> copy of data.

So perhaps the timeout should be influenced by higher layers, e.g. if a
disk becomes part of a raid1, its timeout should be shortened by default,
while a timeout for a disk that is not used in by redundant layer should
be longer.

> Even in the case of Btrfs DUP, it's thwarted without a read error
> reported from the drive (or it returning bad data).

That case gets messy--different timeouts for different parts of the disk.
Probably not practical.



signature.asc
Description: Digital signature


[PATCH] fstests: btrfs: Regression test for leaking data reserved space

2016-06-27 Thread Qu Wenruo
When btrfs hits EDQUOTA when reserving data space, it will leak already
reserved data space.

This test case will check it by using more restrict enospc_debug mount
option to trigger kernel warning at umount time.

Signed-off-by: Qu Wenruo 
---
 tests/btrfs/124 | 73 +
 tests/btrfs/124.out |  2 ++
 tests/btrfs/group   |  1 +
 3 files changed, 76 insertions(+)
 create mode 100755 tests/btrfs/124
 create mode 100644 tests/btrfs/124.out

diff --git a/tests/btrfs/124 b/tests/btrfs/124
new file mode 100755
index 000..94a5b28
--- /dev/null
+++ b/tests/btrfs/124
@@ -0,0 +1,73 @@
+#! /bin/bash
+# FS QA Test 124
+#
+# Regression test for leaking data space after hitting EDQUOTA
+#
+#---
+# Copyright (c) 2016 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+
+_scratch_mkfs
+# Use enospc_debug mount option to trigger restrict space info check
+_scratch_mount "-o enospc_debug"
+
+_run_btrfs_util_prog quota enable $SCRATCH_MNT
+_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
+_run_btrfs_util_prog qgroup limit 512K 0/5 $SCRATCH_MNT
+
+# The amount of written data may change due to different nodesize at mkfs time,
+# so redirect stdout to seqres.full.
+# Also, EDQUOTA is expected, which can't be redirected due to the limitation
+# of _filter_xfs_io, so golden output will include EDQUOTA error message
+_pwrite_byte 0xcdcdcdcd 0 1M $SCRATCH_MNT/test_file | _filter_xfs_io \
+   >> $seqres.full
+
+# Fstests will umount the fs, and at umount time, kernel warning will be
+# triggered
+
+# success, all done
+status=0
+exit
diff --git a/tests/btrfs/124.out b/tests/btrfs/124.out
new file mode 100644
index 000..6774792
--- /dev/null
+++ b/tests/btrfs/124.out
@@ -0,0 +1,2 @@
+QA output created by 124
+pwrite64: Disk quota exceeded
diff --git a/tests/btrfs/group b/tests/btrfs/group
index 5a26ed7..a398213 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -126,3 +126,4 @@
 121 auto quick snapshot qgroup
 122 auto quick snapshot qgroup
 123 auto quick qgroup
+124 auto quick qgroup
-- 
2.5.5



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug during RAID1 replace

2016-06-27 Thread Saint Germain
On Mon, 27 Jun 2016 18:00:34 -0600, Chris Murphy
 wrote :

> On Mon, Jun 27, 2016 at 5:06 PM, Saint Germain 
> wrote:
> > On Mon, 27 Jun 2016 16:58:37 -0600, Chris Murphy
> >  wrote :
> >
> >> On Mon, Jun 27, 2016 at 4:55 PM, Chris Murphy
> >>  wrote:
> >>
> >> >> BTRFS info (device sdb1): dev_replace from /dev/sda1 (devid 1)
> >> >> to /dev/sdd1 started scrub_handle_errored_block: 166 callbacks
> >> >> suppressed BTRFS warning (device sdb1): checksum error at
> >> >> logical 93445255168 on dev /dev/sda1, sector 77669048, root 5,
> >> >> inode 3434831, offset 479232, length 4096, links 1 (path:
> >> >> user/.local/share/zeitgeist/activity.sqlite-wal)
> >> >> btrfs_dev_stat_print_on_error: 166 callbacks suppressed BTRFS
> >> >> error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0,
> >> >> corrupt 14221, gen 24 scrub_handle_errored_block: 166 callbacks
> >> >> suppressed BTRFS error (device sdb1): unable to fixup (regular)
> >> >> error at logical 93445255168 on dev /dev/sda1
> >> >
> >> > Shoot. You have a lot of these. It looks suspiciously like you're
> >> > hitting a case list regulars are only just starting to understand
> >>
> >> Forget this part completely. It doesn't affect raid1. I just
> >> re-read that your setup is not raid1, I don't know why I thought
> >> it was raid5.
> >>
> >> The likely issue here is that you've got legit corruptions on sda
> >> (mix of slow and flat out bad sectors), as well as a failing drive.
> >>
> >> This is also safe to issue:
> >>
> >> smartctl -l scterc /dev/sda
> >> smartctl -l scterc /dev/sdb
> >> cat /sys/block/sda/device/timeout
> >> cat /sys/block/sdb/device/timeout
> >>
> >
> > My setup is indeed RAID1 (and not RAID5)
> >
> > root@system:/# smartctl -l scterc /dev/sda
> > smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.6.0-0.bpo.1-amd64]
> > (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke,
> > www.smartmontools.org
> >
> > SCT Error Recovery Control:
> >Read: Disabled
> >   Write: Disabled
> >
> > root@system:/# smartctl -l scterc /dev/sdb
> > smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.6.0-0.bpo.1-amd64]
> > (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke,
> > www.smartmontools.org
> >
> > SCT Error Recovery Control:
> >Read: Disabled
> >   Write: Disabled
> >
> > root@system:/# cat /sys/block/sda/device/timeout
> > 30
> > root@system:/# cat /sys/block/sdb/device/timeout
> > 30
> 
> Good news and bad news. The bad news is this is a significant
> misconfiguration, it's very common, and it means that any bad sectors
> that don't result in read errors before 30 seconds will mean they
> don't get fixed by Btrfs (or even mdadm or LVM raid). So they can
> accumulate.
> 
> There are two options since your drives support SCT ERC.
> 
> 1.
> smartctl -l scterc,70,70 /dev/sdX  ## done for both drives
> 
> That will make sure the drive reports a read error in 7 seconds, well
> under the kernel's command timer of 7 seconds. This is how your drives
> should normally be configured for RAID usage.
> 
> 2.
> echo 180 > /sys/block/sda/device/timeout
> echo 180 > /sys/block/sdb/device/timeout
> 
> This *might* actually work better in your case. If you permit the
> drives to have really long error recovery, it might actually allow the
> data to be returned to Btrfs and then it can start fixing problems.
> Maybe. It's a long shot. And there will be upwards of 3 minute hangs.
> 
> I would give this a shot first. You can issue these commands safely at
> any time, no umount is needed or anything like that. I would do this
> even before using cp/rsync or ddrescue because it increases the chance
> the drive can recover data from these bad sectors and fix the other
> drive.
> 
> These settings are not persistent across a reboot unless you set a
> udev rule or equivalent.
> 
> On one of my drives that supports SCT ERC it only accepts the smartctl
> -l command to set the timeout once. I can't change it without power
> cycling the drive or it just crashes (yay firmware bugs). Just FYI
> it's possible to run into other weirdness.
> 

I've tried both option and launched a replace, but I got the same error
(replace is cancelled, jernel bug).
I will let these options on and attempt a ddrescue on /dev/sda
to /dev/sdd.
Then I will disconnect /dev/sda and reboot and see if it works better.

> Last, I have no idea if the massive Btrfs write errors on sda are from
> an earlier problem where the drive data or power cable got jiggled or
> was otherwise absent temporarily? So depending on how the block
> timeout change affects your data recovery, you might end up needing to
> do a reboot to get back to a more stable state for all of this? It
> really should be able to fix things *if* at least one copy can be read
> and then written to the other drive.
> 

I have also no idea why is sda behaving like this. I haven't done

Re: [PATCH 05/14] Btrfs: warn_on for unaccounted spaces

2016-06-27 Thread Qu Wenruo



At 06/27/2016 09:03 PM, Chris Mason wrote:



On 06/27/2016 12:47 AM, Qu Wenruo wrote:

Hi Josef,

Would you please move this patch to the first of the patchset?

It's making bisect quite hard, as it will always stop at this patch,
hard to check if it's a regression or existing bug.


That's a good idea.  Which workload are you having trouble with?

-chris



Qgroup test which hits EDQUOTA.

After hitting EDQUOTA, unmount will always trigger a kernel warning, for 
DATA space whose byte_may_use is not zero.


The problem is long existing, seems to be buffered write doesn't clean 
up its delalloc reserved DATA space.


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug during RAID1 replace

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 6:00 PM, Chris Murphy  wrote:

> There are two options since your drives support SCT ERC.
>
> 1.
> smartctl -l scterc,70,70 /dev/sdX  ## done for both drives
>
> That will make sure the drive reports a read error in 7 seconds, well
> under the kernel's command timer of 7 seconds.

correction: "well under the kernel's command timer default of 30 seconds"



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug during RAID1 replace

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 5:06 PM, Saint Germain  wrote:
> On Mon, 27 Jun 2016 16:58:37 -0600, Chris Murphy
>  wrote :
>
>> On Mon, Jun 27, 2016 at 4:55 PM, Chris Murphy
>>  wrote:
>>
>> >> BTRFS info (device sdb1): dev_replace from /dev/sda1 (devid 1)
>> >> to /dev/sdd1 started scrub_handle_errored_block: 166 callbacks
>> >> suppressed BTRFS warning (device sdb1): checksum error at logical
>> >> 93445255168 on dev /dev/sda1, sector 77669048, root 5, inode
>> >> 3434831, offset 479232, length 4096, links 1 (path:
>> >> user/.local/share/zeitgeist/activity.sqlite-wal)
>> >> btrfs_dev_stat_print_on_error: 166 callbacks suppressed BTRFS
>> >> error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0,
>> >> corrupt 14221, gen 24 scrub_handle_errored_block: 166 callbacks
>> >> suppressed BTRFS error (device sdb1): unable to fixup (regular)
>> >> error at logical 93445255168 on dev /dev/sda1
>> >
>> > Shoot. You have a lot of these. It looks suspiciously like you're
>> > hitting a case list regulars are only just starting to understand
>>
>> Forget this part completely. It doesn't affect raid1. I just re-read
>> that your setup is not raid1, I don't know why I thought it was raid5.
>>
>> The likely issue here is that you've got legit corruptions on sda (mix
>> of slow and flat out bad sectors), as well as a failing drive.
>>
>> This is also safe to issue:
>>
>> smartctl -l scterc /dev/sda
>> smartctl -l scterc /dev/sdb
>> cat /sys/block/sda/device/timeout
>> cat /sys/block/sdb/device/timeout
>>
>
> My setup is indeed RAID1 (and not RAID5)
>
> root@system:/# smartctl -l scterc /dev/sda
> smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.6.0-0.bpo.1-amd64] (local
> build) Copyright (C) 2002-14, Bruce Allen, Christian Franke,
> www.smartmontools.org
>
> SCT Error Recovery Control:
>Read: Disabled
>   Write: Disabled
>
> root@system:/# smartctl -l scterc /dev/sdb
> smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.6.0-0.bpo.1-amd64] (local
> build) Copyright (C) 2002-14, Bruce Allen, Christian Franke,
> www.smartmontools.org
>
> SCT Error Recovery Control:
>Read: Disabled
>   Write: Disabled
>
> root@system:/# cat /sys/block/sda/device/timeout
> 30
> root@system:/# cat /sys/block/sdb/device/timeout
> 30

Good news and bad news. The bad news is this is a significant
misconfiguration, it's very common, and it means that any bad sectors
that don't result in read errors before 30 seconds will mean they
don't get fixed by Btrfs (or even mdadm or LVM raid). So they can
accumulate.

There are two options since your drives support SCT ERC.

1.
smartctl -l scterc,70,70 /dev/sdX  ## done for both drives

That will make sure the drive reports a read error in 7 seconds, well
under the kernel's command timer of 7 seconds. This is how your drives
should normally be configured for RAID usage.

2.
echo 180 > /sys/block/sda/device/timeout
echo 180 > /sys/block/sdb/device/timeout

This *might* actually work better in your case. If you permit the
drives to have really long error recovery, it might actually allow the
data to be returned to Btrfs and then it can start fixing problems.
Maybe. It's a long shot. And there will be upwards of 3 minute hangs.

I would give this a shot first. You can issue these commands safely at
any time, no umount is needed or anything like that. I would do this
even before using cp/rsync or ddrescue because it increases the chance
the drive can recover data from these bad sectors and fix the other
drive.

These settings are not persistent across a reboot unless you set a
udev rule or equivalent.

On one of my drives that supports SCT ERC it only accepts the smartctl
-l command to set the timeout once. I can't change it without power
cycling the drive or it just crashes (yay firmware bugs). Just FYI
it's possible to run into other weirdness.


Last, I have no idea if the massive Btrfs write errors on sda are from
an earlier problem where the drive data or power cable got jiggled or
was otherwise absent temporarily? So depending on how the block
timeout change affects your data recovery, you might end up needing to
do a reboot to get back to a more stable state for all of this? It
really should be able to fix things *if* at least one copy can be read
and then written to the other drive.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug during RAID1 replace

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 5:03 PM, Saint Germain  wrote:

>>
>
> Ok thanks I will begin to make an image with dd.
> Do you recommend to use sda or sdb ?

Well at the moment you're kinda stuck. I'd leave them together and
just get the data off the drive normally with cp -a (or just -r if you
don't care about permissions and other metadata like time stamps and
xattr) or rsync -a. Certainly the dying drive is being really pissy
but if you get a bad read off one drive *maybe* it can correct off the
other drive. But that's not possible if you pull one of those drives.

Also as for imaging the drive, you probably need to use ddrescue instead of dd.

Be warned that there's a gotcha where you can corrupt Btrfs volumes
where multiple instances of the same fs uuid and dev uuid appear at
the same time to the kernel. So once you've cloned in this manner,
don't mount the volume until you hide (as in remove) one of the
copies. See block level copies:
https://btrfs.wiki.kernel.org/index.php/Gotchas





> root@system:/# smartctl -x /dev/sda
> ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
>   1 Raw_Read_Error_Rate POSR-K   100   100   051-0
>   2 Throughput_Performance  -OS--K   252   252   000-0
>   3 Spin_Up_TimePO---K   091   090   025-2993
>   4 Start_Stop_Count-O--CK   100   100   000-661
>   5 Reallocated_Sector_Ct   PO--CK   252   252   010-0
>   7 Seek_Error_Rate -OSR-K   252   252   051-0
>   8 Seek_Time_Performance   --S--K   252   252   015-0
>   9 Power_On_Hours  -O--CK   100   100   000-1379
>  10 Spin_Retry_Count-O--CK   252   252   051-0
>  12 Power_Cycle_Count   -O--CK   100   100   000-349
> 191 G-Sense_Error_Rate  -O---K   252   252   000-0
> 192 Power-Off_Retract_Count -O---K   252   252   000-0
> 194 Temperature_Celsius -O   060   047   000-40 (Min/Max 
> 18/53)
> 195 Hardware_ECC_Recovered  -O-RCK   100   100   000-0
> 196 Reallocated_Event_Count -O--CK   252   252   000-0
> 197 Current_Pending_Sector  -O--CK   252   252   000-0
> 198 Offline_Uncorrectable   CK   252   252   000-0
> 199 UDMA_CRC_Error_Count-OS-CK   200   200   000-0
> 200 Multi_Zone_Error_Rate   -O-R-K   100   100   000-2
> 223 Load_Retry_Count-O--CK   100   100   000-1
> 225 Load_Cycle_Count-O--CK   099   099   000-10744
> 241 Total_LBAs_Written  -O--CK   095   094   000-7981553
> 242 Total_LBAs_Read -O--CK   098   094   000-4015781

No current pending, reallocated, or uncorrected sectors. Interesting.
But this drive has piles of write errors. Why? Bad cable? That should
result in UDMA CRC errors, lots of them.

> SATA Phy Event Counters (GP Log 0x11)

No significant problems.



> root@system:/# smartctl -x /dev/sdb
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
>   1 Raw_Read_Error_Rate POSR-K   100   100   051-28
>   2 Throughput_Performance  -OS--K   252   252   000-0
>   3 Spin_Up_TimePO---K   092   083   025-2678
>   4 Start_Stop_Count-O--CK   100   100   000-575
>   5 Reallocated_Sector_Ct   PO--CK   252   252   010-0
>   7 Seek_Error_Rate -OSR-K   252   252   051-0
>   8 Seek_Time_Performance   --S--K   252   252   015-0
>   9 Power_On_Hours  -O--CK   100   100   000-1391
>  10 Spin_Retry_Count-O--CK   252   252   051-0
>  12 Power_Cycle_Count   -O--CK   100   100   000-371
> 191 G-Sense_Error_Rate  -O---K   252   252   000-0
> 192 Power-Off_Retract_Count -O---K   252   252   000-0
> 194 Temperature_Celsius -O   061   047   000-39 (Min/Max 
> 19/53)
> 195 Hardware_ECC_Recovered  -O-RCK   100   100   000-0
> 196 Reallocated_Event_Count -O--CK   252   252   000-0
> 197 Current_Pending_Sector  -O--CK   100   100   000-1
> 198 Offline_Uncorrectable   CK   252   252   000-0
> 199 UDMA_CRC_Error_Count-OS-CK   200   200   000-0
> 200 Multi_Zone_Error_Rate   -O-R-K   100   100   000-3
> 223 Load_Retry_Count-O--CK   100   100   000-1
> 225 Load_Cycle_Count-O--CK   099   099   000-13957
> 241 Total_LBAs_Written  -O--CK   096   094   000-6153920
> 242 Total_LBAs_Read -O--CK   097   094   000-4873960

One pending sector. Enough for a dozen scary warnings or so, but not
enough to account for as many as you have. Pretty curious.


>
> Error 28 [3] occurred at disk power-on lifetime: 1390 hours (57 days + 22 
> hours)
>   When the command that caused the error occurred, the device was active or 
> idle.
>
>   

Re: Kernel bug during RAID1 replace

2016-06-27 Thread Saint Germain
On Mon, 27 Jun 2016 16:58:37 -0600, Chris Murphy
 wrote :

> On Mon, Jun 27, 2016 at 4:55 PM, Chris Murphy
>  wrote:
> 
> >> BTRFS info (device sdb1): dev_replace from /dev/sda1 (devid 1)
> >> to /dev/sdd1 started scrub_handle_errored_block: 166 callbacks
> >> suppressed BTRFS warning (device sdb1): checksum error at logical
> >> 93445255168 on dev /dev/sda1, sector 77669048, root 5, inode
> >> 3434831, offset 479232, length 4096, links 1 (path:
> >> user/.local/share/zeitgeist/activity.sqlite-wal)
> >> btrfs_dev_stat_print_on_error: 166 callbacks suppressed BTRFS
> >> error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0,
> >> corrupt 14221, gen 24 scrub_handle_errored_block: 166 callbacks
> >> suppressed BTRFS error (device sdb1): unable to fixup (regular)
> >> error at logical 93445255168 on dev /dev/sda1
> >
> > Shoot. You have a lot of these. It looks suspiciously like you're
> > hitting a case list regulars are only just starting to understand
> 
> Forget this part completely. It doesn't affect raid1. I just re-read
> that your setup is not raid1, I don't know why I thought it was raid5.
> 
> The likely issue here is that you've got legit corruptions on sda (mix
> of slow and flat out bad sectors), as well as a failing drive.
> 
> This is also safe to issue:
> 
> smartctl -l scterc /dev/sda
> smartctl -l scterc /dev/sdb
> cat /sys/block/sda/device/timeout
> cat /sys/block/sdb/device/timeout
> 

My setup is indeed RAID1 (and not RAID5)

root@system:/# smartctl -l scterc /dev/sda
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.6.0-0.bpo.1-amd64] (local
build) Copyright (C) 2002-14, Bruce Allen, Christian Franke,
www.smartmontools.org

SCT Error Recovery Control:
   Read: Disabled
  Write: Disabled

root@system:/# smartctl -l scterc /dev/sdb
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.6.0-0.bpo.1-amd64] (local
build) Copyright (C) 2002-14, Bruce Allen, Christian Franke,
www.smartmontools.org

SCT Error Recovery Control:
   Read: Disabled
  Write: Disabled

root@system:/# cat /sys/block/sda/device/timeout
30
root@system:/# cat /sys/block/sdb/device/timeout
30
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug during RAID1 replace

2016-06-27 Thread Saint Germain
On Mon, 27 Jun 2016 16:55:07 -0600, Chris Murphy
 wrote :

> On Mon, Jun 27, 2016 at 4:26 PM, Saint Germain 
> wrote:
> 
> >>
> >
> > Thanks for your help.
> >
> > Ok here is the log from the mounting, and including btrfs replace
> > (btrfs replace start -f /dev/sda1 /dev/sdd1 /home):
> >
> > BTRFS info (device sdb1): disk space caching is enabled
> > BTRFS info (device sdb1): bdev /dev/sdb1 errs: wr 11881695, rd 12,
> > flush 7928, corrupt 1705631, gen 1335 BTRFS info (device sdb1):
> > bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt 14220, gen 24
> 
> Eek. So sdb has 11+ million write errors, flush errors, read errors,
> and over 1 million corruptions. It's dying or dead.
> 
> And sda has a dozen thousand+ corruptions. This isn't a good
> combination, as you have two devices with problems and raid5 only
> protects you from one device with problems.
> 
> You were in the process of replacing sda, which is good, but it may
> not be enough...
> 
> 
> > BTRFS info (device sdb1): dev_replace from /dev/sda1 (devid 1)
> > to /dev/sdd1 started scrub_handle_errored_block: 166 callbacks
> > suppressed BTRFS warning (device sdb1): checksum error at logical
> > 93445255168 on dev /dev/sda1, sector 77669048, root 5, inode
> > 3434831, offset 479232, length 4096, links 1 (path:
> > user/.local/share/zeitgeist/activity.sqlite-wal)
> > btrfs_dev_stat_print_on_error: 166 callbacks suppressed BTRFS error
> > (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt
> > 14221, gen 24 scrub_handle_errored_block: 166 callbacks suppressed
> > BTRFS error (device sdb1): unable to fixup (regular) error at
> > logical 93445255168 on dev /dev/sda1
> 
> Shoot. You have a lot of these. It looks suspiciously like you're
> hitting a case list regulars are only just starting to understand
> (somewhat) where it's possible to have a legit corrupt sector that
> Btrfs detects during scrub as wrong, fixes it from parity, but then
> occasionally wrongly overwrites the parity with bad parity. This
> doesn't cause an immediately recognizable problem. But if the volume
> becomes degraded later, Btrfs must use parity to reconstruct
> on-the-fly and if it hits one of these bad parities, the
> reconstruction is bad, and ends up causing lots of these checksum
> errors. We can tell it's not metadata corruption because a.) there's a
> file listed as being affected and b.) the file system doesn't fail and
> go read only. But still it means those files are likely toast...
> 
> 
> [...snip many instances of checksum errors...]
> 
> > BTRFS error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush
> > 0, corrupt 16217, gen 24 ata2.00: exception Emask 0x0 SAct 0x4000
> > SErr 0x0 action 0x0 ata2.00: irq_stat 0x4008
> > ata2.00: failed command: READ FPDMA QUEUED
> > ata2.00: cmd 60/08:70:08:d8:70/00:00:0f:00:00/40 tag 14 ncq 4096 in
> >  res 41/40:00:08:d8:70/00:00:0f:00:00/40 Emask 0x409 (media
> > error)  ata2.00: status: { DRDY ERR }
> > ata2.00: error: { UNC }
> > ata2.00: configured for UDMA/133
> > sd 1:0:0:0: [sdb] tag#14 FAILED Result: hostbyte=DID_OK
> > driverbyte=DRIVER_SENSE sd 1:0:0:0: [sdb] tag#14 Sense Key : Medium
> > Error [current] [descriptor] sd 1:0:0:0: [sdb] tag#14 Add. Sense:
> > Unrecovered read error - auto reallocate failed sd 1:0:0:0: [sdb]
> > tag#14 CDB: Read(10) 28 00 0f 70 d8 08 00 00 08 00
> > blk_update_request: I/O error, dev sdb, sector 259053576
> 
> OK yeah so bad sector on sdb. So you have two failures because sda is
> already giving you trouble while being replaced and on top of it you
> now get a 2nd (partial) failure via bad sectors.
> 
> So rather urgently I think you need to copy things off this volume if
> you don't already have a backup so you can save as much as possible.
> Don't write to the drives. You might even consider 'mount -o
> remount,ro' to avoid anything writing to the volume. Copy the most
> important data first, triage time.
> 
> While that happens you can safely collect some more information:
> 
> btrfs fi us 
> smartctl -x## for both drives
> 

Ok thanks I will begin to make an image with dd.
Do you recommend to use sda or sdb ?

In the meantime here are the info requested:

btrfs fi us /home
Overall:
Device size:   3.63TiB
Device allocated:  2.76TiB
Device unallocated:  888.51GiB
Device missing:  0.00B
Used:  2.62TiB
Free (estimated):517.56GiB  (min: 517.56GiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,RAID1: Size:1.38TiB, Used:1.31TiB
   /dev/sda1   1.38TiB
   /dev/sdb1   1.38TiB

Metadata,RAID1: Size:5.00GiB, Used:3.15GiB
   /dev/sda1   5.00GiB
   /dev/sdb1   5.00GiB

System,RAID1: Size:64.00MiB, Used:216.00KiB
   /dev/sda1  64.00MiB
   /dev/sdb1  64.00MiB

Unallocated:
   

Re: Kernel bug during RAID1 replace

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 4:55 PM, Chris Murphy  wrote:

>> BTRFS info (device sdb1): dev_replace from /dev/sda1 (devid 1) to /dev/sdd1 
>> started
>> scrub_handle_errored_block: 166 callbacks suppressed
>> BTRFS warning (device sdb1): checksum error at logical 93445255168 on dev 
>> /dev/sda1, sector 77669048, root 5, inode 3434831, offset 479232, length 
>> 4096, links 1 (path: user/.local/share/zeitgeist/activity.sqlite-wal)
>> btrfs_dev_stat_print_on_error: 166 callbacks suppressed
>> BTRFS error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt 
>> 14221, gen 24
>> scrub_handle_errored_block: 166 callbacks suppressed
>> BTRFS error (device sdb1): unable to fixup (regular) error at logical 
>> 93445255168 on dev /dev/sda1
>
> Shoot. You have a lot of these. It looks suspiciously like you're
> hitting a case list regulars are only just starting to understand

Forget this part completely. It doesn't affect raid1. I just re-read
that your setup is not raid1, I don't know why I thought it was raid5.

The likely issue here is that you've got legit corruptions on sda (mix
of slow and flat out bad sectors), as well as a failing drive.

This is also safe to issue:

smartctl -l scterc /dev/sda
smartctl -l scterc /dev/sdb
cat /sys/block/sda/device/timeout
cat /sys/block/sdb/device/timeout




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug during RAID1 replace

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 4:26 PM, Saint Germain  wrote:

>>
>
> Thanks for your help.
>
> Ok here is the log from the mounting, and including btrfs replace
> (btrfs replace start -f /dev/sda1 /dev/sdd1 /home):
>
> BTRFS info (device sdb1): disk space caching is enabled
> BTRFS info (device sdb1): bdev /dev/sdb1 errs: wr 11881695, rd 12, flush 
> 7928, corrupt 1705631, gen 1335
> BTRFS info (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt 
> 14220, gen 24

Eek. So sdb has 11+ million write errors, flush errors, read errors,
and over 1 million corruptions. It's dying or dead.

And sda has a dozen thousand+ corruptions. This isn't a good
combination, as you have two devices with problems and raid5 only
protects you from one device with problems.

You were in the process of replacing sda, which is good, but it may
not be enough...


> BTRFS info (device sdb1): dev_replace from /dev/sda1 (devid 1) to /dev/sdd1 
> started
> scrub_handle_errored_block: 166 callbacks suppressed
> BTRFS warning (device sdb1): checksum error at logical 93445255168 on dev 
> /dev/sda1, sector 77669048, root 5, inode 3434831, offset 479232, length 
> 4096, links 1 (path: user/.local/share/zeitgeist/activity.sqlite-wal)
> btrfs_dev_stat_print_on_error: 166 callbacks suppressed
> BTRFS error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt 
> 14221, gen 24
> scrub_handle_errored_block: 166 callbacks suppressed
> BTRFS error (device sdb1): unable to fixup (regular) error at logical 
> 93445255168 on dev /dev/sda1

Shoot. You have a lot of these. It looks suspiciously like you're
hitting a case list regulars are only just starting to understand
(somewhat) where it's possible to have a legit corrupt sector that
Btrfs detects during scrub as wrong, fixes it from parity, but then
occasionally wrongly overwrites the parity with bad parity. This
doesn't cause an immediately recognizable problem. But if the volume
becomes degraded later, Btrfs must use parity to reconstruct
on-the-fly and if it hits one of these bad parities, the
reconstruction is bad, and ends up causing lots of these checksum
errors. We can tell it's not metadata corruption because a.) there's a
file listed as being affected and b.) the file system doesn't fail and
go read only. But still it means those files are likely toast...


[...snip many instances of checksum errors...]

> BTRFS error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt 
> 16217, gen 24
> ata2.00: exception Emask 0x0 SAct 0x4000 SErr 0x0 action 0x0
> ata2.00: irq_stat 0x4008
> ata2.00: failed command: READ FPDMA QUEUED
> ata2.00: cmd 60/08:70:08:d8:70/00:00:0f:00:00/40 tag 14 ncq 4096 in
>  res 41/40:00:08:d8:70/00:00:0f:00:00/40 Emask 0x409 (media error) 
> ata2.00: status: { DRDY ERR }
> ata2.00: error: { UNC }
> ata2.00: configured for UDMA/133
> sd 1:0:0:0: [sdb] tag#14 FAILED Result: hostbyte=DID_OK 
> driverbyte=DRIVER_SENSE
> sd 1:0:0:0: [sdb] tag#14 Sense Key : Medium Error [current] [descriptor]
> sd 1:0:0:0: [sdb] tag#14 Add. Sense: Unrecovered read error - auto reallocate 
> failed
> sd 1:0:0:0: [sdb] tag#14 CDB: Read(10) 28 00 0f 70 d8 08 00 00 08 00
> blk_update_request: I/O error, dev sdb, sector 259053576

OK yeah so bad sector on sdb. So you have two failures because sda is
already giving you trouble while being replaced and on top of it you
now get a 2nd (partial) failure via bad sectors.

So rather urgently I think you need to copy things off this volume if
you don't already have a backup so you can save as much as possible.
Don't write to the drives. You might even consider 'mount -o
remount,ro' to avoid anything writing to the volume. Copy the most
important data first, triage time.

While that happens you can safely collect some more information:

btrfs fi us 
smartctl -x## for both drives


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell
 wrote:
> On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:

>
>> It just came up again in a thread over the weekend on linux-raid@. I'm
>> going to ask while people are paying attention if a patch to change
>> the 30 second time out to something a lot higher has ever been
>> floated, what the negatives might be, and where to get this fixed if
>> it wouldn't be accepted in the kernel code directly.
>
> Defaults are defaults, they're not for everyone.  30 seconds is about
> two minutes too short for an SMR drive's worst-case write latency, or
> 28 seconds too long for an OLTP system, or just right for an end-user's
> personal machine with a low-energy desktop drive and a long spin-up time.

The question is where is the correct place to change the default to
broadly capture most use cases, because it's definitely incompatible
with consumer SATA drives, whether in an enclosure or not.

Maybe it's with the kernel teams at each distribution? Or maybe an
upstream udev rule?

In any case something needs to give here because it's been years of
bugging users about this misconfiguration and people constantly run
into it, which means user education is not working.


>
> Once a drive starts taking 30+ seconds to do I/O, I consider the drive
> failed in the sense that it's too slow to meet latency requirements.

Well that is then a mismatch between use case and the drive purchasing
decision. Consumer drives do this. It's how they're designed to work.


> When the problem is that it's already taking too long, the solution is
> not waiting even longer.  To put things in perspective, consider that
> server hardware watchdog timeouts are typically 60 seconds by default
> (if not maximum).

If you want the data retrieved from that particular device, the only
solution is waiting longer. The alternative is what you get, an IO
error (well actually you get a link reset, which also means the entire
command queue is purged on SATA drives).


> If anything, I want the timeout to be shorter so that upper layers with
> redundancy can get an EIO and initiate repair promptly, and admins can
> get notified to evict chronic offenders from their drive slots, without
> having to pay extra for hard disk firmware with that feature.

The drive totally thwarts this. It doesn't report back to the kernel
what command is hung, as far as I'm aware. It just hangs and goes into
a so called "deep recovery" there is no way to know what sector is
causing the problem until the drive reports a read error, which will
include the affected sector LBA.

Btrfs does have something of a work around for when things get slow,
and that's balance, read and rewrite everything. The write forces
sector remapping by the drive firmware for bad sectors.


>> *Ideally* I think we'd want two timeouts. I'd like to see commands
>> have a timer that results in merely a warning that could be used by
>> e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to
>> write over those sectors". That's how bad sectors start out, they read
>> slower and eventually go beyond 30 seconds and now it's all link
>> resets. If the problem could be fixed before then... that's the best
>> scenario.
>
> What's the downside of a link reset?  Can the driver not just return
> EIO for all the outstanding IOs in progress at reset, and let the upper
> layers deal with it?  Or is the problem that the upper layers are all
> horribly broken by EIOs, or drive firmware horribly broken by link resets?

Link reset clears the entire command queue on SATA drives, and it
wipes away any possibility of finding out what LBA or even a range of
LBAs, is the source of the stall. So it pretty much gets you nothing.


> The upper layers could time the IOs, and make their own decisions based
> on the timing (e.g. btrfs or mdadm could proactively repair anything that
> took more than 10 seconds to read).  That might be a better approach,
> since shortening the time to an EIO is only useful when you have a
> redundancy layer in place to do something about them.

For RAID with redundancy, that's doable, although I have no idea what
work is needed, or even if it's possible, to track commands in this
manner, and fall back to some kind of repair mode as if it were a read
error.

For single drives and RAID 0, the only possible solution is to not do
link resets for up to 3 minutes and hope the drive returns the single
copy of data.

Even in the case of Btrfs DUP, it's thwarted without a read error
reported from the drive (or it returning bad data).



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange behavior when replacing device on BTRFS RAID 5 array.

2016-06-27 Thread Steven Haigh
On 28/06/16 03:46, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 11:29 AM, Chris Murphy  
> wrote:
> 
>>
>> Next is to decide to what degree you want to salvage this volume and
>> keep using Btrfs raid56 despite the risks
> 
> Forgot to complete this thought. So if you get a backup, and decide
> you want to fix it, I would see if you can cancel the replace using
> "btrfs replace cancel " and confirm that it stops. And now is the
> risky part, which is whether to try "btrfs add" and then "btrfs
> remove" or remove the bad drive, reboot, and see if it'll mount with
> -o degraded, and then use add and remove (in which case you'll use
> 'remove missing').
> 
> The first you risk Btrfs still using the flaky bad drive.
> 
> The second you risk whether a degraded mount will work, and whether
> any other drive in the array has a problem while degraded (like an
> unrecovery read error from a single sector).

This is the exact set of circumstances that caused my corrupt array. I
was using RAID6 - yet it still corrupted large portions of things. In
theory, due to having double parity, it should still have survived even
if a disk did go bad - but there we are.

I first started a replace - noted how slow it was going - cancelled the
replace, then did an add / delete - the system crashed and it was all over.

Just as another data point, I've been flogging the guts out of the array
with mdadm RAID6 doing a reshape of that - and no read errors, system
crashes or other problems in over 48 hours.

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature


Re: Kernel bug during RAID1 replace

2016-06-27 Thread Saint Germain
On Mon, 27 Jun 2016 15:42:42 -0600, Chris Murphy
 wrote :

> On Mon, Jun 27, 2016 at 3:36 PM, Saint Germain 
> wrote:
> > Hello,
> >
> > I am on Debian Jessie with a kernel from backports:
> > 4.6.0-0.bpo.1-amd64
> >
> > I am also using btrfs-tools 4.4.1-1.1~bpo8+1
> >
> > When trying to replace a RAID1 drive (with btrfs replace start
> > -f /dev/sda1 /dev/sdd1), the operation is cancelled after completing
> > only 5%.
> >
> > I got this error in the /var/log/syslog:
> > [ cut here ]
> > WARNING: CPU: 2 PID: 2617
> > at /build/linux-9LouV5/linux-4.6.1/fs/btrfs/dev-replace.c:430
> > btrfs_dev_replace_start+0x2be/0x400 [btrfs] Modules linked in:
> > uas(E) usb_storage(E) bnep(E) ftdi_sio(E) usbserial(E)
> > snd_hda_codec_hdmi(E) nls_utf8(E) nls_cp437(E) vfat(E) fat(E)
> > intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E)
> > coretemp(E) kvm_intel(E) kvm(E) iTCO_wdt(E) irqbypass(E)
> > iTCO_vendor_support(E) crct10dif_pclmul(E) crc32_pclmul(E)
> > ghash_clmulni_intel(E) hmac(E) drbg(E) ansi_cprng(E) aesni_intel(E)
> > aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E)
> > cryptd(E) wl(POE) btusb(E) btrtl(E) btbcm(E) btintel(E) cfg80211(E)
> > bluetooth(E) efi_pstore(E) snd_hda_codec_realtek(E) evdev(E)
> > crc16(E) serio_raw(E) pcspkr(E) efivars(E) joydev(E)
> > snd_hda_codec_generic(E) rfkill(E) snd_hda_intel(E) nuvoton_cir(E)
> > rc_core(E) snd_hda_codec(E) i915(E) battery(E) snd_hda_core(E)
> > snd_hwdep(E) soc_button_array(E) tpm_tis(E) drm_kms_helper(E)
> > intel_smartconnect(E) snd_pcm(E) tpm(E) video(E) i2c_i801(E)
> > snd_timer(E) drm(E) snd(E) lpc_ich(E) i2c_algo_bit(E) soundcore(E)
> > mfd_core(E) mei_me(E) processor(E) button(E) mei(E) shpchp(E)
> > fuse(E) autofs4(E) hid_logitech_hidpp(E) btrfs(E)
> > hid_logitech_dj(E) usbhid(E) hid(E) xor(E) raid6_pq(E) sg(E)
> > sr_mod(E) cdrom(E) sd_mod(E) crc32c_intel(E) ahci(E) libahci(E)
> > libata(E) psmouse(E) scsi_mod(E) xhci_pci(E) ehci_pci(E)
> > xhci_hcd(E) ehci_hcd(E) e1000e(E) usbcore(E) ptp(E) pps_core(E)
> > usb_common(E) fjes(E) CPU: 2 PID: 2617 Comm: btrfs Tainted:
> > P   OE   4.6.0-0.bpo.1-amd64 #1 Debian 4.6.1-1~bpo8+1
> > Hardware name: To Be Filled By O.E.M. To Be Filled By
> > O.E.M./Z87E-ITX, BIOS P2.10 10/04/2013 0286
> > f0ba7fe7 813123c5  
> > 8107af94 880186caf000 fffb 8800c76b0800
> > 8800cae7 8800cae70ee0 7ffdd5397d98 Call Trace:
> > [] ? dump_stack+0x5c/0x77 [] ?
> > __warn+0xc4/0xe0 [] ?
> > btrfs_dev_replace_start+0x2be/0x400 [btrfs] [] ?
> > btrfs_ioctl+0x1d42/0x2190 [btrfs] [] ?
> > handle_mm_fault+0x154d/0x1cb0 [] ?
> > do_vfs_ioctl+0x99/0x5d0 [] ? SyS_ioctl+0x76/0x90
> > [] ? system_call_fast_compare_end+0xc/0x96
> > ---[ end trace 9fbfaa137cc5a72a ]---
> >
> >
> >
> > What should I do to replace correctly my drive ?
> 
> I don't often see handle_mm_fault with btrfs problems, maybe the
> entire dmesg from mounting the fs and including btrfs replace would
> reveal a related problem that instigates the failure?
> 
> If the device being replaced is acting unreliably, then you'd want to
> use -r with replace to ignore that device unless it's absolutely
> necessary to read from it.
> 

Thanks for your help.

Ok here is the log from the mounting, and including btrfs replace
(btrfs replace start -f /dev/sda1 /dev/sdd1 /home):

BTRFS info (device sdb1): disk space caching is enabled
BTRFS info (device sdb1): bdev /dev/sdb1 errs: wr 11881695, rd 12, flush 7928, 
corrupt 1705631, gen 1335
BTRFS info (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt 
14220, gen 24
BTRFS info (device sdb1): dev_replace from /dev/sda1 (devid 1) to /dev/sdd1 
started
scrub_handle_errored_block: 166 callbacks suppressed
BTRFS warning (device sdb1): checksum error at logical 93445255168 on dev 
/dev/sda1, sector 77669048, root 5, inode 3434831, offset 479232, length 4096, 
links 1 (path: user/.local/share/zeitgeist/activity.sqlite-wal)
btrfs_dev_stat_print_on_error: 166 callbacks suppressed
BTRFS error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt 
14221, gen 24
scrub_handle_errored_block: 166 callbacks suppressed
BTRFS error (device sdb1): unable to fixup (regular) error at logical 
93445255168 on dev /dev/sda1
BTRFS error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt 
14222, gen 24
BTRFS error (device sdb1): unable to fixup (regular) error at logical 
93445259264 on dev /dev/sda1
BTRFS warning (device sdb1): checksum error at logical 136349810688 on dev 
/dev/sda1, sector 140429952, root 5, inode 4265283, offset 0, length 4096, 
links 1 (path: user/Pictures/Picture-42-2.jpg)
BTRFS error (device sdb1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt 
14223, gen 24
BTRFS error (device sdb1): unable to fixup (regular) error at logical 
136349810688 on dev /dev/sda1
BTRFS warning (device sdb1): checksum 

Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Zygo Blaxell
On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
>  wrote:
> > On 2016-06-25 12:44, Chris Murphy wrote:
> >> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
> >>  wrote:
> >>
> >> OK but hold on. During scrub, it should read data, compute checksums
> >> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
> >> the checksum tree, and the parity strip in the chunk tree. And if
> >> parity is wrong, then it should be replaced.
> >
> > Except that's horribly inefficient.  With limited exceptions involving
> > highly situational co-processors, computing a checksum of a parity block is
> > always going to be faster than computing parity for the stripe.  By using
> > that to check parity, we can safely speed up the common case of near zero
> > errors during a scrub by a pretty significant factor.
> 
> OK I'm in favor of that. Although somehow md gets away with this by
> computing and checking parity for its scrubs, and still manages to
> keep drives saturated in the process - at least HDDs, I'm not sure how
> it fares on SSDs.

A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest
one at more than 10GB/sec.  Maybe a bottleneck is within reach of an
array of SSDs vs. a slow CPU.

> It just came up again in a thread over the weekend on linux-raid@. I'm
> going to ask while people are paying attention if a patch to change
> the 30 second time out to something a lot higher has ever been
> floated, what the negatives might be, and where to get this fixed if
> it wouldn't be accepted in the kernel code directly.

Defaults are defaults, they're not for everyone.  30 seconds is about
two minutes too short for an SMR drive's worst-case write latency, or
28 seconds too long for an OLTP system, or just right for an end-user's
personal machine with a low-energy desktop drive and a long spin-up time.

Once a drive starts taking 30+ seconds to do I/O, I consider the drive
failed in the sense that it's too slow to meet latency requirements.
When the problem is that it's already taking too long, the solution is
not waiting even longer.  To put things in perspective, consider that
server hardware watchdog timeouts are typically 60 seconds by default
(if not maximum).

If anything, I want the timeout to be shorter so that upper layers with
redundancy can get an EIO and initiate repair promptly, and admins can
get notified to evict chronic offenders from their drive slots, without
having to pay extra for hard disk firmware with that feature.

> *Ideally* I think we'd want two timeouts. I'd like to see commands
> have a timer that results in merely a warning that could be used by
> e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to
> write over those sectors". That's how bad sectors start out, they read
> slower and eventually go beyond 30 seconds and now it's all link
> resets. If the problem could be fixed before then... that's the best
> scenario.

What's the downside of a link reset?  Can the driver not just return
EIO for all the outstanding IOs in progress at reset, and let the upper
layers deal with it?  Or is the problem that the upper layers are all
horribly broken by EIOs, or drive firmware horribly broken by link resets?

The upper layers could time the IOs, and make their own decisions based
on the timing (e.g. btrfs or mdadm could proactively repair anything that
took more than 10 seconds to read).  That might be a better approach,
since shortening the time to an EIO is only useful when you have a
redundancy layer in place to do something about them.

> The 2nd timer would be, OK the controller or drive just face planted, reset.
> 
> -- 
> Chris Murphy
> 


signature.asc
Description: Digital signature


Re: Btrfs full balance command fails due to ENOSPC (bug 121071)

2016-06-27 Thread Hans van Kranenburg

Hi!

On 06/27/2016 11:26 PM, Henk Slager wrote:


btrfs-debug does not show metadata ans system chunks; the balancing
problem might come from those.
This script does show all chunks:
https://github.com/knorrie/btrfs-heatmap/blob/master/show_usage.py


Since the existence of python-btrfs, it has gathered a few useful 
example scripts:


git clone https://github.com/knorrie/python-btrfs
cd python-btrfs/examples/
(get root prompt)

./show_usage.py /mountpoint <- view sorted by 'virtual' address space
./show_dev_extents.py /mountpoint <- view sorted by physical layout

The show_usage in the btrfs-heatmap repo is almost gone. I'm currently 
replacing all the proof of concept playing around stuff in there with 
dedicated png-creation code that uses the python-btrfs lib.


So, it's better to refer to the examples in python-btrfs instead. Or, 
write some others to create overviews yourself, it gets easier every day. :D


Have fun,

--
Hans van Kranenburg - System / Network Engineer
T +31 (0)10 2760434 | hans.van.kranenb...@mendix.com | www.mendix.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug during RAID1 replace

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 3:36 PM, Saint Germain  wrote:
> Hello,
>
> I am on Debian Jessie with a kernel from backports:
> 4.6.0-0.bpo.1-amd64
>
> I am also using btrfs-tools 4.4.1-1.1~bpo8+1
>
> When trying to replace a RAID1 drive (with btrfs replace start
> -f /dev/sda1 /dev/sdd1), the operation is cancelled after completing
> only 5%.
>
> I got this error in the /var/log/syslog:
> [ cut here ]
> WARNING: CPU: 2 PID: 2617 at 
> /build/linux-9LouV5/linux-4.6.1/fs/btrfs/dev-replace.c:430 
> btrfs_dev_replace_start+0x2be/0x400 [btrfs]
> Modules linked in: uas(E) usb_storage(E) bnep(E) ftdi_sio(E) usbserial(E) 
> snd_hda_codec_hdmi(E) nls_utf8(E) nls_cp437(E) vfat(E) fat(E) intel_rapl(E) 
> x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) 
> iTCO_wdt(E) irqbypass(E) iTCO_vendor_support(E) crct10dif_pclmul(E) 
> crc32_pclmul(E) ghash_clmulni_intel(E) hmac(E) drbg(E) ansi_cprng(E) 
> aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) 
> cryptd(E) wl(POE) btusb(E) btrtl(E) btbcm(E) btintel(E) cfg80211(E) 
> bluetooth(E) efi_pstore(E) snd_hda_codec_realtek(E) evdev(E) crc16(E) 
> serio_raw(E) pcspkr(E) efivars(E) joydev(E) snd_hda_codec_generic(E) 
> rfkill(E) snd_hda_intel(E) nuvoton_cir(E) rc_core(E) snd_hda_codec(E) i915(E) 
> battery(E) snd_hda_core(E) snd_hwdep(E) soc_button_array(E) tpm_tis(E) 
> drm_kms_helper(E) intel_smartconnect(E) snd_pcm(E) tpm(E) video(E) 
> i2c_i801(E) snd_timer(E) drm(E) snd(E) lpc_ich(E) i2c_algo_bit(E) 
> soundcore(E) mfd_core(E) mei_me(E) processor(E) button(E) mei(E) shpchp(E) 
> fuse(E) autofs4(E) hid_logitech_hidpp(E) btrfs(E) hid_logitech_dj(E) 
> usbhid(E) hid(E) xor(E) raid6_pq(E) sg(E) sr_mod(E) cdrom(E) sd_mod(E) 
> crc32c_intel(E) ahci(E) libahci(E) libata(E) psmouse(E) scsi_mod(E) 
> xhci_pci(E) ehci_pci(E) xhci_hcd(E) ehci_hcd(E) e1000e(E) usbcore(E) ptp(E) 
> pps_core(E) usb_common(E) fjes(E)
> CPU: 2 PID: 2617 Comm: btrfs Tainted: P   OE   4.6.0-0.bpo.1-amd64 #1 
> Debian 4.6.1-1~bpo8+1
> Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z87E-ITX, BIOS 
> P2.10 10/04/2013
>  0286 f0ba7fe7 813123c5 
>   8107af94 880186caf000 fffb
>  8800c76b0800 8800cae7 8800cae70ee0 7ffdd5397d98
> Call Trace:
>  [] ? dump_stack+0x5c/0x77
>  [] ? __warn+0xc4/0xe0
>  [] ? btrfs_dev_replace_start+0x2be/0x400 [btrfs]
>  [] ? btrfs_ioctl+0x1d42/0x2190 [btrfs]
>  [] ? handle_mm_fault+0x154d/0x1cb0
>  [] ? do_vfs_ioctl+0x99/0x5d0
>  [] ? SyS_ioctl+0x76/0x90
>  [] ? system_call_fast_compare_end+0xc/0x96
> ---[ end trace 9fbfaa137cc5a72a ]---
>
>
>
> What should I do to replace correctly my drive ?

I don't often see handle_mm_fault with btrfs problems, maybe the
entire dmesg from mounting the fs and including btrfs replace would
reveal a related problem that instigates the failure?

If the device being replaced is acting unreliably, then you'd want to
use -r with replace to ignore that device unless it's absolutely
necessary to read from it.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Kernel bug during RAID1 replace

2016-06-27 Thread Saint Germain
Hello,

I am on Debian Jessie with a kernel from backports:
4.6.0-0.bpo.1-amd64

I am also using btrfs-tools 4.4.1-1.1~bpo8+1

When trying to replace a RAID1 drive (with btrfs replace start
-f /dev/sda1 /dev/sdd1), the operation is cancelled after completing
only 5%.

I got this error in the /var/log/syslog:
[ cut here ]
WARNING: CPU: 2 PID: 2617 at 
/build/linux-9LouV5/linux-4.6.1/fs/btrfs/dev-replace.c:430 
btrfs_dev_replace_start+0x2be/0x400 [btrfs]
Modules linked in: uas(E) usb_storage(E) bnep(E) ftdi_sio(E) usbserial(E) 
snd_hda_codec_hdmi(E) nls_utf8(E) nls_cp437(E) vfat(E) fat(E) intel_rapl(E) 
x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) 
iTCO_wdt(E) irqbypass(E) iTCO_vendor_support(E) crct10dif_pclmul(E) 
crc32_pclmul(E) ghash_clmulni_intel(E) hmac(E) drbg(E) ansi_cprng(E) 
aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) 
cryptd(E) wl(POE) btusb(E) btrtl(E) btbcm(E) btintel(E) cfg80211(E) 
bluetooth(E) efi_pstore(E) snd_hda_codec_realtek(E) evdev(E) crc16(E) 
serio_raw(E) pcspkr(E) efivars(E) joydev(E) snd_hda_codec_generic(E) rfkill(E) 
snd_hda_intel(E) nuvoton_cir(E) rc_core(E) snd_hda_codec(E) i915(E) battery(E) 
snd_hda_core(E) snd_hwdep(E) soc_button_array(E) tpm_tis(E) drm_kms_helper(E) 
intel_smartconnect(E) snd_pcm(E) tpm(E) video(E) i2c_i801(E) snd_timer(E) 
drm(E) snd(E) lpc_ich(E) i2c_algo_bit(E) soundcore(E) mfd_core(E) mei_me(E) 
processor(E) button(E) mei(E) shp
 chp(E) fuse(E) autofs4(E) hid_logitech_hidpp(E) btrfs(E) hid_logitech_dj(E) 
usbhid(E) hid(E) xor(E) raid6_pq(E) sg(E) sr_mod(E) cdrom(E) sd_mod(E) 
crc32c_intel(E) ahci(E) libahci(E) libata(E) psmouse(E) scsi_mod(E) xhci_pci(E) 
ehci_pci(E) xhci_hcd(E) ehci_hcd(E) e1000e(E) usbcore(E) ptp(E) pps_core(E) 
usb_common(E) fjes(E)
CPU: 2 PID: 2617 Comm: btrfs Tainted: P   OE   4.6.0-0.bpo.1-amd64 #1 
Debian 4.6.1-1~bpo8+1
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z87E-ITX, BIOS 
P2.10 10/04/2013
 0286 f0ba7fe7 813123c5 
  8107af94 880186caf000 fffb
 8800c76b0800 8800cae7 8800cae70ee0 7ffdd5397d98
Call Trace:
 [] ? dump_stack+0x5c/0x77
 [] ? __warn+0xc4/0xe0
 [] ? btrfs_dev_replace_start+0x2be/0x400 [btrfs]
 [] ? btrfs_ioctl+0x1d42/0x2190 [btrfs]
 [] ? handle_mm_fault+0x154d/0x1cb0
 [] ? do_vfs_ioctl+0x99/0x5d0
 [] ? SyS_ioctl+0x76/0x90
 [] ? system_call_fast_compare_end+0xc/0x96
---[ end trace 9fbfaa137cc5a72a ]---



What should I do to replace correctly my drive ?

Thanks in advance,
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs full balance command fails due to ENOSPC (bug 121071)

2016-06-27 Thread Henk Slager
On Mon, Jun 27, 2016 at 9:24 PM, Chris Murphy  wrote:
> On Mon, Jun 27, 2016 at 12:32 PM, Francesco Turco  wrote:
>> On 2016-06-27 20:18, Chris Murphy wrote:
>>> If you can grab btrfs-debugfs from
>>> https://github.com/kdave/btrfs-progs/blob/master/btrfs-debugfs
>>>
>>> And then attach the output to the bug report it might be useful for a
>>> developer. But really your case is an odd duck, because there's fully
>>> 14GiB unallocated, so it should be able to create a new one without
>>> problem.
>>>
>>> $ sudo ./btrfs-debugfs -b /
>>
>> Done! Thank you, I was not aware of the existence of btrfs-debug...
>
> I'm not certain what the "1 enospc errors during balance' refers to.
> That message happens several times, the balance operation isn't
> aborted, and doesn't come with any call traces (those appear later).
> Further, the btrfs-debugfs output suggests the balance worked - each
> bg is continguously located after the last and they're all new bg
> offset values compared to what's found in the dmesg.

btrfs-debug does not show metadata ans system chunks; the balancing
problem might come from those.
This script does show all chunks:
https://github.com/knorrie/btrfs-heatmap/blob/master/show_usage.py

You might want to use vrange or drange balance filters so that you can
just target a certain chunk and maybe that gives a hint where the
problem might be. But anyhow, the behavior experienced is a bug.

> This might be that obscure -28 enospc bug that affects some file
> systems and hasn't been tracked down yet. If I recall correctly it's a
> misleading error, and the only work around to get rid of it is migrate
> to a new Btrfs file system. I don't think the file system is at any
> risk in the current state, but I'm not certain as it's already an edge
> case. I'd just make sure you keep suitably current backups and keep
> using it.
>
>
>
> --
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange behavior when replacing device on BTRFS RAID 5 array.

2016-06-27 Thread Duncan
Nick Austin posted on Sun, 26 Jun 2016 20:57:32 -0700 as excerpted:

> I have a 4 device BTRFS RAID 5 filesystem.
> 
> One of the device members of this file system (sdr) had badblocks, so I
> decided to replace it.

While the others answered the direct question, there's something 
potentially more urgent...

Btrfs raid56 mode has some fundamentally serious bugs as currently 
implemented, that we are just now finding out how serious they 
potentially are.  For the details you can read the other active threads 
from the last week or so, but the important thing is that...

For the time being, raid56 mode is not to be trusted repairable and as a 
result is now highly negative-recommended.  Unless you are using pure 
testing data that you don't care about whether it lives or dies (either 
because it literally /is/ that trivial, or because you have tested 
backups, /making/ it that trivial), I'd urgently recommend either getting 
your data off it ASAP, or rebalancing to redundant-raid, raid1 or raid10, 
instead of parity-raid (5/6), before something worse happens and you no 
longer can.

Raid1 mode is a reasonable alternative, as long as your data fits in the 
available space.  Keeping in mind that btrfs raid1 is always two copies, 
with more than two devices upping the capacity, not the redundancy, 3 
5.46 TB devices = 8.19 TB usable space.  Given your 8+ TiB of data usage, 
plus metadata and system, that's unlikely to fit unless you delete some 
stuff (older snapshots probably, if you have them).  So you'll need to 
keep it to four devices of that size.

Btrfs raid10 is also considered as stable as btrfs in general, and would 
be doable with 4+ devices, but for various reasons I'll skip for brevity 
here (ask if you want them detailed), I'd recommend staying with btrfs 
raid1.

Or switch to md- or dm-raid1.  Or one other interesting alternative, a 
pair of md- or dm-raid0s, on top of which you run btrfs raid1.  That 
gives you the data integrity of btrfs raid1, with somewhat better speed 
than the reasonably stable but as yet unoptimized btrfs raid10.

And of course there's one other alternative, zfs, if you are good with 
its hardware requirements and licensing situation.

But I'd recommend btrfs raid1 as the simple choice.  It's what I'm using 
here (tho on a pair of ssds, so far smaller but faster media, so 
different use-case).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Henk Slager
On Mon, Jun 27, 2016 at 6:17 PM, Chris Murphy  wrote:
> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
>  wrote:
>> On 2016-06-25 12:44, Chris Murphy wrote:
>>>
>>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>>>  wrote:
>>>
 Well, the obvious major advantage that comes to mind for me to
 checksumming
 parity is that it would let us scrub the parity data itself and verify
 it.
>>>
>>>
>>> OK but hold on. During scrub, it should read data, compute checksums
>>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
>>> the checksum tree, and the parity strip in the chunk tree. And if
>>> parity is wrong, then it should be replaced.
>>
>> Except that's horribly inefficient.  With limited exceptions involving
>> highly situational co-processors, computing a checksum of a parity block is
>> always going to be faster than computing parity for the stripe.  By using
>> that to check parity, we can safely speed up the common case of near zero
>> errors during a scrub by a pretty significant factor.
>
> OK I'm in favor of that. Although somehow md gets away with this by
> computing and checking parity for its scrubs, and still manages to
> keep drives saturated in the process - at least HDDs, I'm not sure how
> it fares on SSDs.

What I read in this thread clarifies the different flavors of errors I
saw when trying btrfs raid5 while corrupting 1 device or just
unexpectedly removing a device and replacing it with a fresh one.
Especially the lack of parity csums I was not aware of and I think
this is really wrong.

Consider a 4 disk btrfs raid10 and a 3 disk btrfs raid5. Both protect
against the loss of 1 device or badblocks on 1 device. In the current
design (unoptimized for performance), raid10 reads from 2 disk and
raid5 as well (as far as I remember) per task/process.
Which pair of strips for raid10 is pseudo-random AFAIK, so one could
get low throughput if some device in the array is older/slower and
that one is picked. From device to fs logical layer is just a simple
function, namely copy, so having the option to keep data in-place
(zerocopy). The data is at least read by the csum check and in case of
failure, the btrfs code picks the alternative strip and corrects etc.

For raid5, assuming it does avoid the parity in principle, it is also
a strip pair and csum check. In case of csum failure, one needs the
parity strip parity calculation. To me, It looks like that the 'fear'
of this calculation has made raid56 as a sort of add-on, instead of a
more integral part.

Looking at raid6 perf test at boot in dmesg, it is 30GByte/s, even
higher than memory bandwidth. So although a calculation is needed in
case data0strip+paritystrip would be used instead of
data0strip+data1strip, I think looking at total cost, it can be
cheaper than spending time on seeks, at least on HDDs. If the parity
calculation is treated in a transparent way, same as copy, then there
is more flexibility in selecting disks (and strips) and enables easier
design and performance optimizations I think.

>> The ideal situation that I'd like to see for scrub WRT parity is:
>> 1. Store checksums for the parity itself.
>> 2. During scrub, if the checksum is good, the parity is good, and we just
>> saved the time of computing the whole parity block.
>> 3. If the checksum is not good, then compute the parity.  If the parity just
>> computed matches what is there already, the checksum is bad and should be
>> rewritten (and we should probably recompute the whole block of checksums
>> it's in), otherwise, the parity was bad, write out the new parity and update
>> the checksum.

This 3rd point: if parity matches but csum is not good, then there is
a btrfs design error or some hardware/CPU/memory problem. Compare with
btrfs raid10: if the copies match but csum wrong, then there is
something fatally wrong. Just the first step, csum check and if wrong,
it would mean you generate the assumed corrupt strip newly from the 3
others. And for 3 disk raid5 from the 2 others, whether it is copying
or paritycalculation.

>> 4. Have an option to skip the csum check on the parity and always compute
>> it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Chris Murphy
For what it's worth I found btrfs-map-logical can produce mapping for
raid5 (didn't test raid6) by specifying the extent block length. If
that's omitted it only shows the device+mapping for the first strip.

This example is a 3 disk raid5, with a 128KiB file all in a single extent.

[root@f24s ~]# btrfs-map-logical -l 14157742080 /dev/VG/a
mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a
mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c

[root@f24s ~]# btrfs-map-logical -l 14157742080 -b 131072 /dev/VG/a
mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a
mirror 1 logical 14157807616 physical 1075773440 device /dev/mapper/VG-b
mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c
mirror 2 logical 14157807616 physical 2183069696 device /dev/mapper/VG-c

It's also possible to use -c and -o to copy the extent to a file and
more easily diff it with a control file, rather than using dd.

Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/4] btrfs/126,127,128: test feature ioctl and sysfs interfaces

2016-06-27 Thread jeffm
From: Jeff Mahoney 

This tests the exporting of feature information from the kernel via
sysfs and ioctl. The first test works whether the sysfs permissions
are correct, if the information exported via sysfs matches
what the ioctls are reporting, and if they both match the on-disk
superblock's version of the feature sets. The second and third tests
test online setting and clearing of feature bits via the sysfs and
ioctl interfaces, checking whether they match the on-disk super on
each cycle.

Signed-off-by: Jeff Mahoney 
---
 common/btrfs | 203 +++
 src/btrfs_ioctl_helper.c |  88 +
 tests/btrfs/126  | 244 +++
 tests/btrfs/126.out  |   1 +
 tests/btrfs/127  | 166 
 tests/btrfs/127.out  |   1 +
 tests/btrfs/128  | 128 +
 tests/btrfs/128.out  |   1 +
 tests/btrfs/group|   3 +
 9 files changed, 835 insertions(+)
 create mode 100755 tests/btrfs/126
 create mode 100644 tests/btrfs/126.out
 create mode 100755 tests/btrfs/127
 create mode 100644 tests/btrfs/127.out
 create mode 100755 tests/btrfs/128
 create mode 100644 tests/btrfs/128.out

diff --git a/common/btrfs b/common/btrfs
index 5828d0a..2d7d0ce 100644
--- a/common/btrfs
+++ b/common/btrfs
@@ -48,3 +48,206 @@ _require_btrfs_raid_dev_pool()
_require_scratch_dev_pool 4 # RAID10
_require_scratch_dev_pool_equal_size
 }
+
+# TODO Add tool to enable and test unknown feature bits
+_btrfs_feature_lookup() {
+   local name=$1
+   class=""
+   case "$name" in
+   mixed_backref)  class=incompat; bit=0x1 ;;
+   default_subvol) class=incompat; bit=0x2 ;;
+   mixed_groups)   class=incompat; bit=0x4 ;;
+   compress_lzo)   class=incompat; bit=0x8 ;;
+   compress_lsov2) class=incompat; bit=0x10 ;;
+   big_metadata)   class=incompat; bit=0x20 ;;
+   extended_iref)  class=incompat; bit=0x40 ;;
+   raid56) class=incompat; bit=0x80 ;;
+   skinny_metadata)class=incompat; bit=0x100 ;;
+   compat:*)   class=compat; bit=${name##compat:} ;;
+   compat_ro:*)class=compat_ro; bit=${name##compat_ro:} ;;
+   incompat:*) class=incompat; bit=${name##incompat:} ;;
+   esac
+   if [ -z "$class" ]; then
+   echo "Unknown feature name $name. xfstests needs updating." \
+" Skipping the test of sysfs values to superblock values" \
+>&2
+   fi
+
+   echo "$class/$bit"
+}
+
+_btrfs_feature_get_class() {
+   bits=$(_btrfs_feature_lookup $1)
+   echo ${bits%/*}
+}
+
+_btrfs_feature_get_bit() {
+   bits=$(_btrfs_feature_lookup $1)
+   echo ${bits#*/}
+}
+
+_btrfs_feature_class_to_index()
+{
+   local class=$1
+   local index=0
+
+   case "$class" in
+   compat) index=0 ;;
+   compat_ro) index=1 ;;
+   incompat) index=2 ;;
+   *) echo "Invalid class name $class" >&2
+   esac
+
+   echo $index
+}
+
+# The ioctl helper outputs the supported feature flags as a series of
+# 9 hex numbers, which represent bitfields.
+# These 9 values represent 3 sets of 3 values.
+# supported flags: compat compat_ro incompat, starting at index 0
+# settable online: compat compat_ro incompat, starting at index 3
+# clearable online: compat compat_ro incompat, starting at index 6
+# The returned mask is: 0x1 settable | 0x2 clearable
+_btrfs_feature_ioctl_writeable_mask()
+{
+   local feature=$1
+   local mnt=$2
+   local index=0
+
+   # This usually won't matter.  The supported bits are fs-module global.
+   if [ -z "$mnt" ]; then
+   mnt=$TEST_DIR
+   fi
+
+   class=$(_btrfs_feature_get_class $1)
+   bit=$(_btrfs_feature_get_bit $1)
+   index=$(_btrfs_feature_class_to_index $class)
+
+   local set_index=$(( $index + 3 ))
+   local clear_index=$(( $index + 6 ))
+
+   out=$(src/btrfs_ioctl_helper $mnt GET_SUPPORTED_FEATURES)
+   set -- $out
+   supp_features=($@)
+
+   settable=$(( ${supp_features[$set_index]} & $bit ))
+   clearable=$(( ${supp_features[$clear_index]} & $bit ))
+
+   val=0
+   if [ "$settable" -ne 0 ]; then
+   val=$(( $val | 1 ))
+   fi
+   if [ "$clearable" -ne 0 ]; then
+   val=$(( $val | 2 ))
+   fi
+
+   echo $val
+}
+
+_btrfs_feature_ioctl_index_settable_mask()
+{
+   local class_index=$1
+   local mnt=$2
+
+   # This usually won't matter.  The supported bits are fs-module global.
+   if [ -z "$mnt" ]; then
+   mnt=$TEST_DIR
+   fi
+
+   local set_index=$(( $class_index + 3 ))
+
+   out=$(src/btrfs_ioctl_helper $mnt GET_SUPPORTED_FEATURES)
+   set -- $out
+   supp_features=($@)
+
+   echo $(( ${supp_features[$set_index]} ))
+}
+
+_btrfs_feature_ioctl_index_clearable_mask()
+{
+ 

[PATCH v2 0/4] btrfs feature testing + props fix

2016-06-27 Thread jeffm
From: Jeff Mahoney 

Hi all -

Thanks, Eryu, for the review.  The btrfs feature testing changes were a
patchet I wrote three years ago, and it looks like significant cleanup
has happened in the xfstests since then.  I'm sorry for the level of the
review you had to do for them, but do appreciate that you did.

This version should fix the outstanding issues, including some issues
with the tests themselves, where e.g. the 32MB reserved size was file
system-size (and implementation) dependent.  Most notably, since these
tests share some common functionality that ultimately hit ~250 lines, I
chose to create a new common/btrfs library.  Other than that, I tried to
meet the level of consistency you were looking for with just printing
errors instead of failing, not depending on error codes, etc.

Thanks,

-Jeff

---

Jeff Mahoney (4):
  btrfs/048: extend _filter_btrfs_prop_error to handle additional errors
  btrfs/124: test global metadata reservation reporting
  btrfs/125: test sysfs exports of allocation and device membership info
  btrfs/126,127,128: test feature ioctl and sysfs interfaces

 .gitignore   |   1 +
 common/btrfs | 253 +++
 common/config|   7 +-
 common/filter.btrfs  |  10 +-
 src/Makefile |   3 +-
 src/btrfs_ioctl_helper.c | 220 +
 tests/btrfs/048  |   6 +-
 tests/btrfs/048.out  |   4 +-
 tests/btrfs/124  |  84 
 tests/btrfs/124.out  |   1 +
 tests/btrfs/125  | 177 +
 tests/btrfs/125.out  |   1 +
 tests/btrfs/126  | 244 +
 tests/btrfs/126.out  |   1 +
 tests/btrfs/127  | 166 +++
 tests/btrfs/127.out  |   1 +
 tests/btrfs/128  | 128 
 tests/btrfs/128.out  |   1 +
 tests/btrfs/group|   5 +
 19 files changed, 1302 insertions(+), 11 deletions(-)
 create mode 100644 common/btrfs
 create mode 100644 src/btrfs_ioctl_helper.c
 create mode 100755 tests/btrfs/124
 create mode 100644 tests/btrfs/124.out
 create mode 100755 tests/btrfs/125
 create mode 100644 tests/btrfs/125.out
 create mode 100755 tests/btrfs/126
 create mode 100644 tests/btrfs/126.out
 create mode 100755 tests/btrfs/127
 create mode 100644 tests/btrfs/127.out
 create mode 100755 tests/btrfs/128
 create mode 100644 tests/btrfs/128.out

-- 
1.8.5.6

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/4] btrfs/048: extend _filter_btrfs_prop_error to handle additional errors

2016-06-27 Thread jeffm
From: Jeff Mahoney 

btrfsprogs v4.5.3 changed the formatting of some error messages.  This
patch extends the filter for btrfs prop to handle those.

Signed-off-by: Jeff Mahoney 
---
 common/filter.btrfs | 10 +++---
 tests/btrfs/048 |  6 --
 tests/btrfs/048.out |  4 ++--
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/common/filter.btrfs b/common/filter.btrfs
index 9970f4d..0cf7f0d 100644
--- a/common/filter.btrfs
+++ b/common/filter.btrfs
@@ -72,15 +72,19 @@ _filter_btrfs_compress_property()
sed -e "s/compression=\(lzo\|zlib\)/COMPRESSION=XXX/g"
 }
 
-# filter name of the property from the output, optionally verify against $1
+# filter error messages from btrfs prop, optionally verify against $1
 # recognized message(s):
 #  "object is not compatible with property: label"
+#  "invalid value for property:{, value}"
+#  "failed to {get,set} compression for $PATH[.:]: Invalid argument"
 _filter_btrfs_prop_error()
 {
if ! [ -z "$1" ]; then
-   sed -e "s/\(compatible with property\): $1/\1/"
+   sed -e "s#\(compatible with property\): $1#\1#" \
+   -e "s#^\(.*failed to [sg]et compression for $1\)[:.] 
\(.*\)#\1: \2#"
else
-   sed -e "s/^\(.*compatible with property\).*/\1/"
+   sed -e "s#^\(.*compatible with property\).*#\1#" \
+   -e "s#^\(.*invalid value for property\)[:.].*#\1#"
fi
 }
 
diff --git a/tests/btrfs/048 b/tests/btrfs/048
index 4a36303..0b907b0 100755
--- a/tests/btrfs/048
+++ b/tests/btrfs/048
@@ -79,7 +79,8 @@ echo -e "\nTesting subvolume ro property"
 _run_btrfs_util_prog subvolume create $SCRATCH_MNT/sv1
 $BTRFS_UTIL_PROG property get $SCRATCH_MNT/sv1 ro
 echo "***"
-$BTRFS_UTIL_PROG property set $SCRATCH_MNT/sv1 ro foo
+$BTRFS_UTIL_PROG property set $SCRATCH_MNT/sv1 ro foo 2>&1 |
+   _filter_btrfs_prop_error
 echo "***"
 $BTRFS_UTIL_PROG property set $SCRATCH_MNT/sv1 ro true
 echo "***"
@@ -99,7 +100,8 @@ $BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir/file1 
compression
 $BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir/subdir1 compression
 echo "***"
 $BTRFS_UTIL_PROG property set $SCRATCH_MNT/testdir/file1 compression \
-   foo 2>&1 | _filter_scratch
+   foo 2>&1 | _filter_scratch |
+   _filter_btrfs_prop_error SCRATCH_MNT/testdir/file1
 echo "***"
 $BTRFS_UTIL_PROG property set $SCRATCH_MNT/testdir/file1 compression lzo
 $BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir/file1 compression
diff --git a/tests/btrfs/048.out b/tests/btrfs/048.out
index 0b20d0b..3e4e3d2 100644
--- a/tests/btrfs/048.out
+++ b/tests/btrfs/048.out
@@ -15,7 +15,7 @@ ERROR: object is not compatible with property
 Testing subvolume ro property
 ro=false
 ***
-ERROR: invalid value for property.
+ERROR: invalid value for property
 ***
 ***
 ro=true
@@ -27,7 +27,7 @@ ro=false
 
 Testing compression property
 ***
-ERROR: failed to set compression for SCRATCH_MNT/testdir/file1. Invalid 
argument
+ERROR: failed to set compression for SCRATCH_MNT/testdir/file1: Invalid 
argument
 ***
 compression=lzo
 compression=lzo
-- 
1.8.5.6

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/4] btrfs/125: test sysfs exports of allocation and device membership info

2016-06-27 Thread jeffm
From: Jeff Mahoney 

This tests the sysfs publishing for btrfs allocation and device
membership info under a number of different layouts, similar to the
btrfs replace test. We test the allocation files only for existence and
that they contain numerical values. We test the device membership
by mapping the devices used to create the file system to sysfs paths
and matching them against the paths used for the device membership
symlinks.

Signed-off-by: Jeff Mahoney 
---
 common/btrfs|   7 +++
 common/config   |   7 ++-
 tests/btrfs/125 | 177 
 tests/btrfs/125.out |   1 +
 tests/btrfs/group   |   1 +
 5 files changed, 190 insertions(+), 3 deletions(-)
 create mode 100755 tests/btrfs/125
 create mode 100644 tests/btrfs/125.out

diff --git a/common/btrfs b/common/btrfs
index b972b13..5828d0a 100644
--- a/common/btrfs
+++ b/common/btrfs
@@ -41,3 +41,10 @@ _require_btrfs_ioctl()
_notrun "btrfs ioctl $ioctl not implemented."
fi
 }
+
+# Requires the minimum size pool for largest btrfs RAID test
+_require_btrfs_raid_dev_pool()
+{
+   _require_scratch_dev_pool 4 # RAID10
+   _require_scratch_dev_pool_equal_size
+}
diff --git a/common/config b/common/config
index c25b1ec..8577924 100644
--- a/common/config
+++ b/common/config
@@ -201,13 +201,14 @@ export DEBUGFS_PROG="`set_prog_path debugfs`"
 # newer systems have udevadm command but older systems like RHEL5 don't.
 # But if neither one is available, just set it to "sleep 1" to wait for lv to
 # be settled
-UDEV_SETTLE_PROG="`set_prog_path udevadm`"
-if [ "$UDEV_SETTLE_PROG" == "" ]; then
+UDEVADM_PROG="`set_prog_path udevadm`"
+if [ "$UDEVADM_PROG" == "" ]; then
# try udevsettle command
UDEV_SETTLE_PROG="`set_prog_path udevsettle`"
 else
# udevadm is available, add 'settle' as subcommand
-   UDEV_SETTLE_PROG="$UDEV_SETTLE_PROG settle"
+   UDEV_SETTLE_PROG="$UDEVADM_PROG settle"
+   export UDEVADM_PROG
 fi
 # neither command is available, use sleep 1
 if [ "$UDEV_SETTLE_PROG" == "" ]; then
diff --git a/tests/btrfs/125 b/tests/btrfs/125
new file mode 100755
index 000..999a10e
--- /dev/null
+++ b/tests/btrfs/125
@@ -0,0 +1,177 @@
+#! /bin/bash
+# FS QA Test No. 125
+#
+# Test of the btrfs sysfs publishing
+#
+#---
+# Copyright (C) 2016 SUSE.  All rights reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1
+
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/btrfs
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_command "$UDEVADM_PROG" udevadm
+_require_test
+_require_btrfs_sysfs
+_require_btrfs_raid_dev_pool
+
+sysfs_root=$(_btrfs_get_sysfs $TEST_DIR)
+
+[ -d "$sysfs_root/allocation" ] || _notrun "sysfs allocation dir not found"
+[ -d "$sysfs_root/devices" ] || _notrun "sysfs devices dir not found"
+
+check_file()
+{
+   local file=$1
+   base="$(basename $(dirname $file))/$(basename $file)"
+   value="$(cat $file)"
+   if [ -n "$(echo $value | tr -d 0-9)" ]; then
+   echo "ERROR: $base: numerical value expected" \
+"(got $value)"
+   fi
+}
+
+check_chunk()
+{
+   path=$1
+   mkfs_options=$2
+
+   chunktype=$(basename $path)
+   [ -d "$path" ] || echo "No $chunktype directory."
+
+   for file in bytes_may_use bytes_pinned bytes_reserved bytes_used \
+   disk_total disk_used flags total_bytes \
+   total_bytes_pinned; do
+   check_file "$path/$file"
+   done
+
+   if [ "$chunktype" = "data" -o "$chunktype" = "mixed" ]; then
+   opt="-d"
+   elif [ "$chunktype" = "metadata" -o "$chunktype" = "system" ]; then
+   opt="-m"
+   fi
+
+   profile=$(echo $mkfs_options | sed -e "s/.*$opt \([[:alnum:]]*\).*/\1/")
+   [ -d "$path/$profile" ] || echo 

[PATCH v2 2/4] btrfs/124: test global metadata reservation reporting

2016-06-27 Thread jeffm
From: Jeff Mahoney 

Btrfs can now report the size of the global metadata reservation
via ioctl and sysfs.

This test confirms that we get sane results on an empty file system.

Signed-off-by: Jeff Mahoney 
---
 .gitignore   |   1 +
 common/btrfs |  43 +++
 src/Makefile |   3 +-
 src/btrfs_ioctl_helper.c | 132 +++
 tests/btrfs/124  |  84 ++
 tests/btrfs/124.out  |   1 +
 tests/btrfs/group|   1 +
 7 files changed, 264 insertions(+), 1 deletion(-)
 create mode 100644 common/btrfs
 create mode 100644 src/btrfs_ioctl_helper.c
 create mode 100755 tests/btrfs/124
 create mode 100644 tests/btrfs/124.out

diff --git a/.gitignore b/.gitignore
index 28bd180..0e4f2a1 100644
--- a/.gitignore
+++ b/.gitignore
@@ -39,6 +39,7 @@
 /src/append_reader
 /src/append_writer
 /src/bstat
+/src/btrfs_ioctl_helper
 /src/bulkstat_unlink_test
 /src/bulkstat_unlink_test_modified
 /src/dbtest
diff --git a/common/btrfs b/common/btrfs
new file mode 100644
index 000..b972b13
--- /dev/null
+++ b/common/btrfs
@@ -0,0 +1,43 @@
+#!/bin/bash
+# Functions for testing btrfs
+
+_btrfs_get_fsid()
+{
+   local mnt=$1
+   if [ -z "$mnt" ]; then
+   mnt=$TEST_DIR
+   fi
+   $BTRFS_UTIL_PROG filesystem show $mnt|awk '/uuid:/ {print $NF}'
+}
+
+_btrfs_get_sysfs()
+{
+   local mnt=$1
+   local fsid=$(_btrfs_get_fsid $mnt)
+   echo "/sys/fs/btrfs/$fsid"
+}
+
+_require_btrfs_sysfs()
+{
+   local mnt=$1
+   if [ -z "$mnt" ]; then
+   mnt=$TEST_DIR
+   fi
+   if [ ! -d "$(_btrfs_get_sysfs $mnt)" ];then
+   _notrun "btrfs sysfs support not available."
+   fi
+}
+
+_require_btrfs_ioctl()
+{
+   local ioctl=$1
+   local mnt=$2
+   shift 2
+   if [ -z "$mnt" ]; then
+   mnt=$TEST_DIR
+   fi
+   out=$(src/btrfs_ioctl_helper $mnt $ioctl $@)
+   if [ "$out" = "Not implemented." ]; then
+   _notrun "btrfs ioctl $ioctl not implemented."
+   fi
+}
diff --git a/src/Makefile b/src/Makefile
index 1bf318b..c467475 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -20,7 +20,8 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize 
preallo_rw_pattern_reader \
bulkstat_unlink_test_modified t_dir_offset t_futimens t_immutable \
stale_handle pwrite_mmap_blocked t_dir_offset2 seek_sanity_test \
seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec cloner \
-   renameat2 t_getcwd e4compact test-nextquota punch-alternating
+   renameat2 t_getcwd e4compact test-nextquota punch-alternating \
+   btrfs_ioctl_helper
 
 SUBDIRS =
 
diff --git a/src/btrfs_ioctl_helper.c b/src/btrfs_ioctl_helper.c
new file mode 100644
index 000..b6eb924
--- /dev/null
+++ b/src/btrfs_ioctl_helper.c
@@ -0,0 +1,132 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef BTRFS_IOCTL_MAGIC
+#define BTRFS_IOCTL_MAGIC 0x94
+#endif
+
+#ifndef BTRFS_IOC_SPACE_INFO
+struct btrfs_ioctl_space_info {
+uint64_t flags;
+uint64_t total_bytes;
+uint64_t used_bytes;
+};
+
+struct btrfs_ioctl_space_args {
+uint64_t space_slots;
+uint64_t total_spaces;
+struct btrfs_ioctl_space_info spaces[0];
+};
+#define BTRFS_IOC_SPACE_INFO _IOWR(BTRFS_IOCTL_MAGIC, 20, \
+struct btrfs_ioctl_space_args)
+#endif
+#ifndef BTRFS_SPACE_INFO_GLOBAL_RSV
+#define BTRFS_SPACE_INFO_GLOBAL_RSV(1ULL << 49)
+#endif
+
+static int global_rsv_ioctl(int fd, int argc, char *argv[])
+{
+   struct btrfs_ioctl_space_args arg;
+   struct btrfs_ioctl_space_args *args;
+   int ret;
+   int i;
+   size_t size;
+
+   arg.space_slots = 0;
+
+   ret = ioctl(fd, BTRFS_IOC_SPACE_INFO, );
+   if (ret)
+   return -errno;
+
+   size = sizeof(*args) + sizeof(args->spaces[0]) * arg.total_spaces;
+   args = malloc(size);
+   if (!args)
+   return -ENOMEM;
+
+   args->space_slots = arg.total_spaces;
+
+   ret = ioctl(fd, BTRFS_IOC_SPACE_INFO, args);
+   if (ret)
+   return -errno;
+
+   for (i = 0; i < args->total_spaces; i++) {
+   if (args->spaces[i].flags & BTRFS_SPACE_INFO_GLOBAL_RSV) {
+   unsigned long long reserved;
+   reserved = args->spaces[i].total_bytes;
+   printf("%llu\n", reserved);
+   return 0;
+   }
+   }
+
+   return -ENOENT;
+}
+
+#define IOCTL_TABLE_ENTRY(_ioctl_name, _handler) \
+   { .name = #_ioctl_name, .ioctl_cmd = BTRFS_IOC_##_ioctl_name, \
+ .handler = _handler, }
+
+struct ioctl_table_entry {
+   const char *name;
+   unsigned ioctl_cmd;
+   int (*handler)(int fd, int argc, char *argv[]);
+};
+
+static struct ioctl_table_entry 

Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-27 Thread Duncan
Steven Haigh posted on Mon, 27 Jun 2016 13:21:00 +1000 as excerpted:

> I'd also recommend updates to the ArchLinux wiki - as for some reason I
> always seem to end up there when searching for a certain topic...

Not really btrfs related, but for people using popular search engines, at 
least, this is likely for two reasons:

1) Arch is apparently the most popular distro among reasonably 
technically literate users -- those who will both tend to have good 
technical knowledge and advice on the various real-life issues Linux 
users tend to encounter, and are likely to post it to an easily publicly 
indexable forum.  (And in that regard, wikis are likely to be more 
indexable than (web archives of) mailing lists like this, because that's 
(part of) what wikis /do/ by design, make topics keyword searchable.  
Lists, not so much.)

2) Specifically to the point of being _publicly_ indexable, Arch may have 
a more liberal robots.txt that allows indexers more access, than other 
distros, some of which may limit robot access for performance reasons.


With this combination, arch's wiki is a natural place for searches to 
point.

So agreed, a high priority on getting the raid56 warning up there on the 
arch wiki is a good idea, indeed.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs full balance command fails due to ENOSPC (bug 121071)

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 12:32 PM, Francesco Turco  wrote:
> On 2016-06-27 20:18, Chris Murphy wrote:
>> If you can grab btrfs-debugfs from
>> https://github.com/kdave/btrfs-progs/blob/master/btrfs-debugfs
>>
>> And then attach the output to the bug report it might be useful for a
>> developer. But really your case is an odd duck, because there's fully
>> 14GiB unallocated, so it should be able to create a new one without
>> problem.
>>
>> $ sudo ./btrfs-debugfs -b /
>
> Done! Thank you, I was not aware of the existence of btrfs-debug...

I'm not certain what the "1 enospc errors during balance' refers to.
That message happens several times, the balance operation isn't
aborted, and doesn't come with any call traces (those appear later).
Further, the btrfs-debugfs output suggests the balance worked - each
bg is continguously located after the last and they're all new bg
offset values compared to what's found in the dmesg.

This might be that obscure -28 enospc bug that affects some file
systems and hasn't been tracked down yet. If I recall correctly it's a
misleading error, and the only work around to get rid of it is migrate
to a new Btrfs file system. I don't think the file system is at any
risk in the current state, but I'm not certain as it's already an edge
case. I'd just make sure you keep suitably current backups and keep
using it.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug in 'btrfs filesystem du' ?

2016-06-27 Thread Henk Slager
On Mon, Jun 27, 2016 at 3:33 PM, M G Berberich  wrote:
> Am Montag, den 27. Juni schrieb M G Berberich:
>> after a balance ‘btrfs filesystem du’ probably shows false data about
>> shared data.
>
> Oh, I forgot: I have btrfs-progs v4.5.2 and kernel 4.6.2.

With  btrfs-progs v4.6.1 and kernel 4.7-rc5, the numbers are correct
about shared data.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs full balance command fails due to ENOSPC (bug 121071)

2016-06-27 Thread Francesco Turco
On 2016-06-27 20:18, Chris Murphy wrote:
> If you can grab btrfs-debugfs from
> https://github.com/kdave/btrfs-progs/blob/master/btrfs-debugfs
> 
> And then attach the output to the bug report it might be useful for a
> developer. But really your case is an odd duck, because there's fully
> 14GiB unallocated, so it should be able to create a new one without
> problem.
> 
> $ sudo ./btrfs-debugfs -b /

Done! Thank you, I was not aware of the existence of btrfs-debug...

-- 
Website: http://www.fturco.net/
GPG key: 6712 2364 B2FE 30E1 4791 EB82 7BB1 1F53 29DE CD34
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs full balance command fails due to ENOSPC (bug 121071)

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 11:28 AM, Francesco Turco  wrote:
> Note: I already filed bug 121071 but perhaps I should have written to
> this mailing list first.

https://bugzilla.kernel.org/show_bug.cgi?id=121071

It's a good bug report.


> Is there anything I can try? Should I run the command from a live CD? Is
> this a real bug or a mistake from an unexperienced btrfs user like me?

It's a bug somewhere, not a user mistake.

If you can grab btrfs-debugfs from
https://github.com/kdave/btrfs-progs/blob/master/btrfs-debugfs

And then attach the output to the bug report it might be useful for a
developer. But really your case is an odd duck, because there's fully
14GiB unallocated, so it should be able to create a new one without
problem.

$ sudo ./btrfs-debugfs -b /



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange behavior when replacing device on BTRFS RAID 5 array.

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 11:29 AM, Chris Murphy  wrote:

>
> Next is to decide to what degree you want to salvage this volume and
> keep using Btrfs raid56 despite the risks

Forgot to complete this thought. So if you get a backup, and decide
you want to fix it, I would see if you can cancel the replace using
"btrfs replace cancel " and confirm that it stops. And now is the
risky part, which is whether to try "btrfs add" and then "btrfs
remove" or remove the bad drive, reboot, and see if it'll mount with
-o degraded, and then use add and remove (in which case you'll use
'remove missing').

The first you risk Btrfs still using the flaky bad drive.

The second you risk whether a degraded mount will work, and whether
any other drive in the array has a problem while degraded (like an
unrecovery read error from a single sector).


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange behavior when replacing device on BTRFS RAID 5 array.

2016-06-27 Thread Austin S. Hemmelgarn

On 2016-06-27 13:29, Chris Murphy wrote:

On Sun, Jun 26, 2016 at 10:02 PM, Nick Austin  wrote:

On Sun, Jun 26, 2016 at 8:57 PM, Nick Austin  wrote:

sudo btrfs fi show /mnt/newdata
Label: '/var/data'  uuid: e4a2eb77-956e-447a-875e-4f6595a5d3ec
Total devices 4 FS bytes used 8.07TiB
devid1 size 5.46TiB used 2.70TiB path /dev/sdg
devid2 size 5.46TiB used 2.70TiB path /dev/sdl
devid3 size 5.46TiB used 2.70TiB path /dev/sdm
devid4 size 5.46TiB used 2.70TiB path /dev/sdx


It looks like fi show has bad data:

When I start heavy IO on the filesystem (running rsync -c to verify the data),
I notice zero IO on the bad drive I told btrfs to replace, and lots of IO to the
 expected replacement.

I guess some metadata is messed up somewhere?

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  25.190.007.81   28.460.00   38.54

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdg 437.00 75168.00  1792.00  75168   1792
sdl 443.00 76064.00  1792.00  76064   1792
sdm 438.00 75232.00  1472.00  75232   1472
sdw 443.00 75680.00  1856.00  75680   1856
sdx   0.00 0.00 0.00  0  0


There's reported some bugs with 'btrfs replace' and raid56, but I
don't know the exact nature of those bugs, when or how they manifest.
It's recommended to fallback to use 'btrfs add' and then 'btrfs
delete' but you have other issues going on also.
One other thing to mention, if the device is failing, _always_ add '-r' 
to the replace command line.  This will tell it to avoid reading from 
the device being replaced (in raid1 or raid10 mode, it will pull from 
the other mirror, in raid5/6 mode, it will recompute the block from 
parity and compare to the stored checksums (which in turn means that 
this _will_ be slower on raid5/6 than regular repalce)).  Link resets 
and other issues that cause devices to disappear become more common the 
more damaged a disk is, so avoiding reading from it becomes more 
important too, because just reading from a disk puts stress on it.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange behavior when replacing device on BTRFS RAID 5 array.

2016-06-27 Thread Chris Murphy
On Sun, Jun 26, 2016 at 10:02 PM, Nick Austin  wrote:
> On Sun, Jun 26, 2016 at 8:57 PM, Nick Austin  wrote:
>> sudo btrfs fi show /mnt/newdata
>> Label: '/var/data'  uuid: e4a2eb77-956e-447a-875e-4f6595a5d3ec
>> Total devices 4 FS bytes used 8.07TiB
>> devid1 size 5.46TiB used 2.70TiB path /dev/sdg
>> devid2 size 5.46TiB used 2.70TiB path /dev/sdl
>> devid3 size 5.46TiB used 2.70TiB path /dev/sdm
>> devid4 size 5.46TiB used 2.70TiB path /dev/sdx
>
> It looks like fi show has bad data:
>
> When I start heavy IO on the filesystem (running rsync -c to verify the data),
> I notice zero IO on the bad drive I told btrfs to replace, and lots of IO to 
> the
>  expected replacement.
>
> I guess some metadata is messed up somewhere?
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>   25.190.007.81   28.460.00   38.54
>
> Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
> sdg 437.00 75168.00  1792.00  75168   1792
> sdl 443.00 76064.00  1792.00  76064   1792
> sdm 438.00 75232.00  1472.00  75232   1472
> sdw 443.00 75680.00  1856.00  75680   1856
> sdx   0.00 0.00 0.00  0  0

There's reported some bugs with 'btrfs replace' and raid56, but I
don't know the exact nature of those bugs, when or how they manifest.
It's recommended to fallback to use 'btrfs add' and then 'btrfs
delete' but you have other issues going on also.

Devices dropping off and being renamed is something btrfs, in my
experience, does not handle well at all. The very fact the hardware is
dropping off and coming back is bad, so you really need to get that
sorted out as a prerequisite no matter what RAID technology you're
using.

First advice, make a backup. Don't change the volume further until
you've done this. Each attempt to make the volume healthy again
carries risks of totally breaking it and losing the ability to mount
it. So as long as it's mounted, take advantage of that. Pretend the
very next repair attempt will break the volume, and make your backup
accordingly.

Next is to decide to what degree you want to salvage this volume and
keep using Btrfs raid56 despite the risks (it's still rather
experimental, and in particular some things have been realized on the
list in the last week especially that make it not recommended, except
by people willing to poke it with a stick and learn how many more
bodies can be found with the current implementation) or if you just
want to migrate it over to something like XFS on mdadm or LVM raid 5
as soon as possible?

There's also the obligatory notice that applies to all Linux software
raid implementations which is to discover if you have a very common
misconfiguration that enhances the chance of data loss if the volume
ever goes degraded and you need to rebuild with a new drive:

smartctl -l scterc 
cat /sys/block//device/timeout

The first value must be less than the second. Note the first value is
in deciseconds, the second is in seconds. And either 'unsupported' or
'unset' translates into a vague value that could be as high as 180
seconds.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs full balance command fails due to ENOSPC (bug 121071)

2016-06-27 Thread Francesco Turco
Note: I already filed bug 121071 but perhaps I should have written to
this mailing list first.

I get the ENOSPC error when running a btrfs full balance command for my
root partition, even if it seems I have a lot of free/unallocated space.

# btrfs filesystem show /
Label: none  uuid: 27150b83-7d90-4031-8e83-581315b9a254
Total devices 1 FS bytes used 10.79GiB
devid1 size 25.00GiB used 13.31GiB path /dev/mapper/Desktop-root
# btrfs filesystem df /
Data, single: total=11.00GiB, used=10.40GiB
System, DUP: total=32.00MiB, used=16.00KiB
Metadata, DUP: total=1.12GiB, used=392.08MiB
GlobalReserve, single: total=144.00MiB, used=0.00B
# btrfs balance start --full-balance /
ERROR: error during balancing '/': No space left on device
There may be more info in syslog - try dmesg | tail
# dmesg | tail
[29807.441930] BTRFS info (device dm-2): found 13206 extents
[29807.879845] BTRFS info (device dm-2): relocating block group
47542435840 flags 1
[29827.116083] BTRFS info (device dm-2): found 12909 extents
[29830.500110] BTRFS info (device dm-2): found 12909 extents
[29830.976485] BTRFS info (device dm-2): relocating block group
46468694016 flags 1
[29848.924188] BTRFS info (device dm-2): found 5129 extents
[29851.533076] BTRFS info (device dm-2): found 5129 extents
[29851.994787] BTRFS info (device dm-2): relocating block group
46435139584 flags 34
[29852.399460] BTRFS info (device dm-2): found 1 extents
[29852.657983] BTRFS info (device dm-2): 1 enospc errors during balance

I have successfully balanced both the boot and home partitions before.
Only root gives me problems.

Is there anything I can try? Should I run the command from a live CD? Is
this a real bug or a mistake from an unexperienced btrfs user like me?

Thanks.

-- 
Website: http://www.fturco.net/
GPG key: 6712 2364 B2FE 30E1 4791 EB82 7BB1 1F53 29DE CD34
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bad hard drive - checksum verify failure forces readonly mount

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 12:30 AM, Vasco Almeida  wrote:

> File system image available at (choose one link)
> https://mega.nz/#!AkAEgKyB!RUa7G5xHIygWm0ALx5ZxQjjXNdFYa7lDRHJ_sW0bWLs
> https://www.sendspace.com/file/i70cft




> Should I file a bug report with that image dump linked above or btrfs-
> debug-tree output or both?

If it were me, I'd include both. Maybe the image is incomplete or vice
versa. The debug tree output is also human readable. I'd also put them
up in a cloud location where you can kinda forget about them for a
while, I've had images not looked at for 6+ months by a dev.


> I think I will use the subject of this thread as summary to file the
> bug. Can you think of something more suitable or is that fine?

I would try to summarize something like:
file system created with btrfs-progs version -, and mostly used
with kernel version -, and inexplicably the file system became
unusable at boot time always mounting only readonly. Newer kernel
versions still could not mount it, nor was btrfs check using
btrfs-progs version - able to repair. See thread URL for more
details.

btrfs-image URL
btrfs-debug-tree URL


> I think I will reinstall the OS since, even if I manage to recover the
> file system from this issue, that OS will be something I can not trust
> fully.

Yeah pretty much that's right. There is an rpm command where you can
have it check the signatures of all installed binaries, but I forget
what it is offhand. That'd be an alternative to reinstalling if the
init options were to work.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-27 Thread Christoph Anton Mitterer
On Mon, 2016-06-27 at 07:35 +0300, Andrei Borzenkov wrote:
> The problem is that current implementation of RAID56 puts exactly CoW
> data at risk. I.e. writing new (copy of) data may suddenly make old
> (copy of) data inaccessible, even though it had been safely committed
> to
> disk and is now in read-only snapshot.

Sure,... mine was just a general thing to be added.
No checksums => no way to tell which block is valid in case of silent
block errors => no way to recover unless by chance

=> should be included as a warning, especially as userland software
starts to automatically set nodatacow (IIRC systemd does so), thereby
silently breaking functionality (integrity+recoverability) assumed by
the user.


Cheers,
Chris-

smime.p7s
Description: S/MIME cryptographic signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
 wrote:
> On 2016-06-25 12:44, Chris Murphy wrote:
>>
>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>>  wrote:
>>
>>> Well, the obvious major advantage that comes to mind for me to
>>> checksumming
>>> parity is that it would let us scrub the parity data itself and verify
>>> it.
>>
>>
>> OK but hold on. During scrub, it should read data, compute checksums
>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
>> the checksum tree, and the parity strip in the chunk tree. And if
>> parity is wrong, then it should be replaced.
>
> Except that's horribly inefficient.  With limited exceptions involving
> highly situational co-processors, computing a checksum of a parity block is
> always going to be faster than computing parity for the stripe.  By using
> that to check parity, we can safely speed up the common case of near zero
> errors during a scrub by a pretty significant factor.

OK I'm in favor of that. Although somehow md gets away with this by
computing and checking parity for its scrubs, and still manages to
keep drives saturated in the process - at least HDDs, I'm not sure how
it fares on SSDs.




> The ideal situation that I'd like to see for scrub WRT parity is:
> 1. Store checksums for the parity itself.
> 2. During scrub, if the checksum is good, the parity is good, and we just
> saved the time of computing the whole parity block.
> 3. If the checksum is not good, then compute the parity.  If the parity just
> computed matches what is there already, the checksum is bad and should be
> rewritten (and we should probably recompute the whole block of checksums
> it's in), otherwise, the parity was bad, write out the new parity and update
> the checksum.
> 4. Have an option to skip the csum check on the parity and always compute
> it.
>>
>>
>> Even check > md/sync_action does this. So no pun intended but Btrfs
>> isn't even at parity with mdadm on data integrity if it doesn't check
>> if the parity matches data.
>
> Except that MD and LVM don't have checksums to verify anything outside of
> the very high-level metadata.  They have to compute the parity during a
> scrub because that's the _only_ way they have to check data integrity.  Just
> because that's the only way for them to check it does not mean we have to
> follow their design, especially considering that we have other, faster ways
> to check it.

I'm not opposed to this optimization. But retroactively better
qualifying my previous "major advantage" what I meant was in terms of
solving functional deficiency.



>> The much bigger problem we have right now that affects Btrfs,
>> LVM/mdadm md raid, is this silly bad default with non-enterprise
>> drives having no configurable SCT ERC, with ensuing long recovery
>> times, and the kernel SCSI command timer at 30 seconds - which
>> actually also fucks over regular single disk users also because it
>> means they don't get the "benefit" of long recovery times, which is
>> the whole g'd point of that feature. This itself causes so many
>> problems where bad sectors just get worse and don't get fixed up
>> because of all the link resets. So I still think it's a bullshit
>> default kernel side because it pretty much affects the majority use
>> case, it is only a non-problem with proprietary hardware raid, and
>> software raid using enterprise (or NAS specific) drives that already
>> have short recovery times by default.
>
> On this, we can agree.

It just came up again in a thread over the weekend on linux-raid@. I'm
going to ask while people are paying attention if a patch to change
the 30 second time out to something a lot higher has ever been
floated, what the negatives might be, and where to get this fixed if
it wouldn't be accepted in the kernel code directly.

*Ideally* I think we'd want two timeouts. I'd like to see commands
have a timer that results in merely a warning that could be used by
e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to
write over those sectors". That's how bad sectors start out, they read
slower and eventually go beyond 30 seconds and now it's all link
resets. If the problem could be fixed before then... that's the best
scenario.

The 2nd timer would be, OK the controller or drive just face planted, reset.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug in 'btrfs filesystem du' ?

2016-06-27 Thread M G Berberich
Am Montag, den 27. Juni schrieb M G Berberich:
> after a balance ‘btrfs filesystem du’ probably shows false data about
> shared data.

Oh, I forgot: I have btrfs-progs v4.5.2 and kernel 4.6.2.

MfG
bmg

-- 
„Des is völlig wurscht, was heut beschlos- | M G Berberich
 sen wird: I bin sowieso dagegn!“  | m...@m-berberich.de
(SPD-Stadtrat Kurt Schindler; Regensburg)  | 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Bug in 'btrfs filesystem du' ?

2016-06-27 Thread M G Berberich
Hello,

after a balance ‘btrfs filesystem du’ probably shows false data about
shared data.

To reproduce, create a (smal) btrfs-filesystem, copy some data in a
directory, then ‘cp -a --reflink’ the data. Now all data is shared and
‘btrfs fi du’ shows it correct. In my case:

 Total   Exclusive  Set shared  Filename
  59.38MiB29.69MiB29.69MiB  .

after a balance ‘btrfs fi du’ shows no shared data any more, but all
data as exclusive. In my case:

 Total   Exclusive  Set shared  Filename
  59.38MiB59.38MiB   0.00B  .

As ‘btrfs fi df’ still shows used=29.69MiB, the problem probabaly is
in btrfs-tools.

Test-session log:

# dd if=/dev/urandom of=dev-btrfs bs=4K count=10
10+0 Datensätze ein
10+0 Datensätze aus
40960 bytes (410 MB, 391 MiB) copied, 24.7574 s, 16.5 MB/s

# mkfs.btrfs dev-btrfs
btrfs-progs v4.5.2
See http://btrfs.wiki.kernel.org for more information.
 
Label:  (null)
UUID:   698a2755-8ecb-468d-9577-9a48947361ea
Node size:  16384
Sector size:4096
Filesystem size:390.62MiB
Block group profiles:
  Data: single8.00MiB
  Metadata: DUP  40.00MiB
  System:   DUP  12.00MiB
SSD detected:   no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   IDSIZE  PATH
1   390.62MiB  dev-btrfs
 

# mount /tmp/dev-btrfs /mnt/

# cd /mnt/

# btrfs fi du -s .
 Total   Exclusive  Set shared  Filename
 0.00B   0.00B   0.00B  .

# cp -a /scratch/kernel/linux-4.6/drivers/usb .

# btrfs fi du -s .
 Total   Exclusive  Set shared  Filename
  28.96MiB28.96MiB   0.00B  .

# btrfs fi df .
Data, single: total=56.00MiB, used=3.61MiB
System, DUP: total=8.00MiB, used=16.00KiB
Metadata, DUP: total=32.00MiB, used=192.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B

# btrfs fi usage .
Overall:
Device size: 390.62MiB
Device allocated:136.00MiB
Device unallocated:  254.62MiB
Device missing:  0.00B
Used: 32.06MiB
Free (estimated):280.94MiB  (min: 153.62MiB)
Data ratio:   1.00
Metadata ratio:   2.00
Global reserve:   16.00MiB  (used: 0.00B)
 
Data,single: Size:56.00MiB, Used:29.69MiB
   /dev/loop0 56.00MiB
 
Metadata,DUP: Size:32.00MiB, Used:1.17MiB
   /dev/loop0 64.00MiB
 
System,DUP: Size:8.00MiB, Used:16.00KiB
   /dev/loop0 16.00MiB
 
Unallocated:
   /dev/loop0254.62MiB

# cp -a --reflink usb usb2

# btrfs fi du -s .
 Total   Exclusive  Set shared  Filename
  59.38MiB29.69MiB29.69MiB  .

# btrfs fi df .
Data, single: total=56.00MiB, used=29.69MiB
System, DUP: total=8.00MiB, used=16.00KiB
Metadata, DUP: total=32.00MiB, used=1.17MiB
GlobalReserve, single: total=16.00MiB, used=0.00B

# btrfs fi usage .
Overall:
Device size: 390.62MiB
Device allocated:136.00MiB
Device unallocated:  254.62MiB
Device missing:  0.00B
Used: 32.06MiB
Free (estimated):280.94MiB  (min: 153.62MiB)
Data ratio:   1.00
Metadata ratio:   2.00
Global reserve:   16.00MiB  (used: 0.00B)
 
Data,single: Size:56.00MiB, Used:29.69MiB
   /dev/loop0 56.00MiB
 
Metadata,DUP: Size:32.00MiB, Used:1.17MiB
   /dev/loop0 64.00MiB
 
System,DUP: Size:8.00MiB, Used:16.00KiB
   /dev/loop0 16.00MiB
 
Unallocated:
   /dev/loop0254.62MiB

# btrfs balance start .
WARNING:
 
Full balance without filters requested. This operation is very
intense and takes potentially very long. It is recommended to
use the balance filters to narrow down the balanced data.
Use 'btrfs balance start --full-balance' option to skip this
warning. The operation will start in 10 seconds.
Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting balance without any filters.
Done, had to relocate 4 out of 4 chunks

# btrfs fi du -s .
 Total   Exclusive  Set shared  Filename
  59.38MiB59.38MiB   0.00B  .

# btrfs fi df .
Data, single: total=48.00MiB, used=29.69MiB
System, DUP: total=24.00MiB, used=16.00KiB
Metadata, DUP: total=24.00MiB, used=2.08MiB
GlobalReserve, single: total=16.00MiB, used=0.00B

# btrfs fi usage .
Overall:
Device size: 390.62MiB
Device allocated:144.00MiB
Device unallocated:  246.62MiB
Device missing:  0.00B
Used: 33.88MiB
Free (estimated):264.94MiB  (min: 141.62MiB)
Data ratio:   1.00
Metadata ratio:   2.00
Global reserve:   16.00MiB  (used: 0.00B)
 
Data,single: Size:48.00MiB, Used:29.69MiB
   /dev/loop0 48.00MiB
 
Metadata,DUP: Size:24.00MiB, 

Re: [PATCH 05/14] Btrfs: warn_on for unaccounted spaces

2016-06-27 Thread Chris Mason



On 06/27/2016 12:47 AM, Qu Wenruo wrote:

Hi Josef,

Would you please move this patch to the first of the patchset?

It's making bisect quite hard, as it will always stop at this patch,
hard to check if it's a regression or existing bug.


That's a good idea.  Which workload are you having trouble with?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Austin S. Hemmelgarn

On 2016-06-25 12:44, Chris Murphy wrote:

On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
 wrote:


Well, the obvious major advantage that comes to mind for me to checksumming
parity is that it would let us scrub the parity data itself and verify it.


OK but hold on. During scrub, it should read data, compute checksums
*and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
the checksum tree, and the parity strip in the chunk tree. And if
parity is wrong, then it should be replaced.
Except that's horribly inefficient.  With limited exceptions involving 
highly situational co-processors, computing a checksum of a parity block 
is always going to be faster than computing parity for the stripe.  By 
using that to check parity, we can safely speed up the common case of 
near zero errors during a scrub by a pretty significant factor.


The ideal situation that I'd like to see for scrub WRT parity is:
1. Store checksums for the parity itself.
2. During scrub, if the checksum is good, the parity is good, and we 
just saved the time of computing the whole parity block.
3. If the checksum is not good, then compute the parity.  If the parity 
just computed matches what is there already, the checksum is bad and 
should be rewritten (and we should probably recompute the whole block of 
checksums it's in), otherwise, the parity was bad, write out the new 
parity and update the checksum.
4. Have an option to skip the csum check on the parity and always 
compute it.


Even check > md/sync_action does this. So no pun intended but Btrfs
isn't even at parity with mdadm on data integrity if it doesn't check
if the parity matches data.
Except that MD and LVM don't have checksums to verify anything outside 
of the very high-level metadata.  They have to compute the parity during 
a scrub because that's the _only_ way they have to check data integrity. 
 Just because that's the only way for them to check it does not mean we 
have to follow their design, especially considering that we have other, 
faster ways to check it.




I'd personally much rather know my parity is bad before I need to use it
than after using it to reconstruct data and getting an error there, and I'd
be willing to be that most seasoned sysadmins working for companies using
big storage arrays likely feel the same about it.


That doesn't require parity csums though. It just requires computing
parity during a scrub and comparing it to the parity on disk to make
sure they're the same. If they aren't, assuming no other error for
that full stripe read, then the parity block is replaced.
It does not require it, but it can make it significantly more efficient, 
and even a 1% increase in efficiency is a huge difference on a big array.


So that's also something to check in the code or poke a system with a
stick and see what happens.


I could see it being
practical to have an option to turn this off for performance reasons or
similar, but again, I have a feeling that most people would rather be able
to check if a rebuild will eat data before trying to rebuild (depending on
the situation in such a case, it will sometimes just make more sense to nuke
the array and restore from a backup instead of spending time waiting for it
to rebuild).


The much bigger problem we have right now that affects Btrfs,
LVM/mdadm md raid, is this silly bad default with non-enterprise
drives having no configurable SCT ERC, with ensuing long recovery
times, and the kernel SCSI command timer at 30 seconds - which
actually also fucks over regular single disk users also because it
means they don't get the "benefit" of long recovery times, which is
the whole g'd point of that feature. This itself causes so many
problems where bad sectors just get worse and don't get fixed up
because of all the link resets. So I still think it's a bullshit
default kernel side because it pretty much affects the majority use
case, it is only a non-problem with proprietary hardware raid, and
software raid using enterprise (or NAS specific) drives that already
have short recovery times by default.

On this, we can agree.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/6] fstests: btrfs: test RAID1 device reappear and balanced

2016-06-27 Thread Eryu Guan
On Wed, Jun 22, 2016 at 07:01:54PM +0800, Anand Jain wrote:
> 
> 
> On 06/21/2016 09:31 PM, Eryu Guan wrote:
> > On Wed, Jun 15, 2016 at 04:48:47PM +0800, Anand Jain wrote:
> > > From: Anand Jain 
> > > 
> > > The test does the following:
> > >   Initialize a RAID1 with some data
> > > 
> > >   Re-mount RAID1 degraded with _dev1_ and write up to
> > >   half of the FS capacity
> > 
> > If test devices are big enough, this test consumes much longer test
> > time. I tested with 15G scratch dev pool and this test ran ~200s on my
> > 4vcpu 8G memory test vm.
> 
>  Right. Isn't that a good design? So that it gets tested differently
>  on different HW config. ?

Not in fstests. We should limit the run time of tests to an acceptable
amount, for auto group it's within 5 minutes.

>  However the test time can be reduced by using smaller vdisk.

I think either limit the write size or _notrun if the $max_fs_size is
too big (say 30G).

More comments below.

> 
> Thanks, Anand
> 
> 
> > Is it possible to limit the file size or the device size used? So it
> > won't grow with device size. I'm thinking about something like
> > _scratch_mkfs_sized, but that doesn't work for dev pool.
> > 
> > >   Save md5sum checkpoint1
> > > 
> > >   Re-mount healthy RAID1
> > > 
> > >   Let balance re-silver.
> > >   Save md5sum checkpoint2
> > > 
> > >   Re-mount RAID1 degraded with _dev2_
> > >   Save md5sum checkpoint3
> > > 
> > >   Verify if all three md5sum match
> > > 
> > > Signed-off-by: Anand Jain 
> > > ---
> > > v2:
> > >   add tmp= and its rm
> > >   add comments to why _reload_btrfs_ko is used
> > >   add missing put and test_mount at notrun exit
> > >   use echo instead of _fail when checkpoints are checked
> > >   .out updated to remove Silence..
> > > 
> > >  tests/btrfs/123 | 169 
> > > 
> > >  tests/btrfs/123.out |   7 +++
> > >  tests/btrfs/group   |   1 +
> > >  3 files changed, 177 insertions(+)
> > >  create mode 100755 tests/btrfs/123
> > >  create mode 100644 tests/btrfs/123.out
> > > 
> > > diff --git a/tests/btrfs/123 b/tests/btrfs/123
> > > new file mode 100755
> > > index ..33decfd1c434
> > > --- /dev/null
> > > +++ b/tests/btrfs/123
> > > @@ -0,0 +1,169 @@
> > > +#! /bin/bash
> > > +# FS QA Test 123
> > > +#
> > > +# This test verify the RAID1 reconstruction on the reappeared
> > > +# device. By using the following steps:
> > > +# Initialize a RAID1 with some data
> > > +#
> > > +# Re-mount RAID1 degraded with dev2 missing and write up to
> > > +# half of the FS capacity.
> > > +# Save md5sum checkpoint1
> > > +#
> > > +# Re-mount healthy RAID1
> > > +#
> > > +# Let balance re-silver.
> > > +# Save md5sum checkpoint2
> > > +#
> > > +# Re-mount RAID1 degraded with dev1 missing
> > > +# Save md5sum checkpoint3
> > > +#
> > > +# Verify if all three checkpoints match
> > > +#
> > > +#-
> > > +# Copyright (c) 2016 Oracle.  All Rights Reserved.
> > > +#
> > > +# This program is free software; you can redistribute it and/or
> > > +# modify it under the terms of the GNU General Public License as
> > > +# published by the Free Software Foundation.
> > > +#
> > > +# This program is distributed in the hope that it would be useful,
> > > +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > +# GNU General Public License for more details.
> > > +#
> > > +# You should have received a copy of the GNU General Public License
> > > +# along with this program; if not, write the Free Software Foundation,
> > > +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> > > +#-
> > > +#
> > > +
> > > +seq=`basename $0`
> > > +seqres=$RESULT_DIR/$seq
> > > +echo "QA output created by $seq"
> > > +
> > > +here=`pwd`
> > > +tmp=/tmp/$$
> > > +status=1 # failure is the default!
> > > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > > +
> > > +_cleanup()
> > > +{
> > > + cd /
> > > + rm -f $tmp.*
> > > +}
> > > +
> > > +# get standard environment, filters and checks
> > > +. ./common/rc
> > > +. ./common/filter
> > > +
> > > +# remove previous $seqres.full before test
> > > +rm -f $seqres.full
> > > +
> > > +# real QA test starts here
> > > +
> > > +_supported_fs btrfs
> > > +_supported_os Linux
> > > +_require_scratch_nocheck
> > 
> > Why don't check filesystem after test? A comment would be good if
> > there's a good reason. Patch 6 needs it as well :)

And can you please add comments on _require_scratch_nocheck in this
patch and patch 6, and rebase the whole series after Dave pushed my
pull request(on 06-25) to upstream, and resend?

Thanks,
Eryu

> > 
> > > +_require_scratch_dev_pool 2
> > > +
> > > +# the mounted test dir prevent btrfs unload, we need to unmount
> > > +_test_unmount
> > > 

Re: Rescue a single-device btrfs instance with zeroed tree root

2016-06-27 Thread Ivan Shapovalov
On 2016-06-21 at 20:23 +0300, Ivan Shapovalov wrote:
> Hello,
> 
> So this is another case of "I lost my partition and do not have
> backups". More precisely, _this_ is the backup and it turned out to
> be
> damaged.
> 
> (The backup was made by partclone.btrfs. Together with a zeroed out
> tree root, this asks for a bug in partclone...)
> 
> So: the tree root is zeroes, backup roots are zeroes too,
> btrfs-find-root only reports blocks of level 0 (needed is 1).
> Is there something that can be done? Maybe it is possible to
> reconstruct the root from its children?
> Operations log following.
> Please Cc: me in replies as I'm not subscribed to the list.


Anyone? I'd appreciate any advice on how to rebuild the tree roots.

I'd even write some code if someone tells me the disk format and
logical tree constraints (i. e. in which order to put pointers to child
nodes).

BTW, it looks like all tree roots are lost, i. e. btrfs-find-root with
any objectid (extent tree, subvolume tree, anything in ctree.h) finds
only level 0 nodes.

It should be possible to rebuild intermediate nodes, isn't it? Do they
contain valuable information?

Please Cc: me in replies as I'm not subscribed to the list.

Thanks,
-- 

Ivan Shapovalov / intelfx /

> 
> 1. regular mount
> 
> # mount /dev/loop0p3 /mnt/temp
> 
> === dmesg ===
> [106737.299592] BTRFS info (device loop0p3): disk space caching is
> enabled
> [106737.299604] BTRFS: has skinny extents
> [106737.299884] BTRFS error (device loop0p3): bad tree block start 0
> 162633449472
> [106737.299888] BTRFS: failed to read tree root on loop0p3
> [106737.314359] BTRFS: open_ctree failed
> === end dmesg ===
> 
> 2. mount with -o recovery
> 
> # mount -o recovery /dev/loop0p3 /mnt/temp
> 
> === dmesg ===
> [106742.305720] BTRFS warning (device loop0p3): 'recovery' is
> deprecated, use 'usebackuproot' instead
> [106742.305722] BTRFS info (device loop0p3): trying to use backup
> root at mount time
> [106742.305724] BTRFS info (device loop0p3): disk space caching is
> enabled
> [106742.305725] BTRFS: has skinny extents
> [106742.306056] BTRFS error (device loop0p3): bad tree block start 0
> 162633449472
> [106742.306060] BTRFS: failed to read tree root on loop0p3
> [106742.306069] BTRFS error (device loop0p3): bad tree block start 0
> 162633449472
> [106742.306071] BTRFS: failed to read tree root on loop0p3
> [106742.306084] BTRFS error (device loop0p3): bad tree block start 0
> 162632237056
> [106742.306086] BTRFS: failed to read tree root on loop0p3
> [106742.306097] BTRFS error (device loop0p3): bad tree block start 0
> 162626682880
> [106742.306100] BTRFS: failed to read tree root on loop0p3
> [106742.306111] BTRFS error (device loop0p3): bad tree block start 0
> 162609168384
> [106742.306114] BTRFS: failed to read tree root on loop0p3
> [106742.327272] BTRFS: open_ctree failed
> === end dmesg ===
> 
> 
> 3. btrfs-find-root
> 
> # btrfs-find-root /dev/loop0p3
> Couldn't read tree root
> Superblock thinks the generation is 22332
> Superblock thinks the level is 1
> Well block 162633646080(gen: 22332 level: 0) seems good, but
> generation/level doesn't match, want gen: 22332 level: 1
> Well block 162633596928(gen: 22332 level: 0) seems good, but
> generation/level doesn't match, want gen: 22332 level: 1
> Well block 162633515008(gen: 22332 level: 0) seems good, but
> generation/level doesn't match, want gen: 22332 level: 1
> 
> Thanks,

signature.asc
Description: This is a digitally signed message part


Re: [PATCH 4/4] fstests: btrfs/126,127,128: test feature ioctl and sysfs interfaces

2016-06-27 Thread Eryu Guan
On Fri, Jun 24, 2016 at 11:08:34AM -0400, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> This tests the exporting of feature information from the kernel via
> sysfs and ioctl. The first test works whether the sysfs permissions
> are correct, if the information exported via sysfs matches
> what the ioctls are reporting, and if they both match the on-disk
> superblock's version of the feature sets. The second and third tests
> test online setting and clearing of feature bits via the sysfs and
> ioctl interfaces, checking whether they match the on-disk super on
> each cycle.
> 
> In every case, if the features are not present, it is not considered
> a failure and a message indicating that will be dumped to the $num.full
> file.
> 
> Signed-off-by: Jeff Mahoney 
> ---
>  tests/btrfs/126 | 269 
> 
>  tests/btrfs/126.out |   2 +
>  tests/btrfs/127 | 185 
>  tests/btrfs/127.out |   2 +
>  tests/btrfs/128 | 178 ++
>  tests/btrfs/128.out |   2 +
>  tests/btrfs/group   |   3 +
>  7 files changed, 641 insertions(+)
>  create mode 100755 tests/btrfs/126
>  create mode 100644 tests/btrfs/126.out
>  create mode 100755 tests/btrfs/127
>  create mode 100644 tests/btrfs/127.out
>  create mode 100755 tests/btrfs/128
>  create mode 100644 tests/btrfs/128.out
> 
> diff --git a/tests/btrfs/126 b/tests/btrfs/126
> new file mode 100755
> index 000..3d660c5
> --- /dev/null
> +++ b/tests/btrfs/126
> @@ -0,0 +1,269 @@
> +#!/bin/bash
> +# FA QA Test No. 126
> +#
> +# Test online feature publishing
> +#
> +# This test doesn't test the changing of features. It does test that
> +# the proper publishing bits and permissions match up with
> +# the expected values.
> +#
> +#---
> +# Copyright (c) 2013 SUSE, All Rights Reserved.

Copyright year 2016.

> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +
> +seq=$(basename $0)
> +seqres=$RESULT_DIR/$seq
> +echo "== QA output created by $seq"
> +
> +here=$(pwd)
> +tmp=/tmp/$$
> +status=1

Missing _cleanup() and trap, use './new btrfs' to create new btrfs
tests.

> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter.btrfs
> +
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +_require_command $BTRFS_SHOW_SUPER_PROG

_require_command "$BTRFS_SHOW_SUPER_PROG" btrfs-show-super

> +
> +_scratch_mkfs > /dev/null 2>&1
> +_scratch_mount
> +
> +check_features() {

"{" on a seperate line

> + reserved="$2"
> + method="$3"
> + if [ "$1" != 0 ]; then
> + echo "$method: failed: $reserved"
> + exit 1
> + fi

No need to check return value.

> +if [ "$reserved" = "Not implemented." ]; then
> +echo "Skipping ioctl test. Not implemented." >> $seqres.full
> +return
> +fi

Call _notrun if ioctl not implemented. Do the check before actual test
starts.

And you're mixing spaces and tabs for indention in this function.

> +}
> +
> +error=false

All the checks around error can be omitted.

> +
> +# test -w will always return true if root is making the call.
> +# This would be true in most cases, but for sysfs files, the permissions
> +# are enforced even for root.
> +is_writeable() {

"{" on a seperate line

> + local file=$1
> + mode=$(stat -c "0%a" "$file")
> + mode=$(( $mode & 0200 ))
> +
> + [ "$mode" -eq 0 ] && return 1
> + return 0
> +}
> +
> +# ioctl
> +read -a features < <(src/btrfs_ioctl_helper $SCRATCH_MNT GET_FEATURES 2>&1)
> +check_features $? "$features" "GET_FEATURES"
> +
> +test_ioctl=true
> +[ "${features[*]}" = "Not implemented." ] && test_ioctl=false
> +
> +read -a supp_features < <(src/btrfs_ioctl_helper $SCRATCH_MNT 
> GET_SUPPORTED_FEATURES 2>&1)
> +check_features $? "$supp_features" "GET_SUPPORTED_FEATURES"
> +[ "${supp_features[*]}" = "Not implemented." ] && test_ioctl=false

These checks are not needed if the test was checked and _notrun properly
before test.

> +
> +# Sysfs checks
> +fsid=$(_btrfs_get_fsid $SCRATCH_DEV)
> +sysfs_base="/sys/fs/btrfs"
> +
> +# TODO Add tool to enable and test unknown feature bits
> 

Re: [PATCH 3/4] fstests: btrfs/125: test sysfs exports of allocation and device membership info

2016-06-27 Thread Eryu Guan
On Fri, Jun 24, 2016 at 11:08:33AM -0400, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> This tests the sysfs publishing for btrfs allocation and device
> membership info under a number of different layouts, similar to the
> btrfs replace test. We test the allocation files only for existence and
> that they contain numerical values. We test the device membership
> by mapping the devices used to create the file system to sysfs paths
> and matching them against the paths used for the device membership
> symlinks.
> 
> It passes on kernels without a /sys/fs/btrfs/ directory.
> 
> Signed-off-by: Jeff Mahoney 
> ---
>  common/config   |   4 +-
>  common/rc   |   7 ++
>  tests/btrfs/125 | 193 
> 
>  tests/btrfs/125.out |   2 +
>  tests/btrfs/group   |   1 +
>  5 files changed, 205 insertions(+), 2 deletions(-)
>  create mode 100755 tests/btrfs/125
>  create mode 100644 tests/btrfs/125.out
> 
> diff --git a/common/config b/common/config
> index c25b1ec..c5e65f7 100644
> --- a/common/config
> +++ b/common/config
> @@ -201,13 +201,13 @@ export DEBUGFS_PROG="`set_prog_path debugfs`"
>  # newer systems have udevadm command but older systems like RHEL5 don't.
>  # But if neither one is available, just set it to "sleep 1" to wait for lv to
>  # be settled
> -UDEV_SETTLE_PROG="`set_prog_path udevadm`"
> +UDEVADM_PROG="`set_prog_path udevadm`"
>  if [ "$UDEV_SETTLE_PROG" == "" ]; then

$UDEVADM_PROG should be checked here, not $UDEV_SETTLE_PROG anymore.

>   # try udevsettle command
>   UDEV_SETTLE_PROG="`set_prog_path udevsettle`"
>  else
>   # udevadm is available, add 'settle' as subcommand
> - UDEV_SETTLE_PROG="$UDEV_SETTLE_PROG settle"
> + UDEV_SETTLE_PROG="$UDEVADM_PROG settle"
>  fi
>  # neither command is available, use sleep 1
>  if [ "$UDEV_SETTLE_PROG" == "" ]; then
> diff --git a/common/rc b/common/rc
> index 4b05fcf..f4c4312 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -76,6 +76,13 @@ _btrfs_get_subvolid()
>   $BTRFS_UTIL_PROG sub list $mnt | grep $name | awk '{ print $2 }'
>  }
>  
> +_btrfs_get_feature_flags()
> +{
> + local dev=$1
> + local class=$2
> + $BTRFS_SHOW_SUPER_PROG $dev | grep ^${class}_flags | awk '{print $NF}'
> +}
> +
>  _btrfs_get_fsid()
>  {
>   local dev=$1
> diff --git a/tests/btrfs/125 b/tests/btrfs/125
> new file mode 100755
> index 000..83f1921
> --- /dev/null
> +++ b/tests/btrfs/125
> @@ -0,0 +1,193 @@
> +#! /bin/bash
> +# FS QA Test No. 125
> +#
> +# Test of the btrfs sysfs publishing
> +#
> +#---
> +# Copyright (C) 2013-2016 SUSE.  All rights reserved.

Copyright year is 2016.

> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "== QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1

Missing _cleanup and trap, use "./new btrfs" to generate new btrfs test.

> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# real QA test starts here
> +_supported_fs btrfs

Missing "_supported_os" call. Following the template create by the
'./new' script makes it easier :)

> +_require_scratch
> +_require_scratch_dev_pool
> +_require_command "$UDEVADM_PROG"

We usually provide a program name as a second param

_require_command "$UDEVADM_PROG" udevadm

> +
> +rm -f $seqres.full
> +rm -f $tmp.tmp

This should be "rm -f $tmp.*" and belongs to _cleanup()

> +
> +check_file() {
> + local file=$1
> + base="$(echo "$file" | sed -e 's#/sys/fs/btrfs/[0-9a-f-][0-9a-f-]*/##')"
> + if [ ! -f "$file" ]; then
> + echo "$base missing."
> + return 0

No need to return 0/1 based on failure/pass, because check_chunk()
doesn't need to exit on failure.

> + else
> + value="$(cat $file)"
> + if [ -n "$(echo $value | tr -d 0-9)" ]; then
> + echo "ERROR: $base: numerical value expected" \
> +  "(got $value)"
> + return 0
> + fi
> + fi
> + return 1
> +}
> +
> +check_chunk() {
> + path=$1
> + mkfs_options=$2
> + error=false
> +
> + 

Re: [PATCH 2/4] fstests: btrfs/124: test global metadata reservation reporting

2016-06-27 Thread Eryu Guan
On Mon, Jun 27, 2016 at 03:16:47PM +0800, Eryu Guan wrote:
> On Fri, Jun 24, 2016 at 11:08:32AM -0400, je...@suse.com wrote:
> > From: Jeff Mahoney 
> > 
[snip]
> > +
> > +# get standard environment, filters and checks
> > +. ./common/rc
> > +. ./common/filter.btrfs
> > +
> > +_supported_fs btrfs
> > +_supported_os Linux
> > +_require_scratch
> > +
> > +_scratch_mkfs > /dev/null 2>&1
> > +_scratch_mount
> 
> There should be some kind of "_require_xxx" or something like that to
> _notrun if current running kernel doesn't have global metadata
> reservation report implemented.

Also need a _require_test_program call to make sure btrfs_ioctl_helper
is built and in src/ dir.

_require_test_program "btrfs_ioctl_helper"

Sorry, I missed it in first revlew.

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] fstests: btrfs/124: test global metadata reservation reporting

2016-06-27 Thread Eryu Guan
On Fri, Jun 24, 2016 at 11:08:32AM -0400, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> Btrfs can now report the size of the global metadata reservation
> via ioctl and sysfs.
> 
> This test confirms that we get sane results on an empty file system.
> 
> ENOTTY and missing /sys/fs/btrfs//allocation are not considered
> failures.
> 
> Signed-off-by: Jeff Mahoney 

I'm reviewing mainly from the fstests perspective, need help from other
btrfs developers to review the test itself to see if it's a valid &
useful test.

> ---
>  common/rc|   6 ++
>  src/Makefile |   3 +-
>  src/btrfs_ioctl_helper.c | 220 
> +++
>  tests/btrfs/124  |  90 +++
>  tests/btrfs/124.out  |   2 +
>  tests/btrfs/group|   1 +
>  6 files changed, 321 insertions(+), 1 deletion(-)
>  create mode 100644 src/btrfs_ioctl_helper.c
>  create mode 100755 tests/btrfs/124
>  create mode 100644 tests/btrfs/124.out
> 
> diff --git a/common/rc b/common/rc
> index 3a9c4d1..4b05fcf 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -76,6 +76,12 @@ _btrfs_get_subvolid()
>   $BTRFS_UTIL_PROG sub list $mnt | grep $name | awk '{ print $2 }'
>  }
>  
> +_btrfs_get_fsid()
> +{
> + local dev=$1
> + $BTRFS_UTIL_PROG filesystem show $dev|awk '/uuid:/ {print $NF}'
> +}
> +
>  # Prints the md5 checksum of a given file
>  _md5_checksum()
>  {
> diff --git a/src/Makefile b/src/Makefile
> index 1bf318b..c467475 100644
> --- a/src/Makefile
> +++ b/src/Makefile
> @@ -20,7 +20,8 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize 
> preallo_rw_pattern_reader \
>   bulkstat_unlink_test_modified t_dir_offset t_futimens t_immutable \
>   stale_handle pwrite_mmap_blocked t_dir_offset2 seek_sanity_test \
>   seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec cloner \
> - renameat2 t_getcwd e4compact test-nextquota punch-alternating
> + renameat2 t_getcwd e4compact test-nextquota punch-alternating \
> + btrfs_ioctl_helper

.gitignore needs an entry for new binary.

But I'm wondering that is this something that can be added to
btrfs-progs, either as part of the btrfs command or a seperate command?

>  
>  SUBDIRS =
>  
> diff --git a/src/btrfs_ioctl_helper.c b/src/btrfs_ioctl_helper.c
> new file mode 100644
> index 000..4344bdc
> --- /dev/null
> +++ b/src/btrfs_ioctl_helper.c
> @@ -0,0 +1,220 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#ifndef BTRFS_IOCTL_MAGIC
> +#define BTRFS_IOCTL_MAGIC 0x94
> +#endif
> +
> +#ifndef BTRFS_IOC_SPACE_INFO
> +struct btrfs_ioctl_space_info {
> +uint64_t flags;
> +uint64_t total_bytes;
> +uint64_t used_bytes;
> +};
> +
> +struct btrfs_ioctl_space_args {
> +uint64_t space_slots;
> +uint64_t total_spaces;
> +struct btrfs_ioctl_space_info spaces[0];
> +};
> +#define BTRFS_IOC_SPACE_INFO _IOWR(BTRFS_IOCTL_MAGIC, 20, \
> +struct btrfs_ioctl_space_args)
> +#endif
> +#ifndef BTRFS_SPACE_INFO_GLOBAL_RSV
> +#define BTRFS_SPACE_INFO_GLOBAL_RSV(1ULL << 49)
> +#endif
> +
> +#ifndef BTRFS_IOC_GET_FEATURES
> +struct btrfs_ioctl_feature_flags {
> + uint64_t compat_flags;
> + uint64_t compat_ro_flags;
> + uint64_t incompat_flags;
> +};
> +
> +#define BTRFS_IOC_GET_FEATURES _IOR(BTRFS_IOCTL_MAGIC, 57, \
> +   struct btrfs_ioctl_feature_flags)
> +#define BTRFS_IOC_SET_FEATURES _IOW(BTRFS_IOCTL_MAGIC, 57, \
> +   struct btrfs_ioctl_feature_flags[2])
> +#define BTRFS_IOC_GET_SUPPORTED_FEATURES _IOR(BTRFS_IOCTL_MAGIC, 57, \
> +   struct btrfs_ioctl_feature_flags[3])
> +#endif
> +
> +static int global_rsv_ioctl(int fd, int argc, char *argv[])
> +{
> + struct btrfs_ioctl_space_args arg;
> + struct btrfs_ioctl_space_args *args;
> + int ret;
> + int i;
> + size_t size;
> +
> + arg.space_slots = 0;
> +
> + ret = ioctl(fd, BTRFS_IOC_SPACE_INFO, );
> + if (ret)
> + return -errno;
> +
> + size = sizeof(*args) + sizeof(args->spaces[0]) * arg.total_spaces;
> + args = malloc(size);
> + if (!args)
> + return -ENOMEM;
> +
> + args->space_slots = arg.total_spaces;
> +
> + ret = ioctl(fd, BTRFS_IOC_SPACE_INFO, args);
> + if (ret)
> + return -errno;
> +
> + for (i = 0; i < args->total_spaces; i++) {
> + if (args->spaces[i].flags & BTRFS_SPACE_INFO_GLOBAL_RSV) {
> + unsigned long long reserved;
> + reserved = args->spaces[i].total_bytes;
> + printf("%llu\n", reserved);
> + return 0;
> + }
> + }
> +
> + return -ENOENT;
> +}
> +
> +static int get_features_ioctl(int fd, int argc, char *argv[])
> +{
> + struct 

Re: Bad hard drive - checksum verify failure forces readonly mount

2016-06-27 Thread Vasco Almeida
A Dom, 26-06-2016 às 13:54 -0600, Chris Murphy escreveu:
> On Sun, Jun 26, 2016 at 7:05 AM, Vasco Almeida  > wrote:
> > I have tried "btrfs check --repair /device" but that seems do not
> > do
> > any good.
> > http://paste.fedoraproject.org/384960/66945936/
> 
> It did fix things, in particular with the snapshot that was having
> problems being dropped. But it's not enough it seems to prevent it
> from going read only.
> 
> There's more than one bug here, you might see if the repair was good
> enough that it's possible to use brtfs-image now.

File system image available at (choose one link)
https://mega.nz/#!AkAEgKyB!RUa7G5xHIygWm0ALx5ZxQjjXNdFYa7lDRHJ_sW0bWLs
https://www.sendspace.com/file/i70cft

>  If not, use
> btrfs-debug-tree  > file.txt and post that file somewhere. This
> does expose file names. Maybe that'll shed some light on the problem.
> But also worth filing a bug at bugzilla.kernel.org with this debug
> tree referenced (probably too big to attach), maybe a dev will be
> able
> to look at it and improve things so they don't fail.

Should I file a bug report with that image dump linked above or btrfs-
debug-tree output or both?
I think I will use the subject of this thread as summary to file the
bug. Can you think of something more suitable or is that fine?

> > What else can I do or I must rebuild the file system?
> 
> Well, it's a long shot but you could try using --repair --init-csum
> which will create a new csum tree. But that applies to data, if the
> problem with it going read only is due to metadata corruption this
> won't help. And then last you could try --init-extent-tree. Thing I
> can't answer is which order to do it in.
> 
> In any case there will be files that you shouldn't trust after csum
> has been recreated, anything corrupt will now have a new csum, so you
> can get silent data corruption. It's better to just blow away this
> file system and make a new one and reinstall the OS. But if you're
> feeling brave, you can try one or both of those additional options
> and
> see if they can help.

I think I will reinstall the OS since, even if I manage to recover the
file system from this issue, that OS will be something I can not trust
fully.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html