Re: enospace regression in 4.4

2016-04-12 Thread Duncan
Julian Taylor posted on Tue, 12 Apr 2016 17:52:57 +0200 as excerpted:

> $ sudo btrfs fi balance start -dusage=0 .
> ERROR: error during balancing '.': No space left on device

Not much to add, but this one really surprises me and it may be related 
to the new problem you're seeing.

I don't recall ever seeing a -dusage=0 actually error out due to ENOSPC 
before.  It normally either works, killing some empty chunks, or runs 
without error but also without finding any empty chunks to kill, thus 
"doing nothing, successfully" (to borrow the one-line name and 
description for true (1)).

That even a balance with -dusage=0 is actually failing, not just 
completing without doing anything as might be expected, is strange 
indeed.  With a bit of luck that's a strong hint to the devs as to what 
has actually gone wrong and how to fix it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace

2016-04-12 Thread Yauhen Kharuzhy
On Tue, Apr 12, 2016 at 10:15:50PM +0800, Anand Jain wrote:
> Thanks for various comments, tests and feedback.

Seems working for me. I have triggered OOM killer while testing this in 
VirtualBox but
I don't think that it is related to autoreplace, it seems to be
scrub implementation issue:

[  449.615157] CPU: 0 PID: 1771 Comm: btrfs-health Not tainted 4.4.5-scst31x+ 
#25
[  449.621763] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS 
VirtualBox 12/01/2006
[  449.647614]   8800601c7660 813529e3 
8800601c7858
[  449.659766]  88005ba66140 8800601c76d0 8121b41e 
8800601c7680
[  449.683167]  810d7ccd 8800601c76a0 0206 
81c6d0e0
[  449.700746] Call Trace:
[  449.705078]  [] dump_stack+0x85/0xc2
[  449.715238]  [] dump_header+0x5a/0x21d
[  449.725400]  [] ? trace_hardirqs_on+0xd/0x10
[  449.741261]  [] oom_kill_process+0x200/0x3d0
[  449.753042]  [] out_of_memory+0x562/0x580
[  449.765923]  [] ? out_of_memory+0x2d3/0x580
[  449.768455]  [] __alloc_pages_nodemask+0xafc/0xc80
[  449.770281]  [] alloc_pages_current+0x9b/0x1c0
[  449.783371]  [] scrub_pages+0xb5/0x400 [btrfs]
[  449.804598]  [] ? scrub_find_csum+0xd5/0x110 [btrfs]
[  449.819145]  [] scrub_stripe+0x82e/0x1180 [btrfs]
[  449.829299]  [] scrub_chunk+0x110/0x160 [btrfs]
[  449.835859]  [] scrub_enumerate_chunks+0x27c/0x560 [btrfs]
[  449.852805]  [] ? wake_atomic_t_function+0x30/0x70
[  449.867081]  [] btrfs_scrub_dev+0x1cd/0x680 [btrfs]
[  449.876784]  [] btrfs_dev_replace_start+0x334/0x540 [btrfs]
[  449.891503]  [] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[  449.911958]  [] health_kthread+0x246/0x490 [btrfs]
[  449.922132]  [] ? health_kthread+0x138/0x490 [btrfs]
[  449.946273]  [] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[  449.975742]  [] kthread+0xef/0x110
[  449.994914]  [] ? 
__raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[  450.022306]  [] ? kthread_create_on_node+0x200/0x200
[  450.036069]  [] ret_from_fork+0x3f/0x70
[  450.045622]  [] ? kthread_create_on_node+0x200/0x200
[  450.047625] Mem-Info:
[  450.055195] active_anon:30 inactive_anon:71 isolated_anon:0
[  450.055195]  active_file:220 inactive_file:980 isolated_file:0
[  450.055195]  unevictable:527 dirty:41 writeback:59 unstable:0
[  450.055195]  slab_reclaimable:18226 slab_unreclaimable:283931
[  450.055195]  mapped:612 shmem:10 pagetables:1209 bounce:0
[  450.055195]  free:3310 free_pcp:153 free_cma:0
[  450.069070] Node 0 DMA free:6232kB min:48kB low:60kB high:72kB 
active_anon:0kB inactive_anon:0kB active_file:8kB ina
ctive_file:16kB unevictable:28kB isolated(anon):0kB isolated(file):0kB 
present:15992kB managed:15908kB mlocked:28kB dir
ty:4kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:788kB 
slab_unreclaimable:6236kB kernel_stack:96kB pagetables
:48kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB 
writeback_tmp:0kB pages_scanned:220 all_unreclaim
able? yes
[  450.161023] lowmem_reserve[]: 0 1546 1546 1546
[  450.181786] Node 0 DMA32 free:10620kB min:4896kB low:6120kB high:7344kB 
active_anon:120kB inactive_anon:176kB active
_file:964kB inactive_file:1132kB unevictable:2080kB isolated(anon):0kB 
isolated(file):0kB present:1668032kB managed:158
3780kB mlocked:2080kB dirty:160kB writeback:112kB mapped:2568kB shmem:40kB 
slab_reclaimable:72116kB slab_unreclaimable:1129488kB kernel_stack:4192kB 
pagetables:4788kB unstable:0kB bounce:0kB free_pcp:740kB local_pcp:0kB 
free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  450.267804] lowmem_reserve[]: 0 0 0 0
[  450.272899] Node 0 DMA: 45*4kB (UME) 31*8kB (UME) 19*16kB (ME) 10*32kB (ME) 
7*64kB (ME) 7*128kB (UME) 3*256kB (UME) 2*512kB (UM) 2*1024kB (M) 0*2048kB 
0*4096kB = 6236kB
[  450.286381] Node 0 DMA32: 2006*4kB (UME) 453*8kB (UME) 68*16kB (UME) 15*32kB 
(UM) 2*64kB (UM) 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 
13472kB
[  450.299928] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 
hugepages_size=2048kB
[  450.304622] 985 total pagecache pages
[  450.306857] 111 pages in swap cache
[  450.308870] Swap cache stats: add 9380, delete 9269, find 113/183
[  450.312090] Free swap  = 381628kB
[  450.314188] Total swap = 418492kB
[  450.317644] 421006 pages RAM
[  450.319573] 0 pages HighMem/MovableOnly
[  450.322100] 21084 pages reserved
[  450.323853] 0 pages hwpoisoned
...

-- 
Yauhen Kharuzhy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KERNEL PANIC + CORRUPTED BTRFS?

2016-04-12 Thread Chris Murphy
On Tue, Apr 12, 2016 at 9:48 AM, lenovomi  wrote:

> root@ubuntu:/home/ubuntu# btrfs restore -D -v  /dev/sda /mnt/usb/
> checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
> checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
> checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
> checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
> Csum didn't match
> Couldn't read tree root
> Could not open root, trying backup super
> warning, device 2 is missing
> warning devid 2 not found already
> warning devid 3 not found already
> warning devid 4 not found already
> checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
> checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
> Csum didn't match
> Couldn't read tree root
> Could not open root, trying backup super
> warning, device 2 is missing
> warning devid 2 not found already
> warning devid 3 not found already
> warning devid 4 not found already
> checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
> checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
> Csum didn't match
> Couldn't read tree root
> Could not open root, trying backup super
>

Why are devices 2, 3, 4 missing? I think there's a known issue where
btrfs fi show might see drives as available that other tools won't
see. Try 'btrfs dev scan' and then repeat the restore command with -D
just to see if the missing device warnings go away. If devices are
missing, it's kinda hard to do a restore.


If these are hard drives, there should be supers 0, 1, 2 and they
should all be the same. But they may not be the same on each drive, so
it's worth checking:

btrfs-show-super -f 

And then also btrfs-find-root 




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/13] btrfs: Introduce a new function to check if all chunks a OK for degraded mount

2016-04-12 Thread Yauhen Kharuzhy
On Tue, Apr 12, 2016 at 10:15:51PM +0800, Anand Jain wrote:
> From: Qu Wenruo 
> 
> Introduce a new function, btrfs_check_degradable(), to judge if all chunks
> in btrfs is OK for degraded mount.
> 
> It provides the new basis for accurate btrfs mount/remount and even
> runtime degraded mount check other than old one-size-fit-all method.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/volumes.c | 63 
> ++
>  fs/btrfs/volumes.h |  1 +
>  2 files changed, 64 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 9d72dabdddfc..a351c5dd9e9b 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -7039,3 +7039,66 @@ static void btrfs_close_one_device(struct btrfs_device 
> *device)
>  
>   call_rcu(>rcu, free_device);
>  }
> +
> +/*
> + * Check if all chunks in the fs is OK for degraded mount
> + * Caller itself should do extra check if DEGRADED mount option is given
> + * for >0 return value.
> + *
> + * Return 0 if all chunks are OK.
> + * Return >0 if all chunks are degradable but not all OK.
> + * Return <0 if any chunk is not degradable or other bug.
> + */
> +int btrfs_check_degradable(struct btrfs_fs_info *fs_info, unsigned flags)
> +{
> + struct btrfs_mapping_tree *map_tree = _info->mapping_tree;
> + struct extent_map *em;
> + u64 next_start = 0;
> + int ret = 0;
> +
> + if (flags & MS_RDONLY)
> + return 0;
> +
> + read_lock(_tree->map_tree.lock);
> + em = lookup_extent_mapping(_tree->map_tree, 0, (u64)(-1));
> + /* No any chunk? Should be a huge bug */
> + if (!em) {
> + ret = -ENOENT;
> + goto out;
> + }
> +
> + while (em) {
> + struct map_lookup *map;
> + int missing = 0;
> + int max_tolerated;
> + int i;
> +
> + map = (struct map_lookup *) em->bdev;
> + max_tolerated =
> + btrfs_get_num_tolerated_disk_barrier_failures(
> + map->type);
> + for (i = 0; i < map->num_stripes; i++) {
> + if (map->stripes[i].dev->missing)
> + missing++;
> + }
> + if (missing > max_tolerated) {
> + ret = -EIO;
> + btrfs_warn(fs_info,
> +"missing devices(%d) exceeds the limit(%d), 
> writebale mount is not allowed",
> +missing, max_tolerated);

Typo: s/writebale/writeable/


-- 
Yauhen Kharuzhy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: enospace regression in 4.4

2016-04-12 Thread Julian Taylor
On 12.04.2016 20:09, Henk Slager wrote:
> On Tue, Apr 12, 2016 at 5:52 PM, Julian Taylor
>  wrote:
>> smaller testcase that shows the immediate enospc after fallocate -> rm,
>> though I don't know if it is really related to the full filesystem
>> bugging out as the balance does work if you wait a few seconds after the
>> balance.
>> But this sequence of commands did work in 4.2.
>>
>>  $ sudo btrfs fi show /dev/mapper/lvm-testing
>> Label: none  uuid: 25889ba9-a957-415a-83b0-e34a62cb3212
>> Total devices 1 FS bytes used 225.18MiB
>> devid1 size 5.00GiB used 788.00MiB path /dev/mapper/lvm-testing
>>
>>  $ fallocate -l 4.4G test.dat
>>  $ rm -f test.dat
>>  $ sudo btrfs fi balance start -dusage=0 .
>> ERROR: error during balancing '.': No space left on device
>> There may be more info in syslog - try dmesg | tail
> 
> It seems that kernel 4.4.6 waits longer with de-allocating empty
> chunks and the balance kicks in at a time when the 5 GiB is still
> completely filled with chunks. As balance needs uncallocated space (on
> device level, how much depends on profiles), this error can be
> expected.

hm ok, I'll put a sleep in the script then.
fallocate; rm; fallocate seems to work so its probably ok in normal usage.


> 
>> On 04/12/2016 12:24 PM, Julian Taylor wrote:
>>> hi,
>>> I have a system with two filesystems which are both affected by the
>>> notorious enospace bug when there is plenty of unallocated space
>>> available. The system is a raid0 on two 900 GiB disks and an iscsi
>>> single/dup 1.4TiB.
>>> To deal with the problem I use a cronjob that uses fallocate to give me
>>> an advance notice on the issue so I can apply the only workaround that
>>> works for me, which is shrink the fs to the minimum and grow it again.
>>> This has worked fine for a couple of month.
>>>
>>> I now updated from 4.2 to 4.4.6 and it appears my cronjob actually
>>> triggers an immediate enospc in the balance after removing the
>>> fallocated file and the shrink/resize workaround does not work anymore.
> 
> The filesystem itself is not resized AFAIU, correct?

btrfs resize -XG /mount
so resize filesystem but not the underlying device.

Actually the system just went into enospc again with unallocated free
even after the revert to 4.2 and the shrink trick doesn't want to work
anymore either ...
Though the 4.2 running now is not the same where the shrink workaround
work. I'll have to check the changelog to see if there are btrfs related
changes in it.


> 
> You could shrink a file-system by a few GiB's (without changing the
> size of the underlying device), so that once it really gets filled up
> and hits enospc, you resize to max again and delete files or snapshot
> or something. Of course no option for a 24/7 unattended system, but
> maybe for a client laptop as testing.
> 

that us basically what I have been doing, I used the cronjob to see when
the enospc issue occurred and then resize shrink to fix it. It was
relatively rare, I had to do it maybe every two month.

But now for some reason that trick doesn't work anymore either, I can
shrink it by 200G and resize it back to max and it still complains about
no free space. So now I'm at a loss on how to keep this system working.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: loop subsystem corrupted after mounting multiple btrfs sub-volumes

2016-04-12 Thread Stanislav Brabec

On Feb 26, 2016 at 21:30 Al Viro wrote:

Sigh...  sys_mount() (mount_bdev(), actually) has no way to tell if two
loop devices refer to the same underlying object.  As far as it's
concerned, you are asking to mount a completely unrelated block device.
Which just happens to see the data (living in separate pagecache, even)
modified behind its back (with some delay) after it gets written to another
device.  Filesystem drivers generally don't like when something is screwing
the underlying data, to put it mildly...



I wrote a loop device reuse patch for mount -oloop.

[PATCH 0/3] btrfs-safe implementation of -oloop
http://marc.info/?l=util-linux-ng=146048532307963=2

[PATCH 1/3] libmount: Re-organize is_mounted_same_loopfile()
http://marc.info/?l=util-linux-ng=146048535907971=2

[PATCH 2/3] libmount: reuse existing loop device
http://marc.info/?l=util-linux-ng=146048537807980=2

[PATCH 3/3] mount: Handle EROFS before calling mount() syscall
http://marc.info/?l=util-linux-ng=146048542007990=2

However it works for me, there are still some controversial issues
described in [PATCH 0/3].

These patches will hide corruption of kernel loop control structures
mentioned earlier in this thread and in most cases prevent data
corruption.

--
Best Regards / S pozdravem,

Stanislav Brabec
software developer
-
SUSE LINUX, s. r. o. e-mail: sbra...@suse.com
Lihovarská 1060/12tel: +49 911 7405384547
190 00 Praha 9 fax:  +420 284 084 001
Czech Republichttp://www.suse.cz/
PGP: 830B 40D5 9E05 35D8 5E27 6FA3 717C 209F A04F CD76
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: enospace regression in 4.4

2016-04-12 Thread Henk Slager
On Tue, Apr 12, 2016 at 5:52 PM, Julian Taylor
 wrote:
> smaller testcase that shows the immediate enospc after fallocate -> rm,
> though I don't know if it is really related to the full filesystem
> bugging out as the balance does work if you wait a few seconds after the
> balance.
> But this sequence of commands did work in 4.2.
>
>  $ sudo btrfs fi show /dev/mapper/lvm-testing
> Label: none  uuid: 25889ba9-a957-415a-83b0-e34a62cb3212
> Total devices 1 FS bytes used 225.18MiB
> devid1 size 5.00GiB used 788.00MiB path /dev/mapper/lvm-testing
>
>  $ fallocate -l 4.4G test.dat
>  $ rm -f test.dat
>  $ sudo btrfs fi balance start -dusage=0 .
> ERROR: error during balancing '.': No space left on device
> There may be more info in syslog - try dmesg | tail

It seems that kernel 4.4.6 waits longer with de-allocating empty
chunks and the balance kicks in at a time when the 5 GiB is still
completely filled with chunks. As balance needs uncallocated space (on
device level, how much depends on profiles), this error can be
expected.

> On 04/12/2016 12:24 PM, Julian Taylor wrote:
>> hi,
>> I have a system with two filesystems which are both affected by the
>> notorious enospace bug when there is plenty of unallocated space
>> available. The system is a raid0 on two 900 GiB disks and an iscsi
>> single/dup 1.4TiB.
>> To deal with the problem I use a cronjob that uses fallocate to give me
>> an advance notice on the issue so I can apply the only workaround that
>> works for me, which is shrink the fs to the minimum and grow it again.
>> This has worked fine for a couple of month.
>>
>> I now updated from 4.2 to 4.4.6 and it appears my cronjob actually
>> triggers an immediate enospc in the balance after removing the
>> fallocated file and the shrink/resize workaround does not work anymore.

The filesystem itself is not resized AFAIU, correct?

>> it is mounted with enospc_debug but that just says "2 enospc in
>> balance". Nothing else useful in the log.
>>
>> I had to revert back to 4.2 to get the system running again so it is
>> currently not available for more testing, but I may be able to do more
>> tests if required in future.
>>
>> The cronjob does this once a day:
>>
>> #!/bin/bash
>> sync
>>
>> check() {
>>   date
>>   mnt=$1
>>   time btrfs fi balance start -mlimit=2 $mnt
>>   btrfs fi balance start -dusage=5 $mnt
>>   sync
>>   freespace=$(df -B1 $mnt | tail -n 1 | awk '{print $4 -
>> 50*1024*1024*1024}')
>>   fallocate -l $freespace $mnt/falloc
>>   /usr/sbin/filefrag $mnt/falloc
>>   rm -f $mnt/falloc
>>   btrfs fi balance start -dusage=0 $mnt

See comment for smaller test; Maybe you could put a delay of larger
than the commit time before this balance. To give the kernel itself
the possibility to cleanup empty chunks.

>>   time btrfs fi balance start -mlimit=2 $mnt
>>   time btrfs fi balance start -dlimit=10 $mnt
>>   date
>> }
>>
>> check /data
>> check /data/nas

It could be that now with kernel 4.4.6 or newer, the original enospc
(so not the ones due to balances) does not popup anymore. That would
mean the cronjob workaround itself creates a problem now. Can you give
some background on what other (types of) enospc occurred in the past
and was this with 4.2 kernel ? or older?

You could shrink a file-system by a few GiB's (without changing the
size of the underlying device), so that once it really gets filled up
and hits enospc, you resize to max again and delete files or snapshot
or something. Of course no option for a 24/7 unattended system, but
maybe for a client laptop as testing.

>> btrfs info:
>>
>>
>>  ~ $ btrfs --version
>> btrfs-progs v4.4
>> sagan5 ~ $ sudo btrfs fi show
>> Label: none  uuid: e4aef349-7a56-4287-93b1-79233e016aae
>>   Total devices 2 FS bytes used 898.18GiB
>>   devid1 size 880.00GiB used 473.03GiB path /dev/mapper/data-linear1
>>   devid2 size 880.00GiB used 473.03GiB path /dev/mapper/data-linear2
>>
>> Label: none  uuid: 14040f9b-53c8-46cf-be6b-35de746c3153
>>   Total devices 1 FS bytes used 557.19GiB
>>   devid1 size 1.36TiB used 585.95GiB path /dev/sdd
>>
>>  ~ $ sudo btrfs fi df /data
>> Data, RAID0: total=938.00GiB, used=895.09GiB
>> System, RAID1: total=32.00MiB, used=112.00KiB
>> Metadata, RAID1: total=4.00GiB, used=3.10GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>> sagan5 ~ $ sudo btrfs fi usage /data
>> Overall:
>> Device size: 1.72TiB
>> Device allocated:  946.06GiB
>> Device unallocated:813.94GiB
>> Device missing:0.00B
>> Used:  901.27GiB
>> Free (estimated):  856.85GiB  (min: 449.88GiB)
>> Data ratio: 1.00
>> Metadata ratio: 2.00
>> Global reserve:512.00MiB  (used: 0.00B)
>>
>> Data,RAID0: Size:938.00GiB, Used:895.09GiB
>>/dev/dm-1   469.00GiB
>>/dev/mapper/data-linear1

Re: [PATCH] Btrfs: track transid for delayed ref flushing

2016-04-12 Thread Josef Bacik

On 04/12/2016 01:43 PM, Liu Bo wrote:

On Mon, Apr 11, 2016 at 05:37:40PM -0400, Josef Bacik wrote:

Using the offwakecputime bpf script I noticed most of our time was spent waiting
on the delayed ref throttling.  This is what is supposed to happen, but
sometimes the transaction can commit and then we're waiting for throttling that
doesn't matter anymore.  So change this stuff to be a little smarter by tracking
the transid we were in when we initiated the throttling.  If the transaction we
get is different then we can just bail out.  This resulted in a 50% speedup in
my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
over the entire run (which is about 30 minutes).  Thanks,


Does the bpf script show where it's waiting on?  delayed_refs spinlock?


We are waiting on the wait_for_completion() in 
btrfs_async_run_delayed_refs.  The script only catches where we're in 
TASK_UNINTERRUPTIBLE for longer than 100ms.




Maybe we can make it even smarter.

In __btrfs_end_transaction(), the only case it won't wait for running delayed 
refs
is when trans is JOIN_NOLOCK or ATTACH and "must_run_delayed_refs = 2".

In other cases, even we queue a work into helper worker to do async
delayed refs processing, __btrfs_end_transaction() still waits there.

Since it's a 50% speedup, it looks like at least 50% of 
__btrfs_end_transaction()
are waiting for other trans's queued delayed refs, can we do the check
a little bit earlier, in btrfs_async_run_delayed_refs()?


We'd have to start another transaction, we don't want to have to do 
that.  What I want to do later is have the flushing stuff running all 
the time, so we notice way sooner if we end up with a bunch of people 
all trying to throttle at once.  So we drop below the throttle watermark 
and everybody wakes up, instead of everybody does their work no matter 
what and then wakes up.  But this was quick and easy and I've got other 
stuff to do so this is what I posted ;).  Thanks,


Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: track transid for delayed ref flushing

2016-04-12 Thread Liu Bo
On Mon, Apr 11, 2016 at 05:37:40PM -0400, Josef Bacik wrote:
> Using the offwakecputime bpf script I noticed most of our time was spent 
> waiting
> on the delayed ref throttling.  This is what is supposed to happen, but
> sometimes the transaction can commit and then we're waiting for throttling 
> that
> doesn't matter anymore.  So change this stuff to be a little smarter by 
> tracking
> the transid we were in when we initiated the throttling.  If the transaction 
> we
> get is different then we can just bail out.  This resulted in a 50% speedup in
> my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
> over the entire run (which is about 30 minutes).  Thanks,

Does the bpf script show where it's waiting on?  delayed_refs spinlock?

Maybe we can make it even smarter.

In __btrfs_end_transaction(), the only case it won't wait for running delayed 
refs
is when trans is JOIN_NOLOCK or ATTACH and "must_run_delayed_refs = 2".

In other cases, even we queue a work into helper worker to do async
delayed refs processing, __btrfs_end_transaction() still waits there.

Since it's a 50% speedup, it looks like at least 50% of 
__btrfs_end_transaction()
are waiting for other trans's queued delayed refs, can we do the check
a little bit earlier, in btrfs_async_run_delayed_refs()?

> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/ctree.h   |  2 +-
>  fs/btrfs/extent-tree.c | 15 ---
>  fs/btrfs/inode.c   |  1 +
>  fs/btrfs/transaction.c |  3 ++-
>  4 files changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 55a24c5..4222936 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3505,7 +3505,7 @@ void btrfs_put_block_group(struct 
> btrfs_block_group_cache *cache);
>  int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
>  struct btrfs_root *root, unsigned long count);
>  int btrfs_async_run_delayed_refs(struct btrfs_root *root,
> -  unsigned long count, int wait);
> +  unsigned long count, u64 transid, int wait);
>  int btrfs_lookup_data_extent(struct btrfs_root *root, u64 start, u64 len);
>  int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
>struct btrfs_root *root, u64 bytenr,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 4b5a517..f23f426 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2839,6 +2839,7 @@ int btrfs_should_throttle_delayed_refs(struct 
> btrfs_trans_handle *trans,
>  
>  struct async_delayed_refs {
>   struct btrfs_root *root;
> + u64 transid;
>   int count;
>   int error;
>   int sync;
> @@ -2854,9 +2855,16 @@ static void delayed_ref_async_start(struct btrfs_work 
> *work)
>  
>   async = container_of(work, struct async_delayed_refs, work);
>  
> - trans = btrfs_join_transaction(async->root);
> + trans = btrfs_attach_transaction(async->root);
>   if (IS_ERR(trans)) {
> - async->error = PTR_ERR(trans);
> + if (PTR_ERR(trans) != -ENOENT)
> + async->error = PTR_ERR(trans);
> + goto done;
> + }
> +
> + /* Don't bother flushing if we got into a different transaction */
> + if (trans->transid != async->transid) {
> + btrfs_end_transaction(trans, async->root);

Interesting, btrfs_end_transaction will also issue a work to do 
delayed_ref_async_start, and it doesn't wait.

Thanks,

-liubo

>   goto done;
>   }
>  
> @@ -2880,7 +2888,7 @@ done:
>  }
>  
>  int btrfs_async_run_delayed_refs(struct btrfs_root *root,
> -  unsigned long count, int wait)
> +  unsigned long count, u64 transid, int wait)
>  {
>   struct async_delayed_refs *async;
>   int ret;
> @@ -2892,6 +2900,7 @@ int btrfs_async_run_delayed_refs(struct btrfs_root 
> *root,
>   async->root = root->fs_info->tree_root;
>   async->count = count;
>   async->error = 0;
> + async->transid = transid;
>   if (wait)
>   async->sync = 1;
>   else
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 723e4bb..e6dd4cc 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4534,6 +4534,7 @@ delete:
>   BUG_ON(ret);
>   if (btrfs_should_throttle_delayed_refs(trans, root))
>   btrfs_async_run_delayed_refs(root,
> +  trans->transid,
>   trans->delayed_ref_updates * 2, 0);
>   if (be_nice) {
>   if (truncate_space_check(trans, root,
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 43885e5..7c7671d 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -817,6 +817,7 @@ static int 

Re: scrub: Tree block spanning stripes, ignored

2016-04-12 Thread Ivan P
Feel free to send me that modified btrfsck when you finish it, I'm
open for experiments as long as I have my backup copy.

Regards,
Ivan.

On Mon, Apr 11, 2016 at 3:10 AM, Qu Wenruo  wrote:
> There seems to be something wrong with btrfsck.
>
> Not sure if it's kernel clear_cache mount option or btrfsck to blame.
>
> Anyway, it shouldn't be a big problem though.
>
> If you want to make sure it won't damage your fs, it's better to mount with
> nospace_cache mount option.
>
> I'd try to implement a new option for btrfsck to clear space cache in case
> kernel mount option doesn't work, and hopes it may help you.
>
> Thanks,
> Qu
>
>
> Ivan P wrote on 2016/04/09 11:53 +0200:
>>
>> Well, the message is almost the same after mounting with clear_cache
>> -> unmounting -> mounting with regular options -> unmounting ->
>> running btrfsck --readonly.
>>
>> ===
>> Checking filesystem on /dev/sdb
>> UUID: 013cda95-8aab-4cb2-acdd-2f0f78036e02
>> checking extents
>> checking free space cache
>> block group 632463294464 has wrong amount of free space
>> failed to load free space cache for block group 632463294464
>> checking fs roots
>> checking csums
>> checking root refs
>> found 859557139239 bytes used err is 0
>> total csum bytes: 838453732
>> total tree bytes: 980516864
>> total fs tree bytes: 38387712
>> total extent tree bytes: 11026432
>> btree space waste bytes: 70912724
>> file data blocks allocated: 858788171776
>> referenced 858787610624
>> ===
>>
>> Or should I be using btrfsck without --readonly?
>>
>> Oh and almost forgot (again):
>>>
>>> For backref problem, did you rw mount the fs with some old kernel like
>>> 4.2?
>>> IIRC, I introduced a delayed_ref regression in that version.
>>> Maybe it's related to the bug.
>>>
>>> Thanks,
>>> Qu
>>
>>
>> The FS was created with btrfs-progs 4.2.3 and mounted on kernel 4.2.5,
>> so if that version also had the problem, then that's maybe it.
>>
>> On Fri, Apr 8, 2016 at 2:23 AM, Qu Wenruo  wrote:
>>>
>>>
>>>
>>> Ivan P wrote on 2016/04/07 17:33 +0200:


 After running btrfsck --readonly again, the output is:

 ===
 Checking filesystem on /dev/sdb
 UUID: 013cda95-8aab-4cb2-acdd-2f0f78036e02
 checking extents
 checking free space cache
 block group 632463294464 has wrong amount of free space
 failed to load free space cache for block group 632463294464
 checking fs roots
 checking csums
 checking root refs
 found 859557139240 bytes used err is 0
 total csum bytes: 838453732
 total tree bytes: 980516864
 total fs tree bytes: 38387712
 total extent tree bytes: 11026432
 btree space waste bytes: 70912460
 file data blocks allocated: 858788433920
 referenced 858787872768
 ===

 Seems the free space is wrong because more data blocks are allocated
 than referenced?
>>>
>>>
>>>
>>> Not sure, but space cache is never a big problem.
>>> Mount with clear_cache would rebuild space cache.
>>>
>>> It seems that your fs is in good condition now.
>>>
>>>
>>> Thanks,
>>> Qu
>>>

 Regards,
 Ivan.

 On Thu, Apr 7, 2016 at 2:58 AM, Qu Wenruo 
 wrote:
>
>
>
>
> Ivan P wrote on 2016/04/06 21:39 +0200:
>>
>>
>>
>> Ok, I'm cautiously optimistic: after running btrfsck
>> --init-extent-tree --repair and running scrub, it finished without
>> errors.
>> Will run a file compare against my backup copy, but it seems the
>> repair was successful.
>
>
>
>
> Better run btrfsck again, to ensure no other problem.
>
> For backref problem, did you rw mount the fs with some old kernel like
> 4.2?
> IIRC, I introduced a delayed_ref regression in that version.
> Maybe it's related to the bug.
>
> Thanks,
> Qu
>
>>
>> Here is the btrfs-image btw:
>> https://dl.dropboxusercontent.com/u/19330332/image.btrfs (821Mb)
>>
>> Maybe you will be able to track down whatever caused this.
>>
>> Regards,
>> Ivan.
>>
>> On Sun, Apr 3, 2016 at 3:24 AM, Qu Wenruo 
>> wrote:
>>>
>>>
>>>
>>>
>>>
>>> On 04/03/2016 12:29 AM, Ivan P wrote:




 It's about 800Mb, I think I could upload that.

 I ran it with the -s parameter, is that enough to remove all
 personal
 info from the image?
 Also, I had to run it with -w because otherwise it died on the same
 corrupt node.
>>>
>>>
>>>
>>>
>>>
>>> You can also use -c9 to further compress the data.
>>>
>>> Thanks,
>>> Qu
>>>

 On Fri, Apr 1, 2016 at 2:25 AM, Qu Wenruo 
 wrote:
>
>
>

Re: [PATCH] Btrfs: do not create empty block group if we have allocated data

2016-04-12 Thread Liu Bo
On Mon, Apr 11, 2016 at 05:02:18PM +0200, David Sterba wrote:
> On Mon, Dec 14, 2015 at 06:29:32PM -0800, Liu Bo wrote:
> > Now we force to create empty block group to keep data profile alive,
> > however, in the below example, we eventually get an empty block group
> > while we're trying to get more space for other types (metadata/system),
> > 
> > - Before,
> > block group "A": size=2G, used=1.2G
> > block group "B": size=2G, used=512M
> > 
> > - After "btrfs balance start -dusage=50 mount_point",
> > block group "A": size=2G, used=(1.2+0.5)G
> > block group "C": size=2G, used=0
> > 
> > Since there is no data in block group C, it won't be deleted
> > automatically and we have to get the unused 2G until the next mount.
> > 
> > Balance itself just moves data and doesn't remove data, so it's safe
> > to not create such a empty block group if we already have data
> >  allocated in other block groups.
> > 
> > Signed-off-by: Liu Bo 
> 
> I'm adding the patch to my for-next.

Thank you, David.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: remove BUG_ON()'s in btrfs_map_block

2016-04-12 Thread Josef Bacik
btrfs_map_block can go horribly wrong in the face of fs corruption, lets agree
to not be assholes and panic at any possible chance things are all fucked up.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/volumes.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e2b54d5..ba8216b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5278,7 +5278,18 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
stripe_nr = div64_u64(stripe_nr, stripe_len);
 
stripe_offset = stripe_nr * stripe_len;
-   BUG_ON(offset < stripe_offset);
+   if (offset < stripe_offset) {
+   btrfs_crit(fs_info, "stripe math has gone wrong, "
+  "stripe_offset=%llu, offset=%llu, start=%llu, "
+  "logical=%llu, stripe_len=%llu\n",
+  (unsigned long long)stripe_offset,
+  (unsigned long long)offset,
+  (unsigned long long)em->start,
+  (unsigned long long)logical,
+  (unsigned long long)stripe_len);
+   free_extent_map(em);
+   return -EINVAL;
+   }
 
/* stripe_offset is the offset of this block in its stripe*/
stripe_offset = offset - stripe_offset;
@@ -5519,7 +5530,14 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
_index);
mirror_num = stripe_index + 1;
}
-   BUG_ON(stripe_index >= map->num_stripes);
+   if (stripe_index >= map->num_stripes) {
+   btrfs_crit(fs_info, "stripe index math went horribly wrong, "
+  "got stripe_index=%lu, num_stripes=%lu\n",
+  (unsigned long)stripe_index,
+  (unsigned long)map->num_stripes);
+   ret = -EINVAL;
+   goto out;
+   }
 
num_alloc_stripes = num_stripes;
if (dev_replace_is_ongoing) {
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KERNEL PANIC + CORRUPTED BTRFS?

2016-04-12 Thread lenovomi
One more thing,


root@ubuntu:/home/ubuntu# btrfs fi show
Label: 'universe'  uuid: 51e1c933-a39d-4bff-9cf7-f369b4b5d414
Total devices 4 FS bytes used 10.75TiB
devid1 size 2.73TiB used 2.70TiB path /dev/sda
devid2 size 2.73TiB used 2.70TiB path /dev/sdb
devid3 size 2.73TiB used 2.71TiB path /dev/sdc
devid4 size 2.73TiB used 2.71TiB path /dev/sdd

On Tue, Apr 12, 2016 at 5:43 AM, Chris Murphy  wrote:
> On Mon, Apr 11, 2016 at 3:51 PM, lenovomi  wrote:
>> Hi,
>>
>> i didnt try mount -o ro, when i tried to mount it via esata i got
>> kernel panic immediately. Then i conntected enclosure with drives via
>> usb and tried to mount it :
>
> OK so try '-o ro,recovery' and report back what you get.
>
>
>
>>
>> https://bpaste.net/show/641ab9172539
>> plugged via usb -> mount randomly one of the drive mount /dev/sda /mnt/brtfs
>>
>> I was told on irc channel that i should not run btrfs check and if so
>> i should run it as
>> btrfs check --repair --init-extent-tree
>>
>>
>> Also there was recommendation to run btrfs restore before repair.
>
> Did you use btrfs restore?
> https://btrfs.wiki.kernel.org/index.php/Restore
>
> And did you use --repair --init-extent-tree? I don't recommend it
> until you use restore as well.
>
>
>
>> Still not clear what should i do as next step.
>
> 1. mount with -o ro,recovery  and get important date backed up. It
> sounds like you don't have a backup?
>
> 2. If that doesn't work, use btrfs restore. It's tedious but at least
> you can update your backup.
>
> 3. Next try btrfs check without repair. There's some nuance whether
> it's better to use init-extent-tree or try zeroing the log. But don't
> use repair until there's a current backup with 1 or 2.
>
> --
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: enospace regression in 4.4

2016-04-12 Thread Julian Taylor
smaller testcase that shows the immediate enospc after fallocate -> rm,
though I don't know if it is really related to the full filesystem
bugging out as the balance does work if you wait a few seconds after the
balance.
But this sequence of commands did work in 4.2.

 $ sudo btrfs fi show /dev/mapper/lvm-testing
Label: none  uuid: 25889ba9-a957-415a-83b0-e34a62cb3212
Total devices 1 FS bytes used 225.18MiB
devid1 size 5.00GiB used 788.00MiB path /dev/mapper/lvm-testing

 $ fallocate -l 4.4G test.dat
 $ rm -f test.dat
 $ sudo btrfs fi balance start -dusage=0 .
ERROR: error during balancing '.': No space left on device
There may be more info in syslog - try dmesg | tail


On 04/12/2016 12:24 PM, Julian Taylor wrote:
> hi,
> I have a system with two filesystems which are both affected by the
> notorious enospace bug when there is plenty of unallocated space
> available. The system is a raid0 on two 900 GiB disks and an iscsi
> single/dup 1.4TiB.
> To deal with the problem I use a cronjob that uses fallocate to give me
> an advance notice on the issue so I can apply the only workaround that
> works for me, which is shrink the fs to the minimum and grow it again.
> This has worked fine for a couple of month.
> 
> I now updated from 4.2 to 4.4.6 and it appears my cronjob actually
> triggers an immediate enospc in the balance after removing the
> fallocated file and the shrink/resize workaround does not work anymore.
> it is mounted with enospc_debug but that just says "2 enospc in
> balance". Nothing else useful in the log.
> 
> I had to revert back to 4.2 to get the system running again so it is
> currently not available for more testing, but I may be able to do more
> tests if required in future.
> 
> The cronjob does this once a day:
> 
> #!/bin/bash
> sync
> 
> check() {
>   date
>   mnt=$1
>   time btrfs fi balance start -mlimit=2 $mnt
>   btrfs fi balance start -dusage=5 $mnt
>   sync
>   freespace=$(df -B1 $mnt | tail -n 1 | awk '{print $4 -
> 50*1024*1024*1024}')
>   fallocate -l $freespace $mnt/falloc
>   /usr/sbin/filefrag $mnt/falloc
>   rm -f $mnt/falloc
>   btrfs fi balance start -dusage=0 $mnt
> 
>   time btrfs fi balance start -mlimit=2 $mnt
>   time btrfs fi balance start -dlimit=10 $mnt
>   date
> }
> 
> check /data
> check /data/nas
> 
> 
> btrfs info:
> 
> 
>  ~ $ btrfs --version
> btrfs-progs v4.4
> sagan5 ~ $ sudo btrfs fi show
> Label: none  uuid: e4aef349-7a56-4287-93b1-79233e016aae
>   Total devices 2 FS bytes used 898.18GiB
>   devid1 size 880.00GiB used 473.03GiB path /dev/mapper/data-linear1
>   devid2 size 880.00GiB used 473.03GiB path /dev/mapper/data-linear2
> 
> Label: none  uuid: 14040f9b-53c8-46cf-be6b-35de746c3153
>   Total devices 1 FS bytes used 557.19GiB
>   devid1 size 1.36TiB used 585.95GiB path /dev/sdd
> 
>  ~ $ sudo btrfs fi df /data
> Data, RAID0: total=938.00GiB, used=895.09GiB
> System, RAID1: total=32.00MiB, used=112.00KiB
> Metadata, RAID1: total=4.00GiB, used=3.10GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> sagan5 ~ $ sudo btrfs fi usage /data
> Overall:
> Device size: 1.72TiB
> Device allocated:  946.06GiB
> Device unallocated:813.94GiB
> Device missing:0.00B
> Used:  901.27GiB
> Free (estimated):  856.85GiB  (min: 449.88GiB)
> Data ratio: 1.00
> Metadata ratio: 2.00
> Global reserve:512.00MiB  (used: 0.00B)
> 
> Data,RAID0: Size:938.00GiB, Used:895.09GiB
>/dev/dm-1   469.00GiB
>/dev/mapper/data-linear1469.00GiB
> 
> Metadata,RAID1: Size:4.00GiB, Used:3.09GiB
>/dev/dm-1 4.00GiB
>/dev/mapper/data-linear1  4.00GiB
> 
> System,RAID1: Size:32.00MiB, Used:112.00KiB
>/dev/dm-132.00MiB
>/dev/mapper/data-linear1 32.00MiB
> 
> Unallocated:
>/dev/dm-1   406.97GiB
>/dev/mapper/data-linear1406.97GiB
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KERNEL PANIC + CORRUPTED BTRFS?

2016-04-12 Thread lenovomi
Hi Chris,

i tried these:


1)

 mount -o ro,recovery  /dev/sda /mnt/usb

[ 1707.971925] BTRFS info (device sdd): enabling auto recovery
[ 1707.971933] BTRFS info (device sdd): disk space caching is enabled
[ 1708.005073] BTRFS: sdd checksum verify failed on 17802818387968
wanted BFB02AEC found FF45E2D3 level 1
[ 1708.005230] BTRFS: sdd checksum verify failed on 17802818387968
wanted BFB02AEC found FF45E2D3 level 1
[ 1708.005244] BTRFS: failed to read tree root on sdd
[ 1708.005395] BTRFS: sdd checksum verify failed on 17802818387968
wanted BFB02AEC found FF45E2D3 level 1
[ 1708.005550] BTRFS: sdd checksum verify failed on 17802818387968
wanted BFB02AEC found FF45E2D3 level 1
[ 1708.005562] BTRFS: failed to read tree root on sdd
[ 1708.509471] BTRFS: sdd checksum verify failed on 17802812309504
wanted BFB02AEC found 75187952 level 1
[ 1709.004781] BTRFS: sdd checksum verify failed on 17802812309504
wanted BFB02AEC found 75187952 level 1
[ 1709.004819] BTRFS: failed to read tree root on sdd
[ 1709.026808] BTRFS: sdd checksum verify failed on 17802806853632
wanted BFB02AEC found B6624BB6 level 1
[ 1709.109491] BTRFS: sdd checksum verify failed on 17802806853632
wanted BFB02AEC found B6624BB6 level 1
[ 1709.109521] BTRFS: failed to read tree root on sdd
[ 1709.111635] BTRFS: sdd checksum verify failed on 17802801315840
wanted BFB02AEC found 55A30545 level 1
[ 1709.113389] BTRFS: sdd checksum verify failed on 17802801315840
wanted BFB02AEC found 55A30545 level 1
[ 1709.113426] BTRFS: failed to read tree root on sdd
[ 1709.139603] BTRFS: open_ctree failed

2)
root@ubuntu:/home/ubuntu# btrfs check --readonly -s0 /dev/sda
using SB copy 0, bytenr 65536
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
Csum didn't match
Couldn't read tree root
Couldn't open file system
root@ubuntu:/home/ubuntu# btrfs check --readonly -s1 /dev/sda
using SB copy 1, bytenr 67108864
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
Csum didn't match
Couldn't read tree root
Couldn't open file system
root@ubuntu:/home/ubuntu# btrfs check --readonly -s2 /dev/sda
using SB copy 2, bytenr 274877906944
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
Csum didn't match
Couldn't read tree root
Couldn't open file system
root@ubuntu:/home/ubuntu# btrfs check --readonly -s3 /dev/sda
ERROR: super mirror should be less than: 3


3) restore

root@ubuntu:/home/ubuntu# btrfs restore -D -v  /dev/sda /mnt/usb/
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
Csum didn't match
Couldn't read tree root
Could not open root, trying backup super
warning, device 2 is missing
warning devid 2 not found already
warning devid 3 not found already
warning devid 4 not found already
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
Csum didn't match
Couldn't read tree root
Could not open root, trying backup super
warning, device 2 is missing
warning devid 2 not found already
warning devid 3 not found already
warning devid 4 not found already
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
checksum verify failed on 17802818387968 found FF45E2D3 wanted BFB02AEC
Csum didn't match
Couldn't read tree root
Could not open root, trying backup super


4) Everything is failing up to this point, any other ideas?:(


thanks



On Tue, Apr 12, 2016 at 5:43 AM, Chris Murphy  wrote:
> On Mon, Apr 11, 2016 at 3:51 PM, lenovomi  wrote:
>> Hi,
>>
>> i didnt try mount -o ro, when i tried to mount it via esata i got
>> kernel panic immediately. Then i conntected enclosure with drives via
>> usb and tried to mount it :
>
> OK so try '-o ro,recovery' and report back what you get.
>
>
>
>>
>> https://bpaste.net/show/641ab9172539
>> plugged via usb -> mount randomly one of the drive mount /dev/sda /mnt/brtfs
>>
>> I was told on irc channel that i should not run btrfs check and if so
>> i should run it as
>> btrfs check --repair --init-extent-tree
>>
>>
>> Also there was recommendation to run 

Re: [PATCH 10/13] btrfs: introduce helper functions to perform hot replace

2016-04-12 Thread kbuild test robot
Hi Anand,

[auto build test ERROR on btrfs/next]
[also build test ERROR on v4.6-rc3 next-20160412]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improving the system]

url:
https://github.com/0day-ci/linux/commits/Anand-Jain/Introduce-device-state-failed-spare-device-and-auto-replace/20160412-222557
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
next
config: x86_64-randconfig-x010-201615 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   fs/btrfs/dev-replace.c: In function 'btrfs_auto_replace_start':
   fs/btrfs/dev-replace.c:979:39: warning: passing argument 2 of 
'btrfs_dev_replace_start' from incompatible pointer type 
[-Wincompatible-pointer-types]
  ret = btrfs_dev_replace_start(root, tgt_path, src_devid, NULL,
  ^
   fs/btrfs/dev-replace.c:308:5: note: expected 'struct 
btrfs_ioctl_dev_replace_args *' but argument is of type 'char *'
int btrfs_dev_replace_start(struct btrfs_root *root,
^
>> fs/btrfs/dev-replace.c:979:9: error: too many arguments to function 
>> 'btrfs_dev_replace_start'
  ret = btrfs_dev_replace_start(root, tgt_path, src_devid, NULL,
^
   fs/btrfs/dev-replace.c:308:5: note: declared here
int btrfs_dev_replace_start(struct btrfs_root *root,
^

vim +/btrfs_dev_replace_start +979 fs/btrfs/dev-replace.c

   973  }
   974  
   975  if (atomic_xchg(
   976  >fs_info->mutually_exclusive_operation_running, 
1)) {
   977  ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
   978  } else {
 > 979  ret = btrfs_dev_replace_start(root, tgt_path, 
 > src_devid, NULL,
   980  
BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS);
   981  atomic_set(
   982  
>fs_info->mutually_exclusive_operation_running, 0);

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Btrfs fails desatrerous on fuzzy tests

2016-04-12 Thread Juergen Sauer
Hi!
do you know this paper ?

http://events.linuxfoundation.org/sites/events/files/slides/AFL%20filesystem%20fuzzing%2C%20Vault%202016.pdf

It was rushing through the Linux press sites in Germany, see also [german]:

http://www.pro-linux.de/news/1/23449/fuzzy-test-f%C3%BCr-dateisysteme-vorgestellt.html

mit freundlichen Grüßen
Jürgen Sauer
-- 
Jürgen Sauer - automatiX GmbH,
+49-4209-4699, juergen.sa...@automatix.de
Geschäftsführer: Jürgen Sauer,
Gerichtstand: Amtsgericht Walsrode • HRB 120986
Ust-Id: DE191468481 • St.Nr.: 36/211/08000
GPG Public Key zur Signaturprüfung:
http://www.automatix.de/juergen_sauer_publickey.gpg



signature.asc
Description: OpenPGP digital signature


[PATCH 09/13] btrfs: provide framework to get and put a spare device

2016-04-12 Thread Anand Jain
From: Anand Jain 

This adds functions to get and put a spare device from the list.
So that hot repace code can pick a spare device when needed.

Signed-off-by: Anand Jain 
Tested-by: Austin S. Hemmelgarn 
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/super.c   |  5 +
 fs/btrfs/volumes.c | 53 +
 fs/btrfs/volumes.h |  2 ++
 4 files changed, 61 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a823ff7944f1..1cf1bbf3058f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4185,6 +4185,7 @@ void btrfs_sysfs_remove_mounted(struct btrfs_fs_info 
*fs_info);
 ssize_t btrfs_listxattr(struct dentry *dentry, char *buffer, size_t size);
 
 /* super.c */
+struct file_system_type *btrfs_get_fs_type(void);
 int btrfs_parse_options(struct btrfs_root *root, char *options,
unsigned long new_flags);
 int btrfs_sync_fs(struct super_block *sb, int wait);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 729f596b540a..2d77a8dde92c 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -69,6 +69,11 @@ static struct file_system_type btrfs_fs_type;
 
 static int btrfs_remount(struct super_block *sb, int *flags, char *data);
 
+struct file_system_type *btrfs_get_fs_type()
+{
+   return _fs_type;
+}
+
 const char *btrfs_decode_error(int errno)
 {
char *errstr = "unknown";
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 150807e0310e..00d82872ede0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -525,6 +525,59 @@ static void pending_bios_fn(struct btrfs_work *work)
run_scheduled_bios(device);
 }
 
+int btrfs_get_spare_device(char **path)
+{
+   int ret = 1;
+   struct btrfs_fs_devices *fs_devices;
+   struct btrfs_device *device;
+   struct list_head *fs_uuids = btrfs_get_fs_uuids();
+
+   mutex_lock(_mutex);
+   list_for_each_entry(fs_devices, fs_uuids, list) {
+   if (!fs_devices->spare)
+   continue;
+
+   /* as of now there is only one device in the spare fs_devices */
+   device = list_entry(fs_devices->devices.next,
+   struct btrfs_device, dev_list);
+
+   if (!device || !device->name)
+   continue;
+
+   fs_devices->spare = 0;
+   /*
+* Its under uuid_mutex and there is one spare per fsid
+* so rcu lock is actually not required
+*/
+   *path = kstrdup(device->name->str, GFP_KERNEL);
+   if (*path)
+   ret = 0;
+   else
+   ret = -ENOMEM;
+   break;
+   }
+
+   if (!ret) {
+   btrfs_sysfs_remove_fsid(fs_devices);
+   list_del(_devices->list);
+   free_fs_devices(fs_devices);
+   }
+   mutex_unlock(_mutex);
+
+   return ret;
+}
+
+void btrfs_put_spare_device(char *path)
+{
+   struct file_system_type *btrfs_fs_type;
+   struct btrfs_fs_devices *fs_devices;
+
+   btrfs_fs_type = btrfs_get_fs_type();
+
+   if (btrfs_scan_one_device(path, FMODE_READ,
+   btrfs_fs_type, _devices))
+   printk(KERN_INFO "failed to return spare device\n");
+}
 
 void btrfs_free_stale_device(struct btrfs_device *cur_dev)
 {
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 51cf716eb35b..b4308afa3097 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -469,6 +469,8 @@ int btrfs_init_new_device(struct btrfs_root *root, char 
*path);
 int btrfs_init_dev_replace_tgtdev(struct btrfs_root *root, char *device_path,
  struct btrfs_device *srcdev,
  struct btrfs_device **device_out);
+int btrfs_get_spare_device(char **path);
+void btrfs_put_spare_device(char *path);
 int btrfs_balance(struct btrfs_balance_control *bctl,
  struct btrfs_ioctl_balance_args *bargs);
 int btrfs_resume_balance_async(struct btrfs_fs_info *fs_info);
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace

2016-04-12 Thread Anand Jain
Thanks for various comments, tests and feedback.

Background: Spare device and Auto replace:
 Spare device is predominately used to mitigate or narrow the time
 window of a degraded raid mode, as because during which any further
 disk failure would lead to a catastrophic data loss. Data center
 storage generally will have couple of disks reserved as spares
 on their storage, so that it will automatically kickin to resilver
 the storage pool so that the pool is back to a healthy state.
 Mainly this is an storage feature rather than a FS feature,
 I believe people acquainted with enterprise storage use cases
 will appreciate the need of it, and so most/all of the enterprise
 storage has spare device feature.

Btrfs device states:
 This patch-set adds 'failed' state and makes provision to use
 'offline' state as two new device states. So to summarize
 various device states and their meanings..

 /* missing: device wasn't found at the time of mount */
 int missing;

 /*
  * failed: device confirmed to have experienced critical
  * io failure
  */
 int failed;

 /*
  * offline: When there is no confirmation that a disk has
  * failed. But an interim communication breakdown
  * and not necessarily a candidate for the device replace.
  * Device might be online after user intervention or after
  * block transport layer error recovery.
  */
 int offline;


Device state transition tuning and visualization:
 Sysfs interfaces are planned to provide the required tuning for
 device state transition, sensitivities and visualization of device
 states. However sysfs framework which could provide such an interface
 is being reviewed/tested and not yet ready as of now. So for the
 testing and debug of these features here I have used an update
 version of the procfs patch which is in the ML.

  [PATCH] btrfs: debug: procfs-devlist: introduce procfs interface for
the device list for debugging

 I find the above patch very useful, easy to use (as compared to
 sysfs to visualize the device state) and stable.

This patch set does not depend on any of the sysfs patches as such.

Backward compatibility:
 Adds a new incompatibility feature flags
 (BTRFS_FEATURE_INCOMPAT_SPARE_DEV) to manage the spare device
 when older kernels are used. So it is tested to be work fine
 with older kernel/prog versions.


Auto replace:
 Replace happens automatically, that is when there is any write
 failed or flush failed, the device will be marked as failed, which
 will stop any further IO attempt to that device. And in the next
 commit cycle the auto replace will pick the spare device to
 replace the failed device. And so the btrfs volume is back to a
 healthy state.

Per FSID spare vs Global spare:
 As of now only global spare is supported, that is spare(s)
 are for all the btrfs FS in the system. However future there will
 be a fs_info->no_auto_replace tunable which can be tuned by the user
 to limit the use of global spare.


Example use case:
 Here below is an example use case of the spare setup.

 Add a spare device:
btrfs spare add /dev/sde -f

 If there is a spare device which is already added before the,
 just run

btrfs dev scan [/dev/sde]

 Which will register the spare device to the kernel.

btrfs fi show
 Label: none uuid: 52f170c1-725c-457d-8cfd-d57090460091
  Total devices 2 FS bytes used 112.00KiB
  devid 1 size 2.00GiB used 417.50MiB path /dev/sdc
  devid 2 size 2.00GiB used 417.50MiB path /dev/sdd

Global spare
  device size 3.00GiB path /dev/sde


Patches:

Kernel:
 First, it needs, Qu's per chunk missing device patchset, which is
 part of the set.

 Next patches 6-9 adds support for Spare device. For kernel without
 spare feature the spare device is kept away. And when the kernel
 supports the spare device, it will inhibit from mounting it. Further
 these patch set provides helper function to pick a spare device and
 release a spare device back to the spare device pool.

 Patch 10 provides helper function to auto replace.
 Patch 11 provides helper function to bring a device to failed state.
 Patch 12 marks a device as failed based on flush and write errors,
  and avoids any further IO to it.
 Last 13 triggers auto replace.

Progs:
 Needs below 4 patches which will add sub cli 'spare' to manage
 the spare device. As of now deleting a spare device has to be
 managed using wipefs. However in the long run we would a proper
 btrfs command to do that job.


v3->v4:
Kernel:
 a.
  Mainly bug fixes. Thanks to Yauhen for the bug reports.
  Fixed the issue of bdev not being null. Also fixed the
  issue where auto replace didn't check for
  mutually_exclusive_operation_running. In this process,
  the function force_device_close() is changed quite a
  bit, mainly bdev is copied and nulled within the lock
  context, and later close on the copied bdev is called.
 b.
  changed the wording hot spare to spare device, as some of
  the legacy raid setup would need a perticular 

[PATCH 01/13] btrfs: Introduce a new function to check if all chunks a OK for degraded mount

2016-04-12 Thread Anand Jain
From: Qu Wenruo 

Introduce a new function, btrfs_check_degradable(), to judge if all chunks
in btrfs is OK for degraded mount.

It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/volumes.c | 63 ++
 fs/btrfs/volumes.h |  1 +
 2 files changed, 64 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9d72dabdddfc..a351c5dd9e9b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7039,3 +7039,66 @@ static void btrfs_close_one_device(struct btrfs_device 
*device)
 
call_rcu(>rcu, free_device);
 }
+
+/*
+ * Check if all chunks in the fs is OK for degraded mount
+ * Caller itself should do extra check if DEGRADED mount option is given
+ * for >0 return value.
+ *
+ * Return 0 if all chunks are OK.
+ * Return >0 if all chunks are degradable but not all OK.
+ * Return <0 if any chunk is not degradable or other bug.
+ */
+int btrfs_check_degradable(struct btrfs_fs_info *fs_info, unsigned flags)
+{
+   struct btrfs_mapping_tree *map_tree = _info->mapping_tree;
+   struct extent_map *em;
+   u64 next_start = 0;
+   int ret = 0;
+
+   if (flags & MS_RDONLY)
+   return 0;
+
+   read_lock(_tree->map_tree.lock);
+   em = lookup_extent_mapping(_tree->map_tree, 0, (u64)(-1));
+   /* No any chunk? Should be a huge bug */
+   if (!em) {
+   ret = -ENOENT;
+   goto out;
+   }
+
+   while (em) {
+   struct map_lookup *map;
+   int missing = 0;
+   int max_tolerated;
+   int i;
+
+   map = (struct map_lookup *) em->bdev;
+   max_tolerated =
+   btrfs_get_num_tolerated_disk_barrier_failures(
+   map->type);
+   for (i = 0; i < map->num_stripes; i++) {
+   if (map->stripes[i].dev->missing)
+   missing++;
+   }
+   if (missing > max_tolerated) {
+   ret = -EIO;
+   btrfs_warn(fs_info,
+  "missing devices(%d) exceeds the limit(%d), 
writebale mount is not allowed",
+  missing, max_tolerated);
+   goto out;
+   } else if (missing)
+   ret = 1;
+   next_start = extent_map_end(em);
+
+   /*
+* Alwasy search range [next_start, (u64)-1) to find the next
+* chunk map
+*/
+   em = lookup_extent_mapping(_tree->map_tree, next_start,
+  (u64)(-1) - next_start);
+   }
+out:
+   read_unlock(_tree->map_tree.lock);
+   return ret;
+}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 1939ebde63df..351431a3f5aa 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -566,5 +566,6 @@ static inline void unlock_chunks(struct btrfs_root *root)
 struct list_head *btrfs_get_fs_uuids(void);
 void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info);
 void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);
+int btrfs_check_degradable(struct btrfs_fs_info *fs_info, unsigned flags);
 
 #endif
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/13] btrfs: introduce device dynamic state transition to offline or failed

2016-04-12 Thread Anand Jain
From: Anand Jain 

This patch provides helper functions to force a device to offline
or failed, and we need this device states for the following reasons,
1) a. it can be reported that device has failed when it does
   b. close the device when it goes offline so that blocklayer can
  cleanup
2) identify the candidate for the auto replace
3) avoid further commit error reported against the failing device and
4) a device in the multi device btrfs may go offline from the system
   (but as of now in in some system config btrfs gets unmounted in this
context, which is not a correct behavior)

Signed-off-by: Anand Jain 
Tested-by: Austin S. Hemmelgarn 
---
 fs/btrfs/volumes.c | 138 +
 fs/btrfs/volumes.h |  14 ++
 2 files changed, 152 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 00d82872ede0..275143c42374 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7146,3 +7146,141 @@ out:
read_unlock(_tree->map_tree.lock);
return ret;
 }
+
+static void __close_device(struct work_struct *work)
+{
+   struct btrfs_device *device;
+
+   device = container_of(work, struct btrfs_device, rcu_work);
+
+   if (device->closing_bdev)
+   blkdev_put(device->closing_bdev, device->mode);
+
+   device->closing_bdev = NULL;
+}
+
+static void close_device(struct rcu_head *head)
+{
+   struct btrfs_device *device;
+
+   device = container_of(head, struct btrfs_device, rcu);
+
+   INIT_WORK(>rcu_work, __close_device);
+   schedule_work(>rcu_work);
+}
+
+void device_force_close(struct btrfs_device *device)
+{
+   struct btrfs_device *next_device;
+   struct btrfs_fs_devices *fs_devices;
+
+   fs_devices = device->fs_devices;
+
+   mutex_lock(_devices->device_list_mutex);
+   mutex_lock(_devices->fs_info->chunk_mutex);
+   spin_lock(_devices->fs_info->free_chunk_lock);
+
+   next_device = list_entry(fs_devices->devices.next,
+   struct btrfs_device, dev_list);
+   if (device->bdev == fs_devices->fs_info->sb->s_bdev)
+   fs_devices->fs_info->sb->s_bdev = next_device->bdev;
+
+   if (device->bdev == fs_devices->latest_bdev)
+   fs_devices->latest_bdev = next_device->bdev;
+
+   if (device->bdev)
+   fs_devices->open_devices--;
+
+   if (device->writeable) {
+   list_del_init(>dev_alloc_list);
+   fs_devices->rw_devices--;
+   }
+   device->writeable = 0;
+
+   /*
+* fixme: works for now, but its better to keep the state of
+* missing and offline different, and update rest of the
+* places where we check for only missing and not for failed
+* or offline as of now.
+*/
+   device->missing = 1;
+   fs_devices->missing_devices++;
+   device->closing_bdev = device->bdev;
+   device->bdev = NULL;
+
+   call_rcu(>rcu, close_device);
+
+   spin_unlock(_devices->fs_info->free_chunk_lock);
+   mutex_unlock(_devices->fs_info->chunk_mutex);
+   mutex_unlock(_devices->device_list_mutex);
+
+   rcu_barrier();
+}
+
+void btrfs_device_enforce_state(struct btrfs_device *dev, char *why)
+{
+   int tolerance;
+   bool degrade_option;
+   char dev_status[10];
+   char chunk_status[25];
+   struct btrfs_fs_info *fs_info;
+   struct btrfs_fs_devices *fs_devices;
+
+   fs_devices = dev->fs_devices;
+   fs_info = fs_devices->fs_info;
+   degrade_option = btrfs_test_opt(fs_info->fs_root, DEGRADED);
+
+   /* todo: support seed later */
+   if (fs_devices->seeding)
+   return;
+
+   /* this shouldn't be called if device is already missing */
+   if (dev->missing || !dev->bdev)
+   return;
+
+   if (dev->offline || dev->failed)
+   return;
+
+   /* Only RW device is requested to force close let FS handle it*/
+   if (fs_devices->rw_devices == 1) {
+   btrfs_std_error(fs_info, -EIO,
+   "force offline last RW device");
+   return;
+   }
+
+   if (!strcmp(why, "offline"))
+   dev->offline = 1;
+   else if (!strcmp(why, "failed"))
+   dev->failed = 1;
+   else
+   return;
+
+   /*
+* Here after, there shouldn't any reason why can't force
+* close this device
+*/
+   btrfs_sysfs_rm_device_link(fs_devices, dev);
+   device_force_close(dev);
+   strcpy(dev_status, "closed");
+
+   tolerance = btrfs_check_degradable(fs_info,
+   fs_info->sb->s_flags);
+   if (tolerance > 0) {
+   strncpy(chunk_status, "chunk(s) degraded", 25);
+   } else if(tolerance < 0) {
+   strncpy(chunk_status, "chunk(s) failed", 25);
+   } 

[PATCH 07/13] btrfs: add check not to mount a spare device

2016-04-12 Thread Anand Jain
From: Anand Jain 

Spare devices can be scanned but shouldn't be mountable.

Signed-off-by: Anand Jain 
Tested-by: Austin S. Hemmelgarn 
---
 fs/btrfs/disk-io.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 65c9f19d8017..e9fca3bc7e42 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2811,6 +2811,14 @@ int open_ctree(struct super_block *sb,
goto fail_alloc;
}
 
+   if (btrfs_super_incompat_flags(disk_super) &
+   BTRFS_FEATURE_INCOMPAT_SPARE_DEV) {
+   /*You can only scan a spare device but not mount*/
+   printk(KERN_ERR "BTRFS: You can't mount a spare device\n");
+   err = -ENOTSUPP;
+   goto fail_alloc;
+   }
+
/*
 * Needn't use the lock because there is no other task which will
 * update the flag.
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/13] btrfs: introduce helper functions to perform hot replace

2016-04-12 Thread Anand Jain
From: Anand Jain 

Hot replace / auto replace is important volume manager feature
and is critical to the data center operations, so that the degraded
volume can be brought back to a healthy state at the earliest and
without manual intervention.

This modifies the existing replace code to suite the need of auto
replace, in the long run I hope both the codes to be merged.

Signed-off-by: Anand Jain 
Tested-by: Austin S. Hemmelgarn 
---
 fs/btrfs/dev-replace.c | 43 +++
 fs/btrfs/dev-replace.h |  1 +
 2 files changed, 44 insertions(+)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 2b926867d136..ddc4843604df 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -957,3 +957,46 @@ void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info 
*fs_info)
 _info->fs_state));
}
 }
+
+int btrfs_auto_replace_start(struct btrfs_root *root, u64 src_devid)
+{
+   int ret;
+   char *tgt_path;
+   struct btrfs_fs_info *fs_info = root->fs_info;
+
+   if (!src_devid)
+   return -EINVAL;
+
+   if (fs_info->sb->s_flags & MS_RDONLY)
+   return -EROFS;
+
+   btrfs_dev_replace_lock(_info->dev_replace, 0);
+   if (btrfs_dev_replace_is_ongoing(_info->dev_replace)) {
+   btrfs_dev_replace_unlock(_info->dev_replace, 0);
+   return -EBUSY;
+   }
+   btrfs_dev_replace_unlock(_info->dev_replace, 0);
+
+   if (btrfs_get_spare_device(_path)) {
+   btrfs_err(root->fs_info,
+   "No spare device found/configured in the kernel");
+   return -EINVAL;
+   }
+
+   if (atomic_xchg(
+   >fs_info->mutually_exclusive_operation_running, 1)) {
+   ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
+   } else {
+   ret = btrfs_dev_replace_start(root, tgt_path, src_devid, NULL,
+   BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS);
+   atomic_set(
+   >fs_info->mutually_exclusive_operation_running, 
0);
+   }
+
+   if (ret)
+   btrfs_put_spare_device(tgt_path);
+
+   kfree(tgt_path);
+
+   return ret;
+}
diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h
index e922b42d91df..54b0812c8ba4 100644
--- a/fs/btrfs/dev-replace.h
+++ b/fs/btrfs/dev-replace.h
@@ -46,4 +46,5 @@ static inline void btrfs_dev_replace_stats_inc(atomic64_t 
*stat_value)
 {
atomic64_inc(stat_value);
 }
+int btrfs_auto_replace_start(struct btrfs_root *root, u64 src_devid);
 #endif
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/13] btrfs: check device for critical errors and mark failed

2016-04-12 Thread Anand Jain
From: Anand Jain 

Write and Flush errors are considered as critical errors,
upon which the device will be brought offline and marked as
failed. Write and Flush errors are identified using device
error statistics. This is monitored using a kthread
btrfs_health.

Signed-off-by: Anand Jain 
Tested-by: Austin S. Hemmelgarn 
---
 fs/btrfs/ctree.h   |   2 ++
 fs/btrfs/disk-io.c | 101 -
 fs/btrfs/volumes.c |   1 +
 fs/btrfs/volumes.h |   4 +++
 4 files changed, 107 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1cf1bbf3058f..e36200cf6ead 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1569,6 +1569,7 @@ struct btrfs_fs_info {
struct mutex tree_log_mutex;
struct mutex transaction_kthread_mutex;
struct mutex cleaner_mutex;
+   struct mutex health_mutex;
struct mutex chunk_mutex;
struct mutex volume_mutex;
 
@@ -1686,6 +1687,7 @@ struct btrfs_fs_info {
struct btrfs_workqueue *extent_workers;
struct task_struct *transaction_kthread;
struct task_struct *cleaner_kthread;
+   struct task_struct *health_kthread;
int thread_pool_size;
 
struct kobject *space_info_kobj;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e9fca3bc7e42..1deb5714cc3a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1876,6 +1876,93 @@ sleep:
return 0;
 }
 
+/*
+ * returns:
+ * < 0 : Check didn't run, std error
+ *   0 : No errors found
+ * > 0 : # of devices having fatal errors
+ */
+static int btrfs_update_devices_health(struct btrfs_root *root)
+{
+   int ret = 0;
+   struct btrfs_device *device;
+   struct btrfs_fs_info *fs_info = root->fs_info;
+
+   if (btrfs_fs_closing(fs_info))
+   return -EBUSY;
+
+   /* mark disk(s) with write or flush error(s) as failed */
+   mutex_lock(_info->volume_mutex);
+   list_for_each_entry_rcu(device,
+   _info->fs_devices->devices, dev_list) {
+   int c_err;
+
+   if (device->failed) {
+   ret++;
+   continue;
+   }
+
+   /*
+* todo: replace target device's write/flush error,
+* skip for now
+*/
+   if (device->is_tgtdev_for_dev_replace)
+   continue;
+
+   if (!device->dev_stats_valid)
+   continue;
+
+   c_err = atomic_read(>new_critical_errs);
+   atomic_sub(c_err, >new_critical_errs);
+   if (c_err) {
+   btrfs_crit_in_rcu(fs_info,
+   "fatal error on device %s",
+   rcu_str_deref(device->name));
+   btrfs_device_enforce_state(device, "failed");
+   ret ++;
+   }
+   }
+   mutex_unlock(_info->volume_mutex);
+
+   return ret;
+}
+
+/*
+ * Devices health maintenance kthread, gets woken-up by transaction
+ * kthread, once sysfs is ready, this should publish the report
+ * through sysfs so that user land scripts and invoke actions.
+ */
+static int health_kthread(void *arg)
+{
+   struct btrfs_root *root = arg;
+
+   do {
+   if (btrfs_need_cleaner_sleep(root))
+   goto sleep;
+
+   if (!mutex_trylock(>fs_info->health_mutex))
+   goto sleep;
+
+   if (btrfs_need_cleaner_sleep(root)) {
+   mutex_unlock(>fs_info->health_mutex);
+   goto sleep;
+   }
+
+   /* Check devices health */
+   btrfs_update_devices_health(root);
+
+   mutex_unlock(>fs_info->health_mutex);
+
+sleep:
+   set_current_state(TASK_INTERRUPTIBLE);
+   if (!kthread_should_stop())
+   schedule();
+   __set_current_state(TASK_RUNNING);
+   } while (!kthread_should_stop());
+
+   return 0;
+}
+
 static int transaction_kthread(void *arg)
 {
struct btrfs_root *root = arg;
@@ -1922,6 +2009,7 @@ static int transaction_kthread(void *arg)
btrfs_end_transaction(trans, root);
}
 sleep:
+   wake_up_process(root->fs_info->health_kthread);
wake_up_process(root->fs_info->cleaner_kthread);
mutex_unlock(>fs_info->transaction_kthread_mutex);
 
@@ -2668,6 +2756,7 @@ int open_ctree(struct super_block *sb,
mutex_init(_info->chunk_mutex);
mutex_init(_info->transaction_kthread_mutex);
mutex_init(_info->cleaner_mutex);
+   mutex_init(_info->health_mutex);
mutex_init(_info->volume_mutex);
mutex_init(_info->ro_block_group_mutex);
init_rwsem(_info->commit_root_sem);
@@ -3010,11 

Re: [PATCH 10/13] btrfs: introduce helper functions to perform hot replace

2016-04-12 Thread Anand Jain



On 04/09/2016 06:05 AM, Yauhen Kharuzhy wrote:

On Sat, Apr 02, 2016 at 09:30:48AM +0800, Anand Jain wrote:

Hot replace / auto replace is important volume manager feature
and is critical to the data center operations, so that the degraded
volume can be brought back to a healthy state at the earliest and
without manual intervention.

This modifies the existing replace code to suite the need of auto
replace, in the long run I hope both the codes to be merged.

Signed-off-by: Anand Jain 
Tested-by: Austin S. Hemmelgarn 
---
  fs/btrfs/dev-replace.c | 43 +++
  fs/btrfs/dev-replace.h |  1 +
  2 files changed, 44 insertions(+)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 2b926867d136..ceab4c51db32 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -957,3 +957,46 @@ void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info 
*fs_info)
 _info->fs_state));
}
  }
+
+int btrfs_auto_replace_start(struct btrfs_root *root,
+   struct btrfs_device *src_device)
+{
+   int ret;
+   char *tgt_path;
+   char *src_path;
+   struct btrfs_fs_info *fs_info = root->fs_info;
+
+   if (fs_info->sb->s_flags & MS_RDONLY)
+   return -EROFS;
+
+   btrfs_dev_replace_lock(_info->dev_replace, 0);
+   if (btrfs_dev_replace_is_ongoing(_info->dev_replace)) {
+   btrfs_dev_replace_unlock(_info->dev_replace, 0);
+   return -EBUSY;
+   }
+   btrfs_dev_replace_unlock(_info->dev_replace, 0);
+
+   if (btrfs_get_spare_device(_path)) {
+   btrfs_err(root->fs_info,
+   "No spare device found/configured in the kernel");
+   return -EINVAL;
+   }
+
+   rcu_read_lock();
+   src_path = kstrdup(rcu_str_deref(src_device->name), GFP_ATOMIC);
+   rcu_read_unlock();
+   if (!src_path) {
+   kfree(tgt_path);
+   return -ENOMEM;
+   }
+   ret = btrfs_dev_replace_start(root, tgt_path,
+   src_device->devid, src_path,
+   BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID);
+   if (ret)
+   btrfs_put_spare_device(tgt_path);
+
+   kfree(tgt_path);
+   kfree(src_path);
+
+   return 0;
+}


Without of fs_info->mutually_exclusive_operation_running flag set in
btrfs_auto_replace_start(), device add/remove/balance etc. can be
started in parralel with autoreplace. Should this scenarios be permitted?


 Its needs it in case of if device delete/add etc is running, added
 in v4.  Thanks.



diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h
index e922b42d91df..b918b9d6e5df 100644
--- a/fs/btrfs/dev-replace.h
+++ b/fs/btrfs/dev-replace.h
@@ -46,4 +46,5 @@ static inline void btrfs_dev_replace_stats_inc(atomic64_t 
*stat_value)
  {
atomic64_inc(stat_value);
  }
+int btrfs_auto_replace_start(struct btrfs_root *root, struct btrfs_device 
*src_device);
  #endif
--
2.7.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/13] btrfs: Cleanup num_tolerated_disk_barrier_failures

2016-04-12 Thread Anand Jain
From: Qu Wenruo 

As we use per-chunk degradable check, now the global
num_tolerated_disk_barrier_failures is of no use. So cleanup it.

Signed-off-by: Qu Wenruo 

[Btrfs: resolve conflict to apply 'btrfs: Cleanup 
num_tolerated_disk_barrier_failures']
Signed-off-by: Anand Jain 
---
 fs/btrfs/ctree.h   |  2 --
 fs/btrfs/disk-io.c | 56 --
 fs/btrfs/disk-io.h |  2 --
 fs/btrfs/volumes.c | 17 -
 4 files changed, 77 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0b5c2c71dffd..7a6471269b34 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1829,8 +1829,6 @@ struct btrfs_fs_info {
/* next backup root to be overwritten */
int backup_root_index;
 
-   int num_tolerated_disk_barrier_failures;
-
/* device replace state */
struct btrfs_dev_replace dev_replace;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9ad3667f5e71..65c9f19d8017 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2996,8 +2996,6 @@ retry_root_backup:
printk(KERN_ERR "BTRFS: Failed to read block groups: %d\n", 
ret);
goto fail_sysfs;
}
-   fs_info->num_tolerated_disk_barrier_failures =
-   btrfs_calc_num_tolerated_disk_barrier_failures(fs_info);
 
fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
   "btrfs-cleaner");
@@ -3564,60 +3562,6 @@ int btrfs_get_num_tolerated_disk_barrier_failures(u64 
flags)
return min_tolerated;
 }
 
-int btrfs_calc_num_tolerated_disk_barrier_failures(
-   struct btrfs_fs_info *fs_info)
-{
-   struct btrfs_ioctl_space_info space;
-   struct btrfs_space_info *sinfo;
-   u64 types[] = {BTRFS_BLOCK_GROUP_DATA,
-  BTRFS_BLOCK_GROUP_SYSTEM,
-  BTRFS_BLOCK_GROUP_METADATA,
-  BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA};
-   int i;
-   int c;
-   int num_tolerated_disk_barrier_failures =
-   (int)fs_info->fs_devices->num_devices;
-
-   for (i = 0; i < ARRAY_SIZE(types); i++) {
-   struct btrfs_space_info *tmp;
-
-   sinfo = NULL;
-   rcu_read_lock();
-   list_for_each_entry_rcu(tmp, _info->space_info, list) {
-   if (tmp->flags == types[i]) {
-   sinfo = tmp;
-   break;
-   }
-   }
-   rcu_read_unlock();
-
-   if (!sinfo)
-   continue;
-
-   down_read(>groups_sem);
-   for (c = 0; c < BTRFS_NR_RAID_TYPES; c++) {
-   u64 flags;
-
-   if (list_empty(>block_groups[c]))
-   continue;
-
-   btrfs_get_block_group_info(>block_groups[c],
-  );
-   if (space.total_bytes == 0 || space.used_bytes == 0)
-   continue;
-   flags = space.flags;
-
-   num_tolerated_disk_barrier_failures = min(
-   num_tolerated_disk_barrier_failures,
-   btrfs_get_num_tolerated_disk_barrier_failures(
-   flags));
-   }
-   up_read(>groups_sem);
-   }
-
-   return num_tolerated_disk_barrier_failures;
-}
-
 static int write_all_supers(struct btrfs_root *root, int max_mirrors)
 {
struct list_head *head;
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 8e79d0070bcf..dd155621f95f 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -141,8 +141,6 @@ struct btrfs_root *btrfs_create_tree(struct 
btrfs_trans_handle *trans,
 int btree_lock_page_hook(struct page *page, void *data,
void (*flush_fn)(void *));
 int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags);
-int btrfs_calc_num_tolerated_disk_barrier_failures(
-   struct btrfs_fs_info *fs_info);
 int __init btrfs_end_io_wq_init(void);
 void btrfs_end_io_wq_exit(void);
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d9cae4d7ba55..8549bd2b3a42 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1872,9 +1872,6 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path)
free_fs_devices(cur_devices);
}
 
-   root->fs_info->num_tolerated_disk_barrier_failures =
-   btrfs_calc_num_tolerated_disk_barrier_failures(root->fs_info);
-
/*
 * at this point, the device is zero sized.  We want to
 * remove it from the devices list and zero out the old super
@@ -2402,8 +2399,6 @@ int btrfs_init_new_device(struct btrfs_root *root, char 
*device_path)
 

Re: Global hotspare functionality

2016-04-12 Thread Anand Jain



On 04/05/2016 03:32 AM, Yauhen Kharuzhy wrote:

2016-04-01 18:15 GMT-07:00 Anand Jain :

Issue 2.
At start of autoreplacig drive by hotspare, kernel craches in
transaction
handling code (inside of btrfs_commit_transaction() called by
autoreplace initiating
routines). I 'fixed' this by removing of closing of bdev in
btrfs_close_one_device_dont_free(), see

https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master
(oops text is attached also). Bdev is closed after replacing by
btrfs_dev_replace_finishing(), so this is safe but doesn't seem
to be right way.



   I have sent out V2. I don't see that issue with this,
   could you pls try ?



Yes, it reproduced on v4.4.5 kernel. I will try with current
'for-linus-4.6' Chris' tree soon.

To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev
can be freed by kernel after releasing of all references to it.



   So far the raid group profile would adapt to lower suitable
   group profile when device is missing/failed. This appears to
   be not happening with RAID56 OR there are stale IO which wasn't
   flushed out. Anyway to have this fixed I am moving the patch
btrfs: introduce device dynamic state transition to offline or failed
   to the top in v3 for any potential changes.
   But firstly we need a reliable test case, or a very carefully
   crafted test case which can create this situation

   Here below is the dm-error that I am using for testing, which
   apparently doesn't report this issue. Could you please try on V3. ?
   (pls note the device names are hard coded in the test script
   sorry about that) This would eventually be fstests script.


Hi,

I have reproduced this oops with attached script. I don't use any dm
layer, but just detach drive at scsi layer as xfstests do (device
management functions were copy-pasted from it).


 Nice. I was able reproduce this (also found lock dep issue when running
 this, since it was in the original code a separate patch was sent
 ou). The issue was due to that bdev wasn't null, to fix this the
 btrfs_device_enforce_state() is changed quite a bit. V4 is out.

Thanks, Anand

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/13] btrfs: Do per-chunk degraded check for remount

2016-04-12 Thread Anand Jain
From: Qu Wenruo 

Just the same for mount time check, use new btrfs_check_degraded() to do
per chunk check.

Signed-off-by: Qu Wenruo 

Btrfs: use btrfs_error instead of btrfs_err during remount

Signed-off-by: Anand Jain 
---
 fs/btrfs/super.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index b4e15416704d..729f596b540a 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1767,11 +1767,14 @@ static int btrfs_remount(struct super_block *sb, int 
*flags, char *data)
goto restore;
}
 
-   if (fs_info->fs_devices->missing_devices >
-fs_info->num_tolerated_disk_barrier_failures &&
-   !(*flags & MS_RDONLY)) {
+   ret = btrfs_check_degradable(fs_info, *flags);
+   if (ret < 0) {
+   btrfs_err(fs_info,
+   "degraded writable remount failed %d", ret);
+   goto restore;
+   } else if (ret > 0 && !btrfs_test_opt(root, DEGRADED)) {
btrfs_warn(fs_info,
-   "too many missing devices, writeable remount is 
not allowed");
+   "some device missing, but still degraded 
mountable, please remount with -o degraded option");
ret = -EACCES;
goto restore;
}
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/13] btrfs: support btrfs dev scan for spare device

2016-04-12 Thread Anand Jain
From: Anand Jain 

When the user or system calls the BTRFS_IOC_SCAN_DEV,
ioctl this patch will make sure it is added to the device
list and set it as spare.

This operation will be same when BTRFS_IOC_DEVICES_READY
as well since BTRFS_IOC_DEVICES_READY ioctl has been doing
that by legacy.

Signed-off-by: Anand Jain 
Tested-by: Austin S. Hemmelgarn 
---
 fs/btrfs/volumes.c | 4 
 fs/btrfs/volumes.h | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8549bd2b3a42..150807e0310e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -605,6 +605,10 @@ static noinline int device_list_add(const char *path,
if (IS_ERR(fs_devices))
return PTR_ERR(fs_devices);
 
+   if (btrfs_super_incompat_flags(disk_super) &
+   BTRFS_FEATURE_INCOMPAT_SPARE_DEV)
+   fs_devices->spare = 1;
+
list_add(_devices->list, _uuids);
 
device = NULL;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 48ced5cc09e4..51cf716eb35b 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -263,6 +263,8 @@ struct btrfs_fs_devices {
struct kobject fsid_kobj;
struct kobject *device_dir_kobj;
struct completion kobj_unregister;
+
+   int spare;
 };
 
 #define BTRFS_BIO_INLINE_CSUM_SIZE 64
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/13] btrfs: Do per-chunk check for mount time check

2016-04-12 Thread Anand Jain
From: Qu Wenruo 

Now use the btrfs_check_degraded() to do mount time degraded check.

With this patch, now we can mount with the following case:
 # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
 # wipefs -a /dev/sdc
 # mount /dev/sdb /mnt/btrfs -o degraded
 As the single data chunk is only in sdb, so it's OK to mount as degraded,
 as missing one device is OK for RAID1.

But still fail with the following case as expected:
 # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
 # wipefs -a /dev/sdb
 # mount /dev/sdc /mnt/btrfs -o degraded
 As the data chunk is only in sdb, so it's not OK to mount it as degraded.

Reported-by: Zhao Lei 
Reported-by: Anand Jain 
Signed-off-by: Qu Wenruo 

[Btrfs: use btrfs_error instead of btrfs_err during mount]
Signed-off-by: Anand Jain 
---
 fs/btrfs/disk-io.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d01f89d130e0..4f91a049fbca 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2885,6 +2885,16 @@ int open_ctree(struct super_block *sb,
goto fail_tree_roots;
}
 
+   ret = btrfs_check_degradable(fs_info, fs_info->sb->s_flags);
+   if (ret < 0) {
+   btrfs_err(fs_info, "degraded writable mount failed %d", ret);
+   goto fail_tree_roots;
+   } else if (ret > 0 && !btrfs_test_opt(chunk_root, DEGRADED)) {
+   btrfs_warn(fs_info,
+   "Some device missing, but still degraded mountable, 
please mount with -o degraded option");
+   ret = -EACCES;
+   goto fail_tree_roots;
+   }
/*
 * keep the device that is marked to be the target device for the
 * dev_replace procedure
@@ -2988,14 +2998,6 @@ retry_root_backup:
}
fs_info->num_tolerated_disk_barrier_failures =
btrfs_calc_num_tolerated_disk_barrier_failures(fs_info);
-   if (fs_info->fs_devices->missing_devices >
-fs_info->num_tolerated_disk_barrier_failures &&
-   !(sb->s_flags & MS_RDONLY)) {
-   pr_warn("BTRFS: missing devices(%llu) exceeds the limit(%d), 
writeable mount is not allowed\n",
-   fs_info->fs_devices->missing_devices,
-   fs_info->num_tolerated_disk_barrier_failures);
-   goto fail_sysfs;
-   }
 
fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
   "btrfs-cleaner");
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/13] btrfs: Allow barrier_all_devices to do per-chunk device check

2016-04-12 Thread Anand Jain
From: Qu Wenruo 

The last user of num_tolerated_disk_barrier_failures is
barrier_all_devices(). But it's can be easily changed to new per-chunk
degradable check framework.

Now btrfs_device will have two extra members, representing send/wait
error, set at write_dev_flush() time. And then check it in a similar but
more accurate behavior than old code.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/disk-io.c | 13 +
 fs/btrfs/volumes.c |  6 +-
 fs/btrfs/volumes.h |  4 
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4f91a049fbca..9ad3667f5e71 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3496,8 +3496,6 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
 {
struct list_head *head;
struct btrfs_device *dev;
-   int errors_send = 0;
-   int errors_wait = 0;
int ret;
 
/* send down all the barriers */
@@ -3506,7 +3504,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
if (dev->missing)
continue;
if (!dev->bdev) {
-   errors_send++;
+   dev->err_send = 1;
continue;
}
if (!dev->in_fs_metadata || !dev->writeable)
@@ -3514,7 +3512,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
 
ret = write_dev_flush(dev, 0);
if (ret)
-   errors_send++;
+   dev->err_send = 1;
}
 
/* wait for all the barriers */
@@ -3522,7 +3520,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
if (dev->missing)
continue;
if (!dev->bdev) {
-   errors_wait++;
+   dev->err_wait = 1;
continue;
}
if (!dev->in_fs_metadata || !dev->writeable)
@@ -3530,10 +3528,9 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
 
ret = write_dev_flush(dev, 1);
if (ret)
-   errors_wait++;
+   dev->err_wait = 1;
}
-   if (errors_send > info->num_tolerated_disk_barrier_failures ||
-   errors_wait > info->num_tolerated_disk_barrier_failures)
+   if (btrfs_check_degradable(info, info->sb->s_flags) < 0)
return -EIO;
return 0;
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a351c5dd9e9b..d9cae4d7ba55 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7078,8 +7078,12 @@ int btrfs_check_degradable(struct btrfs_fs_info 
*fs_info, unsigned flags)
btrfs_get_num_tolerated_disk_barrier_failures(
map->type);
for (i = 0; i < map->num_stripes; i++) {
-   if (map->stripes[i].dev->missing)
+   if (map->stripes[i].dev->missing ||
+   map->stripes[i].dev->err_wait ||
+   map->stripes[i].dev->err_send)
missing++;
+   map->stripes[i].dev->err_wait = 0;
+   map->stripes[i].dev->err_send = 0;
}
if (missing > max_tolerated) {
ret = -EIO;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 351431a3f5aa..48ced5cc09e4 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -76,6 +76,10 @@ struct btrfs_device {
int can_discard;
int is_tgtdev_for_dev_replace;
 
+   /* for barrier_all_devices() check */
+   int err_send;
+   int err_wait;
+
 #ifdef __BTRFS_NEED_DEVICE_DATA_ORDERED
seqcount_t data_seqcount;
 #endif
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/13] btrfs: introduce BTRFS_FEATURE_INCOMPAT_SPARE_DEV

2016-04-12 Thread Anand Jain
From: Anand Jain 

Add BTRFS_FEATURE_INCOMPAT_SPARE_DEV (400) flag to identify
a spare device.

Along with this it checks in the mount context that a spare
device will fail to mount.  As spare devices aren't mountable.

Signed-off-by: Anand Jain 
Tested-by: Austin S. Hemmelgarn 
---
 fs/btrfs/ctree.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7a6471269b34..a823ff7944f1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -531,6 +531,7 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_INCOMPAT_RAID56  (1ULL << 7)
 #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8)
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES(1ULL << 9)
+#define BTRFS_FEATURE_INCOMPAT_SPARE_DEV   (1ULL << 10)
 
 #define BTRFS_FEATURE_COMPAT_SUPP  0ULL
 #define BTRFS_FEATURE_COMPAT_SAFE_SET  0ULL
@@ -551,7 +552,8 @@ struct btrfs_super_block {
 BTRFS_FEATURE_INCOMPAT_RAID56 |\
 BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \
 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |   \
-BTRFS_FEATURE_INCOMPAT_NO_HOLES)
+BTRFS_FEATURE_INCOMPAT_NO_HOLES |  \
+BTRFS_FEATURE_INCOMPAT_SPARE_DEV)
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET\
(BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/13] btrfs: check for failed device and hot replace

2016-04-12 Thread Anand Jain
From: Anand Jain 

This patch checks for failed device and kicks out auto
replace, if when user decided to disable auto replace
it can be done by future sysfs or future ioctl interface
to set fs_info->no_auto_replace parameter to 1.

Signed-off-by: Anand Jain 
Tested-by: Austin S. Hemmelgarn 
---
 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/disk-io.c | 35 +++
 2 files changed, 37 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e36200cf6ead..3262430d65a3 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1862,6 +1862,8 @@ struct btrfs_fs_info {
struct list_head pinned_chunks;
 
int creating_free_space_tree;
+
+   int no_auto_replace;
 };
 
 struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1deb5714cc3a..5c5c51319bec 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1876,6 +1876,39 @@ sleep:
return 0;
 }
 
+static int btrfs_recuperate(struct btrfs_root *root)
+{
+   int ret;
+   u64 failed_devid = 0;
+   struct btrfs_device *device;
+   struct btrfs_fs_devices *fs_devices;
+
+   fs_devices = root->fs_info->fs_devices;
+
+   /* fixme: does it need device_list_mutex */
+   mutex_lock(_devices->device_list_mutex);
+   rcu_read_lock();
+   list_for_each_entry_rcu(device,
+   _devices->devices, dev_list) {
+   if (device->failed) {
+   failed_devid = device->devid;
+   break;
+   }
+   }
+   rcu_read_unlock();
+   mutex_unlock(_devices->device_list_mutex);
+
+   /*
+* We are using the replace code which should be interrupt-able
+* during unmount, and as of now there is no user land stop
+* request that we support and this will run until its complete
+*/
+   if (failed_devid && !root->fs_info->no_auto_replace)
+   ret = btrfs_auto_replace_start(root, failed_devid);
+
+   return ret;
+}
+
 /*
  * returns:
  * < 0 : Check didn't run, std error
@@ -1951,6 +1984,8 @@ static int health_kthread(void *arg)
/* Check devices health */
btrfs_update_devices_health(root);
 
+   btrfs_recuperate(root);
+
mutex_unlock(>fs_info->health_mutex);
 
 sleep:
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] btrfs: fix lock dep warning move scratch super outside of chunk_mutex

2016-04-12 Thread Anand Jain
Move scratch super outside of the chunk lock to avoid below
lockdep warning. The better place to scratch super is in
the function btrfs_rm_dev_replace_free_srcdev() just before
free_device, which is outside of the chunk lock as well.

To reproduce:
  (fresh boot)
  mkfs.btrfs -f -draid5 -mraid5 /dev/sdc /dev/sdd /dev/sde
  mount /dev/sdc /btrfs
  dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100
  (get devmgt from https://github.com/asj/devmgt.git)
  devmgt detach /dev/sde
  dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100
  sync
  btrfs replace start -Brf 3 /dev/sdf /btrfs <--
  devmgt attach host7

==
[ INFO: possible circular locking dependency detected ]
4.6.0-rc2asj+ #1 Not tainted
---

btrfs/2174 is trying to acquire lock:
(sb_writers){.+.+.+}, at:
[] __sb_start_write+0xb4/0xf0

but task is already holding lock:
(_info->chunk_mutex){+.+.+.}, at:
[] btrfs_dev_replace_finishing+0x145/0x980 [btrfs]

which lock already depends on the new lock.

Chain exists of:
sb_writers --> _devs->device_list_mutex --> _info->chunk_mutex
Possible unsafe locking scenario:
CPU0CPU1

lock(_info->chunk_mutex);
lock(_devs->device_list_mutex);
lock(_info->chunk_mutex);
lock(sb_writers);

*** DEADLOCK ***

-> #0 (sb_writers){.+.+.+}:
[] __lock_acquire+0x1bc5/0x1ee0
[] lock_acquire+0xbe/0x210
[] percpu_down_read+0x4a/0xa0
[] __sb_start_write+0xb4/0xf0
[] mnt_want_write+0x24/0x50
[] path_openat+0x952/0x1190
[] do_filp_open+0x91/0x100
[] file_open_name+0xfc/0x140
[] filp_open+0x33/0x60
[] update_dev_time+0x16/0x40 [btrfs]
[] btrfs_scratch_superblocks+0x5d/0xb0 [btrfs]
[] btrfs_rm_dev_replace_remove_srcdev+0xae/0xd0 [btrfs]
[] btrfs_dev_replace_finishing+0x4b5/0x980 [btrfs]
[] btrfs_dev_replace_start+0x358/0x530 [btrfs]

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 31bd791d6506..9d72dabdddfc 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1968,11 +1968,8 @@ void btrfs_rm_dev_replace_remove_srcdev(struct 
btrfs_fs_info *fs_info,
if (srcdev->missing)
fs_devices->missing_devices--;
 
-   if (srcdev->writeable) {
+   if (srcdev->writeable)
fs_devices->rw_devices--;
-   /* zero out the old super if it is writable */
-   btrfs_scratch_superblocks(srcdev->bdev, srcdev->name->str);
-   }
 
if (srcdev->bdev)
fs_devices->open_devices--;
@@ -1983,6 +1980,10 @@ void btrfs_rm_dev_replace_free_srcdev(struct 
btrfs_fs_info *fs_info,
 {
struct btrfs_fs_devices *fs_devices = srcdev->fs_devices;
 
+   if (srcdev->writeable) {
+   /* zero out the old super if it is writable */
+   btrfs_scratch_superblocks(srcdev->bdev, srcdev->name->str);
+   }
call_rcu(>rcu, free_device);
 
/*
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hi

2016-04-12 Thread Stevenson, Marjorie
How are you doing today? I sent you an email but I am yet to get your response, 
so I am sending you a reminder.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


enospace regression in 4.4

2016-04-12 Thread Julian Taylor
hi,
I have a system with two filesystems which are both affected by the
notorious enospace bug when there is plenty of unallocated space
available. The system is a raid0 on two 900 GiB disks and an iscsi
single/dup 1.4TiB.
To deal with the problem I use a cronjob that uses fallocate to give me
an advance notice on the issue so I can apply the only workaround that
works for me, which is shrink the fs to the minimum and grow it again.
This has worked fine for a couple of month.

I now updated from 4.2 to 4.4.6 and it appears my cronjob actually
triggers an immediate enospc in the balance after removing the
fallocated file and the shrink/resize workaround does not work anymore.
it is mounted with enospc_debug but that just says "2 enospc in
balance". Nothing else useful in the log.

I had to revert back to 4.2 to get the system running again so it is
currently not available for more testing, but I may be able to do more
tests if required in future.

The cronjob does this once a day:

#!/bin/bash
sync

check() {
  date
  mnt=$1
  time btrfs fi balance start -mlimit=2 $mnt
  btrfs fi balance start -dusage=5 $mnt
  sync
  freespace=$(df -B1 $mnt | tail -n 1 | awk '{print $4 -
50*1024*1024*1024}')
  fallocate -l $freespace $mnt/falloc
  /usr/sbin/filefrag $mnt/falloc
  rm -f $mnt/falloc
  btrfs fi balance start -dusage=0 $mnt

  time btrfs fi balance start -mlimit=2 $mnt
  time btrfs fi balance start -dlimit=10 $mnt
  date
}

check /data
check /data/nas


btrfs info:


 ~ $ btrfs --version
btrfs-progs v4.4
sagan5 ~ $ sudo btrfs fi show
Label: none  uuid: e4aef349-7a56-4287-93b1-79233e016aae
Total devices 2 FS bytes used 898.18GiB
devid1 size 880.00GiB used 473.03GiB path /dev/mapper/data-linear1
devid2 size 880.00GiB used 473.03GiB path /dev/mapper/data-linear2

Label: none  uuid: 14040f9b-53c8-46cf-be6b-35de746c3153
Total devices 1 FS bytes used 557.19GiB
devid1 size 1.36TiB used 585.95GiB path /dev/sdd

 ~ $ sudo btrfs fi df /data
Data, RAID0: total=938.00GiB, used=895.09GiB
System, RAID1: total=32.00MiB, used=112.00KiB
Metadata, RAID1: total=4.00GiB, used=3.10GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
sagan5 ~ $ sudo btrfs fi usage /data
Overall:
Device size:   1.72TiB
Device allocated:946.06GiB
Device unallocated:  813.94GiB
Device missing:  0.00B
Used:901.27GiB
Free (estimated):856.85GiB  (min: 449.88GiB)
Data ratio:   1.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,RAID0: Size:938.00GiB, Used:895.09GiB
   /dev/dm-1 469.00GiB
   /dev/mapper/data-linear1  469.00GiB

Metadata,RAID1: Size:4.00GiB, Used:3.09GiB
   /dev/dm-1   4.00GiB
   /dev/mapper/data-linear14.00GiB

System,RAID1: Size:32.00MiB, Used:112.00KiB
   /dev/dm-1  32.00MiB
   /dev/mapper/data-linear1   32.00MiB

Unallocated:
   /dev/dm-1 406.97GiB
   /dev/mapper/data-linear1  406.97GiB
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fstests: btrfs/091: Disable compress to avoid output dismatch

2016-04-12 Thread Qu Wenruo
If run btrfs/091 with "-o compress=lzo" mount option, test case will
fail, as compress makes extent much smaller on disk, making output
different from golden output.

As this test case is only testing qgroup, not compression, disable
compression manually in test case.

Signed-off-by: Qu Wenruo 
---
 tests/btrfs/091 | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tests/btrfs/091 b/tests/btrfs/091
index 16b5c16..ad9f71c 100755
--- a/tests/btrfs/091
+++ b/tests/btrfs/091
@@ -60,7 +60,8 @@ run_check _scratch_mkfs "--nodesize $NODESIZE"
 # inode cache will also take space in fs tree, disable them to get consistent
 # result.
 # discard error output since we will check return value manually.
-_scratch_mount "-o noinode_cache" 2> /dev/null
+# also disable all compression, or output will mismatch with golden output
+_scratch_mount "-o noinode_cache,compress=no,compress-force=no" 2> /dev/null
 
 # Check for old kernel which doesn't support 'noinode_cache' mount option
 if [ $? -ne 0 ]; then
-- 
1.8.3.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: qgroup: Fix qgroup accounting when creating snapshot

2016-04-12 Thread Qu Wenruo
Current btrfs qgroup design implies a requirement that after calling
btrfs_qgroup_account_extents() there must be a commit root switch.

Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
inside btrfs_commit_transaction() just be commit_cowonly_roots().

However there is a exception at create_pending_snapshot(), which will
call btrfs_qgroup_account_extents() but no any commit root switch.

In case of creating a snapshot whose parent root is itself (create a
snapshot of fs tree), it will corrupt qgroup by the following trace:
(skipped unrelated data)
==
btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots 
= 0, nr_new_roots = 1
qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 
0, excl = 0
qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 
16384, excl = 16384
btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots 
= 0, nr_new_roots = 0
==

The problem here is in first qgroup_account_extent(), the
nr_new_roots of the extent is 1, which means its reference got
increased, and qgroup increased its rfer and excl.

But at second qgroup_account_extent(), its reference got decreased, but
between these two qgroup_account_extent(), there is no switch roots.
This leads to the same nr_old_roots, and this extent just got ignored by
qgroup, which means this extent is wrongly accounted.

Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
create_pending_snapshot(), with needed preparation.

Reported-by: Mark Fasheh 
Signed-off-by: Qu Wenruo 
---
changelog:
v2:
  Fix a soft lockup caused by missing switch_commit_root() call.
  Fix a warning caused by dirty-but-not-committed root.

Note:
  This may be the dirtiest hack I have ever done.
  As there are already several different judgment to check if a fs root
  should be updated. From root->last_trans to root->commit_root ==
  root->node.

  With this patch, we must switch the root of at least related fs tree
  and extent tree to allow qgroup to call
  btrfs_qgroup_account_extents().
  But this will break some transid judgement, as transid is already
  updated to current transid.
  (maybe we need a special sub-transid for qgroup use only?)

  As long as current qgroup use commit_root to determine old_roots,
  there is no better idea though.
---
 fs/btrfs/transaction.c | 96 +-
 1 file changed, 71 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 43885e5..0f299a56 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -311,12 +311,13 @@ loop:
  * when the transaction commits
  */
 static int record_root_in_trans(struct btrfs_trans_handle *trans,
-  struct btrfs_root *root)
+  struct btrfs_root *root,
+  int force)
 {
-   if (test_bit(BTRFS_ROOT_REF_COWS, >state) &&
-   root->last_trans < trans->transid) {
+   if ((test_bit(BTRFS_ROOT_REF_COWS, >state) &&
+   root->last_trans < trans->transid) || force) {
WARN_ON(root == root->fs_info->extent_root);
-   WARN_ON(root->commit_root != root->node);
+   WARN_ON(root->commit_root != root->node && !force);
 
/*
 * see below for IN_TRANS_SETUP usage rules
@@ -331,7 +332,7 @@ static int record_root_in_trans(struct btrfs_trans_handle 
*trans,
smp_wmb();
 
spin_lock(>fs_info->fs_roots_radix_lock);
-   if (root->last_trans == trans->transid) {
+   if (root->last_trans == trans->transid && !force) {
spin_unlock(>fs_info->fs_roots_radix_lock);
return 0;
}
@@ -402,7 +403,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle 
*trans,
return 0;
 
mutex_lock(>fs_info->reloc_mutex);
-   record_root_in_trans(trans, root);
+   record_root_in_trans(trans, root, 0);
mutex_unlock(>fs_info->reloc_mutex);
 
return 0;
@@ -1383,7 +1384,7 @@ static noinline int create_pending_snapshot(struct 
btrfs_trans_handle *trans,
dentry = pending->dentry;
parent_inode = pending->dir;
parent_root = BTRFS_I(parent_inode)->root;
-   record_root_in_trans(trans, parent_root);
+   record_root_in_trans(trans, parent_root, 0);
 
cur_time = current_fs_time(parent_inode->i_sb);
 
@@ -1420,7 +1421,7 @@ static noinline int create_pending_snapshot(struct 
btrfs_trans_handle *trans,
goto fail;
}
 
-   record_root_in_trans(trans, root);
+   record_root_in_trans(trans, root, 0);
btrfs_set_root_last_snapshot(>root_item, trans->transid);
memcpy(new_root_item, >root_item, sizeof(*new_root_item));

Re: KERNEL PANIC + CORRUPTED BTRFS?

2016-04-12 Thread lenovomi
Hi Chris,

 I tried mount with ro.recovery and again kernel panic:
https://bpaste.net/show/895089db279a
https://bpaste.net/show/f3cf84532e26

2) I tried to execute restore but these are results:

https://bpaste.net/show/191e87b20a54


3) should i run repair?


thanks

On Tue, Apr 12, 2016 at 5:43 AM, Chris Murphy  wrote:
> On Mon, Apr 11, 2016 at 3:51 PM, lenovomi  wrote:
>> Hi,
>>
>> i didnt try mount -o ro, when i tried to mount it via esata i got
>> kernel panic immediately. Then i conntected enclosure with drives via
>> usb and tried to mount it :
>
> OK so try '-o ro,recovery' and report back what you get.
>
>
>
>>
>> https://bpaste.net/show/641ab9172539
>> plugged via usb -> mount randomly one of the drive mount /dev/sda /mnt/brtfs
>>
>> I was told on irc channel that i should not run btrfs check and if so
>> i should run it as
>> btrfs check --repair --init-extent-tree
>>
>>
>> Also there was recommendation to run btrfs restore before repair.
>
> Did you use btrfs restore?
> https://btrfs.wiki.kernel.org/index.php/Restore
>
> And did you use --repair --init-extent-tree? I don't recommend it
> until you use restore as well.
>
>
>
>> Still not clear what should i do as next step.
>
> 1. mount with -o ro,recovery  and get important date backed up. It
> sounds like you don't have a backup?
>
> 2. If that doesn't work, use btrfs restore. It's tedious but at least
> you can update your backup.
>
> 3. Next try btrfs check without repair. There's some nuance whether
> it's better to use init-extent-tree or try zeroing the log. But don't
> use repair until there's a current backup with 1 or 2.
>
> --
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Lockdep warning when running btrfs/114

2016-04-12 Thread Qu Wenruo

Hi

When debugging the qgroup problem, I found the following lockdep warning 
outputted when running btrfs/114.

It seems to be more easy to trigger if run all qgroup tests in a row.
(-g qgroup)

The source is integration-4.6 branch *WITHOUT* my qgroup fix patch 
v2(not submitted yet).


Just post this out as it seems to be related to a lot of infrastructures 
like delayed_inode, delayed_refs, backref and qgroups.


Maybe some one has better idea of what's going wrong, and can fix it faster.

Thanks,
Qu

=
[ INFO: possible irq lock inversion dependency detected ]
4.5.0-rc6+ #6 Tainted: G   O
-
kswapd0/546 just changed the state of lock:
 (_node->mutex){+.+.-.}, at: [] 
__btrfs_release_delayed_node+0x3a/0x200 [btrfs]

but this lock took another, RECLAIM_FS-unsafe lock in the past:
 (pcpu_alloc_mutex){+.+.+.}

and interrupts could create inverse lock ordering between them.


other info that might help us debug this:
Chain exists of:
  _node->mutex --> >groups_sem --> pcpu_alloc_mutex

 Possible interrupt unsafe locking scenario:

   CPU0CPU1
   
  lock(pcpu_alloc_mutex);
   local_irq_disable();
   lock(_node->mutex);
   lock(>groups_sem);
  
lock(_node->mutex);

 *** DEADLOCK ***

2 locks held by kswapd0/546:
 #0:  (shrinker_rwsem){..}, at: [] 
shrink_slab.part.63.constprop.82+0x3d/0x500
 #1:  (>s_umount_key#30){+.}, at: [] 
trylock_super+0x16/0x50


the shortest dependencies between 2nd lock and 1st lock:
   -> (pcpu_alloc_mutex){+.+.+.} ops: 1201 {
  HARDIRQ-ON-W at:
  [] __lock_acquire+0xb46/0x1cf0
  [] lock_acquire+0xcd/0x200
  [] mutex_lock_nested+0x71/0x3b0
  [] pcpu_alloc+0x42e/0x620
  [] __alloc_percpu+0x10/0x20
  [] 
__kmem_cache_create+0x36b/0x4c0
  [] 
kmem_cache_create+0x11f/0x220
  [] 
debug_objects_mem_init+0x30/0x1ec

  [] start_kernel+0x3ca/0x491
  [] 
x86_64_start_reservations+0x2a/0x2c
  [] 
x86_64_start_kernel+0xea/0xed

  SOFTIRQ-ON-W at:
  [] __lock_acquire+0x9f7/0x1cf0
  [] lock_acquire+0xcd/0x200
  [] mutex_lock_nested+0x71/0x3b0
  [] pcpu_alloc+0x42e/0x620
  [] __alloc_percpu+0x10/0x20
  [] 
__kmem_cache_create+0x36b/0x4c0
  [] 
kmem_cache_create+0x11f/0x220
  [] 
debug_objects_mem_init+0x30/0x1ec

  [] start_kernel+0x3ca/0x491
  [] 
x86_64_start_reservations+0x2a/0x2c
  [] 
x86_64_start_kernel+0xea/0xed

  RECLAIM_FS-ON-W at:
 [] mark_held_locks+0x71/0x90
 [] 
lockdep_trace_alloc+0xb1/0x100

 [] __kmalloc+0x4e/0x280
 [] pcpu_mem_zalloc+0x32/0x60
 [] 
pcpu_create_chunk+0x11/0x120
 [] 
pcpu_balance_workfn+0x435/0x5a0
 [] 
process_one_work+0x1fa/0x650

 [] worker_thread+0x126/0x4a0
 [] kthread+0xed/0x110
 [] ret_from_fork+0x3f/0x70
  INITIAL USE at:
 [] __lock_acquire+0x3b3/0x1cf0
 [] lock_acquire+0xcd/0x200
 [] mutex_lock_nested+0x71/0x3b0
 [] pcpu_alloc+0x42e/0x620
 [] __alloc_percpu+0x10/0x20
 [] 
__kmem_cache_create+0x36b/0x4c0

 [] create_boot_cache+0x67/0x91
 [] kmem_cache_init+0x4b/0xf3
 [] start_kernel+0x251/0x491
 [] 
x86_64_start_reservations+0x2a/0x2c

 [] x86_64_start_kernel+0xea/0xed
}
... key  at: [] pcpu_alloc_mutex+0x70/0xa0
... acquired at:
   [] lock_acquire+0xcd/0x200
   [] mutex_lock_nested+0x71/0x3b0
   [] pcpu_alloc+0x42e/0x620
   [] __alloc_percpu_gfp+0xd/0x10
   [] __percpu_counter_init+0x55/0xe0
   [] btrfs_init_fs_root+0x9b/0x1d0 [btrfs]
   [] btrfs_get_fs_root+0xc2/0x260 [btrfs]
   [] __resolve_indirect_refs+0x121/0x7d0 [btrfs]
   [] find_parent_nodes+0x39d/0x760 [btrfs]
   [] __btrfs_find_all_roots+0xbe/0x130 [btrfs]
   [] btrfs_find_all_roots+0x50/0x60 [btrfs]
   [] btrfs_qgroup_prepare_account_extents+0x53/0x90 
[btrfs]

   [] btrfs_commit_transaction+0x490/0xb50 [btrfs]
   [] btrfs_sync_fs+0x7a/0x1d0