Re: So, does btrfs check lowmem take days? weeks?

2018-06-28 Thread Su Yue




On 06/29/2018 01:28 PM, Marc MERLIN wrote:

On Fri, Jun 29, 2018 at 01:07:20PM +0800, Qu Wenruo wrote:

lowmem repair seems to be going still, but it's been days and -p seems
to do absolutely nothing.


I'm a afraid you hit a bug in lowmem repair code.
By all means, --repair shouldn't really be used unless you're pretty
sure the problem is something btrfs check can handle.

That's also why --repair is still marked as dangerous.
Especially when it's combined with experimental lowmem mode.


Understood, but btrfs got corrupted (by itself or not, I don't know)
I cannot mount the filesystem read/write
I cannot btrfs check --repair it since that code will kill my machine
What do I have left?


My filesystem is "only" 10TB or so, albeit with a lot of files.


Unless you have tons of snapshots and reflinked (deduped) files, it
shouldn't take so long.


I may have a fair amount.
gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2
enabling repair mode
WARNING: low-memory mode repair support is only partial
Checking filesystem on /dev/mapper/dshelf2
UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
Fixed 0 roots.
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, 
owner: 374857, offset: 3407872) wanted: 3, have: 4
Created new chunk [18457780224000 1073741824]
Delete backref in extent [84302495744 69632]
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, 
owner: 374857, offset: 3407872) wanted: 3, have: 4
Delete backref in extent [84302495744 69632]
ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, 
owner: 374857, offset: 114540544) wanted: 181, have: 240
Delete backref in extent [125712527360 12214272]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, 
owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, 
owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, 
owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, 
owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, 
owner: 374857, offset: 148234240) wanted: 302, have: 431
Delete backref in extent [129952120832 20242432]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, 
owner: 374857, offset: 148234240) wanted: 356, have: 433
Delete backref in extent [129952120832 20242432]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, 
owner: 374857, offset: 180371456) wanted: 161, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, 
owner: 374857, offset: 180371456) wanted: 162, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, 
owner: 374857, offset: 192200704) wanted: 170, have: 249
Delete backref in extent [147895111680 12345344]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, 
owner: 374857, offset: 192200704) wanted: 172, have: 251
Delete backref in extent [147895111680 12345344]
ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, 
owner: 374857, offset: 217653248) wanted: 348, have: 418
Delete backref in extent [150850146304 17522688]
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, 
owner: 374857, offset: 235175936) wanted: 555, have: 1449
Deleted root 2 item[156909494272, 178, 5476627808561673095]
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, 
owner: 374857, offset: 235175936) wanted: 556, have: 1452
Deleted root 2 item[156909494272, 178, 7338474132555182983]
ERROR: file extent[374857 235184128] root 21872 owner 21872 backref lost
Add one extent data backref [156909494272 55320576]
ERROR: file extent[374857 235184128] root 22911 owner 22911 backref lost
Add one extent data backref [156909494272 55320576]


My bad.
It's almost possiblelly a bug about extent of lowmem check which
was reported by Chris too.
The extent check was wrong, the the repair did wrong things.

I have figured out the bug is lowmem check can't deal with shared tree 
block in reloc tree. The fix is simple, you can try the follow repo:


https://github.com/Damenly/btrfs-progs/tree/tmp1

Please run lowmem check "without =--repair" first to be sure whether
your filesystem is fine.

Though the bug and phenomenon are clear enough, before sending my patch,
I have to make a test image. I have spent a week to study btrfs balance
but it seems a liitle 

Re: So, does btrfs check lowmem take days? weeks?

2018-06-28 Thread Qu Wenruo


On 2018年06月29日 13:28, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 01:07:20PM +0800, Qu Wenruo wrote:
>>> lowmem repair seems to be going still, but it's been days and -p seems
>>> to do absolutely nothing.
>>
>> I'm a afraid you hit a bug in lowmem repair code.
>> By all means, --repair shouldn't really be used unless you're pretty
>> sure the problem is something btrfs check can handle.
>>
>> That's also why --repair is still marked as dangerous.
>> Especially when it's combined with experimental lowmem mode.
> 
> Understood, but btrfs got corrupted (by itself or not, I don't know)
> I cannot mount the filesystem read/write
> I cannot btrfs check --repair it since that code will kill my machine
> What do I have left?

Just normal btrfs check, and post the output.
If normal check eats up all your memory, btrfs check --mode=lowmem.

--repair should be considered as the last method.

> 
>>> My filesystem is "only" 10TB or so, albeit with a lot of files.
>>
>> Unless you have tons of snapshots and reflinked (deduped) files, it
>> shouldn't take so long.
> 
> I may have a fair amount.
> gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2 
> enabling repair mode
> WARNING: low-memory mode repair support is only partial
> Checking filesystem on /dev/mapper/dshelf2
> UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
> Fixed 0 roots.
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, 
> owner: 374857, offset: 3407872) wanted: 3, have: 4
> Created new chunk [18457780224000 1073741824]
> Delete backref in extent [84302495744 69632]
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, 
> owner: 374857, offset: 3407872) wanted: 3, have: 4
> Delete backref in extent [84302495744 69632]
> ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, 
> owner: 374857, offset: 114540544) wanted: 181, have: 240
> Delete backref in extent [125712527360 12214272]
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, 
> owner: 374857, offset: 126754816) wanted: 68, have: 115
> Delete backref in extent [125730848768 5111808]
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, 
> owner: 374857, offset: 126754816) wanted: 68, have: 115
> Delete backref in extent [125730848768 5111808]
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, 
> owner: 374857, offset: 131866624) wanted: 115, have: 143
> Delete backref in extent [125736914944 6037504]
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, 
> owner: 374857, offset: 131866624) wanted: 115, have: 143
> Delete backref in extent [125736914944 6037504]
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, 
> owner: 374857, offset: 148234240) wanted: 302, have: 431
> Delete backref in extent [129952120832 20242432]
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, 
> owner: 374857, offset: 148234240) wanted: 356, have: 433
> Delete backref in extent [129952120832 20242432]
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, 
> owner: 374857, offset: 180371456) wanted: 161, have: 240
> Delete backref in extent [134925357056 11829248]
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, 
> owner: 374857, offset: 180371456) wanted: 162, have: 240
> Delete backref in extent [134925357056 11829248]
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, 
> owner: 374857, offset: 192200704) wanted: 170, have: 249
> Delete backref in extent [147895111680 12345344]
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, 
> owner: 374857, offset: 192200704) wanted: 172, have: 251
> Delete backref in extent [147895111680 12345344]
> ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, 
> owner: 374857, offset: 217653248) wanted: 348, have: 418
> Delete backref in extent [150850146304 17522688]
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, 
> owner: 374857, offset: 235175936) wanted: 555, have: 1449
> Deleted root 2 item[156909494272, 178, 5476627808561673095]
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, 
> owner: 374857, offset: 235175936) wanted: 556, have: 1452
> Deleted root 2 item[156909494272, 178, 7338474132555182983]
> ERROR: file extent[374857 235184128] root 21872 owner 21872 backref lost
> Add one extent data backref [156909494272 55320576]
> ERROR: file extent[374857 235184128] root 22911 owner 22911 backref lost
> Add one extent data backref [156909494272 55320576]
> 
> The last two ERROR lines took over a day to get generated, so I'm not sure if 
> it's still working, but just slowly.

OK, that explains something.

One extent is referred hundreds times, no wonder it will take a long time.

Just one tip here, there are really too many 

Re: So, does btrfs check lowmem take days? weeks?

2018-06-28 Thread Marc MERLIN
On Fri, Jun 29, 2018 at 01:35:06PM +0800, Su Yue wrote:
> > It's hard to estimate, especially when every cross check involves a lot
> > of disk IO.
> > 
> > But at least, we could add such indicator to show we're doing something.
> > Maybe we can account all roots in root tree first, before checking a
> tree, report i/num_roots. So users can see the what is the check doing
> something meaningful or silly dead looping.

Sounds reasonable.
Do you want to submit something in git master for btrfs-progs, I pull
it, and just my btrfs check again?

In the meantime, how sane does the output I just posted, look?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-06-28 Thread Su Yue




On 06/29/2018 01:07 PM, Qu Wenruo wrote:



On 2018年06月29日 12:27, Marc MERLIN wrote:

Regular btrfs check --repair has a nice progress option. It wasn't
perfect, but it showed something.

But then it also takes all your memory quicker than the linux kernel can
defend itself and reliably completely kills my 32GB server quicker than
it can OOM anything.

lowmem repair seems to be going still, but it's been days and -p seems
to do absolutely nothing.


I'm a afraid you hit a bug in lowmem repair code.
By all means, --repair shouldn't really be used unless you're pretty
sure the problem is something btrfs check can handle.

That's also why --repair is still marked as dangerous.
Especially when it's combined with experimental lowmem mode.



My filesystem is "only" 10TB or so, albeit with a lot of files.


Unless you have tons of snapshots and reflinked (deduped) files, it
shouldn't take so long.



2 things that come to mind
1) can lowmem have some progress working so that I know if I'm looking
at days, weeks, or even months before it will be done?


It's hard to estimate, especially when every cross check involves a lot
of disk IO.

But at least, we could add such indicator to show we're doing something.
Maybe we can account all roots in root tree first, before checking a

tree, report i/num_roots. So users can see the what is the check doing
something meaningful or silly dead looping.

Thanks,
Su



2) non lowmem is more efficient obviously when it doesn't completely
crash your machine, but could lowmem be given an amount of memory to use
for caching, or maybe use some heuristics based on RAM free so that it's
not so excrutiatingly slow?


IIRC recent commit has added the ability.
a5ce5d219822 ("btrfs-progs: extent-cache: actually cache extent buffers")

That's already included in btrfs-progs v4.13.2.
So it should be a dead loop which lowmem repair code can't handle.

Thanks,
Qu



Thanks,
Marc






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-06-28 Thread Marc MERLIN
On Fri, Jun 29, 2018 at 01:07:20PM +0800, Qu Wenruo wrote:
> > lowmem repair seems to be going still, but it's been days and -p seems
> > to do absolutely nothing.
> 
> I'm a afraid you hit a bug in lowmem repair code.
> By all means, --repair shouldn't really be used unless you're pretty
> sure the problem is something btrfs check can handle.
> 
> That's also why --repair is still marked as dangerous.
> Especially when it's combined with experimental lowmem mode.

Understood, but btrfs got corrupted (by itself or not, I don't know)
I cannot mount the filesystem read/write
I cannot btrfs check --repair it since that code will kill my machine
What do I have left?

> > My filesystem is "only" 10TB or so, albeit with a lot of files.
> 
> Unless you have tons of snapshots and reflinked (deduped) files, it
> shouldn't take so long.

I may have a fair amount.
gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2 
enabling repair mode
WARNING: low-memory mode repair support is only partial
Checking filesystem on /dev/mapper/dshelf2
UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
Fixed 0 roots.
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, 
owner: 374857, offset: 3407872) wanted: 3, have: 4
Created new chunk [18457780224000 1073741824]
Delete backref in extent [84302495744 69632]
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, 
owner: 374857, offset: 3407872) wanted: 3, have: 4
Delete backref in extent [84302495744 69632]
ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, 
owner: 374857, offset: 114540544) wanted: 181, have: 240
Delete backref in extent [125712527360 12214272]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, 
owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, 
owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, 
owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, 
owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, 
owner: 374857, offset: 148234240) wanted: 302, have: 431
Delete backref in extent [129952120832 20242432]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, 
owner: 374857, offset: 148234240) wanted: 356, have: 433
Delete backref in extent [129952120832 20242432]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, 
owner: 374857, offset: 180371456) wanted: 161, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, 
owner: 374857, offset: 180371456) wanted: 162, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, 
owner: 374857, offset: 192200704) wanted: 170, have: 249
Delete backref in extent [147895111680 12345344]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, 
owner: 374857, offset: 192200704) wanted: 172, have: 251
Delete backref in extent [147895111680 12345344]
ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, 
owner: 374857, offset: 217653248) wanted: 348, have: 418
Delete backref in extent [150850146304 17522688]
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, 
owner: 374857, offset: 235175936) wanted: 555, have: 1449
Deleted root 2 item[156909494272, 178, 5476627808561673095]
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, 
owner: 374857, offset: 235175936) wanted: 556, have: 1452
Deleted root 2 item[156909494272, 178, 7338474132555182983]
ERROR: file extent[374857 235184128] root 21872 owner 21872 backref lost
Add one extent data backref [156909494272 55320576]
ERROR: file extent[374857 235184128] root 22911 owner 22911 backref lost
Add one extent data backref [156909494272 55320576]

The last two ERROR lines took over a day to get generated, so I'm not sure if 
it's still working, but just slowly.
For what it's worth non lowmem check used to take 12 to 24H on that filesystem 
back when it still worked.

> > 2 things that come to mind
> > 1) can lowmem have some progress working so that I know if I'm looking
> > at days, weeks, or even months before it will be done?
> 
> It's hard to estimate, especially when every cross check involves a lot
> of disk IO.
> But at least, we could add such indicator to show we're doing something.

Yes, anything to show that I should still wait is still good :)

> > 2) non lowmem 

[PATCH v2] Revert "btrfs: fix a possible umount deadlock"

2018-06-28 Thread Nikolay Borisov
Since commit 88c14590cdd6 ("btrfs: use RCU in btrfs_show_devname for
device list traversal") btrfs_show_devname no longer takes
device_list_mutex. As such the deadlock that 0ccd05285e7f ("btrfs: fix
a possible umount deadlock") aimed to fix no longer exists. So remove
the extra code that commit added.

This reverts commit 0ccd05285e7f5a8e297e1d6dfc41e7c65757d6fa.

Signed-off-by: Nikolay Borisov 
---

V2: 
 * Fixed build failure due to using old name of free_device_rcu function. 
 fs/btrfs/volumes.c | 26 ++
 1 file changed, 6 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5bd6f3a40f9c..011a19b7930f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1004,7 +1004,7 @@ static void btrfs_close_bdev(struct btrfs_device *device)
blkdev_put(device->bdev, device->mode);
 }
 
-static void btrfs_prepare_close_one_device(struct btrfs_device *device)
+static void btrfs_close_one_device(struct btrfs_device *device)
 {
struct btrfs_fs_devices *fs_devices = device->fs_devices;
struct btrfs_device *new_device;
@@ -1022,6 +1022,8 @@ static void btrfs_prepare_close_one_device(struct 
btrfs_device *device)
if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
fs_devices->missing_devices--;
 
+   btrfs_close_bdev(device);
+
new_device = btrfs_alloc_device(NULL, >devid,
device->uuid);
BUG_ON(IS_ERR(new_device)); /* -ENOMEM */
@@ -1035,39 +1037,23 @@ static void btrfs_prepare_close_one_device(struct 
btrfs_device *device)
 
list_replace_rcu(>dev_list, _device->dev_list);
new_device->fs_devices = device->fs_devices;
+
+   call_rcu(>rcu, free_device_rcu);
 }
 
 static int close_fs_devices(struct btrfs_fs_devices *fs_devices)
 {
struct btrfs_device *device, *tmp;
-   struct list_head pending_put;
-
-   INIT_LIST_HEAD(_put);
 
if (--fs_devices->opened > 0)
return 0;
 
mutex_lock(_devices->device_list_mutex);
list_for_each_entry_safe(device, tmp, _devices->devices, dev_list) {
-   btrfs_prepare_close_one_device(device);
-   list_add(>dev_list, _put);
+   btrfs_close_one_device(device);
}
mutex_unlock(_devices->device_list_mutex);
 
-   /*
-* btrfs_show_devname() is using the device_list_mutex,
-* sometimes call to blkdev_put() leads vfs calling
-* into this func. So do put outside of device_list_mutex,
-* as of now.
-*/
-   while (!list_empty(_put)) {
-   device = list_first_entry(_put,
-   struct btrfs_device, dev_list);
-   list_del(>dev_list);
-   btrfs_close_bdev(device);
-   call_rcu(>rcu, free_device_rcu);
-   }
-
WARN_ON(fs_devices->open_devices);
WARN_ON(fs_devices->rw_devices);
fs_devices->opened = 0;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-06-28 Thread Qu Wenruo


On 2018年06月29日 12:27, Marc MERLIN wrote:
> Regular btrfs check --repair has a nice progress option. It wasn't
> perfect, but it showed something.
> 
> But then it also takes all your memory quicker than the linux kernel can
> defend itself and reliably completely kills my 32GB server quicker than
> it can OOM anything.
> 
> lowmem repair seems to be going still, but it's been days and -p seems
> to do absolutely nothing.

I'm a afraid you hit a bug in lowmem repair code.
By all means, --repair shouldn't really be used unless you're pretty
sure the problem is something btrfs check can handle.

That's also why --repair is still marked as dangerous.
Especially when it's combined with experimental lowmem mode.

> 
> My filesystem is "only" 10TB or so, albeit with a lot of files.

Unless you have tons of snapshots and reflinked (deduped) files, it
shouldn't take so long.

> 
> 2 things that come to mind
> 1) can lowmem have some progress working so that I know if I'm looking
> at days, weeks, or even months before it will be done?

It's hard to estimate, especially when every cross check involves a lot
of disk IO.

But at least, we could add such indicator to show we're doing something.

> 
> 2) non lowmem is more efficient obviously when it doesn't completely
> crash your machine, but could lowmem be given an amount of memory to use
> for caching, or maybe use some heuristics based on RAM free so that it's
> not so excrutiatingly slow?

IIRC recent commit has added the ability.
a5ce5d219822 ("btrfs-progs: extent-cache: actually cache extent buffers")

That's already included in btrfs-progs v4.13.2.
So it should be a dead loop which lowmem repair code can't handle.

Thanks,
Qu

> 
> Thanks,
> Marc
> 



signature.asc
Description: OpenPGP digital signature


So, does btrfs check lowmem take days? weeks?

2018-06-28 Thread Marc MERLIN
Regular btrfs check --repair has a nice progress option. It wasn't
perfect, but it showed something.

But then it also takes all your memory quicker than the linux kernel can
defend itself and reliably completely kills my 32GB server quicker than
it can OOM anything.

lowmem repair seems to be going still, but it's been days and -p seems
to do absolutely nothing.

My filesystem is "only" 10TB or so, albeit with a lot of files.

2 things that come to mind
1) can lowmem have some progress working so that I know if I'm looking
at days, weeks, or even months before it will be done?

2) non lowmem is more efficient obviously when it doesn't completely
crash your machine, but could lowmem be given an amount of memory to use
for caching, or maybe use some heuristics based on RAM free so that it's
not so excrutiatingly slow?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: Do extra device generation check at mount time

2018-06-28 Thread james harvey
On Thu, Jun 28, 2018 at 3:15 AM, Qu Wenruo  wrote:
> I'd like to make sure everyone, including developers and end-users, are
> fine with the restrict error-out behavior.

Yes, please error out, as a start.  Requesting this was on my
btrfs-to-do list.  A device generation mismatch from a drive going
offline until forced reboot was what ultimately sent me on the path of
problems I had.  Had parent transid errors, which I realized was due
to this, and decided to replace the outdated lvm partition with a new
one.  (Hiding the bad one to prevent UUID problems of course.)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: call trace: WARNING: at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device

2018-06-28 Thread Qu Wenruo



On 2018年06月29日 09:05, Christoph Anton Mitterer wrote:
> Hey Qu and Nikolay.
> 
> 
> On Thu, 2018-06-28 at 22:58 +0800, Qu Wenruo wrote:
>> Nothing special. Btrfs-progs will handle it pretty well.
> Since this a remote system where the ISP provides only a rescue image
> with pretty old kernel/btrfs-progs, I had to copy a current local
> binary and use that... but that seemed to have worked quite well

Glad that works.

Just forgot to mention, if it's only device size unalignment while the
super block total_bytes is still accurate, you could just modify it
on-line, using "btrfs fi resize"
Just like
# btrfs fi resize 1:-4K 

Then it should fix device size unalignment.

> 
>> Because the WARN_ON() is newly added.
> Ah I see.
> 
>> Yep, latest will warn about it, and --repair can also fix it too.
> Great.
> 
> 
> On Thu, 2018-06-28 at 17:25 +0300, Nikolay Borisov wrote:
>> Was this an old FS or a fresh one?
> You mean in terms of original fs creation? Probably rather oldish.. I'd
> guess at least a year or maybe even 2-3 or more.
> 
>> Looking at the callstack this
>> seems
>> to have occured due to "btrfs_set_device_total_bytes(leaf, dev_item,
>> btrfs_device_get_disk_total_bytes(device));" call. Meaning the total
>> bytes of the disk were unalgined. Perhaps this has been like that for
>> quite some time, then you did a couple of kernel upgrades (this
>> WARN_ON
>> was added later than 4.11) and just now you happened to delete a
>> chunk
>> which would trigger a device update on-disk ?
> Could be...
> 
> 
> The following was however still a bit strange:
> sda2 and sdb2 are the partitions on the two HDDs forming the RAID1.
> 
> 
> root@rescue ~ # ./btrfs rescue fix-device-size /dev/sda2
> Fixed device size for devid 2, old size: 999131127296 new size: 999131123712
> Fixed super total bytes, old size: 1998262251008 new size: 1998262247424
> Fixed unaligned/mismatched total_bytes for super block and device items
> root@rescue ~ # ./btrfs rescue fix-device-size /dev/sdb2
> No device size related problem found
> 
> As you can see, no alignment issues were found on sdb2.
> 
> I've created these at the same time...
> I don't think (but cannot exclude for 100%) that this server ever lost
> a disk (in that case I could image that newer progs/kernel might have
> created sdb2 with proper alignment)
> 
> Looking at the partitions:
> 
> root@rescue ~ # gdisk -l /dev/sda
> GPT fdisk (gdisk) version 1.0.1
> 
> Partition table scan:
>   MBR: protective
>   BSD: not present
>   APM: not present
>   GPT: present
> 
> Found valid GPT with protective MBR; using GPT.
> Disk /dev/sda: 1953525168 sectors, 931.5 GiB
> Logical sector size: 512 bytes
> Disk identifier (GUID): 
> Partition table holds up to 128 entries
> First usable sector is 34, last usable sector is 1953525134
> Partitions will be aligned on 2048-sector boundaries
> Total free space is 2014 sectors (1007.0 KiB)
> 
> Number  Start (sector)End (sector)  Size   Code  Name
>12048 2097151   1023.0 MiB  EF02  BIOS boot partition
>2 2097152  1953525134   930.5 GiB   8300  Linux filesystem
> root@rescue ~ # gdisk -l /dev/sdb
> GPT fdisk (gdisk) version 1.0.1
> 
> Partition table scan:
>   MBR: protective
>   BSD: not present
>   APM: not present
>   GPT: present
> 
> Found valid GPT with protective MBR; using GPT.
> Disk /dev/sdb: 1953525168 sectors, 931.5 GiB
> Logical sector size: 512 bytes
> Disk identifier (GUID): 
> Partition table holds up to 128 entries
> First usable sector is 34, last usable sector is 1953525134
> Partitions will be aligned on 2048-sector boundaries
> Total free space is 2014 sectors (1007.0 KiB)
> 
> Number  Start (sector)End (sector)  Size   Code  Name
>12048 2097151   1023.0 MiB  EF02  BIOS boot partition
>2 2097152  1953525134   930.5 GiB   8300  Linux filesystem
> 
> 
> Both the same... so if there was no device replace or so... then I
> wonder why only one device was affected.

Maybe it's the old mkfs causing the problem?
Although mkfs.btrfs added device size alignment much earlier than
kernel, it's still possible that the old mkfs doesn't handle the initial
device and extra device (mkfs.btrfs will always create a temporary fs on
the first device, then add all the other devices to the system) the same
way.

Thanks,
Qu

> 
> 
> Cheers,
> Chris.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: call trace: WARNING: at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device

2018-06-28 Thread Christoph Anton Mitterer
Hey Qu and Nikolay.


On Thu, 2018-06-28 at 22:58 +0800, Qu Wenruo wrote:
> Nothing special. Btrfs-progs will handle it pretty well.
Since this a remote system where the ISP provides only a rescue image
with pretty old kernel/btrfs-progs, I had to copy a current local
binary and use that... but that seemed to have worked quite well

> Because the WARN_ON() is newly added.
Ah I see.

> Yep, latest will warn about it, and --repair can also fix it too.
Great.


On Thu, 2018-06-28 at 17:25 +0300, Nikolay Borisov wrote:
> Was this an old FS or a fresh one?
You mean in terms of original fs creation? Probably rather oldish.. I'd
guess at least a year or maybe even 2-3 or more.

> Looking at the callstack this
> seems
> to have occured due to "btrfs_set_device_total_bytes(leaf, dev_item,
> btrfs_device_get_disk_total_bytes(device));" call. Meaning the total
> bytes of the disk were unalgined. Perhaps this has been like that for
> quite some time, then you did a couple of kernel upgrades (this
> WARN_ON
> was added later than 4.11) and just now you happened to delete a
> chunk
> which would trigger a device update on-disk ?
Could be...


The following was however still a bit strange:
sda2 and sdb2 are the partitions on the two HDDs forming the RAID1.


root@rescue ~ # ./btrfs rescue fix-device-size /dev/sda2
Fixed device size for devid 2, old size: 999131127296 new size: 999131123712
Fixed super total bytes, old size: 1998262251008 new size: 1998262247424
Fixed unaligned/mismatched total_bytes for super block and device items
root@rescue ~ # ./btrfs rescue fix-device-size /dev/sdb2
No device size related problem found

As you can see, no alignment issues were found on sdb2.

I've created these at the same time...
I don't think (but cannot exclude for 100%) that this server ever lost
a disk (in that case I could image that newer progs/kernel might have
created sdb2 with proper alignment)

Looking at the partitions:

root@rescue ~ # gdisk -l /dev/sda
GPT fdisk (gdisk) version 1.0.1

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sda: 1953525168 sectors, 931.5 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 1953525134
Partitions will be aligned on 2048-sector boundaries
Total free space is 2014 sectors (1007.0 KiB)

Number  Start (sector)End (sector)  Size   Code  Name
   12048 2097151   1023.0 MiB  EF02  BIOS boot partition
   2 2097152  1953525134   930.5 GiB   8300  Linux filesystem
root@rescue ~ # gdisk -l /dev/sdb
GPT fdisk (gdisk) version 1.0.1

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sdb: 1953525168 sectors, 931.5 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 1953525134
Partitions will be aligned on 2048-sector boundaries
Total free space is 2014 sectors (1007.0 KiB)

Number  Start (sector)End (sector)  Size   Code  Name
   12048 2097151   1023.0 MiB  EF02  BIOS boot partition
   2 2097152  1953525134   930.5 GiB   8300  Linux filesystem


Both the same... so if there was no device replace or so... then I
wonder why only one device was affected.


Cheers,
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: Do extra device generation check at mount time

2018-06-28 Thread Qu Wenruo


On 2018年06月28日 22:36, Adam Borowski wrote:
> On Thu, Jun 28, 2018 at 03:04:43PM +0800, Qu Wenruo wrote:
>> There is a reporter considering btrfs raid1 has a major design flaw
>> which can't handle nodatasum files.
>>
>> Despite his incorrect expectation, btrfs indeed doesn't handle device
>> generation mismatch well.
>>
>> This means if one devices missed and re-appeared, even its generation
>> no longer matches with the rest device pool, btrfs does nothing to it,
>> but treat it as normal good device.
>>
>> At least let's detect such generation mismatch and avoid mounting the
>> fs.
> 
> Uhm, that'd be a nasty regression for the regular (no-nodatacow) case. 
> The vast majority of data is fine, and extents that have been written to
> while a device is missing will be either placed elsewhere (if the filesystem
> knew it was degraded) or read one of the copies to notice a wrong checksum
> and automatically recover (if the device was still falsely believed to be
> good at write time).

Yes, for fs without any nodatasum usage, the behavior is indeed overkilled.
But sometimes such overkilled sanity check is really important, as long
as nodatasum is a provided feature.

> 
> We currently don't have selective scrub yet so resyncing such single-copy
> extents is costly, but 1. all will be fine if the data is read, 2. it's
> possible to add such a smart resync in the future, far better than a
> write-intent bitmap can do.

Well, auto scrub for a device looks not that bad to me.
Since normally scrub is scheduled as normal maintenance work, it should
not be a super expensive work.

We only need to teach btrfs to treat such device as kind of degraded.
Then we can reuse most of the scrub routine to fix it.

> 
> To do the latter, we can note the last generation the filesystem was known
> to be fully coherent (ie, all devices were successfully flushed with no
> mysterious write failures), then run selective scrub (perhaps even
> automatically) when the filesystem is no longer degraded.  There's some
> extra complexity with 3- or 4-way RAID (multiple levels of degradation) but
> a single number would help even there.
> 
> But even currently, without the above not-yet-written recovery, it's
> reasonably safe to continue without scrub -- it's a case of running
> partially degraded when the bad copy is already known to be suspicious.
> 
> For no-nodatacow data and metadata, that is.
> 
>> Currently there is no automatic rebuild yet, which means if users find
>> device generation mismatch error message, they can only mount the fs
>> using "device" and "degraded" mount option (if possible), then replace
>> the offending device to manually "rebuild" the fs.
> 
> As nodatacow already means "I don't care about this data, or have another
> way of recovering it", I don't quite get why we would drop existing
> auto-recovery for a common transient failure case.

Yep, exactly my understanding of nodatasum behavior.

However in real world, btrfs is the only *linux* fs supports data csum,
and the most widely used fs like ext4 xfs doesn't support data csum.

As the discussion about the behavior goes, I find that LVM/mdraid +
ext4/xfs could do better device missing management than btrfs nodatasum,
this means we need to at least do something that LVM/mdraid could provide.

> 
> If you're paranoid, perhaps some bit "this filesystem has some nodatacow
> data on it" could warrant such a block, but it would still need to be
> overridable _without_ a need for replace.  There's also the problem that
> systemd marks its journal nodatacow (despite it having infamously bad
> handling of failures!), and too many distributions infect their default
> installs with systemd, meaning such a bit would be on in most cases.
> 
> But why would I put all my other data at risk, just because there's a
> nodatacow file?  There's a big difference between scrubbing when only a few
> transactions worth of data is suspicious and completely throwing away a
> mostly-good replica to replace it from the now fully degraded copy.
> 
>> I totally understand that, generation based solution can't handle
>> split-brain case (where 2 RAID1 devices get mounted degraded separately)
>> at all, but at least let's handle what we can do.
> 
> Generation can do well at least unless both devices were mounted elsewhere
> and got the exact same number of transactions, the problem is that nodatacow
> doesn't bump generation number.

Generation is never a problem, as any metadata change will still bump
generation.

> 
>> The best way to solve the problem is to make btrfs treat such lower gen
>> devices as some kind of missing device, and queue an automatic scrub for
>> that device.
>> But that's a lot of extra work, at least let's start from detecting such
>> problem first.
> 
> I wonder if there's some way to treat problematic nodatacow files as
> degraded only?
> 
> Nodatacow misses most of btrfs mechanisms, thus to get it done right you'd
> need to pretty much copy all of md's 

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Qu Wenruo


On 2018年06月29日 01:10, Andrei Borzenkov wrote:
> 28.06.2018 12:15, Qu Wenruo пишет:
>>
>>
>> On 2018年06月28日 16:16, Andrei Borzenkov wrote:
>>> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo  wrote:


 On 2018年06月28日 11:14, r...@georgianit.com wrote:
>
>
> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>
>>
>> Please get yourself clear of what other raid1 is doing.
>
> A drive failure, where the drive is still there when the computer 
> reboots, is a situation that *any* raid 1, (or for that matter, raid 5, 
> raid 6, anything but raid 0) will recover from perfectly without raising 
> a sweat. Some will rebuild the array automatically,

 WOW, that's black magic, at least for RAID1.
 The whole RAID1 has no idea of which copy is correct unlike btrfs who
 has datasum.

 Don't bother other things, just tell me how to determine which one is
 correct?

>>>
>>> When one drive fails, it is recorded in meta-data on remaining drives;
>>> probably configuration generation number is increased. Next time drive
>>> with older generation is not incorporated. Hardware controllers also
>>> keep this information in NVRAM and so do not even depend on scanning
>>> of other disks.
>>
>> Yep, the only possible way to determine such case is from external info.
>>
>> For device generation, it's possible to enhance btrfs, but at least we
>> could start from detect and refuse to RW mount to avoid possible further
>> corruption.
>> But anyway, if one really cares about such case, hardware RAID
>> controller seems to be the only solution as other software may have the
>> same problem.
>>
>> And the hardware solution looks pretty interesting, is the write to
>> NVRAM 100% atomic? Even at power loss?
>>
>>>
 The only possibility is that, the misbehaved device missed several super
 block update so we have a chance to detect it's out-of-date.
 But that's not always working.

>>>
>>> Why it should not work as long as any write to array is suspended
>>> until superblock on remaining devices is updated?
>>
>> What happens if there is no generation gap in device superblock?
>>
> 
> Well, you use "generation" in strict btrfs sense, I use "generation"
> generically. That is exactly what btrfs apparently lacks currently -
> some monotonic counter that is used to record such event.

Indeed, btrfs doesn't have any way to record which device get degraded
at all.
The usage of btrfs device generation is already kind of workaround.

So to keep the same behavior of mdraid/lvm, each time btrfs detects a
device missing/fatal command (flush/fua) not executed correctly, btrfs
needs to record it, maybe into its device item, and commit it to disk.

In short, the btrfs csum makes us a little conceited about such device
missing case, normally csum will tell us which data is wrong so we could
avoid complex device status tracking.
But apparently, if nodatasum is involved, everything just goes out of
our expectation.

> 
>> If one device got some of its (nodatacow) data written to disk, while
>> the other device doesn't get data written, and neither of them reached
>> super block update, there is no difference in device superblock, thus no
>> way to detect which is correct.
>>
> 
> Again, the very fact that device failed should have triggered update of
> superblock to record this information which presumably should increase
> some counter.

Indeed.

> 
>>>
 If you're talking about missing generation check for btrfs, that's
 valid, but it's far from a "major design flaw", as there are a lot of
 cases where other RAID1 (mdraid or LVM mirrored) can also be affected
 (the brain-split case).

>>>
>>> That's different. Yes, with software-based raid there is usually no
>>> way to detect outdated copy if no other copies are present. Having
>>> older valid data is still very different from corrupting newer data.
>>
>> While for VDI case (or any VM image file format other than raw), older
>> valid data normally means corruption.
>> Unless they have their own write-ahead log.
>>> Some file format may detect such problem by themselves if they have
>> internal checksum, but anyway, older data normally means corruption,
>> especially when partial new and partial old.
>>
> 
> Yes, that's true. But there is really nothing that can be done here,
> even theoretically; it hardly a reason to not do what looks possible.

Well, theoretically, you can just use datasum and datacow :)

Thanks,
Qu

> 
>> On the other hand, with data COW and csum, btrfs can ensure the whole
>> filesystem update is atomic (at least for single device).
>> So the title, especially the "major design flaw" can't be wrong any more.
>>
>>>
> others will automatically kick out the misbehaving drive.  *none* of them 
> will take back the the drive with old data and start commingling that 
> data with good copy.)\ This behaviour from BTRFS is completely abnormal.. 
> and 

Re: [PATCH] Revert "btrfs: fix a possible umount deadlock"

2018-06-28 Thread kbuild test robot
Hi Nikolay,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on v4.18-rc2]
[also build test ERROR on next-20180628]
[cannot apply to btrfs/next]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Nikolay-Borisov/Revert-btrfs-fix-a-possible-umount-deadlock/20180629-044154
config: x86_64-randconfig-x015-201825 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   fs/btrfs/volumes.c: In function 'btrfs_close_one_device':
>> fs/btrfs/volumes.c:1041:25: error: 'free_device' undeclared (first use in 
>> this function); did you mean 'new_device'?
 call_rcu(>rcu, free_device);
^~~
new_device
   fs/btrfs/volumes.c:1041:25: note: each undeclared identifier is reported 
only once for each function it appears in

vim +1041 fs/btrfs/volumes.c

  1006  
  1007  static void btrfs_close_one_device(struct btrfs_device *device)
  1008  {
  1009  struct btrfs_fs_devices *fs_devices = device->fs_devices;
  1010  struct btrfs_device *new_device;
  1011  struct rcu_string *name;
  1012  
  1013  if (device->bdev)
  1014  fs_devices->open_devices--;
  1015  
  1016  if (test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state) &&
  1017  device->devid != BTRFS_DEV_REPLACE_DEVID) {
  1018  list_del_init(>dev_alloc_list);
  1019  fs_devices->rw_devices--;
  1020  }
  1021  
  1022  if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
  1023  fs_devices->missing_devices--;
  1024  
  1025  btrfs_close_bdev(device);
  1026  
  1027  new_device = btrfs_alloc_device(NULL, >devid,
  1028  device->uuid);
  1029  BUG_ON(IS_ERR(new_device)); /* -ENOMEM */
  1030  
  1031  /* Safe because we are under uuid_mutex */
  1032  if (device->name) {
  1033  name = rcu_string_strdup(device->name->str, GFP_NOFS);
  1034  BUG_ON(!name); /* -ENOMEM */
  1035  rcu_assign_pointer(new_device->name, name);
  1036  }
  1037  
  1038  list_replace_rcu(>dev_list, _device->dev_list);
  1039  new_device->fs_devices = device->fs_devices;
  1040  
> 1041  call_rcu(>rcu, free_device);
  1042  }
  1043  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Chris Murphy
On Thu, Jun 28, 2018 at 11:37 AM, Goffredo Baroncelli
 wrote:

> Regarding your point 3), it must be point out that in case of NOCOW files, 
> even having the same transid it is not enough. It still be possible that a 
> copy is update before a power failure preventing the super-block update.
> I think that the only way to prevent it to happens is:
>   1) using a data journal (which means that each data is copied two times)
> OR
>   2) using a cow filesystem (with cow enabled of course !)


There is no power failure in this example. So it's really off the
table considering whether Btrfs or mdadm/lvm raid do better in the
same situation with a nodatacow file.

I think here is the problem in the Btrfs nodatacow case. Btrfs doesn't
have a way of untrusting nodatacow files on a previously missing drive
that hasn't been balanced. There is no such thing as nometadatacow, so
no matter what it figures out there's a problem, and uses the good
copy of metadata, but it never "marks" the previously missing device
as suspicious. When it comes time to read a nodatacow file, Btrfs just
blindly reads off one of the drives, it has no mechanism for
questioning the formerly missing drive without csum.

That is actually a really weird and unique kind of write hole for
Btrfs raid1 when the data is nodatacow.

I have to agree with Remi. This is a flaw in the design or bad bug,
however you want to consider it. Because mdadm/lvm do not behave this
way in the exact same situation.

And an open question I have about scrub is weather it only ever is
checking csums, meaning nodatacow files are never scrubbed, or if the
copies are at least compared to each other?

As for fixes:

- During mount time, Btrfs sees from supers that there is a transid
mismatch, to not read nodatacow files from the lower transid device
until an auto balance has completed. Right now Btrfs doesn't have an
abbreviated balance that "replays" the events between two transids.
Basically it would work like send/receive but for balance to catch up
a previously missing device. Right now we have to do a full balance
which is a brutal penalty for a briefly missing drive. Again, mdadm
and lvm do better here by default.

- Fix the performance issues of COW with disk images. ZFS doesn't even
have a nodatacow option and they're running VM images on ZFS and it
doesn't sound like they're running into ridiculous performance
penalties that makes it impractical to use.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Chris Murphy
On Thu, Jun 28, 2018 at 9:37 AM, Remi Gauvin  wrote:
> On 2018-06-28 10:17 AM, Chris Murphy wrote:
>
>> 2. The new data goes in a single chunk; even if the user does a manual
>> balance (resync) their data isn't replicated. They must know to do a
>> -dconvert balance to replicate the new data. Again this is a net worse
>> behavior than mdadm out of the box, putting user data at risk.
>
> I'm not sure this is the case.  Even though writes failed to the
> disconnected device, btrfs seemed to keep on going as though it *were*.

Yeah in your case the failure happens during normal operation and in
that case there's no degraded state on Btrfs. So it keeps writing to
raid1 chunk on the working drive, with writes on the failed devices
going nowhere (with lots of write errors). When you stop using the
volume, fix the problem with the missing drive, then remount the
volume, it really should still use only the new copy on the never
missing drive, even though it won't necessarily notice the file is
missing on the formerly missing drive. You have to balance manually to
fix it.


> When the array was re-mounted with both devices, (never mounted as
> degraded), and scrub was run, scrub took a *long* time fixing errors, at
> a whopping 3MB/s, and reported having fixed millions of them.

That's slow but it's expected to fix a lot of problems. Even in a very
short amount of time there are thousands of missing data and metadata
extents that need to be replicated.




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Remi Gauvin

> Acceptable, but not really apply to software based RAID1.
> 

Which completely disregards the minor detail that all the software
Raid's I know of can handle exactly this kind of situation without
loosing or corrupting a single byte of data, (Errors on the remaining
hard drive notwithstanding.)

Exactly what methods they employ to do so I'm not an expert at,, but it
*does* work, contrary to your repeated assertions otherwise.

In any case, thank you the for the patch you wrote.  I will, however,
propose a different solution.

Given the reliance of BTRFS on csum, and the lack of any
resynchronization, (no matter how the drives got out of sync, doesn't
matter.).  I think NoDataCow should just be ignored in the case of RAID,
just like the data blocks would get copied if there was a snapshot.

In the current implementation of RAID on btrfs, RAID and nodatacow are
effectively mutually exclusive.  Consider the kinds of use cases
nodatacow is usually recommended for,  VM images and databases.   Even
though those files should have their own mechanisms for dealing with
incomplete writes, and data verification, BTRFS RAID creates a unique
situation where parts of the file can be inconsistent, with different
data being read depending on which device is doing the reading.

Regardless of which method, short term and long term, developers choose
to address this, this next part I have stress I consider very important.

The status page really needs to be updated to reflect this gotchya.  It
*will* bite people in ways they do not expect, and disastrously.

<>

signature.asc
Description: OpenPGP digital signature


Re: [PATCH] Revert "btrfs: fix a possible umount deadlock"

2018-06-28 Thread kbuild test robot
Hi Nikolay,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on v4.18-rc2]
[also build test ERROR on next-20180628]
[cannot apply to btrfs/next]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Nikolay-Borisov/Revert-btrfs-fix-a-possible-umount-deadlock/20180629-044154
config: i386-randconfig-a0-201825 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   fs//btrfs/volumes.c: In function 'btrfs_close_one_device':
>> fs//btrfs/volumes.c:1041:25: error: 'free_device' undeclared (first use in 
>> this function)
 call_rcu(>rcu, free_device);
^
   fs//btrfs/volumes.c:1041:25: note: each undeclared identifier is reported 
only once for each function it appears in

vim +/free_device +1041 fs//btrfs/volumes.c

  1006  
  1007  static void btrfs_close_one_device(struct btrfs_device *device)
  1008  {
  1009  struct btrfs_fs_devices *fs_devices = device->fs_devices;
  1010  struct btrfs_device *new_device;
  1011  struct rcu_string *name;
  1012  
  1013  if (device->bdev)
  1014  fs_devices->open_devices--;
  1015  
  1016  if (test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state) &&
  1017  device->devid != BTRFS_DEV_REPLACE_DEVID) {
  1018  list_del_init(>dev_alloc_list);
  1019  fs_devices->rw_devices--;
  1020  }
  1021  
  1022  if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
  1023  fs_devices->missing_devices--;
  1024  
  1025  btrfs_close_bdev(device);
  1026  
  1027  new_device = btrfs_alloc_device(NULL, >devid,
  1028  device->uuid);
  1029  BUG_ON(IS_ERR(new_device)); /* -ENOMEM */
  1030  
  1031  /* Safe because we are under uuid_mutex */
  1032  if (device->name) {
  1033  name = rcu_string_strdup(device->name->str, GFP_NOFS);
  1034  BUG_ON(!name); /* -ENOMEM */
  1035  rcu_assign_pointer(new_device->name, name);
  1036  }
  1037  
  1038  list_replace_rcu(>dev_list, _device->dev_list);
  1039  new_device->fs_devices = device->fs_devices;
  1040  
> 1041  call_rcu(>rcu, free_device);
  1042  }
  1043  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Incremental send/receive broken after snapshot restore

2018-06-28 Thread Hannes Schweizer
Hi,

Here's my environment:
Linux diablo 4.17.0-gentoo #5 SMP Mon Jun 25 00:26:55 CEST 2018 x86_64
Intel(R) Core(TM) i5 CPU 760 @ 2.80GHz GenuineIntel GNU/Linux
btrfs-progs v4.17

Label: 'online'  uuid: e4dc6617-b7ed-4dfb-84a6-26e3952c8390
Total devices 2 FS bytes used 3.16TiB
devid1 size 1.82TiB used 1.58TiB path /dev/mapper/online0
devid2 size 1.82TiB used 1.58TiB path /dev/mapper/online1
Data, RAID0: total=3.16TiB, used=3.15TiB
System, RAID0: total=16.00MiB, used=240.00KiB
Metadata, RAID0: total=7.00GiB, used=4.91GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Label: 'offline'  uuid: 5b449116-93e5-473e-aaf5-bf3097b14f29
Total devices 2 FS bytes used 3.52TiB
devid1 size 5.46TiB used 3.53TiB path /dev/mapper/offline0
devid2 size 5.46TiB used 3.53TiB path /dev/mapper/offline1
Data, RAID1: total=3.52TiB, used=3.52TiB
System, RAID1: total=8.00MiB, used=512.00KiB
Metadata, RAID1: total=6.00GiB, used=5.11GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Label: 'external'  uuid: 8bf13621-01f0-4f09-95c7-2c157d3087d0
Total devices 1 FS bytes used 3.65TiB
devid1 size 5.46TiB used 3.66TiB path
/dev/mapper/luks-3c196e96-d46c-4a9c-9583-b79c707678fc
Data, single: total=3.64TiB, used=3.64TiB
System, DUP: total=32.00MiB, used=448.00KiB
Metadata, DUP: total=11.00GiB, used=9.72GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


The following automatic backup scheme is in place:
hourly:
btrfs sub snap -r online/root online/root.

daily:
btrfs sub snap -r online/root online/root.
btrfs send -c online/root.
online/root. | btrfs receive offline
btrfs sub del -c online/root.

monthly:
btrfs sub snap -r online/root online/root.
btrfs send -c online/root.
online/root. | btrfs receive external
btrfs sub del -c online/root.

Now here are the commands leading up to my problem:
After the online filesystem suddenly went ro, and btrfs check showed
massive problems, I decided to start the online array from scratch:
1: mkfs.btrfs -f -d raid0 -m raid0 -L "online" /dev/mapper/online0
/dev/mapper/online1

As you can see from the backup commands above, the snapshots of
offline and external are not related, so in order to at least keep the
extensive backlog of the external snapshot set (including all
reflinks), I decided to restore the latest snapshot from external.
2: btrfs send external/root. | btrfs receive online

I wanted to ensure I can restart the incremental backup flow from
online to external, so I did this
3: mv online/root. online/root
4: btrfs sub snap -r online/root online/root.
5: btrfs property set online/root ro false

Now, I naively expected a simple restart of my automatic backups for
external should work.
However after running
6: btrfs sub snap -r online/root online/root.
7: btrfs send -c online/root.
online/root. | btrfs receive external
I see the following error:
ERROR: unlink root/.ssh/agent-diablo-_dev_pts_3 failed. No such file
or directory

Which is unfortunate, but the second problem actually encouraged me to
post this message.
As planned, I had to start the offline array from scratch as well,
because I no longer had any reference snapshot for incremental backups
on other devices:
8: mkfs.btrfs -f -d raid1 -m raid1 -L "offline" /dev/mapper/offline0
/dev/mapper/offline1

However restarting the automatic daily backup flow bails out with a
similar error, although no potentially problematic previous
incremental snapshots should be involved here!
ERROR: unlink o925031-987-0/2139527549 failed. No such file or directory

I'm a bit lost now. The only thing I could image which might be
confusing for btrfs,
is the residual "Received UUID" of online/root.
after command 2.
What's the recommended way to restore snapshots with send/receive
without breaking subsequent incremental backups (including reflinks of
existing backups)?

Any hints appreciated...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: call trace: WARNING: at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device

2018-06-28 Thread Nikolay Borisov



On 28.06.2018 17:17, Christoph Anton Mitterer wrote:
> On Thu, 2018-06-28 at 22:09 +0800, Qu Wenruo wrote:
>>> [   72.168662] WARNING: CPU: 0 PID: 242 at /build/linux-
>>> uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565
>>> btrfs_update_device+0x1b2/0x1c0It
>> looks like it's the old WARN_ON() for unaligned device size.
>> Would you please verify if it is the case?
> 
> # blockdev --getsize64 /dev/sdb2 /dev/sda2
> 999131127296
> 999131127296
> 
> 
> Since getsize64 returns bytes and not sectors, I suppose it would need
> to be aligned to 1024 by the least?
> 
> 999131127296 / 1024 = 975713991,5
> 
> So it's not.
> 
> 
>> If so, "btrfs rescue fix-device-size" should handle it pretty well.
> 
> I guess this needs to be done with the fs unmounted?
> Anything to consider since I have RAID1 (except from running it on both
> devices)?
> 
> 
> Also, it's a bit strange that this error occurred never before (though
> the btrfs-restore manpage says the kernel would check for this since
> 4.11).

Was this an old FS or a fresh one? Looking at the callstack this seems
to have occured due to "btrfs_set_device_total_bytes(leaf, dev_item,
btrfs_device_get_disk_total_bytes(device));" call. Meaning the total
bytes of the disk were unalgined. Perhaps this has been like that for
quite some time, then you did a couple of kernel upgrades (this WARN_ON
was added later than 4.11) and just now you happened to delete a chunk
which would trigger a device update on-disk ?

> 
> It would further be nice if btrfs-check would warn about this.
> 
> 
> Thanks,
> Chris.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: Do extra device generation check at mount time

2018-06-28 Thread Qu Wenruo



On 2018年06月28日 16:04, Nikolay Borisov wrote:
> 
> 
> On 28.06.2018 10:04, Qu Wenruo wrote:
>> There is a reporter considering btrfs raid1 has a major design flaw
>> which can't handle nodatasum files.
>>
>> Despite his incorrect expectation, btrfs indeed doesn't handle device
>> generation mismatch well.
> 
> I think this is getting a bit personal, no need for that. Just say that
> btrfs doesn't handle this particular case and that's it. Though I agree
> the reporter's attitude wasn't exactly nice...
> 
>>
>> This means if one devices missed and re-appeared, even its generation
>> no longer matches with the rest device pool, btrfs does nothing to it,
>> but treat it as normal good device.
>>
>> At least let's detect such generation mismatch and avoid mounting the
>> fs.
>> Currently there is no automatic rebuild yet, which means if users find
>> device generation mismatch error message, they can only mount the fs
>> using "device" and "degraded" mount option (if possible), then replace
>> the offending device to manually "rebuild" the fs.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>> I totally understand that, generation based solution can't handle
>> split-brain case (where 2 RAID1 devices get mounted degraded separately)
>> at all, but at least let's handle what we can do.
>>
>> The best way to solve the problem is to make btrfs treat such lower gen
>> devices as some kind of missing device, and queue an automatic scrub for
>> that device.
>> But that's a lot of extra work, at least let's start from detecting such
>> problem first.
>> ---
>>  fs/btrfs/volumes.c | 50 ++
>>  1 file changed, 50 insertions(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index e034ad9e23b4..80a7c44993bc 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct 
>> btrfs_fs_info *fs_info,
>>devid, uuid);
>>  }
>>  
>> +static int verify_devices_generation(struct btrfs_fs_info *fs_info,
>> + struct btrfs_device *dev)
>> +{
>> +struct btrfs_fs_devices *fs_devices = dev->fs_devices;
>> +struct btrfs_device *cur;
>> +bool warn_only = false;
>> +int ret = 0;
>> +
>> +if (!fs_devices || fs_devices->seeding || !dev->generation)
>> +return 0;
>> +
>> +/*
>> + * If we're not replaying log, we're completely safe to allow
>> + * generation mismatch as it won't write anything to disks, nor
>> + * remount to rw.
>> + */
>> +if (btrfs_test_opt(fs_info, NOLOGREPLAY))
>> +warn_only = true;
>> +
>> +rcu_read_lock();
>> +list_for_each_entry_rcu(cur, _devices->devices, dev_list) {> +   
>> if (cur->generation && cur->generation != dev->generation) {
>> +if (warn_only) {
>> +btrfs_warn_rl_in_rcu(fs_info,
>> +"devid %llu has unexpected generation, has %llu expected %llu",
>> + dev->devid,
>> + dev->generation,
>> + cur->generation);
>> +} else {
>> +btrfs_err_rl_in_rcu(fs_info,
>> +"devid %llu has unexpected generation, has %llu expected %llu",
>> + dev->devid,
>> + dev->generation,
>> + cur->generation);
>> +ret = -EINVAL;
>> +break;
>> +}
>> +}
>> +}
>> +rcu_read_unlock();
>> +return ret;
>> +}
>> +
>>  static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key 
>> *key,
>>struct extent_buffer *leaf,
>>struct btrfs_chunk *chunk)
>> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info 
>> *fs_info, struct btrfs_key *key,
>>  return PTR_ERR(map->stripes[i].dev);
>>  }
>>  btrfs_report_missing_device(fs_info, devid, uuid, 
>> false);
>> +} else {
>> +ret = verify_devices_generation(fs_info,
>> +map->stripes[i].dev);
>> +if (ret < 0) {
>> +free_extent_map(em);
>> +return ret;
>> +}
> 
> So this will be executed when doing mount. What about one device
> disappearing, while the filesystem is still mounted and then later
> re-appears. This code won't be executed in this case, no ?

This depends on how the re-appear happens.

If just several bio fails, it shouldn't be a big problem.
If it really re-appears, I'm not pretty sure how btrfs will handle it,
since there 

Re: Enabling quota may not correctly rescan on 4.17

2018-06-28 Thread Nikolay Borisov


On 28.06.2018 11:10, Misono Tomohiro wrote:
> On 2018/06/28 16:12, Qu Wenruo wrote:
>>
>>
>> On 2018年06月27日 16:25, Misono Tomohiro wrote:
>>> On 2018/06/27 17:10, Qu Wenruo wrote:


 On 2018年06月26日 14:00, Misono Tomohiro wrote:
> Hello Nikolay,
>
> I noticed that commit 5d23515be669 ("btrfs: Move qgroup rescan
> on quota enable to btrfs_quota_enable") in 4.17 sometimes causes
> to fail correctly rescanning quota when quota is enabled.
>
> Simple reproducer:
>
> $ mkfs.btrfs -f $DEV
> $ mount $DEV /mnt
> $ dd if=/dev/urandom of=/mnt/file bs=1000 count=1000
> $ btrfs quota enbale /mnt
> $ umount /mnt
> $ btrfs check $DEV
> ...
>> Unfortunately in my environment, btrfs/114 failed to reprocduce it with
>> 1024 runs overnight, with v4.18-rc1 kernel.
>>
>> Would you please provide the whole btrfs-image dump of the corrupted fs?
> 
> Yes.
> The attached file is an image-dump of above reproducer (kernel 4.17.0, progs 
> 4.17)
> as the dump of btrfs/114 is a bit large for mail.
> 
> Though this does not always happen, I see the failure both on 4.17.0 or 
> 4.18-rc2.
> 
> Thanks,
> Tomohiro Misono

Misono,

Can you please try the attached patch?

> 
>>
>> There are several different assumptions on how the bug happens, with
>> your btrfs-image dump, it would help a lot to rule out some assumption.
>>
>> Thanks,
>> Qu
>From 10345e21bc2b4e61644da6b76ee4528710b2be25 Mon Sep 17 00:00:00 2001
From: Nikolay Borisov 
Date: Tue, 26 Jun 2018 10:03:58 +0300
Subject: [PATCH] btrfs: qgroups: Move transaction managed inside
 btrfs_quota_enable/disable

Commit 5d23515be669 ("btrfs: Move qgroup rescan on quota enable to
btrfs_quota_enable") not only resulted in an easier to follow code but
it also introduced a subtle bug. It changed the timing when the initial
transaction rescan was happening - before the commit it would happen
after transaction commit had occured but after the commit it might happen
before the transaction was committed. This results in failure to
correctly rescan the quota since there could be data which is still not
committed on disk.

This patch aims to fix this by movign the transaction creation/commit
inside btrfs_quota_enable, which allows to schedule the quota commit
after the transaction has been committed. For the sake of symmetry this patch 
also moves the transaction logic inside btrfs_quota_disable

Fixes: 5d23515be669 ("btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable")
Reported-by: Misono Tomohiro 
Link: https://marc.info/?l=linux-btrfs=152999289017582
Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/ioctl.c  | 15 ++-
 fs/btrfs/qgroup.c | 38 +++---
 fs/btrfs/qgroup.h |  6 ++
 3 files changed, 35 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a399750b9e41..316fb1af15e2 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -5135,9 +5135,7 @@ static long btrfs_ioctl_quota_ctl(struct file *file, void __user *arg)
 	struct inode *inode = file_inode(file);
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_ioctl_quota_ctl_args *sa;
-	struct btrfs_trans_handle *trans = NULL;
 	int ret;
-	int err;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
@@ -5153,28 +5151,19 @@ static long btrfs_ioctl_quota_ctl(struct file *file, void __user *arg)
 	}
 
 	down_write(_info->subvol_sem);
-	trans = btrfs_start_transaction(fs_info->tree_root, 2);
-	if (IS_ERR(trans)) {
-		ret = PTR_ERR(trans);
-		goto out;
-	}
 
 	switch (sa->cmd) {
 	case BTRFS_QUOTA_CTL_ENABLE:
-		ret = btrfs_quota_enable(trans, fs_info);
+		ret = btrfs_quota_enable(fs_info);
 		break;
 	case BTRFS_QUOTA_CTL_DISABLE:
-		ret = btrfs_quota_disable(trans, fs_info);
+		ret = btrfs_quota_disable(fs_info);
 		break;
 	default:
 		ret = -EINVAL;
 		break;
 	}
 
-	err = btrfs_commit_transaction(trans);
-	if (err && !ret)
-		ret = err;
-out:
 	kfree(sa);
 	up_write(_info->subvol_sem);
 drop_write:
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 1874a6d2e6f5..1d84af0d053f 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -875,8 +875,7 @@ static int btrfs_clean_quota_tree(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
-int btrfs_quota_enable(struct btrfs_trans_handle *trans,
-		   struct btrfs_fs_info *fs_info)
+int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_root *quota_root;
 	struct btrfs_root *tree_root = fs_info->tree_root;
@@ -886,6 +885,7 @@ int btrfs_quota_enable(struct btrfs_trans_handle *trans,
 	struct btrfs_key key;
 	struct btrfs_key found_key;
 	struct btrfs_qgroup *qgroup = NULL;
+	struct btrfs_trans_handle *trans = NULL;
 	int ret = 0;
 	int slot;
 
@@ -893,6 +893,12 @@ int btrfs_quota_enable(struct btrfs_trans_handle *trans,
 	if (fs_info->quota_root)
 		goto out;
 
+	trans = btrfs_start_transaction(tree_root, 2);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto out;
+	}
+
 	

Re: [GIT PULL] Btrfs updates for 4.18

2018-06-28 Thread David Sterba
On Thu, Jun 28, 2018 at 07:22:59PM +0800, Anand Jain wrote:
>   The circular locking dependency warning occurs at FSSTRESS_PROG.
>   And in particular at doproc() in xfstests/ltp/fsstress.c, randomly
>   at any of the command at
>   opdesc_tops[] = { ..}
>   which involves calling mmap file operation and if there is something
>   to commit.
> 
>   The commit transaction does need device_list_mutex which is also being
>   used for the btrfs_open_devices() in the commit 542c5908abfe84f7.
> 
>   But btrfs_open_devices() is only called at mount, and mmap() can
>   establish only be established after the mount has completed. With
>   this give its unclear to me why the circular locking dependency check
>   is warning about this.
> 
>   I feel until we have clarity about this and also solve other problem
>   related to the streamlining of uuid_mutex, I suggest we revert
>   542c5908abfe84f7. Sorry for the inconvenience.

Ok, the revert is one option. I'm cosidering adding both the locks, like
is in https://patchwork.kernel.org/patch/10478443/ . This would have no
effect, as btrfs_open_devices is called only from mount path and the
list_sort is done only for the first time when there are not other
users of the list that would not also be under the uuid_mutex.

This passed the syzbot and other tests, so this does not break things
and goes towards pushing the device_list_mutex as the real protection
mechanism for the fs_devices members.

Let me know what you think, the revert should be the last option if we
don't have anything better.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1] btrfs: quota: Set rescan progress to (u64)-1 if we hit last leaf

2018-06-28 Thread David Sterba
On Wed, Jun 27, 2018 at 06:19:55PM +0800, Qu Wenruo wrote:
> Commit ff3d27a048d9 ("btrfs: qgroup: Finish rescan when hit the last leaf
> of extent tree") added a new exit for rescan finish.
> 
> However after finishing quota rescan, we set
> fs_info->qgroup_rescan_progress to (u64)-1 before we exit through the
> original exit path.
> While we missed that assignment of (u64)-1 in the new exit path.
> 
> The end result is, the quota status item doesn't have the same value.
> (-1 vs the last bytenr + 1)
> Although it doesn't affect quota accounting, it's still better to keep
> the original behavior.
> 
> Reported-by: Misono Tomohiro 
> Fixes: ff3d27a048d9 ("btrfs: qgroup: Finish rescan when hit the last leaf of 
> extent tree")
> Signed-off-by: Qu Wenruo 

Thanks, added to 4.18 queue.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: call trace: WARNING: at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device

2018-06-28 Thread Nikolay Borisov



On 28.06.2018 16:19, Christoph Anton Mitterer wrote:
> Hey.
> 
> On a 4.16.16 kernel with a RAID 1 btrfs I got the following messages
> since today.
> 
> Data seems still to be readable (correctly)... and there are no other
> errors (like SATA errors) in the kernel log.
> 
> Any idea what these could mean?

It means you have a device, whose size is not aligned to 4k. You can fix
this either by resizing the device to be properly aligned to a 4k
boundary (if you are on x86) or better yet, just run :


btrfs rescue fix-device-size /dev/

> 
> Thanks,
> Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker

2018-06-28 Thread David Sterba
On Wed, Jun 20, 2018 at 07:56:12AM -0700, Chris Mason wrote:
> For COW, btrfs expects pages dirty pages to have been through a few setup
> steps.  This includes reserving space for the new block allocations and 
> marking
> the range in the state tree for delayed allocation.
> 
> A few places outside btrfs will dirty pages directly, especially when 
> unmapping
> mmap'd pages.  In order for these to properly go through COW, we run them
> through a fixup worker to wait for stable pages, and do the delalloc prep.
> 
> 87826df0ec36 added a window where the dirty pages were cleaned, but pending
> more action from the fixup worker.

Can you please be more specific about the window, where it starts and
ends?

> During this window, page migration can jump
> in and relocate the page.  Once our fixup work actually starts, it finds
> page->mapping is NULL and we end up freeing the page without ever writing it.

AFAICS the old and new code do the same sequence of calls from the first
mapping check:
ClearPageChecked, ulock_page, put_page, kfree, extent_changeset_free


> This leads to crc errors and other exciting problems, since it screws up the
> whole statemachine for waiting for ordered extents.  The fix here is to keep
> the page dirty while we're waiting for the fixup worker to get to work.  This
> also makes sure the error handling in btrfs_writepage_fixup_worker does the
> right thing with dirty bits when we run out of space.

So this would need to find the mapping first to be not NULL, go until
btrfs_start_ordered_extent where the lock is droppend and back to again:, check
for mapping that's now NULL?

But I still don't see how this is making things different.

In the remaining sequence

btrfs_lookup_ordered_range, btrfs_delalloc_reserve_space,
btrfs_set_extent_delalloc (without any errors), the clear page checked
comes after the extent is unlocked.

> Signed-off-by: Chris Mason 
> ---
>  fs/btrfs/inode.c | 67 
> +---
>  1 file changed, 49 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0b86cf1..5538900 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2100,11 +2100,21 @@ static void btrfs_writepage_fixup_worker(struct 
> btrfs_work *work)
>   page = fixup->page;
>  again:
>   lock_page(page);
> - if (!page->mapping || !PageDirty(page) || !PageChecked(page)) {
> - ClearPageChecked(page);
> +
> + /*
> +  * before we queued this fixup, we took a reference on the page.
> +  * page->mapping may go NULL, but it shouldn't be moved to a
> +  * different address space.
> +  */
> + if (!page->mapping || !PageDirty(page) || !PageChecked(page))
>   goto out_page;
> - }
>  
> + /*
> +  * we keep the PageChecked() bit set until we're done with the
> +  * btrfs_start_ordered_extent() dance that we do below.  That
> +  * drops and retakes the page lock, so we don't want new
> +  * fixup workers queued for this page during the churn.
> +  */
>   inode = page->mapping->host;
>   page_start = page_offset(page);
>   page_end = page_offset(page) + PAGE_SIZE - 1;
> @@ -2129,33 +2139,46 @@ static void btrfs_writepage_fixup_worker(struct 
> btrfs_work *work)
>  
>   ret = btrfs_delalloc_reserve_space(inode, _reserved, page_start,
>  PAGE_SIZE);
> - if (ret) {
> - mapping_set_error(page->mapping, ret);
> - end_extent_writepage(page, ret, page_start, page_end);
> - ClearPageChecked(page);
> - goto out;
> -  }
> + if (ret)
> + goto out_error;
>  
>   ret = btrfs_set_extent_delalloc(inode, page_start, page_end, 0,
>   _state, 0);
> - if (ret) {
> - mapping_set_error(page->mapping, ret);
> - end_extent_writepage(page, ret, page_start, page_end);
> - ClearPageChecked(page);
> - goto out;
> - }
> + if (ret)
> + goto out_error;
>  
> - ClearPageChecked(page);
> - set_page_dirty(page);

Hm, so previously the page was dirty, unconditionally calling down to
set_page_dirty that could call btree_set_page_dirty and
__set_page_dirty_nobuffers. If the dirty bit is set there, it'll do
nothing.

So this should be equivalent to the new code but looks strange to say at
least.

>   btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, false);
> +
> + /*
> +  * everything went as planned, we're now the proud owners of a
> +  * Dirty page with delayed allocation bits set and space reserved
> +  * for our COW destination.
> +  *
> +  * The page was dirty when we started, nothing should have cleaned it.
> +  */
> + BUG_ON(!PageDirty(page));
> +
>  out:
>   unlock_extent_cached(_I(inode)->io_tree, page_start, page_end,
>_state);
>  out_page:
> +

Re: [PATCH RFC] btrfs: Do extra device generation check at mount time

2018-06-28 Thread Nikolay Borisov



On 28.06.2018 10:04, Qu Wenruo wrote:
> There is a reporter considering btrfs raid1 has a major design flaw
> which can't handle nodatasum files.
> 
> Despite his incorrect expectation, btrfs indeed doesn't handle device
> generation mismatch well.

I think this is getting a bit personal, no need for that. Just say that
btrfs doesn't handle this particular case and that's it. Though I agree
the reporter's attitude wasn't exactly nice...

> 
> This means if one devices missed and re-appeared, even its generation
> no longer matches with the rest device pool, btrfs does nothing to it,
> but treat it as normal good device.
> 
> At least let's detect such generation mismatch and avoid mounting the
> fs.
> Currently there is no automatic rebuild yet, which means if users find
> device generation mismatch error message, they can only mount the fs
> using "device" and "degraded" mount option (if possible), then replace
> the offending device to manually "rebuild" the fs.
> 
> Signed-off-by: Qu Wenruo 
> ---
> I totally understand that, generation based solution can't handle
> split-brain case (where 2 RAID1 devices get mounted degraded separately)
> at all, but at least let's handle what we can do.
> 
> The best way to solve the problem is to make btrfs treat such lower gen
> devices as some kind of missing device, and queue an automatic scrub for
> that device.
> But that's a lot of extra work, at least let's start from detecting such
> problem first.
> ---
>  fs/btrfs/volumes.c | 50 ++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index e034ad9e23b4..80a7c44993bc 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct 
> btrfs_fs_info *fs_info,
> devid, uuid);
>  }
>  
> +static int verify_devices_generation(struct btrfs_fs_info *fs_info,
> +  struct btrfs_device *dev)
> +{
> + struct btrfs_fs_devices *fs_devices = dev->fs_devices;
> + struct btrfs_device *cur;
> + bool warn_only = false;
> + int ret = 0;
> +
> + if (!fs_devices || fs_devices->seeding || !dev->generation)
> + return 0;
> +
> + /*
> +  * If we're not replaying log, we're completely safe to allow
> +  * generation mismatch as it won't write anything to disks, nor
> +  * remount to rw.
> +  */
> + if (btrfs_test_opt(fs_info, NOLOGREPLAY))
> + warn_only = true;
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(cur, _devices->devices, dev_list) {> +   
> if (cur->generation && cur->generation != dev->generation) {
> + if (warn_only) {
> + btrfs_warn_rl_in_rcu(fs_info,
> + "devid %llu has unexpected generation, has %llu expected %llu",
> +  dev->devid,
> +  dev->generation,
> +  cur->generation);
> + } else {
> + btrfs_err_rl_in_rcu(fs_info,
> + "devid %llu has unexpected generation, has %llu expected %llu",
> +  dev->devid,
> +  dev->generation,
> +  cur->generation);
> + ret = -EINVAL;
> + break;
> + }
> + }
> + }
> + rcu_read_unlock();
> + return ret;
> +}
> +
>  static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key 
> *key,
> struct extent_buffer *leaf,
> struct btrfs_chunk *chunk)
> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info 
> *fs_info, struct btrfs_key *key,
>   return PTR_ERR(map->stripes[i].dev);
>   }
>   btrfs_report_missing_device(fs_info, devid, uuid, 
> false);
> + } else {
> + ret = verify_devices_generation(fs_info,
> + map->stripes[i].dev);
> + if (ret < 0) {
> + free_extent_map(em);
> + return ret;
> + }

So this will be executed when doing mount. What about one device
disappearing, while the filesystem is still mounted and then later
re-appears. This code won't be executed in this case, no ?
>   }
>   set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
>   &(map->stripes[i].dev->dev_state));
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

[PATCH] Revert "btrfs: fix a possible umount deadlock"

2018-06-28 Thread Nikolay Borisov
Since commit 88c14590cdd6 ("btrfs: use RCU in btrfs_show_devname for
device list traversal") btrfs_show_devname no longer takes
device_list_mutex. As such the deadlock that 0ccd05285e7f ("btrfs: fix
a possible umount deadlock") aimed to fix no longer exists. So remove the 
extra code this commit added. No functional changes. 

This reverts commit 0ccd05285e7f5a8e297e1d6dfc41e7c65757d6fa.

Signed-off-by: Nikolay Borisov 
---

Anand, 

I would like to use the opportunity to ask you about the peculiar 
sequence needed to close btrfs devices. Why do we first replace the closed 
device with a copy in btrfs_close_one_device, then dispose of the copied
devices in btrfs_close_devices IFF we had fs_devices->seed not being NULL? 

 fs/btrfs/volumes.c | 26 ++
 1 file changed, 6 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5bd6f3a40f9c..70b0ed2ba4df 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1004,7 +1004,7 @@ static void btrfs_close_bdev(struct btrfs_device *device)
blkdev_put(device->bdev, device->mode);
 }
 
-static void btrfs_prepare_close_one_device(struct btrfs_device *device)
+static void btrfs_close_one_device(struct btrfs_device *device)
 {
struct btrfs_fs_devices *fs_devices = device->fs_devices;
struct btrfs_device *new_device;
@@ -1022,6 +1022,8 @@ static void btrfs_prepare_close_one_device(struct 
btrfs_device *device)
if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
fs_devices->missing_devices--;
 
+   btrfs_close_bdev(device);
+
new_device = btrfs_alloc_device(NULL, >devid,
device->uuid);
BUG_ON(IS_ERR(new_device)); /* -ENOMEM */
@@ -1035,39 +1037,23 @@ static void btrfs_prepare_close_one_device(struct 
btrfs_device *device)
 
list_replace_rcu(>dev_list, _device->dev_list);
new_device->fs_devices = device->fs_devices;
+
+   call_rcu(>rcu, free_device);
 }
 
 static int close_fs_devices(struct btrfs_fs_devices *fs_devices)
 {
struct btrfs_device *device, *tmp;
-   struct list_head pending_put;
-
-   INIT_LIST_HEAD(_put);
 
if (--fs_devices->opened > 0)
return 0;
 
mutex_lock(_devices->device_list_mutex);
list_for_each_entry_safe(device, tmp, _devices->devices, dev_list) {
-   btrfs_prepare_close_one_device(device);
-   list_add(>dev_list, _put);
+   btrfs_close_one_device(device);
}
mutex_unlock(_devices->device_list_mutex);
 
-   /*
-* btrfs_show_devname() is using the device_list_mutex,
-* sometimes call to blkdev_put() leads vfs calling
-* into this func. So do put outside of device_list_mutex,
-* as of now.
-*/
-   while (!list_empty(_put)) {
-   device = list_first_entry(_put,
-   struct btrfs_device, dev_list);
-   list_del(>dev_list);
-   btrfs_close_bdev(device);
-   call_rcu(>rcu, free_device_rcu);
-   }
-
WARN_ON(fs_devices->open_devices);
WARN_ON(fs_devices->rw_devices);
fs_devices->opened = 0;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Goffredo Baroncelli
On 06/28/2018 04:17 PM, Chris Murphy wrote:
> Btrfs does two, maybe three, bad things:
> 1. No automatic resync. This is a net worse behavior than mdadm and
> lvm, putting data at risk.
> 2. The new data goes in a single chunk; even if the user does a manual
> balance (resync) their data isn't replicated. They must know to do a
> -dconvert balance to replicate the new data. Again this is a net worse
> behavior than mdadm out of the box, putting user data at risk.
> 3. Apparently if nodatacow, given a file with two copies of different
> transid, Btrfs won't always pick the higher transid copy? If true
> that's terrible, and again not at all what mdadm/lvm are doing.

All these could be avoided simply not allowing a multidevice filesystem to 
mount without ensuring that all the devices have the same generation.

In the past I proposed a mount.btrfs helper; I am still thinking that it would 
be the right place to
a) put all the check before mounting the filesystem
b) print the correct information in order to help the user on what he has to do 
to solve the issues

Regarding your point 3), it must be point out that in case of NOCOW files, even 
having the same transid it is not enough. It still be possible that a copy is 
update before a power failure preventing the super-block update.
I think that the only way to prevent it to happens is:
  1) using a data journal (which means that each data is copied two times)
OR
  2) using a cow filesystem (with cow enabled of course !)

I think that this is a good example of why a HW Raid controller battery backed 
could be better than a SW raid. Of course the likelihood of a lot of problems 
could be reduced using a power supply.


BR
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Andrei Borzenkov
28.06.2018 12:15, Qu Wenruo пишет:
> 
> 
> On 2018年06月28日 16:16, Andrei Borzenkov wrote:
>> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年06月28日 11:14, r...@georgianit.com wrote:


 On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:

>
> Please get yourself clear of what other raid1 is doing.

 A drive failure, where the drive is still there when the computer reboots, 
 is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, 
 anything but raid 0) will recover from perfectly without raising a sweat. 
 Some will rebuild the array automatically,
>>>
>>> WOW, that's black magic, at least for RAID1.
>>> The whole RAID1 has no idea of which copy is correct unlike btrfs who
>>> has datasum.
>>>
>>> Don't bother other things, just tell me how to determine which one is
>>> correct?
>>>
>>
>> When one drive fails, it is recorded in meta-data on remaining drives;
>> probably configuration generation number is increased. Next time drive
>> with older generation is not incorporated. Hardware controllers also
>> keep this information in NVRAM and so do not even depend on scanning
>> of other disks.
> 
> Yep, the only possible way to determine such case is from external info.
> 
> For device generation, it's possible to enhance btrfs, but at least we
> could start from detect and refuse to RW mount to avoid possible further
> corruption.
> But anyway, if one really cares about such case, hardware RAID
> controller seems to be the only solution as other software may have the
> same problem.
> 
> And the hardware solution looks pretty interesting, is the write to
> NVRAM 100% atomic? Even at power loss?
> 
>>
>>> The only possibility is that, the misbehaved device missed several super
>>> block update so we have a chance to detect it's out-of-date.
>>> But that's not always working.
>>>
>>
>> Why it should not work as long as any write to array is suspended
>> until superblock on remaining devices is updated?
> 
> What happens if there is no generation gap in device superblock?
> 

Well, you use "generation" in strict btrfs sense, I use "generation"
generically. That is exactly what btrfs apparently lacks currently -
some monotonic counter that is used to record such event.

> If one device got some of its (nodatacow) data written to disk, while
> the other device doesn't get data written, and neither of them reached
> super block update, there is no difference in device superblock, thus no
> way to detect which is correct.
> 

Again, the very fact that device failed should have triggered update of
superblock to record this information which presumably should increase
some counter.

>>
>>> If you're talking about missing generation check for btrfs, that's
>>> valid, but it's far from a "major design flaw", as there are a lot of
>>> cases where other RAID1 (mdraid or LVM mirrored) can also be affected
>>> (the brain-split case).
>>>
>>
>> That's different. Yes, with software-based raid there is usually no
>> way to detect outdated copy if no other copies are present. Having
>> older valid data is still very different from corrupting newer data.
> 
> While for VDI case (or any VM image file format other than raw), older
> valid data normally means corruption.
> Unless they have their own write-ahead log.
>> Some file format may detect such problem by themselves if they have
> internal checksum, but anyway, older data normally means corruption,
> especially when partial new and partial old.
>

Yes, that's true. But there is really nothing that can be done here,
even theoretically; it hardly a reason to not do what looks possible.

> On the other hand, with data COW and csum, btrfs can ensure the whole
> filesystem update is atomic (at least for single device).
> So the title, especially the "major design flaw" can't be wrong any more.
> 
>>
 others will automatically kick out the misbehaving drive.  *none* of them 
 will take back the the drive with old data and start commingling that data 
 with good copy.)\ This behaviour from BTRFS is completely abnormal.. and 
 defeats even the most basic expectations of RAID.
>>>
>>> RAID1 can only tolerate 1 missing device, it has nothing to do with
>>> error detection.
>>> And it's impossible to detect such case without extra help.
>>>
>>> Your expectation is completely wrong.
>>>
>>
>> Well ... somehow it is my experience as well ... :)
> 
> Acceptable, but not really apply to software based RAID1.
> 
> Thanks,
> Qu
> 
>>

 I'm not the one who has to clear his expectations here.

 --
 To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

>>>
> 




signature.asc
Description: OpenPGP digital signature


[PATCH RFC] btrfs: Do extra device generation check at mount time

2018-06-28 Thread Remi Gauvin
On 2018-06-28 10:36 AM, Adam Borowski wrote:

> 
> Uhm, that'd be a nasty regression for the regular (no-nodatacow) case. 
> The vast majority of data is fine, and extents that have been written to
> while a device is missing will be either placed elsewhere (if the filesystem
> knew it was degraded) or read one of the copies to notice a wrong checksum
> and automatically recover (if the device was still falsely believed to be
> good at write time).
> 
> We currently don't have selective scrub yet so resyncing such single-copy

That might not be the case. though I don't really know the numbers
myself and repeating this is hearsay:

crc32 is not infallible.  1 in so many billion errors will be undetected
by it.  In the case of a dropped device with write failures, when you
*know* the data supposedly written to the disk is bad, re-synching from
believed good copy (so long as it passes checksum verification, of
course), is the only way to be certain that the data is good.


Otherwise, you can be left with a Schroedinger's bit somewhere,  (It's
not 0 or 1, but both, depending on which device the filesystem is
reading from at the time.)


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Remi Gauvin
On 2018-06-28 10:17 AM, Chris Murphy wrote:

> 2. The new data goes in a single chunk; even if the user does a manual
> balance (resync) their data isn't replicated. They must know to do a
> -dconvert balance to replicate the new data. Again this is a net worse
> behavior than mdadm out of the box, putting user data at risk.

I'm not sure this is the case.  Even though writes failed to the
disconnected device, btrfs seemed to keep on going as though it *were*.

When the array was re-mounted with both devices, (never mounted as
degraded), and scrub was run, scrub took a *long* time fixing errors, at
a whopping 3MB/s, and reported having fixed millions of them.


<>

Re: call trace: WARNING: at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device

2018-06-28 Thread Qu Wenruo


On 2018年06月28日 22:17, Christoph Anton Mitterer wrote:
> On Thu, 2018-06-28 at 22:09 +0800, Qu Wenruo wrote:
>>> [   72.168662] WARNING: CPU: 0 PID: 242 at /build/linux-
>>> uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565
>>> btrfs_update_device+0x1b2/0x1c0It
>> looks like it's the old WARN_ON() for unaligned device size.
>> Would you please verify if it is the case?
> 
> # blockdev --getsize64 /dev/sdb2 /dev/sda2
> 999131127296
> 999131127296
> 
> 
> Since getsize64 returns bytes and not sectors, I suppose it would need
> to be aligned to 1024 by the least?
> 
> 999131127296 / 1024 = 975713991,5
> 
> So it's not.

So it's the case.

> 
> 
>> If so, "btrfs rescue fix-device-size" should handle it pretty well.
> 
> I guess this needs to be done with the fs unmounted?

Yep.

> Anything to consider since I have RAID1 (except from running it on both
> devices)?

Nothing special. Btrfs-progs will handle it pretty well.

> 
> 
> Also, it's a bit strange that this error occurred never before (though
> the btrfs-restore manpage says the kernel would check for this since
> 4.11).

Because the WARN_ON() is newly added.

> 
> It would further be nice if btrfs-check would warn about this.

Yep, latest will warn about it, and --repair can also fix it too.

Thanks,
Qu

> 
> 
> Thanks,
> Chris.
> 



signature.asc
Description: OpenPGP digital signature


Re: unsolvable technical issues?

2018-06-28 Thread Adam Borowski
On Wed, Jun 27, 2018 at 08:50:11PM +0200, waxhead wrote:
> Chris Murphy wrote:
> > On Thu, Jun 21, 2018 at 5:13 PM, waxhead  wrote:
> > > According to this:
> > > 
> > > https://stratis-storage.github.io/StratisSoftwareDesign.pdf
> > > Page 4 , section 1.2
> > > 
> > > It claims that BTRFS still have significant technical issues that may 
> > > never
> > > be resolved.
> > > Could someone shed some light on exactly what these technical issues might
> > > be?! What are BTRFS biggest technical problems?
> > 
> > 
> > I think it's appropriate to file an issue and ask what they're
> > referring to. It very well might be use case specific to Red Hat.
> > https://github.com/stratis-storage/stratis-storage.github.io/issues

> https://github.com/stratis-storage/stratis-storage.github.io/issues/1
> 
> Apparently the author have toned down the wording a bit, this confirm that
> the claim was without basis and probably based on "popular myth".
> The document the PDF links to is not yet updated.

It's a company whose profits rely on users choosing it over anything that
competes.  Adding propaganda to a public document is a natural thing for
them to do.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ There's an easy way to tell toy operating systems from real ones.
⣾⠁⢰⠒⠀⣿⡁ Just look at how their shipped fonts display U+1F52B, this makes
⢿⡄⠘⠷⠚⠋⠀ the intended audience obvious.  It's also interesting to see OSes
⠈⠳⣄ go back and forth wrt their intended target.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: Do extra device generation check at mount time

2018-06-28 Thread Adam Borowski
On Thu, Jun 28, 2018 at 03:04:43PM +0800, Qu Wenruo wrote:
> There is a reporter considering btrfs raid1 has a major design flaw
> which can't handle nodatasum files.
> 
> Despite his incorrect expectation, btrfs indeed doesn't handle device
> generation mismatch well.
> 
> This means if one devices missed and re-appeared, even its generation
> no longer matches with the rest device pool, btrfs does nothing to it,
> but treat it as normal good device.
> 
> At least let's detect such generation mismatch and avoid mounting the
> fs.

Uhm, that'd be a nasty regression for the regular (no-nodatacow) case. 
The vast majority of data is fine, and extents that have been written to
while a device is missing will be either placed elsewhere (if the filesystem
knew it was degraded) or read one of the copies to notice a wrong checksum
and automatically recover (if the device was still falsely believed to be
good at write time).

We currently don't have selective scrub yet so resyncing such single-copy
extents is costly, but 1. all will be fine if the data is read, 2. it's
possible to add such a smart resync in the future, far better than a
write-intent bitmap can do.

To do the latter, we can note the last generation the filesystem was known
to be fully coherent (ie, all devices were successfully flushed with no
mysterious write failures), then run selective scrub (perhaps even
automatically) when the filesystem is no longer degraded.  There's some
extra complexity with 3- or 4-way RAID (multiple levels of degradation) but
a single number would help even there.

But even currently, without the above not-yet-written recovery, it's
reasonably safe to continue without scrub -- it's a case of running
partially degraded when the bad copy is already known to be suspicious.

For no-nodatacow data and metadata, that is.

> Currently there is no automatic rebuild yet, which means if users find
> device generation mismatch error message, they can only mount the fs
> using "device" and "degraded" mount option (if possible), then replace
> the offending device to manually "rebuild" the fs.

As nodatacow already means "I don't care about this data, or have another
way of recovering it", I don't quite get why we would drop existing
auto-recovery for a common transient failure case.

If you're paranoid, perhaps some bit "this filesystem has some nodatacow
data on it" could warrant such a block, but it would still need to be
overridable _without_ a need for replace.  There's also the problem that
systemd marks its journal nodatacow (despite it having infamously bad
handling of failures!), and too many distributions infect their default
installs with systemd, meaning such a bit would be on in most cases.

But why would I put all my other data at risk, just because there's a
nodatacow file?  There's a big difference between scrubbing when only a few
transactions worth of data is suspicious and completely throwing away a
mostly-good replica to replace it from the now fully degraded copy.

> I totally understand that, generation based solution can't handle
> split-brain case (where 2 RAID1 devices get mounted degraded separately)
> at all, but at least let's handle what we can do.

Generation can do well at least unless both devices were mounted elsewhere
and got the exact same number of transactions, the problem is that nodatacow
doesn't bump generation number.

> The best way to solve the problem is to make btrfs treat such lower gen
> devices as some kind of missing device, and queue an automatic scrub for
> that device.
> But that's a lot of extra work, at least let's start from detecting such
> problem first.

I wonder if there's some way to treat problematic nodatacow files as
degraded only?

Nodatacow misses most of btrfs mechanisms, thus to get it done right you'd
need to pretty much copy all of md's logic, with a write-intent bitmap or an
equivalent.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ There's an easy way to tell toy operating systems from real ones.
⣾⠁⢰⠒⠀⣿⡁ Just look at how their shipped fonts display U+1F52B, this makes
⢿⡄⠘⠷⠚⠋⠀ the intended audience obvious.  It's also interesting to see OSes
⠈⠳⣄ go back and forth wrt their intended target.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: Do extra device generation check at mount time

2018-06-28 Thread Alberto Bursi


On 28/06/2018 09:04, Qu Wenruo wrote:
> Despite his incorrect expectation, btrfs indeed doesn't handle device
> generation mismatch well.
>
> This means if one devices missed and re-appeared, even its generation
> no longer matches with the rest device pool, btrfs does nothing to it,
> but treat it as normal good device.
>
> At least let's detect such generation mismatch and avoid mounting the
> fs.
> Currently there is no automatic rebuild yet, which means if users find
> device generation mismatch error message, they can only mount the fs
> using "device" and "degraded" mount option (if possible), then replace
> the offending device to manually "rebuild" the fs.
>

Yes. This is a long-standing issue, handling it this way is similar to 
what mdadm
(software raid) also does.

Please get this merged fast, don't get bogged down too much with 
integrating with
Anand Jain's branch as this is a big issue and should get at least this 
basic mitigation asap.

-Alberto


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Chris Murphy
The problems are known with Btrfs raid1, but I think they bear
repeating because they are really not OK.

In the exact same described scenario: a simple clear cut drop off of a
member device, which then later clearly reappears (no transient
failure).

Both mdadm and LVM based raid1 would have re-added the missing device
and resynced it because internal bitmap is the default (on > 100G
arrays for mdadm and always with lvm). Only the new data would be
propagated to user space. Both mdadm and lvm have metadata to know
which drive has stale data in this common scenario.

Btrfs does two, maybe three, bad things:
1. No automatic resync. This is a net worse behavior than mdadm and
lvm, putting data at risk.
2. The new data goes in a single chunk; even if the user does a manual
balance (resync) their data isn't replicated. They must know to do a
-dconvert balance to replicate the new data. Again this is a net worse
behavior than mdadm out of the box, putting user data at risk.
3. Apparently if nodatacow, given a file with two copies of different
transid, Btrfs won't always pick the higher transid copy? If true
that's terrible, and again not at all what mdadm/lvm are doing.


Btrfs can do better because it has more information available to make
unambiguous decisions about data. But it needs to always do at least
as good a job as mdadm/lvm and as reported, that didn't happen. So
some tested is needed in particular case #3 above with nodatacow.
That's a huge bug, if it's true.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: call trace: WARNING: at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device

2018-06-28 Thread Christoph Anton Mitterer
On Thu, 2018-06-28 at 22:09 +0800, Qu Wenruo wrote:
> > [   72.168662] WARNING: CPU: 0 PID: 242 at /build/linux-
> > uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565
> > btrfs_update_device+0x1b2/0x1c0It
> looks like it's the old WARN_ON() for unaligned device size.
> Would you please verify if it is the case?

# blockdev --getsize64 /dev/sdb2 /dev/sda2
999131127296
999131127296


Since getsize64 returns bytes and not sectors, I suppose it would need
to be aligned to 1024 by the least?

999131127296 / 1024 = 975713991,5

So it's not.


> If so, "btrfs rescue fix-device-size" should handle it pretty well.

I guess this needs to be done with the fs unmounted?
Anything to consider since I have RAID1 (except from running it on both
devices)?


Also, it's a bit strange that this error occurred never before (though
the btrfs-restore manpage says the kernel would check for this since
4.11).

It would further be nice if btrfs-check would warn about this.


Thanks,
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: call trace: WARNING: at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device

2018-06-28 Thread Qu Wenruo


On 2018年06月28日 21:19, Christoph Anton Mitterer wrote:
> Hey.
> 
> On a 4.16.16 kernel with a RAID 1 btrfs I got the following messages
> since today.
> 
> Data seems still to be readable (correctly)... and there are no other
> errors (like SATA errors) in the kernel log.
> 
> Any idea what these could mean?
> 
> Thanks,
> Chris.
> 
> 
> [   72.168662] WARNING: CPU: 0 PID: 242 at 
> /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 
> btrfs_update_device+0x1b2/0x1c0It looks like it's the old WARN_ON() for 
> unaligned device size.

Would you please verify if it is the case?
If so, "btrfs rescue fix-device-size" should handle it pretty well.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: [PATCH RFC] btrfs: Do extra device generation check at mount time

2018-06-28 Thread Qu Wenruo


On 2018年06月28日 21:26, Anand Jain wrote:
> 
> 
> On 06/28/2018 04:02 PM, Qu Wenruo wrote:
>> Also CC Anand Jain, since he is also working on various device related
>> work, and an expert on this.
>>
>> Especially I'm not pretty sure if such enhancement is already on his
>> agenda or there are already some unmerged patch for this.
> 
>  Right, some of the patches are already in the ML and probably it needs
>  a revival. Unless there are new challenges, my target is to consolidate
>  them and get them integrated by this year. I am trying harder.
> 
>  I think its a good idea to report the write-hole status to the
>  user-land instead of failing the mount, so that a script can trigger
>  the re-silver as required by the use case. I introduced
>  fs_devices::volume_flags, which is under review as of now, and have
>  plans to add the volume degraded state bit which can be accessed by
>  the sysfs. So yes this is been taken care.

Glad to hear that!

However I found it pretty hard to trace your latest work, would you mind
to share the git repo and branch you're working on?

Maybe I could take some time to do some review.

Thanks,
Qu

> 
> 
> Thanks, Anand
> 
> 
>> Thanks,
>> Qu
>>
>> On 2018年06月28日 15:04, Qu Wenruo wrote:
>>> There is a reporter considering btrfs raid1 has a major design flaw
>>> which can't handle nodatasum files.
>>>
>>> Despite his incorrect expectation, btrfs indeed doesn't handle device
>>> generation mismatch well.
>>>
>>> This means if one devices missed and re-appeared, even its generation
>>> no longer matches with the rest device pool, btrfs does nothing to it,
>>> but treat it as normal good device.
>>>
>>> At least let's detect such generation mismatch and avoid mounting the
>>> fs.
>>> Currently there is no automatic rebuild yet, which means if users find
>>> device generation mismatch error message, they can only mount the fs
>>> using "device" and "degraded" mount option (if possible), then replace
>>> the offending device to manually "rebuild" the fs.
>>>
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>> I totally understand that, generation based solution can't handle
>>> split-brain case (where 2 RAID1 devices get mounted degraded separately)
>>> at all, but at least let's handle what we can do.
>>>
>>> The best way to solve the problem is to make btrfs treat such lower gen
>>> devices as some kind of missing device, and queue an automatic scrub for
>>> that device.
>>> But that's a lot of extra work, at least let's start from detecting such
>>> problem first.
>>> ---
>>>   fs/btrfs/volumes.c | 50 ++
>>>   1 file changed, 50 insertions(+)
>>>
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index e034ad9e23b4..80a7c44993bc 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct
>>> btrfs_fs_info *fs_info,
>>>     devid, uuid);
>>>   }
>>>   +static int verify_devices_generation(struct btrfs_fs_info *fs_info,
>>> + struct btrfs_device *dev)
>>> +{
>>> +    struct btrfs_fs_devices *fs_devices = dev->fs_devices;
>>> +    struct btrfs_device *cur;
>>> +    bool warn_only = false;
>>> +    int ret = 0;
>>> +
>>> +    if (!fs_devices || fs_devices->seeding || !dev->generation)
>>> +    return 0;
>>> +
>>> +    /*
>>> + * If we're not replaying log, we're completely safe to allow
>>> + * generation mismatch as it won't write anything to disks, nor
>>> + * remount to rw.
>>> + */
>>> +    if (btrfs_test_opt(fs_info, NOLOGREPLAY))
>>> +    warn_only = true;
>>> +
>>> +    rcu_read_lock();
>>> +    list_for_each_entry_rcu(cur, _devices->devices, dev_list) {
>>> +    if (cur->generation && cur->generation != dev->generation) {
>>> +    if (warn_only) {
>>> +    btrfs_warn_rl_in_rcu(fs_info,
>>> +    "devid %llu has unexpected generation, has %llu expected %llu",
>>> + dev->devid,
>>> + dev->generation,
>>> + cur->generation);
>>> +    } else {
>>> +    btrfs_err_rl_in_rcu(fs_info,
>>> +    "devid %llu has unexpected generation, has %llu expected %llu",
>>> + dev->devid,
>>> + dev->generation,
>>> + cur->generation);
>>> +    ret = -EINVAL;
> 
> 
> 
> 
>>> +    break;
>>> +    }
>>> +    }
>>> +    }
>>> +    rcu_read_unlock();
>>> +    return ret;
>>> +}
>>> +
>>>   static int read_one_chunk(struct btrfs_fs_info *fs_info, struct
>>> btrfs_key *key,
>>>     struct extent_buffer *leaf,
>>>     struct btrfs_chunk *chunk)
>>> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info
>>> *fs_info, struct btrfs_key *key,
>>>   return PTR_ERR(map->stripes[i].dev);
>>>   }
>>>   

Re: [PATCH RFC] btrfs: Do extra device generation check at mount time

2018-06-28 Thread Anand Jain




On 06/28/2018 04:02 PM, Qu Wenruo wrote:

Also CC Anand Jain, since he is also working on various device related
work, and an expert on this.

Especially I'm not pretty sure if such enhancement is already on his
agenda or there are already some unmerged patch for this.


 Right, some of the patches are already in the ML and probably it needs
 a revival. Unless there are new challenges, my target is to consolidate
 them and get them integrated by this year. I am trying harder.

 I think its a good idea to report the write-hole status to the
 user-land instead of failing the mount, so that a script can trigger
 the re-silver as required by the use case. I introduced
 fs_devices::volume_flags, which is under review as of now, and have
 plans to add the volume degraded state bit which can be accessed by
 the sysfs. So yes this is been taken care.


Thanks, Anand



Thanks,
Qu

On 2018年06月28日 15:04, Qu Wenruo wrote:

There is a reporter considering btrfs raid1 has a major design flaw
which can't handle nodatasum files.

Despite his incorrect expectation, btrfs indeed doesn't handle device
generation mismatch well.

This means if one devices missed and re-appeared, even its generation
no longer matches with the rest device pool, btrfs does nothing to it,
but treat it as normal good device.

At least let's detect such generation mismatch and avoid mounting the
fs.
Currently there is no automatic rebuild yet, which means if users find
device generation mismatch error message, they can only mount the fs
using "device" and "degraded" mount option (if possible), then replace
the offending device to manually "rebuild" the fs.

Signed-off-by: Qu Wenruo 
---
I totally understand that, generation based solution can't handle
split-brain case (where 2 RAID1 devices get mounted degraded separately)
at all, but at least let's handle what we can do.

The best way to solve the problem is to make btrfs treat such lower gen
devices as some kind of missing device, and queue an automatic scrub for
that device.
But that's a lot of extra work, at least let's start from detecting such
problem first.
---
  fs/btrfs/volumes.c | 50 ++
  1 file changed, 50 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e034ad9e23b4..80a7c44993bc 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct 
btrfs_fs_info *fs_info,
  devid, uuid);
  }
  
+static int verify_devices_generation(struct btrfs_fs_info *fs_info,

+struct btrfs_device *dev)
+{
+   struct btrfs_fs_devices *fs_devices = dev->fs_devices;
+   struct btrfs_device *cur;
+   bool warn_only = false;
+   int ret = 0;
+
+   if (!fs_devices || fs_devices->seeding || !dev->generation)
+   return 0;
+
+   /*
+* If we're not replaying log, we're completely safe to allow
+* generation mismatch as it won't write anything to disks, nor
+* remount to rw.
+*/
+   if (btrfs_test_opt(fs_info, NOLOGREPLAY))
+   warn_only = true;
+
+   rcu_read_lock();
+   list_for_each_entry_rcu(cur, _devices->devices, dev_list) {
+   if (cur->generation && cur->generation != dev->generation) {
+   if (warn_only) {
+   btrfs_warn_rl_in_rcu(fs_info,
+   "devid %llu has unexpected generation, has %llu expected %llu",
+dev->devid,
+dev->generation,
+cur->generation);
+   } else {
+   btrfs_err_rl_in_rcu(fs_info,
+   "devid %llu has unexpected generation, has %llu expected %llu",
+dev->devid,
+dev->generation,
+cur->generation);
+   ret = -EINVAL;






+   break;
+   }
+   }
+   }
+   rcu_read_unlock();
+   return ret;
+}
+
  static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key 
*key,
  struct extent_buffer *leaf,
  struct btrfs_chunk *chunk)
@@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, 
struct btrfs_key *key,
return PTR_ERR(map->stripes[i].dev);
}
btrfs_report_missing_device(fs_info, devid, uuid, 
false);
+   } else {
+   ret = verify_devices_generation(fs_info,
+   map->stripes[i].dev);
+   if (ret < 0) {
+  

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Anand Jain




On 06/28/2018 09:42 AM, Remi Gauvin wrote:

There seems to be a major design flaw with BTRFS that needs to be better
documented, to avoid massive data loss.

Tested with Raid 1 on Ubuntu Kernel 4.15

The use case being tested was a Virtualbox VDI file created with
NODATACOW attribute, (as is often suggested, due to the painful
performance penalty of COW on these files.)

However, if a device is temporarily dropped (this in case, tested by
disconnecting drives.) and re-connects automatically next boot, BTRFS
does not in any way synchronize the VDI file, or have any means to know
that one of copy is out of date and bad.

The result of trying to use said VDI file is interestingly insane.




Scrub did not do anything to rectify the situation.


 Please use Balance to rectify as its RAID1. Because when one of the
 device was missing we wrote Single Chunks.

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


call trace: WARNING: at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device

2018-06-28 Thread Christoph Anton Mitterer
Hey.

On a 4.16.16 kernel with a RAID 1 btrfs I got the following messages
since today.

Data seems still to be readable (correctly)... and there are no other
errors (like SATA errors) in the kernel log.

Any idea what these could mean?

Thanks,
Chris.


[   72.168662] WARNING: CPU: 0 PID: 242 at 
/build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 
btrfs_update_device+0x1b2/0x1c0 [btrfs]
[   72.168701] Modules linked in: cpufreq_userspace cpufreq_powersave 
cpufreq_conservative snd_hda_codec_hdmi ip6t_REJECT nf_reject_ipv6 
nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_policy 
ipt_REJECT nf_reject_ipv4 xt_comment xt_tcpudp nf_conntrack_ipv4 powernow_k8 
nf_defrag_ipv4 edac_mce_amd snd_hda_intel kvm_amd snd_hda_codec ccp rng_core 
snd_hda_core kvm snd_hwdep irqbypass snd_pcm wmi_bmof radeon snd_timer ttm 
xt_multiport snd pcspkr soundcore drm_kms_helper k8temp ohci_pci ata_generic 
pata_atiixp ohci_hcd ehci_pci sg wmi xt_conntrack drm nf_conntrack i2c_algo_bit 
ehci_hcd usbcore button sp5100_tco usb_common shpchp i2c_piix4 iptable_filter 
binfmt_misc sunrpc hwmon_vid ip_tables x_tables autofs4 btrfs zstd_decompress 
zstd_compress xxhash sd_mod raid10 raid456 async_raid6_recov
[   72.168776]  async_memcpy async_pq async_xor async_tx libcrc32c 
crc32c_generic xor raid6_pq raid1 raid0 multipath linear md_mod evdev ahci 
libahci serio_raw libata r8169 mii scsi_mod
[   72.168820] CPU: 0 PID: 242 Comm: btrfs-cleaner Not tainted 4.16.0-2-amd64 
#1 Debian 4.16.16-2
[   72.168852] Hardware name: MICRO-STAR INTERNATIONAL CO.,LTD MS-7551/KA780G 
(MS-7551), BIOS V16.6 05/12/2010
[   72.168907] RIP: 0010:btrfs_update_device+0x1b2/0x1c0 [btrfs]
[   72.168939] RSP: 0018:bd5a810a3d60 EFLAGS: 00010206
[   72.168973] RAX: 0fff RBX: 938e847f8000 RCX: 00e8a0db1e00
[   72.169006] RDX: 1000 RSI: 3f5c RDI: 938e7a8015e0
[   72.169040] RBP: 938e8fb97a00 R08: bd5a810a3d10 R09: bd5a810a3d18
[   72.169073] R10: 0003 R11: 3000 R12: 
[   72.169106] R13: 3f3c R14: 938e7a8015e0 R15: 938e8f0c6328
[   72.169140] FS:  () GS:938e9dc0() 
knlGS:
[   72.169177] CS:  0010 DS:  ES:  CR0: 80050033
[   72.169210] CR2: 7fcff92ce000 CR3: 00020575e000 CR4: 06f0
[   72.169243] Call Trace:
[   72.169304]  btrfs_remove_chunk+0x2a9/0x8c0 [btrfs]
[   72.169359]  btrfs_delete_unused_bgs+0x323/0x3f0 [btrfs]
[   72.169415]  ? __btree_submit_bio_start+0x20/0x20 [btrfs]
[   72.169469]  cleaner_kthread+0x152/0x160 [btrfs]
[   72.169506]  kthread+0x113/0x130
[   72.169540]  ? kthread_create_worker_on_cpu+0x70/0x70
[   72.169575]  ? SyS_exit_group+0x10/0x10
[   72.169610]  ret_from_fork+0x35/0x40
[   72.169643] Code: 4c 89 f7 45 31 c0 ba 10 00 00 00 4c 89 ee e8 16 23 ff ff 
4c 89 f7 e8 9e ef fc ff e9 de fe ff ff 41 bc f4 ff ff ff e9 db fe ff ff <0f> 0b 
eb b7 e8 85 4c 1a c5 0f 1f 44 00 00 66 66 66 66 90 41 55 
[   72.169705] ---[ end trace ed549af9d9cf6190 ]---
[   72.170009] WARNING: CPU: 0 PID: 242 at 
/build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 
btrfs_update_device+0x1b2/0x1c0 [btrfs]
[   72.170050] Modules linked in: cpufreq_userspace cpufreq_powersave 
cpufreq_conservative snd_hda_codec_hdmi ip6t_REJECT nf_reject_ipv6 
nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_policy 
ipt_REJECT nf_reject_ipv4 xt_comment xt_tcpudp nf_conntrack_ipv4 powernow_k8 
nf_defrag_ipv4 edac_mce_amd snd_hda_intel kvm_amd snd_hda_codec ccp rng_core 
snd_hda_core kvm snd_hwdep irqbypass snd_pcm wmi_bmof radeon snd_timer ttm 
xt_multiport snd pcspkr soundcore drm_kms_helper k8temp ohci_pci ata_generic 
pata_atiixp ohci_hcd ehci_pci sg wmi xt_conntrack drm nf_conntrack i2c_algo_bit 
ehci_hcd usbcore button sp5100_tco usb_common shpchp i2c_piix4 iptable_filter 
binfmt_misc sunrpc hwmon_vid ip_tables x_tables autofs4 btrfs zstd_decompress 
zstd_compress xxhash sd_mod raid10 raid456 async_raid6_recov
[   72.170152]  async_memcpy async_pq async_xor async_tx libcrc32c 
crc32c_generic xor raid6_pq raid1 raid0 multipath linear md_mod evdev ahci 
libahci serio_raw libata r8169 mii scsi_mod
[   72.170204] CPU: 0 PID: 242 Comm: btrfs-cleaner Tainted: GW
4.16.0-2-amd64 #1 Debian 4.16.16-2
[   72.170241] Hardware name: MICRO-STAR INTERNATIONAL CO.,LTD MS-7551/KA780G 
(MS-7551), BIOS V16.6 05/12/2010
[   72.170300] RIP: 0010:btrfs_update_device+0x1b2/0x1c0 [btrfs]
[   72.170333] RSP: 0018:bd5a810a3d60 EFLAGS: 00010206
[   72.170367] RAX: 0fff RBX: 938e847f8000 RCX: 00e8a0db1e00
[   72.170401] RDX: 1000 RSI: 3f5c RDI: 938e7a8015e0
[   72.170434] RBP: 938e8fb97a00 R08: bd5a810a3d10 R09: bd5a810a3d18
[   72.170468] R10: 0003 R11: 3000 R12: 
[   72.170501] R13: 3f3c R14: 938e7a8015e0 R15: 938e8f0c6328
[   

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Austin S. Hemmelgarn

On 2018-06-28 07:46, Qu Wenruo wrote:



On 2018年06月28日 19:12, Austin S. Hemmelgarn wrote:

On 2018-06-28 05:15, Qu Wenruo wrote:



On 2018年06月28日 16:16, Andrei Borzenkov wrote:

On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo 
wrote:



On 2018年06月28日 11:14, r...@georgianit.com wrote:



On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:



Please get yourself clear of what other raid1 is doing.


A drive failure, where the drive is still there when the computer
reboots, is a situation that *any* raid 1, (or for that matter,
raid 5, raid 6, anything but raid 0) will recover from perfectly
without raising a sweat. Some will rebuild the array automatically,


WOW, that's black magic, at least for RAID1.
The whole RAID1 has no idea of which copy is correct unlike btrfs who
has datasum.

Don't bother other things, just tell me how to determine which one is
correct?



When one drive fails, it is recorded in meta-data on remaining drives;
probably configuration generation number is increased. Next time drive
with older generation is not incorporated. Hardware controllers also
keep this information in NVRAM and so do not even depend on scanning
of other disks.


Yep, the only possible way to determine such case is from external info.

For device generation, it's possible to enhance btrfs, but at least we
could start from detect and refuse to RW mount to avoid possible further
corruption.
But anyway, if one really cares about such case, hardware RAID
controller seems to be the only solution as other software may have the
same problem.

LVM doesn't.  It detects that one of the devices was gone for some
period of time and marks the volume as degraded (and _might_, depending
on how you have things configured, automatically re-sync).  Not sure
about MD, but I am willing to bet it properly detects this type of
situation too.


And the hardware solution looks pretty interesting, is the write to
NVRAM 100% atomic? Even at power loss?

On a proper RAID controller, it's battery backed, and that battery
backing provides enough power to also make sure that the state change is
properly recorded in the event of power loss.


Well, that explains a lot of thing.

So hardware RAID controller can be considered having a special battery
(always atomic) journal device.
If we can't provide UPS for the whole system, a battery powered journal
device indeed makes sense.






The only possibility is that, the misbehaved device missed several
super
block update so we have a chance to detect it's out-of-date.
But that's not always working.



Why it should not work as long as any write to array is suspended
until superblock on remaining devices is updated?


What happens if there is no generation gap in device superblock?

If one device got some of its (nodatacow) data written to disk, while
the other device doesn't get data written, and neither of them reached
super block update, there is no difference in device superblock, thus no
way to detect which is correct.

Yes, but that should be a very small window (at least, once we finally
quit serializing writes across devices), and it's a problem on existing
RAID1 implementations too (and therefore isn't something we should be
using as an excuse for not doing this).





If you're talking about missing generation check for btrfs, that's
valid, but it's far from a "major design flaw", as there are a lot of
cases where other RAID1 (mdraid or LVM mirrored) can also be affected
(the brain-split case).



That's different. Yes, with software-based raid there is usually no
way to detect outdated copy if no other copies are present. Having
older valid data is still very different from corrupting newer data.


While for VDI case (or any VM image file format other than raw), older
valid data normally means corruption.
Unless they have their own write-ahead log.

Some file format may detect such problem by themselves if they have
internal checksum, but anyway, older data normally means corruption,
especially when partial new and partial old.

On the other hand, with data COW and csum, btrfs can ensure the whole
filesystem update is atomic (at least for single device).
So the title, especially the "major design flaw" can't be wrong any more.

The title is excessive, but I'd agree it's a design flaw that BTRFS
doesn't at least notice that the generation ID's are different and
preferentially trust the device with the newer generation ID.


Well, a design flaw should be something that can't be easily fixed
without *huge* on-disk format or behavior change.
Flaw in btrfs' one-subvolume-per-tree metadata design or current extent
booking behavior could be called design flaw.
That would be a structural design flaw.  it's a result of how the 
software is structured.  There are other types of design flaws though.


While for things like this, just as the submitted RFC patch, less than
100 lines could change the behavior.
I would still consider this case a design flaw (a purely behavioral one 
not tied to how 

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Qu Wenruo


On 2018年06月28日 19:12, Austin S. Hemmelgarn wrote:
> On 2018-06-28 05:15, Qu Wenruo wrote:
>>
>>
>> On 2018年06月28日 16:16, Andrei Borzenkov wrote:
>>> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo 
>>> wrote:


 On 2018年06月28日 11:14, r...@georgianit.com wrote:
>
>
> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>
>>
>> Please get yourself clear of what other raid1 is doing.
>
> A drive failure, where the drive is still there when the computer
> reboots, is a situation that *any* raid 1, (or for that matter,
> raid 5, raid 6, anything but raid 0) will recover from perfectly
> without raising a sweat. Some will rebuild the array automatically,

 WOW, that's black magic, at least for RAID1.
 The whole RAID1 has no idea of which copy is correct unlike btrfs who
 has datasum.

 Don't bother other things, just tell me how to determine which one is
 correct?

>>>
>>> When one drive fails, it is recorded in meta-data on remaining drives;
>>> probably configuration generation number is increased. Next time drive
>>> with older generation is not incorporated. Hardware controllers also
>>> keep this information in NVRAM and so do not even depend on scanning
>>> of other disks.
>>
>> Yep, the only possible way to determine such case is from external info.
>>
>> For device generation, it's possible to enhance btrfs, but at least we
>> could start from detect and refuse to RW mount to avoid possible further
>> corruption.
>> But anyway, if one really cares about such case, hardware RAID
>> controller seems to be the only solution as other software may have the
>> same problem.
> LVM doesn't.  It detects that one of the devices was gone for some
> period of time and marks the volume as degraded (and _might_, depending
> on how you have things configured, automatically re-sync).  Not sure
> about MD, but I am willing to bet it properly detects this type of
> situation too.
>>
>> And the hardware solution looks pretty interesting, is the write to
>> NVRAM 100% atomic? Even at power loss?
> On a proper RAID controller, it's battery backed, and that battery
> backing provides enough power to also make sure that the state change is
> properly recorded in the event of power loss.

Well, that explains a lot of thing.

So hardware RAID controller can be considered having a special battery
(always atomic) journal device.
If we can't provide UPS for the whole system, a battery powered journal
device indeed makes sense.

>>
>>>
 The only possibility is that, the misbehaved device missed several
 super
 block update so we have a chance to detect it's out-of-date.
 But that's not always working.

>>>
>>> Why it should not work as long as any write to array is suspended
>>> until superblock on remaining devices is updated?
>>
>> What happens if there is no generation gap in device superblock?
>>
>> If one device got some of its (nodatacow) data written to disk, while
>> the other device doesn't get data written, and neither of them reached
>> super block update, there is no difference in device superblock, thus no
>> way to detect which is correct.
> Yes, but that should be a very small window (at least, once we finally
> quit serializing writes across devices), and it's a problem on existing
> RAID1 implementations too (and therefore isn't something we should be
> using as an excuse for not doing this).
>>
>>>
 If you're talking about missing generation check for btrfs, that's
 valid, but it's far from a "major design flaw", as there are a lot of
 cases where other RAID1 (mdraid or LVM mirrored) can also be affected
 (the brain-split case).

>>>
>>> That's different. Yes, with software-based raid there is usually no
>>> way to detect outdated copy if no other copies are present. Having
>>> older valid data is still very different from corrupting newer data.
>>
>> While for VDI case (or any VM image file format other than raw), older
>> valid data normally means corruption.
>> Unless they have their own write-ahead log.
>>
>> Some file format may detect such problem by themselves if they have
>> internal checksum, but anyway, older data normally means corruption,
>> especially when partial new and partial old.
>>
>> On the other hand, with data COW and csum, btrfs can ensure the whole
>> filesystem update is atomic (at least for single device).
>> So the title, especially the "major design flaw" can't be wrong any more.
> The title is excessive, but I'd agree it's a design flaw that BTRFS
> doesn't at least notice that the generation ID's are different and
> preferentially trust the device with the newer generation ID.

Well, a design flaw should be something that can't be easily fixed
without *huge* on-disk format or behavior change.
Flaw in btrfs' one-subvolume-per-tree metadata design or current extent
booking behavior could be called design flaw.

While for things like this, just as the submitted RFC 

Re: [GIT PULL] Btrfs updates for 4.18

2018-06-28 Thread Anand Jain




On 06/12/2018 12:16 AM, David Sterba wrote:

On Mon, Jun 11, 2018 at 10:50:54AM +0100, Filipe Manana wrote:

btrfs: replace uuid_mutex by device_list_mutex in
btrfs_open_devices



   *
   * the mutex can be very coarse and can cover long-running operations
   *
   * protects: updates to fs_devices counters like missing devices, rw
devices,
   * seeding, structure cloning, openning/closing devices at mount/umount
time

generates some confusion since btrfs_open_devices(), after that
commit, no longer takes the uuid_mutex and it
updates some fs_devices counters (opened, open_devices, etc).


  As uuid_mutex is a global fs_uuids lock for the per fsid operations
  doesn't make any sense.

  This problem is reproducible only for-4.18, misc-next if fine.
  I am looking deeper.


What about the unprotected updates (increments) to fs_devices->opened
and fs_devices->open_devices?
Other functions are accessing/updating them while holding the uuid mutex.


The goal is to reduce usage of uuid_mutex only to protect search or
update of the fs_uuids list, everything else should be protected by the
device_list_mutex.

The commit 542c5908abfe84f7 (use device_list_mutex in
btrfs_open_devices) implements that but then the access to the ->opened
member is not protected consistently. There are patches that convert the
use to device_list_mutex but haven't been merged due to refinements or
pending review.

At this point I think we should revert the one commit 542c5908abfe84f7
as it introduces the locking problems and revisit the whole fs_devices
locking scheme again in the dex dev cycle. That will be post rc1 as
there might be more to revert.




 I tried to narrow this, it appears some of the things that
 circular locking dependency check report doesn't make sense.
 Here below is what I find.. as of now.

 The test case btrfs/004 can be simplified to.. which also
 reproduces the problem.

-8<-
$ cat 165
#! /bin/bash
# FS QA Test No. btrfs/165
#

seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"

here=`pwd`
tmp=/tmp/$$
status=1
noise_pid=0

_cleanup()
{
wait
rm -f $tmp.*
}
trap "_cleanup; exit \$status" 0 1 2 3 15

# get standard environment, filters and checks
. ./common/rc
. ./common/filter

# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch

rm -f $seqres.full

run_check _scratch_mkfs_sized $((2000 * 1024 * 1024))
run_check _scratch_mount
run_check $FSSTRESS_PROG -d $SCRATCH_MNT -w -p 1 -n 2000 $FSSTRESS_AVOID
run_check _scratch_unmount

echo "done"
status=0
exit
-8<-

 The circular locking dependency warning occurs at FSSTRESS_PROG.
 And in particular at doproc() in xfstests/ltp/fsstress.c, randomly
 at any of the command at
 opdesc_tops[] = { ..}
 which involves calling mmap file operation and if there is something
 to commit.

 The commit transaction does need device_list_mutex which is also being
 used for the btrfs_open_devices() in the commit 542c5908abfe84f7.

 But btrfs_open_devices() is only called at mount, and mmap() can
 establish only be established after the mount has completed. With
 this give its unclear to me why the circular locking dependency check
 is warning about this.

 I feel until we have clarity about this and also solve other problem
 related to the streamlining of uuid_mutex, I suggest we revert
 542c5908abfe84f7. Sorry for the inconvenience.

Thanks, Anand


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Austin S. Hemmelgarn

On 2018-06-28 05:15, Qu Wenruo wrote:



On 2018年06月28日 16:16, Andrei Borzenkov wrote:

On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo  wrote:



On 2018年06月28日 11:14, r...@georgianit.com wrote:



On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:



Please get yourself clear of what other raid1 is doing.


A drive failure, where the drive is still there when the computer reboots, is a 
situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but 
raid 0) will recover from perfectly without raising a sweat. Some will rebuild 
the array automatically,


WOW, that's black magic, at least for RAID1.
The whole RAID1 has no idea of which copy is correct unlike btrfs who
has datasum.

Don't bother other things, just tell me how to determine which one is
correct?



When one drive fails, it is recorded in meta-data on remaining drives;
probably configuration generation number is increased. Next time drive
with older generation is not incorporated. Hardware controllers also
keep this information in NVRAM and so do not even depend on scanning
of other disks.


Yep, the only possible way to determine such case is from external info.

For device generation, it's possible to enhance btrfs, but at least we
could start from detect and refuse to RW mount to avoid possible further
corruption.
But anyway, if one really cares about such case, hardware RAID
controller seems to be the only solution as other software may have the
same problem.
LVM doesn't.  It detects that one of the devices was gone for some 
period of time and marks the volume as degraded (and _might_, depending 
on how you have things configured, automatically re-sync).  Not sure 
about MD, but I am willing to bet it properly detects this type of 
situation too.


And the hardware solution looks pretty interesting, is the write to
NVRAM 100% atomic? Even at power loss?
On a proper RAID controller, it's battery backed, and that battery 
backing provides enough power to also make sure that the state change is 
properly recorded in the event of power loss.





The only possibility is that, the misbehaved device missed several super
block update so we have a chance to detect it's out-of-date.
But that's not always working.



Why it should not work as long as any write to array is suspended
until superblock on remaining devices is updated?


What happens if there is no generation gap in device superblock?

If one device got some of its (nodatacow) data written to disk, while
the other device doesn't get data written, and neither of them reached
super block update, there is no difference in device superblock, thus no
way to detect which is correct.
Yes, but that should be a very small window (at least, once we finally 
quit serializing writes across devices), and it's a problem on existing 
RAID1 implementations too (and therefore isn't something we should be 
using as an excuse for not doing this).





If you're talking about missing generation check for btrfs, that's
valid, but it's far from a "major design flaw", as there are a lot of
cases where other RAID1 (mdraid or LVM mirrored) can also be affected
(the brain-split case).



That's different. Yes, with software-based raid there is usually no
way to detect outdated copy if no other copies are present. Having
older valid data is still very different from corrupting newer data.


While for VDI case (or any VM image file format other than raw), older
valid data normally means corruption.
Unless they have their own write-ahead log.

Some file format may detect such problem by themselves if they have
internal checksum, but anyway, older data normally means corruption,
especially when partial new and partial old.

On the other hand, with data COW and csum, btrfs can ensure the whole
filesystem update is atomic (at least for single device).
So the title, especially the "major design flaw" can't be wrong any more.
The title is excessive, but I'd agree it's a design flaw that BTRFS 
doesn't at least notice that the generation ID's are different and 
preferentially trust the device with the newer generation ID. The only 
special handling I can see that would be needed is around volumes 
mounted with the `nodatacow` option, which may not see generation 
changes for a very long time otherwise.





others will automatically kick out the misbehaving drive.  *none* of them will 
take back the the drive with old data and start commingling that data with good 
copy.)\ This behaviour from BTRFS is completely abnormal.. and defeats even the 
most basic expectations of RAID.


RAID1 can only tolerate 1 missing device, it has nothing to do with
error detection.
And it's impossible to detect such case without extra help.

Your expectation is completely wrong.



Well ... somehow it is my experience as well ... :)


Acceptable, but not really apply to software based RAID1.

Thanks,
Qu





I'm not the one who has to clear his expectations here.

--
To unsubscribe from this list: send the line "unsubscribe 

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Qu Wenruo


On 2018年06月28日 16:16, Andrei Borzenkov wrote:
> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo  wrote:
>>
>>
>> On 2018年06月28日 11:14, r...@georgianit.com wrote:
>>>
>>>
>>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>>>

 Please get yourself clear of what other raid1 is doing.
>>>
>>> A drive failure, where the drive is still there when the computer reboots, 
>>> is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, 
>>> anything but raid 0) will recover from perfectly without raising a sweat. 
>>> Some will rebuild the array automatically,
>>
>> WOW, that's black magic, at least for RAID1.
>> The whole RAID1 has no idea of which copy is correct unlike btrfs who
>> has datasum.
>>
>> Don't bother other things, just tell me how to determine which one is
>> correct?
>>
> 
> When one drive fails, it is recorded in meta-data on remaining drives;
> probably configuration generation number is increased. Next time drive
> with older generation is not incorporated. Hardware controllers also
> keep this information in NVRAM and so do not even depend on scanning
> of other disks.

Yep, the only possible way to determine such case is from external info.

For device generation, it's possible to enhance btrfs, but at least we
could start from detect and refuse to RW mount to avoid possible further
corruption.
But anyway, if one really cares about such case, hardware RAID
controller seems to be the only solution as other software may have the
same problem.

And the hardware solution looks pretty interesting, is the write to
NVRAM 100% atomic? Even at power loss?

> 
>> The only possibility is that, the misbehaved device missed several super
>> block update so we have a chance to detect it's out-of-date.
>> But that's not always working.
>>
> 
> Why it should not work as long as any write to array is suspended
> until superblock on remaining devices is updated?

What happens if there is no generation gap in device superblock?

If one device got some of its (nodatacow) data written to disk, while
the other device doesn't get data written, and neither of them reached
super block update, there is no difference in device superblock, thus no
way to detect which is correct.

> 
>> If you're talking about missing generation check for btrfs, that's
>> valid, but it's far from a "major design flaw", as there are a lot of
>> cases where other RAID1 (mdraid or LVM mirrored) can also be affected
>> (the brain-split case).
>>
> 
> That's different. Yes, with software-based raid there is usually no
> way to detect outdated copy if no other copies are present. Having
> older valid data is still very different from corrupting newer data.

While for VDI case (or any VM image file format other than raw), older
valid data normally means corruption.
Unless they have their own write-ahead log.

Some file format may detect such problem by themselves if they have
internal checksum, but anyway, older data normally means corruption,
especially when partial new and partial old.

On the other hand, with data COW and csum, btrfs can ensure the whole
filesystem update is atomic (at least for single device).
So the title, especially the "major design flaw" can't be wrong any more.

> 
>>> others will automatically kick out the misbehaving drive.  *none* of them 
>>> will take back the the drive with old data and start commingling that data 
>>> with good copy.)\ This behaviour from BTRFS is completely abnormal.. and 
>>> defeats even the most basic expectations of RAID.
>>
>> RAID1 can only tolerate 1 missing device, it has nothing to do with
>> error detection.
>> And it's impossible to detect such case without extra help.
>>
>> Your expectation is completely wrong.
>>
> 
> Well ... somehow it is my experience as well ... :)

Acceptable, but not really apply to software based RAID1.

Thanks,
Qu

> 
>>>
>>> I'm not the one who has to clear his expectations here.
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>



signature.asc
Description: OpenPGP digital signature


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Andrei Borzenkov
On Thu, Jun 28, 2018 at 11:16 AM, Andrei Borzenkov  wrote:
> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo  wrote:
>>
>>
>> On 2018年06月28日 11:14, r...@georgianit.com wrote:
>>>
>>>
>>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>>>

 Please get yourself clear of what other raid1 is doing.
>>>
>>> A drive failure, where the drive is still there when the computer reboots, 
>>> is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, 
>>> anything but raid 0) will recover from perfectly without raising a sweat. 
>>> Some will rebuild the array automatically,
>>
>> WOW, that's black magic, at least for RAID1.
>> The whole RAID1 has no idea of which copy is correct unlike btrfs who
>> has datasum.
>>
>> Don't bother other things, just tell me how to determine which one is
>> correct?
>>
>
> When one drive fails, it is recorded in meta-data on remaining drives;
> probably configuration generation number is increased. Next time drive
> with older generation is not incorporated. Hardware controllers also
> keep this information in NVRAM and so do not even depend on scanning
> of other disks.
>
>> The only possibility is that, the misbehaved device missed several super
>> block update so we have a chance to detect it's out-of-date.
>> But that's not always working.
>>
>
> Why it should not work as long as any write to array is suspended
> until superblock on remaining devices is updated?
>
>> If you're talking about missing generation check for btrfs, that's
>> valid, but it's far from a "major design flaw", as there are a lot of
>> cases where other RAID1 (mdraid or LVM mirrored) can also be affected
>> (the brain-split case).
>>
>
> That's different. Yes, with software-based raid there is usually no
> way to detect outdated copy if no other copies are present. Having
> older valid data is still very different from corrupting newer data.
>
>>> others will automatically kick out the misbehaving drive.  *none* of them 
>>> will take back the the drive with old data and start commingling that data 
>>> with good copy.)\ This behaviour from BTRFS is completely abnormal.. and 
>>> defeats even the most basic expectations of RAID.
>>
>> RAID1 can only tolerate 1 missing device, it has nothing to do with
>> error detection.
>> And it's impossible to detect such case without extra help.
>>
>> Your expectation is completely wrong.
>>
>
> Well ... somehow it is my experience as well ... :)

s/experience/expectation/

sorry.

>
>>>
>>> I'm not the one who has to clear his expectations here.
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Andrei Borzenkov
On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo  wrote:
>
>
> On 2018年06月28日 11:14, r...@georgianit.com wrote:
>>
>>
>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>>
>>>
>>> Please get yourself clear of what other raid1 is doing.
>>
>> A drive failure, where the drive is still there when the computer reboots, 
>> is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, 
>> anything but raid 0) will recover from perfectly without raising a sweat. 
>> Some will rebuild the array automatically,
>
> WOW, that's black magic, at least for RAID1.
> The whole RAID1 has no idea of which copy is correct unlike btrfs who
> has datasum.
>
> Don't bother other things, just tell me how to determine which one is
> correct?
>

When one drive fails, it is recorded in meta-data on remaining drives;
probably configuration generation number is increased. Next time drive
with older generation is not incorporated. Hardware controllers also
keep this information in NVRAM and so do not even depend on scanning
of other disks.

> The only possibility is that, the misbehaved device missed several super
> block update so we have a chance to detect it's out-of-date.
> But that's not always working.
>

Why it should not work as long as any write to array is suspended
until superblock on remaining devices is updated?

> If you're talking about missing generation check for btrfs, that's
> valid, but it's far from a "major design flaw", as there are a lot of
> cases where other RAID1 (mdraid or LVM mirrored) can also be affected
> (the brain-split case).
>

That's different. Yes, with software-based raid there is usually no
way to detect outdated copy if no other copies are present. Having
older valid data is still very different from corrupting newer data.

>> others will automatically kick out the misbehaving drive.  *none* of them 
>> will take back the the drive with old data and start commingling that data 
>> with good copy.)\ This behaviour from BTRFS is completely abnormal.. and 
>> defeats even the most basic expectations of RAID.
>
> RAID1 can only tolerate 1 missing device, it has nothing to do with
> error detection.
> And it's impossible to detect such case without extra help.
>
> Your expectation is completely wrong.
>

Well ... somehow it is my experience as well ... :)

>>
>> I'm not the one who has to clear his expectations here.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling quota may not correctly rescan on 4.17

2018-06-28 Thread Misono Tomohiro
On 2018/06/28 16:12, Qu Wenruo wrote:
> 
> 
> On 2018年06月27日 16:25, Misono Tomohiro wrote:
>> On 2018/06/27 17:10, Qu Wenruo wrote:
>>>
>>>
>>> On 2018年06月26日 14:00, Misono Tomohiro wrote:
 Hello Nikolay,

 I noticed that commit 5d23515be669 ("btrfs: Move qgroup rescan
 on quota enable to btrfs_quota_enable") in 4.17 sometimes causes
 to fail correctly rescanning quota when quota is enabled.

 Simple reproducer:

 $ mkfs.btrfs -f $DEV
 $ mount $DEV /mnt
 $ dd if=/dev/urandom of=/mnt/file bs=1000 count=1000
 $ btrfs quota enbale /mnt
 $ umount /mnt
 $ btrfs check $DEV
 ...
> Unfortunately in my environment, btrfs/114 failed to reprocduce it with
> 1024 runs overnight, with v4.18-rc1 kernel.
> 
> Would you please provide the whole btrfs-image dump of the corrupted fs?

Yes.
The attached file is an image-dump of above reproducer (kernel 4.17.0, progs 
4.17)
as the dump of btrfs/114 is a bit large for mail.

Though this does not always happen, I see the failure both on 4.17.0 or 
4.18-rc2.

Thanks,
Tomohiro Misono

> 
> There are several different assumptions on how the bug happens, with
> your btrfs-image dump, it would help a lot to rule out some assumption.
> 
> Thanks,
> Qu


btrfs-image
Description: Binary data


Re: [PATCH RFC] btrfs: Do extra device generation check at mount time

2018-06-28 Thread Qu Wenruo
Also CC Anand Jain, since he is also working on various device related
work, and an expert on this.

Especially I'm not pretty sure if such enhancement is already on his
agenda or there are already some unmerged patch for this.

Thanks,
Qu

On 2018年06月28日 15:04, Qu Wenruo wrote:
> There is a reporter considering btrfs raid1 has a major design flaw
> which can't handle nodatasum files.
> 
> Despite his incorrect expectation, btrfs indeed doesn't handle device
> generation mismatch well.
> 
> This means if one devices missed and re-appeared, even its generation
> no longer matches with the rest device pool, btrfs does nothing to it,
> but treat it as normal good device.
> 
> At least let's detect such generation mismatch and avoid mounting the
> fs.
> Currently there is no automatic rebuild yet, which means if users find
> device generation mismatch error message, they can only mount the fs
> using "device" and "degraded" mount option (if possible), then replace
> the offending device to manually "rebuild" the fs.
> 
> Signed-off-by: Qu Wenruo 
> ---
> I totally understand that, generation based solution can't handle
> split-brain case (where 2 RAID1 devices get mounted degraded separately)
> at all, but at least let's handle what we can do.
> 
> The best way to solve the problem is to make btrfs treat such lower gen
> devices as some kind of missing device, and queue an automatic scrub for
> that device.
> But that's a lot of extra work, at least let's start from detecting such
> problem first.
> ---
>  fs/btrfs/volumes.c | 50 ++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index e034ad9e23b4..80a7c44993bc 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct 
> btrfs_fs_info *fs_info,
> devid, uuid);
>  }
>  
> +static int verify_devices_generation(struct btrfs_fs_info *fs_info,
> +  struct btrfs_device *dev)
> +{
> + struct btrfs_fs_devices *fs_devices = dev->fs_devices;
> + struct btrfs_device *cur;
> + bool warn_only = false;
> + int ret = 0;
> +
> + if (!fs_devices || fs_devices->seeding || !dev->generation)
> + return 0;
> +
> + /*
> +  * If we're not replaying log, we're completely safe to allow
> +  * generation mismatch as it won't write anything to disks, nor
> +  * remount to rw.
> +  */
> + if (btrfs_test_opt(fs_info, NOLOGREPLAY))
> + warn_only = true;
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(cur, _devices->devices, dev_list) {
> + if (cur->generation && cur->generation != dev->generation) {
> + if (warn_only) {
> + btrfs_warn_rl_in_rcu(fs_info,
> + "devid %llu has unexpected generation, has %llu expected %llu",
> +  dev->devid,
> +  dev->generation,
> +  cur->generation);
> + } else {
> + btrfs_err_rl_in_rcu(fs_info,
> + "devid %llu has unexpected generation, has %llu expected %llu",
> +  dev->devid,
> +  dev->generation,
> +  cur->generation);
> + ret = -EINVAL;
> + break;
> + }
> + }
> + }
> + rcu_read_unlock();
> + return ret;
> +}
> +
>  static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key 
> *key,
> struct extent_buffer *leaf,
> struct btrfs_chunk *chunk)
> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info 
> *fs_info, struct btrfs_key *key,
>   return PTR_ERR(map->stripes[i].dev);
>   }
>   btrfs_report_missing_device(fs_info, devid, uuid, 
> false);
> + } else {
> + ret = verify_devices_generation(fs_info,
> + map->stripes[i].dev);
> + if (ret < 0) {
> + free_extent_map(em);
> + return ret;
> + }
>   }
>   set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
>   &(map->stripes[i].dev->dev_state));
> 



signature.asc
Description: OpenPGP digital signature


RE: [PATCH RFC] btrfs: Do extra device generation check at mount time

2018-06-28 Thread Paul Jones
> -Original Message-
> From: linux-btrfs-ow...@vger.kernel.org  ow...@vger.kernel.org> On Behalf Of Qu Wenruo
> Sent: Thursday, 28 June 2018 5:16 PM
> To: Nikolay Borisov ; Qu Wenruo ;
> linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH RFC] btrfs: Do extra device generation check at mount
> time
> 
> 
> 
> On 2018年06月28日 15:06, Nikolay Borisov wrote:
> >
> >
> > On 28.06.2018 10:04, Qu Wenruo wrote:
> >> There is a reporter considering btrfs raid1 has a major design flaw
> >> which can't handle nodatasum files.
> >>
> >> Despite his incorrect expectation, btrfs indeed doesn't handle device
> >> generation mismatch well.
> >>
> >> This means if one devices missed and re-appeared, even its generation
> >> no longer matches with the rest device pool, btrfs does nothing to
> >> it, but treat it as normal good device.
> >>
> >> At least let's detect such generation mismatch and avoid mounting the
> >> fs.
> >> Currently there is no automatic rebuild yet, which means if users
> >> find device generation mismatch error message, they can only mount
> >> the fs using "device" and "degraded" mount option (if possible), then
> >> replace the offending device to manually "rebuild" the fs.
> >>
> >> Signed-off-by: Qu Wenruo 
> >
> > I think a testcase of this functionality is important as well.
> 
> It's currently an RFC patch, test case would come along with final version.
> 
> I'd like to make sure everyone, including developers and end-users, are fine
> with the restrict error-out behavior.

I've been bitten by this before and was most surprised the first time it 
happened. I had assumed that of course btrfs would check such a thing before 
mounting.
Refusing to mount is a great first step, auto scrub is even better, and only 
"scrubbing" files with incorrect generation is better yet.

Paul.



 



Re: [PATCH RFC] btrfs: Do extra device generation check at mount time

2018-06-28 Thread Qu Wenruo



On 2018年06月28日 15:17, Su Yue wrote:
> 
> 
> On 06/28/2018 03:04 PM, Qu Wenruo wrote:
>> There is a reporter considering btrfs raid1 has a major design flaw
>> which can't handle nodatasum files.
>>
>> Despite his incorrect expectation, btrfs indeed doesn't handle device
>> generation mismatch well.
>>
> Just say " btrfs indeed doesn't handle device generation mismatch well."
> is enough.
> Otherwise on someday someone may be confused about who is the reporter
> and what was reported.

I'm just a little over-reacting to that rude reporter.

Will definitely change that part of commit message in final version.

Thanks,
Qu

> 
> Thanks,
> Su
>> This means if one devices missed and re-appeared, even its generation
>> no longer matches with the rest device pool, btrfs does nothing to it,
>> but treat it as normal good device.
>>
>> At least let's detect such generation mismatch and avoid mounting the
>> fs.
>> Currently there is no automatic rebuild yet, which means if users find
>> device generation mismatch error message, they can only mount the fs
>> using "device" and "degraded" mount option (if possible), then replace
>> the offending device to manually "rebuild" the fs.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>> I totally understand that, generation based solution can't handle
>> split-brain case (where 2 RAID1 devices get mounted degraded separately)
>> at all, but at least let's handle what we can do.
>>
>> The best way to solve the problem is to make btrfs treat such lower gen
>> devices as some kind of missing device, and queue an automatic scrub for
>> that device.
>> But that's a lot of extra work, at least let's start from detecting such
>> problem first.
>> ---
>>   fs/btrfs/volumes.c | 50 ++
>>   1 file changed, 50 insertions(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index e034ad9e23b4..80a7c44993bc 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct
>> btrfs_fs_info *fs_info,
>>     devid, uuid);
>>   }
>>   +static int verify_devices_generation(struct btrfs_fs_info *fs_info,
>> + struct btrfs_device *dev)
>> +{
>> +    struct btrfs_fs_devices *fs_devices = dev->fs_devices;
>> +    struct btrfs_device *cur;
>> +    bool warn_only = false;
>> +    int ret = 0;
>> +
>> +    if (!fs_devices || fs_devices->seeding || !dev->generation)
>> +    return 0;
>> +
>> +    /*
>> + * If we're not replaying log, we're completely safe to allow
>> + * generation mismatch as it won't write anything to disks, nor
>> + * remount to rw.
>> + */
>> +    if (btrfs_test_opt(fs_info, NOLOGREPLAY))
>> +    warn_only = true;
>> +
>> +    rcu_read_lock();
>> +    list_for_each_entry_rcu(cur, _devices->devices, dev_list) {
>> +    if (cur->generation && cur->generation != dev->generation) {
>> +    if (warn_only) {
>> +    btrfs_warn_rl_in_rcu(fs_info,
>> +    "devid %llu has unexpected generation, has %llu expected %llu",
>> + dev->devid,
>> + dev->generation,
>> + cur->generation);
>> +    } else {
>> +    btrfs_err_rl_in_rcu(fs_info,
>> +    "devid %llu has unexpected generation, has %llu expected %llu",
>> + dev->devid,
>> + dev->generation,
>> + cur->generation);
>> +    ret = -EINVAL;
>> +    break;
>> +    }
>> +    }
>> +    }
>> +    rcu_read_unlock();
>> +    return ret;
>> +}
>> +
>>   static int read_one_chunk(struct btrfs_fs_info *fs_info, struct
>> btrfs_key *key,
>>     struct extent_buffer *leaf,
>>     struct btrfs_chunk *chunk)
>> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info
>> *fs_info, struct btrfs_key *key,
>>   return PTR_ERR(map->stripes[i].dev);
>>   }
>>   btrfs_report_missing_device(fs_info, devid, uuid, false);
>> +    } else {
>> +    ret = verify_devices_generation(fs_info,
>> +    map->stripes[i].dev);
>> +    if (ret < 0) {
>> +    free_extent_map(em);
>> +    return ret;
>> +    }
>>   }
>>   set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
>>   &(map->stripes[i].dev->dev_state));
>>
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: Do extra device generation check at mount time

2018-06-28 Thread Qu Wenruo



On 2018年06月28日 15:06, Nikolay Borisov wrote:
> 
> 
> On 28.06.2018 10:04, Qu Wenruo wrote:
>> There is a reporter considering btrfs raid1 has a major design flaw
>> which can't handle nodatasum files.
>>
>> Despite his incorrect expectation, btrfs indeed doesn't handle device
>> generation mismatch well.
>>
>> This means if one devices missed and re-appeared, even its generation
>> no longer matches with the rest device pool, btrfs does nothing to it,
>> but treat it as normal good device.
>>
>> At least let's detect such generation mismatch and avoid mounting the
>> fs.
>> Currently there is no automatic rebuild yet, which means if users find
>> device generation mismatch error message, they can only mount the fs
>> using "device" and "degraded" mount option (if possible), then replace
>> the offending device to manually "rebuild" the fs.
>>
>> Signed-off-by: Qu Wenruo 
> 
> I think a testcase of this functionality is important as well.

It's currently an RFC patch, test case would come along with final version.

I'd like to make sure everyone, including developers and end-users, are
fine with the restrict error-out behavior.

Thanks,
Qu

>> ---
>> I totally understand that, generation based solution can't handle
>> split-brain case (where 2 RAID1 devices get mounted degraded separately)
>> at all, but at least let's handle what we can do.
>>
>> The best way to solve the problem is to make btrfs treat such lower gen
>> devices as some kind of missing device, and queue an automatic scrub for
>> that device.
>> But that's a lot of extra work, at least let's start from detecting such
>> problem first.
>> ---
>>  fs/btrfs/volumes.c | 50 ++
>>  1 file changed, 50 insertions(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index e034ad9e23b4..80a7c44993bc 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct 
>> btrfs_fs_info *fs_info,
>>devid, uuid);
>>  }
>>  
>> +static int verify_devices_generation(struct btrfs_fs_info *fs_info,
>> + struct btrfs_device *dev)
>> +{
>> +struct btrfs_fs_devices *fs_devices = dev->fs_devices;
>> +struct btrfs_device *cur;
>> +bool warn_only = false;
>> +int ret = 0;
>> +
>> +if (!fs_devices || fs_devices->seeding || !dev->generation)
>> +return 0;
>> +
>> +/*
>> + * If we're not replaying log, we're completely safe to allow
>> + * generation mismatch as it won't write anything to disks, nor
>> + * remount to rw.
>> + */
>> +if (btrfs_test_opt(fs_info, NOLOGREPLAY))
>> +warn_only = true;
>> +
>> +rcu_read_lock();
>> +list_for_each_entry_rcu(cur, _devices->devices, dev_list) {
>> +if (cur->generation && cur->generation != dev->generation) {
>> +if (warn_only) {
>> +btrfs_warn_rl_in_rcu(fs_info,
>> +"devid %llu has unexpected generation, has %llu expected %llu",
>> + dev->devid,
>> + dev->generation,
>> + cur->generation);
>> +} else {
>> +btrfs_err_rl_in_rcu(fs_info,
>> +"devid %llu has unexpected generation, has %llu expected %llu",
>> + dev->devid,
>> + dev->generation,
>> + cur->generation);
>> +ret = -EINVAL;
>> +break;
>> +}
>> +}
>> +}
>> +rcu_read_unlock();
>> +return ret;
>> +}
>> +
>>  static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key 
>> *key,
>>struct extent_buffer *leaf,
>>struct btrfs_chunk *chunk)
>> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info 
>> *fs_info, struct btrfs_key *key,
>>  return PTR_ERR(map->stripes[i].dev);
>>  }
>>  btrfs_report_missing_device(fs_info, devid, uuid, 
>> false);
>> +} else {
>> +ret = verify_devices_generation(fs_info,
>> +map->stripes[i].dev);
>> +if (ret < 0) {
>> +free_extent_map(em);
>> +return ret;
>> +}
>>  }
>>  set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
>>  &(map->stripes[i].dev->dev_state));
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  

Re: Enabling quota may not correctly rescan on 4.17

2018-06-28 Thread Qu Wenruo



On 2018年06月27日 16:25, Misono Tomohiro wrote:
> On 2018/06/27 17:10, Qu Wenruo wrote:
>>
>>
>> On 2018年06月26日 14:00, Misono Tomohiro wrote:
>>> Hello Nikolay,
>>>
>>> I noticed that commit 5d23515be669 ("btrfs: Move qgroup rescan
>>> on quota enable to btrfs_quota_enable") in 4.17 sometimes causes
>>> to fail correctly rescanning quota when quota is enabled.
>>>
>>> Simple reproducer:
>>>
>>> $ mkfs.btrfs -f $DEV
>>> $ mount $DEV /mnt
>>> $ dd if=/dev/urandom of=/mnt/file bs=1000 count=1000
>>> $ btrfs quota enbale /mnt
>>> $ umount /mnt
>>> $ btrfs check $DEV
>>> ...
>>> checking quota groups
>>> Counts for qgroup id: 0/5 are different
>>> our:referenced 1019904 referenced compressed 1019904
>>> disk:   referenced 16384 referenced compressed 16384
>>> diff:   referenced 1003520 referenced compressed 1003520
>>> our:exclusive 1019904 exclusive compressed 1019904
>>> disk:   exclusive 16384 exclusive compressed 16384
>>> diff:   exclusive 1003520 exclusive compressed 1003520
>>> found 1413120 bytes used, error(s) found
>>> ...
>>>
>>> This can be also observed in btrfs/114. (Note that progs < 4.17
>>> returns error code 0 even if quota is not consistency and therefore
>>> test will incorrectly pass.)
>>
>> BTW, would you please try to dump the quota tree for such mismatch case?
>>
>> It could be a btrfs-progs bug which it should skip quota checking if it
>> found the quota status item has RESCAN flag.
> 
> Yes, this is what I see after running btrfs/114 (/dev/sdh1 is scratch dev):
> 
> $ sudo btrfs check -Q /dev/sdh1
> Checking filesystem on /dev/sdh1
> UUID: d07f6028-0ae7-40d4-ac45-01a4505ddcfb
> Print quota groups for /dev/sdh1
> UUID: d07f6028-0ae7-40d4-ac45-01a4505ddcfb
> Counts for qgroup id: 0/5 are different
> our:referenced 170999808 referenced compressed 170999808
> disk:   referenced 16384 referenced compressed 16384
> diff:   referenced 170983424 referenced compressed 170983424
> our:exclusive 170999808 exclusive compressed 170999808
> disk:   exclusive 16384 exclusive compressed 16384
> diff:   exclusive 170983424 exclusive compressed 170983424
> 

Unfortunately in my environment, btrfs/114 failed to reproduce it with
1024 runs overnight, with v4.18-rc1 kernel.

Would you please provide the whole btrfs-image dump of the corrupted fs?

There are several different assumptions on how the bug happens, with
your btrfs-image dump, it would help a lot to rule out some assumption.

Thanks,
Qu

> 
> $ sudo btrfs inspect-internal dump-tree -t quota /dev/sdh1
> btrfs-progs v4.17
> quota tree key (QUOTA_TREE ROOT_ITEM 0)
> leaf 213958656 items 3 free space 16096 generation 9 owner QUOTA_TREE
> leaf 213958656 flags 0x1(WRITTEN) backref revision 1
> fs uuid d07f6028-0ae7-40d4-ac45-01a4505ddcfb
> chunk uuid 78d753d0-eeb7-4c3e-b825-b6c2c5de5c7a
> item 0 key (0 QGROUP_STATUS 0) itemoff 16251 itemsize 32
> version 1 generation 9 flags ON scan 30572545
> item 1 key (0 QGROUP_INFO 0/5) itemoff 16211 itemsize 40
> generation 7
> referenced 16384 referenced_compressed 16384
> exclusive 16384 exclusive_compressed 16384
> item 2 key (0 QGROUP_LIMIT 0/5) itemoff 16171 itemsize 40
> flags 0
> max_referenced 0 max_exclusive 0
> rsv_referenced 0 rsv_exclusive 0
> total bytes 26843545600
> bytes used 171769856
> uuid d07f6028-0ae7-40d4-ac45-01a4505ddcfb
> 
> 
> And if I mount+rescan again:
> 
> $ sudo mount /dev/sdh1 /mnt
> $ sudo btrfs quota rescan -w /mnt
> $ sudo umount /mnt
> 
> $ sudo btrfs check -Q /dev/sdh1
> Checking filesystem on /dev/sdh1
> UUID: d07f6028-0ae7-40d4-ac45-01a4505ddcfb
> Print quota groups for /dev/sdh1
> UUID: d07f6028-0ae7-40d4-ac45-01a4505ddcfb
> Counts for qgroup id: 0/5
> our:referenced 170999808 referenced compressed 170999808
> disk:   referenced 170999808 referenced compressed 170999808
> our:exclusive 170999808 exclusive compressed 170999808
> disk:   exclusive 170999808 exclusive compressed 170999808
> 
> $ sudo btrfs inspect-internal dump-tree -t quota /dev/sdh1
> btrfs-progs v4.17
> quota tree key (QUOTA_TREE ROOT_ITEM 0)
> leaf 31309824 items 3 free space 16096 generation 13 owner QUOTA_TREE
> leaf 31309824 flags 0x1(WRITTEN) backref revision 1
> fs uuid d07f6028-0ae7-40d4-ac45-01a4505ddcfb
> chunk uuid 78d753d0-eeb7-4c3e-b825-b6c2c5de5c7a
> item 0 key (0 QGROUP_STATUS 0) itemoff 16251 itemsize 32
> version 1 generation 13 flags ON scan 213827585
> item 1 key (0 QGROUP_INFO 0/5) itemoff 16211 itemsize 40
> generation 11
> referenced 170999808 referenced_compressed 170999808
> exclusive 170999808 exclusive_compressed 170999808
> item 2 key (0 QGROUP_LIMIT 0/5) itemoff 16171 itemsize 40
> flags 0
>

Re: [PATCH RFC] btrfs: Do extra device generation check at mount time

2018-06-28 Thread Su Yue




On 06/28/2018 03:04 PM, Qu Wenruo wrote:

There is a reporter considering btrfs raid1 has a major design flaw
which can't handle nodatasum files.

Despite his incorrect expectation, btrfs indeed doesn't handle device
generation mismatch well.


Just say " btrfs indeed doesn't handle device generation mismatch well."
is enough.
Otherwise on someday someone may be confused about who is the reporter 
and what was reported.


Thanks,
Su

This means if one devices missed and re-appeared, even its generation
no longer matches with the rest device pool, btrfs does nothing to it,
but treat it as normal good device.

At least let's detect such generation mismatch and avoid mounting the
fs.
Currently there is no automatic rebuild yet, which means if users find
device generation mismatch error message, they can only mount the fs
using "device" and "degraded" mount option (if possible), then replace
the offending device to manually "rebuild" the fs.

Signed-off-by: Qu Wenruo 
---
I totally understand that, generation based solution can't handle
split-brain case (where 2 RAID1 devices get mounted degraded separately)
at all, but at least let's handle what we can do.

The best way to solve the problem is to make btrfs treat such lower gen
devices as some kind of missing device, and queue an automatic scrub for
that device.
But that's a lot of extra work, at least let's start from detecting such
problem first.
---
  fs/btrfs/volumes.c | 50 ++
  1 file changed, 50 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e034ad9e23b4..80a7c44993bc 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct 
btrfs_fs_info *fs_info,
  devid, uuid);
  }
  
+static int verify_devices_generation(struct btrfs_fs_info *fs_info,

+struct btrfs_device *dev)
+{
+   struct btrfs_fs_devices *fs_devices = dev->fs_devices;
+   struct btrfs_device *cur;
+   bool warn_only = false;
+   int ret = 0;
+
+   if (!fs_devices || fs_devices->seeding || !dev->generation)
+   return 0;
+
+   /*
+* If we're not replaying log, we're completely safe to allow
+* generation mismatch as it won't write anything to disks, nor
+* remount to rw.
+*/
+   if (btrfs_test_opt(fs_info, NOLOGREPLAY))
+   warn_only = true;
+
+   rcu_read_lock();
+   list_for_each_entry_rcu(cur, _devices->devices, dev_list) {
+   if (cur->generation && cur->generation != dev->generation) {
+   if (warn_only) {
+   btrfs_warn_rl_in_rcu(fs_info,
+   "devid %llu has unexpected generation, has %llu expected %llu",
+dev->devid,
+dev->generation,
+cur->generation);
+   } else {
+   btrfs_err_rl_in_rcu(fs_info,
+   "devid %llu has unexpected generation, has %llu expected %llu",
+dev->devid,
+dev->generation,
+cur->generation);
+   ret = -EINVAL;
+   break;
+   }
+   }
+   }
+   rcu_read_unlock();
+   return ret;
+}
+
  static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key 
*key,
  struct extent_buffer *leaf,
  struct btrfs_chunk *chunk)
@@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, 
struct btrfs_key *key,
return PTR_ERR(map->stripes[i].dev);
}
btrfs_report_missing_device(fs_info, devid, uuid, 
false);
+   } else {
+   ret = verify_devices_generation(fs_info,
+   map->stripes[i].dev);
+   if (ret < 0) {
+   free_extent_map(em);
+   return ret;
+   }
}
set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
&(map->stripes[i].dev->dev_state));




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: Do extra device generation check at mount time

2018-06-28 Thread Nikolay Borisov



On 28.06.2018 10:04, Qu Wenruo wrote:
> There is a reporter considering btrfs raid1 has a major design flaw
> which can't handle nodatasum files.
> 
> Despite his incorrect expectation, btrfs indeed doesn't handle device
> generation mismatch well.
> 
> This means if one devices missed and re-appeared, even its generation
> no longer matches with the rest device pool, btrfs does nothing to it,
> but treat it as normal good device.
> 
> At least let's detect such generation mismatch and avoid mounting the
> fs.
> Currently there is no automatic rebuild yet, which means if users find
> device generation mismatch error message, they can only mount the fs
> using "device" and "degraded" mount option (if possible), then replace
> the offending device to manually "rebuild" the fs.
> 
> Signed-off-by: Qu Wenruo 

I think a testcase of this functionality is important as well.
> ---
> I totally understand that, generation based solution can't handle
> split-brain case (where 2 RAID1 devices get mounted degraded separately)
> at all, but at least let's handle what we can do.
> 
> The best way to solve the problem is to make btrfs treat such lower gen
> devices as some kind of missing device, and queue an automatic scrub for
> that device.
> But that's a lot of extra work, at least let's start from detecting such
> problem first.
> ---
>  fs/btrfs/volumes.c | 50 ++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index e034ad9e23b4..80a7c44993bc 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct 
> btrfs_fs_info *fs_info,
> devid, uuid);
>  }
>  
> +static int verify_devices_generation(struct btrfs_fs_info *fs_info,
> +  struct btrfs_device *dev)
> +{
> + struct btrfs_fs_devices *fs_devices = dev->fs_devices;
> + struct btrfs_device *cur;
> + bool warn_only = false;
> + int ret = 0;
> +
> + if (!fs_devices || fs_devices->seeding || !dev->generation)
> + return 0;
> +
> + /*
> +  * If we're not replaying log, we're completely safe to allow
> +  * generation mismatch as it won't write anything to disks, nor
> +  * remount to rw.
> +  */
> + if (btrfs_test_opt(fs_info, NOLOGREPLAY))
> + warn_only = true;
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(cur, _devices->devices, dev_list) {
> + if (cur->generation && cur->generation != dev->generation) {
> + if (warn_only) {
> + btrfs_warn_rl_in_rcu(fs_info,
> + "devid %llu has unexpected generation, has %llu expected %llu",
> +  dev->devid,
> +  dev->generation,
> +  cur->generation);
> + } else {
> + btrfs_err_rl_in_rcu(fs_info,
> + "devid %llu has unexpected generation, has %llu expected %llu",
> +  dev->devid,
> +  dev->generation,
> +  cur->generation);
> + ret = -EINVAL;
> + break;
> + }
> + }
> + }
> + rcu_read_unlock();
> + return ret;
> +}
> +
>  static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key 
> *key,
> struct extent_buffer *leaf,
> struct btrfs_chunk *chunk)
> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info 
> *fs_info, struct btrfs_key *key,
>   return PTR_ERR(map->stripes[i].dev);
>   }
>   btrfs_report_missing_device(fs_info, devid, uuid, 
> false);
> + } else {
> + ret = verify_devices_generation(fs_info,
> + map->stripes[i].dev);
> + if (ret < 0) {
> + free_extent_map(em);
> + return ret;
> + }
>   }
>   set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
>   &(map->stripes[i].dev->dev_state));
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC] btrfs: Do extra device generation check at mount time

2018-06-28 Thread Qu Wenruo
There is a reporter considering btrfs raid1 has a major design flaw
which can't handle nodatasum files.

Despite his incorrect expectation, btrfs indeed doesn't handle device
generation mismatch well.

This means if one devices missed and re-appeared, even its generation
no longer matches with the rest device pool, btrfs does nothing to it,
but treat it as normal good device.

At least let's detect such generation mismatch and avoid mounting the
fs.
Currently there is no automatic rebuild yet, which means if users find
device generation mismatch error message, they can only mount the fs
using "device" and "degraded" mount option (if possible), then replace
the offending device to manually "rebuild" the fs.

Signed-off-by: Qu Wenruo 
---
I totally understand that, generation based solution can't handle
split-brain case (where 2 RAID1 devices get mounted degraded separately)
at all, but at least let's handle what we can do.

The best way to solve the problem is to make btrfs treat such lower gen
devices as some kind of missing device, and queue an automatic scrub for
that device.
But that's a lot of extra work, at least let's start from detecting such
problem first.
---
 fs/btrfs/volumes.c | 50 ++
 1 file changed, 50 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e034ad9e23b4..80a7c44993bc 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct 
btrfs_fs_info *fs_info,
  devid, uuid);
 }
 
+static int verify_devices_generation(struct btrfs_fs_info *fs_info,
+struct btrfs_device *dev)
+{
+   struct btrfs_fs_devices *fs_devices = dev->fs_devices;
+   struct btrfs_device *cur;
+   bool warn_only = false;
+   int ret = 0;
+
+   if (!fs_devices || fs_devices->seeding || !dev->generation)
+   return 0;
+
+   /*
+* If we're not replaying log, we're completely safe to allow
+* generation mismatch as it won't write anything to disks, nor
+* remount to rw.
+*/
+   if (btrfs_test_opt(fs_info, NOLOGREPLAY))
+   warn_only = true;
+
+   rcu_read_lock();
+   list_for_each_entry_rcu(cur, _devices->devices, dev_list) {
+   if (cur->generation && cur->generation != dev->generation) {
+   if (warn_only) {
+   btrfs_warn_rl_in_rcu(fs_info,
+   "devid %llu has unexpected generation, has %llu expected %llu",
+dev->devid,
+dev->generation,
+cur->generation);
+   } else {
+   btrfs_err_rl_in_rcu(fs_info,
+   "devid %llu has unexpected generation, has %llu expected %llu",
+dev->devid,
+dev->generation,
+cur->generation);
+   ret = -EINVAL;
+   break;
+   }
+   }
+   }
+   rcu_read_unlock();
+   return ret;
+}
+
 static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
  struct extent_buffer *leaf,
  struct btrfs_chunk *chunk)
@@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, 
struct btrfs_key *key,
return PTR_ERR(map->stripes[i].dev);
}
btrfs_report_missing_device(fs_info, devid, uuid, 
false);
+   } else {
+   ret = verify_devices_generation(fs_info,
+   map->stripes[i].dev);
+   if (ret < 0) {
+   free_extent_map(em);
+   return ret;
+   }
}
set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
&(map->stripes[i].dev->dev_state));
-- 
2.18.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: btrfs: Test if btrfs will corrupt nodatasum compressed extent when replacing device

2018-06-28 Thread Nikolay Borisov



On 28.06.2018 08:34, Eryu Guan wrote:
> On Thu, Jun 28, 2018 at 08:11:00AM +0300, Nikolay Borisov wrote:
>>
>>
>> On  1.06.2018 04:34, Qu Wenruo wrote:
>>> This is a long existing bug (from 2012) but exposed by a reporter
>>> recently, that when compressed extent without data csum get written to
>>> device-replace target device, the written data is in fact uncompressed data
>>> other than the original compressed data.
>>>
>>> And since btrfs still consider the data is compressed and will try to read 
>>> it
>>> as compressed, it can cause read error.
>>>
>>> The root cause is located, and one RFC patch already sent to fix it,
>>> titled "[PATCH RFC] btrfs: scrub: Don't use inode pages for device replace".
>>> (The RFC is only for the extra possible way to fix the bug, the fix
>>> itself should work without problem)
>>>
>>> Reported-by: James Harvey 
>>> Signed-off-by: Qu Wenruo 
>>
>> Reviewed-by: Nikolay Borisov 
> 
> Thanks for the review! I assume the v3 patch also passes your review :)
Yes, I just saw that you requested an ack from a btrfs developer some
time ago and this test didn't move forward, hence i replied. But yes,
it's a valid test for btrfs.
> 
> Eryu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html