Re: kernel BUG at linux-4.2.0/fs/btrfs/extent-tree.c:1833 on rebalance

2015-09-17 Thread Stéphane Lesimple

Le 2015-09-17 05:03, Qu Wenruo a écrit :

Stéphane Lesimple wrote on 2015/09/16 22:41 +0200:

Le 2015-09-16 22:18, Duncan a écrit :
Stéphane Lesimple posted on Wed, 16 Sep 2015 15:04:20 +0200 as 
excerpted:




Well actually it's the (d) option ;)
I activate the quota feature for only one reason : being able to track
down how much space my snapshots are taking.


Yeah, that's completely one of the ideal use case of btrfs qgroup.

But I'm quite curious about the btrfsck error report on qgroup.

If btrfsck report such error, it means either I'm too confident about
the recent qgroup accounting rework, or btrfsck has some bug which I
didn't take much consideration during the kernel rework.

Would you please provide the full result of previous btrfsck with 
qgroup error?


Sure, I've saved the log somewhere just in case, here your are :

Counts for qgroup id: 3359 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 49152 exclusive compressed 49152
disk:   exclusive 32768 exclusive compressed 32768
diff:   exclusive 16384 exclusive compressed 16384
Counts for qgroup id: 3361 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 49152 exclusive compressed 49152
disk:   exclusive 32768 exclusive compressed 32768
diff:   exclusive 16384 exclusive compressed 16384
Counts for qgroup id: 3362 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 49152 exclusive compressed 49152
disk:   exclusive 32768 exclusive compressed 32768
diff:   exclusive 16384 exclusive compressed 16384
Counts for qgroup id: 3363 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 49152 exclusive compressed 49152
disk:   exclusive 32768 exclusive compressed 32768
diff:   exclusive 16384 exclusive compressed 16384
Counts for qgroup id: 3361 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 49152 exclusive compressed 49152
disk:   exclusive 32768 exclusive compressed 32768
diff:   exclusive 16384 exclusive compressed 16384
Counts for qgroup id: 3362 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 49152 exclusive compressed 49152
disk:   exclusive 32768 exclusive compressed 32768
diff:   exclusive 16384 exclusive compressed 16384
Counts for qgroup id: 3363 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 49152 exclusive compressed 49152
disk:   exclusive 32768 exclusive compressed 32768
diff:   exclusive 16384 exclusive compressed 16384
Counts for qgroup id: 3364 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 49152 exclusive compressed 49152
disk:   exclusive 32768 exclusive compressed 32768
diff:   exclusive 16384 exclusive compressed 16384
Counts for qgroup id: 3365 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 49152 exclusive compressed 49152
disk:   exclusive 32768 exclusive compressed 32768
diff:   exclusive 16384 exclusive compressed 16384
Counts for qgroup id: 3366 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 16384 exclusive compressed 16384
disk:   exclusive 16384 exclusive compressed 16384


Re: kernel BUG at linux-4.2.0/fs/btrfs/extent-tree.c:1833 on rebalance

2015-09-17 Thread Stéphane Lesimple

Le 2015-09-16 15:04, Stéphane Lesimple a écrit :

I also disabled quota because it has almost for sure nothing
to do with the bug


As it turns out, it seems that this assertion was completely wrong.

I've got balance running for more than 16 hours now, without a crash. 
This is almost 50% of the work done without any issue. Before, a crash 
would happen within minutes, sometimes 1 hour, but not much more. The 
problem is, I didn't change anything to the filesystem, well, appart 
from the benign quota disable. So Qu's question about the qgroups errors 
in fsck made me wonder : if I activate quota again, it'll still continue 
to balance flawlessly, right ?


Well, it doesn't. I just ran btrfs quota enable on my filesystem, it 
completed successfully after some minutes (rescan -s said that no rescan 
was pending). Then less than 5 minutes later, the kernel crashed, at the 
same BUG_ON() than usually :


[60156.062082] BTRFS info (device dm-3): relocating block group 
972839452672 flags 129

[60185.203626] BTRFS info (device dm-3): found 1463 extents
[60414.452890] {btrfs} in insert_inline_extent_backref, got owner < 
BTRFS_FIRST_FREE_OBJECTID
[60414.452894] {btrfs} with bytenr=5197436141568 num_bytes=16384 
parent=5336636473344 root_objectid=3358 owner=1 offset=0 refs_to_add=1 
BTRFS_FIRST_FREE_OBJECTID=256

[60414.452924] [ cut here ]
[60414.452928] kernel BUG at fs/btrfs/extent-tree.c:1837!

owner is=1 again at this point in the code (this is still kernel 
4.3.0-rc1 with my added printks).


So I'll disable quota, again, and resume the balance. If I'm right, it 
should proceed without issue for 18 more hours !


Qu, my filesystem is at your disposal :)

--
Stéphane.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at linux-4.2.0/fs/btrfs/extent-tree.c:1833 on rebalance

2015-09-17 Thread Stéphane Lesimple

Le 2015-09-17 08:29, Stéphane Lesimple a écrit :

Le 2015-09-16 15:04, Stéphane Lesimple a écrit :

I also disabled quota because it has almost for sure nothing
to do with the bug


As it turns out, it seems that this assertion was completely wrong.

I've got balance running for more than 16 hours now, without a crash.
This is almost 50% of the work done without any issue. Before, a crash
would happen within minutes, sometimes 1 hour, but not much more. The
problem is, I didn't change anything to the filesystem, well, appart
from the benign quota disable. So Qu's question about the qgroups
errors in fsck made me wonder : if I activate quota again, it'll still
continue to balance flawlessly, right ?

Well, it doesn't. I just ran btrfs quota enable on my filesystem, it
completed successfully after some minutes (rescan -s said that no
rescan was pending). Then less than 5 minutes later, the kernel
crashed, at the same BUG_ON() than usually :

[60156.062082] BTRFS info (device dm-3): relocating block group
972839452672 flags 129
[60185.203626] BTRFS info (device dm-3): found 1463 extents
[60414.452890] {btrfs} in insert_inline_extent_backref, got owner <
BTRFS_FIRST_FREE_OBJECTID
[60414.452894] {btrfs} with bytenr=5197436141568 num_bytes=16384
parent=5336636473344 root_objectid=3358 owner=1 offset=0 refs_to_add=1
BTRFS_FIRST_FREE_OBJECTID=256
[60414.452924] [ cut here ]
[60414.452928] kernel BUG at fs/btrfs/extent-tree.c:1837!

owner is=1 again at this point in the code (this is still kernel
4.3.0-rc1 with my added printks).

So I'll disable quota, again, and resume the balance. If I'm right, it
should proceed without issue for 18 more hours !


Damn, wrong again. It just re-crashed without quota enabled :(
The fact that it went perfectly well for 17+ hours and crashed minutes 
after I reactivated quota might be by complete chance then ...


[ 5487.706499] {btrfs} in insert_inline_extent_backref, got owner < 
BTRFS_FIRST_FREE_OBJECTID
[ 5487.706504] {btrfs} with bytenr=6906661109760 num_bytes=16384 
parent=6905020874752 root_objectid=18446744073709551608 owner=1 offset=0 
refs_to_add=1 BTRFS_FIRST_FREE_OBJECTID=256

[ 5487.706536] [ cut here ]
[ 5487.706539] kernel BUG at fs/btrfs/extent-tree.c:1837!

For reference, the crash I had earlier this morning was as follows :

[60414.452894] {btrfs} with bytenr=5197436141568 num_bytes=16384 
parent=5336636473344 root_objectid=3358 owner=1 offset=0 refs_to_add=1 
BTRFS_FIRST_FREE_OBJECTID=256


So, this is a completely different part of the filesystem.
The bug is always the same though, owner=1 where it shouldn't be < 256.

Balance cancelled.

To me, it sounds like some sort of race condition. But I'm out of ideas 
on what to test now.


--
Stéphane.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at linux-4.2.0/fs/btrfs/extent-tree.c:1833 on rebalance

2015-09-17 Thread Stéphane Lesimple

Le 2015-09-17 08:42, Qu Wenruo a écrit :

Stéphane Lesimple wrote on 2015/09/17 08:11 +0200:

Le 2015-09-17 05:03, Qu Wenruo a écrit :

Stéphane Lesimple wrote on 2015/09/16 22:41 +0200:

Le 2015-09-16 22:18, Duncan a écrit :

Stéphane Lesimple posted on Wed, 16 Sep 2015 15:04:20 +0200 as
excerpted:



Well actually it's the (d) option ;)
I activate the quota feature for only one reason : being able to 
track

down how much space my snapshots are taking.


Yeah, that's completely one of the ideal use case of btrfs qgroup.

But I'm quite curious about the btrfsck error report on qgroup.

If btrfsck report such error, it means either I'm too confident about
the recent qgroup accounting rework, or btrfsck has some bug which I
didn't take much consideration during the kernel rework.

Would you please provide the full result of previous btrfsck with
qgroup error?


Sure, I've saved the log somewhere just in case, here your are :

[...]

Thanks for your log, pretty interesting result.

BTW, did you enabled qgroup from old kernel earlier than 4.2-rc1?
If so, I would be much relaxed as they can be the problem of old 
kernels.


The mkfs.btrfs was done under 3.19, but I'm almost sure I enabled quota 
under 4.2.0 precisely. My kern.log tends to confirm that (looking for 
'qgroup scan completed').



If it's OK for you, would you please enable quota after reproducing
the bug and use for sometime and recheck it?


Sure, I've just reproduced the bug twice as I wanted, and posted the 
info, so now I've cancelled the balance and I can reenable quota. Will 
do it under 4.3.0-rc1. I'll keep you posted if btrfsck complains about 
it in the following days.


Regards,

--
Stéphane.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: Remove unneeded missing device number check

2015-09-17 Thread Anand Jain



On 09/16/2015 11:43 AM, Qu Wenruo wrote:

As we do per-chunk missing device number check at read_one_chunk() time,
it's not needed to do global missing device number check.

Just remove it.


However the missing device count, what we have during the remount is not 
fine grained per chunk.

---
btrfs_remount
::
 if (fs_info->fs_devices->missing_devices >
 fs_info->num_tolerated_disk_barrier_failures &&
!(*flags & MS_RDONLY ||
btrfs_test_opt(root, DEGRADED))) {
btrfs_warn(fs_info,
"too many missing devices, writeable 
remount is not allowed");

ret = -EACCES;
goto restore;
}
-

Thanks, Anand



Now btrfs can handle the following case:
  # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc

  Data chunk will be located in sdb, so we should be safe to wipe sdc
  # wipefs -a /dev/sdc

  # mount /dev/sdb /mnt/btrfs -o degraded

Signed-off-by: Qu Wenruo 
---
  fs/btrfs/disk-io.c | 8 
  1 file changed, 8 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0b658d0..ac640ea 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2947,14 +2947,6 @@ retry_root_backup:
}
fs_info->num_tolerated_disk_barrier_failures =
btrfs_calc_num_tolerated_disk_barrier_failures(fs_info);
-   if (fs_info->fs_devices->missing_devices >
-fs_info->num_tolerated_disk_barrier_failures &&
-   !(sb->s_flags & MS_RDONLY)) {
-   pr_warn("BTRFS: missing devices(%llu) exceeds the limit(%d), 
writeable mount is not allowed\n",
-   fs_info->fs_devices->missing_devices,
-   fs_info->num_tolerated_disk_barrier_failures);
-   goto fail_sysfs;
-   }

fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
   "btrfs-cleaner");


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at linux-4.2.0/fs/btrfs/extent-tree.c:1833 on rebalance

2015-09-17 Thread Stéphane Lesimple

Le 2015-09-17 10:11, Qu Wenruo a écrit :

Stéphane Lesimple wrote on 2015/09/17 10:02 +0200:

Le 2015-09-17 08:42, Qu Wenruo a écrit :

Stéphane Lesimple wrote on 2015/09/17 08:11 +0200:

Le 2015-09-17 05:03, Qu Wenruo a écrit :

Stéphane Lesimple wrote on 2015/09/16 22:41 +0200:

Le 2015-09-16 22:18, Duncan a écrit :

Stéphane Lesimple posted on Wed, 16 Sep 2015 15:04:20 +0200 as
excerpted:



Well actually it's the (d) option ;)
I activate the quota feature for only one reason : being able to 
track

down how much space my snapshots are taking.


Yeah, that's completely one of the ideal use case of btrfs qgroup.

But I'm quite curious about the btrfsck error report on qgroup.

If btrfsck report such error, it means either I'm too confident 
about
the recent qgroup accounting rework, or btrfsck has some bug which 
I

didn't take much consideration during the kernel rework.

Would you please provide the full result of previous btrfsck with
qgroup error?


Sure, I've saved the log somewhere just in case, here your are :

[...]

Thanks for your log, pretty interesting result.

BTW, did you enabled qgroup from old kernel earlier than 4.2-rc1?
If so, I would be much relaxed as they can be the problem of old 
kernels.


The mkfs.btrfs was done under 3.19, but I'm almost sure I enabled 
quota

under 4.2.0 precisely. My kern.log tends to confirm that (looking for
'qgroup scan completed').


Emmm, seems I need to pay more attention on this case now.
Any info about the workload for this btrfs fs?




If it's OK for you, would you please enable quota after reproducing
the bug and use for sometime and recheck it?


Sure, I've just reproduced the bug twice as I wanted, and posted the
info, so now I've cancelled the balance and I can reenable quota. Will
do it under 4.3.0-rc1. I'll keep you posted if btrfsck complains about
it in the following days.

Regards,


Thanks for your patience and detailed report.


You're very welcome.


But I still have another question, did you do any snapshot deletion
after quota enabled?
(I'll assume you did it, as there are a lot of backup snapshot, old
ones should be already deleted)


Actually no : this btrfs system is quite new (less than a week old) as 
I'm migrating from mdadm(raid1)+ext4 to btrfs. So those snapshots were 
actually rsynced one by one from my hardlinks-based "snapshots" under 
ext4 (those pseudo-snapshots are created using a program named 
"rsnapshot", if you know it. This is basically a wrapper to cp -la). I 
didn't activate yet an automatic snapshot/delete on my btrfs system, due 
to the bugs I'm tripping on. So no snapshot was deleted.



That's one of the known bug and Mark is working on it actively.
If you delete non-empty snapshot a lot, then I'd better add a hot fix
to mark qgroup inconsistent after snapshot delete, and trigger a
rescan if possible.


I've made a btrfs-image of the filesystem just before disabling quotas 
(which I did to get a clean btrfsck and eliminate quotas from the 
equation trying to reproduce the bug I have). Would it be of any use if 
I drop it somewhere for you to pick it up ? (2.9G in size).


In the meantime, I've reactivated quotas, umounted the filesystem and 
ran a btrfsck on it : as you would expect, there's no qgroup problem 
reported so far. I'll clear all my snapshots, run an quota rescan, then 
re-create them one by one by rsyncing from my ext4 system I still have. 
Maybe I'll run into the issue again.


--
Stéphane.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs progs 4.1.1 & 4.2 segfault on chunk-recover

2015-09-17 Thread Daniel Wiegert
Hello guys

I think I might found a bug, Lots of text, I dont know what you want
from me and not, so I try to get almost everything in one mail, please
dont shoot me! :)

To make a long store somewhat short, this is about what happend to me;
(skip to  if you dont care about history)

Arch-linux, btrfs-progs 4.1.1 & 4.2, linux 4.1.6-1

Data, RAID5: total=3.11TiB, used=0.00B <-- this one said the other day
used=3.05TiB
System, RAID1: total=32.00MiB, used=0.00B
Metadata, RAID1: total=8.00GiB, used=144.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B

Label: 'Isolinear'  uuid: 9bb3f369-f2a9-46be-8dde-1106ae740e36
Total devices 9 FS bytes used 144.00KiB
devid7 size 2.73TiB used 541.12GiB path /dev/sdi
devid9 size 1.36TiB used 533.09GiB path /dev/sdd2
devid   10 size 1.36TiB used 533.09GiB path /dev/sdg2
devid   11 size 1.82TiB used 536.12GiB path /dev/sdj2
devid   12 size 1.82TiB used 538.09GiB path /dev/sdh2
devid   13 size 286.09GiB used 286.09GiB path /dev/sda3
devid   14 size 286.09GiB used 286.09GiB path /dev/sdb3
devid   15 size 372.61GiB used 372.61GiB path /dev/sdf1
*** Some devices missing

drive 8 was a 1.36TiB
drive 15 is the new drive I added to the system.


*one of 8 drives started to fail, smart saw error, I failed in my
configure and I didn't get notified - Ran for 3-14 days before I
realized.
*I tried on active running system to btrfs dev del /dev/sd[failing] -
Did not work (I think it was csum errors)
*I added one new disk to raid, rebooted and added new disk to array,
tried balancing. Power fail and ups fail after x hours
*I rebooted realized the failing drive was now dead. I could mount
system with degraded and some files gave me kernel panic (
https://goo.gl/photos/UXrZj6YEUW3945b37 )- others were reading fine.
-Was unable to dev del missing.

At this point I knew the system was probobly broken beyond repair. so
I just tried all commands I could think of. check repair, check
init-csum-tree etc endless loop - First very fast text scrolling, lots
of CPU not much diskIO, after ~48h text slow, lots of cpu, almost no
diskIO same type of message repeating (with new numbers):
-
ref mismatch on [17959857729536 4096] extent item 0, found 1
adding new data backref on 17959857729536 parent 35277570539520 owner
0 offset 0 found 1
Backref 17959857729536 parent 35277570539520 owner 0 offset 0 num_refs
0 not found in extent tree
Incorrect local backref count on 17959857729536 parent 35277570539520
owner 0 offset 0 found 1 wanted 0 back 0x145f7800
backpointer mismatch on [17959857729536 4096]
ref mismatch on [17959857733632 4096] extent item 0, found 1
adding new data backref on 17959857733632 parent 35277570785280 owner
0 offset 0 found 1
Backref 17959857733632 parent 35277570785280 owner 0 offset 0 num_refs
0 not found in extent tree
Incorrect local backref count on 17959857733632 parent 35277570785280
owner 0 offset 0 found 1 wanted 0 back 0x145f7b90
backpointer mismatch on [17959857733632 4096]
-

 Found out that chunk-recover gave segfault.(4.1.1 & kdave 4.2)
4.1.1 said in bt:
#0  0x004251bb in btrfs_new_device_extent_record ()
#1  0x004301cb in ?? ()
#2  0x0043085d in ?? ()
#3  0x7fd8071074a4 in start_thread () from /usr/lib/libpthread.so.0
#4  0x7fd806e4513d in clone () from /usr/lib/libc.so.6

not much help, but I compiled -> https://github.com/kdave/btrfs-progs
and backtrace:

-->  http://pastebin.com/XqRrqAB5

I can repeat the segfault. I made two btrfs-image , one is around 4MB
the other is around 300MB think it was.

So, did I find a bug? I cant find my logs at the beginning of my
failing drive, what it said when I tried to remove the broken drive. I
might be able to try the setup again (Got one more
drive-about-to-fail)



ps;
Ive tried to make alpine to work, but it wont accept my passwords, I
hope gmail web client is ok for you guys, openwrt dev team rejected my
posts just because of this email client

best regards
Daniel

end
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at linux-4.2.0/fs/btrfs/extent-tree.c:1833 on rebalance

2015-09-17 Thread Qu Wenruo



在 2015年09月17日 18:08, Stéphane Lesimple 写道:

Le 2015-09-17 10:11, Qu Wenruo a écrit :

Stéphane Lesimple wrote on 2015/09/17 10:02 +0200:

Le 2015-09-17 08:42, Qu Wenruo a écrit :

Stéphane Lesimple wrote on 2015/09/17 08:11 +0200:

Le 2015-09-17 05:03, Qu Wenruo a écrit :

Stéphane Lesimple wrote on 2015/09/16 22:41 +0200:

Le 2015-09-16 22:18, Duncan a écrit :

Stéphane Lesimple posted on Wed, 16 Sep 2015 15:04:20 +0200 as
excerpted:



Well actually it's the (d) option ;)
I activate the quota feature for only one reason : being able to
track
down how much space my snapshots are taking.


Yeah, that's completely one of the ideal use case of btrfs qgroup.

But I'm quite curious about the btrfsck error report on qgroup.

If btrfsck report such error, it means either I'm too confident about
the recent qgroup accounting rework, or btrfsck has some bug which I
didn't take much consideration during the kernel rework.

Would you please provide the full result of previous btrfsck with
qgroup error?


Sure, I've saved the log somewhere just in case, here your are :

[...]

Thanks for your log, pretty interesting result.

BTW, did you enabled qgroup from old kernel earlier than 4.2-rc1?
If so, I would be much relaxed as they can be the problem of old
kernels.


The mkfs.btrfs was done under 3.19, but I'm almost sure I enabled quota
under 4.2.0 precisely. My kern.log tends to confirm that (looking for
'qgroup scan completed').


Emmm, seems I need to pay more attention on this case now.
Any info about the workload for this btrfs fs?




If it's OK for you, would you please enable quota after reproducing
the bug and use for sometime and recheck it?


Sure, I've just reproduced the bug twice as I wanted, and posted the
info, so now I've cancelled the balance and I can reenable quota. Will
do it under 4.3.0-rc1. I'll keep you posted if btrfsck complains about
it in the following days.

Regards,


Thanks for your patience and detailed report.


You're very welcome.


But I still have another question, did you do any snapshot deletion
after quota enabled?
(I'll assume you did it, as there are a lot of backup snapshot, old
ones should be already deleted)


Actually no : this btrfs system is quite new (less than a week old) as
I'm migrating from mdadm(raid1)+ext4 to btrfs. So those snapshots were
actually rsynced one by one from my hardlinks-based "snapshots" under
ext4 (those pseudo-snapshots are created using a program named
"rsnapshot", if you know it. This is basically a wrapper to cp -la). I
didn't activate yet an automatic snapshot/delete on my btrfs system, due
to the bugs I'm tripping on. So no snapshot was deleted.


Now things are getting tricky, as all known bugs are ruled out, it must 
be another hidden bug, even we tried to rework the qgroup accounting code.





That's one of the known bug and Mark is working on it actively.
If you delete non-empty snapshot a lot, then I'd better add a hot fix
to mark qgroup inconsistent after snapshot delete, and trigger a
rescan if possible.


I've made a btrfs-image of the filesystem just before disabling quotas
(which I did to get a clean btrfsck and eliminate quotas from the
equation trying to reproduce the bug I have). Would it be of any use if
I drop it somewhere for you to pick it up ? (2.9G in size).


For dismatch case, static btrfs-image dump won't really help.
As the important point is, when and which operation caused qgroup 
accounting to dismatch.




In the meantime, I've reactivated quotas, umounted the filesystem and
ran a btrfsck on it : as you would expect, there's no qgroup problem
reported so far.


At least, rescan code is working without problem.


I'll clear all my snapshots, run an quota rescan, then
re-create them one by one by rsyncing from my ext4 system I still have.
Maybe I'll run into the issue again.



Would you mind to do the following check for each subvolume rsync?

1) Do 'sync; btrfs qgroup show -prce --raw' and save the output
2) Create the needed snapshot
3) Do 'sync; btrfs qgroup show -prce --raw' and save the output
4) Avoid doing IO if possible until step 6)
5) Do 'btrfs quota rescan -w' and save it
6) Do 'sync; btrfs qgroup show -prce --raw' and save the output
7) Rsync data from ext4 to the newly created snapshot

The point is, as you mentioned, rescan is working fine, we can compare 
output from 3), 6) and 1) to see which qgroup accounting number changes.


And if differs, which means the qgroup update at write time OR snapshot 
creation has something wrong, at least we can locate the problem to 
qgroup update routine or snapshot creation.


Thanks,
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FYIO: A rant about btrfs

2015-09-17 Thread Austin S Hemmelgarn

On 2015-09-16 19:31, Hugo Mills wrote:

On Wed, Sep 16, 2015 at 03:21:26PM -0400, Austin S Hemmelgarn wrote:

On 2015-09-16 12:45, Martin Tippmann wrote:

2015-09-16 17:20 GMT+02:00 Austin S Hemmelgarn :
[...]

[...]

 From reading the list I understand that btrfs is still very much work
in progress and performance is not a top priority at this stage but I
don't see why it shouldn't perform at least equally good as ZFS/F2FS
on the same workloads. Is looking at performance problems on the
development roadmap?

Performance is on the roadmap, but the roadmap is notoriously
short-sighted when it comes to time-frame for completion of
something. You have to understand also that the focus in BTRFS has
also been more on data safety than performance, because that's the
intended niche, and the area most people look to ZFS for.


Wait... there's a roadmap? ;)

Yeah, maybe it's better to say that there's a directed graph of feature 
interdependence.  I was just basing my statement on the presence of a 
list of project ideas on the wiki. :)




smime.p7s
Description: S/MIME Cryptographic Signature


Re: kernel BUG at linux-4.2.0/fs/btrfs/extent-tree.c:1833 on rebalance

2015-09-17 Thread Qu Wenruo



Stéphane Lesimple wrote on 2015/09/17 10:02 +0200:

Le 2015-09-17 08:42, Qu Wenruo a écrit :

Stéphane Lesimple wrote on 2015/09/17 08:11 +0200:

Le 2015-09-17 05:03, Qu Wenruo a écrit :

Stéphane Lesimple wrote on 2015/09/16 22:41 +0200:

Le 2015-09-16 22:18, Duncan a écrit :

Stéphane Lesimple posted on Wed, 16 Sep 2015 15:04:20 +0200 as
excerpted:



Well actually it's the (d) option ;)
I activate the quota feature for only one reason : being able to track
down how much space my snapshots are taking.


Yeah, that's completely one of the ideal use case of btrfs qgroup.

But I'm quite curious about the btrfsck error report on qgroup.

If btrfsck report such error, it means either I'm too confident about
the recent qgroup accounting rework, or btrfsck has some bug which I
didn't take much consideration during the kernel rework.

Would you please provide the full result of previous btrfsck with
qgroup error?


Sure, I've saved the log somewhere just in case, here your are :

[...]

Thanks for your log, pretty interesting result.

BTW, did you enabled qgroup from old kernel earlier than 4.2-rc1?
If so, I would be much relaxed as they can be the problem of old kernels.


The mkfs.btrfs was done under 3.19, but I'm almost sure I enabled quota
under 4.2.0 precisely. My kern.log tends to confirm that (looking for
'qgroup scan completed').


Emmm, seems I need to pay more attention on this case now.
Any info about the workload for this btrfs fs?




If it's OK for you, would you please enable quota after reproducing
the bug and use for sometime and recheck it?


Sure, I've just reproduced the bug twice as I wanted, and posted the
info, so now I've cancelled the balance and I can reenable quota. Will
do it under 4.3.0-rc1. I'll keep you posted if btrfsck complains about
it in the following days.

Regards,


Thanks for your patience and detailed report.

But I still have another question, did you do any snapshot deletion 
after quota enabled?
(I'll assume you did it, as there are a lot of backup snapshot, old ones 
should be already deleted)


That's one of the known bug and Mark is working on it actively.
If you delete non-empty snapshot a lot, then I'd better add a hot fix to 
mark qgroup inconsistent after snapshot delete, and trigger a rescan if 
possible.


Thanks,
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at linux-4.2.0/fs/btrfs/extent-tree.c:1833 on rebalance

2015-09-17 Thread Qu Wenruo



Stéphane Lesimple wrote on 2015/09/17 08:11 +0200:

Le 2015-09-17 05:03, Qu Wenruo a écrit :

Stéphane Lesimple wrote on 2015/09/16 22:41 +0200:

Le 2015-09-16 22:18, Duncan a écrit :

Stéphane Lesimple posted on Wed, 16 Sep 2015 15:04:20 +0200 as
excerpted:



Well actually it's the (d) option ;)
I activate the quota feature for only one reason : being able to track
down how much space my snapshots are taking.


Yeah, that's completely one of the ideal use case of btrfs qgroup.

But I'm quite curious about the btrfsck error report on qgroup.

If btrfsck report such error, it means either I'm too confident about
the recent qgroup accounting rework, or btrfsck has some bug which I
didn't take much consideration during the kernel rework.

Would you please provide the full result of previous btrfsck with
qgroup error?


Sure, I've saved the log somewhere just in case, here your are :

Counts for qgroup id: 3359 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 49152 exclusive compressed 49152
disk:   exclusive 32768 exclusive compressed 32768
diff:   exclusive 16384 exclusive compressed 16384
Counts for qgroup id: 3361 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 49152 exclusive compressed 49152
disk:   exclusive 32768 exclusive compressed 32768
diff:   exclusive 16384 exclusive compressed 16384
Counts for qgroup id: 3362 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 49152 exclusive compressed 49152
disk:   exclusive 32768 exclusive compressed 32768
diff:   exclusive 16384 exclusive compressed 16384
Counts for qgroup id: 3363 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 49152 exclusive compressed 49152
disk:   exclusive 32768 exclusive compressed 32768
diff:   exclusive 16384 exclusive compressed 16384
Counts for qgroup id: 3361 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 49152 exclusive compressed 49152
disk:   exclusive 32768 exclusive compressed 32768
diff:   exclusive 16384 exclusive compressed 16384
Counts for qgroup id: 3362 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 49152 exclusive compressed 49152
disk:   exclusive 32768 exclusive compressed 32768
diff:   exclusive 16384 exclusive compressed 16384
Counts for qgroup id: 3363 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 49152 exclusive compressed 49152
disk:   exclusive 32768 exclusive compressed 32768
diff:   exclusive 16384 exclusive compressed 16384
Counts for qgroup id: 3364 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 49152 exclusive compressed 49152
disk:   exclusive 32768 exclusive compressed 32768
diff:   exclusive 16384 exclusive compressed 16384
Counts for qgroup id: 3365 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 49152 exclusive compressed 49152
disk:   exclusive 32768 exclusive compressed 32768
diff:   exclusive 16384 exclusive compressed 16384
Counts for qgroup id: 3366 are different
our:referenced 7530119168 referenced compressed 7530119168
disk:   referenced 7530086400 referenced compressed 7530086400
diff:   referenced 32768 referenced compressed 32768
our:exclusive 16384 exclusive compressed 16384
disk:

Re: [PATCH 1/2] btrfs: Do per-chunk degrade mode check at mount time

2015-09-17 Thread Anand Jain



Hi Qu,

On 09/17/2015 09:48 AM, Qu Wenruo wrote:

To Anand Jain,

Any feedback on this method to allow single chunk still be degraded
mountable?

It should be much better than allowing degraded mount for any missing
device case.


yeah. this changes the way missing devices are counted and its more fine 
grained. makes sense to me.




Thanks,
Qu

Qu Wenruo wrote on 2015/09/16 11:43 +0800:

Btrfs supports different raid profile for meta/data/sys, and as
different profile support different tolerated missing device, it's
better to check if it can be mounted degraded at a per-chunk base.

So this patch will add check for read_one_chunk() against its profile,
other than checking it against with the lowest duplication profile.

Reported-by: Zhao Lei 
Reported-by: Anand Jain 
Signed-off-by: Qu Wenruo 
---
  fs/btrfs/volumes.c | 17 +
  1 file changed, 17 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 644e070..3272187 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6164,12 +6164,15 @@ static int read_one_chunk(struct btrfs_root
*root, struct btrfs_key *key,
struct btrfs_chunk *chunk)
  {
  struct btrfs_mapping_tree *map_tree = >fs_info->mapping_tree;
+struct super_block *sb = root->fs_info->sb;
  struct map_lookup *map;
  struct extent_map *em;
  u64 logical;
  u64 length;
  u64 devid;
  u8 uuid[BTRFS_UUID_SIZE];
+int missing = 0;
+int max_tolerated;
  int num_stripes;
  int ret;
  int i;
@@ -6238,7 +6241,21 @@ static int read_one_chunk(struct btrfs_root
*root, struct btrfs_key *key,
  btrfs_warn(root->fs_info, "devid %llu uuid %pU is missing",
  devid, uuid);
  }
+if (map->stripes[i].dev->missing)
+missing++;
  map->stripes[i].dev->in_fs_metadata = 1;
+
+}
+
+/* XXX: Why the function name is SO LONG?! */
+max_tolerated =
+btrfs_get_num_tolerated_disk_barrier_failures(map->type);
+if (missing > max_tolerated && !(sb->s_flags & MS_RDONLY)) {
+free_extent_map(em);
+btrfs_error(root->fs_info, -EIO,
+"missing device(%d) exceeds the limit(%d), writeable
mount is not allowed\n",


 \n is not required.

Thanks, Anand


+missing, max_tolerated);
+return -EIO;
  }

  write_lock(_tree->map_tree.lock);


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 storage server won't boot with one disk missing

2015-09-17 Thread Anand Jain


Thanks for the report.

 There is a bug that raid1 with one disk missing and trying to mount 
for the 2nd time.. it would fail. I am not too sure if in the boot 
process would there be mount and then remount/mount again ? If yes then 
it is potentially hitting the problem as in the patch below.


  Btrfs: allow -o rw,degraded for single group profile

 you may want to give this patch a try.

 more below..

On 09/17/2015 07:56 AM, erp...@gmail.com wrote:

Good afternoon,

Earlier today, I tried to set up a storage server using btrfs but ran
into some problems. The goal was to use two disks (4.0TB each) in a
raid1 configuration.

What I did:
1. Attached a single disk to a regular PC configured to boot with UEFI.
2. Booted from a thumb drive that had been made from an Ubuntu 14.04
Server x64 installation DVD.
3. Ran the installation procedure. When it came time to partition the
disk, I chose the guided partitioning option. The partitioning scheme
it suggested was:

* A 500MB EFI System Partition.
* An ext4 root partition of nearly 4 TB in size.
* A 4GB swap partition.

4. Changed the type of the middle partition from ext4 to btrfs, but
left everything else the same.
5. Finalized the partitioning scheme, allowing changes to be written to disk.
6. Continued the installation procedure until it finished. I was able
to boot into a working server from the single disk.
7. Attached the second disk.
8. Used parted to create a GPT label on the second disk and a btrfs
partition that was the same size as the btrfs partition on the first
disk.

# parted /dev/sdb
(parted) mklabel gpt
(parted) mkpart primary btrfs #s ##s
(parted) quit

9. Ran "btrfs device add /dev/sdb1 /" to add the second device to the
filesystem.
10. Ran "btrfs balance start -dconvert=raid1 -mconvert=raid1 /" and
waited for it to finish. It reported that it finished successfully.
11. Rebooted the system. At this point, everything appeared to be working.
12. Shut down the system, temporarily disconnected the second disk
(/dev/sdb) from the motherboard, and powered it back up.

What I expected to happen:
I expected that the system would either start as if nothing were
wrong, or would warn me that one half of the mirror was missing and
ask if I really wanted to start the system with the root array in a
degraded state.


 as of now it would/should start normally only when there is an entry 
-o degraded


 it looks like -o degraded is going to be a very obvious feature,
 I have plans of making it a default feature, and provide -o
 nodegraded feature instead. Thanks for comments if any.

Thanks, Anand



What actually happened:
During the boot process, a kernel message appeared indicating that the
"system array" could not be found for the root filesystem (as
identified by a UUID). It then dumped me to an initramfs prompt.
Powering down the system, reattaching the second disk, and powering it
on allowed me to boot successfully. Running "btrfs fi df /" showed
that all System data was stored as RAID1.

If I want to have a storage server where one of two drives can fail at
any time without causing much down time, am I on the right track? If
so, what should I try next to get the behavior I'm looking for?

Thanks,
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FYIO: A rant about btrfs

2015-09-17 Thread Martin Steigerwald
Am Mittwoch, 16. September 2015, 23:29:30 CEST schrieb Hugo Mills:
> > but even then having write-barriers
> > turned off is still not as safe as having them turned on.  Most of
> > the time when I've tried testing with 'nobarrier' (not just on BTRFS
> > but on ext* as well), I had just as many issues with data loss when
> > the system crashed as when it (simlated via killing the virtual
> > machine) lost power.  Both journaling and COW filesystems need to
> > ensure ordering of certain write operations to be able to maintain
> > consistency.  For example, the new/updated data blocks need to be on
> > disk before the metadata is updated to point to them, otherwise you
> > database can end up corrupted.
> 
>Indeed. The barriers are an ordering condition. The FS relies on
> (i.e. *requires*) that ordering condition, in order to be truly
> consistent. Running with "nobarrier" is a very strong signal that you
> really don't care about the data on the FS.
> 
>This is not a case of me simply believing that because I've been
> using btrfs for so long that I've got used to the peculiarities. The
> first time I heard about the nobarrier option, something like 6 years
> ago when I was first using btrfs, I thought "that's got to be a really
> silly idea". Any complex data structure, like a filesystem, is going
> to rely on some kind of ordering guarantees, somewhere in its
> structure. (The ordering might be strict, with a global clock, or
> barrier-based, or lattice-like, as for example a vector clock, but
> there's going to be _some_ concept of order). nobarrier allows the FS
> to ignore those guarantees, and even without knowing anything about
> the FS at all, doing so is a big red DANGER flag.

Official recommendation for XFS differs from that:

 Q. Should barriers be enabled with storage which has a persistent write 
cache?

Many hardware RAID have a persistent write cache which preserves it across 
power failure, interface resets, system crashes, etc. Using write barriers in 
this instance is not recommended and will in fact lower performance. 
Therefore, it is recommended to turn off the barrier support and mount the 
filesystem with "nobarrier", assuming your RAID controller is infallible and 
not resetting randomly like some common ones do. But take care about the hard 
disk write cache, which should be off. 

http://xfs.org/index.php/
XFS_FAQ#Q._Should_barriers_be_enabled_with_storage_which_has_a_persistent_write_cache.
3F

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 storage server won't boot with one disk missing

2015-09-17 Thread Chris Murphy
On Wed, Sep 16, 2015 at 5:56 PM, erp...@gmail.com  wrote:

> What I expected to happen:
> I expected that the system would either start as if nothing were
> wrong, or would warn me that one half of the mirror was missing and
> ask if I really wanted to start the system with the root array in a
> degraded state.

It's not this sophisticated yet. Btrfs does not "assemble" degraded by
default like mdadm and LVM based RAID. You need to manually mount it
with -o degraded and then continue the boot process, or use boot
parameter rootflags=degraded. Yet there is still some interaction
between btrfs dev scan and udev (?) that I don't understand precisely,
but what happens is when any device is missing, the Btrfs volume UUID
doesn't appear and therefore it still can't be mounted degraded if
volume UUID is used, e.g. boot parameter
root=UUID=  so that needs to be changed to a
/dev/sdXY type of notation and hope that you guess it correctly.



>
> What actually happened:
> During the boot process, a kernel message appeared indicating that the
> "system array" could not be found for the root filesystem (as
> identified by a UUID). It then dumped me to an initramfs prompt.
> Powering down the system, reattaching the second disk, and powering it
> on allowed me to boot successfully. Running "btrfs fi df /" showed
> that all System data was stored as RAID1.

Just an FYI to be really careful about degraded rw mounts. There is no
automatic resync to catch up the previously missing device with the
device that was degraded,rw mounted. You have to scrub or balance,
there's no optimization yet for Btrfs to effectively just "diff"
between the devices' generations and get them all in sync quickly.

Much worse is if you don't scrub or balance, and then redo the test
reversing the device to make missing. Now you have multiple devices
that were rw,degraded mounted, and putting them back together again
will corrupt the whole file system irreparably. Fixing the first
problem would (almost always) avoid the second problem.

> If I want to have a storage server where one of two drives can fail at
> any time without causing much down time, am I on the right track? If
> so, what should I try next to get the behavior I'm looking for?

It's totally not there yet if you want to obviate manual checks and
intervention for failure cases. Both mdadm and LVM integrated RAID
have monitoring and notification which Btrfs lacks entirely. So that
means you have to check it or create scripts to check it. What often
tends to happen is Btrfs just keeps retrying rather than ignoring a
bad device, so you'll see piles of retries with dmesg But Btrfs
doesn't kick out the bad device like the md drive would do. This could
go on for hours, or days. So if you aren't checking for it, you could
unwittingly have a degraded array already.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 storage server won't boot with one disk missing

2015-09-17 Thread Chris Murphy
On Thu, Sep 17, 2015 at 9:18 AM, Anand Jain  wrote:
>
>  as of now it would/should start normally only when there is an entry -o
> degraded
>
>  it looks like -o degraded is going to be a very obvious feature,
>  I have plans of making it a default feature, and provide -o
>  nodegraded feature instead. Thanks for comments if any.


If degraded mounts happen by default, what happens when dev 1 goes
missing temporarily and dev 2 is mounted degraded,rw and then dev 1
reappears? Is there an automatic way to a.) catch up dev 1 with dev 2?
and then b.) automatically make the array no longer degraded?

I think it's a problem to have automatic degraded mounts when there's
no monitoring or notification system of problems. We can get silent
degraded mounts by default with no notification at all there's a
problem with a Btrfs volume.

So off hand my comment is that I think other work is needed before
degraded mounts is default behavior.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FYIO: A rant about btrfs

2015-09-17 Thread Aneurin Price
On 16 September 2015 at 20:21, Austin S Hemmelgarn  wrote:
> ZFS has been around for much longer, it's been mature and feature complete 
> for more than a decade, and has had a long time to improve performance wise.  
> It is important to note though, that on low-end hardware, BTRFS can (and 
> often does in my experience) perform better than ZFS, because ZFS is a 
> serious resource hog (I have yet to see a stable ZFS deployment with less 
> than 16G of RAM, even with all the fancy features turned off).

If you have a real example of ZFS becoming unstable with, say, 4 or
8GB of memory, that doesn't involve attempting deduplication (which I
guess is what you mean by 'all the fancy features') on a many-TB pool,
I'd be interested to hear about it. (Psychic debugger says 'possibly
somebody trying to use a large L2ARC on a pool with many/large zvols')

My home fileserver has been running zfs for about 5 years, on a system
maxed out at 4GB RAM. Currently up to ~9TB of data. The only stability
problems I ever had were towards the beginning when I was using
zfs-fuse because zfsonlinux wasn't ready then, *and* I was trying out
deduplication.

I have a couple of work machines with 2GB RAM and pools currently
around 2.5TB full; no problems with these either in the couple of
years they've been in use, though granted these are lightly loaded
machines since what they mostly do is receive backup streams.

Bear in mind that these are Linux machines, and zfsonlinux's memory
management is known to be inferior to ZFS on Solaris and FreeBSD
(because it does not integrate with the page cache and instead grabs a
[configurable] chunk of memory, and doesn't always do a great job of
dropping it in response to memory pressure).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 storage server won't boot with one disk missing

2015-09-17 Thread Goffredo Baroncelli
Hi Anand,


On 2015-09-17 17:18, Anand Jain wrote:
>  it looks like -o degraded is going to be a very obvious feature,
>  I have plans of making it a default feature, and provide -o
>  nodegraded feature instead. Thanks for comments if any.
> 
> Thanks, Anand

I am not sure if there is a "good" default for this kind of problem; there are 
several aspects:

- remote machine:
for a remote machine, I think that the root filesystem should be mounted 
anyway. For a secondary filesystem (home ?), may be that an user intervention 
could be better (but without home, how an user could log?).

- spare:
in case of a degraded filesystem, the system could insert a spare disk; or a 
reshaping could be started (raid5->raid1, raid6->raid5)

- initramfs:
this is the most complicated things: currently most initramfs don't mount the 
filesystem if all the volumes aren't available. Allowing a degraded root 
filesystem means:
a) wait for the disks until a timeout
b) if the timeout expires, mount in degraded mode (inserting a spare 
disk if available ?)
c) otherwise mount the filesystem as usual

- degraded:
I think that there are different level of degraded. For example, in case of 
raid6 a missing device could be acceptable; however in case of a raid5, this 
should be not allowed; and an user intervention may be preferred.


In the past I suggested the use of an helper, mount.btrfs [1], which could 
handle all these cases better without a kernel intervention:
- wait for the devices to appear
- verifying if all the needed devices are present 
- mounting the filesystem passing
 - all the devices to the kernel (without relying to udev and btrfs dev 
scan...)
 - allowing the degraded mode or not (policy)
 - starting an insertion of the spare(policy)

G.Baroncelli

[1] http://www.spinics.net/lists/linux-btrfs/msg39706.html


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-17 Thread Chris Murphy
On Thu, Sep 17, 2015 at 11:56 AM, Gert Menke  wrote:
> Hi,
>
> thank you for your answers!
>
> So it seems there are several suboptimal alternatives here...
>
> MD+LVM is very close to what I want, but md has no way to cope with silent
> data corruption. So if I'd want to use a guest filesystem that has no
> checksums either, I'm out of luck.

You can use Btrfs in the guest to get at least notification of SDC. If
you want recovery also then that's a bit more challenging. The way
this has been done up until ZFS and Btrfs is T10 DIF (PI). There are
already checksums on the drive, but this adds more checksums that can
be confirmed through the entire storage stack, not just internal to
the drive hardware.

Another way is to put a conventional fs image on e.g. GlusterFS with
checksumming enabled (and at least distributed+replicated filtering).

If you do this directly on Btrfs, maybe you can mitigate some of the
fragmentation issues with bcache or dmcache; and for persistent
snapshotting, use qcow2 to do it instead of Btrfs. You'd use Btrfs
snapshots to create a subvolume for doing backups of the images, and
then get rid of the Btrfs snapshot.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-17 Thread Gert Menke

Hi,

thank you for your answers!

So it seems there are several suboptimal alternatives here...

MD+LVM is very close to what I want, but md has no way to cope with 
silent data corruption. So if I'd want to use a guest filesystem that 
has no checksums either, I'm out of luck.
I'm honestly a bit confused here - isn't checksumming one of the most 
obvious things to want in a software RAID setup? Is it a feature that 
might appear in the future? Maybe I should talk to the md guys...


BTRFS looks really nice feature-wise, but is not (yet) optimized for my 
use-case I guess. Disabling COW would certainly help, but I don't want 
to lose the data checksums. Is nodatacowbutkeepdatachecksums a feature 
that might turn up in the future?


Maybe ZFS is the best choice for my scenario. At least, it seems to work 
fine for Joyent - their SmartOS virtualization OS is essentially Illumos 
(Solaris) with ZFS, and KVM ported from Linux.
Since ZFS supports "Volumes" (virtual block devices inside a ZPool), I 
suspect these are probably optimized to be used for VM images (i.e. do 
as little COW as possible). Of course, snapshots will always degrade 
performance to a degree.


However, there are some drawbacks to ZFS:
- It's less flexible, especially when it comes to reconfiguration of 
disk arrays. Add or remove a disk to/from a RaidZ and rebalance, that 
would be just awesome. It's possible in BTRFS, but not ZFS. :-(
- The not-so-good integration of the fs cache, at least on Linux. I 
don't know if this is really an issue, though. Actually, I imagine it's 
more of an issue for guest systems, because it probably breaks memory 
ballooning. (?)


So it seems there are two options for me:
1. Go with ZFS for now, until BTRFS finds a better way to handle disk 
images, or until md gets data checksums.
2. Buy a bunch of SSDs for VM disk images and use spinning disks for 
data storage only. In that case, BTRFS should probably do fine.


Any comments on that? Am I missing something?

Thanks!
Gert
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-17 Thread Mike Fleetwood
On 17 September 2015 at 18:56, Gert Menke  wrote:
> MD+LVM is very close to what I want, but md has no way to cope with silent
> data corruption. So if I'd want to use a guest filesystem that has no
> checksums either, I'm out of luck.
> I'm honestly a bit confused here - isn't checksumming one of the most
> obvious things to want in a software RAID setup? Is it a feature that might
> appear in the future? Maybe I should talk to the md guys...
...
> Any comments on that? Am I missing something?

How about using file integrity checking tools for cases when the chosen
storage stack doesn't provided data checksumming.
E.g.
aide - http://aide.sourceforge.net/
cfv - http://cfv.sourceforge.net/
tripwire - http://sourceforge.net/projects/tripwire/

Don't use them, just providing options.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at linux-4.2.0/fs/btrfs/extent-tree.c:1833 on rebalance

2015-09-17 Thread Stéphane Lesimple

Le 2015-09-17 12:41, Qu Wenruo a écrit :

In the meantime, I've reactivated quotas, umounted the filesystem and
ran a btrfsck on it : as you would expect, there's no qgroup problem
reported so far.


At least, rescan code is working without problem.


I'll clear all my snapshots, run an quota rescan, then
re-create them one by one by rsyncing from my ext4 system I still 
have.

Maybe I'll run into the issue again.



Would you mind to do the following check for each subvolume rsync?

1) Do 'sync; btrfs qgroup show -prce --raw' and save the output
2) Create the needed snapshot
3) Do 'sync; btrfs qgroup show -prce --raw' and save the output
4) Avoid doing IO if possible until step 6)
5) Do 'btrfs quota rescan -w' and save it
6) Do 'sync; btrfs qgroup show -prce --raw' and save the output
7) Rsync data from ext4 to the newly created snapshot

The point is, as you mentioned, rescan is working fine, we can compare
output from 3), 6) and 1) to see which qgroup accounting number
changes.

And if differs, which means the qgroup update at write time OR
snapshot creation has something wrong, at least we can locate the
problem to qgroup update routine or snapshot creation.


I was about to do that, but first there's something that sounds strange 
: I've begun by trashing all my snapshots, then ran a quota rescan, and 
waited for it to complete, to start on a sane base.

However, this is the output of qgroup show now :

qgroupid  rfer excl max_rfer max_excl 
parent  child
      
--  -
0/5  1638416384 none none 
--- ---
0/1906   16578480291841657848029184 none none 
--- ---
0/1909124950921216 124950921216 none none 
--- ---
0/1911   10545872936961054587293696 none none 
--- ---
0/3270 23727300608  23727300608 none none 
--- ---
0/3314 23206055936  23206055936 none none 
--- ---
0/3317 184729968640 none none 
--- ---
0/3318 22235709440 18446744073708421120 none none 
--- ---
0/3319 222403338240 none none 
--- ---
0/3320 222896087040 none none 
--- ---
0/3321 222896087040 none none 
--- ---
0/3322 184611512320 none none 
--- ---
0/3323 184239022080 none none 
--- ---
0/3324 184239022080 none none 
--- ---
0/3325 184635064320 none none 
--- ---
0/3326 184635064320 none none 
--- ---
0/3327 184635064320 none none 
--- ---
0/3328 184635064320 none none 
--- ---
0/3329 185854279680 none none 
--- ---
0/3330 18621472768 18446744073251348480 none none 
--- ---
0/3331 186214727680 none none 
--- ---
0/3332 186214727680 none none 
--- ---
0/ 187830763520 none none 
--- ---
0/3334 187998044160 none none 
--- ---
0/3335 187998044160 none none 
--- ---
0/3336 188162170880 none none 
--- ---
0/3337 188162662400 none none 
--- ---
0/3338 188162662400 none none 
--- ---
0/3339 188162662400 none none 
--- ---
0/3340 188163645440 none none 
--- ---
0/3341  7530119168   7530119168 none none 
--- ---
0/3342  49192837120 none none 
--- ---
0/3343  49217249280 none none 
--- ---
0/3344  49217249280 none none 
--- ---
0/3345  6503317504 18446744073690902528 none none 
--- ---
0/3346  65034526720 none none 
--- ---
0/3347  65095147520 none none 
--- ---
0/3348  65157939200 none none 
--- ---
0/3349  65157939200 none none 
--- ---
0/3350  65186856960 none none 
--- ---
0/3351  65215119360 none none 
---  

Re: RAID1 storage server won't boot with one disk missing

2015-09-17 Thread Roman Mamedov
On Thu, 17 Sep 2015 19:00:08 +0200
Goffredo Baroncelli  wrote:

> On 2015-09-17 17:18, Anand Jain wrote:
> >  it looks like -o degraded is going to be a very obvious feature,
> >  I have plans of making it a default feature, and provide -o
> >  nodegraded feature instead. Thanks for comments if any.
> > 
> I am not sure if there is a "good" default for this kind of problem

Yes there is. It is whatever people came to expect from using other RAID
systems and/or generally expect from RAID as a concept.

Both mdadm software RAID, and I believe virtually any hardware RAID controller
out there will let you to successfully boot up and give read-write(!) access
to a RAID in a non-critical failure state, because that's kind of the whole
point of a RAID, to eliminate downtime. If the removed disk is later re-added,
then it is automatically resynced. Mdadm can also make use of its 'write
intent bitmap' to resync only those areas of the array which were in any way
touched during the absence of the newly re-added disk.

If you're concerned that the user "misses" the fact that they have a disk
down, then solve *that*, make some sort of a notify daemon, e.g. mdadm has a
built-in "monitor" mode which sends E-Mail on critical events with any of the
arrays.

-- 
With respect,
Roman


signature.asc
Description: PGP signature


Re: BTRFS as image store for KVM?

2015-09-17 Thread Hugo Mills
On Thu, Sep 17, 2015 at 07:56:08PM +0200, Gert Menke wrote:
> Hi,
> 
> thank you for your answers!
> 
> So it seems there are several suboptimal alternatives here...
> 
> MD+LVM is very close to what I want, but md has no way to cope with
> silent data corruption. So if I'd want to use a guest filesystem
> that has no checksums either, I'm out of luck.
> I'm honestly a bit confused here - isn't checksumming one of the
> most obvious things to want in a software RAID setup? Is it a
> feature that might appear in the future? Maybe I should talk to the
> md guys...
> 
> BTRFS looks really nice feature-wise, but is not (yet) optimized for
> my use-case I guess. Disabling COW would certainly help, but I don't
> want to lose the data checksums. Is nodatacowbutkeepdatachecksums a
> feature that might turn up in the future?
[snip]

   No. If you try doing that particular combination of features, you
end up with a filesystem that can be inconsistent: there's a race
condition between updating the data in a file and updating the csum
record for it, and the race can't be fixed.

   Hugo.

-- 
Hugo Mills | I spent most of my money on drink, women and fast
hugo@... carfax.org.uk | cars. The rest I wasted.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |James Hunt


signature.asc
Description: Digital signature


Re: RAID1 storage server won't boot with one disk missing

2015-09-17 Thread Chris Murphy
On Thu, Sep 17, 2015 at 1:02 PM, Roman Mamedov  wrote:
> On Thu, 17 Sep 2015 19:00:08 +0200
> Goffredo Baroncelli  wrote:
>
>> On 2015-09-17 17:18, Anand Jain wrote:
>> >  it looks like -o degraded is going to be a very obvious feature,
>> >  I have plans of making it a default feature, and provide -o
>> >  nodegraded feature instead. Thanks for comments if any.
>> >
>> I am not sure if there is a "good" default for this kind of problem
>
> Yes there is. It is whatever people came to expect from using other RAID
> systems and/or generally expect from RAID as a concept.
>
> Both mdadm software RAID, and I believe virtually any hardware RAID controller
> out there will let you to successfully boot up and give read-write(!) access
> to a RAID in a non-critical failure state, because that's kind of the whole
> point of a RAID, to eliminate downtime. If the removed disk is later re-added,
> then it is automatically resynced. Mdadm can also make use of its 'write
> intent bitmap' to resync only those areas of the array which were in any way
> touched during the absence of the newly re-added disk.
>
> If you're concerned that the user "misses" the fact that they have a disk
> down, then solve *that*, make some sort of a notify daemon, e.g. mdadm has a
> built-in "monitor" mode which sends E-Mail on critical events with any of the
> arrays.

Given the current state: no proposal and no work done yet, I think
it's premature to change the default.

It's an open question what a modern monitoring and notification
mechanism should look like. At the moment it'd be a unique Btrfs thing
because the mdadm and LVM methods aren't abstracted enough to reuse. I
wonder if the storaged and/or openlmi folks have some input on what
this would look like. Feedback from KDE and GNOME also, who rely on at
least mdadm in order to present user space notifications. I think
udisks2 is on the way out and storaged is on the way in, there's just
too much stuff that udisks2 doesn't do and is getting confused about,
including LVM thinly provisioned volumes, not just Btrfs stuff.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-17 Thread Gert Menke

On 17.09.2015 at 21:43, Hugo Mills wrote:

On Thu, Sep 17, 2015 at 07:56:08PM +0200, Gert Menke wrote:

BTRFS looks really nice feature-wise, but is not (yet) optimized for
my use-case I guess. Disabling COW would certainly help, but I don't
want to lose the data checksums. Is nodatacowbutkeepdatachecksums a
feature that might turn up in the future?

[snip]

No. If you try doing that particular combination of features, you
end up with a filesystem that can be inconsistent: there's a race
condition between updating the data in a file and updating the csum
record for it, and the race can't be fixed.
I'm no filesystem expert, but isn't that what an intent log is for? 
(Does btrfs have an intent log?)


And, is this also true for mirrored or raid5 disks?
I'm thinking something like "if the data does not match the checksum, 
just restore both from mirror/parity" should be possible, right?


Gert
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2015-09-17 Thread Калинина Зинаида
Посмотреть цены и ассортимент rusrusrus.ru

 

 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-17 Thread Gert Menke

On 17.09.2015 at 20:35, Chris Murphy wrote:

You can use Btrfs in the guest to get at least notification of SDC.
Yes, but I'd rather not depend on all potential guest OSes having btrfs 
or something similar.



Another way is to put a conventional fs image on e.g. GlusterFS with
checksumming enabled (and at least distributed+replicated filtering).

This sounds interesting! I'll have a look at this.


If you do this directly on Btrfs, maybe you can mitigate some of the
fragmentation issues with bcache or dmcache;
Thanks, I did not know about these. bcache seems to be more or less what 
"zpool add foo cache /dev/ssd" does. Definitely worth a look.


> and for persistent snapshotting, use qcow2 to do it instead of Btrfs. 
You'd use Btrfs

snapshots to create a subvolume for doing backups of the images, and
then get rid of the Btrfs snapshot.

Good idea.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: Remove unneeded missing device number check

2015-09-17 Thread Qu Wenruo



Anand Jain wrote on 2015/09/18 09:47 +0800:



On 09/17/2015 06:01 PM, Qu Wenruo wrote:

Thanks for pointing this out.




Although previous patch is small enough, but for remount case, we need
to iterate all the existing chunk cache.


  yes indeed.

  thinking hard on this - is there any test-case that these two patches
are solving, which the original patch [1] didn't solve ?


Yep, your patch is OK to fix single chunk on safe disk case.
But IMHO, it's a little aggressive and not safe as old codes.

For example, if one use single metadata for 2 disks, and each disk has 
one metadata chunk on it.


One device got missing later.

Then your patch will allow the fs to be mounted as rw, even some tree 
block can be in the missing device.
For RO case, it won't be too dangerous, but if we mounted it as RW, who 
knows what will happen.
(Normal tree COW thing should fail before real write, but I'm not sure 
about other RW operation like scrub/replace/balance and others)


And I think that's the original design concept behind the old missing 
device number check, and it's not a bad idea to follow it anyway.


For the patch size, I find a good idea to handle it, and should make the 
patch(set) size below 200 lines.


Further more, it's even possible to make btrfs change mount option to 
degraded for runtime device missing.


Thanks,
Qu


  I tried to break both the approaches (this patch set and [1]) but I
wasn't successful. sorry if I am missing something.

Thanks, Anand

[1] [PATCH 23/23] Btrfs: allow -o rw,degraded for single group profile



So fix for remount will take a little more time.



Thanks for reviewing.
Qu

在 2015年09月17日 17:43, Anand Jain 写道:



On 09/16/2015 11:43 AM, Qu Wenruo wrote:

As we do per-chunk missing device number check at read_one_chunk()
time,
it's not needed to do global missing device number check.

Just remove it.


However the missing device count, what we have during the remount is not
fine grained per chunk.
---
btrfs_remount
::
  if (fs_info->fs_devices->missing_devices >
  fs_info->num_tolerated_disk_barrier_failures &&
 !(*flags & MS_RDONLY ||
 btrfs_test_opt(root, DEGRADED))) {
 btrfs_warn(fs_info,
 "too many missing devices, writeable
remount is not allowed");
 ret = -EACCES;
 goto restore;
 }
-

Thanks, Anand



Now btrfs can handle the following case:
  # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc

  Data chunk will be located in sdb, so we should be safe to wipe sdc
  # wipefs -a /dev/sdc

  # mount /dev/sdb /mnt/btrfs -o degraded

Signed-off-by: Qu Wenruo 
---
  fs/btrfs/disk-io.c | 8 
  1 file changed, 8 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0b658d0..ac640ea 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2947,14 +2947,6 @@ retry_root_backup:
  }
  fs_info->num_tolerated_disk_barrier_failures =
  btrfs_calc_num_tolerated_disk_barrier_failures(fs_info);
-if (fs_info->fs_devices->missing_devices >
- fs_info->num_tolerated_disk_barrier_failures &&
-!(sb->s_flags & MS_RDONLY)) {
-pr_warn("BTRFS: missing devices(%llu) exceeds the limit(%d),
writeable mount is not allowed\n",
-fs_info->fs_devices->missing_devices,
-fs_info->num_tolerated_disk_barrier_failures);
-goto fail_sysfs;
-}

  fs_info->cleaner_kthread = kthread_run(cleaner_kthread,
tree_root,
 "btrfs-cleaner");


--
To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-17 Thread Duncan
Hugo Mills posted on Thu, 17 Sep 2015 19:43:14 + as excerpted:

>> Is nodatacowbutkeepdatachecksums a feature that might turn up
>> in the future?
> 
> No. If you try doing that particular combination of features, you
> end up with a filesystem that can be inconsistent: there's a race
> condition between updating the data in a file and updating the csum
> record for it, and the race can't be fixed.

...  Which is both why btrfs disables checksumming on nocow, and why
more traditional in-place-overwrite filesystems don't normally offer a 
checksumming feature -- it's only easily and reliably possible with copy-
on-write, as in-place-overwrite introduces race issues that are basically 
impossible to solve.

Logging can narrow the race, but consider, either they introduce some 
level of copy-on-write themselves, or one way or another, you're going to 
have to write two things, one a checksum of the other, and if they are in-
place-overwrites, while the race can be narrowed, there's always going to 
be a point at which either one or the other will have been written, while 
the other hasn't been, and if failure occurs at that point...

The only real way around that is /some/ form of copy-on-write, such that 
both the change and its checksum can be written to a different location 
than the old version, with a single, atomic write then updating a pointer 
to point to the new version of both the data and its checksum, instead of 
the old one.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 storage server won't boot with one disk missing

2015-09-17 Thread Duncan
Anand Jain posted on Thu, 17 Sep 2015 23:18:36 +0800 as excerpted:

>> What I expected to happen:
>> I expected that the [btrfs raid1 data/metadata] system would either
>> start as if nothing were wrong, or would warn me that one half of the
>> mirror was missing and ask if I really wanted to start the system with
>> the root array in a degraded state.
> 
> as of now it would/should start normally only when there is an entry
> -o degraded
> 
> it looks like -o degraded is going to be a very obvious feature,
> I have plans of making it a default feature, and provide -o nodegraded
> feature instead. Thanks for comments if any.

As Chris Murphy, I have my doubts about this, and think it's likely to 
cause as many unhappy users as it prevents.

I'd definitely put -o nodegraded in my default options here, so it's not 
about me, but about all those others that would end up running a silently 
degraded system and have no idea until it's too late, as further devices 
have failed or the one single other available copy of something important 
(remember, still raid1 without N-mirrors option, unfortunately, so if a 
device drops out, that's now data/metadata with only a single valid copy 
regardless of the number of devices, and if it goes invalid...) fails 
checksum for whatever reason.

And since it only /allows/ degraded, not forcing it, if admins or distros 
want it as the default, -o degraded can be added now.  Nothing's stopping 
them except lack of knowledge of the option, the *same* lack of knowledge 
that would potentially cause so much harm if the default were switched.

Put it this way.  With the current default, if it fails and people have 
to ask about the unexpected failure here, no harm to existing data done, 
just add -o degraded and get on with things.  If -o degraded were made 
the default, failure mode would be *MUCH* worse, potential loss of the 
entire filesystem due to silent and thus uncorrected device loss and 
degraded mounting.

So despite the inconvenience of less knowledgeable people losing the 
availability of the filesystem until they can read the wiki or ask about 
it here, I don't believe changing the default to -o degraded is wise, at 
all.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: Remove unneeded missing device number check

2015-09-17 Thread Anand Jain



On 09/17/2015 06:01 PM, Qu Wenruo wrote:

Thanks for pointing this out.




Although previous patch is small enough, but for remount case, we need
to iterate all the existing chunk cache.


 yes indeed.

 thinking hard on this - is there any test-case that these two patches 
are solving, which the original patch [1] didn't solve ?


 I tried to break both the approaches (this patch set and [1]) but I 
wasn't successful. sorry if I am missing something.


Thanks, Anand

[1] [PATCH 23/23] Btrfs: allow -o rw,degraded for single group profile



So fix for remount will take a little more time.



Thanks for reviewing.
Qu

在 2015年09月17日 17:43, Anand Jain 写道:



On 09/16/2015 11:43 AM, Qu Wenruo wrote:

As we do per-chunk missing device number check at read_one_chunk() time,
it's not needed to do global missing device number check.

Just remove it.


However the missing device count, what we have during the remount is not
fine grained per chunk.
---
btrfs_remount
::
  if (fs_info->fs_devices->missing_devices >
  fs_info->num_tolerated_disk_barrier_failures &&
 !(*flags & MS_RDONLY ||
 btrfs_test_opt(root, DEGRADED))) {
 btrfs_warn(fs_info,
 "too many missing devices, writeable
remount is not allowed");
 ret = -EACCES;
 goto restore;
 }
-

Thanks, Anand



Now btrfs can handle the following case:
  # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc

  Data chunk will be located in sdb, so we should be safe to wipe sdc
  # wipefs -a /dev/sdc

  # mount /dev/sdb /mnt/btrfs -o degraded

Signed-off-by: Qu Wenruo 
---
  fs/btrfs/disk-io.c | 8 
  1 file changed, 8 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0b658d0..ac640ea 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2947,14 +2947,6 @@ retry_root_backup:
  }
  fs_info->num_tolerated_disk_barrier_failures =
  btrfs_calc_num_tolerated_disk_barrier_failures(fs_info);
-if (fs_info->fs_devices->missing_devices >
- fs_info->num_tolerated_disk_barrier_failures &&
-!(sb->s_flags & MS_RDONLY)) {
-pr_warn("BTRFS: missing devices(%llu) exceeds the limit(%d),
writeable mount is not allowed\n",
-fs_info->fs_devices->missing_devices,
-fs_info->num_tolerated_disk_barrier_failures);
-goto fail_sysfs;
-}

  fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
 "btrfs-cleaner");


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-17 Thread Duncan
Chris Murphy posted on Thu, 17 Sep 2015 12:35:41 -0600 as excerpted:

> You'd use Btrfs snapshots to create a subvolume for doing backups of
> the images, and then get rid of the Btrfs snapshot.

The caveat here is that if the VM/DB is active during the backups (btrfs 
send/receive or other), it'll still COW1 any writes during the existence 
of the btrfs snapshot.  If the backup can be scheduled during VM/DB 
downtime or at least when activity is very low, the relatively short COW1 
time should avoid serious fragmentation, but if not, even only relatively 
temporary snapshots are likely to trigger noticeable cow1 fragmentation 
issues eventually.

Some users have ameliorated that by scheduling weekly or monthly btrfs 
defrag, reporting that cow1 issues with temporary snapshots build up slow 
enough that the scheduled defrag effectively eliminates the otherwise 
growing problem, but it's still an additional complication to have to 
configure and administer, longer term.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FYIO: A rant about btrfs

2015-09-17 Thread Duncan
Zygo Blaxell posted on Wed, 16 Sep 2015 18:08:56 -0400 as excerpted:

> On Wed, Sep 16, 2015 at 03:04:38PM -0400, Vincent Olivier wrote:
>> 
>> OK fine. Let it be clearer then (on the Btrfs wiki): nobarrier is an
>> absolute no go. Case closed.
> 
> Sometimes it is useful to make an ephemeral filesystem, i.e. a btrfs on
> a dm-crypt device with a random key that is not stored.  This
> configuration intentionally and completely destroys the entire
> filesystem, and all data on it, in the event of a power failure.  It's
> useful for things like temporary table storage, where ramfs is too
> small, swap-backed tmpfs is too slow, and/or there is a requirement that
> the data not be persisted across reboots.
> 
> In other words, nobarrier is for a little better performance when you
> already want to _intentionally_ destroy your filesystem on power
> failure.

Very good explanation of why it's useful to have such an otherwise 
destructive mount option even available in the first place.  Thanks! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-17 Thread Sean Greenslade
On Thu, Sep 17, 2015 at 07:56:08PM +0200, Gert Menke wrote:
> MD+LVM is very close to what I want, but md has no way to cope with silent
> data corruption. So if I'd want to use a guest filesystem that has no
> checksums either, I'm out of luck.
> I'm honestly a bit confused here - isn't checksumming one of the most
> obvious things to want in a software RAID setup? Is it a feature that might
> appear in the future? Maybe I should talk to the md guys...

MD is emulating hardware RAID. In hardware RAID, you are doing
work at the block level. Block-level RAID has no understanding of the
filesystem(s) running on top of it. Therefore it would have to checksum
groups of blocks, and store those checksums on the physical disks
somewhere, perhaps by keeping some portion of the drive for itself. But
then this is not very efficient, since it is maintaining checksums for
data that may be useless (blocks the FS is not currently using). So then
you might make the RAID filesystem aware...and you now have BTRFS RAID.

Simply put, the block level is probably not an appropriate place for
checksumming to occur. BTRFS can make checksumming work much more
effectively and efficiently by doing it at the filesystem level.

--Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 storage server won't boot with one disk missing

2015-09-17 Thread Gareth Pye
I think you have stated that in a very polite and friendly way. I'm
pretty sure I'd phrase it less politely :)

Following mdadm's example of an easy option to allow degraded
mounting, but that shouldn't be the default. Anyone with the expertise
to set that option can be expected to implement a way of knowing that
the mount is degraded.

People tend to be looking at BTRFS for a guarantee that data doesn't
die when hardware does. Defaults that defeat that shouldn't be used.

On Fri, Sep 18, 2015 at 11:36 AM, Duncan <1i5t5.dun...@cox.net> wrote:
> Anand Jain posted on Thu, 17 Sep 2015 23:18:36 +0800 as excerpted:
>
>>> What I expected to happen:
>>> I expected that the [btrfs raid1 data/metadata] system would either
>>> start as if nothing were wrong, or would warn me that one half of the
>>> mirror was missing and ask if I really wanted to start the system with
>>> the root array in a degraded state.
>>
>> as of now it would/should start normally only when there is an entry
>> -o degraded
>>
>> it looks like -o degraded is going to be a very obvious feature,
>> I have plans of making it a default feature, and provide -o nodegraded
>> feature instead. Thanks for comments if any.
>
> As Chris Murphy, I have my doubts about this, and think it's likely to
> cause as many unhappy users as it prevents.
>
> I'd definitely put -o nodegraded in my default options here, so it's not
> about me, but about all those others that would end up running a silently
> degraded system and have no idea until it's too late, as further devices
> have failed or the one single other available copy of something important
> (remember, still raid1 without N-mirrors option, unfortunately, so if a
> device drops out, that's now data/metadata with only a single valid copy
> regardless of the number of devices, and if it goes invalid...) fails
> checksum for whatever reason.
>
> And since it only /allows/ degraded, not forcing it, if admins or distros
> want it as the default, -o degraded can be added now.  Nothing's stopping
> them except lack of knowledge of the option, the *same* lack of knowledge
> that would potentially cause so much harm if the default were switched.
>
> Put it this way.  With the current default, if it fails and people have
> to ask about the unexpected failure here, no harm to existing data done,
> just add -o degraded and get on with things.  If -o degraded were made
> the default, failure mode would be *MUCH* worse, potential loss of the
> entire filesystem due to silent and thus uncorrected device loss and
> degraded mounting.
>
> So despite the inconvenience of less knowledgeable people losing the
> availability of the filesystem until they can read the wiki or ask about
> it here, I don't believe changing the default to -o degraded is wise, at
> all.
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Gareth Pye - blog.cerberos.id.au
Level 2 MTG Judge, Melbourne, Australia
"Dear God, I would like to file a bug report"
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html