Re: kernel BUG at linux-4.2.0/fs/btrfs/extent-tree.c:1833 on rebalance
Le 2015-09-17 05:03, Qu Wenruo a écrit : Stéphane Lesimple wrote on 2015/09/16 22:41 +0200: Le 2015-09-16 22:18, Duncan a écrit : Stéphane Lesimple posted on Wed, 16 Sep 2015 15:04:20 +0200 as excerpted: Well actually it's the (d) option ;) I activate the quota feature for only one reason : being able to track down how much space my snapshots are taking. Yeah, that's completely one of the ideal use case of btrfs qgroup. But I'm quite curious about the btrfsck error report on qgroup. If btrfsck report such error, it means either I'm too confident about the recent qgroup accounting rework, or btrfsck has some bug which I didn't take much consideration during the kernel rework. Would you please provide the full result of previous btrfsck with qgroup error? Sure, I've saved the log somewhere just in case, here your are : Counts for qgroup id: 3359 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 49152 exclusive compressed 49152 disk: exclusive 32768 exclusive compressed 32768 diff: exclusive 16384 exclusive compressed 16384 Counts for qgroup id: 3361 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 49152 exclusive compressed 49152 disk: exclusive 32768 exclusive compressed 32768 diff: exclusive 16384 exclusive compressed 16384 Counts for qgroup id: 3362 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 49152 exclusive compressed 49152 disk: exclusive 32768 exclusive compressed 32768 diff: exclusive 16384 exclusive compressed 16384 Counts for qgroup id: 3363 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 49152 exclusive compressed 49152 disk: exclusive 32768 exclusive compressed 32768 diff: exclusive 16384 exclusive compressed 16384 Counts for qgroup id: 3361 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 49152 exclusive compressed 49152 disk: exclusive 32768 exclusive compressed 32768 diff: exclusive 16384 exclusive compressed 16384 Counts for qgroup id: 3362 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 49152 exclusive compressed 49152 disk: exclusive 32768 exclusive compressed 32768 diff: exclusive 16384 exclusive compressed 16384 Counts for qgroup id: 3363 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 49152 exclusive compressed 49152 disk: exclusive 32768 exclusive compressed 32768 diff: exclusive 16384 exclusive compressed 16384 Counts for qgroup id: 3364 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 49152 exclusive compressed 49152 disk: exclusive 32768 exclusive compressed 32768 diff: exclusive 16384 exclusive compressed 16384 Counts for qgroup id: 3365 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 49152 exclusive compressed 49152 disk: exclusive 32768 exclusive compressed 32768 diff: exclusive 16384 exclusive compressed 16384 Counts for qgroup id: 3366 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 16384 exclusive compressed 16384 disk: exclusive 16384 exclusive compressed 16384
Re: kernel BUG at linux-4.2.0/fs/btrfs/extent-tree.c:1833 on rebalance
Le 2015-09-16 15:04, Stéphane Lesimple a écrit : I also disabled quota because it has almost for sure nothing to do with the bug As it turns out, it seems that this assertion was completely wrong. I've got balance running for more than 16 hours now, without a crash. This is almost 50% of the work done without any issue. Before, a crash would happen within minutes, sometimes 1 hour, but not much more. The problem is, I didn't change anything to the filesystem, well, appart from the benign quota disable. So Qu's question about the qgroups errors in fsck made me wonder : if I activate quota again, it'll still continue to balance flawlessly, right ? Well, it doesn't. I just ran btrfs quota enable on my filesystem, it completed successfully after some minutes (rescan -s said that no rescan was pending). Then less than 5 minutes later, the kernel crashed, at the same BUG_ON() than usually : [60156.062082] BTRFS info (device dm-3): relocating block group 972839452672 flags 129 [60185.203626] BTRFS info (device dm-3): found 1463 extents [60414.452890] {btrfs} in insert_inline_extent_backref, got owner < BTRFS_FIRST_FREE_OBJECTID [60414.452894] {btrfs} with bytenr=5197436141568 num_bytes=16384 parent=5336636473344 root_objectid=3358 owner=1 offset=0 refs_to_add=1 BTRFS_FIRST_FREE_OBJECTID=256 [60414.452924] [ cut here ] [60414.452928] kernel BUG at fs/btrfs/extent-tree.c:1837! owner is=1 again at this point in the code (this is still kernel 4.3.0-rc1 with my added printks). So I'll disable quota, again, and resume the balance. If I'm right, it should proceed without issue for 18 more hours ! Qu, my filesystem is at your disposal :) -- Stéphane. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at linux-4.2.0/fs/btrfs/extent-tree.c:1833 on rebalance
Le 2015-09-17 08:29, Stéphane Lesimple a écrit : Le 2015-09-16 15:04, Stéphane Lesimple a écrit : I also disabled quota because it has almost for sure nothing to do with the bug As it turns out, it seems that this assertion was completely wrong. I've got balance running for more than 16 hours now, without a crash. This is almost 50% of the work done without any issue. Before, a crash would happen within minutes, sometimes 1 hour, but not much more. The problem is, I didn't change anything to the filesystem, well, appart from the benign quota disable. So Qu's question about the qgroups errors in fsck made me wonder : if I activate quota again, it'll still continue to balance flawlessly, right ? Well, it doesn't. I just ran btrfs quota enable on my filesystem, it completed successfully after some minutes (rescan -s said that no rescan was pending). Then less than 5 minutes later, the kernel crashed, at the same BUG_ON() than usually : [60156.062082] BTRFS info (device dm-3): relocating block group 972839452672 flags 129 [60185.203626] BTRFS info (device dm-3): found 1463 extents [60414.452890] {btrfs} in insert_inline_extent_backref, got owner < BTRFS_FIRST_FREE_OBJECTID [60414.452894] {btrfs} with bytenr=5197436141568 num_bytes=16384 parent=5336636473344 root_objectid=3358 owner=1 offset=0 refs_to_add=1 BTRFS_FIRST_FREE_OBJECTID=256 [60414.452924] [ cut here ] [60414.452928] kernel BUG at fs/btrfs/extent-tree.c:1837! owner is=1 again at this point in the code (this is still kernel 4.3.0-rc1 with my added printks). So I'll disable quota, again, and resume the balance. If I'm right, it should proceed without issue for 18 more hours ! Damn, wrong again. It just re-crashed without quota enabled :( The fact that it went perfectly well for 17+ hours and crashed minutes after I reactivated quota might be by complete chance then ... [ 5487.706499] {btrfs} in insert_inline_extent_backref, got owner < BTRFS_FIRST_FREE_OBJECTID [ 5487.706504] {btrfs} with bytenr=6906661109760 num_bytes=16384 parent=6905020874752 root_objectid=18446744073709551608 owner=1 offset=0 refs_to_add=1 BTRFS_FIRST_FREE_OBJECTID=256 [ 5487.706536] [ cut here ] [ 5487.706539] kernel BUG at fs/btrfs/extent-tree.c:1837! For reference, the crash I had earlier this morning was as follows : [60414.452894] {btrfs} with bytenr=5197436141568 num_bytes=16384 parent=5336636473344 root_objectid=3358 owner=1 offset=0 refs_to_add=1 BTRFS_FIRST_FREE_OBJECTID=256 So, this is a completely different part of the filesystem. The bug is always the same though, owner=1 where it shouldn't be < 256. Balance cancelled. To me, it sounds like some sort of race condition. But I'm out of ideas on what to test now. -- Stéphane. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at linux-4.2.0/fs/btrfs/extent-tree.c:1833 on rebalance
Le 2015-09-17 08:42, Qu Wenruo a écrit : Stéphane Lesimple wrote on 2015/09/17 08:11 +0200: Le 2015-09-17 05:03, Qu Wenruo a écrit : Stéphane Lesimple wrote on 2015/09/16 22:41 +0200: Le 2015-09-16 22:18, Duncan a écrit : Stéphane Lesimple posted on Wed, 16 Sep 2015 15:04:20 +0200 as excerpted: Well actually it's the (d) option ;) I activate the quota feature for only one reason : being able to track down how much space my snapshots are taking. Yeah, that's completely one of the ideal use case of btrfs qgroup. But I'm quite curious about the btrfsck error report on qgroup. If btrfsck report such error, it means either I'm too confident about the recent qgroup accounting rework, or btrfsck has some bug which I didn't take much consideration during the kernel rework. Would you please provide the full result of previous btrfsck with qgroup error? Sure, I've saved the log somewhere just in case, here your are : [...] Thanks for your log, pretty interesting result. BTW, did you enabled qgroup from old kernel earlier than 4.2-rc1? If so, I would be much relaxed as they can be the problem of old kernels. The mkfs.btrfs was done under 3.19, but I'm almost sure I enabled quota under 4.2.0 precisely. My kern.log tends to confirm that (looking for 'qgroup scan completed'). If it's OK for you, would you please enable quota after reproducing the bug and use for sometime and recheck it? Sure, I've just reproduced the bug twice as I wanted, and posted the info, so now I've cancelled the balance and I can reenable quota. Will do it under 4.3.0-rc1. I'll keep you posted if btrfsck complains about it in the following days. Regards, -- Stéphane. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: Remove unneeded missing device number check
On 09/16/2015 11:43 AM, Qu Wenruo wrote: As we do per-chunk missing device number check at read_one_chunk() time, it's not needed to do global missing device number check. Just remove it. However the missing device count, what we have during the remount is not fine grained per chunk. --- btrfs_remount :: if (fs_info->fs_devices->missing_devices > fs_info->num_tolerated_disk_barrier_failures && !(*flags & MS_RDONLY || btrfs_test_opt(root, DEGRADED))) { btrfs_warn(fs_info, "too many missing devices, writeable remount is not allowed"); ret = -EACCES; goto restore; } - Thanks, Anand Now btrfs can handle the following case: # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc Data chunk will be located in sdb, so we should be safe to wipe sdc # wipefs -a /dev/sdc # mount /dev/sdb /mnt/btrfs -o degraded Signed-off-by: Qu Wenruo--- fs/btrfs/disk-io.c | 8 1 file changed, 8 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 0b658d0..ac640ea 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2947,14 +2947,6 @@ retry_root_backup: } fs_info->num_tolerated_disk_barrier_failures = btrfs_calc_num_tolerated_disk_barrier_failures(fs_info); - if (fs_info->fs_devices->missing_devices > -fs_info->num_tolerated_disk_barrier_failures && - !(sb->s_flags & MS_RDONLY)) { - pr_warn("BTRFS: missing devices(%llu) exceeds the limit(%d), writeable mount is not allowed\n", - fs_info->fs_devices->missing_devices, - fs_info->num_tolerated_disk_barrier_failures); - goto fail_sysfs; - } fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root, "btrfs-cleaner"); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at linux-4.2.0/fs/btrfs/extent-tree.c:1833 on rebalance
Le 2015-09-17 10:11, Qu Wenruo a écrit : Stéphane Lesimple wrote on 2015/09/17 10:02 +0200: Le 2015-09-17 08:42, Qu Wenruo a écrit : Stéphane Lesimple wrote on 2015/09/17 08:11 +0200: Le 2015-09-17 05:03, Qu Wenruo a écrit : Stéphane Lesimple wrote on 2015/09/16 22:41 +0200: Le 2015-09-16 22:18, Duncan a écrit : Stéphane Lesimple posted on Wed, 16 Sep 2015 15:04:20 +0200 as excerpted: Well actually it's the (d) option ;) I activate the quota feature for only one reason : being able to track down how much space my snapshots are taking. Yeah, that's completely one of the ideal use case of btrfs qgroup. But I'm quite curious about the btrfsck error report on qgroup. If btrfsck report such error, it means either I'm too confident about the recent qgroup accounting rework, or btrfsck has some bug which I didn't take much consideration during the kernel rework. Would you please provide the full result of previous btrfsck with qgroup error? Sure, I've saved the log somewhere just in case, here your are : [...] Thanks for your log, pretty interesting result. BTW, did you enabled qgroup from old kernel earlier than 4.2-rc1? If so, I would be much relaxed as they can be the problem of old kernels. The mkfs.btrfs was done under 3.19, but I'm almost sure I enabled quota under 4.2.0 precisely. My kern.log tends to confirm that (looking for 'qgroup scan completed'). Emmm, seems I need to pay more attention on this case now. Any info about the workload for this btrfs fs? If it's OK for you, would you please enable quota after reproducing the bug and use for sometime and recheck it? Sure, I've just reproduced the bug twice as I wanted, and posted the info, so now I've cancelled the balance and I can reenable quota. Will do it under 4.3.0-rc1. I'll keep you posted if btrfsck complains about it in the following days. Regards, Thanks for your patience and detailed report. You're very welcome. But I still have another question, did you do any snapshot deletion after quota enabled? (I'll assume you did it, as there are a lot of backup snapshot, old ones should be already deleted) Actually no : this btrfs system is quite new (less than a week old) as I'm migrating from mdadm(raid1)+ext4 to btrfs. So those snapshots were actually rsynced one by one from my hardlinks-based "snapshots" under ext4 (those pseudo-snapshots are created using a program named "rsnapshot", if you know it. This is basically a wrapper to cp -la). I didn't activate yet an automatic snapshot/delete on my btrfs system, due to the bugs I'm tripping on. So no snapshot was deleted. That's one of the known bug and Mark is working on it actively. If you delete non-empty snapshot a lot, then I'd better add a hot fix to mark qgroup inconsistent after snapshot delete, and trigger a rescan if possible. I've made a btrfs-image of the filesystem just before disabling quotas (which I did to get a clean btrfsck and eliminate quotas from the equation trying to reproduce the bug I have). Would it be of any use if I drop it somewhere for you to pick it up ? (2.9G in size). In the meantime, I've reactivated quotas, umounted the filesystem and ran a btrfsck on it : as you would expect, there's no qgroup problem reported so far. I'll clear all my snapshots, run an quota rescan, then re-create them one by one by rsyncing from my ext4 system I still have. Maybe I'll run into the issue again. -- Stéphane. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs progs 4.1.1 & 4.2 segfault on chunk-recover
Hello guys I think I might found a bug, Lots of text, I dont know what you want from me and not, so I try to get almost everything in one mail, please dont shoot me! :) To make a long store somewhat short, this is about what happend to me; (skip to if you dont care about history) Arch-linux, btrfs-progs 4.1.1 & 4.2, linux 4.1.6-1 Data, RAID5: total=3.11TiB, used=0.00B <-- this one said the other day used=3.05TiB System, RAID1: total=32.00MiB, used=0.00B Metadata, RAID1: total=8.00GiB, used=144.00KiB GlobalReserve, single: total=16.00MiB, used=0.00B Label: 'Isolinear' uuid: 9bb3f369-f2a9-46be-8dde-1106ae740e36 Total devices 9 FS bytes used 144.00KiB devid7 size 2.73TiB used 541.12GiB path /dev/sdi devid9 size 1.36TiB used 533.09GiB path /dev/sdd2 devid 10 size 1.36TiB used 533.09GiB path /dev/sdg2 devid 11 size 1.82TiB used 536.12GiB path /dev/sdj2 devid 12 size 1.82TiB used 538.09GiB path /dev/sdh2 devid 13 size 286.09GiB used 286.09GiB path /dev/sda3 devid 14 size 286.09GiB used 286.09GiB path /dev/sdb3 devid 15 size 372.61GiB used 372.61GiB path /dev/sdf1 *** Some devices missing drive 8 was a 1.36TiB drive 15 is the new drive I added to the system. *one of 8 drives started to fail, smart saw error, I failed in my configure and I didn't get notified - Ran for 3-14 days before I realized. *I tried on active running system to btrfs dev del /dev/sd[failing] - Did not work (I think it was csum errors) *I added one new disk to raid, rebooted and added new disk to array, tried balancing. Power fail and ups fail after x hours *I rebooted realized the failing drive was now dead. I could mount system with degraded and some files gave me kernel panic ( https://goo.gl/photos/UXrZj6YEUW3945b37 )- others were reading fine. -Was unable to dev del missing. At this point I knew the system was probobly broken beyond repair. so I just tried all commands I could think of. check repair, check init-csum-tree etc endless loop - First very fast text scrolling, lots of CPU not much diskIO, after ~48h text slow, lots of cpu, almost no diskIO same type of message repeating (with new numbers): - ref mismatch on [17959857729536 4096] extent item 0, found 1 adding new data backref on 17959857729536 parent 35277570539520 owner 0 offset 0 found 1 Backref 17959857729536 parent 35277570539520 owner 0 offset 0 num_refs 0 not found in extent tree Incorrect local backref count on 17959857729536 parent 35277570539520 owner 0 offset 0 found 1 wanted 0 back 0x145f7800 backpointer mismatch on [17959857729536 4096] ref mismatch on [17959857733632 4096] extent item 0, found 1 adding new data backref on 17959857733632 parent 35277570785280 owner 0 offset 0 found 1 Backref 17959857733632 parent 35277570785280 owner 0 offset 0 num_refs 0 not found in extent tree Incorrect local backref count on 17959857733632 parent 35277570785280 owner 0 offset 0 found 1 wanted 0 back 0x145f7b90 backpointer mismatch on [17959857733632 4096] - Found out that chunk-recover gave segfault.(4.1.1 & kdave 4.2) 4.1.1 said in bt: #0 0x004251bb in btrfs_new_device_extent_record () #1 0x004301cb in ?? () #2 0x0043085d in ?? () #3 0x7fd8071074a4 in start_thread () from /usr/lib/libpthread.so.0 #4 0x7fd806e4513d in clone () from /usr/lib/libc.so.6 not much help, but I compiled -> https://github.com/kdave/btrfs-progs and backtrace: --> http://pastebin.com/XqRrqAB5 I can repeat the segfault. I made two btrfs-image , one is around 4MB the other is around 300MB think it was. So, did I find a bug? I cant find my logs at the beginning of my failing drive, what it said when I tried to remove the broken drive. I might be able to try the setup again (Got one more drive-about-to-fail) ps; Ive tried to make alpine to work, but it wont accept my passwords, I hope gmail web client is ok for you guys, openwrt dev team rejected my posts just because of this email client best regards Daniel end -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at linux-4.2.0/fs/btrfs/extent-tree.c:1833 on rebalance
在 2015年09月17日 18:08, Stéphane Lesimple 写道: Le 2015-09-17 10:11, Qu Wenruo a écrit : Stéphane Lesimple wrote on 2015/09/17 10:02 +0200: Le 2015-09-17 08:42, Qu Wenruo a écrit : Stéphane Lesimple wrote on 2015/09/17 08:11 +0200: Le 2015-09-17 05:03, Qu Wenruo a écrit : Stéphane Lesimple wrote on 2015/09/16 22:41 +0200: Le 2015-09-16 22:18, Duncan a écrit : Stéphane Lesimple posted on Wed, 16 Sep 2015 15:04:20 +0200 as excerpted: Well actually it's the (d) option ;) I activate the quota feature for only one reason : being able to track down how much space my snapshots are taking. Yeah, that's completely one of the ideal use case of btrfs qgroup. But I'm quite curious about the btrfsck error report on qgroup. If btrfsck report such error, it means either I'm too confident about the recent qgroup accounting rework, or btrfsck has some bug which I didn't take much consideration during the kernel rework. Would you please provide the full result of previous btrfsck with qgroup error? Sure, I've saved the log somewhere just in case, here your are : [...] Thanks for your log, pretty interesting result. BTW, did you enabled qgroup from old kernel earlier than 4.2-rc1? If so, I would be much relaxed as they can be the problem of old kernels. The mkfs.btrfs was done under 3.19, but I'm almost sure I enabled quota under 4.2.0 precisely. My kern.log tends to confirm that (looking for 'qgroup scan completed'). Emmm, seems I need to pay more attention on this case now. Any info about the workload for this btrfs fs? If it's OK for you, would you please enable quota after reproducing the bug and use for sometime and recheck it? Sure, I've just reproduced the bug twice as I wanted, and posted the info, so now I've cancelled the balance and I can reenable quota. Will do it under 4.3.0-rc1. I'll keep you posted if btrfsck complains about it in the following days. Regards, Thanks for your patience and detailed report. You're very welcome. But I still have another question, did you do any snapshot deletion after quota enabled? (I'll assume you did it, as there are a lot of backup snapshot, old ones should be already deleted) Actually no : this btrfs system is quite new (less than a week old) as I'm migrating from mdadm(raid1)+ext4 to btrfs. So those snapshots were actually rsynced one by one from my hardlinks-based "snapshots" under ext4 (those pseudo-snapshots are created using a program named "rsnapshot", if you know it. This is basically a wrapper to cp -la). I didn't activate yet an automatic snapshot/delete on my btrfs system, due to the bugs I'm tripping on. So no snapshot was deleted. Now things are getting tricky, as all known bugs are ruled out, it must be another hidden bug, even we tried to rework the qgroup accounting code. That's one of the known bug and Mark is working on it actively. If you delete non-empty snapshot a lot, then I'd better add a hot fix to mark qgroup inconsistent after snapshot delete, and trigger a rescan if possible. I've made a btrfs-image of the filesystem just before disabling quotas (which I did to get a clean btrfsck and eliminate quotas from the equation trying to reproduce the bug I have). Would it be of any use if I drop it somewhere for you to pick it up ? (2.9G in size). For dismatch case, static btrfs-image dump won't really help. As the important point is, when and which operation caused qgroup accounting to dismatch. In the meantime, I've reactivated quotas, umounted the filesystem and ran a btrfsck on it : as you would expect, there's no qgroup problem reported so far. At least, rescan code is working without problem. I'll clear all my snapshots, run an quota rescan, then re-create them one by one by rsyncing from my ext4 system I still have. Maybe I'll run into the issue again. Would you mind to do the following check for each subvolume rsync? 1) Do 'sync; btrfs qgroup show -prce --raw' and save the output 2) Create the needed snapshot 3) Do 'sync; btrfs qgroup show -prce --raw' and save the output 4) Avoid doing IO if possible until step 6) 5) Do 'btrfs quota rescan -w' and save it 6) Do 'sync; btrfs qgroup show -prce --raw' and save the output 7) Rsync data from ext4 to the newly created snapshot The point is, as you mentioned, rescan is working fine, we can compare output from 3), 6) and 1) to see which qgroup accounting number changes. And if differs, which means the qgroup update at write time OR snapshot creation has something wrong, at least we can locate the problem to qgroup update routine or snapshot creation. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FYIO: A rant about btrfs
On 2015-09-16 19:31, Hugo Mills wrote: On Wed, Sep 16, 2015 at 03:21:26PM -0400, Austin S Hemmelgarn wrote: On 2015-09-16 12:45, Martin Tippmann wrote: 2015-09-16 17:20 GMT+02:00 Austin S Hemmelgarn: [...] [...] From reading the list I understand that btrfs is still very much work in progress and performance is not a top priority at this stage but I don't see why it shouldn't perform at least equally good as ZFS/F2FS on the same workloads. Is looking at performance problems on the development roadmap? Performance is on the roadmap, but the roadmap is notoriously short-sighted when it comes to time-frame for completion of something. You have to understand also that the focus in BTRFS has also been more on data safety than performance, because that's the intended niche, and the area most people look to ZFS for. Wait... there's a roadmap? ;) Yeah, maybe it's better to say that there's a directed graph of feature interdependence. I was just basing my statement on the presence of a list of project ideas on the wiki. :) smime.p7s Description: S/MIME Cryptographic Signature
Re: kernel BUG at linux-4.2.0/fs/btrfs/extent-tree.c:1833 on rebalance
Stéphane Lesimple wrote on 2015/09/17 10:02 +0200: Le 2015-09-17 08:42, Qu Wenruo a écrit : Stéphane Lesimple wrote on 2015/09/17 08:11 +0200: Le 2015-09-17 05:03, Qu Wenruo a écrit : Stéphane Lesimple wrote on 2015/09/16 22:41 +0200: Le 2015-09-16 22:18, Duncan a écrit : Stéphane Lesimple posted on Wed, 16 Sep 2015 15:04:20 +0200 as excerpted: Well actually it's the (d) option ;) I activate the quota feature for only one reason : being able to track down how much space my snapshots are taking. Yeah, that's completely one of the ideal use case of btrfs qgroup. But I'm quite curious about the btrfsck error report on qgroup. If btrfsck report such error, it means either I'm too confident about the recent qgroup accounting rework, or btrfsck has some bug which I didn't take much consideration during the kernel rework. Would you please provide the full result of previous btrfsck with qgroup error? Sure, I've saved the log somewhere just in case, here your are : [...] Thanks for your log, pretty interesting result. BTW, did you enabled qgroup from old kernel earlier than 4.2-rc1? If so, I would be much relaxed as they can be the problem of old kernels. The mkfs.btrfs was done under 3.19, but I'm almost sure I enabled quota under 4.2.0 precisely. My kern.log tends to confirm that (looking for 'qgroup scan completed'). Emmm, seems I need to pay more attention on this case now. Any info about the workload for this btrfs fs? If it's OK for you, would you please enable quota after reproducing the bug and use for sometime and recheck it? Sure, I've just reproduced the bug twice as I wanted, and posted the info, so now I've cancelled the balance and I can reenable quota. Will do it under 4.3.0-rc1. I'll keep you posted if btrfsck complains about it in the following days. Regards, Thanks for your patience and detailed report. But I still have another question, did you do any snapshot deletion after quota enabled? (I'll assume you did it, as there are a lot of backup snapshot, old ones should be already deleted) That's one of the known bug and Mark is working on it actively. If you delete non-empty snapshot a lot, then I'd better add a hot fix to mark qgroup inconsistent after snapshot delete, and trigger a rescan if possible. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at linux-4.2.0/fs/btrfs/extent-tree.c:1833 on rebalance
Stéphane Lesimple wrote on 2015/09/17 08:11 +0200: Le 2015-09-17 05:03, Qu Wenruo a écrit : Stéphane Lesimple wrote on 2015/09/16 22:41 +0200: Le 2015-09-16 22:18, Duncan a écrit : Stéphane Lesimple posted on Wed, 16 Sep 2015 15:04:20 +0200 as excerpted: Well actually it's the (d) option ;) I activate the quota feature for only one reason : being able to track down how much space my snapshots are taking. Yeah, that's completely one of the ideal use case of btrfs qgroup. But I'm quite curious about the btrfsck error report on qgroup. If btrfsck report such error, it means either I'm too confident about the recent qgroup accounting rework, or btrfsck has some bug which I didn't take much consideration during the kernel rework. Would you please provide the full result of previous btrfsck with qgroup error? Sure, I've saved the log somewhere just in case, here your are : Counts for qgroup id: 3359 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 49152 exclusive compressed 49152 disk: exclusive 32768 exclusive compressed 32768 diff: exclusive 16384 exclusive compressed 16384 Counts for qgroup id: 3361 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 49152 exclusive compressed 49152 disk: exclusive 32768 exclusive compressed 32768 diff: exclusive 16384 exclusive compressed 16384 Counts for qgroup id: 3362 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 49152 exclusive compressed 49152 disk: exclusive 32768 exclusive compressed 32768 diff: exclusive 16384 exclusive compressed 16384 Counts for qgroup id: 3363 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 49152 exclusive compressed 49152 disk: exclusive 32768 exclusive compressed 32768 diff: exclusive 16384 exclusive compressed 16384 Counts for qgroup id: 3361 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 49152 exclusive compressed 49152 disk: exclusive 32768 exclusive compressed 32768 diff: exclusive 16384 exclusive compressed 16384 Counts for qgroup id: 3362 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 49152 exclusive compressed 49152 disk: exclusive 32768 exclusive compressed 32768 diff: exclusive 16384 exclusive compressed 16384 Counts for qgroup id: 3363 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 49152 exclusive compressed 49152 disk: exclusive 32768 exclusive compressed 32768 diff: exclusive 16384 exclusive compressed 16384 Counts for qgroup id: 3364 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 49152 exclusive compressed 49152 disk: exclusive 32768 exclusive compressed 32768 diff: exclusive 16384 exclusive compressed 16384 Counts for qgroup id: 3365 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 49152 exclusive compressed 49152 disk: exclusive 32768 exclusive compressed 32768 diff: exclusive 16384 exclusive compressed 16384 Counts for qgroup id: 3366 are different our:referenced 7530119168 referenced compressed 7530119168 disk: referenced 7530086400 referenced compressed 7530086400 diff: referenced 32768 referenced compressed 32768 our:exclusive 16384 exclusive compressed 16384 disk:
Re: [PATCH 1/2] btrfs: Do per-chunk degrade mode check at mount time
Hi Qu, On 09/17/2015 09:48 AM, Qu Wenruo wrote: To Anand Jain, Any feedback on this method to allow single chunk still be degraded mountable? It should be much better than allowing degraded mount for any missing device case. yeah. this changes the way missing devices are counted and its more fine grained. makes sense to me. Thanks, Qu Qu Wenruo wrote on 2015/09/16 11:43 +0800: Btrfs supports different raid profile for meta/data/sys, and as different profile support different tolerated missing device, it's better to check if it can be mounted degraded at a per-chunk base. So this patch will add check for read_one_chunk() against its profile, other than checking it against with the lowest duplication profile. Reported-by: Zhao LeiReported-by: Anand Jain Signed-off-by: Qu Wenruo --- fs/btrfs/volumes.c | 17 + 1 file changed, 17 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 644e070..3272187 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6164,12 +6164,15 @@ static int read_one_chunk(struct btrfs_root *root, struct btrfs_key *key, struct btrfs_chunk *chunk) { struct btrfs_mapping_tree *map_tree = >fs_info->mapping_tree; +struct super_block *sb = root->fs_info->sb; struct map_lookup *map; struct extent_map *em; u64 logical; u64 length; u64 devid; u8 uuid[BTRFS_UUID_SIZE]; +int missing = 0; +int max_tolerated; int num_stripes; int ret; int i; @@ -6238,7 +6241,21 @@ static int read_one_chunk(struct btrfs_root *root, struct btrfs_key *key, btrfs_warn(root->fs_info, "devid %llu uuid %pU is missing", devid, uuid); } +if (map->stripes[i].dev->missing) +missing++; map->stripes[i].dev->in_fs_metadata = 1; + +} + +/* XXX: Why the function name is SO LONG?! */ +max_tolerated = +btrfs_get_num_tolerated_disk_barrier_failures(map->type); +if (missing > max_tolerated && !(sb->s_flags & MS_RDONLY)) { +free_extent_map(em); +btrfs_error(root->fs_info, -EIO, +"missing device(%d) exceeds the limit(%d), writeable mount is not allowed\n", \n is not required. Thanks, Anand +missing, max_tolerated); +return -EIO; } write_lock(_tree->map_tree.lock); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 storage server won't boot with one disk missing
Thanks for the report. There is a bug that raid1 with one disk missing and trying to mount for the 2nd time.. it would fail. I am not too sure if in the boot process would there be mount and then remount/mount again ? If yes then it is potentially hitting the problem as in the patch below. Btrfs: allow -o rw,degraded for single group profile you may want to give this patch a try. more below.. On 09/17/2015 07:56 AM, erp...@gmail.com wrote: Good afternoon, Earlier today, I tried to set up a storage server using btrfs but ran into some problems. The goal was to use two disks (4.0TB each) in a raid1 configuration. What I did: 1. Attached a single disk to a regular PC configured to boot with UEFI. 2. Booted from a thumb drive that had been made from an Ubuntu 14.04 Server x64 installation DVD. 3. Ran the installation procedure. When it came time to partition the disk, I chose the guided partitioning option. The partitioning scheme it suggested was: * A 500MB EFI System Partition. * An ext4 root partition of nearly 4 TB in size. * A 4GB swap partition. 4. Changed the type of the middle partition from ext4 to btrfs, but left everything else the same. 5. Finalized the partitioning scheme, allowing changes to be written to disk. 6. Continued the installation procedure until it finished. I was able to boot into a working server from the single disk. 7. Attached the second disk. 8. Used parted to create a GPT label on the second disk and a btrfs partition that was the same size as the btrfs partition on the first disk. # parted /dev/sdb (parted) mklabel gpt (parted) mkpart primary btrfs #s ##s (parted) quit 9. Ran "btrfs device add /dev/sdb1 /" to add the second device to the filesystem. 10. Ran "btrfs balance start -dconvert=raid1 -mconvert=raid1 /" and waited for it to finish. It reported that it finished successfully. 11. Rebooted the system. At this point, everything appeared to be working. 12. Shut down the system, temporarily disconnected the second disk (/dev/sdb) from the motherboard, and powered it back up. What I expected to happen: I expected that the system would either start as if nothing were wrong, or would warn me that one half of the mirror was missing and ask if I really wanted to start the system with the root array in a degraded state. as of now it would/should start normally only when there is an entry -o degraded it looks like -o degraded is going to be a very obvious feature, I have plans of making it a default feature, and provide -o nodegraded feature instead. Thanks for comments if any. Thanks, Anand What actually happened: During the boot process, a kernel message appeared indicating that the "system array" could not be found for the root filesystem (as identified by a UUID). It then dumped me to an initramfs prompt. Powering down the system, reattaching the second disk, and powering it on allowed me to boot successfully. Running "btrfs fi df /" showed that all System data was stored as RAID1. If I want to have a storage server where one of two drives can fail at any time without causing much down time, am I on the right track? If so, what should I try next to get the behavior I'm looking for? Thanks, Eric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FYIO: A rant about btrfs
Am Mittwoch, 16. September 2015, 23:29:30 CEST schrieb Hugo Mills: > > but even then having write-barriers > > turned off is still not as safe as having them turned on. Most of > > the time when I've tried testing with 'nobarrier' (not just on BTRFS > > but on ext* as well), I had just as many issues with data loss when > > the system crashed as when it (simlated via killing the virtual > > machine) lost power. Both journaling and COW filesystems need to > > ensure ordering of certain write operations to be able to maintain > > consistency. For example, the new/updated data blocks need to be on > > disk before the metadata is updated to point to them, otherwise you > > database can end up corrupted. > >Indeed. The barriers are an ordering condition. The FS relies on > (i.e. *requires*) that ordering condition, in order to be truly > consistent. Running with "nobarrier" is a very strong signal that you > really don't care about the data on the FS. > >This is not a case of me simply believing that because I've been > using btrfs for so long that I've got used to the peculiarities. The > first time I heard about the nobarrier option, something like 6 years > ago when I was first using btrfs, I thought "that's got to be a really > silly idea". Any complex data structure, like a filesystem, is going > to rely on some kind of ordering guarantees, somewhere in its > structure. (The ordering might be strict, with a global clock, or > barrier-based, or lattice-like, as for example a vector clock, but > there's going to be _some_ concept of order). nobarrier allows the FS > to ignore those guarantees, and even without knowing anything about > the FS at all, doing so is a big red DANGER flag. Official recommendation for XFS differs from that: Q. Should barriers be enabled with storage which has a persistent write cache? Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with "nobarrier", assuming your RAID controller is infallible and not resetting randomly like some common ones do. But take care about the hard disk write cache, which should be off. http://xfs.org/index.php/ XFS_FAQ#Q._Should_barriers_be_enabled_with_storage_which_has_a_persistent_write_cache. 3F Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 storage server won't boot with one disk missing
On Wed, Sep 16, 2015 at 5:56 PM, erp...@gmail.comwrote: > What I expected to happen: > I expected that the system would either start as if nothing were > wrong, or would warn me that one half of the mirror was missing and > ask if I really wanted to start the system with the root array in a > degraded state. It's not this sophisticated yet. Btrfs does not "assemble" degraded by default like mdadm and LVM based RAID. You need to manually mount it with -o degraded and then continue the boot process, or use boot parameter rootflags=degraded. Yet there is still some interaction between btrfs dev scan and udev (?) that I don't understand precisely, but what happens is when any device is missing, the Btrfs volume UUID doesn't appear and therefore it still can't be mounted degraded if volume UUID is used, e.g. boot parameter root=UUID= so that needs to be changed to a /dev/sdXY type of notation and hope that you guess it correctly. > > What actually happened: > During the boot process, a kernel message appeared indicating that the > "system array" could not be found for the root filesystem (as > identified by a UUID). It then dumped me to an initramfs prompt. > Powering down the system, reattaching the second disk, and powering it > on allowed me to boot successfully. Running "btrfs fi df /" showed > that all System data was stored as RAID1. Just an FYI to be really careful about degraded rw mounts. There is no automatic resync to catch up the previously missing device with the device that was degraded,rw mounted. You have to scrub or balance, there's no optimization yet for Btrfs to effectively just "diff" between the devices' generations and get them all in sync quickly. Much worse is if you don't scrub or balance, and then redo the test reversing the device to make missing. Now you have multiple devices that were rw,degraded mounted, and putting them back together again will corrupt the whole file system irreparably. Fixing the first problem would (almost always) avoid the second problem. > If I want to have a storage server where one of two drives can fail at > any time without causing much down time, am I on the right track? If > so, what should I try next to get the behavior I'm looking for? It's totally not there yet if you want to obviate manual checks and intervention for failure cases. Both mdadm and LVM integrated RAID have monitoring and notification which Btrfs lacks entirely. So that means you have to check it or create scripts to check it. What often tends to happen is Btrfs just keeps retrying rather than ignoring a bad device, so you'll see piles of retries with dmesg But Btrfs doesn't kick out the bad device like the md drive would do. This could go on for hours, or days. So if you aren't checking for it, you could unwittingly have a degraded array already. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 storage server won't boot with one disk missing
On Thu, Sep 17, 2015 at 9:18 AM, Anand Jainwrote: > > as of now it would/should start normally only when there is an entry -o > degraded > > it looks like -o degraded is going to be a very obvious feature, > I have plans of making it a default feature, and provide -o > nodegraded feature instead. Thanks for comments if any. If degraded mounts happen by default, what happens when dev 1 goes missing temporarily and dev 2 is mounted degraded,rw and then dev 1 reappears? Is there an automatic way to a.) catch up dev 1 with dev 2? and then b.) automatically make the array no longer degraded? I think it's a problem to have automatic degraded mounts when there's no monitoring or notification system of problems. We can get silent degraded mounts by default with no notification at all there's a problem with a Btrfs volume. So off hand my comment is that I think other work is needed before degraded mounts is default behavior. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FYIO: A rant about btrfs
On 16 September 2015 at 20:21, Austin S Hemmelgarnwrote: > ZFS has been around for much longer, it's been mature and feature complete > for more than a decade, and has had a long time to improve performance wise. > It is important to note though, that on low-end hardware, BTRFS can (and > often does in my experience) perform better than ZFS, because ZFS is a > serious resource hog (I have yet to see a stable ZFS deployment with less > than 16G of RAM, even with all the fancy features turned off). If you have a real example of ZFS becoming unstable with, say, 4 or 8GB of memory, that doesn't involve attempting deduplication (which I guess is what you mean by 'all the fancy features') on a many-TB pool, I'd be interested to hear about it. (Psychic debugger says 'possibly somebody trying to use a large L2ARC on a pool with many/large zvols') My home fileserver has been running zfs for about 5 years, on a system maxed out at 4GB RAM. Currently up to ~9TB of data. The only stability problems I ever had were towards the beginning when I was using zfs-fuse because zfsonlinux wasn't ready then, *and* I was trying out deduplication. I have a couple of work machines with 2GB RAM and pools currently around 2.5TB full; no problems with these either in the couple of years they've been in use, though granted these are lightly loaded machines since what they mostly do is receive backup streams. Bear in mind that these are Linux machines, and zfsonlinux's memory management is known to be inferior to ZFS on Solaris and FreeBSD (because it does not integrate with the page cache and instead grabs a [configurable] chunk of memory, and doesn't always do a great job of dropping it in response to memory pressure). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 storage server won't boot with one disk missing
Hi Anand, On 2015-09-17 17:18, Anand Jain wrote: > it looks like -o degraded is going to be a very obvious feature, > I have plans of making it a default feature, and provide -o > nodegraded feature instead. Thanks for comments if any. > > Thanks, Anand I am not sure if there is a "good" default for this kind of problem; there are several aspects: - remote machine: for a remote machine, I think that the root filesystem should be mounted anyway. For a secondary filesystem (home ?), may be that an user intervention could be better (but without home, how an user could log?). - spare: in case of a degraded filesystem, the system could insert a spare disk; or a reshaping could be started (raid5->raid1, raid6->raid5) - initramfs: this is the most complicated things: currently most initramfs don't mount the filesystem if all the volumes aren't available. Allowing a degraded root filesystem means: a) wait for the disks until a timeout b) if the timeout expires, mount in degraded mode (inserting a spare disk if available ?) c) otherwise mount the filesystem as usual - degraded: I think that there are different level of degraded. For example, in case of raid6 a missing device could be acceptable; however in case of a raid5, this should be not allowed; and an user intervention may be preferred. In the past I suggested the use of an helper, mount.btrfs [1], which could handle all these cases better without a kernel intervention: - wait for the devices to appear - verifying if all the needed devices are present - mounting the filesystem passing - all the devices to the kernel (without relying to udev and btrfs dev scan...) - allowing the degraded mode or not (policy) - starting an insertion of the spare(policy) G.Baroncelli [1] http://www.spinics.net/lists/linux-btrfs/msg39706.html -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On Thu, Sep 17, 2015 at 11:56 AM, Gert Menkewrote: > Hi, > > thank you for your answers! > > So it seems there are several suboptimal alternatives here... > > MD+LVM is very close to what I want, but md has no way to cope with silent > data corruption. So if I'd want to use a guest filesystem that has no > checksums either, I'm out of luck. You can use Btrfs in the guest to get at least notification of SDC. If you want recovery also then that's a bit more challenging. The way this has been done up until ZFS and Btrfs is T10 DIF (PI). There are already checksums on the drive, but this adds more checksums that can be confirmed through the entire storage stack, not just internal to the drive hardware. Another way is to put a conventional fs image on e.g. GlusterFS with checksumming enabled (and at least distributed+replicated filtering). If you do this directly on Btrfs, maybe you can mitigate some of the fragmentation issues with bcache or dmcache; and for persistent snapshotting, use qcow2 to do it instead of Btrfs. You'd use Btrfs snapshots to create a subvolume for doing backups of the images, and then get rid of the Btrfs snapshot. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
Hi, thank you for your answers! So it seems there are several suboptimal alternatives here... MD+LVM is very close to what I want, but md has no way to cope with silent data corruption. So if I'd want to use a guest filesystem that has no checksums either, I'm out of luck. I'm honestly a bit confused here - isn't checksumming one of the most obvious things to want in a software RAID setup? Is it a feature that might appear in the future? Maybe I should talk to the md guys... BTRFS looks really nice feature-wise, but is not (yet) optimized for my use-case I guess. Disabling COW would certainly help, but I don't want to lose the data checksums. Is nodatacowbutkeepdatachecksums a feature that might turn up in the future? Maybe ZFS is the best choice for my scenario. At least, it seems to work fine for Joyent - their SmartOS virtualization OS is essentially Illumos (Solaris) with ZFS, and KVM ported from Linux. Since ZFS supports "Volumes" (virtual block devices inside a ZPool), I suspect these are probably optimized to be used for VM images (i.e. do as little COW as possible). Of course, snapshots will always degrade performance to a degree. However, there are some drawbacks to ZFS: - It's less flexible, especially when it comes to reconfiguration of disk arrays. Add or remove a disk to/from a RaidZ and rebalance, that would be just awesome. It's possible in BTRFS, but not ZFS. :-( - The not-so-good integration of the fs cache, at least on Linux. I don't know if this is really an issue, though. Actually, I imagine it's more of an issue for guest systems, because it probably breaks memory ballooning. (?) So it seems there are two options for me: 1. Go with ZFS for now, until BTRFS finds a better way to handle disk images, or until md gets data checksums. 2. Buy a bunch of SSDs for VM disk images and use spinning disks for data storage only. In that case, BTRFS should probably do fine. Any comments on that? Am I missing something? Thanks! Gert -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On 17 September 2015 at 18:56, Gert Menkewrote: > MD+LVM is very close to what I want, but md has no way to cope with silent > data corruption. So if I'd want to use a guest filesystem that has no > checksums either, I'm out of luck. > I'm honestly a bit confused here - isn't checksumming one of the most > obvious things to want in a software RAID setup? Is it a feature that might > appear in the future? Maybe I should talk to the md guys... ... > Any comments on that? Am I missing something? How about using file integrity checking tools for cases when the chosen storage stack doesn't provided data checksumming. E.g. aide - http://aide.sourceforge.net/ cfv - http://cfv.sourceforge.net/ tripwire - http://sourceforge.net/projects/tripwire/ Don't use them, just providing options. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at linux-4.2.0/fs/btrfs/extent-tree.c:1833 on rebalance
Le 2015-09-17 12:41, Qu Wenruo a écrit : In the meantime, I've reactivated quotas, umounted the filesystem and ran a btrfsck on it : as you would expect, there's no qgroup problem reported so far. At least, rescan code is working without problem. I'll clear all my snapshots, run an quota rescan, then re-create them one by one by rsyncing from my ext4 system I still have. Maybe I'll run into the issue again. Would you mind to do the following check for each subvolume rsync? 1) Do 'sync; btrfs qgroup show -prce --raw' and save the output 2) Create the needed snapshot 3) Do 'sync; btrfs qgroup show -prce --raw' and save the output 4) Avoid doing IO if possible until step 6) 5) Do 'btrfs quota rescan -w' and save it 6) Do 'sync; btrfs qgroup show -prce --raw' and save the output 7) Rsync data from ext4 to the newly created snapshot The point is, as you mentioned, rescan is working fine, we can compare output from 3), 6) and 1) to see which qgroup accounting number changes. And if differs, which means the qgroup update at write time OR snapshot creation has something wrong, at least we can locate the problem to qgroup update routine or snapshot creation. I was about to do that, but first there's something that sounds strange : I've begun by trashing all my snapshots, then ran a quota rescan, and waited for it to complete, to start on a sane base. However, this is the output of qgroup show now : qgroupid rfer excl max_rfer max_excl parent child -- - 0/5 1638416384 none none --- --- 0/1906 16578480291841657848029184 none none --- --- 0/1909124950921216 124950921216 none none --- --- 0/1911 10545872936961054587293696 none none --- --- 0/3270 23727300608 23727300608 none none --- --- 0/3314 23206055936 23206055936 none none --- --- 0/3317 184729968640 none none --- --- 0/3318 22235709440 18446744073708421120 none none --- --- 0/3319 222403338240 none none --- --- 0/3320 222896087040 none none --- --- 0/3321 222896087040 none none --- --- 0/3322 184611512320 none none --- --- 0/3323 184239022080 none none --- --- 0/3324 184239022080 none none --- --- 0/3325 184635064320 none none --- --- 0/3326 184635064320 none none --- --- 0/3327 184635064320 none none --- --- 0/3328 184635064320 none none --- --- 0/3329 185854279680 none none --- --- 0/3330 18621472768 18446744073251348480 none none --- --- 0/3331 186214727680 none none --- --- 0/3332 186214727680 none none --- --- 0/ 187830763520 none none --- --- 0/3334 187998044160 none none --- --- 0/3335 187998044160 none none --- --- 0/3336 188162170880 none none --- --- 0/3337 188162662400 none none --- --- 0/3338 188162662400 none none --- --- 0/3339 188162662400 none none --- --- 0/3340 188163645440 none none --- --- 0/3341 7530119168 7530119168 none none --- --- 0/3342 49192837120 none none --- --- 0/3343 49217249280 none none --- --- 0/3344 49217249280 none none --- --- 0/3345 6503317504 18446744073690902528 none none --- --- 0/3346 65034526720 none none --- --- 0/3347 65095147520 none none --- --- 0/3348 65157939200 none none --- --- 0/3349 65157939200 none none --- --- 0/3350 65186856960 none none --- --- 0/3351 65215119360 none none ---
Re: RAID1 storage server won't boot with one disk missing
On Thu, 17 Sep 2015 19:00:08 +0200 Goffredo Baroncelliwrote: > On 2015-09-17 17:18, Anand Jain wrote: > > it looks like -o degraded is going to be a very obvious feature, > > I have plans of making it a default feature, and provide -o > > nodegraded feature instead. Thanks for comments if any. > > > I am not sure if there is a "good" default for this kind of problem Yes there is. It is whatever people came to expect from using other RAID systems and/or generally expect from RAID as a concept. Both mdadm software RAID, and I believe virtually any hardware RAID controller out there will let you to successfully boot up and give read-write(!) access to a RAID in a non-critical failure state, because that's kind of the whole point of a RAID, to eliminate downtime. If the removed disk is later re-added, then it is automatically resynced. Mdadm can also make use of its 'write intent bitmap' to resync only those areas of the array which were in any way touched during the absence of the newly re-added disk. If you're concerned that the user "misses" the fact that they have a disk down, then solve *that*, make some sort of a notify daemon, e.g. mdadm has a built-in "monitor" mode which sends E-Mail on critical events with any of the arrays. -- With respect, Roman signature.asc Description: PGP signature
Re: BTRFS as image store for KVM?
On Thu, Sep 17, 2015 at 07:56:08PM +0200, Gert Menke wrote: > Hi, > > thank you for your answers! > > So it seems there are several suboptimal alternatives here... > > MD+LVM is very close to what I want, but md has no way to cope with > silent data corruption. So if I'd want to use a guest filesystem > that has no checksums either, I'm out of luck. > I'm honestly a bit confused here - isn't checksumming one of the > most obvious things to want in a software RAID setup? Is it a > feature that might appear in the future? Maybe I should talk to the > md guys... > > BTRFS looks really nice feature-wise, but is not (yet) optimized for > my use-case I guess. Disabling COW would certainly help, but I don't > want to lose the data checksums. Is nodatacowbutkeepdatachecksums a > feature that might turn up in the future? [snip] No. If you try doing that particular combination of features, you end up with a filesystem that can be inconsistent: there's a race condition between updating the data in a file and updating the csum record for it, and the race can't be fixed. Hugo. -- Hugo Mills | I spent most of my money on drink, women and fast hugo@... carfax.org.uk | cars. The rest I wasted. http://carfax.org.uk/ | PGP: E2AB1DE4 |James Hunt signature.asc Description: Digital signature
Re: RAID1 storage server won't boot with one disk missing
On Thu, Sep 17, 2015 at 1:02 PM, Roman Mamedovwrote: > On Thu, 17 Sep 2015 19:00:08 +0200 > Goffredo Baroncelli wrote: > >> On 2015-09-17 17:18, Anand Jain wrote: >> > it looks like -o degraded is going to be a very obvious feature, >> > I have plans of making it a default feature, and provide -o >> > nodegraded feature instead. Thanks for comments if any. >> > >> I am not sure if there is a "good" default for this kind of problem > > Yes there is. It is whatever people came to expect from using other RAID > systems and/or generally expect from RAID as a concept. > > Both mdadm software RAID, and I believe virtually any hardware RAID controller > out there will let you to successfully boot up and give read-write(!) access > to a RAID in a non-critical failure state, because that's kind of the whole > point of a RAID, to eliminate downtime. If the removed disk is later re-added, > then it is automatically resynced. Mdadm can also make use of its 'write > intent bitmap' to resync only those areas of the array which were in any way > touched during the absence of the newly re-added disk. > > If you're concerned that the user "misses" the fact that they have a disk > down, then solve *that*, make some sort of a notify daemon, e.g. mdadm has a > built-in "monitor" mode which sends E-Mail on critical events with any of the > arrays. Given the current state: no proposal and no work done yet, I think it's premature to change the default. It's an open question what a modern monitoring and notification mechanism should look like. At the moment it'd be a unique Btrfs thing because the mdadm and LVM methods aren't abstracted enough to reuse. I wonder if the storaged and/or openlmi folks have some input on what this would look like. Feedback from KDE and GNOME also, who rely on at least mdadm in order to present user space notifications. I think udisks2 is on the way out and storaged is on the way in, there's just too much stuff that udisks2 doesn't do and is getting confused about, including LVM thinly provisioned volumes, not just Btrfs stuff. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On 17.09.2015 at 21:43, Hugo Mills wrote: On Thu, Sep 17, 2015 at 07:56:08PM +0200, Gert Menke wrote: BTRFS looks really nice feature-wise, but is not (yet) optimized for my use-case I guess. Disabling COW would certainly help, but I don't want to lose the data checksums. Is nodatacowbutkeepdatachecksums a feature that might turn up in the future? [snip] No. If you try doing that particular combination of features, you end up with a filesystem that can be inconsistent: there's a race condition between updating the data in a file and updating the csum record for it, and the race can't be fixed. I'm no filesystem expert, but isn't that what an intent log is for? (Does btrfs have an intent log?) And, is this also true for mirrored or raid5 disks? I'm thinking something like "if the data does not match the checksum, just restore both from mirror/parity" should be possible, right? Gert -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[no subject]
Посмотреть цены и ассортимент rusrusrus.ru -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On 17.09.2015 at 20:35, Chris Murphy wrote: You can use Btrfs in the guest to get at least notification of SDC. Yes, but I'd rather not depend on all potential guest OSes having btrfs or something similar. Another way is to put a conventional fs image on e.g. GlusterFS with checksumming enabled (and at least distributed+replicated filtering). This sounds interesting! I'll have a look at this. If you do this directly on Btrfs, maybe you can mitigate some of the fragmentation issues with bcache or dmcache; Thanks, I did not know about these. bcache seems to be more or less what "zpool add foo cache /dev/ssd" does. Definitely worth a look. > and for persistent snapshotting, use qcow2 to do it instead of Btrfs. You'd use Btrfs snapshots to create a subvolume for doing backups of the images, and then get rid of the Btrfs snapshot. Good idea. Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: Remove unneeded missing device number check
Anand Jain wrote on 2015/09/18 09:47 +0800: On 09/17/2015 06:01 PM, Qu Wenruo wrote: Thanks for pointing this out. Although previous patch is small enough, but for remount case, we need to iterate all the existing chunk cache. yes indeed. thinking hard on this - is there any test-case that these two patches are solving, which the original patch [1] didn't solve ? Yep, your patch is OK to fix single chunk on safe disk case. But IMHO, it's a little aggressive and not safe as old codes. For example, if one use single metadata for 2 disks, and each disk has one metadata chunk on it. One device got missing later. Then your patch will allow the fs to be mounted as rw, even some tree block can be in the missing device. For RO case, it won't be too dangerous, but if we mounted it as RW, who knows what will happen. (Normal tree COW thing should fail before real write, but I'm not sure about other RW operation like scrub/replace/balance and others) And I think that's the original design concept behind the old missing device number check, and it's not a bad idea to follow it anyway. For the patch size, I find a good idea to handle it, and should make the patch(set) size below 200 lines. Further more, it's even possible to make btrfs change mount option to degraded for runtime device missing. Thanks, Qu I tried to break both the approaches (this patch set and [1]) but I wasn't successful. sorry if I am missing something. Thanks, Anand [1] [PATCH 23/23] Btrfs: allow -o rw,degraded for single group profile So fix for remount will take a little more time. Thanks for reviewing. Qu 在 2015年09月17日 17:43, Anand Jain 写道: On 09/16/2015 11:43 AM, Qu Wenruo wrote: As we do per-chunk missing device number check at read_one_chunk() time, it's not needed to do global missing device number check. Just remove it. However the missing device count, what we have during the remount is not fine grained per chunk. --- btrfs_remount :: if (fs_info->fs_devices->missing_devices > fs_info->num_tolerated_disk_barrier_failures && !(*flags & MS_RDONLY || btrfs_test_opt(root, DEGRADED))) { btrfs_warn(fs_info, "too many missing devices, writeable remount is not allowed"); ret = -EACCES; goto restore; } - Thanks, Anand Now btrfs can handle the following case: # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc Data chunk will be located in sdb, so we should be safe to wipe sdc # wipefs -a /dev/sdc # mount /dev/sdb /mnt/btrfs -o degraded Signed-off-by: Qu Wenruo--- fs/btrfs/disk-io.c | 8 1 file changed, 8 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 0b658d0..ac640ea 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2947,14 +2947,6 @@ retry_root_backup: } fs_info->num_tolerated_disk_barrier_failures = btrfs_calc_num_tolerated_disk_barrier_failures(fs_info); -if (fs_info->fs_devices->missing_devices > - fs_info->num_tolerated_disk_barrier_failures && -!(sb->s_flags & MS_RDONLY)) { -pr_warn("BTRFS: missing devices(%llu) exceeds the limit(%d), writeable mount is not allowed\n", -fs_info->fs_devices->missing_devices, -fs_info->num_tolerated_disk_barrier_failures); -goto fail_sysfs; -} fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root, "btrfs-cleaner"); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
Hugo Mills posted on Thu, 17 Sep 2015 19:43:14 + as excerpted: >> Is nodatacowbutkeepdatachecksums a feature that might turn up >> in the future? > > No. If you try doing that particular combination of features, you > end up with a filesystem that can be inconsistent: there's a race > condition between updating the data in a file and updating the csum > record for it, and the race can't be fixed. ... Which is both why btrfs disables checksumming on nocow, and why more traditional in-place-overwrite filesystems don't normally offer a checksumming feature -- it's only easily and reliably possible with copy- on-write, as in-place-overwrite introduces race issues that are basically impossible to solve. Logging can narrow the race, but consider, either they introduce some level of copy-on-write themselves, or one way or another, you're going to have to write two things, one a checksum of the other, and if they are in- place-overwrites, while the race can be narrowed, there's always going to be a point at which either one or the other will have been written, while the other hasn't been, and if failure occurs at that point... The only real way around that is /some/ form of copy-on-write, such that both the change and its checksum can be written to a different location than the old version, with a single, atomic write then updating a pointer to point to the new version of both the data and its checksum, instead of the old one. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 storage server won't boot with one disk missing
Anand Jain posted on Thu, 17 Sep 2015 23:18:36 +0800 as excerpted: >> What I expected to happen: >> I expected that the [btrfs raid1 data/metadata] system would either >> start as if nothing were wrong, or would warn me that one half of the >> mirror was missing and ask if I really wanted to start the system with >> the root array in a degraded state. > > as of now it would/should start normally only when there is an entry > -o degraded > > it looks like -o degraded is going to be a very obvious feature, > I have plans of making it a default feature, and provide -o nodegraded > feature instead. Thanks for comments if any. As Chris Murphy, I have my doubts about this, and think it's likely to cause as many unhappy users as it prevents. I'd definitely put -o nodegraded in my default options here, so it's not about me, but about all those others that would end up running a silently degraded system and have no idea until it's too late, as further devices have failed or the one single other available copy of something important (remember, still raid1 without N-mirrors option, unfortunately, so if a device drops out, that's now data/metadata with only a single valid copy regardless of the number of devices, and if it goes invalid...) fails checksum for whatever reason. And since it only /allows/ degraded, not forcing it, if admins or distros want it as the default, -o degraded can be added now. Nothing's stopping them except lack of knowledge of the option, the *same* lack of knowledge that would potentially cause so much harm if the default were switched. Put it this way. With the current default, if it fails and people have to ask about the unexpected failure here, no harm to existing data done, just add -o degraded and get on with things. If -o degraded were made the default, failure mode would be *MUCH* worse, potential loss of the entire filesystem due to silent and thus uncorrected device loss and degraded mounting. So despite the inconvenience of less knowledgeable people losing the availability of the filesystem until they can read the wiki or ask about it here, I don't believe changing the default to -o degraded is wise, at all. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: Remove unneeded missing device number check
On 09/17/2015 06:01 PM, Qu Wenruo wrote: Thanks for pointing this out. Although previous patch is small enough, but for remount case, we need to iterate all the existing chunk cache. yes indeed. thinking hard on this - is there any test-case that these two patches are solving, which the original patch [1] didn't solve ? I tried to break both the approaches (this patch set and [1]) but I wasn't successful. sorry if I am missing something. Thanks, Anand [1] [PATCH 23/23] Btrfs: allow -o rw,degraded for single group profile So fix for remount will take a little more time. Thanks for reviewing. Qu 在 2015年09月17日 17:43, Anand Jain 写道: On 09/16/2015 11:43 AM, Qu Wenruo wrote: As we do per-chunk missing device number check at read_one_chunk() time, it's not needed to do global missing device number check. Just remove it. However the missing device count, what we have during the remount is not fine grained per chunk. --- btrfs_remount :: if (fs_info->fs_devices->missing_devices > fs_info->num_tolerated_disk_barrier_failures && !(*flags & MS_RDONLY || btrfs_test_opt(root, DEGRADED))) { btrfs_warn(fs_info, "too many missing devices, writeable remount is not allowed"); ret = -EACCES; goto restore; } - Thanks, Anand Now btrfs can handle the following case: # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc Data chunk will be located in sdb, so we should be safe to wipe sdc # wipefs -a /dev/sdc # mount /dev/sdb /mnt/btrfs -o degraded Signed-off-by: Qu Wenruo--- fs/btrfs/disk-io.c | 8 1 file changed, 8 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 0b658d0..ac640ea 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2947,14 +2947,6 @@ retry_root_backup: } fs_info->num_tolerated_disk_barrier_failures = btrfs_calc_num_tolerated_disk_barrier_failures(fs_info); -if (fs_info->fs_devices->missing_devices > - fs_info->num_tolerated_disk_barrier_failures && -!(sb->s_flags & MS_RDONLY)) { -pr_warn("BTRFS: missing devices(%llu) exceeds the limit(%d), writeable mount is not allowed\n", -fs_info->fs_devices->missing_devices, -fs_info->num_tolerated_disk_barrier_failures); -goto fail_sysfs; -} fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root, "btrfs-cleaner"); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
Chris Murphy posted on Thu, 17 Sep 2015 12:35:41 -0600 as excerpted: > You'd use Btrfs snapshots to create a subvolume for doing backups of > the images, and then get rid of the Btrfs snapshot. The caveat here is that if the VM/DB is active during the backups (btrfs send/receive or other), it'll still COW1 any writes during the existence of the btrfs snapshot. If the backup can be scheduled during VM/DB downtime or at least when activity is very low, the relatively short COW1 time should avoid serious fragmentation, but if not, even only relatively temporary snapshots are likely to trigger noticeable cow1 fragmentation issues eventually. Some users have ameliorated that by scheduling weekly or monthly btrfs defrag, reporting that cow1 issues with temporary snapshots build up slow enough that the scheduled defrag effectively eliminates the otherwise growing problem, but it's still an additional complication to have to configure and administer, longer term. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FYIO: A rant about btrfs
Zygo Blaxell posted on Wed, 16 Sep 2015 18:08:56 -0400 as excerpted: > On Wed, Sep 16, 2015 at 03:04:38PM -0400, Vincent Olivier wrote: >> >> OK fine. Let it be clearer then (on the Btrfs wiki): nobarrier is an >> absolute no go. Case closed. > > Sometimes it is useful to make an ephemeral filesystem, i.e. a btrfs on > a dm-crypt device with a random key that is not stored. This > configuration intentionally and completely destroys the entire > filesystem, and all data on it, in the event of a power failure. It's > useful for things like temporary table storage, where ramfs is too > small, swap-backed tmpfs is too slow, and/or there is a requirement that > the data not be persisted across reboots. > > In other words, nobarrier is for a little better performance when you > already want to _intentionally_ destroy your filesystem on power > failure. Very good explanation of why it's useful to have such an otherwise destructive mount option even available in the first place. Thanks! =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On Thu, Sep 17, 2015 at 07:56:08PM +0200, Gert Menke wrote: > MD+LVM is very close to what I want, but md has no way to cope with silent > data corruption. So if I'd want to use a guest filesystem that has no > checksums either, I'm out of luck. > I'm honestly a bit confused here - isn't checksumming one of the most > obvious things to want in a software RAID setup? Is it a feature that might > appear in the future? Maybe I should talk to the md guys... MD is emulating hardware RAID. In hardware RAID, you are doing work at the block level. Block-level RAID has no understanding of the filesystem(s) running on top of it. Therefore it would have to checksum groups of blocks, and store those checksums on the physical disks somewhere, perhaps by keeping some portion of the drive for itself. But then this is not very efficient, since it is maintaining checksums for data that may be useless (blocks the FS is not currently using). So then you might make the RAID filesystem aware...and you now have BTRFS RAID. Simply put, the block level is probably not an appropriate place for checksumming to occur. BTRFS can make checksumming work much more effectively and efficiently by doing it at the filesystem level. --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 storage server won't boot with one disk missing
I think you have stated that in a very polite and friendly way. I'm pretty sure I'd phrase it less politely :) Following mdadm's example of an easy option to allow degraded mounting, but that shouldn't be the default. Anyone with the expertise to set that option can be expected to implement a way of knowing that the mount is degraded. People tend to be looking at BTRFS for a guarantee that data doesn't die when hardware does. Defaults that defeat that shouldn't be used. On Fri, Sep 18, 2015 at 11:36 AM, Duncan <1i5t5.dun...@cox.net> wrote: > Anand Jain posted on Thu, 17 Sep 2015 23:18:36 +0800 as excerpted: > >>> What I expected to happen: >>> I expected that the [btrfs raid1 data/metadata] system would either >>> start as if nothing were wrong, or would warn me that one half of the >>> mirror was missing and ask if I really wanted to start the system with >>> the root array in a degraded state. >> >> as of now it would/should start normally only when there is an entry >> -o degraded >> >> it looks like -o degraded is going to be a very obvious feature, >> I have plans of making it a default feature, and provide -o nodegraded >> feature instead. Thanks for comments if any. > > As Chris Murphy, I have my doubts about this, and think it's likely to > cause as many unhappy users as it prevents. > > I'd definitely put -o nodegraded in my default options here, so it's not > about me, but about all those others that would end up running a silently > degraded system and have no idea until it's too late, as further devices > have failed or the one single other available copy of something important > (remember, still raid1 without N-mirrors option, unfortunately, so if a > device drops out, that's now data/metadata with only a single valid copy > regardless of the number of devices, and if it goes invalid...) fails > checksum for whatever reason. > > And since it only /allows/ degraded, not forcing it, if admins or distros > want it as the default, -o degraded can be added now. Nothing's stopping > them except lack of knowledge of the option, the *same* lack of knowledge > that would potentially cause so much harm if the default were switched. > > Put it this way. With the current default, if it fails and people have > to ask about the unexpected failure here, no harm to existing data done, > just add -o degraded and get on with things. If -o degraded were made > the default, failure mode would be *MUCH* worse, potential loss of the > entire filesystem due to silent and thus uncorrected device loss and > degraded mounting. > > So despite the inconvenience of less knowledgeable people losing the > availability of the filesystem until they can read the wiki or ask about > it here, I don't believe changing the default to -o degraded is wise, at > all. > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gareth Pye - blog.cerberos.id.au Level 2 MTG Judge, Melbourne, Australia "Dear God, I would like to file a bug report" -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html