Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
Marc MERLIN posted on Tue, 23 May 2017 09:58:47 -0700 as excerpted: > That's a valid point, and in my case, I can back it up/restore, it just > takes a bit of time, but most of the time is manually babysitting all > those subvolumes that I need to recreate by hand with btrfs send/restore > relationships, which all get lost during backup/restore. > This is the most painful part. > What's too big? I've only ever used a filesystem that fits on on a raid > of 4 data drives. That value has increased over time, but I don't have a > a crazy array of 20+ drives as a single filesystem, or anything. > Since drives have gotten bigger, but not that much faster, I use bcache > to make things more acceptable in speed. What's too big? That depends on your tolerance for pain, but given the subvolumes manually recreated by hand with send/receive scenario, I'd probably try to break it down so while there's the same number of snapshots to restore, the number of subvolumes the snapshots are taken against are limited. My own rule of thumb is if it's taking so long that it's a barrier to doing it, I really need to either break things down further, or upgrade to faster storage. The latter is why I'm actually looking at upgrading my media and second backup set, on spinning rust, to ssd. Because while I used to do backups spinning rust to spinning rust of that size all the time, ssds have spoiled me, and now I dread doing the spinning rust backups... or restores. Tho in my case the spinning rust is only a half- TB, so a pair of half-TB to 1 TB ssds for an upgrade is still cost effective. It's not like I'm going multi-TB, which would still be cost prohibitive on SSD, particularly since I want raid1, so doubling the number of SSDs. Meanwhile, what I'd do with that raid of four drives (and /did/ do with my 4-drive raid back a few storage generations ago, when 300 GB spinning- rust disks were still quite big, and what I do with my paired SSDs with btrfs now) is partition them up and do raids of partitions on each drive. One thing that's nice about that is that you can actually do a set of backups on a second set of partitions on the same physical devices, because the physical device redundancy of the raids covers loss of a device, and the separate partitions and raids (btrfs raid1 now) cover the fat-finger or simple loss of filesystem risk. A second set of backups to separate devices can then be made just in case, and depending on the need, swapped out to off-premises or uploaded to the cloud or whatever, but you always have the primary backup at hand to boot to or mount if the working copy fails, by simply pointing to the backup partitions and filesystem instead of the normal working copy. For root, I even have a grub menu item that switches to the backup copy, and for fstab, I have a set of stubs that are assembled via script into three copies of fstab that swap working and backup copies as necessary, with /etc/fstab itself being a symlink to the working copy one, that I simply switch to point to the one that loads the backup copies as working, on the backup. Or I can mount the root filesystem for maintenance from the initramfs, and switch the fstab symlink from there, before exiting maintenance and booting the main system. I learned this "split it up" method the hard way back before mdraid had write-intent bitmaps, and I had only two much larger raids, working and backup, where if one device dropped out and I brought it back in, I had to wait way too long for the huge working raid to resync. When I split things up by function into multiple raids, most of the time only some of them were active and only one or two of the active ones would actually have been being written at the time so were out of sync, and syncing them was fast as they were much smaller than the larger full system raids I had been using previously. >> *BUT*, and here's the "go further" part, keep in mind that >> subvolume-read- >> only is a property, gettable and settable by btrfs property. >> >> So you should be able to unset the read-only property of a subvolume or >> snapshot, move it, then if desired, set it again. >> >> Of course I wouldn't expect send -p to work with such a snapshot, but >> send -c /might/ still work, I'm not actually sure but I'd consider it >> worth trying. (I'd try -p as well, but expect it to fail...) > > That's an interesting point, thanks for making it. > In that case, I did have to destroy and recreate the filesystem since > btrfs check --repair was unable to fix it, but knowing how to reparent > read only subvolumes may be handy in the future, thanks. Hopefully you won't end up testing it any time soon, but if you do, please confirm whether my suspicions that send -p won't work after toggling and reparenting, but send -c still will, are correct. (For those who read this out of thread context where I believe I already stated it, my own use-case involves neither snapshots
Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
On Tue, May 02, 2017 at 05:01:02AM +, Duncan wrote: > Marc MERLIN posted on Mon, 01 May 2017 20:23:46 -0700 as excerpted: > > > Also, how is --mode=lowmem being useful? > > FWIW, I just watched your talk that's linked from the wiki, and wondered > what you were doing these days as I hadn't seen any posts from you here > for awhile. First, sorry for the late reply. Because you didn't Cc me in the answer, it went to a different folder where I only saw your replies now. Off topic, but basically I'm not dead or anything, I have btrfs working well enough to not mess with it further because I have many other hobbies :) that is unless I put a new SAS card in my server, hit some corruption bugs, and now I'm back spending days fixing the system. > Well, that you're asking that question confirms you've not been following > the list too closely... Of course that's understandable as people have > other stuff to do, but just sayin'. That's exactly right. I'm subscribed to way too many lists on way too many topics to be up to date with all, sadly :( > Of course on-list I'm somewhat known for my arguments propounding the > notion that any filesystem that's too big to be practically maintained > (including time necessary to restore from backups, should that be > necessary for whatever reason) is... too big... and should ideally be > broken along logical and functional boundaries into a number of > individual smaller filesystems until such point as each one is found to > be practically maintainable within a reasonably practical time frame. > Don't put all the eggs in one basket, and when the bottom of one of those > baskets inevitably falls out, most of your eggs will be safe in other > baskets. =:^) That's a valid point, and in my case, I can back it up/restore, it just takes a bit of time, but most of the time is manually babysitting all those subvolumes that I need to recreate by hand with btrfs send/restore relationships, which all get lost during backup/restore. This is the most painful part. What's too big? I've only ever used a filesystem that fits on on a raid of 4 data drives. That value has increased over time, but I don't have a a crazy array of 20+ drives as a single filesystem, or anything. Since drives have gotten bigger, but not that much faster, I use bcache to make things more acceptable in speed. > *BUT*, and here's the "go further" part, keep in mind that subvolume-read- > only is a property, gettable and settable by btrfs property. > > So you should be able to unset the read-only property of a subvolume or > snapshot, move it, then if desired, set it again. > > Of course I wouldn't expect send -p to work with such a snapshot, but > send -c /might/ still work, I'm not actually sure but I'd consider it > worth trying. (I'd try -p as well, but expect it to fail...) That's an interesting point, thanks for making it. In that case, I did have to destroy and recreate the filesystem since btrfs check --repair was unable to fix it, but knowing how to reparent read only subvolumes may be handy in the future, thanks. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
Am Fri, 5 May 2017 08:43:23 -0700 schrieb Marc MERLIN: [missing quote of the command] > > Corrupted blocks are corrupted, that command is just trying to > > corrupt it again. > > It won't do the black magic to adjust tree blocks to avoid them. > > I see. you may hve seen the earlier message from Kai Krakow who was > able to to recover his FS by trying this trick, but I understand it > can't work in all cases. Huh, what trick? I don't take credit for it... ;-) The corrupt-block trick must've been someone else... -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
Thanks again for your answer. Obviously even if my filesystem is toast, it's useful to learn from what happened. On Fri, May 05, 2017 at 01:03:02PM +0800, Qu Wenruo wrote: > > > So unfortunately, your fs/subvolume trees are also corrupted. > > > And almost no chance to do a graceful recovery. > > So I'm confused here. You're saying my metadata is not corrupted (and in > > my case, I have DUP, so I should have 2 copies), > > Nope, here I'm all talking about metadata (tree blocks). > Difference is the owner, either extent tree or fs/subvolume tree. I see. I didn't realize that my filesystem managed to corrupt both copies of its metadata. > The fsck doesn't check data blocks. Right, that's what scrub does, fair enough. > The problem is, tree blocks (metadata) that refers these data blocks are > corrupted. > > And they are corrupted in such a way that both extent tree (tree contains > extent allocation info) and fs tree (tree contains real fs info, like inode > and data location) are corrupted. > > So graceful recovery is not possible now. I see, thanks for explaining. > Unfortunately, no, even you have 2 copies, a lot of tree blocks are > corrupted that neither copy matches checksum. Thanks for confirming. I guess if I'm having corruption due to a bad card, it makes sense that both get updated after one another and both got corrupted for the same reason. > Corrupted blocks are corrupted, that command is just trying to corrupt it > again. > It won't do the black magic to adjust tree blocks to avoid them. I see. you may hve seen the earlier message from Kai Krakow who was able to to recover his FS by trying this trick, but I understand it can't work in all cases. Thanks again for your answers. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
At 05/05/2017 10:40 AM, Marc MERLIN wrote: On Fri, May 05, 2017 at 09:19:29AM +0800, Qu Wenruo wrote: Sorry for not noticing the link. no problem, it was only one line amongst many :) Thanks much for having had a look. [Conclusion] After checking the full result, some of fs/subvolume trees are corrupted. [Details] Some example here: --- ref mismatch on [6674127745024 32768] extent item 0, found 1 Backref 6674127745024 parent 7566652473344 owner 0 offset 0 num_refs 0 not found in extent tree Incorrect local backref count on 6674127745024 parent 7566652473344 owner 0 offset 0 found 1 wanted 0 back 0x5648afda0f20 backpointer mismatch on [6674127745024 32768] --- The extent at 6674127745024 seems to be an *DATA* extent. While current default nodesize is 16K and ancient default node is 4K. Unless you specified -n 32K at mkfs time, it's a DATA extent. I did not, so you must be right about DATA, which should be good, right, I don't mind losing data as long as the underlying metadata is correct. I should have given more data on the FS: gargamel:/var/local/src/btrfs-progs# btrfs fi df /mnt/btrfs_pool2/ Data, single: total=6.28TiB, used=6.12TiB System, DUP: total=32.00MiB, used=720.00KiB Metadata, DUP: total=97.00GiB, used=94.39GiB Tons of metadata since the fs is so large. GlobalReserve, single: total=512.00MiB, used=0.00B gargamel:/var/local/src/btrfs-progs# btrfs fi usage /mnt/btrfs_pool2 Overall: Device size: 7.28TiB Device allocated: 6.47TiB Device unallocated: 824.48GiB Device missing: 0.00B Used: 6.30TiB Free (estimated):994.45GiB (min: 582.21GiB) Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,single: Size:6.28TiB, Used:6.12TiB /dev/mapper/dshelf2 6.28TiB Metadata,DUP: Size:97.00GiB, Used:94.39GiB /dev/mapper/dshelf2 194.00GiB System,DUP: Size:32.00MiB, Used:720.00KiB /dev/mapper/dshelf264.00MiB Unallocated: /dev/mapper/dshelf2 824.48GiB Further more, it's a shared data backref, it's using its parent tree block to do backref walk. And its parent tree block is 7566652473344. While such bytenr can't be found anywhere (including csum error output), that's to say either we can't find that tree block nor can't reach the tree root for it. Considering it's data extent, its owner is either root or fs/subvolume tree. Such cases are everywhere, as I found other extent sized from 4K to 44K, so I'm pretty sure there must be some fs/subvolume tree corrupted. (Data extent in root tree is seldom 4K sized) So unfortunately, your fs/subvolume trees are also corrupted. And almost no chance to do a graceful recovery. So I'm confused here. You're saying my metadata is not corrupted (and in my case, I have DUP, so I should have 2 copies), Nope, here I'm all talking about metadata (tree blocks). Difference is the owner, either extent tree or fs/subvolume tree. The fsck doesn't check data blocks. but with data blocks (which are not duped) corrupted, it's also possible to lose the filesystem in a way that it can't be taken back to a clean state, even by deleting some corrupted data? No, it can't be repaired by deleting data. The problem is, tree blocks (metadata) that refers these data blocks are corrupted. And they are corrupted in such a way that both extent tree (tree contains extent allocation info) and fs tree (tree contains real fs info, like inode and data location) are corrupted. So graceful recovery is not possible now. [Alternatives] I would recommend to use "btrfs restore -f " to restore specified subvolume. I don't need to restore data, the data is a backup. It will just take many days to recreate (plus many hours of typing from me because the backup updates are automated, but recreating everything, is not automated) So if I understand correctly, my metadata is fine (and I guess I have 2 copies, so it would have been unlucky to get both copies corrupted), but enough data blocks got corrupted that btrfs cannot recover, even by deleting the corrupted data blocks. Correct? Unfortunately, no, even you have 2 copies, a lot of tree blocks are corrupted that neither copy matches checksum. Just like the following tree block, both copy have wrong checksum. --- checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 --- And is it not possible to clear the corrupted blocks like this? ./btrfs-corrupt-block -l 2899180224512 /dev/mapper/dshelf2 and just accept the lost data but get btrfs check repair to deal with the deleted blocks and bring the rest back to a clean state?No, that won't help. Corrupted blocks are corrupted, that command is just trying to corrupt it again. It won't do the black magic to adjust tree
Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
On Fri, May 05, 2017 at 09:19:29AM +0800, Qu Wenruo wrote: > Sorry for not noticing the link. no problem, it was only one line amongst many :) Thanks much for having had a look. > [Conclusion] > After checking the full result, some of fs/subvolume trees are corrupted. > > [Details] > Some example here: > > --- > ref mismatch on [6674127745024 32768] extent item 0, found 1 > Backref 6674127745024 parent 7566652473344 owner 0 offset 0 num_refs 0 not > found in extent tree > Incorrect local backref count on 6674127745024 parent 7566652473344 owner 0 > offset 0 found 1 wanted 0 back 0x5648afda0f20 > backpointer mismatch on [6674127745024 32768] > --- > > The extent at 6674127745024 seems to be an *DATA* extent. > While current default nodesize is 16K and ancient default node is 4K. > > Unless you specified -n 32K at mkfs time, it's a DATA extent. I did not, so you must be right about DATA, which should be good, right, I don't mind losing data as long as the underlying metadata is correct. I should have given more data on the FS: gargamel:/var/local/src/btrfs-progs# btrfs fi df /mnt/btrfs_pool2/ Data, single: total=6.28TiB, used=6.12TiB System, DUP: total=32.00MiB, used=720.00KiB Metadata, DUP: total=97.00GiB, used=94.39GiB GlobalReserve, single: total=512.00MiB, used=0.00B gargamel:/var/local/src/btrfs-progs# btrfs fi usage /mnt/btrfs_pool2 Overall: Device size: 7.28TiB Device allocated: 6.47TiB Device unallocated: 824.48GiB Device missing: 0.00B Used: 6.30TiB Free (estimated):994.45GiB (min: 582.21GiB) Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,single: Size:6.28TiB, Used:6.12TiB /dev/mapper/dshelf2 6.28TiB Metadata,DUP: Size:97.00GiB, Used:94.39GiB /dev/mapper/dshelf2 194.00GiB System,DUP: Size:32.00MiB, Used:720.00KiB /dev/mapper/dshelf264.00MiB Unallocated: /dev/mapper/dshelf2 824.48GiB > Further more, it's a shared data backref, it's using its parent tree block > to do backref walk. > > And its parent tree block is 7566652473344. > While such bytenr can't be found anywhere (including csum error output), > that's to say either we can't find that tree block nor can't reach the tree > root for it. > > Considering it's data extent, its owner is either root or fs/subvolume tree. > > > Such cases are everywhere, as I found other extent sized from 4K to 44K, so > I'm pretty sure there must be some fs/subvolume tree corrupted. > (Data extent in root tree is seldom 4K sized) > > So unfortunately, your fs/subvolume trees are also corrupted. > And almost no chance to do a graceful recovery. So I'm confused here. You're saying my metadata is not corrupted (and in my case, I have DUP, so I should have 2 copies), but with data blocks (which are not duped) corrupted, it's also possible to lose the filesystem in a way that it can't be taken back to a clean state, even by deleting some corrupted data? > [Alternatives] > I would recommend to use "btrfs restore -f " to restore specified > subvolume. I don't need to restore data, the data is a backup. It will just take many days to recreate (plus many hours of typing from me because the backup updates are automated, but recreating everything, is not automated) So if I understand correctly, my metadata is fine (and I guess I have 2 copies, so it would have been unlucky to get both copies corrupted), but enough data blocks got corrupted that btrfs cannot recover, even by deleting the corrupted data blocks. Correct? And is it not possible to clear the corrupted blocks like this? ./btrfs-corrupt-block -l 2899180224512 /dev/mapper/dshelf2 and just accept the lost data but get btrfs check repair to deal with the deleted blocks and bring the rest back to a clean state? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
At 05/05/2017 09:19 AM, Qu Wenruo wrote: At 05/02/2017 11:23 AM, Marc MERLIN wrote: Hi Chris, Thanks for the reply, much appreciated. On Mon, May 01, 2017 at 07:50:22PM -0600, Chris Murphy wrote: What about btfs check (no repair), without and then also with --mode=lowmem? In theory I like the idea of a 24 hour rollback; but in normal usage Btrfs will eventually free up space containing stale and no longer necessary metadata. Like the chunk tree, it's always changing, so you get to a point, even with snapshots, that the old state of that tree is just - gone. A snapshot of an fs tree does not make the chunk tree frozen in time. Right, of course, I was being way over optimistic here. I kind of forgot that metadata wasn't COW, my bad. In any case, it's a big problem in my mind if no existing tools can fix a file system of this size. So before making anymore changes, make sure you have a btrfs-image somewhere, even if it's huge. The offline checker needs to be able to repair it, right now it's all we have for such a case. The image will be huge, and take maybe 24H to make (last time it took some silly amount of time like that), and honestly I'm not sure how useful it'll be. Outside of the kernel crashing if I do a btrfs balance, and hopefully the crash report I gave is good enough, the state I'm in is not btrfs' fault. If I can't roll back to a reasonably working state, with data loss of a known quantity that I can recover from backup, I'll have to destroy and filesystem and recover from scratch, which will take multiple days. Since I can't wait too long before getting back to a working state, I think I'm going to try btrfs check --repair after a scrub to get a list of all the pathanmes/inodes that are known to be damaged, and work from there. Sounds reasonable? Also, how is --mode=lowmem being useful? And for re-parenting a sub-subvolume, is that possible? (I want to delete /sub1/ but I can't because I have /sub1/sub2 that's also a subvolume and I'm not sure how to re-parent sub2 to somewhere else so that I can subvolume delete sub1) In the meantime, a simple check without repair looks like this. It will likely take many hours to complete: gargamel:/var/local/space# btrfs check /dev/mapper/dshelf2 Checking filesystem on /dev/mapper/dshelf2 UUID: 03e9a50c-1ae6-4782-ab9c-5f310a98e653 checking extents checksum verify failed on 3096461459456 found 0E6B7980 wanted FBE5477A checksum verify failed on 3096461459456 found 0E6B7980 wanted FBE5477A checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 bytenr mismatch, want=2899180224512, have=3981076597540270796 checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5 checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5 checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B parent transid verify failed on 1671538819072 wanted 293964 found 293902 parent transid verify failed on 1671538819072 wanted 293964 found 293902 checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0 checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0 checksum verify failed on 1759425052672 found 843B59F1 wanted F0FF7D00 checksum verify failed on 1759425052672 found 843B59F1 wanted F0FF7D00 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 checksum verify failed on 2898779357184 found 96395131 wanted 433D6E09 checksum verify failed on 2898779357184 found 96395131 wanted 433D6E09 checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 bytenr mismatch, want=2899180224512, have=3981076597540270796 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 (...) Full output please. Sorry for not noticing the link. [Conclusion] After checking the full result, some of fs/subvolume trees are corrupted. [Details] Some example here: --- ref mismatch on [6674127745024 32768] extent item 0, found 1 Backref 6674127745024 parent 7566652473344 owner 0 offset 0 num_refs 0 not found in extent tree Incorrect local backref count on 6674127745024 parent 7566652473344 owner 0 offset 0 found 1 wanted 0 back
Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
At 05/02/2017 11:23 AM, Marc MERLIN wrote: Hi Chris, Thanks for the reply, much appreciated. On Mon, May 01, 2017 at 07:50:22PM -0600, Chris Murphy wrote: What about btfs check (no repair), without and then also with --mode=lowmem? In theory I like the idea of a 24 hour rollback; but in normal usage Btrfs will eventually free up space containing stale and no longer necessary metadata. Like the chunk tree, it's always changing, so you get to a point, even with snapshots, that the old state of that tree is just - gone. A snapshot of an fs tree does not make the chunk tree frozen in time. Right, of course, I was being way over optimistic here. I kind of forgot that metadata wasn't COW, my bad. In any case, it's a big problem in my mind if no existing tools can fix a file system of this size. So before making anymore changes, make sure you have a btrfs-image somewhere, even if it's huge. The offline checker needs to be able to repair it, right now it's all we have for such a case. The image will be huge, and take maybe 24H to make (last time it took some silly amount of time like that), and honestly I'm not sure how useful it'll be. Outside of the kernel crashing if I do a btrfs balance, and hopefully the crash report I gave is good enough, the state I'm in is not btrfs' fault. If I can't roll back to a reasonably working state, with data loss of a known quantity that I can recover from backup, I'll have to destroy and filesystem and recover from scratch, which will take multiple days. Since I can't wait too long before getting back to a working state, I think I'm going to try btrfs check --repair after a scrub to get a list of all the pathanmes/inodes that are known to be damaged, and work from there. Sounds reasonable? Also, how is --mode=lowmem being useful? And for re-parenting a sub-subvolume, is that possible? (I want to delete /sub1/ but I can't because I have /sub1/sub2 that's also a subvolume and I'm not sure how to re-parent sub2 to somewhere else so that I can subvolume delete sub1) In the meantime, a simple check without repair looks like this. It will likely take many hours to complete: gargamel:/var/local/space# btrfs check /dev/mapper/dshelf2 Checking filesystem on /dev/mapper/dshelf2 UUID: 03e9a50c-1ae6-4782-ab9c-5f310a98e653 checking extents checksum verify failed on 3096461459456 found 0E6B7980 wanted FBE5477A checksum verify failed on 3096461459456 found 0E6B7980 wanted FBE5477A checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 bytenr mismatch, want=2899180224512, have=3981076597540270796 checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5 checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5 checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B parent transid verify failed on 1671538819072 wanted 293964 found 293902 parent transid verify failed on 1671538819072 wanted 293964 found 293902 checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0 checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0 checksum verify failed on 1759425052672 found 843B59F1 wanted F0FF7D00 checksum verify failed on 1759425052672 found 843B59F1 wanted F0FF7D00 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 checksum verify failed on 2898779357184 found 96395131 wanted 433D6E09 checksum verify failed on 2898779357184 found 96395131 wanted 433D6E09 checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 bytenr mismatch, want=2899180224512, have=3981076597540270796 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 (...) Full output please. I know it will be long, but the point here is, full output could help us to at least locate where the most corruption are. If most corruption are only in extent tree, the chance to recover will increase hugely. As extent tree is just a backref for all allocated extents, it's not really important if recovery (read) is the primary goal. But if other tree (fs or subvolume tree important for you) also get corrupted, I'm afraid your last chance will be
Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
At 05/02/2017 02:08 AM, Marc MERLIN wrote: So, I forgot to mention that it's my main media and backup server that got corrupted. Yes, I do actually have a backup of a backup server, but it's going to take days to recover due to the amount of data to copy back, not counting lots of manual typing due to the number of subvolumes, btrfs send/receive relationships and so forth. Really, I should be able to roll back all writes from the last 24H, run a check --repair/scrub on top just to be sure, and be back on track. In the meantime, the good news is that the filesystem doesn't crash the kernel (the poasted crash below) now that I was able to cancel the btrfs balance, but it goes read only at the drop of a hat, even when I'm trying to delete recent snapshots and all data that was potentially written in the last 24H On Mon, May 01, 2017 at 10:06:41AM -0700, Marc MERLIN wrote: I have a filesystem that sadly got corrupted by a SAS card I just installed yesterday. I don't think in a case like this, there is there a way to roll back all writes across all subvolumes in the last 24H, correct? Sorry for the late reply. I thought the case is already finished as I see little chance to recover. :( No, no way to roll back unless you're completely sure there is only 1 transaction commit happened in last 24H. (Well, not really possible in real world) Btrfs is only capable to rollback to *previous* commit. That's ensure by forced metadata CoW. But beyond previous commit, only god knows. If all metadata CoW write is done in some place never used by any previous metadata, then there is the chance to recover. But mostly the possibility is very low, some mount option like ssd will change the extent allocator behavior to improve the possibility, but still need a lot of luck. More detailed comment will be replied to btrfs check mail. Thanks, Qu Is the best thing to go in each subvolume, delete the recent snapshots and rename the one from 24H as the current one? Well, just like I expected, it's a pain in the rear and this can't even help fix the top level mountpoint which doesn't have snapshots, so I can't roll it back. btrfs should really have an easy way to roll back X hours, or days to recover from garbage written after a good known point, given that it is COW afterall. Is there a way do this with check --repair maybe? In the meantime, I got stuck while trying to delete snapshots: Let's say I have this: ID 428 gen 294021 top level 5 path backup ID 2023 gen 294021 top level 5 path Soft ID 3021 gen 294051 top level 428 path backup/debian32 ID 4400 gen 294018 top level 428 path backup/debian64 ID 4930 gen 294019 top level 428 path backup/ubuntu I can easily Delete subvolume (no-commit): '/mnt/btrfs_pool2/Soft' and then: gargamel:/mnt/btrfs_pool2# mv Soft_rw.20170430_01:50:22 Soft But I can't delete backup, which actually is mostly only a directory containing other things (in hindsight I shouldn't have made that a subvolume) Delete subvolume (no-commit): '/mnt/btrfs_pool2/backup' ERROR: cannot delete '/mnt/btrfs_pool2/backup': Directory not empty This is because backup has a lot of subvolumes due to btrfs send/receive relationships. Is it possible to recover there? Can you reparent subvolumes to a different subvolume without doing a full copy via btrfs send/receive? Thanks, Marc BTRFS warning (device dm-5): failed to load free space cache for block group 6746013696000, rebuilding it now BTRFS warning (device dm-5): block group 6754603630592 has wrong amount of free space BTRFS warning (device dm-5): failed to load free space cache for block group 6754603630592, rebuilding it now BTRFS warning (device dm-5): block group 7125178777600 has wrong amount of free space BTRFS warning (device dm-5): failed to load free space cache for block group 7125178777600, rebuilding it now BTRFS error (device dm-5): bad tree block start 3981076597540270796 2899180224512 BTRFS error (device dm-5): bad tree block start 942082474969670243 2899180224512 BTRFS: error (device dm-5) in __btrfs_free_extent:6944: errno=-5 IO failure BTRFS info (device dm-5): forced readonly BTRFS: error (device dm-5) in btrfs_run_delayed_refs:2961: errno=-5 IO failure BUG: unable to handle kernel NULL pointer dereference at (null) IP: __del_reloc_root+0x3f/0xa6 PGD 189a0e067 PUD 189a0f067 PMD 0 Oops: [#1] PREEMPT SMP Modules linked in: veth ip6table_filter ip6_tables ebtable_nat ebtables ppdev lp xt_addrtype br_netfilter bridge stp llc tun autofs4 softdog binfmt_misc ftdi_sio nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ipt_REJECT nf_reject_ipv4 xt_conntrack xt_mark xt_nat xt_tcpudp nf_log_ipv4 nf_log_common xt_LOG iptable_mangle iptable_filter lm85 hwmon_vid pl2303 dm_snapshot dm_bufio iptable_nat ip_tables nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_conntrack_ftp ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_nat nf_conntrack x_tables sg st snd_pcm_oss snd_mixer_oss bcache kvm_intel kvm
Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
Am Mon, 1 May 2017 22:56:06 -0600 schrieb Chris Murphy: > On Mon, May 1, 2017 at 9:23 PM, Marc MERLIN wrote: > > Hi Chris, > > > > Thanks for the reply, much appreciated. > > > > On Mon, May 01, 2017 at 07:50:22PM -0600, Chris Murphy wrote: > >> What about btfs check (no repair), without and then also with > >> --mode=lowmem? > >> > >> In theory I like the idea of a 24 hour rollback; but in normal > >> usage Btrfs will eventually free up space containing stale and no > >> longer necessary metadata. Like the chunk tree, it's always > >> changing, so you get to a point, even with snapshots, that the old > >> state of that tree is just - gone. A snapshot of an fs tree does > >> not make the chunk tree frozen in time. > > > > Right, of course, I was being way over optimistic here. I kind of > > forgot that metadata wasn't COW, my bad. > > Well it is COW. But there's more to the file system than fs trees, and > just because an fs tree gets snapshot doesn't mean all data is > snapshot. So whether snapshot or not, there's metadata that becomes > obsolete as the file system is updated and those areas get freed up > and eventually overwritten. > > > > > >> In any case, it's a big problem in my mind if no existing tools can > >> fix a file system of this size. So before making anymore changes, > >> make sure you have a btrfs-image somewhere, even if it's huge. The > >> offline checker needs to be able to repair it, right now it's all > >> we have for such a case. > > > > The image will be huge, and take maybe 24H to make (last time it > > took some silly amount of time like that), and honestly I'm not > > sure how useful it'll be. > > Outside of the kernel crashing if I do a btrfs balance, and > > hopefully the crash report I gave is good enough, the state I'm in > > is not btrfs' fault. > > > > If I can't roll back to a reasonably working state, with data loss > > of a known quantity that I can recover from backup, I'll have to > > destroy and filesystem and recover from scratch, which will take > > multiple days. Since I can't wait too long before getting back to a > > working state, I think I'm going to try btrfs check --repair after > > a scrub to get a list of all the pathanmes/inodes that are known to > > be damaged, and work from there. > > Sounds reasonable? > > Yes. > > > > > > Also, how is --mode=lowmem being useful? > > Testing. lowmem is a different implementation, so it might find > different things from the regular check. > > > > > > And for re-parenting a sub-subvolume, is that possible? > > (I want to delete /sub1/ but I can't because I have /sub1/sub2 > > that's also a subvolume and I'm not sure how to re-parent sub2 to > > somewhere else so that I can subvolume delete sub1) > > Well you can move sub2 out of sub1 just like a directory and then > delete sub1. If it's read-only it can't be moved, but you can use > btrfs property get/set ro true/false to temporarily make it not > read-only, move it, then make it read-only again, and it's still fine > to use with btrfs send receive. > > > > > > > > > In the meantime, a simple check without repair looks like this. It > > will likely take many hours to complete: > > gargamel:/var/local/space# btrfs check /dev/mapper/dshelf2 > > Checking filesystem on /dev/mapper/dshelf2 > > UUID: 03e9a50c-1ae6-4782-ab9c-5f310a98e653 > > checking extents > > checksum verify failed on 3096461459456 found 0E6B7980 wanted > > FBE5477A checksum verify failed on 3096461459456 found 0E6B7980 > > wanted FBE5477A checksum verify failed on 2899180224512 found > > 7A6D427F wanted 7E899EE5 checksum verify failed on 2899180224512 > > found 7A6D427F wanted 7E899EE5 checksum verify failed on > > 2899180224512 found ABBE39B0 wanted E0735D0E checksum verify failed > > on 2899180224512 found 7A6D427F wanted 7E899EE5 bytenr mismatch, > > want=2899180224512, have=3981076597540270796 checksum verify failed > > on 1449488023552 found CECC36AF wanted 199FE6C5 checksum verify > > failed on 1449488023552 found CECC36AF wanted 199FE6C5 checksum > > verify failed on 1449544613888 found 895D691B wanted A0C64D2B > > checksum verify failed on 1449544613888 found 895D691B wanted > > A0C64D2B parent transid verify failed on 1671538819072 wanted > > 293964 found 293902 parent transid verify failed on 1671538819072 > > wanted 293964 found 293902 checksum verify failed on 1671603781632 > > found 18BC28D6 wanted 372655A0 checksum verify failed on > > 1671603781632 found 18BC28D6 wanted 372655A0 checksum verify failed > > on 1759425052672 found 843B59F1 wanted F0FF7D00 checksum verify > > failed on 1759425052672 found 843B59F1 wanted F0FF7D00 checksum > > verify failed on 2182657212416 found CD8EFC0C wanted 70847071 > > checksum verify failed on 2182657212416 found CD8EFC0C wanted > > 70847071 checksum verify failed on 2898779357184 found 96395131 > > wanted 433D6E09 checksum verify failed on 2898779357184 found > > 96395131 wanted
Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
Am Tue, 2 May 2017 05:01:02 + (UTC) schrieb Duncan <1i5t5.dun...@cox.net>: > Of course on-list I'm somewhat known for my arguments propounding the > notion that any filesystem that's too big to be practically > maintained (including time necessary to restore from backups, should > that be necessary for whatever reason) is... too big... and should > ideally be broken along logical and functional boundaries into a > number of individual smaller filesystems until such point as each one > is found to be practically maintainable within a reasonably practical > time frame. Don't put all the eggs in one basket, and when the bottom > of one of those baskets inevitably falls out, most of your eggs will > be safe in other baskets. =:^) Hehe... Yes, you're a fan of small filesystems. I'm more from the opposite camp, preferring one big filesystem to not mess around with size constraints of small filesystems fighting for the same volume space. It also gives such filesystems better chances for data locality of data staying in totally different parts across your fs mounts and can reduce head movement. Of course, much of this is not true if you use different devices per filesystem, or use SSDs, or SAN where you have no real control over the physical placement of image stripes anyway. But well... In an ideal world, subvolumes of btrfs would be totally independent of each other, just only share the same volume and dynamically allocating chunks of space from it. If one is broken, it is simply not usable and it should be destroyable. A garbage collector would grab the leftover chunks from the subvolume and free them, and you could recreate this subvolume from backup. In reality, shared extents will cross subvolume borders so it is probably not how things could work anytime in the near of far future. This idea is more like having thinly provisioned LVM volumes which allocate space as the filesystems on top need them, much like doing thinly provisioned images with a VM host system. The problem here is, unlike subvolumes, those chunks of space could never be given back to the host as it doesn't know if it is still in use. Of course, there's implementations available which allow thinning the images by passing through TRIM from the guest to the host (or by other means of communication channels between host and guest), but that is usually not giving good performance, if even supported. I tried once to exploit this in VirtualBox and hoped it would translate guest discards into hole punching requests on the host, and it's even documented to work that way... But (a) it was horrible slow, and (b) it was incredibly unstable to the point of being useless. OTOH, it's not announced as a stable feature and has to be enabled by manually editing the XML config files. But I still like the idea: Is it possible to make btrfs still work if one subvolume gets corrupted? Of course it should have ways of telling the user which other subvolumes are interconnected through shared extents so those would be also discarded upon corruption cleanup - at least if those extents couldn't be made any sense of any longer. Since corruption is an issue mostly of subvolumes being written to, snapshots should be mostly safe. Such a feature would also only make sense if btrfs had an online repair tool. BTW, are there plans for having an online repair tool in the future? Maybe one that only scans and fixes part of the filesystems (for obvious performance reasons, wrt Duncans idea of handling filesystems), i.e. those parts that the kernel discovered having corruptions? If I could then just delete and restore affected files, this would be even better than having independent subvolumes like above. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
On Mon, May 01, 2017 at 10:56:06PM -0600, Chris Murphy wrote: > > Right, of course, I was being way over optimistic here. I kind of forgot > > that metadata wasn't COW, my bad. > > Well it is COW. But there's more to the file system than fs trees, and > just because an fs tree gets snapshot doesn't mean all data is > snapshot. So whether snapshot or not, there's metadata that becomes > obsolete as the file system is updated and those areas get freed up > and eventually overwritten. Got it, thanks for explaining. > > Also, how is --mode=lowmem being useful? > > Testing. lowmem is a different implementation, so it might find > different things from the regular check. I see. I've fired off some scrub -r and then check to run overnight, I'll see if it finishes overnight assuming the kernel doesn't crash again (yeah, just to make things simpler, I'm hitting another issue when I/O piles up on btrfs on top of dmcrypt on top of bcache http://lkml.iu.edu/hypermail/linux/kernel/1705.0/00626.html https://pastebin.com/YqE4riw0 but that's not a bcache bug, just something else getting in the way. > > And for re-parenting a sub-subvolume, is that possible? > > (I want to delete /sub1/ but I can't because I have /sub1/sub2 that's also > > a subvolume > > and I'm not sure how to re-parent sub2 to somewhere else so that I can > > subvolume delete > > sub1) > > Well you can move sub2 out of sub1 just like a directory and then > delete sub1. If it's read-only it can't be moved, but you can use > btrfs property get/set ro true/false to temporarily make it not > read-only, move it, then make it read-only again, and it's still fine > to use with btrfs send receive. Ah, I didn't think mv would work from inside a subvolume to outside of a subvolume without copying data (it doesn't for files) but I guess it would for for subvolumes, good point. I'll try that, thanks. > Not understanding the problem, it's by definition naive for me to > suggest it should go read-only sooner before hosing itself. But I'd > like to think it's possible for Btrfs to look backward every once in a > while for sanity checking, to limit damage should it be occurring even > if the hardware isn't reporting any problems. Fair point. To be honest, maybe btrfs could indeed have detected problems earlier, but ultimately it's not really its fault if bad things happen when I'm having repeated storage errors underneath. For all I know, some data got written after getting corrupted and btrfs would not notice that right away. Now, I kind of naively thought I could simply unroll all writes done after a certain point. You pointed right (rightfully so) that it's not nearly as simple as I was hoping. So at this point, I think it's just a matter of me providing check/repair logs if they are useful, and someone looking into this balance causing a kernel crash, which is IMO the only real thing that btrfs should reasonably fix. I'll update the thread when I have more logs and have moved further on the recovery. Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
Marc MERLIN posted on Mon, 01 May 2017 20:23:46 -0700 as excerpted: > Also, how is --mode=lowmem being useful? FWIW, I just watched your talk that's linked from the wiki, and wondered what you were doing these days as I hadn't seen any posts from you here for awhile. Well, that you're asking that question confirms you've not been following the list too closely... Of course that's understandable as people have other stuff to do, but just sayin'. The answer is... btrfs check in lowmem mode isn't simply lowmem, it's also effectively a very nearly entirely rewritten second implementation, which has already demonstrated its worth as it has already allowed finding and fixing a number of bugs in normal mode check. Of course normal mode check has returned the favor a few times as well, so it is now reasonably standard list troubleshooting practice to ask for the output from both modes to see what and where they differ, especially if it's not something known to be directly fixable by normal mode, which of course remains the more mature default. So even if neither one can actually fix the problem ATM, any differences in output both lend important clues to the real problem, and potentially help developers to find and fix bugs in one or the other implementation. Tho it's worth noting that lowmem mode can be expected to take longer, as it favors lower memory usage over speed, just as the mode title suggests it will. On a filesystem as big as yours... it may unfortunately not be entirely practical, especially if as you hint there's at least some time pressure here, tho it's not extreme. Of course on-list I'm somewhat known for my arguments propounding the notion that any filesystem that's too big to be practically maintained (including time necessary to restore from backups, should that be necessary for whatever reason) is... too big... and should ideally be broken along logical and functional boundaries into a number of individual smaller filesystems until such point as each one is found to be practically maintainable within a reasonably practical time frame. Don't put all the eggs in one basket, and when the bottom of one of those baskets inevitably falls out, most of your eggs will be safe in other baskets. =:^) But as someone else (pg, IIRC) on-list is fond of saying, lots of other people "know better" (TM). Whatever. It's your data, your systems and your time, not mine. I just know what I've found (sometimes finding it the hard way!) to work best for me, and TBs on TBs of data on a single filesystem, even if it's a backup and is itself backed up, isn't something I'd be putting my own faith in, as the time even for a simple restore from backups is simply too high for me to consider it at all practical. =:^) > And for re-parenting a sub-subvolume, is that possible? > (I want to delete /sub1/ but I can't because I have /sub1/sub2 that's > also a subvolume and I'm not sure how to re-parent sub2 to somewhere > else so that I can subvolume delete sub1) As I believe you know my own use-case doesn't deal with subvolumes and snapshots, so this may be of limited practicality, but FWIW, the sysadmin's guide discussion of snapshot management and special cases seems apropos as a first stop, before going further: https://btrfs.wiki.kernel.org/index.php/SysadminGuide#Managing_Snapshots Note that toward the bottom of "management" it discusses moving subvolumes (which will obviously reparent them), but then below that in special cases it says that read-only subvolumes (and thus snapshots) cannot be moved, explaining why. *BUT*, and here's the "go further" part, keep in mind that subvolume-read- only is a property, gettable and settable by btrfs property. So you should be able to unset the read-only property of a subvolume or snapshot, move it, then if desired, set it again. Of course I wouldn't expect send -p to work with such a snapshot, but send -c /might/ still work, I'm not actually sure but I'd consider it worth trying. (I'd try -p as well, but expect it to fail...) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
On Mon, May 1, 2017 at 9:23 PM, Marc MERLINwrote: > Hi Chris, > > Thanks for the reply, much appreciated. > > On Mon, May 01, 2017 at 07:50:22PM -0600, Chris Murphy wrote: >> What about btfs check (no repair), without and then also with --mode=lowmem? >> >> In theory I like the idea of a 24 hour rollback; but in normal usage >> Btrfs will eventually free up space containing stale and no longer >> necessary metadata. Like the chunk tree, it's always changing, so you >> get to a point, even with snapshots, that the old state of that tree >> is just - gone. A snapshot of an fs tree does not make the chunk tree >> frozen in time. > > Right, of course, I was being way over optimistic here. I kind of forgot > that metadata wasn't COW, my bad. Well it is COW. But there's more to the file system than fs trees, and just because an fs tree gets snapshot doesn't mean all data is snapshot. So whether snapshot or not, there's metadata that becomes obsolete as the file system is updated and those areas get freed up and eventually overwritten. > >> In any case, it's a big problem in my mind if no existing tools can >> fix a file system of this size. So before making anymore changes, make >> sure you have a btrfs-image somewhere, even if it's huge. The offline >> checker needs to be able to repair it, right now it's all we have for >> such a case. > > The image will be huge, and take maybe 24H to make (last time it took > some silly amount of time like that), and honestly I'm not sure how > useful it'll be. > Outside of the kernel crashing if I do a btrfs balance, and hopefully > the crash report I gave is good enough, the state I'm in is not btrfs' > fault. > > If I can't roll back to a reasonably working state, with data loss of a > known quantity that I can recover from backup, I'll have to destroy and > filesystem and recover from scratch, which will take multiple days. > Since I can't wait too long before getting back to a working state, I > think I'm going to try btrfs check --repair after a scrub to get a list > of all the pathanmes/inodes that are known to be damaged, and work from > there. > Sounds reasonable? Yes. > > Also, how is --mode=lowmem being useful? Testing. lowmem is a different implementation, so it might find different things from the regular check. > > And for re-parenting a sub-subvolume, is that possible? > (I want to delete /sub1/ but I can't because I have /sub1/sub2 that's also a > subvolume > and I'm not sure how to re-parent sub2 to somewhere else so that I can > subvolume delete > sub1) Well you can move sub2 out of sub1 just like a directory and then delete sub1. If it's read-only it can't be moved, but you can use btrfs property get/set ro true/false to temporarily make it not read-only, move it, then make it read-only again, and it's still fine to use with btrfs send receive. > > In the meantime, a simple check without repair looks like this. It will > likely take many hours to complete: > gargamel:/var/local/space# btrfs check /dev/mapper/dshelf2 > Checking filesystem on /dev/mapper/dshelf2 > UUID: 03e9a50c-1ae6-4782-ab9c-5f310a98e653 > checking extents > checksum verify failed on 3096461459456 found 0E6B7980 wanted FBE5477A > checksum verify failed on 3096461459456 found 0E6B7980 wanted FBE5477A > checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 > checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 > checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E > checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 > bytenr mismatch, want=2899180224512, have=3981076597540270796 > checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5 > checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5 > checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B > checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B > parent transid verify failed on 1671538819072 wanted 293964 found 293902 > parent transid verify failed on 1671538819072 wanted 293964 found 293902 > checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0 > checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0 > checksum verify failed on 1759425052672 found 843B59F1 wanted F0FF7D00 > checksum verify failed on 1759425052672 found 843B59F1 wanted F0FF7D00 > checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 > checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 > checksum verify failed on 2898779357184 found 96395131 wanted 433D6E09 > checksum verify failed on 2898779357184 found 96395131 wanted 433D6E09 > checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 > checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 > checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E > checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 > bytenr
Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
Hi Chris, Thanks for the reply, much appreciated. On Mon, May 01, 2017 at 07:50:22PM -0600, Chris Murphy wrote: > What about btfs check (no repair), without and then also with --mode=lowmem? > > In theory I like the idea of a 24 hour rollback; but in normal usage > Btrfs will eventually free up space containing stale and no longer > necessary metadata. Like the chunk tree, it's always changing, so you > get to a point, even with snapshots, that the old state of that tree > is just - gone. A snapshot of an fs tree does not make the chunk tree > frozen in time. Right, of course, I was being way over optimistic here. I kind of forgot that metadata wasn't COW, my bad. > In any case, it's a big problem in my mind if no existing tools can > fix a file system of this size. So before making anymore changes, make > sure you have a btrfs-image somewhere, even if it's huge. The offline > checker needs to be able to repair it, right now it's all we have for > such a case. The image will be huge, and take maybe 24H to make (last time it took some silly amount of time like that), and honestly I'm not sure how useful it'll be. Outside of the kernel crashing if I do a btrfs balance, and hopefully the crash report I gave is good enough, the state I'm in is not btrfs' fault. If I can't roll back to a reasonably working state, with data loss of a known quantity that I can recover from backup, I'll have to destroy and filesystem and recover from scratch, which will take multiple days. Since I can't wait too long before getting back to a working state, I think I'm going to try btrfs check --repair after a scrub to get a list of all the pathanmes/inodes that are known to be damaged, and work from there. Sounds reasonable? Also, how is --mode=lowmem being useful? And for re-parenting a sub-subvolume, is that possible? (I want to delete /sub1/ but I can't because I have /sub1/sub2 that's also a subvolume and I'm not sure how to re-parent sub2 to somewhere else so that I can subvolume delete sub1) In the meantime, a simple check without repair looks like this. It will likely take many hours to complete: gargamel:/var/local/space# btrfs check /dev/mapper/dshelf2 Checking filesystem on /dev/mapper/dshelf2 UUID: 03e9a50c-1ae6-4782-ab9c-5f310a98e653 checking extents checksum verify failed on 3096461459456 found 0E6B7980 wanted FBE5477A checksum verify failed on 3096461459456 found 0E6B7980 wanted FBE5477A checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 bytenr mismatch, want=2899180224512, have=3981076597540270796 checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5 checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5 checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B parent transid verify failed on 1671538819072 wanted 293964 found 293902 parent transid verify failed on 1671538819072 wanted 293964 found 293902 checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0 checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0 checksum verify failed on 1759425052672 found 843B59F1 wanted F0FF7D00 checksum verify failed on 1759425052672 found 843B59F1 wanted F0FF7D00 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 checksum verify failed on 2898779357184 found 96395131 wanted 433D6E09 checksum verify failed on 2898779357184 found 96395131 wanted 433D6E09 checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5 bytenr mismatch, want=2899180224512, have=3981076597540270796 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071 (...) Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
What about btfs check (no repair), without and then also with --mode=lowmem? In theory I like the idea of a 24 hour rollback; but in normal usage Btrfs will eventually free up space containing stale and no longer necessary metadata. Like the chunk tree, it's always changing, so you get to a point, even with snapshots, that the old state of that tree is just - gone. A snapshot of an fs tree does not make the chunk tree frozen in time. To do what you want, maybe isn't a ton of work if it could be based on a variation of the existing btrfs seed device code. Call it a "super snapshot". I like the idea of triage, where bad parts of the file system can just be cut off, like triage. Compared to other filesystems, they'll say this is hardware sabotage and nothing can be done. Btrfs is a bit deceptive in that it sorta invites the idea we can use hardware that isn't proven, and the fs can survive. In any case, it's a big problem in my mind if no existing tools can fix a file system of this size. So before making anymore changes, make sure you have a btrfs-image somewhere, even if it's huge. The offline checker needs to be able to repair it, right now it's all we have for such a case. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
So, I forgot to mention that it's my main media and backup server that got corrupted. Yes, I do actually have a backup of a backup server, but it's going to take days to recover due to the amount of data to copy back, not counting lots of manual typing due to the number of subvolumes, btrfs send/receive relationships and so forth. Really, I should be able to roll back all writes from the last 24H, run a check --repair/scrub on top just to be sure, and be back on track. In the meantime, the good news is that the filesystem doesn't crash the kernel (the poasted crash below) now that I was able to cancel the btrfs balance, but it goes read only at the drop of a hat, even when I'm trying to delete recent snapshots and all data that was potentially written in the last 24H On Mon, May 01, 2017 at 10:06:41AM -0700, Marc MERLIN wrote: > I have a filesystem that sadly got corrupted by a SAS card I just installed > yesterday. > > I don't think in a case like this, there is there a way to roll back all > writes across all subvolumes in the last 24H, correct? > > Is the best thing to go in each subvolume, delete the recent snapshots and > rename the one from 24H as the current one? Well, just like I expected, it's a pain in the rear and this can't even help fix the top level mountpoint which doesn't have snapshots, so I can't roll it back. btrfs should really have an easy way to roll back X hours, or days to recover from garbage written after a good known point, given that it is COW afterall. Is there a way do this with check --repair maybe? In the meantime, I got stuck while trying to delete snapshots: Let's say I have this: ID 428 gen 294021 top level 5 path backup ID 2023 gen 294021 top level 5 path Soft ID 3021 gen 294051 top level 428 path backup/debian32 ID 4400 gen 294018 top level 428 path backup/debian64 ID 4930 gen 294019 top level 428 path backup/ubuntu I can easily Delete subvolume (no-commit): '/mnt/btrfs_pool2/Soft' and then: gargamel:/mnt/btrfs_pool2# mv Soft_rw.20170430_01:50:22 Soft But I can't delete backup, which actually is mostly only a directory containing other things (in hindsight I shouldn't have made that a subvolume) Delete subvolume (no-commit): '/mnt/btrfs_pool2/backup' ERROR: cannot delete '/mnt/btrfs_pool2/backup': Directory not empty This is because backup has a lot of subvolumes due to btrfs send/receive relationships. Is it possible to recover there? Can you reparent subvolumes to a different subvolume without doing a full copy via btrfs send/receive? Thanks, Marc > BTRFS warning (device dm-5): failed to load free space cache for block group > 6746013696000, rebuilding it now > BTRFS warning (device dm-5): block group 6754603630592 has wrong amount of > free space > BTRFS warning (device dm-5): failed to load free space cache for block group > 6754603630592, rebuilding it now > BTRFS warning (device dm-5): block group 7125178777600 has wrong amount of > free space > BTRFS warning (device dm-5): failed to load free space cache for block group > 7125178777600, rebuilding it now > BTRFS error (device dm-5): bad tree block start 3981076597540270796 > 2899180224512 > BTRFS error (device dm-5): bad tree block start 942082474969670243 > 2899180224512 > BTRFS: error (device dm-5) in __btrfs_free_extent:6944: errno=-5 IO failure > BTRFS info (device dm-5): forced readonly > BTRFS: error (device dm-5) in btrfs_run_delayed_refs:2961: errno=-5 IO failure > BUG: unable to handle kernel NULL pointer dereference at (null) > IP: __del_reloc_root+0x3f/0xa6 > PGD 189a0e067 > PUD 189a0f067 > PMD 0 > > Oops: [#1] PREEMPT SMP > Modules linked in: veth ip6table_filter ip6_tables ebtable_nat ebtables ppdev > lp xt_addrtype br_netfilter bridge stp llc tun autofs4 softdog binfmt_misc > ftdi_sio nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ipt_REJECT > nf_reject_ipv4 xt_conntrack xt_mark xt_nat xt_tcpudp nf_log_ipv4 > nf_log_common xt_LOG iptable_mangle iptable_filter lm85 hwmon_vid pl2303 > dm_snapshot dm_bufio iptable_nat ip_tables nf_conntrack_ipv4 nf_defrag_ipv4 > nf_nat_ipv4 nf_conntrack_ftp ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_nat > nf_conntrack x_tables sg st snd_pcm_oss snd_mixer_oss bcache kvm_intel kvm > irqbypass snd_hda_codec_realtek snd_cmipci snd_hda_codec_generic > snd_hda_intel snd_mpu401_uart snd_hda_codec snd_opl3_lib snd_rawmidi > snd_hda_core snd_seq_device snd_hwdep eeepc_wmi snd_pcm asus_wmi rc_ati_x10 > asix snd_timer ati_remote sparse_keymap usbnet rfkill snd hwmon soundcore > rc_core evdev libphy tpm_infineon pcspkr i915 parport_pc i2c_i801 input_leds > mei_me lpc_ich parport tpm_tis battery usbserial tpm_tis_core tpm wmi e1000e > ptp pps_core fuse raid456 multipath mmc_block mmc_core lrw ablk_helper > dm_crypt dm_mod async_raid6_recov async_pq async_xor async_memcpy async_tx > crc32c_intel blowfish_x86_64 blowfish_common pcbc aesni_intel aes_x86_64 > crypto_simd glue_helper cryptd xhci_pci ehci_pci sata_sil24