Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-24 Thread Duncan
Marc MERLIN posted on Tue, 23 May 2017 09:58:47 -0700 as excerpted:

> That's a valid point, and in my case, I can back it up/restore, it just
> takes a bit of time, but most of the time is manually babysitting all
> those subvolumes that I need to recreate by hand with btrfs send/restore
> relationships, which all get lost during backup/restore.
> This is the most painful part.
> What's too big? I've only ever used a filesystem that fits on on a raid
> of 4 data drives. That value has increased over time, but I don't have a
> a crazy array of 20+ drives as a single filesystem, or anything.
> Since drives have gotten bigger, but not that much faster, I use bcache
> to make things more acceptable in speed.

What's too big?  That depends on your tolerance for pain, but given the 
subvolumes manually recreated by hand with send/receive scenario, I'd 
probably try to break it down so while there's the same number of 
snapshots to restore, the number of subvolumes the snapshots are taken 
against are limited.

My own rule of thumb is if it's taking so long that it's a barrier to 
doing it, I really need to either break things down further, or upgrade 
to faster storage.  The latter is why I'm actually looking at upgrading 
my media and second backup set, on spinning rust, to ssd.  Because while 
I used to do backups spinning rust to spinning rust of that size all the 
time, ssds have spoiled me, and now I dread doing the spinning rust 
backups... or restores.   Tho in my case the spinning rust is only a half-
TB, so a pair of half-TB to 1 TB ssds for an upgrade is still cost 
effective.  It's not like I'm going multi-TB, which would still be cost 
prohibitive on SSD, particularly since I want raid1, so doubling the 
number of SSDs.

Meanwhile, what I'd do with that raid of four drives (and /did/ do with 
my 4-drive raid back a few storage generations ago, when 300 GB spinning-
rust disks were still quite big, and what I do with my paired SSDs with 
btrfs now) is partition them up and do raids of partitions on each drive.

One thing that's nice about that is that you can actually do a set of 
backups on a second set of partitions on the same physical devices, 
because the physical device redundancy of the raids covers loss of a 
device, and the separate partitions and raids (btrfs raid1 now) cover the 
fat-finger or simple loss of filesystem risk.  A second set of backups to 
separate devices can then be made just in case, and depending on the 
need, swapped out to off-premises or uploaded to the cloud or whatever, 
but you always have the primary backup at hand to boot to or mount if the 
working copy fails, by simply pointing to the backup partitions and 
filesystem instead of the normal working copy.  For root, I even have a 
grub menu item that switches to the backup copy, and for fstab, I have a 
set of stubs that are assembled via script into three copies of fstab 
that swap working and backup copies as necessary, with /etc/fstab itself 
being a symlink to the working copy one, that I simply switch to point to 
the one that loads the backup copies as working, on the backup.  Or I can 
mount the root filesystem for maintenance from the initramfs, and switch 
the fstab symlink from there, before exiting maintenance and booting the 
main system.

I learned this "split it up" method the hard way back before mdraid had 
write-intent bitmaps, and I had only two much larger raids, working and 
backup, where if one device dropped out and I brought it back in, I had 
to wait way too long for the huge working raid to resync.  When I split 
things up by function into multiple raids, most of the time only some of 
them were active and only one or two of the active ones would actually 
have been being written at the time so were out of sync, and syncing them 
was fast as they were much smaller than the larger full system raids I 
had been using previously.

>> *BUT*, and here's the "go further" part, keep in mind that
>> subvolume-read-
>> only is a property, gettable and settable by btrfs property.
>> 
>> So you should be able to unset the read-only property of a subvolume or
>> snapshot, move it, then if desired, set it again.
>> 
>> Of course I wouldn't expect send -p to work with such a snapshot, but
>> send -c /might/ still work, I'm not actually sure but I'd consider it
>> worth trying.  (I'd try -p as well, but expect it to fail...)
> 
> That's an interesting point, thanks for making it.
> In that case, I did have to destroy and recreate the filesystem since
> btrfs check --repair was unable to fix it, but knowing how to reparent
> read only subvolumes may be handy in the future, thanks.

Hopefully you won't end up testing it any time soon, but if you do, 
please confirm whether my suspicions that send -p won't work after 
toggling and reparenting, but send -c still will, are correct.

(For those who read this out of thread context where I believe I already 
stated it, my own use-case involves neither snapshots 

Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-23 Thread Marc MERLIN
On Tue, May 02, 2017 at 05:01:02AM +, Duncan wrote:
> Marc MERLIN posted on Mon, 01 May 2017 20:23:46 -0700 as excerpted:
> 
> > Also, how is --mode=lowmem being useful?
> 
> FWIW, I just watched your talk that's linked from the wiki, and wondered 
> what you were doing these days as I hadn't seen any posts from you here 
> for awhile.
 
First, sorry for the late reply. Because you didn't Cc me in the answer,
it went to a different folder where I only saw your replies now.
Off topic, but basically I'm not dead or anything, I have btrfs working
well enough to not mess with it further because I have many other
hobbies :)  that is unless I put a new SAS card in my server, hit some
corruption bugs, and now I'm back spending days fixing the system.

> Well, that you're asking that question confirms you've not been following 
> the list too closely...  Of course that's understandable as people have 
> other stuff to do, but just sayin'.

That's exactly right. I'm subscribed to way too many lists on way too
many topics to be up to date with all, sadly :(

> Of course on-list I'm somewhat known for my arguments propounding the 
> notion that any filesystem that's too big to be practically maintained 
> (including time necessary to restore from backups, should that be 
> necessary for whatever reason) is... too big... and should ideally be 
> broken along logical and functional boundaries into a number of 
> individual smaller filesystems until such point as each one is found to 
> be practically maintainable within a reasonably practical time frame.  
> Don't put all the eggs in one basket, and when the bottom of one of those 
> baskets inevitably falls out, most of your eggs will be safe in other 
> baskets. =:^)
 
That's a valid point, and in my case, I can back it up/restore, it just
takes a bit of time, but most of the time is manually babysitting all
those subvolumes that I need to recreate by hand with btrfs send/restore
relationships, which all get lost during backup/restore.
This is the most painful part.
What's too big? I've only ever used a filesystem that fits on on a raid
of 4 data drives. That value has increased over time, but I don't have a
a crazy array of 20+ drives as a single filesystem, or anything.
Since drives have gotten bigger, but not that much faster, I use bcache
to make things more acceptable in speed.

> *BUT*, and here's the "go further" part, keep in mind that subvolume-read-
> only is a property, gettable and settable by btrfs property.
> 
> So you should be able to unset the read-only property of a subvolume or 
> snapshot, move it, then if desired, set it again.
> 
> Of course I wouldn't expect send -p to work with such a snapshot, but 
> send -c /might/ still work, I'm not actually sure but I'd consider it 
> worth trying.  (I'd try -p as well, but expect it to fail...)

That's an interesting point, thanks for making it.
In that case, I did have to destroy and recreate the filesystem since
btrfs check --repair was unable to fix it, but knowing how to reparent
read only subvolumes may be handy in the future, thanks.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-17 Thread Kai Krakow
Am Fri, 5 May 2017 08:43:23 -0700
schrieb Marc MERLIN :

[missing quote of the command]
> > Corrupted blocks are corrupted, that command is just trying to
> > corrupt it again.
> > It won't do the black magic to adjust tree blocks to avoid them.  
>  
> I see. you may hve seen the earlier message from Kai Krakow who was
> able to to recover his FS by trying this trick, but I understand it
> can't work in all cases.

Huh, what trick? I don't take credit for it... ;-)

The corrupt-block trick must've been someone else...


-- 
Regards,
Kai

Replies to list-only preferred.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-05 Thread Marc MERLIN
Thanks again for your answer. Obviously even if my filesystem is toast,
it's useful to learn from what happened.

On Fri, May 05, 2017 at 01:03:02PM +0800, Qu Wenruo wrote:
> > > So unfortunately, your fs/subvolume trees are also corrupted.
> > > And almost no chance to do a graceful recovery.
> > So I'm confused here. You're saying my metadata is not corrupted (and in
> > my case, I have DUP, so I should have 2 copies),
> 
> Nope, here I'm all talking about metadata (tree blocks).
> Difference is the owner, either extent tree or fs/subvolume tree.
 
I see. I didn't realize that my filesystem managed to corrupt both
copies of its metadata.

> The fsck doesn't check data blocks.

Right, that's what scrub does, fair enough.

> The problem is, tree blocks (metadata) that refers these data blocks are
> corrupted.
> 
> And they are corrupted in such a way that both extent tree (tree contains
> extent allocation info) and fs tree (tree contains real fs info, like inode
> and data location) are corrupted.
> 
> So graceful recovery is not possible now.

I see, thanks for explaining.

> Unfortunately, no, even you have 2 copies, a lot of tree blocks are
> corrupted that neither copy matches checksum.
 
Thanks for confirming. I guess if I'm having corruption due to a bad
card, it makes sense that both get updated after one another and both
got corrupted for the same reason.

> Corrupted blocks are corrupted, that command is just trying to corrupt it
> again.
> It won't do the black magic to adjust tree blocks to avoid them.
 
I see. you may hve seen the earlier message from Kai Krakow who was
able to to recover his FS by trying this trick, but I understand it
can't work in all cases.

Thanks again for your answers.
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-04 Thread Qu Wenruo



At 05/05/2017 10:40 AM, Marc MERLIN wrote:

On Fri, May 05, 2017 at 09:19:29AM +0800, Qu Wenruo wrote:

Sorry for not noticing the link.
  
no problem, it was only one line amongst many :)

Thanks much for having had a look.


[Conclusion]
After checking the full result, some of fs/subvolume trees are corrupted.

[Details]
Some example here:

---
ref mismatch on [6674127745024 32768] extent item 0, found 1
Backref 6674127745024 parent 7566652473344 owner 0 offset 0 num_refs 0 not
found in extent tree
Incorrect local backref count on 6674127745024 parent 7566652473344 owner 0
offset 0 found 1 wanted 0 back 0x5648afda0f20
backpointer mismatch on [6674127745024 32768]
---

The extent at 6674127745024 seems to be an *DATA* extent.
While current default nodesize is 16K and ancient default node is 4K.

Unless you specified -n 32K at mkfs time, it's a DATA extent.


I did not, so you must be right about DATA, which should be good, right,
I don't mind losing data as long as the underlying metadata is correct.

I should have given more data on the FS:

gargamel:/var/local/src/btrfs-progs# btrfs fi df /mnt/btrfs_pool2/
Data, single: total=6.28TiB, used=6.12TiB
System, DUP: total=32.00MiB, used=720.00KiB
Metadata, DUP: total=97.00GiB, used=94.39GiB


Tons of metadata since the fs is so large.


GlobalReserve, single: total=512.00MiB, used=0.00B

gargamel:/var/local/src/btrfs-progs# btrfs fi usage /mnt/btrfs_pool2
Overall:
 Device size:   7.28TiB
 Device allocated:  6.47TiB
 Device unallocated:  824.48GiB
 Device missing:  0.00B
 Used:  6.30TiB
 Free (estimated):994.45GiB  (min: 582.21GiB)
 Data ratio:   1.00
 Metadata ratio:   2.00
 Global reserve:  512.00MiB  (used: 0.00B)

Data,single: Size:6.28TiB, Used:6.12TiB
/dev/mapper/dshelf2 6.28TiB

Metadata,DUP: Size:97.00GiB, Used:94.39GiB
/dev/mapper/dshelf2   194.00GiB

System,DUP: Size:32.00MiB, Used:720.00KiB
/dev/mapper/dshelf264.00MiB

Unallocated:
/dev/mapper/dshelf2   824.48GiB



Further more, it's a shared data backref, it's using its parent tree block
to do backref walk.

And its parent tree block is 7566652473344.
While such bytenr can't be found anywhere (including csum error output),
that's to say either we can't find that tree block nor can't reach the tree
root for it.

Considering it's data extent, its owner is either root or fs/subvolume tree.


Such cases are everywhere, as I found other extent sized from 4K to 44K, so
I'm pretty sure there must be some fs/subvolume tree corrupted.
(Data extent in root tree is seldom 4K sized)

So unfortunately, your fs/subvolume trees are also corrupted.
And almost no chance to do a graceful recovery.
  
So I'm confused here. You're saying my metadata is not corrupted (and in

my case, I have DUP, so I should have 2 copies),


Nope, here I'm all talking about metadata (tree blocks).
Difference is the owner, either extent tree or fs/subvolume tree.

The fsck doesn't check data blocks.


but with data blocks
(which are not duped) corrupted, it's also possible to lose the
filesystem in a way that it can't be taken back to a clean state, even
by deleting some corrupted data?


No, it can't be repaired by deleting data.

The problem is, tree blocks (metadata) that refers these data blocks are 
corrupted.


And they are corrupted in such a way that both extent tree (tree 
contains extent allocation info) and fs tree (tree contains real fs 
info, like inode and data location) are corrupted.


So graceful recovery is not possible now.




[Alternatives]
I would recommend to use "btrfs restore -f " to restore specified
subvolume.


I don't need to restore data, the data is a backup. It will just take
many days to recreate (plus many hours of typing from me because the
backup updates are automated, but recreating everything, is not
automated)

So if I understand correctly, my metadata is fine (and I guess I have 2
copies, so it would have been unlucky to get both copies corrupted), but
enough data blocks got corrupted that btrfs cannot recover, even by
deleting the corrupted data blocks. Correct?


Unfortunately, no, even you have 2 copies, a lot of tree blocks are 
corrupted that neither copy matches checksum.


Just like the following tree block, both copy have wrong checksum.
---
checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E
checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
---



And is it not possible to clear the corrupted blocks like this?
./btrfs-corrupt-block -l  2899180224512 /dev/mapper/dshelf2
and just accept the lost data but get btrfs check repair to deal with
the deleted blocks and bring the rest back to a clean state?No, that won't help.


Corrupted blocks are corrupted, that command is just trying to corrupt 
it again.

It won't do the black magic to adjust tree 

Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-04 Thread Marc MERLIN
On Fri, May 05, 2017 at 09:19:29AM +0800, Qu Wenruo wrote:
> Sorry for not noticing the link.
 
no problem, it was only one line amongst many :)
Thanks much for having had a look.

> [Conclusion]
> After checking the full result, some of fs/subvolume trees are corrupted.
> 
> [Details]
> Some example here:
> 
> ---
> ref mismatch on [6674127745024 32768] extent item 0, found 1
> Backref 6674127745024 parent 7566652473344 owner 0 offset 0 num_refs 0 not
> found in extent tree
> Incorrect local backref count on 6674127745024 parent 7566652473344 owner 0
> offset 0 found 1 wanted 0 back 0x5648afda0f20
> backpointer mismatch on [6674127745024 32768]
> ---
> 
> The extent at 6674127745024 seems to be an *DATA* extent.
> While current default nodesize is 16K and ancient default node is 4K.
> 
> Unless you specified -n 32K at mkfs time, it's a DATA extent.

I did not, so you must be right about DATA, which should be good, right,
I don't mind losing data as long as the underlying metadata is correct.

I should have given more data on the FS:

gargamel:/var/local/src/btrfs-progs# btrfs fi df /mnt/btrfs_pool2/
Data, single: total=6.28TiB, used=6.12TiB
System, DUP: total=32.00MiB, used=720.00KiB
Metadata, DUP: total=97.00GiB, used=94.39GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

gargamel:/var/local/src/btrfs-progs# btrfs fi usage /mnt/btrfs_pool2
Overall:
Device size:   7.28TiB
Device allocated:  6.47TiB
Device unallocated:  824.48GiB
Device missing:  0.00B
Used:  6.30TiB
Free (estimated):994.45GiB  (min: 582.21GiB)
Data ratio:   1.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,single: Size:6.28TiB, Used:6.12TiB
   /dev/mapper/dshelf2 6.28TiB

Metadata,DUP: Size:97.00GiB, Used:94.39GiB
   /dev/mapper/dshelf2   194.00GiB

System,DUP: Size:32.00MiB, Used:720.00KiB
   /dev/mapper/dshelf264.00MiB

Unallocated:
   /dev/mapper/dshelf2   824.48GiB


> Further more, it's a shared data backref, it's using its parent tree block
> to do backref walk.
> 
> And its parent tree block is 7566652473344.
> While such bytenr can't be found anywhere (including csum error output),
> that's to say either we can't find that tree block nor can't reach the tree
> root for it.
> 
> Considering it's data extent, its owner is either root or fs/subvolume tree.
> 
> 
> Such cases are everywhere, as I found other extent sized from 4K to 44K, so
> I'm pretty sure there must be some fs/subvolume tree corrupted.
> (Data extent in root tree is seldom 4K sized)
> 
> So unfortunately, your fs/subvolume trees are also corrupted.
> And almost no chance to do a graceful recovery.
 
So I'm confused here. You're saying my metadata is not corrupted (and in
my case, I have DUP, so I should have 2 copies), but with data blocks
(which are not duped) corrupted, it's also possible to lose the
filesystem in a way that it can't be taken back to a clean state, even
by deleting some corrupted data?

> [Alternatives]
> I would recommend to use "btrfs restore -f " to restore specified
> subvolume.

I don't need to restore data, the data is a backup. It will just take
many days to recreate (plus many hours of typing from me because the
backup updates are automated, but recreating everything, is not
automated)

So if I understand correctly, my metadata is fine (and I guess I have 2
copies, so it would have been unlucky to get both copies corrupted), but
enough data blocks got corrupted that btrfs cannot recover, even by
deleting the corrupted data blocks. Correct?

And is it not possible to clear the corrupted blocks like this?
./btrfs-corrupt-block -l  2899180224512 /dev/mapper/dshelf2
and just accept the lost data but get btrfs check repair to deal with
the deleted blocks and bring the rest back to a clean state?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-04 Thread Qu Wenruo



At 05/05/2017 09:19 AM, Qu Wenruo wrote:



At 05/02/2017 11:23 AM, Marc MERLIN wrote:

Hi Chris,

Thanks for the reply, much appreciated.

On Mon, May 01, 2017 at 07:50:22PM -0600, Chris Murphy wrote:
What about btfs check (no repair), without and then also with 
--mode=lowmem?


In theory I like the idea of a 24 hour rollback; but in normal usage
Btrfs will eventually free up space containing stale and no longer
necessary metadata. Like the chunk tree, it's always changing, so you
get to a point, even with snapshots, that the old state of that tree
is just - gone. A snapshot of an fs tree does not make the chunk tree
frozen in time.

Right, of course, I was being way over optimistic here. I kind of forgot
that metadata wasn't COW, my bad.


In any case, it's a big problem in my mind if no existing tools can
fix a file system of this size. So before making anymore changes, make
sure you have a btrfs-image somewhere, even if it's huge. The offline
checker needs to be able to repair it, right now it's all we have for
such a case.


The image will be huge, and take maybe 24H to make (last time it took
some silly amount of time like that), and honestly I'm not sure how
useful it'll be.
Outside of the kernel crashing if I do a btrfs balance, and hopefully
the crash report I gave is good enough, the state I'm in is not btrfs'
fault.

If I can't roll back to a reasonably working state, with data loss of a
known quantity that I can recover from backup, I'll have to destroy and
filesystem and recover from scratch, which will take multiple days.
Since I can't wait too long before getting back to a working state, I
think I'm going to try btrfs check --repair after a scrub to get a list
of all the pathanmes/inodes that are known to be damaged, and work from
there.
Sounds reasonable?

Also, how is --mode=lowmem being useful?

And for re-parenting a sub-subvolume, is that possible?
(I want to delete /sub1/ but I can't because I have /sub1/sub2 that's 
also a subvolume
and I'm not sure how to re-parent sub2 to somewhere else so that I can 
subvolume delete

sub1)

In the meantime, a simple check without repair looks like this. It will
likely take many hours to complete:
gargamel:/var/local/space# btrfs check /dev/mapper/dshelf2
Checking filesystem on /dev/mapper/dshelf2
UUID: 03e9a50c-1ae6-4782-ab9c-5f310a98e653
checking extents
checksum verify failed on 3096461459456 found 0E6B7980 wanted FBE5477A
checksum verify failed on 3096461459456 found 0E6B7980 wanted FBE5477A
checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E
checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
bytenr mismatch, want=2899180224512, have=3981076597540270796
checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5
checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5
checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B
checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B
parent transid verify failed on 1671538819072 wanted 293964 found 293902
parent transid verify failed on 1671538819072 wanted 293964 found 293902
checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0
checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0
checksum verify failed on 1759425052672 found 843B59F1 wanted F0FF7D00
checksum verify failed on 1759425052672 found 843B59F1 wanted F0FF7D00
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
checksum verify failed on 2898779357184 found 96395131 wanted 433D6E09
checksum verify failed on 2898779357184 found 96395131 wanted 433D6E09
checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E
checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
bytenr mismatch, want=2899180224512, have=3981076597540270796
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
(...)


Full output please.


Sorry for not noticing the link.

[Conclusion]
After checking the full result, some of fs/subvolume trees are corrupted.

[Details]
Some example here:

---
ref mismatch on [6674127745024 32768] extent item 0, found 1
Backref 6674127745024 parent 7566652473344 owner 0 offset 0 num_refs 0 
not found in extent tree
Incorrect local backref count on 6674127745024 parent 7566652473344 
owner 0 offset 0 found 1 wanted 0 back 

Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-04 Thread Qu Wenruo



At 05/02/2017 11:23 AM, Marc MERLIN wrote:

Hi Chris,

Thanks for the reply, much appreciated.

On Mon, May 01, 2017 at 07:50:22PM -0600, Chris Murphy wrote:

What about btfs check (no repair), without and then also with --mode=lowmem?

In theory I like the idea of a 24 hour rollback; but in normal usage
Btrfs will eventually free up space containing stale and no longer
necessary metadata. Like the chunk tree, it's always changing, so you
get to a point, even with snapshots, that the old state of that tree
is just - gone. A snapshot of an fs tree does not make the chunk tree
frozen in time.
  
Right, of course, I was being way over optimistic here. I kind of forgot

that metadata wasn't COW, my bad.


In any case, it's a big problem in my mind if no existing tools can
fix a file system of this size. So before making anymore changes, make
sure you have a btrfs-image somewhere, even if it's huge. The offline
checker needs to be able to repair it, right now it's all we have for
such a case.


The image will be huge, and take maybe 24H to make (last time it took
some silly amount of time like that), and honestly I'm not sure how
useful it'll be.
Outside of the kernel crashing if I do a btrfs balance, and hopefully
the crash report I gave is good enough, the state I'm in is not btrfs'
fault.

If I can't roll back to a reasonably working state, with data loss of a
known quantity that I can recover from backup, I'll have to destroy and
filesystem and recover from scratch, which will take multiple days.
Since I can't wait too long before getting back to a working state, I
think I'm going to try btrfs check --repair after a scrub to get a list
of all the pathanmes/inodes that are known to be damaged, and work from
there.
Sounds reasonable?

Also, how is --mode=lowmem being useful?

And for re-parenting a sub-subvolume, is that possible?
(I want to delete /sub1/ but I can't because I have /sub1/sub2 that's also a 
subvolume
and I'm not sure how to re-parent sub2 to somewhere else so that I can 
subvolume delete
sub1)

In the meantime, a simple check without repair looks like this. It will
likely take many hours to complete:
gargamel:/var/local/space# btrfs check /dev/mapper/dshelf2
Checking filesystem on /dev/mapper/dshelf2
UUID: 03e9a50c-1ae6-4782-ab9c-5f310a98e653
checking extents
checksum verify failed on 3096461459456 found 0E6B7980 wanted FBE5477A
checksum verify failed on 3096461459456 found 0E6B7980 wanted FBE5477A
checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E
checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
bytenr mismatch, want=2899180224512, have=3981076597540270796
checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5
checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5
checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B
checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B
parent transid verify failed on 1671538819072 wanted 293964 found 293902
parent transid verify failed on 1671538819072 wanted 293964 found 293902
checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0
checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0
checksum verify failed on 1759425052672 found 843B59F1 wanted F0FF7D00
checksum verify failed on 1759425052672 found 843B59F1 wanted F0FF7D00
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
checksum verify failed on 2898779357184 found 96395131 wanted 433D6E09
checksum verify failed on 2898779357184 found 96395131 wanted 433D6E09
checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E
checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
bytenr mismatch, want=2899180224512, have=3981076597540270796
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
(...)


Full output please.

I know it will be long, but the point here is, full output could help us 
to at least locate where the most corruption are.


If most corruption are only in extent tree, the chance to recover will 
increase hugely.


As extent tree is just a backref for all allocated extents, it's not 
really important if recovery (read) is the primary goal.


But if other tree (fs or subvolume tree important for you) also get 
corrupted, I'm afraid your last chance will be 

Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-04 Thread Qu Wenruo



At 05/02/2017 02:08 AM, Marc MERLIN wrote:

So, I forgot to mention that it's my main media and backup server that got
corrupted. Yes, I do actually have a backup of a backup server, but it's
going to take days to recover due to the amount of data to copy back, not
counting lots of manual typing due to the number of subvolumes, btrfs
send/receive relationships and so forth.

Really, I should be able to roll back all writes from the last 24H, run a
check --repair/scrub on top just to be sure, and be back on track.

In the meantime, the good news is that the filesystem doesn't crash the
kernel (the poasted crash below) now that I was able to cancel the btrfs 
balance,
but it goes read only at the drop of a hat, even when I'm trying to delete
recent snapshots and all data that was potentially written in the last 24H

On Mon, May 01, 2017 at 10:06:41AM -0700, Marc MERLIN wrote:

I have a filesystem that sadly got corrupted by a SAS card I just installed 
yesterday.

I don't think in a case like this, there is there a way to roll back all
writes across all subvolumes in the last 24H, correct?


Sorry for the late reply.
I thought the case is already finished as I see little chance to recover. :(

No, no way to roll back unless you're completely sure there is only 1 
transaction commit happened in last 24H.

(Well, not really possible in real world)

Btrfs is only capable to rollback to *previous* commit.
That's ensure by forced metadata CoW.

But beyond previous commit, only god knows.

If all metadata CoW write is done in some place never used by any 
previous metadata, then there is the chance to recover.


But mostly the possibility is very low, some mount option like ssd will 
change the extent allocator behavior to improve the possibility, but 
still need a lot of luck.


More detailed comment will be replied to btrfs check mail.

Thanks,
Qu



Is the best thing to go in each subvolume, delete the recent snapshots and
rename the one from 24H as the current one?
  
Well, just like I expected, it's a pain in the rear and this can't even help

fix the top level mountpoint which doesn't have snapshots, so I can't roll
it back.
btrfs should really have an easy way to roll back X hours, or days to
recover from garbage written after a good known point, given that it is COW
afterall.

Is there a way do this with check --repair maybe?

In the meantime, I got stuck while trying to delete snapshots:

Let's say I have this:
ID 428 gen 294021 top level 5 path backup
ID 2023 gen 294021 top level 5 path Soft
ID 3021 gen 294051 top level 428 path backup/debian32
ID 4400 gen 294018 top level 428 path backup/debian64
ID 4930 gen 294019 top level 428 path backup/ubuntu

I can easily
Delete subvolume (no-commit): '/mnt/btrfs_pool2/Soft'
and then:
gargamel:/mnt/btrfs_pool2# mv Soft_rw.20170430_01:50:22 Soft

But I can't delete backup, which actually is mostly only a directory
containing other things (in hindsight I shouldn't have made that a
subvolume)
Delete subvolume (no-commit): '/mnt/btrfs_pool2/backup'
ERROR: cannot delete '/mnt/btrfs_pool2/backup': Directory not empty

This is because backup has a lot of subvolumes due to btrfs send/receive
relationships.

Is it possible to recover there? Can you reparent subvolumes to a different
subvolume without doing a full copy via btrfs send/receive?

Thanks,
Marc


BTRFS warning (device dm-5): failed to load free space cache for block group 
6746013696000, rebuilding it now
BTRFS warning (device dm-5): block group 6754603630592 has wrong amount of free 
space
BTRFS warning (device dm-5): failed to load free space cache for block group 
6754603630592, rebuilding it now
BTRFS warning (device dm-5): block group 7125178777600 has wrong amount of free 
space
BTRFS warning (device dm-5): failed to load free space cache for block group 
7125178777600, rebuilding it now
BTRFS error (device dm-5): bad tree block start 3981076597540270796 
2899180224512
BTRFS error (device dm-5): bad tree block start 942082474969670243 2899180224512
BTRFS: error (device dm-5) in __btrfs_free_extent:6944: errno=-5 IO failure
BTRFS info (device dm-5): forced readonly
BTRFS: error (device dm-5) in btrfs_run_delayed_refs:2961: errno=-5 IO failure
BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: __del_reloc_root+0x3f/0xa6
PGD 189a0e067
PUD 189a0f067
PMD 0

Oops:  [#1] PREEMPT SMP
Modules linked in: veth ip6table_filter ip6_tables ebtable_nat ebtables ppdev 
lp xt_addrtype br_netfilter bridge stp llc tun autofs4 softdog binfmt_misc 
ftdi_sio nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ipt_REJECT 
nf_reject_ipv4 xt_conntrack xt_mark xt_nat xt_tcpudp nf_log_ipv4 nf_log_common 
xt_LOG iptable_mangle iptable_filter lm85 hwmon_vid pl2303 dm_snapshot dm_bufio 
iptable_nat ip_tables nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
nf_conntrack_ftp ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_nat nf_conntrack 
x_tables sg st snd_pcm_oss snd_mixer_oss bcache kvm_intel kvm 

Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-02 Thread Kai Krakow
Am Mon, 1 May 2017 22:56:06 -0600
schrieb Chris Murphy :

> On Mon, May 1, 2017 at 9:23 PM, Marc MERLIN  wrote:
> > Hi Chris,
> >
> > Thanks for the reply, much appreciated.
> >
> > On Mon, May 01, 2017 at 07:50:22PM -0600, Chris Murphy wrote:  
> >> What about btfs check (no repair), without and then also with
> >> --mode=lowmem?
> >>
> >> In theory I like the idea of a 24 hour rollback; but in normal
> >> usage Btrfs will eventually free up space containing stale and no
> >> longer necessary metadata. Like the chunk tree, it's always
> >> changing, so you get to a point, even with snapshots, that the old
> >> state of that tree is just - gone. A snapshot of an fs tree does
> >> not make the chunk tree frozen in time.  
> >
> > Right, of course, I was being way over optimistic here. I kind of
> > forgot that metadata wasn't COW, my bad.  
> 
> Well it is COW. But there's more to the file system than fs trees, and
> just because an fs tree gets snapshot doesn't mean all data is
> snapshot. So whether snapshot or not, there's metadata that becomes
> obsolete as the file system is updated and those areas get freed up
> and eventually overwritten.
> 
> 
> >  
> >> In any case, it's a big problem in my mind if no existing tools can
> >> fix a file system of this size. So before making anymore changes,
> >> make sure you have a btrfs-image somewhere, even if it's huge. The
> >> offline checker needs to be able to repair it, right now it's all
> >> we have for such a case.  
> >
> > The image will be huge, and take maybe 24H to make (last time it
> > took some silly amount of time like that), and honestly I'm not
> > sure how useful it'll be.
> > Outside of the kernel crashing if I do a btrfs balance, and
> > hopefully the crash report I gave is good enough, the state I'm in
> > is not btrfs' fault.
> >
> > If I can't roll back to a reasonably working state, with data loss
> > of a known quantity that I can recover from backup, I'll have to
> > destroy and filesystem and recover from scratch, which will take
> > multiple days. Since I can't wait too long before getting back to a
> > working state, I think I'm going to try btrfs check --repair after
> > a scrub to get a list of all the pathanmes/inodes that are known to
> > be damaged, and work from there.
> > Sounds reasonable?  
> 
> Yes.
> 
> 
> >
> > Also, how is --mode=lowmem being useful?  
> 
> Testing. lowmem is a different implementation, so it might find
> different things from the regular check.
> 
> 
> >
> > And for re-parenting a sub-subvolume, is that possible?
> > (I want to delete /sub1/ but I can't because I have /sub1/sub2
> > that's also a subvolume and I'm not sure how to re-parent sub2 to
> > somewhere else so that I can subvolume delete sub1)  
> 
> Well you can move sub2 out of sub1 just like a directory and then
> delete sub1. If it's read-only it can't be moved, but you can use
> btrfs property get/set ro true/false to temporarily make it not
> read-only, move it, then make it read-only again, and it's still fine
> to use with btrfs send receive.
> 
> 
> 
> 
> 
> >
> > In the meantime, a simple check without repair looks like this. It
> > will likely take many hours to complete:
> > gargamel:/var/local/space# btrfs check /dev/mapper/dshelf2
> > Checking filesystem on /dev/mapper/dshelf2
> > UUID: 03e9a50c-1ae6-4782-ab9c-5f310a98e653
> > checking extents
> > checksum verify failed on 3096461459456 found 0E6B7980 wanted
> > FBE5477A checksum verify failed on 3096461459456 found 0E6B7980
> > wanted FBE5477A checksum verify failed on 2899180224512 found
> > 7A6D427F wanted 7E899EE5 checksum verify failed on 2899180224512
> > found 7A6D427F wanted 7E899EE5 checksum verify failed on
> > 2899180224512 found ABBE39B0 wanted E0735D0E checksum verify failed
> > on 2899180224512 found 7A6D427F wanted 7E899EE5 bytenr mismatch,
> > want=2899180224512, have=3981076597540270796 checksum verify failed
> > on 1449488023552 found CECC36AF wanted 199FE6C5 checksum verify
> > failed on 1449488023552 found CECC36AF wanted 199FE6C5 checksum
> > verify failed on 1449544613888 found 895D691B wanted A0C64D2B
> > checksum verify failed on 1449544613888 found 895D691B wanted
> > A0C64D2B parent transid verify failed on 1671538819072 wanted
> > 293964 found 293902 parent transid verify failed on 1671538819072
> > wanted 293964 found 293902 checksum verify failed on 1671603781632
> > found 18BC28D6 wanted 372655A0 checksum verify failed on
> > 1671603781632 found 18BC28D6 wanted 372655A0 checksum verify failed
> > on 1759425052672 found 843B59F1 wanted F0FF7D00 checksum verify
> > failed on 1759425052672 found 843B59F1 wanted F0FF7D00 checksum
> > verify failed on 2182657212416 found CD8EFC0C wanted 70847071
> > checksum verify failed on 2182657212416 found CD8EFC0C wanted
> > 70847071 checksum verify failed on 2898779357184 found 96395131
> > wanted 433D6E09 checksum verify failed on 2898779357184 found
> > 96395131 wanted 

Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-02 Thread Kai Krakow
Am Tue, 2 May 2017 05:01:02 + (UTC)
schrieb Duncan <1i5t5.dun...@cox.net>:

> Of course on-list I'm somewhat known for my arguments propounding the 
> notion that any filesystem that's too big to be practically
> maintained (including time necessary to restore from backups, should
> that be necessary for whatever reason) is... too big... and should
> ideally be broken along logical and functional boundaries into a
> number of individual smaller filesystems until such point as each one
> is found to be practically maintainable within a reasonably practical
> time frame. Don't put all the eggs in one basket, and when the bottom
> of one of those baskets inevitably falls out, most of your eggs will
> be safe in other baskets. =:^)

Hehe... Yes, you're a fan of small filesystems. I'm more from the
opposite camp, preferring one big filesystem to not mess around with
size constraints of small filesystems fighting for the same volume
space. It also gives such filesystems better chances for data locality
of data staying in totally different parts across your fs mounts and
can reduce head movement. Of course, much of this is not true if you
use different devices per filesystem, or use SSDs, or SAN where you
have no real control over the physical placement of image stripes
anyway. But well...

In an ideal world, subvolumes of btrfs would be totally independent of
each other, just only share the same volume and dynamically allocating
chunks of space from it. If one is broken, it is simply not usable and
it should be destroyable. A garbage collector would grab the leftover
chunks from the subvolume and free them, and you could recreate this
subvolume from backup. In reality, shared extents will cross subvolume
borders so it is probably not how things could work anytime in the near
of far future.

This idea is more like having thinly provisioned LVM volumes which
allocate space as the filesystems on top need them, much like doing
thinly provisioned images with a VM host system. The problem here is,
unlike subvolumes, those chunks of space could never be given back to
the host as it doesn't know if it is still in use. Of course, there's
implementations available which allow thinning the images by passing
through TRIM from the guest to the host (or by other means of
communication channels between host and guest), but that is usually not
giving good performance, if even supported.

I tried once to exploit this in VirtualBox and hoped it would translate
guest discards into hole punching requests on the host, and it's even
documented to work that way... But (a) it was horrible slow, and (b) it
was incredibly unstable to the point of being useless. OTOH, it's not
announced as a stable feature and has to be enabled by manually editing
the XML config files.

But I still like the idea: Is it possible to make btrfs still work if
one subvolume gets corrupted? Of course it should have ways of telling
the user which other subvolumes are interconnected through shared
extents so those would be also discarded upon corruption cleanup - at
least if those extents couldn't be made any sense of any longer. Since
corruption is an issue mostly of subvolumes being written to, snapshots
should be mostly safe.

Such a feature would also only make sense if btrfs had an online repair
tool. BTW, are there plans for having an online repair tool in the
future? Maybe one that only scans and fixes part of the filesystems
(for obvious performance reasons, wrt Duncans idea of handling
filesystems), i.e. those parts that the kernel discovered having
corruptions? If I could then just delete and restore affected files,
this would be even better than having independent subvolumes like above.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-01 Thread Marc MERLIN
On Mon, May 01, 2017 at 10:56:06PM -0600, Chris Murphy wrote:
> > Right, of course, I was being way over optimistic here. I kind of forgot
> > that metadata wasn't COW, my bad.
> 
> Well it is COW. But there's more to the file system than fs trees, and
> just because an fs tree gets snapshot doesn't mean all data is
> snapshot. So whether snapshot or not, there's metadata that becomes
> obsolete as the file system is updated and those areas get freed up
> and eventually overwritten.

Got it, thanks for explaining.

> > Also, how is --mode=lowmem being useful?
> 
> Testing. lowmem is a different implementation, so it might find
> different things from the regular check.
 
I see.
I've fired off some scrub -r and then check to run overnight, I'll see
if it finishes overnight assuming the kernel doesn't crash again (yeah,
just to make things simpler, I'm hitting another issue when I/O piles up
on btrfs on top of dmcrypt on top of bcache
http://lkml.iu.edu/hypermail/linux/kernel/1705.0/00626.html
https://pastebin.com/YqE4riw0
but that's not a bcache bug, just something else getting in the way.

> > And for re-parenting a sub-subvolume, is that possible?
> > (I want to delete /sub1/ but I can't because I have /sub1/sub2 that's also 
> > a subvolume
> > and I'm not sure how to re-parent sub2 to somewhere else so that I can 
> > subvolume delete
> > sub1)
> 
> Well you can move sub2 out of sub1 just like a directory and then
> delete sub1. If it's read-only it can't be moved, but you can use
> btrfs property get/set ro true/false to temporarily make it not
> read-only, move it, then make it read-only again, and it's still fine
> to use with btrfs send receive.

Ah, I didn't think mv would work from inside a subvolume to outside of a
subvolume without copying data (it doesn't for files) but I guess it
would for for subvolumes, good point.
I'll try that, thanks.

> Not understanding the problem, it's by definition naive for me to
> suggest it should go read-only sooner before hosing itself. But I'd
> like to think it's possible for Btrfs to look backward every once in a
> while for sanity checking, to limit damage should it be occurring even
> if the hardware isn't reporting any problems.

Fair point. To be honest, maybe btrfs could indeed have detected
problems earlier, but ultimately it's not really its fault if bad things
happen when I'm having repeated storage errors underneath. For all I
know, some data got written after getting corrupted and btrfs would not
notice that right away.
Now, I kind of naively thought I could simply unroll all writes done
after a certain point. You pointed right (rightfully so) that it's not
nearly as simple as I was hoping.

So at this point, I think it's just a matter of me providing
check/repair logs if they are useful, and someone looking into this
balance causing a kernel crash, which is IMO the only real thing that
btrfs should reasonably fix.

I'll update the thread when I have more logs and have moved further on
the recovery.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-01 Thread Duncan
Marc MERLIN posted on Mon, 01 May 2017 20:23:46 -0700 as excerpted:

> Also, how is --mode=lowmem being useful?

FWIW, I just watched your talk that's linked from the wiki, and wondered 
what you were doing these days as I hadn't seen any posts from you here 
for awhile.

Well, that you're asking that question confirms you've not been following 
the list too closely...  Of course that's understandable as people have 
other stuff to do, but just sayin'.

The answer is... btrfs check in lowmem mode isn't simply lowmem, it's 
also effectively a very nearly entirely rewritten second implementation, 
which has already demonstrated its worth as it has already allowed 
finding and fixing a number of bugs in normal mode check.  Of course 
normal mode check has returned the favor a few times as well, so it is 
now reasonably standard list troubleshooting practice to ask for the 
output from both modes to see what and where they differ, especially if 
it's not something known to be directly fixable by normal mode, which of 
course remains the more mature default.

So even if neither one can actually fix the problem ATM, any differences 
in output both lend important clues to the real problem, and potentially 
help developers to find and fix bugs in one or the other implementation.

Tho it's worth noting that lowmem mode can be expected to take longer, as 
it favors lower memory usage over speed, just as the mode title suggests 
it will.  On a filesystem as big as yours... it may unfortunately not be 
entirely practical, especially if as you hint there's at least some time 
pressure here, tho it's not extreme.

Of course on-list I'm somewhat known for my arguments propounding the 
notion that any filesystem that's too big to be practically maintained 
(including time necessary to restore from backups, should that be 
necessary for whatever reason) is... too big... and should ideally be 
broken along logical and functional boundaries into a number of 
individual smaller filesystems until such point as each one is found to 
be practically maintainable within a reasonably practical time frame.  
Don't put all the eggs in one basket, and when the bottom of one of those 
baskets inevitably falls out, most of your eggs will be safe in other 
baskets. =:^)

But as someone else (pg, IIRC) on-list is fond of saying, lots of other 
people "know better" (TM).  Whatever.  It's your data, your systems and 
your time, not mine.  I just know what I've found (sometimes finding it 
the hard way!) to work best for me, and TBs on TBs of data on a single 
filesystem, even if it's a backup and is itself backed up, isn't 
something I'd be putting my own faith in, as the time even for a simple 
restore from backups is simply too high for me to consider it at all 
practical. =:^)

> And for re-parenting a sub-subvolume, is that possible?
> (I want to delete /sub1/ but I can't because I have /sub1/sub2 that's
> also a subvolume and I'm not sure how to re-parent sub2 to somewhere
> else so that I can subvolume delete sub1)

As I believe you know my own use-case doesn't deal with subvolumes and 
snapshots, so this may be of limited practicality, but FWIW, the 
sysadmin's guide discussion of snapshot management and special cases 
seems apropos as a first stop, before going further:

https://btrfs.wiki.kernel.org/index.php/SysadminGuide#Managing_Snapshots

Note that toward the bottom of "management" it discusses moving 
subvolumes (which will obviously reparent them), but then below that in 
special cases it says that read-only subvolumes (and thus snapshots) 
cannot be moved, explaining why.


*BUT*, and here's the "go further" part, keep in mind that subvolume-read-
only is a property, gettable and settable by btrfs property.

So you should be able to unset the read-only property of a subvolume or 
snapshot, move it, then if desired, set it again.

Of course I wouldn't expect send -p to work with such a snapshot, but 
send -c /might/ still work, I'm not actually sure but I'd consider it 
worth trying.  (I'd try -p as well, but expect it to fail...)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-01 Thread Chris Murphy
On Mon, May 1, 2017 at 9:23 PM, Marc MERLIN  wrote:
> Hi Chris,
>
> Thanks for the reply, much appreciated.
>
> On Mon, May 01, 2017 at 07:50:22PM -0600, Chris Murphy wrote:
>> What about btfs check (no repair), without and then also with --mode=lowmem?
>>
>> In theory I like the idea of a 24 hour rollback; but in normal usage
>> Btrfs will eventually free up space containing stale and no longer
>> necessary metadata. Like the chunk tree, it's always changing, so you
>> get to a point, even with snapshots, that the old state of that tree
>> is just - gone. A snapshot of an fs tree does not make the chunk tree
>> frozen in time.
>
> Right, of course, I was being way over optimistic here. I kind of forgot
> that metadata wasn't COW, my bad.

Well it is COW. But there's more to the file system than fs trees, and
just because an fs tree gets snapshot doesn't mean all data is
snapshot. So whether snapshot or not, there's metadata that becomes
obsolete as the file system is updated and those areas get freed up
and eventually overwritten.


>
>> In any case, it's a big problem in my mind if no existing tools can
>> fix a file system of this size. So before making anymore changes, make
>> sure you have a btrfs-image somewhere, even if it's huge. The offline
>> checker needs to be able to repair it, right now it's all we have for
>> such a case.
>
> The image will be huge, and take maybe 24H to make (last time it took
> some silly amount of time like that), and honestly I'm not sure how
> useful it'll be.
> Outside of the kernel crashing if I do a btrfs balance, and hopefully
> the crash report I gave is good enough, the state I'm in is not btrfs'
> fault.
>
> If I can't roll back to a reasonably working state, with data loss of a
> known quantity that I can recover from backup, I'll have to destroy and
> filesystem and recover from scratch, which will take multiple days.
> Since I can't wait too long before getting back to a working state, I
> think I'm going to try btrfs check --repair after a scrub to get a list
> of all the pathanmes/inodes that are known to be damaged, and work from
> there.
> Sounds reasonable?

Yes.


>
> Also, how is --mode=lowmem being useful?

Testing. lowmem is a different implementation, so it might find
different things from the regular check.


>
> And for re-parenting a sub-subvolume, is that possible?
> (I want to delete /sub1/ but I can't because I have /sub1/sub2 that's also a 
> subvolume
> and I'm not sure how to re-parent sub2 to somewhere else so that I can 
> subvolume delete
> sub1)

Well you can move sub2 out of sub1 just like a directory and then
delete sub1. If it's read-only it can't be moved, but you can use
btrfs property get/set ro true/false to temporarily make it not
read-only, move it, then make it read-only again, and it's still fine
to use with btrfs send receive.





>
> In the meantime, a simple check without repair looks like this. It will
> likely take many hours to complete:
> gargamel:/var/local/space# btrfs check /dev/mapper/dshelf2
> Checking filesystem on /dev/mapper/dshelf2
> UUID: 03e9a50c-1ae6-4782-ab9c-5f310a98e653
> checking extents
> checksum verify failed on 3096461459456 found 0E6B7980 wanted FBE5477A
> checksum verify failed on 3096461459456 found 0E6B7980 wanted FBE5477A
> checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
> checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
> checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E
> checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
> bytenr mismatch, want=2899180224512, have=3981076597540270796
> checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5
> checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5
> checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B
> checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B
> parent transid verify failed on 1671538819072 wanted 293964 found 293902
> parent transid verify failed on 1671538819072 wanted 293964 found 293902
> checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0
> checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0
> checksum verify failed on 1759425052672 found 843B59F1 wanted F0FF7D00
> checksum verify failed on 1759425052672 found 843B59F1 wanted F0FF7D00
> checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
> checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
> checksum verify failed on 2898779357184 found 96395131 wanted 433D6E09
> checksum verify failed on 2898779357184 found 96395131 wanted 433D6E09
> checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
> checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
> checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E
> checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
> bytenr 

Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-01 Thread Marc MERLIN
Hi Chris,

Thanks for the reply, much appreciated.

On Mon, May 01, 2017 at 07:50:22PM -0600, Chris Murphy wrote:
> What about btfs check (no repair), without and then also with --mode=lowmem?
> 
> In theory I like the idea of a 24 hour rollback; but in normal usage
> Btrfs will eventually free up space containing stale and no longer
> necessary metadata. Like the chunk tree, it's always changing, so you
> get to a point, even with snapshots, that the old state of that tree
> is just - gone. A snapshot of an fs tree does not make the chunk tree
> frozen in time.
 
Right, of course, I was being way over optimistic here. I kind of forgot
that metadata wasn't COW, my bad.

> In any case, it's a big problem in my mind if no existing tools can
> fix a file system of this size. So before making anymore changes, make
> sure you have a btrfs-image somewhere, even if it's huge. The offline
> checker needs to be able to repair it, right now it's all we have for
> such a case.

The image will be huge, and take maybe 24H to make (last time it took
some silly amount of time like that), and honestly I'm not sure how
useful it'll be.
Outside of the kernel crashing if I do a btrfs balance, and hopefully
the crash report I gave is good enough, the state I'm in is not btrfs'
fault.

If I can't roll back to a reasonably working state, with data loss of a
known quantity that I can recover from backup, I'll have to destroy and
filesystem and recover from scratch, which will take multiple days.
Since I can't wait too long before getting back to a working state, I
think I'm going to try btrfs check --repair after a scrub to get a list
of all the pathanmes/inodes that are known to be damaged, and work from
there.
Sounds reasonable?

Also, how is --mode=lowmem being useful?

And for re-parenting a sub-subvolume, is that possible?
(I want to delete /sub1/ but I can't because I have /sub1/sub2 that's also a 
subvolume
and I'm not sure how to re-parent sub2 to somewhere else so that I can 
subvolume delete
sub1)

In the meantime, a simple check without repair looks like this. It will
likely take many hours to complete:
gargamel:/var/local/space# btrfs check /dev/mapper/dshelf2
Checking filesystem on /dev/mapper/dshelf2
UUID: 03e9a50c-1ae6-4782-ab9c-5f310a98e653
checking extents
checksum verify failed on 3096461459456 found 0E6B7980 wanted FBE5477A
checksum verify failed on 3096461459456 found 0E6B7980 wanted FBE5477A
checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E
checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
bytenr mismatch, want=2899180224512, have=3981076597540270796
checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5
checksum verify failed on 1449488023552 found CECC36AF wanted 199FE6C5
checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B
checksum verify failed on 1449544613888 found 895D691B wanted A0C64D2B
parent transid verify failed on 1671538819072 wanted 293964 found 293902
parent transid verify failed on 1671538819072 wanted 293964 found 293902
checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0
checksum verify failed on 1671603781632 found 18BC28D6 wanted 372655A0
checksum verify failed on 1759425052672 found 843B59F1 wanted F0FF7D00
checksum verify failed on 1759425052672 found 843B59F1 wanted F0FF7D00
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
checksum verify failed on 2898779357184 found 96395131 wanted 433D6E09
checksum verify failed on 2898779357184 found 96395131 wanted 433D6E09
checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
checksum verify failed on 2899180224512 found ABBE39B0 wanted E0735D0E
checksum verify failed on 2899180224512 found 7A6D427F wanted 7E899EE5
bytenr mismatch, want=2899180224512, have=3981076597540270796
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
checksum verify failed on 2182657212416 found CD8EFC0C wanted 70847071
(...)

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-01 Thread Chris Murphy
What about btfs check (no repair), without and then also with --mode=lowmem?

In theory I like the idea of a 24 hour rollback; but in normal usage
Btrfs will eventually free up space containing stale and no longer
necessary metadata. Like the chunk tree, it's always changing, so you
get to a point, even with snapshots, that the old state of that tree
is just - gone. A snapshot of an fs tree does not make the chunk tree
frozen in time.

To do what you want, maybe isn't a ton of work if it could be based on
a variation of the existing btrfs seed device code. Call it a "super
snapshot".

I like the idea of triage, where bad parts of the file system can just
be cut off, like triage. Compared to other filesystems, they'll say
this is hardware sabotage and nothing can be done. Btrfs is a bit
deceptive in that it sorta invites the idea we can use hardware that
isn't proven, and the fs can survive.

In any case, it's a big problem in my mind if no existing tools can
fix a file system of this size. So before making anymore changes, make
sure you have a btrfs-image somewhere, even if it's huge. The offline
checker needs to be able to repair it, right now it's all we have for
such a case.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-01 Thread Marc MERLIN
So, I forgot to mention that it's my main media and backup server that got
corrupted. Yes, I do actually have a backup of a backup server, but it's
going to take days to recover due to the amount of data to copy back, not
counting lots of manual typing due to the number of subvolumes, btrfs
send/receive relationships and so forth.

Really, I should be able to roll back all writes from the last 24H, run a
check --repair/scrub on top just to be sure, and be back on track.

In the meantime, the good news is that the filesystem doesn't crash the
kernel (the poasted crash below) now that I was able to cancel the btrfs 
balance, 
but it goes read only at the drop of a hat, even when I'm trying to delete
recent snapshots and all data that was potentially written in the last 24H

On Mon, May 01, 2017 at 10:06:41AM -0700, Marc MERLIN wrote:
> I have a filesystem that sadly got corrupted by a SAS card I just installed 
> yesterday.
> 
> I don't think in a case like this, there is there a way to roll back all
> writes across all subvolumes in the last 24H, correct?
> 
> Is the best thing to go in each subvolume, delete the recent snapshots and
> rename the one from 24H as the current one?
 
Well, just like I expected, it's a pain in the rear and this can't even help
fix the top level mountpoint which doesn't have snapshots, so I can't roll
it back.
btrfs should really have an easy way to roll back X hours, or days to
recover from garbage written after a good known point, given that it is COW
afterall.

Is there a way do this with check --repair maybe?

In the meantime, I got stuck while trying to delete snapshots:

Let's say I have this:
ID 428 gen 294021 top level 5 path backup
ID 2023 gen 294021 top level 5 path Soft
ID 3021 gen 294051 top level 428 path backup/debian32
ID 4400 gen 294018 top level 428 path backup/debian64
ID 4930 gen 294019 top level 428 path backup/ubuntu

I can easily
Delete subvolume (no-commit): '/mnt/btrfs_pool2/Soft'
and then:
gargamel:/mnt/btrfs_pool2# mv Soft_rw.20170430_01:50:22 Soft

But I can't delete backup, which actually is mostly only a directory
containing other things (in hindsight I shouldn't have made that a
subvolume)
Delete subvolume (no-commit): '/mnt/btrfs_pool2/backup'
ERROR: cannot delete '/mnt/btrfs_pool2/backup': Directory not empty

This is because backup has a lot of subvolumes due to btrfs send/receive
relationships.

Is it possible to recover there? Can you reparent subvolumes to a different
subvolume without doing a full copy via btrfs send/receive?

Thanks,
Marc

> BTRFS warning (device dm-5): failed to load free space cache for block group 
> 6746013696000, rebuilding it now
> BTRFS warning (device dm-5): block group 6754603630592 has wrong amount of 
> free space
> BTRFS warning (device dm-5): failed to load free space cache for block group 
> 6754603630592, rebuilding it now
> BTRFS warning (device dm-5): block group 7125178777600 has wrong amount of 
> free space
> BTRFS warning (device dm-5): failed to load free space cache for block group 
> 7125178777600, rebuilding it now
> BTRFS error (device dm-5): bad tree block start 3981076597540270796 
> 2899180224512
> BTRFS error (device dm-5): bad tree block start 942082474969670243 
> 2899180224512
> BTRFS: error (device dm-5) in __btrfs_free_extent:6944: errno=-5 IO failure
> BTRFS info (device dm-5): forced readonly
> BTRFS: error (device dm-5) in btrfs_run_delayed_refs:2961: errno=-5 IO failure
> BUG: unable to handle kernel NULL pointer dereference at   (null)
> IP: __del_reloc_root+0x3f/0xa6
> PGD 189a0e067
> PUD 189a0f067
> PMD 0
> 
> Oops:  [#1] PREEMPT SMP
> Modules linked in: veth ip6table_filter ip6_tables ebtable_nat ebtables ppdev 
> lp xt_addrtype br_netfilter bridge stp llc tun autofs4 softdog binfmt_misc 
> ftdi_sio nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ipt_REJECT 
> nf_reject_ipv4 xt_conntrack xt_mark xt_nat xt_tcpudp nf_log_ipv4 
> nf_log_common xt_LOG iptable_mangle iptable_filter lm85 hwmon_vid pl2303 
> dm_snapshot dm_bufio iptable_nat ip_tables nf_conntrack_ipv4 nf_defrag_ipv4 
> nf_nat_ipv4 nf_conntrack_ftp ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_nat 
> nf_conntrack x_tables sg st snd_pcm_oss snd_mixer_oss bcache kvm_intel kvm 
> irqbypass snd_hda_codec_realtek snd_cmipci snd_hda_codec_generic 
> snd_hda_intel snd_mpu401_uart snd_hda_codec snd_opl3_lib snd_rawmidi 
> snd_hda_core snd_seq_device snd_hwdep eeepc_wmi snd_pcm asus_wmi rc_ati_x10
>  asix snd_timer ati_remote sparse_keymap usbnet rfkill snd hwmon soundcore 
> rc_core evdev libphy tpm_infineon pcspkr i915 parport_pc i2c_i801 input_leds 
> mei_me lpc_ich parport tpm_tis battery usbserial tpm_tis_core tpm wmi e1000e 
> ptp pps_core fuse raid456 multipath mmc_block mmc_core lrw ablk_helper 
> dm_crypt dm_mod async_raid6_recov async_pq async_xor async_memcpy async_tx 
> crc32c_intel blowfish_x86_64 blowfish_common pcbc aesni_intel aes_x86_64 
> crypto_simd glue_helper cryptd xhci_pci ehci_pci sata_sil24