[Please CC me, I'm not on the list.] The current plan is to dump the whole NVMe with dd (ongoing ...) and experiment on that. Safer that way.
Question: Can I work with the mounted backup image on the machine that also contains the original disc? I vaguely recall something about btrfs really not liking clones. Cheers, Christian Am So., 20. Okt. 2019 um 09:41 Uhr schrieb Qu Wenruo <quwenruo.bt...@gmx.com>: > > > > On 2019/10/20 下午3:01, Christian Pernegger wrote: > > [Please CC me, I'm not on the list.] > > > > Good morning & thank you. > > > > Am So., 20. Okt. 2019 um 02:38 Uhr schrieb Qu Wenruo > > <quwenruo.bt...@gmx.com>: > >> It looks like you're using eGPU and the thunderbolt 3 connection > >> disconnect? > >> That would cause a kernel panic/hang or whatever. > > > > No, it's a Radeon VII in a Gigabyte X570 Aorus Master. The board has > > PCIe 4, otherwise nothing exotic. > > Since Radeon 7 doesn't support PCIe 4, they would just negotiate to use > PCIE 3, thus really nothing exotic. > > Just a kernel bug in amdgpu. > But since you're already using Radeon 7, it's recommended to use newer > kernel for latest drm updates. > > > > >>> [...] > >>> BTRFS error [...]: bad tree block start, want 284041084928 have 0 > >>> BTRFS error [...]: failed to read block groups: -5 > >>> BTRFS error [...]: open_ctree failed > > ["big number" filled in above] > > > >> This means some tree blocks didn't reach disk or just got wiped out. > >> Are you using discard mount option? > > > > Not to my knowledge. As in, I didn't set "discard", as far as I can > > remember it didn't show up in mount output, but it's possible it's on > > by default. > > Discard won't turn on by default IIRC. > So it's not discard related. > > > > >>> running btrfs check gives: > >>> checksum verify failed on 284041084928 found E4E3BDB6 wanted 00000000 > >>> checksum verify failed on 284041084928 found E4E3BDB6 wanted 00000000 > > This matches the kernel output, means that tree block doesn't reach disk > at all. > > >>> bytenr mismatch, want=284041084928, have=0 > >>> ERROR: cannot open filesystem. > > ["big number" and "8-digit hex" filled in above] > > > >> Again, some old tree blocks got wiped out. > >> BTW, you don't need to wipe the numbers, sometimes it help developer to > >> find some corner problem. > > > > I was just being lazy, sorry about that. > > > >> If it's the only problem, you can try this kernel branch to at least do > >> a RO mount: > >> https://github.com/adam900710/linux/tree/rescue_options > >> > >> Then mount the fs with "rescue=skipbg,ro" option. > >> If the bad tree block is the only problem, it should be able to mount it. > >> > >> If that mount succeeded, and you can access all files, then it means > >> only extent tree is corrupted, then you can try btrfs check > >> --init-extent-tree, there are some reports of --init-extent-tree fixed > >> the problem. > > > > You wouldn't happen to know of a bootable rescue image that has this? > > Archlinux iso at least has the latest btrfs-progs. > You can try that. > > The latest btrfs check is not that super dangerous compared to older > versions. > You can try --init-extent-tree, if it finishes it should give you a more > or less mountable fs. > > If it crashes, then it shouldn't cause extra damage, but still it's not > 100% safe. > > > I'd recommend the following safer methods before trying --init-extent-tree: > > - Dump backup roots first: > # btrfs ins dump-super -f <dev> | grep backup_treee_root > Then grab all big numbers. > > - Try backup_extent_root numbers in btrfs check first > # btrfs check -r <above big number> <dev> > Use the number with highest generation first. > > It's the equivalent of kernel usebackuproot mount option, but more > control as you can try every backup and find which one can pass the > extent tree failure. > > If all backup fails to pass basic btrfs check, and all happen to have > the same "wanted 00000000" then it means a big range of tree blocks > get wiped out, not really related to btrfs but some hardware wipe. > > If one can pass the initial mount and gives extra errors, then you can > add --repair to hope for a better chance to repair. > > > The affected machine obviously doesn't boot, getting the NVMe out > > requires dismantling the CPU cooler, and TBH, I haven't built a kernel > > in ~15 years. > > The safest one is still that out-of-tree rescue patchset, especially > when we can't rule out other corruptions in other trees. > I should really push that patchset harder into mainline. > > Just another unrelated hardware recommend, since you're already using > Radeon 7 and X570 board, I guess using an AIO will make M.2 SSD more > accessible. > > Or keep the exotic tower cooler, and use an M.2 to PCIe adapter to make > your SSD more accessible, as CrossFire is already dead, I guess you have > some free PCIE x4 slots. > > > > >> About the cause, either btrfs didn't write some tree blocks correctly or > >> the NVMe doesn't implement FUA/FLUSH correctly (which I don't believe is > >> the case). > >> > >> So it's recommended to update the kernel to 5.3 kernel. > > > > FWIW, it's a Samsung 970 Evo Plus. > > It doesn't look like a hardware problem, but I keep my conclusion until > you have tried all backup roots. > > Thanks, > Qu > > > TBH, I didn't expect to lose more than the last couple minutes of > > writes in such a crash, certainly not an unmountable filesystem. So > > I'd love to know what caused this so I can avoid it in future.> But > > first things first, have to get this thing up & running again ... > > > > Cheers, > > Christian > > > Am So., 20. Okt. 2019 um 12:11 Uhr schrieb Christian Pernegger <perneg...@gmail.com>: > > [Re-send, hit reply instead of reply-all by mistake. Please CC me, I'm > not on the list.] > > Good morning & thank you. > > Am So., 20. Okt. 2019 um 02:38 Uhr schrieb Qu Wenruo <quwenruo.bt...@gmx.com>: > > It looks like you're using eGPU and the thunderbolt 3 connection disconnect? > > That would cause a kernel panic/hang or whatever. > > No, it's a Radeon VII in a Gigabyte X570 Aorus Master. The board has > PCIe 4, otherwise nothing exotic. > > > > [...] > > > BTRFS error [...]: bad tree block start, want 284041084928 have 0 > > > BTRFS error [...]: failed to read block groups: -5 > > > BTRFS error [...]: open_ctree failed > ["big number" filled in above] > > > This means some tree blocks didn't reach disk or just got wiped out. > > Are you using discard mount option? > > Not to my knowledge. As in, I didn't set "discard", as far as I can > remember it didn't show up in mount output, but it's possible it's on > by default. > > > > running btrfs check gives: > > > checksum verify failed on 284041084928 found E4E3BDB6 wanted 00000000 > > > checksum verify failed on 284041084928 found E4E3BDB6 wanted 00000000 > > > bytenr mismatch, want=284041084928, have=0 > > > ERROR: cannot open filesystem. > ["big number" and "8-digit hex" filled in above] > > > Again, some old tree blocks got wiped out. > > BTW, you don't need to wipe the numbers, sometimes it help developer to > > find some corner problem. > > I was just being lazy, sorry about that. > > > If it's the only problem, you can try this kernel branch to at least do > > a RO mount: > > https://github.com/adam900710/linux/tree/rescue_options > > > > Then mount the fs with "rescue=skipbg,ro" option. > > If the bad tree block is the only problem, it should be able to mount it. > > > > If that mount succeeded, and you can access all files, then it means > > only extent tree is corrupted, then you can try btrfs check > > --init-extent-tree, there are some reports of --init-extent-tree fixed > > the problem. > > You wouldn't happen to know of a bootable rescue image that has this? > The affected machine obviously doesn't boot, getting the NVMe out > requires dismantling the CPU cooler, and TBH, I haven't built a kernel > in ~15 years. > > > About the cause, either btrfs didn't write some tree blocks correctly or > > the NVMe doesn't implement FUA/FLUSH correctly (which I don't believe is > > the case). > > > > So it's recommended to update the kernel to 5.3 kernel. > > FWIW, it's a Samsung 970 Evo Plus. > TBH, I didn't expect to lose more than the last couple minutes of > writes in such a crash, certainly not an unmountable filesystem. So > I'd love to know what caused this so I can avoid it in future. But > first things first, have to get this thing up & running again ... > > Cheers, > Christian > > Am So., 20. Okt. 2019 um 02:38 Uhr schrieb Qu Wenruo <quwenruo.bt...@gmx.com>: > > > > > > > > On 2019/10/20 上午6:34, Christian Pernegger wrote: > > > [Please CC me, I'm not on the list.] > > > > > > Hello, > > > > > > I'm afraid I could use some help. > > > > > > The affected machine froze during a game, was entirely unresponsive > > > locally, though ssh still worked. For completeness' sake, dmesg had: > > > [110592.128512] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 > > > timeout, signaled seq=3404070, emitted seq=3404071 > > > [110592.128545] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process > > > information: process Xorg pid 1191 thread Xorg:cs0 pid 1204 > > > [110592.128549] amdgpu 0000:0c:00.0: GPU reset begin! > > > [110592.138530] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx > > > timeout, signaled seq=13149116, emitted seq=13149118 > > > [110592.138577] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process > > > information: process Overcooked.exe pid 4830 thread dxvk-submit pid > > > 4856 > > > [110592.138579] amdgpu 0000:0c:00.0: GPU reset begin! > > > > It looks like you're using eGPU and the thunderbolt 3 connection disconnect? > > That would cause a kernel panic/hang or whatever. > > > > > > > > Oh well, I thought, and "shutdown -h now" it. That quit my ssh session > > > and locked me out, but otherwise didn't take, no reboot, still frozen. > > > Alt-SysRq-REISUB it was. That did it. > > > > > > Only now all I get is a rescue shell, the pertinent messages look to > > > be [everything is copied off the screen by hand]: > > > [...] > > > BTRFS info [...]: disk space caching is enabled > > > BTRFS info [...]: has skinny extents > > > BTRFS error [...]: bad tree block start, want [big number] have 0 > > > BTRFS error [...]: failed to read block groups: -5 > > > BTRFS error [...]: open_ctree failed > > > > This means some tree blocks didn't reach disk or just got wiped out. > > > > Are you using discard mount option? > > > > > > > > Mounting with -o ro,usebackuproot doesn't change anything. > > > > > > running btrfs check gives: > > > checksum verify failed on [same big number] found [8 digits hex] wanted > > > 00000000 > > > checksum verify failed on [same big number] found [8 digits hex] wanted > > > 00000000 > > > > Again, some old tree blocks got wiped out. > > > > BTW, you don't need to wipe the numbers, sometimes it help developer to > > find some corner problem. > > > > > bytenr mismatch, want=[same big number], have=0 > > > ERROR: cannot open filesystem. > > > > > > That's all I've got, I'd really appreciate some help. There's hourly > > > snapshots courtesy of Timeshift, though I have a feeling those won't > > > help ... > > > > If it's the only problem, you can try this kernel branch to at least do > > a RO mount: > > https://github.com/adam900710/linux/tree/rescue_options > > > > Then mount the fs with "rescue=skipbg,ro" option. > > If the bad tree block is the only problem, it should be able to mount it. > > > > If that mount succeeded, and you can access all files, then it means > > only extent tree is corrupted, then you can try btrfs check > > --init-extent-tree, there are some reports of --init-extent-tree fixed > > the problem. > > > > > > > > Oh, it's a recent Linux Mint 19.2 install, default layout (@, @home), > > > Timeshift enabled; on a single device (NVMe). HWE kernel (Kernel > > > 5.0.0-31-generic), btrfs-progs 4.15.1. > > > > About the cause, either btrfs didn't write some tree blocks correctly or > > the NVMe doesn't implement FUA/FLUSH correctly (which I don't believe is > > the case). > > > > So it's recommended to update the kernel to 5.3 kernel. > > > > Thanks, > > Qu > > > > > > > > TIA, > > > Christian > > > > >