Re: diagnosing XFS corruption after upgrading to Fedora 36

2022-07-19 Thread George N. White III
On Tue, Jul 19, 2022 at 12:50 AM Patrick Hemmer 
wrote:

> Just to close this out, and not be "that guy" (https://xkcd.com/979/), I
> ended up just rolling the

kernel back to the Fedora 35 kernel (5.14.10).
> Without a good way to isolate where the problem is (between XFS & LVM), I
> really didn't want

to waste time tracking this down, and restoring my system every couple
> hours. I'll try again in 6

months or so and see if maybe it's been found and fixed.
>

If the old kernel works, that points to the kernel rather than a hardware
issue.  If your
hardware is widely used, others will encounter the same problem.  Searching
for issues with linux 5.18  (any distro) and your specific hardware may
find other victims.

There were "bug fix" changes to xfs in 5.18:
https://www.phoronix.com/scan.php?page=news_item=Linux-5.18-XFS-Changes

-- 
George N. White III
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure


Re: diagnosing XFS corruption after upgrading to Fedora 36

2022-07-18 Thread Patrick Hemmer
Just to close this out, and not be "that guy" (https://xkcd.com/979/), I ended 
up just rolling the kernel back to the Fedora 35 kernel (5.14.10).
Without a good way to isolate where the problem is (between XFS & LVM), I 
really didn't want to waste time tracking this down, and restoring my system 
every couple hours. I'll try again in 6 months or so and see if maybe it's been 
found and fixed.
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure


Re: diagnosing XFS corruption after upgrading to Fedora 36

2022-07-18 Thread Roger Heflin
You might include a full dmesg/messages.  This is the sort of error
you get when there is an underlying read failure/breakage on the
device that the data is actually on.

You get scsi errors/block errors first and then that shows up as
filesystem errors similar to these.  This sounds like the underlying
device has issues (bad, bad cable, bad power...).

On Sun, Jul 17, 2022 at 10:10 PM Patrick Hemmer  wrote:
>
> Ever since upgrading to Fedora 36, my root filesystem is getting corrupted 
> every few hours. I maintain block level backups, and I have to restore every 
> time this happens. xfs_repair can fix the filesystem, but the system is 
> typically unusable as there's often over 10k files in lost+found.
>
> I have tried creating a brand new filesystem (mkfs.xfs), but it still gets 
> corrupted.
>
> I would file a bug, but the caveat is that I also have LVM underneath the 
> filesystem. And so I don't know whether it's a problem with XFS, or LVM. I 
> have other XFS filesystems also on LVM, and have seen corruption on them as 
> well, but it's nowhere near as significant or frequent as on the root 
> filesystem.
>
> Sometimes I can detect the corruption before the kernel does, by doing a 
> snapshot, and running `xfs_repair -n` on the snapshot. And sometimes the 
> kernel will detect the corruption first, usually with a message like:
>
> Jul 17 15:06:52 whistler kernel: XFS (dm-0): Metadata corruption detected at 
> xfs_buf_ioend+0x14c/0x5d0 [xfs], xfs_inode block 0x46057c8 
> xfs_inode_buf_verify
> Jul 17 15:06:52 whistler kernel: XFS (dm-0): Unmount and run xfs_repair
> Jul 17 15:06:52 whistler kernel: XFS (dm-0): First 128 bytes of corrupted 
> metadata buffer:
> Jul 17 15:06:52 whistler kernel: : 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00 00  
> Jul 17 15:06:52 whistler kernel: 0010: 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00 00  
> Jul 17 15:06:52 whistler kernel: 0020: 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00 00  
> Jul 17 15:06:52 whistler kernel: 0030: 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00 00  
> Jul 17 15:06:52 whistler kernel: 0040: 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00 00  
> Jul 17 15:06:52 whistler kernel: 0050: 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00 00  
> Jul 17 15:06:52 whistler kernel: 0060: 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00 00  
> Jul 17 15:06:52 whistler kernel: 0070: 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00 00  
> Jul 17 15:06:52 whistler kernel: XFS (dm-0): metadata I/O error in 
> "xfs_imap_to_bp+0x40/0x50 [xfs]" at daddr 0x46057c8 len 32 error 117
> Jul 17 15:06:52 whistler kernel: XFS (dm-0): Metadata I/O Error (0x1) 
> detected at xfs_trans_read_buf_map+0x179/0x2d0 [xfs] 
> (fs/xfs/xfs_trans_buf.c:296).  Shutting down filesystem.
> Jul 17 15:06:52 whistler kernel: XFS (dm-0): Please unmount the filesystem 
> and rectify the problem(s)
>
> So how can I proceed on this? Is there any way to determine whether this is 
> an LVM issue or an XFS issue?
> ___
> users mailing list -- users@lists.fedoraproject.org
> To unsubscribe send an email to users-le...@lists.fedoraproject.org
> Fedora Code of Conduct: 
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: 
> https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
> Do not reply to spam on the list, report it: 
> https://pagure.io/fedora-infrastructure
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure


Re: diagnosing XFS corruption after upgrading to Fedora 36

2022-07-18 Thread Tim via users
On Mon, 2022-07-18 at 07:29 -0300, George N. White III wrote:
> Cables and connectors should also be considered.  Try swapping cables
> and connections. "Contact enhancer" sometimes solves connection
> problems (now that cars are full of computers, you can buy 
> contact enhancer at auto supply stores). 

As someone who's been in electronics servicing for well over 30 years,
I can attest that connectors are the cause of many mysterious faults
where nothing else was wrong with the equipment.  Unplugging and
replugging fixed many faults, and using contact cleaner helps stop the
problem from rapidly recurring.  But use proper contact cleaner, not
*ordinary* WD40 (it's corrosive, and will cause worse contact problems
down the track, not to mention how horrible it is the the lungs).

I used to encounter many connector problems with PCs years ago (when I
frequently fixed other people's computers) because the case wasn't
rigid enough.  When people moved the box about, even by small amounts,
the chassis would twist and it pulled cards partway out of their
sockets.  I had one that pretty much had to stay untouched on the
shelf.  Thermal expansion and contraction also walks connectors apart.

One of my early computers had a very solid case, and it had a metal bar
between the front and back of the case, and another that was screwed
down over the top of plug-in cards to hold them firmly into place.

Modern SATA drive data and power connectors are not very good, in my
opinion, compared to the older style.  They had a much tighter grip. 
Some of the better SATA cables have a metal catch to stop them slipping
out.

-- 
 
uname -rsvp
Linux 3.10.0-1160.71.1.el7.x86_64 #1 SMP Tue Jun 28 15:37:28 UTC 2022 x86_64
 
Boilerplate:  All unexpected mail to my mailbox is automatically deleted.
I will only get to see the messages that are posted to the mailing list.
 
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure


Re: diagnosing XFS corruption after upgrading to Fedora 36

2022-07-18 Thread George N. White III
On Mon, Jul 18, 2022 at 12:10 AM Patrick Hemmer 
wrote:

> Ever since upgrading to Fedora 36, my root filesystem is getting corrupted
> every few hours. I maintain block level backups, and I have to restore
> every time this happens. xfs_repair can fix the filesystem, but the system
> is typically unusable as there's often over 10k files in lost+found.
>
> I have tried creating a brand new filesystem (mkfs.xfs), but it still gets
> corrupted.
>
> I would file a bug, but the caveat is that I also have LVM underneath the
> filesystem. And so I don't know whether it's a problem with XFS, or LVM. I
> have other XFS filesystems also on LVM, and have seen corruption on them as
> well, but it's nowhere near as significant or frequent as on the root
> filesystem.
>
> Sometimes I can detect the corruption before the kernel does, by doing a
> snapshot, and running `xfs_repair -n` on the snapshot. And sometimes the
> kernel will detect the corruption first, usually with a message like:
>
> Jul 17 15:06:52 whistler kernel: XFS (dm-0): Metadata corruption detected
> at xfs_buf_ioend+0x14c/0x5d0 [xfs], xfs_inode block 0x46057c8
> xfs_inode_buf_verify
> Jul 17 15:06:52 whistler kernel: XFS (dm-0): Unmount and run xfs_repair
> Jul 17 15:06:52 whistler kernel: XFS (dm-0): First 128 bytes of corrupted
> metadata buffer:
> Jul 17 15:06:52 whistler kernel: : 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00  
> Jul 17 15:06:52 whistler kernel: 0010: 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00  
> Jul 17 15:06:52 whistler kernel: 0020: 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00  
> Jul 17 15:06:52 whistler kernel: 0030: 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00  
> Jul 17 15:06:52 whistler kernel: 0040: 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00  
> Jul 17 15:06:52 whistler kernel: 0050: 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00  
> Jul 17 15:06:52 whistler kernel: 0060: 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00  
> Jul 17 15:06:52 whistler kernel: 0070: 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00  
> Jul 17 15:06:52 whistler kernel: XFS (dm-0): metadata I/O error in
> "xfs_imap_to_bp+0x40/0x50 [xfs]" at daddr 0x46057c8 len 32 error 117
> Jul 17 15:06:52 whistler kernel: XFS (dm-0): Metadata I/O Error (0x1)
> detected at xfs_trans_read_buf_map+0x179/0x2d0 [xfs]
> (fs/xfs/xfs_trans_buf.c:296).  Shutting down filesystem.
> Jul 17 15:06:52 whistler kernel: XFS (dm-0): Please unmount the filesystem
> and rectify the problem(s)
>
> So how can I proceed on this? Is there any way to determine whether this
> is an LVM issue or an XFS issue?
>

LVM and XFS on linux have been very reliable, so you need to rule out
hardware problems.   If the drive supports
S.M.A.R.T then smartmontools can run the internal tests.  Some vendors
provide test software (often
Windows only).  Cables and connectors should also be considered.  Try
swapping cables and connections.
"Contact enhancer" sometimes solves connection problems (now that cars are
full of computers, you can buy
contact enhancer at auto supply stores).

It is very useful to have an external drive to USB adapter.  For nvme, a
USB-C nvme case provides a way to
test nvme drives, and a cast-off 128G nvme card can be used in the adapter
as a fast alternative to USB memory
"keys".




> ___
> users mailing list -- users@lists.fedoraproject.org
> To unsubscribe send an email to users-le...@lists.fedoraproject.org
> Fedora Code of Conduct:
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives:
> https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
> Do not reply to spam on the list, report it:
> https://pagure.io/fedora-infrastructure
>


-- 
George N. White III
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure


diagnosing XFS corruption after upgrading to Fedora 36

2022-07-17 Thread Patrick Hemmer
Ever since upgrading to Fedora 36, my root filesystem is getting corrupted 
every few hours. I maintain block level backups, and I have to restore every 
time this happens. xfs_repair can fix the filesystem, but the system is 
typically unusable as there's often over 10k files in lost+found.

I have tried creating a brand new filesystem (mkfs.xfs), but it still gets 
corrupted.

I would file a bug, but the caveat is that I also have LVM underneath the 
filesystem. And so I don't know whether it's a problem with XFS, or LVM. I have 
other XFS filesystems also on LVM, and have seen corruption on them as well, 
but it's nowhere near as significant or frequent as on the root filesystem.

Sometimes I can detect the corruption before the kernel does, by doing a 
snapshot, and running `xfs_repair -n` on the snapshot. And sometimes the kernel 
will detect the corruption first, usually with a message like:

Jul 17 15:06:52 whistler kernel: XFS (dm-0): Metadata corruption detected at 
xfs_buf_ioend+0x14c/0x5d0 [xfs], xfs_inode block 0x46057c8 xfs_inode_buf_verify
Jul 17 15:06:52 whistler kernel: XFS (dm-0): Unmount and run xfs_repair
Jul 17 15:06:52 whistler kernel: XFS (dm-0): First 128 bytes of corrupted 
metadata buffer:
Jul 17 15:06:52 whistler kernel: : 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00  
Jul 17 15:06:52 whistler kernel: 0010: 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00  
Jul 17 15:06:52 whistler kernel: 0020: 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00  
Jul 17 15:06:52 whistler kernel: 0030: 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00  
Jul 17 15:06:52 whistler kernel: 0040: 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00  
Jul 17 15:06:52 whistler kernel: 0050: 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00  
Jul 17 15:06:52 whistler kernel: 0060: 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00  
Jul 17 15:06:52 whistler kernel: 0070: 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00  
Jul 17 15:06:52 whistler kernel: XFS (dm-0): metadata I/O error in 
"xfs_imap_to_bp+0x40/0x50 [xfs]" at daddr 0x46057c8 len 32 error 117
Jul 17 15:06:52 whistler kernel: XFS (dm-0): Metadata I/O Error (0x1) detected 
at xfs_trans_read_buf_map+0x179/0x2d0 [xfs] (fs/xfs/xfs_trans_buf.c:296).  
Shutting down filesystem.
Jul 17 15:06:52 whistler kernel: XFS (dm-0): Please unmount the filesystem and 
rectify the problem(s)

So how can I proceed on this? Is there any way to determine whether this is an 
LVM issue or an XFS issue?
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure