Bug#988477: [Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
On Wed, May 28, 2025 at 05:21:00PM -0700, Elliott Mitchell wrote: > On Sun, May 18, 2025 at 02:10:25PM +0200, Maximilian Engelhardt wrote: > > On Montag, 14. April 2025 00:22:01 CEST Elliott Mitchell wrote: > > > > > > Do any of the Debian maintainers have an AMD machine setup for debugging? > > > I'm not very well setup for debugging this particular issue. If you've > > > got an AMD machine with a pair of available SATA ports (including SATA > > > power!), I could send a pair of SATA devices known to readily reproduce > > > the issue. > > > > I'm not aware of anybody in our team having hardware where they can > > reproduce > > this issue, else I'm sure they would have already provided feedback here. > > There are also not many reports here of people running into this problem. > > Thus > > I assume it needs a special (and probably rare) hardware combination to > > trigger this. > > One thing I can add is that I have been running software raid1 with Xen on > > two > > SATA SSDs on an Intel CPU since many years without seeing any data > > corruption. > > I'm skeptical of it being rare, but certainly uncommon. You've got some > similarity to the reproductions, but there are differences. > > First question, what brand/model are the SSDs? Samsung SSDs are known to > be effected (severely effected for some models), while Crucial/Micron > SSDs are uneffected (some models might be mildly effected). > > Second question, where are the SATA ports? They on-motherboard? Add-on > card? The reproductions were with on-motherboard ports. > > What generation is your processor? Are you sure it has an IOMMU and Xen > is driving the IOMMU? I had suspected Intel systems would be effected, > but you may have disproven this. Uh. I did hope you could help narrowing things down some. Right now we've got two confirmed reproductions, while you're the only person who isn't seeing this reproduce. The biggest difference is you've got a system with an Intel processor. Yet we already know not all SSDs are effected, so could be your pair are ones which won't reproduce the issue. On top of that, similar to the spurious interrupt issue, could be it is less severe on Intel processors and that has kept you safe. Presently the shortage of reports seems mostly attributable to few people using RAID1 with SSDs. -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| [email protected] PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Bug#988477: [Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
On Sun, May 18, 2025 at 02:10:25PM +0200, Maximilian Engelhardt wrote: > On Montag, 14. April 2025 00:22:01 CEST Elliott Mitchell wrote: > > > > Do any of the Debian maintainers have an AMD machine setup for debugging? > > I'm not very well setup for debugging this particular issue. If you've > > got an AMD machine with a pair of available SATA ports (including SATA > > power!), I could send a pair of SATA devices known to readily reproduce > > the issue. > > I'm not aware of anybody in our team having hardware where they can reproduce > this issue, else I'm sure they would have already provided feedback here. > There are also not many reports here of people running into this problem. > Thus > I assume it needs a special (and probably rare) hardware combination to > trigger this. > One thing I can add is that I have been running software raid1 with Xen on > two > SATA SSDs on an Intel CPU since many years without seeing any data corruption. I'm skeptical of it being rare, but certainly uncommon. You've got some similarity to the reproductions, but there are differences. First question, what brand/model are the SSDs? Samsung SSDs are known to be effected (severely effected for some models), while Crucial/Micron SSDs are uneffected (some models might be mildly effected). Second question, where are the SATA ports? They on-motherboard? Add-on card? The reproductions were with on-motherboard ports. What generation is your processor? Are you sure it has an IOMMU and Xen is driving the IOMMU? I had suspected Intel systems would be effected, but you may have disproven this. > As Debian packages versions of xen, linux, etc. have changed a bit since the > last time this issue was reported as reproduced in this bug, it would be good > to get confirmation the problem is still there in Debian unstable or testing. This is possible. -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| [email protected] PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Bug#988477: [Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device
On Montag, 14. April 2025 00:22:01 CEST Elliott Mitchell wrote: > The analysis is the "(XEN) AMD-Vi: IO_PAGE_FAULT" message, and the > software RAID data loss are distinct bugs. That patch/commit likely > makes the correlated message disappear, but almost certainly leaves the > software RAID data loss behind. > > Do any of the Debian maintainers have an AMD machine setup for debugging? > I'm not very well setup for debugging this particular issue. If you've > got an AMD machine with a pair of available SATA ports (including SATA > power!), I could send a pair of SATA devices known to readily reproduce > the issue. I'm not aware of anybody in our team having hardware where they can reproduce this issue, else I'm sure they would have already provided feedback here. There are also not many reports here of people running into this problem. Thus I assume it needs a special (and probably rare) hardware combination to trigger this. One thing I can add is that I have been running software raid1 with Xen on two SATA SSDs on an Intel CPU since many years without seeing any data corruption. As Debian packages versions of xen, linux, etc. have changed a bit since the last time this issue was reported as reproduced in this bug, it would be good to get confirmation the problem is still there in Debian unstable or testing. signature.asc Description: This is a digitally signed message part.

