Bug#988477: [Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device

2025-07-03 Thread Elliott Mitchell
On Wed, May 28, 2025 at 05:21:00PM -0700, Elliott Mitchell wrote:
> On Sun, May 18, 2025 at 02:10:25PM +0200, Maximilian Engelhardt wrote:
> > On Montag, 14. April 2025 00:22:01 CEST Elliott Mitchell wrote:
> > > 
> > > Do any of the Debian maintainers have an AMD machine setup for debugging?
> > > I'm not very well setup for debugging this particular issue.  If you've
> > > got an AMD machine with a pair of available SATA ports (including SATA
> > > power!), I could send a pair of SATA devices known to readily reproduce
> > > the issue.
> > 
> > I'm not aware of anybody in our team having hardware where they can 
> > reproduce 
> > this issue, else I'm sure they would have already provided feedback here. 
> > There are also not many reports here of people running into this problem. 
> > Thus 
> > I assume it needs a special (and probably rare) hardware combination to 
> > trigger this.
> > One thing I can add is that I have been running software raid1 with Xen on 
> > two 
> > SATA SSDs on an Intel CPU since many years without seeing any data 
> > corruption.
> 
> I'm skeptical of it being rare, but certainly uncommon.  You've got some
> similarity to the reproductions, but there are differences.
> 
> First question, what brand/model are the SSDs?  Samsung SSDs are known to
> be effected (severely effected for some models), while Crucial/Micron
> SSDs are uneffected (some models might be mildly effected).
> 
> Second question, where are the SATA ports?  They on-motherboard?  Add-on
> card?  The reproductions were with on-motherboard ports.
> 
> What generation is your processor?  Are you sure it has an IOMMU and Xen
> is driving the IOMMU?  I had suspected Intel systems would be effected,
> but you may have disproven this.

Uh.  I did hope you could help narrowing things down some.  Right now
we've got two confirmed reproductions, while you're the only person who
isn't seeing this reproduce.

The biggest difference is you've got a system with an Intel processor.
Yet we already know not all SSDs are effected, so could be your pair are
ones which won't reproduce the issue.  On top of that, similar to the
spurious interrupt issue, could be it is less severe on Intel processors
and that has kept you safe.

Presently the shortage of reports seems mostly attributable to few people
using RAID1 with SSDs.


-- 
(\___(\___(\__  --=> 8-) EHM <=--  __/)___/)___/)
 \BS (| [email protected]  PGP 87145445 |)   /
  \_CS\   |  _  -O #include  O-   _  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445



Bug#988477: [Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device

2025-05-28 Thread Elliott Mitchell
On Sun, May 18, 2025 at 02:10:25PM +0200, Maximilian Engelhardt wrote:
> On Montag, 14. April 2025 00:22:01 CEST Elliott Mitchell wrote:
> > 
> > Do any of the Debian maintainers have an AMD machine setup for debugging?
> > I'm not very well setup for debugging this particular issue.  If you've
> > got an AMD machine with a pair of available SATA ports (including SATA
> > power!), I could send a pair of SATA devices known to readily reproduce
> > the issue.
> 
> I'm not aware of anybody in our team having hardware where they can reproduce 
> this issue, else I'm sure they would have already provided feedback here. 
> There are also not many reports here of people running into this problem. 
> Thus 
> I assume it needs a special (and probably rare) hardware combination to 
> trigger this.
> One thing I can add is that I have been running software raid1 with Xen on 
> two 
> SATA SSDs on an Intel CPU since many years without seeing any data corruption.

I'm skeptical of it being rare, but certainly uncommon.  You've got some
similarity to the reproductions, but there are differences.

First question, what brand/model are the SSDs?  Samsung SSDs are known to
be effected (severely effected for some models), while Crucial/Micron
SSDs are uneffected (some models might be mildly effected).

Second question, where are the SATA ports?  They on-motherboard?  Add-on
card?  The reproductions were with on-motherboard ports.

What generation is your processor?  Are you sure it has an IOMMU and Xen
is driving the IOMMU?  I had suspected Intel systems would be effected,
but you may have disproven this.

> As Debian packages versions of xen, linux, etc. have changed a bit since the 
> last time this issue was reported as reproduced in this bug, it would be good 
> to get confirmation the problem is still there in Debian unstable or testing.

This is possible.


-- 
(\___(\___(\__  --=> 8-) EHM <=--  __/)___/)___/)
 \BS (| [email protected]  PGP 87145445 |)   /
  \_CS\   |  _  -O #include  O-   _  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445



Bug#988477: [Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device

2025-05-18 Thread Maximilian Engelhardt
On Montag, 14. April 2025 00:22:01 CEST Elliott Mitchell wrote:
> The analysis is the "(XEN) AMD-Vi: IO_PAGE_FAULT" message, and the
> software RAID data loss are distinct bugs.  That patch/commit likely
> makes the correlated message disappear, but almost certainly leaves the
> software RAID data loss behind.
> 
> Do any of the Debian maintainers have an AMD machine setup for debugging?
> I'm not very well setup for debugging this particular issue.  If you've
> got an AMD machine with a pair of available SATA ports (including SATA
> power!), I could send a pair of SATA devices known to readily reproduce
> the issue.

I'm not aware of anybody in our team having hardware where they can reproduce 
this issue, else I'm sure they would have already provided feedback here. 
There are also not many reports here of people running into this problem. Thus 
I assume it needs a special (and probably rare) hardware combination to 
trigger this.
One thing I can add is that I have been running software raid1 with Xen on two 
SATA SSDs on an Intel CPU since many years without seeing any data corruption.

As Debian packages versions of xen, linux, etc. have changed a bit since the 
last time this issue was reported as reproduced in this bug, it would be good 
to get confirmation the problem is still there in Debian unstable or testing.




signature.asc
Description: This is a digitally signed message part.