Alison Schofield wrote:
[..]
> > > "dpa":1073741824,
> > > "dpa_length":64,
> >
> > The dpa_length is also the hpa_length, right? So maybe just call the
> > field "length".
> >
>
> No, the length only refers to the device address space. I don't think
> the hpa is guaranteed to be contiguous, so only the starting hpa addr
> is offered.
>
> hmm..should we call it 'size' because that seems to imply less
> contiguous-ness than length?
The only way the length could be discontiguous in HPA space is if the
error length is greater than the interleave granularity. Given poison is
tracked in cachelines and the smallest granularity is 4 cachelines it is
unlikely to hit the mutiple HPA case.
However, I think the kernel side should aim to preclude that from
happening. Given that this is relying on the kernel's translation I
would make it so that the kernel never leaves the impacted HPAs as
ambiguous. For example, if the interleave_granularity of the region is
256 and the DPA length is 512, it would be helpful if the *kernel* split
that into multiple trace events to communicate the multiple impacted
HPAs rather than leave it as an exercise to userspace.
> Which should it be 'dpa_length' or 'size' (or 'length')
I recall we used "length" for the number of badblocks in "ndctl list
--media-errors", might as well keep in consistent.
> > > "hpa":1035355557888,
> > > "source":"Injected"
> > > },
> > > {
> > > "region":"region5",
> > > "dpa":1073745920,
> > > "dpa_length":64,
> > > "hpa":1035355566080,
> > > "source":"Injected"
> >
> > This "source" field feels like debug data. In production nobody is going
> > to be doing poison injection, and if the administrator injected it then
> > its implied they know that status. Otherwise a media-error is a
> > media-error regardless of the source.
>
> From CXL Spec Tabel 8-140 Sources can be:
>
> Unknown.
> External. Poison received from a source external to the device.
> Internal. The device generated poison from an internal source.
> Injected. The error was injected into the device for testing purposes.
> Vendor Specific.
>
> On the v5 review, Erwin commented:
> >> This is how I would use source.
> >> "external" = don't expect to see a cxl media error, look elsewhere like a
> >> UCNA or a mem_data error in the RP's CXL.CM RAS.
> >> "internal" = expect to see a media error for more information.
> >> "injected" = somebody injected the error, no service action needed except
> >> to maybe tighten up your security.
> >> "vendor" = see vendor
>
> If it's not presented here, user can look it up in the cxl_poison trace
> event directly.
>
> I think we should keep this as is.
Ah, I had forgotten Erwin's comment, yeah, showing "external" vs
"internal" looks useful, "injected" gets to come along for the ride, and
if any vendor actually ships that "vendor" status that's a good
indication to the end user to go shopping for a device that plays better
with open standards.
Might be useful to capture Erwin's analysis of how to use that field in
the man page, if it's not there already.