Some comments inline -----Original Message----- From: Dan Williams <[email protected]> Sent: Wednesday, December 6, 2023 8:23 PM To: Schofield, Alison <[email protected]>; Verma, Vishal L <[email protected]> Cc: Schofield, Alison <[email protected]>; [email protected]; [email protected] Subject: Re: [ndctl PATCH v5 4/5] cxl/list: add --poison option to cxl list
alison.schofield@ wrote: > From: Alison Schofield <[email protected]> > > The --poison option to 'cxl list' retrieves poison lists from memory > devices supporting the capability and displays the returned poison > records in the cxl list json. This option can apply to memdevs or > regions. > > Example usage in the Documentation/cxl/cxl-list.txt update. > > Signed-off-by: Alison Schofield <[email protected]> > --- > Documentation/cxl/cxl-list.txt | 58 ++++++++++++++++++++++++++++++++++ > cxl/filter.h | 3 ++ > cxl/list.c | 2 ++ > 3 files changed, 63 insertions(+) > > diff --git a/Documentation/cxl/cxl-list.txt > b/Documentation/cxl/cxl-list.txt index 838de4086678..ee2f1b2d9fae > 100644 > --- a/Documentation/cxl/cxl-list.txt > +++ b/Documentation/cxl/cxl-list.txt > @@ -415,6 +415,64 @@ OPTIONS > --region:: > Specify CXL region device name(s), or device id(s), to filter the > listing. > > +-L:: > +--poison:: > + Include poison information. The poison list is retrieved from the > + device(s) and poison records are added to the listing. Apply this > + option to memdevs and regions where devices support the poison > + list capability. While CXL calls it "poison" I am not convinced that's the term that end users can universally use for this. This is why "ndctl list" uses -M, but yeah, -M and -P are already taken. Even -E is taken for "errors". > + > +---- > +# cxl list -m mem11 --poison > +[ > + { > + "memdev":"mem11", > + "pmem_size":268435456, > + "ram_size":0, > + "serial":0, > + "host":"0000:37:00.0", > + "poison":{ > + "nr_records":1, > + "records":[ One cleanup I want to see before this goes live... drop nr_records and just make "poison" an array object directly. The number of records is trivially determined by the jq "len" operator. Also, per above rename "poison" to "media_errors". I believe "poison" is an x86'ism where "media_error" is a more generic term. [ETT] The problem is that a poison can be written into CXL and is NOT a media error. Generally... the only way a customer should interpret that a DPA is "poisoned" is that if you consume that address, you'll end up in a MCA-Recovery path. It does not indicate media health. But see discussion later about "source" > + { > + "dpa":0, > + "dpa_length":64, > + "source":"Internal", > + } > + ] > + } > + } > +] > +# cxl list -r region5 --poison > +[ > + { > + "region":"region5", > + "resource":1035623989248, > + "size":2147483648, > + "interleave_ways":2, > + "interleave_granularity":4096, > + "decode_state":"commit", > + "poison":{ > + "nr_records":2, > + "records":[ > + { > + "memdev":"mem2", > + "dpa":0, > + "dpa_length":64, Does length need to be prefixed with "dpa_"? > + "source":"Internal", I am not sure what the end user can do with "source"? I have tended to not emit things if I can't think of a use case for the field to be there. [ETT] The sources are "external" (poison written into CXL), "internal" (the device generated the poison), "injected" (ie via poison inject) and "vendor specific". This is how I would use source. "external" = don't expect to see a cxl media error, look elsewhere like a UCNA or a mem_data error in the RP's CXL.CM RAS. "internal" = expect to see a media error for more information. "injected" = somebody injected the error, no service action needed except to maybe tighten up your security. "vendor" = see vendor How about a HPA? :D
