Some comments inline

-----Original Message-----
From: Dan Williams <[email protected]>
Sent: Wednesday, December 6, 2023 8:23 PM
To: Schofield, Alison <[email protected]>; Verma, Vishal L 
<[email protected]>
Cc: Schofield, Alison <[email protected]>; [email protected]; 
[email protected]
Subject: Re: [ndctl PATCH v5 4/5] cxl/list: add --poison option to cxl list

alison.schofield@ wrote:
> From: Alison Schofield <[email protected]>
>
> The --poison option to 'cxl list' retrieves poison lists from memory
> devices supporting the capability and displays the returned poison
> records in the cxl list json. This option can apply to memdevs or
> regions.
>
> Example usage in the Documentation/cxl/cxl-list.txt update.
>
> Signed-off-by: Alison Schofield <[email protected]>
> ---
>  Documentation/cxl/cxl-list.txt | 58 ++++++++++++++++++++++++++++++++++
>  cxl/filter.h                   |  3 ++
>  cxl/list.c                     |  2 ++
>  3 files changed, 63 insertions(+)
>
> diff --git a/Documentation/cxl/cxl-list.txt
> b/Documentation/cxl/cxl-list.txt index 838de4086678..ee2f1b2d9fae
> 100644
> --- a/Documentation/cxl/cxl-list.txt
> +++ b/Documentation/cxl/cxl-list.txt
> @@ -415,6 +415,64 @@ OPTIONS
>  --region::
>       Specify CXL region device name(s), or device id(s), to filter the 
> listing.
>
> +-L::
> +--poison::
> +     Include poison information. The poison list is retrieved from the
> +     device(s) and poison records are added to the listing. Apply this
> +     option to memdevs and regions where devices support the poison
> +     list capability.

While CXL calls it "poison" I am not convinced that's the term that end users 
can universally use for this. This is why "ndctl list" uses -M, but yeah, -M 
and -P are already taken. Even -E is taken for "errors".

> +
> +----
> +# cxl list -m mem11 --poison
> +[
> +  {
> +    "memdev":"mem11",
> +    "pmem_size":268435456,
> +    "ram_size":0,
> +    "serial":0,
> +    "host":"0000:37:00.0",
> +    "poison":{
> +      "nr_records":1,
> +      "records":[

One cleanup I want to see before this goes live... drop nr_records and just 
make "poison" an array object directly. The number of records is trivially 
determined by the jq "len" operator.

Also, per above rename "poison" to "media_errors". I believe "poison" is an 
x86'ism where "media_error" is a more generic term.
[ETT] The problem is that a poison can be written into CXL and is NOT a media 
error.  Generally... the only way a customer should interpret that a DPA is 
"poisoned" is that if you consume that address, you'll end up in a MCA-Recovery 
path.  It does not indicate media health.  But see discussion later about 
"source"

> +        {
> +          "dpa":0,
> +          "dpa_length":64,
> +          "source":"Internal",
> +        }
> +      ]
> +    }
> +  }
> +]
> +# cxl list -r region5 --poison
> +[
> +  {
> +    "region":"region5",
> +    "resource":1035623989248,
> +    "size":2147483648,
> +    "interleave_ways":2,
> +    "interleave_granularity":4096,
> +    "decode_state":"commit",
> +    "poison":{
> +      "nr_records":2,
> +      "records":[
> +        {
> +          "memdev":"mem2",
> +          "dpa":0,
> +          "dpa_length":64,

Does length need to be prefixed with "dpa_"?

> +          "source":"Internal",

I am not sure what the end user can do with "source"? I have tended to not emit 
things if I can't think of a use case for the field to be there.
[ETT] The sources are "external" (poison written into CXL), "internal" (the 
device generated the poison), "injected" (ie via poison inject) and "vendor 
specific".
This is how I would use source.
"external" = don't expect to see a cxl media error, look elsewhere like a UCNA 
or a mem_data error in the RP's CXL.CM RAS.
"internal" = expect to see a media error for more information.
"injected" = somebody injected the error, no service action needed except to 
maybe tighten up your security.
"vendor" = see vendor

How about a HPA? :D

Reply via email to