On Thu, Jan 18, 2024 at 01:56:51PM -0800, Dan Williams wrote:
> alison.schofield@ wrote:
> > From: Alison Schofield <[email protected]>
> >
> > Changes since v5:
> > - Use a private parser for cxl_poison events. (Dan)
> > Previously used the default parser and re-parsed per the cxl-list
> > needs. Replace that with a private parsing method for cxl_poison.
> > - Add a private context to support private parsers.
> > - Add helpers to use with the cxl_poison parser.
> > - cxl list json: drop nr_records field (Dan)
> > - cxl list option: replace "poison" w "media-errors" (Dan)
> > - cxl list json: replace "poison" w "media_errors" (Dan)
> > - Link to v5:
> > https://lore.kernel.org/linux-cxl/[email protected]/
> >
> >
> > Begin cover letter:
> >
> > Add the option to add a memory devices poison list to the cxl-list
> > json output. Offer the option by memdev and by region. Sample usage:
> >
> > # cxl list -m mem1 --media-errors
> > [
> > {
> > "memdev":"mem1",
> > "pmem_size":1073741824,
> > "ram_size":1073741824,
> > "serial":1,
> > "numa_node":1,
> > "host":"cxl_mem.1",
> > "media_errors":[
> > {
> > "dpa":0,
> > "dpa_length":64,
> > "source":"Injected"
> > },
> > {
> > "region":"region5",
>
> It feels odd to list the region here. I feel like what really matters is
> to list the endpoint decoder and if someone wants to associate endpoint
> decoder to region, or endpoint decoder to memdev there are other queries
> for that.
>
> Then this format does not change between the "region" listing and
> "memdev" listing, they both just output the endpoint decoder and leave
> the rest to follow-on queries.
>
> For example I expect operations software has already recorded the
> endpoint decoder to region mapping, so when this data comes in the
> endpoint decoder is a key to make that association. Otherwise:
>
> cxl list -RT -e $endpoint_decoder
>
> ...can answer follow up questions about what is impacted by a given
> media error record.
I see it as a convenience offering, but I'm starting to see that your
stance is probably that a cxl-list option should only list additional
info provided by the option, and not include info that can be retrieved
elsewhere w cxl-list.
I plan to make this change to endpoint as you suggest.
>
> > "dpa":1073741824,
> > "dpa_length":64,
>
> The dpa_length is also the hpa_length, right? So maybe just call the
> field "length".
>
No, the length only refers to the device address space. I don't think
the hpa is guaranteed to be contiguous, so only the starting hpa addr
is offered.
hmm..should we call it 'size' because that seems to imply less
contiguous-ness than length?
Which should it be 'dpa_length' or 'size' (or 'length')
> > "hpa":1035355557888,
> > "source":"Injected"
> > },
> > {
> > "region":"region5",
> > "dpa":1073745920,
> > "dpa_length":64,
> > "hpa":1035355566080,
> > "source":"Injected"
>
> This "source" field feels like debug data. In production nobody is going
> to be doing poison injection, and if the administrator injected it then
> its implied they know that status. Otherwise a media-error is a
> media-error regardless of the source.
>From CXL Spec Tabel 8-140 Sources can be:
Unknown.
External. Poison received from a source external to the device.
Internal. The device generated poison from an internal source.
Injected. The error was injected into the device for testing purposes.
Vendor Specific.
On the v5 review, Erwin commented:
>> This is how I would use source.
>> "external" = don't expect to see a cxl media error, look elsewhere like a
>> UCNA or a mem_data error in the RP's CXL.CM RAS.
>> "internal" = expect to see a media error for more information.
>> "injected" = somebody injected the error, no service action needed except to
>> maybe tighten up your security.
>> "vendor" = see vendor
If it's not presented here, user can look it up in the cxl_poison trace
event directly.
I think we should keep this as is.
>
> > }
> > ]
> > }
> > ]
> >
> > # cxl list -r region5 --media-errors
> > [
> > {
> > "region":"region5",
> > "resource":1035355553792,
> > "size":2147483648,
> > "type":"pmem",
> > "interleave_ways":2,
> > "interleave_granularity":4096,
> > "decode_state":"commit",
> > "media_errors":[
> > {
> > "memdev":"mem1",
> > "dpa":1073741824,
> > "dpa_length":64,
> > "hpa":1035355557888,
> > "source":"Injected"
> > },
> > {
> > "memdev":"mem1",
> > "dpa":1073745920,
> > "dpa_length":64,
> > "hpa":1035355566080,
> > "source":"Injected"
> > }
> > ]
> > }
> > ]
> >
> > Alison Schofield (7):
> > libcxl: add interfaces for GET_POISON_LIST mailbox commands
> > cxl: add an optional pid check to event parsing
> > cxl/event_trace: add a private context for private parsers
> > cxl/event_trace: add helpers get_field_[string|data]()
> > cxl/list: collect and parse media_error records
> > cxl/list: add --media-errors option to cxl list
> > cxl/test: add cxl-poison.sh unit test
> >
> > Documentation/cxl/cxl-list.txt | 71 +++++++++++
> > cxl/event_trace.c | 53 +++++++-
> > cxl/event_trace.h | 9 +-
> > cxl/filter.h | 3 +
> > cxl/json.c | 218 +++++++++++++++++++++++++++++++++
> > cxl/lib/libcxl.c | 47 +++++++
> > cxl/lib/libcxl.sym | 6 +
> > cxl/libcxl.h | 2 +
> > cxl/list.c | 2 +
> > test/cxl-poison.sh | 133 ++++++++++++++++++++
> > test/meson.build | 2 +
> > 11 files changed, 543 insertions(+), 3 deletions(-)
> > create mode 100644 test/cxl-poison.sh
> >
> >
> > base-commit: a871e6153b11fe63780b37cdcb1eb347b296095c
> > --
> > 2.37.3
> >
> >
>
>