from:"Dan Williams"

Re: [RFC PATCH] cxl: avoid duplicating report from MCE & device

2024-06-21 Thread Dan Williams

Shiyang Ruan wrote:
> Background:
> Since CXL device is a memory device, while CPU consumes a poison page of 
> CXL device, it always triggers a MCE by interrupt (INT18), no matter 
> which-First path is configured.  This is the first report.  Then 
> currently, in FW-First path, the poison event is transferred according 
> to the following process: CXL device -> firmware -> OS:ACPI->APEI->GHES 
>  -> CPER -> trace report.  This is the second one.  These two reports
> are indicating the same poisoning page, which is the so-called "duplicate
> report"[1].  And the memory_failure() handling I'm trying to add in
> OS-First path could also be another duplicate report.
> 
> Hope the flow below could make it easier to understand:
> CPU accesses bad memory on CXL device, then
>  -> MCE (INT18), *always* report (1)
>  -> * FW-First (implemented now)
>   -> CXL device -> FW
> -> OS:ACPI->APEI->GHES->CPER -> trace report (2.a)
> * OS-First (not implemented yet, I'm working on it)
>   -> CXL device -> MSI
> -> OS:CXL driver -> memory_failure() (2.b)
> so, the (1) and (2.a/b) are duplicated.
> 
> (I didn't get response in my reply for [1] while I have to make patch to
> solve this problem, so please correct me if my understanding is wrong.)

The CPU MCE may not be in the loop. Consider the case of patrol scrub,
or device-DMA accessing poison. In that case the device will signal a
component event and the CPU may never issue the MCE.

What is missing for me from this description is *why* does the duplicate
report matter in practice? If all that happens is that the kernel
repeats the lookup to offline the page and set the HWPoison bit, is that
duplicated work worth adding more tracking?

> This patch adds a new notifier_block and MCE_PRIO_CXL, for CXL memdev
> to check whether the current poison page has been reported (if yes,
> stop the notifier chain, won't call the following memory_failure()
> to report), into `x86_mce_decoder_chain`.  In this way, if the poison
> page already handled(recorded and reported) in (1) or (2), the other one
> won't duplicate the report.  The record could be clear when
> cxl_clear_poison() is called.
> 
> [1] 
> https://lore.kernel.org/linux-cxl/664d948fb86f0_e8be29...@dwillia2-mobl3.amr.corp.intel.com.notmuch/
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  arch/x86/include/asm/mce.h |   1 +
>  drivers/cxl/core/mbox.c| 130 +
>  drivers/cxl/core/memdev.c  |   6 +-
>  drivers/cxl/cxlmem.h   |   3 +
>  4 files changed, 139 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index dfd2e9699bd7..d8109c48e7d9 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -182,6 +182,7 @@ enum mce_notifier_prios {
>   MCE_PRIO_NFIT,
>   MCE_PRIO_EXTLOG,
>   MCE_PRIO_UC,
> + MCE_PRIO_CXL,
>   MCE_PRIO_EARLY,
>   MCE_PRIO_CEC,
>   MCE_PRIO_HIGHEST = MCE_PRIO_CEC
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 2626f3fff201..0eb3c5401e81 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -4,6 +4,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -880,6 +882,9 @@ void cxl_event_trace_record(const struct cxl_memdev 
> *cxlmd,
>   if (cxlr)
>   hpa = cxl_trace_hpa(cxlr, cxlmd, dpa);
>  
> + if (hpa != ULLONG_MAX && cxl_mce_recorded(hpa))
> + return;
> +
>   if (event_type == CXL_CPER_EVENT_GEN_MEDIA)
>   trace_cxl_general_media(cxlmd, type, cxlr, hpa,
>   &evt->gen_media);
> @@ -1408,6 +1413,127 @@ int cxl_poison_state_init(struct cxl_memdev_state 
> *mds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_poison_state_init, CXL);
>  
> +struct cxl_mce_record {
> + struct list_head node;
> + u64 hpa;
> +};
> +LIST_HEAD(cxl_mce_records);
> +DEFINE_MUTEX(cxl_mce_mutex);

I would recommend an xarray for this use case as that already has its
own internal locking and efficient memory allocation for new nodes.

However, the "why" question needs to be answered first.

Re: [PATCH v3 2/2] cxl/core: add poison creation event handler

2024-05-21 Thread Dan Williams

Shiyang Ruan wrote:
[..]
> >> My expectation is MF_ACTION_REQUIRED is not appropriate for CXL event
> >> reported errors since action is only required for direct consumption
> >> events and those need not be reported through the device event queue.
> > Got it.
> 
> I'm not very sure about 'Host write/read' type.  In my opinion, these 
> two types of event should be sent from device when CPU is accessing a 
> bad memory address, they could be thought of a sync event which needs 

Hmm, no that's not my understanding of a sync event. I expect when error
notifications are synchronous the CPU is guaranteed not to make forward
progress past the point of encountering the error. MSI-signaled
component-events are always asynchronous by that definition because the
CPU is free running while the interrupt is in-flight.

> the 'MF_ACTION_REQUIRED' flag.  Then, we can determine the flag by the 
> types like this:
> - CXL_EVENT_TRANSACTION_READ | CXL_EVENT_TRANSACTION_WRITE
>=> MF_ACTION_REQUIRED
> - CXL_EVENT_TRANSACTION_INJECT_POISON => MF_SW_SIMULATED
> - others  => 0

I doubt any reasonable policy can be inferred from the transaction type.
Consider that the CPU itself does not take a sychronous exception when
writes encounter poison. At most those are flagged via CMCI
(corrected machine check interrupt). The only events that cause
exceptions are CPU reads that consume poison. The device has no idea
whether read events are coming from a CPU or a DMA event.

MF_SW_SIMULATED is purely for software simulated poison events as
injected poison can stil cause system fatal damage if the poison is
ingested in an unrecoverable path.

So, I think all CXL poison notification events should trigger an action
optional memory_failure(). I expect this needs to make sure that
duplicates re not a problem. I.e. in the case of CPU consumption of CXL
poison, that causes a synchronous MF_ACTION_REQUIRED event via the MCE
path *and* it may trigger the device to send an error record for the
same page. As far as I can see, duplicate reports (MCE + CXL device) are
unavoidable.

Re: [PATCH v3 2/2] cxl/core: add poison creation event handler

2024-04-23 Thread Dan Williams

Shiyang Ruan wrote:
> Currently driver only traces cxl events, poison creation (for both vmem
> and pmem type) on cxl memdev is silent.

As it should be.

> OS needs to be notified then it could handle poison pages in time.

No, it was always the case that latent poison is an "action optional"
event. I am not understanding the justification for this approach. What
breaks if the kernel does not forward events to memory_failure_queue()?

Consider that in the CPU consumption case that the firmware first path
will do its own memory_failure_queue() and in the native case the MCE
handler will take care of this. So that leaves pages that are accessed
by DMA or background operation that encounter poison. Those are "action
optional" scenarios and it is not clear to me how the driver tells the
difference.

This needs more precision on which agent is repsonsible for what level
of reporting. The distribution of responsibility between ACPI GHES,
EDAC, and the CXL driver is messy and I expect this changelog to
demonstrate it understands all those considerations.

> Per CXL spec, the device error event could be signaled through
> FW-First and OS-First methods.
> 
> So, add poison creation event handler in OS-First method:
>   - Qemu:

Why is QEMU relevant for this patch? QEMU is only a development vehicle
the upstream enabling should be reference shipping or expected to be
shipping hardware implementations.

> - CXL device reports POISON creation event to OS by MSI by sending
>   GMER/DER after injecting a poison record;

When you say "inject" here do you mean "add to the poison list if
present". Because "inject" to me means the "Inject Poison" Memory Device
Command.

>   - CXL driver:
> a. parse the POISON event from GMER/DER;
> b. translate poisoned DPA to HPA (PFN);
> c. enqueue poisoned PFN to memory_failure's work queue;
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  drivers/cxl/core/mbox.c   | 119 +-
>  drivers/cxl/cxlmem.h  |   8 +--
>  include/linux/cxl-event.h |  18 +-
>  3 files changed, 125 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index f0f54aeccc87..76af0d73859d 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -837,25 +837,116 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
>  
> -void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> - enum cxl_event_log_type type,
> - enum cxl_event_type event_type,
> - const uuid_t *uuid, union cxl_event *evt)
> +static void cxl_report_poison(struct cxl_memdev *cxlmd, struct cxl_region 
> *cxlr,
> +   u64 dpa)
>  {
> - if (event_type == CXL_CPER_EVENT_GEN_MEDIA)
> + u64 hpa = cxl_trace_hpa(cxlr, cxlmd, dpa);
> + unsigned long pfn = PHYS_PFN(hpa);
> +
> + if (!IS_ENABLED(CONFIG_MEMORY_FAILURE))
> + return;

No need for this check, memory_failure_queue() is already stubbed out in
the CONFIG_MEMORY_FAILURE=n case.

> + memory_failure_queue(pfn, MF_ACTION_REQUIRED);

My expectation is MF_ACTION_REQUIRED is not appropriate for CXL event
reported errors since action is only required for direct consumption
events and those need not be reported through the device event queue.

It would be useful to collaborate with a BIOS firmware engineer so that
the kernel ends up with similar logic as is used to set CPER record
severity, or at least understands why it would want to be different.
See how ghes_handle_memory_failure() determines the
memory_failure_queue() flags.

Re: [PATCH v3 1/2] cxl/core: correct length of DPA field masks

2024-04-23 Thread Dan Williams

Shiyang Ruan wrote:
> The length of Physical Address in General Media Event Record/DRAM Event
> Record is 64-bit, so the field mask should be defined as such length.
> Otherwise, this causes cxl_general_media and cxl_dram tracepoints to
> mask off the upper-32-bits of DPA addresses. The cxl_poison event is
> unaffected.
> 
> If userspace was doing its own DPA-to-HPA translation this could lead to
> incorrect page retirement decisions, but there is no known consumer
> (like rasdaemon) of this event today.
> 
> Fixes: d54a531a430b ("cxl/mem: Trace General Media Event Record")
> Cc: 
> Cc: Dan Williams 
> Cc: Davidlohr Bueso 
> Cc: Jonathan Cameron 
> Cc: Ira Weiny 
> Signed-off-by: Shiyang Ruan 
> ---
>  drivers/cxl/core/trace.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index e5f13260fc52..cdfce932d5b1 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -253,7 +253,7 @@ TRACE_EVENT(cxl_generic_event,
>   * DRAM Event Record
>   * CXL rev 3.0 section 8.2.9.2.1.2; Table 8-44
>   */
> -#define CXL_DPA_FLAGS_MASK   0x3F
> +#define CXL_DPA_FLAGS_MASK   0x3FULL
>  #define CXL_DPA_MASK (~CXL_DPA_FLAGS_MASK)
>  
>  #define CXL_DPA_VOLATILE BIT(0)

Looks good,

Reviewed-by: Dan Williams

Re: [RFC PATCH v2 5/6] cxl: add definition for transaction types

2024-03-29 Thread Dan Williams

Shiyang Ruan wrote:
> The transaction types are defined in General Media Event Record/DRAM Event
> per CXL rev 3.0 Section 8.2.9.2.1.1; Table 8-43 and
> Section 8.2.9.2.1.2; Table 8-44.  Add them for Event Record handler use.

Combine this patch with the one that uses them so that the use case can
be reviewed together with the implementation.

Re: [RFC PATCH v2 4/6] cxl/core: report poison when injecting from debugfs

2024-03-29 Thread Dan Williams

Shiyang Ruan wrote:
> Poison injection from debugfs is silent too.  Add calling
> cxl_mem_report_poison() to make it able to do memory_failure().

Why does this needs to be signalled? It is a debug interface, the
debugger can also trigger a read after the injection, or trigger page
soft-offline.

Re: [RFC PATCH v2 3/6] cxl/core: add report option for cxl_mem_get_poison()

2024-03-29 Thread Dan Williams

Shiyang Ruan wrote:
> The GMER only has "Physical Address" field, no such one indicates length.
> So, when a poison event is received, we could use GET_POISON_LIST command
> to get the poison list.  Now driver has cxl_mem_get_poison(), so
> reuse it and add a parameter 'bool report', report poison record to MCE
> if set true.

I am not sure I agree with the rationale here because there is no
correlation between the event being signaled and the current state of
the poison list. It also establishes race between multiple GMER events,
i.e. imagine the hardware sends 4 GMER events to communicate a 256B
poison discovery event. Does the driver need logic to support GMER event
2, 3, and 4 if it already say all 256B of poison after processing GMER
event 1?

I think the best the driver can do is assume at least 64B of poison
per-event and depend on multiple notifications to handle larger poison
lengths.

Otherwise, the poison list is really only useful for pre-populating
pages to offline after a reboot, i.e. to catch the kernel up with the
state of poison pages after a reboot.

Re: [RFC PATCH v2 2/6] cxl/core: introduce cxl_mem_report_poison()

2024-03-29 Thread Dan Williams

Shiyang Ruan wrote:
> If poison is detected(reported from cxl memdev), OS should be notified to
> handle it. So, introduce this helper function for later use:
>   1. translate DPA to HPA;
>   2. enqueue records into memory_failure's work queue;
> 
> Signed-off-by: Shiyang Ruan 

This patch is too small, it needs the corresponding caller to make sense
of the proposed change.

> ---
> 
> Currently poison injection from debugfs always create a 64-bytes-length
> record, which is fine.  But the injection from qemu's QMP API:
> qmp_cxl_inject_poison() could create a poison record contains big length,
> which may cause many many times of calling memory_failure_queue().
> Though the MEMORY_FAILURE_FIFO_SIZE is 1 << 4, it seems not enougth.

What matters is what devices do in practice, the kernel should not be
worried about odd corner cases that only exist in QEMU injection
scenarios.

Re: [RFC PATCH v2 1/6] cxl/core: correct length of DPA field masks

2024-03-29 Thread Dan Williams

Shiyang Ruan wrote:
> The length of Physical Address in General Media Event Record/DRAM Event
> Record is 64-bit, so the field mask should be defined as such length.
> Otherwise, this causes cxl_general_media and cxl_dram tracepoints to
> mask off the upper-32-bits of DPA addresses. The cxl_poison event is
> unaffected.
> 
> If userspace was doing its own DPA-to-HPA translation this could lead to
> incorrect page retirement decisions, but there is no known consumer
> (like rasdaemon) of this event today.
> 
> Fixes: d54a531a430b ("cxl/mem: Trace General Media Event Record")
> Cc: 
> Cc: Dan Williams 
> Cc: Davidlohr Bueso 
> Cc: Jonathan Cameron 
> Cc: Ira Weiny 
> Signed-off-by: Shiyang Ruan 
> ---
>  drivers/cxl/core/trace.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index e5f13260fc52..e2d1f296df97 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -253,11 +253,11 @@ TRACE_EVENT(cxl_generic_event,
>   * DRAM Event Record
>   * CXL rev 3.0 section 8.2.9.2.1.2; Table 8-44
>   */
> -#define CXL_DPA_FLAGS_MASK   0x3F
> +#define CXL_DPA_FLAGS_MASK   0x3FULL

This change makes sense...

>  #define CXL_DPA_MASK (~CXL_DPA_FLAGS_MASK)
>  
> -#define CXL_DPA_VOLATILE BIT(0)
> -#define CXL_DPA_NOT_REPAIRABLE   BIT(1)
> +#define CXL_DPA_VOLATILE BIT_ULL(0)
> +#define CXL_DPA_NOT_REPAIRABLE   BIT_ULL(1)

...these do not. The argument to __print_flags() is an unsigned long, so
they will be cast down to (unsigned long), and they are never used as a
mask so the generated code should not change.

Re: [RFC PATCH v2 0/6] cxl: add poison event handler

2024-03-29 Thread Dan Williams

Alison Schofield wrote:
> On Fri, Mar 29, 2024 at 11:22:32AM -0700, Dan Williams wrote:
> > Alison Schofield wrote:
> > [..]
> > > Upon receipt of that new poison list, call memory_failture_queue()
> > > on *any* poison in a mapped space. Is that OK?  Can we call
> > > memory_failure_queue() on any and every poison report that is in
> > > HPA space regardless of whether it first came to us through a GMER?
> > > I'm actually wondering if that is going to be the next ask anyway -
> > > ie report all poison.
> > 
> > memory_failure_queue() should be called on poison creation events. Leave
> > the MF_ACTION_REQUIRED flag not set so that memory_failure() performs
> > "action optional" handling.  So I would expect memory_failure_queue()
> > notification for GMER events, but not on poison list events.
> 
> Seems I totally missed the point of this patch set.
> Is it's only purpose to make sure that poison that is injected gets
> reported to memory_failure?

Clarify terms, "poison injection" to me is a debug-only event to test
that poison handling is working, "poison creation" is an event where new
poison was encountered by CPU consumption, deposited by a
DMA-with-poison transaction, or discovered by a background scrub
operation.

> 
> So this single patch only:
> 1. Poison inject leads to this GMER/CXL_EVENT_TRANSACTION_INJECT_POISON 

Inject is a special case. Likely this should copy the PMEM legacy where
notifying memory_failure() on injected poison is optional:

"ndctl inject-error --no-notify"

> 2. Driver sees GMER/CXL_EVENT_TRANSACTION_INJECT_POISON and reads poison
> list to get accurate length.

Again, inject is the least interesting for the common case, production
kernels care about "Media ECC Error" and "Scrub Media ECC Error"
regardless of transaction type.

> 3. Driver reports that to memory_failure_queue()
> 
> Still expect there's some code sharing opportunities and I still wonder
> about what is next in this area.

One area this needs to be careful is in unifying the OS-first and
FW-first paths. In the FW-first case the platform can trigger
memory_failure() along with the GMER by just posting a memory failure
CPER record. So there is a risk that things get doubly reported if the
GMER handling code blindly triggers memory_failure(). Might be benign,
but probably best avoided.

Re: [RFC PATCH v2 0/6] cxl: add poison event handler

2024-03-29 Thread Dan Williams

Alison Schofield wrote:
[..]
> Upon receipt of that new poison list, call memory_failture_queue()
> on *any* poison in a mapped space. Is that OK?  Can we call
> memory_failure_queue() on any and every poison report that is in
> HPA space regardless of whether it first came to us through a GMER?
> I'm actually wondering if that is going to be the next ask anyway -
> ie report all poison.

memory_failure_queue() should be called on poison creation events. Leave
the MF_ACTION_REQUIRED flag not set so that memory_failure() performs
"action optional" handling.  So I would expect memory_failure_queue()
notification for GMER events, but not on poison list events.

Re: [PATCH v2 1/1] cxl/mem: Fix for the index of Clear Event Record Handle

2024-03-18 Thread Dan Williams

Jonathan Cameron wrote:
> On Mon, 18 Mar 2024 10:29:28 +0800
> Yuquan Wang  wrote:
> 
> > The dev_dbg info for Clear Event Records mailbox command would report
> > the handle of the next record to clear not the current one.
> > 
> > This was because the index 'i' had incremented before printing the
> > current handle value.
> > 
> > Signed-off-by: Yuquan Wang 
> > ---
> >  drivers/cxl/core/mbox.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index 9adda4795eb7..b810a6aa3010 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -915,7 +915,7 @@ static int cxl_clear_event_record(struct 
> > cxl_memdev_state *mds,
> >  
> > payload->handles[i++] = gen->hdr.handle;
> > dev_dbg(mds->cxlds.dev, "Event log '%d': Clearing %u\n", log,
> > -   le16_to_cpu(payload->handles[i]));
> > +   le16_to_cpu(payload->handles[i-1]));
> Trivial but needs spaces around the -. e.g.  [i - 1] 
> 
> Maybe Dan can fix up whilst applying.
> 
> Otherwise
> 
> Reviewed-by: Jonathan Cameron 

I have enlisted Dave to start wrangling CXL kernel patches upstream, and
I will fall back to just reviewing.

Dave, you can add my:

Reviewed-by: Dan Williams 

...with the same caveat as above.

Re: [PATCH 1/1] cxl/mem: Fix for the index of Clear Event Record Handle

2024-03-15 Thread Dan Williams

Yuquan Wang wrote:
> The dev_dbg info for Clear Event Records mailbox command would report
> the handle of the next record to clear not the current one.
> 
> This was because the index 'i' had incremented before printing the
> current handle value.
> 
> This fix also adjusts the index variable name from 'i' to 'clear_cnt'
> which can be easier for developers to distinguish and understand.
> 
> Signed-off-by: Yuquan Wang 
> ---
>  drivers/cxl/core/mbox.c | 21 +++--
>  1 file changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 9adda4795eb7..3ca2845ae6aa 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -881,7 +881,7 @@ static int cxl_clear_event_record(struct cxl_memdev_state 
> *mds,
>   struct cxl_mbox_cmd mbox_cmd;
>   u16 cnt;
>   int rc = 0;
> - int i;
> + int clear_cnt;
>  
>   /* Payload size may limit the max handles */
>   if (pl_size > mds->payload_size) {
> @@ -908,28 +908,29 @@ static int cxl_clear_event_record(struct 
> cxl_memdev_state *mds,
>* Clear Event Records uses u8 for the handle cnt while Get Event
>* Record can return up to 0x records.
>*/
> - i = 0;
> + clear_cnt = 0;
>   for (cnt = 0; cnt < total; cnt++) {
>   struct cxl_event_record_raw *raw = &get_pl->records[cnt];
>   struct cxl_event_generic *gen = &raw->event.generic;
>  
> - payload->handles[i++] = gen->hdr.handle;
> + payload->handles[clear_cnt] = gen->hdr.handle;
>   dev_dbg(mds->cxlds.dev, "Event log '%d': Clearing %u\n", log,
> - le16_to_cpu(payload->handles[i]));

Couldn't this patch be much smaller by just changing this to "i - 1",
because from the description the only problem is the dev_dbg()
statement, so just fix that.

Re: [RFC PATCH 1/5] cxl/core: correct length of DPA field masks

2024-02-21 Thread Dan Williams

[ add Ira and Davidlohr ]

Shiyang Ruan wrote:
> 
> 
> 在 2024/2/10 14:34, Dan Williams 写道:
> > Shiyang Ruan wrote:
> >> The length of Physical Address in General Media Event Record/DRAM Event
> >> Record is 64-bit, so the field mask should be defined as such length.
> > 
> > Can you include this user visible side-effect of this change. Looks like
> > this could cause usages of CXL_DPA_FLAGS_MASK to return an incorrect
> > result?
> 
> Ok.  Will change it to this:
> 
> The length of Physical Address in General Media Event Record/DRAM Event 
> Record is 64bit, per CXL Spec r3.0 - 8.2.9.2.1.1, Table 8-43.  Currently 
> CXL_DPA_FLAGS_MASK is defined as int (32bit), then CXL_DPA_MASK is a int 
> too, it will be 0xFFC0 while using "->dpa & CXL_DPA_MASK" to 
> obtain real physical address (to drop flags in lower bits), in this case 
> the higher 32bit of ->dpa will be lost.
> 
> To avoid this, define CXL_DPA_FLAGS_MASK as 64bit: 0x3FULL.

That part was clear, what is missing is the impact. For example, this
bug only impacts the output of the dpa field in the cxl_general_media
and cxl_dram tracepoints, but no other part of the CXL code.

So in this case the impact of the bug is low, but in the worst case an
wrong size mask could compromise the security / integrity of the system.

So I would expect a changelog like this to make the importance of the
fix clear and to notify the original stakeholders where the bug was
introduced.

---
The length of Physical Address in General Media Event Record/DRAM Event
Record is 64-bit, so the field mask should be defined as such length.
Otherwise, this causes cxl_general_media and cxl_dram tracepoints to
mask off the upper-32-bits of DPA addresses. The cxl_poison event is
unaffected.

If userspace was doing its own DPA-to-HPA translation this could lead to
incorrect page retirement decisions, but there is no known consumer
(like rasdaemon) of this event today.

Fixes: d54a531a430b ("cxl/mem: Trace General Media Event Record")
Cc: 
Cc: Dan Williams 
Cc: Davidlohr Bueso 
Cc: Jonathan Cameron 
Cc: Ira Weiny 
---

RE: [RFC PATCH 5/5] cxl/core: add poison injection event handler

2024-02-09 Thread Dan Williams

Shiyang Ruan wrote:
> Currently driver only trace cxl events, poison injection on cxl memdev
> is silent.  OS needs to be notified then it could handle poison range
> in time.  Per CXL spec, the device error event could be signaled through
> FW-First and OS-First methods.
> 
> So, add poison event handler in OS-First method:
>   - qemu:
> - CXL device report POISON event to OS by MSI by sending GMER after
>   injecting a poison record

QEMU details do not belong in a kernel changelog. It is ok for an RFC,
but my hope is that this can be tested on hardware after being proven on
QEMU.

>   - CXL driver
> a. read the POISON event through GMER;   <-- this patch
> b. get POISON list;
> c. translate DPA to HPA;
> d. construct a mce instance, then call mce_log() to queue this mce
>instance;

It is not clear to me why the kernel should proactively fire machine
check notifications on injection? The changelog needs to make clear why
the kernel should do this, and the consequences of not going it.

For CPU consumed poison the machine check event will already fire. For
background discovery of poison, that should translate to a
memory_failure() notification with teh MF_ACTION_REQUIRED flag cleared.
Userspace, like rasdaemon, can then make a page offline decision.

RE: [RFC PATCH 4/5] cxl/core: add report option for cxl_mem_get_poison()

2024-02-09 Thread Dan Williams

Shiyang Ruan wrote:
> When a poison event is received, driver uses GET_POISON_LIST command
> to get the poison list.  Now driver has cxl_mem_get_poison(), so
> reuse it and add a parameter 'bool report', report poison record to MCE
> if set true.

If the memory error record has the poison event, why does the poison
list need to be retrieved by the kernel? I would expect it is sufficient
to just report the single poison event and leave it to userspace to
react to that event and retrieve more data if it wants.

RE: [RFC PATCH 3/5] cxl/core: introduce cxl_mem_report_poison()

2024-02-09 Thread Dan Williams

Shiyang Ruan wrote:
> If poison is detected(reported from cxl memdev), OS should be notified to
> handle it.  Introduce this function:
>   1. translate DPA to HPA;
>   2. construct a MCE instance; (TODO: more details need to be filled)
>   3. log it into MCE event queue;
> 
> After that, MCE mechanism can walk over its notifier chain to execute
> specific handlers.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  arch/x86/kernel/cpu/mce/core.c |  1 +
>  drivers/cxl/core/mbox.c| 33 +
>  2 files changed, 34 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index bc39252bc54f..a64c0aceb7e0 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -131,6 +131,7 @@ void mce_setup(struct mce *m)
>   m->ppin = cpu_data(m->extcpu).ppin;
>   m->microcode = boot_cpu_data.microcode;
>  }
> +EXPORT_SYMBOL_GPL(mce_setup);

No, mce_setup() is x86 specific and the CXL subsystem is CPU
architecture independent. My expectation is that CXL should translate
errors for edac similar to how the ACPI GHES code does it. See usage of
edac_raw_mc_handle_error() and memory_failure_queue().

Otherwise an MCE is a CPU consumption of poison event, and CXL is
reporting device-side discovery of poison.

RE: [RFC PATCH 2/5] cxl/core: introduce cxl_memdev_dpa_to_hpa()

2024-02-09 Thread Dan Williams

Shiyang Ruan wrote:
> When a memdev is assigned to a region, its Device Physical Address will be
> mapped to Host Physical Address.  Introduce this helper function to
> translate HPA from a given memdev and its DPA.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  drivers/cxl/core/memdev.c | 12 
>  drivers/cxl/cxlmem.h  |  1 +
>  2 files changed, 13 insertions(+)
> 
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index dae8802ecdb0..c304e709ef0e 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -319,6 +319,18 @@ static int cxl_validate_poison_dpa(struct cxl_memdev 
> *cxlmd, u64 dpa)
>   return 0;
>  }
>  
> +phys_addr_t cxl_memdev_dpa_to_hpa(struct cxl_memdev *cxlmd, u64 dpa)
> +{
> + struct cxl_region *cxlr = cxl_dpa_to_region(cxlmd, dpa);
> +
> + if (cxlr)
> + return cxlr->params.res->start + dpa;
> + else {
> + dev_dbg(&cxlmd->dev, "device belongs to no region.\n");
> + return 0;
> + }
> +}

Hmm no, I would not say memdevs are assigned to regions, *endpoint
decoders* are assigned to regions. cxl_dpa_to_region() is only an
internal helper when the endpoint decoder is unknown. Otherwise endpoint
decoders have a direct-link to their region, if mapped. See usage of
"cxled->cxld.region".

RE: [RFC PATCH 1/5] cxl/core: correct length of DPA field masks

2024-02-09 Thread Dan Williams

Shiyang Ruan wrote:
> The length of Physical Address in General Media Event Record/DRAM Event
> Record is 64-bit, so the field mask should be defined as such length.

Can you include this user visible side-effect of this change. Looks like
this could cause usages of CXL_DPA_FLAGS_MASK to return an incorrect
result?

> 
> Signed-off-by: Shiyang Ruan 
> ---
>  drivers/cxl/core/trace.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index 89445435303a..388a87d972c2 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -253,11 +253,11 @@ TRACE_EVENT(cxl_generic_event,
>   * DRAM Event Record
>   * CXL rev 3.0 section 8.2.9.2.1.2; Table 8-44
>   */
> -#define CXL_DPA_FLAGS_MASK   0x3F
> +#define CXL_DPA_FLAGS_MASK   0x3FULL
>  #define CXL_DPA_MASK (~CXL_DPA_FLAGS_MASK)
>  
> -#define CXL_DPA_VOLATILE BIT(0)
> -#define CXL_DPA_NOT_REPAIRABLE   BIT(1)
> +#define CXL_DPA_VOLATILE BIT_ULL(0)
> +#define CXL_DPA_NOT_REPAIRABLE   BIT_ULL(1)
>  #define show_dpa_flags(flags)__print_flags(flags, "|",   
>\
>   { CXL_DPA_VOLATILE, "VOLATILE"  }, \
>   { CXL_DPA_NOT_REPAIRABLE,   "NOT_REPAIRABLE"}  \
> -- 
> 2.34.1
>

Re: [PATCH v6 1/2] qom: new object to associate device to numa node

2024-01-09 Thread Dan Williams

Jason Gunthorpe wrote:
> On Tue, Jan 09, 2024 at 06:02:03PM +0100, David Hildenbrand wrote:
> > > Given that, an alternative proposal that I think would work
> > > for you would be to add a 'placeholder' memory node definition
> > > in SRAT (so allow 0 size explicitly - might need a new SRAT
> > > entry to avoid backwards compat issues).
> > 
> > Putting all the PCI/GI/... complexity aside, I'll just raise again that for
> > virtio-mem something simple like that might be helpful as well, IIUC.
> > 
> > -numa node,nodeid=2 \
> > ...
> > -device virtio-mem-pci,node=2,... \
> > 
> > All we need is the OS to prepare for an empty node that will get populated
> > with memory later.
> 
> That is all this is doing too, the NUMA relationship of the actual
> memory is desribed already by the PCI device since it is a BAR on the
> device.
> 
> The only purpose is to get the empty nodes into Linux :(
> 
> > So if that's what a "placeholder" node definition in srat could achieve as
> > well, even without all of the other acpi-generic-initiator stuff, that would
> > be great.
> 
> Seems like there are two use quite similar cases.. virtio-mem is going
> to be calling the same family of kernel API I suspect :)

It seems sad that we, as an industry, went through all of this trouble
to define a dynamically enumerable CXL device model only to turn around
and require static ACPI tables to tell us how to enumerate it.

A similar problem exists on the memory target side and the approach
taken there was to have Linux statically reserve at least enough numa
node numbers for all the platform CXL memory ranges (defined in the
ACPI.CEDT.CFMWS), but with the promise to come back and broach the
dynamic node creation problem "if the need arises".

This initiator-node enumeration case seems like that occasion where the
need has arisen to get Linux out of the mode of needing to declare all
possible numa nodes early in boot. Allow for nodes to be discoverable
post NUMA-init.

One strawman scheme that comes to mind is instead of "add nodes early" in
boot, "delete unused nodes late" in boot after the device topology has
been enumerated. Otherwise, requiring static ACPI tables to further
enumerate an industry-standard dynamically enumerated bus seems to be
going in the wrong direction.

Re: [PATCH v3] hw/cxl: Add QTG _DSM support for ACPI0017 device

2023-10-06 Thread Dan Williams

Jonathan Cameron wrote:
[..]
> > > > 
> > > > what does "a WORD" mean is unclear - do you match what hardware does
> > > > when you use aml_buffer? pls mention this in commit log, and
> > > > show actual hardware dump for comparison.  
> > > The CXL spec says WORD without much qualification. It's a 16bit value 
> > > AFAICT. I'll add additional comment. Currently I do not have access to 
> > > actual hardware unfortunately. I'm constructing this purely based on spec 
> > > description.  
> > 
> 
> WORD does seem to be clearly defined in the ACPI spec as uint16
> and as this is describing a DSDT blob I think we can safe that
> it means that.  Also lines up with the fixed sizes in CEDT.

I am not sure it means that, the ACPI specification indicates that packages
can hold "integers" and integers can be any size up to 64-bits.

> > It's not clear buffer is actually word though.
> > 
> > Jonathan do you have hardware access?
> 
> No.  +CC linux-cxl to see if anyone else has hardware + BIOS with
> QTG implemented...  There will be lots of implementations soon so I'd make
> not guarantee they will all interpret this the same.
> 
> Aim here is Linux kernel enablement support, and unfortunately that almost
> always means we are ahead of easy availability of hardware. If it exists
> its probably prototypes in a lab, in which case no guarantees on the
> BIOS tables presented...

>From a pre-production system the ASL is putting the result of SizeOf
directly into the first element in the return package:

Local1 = SizeOf (CXQI)
Local0 [0x00] = Local1

...where CXQI appears to be a fixed table of QTG ids for the platform, and
SizeOf() returns an integer type with no restriction that it be a 16-bit
value.

So I think the specification is misleading by specifying WORD when ACPI
"Package" objects convey "Integers" where the size of the integer can be a
u8 to a u64.

> > Also, possible to get clarification from the spec committee?
> 
> I'm unclear what we are clarifying.  As I read it current implementation
> is indeed wrong and I failed to notice this earlier :(
> 
> Ultimately data encoding (ACPI 6.5 section 20.2..3 Data Objects Encoding)
> should I think be
> 
> 0x0B 0x00 0x00
> WordPrefix then data : note if you try a 0x0001 and feed
> it to iasl it will squash it into a byte instead and indeed if you
> force the binary to the above it will decode it as 0x but recompile
> that and you will be back to just
> 0x00 (as bytes don't need a prefix..)
> 
> Currently it would be.
> 0x11 0x05 0x0a 0x02 0x00 0x01
> BufferOp 
> 
> Btw I built a minimal DSDT file to test this and iasl isn't happy with
> the fact the _DSM doesn't return anything at all if ARG2 isn't 1 or 2.
> Whilst that's imdef territory as not covered by the CXL spec, we should
> return 'something' ;)
> 
> Anyhow, to do this as per the CXL spec we need an aml_word()
> that just implements the word case from aml_int()

If I understand correctly, aml_int() is sufficient since this is not a
Field() where access size matters.

> Chance are that it never matters if we get an ecoding that is
> only single byte (because the value is small) but who knows what
> other AML parsers might do.

I expect the reason WORD is used in the specification is because of the
size of the QTG ID field in the CFMWS. ACPI could support returning more
than USHRT_MAX in an Integer field in a Package, but those IDs above
USHRT_MAX could not be represented in CFMWS.

[..]
> > but again it is not clear at all what does spec mean.
> > an integer up to 0xf? a buffer as you did? just two bytes?
> > 
> > could be any of these.
> 
> The best we have in the way of description is the multiple QTG example
> where it's
> Package() {2, 1} combined with it being made up of WORDs
> 
> whereas in general that will get squashed to a pair of bytes...
> So I'm thinking WORDs is max size rather than only size but
> given ambiguity we should encode them as words anyway.

My assertion is that for the Package format the size of the integer does
not matter because the Package.Integer type can convey up to 64-bit values.
For being compatible with the *usage* of that max id, values that do not
fit into 16-bits are out of spec, but nothing prevents the Package from
using any size integer, afaics.

Re: [PATCH] hw/acpi/cxl: Drop device-memory support from CFMWS entries

2023-03-22 Thread Dan Williams

Jonathan Cameron wrote:
> On Mon, 20 Mar 2023 23:08:31 -0700
> Dan Williams  wrote:
> 
> > While it was a reasonable idea to specify no window restricitions at the
> > outset of the CXL emulation support, it turns out that in practice a
> > platform will never follow the QEMU example of specifying simultaneous
> > support for HDM-H and HDM-D[B] in a single window.
> > 
> > HDM-D mandates extra bus cycles for host/device bias protocol, and HDM-DB
> > mandates extra bus cycles for back-invalidate protocol, so hardware must
> > be explicitly prepared for device-memory unlike host-only memory
> > (HDM-H).
> > 
> > In preparation for the kernel dropping support for windows that do not
> > select between device and host-only memory, move QEMU exclusively to
> > declaring host-only windows.
> > 
> > Signed-off-by: Dan Williams 
> Hi Dan,
> 
> Can we have some spec references? I think the Protocol tables in
> appendix C would work for that - but more specific examples called
> out from them would be good.

I presume our messages crossed in the ether:

https://lore.kernel.org/linux-cxl/641a018ed7fb8_269929...@dwillia2-xfh.jf.intel.com.notmuch/

...but yes, if I was still committed to this change, which I am not at
this point, some references from Appendix C and "3.3.11 Forward Progress
and Ordering Rules would be appropriate".

> I'm also not totally convinced it isn't a host implementation detail
> - key here is that host bridge's are still part of the host so can
> do impdef stuff as long as they look correct to CXL side and to
> host side.
> 
> Time for some wild implementation conjecturing.
> 
> Imagine a host that has host bridges of above average smartness.
> Those host bridges have HDM decoders (this doesn't work if not)
> 
> Host is using a single HPA window for HDM-D[B] and HDM-H.
> The host bridge knows the target is HDM-H - it can get that form
> the HDM decoder Target Type bit etc.  The HB can send (to the
> rest of the Host) whatever replies are necessary / fill in extra
> info to make it look like HDM-D[B] to the host interconnect protocol.
> 
> (after some fun with a white board we think you can make this efficient
>  by potentially making the Host bridge HDM decoder setup visible to
>  other parts of the host - relatively easy give lots of time allowed
>  for a decoder commit).
> 
> Why would you do this?  Limited HPA space availability on the host
> and wanting to be very flexible about use of the CXL windows.

Limited HPA space and fewer decode rules at the root level indeed sounds
compelling.

> Obviously this is all moot if there is a constraint we can point to
> in a specification.

At this time I don't have such a reference.

> BTW. I'm carrying a patch (it's on the gitlab tree) that I haven't
> posted yet that lets you configure this restriction at runtime as
> a similar potential host implementation restriction occurs for
> PMEM vs Volatile.  That is also needed to exercise the fun corners of
> QTG etc.

Sounds good.

RE: [PATCH] hw/acpi/cxl: Drop device-memory support from CFMWS entries

2023-03-21 Thread Dan Williams

Dan Williams wrote:
> While it was a reasonable idea to specify no window restricitions at the
> outset of the CXL emulation support, it turns out that in practice a
> platform will never follow the QEMU example of specifying simultaneous
> support for HDM-H and HDM-D[B] in a single window.
> 
> HDM-D mandates extra bus cycles for host/device bias protocol, and HDM-DB
> mandates extra bus cycles for back-invalidate protocol, so hardware must
> be explicitly prepared for device-memory unlike host-only memory
> (HDM-H).
> 
> In preparation for the kernel dropping support for windows that do not
> select between device and host-only memory, move QEMU exclusively to
> declaring host-only windows.

After an offline discussion determined that a sufficiently sophisticated
platform might be able to support mixed HDM-H and HDM-D[B] in the same
window, so the kernel is not going to drop this support.

[PATCH] hw/acpi/cxl: Drop device-memory support from CFMWS entries

2023-03-20 Thread Dan Williams

While it was a reasonable idea to specify no window restricitions at the
outset of the CXL emulation support, it turns out that in practice a
platform will never follow the QEMU example of specifying simultaneous
support for HDM-H and HDM-D[B] in a single window.

HDM-D mandates extra bus cycles for host/device bias protocol, and HDM-DB
mandates extra bus cycles for back-invalidate protocol, so hardware must
be explicitly prepared for device-memory unlike host-only memory
(HDM-H).

In preparation for the kernel dropping support for windows that do not
select between device and host-only memory, move QEMU exclusively to
declaring host-only windows.

Signed-off-by: Dan Williams 
---
 hw/acpi/cxl.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/acpi/cxl.c b/hw/acpi/cxl.c
index 2bf8c0799359..defb289e2fef 100644
--- a/hw/acpi/cxl.c
+++ b/hw/acpi/cxl.c
@@ -103,8 +103,8 @@ static void cedt_build_cfmws(GArray *table_data, CXLState 
*cxls)
 /* Host Bridge Interleave Granularity */
 build_append_int_noprefix(table_data, fw->enc_int_gran, 4);
 
-/* Window Restrictions */
-build_append_int_noprefix(table_data, 0x0f, 2); /* No restrictions */
+/* Window Restrictions: Host-only ram and pmem */
+build_append_int_noprefix(table_data, 0x0e, 2);
 
 /* QTG ID */
 build_append_int_noprefix(table_data, 0, 2);

Re: cxl nvdimm Potential probe ordering issues.

2023-01-20 Thread Dan Williams

Gregory Price wrote:
> On Fri, Jan 20, 2023 at 09:38:13AM -0800, Dan Williams wrote:
> > As it stands currently that dax device and the cxl device are not
> > related since a default dax-device is loaded just based on the presence
> > of an EFI_MEMORY_SP address range in the address map. With the new ram
> > enabling that default device will be elided and CXL will register a
> > dax-device parented by a cxl region.
> > 
> > >- The memory *does not* auto-online, instead the dax device can be
> > >  onlined as system-ram *manually* via ndctl and friends
> > 
> > That *manually* part is the problem that needs distro help to solve. It
> > should be the case that by default all Linux distributions auto-online
> > all dax-devices. If that happens to online memory that is too slow for
> > general use, or too high-performance / precious for general purpose use
> > then the administrator can set policy after the fact. Unfortunately user
> > policy can not be applied if these memory ranges were onlined by the
> > kernel at boot , so that's why the kernel policy defaults to not-online.
> > 
> > In other words, there is no guarantee that memory that was assigned to
> > the general purpose pool at boot can be removed. The only guaranteed
> > behavior is to never give the memory to the core kernel in the first
> > instance and always let user policy route the memory.
> > 
> > > 3) The code creates an nvdimm_bridge IFF a CFMW is defined - regardless
> > >of the type-3 device configuration (pmem-only or vmem-only)
> > 
> > Correct, the top-level bus code (cxl_acpi) and the endpoint code
> > (cxl_mem, cxl_port) need to handshake before establishing regions. For
> > pmem regions the platform needs to claim the availability of a pmem
> > capable CXL window.
> > 
> > > 4) As you can see above, multiple decoders are registered.  I'm not sure
> > >if that's correct or not, but it does seem odd given there's only one
> > >cxl type-3 device.  Odd that decoder0.0 shows up when CFMW is there,
> > >but not when it isn't.
> > 
> > CXL windows are modeled as decoders hanging off the the CXL root device
> > (ACPI0017 on ACPI based platforms). An endpoint decoder can then map a
> > selection of that window.
> > 
> > > Don't know why I haven't thought of this until now, but is the CFMW code
> > > reporting something odd about what's behind it?  Is it assuming the
> > > devices are pmem?
> > 
> > No, the cxl_acpi code is just advertising platform decode possibilities
> > independent of what devices show up. Think of this like the PCI MMIO
> > space that gets allocated to a root bridge at the beginning of time.
> > That space may or may not get consumed based on what devices show up
> > downstream.
> 
> Thank you for the explanation Dan, and thank you for you patience
> @JCameron.  I'm fairly sure I grok it now.
> 
> Summarizing to make sure: the cxl driver is providing what would be the
> CXL.io (control) path, and the CXL.mem path is basically being simulated
> by what otherwise would be a traditional PCI memory region. This explains
> why turning off Legacy mode drops the dax devices, and why the topology
> looks strange - the devices are basically attached in 2 different ways.
> 
> Might there be interest from the QEMU community to implement this
> legacy-style setup in the short term, in an effort to test the the
> control path of type-3 devices while we wait for the kernel to catch up?
> 
> Or should we forget this mode ever existed and just barrel forward
> with HDM decoders and writing the kernel code to hook up the underlying
> devices in drivers/cxl?

Which mode are you referring?

The next steps for the kernel enabling relevant to this thread are:

* ram region discovery (platform firmware or kexec established)
* ram region creation
* pmem region discovery (from labels)

Re: cxl nvdimm Potential probe ordering issues.

2023-01-20 Thread Dan Williams

Gregory Price wrote:
> On Thu, Jan 19, 2023 at 03:04:49PM +, Jonathan Cameron wrote:
> > Gregory, would you mind checking if
> > cxl_nvb is NULL here...
> > https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/pmem.c#L67
> > (printk before it is used should work).
> > 
> > Might also be worth checking cxl_nvd and cxl_ds
> > but my guess is cxl_nvb is our problem (it is when I deliberate change
> > the load order).
> > 
> > Jonathan
> > 
> 
> This is exactly the issue.  cxl_nvb is null, the rest appear fine.
> 
> Also, note, that weirdly the non-volatile bridge shows up when launching
> this in volatile mode, but no stack trace appears.
> 
> ¯\_(ツ)_/¯
> 
> After spending way too much time tracing through the current cxl driver
> code, i have only really determined that
> 
> 1) The code is very pmem oriented, and it's unclear to me how the driver
>as-is differentiates a persistent device from a volatile device. That
>code path still completely escapes me.  The only differentiating code
>i see is in the memdev probe path that creates mem#/pmem and mem#/ram

Yes, pmem was the initial focus because it had the most dependency on
the OS to setup vs BIOS, but the ram enabling is at the top of the queue
now.

> 
> 2) The code successfully manages probe, enable, and mount a REAL device
>- cxl memdev appears (/sys/bus/cxl/devices/mem0)
>- a dax device appears (/sys/bus/dax/devices/)
>  This happens at boot, which I assume must be bios related

As it stands currently that dax device and the cxl device are not
related since a default dax-device is loaded just based on the presence
of an EFI_MEMORY_SP address range in the address map. With the new ram
enabling that default device will be elided and CXL will register a
dax-device parented by a cxl region.

>- The memory *does not* auto-online, instead the dax device can be
>  onlined as system-ram *manually* via ndctl and friends

That *manually* part is the problem that needs distro help to solve. It
should be the case that by default all Linux distributions auto-online
all dax-devices. If that happens to online memory that is too slow for
general use, or too high-performance / precious for general purpose use
then the administrator can set policy after the fact. Unfortunately user
policy can not be applied if these memory ranges were onlined by the
kernel at boot , so that's why the kernel policy defaults to not-online.

In other words, there is no guarantee that memory that was assigned to
the general purpose pool at boot can be removed. The only guaranteed
behavior is to never give the memory to the core kernel in the first
instance and always let user policy route the memory.

> 3) The code creates an nvdimm_bridge IFF a CFMW is defined - regardless
>of the type-3 device configuration (pmem-only or vmem-only)

Correct, the top-level bus code (cxl_acpi) and the endpoint code
(cxl_mem, cxl_port) need to handshake before establishing regions. For
pmem regions the platform needs to claim the availability of a pmem
capable CXL window.

> 
># CFMW defined
>[root@fedora ~]# ls /sys/bus/cxl/devices/
>decoder0.0  decoder2.0  mem0port1
>decoder1.0  endpoint2   nvdimm-bridge0  root0
> 
># CFMW not defined
>[root@fedora ~]# ls /sys/bus/cxl/devices/
>decoder1.0  decoder2.0  endpoint2  mem0  port1  root0
> 
> 4) As you can see above, multiple decoders are registered.  I'm not sure
>if that's correct or not, but it does seem odd given there's only one
>cxl type-3 device.  Odd that decoder0.0 shows up when CFMW is there,
>but not when it isn't.

CXL windows are modeled as decoders hanging off the the CXL root device
(ACPI0017 on ACPI based platforms). An endpoint decoder can then map a
selection of that window.

> 
>Note: All these tests have two root ports:
>-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
>-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
>-device cxl-rp,id=rp1,bus=cxl.0,chassis=0,port=1,slot=1 \
> 
> 
> Don't know why I haven't thought of this until now, but is the CFMW code
> reporting something odd about what's behind it?  Is it assuming the
> devices are pmem?

No, the cxl_acpi code is just advertising platform decode possibilities
independent of what devices show up. Think of this like the PCI MMIO
space that gets allocated to a root bridge at the beginning of time.
That space may or may not get consumed based on what devices show up
downstream.

Re: cxl nvdimm Potential probe ordering issues.

2023-01-20 Thread Dan Williams

Jonathan Cameron wrote:
> On Thu, 19 Jan 2023 15:04:49 +
> Jonathan Cameron  wrote:
> 
> > On Thu, 19 Jan 2023 12:42:44 +
> > Jonathan Cameron via  wrote:
> > 
> > > On Wed, 18 Jan 2023 14:31:53 -0500
> > > Gregory Price  wrote:
> > >   
> > > > I apparently forgot an intro lol
> > > > 
> > > > I tested the DOE linux branch with the 2023-1-11 QEMU branch with both
> > > > volatile, non-volatile, and "legacy" (pre-my-patch) non-volatile mode.
> > > > 
> > > > 1) *In volatile mode, there are no stack traces present (during boot*)
> > > > 
> > > > On Wed, Jan 18, 2023 at 02:22:08PM -0500, Gregory Price wrote:
> > > > > 
> > > > > 1) No stack traces present
> > > > > 2) Device usage appears to work, but cxl-cli fails to create a 
> > > > > region, i
> > > > > haven't checked why yet (also tried ndctl-75, same results)
> > > > > 3) There seems to be some other regression with the cxl_pmem_init
> > > > > routine, because I get a stack trace in this setup regardless of 
> > > > > whether
> > > > > I apply the type-3 device commit.
> > > > > 
> > > > > 
> > > > > All tests below with the previously posted DOE linux branch.
> > > > > Base QEMU branch was Jonathan's 2023-1-11
> > > > > 
> > > > > 
> > > > > DOE Branch - 2023-1-11 (HEAD) (all commits)
> > > > > 
> > > > > QEMU Config:
> > > > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> > > > > -drive 
> > > > > file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd
> > > > >  \
> > > > > -m 3G,slots=4,maxmem=8G \
> > > > > -smp 4 \
> > > > > -machine type=q35,accel=kvm,cxl=on \
> > > > > -enable-kvm \
> > > > > -nographic \
> > > > > -object memory-backend-ram,id=mem0,size=1G,share=on \
> > > > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> > > > > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> > > > > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
> > > > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> > > > > 
> > > > > Result:  This worked, but cxl-cli could not create a region (will look
> > > > > into this further later).
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > When running with a persistent memory configuration, I'm seeing a
> > > > > kernel stack trace on cxl_pmem_init
> > > > > 
> > > > > Config:
> > > > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> > > > > -drive 
> > > > > file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd
> > > > >  \
> > > > > -m 3G,slots=4,maxmem=4G \
> > > > > -smp 4 \
> > > > > -machine type=q35,accel=kvm,cxl=on \
> > > > > -enable-kvm \
> > > > > -nographic \
> > > > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> > > > > -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> > > > > -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \
> > > > > -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \
> > > > > -device 
> > > > > cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0
> > > > >  \
> > > > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> > > > > 
> > > > > 
> > > > > [   62.167518] BUG: kernel NULL pointer dereference, address: 
> > > > > 04c0
> > > > > [   62.185069] #PF: supervisor read access in kernel mode
> > > > > [   62.198502] #PF: error_code(0x) - not-present page
> > > > > [   62.211019] PGD 0 P4D 0
> > > > > [   62.220521] Oops:  [#1] PREEMPT SMP PTI
> > > > > [   62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 
> > > > > 6.2.0-rc1+ #1
> > > > > [   62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), 
> > > > > BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
> > > > > [   62.258432] Adding 2939900k swap on /dev/zram0.  Priority:100 
> > > > > extents:1 across:2939900k SSDscFS
> > > > > [   62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> > > > > [   62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 
> > > > > 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 
> > > > > 4c 89 0
> > > > > [   62.285531] RSP: 0018:acff0141fc38 EFLAGS: 00010202
> > > > > [   62.285534] RAX: 97a8a37b84b8 RBX: 97a8a37b8000 RCX: 
> > > > > 
> > > > > [   62.285536] RDX: 0001 RSI: 97a8a37b8000 RDI: 
> > > > > 
> > > > > [   62.285537] RBP: 97a8a37b8000 R08: 0001 R09: 
> > > > > 0001
> > > > > [   62.285538] R10: 0001 R11:  R12: 
> > > > > 97a8a37b8000
> > > > > [   62.285539] R13: 97a982c3dc28 R14:  R15: 
> > > > > 
> > > > > [   62.285541] FS:  7f2619829580() GS:97a9bca0() 
> > > > > knlGS:
> > > > > [   62.285542] CS:  0010 DS:  ES:  CR0: 80050033
> > > > > [   62.285544] CR2: 04c0 CR3: 0001056a8000 CR4: 
> > > > > 06e0
> > > > > [   62.285653] Call Trace:
> > > > > [   62.285656]  
> > > > > [   62.285660]  cxl_bus_probe+0x17/0x50
> > > >

Re: [BUG] cxl can not create region

2022-08-15 Thread Dan Williams

Jonathan Cameron wrote:
> On Fri, 12 Aug 2022 16:44:03 +0100
> Jonathan Cameron  wrote:
> 
> > On Thu, 11 Aug 2022 18:08:57 +0100
> > Jonathan Cameron via  wrote:
> > 
> > > On Tue, 9 Aug 2022 17:08:25 +0100
> > > Jonathan Cameron  wrote:
> > >   
> > > > On Tue, 9 Aug 2022 21:07:06 +0800
> > > > Bobo WL  wrote:
> > > > 
> > > > > Hi Jonathan
> > > > > 
> > > > > Thanks for your reply!
> > > > > 
> > > > > On Mon, Aug 8, 2022 at 8:37 PM Jonathan Cameron
> > > > >  wrote:  
> > > > > >
> > > > > > Probably not related to your problem, but there is a disconnect in 
> > > > > > QEMU /
> > > > > > kernel assumptionsaround the presence of an HDM decoder when a HB 
> > > > > > only
> > > > > > has a single root port. Spec allows it to be provided or not as an 
> > > > > > implementation choice.
> > > > > > Kernel assumes it isn't provide. Qemu assumes it is.
> > > > > >
> > > > > > The temporary solution is to throw in a second root port on the HB 
> > > > > > and not
> > > > > > connect anything to it.  Longer term I may special case this so 
> > > > > > that the particular
> > > > > > decoder defaults to pass through settings in QEMU if there is only 
> > > > > > one root port.
> > > > > >
> > > > > 
> > > > > You are right! After adding an extra HB in qemu, I can create a x1
> > > > > region successfully.
> > > > > But have some errors in Nvdimm:
> > > > > 
> > > > > [   74.925838] Unknown online node for memory at 0x100, 
> > > > > assuming node 0
> > > > > [   74.925846] Unknown target node for memory at 0x100, 
> > > > > assuming node 0
> > > > > [   74.927470] nd_region region0: nmem0: is disabled, failing probe   
> > > > >
> > > > 
> > > > Ah. I've seen this one, but not chased it down yet.  Was on my todo 
> > > > list to chase
> > > > down. Once I reach this state I can verify the HDM Decode is correct 
> > > > which is what
> > > > I've been using to test (Which wasn't true until earlier this week). 
> > > > I'm currently testing via devmem, more for historical reasons than 
> > > > because it makes
> > > > that much sense anymore.  
> > > 
> > > *embarassed cough*.  We haven't fully hooked the LSA up in qemu yet.
> > > I'd forgotten that was still on the todo list. I don't think it will
> > > be particularly hard to do and will take a look in next few days.
> > > 
> > > Very very indirectly this error is causing a driver probe fail that means 
> > > that
> > > we hit a code path that has a rather odd looking check on NDD_LABELING.
> > > Should not have gotten near that path though - hence the problem is 
> > > actually
> > > when we call cxl_pmem_get_config_data() and it returns an error because
> > > we haven't fully connected up the command in QEMU.  
> > 
> > So a least one bug in QEMU. We were not supporting variable length payloads 
> > on mailbox
> > inputs (but were on outputs).  That hasn't mattered until we get to LSA 
> > writes.
> > We just need to relax condition on the supplied length.
> > 
> > diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c
> > index c352a935c4..fdda9529fe 100644
> > --- a/hw/cxl/cxl-mailbox-utils.c
> > +++ b/hw/cxl/cxl-mailbox-utils.c
> > @@ -510,7 +510,7 @@ void cxl_process_mailbox(CXLDeviceState *cxl_dstate)
> >  cxl_cmd = &cxl_cmd_set[set][cmd];
> >  h = cxl_cmd->handler;
> >  if (h) {
> > -if (len == cxl_cmd->in) {
> > +if (len == cxl_cmd->in || !cxl_cmd->in) {
> Fix is wrong as we use ~0 as the placeholder for variable payload, not 0.
> 
> With that fixed we hit new fun paths - after some errors we get the
> worrying - not totally sure but looks like a failure on an error cleanup.
> I'll chase down the error source, but even then this is probably triggerable 
> by
> hardware problem or similar.  Some bonus prints in here from me chasing
> error paths, but it's otherwise just cxl/next + the fix I posted earlier 
> today.

One of the scenarios that I cannot rule out is nvdimm_probe() racing
nd_region_probe(), but given all the work it takes to create a region I
suspect all the nvdimm_probe() work to have completed...

It is at least one potentially wrong hypothesis that needs to be chased
down.

> 
> [   69.919877] nd_bus ndbus0: START: nd_region.probe(region0)
> [   69.920108] nd_region_probe
> [   69.920623] [ cut here ]
> [   69.920675] refcount_t: addition on 0; use-after-free.
> [   69.921314] WARNING: CPU: 3 PID: 710 at lib/refcount.c:25 
> refcount_warn_saturate+0xa0/0x144
> [   69.926949] Modules linked in: cxl_pmem cxl_mem cxl_pci cxl_port cxl_acpi 
> cxl_core
> [   69.928830] CPU: 3 PID: 710 Comm: kworker/u8:9 Not tainted 5.19.0-rc3+ #399
> [   69.930596] Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> [   69.931482] Workqueue: events_unbound async_run_entry_fn
> [   69.932403] pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [   69.934023] pc : refcount_warn_saturate+0xa0/0x144
> [   69.935161] lr : refcount_

Re: [BUG] cxl can not create region

2022-08-12 Thread Dan Williams

Jonathan Cameron wrote:
> On Thu, 11 Aug 2022 18:08:57 +0100
> Jonathan Cameron via  wrote:
> 
> > On Tue, 9 Aug 2022 17:08:25 +0100
> > Jonathan Cameron  wrote:
> > 
> > > On Tue, 9 Aug 2022 21:07:06 +0800
> > > Bobo WL  wrote:
> > >   
> > > > Hi Jonathan
> > > > 
> > > > Thanks for your reply!
> > > > 
> > > > On Mon, Aug 8, 2022 at 8:37 PM Jonathan Cameron
> > > >  wrote:
> > > > >
> > > > > Probably not related to your problem, but there is a disconnect in 
> > > > > QEMU /
> > > > > kernel assumptionsaround the presence of an HDM decoder when a HB only
> > > > > has a single root port. Spec allows it to be provided or not as an 
> > > > > implementation choice.
> > > > > Kernel assumes it isn't provide. Qemu assumes it is.
> > > > >
> > > > > The temporary solution is to throw in a second root port on the HB 
> > > > > and not
> > > > > connect anything to it.  Longer term I may special case this so that 
> > > > > the particular
> > > > > decoder defaults to pass through settings in QEMU if there is only 
> > > > > one root port.
> > > > >  
> > > > 
> > > > You are right! After adding an extra HB in qemu, I can create a x1
> > > > region successfully.
> > > > But have some errors in Nvdimm:
> > > > 
> > > > [   74.925838] Unknown online node for memory at 0x100, 
> > > > assuming node 0
> > > > [   74.925846] Unknown target node for memory at 0x100, 
> > > > assuming node 0
> > > > [   74.927470] nd_region region0: nmem0: is disabled, failing probe
> > > 
> > > Ah. I've seen this one, but not chased it down yet.  Was on my todo list 
> > > to chase
> > > down. Once I reach this state I can verify the HDM Decode is correct 
> > > which is what
> > > I've been using to test (Which wasn't true until earlier this week). 
> > > I'm currently testing via devmem, more for historical reasons than 
> > > because it makes
> > > that much sense anymore.
> > 
> > *embarassed cough*.  We haven't fully hooked the LSA up in qemu yet.
> > I'd forgotten that was still on the todo list. I don't think it will
> > be particularly hard to do and will take a look in next few days.
> > 
> > Very very indirectly this error is causing a driver probe fail that means 
> > that
> > we hit a code path that has a rather odd looking check on NDD_LABELING.
> > Should not have gotten near that path though - hence the problem is actually
> > when we call cxl_pmem_get_config_data() and it returns an error because
> > we haven't fully connected up the command in QEMU.
> 
> So a least one bug in QEMU. We were not supporting variable length payloads 
> on mailbox
> inputs (but were on outputs).  That hasn't mattered until we get to LSA 
> writes.
> We just need to relax condition on the supplied length.
> 
> diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c
> index c352a935c4..fdda9529fe 100644
> --- a/hw/cxl/cxl-mailbox-utils.c
> +++ b/hw/cxl/cxl-mailbox-utils.c
> @@ -510,7 +510,7 @@ void cxl_process_mailbox(CXLDeviceState *cxl_dstate)
>  cxl_cmd = &cxl_cmd_set[set][cmd];
>  h = cxl_cmd->handler;
>  if (h) {
> -if (len == cxl_cmd->in) {
> +if (len == cxl_cmd->in || !cxl_cmd->in) {
>  cxl_cmd->payload = cxl_dstate->mbox_reg_state +
>  A_CXL_DEV_CMD_PAYLOAD;
>  ret = (*h)(cxl_cmd, cxl_dstate, &len);
> 
> 
> This lets the nvdimm/region probe fine, but I'm getting some issues with
> namespace capacity so I'll look at what is causing that next.
> Unfortunately I'm not that familiar with the driver/nvdimm side of things
> so it's take a while to figure out what kicks off what!

The whirlwind tour is that 'struct nd_region' instances that represent a
persitent memory address range are composed of one more mappings of
'struct nvdimm' objects. The nvdimm object is driven by the dimm driver
in drivers/nvdimm/dimm.c. That driver is mainly charged with unlocking
the dimm (if locked) and interrogating the label area to look for
namespace labels.

The label command calls are routed to the '->ndctl()' callback that was
registered when the CXL nvdimm_bus_descriptor was created. That callback
handles both 'bus' scope calls, currently none for CXL, and per nvdimm
calls. cxl_pmem_nvdimm_ctl() translates those generic LIBNVDIMM commands
to CXL commands.

The 'struct nvdimm' objects that the CXL side registers have the
NDD_LABELING flag set which means that namespaces need to be explicitly
created / provisioned from region capacity. Otherwise, if
drivers/nvdimm/dimm.c does not find a namespace-label-index block then
the region reverts to label-less mode and a default namespace equal to
the size of the region is instantiated.

If you are seeing small mismatches in namespace capacity then it may
just be the fact that by default 'ndctl create-namespace' results in an
'fsdax' mode namespace which just means that it is a block device where
1.5% of the capacity is reserved for 'struct page' metadata. You should
be able to see namespace capacity

Re: [BUG] cxl can not create region

2022-08-11 Thread Dan Williams

Dan Williams wrote:
> Bobo WL wrote:
> > Hi Dan,
> > 
> > Thanks for your reply!
> > 
> > On Mon, Aug 8, 2022 at 11:58 PM Dan Williams  
> > wrote:
> > >
> > > What is the output of:
> > >
> > > cxl list -MDTu -d decoder0.0
> > >
> > > ...? It might be the case that mem1 cannot be mapped by decoder0.0, or
> > > at least not in the specified order, or that validation check is broken.
> > 
> > Command "cxl list -MDTu -d decoder0.0" output:
> 
> Thanks for this, I think I know the problem, but will try some
> experiments with cxl_test first.

Hmm, so my cxl_test experiment unfortunately passed so I'm not
reproducing the failure mode. This is the result of creating x4 region
with devices directly attached to a single host-bridge:

# cxl create-region -d decoder3.5 -w 4 -m -g 256 mem{12,10,9,11} -s $((1<<30))
{
  "region":"region8",
  "resource":"0xf1f000",
  "size":"1024.00 MiB (1073.74 MB)",
  "interleave_ways":4,
  "interleave_granularity":256,
  "decode_state":"commit",
  "mappings":[
{
  "position":3,
  "memdev":"mem11",
  "decoder":"decoder21.0"
},
{
  "position":2,
  "memdev":"mem9",
  "decoder":"decoder19.0"
},
{
  "position":1,
  "memdev":"mem10",
  "decoder":"decoder20.0"
},
{
  "position":0,
  "memdev":"mem12",
  "decoder":"decoder22.0"
}
  ]
}
cxl region: cmd_create_region: created 1 region

> Did the commit_store() crash stop reproducing with latest cxl/preview
> branch?

I missed the answer to this question.

All of these changes are now in Linus' tree perhaps give that a try and
post the debug log again?

Re: [BUG] cxl can not create region

2022-08-09 Thread Dan Williams

Bobo WL wrote:
> Hi Dan,
> 
> Thanks for your reply!
> 
> On Mon, Aug 8, 2022 at 11:58 PM Dan Williams  wrote:
> >
> > What is the output of:
> >
> > cxl list -MDTu -d decoder0.0
> >
> > ...? It might be the case that mem1 cannot be mapped by decoder0.0, or
> > at least not in the specified order, or that validation check is broken.
> 
> Command "cxl list -MDTu -d decoder0.0" output:

Thanks for this, I think I know the problem, but will try some
experiments with cxl_test first.

Did the commit_store() crash stop reproducing with latest cxl/preview
branch?

RE: [BUG] cxl can not create region

2022-08-08 Thread Dan Williams

Bobo WL wrote:
> Hi list
> 
> I want to test cxl functions in arm64, and found some problems I can't
> figure out.
> 
> My test environment:
> 
> 1. build latest bios from https://github.com/tianocore/edk2.git master
> branch(cc2db6ebfb6d9d85ba4c7b35fba1fa37fffc0bc2)
> 2. build latest qemu-system-aarch64 from git://git.qemu.org/qemu.git
> master branch(846dcf0ba4eff824c295f06550b8673ff3f31314). With cxl arm
> support patch: 
> https://patchwork.kernel.org/project/cxl/cover/20220616141950.23374-1-jonathan.came...@huawei.com/
> 3. build Linux kernel from
> https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git preview
> branch(65fc1c3d26b96002a5aa1f4012fae4dc98fd5683)
> 4. build latest ndctl tools from https://github.com/pmem/ndctl
> create_region branch(8558b394e449779e3a4f3ae90fae77ede0bca159)
> 
> And my qemu test commands:
> sudo $QEMU_BIN -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 \
> -cpu max -smp 8 -nographic -no-reboot \
> -kernel $KERNEL -bios $BIOS_BIN \
> -drive if=none,file=$ROOTFS,format=qcow2,id=hd \
> -device virtio-blk-pci,drive=hd -append 'root=/dev/vda1
> nokaslr dyndbg="module cxl* +p"' \
> -object memory-backend-ram,size=4G,id=mem0 \
> -numa node,nodeid=0,cpus=0-7,memdev=mem0 \
> -net nic -net user,hostfwd=tcp::-:22 -enable-kvm \
> -object
> memory-backend-file,id=cxl-mem0,share=on,mem-path=/tmp/cxltest.raw,size=256M
> \
> -object
> memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/cxltest1.raw,size=256M
> \
> -object
> memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/cxltest2.raw,size=256M
> \
> -object
> memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/cxltest3.raw,size=256M
> \
> -object
> memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa0.raw,size=256M
> \
> -object
> memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa1.raw,size=256M
> \
> -object
> memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/lsa2.raw,size=256M
> \
> -object
> memory-backend-file,id=cxl-lsa3,share=on,mem-path=/tmp/lsa3.raw,size=256M
> \
> -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
> -device cxl-upstream,bus=root_port0,id=us0 \
> -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
> -device
> cxl-type3,bus=swport0,memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \
> -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=5 \
> -device
> cxl-type3,bus=swport1,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem1 \
> -device cxl-downstream,port=2,bus=us0,id=swport2,chassis=0,slot=6 \
> -device
> cxl-type3,bus=swport2,memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem2 \
> -device cxl-downstream,port=3,bus=us0,id=swport3,chassis=0,slot=7 \
> -device
> cxl-type3,bus=swport3,memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem3 \
> -M 
> cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k
> 
> And I have got two problems.
> 1. When I want to create x1 region with command: "cxl create-region -d
> decoder0.0 -w 1 -g 4096 mem0", kernel crashed with null pointer
> reference. Crash log:
> 
> [  534.697324] cxl_region region0: config state: 0
> [  534.697346] cxl_region region0: probe: -6
> [  534.697368] cxl_acpi ACPI0017:00: decoder0.0: created region0
> [  534.699115] cxl region0: mem0:endpoint3 decoder3.0 add:
> mem0:decoder3.0 @ 0 next: none nr_eps: 1 nr_targets: 1
> [  534.699149] cxl region0: :0d:00.0:port2 decoder2.0 add:
> mem0:decoder3.0 @ 0 next: mem0 nr_eps: 1 nr_targets: 1
> [  534.699167] cxl region0: ACPI0016:00:port1 decoder1.0 add:
> mem0:decoder3.0 @ 0 next: :0d:00.0 nr_eps: 1 nr_targets: 1
> [  534.699176] cxl region0: ACPI0016:00:port1 iw: 1 ig: 256
> [  534.699182] cxl region0: ACPI0016:00:port1 target[0] = :0c:00.0
> for mem0:decoder3.0 @ 0
> [  534.699189] cxl region0: :0d:00.0:port2 iw: 1 ig: 256
> [  534.699193] cxl region0: :0d:00.0:port2 target[0] =
> :0e:00.0 for mem0:decoder3.0 @ 0
> [  534.699405] Unable to handle kernel NULL pointer dereference at
> virtual address 
> [  534.701474] Mem abort info:
> [  534.701994]   ESR = 0x8604
> [  534.702653]   EC = 0x21: IABT (current EL), IL = 32 bits
> [  534.703616]   SET = 0, FnV = 0
> [  534.704174]   EA = 0, S1PTW = 0
> [  534.704803]   FSC = 0x04: level 0 translation fault
> [  534.705694] user pgtable: 4k pages, 48-bit VAs, pgdp=00010144a000
> [  534.706875] [] pgd=, p4d=
> [  534.709855] Internal error: Oops: 8604 [#1] PREEMPT SMP
> [  534.710301] Modules linked in:
> [  534.710546] CPU: 7 PID: 331 Comm: cxl Not tainted
> 5.19.0-rc3-00064-g65fc1c3d26b9-dirty #11
> [  534.715393] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
> [  534.717179] pstate: 6045 (nZCv daif +PAN -

Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm

2021-05-04 Thread Dan Williams

On Tue, May 4, 2021 at 2:03 AM Aneesh Kumar K.V
 wrote:
>
> On 5/4/21 11:13 AM, Pankaj Gupta wrote:
> 
>
> >>
> >> What this patch series did was to express that property via a device
> >> tree node and guest driver enables a hypercall based flush mechanism to
> >> ensure persistence.
> >
> > Would VIRTIO (entirely asynchronous, no trap at host side) based
> > mechanism is better
> > than hyper-call based? Registering memory can be done any way. We
> > implemented virtio-pmem
> > flush mechanisms with below considerations:
> >
> > - Proper semantic for guest flush requests.
> > - Efficient mechanism for performance pov.
> >
>
> sure, virio-pmem can be used as an alternative.
>
> > I am just asking myself if we have platform agnostic mechanism already
> > there, maybe
> > we can extend it to suit our needs? Maybe I am missing some points here.
> >
>
> What is being attempted in this series is to indicate to the guest OS
> that the backing device/file used for emulated nvdimm device cannot
> guarantee the persistence via cpu cache flush instructions.

Right, the detail I am arguing is that it should be a device
description, not a backend file property. In other words this proposal
is advocating this:

-object memory-backend-file,id=mem1,share,sync-dax=$sync-dax,mem-path=$path
-device nvdimm,memdev=mem1

...and I am advocating for reusing or duplicating the virtio-pmem
model like this:

-object memory-backend-file,id=mem1,share,mem-path=$path
-device spapr-hcall,memdev=mem1

...because the interface to the guest is the device, not the
memory-backend-file.

Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm

2021-05-03 Thread Dan Williams

On Mon, May 3, 2021 at 7:06 AM Shivaprasad G Bhat  wrote:
>
>
> On 5/1/21 12:44 AM, Dan Williams wrote:
> > Some corrections to terminology confusion below...
> >
> >
> > On Wed, Apr 28, 2021 at 8:49 PM Shivaprasad G Bhat  
> > wrote:
> >> The nvdimm devices are expected to ensure write persistence during power
> >> failure kind of scenarios.
> > No, QEMU is not expected to make that guarantee. QEMU is free to lie
> > to the guest about the persistence guarantees of the guest PMEM
> > ranges. It's more accurate to say that QEMU nvdimm devices can emulate
> > persistent memory and optionally pass through host power-fail
> > persistence guarantees to the guest. The power-fail persistence domain
> > can be one of "cpu_cache", or "memory_controller" if the persistent
> > memory region is "synchronous". If the persistent range is not
> > synchronous, it really isn't "persistent memory"; it's memory mapped
> > storage that needs I/O commands to flush.
>
> Since this is virtual nvdimm(v-nvdimm) backed by a file, and the data is
> completely
>
> in the host pagecache, and we need a way to ensure that host pagecaches
>
> are flushed to the backend. This analogous to the WPQ flush being offloaded
>
> to the hypervisor.

No, it isn't analogous. WPQ flush is an optional mechanism to force
data to a higher durability domain. The flush in this interface is
mandatory. It's a different class of device.

The proposal that "sync-dax=unsafe" for non-PPC architectures, is a
fundamental misrepresentation of how this is supposed to work. Rather
than make "sync-dax" a first class citizen of the device-description
interface I'm proposing that you make this a separate device-type.
This also solves the problem that "sync-dax" with an implicit
architecture backend assumption is not precise, but a new "non-nvdimm"
device type would make it explicit what the host is advertising to the
guest.

>
>
> Ref: https://github.com/dgibson/qemu/blob/main/docs/nvdimm.txt
>
>
>
> >
> >> The libpmem has architecture specific instructions like dcbf on POWER
> > Which "libpmem" is this? PMDK is a reference library not a PMEM
> > interface... maybe I'm missing what libpmem has to do with QEMU?
>
>
> I was referrering to semantics of flushing pmem cache lines as in
>
> PMDK/libpmem.
>
>
> >
> >> to flush the cache data to backend nvdimm device during normal writes
> >> followed by explicit flushes if the backend devices are not synchronous
> >> DAX capable.
> >>
> >> Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
> >> and the subsequent flush doesn't traslate to actual flush to the backend
> > s/traslate/translate/
> >
> >> file on the host in case of file backed v-nvdimms. This is addressed by
> >> virtio-pmem in case of x86_64 by making explicit flushes translating to
> >> fsync at qemu.
> > Note that virtio-pmem was a proposal for a specific optimization of
> > allowing guests to share page cache. The virtio-pmem approach is not
> > to be confused with actual persistent memory.
> >
> >> On SPAPR, the issue is addressed by adding a new hcall to
> >> request for an explicit flush from the guest ndctl driver when the backend
> > What is an "ndctl" driver? ndctl is userspace tooling, do you mean the
> > guest pmem driver?
>
>
> oops, wrong terminologies. I was referring to guest libnvdimm and
>
> papr_scm kernel modules.
>
>
> >
> >> nvdimm cannot ensure write persistence with dcbf alone. So, the approach
> >> here is to convey when the hcall flush is required in a device tree
> >> property. The guest makes the hcall when the property is found, instead
> >> of relying on dcbf.
> >>
> >> A new device property sync-dax is added to the nvdimm device. When the
> >> sync-dax is 'writeback'(default for PPC), device property
> >> "hcall-flush-required" is set, and the guest makes hcall H_SCM_FLUSH
> >> requesting for an explicit flush.
> > I'm not sure "sync-dax" is a suitable name for the property of the
> > guest persistent memory.
>
>
> sync-dax property translates ND_REGION_ASYNC flag being set/unset

Yes, I am aware, but that property alone is not sufficient to identify
the flush mechanism.

>
> for the pmem region also if the nvdimm_flush callback is provided in the
>
> papr_scm or not. As everything boils down to synchronous nature
>
> of the device, I chose

Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm

2021-04-30 Thread Dan Williams

Some corrections to terminology confusion below...

On Wed, Apr 28, 2021 at 8:49 PM Shivaprasad G Bhat  wrote:
>
> The nvdimm devices are expected to ensure write persistence during power
> failure kind of scenarios.

No, QEMU is not expected to make that guarantee. QEMU is free to lie
to the guest about the persistence guarantees of the guest PMEM
ranges. It's more accurate to say that QEMU nvdimm devices can emulate
persistent memory and optionally pass through host power-fail
persistence guarantees to the guest. The power-fail persistence domain
can be one of "cpu_cache", or "memory_controller" if the persistent
memory region is "synchronous". If the persistent range is not
synchronous, it really isn't "persistent memory"; it's memory mapped
storage that needs I/O commands to flush.

> The libpmem has architecture specific instructions like dcbf on POWER

Which "libpmem" is this? PMDK is a reference library not a PMEM
interface... maybe I'm missing what libpmem has to do with QEMU?

> to flush the cache data to backend nvdimm device during normal writes
> followed by explicit flushes if the backend devices are not synchronous
> DAX capable.
>
> Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
> and the subsequent flush doesn't traslate to actual flush to the backend

s/traslate/translate/

> file on the host in case of file backed v-nvdimms. This is addressed by
> virtio-pmem in case of x86_64 by making explicit flushes translating to
> fsync at qemu.

Note that virtio-pmem was a proposal for a specific optimization of
allowing guests to share page cache. The virtio-pmem approach is not
to be confused with actual persistent memory.

> On SPAPR, the issue is addressed by adding a new hcall to
> request for an explicit flush from the guest ndctl driver when the backend

What is an "ndctl" driver? ndctl is userspace tooling, do you mean the
guest pmem driver?

> nvdimm cannot ensure write persistence with dcbf alone. So, the approach
> here is to convey when the hcall flush is required in a device tree
> property. The guest makes the hcall when the property is found, instead
> of relying on dcbf.
>
> A new device property sync-dax is added to the nvdimm device. When the
> sync-dax is 'writeback'(default for PPC), device property
> "hcall-flush-required" is set, and the guest makes hcall H_SCM_FLUSH
> requesting for an explicit flush.

I'm not sure "sync-dax" is a suitable name for the property of the
guest persistent memory. There is no requirement that the
memory-backend file for a guest be a dax-capable file. It's also
implementation specific what hypercall needs to be invoked for a given
occurrence of "sync-dax". What does that map to on non-PPC platforms
for example? It seems to me that an "nvdimm" device presents the
synchronous usage model and a whole other device type implements an
async-hypercall setup that the guest happens to service with its
nvdimm stack, but it's not an "nvdimm" anymore at that point.

> sync-dax is "unsafe" on all other platforms(x86, ARM) and old pseries machines
> prior to 5.2 on PPC. sync-dax="writeback" on ARM and x86_64 is prevented
> now as the flush semantics are unimplemented.

"sync-dax" has no meaning on its own, I think this needs an explicit
mechanism to convey both the "not-sync" property *and* the callback
method, it shouldn't be inferred by arch type.

> When the backend file is actually synchronous DAX capable and no explicit
> flushes are required, the sync-dax mode 'direct' is to be used.
>
> The below demonstration shows the map_sync behavior with sync-dax writeback &
> direct.
> (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)
>
> The pmem0 is from nvdimm with With sync-dax=direct, and pmem1 is from
> nvdimm with syn-dax=writeback, mounted as
> /dev/pmem0 on /mnt1 type xfs 
> (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/pmem1 on /mnt2 type xfs 
> (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
>
> [root@atest-guest ~]# ./mapsync /mnt1/newfile > When 
> sync-dax=unsafe/direct
> [root@atest-guest ~]# ./mapsync /mnt2/newfile > when sync-dax=writeback
> Failed to mmap  with Operation not supported
>
> The first patch does the header file cleanup necessary for the
> subsequent ones. Second patch implements the hcall, adds the necessary
> vmstate properties to spapr machine structure for carrying the hcall
> status during save-restore. The nature of the hcall being asynchronus,
> the patch uses aio utilities to offload the flush. The third patch adds
> the 'sync-dax' device property and enables the device tree property
> for the guest to utilise the hcall.
>
> The kernel changes to exploit this hcall is at
> https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch
>
> ---
> v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html
> Changes from v3:
>   - Fixed the forward declaration coding guidel

Re: Microsoft and Intel NVDIMM ACPI _DSM interfaces status?

2021-03-17 Thread Dan Williams

On Wed, Mar 17, 2021 at 4:49 AM Stefan Hajnoczi  wrote:
>
> Hi,
> Microsoft and Intel developed two different ACPI NVDIMM _DSM interfaces.
>
> The specs for the Intel interface are available here:
> https://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
>
> This is the interface that QEMU emulates. It has been reported that
> Windows 2016 Server and 2019 Server guests do not recognize QEMU's
> emulated NVDIMM devices using the Microsoft driver.
>
> I'd like to understand the path forward that will allow both Linux and
> Windows guests to successfully use QEMU's emulated NVDIMM device
> (https://gitlab.com/qemu-project/qemu/-/blob/master/hw/acpi/nvdimm.c).
>
> Are/have these two interfaces being/been unified?

No, they service 2 different classes of NVDIMMs. The Microsoft
specification was defined for NVDIMM-N devices that are the
traditional battery/super-capacitor backed DDR with sometimes an equal
amount of flash to squirrel away data to non-volatile media on power
loss. The Intel one is for a persistent media class of device where
there is no need to communicate health attributes like "energy source
died" or "restore from flash" failed.

> Should QEMU emulate both of them to make running Windows guests easy?

Depends on how tolerant Windows is to different format-interface-code
implementations and what QEMU should return on each of the functions
it decides to implement.

Note that QEMU only implements a small subset of the Intel commands,
i.e. just enough to provision namespaces in the NVDIMM label area.
"NVDIMM label area" is a concept that is newer than the NVDIMM-N
definition which is why you don't see labels mentioned in the
Microsoft specification. Since then ACPI has developed common label
area access methods, see "6.5.10 NVDIMM Label Methods" in the ACPI 6.4
specification.

Note that you'll also see "9.20.8 NVDIMM Device Methods" in that spec
for some health management commands that overlap what the Microsoft
and Intel specs communicate. Linux does not support them since any
platform that might support them will also support the Intel
specification so there's no driving need for Linux to migrate. I do
not know the status of that command support in Windows, or the HPE
command support in Windows for that matter.

If you need to support guests that expect the Hyper-V command
definition, and will fail to attach NVDIMM support in the absence of
that definition then QEMU needs UUID_NFIT_DIMM_N_HYPERV support, note
that is a different command set than even UUID_NFIT_DIMM_N_MSFT
(include/acpi/acuuid.h). That would also require changes to virtual
ACPI NFIT to advertise the correlated format interface code. There may
be more... you would need someone from Hyper-V land to enumerate all
that is expected.

Re: [RFC] Set addresses for memory devices [CXL]

2021-01-27 Thread Dan Williams

On Wed, Jan 27, 2021 at 7:52 PM Ben Widawsky  wrote:
>
> Hi list, Igor.
>
> I wanted to get some ideas on how to better handle this. Per the recent
> discussion [1], it's become clear that there needs to be more thought put into
> how to manage the address space for CXL memory devices. If you see the
> discussion on interleave [2] there's a decent diagram for the problem 
> statement.
>
> A CXL topology looks just like a PCIe topology. A CXL memory device is a 
> memory
> expander. It's a byte addressable address range with a combination of 
> persistent
> and volatile memory. In a CXL capable system, you can effectively think of 
> these
> things as more configurable NVDIMMs. The memory devices have an interface that
> allows the OS to program the base physical address range it claims called an 
> HDM
> (Host Defined Memory) decoder. A larger address range is claimed by a host
> bridge (or a combination of host bridges in the interleaved case) which is
> platform specific.
>
> Originally, my plan was to create a single memory backend for a "window" and
> subregion the devices in there. So for example, if you had two devices under a
> hostbridge, each of 256M size, the window would be some fixed GPA of 512M+ 
> size
> memory backend, and those memory devices would be a subregion of the
> hostbridge's window. I thought this was working in my patch series, but as it
> turns out, this doesn't actually work as I intended. `info mtree` looks good,
> but `info memory-devices` doesn't.
>

A couple clarifying questions...

> So let me list the requirements and hopefully get some feedback on the best 
> way
> to handle it.
> 1. A PCIe like device has a persistent memory region (I don't care about
> volatile at the moment).

What do you mean by "PCIe" like? If it is PCI enumerable by the guest
it has no business being treated as proper memory because the OS
rightly assumes that PCIe address space is not I/O coherent to other
initiators.

> 2. The physical address base for the memory region is programmable.
> 3. Memory accesses will support interleaving across multiple host bridges.

So, per 1. it would look like a PCIe address space inside QEMU but
advertised as an I/O coherent platform resource in the guest?

Re: [RFC PATCH 00/25] Introduce CXL 2.0 Emulation

2020-12-03 Thread Dan Williams

[ add linux-cxl for the Linux driver question ]

On Thu, Dec 3, 2020 at 9:10 PM Chris Browy  wrote:
[..]

I'll let Ben address the other questions...

> 4. What are the userspace system APIs for targeting CXL HDM address domain?
>Usually you can mmap a SPA if you know how to look it up.

tl;dr: /proc/iomem and sysfs

There are a lot of moving pieces to this question. Once a CXL.mem
device is mapped the driver will arrange for it to appear in
/proc/iomem. If it is persistent memory it will be accessed through a
/dev/pmem or /dev/dax device, and if it is volatile it will show up as
just another "System RAM" range. The System RAM range may end up with
a unique Linux NUMA node id. Alternatively, if the BIOS maps the
device then the BIOS will set up the EFI memory map + ACPI
SRAT/SLIT/HMAT for Linux to interpret.

Once something shows up in /proc/iomem it was traditionally accessible
via /dev/mem. These days though /dev/mem is slowly being deprecated
due to security (see CONFIG_LOCK_DOWN_KERNEL*) and stability (see
CONFIG_IO_STRICT_DEVMEM)  concerns, so depending on your kernel
configuration tools may not be able to use /dev/mem to map the SPA
directly and will need to go through the driver provided interface.

The driver needs to be able coordinate coherent HDM settings across
multiple devices in an interleave, all switches in the path, and the
root bridge(s). So I am expecting a sysfs interface similar to the
nvdimm-subsystem and device-dax-subsystem where there will be
sysfs-files to enumerate the mappings and a provisioning mechanism to
reconfigure HDMs with various interleave settings. For a flavor of
what this might look like see the description of NVDIMM "Region"
devices:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/driver-api/nvdimm/nvdimm.rst#n469

The difference will be that unlike the nvdimm case where regions are
platform-firmware-defined and static, CXL regions are runtime dynamic.

Re: [PATCH] exec: fetch the alignment of Linux devdax pmem character device nodes

2020-04-09 Thread Dan Williams

On Thu, Apr 9, 2020 at 7:33 AM Liu, Jingqi  wrote:
>
> On 4/8/2020 5:42 PM, Joao Martins wrote:
> > On 4/8/20 3:25 AM, Liu, Jingqi wrote:
> >> On 4/8/2020 2:28 AM, Joao Martins wrote:
> >>> On 4/7/20 5:55 PM, Dan Williams wrote:
> >>>> On Tue, Apr 7, 2020 at 4:01 AM Joao Martins  
> >>>> wrote:
> >>>>> Perhaps, you meant instead:
> >>>>>
> >>>>>   /sys/dev/char/%d:%d/align
> >>>>>
> >>>> Hmm, are you sure that's working?
> >>> It is, except that I made the slight mistake of testing with a bunch of 
> >>> wip
> >>> patches on top which one of them actually adds the 'align' to child dax 
> >>> device.
> >>>
> >>> Argh, my apologies - and thanks for noticing.
> >>>
> >>>> I expect the alignment to be found
> >>>> in the region device:
> >>>>
> >>>> /sys/class/dax:
> >>>> /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus1/region1/dax1.1/dax1.0
> >>>> $(readlink -f /sys/dev/char/253\:263)/../align
> >>>> $(readlink -f /sys/dev/char/253\:263)/device/align
> >>>>
> >>>>
> >>>> /sys/bus/dax:
> >>>> /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus1/region1/dax1.0/dax1.0
> >>>> $(readlink -f /sys/dev/char/253\:265)/../align
> >>>> $(readlink -f /sys/dev/char/253\:265)/device/align <-- No such file
> >>>>
> >>>> The use of the /sys/dev/char/%d:%d/device is only supported by the
> >>>> deprecated /sys/class/dax.
> >> Hi Dan,
> >>
> >> Thanks for your comments.
> >>
> >> Seems it is a mistake.
> >>
> >> It should be: $(readlink -f /sys/dev/char/253\:263)/../../align
> >>
> > Hmm, perhaps you have an extra '../' in the path? This works for me:
> >
> > # ls $(readlink -f /sys/dev/char/252\:0/../align)
> > /sys/devices/platform/e820_pmem/ndbus0/region0/dax0.0/dax0.0/../align
> > # cat $(readlink -f /sys/dev/char/252\:0)/../align
> > 2097152
> > # cat /sys/dev/char/252\:0/../align
> > 2097152
>
> Hi Joao,
>
> Hmm, I need to have an extra '../' in the path. The details are as follows:
>
> # ll /dev/dax2.0
> crw--- 1 root root 251, 5 Mar 20 13:35 /dev/dax2.0
> # uname -r
> 5.6.0-rc1-00044-gb19e8c684703
> # readlink -f /sys/dev/char/251\:5/
> /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/dax2.1/dax/dax2.0
> # ls $(readlink -f /sys/dev/char/251\:5)/../align
> ls: cannot access
> '/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/dax2.1/dax/dax2.0/../align':
> No such file or directory
> # ls $(readlink -f /sys/dev/char/251\:5)/../dax_region/align
> ls: cannot access
> '/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/dax2.1/dax/dax2.0/../dax_region/align':
> No such file or directory
> # ls $(readlink -f /sys/dev/char/251\:5)/../../align
> /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/dax2.1/dax/dax2.0/../../align
> # ls $(readlink -f /sys/dev/char/251\:5)/../../dax_region/align
> /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/dax2.1/dax/dax2.0/../../dax_region/align
> # lsmod|grep pmem
> dax_pmem_compat16384  0
> device_dax 20480  1 dax_pmem_compat
> dax_pmem_core  16384  1 dax_pmem_compat
> # lsmod|grep dax
> dax_pmem_compat16384  0
> device_dax 20480  1 dax_pmem_compat
> dax_pmem_core  16384  1 dax_pmem_compat
>
> Seems some configurations are different ?
>
> Can you share your info as above ? Thanks.

Alternatively maybe you can use libdaxctl that automatically handles
the ABI differences between compat-dax-class and dax-bus layouts? I
didn't recommend it before because I was not sure about qemu's policy
about taking on new library dependencies, but with libdaxctl you could
do something like (untested):

path = g_strdup_printf("/sys/dev/char/%d:%d", major(st.st_rdev),
minor(st.st_rdev));
rpath = realpath(path, NULL);
daxctl_region_foreach(ctx, region)
if (strstr(daxctl_region_get_path(region), rpath)) {
align = daxctl_region_get_align(region);
break;
}

Re: [PATCH] exec: fetch the alignment of Linux devdax pmem character device nodes

2020-04-07 Thread Dan Williams

On Tue, Apr 7, 2020 at 4:01 AM Joao Martins  wrote:
>
>
>
> On 4/1/20 4:13 AM, Jingqi Liu wrote:
> > If the backend file is devdax pmem character device, the alignment
> > specified by the option 'align=NUM' in the '-object memory-backend-file'
> > needs to match the alignment requirement of the devdax pmem character 
> > device.
> >
> > This patch fetches the devdax pmem file 'align', so that we can compare
> > it with the NUM of 'align=NUM'.
> > The NUM needs to be larger than or equal to the devdax pmem file 'align'.
> >
> > It also fixes the problem that mmap() returns failure in qemu_ram_mmap()
> > when the NUM of 'align=NUM' is less than the devdax pmem file 'align'.
> >
> > Cc: Dan Williams 
> > Signed-off-by: Jingqi Liu 
> > ---
> >  exec.c | 46 +-
> >  1 file changed, 45 insertions(+), 1 deletion(-)
> >
> > diff --git a/exec.c b/exec.c
> > index de9d949902..8221abffec 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -1736,6 +1736,42 @@ static int64_t get_file_size(int fd)
> >  return size;
> >  }
> >
> > +static int64_t get_file_align(int fd)
> > +{
> > +int64_t align = -1;
> > +#if defined(__linux__)
> > +struct stat st;
> > +
> > +if (fstat(fd, &st) < 0) {
> > +return -errno;
> > +}
> > +
> > +/* Special handling for devdax character devices */
> > +if (S_ISCHR(st.st_mode)) {
> > +g_autofree char *subsystem_path = NULL;
> > +g_autofree char *subsystem = NULL;
> > +
> > +subsystem_path = g_strdup_printf("/sys/dev/char/%d:%d/subsystem",
> > + major(st.st_rdev), 
> > minor(st.st_rdev));
> > +subsystem = g_file_read_link(subsystem_path, NULL);
> > +
> > +if (subsystem && g_str_has_suffix(subsystem, "/dax")) {
> > +g_autofree char *align_path = NULL;
> > +g_autofree char *align_str = NULL;
> > +
> > +align_path = 
> > g_strdup_printf("/sys/dev/char/%d:%d/device/align",
> > +major(st.st_rdev), minor(st.st_rdev));
> > +
>
> Perhaps, you meant instead:
>
> /sys/dev/char/%d:%d/align
>

Hmm, are you sure that's working? I expect the alignment to be found
in the region device:

/sys/class/dax:
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus1/region1/dax1.1/dax1.0
$(readlink -f /sys/dev/char/253\:263)/../align
$(readlink -f /sys/dev/char/253\:263)/device/align


/sys/bus/dax:
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus1/region1/dax1.0/dax1.0
$(readlink -f /sys/dev/char/253\:265)/../align
$(readlink -f /sys/dev/char/253\:265)/device/align <-- No such file

The use of the /sys/dev/char/%d:%d/device is only supported by the
deprecated /sys/class/dax. The current /sys/bus/dax device-model can
be a drop in replacement as long as software is not written to the
/sys/class sysfs layout, i.e. it uses ../ instead of device/ to walk
to the region properties.

Re: [PATCH] exec: fetch the alignment of Linux devdax pmem character device nodes

2020-04-07 Thread Dan Williams

On Tue, Apr 7, 2020 at 1:08 AM Paolo Bonzini  wrote:
>
> On 07/04/20 09:29, Liu, Jingqi wrote:
> > Ping.
> >
> > Any comments are appreciated.
> >
> > Hi Paolo, Richard,
> >
> > Any comments about this ?
>
> I was hoping to get a review from someone else because I have no way to
> test it.  But I've now queued the patch, thanks.

Does qemu run tests in a nested VM? The difficult aspect of testing
devdax is that you need to boot your kernel with a special option or
have existing memory ranges assigned to the device. Although, Joao had
thoughts about allowing dynamic creation of device-dax instance by hot
unplugging memory.


>
> Paolo
>
> >
> > Thanks,
> >
> > Jingqi
> >
> > On 4/1/2020 11:13 AM, Liu, Jingqi wrote:
> >> If the backend file is devdax pmem character device, the alignment
> >> specified by the option 'align=NUM' in the '-object memory-backend-file'
> >> needs to match the alignment requirement of the devdax pmem character
> >> device.
> >>
> >> This patch fetches the devdax pmem file 'align', so that we can compare
> >> it with the NUM of 'align=NUM'.
> >> The NUM needs to be larger than or equal to the devdax pmem file 'align'.
> >>
> >> It also fixes the problem that mmap() returns failure in qemu_ram_mmap()
> >> when the NUM of 'align=NUM' is less than the devdax pmem file 'align'.
> >>
> >> Cc: Dan Williams 
> >> Signed-off-by: Jingqi Liu 
> >> ---
> >>   exec.c | 46 +-
> >>   1 file changed, 45 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/exec.c b/exec.c
> >> index de9d949902..8221abffec 100644
> >> --- a/exec.c
> >> +++ b/exec.c
> >> @@ -1736,6 +1736,42 @@ static int64_t get_file_size(int fd)
> >>   return size;
> >>   }
> >>   +static int64_t get_file_align(int fd)
> >> +{
> >> +int64_t align = -1;
> >> +#if defined(__linux__)
> >> +struct stat st;
> >> +
> >> +if (fstat(fd, &st) < 0) {
> >> +return -errno;
> >> +}
> >> +
> >> +/* Special handling for devdax character devices */
> >> +if (S_ISCHR(st.st_mode)) {
> >> +g_autofree char *subsystem_path = NULL;
> >> +g_autofree char *subsystem = NULL;
> >> +
> >> +subsystem_path =
> >> g_strdup_printf("/sys/dev/char/%d:%d/subsystem",
> >> + major(st.st_rdev),
> >> minor(st.st_rdev));
> >> +subsystem = g_file_read_link(subsystem_path, NULL);
> >> +
> >> +if (subsystem && g_str_has_suffix(subsystem, "/dax")) {
> >> +g_autofree char *align_path = NULL;
> >> +g_autofree char *align_str = NULL;
> >> +
> >> +align_path =
> >> g_strdup_printf("/sys/dev/char/%d:%d/device/align",
> >> +major(st.st_rdev),
> >> minor(st.st_rdev));
> >> +
> >> +if (g_file_get_contents(align_path, &align_str, NULL,
> >> NULL)) {
> >> +return g_ascii_strtoll(align_str, NULL, 0);
> >> +}
> >> +}
> >> +}
> >> +#endif /* defined(__linux__) */
> >> +
> >> +return align;
> >> +}
> >> +
> >>   static int file_ram_open(const char *path,
> >>const char *region_name,
> >>bool *created,
> >> @@ -2275,7 +2311,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t
> >> size, MemoryRegion *mr,
> >>   {
> >>   RAMBlock *new_block;
> >>   Error *local_err = NULL;
> >> -int64_t file_size;
> >> +int64_t file_size, file_align;
> >> /* Just support these ram flags by now. */
> >>   assert((ram_flags & ~(RAM_SHARED | RAM_PMEM)) == 0);
> >> @@ -2311,6 +2347,14 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t
> >> size, MemoryRegion *mr,
> >>   return NULL;
> >>   }
> >>   +file_align = get_file_align(fd);
> >> +if (file_align > 0 && mr && file_align > mr->align) {
> >> +error_setg(errp, "backing store align 0x%" PRIx64
> >> +   " is larger than 'align' option 0x" RAM_ADDR_FMT,
> >> +   file_align, mr->align);
> >> +return NULL;

Is there any downside to just making the alignment value be the max of
the device-dax instance align and the command line option? Why force
someone to debug the option unnecessarily?

Re: [Qemu-devel] [PATCH v9 05/11] numa: Extend CLI to provide initiator information for numa nodes

2019-08-14 Thread Dan Williams

On Wed, Aug 14, 2019 at 6:57 PM Tao Xu  wrote:
>
> On 8/15/2019 5:29 AM, Dan Williams wrote:
> > On Tue, Aug 13, 2019 at 10:14 PM Tao Xu  wrote:
> >>
> >> On 8/14/2019 10:39 AM, Dan Williams wrote:
> >>> On Tue, Aug 13, 2019 at 8:00 AM Igor Mammedov  wrote:
> >>>>
> >>>> On Fri,  9 Aug 2019 14:57:25 +0800
> >>>> Tao  wrote:
> >>>>
> >>>>> From: Tao Xu 
> >>>>>
> [...]
> >>>>> +for (i = 0; i < machine->numa_state->num_nodes; i++) {
> >>>>> +if (numa_info[i].initiator_valid &&
> >>>>> +!numa_info[numa_info[i].initiator].has_cpu) {
> >>>> ^^ possible out of 
> >>>> bounds read, see bellow
> >>>>
> >>>>> +error_report("The initiator-id %"PRIu16 " of NUMA node %d"
> >>>>> + " does not exist.", numa_info[i].initiator, 
> >>>>> i);
> >>>>> +error_printf("\n");
> >>>>> +
> >>>>> +exit(1);
> >>>>> +}
> >>>> it takes care only about nodes that have cpus or memory-only ones that 
> >>>> have
> >>>> initiator explicitly provided on CLI. And leaves possibility to have
> >>>> memory-only nodes without initiator mixed with nodes that have initiator.
> >>>> Is it valid to have mixed configuration?
> >>>> Should we forbid it?
> >>>
> >>> The spec talks about the "Proximity Domain for the Attached Initiator"
> >>> field only being valid if the memory controller for the memory can be
> >>> identified by an initiator id in the SRAT. So I expect the only way to
> >>> define a memory proximity domain without this local initiator is to
> >>> allow specifying a node-id that does not have an entry in the SRAT.
> >>>
> >> Hi Dan,
> >>
> >> So there may be a situation for the Attached Initiator field is not
> >> valid? If true, I would allow user to input Initiator invalid.
> >
> > Yes it's something the OS needs to consider because the platform may
> > not be able to meet the constraint that a single initiator is
> > associated with the memory controller for a given memory target. In
> > retrospect it would have been nice if the spec reserved 0x for
> > this purpose, but it seems "not in SRAT" is the only way to identify
> > memory that is not attached to any single initiator.
> >
> But As far as I konw, QEMU can't emulate a NUMA node "not in SRAT". I am
> wondering if it is effective only set Initiator invalid?

You don't need to emulate a NUMA node not in SRAT. Just put a number
in this HMAT entry larger than the largest proximity domain number
found in the SRAT.
>

Re: [Qemu-devel] [PATCH v9 05/11] numa: Extend CLI to provide initiator information for numa nodes

2019-08-14 Thread Dan Williams

On Tue, Aug 13, 2019 at 10:14 PM Tao Xu  wrote:
>
> On 8/14/2019 10:39 AM, Dan Williams wrote:
> > On Tue, Aug 13, 2019 at 8:00 AM Igor Mammedov  wrote:
> >>
> >> On Fri,  9 Aug 2019 14:57:25 +0800
> >> Tao  wrote:
> >>
> >>> From: Tao Xu 
> >>>
> >>> In ACPI 6.3 chapter 5.2.27 Heterogeneous Memory Attribute Table (HMAT),
> >>> The initiator represents processor which access to memory. And in 5.2.27.3
> >>> Memory Proximity Domain Attributes Structure, the attached initiator is
> >>> defined as where the memory controller responsible for a memory proximity
> >>> domain. With attached initiator information, the topology of heterogeneous
> >>> memory can be described.
> >>>
> >>> Extend CLI of "-numa node" option to indicate the initiator numa node-id.
> >>> In the linux kernel, the codes in drivers/acpi/hmat/hmat.c parse and 
> >>> report
> >>> the platform's HMAT tables.
> >>>
> >>> Reviewed-by: Jingqi Liu 
> >>> Suggested-by: Dan Williams 
> >>> Signed-off-by: Tao Xu 
> >>> ---
> >>>
> >>> No changes in v9
> >>> ---
> >>>   hw/core/machine.c | 24 
> >>>   hw/core/numa.c| 13 +
> >>>   include/sysemu/numa.h |  3 +++
> >>>   qapi/machine.json |  6 +-
> >>>   qemu-options.hx   | 27 +++
> >>>   5 files changed, 68 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/hw/core/machine.c b/hw/core/machine.c
> >>> index 3c55470103..113184a9df 100644
> >>> --- a/hw/core/machine.c
> >>> +++ b/hw/core/machine.c
> >>> @@ -640,6 +640,7 @@ void machine_set_cpu_numa_node(MachineState *machine,
> >>>  const CpuInstanceProperties *props, 
> >>> Error **errp)
> >>>   {
> >>>   MachineClass *mc = MACHINE_GET_CLASS(machine);
> >>> +NodeInfo *numa_info = machine->numa_state->nodes;
> >>>   bool match = false;
> >>>   int i;
> >>>
> >>> @@ -709,6 +710,16 @@ void machine_set_cpu_numa_node(MachineState *machine,
> >>>   match = true;
> >>>   slot->props.node_id = props->node_id;
> >>>   slot->props.has_node_id = props->has_node_id;
> >>> +
> >>> +if (numa_info[props->node_id].initiator_valid &&
> >>> +(props->node_id != numa_info[props->node_id].initiator)) {
> >>> +error_setg(errp, "The initiator of CPU NUMA node %" PRId64
> >>> +   " should be itself.", props->node_id);
> >>> +return;
> >>> +}
> >>> +numa_info[props->node_id].initiator_valid = true;
> >>> +numa_info[props->node_id].has_cpu = true;
> >>> +numa_info[props->node_id].initiator = props->node_id;
> >>>   }
> >>>
> >>>   if (!match) {
> >>> @@ -1050,6 +1061,7 @@ static void 
> >>> machine_numa_finish_cpu_init(MachineState *machine)
> >>>   GString *s = g_string_new(NULL);
> >>>   MachineClass *mc = MACHINE_GET_CLASS(machine);
> >>>   const CPUArchIdList *possible_cpus = 
> >>> mc->possible_cpu_arch_ids(machine);
> >>> +NodeInfo *numa_info = machine->numa_state->nodes;
> >>>
> >>>   assert(machine->numa_state->num_nodes);
> >>>   for (i = 0; i < possible_cpus->len; i++) {
> >>> @@ -1083,6 +1095,18 @@ static void 
> >>> machine_numa_finish_cpu_init(MachineState *machine)
> >>>   machine_set_cpu_numa_node(machine, &props, &error_fatal);
> >>>   }
> >>>   }
> >>> +
> >>> +for (i = 0; i < machine->numa_state->num_nodes; i++) {
> >>> +if (numa_info[i].initiator_valid &&
> >>> +!numa_info[numa_info[i].initiator].has_cpu) {
> >>^^ possible out of bounds 
> >> read, see bellow
> >>
> >>> +error_report("The initiator-id %"PRIu16 " of NUMA node %d"
> >>> + " does not exist.", numa_info[i].initiator, i);
>

Re: [Qemu-devel] [PATCH v9 05/11] numa: Extend CLI to provide initiator information for numa nodes

2019-08-13 Thread Dan Williams

On Tue, Aug 13, 2019 at 8:00 AM Igor Mammedov  wrote:
>
> On Fri,  9 Aug 2019 14:57:25 +0800
> Tao  wrote:
>
> > From: Tao Xu 
> >
> > In ACPI 6.3 chapter 5.2.27 Heterogeneous Memory Attribute Table (HMAT),
> > The initiator represents processor which access to memory. And in 5.2.27.3
> > Memory Proximity Domain Attributes Structure, the attached initiator is
> > defined as where the memory controller responsible for a memory proximity
> > domain. With attached initiator information, the topology of heterogeneous
> > memory can be described.
> >
> > Extend CLI of "-numa node" option to indicate the initiator numa node-id.
> > In the linux kernel, the codes in drivers/acpi/hmat/hmat.c parse and report
> > the platform's HMAT tables.
> >
> > Reviewed-by: Jingqi Liu 
> > Suggested-by: Dan Williams 
> > Signed-off-by: Tao Xu 
> > ---
> >
> > No changes in v9
> > ---
> >  hw/core/machine.c | 24 
> >  hw/core/numa.c| 13 +
> >  include/sysemu/numa.h |  3 +++
> >  qapi/machine.json |  6 +-
> >  qemu-options.hx   | 27 +++
> >  5 files changed, 68 insertions(+), 5 deletions(-)
> >
> > diff --git a/hw/core/machine.c b/hw/core/machine.c
> > index 3c55470103..113184a9df 100644
> > --- a/hw/core/machine.c
> > +++ b/hw/core/machine.c
> > @@ -640,6 +640,7 @@ void machine_set_cpu_numa_node(MachineState *machine,
> > const CpuInstanceProperties *props, Error 
> > **errp)
> >  {
> >  MachineClass *mc = MACHINE_GET_CLASS(machine);
> > +NodeInfo *numa_info = machine->numa_state->nodes;
> >  bool match = false;
> >  int i;
> >
> > @@ -709,6 +710,16 @@ void machine_set_cpu_numa_node(MachineState *machine,
> >  match = true;
> >  slot->props.node_id = props->node_id;
> >  slot->props.has_node_id = props->has_node_id;
> > +
> > +if (numa_info[props->node_id].initiator_valid &&
> > +(props->node_id != numa_info[props->node_id].initiator)) {
> > +error_setg(errp, "The initiator of CPU NUMA node %" PRId64
> > +   " should be itself.", props->node_id);
> > +return;
> > +}
> > +numa_info[props->node_id].initiator_valid = true;
> > +numa_info[props->node_id].has_cpu = true;
> > +numa_info[props->node_id].initiator = props->node_id;
> >  }
> >
> >  if (!match) {
> > @@ -1050,6 +1061,7 @@ static void machine_numa_finish_cpu_init(MachineState 
> > *machine)
> >  GString *s = g_string_new(NULL);
> >  MachineClass *mc = MACHINE_GET_CLASS(machine);
> >  const CPUArchIdList *possible_cpus = 
> > mc->possible_cpu_arch_ids(machine);
> > +NodeInfo *numa_info = machine->numa_state->nodes;
> >
> >  assert(machine->numa_state->num_nodes);
> >  for (i = 0; i < possible_cpus->len; i++) {
> > @@ -1083,6 +1095,18 @@ static void 
> > machine_numa_finish_cpu_init(MachineState *machine)
> >  machine_set_cpu_numa_node(machine, &props, &error_fatal);
> >  }
> >  }
> > +
> > +for (i = 0; i < machine->numa_state->num_nodes; i++) {
> > +if (numa_info[i].initiator_valid &&
> > +!numa_info[numa_info[i].initiator].has_cpu) {
>   ^^ possible out of bounds read, 
> see bellow
>
> > +error_report("The initiator-id %"PRIu16 " of NUMA node %d"
> > + " does not exist.", numa_info[i].initiator, i);
> > +error_printf("\n");
> > +
> > +exit(1);
> > +}
> it takes care only about nodes that have cpus or memory-only ones that have
> initiator explicitly provided on CLI. And leaves possibility to have
> memory-only nodes without initiator mixed with nodes that have initiator.
> Is it valid to have mixed configuration?
> Should we forbid it?

The spec talks about the "Proximity Domain for the Attached Initiator"
field only being valid if the memory controller for the memory can be
identified by an initiator id in the SRAT. So I expect the only way to
define a memory proximity domain without this local initiator is to
allow specifying a node-id that does not have an entry in the SRAT.

That would be a useful feature for testing OS HMAT p

Re: [Qemu-devel] [PATCH v10 4/7] dm: enable synchronous dax

2019-05-31 Thread Dan Williams

On Tue, May 21, 2019 at 6:43 AM Pankaj Gupta  wrote:
>
>  This patch sets dax device 'DAXDEV_SYNC' flag if all the target
>  devices of device mapper support synchrononous DAX. If device
>  mapper consists of both synchronous and asynchronous dax devices,
>  we don't set 'DAXDEV_SYNC' flag.
>
> Signed-off-by: Pankaj Gupta 
> ---
>  drivers/md/dm-table.c | 14 ++
>  1 file changed, 14 insertions(+)
>
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index cde3b49b2a91..1cce626ff576 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -886,10 +886,17 @@ static int device_supports_dax(struct dm_target *ti, 
> struct dm_dev *dev,
> return bdev_dax_supported(dev->bdev, PAGE_SIZE);
>  }
>
> +static int device_synchronous(struct dm_target *ti, struct dm_dev *dev,
> +  sector_t start, sector_t len, void *data)
> +{
> +   return dax_synchronous(dev->dax_dev);
> +}
> +
>  static bool dm_table_supports_dax(struct dm_table *t)
>  {
> struct dm_target *ti;
> unsigned i;
> +   bool dax_sync = true;
>
> /* Ensure that all targets support DAX. */
> for (i = 0; i < dm_table_get_num_targets(t); i++) {
> @@ -901,7 +908,14 @@ static bool dm_table_supports_dax(struct dm_table *t)
> if (!ti->type->iterate_devices ||
> !ti->type->iterate_devices(ti, device_supports_dax, NULL))
> return false;
> +
> +   /* Check devices support synchronous DAX */
> +   if (dax_sync &&
> +   !ti->type->iterate_devices(ti, device_synchronous, NULL))
> +   dax_sync = false;

Looks like this needs to be rebased on the current state of v5.2-rc,
and then we can nudge Mike for an ack.

Re: [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT)

2019-05-31 Thread Dan Williams

On Tue, May 7, 2019 at 11:32 PM Tao Xu  wrote:
>
> This series of patches will build Heterogeneous Memory Attribute Table (HMAT)
> according to the command line. The ACPI HMAT describes the memory attributes,
> such as memory side cache attributes and bandwidth and latency details,
> related to the System Physical Address (SPA) Memory Ranges.
> The software is expected to use this information as hint for optimization.
>
> OSPM evaluates HMAT only during system initialization. Any changes to the HMAT
> state at runtime or information regarding HMAT for hot plug are communicated
> using the _HMA method.
[..]

Hi,

I gave these patches a try while developing support for the new EFI
v2.8 Specific Purpose Memory attribute [1]. I have a gap / feature
request to note to make this implementation capable of emulating
current shipping platform BIOS implementations for persistent memory
platforms.

The NUMA configuration I tested was:

-numa node,mem=4G,cpus=0-19,nodeid=0
-numa node,mem=4G,cpus=20-39,nodeid=1
-numa node,mem=4G,nodeid=2
-numa node,mem=4G,nodeid=3

...and it produced an entry like the following for proximity domain 2.

[0C8h 0200   2]   Structure Type :  [Memory Proximity
Domain Attributes]
[0CAh 0202   2] Reserved : 
[0CCh 0204   4]   Length : 0028
[0D0h 0208   2]Flags (decoded below) : 0002
Processor Proximity Domain Valid : 0
[0D2h 0210   2]Reserved1 : 
[0D4h 0212   4]   Processor Proximity Domain : 0002
[0D8h 0216   4]  Memory Proximity Domain : 0002
[0DCh 0220   4]Reserved2 : 
[0E0h 0224   8]Reserved3 : 00024000
[0E8h 0232   8]Reserved4 : 0001

Notice that the Processor "Proximity Domain Valid" bit is clear. I
understand that the implementation is keying off of whether cpus are
defined for that same node or not, but that's not how current
persistent memory platforms implement "Processor Proximity Domain". On
these platforms persistent memory indeed has its own proximity domain,
but the Processor Proximity Domain is expected to be assigned to the
domain that houses the memory controller for that persistent memory.
So to emulate that configuration it would be useful to have a way to
specify "Processor Proximity Domain" without needing to define CPUs in
that domain.

Something like:

-numa node,mem=4G,cpus=0-19,nodeid=0
-numa node,mem=4G,cpus=20-39,nodeid=1
-numa node,mem=4G,nodeid=2,localnodeid=0
-numa node,mem=4G,nodeid=3,localnodeid=1

...to specify that node2 memory is connected / local to node0 and
node3 memory is connected / local to node1. In general HMAT specifies
that all performance differentiated memory ranges have their own
proximity domain, but those are expected to still be associated with a
local/host/home-socket memory controller.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2019-May/021668.html

Re: [Qemu-devel] [PATCH v9 1/7] libnvdimm: nd_region flush callback support

2019-05-15 Thread Dan Williams

On Tue, May 14, 2019 at 7:55 AM Pankaj Gupta  wrote:
>
> This patch adds functionality to perform flush from guest
> to host over VIRTIO. We are registering a callback based
> on 'nd_region' type. virtio_pmem driver requires this special
> flush function. For rest of the region types we are registering
> existing flush function. Report error returned by host fsync
> failure to userspace.
>
> Signed-off-by: Pankaj Gupta 
> ---
>  drivers/acpi/nfit/core.c |  4 ++--
>  drivers/nvdimm/claim.c   |  6 --
>  drivers/nvdimm/nd.h  |  1 +
>  drivers/nvdimm/pmem.c| 13 -
>  drivers/nvdimm/region_devs.c | 26 --
>  include/linux/libnvdimm.h|  8 +++-
>  6 files changed, 46 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index 5a389a4f4f65..08dde76cf459 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -2434,7 +2434,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, 
> unsigned int bw,
> offset = to_interleave_offset(offset, mmio);
>
> writeq(cmd, mmio->addr.base + offset);
> -   nvdimm_flush(nfit_blk->nd_region);
> +   nvdimm_flush(nfit_blk->nd_region, NULL);
>
> if (nfit_blk->dimm_flags & NFIT_BLK_DCR_LATCH)
> readq(mmio->addr.base + offset);
> @@ -2483,7 +2483,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk 
> *nfit_blk,
> }
>
> if (rw)
> -   nvdimm_flush(nfit_blk->nd_region);
> +   nvdimm_flush(nfit_blk->nd_region, NULL);
>
> rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0;
> return rc;
> diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
> index fb667bf469c7..13510bae1e6f 100644
> --- a/drivers/nvdimm/claim.c
> +++ b/drivers/nvdimm/claim.c
> @@ -263,7 +263,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
> struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
> unsigned int sz_align = ALIGN(size + (offset & (512 - 1)), 512);
> sector_t sector = offset >> 9;
> -   int rc = 0;
> +   int rc = 0, ret = 0;
>
> if (unlikely(!size))
> return 0;
> @@ -301,7 +301,9 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
> }
>
> memcpy_flushcache(nsio->addr + offset, buf, size);
> -   nvdimm_flush(to_nd_region(ndns->dev.parent));
> +   ret = nvdimm_flush(to_nd_region(ndns->dev.parent), NULL);
> +   if (ret)
> +   rc = ret;
>
> return rc;
>  }
> diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
> index a5ac3b240293..0c74d2428bd7 100644
> --- a/drivers/nvdimm/nd.h
> +++ b/drivers/nvdimm/nd.h
> @@ -159,6 +159,7 @@ struct nd_region {
> struct badblocks bb;
> struct nd_interleave_set *nd_set;
> struct nd_percpu_lane __percpu *lane;
> +   int (*flush)(struct nd_region *nd_region, struct bio *bio);

So this triggers:

In file included from drivers/nvdimm/e820.c:7:
./include/linux/libnvdimm.h:140:51: warning: ‘struct bio’ declared
inside parameter list will not be visible outside of this definition
or declaration
  int (*flush)(struct nd_region *nd_region, struct bio *bio);
   ^~~
I was already feeling uneasy about trying to squeeze this into v5.2,
but this warning and the continued drip of comments leads me to
conclude that this driver would do well to wait one more development
cycle. Lets close out the final fixups and let this driver soak in
-next. Then for the v5.3 cycle I'll redouble my efforts towards the
goal of closing patch acceptance at the -rc6 / -rc7 development
milestone.

Re: [Qemu-devel] [PATCH v9 4/7] dm: enable synchronous dax

2019-05-15 Thread Dan Williams

[ add Mike and dm-devel ]

Mike, any concerns with the below addition to the device-mapper-dax
implementation?

On Tue, May 14, 2019 at 7:58 AM Pankaj Gupta  wrote:
>
>  This patch sets dax device 'DAXDEV_SYNC' flag if all the target
>  devices of device mapper support synchrononous DAX. If device
>  mapper consists of both synchronous and asynchronous dax devices,
>  we don't set 'DAXDEV_SYNC' flag.
>
> Signed-off-by: Pankaj Gupta 
> ---
>  drivers/md/dm-table.c | 14 ++
>  1 file changed, 14 insertions(+)
>
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index cde3b49b2a91..1cce626ff576 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -886,10 +886,17 @@ static int device_supports_dax(struct dm_target *ti, 
> struct dm_dev *dev,
> return bdev_dax_supported(dev->bdev, PAGE_SIZE);
>  }
>
> +static int device_synchronous(struct dm_target *ti, struct dm_dev *dev,
> +  sector_t start, sector_t len, void *data)
> +{
> +   return dax_synchronous(dev->dax_dev);
> +}
> +
>  static bool dm_table_supports_dax(struct dm_table *t)
>  {
> struct dm_target *ti;
> unsigned i;
> +   bool dax_sync = true;
>
> /* Ensure that all targets support DAX. */
> for (i = 0; i < dm_table_get_num_targets(t); i++) {
> @@ -901,7 +908,14 @@ static bool dm_table_supports_dax(struct dm_table *t)
> if (!ti->type->iterate_devices ||
> !ti->type->iterate_devices(ti, device_supports_dax, NULL))
> return false;
> +
> +   /* Check devices support synchronous DAX */
> +   if (dax_sync &&
> +   !ti->type->iterate_devices(ti, device_synchronous, NULL))
> +   dax_sync = false;
> }
> +   if (dax_sync)
> +   set_dax_synchronous(t->md->dax_dev);
>
> return true;
>  }
> --
> 2.20.1
>

Re: [Qemu-devel] [PATCH v9 2/7] virtio-pmem: Add virtio pmem driver

2019-05-15 Thread Dan Williams

On Tue, May 14, 2019 at 8:25 AM Pankaj Gupta  wrote:
>
>
> > On 5/14/19 7:54 AM, Pankaj Gupta wrote:
> > > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > > index 35897649c24f..94bad084ebab 100644
> > > --- a/drivers/virtio/Kconfig
> > > +++ b/drivers/virtio/Kconfig
> > > @@ -42,6 +42,17 @@ config VIRTIO_PCI_LEGACY
> > >
> > >   If unsure, say Y.
> > >
> > > +config VIRTIO_PMEM
> > > +   tristate "Support for virtio pmem driver"
> > > +   depends on VIRTIO
> > > +   depends on LIBNVDIMM
> > > +   help
> > > +   This driver provides access to virtio-pmem devices, storage devices
> > > +   that are mapped into the physical address space - similar to NVDIMMs
> > > +- with a virtio-based flushing interface.
> > > +
> > > +   If unsure, say M.
> >
> > 
> > from Documentation/process/coding-style.rst:
> > "Lines under a ``config`` definition
> > are indented with one tab, while help text is indented an additional two
> > spaces."
>
> ah... I changed help text and 'checkpatch' did not say anything :( .
>
> Will wait for Dan, If its possible to add two spaces to help text while 
> applying
> the series.

I'm inclined to handle this with a fixup appended to the end of the
series just so the patchwork-bot does not get confused by the content
changing from what was sent to the list.

Re: [Qemu-devel] [PATCH v8 3/6] libnvdimm: add dax_dev sync flag

2019-05-13 Thread Dan Williams

On Mon, May 13, 2019 at 10:32 AM Pankaj Gupta  wrote:
>
>
> Hi Dan,
>
> While testing device mapper with DAX, I faced a bug with the commit:
>
> commit ad428cdb525a97d15c0349fdc80f3d58befb50df
> Author: Dan Williams 
> Date:   Wed Feb 20 21:12:50 2019 -0800
>
> When I reverted the condition to old code[1] it worked for me. I
> am thinking when we map two different devices (e.g with device mapper), will
> start & end pfn still point to same pgmap? Or there is something else which
> I am missing here.
>
> Note: I tested only EXT4.
>
> [1]
>
> -   if (pgmap && pgmap->type == MEMORY_DEVICE_FS_DAX)
> +   end_pgmap = get_dev_pagemap(pfn_t_to_pfn(end_pfn), NULL);
> +   if (pgmap && pgmap == end_pgmap && pgmap->type == 
> MEMORY_DEVICE_FS_DAX
> +   && pfn_t_to_page(pfn)->pgmap == pgmap
> +   && pfn_t_to_page(end_pfn)->pgmap == pgmap
> +   && pfn_t_to_pfn(pfn) == PHYS_PFN(__pa(kaddr))
> +   && pfn_t_to_pfn(end_pfn) == 
> PHYS_PFN(__pa(end_kaddr)))

Ugh, yes, device-mapper continues to be an awkward fit for dax (or
vice versa). We would either need a way to have a multi-level pfn to
pagemap lookup for composite devices, or a way to discern that even
though the pagemap is different that the result is still valid / not
an indication that we have leaked into an unassociated address range.
Perhaps a per-daxdev callback for ->dax_supported() so that
device-mapper internals can be used for this validation.

We need to get that fixed up, but I don't see it as a blocker /
pre-requisite for virtio-pmem.

Re: [Qemu-devel] [PATCH v8 3/6] libnvdimm: add dax_dev sync flag

2019-05-10 Thread Dan Williams

On Fri, May 10, 2019 at 5:45 PM Pankaj Gupta  wrote:
>
>
>
> > >
> > > This patch adds 'DAXDEV_SYNC' flag which is set
> > > for nd_region doing synchronous flush. This later
> > > is used to disable MAP_SYNC functionality for
> > > ext4 & xfs filesystem for devices don't support
> > > synchronous flush.
> > >
> > > Signed-off-by: Pankaj Gupta 
> > > ---
> > >  drivers/dax/bus.c|  2 +-
> > >  drivers/dax/super.c  | 13 -
> > >  drivers/md/dm.c  |  3 ++-
> > >  drivers/nvdimm/pmem.c|  5 -
> > >  drivers/nvdimm/region_devs.c |  7 +++
> > >  include/linux/dax.h  |  8 ++--
> > >  include/linux/libnvdimm.h|  1 +
> > >  7 files changed, 33 insertions(+), 6 deletions(-)
> > [..]
> > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > index 043f0761e4a0..ee007b75d9fd 100644
> > > --- a/drivers/md/dm.c
> > > +++ b/drivers/md/dm.c
> > > @@ -1969,7 +1969,8 @@ static struct mapped_device *alloc_dev(int minor)
> > > sprintf(md->disk->disk_name, "dm-%d", minor);
> > >
> > > if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
> > > -   dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops);
> > > +   dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops,
> > > +DAXDEV_F_SYNC);
> >
> > Apologies for not realizing this until now, but this is broken.
> > Imaging a device-mapper configuration composed of both 'async'
> > virtio-pmem and 'sync' pmem. The 'sync' flag needs to be unified
> > across all members. I would change this argument to '0' and then
> > arrange for it to be set at dm_table_supports_dax() time after
> > validating that all components support synchronous dax.
>
> o.k. Need to set 'DAXDEV_F_SYNC' flag after verifying all the target
> components support synchronous DAX.
>
> Just a question, If device mapper configuration have composed of both
> virtio-pmem or pmem devices, we want to configure device mapper for async 
> flush?

If it's composed of both then, yes, it needs to be async flush at the
device-mapper level. Otherwise MAP_SYNC will succeed and fail to
trigger fsync on the host file when necessary for the virtio-pmem
backed portion of the device-mapper device.

Re: [Qemu-devel] [PATCH v7 2/6] virtio-pmem: Add virtio pmem driver

2019-05-10 Thread Dan Williams

On Wed, May 8, 2019 at 4:19 AM Pankaj Gupta  wrote:
>
>
> Hi Dan,
>
> Thank you for the review. Please see my reply inline.
>
> >
> > Hi Pankaj,
> >
> > Some minor file placement comments below.
>
> Sure.
>
> >
> > On Thu, Apr 25, 2019 at 10:02 PM Pankaj Gupta  wrote:
> > >
> > > This patch adds virtio-pmem driver for KVM guest.
> > >
> > > Guest reads the persistent memory range information from
> > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > creates a nd_region object with the persistent memory
> > > range information so that existing 'nvdimm/pmem' driver
> > > can reserve this into system memory map. This way
> > > 'virtio-pmem' driver uses existing functionality of pmem
> > > driver to register persistent memory compatible for DAX
> > > capable filesystems.
> > >
> > > This also provides function to perform guest flush over
> > > VIRTIO from 'pmem' driver when userspace performs flush
> > > on DAX memory range.
> > >
> > > Signed-off-by: Pankaj Gupta 
> > > ---
> > >  drivers/nvdimm/virtio_pmem.c | 114 +
> > >  drivers/virtio/Kconfig   |  10 +++
> > >  drivers/virtio/Makefile  |   1 +
> > >  drivers/virtio/pmem.c| 118 +++
> > >  include/linux/virtio_pmem.h  |  60 
> > >  include/uapi/linux/virtio_ids.h  |   1 +
> > >  include/uapi/linux/virtio_pmem.h |  10 +++
> > >  7 files changed, 314 insertions(+)
> > >  create mode 100644 drivers/nvdimm/virtio_pmem.c
> > >  create mode 100644 drivers/virtio/pmem.c
> > >  create mode 100644 include/linux/virtio_pmem.h
> > >  create mode 100644 include/uapi/linux/virtio_pmem.h
> > >
> > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> > > new file mode 100644
> > > index ..66b582f751a3
> > > --- /dev/null
> > > +++ b/drivers/nvdimm/virtio_pmem.c
> > > @@ -0,0 +1,114 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * virtio_pmem.c: Virtio pmem Driver
> > > + *
> > > + * Discovers persistent memory range information
> > > + * from host and provides a virtio based flushing
> > > + * interface.
> > > + */
> > > +#include 
> > > +#include "nd.h"
> > > +
> > > + /* The interrupt handler */
> > > +void host_ack(struct virtqueue *vq)
> > > +{
> > > +   unsigned int len;
> > > +   unsigned long flags;
> > > +   struct virtio_pmem_request *req, *req_buf;
> > > +   struct virtio_pmem *vpmem = vq->vdev->priv;
> > > +
> > > +   spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > +   while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> > > +   req->done = true;
> > > +   wake_up(&req->host_acked);
> > > +
> > > +   if (!list_empty(&vpmem->req_list)) {
> > > +   req_buf = list_first_entry(&vpmem->req_list,
> > > +   struct virtio_pmem_request, list);
> > > +   list_del(&vpmem->req_list);
> > > +   req_buf->wq_buf_avail = true;
> > > +   wake_up(&req_buf->wq_buf);
> > > +   }
> > > +   }
> > > +   spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > +}
> > > +EXPORT_SYMBOL_GPL(host_ack);
> > > +
> > > + /* The request submission function */
> > > +int virtio_pmem_flush(struct nd_region *nd_region)
> > > +{
> > > +   int err;
> > > +   unsigned long flags;
> > > +   struct scatterlist *sgs[2], sg, ret;
> > > +   struct virtio_device *vdev = nd_region->provider_data;
> > > +   struct virtio_pmem *vpmem = vdev->priv;
> > > +   struct virtio_pmem_request *req;
> > > +
> > > +   might_sleep();
> > > +   req = kmalloc(sizeof(*req), GFP_KERNEL);
> > > +   if (!req)
> > > +   return -ENOMEM;
> > > +
> > > +   req->done = req->wq_buf_avail = false;
> > > +   strcpy(req->name, "FLUSH");
> > > +   init_waitqueue_head(&req->host_acked);
> > > +   init_waitqueue_head(&req->wq_buf);
> > > +   sg_init_one(&sg, req->name, strlen(req->name));
> > > +   sgs[0] = &sg;
> > > +   sg_init_one(&ret, &req->ret, sizeof(req->ret));
> > > +   sgs[1] = &ret;
> > > +
> > > +   spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > +   err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, 
> > > GFP_ATOMIC);
> > > +   if (err) {
> > > +   dev_err(&vdev->dev, "failed to send command to virtio pmem
> > > device\n");
> > > +
> > > +   list_add_tail(&vpmem->req_list, &req->list);
> > > +   spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > +
> > > +   /* When host has read buffer, this completes via host_ack
> > > */
> > > +   wait_event(req->wq_buf, req->wq_buf_avail);
> > > +   spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > +   }
> > > +   err = virtqueue_kick(vpmem->req_vq);
> > > +   spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > +
> > > +

Re: [Qemu-devel] [PATCH v8 0/6] virtio pmem driver

2019-05-10 Thread Dan Williams

On Fri, May 10, 2019 at 8:52 AM Pankaj Gupta  wrote:
>
>  Hi Michael & Dan,
>
>  Please review/ack the patch series from LIBNVDIMM & VIRTIO side.
>  We have ack on ext4, xfs patches(4, 5 & 6) patch 2. Still need
>  your ack on nvdimm patches(1 & 3) & virtio patch 2.

I was planning to merge these via the nvdimm tree, not ack them. Did
you have another maintainer lined up to take these patches?

Re: [Qemu-devel] [PATCH v8 3/6] libnvdimm: add dax_dev sync flag

2019-05-10 Thread Dan Williams

On Fri, May 10, 2019 at 8:53 AM Pankaj Gupta  wrote:
>
> This patch adds 'DAXDEV_SYNC' flag which is set
> for nd_region doing synchronous flush. This later
> is used to disable MAP_SYNC functionality for
> ext4 & xfs filesystem for devices don't support
> synchronous flush.
>
> Signed-off-by: Pankaj Gupta 
> ---
>  drivers/dax/bus.c|  2 +-
>  drivers/dax/super.c  | 13 -
>  drivers/md/dm.c  |  3 ++-
>  drivers/nvdimm/pmem.c|  5 -
>  drivers/nvdimm/region_devs.c |  7 +++
>  include/linux/dax.h  |  8 ++--
>  include/linux/libnvdimm.h|  1 +
>  7 files changed, 33 insertions(+), 6 deletions(-)
[..]
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 043f0761e4a0..ee007b75d9fd 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1969,7 +1969,8 @@ static struct mapped_device *alloc_dev(int minor)
> sprintf(md->disk->disk_name, "dm-%d", minor);
>
> if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
> -   dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops);
> +   dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops,
> +DAXDEV_F_SYNC);

Apologies for not realizing this until now, but this is broken.
Imaging a device-mapper configuration composed of both 'async'
virtio-pmem and 'sync' pmem. The 'sync' flag needs to be unified
across all members. I would change this argument to '0' and then
arrange for it to be set at dm_table_supports_dax() time after
validating that all components support synchronous dax.

Re: [Qemu-devel] [PATCH v7 6/6] xfs: disable map_sync for async flush

2019-05-07 Thread Dan Williams

On Thu, Apr 25, 2019 at 10:03 PM Pankaj Gupta  wrote:
>
> Dont support 'MAP_SYNC' with non-DAX files and DAX files
> with asynchronous dax_device. Virtio pmem provides
> asynchronous host page cache flush mechanism. We don't
> support 'MAP_SYNC' with virtio pmem and xfs.
>
> Signed-off-by: Pankaj Gupta 
> ---
>  fs/xfs/xfs_file.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)

Darrick, does this look ok to take through the nvdimm tree?

>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index a7ceae90110e..f17652cca5ff 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1203,11 +1203,14 @@ xfs_file_mmap(
> struct file *filp,
> struct vm_area_struct *vma)
>  {
> +   struct dax_device   *dax_dev;
> +
> +   dax_dev = xfs_find_daxdev_for_inode(file_inode(filp));
> /*
> -* We don't support synchronous mappings for non-DAX files. At least
> -* until someone comes with a sensible use case.
> +* We don't support synchronous mappings for non-DAX files and
> +* for DAX files if underneath dax_device is not synchronous.
>  */
> -   if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
> +   if (!daxdev_mapping_supported(vma, dax_dev))
> return -EOPNOTSUPP;
>
> file_accessed(filp);
> --
> 2.20.1
>

Re: [Qemu-devel] [PATCH v7 3/6] libnvdimm: add dax_dev sync flag

2019-05-07 Thread Dan Williams

On Thu, Apr 25, 2019 at 10:02 PM Pankaj Gupta  wrote:
>
> This patch adds 'DAXDEV_SYNC' flag which is set
> for nd_region doing synchronous flush. This later
> is used to disable MAP_SYNC functionality for
> ext4 & xfs filesystem for devices don't support
> synchronous flush.
>
> Signed-off-by: Pankaj Gupta 
[..]
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 0dd316a74a29..c97fc0cc7167 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -7,6 +7,9 @@
>  #include 
>  #include 
>
> +/* Flag for synchronous flush */
> +#define DAXDEV_F_SYNC true

I'd feel better, i.e. it reads more canonically, if this was defined
as (1UL << 0) and the argument to alloc_dax() was changed to 'unsigned
long flags' rather than a bool.

Re: [Qemu-devel] [PATCH v7 2/6] virtio-pmem: Add virtio pmem driver

2019-05-07 Thread Dan Williams

Hi Pankaj,

Some minor file placement comments below.

On Thu, Apr 25, 2019 at 10:02 PM Pankaj Gupta  wrote:
>
> This patch adds virtio-pmem driver for KVM guest.
>
> Guest reads the persistent memory range information from
> Qemu over VIRTIO and registers it on nvdimm_bus. It also
> creates a nd_region object with the persistent memory
> range information so that existing 'nvdimm/pmem' driver
> can reserve this into system memory map. This way
> 'virtio-pmem' driver uses existing functionality of pmem
> driver to register persistent memory compatible for DAX
> capable filesystems.
>
> This also provides function to perform guest flush over
> VIRTIO from 'pmem' driver when userspace performs flush
> on DAX memory range.
>
> Signed-off-by: Pankaj Gupta 
> ---
>  drivers/nvdimm/virtio_pmem.c | 114 +
>  drivers/virtio/Kconfig   |  10 +++
>  drivers/virtio/Makefile  |   1 +
>  drivers/virtio/pmem.c| 118 +++
>  include/linux/virtio_pmem.h  |  60 
>  include/uapi/linux/virtio_ids.h  |   1 +
>  include/uapi/linux/virtio_pmem.h |  10 +++
>  7 files changed, 314 insertions(+)
>  create mode 100644 drivers/nvdimm/virtio_pmem.c
>  create mode 100644 drivers/virtio/pmem.c
>  create mode 100644 include/linux/virtio_pmem.h
>  create mode 100644 include/uapi/linux/virtio_pmem.h
>
> diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> new file mode 100644
> index ..66b582f751a3
> --- /dev/null
> +++ b/drivers/nvdimm/virtio_pmem.c
> @@ -0,0 +1,114 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * virtio_pmem.c: Virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and provides a virtio based flushing
> + * interface.
> + */
> +#include 
> +#include "nd.h"
> +
> + /* The interrupt handler */
> +void host_ack(struct virtqueue *vq)
> +{
> +   unsigned int len;
> +   unsigned long flags;
> +   struct virtio_pmem_request *req, *req_buf;
> +   struct virtio_pmem *vpmem = vq->vdev->priv;
> +
> +   spin_lock_irqsave(&vpmem->pmem_lock, flags);
> +   while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> +   req->done = true;
> +   wake_up(&req->host_acked);
> +
> +   if (!list_empty(&vpmem->req_list)) {
> +   req_buf = list_first_entry(&vpmem->req_list,
> +   struct virtio_pmem_request, list);
> +   list_del(&vpmem->req_list);
> +   req_buf->wq_buf_avail = true;
> +   wake_up(&req_buf->wq_buf);
> +   }
> +   }
> +   spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(host_ack);
> +
> + /* The request submission function */
> +int virtio_pmem_flush(struct nd_region *nd_region)
> +{
> +   int err;
> +   unsigned long flags;
> +   struct scatterlist *sgs[2], sg, ret;
> +   struct virtio_device *vdev = nd_region->provider_data;
> +   struct virtio_pmem *vpmem = vdev->priv;
> +   struct virtio_pmem_request *req;
> +
> +   might_sleep();
> +   req = kmalloc(sizeof(*req), GFP_KERNEL);
> +   if (!req)
> +   return -ENOMEM;
> +
> +   req->done = req->wq_buf_avail = false;
> +   strcpy(req->name, "FLUSH");
> +   init_waitqueue_head(&req->host_acked);
> +   init_waitqueue_head(&req->wq_buf);
> +   sg_init_one(&sg, req->name, strlen(req->name));
> +   sgs[0] = &sg;
> +   sg_init_one(&ret, &req->ret, sizeof(req->ret));
> +   sgs[1] = &ret;
> +
> +   spin_lock_irqsave(&vpmem->pmem_lock, flags);
> +   err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> +   if (err) {
> +   dev_err(&vdev->dev, "failed to send command to virtio pmem 
> device\n");
> +
> +   list_add_tail(&vpmem->req_list, &req->list);
> +   spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +
> +   /* When host has read buffer, this completes via host_ack */
> +   wait_event(req->wq_buf, req->wq_buf_avail);
> +   spin_lock_irqsave(&vpmem->pmem_lock, flags);
> +   }
> +   err = virtqueue_kick(vpmem->req_vq);
> +   spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +
> +   if (!err) {
> +   err = -EIO;
> +   goto ret;
> +   }
> +   /* When host has read buffer, this completes via host_ack */
> +   wait_event(req->host_acked, req->done);
> +   err = req->ret;
> +ret:
> +   kfree(req);
> +   return err;
> +};
> +
> + /* The asynchronous flush callback function */
> +int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
> +{
> +   int rc = 0;
> +
> +   /* Create child bio for asynchronous flush and chain with
> +* parent bio. Otherwise directly call nd_region flush.
> +*/
> +   if (bio && bio->bi_iter.bi_

Re: [Qemu-devel] [PATCH v5 1/6] libnvdimm: nd_region flush callback support

2019-04-22 Thread Dan Williams

On Mon, Apr 22, 2019 at 8:59 AM Jeff Moyer  wrote:
>
> Dan Williams  writes:
>
> > On Thu, Apr 18, 2019 at 9:18 AM Christoph Hellwig  
> > wrote:
> >>
> >> On Thu, Apr 18, 2019 at 09:05:05AM -0700, Dan Williams wrote:
> >> > > > I'd either add a comment about avoiding retpoline overhead here or 
> >> > > > just
> >> > > > make ->flush == NULL mean generic_nvdimm_flush(). Just so that 
> >> > > > people don't
> >> > > > get confused by the code.
> >> > >
> >> > > Isn't this premature optimization?  I really don't like adding things
> >> > > like this without some numbers to show it's worth it.
> >> >
> >> > I don't think it's premature given this optimization technique is
> >> > already being deployed elsewhere, see:
> >> >
> >> > https://lwn.net/Articles/774347/
> >>
> >> For one this one was backed by numbers, and second after feedback
> >> from Linux we switched to the NULL pointer check instead.
> >
> > Ok I should have noticed the switch to NULL pointer check. However,
> > the question still stands do we want everyone to run numbers to
> > justify this optimization, or make it a new common kernel coding
> > practice to do:
> >
> > if (!object->op)
> > generic_op(object);
> > else
> > object->op(object);
> >
> > ...in hot paths?
>
> I don't think nvdimm_flush is a hot path.  Numbers of some
> representative workload would prove one of us right.

I'd rather say that the if "if (!op) do_generic()" pattern is more
readable in the general case, saves grepping for who set the op in the
common case. The fact that it has the potential to be faster is gravy
at that point.

Re: [Qemu-devel] [PATCH v5 1/6] libnvdimm: nd_region flush callback support

2019-04-18 Thread Dan Williams

On Thu, Apr 18, 2019 at 9:18 AM Christoph Hellwig  wrote:
>
> On Thu, Apr 18, 2019 at 09:05:05AM -0700, Dan Williams wrote:
> > > > I'd either add a comment about avoiding retpoline overhead here or just
> > > > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people 
> > > > don't
> > > > get confused by the code.
> > >
> > > Isn't this premature optimization?  I really don't like adding things
> > > like this without some numbers to show it's worth it.
> >
> > I don't think it's premature given this optimization technique is
> > already being deployed elsewhere, see:
> >
> > https://lwn.net/Articles/774347/
>
> For one this one was backed by numbers, and second after feedback
> from Linux we switched to the NULL pointer check instead.

Ok I should have noticed the switch to NULL pointer check. However,
the question still stands do we want everyone to run numbers to
justify this optimization, or make it a new common kernel coding
practice to do:

if (!object->op)
generic_op(object);
else
object->op(object);

...in hot paths? I agree with not doing premature optimization in
principle, but this hack is minimally intrusive from a readability
perspective similar to likely()/unlikely() usage which also don't come
with numbers on a per patch basis.

Re: [Qemu-devel] [PATCH v5 1/6] libnvdimm: nd_region flush callback support

2019-04-18 Thread Dan Williams

On Fri, Apr 12, 2019 at 6:12 AM Jeff Moyer  wrote:
>
> Jan Kara  writes:
>
> > On Thu 11-04-19 07:51:48, Dan Williams wrote:
> >> On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta  wrote:
> >> > +   } else {
> >> > +   if (nd_region->flush(nd_region))
> >> > +   rc = -EIO;
> >>
> >> Given the common case wants to be fast and synchronous I think we
> >> should try to avoid retpoline overhead by default. So something like
> >> this:
> >>
> >> if (nd_region->flush == generic_nvdimm_flush)
> >> rc = generic_nvdimm_flush(...);
> >
> > I'd either add a comment about avoiding retpoline overhead here or just
> > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
> > get confused by the code.
>
> Isn't this premature optimization?  I really don't like adding things
> like this without some numbers to show it's worth it.

I don't think it's premature given this optimization technique is
already being deployed elsewhere, see:

https://lwn.net/Articles/774347/

Re: [Qemu-devel] [PATCH v5 1/6] libnvdimm: nd_region flush callback support

2019-04-11 Thread Dan Williams

On Thu, Apr 11, 2019 at 9:02 AM Pankaj Gupta  wrote:
>
>
>
> > >
> > > This patch adds functionality to perform flush from guest
> > > to host over VIRTIO. We are registering a callback based
> > > on 'nd_region' type. virtio_pmem driver requires this special
> > > flush function. For rest of the region types we are registering
> > > existing flush function. Report error returned by host fsync
> > > failure to userspace.
> > >
> > > This also handles asynchronous flush requests from the block layer
> > > by creating a child bio and chaining it with parent bio.
> > >
> > > Signed-off-by: Pankaj Gupta 
> > > ---bio_chain Dan williams
> > [..]
> > > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> > > index b4ef7d9ff22e..fb1041ab32a6 100644
> > > --- a/drivers/nvdimm/region_devs.c
> > > +++ b/drivers/nvdimm/region_devs.c
> > > @@ -295,7 +295,9 @@ static ssize_t deep_flush_store(struct device *dev,
> > > struct device_attribute *att
> > > return rc;
> > > if (!flush)
> > > return -EINVAL;
> > > -   nvdimm_flush(nd_region);
> > > +   rc = nvdimm_flush(nd_region, NULL, false);
> > > +   if (rc)
> > > +   return rc;
> > >
> > > return len;
> > >  }
> > > @@ -1085,6 +1087,11 @@ static struct nd_region *nd_region_create(struct
> > > nvdimm_bus *nvdimm_bus,
> > > dev->of_node = ndr_desc->of_node;
> > > nd_region->ndr_size = resource_size(ndr_desc->res);
> > > nd_region->ndr_start = ndr_desc->res->start;
> > > +   if (ndr_desc->flush)
> > > +   nd_region->flush = ndr_desc->flush;
> > > +   else
> > > +   nd_region->flush = generic_nvdimm_flush;
> > > +
> > > nd_device_register(dev);
> > >
> > > return nd_region;
> > > @@ -1125,11 +1132,36 @@ struct nd_region
> > > *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
> > >  }
> > >  EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);
> > >
> > > +int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool 
> > > async)
> > > +{
> >
> > I don't quite see the point of the 'async' argument. All the usages of
> > this routine are either
> >
> > nvdimm_flush(nd_region, bio, true)
> > ...or:
> > nvdimm_flush(nd_region, NULL, false)
>
> Agree.
>
> >
> > ...so why not gate async behavior on the presence of the 'bio' argument?
>
> Sure.
>
> >
> >
> > > +   int rc = 0;
> > > +
> > > +   /* Create child bio for asynchronous flush and chain with
> > > +* parent bio. Otherwise directly call nd_region flush.
> > > +*/
> > > +   if (async && bio->bi_iter.bi_sector != -1) {
> > > +
> > > +   struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> > > +
> > > +   if (!child)
> > > +   return -ENOMEM;
> > > +   bio_copy_dev(child, bio);
> > > +   child->bi_opf = REQ_PREFLUSH;
> > > +   child->bi_iter.bi_sector = -1;
> > > +   bio_chain(child, bio);
> > > +   submit_bio(child);
> >
> > I understand how this works, but it's a bit too "magical" for my
> > taste. I would prefer that all flush implementations take an optional
> > 'bio' argument rather than rely on the make_request implementation to
> > stash the bio away on a driver specific list.
>
> I did this to make use of "bio_chain" for chaining child bio for async flush
> suggested [1]. Are you saying to remove this and just call "flush" based on
> bio argument? Or I implemented the 'bio_chain' request entirely wrong?

No, I think you implemented it correctly. I'm just asking for the
chaining to be performed internal to the ->flush() callback rather
than in the common nvdimm_flush() front-end.

Re: [Qemu-devel] [PATCH v5 3/6] libnvdimm: add dax_dev sync flag

2019-04-11 Thread Dan Williams

On Tue, Apr 9, 2019 at 9:10 PM Pankaj Gupta  wrote:
>
> This patch adds 'DAXDEV_SYNC' flag which is set
> for nd_region doing synchronous flush. This later
> is used to disable MAP_SYNC functionality for
> ext4 & xfs filesystem for devices don't support
> synchronous flush.
>
> Signed-off-by: Pankaj Gupta 
> ---
>  drivers/dax/bus.c|  2 +-
>  drivers/dax/super.c  | 13 -
>  drivers/md/dm.c  |  2 +-
>  drivers/nvdimm/pmem.c|  3 ++-
>  drivers/nvdimm/region_devs.c |  7 +++
>  include/linux/dax.h  |  9 +++--
>  include/linux/libnvdimm.h|  1 +
>  7 files changed, 31 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 2109cfe80219..431bf7d2a7f9 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -388,7 +388,7 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region 
> *dax_region, int id,
>  * No 'host' or dax_operations since there is no access to this
>  * device outside of mmap of the resulting character device.
>  */
> -   dax_dev = alloc_dax(dev_dax, NULL, NULL);
> +   dax_dev = alloc_dax(dev_dax, NULL, NULL, true);

I find apis that take a boolean as unreadable. What does 'true' mean?
It wastes time to go look at the function definition vs something
like:

alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC);

Re: [Qemu-devel] [PATCH v5 1/6] libnvdimm: nd_region flush callback support

2019-04-11 Thread Dan Williams

On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta  wrote:
>
> This patch adds functionality to perform flush from guest
> to host over VIRTIO. We are registering a callback based
> on 'nd_region' type. virtio_pmem driver requires this special
> flush function. For rest of the region types we are registering
> existing flush function. Report error returned by host fsync
> failure to userspace.
>
> This also handles asynchronous flush requests from the block layer
> by creating a child bio and chaining it with parent bio.
>
> Signed-off-by: Pankaj Gupta 
> ---
[..]
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index b4ef7d9ff22e..fb1041ab32a6 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -295,7 +295,9 @@ static ssize_t deep_flush_store(struct device *dev, 
> struct device_attribute *att
> return rc;
> if (!flush)
> return -EINVAL;
> -   nvdimm_flush(nd_region);
> +   rc = nvdimm_flush(nd_region, NULL, false);
> +   if (rc)
> +   return rc;
>
> return len;
>  }
> @@ -1085,6 +1087,11 @@ static struct nd_region *nd_region_create(struct 
> nvdimm_bus *nvdimm_bus,
> dev->of_node = ndr_desc->of_node;
> nd_region->ndr_size = resource_size(ndr_desc->res);
> nd_region->ndr_start = ndr_desc->res->start;
> +   if (ndr_desc->flush)
> +   nd_region->flush = ndr_desc->flush;
> +   else
> +   nd_region->flush = generic_nvdimm_flush;
> +
> nd_device_register(dev);
>
> return nd_region;
> @@ -1125,11 +1132,36 @@ struct nd_region 
> *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
>  }
>  EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);
>
> +int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async)
> +{

I don't quite see the point of the 'async' argument. All the usages of
this routine are either

nvdimm_flush(nd_region, bio, true)
...or:
nvdimm_flush(nd_region, NULL, false)

...so why not gate async behavior on the presence of the 'bio' argument?


> +   int rc = 0;
> +
> +   /* Create child bio for asynchronous flush and chain with
> +* parent bio. Otherwise directly call nd_region flush.
> +*/
> +   if (async && bio->bi_iter.bi_sector != -1) {
> +
> +   struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> +
> +   if (!child)
> +   return -ENOMEM;
> +   bio_copy_dev(child, bio);
> +   child->bi_opf = REQ_PREFLUSH;
> +   child->bi_iter.bi_sector = -1;
> +   bio_chain(child, bio);
> +   submit_bio(child);

I understand how this works, but it's a bit too "magical" for my
taste. I would prefer that all flush implementations take an optional
'bio' argument rather than rely on the make_request implementation to
stash the bio away on a driver specific list.

> +   } else {
> +   if (nd_region->flush(nd_region))
> +   rc = -EIO;

Given the common case wants to be fast and synchronous I think we
should try to avoid retpoline overhead by default. So something like
this:

if (nd_region->flush == generic_nvdimm_flush)
rc = generic_nvdimm_flush(...);

Re: [Qemu-devel] Why only devdax guarantees guest data persistence ?

2019-02-20 Thread Dan Williams

On Wed, Feb 20, 2019 at 7:37 AM Stefan Hajnoczi  wrote:
>
> On Fri, Feb 15, 2019 at 05:09:31PM +, bipin.tomar--- via Qemu-devel wrote:
> > Text from "docs/nvdimm.txt" says:
> > Guest Data Persistence
> > --
> >
> > Though QEMU supports multiple types of vNVDIMM backends on Linux,
> > currently the only one that can guarantee the guest write persistence
> > is the device DAX on the real NVDIMM device (e.g., /dev/dax0.0), to
> > which all guest access do not involve any host-side kernel cache.
> >
> > I think here "host-side kernel cache" imply "page cache". Why does fsdax 
> > NOT have the same persistence guarantees as devdax for vNVDIMM?
> > Both the modes avoid using page cache then why is devdax explicitly called 
> > out?
>
> File systems may require msync(2)/fsync(2) to guarantee persistence even
> with DAX (just a cache flush instruction may not be enough!).  Emulated
> NVDIMM devices lack an fsync interface so guests are unable to fsync the
> host file system.
>
> This is not an issue with devdax since there is no host file system.
>
> virtio-pmem is an effort to add a paravirtualized fsync-style interface
> and should solve this problem in the future:
> https://lkml.org/lkml/2019/1/9/541

A para-virtualized flush is only necessary if the host implements
caches and dirty-metadata that cannot be coordinated / flushed by
MAP_SYNC. virtio-pmem is not a solution to a problem, virtio-pmem is a
non-DAX alternative implementation of persistent memory that requires
the guest to coordinate metadata and host page cache flushing.

Re: [Qemu-devel] Querying the size of devdax devices from userspace

2019-02-04 Thread Dan Williams

On Sun, Feb 3, 2019 at 11:56 PM Stefan Hajnoczi  wrote:
>
> How can userspace applications query the size of devdax character
> devices?
>
> stat(1) doesn't know how large the device is:
>
>   # stat /dev/dax0.0
> File: /dev/dax0.0
> Size: 0 Blocks: 0  IO Block: 4096   character special 
> file
>   Device: 6h/6d Inode: 56764   Links: 1 Device type: fa,d
>
> ndctl(1) finds out by digging in /sys:
>
>   # ndctl list
>   [
> {
>   "dev":"namespace0.0",
>   "mode":"devdax",
>   "map":"dev",
>   "size":2111832064,
>   "uuid":"235acf4d-503f-46be-bf64-f26faf9777ef",
>   "chardev":"dax0.0"
> }
>   ]
>
> I'm not sure how to do it without enumerating all nvdimms in /sys.  Is
> there a mapping from devdax major/minor number to a /sys path?
>
> The use case I have in mind is that QEMU currently takes the size as a
> command-line parameter.  The failure mode is ugly when users get this
> value wrong: the guest gets a softlockup and there is no error message
> on the host side.
>
> I'd like QEMU to detect the size or at least reject size values that are
> too large.  In order to do that userspace needs a convenient way of
> querying the size.  Any ideas?

You're looking for /sys/dev/char. That will wake you from major:minor
to sysfs. Perhaps borrow fio's implementation for the same:

   http://git.kernel.dk/cgit/fio/tree/engines/dev-dax.c#n258

Re: [Qemu-devel] [PATCH V9 4/6] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap()

2019-01-22 Thread Dan Williams

On Mon, Jan 21, 2019 at 7:27 PM Michael S. Tsirkin  wrote:
[..]
> > 2. The reset of cases:
> >   - we will never pass the MAP_SYNC to mmap2
>
> I don't see code probing for MAP_SYNC support. Did I miss it?
> But if all you want is to have old linux ignore MAP_SYNC,
> I think you got your wish automatically - just do not set
> MAP_SHARED_VALIDATE.

That will also cause new Linux to ignore MAP_SYNC.

Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device

2019-01-14 Thread Dan Williams

On Mon, Jan 14, 2019 at 1:25 PM Dave Chinner  wrote:
>
> On Mon, Jan 14, 2019 at 02:15:40AM -0500, Pankaj Gupta wrote:
> >
> > > > Until you have images (and hence host page cache) shared between
> > > > multiple guests. People will want to do this, because it means they
> > > > only need a single set of pages in host memory for executable
> > > > binaries rather than a set of pages per guest. Then you have
> > > > multiple guests being able to detect residency of the same set of
> > > > pages. If the guests can then, in any way, control eviction of the
> > > > pages from the host cache, then we have a guest-to-guest information
> > > > leak channel.
> > >
> > > I don't think we should ever be considering something that would allow a
> > > guest to evict page's from the host's pagecache [1].  The guest should
> > > be able to kick its own references to the host's pagecache out of its
> > > own pagecache, but not be able to influence whether the host or another
> > > guest has a read-only mapping cached.
> > >
> > > [1] Unless the guest is allowed to modify the host's file; obviously
> > > truncation, holepunching, etc are going to evict pages from the host's
> > > page cache.
> >
> > This is so correct. Guest does not not evict host page cache pages directly.
>
> They don't right now.
>
> But someone is going to end up asking for discard to work so that
> the guest can free unused space in the underlying spares image (i.e.
> make use of fstrim or mount -o discard) because they have workloads
> that have bursts of space usage and they need to trim the image
> files afterwards to keep their overall space usage under control.
>
> And then

...we reject / push back on that patch citing the above concern.

> > In case of virtio-pmem & DAX, guest clears guest page cache exceptional 
> > entries.
> > Its solely decision of host to take action on the host page cache pages.
> >
> > In case of virtio-pmem, guest does not modify host file directly i.e don't
> > perform hole punch & truncation operation directly on host file.
>
> ... this will no longer be true, and the nuclear landmine in this
> driver interface will have been armed

I agree with the need to be careful when / if explicit cache control
is added, but that's not the case today.

Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device

2019-01-12 Thread Dan Williams

On Sat, Jan 12, 2019 at 5:38 PM Pankaj Gupta  wrote:
>
>
>
> >
> > On Thu 10-01-19 12:26:17, Dave Chinner wrote:
> > > On Wed, Jan 09, 2019 at 08:17:31PM +0530, Pankaj Gupta wrote:
> > > >  This patch series has implementation for "virtio pmem".
> > > >  "virtio pmem" is fake persistent memory(nvdimm) in guest
> > > >  which allows to bypass the guest page cache. This also
> > > >  implements a VIRTIO based asynchronous flush mechanism.
> > >
> > > H. Sharing the host page cache direct into the guest VM. Sounds
> > > like a good idea, but.
> > >
> > > This means the guest VM can now run timing attacks to observe host
> > > side page cache residency, and depending on the implementation I'm
> > > guessing that the guest will be able to control host side page
> > > cache eviction, too (e.g. via discard or hole punch operations).
> > >
> > > Which means this functionality looks to me like a new vector for
> > > information leakage into and out of the guest VM via guest
> > > controlled host page cache manipulation.
> > >
> > > https://arxiv.org/pdf/1901.01161
> > >
> > > I might be wrong, but if I'm not we're going to have to be very
> > > careful about how guest VMs can access and manipulate host side
> > > resources like the page cache.
> >
> > Right. Thinking about this I would be more concerned about the fact that
> > guest can effectively pin amount of host's page cache upto size of the
> > device/file passed to guest as PMEM, can't it Pankaj? Or is there some QEMU
> > magic that avoids this?
>
> Yes, guest will pin these host page cache pages using 'get_user_pages' by
> elevating the page reference count. But these pages can be reclaimed by host
> at any time when there is memory pressure.

Wait, how can the guest pin the host pages? I would expect this to
happen only when using vfio and device assignment. Otherwise, no the
host can't reclaim a pinned page, that's the whole point of a pin to
prevent the mm from reclaiming ownership.

> KVM does not permanently pin pages. vfio does that but we are not using
> it here.

Right, so I'm confused by your pin assertion above.

Re: [Qemu-devel] [PATCH v3 3/5] libnvdimm: add nd_region buffered dax_dev flag

2019-01-09 Thread Dan Williams

On Wed, Jan 9, 2019 at 5:53 AM Pankaj Gupta  wrote:
>
> This patch adds 'DAXDEV_BUFFERED' flag which is set
> for virtio pmem corresponding nd_region. This later
> is used to disable MAP_SYNC functionality for ext4
> & xfs filesystem.
>
> Signed-off-by: Pankaj Gupta 
> ---
>  drivers/dax/super.c  | 17 +
>  drivers/nvdimm/pmem.c|  3 +++
>  drivers/nvdimm/region_devs.c |  7 +++
>  drivers/virtio/pmem.c|  1 +
>  include/linux/dax.h  |  9 +
>  include/linux/libnvdimm.h|  6 ++
>  6 files changed, 43 insertions(+)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 6e928f3..9128740 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -167,6 +167,8 @@ enum dax_device_flags {
> DAXDEV_ALIVE,
> /* gate whether dax_flush() calls the low level flush routine */
> DAXDEV_WRITE_CACHE,
> +   /* flag to disable MAP_SYNC for virtio based host page cache flush */
> +   DAXDEV_BUFFERED,
>  };
>
>  /**
> @@ -335,6 +337,21 @@ bool dax_write_cache_enabled(struct dax_device *dax_dev)
>  }
>  EXPORT_SYMBOL_GPL(dax_write_cache_enabled);
>
> +void virtio_pmem_host_cache(struct dax_device *dax_dev, bool wc)
> +{
> +   if (wc)
> +   set_bit(DAXDEV_BUFFERED, &dax_dev->flags);
> +   else
> +   clear_bit(DAXDEV_BUFFERED, &dax_dev->flags);
> +}
> +EXPORT_SYMBOL_GPL(virtio_pmem_host_cache);

The "write_cache" property was structured this way because it can
conceivably change at runtime. The MAP_SYNC capability should be
static and never changed after init.

> +bool virtio_pmem_host_cache_enabled(struct dax_device *dax_dev)
> +{
> +   return test_bit(DAXDEV_BUFFERED, &dax_dev->flags);
> +}
> +EXPORT_SYMBOL_GPL(virtio_pmem_host_cache_enabled);

Echoing Darrick and Jan this is should be a generic property of a
dax_device and not specific to virtio. I don't like the "buffered"
designation as that's not accurate. There may be hardware reasons why
a dax_device is not synchronous, like a requirement to flush a
write-pending queue or otherwise notify the device of new writes.

I would just have a dax_synchronous() helper and a DAXDEV_SYNC flag. I
would also modify alloc_dax() to take a flags argument so that the
capability can be instantiated when the dax_device is allocated.

Re: [Qemu-devel] [PATCH v2 2/2] virtio-pmem: Add virtio pmem driver

2018-10-17 Thread Dan Williams

On Wed, Oct 17, 2018 at 12:11 PM Pankaj Gupta  wrote:
>
>
>
> > On Fri, Oct 12, 2018 at 10:01 PM Pankaj Gupta  wrote:
> > >
> > > This patch adds virtio-pmem driver for KVM guest.
> > >
> > > Guest reads the persistent memory range information from
> > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > creates a nd_region object with the persistent memory
> > > range information so that existing 'nvdimm/pmem' driver
> > > can reserve this into system memory map. This way
> > > 'virtio-pmem' driver uses existing functionality of pmem
> > > driver to register persistent memory compatible for DAX
> > > capable filesystems.
> > >
> > > This also provides function to perform guest flush over
> > > VIRTIO from 'pmem' driver when userspace performs flush
> > > on DAX memory range.
> >
> > Before we can move forward with this driver we need additional
> > filesystem enabling to detect when the backing device is fronting DAX
> > pmem or a paravirtualized page cache through virtio-pmem. Any
> > interface that requires fsync() and a round trip to the hypervisor to
> > flush host page cache is not DAX.
>
> I saw your proposal[1] for new mmap flag MAP_DIRECT. IIUIC mapping should 
> fail for
> MAP_DIRECT if it requires explicit flush or buffer indirection. So, if we 
> disable
> MAP_SYNC flag for virtio-pmem this should fail MAP_DIRECT as well? Otherwise
> without MAP_DIRECT, virtio-pmem should be defaulted to VIRTIO flush mechanism.

Right, although I wouldn't worry about MAP_DIRECT in the short term
since we're still discussing what the upstream interface. Regardless
of whether MAP_DIRECT is specified or not the virtio-flush mechanism
would always be used for virtio-pmem. I.e. there is no possibility to
get full DAX operation with virtio-pmem, only the page-cache bypass
sub-set.

Taking a look at where we could inject this check for filesystems it's
a bit awkward to do it in xfs_file_mmap() for example because we do
not have the backing device for the extents of the inode. So at a
minimum you would need to investigate calling xfs_inode_supports_dax()
from that path and teaching it about a new dax_device flag. I'm
thinking the dax_device flag should be called DAXDEV_BUFFERED to
indicate the presence of software buffering on a device that otherwise
supports bypassing the local page cache.

Re: [Qemu-devel] [PATCH v2 2/2] virtio-pmem: Add virtio pmem driver

2018-10-13 Thread Dan Williams

On Fri, Oct 12, 2018 at 10:01 PM Pankaj Gupta  wrote:
>
> This patch adds virtio-pmem driver for KVM guest.
>
> Guest reads the persistent memory range information from
> Qemu over VIRTIO and registers it on nvdimm_bus. It also
> creates a nd_region object with the persistent memory
> range information so that existing 'nvdimm/pmem' driver
> can reserve this into system memory map. This way
> 'virtio-pmem' driver uses existing functionality of pmem
> driver to register persistent memory compatible for DAX
> capable filesystems.
>
> This also provides function to perform guest flush over
> VIRTIO from 'pmem' driver when userspace performs flush
> on DAX memory range.

Before we can move forward with this driver we need additional
filesystem enabling to detect when the backing device is fronting DAX
pmem or a paravirtualized page cache through virtio-pmem. Any
interface that requires fsync() and a round trip to the hypervisor to
flush host page cache is not DAX.

Re: [Qemu-devel] [PATCH v2 0/2] kvm "fake DAX" device

2018-10-13 Thread Dan Williams

On Fri, Oct 12, 2018 at 10:00 PM Pankaj Gupta  wrote:
>
>  This patch series has implementation for "fake DAX".
>  "fake DAX" is fake persistent memory(nvdimm) in guest
>  which allows to bypass the guest page cache. This also
>  implements a VIRTIO based asynchronous flush mechanism.

Can we stop calling this 'fake DAX', because it isn't 'DAX' and it's
starting to confuse people. This enabling is effectively a
host-page-cache-passthrough mechanism not DAX. Let's call the whole
approach virtio-pmem, and leave DAX out of the name to hopefully
prevent people from wondering why some DAX features are disabled with
this driver. For example MAP_SYNC is not compatible with this
approach.

Additional enabling is need to disable MAP_SYNC in the presence of a
virtio-pmem device. See the rough proposal here:

https://lkml.org/lkml/2018/4/25/756

Re: [Qemu-devel] [PATCH 3/3] virtio-pmem: Add virtio pmem driver

2018-09-27 Thread Dan Williams

On Thu, Sep 27, 2018 at 6:07 AM Pankaj Gupta  wrote:
[..]
> > We are plugging VIRTIO based flush callback for virtio_pmem driver. If pmem
> > driver (pmem_make_request) has to queue request we have to plug "blk_mq_ops"
> > callbacks for corresponding  VIRTIO vqs. AFAICU there is no existing
> > multiqueue
> > code merged for pmem driver yet, though i could see patches by Dave 
> > upstream.
> >
>
> I thought about this and with current infrastructure "make_request" releases 
> spinlock
> and makes current thread/task. All Other threads are free to call 
> 'make_request'/flush
> and similarly wait by releasing the lock.

Which lock are you referring?

> This actually works like a queue of threads
> waiting for notifications from host.
>
> Current pmem code do not have multiqueue support and I am not sure if core 
> pmem code
> needs it. Adding multiqueue support just for virtio-pmem and not for pmem in 
> same driver
> will be confusing or require alot of tweaking.

Why does the pmem driver need to be converted to multiqueue support?

> Could you please give your suggestions on this.

I was expecting that flush requests that cannot be completed
synchronously be placed on a queue and have bio_endio() called at a
future time. I.e. use bio_chain() to manage the async portion of the
flush request. This causes the guest block layer to just assume the
bio was queued and will be completed at some point in the future.

Re: [Qemu-devel] [PATCH 3/3] virtio-pmem: Add virtio pmem driver

2018-09-21 Thread Dan Williams

On Fri, Aug 31, 2018 at 6:32 AM Pankaj Gupta  wrote:
>
> This patch adds virtio-pmem driver for KVM guest.
>
> Guest reads the persistent memory range information from
> Qemu over VIRTIO and registers it on nvdimm_bus. It also
> creates a nd_region object with the persistent memory
> range information so that existing 'nvdimm/pmem' driver
> can reserve this into system memory map. This way
> 'virtio-pmem' driver uses existing functionality of pmem
> driver to register persistent memory compatible for DAX
> capable filesystems.
>
> This also provides function to perform guest flush over
> VIRTIO from 'pmem' driver when userspace performs flush
> on DAX memory range.
>
> Signed-off-by: Pankaj Gupta 
> ---
>  drivers/virtio/Kconfig   |   9 ++
>  drivers/virtio/Makefile  |   1 +
>  drivers/virtio/virtio_pmem.c | 255 
> +++
>  include/uapi/linux/virtio_ids.h  |   1 +
>  include/uapi/linux/virtio_pmem.h |  40 ++
>  5 files changed, 306 insertions(+)
>  create mode 100644 drivers/virtio/virtio_pmem.c
>  create mode 100644 include/uapi/linux/virtio_pmem.h
>
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 3589764..a331e23 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -42,6 +42,15 @@ config VIRTIO_PCI_LEGACY
>
>   If unsure, say Y.
>
> +config VIRTIO_PMEM
> +   tristate "Support for virtio pmem driver"
> +   depends on VIRTIO
> +   help
> +   This driver provides support for virtio based flushing interface
> +   for persistent memory range.
> +
> +   If unsure, say M.
> +
>  config VIRTIO_BALLOON
> tristate "Virtio balloon driver"
> depends on VIRTIO
> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> index 3a2b5c5..cbe91c6 100644
> --- a/drivers/virtio/Makefile
> +++ b/drivers/virtio/Makefile
> @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
>  virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
>  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
>  obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
> diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
> new file mode 100644
> index 000..c22cc87
> --- /dev/null
> +++ b/drivers/virtio/virtio_pmem.c
> @@ -0,0 +1,255 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * virtio_pmem.c: Virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and provides a virtio based flushing
> + * interface.
> + */
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 

I think we need to split this driver into 2 files,
drivers/virtio/pmem.c would discover and register the virtual pmem
device with the libnvdimm core, and drivers/nvdimm/virtio.c would
house virtio_pmem_flush().

> +
> +struct virtio_pmem_request {
> +   /* Host return status corresponding to flush request */
> +   int ret;
> +
> +   /* command name*/
> +   char name[16];
> +
> +   /* Wait queue to process deferred work after ack from host */
> +   wait_queue_head_t host_acked;
> +   bool done;
> +
> +   /* Wait queue to process deferred work after virt queue buffer avail 
> */
> +   wait_queue_head_t wq_buf;

Why does this need wait_queue's per request? shouldn't this be per-device?

> +   bool wq_buf_avail;
> +   struct list_head list;
> +};
> +
> +struct virtio_pmem {
> +   struct virtio_device *vdev;
> +
> +   /* Virtio pmem request queue */
> +   struct virtqueue *req_vq;
> +
> +   /* nvdimm bus registers virtio pmem device */
> +   struct nvdimm_bus *nvdimm_bus;
> +   struct nvdimm_bus_descriptor nd_desc;
> +
> +   /* List to store deferred work if virtqueue is full */
> +   struct list_head req_list;
> +
> +   /* Synchronize virtqueue data */
> +   spinlock_t pmem_lock;
> +
> +   /* Memory region information */
> +   uint64_t start;
> +   uint64_t size;
> +};
> +
> +static struct virtio_device_id id_table[] = {
> +   { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> +   { 0 },
> +};
> +
> + /* The interrupt handler */
> +static void host_ack(struct virtqueue *vq)
> +{
> +   unsigned int len;
> +   unsigned long flags;
> +   struct virtio_pmem_request *req, *req_buf;
> +   struct virtio_pmem *vpmem = vq->vdev->priv;
> +
> +   spin_lock_irqsave(&vpmem->pmem_lock, flags);
> +   while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> +   req->done = true;
> +   wake_up(&req->host_acked);
> +
> +   if (!list_empty(&vpmem->req_list)) {
> +   req_buf = list_first_entry(&vpmem->req_list,
> +   struct virtio_pmem_request, list);
> +   list_del(&vpmem->req_list);
> +   req_buf->wq_buf_avail = true;
> +

Re: [Qemu-devel] [PATCH 1/3] nd: move nd_region to common header

2018-09-21 Thread Dan Williams

On Fri, Aug 31, 2018 at 6:31 AM Pankaj Gupta  wrote:
>
> This patch moves nd_region definition to common header
> include/linux/nd.h file. This is required for flush callback
> support for both virtio-pmem & pmem driver.
>
> Signed-off-by: Pankaj Gupta 
> ---
>  drivers/nvdimm/nd.h | 39 ---
>  include/linux/nd.h  | 40 
>  2 files changed, 40 insertions(+), 39 deletions(-)

No, we need to find a way to do this without dumping all of these
internal details to a public / global header.

Re: [Qemu-devel] [PATCH 2/3] libnvdimm: nd_region flush callback support

2018-09-21 Thread Dan Williams

On Fri, Aug 31, 2018 at 6:32 AM Pankaj Gupta  wrote:
>
> This patch adds functionality to perform flush from guest
> to host over VIRTIO. We are registering a callback based
> on 'nd_region' type. virtio_pmem driver requires this special
> flush function. For rest of the region types we are registering
> existing flush function. Report error returned by host fsync
> failure to userspace.
>
> Signed-off-by: Pankaj Gupta 

This looks ok to me, just some nits below.

> ---
>  drivers/acpi/nfit/core.c |  7 +--
>  drivers/nvdimm/claim.c   |  3 ++-
>  drivers/nvdimm/pmem.c| 12 
>  drivers/nvdimm/region_devs.c | 12 ++--
>  include/linux/libnvdimm.h|  4 +++-
>  5 files changed, 28 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index b072cfc..cd63b69 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -2216,6 +2216,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, 
> unsigned int bw,
>  {
> u64 cmd, offset;
> struct nfit_blk_mmio *mmio = &nfit_blk->mmio[DCR];
> +   struct nd_region *nd_region = nfit_blk->nd_region;
>
> enum {
> BCW_OFFSET_MASK = (1ULL << 48)-1,
> @@ -2234,7 +2235,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, 
> unsigned int bw,
> offset = to_interleave_offset(offset, mmio);
>
> writeq(cmd, mmio->addr.base + offset);
> -   nvdimm_flush(nfit_blk->nd_region);
> +   nd_region->flush(nd_region);

I would keep the indirect function call override inside of
nvdimm_flush. Then this hunk can go away...

>
> if (nfit_blk->dimm_flags & NFIT_BLK_DCR_LATCH)
> readq(mmio->addr.base + offset);
> @@ -2245,6 +2246,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk 
> *nfit_blk,
> unsigned int lane)
>  {
> struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW];
> +   struct nd_region *nd_region = nfit_blk->nd_region;
> unsigned int copied = 0;
> u64 base_offset;
> int rc;
> @@ -2283,7 +2285,8 @@ static int acpi_nfit_blk_single_io(struct nfit_blk 
> *nfit_blk,
> }
>
> if (rw)
> -   nvdimm_flush(nfit_blk->nd_region);
> +   nd_region->flush(nd_region);
> +
>

...ditto, no need to touch this code.

> rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0;
> return rc;
> diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
> index fb667bf..49dce9c 100644
> --- a/drivers/nvdimm/claim.c
> +++ b/drivers/nvdimm/claim.c
> @@ -262,6 +262,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
>  {
> struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
> unsigned int sz_align = ALIGN(size + (offset & (512 - 1)), 512);
> +   struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
> sector_t sector = offset >> 9;
> int rc = 0;
>
> @@ -301,7 +302,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
> }
>
> memcpy_flushcache(nsio->addr + offset, buf, size);
> -   nvdimm_flush(to_nd_region(ndns->dev.parent));
> +   nd_region->flush(nd_region);

For this you would need to teach nsio_rw_bytes() that the flush can fail.

>
> return rc;
>  }
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 6071e29..ba57cfa 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -201,7 +201,8 @@ static blk_qc_t pmem_make_request(struct request_queue 
> *q, struct bio *bio)
> struct nd_region *nd_region = to_region(pmem);
>
> if (bio->bi_opf & REQ_PREFLUSH)
> -   nvdimm_flush(nd_region);
> +   bio->bi_status = nd_region->flush(nd_region);
> +

Let's have nvdimm_flush() return 0 or -EIO if it fails since thats
what nsio_rw_bytes() expects, and you'll need to translate that to:
BLK_STS_IOERR

>
> do_acct = nd_iostat_start(bio, &start);
> bio_for_each_segment(bvec, bio, iter) {
> @@ -216,7 +217,7 @@ static blk_qc_t pmem_make_request(struct request_queue 
> *q, struct bio *bio)
> nd_iostat_end(bio, start);
>
> if (bio->bi_opf & REQ_FUA)
> -   nvdimm_flush(nd_region);
> +   bio->bi_status = nd_region->flush(nd_region);

Same comment.

>
> bio_endio(bio);
> return BLK_QC_T_NONE;
> @@ -517,6 +518,7 @@ static int nd_pmem_probe(struct device *dev)
>  static int nd_pmem_remove(struct device *dev)
>  {
> struct pmem_device *pmem = dev_get_drvdata(dev);
> +   struct nd_region *nd_region = to_region(pmem);
>
> if (is_nd_btt(dev))
> nvdimm_namespace_detach_btt(to_nd_btt(dev));
> @@ -528,14 +530,16 @@ static int nd_pmem_remove(struct device *dev)
> sysfs_put(pmem->bb_state);
> pmem->bb_state = NULL;
> }
> -   nvdimm_flush(to_nd_region(dev->parent));
> +   nd_region->flush(nd_region);

Not

Re: [Qemu-devel] [RFC PATCH 1/1] nvdimm: let qemu requiring section alignment of pmem resource.

2018-06-11 Thread Dan Williams

On Mon, Jun 11, 2018 at 9:26 AM, Stefan Hajnoczi  wrote:
> On Mon, Jun 11, 2018 at 06:54:25PM +0800, Zhang Yi wrote:
>> Nvdimm driver use Memory hot-plug APIs to map it's pmem resource,
>> which at a section granularity.
>>
>> When QEMU emulated the vNVDIMM device, decrease the label-storage,
>> QEMU will put the vNVDIMMs directly next to one another in physical
>> address space, which means that the boundary between them won't
>> align to the 128 MB memory section size.
>
> I'm having a hard time parsing this.
>
> Where does the "128 MB memory section size" come from?  ACPI?
> A chipset-specific value?
>

The devm_memremap_pages() implementation use the memory hotplug core
to allocate the 'struct page' array/map for persistent memory. Memory
hotplug can only be performed in terms of sections, 128MB on x86_64.
There is some limited support for allowing devm_memremap_pages() to
overlap 'System RAM' within a given section, but it does not currently
support multiple devm_memremap_pages() calls overlapping within the
same section. There is currently a kernel bug where we do not handle
this unsupported configuration gracefully. The fix will cause
configurations configurations that try to overlap 2 persistent memory
ranges in the same section to fail.

The proposed fix is trying to make sure that QEMU does not run afoul
of this constraint.

There is currently no line of sight to reduce the minimum memory
hotplug alignment size to less than 128M. Also, as other architectures
outside of x86_64 add devm_memremap_pages() support, the minimum
section alignment constraint might change and is a property of a guest
OS. My understanding is that some guest OSes might expect an even
larger persistent memory minimum alignment.

Re: [Qemu-devel] [qemu PATCH v4 3/4] nvdimm, acpi: support NFIT platform capabilities

2018-06-06 Thread Dan Williams

On Wed, Jun 6, 2018 at 4:20 PM, Elliott, Robert (Persistent Memory)
 wrote:
>
>> > Okay, we can move to the symbolic names.  Do you want them to be that
>> long, or
>> > would:
>> >
>> > nvdimm-cap-cpu
>> > nvdimm-cap-mem-ctrl
>> > nvdimm-cap-mirroring
>>
>> Wait, why is mirroring part of this?
>
> This data structure is intended to report any kind of platform capability, not
> just platform persistence capabilities.

Yes, but here's nothing actionable that a qemu guest OS can do with
that mirroring information, so there's no need at this time to add cli
cruft and code to support it.

Re: [Qemu-devel] [qemu PATCH v4 3/4] nvdimm, acpi: support NFIT platform capabilities

2018-06-05 Thread Dan Williams

On Tue, Jun 5, 2018 at 3:07 PM, Ross Zwisler
 wrote:
> On Tue, Jun 05, 2018 at 09:37:25PM +0300, Michael S. Tsirkin wrote:
>> On Tue, Jun 05, 2018 at 11:15:00AM -0700, Dan Williams wrote:
>> > On Tue, Jun 5, 2018 at 9:42 AM, Ross Zwisler
>> >  wrote:
>> > > On Tue, Jun 05, 2018 at 06:25:27PM +0300, Michael S. Tsirkin wrote:
>> > >> On Mon, May 21, 2018 at 10:32:02AM -0600, Ross Zwisler wrote:
>> > >> > Add a machine command line option to allow the user to control the 
>> > >> > Platform
>> > >> > Capabilities Structure in the virtualized NFIT.  This Platform 
>> > >> > Capabilities
>> > >> > Structure was added in ACPI 6.2 Errata A.
>> > >> >
>> > >> > Signed-off-by: Ross Zwisler 
>> > >>
>> > >> I tried playing with it and encoding the capabilities is
>> > >> quite awkward.
>> > >>
>> > >> Can we add bits for specific capabilities instead of nvdimm-cap?
>> > >>
>> > >> How about:
>> > >>
>> > >> "cpu-flush-on-power-loss-cap"
>> > >> "memory-flush-on-power-loss-cap"
>> > >> "byte-addressable-mirroring-cap"
>> > >
>> > > Hmmm...I don't like that as much because:
>> > >
>> > > a) It's very verbose.  Looking at my current qemu command line few other
>> > >options require that many characters, and you'd commonly be defining 
>> > > more
>> > >than one of these for a given VM.
>> > >
>> > > b) It means that the QEMU will need to be updated if/when new flags are 
>> > > added,
>> > >because we'll have to have new options for each flag.  The current
>> > >implementation is more future-proof because you can specify any flags
>> > >value you want.
>> > >
>> > > However, if you feel strongly about this, I'll make the change.
>> >
>> > Straw-man: Could we do something similar with what we are doing in ndctl?
>> >
>> > enum ndctl_persistence_domain {
>> > PERSISTENCE_NONE = 0,
>> > PERSISTENCE_MEM_CTRL = 10,
>> > PERSISTENCE_CPU_CACHE = 20,
>> > PERSISTENCE_UNKNOWN = INT_MAX,
>> > };
>> >
>> > ...and have the command line take a number where "10" and "20" are
>> > supported today, but allows us to adapt to new persistence domains in
>> > the future.
>>
>> I'm fine with that except can we have symbolic names instead of numbers
>> on command line?
>>
>> --
>> MST
>
> Okay, we can move to the symbolic names.  Do you want them to be that long, or
> would:
>
> nvdimm-cap-cpu
> nvdimm-cap-mem-ctrl
> nvdimm-cap-mirroring

Wait, why is mirroring part of this?

I was thinking this option would be:

--persistence-domain={cpu,mem-ctrl}

...and try not to let ACPI specifics leak into the qemu command line
interface. For example PowerPC qemu could have a persistence domain
communicated via Open Firmware or some other mechanism.
>
> or something be better?

Re: [Qemu-devel] [qemu PATCH v4 3/4] nvdimm, acpi: support NFIT platform capabilities

2018-06-05 Thread Dan Williams

On Tue, Jun 5, 2018 at 9:42 AM, Ross Zwisler
 wrote:
> On Tue, Jun 05, 2018 at 06:25:27PM +0300, Michael S. Tsirkin wrote:
>> On Mon, May 21, 2018 at 10:32:02AM -0600, Ross Zwisler wrote:
>> > Add a machine command line option to allow the user to control the Platform
>> > Capabilities Structure in the virtualized NFIT.  This Platform Capabilities
>> > Structure was added in ACPI 6.2 Errata A.
>> >
>> > Signed-off-by: Ross Zwisler 
>>
>> I tried playing with it and encoding the capabilities is
>> quite awkward.
>>
>> Can we add bits for specific capabilities instead of nvdimm-cap?
>>
>> How about:
>>
>> "cpu-flush-on-power-loss-cap"
>> "memory-flush-on-power-loss-cap"
>> "byte-addressable-mirroring-cap"
>
> Hmmm...I don't like that as much because:
>
> a) It's very verbose.  Looking at my current qemu command line few other
>options require that many characters, and you'd commonly be defining more
>than one of these for a given VM.
>
> b) It means that the QEMU will need to be updated if/when new flags are added,
>because we'll have to have new options for each flag.  The current
>implementation is more future-proof because you can specify any flags
>value you want.
>
> However, if you feel strongly about this, I'll make the change.

Straw-man: Could we do something similar with what we are doing in ndctl?

enum ndctl_persistence_domain {
PERSISTENCE_NONE = 0,
PERSISTENCE_MEM_CTRL = 10,
PERSISTENCE_CPU_CACHE = 20,
PERSISTENCE_UNKNOWN = INT_MAX,
};

...and have the command line take a number where "10" and "20" are
supported today, but allows us to adapt to new persistence domains in
the future.

Re: [Qemu-devel] Questions about vNVDIMM on qemu/KVM

2018-05-24 Thread Dan Williams

On Thu, May 24, 2018 at 12:19 AM, Yasunori Goto  wrote:
>> On Tue, May 22, 2018 at 10:08 PM, Yasunori Goto  
>> wrote:
>> > Hello,
>> >
>> > I'm investigating status of vNVDIMM on qemu/KVM,
>> > and I have some questions about it. I'm glad if anyone answer them.
>> >
>> > In my understanding, qemu/KVM has a feature to show NFIT for guest,
>> > and it will be still updated about platform capability with this patch set.
>> > https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg04756.html
>> >
>> > And libvirt also supports this feature with 
>> > https://libvirt.org/formatdomain.html#elementsMemory
>> >
>> >
>> > However, virtio-pmem is developing now, and it is better
>> > for archtectures to detect regions of NVDIMM without ACPI (like s390x)
>>
>> I think you are confusing virtio-pmem (patches from Pankaj) and
>> virtio-mem (patches from David)? ...or I'm confused.
>
> Probably, "I" am confusing.
> So, your clarification is very helpful for me.
>
>
>>
>> > In addition, It is also necessary to flush guest contents on vNVDIMM
>> > who has a backend-file.
>>
>> virtio-pmem is a mechanism to use host page cache as pmem in a guest.
>> It does not support high performance memory applications because it
>> requires fsync/msync. I.e. it is not DAX it is the traditional mmap
>> I/O model, but moving page cache management to the host rather than
>> duplicating it in guests.
>
> Ah, ok.
>
>
>>
>> > Q1) Does ACPI.NFIT bus of qemu/kvm remain with virtio-pmem?
>> > How do each roles become it if both NFIT and virtio-pmem will be 
>> > available?
>> > If my understanding is correct, both NFIT and virtio-pmem is used to
>> > detect vNVDIMM regions, but only one seems to be necessary
>>
>> We need both because they are different. Guest DAX should not be using
>> virtio-pmem.
>
> Hmm. Ok.
>
> But ,I would like understand one more thing.
> In the following mail, it seems that e820 bus will be used for fake DAX.
>
> https://lists.01.org/pipermail/linux-nvdimm/2018-January/013926.html
>
> Could you tell me what is relationship between "fake DAX" in this mail
> and Guest DAX?
> Why e820 is necessary for this case?
>

It was proposed as a starting template for writing a new nvdimm bus
driver. All we need is a way to communicate both the address range and
the flush interface. This could be done with a new SPA Range GUID with
the NFIT, or a custom virtio-pci device that registers a special
nvdimm region with this property. My preference is whichever approach
minimizes the code duplication, because the pmem driver should be
re-used as much as possible.

Re: [Qemu-devel] Questions about vNVDIMM on qemu/KVM

2018-05-23 Thread Dan Williams

On Tue, May 22, 2018 at 10:08 PM, Yasunori Goto  wrote:
> Hello,
>
> I'm investigating status of vNVDIMM on qemu/KVM,
> and I have some questions about it. I'm glad if anyone answer them.
>
> In my understanding, qemu/KVM has a feature to show NFIT for guest,
> and it will be still updated about platform capability with this patch set.
> https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg04756.html
>
> And libvirt also supports this feature with 
> https://libvirt.org/formatdomain.html#elementsMemory
>
>
> However, virtio-pmem is developing now, and it is better
> for archtectures to detect regions of NVDIMM without ACPI (like s390x)

I think you are confusing virtio-pmem (patches from Pankaj) and
virtio-mem (patches from David)? ...or I'm confused.

> In addition, It is also necessary to flush guest contents on vNVDIMM
> who has a backend-file.

virtio-pmem is a mechanism to use host page cache as pmem in a guest.
It does not support high performance memory applications because it
requires fsync/msync. I.e. it is not DAX it is the traditional mmap
I/O model, but moving page cache management to the host rather than
duplicating it in guests.

> Q1) Does ACPI.NFIT bus of qemu/kvm remain with virtio-pmem?
> How do each roles become it if both NFIT and virtio-pmem will be 
> available?
> If my understanding is correct, both NFIT and virtio-pmem is used to
> detect vNVDIMM regions, but only one seems to be necessary

We need both because they are different. Guest DAX should not be using
virtio-pmem.

> Otherwize, is the NFIT bus just for keeping compatibility,
> and virtio-pmem is promising way?
>
>
> Q2) What bus is(will be?) created for virtio-pmem?
> I could confirm the bus of NFIT is created with ,
> and I heard other bus will be created for virtio-pmem, but I could not
> find what bus is created concretely.
> ---
>   # ndctl list -B
>   {
>  "provider":"ACPI.NFIT",
>  "dev":"ndbus0"
>   }
> ---
>
> I think it affects what operations user will be able to, and what
> notification is necessary for vNVDIMM.
> ACPI defines some operations like namespace controll, and notification
> for NVDIMM health status or others.
> (I suppose that other status notification might be necessary for vNVDIMM,
>  but I'm not sure yet...)
>
> If my understanding is wrong, please correct me.

The current plan, per my understanding, is a virtio-pmem SPA UUID
added to the virtual NFIT so that the guest driver can load the pmem
driver but also hook up the virtio command ring for forwarding
WRITE_{FUA,FLUSH} commands as host fsync operations.

Re: [Qemu-devel] [RFC v2 2/2] pmem: device flush over VIRTIO

2018-04-26 Thread Dan Williams

On Thu, Apr 26, 2018 at 9:40 AM, Pankaj Gupta  wrote:
>
>>
>> On Wed, Apr 25, 2018 at 04:54:14PM +0530, Pankaj Gupta wrote:
>> > This patch adds functionality to perform
>> > flush from guest to hosy over VIRTIO
>> > when 'ND_REGION_VIRTIO'flag is set on
>> > nd_negion. Flag is set by 'virtio-pmem'
>> > driver.
>> >
>> > Signed-off-by: Pankaj Gupta 
>> > ---
>> >  drivers/nvdimm/region_devs.c | 7 +++
>> >  1 file changed, 7 insertions(+)
>> >
>> > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
>> > index a612be6..6c6454e 100644
>> > --- a/drivers/nvdimm/region_devs.c
>> > +++ b/drivers/nvdimm/region_devs.c
>> > @@ -20,6 +20,7 @@
>> >  #include 
>> >  #include "nd-core.h"
>> >  #include "nd.h"
>> > +#include 
>> >
>> >  /*
>> >   * For readq() and writeq() on 32-bit builds, the hi-lo, lo-hi order is
>> > @@ -1074,6 +1075,12 @@ void nvdimm_flush(struct nd_region *nd_region)
>> > struct nd_region_data *ndrd = dev_get_drvdata(&nd_region->dev);
>> > int i, idx;
>> >
>> > +   /* call PV device flush */
>> > +   if (test_bit(ND_REGION_VIRTIO, &nd_region->flags)) {
>> > +   virtio_pmem_flush(&nd_region->dev);
>> > +   return;
>> > +   }
>>
>> How does libnvdimm know when flush has completed?
>>
>> Callers expect the flush to be finished when nvdimm_flush() returns but
>> the virtio driver has only queued the request, it hasn't waited for
>> completion!
>
> I tried to implement what nvdimm does right now. It just writes to
> flush hint address to make sure data persists.

nvdimm_flush() is currently expected to be synchronous. Currently it
is sfence(); write to special address; sfence(). By the time the
second sfence returns the data is flushed. So you would need to make
this virtio flush interface synchronous as well, but that appears
problematic to stop the guest for unbounded amounts of time. Instead,
you need to rework nvdimm_flush() and the pmem driver to make these
flush requests asynchronous and add the plumbing for completion
callbacks via bio_endio().

> I just did not want to block guest write requests till host side
> fsync completes.

You must complete the flush before bio_endio(), otherwise you're
violating the expectations of the guest filesystem/block-layer.

>
> be worse for operations on different guest files because all these operations 
> would happen
> ultimately on same file at host.
>
> I think with current way, we can achieve an asynchronous queuing mechanism on 
> cost of not
> 100% sure when fsync would complete but it is assured it will happen. Also, 
> its entire block
> flush.

No, again,  that's broken. We need to add the plumbing for
communicating the fsync() completion relative the WRITE_{FLUSH,FUA}
bio in the guest.

Re: [Qemu-devel] [PATCH] acpi/nvdimm: remove forward name references

2018-04-25 Thread Dan Williams

>> > Good to know, thanks.
>
> I have made a big mistake. I was comparing the resulting SSDT of this patch 
> to an SSDT
> Generated by a much older version of kvm/qemu. The latest kvm/qemu upstream 
> builds
> the SSDT with the operation region contained inside of methods. This should 
> resolve
> MEMA of the operation region when the method is invoked. The older SSDT that 
> I was
> looking at declared the memory region outside of methods and had a forward 
> reference
> to MEMA which we no longer support.
>
> So the latest qemu/kvm works just fine and this patch is not needed, Dan has 
> confirmed that
> the latest qemu/kvm works for his nvdimm setup.
>
> Thanks for all of the help and I'm sorry for any confusion that I might have 
> caused!

Yes, my bad as well. I saw the regression on v2.8.0, but updating to
latest qemu-mainline also fixes the problem for me.

Thanks for the debug.

Re: [Qemu-devel] [RFC v2 1/2] virtio: add pmem driver

2018-04-25 Thread Dan Williams

[ adding Jeff directly since he has also been looking at
infrastructure to track when MAP_SYNC should be disabled ]

On Wed, Apr 25, 2018 at 7:21 AM, Dan Williams  wrote:
> On Wed, Apr 25, 2018 at 4:24 AM, Pankaj Gupta  wrote:
>> This patch adds virtio-pmem driver for KVM
>> guest.
>
> Minor nit, please expand your changelog line wrapping to 72 columns.
>
>>
>> Guest reads the persistent memory range
>> information from Qemu over VIRTIO and registers
>> it on nvdimm_bus. It also creates a nd_region
>> object with the persistent memory range
>> information so that existing 'nvdimm/pmem'
>> driver can reserve this into system memory map.
>> This way 'virtio-pmem' driver uses existing
>> functionality of pmem driver to register persistent
>> memory compatible for DAX capable filesystems.
>
> We need some additional enabling to disable MAP_SYNC for this
> configuration. In other words, if fsync() is required then we must
> disable the MAP_SYNC optimization. I think this should be a struct
> dax_device property looked up at mmap time in each MAP_SYNC capable
> ->mmap() file operation implementation.

Re: [Qemu-devel] [RFC v2 2/2] pmem: device flush over VIRTIO

2018-04-25 Thread Dan Williams

On Wed, Apr 25, 2018 at 4:24 AM, Pankaj Gupta  wrote:
> This patch adds functionality to perform
> flush from guest to hosy over VIRTIO
> when 'ND_REGION_VIRTIO'flag is set on
> nd_negion. Flag is set by 'virtio-pmem'
> driver.
>
> Signed-off-by: Pankaj Gupta 
> ---
>  drivers/nvdimm/region_devs.c | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index a612be6..6c6454e 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include "nd-core.h"
>  #include "nd.h"
> +#include 
>
>  /*
>   * For readq() and writeq() on 32-bit builds, the hi-lo, lo-hi order is
> @@ -1074,6 +1075,12 @@ void nvdimm_flush(struct nd_region *nd_region)
> struct nd_region_data *ndrd = dev_get_drvdata(&nd_region->dev);
> int i, idx;
>
> +   /* call PV device flush */
> +   if (test_bit(ND_REGION_VIRTIO, &nd_region->flags)) {
> +   virtio_pmem_flush(&nd_region->dev);
> +   return;
> +   }
> +

I'd rather introduce a ->flush() operation hanging off of 'struct
nd_region' so that this multiplexing can be a static setting.

Re: [Qemu-devel] [RFC v2 1/2] virtio: add pmem driver

2018-04-25 Thread Dan Williams

On Wed, Apr 25, 2018 at 4:24 AM, Pankaj Gupta  wrote:
> This patch adds virtio-pmem driver for KVM
> guest.

Minor nit, please expand your changelog line wrapping to 72 columns.

>
> Guest reads the persistent memory range
> information from Qemu over VIRTIO and registers
> it on nvdimm_bus. It also creates a nd_region
> object with the persistent memory range
> information so that existing 'nvdimm/pmem'
> driver can reserve this into system memory map.
> This way 'virtio-pmem' driver uses existing
> functionality of pmem driver to register persistent
> memory compatible for DAX capable filesystems.

We need some additional enabling to disable MAP_SYNC for this
configuration. In other words, if fsync() is required then we must
disable the MAP_SYNC optimization. I think this should be a struct
dax_device property looked up at mmap time in each MAP_SYNC capable
->mmap() file operation implementation.

Re: [Qemu-devel] RFC: altering the NVDIMM acpi table

2018-04-23 Thread Dan Williams

On Mon, Apr 23, 2018 at 1:35 PM, Schmauss, Erik  wrote:
> Hello,
>
> I work on ACPICA and we have recently made changes to the behavior of
> the Linux AML interpreter to match other OS implementations. After
> sending the patches to upstream Linux, we have identified that
> hw/acpi/nvdimm.c specifies an ACPI table with a forward reference
>  (MEMA is a forward reference that is no longer supported as of Linux
> 4.17-rc1).
>
> We would like to change this file to move the declaration of Name
> (MEMA,...) to appear as the very first declaration in the SSDT. Below is a 
> patch outlining the change that I would like to make.
> However, I am having a hard time getting make check to run
>  to completion in a reasonable amount of time. It always seems to fail
> on some sort of checksum test... It would be great if you could let me
> know what you think of the change and what I can do to speed up the
> execution time of make check...
>
>
> Thanks,
>
> Erik Schmauss
>
> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> index 59d6e4254c..7c9efd9ac7 100644
> --- a/hw/acpi/nvdimm.c
> +++ b/hw/acpi/nvdimm.c
> @@ -1234,6 +1234,9 @@ static void nvdimm_build_ssdt(GArray *table_offsets, 
> GArray *table_data,
>  ssdt = init_aml_allocator();
>  acpi_data_push(ssdt->buf, sizeof(AcpiTableHeader));
>
> +mem_addr_offset = build_append_named_dword(table_data,
> +   NVDIMM_ACPI_MEM_ADDR);
> +
>  sb_scope = aml_scope("\\_SB");
>
>  dev = aml_device("NVDR");
> @@ -1266,9 +1269,6 @@ static void nvdimm_build_ssdt(GArray *table_offsets, 
> GArray *table_data,
>
>  /* copy AML table into ACPI tables blob and patch header there */
>  g_array_append_vals(table_data, ssdt->buf->data, ssdt->buf->len);
> -mem_addr_offset = build_append_named_dword(table_data,
> -   NVDIMM_ACPI_MEM_ADDR);
> -
>  bios_linker_loader_alloc(linker,
>   NVDIMM_DSM_MEM_FILE, dsm_dma_arrea,
>   sizeof(NvdimmDsmIn), false /* high memory */);

I gave this a shot and it appears to breaking some assumption of where
this device is mapped relative to System RAM:

 ioremap on RAM at 0xbffe - 0x00019ffe1fff
 WARNING: CPU: 0 PID: 0 at arch/x86/mm/ioremap.c:166
__ioremap_caller+0x28b/0x300
 Modules linked in:
 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.17.0-rc1+ #1734
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
 RIP: 0010:__ioremap_caller+0x28b/0x300
 RSP: :82603db0 EFLAGS: 00010286
 RAX:  RBX: bffe RCX: 0006
 RDX: 0168 RSI: 82618f90 RDI: 0246
 RBP: e0002000 R08:  R09: 
 R10: 88043e7d8000 R11:  R12: 
 R13: bffe R14: 81a731e9 R15: 82603ee4
 FS:  () GS:88043140() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2:  CR3: 0261 CR4: 000406b0
 Call Trace:
  ? acpi_os_map_iomem+0x7b/0x1c0
  acpi_os_map_iomem+0x189/0x1c0
  acpi_tb_acquire_table+0x39/0x64
  acpi_tb_validate_table+0x21/0x33
  acpi_tb_verify_temp_table+0x37/0x213
  acpi_reallocate_root_table+0xe1/0x112
  acpi_early_init+0x4b/0x102
  start_kernel+0x419/0x4ee
  secondary_startup_64+0xa5/0xb0

Re: [Qemu-devel] [PATCH v2 5/7] ACPICA: Integrate package handling with module-level code

2018-04-16 Thread Dan Williams

On Mon, Apr 16, 2018 at 5:05 PM, Schmauss, Erik  wrote:
>
>> -Original Message-
>> From: linux-acpi-ow...@vger.kernel.org [mailto:linux-acpi-
>> ow...@vger.kernel.org] On Behalf Of Dan Williams
>> Sent: Monday, April 16, 2018 4:22 PM
>> To: Schmauss, Erik 
>> Cc: Rafael J. Wysocki ; Linux ACPI > a...@vger.kernel.org>; Moore, Robert ; linux-
>> nvdimm ; Qemu Developers > de...@nongnu.org>
>> Subject: Re: [PATCH v2 5/7] ACPICA: Integrate package handling with module-
>> level code
>>
>> On Mon, Apr 16, 2018 at 4:15 PM, Schmauss, Erik 
>> wrote:
>> > [ trimming ]
>> >> >> Rafael, we may want to hold back on the module-level code changes
>> >> >> (the patches below) for rc1. Between this and the strange _TSS
>> >> >> issue, it seems like there are a few more things to resolve before
>> >> >> this is ready for kernel upstream.
>> >> >
>> > Hi Rafael,
>> >
>> >> > It looks like you are asking me to queue up reverts as per the
>> >> > Dan's report, is that correct?
>> >
>> > This is indeed what I meant last week. However, I've looked into the
>> > issue and Dan's qemu instance had AML that we no longer support. This
>> > is because the ACPICA commit makes changes to the execution of AML
>> > during table load to match windows AML interpreter behavior so this commit
>> also got rid of support for executing code containing forward references 
>> (except
>> for package elements).
>> >
>> > I've suggested a fix for the firmware in a separate email. So I would
>> > say that this issue is resolved after if Dan can run his test successfully 
>> > with the
>> adjusted firmware.
>> >
>> > If Dan's test is successful, we don’t need to revert these changes
>>
>
> Hi Dan,
>
>> I'm concerned about other qemu-kvm users that do not upgrade their hypervisor
>> at the same pace as their guest kernel. Especially for cloud providers that 
>> may
>> be running latest mainline kernel on older qemu-kvm this will look like a 
>> pure
>> kernel regression. Is there a quick fix we can carry in the kernel to 
>> support these
>> forward references, at least until we know that qemu-kvm is no longer 
>> shipping
>> the broken AML?
>
> This is a very good point. Thanks for bringing this up! One option is for 
> them to set
> the global variable acpi_gbl_execute_tables_as_methods in 
> include/acpi/acpixf.h to false.
> This will effectively revert the new behavior in the AML interpreter and go 
> back to the old way.
> Since this is a global flag, we could have a command line option for Linux 
> kernel to turn this
> feature on.
>
> Out of curiosity, is this ACPI table somehow customized for your work? I have 
> a collection
> of acpi tables and your ACPI tables are the only ones that have an 
> OperationRegion called
> NRAM. What is the chance that others will be running Linux on the same tables 
> as the one
> you sent me?

I don't think there's anything atypical about my particular setup. It
creates two virtual NVDIMMs that each represent a 30GB address space.
I suspect any user of the KVM NVDIMM virtualization would see the same
problem.

Re: [Qemu-devel] [PATCH v2 5/7] ACPICA: Integrate package handling with module-level code

2018-04-16 Thread Dan Williams

On Mon, Apr 16, 2018 at 4:15 PM, Schmauss, Erik  wrote:
> [ trimming ]
>> >> Rafael, we may want to hold back on the module-level code changes
>> >> (the patches below) for rc1. Between this and the strange _TSS issue,
>> >> it seems like there are a few more things to resolve before this is
>> >> ready for kernel upstream.
>> >
> Hi Rafael,
>
>> > It looks like you are asking me to queue up reverts as per the Dan's
>> > report, is that correct?
>
> This is indeed what I meant last week. However, I've looked into the issue 
> and Dan's qemu
> instance had AML that we no longer support. This is because the ACPICA commit 
> makes changes to the execution of AML
> during table load to match windows AML interpreter behavior so this commit 
> also got rid of support for executing code
> containing forward references (except for package elements).
>
> I've suggested a fix for the firmware in a separate email. So I would say 
> that this issue is resolved after if Dan can run
> his test successfully with the adjusted firmware.
>
> If Dan's test is successful, we don’t need to revert these changes

I'm concerned about other qemu-kvm users that do not upgrade their
hypervisor at the same pace as their guest kernel. Especially for
cloud providers that may be running latest mainline kernel on older
qemu-kvm this will look like a pure kernel regression. Is there a
quick fix we can carry in the kernel to support these forward
references, at least until we know that qemu-kvm is no longer shipping
the broken AML?

Re: [Qemu-devel] [PATCH] hw/acpi-build: build SRAT memory affinity structures for NVDIMM

2018-02-21 Thread Dan Williams

On Wed, Feb 21, 2018 at 5:55 AM, Igor Mammedov  wrote:
> On Tue, 20 Feb 2018 17:17:58 -0800
> Dan Williams  wrote:
>
>> On Tue, Feb 20, 2018 at 6:10 AM, Igor Mammedov  wrote:
>> > On Sat, 17 Feb 2018 14:31:35 +0800
>> > Haozhong Zhang  wrote:
>> >
>> >> ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity
>> >> domain of a NVDIMM SPA range must match with corresponding entry in
>> >> SRAT table.
>> >>
>> >> The address ranges of vNVDIMM in QEMU are allocated from the
>> >> hot-pluggable address space, which is entirely covered by one SRAT
>> >> memory affinity structure. However, users can set the vNVDIMM
>> >> proximity domain in NFIT SPA range structure by the 'node' property of
>> >> '-device nvdimm' to a value different than the one in the above SRAT
>> >> memory affinity structure.
>> >>
>> >> In order to solve such proximity domain mismatch, this patch build one
>> >> SRAT memory affinity structure for each NVDIMM device with the
>> >> proximity domain used in NFIT. The remaining hot-pluggable address
>> >> space is covered by one or multiple SRAT memory affinity structures
>> >> with the proximity domain of the last node as before.
>> >>
>> >> Signed-off-by: Haozhong Zhang 
>> > If we consider hotpluggable system, correctly implemented OS should
>> > be able pull proximity from Device::_PXM and override any value from SRAT.
>> > Do we really have a problem here (anything that breaks if we would use 
>> > _PXM)?
>> > Maybe we should add _PXM object to nvdimm device nodes instead of 
>> > massaging SRAT?
>>
>> Unfortunately _PXM is an awkward fit. Currently the proximity domain
>> is attached to the SPA range structure. The SPA range may be
>> associated with multiple DIMM devices and those individual NVDIMMs may
>> have conflicting _PXM properties.
> There shouldn't be any conflict here as  NVDIMM device's _PXM method,
> should override in runtime any proximity specified by parent scope.
> (as parent scope I'd also count boot time NFIT/SRAT tables).
>
> To make it more clear we could clear valid proximity domain flag in SPA
> like this:
>
> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> index 59d6e42..131bca5 100644
> --- a/hw/acpi/nvdimm.c
> +++ b/hw/acpi/nvdimm.c
> @@ -260,9 +260,7 @@ nvdimm_build_structure_spa(GArray *structures, 
> DeviceState *dev)
>   */
>  nfit_spa->flags = cpu_to_le16(1 /* Control region is strictly for
> management during hot add/online
> -   operation */ |
> -  2 /* Data in Proximity Domain field is
> -   valid*/);
> +   operation */);
>
>  /* NUMA node. */
>  nfit_spa->proximity_domain = cpu_to_le32(node);
>
>> Even if that was unified across
>> DIMMs it is ambiguous whether a DIMM-device _PXM would relate to the
>> device's control interface, or the assembled persistent memory SPA
>> range.
> I'm not sure what you mean under 'device's control interface',
> could you clarify where the ambiguity comes from?

There are multiple SPA range types. In addition to the typical
Persistent Memory SPA range there are also Control Region SPA ranges
for MMIO registers on the DIMM for Block Apertures and other purposes.

>
> I read spec as: _PXM applies to address range covered by NVDIMM
> device it belongs to.

No, an NVDIMM may contribute to multiple SPA ranges and those ranges
may span sockets.

>
> As for assembled SPA, I'd assume that it applies to interleaved set
> and all NVDIMMs with it should be on the same node. It's somewhat
> irrelevant question though as QEMU so far implements only
>   1:1:1/SPA:Region Mapping:NVDIMM Device/
> mapping.
>
> My main concern with using static configuration tables for proximity
> mapping, we'd miss on hotplug side of equation. However if we start
> from dynamic side first, we could later complement it with static
> tables if there really were need for it.

Especially when you consider the new HMAT table that wants to have
proximity domains for describing performance characteristics of an
address range relative to an initiator, the _PXM method on an
individual NVDIMM device is a poor fit for describing a wider set.

Re: [Qemu-devel] [PATCH] hw/acpi-build: build SRAT memory affinity structures for NVDIMM

2018-02-20 Thread Dan Williams

On Tue, Feb 20, 2018 at 6:10 AM, Igor Mammedov  wrote:
> On Sat, 17 Feb 2018 14:31:35 +0800
> Haozhong Zhang  wrote:
>
>> ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity
>> domain of a NVDIMM SPA range must match with corresponding entry in
>> SRAT table.
>>
>> The address ranges of vNVDIMM in QEMU are allocated from the
>> hot-pluggable address space, which is entirely covered by one SRAT
>> memory affinity structure. However, users can set the vNVDIMM
>> proximity domain in NFIT SPA range structure by the 'node' property of
>> '-device nvdimm' to a value different than the one in the above SRAT
>> memory affinity structure.
>>
>> In order to solve such proximity domain mismatch, this patch build one
>> SRAT memory affinity structure for each NVDIMM device with the
>> proximity domain used in NFIT. The remaining hot-pluggable address
>> space is covered by one or multiple SRAT memory affinity structures
>> with the proximity domain of the last node as before.
>>
>> Signed-off-by: Haozhong Zhang 
> If we consider hotpluggable system, correctly implemented OS should
> be able pull proximity from Device::_PXM and override any value from SRAT.
> Do we really have a problem here (anything that breaks if we would use _PXM)?
> Maybe we should add _PXM object to nvdimm device nodes instead of massaging 
> SRAT?

Unfortunately _PXM is an awkward fit. Currently the proximity domain
is attached to the SPA range structure. The SPA range may be
associated with multiple DIMM devices and those individual NVDIMMs may
have conflicting _PXM properties. Even if that was unified across
DIMMs it is ambiguous whether a DIMM-device _PXM would relate to the
device's control interface, or the assembled persistent memory SPA
range.

Re: [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM

2018-02-07 Thread Dan Williams

On Wed, Feb 7, 2018 at 10:37 AM, Dr. David Alan Gilbert
 wrote:
> * Dan Williams (dan.j.willi...@intel.com) wrote:
>> On Wed, Feb 7, 2018 at 10:08 AM, Dr. David Alan Gilbert
>>  wrote:
>> > * Dan Williams (dan.j.willi...@intel.com) wrote:
>> >> On Wed, Feb 7, 2018 at 5:24 AM, Dr. David Alan Gilbert
>> >>  wrote:
>> >> > * Haozhong Zhang (haozhong.zh...@intel.com) wrote:
>> >> >> On 02/07/18 13:03 +, Dr. David Alan Gilbert wrote:
>> >> >> > * Haozhong Zhang (haozhong.zh...@intel.com) wrote:
>> >> >> > > On 02/07/18 11:54 +, Dr. David Alan Gilbert wrote:
>> >> >> > > > * Haozhong Zhang (haozhong.zh...@intel.com) wrote:
>> >> >> > > > > When loading a compressed page to persistent memory, flush CPU 
>> >> >> > > > > cache
>> >> >> > > > > after the data is decompressed. Combined with a call to 
>> >> >> > > > > pmem_drain()
>> >> >> > > > > at the end of memory loading, we can guarantee those 
>> >> >> > > > > compressed pages
>> >> >> > > > > are persistently loaded to PMEM.
>> >> >> > > >
>> >> >> > > > Can you explain why this can use the flush and doesn't need the 
>> >> >> > > > special
>> >> >> > > > memset?
>> >> >> > >
>> >> >> > > The best approach to ensure the write persistence is to operate 
>> >> >> > > pmem
>> >> >> > > all via libpmem, e.g., pmem_memcpy_nodrain() + pmem_drain(). 
>> >> >> > > However,
>> >> >> > > the write to pmem in this case is performed by uncompress() which 
>> >> >> > > is
>> >> >> > > implemented out of QEMU and libpmem. It may or may not use libpmem,
>> >> >> > > which is not controlled by QEMU. Therefore, we have to use the less
>> >> >> > > optimal approach, that is to flush cache for all pmem addresses 
>> >> >> > > that
>> >> >> > > uncompress() may have written, i.e.,/e.g., memcpy() and/or 
>> >> >> > > memset() in
>> >> >> > > uncompress(), and pmem_flush() + pmem_drain() in QEMU.
>> >> >> >
>> >> >> > In what way is it less optimal?
>> >> >> > If that's a legal thing to do, then why not just do a pmem_flush +
>> >> >> > pmem_drain right at the end of the ram loading and leave all the 
>> >> >> > rest of
>> >> >> > the code untouched?
>> >> >>
>> >> >> For example, the implementation pmem_memcpy_nodrain() prefers to use
>> >> >> movnt instructions w/o flush to write pmem if those instructions are
>> >> >> available, and falls back to memcpy() + flush if movnt are not
>> >> >> available, so I suppose the latter is less optimal.
>> >> >
>> >> > But if you use normal memcpy calls to copy a few GB of RAM in an
>> >> > incoming migrate and then do a single flush at the end, isn't that
>> >> > better?
>> >>
>> >> Not really, because now you've needlessly polluted the cache and are
>> >> spending CPU looping over the cachelines that could have been bypassed
>> >> with movnt.
>> >
>> > What's different in the pmem case?   Isn't what you've said true in the
>> > normal migrate case as well?
>> >
>>
>> In the normal migrate case the memory is volatile so once the copy is
>> globally visiable you're done. In the pmem case the migration is not
>> complete until the persistent state is synchronized.
>>
>> Note, I'm talking in generalities because I don't know the deeper
>> details of the migrate flow.
>
> On the receive side of a migrate, during a normal precopy migrate
> (without xbzrle) nothing is going to be reading that RAM until
> the whole RAM load has completed anyway - so we're not benefiting from
> it being in the caches either.
>
> In the pmem case, again since nothing is going to be reading from that
> RAM until the end anyway, why bother flushing as we go as opposed to at
> the end?

Flushing at the end implies doing a large loop flushing the caches at
the end of the transfer because the x86 ISA only exposes a
line-by-line flush to unprivileged code rather than a full cache flush
like what the kernel can do with wbinvd. So, better to flush as we go
rather than incur the overhead of the loop at the end. I.e. I'm
assuming it is more efficient to do 'movnt' in the first instance and
not worry about the flush loop.

Re: [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM

2018-02-07 Thread Dan Williams

On Wed, Feb 7, 2018 at 10:08 AM, Dr. David Alan Gilbert
 wrote:
> * Dan Williams (dan.j.willi...@intel.com) wrote:
>> On Wed, Feb 7, 2018 at 5:24 AM, Dr. David Alan Gilbert
>>  wrote:
>> > * Haozhong Zhang (haozhong.zh...@intel.com) wrote:
>> >> On 02/07/18 13:03 +, Dr. David Alan Gilbert wrote:
>> >> > * Haozhong Zhang (haozhong.zh...@intel.com) wrote:
>> >> > > On 02/07/18 11:54 +, Dr. David Alan Gilbert wrote:
>> >> > > > * Haozhong Zhang (haozhong.zh...@intel.com) wrote:
>> >> > > > > When loading a compressed page to persistent memory, flush CPU 
>> >> > > > > cache
>> >> > > > > after the data is decompressed. Combined with a call to 
>> >> > > > > pmem_drain()
>> >> > > > > at the end of memory loading, we can guarantee those compressed 
>> >> > > > > pages
>> >> > > > > are persistently loaded to PMEM.
>> >> > > >
>> >> > > > Can you explain why this can use the flush and doesn't need the 
>> >> > > > special
>> >> > > > memset?
>> >> > >
>> >> > > The best approach to ensure the write persistence is to operate pmem
>> >> > > all via libpmem, e.g., pmem_memcpy_nodrain() + pmem_drain(). However,
>> >> > > the write to pmem in this case is performed by uncompress() which is
>> >> > > implemented out of QEMU and libpmem. It may or may not use libpmem,
>> >> > > which is not controlled by QEMU. Therefore, we have to use the less
>> >> > > optimal approach, that is to flush cache for all pmem addresses that
>> >> > > uncompress() may have written, i.e.,/e.g., memcpy() and/or memset() in
>> >> > > uncompress(), and pmem_flush() + pmem_drain() in QEMU.
>> >> >
>> >> > In what way is it less optimal?
>> >> > If that's a legal thing to do, then why not just do a pmem_flush +
>> >> > pmem_drain right at the end of the ram loading and leave all the rest of
>> >> > the code untouched?
>> >>
>> >> For example, the implementation pmem_memcpy_nodrain() prefers to use
>> >> movnt instructions w/o flush to write pmem if those instructions are
>> >> available, and falls back to memcpy() + flush if movnt are not
>> >> available, so I suppose the latter is less optimal.
>> >
>> > But if you use normal memcpy calls to copy a few GB of RAM in an
>> > incoming migrate and then do a single flush at the end, isn't that
>> > better?
>>
>> Not really, because now you've needlessly polluted the cache and are
>> spending CPU looping over the cachelines that could have been bypassed
>> with movnt.
>
> What's different in the pmem case?   Isn't what you've said true in the
> normal migrate case as well?
>

In the normal migrate case the memory is volatile so once the copy is
globally visiable you're done. In the pmem case the migration is not
complete until the persistent state is synchronized.

Note, I'm talking in generalities because I don't know the deeper
details of the migrate flow.

Re: [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM

2018-02-07 Thread Dan Williams

On Wed, Feb 7, 2018 at 5:24 AM, Dr. David Alan Gilbert
 wrote:
> * Haozhong Zhang (haozhong.zh...@intel.com) wrote:
>> On 02/07/18 13:03 +, Dr. David Alan Gilbert wrote:
>> > * Haozhong Zhang (haozhong.zh...@intel.com) wrote:
>> > > On 02/07/18 11:54 +, Dr. David Alan Gilbert wrote:
>> > > > * Haozhong Zhang (haozhong.zh...@intel.com) wrote:
>> > > > > When loading a compressed page to persistent memory, flush CPU cache
>> > > > > after the data is decompressed. Combined with a call to pmem_drain()
>> > > > > at the end of memory loading, we can guarantee those compressed pages
>> > > > > are persistently loaded to PMEM.
>> > > >
>> > > > Can you explain why this can use the flush and doesn't need the special
>> > > > memset?
>> > >
>> > > The best approach to ensure the write persistence is to operate pmem
>> > > all via libpmem, e.g., pmem_memcpy_nodrain() + pmem_drain(). However,
>> > > the write to pmem in this case is performed by uncompress() which is
>> > > implemented out of QEMU and libpmem. It may or may not use libpmem,
>> > > which is not controlled by QEMU. Therefore, we have to use the less
>> > > optimal approach, that is to flush cache for all pmem addresses that
>> > > uncompress() may have written, i.e.,/e.g., memcpy() and/or memset() in
>> > > uncompress(), and pmem_flush() + pmem_drain() in QEMU.
>> >
>> > In what way is it less optimal?
>> > If that's a legal thing to do, then why not just do a pmem_flush +
>> > pmem_drain right at the end of the ram loading and leave all the rest of
>> > the code untouched?
>>
>> For example, the implementation pmem_memcpy_nodrain() prefers to use
>> movnt instructions w/o flush to write pmem if those instructions are
>> available, and falls back to memcpy() + flush if movnt are not
>> available, so I suppose the latter is less optimal.
>
> But if you use normal memcpy calls to copy a few GB of RAM in an
> incoming migrate and then do a single flush at the end, isn't that
> better?

Not really, because now you've needlessly polluted the cache and are
spending CPU looping over the cachelines that could have been bypassed
with movnt.

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

2018-01-31 Thread Dan Williams

[ adding Michal and lsf-pci ]

On Wed, Jan 31, 2018 at 7:02 PM, Dan Williams  wrote:
> On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang
>  wrote:
>> + vfio maintainer Alex Williamson in case my understanding of vfio is 
>> incorrect.
>>
>> On 01/31/18 16:32 -0800, Dan Williams wrote:
>>> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
>>>  wrote:
>>> > On 01/31/18 16:08 -0800, Dan Williams wrote:
>>> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
>>> >>  wrote:
>>> >> > On 01/31/18 14:25 -0800, Dan Williams wrote:
>>> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
>>> >> >>  wrote:
>>> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
>>> >> >> > guarantee the write persistence to mmap'ed files supporting DAX 
>>> >> >> > (e.g.,
>>> >> >> > files on ext4/xfs file system mounted with '-o dax').
>>> >> >>
>>> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
>>> >> >> metadata is in sync after a fault. However, that does not make
>>> >> >> filesystem-DAX safe for use with QEMU, because we still need to
>>> >> >> coordinate DMA with fileystem operations. There is no way to do that
>>> >> >> coordination from within a guest. QEMU needs to use device-dax if the
>>> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch
>>> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to
>>> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX
>>> >> >> pages to be mapped in EPT entries unless / until we have a solution to
>>> >> >> the DMA synchronization problem. Apologies for not noticing this
>>> >> >> earlier.
>>> >> >
>>> >> > QEMU does not truncate or punch holes of the file once it has been
>>> >> > mmap()'ed. Does the problem [1] still exist in such case?
>>> >>
>>> >> Something else on the system might. The only agent that could enforce
>>> >> protection is the kernel, and the kernel will likely just disallow
>>> >> passing addresses from filesystem-dax vmas through to a guest
>>> >> altogether. I think there's even a problem in the non-DAX case unless
>>> >> KVM is pinning pages while they are handed out to a guest. The problem
>>> >> is that we don't have a page cache page to pin in the DAX case.
>>> >>
>>> >
>>> > Does it mean any user-space code like
>>> >   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
>>> >   // make DMA to ptr
>>> > is unsafe?
>>>
>>> Yes, it is currently unsafe because there is no coordination with the
>>> filesytem if it decides to make block layout changes. We can fix that
>>> in the non-virtualization case by having the filesystem wait for DMA
>>> completion callbacks (i.e. what for all pages to be idle), but as far
>>> as I can see we can't do the same coordination for DMA initiated by a
>>> guest device driver.
>>>
>>
>> I think that fix [1] also works for KVM/QEMU. The guest DMA are
>> performed on two types of devices:
>>
>> 1. For emulated devices, the guest DMA requests are trapped and
>>actually performed by QEMU on the host side. The host side fix [1]
>>can cover this case.
>>
>> 2. For passthrough devices, vfio pins all pages, including those
>>backed by dax mode files, used by the guest if any device is
>>passthroughed to it. If I read the commit message in [2] correctly,
>>operations that change the page-to-file offset association of pages
>>from dax mode files will be deferred until the reference count of
>>the affected pages becomes 1.  That is, if any passthrough device
>>is used with a VM, the changes of page-to-file offset will not be
>>able to happen until the VM is shutdown, so the fix [1] still takes
>>effect here.
>
> This sounds like a longterm mapping under control of vfio and not the
> filesystem. See get_user_pages_longterm(), it is a problem if pages
> are pinned indefinitely especially DAX. It sounds like vfio is in the
> same boat as RDMA and cannot support long lived pins of DAX pages. As
> of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual
> fix will be to create a "memory-registration with lease" semantic
> available for RDMA so that the kernel can forcibly revoke page pinning
> to perform physical layout changes. In the near it seems
> vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so
> that filesystem-dax mappings are explicitly disallowed.
>
>> Another question is how a user-space application (e.g., QEMU) knows
>> whether it's safe to mmap a file on the DAX file system?
>
> I think we fix vaddr_get_pfn() to start failing for DAX mappings
> unless/until we can add a "with lease" mechanism. Userspace will know
> when it is safe again when vfio stops failing.

Btw, there is an LSF/MM topic proposal on this subject [1].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-January/013935.html

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

2018-01-31 Thread Dan Williams

On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang
 wrote:
> + vfio maintainer Alex Williamson in case my understanding of vfio is 
> incorrect.
>
> On 01/31/18 16:32 -0800, Dan Williams wrote:
>> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
>>  wrote:
>> > On 01/31/18 16:08 -0800, Dan Williams wrote:
>> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
>> >>  wrote:
>> >> > On 01/31/18 14:25 -0800, Dan Williams wrote:
>> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
>> >> >>  wrote:
>> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
>> >> >> > guarantee the write persistence to mmap'ed files supporting DAX 
>> >> >> > (e.g.,
>> >> >> > files on ext4/xfs file system mounted with '-o dax').
>> >> >>
>> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
>> >> >> metadata is in sync after a fault. However, that does not make
>> >> >> filesystem-DAX safe for use with QEMU, because we still need to
>> >> >> coordinate DMA with fileystem operations. There is no way to do that
>> >> >> coordination from within a guest. QEMU needs to use device-dax if the
>> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch
>> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to
>> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX
>> >> >> pages to be mapped in EPT entries unless / until we have a solution to
>> >> >> the DMA synchronization problem. Apologies for not noticing this
>> >> >> earlier.
>> >> >
>> >> > QEMU does not truncate or punch holes of the file once it has been
>> >> > mmap()'ed. Does the problem [1] still exist in such case?
>> >>
>> >> Something else on the system might. The only agent that could enforce
>> >> protection is the kernel, and the kernel will likely just disallow
>> >> passing addresses from filesystem-dax vmas through to a guest
>> >> altogether. I think there's even a problem in the non-DAX case unless
>> >> KVM is pinning pages while they are handed out to a guest. The problem
>> >> is that we don't have a page cache page to pin in the DAX case.
>> >>
>> >
>> > Does it mean any user-space code like
>> >   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
>> >   // make DMA to ptr
>> > is unsafe?
>>
>> Yes, it is currently unsafe because there is no coordination with the
>> filesytem if it decides to make block layout changes. We can fix that
>> in the non-virtualization case by having the filesystem wait for DMA
>> completion callbacks (i.e. what for all pages to be idle), but as far
>> as I can see we can't do the same coordination for DMA initiated by a
>> guest device driver.
>>
>
> I think that fix [1] also works for KVM/QEMU. The guest DMA are
> performed on two types of devices:
>
> 1. For emulated devices, the guest DMA requests are trapped and
>actually performed by QEMU on the host side. The host side fix [1]
>can cover this case.
>
> 2. For passthrough devices, vfio pins all pages, including those
>backed by dax mode files, used by the guest if any device is
>passthroughed to it. If I read the commit message in [2] correctly,
>operations that change the page-to-file offset association of pages
>from dax mode files will be deferred until the reference count of
>the affected pages becomes 1.  That is, if any passthrough device
>is used with a VM, the changes of page-to-file offset will not be
>able to happen until the VM is shutdown, so the fix [1] still takes
>effect here.

This sounds like a longterm mapping under control of vfio and not the
filesystem. See get_user_pages_longterm(), it is a problem if pages
are pinned indefinitely especially DAX. It sounds like vfio is in the
same boat as RDMA and cannot support long lived pins of DAX pages. As
of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual
fix will be to create a "memory-registration with lease" semantic
available for RDMA so that the kernel can forcibly revoke page pinning
to perform physical layout changes. In the near it seems
vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so
that filesystem-dax mappings are explicitly disallowed.

> Another question is how a user-space application (e.g., QEMU) knows
> whether it's safe to mmap a file on the DAX file system?

I think we fix vaddr_get_pfn() to start failing for DAX mappings
unless/until we can add a "with lease" mechanism. Userspace will know
when it is safe again when vfio stops failing.

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

2018-01-31 Thread Dan Williams

On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
 wrote:
> On 01/31/18 16:08 -0800, Dan Williams wrote:
>> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
>>  wrote:
>> > On 01/31/18 14:25 -0800, Dan Williams wrote:
>> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
>> >>  wrote:
>> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
>> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
>> >> > files on ext4/xfs file system mounted with '-o dax').
>> >>
>> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
>> >> metadata is in sync after a fault. However, that does not make
>> >> filesystem-DAX safe for use with QEMU, because we still need to
>> >> coordinate DMA with fileystem operations. There is no way to do that
>> >> coordination from within a guest. QEMU needs to use device-dax if the
>> >> guest might ever perform DMA to a virtual-pmem range. See this patch
>> >> set for more details on the DAX vs DMA problem [1]. I think we need to
>> >> enforce this in the host kernel. I.e. do not allow file backed DAX
>> >> pages to be mapped in EPT entries unless / until we have a solution to
>> >> the DMA synchronization problem. Apologies for not noticing this
>> >> earlier.
>> >
>> > QEMU does not truncate or punch holes of the file once it has been
>> > mmap()'ed. Does the problem [1] still exist in such case?
>>
>> Something else on the system might. The only agent that could enforce
>> protection is the kernel, and the kernel will likely just disallow
>> passing addresses from filesystem-dax vmas through to a guest
>> altogether. I think there's even a problem in the non-DAX case unless
>> KVM is pinning pages while they are handed out to a guest. The problem
>> is that we don't have a page cache page to pin in the DAX case.
>>
>
> Does it mean any user-space code like
>   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
>   // make DMA to ptr
> is unsafe?

Yes, it is currently unsafe because there is no coordination with the
filesytem if it decides to make block layout changes. We can fix that
in the non-virtualization case by having the filesystem wait for DMA
completion callbacks (i.e. what for all pages to be idle), but as far
as I can see we can't do the same coordination for DMA initiated by a
guest device driver.

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

2018-01-31 Thread Dan Williams

On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
 wrote:
> On 01/31/18 14:25 -0800, Dan Williams wrote:
>> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
>>  wrote:
>> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
>> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
>> > files on ext4/xfs file system mounted with '-o dax').
>>
>> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
>> metadata is in sync after a fault. However, that does not make
>> filesystem-DAX safe for use with QEMU, because we still need to
>> coordinate DMA with fileystem operations. There is no way to do that
>> coordination from within a guest. QEMU needs to use device-dax if the
>> guest might ever perform DMA to a virtual-pmem range. See this patch
>> set for more details on the DAX vs DMA problem [1]. I think we need to
>> enforce this in the host kernel. I.e. do not allow file backed DAX
>> pages to be mapped in EPT entries unless / until we have a solution to
>> the DMA synchronization problem. Apologies for not noticing this
>> earlier.
>
> QEMU does not truncate or punch holes of the file once it has been
> mmap()'ed. Does the problem [1] still exist in such case?

Something else on the system might. The only agent that could enforce
protection is the kernel, and the kernel will likely just disallow
passing addresses from filesystem-dax vmas through to a guest
altogether. I think there's even a problem in the non-DAX case unless
KVM is pinning pages while they are handed out to a guest. The problem
is that we don't have a page cache page to pin in the DAX case.

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

2018-01-31 Thread Dan Williams

On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
 wrote:
> Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
> guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
> files on ext4/xfs file system mounted with '-o dax').

Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
metadata is in sync after a fault. However, that does not make
filesystem-DAX safe for use with QEMU, because we still need to
coordinate DMA with fileystem operations. There is no way to do that
coordination from within a guest. QEMU needs to use device-dax if the
guest might ever perform DMA to a virtual-pmem range. See this patch
set for more details on the DAX vs DMA problem [1]. I think we need to
enforce this in the host kernel. I.e. do not allow file backed DAX
pages to be mapped in EPT entries unless / until we have a solution to
the DMA synchronization problem. Apologies for not noticing this
earlier.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html

1 2 >

1 - 100 of 177 matches

Mail list logo