Re: [RFC PATCH] cxl: avoid duplicating report from MCE & device
Shiyang Ruan wrote: > Background: > Since CXL device is a memory device, while CPU consumes a poison page of > CXL device, it always triggers a MCE by interrupt (INT18), no matter > which-First path is configured. This is the first report. Then > currently, in FW-First path, the poison event is transferred according > to the following process: CXL device -> firmware -> OS:ACPI->APEI->GHES > -> CPER -> trace report. This is the second one. These two reports > are indicating the same poisoning page, which is the so-called "duplicate > report"[1]. And the memory_failure() handling I'm trying to add in > OS-First path could also be another duplicate report. > > Hope the flow below could make it easier to understand: > CPU accesses bad memory on CXL device, then > -> MCE (INT18), *always* report (1) > -> * FW-First (implemented now) > -> CXL device -> FW > -> OS:ACPI->APEI->GHES->CPER -> trace report (2.a) > * OS-First (not implemented yet, I'm working on it) > -> CXL device -> MSI > -> OS:CXL driver -> memory_failure() (2.b) > so, the (1) and (2.a/b) are duplicated. > > (I didn't get response in my reply for [1] while I have to make patch to > solve this problem, so please correct me if my understanding is wrong.) The CPU MCE may not be in the loop. Consider the case of patrol scrub, or device-DMA accessing poison. In that case the device will signal a component event and the CPU may never issue the MCE. What is missing for me from this description is *why* does the duplicate report matter in practice? If all that happens is that the kernel repeats the lookup to offline the page and set the HWPoison bit, is that duplicated work worth adding more tracking? > This patch adds a new notifier_block and MCE_PRIO_CXL, for CXL memdev > to check whether the current poison page has been reported (if yes, > stop the notifier chain, won't call the following memory_failure() > to report), into `x86_mce_decoder_chain`. In this way, if the poison > page already handled(recorded and reported) in (1) or (2), the other one > won't duplicate the report. The record could be clear when > cxl_clear_poison() is called. > > [1] > https://lore.kernel.org/linux-cxl/664d948fb86f0_e8be29...@dwillia2-mobl3.amr.corp.intel.com.notmuch/ > > Signed-off-by: Shiyang Ruan > --- > arch/x86/include/asm/mce.h | 1 + > drivers/cxl/core/mbox.c| 130 + > drivers/cxl/core/memdev.c | 6 +- > drivers/cxl/cxlmem.h | 3 + > 4 files changed, 139 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h > index dfd2e9699bd7..d8109c48e7d9 100644 > --- a/arch/x86/include/asm/mce.h > +++ b/arch/x86/include/asm/mce.h > @@ -182,6 +182,7 @@ enum mce_notifier_prios { > MCE_PRIO_NFIT, > MCE_PRIO_EXTLOG, > MCE_PRIO_UC, > + MCE_PRIO_CXL, > MCE_PRIO_EARLY, > MCE_PRIO_CEC, > MCE_PRIO_HIGHEST = MCE_PRIO_CEC > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > index 2626f3fff201..0eb3c5401e81 100644 > --- a/drivers/cxl/core/mbox.c > +++ b/drivers/cxl/core/mbox.c > @@ -4,6 +4,8 @@ > #include > #include > #include > +#include > +#include > #include > #include > #include > @@ -880,6 +882,9 @@ void cxl_event_trace_record(const struct cxl_memdev > *cxlmd, > if (cxlr) > hpa = cxl_trace_hpa(cxlr, cxlmd, dpa); > > + if (hpa != ULLONG_MAX && cxl_mce_recorded(hpa)) > + return; > + > if (event_type == CXL_CPER_EVENT_GEN_MEDIA) > trace_cxl_general_media(cxlmd, type, cxlr, hpa, > &evt->gen_media); > @@ -1408,6 +1413,127 @@ int cxl_poison_state_init(struct cxl_memdev_state > *mds) > } > EXPORT_SYMBOL_NS_GPL(cxl_poison_state_init, CXL); > > +struct cxl_mce_record { > + struct list_head node; > + u64 hpa; > +}; > +LIST_HEAD(cxl_mce_records); > +DEFINE_MUTEX(cxl_mce_mutex); I would recommend an xarray for this use case as that already has its own internal locking and efficient memory allocation for new nodes. However, the "why" question needs to be answered first.
Re: [PATCH v3 2/2] cxl/core: add poison creation event handler
Shiyang Ruan wrote: [..] > >> My expectation is MF_ACTION_REQUIRED is not appropriate for CXL event > >> reported errors since action is only required for direct consumption > >> events and those need not be reported through the device event queue. > > Got it. > > I'm not very sure about 'Host write/read' type. In my opinion, these > two types of event should be sent from device when CPU is accessing a > bad memory address, they could be thought of a sync event which needs Hmm, no that's not my understanding of a sync event. I expect when error notifications are synchronous the CPU is guaranteed not to make forward progress past the point of encountering the error. MSI-signaled component-events are always asynchronous by that definition because the CPU is free running while the interrupt is in-flight. > the 'MF_ACTION_REQUIRED' flag. Then, we can determine the flag by the > types like this: > - CXL_EVENT_TRANSACTION_READ | CXL_EVENT_TRANSACTION_WRITE >=> MF_ACTION_REQUIRED > - CXL_EVENT_TRANSACTION_INJECT_POISON => MF_SW_SIMULATED > - others => 0 I doubt any reasonable policy can be inferred from the transaction type. Consider that the CPU itself does not take a sychronous exception when writes encounter poison. At most those are flagged via CMCI (corrected machine check interrupt). The only events that cause exceptions are CPU reads that consume poison. The device has no idea whether read events are coming from a CPU or a DMA event. MF_SW_SIMULATED is purely for software simulated poison events as injected poison can stil cause system fatal damage if the poison is ingested in an unrecoverable path. So, I think all CXL poison notification events should trigger an action optional memory_failure(). I expect this needs to make sure that duplicates re not a problem. I.e. in the case of CPU consumption of CXL poison, that causes a synchronous MF_ACTION_REQUIRED event via the MCE path *and* it may trigger the device to send an error record for the same page. As far as I can see, duplicate reports (MCE + CXL device) are unavoidable.
Re: [PATCH v3 2/2] cxl/core: add poison creation event handler
Shiyang Ruan wrote: > Currently driver only traces cxl events, poison creation (for both vmem > and pmem type) on cxl memdev is silent. As it should be. > OS needs to be notified then it could handle poison pages in time. No, it was always the case that latent poison is an "action optional" event. I am not understanding the justification for this approach. What breaks if the kernel does not forward events to memory_failure_queue()? Consider that in the CPU consumption case that the firmware first path will do its own memory_failure_queue() and in the native case the MCE handler will take care of this. So that leaves pages that are accessed by DMA or background operation that encounter poison. Those are "action optional" scenarios and it is not clear to me how the driver tells the difference. This needs more precision on which agent is repsonsible for what level of reporting. The distribution of responsibility between ACPI GHES, EDAC, and the CXL driver is messy and I expect this changelog to demonstrate it understands all those considerations. > Per CXL spec, the device error event could be signaled through > FW-First and OS-First methods. > > So, add poison creation event handler in OS-First method: > - Qemu: Why is QEMU relevant for this patch? QEMU is only a development vehicle the upstream enabling should be reference shipping or expected to be shipping hardware implementations. > - CXL device reports POISON creation event to OS by MSI by sending > GMER/DER after injecting a poison record; When you say "inject" here do you mean "add to the poison list if present". Because "inject" to me means the "Inject Poison" Memory Device Command. > - CXL driver: > a. parse the POISON event from GMER/DER; > b. translate poisoned DPA to HPA (PFN); > c. enqueue poisoned PFN to memory_failure's work queue; > > Signed-off-by: Shiyang Ruan > --- > drivers/cxl/core/mbox.c | 119 +- > drivers/cxl/cxlmem.h | 8 +-- > include/linux/cxl-event.h | 18 +- > 3 files changed, 125 insertions(+), 20 deletions(-) > > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > index f0f54aeccc87..76af0d73859d 100644 > --- a/drivers/cxl/core/mbox.c > +++ b/drivers/cxl/core/mbox.c > @@ -837,25 +837,116 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds) > } > EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL); > > -void cxl_event_trace_record(const struct cxl_memdev *cxlmd, > - enum cxl_event_log_type type, > - enum cxl_event_type event_type, > - const uuid_t *uuid, union cxl_event *evt) > +static void cxl_report_poison(struct cxl_memdev *cxlmd, struct cxl_region > *cxlr, > + u64 dpa) > { > - if (event_type == CXL_CPER_EVENT_GEN_MEDIA) > + u64 hpa = cxl_trace_hpa(cxlr, cxlmd, dpa); > + unsigned long pfn = PHYS_PFN(hpa); > + > + if (!IS_ENABLED(CONFIG_MEMORY_FAILURE)) > + return; No need for this check, memory_failure_queue() is already stubbed out in the CONFIG_MEMORY_FAILURE=n case. > + memory_failure_queue(pfn, MF_ACTION_REQUIRED); My expectation is MF_ACTION_REQUIRED is not appropriate for CXL event reported errors since action is only required for direct consumption events and those need not be reported through the device event queue. It would be useful to collaborate with a BIOS firmware engineer so that the kernel ends up with similar logic as is used to set CPER record severity, or at least understands why it would want to be different. See how ghes_handle_memory_failure() determines the memory_failure_queue() flags.
Re: [PATCH v3 1/2] cxl/core: correct length of DPA field masks
Shiyang Ruan wrote: > The length of Physical Address in General Media Event Record/DRAM Event > Record is 64-bit, so the field mask should be defined as such length. > Otherwise, this causes cxl_general_media and cxl_dram tracepoints to > mask off the upper-32-bits of DPA addresses. The cxl_poison event is > unaffected. > > If userspace was doing its own DPA-to-HPA translation this could lead to > incorrect page retirement decisions, but there is no known consumer > (like rasdaemon) of this event today. > > Fixes: d54a531a430b ("cxl/mem: Trace General Media Event Record") > Cc: > Cc: Dan Williams > Cc: Davidlohr Bueso > Cc: Jonathan Cameron > Cc: Ira Weiny > Signed-off-by: Shiyang Ruan > --- > drivers/cxl/core/trace.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h > index e5f13260fc52..cdfce932d5b1 100644 > --- a/drivers/cxl/core/trace.h > +++ b/drivers/cxl/core/trace.h > @@ -253,7 +253,7 @@ TRACE_EVENT(cxl_generic_event, > * DRAM Event Record > * CXL rev 3.0 section 8.2.9.2.1.2; Table 8-44 > */ > -#define CXL_DPA_FLAGS_MASK 0x3F > +#define CXL_DPA_FLAGS_MASK 0x3FULL > #define CXL_DPA_MASK (~CXL_DPA_FLAGS_MASK) > > #define CXL_DPA_VOLATILE BIT(0) Looks good, Reviewed-by: Dan Williams
Re: [RFC PATCH v2 5/6] cxl: add definition for transaction types
Shiyang Ruan wrote: > The transaction types are defined in General Media Event Record/DRAM Event > per CXL rev 3.0 Section 8.2.9.2.1.1; Table 8-43 and > Section 8.2.9.2.1.2; Table 8-44. Add them for Event Record handler use. Combine this patch with the one that uses them so that the use case can be reviewed together with the implementation.
Re: [RFC PATCH v2 4/6] cxl/core: report poison when injecting from debugfs
Shiyang Ruan wrote: > Poison injection from debugfs is silent too. Add calling > cxl_mem_report_poison() to make it able to do memory_failure(). Why does this needs to be signalled? It is a debug interface, the debugger can also trigger a read after the injection, or trigger page soft-offline.
Re: [RFC PATCH v2 3/6] cxl/core: add report option for cxl_mem_get_poison()
Shiyang Ruan wrote: > The GMER only has "Physical Address" field, no such one indicates length. > So, when a poison event is received, we could use GET_POISON_LIST command > to get the poison list. Now driver has cxl_mem_get_poison(), so > reuse it and add a parameter 'bool report', report poison record to MCE > if set true. I am not sure I agree with the rationale here because there is no correlation between the event being signaled and the current state of the poison list. It also establishes race between multiple GMER events, i.e. imagine the hardware sends 4 GMER events to communicate a 256B poison discovery event. Does the driver need logic to support GMER event 2, 3, and 4 if it already say all 256B of poison after processing GMER event 1? I think the best the driver can do is assume at least 64B of poison per-event and depend on multiple notifications to handle larger poison lengths. Otherwise, the poison list is really only useful for pre-populating pages to offline after a reboot, i.e. to catch the kernel up with the state of poison pages after a reboot.
Re: [RFC PATCH v2 2/6] cxl/core: introduce cxl_mem_report_poison()
Shiyang Ruan wrote: > If poison is detected(reported from cxl memdev), OS should be notified to > handle it. So, introduce this helper function for later use: > 1. translate DPA to HPA; > 2. enqueue records into memory_failure's work queue; > > Signed-off-by: Shiyang Ruan This patch is too small, it needs the corresponding caller to make sense of the proposed change. > --- > > Currently poison injection from debugfs always create a 64-bytes-length > record, which is fine. But the injection from qemu's QMP API: > qmp_cxl_inject_poison() could create a poison record contains big length, > which may cause many many times of calling memory_failure_queue(). > Though the MEMORY_FAILURE_FIFO_SIZE is 1 << 4, it seems not enougth. What matters is what devices do in practice, the kernel should not be worried about odd corner cases that only exist in QEMU injection scenarios.
Re: [RFC PATCH v2 1/6] cxl/core: correct length of DPA field masks
Shiyang Ruan wrote: > The length of Physical Address in General Media Event Record/DRAM Event > Record is 64-bit, so the field mask should be defined as such length. > Otherwise, this causes cxl_general_media and cxl_dram tracepoints to > mask off the upper-32-bits of DPA addresses. The cxl_poison event is > unaffected. > > If userspace was doing its own DPA-to-HPA translation this could lead to > incorrect page retirement decisions, but there is no known consumer > (like rasdaemon) of this event today. > > Fixes: d54a531a430b ("cxl/mem: Trace General Media Event Record") > Cc: > Cc: Dan Williams > Cc: Davidlohr Bueso > Cc: Jonathan Cameron > Cc: Ira Weiny > Signed-off-by: Shiyang Ruan > --- > drivers/cxl/core/trace.h | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h > index e5f13260fc52..e2d1f296df97 100644 > --- a/drivers/cxl/core/trace.h > +++ b/drivers/cxl/core/trace.h > @@ -253,11 +253,11 @@ TRACE_EVENT(cxl_generic_event, > * DRAM Event Record > * CXL rev 3.0 section 8.2.9.2.1.2; Table 8-44 > */ > -#define CXL_DPA_FLAGS_MASK 0x3F > +#define CXL_DPA_FLAGS_MASK 0x3FULL This change makes sense... > #define CXL_DPA_MASK (~CXL_DPA_FLAGS_MASK) > > -#define CXL_DPA_VOLATILE BIT(0) > -#define CXL_DPA_NOT_REPAIRABLE BIT(1) > +#define CXL_DPA_VOLATILE BIT_ULL(0) > +#define CXL_DPA_NOT_REPAIRABLE BIT_ULL(1) ...these do not. The argument to __print_flags() is an unsigned long, so they will be cast down to (unsigned long), and they are never used as a mask so the generated code should not change.
Re: [RFC PATCH v2 0/6] cxl: add poison event handler
Alison Schofield wrote: > On Fri, Mar 29, 2024 at 11:22:32AM -0700, Dan Williams wrote: > > Alison Schofield wrote: > > [..] > > > Upon receipt of that new poison list, call memory_failture_queue() > > > on *any* poison in a mapped space. Is that OK? Can we call > > > memory_failure_queue() on any and every poison report that is in > > > HPA space regardless of whether it first came to us through a GMER? > > > I'm actually wondering if that is going to be the next ask anyway - > > > ie report all poison. > > > > memory_failure_queue() should be called on poison creation events. Leave > > the MF_ACTION_REQUIRED flag not set so that memory_failure() performs > > "action optional" handling. So I would expect memory_failure_queue() > > notification for GMER events, but not on poison list events. > > Seems I totally missed the point of this patch set. > Is it's only purpose to make sure that poison that is injected gets > reported to memory_failure? Clarify terms, "poison injection" to me is a debug-only event to test that poison handling is working, "poison creation" is an event where new poison was encountered by CPU consumption, deposited by a DMA-with-poison transaction, or discovered by a background scrub operation. > > So this single patch only: > 1. Poison inject leads to this GMER/CXL_EVENT_TRANSACTION_INJECT_POISON Inject is a special case. Likely this should copy the PMEM legacy where notifying memory_failure() on injected poison is optional: "ndctl inject-error --no-notify" > 2. Driver sees GMER/CXL_EVENT_TRANSACTION_INJECT_POISON and reads poison > list to get accurate length. Again, inject is the least interesting for the common case, production kernels care about "Media ECC Error" and "Scrub Media ECC Error" regardless of transaction type. > 3. Driver reports that to memory_failure_queue() > > Still expect there's some code sharing opportunities and I still wonder > about what is next in this area. One area this needs to be careful is in unifying the OS-first and FW-first paths. In the FW-first case the platform can trigger memory_failure() along with the GMER by just posting a memory failure CPER record. So there is a risk that things get doubly reported if the GMER handling code blindly triggers memory_failure(). Might be benign, but probably best avoided.
Re: [RFC PATCH v2 0/6] cxl: add poison event handler
Alison Schofield wrote: [..] > Upon receipt of that new poison list, call memory_failture_queue() > on *any* poison in a mapped space. Is that OK? Can we call > memory_failure_queue() on any and every poison report that is in > HPA space regardless of whether it first came to us through a GMER? > I'm actually wondering if that is going to be the next ask anyway - > ie report all poison. memory_failure_queue() should be called on poison creation events. Leave the MF_ACTION_REQUIRED flag not set so that memory_failure() performs "action optional" handling. So I would expect memory_failure_queue() notification for GMER events, but not on poison list events.
Re: [PATCH v2 1/1] cxl/mem: Fix for the index of Clear Event Record Handle
Jonathan Cameron wrote: > On Mon, 18 Mar 2024 10:29:28 +0800 > Yuquan Wang wrote: > > > The dev_dbg info for Clear Event Records mailbox command would report > > the handle of the next record to clear not the current one. > > > > This was because the index 'i' had incremented before printing the > > current handle value. > > > > Signed-off-by: Yuquan Wang > > --- > > drivers/cxl/core/mbox.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > > index 9adda4795eb7..b810a6aa3010 100644 > > --- a/drivers/cxl/core/mbox.c > > +++ b/drivers/cxl/core/mbox.c > > @@ -915,7 +915,7 @@ static int cxl_clear_event_record(struct > > cxl_memdev_state *mds, > > > > payload->handles[i++] = gen->hdr.handle; > > dev_dbg(mds->cxlds.dev, "Event log '%d': Clearing %u\n", log, > > - le16_to_cpu(payload->handles[i])); > > + le16_to_cpu(payload->handles[i-1])); > Trivial but needs spaces around the -. e.g. [i - 1] > > Maybe Dan can fix up whilst applying. > > Otherwise > > Reviewed-by: Jonathan Cameron I have enlisted Dave to start wrangling CXL kernel patches upstream, and I will fall back to just reviewing. Dave, you can add my: Reviewed-by: Dan Williams ...with the same caveat as above.
Re: [PATCH 1/1] cxl/mem: Fix for the index of Clear Event Record Handle
Yuquan Wang wrote: > The dev_dbg info for Clear Event Records mailbox command would report > the handle of the next record to clear not the current one. > > This was because the index 'i' had incremented before printing the > current handle value. > > This fix also adjusts the index variable name from 'i' to 'clear_cnt' > which can be easier for developers to distinguish and understand. > > Signed-off-by: Yuquan Wang > --- > drivers/cxl/core/mbox.c | 21 +++-- > 1 file changed, 11 insertions(+), 10 deletions(-) > > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > index 9adda4795eb7..3ca2845ae6aa 100644 > --- a/drivers/cxl/core/mbox.c > +++ b/drivers/cxl/core/mbox.c > @@ -881,7 +881,7 @@ static int cxl_clear_event_record(struct cxl_memdev_state > *mds, > struct cxl_mbox_cmd mbox_cmd; > u16 cnt; > int rc = 0; > - int i; > + int clear_cnt; > > /* Payload size may limit the max handles */ > if (pl_size > mds->payload_size) { > @@ -908,28 +908,29 @@ static int cxl_clear_event_record(struct > cxl_memdev_state *mds, >* Clear Event Records uses u8 for the handle cnt while Get Event >* Record can return up to 0x records. >*/ > - i = 0; > + clear_cnt = 0; > for (cnt = 0; cnt < total; cnt++) { > struct cxl_event_record_raw *raw = &get_pl->records[cnt]; > struct cxl_event_generic *gen = &raw->event.generic; > > - payload->handles[i++] = gen->hdr.handle; > + payload->handles[clear_cnt] = gen->hdr.handle; > dev_dbg(mds->cxlds.dev, "Event log '%d': Clearing %u\n", log, > - le16_to_cpu(payload->handles[i])); Couldn't this patch be much smaller by just changing this to "i - 1", because from the description the only problem is the dev_dbg() statement, so just fix that.
Re: [RFC PATCH 1/5] cxl/core: correct length of DPA field masks
[ add Ira and Davidlohr ] Shiyang Ruan wrote: > > > 在 2024/2/10 14:34, Dan Williams 写道: > > Shiyang Ruan wrote: > >> The length of Physical Address in General Media Event Record/DRAM Event > >> Record is 64-bit, so the field mask should be defined as such length. > > > > Can you include this user visible side-effect of this change. Looks like > > this could cause usages of CXL_DPA_FLAGS_MASK to return an incorrect > > result? > > Ok. Will change it to this: > > The length of Physical Address in General Media Event Record/DRAM Event > Record is 64bit, per CXL Spec r3.0 - 8.2.9.2.1.1, Table 8-43. Currently > CXL_DPA_FLAGS_MASK is defined as int (32bit), then CXL_DPA_MASK is a int > too, it will be 0xFFC0 while using "->dpa & CXL_DPA_MASK" to > obtain real physical address (to drop flags in lower bits), in this case > the higher 32bit of ->dpa will be lost. > > To avoid this, define CXL_DPA_FLAGS_MASK as 64bit: 0x3FULL. That part was clear, what is missing is the impact. For example, this bug only impacts the output of the dpa field in the cxl_general_media and cxl_dram tracepoints, but no other part of the CXL code. So in this case the impact of the bug is low, but in the worst case an wrong size mask could compromise the security / integrity of the system. So I would expect a changelog like this to make the importance of the fix clear and to notify the original stakeholders where the bug was introduced. --- The length of Physical Address in General Media Event Record/DRAM Event Record is 64-bit, so the field mask should be defined as such length. Otherwise, this causes cxl_general_media and cxl_dram tracepoints to mask off the upper-32-bits of DPA addresses. The cxl_poison event is unaffected. If userspace was doing its own DPA-to-HPA translation this could lead to incorrect page retirement decisions, but there is no known consumer (like rasdaemon) of this event today. Fixes: d54a531a430b ("cxl/mem: Trace General Media Event Record") Cc: Cc: Dan Williams Cc: Davidlohr Bueso Cc: Jonathan Cameron Cc: Ira Weiny ---
RE: [RFC PATCH 5/5] cxl/core: add poison injection event handler
Shiyang Ruan wrote: > Currently driver only trace cxl events, poison injection on cxl memdev > is silent. OS needs to be notified then it could handle poison range > in time. Per CXL spec, the device error event could be signaled through > FW-First and OS-First methods. > > So, add poison event handler in OS-First method: > - qemu: > - CXL device report POISON event to OS by MSI by sending GMER after > injecting a poison record QEMU details do not belong in a kernel changelog. It is ok for an RFC, but my hope is that this can be tested on hardware after being proven on QEMU. > - CXL driver > a. read the POISON event through GMER; <-- this patch > b. get POISON list; > c. translate DPA to HPA; > d. construct a mce instance, then call mce_log() to queue this mce >instance; It is not clear to me why the kernel should proactively fire machine check notifications on injection? The changelog needs to make clear why the kernel should do this, and the consequences of not going it. For CPU consumed poison the machine check event will already fire. For background discovery of poison, that should translate to a memory_failure() notification with teh MF_ACTION_REQUIRED flag cleared. Userspace, like rasdaemon, can then make a page offline decision.
RE: [RFC PATCH 4/5] cxl/core: add report option for cxl_mem_get_poison()
Shiyang Ruan wrote: > When a poison event is received, driver uses GET_POISON_LIST command > to get the poison list. Now driver has cxl_mem_get_poison(), so > reuse it and add a parameter 'bool report', report poison record to MCE > if set true. If the memory error record has the poison event, why does the poison list need to be retrieved by the kernel? I would expect it is sufficient to just report the single poison event and leave it to userspace to react to that event and retrieve more data if it wants.
RE: [RFC PATCH 3/5] cxl/core: introduce cxl_mem_report_poison()
Shiyang Ruan wrote: > If poison is detected(reported from cxl memdev), OS should be notified to > handle it. Introduce this function: > 1. translate DPA to HPA; > 2. construct a MCE instance; (TODO: more details need to be filled) > 3. log it into MCE event queue; > > After that, MCE mechanism can walk over its notifier chain to execute > specific handlers. > > Signed-off-by: Shiyang Ruan > --- > arch/x86/kernel/cpu/mce/core.c | 1 + > drivers/cxl/core/mbox.c| 33 + > 2 files changed, 34 insertions(+) > > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > index bc39252bc54f..a64c0aceb7e0 100644 > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -131,6 +131,7 @@ void mce_setup(struct mce *m) > m->ppin = cpu_data(m->extcpu).ppin; > m->microcode = boot_cpu_data.microcode; > } > +EXPORT_SYMBOL_GPL(mce_setup); No, mce_setup() is x86 specific and the CXL subsystem is CPU architecture independent. My expectation is that CXL should translate errors for edac similar to how the ACPI GHES code does it. See usage of edac_raw_mc_handle_error() and memory_failure_queue(). Otherwise an MCE is a CPU consumption of poison event, and CXL is reporting device-side discovery of poison.
RE: [RFC PATCH 2/5] cxl/core: introduce cxl_memdev_dpa_to_hpa()
Shiyang Ruan wrote: > When a memdev is assigned to a region, its Device Physical Address will be > mapped to Host Physical Address. Introduce this helper function to > translate HPA from a given memdev and its DPA. > > Signed-off-by: Shiyang Ruan > --- > drivers/cxl/core/memdev.c | 12 > drivers/cxl/cxlmem.h | 1 + > 2 files changed, 13 insertions(+) > > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c > index dae8802ecdb0..c304e709ef0e 100644 > --- a/drivers/cxl/core/memdev.c > +++ b/drivers/cxl/core/memdev.c > @@ -319,6 +319,18 @@ static int cxl_validate_poison_dpa(struct cxl_memdev > *cxlmd, u64 dpa) > return 0; > } > > +phys_addr_t cxl_memdev_dpa_to_hpa(struct cxl_memdev *cxlmd, u64 dpa) > +{ > + struct cxl_region *cxlr = cxl_dpa_to_region(cxlmd, dpa); > + > + if (cxlr) > + return cxlr->params.res->start + dpa; > + else { > + dev_dbg(&cxlmd->dev, "device belongs to no region.\n"); > + return 0; > + } > +} Hmm no, I would not say memdevs are assigned to regions, *endpoint decoders* are assigned to regions. cxl_dpa_to_region() is only an internal helper when the endpoint decoder is unknown. Otherwise endpoint decoders have a direct-link to their region, if mapped. See usage of "cxled->cxld.region".
RE: [RFC PATCH 1/5] cxl/core: correct length of DPA field masks
Shiyang Ruan wrote: > The length of Physical Address in General Media Event Record/DRAM Event > Record is 64-bit, so the field mask should be defined as such length. Can you include this user visible side-effect of this change. Looks like this could cause usages of CXL_DPA_FLAGS_MASK to return an incorrect result? > > Signed-off-by: Shiyang Ruan > --- > drivers/cxl/core/trace.h | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h > index 89445435303a..388a87d972c2 100644 > --- a/drivers/cxl/core/trace.h > +++ b/drivers/cxl/core/trace.h > @@ -253,11 +253,11 @@ TRACE_EVENT(cxl_generic_event, > * DRAM Event Record > * CXL rev 3.0 section 8.2.9.2.1.2; Table 8-44 > */ > -#define CXL_DPA_FLAGS_MASK 0x3F > +#define CXL_DPA_FLAGS_MASK 0x3FULL > #define CXL_DPA_MASK (~CXL_DPA_FLAGS_MASK) > > -#define CXL_DPA_VOLATILE BIT(0) > -#define CXL_DPA_NOT_REPAIRABLE BIT(1) > +#define CXL_DPA_VOLATILE BIT_ULL(0) > +#define CXL_DPA_NOT_REPAIRABLE BIT_ULL(1) > #define show_dpa_flags(flags)__print_flags(flags, "|", >\ > { CXL_DPA_VOLATILE, "VOLATILE" }, \ > { CXL_DPA_NOT_REPAIRABLE, "NOT_REPAIRABLE"} \ > -- > 2.34.1 >
Re: [PATCH v6 1/2] qom: new object to associate device to numa node
Jason Gunthorpe wrote: > On Tue, Jan 09, 2024 at 06:02:03PM +0100, David Hildenbrand wrote: > > > Given that, an alternative proposal that I think would work > > > for you would be to add a 'placeholder' memory node definition > > > in SRAT (so allow 0 size explicitly - might need a new SRAT > > > entry to avoid backwards compat issues). > > > > Putting all the PCI/GI/... complexity aside, I'll just raise again that for > > virtio-mem something simple like that might be helpful as well, IIUC. > > > > -numa node,nodeid=2 \ > > ... > > -device virtio-mem-pci,node=2,... \ > > > > All we need is the OS to prepare for an empty node that will get populated > > with memory later. > > That is all this is doing too, the NUMA relationship of the actual > memory is desribed already by the PCI device since it is a BAR on the > device. > > The only purpose is to get the empty nodes into Linux :( > > > So if that's what a "placeholder" node definition in srat could achieve as > > well, even without all of the other acpi-generic-initiator stuff, that would > > be great. > > Seems like there are two use quite similar cases.. virtio-mem is going > to be calling the same family of kernel API I suspect :) It seems sad that we, as an industry, went through all of this trouble to define a dynamically enumerable CXL device model only to turn around and require static ACPI tables to tell us how to enumerate it. A similar problem exists on the memory target side and the approach taken there was to have Linux statically reserve at least enough numa node numbers for all the platform CXL memory ranges (defined in the ACPI.CEDT.CFMWS), but with the promise to come back and broach the dynamic node creation problem "if the need arises". This initiator-node enumeration case seems like that occasion where the need has arisen to get Linux out of the mode of needing to declare all possible numa nodes early in boot. Allow for nodes to be discoverable post NUMA-init. One strawman scheme that comes to mind is instead of "add nodes early" in boot, "delete unused nodes late" in boot after the device topology has been enumerated. Otherwise, requiring static ACPI tables to further enumerate an industry-standard dynamically enumerated bus seems to be going in the wrong direction.
Re: [PATCH v3] hw/cxl: Add QTG _DSM support for ACPI0017 device
Jonathan Cameron wrote: [..] > > > > > > > > what does "a WORD" mean is unclear - do you match what hardware does > > > > when you use aml_buffer? pls mention this in commit log, and > > > > show actual hardware dump for comparison. > > > The CXL spec says WORD without much qualification. It's a 16bit value > > > AFAICT. I'll add additional comment. Currently I do not have access to > > > actual hardware unfortunately. I'm constructing this purely based on spec > > > description. > > > > WORD does seem to be clearly defined in the ACPI spec as uint16 > and as this is describing a DSDT blob I think we can safe that > it means that. Also lines up with the fixed sizes in CEDT. I am not sure it means that, the ACPI specification indicates that packages can hold "integers" and integers can be any size up to 64-bits. > > It's not clear buffer is actually word though. > > > > Jonathan do you have hardware access? > > No. +CC linux-cxl to see if anyone else has hardware + BIOS with > QTG implemented... There will be lots of implementations soon so I'd make > not guarantee they will all interpret this the same. > > Aim here is Linux kernel enablement support, and unfortunately that almost > always means we are ahead of easy availability of hardware. If it exists > its probably prototypes in a lab, in which case no guarantees on the > BIOS tables presented... >From a pre-production system the ASL is putting the result of SizeOf directly into the first element in the return package: Local1 = SizeOf (CXQI) Local0 [0x00] = Local1 ...where CXQI appears to be a fixed table of QTG ids for the platform, and SizeOf() returns an integer type with no restriction that it be a 16-bit value. So I think the specification is misleading by specifying WORD when ACPI "Package" objects convey "Integers" where the size of the integer can be a u8 to a u64. > > Also, possible to get clarification from the spec committee? > > I'm unclear what we are clarifying. As I read it current implementation > is indeed wrong and I failed to notice this earlier :( > > Ultimately data encoding (ACPI 6.5 section 20.2..3 Data Objects Encoding) > should I think be > > 0x0B 0x00 0x00 > WordPrefix then data : note if you try a 0x0001 and feed > it to iasl it will squash it into a byte instead and indeed if you > force the binary to the above it will decode it as 0x but recompile > that and you will be back to just > 0x00 (as bytes don't need a prefix..) > > Currently it would be. > 0x11 0x05 0x0a 0x02 0x00 0x01 > BufferOp > > Btw I built a minimal DSDT file to test this and iasl isn't happy with > the fact the _DSM doesn't return anything at all if ARG2 isn't 1 or 2. > Whilst that's imdef territory as not covered by the CXL spec, we should > return 'something' ;) > > Anyhow, to do this as per the CXL spec we need an aml_word() > that just implements the word case from aml_int() If I understand correctly, aml_int() is sufficient since this is not a Field() where access size matters. > Chance are that it never matters if we get an ecoding that is > only single byte (because the value is small) but who knows what > other AML parsers might do. I expect the reason WORD is used in the specification is because of the size of the QTG ID field in the CFMWS. ACPI could support returning more than USHRT_MAX in an Integer field in a Package, but those IDs above USHRT_MAX could not be represented in CFMWS. [..] > > but again it is not clear at all what does spec mean. > > an integer up to 0xf? a buffer as you did? just two bytes? > > > > could be any of these. > > The best we have in the way of description is the multiple QTG example > where it's > Package() {2, 1} combined with it being made up of WORDs > > whereas in general that will get squashed to a pair of bytes... > So I'm thinking WORDs is max size rather than only size but > given ambiguity we should encode them as words anyway. My assertion is that for the Package format the size of the integer does not matter because the Package.Integer type can convey up to 64-bit values. For being compatible with the *usage* of that max id, values that do not fit into 16-bits are out of spec, but nothing prevents the Package from using any size integer, afaics.
Re: [PATCH] hw/acpi/cxl: Drop device-memory support from CFMWS entries
Jonathan Cameron wrote: > On Mon, 20 Mar 2023 23:08:31 -0700 > Dan Williams wrote: > > > While it was a reasonable idea to specify no window restricitions at the > > outset of the CXL emulation support, it turns out that in practice a > > platform will never follow the QEMU example of specifying simultaneous > > support for HDM-H and HDM-D[B] in a single window. > > > > HDM-D mandates extra bus cycles for host/device bias protocol, and HDM-DB > > mandates extra bus cycles for back-invalidate protocol, so hardware must > > be explicitly prepared for device-memory unlike host-only memory > > (HDM-H). > > > > In preparation for the kernel dropping support for windows that do not > > select between device and host-only memory, move QEMU exclusively to > > declaring host-only windows. > > > > Signed-off-by: Dan Williams > Hi Dan, > > Can we have some spec references? I think the Protocol tables in > appendix C would work for that - but more specific examples called > out from them would be good. I presume our messages crossed in the ether: https://lore.kernel.org/linux-cxl/641a018ed7fb8_269929...@dwillia2-xfh.jf.intel.com.notmuch/ ...but yes, if I was still committed to this change, which I am not at this point, some references from Appendix C and "3.3.11 Forward Progress and Ordering Rules would be appropriate". > I'm also not totally convinced it isn't a host implementation detail > - key here is that host bridge's are still part of the host so can > do impdef stuff as long as they look correct to CXL side and to > host side. > > Time for some wild implementation conjecturing. > > Imagine a host that has host bridges of above average smartness. > Those host bridges have HDM decoders (this doesn't work if not) > > Host is using a single HPA window for HDM-D[B] and HDM-H. > The host bridge knows the target is HDM-H - it can get that form > the HDM decoder Target Type bit etc. The HB can send (to the > rest of the Host) whatever replies are necessary / fill in extra > info to make it look like HDM-D[B] to the host interconnect protocol. > > (after some fun with a white board we think you can make this efficient > by potentially making the Host bridge HDM decoder setup visible to > other parts of the host - relatively easy give lots of time allowed > for a decoder commit). > > Why would you do this? Limited HPA space availability on the host > and wanting to be very flexible about use of the CXL windows. Limited HPA space and fewer decode rules at the root level indeed sounds compelling. > Obviously this is all moot if there is a constraint we can point to > in a specification. At this time I don't have such a reference. > BTW. I'm carrying a patch (it's on the gitlab tree) that I haven't > posted yet that lets you configure this restriction at runtime as > a similar potential host implementation restriction occurs for > PMEM vs Volatile. That is also needed to exercise the fun corners of > QTG etc. Sounds good.
RE: [PATCH] hw/acpi/cxl: Drop device-memory support from CFMWS entries
Dan Williams wrote: > While it was a reasonable idea to specify no window restricitions at the > outset of the CXL emulation support, it turns out that in practice a > platform will never follow the QEMU example of specifying simultaneous > support for HDM-H and HDM-D[B] in a single window. > > HDM-D mandates extra bus cycles for host/device bias protocol, and HDM-DB > mandates extra bus cycles for back-invalidate protocol, so hardware must > be explicitly prepared for device-memory unlike host-only memory > (HDM-H). > > In preparation for the kernel dropping support for windows that do not > select between device and host-only memory, move QEMU exclusively to > declaring host-only windows. After an offline discussion determined that a sufficiently sophisticated platform might be able to support mixed HDM-H and HDM-D[B] in the same window, so the kernel is not going to drop this support.
[PATCH] hw/acpi/cxl: Drop device-memory support from CFMWS entries
While it was a reasonable idea to specify no window restricitions at the outset of the CXL emulation support, it turns out that in practice a platform will never follow the QEMU example of specifying simultaneous support for HDM-H and HDM-D[B] in a single window. HDM-D mandates extra bus cycles for host/device bias protocol, and HDM-DB mandates extra bus cycles for back-invalidate protocol, so hardware must be explicitly prepared for device-memory unlike host-only memory (HDM-H). In preparation for the kernel dropping support for windows that do not select between device and host-only memory, move QEMU exclusively to declaring host-only windows. Signed-off-by: Dan Williams --- hw/acpi/cxl.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/hw/acpi/cxl.c b/hw/acpi/cxl.c index 2bf8c0799359..defb289e2fef 100644 --- a/hw/acpi/cxl.c +++ b/hw/acpi/cxl.c @@ -103,8 +103,8 @@ static void cedt_build_cfmws(GArray *table_data, CXLState *cxls) /* Host Bridge Interleave Granularity */ build_append_int_noprefix(table_data, fw->enc_int_gran, 4); -/* Window Restrictions */ -build_append_int_noprefix(table_data, 0x0f, 2); /* No restrictions */ +/* Window Restrictions: Host-only ram and pmem */ +build_append_int_noprefix(table_data, 0x0e, 2); /* QTG ID */ build_append_int_noprefix(table_data, 0, 2);
Re: cxl nvdimm Potential probe ordering issues.
Gregory Price wrote: > On Fri, Jan 20, 2023 at 09:38:13AM -0800, Dan Williams wrote: > > As it stands currently that dax device and the cxl device are not > > related since a default dax-device is loaded just based on the presence > > of an EFI_MEMORY_SP address range in the address map. With the new ram > > enabling that default device will be elided and CXL will register a > > dax-device parented by a cxl region. > > > > >- The memory *does not* auto-online, instead the dax device can be > > > onlined as system-ram *manually* via ndctl and friends > > > > That *manually* part is the problem that needs distro help to solve. It > > should be the case that by default all Linux distributions auto-online > > all dax-devices. If that happens to online memory that is too slow for > > general use, or too high-performance / precious for general purpose use > > then the administrator can set policy after the fact. Unfortunately user > > policy can not be applied if these memory ranges were onlined by the > > kernel at boot , so that's why the kernel policy defaults to not-online. > > > > In other words, there is no guarantee that memory that was assigned to > > the general purpose pool at boot can be removed. The only guaranteed > > behavior is to never give the memory to the core kernel in the first > > instance and always let user policy route the memory. > > > > > 3) The code creates an nvdimm_bridge IFF a CFMW is defined - regardless > > >of the type-3 device configuration (pmem-only or vmem-only) > > > > Correct, the top-level bus code (cxl_acpi) and the endpoint code > > (cxl_mem, cxl_port) need to handshake before establishing regions. For > > pmem regions the platform needs to claim the availability of a pmem > > capable CXL window. > > > > > 4) As you can see above, multiple decoders are registered. I'm not sure > > >if that's correct or not, but it does seem odd given there's only one > > >cxl type-3 device. Odd that decoder0.0 shows up when CFMW is there, > > >but not when it isn't. > > > > CXL windows are modeled as decoders hanging off the the CXL root device > > (ACPI0017 on ACPI based platforms). An endpoint decoder can then map a > > selection of that window. > > > > > Don't know why I haven't thought of this until now, but is the CFMW code > > > reporting something odd about what's behind it? Is it assuming the > > > devices are pmem? > > > > No, the cxl_acpi code is just advertising platform decode possibilities > > independent of what devices show up. Think of this like the PCI MMIO > > space that gets allocated to a root bridge at the beginning of time. > > That space may or may not get consumed based on what devices show up > > downstream. > > Thank you for the explanation Dan, and thank you for you patience > @JCameron. I'm fairly sure I grok it now. > > Summarizing to make sure: the cxl driver is providing what would be the > CXL.io (control) path, and the CXL.mem path is basically being simulated > by what otherwise would be a traditional PCI memory region. This explains > why turning off Legacy mode drops the dax devices, and why the topology > looks strange - the devices are basically attached in 2 different ways. > > Might there be interest from the QEMU community to implement this > legacy-style setup in the short term, in an effort to test the the > control path of type-3 devices while we wait for the kernel to catch up? > > Or should we forget this mode ever existed and just barrel forward > with HDM decoders and writing the kernel code to hook up the underlying > devices in drivers/cxl? Which mode are you referring? The next steps for the kernel enabling relevant to this thread are: * ram region discovery (platform firmware or kexec established) * ram region creation * pmem region discovery (from labels)
Re: cxl nvdimm Potential probe ordering issues.
Gregory Price wrote: > On Thu, Jan 19, 2023 at 03:04:49PM +, Jonathan Cameron wrote: > > Gregory, would you mind checking if > > cxl_nvb is NULL here... > > https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/pmem.c#L67 > > (printk before it is used should work). > > > > Might also be worth checking cxl_nvd and cxl_ds > > but my guess is cxl_nvb is our problem (it is when I deliberate change > > the load order). > > > > Jonathan > > > > This is exactly the issue. cxl_nvb is null, the rest appear fine. > > Also, note, that weirdly the non-volatile bridge shows up when launching > this in volatile mode, but no stack trace appears. > > ¯\_(ツ)_/¯ > > After spending way too much time tracing through the current cxl driver > code, i have only really determined that > > 1) The code is very pmem oriented, and it's unclear to me how the driver >as-is differentiates a persistent device from a volatile device. That >code path still completely escapes me. The only differentiating code >i see is in the memdev probe path that creates mem#/pmem and mem#/ram Yes, pmem was the initial focus because it had the most dependency on the OS to setup vs BIOS, but the ram enabling is at the top of the queue now. > > 2) The code successfully manages probe, enable, and mount a REAL device >- cxl memdev appears (/sys/bus/cxl/devices/mem0) >- a dax device appears (/sys/bus/dax/devices/) > This happens at boot, which I assume must be bios related As it stands currently that dax device and the cxl device are not related since a default dax-device is loaded just based on the presence of an EFI_MEMORY_SP address range in the address map. With the new ram enabling that default device will be elided and CXL will register a dax-device parented by a cxl region. >- The memory *does not* auto-online, instead the dax device can be > onlined as system-ram *manually* via ndctl and friends That *manually* part is the problem that needs distro help to solve. It should be the case that by default all Linux distributions auto-online all dax-devices. If that happens to online memory that is too slow for general use, or too high-performance / precious for general purpose use then the administrator can set policy after the fact. Unfortunately user policy can not be applied if these memory ranges were onlined by the kernel at boot , so that's why the kernel policy defaults to not-online. In other words, there is no guarantee that memory that was assigned to the general purpose pool at boot can be removed. The only guaranteed behavior is to never give the memory to the core kernel in the first instance and always let user policy route the memory. > 3) The code creates an nvdimm_bridge IFF a CFMW is defined - regardless >of the type-3 device configuration (pmem-only or vmem-only) Correct, the top-level bus code (cxl_acpi) and the endpoint code (cxl_mem, cxl_port) need to handshake before establishing regions. For pmem regions the platform needs to claim the availability of a pmem capable CXL window. > ># CFMW defined >[root@fedora ~]# ls /sys/bus/cxl/devices/ >decoder0.0 decoder2.0 mem0port1 >decoder1.0 endpoint2 nvdimm-bridge0 root0 > ># CFMW not defined >[root@fedora ~]# ls /sys/bus/cxl/devices/ >decoder1.0 decoder2.0 endpoint2 mem0 port1 root0 > > 4) As you can see above, multiple decoders are registered. I'm not sure >if that's correct or not, but it does seem odd given there's only one >cxl type-3 device. Odd that decoder0.0 shows up when CFMW is there, >but not when it isn't. CXL windows are modeled as decoders hanging off the the CXL root device (ACPI0017 on ACPI based platforms). An endpoint decoder can then map a selection of that window. > >Note: All these tests have two root ports: >-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ >-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \ >-device cxl-rp,id=rp1,bus=cxl.0,chassis=0,port=1,slot=1 \ > > > Don't know why I haven't thought of this until now, but is the CFMW code > reporting something odd about what's behind it? Is it assuming the > devices are pmem? No, the cxl_acpi code is just advertising platform decode possibilities independent of what devices show up. Think of this like the PCI MMIO space that gets allocated to a root bridge at the beginning of time. That space may or may not get consumed based on what devices show up downstream.
Re: cxl nvdimm Potential probe ordering issues.
Jonathan Cameron wrote: > On Thu, 19 Jan 2023 15:04:49 + > Jonathan Cameron wrote: > > > On Thu, 19 Jan 2023 12:42:44 + > > Jonathan Cameron via wrote: > > > > > On Wed, 18 Jan 2023 14:31:53 -0500 > > > Gregory Price wrote: > > > > > > > I apparently forgot an intro lol > > > > > > > > I tested the DOE linux branch with the 2023-1-11 QEMU branch with both > > > > volatile, non-volatile, and "legacy" (pre-my-patch) non-volatile mode. > > > > > > > > 1) *In volatile mode, there are no stack traces present (during boot*) > > > > > > > > On Wed, Jan 18, 2023 at 02:22:08PM -0500, Gregory Price wrote: > > > > > > > > > > 1) No stack traces present > > > > > 2) Device usage appears to work, but cxl-cli fails to create a > > > > > region, i > > > > > haven't checked why yet (also tried ndctl-75, same results) > > > > > 3) There seems to be some other regression with the cxl_pmem_init > > > > > routine, because I get a stack trace in this setup regardless of > > > > > whether > > > > > I apply the type-3 device commit. > > > > > > > > > > > > > > > All tests below with the previously posted DOE linux branch. > > > > > Base QEMU branch was Jonathan's 2023-1-11 > > > > > > > > > > > > > > > DOE Branch - 2023-1-11 (HEAD) (all commits) > > > > > > > > > > QEMU Config: > > > > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > > > > > -drive > > > > > file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd > > > > > \ > > > > > -m 3G,slots=4,maxmem=8G \ > > > > > -smp 4 \ > > > > > -machine type=q35,accel=kvm,cxl=on \ > > > > > -enable-kvm \ > > > > > -nographic \ > > > > > -object memory-backend-ram,id=mem0,size=1G,share=on \ > > > > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > > > > > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > > > > > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \ > > > > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > > > > > > > > Result: This worked, but cxl-cli could not create a region (will look > > > > > into this further later). > > > > > > > > > > > > > > > > > > > > > > > > > When running with a persistent memory configuration, I'm seeing a > > > > > kernel stack trace on cxl_pmem_init > > > > > > > > > > Config: > > > > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > > > > > -drive > > > > > file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd > > > > > \ > > > > > -m 3G,slots=4,maxmem=4G \ > > > > > -smp 4 \ > > > > > -machine type=q35,accel=kvm,cxl=on \ > > > > > -enable-kvm \ > > > > > -nographic \ > > > > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > > > > > -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > > > > > -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \ > > > > > -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \ > > > > > -device > > > > > cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 > > > > > \ > > > > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > > > > > > > > > > > > > [ 62.167518] BUG: kernel NULL pointer dereference, address: > > > > > 04c0 > > > > > [ 62.185069] #PF: supervisor read access in kernel mode > > > > > [ 62.198502] #PF: error_code(0x) - not-present page > > > > > [ 62.211019] PGD 0 P4D 0 > > > > > [ 62.220521] Oops: [#1] PREEMPT SMP PTI > > > > > [ 62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted > > > > > 6.2.0-rc1+ #1 > > > > > [ 62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), > > > > > BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 > > > > > [ 62.258432] Adding 2939900k swap on /dev/zram0. Priority:100 > > > > > extents:1 across:2939900k SSDscFS > > > > > [ 62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > > > > > [ 62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c > > > > > 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 > > > > > 4c 89 0 > > > > > [ 62.285531] RSP: 0018:acff0141fc38 EFLAGS: 00010202 > > > > > [ 62.285534] RAX: 97a8a37b84b8 RBX: 97a8a37b8000 RCX: > > > > > > > > > > [ 62.285536] RDX: 0001 RSI: 97a8a37b8000 RDI: > > > > > > > > > > [ 62.285537] RBP: 97a8a37b8000 R08: 0001 R09: > > > > > 0001 > > > > > [ 62.285538] R10: 0001 R11: R12: > > > > > 97a8a37b8000 > > > > > [ 62.285539] R13: 97a982c3dc28 R14: R15: > > > > > > > > > > [ 62.285541] FS: 7f2619829580() GS:97a9bca0() > > > > > knlGS: > > > > > [ 62.285542] CS: 0010 DS: ES: CR0: 80050033 > > > > > [ 62.285544] CR2: 04c0 CR3: 0001056a8000 CR4: > > > > > 06e0 > > > > > [ 62.285653] Call Trace: > > > > > [ 62.285656] > > > > > [ 62.285660] cxl_bus_probe+0x17/0x50 > > > >
Re: [BUG] cxl can not create region
Jonathan Cameron wrote: > On Fri, 12 Aug 2022 16:44:03 +0100 > Jonathan Cameron wrote: > > > On Thu, 11 Aug 2022 18:08:57 +0100 > > Jonathan Cameron via wrote: > > > > > On Tue, 9 Aug 2022 17:08:25 +0100 > > > Jonathan Cameron wrote: > > > > > > > On Tue, 9 Aug 2022 21:07:06 +0800 > > > > Bobo WL wrote: > > > > > > > > > Hi Jonathan > > > > > > > > > > Thanks for your reply! > > > > > > > > > > On Mon, Aug 8, 2022 at 8:37 PM Jonathan Cameron > > > > > wrote: > > > > > > > > > > > > Probably not related to your problem, but there is a disconnect in > > > > > > QEMU / > > > > > > kernel assumptionsaround the presence of an HDM decoder when a HB > > > > > > only > > > > > > has a single root port. Spec allows it to be provided or not as an > > > > > > implementation choice. > > > > > > Kernel assumes it isn't provide. Qemu assumes it is. > > > > > > > > > > > > The temporary solution is to throw in a second root port on the HB > > > > > > and not > > > > > > connect anything to it. Longer term I may special case this so > > > > > > that the particular > > > > > > decoder defaults to pass through settings in QEMU if there is only > > > > > > one root port. > > > > > > > > > > > > > > > > You are right! After adding an extra HB in qemu, I can create a x1 > > > > > region successfully. > > > > > But have some errors in Nvdimm: > > > > > > > > > > [ 74.925838] Unknown online node for memory at 0x100, > > > > > assuming node 0 > > > > > [ 74.925846] Unknown target node for memory at 0x100, > > > > > assuming node 0 > > > > > [ 74.927470] nd_region region0: nmem0: is disabled, failing probe > > > > > > > > > > > > > Ah. I've seen this one, but not chased it down yet. Was on my todo > > > > list to chase > > > > down. Once I reach this state I can verify the HDM Decode is correct > > > > which is what > > > > I've been using to test (Which wasn't true until earlier this week). > > > > I'm currently testing via devmem, more for historical reasons than > > > > because it makes > > > > that much sense anymore. > > > > > > *embarassed cough*. We haven't fully hooked the LSA up in qemu yet. > > > I'd forgotten that was still on the todo list. I don't think it will > > > be particularly hard to do and will take a look in next few days. > > > > > > Very very indirectly this error is causing a driver probe fail that means > > > that > > > we hit a code path that has a rather odd looking check on NDD_LABELING. > > > Should not have gotten near that path though - hence the problem is > > > actually > > > when we call cxl_pmem_get_config_data() and it returns an error because > > > we haven't fully connected up the command in QEMU. > > > > So a least one bug in QEMU. We were not supporting variable length payloads > > on mailbox > > inputs (but were on outputs). That hasn't mattered until we get to LSA > > writes. > > We just need to relax condition on the supplied length. > > > > diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c > > index c352a935c4..fdda9529fe 100644 > > --- a/hw/cxl/cxl-mailbox-utils.c > > +++ b/hw/cxl/cxl-mailbox-utils.c > > @@ -510,7 +510,7 @@ void cxl_process_mailbox(CXLDeviceState *cxl_dstate) > > cxl_cmd = &cxl_cmd_set[set][cmd]; > > h = cxl_cmd->handler; > > if (h) { > > -if (len == cxl_cmd->in) { > > +if (len == cxl_cmd->in || !cxl_cmd->in) { > Fix is wrong as we use ~0 as the placeholder for variable payload, not 0. > > With that fixed we hit new fun paths - after some errors we get the > worrying - not totally sure but looks like a failure on an error cleanup. > I'll chase down the error source, but even then this is probably triggerable > by > hardware problem or similar. Some bonus prints in here from me chasing > error paths, but it's otherwise just cxl/next + the fix I posted earlier > today. One of the scenarios that I cannot rule out is nvdimm_probe() racing nd_region_probe(), but given all the work it takes to create a region I suspect all the nvdimm_probe() work to have completed... It is at least one potentially wrong hypothesis that needs to be chased down. > > [ 69.919877] nd_bus ndbus0: START: nd_region.probe(region0) > [ 69.920108] nd_region_probe > [ 69.920623] [ cut here ] > [ 69.920675] refcount_t: addition on 0; use-after-free. > [ 69.921314] WARNING: CPU: 3 PID: 710 at lib/refcount.c:25 > refcount_warn_saturate+0xa0/0x144 > [ 69.926949] Modules linked in: cxl_pmem cxl_mem cxl_pci cxl_port cxl_acpi > cxl_core > [ 69.928830] CPU: 3 PID: 710 Comm: kworker/u8:9 Not tainted 5.19.0-rc3+ #399 > [ 69.930596] Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 > [ 69.931482] Workqueue: events_unbound async_run_entry_fn > [ 69.932403] pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 69.934023] pc : refcount_warn_saturate+0xa0/0x144 > [ 69.935161] lr : refcount_
Re: [BUG] cxl can not create region
Jonathan Cameron wrote: > On Thu, 11 Aug 2022 18:08:57 +0100 > Jonathan Cameron via wrote: > > > On Tue, 9 Aug 2022 17:08:25 +0100 > > Jonathan Cameron wrote: > > > > > On Tue, 9 Aug 2022 21:07:06 +0800 > > > Bobo WL wrote: > > > > > > > Hi Jonathan > > > > > > > > Thanks for your reply! > > > > > > > > On Mon, Aug 8, 2022 at 8:37 PM Jonathan Cameron > > > > wrote: > > > > > > > > > > Probably not related to your problem, but there is a disconnect in > > > > > QEMU / > > > > > kernel assumptionsaround the presence of an HDM decoder when a HB only > > > > > has a single root port. Spec allows it to be provided or not as an > > > > > implementation choice. > > > > > Kernel assumes it isn't provide. Qemu assumes it is. > > > > > > > > > > The temporary solution is to throw in a second root port on the HB > > > > > and not > > > > > connect anything to it. Longer term I may special case this so that > > > > > the particular > > > > > decoder defaults to pass through settings in QEMU if there is only > > > > > one root port. > > > > > > > > > > > > > You are right! After adding an extra HB in qemu, I can create a x1 > > > > region successfully. > > > > But have some errors in Nvdimm: > > > > > > > > [ 74.925838] Unknown online node for memory at 0x100, > > > > assuming node 0 > > > > [ 74.925846] Unknown target node for memory at 0x100, > > > > assuming node 0 > > > > [ 74.927470] nd_region region0: nmem0: is disabled, failing probe > > > > > > Ah. I've seen this one, but not chased it down yet. Was on my todo list > > > to chase > > > down. Once I reach this state I can verify the HDM Decode is correct > > > which is what > > > I've been using to test (Which wasn't true until earlier this week). > > > I'm currently testing via devmem, more for historical reasons than > > > because it makes > > > that much sense anymore. > > > > *embarassed cough*. We haven't fully hooked the LSA up in qemu yet. > > I'd forgotten that was still on the todo list. I don't think it will > > be particularly hard to do and will take a look in next few days. > > > > Very very indirectly this error is causing a driver probe fail that means > > that > > we hit a code path that has a rather odd looking check on NDD_LABELING. > > Should not have gotten near that path though - hence the problem is actually > > when we call cxl_pmem_get_config_data() and it returns an error because > > we haven't fully connected up the command in QEMU. > > So a least one bug in QEMU. We were not supporting variable length payloads > on mailbox > inputs (but were on outputs). That hasn't mattered until we get to LSA > writes. > We just need to relax condition on the supplied length. > > diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c > index c352a935c4..fdda9529fe 100644 > --- a/hw/cxl/cxl-mailbox-utils.c > +++ b/hw/cxl/cxl-mailbox-utils.c > @@ -510,7 +510,7 @@ void cxl_process_mailbox(CXLDeviceState *cxl_dstate) > cxl_cmd = &cxl_cmd_set[set][cmd]; > h = cxl_cmd->handler; > if (h) { > -if (len == cxl_cmd->in) { > +if (len == cxl_cmd->in || !cxl_cmd->in) { > cxl_cmd->payload = cxl_dstate->mbox_reg_state + > A_CXL_DEV_CMD_PAYLOAD; > ret = (*h)(cxl_cmd, cxl_dstate, &len); > > > This lets the nvdimm/region probe fine, but I'm getting some issues with > namespace capacity so I'll look at what is causing that next. > Unfortunately I'm not that familiar with the driver/nvdimm side of things > so it's take a while to figure out what kicks off what! The whirlwind tour is that 'struct nd_region' instances that represent a persitent memory address range are composed of one more mappings of 'struct nvdimm' objects. The nvdimm object is driven by the dimm driver in drivers/nvdimm/dimm.c. That driver is mainly charged with unlocking the dimm (if locked) and interrogating the label area to look for namespace labels. The label command calls are routed to the '->ndctl()' callback that was registered when the CXL nvdimm_bus_descriptor was created. That callback handles both 'bus' scope calls, currently none for CXL, and per nvdimm calls. cxl_pmem_nvdimm_ctl() translates those generic LIBNVDIMM commands to CXL commands. The 'struct nvdimm' objects that the CXL side registers have the NDD_LABELING flag set which means that namespaces need to be explicitly created / provisioned from region capacity. Otherwise, if drivers/nvdimm/dimm.c does not find a namespace-label-index block then the region reverts to label-less mode and a default namespace equal to the size of the region is instantiated. If you are seeing small mismatches in namespace capacity then it may just be the fact that by default 'ndctl create-namespace' results in an 'fsdax' mode namespace which just means that it is a block device where 1.5% of the capacity is reserved for 'struct page' metadata. You should be able to see namespace capacity
Re: [BUG] cxl can not create region
Dan Williams wrote: > Bobo WL wrote: > > Hi Dan, > > > > Thanks for your reply! > > > > On Mon, Aug 8, 2022 at 11:58 PM Dan Williams > > wrote: > > > > > > What is the output of: > > > > > > cxl list -MDTu -d decoder0.0 > > > > > > ...? It might be the case that mem1 cannot be mapped by decoder0.0, or > > > at least not in the specified order, or that validation check is broken. > > > > Command "cxl list -MDTu -d decoder0.0" output: > > Thanks for this, I think I know the problem, but will try some > experiments with cxl_test first. Hmm, so my cxl_test experiment unfortunately passed so I'm not reproducing the failure mode. This is the result of creating x4 region with devices directly attached to a single host-bridge: # cxl create-region -d decoder3.5 -w 4 -m -g 256 mem{12,10,9,11} -s $((1<<30)) { "region":"region8", "resource":"0xf1f000", "size":"1024.00 MiB (1073.74 MB)", "interleave_ways":4, "interleave_granularity":256, "decode_state":"commit", "mappings":[ { "position":3, "memdev":"mem11", "decoder":"decoder21.0" }, { "position":2, "memdev":"mem9", "decoder":"decoder19.0" }, { "position":1, "memdev":"mem10", "decoder":"decoder20.0" }, { "position":0, "memdev":"mem12", "decoder":"decoder22.0" } ] } cxl region: cmd_create_region: created 1 region > Did the commit_store() crash stop reproducing with latest cxl/preview > branch? I missed the answer to this question. All of these changes are now in Linus' tree perhaps give that a try and post the debug log again?
Re: [BUG] cxl can not create region
Bobo WL wrote: > Hi Dan, > > Thanks for your reply! > > On Mon, Aug 8, 2022 at 11:58 PM Dan Williams wrote: > > > > What is the output of: > > > > cxl list -MDTu -d decoder0.0 > > > > ...? It might be the case that mem1 cannot be mapped by decoder0.0, or > > at least not in the specified order, or that validation check is broken. > > Command "cxl list -MDTu -d decoder0.0" output: Thanks for this, I think I know the problem, but will try some experiments with cxl_test first. Did the commit_store() crash stop reproducing with latest cxl/preview branch?
RE: [BUG] cxl can not create region
Bobo WL wrote: > Hi list > > I want to test cxl functions in arm64, and found some problems I can't > figure out. > > My test environment: > > 1. build latest bios from https://github.com/tianocore/edk2.git master > branch(cc2db6ebfb6d9d85ba4c7b35fba1fa37fffc0bc2) > 2. build latest qemu-system-aarch64 from git://git.qemu.org/qemu.git > master branch(846dcf0ba4eff824c295f06550b8673ff3f31314). With cxl arm > support patch: > https://patchwork.kernel.org/project/cxl/cover/20220616141950.23374-1-jonathan.came...@huawei.com/ > 3. build Linux kernel from > https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git preview > branch(65fc1c3d26b96002a5aa1f4012fae4dc98fd5683) > 4. build latest ndctl tools from https://github.com/pmem/ndctl > create_region branch(8558b394e449779e3a4f3ae90fae77ede0bca159) > > And my qemu test commands: > sudo $QEMU_BIN -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 \ > -cpu max -smp 8 -nographic -no-reboot \ > -kernel $KERNEL -bios $BIOS_BIN \ > -drive if=none,file=$ROOTFS,format=qcow2,id=hd \ > -device virtio-blk-pci,drive=hd -append 'root=/dev/vda1 > nokaslr dyndbg="module cxl* +p"' \ > -object memory-backend-ram,size=4G,id=mem0 \ > -numa node,nodeid=0,cpus=0-7,memdev=mem0 \ > -net nic -net user,hostfwd=tcp::-:22 -enable-kvm \ > -object > memory-backend-file,id=cxl-mem0,share=on,mem-path=/tmp/cxltest.raw,size=256M > \ > -object > memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/cxltest1.raw,size=256M > \ > -object > memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/cxltest2.raw,size=256M > \ > -object > memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/cxltest3.raw,size=256M > \ > -object > memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa0.raw,size=256M > \ > -object > memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa1.raw,size=256M > \ > -object > memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/lsa2.raw,size=256M > \ > -object > memory-backend-file,id=cxl-lsa3,share=on,mem-path=/tmp/lsa3.raw,size=256M > \ > -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ > -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \ > -device cxl-upstream,bus=root_port0,id=us0 \ > -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \ > -device > cxl-type3,bus=swport0,memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \ > -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=5 \ > -device > cxl-type3,bus=swport1,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem1 \ > -device cxl-downstream,port=2,bus=us0,id=swport2,chassis=0,slot=6 \ > -device > cxl-type3,bus=swport2,memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem2 \ > -device cxl-downstream,port=3,bus=us0,id=swport3,chassis=0,slot=7 \ > -device > cxl-type3,bus=swport3,memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem3 \ > -M > cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k > > And I have got two problems. > 1. When I want to create x1 region with command: "cxl create-region -d > decoder0.0 -w 1 -g 4096 mem0", kernel crashed with null pointer > reference. Crash log: > > [ 534.697324] cxl_region region0: config state: 0 > [ 534.697346] cxl_region region0: probe: -6 > [ 534.697368] cxl_acpi ACPI0017:00: decoder0.0: created region0 > [ 534.699115] cxl region0: mem0:endpoint3 decoder3.0 add: > mem0:decoder3.0 @ 0 next: none nr_eps: 1 nr_targets: 1 > [ 534.699149] cxl region0: :0d:00.0:port2 decoder2.0 add: > mem0:decoder3.0 @ 0 next: mem0 nr_eps: 1 nr_targets: 1 > [ 534.699167] cxl region0: ACPI0016:00:port1 decoder1.0 add: > mem0:decoder3.0 @ 0 next: :0d:00.0 nr_eps: 1 nr_targets: 1 > [ 534.699176] cxl region0: ACPI0016:00:port1 iw: 1 ig: 256 > [ 534.699182] cxl region0: ACPI0016:00:port1 target[0] = :0c:00.0 > for mem0:decoder3.0 @ 0 > [ 534.699189] cxl region0: :0d:00.0:port2 iw: 1 ig: 256 > [ 534.699193] cxl region0: :0d:00.0:port2 target[0] = > :0e:00.0 for mem0:decoder3.0 @ 0 > [ 534.699405] Unable to handle kernel NULL pointer dereference at > virtual address > [ 534.701474] Mem abort info: > [ 534.701994] ESR = 0x8604 > [ 534.702653] EC = 0x21: IABT (current EL), IL = 32 bits > [ 534.703616] SET = 0, FnV = 0 > [ 534.704174] EA = 0, S1PTW = 0 > [ 534.704803] FSC = 0x04: level 0 translation fault > [ 534.705694] user pgtable: 4k pages, 48-bit VAs, pgdp=00010144a000 > [ 534.706875] [] pgd=, p4d= > [ 534.709855] Internal error: Oops: 8604 [#1] PREEMPT SMP > [ 534.710301] Modules linked in: > [ 534.710546] CPU: 7 PID: 331 Comm: cxl Not tainted > 5.19.0-rc3-00064-g65fc1c3d26b9-dirty #11 > [ 534.715393] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 > [ 534.717179] pstate: 6045 (nZCv daif +PAN -
Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
On Tue, May 4, 2021 at 2:03 AM Aneesh Kumar K.V wrote: > > On 5/4/21 11:13 AM, Pankaj Gupta wrote: > > > >> > >> What this patch series did was to express that property via a device > >> tree node and guest driver enables a hypercall based flush mechanism to > >> ensure persistence. > > > > Would VIRTIO (entirely asynchronous, no trap at host side) based > > mechanism is better > > than hyper-call based? Registering memory can be done any way. We > > implemented virtio-pmem > > flush mechanisms with below considerations: > > > > - Proper semantic for guest flush requests. > > - Efficient mechanism for performance pov. > > > > sure, virio-pmem can be used as an alternative. > > > I am just asking myself if we have platform agnostic mechanism already > > there, maybe > > we can extend it to suit our needs? Maybe I am missing some points here. > > > > What is being attempted in this series is to indicate to the guest OS > that the backing device/file used for emulated nvdimm device cannot > guarantee the persistence via cpu cache flush instructions. Right, the detail I am arguing is that it should be a device description, not a backend file property. In other words this proposal is advocating this: -object memory-backend-file,id=mem1,share,sync-dax=$sync-dax,mem-path=$path -device nvdimm,memdev=mem1 ...and I am advocating for reusing or duplicating the virtio-pmem model like this: -object memory-backend-file,id=mem1,share,mem-path=$path -device spapr-hcall,memdev=mem1 ...because the interface to the guest is the device, not the memory-backend-file.
Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
On Mon, May 3, 2021 at 7:06 AM Shivaprasad G Bhat wrote: > > > On 5/1/21 12:44 AM, Dan Williams wrote: > > Some corrections to terminology confusion below... > > > > > > On Wed, Apr 28, 2021 at 8:49 PM Shivaprasad G Bhat > > wrote: > >> The nvdimm devices are expected to ensure write persistence during power > >> failure kind of scenarios. > > No, QEMU is not expected to make that guarantee. QEMU is free to lie > > to the guest about the persistence guarantees of the guest PMEM > > ranges. It's more accurate to say that QEMU nvdimm devices can emulate > > persistent memory and optionally pass through host power-fail > > persistence guarantees to the guest. The power-fail persistence domain > > can be one of "cpu_cache", or "memory_controller" if the persistent > > memory region is "synchronous". If the persistent range is not > > synchronous, it really isn't "persistent memory"; it's memory mapped > > storage that needs I/O commands to flush. > > Since this is virtual nvdimm(v-nvdimm) backed by a file, and the data is > completely > > in the host pagecache, and we need a way to ensure that host pagecaches > > are flushed to the backend. This analogous to the WPQ flush being offloaded > > to the hypervisor. No, it isn't analogous. WPQ flush is an optional mechanism to force data to a higher durability domain. The flush in this interface is mandatory. It's a different class of device. The proposal that "sync-dax=unsafe" for non-PPC architectures, is a fundamental misrepresentation of how this is supposed to work. Rather than make "sync-dax" a first class citizen of the device-description interface I'm proposing that you make this a separate device-type. This also solves the problem that "sync-dax" with an implicit architecture backend assumption is not precise, but a new "non-nvdimm" device type would make it explicit what the host is advertising to the guest. > > > Ref: https://github.com/dgibson/qemu/blob/main/docs/nvdimm.txt > > > > > > >> The libpmem has architecture specific instructions like dcbf on POWER > > Which "libpmem" is this? PMDK is a reference library not a PMEM > > interface... maybe I'm missing what libpmem has to do with QEMU? > > > I was referrering to semantics of flushing pmem cache lines as in > > PMDK/libpmem. > > > > > >> to flush the cache data to backend nvdimm device during normal writes > >> followed by explicit flushes if the backend devices are not synchronous > >> DAX capable. > >> > >> Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest > >> and the subsequent flush doesn't traslate to actual flush to the backend > > s/traslate/translate/ > > > >> file on the host in case of file backed v-nvdimms. This is addressed by > >> virtio-pmem in case of x86_64 by making explicit flushes translating to > >> fsync at qemu. > > Note that virtio-pmem was a proposal for a specific optimization of > > allowing guests to share page cache. The virtio-pmem approach is not > > to be confused with actual persistent memory. > > > >> On SPAPR, the issue is addressed by adding a new hcall to > >> request for an explicit flush from the guest ndctl driver when the backend > > What is an "ndctl" driver? ndctl is userspace tooling, do you mean the > > guest pmem driver? > > > oops, wrong terminologies. I was referring to guest libnvdimm and > > papr_scm kernel modules. > > > > > >> nvdimm cannot ensure write persistence with dcbf alone. So, the approach > >> here is to convey when the hcall flush is required in a device tree > >> property. The guest makes the hcall when the property is found, instead > >> of relying on dcbf. > >> > >> A new device property sync-dax is added to the nvdimm device. When the > >> sync-dax is 'writeback'(default for PPC), device property > >> "hcall-flush-required" is set, and the guest makes hcall H_SCM_FLUSH > >> requesting for an explicit flush. > > I'm not sure "sync-dax" is a suitable name for the property of the > > guest persistent memory. > > > sync-dax property translates ND_REGION_ASYNC flag being set/unset Yes, I am aware, but that property alone is not sufficient to identify the flush mechanism. > > for the pmem region also if the nvdimm_flush callback is provided in the > > papr_scm or not. As everything boils down to synchronous nature > > of the device, I chose
Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm
Some corrections to terminology confusion below... On Wed, Apr 28, 2021 at 8:49 PM Shivaprasad G Bhat wrote: > > The nvdimm devices are expected to ensure write persistence during power > failure kind of scenarios. No, QEMU is not expected to make that guarantee. QEMU is free to lie to the guest about the persistence guarantees of the guest PMEM ranges. It's more accurate to say that QEMU nvdimm devices can emulate persistent memory and optionally pass through host power-fail persistence guarantees to the guest. The power-fail persistence domain can be one of "cpu_cache", or "memory_controller" if the persistent memory region is "synchronous". If the persistent range is not synchronous, it really isn't "persistent memory"; it's memory mapped storage that needs I/O commands to flush. > The libpmem has architecture specific instructions like dcbf on POWER Which "libpmem" is this? PMDK is a reference library not a PMEM interface... maybe I'm missing what libpmem has to do with QEMU? > to flush the cache data to backend nvdimm device during normal writes > followed by explicit flushes if the backend devices are not synchronous > DAX capable. > > Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest > and the subsequent flush doesn't traslate to actual flush to the backend s/traslate/translate/ > file on the host in case of file backed v-nvdimms. This is addressed by > virtio-pmem in case of x86_64 by making explicit flushes translating to > fsync at qemu. Note that virtio-pmem was a proposal for a specific optimization of allowing guests to share page cache. The virtio-pmem approach is not to be confused with actual persistent memory. > On SPAPR, the issue is addressed by adding a new hcall to > request for an explicit flush from the guest ndctl driver when the backend What is an "ndctl" driver? ndctl is userspace tooling, do you mean the guest pmem driver? > nvdimm cannot ensure write persistence with dcbf alone. So, the approach > here is to convey when the hcall flush is required in a device tree > property. The guest makes the hcall when the property is found, instead > of relying on dcbf. > > A new device property sync-dax is added to the nvdimm device. When the > sync-dax is 'writeback'(default for PPC), device property > "hcall-flush-required" is set, and the guest makes hcall H_SCM_FLUSH > requesting for an explicit flush. I'm not sure "sync-dax" is a suitable name for the property of the guest persistent memory. There is no requirement that the memory-backend file for a guest be a dax-capable file. It's also implementation specific what hypercall needs to be invoked for a given occurrence of "sync-dax". What does that map to on non-PPC platforms for example? It seems to me that an "nvdimm" device presents the synchronous usage model and a whole other device type implements an async-hypercall setup that the guest happens to service with its nvdimm stack, but it's not an "nvdimm" anymore at that point. > sync-dax is "unsafe" on all other platforms(x86, ARM) and old pseries machines > prior to 5.2 on PPC. sync-dax="writeback" on ARM and x86_64 is prevented > now as the flush semantics are unimplemented. "sync-dax" has no meaning on its own, I think this needs an explicit mechanism to convey both the "not-sync" property *and* the callback method, it shouldn't be inferred by arch type. > When the backend file is actually synchronous DAX capable and no explicit > flushes are required, the sync-dax mode 'direct' is to be used. > > The below demonstration shows the map_sync behavior with sync-dax writeback & > direct. > (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c) > > The pmem0 is from nvdimm with With sync-dax=direct, and pmem1 is from > nvdimm with syn-dax=writeback, mounted as > /dev/pmem0 on /mnt1 type xfs > (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota) > /dev/pmem1 on /mnt2 type xfs > (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota) > > [root@atest-guest ~]# ./mapsync /mnt1/newfile > When > sync-dax=unsafe/direct > [root@atest-guest ~]# ./mapsync /mnt2/newfile > when sync-dax=writeback > Failed to mmap with Operation not supported > > The first patch does the header file cleanup necessary for the > subsequent ones. Second patch implements the hcall, adds the necessary > vmstate properties to spapr machine structure for carrying the hcall > status during save-restore. The nature of the hcall being asynchronus, > the patch uses aio utilities to offload the flush. The third patch adds > the 'sync-dax' device property and enables the device tree property > for the guest to utilise the hcall. > > The kernel changes to exploit this hcall is at > https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch > > --- > v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html > Changes from v3: > - Fixed the forward declaration coding guidel
Re: Microsoft and Intel NVDIMM ACPI _DSM interfaces status?
On Wed, Mar 17, 2021 at 4:49 AM Stefan Hajnoczi wrote: > > Hi, > Microsoft and Intel developed two different ACPI NVDIMM _DSM interfaces. > > The specs for the Intel interface are available here: > https://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf > > This is the interface that QEMU emulates. It has been reported that > Windows 2016 Server and 2019 Server guests do not recognize QEMU's > emulated NVDIMM devices using the Microsoft driver. > > I'd like to understand the path forward that will allow both Linux and > Windows guests to successfully use QEMU's emulated NVDIMM device > (https://gitlab.com/qemu-project/qemu/-/blob/master/hw/acpi/nvdimm.c). > > Are/have these two interfaces being/been unified? No, they service 2 different classes of NVDIMMs. The Microsoft specification was defined for NVDIMM-N devices that are the traditional battery/super-capacitor backed DDR with sometimes an equal amount of flash to squirrel away data to non-volatile media on power loss. The Intel one is for a persistent media class of device where there is no need to communicate health attributes like "energy source died" or "restore from flash" failed. > Should QEMU emulate both of them to make running Windows guests easy? Depends on how tolerant Windows is to different format-interface-code implementations and what QEMU should return on each of the functions it decides to implement. Note that QEMU only implements a small subset of the Intel commands, i.e. just enough to provision namespaces in the NVDIMM label area. "NVDIMM label area" is a concept that is newer than the NVDIMM-N definition which is why you don't see labels mentioned in the Microsoft specification. Since then ACPI has developed common label area access methods, see "6.5.10 NVDIMM Label Methods" in the ACPI 6.4 specification. Note that you'll also see "9.20.8 NVDIMM Device Methods" in that spec for some health management commands that overlap what the Microsoft and Intel specs communicate. Linux does not support them since any platform that might support them will also support the Intel specification so there's no driving need for Linux to migrate. I do not know the status of that command support in Windows, or the HPE command support in Windows for that matter. If you need to support guests that expect the Hyper-V command definition, and will fail to attach NVDIMM support in the absence of that definition then QEMU needs UUID_NFIT_DIMM_N_HYPERV support, note that is a different command set than even UUID_NFIT_DIMM_N_MSFT (include/acpi/acuuid.h). That would also require changes to virtual ACPI NFIT to advertise the correlated format interface code. There may be more... you would need someone from Hyper-V land to enumerate all that is expected.
Re: [RFC] Set addresses for memory devices [CXL]
On Wed, Jan 27, 2021 at 7:52 PM Ben Widawsky wrote: > > Hi list, Igor. > > I wanted to get some ideas on how to better handle this. Per the recent > discussion [1], it's become clear that there needs to be more thought put into > how to manage the address space for CXL memory devices. If you see the > discussion on interleave [2] there's a decent diagram for the problem > statement. > > A CXL topology looks just like a PCIe topology. A CXL memory device is a > memory > expander. It's a byte addressable address range with a combination of > persistent > and volatile memory. In a CXL capable system, you can effectively think of > these > things as more configurable NVDIMMs. The memory devices have an interface that > allows the OS to program the base physical address range it claims called an > HDM > (Host Defined Memory) decoder. A larger address range is claimed by a host > bridge (or a combination of host bridges in the interleaved case) which is > platform specific. > > Originally, my plan was to create a single memory backend for a "window" and > subregion the devices in there. So for example, if you had two devices under a > hostbridge, each of 256M size, the window would be some fixed GPA of 512M+ > size > memory backend, and those memory devices would be a subregion of the > hostbridge's window. I thought this was working in my patch series, but as it > turns out, this doesn't actually work as I intended. `info mtree` looks good, > but `info memory-devices` doesn't. > A couple clarifying questions... > So let me list the requirements and hopefully get some feedback on the best > way > to handle it. > 1. A PCIe like device has a persistent memory region (I don't care about > volatile at the moment). What do you mean by "PCIe" like? If it is PCI enumerable by the guest it has no business being treated as proper memory because the OS rightly assumes that PCIe address space is not I/O coherent to other initiators. > 2. The physical address base for the memory region is programmable. > 3. Memory accesses will support interleaving across multiple host bridges. So, per 1. it would look like a PCIe address space inside QEMU but advertised as an I/O coherent platform resource in the guest?
Re: [RFC PATCH 00/25] Introduce CXL 2.0 Emulation
[ add linux-cxl for the Linux driver question ] On Thu, Dec 3, 2020 at 9:10 PM Chris Browy wrote: [..] I'll let Ben address the other questions... > 4. What are the userspace system APIs for targeting CXL HDM address domain? >Usually you can mmap a SPA if you know how to look it up. tl;dr: /proc/iomem and sysfs There are a lot of moving pieces to this question. Once a CXL.mem device is mapped the driver will arrange for it to appear in /proc/iomem. If it is persistent memory it will be accessed through a /dev/pmem or /dev/dax device, and if it is volatile it will show up as just another "System RAM" range. The System RAM range may end up with a unique Linux NUMA node id. Alternatively, if the BIOS maps the device then the BIOS will set up the EFI memory map + ACPI SRAT/SLIT/HMAT for Linux to interpret. Once something shows up in /proc/iomem it was traditionally accessible via /dev/mem. These days though /dev/mem is slowly being deprecated due to security (see CONFIG_LOCK_DOWN_KERNEL*) and stability (see CONFIG_IO_STRICT_DEVMEM) concerns, so depending on your kernel configuration tools may not be able to use /dev/mem to map the SPA directly and will need to go through the driver provided interface. The driver needs to be able coordinate coherent HDM settings across multiple devices in an interleave, all switches in the path, and the root bridge(s). So I am expecting a sysfs interface similar to the nvdimm-subsystem and device-dax-subsystem where there will be sysfs-files to enumerate the mappings and a provisioning mechanism to reconfigure HDMs with various interleave settings. For a flavor of what this might look like see the description of NVDIMM "Region" devices: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/driver-api/nvdimm/nvdimm.rst#n469 The difference will be that unlike the nvdimm case where regions are platform-firmware-defined and static, CXL regions are runtime dynamic.
Re: [PATCH] exec: fetch the alignment of Linux devdax pmem character device nodes
On Thu, Apr 9, 2020 at 7:33 AM Liu, Jingqi wrote: > > On 4/8/2020 5:42 PM, Joao Martins wrote: > > On 4/8/20 3:25 AM, Liu, Jingqi wrote: > >> On 4/8/2020 2:28 AM, Joao Martins wrote: > >>> On 4/7/20 5:55 PM, Dan Williams wrote: > >>>> On Tue, Apr 7, 2020 at 4:01 AM Joao Martins > >>>> wrote: > >>>>> Perhaps, you meant instead: > >>>>> > >>>>> /sys/dev/char/%d:%d/align > >>>>> > >>>> Hmm, are you sure that's working? > >>> It is, except that I made the slight mistake of testing with a bunch of > >>> wip > >>> patches on top which one of them actually adds the 'align' to child dax > >>> device. > >>> > >>> Argh, my apologies - and thanks for noticing. > >>> > >>>> I expect the alignment to be found > >>>> in the region device: > >>>> > >>>> /sys/class/dax: > >>>> /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus1/region1/dax1.1/dax1.0 > >>>> $(readlink -f /sys/dev/char/253\:263)/../align > >>>> $(readlink -f /sys/dev/char/253\:263)/device/align > >>>> > >>>> > >>>> /sys/bus/dax: > >>>> /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus1/region1/dax1.0/dax1.0 > >>>> $(readlink -f /sys/dev/char/253\:265)/../align > >>>> $(readlink -f /sys/dev/char/253\:265)/device/align <-- No such file > >>>> > >>>> The use of the /sys/dev/char/%d:%d/device is only supported by the > >>>> deprecated /sys/class/dax. > >> Hi Dan, > >> > >> Thanks for your comments. > >> > >> Seems it is a mistake. > >> > >> It should be: $(readlink -f /sys/dev/char/253\:263)/../../align > >> > > Hmm, perhaps you have an extra '../' in the path? This works for me: > > > > # ls $(readlink -f /sys/dev/char/252\:0/../align) > > /sys/devices/platform/e820_pmem/ndbus0/region0/dax0.0/dax0.0/../align > > # cat $(readlink -f /sys/dev/char/252\:0)/../align > > 2097152 > > # cat /sys/dev/char/252\:0/../align > > 2097152 > > Hi Joao, > > Hmm, I need to have an extra '../' in the path. The details are as follows: > > # ll /dev/dax2.0 > crw--- 1 root root 251, 5 Mar 20 13:35 /dev/dax2.0 > # uname -r > 5.6.0-rc1-00044-gb19e8c684703 > # readlink -f /sys/dev/char/251\:5/ > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/dax2.1/dax/dax2.0 > # ls $(readlink -f /sys/dev/char/251\:5)/../align > ls: cannot access > '/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/dax2.1/dax/dax2.0/../align': > No such file or directory > # ls $(readlink -f /sys/dev/char/251\:5)/../dax_region/align > ls: cannot access > '/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/dax2.1/dax/dax2.0/../dax_region/align': > No such file or directory > # ls $(readlink -f /sys/dev/char/251\:5)/../../align > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/dax2.1/dax/dax2.0/../../align > # ls $(readlink -f /sys/dev/char/251\:5)/../../dax_region/align > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/dax2.1/dax/dax2.0/../../dax_region/align > # lsmod|grep pmem > dax_pmem_compat16384 0 > device_dax 20480 1 dax_pmem_compat > dax_pmem_core 16384 1 dax_pmem_compat > # lsmod|grep dax > dax_pmem_compat16384 0 > device_dax 20480 1 dax_pmem_compat > dax_pmem_core 16384 1 dax_pmem_compat > > Seems some configurations are different ? > > Can you share your info as above ? Thanks. Alternatively maybe you can use libdaxctl that automatically handles the ABI differences between compat-dax-class and dax-bus layouts? I didn't recommend it before because I was not sure about qemu's policy about taking on new library dependencies, but with libdaxctl you could do something like (untested): path = g_strdup_printf("/sys/dev/char/%d:%d", major(st.st_rdev), minor(st.st_rdev)); rpath = realpath(path, NULL); daxctl_region_foreach(ctx, region) if (strstr(daxctl_region_get_path(region), rpath)) { align = daxctl_region_get_align(region); break; }
Re: [PATCH] exec: fetch the alignment of Linux devdax pmem character device nodes
On Tue, Apr 7, 2020 at 4:01 AM Joao Martins wrote: > > > > On 4/1/20 4:13 AM, Jingqi Liu wrote: > > If the backend file is devdax pmem character device, the alignment > > specified by the option 'align=NUM' in the '-object memory-backend-file' > > needs to match the alignment requirement of the devdax pmem character > > device. > > > > This patch fetches the devdax pmem file 'align', so that we can compare > > it with the NUM of 'align=NUM'. > > The NUM needs to be larger than or equal to the devdax pmem file 'align'. > > > > It also fixes the problem that mmap() returns failure in qemu_ram_mmap() > > when the NUM of 'align=NUM' is less than the devdax pmem file 'align'. > > > > Cc: Dan Williams > > Signed-off-by: Jingqi Liu > > --- > > exec.c | 46 +- > > 1 file changed, 45 insertions(+), 1 deletion(-) > > > > diff --git a/exec.c b/exec.c > > index de9d949902..8221abffec 100644 > > --- a/exec.c > > +++ b/exec.c > > @@ -1736,6 +1736,42 @@ static int64_t get_file_size(int fd) > > return size; > > } > > > > +static int64_t get_file_align(int fd) > > +{ > > +int64_t align = -1; > > +#if defined(__linux__) > > +struct stat st; > > + > > +if (fstat(fd, &st) < 0) { > > +return -errno; > > +} > > + > > +/* Special handling for devdax character devices */ > > +if (S_ISCHR(st.st_mode)) { > > +g_autofree char *subsystem_path = NULL; > > +g_autofree char *subsystem = NULL; > > + > > +subsystem_path = g_strdup_printf("/sys/dev/char/%d:%d/subsystem", > > + major(st.st_rdev), > > minor(st.st_rdev)); > > +subsystem = g_file_read_link(subsystem_path, NULL); > > + > > +if (subsystem && g_str_has_suffix(subsystem, "/dax")) { > > +g_autofree char *align_path = NULL; > > +g_autofree char *align_str = NULL; > > + > > +align_path = > > g_strdup_printf("/sys/dev/char/%d:%d/device/align", > > +major(st.st_rdev), minor(st.st_rdev)); > > + > > Perhaps, you meant instead: > > /sys/dev/char/%d:%d/align > Hmm, are you sure that's working? I expect the alignment to be found in the region device: /sys/class/dax: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus1/region1/dax1.1/dax1.0 $(readlink -f /sys/dev/char/253\:263)/../align $(readlink -f /sys/dev/char/253\:263)/device/align /sys/bus/dax: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus1/region1/dax1.0/dax1.0 $(readlink -f /sys/dev/char/253\:265)/../align $(readlink -f /sys/dev/char/253\:265)/device/align <-- No such file The use of the /sys/dev/char/%d:%d/device is only supported by the deprecated /sys/class/dax. The current /sys/bus/dax device-model can be a drop in replacement as long as software is not written to the /sys/class sysfs layout, i.e. it uses ../ instead of device/ to walk to the region properties.
Re: [PATCH] exec: fetch the alignment of Linux devdax pmem character device nodes
On Tue, Apr 7, 2020 at 1:08 AM Paolo Bonzini wrote: > > On 07/04/20 09:29, Liu, Jingqi wrote: > > Ping. > > > > Any comments are appreciated. > > > > Hi Paolo, Richard, > > > > Any comments about this ? > > I was hoping to get a review from someone else because I have no way to > test it. But I've now queued the patch, thanks. Does qemu run tests in a nested VM? The difficult aspect of testing devdax is that you need to boot your kernel with a special option or have existing memory ranges assigned to the device. Although, Joao had thoughts about allowing dynamic creation of device-dax instance by hot unplugging memory. > > Paolo > > > > > Thanks, > > > > Jingqi > > > > On 4/1/2020 11:13 AM, Liu, Jingqi wrote: > >> If the backend file is devdax pmem character device, the alignment > >> specified by the option 'align=NUM' in the '-object memory-backend-file' > >> needs to match the alignment requirement of the devdax pmem character > >> device. > >> > >> This patch fetches the devdax pmem file 'align', so that we can compare > >> it with the NUM of 'align=NUM'. > >> The NUM needs to be larger than or equal to the devdax pmem file 'align'. > >> > >> It also fixes the problem that mmap() returns failure in qemu_ram_mmap() > >> when the NUM of 'align=NUM' is less than the devdax pmem file 'align'. > >> > >> Cc: Dan Williams > >> Signed-off-by: Jingqi Liu > >> --- > >> exec.c | 46 +- > >> 1 file changed, 45 insertions(+), 1 deletion(-) > >> > >> diff --git a/exec.c b/exec.c > >> index de9d949902..8221abffec 100644 > >> --- a/exec.c > >> +++ b/exec.c > >> @@ -1736,6 +1736,42 @@ static int64_t get_file_size(int fd) > >> return size; > >> } > >> +static int64_t get_file_align(int fd) > >> +{ > >> +int64_t align = -1; > >> +#if defined(__linux__) > >> +struct stat st; > >> + > >> +if (fstat(fd, &st) < 0) { > >> +return -errno; > >> +} > >> + > >> +/* Special handling for devdax character devices */ > >> +if (S_ISCHR(st.st_mode)) { > >> +g_autofree char *subsystem_path = NULL; > >> +g_autofree char *subsystem = NULL; > >> + > >> +subsystem_path = > >> g_strdup_printf("/sys/dev/char/%d:%d/subsystem", > >> + major(st.st_rdev), > >> minor(st.st_rdev)); > >> +subsystem = g_file_read_link(subsystem_path, NULL); > >> + > >> +if (subsystem && g_str_has_suffix(subsystem, "/dax")) { > >> +g_autofree char *align_path = NULL; > >> +g_autofree char *align_str = NULL; > >> + > >> +align_path = > >> g_strdup_printf("/sys/dev/char/%d:%d/device/align", > >> +major(st.st_rdev), > >> minor(st.st_rdev)); > >> + > >> +if (g_file_get_contents(align_path, &align_str, NULL, > >> NULL)) { > >> +return g_ascii_strtoll(align_str, NULL, 0); > >> +} > >> +} > >> +} > >> +#endif /* defined(__linux__) */ > >> + > >> +return align; > >> +} > >> + > >> static int file_ram_open(const char *path, > >>const char *region_name, > >>bool *created, > >> @@ -2275,7 +2311,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t > >> size, MemoryRegion *mr, > >> { > >> RAMBlock *new_block; > >> Error *local_err = NULL; > >> -int64_t file_size; > >> +int64_t file_size, file_align; > >> /* Just support these ram flags by now. */ > >> assert((ram_flags & ~(RAM_SHARED | RAM_PMEM)) == 0); > >> @@ -2311,6 +2347,14 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t > >> size, MemoryRegion *mr, > >> return NULL; > >> } > >> +file_align = get_file_align(fd); > >> +if (file_align > 0 && mr && file_align > mr->align) { > >> +error_setg(errp, "backing store align 0x%" PRIx64 > >> + " is larger than 'align' option 0x" RAM_ADDR_FMT, > >> + file_align, mr->align); > >> +return NULL; Is there any downside to just making the alignment value be the max of the device-dax instance align and the command line option? Why force someone to debug the option unnecessarily?
Re: [Qemu-devel] [PATCH v9 05/11] numa: Extend CLI to provide initiator information for numa nodes
On Wed, Aug 14, 2019 at 6:57 PM Tao Xu wrote: > > On 8/15/2019 5:29 AM, Dan Williams wrote: > > On Tue, Aug 13, 2019 at 10:14 PM Tao Xu wrote: > >> > >> On 8/14/2019 10:39 AM, Dan Williams wrote: > >>> On Tue, Aug 13, 2019 at 8:00 AM Igor Mammedov wrote: > >>>> > >>>> On Fri, 9 Aug 2019 14:57:25 +0800 > >>>> Tao wrote: > >>>> > >>>>> From: Tao Xu > >>>>> > [...] > >>>>> +for (i = 0; i < machine->numa_state->num_nodes; i++) { > >>>>> +if (numa_info[i].initiator_valid && > >>>>> +!numa_info[numa_info[i].initiator].has_cpu) { > >>>> ^^ possible out of > >>>> bounds read, see bellow > >>>> > >>>>> +error_report("The initiator-id %"PRIu16 " of NUMA node %d" > >>>>> + " does not exist.", numa_info[i].initiator, > >>>>> i); > >>>>> +error_printf("\n"); > >>>>> + > >>>>> +exit(1); > >>>>> +} > >>>> it takes care only about nodes that have cpus or memory-only ones that > >>>> have > >>>> initiator explicitly provided on CLI. And leaves possibility to have > >>>> memory-only nodes without initiator mixed with nodes that have initiator. > >>>> Is it valid to have mixed configuration? > >>>> Should we forbid it? > >>> > >>> The spec talks about the "Proximity Domain for the Attached Initiator" > >>> field only being valid if the memory controller for the memory can be > >>> identified by an initiator id in the SRAT. So I expect the only way to > >>> define a memory proximity domain without this local initiator is to > >>> allow specifying a node-id that does not have an entry in the SRAT. > >>> > >> Hi Dan, > >> > >> So there may be a situation for the Attached Initiator field is not > >> valid? If true, I would allow user to input Initiator invalid. > > > > Yes it's something the OS needs to consider because the platform may > > not be able to meet the constraint that a single initiator is > > associated with the memory controller for a given memory target. In > > retrospect it would have been nice if the spec reserved 0x for > > this purpose, but it seems "not in SRAT" is the only way to identify > > memory that is not attached to any single initiator. > > > But As far as I konw, QEMU can't emulate a NUMA node "not in SRAT". I am > wondering if it is effective only set Initiator invalid? You don't need to emulate a NUMA node not in SRAT. Just put a number in this HMAT entry larger than the largest proximity domain number found in the SRAT. >
Re: [Qemu-devel] [PATCH v9 05/11] numa: Extend CLI to provide initiator information for numa nodes
On Tue, Aug 13, 2019 at 10:14 PM Tao Xu wrote: > > On 8/14/2019 10:39 AM, Dan Williams wrote: > > On Tue, Aug 13, 2019 at 8:00 AM Igor Mammedov wrote: > >> > >> On Fri, 9 Aug 2019 14:57:25 +0800 > >> Tao wrote: > >> > >>> From: Tao Xu > >>> > >>> In ACPI 6.3 chapter 5.2.27 Heterogeneous Memory Attribute Table (HMAT), > >>> The initiator represents processor which access to memory. And in 5.2.27.3 > >>> Memory Proximity Domain Attributes Structure, the attached initiator is > >>> defined as where the memory controller responsible for a memory proximity > >>> domain. With attached initiator information, the topology of heterogeneous > >>> memory can be described. > >>> > >>> Extend CLI of "-numa node" option to indicate the initiator numa node-id. > >>> In the linux kernel, the codes in drivers/acpi/hmat/hmat.c parse and > >>> report > >>> the platform's HMAT tables. > >>> > >>> Reviewed-by: Jingqi Liu > >>> Suggested-by: Dan Williams > >>> Signed-off-by: Tao Xu > >>> --- > >>> > >>> No changes in v9 > >>> --- > >>> hw/core/machine.c | 24 > >>> hw/core/numa.c| 13 + > >>> include/sysemu/numa.h | 3 +++ > >>> qapi/machine.json | 6 +- > >>> qemu-options.hx | 27 +++ > >>> 5 files changed, 68 insertions(+), 5 deletions(-) > >>> > >>> diff --git a/hw/core/machine.c b/hw/core/machine.c > >>> index 3c55470103..113184a9df 100644 > >>> --- a/hw/core/machine.c > >>> +++ b/hw/core/machine.c > >>> @@ -640,6 +640,7 @@ void machine_set_cpu_numa_node(MachineState *machine, > >>> const CpuInstanceProperties *props, > >>> Error **errp) > >>> { > >>> MachineClass *mc = MACHINE_GET_CLASS(machine); > >>> +NodeInfo *numa_info = machine->numa_state->nodes; > >>> bool match = false; > >>> int i; > >>> > >>> @@ -709,6 +710,16 @@ void machine_set_cpu_numa_node(MachineState *machine, > >>> match = true; > >>> slot->props.node_id = props->node_id; > >>> slot->props.has_node_id = props->has_node_id; > >>> + > >>> +if (numa_info[props->node_id].initiator_valid && > >>> +(props->node_id != numa_info[props->node_id].initiator)) { > >>> +error_setg(errp, "The initiator of CPU NUMA node %" PRId64 > >>> + " should be itself.", props->node_id); > >>> +return; > >>> +} > >>> +numa_info[props->node_id].initiator_valid = true; > >>> +numa_info[props->node_id].has_cpu = true; > >>> +numa_info[props->node_id].initiator = props->node_id; > >>> } > >>> > >>> if (!match) { > >>> @@ -1050,6 +1061,7 @@ static void > >>> machine_numa_finish_cpu_init(MachineState *machine) > >>> GString *s = g_string_new(NULL); > >>> MachineClass *mc = MACHINE_GET_CLASS(machine); > >>> const CPUArchIdList *possible_cpus = > >>> mc->possible_cpu_arch_ids(machine); > >>> +NodeInfo *numa_info = machine->numa_state->nodes; > >>> > >>> assert(machine->numa_state->num_nodes); > >>> for (i = 0; i < possible_cpus->len; i++) { > >>> @@ -1083,6 +1095,18 @@ static void > >>> machine_numa_finish_cpu_init(MachineState *machine) > >>> machine_set_cpu_numa_node(machine, &props, &error_fatal); > >>> } > >>> } > >>> + > >>> +for (i = 0; i < machine->numa_state->num_nodes; i++) { > >>> +if (numa_info[i].initiator_valid && > >>> +!numa_info[numa_info[i].initiator].has_cpu) { > >>^^ possible out of bounds > >> read, see bellow > >> > >>> +error_report("The initiator-id %"PRIu16 " of NUMA node %d" > >>> + " does not exist.", numa_info[i].initiator, i); >
Re: [Qemu-devel] [PATCH v9 05/11] numa: Extend CLI to provide initiator information for numa nodes
On Tue, Aug 13, 2019 at 8:00 AM Igor Mammedov wrote: > > On Fri, 9 Aug 2019 14:57:25 +0800 > Tao wrote: > > > From: Tao Xu > > > > In ACPI 6.3 chapter 5.2.27 Heterogeneous Memory Attribute Table (HMAT), > > The initiator represents processor which access to memory. And in 5.2.27.3 > > Memory Proximity Domain Attributes Structure, the attached initiator is > > defined as where the memory controller responsible for a memory proximity > > domain. With attached initiator information, the topology of heterogeneous > > memory can be described. > > > > Extend CLI of "-numa node" option to indicate the initiator numa node-id. > > In the linux kernel, the codes in drivers/acpi/hmat/hmat.c parse and report > > the platform's HMAT tables. > > > > Reviewed-by: Jingqi Liu > > Suggested-by: Dan Williams > > Signed-off-by: Tao Xu > > --- > > > > No changes in v9 > > --- > > hw/core/machine.c | 24 > > hw/core/numa.c| 13 + > > include/sysemu/numa.h | 3 +++ > > qapi/machine.json | 6 +- > > qemu-options.hx | 27 +++ > > 5 files changed, 68 insertions(+), 5 deletions(-) > > > > diff --git a/hw/core/machine.c b/hw/core/machine.c > > index 3c55470103..113184a9df 100644 > > --- a/hw/core/machine.c > > +++ b/hw/core/machine.c > > @@ -640,6 +640,7 @@ void machine_set_cpu_numa_node(MachineState *machine, > > const CpuInstanceProperties *props, Error > > **errp) > > { > > MachineClass *mc = MACHINE_GET_CLASS(machine); > > +NodeInfo *numa_info = machine->numa_state->nodes; > > bool match = false; > > int i; > > > > @@ -709,6 +710,16 @@ void machine_set_cpu_numa_node(MachineState *machine, > > match = true; > > slot->props.node_id = props->node_id; > > slot->props.has_node_id = props->has_node_id; > > + > > +if (numa_info[props->node_id].initiator_valid && > > +(props->node_id != numa_info[props->node_id].initiator)) { > > +error_setg(errp, "The initiator of CPU NUMA node %" PRId64 > > + " should be itself.", props->node_id); > > +return; > > +} > > +numa_info[props->node_id].initiator_valid = true; > > +numa_info[props->node_id].has_cpu = true; > > +numa_info[props->node_id].initiator = props->node_id; > > } > > > > if (!match) { > > @@ -1050,6 +1061,7 @@ static void machine_numa_finish_cpu_init(MachineState > > *machine) > > GString *s = g_string_new(NULL); > > MachineClass *mc = MACHINE_GET_CLASS(machine); > > const CPUArchIdList *possible_cpus = > > mc->possible_cpu_arch_ids(machine); > > +NodeInfo *numa_info = machine->numa_state->nodes; > > > > assert(machine->numa_state->num_nodes); > > for (i = 0; i < possible_cpus->len; i++) { > > @@ -1083,6 +1095,18 @@ static void > > machine_numa_finish_cpu_init(MachineState *machine) > > machine_set_cpu_numa_node(machine, &props, &error_fatal); > > } > > } > > + > > +for (i = 0; i < machine->numa_state->num_nodes; i++) { > > +if (numa_info[i].initiator_valid && > > +!numa_info[numa_info[i].initiator].has_cpu) { > ^^ possible out of bounds read, > see bellow > > > +error_report("The initiator-id %"PRIu16 " of NUMA node %d" > > + " does not exist.", numa_info[i].initiator, i); > > +error_printf("\n"); > > + > > +exit(1); > > +} > it takes care only about nodes that have cpus or memory-only ones that have > initiator explicitly provided on CLI. And leaves possibility to have > memory-only nodes without initiator mixed with nodes that have initiator. > Is it valid to have mixed configuration? > Should we forbid it? The spec talks about the "Proximity Domain for the Attached Initiator" field only being valid if the memory controller for the memory can be identified by an initiator id in the SRAT. So I expect the only way to define a memory proximity domain without this local initiator is to allow specifying a node-id that does not have an entry in the SRAT. That would be a useful feature for testing OS HMAT p
Re: [Qemu-devel] [PATCH v10 4/7] dm: enable synchronous dax
On Tue, May 21, 2019 at 6:43 AM Pankaj Gupta wrote: > > This patch sets dax device 'DAXDEV_SYNC' flag if all the target > devices of device mapper support synchrononous DAX. If device > mapper consists of both synchronous and asynchronous dax devices, > we don't set 'DAXDEV_SYNC' flag. > > Signed-off-by: Pankaj Gupta > --- > drivers/md/dm-table.c | 14 ++ > 1 file changed, 14 insertions(+) > > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c > index cde3b49b2a91..1cce626ff576 100644 > --- a/drivers/md/dm-table.c > +++ b/drivers/md/dm-table.c > @@ -886,10 +886,17 @@ static int device_supports_dax(struct dm_target *ti, > struct dm_dev *dev, > return bdev_dax_supported(dev->bdev, PAGE_SIZE); > } > > +static int device_synchronous(struct dm_target *ti, struct dm_dev *dev, > + sector_t start, sector_t len, void *data) > +{ > + return dax_synchronous(dev->dax_dev); > +} > + > static bool dm_table_supports_dax(struct dm_table *t) > { > struct dm_target *ti; > unsigned i; > + bool dax_sync = true; > > /* Ensure that all targets support DAX. */ > for (i = 0; i < dm_table_get_num_targets(t); i++) { > @@ -901,7 +908,14 @@ static bool dm_table_supports_dax(struct dm_table *t) > if (!ti->type->iterate_devices || > !ti->type->iterate_devices(ti, device_supports_dax, NULL)) > return false; > + > + /* Check devices support synchronous DAX */ > + if (dax_sync && > + !ti->type->iterate_devices(ti, device_synchronous, NULL)) > + dax_sync = false; Looks like this needs to be rebased on the current state of v5.2-rc, and then we can nudge Mike for an ack.
Re: [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT)
On Tue, May 7, 2019 at 11:32 PM Tao Xu wrote: > > This series of patches will build Heterogeneous Memory Attribute Table (HMAT) > according to the command line. The ACPI HMAT describes the memory attributes, > such as memory side cache attributes and bandwidth and latency details, > related to the System Physical Address (SPA) Memory Ranges. > The software is expected to use this information as hint for optimization. > > OSPM evaluates HMAT only during system initialization. Any changes to the HMAT > state at runtime or information regarding HMAT for hot plug are communicated > using the _HMA method. [..] Hi, I gave these patches a try while developing support for the new EFI v2.8 Specific Purpose Memory attribute [1]. I have a gap / feature request to note to make this implementation capable of emulating current shipping platform BIOS implementations for persistent memory platforms. The NUMA configuration I tested was: -numa node,mem=4G,cpus=0-19,nodeid=0 -numa node,mem=4G,cpus=20-39,nodeid=1 -numa node,mem=4G,nodeid=2 -numa node,mem=4G,nodeid=3 ...and it produced an entry like the following for proximity domain 2. [0C8h 0200 2] Structure Type : [Memory Proximity Domain Attributes] [0CAh 0202 2] Reserved : [0CCh 0204 4] Length : 0028 [0D0h 0208 2]Flags (decoded below) : 0002 Processor Proximity Domain Valid : 0 [0D2h 0210 2]Reserved1 : [0D4h 0212 4] Processor Proximity Domain : 0002 [0D8h 0216 4] Memory Proximity Domain : 0002 [0DCh 0220 4]Reserved2 : [0E0h 0224 8]Reserved3 : 00024000 [0E8h 0232 8]Reserved4 : 0001 Notice that the Processor "Proximity Domain Valid" bit is clear. I understand that the implementation is keying off of whether cpus are defined for that same node or not, but that's not how current persistent memory platforms implement "Processor Proximity Domain". On these platforms persistent memory indeed has its own proximity domain, but the Processor Proximity Domain is expected to be assigned to the domain that houses the memory controller for that persistent memory. So to emulate that configuration it would be useful to have a way to specify "Processor Proximity Domain" without needing to define CPUs in that domain. Something like: -numa node,mem=4G,cpus=0-19,nodeid=0 -numa node,mem=4G,cpus=20-39,nodeid=1 -numa node,mem=4G,nodeid=2,localnodeid=0 -numa node,mem=4G,nodeid=3,localnodeid=1 ...to specify that node2 memory is connected / local to node0 and node3 memory is connected / local to node1. In general HMAT specifies that all performance differentiated memory ranges have their own proximity domain, but those are expected to still be associated with a local/host/home-socket memory controller. [1]: https://lists.01.org/pipermail/linux-nvdimm/2019-May/021668.html
Re: [Qemu-devel] [PATCH v9 1/7] libnvdimm: nd_region flush callback support
On Tue, May 14, 2019 at 7:55 AM Pankaj Gupta wrote: > > This patch adds functionality to perform flush from guest > to host over VIRTIO. We are registering a callback based > on 'nd_region' type. virtio_pmem driver requires this special > flush function. For rest of the region types we are registering > existing flush function. Report error returned by host fsync > failure to userspace. > > Signed-off-by: Pankaj Gupta > --- > drivers/acpi/nfit/core.c | 4 ++-- > drivers/nvdimm/claim.c | 6 -- > drivers/nvdimm/nd.h | 1 + > drivers/nvdimm/pmem.c| 13 - > drivers/nvdimm/region_devs.c | 26 -- > include/linux/libnvdimm.h| 8 +++- > 6 files changed, 46 insertions(+), 12 deletions(-) > > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c > index 5a389a4f4f65..08dde76cf459 100644 > --- a/drivers/acpi/nfit/core.c > +++ b/drivers/acpi/nfit/core.c > @@ -2434,7 +2434,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, > unsigned int bw, > offset = to_interleave_offset(offset, mmio); > > writeq(cmd, mmio->addr.base + offset); > - nvdimm_flush(nfit_blk->nd_region); > + nvdimm_flush(nfit_blk->nd_region, NULL); > > if (nfit_blk->dimm_flags & NFIT_BLK_DCR_LATCH) > readq(mmio->addr.base + offset); > @@ -2483,7 +2483,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk > *nfit_blk, > } > > if (rw) > - nvdimm_flush(nfit_blk->nd_region); > + nvdimm_flush(nfit_blk->nd_region, NULL); > > rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0; > return rc; > diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c > index fb667bf469c7..13510bae1e6f 100644 > --- a/drivers/nvdimm/claim.c > +++ b/drivers/nvdimm/claim.c > @@ -263,7 +263,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns, > struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev); > unsigned int sz_align = ALIGN(size + (offset & (512 - 1)), 512); > sector_t sector = offset >> 9; > - int rc = 0; > + int rc = 0, ret = 0; > > if (unlikely(!size)) > return 0; > @@ -301,7 +301,9 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns, > } > > memcpy_flushcache(nsio->addr + offset, buf, size); > - nvdimm_flush(to_nd_region(ndns->dev.parent)); > + ret = nvdimm_flush(to_nd_region(ndns->dev.parent), NULL); > + if (ret) > + rc = ret; > > return rc; > } > diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h > index a5ac3b240293..0c74d2428bd7 100644 > --- a/drivers/nvdimm/nd.h > +++ b/drivers/nvdimm/nd.h > @@ -159,6 +159,7 @@ struct nd_region { > struct badblocks bb; > struct nd_interleave_set *nd_set; > struct nd_percpu_lane __percpu *lane; > + int (*flush)(struct nd_region *nd_region, struct bio *bio); So this triggers: In file included from drivers/nvdimm/e820.c:7: ./include/linux/libnvdimm.h:140:51: warning: ‘struct bio’ declared inside parameter list will not be visible outside of this definition or declaration int (*flush)(struct nd_region *nd_region, struct bio *bio); ^~~ I was already feeling uneasy about trying to squeeze this into v5.2, but this warning and the continued drip of comments leads me to conclude that this driver would do well to wait one more development cycle. Lets close out the final fixups and let this driver soak in -next. Then for the v5.3 cycle I'll redouble my efforts towards the goal of closing patch acceptance at the -rc6 / -rc7 development milestone.
Re: [Qemu-devel] [PATCH v9 4/7] dm: enable synchronous dax
[ add Mike and dm-devel ] Mike, any concerns with the below addition to the device-mapper-dax implementation? On Tue, May 14, 2019 at 7:58 AM Pankaj Gupta wrote: > > This patch sets dax device 'DAXDEV_SYNC' flag if all the target > devices of device mapper support synchrononous DAX. If device > mapper consists of both synchronous and asynchronous dax devices, > we don't set 'DAXDEV_SYNC' flag. > > Signed-off-by: Pankaj Gupta > --- > drivers/md/dm-table.c | 14 ++ > 1 file changed, 14 insertions(+) > > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c > index cde3b49b2a91..1cce626ff576 100644 > --- a/drivers/md/dm-table.c > +++ b/drivers/md/dm-table.c > @@ -886,10 +886,17 @@ static int device_supports_dax(struct dm_target *ti, > struct dm_dev *dev, > return bdev_dax_supported(dev->bdev, PAGE_SIZE); > } > > +static int device_synchronous(struct dm_target *ti, struct dm_dev *dev, > + sector_t start, sector_t len, void *data) > +{ > + return dax_synchronous(dev->dax_dev); > +} > + > static bool dm_table_supports_dax(struct dm_table *t) > { > struct dm_target *ti; > unsigned i; > + bool dax_sync = true; > > /* Ensure that all targets support DAX. */ > for (i = 0; i < dm_table_get_num_targets(t); i++) { > @@ -901,7 +908,14 @@ static bool dm_table_supports_dax(struct dm_table *t) > if (!ti->type->iterate_devices || > !ti->type->iterate_devices(ti, device_supports_dax, NULL)) > return false; > + > + /* Check devices support synchronous DAX */ > + if (dax_sync && > + !ti->type->iterate_devices(ti, device_synchronous, NULL)) > + dax_sync = false; > } > + if (dax_sync) > + set_dax_synchronous(t->md->dax_dev); > > return true; > } > -- > 2.20.1 >
Re: [Qemu-devel] [PATCH v9 2/7] virtio-pmem: Add virtio pmem driver
On Tue, May 14, 2019 at 8:25 AM Pankaj Gupta wrote: > > > > On 5/14/19 7:54 AM, Pankaj Gupta wrote: > > > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig > > > index 35897649c24f..94bad084ebab 100644 > > > --- a/drivers/virtio/Kconfig > > > +++ b/drivers/virtio/Kconfig > > > @@ -42,6 +42,17 @@ config VIRTIO_PCI_LEGACY > > > > > > If unsure, say Y. > > > > > > +config VIRTIO_PMEM > > > + tristate "Support for virtio pmem driver" > > > + depends on VIRTIO > > > + depends on LIBNVDIMM > > > + help > > > + This driver provides access to virtio-pmem devices, storage devices > > > + that are mapped into the physical address space - similar to NVDIMMs > > > +- with a virtio-based flushing interface. > > > + > > > + If unsure, say M. > > > > > > from Documentation/process/coding-style.rst: > > "Lines under a ``config`` definition > > are indented with one tab, while help text is indented an additional two > > spaces." > > ah... I changed help text and 'checkpatch' did not say anything :( . > > Will wait for Dan, If its possible to add two spaces to help text while > applying > the series. I'm inclined to handle this with a fixup appended to the end of the series just so the patchwork-bot does not get confused by the content changing from what was sent to the list.
Re: [Qemu-devel] [PATCH v8 3/6] libnvdimm: add dax_dev sync flag
On Mon, May 13, 2019 at 10:32 AM Pankaj Gupta wrote: > > > Hi Dan, > > While testing device mapper with DAX, I faced a bug with the commit: > > commit ad428cdb525a97d15c0349fdc80f3d58befb50df > Author: Dan Williams > Date: Wed Feb 20 21:12:50 2019 -0800 > > When I reverted the condition to old code[1] it worked for me. I > am thinking when we map two different devices (e.g with device mapper), will > start & end pfn still point to same pgmap? Or there is something else which > I am missing here. > > Note: I tested only EXT4. > > [1] > > - if (pgmap && pgmap->type == MEMORY_DEVICE_FS_DAX) > + end_pgmap = get_dev_pagemap(pfn_t_to_pfn(end_pfn), NULL); > + if (pgmap && pgmap == end_pgmap && pgmap->type == > MEMORY_DEVICE_FS_DAX > + && pfn_t_to_page(pfn)->pgmap == pgmap > + && pfn_t_to_page(end_pfn)->pgmap == pgmap > + && pfn_t_to_pfn(pfn) == PHYS_PFN(__pa(kaddr)) > + && pfn_t_to_pfn(end_pfn) == > PHYS_PFN(__pa(end_kaddr))) Ugh, yes, device-mapper continues to be an awkward fit for dax (or vice versa). We would either need a way to have a multi-level pfn to pagemap lookup for composite devices, or a way to discern that even though the pagemap is different that the result is still valid / not an indication that we have leaked into an unassociated address range. Perhaps a per-daxdev callback for ->dax_supported() so that device-mapper internals can be used for this validation. We need to get that fixed up, but I don't see it as a blocker / pre-requisite for virtio-pmem.
Re: [Qemu-devel] [PATCH v8 3/6] libnvdimm: add dax_dev sync flag
On Fri, May 10, 2019 at 5:45 PM Pankaj Gupta wrote: > > > > > > > > > This patch adds 'DAXDEV_SYNC' flag which is set > > > for nd_region doing synchronous flush. This later > > > is used to disable MAP_SYNC functionality for > > > ext4 & xfs filesystem for devices don't support > > > synchronous flush. > > > > > > Signed-off-by: Pankaj Gupta > > > --- > > > drivers/dax/bus.c| 2 +- > > > drivers/dax/super.c | 13 - > > > drivers/md/dm.c | 3 ++- > > > drivers/nvdimm/pmem.c| 5 - > > > drivers/nvdimm/region_devs.c | 7 +++ > > > include/linux/dax.h | 8 ++-- > > > include/linux/libnvdimm.h| 1 + > > > 7 files changed, 33 insertions(+), 6 deletions(-) > > [..] > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > > > index 043f0761e4a0..ee007b75d9fd 100644 > > > --- a/drivers/md/dm.c > > > +++ b/drivers/md/dm.c > > > @@ -1969,7 +1969,8 @@ static struct mapped_device *alloc_dev(int minor) > > > sprintf(md->disk->disk_name, "dm-%d", minor); > > > > > > if (IS_ENABLED(CONFIG_DAX_DRIVER)) { > > > - dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops); > > > + dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops, > > > +DAXDEV_F_SYNC); > > > > Apologies for not realizing this until now, but this is broken. > > Imaging a device-mapper configuration composed of both 'async' > > virtio-pmem and 'sync' pmem. The 'sync' flag needs to be unified > > across all members. I would change this argument to '0' and then > > arrange for it to be set at dm_table_supports_dax() time after > > validating that all components support synchronous dax. > > o.k. Need to set 'DAXDEV_F_SYNC' flag after verifying all the target > components support synchronous DAX. > > Just a question, If device mapper configuration have composed of both > virtio-pmem or pmem devices, we want to configure device mapper for async > flush? If it's composed of both then, yes, it needs to be async flush at the device-mapper level. Otherwise MAP_SYNC will succeed and fail to trigger fsync on the host file when necessary for the virtio-pmem backed portion of the device-mapper device.
Re: [Qemu-devel] [PATCH v7 2/6] virtio-pmem: Add virtio pmem driver
On Wed, May 8, 2019 at 4:19 AM Pankaj Gupta wrote: > > > Hi Dan, > > Thank you for the review. Please see my reply inline. > > > > > Hi Pankaj, > > > > Some minor file placement comments below. > > Sure. > > > > > On Thu, Apr 25, 2019 at 10:02 PM Pankaj Gupta wrote: > > > > > > This patch adds virtio-pmem driver for KVM guest. > > > > > > Guest reads the persistent memory range information from > > > Qemu over VIRTIO and registers it on nvdimm_bus. It also > > > creates a nd_region object with the persistent memory > > > range information so that existing 'nvdimm/pmem' driver > > > can reserve this into system memory map. This way > > > 'virtio-pmem' driver uses existing functionality of pmem > > > driver to register persistent memory compatible for DAX > > > capable filesystems. > > > > > > This also provides function to perform guest flush over > > > VIRTIO from 'pmem' driver when userspace performs flush > > > on DAX memory range. > > > > > > Signed-off-by: Pankaj Gupta > > > --- > > > drivers/nvdimm/virtio_pmem.c | 114 + > > > drivers/virtio/Kconfig | 10 +++ > > > drivers/virtio/Makefile | 1 + > > > drivers/virtio/pmem.c| 118 +++ > > > include/linux/virtio_pmem.h | 60 > > > include/uapi/linux/virtio_ids.h | 1 + > > > include/uapi/linux/virtio_pmem.h | 10 +++ > > > 7 files changed, 314 insertions(+) > > > create mode 100644 drivers/nvdimm/virtio_pmem.c > > > create mode 100644 drivers/virtio/pmem.c > > > create mode 100644 include/linux/virtio_pmem.h > > > create mode 100644 include/uapi/linux/virtio_pmem.h > > > > > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c > > > new file mode 100644 > > > index ..66b582f751a3 > > > --- /dev/null > > > +++ b/drivers/nvdimm/virtio_pmem.c > > > @@ -0,0 +1,114 @@ > > > +// SPDX-License-Identifier: GPL-2.0 > > > +/* > > > + * virtio_pmem.c: Virtio pmem Driver > > > + * > > > + * Discovers persistent memory range information > > > + * from host and provides a virtio based flushing > > > + * interface. > > > + */ > > > +#include > > > +#include "nd.h" > > > + > > > + /* The interrupt handler */ > > > +void host_ack(struct virtqueue *vq) > > > +{ > > > + unsigned int len; > > > + unsigned long flags; > > > + struct virtio_pmem_request *req, *req_buf; > > > + struct virtio_pmem *vpmem = vq->vdev->priv; > > > + > > > + spin_lock_irqsave(&vpmem->pmem_lock, flags); > > > + while ((req = virtqueue_get_buf(vq, &len)) != NULL) { > > > + req->done = true; > > > + wake_up(&req->host_acked); > > > + > > > + if (!list_empty(&vpmem->req_list)) { > > > + req_buf = list_first_entry(&vpmem->req_list, > > > + struct virtio_pmem_request, list); > > > + list_del(&vpmem->req_list); > > > + req_buf->wq_buf_avail = true; > > > + wake_up(&req_buf->wq_buf); > > > + } > > > + } > > > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags); > > > +} > > > +EXPORT_SYMBOL_GPL(host_ack); > > > + > > > + /* The request submission function */ > > > +int virtio_pmem_flush(struct nd_region *nd_region) > > > +{ > > > + int err; > > > + unsigned long flags; > > > + struct scatterlist *sgs[2], sg, ret; > > > + struct virtio_device *vdev = nd_region->provider_data; > > > + struct virtio_pmem *vpmem = vdev->priv; > > > + struct virtio_pmem_request *req; > > > + > > > + might_sleep(); > > > + req = kmalloc(sizeof(*req), GFP_KERNEL); > > > + if (!req) > > > + return -ENOMEM; > > > + > > > + req->done = req->wq_buf_avail = false; > > > + strcpy(req->name, "FLUSH"); > > > + init_waitqueue_head(&req->host_acked); > > > + init_waitqueue_head(&req->wq_buf); > > > + sg_init_one(&sg, req->name, strlen(req->name)); > > > + sgs[0] = &sg; > > > + sg_init_one(&ret, &req->ret, sizeof(req->ret)); > > > + sgs[1] = &ret; > > > + > > > + spin_lock_irqsave(&vpmem->pmem_lock, flags); > > > + err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, > > > GFP_ATOMIC); > > > + if (err) { > > > + dev_err(&vdev->dev, "failed to send command to virtio pmem > > > device\n"); > > > + > > > + list_add_tail(&vpmem->req_list, &req->list); > > > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags); > > > + > > > + /* When host has read buffer, this completes via host_ack > > > */ > > > + wait_event(req->wq_buf, req->wq_buf_avail); > > > + spin_lock_irqsave(&vpmem->pmem_lock, flags); > > > + } > > > + err = virtqueue_kick(vpmem->req_vq); > > > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags); > > > + > > > +
Re: [Qemu-devel] [PATCH v8 0/6] virtio pmem driver
On Fri, May 10, 2019 at 8:52 AM Pankaj Gupta wrote: > > Hi Michael & Dan, > > Please review/ack the patch series from LIBNVDIMM & VIRTIO side. > We have ack on ext4, xfs patches(4, 5 & 6) patch 2. Still need > your ack on nvdimm patches(1 & 3) & virtio patch 2. I was planning to merge these via the nvdimm tree, not ack them. Did you have another maintainer lined up to take these patches?
Re: [Qemu-devel] [PATCH v8 3/6] libnvdimm: add dax_dev sync flag
On Fri, May 10, 2019 at 8:53 AM Pankaj Gupta wrote: > > This patch adds 'DAXDEV_SYNC' flag which is set > for nd_region doing synchronous flush. This later > is used to disable MAP_SYNC functionality for > ext4 & xfs filesystem for devices don't support > synchronous flush. > > Signed-off-by: Pankaj Gupta > --- > drivers/dax/bus.c| 2 +- > drivers/dax/super.c | 13 - > drivers/md/dm.c | 3 ++- > drivers/nvdimm/pmem.c| 5 - > drivers/nvdimm/region_devs.c | 7 +++ > include/linux/dax.h | 8 ++-- > include/linux/libnvdimm.h| 1 + > 7 files changed, 33 insertions(+), 6 deletions(-) [..] > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > index 043f0761e4a0..ee007b75d9fd 100644 > --- a/drivers/md/dm.c > +++ b/drivers/md/dm.c > @@ -1969,7 +1969,8 @@ static struct mapped_device *alloc_dev(int minor) > sprintf(md->disk->disk_name, "dm-%d", minor); > > if (IS_ENABLED(CONFIG_DAX_DRIVER)) { > - dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops); > + dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops, > +DAXDEV_F_SYNC); Apologies for not realizing this until now, but this is broken. Imaging a device-mapper configuration composed of both 'async' virtio-pmem and 'sync' pmem. The 'sync' flag needs to be unified across all members. I would change this argument to '0' and then arrange for it to be set at dm_table_supports_dax() time after validating that all components support synchronous dax.
Re: [Qemu-devel] [PATCH v7 6/6] xfs: disable map_sync for async flush
On Thu, Apr 25, 2019 at 10:03 PM Pankaj Gupta wrote: > > Dont support 'MAP_SYNC' with non-DAX files and DAX files > with asynchronous dax_device. Virtio pmem provides > asynchronous host page cache flush mechanism. We don't > support 'MAP_SYNC' with virtio pmem and xfs. > > Signed-off-by: Pankaj Gupta > --- > fs/xfs/xfs_file.c | 9 ++--- > 1 file changed, 6 insertions(+), 3 deletions(-) Darrick, does this look ok to take through the nvdimm tree? > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index a7ceae90110e..f17652cca5ff 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -1203,11 +1203,14 @@ xfs_file_mmap( > struct file *filp, > struct vm_area_struct *vma) > { > + struct dax_device *dax_dev; > + > + dax_dev = xfs_find_daxdev_for_inode(file_inode(filp)); > /* > -* We don't support synchronous mappings for non-DAX files. At least > -* until someone comes with a sensible use case. > +* We don't support synchronous mappings for non-DAX files and > +* for DAX files if underneath dax_device is not synchronous. > */ > - if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC)) > + if (!daxdev_mapping_supported(vma, dax_dev)) > return -EOPNOTSUPP; > > file_accessed(filp); > -- > 2.20.1 >
Re: [Qemu-devel] [PATCH v7 3/6] libnvdimm: add dax_dev sync flag
On Thu, Apr 25, 2019 at 10:02 PM Pankaj Gupta wrote: > > This patch adds 'DAXDEV_SYNC' flag which is set > for nd_region doing synchronous flush. This later > is used to disable MAP_SYNC functionality for > ext4 & xfs filesystem for devices don't support > synchronous flush. > > Signed-off-by: Pankaj Gupta [..] > diff --git a/include/linux/dax.h b/include/linux/dax.h > index 0dd316a74a29..c97fc0cc7167 100644 > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -7,6 +7,9 @@ > #include > #include > > +/* Flag for synchronous flush */ > +#define DAXDEV_F_SYNC true I'd feel better, i.e. it reads more canonically, if this was defined as (1UL << 0) and the argument to alloc_dax() was changed to 'unsigned long flags' rather than a bool.
Re: [Qemu-devel] [PATCH v7 2/6] virtio-pmem: Add virtio pmem driver
Hi Pankaj, Some minor file placement comments below. On Thu, Apr 25, 2019 at 10:02 PM Pankaj Gupta wrote: > > This patch adds virtio-pmem driver for KVM guest. > > Guest reads the persistent memory range information from > Qemu over VIRTIO and registers it on nvdimm_bus. It also > creates a nd_region object with the persistent memory > range information so that existing 'nvdimm/pmem' driver > can reserve this into system memory map. This way > 'virtio-pmem' driver uses existing functionality of pmem > driver to register persistent memory compatible for DAX > capable filesystems. > > This also provides function to perform guest flush over > VIRTIO from 'pmem' driver when userspace performs flush > on DAX memory range. > > Signed-off-by: Pankaj Gupta > --- > drivers/nvdimm/virtio_pmem.c | 114 + > drivers/virtio/Kconfig | 10 +++ > drivers/virtio/Makefile | 1 + > drivers/virtio/pmem.c| 118 +++ > include/linux/virtio_pmem.h | 60 > include/uapi/linux/virtio_ids.h | 1 + > include/uapi/linux/virtio_pmem.h | 10 +++ > 7 files changed, 314 insertions(+) > create mode 100644 drivers/nvdimm/virtio_pmem.c > create mode 100644 drivers/virtio/pmem.c > create mode 100644 include/linux/virtio_pmem.h > create mode 100644 include/uapi/linux/virtio_pmem.h > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c > new file mode 100644 > index ..66b582f751a3 > --- /dev/null > +++ b/drivers/nvdimm/virtio_pmem.c > @@ -0,0 +1,114 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * virtio_pmem.c: Virtio pmem Driver > + * > + * Discovers persistent memory range information > + * from host and provides a virtio based flushing > + * interface. > + */ > +#include > +#include "nd.h" > + > + /* The interrupt handler */ > +void host_ack(struct virtqueue *vq) > +{ > + unsigned int len; > + unsigned long flags; > + struct virtio_pmem_request *req, *req_buf; > + struct virtio_pmem *vpmem = vq->vdev->priv; > + > + spin_lock_irqsave(&vpmem->pmem_lock, flags); > + while ((req = virtqueue_get_buf(vq, &len)) != NULL) { > + req->done = true; > + wake_up(&req->host_acked); > + > + if (!list_empty(&vpmem->req_list)) { > + req_buf = list_first_entry(&vpmem->req_list, > + struct virtio_pmem_request, list); > + list_del(&vpmem->req_list); > + req_buf->wq_buf_avail = true; > + wake_up(&req_buf->wq_buf); > + } > + } > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags); > +} > +EXPORT_SYMBOL_GPL(host_ack); > + > + /* The request submission function */ > +int virtio_pmem_flush(struct nd_region *nd_region) > +{ > + int err; > + unsigned long flags; > + struct scatterlist *sgs[2], sg, ret; > + struct virtio_device *vdev = nd_region->provider_data; > + struct virtio_pmem *vpmem = vdev->priv; > + struct virtio_pmem_request *req; > + > + might_sleep(); > + req = kmalloc(sizeof(*req), GFP_KERNEL); > + if (!req) > + return -ENOMEM; > + > + req->done = req->wq_buf_avail = false; > + strcpy(req->name, "FLUSH"); > + init_waitqueue_head(&req->host_acked); > + init_waitqueue_head(&req->wq_buf); > + sg_init_one(&sg, req->name, strlen(req->name)); > + sgs[0] = &sg; > + sg_init_one(&ret, &req->ret, sizeof(req->ret)); > + sgs[1] = &ret; > + > + spin_lock_irqsave(&vpmem->pmem_lock, flags); > + err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC); > + if (err) { > + dev_err(&vdev->dev, "failed to send command to virtio pmem > device\n"); > + > + list_add_tail(&vpmem->req_list, &req->list); > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags); > + > + /* When host has read buffer, this completes via host_ack */ > + wait_event(req->wq_buf, req->wq_buf_avail); > + spin_lock_irqsave(&vpmem->pmem_lock, flags); > + } > + err = virtqueue_kick(vpmem->req_vq); > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags); > + > + if (!err) { > + err = -EIO; > + goto ret; > + } > + /* When host has read buffer, this completes via host_ack */ > + wait_event(req->host_acked, req->done); > + err = req->ret; > +ret: > + kfree(req); > + return err; > +}; > + > + /* The asynchronous flush callback function */ > +int async_pmem_flush(struct nd_region *nd_region, struct bio *bio) > +{ > + int rc = 0; > + > + /* Create child bio for asynchronous flush and chain with > +* parent bio. Otherwise directly call nd_region flush. > +*/ > + if (bio && bio->bi_iter.bi_
Re: [Qemu-devel] [PATCH v5 1/6] libnvdimm: nd_region flush callback support
On Mon, Apr 22, 2019 at 8:59 AM Jeff Moyer wrote: > > Dan Williams writes: > > > On Thu, Apr 18, 2019 at 9:18 AM Christoph Hellwig > > wrote: > >> > >> On Thu, Apr 18, 2019 at 09:05:05AM -0700, Dan Williams wrote: > >> > > > I'd either add a comment about avoiding retpoline overhead here or > >> > > > just > >> > > > make ->flush == NULL mean generic_nvdimm_flush(). Just so that > >> > > > people don't > >> > > > get confused by the code. > >> > > > >> > > Isn't this premature optimization? I really don't like adding things > >> > > like this without some numbers to show it's worth it. > >> > > >> > I don't think it's premature given this optimization technique is > >> > already being deployed elsewhere, see: > >> > > >> > https://lwn.net/Articles/774347/ > >> > >> For one this one was backed by numbers, and second after feedback > >> from Linux we switched to the NULL pointer check instead. > > > > Ok I should have noticed the switch to NULL pointer check. However, > > the question still stands do we want everyone to run numbers to > > justify this optimization, or make it a new common kernel coding > > practice to do: > > > > if (!object->op) > > generic_op(object); > > else > > object->op(object); > > > > ...in hot paths? > > I don't think nvdimm_flush is a hot path. Numbers of some > representative workload would prove one of us right. I'd rather say that the if "if (!op) do_generic()" pattern is more readable in the general case, saves grepping for who set the op in the common case. The fact that it has the potential to be faster is gravy at that point.
Re: [Qemu-devel] [PATCH v5 1/6] libnvdimm: nd_region flush callback support
On Thu, Apr 18, 2019 at 9:18 AM Christoph Hellwig wrote: > > On Thu, Apr 18, 2019 at 09:05:05AM -0700, Dan Williams wrote: > > > > I'd either add a comment about avoiding retpoline overhead here or just > > > > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people > > > > don't > > > > get confused by the code. > > > > > > Isn't this premature optimization? I really don't like adding things > > > like this without some numbers to show it's worth it. > > > > I don't think it's premature given this optimization technique is > > already being deployed elsewhere, see: > > > > https://lwn.net/Articles/774347/ > > For one this one was backed by numbers, and second after feedback > from Linux we switched to the NULL pointer check instead. Ok I should have noticed the switch to NULL pointer check. However, the question still stands do we want everyone to run numbers to justify this optimization, or make it a new common kernel coding practice to do: if (!object->op) generic_op(object); else object->op(object); ...in hot paths? I agree with not doing premature optimization in principle, but this hack is minimally intrusive from a readability perspective similar to likely()/unlikely() usage which also don't come with numbers on a per patch basis.
Re: [Qemu-devel] [PATCH v5 1/6] libnvdimm: nd_region flush callback support
On Fri, Apr 12, 2019 at 6:12 AM Jeff Moyer wrote: > > Jan Kara writes: > > > On Thu 11-04-19 07:51:48, Dan Williams wrote: > >> On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta wrote: > >> > + } else { > >> > + if (nd_region->flush(nd_region)) > >> > + rc = -EIO; > >> > >> Given the common case wants to be fast and synchronous I think we > >> should try to avoid retpoline overhead by default. So something like > >> this: > >> > >> if (nd_region->flush == generic_nvdimm_flush) > >> rc = generic_nvdimm_flush(...); > > > > I'd either add a comment about avoiding retpoline overhead here or just > > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't > > get confused by the code. > > Isn't this premature optimization? I really don't like adding things > like this without some numbers to show it's worth it. I don't think it's premature given this optimization technique is already being deployed elsewhere, see: https://lwn.net/Articles/774347/
Re: [Qemu-devel] [PATCH v5 1/6] libnvdimm: nd_region flush callback support
On Thu, Apr 11, 2019 at 9:02 AM Pankaj Gupta wrote: > > > > > > > > > This patch adds functionality to perform flush from guest > > > to host over VIRTIO. We are registering a callback based > > > on 'nd_region' type. virtio_pmem driver requires this special > > > flush function. For rest of the region types we are registering > > > existing flush function. Report error returned by host fsync > > > failure to userspace. > > > > > > This also handles asynchronous flush requests from the block layer > > > by creating a child bio and chaining it with parent bio. > > > > > > Signed-off-by: Pankaj Gupta > > > ---bio_chain Dan williams > > [..] > > > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c > > > index b4ef7d9ff22e..fb1041ab32a6 100644 > > > --- a/drivers/nvdimm/region_devs.c > > > +++ b/drivers/nvdimm/region_devs.c > > > @@ -295,7 +295,9 @@ static ssize_t deep_flush_store(struct device *dev, > > > struct device_attribute *att > > > return rc; > > > if (!flush) > > > return -EINVAL; > > > - nvdimm_flush(nd_region); > > > + rc = nvdimm_flush(nd_region, NULL, false); > > > + if (rc) > > > + return rc; > > > > > > return len; > > > } > > > @@ -1085,6 +1087,11 @@ static struct nd_region *nd_region_create(struct > > > nvdimm_bus *nvdimm_bus, > > > dev->of_node = ndr_desc->of_node; > > > nd_region->ndr_size = resource_size(ndr_desc->res); > > > nd_region->ndr_start = ndr_desc->res->start; > > > + if (ndr_desc->flush) > > > + nd_region->flush = ndr_desc->flush; > > > + else > > > + nd_region->flush = generic_nvdimm_flush; > > > + > > > nd_device_register(dev); > > > > > > return nd_region; > > > @@ -1125,11 +1132,36 @@ struct nd_region > > > *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus, > > > } > > > EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create); > > > > > > +int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool > > > async) > > > +{ > > > > I don't quite see the point of the 'async' argument. All the usages of > > this routine are either > > > > nvdimm_flush(nd_region, bio, true) > > ...or: > > nvdimm_flush(nd_region, NULL, false) > > Agree. > > > > > ...so why not gate async behavior on the presence of the 'bio' argument? > > Sure. > > > > > > > > + int rc = 0; > > > + > > > + /* Create child bio for asynchronous flush and chain with > > > +* parent bio. Otherwise directly call nd_region flush. > > > +*/ > > > + if (async && bio->bi_iter.bi_sector != -1) { > > > + > > > + struct bio *child = bio_alloc(GFP_ATOMIC, 0); > > > + > > > + if (!child) > > > + return -ENOMEM; > > > + bio_copy_dev(child, bio); > > > + child->bi_opf = REQ_PREFLUSH; > > > + child->bi_iter.bi_sector = -1; > > > + bio_chain(child, bio); > > > + submit_bio(child); > > > > I understand how this works, but it's a bit too "magical" for my > > taste. I would prefer that all flush implementations take an optional > > 'bio' argument rather than rely on the make_request implementation to > > stash the bio away on a driver specific list. > > I did this to make use of "bio_chain" for chaining child bio for async flush > suggested [1]. Are you saying to remove this and just call "flush" based on > bio argument? Or I implemented the 'bio_chain' request entirely wrong? No, I think you implemented it correctly. I'm just asking for the chaining to be performed internal to the ->flush() callback rather than in the common nvdimm_flush() front-end.
Re: [Qemu-devel] [PATCH v5 3/6] libnvdimm: add dax_dev sync flag
On Tue, Apr 9, 2019 at 9:10 PM Pankaj Gupta wrote: > > This patch adds 'DAXDEV_SYNC' flag which is set > for nd_region doing synchronous flush. This later > is used to disable MAP_SYNC functionality for > ext4 & xfs filesystem for devices don't support > synchronous flush. > > Signed-off-by: Pankaj Gupta > --- > drivers/dax/bus.c| 2 +- > drivers/dax/super.c | 13 - > drivers/md/dm.c | 2 +- > drivers/nvdimm/pmem.c| 3 ++- > drivers/nvdimm/region_devs.c | 7 +++ > include/linux/dax.h | 9 +++-- > include/linux/libnvdimm.h| 1 + > 7 files changed, 31 insertions(+), 6 deletions(-) > > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c > index 2109cfe80219..431bf7d2a7f9 100644 > --- a/drivers/dax/bus.c > +++ b/drivers/dax/bus.c > @@ -388,7 +388,7 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region > *dax_region, int id, > * No 'host' or dax_operations since there is no access to this > * device outside of mmap of the resulting character device. > */ > - dax_dev = alloc_dax(dev_dax, NULL, NULL); > + dax_dev = alloc_dax(dev_dax, NULL, NULL, true); I find apis that take a boolean as unreadable. What does 'true' mean? It wastes time to go look at the function definition vs something like: alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC);
Re: [Qemu-devel] [PATCH v5 1/6] libnvdimm: nd_region flush callback support
On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta wrote: > > This patch adds functionality to perform flush from guest > to host over VIRTIO. We are registering a callback based > on 'nd_region' type. virtio_pmem driver requires this special > flush function. For rest of the region types we are registering > existing flush function. Report error returned by host fsync > failure to userspace. > > This also handles asynchronous flush requests from the block layer > by creating a child bio and chaining it with parent bio. > > Signed-off-by: Pankaj Gupta > --- [..] > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c > index b4ef7d9ff22e..fb1041ab32a6 100644 > --- a/drivers/nvdimm/region_devs.c > +++ b/drivers/nvdimm/region_devs.c > @@ -295,7 +295,9 @@ static ssize_t deep_flush_store(struct device *dev, > struct device_attribute *att > return rc; > if (!flush) > return -EINVAL; > - nvdimm_flush(nd_region); > + rc = nvdimm_flush(nd_region, NULL, false); > + if (rc) > + return rc; > > return len; > } > @@ -1085,6 +1087,11 @@ static struct nd_region *nd_region_create(struct > nvdimm_bus *nvdimm_bus, > dev->of_node = ndr_desc->of_node; > nd_region->ndr_size = resource_size(ndr_desc->res); > nd_region->ndr_start = ndr_desc->res->start; > + if (ndr_desc->flush) > + nd_region->flush = ndr_desc->flush; > + else > + nd_region->flush = generic_nvdimm_flush; > + > nd_device_register(dev); > > return nd_region; > @@ -1125,11 +1132,36 @@ struct nd_region > *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus, > } > EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create); > > +int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async) > +{ I don't quite see the point of the 'async' argument. All the usages of this routine are either nvdimm_flush(nd_region, bio, true) ...or: nvdimm_flush(nd_region, NULL, false) ...so why not gate async behavior on the presence of the 'bio' argument? > + int rc = 0; > + > + /* Create child bio for asynchronous flush and chain with > +* parent bio. Otherwise directly call nd_region flush. > +*/ > + if (async && bio->bi_iter.bi_sector != -1) { > + > + struct bio *child = bio_alloc(GFP_ATOMIC, 0); > + > + if (!child) > + return -ENOMEM; > + bio_copy_dev(child, bio); > + child->bi_opf = REQ_PREFLUSH; > + child->bi_iter.bi_sector = -1; > + bio_chain(child, bio); > + submit_bio(child); I understand how this works, but it's a bit too "magical" for my taste. I would prefer that all flush implementations take an optional 'bio' argument rather than rely on the make_request implementation to stash the bio away on a driver specific list. > + } else { > + if (nd_region->flush(nd_region)) > + rc = -EIO; Given the common case wants to be fast and synchronous I think we should try to avoid retpoline overhead by default. So something like this: if (nd_region->flush == generic_nvdimm_flush) rc = generic_nvdimm_flush(...);
Re: [Qemu-devel] Why only devdax guarantees guest data persistence ?
On Wed, Feb 20, 2019 at 7:37 AM Stefan Hajnoczi wrote: > > On Fri, Feb 15, 2019 at 05:09:31PM +, bipin.tomar--- via Qemu-devel wrote: > > Text from "docs/nvdimm.txt" says: > > Guest Data Persistence > > -- > > > > Though QEMU supports multiple types of vNVDIMM backends on Linux, > > currently the only one that can guarantee the guest write persistence > > is the device DAX on the real NVDIMM device (e.g., /dev/dax0.0), to > > which all guest access do not involve any host-side kernel cache. > > > > I think here "host-side kernel cache" imply "page cache". Why does fsdax > > NOT have the same persistence guarantees as devdax for vNVDIMM? > > Both the modes avoid using page cache then why is devdax explicitly called > > out? > > File systems may require msync(2)/fsync(2) to guarantee persistence even > with DAX (just a cache flush instruction may not be enough!). Emulated > NVDIMM devices lack an fsync interface so guests are unable to fsync the > host file system. > > This is not an issue with devdax since there is no host file system. > > virtio-pmem is an effort to add a paravirtualized fsync-style interface > and should solve this problem in the future: > https://lkml.org/lkml/2019/1/9/541 A para-virtualized flush is only necessary if the host implements caches and dirty-metadata that cannot be coordinated / flushed by MAP_SYNC. virtio-pmem is not a solution to a problem, virtio-pmem is a non-DAX alternative implementation of persistent memory that requires the guest to coordinate metadata and host page cache flushing.
Re: [Qemu-devel] Querying the size of devdax devices from userspace
On Sun, Feb 3, 2019 at 11:56 PM Stefan Hajnoczi wrote: > > How can userspace applications query the size of devdax character > devices? > > stat(1) doesn't know how large the device is: > > # stat /dev/dax0.0 > File: /dev/dax0.0 > Size: 0 Blocks: 0 IO Block: 4096 character special > file > Device: 6h/6d Inode: 56764 Links: 1 Device type: fa,d > > ndctl(1) finds out by digging in /sys: > > # ndctl list > [ > { > "dev":"namespace0.0", > "mode":"devdax", > "map":"dev", > "size":2111832064, > "uuid":"235acf4d-503f-46be-bf64-f26faf9777ef", > "chardev":"dax0.0" > } > ] > > I'm not sure how to do it without enumerating all nvdimms in /sys. Is > there a mapping from devdax major/minor number to a /sys path? > > The use case I have in mind is that QEMU currently takes the size as a > command-line parameter. The failure mode is ugly when users get this > value wrong: the guest gets a softlockup and there is no error message > on the host side. > > I'd like QEMU to detect the size or at least reject size values that are > too large. In order to do that userspace needs a convenient way of > querying the size. Any ideas? You're looking for /sys/dev/char. That will wake you from major:minor to sysfs. Perhaps borrow fio's implementation for the same: http://git.kernel.dk/cgit/fio/tree/engines/dev-dax.c#n258
Re: [Qemu-devel] [PATCH V9 4/6] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap()
On Mon, Jan 21, 2019 at 7:27 PM Michael S. Tsirkin wrote: [..] > > 2. The reset of cases: > > - we will never pass the MAP_SYNC to mmap2 > > I don't see code probing for MAP_SYNC support. Did I miss it? > But if all you want is to have old linux ignore MAP_SYNC, > I think you got your wish automatically - just do not set > MAP_SHARED_VALIDATE. That will also cause new Linux to ignore MAP_SYNC.
Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device
On Mon, Jan 14, 2019 at 1:25 PM Dave Chinner wrote: > > On Mon, Jan 14, 2019 at 02:15:40AM -0500, Pankaj Gupta wrote: > > > > > > Until you have images (and hence host page cache) shared between > > > > multiple guests. People will want to do this, because it means they > > > > only need a single set of pages in host memory for executable > > > > binaries rather than a set of pages per guest. Then you have > > > > multiple guests being able to detect residency of the same set of > > > > pages. If the guests can then, in any way, control eviction of the > > > > pages from the host cache, then we have a guest-to-guest information > > > > leak channel. > > > > > > I don't think we should ever be considering something that would allow a > > > guest to evict page's from the host's pagecache [1]. The guest should > > > be able to kick its own references to the host's pagecache out of its > > > own pagecache, but not be able to influence whether the host or another > > > guest has a read-only mapping cached. > > > > > > [1] Unless the guest is allowed to modify the host's file; obviously > > > truncation, holepunching, etc are going to evict pages from the host's > > > page cache. > > > > This is so correct. Guest does not not evict host page cache pages directly. > > They don't right now. > > But someone is going to end up asking for discard to work so that > the guest can free unused space in the underlying spares image (i.e. > make use of fstrim or mount -o discard) because they have workloads > that have bursts of space usage and they need to trim the image > files afterwards to keep their overall space usage under control. > > And then ...we reject / push back on that patch citing the above concern. > > In case of virtio-pmem & DAX, guest clears guest page cache exceptional > > entries. > > Its solely decision of host to take action on the host page cache pages. > > > > In case of virtio-pmem, guest does not modify host file directly i.e don't > > perform hole punch & truncation operation directly on host file. > > ... this will no longer be true, and the nuclear landmine in this > driver interface will have been armed I agree with the need to be careful when / if explicit cache control is added, but that's not the case today.
Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device
On Sat, Jan 12, 2019 at 5:38 PM Pankaj Gupta wrote: > > > > > > > On Thu 10-01-19 12:26:17, Dave Chinner wrote: > > > On Wed, Jan 09, 2019 at 08:17:31PM +0530, Pankaj Gupta wrote: > > > > This patch series has implementation for "virtio pmem". > > > > "virtio pmem" is fake persistent memory(nvdimm) in guest > > > > which allows to bypass the guest page cache. This also > > > > implements a VIRTIO based asynchronous flush mechanism. > > > > > > H. Sharing the host page cache direct into the guest VM. Sounds > > > like a good idea, but. > > > > > > This means the guest VM can now run timing attacks to observe host > > > side page cache residency, and depending on the implementation I'm > > > guessing that the guest will be able to control host side page > > > cache eviction, too (e.g. via discard or hole punch operations). > > > > > > Which means this functionality looks to me like a new vector for > > > information leakage into and out of the guest VM via guest > > > controlled host page cache manipulation. > > > > > > https://arxiv.org/pdf/1901.01161 > > > > > > I might be wrong, but if I'm not we're going to have to be very > > > careful about how guest VMs can access and manipulate host side > > > resources like the page cache. > > > > Right. Thinking about this I would be more concerned about the fact that > > guest can effectively pin amount of host's page cache upto size of the > > device/file passed to guest as PMEM, can't it Pankaj? Or is there some QEMU > > magic that avoids this? > > Yes, guest will pin these host page cache pages using 'get_user_pages' by > elevating the page reference count. But these pages can be reclaimed by host > at any time when there is memory pressure. Wait, how can the guest pin the host pages? I would expect this to happen only when using vfio and device assignment. Otherwise, no the host can't reclaim a pinned page, that's the whole point of a pin to prevent the mm from reclaiming ownership. > KVM does not permanently pin pages. vfio does that but we are not using > it here. Right, so I'm confused by your pin assertion above.
Re: [Qemu-devel] [PATCH v3 3/5] libnvdimm: add nd_region buffered dax_dev flag
On Wed, Jan 9, 2019 at 5:53 AM Pankaj Gupta wrote: > > This patch adds 'DAXDEV_BUFFERED' flag which is set > for virtio pmem corresponding nd_region. This later > is used to disable MAP_SYNC functionality for ext4 > & xfs filesystem. > > Signed-off-by: Pankaj Gupta > --- > drivers/dax/super.c | 17 + > drivers/nvdimm/pmem.c| 3 +++ > drivers/nvdimm/region_devs.c | 7 +++ > drivers/virtio/pmem.c| 1 + > include/linux/dax.h | 9 + > include/linux/libnvdimm.h| 6 ++ > 6 files changed, 43 insertions(+) > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c > index 6e928f3..9128740 100644 > --- a/drivers/dax/super.c > +++ b/drivers/dax/super.c > @@ -167,6 +167,8 @@ enum dax_device_flags { > DAXDEV_ALIVE, > /* gate whether dax_flush() calls the low level flush routine */ > DAXDEV_WRITE_CACHE, > + /* flag to disable MAP_SYNC for virtio based host page cache flush */ > + DAXDEV_BUFFERED, > }; > > /** > @@ -335,6 +337,21 @@ bool dax_write_cache_enabled(struct dax_device *dax_dev) > } > EXPORT_SYMBOL_GPL(dax_write_cache_enabled); > > +void virtio_pmem_host_cache(struct dax_device *dax_dev, bool wc) > +{ > + if (wc) > + set_bit(DAXDEV_BUFFERED, &dax_dev->flags); > + else > + clear_bit(DAXDEV_BUFFERED, &dax_dev->flags); > +} > +EXPORT_SYMBOL_GPL(virtio_pmem_host_cache); The "write_cache" property was structured this way because it can conceivably change at runtime. The MAP_SYNC capability should be static and never changed after init. > +bool virtio_pmem_host_cache_enabled(struct dax_device *dax_dev) > +{ > + return test_bit(DAXDEV_BUFFERED, &dax_dev->flags); > +} > +EXPORT_SYMBOL_GPL(virtio_pmem_host_cache_enabled); Echoing Darrick and Jan this is should be a generic property of a dax_device and not specific to virtio. I don't like the "buffered" designation as that's not accurate. There may be hardware reasons why a dax_device is not synchronous, like a requirement to flush a write-pending queue or otherwise notify the device of new writes. I would just have a dax_synchronous() helper and a DAXDEV_SYNC flag. I would also modify alloc_dax() to take a flags argument so that the capability can be instantiated when the dax_device is allocated.
Re: [Qemu-devel] [PATCH v2 2/2] virtio-pmem: Add virtio pmem driver
On Wed, Oct 17, 2018 at 12:11 PM Pankaj Gupta wrote: > > > > > On Fri, Oct 12, 2018 at 10:01 PM Pankaj Gupta wrote: > > > > > > This patch adds virtio-pmem driver for KVM guest. > > > > > > Guest reads the persistent memory range information from > > > Qemu over VIRTIO and registers it on nvdimm_bus. It also > > > creates a nd_region object with the persistent memory > > > range information so that existing 'nvdimm/pmem' driver > > > can reserve this into system memory map. This way > > > 'virtio-pmem' driver uses existing functionality of pmem > > > driver to register persistent memory compatible for DAX > > > capable filesystems. > > > > > > This also provides function to perform guest flush over > > > VIRTIO from 'pmem' driver when userspace performs flush > > > on DAX memory range. > > > > Before we can move forward with this driver we need additional > > filesystem enabling to detect when the backing device is fronting DAX > > pmem or a paravirtualized page cache through virtio-pmem. Any > > interface that requires fsync() and a round trip to the hypervisor to > > flush host page cache is not DAX. > > I saw your proposal[1] for new mmap flag MAP_DIRECT. IIUIC mapping should > fail for > MAP_DIRECT if it requires explicit flush or buffer indirection. So, if we > disable > MAP_SYNC flag for virtio-pmem this should fail MAP_DIRECT as well? Otherwise > without MAP_DIRECT, virtio-pmem should be defaulted to VIRTIO flush mechanism. Right, although I wouldn't worry about MAP_DIRECT in the short term since we're still discussing what the upstream interface. Regardless of whether MAP_DIRECT is specified or not the virtio-flush mechanism would always be used for virtio-pmem. I.e. there is no possibility to get full DAX operation with virtio-pmem, only the page-cache bypass sub-set. Taking a look at where we could inject this check for filesystems it's a bit awkward to do it in xfs_file_mmap() for example because we do not have the backing device for the extents of the inode. So at a minimum you would need to investigate calling xfs_inode_supports_dax() from that path and teaching it about a new dax_device flag. I'm thinking the dax_device flag should be called DAXDEV_BUFFERED to indicate the presence of software buffering on a device that otherwise supports bypassing the local page cache.
Re: [Qemu-devel] [PATCH v2 2/2] virtio-pmem: Add virtio pmem driver
On Fri, Oct 12, 2018 at 10:01 PM Pankaj Gupta wrote: > > This patch adds virtio-pmem driver for KVM guest. > > Guest reads the persistent memory range information from > Qemu over VIRTIO and registers it on nvdimm_bus. It also > creates a nd_region object with the persistent memory > range information so that existing 'nvdimm/pmem' driver > can reserve this into system memory map. This way > 'virtio-pmem' driver uses existing functionality of pmem > driver to register persistent memory compatible for DAX > capable filesystems. > > This also provides function to perform guest flush over > VIRTIO from 'pmem' driver when userspace performs flush > on DAX memory range. Before we can move forward with this driver we need additional filesystem enabling to detect when the backing device is fronting DAX pmem or a paravirtualized page cache through virtio-pmem. Any interface that requires fsync() and a round trip to the hypervisor to flush host page cache is not DAX.
Re: [Qemu-devel] [PATCH v2 0/2] kvm "fake DAX" device
On Fri, Oct 12, 2018 at 10:00 PM Pankaj Gupta wrote: > > This patch series has implementation for "fake DAX". > "fake DAX" is fake persistent memory(nvdimm) in guest > which allows to bypass the guest page cache. This also > implements a VIRTIO based asynchronous flush mechanism. Can we stop calling this 'fake DAX', because it isn't 'DAX' and it's starting to confuse people. This enabling is effectively a host-page-cache-passthrough mechanism not DAX. Let's call the whole approach virtio-pmem, and leave DAX out of the name to hopefully prevent people from wondering why some DAX features are disabled with this driver. For example MAP_SYNC is not compatible with this approach. Additional enabling is need to disable MAP_SYNC in the presence of a virtio-pmem device. See the rough proposal here: https://lkml.org/lkml/2018/4/25/756
Re: [Qemu-devel] [PATCH 3/3] virtio-pmem: Add virtio pmem driver
On Thu, Sep 27, 2018 at 6:07 AM Pankaj Gupta wrote: [..] > > We are plugging VIRTIO based flush callback for virtio_pmem driver. If pmem > > driver (pmem_make_request) has to queue request we have to plug "blk_mq_ops" > > callbacks for corresponding VIRTIO vqs. AFAICU there is no existing > > multiqueue > > code merged for pmem driver yet, though i could see patches by Dave > > upstream. > > > > I thought about this and with current infrastructure "make_request" releases > spinlock > and makes current thread/task. All Other threads are free to call > 'make_request'/flush > and similarly wait by releasing the lock. Which lock are you referring? > This actually works like a queue of threads > waiting for notifications from host. > > Current pmem code do not have multiqueue support and I am not sure if core > pmem code > needs it. Adding multiqueue support just for virtio-pmem and not for pmem in > same driver > will be confusing or require alot of tweaking. Why does the pmem driver need to be converted to multiqueue support? > Could you please give your suggestions on this. I was expecting that flush requests that cannot be completed synchronously be placed on a queue and have bio_endio() called at a future time. I.e. use bio_chain() to manage the async portion of the flush request. This causes the guest block layer to just assume the bio was queued and will be completed at some point in the future.
Re: [Qemu-devel] [PATCH 3/3] virtio-pmem: Add virtio pmem driver
On Fri, Aug 31, 2018 at 6:32 AM Pankaj Gupta wrote: > > This patch adds virtio-pmem driver for KVM guest. > > Guest reads the persistent memory range information from > Qemu over VIRTIO and registers it on nvdimm_bus. It also > creates a nd_region object with the persistent memory > range information so that existing 'nvdimm/pmem' driver > can reserve this into system memory map. This way > 'virtio-pmem' driver uses existing functionality of pmem > driver to register persistent memory compatible for DAX > capable filesystems. > > This also provides function to perform guest flush over > VIRTIO from 'pmem' driver when userspace performs flush > on DAX memory range. > > Signed-off-by: Pankaj Gupta > --- > drivers/virtio/Kconfig | 9 ++ > drivers/virtio/Makefile | 1 + > drivers/virtio/virtio_pmem.c | 255 > +++ > include/uapi/linux/virtio_ids.h | 1 + > include/uapi/linux/virtio_pmem.h | 40 ++ > 5 files changed, 306 insertions(+) > create mode 100644 drivers/virtio/virtio_pmem.c > create mode 100644 include/uapi/linux/virtio_pmem.h > > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig > index 3589764..a331e23 100644 > --- a/drivers/virtio/Kconfig > +++ b/drivers/virtio/Kconfig > @@ -42,6 +42,15 @@ config VIRTIO_PCI_LEGACY > > If unsure, say Y. > > +config VIRTIO_PMEM > + tristate "Support for virtio pmem driver" > + depends on VIRTIO > + help > + This driver provides support for virtio based flushing interface > + for persistent memory range. > + > + If unsure, say M. > + > config VIRTIO_BALLOON > tristate "Virtio balloon driver" > depends on VIRTIO > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile > index 3a2b5c5..cbe91c6 100644 > --- a/drivers/virtio/Makefile > +++ b/drivers/virtio/Makefile > @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o > virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o > obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o > obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o > +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o > diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c > new file mode 100644 > index 000..c22cc87 > --- /dev/null > +++ b/drivers/virtio/virtio_pmem.c > @@ -0,0 +1,255 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * virtio_pmem.c: Virtio pmem Driver > + * > + * Discovers persistent memory range information > + * from host and provides a virtio based flushing > + * interface. > + */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include I think we need to split this driver into 2 files, drivers/virtio/pmem.c would discover and register the virtual pmem device with the libnvdimm core, and drivers/nvdimm/virtio.c would house virtio_pmem_flush(). > + > +struct virtio_pmem_request { > + /* Host return status corresponding to flush request */ > + int ret; > + > + /* command name*/ > + char name[16]; > + > + /* Wait queue to process deferred work after ack from host */ > + wait_queue_head_t host_acked; > + bool done; > + > + /* Wait queue to process deferred work after virt queue buffer avail > */ > + wait_queue_head_t wq_buf; Why does this need wait_queue's per request? shouldn't this be per-device? > + bool wq_buf_avail; > + struct list_head list; > +}; > + > +struct virtio_pmem { > + struct virtio_device *vdev; > + > + /* Virtio pmem request queue */ > + struct virtqueue *req_vq; > + > + /* nvdimm bus registers virtio pmem device */ > + struct nvdimm_bus *nvdimm_bus; > + struct nvdimm_bus_descriptor nd_desc; > + > + /* List to store deferred work if virtqueue is full */ > + struct list_head req_list; > + > + /* Synchronize virtqueue data */ > + spinlock_t pmem_lock; > + > + /* Memory region information */ > + uint64_t start; > + uint64_t size; > +}; > + > +static struct virtio_device_id id_table[] = { > + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID }, > + { 0 }, > +}; > + > + /* The interrupt handler */ > +static void host_ack(struct virtqueue *vq) > +{ > + unsigned int len; > + unsigned long flags; > + struct virtio_pmem_request *req, *req_buf; > + struct virtio_pmem *vpmem = vq->vdev->priv; > + > + spin_lock_irqsave(&vpmem->pmem_lock, flags); > + while ((req = virtqueue_get_buf(vq, &len)) != NULL) { > + req->done = true; > + wake_up(&req->host_acked); > + > + if (!list_empty(&vpmem->req_list)) { > + req_buf = list_first_entry(&vpmem->req_list, > + struct virtio_pmem_request, list); > + list_del(&vpmem->req_list); > + req_buf->wq_buf_avail = true; > +
Re: [Qemu-devel] [PATCH 1/3] nd: move nd_region to common header
On Fri, Aug 31, 2018 at 6:31 AM Pankaj Gupta wrote: > > This patch moves nd_region definition to common header > include/linux/nd.h file. This is required for flush callback > support for both virtio-pmem & pmem driver. > > Signed-off-by: Pankaj Gupta > --- > drivers/nvdimm/nd.h | 39 --- > include/linux/nd.h | 40 > 2 files changed, 40 insertions(+), 39 deletions(-) No, we need to find a way to do this without dumping all of these internal details to a public / global header.
Re: [Qemu-devel] [PATCH 2/3] libnvdimm: nd_region flush callback support
On Fri, Aug 31, 2018 at 6:32 AM Pankaj Gupta wrote: > > This patch adds functionality to perform flush from guest > to host over VIRTIO. We are registering a callback based > on 'nd_region' type. virtio_pmem driver requires this special > flush function. For rest of the region types we are registering > existing flush function. Report error returned by host fsync > failure to userspace. > > Signed-off-by: Pankaj Gupta This looks ok to me, just some nits below. > --- > drivers/acpi/nfit/core.c | 7 +-- > drivers/nvdimm/claim.c | 3 ++- > drivers/nvdimm/pmem.c| 12 > drivers/nvdimm/region_devs.c | 12 ++-- > include/linux/libnvdimm.h| 4 +++- > 5 files changed, 28 insertions(+), 10 deletions(-) > > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c > index b072cfc..cd63b69 100644 > --- a/drivers/acpi/nfit/core.c > +++ b/drivers/acpi/nfit/core.c > @@ -2216,6 +2216,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, > unsigned int bw, > { > u64 cmd, offset; > struct nfit_blk_mmio *mmio = &nfit_blk->mmio[DCR]; > + struct nd_region *nd_region = nfit_blk->nd_region; > > enum { > BCW_OFFSET_MASK = (1ULL << 48)-1, > @@ -2234,7 +2235,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, > unsigned int bw, > offset = to_interleave_offset(offset, mmio); > > writeq(cmd, mmio->addr.base + offset); > - nvdimm_flush(nfit_blk->nd_region); > + nd_region->flush(nd_region); I would keep the indirect function call override inside of nvdimm_flush. Then this hunk can go away... > > if (nfit_blk->dimm_flags & NFIT_BLK_DCR_LATCH) > readq(mmio->addr.base + offset); > @@ -2245,6 +2246,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk > *nfit_blk, > unsigned int lane) > { > struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW]; > + struct nd_region *nd_region = nfit_blk->nd_region; > unsigned int copied = 0; > u64 base_offset; > int rc; > @@ -2283,7 +2285,8 @@ static int acpi_nfit_blk_single_io(struct nfit_blk > *nfit_blk, > } > > if (rw) > - nvdimm_flush(nfit_blk->nd_region); > + nd_region->flush(nd_region); > + > ...ditto, no need to touch this code. > rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0; > return rc; > diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c > index fb667bf..49dce9c 100644 > --- a/drivers/nvdimm/claim.c > +++ b/drivers/nvdimm/claim.c > @@ -262,6 +262,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns, > { > struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev); > unsigned int sz_align = ALIGN(size + (offset & (512 - 1)), 512); > + struct nd_region *nd_region = to_nd_region(ndns->dev.parent); > sector_t sector = offset >> 9; > int rc = 0; > > @@ -301,7 +302,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns, > } > > memcpy_flushcache(nsio->addr + offset, buf, size); > - nvdimm_flush(to_nd_region(ndns->dev.parent)); > + nd_region->flush(nd_region); For this you would need to teach nsio_rw_bytes() that the flush can fail. > > return rc; > } > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index 6071e29..ba57cfa 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -201,7 +201,8 @@ static blk_qc_t pmem_make_request(struct request_queue > *q, struct bio *bio) > struct nd_region *nd_region = to_region(pmem); > > if (bio->bi_opf & REQ_PREFLUSH) > - nvdimm_flush(nd_region); > + bio->bi_status = nd_region->flush(nd_region); > + Let's have nvdimm_flush() return 0 or -EIO if it fails since thats what nsio_rw_bytes() expects, and you'll need to translate that to: BLK_STS_IOERR > > do_acct = nd_iostat_start(bio, &start); > bio_for_each_segment(bvec, bio, iter) { > @@ -216,7 +217,7 @@ static blk_qc_t pmem_make_request(struct request_queue > *q, struct bio *bio) > nd_iostat_end(bio, start); > > if (bio->bi_opf & REQ_FUA) > - nvdimm_flush(nd_region); > + bio->bi_status = nd_region->flush(nd_region); Same comment. > > bio_endio(bio); > return BLK_QC_T_NONE; > @@ -517,6 +518,7 @@ static int nd_pmem_probe(struct device *dev) > static int nd_pmem_remove(struct device *dev) > { > struct pmem_device *pmem = dev_get_drvdata(dev); > + struct nd_region *nd_region = to_region(pmem); > > if (is_nd_btt(dev)) > nvdimm_namespace_detach_btt(to_nd_btt(dev)); > @@ -528,14 +530,16 @@ static int nd_pmem_remove(struct device *dev) > sysfs_put(pmem->bb_state); > pmem->bb_state = NULL; > } > - nvdimm_flush(to_nd_region(dev->parent)); > + nd_region->flush(nd_region); Not
Re: [Qemu-devel] [RFC PATCH 1/1] nvdimm: let qemu requiring section alignment of pmem resource.
On Mon, Jun 11, 2018 at 9:26 AM, Stefan Hajnoczi wrote: > On Mon, Jun 11, 2018 at 06:54:25PM +0800, Zhang Yi wrote: >> Nvdimm driver use Memory hot-plug APIs to map it's pmem resource, >> which at a section granularity. >> >> When QEMU emulated the vNVDIMM device, decrease the label-storage, >> QEMU will put the vNVDIMMs directly next to one another in physical >> address space, which means that the boundary between them won't >> align to the 128 MB memory section size. > > I'm having a hard time parsing this. > > Where does the "128 MB memory section size" come from? ACPI? > A chipset-specific value? > The devm_memremap_pages() implementation use the memory hotplug core to allocate the 'struct page' array/map for persistent memory. Memory hotplug can only be performed in terms of sections, 128MB on x86_64. There is some limited support for allowing devm_memremap_pages() to overlap 'System RAM' within a given section, but it does not currently support multiple devm_memremap_pages() calls overlapping within the same section. There is currently a kernel bug where we do not handle this unsupported configuration gracefully. The fix will cause configurations configurations that try to overlap 2 persistent memory ranges in the same section to fail. The proposed fix is trying to make sure that QEMU does not run afoul of this constraint. There is currently no line of sight to reduce the minimum memory hotplug alignment size to less than 128M. Also, as other architectures outside of x86_64 add devm_memremap_pages() support, the minimum section alignment constraint might change and is a property of a guest OS. My understanding is that some guest OSes might expect an even larger persistent memory minimum alignment.
Re: [Qemu-devel] [qemu PATCH v4 3/4] nvdimm, acpi: support NFIT platform capabilities
On Wed, Jun 6, 2018 at 4:20 PM, Elliott, Robert (Persistent Memory) wrote: > >> > Okay, we can move to the symbolic names. Do you want them to be that >> long, or >> > would: >> > >> > nvdimm-cap-cpu >> > nvdimm-cap-mem-ctrl >> > nvdimm-cap-mirroring >> >> Wait, why is mirroring part of this? > > This data structure is intended to report any kind of platform capability, not > just platform persistence capabilities. Yes, but here's nothing actionable that a qemu guest OS can do with that mirroring information, so there's no need at this time to add cli cruft and code to support it.
Re: [Qemu-devel] [qemu PATCH v4 3/4] nvdimm, acpi: support NFIT platform capabilities
On Tue, Jun 5, 2018 at 3:07 PM, Ross Zwisler wrote: > On Tue, Jun 05, 2018 at 09:37:25PM +0300, Michael S. Tsirkin wrote: >> On Tue, Jun 05, 2018 at 11:15:00AM -0700, Dan Williams wrote: >> > On Tue, Jun 5, 2018 at 9:42 AM, Ross Zwisler >> > wrote: >> > > On Tue, Jun 05, 2018 at 06:25:27PM +0300, Michael S. Tsirkin wrote: >> > >> On Mon, May 21, 2018 at 10:32:02AM -0600, Ross Zwisler wrote: >> > >> > Add a machine command line option to allow the user to control the >> > >> > Platform >> > >> > Capabilities Structure in the virtualized NFIT. This Platform >> > >> > Capabilities >> > >> > Structure was added in ACPI 6.2 Errata A. >> > >> > >> > >> > Signed-off-by: Ross Zwisler >> > >> >> > >> I tried playing with it and encoding the capabilities is >> > >> quite awkward. >> > >> >> > >> Can we add bits for specific capabilities instead of nvdimm-cap? >> > >> >> > >> How about: >> > >> >> > >> "cpu-flush-on-power-loss-cap" >> > >> "memory-flush-on-power-loss-cap" >> > >> "byte-addressable-mirroring-cap" >> > > >> > > Hmmm...I don't like that as much because: >> > > >> > > a) It's very verbose. Looking at my current qemu command line few other >> > >options require that many characters, and you'd commonly be defining >> > > more >> > >than one of these for a given VM. >> > > >> > > b) It means that the QEMU will need to be updated if/when new flags are >> > > added, >> > >because we'll have to have new options for each flag. The current >> > >implementation is more future-proof because you can specify any flags >> > >value you want. >> > > >> > > However, if you feel strongly about this, I'll make the change. >> > >> > Straw-man: Could we do something similar with what we are doing in ndctl? >> > >> > enum ndctl_persistence_domain { >> > PERSISTENCE_NONE = 0, >> > PERSISTENCE_MEM_CTRL = 10, >> > PERSISTENCE_CPU_CACHE = 20, >> > PERSISTENCE_UNKNOWN = INT_MAX, >> > }; >> > >> > ...and have the command line take a number where "10" and "20" are >> > supported today, but allows us to adapt to new persistence domains in >> > the future. >> >> I'm fine with that except can we have symbolic names instead of numbers >> on command line? >> >> -- >> MST > > Okay, we can move to the symbolic names. Do you want them to be that long, or > would: > > nvdimm-cap-cpu > nvdimm-cap-mem-ctrl > nvdimm-cap-mirroring Wait, why is mirroring part of this? I was thinking this option would be: --persistence-domain={cpu,mem-ctrl} ...and try not to let ACPI specifics leak into the qemu command line interface. For example PowerPC qemu could have a persistence domain communicated via Open Firmware or some other mechanism. > > or something be better?
Re: [Qemu-devel] [qemu PATCH v4 3/4] nvdimm, acpi: support NFIT platform capabilities
On Tue, Jun 5, 2018 at 9:42 AM, Ross Zwisler wrote: > On Tue, Jun 05, 2018 at 06:25:27PM +0300, Michael S. Tsirkin wrote: >> On Mon, May 21, 2018 at 10:32:02AM -0600, Ross Zwisler wrote: >> > Add a machine command line option to allow the user to control the Platform >> > Capabilities Structure in the virtualized NFIT. This Platform Capabilities >> > Structure was added in ACPI 6.2 Errata A. >> > >> > Signed-off-by: Ross Zwisler >> >> I tried playing with it and encoding the capabilities is >> quite awkward. >> >> Can we add bits for specific capabilities instead of nvdimm-cap? >> >> How about: >> >> "cpu-flush-on-power-loss-cap" >> "memory-flush-on-power-loss-cap" >> "byte-addressable-mirroring-cap" > > Hmmm...I don't like that as much because: > > a) It's very verbose. Looking at my current qemu command line few other >options require that many characters, and you'd commonly be defining more >than one of these for a given VM. > > b) It means that the QEMU will need to be updated if/when new flags are added, >because we'll have to have new options for each flag. The current >implementation is more future-proof because you can specify any flags >value you want. > > However, if you feel strongly about this, I'll make the change. Straw-man: Could we do something similar with what we are doing in ndctl? enum ndctl_persistence_domain { PERSISTENCE_NONE = 0, PERSISTENCE_MEM_CTRL = 10, PERSISTENCE_CPU_CACHE = 20, PERSISTENCE_UNKNOWN = INT_MAX, }; ...and have the command line take a number where "10" and "20" are supported today, but allows us to adapt to new persistence domains in the future.
Re: [Qemu-devel] Questions about vNVDIMM on qemu/KVM
On Thu, May 24, 2018 at 12:19 AM, Yasunori Goto wrote: >> On Tue, May 22, 2018 at 10:08 PM, Yasunori Goto >> wrote: >> > Hello, >> > >> > I'm investigating status of vNVDIMM on qemu/KVM, >> > and I have some questions about it. I'm glad if anyone answer them. >> > >> > In my understanding, qemu/KVM has a feature to show NFIT for guest, >> > and it will be still updated about platform capability with this patch set. >> > https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg04756.html >> > >> > And libvirt also supports this feature with >> > https://libvirt.org/formatdomain.html#elementsMemory >> > >> > >> > However, virtio-pmem is developing now, and it is better >> > for archtectures to detect regions of NVDIMM without ACPI (like s390x) >> >> I think you are confusing virtio-pmem (patches from Pankaj) and >> virtio-mem (patches from David)? ...or I'm confused. > > Probably, "I" am confusing. > So, your clarification is very helpful for me. > > >> >> > In addition, It is also necessary to flush guest contents on vNVDIMM >> > who has a backend-file. >> >> virtio-pmem is a mechanism to use host page cache as pmem in a guest. >> It does not support high performance memory applications because it >> requires fsync/msync. I.e. it is not DAX it is the traditional mmap >> I/O model, but moving page cache management to the host rather than >> duplicating it in guests. > > Ah, ok. > > >> >> > Q1) Does ACPI.NFIT bus of qemu/kvm remain with virtio-pmem? >> > How do each roles become it if both NFIT and virtio-pmem will be >> > available? >> > If my understanding is correct, both NFIT and virtio-pmem is used to >> > detect vNVDIMM regions, but only one seems to be necessary >> >> We need both because they are different. Guest DAX should not be using >> virtio-pmem. > > Hmm. Ok. > > But ,I would like understand one more thing. > In the following mail, it seems that e820 bus will be used for fake DAX. > > https://lists.01.org/pipermail/linux-nvdimm/2018-January/013926.html > > Could you tell me what is relationship between "fake DAX" in this mail > and Guest DAX? > Why e820 is necessary for this case? > It was proposed as a starting template for writing a new nvdimm bus driver. All we need is a way to communicate both the address range and the flush interface. This could be done with a new SPA Range GUID with the NFIT, or a custom virtio-pci device that registers a special nvdimm region with this property. My preference is whichever approach minimizes the code duplication, because the pmem driver should be re-used as much as possible.
Re: [Qemu-devel] Questions about vNVDIMM on qemu/KVM
On Tue, May 22, 2018 at 10:08 PM, Yasunori Goto wrote: > Hello, > > I'm investigating status of vNVDIMM on qemu/KVM, > and I have some questions about it. I'm glad if anyone answer them. > > In my understanding, qemu/KVM has a feature to show NFIT for guest, > and it will be still updated about platform capability with this patch set. > https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg04756.html > > And libvirt also supports this feature with > https://libvirt.org/formatdomain.html#elementsMemory > > > However, virtio-pmem is developing now, and it is better > for archtectures to detect regions of NVDIMM without ACPI (like s390x) I think you are confusing virtio-pmem (patches from Pankaj) and virtio-mem (patches from David)? ...or I'm confused. > In addition, It is also necessary to flush guest contents on vNVDIMM > who has a backend-file. virtio-pmem is a mechanism to use host page cache as pmem in a guest. It does not support high performance memory applications because it requires fsync/msync. I.e. it is not DAX it is the traditional mmap I/O model, but moving page cache management to the host rather than duplicating it in guests. > Q1) Does ACPI.NFIT bus of qemu/kvm remain with virtio-pmem? > How do each roles become it if both NFIT and virtio-pmem will be > available? > If my understanding is correct, both NFIT and virtio-pmem is used to > detect vNVDIMM regions, but only one seems to be necessary We need both because they are different. Guest DAX should not be using virtio-pmem. > Otherwize, is the NFIT bus just for keeping compatibility, > and virtio-pmem is promising way? > > > Q2) What bus is(will be?) created for virtio-pmem? > I could confirm the bus of NFIT is created with , > and I heard other bus will be created for virtio-pmem, but I could not > find what bus is created concretely. > --- > # ndctl list -B > { > "provider":"ACPI.NFIT", > "dev":"ndbus0" > } > --- > > I think it affects what operations user will be able to, and what > notification is necessary for vNVDIMM. > ACPI defines some operations like namespace controll, and notification > for NVDIMM health status or others. > (I suppose that other status notification might be necessary for vNVDIMM, > but I'm not sure yet...) > > If my understanding is wrong, please correct me. The current plan, per my understanding, is a virtio-pmem SPA UUID added to the virtual NFIT so that the guest driver can load the pmem driver but also hook up the virtio command ring for forwarding WRITE_{FUA,FLUSH} commands as host fsync operations.
Re: [Qemu-devel] [RFC v2 2/2] pmem: device flush over VIRTIO
On Thu, Apr 26, 2018 at 9:40 AM, Pankaj Gupta wrote: > >> >> On Wed, Apr 25, 2018 at 04:54:14PM +0530, Pankaj Gupta wrote: >> > This patch adds functionality to perform >> > flush from guest to hosy over VIRTIO >> > when 'ND_REGION_VIRTIO'flag is set on >> > nd_negion. Flag is set by 'virtio-pmem' >> > driver. >> > >> > Signed-off-by: Pankaj Gupta >> > --- >> > drivers/nvdimm/region_devs.c | 7 +++ >> > 1 file changed, 7 insertions(+) >> > >> > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c >> > index a612be6..6c6454e 100644 >> > --- a/drivers/nvdimm/region_devs.c >> > +++ b/drivers/nvdimm/region_devs.c >> > @@ -20,6 +20,7 @@ >> > #include >> > #include "nd-core.h" >> > #include "nd.h" >> > +#include >> > >> > /* >> > * For readq() and writeq() on 32-bit builds, the hi-lo, lo-hi order is >> > @@ -1074,6 +1075,12 @@ void nvdimm_flush(struct nd_region *nd_region) >> > struct nd_region_data *ndrd = dev_get_drvdata(&nd_region->dev); >> > int i, idx; >> > >> > + /* call PV device flush */ >> > + if (test_bit(ND_REGION_VIRTIO, &nd_region->flags)) { >> > + virtio_pmem_flush(&nd_region->dev); >> > + return; >> > + } >> >> How does libnvdimm know when flush has completed? >> >> Callers expect the flush to be finished when nvdimm_flush() returns but >> the virtio driver has only queued the request, it hasn't waited for >> completion! > > I tried to implement what nvdimm does right now. It just writes to > flush hint address to make sure data persists. nvdimm_flush() is currently expected to be synchronous. Currently it is sfence(); write to special address; sfence(). By the time the second sfence returns the data is flushed. So you would need to make this virtio flush interface synchronous as well, but that appears problematic to stop the guest for unbounded amounts of time. Instead, you need to rework nvdimm_flush() and the pmem driver to make these flush requests asynchronous and add the plumbing for completion callbacks via bio_endio(). > I just did not want to block guest write requests till host side > fsync completes. You must complete the flush before bio_endio(), otherwise you're violating the expectations of the guest filesystem/block-layer. > > be worse for operations on different guest files because all these operations > would happen > ultimately on same file at host. > > I think with current way, we can achieve an asynchronous queuing mechanism on > cost of not > 100% sure when fsync would complete but it is assured it will happen. Also, > its entire block > flush. No, again, that's broken. We need to add the plumbing for communicating the fsync() completion relative the WRITE_{FLUSH,FUA} bio in the guest.
Re: [Qemu-devel] [PATCH] acpi/nvdimm: remove forward name references
>> > Good to know, thanks. > > I have made a big mistake. I was comparing the resulting SSDT of this patch > to an SSDT > Generated by a much older version of kvm/qemu. The latest kvm/qemu upstream > builds > the SSDT with the operation region contained inside of methods. This should > resolve > MEMA of the operation region when the method is invoked. The older SSDT that > I was > looking at declared the memory region outside of methods and had a forward > reference > to MEMA which we no longer support. > > So the latest qemu/kvm works just fine and this patch is not needed, Dan has > confirmed that > the latest qemu/kvm works for his nvdimm setup. > > Thanks for all of the help and I'm sorry for any confusion that I might have > caused! Yes, my bad as well. I saw the regression on v2.8.0, but updating to latest qemu-mainline also fixes the problem for me. Thanks for the debug.
Re: [Qemu-devel] [RFC v2 1/2] virtio: add pmem driver
[ adding Jeff directly since he has also been looking at infrastructure to track when MAP_SYNC should be disabled ] On Wed, Apr 25, 2018 at 7:21 AM, Dan Williams wrote: > On Wed, Apr 25, 2018 at 4:24 AM, Pankaj Gupta wrote: >> This patch adds virtio-pmem driver for KVM >> guest. > > Minor nit, please expand your changelog line wrapping to 72 columns. > >> >> Guest reads the persistent memory range >> information from Qemu over VIRTIO and registers >> it on nvdimm_bus. It also creates a nd_region >> object with the persistent memory range >> information so that existing 'nvdimm/pmem' >> driver can reserve this into system memory map. >> This way 'virtio-pmem' driver uses existing >> functionality of pmem driver to register persistent >> memory compatible for DAX capable filesystems. > > We need some additional enabling to disable MAP_SYNC for this > configuration. In other words, if fsync() is required then we must > disable the MAP_SYNC optimization. I think this should be a struct > dax_device property looked up at mmap time in each MAP_SYNC capable > ->mmap() file operation implementation.
Re: [Qemu-devel] [RFC v2 2/2] pmem: device flush over VIRTIO
On Wed, Apr 25, 2018 at 4:24 AM, Pankaj Gupta wrote: > This patch adds functionality to perform > flush from guest to hosy over VIRTIO > when 'ND_REGION_VIRTIO'flag is set on > nd_negion. Flag is set by 'virtio-pmem' > driver. > > Signed-off-by: Pankaj Gupta > --- > drivers/nvdimm/region_devs.c | 7 +++ > 1 file changed, 7 insertions(+) > > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c > index a612be6..6c6454e 100644 > --- a/drivers/nvdimm/region_devs.c > +++ b/drivers/nvdimm/region_devs.c > @@ -20,6 +20,7 @@ > #include > #include "nd-core.h" > #include "nd.h" > +#include > > /* > * For readq() and writeq() on 32-bit builds, the hi-lo, lo-hi order is > @@ -1074,6 +1075,12 @@ void nvdimm_flush(struct nd_region *nd_region) > struct nd_region_data *ndrd = dev_get_drvdata(&nd_region->dev); > int i, idx; > > + /* call PV device flush */ > + if (test_bit(ND_REGION_VIRTIO, &nd_region->flags)) { > + virtio_pmem_flush(&nd_region->dev); > + return; > + } > + I'd rather introduce a ->flush() operation hanging off of 'struct nd_region' so that this multiplexing can be a static setting.
Re: [Qemu-devel] [RFC v2 1/2] virtio: add pmem driver
On Wed, Apr 25, 2018 at 4:24 AM, Pankaj Gupta wrote: > This patch adds virtio-pmem driver for KVM > guest. Minor nit, please expand your changelog line wrapping to 72 columns. > > Guest reads the persistent memory range > information from Qemu over VIRTIO and registers > it on nvdimm_bus. It also creates a nd_region > object with the persistent memory range > information so that existing 'nvdimm/pmem' > driver can reserve this into system memory map. > This way 'virtio-pmem' driver uses existing > functionality of pmem driver to register persistent > memory compatible for DAX capable filesystems. We need some additional enabling to disable MAP_SYNC for this configuration. In other words, if fsync() is required then we must disable the MAP_SYNC optimization. I think this should be a struct dax_device property looked up at mmap time in each MAP_SYNC capable ->mmap() file operation implementation.
Re: [Qemu-devel] RFC: altering the NVDIMM acpi table
On Mon, Apr 23, 2018 at 1:35 PM, Schmauss, Erik wrote: > Hello, > > I work on ACPICA and we have recently made changes to the behavior of > the Linux AML interpreter to match other OS implementations. After > sending the patches to upstream Linux, we have identified that > hw/acpi/nvdimm.c specifies an ACPI table with a forward reference > (MEMA is a forward reference that is no longer supported as of Linux > 4.17-rc1). > > We would like to change this file to move the declaration of Name > (MEMA,...) to appear as the very first declaration in the SSDT. Below is a > patch outlining the change that I would like to make. > However, I am having a hard time getting make check to run > to completion in a reasonable amount of time. It always seems to fail > on some sort of checksum test... It would be great if you could let me > know what you think of the change and what I can do to speed up the > execution time of make check... > > > Thanks, > > Erik Schmauss > > diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c > index 59d6e4254c..7c9efd9ac7 100644 > --- a/hw/acpi/nvdimm.c > +++ b/hw/acpi/nvdimm.c > @@ -1234,6 +1234,9 @@ static void nvdimm_build_ssdt(GArray *table_offsets, > GArray *table_data, > ssdt = init_aml_allocator(); > acpi_data_push(ssdt->buf, sizeof(AcpiTableHeader)); > > +mem_addr_offset = build_append_named_dword(table_data, > + NVDIMM_ACPI_MEM_ADDR); > + > sb_scope = aml_scope("\\_SB"); > > dev = aml_device("NVDR"); > @@ -1266,9 +1269,6 @@ static void nvdimm_build_ssdt(GArray *table_offsets, > GArray *table_data, > > /* copy AML table into ACPI tables blob and patch header there */ > g_array_append_vals(table_data, ssdt->buf->data, ssdt->buf->len); > -mem_addr_offset = build_append_named_dword(table_data, > - NVDIMM_ACPI_MEM_ADDR); > - > bios_linker_loader_alloc(linker, > NVDIMM_DSM_MEM_FILE, dsm_dma_arrea, > sizeof(NvdimmDsmIn), false /* high memory */); I gave this a shot and it appears to breaking some assumption of where this device is mapped relative to System RAM: ioremap on RAM at 0xbffe - 0x00019ffe1fff WARNING: CPU: 0 PID: 0 at arch/x86/mm/ioremap.c:166 __ioremap_caller+0x28b/0x300 Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.17.0-rc1+ #1734 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014 RIP: 0010:__ioremap_caller+0x28b/0x300 RSP: :82603db0 EFLAGS: 00010286 RAX: RBX: bffe RCX: 0006 RDX: 0168 RSI: 82618f90 RDI: 0246 RBP: e0002000 R08: R09: R10: 88043e7d8000 R11: R12: R13: bffe R14: 81a731e9 R15: 82603ee4 FS: () GS:88043140() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 0261 CR4: 000406b0 Call Trace: ? acpi_os_map_iomem+0x7b/0x1c0 acpi_os_map_iomem+0x189/0x1c0 acpi_tb_acquire_table+0x39/0x64 acpi_tb_validate_table+0x21/0x33 acpi_tb_verify_temp_table+0x37/0x213 acpi_reallocate_root_table+0xe1/0x112 acpi_early_init+0x4b/0x102 start_kernel+0x419/0x4ee secondary_startup_64+0xa5/0xb0
Re: [Qemu-devel] [PATCH v2 5/7] ACPICA: Integrate package handling with module-level code
On Mon, Apr 16, 2018 at 5:05 PM, Schmauss, Erik wrote: > >> -Original Message- >> From: linux-acpi-ow...@vger.kernel.org [mailto:linux-acpi- >> ow...@vger.kernel.org] On Behalf Of Dan Williams >> Sent: Monday, April 16, 2018 4:22 PM >> To: Schmauss, Erik >> Cc: Rafael J. Wysocki ; Linux ACPI > a...@vger.kernel.org>; Moore, Robert ; linux- >> nvdimm ; Qemu Developers > de...@nongnu.org> >> Subject: Re: [PATCH v2 5/7] ACPICA: Integrate package handling with module- >> level code >> >> On Mon, Apr 16, 2018 at 4:15 PM, Schmauss, Erik >> wrote: >> > [ trimming ] >> >> >> Rafael, we may want to hold back on the module-level code changes >> >> >> (the patches below) for rc1. Between this and the strange _TSS >> >> >> issue, it seems like there are a few more things to resolve before >> >> >> this is ready for kernel upstream. >> >> > >> > Hi Rafael, >> > >> >> > It looks like you are asking me to queue up reverts as per the >> >> > Dan's report, is that correct? >> > >> > This is indeed what I meant last week. However, I've looked into the >> > issue and Dan's qemu instance had AML that we no longer support. This >> > is because the ACPICA commit makes changes to the execution of AML >> > during table load to match windows AML interpreter behavior so this commit >> also got rid of support for executing code containing forward references >> (except >> for package elements). >> > >> > I've suggested a fix for the firmware in a separate email. So I would >> > say that this issue is resolved after if Dan can run his test successfully >> > with the >> adjusted firmware. >> > >> > If Dan's test is successful, we don’t need to revert these changes >> > > Hi Dan, > >> I'm concerned about other qemu-kvm users that do not upgrade their hypervisor >> at the same pace as their guest kernel. Especially for cloud providers that >> may >> be running latest mainline kernel on older qemu-kvm this will look like a >> pure >> kernel regression. Is there a quick fix we can carry in the kernel to >> support these >> forward references, at least until we know that qemu-kvm is no longer >> shipping >> the broken AML? > > This is a very good point. Thanks for bringing this up! One option is for > them to set > the global variable acpi_gbl_execute_tables_as_methods in > include/acpi/acpixf.h to false. > This will effectively revert the new behavior in the AML interpreter and go > back to the old way. > Since this is a global flag, we could have a command line option for Linux > kernel to turn this > feature on. > > Out of curiosity, is this ACPI table somehow customized for your work? I have > a collection > of acpi tables and your ACPI tables are the only ones that have an > OperationRegion called > NRAM. What is the chance that others will be running Linux on the same tables > as the one > you sent me? I don't think there's anything atypical about my particular setup. It creates two virtual NVDIMMs that each represent a 30GB address space. I suspect any user of the KVM NVDIMM virtualization would see the same problem.
Re: [Qemu-devel] [PATCH v2 5/7] ACPICA: Integrate package handling with module-level code
On Mon, Apr 16, 2018 at 4:15 PM, Schmauss, Erik wrote: > [ trimming ] >> >> Rafael, we may want to hold back on the module-level code changes >> >> (the patches below) for rc1. Between this and the strange _TSS issue, >> >> it seems like there are a few more things to resolve before this is >> >> ready for kernel upstream. >> > > Hi Rafael, > >> > It looks like you are asking me to queue up reverts as per the Dan's >> > report, is that correct? > > This is indeed what I meant last week. However, I've looked into the issue > and Dan's qemu > instance had AML that we no longer support. This is because the ACPICA commit > makes changes to the execution of AML > during table load to match windows AML interpreter behavior so this commit > also got rid of support for executing code > containing forward references (except for package elements). > > I've suggested a fix for the firmware in a separate email. So I would say > that this issue is resolved after if Dan can run > his test successfully with the adjusted firmware. > > If Dan's test is successful, we don’t need to revert these changes I'm concerned about other qemu-kvm users that do not upgrade their hypervisor at the same pace as their guest kernel. Especially for cloud providers that may be running latest mainline kernel on older qemu-kvm this will look like a pure kernel regression. Is there a quick fix we can carry in the kernel to support these forward references, at least until we know that qemu-kvm is no longer shipping the broken AML?
Re: [Qemu-devel] [PATCH] hw/acpi-build: build SRAT memory affinity structures for NVDIMM
On Wed, Feb 21, 2018 at 5:55 AM, Igor Mammedov wrote: > On Tue, 20 Feb 2018 17:17:58 -0800 > Dan Williams wrote: > >> On Tue, Feb 20, 2018 at 6:10 AM, Igor Mammedov wrote: >> > On Sat, 17 Feb 2018 14:31:35 +0800 >> > Haozhong Zhang wrote: >> > >> >> ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity >> >> domain of a NVDIMM SPA range must match with corresponding entry in >> >> SRAT table. >> >> >> >> The address ranges of vNVDIMM in QEMU are allocated from the >> >> hot-pluggable address space, which is entirely covered by one SRAT >> >> memory affinity structure. However, users can set the vNVDIMM >> >> proximity domain in NFIT SPA range structure by the 'node' property of >> >> '-device nvdimm' to a value different than the one in the above SRAT >> >> memory affinity structure. >> >> >> >> In order to solve such proximity domain mismatch, this patch build one >> >> SRAT memory affinity structure for each NVDIMM device with the >> >> proximity domain used in NFIT. The remaining hot-pluggable address >> >> space is covered by one or multiple SRAT memory affinity structures >> >> with the proximity domain of the last node as before. >> >> >> >> Signed-off-by: Haozhong Zhang >> > If we consider hotpluggable system, correctly implemented OS should >> > be able pull proximity from Device::_PXM and override any value from SRAT. >> > Do we really have a problem here (anything that breaks if we would use >> > _PXM)? >> > Maybe we should add _PXM object to nvdimm device nodes instead of >> > massaging SRAT? >> >> Unfortunately _PXM is an awkward fit. Currently the proximity domain >> is attached to the SPA range structure. The SPA range may be >> associated with multiple DIMM devices and those individual NVDIMMs may >> have conflicting _PXM properties. > There shouldn't be any conflict here as NVDIMM device's _PXM method, > should override in runtime any proximity specified by parent scope. > (as parent scope I'd also count boot time NFIT/SRAT tables). > > To make it more clear we could clear valid proximity domain flag in SPA > like this: > > diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c > index 59d6e42..131bca5 100644 > --- a/hw/acpi/nvdimm.c > +++ b/hw/acpi/nvdimm.c > @@ -260,9 +260,7 @@ nvdimm_build_structure_spa(GArray *structures, > DeviceState *dev) > */ > nfit_spa->flags = cpu_to_le16(1 /* Control region is strictly for > management during hot add/online > - operation */ | > - 2 /* Data in Proximity Domain field is > - valid*/); > + operation */); > > /* NUMA node. */ > nfit_spa->proximity_domain = cpu_to_le32(node); > >> Even if that was unified across >> DIMMs it is ambiguous whether a DIMM-device _PXM would relate to the >> device's control interface, or the assembled persistent memory SPA >> range. > I'm not sure what you mean under 'device's control interface', > could you clarify where the ambiguity comes from? There are multiple SPA range types. In addition to the typical Persistent Memory SPA range there are also Control Region SPA ranges for MMIO registers on the DIMM for Block Apertures and other purposes. > > I read spec as: _PXM applies to address range covered by NVDIMM > device it belongs to. No, an NVDIMM may contribute to multiple SPA ranges and those ranges may span sockets. > > As for assembled SPA, I'd assume that it applies to interleaved set > and all NVDIMMs with it should be on the same node. It's somewhat > irrelevant question though as QEMU so far implements only > 1:1:1/SPA:Region Mapping:NVDIMM Device/ > mapping. > > My main concern with using static configuration tables for proximity > mapping, we'd miss on hotplug side of equation. However if we start > from dynamic side first, we could later complement it with static > tables if there really were need for it. Especially when you consider the new HMAT table that wants to have proximity domains for describing performance characteristics of an address range relative to an initiator, the _PXM method on an individual NVDIMM device is a poor fit for describing a wider set.
Re: [Qemu-devel] [PATCH] hw/acpi-build: build SRAT memory affinity structures for NVDIMM
On Tue, Feb 20, 2018 at 6:10 AM, Igor Mammedov wrote: > On Sat, 17 Feb 2018 14:31:35 +0800 > Haozhong Zhang wrote: > >> ACPI 6.2A Table 5-129 "SPA Range Structure" requires the proximity >> domain of a NVDIMM SPA range must match with corresponding entry in >> SRAT table. >> >> The address ranges of vNVDIMM in QEMU are allocated from the >> hot-pluggable address space, which is entirely covered by one SRAT >> memory affinity structure. However, users can set the vNVDIMM >> proximity domain in NFIT SPA range structure by the 'node' property of >> '-device nvdimm' to a value different than the one in the above SRAT >> memory affinity structure. >> >> In order to solve such proximity domain mismatch, this patch build one >> SRAT memory affinity structure for each NVDIMM device with the >> proximity domain used in NFIT. The remaining hot-pluggable address >> space is covered by one or multiple SRAT memory affinity structures >> with the proximity domain of the last node as before. >> >> Signed-off-by: Haozhong Zhang > If we consider hotpluggable system, correctly implemented OS should > be able pull proximity from Device::_PXM and override any value from SRAT. > Do we really have a problem here (anything that breaks if we would use _PXM)? > Maybe we should add _PXM object to nvdimm device nodes instead of massaging > SRAT? Unfortunately _PXM is an awkward fit. Currently the proximity domain is attached to the SPA range structure. The SPA range may be associated with multiple DIMM devices and those individual NVDIMMs may have conflicting _PXM properties. Even if that was unified across DIMMs it is ambiguous whether a DIMM-device _PXM would relate to the device's control interface, or the assembled persistent memory SPA range.
Re: [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM
On Wed, Feb 7, 2018 at 10:37 AM, Dr. David Alan Gilbert wrote: > * Dan Williams (dan.j.willi...@intel.com) wrote: >> On Wed, Feb 7, 2018 at 10:08 AM, Dr. David Alan Gilbert >> wrote: >> > * Dan Williams (dan.j.willi...@intel.com) wrote: >> >> On Wed, Feb 7, 2018 at 5:24 AM, Dr. David Alan Gilbert >> >> wrote: >> >> > * Haozhong Zhang (haozhong.zh...@intel.com) wrote: >> >> >> On 02/07/18 13:03 +, Dr. David Alan Gilbert wrote: >> >> >> > * Haozhong Zhang (haozhong.zh...@intel.com) wrote: >> >> >> > > On 02/07/18 11:54 +, Dr. David Alan Gilbert wrote: >> >> >> > > > * Haozhong Zhang (haozhong.zh...@intel.com) wrote: >> >> >> > > > > When loading a compressed page to persistent memory, flush CPU >> >> >> > > > > cache >> >> >> > > > > after the data is decompressed. Combined with a call to >> >> >> > > > > pmem_drain() >> >> >> > > > > at the end of memory loading, we can guarantee those >> >> >> > > > > compressed pages >> >> >> > > > > are persistently loaded to PMEM. >> >> >> > > > >> >> >> > > > Can you explain why this can use the flush and doesn't need the >> >> >> > > > special >> >> >> > > > memset? >> >> >> > > >> >> >> > > The best approach to ensure the write persistence is to operate >> >> >> > > pmem >> >> >> > > all via libpmem, e.g., pmem_memcpy_nodrain() + pmem_drain(). >> >> >> > > However, >> >> >> > > the write to pmem in this case is performed by uncompress() which >> >> >> > > is >> >> >> > > implemented out of QEMU and libpmem. It may or may not use libpmem, >> >> >> > > which is not controlled by QEMU. Therefore, we have to use the less >> >> >> > > optimal approach, that is to flush cache for all pmem addresses >> >> >> > > that >> >> >> > > uncompress() may have written, i.e.,/e.g., memcpy() and/or >> >> >> > > memset() in >> >> >> > > uncompress(), and pmem_flush() + pmem_drain() in QEMU. >> >> >> > >> >> >> > In what way is it less optimal? >> >> >> > If that's a legal thing to do, then why not just do a pmem_flush + >> >> >> > pmem_drain right at the end of the ram loading and leave all the >> >> >> > rest of >> >> >> > the code untouched? >> >> >> >> >> >> For example, the implementation pmem_memcpy_nodrain() prefers to use >> >> >> movnt instructions w/o flush to write pmem if those instructions are >> >> >> available, and falls back to memcpy() + flush if movnt are not >> >> >> available, so I suppose the latter is less optimal. >> >> > >> >> > But if you use normal memcpy calls to copy a few GB of RAM in an >> >> > incoming migrate and then do a single flush at the end, isn't that >> >> > better? >> >> >> >> Not really, because now you've needlessly polluted the cache and are >> >> spending CPU looping over the cachelines that could have been bypassed >> >> with movnt. >> > >> > What's different in the pmem case? Isn't what you've said true in the >> > normal migrate case as well? >> > >> >> In the normal migrate case the memory is volatile so once the copy is >> globally visiable you're done. In the pmem case the migration is not >> complete until the persistent state is synchronized. >> >> Note, I'm talking in generalities because I don't know the deeper >> details of the migrate flow. > > On the receive side of a migrate, during a normal precopy migrate > (without xbzrle) nothing is going to be reading that RAM until > the whole RAM load has completed anyway - so we're not benefiting from > it being in the caches either. > > In the pmem case, again since nothing is going to be reading from that > RAM until the end anyway, why bother flushing as we go as opposed to at > the end? Flushing at the end implies doing a large loop flushing the caches at the end of the transfer because the x86 ISA only exposes a line-by-line flush to unprivileged code rather than a full cache flush like what the kernel can do with wbinvd. So, better to flush as we go rather than incur the overhead of the loop at the end. I.e. I'm assuming it is more efficient to do 'movnt' in the first instance and not worry about the flush loop.
Re: [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM
On Wed, Feb 7, 2018 at 10:08 AM, Dr. David Alan Gilbert wrote: > * Dan Williams (dan.j.willi...@intel.com) wrote: >> On Wed, Feb 7, 2018 at 5:24 AM, Dr. David Alan Gilbert >> wrote: >> > * Haozhong Zhang (haozhong.zh...@intel.com) wrote: >> >> On 02/07/18 13:03 +, Dr. David Alan Gilbert wrote: >> >> > * Haozhong Zhang (haozhong.zh...@intel.com) wrote: >> >> > > On 02/07/18 11:54 +, Dr. David Alan Gilbert wrote: >> >> > > > * Haozhong Zhang (haozhong.zh...@intel.com) wrote: >> >> > > > > When loading a compressed page to persistent memory, flush CPU >> >> > > > > cache >> >> > > > > after the data is decompressed. Combined with a call to >> >> > > > > pmem_drain() >> >> > > > > at the end of memory loading, we can guarantee those compressed >> >> > > > > pages >> >> > > > > are persistently loaded to PMEM. >> >> > > > >> >> > > > Can you explain why this can use the flush and doesn't need the >> >> > > > special >> >> > > > memset? >> >> > > >> >> > > The best approach to ensure the write persistence is to operate pmem >> >> > > all via libpmem, e.g., pmem_memcpy_nodrain() + pmem_drain(). However, >> >> > > the write to pmem in this case is performed by uncompress() which is >> >> > > implemented out of QEMU and libpmem. It may or may not use libpmem, >> >> > > which is not controlled by QEMU. Therefore, we have to use the less >> >> > > optimal approach, that is to flush cache for all pmem addresses that >> >> > > uncompress() may have written, i.e.,/e.g., memcpy() and/or memset() in >> >> > > uncompress(), and pmem_flush() + pmem_drain() in QEMU. >> >> > >> >> > In what way is it less optimal? >> >> > If that's a legal thing to do, then why not just do a pmem_flush + >> >> > pmem_drain right at the end of the ram loading and leave all the rest of >> >> > the code untouched? >> >> >> >> For example, the implementation pmem_memcpy_nodrain() prefers to use >> >> movnt instructions w/o flush to write pmem if those instructions are >> >> available, and falls back to memcpy() + flush if movnt are not >> >> available, so I suppose the latter is less optimal. >> > >> > But if you use normal memcpy calls to copy a few GB of RAM in an >> > incoming migrate and then do a single flush at the end, isn't that >> > better? >> >> Not really, because now you've needlessly polluted the cache and are >> spending CPU looping over the cachelines that could have been bypassed >> with movnt. > > What's different in the pmem case? Isn't what you've said true in the > normal migrate case as well? > In the normal migrate case the memory is volatile so once the copy is globally visiable you're done. In the pmem case the migration is not complete until the persistent state is synchronized. Note, I'm talking in generalities because I don't know the deeper details of the migrate flow.
Re: [Qemu-devel] [PATCH v2 7/8] migration/ram: ensure write persistence on loading compressed pages to PMEM
On Wed, Feb 7, 2018 at 5:24 AM, Dr. David Alan Gilbert wrote: > * Haozhong Zhang (haozhong.zh...@intel.com) wrote: >> On 02/07/18 13:03 +, Dr. David Alan Gilbert wrote: >> > * Haozhong Zhang (haozhong.zh...@intel.com) wrote: >> > > On 02/07/18 11:54 +, Dr. David Alan Gilbert wrote: >> > > > * Haozhong Zhang (haozhong.zh...@intel.com) wrote: >> > > > > When loading a compressed page to persistent memory, flush CPU cache >> > > > > after the data is decompressed. Combined with a call to pmem_drain() >> > > > > at the end of memory loading, we can guarantee those compressed pages >> > > > > are persistently loaded to PMEM. >> > > > >> > > > Can you explain why this can use the flush and doesn't need the special >> > > > memset? >> > > >> > > The best approach to ensure the write persistence is to operate pmem >> > > all via libpmem, e.g., pmem_memcpy_nodrain() + pmem_drain(). However, >> > > the write to pmem in this case is performed by uncompress() which is >> > > implemented out of QEMU and libpmem. It may or may not use libpmem, >> > > which is not controlled by QEMU. Therefore, we have to use the less >> > > optimal approach, that is to flush cache for all pmem addresses that >> > > uncompress() may have written, i.e.,/e.g., memcpy() and/or memset() in >> > > uncompress(), and pmem_flush() + pmem_drain() in QEMU. >> > >> > In what way is it less optimal? >> > If that's a legal thing to do, then why not just do a pmem_flush + >> > pmem_drain right at the end of the ram loading and leave all the rest of >> > the code untouched? >> >> For example, the implementation pmem_memcpy_nodrain() prefers to use >> movnt instructions w/o flush to write pmem if those instructions are >> available, and falls back to memcpy() + flush if movnt are not >> available, so I suppose the latter is less optimal. > > But if you use normal memcpy calls to copy a few GB of RAM in an > incoming migrate and then do a single flush at the end, isn't that > better? Not really, because now you've needlessly polluted the cache and are spending CPU looping over the cachelines that could have been bypassed with movnt.
Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file
[ adding Michal and lsf-pci ] On Wed, Jan 31, 2018 at 7:02 PM, Dan Williams wrote: > On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang > wrote: >> + vfio maintainer Alex Williamson in case my understanding of vfio is >> incorrect. >> >> On 01/31/18 16:32 -0800, Dan Williams wrote: >>> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang >>> wrote: >>> > On 01/31/18 16:08 -0800, Dan Williams wrote: >>> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang >>> >> wrote: >>> >> > On 01/31/18 14:25 -0800, Dan Williams wrote: >>> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang >>> >> >> wrote: >>> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to >>> >> >> > guarantee the write persistence to mmap'ed files supporting DAX >>> >> >> > (e.g., >>> >> >> > files on ext4/xfs file system mounted with '-o dax'). >>> >> >> >>> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the >>> >> >> metadata is in sync after a fault. However, that does not make >>> >> >> filesystem-DAX safe for use with QEMU, because we still need to >>> >> >> coordinate DMA with fileystem operations. There is no way to do that >>> >> >> coordination from within a guest. QEMU needs to use device-dax if the >>> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch >>> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to >>> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX >>> >> >> pages to be mapped in EPT entries unless / until we have a solution to >>> >> >> the DMA synchronization problem. Apologies for not noticing this >>> >> >> earlier. >>> >> > >>> >> > QEMU does not truncate or punch holes of the file once it has been >>> >> > mmap()'ed. Does the problem [1] still exist in such case? >>> >> >>> >> Something else on the system might. The only agent that could enforce >>> >> protection is the kernel, and the kernel will likely just disallow >>> >> passing addresses from filesystem-dax vmas through to a guest >>> >> altogether. I think there's even a problem in the non-DAX case unless >>> >> KVM is pinning pages while they are handed out to a guest. The problem >>> >> is that we don't have a page cache page to pin in the DAX case. >>> >> >>> > >>> > Does it mean any user-space code like >>> > ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem >>> > // make DMA to ptr >>> > is unsafe? >>> >>> Yes, it is currently unsafe because there is no coordination with the >>> filesytem if it decides to make block layout changes. We can fix that >>> in the non-virtualization case by having the filesystem wait for DMA >>> completion callbacks (i.e. what for all pages to be idle), but as far >>> as I can see we can't do the same coordination for DMA initiated by a >>> guest device driver. >>> >> >> I think that fix [1] also works for KVM/QEMU. The guest DMA are >> performed on two types of devices: >> >> 1. For emulated devices, the guest DMA requests are trapped and >>actually performed by QEMU on the host side. The host side fix [1] >>can cover this case. >> >> 2. For passthrough devices, vfio pins all pages, including those >>backed by dax mode files, used by the guest if any device is >>passthroughed to it. If I read the commit message in [2] correctly, >>operations that change the page-to-file offset association of pages >>from dax mode files will be deferred until the reference count of >>the affected pages becomes 1. That is, if any passthrough device >>is used with a VM, the changes of page-to-file offset will not be >>able to happen until the VM is shutdown, so the fix [1] still takes >>effect here. > > This sounds like a longterm mapping under control of vfio and not the > filesystem. See get_user_pages_longterm(), it is a problem if pages > are pinned indefinitely especially DAX. It sounds like vfio is in the > same boat as RDMA and cannot support long lived pins of DAX pages. As > of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual > fix will be to create a "memory-registration with lease" semantic > available for RDMA so that the kernel can forcibly revoke page pinning > to perform physical layout changes. In the near it seems > vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so > that filesystem-dax mappings are explicitly disallowed. > >> Another question is how a user-space application (e.g., QEMU) knows >> whether it's safe to mmap a file on the DAX file system? > > I think we fix vaddr_get_pfn() to start failing for DAX mappings > unless/until we can add a "with lease" mechanism. Userspace will know > when it is safe again when vfio stops failing. Btw, there is an LSF/MM topic proposal on this subject [1]. [1]: https://lists.01.org/pipermail/linux-nvdimm/2018-January/013935.html
Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file
On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang wrote: > + vfio maintainer Alex Williamson in case my understanding of vfio is > incorrect. > > On 01/31/18 16:32 -0800, Dan Williams wrote: >> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang >> wrote: >> > On 01/31/18 16:08 -0800, Dan Williams wrote: >> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang >> >> wrote: >> >> > On 01/31/18 14:25 -0800, Dan Williams wrote: >> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang >> >> >> wrote: >> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to >> >> >> > guarantee the write persistence to mmap'ed files supporting DAX >> >> >> > (e.g., >> >> >> > files on ext4/xfs file system mounted with '-o dax'). >> >> >> >> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the >> >> >> metadata is in sync after a fault. However, that does not make >> >> >> filesystem-DAX safe for use with QEMU, because we still need to >> >> >> coordinate DMA with fileystem operations. There is no way to do that >> >> >> coordination from within a guest. QEMU needs to use device-dax if the >> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch >> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to >> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX >> >> >> pages to be mapped in EPT entries unless / until we have a solution to >> >> >> the DMA synchronization problem. Apologies for not noticing this >> >> >> earlier. >> >> > >> >> > QEMU does not truncate or punch holes of the file once it has been >> >> > mmap()'ed. Does the problem [1] still exist in such case? >> >> >> >> Something else on the system might. The only agent that could enforce >> >> protection is the kernel, and the kernel will likely just disallow >> >> passing addresses from filesystem-dax vmas through to a guest >> >> altogether. I think there's even a problem in the non-DAX case unless >> >> KVM is pinning pages while they are handed out to a guest. The problem >> >> is that we don't have a page cache page to pin in the DAX case. >> >> >> > >> > Does it mean any user-space code like >> > ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem >> > // make DMA to ptr >> > is unsafe? >> >> Yes, it is currently unsafe because there is no coordination with the >> filesytem if it decides to make block layout changes. We can fix that >> in the non-virtualization case by having the filesystem wait for DMA >> completion callbacks (i.e. what for all pages to be idle), but as far >> as I can see we can't do the same coordination for DMA initiated by a >> guest device driver. >> > > I think that fix [1] also works for KVM/QEMU. The guest DMA are > performed on two types of devices: > > 1. For emulated devices, the guest DMA requests are trapped and >actually performed by QEMU on the host side. The host side fix [1] >can cover this case. > > 2. For passthrough devices, vfio pins all pages, including those >backed by dax mode files, used by the guest if any device is >passthroughed to it. If I read the commit message in [2] correctly, >operations that change the page-to-file offset association of pages >from dax mode files will be deferred until the reference count of >the affected pages becomes 1. That is, if any passthrough device >is used with a VM, the changes of page-to-file offset will not be >able to happen until the VM is shutdown, so the fix [1] still takes >effect here. This sounds like a longterm mapping under control of vfio and not the filesystem. See get_user_pages_longterm(), it is a problem if pages are pinned indefinitely especially DAX. It sounds like vfio is in the same boat as RDMA and cannot support long lived pins of DAX pages. As of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual fix will be to create a "memory-registration with lease" semantic available for RDMA so that the kernel can forcibly revoke page pinning to perform physical layout changes. In the near it seems vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so that filesystem-dax mappings are explicitly disallowed. > Another question is how a user-space application (e.g., QEMU) knows > whether it's safe to mmap a file on the DAX file system? I think we fix vaddr_get_pfn() to start failing for DAX mappings unless/until we can add a "with lease" mechanism. Userspace will know when it is safe again when vfio stops failing.
Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file
On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang wrote: > On 01/31/18 16:08 -0800, Dan Williams wrote: >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang >> wrote: >> > On 01/31/18 14:25 -0800, Dan Williams wrote: >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang >> >> wrote: >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g., >> >> > files on ext4/xfs file system mounted with '-o dax'). >> >> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the >> >> metadata is in sync after a fault. However, that does not make >> >> filesystem-DAX safe for use with QEMU, because we still need to >> >> coordinate DMA with fileystem operations. There is no way to do that >> >> coordination from within a guest. QEMU needs to use device-dax if the >> >> guest might ever perform DMA to a virtual-pmem range. See this patch >> >> set for more details on the DAX vs DMA problem [1]. I think we need to >> >> enforce this in the host kernel. I.e. do not allow file backed DAX >> >> pages to be mapped in EPT entries unless / until we have a solution to >> >> the DMA synchronization problem. Apologies for not noticing this >> >> earlier. >> > >> > QEMU does not truncate or punch holes of the file once it has been >> > mmap()'ed. Does the problem [1] still exist in such case? >> >> Something else on the system might. The only agent that could enforce >> protection is the kernel, and the kernel will likely just disallow >> passing addresses from filesystem-dax vmas through to a guest >> altogether. I think there's even a problem in the non-DAX case unless >> KVM is pinning pages while they are handed out to a guest. The problem >> is that we don't have a page cache page to pin in the DAX case. >> > > Does it mean any user-space code like > ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem > // make DMA to ptr > is unsafe? Yes, it is currently unsafe because there is no coordination with the filesytem if it decides to make block layout changes. We can fix that in the non-virtualization case by having the filesystem wait for DMA completion callbacks (i.e. what for all pages to be idle), but as far as I can see we can't do the same coordination for DMA initiated by a guest device driver.
Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file
On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang wrote: > On 01/31/18 14:25 -0800, Dan Williams wrote: >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang >> wrote: >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g., >> > files on ext4/xfs file system mounted with '-o dax'). >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the >> metadata is in sync after a fault. However, that does not make >> filesystem-DAX safe for use with QEMU, because we still need to >> coordinate DMA with fileystem operations. There is no way to do that >> coordination from within a guest. QEMU needs to use device-dax if the >> guest might ever perform DMA to a virtual-pmem range. See this patch >> set for more details on the DAX vs DMA problem [1]. I think we need to >> enforce this in the host kernel. I.e. do not allow file backed DAX >> pages to be mapped in EPT entries unless / until we have a solution to >> the DMA synchronization problem. Apologies for not noticing this >> earlier. > > QEMU does not truncate or punch holes of the file once it has been > mmap()'ed. Does the problem [1] still exist in such case? Something else on the system might. The only agent that could enforce protection is the kernel, and the kernel will likely just disallow passing addresses from filesystem-dax vmas through to a guest altogether. I think there's even a problem in the non-DAX case unless KVM is pinning pages while they are handed out to a guest. The problem is that we don't have a page cache page to pin in the DAX case.
Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file
On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang wrote: > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to > guarantee the write persistence to mmap'ed files supporting DAX (e.g., > files on ext4/xfs file system mounted with '-o dax'). Wait, MAP_SYNC does not guarantee persistence. It makes sure that the metadata is in sync after a fault. However, that does not make filesystem-DAX safe for use with QEMU, because we still need to coordinate DMA with fileystem operations. There is no way to do that coordination from within a guest. QEMU needs to use device-dax if the guest might ever perform DMA to a virtual-pmem range. See this patch set for more details on the DAX vs DMA problem [1]. I think we need to enforce this in the host kernel. I.e. do not allow file backed DAX pages to be mapped in EPT entries unless / until we have a solution to the DMA synchronization problem. Apologies for not noticing this earlier. [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html