On 04/06/17 20:02 +0800, Xiao Guangrong wrote: > > > On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote: > > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote: > > > This patch series constructs the flush hint address structures for > > > nvdimm devices in QEMU. > > > > > > It's of course not for 2.9. I send it out early in order to get > > > comments on one point I'm uncertain (see the detailed explanation > > > below). Thanks for any comments in advance! > > > Background > > > --------------- > > > > Extra background: > > > > Flush Hint Addresses are necessary because: > > > > 1. Some hardware configurations may require them. In other words, a > > cache flush instruction is not enough to persist data. > > > > 2. The host file system may need fsync(2) calls (e.g. to persist > > metadata changes). > > > > Without Flush Hint Addresses only some NVDIMM configurations actually > > guarantee data persistence. > > > > > Flush hint address structure is a substructure of NFIT and specifies > > > one or more addresses, namely Flush Hint Addresses. Software can write > > > to any one of these flush hint addresses to cause any preceding writes > > > to the NVDIMM region to be flushed out of the intervening platform > > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec > > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure". > > > > Do you have performance data? I'm concerned that Flush Hint Address > > hardware interface is not virtualization-friendly. > > > > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does: > > > > wmb(); > > for (i = 0; i < nd_region->ndr_mappings; i++) > > if (ndrd_get_flush_wpq(ndrd, i, 0)) > > writeq(1, ndrd_get_flush_wpq(ndrd, i, idx)); > > wmb(); > > > > That looks pretty lightweight - it's an MMIO write between write > > barriers. > > > > This patch implements the MMIO write like this: > > > > void nvdimm_flush(NVDIMMDevice *nvdimm) > > { > > if (nvdimm->backend_fd != -1) { > > /* > > * If the backend store is a physical NVDIMM device, fsync() > > * will trigger the flush via the flush hint on the host device. > > */ > > fsync(nvdimm->backend_fd); > > } > > } > > > > The MMIO store instruction turned into a synchronous fsync(2) system > > call plus vmexit/vmenter and QEMU userspace context switch: > > > > 1. The vcpu blocks during the fsync(2) system call. The MMIO write > > instruction has an unexpected and huge latency. > > > > 2. The vcpu thread holds the QEMU global mutex so all other threads > > (including the monitor) are blocked during fsync(2). Other vcpu > > threads may block if they vmexit. > > > > It is hard to implement this efficiently in QEMU. This is why I said > > the hardware interface is not virtualization-friendly. It's cheap on > > real hardware but expensive under virtualization. > > > > We should think about the optimal way of implementing Flush Hint > > Addresses in QEMU. But if there is no reasonable way to implement them > > then I think it's better *not* to implement them, just like the Block > > Window feature which is also not virtualization-friendly. Users who > > want a block device can use virtio-blk. I don't think NVDIMM Block > > Window can achieve better performance than virtio-blk under > > virtualization (although I'm happy to be proven wrong). > > > > Some ideas for a faster implementation: > > > > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU > > global mutex. Little synchronization is necessary as long as the > > NVDIMM device isn't hot unplugged (not yet supported anyway). > > > > 2. Can the host kernel provide a way to mmap Address Flush Hints from > > the physical NVDIMM in cases where the configuration does not require > > host kernel interception? That way QEMU can map the physical > > NVDIMM's Address Flush Hints directly into the guest. The hypervisor > > is bypassed and performance would be good. > > > > I'm not sure there is anything we can do to make the case where the host > > kernel wants an fsync(2) fast :(. > > Good point. > > We can assume flush-CPU-cache-to-make-persistence is always > available on Intel's hardware so that flush-hint-table is not > needed if the vNVDIMM is based on a real Intel's NVDIMM device. >
We can let users of qemu (e.g. libvirt) detect whether the backend device supports ADR, and pass 'flush-hint' option to qemu only if ADR is not supported. > If the vNVDIMM device is based on the regular file, i think > fsync is the bottleneck rather than this mmio-virtualization. :( > Yes, fsync() on the regular file is the bottleneck. We may either 1/ perform the host-side flush in an asynchronous way which will not block vcpu too long, or 2/ not provide strong durability guarantee for non-NVDIMM backend and not emulate flush-hint for guest at all. (I know 1/ does not provide strong durability guarantee either). Haozhong