On Wed, May 09, 2018 at 07:54:31PM +0200, Max Reitz wrote: > On 2018-05-09 12:16, Stefan Hajnoczi wrote: > > On Tue, May 08, 2018 at 05:03:09PM +0200, Kevin Wolf wrote: > >> Am 08.05.2018 um 16:41 hat Eric Blake geschrieben: > >>> On 12/25/2017 01:33 AM, He Junyan wrote: > >> 2. Make the nvdimm device use the QEMU block layer so that it is backed > >> by a non-raw disk image (such as a qcow2 file representing the > >> content of the nvdimm) that supports snapshots. > >> > >> This part is hard because it requires some completely new > >> infrastructure such as mapping clusters of the image file to guest > >> pages, and doing cluster allocation (including the copy on write > >> logic) by handling guest page faults. > >> > >> I think it makes sense to invest some effort into such interfaces, but > >> be prepared for a long journey. > > > > I like the suggestion but it needs to be followed up with a concrete > > design that is feasible and fair for Junyan and others to implement. > > Otherwise the "long journey" is really just a way of rejecting this > > feature. > > > > Let's discuss the details of using the block layer for NVDIMM and try to > > come up with a plan. > > > > The biggest issue with using the block layer is that persistent memory > > applications use load/store instructions to directly access data. This > > is fundamentally different from the block layer, which transfers blocks > > of data to and from the device. > > > > Because of block DMA, QEMU is able to perform processing at each block > > driver graph node. This doesn't exist for persistent memory because > > software does not trap I/O. Therefore the concept of filter nodes > > doesn't make sense for persistent memory - we certainly do not want to > > trap every I/O because performance would be terrible. > > > > Another difference is that persistent memory I/O is synchronous. > > Load/store instructions execute quickly. Perhaps we could use KVM async > > page faults in cases where QEMU needs to perform processing, but again > > the performance would be bad. > > Let me first say that I have no idea how the interface to NVDIMM looks. > I just assume it works pretty much like normal RAM (so the interface is > just that it’s a part of the physical address space). > > Also, it sounds a bit like you are already discarding my idea, but here > goes anyway. > > Would it be possible to introduce a buffering block driver that presents > the guest an area of RAM/NVDIMM through an NVDIMM interface (so I > suppose as part of the guest address space)? For writing, we’d keep a > dirty bitmap on it, and then we’d asynchronously move the dirty areas > through the block layer, so basically like mirror. On flushing, we’d > block until everything is clean. > > For reading, we’d follow a COR/stream model, basically, where everything > is unpopulated in the beginning and everything is loaded through the > block layer both asynchronously all the time and on-demand whenever the > guest needs something that has not been loaded yet. > > Now I notice that that looks pretty much like a backing file model where > we constantly run both a stream and a commit job at the same time. > > The user could decide how much memory to use for the buffer, so it could > either hold everything or be partially unallocated. > > You’d probably want to back the buffer by NVDIMM normally, so that > nothing is lost on crashes (though this would imply that for partial > allocation the buffering block driver would need to know the mapping > between the area in real NVDIMM and its virtual representation of it). > > Just my two cents while scanning through qemu-block to find emails that > don’t actually concern me...
The guest kernel already implements this - it's the page cache and the block layer! Doing it in QEMU with dirty memory logging enabled is less efficient than doing it in the guest. That's why I said it's better to just use block devices than to implement buffering. I'm saying that persistent memory emulation on top of the iscsi:// block driver (for example) does not make sense. It could be implemented but the performance wouldn't be better than block I/O and the complexity/code size in QEMU isn't justified IMO. Stefan > > Most protocol drivers do not support direct memory access. iscsi, curl, > > etc just don't fit the model. One might be tempted to implement > > buffering but at that point it's better to just use block devices. > > > > I have CCed Pankaj, who is working on the virtio-pmem device. I need to > > be clear that emulated NVDIMM cannot be supported with the block layer > > since it lacks a guest flush mechanism. There is no way for > > applications to let the hypervisor know the file needs to be fsynced. > > That's what virtio-pmem addresses. > > > > Summary: > > A subset of the block layer could be used to back virtio-pmem. This > > requires a new block driver API and the KVM async page fault mechanism > > for trapping and mapping pages. Actual emulated NVDIMM devices cannot > > be supported unless the hardware specification is extended with a > > virtualization-friendly interface in the future. > > > > Please let me know your thoughts. > > > > Stefan > > > >
signature.asc
Description: PGP signature