On Tue, Jul 25, 2017 at 7:27 AM, Pankaj Gupta <pagu...@redhat.com> wrote: > >> Subject: Re: KVM "fake DAX" flushing interface - discussion >> >> On Mon 24-07-17 08:06:07, Pankaj Gupta wrote: >> > >> > > On Sun 23-07-17 13:10:34, Dan Williams wrote: >> > > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <r...@redhat.com> wrote: >> > > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote: >> > > > >> [ adding Ross and Jan ] >> > > > >> >> > > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <r...@redhat.com> >> > > > >> wrote: >> > > > >> > >> > > > >> > The goal is to increase density of guests, by moving page >> > > > >> > cache into the host (where it can be easily reclaimed). >> > > > >> > >> > > > >> > If we assume the guests will be backed by relatively fast >> > > > >> > SSDs, a "whole device flush" from filesystem journaling >> > > > >> > code (issued where the filesystem issues a barrier or >> > > > >> > disk cache flush today) may be just what we need to make >> > > > >> > that work. >> > > > >> >> > > > >> Ok, apologies, I indeed had some pieces of the proposal confused. >> > > > >> >> > > > >> However, it still seems like the storage interface is not capable of >> > > > >> expressing what is needed, because the operation that is needed is a >> > > > >> range flush. In the guest you want the DAX page dirty tracking to >> > > > >> communicate range flush information to the host, but there's no >> > > > >> readily available block i/o semantic that software running on top of >> > > > >> the fake pmem device can use to communicate with the host. Instead >> > > > >> you >> > > > >> want to intercept the dax_flush() operation and turn it into a >> > > > >> queued >> > > > >> request on the host. >> > > > >> >> > > > >> In 4.13 we have turned this dax_flush() operation into an explicit >> > > > >> driver call. That seems a better interface to modify than trying to >> > > > >> map block-storage flush-cache / force-unit-access commands to this >> > > > >> host request. >> > > > >> >> > > > >> The additional piece you would need to consider is whether to track >> > > > >> all writes in addition to mmap writes in the guest as DAX-page-cache >> > > > >> dirtying events, or arrange for every dax_copy_from_iter() >> > > > >> operation() >> > > > >> to also queue a sync on the host, but that essentially turns the >> > > > >> host >> > > > >> page cache into a pseudo write-through mode. >> > > > > >> > > > > I suspect initially it will be fine to not offer DAX >> > > > > semantics to applications using these "fake DAX" devices >> > > > > from a virtual machine, because the DAX APIs are designed >> > > > > for a much higher performance device than these fake DAX >> > > > > setups could ever give. >> > > > >> > > > Right, we don't need DAX, per se, in the guest. >> > > > >> > > > > >> > > > > Having userspace call fsync/msync like done normally, and >> > > > > having those coarser calls be turned into somewhat efficient >> > > > > backend flushes would be perfectly acceptable. >> > > > > >> > > > > The big question is, what should that kind of interface look >> > > > > like? >> > > > >> > > > To me, this looks much like the dirty cache tracking that is done in >> > > > the address_space radix for the DAX case, but modified to coordinate >> > > > queued / page-based flushing when the guest wants to persist data. >> > > > The similarity to DAX is not storing guest allocated pages in the >> > > > radix but entries that track dirty guest physical addresses. >> > > >> > > Let me check whether I understand the problem correctly. So we want to >> > > export a block device (essentially a page cache of this block device) to >> > > a >> > > guest as PMEM and use DAX in the guest to save guest's page cache. The >> > >> > that's correct. >> > >> > > natural way to make the persistence work would be to make ->flush >> > > callback >> > > of the PMEM device to do an upcall to the host which could then >> > > fdatasync() >> > > appropriate image file range however the performance would suck in such >> > > case since ->flush gets called for at most one page ranges from DAX. >> > >> > Discussion is : sync a range using paravirt device or flush hit addresses >> > vs block device flush. >> > >> > > >> > > So what you could do instead is to completely ignore ->flush calls for >> > > the >> > > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the >> > > PMEM device (generated by blkdev_issue_flush() or the journalling >> > > machinery) and fdatasync() the whole image file at that moment - in fact >> > > you must do that for metadata IO to hit persistent storage anyway in your >> > > setting. This would very closely follow how exporting block devices with >> > > volatile cache works with KVM these days AFAIU and the performance will >> > > be >> > > the same. >> > >> > yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags. >> > As per suggestions looks like block flushing device is way ahead. >> > >> > If we do an asynchronous block flush at guest side(put current task in >> > wait queue till host side fdatasync completes) can solve the purpose? Or >> > do we need another paravirt device for this? >> >> Well, even currently if you have PMEM device, you still have also a block >> device and a request queue associated with it and metadata IO goes through >> that path. So in your case you will have the same in the guest as a result >> of exposing virtual PMEM device to the guest and you just need to make sure >> this virtual block device behaves the same way as traditional virtualized >> block devices in KVM in respose to 'REQ_OP_WRITE | REQ_PREFLUSH' requests. > > Looks like only way to send flush(blk dev) from guest to host with nvdimm > is using flush hint addresses. Is this the correct interface I am looking? > > blkdev_issue_flush > submit_bio_wait > submit_bio > generic_make_request > pmem_make_request > ... > if (bio->bi_opf & REQ_FLUSH) > nvdimm_flush(nd_region);
I would inject a paravirtualized version of pmem_make_request() that sends an async flush operation over virtio to the host. Don't try to use flush hint addresses for this, they don't have the proper semantics. The guest should be allowed to issue the flush and receive the completion asynchronously rather than taking a vm exist and blocking on that request. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm