Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Thu, Jan 18, 2018 at 11:51 AM, David Hildenbrand wrote: > >>> 1] Existing pmem driver & virtio for region discovery: >>> - >>> Use existing pmem driver which is tightly coupled with concepts of >>> namespaces, labels etc >>> from ACPI region discovery and re-implement these concepts with virtio so >>> that existing >>> pmem driver can understand it. In addition to this, task of pmem driver >>> to send flush command >>> using virtio. >> >> It's not tightly coupled. The whole point of libnvdimm is to be >> agnostic to ACPI, e820 or any other range discovery. The only work to >> do beyond identifying the address range is teaching libnvdimm to pass >> along a flush control interface to the pmem driver. >> >>> >>> 2] Existing pmem driver & ACPI NFIT for region discovery: >>> >>> - If we use NFIT ACPI, we need to teach existing ACPI driver to add this >>> new memory >>> type and teach existing pmem driver to handle this new memory type. Still >>> we need >>> an asynchronous(virtio) way to send flush commands. We need virtio >>> device/driver >>> or arbitrary key/value like pair just to send commands from guest to host >>> using virtio. >>> >>> 3] New Virtio pmem driver & paravirt device: >>> >>> Third way is new virtio pmem driver with less work to support existing >>> features of different protocols, >>> and with asynchronous way of sending flush commands. >>> >>> But this needs to duplicate some of the work which existing pmem driver >>> does but as discussed >>> previously we can separate common code from existing pmem driver and >>> reuse it. >>> >>> Among these approaches I also prefer 3]. >> >> I disagree, the reason we went down this ACPI path was to limit the >> needless duplication of most of the pmem driver. >> > > I have way to little insight to make qualified statements to different > approaches here. :) > > All I am interesting in is making this as independent of architecture > specific technologies (e.g. ACPI) as possible. We will want this e.g. > for s390x too. Rather sooner than later. So trying to couple this > (somehow) to ACPI just for the sake of less code to copy will not pay of > in the long run. > > Better have a clean virtio interface / design right from the start. > > So I hope my words will be heard :) I think that's reasonable. Once we have the virtio based discovery I think the incremental changes to libnvdimm core and the pmem driver are small.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
>> 1] Existing pmem driver & virtio for region discovery: >> - >> Use existing pmem driver which is tightly coupled with concepts of >> namespaces, labels etc >> from ACPI region discovery and re-implement these concepts with virtio so >> that existing >> pmem driver can understand it. In addition to this, task of pmem driver to >> send flush command >> using virtio. > > It's not tightly coupled. The whole point of libnvdimm is to be > agnostic to ACPI, e820 or any other range discovery. The only work to > do beyond identifying the address range is teaching libnvdimm to pass > along a flush control interface to the pmem driver. > >> >> 2] Existing pmem driver & ACPI NFIT for region discovery: >> >> - If we use NFIT ACPI, we need to teach existing ACPI driver to add this new >> memory >> type and teach existing pmem driver to handle this new memory type. Still >> we need >> an asynchronous(virtio) way to send flush commands. We need virtio >> device/driver >> or arbitrary key/value like pair just to send commands from guest to host >> using virtio. >> >> 3] New Virtio pmem driver & paravirt device: >> >> Third way is new virtio pmem driver with less work to support existing >> features of different protocols, >> and with asynchronous way of sending flush commands. >> >> But this needs to duplicate some of the work which existing pmem driver >> does but as discussed >> previously we can separate common code from existing pmem driver and reuse >> it. >> >> Among these approaches I also prefer 3]. > > I disagree, the reason we went down this ACPI path was to limit the > needless duplication of most of the pmem driver. > I have way to little insight to make qualified statements to different approaches here. :) All I am interesting in is making this as independent of architecture specific technologies (e.g. ACPI) as possible. We will want this e.g. for s390x too. Rather sooner than later. So trying to couple this (somehow) to ACPI just for the sake of less code to copy will not pay of in the long run. Better have a clean virtio interface / design right from the start. So I hope my words will be heard :) -- Thanks, David / dhildenb
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Thu, Jan 18, 2018 at 11:36 AM, Pankaj Gupta wrote: > >> >> On Thu, Jan 18, 2018 at 10:54 AM, Pankaj Gupta wrote: >> > >> >> >> >> >> I'd like to emphasize again, that I would prefer a virtio-pmem only >> >> >> solution. >> >> >> >> >> >> There are architectures out there (e.g. s390x) that don't support >> >> >> NVDIMMs - there is no HW interface to expose any such stuff. >> >> >> >> >> >> However, with virtio-pmem, we could make it work also on architectures >> >> >> not having ACPI and friends. >> >> > >> >> > ACPI and virtio-only can share the same pmem driver. There are two >> >> > parts to this, region discovery and setting up the pmem driver. For >> >> > discovery you can either have an NFIT-bus defined range, or a new >> >> > virtio-pmem-bus define it. As far as the pmem driver itself it's >> >> > agnostic to how the range is discovered. >> >> > >> >> >> >> And in addition to discovery + setup, we need the flush via virtio. >> >> >> >> > In other words, pmem consumes 'regions' from libnvdimm and the a bus >> >> > provider like nfit, e820, or a new virtio-mechansim produce 'regions'. >> >> > >> >> >> >> That sounds good to me. I would like to see how the ACPI discovery >> >> variant connects to a virtio ring. >> >> >> >> The natural way for me would be: >> >> >> >> A virtio-X device supplies a memory region ("discovery") and also the >> >> interface for flushes for this device. So one virtio-X corresponds to >> >> one pmem device. No ACPI to be involved (also not on architectures that >> >> have ACPI) >> > >> > I agree here if we discover regions with virtio-X we don't need to worry >> > about >> > NFIT ACPI. Actually, there are three ways to do it with pros and cons of >> > these >> > approaches: >> > >> > 1] Existing pmem driver & virtio for region discovery: >> > - >> > Use existing pmem driver which is tightly coupled with concepts of >> > namespaces, labels etc >> > from ACPI region discovery and re-implement these concepts with virtio so >> > that existing >> > pmem driver can understand it. In addition to this, task of pmem driver >> > to send flush command >> > using virtio. >> >> It's not tightly coupled. The whole point of libnvdimm is to be >> agnostic to ACPI, e820 or any other range discovery. The only work to >> do beyond identifying the address range is teaching libnvdimm to pass >> along a flush control interface to the pmem driver. > > o.k that means we can configure libnvdimm with virtio as well and use > existing pmem > driver. AFAICU it uses nvdimm bus? > > Do we need other features which ACPI provides? No, to keep it simple use nvdimm_pmem_region_create without registering any DIMM devices. I'd start with the e820 driver as a bus driver reference (drivers/nvdimm/e820.c) rather than try to unwind the complexity of the nfit driver.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
> > On Thu, Jan 18, 2018 at 10:54 AM, Pankaj Gupta wrote: > > > >> > >> >> I'd like to emphasize again, that I would prefer a virtio-pmem only > >> >> solution. > >> >> > >> >> There are architectures out there (e.g. s390x) that don't support > >> >> NVDIMMs - there is no HW interface to expose any such stuff. > >> >> > >> >> However, with virtio-pmem, we could make it work also on architectures > >> >> not having ACPI and friends. > >> > > >> > ACPI and virtio-only can share the same pmem driver. There are two > >> > parts to this, region discovery and setting up the pmem driver. For > >> > discovery you can either have an NFIT-bus defined range, or a new > >> > virtio-pmem-bus define it. As far as the pmem driver itself it's > >> > agnostic to how the range is discovered. > >> > > >> > >> And in addition to discovery + setup, we need the flush via virtio. > >> > >> > In other words, pmem consumes 'regions' from libnvdimm and the a bus > >> > provider like nfit, e820, or a new virtio-mechansim produce 'regions'. > >> > > >> > >> That sounds good to me. I would like to see how the ACPI discovery > >> variant connects to a virtio ring. > >> > >> The natural way for me would be: > >> > >> A virtio-X device supplies a memory region ("discovery") and also the > >> interface for flushes for this device. So one virtio-X corresponds to > >> one pmem device. No ACPI to be involved (also not on architectures that > >> have ACPI) > > > > I agree here if we discover regions with virtio-X we don't need to worry > > about > > NFIT ACPI. Actually, there are three ways to do it with pros and cons of > > these > > approaches: > > > > 1] Existing pmem driver & virtio for region discovery: > > - > > Use existing pmem driver which is tightly coupled with concepts of > > namespaces, labels etc > > from ACPI region discovery and re-implement these concepts with virtio so > > that existing > > pmem driver can understand it. In addition to this, task of pmem driver > > to send flush command > > using virtio. > > It's not tightly coupled. The whole point of libnvdimm is to be > agnostic to ACPI, e820 or any other range discovery. The only work to > do beyond identifying the address range is teaching libnvdimm to pass > along a flush control interface to the pmem driver. o.k that means we can configure libnvdimm with virtio as well and use existing pmem driver. AFAICU it uses nvdimm bus? Do we need other features which ACPI provides? acpi_nfit_init nvdimm_bus_register ... acpi_nfit_register_region acpi_region_create nvdimm_pmem_region_create Also, need to check how to pass virtio flush interface. > > > > > 2] Existing pmem driver & ACPI NFIT for region discovery: > > > > - If we use NFIT ACPI, we need to teach existing ACPI driver to add this > > new memory > > type and teach existing pmem driver to handle this new memory type. Still > > we need > > an asynchronous(virtio) way to send flush commands. We need virtio > > device/driver > > or arbitrary key/value like pair just to send commands from guest to host > > using virtio. > > > > 3] New Virtio pmem driver & paravirt device: > > > > Third way is new virtio pmem driver with less work to support existing > > features of different protocols, > > and with asynchronous way of sending flush commands. > > > > But this needs to duplicate some of the work which existing pmem driver > > does but as discussed > > previously we can separate common code from existing pmem driver and > > reuse it. > > > > Among these approaches I also prefer 3]. > > I disagree, the reason we went down this ACPI path was to limit the > needless duplication of most of the pmem driver. yes. >
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
> > >> I'd like to emphasize again, that I would prefer a virtio-pmem only > >> solution. > >> > >> There are architectures out there (e.g. s390x) that don't support > >> NVDIMMs - there is no HW interface to expose any such stuff. > >> > >> However, with virtio-pmem, we could make it work also on architectures > >> not having ACPI and friends. > > > > ACPI and virtio-only can share the same pmem driver. There are two > > parts to this, region discovery and setting up the pmem driver. For > > discovery you can either have an NFIT-bus defined range, or a new > > virtio-pmem-bus define it. As far as the pmem driver itself it's > > agnostic to how the range is discovered. > > > > And in addition to discovery + setup, we need the flush via virtio. > > > In other words, pmem consumes 'regions' from libnvdimm and the a bus > > provider like nfit, e820, or a new virtio-mechansim produce 'regions'. > > > > That sounds good to me. I would like to see how the ACPI discovery > variant connects to a virtio ring. > > The natural way for me would be: > > A virtio-X device supplies a memory region ("discovery") and also the > interface for flushes for this device. So one virtio-X corresponds to > one pmem device. No ACPI to be involved (also not on architectures that > have ACPI) I agree here if we discover regions with virtio-X we don't need to worry about NFIT ACPI. Actually, there are three ways to do it with pros and cons of these approaches: 1] Existing pmem driver & virtio for region discovery: - Use existing pmem driver which is tightly coupled with concepts of namespaces, labels etc from ACPI region discovery and re-implement these concepts with virtio so that existing pmem driver can understand it. In addition to this, task of pmem driver to send flush command using virtio. 2] Existing pmem driver & ACPI NFIT for region discovery: - If we use NFIT ACPI, we need to teach existing ACPI driver to add this new memory type and teach existing pmem driver to handle this new memory type. Still we need an asynchronous(virtio) way to send flush commands. We need virtio device/driver or arbitrary key/value like pair just to send commands from guest to host using virtio. 3] New Virtio pmem driver & paravirt device: Third way is new virtio pmem driver with less work to support existing features of different protocols, and with asynchronous way of sending flush commands. But this needs to duplicate some of the work which existing pmem driver does but as discussed previously we can separate common code from existing pmem driver and reuse it. Among these approaches I also prefer 3]. > > -- > > Thanks, > > David / dhildenb >
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Thu, Jan 18, 2018 at 10:54 AM, Pankaj Gupta wrote: > >> >> >> I'd like to emphasize again, that I would prefer a virtio-pmem only >> >> solution. >> >> >> >> There are architectures out there (e.g. s390x) that don't support >> >> NVDIMMs - there is no HW interface to expose any such stuff. >> >> >> >> However, with virtio-pmem, we could make it work also on architectures >> >> not having ACPI and friends. >> > >> > ACPI and virtio-only can share the same pmem driver. There are two >> > parts to this, region discovery and setting up the pmem driver. For >> > discovery you can either have an NFIT-bus defined range, or a new >> > virtio-pmem-bus define it. As far as the pmem driver itself it's >> > agnostic to how the range is discovered. >> > >> >> And in addition to discovery + setup, we need the flush via virtio. >> >> > In other words, pmem consumes 'regions' from libnvdimm and the a bus >> > provider like nfit, e820, or a new virtio-mechansim produce 'regions'. >> > >> >> That sounds good to me. I would like to see how the ACPI discovery >> variant connects to a virtio ring. >> >> The natural way for me would be: >> >> A virtio-X device supplies a memory region ("discovery") and also the >> interface for flushes for this device. So one virtio-X corresponds to >> one pmem device. No ACPI to be involved (also not on architectures that >> have ACPI) > > I agree here if we discover regions with virtio-X we don't need to worry about > NFIT ACPI. Actually, there are three ways to do it with pros and cons of these > approaches: > > 1] Existing pmem driver & virtio for region discovery: > - > Use existing pmem driver which is tightly coupled with concepts of > namespaces, labels etc > from ACPI region discovery and re-implement these concepts with virtio so > that existing > pmem driver can understand it. In addition to this, task of pmem driver to > send flush command > using virtio. It's not tightly coupled. The whole point of libnvdimm is to be agnostic to ACPI, e820 or any other range discovery. The only work to do beyond identifying the address range is teaching libnvdimm to pass along a flush control interface to the pmem driver. > > 2] Existing pmem driver & ACPI NFIT for region discovery: > > - If we use NFIT ACPI, we need to teach existing ACPI driver to add this new > memory > type and teach existing pmem driver to handle this new memory type. Still > we need > an asynchronous(virtio) way to send flush commands. We need virtio > device/driver > or arbitrary key/value like pair just to send commands from guest to host > using virtio. > > 3] New Virtio pmem driver & paravirt device: > > Third way is new virtio pmem driver with less work to support existing > features of different protocols, > and with asynchronous way of sending flush commands. > > But this needs to duplicate some of the work which existing pmem driver > does but as discussed > previously we can separate common code from existing pmem driver and reuse > it. > > Among these approaches I also prefer 3]. I disagree, the reason we went down this ACPI path was to limit the needless duplication of most of the pmem driver.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Thu, Jan 18, 2018 at 9:48 AM, David Hildenbrand wrote: >>> I'd like to emphasize again, that I would prefer a virtio-pmem only >>> solution. >>> >>> There are architectures out there (e.g. s390x) that don't support >>> NVDIMMs - there is no HW interface to expose any such stuff. >>> >>> However, with virtio-pmem, we could make it work also on architectures >>> not having ACPI and friends. >> >> ACPI and virtio-only can share the same pmem driver. There are two >> parts to this, region discovery and setting up the pmem driver. For >> discovery you can either have an NFIT-bus defined range, or a new >> virtio-pmem-bus define it. As far as the pmem driver itself it's >> agnostic to how the range is discovered. >> > > And in addition to discovery + setup, we need the flush via virtio. > >> In other words, pmem consumes 'regions' from libnvdimm and the a bus >> provider like nfit, e820, or a new virtio-mechansim produce 'regions'. >> > > That sounds good to me. I would like to see how the ACPI discovery > variant connects to a virtio ring. > > The natural way for me would be: > > A virtio-X device supplies a memory region ("discovery") and also the > interface for flushes for this device. So one virtio-X corresponds to > one pmem device. No ACPI to be involved (also not on architectures that > have ACPI) Hmm, yes, it seems if ACPI is just going to be used as a trigger for "go find the virtio-X interface for this range" we could have started from a virtio device in the first place.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
>> I'd like to emphasize again, that I would prefer a virtio-pmem only >> solution. >> >> There are architectures out there (e.g. s390x) that don't support >> NVDIMMs - there is no HW interface to expose any such stuff. >> >> However, with virtio-pmem, we could make it work also on architectures >> not having ACPI and friends. > > ACPI and virtio-only can share the same pmem driver. There are two > parts to this, region discovery and setting up the pmem driver. For > discovery you can either have an NFIT-bus defined range, or a new > virtio-pmem-bus define it. As far as the pmem driver itself it's > agnostic to how the range is discovered. > And in addition to discovery + setup, we need the flush via virtio. > In other words, pmem consumes 'regions' from libnvdimm and the a bus > provider like nfit, e820, or a new virtio-mechansim produce 'regions'. > That sounds good to me. I would like to see how the ACPI discovery variant connects to a virtio ring. The natural way for me would be: A virtio-X device supplies a memory region ("discovery") and also the interface for flushes for this device. So one virtio-X corresponds to one pmem device. No ACPI to be involved (also not on architectures that have ACPI) -- Thanks, David / dhildenb
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Thu, Jan 18, 2018 at 8:53 AM, David Hildenbrand wrote: > On 24.11.2017 13:40, Pankaj Gupta wrote: >> >> Hello, >> >> Thank you all for all the useful suggestions. >> I want to summarize the discussions so far in the >> thread. Please see below: >> > >> We can go with the "best" interface for what >> could be a relatively slow flush (fsync on a >> file on ssd/disk on the host), which requires >> that the flushing task wait on completion >> asynchronously. > > > I'd like to clarify the interface of "wait on completion > asynchronously" and KVM async page fault a bit more. > > Current design of async-page-fault only works on RAM rather > than MMIO, i.e, if the page fault caused by accessing the > device memory of a emulated device, it needs to go to > userspace (QEMU) which emulates the operation in vCPU's > thread. > > As i mentioned before the memory region used for vNVDIMM > flush interface should be MMIO and consider its support > on other hypervisors, so we do better push this async > mechanism into the flush interface design itself rather > than depends on kvm async-page-fault. I would expect this interface to be virtio-ring based to queue flush requests asynchronously to the host. >>> >>> Could we reuse the virtio-blk device, only with a different device id? >> >> As per previous discussions, there were suggestions on main two parts of the >> project: >> >> 1] Expose vNVDIMM memory range to KVM guest. >> >>- Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM >> spec >> changes for this? >> >>- Guest should be able to add this memory in system memory map. Name of >> the added memory in >> '/proc/iomem' should be different(shared memory?) than persistent >> memory as it >> does not satisfy exact definition of persistent memory (requires an >> explicit flush). >> >>- Guest should not allow 'device-dax' and other fancy features which are >> not >> virtualization friendly. >> >> 2] Flushing interface to persist guest changes. >> >>- As per suggestion by ChristophH (CCed), we explored options other then >> virtio like MMIO etc. >> Looks like most of these options are not use-case friendly. As we want >> to do fsync on a >> file on ssd/disk on the host and we cannot make guest vCPU's wait for >> that time. >> >>- Though adding new driver(virtio-pmem) looks like repeated work and not >> needed so we can >> go with the existing pmem driver and add flush specific to this new >> memory type. > > I'd like to emphasize again, that I would prefer a virtio-pmem only > solution. > > There are architectures out there (e.g. s390x) that don't support > NVDIMMs - there is no HW interface to expose any such stuff. > > However, with virtio-pmem, we could make it work also on architectures > not having ACPI and friends. ACPI and virtio-only can share the same pmem driver. There are two parts to this, region discovery and setting up the pmem driver. For discovery you can either have an NFIT-bus defined range, or a new virtio-pmem-bus define it. As far as the pmem driver itself it's agnostic to how the range is discovered. In other words, pmem consumes 'regions' from libnvdimm and the a bus provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On 24.11.2017 13:40, Pankaj Gupta wrote: > > Hello, > > Thank you all for all the useful suggestions. > I want to summarize the discussions so far in the > thread. Please see below: > > We can go with the "best" interface for what > could be a relatively slow flush (fsync on a > file on ssd/disk on the host), which requires > that the flushing task wait on completion > asynchronously. I'd like to clarify the interface of "wait on completion asynchronously" and KVM async page fault a bit more. Current design of async-page-fault only works on RAM rather than MMIO, i.e, if the page fault caused by accessing the device memory of a emulated device, it needs to go to userspace (QEMU) which emulates the operation in vCPU's thread. As i mentioned before the memory region used for vNVDIMM flush interface should be MMIO and consider its support on other hypervisors, so we do better push this async mechanism into the flush interface design itself rather than depends on kvm async-page-fault. >>> >>> I would expect this interface to be virtio-ring based to queue flush >>> requests asynchronously to the host. >> >> Could we reuse the virtio-blk device, only with a different device id? > > As per previous discussions, there were suggestions on main two parts of the > project: > > 1] Expose vNVDIMM memory range to KVM guest. > >- Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM > spec > changes for this? > >- Guest should be able to add this memory in system memory map. Name of > the added memory in > '/proc/iomem' should be different(shared memory?) than persistent memory > as it > does not satisfy exact definition of persistent memory (requires an > explicit flush). > >- Guest should not allow 'device-dax' and other fancy features which are > not > virtualization friendly. > > 2] Flushing interface to persist guest changes. > >- As per suggestion by ChristophH (CCed), we explored options other then > virtio like MMIO etc. > Looks like most of these options are not use-case friendly. As we want > to do fsync on a > file on ssd/disk on the host and we cannot make guest vCPU's wait for > that time. > >- Though adding new driver(virtio-pmem) looks like repeated work and not > needed so we can > go with the existing pmem driver and add flush specific to this new > memory type. I'd like to emphasize again, that I would prefer a virtio-pmem only solution. There are architectures out there (e.g. s390x) that don't support NVDIMMs - there is no HW interface to expose any such stuff. However, with virtio-pmem, we could make it work also on architectures not having ACPI and friends. > >- Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense > if just > want a flush vehicle to send guest commands to host and get reply after > asynchronous > execution. There was previous discussion [1] with Rik & Dan on this. > > [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html > > Is my understanding correct here? > > Thanks, > Pankaj > > -- Thanks, David / dhildenb
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
Hi Dan, Thanks for your reply. > > On Fri, Jan 12, 2018 at 10:23 PM, Pankaj Gupta wrote: > > > > Hello Dan, > > > >> Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2 > >> System Physical Address (SPA) Range Structure" in the ACPI 6.2A > >> specification. Since it is a GUID we could define a Linux specific > >> type for this case, but spec changes would allow non-Linux hypervisors > >> to advertise a standard interface to guests. > >> > > > > I have added new SPA with a GUUID for this memory type and I could add > > this new memory type in System memory map. I need help with the namespace > > handling for this new type As mentioned in [1] discussion: > > > > - Create a new namespace for this new memory type > > - Teach libnvdimm how to handle this new namespace > > > > I have some queries on this: > > > > 1] How namespace handling of this new memory type would be? > > This would be a namespace that creates a pmem device, but does not allow DAX. o.k > > > > > 2] There are existing namespace types: > > ND_DEVICE_NAMESPACE_IO, ND_DEVICE_NAMESPACE_PMEM, ND_DEVICE_NAMESPACE_BLK > > > > How libnvdimm will handle this new name-space type in conjuction with > > existing > > memory type, region & namespaces? > > The type will be either ND_DEVICE_NAMESPACE_IO or > ND_DEVICE_NAMESPACE_PMEM depending on whether you configure KVM to > provide a virtual NVDIMM and label space. In other words the only > difference between this range and a typical persistent memory range is > that we will have a flag to disable DAX operation. o.k. In short we have disable this flag 'QUEUE_FLAG_DAX' for this namespace & region? Also don't execute below code for this new type? pmem_attach_disk() ... ... dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops); if (!dax_dev) { put_disk(disk); return -ENOMEM; } dax_write_cache(dax_dev, wbc); pmem->dax_dev = dax_dev; > > See the usage of nvdimm_has_cache() in pmem_attach_disk() as an > example of how to pass attributes about the "region" to the the pmem > driver. sure. > > > > > 3] For sending guest to host flush commands we still have to think about > > some > >async way? > > I thought we discussed this being a paravirtualized virtio command ring? o.k. will implement this. >
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Fri, Jan 12, 2018 at 10:23 PM, Pankaj Gupta wrote: > > Hello Dan, > >> Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2 >> System Physical Address (SPA) Range Structure" in the ACPI 6.2A >> specification. Since it is a GUID we could define a Linux specific >> type for this case, but spec changes would allow non-Linux hypervisors >> to advertise a standard interface to guests. >> > > I have added new SPA with a GUUID for this memory type and I could add > this new memory type in System memory map. I need help with the namespace > handling for this new type As mentioned in [1] discussion: > > - Create a new namespace for this new memory type > - Teach libnvdimm how to handle this new namespace > > I have some queries on this: > > 1] How namespace handling of this new memory type would be? This would be a namespace that creates a pmem device, but does not allow DAX. > > 2] There are existing namespace types: > ND_DEVICE_NAMESPACE_IO, ND_DEVICE_NAMESPACE_PMEM, ND_DEVICE_NAMESPACE_BLK > > How libnvdimm will handle this new name-space type in conjuction with > existing > memory type, region & namespaces? The type will be either ND_DEVICE_NAMESPACE_IO or ND_DEVICE_NAMESPACE_PMEM depending on whether you configure KVM to provide a virtual NVDIMM and label space. In other words the only difference between this range and a typical persistent memory range is that we will have a flag to disable DAX operation. See the usage of nvdimm_has_cache() in pmem_attach_disk() as an example of how to pass attributes about the "region" to the the pmem driver. > > 3] For sending guest to host flush commands we still have to think about some >async way? I thought we discussed this being a paravirtualized virtio command ring?
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
Hello Dan, > Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2 > System Physical Address (SPA) Range Structure" in the ACPI 6.2A > specification. Since it is a GUID we could define a Linux specific > type for this case, but spec changes would allow non-Linux hypervisors > to advertise a standard interface to guests. > I have added new SPA with a GUUID for this memory type and I could add this new memory type in System memory map. I need help with the namespace handling for this new type As mentioned in [1] discussion: - Create a new namespace for this new memory type - Teach libnvdimm how to handle this new namespace I have some queries on this: 1] How namespace handling of this new memory type would be? 2] There are existing namespace types: ND_DEVICE_NAMESPACE_IO, ND_DEVICE_NAMESPACE_PMEM, ND_DEVICE_NAMESPACE_BLK How libnvdimm will handle this new name-space type in conjuction with existing memory type, region & namespaces? 3] For sending guest to host flush commands we still have to think about some async way? [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08404.html Thanks, Pankaj
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Fri, Nov 24, 2017 at 4:40 AM, Pankaj Gupta wrote: [..] > 1] Expose vNVDIMM memory range to KVM guest. > >- Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM > spec > changes for this? Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2 System Physical Address (SPA) Range Structure" in the ACPI 6.2A specification. Since it is a GUID we could define a Linux specific type for this case, but spec changes would allow non-Linux hypervisors to advertise a standard interface to guests.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On 24/11/2017 14:02, Pankaj Gupta wrote: > >>>- Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense >>>if just >>> want a flush vehicle to send guest commands to host and get reply >>> after asynchronous >>> execution. There was previous discussion [1] with Rik & Dan on this. >>> >>> [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html >> >> ... in fact, the virtio-blk device _could_ actually accept regular I/O >> too. That would make it easier to boot from pmem. Is there anything >> similar in regular hardware? > > there is existing block device associated(hard bind) with the pmem range. > Also, comment by Christoph [1], about removing block device with DAX support. > Still I am not clear about this. Am I missing anything here? The I/O part of the blk device would only be used by the firmware. In Linux, the different device id would bind the device to a different driver that would only be used for flushing. But maybe this idea makes no sense. :) Paolo
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
> >- Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense > >if just > > want a flush vehicle to send guest commands to host and get reply > > after asynchronous > > execution. There was previous discussion [1] with Rik & Dan on this. > > > > [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html > > ... in fact, the virtio-blk device _could_ actually accept regular I/O > too. That would make it easier to boot from pmem. Is there anything > similar in regular hardware? there is existing block device associated(hard bind) with the pmem range. Also, comment by Christoph [1], about removing block device with DAX support. Still I am not clear about this. Am I missing anything here? [1] https://marc.info/?l=kvm&m=150822740332536&w=2 Pankaj
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On 24/11/2017 13:40, Pankaj Gupta wrote: >- Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense > if just > want a flush vehicle to send guest commands to host and get reply after > asynchronous > execution. There was previous discussion [1] with Rik & Dan on this. > > [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html ... in fact, the virtio-blk device _could_ actually accept regular I/O too. That would make it easier to boot from pmem. Is there anything similar in regular hardware? Paolo
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
Hello, Thank you all for all the useful suggestions. I want to summarize the discussions so far in the thread. Please see below: > >> > >>> We can go with the "best" interface for what > >>> could be a relatively slow flush (fsync on a > >>> file on ssd/disk on the host), which requires > >>> that the flushing task wait on completion > >>> asynchronously. > >> > >> > >> I'd like to clarify the interface of "wait on completion > >> asynchronously" and KVM async page fault a bit more. > >> > >> Current design of async-page-fault only works on RAM rather > >> than MMIO, i.e, if the page fault caused by accessing the > >> device memory of a emulated device, it needs to go to > >> userspace (QEMU) which emulates the operation in vCPU's > >> thread. > >> > >> As i mentioned before the memory region used for vNVDIMM > >> flush interface should be MMIO and consider its support > >> on other hypervisors, so we do better push this async > >> mechanism into the flush interface design itself rather > >> than depends on kvm async-page-fault. > > > > I would expect this interface to be virtio-ring based to queue flush > > requests asynchronously to the host. > > Could we reuse the virtio-blk device, only with a different device id? As per previous discussions, there were suggestions on main two parts of the project: 1] Expose vNVDIMM memory range to KVM guest. - Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM spec changes for this? - Guest should be able to add this memory in system memory map. Name of the added memory in '/proc/iomem' should be different(shared memory?) than persistent memory as it does not satisfy exact definition of persistent memory (requires an explicit flush). - Guest should not allow 'device-dax' and other fancy features which are not virtualization friendly. 2] Flushing interface to persist guest changes. - As per suggestion by ChristophH (CCed), we explored options other then virtio like MMIO etc. Looks like most of these options are not use-case friendly. As we want to do fsync on a file on ssd/disk on the host and we cannot make guest vCPU's wait for that time. - Though adding new driver(virtio-pmem) looks like repeated work and not needed so we can go with the existing pmem driver and add flush specific to this new memory type. - Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense if just want a flush vehicle to send guest commands to host and get reply after asynchronous execution. There was previous discussion [1] with Rik & Dan on this. [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html Is my understanding correct here? Thanks, Pankaj
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On 23/11/2017 17:14, Dan Williams wrote: > On Wed, Nov 22, 2017 at 8:05 PM, Xiao Guangrong > wrote: >> >> >> On 11/22/2017 02:19 AM, Rik van Riel wrote: >> >>> We can go with the "best" interface for what >>> could be a relatively slow flush (fsync on a >>> file on ssd/disk on the host), which requires >>> that the flushing task wait on completion >>> asynchronously. >> >> >> I'd like to clarify the interface of "wait on completion >> asynchronously" and KVM async page fault a bit more. >> >> Current design of async-page-fault only works on RAM rather >> than MMIO, i.e, if the page fault caused by accessing the >> device memory of a emulated device, it needs to go to >> userspace (QEMU) which emulates the operation in vCPU's >> thread. >> >> As i mentioned before the memory region used for vNVDIMM >> flush interface should be MMIO and consider its support >> on other hypervisors, so we do better push this async >> mechanism into the flush interface design itself rather >> than depends on kvm async-page-fault. > > I would expect this interface to be virtio-ring based to queue flush > requests asynchronously to the host. Could we reuse the virtio-blk device, only with a different device id? Thanks, Paolo
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Wed, Nov 22, 2017 at 8:05 PM, Xiao Guangrong wrote: > > > On 11/22/2017 02:19 AM, Rik van Riel wrote: > >> We can go with the "best" interface for what >> could be a relatively slow flush (fsync on a >> file on ssd/disk on the host), which requires >> that the flushing task wait on completion >> asynchronously. > > > I'd like to clarify the interface of "wait on completion > asynchronously" and KVM async page fault a bit more. > > Current design of async-page-fault only works on RAM rather > than MMIO, i.e, if the page fault caused by accessing the > device memory of a emulated device, it needs to go to > userspace (QEMU) which emulates the operation in vCPU's > thread. > > As i mentioned before the memory region used for vNVDIMM > flush interface should be MMIO and consider its support > on other hypervisors, so we do better push this async > mechanism into the flush interface design itself rather > than depends on kvm async-page-fault. I would expect this interface to be virtio-ring based to queue flush requests asynchronously to the host.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On 11/22/2017 02:19 AM, Rik van Riel wrote: We can go with the "best" interface for what could be a relatively slow flush (fsync on a file on ssd/disk on the host), which requires that the flushing task wait on completion asynchronously. I'd like to clarify the interface of "wait on completion asynchronously" and KVM async page fault a bit more. Current design of async-page-fault only works on RAM rather than MMIO, i.e, if the page fault caused by accessing the device memory of a emulated device, it needs to go to userspace (QEMU) which emulates the operation in vCPU's thread. As i mentioned before the memory region used for vNVDIMM flush interface should be MMIO and consider its support on other hypervisors, so we do better push this async mechanism into the flush interface design itself rather than depends on kvm async-page-fault.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Tue, 2017-11-21 at 10:26 -0800, Dan Williams wrote: > On Tue, Nov 21, 2017 at 10:19 AM, Rik van Riel > wrote: > > On Fri, 2017-11-03 at 14:21 +0800, Xiao Guangrong wrote: > > > On 11/03/2017 12:30 AM, Dan Williams wrote: > > > > > > > > Good point, I was assuming that the mmio flush interface would > > > > be > > > > discovered separately from the NFIT-defined memory range. > > > > Perhaps > > > > via > > > > PCI in the guest? This piece of the proposal needs a bit more > > > > thought... > > > > > > > > > > Consider the case that the vNVDIMM device on normal storage and > > > vNVDIMM device on real nvdimm hardware can both exist in VM, the > > > flush interface should be able to associate with the SPA region > > > respectively. That's why I'd like to integrate the flush > > > interface > > > into NFIT/ACPI by using a separate table. Is it possible to be a > > > part of ACPI specification? :) > > > > It would also be perfectly fine to have the > > virtio PCI device indicate which vNVDIMM > > range it flushes. > > > > Since the guest OS needs to support that kind > > of device anyway, does it really matter which > > direction the device association points? > > > > We can go with the "best" interface for what > > could be a relatively slow flush (fsync on a > > file on ssd/disk on the host), which requires > > that the flushing task wait on completion > > asynchronously. > > > > If that kind of interface cannot be advertised > > through NFIT/ACPI, wouldn't it be perfectly fine > > to have only the virtio PCI device indicate which > > vNVDIMM range it flushes? > > > > Yes, we could do this with a custom PCI device, however the NFIT is > frustratingly close to being able to define something like this. At > the very least we can start with a "SPA Range GUID" that is Linux > specific to indicate "call this virtio flush interface on FUA / flush > cache requests" as a stop gap until a standardized flush interface > can > be defined. Ahh, is that a "look for a device with this GUID" NFIT hint? That would be enough to tip off OSes that do not support that device that they found a vNVDIMM device that they cannot safely flush, which could help them report such errors to userspace... -- All rights reversed signature.asc Description: This is a digitally signed message part
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Tue, Nov 21, 2017 at 10:19 AM, Rik van Riel wrote: > On Fri, 2017-11-03 at 14:21 +0800, Xiao Guangrong wrote: >> On 11/03/2017 12:30 AM, Dan Williams wrote: >> > >> > Good point, I was assuming that the mmio flush interface would be >> > discovered separately from the NFIT-defined memory range. Perhaps >> > via >> > PCI in the guest? This piece of the proposal needs a bit more >> > thought... >> > >> >> Consider the case that the vNVDIMM device on normal storage and >> vNVDIMM device on real nvdimm hardware can both exist in VM, the >> flush interface should be able to associate with the SPA region >> respectively. That's why I'd like to integrate the flush interface >> into NFIT/ACPI by using a separate table. Is it possible to be a >> part of ACPI specification? :) > > It would also be perfectly fine to have the > virtio PCI device indicate which vNVDIMM > range it flushes. > > Since the guest OS needs to support that kind > of device anyway, does it really matter which > direction the device association points? > > We can go with the "best" interface for what > could be a relatively slow flush (fsync on a > file on ssd/disk on the host), which requires > that the flushing task wait on completion > asynchronously. > > If that kind of interface cannot be advertised > through NFIT/ACPI, wouldn't it be perfectly fine > to have only the virtio PCI device indicate which > vNVDIMM range it flushes? > Yes, we could do this with a custom PCI device, however the NFIT is frustratingly close to being able to define something like this. At the very least we can start with a "SPA Range GUID" that is Linux specific to indicate "call this virtio flush interface on FUA / flush cache requests" as a stop gap until a standardized flush interface can be defined.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Fri, 2017-11-03 at 14:21 +0800, Xiao Guangrong wrote: > On 11/03/2017 12:30 AM, Dan Williams wrote: > > > > Good point, I was assuming that the mmio flush interface would be > > discovered separately from the NFIT-defined memory range. Perhaps > > via > > PCI in the guest? This piece of the proposal needs a bit more > > thought... > > > > Consider the case that the vNVDIMM device on normal storage and > vNVDIMM device on real nvdimm hardware can both exist in VM, the > flush interface should be able to associate with the SPA region > respectively. That's why I'd like to integrate the flush interface > into NFIT/ACPI by using a separate table. Is it possible to be a > part of ACPI specification? :) It would also be perfectly fine to have the virtio PCI device indicate which vNVDIMM range it flushes. Since the guest OS needs to support that kind of device anyway, does it really matter which direction the device association points? We can go with the "best" interface for what could be a relatively slow flush (fsync on a file on ssd/disk on the host), which requires that the flushing task wait on completion asynchronously. If that kind of interface cannot be advertised through NFIT/ACPI, wouldn't it be perfectly fine to have only the virtio PCI device indicate which vNVDIMM range it flushes? -- All rights reversed signature.asc Description: This is a digitally signed message part
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
> > > > > >> [..] > >> >> Yes, the GUID will specifically identify this range as "Virtio Shared > >> >> Memory" (or whatever name survives after a bikeshed debate). The > >> >> libnvdimm core then needs to grow a new region type that mostly > >> >> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a > >> >> new flush interface to perform the host communication. Device-dax > >> >> would be disallowed from attaching to this region type, or we could > >> >> grow a new device-dax type that does not allow the raw device to be > >> >> mapped, but allows a filesystem mounted on top to manage the flush > >> >> interface. > >> > > >> > > >> > I am afraid it is not a good idea that a single SPA is used for multiple > >> > purposes. For the region used as "pmem" is directly mapped to the VM so > >> > that guest can freely access it without host's assistance, however, for > >> > the region used as "host communication" is not mapped to VM, so that > >> > it causes VM-exit and host gets the chance to do specific operations, > >> > e.g, flush cache. So we'd better distinctly define these two regions to > >> > avoid the unnecessary complexity in hypervisor. > >> > >> Good point, I was assuming that the mmio flush interface would be > >> discovered separately from the NFIT-defined memory range. Perhaps via > >> PCI in the guest? This piece of the proposal needs a bit more > >> thought... > > > > Also, in earlier discussions we agreed for entire device flush whenever > > guest > > performs a fsync on DAX file. If we do a MMIO call for this, guest CPU > > would be > > trapped for the duration device flush is completed. > > > > Instead, if we do perform an asynchronous flush guest CPU's can be utilized > > by > > some other tasks till flush completes? > > Yes, the interface for the guest to trigger and wait for flush > requests should be asynchronous, just like a storage "flush-cache" > command. One idea got while discussing this with Rik & Amit during KVM forum is to use something similar to Hyperv Key-value pair for sharing command between guest <=> host. Don't think such thing exists yet for KVM? Or how we can utilize existing features in KVM to achieve this?
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Sun, Nov 5, 2017 at 11:57 PM, Pankaj Gupta wrote: > > >> [..] >> >> Yes, the GUID will specifically identify this range as "Virtio Shared >> >> Memory" (or whatever name survives after a bikeshed debate). The >> >> libnvdimm core then needs to grow a new region type that mostly >> >> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a >> >> new flush interface to perform the host communication. Device-dax >> >> would be disallowed from attaching to this region type, or we could >> >> grow a new device-dax type that does not allow the raw device to be >> >> mapped, but allows a filesystem mounted on top to manage the flush >> >> interface. >> > >> > >> > I am afraid it is not a good idea that a single SPA is used for multiple >> > purposes. For the region used as "pmem" is directly mapped to the VM so >> > that guest can freely access it without host's assistance, however, for >> > the region used as "host communication" is not mapped to VM, so that >> > it causes VM-exit and host gets the chance to do specific operations, >> > e.g, flush cache. So we'd better distinctly define these two regions to >> > avoid the unnecessary complexity in hypervisor. >> >> Good point, I was assuming that the mmio flush interface would be >> discovered separately from the NFIT-defined memory range. Perhaps via >> PCI in the guest? This piece of the proposal needs a bit more >> thought... > > Also, in earlier discussions we agreed for entire device flush whenever guest > performs a fsync on DAX file. If we do a MMIO call for this, guest CPU would > be > trapped for the duration device flush is completed. > > Instead, if we do perform an asynchronous flush guest CPU's can be utilized by > some other tasks till flush completes? Yes, the interface for the guest to trigger and wait for flush requests should be asynchronous, just like a storage "flush-cache" command.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
> [..] > >> Yes, the GUID will specifically identify this range as "Virtio Shared > >> Memory" (or whatever name survives after a bikeshed debate). The > >> libnvdimm core then needs to grow a new region type that mostly > >> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a > >> new flush interface to perform the host communication. Device-dax > >> would be disallowed from attaching to this region type, or we could > >> grow a new device-dax type that does not allow the raw device to be > >> mapped, but allows a filesystem mounted on top to manage the flush > >> interface. > > > > > > I am afraid it is not a good idea that a single SPA is used for multiple > > purposes. For the region used as "pmem" is directly mapped to the VM so > > that guest can freely access it without host's assistance, however, for > > the region used as "host communication" is not mapped to VM, so that > > it causes VM-exit and host gets the chance to do specific operations, > > e.g, flush cache. So we'd better distinctly define these two regions to > > avoid the unnecessary complexity in hypervisor. > > Good point, I was assuming that the mmio flush interface would be > discovered separately from the NFIT-defined memory range. Perhaps via > PCI in the guest? This piece of the proposal needs a bit more > thought... Also, in earlier discussions we agreed for entire device flush whenever guest performs a fsync on DAX file. If we do a MMIO call for this, guest CPU would be trapped for the duration device flush is completed. Instead, if we do perform an asynchronous flush guest CPU's can be utilized by some other tasks till flush completes? Thanks, Pankaj
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On 11/03/2017 12:30 AM, Dan Williams wrote: On Thu, Nov 2, 2017 at 1:50 AM, Xiao Guangrong wrote: [..] Yes, the GUID will specifically identify this range as "Virtio Shared Memory" (or whatever name survives after a bikeshed debate). The libnvdimm core then needs to grow a new region type that mostly behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a new flush interface to perform the host communication. Device-dax would be disallowed from attaching to this region type, or we could grow a new device-dax type that does not allow the raw device to be mapped, but allows a filesystem mounted on top to manage the flush interface. I am afraid it is not a good idea that a single SPA is used for multiple purposes. For the region used as "pmem" is directly mapped to the VM so that guest can freely access it without host's assistance, however, for the region used as "host communication" is not mapped to VM, so that it causes VM-exit and host gets the chance to do specific operations, e.g, flush cache. So we'd better distinctly define these two regions to avoid the unnecessary complexity in hypervisor. Good point, I was assuming that the mmio flush interface would be discovered separately from the NFIT-defined memory range. Perhaps via PCI in the guest? This piece of the proposal needs a bit more thought... Consider the case that the vNVDIMM device on normal storage and vNVDIMM device on real nvdimm hardware can both exist in VM, the flush interface should be able to associate with the SPA region respectively. That's why I'd like to integrate the flush interface into NFIT/ACPI by using a separate table. Is it possible to be a part of ACPI specification? :)
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Thu, Nov 2, 2017 at 1:50 AM, Xiao Guangrong wrote: [..] >> Yes, the GUID will specifically identify this range as "Virtio Shared >> Memory" (or whatever name survives after a bikeshed debate). The >> libnvdimm core then needs to grow a new region type that mostly >> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a >> new flush interface to perform the host communication. Device-dax >> would be disallowed from attaching to this region type, or we could >> grow a new device-dax type that does not allow the raw device to be >> mapped, but allows a filesystem mounted on top to manage the flush >> interface. > > > I am afraid it is not a good idea that a single SPA is used for multiple > purposes. For the region used as "pmem" is directly mapped to the VM so > that guest can freely access it without host's assistance, however, for > the region used as "host communication" is not mapped to VM, so that > it causes VM-exit and host gets the chance to do specific operations, > e.g, flush cache. So we'd better distinctly define these two regions to > avoid the unnecessary complexity in hypervisor. Good point, I was assuming that the mmio flush interface would be discovered separately from the NFIT-defined memory range. Perhaps via PCI in the guest? This piece of the proposal needs a bit more thought...
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On 11/01/2017 11:20 PM, Dan Williams wrote: On 11/01/2017 12:25 PM, Dan Williams wrote: [..] It's not persistent memory if it requires a hypercall to make it persistent. Unless memory writes can be made durable purely with cpu instructions it's dangerous for it to be treated as a PMEM range. Consider a guest that tried to map it with device-dax which has no facility to route requests to a special flushing interface. Can we separate the concept of flush interface from persistent memory? Say there are two APIs, one is used to indicate the memory type (i.e, /proc/iomem) and another one indicates the flush interface. So for existing nvdimm hardwares: 1: Persist-memory + CLFLUSH 2: Persiste-memory + flush-hint-table (I know Intel does not use it) and for the virtual nvdimm which backended on normal storage: Persist-memory + virtual flush interface I see the flush interface as fundamental to identifying the media properties. It's not byte-addressable persistent memory if the application needs to call a sideband interface to manage writes. This is why we have pushed for something like the MAP_SYNC interface to make filesystem-dax actually behave in a way that applications can safely treat it as persistent memory, and this is also the guarantee that device-dax provides. Changing the flush interface makes it distinct and unusable for applications that want to manage data persistence in userspace. I was thinking that from the device's perspective, both of them are not persistent until a flush operation is issued (clflush or virtual flush-interface). But you are right, from the user/software's perspective, their fundamentals are different. So for the virtual nvdimm which is backended on normal storage, we should refuse MAP_SYNC and the only way to guarantee persistence is fsync/fdatasync. Actually, we can treat a SPA region which associates with specific flush interface as special GUID as your proposal, please see more in below comment... In what way is this "more complicated"? It was trivial to add support for the "volatile" NFIT range, this will not be any more complicated than that. Introducing memory type is easy indeed, however, a new flush interface definition is inevitable, i.e, we need a standard way to discover the MMIOs to communicate with host. Right, the proposed way to do that for x86 platforms is a new SPA Range GUID type. in the NFIT. So this SPA is used for both persistent memory region and flush interface? Maybe i missed it in previous mails, could you please detail how to do it? Yes, the GUID will specifically identify this range as "Virtio Shared Memory" (or whatever name survives after a bikeshed debate). The libnvdimm core then needs to grow a new region type that mostly behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a new flush interface to perform the host communication. Device-dax would be disallowed from attaching to this region type, or we could grow a new device-dax type that does not allow the raw device to be mapped, but allows a filesystem mounted on top to manage the flush interface. I am afraid it is not a good idea that a single SPA is used for multiple purposes. For the region used as "pmem" is directly mapped to the VM so that guest can freely access it without host's assistance, however, for the region used as "host communication" is not mapped to VM, so that it causes VM-exit and host gets the chance to do specific operations, e.g, flush cache. So we'd better distinctly define these two regions to avoid the unnecessary complexity in hypervisor.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
> On 11/01/2017 12:25 PM, Dan Williams wrote: [..] >> It's not persistent memory if it requires a hypercall to make it >> persistent. Unless memory writes can be made durable purely with cpu >> instructions it's dangerous for it to be treated as a PMEM range. >> Consider a guest that tried to map it with device-dax which has no >> facility to route requests to a special flushing interface. >> > > Can we separate the concept of flush interface from persistent memory? > Say there are two APIs, one is used to indicate the memory type (i.e, > /proc/iomem) and another one indicates the flush interface. > > So for existing nvdimm hardwares: > 1: Persist-memory + CLFLUSH > 2: Persiste-memory + flush-hint-table (I know Intel does not use it) > > and for the virtual nvdimm which backended on normal storage: > Persist-memory + virtual flush interface I see the flush interface as fundamental to identifying the media properties. It's not byte-addressable persistent memory if the application needs to call a sideband interface to manage writes. This is why we have pushed for something like the MAP_SYNC interface to make filesystem-dax actually behave in a way that applications can safely treat it as persistent memory, and this is also the guarantee that device-dax provides. Changing the flush interface makes it distinct and unusable for applications that want to manage data persistence in userspace. >>> In what way is this "more complicated"? It was trivial to add support for the "volatile" NFIT range, this will not be any more complicated than that. >>> >>> Introducing memory type is easy indeed, however, a new flush interface >>> definition is inevitable, i.e, we need a standard way to discover the >>> MMIOs to communicate with host. >> >> >> Right, the proposed way to do that for x86 platforms is a new SPA >> Range GUID type. in the NFIT. >> > > So this SPA is used for both persistent memory region and flush interface? > Maybe i missed it in previous mails, could you please detail how to do > it? Yes, the GUID will specifically identify this range as "Virtio Shared Memory" (or whatever name survives after a bikeshed debate). The libnvdimm core then needs to grow a new region type that mostly behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a new flush interface to perform the host communication. Device-dax would be disallowed from attaching to this region type, or we could grow a new device-dax type that does not allow the raw device to be mapped, but allows a filesystem mounted on top to manage the flush interface. > BTW, please note hypercall is not acceptable for standard, MMIO/PIO regions > are. (Oh, yes, it depends on Paolo. :)) MMIO/PIO regions works for me, that's not the part of the proposal I'm concerned about.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On 11/01/2017 12:25 PM, Dan Williams wrote: On Tue, Oct 31, 2017 at 8:43 PM, Xiao Guangrong wrote: On 10/31/2017 10:20 PM, Dan Williams wrote: On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong wrote: On 07/27/2017 08:54 AM, Dan Williams wrote: At that point, would it make sense to expose these special virtio-pmem areas to the guest in a slightly different way, so the regions that need virtio flushing are not bound by the regular driver, and the regular driver can continue to work for memory regions that are backed by actual pmem in the host? Hmm, yes that could be feasible especially if it uses the ACPI NFIT mechanism. It would basically involve defining a new SPA (System Phyiscal Address) range GUID type, and then teaching libnvdimm to treat that as a new pmem device type. I would prefer a new flush mechanism to a new memory type introduced to NFIT, e.g, in that mechanism we can define request queues and completion queues and any other features to make virtualization friendly. That would be much simpler. No that's more confusing because now we are overloading the definition of persistent memory. I want this memory type identified from the top of the stack so it can appear differently in /proc/iomem and also implement this alternate flush communication. For the characteristic of memory, I have no idea why VM should know this difference. It can be completely transparent to VM, that means, VM does not need to know where this virtual PMEM comes from (for a really nvdimm backend or a normal storage). The only discrepancy is the flush interface. It's not persistent memory if it requires a hypercall to make it persistent. Unless memory writes can be made durable purely with cpu instructions it's dangerous for it to be treated as a PMEM range. Consider a guest that tried to map it with device-dax which has no facility to route requests to a special flushing interface. Can we separate the concept of flush interface from persistent memory? Say there are two APIs, one is used to indicate the memory type (i.e, /proc/iomem) and another one indicates the flush interface. So for existing nvdimm hardwares: 1: Persist-memory + CLFLUSH 2: Persiste-memory + flush-hint-table (I know Intel does not use it) and for the virtual nvdimm which backended on normal storage: Persist-memory + virtual flush interface In what way is this "more complicated"? It was trivial to add support for the "volatile" NFIT range, this will not be any more complicated than that. Introducing memory type is easy indeed, however, a new flush interface definition is inevitable, i.e, we need a standard way to discover the MMIOs to communicate with host. Right, the proposed way to do that for x86 platforms is a new SPA Range GUID type. in the NFIT. So this SPA is used for both persistent memory region and flush interface? Maybe i missed it in previous mails, could you please detail how to do it? BTW, please note hypercall is not acceptable for standard, MMIO/PIO regions are. (Oh, yes, it depends on Paolo. :))
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Tue, Oct 31, 2017 at 8:43 PM, Xiao Guangrong wrote: > > > On 10/31/2017 10:20 PM, Dan Williams wrote: >> >> On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong >> wrote: >>> >>> >>> >>> On 07/27/2017 08:54 AM, Dan Williams wrote: >>> > At that point, would it make sense to expose these special > virtio-pmem areas to the guest in a slightly different way, > so the regions that need virtio flushing are not bound by > the regular driver, and the regular driver can continue to > work for memory regions that are backed by actual pmem in > the host? Hmm, yes that could be feasible especially if it uses the ACPI NFIT mechanism. It would basically involve defining a new SPA (System Phyiscal Address) range GUID type, and then teaching libnvdimm to treat that as a new pmem device type. >>> >>> >>> >>> I would prefer a new flush mechanism to a new memory type introduced >>> to NFIT, e.g, in that mechanism we can define request queues and >>> completion queues and any other features to make virtualization >>> friendly. That would be much simpler. >>> >> >> No that's more confusing because now we are overloading the definition >> of persistent memory. I want this memory type identified from the top >> of the stack so it can appear differently in /proc/iomem and also >> implement this alternate flush communication. >> > > For the characteristic of memory, I have no idea why VM should know this > difference. It can be completely transparent to VM, that means, VM > does not need to know where this virtual PMEM comes from (for a really > nvdimm backend or a normal storage). The only discrepancy is the flush > interface. It's not persistent memory if it requires a hypercall to make it persistent. Unless memory writes can be made durable purely with cpu instructions it's dangerous for it to be treated as a PMEM range. Consider a guest that tried to map it with device-dax which has no facility to route requests to a special flushing interface. > >> In what way is this "more complicated"? It was trivial to add support >> for the "volatile" NFIT range, this will not be any more complicated >> than that. >> > > Introducing memory type is easy indeed, however, a new flush interface > definition is inevitable, i.e, we need a standard way to discover the > MMIOs to communicate with host. Right, the proposed way to do that for x86 platforms is a new SPA Range GUID type. in the NFIT.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On 10/31/2017 10:20 PM, Dan Williams wrote: On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong wrote: On 07/27/2017 08:54 AM, Dan Williams wrote: At that point, would it make sense to expose these special virtio-pmem areas to the guest in a slightly different way, so the regions that need virtio flushing are not bound by the regular driver, and the regular driver can continue to work for memory regions that are backed by actual pmem in the host? Hmm, yes that could be feasible especially if it uses the ACPI NFIT mechanism. It would basically involve defining a new SPA (System Phyiscal Address) range GUID type, and then teaching libnvdimm to treat that as a new pmem device type. I would prefer a new flush mechanism to a new memory type introduced to NFIT, e.g, in that mechanism we can define request queues and completion queues and any other features to make virtualization friendly. That would be much simpler. No that's more confusing because now we are overloading the definition of persistent memory. I want this memory type identified from the top of the stack so it can appear differently in /proc/iomem and also implement this alternate flush communication. For the characteristic of memory, I have no idea why VM should know this difference. It can be completely transparent to VM, that means, VM does not need to know where this virtual PMEM comes from (for a really nvdimm backend or a normal storage). The only discrepancy is the flush interface. In what way is this "more complicated"? It was trivial to add support for the "volatile" NFIT range, this will not be any more complicated than that. Introducing memory type is easy indeed, however, a new flush interface definition is inevitable, i.e, we need a standard way to discover the MMIOs to communicate with host.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong wrote: > > > On 07/27/2017 08:54 AM, Dan Williams wrote: > >>> At that point, would it make sense to expose these special >>> virtio-pmem areas to the guest in a slightly different way, >>> so the regions that need virtio flushing are not bound by >>> the regular driver, and the regular driver can continue to >>> work for memory regions that are backed by actual pmem in >>> the host? >> >> >> Hmm, yes that could be feasible especially if it uses the ACPI NFIT >> mechanism. It would basically involve defining a new SPA (System >> Phyiscal Address) range GUID type, and then teaching libnvdimm to >> treat that as a new pmem device type. > > > I would prefer a new flush mechanism to a new memory type introduced > to NFIT, e.g, in that mechanism we can define request queues and > completion queues and any other features to make virtualization > friendly. That would be much simpler. > No that's more confusing because now we are overloading the definition of persistent memory. I want this memory type identified from the top of the stack so it can appear differently in /proc/iomem and also implement this alternate flush communication. In what way is this "more complicated"? It was trivial to add support for the "volatile" NFIT range, this will not be any more complicated than that.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On 07/27/2017 08:54 AM, Dan Williams wrote: At that point, would it make sense to expose these special virtio-pmem areas to the guest in a slightly different way, so the regions that need virtio flushing are not bound by the regular driver, and the regular driver can continue to work for memory regions that are backed by actual pmem in the host? Hmm, yes that could be feasible especially if it uses the ACPI NFIT mechanism. It would basically involve defining a new SPA (System Phyiscal Address) range GUID type, and then teaching libnvdimm to treat that as a new pmem device type. I would prefer a new flush mechanism to a new memory type introduced to NFIT, e.g, in that mechanism we can define request queues and completion queues and any other features to make virtualization friendly. That would be much simpler.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Wed, Jul 26, 2017 at 4:46 PM, Rik van Riel wrote: > On Wed, 2017-07-26 at 14:40 -0700, Dan Williams wrote: >> On Wed, Jul 26, 2017 at 2:27 PM, Rik van Riel >> wrote: >> > On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote: >> > > > >> > > >> > > Just want to summarize here(high level): >> > > >> > > This will require implementing new 'virtio-pmem' device which >> > > presents >> > > a DAX address range(like pmem) to guest with read/write(direct >> > > access) >> > > & device flush functionality. Also, qemu should implement >> > > corresponding >> > > support for flush using virtio. >> > > >> > >> > Alternatively, the existing pmem code, with >> > a flush-only block device on the side, which >> > is somehow associated with the pmem device. >> > >> > I wonder which alternative leads to the least >> > code duplication, and the least maintenance >> > hassle going forward. >> >> I'd much prefer to have another driver. I.e. a driver that refactors >> out some common pmem details into a shared object and can attach to >> ND_DEVICE_NAMESPACE_{IO,PMEM}. A control device on the side seems >> like >> a recipe for confusion. > > At that point, would it make sense to expose these special > virtio-pmem areas to the guest in a slightly different way, > so the regions that need virtio flushing are not bound by > the regular driver, and the regular driver can continue to > work for memory regions that are backed by actual pmem in > the host? Hmm, yes that could be feasible especially if it uses the ACPI NFIT mechanism. It would basically involve defining a new SPA (System Phyiscal Address) range GUID type, and then teaching libnvdimm to treat that as a new pmem device type. See usage of UUID_PERSISTENT_MEMORY in drivers/acpi/nfit/ and the eventual region description sent to nvdimm_pmem_region_create(). We would then need to plumb a new flag so that nd_region_to_nstype() in libnvdimm returns a different namespace type number for this virtio use case, but otherwise the rest of libnvdimm should treat the region as pmem.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Wed, 2017-07-26 at 14:40 -0700, Dan Williams wrote: > On Wed, Jul 26, 2017 at 2:27 PM, Rik van Riel > wrote: > > On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote: > > > > > > > > > > Just want to summarize here(high level): > > > > > > This will require implementing new 'virtio-pmem' device which > > > presents > > > a DAX address range(like pmem) to guest with read/write(direct > > > access) > > > & device flush functionality. Also, qemu should implement > > > corresponding > > > support for flush using virtio. > > > > > > > Alternatively, the existing pmem code, with > > a flush-only block device on the side, which > > is somehow associated with the pmem device. > > > > I wonder which alternative leads to the least > > code duplication, and the least maintenance > > hassle going forward. > > I'd much prefer to have another driver. I.e. a driver that refactors > out some common pmem details into a shared object and can attach to > ND_DEVICE_NAMESPACE_{IO,PMEM}. A control device on the side seems > like > a recipe for confusion. At that point, would it make sense to expose these special virtio-pmem areas to the guest in a slightly different way, so the regions that need virtio flushing are not bound by the regular driver, and the regular driver can continue to work for memory regions that are backed by actual pmem in the host? > With a $new_driver in hand you can just do: > > modprobe $new_driver > echo $namespace > /sys/bus/nd/drivers/nd_pmem/unbind > echo $namespace > /sys/bus/nd/drivers/$new_driver/new_id > echo $namespace > /sys/bus/nd/drivers/$new_driver/bind > > ...and the guest can arrange for $new_driver to be the default, so > you > don't need to do those steps each boot of the VM, by doing: > > echo "blacklist nd_pmem" > /etc/modprobe.d/virt-dax-flush.conf > echo "alias nd:t4* $new_driver" >> /etc/modprobe.d/virt-dax- > flush.conf > echo "alias nd:t5* $new_driver" >> /etc/modprobe.d/virt-dax- > flush.conf
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Wed, Jul 26, 2017 at 2:27 PM, Rik van Riel wrote: > On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote: >> > >> Just want to summarize here(high level): >> >> This will require implementing new 'virtio-pmem' device which >> presents >> a DAX address range(like pmem) to guest with read/write(direct >> access) >> & device flush functionality. Also, qemu should implement >> corresponding >> support for flush using virtio. >> > Alternatively, the existing pmem code, with > a flush-only block device on the side, which > is somehow associated with the pmem device. > > I wonder which alternative leads to the least > code duplication, and the least maintenance > hassle going forward. I'd much prefer to have another driver. I.e. a driver that refactors out some common pmem details into a shared object and can attach to ND_DEVICE_NAMESPACE_{IO,PMEM}. A control device on the side seems like a recipe for confusion. With a $new_driver in hand you can just do: modprobe $new_driver echo $namespace > /sys/bus/nd/drivers/nd_pmem/unbind echo $namespace > /sys/bus/nd/drivers/$new_driver/new_id echo $namespace > /sys/bus/nd/drivers/$new_driver/bind ...and the guest can arrange for $new_driver to be the default, so you don't need to do those steps each boot of the VM, by doing: echo "blacklist nd_pmem" > /etc/modprobe.d/virt-dax-flush.conf echo "alias nd:t4* $new_driver" >> /etc/modprobe.d/virt-dax-flush.conf echo "alias nd:t5* $new_driver" >> /etc/modprobe.d/virt-dax-flush.conf
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote: > > > Just want to summarize here(high level): > > This will require implementing new 'virtio-pmem' device which > presents > a DAX address range(like pmem) to guest with read/write(direct > access) > & device flush functionality. Also, qemu should implement > corresponding > support for flush using virtio. > Alternatively, the existing pmem code, with a flush-only block device on the side, which is somehow associated with the pmem device. I wonder which alternative leads to the least code duplication, and the least maintenance hassle going forward. -- All rights reversed signature.asc Description: This is a digitally signed message part
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
> > On Tue, 2017-07-25 at 07:46 -0700, Dan Williams wrote: > > On Tue, Jul 25, 2017 at 7:27 AM, Pankaj Gupta > > wrote: > > > > > > Looks like only way to send flush(blk dev) from guest to host with > > > nvdimm > > > is using flush hint addresses. Is this the correct interface I am > > > looking? > > > > > > blkdev_issue_flush > > > submit_bio_wait > > > submit_bio > > > generic_make_request > > > pmem_make_request > > > ... > > > if (bio->bi_opf & REQ_FLUSH) > > > nvdimm_flush(nd_region); > > > > I would inject a paravirtualized version of pmem_make_request() that > > sends an async flush operation over virtio to the host. Don't try to > > use flush hint addresses for this, they don't have the proper > > semantics. The guest should be allowed to issue the flush and receive > > the completion asynchronously rather than taking a vm exist and > > blocking on that request. > > That is my feeling, too. A slower IO device benefits > greatly from an asynchronous flush mechanism. Thanks for all the suggestions! Just want to summarize here(high level): This will require implementing new 'virtio-pmem' device which presents a DAX address range(like pmem) to guest with read/write(direct access) & device flush functionality. Also, qemu should implement corresponding support for flush using virtio. Thanks, Pankaj > > -- > All rights reversed
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Tue, 2017-07-25 at 07:46 -0700, Dan Williams wrote: > On Tue, Jul 25, 2017 at 7:27 AM, Pankaj Gupta > wrote: > > > > Looks like only way to send flush(blk dev) from guest to host with > > nvdimm > > is using flush hint addresses. Is this the correct interface I am > > looking? > > > > blkdev_issue_flush > > submit_bio_wait > > submit_bio > > generic_make_request > > pmem_make_request > > ... > > if (bio->bi_opf & REQ_FLUSH) > > nvdimm_flush(nd_region); > > I would inject a paravirtualized version of pmem_make_request() that > sends an async flush operation over virtio to the host. Don't try to > use flush hint addresses for this, they don't have the proper > semantics. The guest should be allowed to issue the flush and receive > the completion asynchronously rather than taking a vm exist and > blocking on that request. That is my feeling, too. A slower IO device benefits greatly from an asynchronous flush mechanism. -- All rights reversed signature.asc Description: This is a digitally signed message part
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Tue, Jul 25, 2017 at 7:27 AM, Pankaj Gupta wrote: > >> Subject: Re: KVM "fake DAX" flushing interface - discussion >> >> On Mon 24-07-17 08:06:07, Pankaj Gupta wrote: >> > >> > > On Sun 23-07-17 13:10:34, Dan Williams wrote: >> > > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel wrote: >> > > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote: >> > > > >> [ adding Ross and Jan ] >> > > > >> >> > > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel >> > > > >> wrote: >> > > > >> > >> > > > >> > The goal is to increase density of guests, by moving page >> > > > >> > cache into the host (where it can be easily reclaimed). >> > > > >> > >> > > > >> > If we assume the guests will be backed by relatively fast >> > > > >> > SSDs, a "whole device flush" from filesystem journaling >> > > > >> > code (issued where the filesystem issues a barrier or >> > > > >> > disk cache flush today) may be just what we need to make >> > > > >> > that work. >> > > > >> >> > > > >> Ok, apologies, I indeed had some pieces of the proposal confused. >> > > > >> >> > > > >> However, it still seems like the storage interface is not capable of >> > > > >> expressing what is needed, because the operation that is needed is a >> > > > >> range flush. In the guest you want the DAX page dirty tracking to >> > > > >> communicate range flush information to the host, but there's no >> > > > >> readily available block i/o semantic that software running on top of >> > > > >> the fake pmem device can use to communicate with the host. Instead >> > > > >> you >> > > > >> want to intercept the dax_flush() operation and turn it into a >> > > > >> queued >> > > > >> request on the host. >> > > > >> >> > > > >> In 4.13 we have turned this dax_flush() operation into an explicit >> > > > >> driver call. That seems a better interface to modify than trying to >> > > > >> map block-storage flush-cache / force-unit-access commands to this >> > > > >> host request. >> > > > >> >> > > > >> The additional piece you would need to consider is whether to track >> > > > >> all writes in addition to mmap writes in the guest as DAX-page-cache >> > > > >> dirtying events, or arrange for every dax_copy_from_iter() >> > > > >> operation() >> > > > >> to also queue a sync on the host, but that essentially turns the >> > > > >> host >> > > > >> page cache into a pseudo write-through mode. >> > > > > >> > > > > I suspect initially it will be fine to not offer DAX >> > > > > semantics to applications using these "fake DAX" devices >> > > > > from a virtual machine, because the DAX APIs are designed >> > > > > for a much higher performance device than these fake DAX >> > > > > setups could ever give. >> > > > >> > > > Right, we don't need DAX, per se, in the guest. >> > > > >> > > > > >> > > > > Having userspace call fsync/msync like done normally, and >> > > > > having those coarser calls be turned into somewhat efficient >> > > > > backend flushes would be perfectly acceptable. >> > > > > >> > > > > The big question is, what should that kind of interface look >> > > > > like? >> > > > >> > > > To me, this looks much like the dirty cache tracking that is done in >> > > > the address_space radix for the DAX case, but modified to coordinate >> > > > queued / page-based flushing when the guest wants to persist data. >> > > > The similarity to DAX is not storing guest allocated pages in the >> > > > radix but entries that track dirty guest physical addresses. >> > > >> > > Let me check whether I understand the problem correctly. So we want to >> > > export a block device (essentially a page cache of this block device) to >> > > a >> > > guest as PMEM and use DAX in the guest to save guest's page cache. The >> > >> > that's correct. >> > >> > > natural way to make the persistence work would be to make ->flush >> > > callback >> > > of the PMEM device to do an upcall to the host which could then >> > > fdatasync() >> > > appropriate image file range however the performance would suck in such >> > > case since ->flush gets called for at most one page ranges from DAX. >> > >> > Discussion is : sync a range using paravirt device or flush hit addresses >> > vs block device flush. >> > >> > > >> > > So what you could do instead is to completely ignore ->flush calls for >> > > the >> > > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the >> > > PMEM device (generated by blkdev_issue_flush() or the journalling >> > > machinery) and fdatasync() the whole image file at that moment - in fact >> > > you must do that for metadata IO to hit persistent storage anyway in your >> > > setting. This would very closely follow how exporting block devices with >> > > volatile cache works with KVM these days AFAIU and the performance will >> > > be >> > > the same. >> > >> > yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags. >> > As per suggestions looks like block flushing device is way ahead. >> > >> > If we do an asynchronous block flush at gues
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
> Subject: Re: KVM "fake DAX" flushing interface - discussion > > On Mon 24-07-17 08:06:07, Pankaj Gupta wrote: > > > > > On Sun 23-07-17 13:10:34, Dan Williams wrote: > > > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel wrote: > > > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote: > > > > >> [ adding Ross and Jan ] > > > > >> > > > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel > > > > >> wrote: > > > > >> > > > > > >> > The goal is to increase density of guests, by moving page > > > > >> > cache into the host (where it can be easily reclaimed). > > > > >> > > > > > >> > If we assume the guests will be backed by relatively fast > > > > >> > SSDs, a "whole device flush" from filesystem journaling > > > > >> > code (issued where the filesystem issues a barrier or > > > > >> > disk cache flush today) may be just what we need to make > > > > >> > that work. > > > > >> > > > > >> Ok, apologies, I indeed had some pieces of the proposal confused. > > > > >> > > > > >> However, it still seems like the storage interface is not capable of > > > > >> expressing what is needed, because the operation that is needed is a > > > > >> range flush. In the guest you want the DAX page dirty tracking to > > > > >> communicate range flush information to the host, but there's no > > > > >> readily available block i/o semantic that software running on top of > > > > >> the fake pmem device can use to communicate with the host. Instead > > > > >> you > > > > >> want to intercept the dax_flush() operation and turn it into a > > > > >> queued > > > > >> request on the host. > > > > >> > > > > >> In 4.13 we have turned this dax_flush() operation into an explicit > > > > >> driver call. That seems a better interface to modify than trying to > > > > >> map block-storage flush-cache / force-unit-access commands to this > > > > >> host request. > > > > >> > > > > >> The additional piece you would need to consider is whether to track > > > > >> all writes in addition to mmap writes in the guest as DAX-page-cache > > > > >> dirtying events, or arrange for every dax_copy_from_iter() > > > > >> operation() > > > > >> to also queue a sync on the host, but that essentially turns the > > > > >> host > > > > >> page cache into a pseudo write-through mode. > > > > > > > > > > I suspect initially it will be fine to not offer DAX > > > > > semantics to applications using these "fake DAX" devices > > > > > from a virtual machine, because the DAX APIs are designed > > > > > for a much higher performance device than these fake DAX > > > > > setups could ever give. > > > > > > > > Right, we don't need DAX, per se, in the guest. > > > > > > > > > > > > > > Having userspace call fsync/msync like done normally, and > > > > > having those coarser calls be turned into somewhat efficient > > > > > backend flushes would be perfectly acceptable. > > > > > > > > > > The big question is, what should that kind of interface look > > > > > like? > > > > > > > > To me, this looks much like the dirty cache tracking that is done in > > > > the address_space radix for the DAX case, but modified to coordinate > > > > queued / page-based flushing when the guest wants to persist data. > > > > The similarity to DAX is not storing guest allocated pages in the > > > > radix but entries that track dirty guest physical addresses. > > > > > > Let me check whether I understand the problem correctly. So we want to > > > export a block device (essentially a page cache of this block device) to > > > a > > > guest as PMEM and use DAX in the guest to save guest's page cache. The > > > > that's correct. > > > > > natural way to make the persistence work would be to make ->flush > > > callback > > > of the PMEM device to do an upcall to the host which could then > > > fdatasync() > > > appropriate image file range however the performance would suck in such > > > case since ->flush gets called for at most one page ranges from DAX. > > > > Discussion is : sync a range using paravirt device or flush hit addresses > > vs block device flush. > > > > > > > > So what you could do instead is to completely ignore ->flush calls for > > > the > > > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the > > > PMEM device (generated by blkdev_issue_flush() or the journalling > > > machinery) and fdatasync() the whole image file at that moment - in fact > > > you must do that for metadata IO to hit persistent storage anyway in your > > > setting. This would very closely follow how exporting block devices with > > > volatile cache works with KVM these days AFAIU and the performance will > > > be > > > the same. > > > > yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags. > > As per suggestions looks like block flushing device is way ahead. > > > > If we do an asynchronous block flush at guest side(put current task in > > wait queue till host side fdatasync completes) can solve the purpose? Or > > do we need another paravirt device f
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Mon, Jul 24, 2017 at 8:48 AM, Jan Kara wrote: > On Mon 24-07-17 08:10:05, Dan Williams wrote: >> On Mon, Jul 24, 2017 at 5:37 AM, Jan Kara wrote: [..] >> This approach would turn into a full fsync on the host. The question >> in my mind is whether there is any optimization to be had by trapping >> dax_flush() and calling msync() on host ranges, but Jan is right >> trapping blkdev_issue_flush() and turning around and calling host >> fsync() is the most straightforward approach that does not need driver >> interface changes. The dax_flush() approach would need to modify it >> into a async completion interface. > > If the backing device on the host is actually a normal block device or an > image file, doing full fsync() is the most efficient implementation > anyway... Ah, ok, great. That was the gap in my understanding.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Mon 24-07-17 08:10:05, Dan Williams wrote: > On Mon, Jul 24, 2017 at 5:37 AM, Jan Kara wrote: > > On Mon 24-07-17 08:06:07, Pankaj Gupta wrote: > >> > >> > On Sun 23-07-17 13:10:34, Dan Williams wrote: > >> > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel wrote: > >> > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote: > >> > > >> [ adding Ross and Jan ] > >> > > >> > >> > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel > >> > > >> wrote: > >> > > >> > > >> > > >> > The goal is to increase density of guests, by moving page > >> > > >> > cache into the host (where it can be easily reclaimed). > >> > > >> > > >> > > >> > If we assume the guests will be backed by relatively fast > >> > > >> > SSDs, a "whole device flush" from filesystem journaling > >> > > >> > code (issued where the filesystem issues a barrier or > >> > > >> > disk cache flush today) may be just what we need to make > >> > > >> > that work. > >> > > >> > >> > > >> Ok, apologies, I indeed had some pieces of the proposal confused. > >> > > >> > >> > > >> However, it still seems like the storage interface is not capable of > >> > > >> expressing what is needed, because the operation that is needed is a > >> > > >> range flush. In the guest you want the DAX page dirty tracking to > >> > > >> communicate range flush information to the host, but there's no > >> > > >> readily available block i/o semantic that software running on top of > >> > > >> the fake pmem device can use to communicate with the host. Instead > >> > > >> you > >> > > >> want to intercept the dax_flush() operation and turn it into a > >> > > >> queued > >> > > >> request on the host. > >> > > >> > >> > > >> In 4.13 we have turned this dax_flush() operation into an explicit > >> > > >> driver call. That seems a better interface to modify than trying to > >> > > >> map block-storage flush-cache / force-unit-access commands to this > >> > > >> host request. > >> > > >> > >> > > >> The additional piece you would need to consider is whether to track > >> > > >> all writes in addition to mmap writes in the guest as DAX-page-cache > >> > > >> dirtying events, or arrange for every dax_copy_from_iter() > >> > > >> operation() > >> > > >> to also queue a sync on the host, but that essentially turns the > >> > > >> host > >> > > >> page cache into a pseudo write-through mode. > >> > > > > >> > > > I suspect initially it will be fine to not offer DAX > >> > > > semantics to applications using these "fake DAX" devices > >> > > > from a virtual machine, because the DAX APIs are designed > >> > > > for a much higher performance device than these fake DAX > >> > > > setups could ever give. > >> > > > >> > > Right, we don't need DAX, per se, in the guest. > >> > > > >> > > > > >> > > > Having userspace call fsync/msync like done normally, and > >> > > > having those coarser calls be turned into somewhat efficient > >> > > > backend flushes would be perfectly acceptable. > >> > > > > >> > > > The big question is, what should that kind of interface look > >> > > > like? > >> > > > >> > > To me, this looks much like the dirty cache tracking that is done in > >> > > the address_space radix for the DAX case, but modified to coordinate > >> > > queued / page-based flushing when the guest wants to persist data. > >> > > The similarity to DAX is not storing guest allocated pages in the > >> > > radix but entries that track dirty guest physical addresses. > >> > > >> > Let me check whether I understand the problem correctly. So we want to > >> > export a block device (essentially a page cache of this block device) to > >> > a > >> > guest as PMEM and use DAX in the guest to save guest's page cache. The > >> > >> that's correct. > >> > >> > natural way to make the persistence work would be to make ->flush > >> > callback > >> > of the PMEM device to do an upcall to the host which could then > >> > fdatasync() > >> > appropriate image file range however the performance would suck in such > >> > case since ->flush gets called for at most one page ranges from DAX. > >> > >> Discussion is : sync a range using paravirt device or flush hit addresses > >> vs block device flush. > >> > >> > > >> > So what you could do instead is to completely ignore ->flush calls for > >> > the > >> > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the > >> > PMEM device (generated by blkdev_issue_flush() or the journalling > >> > machinery) and fdatasync() the whole image file at that moment - in fact > >> > you must do that for metadata IO to hit persistent storage anyway in your > >> > setting. This would very closely follow how exporting block devices with > >> > volatile cache works with KVM these days AFAIU and the performance will > >> > be > >> > the same. > >> > >> yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags. > >> As per suggestions looks like block flushing device is way ahead. > >> > >> If we do an asynchronous block flush at guest side(put curre
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Mon, Jul 24, 2017 at 5:37 AM, Jan Kara wrote: > On Mon 24-07-17 08:06:07, Pankaj Gupta wrote: >> >> > On Sun 23-07-17 13:10:34, Dan Williams wrote: >> > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel wrote: >> > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote: >> > > >> [ adding Ross and Jan ] >> > > >> >> > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel >> > > >> wrote: >> > > >> > >> > > >> > The goal is to increase density of guests, by moving page >> > > >> > cache into the host (where it can be easily reclaimed). >> > > >> > >> > > >> > If we assume the guests will be backed by relatively fast >> > > >> > SSDs, a "whole device flush" from filesystem journaling >> > > >> > code (issued where the filesystem issues a barrier or >> > > >> > disk cache flush today) may be just what we need to make >> > > >> > that work. >> > > >> >> > > >> Ok, apologies, I indeed had some pieces of the proposal confused. >> > > >> >> > > >> However, it still seems like the storage interface is not capable of >> > > >> expressing what is needed, because the operation that is needed is a >> > > >> range flush. In the guest you want the DAX page dirty tracking to >> > > >> communicate range flush information to the host, but there's no >> > > >> readily available block i/o semantic that software running on top of >> > > >> the fake pmem device can use to communicate with the host. Instead >> > > >> you >> > > >> want to intercept the dax_flush() operation and turn it into a queued >> > > >> request on the host. >> > > >> >> > > >> In 4.13 we have turned this dax_flush() operation into an explicit >> > > >> driver call. That seems a better interface to modify than trying to >> > > >> map block-storage flush-cache / force-unit-access commands to this >> > > >> host request. >> > > >> >> > > >> The additional piece you would need to consider is whether to track >> > > >> all writes in addition to mmap writes in the guest as DAX-page-cache >> > > >> dirtying events, or arrange for every dax_copy_from_iter() >> > > >> operation() >> > > >> to also queue a sync on the host, but that essentially turns the host >> > > >> page cache into a pseudo write-through mode. >> > > > >> > > > I suspect initially it will be fine to not offer DAX >> > > > semantics to applications using these "fake DAX" devices >> > > > from a virtual machine, because the DAX APIs are designed >> > > > for a much higher performance device than these fake DAX >> > > > setups could ever give. >> > > >> > > Right, we don't need DAX, per se, in the guest. >> > > >> > > > >> > > > Having userspace call fsync/msync like done normally, and >> > > > having those coarser calls be turned into somewhat efficient >> > > > backend flushes would be perfectly acceptable. >> > > > >> > > > The big question is, what should that kind of interface look >> > > > like? >> > > >> > > To me, this looks much like the dirty cache tracking that is done in >> > > the address_space radix for the DAX case, but modified to coordinate >> > > queued / page-based flushing when the guest wants to persist data. >> > > The similarity to DAX is not storing guest allocated pages in the >> > > radix but entries that track dirty guest physical addresses. >> > >> > Let me check whether I understand the problem correctly. So we want to >> > export a block device (essentially a page cache of this block device) to a >> > guest as PMEM and use DAX in the guest to save guest's page cache. The >> >> that's correct. >> >> > natural way to make the persistence work would be to make ->flush callback >> > of the PMEM device to do an upcall to the host which could then fdatasync() >> > appropriate image file range however the performance would suck in such >> > case since ->flush gets called for at most one page ranges from DAX. >> >> Discussion is : sync a range using paravirt device or flush hit addresses >> vs block device flush. >> >> > >> > So what you could do instead is to completely ignore ->flush calls for the >> > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the >> > PMEM device (generated by blkdev_issue_flush() or the journalling >> > machinery) and fdatasync() the whole image file at that moment - in fact >> > you must do that for metadata IO to hit persistent storage anyway in your >> > setting. This would very closely follow how exporting block devices with >> > volatile cache works with KVM these days AFAIU and the performance will be >> > the same. >> >> yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags. >> As per suggestions looks like block flushing device is way ahead. >> >> If we do an asynchronous block flush at guest side(put current task in >> wait queue till host side fdatasync completes) can solve the purpose? Or >> do we need another paravirt device for this? > > Well, even currently if you have PMEM device, you still have also a block > device and a request queue associated with it and metadata IO goes through > that pat
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Mon 24-07-17 08:06:07, Pankaj Gupta wrote: > > > On Sun 23-07-17 13:10:34, Dan Williams wrote: > > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel wrote: > > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote: > > > >> [ adding Ross and Jan ] > > > >> > > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel > > > >> wrote: > > > >> > > > > >> > The goal is to increase density of guests, by moving page > > > >> > cache into the host (where it can be easily reclaimed). > > > >> > > > > >> > If we assume the guests will be backed by relatively fast > > > >> > SSDs, a "whole device flush" from filesystem journaling > > > >> > code (issued where the filesystem issues a barrier or > > > >> > disk cache flush today) may be just what we need to make > > > >> > that work. > > > >> > > > >> Ok, apologies, I indeed had some pieces of the proposal confused. > > > >> > > > >> However, it still seems like the storage interface is not capable of > > > >> expressing what is needed, because the operation that is needed is a > > > >> range flush. In the guest you want the DAX page dirty tracking to > > > >> communicate range flush information to the host, but there's no > > > >> readily available block i/o semantic that software running on top of > > > >> the fake pmem device can use to communicate with the host. Instead > > > >> you > > > >> want to intercept the dax_flush() operation and turn it into a queued > > > >> request on the host. > > > >> > > > >> In 4.13 we have turned this dax_flush() operation into an explicit > > > >> driver call. That seems a better interface to modify than trying to > > > >> map block-storage flush-cache / force-unit-access commands to this > > > >> host request. > > > >> > > > >> The additional piece you would need to consider is whether to track > > > >> all writes in addition to mmap writes in the guest as DAX-page-cache > > > >> dirtying events, or arrange for every dax_copy_from_iter() > > > >> operation() > > > >> to also queue a sync on the host, but that essentially turns the host > > > >> page cache into a pseudo write-through mode. > > > > > > > > I suspect initially it will be fine to not offer DAX > > > > semantics to applications using these "fake DAX" devices > > > > from a virtual machine, because the DAX APIs are designed > > > > for a much higher performance device than these fake DAX > > > > setups could ever give. > > > > > > Right, we don't need DAX, per se, in the guest. > > > > > > > > > > > Having userspace call fsync/msync like done normally, and > > > > having those coarser calls be turned into somewhat efficient > > > > backend flushes would be perfectly acceptable. > > > > > > > > The big question is, what should that kind of interface look > > > > like? > > > > > > To me, this looks much like the dirty cache tracking that is done in > > > the address_space radix for the DAX case, but modified to coordinate > > > queued / page-based flushing when the guest wants to persist data. > > > The similarity to DAX is not storing guest allocated pages in the > > > radix but entries that track dirty guest physical addresses. > > > > Let me check whether I understand the problem correctly. So we want to > > export a block device (essentially a page cache of this block device) to a > > guest as PMEM and use DAX in the guest to save guest's page cache. The > > that's correct. > > > natural way to make the persistence work would be to make ->flush callback > > of the PMEM device to do an upcall to the host which could then fdatasync() > > appropriate image file range however the performance would suck in such > > case since ->flush gets called for at most one page ranges from DAX. > > Discussion is : sync a range using paravirt device or flush hit addresses > vs block device flush. > > > > > So what you could do instead is to completely ignore ->flush calls for the > > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the > > PMEM device (generated by blkdev_issue_flush() or the journalling > > machinery) and fdatasync() the whole image file at that moment - in fact > > you must do that for metadata IO to hit persistent storage anyway in your > > setting. This would very closely follow how exporting block devices with > > volatile cache works with KVM these days AFAIU and the performance will be > > the same. > > yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags. > As per suggestions looks like block flushing device is way ahead. > > If we do an asynchronous block flush at guest side(put current task in > wait queue till host side fdatasync completes) can solve the purpose? Or > do we need another paravirt device for this? Well, even currently if you have PMEM device, you still have also a block device and a request queue associated with it and metadata IO goes through that path. So in your case you will have the same in the guest as a result of exposing virtual PMEM device to the guest and you just need to make s
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
> On Sun 23-07-17 13:10:34, Dan Williams wrote: > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel wrote: > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote: > > >> [ adding Ross and Jan ] > > >> > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel > > >> wrote: > > >> > > > >> > The goal is to increase density of guests, by moving page > > >> > cache into the host (where it can be easily reclaimed). > > >> > > > >> > If we assume the guests will be backed by relatively fast > > >> > SSDs, a "whole device flush" from filesystem journaling > > >> > code (issued where the filesystem issues a barrier or > > >> > disk cache flush today) may be just what we need to make > > >> > that work. > > >> > > >> Ok, apologies, I indeed had some pieces of the proposal confused. > > >> > > >> However, it still seems like the storage interface is not capable of > > >> expressing what is needed, because the operation that is needed is a > > >> range flush. In the guest you want the DAX page dirty tracking to > > >> communicate range flush information to the host, but there's no > > >> readily available block i/o semantic that software running on top of > > >> the fake pmem device can use to communicate with the host. Instead > > >> you > > >> want to intercept the dax_flush() operation and turn it into a queued > > >> request on the host. > > >> > > >> In 4.13 we have turned this dax_flush() operation into an explicit > > >> driver call. That seems a better interface to modify than trying to > > >> map block-storage flush-cache / force-unit-access commands to this > > >> host request. > > >> > > >> The additional piece you would need to consider is whether to track > > >> all writes in addition to mmap writes in the guest as DAX-page-cache > > >> dirtying events, or arrange for every dax_copy_from_iter() > > >> operation() > > >> to also queue a sync on the host, but that essentially turns the host > > >> page cache into a pseudo write-through mode. > > > > > > I suspect initially it will be fine to not offer DAX > > > semantics to applications using these "fake DAX" devices > > > from a virtual machine, because the DAX APIs are designed > > > for a much higher performance device than these fake DAX > > > setups could ever give. > > > > Right, we don't need DAX, per se, in the guest. > > > > > > > > Having userspace call fsync/msync like done normally, and > > > having those coarser calls be turned into somewhat efficient > > > backend flushes would be perfectly acceptable. > > > > > > The big question is, what should that kind of interface look > > > like? > > > > To me, this looks much like the dirty cache tracking that is done in > > the address_space radix for the DAX case, but modified to coordinate > > queued / page-based flushing when the guest wants to persist data. > > The similarity to DAX is not storing guest allocated pages in the > > radix but entries that track dirty guest physical addresses. > > Let me check whether I understand the problem correctly. So we want to > export a block device (essentially a page cache of this block device) to a > guest as PMEM and use DAX in the guest to save guest's page cache. The that's correct. > natural way to make the persistence work would be to make ->flush callback > of the PMEM device to do an upcall to the host which could then fdatasync() > appropriate image file range however the performance would suck in such > case since ->flush gets called for at most one page ranges from DAX. Discussion is : sync a range using paravirt device or flush hit addresses vs block device flush. > > So what you could do instead is to completely ignore ->flush calls for the > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the > PMEM device (generated by blkdev_issue_flush() or the journalling > machinery) and fdatasync() the whole image file at that moment - in fact > you must do that for metadata IO to hit persistent storage anyway in your > setting. This would very closely follow how exporting block devices with > volatile cache works with KVM these days AFAIU and the performance will be > the same. yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags. As per suggestions looks like block flushing device is way ahead. If we do an asynchronous block flush at guest side(put current task in wait queue till host side fdatasync completes) can solve the purpose? Or do we need another paravirt device for this? > > Honza > -- > Jan Kara > SUSE Labs, CR >
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Sun 23-07-17 13:10:34, Dan Williams wrote: > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel wrote: > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote: > >> [ adding Ross and Jan ] > >> > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel > >> wrote: > >> > > >> > The goal is to increase density of guests, by moving page > >> > cache into the host (where it can be easily reclaimed). > >> > > >> > If we assume the guests will be backed by relatively fast > >> > SSDs, a "whole device flush" from filesystem journaling > >> > code (issued where the filesystem issues a barrier or > >> > disk cache flush today) may be just what we need to make > >> > that work. > >> > >> Ok, apologies, I indeed had some pieces of the proposal confused. > >> > >> However, it still seems like the storage interface is not capable of > >> expressing what is needed, because the operation that is needed is a > >> range flush. In the guest you want the DAX page dirty tracking to > >> communicate range flush information to the host, but there's no > >> readily available block i/o semantic that software running on top of > >> the fake pmem device can use to communicate with the host. Instead > >> you > >> want to intercept the dax_flush() operation and turn it into a queued > >> request on the host. > >> > >> In 4.13 we have turned this dax_flush() operation into an explicit > >> driver call. That seems a better interface to modify than trying to > >> map block-storage flush-cache / force-unit-access commands to this > >> host request. > >> > >> The additional piece you would need to consider is whether to track > >> all writes in addition to mmap writes in the guest as DAX-page-cache > >> dirtying events, or arrange for every dax_copy_from_iter() > >> operation() > >> to also queue a sync on the host, but that essentially turns the host > >> page cache into a pseudo write-through mode. > > > > I suspect initially it will be fine to not offer DAX > > semantics to applications using these "fake DAX" devices > > from a virtual machine, because the DAX APIs are designed > > for a much higher performance device than these fake DAX > > setups could ever give. > > Right, we don't need DAX, per se, in the guest. > > > > > Having userspace call fsync/msync like done normally, and > > having those coarser calls be turned into somewhat efficient > > backend flushes would be perfectly acceptable. > > > > The big question is, what should that kind of interface look > > like? > > To me, this looks much like the dirty cache tracking that is done in > the address_space radix for the DAX case, but modified to coordinate > queued / page-based flushing when the guest wants to persist data. > The similarity to DAX is not storing guest allocated pages in the > radix but entries that track dirty guest physical addresses. Let me check whether I understand the problem correctly. So we want to export a block device (essentially a page cache of this block device) to a guest as PMEM and use DAX in the guest to save guest's page cache. The natural way to make the persistence work would be to make ->flush callback of the PMEM device to do an upcall to the host which could then fdatasync() appropriate image file range however the performance would suck in such case since ->flush gets called for at most one page ranges from DAX. So what you could do instead is to completely ignore ->flush calls for the PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the PMEM device (generated by blkdev_issue_flush() or the journalling machinery) and fdatasync() the whole image file at that moment - in fact you must do that for metadata IO to hit persistent storage anyway in your setting. This would very closely follow how exporting block devices with volatile cache works with KVM these days AFAIU and the performance will be the same. Honza -- Jan Kara SUSE Labs, CR
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel wrote: > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote: >> [ adding Ross and Jan ] >> >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel >> wrote: >> > >> > The goal is to increase density of guests, by moving page >> > cache into the host (where it can be easily reclaimed). >> > >> > If we assume the guests will be backed by relatively fast >> > SSDs, a "whole device flush" from filesystem journaling >> > code (issued where the filesystem issues a barrier or >> > disk cache flush today) may be just what we need to make >> > that work. >> >> Ok, apologies, I indeed had some pieces of the proposal confused. >> >> However, it still seems like the storage interface is not capable of >> expressing what is needed, because the operation that is needed is a >> range flush. In the guest you want the DAX page dirty tracking to >> communicate range flush information to the host, but there's no >> readily available block i/o semantic that software running on top of >> the fake pmem device can use to communicate with the host. Instead >> you >> want to intercept the dax_flush() operation and turn it into a queued >> request on the host. >> >> In 4.13 we have turned this dax_flush() operation into an explicit >> driver call. That seems a better interface to modify than trying to >> map block-storage flush-cache / force-unit-access commands to this >> host request. >> >> The additional piece you would need to consider is whether to track >> all writes in addition to mmap writes in the guest as DAX-page-cache >> dirtying events, or arrange for every dax_copy_from_iter() >> operation() >> to also queue a sync on the host, but that essentially turns the host >> page cache into a pseudo write-through mode. > > I suspect initially it will be fine to not offer DAX > semantics to applications using these "fake DAX" devices > from a virtual machine, because the DAX APIs are designed > for a much higher performance device than these fake DAX > setups could ever give. Right, we don't need DAX, per se, in the guest. > > Having userspace call fsync/msync like done normally, and > having those coarser calls be turned into somewhat efficient > backend flushes would be perfectly acceptable. > > The big question is, what should that kind of interface look > like? To me, this looks much like the dirty cache tracking that is done in the address_space radix for the DAX case, but modified to coordinate queued / page-based flushing when the guest wants to persist data. The similarity to DAX is not storing guest allocated pages in the radix but entries that track dirty guest physical addresses.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote: > [ adding Ross and Jan ] > > On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel > wrote: > > > > The goal is to increase density of guests, by moving page > > cache into the host (where it can be easily reclaimed). > > > > If we assume the guests will be backed by relatively fast > > SSDs, a "whole device flush" from filesystem journaling > > code (issued where the filesystem issues a barrier or > > disk cache flush today) may be just what we need to make > > that work. > > Ok, apologies, I indeed had some pieces of the proposal confused. > > However, it still seems like the storage interface is not capable of > expressing what is needed, because the operation that is needed is a > range flush. In the guest you want the DAX page dirty tracking to > communicate range flush information to the host, but there's no > readily available block i/o semantic that software running on top of > the fake pmem device can use to communicate with the host. Instead > you > want to intercept the dax_flush() operation and turn it into a queued > request on the host. > > In 4.13 we have turned this dax_flush() operation into an explicit > driver call. That seems a better interface to modify than trying to > map block-storage flush-cache / force-unit-access commands to this > host request. > > The additional piece you would need to consider is whether to track > all writes in addition to mmap writes in the guest as DAX-page-cache > dirtying events, or arrange for every dax_copy_from_iter() > operation() > to also queue a sync on the host, but that essentially turns the host > page cache into a pseudo write-through mode. I suspect initially it will be fine to not offer DAX semantics to applications using these "fake DAX" devices from a virtual machine, because the DAX APIs are designed for a much higher performance device than these fake DAX setups could ever give. Having userspace call fsync/msync like done normally, and having those coarser calls be turned into somewhat efficient backend flushes would be perfectly acceptable. The big question is, what should that kind of interface look like?
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
[ adding Ross and Jan ] On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel wrote: > On Sat, 2017-07-22 at 12:34 -0700, Dan Williams wrote: >> On Fri, Jul 21, 2017 at 8:58 AM, Stefan Hajnoczi > > wrote: >> > >> > Maybe the NVDIMM folks can comment on this idea. >> >> I think it's unworkable to use the flush hints as a guest-to-host >> fsync mechanism. That mechanism was designed to flush small memory >> controller buffers, not large swaths of dirty memory. What about >> running the guests in a writethrough cache mode to avoid needing >> dirty >> cache management altogether? Either way I think you need to use >> device-dax on the host, or one of the two work-in-progress filesystem >> mechanisms (synchronous-faults or S_IOMAP_FROZEN) to avoid need any >> metadata coordination between guests and the host. > > The thing Pankaj is looking at is to use the DAX mechanisms > inside the guest (disk image as memory mapped nvdimm area), > with that disk image backed by a regular storage device on > the host. > > The goal is to increase density of guests, by moving page > cache into the host (where it can be easily reclaimed). > > If we assume the guests will be backed by relatively fast > SSDs, a "whole device flush" from filesystem journaling > code (issued where the filesystem issues a barrier or > disk cache flush today) may be just what we need to make > that work. Ok, apologies, I indeed had some pieces of the proposal confused. However, it still seems like the storage interface is not capable of expressing what is needed, because the operation that is needed is a range flush. In the guest you want the DAX page dirty tracking to communicate range flush information to the host, but there's no readily available block i/o semantic that software running on top of the fake pmem device can use to communicate with the host. Instead you want to intercept the dax_flush() operation and turn it into a queued request on the host. In 4.13 we have turned this dax_flush() operation into an explicit driver call. That seems a better interface to modify than trying to map block-storage flush-cache / force-unit-access commands to this host request. The additional piece you would need to consider is whether to track all writes in addition to mmap writes in the guest as DAX-page-cache dirtying events, or arrange for every dax_copy_from_iter() operation() to also queue a sync on the host, but that essentially turns the host page cache into a pseudo write-through mode.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Sat, 2017-07-22 at 12:34 -0700, Dan Williams wrote: > On Fri, Jul 21, 2017 at 8:58 AM, Stefan Hajnoczi > wrote: > > > > Maybe the NVDIMM folks can comment on this idea. > > I think it's unworkable to use the flush hints as a guest-to-host > fsync mechanism. That mechanism was designed to flush small memory > controller buffers, not large swaths of dirty memory. What about > running the guests in a writethrough cache mode to avoid needing > dirty > cache management altogether? Either way I think you need to use > device-dax on the host, or one of the two work-in-progress filesystem > mechanisms (synchronous-faults or S_IOMAP_FROZEN) to avoid need any > metadata coordination between guests and the host. The thing Pankaj is looking at is to use the DAX mechanisms inside the guest (disk image as memory mapped nvdimm area), with that disk image backed by a regular storage device on the host. The goal is to increase density of guests, by moving page cache into the host (where it can be easily reclaimed). If we assume the guests will be backed by relatively fast SSDs, a "whole device flush" from filesystem journaling code (issued where the filesystem issues a barrier or disk cache flush today) may be just what we need to make that work.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Fri, Jul 21, 2017 at 8:58 AM, Stefan Hajnoczi wrote: > On Fri, Jul 21, 2017 at 09:29:15AM -0400, Pankaj Gupta wrote: >> >> > > A] Problems to solve: >> > > -- >> > > >> > > 1] We are considering two approaches for 'fake DAX flushing interface'. >> > > >> > > 1.1] fake dax with NVDIMM flush hints & KVM async page fault >> > > >> > > - Existing interface. >> > > >> > > - The approach to use flush hint address is already nacked upstream. >> > > >> > > - Flush hint not queued interface for flushing. Applications might >> > >avoid to use it. >> > >> > This doesn't contradicts the last point about async operation and vcpu >> > control. KVM async page faults turn the Address Flush Hints write into >> > an async operation so the guest can get other work done while waiting >> > for completion. >> > >> > > >> > > - Flush hint address traps from guest to host and do an entire fsync >> > >on backing file which itself is costly. >> > > >> > > - Can be used to flush specific pages on host backing disk. We can >> > >send data(pages information) equal to cache-line size(limitation) >> > >and tell host to sync corresponding pages instead of entire disk >> > >sync. >> > >> > Are you sure? Your previous point says only the entire device can be >> > synced. The NVDIMM Adress Flush Hints interface does not involve >> > address range information. >> >> Just syncing entire block device should be simple but costly. Using flush >> hint address to write data which contains list/info of dirty pages to >> flush requires more thought. This calls mmio write callback at Qemu side. >> As per Intel (ACPI spec 6.1, Table 5-135) there is limit to max length >> of data guest can write and is equal to cache line size. >> >> > >> > > >> > > - This will be an asynchronous operation and vCPU control is >> > > returned >> > >quickly. >> > > >> > > >> > > 1.2] Using additional para virt device in addition to pmem device(fake >> > > dax >> > > with device flush) >> > >> > Perhaps this can be exposed via ACPI as part of the NVDIMM standards >> > instead of a separate KVM-only paravirt device. >> >> Same reason as above. If we decide on sending list of dirty pages there is >> limit to send max size of data to host using flush hint address. > > I understand now: you are proposing to change the semantics of the > Address Flush Hints interface. You want the value written to have > meaning (the address range that needs to be flushed). > > Today the spec says: > > The content of the data is not relevant to the functioning of the > flush hint mechanism. > > Maybe the NVDIMM folks can comment on this idea. I think it's unworkable to use the flush hints as a guest-to-host fsync mechanism. That mechanism was designed to flush small memory controller buffers, not large swaths of dirty memory. What about running the guests in a writethrough cache mode to avoid needing dirty cache management altogether? Either way I think you need to use device-dax on the host, or one of the two work-in-progress filesystem mechanisms (synchronous-faults or S_IOMAP_FROZEN) to avoid need any metadata coordination between guests and the host.
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Fri, Jul 21, 2017 at 09:29:15AM -0400, Pankaj Gupta wrote: > > > > A] Problems to solve: > > > -- > > > > > > 1] We are considering two approaches for 'fake DAX flushing interface'. > > > > > > 1.1] fake dax with NVDIMM flush hints & KVM async page fault > > > > > > - Existing interface. > > > > > > - The approach to use flush hint address is already nacked upstream. > > > > > > - Flush hint not queued interface for flushing. Applications might > > >avoid to use it. > > > > This doesn't contradicts the last point about async operation and vcpu > > control. KVM async page faults turn the Address Flush Hints write into > > an async operation so the guest can get other work done while waiting > > for completion. > > > > > > > > - Flush hint address traps from guest to host and do an entire fsync > > >on backing file which itself is costly. > > > > > > - Can be used to flush specific pages on host backing disk. We can > > >send data(pages information) equal to cache-line size(limitation) > > >and tell host to sync corresponding pages instead of entire disk > > >sync. > > > > Are you sure? Your previous point says only the entire device can be > > synced. The NVDIMM Adress Flush Hints interface does not involve > > address range information. > > Just syncing entire block device should be simple but costly. Using flush > hint address to write data which contains list/info of dirty pages to > flush requires more thought. This calls mmio write callback at Qemu side. > As per Intel (ACPI spec 6.1, Table 5-135) there is limit to max length > of data guest can write and is equal to cache line size. > > > > > > > > > - This will be an asynchronous operation and vCPU control is returned > > >quickly. > > > > > > > > > 1.2] Using additional para virt device in addition to pmem device(fake > > > dax > > > with device flush) > > > > Perhaps this can be exposed via ACPI as part of the NVDIMM standards > > instead of a separate KVM-only paravirt device. > > Same reason as above. If we decide on sending list of dirty pages there is > limit to send max size of data to host using flush hint address. I understand now: you are proposing to change the semantics of the Address Flush Hints interface. You want the value written to have meaning (the address range that needs to be flushed). Today the spec says: The content of the data is not relevant to the functioning of the flush hint mechanism. Maybe the NVDIMM folks can comment on this idea. signature.asc Description: PGP signature
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Fri, 2017-07-21 at 09:29 -0400, Pankaj Gupta wrote: > > > > > > - Flush hint address traps from guest to host and do an > > > entire fsync > > > on backing file which itself is costly. > > > > > > - Can be used to flush specific pages on host backing disk. > > > We can > > > send data(pages information) equal to cache-line > > > size(limitation) > > > and tell host to sync corresponding pages instead of > > > entire disk > > > sync. > > > > Are you sure? Your previous point says only the entire device can > > be > > synced. The NVDIMM Adress Flush Hints interface does not involve > > address range information. > > Just syncing entire block device should be simple but costly. Costly depends on just how fast the backing IO device is. If the backing IO is a spinning disk, doing targeted range syncs will certainly be faster. On the other hand, if the backing IO is one of the latest generation SSD devices, it may be faster to have just one hypercall and flush everything, than it would be to have separate sync calls for each range that we want flushed. Should we design our interfaces for yesterday's storage devices, or for tomorrow's storage devices?
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
> > A] Problems to solve: > > -- > > > > 1] We are considering two approaches for 'fake DAX flushing interface'. > > > > 1.1] fake dax with NVDIMM flush hints & KVM async page fault > > > > - Existing interface. > > > > - The approach to use flush hint address is already nacked upstream. > > > > - Flush hint not queued interface for flushing. Applications might > >avoid to use it. > > This doesn't contradicts the last point about async operation and vcpu > control. KVM async page faults turn the Address Flush Hints write into > an async operation so the guest can get other work done while waiting > for completion. > > > > > - Flush hint address traps from guest to host and do an entire fsync > >on backing file which itself is costly. > > > > - Can be used to flush specific pages on host backing disk. We can > >send data(pages information) equal to cache-line size(limitation) > >and tell host to sync corresponding pages instead of entire disk > >sync. > > Are you sure? Your previous point says only the entire device can be > synced. The NVDIMM Adress Flush Hints interface does not involve > address range information. Just syncing entire block device should be simple but costly. Using flush hint address to write data which contains list/info of dirty pages to flush requires more thought. This calls mmio write callback at Qemu side. As per Intel (ACPI spec 6.1, Table 5-135) there is limit to max length of data guest can write and is equal to cache line size. > > > > > - This will be an asynchronous operation and vCPU control is returned > >quickly. > > > > > > 1.2] Using additional para virt device in addition to pmem device(fake dax > > with device flush) > > Perhaps this can be exposed via ACPI as part of the NVDIMM standards > instead of a separate KVM-only paravirt device. Same reason as above. If we decide on sending list of dirty pages there is limit to send max size of data to host using flush hint address. >
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On Fri, Jul 21, 2017 at 02:56:34AM -0400, Pankaj Gupta wrote: > A] Problems to solve: > -- > > 1] We are considering two approaches for 'fake DAX flushing interface'. > > 1.1] fake dax with NVDIMM flush hints & KVM async page fault > > - Existing interface. > > - The approach to use flush hint address is already nacked upstream. > > - Flush hint not queued interface for flushing. Applications might >avoid to use it. This doesn't contradicts the last point about async operation and vcpu control. KVM async page faults turn the Address Flush Hints write into an async operation so the guest can get other work done while waiting for completion. > > - Flush hint address traps from guest to host and do an entire fsync >on backing file which itself is costly. > > - Can be used to flush specific pages on host backing disk. We can >send data(pages information) equal to cache-line size(limitation) >and tell host to sync corresponding pages instead of entire disk sync. Are you sure? Your previous point says only the entire device can be synced. The NVDIMM Adress Flush Hints interface does not involve address range information. > > - This will be an asynchronous operation and vCPU control is returned >quickly. > > > 1.2] Using additional para virt device in addition to pmem device(fake dax > with device flush) Perhaps this can be exposed via ACPI as part of the NVDIMM standards instead of a separate KVM-only paravirt device. signature.asc Description: PGP signature
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
> > > > Hello, > > > > We shared a proposal for 'KVM fake DAX flushing interface'. > > > > https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html > > > > In above link, > "Overall goal of project >is to increase the number of virtual machines that can be >run on a physical machine, in order to *increase the density* >of customer virtual machines" > > Is the fake persistent memory used as normal RAM in guest? If no, how > is it expected to be used in guest? Yes, guest will have a nvdimm DAX device and not use page cache for most of the operations. Host will manage memory requirement of all the guests. > > > We did initial POC in which we used 'virtio-blk' device to perform > > a device flush on pmem fsync on ext4 filesystem. They are few hacks > > to make things work. We need suggestions on below points before we > > start actual implementation. > > > > A] Problems to solve: > > -- > > > > 1] We are considering two approaches for 'fake DAX flushing interface'. > > > > 1.1] fake dax with NVDIMM flush hints & KVM async page fault > > > > - Existing interface. > > > > - The approach to use flush hint address is already nacked upstream. > > > > - Flush hint not queued interface for flushing. Applications might > >avoid to use it. > > > > - Flush hint address traps from guest to host and do an entire fsync > >on backing file which itself is costly. > > > > - Can be used to flush specific pages on host backing disk. We can > >send data(pages information) equal to cache-line size(limitation) > >and tell host to sync corresponding pages instead of entire disk > >sync. > > > > - This will be an asynchronous operation and vCPU control is returned > >quickly. > > > > > > 1.2] Using additional para virt device in addition to pmem device(fake dax > > with device flush) > > > > - New interface > > > > - Guest maintains information of DAX dirty pages as exceptional > > entries in > >radix tree. > > > > - If we want to flush specific pages from guest to host, we need to > > send > >list of the dirty pages corresponding to file on which we are doing > >fsync. > > > > - This will require implementation of new interface, a new paravirt > > device > >for sending flush requests. > > > > - Host side will perform fsync/fdatasync on list of dirty pages or > > entire > >block device backed file. > > > > 2] Questions: > > --- > > > > 2.1] Not sure why WPQ flush is not a queued interface? We can force > > applications > > to call this? device DAX neither calls fsync/msync? > > > > 2.2] Depending upon interface we decide, we need optimal solution to sync > > range of pages? > > > > - Send range of pages from guest to host to sync asynchronously > > instead > >of syncing entire block device? > > e.g. a new virtio device to deliver sync requests to host? > > > > > - Other option is to sync entire disk backing file to make sure all > > the > >writes are persistent. In our case, backing file is a regular file > >on > >non NVDIMM device so host page cache has list of dirty pages which > >can be used either with fsync or similar interface. > > As the amount of dirty pages can be variant, the latency of each host > fsync is likely to vary in a large range. > > > > > 2.3] If we do host fsync on entire disk we will be flushing all the dirty > > data > > to backend file. Just thinking what would be better approach, > > flushing > > pages on corresponding guest file fsync or entire block device? > > > > 2.4] If we decide to choose one of the above approaches, we need to > > consider > > all DAX supporting filesystems(ext4/xfs). Would hooking code to > > corresponding > > fsync code of fs seems reasonable? Just thinking for flush hint > > address use-case? > > Or how flush hint addresses would be invoked with fsync or similar > > api? > > > > 2.5] Also with filesystem journalling and other mount options like > > barriers, > > ordered etc, how we decide to use page flush hint or regular fsync on > > file? > > > > 2.6] If at guest side we have PFN of all the dirty pages in radixtree? and > > we send > > these to to host? At host side would we able to find corresponding > > page and flush > > them all? > > That may require the host file system provides API to flush specified > blocks/extents and their meta data in the file system. I'm not > familiar with this part and don't know whether such API exists. > > Haozhong >
Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion
On 07/21/17 02:56 -0400, Pankaj Gupta wrote: > > Hello, > > We shared a proposal for 'KVM fake DAX flushing interface'. > > https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html > In above link, "Overall goal of project is to increase the number of virtual machines that can be run on a physical machine, in order to *increase the density* of customer virtual machines" Is the fake persistent memory used as normal RAM in guest? If no, how is it expected to be used in guest? > We did initial POC in which we used 'virtio-blk' device to perform > a device flush on pmem fsync on ext4 filesystem. They are few hacks > to make things work. We need suggestions on below points before we > start actual implementation. > > A] Problems to solve: > -- > > 1] We are considering two approaches for 'fake DAX flushing interface'. > > 1.1] fake dax with NVDIMM flush hints & KVM async page fault > > - Existing interface. > > - The approach to use flush hint address is already nacked upstream. > > - Flush hint not queued interface for flushing. Applications might >avoid to use it. > > - Flush hint address traps from guest to host and do an entire fsync >on backing file which itself is costly. > > - Can be used to flush specific pages on host backing disk. We can >send data(pages information) equal to cache-line size(limitation) >and tell host to sync corresponding pages instead of entire disk sync. > > - This will be an asynchronous operation and vCPU control is returned >quickly. > > > 1.2] Using additional para virt device in addition to pmem device(fake dax > with device flush) > > - New interface > > - Guest maintains information of DAX dirty pages as exceptional entries > in >radix tree. > > - If we want to flush specific pages from guest to host, we need to send >list of the dirty pages corresponding to file on which we are doing > fsync. > > - This will require implementation of new interface, a new paravirt > device >for sending flush requests. > > - Host side will perform fsync/fdatasync on list of dirty pages or > entire >block device backed file. > > 2] Questions: > --- > > 2.1] Not sure why WPQ flush is not a queued interface? We can force > applications > to call this? device DAX neither calls fsync/msync? > > 2.2] Depending upon interface we decide, we need optimal solution to sync > range of pages? > > - Send range of pages from guest to host to sync asynchronously instead >of syncing entire block device? e.g. a new virtio device to deliver sync requests to host? > > - Other option is to sync entire disk backing file to make sure all the >writes are persistent. In our case, backing file is a regular file on >non NVDIMM device so host page cache has list of dirty pages which >can be used either with fsync or similar interface. As the amount of dirty pages can be variant, the latency of each host fsync is likely to vary in a large range. > > 2.3] If we do host fsync on entire disk we will be flushing all the dirty > data > to backend file. Just thinking what would be better approach, flushing > pages on corresponding guest file fsync or entire block device? > > 2.4] If we decide to choose one of the above approaches, we need to consider > all DAX supporting filesystems(ext4/xfs). Would hooking code to > corresponding > fsync code of fs seems reasonable? Just thinking for flush hint address > use-case? > Or how flush hint addresses would be invoked with fsync or similar api? > > 2.5] Also with filesystem journalling and other mount options like barriers, > ordered etc, how we decide to use page flush hint or regular fsync on > file? > > 2.6] If at guest side we have PFN of all the dirty pages in radixtree? and > we send > these to to host? At host side would we able to find corresponding page > and flush > them all? That may require the host file system provides API to flush specified blocks/extents and their meta data in the file system. I'm not familiar with this part and don't know whether such API exists. Haozhong
[Qemu-devel] KVM "fake DAX" flushing interface - discussion
Hello, We shared a proposal for 'KVM fake DAX flushing interface'. https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html We did initial POC in which we used 'virtio-blk' device to perform a device flush on pmem fsync on ext4 filesystem. They are few hacks to make things work. We need suggestions on below points before we start actual implementation. A] Problems to solve: -- 1] We are considering two approaches for 'fake DAX flushing interface'. 1.1] fake dax with NVDIMM flush hints & KVM async page fault - Existing interface. - The approach to use flush hint address is already nacked upstream. - Flush hint not queued interface for flushing. Applications might avoid to use it. - Flush hint address traps from guest to host and do an entire fsync on backing file which itself is costly. - Can be used to flush specific pages on host backing disk. We can send data(pages information) equal to cache-line size(limitation) and tell host to sync corresponding pages instead of entire disk sync. - This will be an asynchronous operation and vCPU control is returned quickly. 1.2] Using additional para virt device in addition to pmem device(fake dax with device flush) - New interface - Guest maintains information of DAX dirty pages as exceptional entries in radix tree. - If we want to flush specific pages from guest to host, we need to send list of the dirty pages corresponding to file on which we are doing fsync. - This will require implementation of new interface, a new paravirt device for sending flush requests. - Host side will perform fsync/fdatasync on list of dirty pages or entire block device backed file. 2] Questions: --- 2.1] Not sure why WPQ flush is not a queued interface? We can force applications to call this? device DAX neither calls fsync/msync? 2.2] Depending upon interface we decide, we need optimal solution to sync range of pages? - Send range of pages from guest to host to sync asynchronously instead of syncing entire block device? - Other option is to sync entire disk backing file to make sure all the writes are persistent. In our case, backing file is a regular file on non NVDIMM device so host page cache has list of dirty pages which can be used either with fsync or similar interface. 2.3] If we do host fsync on entire disk we will be flushing all the dirty data to backend file. Just thinking what would be better approach, flushing pages on corresponding guest file fsync or entire block device? 2.4] If we decide to choose one of the above approaches, we need to consider all DAX supporting filesystems(ext4/xfs). Would hooking code to corresponding fsync code of fs seems reasonable? Just thinking for flush hint address use-case? Or how flush hint addresses would be invoked with fsync or similar api? 2.5] Also with filesystem journalling and other mount options like barriers, ordered etc, how we decide to use page flush hint or regular fsync on file? 2.6] If at guest side we have PFN of all the dirty pages in radixtree? and we send these to to host? At host side would we able to find corresponding page and flush them all? Suggestions & ideas are welcome. Thanks, Pankaj