subject:"\[Qemu\-devel\] KVM \"fake DAX\" flushing interface \- discussion"

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2018-01-18 Thread Dan Williams

On Thu, Jan 18, 2018 at 11:51 AM, David Hildenbrand  wrote:
>
>>> 1] Existing pmem driver & virtio for region discovery:
>>>   -
>>>   Use existing pmem driver which is tightly coupled with concepts of 
>>> namespaces, labels etc
>>>   from ACPI region discovery and re-implement these concepts with virtio so 
>>> that existing
>>>   pmem driver can understand it. In addition to this, task of pmem driver 
>>> to send flush command
>>>   using virtio.
>>
>> It's not tightly coupled. The whole point of libnvdimm is to be
>> agnostic to ACPI, e820 or any other range discovery. The only work to
>> do beyond identifying the address range is teaching libnvdimm to pass
>> along a flush control interface to the pmem driver.
>>
>>>
>>> 2] Existing pmem driver & ACPI NFIT for region discovery:
>>>   
>>> - If we use NFIT ACPI, we need to teach existing ACPI driver to add this 
>>> new memory
>>>   type and teach existing pmem driver to handle this new memory type. Still 
>>> we need
>>>   an asynchronous(virtio) way to send flush commands. We need virtio 
>>> device/driver
>>>   or arbitrary key/value like pair just to send commands from guest to host 
>>> using virtio.
>>>
>>> 3] New Virtio pmem driver & paravirt device:
>>>  
>>>   Third way is new virtio pmem driver with less work to support existing 
>>> features of different protocols,
>>>   and with asynchronous way of sending flush commands.
>>>
>>>   But this needs to duplicate some of the work which existing pmem driver 
>>> does but as discussed
>>>   previously we can separate common code from existing pmem driver and 
>>> reuse it.
>>>
>>> Among these approaches I also prefer 3].
>>
>> I disagree, the reason we went down this ACPI path was to limit the
>> needless duplication of most of the pmem driver.
>>
>
> I have way to little insight to make qualified statements to different
> approaches here. :)
>
> All I am interesting in is making this as independent of architecture
> specific technologies (e.g. ACPI) as possible. We will want this e.g.
> for s390x too. Rather sooner than later. So trying to couple this
> (somehow) to ACPI just for the sake of less code to copy will not pay of
> in the long run.
>
> Better have a clean virtio interface / design right from the start.
>
> So I hope my words will be heard :)

I think that's reasonable. Once we have the virtio based discovery I
think the incremental changes to libnvdimm core and the pmem driver
are small.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2018-01-18 Thread David Hildenbrand


>> 1] Existing pmem driver & virtio for region discovery:
>>   -
>>   Use existing pmem driver which is tightly coupled with concepts of 
>> namespaces, labels etc
>>   from ACPI region discovery and re-implement these concepts with virtio so 
>> that existing
>>   pmem driver can understand it. In addition to this, task of pmem driver to 
>> send flush command
>>   using virtio.
> 
> It's not tightly coupled. The whole point of libnvdimm is to be
> agnostic to ACPI, e820 or any other range discovery. The only work to
> do beyond identifying the address range is teaching libnvdimm to pass
> along a flush control interface to the pmem driver.
> 
>>
>> 2] Existing pmem driver & ACPI NFIT for region discovery:
>>   
>> - If we use NFIT ACPI, we need to teach existing ACPI driver to add this new 
>> memory
>>   type and teach existing pmem driver to handle this new memory type. Still 
>> we need
>>   an asynchronous(virtio) way to send flush commands. We need virtio 
>> device/driver
>>   or arbitrary key/value like pair just to send commands from guest to host 
>> using virtio.
>>
>> 3] New Virtio pmem driver & paravirt device:
>>  
>>   Third way is new virtio pmem driver with less work to support existing 
>> features of different protocols,
>>   and with asynchronous way of sending flush commands.
>>
>>   But this needs to duplicate some of the work which existing pmem driver 
>> does but as discussed
>>   previously we can separate common code from existing pmem driver and reuse 
>> it.
>>
>> Among these approaches I also prefer 3].
> 
> I disagree, the reason we went down this ACPI path was to limit the
> needless duplication of most of the pmem driver.
> 

I have way to little insight to make qualified statements to different
approaches here. :)

All I am interesting in is making this as independent of architecture
specific technologies (e.g. ACPI) as possible. We will want this e.g.
for s390x too. Rather sooner than later. So trying to couple this
(somehow) to ACPI just for the sake of less code to copy will not pay of
in the long run.

Better have a clean virtio interface / design right from the start.

So I hope my words will be heard :)

-- 

Thanks,

David / dhildenb

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2018-01-18 Thread Dan Williams

On Thu, Jan 18, 2018 at 11:36 AM, Pankaj Gupta  wrote:
>
>>
>> On Thu, Jan 18, 2018 at 10:54 AM, Pankaj Gupta  wrote:
>> >
>> >>
>> >> >> I'd like to emphasize again, that I would prefer a virtio-pmem only
>> >> >> solution.
>> >> >>
>> >> >> There are architectures out there (e.g. s390x) that don't support
>> >> >> NVDIMMs - there is no HW interface to expose any such stuff.
>> >> >>
>> >> >> However, with virtio-pmem, we could make it work also on architectures
>> >> >> not having ACPI and friends.
>> >> >
>> >> > ACPI and virtio-only can share the same pmem driver. There are two
>> >> > parts to this, region discovery and setting up the pmem driver. For
>> >> > discovery you can either have an NFIT-bus defined range, or a new
>> >> > virtio-pmem-bus define it. As far as the pmem driver itself it's
>> >> > agnostic to how the range is discovered.
>> >> >
>> >>
>> >> And in addition to discovery + setup, we need the flush via virtio.
>> >>
>> >> > In other words, pmem consumes 'regions' from libnvdimm and the a bus
>> >> > provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
>> >> >
>> >>
>> >> That sounds good to me. I would like to see how the ACPI discovery
>> >> variant connects to a virtio ring.
>> >>
>> >> The natural way for me would be:
>> >>
>> >> A virtio-X device supplies a memory region ("discovery") and also the
>> >> interface for flushes for this device. So one virtio-X corresponds to
>> >> one pmem device. No ACPI to be involved (also not on architectures that
>> >> have ACPI)
>> >
>> > I agree here if we discover regions with virtio-X we don't need to worry
>> > about
>> > NFIT ACPI. Actually, there are three ways to do it with pros and cons of
>> > these
>> > approaches:
>> >
>> > 1] Existing pmem driver & virtio for region discovery:
>> >   -
>> >   Use existing pmem driver which is tightly coupled with concepts of
>> >   namespaces, labels etc
>> >   from ACPI region discovery and re-implement these concepts with virtio so
>> >   that existing
>> >   pmem driver can understand it. In addition to this, task of pmem driver
>> >   to send flush command
>> >   using virtio.
>>
>> It's not tightly coupled. The whole point of libnvdimm is to be
>> agnostic to ACPI, e820 or any other range discovery. The only work to
>> do beyond identifying the address range is teaching libnvdimm to pass
>> along a flush control interface to the pmem driver.
>
> o.k that means we can configure libnvdimm with virtio as well and use 
> existing pmem
> driver. AFAICU it uses nvdimm bus?
>
> Do we need other features which ACPI provides?

No, to keep it simple use nvdimm_pmem_region_create without
registering any DIMM devices. I'd start with the e820 driver as a bus
driver reference (drivers/nvdimm/e820.c) rather than try to unwind the
complexity of the nfit driver.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2018-01-18 Thread Pankaj Gupta


> 
> On Thu, Jan 18, 2018 at 10:54 AM, Pankaj Gupta  wrote:
> >
> >>
> >> >> I'd like to emphasize again, that I would prefer a virtio-pmem only
> >> >> solution.
> >> >>
> >> >> There are architectures out there (e.g. s390x) that don't support
> >> >> NVDIMMs - there is no HW interface to expose any such stuff.
> >> >>
> >> >> However, with virtio-pmem, we could make it work also on architectures
> >> >> not having ACPI and friends.
> >> >
> >> > ACPI and virtio-only can share the same pmem driver. There are two
> >> > parts to this, region discovery and setting up the pmem driver. For
> >> > discovery you can either have an NFIT-bus defined range, or a new
> >> > virtio-pmem-bus define it. As far as the pmem driver itself it's
> >> > agnostic to how the range is discovered.
> >> >
> >>
> >> And in addition to discovery + setup, we need the flush via virtio.
> >>
> >> > In other words, pmem consumes 'regions' from libnvdimm and the a bus
> >> > provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
> >> >
> >>
> >> That sounds good to me. I would like to see how the ACPI discovery
> >> variant connects to a virtio ring.
> >>
> >> The natural way for me would be:
> >>
> >> A virtio-X device supplies a memory region ("discovery") and also the
> >> interface for flushes for this device. So one virtio-X corresponds to
> >> one pmem device. No ACPI to be involved (also not on architectures that
> >> have ACPI)
> >
> > I agree here if we discover regions with virtio-X we don't need to worry
> > about
> > NFIT ACPI. Actually, there are three ways to do it with pros and cons of
> > these
> > approaches:
> >
> > 1] Existing pmem driver & virtio for region discovery:
> >   -
> >   Use existing pmem driver which is tightly coupled with concepts of
> >   namespaces, labels etc
> >   from ACPI region discovery and re-implement these concepts with virtio so
> >   that existing
> >   pmem driver can understand it. In addition to this, task of pmem driver
> >   to send flush command
> >   using virtio.
> 
> It's not tightly coupled. The whole point of libnvdimm is to be
> agnostic to ACPI, e820 or any other range discovery. The only work to
> do beyond identifying the address range is teaching libnvdimm to pass
> along a flush control interface to the pmem driver.

o.k that means we can configure libnvdimm with virtio as well and use existing 
pmem
driver. AFAICU it uses nvdimm bus? 

Do we need other features which ACPI provides?

acpi_nfit_init
 nvdimm_bus_register
  ...
acpi_nfit_register_region
  acpi_region_create
nvdimm_pmem_region_create
  
Also, need to check how to pass virtio flush interface.

> 
> >
> > 2] Existing pmem driver & ACPI NFIT for region discovery:
> >   
> > - If we use NFIT ACPI, we need to teach existing ACPI driver to add this
> > new memory
> >   type and teach existing pmem driver to handle this new memory type. Still
> >   we need
> >   an asynchronous(virtio) way to send flush commands. We need virtio
> >   device/driver
> >   or arbitrary key/value like pair just to send commands from guest to host
> >   using virtio.
> >
> > 3] New Virtio pmem driver & paravirt device:
> >  
> >   Third way is new virtio pmem driver with less work to support existing
> >   features of different protocols,
> >   and with asynchronous way of sending flush commands.
> >
> >   But this needs to duplicate some of the work which existing pmem driver
> >   does but as discussed
> >   previously we can separate common code from existing pmem driver and
> >   reuse it.
> >
> > Among these approaches I also prefer 3].
> 
> I disagree, the reason we went down this ACPI path was to limit the
> needless duplication of most of the pmem driver.

yes.
>

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2018-01-18 Thread Pankaj Gupta


> 
> >> I'd like to emphasize again, that I would prefer a virtio-pmem only
> >> solution.
> >>
> >> There are architectures out there (e.g. s390x) that don't support
> >> NVDIMMs - there is no HW interface to expose any such stuff.
> >>
> >> However, with virtio-pmem, we could make it work also on architectures
> >> not having ACPI and friends.
> > 
> > ACPI and virtio-only can share the same pmem driver. There are two
> > parts to this, region discovery and setting up the pmem driver. For
> > discovery you can either have an NFIT-bus defined range, or a new
> > virtio-pmem-bus define it. As far as the pmem driver itself it's
> > agnostic to how the range is discovered.
> > 
> 
> And in addition to discovery + setup, we need the flush via virtio.
> 
> > In other words, pmem consumes 'regions' from libnvdimm and the a bus
> > provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
> > 
> 
> That sounds good to me. I would like to see how the ACPI discovery
> variant connects to a virtio ring.
> 
> The natural way for me would be:
> 
> A virtio-X device supplies a memory region ("discovery") and also the
> interface for flushes for this device. So one virtio-X corresponds to
> one pmem device. No ACPI to be involved (also not on architectures that
> have ACPI)

I agree here if we discover regions with virtio-X we don't need to worry about
NFIT ACPI. Actually, there are three ways to do it with pros and cons of these 
approaches: 

1] Existing pmem driver & virtio for region discovery:
  -
  Use existing pmem driver which is tightly coupled with concepts of 
namespaces, labels etc 
  from ACPI region discovery and re-implement these concepts with virtio so 
that existing
  pmem driver can understand it. In addition to this, task of pmem driver to 
send flush command
  using virtio.
  
2] Existing pmem driver & ACPI NFIT for region discovery:
  
- If we use NFIT ACPI, we need to teach existing ACPI driver to add this new 
memory
  type and teach existing pmem driver to handle this new memory type. Still we 
need 
  an asynchronous(virtio) way to send flush commands. We need virtio 
device/driver
  or arbitrary key/value like pair just to send commands from guest to host 
using virtio. 

3] New Virtio pmem driver & paravirt device:
 
  Third way is new virtio pmem driver with less work to support existing 
features of different protocols, 
  and with asynchronous way of sending flush commands.

  But this needs to duplicate some of the work which existing pmem driver does 
but as discussed 
  previously we can separate common code from existing pmem driver and reuse it.

Among these approaches I also prefer 3].

> 
> --
> 
> Thanks,
> 
> David / dhildenb
>

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2018-01-18 Thread Dan Williams

On Thu, Jan 18, 2018 at 10:54 AM, Pankaj Gupta  wrote:
>
>>
>> >> I'd like to emphasize again, that I would prefer a virtio-pmem only
>> >> solution.
>> >>
>> >> There are architectures out there (e.g. s390x) that don't support
>> >> NVDIMMs - there is no HW interface to expose any such stuff.
>> >>
>> >> However, with virtio-pmem, we could make it work also on architectures
>> >> not having ACPI and friends.
>> >
>> > ACPI and virtio-only can share the same pmem driver. There are two
>> > parts to this, region discovery and setting up the pmem driver. For
>> > discovery you can either have an NFIT-bus defined range, or a new
>> > virtio-pmem-bus define it. As far as the pmem driver itself it's
>> > agnostic to how the range is discovered.
>> >
>>
>> And in addition to discovery + setup, we need the flush via virtio.
>>
>> > In other words, pmem consumes 'regions' from libnvdimm and the a bus
>> > provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
>> >
>>
>> That sounds good to me. I would like to see how the ACPI discovery
>> variant connects to a virtio ring.
>>
>> The natural way for me would be:
>>
>> A virtio-X device supplies a memory region ("discovery") and also the
>> interface for flushes for this device. So one virtio-X corresponds to
>> one pmem device. No ACPI to be involved (also not on architectures that
>> have ACPI)
>
> I agree here if we discover regions with virtio-X we don't need to worry about
> NFIT ACPI. Actually, there are three ways to do it with pros and cons of these
> approaches:
>
> 1] Existing pmem driver & virtio for region discovery:
>   -
>   Use existing pmem driver which is tightly coupled with concepts of 
> namespaces, labels etc
>   from ACPI region discovery and re-implement these concepts with virtio so 
> that existing
>   pmem driver can understand it. In addition to this, task of pmem driver to 
> send flush command
>   using virtio.

It's not tightly coupled. The whole point of libnvdimm is to be
agnostic to ACPI, e820 or any other range discovery. The only work to
do beyond identifying the address range is teaching libnvdimm to pass
along a flush control interface to the pmem driver.

>
> 2] Existing pmem driver & ACPI NFIT for region discovery:
>   
> - If we use NFIT ACPI, we need to teach existing ACPI driver to add this new 
> memory
>   type and teach existing pmem driver to handle this new memory type. Still 
> we need
>   an asynchronous(virtio) way to send flush commands. We need virtio 
> device/driver
>   or arbitrary key/value like pair just to send commands from guest to host 
> using virtio.
>
> 3] New Virtio pmem driver & paravirt device:
>  
>   Third way is new virtio pmem driver with less work to support existing 
> features of different protocols,
>   and with asynchronous way of sending flush commands.
>
>   But this needs to duplicate some of the work which existing pmem driver 
> does but as discussed
>   previously we can separate common code from existing pmem driver and reuse 
> it.
>
> Among these approaches I also prefer 3].

I disagree, the reason we went down this ACPI path was to limit the
needless duplication of most of the pmem driver.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2018-01-18 Thread Dan Williams

On Thu, Jan 18, 2018 at 9:48 AM, David Hildenbrand  wrote:
>>> I'd like to emphasize again, that I would prefer a virtio-pmem only
>>> solution.
>>>
>>> There are architectures out there (e.g. s390x) that don't support
>>> NVDIMMs - there is no HW interface to expose any such stuff.
>>>
>>> However, with virtio-pmem, we could make it work also on architectures
>>> not having ACPI and friends.
>>
>> ACPI and virtio-only can share the same pmem driver. There are two
>> parts to this, region discovery and setting up the pmem driver. For
>> discovery you can either have an NFIT-bus defined range, or a new
>> virtio-pmem-bus define it. As far as the pmem driver itself it's
>> agnostic to how the range is discovered.
>>
>
> And in addition to discovery + setup, we need the flush via virtio.
>
>> In other words, pmem consumes 'regions' from libnvdimm and the a bus
>> provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
>>
>
> That sounds good to me. I would like to see how the ACPI discovery
> variant connects to a virtio ring.
>
> The natural way for me would be:
>
> A virtio-X device supplies a memory region ("discovery") and also the
> interface for flushes for this device. So one virtio-X corresponds to
> one pmem device. No ACPI to be involved (also not on architectures that
> have ACPI)

Hmm, yes, it seems if ACPI is just going to be used as a trigger for
"go find the virtio-X interface for this range" we could have started
from a virtio device in the first place.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2018-01-18 Thread David Hildenbrand

>> I'd like to emphasize again, that I would prefer a virtio-pmem only
>> solution.
>>
>> There are architectures out there (e.g. s390x) that don't support
>> NVDIMMs - there is no HW interface to expose any such stuff.
>>
>> However, with virtio-pmem, we could make it work also on architectures
>> not having ACPI and friends.
> 
> ACPI and virtio-only can share the same pmem driver. There are two
> parts to this, region discovery and setting up the pmem driver. For
> discovery you can either have an NFIT-bus defined range, or a new
> virtio-pmem-bus define it. As far as the pmem driver itself it's
> agnostic to how the range is discovered.
> 

And in addition to discovery + setup, we need the flush via virtio.

> In other words, pmem consumes 'regions' from libnvdimm and the a bus
> provider like nfit, e820, or a new virtio-mechansim produce 'regions'.
> 

That sounds good to me. I would like to see how the ACPI discovery
variant connects to a virtio ring.

The natural way for me would be:

A virtio-X device supplies a memory region ("discovery") and also the
interface for flushes for this device. So one virtio-X corresponds to
one pmem device. No ACPI to be involved (also not on architectures that
have ACPI)

-- 

Thanks,

David / dhildenb

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2018-01-18 Thread Dan Williams

On Thu, Jan 18, 2018 at 8:53 AM, David Hildenbrand  wrote:
> On 24.11.2017 13:40, Pankaj Gupta wrote:
>>
>> Hello,
>>
>> Thank you all for all the useful suggestions.
>> I want to summarize the discussions so far in the
>> thread. Please see below:
>>
>
>> We can go with the "best" interface for what
>> could be a relatively slow flush (fsync on a
>> file on ssd/disk on the host), which requires
>> that the flushing task wait on completion
>> asynchronously.
>
>
> I'd like to clarify the interface of "wait on completion
> asynchronously" and KVM async page fault a bit more.
>
> Current design of async-page-fault only works on RAM rather
> than MMIO, i.e, if the page fault caused by accessing the
> device memory of a emulated device, it needs to go to
> userspace (QEMU) which emulates the operation in vCPU's
> thread.
>
> As i mentioned before the memory region used for vNVDIMM
> flush interface should be MMIO and consider its support
> on other hypervisors, so we do better push this async
> mechanism into the flush interface design itself rather
> than depends on kvm async-page-fault.

 I would expect this interface to be virtio-ring based to queue flush
 requests asynchronously to the host.
>>>
>>> Could we reuse the virtio-blk device, only with a different device id?
>>
>> As per previous discussions, there were suggestions on main two parts of the 
>> project:
>>
>> 1] Expose vNVDIMM memory range to KVM guest.
>>
>>- Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM 
>> spec
>>  changes for this?
>>
>>- Guest should be able to add this memory in system memory map. Name of 
>> the added memory in
>>  '/proc/iomem' should be different(shared memory?) than persistent 
>> memory as it
>>  does not satisfy exact definition of persistent memory (requires an 
>> explicit flush).
>>
>>- Guest should not allow 'device-dax' and other fancy features which are 
>> not
>>  virtualization friendly.
>>
>> 2] Flushing interface to persist guest changes.
>>
>>- As per suggestion by ChristophH (CCed), we explored options other then 
>> virtio like MMIO etc.
>>  Looks like most of these options are not use-case friendly. As we want 
>> to do fsync on a
>>  file on ssd/disk on the host and we cannot make guest vCPU's wait for 
>> that time.
>>
>>- Though adding new driver(virtio-pmem) looks like repeated work and not 
>> needed so we can
>>  go with the existing pmem driver and add flush specific to this new 
>> memory type.
>
> I'd like to emphasize again, that I would prefer a virtio-pmem only
> solution.
>
> There are architectures out there (e.g. s390x) that don't support
> NVDIMMs - there is no HW interface to expose any such stuff.
>
> However, with virtio-pmem, we could make it work also on architectures
> not having ACPI and friends.

ACPI and virtio-only can share the same pmem driver. There are two
parts to this, region discovery and setting up the pmem driver. For
discovery you can either have an NFIT-bus defined range, or a new
virtio-pmem-bus define it. As far as the pmem driver itself it's
agnostic to how the range is discovered.

In other words, pmem consumes 'regions' from libnvdimm and the a bus
provider like nfit, e820, or a new virtio-mechansim produce 'regions'.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2018-01-18 Thread David Hildenbrand

On 24.11.2017 13:40, Pankaj Gupta wrote:
> 
> Hello,
> 
> Thank you all for all the useful suggestions.
> I want to summarize the discussions so far in the
> thread. Please see below:
> 

> We can go with the "best" interface for what
> could be a relatively slow flush (fsync on a
> file on ssd/disk on the host), which requires
> that the flushing task wait on completion
> asynchronously.


 I'd like to clarify the interface of "wait on completion
 asynchronously" and KVM async page fault a bit more.

 Current design of async-page-fault only works on RAM rather
 than MMIO, i.e, if the page fault caused by accessing the
 device memory of a emulated device, it needs to go to
 userspace (QEMU) which emulates the operation in vCPU's
 thread.

 As i mentioned before the memory region used for vNVDIMM
 flush interface should be MMIO and consider its support
 on other hypervisors, so we do better push this async
 mechanism into the flush interface design itself rather
 than depends on kvm async-page-fault.
>>>
>>> I would expect this interface to be virtio-ring based to queue flush
>>> requests asynchronously to the host.
>>
>> Could we reuse the virtio-blk device, only with a different device id?
> 
> As per previous discussions, there were suggestions on main two parts of the 
> project:
> 
> 1] Expose vNVDIMM memory range to KVM guest.
> 
>- Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM 
> spec 
>  changes for this? 
> 
>- Guest should be able to add this memory in system memory map. Name of 
> the added memory in
>  '/proc/iomem' should be different(shared memory?) than persistent memory 
> as it 
>  does not satisfy exact definition of persistent memory (requires an 
> explicit flush).
> 
>- Guest should not allow 'device-dax' and other fancy features which are 
> not 
>  virtualization friendly.
> 
> 2] Flushing interface to persist guest changes.
> 
>- As per suggestion by ChristophH (CCed), we explored options other then 
> virtio like MMIO etc.
>  Looks like most of these options are not use-case friendly. As we want 
> to do fsync on a
>  file on ssd/disk on the host and we cannot make guest vCPU's wait for 
> that time. 
> 
>- Though adding new driver(virtio-pmem) looks like repeated work and not 
> needed so we can 
>  go with the existing pmem driver and add flush specific to this new 
> memory type.

I'd like to emphasize again, that I would prefer a virtio-pmem only
solution.

There are architectures out there (e.g. s390x) that don't support
NVDIMMs - there is no HW interface to expose any such stuff.

However, with virtio-pmem, we could make it work also on architectures
not having ACPI and friends.

> 
>- Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense 
> if just 
>  want a flush vehicle to send guest commands to host and get reply after 
> asynchronous
>  execution. There was previous discussion [1] with Rik & Dan on this.
> 
> [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html 
> 
> Is my understanding correct here?
> 
> Thanks,
> Pankaj  
>  
> 


-- 

Thanks,

David / dhildenb

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2018-01-17 Thread Pankaj Gupta


Hi Dan,

Thanks for your reply.

> 
> On Fri, Jan 12, 2018 at 10:23 PM, Pankaj Gupta  wrote:
> >
> > Hello Dan,
> >
> >> Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2
> >> System Physical Address (SPA) Range Structure" in the ACPI 6.2A
> >> specification. Since it is a GUID we could define a Linux specific
> >> type for this case, but spec changes would allow non-Linux hypervisors
> >> to advertise a standard interface to guests.
> >>
> >
> > I have added new SPA with a GUUID for this memory type and I could add
> > this new memory type in System memory map. I need help with the namespace
> > handling for this new type As mentioned in [1] discussion:
> >
> > - Create a new namespace for this new memory type
> > - Teach libnvdimm how to handle this new namespace
> >
> > I have some queries on this:
> >
> > 1] How namespace handling of this new memory type would be?
> 
> This would be a namespace that creates a pmem device, but does not allow DAX.

o.k

> 
> >
> > 2] There are existing namespace types:
> >   ND_DEVICE_NAMESPACE_IO, ND_DEVICE_NAMESPACE_PMEM, ND_DEVICE_NAMESPACE_BLK
> >
> >   How libnvdimm will handle this new name-space type in conjuction with
> >   existing
> >   memory type, region & namespaces?
> 
> The type will be either ND_DEVICE_NAMESPACE_IO or
> ND_DEVICE_NAMESPACE_PMEM depending on whether you configure KVM to
> provide a virtual NVDIMM and label space. In other words the only
> difference between this range and a typical persistent memory range is
> that we will have a flag to disable DAX operation.

o.k. In short we have disable this flag 'QUEUE_FLAG_DAX' for this 
namespace & region? Also don't execute below code for this new type?

pmem_attach_disk()
...
...
dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
if (!dax_dev) {
put_disk(disk);
return -ENOMEM;
}
dax_write_cache(dax_dev, wbc);
pmem->dax_dev = dax_dev;

> 
> See the usage of nvdimm_has_cache() in pmem_attach_disk() as an
> example of how to pass attributes about the "region" to the the pmem
> driver.

sure.

> 
> >
> > 3] For sending guest to host flush commands we still have to think about
> > some
> >async way?
> 
> I thought we discussed this being a paravirtualized virtio command ring?

o.k. will implement this. 

>

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2018-01-17 Thread Dan Williams

On Fri, Jan 12, 2018 at 10:23 PM, Pankaj Gupta  wrote:
>
> Hello Dan,
>
>> Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2
>> System Physical Address (SPA) Range Structure" in the ACPI 6.2A
>> specification. Since it is a GUID we could define a Linux specific
>> type for this case, but spec changes would allow non-Linux hypervisors
>> to advertise a standard interface to guests.
>>
>
> I have added new SPA with a GUUID for this memory type and I could add
> this new memory type in System memory map. I need help with the namespace
> handling for this new type As mentioned in [1] discussion:
>
> - Create a new namespace for this new memory type
> - Teach libnvdimm how to handle this new namespace
>
> I have some queries on this:
>
> 1] How namespace handling of this new memory type would be?

This would be a namespace that creates a pmem device, but does not allow DAX.

>
> 2] There are existing namespace types:
>   ND_DEVICE_NAMESPACE_IO, ND_DEVICE_NAMESPACE_PMEM, ND_DEVICE_NAMESPACE_BLK
>
>   How libnvdimm will handle this new name-space type in conjuction with 
> existing
>   memory type, region & namespaces?

The type will be either ND_DEVICE_NAMESPACE_IO or
ND_DEVICE_NAMESPACE_PMEM depending on whether you configure KVM to
provide a virtual NVDIMM and label space. In other words the only
difference between this range and a typical persistent memory range is
that we will have a flag to disable DAX operation.

See the usage of nvdimm_has_cache() in pmem_attach_disk() as an
example of how to pass attributes about the "region" to the the pmem
driver.

>
> 3] For sending guest to host flush commands we still have to think about some
>async way?

I thought we discussed this being a paravirtualized virtio command ring?

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2018-01-12 Thread Pankaj Gupta


Hello Dan,

> Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2
> System Physical Address (SPA) Range Structure" in the ACPI 6.2A
> specification. Since it is a GUID we could define a Linux specific
> type for this case, but spec changes would allow non-Linux hypervisors
> to advertise a standard interface to guests.
> 

I have added new SPA with a GUUID for this memory type and I could add 
this new memory type in System memory map. I need help with the namespace
handling for this new type As mentioned in [1] discussion:

- Create a new namespace for this new memory type
- Teach libnvdimm how to handle this new namespace 

I have some queries on this:

1] How namespace handling of this new memory type would be?
  
2] There are existing namespace types: 
  ND_DEVICE_NAMESPACE_IO, ND_DEVICE_NAMESPACE_PMEM, ND_DEVICE_NAMESPACE_BLK

  How libnvdimm will handle this new name-space type in conjuction with existing
  memory type, region & namespaces?  

3] For sending guest to host flush commands we still have to think about some 
   async way?

[1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08404.html 

Thanks,
Pankaj

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-11-28 Thread Dan Williams

On Fri, Nov 24, 2017 at 4:40 AM, Pankaj Gupta  wrote:
[..]
> 1] Expose vNVDIMM memory range to KVM guest.
>
>- Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM 
> spec
>  changes for this?

Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2
System Physical Address (SPA) Range Structure" in the ACPI 6.2A
specification. Since it is a GUID we could define a Linux specific
type for this case, but spec changes would allow non-Linux hypervisors
to advertise a standard interface to guests.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-11-24 Thread Paolo Bonzini

On 24/11/2017 14:02, Pankaj Gupta wrote:
> 
>>>- Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense
>>>if just
>>>  want a flush vehicle to send guest commands to host and get reply
>>>  after asynchronous
>>>  execution. There was previous discussion [1] with Rik & Dan on this.
>>>
>>> [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html
>>
>> ... in fact, the virtio-blk device _could_ actually accept regular I/O
>> too.  That would make it easier to boot from pmem.  Is there anything
>> similar in regular hardware?
> 
> there is existing block device associated(hard bind) with the pmem range.
> Also, comment by Christoph [1], about removing block device with DAX support.
> Still I am not clear about this. Am I missing anything here?

The I/O part of the blk device would only be used by the firmware.  In
Linux, the different device id would bind the device to a different
driver that would only be used for flushing.

But maybe this idea makes no sense. :)

Paolo

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-11-24 Thread Pankaj Gupta


> >- Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense
> >if just
> >  want a flush vehicle to send guest commands to host and get reply
> >  after asynchronous
> >  execution. There was previous discussion [1] with Rik & Dan on this.
> > 
> > [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html
> 
> ... in fact, the virtio-blk device _could_ actually accept regular I/O
> too.  That would make it easier to boot from pmem.  Is there anything
> similar in regular hardware?

there is existing block device associated(hard bind) with the pmem range.
Also, comment by Christoph [1], about removing block device with DAX support.
Still I am not clear about this. Am I missing anything here?

[1] https://marc.info/?l=kvm&m=150822740332536&w=2

Pankaj

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-11-24 Thread Paolo Bonzini

On 24/11/2017 13:40, Pankaj Gupta wrote:
>- Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense 
> if just 
>  want a flush vehicle to send guest commands to host and get reply after 
> asynchronous
>  execution. There was previous discussion [1] with Rik & Dan on this.
> 
> [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html 

... in fact, the virtio-blk device _could_ actually accept regular I/O
too.  That would make it easier to boot from pmem.  Is there anything
similar in regular hardware?

Paolo

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-11-24 Thread Pankaj Gupta


Hello,

Thank you all for all the useful suggestions.
I want to summarize the discussions so far in the
thread. Please see below:

> >>
> >>> We can go with the "best" interface for what
> >>> could be a relatively slow flush (fsync on a
> >>> file on ssd/disk on the host), which requires
> >>> that the flushing task wait on completion
> >>> asynchronously.
> >>
> >>
> >> I'd like to clarify the interface of "wait on completion
> >> asynchronously" and KVM async page fault a bit more.
> >>
> >> Current design of async-page-fault only works on RAM rather
> >> than MMIO, i.e, if the page fault caused by accessing the
> >> device memory of a emulated device, it needs to go to
> >> userspace (QEMU) which emulates the operation in vCPU's
> >> thread.
> >>
> >> As i mentioned before the memory region used for vNVDIMM
> >> flush interface should be MMIO and consider its support
> >> on other hypervisors, so we do better push this async
> >> mechanism into the flush interface design itself rather
> >> than depends on kvm async-page-fault.
> > 
> > I would expect this interface to be virtio-ring based to queue flush
> > requests asynchronously to the host.
> 
> Could we reuse the virtio-blk device, only with a different device id?

As per previous discussions, there were suggestions on main two parts of the 
project:

1] Expose vNVDIMM memory range to KVM guest.

   - Add flag in ACPI NFIT table for this new memory type. Do we need NVDIMM 
spec 
 changes for this? 

   - Guest should be able to add this memory in system memory map. Name of the 
added memory in
 '/proc/iomem' should be different(shared memory?) than persistent memory 
as it 
 does not satisfy exact definition of persistent memory (requires an 
explicit flush).

   - Guest should not allow 'device-dax' and other fancy features which are not 
 virtualization friendly.

2] Flushing interface to persist guest changes.

   - As per suggestion by ChristophH (CCed), we explored options other then 
virtio like MMIO etc.
 Looks like most of these options are not use-case friendly. As we want to 
do fsync on a
 file on ssd/disk on the host and we cannot make guest vCPU's wait for that 
time. 

   - Though adding new driver(virtio-pmem) looks like repeated work and not 
needed so we can 
 go with the existing pmem driver and add flush specific to this new memory 
type.

   - Suggestion by Paolo & Stefan(previously) to use virtio-blk makes sense if 
just 
 want a flush vehicle to send guest commands to host and get reply after 
asynchronous
 execution. There was previous discussion [1] with Rik & Dan on this.

[1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08373.html 

Is my understanding correct here?

Thanks,
Pankaj

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-11-23 Thread Paolo Bonzini

On 23/11/2017 17:14, Dan Williams wrote:
> On Wed, Nov 22, 2017 at 8:05 PM, Xiao Guangrong
>  wrote:
>>
>>
>> On 11/22/2017 02:19 AM, Rik van Riel wrote:
>>
>>> We can go with the "best" interface for what
>>> could be a relatively slow flush (fsync on a
>>> file on ssd/disk on the host), which requires
>>> that the flushing task wait on completion
>>> asynchronously.
>>
>>
>> I'd like to clarify the interface of "wait on completion
>> asynchronously" and KVM async page fault a bit more.
>>
>> Current design of async-page-fault only works on RAM rather
>> than MMIO, i.e, if the page fault caused by accessing the
>> device memory of a emulated device, it needs to go to
>> userspace (QEMU) which emulates the operation in vCPU's
>> thread.
>>
>> As i mentioned before the memory region used for vNVDIMM
>> flush interface should be MMIO and consider its support
>> on other hypervisors, so we do better push this async
>> mechanism into the flush interface design itself rather
>> than depends on kvm async-page-fault.
> 
> I would expect this interface to be virtio-ring based to queue flush
> requests asynchronously to the host.

Could we reuse the virtio-blk device, only with a different device id?

Thanks,

Paolo

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-11-23 Thread Dan Williams

On Wed, Nov 22, 2017 at 8:05 PM, Xiao Guangrong
 wrote:
>
>
> On 11/22/2017 02:19 AM, Rik van Riel wrote:
>
>> We can go with the "best" interface for what
>> could be a relatively slow flush (fsync on a
>> file on ssd/disk on the host), which requires
>> that the flushing task wait on completion
>> asynchronously.
>
>
> I'd like to clarify the interface of "wait on completion
> asynchronously" and KVM async page fault a bit more.
>
> Current design of async-page-fault only works on RAM rather
> than MMIO, i.e, if the page fault caused by accessing the
> device memory of a emulated device, it needs to go to
> userspace (QEMU) which emulates the operation in vCPU's
> thread.
>
> As i mentioned before the memory region used for vNVDIMM
> flush interface should be MMIO and consider its support
> on other hypervisors, so we do better push this async
> mechanism into the flush interface design itself rather
> than depends on kvm async-page-fault.

I would expect this interface to be virtio-ring based to queue flush
requests asynchronously to the host.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-11-22 Thread Xiao Guangrong




On 11/22/2017 02:19 AM, Rik van Riel wrote:


We can go with the "best" interface for what
could be a relatively slow flush (fsync on a
file on ssd/disk on the host), which requires
that the flushing task wait on completion
asynchronously.


I'd like to clarify the interface of "wait on completion
asynchronously" and KVM async page fault a bit more.

Current design of async-page-fault only works on RAM rather
than MMIO, i.e, if the page fault caused by accessing the
device memory of a emulated device, it needs to go to
userspace (QEMU) which emulates the operation in vCPU's
thread.

As i mentioned before the memory region used for vNVDIMM
flush interface should be MMIO and consider its support
on other hypervisors, so we do better push this async
mechanism into the flush interface design itself rather
than depends on kvm async-page-fault.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-11-21 Thread Rik van Riel

On Tue, 2017-11-21 at 10:26 -0800, Dan Williams wrote:
> On Tue, Nov 21, 2017 at 10:19 AM, Rik van Riel 
> wrote:
> > On Fri, 2017-11-03 at 14:21 +0800, Xiao Guangrong wrote:
> > > On 11/03/2017 12:30 AM, Dan Williams wrote:
> > > > 
> > > > Good point, I was assuming that the mmio flush interface would
> > > > be
> > > > discovered separately from the NFIT-defined memory range.
> > > > Perhaps
> > > > via
> > > > PCI in the guest? This piece of the proposal  needs a bit more
> > > > thought...
> > > > 
> > > 
> > > Consider the case that the vNVDIMM device on normal storage and
> > > vNVDIMM device on real nvdimm hardware can both exist in VM, the
> > > flush interface should be able to associate with the SPA region
> > > respectively. That's why I'd like to integrate the flush
> > > interface
> > > into NFIT/ACPI by using a separate table. Is it possible to be a
> > > part of ACPI specification? :)
> > 
> > It would also be perfectly fine to have the
> > virtio PCI device indicate which vNVDIMM
> > range it flushes.
> > 
> > Since the guest OS needs to support that kind
> > of device anyway, does it really matter which
> > direction the device association points?
> > 
> > We can go with the "best" interface for what
> > could be a relatively slow flush (fsync on a
> > file on ssd/disk on the host), which requires
> > that the flushing task wait on completion
> > asynchronously.
> > 
> > If that kind of interface cannot be advertised
> > through NFIT/ACPI, wouldn't it be perfectly fine
> > to have only the virtio PCI device indicate which
> > vNVDIMM range it flushes?
> > 
> 
> Yes, we could do this with a custom PCI device, however the NFIT is
> frustratingly close to being able to define something like this. At
> the very least we can start with a "SPA Range GUID" that is Linux
> specific to indicate "call this virtio flush interface on FUA / flush
> cache requests" as a stop gap until a standardized flush interface
> can
> be defined.

Ahh, is that a "look for a device with this GUID"
NFIT hint?

That would be enough to tip off OSes that do not
support that device that they found a vNVDIMM
device that they cannot safely flush, which could
help them report such errors to userspace...

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-11-21 Thread Dan Williams

On Tue, Nov 21, 2017 at 10:19 AM, Rik van Riel  wrote:
> On Fri, 2017-11-03 at 14:21 +0800, Xiao Guangrong wrote:
>> On 11/03/2017 12:30 AM, Dan Williams wrote:
>> >
>> > Good point, I was assuming that the mmio flush interface would be
>> > discovered separately from the NFIT-defined memory range. Perhaps
>> > via
>> > PCI in the guest? This piece of the proposal  needs a bit more
>> > thought...
>> >
>>
>> Consider the case that the vNVDIMM device on normal storage and
>> vNVDIMM device on real nvdimm hardware can both exist in VM, the
>> flush interface should be able to associate with the SPA region
>> respectively. That's why I'd like to integrate the flush interface
>> into NFIT/ACPI by using a separate table. Is it possible to be a
>> part of ACPI specification? :)
>
> It would also be perfectly fine to have the
> virtio PCI device indicate which vNVDIMM
> range it flushes.
>
> Since the guest OS needs to support that kind
> of device anyway, does it really matter which
> direction the device association points?
>
> We can go with the "best" interface for what
> could be a relatively slow flush (fsync on a
> file on ssd/disk on the host), which requires
> that the flushing task wait on completion
> asynchronously.
>
> If that kind of interface cannot be advertised
> through NFIT/ACPI, wouldn't it be perfectly fine
> to have only the virtio PCI device indicate which
> vNVDIMM range it flushes?
>

Yes, we could do this with a custom PCI device, however the NFIT is
frustratingly close to being able to define something like this. At
the very least we can start with a "SPA Range GUID" that is Linux
specific to indicate "call this virtio flush interface on FUA / flush
cache requests" as a stop gap until a standardized flush interface can
be defined.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-11-21 Thread Rik van Riel

On Fri, 2017-11-03 at 14:21 +0800, Xiao Guangrong wrote:
> On 11/03/2017 12:30 AM, Dan Williams wrote:
> > 
> > Good point, I was assuming that the mmio flush interface would be
> > discovered separately from the NFIT-defined memory range. Perhaps
> > via
> > PCI in the guest? This piece of the proposal  needs a bit more
> > thought...
> > 
> 
> Consider the case that the vNVDIMM device on normal storage and
> vNVDIMM device on real nvdimm hardware can both exist in VM, the
> flush interface should be able to associate with the SPA region
> respectively. That's why I'd like to integrate the flush interface
> into NFIT/ACPI by using a separate table. Is it possible to be a
> part of ACPI specification? :)

It would also be perfectly fine to have the
virtio PCI device indicate which vNVDIMM
range it flushes.

Since the guest OS needs to support that kind
of device anyway, does it really matter which
direction the device association points?

We can go with the "best" interface for what
could be a relatively slow flush (fsync on a
file on ssd/disk on the host), which requires
that the flushing task wait on completion
asynchronously.

If that kind of interface cannot be advertised
through NFIT/ACPI, wouldn't it be perfectly fine
to have only the virtio PCI device indicate which
vNVDIMM range it flushes?

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-11-07 Thread Pankaj Gupta


> >
> >
> >> [..]
> >> >> Yes, the GUID will specifically identify this range as "Virtio Shared
> >> >> Memory" (or whatever name survives after a bikeshed debate). The
> >> >> libnvdimm core then needs to grow a new region type that mostly
> >> >> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
> >> >> new flush interface to perform the host communication. Device-dax
> >> >> would be disallowed from attaching to this region type, or we could
> >> >> grow a new device-dax type that does not allow the raw device to be
> >> >> mapped, but allows a filesystem mounted on top to manage the flush
> >> >> interface.
> >> >
> >> >
> >> > I am afraid it is not a good idea that a single SPA is used for multiple
> >> > purposes. For the region used as "pmem" is directly mapped to the VM so
> >> > that guest can freely access it without host's assistance, however, for
> >> > the region used as "host communication" is not mapped to VM, so that
> >> > it causes VM-exit and host gets the chance to do specific operations,
> >> > e.g, flush cache. So we'd better distinctly define these two regions to
> >> > avoid the unnecessary complexity in hypervisor.
> >>
> >> Good point, I was assuming that the mmio flush interface would be
> >> discovered separately from the NFIT-defined memory range. Perhaps via
> >> PCI in the guest? This piece of the proposal  needs a bit more
> >> thought...
> >
> > Also, in earlier discussions we agreed for entire device flush whenever
> > guest
> > performs a fsync on DAX file. If we do a MMIO call for this, guest CPU
> > would be
> > trapped for the duration device flush is completed.
> >
> > Instead, if we do perform an asynchronous flush guest CPU's can be utilized
> > by
> > some other tasks till flush completes?
> 
> Yes, the interface for the guest to trigger and wait for flush
> requests should be asynchronous, just like a storage "flush-cache"
> command.

One idea got while discussing this with Rik & Amit during KVM forum is to use 
something 
similar to Hyperv Key-value pair for sharing command between guest <=> host. 
Don't think 
such thing exists yet for KVM? Or how we can utilize existing features in KVM 
to achieve this?

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-11-06 Thread Dan Williams

On Sun, Nov 5, 2017 at 11:57 PM, Pankaj Gupta  wrote:
>
>
>> [..]
>> >> Yes, the GUID will specifically identify this range as "Virtio Shared
>> >> Memory" (or whatever name survives after a bikeshed debate). The
>> >> libnvdimm core then needs to grow a new region type that mostly
>> >> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
>> >> new flush interface to perform the host communication. Device-dax
>> >> would be disallowed from attaching to this region type, or we could
>> >> grow a new device-dax type that does not allow the raw device to be
>> >> mapped, but allows a filesystem mounted on top to manage the flush
>> >> interface.
>> >
>> >
>> > I am afraid it is not a good idea that a single SPA is used for multiple
>> > purposes. For the region used as "pmem" is directly mapped to the VM so
>> > that guest can freely access it without host's assistance, however, for
>> > the region used as "host communication" is not mapped to VM, so that
>> > it causes VM-exit and host gets the chance to do specific operations,
>> > e.g, flush cache. So we'd better distinctly define these two regions to
>> > avoid the unnecessary complexity in hypervisor.
>>
>> Good point, I was assuming that the mmio flush interface would be
>> discovered separately from the NFIT-defined memory range. Perhaps via
>> PCI in the guest? This piece of the proposal  needs a bit more
>> thought...
>
> Also, in earlier discussions we agreed for entire device flush whenever guest
> performs a fsync on DAX file. If we do a MMIO call for this, guest CPU would 
> be
> trapped for the duration device flush is completed.
>
> Instead, if we do perform an asynchronous flush guest CPU's can be utilized by
> some other tasks till flush completes?

Yes, the interface for the guest to trigger and wait for flush
requests should be asynchronous, just like a storage "flush-cache"
command.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-11-05 Thread Pankaj Gupta



> [..]
> >> Yes, the GUID will specifically identify this range as "Virtio Shared
> >> Memory" (or whatever name survives after a bikeshed debate). The
> >> libnvdimm core then needs to grow a new region type that mostly
> >> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
> >> new flush interface to perform the host communication. Device-dax
> >> would be disallowed from attaching to this region type, or we could
> >> grow a new device-dax type that does not allow the raw device to be
> >> mapped, but allows a filesystem mounted on top to manage the flush
> >> interface.
> >
> >
> > I am afraid it is not a good idea that a single SPA is used for multiple
> > purposes. For the region used as "pmem" is directly mapped to the VM so
> > that guest can freely access it without host's assistance, however, for
> > the region used as "host communication" is not mapped to VM, so that
> > it causes VM-exit and host gets the chance to do specific operations,
> > e.g, flush cache. So we'd better distinctly define these two regions to
> > avoid the unnecessary complexity in hypervisor.
> 
> Good point, I was assuming that the mmio flush interface would be
> discovered separately from the NFIT-defined memory range. Perhaps via
> PCI in the guest? This piece of the proposal  needs a bit more
> thought...

Also, in earlier discussions we agreed for entire device flush whenever guest
performs a fsync on DAX file. If we do a MMIO call for this, guest CPU would be
trapped for the duration device flush is completed.

Instead, if we do perform an asynchronous flush guest CPU's can be utilized by
some other tasks till flush completes?

Thanks,
Pankaj

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-11-02 Thread Xiao Guangrong




On 11/03/2017 12:30 AM, Dan Williams wrote:

On Thu, Nov 2, 2017 at 1:50 AM, Xiao Guangrong
 wrote:
[..]

Yes, the GUID will specifically identify this range as "Virtio Shared
Memory" (or whatever name survives after a bikeshed debate). The
libnvdimm core then needs to grow a new region type that mostly
behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
new flush interface to perform the host communication. Device-dax
would be disallowed from attaching to this region type, or we could
grow a new device-dax type that does not allow the raw device to be
mapped, but allows a filesystem mounted on top to manage the flush
interface.



I am afraid it is not a good idea that a single SPA is used for multiple
purposes. For the region used as "pmem" is directly mapped to the VM so
that guest can freely access it without host's assistance, however, for
the region used as "host communication" is not mapped to VM, so that
it causes VM-exit and host gets the chance to do specific operations,
e.g, flush cache. So we'd better distinctly define these two regions to
avoid the unnecessary complexity in hypervisor.


Good point, I was assuming that the mmio flush interface would be
discovered separately from the NFIT-defined memory range. Perhaps via
PCI in the guest? This piece of the proposal  needs a bit more
thought...



Consider the case that the vNVDIMM device on normal storage and
vNVDIMM device on real nvdimm hardware can both exist in VM, the
flush interface should be able to associate with the SPA region
respectively. That's why I'd like to integrate the flush interface
into NFIT/ACPI by using a separate table. Is it possible to be a
part of ACPI specification? :)

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-11-02 Thread Dan Williams

On Thu, Nov 2, 2017 at 1:50 AM, Xiao Guangrong
 wrote:
[..]
>> Yes, the GUID will specifically identify this range as "Virtio Shared
>> Memory" (or whatever name survives after a bikeshed debate). The
>> libnvdimm core then needs to grow a new region type that mostly
>> behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
>> new flush interface to perform the host communication. Device-dax
>> would be disallowed from attaching to this region type, or we could
>> grow a new device-dax type that does not allow the raw device to be
>> mapped, but allows a filesystem mounted on top to manage the flush
>> interface.
>
>
> I am afraid it is not a good idea that a single SPA is used for multiple
> purposes. For the region used as "pmem" is directly mapped to the VM so
> that guest can freely access it without host's assistance, however, for
> the region used as "host communication" is not mapped to VM, so that
> it causes VM-exit and host gets the chance to do specific operations,
> e.g, flush cache. So we'd better distinctly define these two regions to
> avoid the unnecessary complexity in hypervisor.

Good point, I was assuming that the mmio flush interface would be
discovered separately from the NFIT-defined memory range. Perhaps via
PCI in the guest? This piece of the proposal  needs a bit more
thought...

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-11-02 Thread Xiao Guangrong




On 11/01/2017 11:20 PM, Dan Williams wrote:

On 11/01/2017 12:25 PM, Dan Williams wrote:

[..]

It's not persistent memory if it requires a hypercall to make it
persistent. Unless memory writes can be made durable purely with cpu
instructions it's dangerous for it to be treated as a PMEM range.
Consider a guest that tried to map it with device-dax which has no
facility to route requests to a special flushing interface.



Can we separate the concept of flush interface from persistent memory?
Say there are two APIs, one is used to indicate the memory type (i.e,
/proc/iomem) and another one indicates the flush interface.

So for existing nvdimm hardwares:
1: Persist-memory + CLFLUSH
2: Persiste-memory + flush-hint-table (I know Intel does not use it)

and for the virtual nvdimm which backended on normal storage:
Persist-memory + virtual flush interface


I see the flush interface as fundamental to identifying the media
properties. It's not byte-addressable persistent memory if the
application needs to call a sideband interface to manage writes. This
is why we have pushed for something like the MAP_SYNC interface to
make filesystem-dax actually behave in a way that applications can
safely treat it as persistent memory, and this is also the guarantee
that device-dax provides. Changing the flush interface makes it
distinct and unusable for applications that want to manage data
persistence in userspace.



I was thinking that from the device's perspective, both of them are
not persistent until a flush operation is issued (clflush or virtual
flush-interface). But you are right, from the user/software's
perspective, their fundamentals are different.

So for the virtual nvdimm which is backended on normal storage, we
should refuse MAP_SYNC and the only way to guarantee persistence
is fsync/fdatasync.

Actually, we can treat a SPA region which associates with specific
flush interface as special GUID as your proposal, please see more
in below comment...




In what way is this "more complicated"? It was trivial to add support
for the "volatile" NFIT range, this will not be any more complicated
than that.



Introducing memory type is easy indeed, however, a new flush interface
definition is inevitable, i.e, we need a standard way to discover the
MMIOs to communicate with host.



Right, the proposed way to do that for x86 platforms is a new SPA
Range GUID type. in the NFIT.



So this SPA is used for both persistent memory region and flush interface?
Maybe i missed it in previous mails, could you please detail how to do
it?


Yes, the GUID will specifically identify this range as "Virtio Shared
Memory" (or whatever name survives after a bikeshed debate). The
libnvdimm core then needs to grow a new region type that mostly
behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
new flush interface to perform the host communication. Device-dax
would be disallowed from attaching to this region type, or we could
grow a new device-dax type that does not allow the raw device to be
mapped, but allows a filesystem mounted on top to manage the flush
interface.


I am afraid it is not a good idea that a single SPA is used for multiple
purposes. For the region used as "pmem" is directly mapped to the VM so
that guest can freely access it without host's assistance, however, for
the region used as "host communication" is not mapped to VM, so that
it causes VM-exit and host gets the chance to do specific operations,
e.g, flush cache. So we'd better distinctly define these two regions to
avoid the unnecessary complexity in hypervisor.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-11-01 Thread Dan Williams

> On 11/01/2017 12:25 PM, Dan Williams wrote:
[..]
>> It's not persistent memory if it requires a hypercall to make it
>> persistent. Unless memory writes can be made durable purely with cpu
>> instructions it's dangerous for it to be treated as a PMEM range.
>> Consider a guest that tried to map it with device-dax which has no
>> facility to route requests to a special flushing interface.
>>
>
> Can we separate the concept of flush interface from persistent memory?
> Say there are two APIs, one is used to indicate the memory type (i.e,
> /proc/iomem) and another one indicates the flush interface.
>
> So for existing nvdimm hardwares:
> 1: Persist-memory + CLFLUSH
> 2: Persiste-memory + flush-hint-table (I know Intel does not use it)
>
> and for the virtual nvdimm which backended on normal storage:
> Persist-memory + virtual flush interface

I see the flush interface as fundamental to identifying the media
properties. It's not byte-addressable persistent memory if the
application needs to call a sideband interface to manage writes. This
is why we have pushed for something like the MAP_SYNC interface to
make filesystem-dax actually behave in a way that applications can
safely treat it as persistent memory, and this is also the guarantee
that device-dax provides. Changing the flush interface makes it
distinct and unusable for applications that want to manage data
persistence in userspace.

>>>
 In what way is this "more complicated"? It was trivial to add support
 for the "volatile" NFIT range, this will not be any more complicated
 than that.

>>>
>>> Introducing memory type is easy indeed, however, a new flush interface
>>> definition is inevitable, i.e, we need a standard way to discover the
>>> MMIOs to communicate with host.
>>
>>
>> Right, the proposed way to do that for x86 platforms is a new SPA
>> Range GUID type. in the NFIT.
>>
>
> So this SPA is used for both persistent memory region and flush interface?
> Maybe i missed it in previous mails, could you please detail how to do
> it?

Yes, the GUID will specifically identify this range as "Virtio Shared
Memory" (or whatever name survives after a bikeshed debate). The
libnvdimm core then needs to grow a new region type that mostly
behaves the same as a "pmem" region, but drivers/nvdimm/pmem.c grows a
new flush interface to perform the host communication. Device-dax
would be disallowed from attaching to this region type, or we could
grow a new device-dax type that does not allow the raw device to be
mapped, but allows a filesystem mounted on top to manage the flush
interface.

> BTW, please note hypercall is not acceptable for standard, MMIO/PIO regions
> are. (Oh, yes, it depends on Paolo. :))

MMIO/PIO regions works for me, that's not the part of the proposal I'm
concerned about.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-10-31 Thread Xiao Guangrong




On 11/01/2017 12:25 PM, Dan Williams wrote:

On Tue, Oct 31, 2017 at 8:43 PM, Xiao Guangrong
 wrote:



On 10/31/2017 10:20 PM, Dan Williams wrote:


On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong
 wrote:




On 07/27/2017 08:54 AM, Dan Williams wrote:


At that point, would it make sense to expose these special
virtio-pmem areas to the guest in a slightly different way,
so the regions that need virtio flushing are not bound by
the regular driver, and the regular driver can continue to
work for memory regions that are backed by actual pmem in
the host?




Hmm, yes that could be feasible especially if it uses the ACPI NFIT
mechanism. It would basically involve defining a new SPA (System
Phyiscal Address) range GUID type, and then teaching libnvdimm to
treat that as a new pmem device type.




I would prefer a new flush mechanism to a new memory type introduced
to NFIT, e.g, in that mechanism we can define request queues and
completion queues and any other features to make virtualization
friendly. That would be much simpler.



No that's more confusing because now we are overloading the definition
of persistent memory. I want this memory type identified from the top
of the stack so it can appear differently in /proc/iomem and also
implement this alternate flush communication.



For the characteristic of memory, I have no idea why VM should know this
difference. It can be completely transparent to VM, that means, VM
does not need to know where this virtual PMEM comes from (for a really
nvdimm backend or a normal storage). The only discrepancy is the flush
interface.


It's not persistent memory if it requires a hypercall to make it
persistent. Unless memory writes can be made durable purely with cpu
instructions it's dangerous for it to be treated as a PMEM range.
Consider a guest that tried to map it with device-dax which has no
facility to route requests to a special flushing interface.



Can we separate the concept of flush interface from persistent memory?
Say there are two APIs, one is used to indicate the memory type (i.e,
/proc/iomem) and another one indicates the flush interface.

So for existing nvdimm hardwares:
1: Persist-memory + CLFLUSH
2: Persiste-memory + flush-hint-table (I know Intel does not use it)

and for the virtual nvdimm which backended on normal storage:
Persist-memory + virtual flush interface




In what way is this "more complicated"? It was trivial to add support
for the "volatile" NFIT range, this will not be any more complicated
than that.



Introducing memory type is easy indeed, however, a new flush interface
definition is inevitable, i.e, we need a standard way to discover the
MMIOs to communicate with host.


Right, the proposed way to do that for x86 platforms is a new SPA
Range GUID type. in the NFIT.



So this SPA is used for both persistent memory region and flush interface?
Maybe i missed it in previous mails, could you please detail how to do
it?

BTW, please note hypercall is not acceptable for standard, MMIO/PIO regions
are. (Oh, yes, it depends on Paolo. :))

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-10-31 Thread Dan Williams

On Tue, Oct 31, 2017 at 8:43 PM, Xiao Guangrong
 wrote:
>
>
> On 10/31/2017 10:20 PM, Dan Williams wrote:
>>
>> On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong
>>  wrote:
>>>
>>>
>>>
>>> On 07/27/2017 08:54 AM, Dan Williams wrote:
>>>
> At that point, would it make sense to expose these special
> virtio-pmem areas to the guest in a slightly different way,
> so the regions that need virtio flushing are not bound by
> the regular driver, and the regular driver can continue to
> work for memory regions that are backed by actual pmem in
> the host?



 Hmm, yes that could be feasible especially if it uses the ACPI NFIT
 mechanism. It would basically involve defining a new SPA (System
 Phyiscal Address) range GUID type, and then teaching libnvdimm to
 treat that as a new pmem device type.
>>>
>>>
>>>
>>> I would prefer a new flush mechanism to a new memory type introduced
>>> to NFIT, e.g, in that mechanism we can define request queues and
>>> completion queues and any other features to make virtualization
>>> friendly. That would be much simpler.
>>>
>>
>> No that's more confusing because now we are overloading the definition
>> of persistent memory. I want this memory type identified from the top
>> of the stack so it can appear differently in /proc/iomem and also
>> implement this alternate flush communication.
>>
>
> For the characteristic of memory, I have no idea why VM should know this
> difference. It can be completely transparent to VM, that means, VM
> does not need to know where this virtual PMEM comes from (for a really
> nvdimm backend or a normal storage). The only discrepancy is the flush
> interface.

It's not persistent memory if it requires a hypercall to make it
persistent. Unless memory writes can be made durable purely with cpu
instructions it's dangerous for it to be treated as a PMEM range.
Consider a guest that tried to map it with device-dax which has no
facility to route requests to a special flushing interface.

>
>> In what way is this "more complicated"? It was trivial to add support
>> for the "volatile" NFIT range, this will not be any more complicated
>> than that.
>>
>
> Introducing memory type is easy indeed, however, a new flush interface
> definition is inevitable, i.e, we need a standard way to discover the
> MMIOs to communicate with host.

Right, the proposed way to do that for x86 platforms is a new SPA
Range GUID type. in the NFIT.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-10-31 Thread Xiao Guangrong




On 10/31/2017 10:20 PM, Dan Williams wrote:

On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong
 wrote:



On 07/27/2017 08:54 AM, Dan Williams wrote:


At that point, would it make sense to expose these special
virtio-pmem areas to the guest in a slightly different way,
so the regions that need virtio flushing are not bound by
the regular driver, and the regular driver can continue to
work for memory regions that are backed by actual pmem in
the host?



Hmm, yes that could be feasible especially if it uses the ACPI NFIT
mechanism. It would basically involve defining a new SPA (System
Phyiscal Address) range GUID type, and then teaching libnvdimm to
treat that as a new pmem device type.



I would prefer a new flush mechanism to a new memory type introduced
to NFIT, e.g, in that mechanism we can define request queues and
completion queues and any other features to make virtualization
friendly. That would be much simpler.



No that's more confusing because now we are overloading the definition
of persistent memory. I want this memory type identified from the top
of the stack so it can appear differently in /proc/iomem and also
implement this alternate flush communication.



For the characteristic of memory, I have no idea why VM should know this
difference. It can be completely transparent to VM, that means, VM
does not need to know where this virtual PMEM comes from (for a really
nvdimm backend or a normal storage). The only discrepancy is the flush
interface.


In what way is this "more complicated"? It was trivial to add support
for the "volatile" NFIT range, this will not be any more complicated
than that.



Introducing memory type is easy indeed, however, a new flush interface
definition is inevitable, i.e, we need a standard way to discover the
MMIOs to communicate with host.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-10-31 Thread Dan Williams

On Tue, Oct 31, 2017 at 12:13 AM, Xiao Guangrong
 wrote:
>
>
> On 07/27/2017 08:54 AM, Dan Williams wrote:
>
>>> At that point, would it make sense to expose these special
>>> virtio-pmem areas to the guest in a slightly different way,
>>> so the regions that need virtio flushing are not bound by
>>> the regular driver, and the regular driver can continue to
>>> work for memory regions that are backed by actual pmem in
>>> the host?
>>
>>
>> Hmm, yes that could be feasible especially if it uses the ACPI NFIT
>> mechanism. It would basically involve defining a new SPA (System
>> Phyiscal Address) range GUID type, and then teaching libnvdimm to
>> treat that as a new pmem device type.
>
>
> I would prefer a new flush mechanism to a new memory type introduced
> to NFIT, e.g, in that mechanism we can define request queues and
> completion queues and any other features to make virtualization
> friendly. That would be much simpler.
>

No that's more confusing because now we are overloading the definition
of persistent memory. I want this memory type identified from the top
of the stack so it can appear differently in /proc/iomem and also
implement this alternate flush communication.

In what way is this "more complicated"? It was trivial to add support
for the "volatile" NFIT range, this will not be any more complicated
than that.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-10-31 Thread Xiao Guangrong




On 07/27/2017 08:54 AM, Dan Williams wrote:


At that point, would it make sense to expose these special
virtio-pmem areas to the guest in a slightly different way,
so the regions that need virtio flushing are not bound by
the regular driver, and the regular driver can continue to
work for memory regions that are backed by actual pmem in
the host?


Hmm, yes that could be feasible especially if it uses the ACPI NFIT
mechanism. It would basically involve defining a new SPA (System
Phyiscal Address) range GUID type, and then teaching libnvdimm to
treat that as a new pmem device type.


I would prefer a new flush mechanism to a new memory type introduced
to NFIT, e.g, in that mechanism we can define request queues and
completion queues and any other features to make virtualization
friendly. That would be much simpler.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-26 Thread Dan Williams

On Wed, Jul 26, 2017 at 4:46 PM, Rik van Riel  wrote:
> On Wed, 2017-07-26 at 14:40 -0700, Dan Williams wrote:
>> On Wed, Jul 26, 2017 at 2:27 PM, Rik van Riel 
>> wrote:
>> > On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote:
>> > > >
>> > >
>> > > Just want to summarize here(high level):
>> > >
>> > > This will require implementing new 'virtio-pmem' device which
>> > > presents
>> > > a DAX address range(like pmem) to guest with read/write(direct
>> > > access)
>> > > & device flush functionality. Also, qemu should implement
>> > > corresponding
>> > > support for flush using virtio.
>> > >
>> >
>> > Alternatively, the existing pmem code, with
>> > a flush-only block device on the side, which
>> > is somehow associated with the pmem device.
>> >
>> > I wonder which alternative leads to the least
>> > code duplication, and the least maintenance
>> > hassle going forward.
>>
>> I'd much prefer to have another driver. I.e. a driver that refactors
>> out some common pmem details into a shared object and can attach to
>> ND_DEVICE_NAMESPACE_{IO,PMEM}. A control device on the side seems
>> like
>> a recipe for confusion.
>
> At that point, would it make sense to expose these special
> virtio-pmem areas to the guest in a slightly different way,
> so the regions that need virtio flushing are not bound by
> the regular driver, and the regular driver can continue to
> work for memory regions that are backed by actual pmem in
> the host?

Hmm, yes that could be feasible especially if it uses the ACPI NFIT
mechanism. It would basically involve defining a new SPA (System
Phyiscal Address) range GUID type, and then teaching libnvdimm to
treat that as a new pmem device type.

See usage of UUID_PERSISTENT_MEMORY in drivers/acpi/nfit/ and the
eventual region description sent to nvdimm_pmem_region_create(). We
would then need to plumb a new flag so that nd_region_to_nstype() in
libnvdimm returns a different namespace type number for this virtio
use case, but otherwise the rest of libnvdimm should treat the region
as pmem.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-26 Thread Rik van Riel

On Wed, 2017-07-26 at 14:40 -0700, Dan Williams wrote:
> On Wed, Jul 26, 2017 at 2:27 PM, Rik van Riel 
> wrote:
> > On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote:
> > > > 
> > > 
> > > Just want to summarize here(high level):
> > > 
> > > This will require implementing new 'virtio-pmem' device which
> > > presents
> > > a DAX address range(like pmem) to guest with read/write(direct
> > > access)
> > > & device flush functionality. Also, qemu should implement
> > > corresponding
> > > support for flush using virtio.
> > > 
> > 
> > Alternatively, the existing pmem code, with
> > a flush-only block device on the side, which
> > is somehow associated with the pmem device.
> > 
> > I wonder which alternative leads to the least
> > code duplication, and the least maintenance
> > hassle going forward.
> 
> I'd much prefer to have another driver. I.e. a driver that refactors
> out some common pmem details into a shared object and can attach to
> ND_DEVICE_NAMESPACE_{IO,PMEM}. A control device on the side seems
> like
> a recipe for confusion.

At that point, would it make sense to expose these special
virtio-pmem areas to the guest in a slightly different way,
so the regions that need virtio flushing are not bound by
the regular driver, and the regular driver can continue to
work for memory regions that are backed by actual pmem in
the host?

> With a $new_driver in hand you can just do:
> 
>    modprobe $new_driver
>    echo $namespace > /sys/bus/nd/drivers/nd_pmem/unbind
>    echo $namespace > /sys/bus/nd/drivers/$new_driver/new_id
>    echo $namespace > /sys/bus/nd/drivers/$new_driver/bind
> 
> ...and the guest can arrange for $new_driver to be the default, so
> you
> don't need to do those steps each boot of the VM, by doing:
> 
> echo "blacklist nd_pmem" > /etc/modprobe.d/virt-dax-flush.conf
> echo "alias nd:t4* $new_driver" >> /etc/modprobe.d/virt-dax-
> flush.conf
> echo "alias nd:t5* $new_driver" >> /etc/modprobe.d/virt-dax-
> flush.conf

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-26 Thread Dan Williams

On Wed, Jul 26, 2017 at 2:27 PM, Rik van Riel  wrote:
> On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote:
>> >
>> Just want to summarize here(high level):
>>
>> This will require implementing new 'virtio-pmem' device which
>> presents
>> a DAX address range(like pmem) to guest with read/write(direct
>> access)
>> & device flush functionality. Also, qemu should implement
>> corresponding
>> support for flush using virtio.
>>
> Alternatively, the existing pmem code, with
> a flush-only block device on the side, which
> is somehow associated with the pmem device.
>
> I wonder which alternative leads to the least
> code duplication, and the least maintenance
> hassle going forward.

I'd much prefer to have another driver. I.e. a driver that refactors
out some common pmem details into a shared object and can attach to
ND_DEVICE_NAMESPACE_{IO,PMEM}. A control device on the side seems like
a recipe for confusion.

With a $new_driver in hand you can just do:

   modprobe $new_driver
   echo $namespace > /sys/bus/nd/drivers/nd_pmem/unbind
   echo $namespace > /sys/bus/nd/drivers/$new_driver/new_id
   echo $namespace > /sys/bus/nd/drivers/$new_driver/bind

...and the guest can arrange for $new_driver to be the default, so you
don't need to do those steps each boot of the VM, by doing:

echo "blacklist nd_pmem" > /etc/modprobe.d/virt-dax-flush.conf
echo "alias nd:t4* $new_driver" >> /etc/modprobe.d/virt-dax-flush.conf
echo "alias nd:t5* $new_driver" >> /etc/modprobe.d/virt-dax-flush.conf

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-26 Thread Rik van Riel

On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote:
> > 
> Just want to summarize here(high level):
> 
> This will require implementing new 'virtio-pmem' device which
> presents 
> a DAX address range(like pmem) to guest with read/write(direct
> access)
> & device flush functionality. Also, qemu should implement
> corresponding
> support for flush using virtio.
> 
Alternatively, the existing pmem code, with
a flush-only block device on the side, which
is somehow associated with the pmem device.

I wonder which alternative leads to the least
code duplication, and the least maintenance
hassle going forward.

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-26 Thread Pankaj Gupta


> 
> On Tue, 2017-07-25 at 07:46 -0700, Dan Williams wrote:
> > On Tue, Jul 25, 2017 at 7:27 AM, Pankaj Gupta 
> > wrote:
> > > 
> > > Looks like only way to send flush(blk dev) from guest to host with
> > > nvdimm
> > > is using flush hint addresses. Is this the correct interface I am
> > > looking?
> > > 
> > > blkdev_issue_flush
> > >  submit_bio_wait
> > >   submit_bio
> > > generic_make_request
> > >   pmem_make_request
> > >   ...
> > >    if (bio->bi_opf & REQ_FLUSH)
> > > nvdimm_flush(nd_region);
> > 
> > I would inject a paravirtualized version of pmem_make_request() that
> > sends an async flush operation over virtio to the host. Don't try to
> > use flush hint addresses for this, they don't have the proper
> > semantics. The guest should be allowed to issue the flush and receive
> > the completion asynchronously rather than taking a vm exist and
> > blocking on that request.
> 
> That is my feeling, too. A slower IO device benefits
> greatly from an asynchronous flush mechanism.

Thanks for all the suggestions!

Just want to summarize here(high level):

This will require implementing new 'virtio-pmem' device which presents 
a DAX address range(like pmem) to guest with read/write(direct access)
& device flush functionality. Also, qemu should implement corresponding
support for flush using virtio.

Thanks,
Pankaj
> 
> --
> All rights reversed

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-25 Thread Rik van Riel

On Tue, 2017-07-25 at 07:46 -0700, Dan Williams wrote:
> On Tue, Jul 25, 2017 at 7:27 AM, Pankaj Gupta 
> wrote:
> > 
> > Looks like only way to send flush(blk dev) from guest to host with
> > nvdimm
> > is using flush hint addresses. Is this the correct interface I am
> > looking?
> > 
> > blkdev_issue_flush
> >  submit_bio_wait
> >   submit_bio
> > generic_make_request
> >   pmem_make_request
> >   ...
> >    if (bio->bi_opf & REQ_FLUSH)
> > nvdimm_flush(nd_region);
> 
> I would inject a paravirtualized version of pmem_make_request() that
> sends an async flush operation over virtio to the host. Don't try to
> use flush hint addresses for this, they don't have the proper
> semantics. The guest should be allowed to issue the flush and receive
> the completion asynchronously rather than taking a vm exist and
> blocking on that request.

That is my feeling, too. A slower IO device benefits
greatly from an asynchronous flush mechanism.

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-25 Thread Dan Williams

On Tue, Jul 25, 2017 at 7:27 AM, Pankaj Gupta  wrote:
>
>> Subject: Re: KVM "fake DAX" flushing interface - discussion
>>
>> On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
>> >
>> > > On Sun 23-07-17 13:10:34, Dan Williams wrote:
>> > > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel  wrote:
>> > > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
>> > > > >> [ adding Ross and Jan ]
>> > > > >>
>> > > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel 
>> > > > >> wrote:
>> > > > >> >
>> > > > >> > The goal is to increase density of guests, by moving page
>> > > > >> > cache into the host (where it can be easily reclaimed).
>> > > > >> >
>> > > > >> > If we assume the guests will be backed by relatively fast
>> > > > >> > SSDs, a "whole device flush" from filesystem journaling
>> > > > >> > code (issued where the filesystem issues a barrier or
>> > > > >> > disk cache flush today) may be just what we need to make
>> > > > >> > that work.
>> > > > >>
>> > > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
>> > > > >>
>> > > > >> However, it still seems like the storage interface is not capable of
>> > > > >> expressing what is needed, because the operation that is needed is a
>> > > > >> range flush. In the guest you want the DAX page dirty tracking to
>> > > > >> communicate range flush information to the host, but there's no
>> > > > >> readily available block i/o semantic that software running on top of
>> > > > >> the fake pmem device can use to communicate with the host. Instead
>> > > > >> you
>> > > > >> want to intercept the dax_flush() operation and turn it into a
>> > > > >> queued
>> > > > >> request on the host.
>> > > > >>
>> > > > >> In 4.13 we have turned this dax_flush() operation into an explicit
>> > > > >> driver call. That seems a better interface to modify than trying to
>> > > > >> map block-storage flush-cache / force-unit-access commands to this
>> > > > >> host request.
>> > > > >>
>> > > > >> The additional piece you would need to consider is whether to track
>> > > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
>> > > > >> dirtying events, or arrange for every dax_copy_from_iter()
>> > > > >> operation()
>> > > > >> to also queue a sync on the host, but that essentially turns the
>> > > > >> host
>> > > > >> page cache into a pseudo write-through mode.
>> > > > >
>> > > > > I suspect initially it will be fine to not offer DAX
>> > > > > semantics to applications using these "fake DAX" devices
>> > > > > from a virtual machine, because the DAX APIs are designed
>> > > > > for a much higher performance device than these fake DAX
>> > > > > setups could ever give.
>> > > >
>> > > > Right, we don't need DAX, per se, in the guest.
>> > > >
>> > > > >
>> > > > > Having userspace call fsync/msync like done normally, and
>> > > > > having those coarser calls be turned into somewhat efficient
>> > > > > backend flushes would be perfectly acceptable.
>> > > > >
>> > > > > The big question is, what should that kind of interface look
>> > > > > like?
>> > > >
>> > > > To me, this looks much like the dirty cache tracking that is done in
>> > > > the address_space radix for the DAX case, but modified to coordinate
>> > > > queued / page-based flushing when the guest  wants to persist data.
>> > > > The similarity to DAX is not storing guest allocated pages in the
>> > > > radix but entries that track dirty guest physical addresses.
>> > >
>> > > Let me check whether I understand the problem correctly. So we want to
>> > > export a block device (essentially a page cache of this block device) to
>> > > a
>> > > guest as PMEM and use DAX in the guest to save guest's page cache. The
>> >
>> > that's correct.
>> >
>> > > natural way to make the persistence work would be to make ->flush
>> > > callback
>> > > of the PMEM device to do an upcall to the host which could then
>> > > fdatasync()
>> > > appropriate image file range however the performance would suck in such
>> > > case since ->flush gets called for at most one page ranges from DAX.
>> >
>> > Discussion is : sync a range using paravirt device or flush hit addresses
>> > vs block device flush.
>> >
>> > >
>> > > So what you could do instead is to completely ignore ->flush calls for
>> > > the
>> > > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
>> > > PMEM device (generated by blkdev_issue_flush() or the journalling
>> > > machinery) and fdatasync() the whole image file at that moment - in fact
>> > > you must do that for metadata IO to hit persistent storage anyway in your
>> > > setting. This would very closely follow how exporting block devices with
>> > > volatile cache works with KVM these days AFAIU and the performance will
>> > > be
>> > > the same.
>> >
>> > yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
>> > As per suggestions looks like block flushing device is way ahead.
>> >
>> > If we do an asynchronous block flush at gues

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-25 Thread Pankaj Gupta


> Subject: Re: KVM "fake DAX" flushing interface - discussion
> 
> On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
> > 
> > > On Sun 23-07-17 13:10:34, Dan Williams wrote:
> > > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel  wrote:
> > > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> > > > >> [ adding Ross and Jan ]
> > > > >>
> > > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel 
> > > > >> wrote:
> > > > >> >
> > > > >> > The goal is to increase density of guests, by moving page
> > > > >> > cache into the host (where it can be easily reclaimed).
> > > > >> >
> > > > >> > If we assume the guests will be backed by relatively fast
> > > > >> > SSDs, a "whole device flush" from filesystem journaling
> > > > >> > code (issued where the filesystem issues a barrier or
> > > > >> > disk cache flush today) may be just what we need to make
> > > > >> > that work.
> > > > >>
> > > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
> > > > >>
> > > > >> However, it still seems like the storage interface is not capable of
> > > > >> expressing what is needed, because the operation that is needed is a
> > > > >> range flush. In the guest you want the DAX page dirty tracking to
> > > > >> communicate range flush information to the host, but there's no
> > > > >> readily available block i/o semantic that software running on top of
> > > > >> the fake pmem device can use to communicate with the host. Instead
> > > > >> you
> > > > >> want to intercept the dax_flush() operation and turn it into a
> > > > >> queued
> > > > >> request on the host.
> > > > >>
> > > > >> In 4.13 we have turned this dax_flush() operation into an explicit
> > > > >> driver call. That seems a better interface to modify than trying to
> > > > >> map block-storage flush-cache / force-unit-access commands to this
> > > > >> host request.
> > > > >>
> > > > >> The additional piece you would need to consider is whether to track
> > > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
> > > > >> dirtying events, or arrange for every dax_copy_from_iter()
> > > > >> operation()
> > > > >> to also queue a sync on the host, but that essentially turns the
> > > > >> host
> > > > >> page cache into a pseudo write-through mode.
> > > > >
> > > > > I suspect initially it will be fine to not offer DAX
> > > > > semantics to applications using these "fake DAX" devices
> > > > > from a virtual machine, because the DAX APIs are designed
> > > > > for a much higher performance device than these fake DAX
> > > > > setups could ever give.
> > > > 
> > > > Right, we don't need DAX, per se, in the guest.
> > > > 
> > > > >
> > > > > Having userspace call fsync/msync like done normally, and
> > > > > having those coarser calls be turned into somewhat efficient
> > > > > backend flushes would be perfectly acceptable.
> > > > >
> > > > > The big question is, what should that kind of interface look
> > > > > like?
> > > > 
> > > > To me, this looks much like the dirty cache tracking that is done in
> > > > the address_space radix for the DAX case, but modified to coordinate
> > > > queued / page-based flushing when the guest  wants to persist data.
> > > > The similarity to DAX is not storing guest allocated pages in the
> > > > radix but entries that track dirty guest physical addresses.
> > > 
> > > Let me check whether I understand the problem correctly. So we want to
> > > export a block device (essentially a page cache of this block device) to
> > > a
> > > guest as PMEM and use DAX in the guest to save guest's page cache. The
> > 
> > that's correct.
> > 
> > > natural way to make the persistence work would be to make ->flush
> > > callback
> > > of the PMEM device to do an upcall to the host which could then
> > > fdatasync()
> > > appropriate image file range however the performance would suck in such
> > > case since ->flush gets called for at most one page ranges from DAX.
> > 
> > Discussion is : sync a range using paravirt device or flush hit addresses
> > vs block device flush.
> > 
> > > 
> > > So what you could do instead is to completely ignore ->flush calls for
> > > the
> > > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
> > > PMEM device (generated by blkdev_issue_flush() or the journalling
> > > machinery) and fdatasync() the whole image file at that moment - in fact
> > > you must do that for metadata IO to hit persistent storage anyway in your
> > > setting. This would very closely follow how exporting block devices with
> > > volatile cache works with KVM these days AFAIU and the performance will
> > > be
> > > the same.
> > 
> > yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
> > As per suggestions looks like block flushing device is way ahead.
> > 
> > If we do an asynchronous block flush at guest side(put current task in
> > wait queue till host side fdatasync completes) can solve the purpose? Or
> > do we need another paravirt device f

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-24 Thread Dan Williams

On Mon, Jul 24, 2017 at 8:48 AM, Jan Kara  wrote:
> On Mon 24-07-17 08:10:05, Dan Williams wrote:
>> On Mon, Jul 24, 2017 at 5:37 AM, Jan Kara  wrote:
[..]
>> This approach would turn into a full fsync on the host. The question
>> in my mind is whether there is any optimization to be had by trapping
>> dax_flush() and calling msync() on host ranges, but Jan is right
>> trapping blkdev_issue_flush() and turning around and calling host
>> fsync() is the most straightforward approach that does not need driver
>> interface changes. The dax_flush() approach would need to modify it
>> into a async completion interface.
>
> If the backing device on the host is actually a normal block device or an
> image file, doing full fsync() is the most efficient implementation
> anyway...

Ah, ok, great. That was the gap in my understanding.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-24 Thread Jan Kara

On Mon 24-07-17 08:10:05, Dan Williams wrote:
> On Mon, Jul 24, 2017 at 5:37 AM, Jan Kara  wrote:
> > On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
> >>
> >> > On Sun 23-07-17 13:10:34, Dan Williams wrote:
> >> > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel  wrote:
> >> > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> >> > > >> [ adding Ross and Jan ]
> >> > > >>
> >> > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel 
> >> > > >> wrote:
> >> > > >> >
> >> > > >> > The goal is to increase density of guests, by moving page
> >> > > >> > cache into the host (where it can be easily reclaimed).
> >> > > >> >
> >> > > >> > If we assume the guests will be backed by relatively fast
> >> > > >> > SSDs, a "whole device flush" from filesystem journaling
> >> > > >> > code (issued where the filesystem issues a barrier or
> >> > > >> > disk cache flush today) may be just what we need to make
> >> > > >> > that work.
> >> > > >>
> >> > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
> >> > > >>
> >> > > >> However, it still seems like the storage interface is not capable of
> >> > > >> expressing what is needed, because the operation that is needed is a
> >> > > >> range flush. In the guest you want the DAX page dirty tracking to
> >> > > >> communicate range flush information to the host, but there's no
> >> > > >> readily available block i/o semantic that software running on top of
> >> > > >> the fake pmem device can use to communicate with the host. Instead
> >> > > >> you
> >> > > >> want to intercept the dax_flush() operation and turn it into a 
> >> > > >> queued
> >> > > >> request on the host.
> >> > > >>
> >> > > >> In 4.13 we have turned this dax_flush() operation into an explicit
> >> > > >> driver call. That seems a better interface to modify than trying to
> >> > > >> map block-storage flush-cache / force-unit-access commands to this
> >> > > >> host request.
> >> > > >>
> >> > > >> The additional piece you would need to consider is whether to track
> >> > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
> >> > > >> dirtying events, or arrange for every dax_copy_from_iter()
> >> > > >> operation()
> >> > > >> to also queue a sync on the host, but that essentially turns the 
> >> > > >> host
> >> > > >> page cache into a pseudo write-through mode.
> >> > > >
> >> > > > I suspect initially it will be fine to not offer DAX
> >> > > > semantics to applications using these "fake DAX" devices
> >> > > > from a virtual machine, because the DAX APIs are designed
> >> > > > for a much higher performance device than these fake DAX
> >> > > > setups could ever give.
> >> > >
> >> > > Right, we don't need DAX, per se, in the guest.
> >> > >
> >> > > >
> >> > > > Having userspace call fsync/msync like done normally, and
> >> > > > having those coarser calls be turned into somewhat efficient
> >> > > > backend flushes would be perfectly acceptable.
> >> > > >
> >> > > > The big question is, what should that kind of interface look
> >> > > > like?
> >> > >
> >> > > To me, this looks much like the dirty cache tracking that is done in
> >> > > the address_space radix for the DAX case, but modified to coordinate
> >> > > queued / page-based flushing when the guest  wants to persist data.
> >> > > The similarity to DAX is not storing guest allocated pages in the
> >> > > radix but entries that track dirty guest physical addresses.
> >> >
> >> > Let me check whether I understand the problem correctly. So we want to
> >> > export a block device (essentially a page cache of this block device) to 
> >> > a
> >> > guest as PMEM and use DAX in the guest to save guest's page cache. The
> >>
> >> that's correct.
> >>
> >> > natural way to make the persistence work would be to make ->flush 
> >> > callback
> >> > of the PMEM device to do an upcall to the host which could then 
> >> > fdatasync()
> >> > appropriate image file range however the performance would suck in such
> >> > case since ->flush gets called for at most one page ranges from DAX.
> >>
> >> Discussion is : sync a range using paravirt device or flush hit addresses
> >> vs block device flush.
> >>
> >> >
> >> > So what you could do instead is to completely ignore ->flush calls for 
> >> > the
> >> > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
> >> > PMEM device (generated by blkdev_issue_flush() or the journalling
> >> > machinery) and fdatasync() the whole image file at that moment - in fact
> >> > you must do that for metadata IO to hit persistent storage anyway in your
> >> > setting. This would very closely follow how exporting block devices with
> >> > volatile cache works with KVM these days AFAIU and the performance will 
> >> > be
> >> > the same.
> >>
> >> yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
> >> As per suggestions looks like block flushing device is way ahead.
> >>
> >> If we do an asynchronous block flush at guest side(put curre

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-24 Thread Dan Williams

On Mon, Jul 24, 2017 at 5:37 AM, Jan Kara  wrote:
> On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
>>
>> > On Sun 23-07-17 13:10:34, Dan Williams wrote:
>> > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel  wrote:
>> > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
>> > > >> [ adding Ross and Jan ]
>> > > >>
>> > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel 
>> > > >> wrote:
>> > > >> >
>> > > >> > The goal is to increase density of guests, by moving page
>> > > >> > cache into the host (where it can be easily reclaimed).
>> > > >> >
>> > > >> > If we assume the guests will be backed by relatively fast
>> > > >> > SSDs, a "whole device flush" from filesystem journaling
>> > > >> > code (issued where the filesystem issues a barrier or
>> > > >> > disk cache flush today) may be just what we need to make
>> > > >> > that work.
>> > > >>
>> > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
>> > > >>
>> > > >> However, it still seems like the storage interface is not capable of
>> > > >> expressing what is needed, because the operation that is needed is a
>> > > >> range flush. In the guest you want the DAX page dirty tracking to
>> > > >> communicate range flush information to the host, but there's no
>> > > >> readily available block i/o semantic that software running on top of
>> > > >> the fake pmem device can use to communicate with the host. Instead
>> > > >> you
>> > > >> want to intercept the dax_flush() operation and turn it into a queued
>> > > >> request on the host.
>> > > >>
>> > > >> In 4.13 we have turned this dax_flush() operation into an explicit
>> > > >> driver call. That seems a better interface to modify than trying to
>> > > >> map block-storage flush-cache / force-unit-access commands to this
>> > > >> host request.
>> > > >>
>> > > >> The additional piece you would need to consider is whether to track
>> > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
>> > > >> dirtying events, or arrange for every dax_copy_from_iter()
>> > > >> operation()
>> > > >> to also queue a sync on the host, but that essentially turns the host
>> > > >> page cache into a pseudo write-through mode.
>> > > >
>> > > > I suspect initially it will be fine to not offer DAX
>> > > > semantics to applications using these "fake DAX" devices
>> > > > from a virtual machine, because the DAX APIs are designed
>> > > > for a much higher performance device than these fake DAX
>> > > > setups could ever give.
>> > >
>> > > Right, we don't need DAX, per se, in the guest.
>> > >
>> > > >
>> > > > Having userspace call fsync/msync like done normally, and
>> > > > having those coarser calls be turned into somewhat efficient
>> > > > backend flushes would be perfectly acceptable.
>> > > >
>> > > > The big question is, what should that kind of interface look
>> > > > like?
>> > >
>> > > To me, this looks much like the dirty cache tracking that is done in
>> > > the address_space radix for the DAX case, but modified to coordinate
>> > > queued / page-based flushing when the guest  wants to persist data.
>> > > The similarity to DAX is not storing guest allocated pages in the
>> > > radix but entries that track dirty guest physical addresses.
>> >
>> > Let me check whether I understand the problem correctly. So we want to
>> > export a block device (essentially a page cache of this block device) to a
>> > guest as PMEM and use DAX in the guest to save guest's page cache. The
>>
>> that's correct.
>>
>> > natural way to make the persistence work would be to make ->flush callback
>> > of the PMEM device to do an upcall to the host which could then fdatasync()
>> > appropriate image file range however the performance would suck in such
>> > case since ->flush gets called for at most one page ranges from DAX.
>>
>> Discussion is : sync a range using paravirt device or flush hit addresses
>> vs block device flush.
>>
>> >
>> > So what you could do instead is to completely ignore ->flush calls for the
>> > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
>> > PMEM device (generated by blkdev_issue_flush() or the journalling
>> > machinery) and fdatasync() the whole image file at that moment - in fact
>> > you must do that for metadata IO to hit persistent storage anyway in your
>> > setting. This would very closely follow how exporting block devices with
>> > volatile cache works with KVM these days AFAIU and the performance will be
>> > the same.
>>
>> yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
>> As per suggestions looks like block flushing device is way ahead.
>>
>> If we do an asynchronous block flush at guest side(put current task in
>> wait queue till host side fdatasync completes) can solve the purpose? Or
>> do we need another paravirt device for this?
>
> Well, even currently if you have PMEM device, you still have also a block
> device and a request queue associated with it and metadata IO goes through
> that pat

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-24 Thread Jan Kara

On Mon 24-07-17 08:06:07, Pankaj Gupta wrote:
> 
> > On Sun 23-07-17 13:10:34, Dan Williams wrote:
> > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel  wrote:
> > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> > > >> [ adding Ross and Jan ]
> > > >>
> > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel 
> > > >> wrote:
> > > >> >
> > > >> > The goal is to increase density of guests, by moving page
> > > >> > cache into the host (where it can be easily reclaimed).
> > > >> >
> > > >> > If we assume the guests will be backed by relatively fast
> > > >> > SSDs, a "whole device flush" from filesystem journaling
> > > >> > code (issued where the filesystem issues a barrier or
> > > >> > disk cache flush today) may be just what we need to make
> > > >> > that work.
> > > >>
> > > >> Ok, apologies, I indeed had some pieces of the proposal confused.
> > > >>
> > > >> However, it still seems like the storage interface is not capable of
> > > >> expressing what is needed, because the operation that is needed is a
> > > >> range flush. In the guest you want the DAX page dirty tracking to
> > > >> communicate range flush information to the host, but there's no
> > > >> readily available block i/o semantic that software running on top of
> > > >> the fake pmem device can use to communicate with the host. Instead
> > > >> you
> > > >> want to intercept the dax_flush() operation and turn it into a queued
> > > >> request on the host.
> > > >>
> > > >> In 4.13 we have turned this dax_flush() operation into an explicit
> > > >> driver call. That seems a better interface to modify than trying to
> > > >> map block-storage flush-cache / force-unit-access commands to this
> > > >> host request.
> > > >>
> > > >> The additional piece you would need to consider is whether to track
> > > >> all writes in addition to mmap writes in the guest as DAX-page-cache
> > > >> dirtying events, or arrange for every dax_copy_from_iter()
> > > >> operation()
> > > >> to also queue a sync on the host, but that essentially turns the host
> > > >> page cache into a pseudo write-through mode.
> > > >
> > > > I suspect initially it will be fine to not offer DAX
> > > > semantics to applications using these "fake DAX" devices
> > > > from a virtual machine, because the DAX APIs are designed
> > > > for a much higher performance device than these fake DAX
> > > > setups could ever give.
> > > 
> > > Right, we don't need DAX, per se, in the guest.
> > > 
> > > >
> > > > Having userspace call fsync/msync like done normally, and
> > > > having those coarser calls be turned into somewhat efficient
> > > > backend flushes would be perfectly acceptable.
> > > >
> > > > The big question is, what should that kind of interface look
> > > > like?
> > > 
> > > To me, this looks much like the dirty cache tracking that is done in
> > > the address_space radix for the DAX case, but modified to coordinate
> > > queued / page-based flushing when the guest  wants to persist data.
> > > The similarity to DAX is not storing guest allocated pages in the
> > > radix but entries that track dirty guest physical addresses.
> > 
> > Let me check whether I understand the problem correctly. So we want to
> > export a block device (essentially a page cache of this block device) to a
> > guest as PMEM and use DAX in the guest to save guest's page cache. The
> 
> that's correct.
> 
> > natural way to make the persistence work would be to make ->flush callback
> > of the PMEM device to do an upcall to the host which could then fdatasync()
> > appropriate image file range however the performance would suck in such
> > case since ->flush gets called for at most one page ranges from DAX.
> 
> Discussion is : sync a range using paravirt device or flush hit addresses 
> vs block device flush.
> 
> > 
> > So what you could do instead is to completely ignore ->flush calls for the
> > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
> > PMEM device (generated by blkdev_issue_flush() or the journalling
> > machinery) and fdatasync() the whole image file at that moment - in fact
> > you must do that for metadata IO to hit persistent storage anyway in your
> > setting. This would very closely follow how exporting block devices with
> > volatile cache works with KVM these days AFAIU and the performance will be
> > the same.
> 
> yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
> As per suggestions looks like block flushing device is way ahead. 
> 
> If we do an asynchronous block flush at guest side(put current task in
> wait queue till host side fdatasync completes) can solve the purpose? Or
> do we need another paravirt device for this?

Well, even currently if you have PMEM device, you still have also a block
device and a request queue associated with it and metadata IO goes through
that path. So in your case you will have the same in the guest as a result
of exposing virtual PMEM device to the guest and you just need to make s

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-24 Thread Pankaj Gupta


> On Sun 23-07-17 13:10:34, Dan Williams wrote:
> > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel  wrote:
> > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> > >> [ adding Ross and Jan ]
> > >>
> > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel 
> > >> wrote:
> > >> >
> > >> > The goal is to increase density of guests, by moving page
> > >> > cache into the host (where it can be easily reclaimed).
> > >> >
> > >> > If we assume the guests will be backed by relatively fast
> > >> > SSDs, a "whole device flush" from filesystem journaling
> > >> > code (issued where the filesystem issues a barrier or
> > >> > disk cache flush today) may be just what we need to make
> > >> > that work.
> > >>
> > >> Ok, apologies, I indeed had some pieces of the proposal confused.
> > >>
> > >> However, it still seems like the storage interface is not capable of
> > >> expressing what is needed, because the operation that is needed is a
> > >> range flush. In the guest you want the DAX page dirty tracking to
> > >> communicate range flush information to the host, but there's no
> > >> readily available block i/o semantic that software running on top of
> > >> the fake pmem device can use to communicate with the host. Instead
> > >> you
> > >> want to intercept the dax_flush() operation and turn it into a queued
> > >> request on the host.
> > >>
> > >> In 4.13 we have turned this dax_flush() operation into an explicit
> > >> driver call. That seems a better interface to modify than trying to
> > >> map block-storage flush-cache / force-unit-access commands to this
> > >> host request.
> > >>
> > >> The additional piece you would need to consider is whether to track
> > >> all writes in addition to mmap writes in the guest as DAX-page-cache
> > >> dirtying events, or arrange for every dax_copy_from_iter()
> > >> operation()
> > >> to also queue a sync on the host, but that essentially turns the host
> > >> page cache into a pseudo write-through mode.
> > >
> > > I suspect initially it will be fine to not offer DAX
> > > semantics to applications using these "fake DAX" devices
> > > from a virtual machine, because the DAX APIs are designed
> > > for a much higher performance device than these fake DAX
> > > setups could ever give.
> > 
> > Right, we don't need DAX, per se, in the guest.
> > 
> > >
> > > Having userspace call fsync/msync like done normally, and
> > > having those coarser calls be turned into somewhat efficient
> > > backend flushes would be perfectly acceptable.
> > >
> > > The big question is, what should that kind of interface look
> > > like?
> > 
> > To me, this looks much like the dirty cache tracking that is done in
> > the address_space radix for the DAX case, but modified to coordinate
> > queued / page-based flushing when the guest  wants to persist data.
> > The similarity to DAX is not storing guest allocated pages in the
> > radix but entries that track dirty guest physical addresses.
> 
> Let me check whether I understand the problem correctly. So we want to
> export a block device (essentially a page cache of this block device) to a
> guest as PMEM and use DAX in the guest to save guest's page cache. The

that's correct.

> natural way to make the persistence work would be to make ->flush callback
> of the PMEM device to do an upcall to the host which could then fdatasync()
> appropriate image file range however the performance would suck in such
> case since ->flush gets called for at most one page ranges from DAX.

Discussion is : sync a range using paravirt device or flush hit addresses 
vs block device flush.

> 
> So what you could do instead is to completely ignore ->flush calls for the
> PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
> PMEM device (generated by blkdev_issue_flush() or the journalling
> machinery) and fdatasync() the whole image file at that moment - in fact
> you must do that for metadata IO to hit persistent storage anyway in your
> setting. This would very closely follow how exporting block devices with
> volatile cache works with KVM these days AFAIU and the performance will be
> the same.

yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags.
As per suggestions looks like block flushing device is way ahead. 

If we do an asynchronous block flush at guest side(put current task in wait 
queue 
till host side fdatasync completes) can solve the purpose? Or do we need 
another paravirt
device for this?

> 
>   Honza
> --
> Jan Kara 
> SUSE Labs, CR
>

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-24 Thread Jan Kara

On Sun 23-07-17 13:10:34, Dan Williams wrote:
> On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel  wrote:
> > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> >> [ adding Ross and Jan ]
> >>
> >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel 
> >> wrote:
> >> >
> >> > The goal is to increase density of guests, by moving page
> >> > cache into the host (where it can be easily reclaimed).
> >> >
> >> > If we assume the guests will be backed by relatively fast
> >> > SSDs, a "whole device flush" from filesystem journaling
> >> > code (issued where the filesystem issues a barrier or
> >> > disk cache flush today) may be just what we need to make
> >> > that work.
> >>
> >> Ok, apologies, I indeed had some pieces of the proposal confused.
> >>
> >> However, it still seems like the storage interface is not capable of
> >> expressing what is needed, because the operation that is needed is a
> >> range flush. In the guest you want the DAX page dirty tracking to
> >> communicate range flush information to the host, but there's no
> >> readily available block i/o semantic that software running on top of
> >> the fake pmem device can use to communicate with the host. Instead
> >> you
> >> want to intercept the dax_flush() operation and turn it into a queued
> >> request on the host.
> >>
> >> In 4.13 we have turned this dax_flush() operation into an explicit
> >> driver call. That seems a better interface to modify than trying to
> >> map block-storage flush-cache / force-unit-access commands to this
> >> host request.
> >>
> >> The additional piece you would need to consider is whether to track
> >> all writes in addition to mmap writes in the guest as DAX-page-cache
> >> dirtying events, or arrange for every dax_copy_from_iter()
> >> operation()
> >> to also queue a sync on the host, but that essentially turns the host
> >> page cache into a pseudo write-through mode.
> >
> > I suspect initially it will be fine to not offer DAX
> > semantics to applications using these "fake DAX" devices
> > from a virtual machine, because the DAX APIs are designed
> > for a much higher performance device than these fake DAX
> > setups could ever give.
> 
> Right, we don't need DAX, per se, in the guest.
> 
> >
> > Having userspace call fsync/msync like done normally, and
> > having those coarser calls be turned into somewhat efficient
> > backend flushes would be perfectly acceptable.
> >
> > The big question is, what should that kind of interface look
> > like?
> 
> To me, this looks much like the dirty cache tracking that is done in
> the address_space radix for the DAX case, but modified to coordinate
> queued / page-based flushing when the guest  wants to persist data.
> The similarity to DAX is not storing guest allocated pages in the
> radix but entries that track dirty guest physical addresses.

Let me check whether I understand the problem correctly. So we want to
export a block device (essentially a page cache of this block device) to a
guest as PMEM and use DAX in the guest to save guest's page cache. The
natural way to make the persistence work would be to make ->flush callback
of the PMEM device to do an upcall to the host which could then fdatasync()
appropriate image file range however the performance would suck in such
case since ->flush gets called for at most one page ranges from DAX. 

So what you could do instead is to completely ignore ->flush calls for the
PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the
PMEM device (generated by blkdev_issue_flush() or the journalling
machinery) and fdatasync() the whole image file at that moment - in fact
you must do that for metadata IO to hit persistent storage anyway in your
setting. This would very closely follow how exporting block devices with
volatile cache works with KVM these days AFAIU and the performance will be
the same.

Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-23 Thread Dan Williams

On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel  wrote:
> On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
>> [ adding Ross and Jan ]
>>
>> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel 
>> wrote:
>> >
>> > The goal is to increase density of guests, by moving page
>> > cache into the host (where it can be easily reclaimed).
>> >
>> > If we assume the guests will be backed by relatively fast
>> > SSDs, a "whole device flush" from filesystem journaling
>> > code (issued where the filesystem issues a barrier or
>> > disk cache flush today) may be just what we need to make
>> > that work.
>>
>> Ok, apologies, I indeed had some pieces of the proposal confused.
>>
>> However, it still seems like the storage interface is not capable of
>> expressing what is needed, because the operation that is needed is a
>> range flush. In the guest you want the DAX page dirty tracking to
>> communicate range flush information to the host, but there's no
>> readily available block i/o semantic that software running on top of
>> the fake pmem device can use to communicate with the host. Instead
>> you
>> want to intercept the dax_flush() operation and turn it into a queued
>> request on the host.
>>
>> In 4.13 we have turned this dax_flush() operation into an explicit
>> driver call. That seems a better interface to modify than trying to
>> map block-storage flush-cache / force-unit-access commands to this
>> host request.
>>
>> The additional piece you would need to consider is whether to track
>> all writes in addition to mmap writes in the guest as DAX-page-cache
>> dirtying events, or arrange for every dax_copy_from_iter()
>> operation()
>> to also queue a sync on the host, but that essentially turns the host
>> page cache into a pseudo write-through mode.
>
> I suspect initially it will be fine to not offer DAX
> semantics to applications using these "fake DAX" devices
> from a virtual machine, because the DAX APIs are designed
> for a much higher performance device than these fake DAX
> setups could ever give.

Right, we don't need DAX, per se, in the guest.

>
> Having userspace call fsync/msync like done normally, and
> having those coarser calls be turned into somewhat efficient
> backend flushes would be perfectly acceptable.
>
> The big question is, what should that kind of interface look
> like?

To me, this looks much like the dirty cache tracking that is done in
the address_space radix for the DAX case, but modified to coordinate
queued / page-based flushing when the guest  wants to persist data.
The similarity to DAX is not storing guest allocated pages in the
radix but entries that track dirty guest physical addresses.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-23 Thread Rik van Riel

On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> [ adding Ross and Jan ]
> 
> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel 
> wrote:
> > 
> > The goal is to increase density of guests, by moving page
> > cache into the host (where it can be easily reclaimed).
> > 
> > If we assume the guests will be backed by relatively fast
> > SSDs, a "whole device flush" from filesystem journaling
> > code (issued where the filesystem issues a barrier or
> > disk cache flush today) may be just what we need to make
> > that work.
> 
> Ok, apologies, I indeed had some pieces of the proposal confused.
> 
> However, it still seems like the storage interface is not capable of
> expressing what is needed, because the operation that is needed is a
> range flush. In the guest you want the DAX page dirty tracking to
> communicate range flush information to the host, but there's no
> readily available block i/o semantic that software running on top of
> the fake pmem device can use to communicate with the host. Instead
> you
> want to intercept the dax_flush() operation and turn it into a queued
> request on the host.
> 
> In 4.13 we have turned this dax_flush() operation into an explicit
> driver call. That seems a better interface to modify than trying to
> map block-storage flush-cache / force-unit-access commands to this
> host request.
> 
> The additional piece you would need to consider is whether to track
> all writes in addition to mmap writes in the guest as DAX-page-cache
> dirtying events, or arrange for every dax_copy_from_iter()
> operation()
> to also queue a sync on the host, but that essentially turns the host
> page cache into a pseudo write-through mode.

I suspect initially it will be fine to not offer DAX
semantics to applications using these "fake DAX" devices
from a virtual machine, because the DAX APIs are designed
for a much higher performance device than these fake DAX
setups could ever give.

Having userspace call fsync/msync like done normally, and
having those coarser calls be turned into somewhat efficient
backend flushes would be perfectly acceptable.

The big question is, what should that kind of interface look
like?

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-23 Thread Dan Williams

[ adding Ross and Jan ]

On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel  wrote:
> On Sat, 2017-07-22 at 12:34 -0700, Dan Williams wrote:
>> On Fri, Jul 21, 2017 at 8:58 AM, Stefan Hajnoczi > > wrote:
>> >
>> > Maybe the NVDIMM folks can comment on this idea.
>>
>> I think it's unworkable to use the flush hints as a guest-to-host
>> fsync mechanism. That mechanism was designed to flush small memory
>> controller buffers, not large swaths of dirty memory. What about
>> running the guests in a writethrough cache mode to avoid needing
>> dirty
>> cache management altogether? Either way I think you need to use
>> device-dax on the host, or one of the two work-in-progress filesystem
>> mechanisms (synchronous-faults or S_IOMAP_FROZEN) to avoid need any
>> metadata coordination between guests and the host.
>
> The thing Pankaj is looking at is to use the DAX mechanisms
> inside the guest (disk image as memory mapped nvdimm area),
> with that disk image backed by a regular storage device on
> the host.
>
> The goal is to increase density of guests, by moving page
> cache into the host (where it can be easily reclaimed).
>
> If we assume the guests will be backed by relatively fast
> SSDs, a "whole device flush" from filesystem journaling
> code (issued where the filesystem issues a barrier or
> disk cache flush today) may be just what we need to make
> that work.

Ok, apologies, I indeed had some pieces of the proposal confused.

However, it still seems like the storage interface is not capable of
expressing what is needed, because the operation that is needed is a
range flush. In the guest you want the DAX page dirty tracking to
communicate range flush information to the host, but there's no
readily available block i/o semantic that software running on top of
the fake pmem device can use to communicate with the host. Instead you
want to intercept the dax_flush() operation and turn it into a queued
request on the host.

In 4.13 we have turned this dax_flush() operation into an explicit
driver call. That seems a better interface to modify than trying to
map block-storage flush-cache / force-unit-access commands to this
host request.

The additional piece you would need to consider is whether to track
all writes in addition to mmap writes in the guest as DAX-page-cache
dirtying events, or arrange for every dax_copy_from_iter() operation()
to also queue a sync on the host, but that essentially turns the host
page cache into a pseudo write-through mode.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-23 Thread Rik van Riel

On Sat, 2017-07-22 at 12:34 -0700, Dan Williams wrote:
> On Fri, Jul 21, 2017 at 8:58 AM, Stefan Hajnoczi  > wrote:
> >
> > Maybe the NVDIMM folks can comment on this idea.
> 
> I think it's unworkable to use the flush hints as a guest-to-host
> fsync mechanism. That mechanism was designed to flush small memory
> controller buffers, not large swaths of dirty memory. What about
> running the guests in a writethrough cache mode to avoid needing
> dirty
> cache management altogether? Either way I think you need to use
> device-dax on the host, or one of the two work-in-progress filesystem
> mechanisms (synchronous-faults or S_IOMAP_FROZEN) to avoid need any
> metadata coordination between guests and the host.

The thing Pankaj is looking at is to use the DAX mechanisms
inside the guest (disk image as memory mapped nvdimm area),
with that disk image backed by a regular storage device on
the host.

The goal is to increase density of guests, by moving page
cache into the host (where it can be easily reclaimed).

If we assume the guests will be backed by relatively fast
SSDs, a "whole device flush" from filesystem journaling
code (issued where the filesystem issues a barrier or
disk cache flush today) may be just what we need to make
that work.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-22 Thread Dan Williams

On Fri, Jul 21, 2017 at 8:58 AM, Stefan Hajnoczi  wrote:
> On Fri, Jul 21, 2017 at 09:29:15AM -0400, Pankaj Gupta wrote:
>>
>> > > A] Problems to solve:
>> > > --
>> > >
>> > > 1] We are considering two approaches for 'fake DAX flushing interface'.
>> > >
>> > >  1.1] fake dax with NVDIMM flush hints & KVM async page fault
>> > >
>> > >  - Existing interface.
>> > >
>> > >  - The approach to use flush hint address is already nacked upstream.
>> > >
>> > >  - Flush hint not queued interface for flushing. Applications might
>> > >avoid to use it.
>> >
>> > This doesn't contradicts the last point about async operation and vcpu
>> > control.  KVM async page faults turn the Address Flush Hints write into
>> > an async operation so the guest can get other work done while waiting
>> > for completion.
>> >
>> > >
>> > >  - Flush hint address traps from guest to host and do an entire fsync
>> > >on backing file which itself is costly.
>> > >
>> > >  - Can be used to flush specific pages on host backing disk. We can
>> > >send data(pages information) equal to cache-line size(limitation)
>> > >and tell host to sync corresponding pages instead of entire disk
>> > >sync.
>> >
>> > Are you sure?  Your previous point says only the entire device can be
>> > synced.  The NVDIMM Adress Flush Hints interface does not involve
>> > address range information.
>>
>> Just syncing entire block device should be simple but costly. Using flush
>> hint address to write data which contains list/info of dirty pages to
>> flush requires more thought. This calls mmio write callback at Qemu side.
>> As per Intel (ACPI spec 6.1, Table 5-135) there is limit to max length
>> of data guest can write and is equal to cache line size.
>>
>> >
>> > >
>> > >  - This will be an asynchronous operation and vCPU control is 
>> > > returned
>> > >quickly.
>> > >
>> > >
>> > >  1.2] Using additional para virt device in addition to pmem device(fake 
>> > > dax
>> > >  with device flush)
>> >
>> > Perhaps this can be exposed via ACPI as part of the NVDIMM standards
>> > instead of a separate KVM-only paravirt device.
>>
>> Same reason as above. If we decide on sending list of dirty pages there is
>> limit to send max size of data to host using flush hint address.
>
> I understand now: you are proposing to change the semantics of the
> Address Flush Hints interface.  You want the value written to have
> meaning (the address range that needs to be flushed).
>
> Today the spec says:
>
>   The content of the data is not relevant to the functioning of the
>   flush hint mechanism.
>
> Maybe the NVDIMM folks can comment on this idea.

I think it's unworkable to use the flush hints as a guest-to-host
fsync mechanism. That mechanism was designed to flush small memory
controller buffers, not large swaths of dirty memory. What about
running the guests in a writethrough cache mode to avoid needing dirty
cache management altogether? Either way I think you need to use
device-dax on the host, or one of the two work-in-progress filesystem
mechanisms (synchronous-faults or S_IOMAP_FROZEN) to avoid need any
metadata coordination between guests and the host.

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-21 Thread Stefan Hajnoczi

On Fri, Jul 21, 2017 at 09:29:15AM -0400, Pankaj Gupta wrote:
> 
> > > A] Problems to solve:
> > > --
> > > 
> > > 1] We are considering two approaches for 'fake DAX flushing interface'.
> > > 
> > >  1.1] fake dax with NVDIMM flush hints & KVM async page fault
> > > 
> > >  - Existing interface.
> > > 
> > >  - The approach to use flush hint address is already nacked upstream.
> > >
> > >  - Flush hint not queued interface for flushing. Applications might
> > >avoid to use it.
> > 
> > This doesn't contradicts the last point about async operation and vcpu
> > control.  KVM async page faults turn the Address Flush Hints write into
> > an async operation so the guest can get other work done while waiting
> > for completion.
> > 
> > > 
> > >  - Flush hint address traps from guest to host and do an entire fsync
> > >on backing file which itself is costly.
> > > 
> > >  - Can be used to flush specific pages on host backing disk. We can
> > >send data(pages information) equal to cache-line size(limitation)
> > >and tell host to sync corresponding pages instead of entire disk
> > >sync.
> > 
> > Are you sure?  Your previous point says only the entire device can be
> > synced.  The NVDIMM Adress Flush Hints interface does not involve
> > address range information.
> 
> Just syncing entire block device should be simple but costly. Using flush 
> hint address to write data which contains list/info of dirty pages to 
> flush requires more thought. This calls mmio write callback at Qemu side.
> As per Intel (ACPI spec 6.1, Table 5-135) there is limit to max length 
> of data guest can write and is equal to cache line size.
>  
> > 
> > > 
> > >  - This will be an asynchronous operation and vCPU control is returned
> > >quickly.
> > > 
> > > 
> > >  1.2] Using additional para virt device in addition to pmem device(fake 
> > > dax
> > >  with device flush)
> > 
> > Perhaps this can be exposed via ACPI as part of the NVDIMM standards
> > instead of a separate KVM-only paravirt device.
> 
> Same reason as above. If we decide on sending list of dirty pages there is
> limit to send max size of data to host using flush hint address.  

I understand now: you are proposing to change the semantics of the
Address Flush Hints interface.  You want the value written to have
meaning (the address range that needs to be flushed).

Today the spec says:

  The content of the data is not relevant to the functioning of the
  flush hint mechanism.

Maybe the NVDIMM folks can comment on this idea.


signature.asc
Description: PGP signature

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-21 Thread Rik van Riel

On Fri, 2017-07-21 at 09:29 -0400, Pankaj Gupta wrote:
> > > 
> > >  - Flush hint address traps from guest to host and do an
> > > entire fsync
> > >    on backing file which itself is costly.
> > > 
> > >  - Can be used to flush specific pages on host backing disk.
> > > We can
> > >    send data(pages information) equal to cache-line
> > > size(limitation)
> > >    and tell host to sync corresponding pages instead of
> > > entire disk
> > >    sync.
> > 
> > Are you sure?  Your previous point says only the entire device can
> > be
> > synced.  The NVDIMM Adress Flush Hints interface does not involve
> > address range information.
> 
> Just syncing entire block device should be simple but costly.

Costly depends on just how fast the backing IO device is.

If the backing IO is a spinning disk, doing targeted range
syncs will certainly be faster.

On the other hand, if the backing IO is one of the latest
generation SSD devices, it may be faster to have just one
hypercall and flush everything, than it would be to have
separate sync calls for each range that we want flushed.

Should we design our interfaces for yesterday's storage
devices, or for tomorrow's storage devices?

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-21 Thread Pankaj Gupta


> > A] Problems to solve:
> > --
> > 
> > 1] We are considering two approaches for 'fake DAX flushing interface'.
> > 
> >  1.1] fake dax with NVDIMM flush hints & KVM async page fault
> > 
> >  - Existing interface.
> > 
> >  - The approach to use flush hint address is already nacked upstream.
> >
> >  - Flush hint not queued interface for flushing. Applications might
> >avoid to use it.
> 
> This doesn't contradicts the last point about async operation and vcpu
> control.  KVM async page faults turn the Address Flush Hints write into
> an async operation so the guest can get other work done while waiting
> for completion.
> 
> > 
> >  - Flush hint address traps from guest to host and do an entire fsync
> >on backing file which itself is costly.
> > 
> >  - Can be used to flush specific pages on host backing disk. We can
> >send data(pages information) equal to cache-line size(limitation)
> >and tell host to sync corresponding pages instead of entire disk
> >sync.
> 
> Are you sure?  Your previous point says only the entire device can be
> synced.  The NVDIMM Adress Flush Hints interface does not involve
> address range information.

Just syncing entire block device should be simple but costly. Using flush 
hint address to write data which contains list/info of dirty pages to 
flush requires more thought. This calls mmio write callback at Qemu side.
As per Intel (ACPI spec 6.1, Table 5-135) there is limit to max length 
of data guest can write and is equal to cache line size.
 
> 
> > 
> >  - This will be an asynchronous operation and vCPU control is returned
> >quickly.
> > 
> > 
> >  1.2] Using additional para virt device in addition to pmem device(fake dax
> >  with device flush)
> 
> Perhaps this can be exposed via ACPI as part of the NVDIMM standards
> instead of a separate KVM-only paravirt device.

Same reason as above. If we decide on sending list of dirty pages there is
limit to send max size of data to host using flush hint address.  
>

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-21 Thread Stefan Hajnoczi

On Fri, Jul 21, 2017 at 02:56:34AM -0400, Pankaj Gupta wrote:
> A] Problems to solve:
> --
> 
> 1] We are considering two approaches for 'fake DAX flushing interface'.
> 
>  1.1] fake dax with NVDIMM flush hints & KVM async page fault
> 
>  - Existing interface.
> 
>  - The approach to use flush hint address is already nacked upstream.
>
>  - Flush hint not queued interface for flushing. Applications might 
>avoid to use it.

This doesn't contradicts the last point about async operation and vcpu
control.  KVM async page faults turn the Address Flush Hints write into
an async operation so the guest can get other work done while waiting
for completion.

> 
>  - Flush hint address traps from guest to host and do an entire fsync 
>on backing file which itself is costly.
> 
>  - Can be used to flush specific pages on host backing disk. We can 
>send data(pages information) equal to cache-line size(limitation) 
>and tell host to sync corresponding pages instead of entire disk sync.

Are you sure?  Your previous point says only the entire device can be
synced.  The NVDIMM Adress Flush Hints interface does not involve
address range information.

> 
>  - This will be an asynchronous operation and vCPU control is returned 
>quickly.
> 
> 
>  1.2] Using additional para virt device in addition to pmem device(fake dax 
> with device flush)

Perhaps this can be exposed via ACPI as part of the NVDIMM standards
instead of a separate KVM-only paravirt device.

signature.asc
Description: PGP signature

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-21 Thread Pankaj Gupta


> > 
> > Hello,
> > 
> > We shared a proposal for 'KVM fake DAX flushing interface'.
> > 
> > https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html
> >
> 
> In above link,
>   "Overall goal of project
>is to increase the number of virtual machines that can be
>run on a physical machine, in order to *increase the density*
>of customer virtual machines"
> 
> Is the fake persistent memory used as normal RAM in guest? If no, how
> is it expected to be used in guest?

Yes, guest will have a nvdimm DAX device and not use page cache for most 
of the operations. Host will manage memory requirement of all the guests.
  
> 
> > We did initial POC in which we used 'virtio-blk' device to perform
> > a device flush on pmem fsync on ext4 filesystem. They are few hacks
> > to make things work. We need suggestions on below points before we
> > start actual implementation.
> >
> > A] Problems to solve:
> > --
> > 
> > 1] We are considering two approaches for 'fake DAX flushing interface'.
> > 
> >  1.1] fake dax with NVDIMM flush hints & KVM async page fault
> > 
> >  - Existing interface.
> > 
> >  - The approach to use flush hint address is already nacked upstream.
> > 
> >  - Flush hint not queued interface for flushing. Applications might
> >avoid to use it.
> > 
> >  - Flush hint address traps from guest to host and do an entire fsync
> >on backing file which itself is costly.
> > 
> >  - Can be used to flush specific pages on host backing disk. We can
> >send data(pages information) equal to cache-line size(limitation)
> >and tell host to sync corresponding pages instead of entire disk
> >sync.
> > 
> >  - This will be an asynchronous operation and vCPU control is returned
> >quickly.
> > 
> > 
> >  1.2] Using additional para virt device in addition to pmem device(fake dax
> >  with device flush)
> > 
> >  - New interface
> > 
> >  - Guest maintains information of DAX dirty pages as exceptional
> >  entries in
> >radix tree.
> > 
> >  - If we want to flush specific pages from guest to host, we need to
> >  send
> >list of the dirty pages corresponding to file on which we are doing
> >fsync.
> > 
> >  - This will require implementation of new interface, a new paravirt
> >  device
> >for sending flush requests.
> > 
> >  - Host side will perform fsync/fdatasync on list of dirty pages or
> >  entire
> >block device backed file.
> > 
> > 2] Questions:
> > ---
> > 
> >  2.1] Not sure why WPQ flush is not a queued interface? We can force
> >  applications
> >   to call this? device DAX neither calls fsync/msync?
> > 
> >  2.2] Depending upon interface we decide, we need optimal solution to sync
> >   range of pages?
> > 
> >  - Send range of pages from guest to host to sync asynchronously
> >  instead
> >of syncing entire block device?
> 
> e.g. a new virtio device to deliver sync requests to host?
> 
> > 
> >  - Other option is to sync entire disk backing file to make sure all
> >  the
> >writes are persistent. In our case, backing file is a regular file
> >on
> >non NVDIMM device so host page cache has list of dirty pages which
> >can be used either with fsync or similar interface.
> 
> As the amount of dirty pages can be variant, the latency of each host
> fsync is likely to vary in a large range.
> 
> > 
> >  2.3] If we do host fsync on entire disk we will be flushing all the dirty
> >  data
> >   to backend file. Just thinking what would be better approach,
> >   flushing
> >   pages on corresponding guest file fsync or entire block device?
> > 
> >  2.4] If we decide to choose one of the above approaches, we need to
> >  consider
> >   all DAX supporting filesystems(ext4/xfs). Would hooking code to
> >   corresponding
> >   fsync code of fs seems reasonable? Just thinking for flush hint
> >   address use-case?
> >   Or how flush hint addresses would be invoked with fsync or similar
> >   api?
> > 
> >  2.5] Also with filesystem journalling and other mount options like
> >  barriers,
> >   ordered etc, how we decide to use page flush hint or regular fsync on
> >   file?
> >  
> >  2.6] If at guest side we have PFN of all the dirty pages in radixtree? and
> >  we send
> >   these to to host? At host side would we able to find corresponding
> >   page and flush
> >   them all?
> 
> That may require the host file system provides API to flush specified
> blocks/extents and their meta data in the file system. I'm not
> familiar with this part and don't know whether such API exists.
> 
> Haozhong
>

Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-21 Thread Haozhong Zhang

On 07/21/17 02:56 -0400, Pankaj Gupta wrote:
> 
> Hello,
> 
> We shared a proposal for 'KVM fake DAX flushing interface'.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html
>

In above link,
  "Overall goal of project 
   is to increase the number of virtual machines that can be 
   run on a physical machine, in order to *increase the density*
   of customer virtual machines"

Is the fake persistent memory used as normal RAM in guest? If no, how
is it expected to be used in guest?

> We did initial POC in which we used 'virtio-blk' device to perform 
> a device flush on pmem fsync on ext4 filesystem. They are few hacks 
> to make things work. We need suggestions on below points before we 
> start actual implementation.
>
> A] Problems to solve:
> --
> 
> 1] We are considering two approaches for 'fake DAX flushing interface'.
> 
>  1.1] fake dax with NVDIMM flush hints & KVM async page fault
> 
>  - Existing interface.
> 
>  - The approach to use flush hint address is already nacked upstream.
> 
>  - Flush hint not queued interface for flushing. Applications might 
>avoid to use it.
> 
>  - Flush hint address traps from guest to host and do an entire fsync 
>on backing file which itself is costly.
> 
>  - Can be used to flush specific pages on host backing disk. We can 
>send data(pages information) equal to cache-line size(limitation) 
>and tell host to sync corresponding pages instead of entire disk sync.
> 
>  - This will be an asynchronous operation and vCPU control is returned 
>quickly.
> 
> 
>  1.2] Using additional para virt device in addition to pmem device(fake dax 
> with device flush)
> 
>  - New interface
> 
>  - Guest maintains information of DAX dirty pages as exceptional entries 
> in 
>radix tree.
> 
>  - If we want to flush specific pages from guest to host, we need to send 
>list of the dirty pages corresponding to file on which we are doing 
> fsync.
> 
>  - This will require implementation of new interface, a new paravirt 
> device 
>for sending flush requests.
> 
>  - Host side will perform fsync/fdatasync on list of dirty pages or 
> entire 
>block device backed file.
> 
> 2] Questions:
> ---
> 
>  2.1] Not sure why WPQ flush is not a queued interface? We can force 
> applications 
>   to call this? device DAX neither calls fsync/msync?
> 
>  2.2] Depending upon interface we decide, we need optimal solution to sync 
>   range of pages?
> 
>  - Send range of pages from guest to host to sync asynchronously instead 
>of syncing entire block device?

e.g. a new virtio device to deliver sync requests to host?

> 
>  - Other option is to sync entire disk backing file to make sure all the 
>writes are persistent. In our case, backing file is a regular file on 
>non NVDIMM device so host page cache has list of dirty pages which
>can be used either with fsync or similar interface.

As the amount of dirty pages can be variant, the latency of each host
fsync is likely to vary in a large range.

> 
>  2.3] If we do host fsync on entire disk we will be flushing all the dirty 
> data
>   to backend file. Just thinking what would be better approach, flushing 
>   pages on corresponding guest file fsync or entire block device?
> 
>  2.4] If we decide to choose one of the above approaches, we need to consider 
>   all DAX supporting filesystems(ext4/xfs). Would hooking code to 
> corresponding
>   fsync code of fs seems reasonable? Just thinking for flush hint address 
> use-case?
>   Or how flush hint addresses would be invoked with fsync or similar api?
> 
>  2.5] Also with filesystem journalling and other mount options like barriers, 
>   ordered etc, how we decide to use page flush hint or regular fsync on 
> file?
>  
>  2.6] If at guest side we have PFN of all the dirty pages in radixtree? and 
> we send 
>   these to to host? At host side would we able to find corresponding page 
> and flush 
>   them all?

That may require the host file system provides API to flush specified
blocks/extents and their meta data in the file system. I'm not
familiar with this part and don't know whether such API exists.

Haozhong

[Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-21 Thread Pankaj Gupta


Hello,

We shared a proposal for 'KVM fake DAX flushing interface'.

https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html

We did initial POC in which we used 'virtio-blk' device to perform 
a device flush on pmem fsync on ext4 filesystem. They are few hacks 
to make things work. We need suggestions on below points before we 
start actual implementation.

A] Problems to solve:
--

1] We are considering two approaches for 'fake DAX flushing interface'.

 1.1] fake dax with NVDIMM flush hints & KVM async page fault

 - Existing interface.

 - The approach to use flush hint address is already nacked upstream.

 - Flush hint not queued interface for flushing. Applications might 
   avoid to use it.

 - Flush hint address traps from guest to host and do an entire fsync 
   on backing file which itself is costly.

 - Can be used to flush specific pages on host backing disk. We can 
   send data(pages information) equal to cache-line size(limitation) 
   and tell host to sync corresponding pages instead of entire disk sync.

 - This will be an asynchronous operation and vCPU control is returned 
   quickly.


 1.2] Using additional para virt device in addition to pmem device(fake dax 
with device flush)

 - New interface

 - Guest maintains information of DAX dirty pages as exceptional entries in 
   radix tree.

 - If we want to flush specific pages from guest to host, we need to send 
   list of the dirty pages corresponding to file on which we are doing 
fsync.

 - This will require implementation of new interface, a new paravirt device 
   for sending flush requests.

 - Host side will perform fsync/fdatasync on list of dirty pages or entire 
   block device backed file.

2] Questions:
---

 2.1] Not sure why WPQ flush is not a queued interface? We can force 
applications 
  to call this? device DAX neither calls fsync/msync?

 2.2] Depending upon interface we decide, we need optimal solution to sync 
  range of pages?

 - Send range of pages from guest to host to sync asynchronously instead 
   of syncing entire block device?

 - Other option is to sync entire disk backing file to make sure all the 
   writes are persistent. In our case, backing file is a regular file on 
   non NVDIMM device so host page cache has list of dirty pages which
   can be used either with fsync or similar interface.

 2.3] If we do host fsync on entire disk we will be flushing all the dirty data
  to backend file. Just thinking what would be better approach, flushing 
  pages on corresponding guest file fsync or entire block device?

 2.4] If we decide to choose one of the above approaches, we need to consider 
  all DAX supporting filesystems(ext4/xfs). Would hooking code to 
corresponding
  fsync code of fs seems reasonable? Just thinking for flush hint address 
use-case?
  Or how flush hint addresses would be invoked with fsync or similar api?

 2.5] Also with filesystem journalling and other mount options like barriers, 
  ordered etc, how we decide to use page flush hint or regular fsync on 
file?
 
 2.6] If at guest side we have PFN of all the dirty pages in radixtree? and we 
send 
  these to to host? At host side would we able to find corresponding page 
and flush 
  them all?

Suggestions & ideas are welcome.

Thanks,
Pankaj

62 matches

Mail list logo