Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface

Alex Williamson Fri, 21 Jun 2019 15:02:44 -0700

On Sat, 22 Jun 2019 02:00:08 +0530
Kirti Wankhede <kwankh...@nvidia.com> wrote:
> On 6/22/2019 1:30 AM, Alex Williamson wrote:
> > On Sat, 22 Jun 2019 01:05:48 +0530
> > Kirti Wankhede <kwankh...@nvidia.com> wrote:
> >   
> >> On 6/21/2019 8:33 PM, Alex Williamson wrote:  
> >>> On Fri, 21 Jun 2019 11:22:15 +0530
> >>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> >>>     
> >>>> On 6/20/2019 10:48 PM, Alex Williamson wrote:    
> >>>>> On Thu, 20 Jun 2019 20:07:29 +0530
> >>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> >>>>>       
> >>>>>> - Defined MIGRATION region type and sub-type.
> >>>>>> - Used 3 bits to define VFIO device states.
> >>>>>>     Bit 0 => _RUNNING
> >>>>>>     Bit 1 => _SAVING
> >>>>>>     Bit 2 => _RESUMING
> >>>>>>     Combination of these bits defines VFIO device's state during 
> >>>>>> migration
> >>>>>>     _STOPPED => All bits 0 indicates VFIO device stopped.
> >>>>>>     _RUNNING => Normal VFIO device running state.
> >>>>>>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running 
> >>>>>> but start
> >>>>>>                           saving state of device i.e. pre-copy state
> >>>>>>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
> >>>>>>                           save device state,i.e. stop-n-copy state
> >>>>>>     _RESUMING => VFIO device resuming state.
> >>>>>>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits 
> >>>>>> are set
> >>>>>> - Defined vfio_device_migration_info structure which will be placed at 
> >>>>>> 0th
> >>>>>>   offset of migration region to get/set VFIO device related 
> >>>>>> information.
> >>>>>>   Defined members of structure and usage on read/write access:
> >>>>>>     * device_state: (read/write)
> >>>>>>         To convey VFIO device state to be transitioned to. Only 3 bits 
> >>>>>> are used
> >>>>>>         as of now.
> >>>>>>     * pending bytes: (read only)
> >>>>>>         To get pending bytes yet to be migrated for VFIO device.
> >>>>>>     * data_offset: (read only)
> >>>>>>         To get data offset in migration from where data exist during 
> >>>>>> _SAVING
> >>>>>>         and from where data should be written by user space 
> >>>>>> application during
> >>>>>>          _RESUMING state
> >>>>>>     * data_size: (read/write)
> >>>>>>         To get and set size of data copied in migration region during 
> >>>>>> _SAVING
> >>>>>>         and _RESUMING state.
> >>>>>>     * start_pfn, page_size, total_pfns: (write only)
> >>>>>>         To get bitmap of dirty pages from vendor driver from given
> >>>>>>         start address for total_pfns.
> >>>>>>     * copied_pfns: (read only)
> >>>>>>         To get number of pfns bitmap copied in migration region.
> >>>>>>         Vendor driver should copy the bitmap with bits set only for
> >>>>>>         pages to be marked dirty in migration region. Vendor driver
> >>>>>>         should return 0 if there are 0 pages dirty in requested
> >>>>>>         range. Vendor driver should return -1 to mark all pages in the 
> >>>>>> section
> >>>>>>         as dirty
> >>>>>>
> >>>>>> Migration region looks like:
> >>>>>>  ------------------------------------------------------------------
> >>>>>> |vfio_device_migration_info|    data section                      |
> >>>>>> |                          |     ///////////////////////////////  |
> >>>>>>  ------------------------------------------------------------------
> >>>>>>  ^                              ^                              ^
> >>>>>>  offset 0-trapped part        data_offset                 data_size
> >>>>>>
> >>>>>> Data section is always followed by vfio_device_migration_info
> >>>>>> structure in the region, so data_offset will always be none-0.
> >>>>>> Offset from where data is copied is decided by kernel driver, data
> >>>>>> section can be trapped or mapped depending on how kernel driver
> >>>>>> defines data section. If mmapped, then data_offset should be page
> >>>>>> aligned, where as initial section which contain
> >>>>>> vfio_device_migration_info structure might not end at offset which
> >>>>>> is page aligned.
> >>>>>>
> >>>>>> Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
> >>>>>> Reviewed-by: Neo Jia <c...@nvidia.com>
> >>>>>> ---
> >>>>>>  linux-headers/linux/vfio.h | 71 
> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>  1 file changed, 71 insertions(+)
> >>>>>>
> >>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >>>>>> index 24f505199f83..274ec477eb82 100644
> >>>>>> --- a/linux-headers/linux/vfio.h
> >>>>>> +++ b/linux-headers/linux/vfio.h
> >>>>>> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid {
> >>>>>>   */
> >>>>>>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD  (1)
> >>>>>>  
> >>>>>> +/* Migration region type and sub-type */
> >>>>>> +#define VFIO_REGION_TYPE_MIGRATION            (2)
> >>>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION         (1)
> >>>>>> +
> >>>>>> +/**
> >>>>>> + * Structure vfio_device_migration_info is placed at 0th offset of
> >>>>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device 
> >>>>>> related migration
> >>>>>> + * information. Field accesses from this structure are only supported 
> >>>>>> at their
> >>>>>> + * native width and alignment, otherwise should return error.
> >>>>>> + *
> >>>>>> + * device_state: (read/write)
> >>>>>> + *      To indicate vendor driver the state VFIO device should be 
> >>>>>> transitioned
> >>>>>> + *      to. If device state transition fails, write to this field 
> >>>>>> return error.
> >>>>>> + *      It consists of 3 bits:
> >>>>>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, 
> >>>>>> that indicates
> >>>>>> + *        _STOPPED state. When device is changed to _STOPPED, driver 
> >>>>>> should stop
> >>>>>> + *        device before write returns.
> >>>>>> + *      - If bit 1 set, indicates _SAVING state.
> >>>>>> + *      - If bit 2 set, indicates _RESUMING state.
> >>>>>> + *
> >>>>>> + * pending bytes: (read only)
> >>>>>> + *      Read pending bytes yet to be migrated from vendor driver
> >>>>>> + *
> >>>>>> + * data_offset: (read only)
> >>>>>> + *      User application should read data_offset in migration region 
> >>>>>> from where
> >>>>>> + *      user application should read data during _SAVING state or 
> >>>>>> write data
> >>>>>> + *      during _RESUMING state.
> >>>>>> + *
> >>>>>> + * data_size: (read/write)
> >>>>>> + *      User application should read data_size to know data copied in 
> >>>>>> migration
> >>>>>> + *      region during _SAVING state and write size of data copied in 
> >>>>>> migration
> >>>>>> + *      region during _RESUMING state.
> >>>>>> + *
> >>>>>> + * start_pfn: (write only)
> >>>>>> + *      Start address pfn to get bitmap of dirty pages from vendor 
> >>>>>> driver duing
> >>>>>> + *      _SAVING state.
> >>>>>> + *
> >>>>>> + * page_size: (write only)
> >>>>>> + *      User application should write the page_size of pfn.
> >>>>>> + *
> >>>>>> + * total_pfns: (write only)
> >>>>>> + *      Total pfn count from start_pfn for which dirty bitmap is 
> >>>>>> requested.
> >>>>>> + *
> >>>>>> + * copied_pfns: (read only)
> >>>>>> + *      pfn count for which dirty bitmap is copied to migration 
> >>>>>> region.
> >>>>>> + *      Vendor driver should copy the bitmap with bits set only for 
> >>>>>> pages to be
> >>>>>> + *      marked dirty in migration region.
> >>>>>> + *      Vendor driver should return 0 if there are 0 pages dirty in 
> >>>>>> requested
> >>>>>> + *      range.
> >>>>>> + *      Vendor driver should return -1 to mark all pages in the 
> >>>>>> section as
> >>>>>> + *      dirty.      
> >>>>>
> >>>>> Is the protocol that the user writes start_pfn/page_size/total_pfns in
> >>>>> any order and then the read of copied_pfns is what triggers the
> >>>>> snapshot?      
> >>>>
> >>>> Yes.
> >>>>    
> >>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
> >>>>> can write them once and get repeated refreshes of the dirty bitmap by
> >>>>> re-reading copied_pfns?      
> >>>>
> >>>> Yes and that bitmap should be for given range (from start_pfn till
> >>>> start_pfn + tolal_pfns).
> >>>> Re-reading of copied_pfns is to handle the case where it might be
> >>>> possible that vendor driver reserved area for bitmap < total bitmap size
> >>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
> >>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
> >>>> is, there are no pages dirty in rest of the range)    
> >>>
> >>> So reading copied_pfns triggers the data range to be updated, but the
> >>> caller cannot assume it to be synchronous and uses total_pfns to poll
> >>> that the update is complete?  How does the vendor driver differentiate
> >>> the user polling for the previous update to finish versus requesting a
> >>> new update?
> >>>     
> >>
> >> Write on start_pfn/page_size/total_pfns, then read on copied_pfns
> >> indicates new update, where as sequential read on copied_pfns indicates
> >> polling for previous update.  
> > 
> > Hmm, this seems to contradict the answer to my question above where I
> > ask if the write fields are sticky so a user can trigger a refresh via
> > copied_pfns.  
> 
> Sorry, how its contradict? pasting it again below:
> >>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
> >>>>> can write them once and get repeated refreshes of the dirty bitmap by
> >>>>> re-reading copied_pfns?  
> >>>>
> >>>> Yes and that bitmap should be for given range (from start_pfn till
> >>>> start_pfn + tolal_pfns).
> >>>> Re-reading of copied_pfns is to handle the case where it might be
> >>>> possible that vendor driver reserved area for bitmap < total bitmap  
> size
> >>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
> >>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
> >>>> is, there are no pages dirty in rest of the range)


Sorry, I guess I misinterpreted again.  So the vendor driver can return
copied_pfns < total_pfns if it has a buffer limitation, not as an
indication of its background progress in writing out the bitmap.  Just
as a proof of concept, let's say the vendor driver has a 1 bit buffer
and I write 0 to start_pfn and 3 to total_pfns.  I read copied_pfns,
which returns 1, so I read data_offset to find where this 1 bit is
located and then read my bit from that location.  This is the dirty
state of the first pfn.  I read copied_pfns again and it reports 2, so
I again read data_offset to find where the data is located, and it's my
job to remember that I've already read 1 bit, so 2 means there's only 1
bit available and it's the second pfn.  I read the bit.  I again read
copied_pfns, which now reports 3, I read data_offset to find the
location of the data, I remember that I've already read 2 bits, so I
read my bit into the 3rd pfn.  This seems rather clumsy.

Now that copied_pfns == total_pfns, what happens if I read copied_pfns
again?  This is actually what I thought I was asking previously.

Should we expose the pfn buffer size and fault on writes of larger than that
size, requiring the user to iterate start_pfn themselves?  Are there
any operations where the user can assume data_offset is constant?  Thanks,

Alex

Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface

Reply via email to