Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface

Alex Williamson Mon, 24 Jun 2019 12:02:30 -0700

On Tue, 25 Jun 2019 00:22:16 +0530
Kirti Wankhede <kwankh...@nvidia.com> wrote:


> On 6/24/2019 8:55 PM, Alex Williamson wrote:
> > On Mon, 24 Jun 2019 20:30:08 +0530
> > Kirti Wankhede <kwankh...@nvidia.com> wrote:
> >   
> >> On 6/22/2019 3:31 AM, Alex Williamson wrote:  
> >>> On Sat, 22 Jun 2019 02:00:08 +0530
> >>> Kirti Wankhede <kwankh...@nvidia.com> wrote:    
> >>>> On 6/22/2019 1:30 AM, Alex Williamson wrote:    
> >>>>> On Sat, 22 Jun 2019 01:05:48 +0530
> >>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> >>>>>       
> >>>>>> On 6/21/2019 8:33 PM, Alex Williamson wrote:      
> >>>>>>> On Fri, 21 Jun 2019 11:22:15 +0530
> >>>>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> >>>>>>>         
> >>>>>>>> On 6/20/2019 10:48 PM, Alex Williamson wrote:        
> >>>>>>>>> On Thu, 20 Jun 2019 20:07:29 +0530
> >>>>>>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> >>>>>>>>>           
> >>>>>>>>>> - Defined MIGRATION region type and sub-type.
> >>>>>>>>>> - Used 3 bits to define VFIO device states.
> >>>>>>>>>>     Bit 0 => _RUNNING
> >>>>>>>>>>     Bit 1 => _SAVING
> >>>>>>>>>>     Bit 2 => _RESUMING
> >>>>>>>>>>     Combination of these bits defines VFIO device's state during 
> >>>>>>>>>> migration
> >>>>>>>>>>     _STOPPED => All bits 0 indicates VFIO device stopped.
> >>>>>>>>>>     _RUNNING => Normal VFIO device running state.
> >>>>>>>>>>     _SAVING | _RUNNING => vCPUs are running, VFIO device is 
> >>>>>>>>>> running but start
> >>>>>>>>>>                           saving state of device i.e. pre-copy 
> >>>>>>>>>> state
> >>>>>>>>>>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, 
> >>>>>>>>>> and
> >>>>>>>>>>                           save device state,i.e. stop-n-copy state
> >>>>>>>>>>     _RESUMING => VFIO device resuming state.
> >>>>>>>>>>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING 
> >>>>>>>>>> bits are set
> >>>>>>>>>> - Defined vfio_device_migration_info structure which will be 
> >>>>>>>>>> placed at 0th
> >>>>>>>>>>   offset of migration region to get/set VFIO device related 
> >>>>>>>>>> information.
> >>>>>>>>>>   Defined members of structure and usage on read/write access:
> >>>>>>>>>>     * device_state: (read/write)
> >>>>>>>>>>         To convey VFIO device state to be transitioned to. Only 3 
> >>>>>>>>>> bits are used
> >>>>>>>>>>         as of now.
> >>>>>>>>>>     * pending bytes: (read only)
> >>>>>>>>>>         To get pending bytes yet to be migrated for VFIO device.
> >>>>>>>>>>     * data_offset: (read only)
> >>>>>>>>>>         To get data offset in migration from where data exist 
> >>>>>>>>>> during _SAVING
> >>>>>>>>>>         and from where data should be written by user space 
> >>>>>>>>>> application during
> >>>>>>>>>>          _RESUMING state
> >>>>>>>>>>     * data_size: (read/write)
> >>>>>>>>>>         To get and set size of data copied in migration region 
> >>>>>>>>>> during _SAVING
> >>>>>>>>>>         and _RESUMING state.
> >>>>>>>>>>     * start_pfn, page_size, total_pfns: (write only)
> >>>>>>>>>>         To get bitmap of dirty pages from vendor driver from given
> >>>>>>>>>>         start address for total_pfns.
> >>>>>>>>>>     * copied_pfns: (read only)
> >>>>>>>>>>         To get number of pfns bitmap copied in migration region.
> >>>>>>>>>>         Vendor driver should copy the bitmap with bits set only for
> >>>>>>>>>>         pages to be marked dirty in migration region. Vendor driver
> >>>>>>>>>>         should return 0 if there are 0 pages dirty in requested
> >>>>>>>>>>         range. Vendor driver should return -1 to mark all pages in 
> >>>>>>>>>> the section
> >>>>>>>>>>         as dirty
> >>>>>>>>>>
> >>>>>>>>>> Migration region looks like:
> >>>>>>>>>>  ------------------------------------------------------------------
> >>>>>>>>>> |vfio_device_migration_info|    data section                      |
> >>>>>>>>>> |                          |     ///////////////////////////////  |
> >>>>>>>>>>  ------------------------------------------------------------------
> >>>>>>>>>>  ^                              ^                              ^
> >>>>>>>>>>  offset 0-trapped part        data_offset                 data_size
> >>>>>>>>>>
> >>>>>>>>>> Data section is always followed by vfio_device_migration_info
> >>>>>>>>>> structure in the region, so data_offset will always be none-0.
> >>>>>>>>>> Offset from where data is copied is decided by kernel driver, data
> >>>>>>>>>> section can be trapped or mapped depending on how kernel driver
> >>>>>>>>>> defines data section. If mmapped, then data_offset should be page
> >>>>>>>>>> aligned, where as initial section which contain
> >>>>>>>>>> vfio_device_migration_info structure might not end at offset which
> >>>>>>>>>> is page aligned.
> >>>>>>>>>>
> >>>>>>>>>> Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
> >>>>>>>>>> Reviewed-by: Neo Jia <c...@nvidia.com>
> >>>>>>>>>> ---
> >>>>>>>>>>  linux-headers/linux/vfio.h | 71 
> >>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>>>  1 file changed, 71 insertions(+)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/linux-headers/linux/vfio.h 
> >>>>>>>>>> b/linux-headers/linux/vfio.h
> >>>>>>>>>> index 24f505199f83..274ec477eb82 100644
> >>>>>>>>>> --- a/linux-headers/linux/vfio.h
> >>>>>>>>>> +++ b/linux-headers/linux/vfio.h
> >>>>>>>>>> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid {
> >>>>>>>>>>   */
> >>>>>>>>>>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD      (1)
> >>>>>>>>>>  
> >>>>>>>>>> +/* Migration region type and sub-type */
> >>>>>>>>>> +#define VFIO_REGION_TYPE_MIGRATION                (2)
> >>>>>>>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION             (1)
> >>>>>>>>>> +
> >>>>>>>>>> +/**
> >>>>>>>>>> + * Structure vfio_device_migration_info is placed at 0th offset of
> >>>>>>>>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device 
> >>>>>>>>>> related migration
> >>>>>>>>>> + * information. Field accesses from this structure are only 
> >>>>>>>>>> supported at their
> >>>>>>>>>> + * native width and alignment, otherwise should return error.
> >>>>>>>>>> + *
> >>>>>>>>>> + * device_state: (read/write)
> >>>>>>>>>> + *      To indicate vendor driver the state VFIO device should be 
> >>>>>>>>>> transitioned
> >>>>>>>>>> + *      to. If device state transition fails, write to this field 
> >>>>>>>>>> return error.
> >>>>>>>>>> + *      It consists of 3 bits:
> >>>>>>>>>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, 
> >>>>>>>>>> that indicates
> >>>>>>>>>> + *        _STOPPED state. When device is changed to _STOPPED, 
> >>>>>>>>>> driver should stop
> >>>>>>>>>> + *        device before write returns.
> >>>>>>>>>> + *      - If bit 1 set, indicates _SAVING state.
> >>>>>>>>>> + *      - If bit 2 set, indicates _RESUMING state.
> >>>>>>>>>> + *
> >>>>>>>>>> + * pending bytes: (read only)
> >>>>>>>>>> + *      Read pending bytes yet to be migrated from vendor driver
> >>>>>>>>>> + *
> >>>>>>>>>> + * data_offset: (read only)
> >>>>>>>>>> + *      User application should read data_offset in migration 
> >>>>>>>>>> region from where
> >>>>>>>>>> + *      user application should read data during _SAVING state or 
> >>>>>>>>>> write data
> >>>>>>>>>> + *      during _RESUMING state.
> >>>>>>>>>> + *
> >>>>>>>>>> + * data_size: (read/write)
> >>>>>>>>>> + *      User application should read data_size to know data 
> >>>>>>>>>> copied in migration
> >>>>>>>>>> + *      region during _SAVING state and write size of data copied 
> >>>>>>>>>> in migration
> >>>>>>>>>> + *      region during _RESUMING state.
> >>>>>>>>>> + *
> >>>>>>>>>> + * start_pfn: (write only)
> >>>>>>>>>> + *      Start address pfn to get bitmap of dirty pages from 
> >>>>>>>>>> vendor driver duing
> >>>>>>>>>> + *      _SAVING state.
> >>>>>>>>>> + *
> >>>>>>>>>> + * page_size: (write only)
> >>>>>>>>>> + *      User application should write the page_size of pfn.
> >>>>>>>>>> + *
> >>>>>>>>>> + * total_pfns: (write only)
> >>>>>>>>>> + *      Total pfn count from start_pfn for which dirty bitmap is 
> >>>>>>>>>> requested.
> >>>>>>>>>> + *
> >>>>>>>>>> + * copied_pfns: (read only)
> >>>>>>>>>> + *      pfn count for which dirty bitmap is copied to migration 
> >>>>>>>>>> region.
> >>>>>>>>>> + *      Vendor driver should copy the bitmap with bits set only 
> >>>>>>>>>> for pages to be
> >>>>>>>>>> + *      marked dirty in migration region.
> >>>>>>>>>> + *      Vendor driver should return 0 if there are 0 pages dirty 
> >>>>>>>>>> in requested
> >>>>>>>>>> + *      range.
> >>>>>>>>>> + *      Vendor driver should return -1 to mark all pages in the 
> >>>>>>>>>> section as
> >>>>>>>>>> + *      dirty.          
> >>>>>>>>>
> >>>>>>>>> Is the protocol that the user writes start_pfn/page_size/total_pfns 
> >>>>>>>>> in
> >>>>>>>>> any order and then the read of copied_pfns is what triggers the
> >>>>>>>>> snapshot?          
> >>>>>>>>
> >>>>>>>> Yes.
> >>>>>>>>        
> >>>>>>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
> >>>>>>>>> can write them once and get repeated refreshes of the dirty bitmap 
> >>>>>>>>> by
> >>>>>>>>> re-reading copied_pfns?          
> >>>>>>>>
> >>>>>>>> Yes and that bitmap should be for given range (from start_pfn till
> >>>>>>>> start_pfn + tolal_pfns).
> >>>>>>>> Re-reading of copied_pfns is to handle the case where it might be
> >>>>>>>> possible that vendor driver reserved area for bitmap < total bitmap 
> >>>>>>>> size
> >>>>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have 
> >>>>>>>> to
> >>>>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
> >>>>>>>> is, there are no pages dirty in rest of the range)        
> >>>>>>>
> >>>>>>> So reading copied_pfns triggers the data range to be updated, but the
> >>>>>>> caller cannot assume it to be synchronous and uses total_pfns to poll
> >>>>>>> that the update is complete?  How does the vendor driver differentiate
> >>>>>>> the user polling for the previous update to finish versus requesting a
> >>>>>>> new update?
> >>>>>>>         
> >>>>>>
> >>>>>> Write on start_pfn/page_size/total_pfns, then read on copied_pfns
> >>>>>> indicates new update, where as sequential read on copied_pfns indicates
> >>>>>> polling for previous update.      
> >>>>>
> >>>>> Hmm, this seems to contradict the answer to my question above where I
> >>>>> ask if the write fields are sticky so a user can trigger a refresh via
> >>>>> copied_pfns.      
> >>>>
> >>>> Sorry, how its contradict? pasting it again below:    
> >>>>>>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
> >>>>>>>>> can write them once and get repeated refreshes of the dirty bitmap 
> >>>>>>>>> by
> >>>>>>>>> re-reading copied_pfns?      
> >>>>>>>>
> >>>>>>>> Yes and that bitmap should be for given range (from start_pfn till
> >>>>>>>> start_pfn + tolal_pfns).
> >>>>>>>> Re-reading of copied_pfns is to handle the case where it might be
> >>>>>>>> possible that vendor driver reserved area for bitmap < total bitmap  
> >>>>>>>>     
> >>>> size    
> >>>>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have 
> >>>>>>>> to
> >>>>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
> >>>>>>>> is, there are no pages dirty in rest of the range)      
> >>>
> >>> Sorry, I guess I misinterpreted again.  So the vendor driver can return
> >>> copied_pfns < total_pfns if it has a buffer limitation, not as an
> >>> indication of its background progress in writing out the bitmap.  Just
> >>> as a proof of concept, let's say the vendor driver has a 1 bit buffer
> >>> and I write 0 to start_pfn and 3 to total_pfns.  I read copied_pfns,
> >>> which returns 1, so I read data_offset to find where this 1 bit is
> >>> located and then read my bit from that location.  This is the dirty
> >>> state of the first pfn.  I read copied_pfns again and it reports 2,    
> >>
> >> It should report 1 to indicate its data for one pfn.
> >>  
> >>> I again read data_offset to find where the data is located, and it's my
> >>> job to remember that I've already read 1 bit, so 2 means there's only 1
> >>> bit available and it's the second pfn.    
> >>
> >> No.
> >> Here 'I' means User application, right?  
> > 
> > Yes
> >   
> >> User application knows for how many pfns bitmap he had already received,
> >> i.e. see 'count' in function vfio_get_dirty_page_list().
> >>
> >> Here copied_pfns is the number of pfns for which bitmap is available in
> >> buffer. Start address for that bitmap is then calculated by user
> >> application as :
> >> ((start_pfn + count) * page_size)
> >>
> >> Then QEMU calls:
> >>
> >> cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
> >>                                        (start_pfn + count) * page_size,
> >>                                         copied_pfns);
> >>  
> >>>  I read the bit.  I again read
> >>> copied_pfns, which now reports 3, I read data_offset to find the
> >>> location of the data, I remember that I've already read 2 bits, so I
> >>> read my bit into the 3rd pfn.  This seems rather clumsy.
> >>>    
> >>
> >> Hope above explanation helps.  
> > 
> > Still seems rather clumsy, the knowledge of which bit(s) are available
> > in the buffer can only be known by foreknowledge of which bits have
> > already been read.  That seems error prone for both the user and the
> > vendor driver to stay in sync.
> >   
> >>> Now that copied_pfns == total_pfns, what happens if I read copied_pfns
> >>> again?  This is actually what I thought I was asking previously.
> >>>     
> >>
> >> It should return 0.  
> > 
> > Are we assuming no new pages have been dirtied?  What if pages have
> > been dirtied?
> >   
> >>> Should we expose the pfn buffer size and fault on writes of larger than 
> >>> that
> >>> size, requiring the user to iterate start_pfn themselves?    
> >>
> >> Who should fault, vendor driver or user application?
> >>
> >> Here Vendor driver is writing data to data section.
> >> In the steps in this patch-set, user application is incrementing
> >> start_pfn by adding copied_pfn count.  
> > 
> > The user app is writing total_pfns to get a range, correct?  The vendor
> > driver could return errno on that write if total_pfns exceeds the
> > available buffer size.
> >   
> 
> ok. If vendor driver returns error, then will user application retry
> with smaller size?

I think we'd need to improve the header to indicate the available size,
it would seem unreasonable to me to require the user to guess how much
is available.  Thanks,

Alex

Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface

Reply via email to