Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface

Kirti Wankhede Tue, 25 Jun 2019 08:31:55 -0700

On 6/25/2019 12:31 AM, Alex Williamson wrote:
> On Tue, 25 Jun 2019 00:22:16 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
>> On 6/24/2019 8:55 PM, Alex Williamson wrote:
>>> On Mon, 24 Jun 2019 20:30:08 +0530
>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>   
>>>> On 6/22/2019 3:31 AM, Alex Williamson wrote:  
>>>>> On Sat, 22 Jun 2019 02:00:08 +0530
>>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:    
>>>>>> On 6/22/2019 1:30 AM, Alex Williamson wrote:    
>>>>>>> On Sat, 22 Jun 2019 01:05:48 +0530
>>>>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>>>>>       
>>>>>>>> On 6/21/2019 8:33 PM, Alex Williamson wrote:      
>>>>>>>>> On Fri, 21 Jun 2019 11:22:15 +0530
>>>>>>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>>>>>>>         
>>>>>>>>>> On 6/20/2019 10:48 PM, Alex Williamson wrote:        
>>>>>>>>>>> On Thu, 20 Jun 2019 20:07:29 +0530
>>>>>>>>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>>>>>>>>>           
>>>>>>>>>>>> - Defined MIGRATION region type and sub-type.
>>>>>>>>>>>> - Used 3 bits to define VFIO device states.
>>>>>>>>>>>>     Bit 0 => _RUNNING
>>>>>>>>>>>>     Bit 1 => _SAVING
>>>>>>>>>>>>     Bit 2 => _RESUMING
>>>>>>>>>>>>     Combination of these bits defines VFIO device's state during 
>>>>>>>>>>>> migration
>>>>>>>>>>>>     _STOPPED => All bits 0 indicates VFIO device stopped.
>>>>>>>>>>>>     _RUNNING => Normal VFIO device running state.
>>>>>>>>>>>>     _SAVING | _RUNNING => vCPUs are running, VFIO device is 
>>>>>>>>>>>> running but start
>>>>>>>>>>>>                           saving state of device i.e. pre-copy 
>>>>>>>>>>>> state
>>>>>>>>>>>>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, 
>>>>>>>>>>>> and
>>>>>>>>>>>>                           save device state,i.e. stop-n-copy state
>>>>>>>>>>>>     _RESUMING => VFIO device resuming state.
>>>>>>>>>>>>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING 
>>>>>>>>>>>> bits are set
>>>>>>>>>>>> - Defined vfio_device_migration_info structure which will be 
>>>>>>>>>>>> placed at 0th
>>>>>>>>>>>>   offset of migration region to get/set VFIO device related 
>>>>>>>>>>>> information.
>>>>>>>>>>>>   Defined members of structure and usage on read/write access:
>>>>>>>>>>>>     * device_state: (read/write)
>>>>>>>>>>>>         To convey VFIO device state to be transitioned to. Only 3 
>>>>>>>>>>>> bits are used
>>>>>>>>>>>>         as of now.
>>>>>>>>>>>>     * pending bytes: (read only)
>>>>>>>>>>>>         To get pending bytes yet to be migrated for VFIO device.
>>>>>>>>>>>>     * data_offset: (read only)
>>>>>>>>>>>>         To get data offset in migration from where data exist 
>>>>>>>>>>>> during _SAVING
>>>>>>>>>>>>         and from where data should be written by user space 
>>>>>>>>>>>> application during
>>>>>>>>>>>>          _RESUMING state
>>>>>>>>>>>>     * data_size: (read/write)
>>>>>>>>>>>>         To get and set size of data copied in migration region 
>>>>>>>>>>>> during _SAVING
>>>>>>>>>>>>         and _RESUMING state.
>>>>>>>>>>>>     * start_pfn, page_size, total_pfns: (write only)
>>>>>>>>>>>>         To get bitmap of dirty pages from vendor driver from given
>>>>>>>>>>>>         start address for total_pfns.
>>>>>>>>>>>>     * copied_pfns: (read only)
>>>>>>>>>>>>         To get number of pfns bitmap copied in migration region.
>>>>>>>>>>>>         Vendor driver should copy the bitmap with bits set only for
>>>>>>>>>>>>         pages to be marked dirty in migration region. Vendor driver
>>>>>>>>>>>>         should return 0 if there are 0 pages dirty in requested
>>>>>>>>>>>>         range. Vendor driver should return -1 to mark all pages in 
>>>>>>>>>>>> the section
>>>>>>>>>>>>         as dirty
>>>>>>>>>>>>
>>>>>>>>>>>> Migration region looks like:
>>>>>>>>>>>>  ------------------------------------------------------------------
>>>>>>>>>>>> |vfio_device_migration_info|    data section                      |
>>>>>>>>>>>> |                          |     ///////////////////////////////  |
>>>>>>>>>>>>  ------------------------------------------------------------------
>>>>>>>>>>>>  ^                              ^                              ^
>>>>>>>>>>>>  offset 0-trapped part        data_offset                 data_size
>>>>>>>>>>>>
>>>>>>>>>>>> Data section is always followed by vfio_device_migration_info
>>>>>>>>>>>> structure in the region, so data_offset will always be none-0.
>>>>>>>>>>>> Offset from where data is copied is decided by kernel driver, data
>>>>>>>>>>>> section can be trapped or mapped depending on how kernel driver
>>>>>>>>>>>> defines data section. If mmapped, then data_offset should be page
>>>>>>>>>>>> aligned, where as initial section which contain
>>>>>>>>>>>> vfio_device_migration_info structure might not end at offset which
>>>>>>>>>>>> is page aligned.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
>>>>>>>>>>>> Reviewed-by: Neo Jia <c...@nvidia.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>  linux-headers/linux/vfio.h | 71 
>>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>  1 file changed, 71 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/linux-headers/linux/vfio.h 
>>>>>>>>>>>> b/linux-headers/linux/vfio.h
>>>>>>>>>>>> index 24f505199f83..274ec477eb82 100644
>>>>>>>>>>>> --- a/linux-headers/linux/vfio.h
>>>>>>>>>>>> +++ b/linux-headers/linux/vfio.h
>>>>>>>>>>>> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid {
>>>>>>>>>>>>   */
>>>>>>>>>>>>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD      (1)
>>>>>>>>>>>>  
>>>>>>>>>>>> +/* Migration region type and sub-type */
>>>>>>>>>>>> +#define VFIO_REGION_TYPE_MIGRATION                (2)
>>>>>>>>>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION             (1)
>>>>>>>>>>>> +
>>>>>>>>>>>> +/**
>>>>>>>>>>>> + * Structure vfio_device_migration_info is placed at 0th offset of
>>>>>>>>>>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device 
>>>>>>>>>>>> related migration
>>>>>>>>>>>> + * information. Field accesses from this structure are only 
>>>>>>>>>>>> supported at their
>>>>>>>>>>>> + * native width and alignment, otherwise should return error.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * device_state: (read/write)
>>>>>>>>>>>> + *      To indicate vendor driver the state VFIO device should be 
>>>>>>>>>>>> transitioned
>>>>>>>>>>>> + *      to. If device state transition fails, write to this field 
>>>>>>>>>>>> return error.
>>>>>>>>>>>> + *      It consists of 3 bits:
>>>>>>>>>>>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, 
>>>>>>>>>>>> that indicates
>>>>>>>>>>>> + *        _STOPPED state. When device is changed to _STOPPED, 
>>>>>>>>>>>> driver should stop
>>>>>>>>>>>> + *        device before write returns.
>>>>>>>>>>>> + *      - If bit 1 set, indicates _SAVING state.
>>>>>>>>>>>> + *      - If bit 2 set, indicates _RESUMING state.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * pending bytes: (read only)
>>>>>>>>>>>> + *      Read pending bytes yet to be migrated from vendor driver
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * data_offset: (read only)
>>>>>>>>>>>> + *      User application should read data_offset in migration 
>>>>>>>>>>>> region from where
>>>>>>>>>>>> + *      user application should read data during _SAVING state or 
>>>>>>>>>>>> write data
>>>>>>>>>>>> + *      during _RESUMING state.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * data_size: (read/write)
>>>>>>>>>>>> + *      User application should read data_size to know data 
>>>>>>>>>>>> copied in migration
>>>>>>>>>>>> + *      region during _SAVING state and write size of data copied 
>>>>>>>>>>>> in migration
>>>>>>>>>>>> + *      region during _RESUMING state.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * start_pfn: (write only)
>>>>>>>>>>>> + *      Start address pfn to get bitmap of dirty pages from 
>>>>>>>>>>>> vendor driver duing
>>>>>>>>>>>> + *      _SAVING state.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * page_size: (write only)
>>>>>>>>>>>> + *      User application should write the page_size of pfn.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * total_pfns: (write only)
>>>>>>>>>>>> + *      Total pfn count from start_pfn for which dirty bitmap is 
>>>>>>>>>>>> requested.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * copied_pfns: (read only)
>>>>>>>>>>>> + *      pfn count for which dirty bitmap is copied to migration 
>>>>>>>>>>>> region.
>>>>>>>>>>>> + *      Vendor driver should copy the bitmap with bits set only 
>>>>>>>>>>>> for pages to be
>>>>>>>>>>>> + *      marked dirty in migration region.
>>>>>>>>>>>> + *      Vendor driver should return 0 if there are 0 pages dirty 
>>>>>>>>>>>> in requested
>>>>>>>>>>>> + *      range.
>>>>>>>>>>>> + *      Vendor driver should return -1 to mark all pages in the 
>>>>>>>>>>>> section as
>>>>>>>>>>>> + *      dirty.          
>>>>>>>>>>>
>>>>>>>>>>> Is the protocol that the user writes start_pfn/page_size/total_pfns 
>>>>>>>>>>> in
>>>>>>>>>>> any order and then the read of copied_pfns is what triggers the
>>>>>>>>>>> snapshot?          
>>>>>>>>>>
>>>>>>>>>> Yes.
>>>>>>>>>>        
>>>>>>>>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
>>>>>>>>>>> can write them once and get repeated refreshes of the dirty bitmap 
>>>>>>>>>>> by
>>>>>>>>>>> re-reading copied_pfns?          
>>>>>>>>>>
>>>>>>>>>> Yes and that bitmap should be for given range (from start_pfn till
>>>>>>>>>> start_pfn + tolal_pfns).
>>>>>>>>>> Re-reading of copied_pfns is to handle the case where it might be
>>>>>>>>>> possible that vendor driver reserved area for bitmap < total bitmap 
>>>>>>>>>> size
>>>>>>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have 
>>>>>>>>>> to
>>>>>>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
>>>>>>>>>> is, there are no pages dirty in rest of the range)        
>>>>>>>>>
>>>>>>>>> So reading copied_pfns triggers the data range to be updated, but the
>>>>>>>>> caller cannot assume it to be synchronous and uses total_pfns to poll
>>>>>>>>> that the update is complete?  How does the vendor driver differentiate
>>>>>>>>> the user polling for the previous update to finish versus requesting a
>>>>>>>>> new update?
>>>>>>>>>         
>>>>>>>>
>>>>>>>> Write on start_pfn/page_size/total_pfns, then read on copied_pfns
>>>>>>>> indicates new update, where as sequential read on copied_pfns indicates
>>>>>>>> polling for previous update.      
>>>>>>>
>>>>>>> Hmm, this seems to contradict the answer to my question above where I
>>>>>>> ask if the write fields are sticky so a user can trigger a refresh via
>>>>>>> copied_pfns.      
>>>>>>
>>>>>> Sorry, how its contradict? pasting it again below:    
>>>>>>>>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
>>>>>>>>>>> can write them once and get repeated refreshes of the dirty bitmap 
>>>>>>>>>>> by
>>>>>>>>>>> re-reading copied_pfns?      
>>>>>>>>>>
>>>>>>>>>> Yes and that bitmap should be for given range (from start_pfn till
>>>>>>>>>> start_pfn + tolal_pfns).
>>>>>>>>>> Re-reading of copied_pfns is to handle the case where it might be
>>>>>>>>>> possible that vendor driver reserved area for bitmap < total bitmap  
>>>>>>>>>>     
>>>>>> size    
>>>>>>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have 
>>>>>>>>>> to
>>>>>>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
>>>>>>>>>> is, there are no pages dirty in rest of the range)      
>>>>>
>>>>> Sorry, I guess I misinterpreted again.  So the vendor driver can return
>>>>> copied_pfns < total_pfns if it has a buffer limitation, not as an
>>>>> indication of its background progress in writing out the bitmap.  Just
>>>>> as a proof of concept, let's say the vendor driver has a 1 bit buffer
>>>>> and I write 0 to start_pfn and 3 to total_pfns.  I read copied_pfns,
>>>>> which returns 1, so I read data_offset to find where this 1 bit is
>>>>> located and then read my bit from that location.  This is the dirty
>>>>> state of the first pfn.  I read copied_pfns again and it reports 2,    
>>>>
>>>> It should report 1 to indicate its data for one pfn.
>>>>  
>>>>> I again read data_offset to find where the data is located, and it's my
>>>>> job to remember that I've already read 1 bit, so 2 means there's only 1
>>>>> bit available and it's the second pfn.    
>>>>
>>>> No.
>>>> Here 'I' means User application, right?  
>>>
>>> Yes
>>>   
>>>> User application knows for how many pfns bitmap he had already received,
>>>> i.e. see 'count' in function vfio_get_dirty_page_list().
>>>>
>>>> Here copied_pfns is the number of pfns for which bitmap is available in
>>>> buffer. Start address for that bitmap is then calculated by user
>>>> application as :
>>>> ((start_pfn + count) * page_size)
>>>>
>>>> Then QEMU calls:
>>>>
>>>> cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
>>>>                                        (start_pfn + count) * page_size,
>>>>                                         copied_pfns);
>>>>  
>>>>>  I read the bit.  I again read
>>>>> copied_pfns, which now reports 3, I read data_offset to find the
>>>>> location of the data, I remember that I've already read 2 bits, so I
>>>>> read my bit into the 3rd pfn.  This seems rather clumsy.
>>>>>    
>>>>
>>>> Hope above explanation helps.  
>>>
>>> Still seems rather clumsy, the knowledge of which bit(s) are available
>>> in the buffer can only be known by foreknowledge of which bits have
>>> already been read.  That seems error prone for both the user and the
>>> vendor driver to stay in sync.
>>>   
>>>>> Now that copied_pfns == total_pfns, what happens if I read copied_pfns
>>>>> again?  This is actually what I thought I was asking previously.
>>>>>     
>>>>
>>>> It should return 0.  
>>>
>>> Are we assuming no new pages have been dirtied?  What if pages have
>>> been dirtied?
>>>   
>>>>> Should we expose the pfn buffer size and fault on writes of larger than 
>>>>> that
>>>>> size, requiring the user to iterate start_pfn themselves?    
>>>>
>>>> Who should fault, vendor driver or user application?
>>>>
>>>> Here Vendor driver is writing data to data section.
>>>> In the steps in this patch-set, user application is incrementing
>>>> start_pfn by adding copied_pfn count.  
>>>
>>> The user app is writing total_pfns to get a range, correct?  The vendor
>>> driver could return errno on that write if total_pfns exceeds the
>>> available buffer size.
>>>   
>>
>> ok. If vendor driver returns error, then will user application retry
>> with smaller size?
> 
> I think we'd need to improve the header to indicate the available size,
> it would seem unreasonable to me to require the user to guess how much
> is available.  Thanks,
> 

Instead of returning error on write to total_pfns, how about writing
updated start_pfn and total_pfns again as below:

count = 0
while (total_pfns > 0) {
    write(start_pfn + count)
    write(page_size)
    write(total_pfns)
    read(copied_pfns)
    read(data_offset)
    read bitmap from data_offset for copied_pfns and mark pages dirty
    if (copied_pfns < total_pfns)
        count += copied_pfns,
        total_pfns -= copied_pfns
}

Thanks,
Kirti
Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface

Reply via email to