On 6/25/2019 12:31 AM, Alex Williamson wrote:
> On Tue, 25 Jun 2019 00:22:16 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>
>> On 6/24/2019 8:55 PM, Alex Williamson wrote:
>>> On Mon, 24 Jun 2019 20:30:08 +0530
>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>
>>>> On 6/22/2019 3:31 AM, Alex Williamson wrote:
>>>>> On Sat, 22 Jun 2019 02:00:08 +0530
>>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>>>> On 6/22/2019 1:30 AM, Alex Williamson wrote:
>>>>>>> On Sat, 22 Jun 2019 01:05:48 +0530
>>>>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>>>>>
>>>>>>>> On 6/21/2019 8:33 PM, Alex Williamson wrote:
>>>>>>>>> On Fri, 21 Jun 2019 11:22:15 +0530
>>>>>>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>>>>>>>
>>>>>>>>>> On 6/20/2019 10:48 PM, Alex Williamson wrote:
>>>>>>>>>>> On Thu, 20 Jun 2019 20:07:29 +0530
>>>>>>>>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> - Defined MIGRATION region type and sub-type.
>>>>>>>>>>>> - Used 3 bits to define VFIO device states.
>>>>>>>>>>>> Bit 0 => _RUNNING
>>>>>>>>>>>> Bit 1 => _SAVING
>>>>>>>>>>>> Bit 2 => _RESUMING
>>>>>>>>>>>> Combination of these bits defines VFIO device's state during
>>>>>>>>>>>> migration
>>>>>>>>>>>> _STOPPED => All bits 0 indicates VFIO device stopped.
>>>>>>>>>>>> _RUNNING => Normal VFIO device running state.
>>>>>>>>>>>> _SAVING | _RUNNING => vCPUs are running, VFIO device is
>>>>>>>>>>>> running but start
>>>>>>>>>>>> saving state of device i.e. pre-copy
>>>>>>>>>>>> state
>>>>>>>>>>>> _SAVING => vCPUs are stoppped, VFIO device should be stopped,
>>>>>>>>>>>> and
>>>>>>>>>>>> save device state,i.e. stop-n-copy state
>>>>>>>>>>>> _RESUMING => VFIO device resuming state.
>>>>>>>>>>>> _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING
>>>>>>>>>>>> bits are set
>>>>>>>>>>>> - Defined vfio_device_migration_info structure which will be
>>>>>>>>>>>> placed at 0th
>>>>>>>>>>>> offset of migration region to get/set VFIO device related
>>>>>>>>>>>> information.
>>>>>>>>>>>> Defined members of structure and usage on read/write access:
>>>>>>>>>>>> * device_state: (read/write)
>>>>>>>>>>>> To convey VFIO device state to be transitioned to. Only 3
>>>>>>>>>>>> bits are used
>>>>>>>>>>>> as of now.
>>>>>>>>>>>> * pending bytes: (read only)
>>>>>>>>>>>> To get pending bytes yet to be migrated for VFIO device.
>>>>>>>>>>>> * data_offset: (read only)
>>>>>>>>>>>> To get data offset in migration from where data exist
>>>>>>>>>>>> during _SAVING
>>>>>>>>>>>> and from where data should be written by user space
>>>>>>>>>>>> application during
>>>>>>>>>>>> _RESUMING state
>>>>>>>>>>>> * data_size: (read/write)
>>>>>>>>>>>> To get and set size of data copied in migration region
>>>>>>>>>>>> during _SAVING
>>>>>>>>>>>> and _RESUMING state.
>>>>>>>>>>>> * start_pfn, page_size, total_pfns: (write only)
>>>>>>>>>>>> To get bitmap of dirty pages from vendor driver from given
>>>>>>>>>>>> start address for total_pfns.
>>>>>>>>>>>> * copied_pfns: (read only)
>>>>>>>>>>>> To get number of pfns bitmap copied in migration region.
>>>>>>>>>>>> Vendor driver should copy the bitmap with bits set only for
>>>>>>>>>>>> pages to be marked dirty in migration region. Vendor driver
>>>>>>>>>>>> should return 0 if there are 0 pages dirty in requested
>>>>>>>>>>>> range. Vendor driver should return -1 to mark all pages in
>>>>>>>>>>>> the section
>>>>>>>>>>>> as dirty
>>>>>>>>>>>>
>>>>>>>>>>>> Migration region looks like:
>>>>>>>>>>>> ------------------------------------------------------------------
>>>>>>>>>>>> |vfio_device_migration_info| data section |
>>>>>>>>>>>> | | /////////////////////////////// |
>>>>>>>>>>>> ------------------------------------------------------------------
>>>>>>>>>>>> ^ ^ ^
>>>>>>>>>>>> offset 0-trapped part data_offset data_size
>>>>>>>>>>>>
>>>>>>>>>>>> Data section is always followed by vfio_device_migration_info
>>>>>>>>>>>> structure in the region, so data_offset will always be none-0.
>>>>>>>>>>>> Offset from where data is copied is decided by kernel driver, data
>>>>>>>>>>>> section can be trapped or mapped depending on how kernel driver
>>>>>>>>>>>> defines data section. If mmapped, then data_offset should be page
>>>>>>>>>>>> aligned, where as initial section which contain
>>>>>>>>>>>> vfio_device_migration_info structure might not end at offset which
>>>>>>>>>>>> is page aligned.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
>>>>>>>>>>>> Reviewed-by: Neo Jia <c...@nvidia.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>> linux-headers/linux/vfio.h | 71
>>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>> 1 file changed, 71 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/linux-headers/linux/vfio.h
>>>>>>>>>>>> b/linux-headers/linux/vfio.h
>>>>>>>>>>>> index 24f505199f83..274ec477eb82 100644
>>>>>>>>>>>> --- a/linux-headers/linux/vfio.h
>>>>>>>>>>>> +++ b/linux-headers/linux/vfio.h
>>>>>>>>>>>> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid {
>>>>>>>>>>>> */
>>>>>>>>>>>> #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD (1)
>>>>>>>>>>>>
>>>>>>>>>>>> +/* Migration region type and sub-type */
>>>>>>>>>>>> +#define VFIO_REGION_TYPE_MIGRATION (2)
>>>>>>>>>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION (1)
>>>>>>>>>>>> +
>>>>>>>>>>>> +/**
>>>>>>>>>>>> + * Structure vfio_device_migration_info is placed at 0th offset of
>>>>>>>>>>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device
>>>>>>>>>>>> related migration
>>>>>>>>>>>> + * information. Field accesses from this structure are only
>>>>>>>>>>>> supported at their
>>>>>>>>>>>> + * native width and alignment, otherwise should return error.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * device_state: (read/write)
>>>>>>>>>>>> + * To indicate vendor driver the state VFIO device should be
>>>>>>>>>>>> transitioned
>>>>>>>>>>>> + * to. If device state transition fails, write to this field
>>>>>>>>>>>> return error.
>>>>>>>>>>>> + * It consists of 3 bits:
>>>>>>>>>>>> + * - If bit 0 set, indicates _RUNNING state. When its reset,
>>>>>>>>>>>> that indicates
>>>>>>>>>>>> + * _STOPPED state. When device is changed to _STOPPED,
>>>>>>>>>>>> driver should stop
>>>>>>>>>>>> + * device before write returns.
>>>>>>>>>>>> + * - If bit 1 set, indicates _SAVING state.
>>>>>>>>>>>> + * - If bit 2 set, indicates _RESUMING state.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * pending bytes: (read only)
>>>>>>>>>>>> + * Read pending bytes yet to be migrated from vendor driver
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * data_offset: (read only)
>>>>>>>>>>>> + * User application should read data_offset in migration
>>>>>>>>>>>> region from where
>>>>>>>>>>>> + * user application should read data during _SAVING state or
>>>>>>>>>>>> write data
>>>>>>>>>>>> + * during _RESUMING state.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * data_size: (read/write)
>>>>>>>>>>>> + * User application should read data_size to know data
>>>>>>>>>>>> copied in migration
>>>>>>>>>>>> + * region during _SAVING state and write size of data copied
>>>>>>>>>>>> in migration
>>>>>>>>>>>> + * region during _RESUMING state.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * start_pfn: (write only)
>>>>>>>>>>>> + * Start address pfn to get bitmap of dirty pages from
>>>>>>>>>>>> vendor driver duing
>>>>>>>>>>>> + * _SAVING state.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * page_size: (write only)
>>>>>>>>>>>> + * User application should write the page_size of pfn.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * total_pfns: (write only)
>>>>>>>>>>>> + * Total pfn count from start_pfn for which dirty bitmap is
>>>>>>>>>>>> requested.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * copied_pfns: (read only)
>>>>>>>>>>>> + * pfn count for which dirty bitmap is copied to migration
>>>>>>>>>>>> region.
>>>>>>>>>>>> + * Vendor driver should copy the bitmap with bits set only
>>>>>>>>>>>> for pages to be
>>>>>>>>>>>> + * marked dirty in migration region.
>>>>>>>>>>>> + * Vendor driver should return 0 if there are 0 pages dirty
>>>>>>>>>>>> in requested
>>>>>>>>>>>> + * range.
>>>>>>>>>>>> + * Vendor driver should return -1 to mark all pages in the
>>>>>>>>>>>> section as
>>>>>>>>>>>> + * dirty.
>>>>>>>>>>>
>>>>>>>>>>> Is the protocol that the user writes start_pfn/page_size/total_pfns
>>>>>>>>>>> in
>>>>>>>>>>> any order and then the read of copied_pfns is what triggers the
>>>>>>>>>>> snapshot?
>>>>>>>>>>
>>>>>>>>>> Yes.
>>>>>>>>>>
>>>>>>>>>>> Are start_pfn/page_size/total_pfns sticky such that a user
>>>>>>>>>>> can write them once and get repeated refreshes of the dirty bitmap
>>>>>>>>>>> by
>>>>>>>>>>> re-reading copied_pfns?
>>>>>>>>>>
>>>>>>>>>> Yes and that bitmap should be for given range (from start_pfn till
>>>>>>>>>> start_pfn + tolal_pfns).
>>>>>>>>>> Re-reading of copied_pfns is to handle the case where it might be
>>>>>>>>>> possible that vendor driver reserved area for bitmap < total bitmap
>>>>>>>>>> size
>>>>>>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have
>>>>>>>>>> to
>>>>>>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
>>>>>>>>>> is, there are no pages dirty in rest of the range)
>>>>>>>>>
>>>>>>>>> So reading copied_pfns triggers the data range to be updated, but the
>>>>>>>>> caller cannot assume it to be synchronous and uses total_pfns to poll
>>>>>>>>> that the update is complete? How does the vendor driver differentiate
>>>>>>>>> the user polling for the previous update to finish versus requesting a
>>>>>>>>> new update?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Write on start_pfn/page_size/total_pfns, then read on copied_pfns
>>>>>>>> indicates new update, where as sequential read on copied_pfns indicates
>>>>>>>> polling for previous update.
>>>>>>>
>>>>>>> Hmm, this seems to contradict the answer to my question above where I
>>>>>>> ask if the write fields are sticky so a user can trigger a refresh via
>>>>>>> copied_pfns.
>>>>>>
>>>>>> Sorry, how its contradict? pasting it again below:
>>>>>>>>>>> Are start_pfn/page_size/total_pfns sticky such that a user
>>>>>>>>>>> can write them once and get repeated refreshes of the dirty bitmap
>>>>>>>>>>> by
>>>>>>>>>>> re-reading copied_pfns?
>>>>>>>>>>
>>>>>>>>>> Yes and that bitmap should be for given range (from start_pfn till
>>>>>>>>>> start_pfn + tolal_pfns).
>>>>>>>>>> Re-reading of copied_pfns is to handle the case where it might be
>>>>>>>>>> possible that vendor driver reserved area for bitmap < total bitmap
>>>>>>>>>>
>>>>>> size
>>>>>>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have
>>>>>>>>>> to
>>>>>>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
>>>>>>>>>> is, there are no pages dirty in rest of the range)
>>>>>
>>>>> Sorry, I guess I misinterpreted again. So the vendor driver can return
>>>>> copied_pfns < total_pfns if it has a buffer limitation, not as an
>>>>> indication of its background progress in writing out the bitmap. Just
>>>>> as a proof of concept, let's say the vendor driver has a 1 bit buffer
>>>>> and I write 0 to start_pfn and 3 to total_pfns. I read copied_pfns,
>>>>> which returns 1, so I read data_offset to find where this 1 bit is
>>>>> located and then read my bit from that location. This is the dirty
>>>>> state of the first pfn. I read copied_pfns again and it reports 2,
>>>>
>>>> It should report 1 to indicate its data for one pfn.
>>>>
>>>>> I again read data_offset to find where the data is located, and it's my
>>>>> job to remember that I've already read 1 bit, so 2 means there's only 1
>>>>> bit available and it's the second pfn.
>>>>
>>>> No.
>>>> Here 'I' means User application, right?
>>>
>>> Yes
>>>
>>>> User application knows for how many pfns bitmap he had already received,
>>>> i.e. see 'count' in function vfio_get_dirty_page_list().
>>>>
>>>> Here copied_pfns is the number of pfns for which bitmap is available in
>>>> buffer. Start address for that bitmap is then calculated by user
>>>> application as :
>>>> ((start_pfn + count) * page_size)
>>>>
>>>> Then QEMU calls:
>>>>
>>>> cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
>>>> (start_pfn + count) * page_size,
>>>> copied_pfns);
>>>>
>>>>> I read the bit. I again read
>>>>> copied_pfns, which now reports 3, I read data_offset to find the
>>>>> location of the data, I remember that I've already read 2 bits, so I
>>>>> read my bit into the 3rd pfn. This seems rather clumsy.
>>>>>
>>>>
>>>> Hope above explanation helps.
>>>
>>> Still seems rather clumsy, the knowledge of which bit(s) are available
>>> in the buffer can only be known by foreknowledge of which bits have
>>> already been read. That seems error prone for both the user and the
>>> vendor driver to stay in sync.
>>>
>>>>> Now that copied_pfns == total_pfns, what happens if I read copied_pfns
>>>>> again? This is actually what I thought I was asking previously.
>>>>>
>>>>
>>>> It should return 0.
>>>
>>> Are we assuming no new pages have been dirtied? What if pages have
>>> been dirtied?
>>>
>>>>> Should we expose the pfn buffer size and fault on writes of larger than
>>>>> that
>>>>> size, requiring the user to iterate start_pfn themselves?
>>>>
>>>> Who should fault, vendor driver or user application?
>>>>
>>>> Here Vendor driver is writing data to data section.
>>>> In the steps in this patch-set, user application is incrementing
>>>> start_pfn by adding copied_pfn count.
>>>
>>> The user app is writing total_pfns to get a range, correct? The vendor
>>> driver could return errno on that write if total_pfns exceeds the
>>> available buffer size.
>>>
>>
>> ok. If vendor driver returns error, then will user application retry
>> with smaller size?
>
> I think we'd need to improve the header to indicate the available size,
> it would seem unreasonable to me to require the user to guess how much
> is available. Thanks,
>
Instead of returning error on write to total_pfns, how about writing
updated start_pfn and total_pfns again as below:
count = 0
while (total_pfns > 0) {
write(start_pfn + count)
write(page_size)
write(total_pfns)
read(copied_pfns)
read(data_offset)
read bitmap from data_offset for copied_pfns and mark pages dirty
if (copied_pfns < total_pfns)
count += copied_pfns,
total_pfns -= copied_pfns
}
Thanks,
Kirti