On Sat, 22 Jun 2019 02:00:08 +0530 Kirti Wankhede <kwankh...@nvidia.com> wrote: > On 6/22/2019 1:30 AM, Alex Williamson wrote: > > On Sat, 22 Jun 2019 01:05:48 +0530 > > Kirti Wankhede <kwankh...@nvidia.com> wrote: > > > >> On 6/21/2019 8:33 PM, Alex Williamson wrote: > >>> On Fri, 21 Jun 2019 11:22:15 +0530 > >>> Kirti Wankhede <kwankh...@nvidia.com> wrote: > >>> > >>>> On 6/20/2019 10:48 PM, Alex Williamson wrote: > >>>>> On Thu, 20 Jun 2019 20:07:29 +0530 > >>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote: > >>>>> > >>>>>> - Defined MIGRATION region type and sub-type. > >>>>>> - Used 3 bits to define VFIO device states. > >>>>>> Bit 0 => _RUNNING > >>>>>> Bit 1 => _SAVING > >>>>>> Bit 2 => _RESUMING > >>>>>> Combination of these bits defines VFIO device's state during > >>>>>> migration > >>>>>> _STOPPED => All bits 0 indicates VFIO device stopped. > >>>>>> _RUNNING => Normal VFIO device running state. > >>>>>> _SAVING | _RUNNING => vCPUs are running, VFIO device is running > >>>>>> but start > >>>>>> saving state of device i.e. pre-copy state > >>>>>> _SAVING => vCPUs are stoppped, VFIO device should be stopped, and > >>>>>> save device state,i.e. stop-n-copy state > >>>>>> _RESUMING => VFIO device resuming state. > >>>>>> _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits > >>>>>> are set > >>>>>> - Defined vfio_device_migration_info structure which will be placed at > >>>>>> 0th > >>>>>> offset of migration region to get/set VFIO device related > >>>>>> information. > >>>>>> Defined members of structure and usage on read/write access: > >>>>>> * device_state: (read/write) > >>>>>> To convey VFIO device state to be transitioned to. Only 3 bits > >>>>>> are used > >>>>>> as of now. > >>>>>> * pending bytes: (read only) > >>>>>> To get pending bytes yet to be migrated for VFIO device. > >>>>>> * data_offset: (read only) > >>>>>> To get data offset in migration from where data exist during > >>>>>> _SAVING > >>>>>> and from where data should be written by user space > >>>>>> application during > >>>>>> _RESUMING state > >>>>>> * data_size: (read/write) > >>>>>> To get and set size of data copied in migration region during > >>>>>> _SAVING > >>>>>> and _RESUMING state. > >>>>>> * start_pfn, page_size, total_pfns: (write only) > >>>>>> To get bitmap of dirty pages from vendor driver from given > >>>>>> start address for total_pfns. > >>>>>> * copied_pfns: (read only) > >>>>>> To get number of pfns bitmap copied in migration region. > >>>>>> Vendor driver should copy the bitmap with bits set only for > >>>>>> pages to be marked dirty in migration region. Vendor driver > >>>>>> should return 0 if there are 0 pages dirty in requested > >>>>>> range. Vendor driver should return -1 to mark all pages in the > >>>>>> section > >>>>>> as dirty > >>>>>> > >>>>>> Migration region looks like: > >>>>>> ------------------------------------------------------------------ > >>>>>> |vfio_device_migration_info| data section | > >>>>>> | | /////////////////////////////// | > >>>>>> ------------------------------------------------------------------ > >>>>>> ^ ^ ^ > >>>>>> offset 0-trapped part data_offset data_size > >>>>>> > >>>>>> Data section is always followed by vfio_device_migration_info > >>>>>> structure in the region, so data_offset will always be none-0. > >>>>>> Offset from where data is copied is decided by kernel driver, data > >>>>>> section can be trapped or mapped depending on how kernel driver > >>>>>> defines data section. If mmapped, then data_offset should be page > >>>>>> aligned, where as initial section which contain > >>>>>> vfio_device_migration_info structure might not end at offset which > >>>>>> is page aligned. > >>>>>> > >>>>>> Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com> > >>>>>> Reviewed-by: Neo Jia <c...@nvidia.com> > >>>>>> --- > >>>>>> linux-headers/linux/vfio.h | 71 > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>> 1 file changed, 71 insertions(+) > >>>>>> > >>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h > >>>>>> index 24f505199f83..274ec477eb82 100644 > >>>>>> --- a/linux-headers/linux/vfio.h > >>>>>> +++ b/linux-headers/linux/vfio.h > >>>>>> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid { > >>>>>> */ > >>>>>> #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD (1) > >>>>>> > >>>>>> +/* Migration region type and sub-type */ > >>>>>> +#define VFIO_REGION_TYPE_MIGRATION (2) > >>>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION (1) > >>>>>> + > >>>>>> +/** > >>>>>> + * Structure vfio_device_migration_info is placed at 0th offset of > >>>>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device > >>>>>> related migration > >>>>>> + * information. Field accesses from this structure are only supported > >>>>>> at their > >>>>>> + * native width and alignment, otherwise should return error. > >>>>>> + * > >>>>>> + * device_state: (read/write) > >>>>>> + * To indicate vendor driver the state VFIO device should be > >>>>>> transitioned > >>>>>> + * to. If device state transition fails, write to this field > >>>>>> return error. > >>>>>> + * It consists of 3 bits: > >>>>>> + * - If bit 0 set, indicates _RUNNING state. When its reset, > >>>>>> that indicates > >>>>>> + * _STOPPED state. When device is changed to _STOPPED, driver > >>>>>> should stop > >>>>>> + * device before write returns. > >>>>>> + * - If bit 1 set, indicates _SAVING state. > >>>>>> + * - If bit 2 set, indicates _RESUMING state. > >>>>>> + * > >>>>>> + * pending bytes: (read only) > >>>>>> + * Read pending bytes yet to be migrated from vendor driver > >>>>>> + * > >>>>>> + * data_offset: (read only) > >>>>>> + * User application should read data_offset in migration region > >>>>>> from where > >>>>>> + * user application should read data during _SAVING state or > >>>>>> write data > >>>>>> + * during _RESUMING state. > >>>>>> + * > >>>>>> + * data_size: (read/write) > >>>>>> + * User application should read data_size to know data copied in > >>>>>> migration > >>>>>> + * region during _SAVING state and write size of data copied in > >>>>>> migration > >>>>>> + * region during _RESUMING state. > >>>>>> + * > >>>>>> + * start_pfn: (write only) > >>>>>> + * Start address pfn to get bitmap of dirty pages from vendor > >>>>>> driver duing > >>>>>> + * _SAVING state. > >>>>>> + * > >>>>>> + * page_size: (write only) > >>>>>> + * User application should write the page_size of pfn. > >>>>>> + * > >>>>>> + * total_pfns: (write only) > >>>>>> + * Total pfn count from start_pfn for which dirty bitmap is > >>>>>> requested. > >>>>>> + * > >>>>>> + * copied_pfns: (read only) > >>>>>> + * pfn count for which dirty bitmap is copied to migration > >>>>>> region. > >>>>>> + * Vendor driver should copy the bitmap with bits set only for > >>>>>> pages to be > >>>>>> + * marked dirty in migration region. > >>>>>> + * Vendor driver should return 0 if there are 0 pages dirty in > >>>>>> requested > >>>>>> + * range. > >>>>>> + * Vendor driver should return -1 to mark all pages in the > >>>>>> section as > >>>>>> + * dirty. > >>>>> > >>>>> Is the protocol that the user writes start_pfn/page_size/total_pfns in > >>>>> any order and then the read of copied_pfns is what triggers the > >>>>> snapshot? > >>>> > >>>> Yes. > >>>> > >>>>> Are start_pfn/page_size/total_pfns sticky such that a user > >>>>> can write them once and get repeated refreshes of the dirty bitmap by > >>>>> re-reading copied_pfns? > >>>> > >>>> Yes and that bitmap should be for given range (from start_pfn till > >>>> start_pfn + tolal_pfns). > >>>> Re-reading of copied_pfns is to handle the case where it might be > >>>> possible that vendor driver reserved area for bitmap < total bitmap size > >>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to > >>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that > >>>> is, there are no pages dirty in rest of the range) > >>> > >>> So reading copied_pfns triggers the data range to be updated, but the > >>> caller cannot assume it to be synchronous and uses total_pfns to poll > >>> that the update is complete? How does the vendor driver differentiate > >>> the user polling for the previous update to finish versus requesting a > >>> new update? > >>> > >> > >> Write on start_pfn/page_size/total_pfns, then read on copied_pfns > >> indicates new update, where as sequential read on copied_pfns indicates > >> polling for previous update. > > > > Hmm, this seems to contradict the answer to my question above where I > > ask if the write fields are sticky so a user can trigger a refresh via > > copied_pfns. > > Sorry, how its contradict? pasting it again below: > >>>>> Are start_pfn/page_size/total_pfns sticky such that a user > >>>>> can write them once and get repeated refreshes of the dirty bitmap by > >>>>> re-reading copied_pfns? > >>>> > >>>> Yes and that bitmap should be for given range (from start_pfn till > >>>> start_pfn + tolal_pfns). > >>>> Re-reading of copied_pfns is to handle the case where it might be > >>>> possible that vendor driver reserved area for bitmap < total bitmap > size > >>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to > >>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that > >>>> is, there are no pages dirty in rest of the range)
Sorry, I guess I misinterpreted again. So the vendor driver can return copied_pfns < total_pfns if it has a buffer limitation, not as an indication of its background progress in writing out the bitmap. Just as a proof of concept, let's say the vendor driver has a 1 bit buffer and I write 0 to start_pfn and 3 to total_pfns. I read copied_pfns, which returns 1, so I read data_offset to find where this 1 bit is located and then read my bit from that location. This is the dirty state of the first pfn. I read copied_pfns again and it reports 2, so I again read data_offset to find where the data is located, and it's my job to remember that I've already read 1 bit, so 2 means there's only 1 bit available and it's the second pfn. I read the bit. I again read copied_pfns, which now reports 3, I read data_offset to find the location of the data, I remember that I've already read 2 bits, so I read my bit into the 3rd pfn. This seems rather clumsy. Now that copied_pfns == total_pfns, what happens if I read copied_pfns again? This is actually what I thought I was asking previously. Should we expose the pfn buffer size and fault on writes of larger than that size, requiring the user to iterate start_pfn themselves? Are there any operations where the user can assume data_offset is constant? Thanks, Alex