On Tue, 25 Jun 2019 00:22:16 +0530 Kirti Wankhede <kwankh...@nvidia.com> wrote:
> On 6/24/2019 8:55 PM, Alex Williamson wrote: > > On Mon, 24 Jun 2019 20:30:08 +0530 > > Kirti Wankhede <kwankh...@nvidia.com> wrote: > > > >> On 6/22/2019 3:31 AM, Alex Williamson wrote: > >>> On Sat, 22 Jun 2019 02:00:08 +0530 > >>> Kirti Wankhede <kwankh...@nvidia.com> wrote: > >>>> On 6/22/2019 1:30 AM, Alex Williamson wrote: > >>>>> On Sat, 22 Jun 2019 01:05:48 +0530 > >>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote: > >>>>> > >>>>>> On 6/21/2019 8:33 PM, Alex Williamson wrote: > >>>>>>> On Fri, 21 Jun 2019 11:22:15 +0530 > >>>>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote: > >>>>>>> > >>>>>>>> On 6/20/2019 10:48 PM, Alex Williamson wrote: > >>>>>>>>> On Thu, 20 Jun 2019 20:07:29 +0530 > >>>>>>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote: > >>>>>>>>> > >>>>>>>>>> - Defined MIGRATION region type and sub-type. > >>>>>>>>>> - Used 3 bits to define VFIO device states. > >>>>>>>>>> Bit 0 => _RUNNING > >>>>>>>>>> Bit 1 => _SAVING > >>>>>>>>>> Bit 2 => _RESUMING > >>>>>>>>>> Combination of these bits defines VFIO device's state during > >>>>>>>>>> migration > >>>>>>>>>> _STOPPED => All bits 0 indicates VFIO device stopped. > >>>>>>>>>> _RUNNING => Normal VFIO device running state. > >>>>>>>>>> _SAVING | _RUNNING => vCPUs are running, VFIO device is > >>>>>>>>>> running but start > >>>>>>>>>> saving state of device i.e. pre-copy > >>>>>>>>>> state > >>>>>>>>>> _SAVING => vCPUs are stoppped, VFIO device should be stopped, > >>>>>>>>>> and > >>>>>>>>>> save device state,i.e. stop-n-copy state > >>>>>>>>>> _RESUMING => VFIO device resuming state. > >>>>>>>>>> _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING > >>>>>>>>>> bits are set > >>>>>>>>>> - Defined vfio_device_migration_info structure which will be > >>>>>>>>>> placed at 0th > >>>>>>>>>> offset of migration region to get/set VFIO device related > >>>>>>>>>> information. > >>>>>>>>>> Defined members of structure and usage on read/write access: > >>>>>>>>>> * device_state: (read/write) > >>>>>>>>>> To convey VFIO device state to be transitioned to. Only 3 > >>>>>>>>>> bits are used > >>>>>>>>>> as of now. > >>>>>>>>>> * pending bytes: (read only) > >>>>>>>>>> To get pending bytes yet to be migrated for VFIO device. > >>>>>>>>>> * data_offset: (read only) > >>>>>>>>>> To get data offset in migration from where data exist > >>>>>>>>>> during _SAVING > >>>>>>>>>> and from where data should be written by user space > >>>>>>>>>> application during > >>>>>>>>>> _RESUMING state > >>>>>>>>>> * data_size: (read/write) > >>>>>>>>>> To get and set size of data copied in migration region > >>>>>>>>>> during _SAVING > >>>>>>>>>> and _RESUMING state. > >>>>>>>>>> * start_pfn, page_size, total_pfns: (write only) > >>>>>>>>>> To get bitmap of dirty pages from vendor driver from given > >>>>>>>>>> start address for total_pfns. > >>>>>>>>>> * copied_pfns: (read only) > >>>>>>>>>> To get number of pfns bitmap copied in migration region. > >>>>>>>>>> Vendor driver should copy the bitmap with bits set only for > >>>>>>>>>> pages to be marked dirty in migration region. Vendor driver > >>>>>>>>>> should return 0 if there are 0 pages dirty in requested > >>>>>>>>>> range. Vendor driver should return -1 to mark all pages in > >>>>>>>>>> the section > >>>>>>>>>> as dirty > >>>>>>>>>> > >>>>>>>>>> Migration region looks like: > >>>>>>>>>> ------------------------------------------------------------------ > >>>>>>>>>> |vfio_device_migration_info| data section | > >>>>>>>>>> | | /////////////////////////////// | > >>>>>>>>>> ------------------------------------------------------------------ > >>>>>>>>>> ^ ^ ^ > >>>>>>>>>> offset 0-trapped part data_offset data_size > >>>>>>>>>> > >>>>>>>>>> Data section is always followed by vfio_device_migration_info > >>>>>>>>>> structure in the region, so data_offset will always be none-0. > >>>>>>>>>> Offset from where data is copied is decided by kernel driver, data > >>>>>>>>>> section can be trapped or mapped depending on how kernel driver > >>>>>>>>>> defines data section. If mmapped, then data_offset should be page > >>>>>>>>>> aligned, where as initial section which contain > >>>>>>>>>> vfio_device_migration_info structure might not end at offset which > >>>>>>>>>> is page aligned. > >>>>>>>>>> > >>>>>>>>>> Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com> > >>>>>>>>>> Reviewed-by: Neo Jia <c...@nvidia.com> > >>>>>>>>>> --- > >>>>>>>>>> linux-headers/linux/vfio.h | 71 > >>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>>>>> 1 file changed, 71 insertions(+) > >>>>>>>>>> > >>>>>>>>>> diff --git a/linux-headers/linux/vfio.h > >>>>>>>>>> b/linux-headers/linux/vfio.h > >>>>>>>>>> index 24f505199f83..274ec477eb82 100644 > >>>>>>>>>> --- a/linux-headers/linux/vfio.h > >>>>>>>>>> +++ b/linux-headers/linux/vfio.h > >>>>>>>>>> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid { > >>>>>>>>>> */ > >>>>>>>>>> #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD (1) > >>>>>>>>>> > >>>>>>>>>> +/* Migration region type and sub-type */ > >>>>>>>>>> +#define VFIO_REGION_TYPE_MIGRATION (2) > >>>>>>>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION (1) > >>>>>>>>>> + > >>>>>>>>>> +/** > >>>>>>>>>> + * Structure vfio_device_migration_info is placed at 0th offset of > >>>>>>>>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device > >>>>>>>>>> related migration > >>>>>>>>>> + * information. Field accesses from this structure are only > >>>>>>>>>> supported at their > >>>>>>>>>> + * native width and alignment, otherwise should return error. > >>>>>>>>>> + * > >>>>>>>>>> + * device_state: (read/write) > >>>>>>>>>> + * To indicate vendor driver the state VFIO device should be > >>>>>>>>>> transitioned > >>>>>>>>>> + * to. If device state transition fails, write to this field > >>>>>>>>>> return error. > >>>>>>>>>> + * It consists of 3 bits: > >>>>>>>>>> + * - If bit 0 set, indicates _RUNNING state. When its reset, > >>>>>>>>>> that indicates > >>>>>>>>>> + * _STOPPED state. When device is changed to _STOPPED, > >>>>>>>>>> driver should stop > >>>>>>>>>> + * device before write returns. > >>>>>>>>>> + * - If bit 1 set, indicates _SAVING state. > >>>>>>>>>> + * - If bit 2 set, indicates _RESUMING state. > >>>>>>>>>> + * > >>>>>>>>>> + * pending bytes: (read only) > >>>>>>>>>> + * Read pending bytes yet to be migrated from vendor driver > >>>>>>>>>> + * > >>>>>>>>>> + * data_offset: (read only) > >>>>>>>>>> + * User application should read data_offset in migration > >>>>>>>>>> region from where > >>>>>>>>>> + * user application should read data during _SAVING state or > >>>>>>>>>> write data > >>>>>>>>>> + * during _RESUMING state. > >>>>>>>>>> + * > >>>>>>>>>> + * data_size: (read/write) > >>>>>>>>>> + * User application should read data_size to know data > >>>>>>>>>> copied in migration > >>>>>>>>>> + * region during _SAVING state and write size of data copied > >>>>>>>>>> in migration > >>>>>>>>>> + * region during _RESUMING state. > >>>>>>>>>> + * > >>>>>>>>>> + * start_pfn: (write only) > >>>>>>>>>> + * Start address pfn to get bitmap of dirty pages from > >>>>>>>>>> vendor driver duing > >>>>>>>>>> + * _SAVING state. > >>>>>>>>>> + * > >>>>>>>>>> + * page_size: (write only) > >>>>>>>>>> + * User application should write the page_size of pfn. > >>>>>>>>>> + * > >>>>>>>>>> + * total_pfns: (write only) > >>>>>>>>>> + * Total pfn count from start_pfn for which dirty bitmap is > >>>>>>>>>> requested. > >>>>>>>>>> + * > >>>>>>>>>> + * copied_pfns: (read only) > >>>>>>>>>> + * pfn count for which dirty bitmap is copied to migration > >>>>>>>>>> region. > >>>>>>>>>> + * Vendor driver should copy the bitmap with bits set only > >>>>>>>>>> for pages to be > >>>>>>>>>> + * marked dirty in migration region. > >>>>>>>>>> + * Vendor driver should return 0 if there are 0 pages dirty > >>>>>>>>>> in requested > >>>>>>>>>> + * range. > >>>>>>>>>> + * Vendor driver should return -1 to mark all pages in the > >>>>>>>>>> section as > >>>>>>>>>> + * dirty. > >>>>>>>>> > >>>>>>>>> Is the protocol that the user writes start_pfn/page_size/total_pfns > >>>>>>>>> in > >>>>>>>>> any order and then the read of copied_pfns is what triggers the > >>>>>>>>> snapshot? > >>>>>>>> > >>>>>>>> Yes. > >>>>>>>> > >>>>>>>>> Are start_pfn/page_size/total_pfns sticky such that a user > >>>>>>>>> can write them once and get repeated refreshes of the dirty bitmap > >>>>>>>>> by > >>>>>>>>> re-reading copied_pfns? > >>>>>>>> > >>>>>>>> Yes and that bitmap should be for given range (from start_pfn till > >>>>>>>> start_pfn + tolal_pfns). > >>>>>>>> Re-reading of copied_pfns is to handle the case where it might be > >>>>>>>> possible that vendor driver reserved area for bitmap < total bitmap > >>>>>>>> size > >>>>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have > >>>>>>>> to > >>>>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that > >>>>>>>> is, there are no pages dirty in rest of the range) > >>>>>>> > >>>>>>> So reading copied_pfns triggers the data range to be updated, but the > >>>>>>> caller cannot assume it to be synchronous and uses total_pfns to poll > >>>>>>> that the update is complete? How does the vendor driver differentiate > >>>>>>> the user polling for the previous update to finish versus requesting a > >>>>>>> new update? > >>>>>>> > >>>>>> > >>>>>> Write on start_pfn/page_size/total_pfns, then read on copied_pfns > >>>>>> indicates new update, where as sequential read on copied_pfns indicates > >>>>>> polling for previous update. > >>>>> > >>>>> Hmm, this seems to contradict the answer to my question above where I > >>>>> ask if the write fields are sticky so a user can trigger a refresh via > >>>>> copied_pfns. > >>>> > >>>> Sorry, how its contradict? pasting it again below: > >>>>>>>>> Are start_pfn/page_size/total_pfns sticky such that a user > >>>>>>>>> can write them once and get repeated refreshes of the dirty bitmap > >>>>>>>>> by > >>>>>>>>> re-reading copied_pfns? > >>>>>>>> > >>>>>>>> Yes and that bitmap should be for given range (from start_pfn till > >>>>>>>> start_pfn + tolal_pfns). > >>>>>>>> Re-reading of copied_pfns is to handle the case where it might be > >>>>>>>> possible that vendor driver reserved area for bitmap < total bitmap > >>>>>>>> > >>>> size > >>>>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have > >>>>>>>> to > >>>>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that > >>>>>>>> is, there are no pages dirty in rest of the range) > >>> > >>> Sorry, I guess I misinterpreted again. So the vendor driver can return > >>> copied_pfns < total_pfns if it has a buffer limitation, not as an > >>> indication of its background progress in writing out the bitmap. Just > >>> as a proof of concept, let's say the vendor driver has a 1 bit buffer > >>> and I write 0 to start_pfn and 3 to total_pfns. I read copied_pfns, > >>> which returns 1, so I read data_offset to find where this 1 bit is > >>> located and then read my bit from that location. This is the dirty > >>> state of the first pfn. I read copied_pfns again and it reports 2, > >> > >> It should report 1 to indicate its data for one pfn. > >> > >>> I again read data_offset to find where the data is located, and it's my > >>> job to remember that I've already read 1 bit, so 2 means there's only 1 > >>> bit available and it's the second pfn. > >> > >> No. > >> Here 'I' means User application, right? > > > > Yes > > > >> User application knows for how many pfns bitmap he had already received, > >> i.e. see 'count' in function vfio_get_dirty_page_list(). > >> > >> Here copied_pfns is the number of pfns for which bitmap is available in > >> buffer. Start address for that bitmap is then calculated by user > >> application as : > >> ((start_pfn + count) * page_size) > >> > >> Then QEMU calls: > >> > >> cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf, > >> (start_pfn + count) * page_size, > >> copied_pfns); > >> > >>> I read the bit. I again read > >>> copied_pfns, which now reports 3, I read data_offset to find the > >>> location of the data, I remember that I've already read 2 bits, so I > >>> read my bit into the 3rd pfn. This seems rather clumsy. > >>> > >> > >> Hope above explanation helps. > > > > Still seems rather clumsy, the knowledge of which bit(s) are available > > in the buffer can only be known by foreknowledge of which bits have > > already been read. That seems error prone for both the user and the > > vendor driver to stay in sync. > > > >>> Now that copied_pfns == total_pfns, what happens if I read copied_pfns > >>> again? This is actually what I thought I was asking previously. > >>> > >> > >> It should return 0. > > > > Are we assuming no new pages have been dirtied? What if pages have > > been dirtied? > > > >>> Should we expose the pfn buffer size and fault on writes of larger than > >>> that > >>> size, requiring the user to iterate start_pfn themselves? > >> > >> Who should fault, vendor driver or user application? > >> > >> Here Vendor driver is writing data to data section. > >> In the steps in this patch-set, user application is incrementing > >> start_pfn by adding copied_pfn count. > > > > The user app is writing total_pfns to get a range, correct? The vendor > > driver could return errno on that write if total_pfns exceeds the > > available buffer size. > > > > ok. If vendor driver returns error, then will user application retry > with smaller size? I think we'd need to improve the header to indicate the available size, it would seem unreasonable to me to require the user to guess how much is available. Thanks, Alex