Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
On Wed, 19 Feb 2020 09:51:32 +0530 Kirti Wankhede wrote: > On 2/19/2020 3:11 AM, Alex Williamson wrote: > > On Tue, 18 Feb 2020 11:28:53 +0530 > > Kirti Wankhede wrote: > > > >> > >> > >>> As I understand the above algorithm, we find a vfio_dma > >>> overlapping the request and populate the bitmap for that range. Then > >>> we go back and put_user() for each byte that we touched. We could > >>> instead simply work on a one byte buffer as we enumerate the requested > >>> range and do a put_user() ever time we reach the end of it and have > >>> bits > >>> set. That would greatly simplify the above example. But I would > >>> expect > >>> that we're a) more likely to get asked for ranges covering a single > >>> vfio_dma > >> > >> QEMU ask for single vfio_dma during each iteration. > >> > >> If we restrict this ABI to cover single vfio_dma only, then it > >> simplifies the logic here. That was my original suggestion. Should we > >> think about that again? > > > > But we currently allow unmaps that overlap multiple vfio_dmas as long > > as no vfio_dma is bisected, so I think that implies that an unmap while > > asking for the dirty bitmap has even further restricted semantics. I'm > > also reluctant to design an ABI around what happens to be the current > > QEMU implementation. > > > > If we take your example above, ranges {0x,0xa000} and > > {0xa000,0x1} ({start,end}), I think you're working with the > > following two bitmaps in this implementation: > > > > 0011 b > > 0011b > > > > And we need to combine those into: > > > > b > > > > Right? > > > > But it seems like that would be easier if the second bitmap was instead: > > > > 1100b > > > > Then we wouldn't need to worry about the entire bitmap being shifted by > > the bit offset within the byte, which limits our fixes to the boundary > > byte and allows us to use copy_to_user() directly for the bulk of the > > copy. So how do we get there? > > > > I think we start with allocating the vfio_dma bitmap to account for > > this initial offset, so we calculate bitmap_base_iova as: > > (iova & ~((PAGE_SIZE << 3) - 1)) > > We then use bitmap_base_iova in calculating which bits to set. > > > > The user needs to follow the same rules, and maybe this adds some value > > to the user providing the bitmap size rather than the kernel > > calculating it. For example, if the user wanted the dirty bitmap for > > the range {0xa000,0x1} above, they'd provide at least a 1 byte > > bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty. > > > > Effectively the user can ask for any iova range, but the buffer will be > > filled relative to the zeroth bit of the bitmap following the above > > bitmap_base_iova formula (and replacing PAGE_SIZE with the user > > requested pgsize). I'm tempted to make this explicit in the user > > interface (ie. only allow bitmaps starting on aligned pages), but a > > user is able to map and unmap single pages and we need to support > > returning a dirty bitmap with an unmap, so I don't think we can do that. > > > > Sigh, finding adjacent vfio_dmas within the same byte seems simpler than > this. > >>> > >>> How does KVM do this? My intent was that if all of our bitmaps share > >>> the same alignment then we can merge the intersection and continue to > >>> use copy_to_user() on either side. However, if QEMU doesn't do the > >>> same, it doesn't really help us. Is QEMU stuck with an implementation > >>> of only retrieving dirty bits per MemoryRegionSection exactly because > >>> of this issue and therefore we can rely on it in our implementation as > >>> well? Thanks, > >>> > >> > >> QEMU sync dirty_bitmap per MemoryRegionSection. Within > >> MemoryRegionSection there could be multiple KVMSlots. QEMU queries > >> dirty_bitmap per KVMSlot and mark dirty for each KVMSlot. > >> On kernel side, KVM_GET_DIRTY_LOG ioctl calls > >> kvm_get_dirty_log_protect(), where it uses copy_to_user() to copy bitmap > >> of that memSlot. > >> vfio_dma is per MemoryRegionSection. We can reply on MemoryRegionSection > >> in our implementation. But to get bitmap during unmap, we have to take > >> care of concatenating bitmaps. > > > > So KVM does not worry about bitmap alignment because the interface is > > based on slots, a dirty bitmap can only be retrieved for a single, > > entire slot. We need VFIO_IOMMU_UNMAP_DMA to maintain its support for > > spanning multiple vfio_dmas, but maybe we have some leeway that we > > don't need to support both multiple vfio_dmas and dirty bitmap at the > > same time. It seems like it would be a massive simplification if we > > required an unmap with dirty bitmap to s
Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
On 2/19/2020 3:11 AM, Alex Williamson wrote: On Tue, 18 Feb 2020 11:28:53 +0530 Kirti Wankhede wrote: As I understand the above algorithm, we find a vfio_dma overlapping the request and populate the bitmap for that range. Then we go back and put_user() for each byte that we touched. We could instead simply work on a one byte buffer as we enumerate the requested range and do a put_user() ever time we reach the end of it and have bits set. That would greatly simplify the above example. But I would expect that we're a) more likely to get asked for ranges covering a single vfio_dma QEMU ask for single vfio_dma during each iteration. If we restrict this ABI to cover single vfio_dma only, then it simplifies the logic here. That was my original suggestion. Should we think about that again? But we currently allow unmaps that overlap multiple vfio_dmas as long as no vfio_dma is bisected, so I think that implies that an unmap while asking for the dirty bitmap has even further restricted semantics. I'm also reluctant to design an ABI around what happens to be the current QEMU implementation. If we take your example above, ranges {0x,0xa000} and {0xa000,0x1} ({start,end}), I think you're working with the following two bitmaps in this implementation: 0011 b 0011b And we need to combine those into: b Right? But it seems like that would be easier if the second bitmap was instead: 1100b Then we wouldn't need to worry about the entire bitmap being shifted by the bit offset within the byte, which limits our fixes to the boundary byte and allows us to use copy_to_user() directly for the bulk of the copy. So how do we get there? I think we start with allocating the vfio_dma bitmap to account for this initial offset, so we calculate bitmap_base_iova as: (iova & ~((PAGE_SIZE << 3) - 1)) We then use bitmap_base_iova in calculating which bits to set. The user needs to follow the same rules, and maybe this adds some value to the user providing the bitmap size rather than the kernel calculating it. For example, if the user wanted the dirty bitmap for the range {0xa000,0x1} above, they'd provide at least a 1 byte bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty. Effectively the user can ask for any iova range, but the buffer will be filled relative to the zeroth bit of the bitmap following the above bitmap_base_iova formula (and replacing PAGE_SIZE with the user requested pgsize). I'm tempted to make this explicit in the user interface (ie. only allow bitmaps starting on aligned pages), but a user is able to map and unmap single pages and we need to support returning a dirty bitmap with an unmap, so I don't think we can do that. Sigh, finding adjacent vfio_dmas within the same byte seems simpler than this. How does KVM do this? My intent was that if all of our bitmaps share the same alignment then we can merge the intersection and continue to use copy_to_user() on either side. However, if QEMU doesn't do the same, it doesn't really help us. Is QEMU stuck with an implementation of only retrieving dirty bits per MemoryRegionSection exactly because of this issue and therefore we can rely on it in our implementation as well? Thanks, QEMU sync dirty_bitmap per MemoryRegionSection. Within MemoryRegionSection there could be multiple KVMSlots. QEMU queries dirty_bitmap per KVMSlot and mark dirty for each KVMSlot. On kernel side, KVM_GET_DIRTY_LOG ioctl calls kvm_get_dirty_log_protect(), where it uses copy_to_user() to copy bitmap of that memSlot. vfio_dma is per MemoryRegionSection. We can reply on MemoryRegionSection in our implementation. But to get bitmap during unmap, we have to take care of concatenating bitmaps. So KVM does not worry about bitmap alignment because the interface is based on slots, a dirty bitmap can only be retrieved for a single, entire slot. We need VFIO_IOMMU_UNMAP_DMA to maintain its support for spanning multiple vfio_dmas, but maybe we have some leeway that we don't need to support both multiple vfio_dmas and dirty bitmap at the same time. It seems like it would be a massive simplification if we required an unmap with dirty bitmap to span exactly one vfio_dma, right? Yes. I don't see that we'd break any existing users with that, it's unfortunate that we can't have the flexibility of the existing calling convention, but I think there's good reason for it here. Our separate dirty bitmap log reporting would follow the same semantics. I think this all aligns with how the MemoryListener works in QEMU right now, correct? For example we wouldn't need any extra per MAP_DMA tracking in QEMU like KVM has for its slots. That right. Should we go ahead with the implementation to get dirty bitmap for one vfio_dma for GET_DIRTY ioctl and unmap with dirty ioctl? Accordingly we can have sanity checks in these ioctls. Thanks, Kirti In QEMU, in function kvm_physical_sync_dirty_bitmap() t
Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
On Tue, 18 Feb 2020 11:28:53 +0530 Kirti Wankhede wrote: > > > >As I understand the above algorithm, we find a vfio_dma > > overlapping the request and populate the bitmap for that range. Then > > we go back and put_user() for each byte that we touched. We could > > instead simply work on a one byte buffer as we enumerate the requested > > range and do a put_user() ever time we reach the end of it and have bits > > set. That would greatly simplify the above example. But I would expect > > that we're a) more likely to get asked for ranges covering a single > > vfio_dma > > QEMU ask for single vfio_dma during each iteration. > > If we restrict this ABI to cover single vfio_dma only, then it > simplifies the logic here. That was my original suggestion. Should we > think about that again? > >>> > >>> But we currently allow unmaps that overlap multiple vfio_dmas as long > >>> as no vfio_dma is bisected, so I think that implies that an unmap while > >>> asking for the dirty bitmap has even further restricted semantics. I'm > >>> also reluctant to design an ABI around what happens to be the current > >>> QEMU implementation. > >>> > >>> If we take your example above, ranges {0x,0xa000} and > >>> {0xa000,0x1} ({start,end}), I think you're working with the > >>> following two bitmaps in this implementation: > >>> > >>> 0011 b > >>> 0011b > >>> > >>> And we need to combine those into: > >>> > >>> b > >>> > >>> Right? > >>> > >>> But it seems like that would be easier if the second bitmap was instead: > >>> > >>> 1100b > >>> > >>> Then we wouldn't need to worry about the entire bitmap being shifted by > >>> the bit offset within the byte, which limits our fixes to the boundary > >>> byte and allows us to use copy_to_user() directly for the bulk of the > >>> copy. So how do we get there? > >>> > >>> I think we start with allocating the vfio_dma bitmap to account for > >>> this initial offset, so we calculate bitmap_base_iova as: > >>> (iova & ~((PAGE_SIZE << 3) - 1)) > >>> We then use bitmap_base_iova in calculating which bits to set. > >>> > >>> The user needs to follow the same rules, and maybe this adds some value > >>> to the user providing the bitmap size rather than the kernel > >>> calculating it. For example, if the user wanted the dirty bitmap for > >>> the range {0xa000,0x1} above, they'd provide at least a 1 byte > >>> bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty. > >>> > >>> Effectively the user can ask for any iova range, but the buffer will be > >>> filled relative to the zeroth bit of the bitmap following the above > >>> bitmap_base_iova formula (and replacing PAGE_SIZE with the user > >>> requested pgsize). I'm tempted to make this explicit in the user > >>> interface (ie. only allow bitmaps starting on aligned pages), but a > >>> user is able to map and unmap single pages and we need to support > >>> returning a dirty bitmap with an unmap, so I don't think we can do that. > >>> > >> > >> Sigh, finding adjacent vfio_dmas within the same byte seems simpler than > >> this. > > > > How does KVM do this? My intent was that if all of our bitmaps share > > the same alignment then we can merge the intersection and continue to > > use copy_to_user() on either side. However, if QEMU doesn't do the > > same, it doesn't really help us. Is QEMU stuck with an implementation > > of only retrieving dirty bits per MemoryRegionSection exactly because > > of this issue and therefore we can rely on it in our implementation as > > well? Thanks, > > > > QEMU sync dirty_bitmap per MemoryRegionSection. Within > MemoryRegionSection there could be multiple KVMSlots. QEMU queries > dirty_bitmap per KVMSlot and mark dirty for each KVMSlot. > On kernel side, KVM_GET_DIRTY_LOG ioctl calls > kvm_get_dirty_log_protect(), where it uses copy_to_user() to copy bitmap > of that memSlot. > vfio_dma is per MemoryRegionSection. We can reply on MemoryRegionSection > in our implementation. But to get bitmap during unmap, we have to take > care of concatenating bitmaps. So KVM does not worry about bitmap alignment because the interface is based on slots, a dirty bitmap can only be retrieved for a single, entire slot. We need VFIO_IOMMU_UNMAP_DMA to maintain its support for spanning multiple vfio_dmas, but maybe we have some leeway that we don't need to support both multiple vfio_dmas and dirty bitmap at the same time. It seems like it would be a massive simplification if we required an unmap with dirty bitmap to span exactly one vfio_dma, right? I don't see that we'd break any existing users with that, it's unfortunate that we can't have the flexibility of the existing calling convention, but I think there's good reason for it here. Our separate dirty bitmap log reporting would follow the same semantics. I think this all aligns with how the MemoryListener works in
Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
As I understand the above algorithm, we find a vfio_dma overlapping the request and populate the bitmap for that range. Then we go back and put_user() for each byte that we touched. We could instead simply work on a one byte buffer as we enumerate the requested range and do a put_user() ever time we reach the end of it and have bits set. That would greatly simplify the above example. But I would expect that we're a) more likely to get asked for ranges covering a single vfio_dma QEMU ask for single vfio_dma during each iteration. If we restrict this ABI to cover single vfio_dma only, then it simplifies the logic here. That was my original suggestion. Should we think about that again? But we currently allow unmaps that overlap multiple vfio_dmas as long as no vfio_dma is bisected, so I think that implies that an unmap while asking for the dirty bitmap has even further restricted semantics. I'm also reluctant to design an ABI around what happens to be the current QEMU implementation. If we take your example above, ranges {0x,0xa000} and {0xa000,0x1} ({start,end}), I think you're working with the following two bitmaps in this implementation: 0011 b 0011b And we need to combine those into: b Right? But it seems like that would be easier if the second bitmap was instead: 1100b Then we wouldn't need to worry about the entire bitmap being shifted by the bit offset within the byte, which limits our fixes to the boundary byte and allows us to use copy_to_user() directly for the bulk of the copy. So how do we get there? I think we start with allocating the vfio_dma bitmap to account for this initial offset, so we calculate bitmap_base_iova as: (iova & ~((PAGE_SIZE << 3) - 1)) We then use bitmap_base_iova in calculating which bits to set. The user needs to follow the same rules, and maybe this adds some value to the user providing the bitmap size rather than the kernel calculating it. For example, if the user wanted the dirty bitmap for the range {0xa000,0x1} above, they'd provide at least a 1 byte bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty. Effectively the user can ask for any iova range, but the buffer will be filled relative to the zeroth bit of the bitmap following the above bitmap_base_iova formula (and replacing PAGE_SIZE with the user requested pgsize). I'm tempted to make this explicit in the user interface (ie. only allow bitmaps starting on aligned pages), but a user is able to map and unmap single pages and we need to support returning a dirty bitmap with an unmap, so I don't think we can do that. Sigh, finding adjacent vfio_dmas within the same byte seems simpler than this. How does KVM do this? My intent was that if all of our bitmaps share the same alignment then we can merge the intersection and continue to use copy_to_user() on either side. However, if QEMU doesn't do the same, it doesn't really help us. Is QEMU stuck with an implementation of only retrieving dirty bits per MemoryRegionSection exactly because of this issue and therefore we can rely on it in our implementation as well? Thanks, QEMU sync dirty_bitmap per MemoryRegionSection. Within MemoryRegionSection there could be multiple KVMSlots. QEMU queries dirty_bitmap per KVMSlot and mark dirty for each KVMSlot. On kernel side, KVM_GET_DIRTY_LOG ioctl calls kvm_get_dirty_log_protect(), where it uses copy_to_user() to copy bitmap of that memSlot. vfio_dma is per MemoryRegionSection. We can reply on MemoryRegionSection in our implementation. But to get bitmap during unmap, we have to take care of concatenating bitmaps. In QEMU, in function kvm_physical_sync_dirty_bitmap() there is a comment where bitmap size is calculated and bitmap is defined as 'void __user *dirty_bitmap' which is also the concern you raised and could be handled similarly as below. /* XXX bad kernel interface alert * For dirty bitmap, kernel allocates array of size aligned to * bits-per-long. But for case when the kernel is 64bits and * the userspace is 32bits, userspace can't align to the same * bits-per-long, since sizeof(long) is different between kernel * and user space. This way, userspace will provide buffer which * may be 4 bytes less than the kernel will use, resulting in * userspace memory corruption (which is not detectable by valgrind * too, in most cases). * So for now, let's align to 64 instead of HOST_LONG_BITS here, in * a hope that sizeof(long) won't become >8 any time soon. */ if (!mem->dirty_bmap) { hwaddr bitmap_size = ALIGN(((mem->memory_size) >> TARGET_PAGE_BITS), /*HOST_LONG_BITS*/ 64) / 8; /* Allocate on the first log_sync, once and for all */ mem->dirty_bmap = g_malloc0(bitmap_size); } Thanks, Kirti
Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
On Tue, 18 Feb 2020 00:43:48 +0530 Kirti Wankhede wrote: > On 2/14/2020 4:50 AM, Alex Williamson wrote: > > On Fri, 14 Feb 2020 01:41:35 +0530 > > Kirti Wankhede wrote: > > > >> > >> > >> > >> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, > >> dma_addr_t iova, > >> +size_t size, uint64_t pgsize, > >> +unsigned char __user *bitmap) > >> +{ > >> + struct vfio_dma *dma; > >> + dma_addr_t i = iova, iova_limit; > >> + unsigned int bsize, nbits = 0, l = 0; > >> + unsigned long pgshift = __ffs(pgsize); > >> + > >> + while ((dma = vfio_find_dma(iommu, i, pgsize))) { > >> + int ret, j; > >> + unsigned int npages = 0, shift = 0; > >> + unsigned char temp = 0; > >> + > >> + /* mark all pages dirty if all pages are pinned and > >> mapped. */ > >> + if (dma->iommu_mapped) { > >> + iova_limit = min(dma->iova + dma->size, iova + > >> size); > >> + npages = iova_limit/pgsize; > >> + bitmap_set(dma->bitmap, 0, npages); > > > > npages is derived from iova_limit, which is the number of bits to set > > dirty relative to the first requested iova, not iova zero, ie. the set > > of dirty bits is offset from those requested unless iova == dma->iova. > > > > Right, fixing. > > > Also I hope dma->bitmap was actually allocated. Not only does the > > START error path potentially leave dirty tracking enabled without all > > the bitmap allocated, when does the bitmap get allocated for a new > > vfio_dma when dirty tracking is enabled? Seems it only occurs if a > > vpfn gets marked dirty. > > > > Right. > > Fixing error paths. > > > >> + } else if (dma->bitmap) { > >> + struct rb_node *n = rb_first(&dma->pfn_list); > >> + bool found = false; > >> + > >> + for (; n; n = rb_next(n)) { > >> + struct vfio_pfn *vpfn = rb_entry(n, > >> + struct vfio_pfn, node); > >> + if (vpfn->iova >= i) { > >> + found = true; > >> + break; > >> + } > >> + } > >> + > >> + if (!found) { > >> + i += dma->size; > >> + continue; > >> + } > >> + > >> + for (; n; n = rb_next(n)) { > >> + unsigned int s; > >> + struct vfio_pfn *vpfn = rb_entry(n, > >> + struct vfio_pfn, node); > >> + > >> + if (vpfn->iova >= iova + size) > >> + break; > >> + > >> + s = (vpfn->iova - dma->iova) >> pgshift; > >> + bitmap_set(dma->bitmap, s, 1); > >> + > >> + iova_limit = vpfn->iova + pgsize; > >> + } > >> + npages = iova_limit/pgsize; > > > > Isn't iova_limit potentially uninitialized here? For example, if our > > vfio_dma covers {0,8192} and we ask for the bitmap of {0,4096} and > > there's a vpfn at {4096,8192}. I think that means vpfn->iova >= i > > (4096 >= 0), so we break with found = true, then we test 4096 >= 0 + > > 4096 and break, and npages = /pgsize. > > > > Right, Fixing it. > > >> + } > >> + > >> + bsize = dirty_bitmap_bytes(npages); > >> + shift = nbits % BITS_PER_BYTE; > >> + > >> + if (npages && shift) { > >> + l--; > >> + if (!access_ok((void __user *)bitmap + l, > >> + sizeof(unsigned char))) > >> + return -EINVAL; > >> + > >> + ret = __get_user(temp, bitmap + l); > > > > I don't understand why we care to get the user's bitmap, are we trying > > to leave whatever garbage they might have set in it and only also set > > the dirty bits? That seems unnecessary. > > > > Suppose dma mapped ranges are {start, size}: > {0, 0xa000}, {0xa000, 0x1} > > Bitmap asked from 0 - 0x1. Say suppose all pages are dirty. > Then in first iterat
Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
On 2/14/2020 4:50 AM, Alex Williamson wrote: On Fri, 14 Feb 2020 01:41:35 +0530 Kirti Wankhede wrote: +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova, + size_t size, uint64_t pgsize, + unsigned char __user *bitmap) +{ + struct vfio_dma *dma; + dma_addr_t i = iova, iova_limit; + unsigned int bsize, nbits = 0, l = 0; + unsigned long pgshift = __ffs(pgsize); + + while ((dma = vfio_find_dma(iommu, i, pgsize))) { + int ret, j; + unsigned int npages = 0, shift = 0; + unsigned char temp = 0; + + /* mark all pages dirty if all pages are pinned and mapped. */ + if (dma->iommu_mapped) { + iova_limit = min(dma->iova + dma->size, iova + size); + npages = iova_limit/pgsize; + bitmap_set(dma->bitmap, 0, npages); npages is derived from iova_limit, which is the number of bits to set dirty relative to the first requested iova, not iova zero, ie. the set of dirty bits is offset from those requested unless iova == dma->iova. Right, fixing. Also I hope dma->bitmap was actually allocated. Not only does the START error path potentially leave dirty tracking enabled without all the bitmap allocated, when does the bitmap get allocated for a new vfio_dma when dirty tracking is enabled? Seems it only occurs if a vpfn gets marked dirty. Right. Fixing error paths. + } else if (dma->bitmap) { + struct rb_node *n = rb_first(&dma->pfn_list); + bool found = false; + + for (; n; n = rb_next(n)) { + struct vfio_pfn *vpfn = rb_entry(n, + struct vfio_pfn, node); + if (vpfn->iova >= i) { + found = true; + break; + } + } + + if (!found) { + i += dma->size; + continue; + } + + for (; n; n = rb_next(n)) { + unsigned int s; + struct vfio_pfn *vpfn = rb_entry(n, + struct vfio_pfn, node); + + if (vpfn->iova >= iova + size) + break; + + s = (vpfn->iova - dma->iova) >> pgshift; + bitmap_set(dma->bitmap, s, 1); + + iova_limit = vpfn->iova + pgsize; + } + npages = iova_limit/pgsize; Isn't iova_limit potentially uninitialized here? For example, if our vfio_dma covers {0,8192} and we ask for the bitmap of {0,4096} and there's a vpfn at {4096,8192}. I think that means vpfn->iova >= i (4096 >= 0), so we break with found = true, then we test 4096 >= 0 + 4096 and break, and npages = /pgsize. Right, Fixing it. + } + + bsize = dirty_bitmap_bytes(npages); + shift = nbits % BITS_PER_BYTE; + + if (npages && shift) { + l--; + if (!access_ok((void __user *)bitmap + l, + sizeof(unsigned char))) + return -EINVAL; + + ret = __get_user(temp, bitmap + l); I don't understand why we care to get the user's bitmap, are we trying to leave whatever garbage they might have set in it and only also set the dirty bits? That seems unnecessary. Suppose dma mapped ranges are {start, size}: {0, 0xa000}, {0xa000, 0x1} Bitmap asked from 0 - 0x1. Say suppose all pages are dirty. Then in first iteration for dma {0,0xa000} there are 10 pages, so 10 bits are set, put_user() happens for 2 bytes, (0011 b). In second iteration for dma {0xa000, 0x1} there are 6 pages and these bits should be appended to previous byte. So get_user() that byte, then shift-OR rest of the bitmap, result should be: ( b) Without get_user() and shift-OR, resulting bitmap would be 11 0011 b which would be wrong. Seems like if we use a put_user() approach then we should look for adjacent vfio_dmas within the same byte/word/dword before we push it to the user to avoid this sort of inefficiency. Won't that add more complication to logic? I'm tempted to think it might be less complicated. Also why do we need these access_ok() checks when we already checked the range at the start of the ioctl? Since pointer is updated runtime here, better to check that pointer before using that pointer. Sorry, I
Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
On Fri, 14 Feb 2020 01:41:35 +0530 Kirti Wankhede wrote: > > > > +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t > iova, > + size_t size, uint64_t pgsize, > + unsigned char __user *bitmap) > +{ > +struct vfio_dma *dma; > +dma_addr_t i = iova, iova_limit; > +unsigned int bsize, nbits = 0, l = 0; > +unsigned long pgshift = __ffs(pgsize); > + > +while ((dma = vfio_find_dma(iommu, i, pgsize))) { > +int ret, j; > +unsigned int npages = 0, shift = 0; > +unsigned char temp = 0; > + > +/* mark all pages dirty if all pages are pinned and > mapped. */ > +if (dma->iommu_mapped) { > +iova_limit = min(dma->iova + dma->size, iova + > size); > +npages = iova_limit/pgsize; > +bitmap_set(dma->bitmap, 0, npages); > >>> > >>> npages is derived from iova_limit, which is the number of bits to set > >>> dirty relative to the first requested iova, not iova zero, ie. the set > >>> of dirty bits is offset from those requested unless iova == dma->iova. > >>> > >> > >> Right, fixing. > >> > >>> Also I hope dma->bitmap was actually allocated. Not only does the > >>> START error path potentially leave dirty tracking enabled without all > >>> the bitmap allocated, when does the bitmap get allocated for a new > >>> vfio_dma when dirty tracking is enabled? Seems it only occurs if a > >>> vpfn gets marked dirty. > >>> > >> > >> Right. > >> > >> Fixing error paths. > >> > >> > +} else if (dma->bitmap) { > +struct rb_node *n = rb_first(&dma->pfn_list); > +bool found = false; > + > +for (; n; n = rb_next(n)) { > +struct vfio_pfn *vpfn = rb_entry(n, > +struct vfio_pfn, node); > +if (vpfn->iova >= i) { > +found = true; > +break; > +} > +} > + > +if (!found) { > +i += dma->size; > +continue; > +} > + > +for (; n; n = rb_next(n)) { > +unsigned int s; > +struct vfio_pfn *vpfn = rb_entry(n, > +struct vfio_pfn, node); > + > +if (vpfn->iova >= iova + size) > +break; > + > +s = (vpfn->iova - dma->iova) >> pgshift; > +bitmap_set(dma->bitmap, s, 1); > + > +iova_limit = vpfn->iova + pgsize; > +} > +npages = iova_limit/pgsize; > >>> > >>> Isn't iova_limit potentially uninitialized here? For example, if our > >>> vfio_dma covers {0,8192} and we ask for the bitmap of {0,4096} and > >>> there's a vpfn at {4096,8192}. I think that means vpfn->iova >= i > >>> (4096 >= 0), so we break with found = true, then we test 4096 >= 0 + > >>> 4096 and break, and npages = /pgsize. > >>> > >> > >> Right, Fixing it. > >> > +} > + > +bsize = dirty_bitmap_bytes(npages); > +shift = nbits % BITS_PER_BYTE; > + > +if (npages && shift) { > +l--; > +if (!access_ok((void __user *)bitmap + l, > +sizeof(unsigned char))) > +return -EINVAL; > + > +ret = __get_user(temp, bitmap + l); > >>> > >>> I don't understand why we care to get the user's bitmap, are we trying > >>> to leave whatever garbage they might have set in it and only also set > >>> the dirty bits? That seems unnecessary. > >>> > >> > >> Suppose dma mapped ranges are {start, size}: > >> {0, 0xa000}, {0xa000, 0x1} > >> > >> Bitmap asked from 0 - 0x1. Say suppose all pages are dirty. > >> Then in first iteration for dma {0,0xa000} there are 10 pages, so 10 > >> bits are set, put_user() happens for 2 bytes, (0011 b). > >> In second iteration for dma {0xa000, 0x1} there are 6 pages and > >> these bits should be appended to previous byte. So get_user() tha
Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
+static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova, + size_t size, uint64_t pgsize, + unsigned char __user *bitmap) +{ + struct vfio_dma *dma; + dma_addr_t i = iova, iova_limit; + unsigned int bsize, nbits = 0, l = 0; + unsigned long pgshift = __ffs(pgsize); + + while ((dma = vfio_find_dma(iommu, i, pgsize))) { + int ret, j; + unsigned int npages = 0, shift = 0; + unsigned char temp = 0; + + /* mark all pages dirty if all pages are pinned and mapped. */ + if (dma->iommu_mapped) { + iova_limit = min(dma->iova + dma->size, iova + size); + npages = iova_limit/pgsize; + bitmap_set(dma->bitmap, 0, npages); npages is derived from iova_limit, which is the number of bits to set dirty relative to the first requested iova, not iova zero, ie. the set of dirty bits is offset from those requested unless iova == dma->iova. Right, fixing. Also I hope dma->bitmap was actually allocated. Not only does the START error path potentially leave dirty tracking enabled without all the bitmap allocated, when does the bitmap get allocated for a new vfio_dma when dirty tracking is enabled? Seems it only occurs if a vpfn gets marked dirty. Right. Fixing error paths. + } else if (dma->bitmap) { + struct rb_node *n = rb_first(&dma->pfn_list); + bool found = false; + + for (; n; n = rb_next(n)) { + struct vfio_pfn *vpfn = rb_entry(n, + struct vfio_pfn, node); + if (vpfn->iova >= i) { + found = true; + break; + } + } + + if (!found) { + i += dma->size; + continue; + } + + for (; n; n = rb_next(n)) { + unsigned int s; + struct vfio_pfn *vpfn = rb_entry(n, + struct vfio_pfn, node); + + if (vpfn->iova >= iova + size) + break; + + s = (vpfn->iova - dma->iova) >> pgshift; + bitmap_set(dma->bitmap, s, 1); + + iova_limit = vpfn->iova + pgsize; + } + npages = iova_limit/pgsize; Isn't iova_limit potentially uninitialized here? For example, if our vfio_dma covers {0,8192} and we ask for the bitmap of {0,4096} and there's a vpfn at {4096,8192}. I think that means vpfn->iova >= i (4096 >= 0), so we break with found = true, then we test 4096 >= 0 + 4096 and break, and npages = /pgsize. Right, Fixing it. + } + + bsize = dirty_bitmap_bytes(npages); + shift = nbits % BITS_PER_BYTE; + + if (npages && shift) { + l--; + if (!access_ok((void __user *)bitmap + l, + sizeof(unsigned char))) + return -EINVAL; + + ret = __get_user(temp, bitmap + l); I don't understand why we care to get the user's bitmap, are we trying to leave whatever garbage they might have set in it and only also set the dirty bits? That seems unnecessary. Suppose dma mapped ranges are {start, size}: {0, 0xa000}, {0xa000, 0x1} Bitmap asked from 0 - 0x1. Say suppose all pages are dirty. Then in first iteration for dma {0,0xa000} there are 10 pages, so 10 bits are set, put_user() happens for 2 bytes, (0011 b). In second iteration for dma {0xa000, 0x1} there are 6 pages and these bits should be appended to previous byte. So get_user() that byte, then shift-OR rest of the bitmap, result should be: ( b) Without get_user() and shift-OR, resulting bitmap would be 11 0011 b which would be wrong. Seems like if we use a put_user() approach then we should look for adjacent vfio_dmas within the same byte/word/dword before we push it to the user to avoid this sort of inefficiency. Won't that add more complication to logic? Also why do we need these access_ok() checks when we already checked the range at the start of the ioctl? Since pointer is updated runtime here, better to check that pointer before using that pointer. Sorry, I still don't understand this, we check access_ok() with a pointer and a length, therefore as long as we're incrementing the pointer within that length, why do we need to retest? Idea
Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
On Thu, 13 Feb 2020 02:26:23 +0530 Kirti Wankhede wrote: > On 2/10/2020 10:55 PM, Alex Williamson wrote: > > On Sat, 8 Feb 2020 01:12:31 +0530 > > Kirti Wankhede wrote: > > > >> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations: > >> - Start pinned and unpinned pages tracking while migration is active > >> - Stop pinned and unpinned dirty pages tracking. This is also used to > >>stop dirty pages tracking if migration failed or cancelled. > >> - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its > >>user space application responsibility to copy content of dirty pages > >>from source to destination during migration. > >> > >> To prevent DoS attack, memory for bitmap is allocated per vfio_dma > >> structure. Bitmap size is calculated considering smallest supported page > >> size. Bitmap is allocated when dirty logging is enabled for those > >> vfio_dmas whose vpfn list is not empty or whole range is mapped, in > >> case of pass-through device. > >> > >> There could be multiple option as to when bitmap should be populated: > >> * Polulate bitmap for already pinned pages when bitmap is allocated for > >>a vfio_dma with the smallest supported page size. Updates bitmap from > >>page pinning and unpinning functions. When user application queries > >>bitmap, check if requested page size is same as page size used to > >>populated bitmap. If it is equal, copy bitmap. But if not equal, > >>re-populated bitmap according to requested page size and then copy to > >>user. > >>Pros: Bitmap gets populated on the fly after dirty tracking has > >> started. > >>Cons: If requested page size is different than smallest supported > >> page size, then bitmap has to be re-populated again, with > >> additional overhead of allocating bitmap memory again for > >> re-population of bitmap. > > > > No memory needs to be allocated to re-populate the bitmap. The bitmap > > is clear-on-read and by tracking the bitmap in the smallest supported > > page size we can guarantee that we can fit the user requested bitmap > > size within the space occupied by that minimal page size range of the > > bitmap. Therefore we'd destructively translate the requested region of > > the bitmap to a different page size, write it out to the user, and > > clear it. Also we expect userspace to use the minimum page size almost > > exclusively, which is optimized by this approach as dirty bit tracking > > is spread out over each page pinning operation. > > > >> > >> * Populate bitmap when bitmap is queried by user application. > >>Pros: Bitmap is populated with requested page size. This eliminates > >> the need to re-populate bitmap if requested page size is > >> different than smallest supported pages size. > >>Cons: There is one time processing time, when bitmap is queried. > > > > Another significant Con is that the vpfn list needs to track and manage > > unpinned pages, which makes it more complex and intrusive. The > > previous option seems to have both time and complexity advantages, > > especially in the case we expect to be most common of the user > > accessing the bitmap with the minimum page size, ie. PAGE_SIZE. It's > > also not clear why we pre-allocate the bitmap at all with this approach. > > > >> I prefer later option with simple logic and to eliminate over-head of > >> bitmap repopulation in case of differnt page sizes. Later option is > >> implemented in this patch. > > > > Hmm, we'll see below, but I not convinced based on the above rationale. > > > >> Signed-off-by: Kirti Wankhede > >> Reviewed-by: Neo Jia > >> --- > >> drivers/vfio/vfio_iommu_type1.c | 299 > >> ++-- > >> 1 file changed, 287 insertions(+), 12 deletions(-) > >> > >> diff --git a/drivers/vfio/vfio_iommu_type1.c > >> b/drivers/vfio/vfio_iommu_type1.c > >> index d386461e5d11..df358dc1c85b 100644 > >> --- a/drivers/vfio/vfio_iommu_type1.c > >> +++ b/drivers/vfio/vfio_iommu_type1.c > >> @@ -70,6 +70,7 @@ struct vfio_iommu { > >>unsigned intdma_avail; > >>boolv2; > >>boolnesting; > >> + booldirty_page_tracking; > >> }; > >> > >> struct vfio_domain { > >> @@ -90,6 +91,7 @@ struct vfio_dma { > >>boollock_cap; /* capable(CAP_IPC_LOCK) */ > >>struct task_struct *task; > >>struct rb_root pfn_list; /* Ex-user pinned pfn list */ > >> + unsigned long *bitmap; > >> }; > >> > >> struct vfio_group { > >> @@ -125,6 +127,7 @@ struct vfio_regions { > >>(!list_empty(&iommu->domain_list)) > >> > >> static int put_pfn(unsigned long pfn, int prot); > >> +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu); > >> > >> /* > >>* This code handles mapping and unmapping of user data b
Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
On 2/10/2020 10:55 PM, Alex Williamson wrote: On Sat, 8 Feb 2020 01:12:31 +0530 Kirti Wankhede wrote: VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations: - Start pinned and unpinned pages tracking while migration is active - Stop pinned and unpinned dirty pages tracking. This is also used to stop dirty pages tracking if migration failed or cancelled. - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its user space application responsibility to copy content of dirty pages from source to destination during migration. To prevent DoS attack, memory for bitmap is allocated per vfio_dma structure. Bitmap size is calculated considering smallest supported page size. Bitmap is allocated when dirty logging is enabled for those vfio_dmas whose vpfn list is not empty or whole range is mapped, in case of pass-through device. There could be multiple option as to when bitmap should be populated: * Polulate bitmap for already pinned pages when bitmap is allocated for a vfio_dma with the smallest supported page size. Updates bitmap from page pinning and unpinning functions. When user application queries bitmap, check if requested page size is same as page size used to populated bitmap. If it is equal, copy bitmap. But if not equal, re-populated bitmap according to requested page size and then copy to user. Pros: Bitmap gets populated on the fly after dirty tracking has started. Cons: If requested page size is different than smallest supported page size, then bitmap has to be re-populated again, with additional overhead of allocating bitmap memory again for re-population of bitmap. No memory needs to be allocated to re-populate the bitmap. The bitmap is clear-on-read and by tracking the bitmap in the smallest supported page size we can guarantee that we can fit the user requested bitmap size within the space occupied by that minimal page size range of the bitmap. Therefore we'd destructively translate the requested region of the bitmap to a different page size, write it out to the user, and clear it. Also we expect userspace to use the minimum page size almost exclusively, which is optimized by this approach as dirty bit tracking is spread out over each page pinning operation. * Populate bitmap when bitmap is queried by user application. Pros: Bitmap is populated with requested page size. This eliminates the need to re-populate bitmap if requested page size is different than smallest supported pages size. Cons: There is one time processing time, when bitmap is queried. Another significant Con is that the vpfn list needs to track and manage unpinned pages, which makes it more complex and intrusive. The previous option seems to have both time and complexity advantages, especially in the case we expect to be most common of the user accessing the bitmap with the minimum page size, ie. PAGE_SIZE. It's also not clear why we pre-allocate the bitmap at all with this approach. I prefer later option with simple logic and to eliminate over-head of bitmap repopulation in case of differnt page sizes. Later option is implemented in this patch. Hmm, we'll see below, but I not convinced based on the above rationale. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- drivers/vfio/vfio_iommu_type1.c | 299 ++-- 1 file changed, 287 insertions(+), 12 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index d386461e5d11..df358dc1c85b 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -70,6 +70,7 @@ struct vfio_iommu { unsigned intdma_avail; boolv2; boolnesting; + booldirty_page_tracking; }; struct vfio_domain { @@ -90,6 +91,7 @@ struct vfio_dma { boollock_cap; /* capable(CAP_IPC_LOCK) */ struct task_struct *task; struct rb_root pfn_list; /* Ex-user pinned pfn list */ + unsigned long *bitmap; }; struct vfio_group { @@ -125,6 +127,7 @@ struct vfio_regions { (!list_empty(&iommu->domain_list)) static int put_pfn(unsigned long pfn, int prot); +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu); /* * This code handles mapping and unmapping of user data buffers @@ -174,6 +177,57 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old) rb_erase(&old->node, &iommu->dma_list); } +static inline unsigned long dirty_bitmap_bytes(unsigned int npages) +{ + if (!npages) + return 0; + + return ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long); +} + +static int vfio_dma_bitmap_alloc(struct vfio_iommu *iommu, +struct vfio_dma *dma, unsign
Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
On Tue, Feb 11, 2020 at 11:45:43AM +0800, Alex Williamson wrote: > On Mon, 10 Feb 2020 21:52:51 -0500 > Yan Zhao wrote: > > > On Tue, Feb 11, 2020 at 03:44:54AM +0800, Alex Williamson wrote: > > > On Mon, 10 Feb 2020 04:49:54 -0500 > > > Yan Zhao wrote: > > > > > > > On Sat, Feb 08, 2020 at 03:42:31AM +0800, Kirti Wankhede wrote: > > > > > VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations: > > > > > - Start pinned and unpinned pages tracking while migration is active > > > > > - Stop pinned and unpinned dirty pages tracking. This is also used to > > > > > stop dirty pages tracking if migration failed or cancelled. > > > > > - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, > > > > > its > > > > > user space application responsibility to copy content of dirty pages > > > > > from source to destination during migration. > > > > > > > > > > To prevent DoS attack, memory for bitmap is allocated per vfio_dma > > > > > structure. Bitmap size is calculated considering smallest supported > > > > > page > > > > > size. Bitmap is allocated when dirty logging is enabled for those > > > > > vfio_dmas whose vpfn list is not empty or whole range is mapped, in > > > > > case of pass-through device. > > > > > > > > > > There could be multiple option as to when bitmap should be populated: > > > > > * Polulate bitmap for already pinned pages when bitmap is allocated > > > > > for > > > > > a vfio_dma with the smallest supported page size. Updates bitmap > > > > > from > > > > > page pinning and unpinning functions. When user application queries > > > > > bitmap, check if requested page size is same as page size used to > > > > > populated bitmap. If it is equal, copy bitmap. But if not equal, > > > > > re-populated bitmap according to requested page size and then copy > > > > > to > > > > > user. > > > > > Pros: Bitmap gets populated on the fly after dirty tracking has > > > > > started. > > > > > Cons: If requested page size is different than smallest supported > > > > > page size, then bitmap has to be re-populated again, with > > > > > additional overhead of allocating bitmap memory again for > > > > > re-population of bitmap. > > > > > > > > > > * Populate bitmap when bitmap is queried by user application. > > > > > Pros: Bitmap is populated with requested page size. This eliminates > > > > > the need to re-populate bitmap if requested page size is > > > > > different than smallest supported pages size. > > > > > Cons: There is one time processing time, when bitmap is queried. > > > > > > > > > > I prefer later option with simple logic and to eliminate over-head of > > > > > bitmap repopulation in case of differnt page sizes. Later option is > > > > > implemented in this patch. > > > > > > > > > > Signed-off-by: Kirti Wankhede > > > > > Reviewed-by: Neo Jia > > > > > --- > > > > > drivers/vfio/vfio_iommu_type1.c | 299 > > > > > ++-- > > > > > 1 file changed, 287 insertions(+), 12 deletions(-) > > > > > > > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c > > > > > b/drivers/vfio/vfio_iommu_type1.c > > > > > index d386461e5d11..df358dc1c85b 100644 > > > > > --- a/drivers/vfio/vfio_iommu_type1.c > > > > > +++ b/drivers/vfio/vfio_iommu_type1.c > > > [snip] > > > > > @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct > > > > > vfio_iommu *iommu) > > > > > return bitmap; > > > > > } > > > > > > > > > > +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, > > > > > dma_addr_t iova, > > > > > + size_t size, uint64_t pgsize, > > > > > + unsigned char __user *bitmap) > > > > > +{ > > > > > + struct vfio_dma *dma; > > > > > + dma_addr_t i = iova, iova_limit; > > > > > + unsigned int bsize, nbits = 0, l = 0; > > > > > + unsigned long pgshift = __ffs(pgsize); > > > > > + > > > > > + while ((dma = vfio_find_dma(iommu, i, pgsize))) { > > > > > + int ret, j; > > > > > + unsigned int npages = 0, shift = 0; > > > > > + unsigned char temp = 0; > > > > > + > > > > > + /* mark all pages dirty if all pages are pinned and > > > > > mapped. */ > > > > > + if (dma->iommu_mapped) { > > > > > + iova_limit = min(dma->iova + dma->size, iova + > > > > > size); > > > > > + npages = iova_limit/pgsize; > > > > > + bitmap_set(dma->bitmap, 0, npages); > > > > for pass-through devices, it's not good to always return all pinned > > > > pages as > > > > dirty. could it also call vfio_pin_pages to track dirty pages? or any > > > > other interface provided to do that? > > > > > > See patch 7/7. Thanks, > > > > > hi Alex and Kirti, > > for pass-through devices, though patch 7/7 enables the vendor driver to > > set dirty pages by calling vfio_pin_pages,
Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
On Mon, 10 Feb 2020 21:52:51 -0500 Yan Zhao wrote: > On Tue, Feb 11, 2020 at 03:44:54AM +0800, Alex Williamson wrote: > > On Mon, 10 Feb 2020 04:49:54 -0500 > > Yan Zhao wrote: > > > > > On Sat, Feb 08, 2020 at 03:42:31AM +0800, Kirti Wankhede wrote: > > > > VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations: > > > > - Start pinned and unpinned pages tracking while migration is active > > > > - Stop pinned and unpinned dirty pages tracking. This is also used to > > > > stop dirty pages tracking if migration failed or cancelled. > > > > - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its > > > > user space application responsibility to copy content of dirty pages > > > > from source to destination during migration. > > > > > > > > To prevent DoS attack, memory for bitmap is allocated per vfio_dma > > > > structure. Bitmap size is calculated considering smallest supported page > > > > size. Bitmap is allocated when dirty logging is enabled for those > > > > vfio_dmas whose vpfn list is not empty or whole range is mapped, in > > > > case of pass-through device. > > > > > > > > There could be multiple option as to when bitmap should be populated: > > > > * Polulate bitmap for already pinned pages when bitmap is allocated for > > > > a vfio_dma with the smallest supported page size. Updates bitmap from > > > > page pinning and unpinning functions. When user application queries > > > > bitmap, check if requested page size is same as page size used to > > > > populated bitmap. If it is equal, copy bitmap. But if not equal, > > > > re-populated bitmap according to requested page size and then copy to > > > > user. > > > > Pros: Bitmap gets populated on the fly after dirty tracking has > > > > started. > > > > Cons: If requested page size is different than smallest supported > > > > page size, then bitmap has to be re-populated again, with > > > > additional overhead of allocating bitmap memory again for > > > > re-population of bitmap. > > > > > > > > * Populate bitmap when bitmap is queried by user application. > > > > Pros: Bitmap is populated with requested page size. This eliminates > > > > the need to re-populate bitmap if requested page size is > > > > different than smallest supported pages size. > > > > Cons: There is one time processing time, when bitmap is queried. > > > > > > > > I prefer later option with simple logic and to eliminate over-head of > > > > bitmap repopulation in case of differnt page sizes. Later option is > > > > implemented in this patch. > > > > > > > > Signed-off-by: Kirti Wankhede > > > > Reviewed-by: Neo Jia > > > > --- > > > > drivers/vfio/vfio_iommu_type1.c | 299 > > > > ++-- > > > > 1 file changed, 287 insertions(+), 12 deletions(-) > > > > > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c > > > > b/drivers/vfio/vfio_iommu_type1.c > > > > index d386461e5d11..df358dc1c85b 100644 > > > > --- a/drivers/vfio/vfio_iommu_type1.c > > > > +++ b/drivers/vfio/vfio_iommu_type1.c > > [snip] > > > > @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct > > > > vfio_iommu *iommu) > > > > return bitmap; > > > > } > > > > > > > > +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t > > > > iova, > > > > + size_t size, uint64_t pgsize, > > > > + unsigned char __user *bitmap) > > > > +{ > > > > + struct vfio_dma *dma; > > > > + dma_addr_t i = iova, iova_limit; > > > > + unsigned int bsize, nbits = 0, l = 0; > > > > + unsigned long pgshift = __ffs(pgsize); > > > > + > > > > + while ((dma = vfio_find_dma(iommu, i, pgsize))) { > > > > + int ret, j; > > > > + unsigned int npages = 0, shift = 0; > > > > + unsigned char temp = 0; > > > > + > > > > + /* mark all pages dirty if all pages are pinned and > > > > mapped. */ > > > > + if (dma->iommu_mapped) { > > > > + iova_limit = min(dma->iova + dma->size, iova + > > > > size); > > > > + npages = iova_limit/pgsize; > > > > + bitmap_set(dma->bitmap, 0, npages); > > > for pass-through devices, it's not good to always return all pinned pages > > > as > > > dirty. could it also call vfio_pin_pages to track dirty pages? or any > > > other interface provided to do that? > > > > See patch 7/7. Thanks, > > > hi Alex and Kirti, > for pass-through devices, though patch 7/7 enables the vendor driver to > set dirty pages by calling vfio_pin_pages, however, its overhead is much > higher than the previous way of generating a bitmap directly to user. > And it also requires pass-through device vendor driver to track guest > operations to know when to call vfio_pin_pages. > There are still use cases like a pass-through device is
Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
On Tue, Feb 11, 2020 at 03:44:54AM +0800, Alex Williamson wrote: > On Mon, 10 Feb 2020 04:49:54 -0500 > Yan Zhao wrote: > > > On Sat, Feb 08, 2020 at 03:42:31AM +0800, Kirti Wankhede wrote: > > > VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations: > > > - Start pinned and unpinned pages tracking while migration is active > > > - Stop pinned and unpinned dirty pages tracking. This is also used to > > > stop dirty pages tracking if migration failed or cancelled. > > > - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its > > > user space application responsibility to copy content of dirty pages > > > from source to destination during migration. > > > > > > To prevent DoS attack, memory for bitmap is allocated per vfio_dma > > > structure. Bitmap size is calculated considering smallest supported page > > > size. Bitmap is allocated when dirty logging is enabled for those > > > vfio_dmas whose vpfn list is not empty or whole range is mapped, in > > > case of pass-through device. > > > > > > There could be multiple option as to when bitmap should be populated: > > > * Polulate bitmap for already pinned pages when bitmap is allocated for > > > a vfio_dma with the smallest supported page size. Updates bitmap from > > > page pinning and unpinning functions. When user application queries > > > bitmap, check if requested page size is same as page size used to > > > populated bitmap. If it is equal, copy bitmap. But if not equal, > > > re-populated bitmap according to requested page size and then copy to > > > user. > > > Pros: Bitmap gets populated on the fly after dirty tracking has > > > started. > > > Cons: If requested page size is different than smallest supported > > > page size, then bitmap has to be re-populated again, with > > > additional overhead of allocating bitmap memory again for > > > re-population of bitmap. > > > > > > * Populate bitmap when bitmap is queried by user application. > > > Pros: Bitmap is populated with requested page size. This eliminates > > > the need to re-populate bitmap if requested page size is > > > different than smallest supported pages size. > > > Cons: There is one time processing time, when bitmap is queried. > > > > > > I prefer later option with simple logic and to eliminate over-head of > > > bitmap repopulation in case of differnt page sizes. Later option is > > > implemented in this patch. > > > > > > Signed-off-by: Kirti Wankhede > > > Reviewed-by: Neo Jia > > > --- > > > drivers/vfio/vfio_iommu_type1.c | 299 > > > ++-- > > > 1 file changed, 287 insertions(+), 12 deletions(-) > > > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c > > > b/drivers/vfio/vfio_iommu_type1.c > > > index d386461e5d11..df358dc1c85b 100644 > > > --- a/drivers/vfio/vfio_iommu_type1.c > > > +++ b/drivers/vfio/vfio_iommu_type1.c > [snip] > > > @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct > > > vfio_iommu *iommu) > > > return bitmap; > > > } > > > > > > +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t > > > iova, > > > + size_t size, uint64_t pgsize, > > > + unsigned char __user *bitmap) > > > +{ > > > + struct vfio_dma *dma; > > > + dma_addr_t i = iova, iova_limit; > > > + unsigned int bsize, nbits = 0, l = 0; > > > + unsigned long pgshift = __ffs(pgsize); > > > + > > > + while ((dma = vfio_find_dma(iommu, i, pgsize))) { > > > + int ret, j; > > > + unsigned int npages = 0, shift = 0; > > > + unsigned char temp = 0; > > > + > > > + /* mark all pages dirty if all pages are pinned and mapped. */ > > > + if (dma->iommu_mapped) { > > > + iova_limit = min(dma->iova + dma->size, iova + size); > > > + npages = iova_limit/pgsize; > > > + bitmap_set(dma->bitmap, 0, npages); > > for pass-through devices, it's not good to always return all pinned pages as > > dirty. could it also call vfio_pin_pages to track dirty pages? or any > > other interface provided to do that? > > See patch 7/7. Thanks, > hi Alex and Kirti, for pass-through devices, though patch 7/7 enables the vendor driver to set dirty pages by calling vfio_pin_pages, however, its overhead is much higher than the previous way of generating a bitmap directly to user. And it also requires pass-through device vendor driver to track guest operations to know when to call vfio_pin_pages. There are still use cases like a pass-through device is able to track dirty pages in its hardware buffer, so is there a way for it pass its dirty bitmap to user? Thanks Yan
Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
On Mon, 10 Feb 2020 04:49:54 -0500 Yan Zhao wrote: > On Sat, Feb 08, 2020 at 03:42:31AM +0800, Kirti Wankhede wrote: > > VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations: > > - Start pinned and unpinned pages tracking while migration is active > > - Stop pinned and unpinned dirty pages tracking. This is also used to > > stop dirty pages tracking if migration failed or cancelled. > > - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its > > user space application responsibility to copy content of dirty pages > > from source to destination during migration. > > > > To prevent DoS attack, memory for bitmap is allocated per vfio_dma > > structure. Bitmap size is calculated considering smallest supported page > > size. Bitmap is allocated when dirty logging is enabled for those > > vfio_dmas whose vpfn list is not empty or whole range is mapped, in > > case of pass-through device. > > > > There could be multiple option as to when bitmap should be populated: > > * Polulate bitmap for already pinned pages when bitmap is allocated for > > a vfio_dma with the smallest supported page size. Updates bitmap from > > page pinning and unpinning functions. When user application queries > > bitmap, check if requested page size is same as page size used to > > populated bitmap. If it is equal, copy bitmap. But if not equal, > > re-populated bitmap according to requested page size and then copy to > > user. > > Pros: Bitmap gets populated on the fly after dirty tracking has > > started. > > Cons: If requested page size is different than smallest supported > > page size, then bitmap has to be re-populated again, with > > additional overhead of allocating bitmap memory again for > > re-population of bitmap. > > > > * Populate bitmap when bitmap is queried by user application. > > Pros: Bitmap is populated with requested page size. This eliminates > > the need to re-populate bitmap if requested page size is > > different than smallest supported pages size. > > Cons: There is one time processing time, when bitmap is queried. > > > > I prefer later option with simple logic and to eliminate over-head of > > bitmap repopulation in case of differnt page sizes. Later option is > > implemented in this patch. > > > > Signed-off-by: Kirti Wankhede > > Reviewed-by: Neo Jia > > --- > > drivers/vfio/vfio_iommu_type1.c | 299 > > ++-- > > 1 file changed, 287 insertions(+), 12 deletions(-) > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c > > b/drivers/vfio/vfio_iommu_type1.c > > index d386461e5d11..df358dc1c85b 100644 > > --- a/drivers/vfio/vfio_iommu_type1.c > > +++ b/drivers/vfio/vfio_iommu_type1.c [snip] > > @@ -830,6 +924,113 @@ static unsigned long vfio_pgsize_bitmap(struct > > vfio_iommu *iommu) > > return bitmap; > > } > > > > +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t > > iova, > > + size_t size, uint64_t pgsize, > > + unsigned char __user *bitmap) > > +{ > > + struct vfio_dma *dma; > > + dma_addr_t i = iova, iova_limit; > > + unsigned int bsize, nbits = 0, l = 0; > > + unsigned long pgshift = __ffs(pgsize); > > + > > + while ((dma = vfio_find_dma(iommu, i, pgsize))) { > > + int ret, j; > > + unsigned int npages = 0, shift = 0; > > + unsigned char temp = 0; > > + > > + /* mark all pages dirty if all pages are pinned and mapped. */ > > + if (dma->iommu_mapped) { > > + iova_limit = min(dma->iova + dma->size, iova + size); > > + npages = iova_limit/pgsize; > > + bitmap_set(dma->bitmap, 0, npages); > for pass-through devices, it's not good to always return all pinned pages as > dirty. could it also call vfio_pin_pages to track dirty pages? or any > other interface provided to do that? See patch 7/7. Thanks, Alex
Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
On Sat, 8 Feb 2020 01:12:31 +0530 Kirti Wankhede wrote: > VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations: > - Start pinned and unpinned pages tracking while migration is active > - Stop pinned and unpinned dirty pages tracking. This is also used to > stop dirty pages tracking if migration failed or cancelled. > - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its > user space application responsibility to copy content of dirty pages > from source to destination during migration. > > To prevent DoS attack, memory for bitmap is allocated per vfio_dma > structure. Bitmap size is calculated considering smallest supported page > size. Bitmap is allocated when dirty logging is enabled for those > vfio_dmas whose vpfn list is not empty or whole range is mapped, in > case of pass-through device. > > There could be multiple option as to when bitmap should be populated: > * Polulate bitmap for already pinned pages when bitmap is allocated for > a vfio_dma with the smallest supported page size. Updates bitmap from > page pinning and unpinning functions. When user application queries > bitmap, check if requested page size is same as page size used to > populated bitmap. If it is equal, copy bitmap. But if not equal, > re-populated bitmap according to requested page size and then copy to > user. > Pros: Bitmap gets populated on the fly after dirty tracking has > started. > Cons: If requested page size is different than smallest supported > page size, then bitmap has to be re-populated again, with > additional overhead of allocating bitmap memory again for > re-population of bitmap. No memory needs to be allocated to re-populate the bitmap. The bitmap is clear-on-read and by tracking the bitmap in the smallest supported page size we can guarantee that we can fit the user requested bitmap size within the space occupied by that minimal page size range of the bitmap. Therefore we'd destructively translate the requested region of the bitmap to a different page size, write it out to the user, and clear it. Also we expect userspace to use the minimum page size almost exclusively, which is optimized by this approach as dirty bit tracking is spread out over each page pinning operation. > > * Populate bitmap when bitmap is queried by user application. > Pros: Bitmap is populated with requested page size. This eliminates > the need to re-populate bitmap if requested page size is > different than smallest supported pages size. > Cons: There is one time processing time, when bitmap is queried. Another significant Con is that the vpfn list needs to track and manage unpinned pages, which makes it more complex and intrusive. The previous option seems to have both time and complexity advantages, especially in the case we expect to be most common of the user accessing the bitmap with the minimum page size, ie. PAGE_SIZE. It's also not clear why we pre-allocate the bitmap at all with this approach. > I prefer later option with simple logic and to eliminate over-head of > bitmap repopulation in case of differnt page sizes. Later option is > implemented in this patch. Hmm, we'll see below, but I not convinced based on the above rationale. > Signed-off-by: Kirti Wankhede > Reviewed-by: Neo Jia > --- > drivers/vfio/vfio_iommu_type1.c | 299 > ++-- > 1 file changed, 287 insertions(+), 12 deletions(-) > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > index d386461e5d11..df358dc1c85b 100644 > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -70,6 +70,7 @@ struct vfio_iommu { > unsigned intdma_avail; > boolv2; > boolnesting; > + booldirty_page_tracking; > }; > > struct vfio_domain { > @@ -90,6 +91,7 @@ struct vfio_dma { > boollock_cap; /* capable(CAP_IPC_LOCK) */ > struct task_struct *task; > struct rb_root pfn_list; /* Ex-user pinned pfn list */ > + unsigned long *bitmap; > }; > > struct vfio_group { > @@ -125,6 +127,7 @@ struct vfio_regions { > (!list_empty(&iommu->domain_list)) > > static int put_pfn(unsigned long pfn, int prot); > +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu); > > /* > * This code handles mapping and unmapping of user data buffers > @@ -174,6 +177,57 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, > struct vfio_dma *old) > rb_erase(&old->node, &iommu->dma_list); > } > > +static inline unsigned long dirty_bitmap_bytes(unsigned int npages) > +{ > + if (!npages) > + return 0; > + > + return ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long); > +} > + > +static int vfio_dma_bitmap_alloc(struct vfio_iommu *iommu, > +
Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
On Sat, Feb 08, 2020 at 03:42:31AM +0800, Kirti Wankhede wrote: > VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations: > - Start pinned and unpinned pages tracking while migration is active > - Stop pinned and unpinned dirty pages tracking. This is also used to > stop dirty pages tracking if migration failed or cancelled. > - Get dirty pages bitmap. This ioctl returns bitmap of dirty pages, its > user space application responsibility to copy content of dirty pages > from source to destination during migration. > > To prevent DoS attack, memory for bitmap is allocated per vfio_dma > structure. Bitmap size is calculated considering smallest supported page > size. Bitmap is allocated when dirty logging is enabled for those > vfio_dmas whose vpfn list is not empty or whole range is mapped, in > case of pass-through device. > > There could be multiple option as to when bitmap should be populated: > * Polulate bitmap for already pinned pages when bitmap is allocated for > a vfio_dma with the smallest supported page size. Updates bitmap from > page pinning and unpinning functions. When user application queries > bitmap, check if requested page size is same as page size used to > populated bitmap. If it is equal, copy bitmap. But if not equal, > re-populated bitmap according to requested page size and then copy to > user. > Pros: Bitmap gets populated on the fly after dirty tracking has > started. > Cons: If requested page size is different than smallest supported > page size, then bitmap has to be re-populated again, with > additional overhead of allocating bitmap memory again for > re-population of bitmap. > > * Populate bitmap when bitmap is queried by user application. > Pros: Bitmap is populated with requested page size. This eliminates > the need to re-populate bitmap if requested page size is > different than smallest supported pages size. > Cons: There is one time processing time, when bitmap is queried. > > I prefer later option with simple logic and to eliminate over-head of > bitmap repopulation in case of differnt page sizes. Later option is > implemented in this patch. > > Signed-off-by: Kirti Wankhede > Reviewed-by: Neo Jia > --- > drivers/vfio/vfio_iommu_type1.c | 299 > ++-- > 1 file changed, 287 insertions(+), 12 deletions(-) > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > index d386461e5d11..df358dc1c85b 100644 > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -70,6 +70,7 @@ struct vfio_iommu { > unsigned intdma_avail; > boolv2; > boolnesting; > + booldirty_page_tracking; > }; > > struct vfio_domain { > @@ -90,6 +91,7 @@ struct vfio_dma { > boollock_cap; /* capable(CAP_IPC_LOCK) */ > struct task_struct *task; > struct rb_root pfn_list; /* Ex-user pinned pfn list */ > + unsigned long *bitmap; > }; > > struct vfio_group { > @@ -125,6 +127,7 @@ struct vfio_regions { > (!list_empty(&iommu->domain_list)) > > static int put_pfn(unsigned long pfn, int prot); > +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu); > > /* > * This code handles mapping and unmapping of user data buffers > @@ -174,6 +177,57 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, > struct vfio_dma *old) > rb_erase(&old->node, &iommu->dma_list); > } > > +static inline unsigned long dirty_bitmap_bytes(unsigned int npages) > +{ > + if (!npages) > + return 0; > + > + return ALIGN(npages, BITS_PER_LONG) / sizeof(unsigned long); > +} > + > +static int vfio_dma_bitmap_alloc(struct vfio_iommu *iommu, > + struct vfio_dma *dma, unsigned long pgsizes) > +{ > + unsigned long pgshift = __ffs(pgsizes); > + > + if (!RB_EMPTY_ROOT(&dma->pfn_list) || dma->iommu_mapped) { > + unsigned long npages = dma->size >> pgshift; > + unsigned long bsize = dirty_bitmap_bytes(npages); > + > + dma->bitmap = kvzalloc(bsize, GFP_KERNEL); > + if (!dma->bitmap) > + return -ENOMEM; > + } > + return 0; > +} > + > +static int vfio_dma_all_bitmap_alloc(struct vfio_iommu *iommu, > + unsigned long pgsizes) > +{ > + struct rb_node *n = rb_first(&iommu->dma_list); > + int ret; > + > + for (; n; n = rb_next(n)) { > + struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node); > + > + ret = vfio_dma_bitmap_alloc(iommu, dma, pgsizes); > + if (ret) > + return ret; > + } > + return 0; > +} > + > +static void vfio_dma_all_bitmap_free(struct vfio_iommu *iommu) > +{ > + struct rb_node *n = rb_firs