(Re-post, as my mail client somehow made my previous post attach to the wrong thread. I do not mean to spam y'all, but maybe my previous mail got lost in your filters ... ... as I have not yet seen any answer to my questions/remarks. )
> On 2016/9/14 6:55, Alex Williamson wrote: > > [cc +Paolo] > >> On Thu, 11 Aug 2016 19:05:57 +0800 >> Yongji Xie <address@hidden> wrote: >> >> Now the kernel commit 05f0c03fbac1 ("vfio-pci: Allow to mmap >> sub-page MMIO BARs if the mmio page is exclusive") allows VFIO >> to mmap sub-page BARs. This is the corresponding QEMU patch. Immediate questions first: It seems that mentioned commit will be part of Kernel 4.8 ? But as far as I can judge this change should also cooperate with older/existing kernels (which then just have qemu behave as before) ? (For my actual point of interrest related to this patch please see further down.) >> With those patches applied, we could passthrough sub-page BARs >> to guest, which can help to improve IO performance for some devices. >> >> In this patch, we expand MemoryRegions of these sub-page >> MMIO BARs to PAGE_SIZE in vfio_pci_write_config(), so that >> the BARs could be passed to KVM ioctl KVM_SET_USER_MEMORY_REGION >> with a valid size. The expanding size will be recovered when >> the base address of sub-page BAR is changed and not page aligned >> any more in guest. And we also set the priority of these BARs' >> memory regions to zero in case of overlap with BARs which share >> the same page with sub-page BARs in guest. >> >> Signed-off-by: Yongji Xie <address@hidden> >> --- >> hw/vfio/common.c | 3 +-- >> hw/vfio/pci.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> 2 files changed, 77 insertions(+), 2 deletions(-) > > Hi Yongji, ... >> + mr = region->mem; >> + mmap_mr = ®ion->mmaps[0].mem; >> + memory_region_transaction_begin(); >> + if (memory_region_size(mr) < qemu_real_host_page_size) { >> + if (!(bar_addr & ~qemu_real_host_page_mask) && >> + memory_region_is_mapped(mr) && region->mmaps[0].mmap) { >> + /* Expand memory region to page size and set priority */ >> + memory_region_del_subregion(r->address_space, mr); >> + memory_region_set_size(mr, qemu_real_host_page_size); >> + memory_region_set_size(mmap_mr, qemu_real_host_page_size); >> + memory_region_add_subregion_overlap(r->address_space, >> + bar_addr, mr, 0); >> + } >> + } else { >> + /* This case would happen when guest rescan one PCI device */ >> + if (bar_addr & ~qemu_real_host_page_mask) { >> + /* Recover the size of memory region */ >> + memory_region_set_size(mr, r->size); >> + memory_region_set_size(mmap_mr, r->size); >> + } else if (memory_region_is_mapped(mr)) { >> + /* Set the priority of memory region to zero */ >> + memory_region_del_subregion(r->address_space, mr); >> + memory_region_add_subregion_overlap(r->address_space, >> + bar_addr, mr, 0); >> + } >> + } >> + memory_region_transaction_commit(); > > Paolo, as the reigning memory API expert, do you see any issues with > this? Expanding the size of a container MemoryRegion and the contained > mmap'd subregion and manipulating priorities so that we don't interfere > with other MemoryRegions. Since the following qemu commit function memory_region_add_subregion_overlap() actually has a misleading name: http://git.qemu.org/?p=qemu.git;a=blobdiff;f=memory.c;h=ac5236b51587ee397edd177502fc20ce159f2235;hp=9daac5ea2d9a9c83533880a812760683f6e09765;hb=b61359781958759317ee6fd1a45b59be0b7dbbe1;hpb=ab0a99560857302b60053c245d1231acbd976cd4 The sole thing that memory_region_add_subregion_overlap() now actually does differently from memory_region_add_subregion() is nothing else than setting the region's priority to a value of callers choice. The _default_ priority as chosen by memory_region_add_subregion() _is_ 0. So, explicitly choosing memory_region_add_subregion_overlap(... , 0) does nothing new. Or does it: Actually, _yes_, because I see Alex actually willing to discuss choice of memory region priorities related to VFIO and mmap. Why do I "invade" this thread ? I would like you to consider thinking twice about selecting proper priorities for _any_ mmap related region (i.e. also the aligned case), and here is why: (I will also make a statement related to region expansion for alignment.) First of all, I recently suggested a patch which can visualise what I write about subsequently: https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg01315.html (I would appreciate if somebody would review and thus get it merged.) As a general remark, the sub-page mmap case does not only occur when a 'small' BAR is encountered, it also occurs when a fully mmap-ed page is split by a 'small' VFIO quirk. Hi Alex, here we go again about RTL8168 and its MSIX quirk. (Subsequently I also relate to/conclude for Yongji's patch.) Mentioned quirk cuts for certain RTL8168 models a full-page BAR right into 3 pieces, 0..qirkaddr-1, quirk and quirk+qsize..pagesize-1. What I found is that both "mmap-fractions" behave _buggy_. (Attempt of an analysis subsequently.) Here the first piece could be covered by Yongji's patch, but the third piece certainly not, as the kernel patch is limited to aligned small pieces. And that only starting with kernel 4.8. This is what we also need to solve and where above priority choice is coming into play (but there's more to consider): Right after memory_region_transaction_commit() a new flat view is created. As documented in docs/memory.txt, priorities _are_ important for the final 'stuff' which is exposed in the flat view in a certain address range. _But_ - priorities are not the only important property. So priorities need to fit into a larger picture. (Here I refer again to my suggested mtree info <mapinfo> patch.) Complexity of the situation besides watching your priorities: As far as I understand the mmap functionality is 2-phased, there is the mmap system call, which Yongji's patch is tuning around, and then the setup of the actual dma, which happens for sections of the flat view in function hw/vfio/common.c/vfio_dma_map() via an ioctl on the container file descriptor. vfio_dma_map() is called from hw/vfio/common.c/vfio_iommu_map_notify() _and_ hw/vfio/common.c/vfio_listener_region_add() (for mmap flat ranges). There be dragons - vfio_dma_map() is _skipped_ for ranges which are _not_ fully page aligned (there may be exceptions, but the buggy behavior is triggered by a skip-case currently occurring for the quirk split pieces mentioned above). This would also happen for non-page-extended page-aligned sub-page BARs, I suppose. Why is this skip-and-forget 'dangerous' ? Well, on the one hand memory.txt rules specify which mmap region is actually exposed in the flat range and thus fed into the skip-happy function vfio_listener_region_add() (I do not understand enough about vfio_iommu_map_notify(), so I disregard that one here for now). On the other hand, those regions that are dma-skipped do not seem to safely fallback to the 'slow' region-ops/memory-handlers. This is certainly also due to rules specified in memory.txt, be those either implemented correctly, but not correctly employed by well adjusted subregion-stacking or even by a buggy implementation of those stacking rules. End of last year I was able to 'fix' the RTL8168 MSIX quirk problem by adding an additional subregion (one page in size, referring to the page sized quirk split BAR), which re-exposed the required region-ops for the slow path. Back then Alex called that a hack, but still: I assume, sub-page BARs which are the subject of Yongji's patch will also be subject of the correct-stacking-problem and thus, just picking a seemingly well edjucated priority for a page-extended BAR-region might 'endanger' the extended page-fraction's slow-path/region-ops. Conclusion _This_ patch probably needs an additional page sized region added right at the correct position of the subregion 'stack' in order to be able to 'create-and-forget' to be prepared for all different additional subregion add-ons, potentially by yet unknown quirks. And while we're at it, we can also fix the RTL8168 MSIX quirk using the same precautional add-a-helper-region approach. I have a few patch snippets in store since start of this year to roughly address the dma-skip problem (still need refinement). Anybody interrested in starting to discuss those ? In order to test/verify/visualise I suggest again https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg04808.html (I would really appreciate a review, refine and merge.) And maybe we should even add a refactoring patch for the purpose of renaming memory_region_add_subregion_overlap() to memory_region_add_subregion_stacked() or memory_region_add_subregion_prioritised() or similar. Regards, Thorsten