George / Tim, Could you help me review these memory patches? Thanks! -Quan
> -----Original Message----- > From: Xu, Quan > Sent: Wednesday, September 16, 2015 9:24 PM > To: andrew.coop...@citrix.com; Dong, Eddie; ian.campb...@citrix.com; > ian.jack...@eu.citrix.com; jbeul...@suse.com; Nakajima, Jun; k...@xen.org; > Tian, Kevin; t...@xen.org; Zhang, Yang Z; george.dun...@eu.citrix.com > Cc: xen-devel@lists.xen.org; Xu, Quan > Subject: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device > > Introduction > ============ > > VT-d code currently has a number of cases where completion of certain > operations is being waited for by way of spinning. The majority of instances > use > that variable indirectly through IOMMU_WAIT_OP() macro , allowing for loops of > up to 1 second (DMAR_OPERATION_TIMEOUT). While in many of the cases this > may be acceptable, the invalidation case seems particularly problematic. > > Currently hypervisor polls the status address of wait descriptor up to 1 > second to > get Invalidation flush result. When Invalidation queue includes Device-TLB > invalidation, using 1 second is a mistake here in the validation sync. As the > 1 > second timeout here is related to response times by the IOMMU engine, Instead > of Device-TLB invalidation with PCI-e Address Translation Services (ATS) in > use. > the ATS specification mandates a timeout of 1 _minute_ for cache flush. The > ATS > case needs to be taken into consideration when doing invalidations. Obviously > we can't spin for a minute, so invalidation absolutely needs to be converted > to a > non-spinning model. > > Also i should fix the new memory security issue. > The page freed from the domain should be on held, until the Device-TLB flush > is > completed (ATS timeout of 1 _minute_). > The page previously associated with the freed portion of GPA should not be > reallocated for another purpose until the appropriate invalidations have been > performed. Otherwise, the original page owner can still access freed page > though > DMA. > > Why RFC > ======= > Patch 0001--0005, 0013 are IOMMU related. > Patch 0006 is about new flag (vCPU / MMU related). > Patch 0007 is vCPU related. > Patch 0008--0012 are MMU related. > > 1. Xen MMU is very complicated. Could Xen MMU experts help me verify > whether I > have covered all of the case? > > 2. For gnttab_transfer, If the Device-TLB flush is still not completed > when to > map the transferring page to a remote domain, schedule and wait on a > waitqueue > until the Device-TLB flush is completed. Is it correct? > > (I have tested waitqueue in decrease_reservation() [do_memory_op() > hypercall]) > I wake up domain(with only one vCPU) with debug-key tool, and the > domain(with only one vCPU) > is still working after waiting 60s in a waitqueue. ) > > > Design Overview > =============== > > This design implements a non-spinning model for Device-TLB invalidation -- > using > an interrupt based mechanism. Track the Device-TLB invalidation status in an > invalidation table per-domain. The invalidation table keeps the count of > in-flight > Device-TLB invalidation requests, and also provides a global polling > parameter per > domain for in-flight Device-TLB invalidation requests. > Update invalidation table's count of in-flight Device-TLB invalidation > requests and > assign the address of global polling parameter per domain in the Status > Address > of each invalidation wait descriptor, when to submit invalidation requests. > > For example: > . > > |invl | Status Data = 1 (the count of in-flight Device-TLB invalidation > |requests) wait | Status Address = > |virt_to_maddr(&_a_global_polling_parameter_per_domain_) > |dsc | > . > . > > |invl | > |wait | Status Data = 2 (the count of in-flight Device-TLB invalidation > |requests) dsc | Status Address = > |virt_to_maddr(&_a_global_polling_parameter_per_domain_) > . > . > > |invl | > |wait | Status Data = 3 (the count of in-flight Device-TLB invalidation > |requests) dsc | Status Address = > |virt_to_maddr(&_a_global_polling_parameter_per_domain_) > . > . > > More information about VT-d Invalidation Wait Descriptor, please refer to > > http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog > y/vt-directed-io-spec.html > 6.5.2.8 Invalidation Wait Descriptor. > Status Address and Data: Status address and data is used by hardware to > perform > wait descriptor > completion status write when the Status Write(SW) > field is Set. Hardware Behavior > is undefined if the Status address range of > 0xFEEX_XXXX etc.). The Status Address > and Data fields are ignored by hardware when the > Status Write field is Clear. > > The invalidation completion event interrupt is generated only after the > invalidation wait descriptor completes. In invalidation interrupt handler, it > will > schedule a soft-irq to do the following check: > > if invalidation table's count of in-flight Device-TLB invalidation requests > == > polling parameter: > This domain has no in-flight Device-TLB invalidation requests. > else > This domain has in-flight Device-TLB invalidation requests. > > Track domain Status: > The vCPU is NOT allowed to entry guest mode and put into SCHEDOP_yield > list if it has in-flight Device-TLB invalidation requests. > > Memory security issue: > In case with PCI-e Address Translation Services(ATS) in use, ATS spec > mandates a timeout of 1 minute for cache flush. > The page freed from the domain should be on held, until the Device-TLB > flush is completed. The page previously associated with the freed portion of > GPA should not be reallocated for another purpose until the appropriate > invalidations have been performed. Otherwise, the original page owner can > still > access freed page though DMA. > > *Held on The page until the Device-TLB flush is completed. > - Unlink the page from the original owner. > - Remove the page from the page_list of domain. > - Decrease the total pages count of domain. > - Add the page to qi_hold_page_list. > > *Put the page in Queued Invalidation(QI) interrupt handler if the > Device-TLB > is completed. > > Invalidation Fault: > A fault event will be generated if an invalidation failed. We can disable the > devices. > > For Context Invalidation and IOTLB invalidation without Device-TLB > invalidation, > Queued Invalidation(QI) submits invalidation requests as before(This is a > tradeoff > and the cost of interrupt is overhead. It will be modified in coming series of > patch). > > More details > ============ > > 1. invalidation table. We define qi_table structure per domain. > +struct qi_talbe { > + u64 qi_table_poll_slot; > + u32 qi_table_status_data; > +}; > > @ struct hvm_iommu { > + /* IOMMU Queued Invalidation(QI) */ > + struct qi_talbe talbe; > } > > 2. Modification to Device IOTLB invalidation: > - Enabled interrupt notification when hardware completes the > invalidations: > Set FN, IF and SW bits in Invalidation Wait Descriptor. The reason why > also > set SW bit is that > the interrupt for notification is global not per domain. > So we still need to poll the status address to know which Device-TLB > invalidation request is > completed in QI interrupt handler. > - A new per-domain flag (*qi_flag) is used to track the status of > Device-TLB > invalidation request. > The *qi_flag will be set before sbumitting the Device-TLB invalidation > requests. The vCPU is NOT > allowed to entry guest mode and put into SCHEDOP_yield list, if the > *qi_flag is Set. > - new logic to do synchronize. > if no Device-TLB invalidation: > Back to current invalidation logic. > else > Set IF, SW, FN bit in wait descriptor and prepare the Status Data. > Set *qi_flag. > Put the domain in pending flush list (The vCPU is NOT allowed to > entry guest mode and put into SCHEDOP_yield if the *qi_flag is Set.) > Return > > More information about VT-d Invalidation Wait Descriptor, please refer to > > http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog > y/vt-directed-io-spec.html > 6.5.2.8 Invalidation Wait Descriptor. > SW: Indicate the invalidation wait descriptor completion by performing a > coherent DWORD write of the value in the Status Data field > to the address specified in the Status Address. > FN: Indicate the descriptors following the invalidation wait descriptor > must be > processed by hardware only after the invalidation > Wait descriptor completes. > IF: Indicate the invalidation wait descriptor completion by generating an > invalidation completion event per the programing of the > Invalidation Completion Event Registers. > > 3. Modification to domain running lifecycle: > - When the *qi_flag is set, the domain is not allowed to enter guest mode > and put into SCHEDOP_yield list > if there are in-flight Device-TLB invalidation requests. > > 4. New interrupt handler for invalidation completion: > - when hardware completes the Device-TLB invalidation requests, it > generates an interrupt to notify hypervisor. > - In interrupt handler, schedule a tasklet to handle it. > - tasklet to handle below: > *Clear IWC field in the Invalidation Completion Status register. If > the > IWC field in the Invalidation > Completion Status register was already Set at the time of setting > this > field, it is not treated as a new > interrupt condition. > *Scan the domain list. (the domain is with vt-d passthrough devices. > scan 'iommu->domid_bitmap') > for each domain: > check the values invalidation table (qi_table_poll_slot and > qi_table_status_data) of each domain. > if equal: > Put the on hold pages. > Clear the invalidation table. > Clear *qi_flag. > > *If IP field of Invalidation Event Control Register is Set, try to > *Clear IWC > and *Scan the domain list again, instead of > generating another interrupt. > *Clear IM field of Invalidation Event Control Register. > > (( > Logic of IWC / IP / IM as below: > > Interrupt condition (An invalidation wait descriptor > with Interrupt Flag(IF) field Set completed.) > || > v > ----------------------(IWC) ---------------------- > (IWC is Set) (IWC is not Set) > || || > V || > (Not treated as an new interrupt condition) || > V > (Set IWC / IP) > || > V > > ---------------------(IM)--------------------- > (IM is Set) > (IM not Set) > || > || > || > V > || (cause Interrupt > message / then hardware clear IP) > V > (interrupt is held pending, clearing IM to cause interrupt message) > > * If IWC field is being clear, the IP field is cleared. > )) > > 5. invalidation failed. > - A fault event will be generated if invalidation failed. we can disable > the > devices if receive an > invalidation fault event. > > 6. Memory security issue: > > The page freed from the domain should be on held, until the Device-TLB > flush is completed. The page previously associated with the freed portion of > GPA should not be reallocated for another purpose until the appropriate > invalidations have been performed. Otherwise, the original page owner can > still > access freed page though DMA. > > *Held on The page unitl the Device-TLB flush is completed. > - Unlink the page from the original owner. > - Remove the page from the page_list of domain. > - Decrease the total pages count of domain. > - Add the page to qi_hold_page_list. > > *Put the page in Queued Invalidation(QI) interrupt handler if the > Device-TLB is > completed. > > > ---- > There are 3 reasons to submit device-TLB invalidation requests: > *VT-d initialization. > *Reassign device ownership. > *Memory modification. > > 6.1 *VT-d initialization > When VT-d is initializing, there is no guest domain running. So no memory > security issue. > iotlb(iotlb/device-tlb) > |-iommu_flush_iotlb_global()--iommu_flush_all()--intel_iommu_hwdom_init( > |) > |--init_vtd_hw() > 6.2 *Reassign device ownership > Reassign device ownership is invoked by 2 hypercalls: do_physdev_op() and > arch_do_domctl(). > As the *qi_flag is Set, the domain is not allowed to enter guest mode. If the > appropriate invalidations maybe have not been performed, the *qi_flag is still > Set, and these devices are not ready for guest domains to launch DMA with > these devices. So if the *qi_flag is introduced, there is no memory security > issue. > > iotlb(iotlb/device-tlb) > |-iommu_flush_iotlb_dsi() > |--domain_context_mapping_one() ... > |--domain_context_unmap_one() ... > > |-iommu_flush_iotlb_psi() > |--domain_context_mapping_one() ... > |--domain_context_unmap_one() ... > > 6.3 *Memory modification. > While memory is modified, There are a lot of invoker flow for updating EPT, > but > not all of them will update IOMMU page tables. All of the following three > conditions are met. > * P2M is hostp2m. ( p2m_is_hostp2m(p2m) ) > * Previous mfn is not equal to new mfn. (prev_mfn != new_mfn) > * This domain needs IOMMU. (need_iommu(d)) > > ## > |--iommu_pte_flush()--ept_set_entry() > > #PoD(populate on demand) is not supported while IOMMU passthrough is > enabled. So ignore PoD invoker flow below. > |--p2m_pod_zero_check_superpage() ... > |--p2m_pod_zero_check() ... > |--p2m_pod_demand_populate() ... > |--p2m_pod_decrease_reservation() ... > |--guest_physmap_mark_populate_on_demand() ... > > #Xen paging is not supported while IOMMU passthrough is enabled. So ignore > Xen paging invoker flow below. > |--p2m_mem_paging_evict() ... > |--p2m_mem_paging_resume()... > |--p2m_mem_paging_prep()... > |--p2m_mem_paging_populate()... > |--p2m_mem_paging_nominate()... > |--p2m_alloc_table()--shadow_enable() > --paging_enable()--shadow_domctl() --paging_domctl()--arch_do_domctl() > --do_domctl() > > |--paging_domctl_continuation() > > #Xen sharing is not supported while IOMMU passthrough is enabled. So ignore > Xen paging invoker flow below. > |--set_shared_p2m_entry()... > > > #Domain is paused, the domain can't launch DMA. > |--relinquish_shared_pages()--domain_relinquish_resources( case > RELMEM_shared: ) --domain_kill()--do_domctl() > > #The below p2m is not hostp2m. It is L2 to L0. So ignore invoker flow below. > |--nestedhap_fix_p2m() --nestedhvm_hap_nested_page_fault() > --hvm_hap_nested_page_fault() > --ept_handle_violation()--vmx_vmexit_handler() > > #If prev_mfn == new_mfn, it will not update IOMMU page tables. So ignore > invoker flow below. > |--p2m_mem_access_check()-- hvm_hap_nested_page_fault() > --ept_handle_violation()--vmx_vmexit_handler()(L1 --> L0 / but just only check > p2m_type_t) > |--p2m_set_mem_access() ... > |--guest_physmap_mark_populate_on_demand() ... > |--p2m_change_type_one() ... > # The previous page is not put and allocated for Xen or other guest domains. > So > there is no memory security issue. Ignore invoker flow below. > |--p2m_remove_page()--guest_physmap_remove_page() ... > > |--clear_mmio_p2m_entry()--unmap_mmio_regions()--do_domctl() > |--map_mmio_regions()--do_domctl() > > > # Held on the pages which are removed in guest_remove_page(), and put in QI > interrupt handler when it has no in-flight Device-TLB invalidation requests. > > |--clear_mmio_p2m_entry()--*guest_remove_page()*--decrease_reservation() > > |--xenmem_add_to_physmap_one() --xenmem_add_to_physmap() > /xenmem_add_to_physmap_batch() .. --do_memory_op() > |--p2m_add_foreign() -- > xenmem_add_to_physmap_one() ..--do_memory_op() > > |--guest_physmap_add_entry()--create_grant_p2m_mapping() ... > --do_grant_table_op() > > (( > Much more explanation: > Actually, the previous pages are maybe mapped from Xen heap for guest > domains in decrease_reservation() / xenmem_add_to_physmap_one() > / p2m_add_foreign(), but they are not mapped to IOMMU table. The below 4 > functions will map xen heap page for guest domains: > * share page for xen Oprofile. > * vLAPIC mapping. > * grant table shared page. > * domain share_info page. > )) > > # For grant_unmap*. ignore it at this point, as we can held on the page when > domain free xenbllooned page. > > > |--iommu_map_page()--__gnttab_unmap_common()--__gnttab_unmap_grant > _ref() --gnttab_unmap_grant_ref()--do_grant_table_op() > > |--__gnttab_unmap_and_replace() -- gnttab_unmap_and_replace() > --do_grant_table_op() > > # For grant_map*, ignore it as there is no pfn<--->mfn in Device-TLB. > > # For grant_transfer: > |--p2m_remove_page()--guest_physmap_remove_page() > |--gnttab_transfer() ... > --do_grant_table_op() > > If the Device-TLB flush is still not completed when to map the > transferring > page to a remote domain, > schedule and wait on a waitqueue until the Device-TLB flush is completed. > > Plan B: > ((If the Device-TLB flush is still not completed before adding the > transferring > page to the target domain, > allocate a new page for target domain and held on the old transferring page > which will be put in QI interrupt > handler when there are no in-flight Device-TLB invalidation requests.)) > > > Quan Xu (13): > vt-d: Redefine iommu_set_interrupt() for registering MSI interrupt > vt-d: Register MSI for async invalidation completion interrupt. > vt-d: Track the Device-TLB invalidation status in an invalidation table. > vt-d: Clear invalidation table in invaidation interrupt handler > vt-d: Clear the IWC field of Invalidation Event Control Register in > vt-d: Introduce a new per-domain flag - qi_flag. > vt-d: If the qi_flag is Set, the domain's vCPUs are not allowed to > vt-d: Held on the freed page until the Device-TLB flush is completed. > vt-d: Put the page in Queued Invalidation(QI) interrupt handler if > vt-d: Held on the removed page until the Device-TLB flush is completed. > vt-d: If the Device-TLB flush is still not completed when > vt-d: For gnttab_transfer, If the Device-TLB flush is still > vt-d: Set the IF bit in Invalidation Wait Descriptor When submit Device-TLB > > xen/arch/x86/hvm/vmx/entry.S | 10 ++ > xen/arch/x86/x86_64/asm-offsets.c | 1 + > xen/common/domain.c | 15 ++ > xen/common/grant_table.c | 16 ++ > xen/common/memory.c | 16 +- > xen/drivers/passthrough/vtd/iommu.c | 290 > +++++++++++++++++++++++++++++++++-- > xen/drivers/passthrough/vtd/iommu.h | 18 +++ > xen/drivers/passthrough/vtd/qinval.c | 51 +++++- > xen/include/xen/hvm/iommu.h | 42 +++++ > 9 files changed, 443 insertions(+), 16 deletions(-) > > -- > 1.8.3.2 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel