Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
Jan Beulich wrote on 2015-10-15: On 15.10.15 at 03:03,wrote: >> Jan Beulich wrote on 2015-10-14: >>> As long as the multi-millisecond spins aren't going to go away by >>> other means, I think conversion to async mode is ultimately unavoidable. >> >> I am not fully agreed. I think the time to spin is important. To me, >> less than 1 ms is acceptable and if the hardware can guarantee it, >> then sync mode also is ok. > > Okay, let me put the condition slightly differently - any spin on the > order of what a WBINVD might take ought to be okay, provided both are >From the data we collected, the invalidation is completed within several us. >IMO, the time for WBINVD is varying due the size and different cache >hierarchies. And it may take more than several us in worst case. > equally (in)accessible to guests. The whole discussion is really about > limiting the impact misbehaving guests can have on the whole system. > (Obviously any spin time reaching the order of a scheduling time slice > is a problem.) The premise for a misbehaving guest to impact the system is that the IOMMU is buggy which takes long time to complete the invalidation. In other words, if all invalidations are able to complete within several us, what's the matter to do with the spin time? > >> I remember the origin motivation to handle ATS problem is due to: 1. >> ATS spec allow 60s timeout to completed the flush which Xen only >> allows 1s, and 2. spin loop for 1s is not reasonable since it will >> hurt the scheduler. For the former, as we discussed before, either >> disable ATS support or only support some specific ATS >> devices(complete the flush less than 10ms or 1ms) is acceptable. For >> the latter, if spin loop for 1s is not acceptable, we can reduce the >> timeout to 10ms or 1ms > to eliminate the performance impaction. > > If we really can, why has it been chosen to be 1s in the first place? What I can tell is 1s is just the value the original author chooses. It has no special means. I have double check with our hardware expert and he suggests us to use the value as small as possible. According his comment, 10ms is sufficiently large. > >> Yes, I'd agree it would be best solution if Xen has the async mode. >> But spin loop is used widely in iommu code: not only for >> invalidations, lots of DMAR operations are using spin to sync >> hardware's status. For those operations, it is hard to use async mode. >> Or, even it is possible to use async mode, I don't see the benefit >> considering the cost and complexity which means we either need a >> timer or a > softirq to do the check. > > Even if the cost is high, limited overall throughput by undue spinning > is worth it imo even outside of misbehaving guest considerations. I'm > surprised you're not getting similar pressure on this from the KVM > folks (assuming the use of spinning is similar there). Because no one observe such invalidation timeout issue so far. What we have discussed are only in theory. btw, I have told the issue to Linux IOMMU maintainer but he didn't say anything on it. Best regards, Yang ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
Jan Beulich wrote on 2015-10-15: On 15.10.15 at 10:52,wrote: >> Jan Beulich wrote on 2015-10-15: >> On 15.10.15 at 09:28, wrote: The premise for a misbehaving guest to impact the system is that the IOMMU is buggy which takes long time to complete the invalidation. In other words, if all invalidations are able to complete within several us, what's the matter to do with the spin time? >>> >>> The risk of exploits of such poorly behaving IOMMUs. I.e. if >>> properly >> >> But this is not a software flaw. A guest has no way to know the >> underlying IOMMU is wrong and it cannot exploit it. > > A guest doesn't need to know what IOMMU is there in order to try some > exploit. Plus - based other information it may be able to make an > educated guess. As I said before, the premise is the IOMMU is buggy. > >>> operating IOMMUs only require several us, why spin for several ms? >> >> 10ms is just my suggestion. I don't know whether future hardware >> will need more time to complete the invalidation. So I think we need >> to have a large enough timeout here. Meanwhile, doesn't impact the > scheduling. > > It does, as explained further down in my previous reply. > >> I remember the origin motivation to handle ATS problem is due to: 1. >> ATS spec allow 60s timeout to completed the flush which Xen only >> allows 1s, and 2. spin loop for 1s is not reasonable since it >> will hurt the scheduler. For the former, as we discussed before, >> either disable ATS support or only support some specific ATS >> devices(complete the flush less than 10ms or 1ms) is acceptable. >> For the latter, if spin loop for 1s is not acceptable, we can >> reduce the timeout to 10ms or 1ms > to eliminate the performance impaction. > > If we really can, why has it been chosen to be 1s in the first place? What I can tell is 1s is just the value the original author chooses. It has no special means. I have double check with our hardware expert and he suggests us to use the value as small as possible. According his comment, 10ms is sufficiently large. >>> >>> So here you talk about milliseconds again, while above you talked >>> about microsecond. Can we at least settle on an order of what is >>> required? 10ms is >>> 10 times the minimum time slice credit1 allows, i.e. >>> awfully long. >> >> We can use an appropriate value which you think reasonable which can >> cover most of invalidation cases. For left cases, the vcpu can yield >> the CPU to others until a timer fired. In callback function, hypervisor >> can check whether the invalidation is completed. If yes, schedule in >> the vcpu. Otherwise, kill the guest due to unpredictable invalidation >> timeout. > > Using a timer implies you again think about pausing the vCPU until the > invalidation completes. Which, as discussed before, has its own > problems and, even worse, won't cover the domain's other vCPU-s or > devices still possibly doing work involving the use of the being > invalidated entries. Or did you have something else in mind? Why not pause the whole domain? Based on Quan's data, all the invalidations in his experiment are completed within 3us. So perhaps 10us is enough to cover all invalidations in today's IOMMU.(I need to check with hardware expert to get the exact data). The timer mechanism is a backup which only for the extreme case which exists in theory. So the probability for a guest trigger the timer mechanism can be ignored. Even it happens, it only affect guest itself. > > IOW - as soon as spinning time reaches the order of the scheduler time > slice, I think the only sane model is async operation with proper refcounting. > > Jan Best regards, Yang ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> On 15.10.15 at 03:03,wrote: > Jan Beulich wrote on 2015-10-14: >> As long as the multi-millisecond spins aren't going to go away by >> other means, I think conversion to async mode is ultimately unavoidable. > > I am not fully agreed. I think the time to spin is important. To me, less > than 1 ms is acceptable and if the hardware can guarantee it, then sync mode > also is ok. Okay, let me put the condition slightly differently - any spin on the order of what a WBINVD might take ought to be okay, provided both are equally (in)accessible to guests. The whole discussion is really about limiting the impact misbehaving guests can have on the whole system. (Obviously any spin time reaching the order of a scheduling time slice is a problem.) > I remember the origin motivation to handle ATS problem is due to: 1. ATS > spec allow 60s timeout to completed the flush which Xen only allows 1s, and > 2. spin loop for 1s is not reasonable since it will hurt the scheduler. For > the former, as we discussed before, either disable ATS support or only > support some specific ATS devices(complete the flush less than 10ms or 1ms) > is acceptable. For the latter, if spin loop for 1s is not acceptable, we can > reduce the timeout to 10ms or 1ms to eliminate the performance impaction. If we really can, why has it been chosen to be 1s in the first place? > Yes, I'd agree it would be best solution if Xen has the async mode. But spin > loop is used widely in iommu code: not only for invalidations, lots of DMAR > operations are using spin to sync hardware's status. For those operations, it > is hard to use async mode. Or, even it is possible to use async mode, I don't > see the benefit considering the cost and complexity which means we either > need a timer or a softirq to do the check. Even if the cost is high, limited overall throughput by undue spinning is worth it imo even outside of misbehaving guest considerations. I'm surprised you're not getting similar pressure on this from the KVM folks (assuming the use of spinning is similar there). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> On 15.10.15 at 09:28,wrote: > Jan Beulich wrote on 2015-10-15: > On 15.10.15 at 03:03, wrote: >>> Jan Beulich wrote on 2015-10-14: As long as the multi-millisecond spins aren't going to go away by other means, I think conversion to async mode is ultimately unavoidable. >>> >>> I am not fully agreed. I think the time to spin is important. To me, >>> less than 1 ms is acceptable and if the hardware can guarantee it, >>> then sync mode also is ok. >> >> Okay, let me put the condition slightly differently - any spin on the >> order of what a WBINVD might take ought to be okay, provided both are > > From the data we collected, the invalidation is completed within several us. > IMO, the time for WBINVD is varying due the size and different cache > hierarchies. And it may take more than several us in worst case. Understood - hence the setting of the worst case latency of WBINVD as an upper bound for other (kind of similar) software operation. >> equally (in)accessible to guests. The whole discussion is really about >> limiting the impact misbehaving guests can have on the whole system. >> (Obviously any spin time reaching the order of a scheduling time slice >> is a problem.) > > The premise for a misbehaving guest to impact the system is that the IOMMU > is buggy which takes long time to complete the invalidation. In other words, > if all invalidations are able to complete within several us, what's the > matter to do with the spin time? The risk of exploits of such poorly behaving IOMMUs. I.e. if properly operating IOMMUs only require several us, why spin for several ms? >>> I remember the origin motivation to handle ATS problem is due to: 1. >>> ATS spec allow 60s timeout to completed the flush which Xen only >>> allows 1s, and 2. spin loop for 1s is not reasonable since it will >>> hurt the scheduler. For the former, as we discussed before, either >>> disable ATS support or only support some specific ATS >>> devices(complete the flush less than 10ms or 1ms) is acceptable. For >>> the latter, if spin loop for 1s is not acceptable, we can reduce the >>> timeout to 10ms or 1ms >> to eliminate the performance impaction. >> >> If we really can, why has it been chosen to be 1s in the first place? > > What I can tell is 1s is just the value the original author chooses. It has > no special means. I have double check with our hardware expert and he > suggests us to use the value as small as possible. According his comment, > 10ms is sufficiently large. So here you talk about milliseconds again, while above you talked about microsecond. Can we at least settle on an order of what is required? 10ms is 10 times the minimum time slice credit1 allows, i.e. awfully long. >>> Yes, I'd agree it would be best solution if Xen has the async mode. >>> But spin loop is used widely in iommu code: not only for >>> invalidations, lots of DMAR operations are using spin to sync >>> hardware's status. For those operations, it is hard to use async mode. >>> Or, even it is possible to use async mode, I don't see the benefit >>> considering the cost and complexity which means we either need a >>> timer or a >> softirq to do the check. >> >> Even if the cost is high, limited overall throughput by undue spinning >> is worth it imo even outside of misbehaving guest considerations. I'm >> surprised you're not getting similar pressure on this from the KVM >> folks (assuming the use of spinning is similar there). > > Because no one observe such invalidation timeout issue so far. What we have > discussed are only in theory. As is the case with many security related things. We shouldn't wait until someone exploits them. > btw, I have told the issue to Linux IOMMU maintainer but he didn't say > anything on it. Interesting - that may speak for itself (depending on how long this has been pending), but otoh is in line with experience I have with many (but not all) Linux maintainers. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> On 15.10.15 at 10:52,wrote: > Jan Beulich wrote on 2015-10-15: > On 15.10.15 at 09:28, wrote: >>> The premise for a misbehaving guest to impact the system is that the >>> IOMMU is buggy which takes long time to complete the invalidation. >>> In other words, if all invalidations are able to complete within >>> several us, what's the matter to do with the spin time? >> >> The risk of exploits of such poorly behaving IOMMUs. I.e. if properly > > But this is not a software flaw. A guest has no way to know the underlying > IOMMU is wrong and it cannot exploit it. A guest doesn't need to know what IOMMU is there in order to try some exploit. Plus - based other information it may be able to make an educated guess. >> operating IOMMUs only require several us, why spin for several ms? > > 10ms is just my suggestion. I don't know whether future hardware will need > more time to complete the invalidation. So I think we need to have a large > enough timeout here. Meanwhile, doesn't impact the scheduling. It does, as explained further down in my previous reply. > I remember the origin motivation to handle ATS problem is due to: 1. > ATS spec allow 60s timeout to completed the flush which Xen only > allows 1s, and 2. spin loop for 1s is not reasonable since it will > hurt the scheduler. For the former, as we discussed before, either > disable ATS support or only support some specific ATS > devices(complete the flush less than 10ms or 1ms) is acceptable. > For the latter, if spin loop for 1s is not acceptable, we can > reduce the timeout to 10ms or 1ms to eliminate the performance impaction. If we really can, why has it been chosen to be 1s in the first place? >>> >>> What I can tell is 1s is just the value the original author chooses. >>> It has no special means. I have double check with our hardware >>> expert and he suggests us to use the value as small as possible. >>> According his comment, 10ms is sufficiently large. >> >> So here you talk about milliseconds again, while above you talked >> about microsecond. Can we at least settle on an order of what is >> required? 10ms is >> 10 times the minimum time slice credit1 allows, i.e. >> awfully long. > > We can use an appropriate value which you think reasonable which can cover > most of invalidation cases. For left cases, the vcpu can yield the CPU to > others until a timer fired. In callback function, hypervisor can check > whether the invalidation is completed. If yes, schedule in the vcpu. > Otherwise, kill the guest due to unpredictable invalidation timeout. Using a timer implies you again think about pausing the vCPU until the invalidation completes. Which, as discussed before, has its own problems and, even worse, won't cover the domain's other vCPU-s or devices still possibly doing work involving the use of the being invalidated entries. Or did you have something else in mind? IOW - as soon as spinning time reaches the order of the scheduler time slice, I think the only sane model is async operation with proper refcounting. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
Jan Beulich wrote on 2015-10-15: On 15.10.15 at 09:28,wrote: >> Jan Beulich wrote on 2015-10-15: >> On 15.10.15 at 03:03, wrote: Jan Beulich wrote on 2015-10-14: > As long as the multi-millisecond spins aren't going to go away by > other means, I think conversion to async mode is ultimately unavoidable. I am not fully agreed. I think the time to spin is important. To me, less than 1 ms is acceptable and if the hardware can guarantee it, then sync mode also is ok. >>> >>> Okay, let me put the condition slightly differently - any spin on >>> the order of what a WBINVD might take ought to be okay, provided >>> both are >> >> From the data we collected, the invalidation is completed within several us. >> IMO, the time for WBINVD is varying due the size and different cache >> hierarchies. And it may take more than several us in worst case. > > Understood - hence the setting of the worst case latency of WBINVD as > an upper bound for other (kind of similar) software operation. > >>> equally (in)accessible to guests. The whole discussion is really about >>> limiting the impact misbehaving guests can have on the whole system. >>> (Obviously any spin time reaching the order of a scheduling time slice >>> is a problem.) >> >> The premise for a misbehaving guest to impact the system is that the >> IOMMU is buggy which takes long time to complete the invalidation. >> In other words, if all invalidations are able to complete within >> several us, what's the matter to do with the spin time? > > The risk of exploits of such poorly behaving IOMMUs. I.e. if properly But this is not a software flaw. A guest has no way to know the underlying IOMMU is wrong and it cannot exploit it. > operating IOMMUs only require several us, why spin for several ms? 10ms is just my suggestion. I don't know whether future hardware will need more time to complete the invalidation. So I think we need to have a large enough timeout here. Meanwhile, doesn't impact the scheduling. > I remember the origin motivation to handle ATS problem is due to: 1. ATS spec allow 60s timeout to completed the flush which Xen only allows 1s, and 2. spin loop for 1s is not reasonable since it will hurt the scheduler. For the former, as we discussed before, either disable ATS support or only support some specific ATS devices(complete the flush less than 10ms or 1ms) is acceptable. For the latter, if spin loop for 1s is not acceptable, we can reduce the timeout to 10ms or 1ms >>> to eliminate the performance impaction. >>> >>> If we really can, why has it been chosen to be 1s in the first place? >> >> What I can tell is 1s is just the value the original author chooses. >> It has no special means. I have double check with our hardware >> expert and he suggests us to use the value as small as possible. >> According his comment, 10ms is sufficiently large. > > So here you talk about milliseconds again, while above you talked > about microsecond. Can we at least settle on an order of what is > required? 10ms is > 10 times the minimum time slice credit1 allows, i.e. > awfully long. We can use an appropriate value which you think reasonable which can cover most of invalidation cases. For left cases, the vcpu can yield the CPU to others until a timer fired. In callback function, hypervisor can check whether the invalidation is completed. If yes, schedule in the vcpu. Otherwise, kill the guest due to unpredictable invalidation timeout. Best regards, Yang ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>> >>> On 13.10.2015 at 22:50wrote: > >>> On 13.10.15 at 16:29, wrote: > >> > >>>On 29.09.2015 at 15:22 wrote: > >> >>> On 29.09.15 at 04:53, wrote: > >> Monday, September 28, 2015 2:47 PM, wrote: > >> >> >>> On 28.09.15 at 05:08, wrote: > >> >> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote: > > 1. if (page->count_info & PGC_count_mask == 0) and (page->count_info > > != 0) In this case, can the page be freed to xen domain heap? > > Whether a page can get freed depends on changes to count_info, not just its > current state. For instance, PGC_allocated set implies > page->count_info & PGC_count_mask != 0, i.e. your question above > cannot be answered properly. Just look at put_page() - it frees the page when > the count _drops_ to zero. > > > 2. if (page->count_info & PGC_count_mask == 0) and > (page->u.inuse.type_info != 0) : > > In this case, can the page be freed to xen domain heap? > > Generally type_info should be zero when the ref count is zero; there are, I > think, > exceptional cases (like during domain death) where this might get violated > with > no harm. But again - look at put_page() and you'll see that type_info doesn't > matter for whether a page gets freed; all it matter is whether a page's type > can > change: Only when type count is zero. > Jan, Thanks for kind explanation. -Quan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>> >>>On 13.10.2015 at 17:35,wrote: > At 11:09 + on 11 Oct (1444561760), Xu, Quan wrote: > What in particular is worrying you about scheme A? AFAICS you need to build > the same refcount-taking mechanism for either scheme. > > Is it the interactions with other p2m-based features in VMs that don't have > devices passed through? In that case perhaps you could just mandate that ATS > support means no shared HAP/IOMMU tables, and do the refcounting only in the > IOMMU ones? > I am worrying that I should keep a log of all relevant pending derefs and to be processed when the flush completes for __scheme_A__.. Now I know only need the log/deref whenever you need to flush the IOMMU. :):) > > I think __scheme_A__ is complex to keep a log of all relevant pending > > derefs, and to be processed when the flush completes; > > > > optimized __scheme_A__: > > It could keep a log of the reference only when the IOMMU entry is _ > removed/overwritten_.(if the IOMMU entry is not _ removed/overwritten_, it is > safe.). > > Yes, though I'd add any change from read-write to read-only. > Basically, you only need the log/deref whenever you need to flush the > IOMMU. :) > A summary of __scheme_A__: Q1: - when to take the references? take the reference when the IOMMU entry is _created_; in detail: --iommu_map_page(), or --ept_set_entry() [Once IOMMU shares EPT page table.] Q2: how do you know when to drop them? - Log (or something) when the IOMMU entry is removed/overwritten; and - Drop the entry when the flush completes. in detail: --iommu_unmap_page(); or --ept_set_entry() [Once IOMMU shares EPT page table.] **The challenge: how to log when IOMMU entry is removed/overwritten? Q3: what to do about mappings of other domains' memory (i.e. grant and foreign mappings). i.e. grant: -Take the reference when the IOMMU entry is _created_; then, -Queue the reference drop;and -Queue grant iflag update(only for grant_unmap and grant_transfer are enough -- we can hold on this question). __afaics__: 1. Are Q1/Q2/Q3 enough for this memory security issue? 2. Are there any other potential memory issues, when I finish __scheme_A__? 3. Do you have any idea how to log when IOMMU entry is removed/overwritten? Tim, thanks for your help! Quan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> On 14.10.15 at 07:12,wrote: > Jan Beulich wrote on 2015-10-13: > On 13.10.15 at 07:27, wrote: >>> Jan Beulich wrote on 2015-10-12: >>> On 12.10.15 at 03:42, wrote: > So, my suggestion is that we can rely on user to not assign the > ATS device if hypervisor says it cannot support such device. For > example, if hypervisor find the invalidation isn't completed in 1 > second, then hypervisor can crash itself and tell the user this > ATS device needs more than 1 second invalidation time which is not > support by >>> Xen. Crashing the hypervisor in such a case is a security issue, i.e. is not >>> >>> Indeed. Crashing the guest is more reasonable. >>> an acceptable thing (and the fact that we panic() on timeout expiry right now isn't really acceptable either). If crashing the offending guest was sufficient to contain the issue, that might be an option. Else >>> >>> I think it should be sufficient (any concern from you?). >> >> Having looked at the code, it wasn't immediately clear whether that >> would work. After all there one would think there would be a reason >> for the code panic()ing right now instead. > > What the panic()ing refer to here? E.g. what we have in queue_invalidate_wait(): while ( poll_slot != QINVAL_STAT_DONE ) { if ( NOW() > (start_time + DMAR_OPERATION_TIMEOUT) ) { print_qi_regs(iommu); panic("queue invalidate wait descriptor was not executed"); } cpu_relax(); } >>> But if solution 1 is acceptable, I prefer it since most of ATS >>> devices are still able to play with Xen. >> >> With a multi-millisecond spin, solution 1 would imo be acceptable only >> as a transitional measure. > > What does the transitional measure mean? Do you mean we still need the async > flush for ATS issue or we can adapt solution 1 before ATS spec changed? As long as the multi-millisecond spins aren't going to go away by other means, I think conversion to async mode is ultimately unavoidable. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
Jan Beulich wrote on 2015-10-14: On 14.10.15 at 07:12,wrote: >> Jan Beulich wrote on 2015-10-13: >> On 13.10.15 at 07:27, wrote: Jan Beulich wrote on 2015-10-12: On 12.10.15 at 03:42, wrote: >> So, my suggestion is that we can rely on user to not assign the >> ATS device if hypervisor says it cannot support such device. For >> example, if hypervisor find the invalidation isn't completed in >> 1 second, then hypervisor can crash itself and tell the user >> this ATS device needs more than 1 second invalidation time which >> is not support by Xen. > > Crashing the hypervisor in such a case is a security issue, i.e. > is not Indeed. Crashing the guest is more reasonable. > an acceptable thing (and the fact that we panic() on timeout expiry > right now isn't really acceptable either). If crashing the offending > guest was sufficient to contain the issue, that might be an option. > Else I think it should be sufficient (any concern from you?). >>> >>> Having looked at the code, it wasn't immediately clear whether that >>> would work. After all there one would think there would be a reason >>> for the code panic()ing right now instead. >> >> What the panic()ing refer to here? > > E.g. what we have in queue_invalidate_wait(): > > while ( poll_slot != QINVAL_STAT_DONE ) > { > if ( NOW() > (start_time + DMAR_OPERATION_TIMEOUT) ) > { > print_qi_regs(iommu); > panic("queue invalidate wait descriptor was not > executed"); > } > cpu_relax(); > } But if solution 1 is acceptable, I prefer it since most of ATS devices are still able to play with Xen. >>> >>> With a multi-millisecond spin, solution 1 would imo be acceptable >>> only as a transitional measure. >> >> What does the transitional measure mean? Do you mean we still need >> the async flush for ATS issue or we can adapt solution 1 before ATS spec >> changed? > > As long as the multi-millisecond spins aren't going to go away by > other means, I think conversion to async mode is ultimately unavoidable. I am not fully agreed. I think the time to spin is important. To me, less than 1 ms is acceptable and if the hardware can guarantee it, then sync mode also is ok. I remember the origin motivation to handle ATS problem is due to: 1. ATS spec allow 60s timeout to completed the flush which Xen only allows 1s, and 2. spin loop for 1s is not reasonable since it will hurt the scheduler. For the former, as we discussed before, either disable ATS support or only support some specific ATS devices(complete the flush less than 10ms or 1ms) is acceptable. For the latter, if spin loop for 1s is not acceptable, we can reduce the timeout to 10ms or 1ms to eliminate the performance impaction. Yes, I'd agree it would be best solution if Xen has the async mode. But spin loop is used widely in iommu code: not only for invalidations, lots of DMAR operations are using spin to sync hardware's status. For those operations, it is hard to use async mode. Or, even it is possible to use async mode, I don't see the benefit considering the cost and complexity which means we either need a timer or a softirq to do the check. Best regards, Yang ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
Jan Beulich wrote on 2015-10-13: On 13.10.15 at 07:27,wrote: >> Jan Beulich wrote on 2015-10-12: >> On 12.10.15 at 03:42, wrote: So, my suggestion is that we can rely on user to not assign the ATS device if hypervisor says it cannot support such device. For example, if hypervisor find the invalidation isn't completed in 1 second, then hypervisor can crash itself and tell the user this ATS device needs more than 1 second invalidation time which is not support by >> Xen. >>> >>> Crashing the hypervisor in such a case is a security issue, i.e. is >>> not >> >> Indeed. Crashing the guest is more reasonable. >> >>> an acceptable thing (and the fact that we panic() on timeout expiry >>> right now isn't really acceptable either). If crashing the >>> offending guest was sufficient to contain the issue, that might be an >>> option. >>> Else >> >> I think it should be sufficient (any concern from you?). > > Having looked at the code, it wasn't immediately clear whether that > would work. After all there one would think there would be a reason > for the code panic()ing right now instead. What the panic()ing refer to here? > >> Hypervisor can >> crash the guest with hint that the device may need long time to >> complete the invalidation or device maybe bad. And user should add >> the device to a blacklist to disallow assignment again. >> >>> ripping out ATS support (and limiting spin time below what there is >>> currently) may be the only alternative to fixing it. >> >> Yes, it is another solution considering ATS device is rare currently. >> For spin time, 10ms should be enough in both two solutions. > > But 10ms is awfully much already. Which is why I've been advocating > async flushing independent of ATS. Agree. Technically speaking, async flush is the best solution. But considering the complexity and the benefit it brings, a compromise solution maybe better. > >> But if solution 1 is acceptable, I prefer it since most of ATS >> devices are still able to play with Xen. > > With a multi-millisecond spin, solution 1 would imo be acceptable only > as a transitional measure. What does the transitional measure mean? Do you mean we still need the async flush for ATS issue or we can adapt solution 1 before ATS spec changed? > > Jan Best regards, Yang ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> On 13.10.15 at 07:27,wrote: > Jan Beulich wrote on 2015-10-12: > On 12.10.15 at 03:42, wrote: >>> So, my suggestion is that we can rely on user to not assign the ATS >>> device if hypervisor says it cannot support such device. For >>> example, if hypervisor find the invalidation isn't completed in 1 >>> second, then hypervisor can crash itself and tell the user this ATS >>> device needs more than 1 second invalidation time which is not support by > Xen. >> >> Crashing the hypervisor in such a case is a security issue, i.e. is not > > Indeed. Crashing the guest is more reasonable. > >> an acceptable thing (and the fact that we panic() on timeout expiry >> right now isn't really acceptable either). If crashing the offending >> guest was sufficient to contain the issue, that might be an option. Else > > I think it should be sufficient (any concern from you?). Having looked at the code, it wasn't immediately clear whether that would work. After all there one would think there would be a reason for the code panic()ing right now instead. > Hypervisor can > crash the guest with hint that the device may need long time to complete the > invalidation or device maybe bad. And user should add the device to a > blacklist to disallow assignment again. > >> ripping out ATS support (and limiting spin time below what there is >> currently) may be the only alternative to fixing it. > > Yes, it is another solution considering ATS device is rare currently. For > spin time, 10ms should be enough in both two solutions. But 10ms is awfully much already. Which is why I've been advocating async flushing independent of ATS. > But if solution 1 is acceptable, I prefer it since most of ATS devices are > still able to play with Xen. With a multi-millisecond spin, solution 1 would imo be acceptable only as a transitional measure. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
Hi, At 11:09 + on 11 Oct (1444561760), Xu, Quan wrote: > One question: do two lists refer to page_list and arch.relmem_list? No, I was wondering if a page ever needed to be queued waiting for two different flushes -- e.g. if there are multiple IOMMUs. > I know you prefer __scheme_A__(I think Jan prefers __scheme_A__ too. Jan, > correct me, if I am wrong :) ) > which fits better with the usual way refcounts are used. But __scheme_A__ > would be difficult for buy-in by my team (Obviously, why spend so many effort > for such a small issue? why does __scheme_B__ not accept?) I think, > __scheme_A__ is also a tricky solution. > What in particular is worrying you about scheme A? AFAICS you need to build the same refcount-taking mechanism for either scheme. Is it the interactions with other p2m-based features in VMs that don't have devices passed through? In that case perhaps you could just mandate that ATS support means no shared HAP/IOMMU tables, and do the refcounting only in the IOMMU ones? > I think __scheme_A__ is complex to keep a log of all relevant pending > derefs, and to be processed when the flush completes; > > optimized __scheme_A__: > It could keep a log of the reference only when the IOMMU entry is _ > removed/overwritten_.(if the IOMMU entry is not _ removed/overwritten_, it is > safe.). Yes, though I'd add any change from read-write to read-only. Basically, you only need the log/deref whenever you need to flush the IOMMU. :) Cheers, Tim. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> On 13.10.15 at 16:29,wrote: >> > >>>On 29.09.2015 at 15:22 wrote: >> >>> On 29.09.15 at 04:53, wrote: >> Monday, September 28, 2015 2:47 PM, wrote: >> >> >>> On 28.09.15 at 05:08, wrote: >> >> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote: >>The extra ref taken will prevent the page from getting freed. > > Jan, could you share more about it? > > I want to check some cases of Xen memory. i.e. > > 1. if (page->count_info & PGC_count_mask == 0) and (page->count_info != 0) > In this case, can the page be freed to xen domain heap? Whether a page can get freed depends on changes to count_info, not just its current state. For instance, PGC_allocated set implies page->count_info & PGC_count_mask != 0, i.e. your question above cannot be answered properly. Just look at put_page() - it frees the page when the count _drops_ to zero. > 2. if (page->count_info & PGC_count_mask == 0) and (page->u.inuse.type_info > != 0) : > In this case, can the page be freed to xen domain heap? Generally type_info should be zero when the ref count is zero; there are, I think, exceptional cases (like during domain death) where this might get violated with no harm. But again - look at put_page() and you'll see that type_info doesn't matter for whether a page gets freed; all it matter is whether a page's type can change: Only when type count is zero. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>> >>>On 29.09.2015 at 15:22wrote: > >>> On 29.09.15 at 04:53, wrote: > Monday, September 28, 2015 2:47 PM, wrote: > >> >>> On 28.09.15 at 05:08, wrote: > >> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote: >The extra ref taken will prevent the page from getting freed. Jan, could you share more about it? I want to check some cases of Xen memory. i.e. 1. if (page->count_info & PGC_count_mask == 0) and (page->count_info != 0) In this case, can the page be freed to xen domain heap? 2. if (page->count_info & PGC_count_mask == 0) and (page->u.inuse.type_info != 0) : In this case, can the page be freed to xen domain heap? Thanks! Quan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> On 12.10.15 at 03:42,wrote: > According the discussion and suggestion you made in past several weeks, > obviously, it is not an easy task. So I am wondering whether it is worth to > do it since: > 1. ATS device is not popular. I only know one NIC from Myricom has ATS > capabilities. > 2. The issue is only in theory. Linux, Windows, VMware are all using spin > now as well as Xen, but none of them observed any problem so far. > 3. I know there is propose to modify the timeout value(maybe less in 1 ms) > in ATS spec to mitigate the problem. But the risk is how long to achieve it. > 4. The most important concern is it is too complicated to fix it in Xen > since it needs to modify the core memory part. And I don't think Quan and i > have the enough knowledge to do it perfectly currently. It may take long > time, half of year or one year?(We have spent three months so far). Yes, if > Tim likes to take it. It will be much fast. :) > > So, my suggestion is that we can rely on user to not assign the ATS device > if hypervisor says it cannot support such device. For example, if hypervisor > find the invalidation isn't completed in 1 second, then hypervisor can crash > itself and tell the user this ATS device needs more than 1 second > invalidation time which is not support by Xen. Crashing the hypervisor in such a case is a security issue, i.e. is not an acceptable thing (and the fact that we panic() on timeout expiry right now isn't really acceptable either). If crashing the offending guest was sufficient to contain the issue, that might be an option. Else ripping out ATS support (and limiting spin time below what there is currently) may be the only alternative to fixing it. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> On 11.10.15 at 13:09,wrote: > On 11.10.2015 at 2:25, wrote: >> At 17:02 + on 07 Oct (1444237344), Xu, Quan wrote: >> > Q2: how do you know when to drop them? >> >- log (or something) when the IOMMU entry is removed/overwritten; and >> >- drop the entry when the flush completes. >> > >> >-- We can add a new page_list_entry structure per page_info, and Add the >> page with the new page_list_entry structure to per domain page list, when >> > the IOMMU entry is removed/overwritten; and drop the entry when the >> flush completes. >> >> As Jan says, I think that might be too much overhead -- a new page_list_entry >> would be a pretty big increase to page_info. >> (And potentially might not be enough if a page ever needs to be on two > lists? Or >> can we make sure that never happens?) >> >> Storing the list of to-be-flushed MFNs separately sounds better. The >> question > is >> how to make sure we don't run out of memory to store that list. Maybe have > a >> pool allocated that's big enough for sensible use, and fail p2m updates with >> -EAGAIN when we run out? >> > Ignore this method. It is not a good idea. > One question: do two lists refer to page_list and arch.relmem_list? > It is might a challenge to make sure we don't run out of memory, if it store > additional relevant information. > An aggressive method: > --Remove this page from domain page_list / arch.relmem_list. > ... [like steal_page(). Swizzle the owner -- page_set_owner(page, NULL); > Unlink from original owner -- page_list_del(page, >page_list)] Then,( take > a > reference for _scheme_B__,) add this page to a per domain page list(), and > put this page when the flush completes. I'm not following - are you suggesting to remove the page from the guest temporarily? That's certainly no a valid approach. > I am afraid that this method may cause some potential issues. i.e. make xen > crash in some corner cases .. And such would preclude its use too. >> > __scheme B__ >> > Q1: - when to take the references? >> > >> > take the reference when the IOMMU entry is _ removed/overwritten_; >> > in detail: >> > --iommu_unmap_page(), or >> > --ept_set_entry() [Once IOMMU shares EPT page table.] >> >> Hmm. That might be practical, taking a typecount of whatever type the page >> happens to be at the time. I would prefer scheme A though, if it can be > made >> to work. It fits better with the usual way refcounts are used, and it's > more likely >> to be a good long-term fix. This is the second time we've had to add >> refcounting for an edge case, and I suspect it won't be the last. > > I know you prefer __scheme_A__(I think Jan prefers __scheme_A__ too. Jan, > correct me, if I am wrong :) ) > which fits better with the usual way refcounts are used. But __scheme_A__ > would be difficult for buy-in by my team (Obviously, why spend so many effort > for such a small issue? why does __scheme_B__ not accept?) I think, > __scheme_A__ is also a tricky solution. > > > The IOMMU entry associated with the freed portion of GPA should not be > reallocated for another purpose until the flush completes. > I think __scheme_A__ is complex to keep a log of all relevant pending > derefs, and to be processed when the flush completes; > > optimized __scheme_A__: > It could keep a log of the reference only when the IOMMU entry is _ > removed/overwritten_.(if the IOMMU entry is not _ removed/overwritten_, it is > safe.). And I'm afraid this is too vague for me to understand - what do you want to track under what conditions? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
Jan Beulich wrote on 2015-10-12: On 12.10.15 at 03:42,wrote: >> According the discussion and suggestion you made in past several >> weeks, obviously, it is not an easy task. So I am wondering whether >> it is worth to do it since: >> 1. ATS device is not popular. I only know one NIC from Myricom has >> ATS capabilities. >> 2. The issue is only in theory. Linux, Windows, VMware are all using >> spin now as well as Xen, but none of them observed any problem so far. >> 3. I know there is propose to modify the timeout value(maybe less in >> 1 >> ms) in ATS spec to mitigate the problem. But the risk is how long to achieve >> it. >> 4. The most important concern is it is too complicated to fix it in >> Xen since it needs to modify the core memory part. And I don't think >> Quan and i have the enough knowledge to do it perfectly currently. >> It may take long time, half of year or one year?(We have spent three >> months so far). Yes, if Tim likes to take it. It will be much fast. >> :) >> >> So, my suggestion is that we can rely on user to not assign the ATS >> device if hypervisor says it cannot support such device. For >> example, if hypervisor find the invalidation isn't completed in 1 >> second, then hypervisor can crash itself and tell the user this ATS >> device needs more than 1 second invalidation time which is not support by >> Xen. > > Crashing the hypervisor in such a case is a security issue, i.e. is not Indeed. Crashing the guest is more reasonable. > an acceptable thing (and the fact that we panic() on timeout expiry > right now isn't really acceptable either). If crashing the offending > guest was sufficient to contain the issue, that might be an option. Else I think it should be sufficient (any concern from you?). Hypervisor can crash the guest with hint that the device may need long time to complete the invalidation or device maybe bad. And user should add the device to a blacklist to disallow assignment again. > ripping out ATS support (and limiting spin time below what there is > currently) may be the only alternative to fixing it. Yes, it is another solution considering ATS device is rare currently. For spin time, 10ms should be enough in both two solutions. But if solution 1 is acceptable, I prefer it since most of ATS devices are still able to play with Xen. Best regards, Yang ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
Xu, Quan wrote on 2015-09-16: > Introduction > > >VT-d code currently has a number of cases where completion of > certain operations is being waited for by way of spinning. The > majority of instances use that variable indirectly through > IOMMU_WAIT_OP() macro , allowing for loops of up to 1 second > (DMAR_OPERATION_TIMEOUT). While in many of the cases this may be acceptable, > the invalidation case seems particularly problematic. > > Currently hypervisor polls the status address of wait descriptor up to > 1 second to get Invalidation flush result. When Invalidation queue > includes Device-TLB invalidation, using 1 second is a mistake here in > the validation sync. As the 1 second timeout here is related to > response times by the IOMMU engine, Instead of Device-TLB invalidation > with PCI-e Address Translation Services (ATS) in use. the ATS specification > mandates a timeout of 1 _minute_ for cache flush. > The ATS case needs to be taken into consideration when doing invalidations. > Obviously we can't spin for a minute, so invalidation absolutely needs > to be converted to a non-spinning model. > >Also i should fix the new memory security issue. > The page freed from the domain should be on held, until the Device-TLB > flush is completed (ATS timeout of 1 _minute_). > The page previously associated with the freed portion of GPA should > not be reallocated for another purpose until the appropriate > invalidations have been performed. Otherwise, the original page owner > can still access freed page though DMA. > Hi Maintainers, According the discussion and suggestion you made in past several weeks, obviously, it is not an easy task. So I am wondering whether it is worth to do it since: 1. ATS device is not popular. I only know one NIC from Myricom has ATS capabilities. 2. The issue is only in theory. Linux, Windows, VMware are all using spin now as well as Xen, but none of them observed any problem so far. 3. I know there is propose to modify the timeout value(maybe less in 1 ms) in ATS spec to mitigate the problem. But the risk is how long to achieve it. 4. The most important concern is it is too complicated to fix it in Xen since it needs to modify the core memory part. And I don't think Quan and i have the enough knowledge to do it perfectly currently. It may take long time, half of year or one year?(We have spent three months so far). Yes, if Tim likes to take it. It will be much fast. :) So, my suggestion is that we can rely on user to not assign the ATS device if hypervisor says it cannot support such device. For example, if hypervisor find the invalidation isn't completed in 1 second, then hypervisor can crash itself and tell the user this ATS device needs more than 1 second invalidation time which is not support by Xen. Any comments? Best regards, Yang ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
Hi, At 17:02 + on 07 Oct (1444237344), Xu, Quan wrote: > __scheme A__ > Q1: - when to take the references? > take the reference when the IOMMU entry is _created_; > in detail: > --iommu_map_page(), or > --ept_set_entry() [Once IOMMU shares EPT page table.] > > That leaves one question: > -- how to do with hot-plug ATS device pass-through? As the EPT doesn't > aware IOMMU will share EPT page table when EPT page table was _created_. Either you have to take all the references even before the hot-plug, or you have to do an (interruptable) pass over the EPT table when needs_iommu becaomes set. Of those I prefer the first one. EPT entries already take references to foreign pages, and I think that extending that to other kinds should be OK. > Q2: how do you know when to drop them? >- log (or something) when the IOMMU entry is removed/overwritten; and >- drop the entry when the flush completes. > >-- We can add a new page_list_entry structure per page_info, and Add the > page with the new page_list_entry structure to per domain page list, when > the IOMMU entry is removed/overwritten; and drop the entry when the flush > completes. As Jan says, I think that might be too much overhead -- a new page_list_entry would be a pretty big increase to page_info. (And potentially might not be enough if a page ever needs to be on two lists? Or can we make sure that never happens?) Storing the list of to-be-flushed MFNs separately sounds better. The question is how to make sure we don't run out of memory to store that list. Maybe have a pool allocated that's big enough for sensible use, and fail p2m updates with -EAGAIN when we run out? > Q3: what to do about mappings of other domains' memory (i.e. grant and > foreign mappings). >Between two domains, now I have only one idea to fix this tricky issue -- > waitqueue. >I.e. grant. > For gnttab_transfer /gnttab_unmap , wait on a waitqueue before updating > grant flag, until the Device-TLB flush is completed. > For grant-mapped, it is safe as the modification of gnttab_unmap. Waitqueues are probably more heavyweight than is needed here. The grant unmap operation can complete immediately, and we can queue the reference drop (and maybe even some of the maptrack updates) to be finished when the flush completes. It could maybe use the same queue/log as the normal p2m updates. > __scheme B__ > Q1: - when to take the references? > > take the reference when the IOMMU entry is _ removed/overwritten_; > in detail: > --iommu_unmap_page(), or > --ept_set_entry() [Once IOMMU shares EPT page table.] Hmm. That might be practical, taking a typecount of whatever type the page happens to be at the time. I would prefer scheme A though, if it can be made to work. It fits better with the usual way refcounts are used, and it's more likely to be a good long-term fix. This is the second time we've had to add refcounting for an edge case, and I suspect it won't be the last. Cheers, Tim. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> On 09.10.15 at 09:06,wrote: >> > >>>On 08.10.2015 at 16:52 wrote: >> >>> On 07.10.15 at 19:02, wrote: >> > Q3: what to do about mappings of other domains' memory (i.e. grant and >> > foreign mappings). >> >Between two domains, now I have only one idea to fix this tricky >> > issue -- waitqueue. >> >I.e. grant. >> > For gnttab_transfer /gnttab_unmap , wait on a waitqueue before >> > updating grant flag, until the Device-TLB flush is completed. >> > For grant-mapped, it is safe as the modification of gnttab_unmap. >> >> Hmm, wouldn't grant transfers already be taken care of by the extra >> references? >> See steal_page(). Perhaps the ordering between its invocation and >> guest_physmap_remove_page() would need to be switched (with the latter >> getting undone if steal_page() fails). The waiting for the flush to complete >> could - >> afaics - be done by using the grant-ops' inherent batching (and hence easy >> availability of continuations). But I admit there's some hand waiving here >> without >> closer inspection... > > I think the extra references can NOT fix the security issue between two > domains. > i.e. If domA transfers the ownership of a page to domB, but the domA still > take extra references of the page. I think it is not correct. Again - see steal_page(): Pages with references beyond the single allocation related one can't change ownership. >> > __scheme B__ >> > Q1: - when to take the references? >> > >> > take the reference when the IOMMU entry is _ removed/overwritten_; >> > in detail: >> > --iommu_unmap_page(), or >> > --ept_set_entry() [Once IOMMU shares EPT page table.] >> > >> > * Make sure IOMMU page should not be reallocated for >> > another purpose until the appropriate invalidations have been >> > performed. >> > * in this case, it does not matter hot-plug ATS device >> > pass-through or ATS device assigned in domain initialization. >> > >> > Q2 / Q3: >> > The same as above __scheme A__ Q2/Q3. >> > >> > One question: is __scheme B__ safe? If it is safe, I prefer __scheme B__.. >> >> While at the first glance this looks like a neat idea - > > > I think this is safe and a good solution. > I hope you can review into the __scheme B__. I need _Acked-by_ you and Tim > Deegan. What do you mean here? I'm not going to ack a patch that hasn't even got written, and while scheme B looks possible, I might still overlook something, so I also can't up front ack that model (which may then lead to you expecting that once implemented it gets accepted). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>> >>> On 09.10.2015 at 15:18wrote: > >>> On 09.10.15 at 09:06, wrote: > >> > >>>On 08.10.2015 at 16:52 wrote: > >> >>> On 07.10.15 at 19:02, wrote: > >> > __scheme B__ > >> > Q1: - when to take the references? > >> > > >> > take the reference when the IOMMU entry is _ > removed/overwritten_; > >> > in detail: > >> > --iommu_unmap_page(), or > >> > --ept_set_entry() [Once IOMMU shares EPT page table.] > >> > > >> > * Make sure IOMMU page should not be reallocated for > >> > another purpose until the appropriate invalidations have been > >> > performed. > >> > * in this case, it does not matter hot-plug ATS device > >> > pass-through or ATS device assigned in domain initialization. > >> > > >> > Q2 / Q3: > >> > The same as above __scheme A__ Q2/Q3. > >> > > >> > One question: is __scheme B__ safe? If it is safe, I prefer __scheme > >> > B__.. > >> > >> While at the first glance this looks like a neat idea - > > > > > > I think this is safe and a good solution. > > I hope you can review into the __scheme B__. I need _Acked-by_ you and > > Tim Deegan. > > What do you mean here? Just verify it. If it is working, I continue to write patch based on it. If it is not working, I continue to research .. > I'm not going to ack a patch that hasn't even got > written, and while scheme B looks possible, I might still overlook something, > so I > also can't up front ack that model (which may then lead to you expecting that > once implemented it gets accepted). I am getting started to write patch based on the __scheme B__ and send out ASAP . Thanks Quan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> On 07.10.15 at 19:02,wrote: > __scheme A__ > Q1: - when to take the references? > take the reference when the IOMMU entry is _created_; > in detail: > --iommu_map_page(), or > --ept_set_entry() [Once IOMMU shares EPT page table.] > > That leaves one question: > -- how to do with hot-plug ATS device pass-through? As the EPT doesn't > aware > IOMMU will share EPT page table when EPT page table was _created_. How is it not? iommu_hap_pt_share is a global option, and hence even for a domain which does not (yet) require an IOMMU these references could be acquired proactively. > Q2: how do you know when to drop them? >- log (or something) when the IOMMU entry is removed/overwritten; and >- drop the entry when the flush completes. > >-- We can add a new page_list_entry structure per page_info, and Add the > page with the new page_list_entry structure to per domain page list, when > the IOMMU entry is removed/overwritten; and drop the entry when the > flush completes. Please be very careful when considering to grow the size of struct page_info - due to the amount of its instances this would have a quite measurable effect on the memory overhead of memory management. Since the number of pages currently having a flush pending shouldn't be that high at any point in time, I don't think such increased overhead would be justifiable. > Q3: what to do about mappings of other domains' memory (i.e. grant and > foreign mappings). >Between two domains, now I have only one idea to fix this tricky issue -- > waitqueue. >I.e. grant. > For gnttab_transfer /gnttab_unmap , wait on a waitqueue before updating > grant flag, until the Device-TLB flush is completed. > For grant-mapped, it is safe as the modification of gnttab_unmap. Hmm, wouldn't grant transfers already be taken care of by the extra references? See steal_page(). Perhaps the ordering between its invocation and guest_physmap_remove_page() would need to be switched (with the latter getting undone if steal_page() fails). The waiting for the flush to complete could - afaics - be done by using the grant-ops' inherent batching (and hence easy availability of continuations). But I admit there's some hand waiving here without closer inspection... > __scheme B__ > Q1: - when to take the references? > > take the reference when the IOMMU entry is _ removed/overwritten_; > in detail: > --iommu_unmap_page(), or > --ept_set_entry() [Once IOMMU shares EPT page table.] > > * Make sure IOMMU page should not be reallocated for > another purpose until the appropriate invalidations have been > performed. > * in this case, it does not matter hot-plug ATS device pass-through or > ATS > device assigned in domain initialization. > > Q2 / Q3: > The same as above __scheme A__ Q2/Q3. > > One question: is __scheme B__ safe? If it is safe, I prefer __scheme B__.. While at the first glance this looks like a neat idea - what do you do if obtaining the reference fails? Not touching the EPT entry may be possible (but not nice), but not doing the IOMMU side update when the EPT one already succeeded would seem very cumbersome (as you'd need to roll back the EPT side update). Maybe a two-phase operation (obtain references for IOMMU, update EPT, update IOMMU) could solve this (and could then actually be independent of whether page tables are shared). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> >> On October 01, 2015, at 5:09 PMwrote: > At 15:05 + on 30 Sep (1443625549), Xu, Quan wrote: > > >> >>> On September 29, 2015, at 5:12 PM, wrote: > > Could I introduce a new typed reference which can only been deref in > > QI interrupt handler(or associated tasklet)?? --(stop me, I always want to > > add > some new flag or typed ..) And preventing changes of ownership/type on the > relevant pages. > > If you get_page_and_type(PGT_writable_page), then that guarantees that the > page won't change type or be freed until the matching put_page_and_type(), > which does what you want. Likewise, for read-only mappings, get_page() will > guarantee that the page isn't freed until you call put_page(). (We don't care > about read-only DMA mappings of specially-typed pages so you don't need a > typecount there). > Thanks for your further clarification. > That leaves three questions: > - when to take the references? > - how do you know when to drop them? > - what to do about mappings of other domains' memory (i.e. grant >and foreign mappings). > I think this are the key points. Look at the below answers. > IIUC your current scheme is to take the reference as the page is freed and > drop it > after the flush. Yes, > That's not enough for pages where permissions change for > other reasons, so it will at least have to be "take the reference when the > IOMMU > entry is removed/changed". IMO that's sufficiently invasive that you might as > well: > - take the reference when the IOMMU entry is _created_; > - log (or something) when the IOMMU entry is removed/overwritten; and > - drop the entry when the flush completes. __scheme A__ Q1: - when to take the references? take the reference when the IOMMU entry is _created_; in detail: --iommu_map_page(), or --ept_set_entry() [Once IOMMU shares EPT page table.] That leaves one question: -- how to do with hot-plug ATS device pass-through? As the EPT doesn't aware IOMMU will share EPT page table when EPT page table was _created_. Q2: how do you know when to drop them? - log (or something) when the IOMMU entry is removed/overwritten; and - drop the entry when the flush completes. -- We can add a new page_list_entry structure per page_info, and Add the page with the new page_list_entry structure to per domain page list, when the IOMMU entry is removed/overwritten; and drop the entry when the flush completes. Q3: what to do about mappings of other domains' memory (i.e. grant and foreign mappings). Between two domains, now I have only one idea to fix this tricky issue -- waitqueue. I.e. grant. For gnttab_transfer /gnttab_unmap , wait on a waitqueue before updating grant flag, until the Device-TLB flush is completed. For grant-mapped, it is safe as the modification of gnttab_unmap. __scheme B__ Q1: - when to take the references? take the reference when the IOMMU entry is _ removed/overwritten_; in detail: --iommu_unmap_page(), or --ept_set_entry() [Once IOMMU shares EPT page table.] * Make sure IOMMU page should not be reallocated for another purpose until the appropriate invalidations have been performed. * in this case, it does not matter hot-plug ATS device pass-through or ATS device assigned in domain initialization. Q2 / Q3: The same as above __scheme A__ Q2/Q3. One question: is __scheme B__ safe? If it is safe, I prefer __scheme B__.. Tim, thanks very much! Quan > > Like other schemes (I'm thinking about the p2m foreign mapping stuff > here) the reference counting is best done at the very bottom of the stack, > when > the actual entries are being written and cleared. IOW I'd be adding more code > to atomic_write_ept_entry() and friends. > > Cheers, > > Tim. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
> >> >> >>> On September 29, 2015 at 3:22 PM,wrote: > >>> On 29.09.15 at 04:53, wrote: > Monday, September 28, 2015 2:47 PM, wrote: > >> >>> On 28.09.15 at 05:08, wrote: > >> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote: > > > > For Tim's suggestion --"to make the IOMMU table take typed refcounts > > to anything it points to, and only drop those refcounts when the flush > > completes." > > > > From IOMMU point of view, if it can walk through IOMMU table to get > > these pages and take typed refcounts. > > These pages are maybe owned by hardware_domain, dummy, HVM guest .etc. > > could I narrow it down to HVM guest? --- It is not for anything it > > points to, but just for HVM guest related. this will simplify the design. > > I don't follow. Why would you want to walk page tables? And why would a HVM > guest have pages other than those owned by itself or granted access to by > another guest mapped in its IOMMU page tables? It is tricky. Let's ignore it. This is an analysis of IOMMU table to take typed refcounts to anything it points to. I know the IOMMU table and EPT table may share the same page table('iommu_hap_pt_share = 1'). Then, go through iommu table and take typed refcounts. > In any event - the ref-counting > would need to happen as you _create_ the mappings, not at some later point. > a general rule. Agreed. When create the mappings, what conditions to take typed refcounts? > > Just for check, do typed refcounts refer to the following? > > > > --- a/xen/include/asm-x86/mm.h > > +++ b/xen/include/asm-x86/mm.h > > @@ -183,6 +183,7 @@ struct page_info > > #define PGT_seg_desc_page PG_mask(5, 4) /* using this page in a GDT/LDT? > */ > > #define PGT_writable_page PG_mask(7, 4) /* has writable mappings? > */ > > #define PGT_shared_page PG_mask(8, 4) /* CoW sharable page > */ > > +#define PGT_dev_tlb_page PG_mask(9, 4) /* Maybe in Device-TLB > mapping? */ > > #define PGT_type_mask PG_mask(15, 4) /* Bits 28-31 or 60-63. > */ > > > > * I define a new typed refcounts PGT_dev_tlb_page. > > Why? I.e. why won't a base ref for r/o pages and a writable type-ref for r/w > ones > suffice, just like we do everywhere else? > I think it is different from r/o or writable. The page freed from the domain, the Device-TLB flush is not completed. The page is _not_ r/o or writable, and can only access though DMA.. Maybe it would modify a lot of related code. r/o or writable are acceptable to me. > >> Once you do that, I > >> don't think there'll be a reason to pause the guest for the duration > >> of the > > flush. > >> And really (as pointed out before) pausing the guest would get us > >> _far_ away from how real hardware behaves. > >> > > > > Once I do that, I think the guest should be still paused, if the > > Device-TLB flush is not completed. > > > > As mentioned in previous email, for example: > > Call do_memory_op HYPERCALL to free a pageX (gfn1 <---> mfn1). The > > gfn1 is the freed portion of GPA. > > assume that there is a mapping(gfn1<---> mfn1) in Device-TLB. If the > > Device-TLB flush is not completed and return to guest mode, the guest > > may call do_memory_op HYPERCALL to allocate a new pageY(mfn2) to > > gfn1.. > > then: > > the EPT mapping is (gfn1--mfn2), the Device-TLB mapping is (gfn1<--->mfn1) . > > > > If the Device-TLB flush is not completed, DMA associated with gfn1 may > > still write some data with pageX(gfn1 <---> mfn1), but pageX will be > > Released to xen when the Device-TLB flush is completed. It is maybe > > not correct for guest to read data from gfn1 after DMA(now the page > > associated with gfn1 is pageY ). > > > > Right? > > No. The extra ref taken will prevent the page from getting freed. And as long > as > the flush is in process, DMA to/from the page is going to produce undefined > results (affecting only the guest). But note that there may be reasons for an > external to the guest entity invoking the operation which ultimately led to > the > flush to do this on a paused guest only. But that's not of concern to the > hypervisor side implementation. > Reasonable. Jan, thanks! -Quan > Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> On 30.09.15 at 15:55,wrote: >> >> >> >>> On September 29, 2015 at 3:22 PM, wrote: >> >>> On 29.09.15 at 04:53, wrote: >> Monday, September 28, 2015 2:47 PM, wrote: >> >> >>> On 28.09.15 at 05:08, wrote: >> >> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote: >> > >> > For Tim's suggestion --"to make the IOMMU table take typed refcounts >> > to anything it points to, and only drop those refcounts when the flush >> > completes." >> > >> > From IOMMU point of view, if it can walk through IOMMU table to get >> > these pages and take typed refcounts. >> > These pages are maybe owned by hardware_domain, dummy, HVM guest .etc. >> > could I narrow it down to HVM guest? --- It is not for anything it >> > points to, but just for HVM guest related. this will simplify the design. >> >> I don't follow. Why would you want to walk page tables? And why would a HVM >> guest have pages other than those owned by itself or granted access to by >> another guest mapped in its IOMMU page tables? > > It is tricky. Let's ignore it. > > This is an analysis of IOMMU table to take typed refcounts to anything it > points to. > I know the IOMMU table and EPT table may share the same page > table('iommu_hap_pt_share = 1'). > Then, go through iommu table and take typed refcounts. > >> In any event - the ref-counting >> would need to happen as you _create_ the mappings, not at some later point. >> > a general rule. Agreed. > > When create the mappings, what conditions to take typed refcounts? On any r/o mapping, take a general ref. On any r/w mapping take a writable ref. >> > --- a/xen/include/asm-x86/mm.h >> > +++ b/xen/include/asm-x86/mm.h >> > @@ -183,6 +183,7 @@ struct page_info >> > #define PGT_seg_desc_page PG_mask(5, 4) /* using this page in a GDT/LDT? >> */ >> > #define PGT_writable_page PG_mask(7, 4) /* has writable mappings? >> */ >> > #define PGT_shared_page PG_mask(8, 4) /* CoW sharable page >> */ >> > +#define PGT_dev_tlb_page PG_mask(9, 4) /* Maybe in Device-TLB >> mapping? */ >> > #define PGT_type_mask PG_mask(15, 4) /* Bits 28-31 or 60-63. >> */ >> > >> > * I define a new typed refcounts PGT_dev_tlb_page. >> >> Why? I.e. why won't a base ref for r/o pages and a writable type-ref for r/w > ones >> suffice, just like we do everywhere else? >> > > I think it is different from r/o or writable. > > The page freed from the domain, the Device-TLB flush is not completed. > The page is _not_ r/o or writable, and can only access though DMA.. DMA or not, the page can either be accessed (read) (when mapped r/o in the IOMMU) or written (when mapped r/w in the IOMMU). For the purpose of refcounting it doesn't matter where the reads/writes originate. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>> >>> On September 29, 2015, at 5:12 PM,wrote: > At 03:08 + on 28 Sep (1443409723), Xu, Quan wrote: > > >>> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote: > > > 7/13: I'm not convinced that making the vcpu spin calling > > > sched_yield() is a very good plan. Better to explicitly pause the > > > domain if you need its vcpus not to run. But first -- why does > > > IOMMU flushing mean that vcpus can't be run? > > > > Ensure that the required Device-TLB flushes are applied before > > returning to guest mode via hypercall completion. the domain can also > > DMA this freed pages. For example, Call do_memory_op HYPERCALL to > > free a pageX (gfn --- mfn) from domain, and assume that there is a > > mapping(gfn --- mfn) in Device-TLB, once the vcpu has returned to > > guest mode, then the domain can still DMA this freed pageX. Domain > > kernel cannot use this being freed page, otherwise this is a domain > > kernel bug. > > > OK - let's ignore guest kernel bugs. IIUC you're worried about the guest OS > telling a device to issue DMA to an address that has changed in the IOMMU > tables (unmapped, remapped elsewhere, permisisons changedm ) but not yet > been flushed? Yes, issue DMA to an address that has changed in the IOMMU table and EPT table, but not yet been flushed. > > Unfortunately, pausing the guest's CPUs doesn't stop that. A malicious guest > could enqueue network receive buffers pointing to that address, and then > arrange for a packet to arrive between the IOMMU table change and the flush > completion. Cool !! > So you'll need to do something else to make the unmap safe. >The usual > method in Xen is to hold a reference to the page (for read-only > mappings) Read-only mapping refers to 'PGT_pinned'? Could I introduce a new typed reference which can only been deref in QI interrupt handler(or associated tasklet)?? --(stop me, I always want to add some new flag or typed ..) And preventing changes of ownership/type on the relevant pages. > or a typed reference (for read-write), and not release that reference > until the flush has completed. That's OK with in-line synchronous flushes. > > With the flush taking longer than Xen can wait for, you'll need to do > something > more complex, e.g.: > - keep a log of all relevant pending derefs, to be processed when the >flush completes; One of the CCed mentioned this solution in internal discussions. But it is tricky and over-engineering. I need more than half year to implement it. > or > - have some other method of preventing changes of ownership/type on >the relevant pages. I prefer this solution. > E.g. for CPU TLBs, we keep a per-page counter >(tlbflush-timestamp) that we can use to detect whether enough TLB >flushes have happened since the page was freed. > > The log is tricky - I'm not sure how toq make sure that it has bounded size > if a > flush can take seconds. > > I'm not sure the counter works either -- when that detector triggers we do a > synchronous TLB-flush IPI to make the operation safe, and that's exactly what > we > can't do here. > > Any other ideas floating around? > > Cheers, > Tim, thanks for your help. Any idea, I will send out. Maybe it is not a complete solution. Quan > Tim. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> On 29.09.15 at 04:53,wrote: Monday, September 28, 2015 2:47 PM, wrote: >> >>> On 28.09.15 at 05:08, wrote: >> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote: > >> It would be a guest kernel bug, but all _we_ care about is that such a guest > kernel >> bug won't affect the hypervisor or other guests. > > It won't affect the hypervisor or other guest domains. > As the required Device-TLB flushes are not applied, the hypercall is not > completed. The being freed page is still owned by this buggy > Guest, not released back to xen or reallocated for other guests. Seems like you misunderstood the purpose of my reply: I wasn't claiming that what you patch set currently does would constitute an issue. I was simply stating a general rule to consider when thinking about which solutions are viable and which aren't. > For Tim's suggestion --"to make the IOMMU table take typed refcounts to > anything it points to, and only drop those refcounts when the flush > completes." > > From IOMMU point of view, if it can walk through IOMMU table to get these > pages and take typed refcounts. > These pages are maybe owned by hardware_domain, dummy, HVM guest .etc. could > I narrow it down to HVM guest? --- It is not for anything it points to, but > just > for HVM guest related. this will simplify the design. I don't follow. Why would you want to walk page tables? And why would a HVM guest have pages other than those owned by itself or granted access to by another guest mapped in its IOMMU page tables? In any event - the ref-counting would need to happen as you _create_ the mappings, not at some later point. > from HVM guest point of view, once the ATS device is assigned, we can: > *pause the HVM guest domain. > *scan domain's xenpage_list, page_list and arch.relmem_list to get these > pages, which will be took typed refcounts ( PGT_dev_tlb_page -- a new type). > *unpause the HVM guest domain. > > (we can ignore domain's xenpage_list) as: > (( >Actually, the previous pages are maybe mapped from Xen heap for guest > domains in decrease_reservation() / xenmem_add_to_physmap_one() >/ p2m_add_foreign(), but they are not mapped to IOMMU table. The below 4 > functions will map xen heap page for guest domains: > * share page for xen Oprofile. > * vLAPIC mapping. > * grant table shared page. > * domain share_info page. > )) Neither of which really has a need to be in the IOMMU page tables afaics. > Just for check, do typed refcounts refer to the following? > > --- a/xen/include/asm-x86/mm.h > +++ b/xen/include/asm-x86/mm.h > @@ -183,6 +183,7 @@ struct page_info > #define PGT_seg_desc_page PG_mask(5, 4) /* using this page in a GDT/LDT? */ > #define PGT_writable_page PG_mask(7, 4) /* has writable mappings? */ > #define PGT_shared_page PG_mask(8, 4) /* CoW sharable page */ > +#define PGT_dev_tlb_page PG_mask(9, 4) /* Maybe in Device-TLB mapping? */ > #define PGT_type_mask PG_mask(15, 4) /* Bits 28-31 or 60-63. */ > > * I define a new typed refcounts PGT_dev_tlb_page. Why? I.e. why won't a base ref for r/o pages and a writable type-ref for r/w ones suffice, just like we do everywhere else? >> Once you do that, I >> don't think there'll be a reason to pause the guest for the duration of the > flush. >> And really (as pointed out before) pausing the guest would get us _far_ away >> from how real hardware behaves. >> > > Once I do that, I think the guest should be still paused, if the Device-TLB > flush is not completed. > > As mentioned in previous email, for example: > Call do_memory_op HYPERCALL to free a pageX (gfn1 <---> mfn1). The gfn1 is > the > freed portion of GPA. > assume that there is a mapping(gfn1<---> mfn1) in Device-TLB. If the > Device-TLB > flush is not completed and return to guest mode, > the guest may call do_memory_op HYPERCALL to allocate a new pageY(mfn2) to > gfn1.. > then: > the EPT mapping is (gfn1--mfn2), the Device-TLB mapping is (gfn1<--->mfn1) . > > If the Device-TLB flush is not completed, DMA associated with gfn1 may still > write some data with pageX(gfn1 <---> mfn1), but pageX will be > Released to xen when the Device-TLB flush is completed. It is maybe not > correct for guest to read data from gfn1 after DMA(now the page associated > with gfn1 is pageY ). > > Right? No. The extra ref taken will prevent the page from getting freed. And as long as the flush is in process, DMA to/from the page is going to produce undefined results (affecting only the guest). But note that there may be reasons for an external to the guest entity invoking the operation which ultimately led to the flush to do this on a paused guest only. But that's not of concern to the hypervisor side implementation. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> On 29.09.15 at 11:11,wrote: > With the flush taking longer than Xen can wait for, you'll need to > do something more complex, e.g.: > - keep a log of all relevant pending derefs, to be processed when the >flush completes; or > - have some other method of preventing changes of ownership/type on >the relevant pages. E.g. for CPU TLBs, we keep a per-page counter >(tlbflush-timestamp) that we can use to detect whether enough TLB >flushes have happened since the page was freed. > > The log is tricky - I'm not sure how toq make sure that it has bounded > size if a flush can take seconds. > > I'm not sure the counter works either -- when that detector triggers > we do a synchronous TLB-flush IPI to make the operation safe, and > that's exactly what we can't do here. > > Any other ideas floating around? The variant of the log model might work if sufficient information is available in the interrupt handler (or associated tasklet) to identify a much smaller subset of pages to scan through. Since there is a 32-bit quantity written to a pre-determined location upon qi completion, I wonder whether that couldn't be used for that purpose - 32 bits disambiguate a page within 16Tb of RAM, so there wouldn't need to be too many hashed together in a single chain. Otoh of course we can't have 2^32 standalone list heads. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
Hi, At 03:08 + on 28 Sep (1443409723), Xu, Quan wrote: > >>> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote: > > 7/13: I'm not convinced that making the vcpu spin calling > > sched_yield() is a very good plan. Better to explicitly pause the domain > > if you > > need its vcpus not to run. But first -- why does IOMMU flushing mean that > > vcpus can't be run? > > Ensure that the required Device-TLB flushes are applied before > returning to guest mode via hypercall completion. the domain can > also DMA this freed pages. For example, Call do_memory_op HYPERCALL > to free a pageX (gfn --- mfn) from domain, and assume that there is > a mapping(gfn --- mfn) in Device-TLB, once the vcpu has returned to > guest mode, then the domain can still DMA this freed pageX. Domain > kernel cannot use this being freed page, otherwise this is a domain > kernel bug. OK - let's ignore guest kernel bugs. IIUC you're worried about the guest OS telling a device to issue DMA to an address that has changed in the IOMMU tables (unmapped, remapped elsewhere, permisisons changedm ) but not yet been flushed? Unfortunately, pausing the guest's CPUs doesn't stop that. A malicious guest could enqueue network receive buffers pointing to that address, and then arrange for a packet to arrive between the IOMMU table change and the flush completion. So you'll need to do something else to make the unmap safe. The usual method in Xen is to hold a reference to the page (for read-only mappings) or a typed reference (for read-write), and not release that reference until the flush has completed. That's OK with in-line synchronous flushes. With the flush taking longer than Xen can wait for, you'll need to do something more complex, e.g.: - keep a log of all relevant pending derefs, to be processed when the flush completes; or - have some other method of preventing changes of ownership/type on the relevant pages. E.g. for CPU TLBs, we keep a per-page counter (tlbflush-timestamp) that we can use to detect whether enough TLB flushes have happened since the page was freed. The log is tricky - I'm not sure how toq make sure that it has bounded size if a flush can take seconds. I'm not sure the counter works either -- when that detector triggers we do a synchronous TLB-flush IPI to make the operation safe, and that's exactly what we can't do here. Any other ideas floating around? Cheers, Tim. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> On 28.09.15 at 05:08,wrote: Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote: >> 7/13: I'm not convinced that making the vcpu spin calling >> sched_yield() is a very good plan. Better to explicitly pause the domain if >> you >> need its vcpus not to run. But first -- why does IOMMU flushing mean that >> vcpus can't be run? > > Ensure that the required Device-TLB flushes are applied before returning to > guest mode via hypercall completion. > the domain can also DMA this freed pages. > For example, Call do_memory_op HYPERCALL to free a pageX (gfn --- mfn) from > domain, and assume that there is > a mapping(gfn --- mfn) in Device-TLB, once the vcpu has returned to guest > mode, > then the domain can still DMA this freed pageX. > Domain kernel cannot use this being freed page, otherwise this is a domain > kernel bug. It would be a guest kernel bug, but all _we_ care about is that such a guest kernel bug won't affect the hypervisor or other guests. You need to answer the question (perhaps just for yourself) taking into account Tim's suggestion to hold references to all pages mapped by the IOMMU page tables. Once you do that, I don't think there'll be a reason to pause the guest for the duration of the flush. And really (as pointed out before) pausing the guest would get us _far_ away from how real hardware behaves. The only possibly tricky thing will be how to know in the flush completion handler which pages to drop references for, as it doesn't look like you'd be able to put them on a list without allocating extra memory fro tracking (and allocation in turn would be bad as it can fail). > I didn't make the IOMMU table to take typed refcount to anything it points > to. This is really complex. But unavoidable I think, and with that I'm not sure it makes a lot of sense to do further (detailed) review of the initial version of the series. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> Monday, September 28, 2015 2:47 PM,wrote: > >>> On 28.09.15 at 05:08, wrote: > Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote: > It would be a guest kernel bug, but all _we_ care about is that such a guest > kernel > bug won't affect the hypervisor or other guests. It won't affect the hypervisor or other guest domains. As the required Device-TLB flushes are not applied, the hypercall is not completed. The being freed page is still owned by this buggy Guest, not released back to xen or reallocated for other guests. > You need to answer the > question (perhaps just for yourself) taking into account Tim's suggestion to > hold > references to all pages mapped by the IOMMU page tables. It is safe and complex. But if Tim can ack all of my memory analysis, does my solution work for upstream? For Tim's suggestion --"to make the IOMMU table take typed refcounts to anything it points to, and only drop those refcounts when the flush completes." >From IOMMU point of view, if it can walk through IOMMU table to get these >pages and take typed refcounts. These pages are maybe owned by hardware_domain, dummy, HVM guest .etc. could I narrow it down to HVM guest? --- It is not for anything it points to, but just for HVM guest related. this will simplify the design. from HVM guest point of view, once the ATS device is assigned, we can: *pause the HVM guest domain. *scan domain's xenpage_list, page_list and arch.relmem_list to get these pages, which will be took typed refcounts ( PGT_dev_tlb_page -- a new type). *unpause the HVM guest domain. (we can ignore domain's xenpage_list) as: (( Actually, the previous pages are maybe mapped from Xen heap for guest domains in decrease_reservation() / xenmem_add_to_physmap_one() / p2m_add_foreign(), but they are not mapped to IOMMU table. The below 4 functions will map xen heap page for guest domains: * share page for xen Oprofile. * vLAPIC mapping. * grant table shared page. * domain share_info page. )) * Once assigned a new page, if the ATS device is assigned, we should also take typed refcounts ( PGT_dev_tlb_page). * Once freed a page, the ATS device is assigned, we should check the typed refcounts ( PGT_dev_tlb_page) in free_domheap_pages() If the typed refcounts is PGT_dev_tlb_page, the page should be hold in a page list per-domain and freed in QI interrupt handler. Just for check, do typed refcounts refer to the following? --- a/xen/include/asm-x86/mm.h +++ b/xen/include/asm-x86/mm.h @@ -183,6 +183,7 @@ struct page_info #define PGT_seg_desc_page PG_mask(5, 4) /* using this page in a GDT/LDT? */ #define PGT_writable_page PG_mask(7, 4) /* has writable mappings? */ #define PGT_shared_page PG_mask(8, 4) /* CoW sharable page */ +#define PGT_dev_tlb_page PG_mask(9, 4) /* Maybe in Device-TLB mapping? */ #define PGT_type_mask PG_mask(15, 4) /* Bits 28-31 or 60-63. */ * I define a new typed refcounts PGT_dev_tlb_page. > Once you do that, I > don't think there'll be a reason to pause the guest for the duration of the > flush. > And really (as pointed out before) pausing the guest would get us _far_ away > from how real hardware behaves. > Once I do that, I think the guest should be still paused, if the Device-TLB flush is not completed. As mentioned in previous email, for example: Call do_memory_op HYPERCALL to free a pageX (gfn1 <---> mfn1). The gfn1 is the freed portion of GPA. assume that there is a mapping(gfn1<---> mfn1) in Device-TLB. If the Device-TLB flush is not completed and return to guest mode, the guest may call do_memory_op HYPERCALL to allocate a new pageY(mfn2) to gfn1.. then: the EPT mapping is (gfn1--mfn2), the Device-TLB mapping is (gfn1<--->mfn1) . If the Device-TLB flush is not completed, DMA associated with gfn1 may still write some data with pageX(gfn1 <---> mfn1), but pageX will be Released to xen when the Device-TLB flush is completed. It is maybe not correct for guest to read data from gfn1 after DMA(now the page associated with gfn1 is pageY ). Right? > The only possibly tricky thing will be how to know in the flush completion > handler > which pages to drop references for, as it doesn't look like you'd be able to > put > them on a list without allocating extra memory fro tracking (and allocation > in turn > would be bad as it can fail). > * Once freed a page, the ATS device is assigned, we should check the typed refcounts ( PGT_dev_tlb_page) in free_domheap_pages() If the typed refcounts is PGT_dev_tlb_page, the page should be hold in a page list per-domain and freed in QI interrupt handler. > > I didn't make the IOMMU table to take typed refcount to anything it > > points to. This is really complex. > > But unavoidable I think, and with that I'm not sure it makes a lot of sense > to do > further (detailed) review of the initial version of the
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
Tim, thanks for your review. >>> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote: > Hi, > > At 14:09 + on 21 Sep (1442844587), Xu, Quan wrote: > > George / Tim, > > Could you help me review these memory patches? Thanks! > > The interrupt-mapping and chipset control parts of this are outside my > understanding. :) And I'm not an x86/mm maintainer any more, but I'll have a > look: > I've heard so much about you from my teammates. :):) > 7/13: I'm not convinced that making the vcpu spin calling > sched_yield() is a very good plan. Better to explicitly pause the domain if > you > need its vcpus not to run. But first -- why does IOMMU flushing mean that > vcpus can't be run? > Ensure that the required Device-TLB flushes are applied before returning to guest mode via hypercall completion. the domain can also DMA this freed pages. For example, Call do_memory_op HYPERCALL to free a pageX (gfn --- mfn) from domain, and assume that there is a mapping(gfn --- mfn) in Device-TLB, once the vcpu has returned to guest mode, then the domain can still DMA this freed pageX. Domain kernel cannot use this being freed page, otherwise this is a domain kernel bug. In fact, any time you want to reschedule, you need to raise SCHEDULE_SOFTIRQ, which is then checked and serviced in do_softirq(). It can also pause the domain and unpause the domain in QI interrupt handler, or block the vCPU and unblock all of the domain's vCPU in QI interrupt handler. It is the performance consideration while using sched_yield(). > 8/13: This doesn't seem like it's enough make this safe. :( Yes, you can't > allocate > a page to another VM while there are IOTLB entries pointing to it, but you > also > can't use it for other things inside the same domain! > > It might be enough, if you can argue that e.g. the IOMMU tables only ever have > mappings of pages owned by the domain, and that any other feature that might > rely on the daomin's memory being made read-only (e.g. sharing) is explicily > disallowed, but I'd want to see those things mechanically enforced. > As mentioned above, Device-TLB flushes are applied before returning to guest mode via hypercall completion. If the hypercall is not completed, the freed page will not be released back to xen. Domain kernel cannot use this being freed page, otherwise this is a domain kernel bug. So I think it is enough. :):) > I think the safest answer is to make the IOMMU table take typed refcounts to > anything it points to, and only drop those refcounts when the flush completes, > but I can imaging that becoming complex. > Yes, i also think so. I have a similar design, taking typed refcount and increasing the page refcount (as PGT_pinned / PGC_allocated), but i did it only in page freed invoker flow. I dropped this design, as i should introduce anther flag to aware that the domain is assigned a ATS device(A ugly design). I didn't make the IOMMU table to take typed refcount to anything it points to. This is really complex. > You may also need to consider grant-mapped memory - you need to make sure > the grant isn't released until after the flush completes. > I can introduce a new grant type for this case (such as GNTMAP_iommu_dma).. For ending foreign access, it should _only_ be used when the remote domain has unmapped the frame. gnttab_query_foreign_access( gref ) will indicate the state of any mapping. > 12/13: Ah, I see you are looking at grant table stuff, at least for the > transfer path. :) > Still the mapping path needs to be looked at. > Yes, I think you can do it. :):) > Cheers, > > Tim. Tim, thanks again. I really appreciate your review. Quan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
Hi, At 14:09 + on 21 Sep (1442844587), Xu, Quan wrote: > George / Tim, > Could you help me review these memory patches? Thanks! The interrupt-mapping and chipset control parts of this are outside my understanding. :) And I'm not an x86/mm maintainer any more, but I'll have a look: 7/13: I'm not convinced that making the vcpu spin calling sched_yield() is a very good plan. Better to explicitly pause the domain if you need its vcpus not to run. But first -- why does IOMMU flushing mean that vcpus can't be run? 8/13: This doesn't seem like it's enough make this safe. :( Yes, you can't allocate a page to another VM while there are IOTLB entries pointing to it, but you also can't use it for other things inside the same domain! It might be enough, if you can argue that e.g. the IOMMU tables only ever have mappings of pages owned by the domain, and that any other feature that might rely on the daomin's memory being made read-only (e.g. sharing) is explicily disallowed, but I'd want to see those things mechanically enforced. I think the safest answer is to make the IOMMU table take typed refcounts to anything it points to, and only drop those refcounts when the flush completes, but I can imaging that becoming complex. You may also need to consider grant-mapped memory - you need to make sure the grant isn't released until after the flush completes. 12/13: Ah, I see you are looking at grant table stuff, at least for the transfer path. :) Still the mapping path needs to be looked at. Cheers, Tim. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> On 21.09.15 at 11:46,wrote: >>> >>> On 21.09.15 at 16:51, < jbeul...@suse.com > wrote: >>- Anything else? > > > Just test the extreme case. The ATS specification mandates a timeout of 1 > _minute_ for cache flush, even though it doesn't take so much time for cache > flush. > In my design, if the Device-TLB is not completed, the domain's vCPUs are not > allowed entry guest mode (patch #7), otherwise, the logic is not correct. Hmm, that would be a serious limitation, and I can't immediately see a reason: Taking this with virtualization removed, I don't think such an invalidation would stall a physical CPU for a whole minute. Sadly I also can't immediately think of a solution, but I guess first of all I'll have to take a look at the series (which unfortunately may take a few days to get to). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> On 21.09.15 at 16:03,wrote: On 21.09.15 at 20:04, < jbeul...@suse.com > wrote: >> >>> On 21.09.15 at 11:46, wrote: >> >>> >>> On 21.09.15 at 16:51, < jbeul...@suse.com > wrote: >> >>- Anything else? >> > >> > >> > Just test the extreme case. The ATS specification mandates a timeout >> > of 1 _minute_ for cache flush, even though it doesn't take so much >> > time for cache flush. >> > In my design, if the Device-TLB is not completed, the domain's vCPUs >> > are not allowed entry guest mode (patch #7), otherwise, the logic is not >> correct. >> >> Hmm, that would be a serious limitation, and I can't immediately see a > reason: >> Taking this with virtualization removed, I don't think such an invalidation > would >> stall a physical CPU for a whole minute. Sadly I also can't immediately > think of a >> solution, but I guess first of all I'll have to take a look at the series > (which >> unfortunately may take a few days to get to). > > thanks for your review. Any comment, I will reply to your emails ASAP. Thanks. It would however have been nice if you addressed the implied question (regarding the reason) right away. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> On 21.09.15 at 20:04, < jbeul...@suse.com > wrote: > >>> On 21.09.15 at 11:46,wrote: > >>> >>> On 21.09.15 at 16:51, < jbeul...@suse.com > wrote: > >>- Anything else? > > > > > > Just test the extreme case. The ATS specification mandates a timeout > > of 1 _minute_ for cache flush, even though it doesn't take so much > > time for cache flush. > > In my design, if the Device-TLB is not completed, the domain's vCPUs > > are not allowed entry guest mode (patch #7), otherwise, the logic is not > correct. > > Hmm, that would be a serious limitation, and I can't immediately see a reason: > Taking this with virtualization removed, I don't think such an invalidation > would > stall a physical CPU for a whole minute. Sadly I also can't immediately think > of a > solution, but I guess first of all I'll have to take a look at the series > (which > unfortunately may take a few days to get to). > > Jan Jan, thanks for your review. Any comment, I will reply to your emails ASAP. -Quan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
George / Tim, Could you help me review these memory patches? Thanks! -Quan > -Original Message- > From: Xu, Quan > Sent: Wednesday, September 16, 2015 9:24 PM > To: andrew.coop...@citrix.com; Dong, Eddie; ian.campb...@citrix.com; > ian.jack...@eu.citrix.com; jbeul...@suse.com; Nakajima, Jun; k...@xen.org; > Tian, Kevin; t...@xen.org; Zhang, Yang Z; george.dun...@eu.citrix.com > Cc: xen-devel@lists.xen.org; Xu, Quan > Subject: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device > > Introduction > > >VT-d code currently has a number of cases where completion of certain > operations is being waited for by way of spinning. The majority of instances > use > that variable indirectly through IOMMU_WAIT_OP() macro , allowing for loops of > up to 1 second (DMAR_OPERATION_TIMEOUT). While in many of the cases this > may be acceptable, the invalidation case seems particularly problematic. > > Currently hypervisor polls the status address of wait descriptor up to 1 > second to > get Invalidation flush result. When Invalidation queue includes Device-TLB > invalidation, using 1 second is a mistake here in the validation sync. As the > 1 > second timeout here is related to response times by the IOMMU engine, Instead > of Device-TLB invalidation with PCI-e Address Translation Services (ATS) in > use. > the ATS specification mandates a timeout of 1 _minute_ for cache flush. The > ATS > case needs to be taken into consideration when doing invalidations. Obviously > we can't spin for a minute, so invalidation absolutely needs to be converted > to a > non-spinning model. > >Also i should fix the new memory security issue. > The page freed from the domain should be on held, until the Device-TLB flush > is > completed (ATS timeout of 1 _minute_). > The page previously associated with the freed portion of GPA should not be > reallocated for another purpose until the appropriate invalidations have been > performed. Otherwise, the original page owner can still access freed page > though > DMA. > > Why RFC > === > Patch 0001--0005, 0013 are IOMMU related. > Patch 0006 is about new flag (vCPU / MMU related). > Patch 0007 is vCPU related. > Patch 0008--0012 are MMU related. > > 1. Xen MMU is very complicated. Could Xen MMU experts help me verify > whether I >have covered all of the case? > > 2. For gnttab_transfer, If the Device-TLB flush is still not completed > when to >map the transferring page to a remote domain, schedule and wait on a > waitqueue >until the Device-TLB flush is completed. Is it correct? > >(I have tested waitqueue in decrease_reservation() [do_memory_op() > hypercall]) > I wake up domain(with only one vCPU) with debug-key tool, and the > domain(with only one vCPU) > is still working after waiting 60s in a waitqueue. ) > > > Design Overview > === > > This design implements a non-spinning model for Device-TLB invalidation -- > using > an interrupt based mechanism. Track the Device-TLB invalidation status in an > invalidation table per-domain. The invalidation table keeps the count of > in-flight > Device-TLB invalidation requests, and also provides a global polling > parameter per > domain for in-flight Device-TLB invalidation requests. > Update invalidation table's count of in-flight Device-TLB invalidation > requests and > assign the address of global polling parameter per domain in the Status > Address > of each invalidation wait descriptor, when to submit invalidation requests. > > For example: > . > > |invl | Status Data = 1 (the count of in-flight Device-TLB invalidation > |requests) wait | Status Address = > |virt_to_maddr(&_a_global_polling_parameter_per_domain_) > |dsc | > . > . > > |invl | > |wait | Status Data = 2 (the count of in-flight Device-TLB invalidation > |requests) dsc | Status Address = > |virt_to_maddr(&_a_global_polling_parameter_per_domain_) > . > . > > |invl | > |wait | Status Data = 3 (the count of in-flight Device-TLB invalidation > |requests) dsc | Status Address = > |virt_to_maddr(&_a_global_polling_parameter_per_domain_) > . > . > > More information about VT-d Invalidation Wait Descriptor, please refer to > > http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog > y/vt-directed-io-spec.html > 6.5.2.8 Invalidation Wait Descriptor. > Status Address and Data: Status address and data is used by hardware to > perform > wait descriptor > completion status write when the Status Write(SW) > field is Set. Hardware Behavior > is undefined if the Status address range of > 0xFEEX_ etc.). The Status Address > and Data fields are ignored by hardware when the > Status Write field is Clear. > > The invalidation completion event interrupt is generated only after the > invalidation wait descriptor completes. In
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
Thanks Jan. >> >>> On 21.09.15 at 16:51, < jbeul...@suse.com > wrote: >>> On 17.09.15 at 05:26,wrote: > Much more information: >If I run a service in this domain and tested this waitqueue case. > The domain is still working after 60s, but It prints out Call Trace with > $dmesg: > > [ 161.978599] BUG: soft lockup - CPU#0 stuck for 57s! > [kworker/0:1:272] >Not sure what you meant to tell us with that (it clearly means there's a bug >somewhere in the series you're testing): >- Drop the current series and wait for an update? No. >- A request for help? If so, I don't think there's much to be said based on just the kernel soft lockup detector output. >- Anything else? Just test the extreme case. The ATS specification mandates a timeout of 1 _minute_ for cache flush, even though it doesn't take so much time for cache flush. In my design, if the Device-TLB is not completed, the domain's vCPUs are not allowed entry guest mode (patch #7), otherwise, the logic is not correct. ~Quan >Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
>>> On 17.09.15 at 05:26,wrote: > Much more information: >If I run a service in this domain and tested this waitqueue case. The > domain is still working after 60s, but It prints out Call Trace with $dmesg: > > [ 161.978599] BUG: soft lockup - CPU#0 stuck for 57s! [kworker/0:1:272] Not sure what you meant to tell us with that (it clearly means there's a bug somewhere in the series you're testing): - Drop the current series and wait for an update? - A request for help? If so, I don't think there's much to be said based on just the kernel soft lockup detector output. - Anything else? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
Julien Grall writes ("Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device"): > On 16/09/2015 14:47, Ian Jackson wrote: > > I don't consider myself qualified to review that. I think the > > MAINTAINERS file should have an entry for xen/ but it doesn't seem to. > > I think a such patch should come from you. You're probably right. Just sent. Thanks, Ian. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
On 16/09/2015 14:47, Ian Jackson wrote: Julien Grall writes ("Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device"): On 16/09/15 11:46, Ian Jackson wrote: JOOI why did you CC me ? I did a quick scan of these patches and they don't seem to have any tools impact. I would prefer not to be CC'd unless there is a reason why my attention would be valueable. The common directory is maintained by "THE REST" group. From the MAINTAINERS file you are part of it. Ah. Hmmm. You mean xen/common/ ? Right. I don't consider myself qualified to review that. I think the MAINTAINERS file should have an entry for xen/ but it doesn't seem to. I think a such patch should come from you. Regards, -- Julien Grall ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
Quan Xu writes ("[Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device"): > Introduction > Thanks for your submission. JOOI why did you CC me ? I did a quick scan of these patches and they don't seem to have any tools impact. I would prefer not to be CC'd unless there is a reason why my attention would be valueable. Regards, Ian. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
Julien Grall writes ("Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device"): > On 16/09/15 11:46, Ian Jackson wrote: > > JOOI why did you CC me ? I did a quick scan of these patches and they > > don't seem to have any tools impact. I would prefer not to be CC'd > > unless there is a reason why my attention would be valueable. > > The common directory is maintained by "THE REST" group. From the > MAINTAINERS file you are part of it. Ah. Hmmm. You mean xen/common/ ? I don't consider myself qualified to review that. I think the MAINTAINERS file should have an entry for xen/ but it doesn't seem to. Ian. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
> -Original Message- > From: Ian Jackson [mailto:ian.jack...@eu.citrix.com] > Sent: Wednesday, September 16, 2015 6:47 PM > To: Xu, Quan > Cc: andrew.coop...@citrix.com; Dong, Eddie; ian.campb...@citrix.com; > ian.jack...@eu.citrix.com; jbeul...@suse.com; Nakajima, Jun; k...@xen.org; > Tian, Kevin; t...@xen.org; Zhang, Yang Z; george.dun...@eu.citrix.com; > xen-devel@lists.xen.org > Subject: Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS > Device > > Quan Xu writes ("[Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS > Device"): > > Introduction > > > > Thanks for your submission. > > JOOI why did you CC me ? I did a quick scan of these patches and they don't > seem to have any tools impact. I would prefer not to be CC'd unless there is > a > reason why my attention would be valueable. Ian, Thanks for your quick response! For patch-11 and patch-12, I got your email with scripts/get_maintainer.pl tool. -Quan > > Regards, > Ian. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
> -Original Message- > From: Xu, Quan > Sent: Wednesday, September 16, 2015 9:24 PM > To: andrew.coop...@citrix.com; Dong, Eddie; ian.campb...@citrix.com; > ian.jack...@eu.citrix.com; jbeul...@suse.com; Nakajima, Jun; k...@xen.org; > Tian, Kevin; t...@xen.org; Zhang, Yang Z; george.dun...@eu.citrix.com > Cc: xen-devel@lists.xen.org; Xu, Quan > Subject: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device > > Introduction > > >VT-d code currently has a number of cases where completion of certain > operations is being waited for by way of spinning. The majority of instances > use > that variable indirectly through IOMMU_WAIT_OP() macro , allowing for loops of > up to 1 second (DMAR_OPERATION_TIMEOUT). While in many of the cases this > may be acceptable, the invalidation case seems particularly problematic. > > Currently hypervisor polls the status address of wait descriptor up to 1 > second to > get Invalidation flush result. When Invalidation queue includes Device-TLB > invalidation, using 1 second is a mistake here in the validation sync. As the > 1 > second timeout here is related to response times by the IOMMU engine, Instead > of Device-TLB invalidation with PCI-e Address Translation Services (ATS) in > use. > the ATS specification mandates a timeout of 1 _minute_ for cache flush. The > ATS > case needs to be taken into consideration when doing invalidations. Obviously > we can't spin for a minute, so invalidation absolutely needs to be converted > to a > non-spinning model. > >Also i should fix the new memory security issue. > The page freed from the domain should be on held, until the Device-TLB flush > is > completed (ATS timeout of 1 _minute_). > The page previously associated with the freed portion of GPA should not be > reallocated for another purpose until the appropriate invalidations have been > performed. Otherwise, the original page owner can still access freed page > though > DMA. > > Why RFC > === > Patch 0001--0005, 0013 are IOMMU related. > Patch 0006 is about new flag (vCPU / MMU related). > Patch 0007 is vCPU related. > Patch 0008--0012 are MMU related. > > 1. Xen MMU is very complicated. Could Xen MMU experts help me verify > whether I >have covered all of the case? > > 2. For gnttab_transfer, If the Device-TLB flush is still not completed > when to >map the transferring page to a remote domain, schedule and wait on a > waitqueue >until the Device-TLB flush is completed. Is it correct? > >(I have tested waitqueue in decrease_reservation() [do_memory_op() > hypercall]) > I wake up domain(with only one vCPU) with debug-key tool, and the > domain(with only one vCPU) > is still working after waiting 60s in a waitqueue. ) Much more information: If I run a service in this domain and tested this waitqueue case. The domain is still working after 60s, but It prints out Call Trace with $dmesg: [ 161.978599] BUG: soft lockup - CPU#0 stuck for 57s! [kworker/0:1:272] [ 161.978621] Modules linked in: crct10dif_pclmul(F) crc32_pclmul(F) joydev(F) ghash_clmulni_intel(F) cryptd(F) xen_kbdfront(F) microcode(F) cirrus ttm drm_kms_helper drm psmouse(F) serio_raw(F) syscopyarea(F) sysfillrect(F) sysimgblt(F) i2c_piix4 ext2(F) mac_hid lp(F) parport(F) myri10ge dca floppy(F) [ 161.978626] CPU: 0 PID: 272 Comm: kworker/0:1 Tainted: GF 3.11.0-12-generic #19-Ubuntu [ 161.978628] Hardware name: Xen HVM domU, BIOS 4.6.0-rc 08/03/2015 [ 161.978638] Workqueue: events balloon_process [ 161.978640] task: 88007a1b4650 ti: 88007a1f2000 task.ti: 88007a1f2000 [ 161.978650] RIP: 0010:[] [] xen_hypercall_memory_op+0x5/0x20 [ 161.978652] RSP: 0018:88007a1f3d60 EFLAGS: 0246 [ 161.978653] RAX: 000c RBX: RCX: 1568 [ 161.978654] RDX: RSI: 88007a1f3d70 RDI: 0041 [ 161.978656] RBP: 88007a1f3db8 R08: 81d04888 R09: 0006a5dd [ 161.978657] R10: 3690 R11: 88007f7fa750 R12: 880036671000 [ 161.978658] R13: 810c6176 R14: 88007a1f3d20 R15: [ 161.978660] FS: () GS:88007be0() knlGS: [ 161.978661] CS: 0010 DS: ES: CR0: 80050033 [ 161.978662] CR2: 7f8d3e97e000 CR3: 01c0e000 CR4: 001406f0 [ 161.978669] Stack: [ 161.978673] 8141a16a 365b0048 81fb1520 00ff [ 161.978676] 7ff0 81c97301 1600 [ 161.978678] 88007be13e00 88007be17e00 88007a1f3e20 [ 161.978679] Call Trace: [ 161.978684] [] ? decrease_reservation+0x29a/0x2e0 [ 161.978688] [] balloon_process+0x333/0x430 [ 161.978695] [] process_one_work+0x17c/0x430 [ 161.978699] [] worker_thread+0x11c/0x3c0 [ 161.978702] [] ?