Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-15 Thread Zhang, Yang Z
Jan Beulich wrote on 2015-10-15:
 On 15.10.15 at 03:03,  wrote:
>> Jan Beulich wrote on 2015-10-14:
>>> As long as the multi-millisecond spins aren't going to go away by
>>> other means, I think conversion to async mode is ultimately unavoidable.
>> 
>> I am not fully agreed. I think the time to spin is important. To me,
>> less than 1 ms is acceptable and if the hardware can guarantee it,
>> then sync mode also is ok.
> 
> Okay, let me put the condition slightly differently - any spin on the
> order of what a WBINVD might take ought to be okay, provided both are

>From the data we collected, the invalidation is completed within several us. 
>IMO, the time for WBINVD is varying due the size and different cache 
>hierarchies. And it may take more than several us in worst case.

> equally (in)accessible to guests. The whole discussion is really about
> limiting the impact misbehaving guests can have on the whole system.
> (Obviously any spin time reaching the order of a scheduling time slice
> is a problem.)

The premise for a misbehaving guest to impact the system is that the IOMMU is 
buggy which takes long time to complete the invalidation. In other words, if 
all invalidations are able to complete within several us, what's the matter to 
do with the spin time? 

> 
>> I remember the origin motivation to handle ATS problem is due to: 1.
>> ATS spec allow 60s timeout to completed the flush which Xen only
>> allows 1s, and 2. spin loop for 1s is not reasonable since it will
>> hurt the scheduler. For the former, as we discussed before, either
>> disable ATS support or only support some specific ATS
>> devices(complete the flush less than 10ms or 1ms) is acceptable. For
>> the latter, if spin loop for 1s is not acceptable, we can reduce the
>> timeout to 10ms or 1ms
> to eliminate the performance impaction.
> 
> If we really can, why has it been chosen to be 1s in the first place?

What I can tell is 1s is just the value the original author chooses. It has no 
special means. I have double check with our hardware expert and he suggests us 
to use the value as small as possible. According his comment, 10ms is 
sufficiently large.  

> 
>> Yes, I'd agree it would be best solution if Xen has the async mode.
>> But spin loop is used widely in iommu code: not only for
>> invalidations, lots of DMAR operations are using spin to sync
>> hardware's status. For those operations, it is hard to use async mode.
>> Or, even it is possible to use async mode, I don't see the benefit
>> considering the cost and complexity which means we either need a
>> timer or a
> softirq to do the check.
> 
> Even if the cost is high, limited overall throughput by undue spinning
> is worth it imo even outside of misbehaving guest considerations. I'm
> surprised you're not getting similar pressure on this from the KVM
> folks (assuming the use of spinning is similar there).

Because no one observe such invalidation timeout issue so far. What we have 
discussed are only in theory. 

btw, I have told the issue to Linux IOMMU maintainer but he didn't say anything 
on it.

Best regards,
Yang


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-15 Thread Zhang, Yang Z
Jan Beulich wrote on 2015-10-15:
 On 15.10.15 at 10:52,  wrote:
>> Jan Beulich wrote on 2015-10-15:
>> On 15.10.15 at 09:28,  wrote:
 The premise for a misbehaving guest to impact the system is that
 the IOMMU is buggy which takes long time to complete the invalidation.
 In other words, if all invalidations are able to complete within
 several us, what's the matter to do with the spin time?
>>> 
>>> The risk of exploits of such poorly behaving IOMMUs. I.e. if
>>> properly
>> 
>> But this is not a software flaw. A guest has no way to know the
>> underlying IOMMU is wrong and it cannot exploit it.
> 
> A guest doesn't need to know what IOMMU is there in order to try some
> exploit. Plus - based other information it may be able to make an
> educated guess.

As I said before, the premise is the IOMMU is buggy. 

> 
>>> operating IOMMUs only require several us, why spin for several ms?
>> 
>> 10ms is just my suggestion. I don't know whether future hardware
>> will need more time to complete the invalidation. So I think we need
>> to have a large enough timeout here. Meanwhile, doesn't impact the
> scheduling.
> 
> It does, as explained further down in my previous reply.
> 
>> I remember the origin motivation to handle ATS problem is due to: 1.
>> ATS spec allow 60s timeout to completed the flush which Xen only
>> allows 1s, and 2. spin loop for 1s is not reasonable since it
>> will hurt the scheduler. For the former, as we discussed before,
>> either disable ATS support or only support some specific ATS
>> devices(complete the flush less than 10ms or 1ms) is acceptable.
>> For the latter, if spin loop for 1s is not acceptable, we can
>> reduce the timeout to 10ms or 1ms
> to eliminate the performance impaction.
> 
> If we really can, why has it been chosen to be 1s in the first place?
 
 What I can tell is 1s is just the value the original author chooses.
 It has no special means. I have double check with our hardware
 expert and he suggests us to use the value as small as possible.
 According his comment, 10ms is sufficiently large.
>>> 
>>> So here you talk about milliseconds again, while above you talked
>>> about microsecond. Can we at least settle on an order of what is
>>> required? 10ms is
>>> 10 times the minimum time slice credit1 allows, i.e.
>>> awfully long.
>> 
>> We can use an appropriate value which you think reasonable which can
>> cover most of invalidation cases. For left cases, the vcpu can yield
>> the CPU to others until a timer fired. In callback function, hypervisor
>> can check whether the invalidation is completed. If yes, schedule in
>> the vcpu. Otherwise, kill the guest due to unpredictable invalidation
>> timeout.
> 
> Using a timer implies you again think about pausing the vCPU until the
> invalidation completes. Which, as discussed before, has its own
> problems and, even worse, won't cover the domain's other vCPU-s or
> devices still possibly doing work involving the use of the being
> invalidated entries. Or did you have something else in mind?

Why not pause the whole domain? Based on Quan's data, all the invalidations in 
his experiment are completed within 3us. So perhaps 10us is enough to cover all 
invalidations in today's IOMMU.(I need to check with hardware expert to get the 
exact data). The timer mechanism is a backup which only for the extreme case 
which exists in theory. So the probability for a guest trigger the timer 
mechanism can be ignored. Even it happens, it only affect guest itself.

> 
> IOW - as soon as spinning time reaches the order of the scheduler time
> slice, I think the only sane model is async operation with proper refcounting.
> 
> Jan


Best regards,
Yang



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-15 Thread Jan Beulich
>>> On 15.10.15 at 03:03,  wrote:
> Jan Beulich wrote on 2015-10-14:
>> As long as the multi-millisecond spins aren't going to go away by
>> other means, I think conversion to async mode is ultimately unavoidable.
> 
> I am not fully agreed. I think the time to spin is important. To me, less 
> than 1 ms is acceptable and if the hardware can guarantee it, then sync mode 
> also is ok. 

Okay, let me put the condition slightly differently - any spin on the
order of what a WBINVD might take ought to be okay, provided
both are equally (in)accessible to guests. The whole discussion is
really about limiting the impact misbehaving guests can have on
the whole system. (Obviously any spin time reaching the order of
a scheduling time slice is a problem.)

> I remember the origin motivation to handle ATS problem is due to: 1. ATS 
> spec allow 60s timeout to completed the flush which Xen only allows 1s, and 
> 2. spin loop for 1s is not reasonable since it will hurt the scheduler. For 
> the former, as we discussed before, either disable ATS support or only 
> support some specific ATS devices(complete the flush less than 10ms or 1ms) 
> is acceptable. For the latter, if spin loop for 1s is not acceptable, we can 
> reduce the timeout to 10ms or 1ms to eliminate the performance impaction. 

If we really can, why has it been chosen to be 1s in the first place?

> Yes, I'd agree it would be best solution if Xen has the async mode. But spin 
> loop is used widely in iommu code: not only for invalidations, lots of DMAR 
> operations are using spin to sync hardware's status. For those operations, it 
> is hard to use async mode. Or, even it is possible to use async mode, I don't 
> see the benefit considering the cost and complexity which means we either 
> need a timer or a softirq to do the check.

Even if the cost is high, limited overall throughput by undue spinning is
worth it imo even outside of misbehaving guest considerations. I'm
surprised you're not getting similar pressure on this from the KVM folks
(assuming the use of spinning is similar there).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-15 Thread Jan Beulich
>>> On 15.10.15 at 09:28,  wrote:
> Jan Beulich wrote on 2015-10-15:
> On 15.10.15 at 03:03,  wrote:
>>> Jan Beulich wrote on 2015-10-14:
 As long as the multi-millisecond spins aren't going to go away by
 other means, I think conversion to async mode is ultimately unavoidable.
>>> 
>>> I am not fully agreed. I think the time to spin is important. To me,
>>> less than 1 ms is acceptable and if the hardware can guarantee it,
>>> then sync mode also is ok.
>> 
>> Okay, let me put the condition slightly differently - any spin on the
>> order of what a WBINVD might take ought to be okay, provided both are
> 
> From the data we collected, the invalidation is completed within several us. 
> IMO, the time for WBINVD is varying due the size and different cache 
> hierarchies. And it may take more than several us in worst case.

Understood - hence the setting of the worst case latency of WBINVD
as an upper bound for other (kind of similar) software operation.

>> equally (in)accessible to guests. The whole discussion is really about
>> limiting the impact misbehaving guests can have on the whole system.
>> (Obviously any spin time reaching the order of a scheduling time slice
>> is a problem.)
> 
> The premise for a misbehaving guest to impact the system is that the IOMMU 
> is buggy which takes long time to complete the invalidation. In other words, 
> if all invalidations are able to complete within several us, what's the 
> matter to do with the spin time? 

The risk of exploits of such poorly behaving IOMMUs. I.e. if properly
operating IOMMUs only require several us, why spin for several ms?

>>> I remember the origin motivation to handle ATS problem is due to: 1.
>>> ATS spec allow 60s timeout to completed the flush which Xen only
>>> allows 1s, and 2. spin loop for 1s is not reasonable since it will
>>> hurt the scheduler. For the former, as we discussed before, either
>>> disable ATS support or only support some specific ATS
>>> devices(complete the flush less than 10ms or 1ms) is acceptable. For
>>> the latter, if spin loop for 1s is not acceptable, we can reduce the
>>> timeout to 10ms or 1ms
>> to eliminate the performance impaction.
>> 
>> If we really can, why has it been chosen to be 1s in the first place?
> 
> What I can tell is 1s is just the value the original author chooses. It has 
> no special means. I have double check with our hardware expert and he 
> suggests us to use the value as small as possible. According his comment, 
> 10ms is sufficiently large.  

So here you talk about milliseconds again, while above you talked
about microsecond. Can we at least settle on an order of what is
required? 10ms is 10 times the minimum time slice credit1 allows, i.e.
awfully long.

>>> Yes, I'd agree it would be best solution if Xen has the async mode.
>>> But spin loop is used widely in iommu code: not only for
>>> invalidations, lots of DMAR operations are using spin to sync
>>> hardware's status. For those operations, it is hard to use async mode.
>>> Or, even it is possible to use async mode, I don't see the benefit
>>> considering the cost and complexity which means we either need a
>>> timer or a
>> softirq to do the check.
>> 
>> Even if the cost is high, limited overall throughput by undue spinning
>> is worth it imo even outside of misbehaving guest considerations. I'm
>> surprised you're not getting similar pressure on this from the KVM
>> folks (assuming the use of spinning is similar there).
> 
> Because no one observe such invalidation timeout issue so far. What we have 
> discussed are only in theory. 

As is the case with many security related things. We shouldn't wait
until someone exploits them.

> btw, I have told the issue to Linux IOMMU maintainer but he didn't say 
> anything on it.

Interesting - that may speak for itself (depending on how long this
has been pending), but otoh is in line with experience I have with
many (but not all) Linux maintainers.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-15 Thread Jan Beulich
>>> On 15.10.15 at 10:52,  wrote:
> Jan Beulich wrote on 2015-10-15:
> On 15.10.15 at 09:28,  wrote:
>>> The premise for a misbehaving guest to impact the system is that the
>>> IOMMU is buggy which takes long time to complete the invalidation.
>>> In other words, if all invalidations are able to complete within
>>> several us, what's the matter to do with the spin time?
>> 
>> The risk of exploits of such poorly behaving IOMMUs. I.e. if properly
> 
> But this is not a software flaw. A guest has no way to know the underlying 
> IOMMU is wrong and it cannot exploit it.

A guest doesn't need to know what IOMMU is there in order to try
some exploit. Plus - based other information it may be able to make
an educated guess.

>> operating IOMMUs only require several us, why spin for several ms?
> 
> 10ms is just my suggestion. I don't know whether future hardware will need 
> more time to complete the invalidation. So I think we need to have a large 
> enough timeout here. Meanwhile, doesn't impact the scheduling.

It does, as explained further down in my previous reply.

> I remember the origin motivation to handle ATS problem is due to: 1.
> ATS spec allow 60s timeout to completed the flush which Xen only
> allows 1s, and 2. spin loop for 1s is not reasonable since it will
> hurt the scheduler. For the former, as we discussed before, either
> disable ATS support or only support some specific ATS
> devices(complete the flush less than 10ms or 1ms) is acceptable.
> For the latter, if spin loop for 1s is not acceptable, we can
> reduce the timeout to 10ms or 1ms
 to eliminate the performance impaction.
 
 If we really can, why has it been chosen to be 1s in the first place?
>>> 
>>> What I can tell is 1s is just the value the original author chooses.
>>> It has no special means. I have double check with our hardware
>>> expert and he suggests us to use the value as small as possible.
>>> According his comment, 10ms is sufficiently large.
>> 
>> So here you talk about milliseconds again, while above you talked
>> about microsecond. Can we at least settle on an order of what is
>> required? 10ms is
>> 10 times the minimum time slice credit1 allows, i.e.
>> awfully long.
> 
> We can use an appropriate value which you think reasonable which can cover 
> most of invalidation cases. For left cases, the vcpu can yield the CPU to 
> others until a timer fired. In callback function, hypervisor can check 
> whether the invalidation is completed. If yes, schedule in the vcpu. 
> Otherwise, kill the guest due to unpredictable invalidation timeout.

Using a timer implies you again think about pausing the vCPU until
the invalidation completes. Which, as discussed before, has its own
problems and, even worse, won't cover the domain's other vCPU-s
or devices still possibly doing work involving the use of the being
invalidated entries. Or did you have something else in mind?

IOW - as soon as spinning time reaches the order of the scheduler
time slice, I think the only sane model is async operation with
proper refcounting.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-15 Thread Zhang, Yang Z
Jan Beulich wrote on 2015-10-15:
 On 15.10.15 at 09:28,  wrote:
>> Jan Beulich wrote on 2015-10-15:
>> On 15.10.15 at 03:03,  wrote:
 Jan Beulich wrote on 2015-10-14:
> As long as the multi-millisecond spins aren't going to go away by
> other means, I think conversion to async mode is ultimately unavoidable.
 
 I am not fully agreed. I think the time to spin is important. To
 me, less than 1 ms is acceptable and if the hardware can guarantee
 it, then sync mode also is ok.
>>> 
>>> Okay, let me put the condition slightly differently - any spin on
>>> the order of what a WBINVD might take ought to be okay, provided
>>> both are
>> 
>> From the data we collected, the invalidation is completed within several us.
>> IMO, the time for WBINVD is varying due the size and different cache
>> hierarchies. And it may take more than several us in worst case.
> 
> Understood - hence the setting of the worst case latency of WBINVD as
> an upper bound for other (kind of similar) software operation.
> 
>>> equally (in)accessible to guests. The whole discussion is really about
>>> limiting the impact misbehaving guests can have on the whole system.
>>> (Obviously any spin time reaching the order of a scheduling time slice
>>> is a problem.)
>> 
>> The premise for a misbehaving guest to impact the system is that the
>> IOMMU is buggy which takes long time to complete the invalidation.
>> In other words, if all invalidations are able to complete within
>> several us, what's the matter to do with the spin time?
> 
> The risk of exploits of such poorly behaving IOMMUs. I.e. if properly

But this is not a software flaw. A guest has no way to know the underlying 
IOMMU is wrong and it cannot exploit it.

> operating IOMMUs only require several us, why spin for several ms?

10ms is just my suggestion. I don't know whether future hardware will need more 
time to complete the invalidation. So I think we need to have a large enough 
timeout here. Meanwhile, doesn't impact the scheduling.

> 
 I remember the origin motivation to handle ATS problem is due to: 1.
 ATS spec allow 60s timeout to completed the flush which Xen only
 allows 1s, and 2. spin loop for 1s is not reasonable since it will
 hurt the scheduler. For the former, as we discussed before, either
 disable ATS support or only support some specific ATS
 devices(complete the flush less than 10ms or 1ms) is acceptable.
 For the latter, if spin loop for 1s is not acceptable, we can
 reduce the timeout to 10ms or 1ms
>>> to eliminate the performance impaction.
>>> 
>>> If we really can, why has it been chosen to be 1s in the first place?
>> 
>> What I can tell is 1s is just the value the original author chooses.
>> It has no special means. I have double check with our hardware
>> expert and he suggests us to use the value as small as possible.
>> According his comment, 10ms is sufficiently large.
> 
> So here you talk about milliseconds again, while above you talked
> about microsecond. Can we at least settle on an order of what is
> required? 10ms is
> 10 times the minimum time slice credit1 allows, i.e.
> awfully long.

We can use an appropriate value which you think reasonable which can cover most 
of invalidation cases. For left cases, the vcpu can yield the CPU to others 
until a timer fired. In callback function, hypervisor can check whether the 
invalidation is completed. If yes, schedule in the vcpu. Otherwise, kill the 
guest due to unpredictable invalidation timeout.

Best regards,
Yang



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-14 Thread Xu, Quan
>> >>> On 13.10.2015 at 22:50  wrote:
> >>> On 13.10.15 at 16:29,  wrote:
> >> > >>>On 29.09.2015 at 15:22  wrote:
> >> >>> On 29.09.15 at 04:53,  wrote:
> >>  Monday, September 28, 2015 2:47 PM, wrote:
> >> >> >>> On 28.09.15 at 05:08,  wrote:
> >> >>  Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:

> > 1. if (page->count_info & PGC_count_mask == 0) and (page->count_info
> > != 0) In this case, can the page be freed to xen domain heap?
> 
> Whether a page can get freed depends on changes to count_info, not just its
> current state. For instance, PGC_allocated set implies
> page->count_info & PGC_count_mask != 0, i.e. your question above
> cannot be answered properly. Just look at put_page() - it frees the page when
> the count _drops_ to zero.
> 
> > 2. if  (page->count_info & PGC_count_mask == 0) and
> (page->u.inuse.type_info != 0) :
> > In this case, can the page be freed to xen domain heap?
> 
> Generally type_info should be zero when the ref count is zero; there are, I 
> think,
> exceptional cases (like during domain death) where this might get violated 
> with
> no harm. But again - look at put_page() and you'll see that type_info doesn't
> matter for whether a page gets freed; all it matter is whether a page's type 
> can
> change: Only when type count is zero.
> 

Jan, Thanks for kind explanation.

-Quan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-14 Thread Xu, Quan


>> >>>On 13.10.2015 at 17:35,  wrote:
> At 11:09 + on 11 Oct (1444561760), Xu, Quan wrote:


> What in particular is worrying you about scheme A?  AFAICS you need to build
> the same refcount-taking mechanism for either scheme.
> 
> Is it the interactions with other p2m-based features in VMs that don't have
> devices passed through?  In that case perhaps you could just mandate that ATS
> support means no shared HAP/IOMMU tables, and do the refcounting only in the
> IOMMU ones?
> 

I am worrying that I should keep a log of all relevant pending derefs and to be 
processed when the flush completes for __scheme_A__..
Now I know only need the log/deref whenever you need to flush the IOMMU. :):)


> >  I think __scheme_A__ is complex to keep a log of all relevant pending
> > derefs, and to be processed when the flush completes;
> >
> > optimized __scheme_A__:
> > It could keep a log of the reference only when the IOMMU entry is _
> removed/overwritten_.(if the IOMMU entry is not _ removed/overwritten_, it is
> safe.).
> 
> Yes, though I'd add any change from read-write to read-only.
> Basically, you only need the log/deref whenever you need to flush the
> IOMMU. :)
> 

A summary of __scheme_A__:

 Q1: - when to take the references?
 take the reference when the IOMMU entry is _created_;
 in detail:
  --iommu_map_page(), or
  --ept_set_entry() [Once IOMMU shares EPT page table.]

 Q2: how do you know when to drop them?
- Log (or something) when the IOMMU entry is removed/overwritten; and
- Drop the entry when the flush completes.
in detail:
   --iommu_unmap_page(); or
   --ept_set_entry() [Once IOMMU shares EPT page table.]


   **The challenge: how to log when IOMMU entry is removed/overwritten?



 Q3: what to do about mappings of other domains' memory (i.e. grant and
 foreign mappings).

i.e. grant:
-Take the reference when the IOMMU entry is _created_;
then,
-Queue the reference drop;and
-Queue grant iflag update(only for grant_unmap and grant_transfer are 
enough -- we can hold on this question).



 __afaics__:
   1. Are Q1/Q2/Q3 enough for this memory security issue?
   2. Are there any other potential memory issues, when I finish  
__scheme_A__?
   3. Do you have any idea how to log when IOMMU entry is 
removed/overwritten?
  
Tim, thanks for your help!


Quan






























___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-14 Thread Jan Beulich
>>> On 14.10.15 at 07:12,  wrote:
> Jan Beulich wrote on 2015-10-13:
> On 13.10.15 at 07:27,  wrote:
>>> Jan Beulich wrote on 2015-10-12:
>>> On 12.10.15 at 03:42,  wrote:
> So, my suggestion is that we can rely on user to not assign the
> ATS device if hypervisor says it cannot support such device. For
> example, if hypervisor find the invalidation isn't completed in 1
> second, then hypervisor can crash itself and tell the user this
> ATS device needs more than 1 second invalidation time which is not
> support by
>>> Xen.
 
 Crashing the hypervisor in such a case is a security issue, i.e. is
 not
>>> 
>>> Indeed. Crashing the guest is more reasonable.
>>> 
 an acceptable thing (and the fact that we panic() on timeout expiry
 right now isn't really acceptable either). If crashing the
 offending guest was sufficient to contain the issue, that might be an 
 option.
 Else
>>> 
>>> I think it should be sufficient (any concern from you?).
>> 
>> Having looked at the code, it wasn't immediately clear whether that
>> would work. After all there one would think there would be a reason
>> for the code panic()ing right now instead.
> 
> What the panic()ing refer to here?

E.g. what we have in queue_invalidate_wait():

while ( poll_slot != QINVAL_STAT_DONE )
{
if ( NOW() > (start_time + DMAR_OPERATION_TIMEOUT) )
{
print_qi_regs(iommu);
panic("queue invalidate wait descriptor was not executed");
}
cpu_relax();
}

>>> But if solution 1 is acceptable, I prefer it since most of ATS
>>> devices are still able to play with Xen.
>> 
>> With a multi-millisecond spin, solution 1 would imo be acceptable only
>> as a transitional measure.
> 
> What does the transitional measure mean? Do you mean we still need the async 
> flush for ATS issue or we can adapt solution 1 before ATS spec changed?

As long as the multi-millisecond spins aren't going to go away by
other means, I think conversion to async mode is ultimately
unavoidable.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-14 Thread Zhang, Yang Z
Jan Beulich wrote on 2015-10-14:
 On 14.10.15 at 07:12,  wrote:
>> Jan Beulich wrote on 2015-10-13:
>> On 13.10.15 at 07:27,  wrote:
 Jan Beulich wrote on 2015-10-12:
 On 12.10.15 at 03:42,  wrote:
>> So, my suggestion is that we can rely on user to not assign the
>> ATS device if hypervisor says it cannot support such device. For
>> example, if hypervisor find the invalidation isn't completed in
>> 1 second, then hypervisor can crash itself and tell the user
>> this ATS device needs more than 1 second invalidation time which
>> is not support by
 Xen.
> 
> Crashing the hypervisor in such a case is a security issue, i.e.
> is not
 
 Indeed. Crashing the guest is more reasonable.
 
> an acceptable thing (and the fact that we panic() on timeout expiry
> right now isn't really acceptable either). If crashing the offending
> guest was sufficient to contain the issue, that might be an option.
> Else
 
 I think it should be sufficient (any concern from you?).
>>> 
>>> Having looked at the code, it wasn't immediately clear whether that
>>> would work. After all there one would think there would be a reason
>>> for the code panic()ing right now instead.
>> 
>> What the panic()ing refer to here?
> 
> E.g. what we have in queue_invalidate_wait():
> 
> while ( poll_slot != QINVAL_STAT_DONE )
> {
> if ( NOW() > (start_time + DMAR_OPERATION_TIMEOUT) )
> {
> print_qi_regs(iommu);
> panic("queue invalidate wait descriptor was not
> executed");
> }
> cpu_relax();
> }
 But if solution 1 is acceptable, I prefer it since most of ATS
 devices are still able to play with Xen.
>>> 
>>> With a multi-millisecond spin, solution 1 would imo be acceptable
>>> only as a transitional measure.
>> 
>> What does the transitional measure mean? Do you mean we still need
>> the async flush for ATS issue or we can adapt solution 1 before ATS spec 
>> changed?
> 
> As long as the multi-millisecond spins aren't going to go away by
> other means, I think conversion to async mode is ultimately unavoidable.

I am not fully agreed. I think the time to spin is important. To me, less than 
1 ms is acceptable and if the hardware can guarantee it, then sync mode also is 
ok. 
I remember the origin motivation to handle ATS problem is due to: 1. ATS spec 
allow 60s timeout to completed the flush which Xen only allows 1s, and 2. spin 
loop for 1s is not reasonable since it will hurt the scheduler. For the former, 
as we discussed before, either disable ATS support or only support some 
specific ATS devices(complete the flush less than 10ms or 1ms) is acceptable. 
For the latter, if spin loop for 1s is not acceptable, we can reduce the 
timeout to 10ms or 1ms to eliminate the performance impaction. 
Yes, I'd agree it would be best solution if Xen has the async mode. But spin 
loop is used widely in iommu code: not only for invalidations, lots of DMAR 
operations are using spin to sync hardware's status. For those operations, it 
is hard to use async mode. Or, even it is possible to use async mode, I don't 
see the benefit considering the cost and complexity which means we either need 
a timer or a softirq to do the check.

Best regards,
Yang



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-13 Thread Zhang, Yang Z
Jan Beulich wrote on 2015-10-13:
 On 13.10.15 at 07:27,  wrote:
>> Jan Beulich wrote on 2015-10-12:
>> On 12.10.15 at 03:42,  wrote:
 So, my suggestion is that we can rely on user to not assign the
 ATS device if hypervisor says it cannot support such device. For
 example, if hypervisor find the invalidation isn't completed in 1
 second, then hypervisor can crash itself and tell the user this
 ATS device needs more than 1 second invalidation time which is not
 support by
>> Xen.
>>> 
>>> Crashing the hypervisor in such a case is a security issue, i.e. is
>>> not
>> 
>> Indeed. Crashing the guest is more reasonable.
>> 
>>> an acceptable thing (and the fact that we panic() on timeout expiry
>>> right now isn't really acceptable either). If crashing the
>>> offending guest was sufficient to contain the issue, that might be an 
>>> option.
>>> Else
>> 
>> I think it should be sufficient (any concern from you?).
> 
> Having looked at the code, it wasn't immediately clear whether that
> would work. After all there one would think there would be a reason
> for the code panic()ing right now instead.

What the panic()ing refer to here?

> 
>> Hypervisor can
>> crash the guest with hint that the device may need long time to
>> complete the invalidation or device maybe bad. And user should add
>> the device to a blacklist to disallow assignment again.
>> 
>>> ripping out ATS support (and limiting spin time below what there is
>>> currently) may be the only alternative to fixing it.
>> 
>> Yes, it is another solution considering ATS device is rare currently.
>> For spin time, 10ms should be enough in both two solutions.
> 
> But 10ms is awfully much already. Which is why I've been advocating
> async flushing independent of ATS.

Agree. Technically speaking, async flush is the best solution. But considering 
the complexity and the benefit it brings, a compromise solution maybe better.

> 
>> But if solution 1 is acceptable, I prefer it since most of ATS
>> devices are still able to play with Xen.
> 
> With a multi-millisecond spin, solution 1 would imo be acceptable only
> as a transitional measure.

What does the transitional measure mean? Do you mean we still need the async 
flush for ATS issue or we can adapt solution 1 before ATS spec changed?

> 
> Jan


Best regards,
Yang



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-13 Thread Jan Beulich
>>> On 13.10.15 at 07:27,  wrote:
> Jan Beulich wrote on 2015-10-12:
> On 12.10.15 at 03:42,  wrote:
>>> So, my suggestion is that we can rely on user to not assign the ATS
>>> device if hypervisor says it cannot support such device. For
>>> example, if hypervisor find the invalidation isn't completed in 1
>>> second, then hypervisor can crash itself and tell the user this ATS
>>> device needs more than 1 second invalidation time which is not support by 
> Xen.
>> 
>> Crashing the hypervisor in such a case is a security issue, i.e. is not
> 
> Indeed. Crashing the guest is more reasonable. 
> 
>> an acceptable thing (and the fact that we panic() on timeout expiry
>> right now isn't really acceptable either). If crashing the offending
>> guest was sufficient to contain the issue, that might be an option. Else
> 
> I think it should be sufficient (any concern from you?).

Having looked at the code, it wasn't immediately clear whether
that would work. After all there one would think there would be a
reason for the code panic()ing right now instead.

> Hypervisor can 
> crash the guest with hint that the device may need long time to complete the 
> invalidation or device maybe bad. And user should add the device to a 
> blacklist to disallow assignment again. 
> 
>> ripping out ATS support (and limiting spin time below what there is
>> currently) may be the only alternative to fixing it.
> 
> Yes, it is another solution considering ATS device is rare currently. For 
> spin time, 10ms should be enough in both two solutions. 

But 10ms is awfully much already. Which is why I've been advocating
async flushing independent of ATS.

> But if solution 1 is acceptable, I prefer it since most of ATS devices are 
> still able to play with Xen.

With a multi-millisecond spin, solution 1 would imo be acceptable only
as a transitional measure.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-13 Thread Tim Deegan
Hi,

At 11:09 + on 11 Oct (1444561760), Xu, Quan wrote:
> One question: do two lists refer to page_list and arch.relmem_list?

No, I was wondering if a page ever needed to be queued waiting for two
different flushes -- e.g. if there are multiple IOMMUs.

> I know you prefer __scheme_A__(I think Jan prefers __scheme_A__ too.  Jan, 
> correct me, if I am wrong :) )
> which fits better with the usual way refcounts are used. But __scheme_A__ 
> would be difficult for buy-in by my team (Obviously, why spend so many effort 
> for such a small issue? why does __scheme_B__ not accept?) I think, 
> __scheme_A__ is also a tricky solution.
> 

What in particular is worrying you about scheme A?  AFAICS you need to
build the same refcount-taking mechanism for either scheme.

Is it the interactions with other p2m-based features in VMs that don't
have devices passed through?  In that case perhaps you could just
mandate that ATS support means no shared HAP/IOMMU tables, and do the
refcounting only in the IOMMU ones?

>  I think __scheme_A__ is complex to keep a log of all relevant pending 
> derefs, and to be processed when the flush completes;
> 
> optimized __scheme_A__:
> It could keep a log of the reference only when the IOMMU entry is _ 
> removed/overwritten_.(if the IOMMU entry is not _ removed/overwritten_, it is 
> safe.).

Yes, though I'd add any change from read-write to read-only.
Basically, you only need the log/deref whenever you need to flush the
IOMMU. :)

Cheers,

Tim.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-13 Thread Jan Beulich
>>> On 13.10.15 at 16:29,  wrote:
>> > >>>On 29.09.2015 at 15:22  wrote:
>> >>> On 29.09.15 at 04:53,  wrote:
>>  Monday, September 28, 2015 2:47 PM, wrote:
>> >> >>> On 28.09.15 at 05:08,  wrote:
>> >>  Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:
>>The extra ref taken will prevent the page from getting freed. 
> 
> Jan, could you share more about it?
> 
> I want to check some cases of Xen memory. i.e.
> 
> 1. if (page->count_info & PGC_count_mask == 0) and (page->count_info != 0)
> In this case, can the page be freed to xen domain heap?

Whether a page can get freed depends on changes to count_info, not
just its current state. For instance, PGC_allocated set implies
page->count_info & PGC_count_mask != 0, i.e. your question above
cannot be answered properly. Just look at put_page() - it frees the
page when the count _drops_ to zero.

> 2. if  (page->count_info & PGC_count_mask == 0) and  (page->u.inuse.type_info 
> != 0) :
> In this case, can the page be freed to xen domain heap?

Generally type_info should be zero when the ref count is zero; there
are, I think, exceptional cases (like during domain death) where this
might get violated with no harm. But again - look at put_page() and
you'll see that type_info doesn't matter for whether a page gets
freed; all it matter is whether a page's type can change: Only when
type count is zero.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-13 Thread Xu, Quan
>> >>>On 29.09.2015 at 15:22  wrote:
> >>> On 29.09.15 at 04:53,  wrote:
>  Monday, September 28, 2015 2:47 PM, wrote:
> >> >>> On 28.09.15 at 05:08,  wrote:
> >>  Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:


>The extra ref taken will prevent the page from getting freed. 

Jan, could you share more about it?

I want to check some cases of Xen memory. i.e.

1. if (page->count_info & PGC_count_mask == 0) and (page->count_info != 0)
In this case, can the page be freed to xen domain heap?

2. if  (page->count_info & PGC_count_mask == 0) and  (page->u.inuse.type_info 
!= 0) :
In this case, can the page be freed to xen domain heap?

Thanks!

Quan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-12 Thread Jan Beulich
>>> On 12.10.15 at 03:42,  wrote:
> According the discussion and suggestion you made in past several weeks, 
> obviously, it is not an easy task. So I am wondering whether it is worth to 
> do it since: 
> 1. ATS device is not popular. I only know one NIC from Myricom has ATS 
> capabilities.
> 2. The issue is only in theory. Linux, Windows, VMware are all using spin 
> now as well as Xen, but none of them observed any problem so far.
> 3. I know there is propose to modify the timeout value(maybe less in 1 ms) 
> in ATS spec to mitigate the problem. But the risk is how long to achieve it.
> 4. The most important concern is it is too complicated to fix it in Xen 
> since it needs to modify the core memory part. And I don't think Quan and i 
> have the enough knowledge to do it perfectly currently. It may take long 
> time, half of year or one year?(We have spent three months so far). Yes, if 
> Tim likes to take it. It will be much fast. :)
> 
> So, my suggestion is that we can rely on user to not assign the ATS device 
> if hypervisor says it cannot support such device. For example, if hypervisor 
> find the invalidation isn't completed in 1 second, then hypervisor can crash 
> itself and tell the user this ATS device needs more than 1 second 
> invalidation time which is not support by Xen.

Crashing the hypervisor in such a case is a security issue, i.e. is not
an acceptable thing (and the fact that we panic() on timeout expiry
right now isn't really acceptable either). If crashing the offending
guest was sufficient to contain the issue, that might be an option.
Else ripping out ATS support (and limiting spin time below what there
is currently) may be the only alternative to fixing it.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-12 Thread Jan Beulich
>>> On 11.10.15 at 13:09,  wrote:
> On 11.10.2015 at 2:25,  wrote:
>> At 17:02 + on 07 Oct (1444237344), Xu, Quan wrote:
>> > Q2: how do you know when to drop them?
>> >- log (or something) when the IOMMU entry is removed/overwritten; and
>> >- drop the entry when the flush completes.
>> >
>> >-- We can add a new page_list_entry structure per page_info, and Add the
>> page with the new page_list_entry structure to per domain page list, when
>> > the IOMMU entry is removed/overwritten; and drop the entry when the
>> flush completes.
>> 
>> As Jan says, I think that might be too much overhead -- a new page_list_entry
>> would be a pretty big increase to page_info.
>> (And potentially might not be enough if a page ever needs to be on two 
> lists? Or
>> can we make sure that never happens?)
>> 
>> Storing the list of to-be-flushed MFNs separately sounds better.  The 
>> question 
> is
>> how to make sure we don't run out of memory to store that list.  Maybe have 
> a
>> pool allocated that's big enough for sensible use, and fail p2m updates with
>> -EAGAIN when we run out?
>> 
> Ignore this method. It is not a good idea.
> One question: do two lists refer to page_list and arch.relmem_list?
> It is might a challenge to make sure we don't run out of memory, if it store 
> additional relevant information.
> An aggressive method:
> --Remove this page from domain page_list / arch.relmem_list.  
> ... [like steal_page(). Swizzle the owner -- page_set_owner(page, NULL);  
> Unlink from original owner -- page_list_del(page, >page_list)] Then,( take 
> a 
> reference for _scheme_B__,) add this page to a per domain page list(), and 
> put this page when the flush completes.

I'm not following - are you suggesting to remove the page from the
guest temporarily? That's certainly no a valid approach.

> I am afraid that this method may cause some potential issues. i.e. make xen 
> crash in some corner cases ..

And such would preclude its use too.

>> > __scheme B__
>> > Q1: - when to take the references?
>> >
>> > take the reference when the IOMMU entry is _ removed/overwritten_;
>> > in detail:
>> >  --iommu_unmap_page(), or
>> >  --ept_set_entry() [Once IOMMU shares EPT page table.]
>> 
>> Hmm.  That might be practical, taking a typecount of whatever type the page
>> happens to be at the time.  I would prefer scheme A though, if it can be 
> made
>> to work.  It fits better with the usual way refcounts are used, and it's 
> more likely
>> to be a good long-term fix.  This is the second time we've had to add
>> refcounting for an edge case, and I suspect it won't be the last.
> 
> I know you prefer __scheme_A__(I think Jan prefers __scheme_A__ too.  Jan, 
> correct me, if I am wrong :) )
> which fits better with the usual way refcounts are used. But __scheme_A__ 
> would be difficult for buy-in by my team (Obviously, why spend so many effort 
> for such a small issue? why does __scheme_B__ not accept?) I think, 
> __scheme_A__ is also a tricky solution.
> 
> 
>  The IOMMU entry associated with the freed portion of GPA should not be 
> reallocated for another purpose until the flush completes.
>  I think __scheme_A__ is complex to keep a log of all relevant pending 
> derefs, and to be processed when the flush completes;
> 
> optimized __scheme_A__:
> It could keep a log of the reference only when the IOMMU entry is _ 
> removed/overwritten_.(if the IOMMU entry is not _ removed/overwritten_, it is 
> safe.).

And I'm afraid this is too vague for me to understand - what do you
want to track under what conditions?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-12 Thread Zhang, Yang Z
Jan Beulich wrote on 2015-10-12:
 On 12.10.15 at 03:42,  wrote:
>> According the discussion and suggestion you made in past several
>> weeks, obviously, it is not an easy task. So I am wondering whether
>> it is worth to do it since:
>> 1. ATS device is not popular. I only know one NIC from Myricom has
>> ATS capabilities.
>> 2. The issue is only in theory. Linux, Windows, VMware are all using
>> spin now as well as Xen, but none of them observed any problem so far.
>> 3. I know there is propose to modify the timeout value(maybe less in
>> 1
>> ms) in ATS spec to mitigate the problem. But the risk is how long to achieve 
>> it.
>> 4. The most important concern is it is too complicated to fix it in
>> Xen since it needs to modify the core memory part. And I don't think
>> Quan and i have the enough knowledge to do it perfectly currently.
>> It may take long time, half of year or one year?(We have spent three
>> months so far). Yes, if Tim likes to take it. It will be much fast.
>> :)
>> 
>> So, my suggestion is that we can rely on user to not assign the ATS
>> device if hypervisor says it cannot support such device. For
>> example, if hypervisor find the invalidation isn't completed in 1
>> second, then hypervisor can crash itself and tell the user this ATS
>> device needs more than 1 second invalidation time which is not support by 
>> Xen.
> 
> Crashing the hypervisor in such a case is a security issue, i.e. is not

Indeed. Crashing the guest is more reasonable. 

> an acceptable thing (and the fact that we panic() on timeout expiry
> right now isn't really acceptable either). If crashing the offending
> guest was sufficient to contain the issue, that might be an option. Else

I think it should be sufficient (any concern from you?). Hypervisor can crash 
the guest with hint that the device may need long time to complete the 
invalidation or device maybe bad. And user should add the device to a blacklist 
to disallow assignment again. 

> ripping out ATS support (and limiting spin time below what there is
> currently) may be the only alternative to fixing it.

Yes, it is another solution considering ATS device is rare currently. For spin 
time, 10ms should be enough in both two solutions. 
But if solution 1 is acceptable, I prefer it since most of ATS devices are 
still able to play with Xen.

Best regards,
Yang



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-11 Thread Zhang, Yang Z
Xu, Quan wrote on 2015-09-16:
> Introduction
> 
> 
>VT-d code currently has a number of cases where completion of
> certain operations is being waited for by way of spinning. The
> majority of instances use that variable indirectly through
> IOMMU_WAIT_OP() macro , allowing for loops of up to 1 second
> (DMAR_OPERATION_TIMEOUT). While in many of the cases this may be acceptable, 
> the invalidation case seems particularly problematic.
> 
> Currently hypervisor polls the status address of wait descriptor up to
> 1 second to get Invalidation flush result. When Invalidation queue
> includes Device-TLB invalidation, using 1 second is a mistake here in
> the validation sync. As the 1 second timeout here is related to
> response times by the IOMMU engine, Instead of Device-TLB invalidation
> with PCI-e Address Translation Services (ATS) in use. the ATS specification 
> mandates a timeout of 1 _minute_ for cache flush.
> The ATS case needs to be taken into consideration when doing invalidations.
> Obviously we can't spin for a minute, so invalidation absolutely needs
> to be converted to a non-spinning model.
> 
>Also i should fix the new memory security issue.
> The page freed from the domain should be on held, until the Device-TLB
> flush is completed (ATS timeout of 1 _minute_).
> The page previously associated  with the freed portion of GPA should
> not be reallocated for another purpose until the appropriate
> invalidations have been performed. Otherwise, the original page owner
> can still access freed page though DMA.
> 

Hi Maintainers,

According the discussion and suggestion you made in past several weeks, 
obviously, it is not an easy task. So I am wondering whether it is worth to do 
it since: 
1. ATS device is not popular. I only know one NIC from Myricom has ATS 
capabilities.
2. The issue is only in theory. Linux, Windows, VMware are all using spin now 
as well as Xen, but none of them observed any problem so far.
3. I know there is propose to modify the timeout value(maybe less in 1 ms) in 
ATS spec to mitigate the problem. But the risk is how long to achieve it.
4. The most important concern is it is too complicated to fix it in Xen since 
it needs to modify the core memory part. And I don't think Quan and i have the 
enough knowledge to do it perfectly currently. It may take long time, half of 
year or one year?(We have spent three months so far). Yes, if Tim likes to take 
it. It will be much fast. :)

So, my suggestion is that we can rely on user to not assign the ATS device if 
hypervisor says it cannot support such device. For example, if hypervisor find 
the invalidation isn't completed in 1 second, then hypervisor can crash itself 
and tell the user this ATS device needs more than 1 second invalidation time 
which is not support by Xen.

Any comments?

Best regards,
Yang



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-10 Thread Tim Deegan
Hi,

At 17:02 + on 07 Oct (1444237344), Xu, Quan wrote:
> __scheme A__
> Q1: - when to take the references?
> take the reference when the IOMMU entry is _created_;
> in detail:
>  --iommu_map_page(), or
>  --ept_set_entry() [Once IOMMU shares EPT page table.]
> 
> That leaves one question:
> -- how to do with hot-plug ATS device pass-through? As the EPT doesn't 
> aware IOMMU will share EPT page table when EPT page table was _created_.

Either you have to take all the references even before the hot-plug,
or you have to do an (interruptable) pass over the EPT table when
needs_iommu becaomes set.  Of those I prefer the first one.  EPT
entries already take references to foreign pages, and I think that
extending that to other kinds should be OK.

> Q2: how do you know when to drop them?
>- log (or something) when the IOMMU entry is removed/overwritten; and
>- drop the entry when the flush completes.
>
>-- We can add a new page_list_entry structure per page_info, and Add the 
> page with the new page_list_entry structure to per domain page list, when
> the IOMMU entry is removed/overwritten; and drop the entry when the flush 
> completes. 

As Jan says, I think that might be too much overhead -- a new
page_list_entry would be a pretty big increase to page_info.
(And potentially might not be enough if a page ever needs to be on two
lists? Or can we make sure that never happens?)

Storing the list of to-be-flushed MFNs separately sounds better.  The
question is how to make sure we don't run out of memory to store that
list.  Maybe have a pool allocated that's big enough for sensible use,
and fail p2m updates with -EAGAIN when we run out?

> Q3: what to do about mappings of other domains' memory (i.e. grant and 
> foreign mappings).
>Between two domains, now I have only one idea to fix this tricky issue -- 
> waitqueue.
>I.e. grant.
> For gnttab_transfer /gnttab_unmap , wait on a waitqueue before updating 
> grant flag, until the Device-TLB flush is completed.
> For grant-mapped, it is safe as the modification of gnttab_unmap.

Waitqueues are probably more heavyweight than is needed here.  The
grant unmap operation can complete immediately, and we can queue the
reference drop (and maybe even some of the maptrack updates) to be
finished when the flush completes.  It could maybe use the same
queue/log as the normal p2m updates.

> __scheme B__
> Q1: - when to take the references?
> 
> take the reference when the IOMMU entry is _ removed/overwritten_;
> in detail:
>  --iommu_unmap_page(), or
>  --ept_set_entry() [Once IOMMU shares EPT page table.]

Hmm.  That might be practical, taking a typecount of whatever type the
page happens to be at the time.  I would prefer scheme A though, if it
can be made to work.  It fits better with the usual way refcounts are
used, and it's more likely to be a good long-term fix.  This is the
second time we've had to add refcounting for an edge case, and I
suspect it won't be the last.

Cheers,

Tim.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-09 Thread Jan Beulich
>>> On 09.10.15 at 09:06,  wrote:
>> > >>>On 08.10.2015 at 16:52  wrote:
>> >>> On 07.10.15 at 19:02,  wrote:
>> > Q3: what to do about mappings of other domains' memory (i.e. grant and
>> > foreign mappings).
>> >Between two domains, now I have only one idea to fix this tricky
>> > issue -- waitqueue.
>> >I.e. grant.
>> > For gnttab_transfer /gnttab_unmap , wait on a waitqueue before
>> > updating grant flag, until the Device-TLB flush is completed.
>> > For grant-mapped, it is safe as the modification of gnttab_unmap.
>> 
>> Hmm, wouldn't grant transfers already be taken care of by the extra 
>> references?
>> See steal_page(). Perhaps the ordering between its invocation and
>> guest_physmap_remove_page() would need to be switched (with the latter
>> getting undone if steal_page() fails). The waiting for the flush to complete 
>> could -
>> afaics - be done by using the grant-ops' inherent batching (and hence easy
>> availability of continuations). But I admit there's some hand waiving here 
>> without
>> closer inspection...
> 
> I think the extra references can NOT fix the security issue between two 
> domains.
> i.e. If domA transfers the ownership of a page to domB, but the domA still 
> take extra references of the page. I think it is not correct.

Again - see steal_page(): Pages with references beyond the single
allocation related one can't change ownership.

>> > __scheme B__
>> > Q1: - when to take the references?
>> >
>> > take the reference when the IOMMU entry is _ removed/overwritten_;
>> > in detail:
>> >  --iommu_unmap_page(), or
>> >  --ept_set_entry() [Once IOMMU shares EPT page table.]
>> >
>> > * Make sure IOMMU page should not be reallocated for
>> >  another purpose until the appropriate invalidations have been
>> >  performed.
>> > * in this case, it does not matter hot-plug ATS device
>> > pass-through or ATS device assigned in domain initialization.
>> >
>> > Q2 / Q3:
>> > The same as above __scheme A__ Q2/Q3.
>> >
>> > One question: is __scheme B__ safe? If it is safe, I prefer __scheme B__..
>> 
>> While at the first glance this looks like a neat idea - 
> 
> 
> I think this is safe and a good solution.
> I hope you can review into the __scheme B__. I need _Acked-by_ you and Tim 
> Deegan.

What do you mean here? I'm not going to ack a patch that hasn't
even got written, and while scheme B looks possible, I might still
overlook something, so I also can't up front ack that model (which
may then lead to you expecting that once implemented it gets
accepted).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-09 Thread Xu, Quan
>> >>> On 09.10.2015 at 15:18  wrote:
> >>> On 09.10.15 at 09:06,  wrote:
> >> > >>>On 08.10.2015 at 16:52  wrote:
> >> >>> On 07.10.15 at 19:02,  wrote:
 
> >> > __scheme B__
> >> > Q1: - when to take the references?
> >> >
> >> > take the reference when the IOMMU entry is _
> removed/overwritten_;
> >> > in detail:
> >> >  --iommu_unmap_page(), or
> >> >  --ept_set_entry() [Once IOMMU shares EPT page table.]
> >> >
> >> > * Make sure IOMMU page should not be reallocated for
> >> >  another purpose until the appropriate invalidations have been
> >> >  performed.
> >> > * in this case, it does not matter hot-plug ATS device
> >> > pass-through or ATS device assigned in domain initialization.
> >> >
> >> > Q2 / Q3:
> >> > The same as above __scheme A__ Q2/Q3.
> >> >
> >> > One question: is __scheme B__ safe? If it is safe, I prefer __scheme 
> >> > B__..
> >>
> >> While at the first glance this looks like a neat idea -
> >
> >
> > I think this is safe and a good solution.
> > I hope you can review into the __scheme B__. I need _Acked-by_ you and
> > Tim Deegan.
> 
> What do you mean here?

Just verify it.
If it is working, I continue to write patch based on it.
If it is not working, I continue to research ..


> I'm not going to ack a patch that hasn't even got
> written, and while scheme B looks possible, I might still overlook something, 
> so I
> also can't up front ack that model (which may then lead to you expecting that
> once implemented it gets accepted).


I am getting started to write patch based on the __scheme B__ and send out ASAP 
.
Thanks 

Quan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-08 Thread Jan Beulich
>>> On 07.10.15 at 19:02,  wrote:
> __scheme A__
> Q1: - when to take the references?
> take the reference when the IOMMU entry is _created_;
> in detail:
>  --iommu_map_page(), or
>  --ept_set_entry() [Once IOMMU shares EPT page table.]
> 
> That leaves one question:
> -- how to do with hot-plug ATS device pass-through? As the EPT doesn't 
> aware 
> IOMMU will share EPT page table when EPT page table was _created_.

How is it not? iommu_hap_pt_share is a global option, and hence
even for a domain which does not (yet) require an IOMMU these
references could be acquired proactively.

> Q2: how do you know when to drop them?
>- log (or something) when the IOMMU entry is removed/overwritten; and
>- drop the entry when the flush completes.
>
>-- We can add a new page_list_entry structure per page_info, and Add the 
> page with the new page_list_entry structure to per domain page list, when
> the IOMMU entry is removed/overwritten; and drop the entry when the 
> flush completes. 

Please be very careful when considering to grow the size of struct
page_info - due to the amount of its instances this would have a
quite measurable effect on the memory overhead of memory
management. Since the number of pages currently having a flush
pending shouldn't be that high at any point in time, I don't think
such increased overhead would be justifiable.

> Q3: what to do about mappings of other domains' memory (i.e. grant and 
> foreign mappings).
>Between two domains, now I have only one idea to fix this tricky issue -- 
> waitqueue.
>I.e. grant.
> For gnttab_transfer /gnttab_unmap , wait on a waitqueue before updating 
> grant flag, until the Device-TLB flush is completed.
> For grant-mapped, it is safe as the modification of gnttab_unmap.

Hmm, wouldn't grant transfers already be taken care of by the
extra references? See steal_page(). Perhaps the ordering
between its invocation and guest_physmap_remove_page() would
need to be switched (with the latter getting undone if steal_page()
fails). The waiting for the flush to complete could - afaics - be done
by using the grant-ops' inherent batching (and hence easy availability
of continuations). But I admit there's some hand waiving here
without closer inspection...

> __scheme B__
> Q1: - when to take the references?
> 
> take the reference when the IOMMU entry is _ removed/overwritten_;
> in detail:
>  --iommu_unmap_page(), or
>  --ept_set_entry() [Once IOMMU shares EPT page table.]
> 
> * Make sure IOMMU page should not be reallocated for
>  another purpose until the appropriate invalidations have been
>  performed. 
> * in this case, it does not matter hot-plug ATS device pass-through or 
> ATS 
> device assigned in domain initialization.
> 
> Q2 / Q3: 
> The same as above __scheme A__ Q2/Q3.
> 
> One question: is __scheme B__ safe? If it is safe, I prefer __scheme B__..

While at the first glance this looks like a neat idea - what do you do
if obtaining the reference fails? Not touching the EPT entry may be
possible (but not nice), but not doing the IOMMU side update when
the EPT one already succeeded would seem very cumbersome (as
you'd need to roll back the EPT side update). Maybe a two-phase
operation (obtain references for IOMMU, update EPT, update IOMMU)
could solve this (and could then actually be independent of whether
page tables are shared).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-10-07 Thread Xu, Quan

>>> >> On October 01, 2015, at 5:09 PM  wrote:
> At 15:05 + on 30 Sep (1443625549), Xu, Quan wrote:
> > >> >>> On September 29, 2015, at 5:12 PM,  wrote:

> > Could I introduce a new typed reference which can only been deref in
> > QI interrupt handler(or associated tasklet)?? --(stop me, I always want to 
> > add
> some new flag or typed ..) And preventing changes of ownership/type on the
> relevant pages.
> 
> If you get_page_and_type(PGT_writable_page), then that guarantees that the
> page won't change type or be freed until the matching put_page_and_type(),
> which does what you want.  Likewise, for read-only mappings, get_page() will
> guarantee that the page isn't freed until you call put_page().  (We don't care
> about read-only DMA mappings of specially-typed pages so you don't need a
> typecount there).
> 
Thanks for your further clarification.

> That leaves three questions:
>  - when to take the references?
>  - how do you know when to drop them?
>  - what to do about mappings of other domains' memory (i.e. grant
>and foreign mappings).
> 

I think this are the key points. Look at the below answers.

> IIUC your current scheme is to take the reference as the page is freed and 
> drop it
> after the flush.

Yes,

> That's not enough for pages where permissions change for
> other reasons, so it will at least have to be "take the reference when the 
> IOMMU
> entry is removed/changed".  IMO that's sufficiently invasive that you might as
> well:
>  - take the reference when the IOMMU entry is _created_;
>  - log (or something) when the IOMMU entry is removed/overwritten; and
>  - drop the entry when the flush completes.


__scheme A__
Q1: - when to take the references?
take the reference when the IOMMU entry is _created_;
in detail:
 --iommu_map_page(), or
 --ept_set_entry() [Once IOMMU shares EPT page table.]

That leaves one question:
-- how to do with hot-plug ATS device pass-through? As the EPT doesn't 
aware IOMMU will share EPT page table when EPT page table was _created_.

Q2: how do you know when to drop them?
   - log (or something) when the IOMMU entry is removed/overwritten; and
   - drop the entry when the flush completes.
   
   -- We can add a new page_list_entry structure per page_info, and Add the 
page with the new page_list_entry structure to per domain page list, when
the IOMMU entry is removed/overwritten; and drop the entry when the flush 
completes. 

Q3: what to do about mappings of other domains' memory (i.e. grant and foreign 
mappings).
   Between two domains, now I have only one idea to fix this tricky issue -- 
waitqueue.
   I.e. grant.
For gnttab_transfer /gnttab_unmap , wait on a waitqueue before updating 
grant flag, until the Device-TLB flush is completed.
For grant-mapped, it is safe as the modification of gnttab_unmap.



__scheme B__
Q1: - when to take the references?

take the reference when the IOMMU entry is _ removed/overwritten_;
in detail:
 --iommu_unmap_page(), or
 --ept_set_entry() [Once IOMMU shares EPT page table.]

* Make sure IOMMU page should not be reallocated for
 another purpose until the appropriate invalidations have been
 performed. 
* in this case, it does not matter hot-plug ATS device pass-through or ATS 
device assigned in domain initialization.

Q2 / Q3: 
The same as above __scheme A__ Q2/Q3.

One question: is __scheme B__ safe? If it is safe, I prefer __scheme B__..



Tim, thanks very much!


Quan



> 
> Like other schemes (I'm thinking about the p2m foreign mapping stuff
> here) the reference counting is best done at the very bottom of the stack, 
> when
> the actual entries are being written and cleared.  IOW I'd be adding more code
> to atomic_write_ept_entry() and friends.
> 
> Cheers,
> 
> Tim.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-30 Thread Xu, Quan
> >> >> >>> On September 29, 2015 at 3:22 PM,  wrote:
> >>> On 29.09.15 at 04:53,  wrote:
>  Monday, September 28, 2015 2:47 PM, wrote:
> >> >>> On 28.09.15 at 05:08,  wrote:
> >>  Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:
> >
> > For Tim's suggestion --"to make the IOMMU table take typed refcounts
> > to anything it points to, and only drop those refcounts when the flush
> > completes."
> >
> > From IOMMU point of view, if it can walk through IOMMU table to get
> > these pages and take typed refcounts.
> > These pages are maybe owned by hardware_domain, dummy, HVM guest .etc.
> > could I narrow it down to HVM guest? --- It is not for anything it
> > points to, but just for HVM guest related. this will simplify the design.
> 
> I don't follow. Why would you want to walk page tables? And why would a HVM
> guest have pages other than those owned by itself or granted access to by
> another guest mapped in its IOMMU page tables?

It is tricky. Let's ignore it.

This is an analysis of IOMMU table to take typed refcounts to anything it 
points to.
I know the IOMMU table and EPT table may share the same page 
table('iommu_hap_pt_share = 1').
Then, go through iommu table and take typed refcounts.

> In any event - the ref-counting
> would need to happen as you _create_ the mappings, not at some later point.
> 
a general rule. Agreed.

When create the mappings, what conditions to take typed refcounts?



> >  Just for check, do typed refcounts refer to the following?
> >
> > --- a/xen/include/asm-x86/mm.h
> > +++ b/xen/include/asm-x86/mm.h
> > @@ -183,6 +183,7 @@ struct page_info
> >  #define PGT_seg_desc_page PG_mask(5, 4)  /* using this page in a GDT/LDT?
> */
> >  #define PGT_writable_page PG_mask(7, 4)  /* has writable mappings?
> */
> >  #define PGT_shared_page   PG_mask(8, 4)  /* CoW sharable page
> */
> > +#define PGT_dev_tlb_page  PG_mask(9, 4)  /* Maybe in Device-TLB
> mapping?   */
> >  #define PGT_type_mask PG_mask(15, 4) /* Bits 28-31 or 60-63.
> */
> >
> > * I define a new typed refcounts PGT_dev_tlb_page.
> 
> Why? I.e. why won't a base ref for r/o pages and a writable type-ref for r/w 
> ones
> suffice, just like we do everywhere else?
> 

I think it is different from r/o or writable.

The page freed from the domain, the Device-TLB flush is not completed.
The page is _not_ r/o or writable, and can only access though DMA..

Maybe it would modify a lot of related code. 
r/o or writable are acceptable to me. 


> >> Once you do that, I
> >> don't think there'll be a reason to pause the guest for the duration
> >> of the
> > flush.
> >> And really (as pointed out before) pausing the guest would get us
> >> _far_ away from how real hardware behaves.
> >>
> >
> > Once I do that, I think the guest should be still paused, if the
> > Device-TLB flush is not completed.
> >
> > As mentioned in previous email, for example:
> > Call do_memory_op HYPERCALL to free a pageX (gfn1 <---> mfn1). The
> > gfn1 is the freed portion of GPA.
> > assume that there is a mapping(gfn1<---> mfn1) in Device-TLB. If the
> > Device-TLB flush is not completed and return to guest mode, the guest
> > may call do_memory_op HYPERCALL to allocate a new pageY(mfn2) to
> > gfn1..
> > then:
> > the EPT mapping is (gfn1--mfn2), the Device-TLB mapping is (gfn1<--->mfn1) .
> >
> > If the Device-TLB flush is not completed, DMA associated with gfn1 may
> > still write some data with pageX(gfn1 <---> mfn1), but pageX will be
> > Released to xen when the Device-TLB flush is completed. It is maybe
> > not correct for guest to read data from gfn1 after DMA(now the page
> > associated with gfn1 is pageY ).
> >
> > Right?
> 
> No. The extra ref taken will prevent the page from getting freed. And as long 
> as
> the flush is in process, DMA to/from the page is going to produce undefined
> results (affecting only the guest). But note that there may be reasons for an
> external to the guest entity invoking the operation which ultimately led to 
> the
> flush to do this on a paused guest only. But that's not of concern to the
> hypervisor side implementation.
> 

Reasonable.

Jan, thanks!

-Quan


> Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-30 Thread Jan Beulich
>>> On 30.09.15 at 15:55,  wrote:
>> >> >> >>> On September 29, 2015 at 3:22 PM,  wrote:
>> >>> On 29.09.15 at 04:53,  wrote:
>>  Monday, September 28, 2015 2:47 PM, wrote:
>> >> >>> On 28.09.15 at 05:08,  wrote:
>> >>  Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:
>> >
>> > For Tim's suggestion --"to make the IOMMU table take typed refcounts
>> > to anything it points to, and only drop those refcounts when the flush
>> > completes."
>> >
>> > From IOMMU point of view, if it can walk through IOMMU table to get
>> > these pages and take typed refcounts.
>> > These pages are maybe owned by hardware_domain, dummy, HVM guest .etc.
>> > could I narrow it down to HVM guest? --- It is not for anything it
>> > points to, but just for HVM guest related. this will simplify the design.
>> 
>> I don't follow. Why would you want to walk page tables? And why would a HVM
>> guest have pages other than those owned by itself or granted access to by
>> another guest mapped in its IOMMU page tables?
> 
> It is tricky. Let's ignore it.
> 
> This is an analysis of IOMMU table to take typed refcounts to anything it 
> points to.
> I know the IOMMU table and EPT table may share the same page 
> table('iommu_hap_pt_share = 1').
> Then, go through iommu table and take typed refcounts.
> 
>> In any event - the ref-counting
>> would need to happen as you _create_ the mappings, not at some later point.
>> 
> a general rule. Agreed.
> 
> When create the mappings, what conditions to take typed refcounts?

On any r/o mapping, take a general ref. On any r/w mapping take
a writable ref.

>> > --- a/xen/include/asm-x86/mm.h
>> > +++ b/xen/include/asm-x86/mm.h
>> > @@ -183,6 +183,7 @@ struct page_info
>> >  #define PGT_seg_desc_page PG_mask(5, 4)  /* using this page in a GDT/LDT?
>> */
>> >  #define PGT_writable_page PG_mask(7, 4)  /* has writable mappings?
>> */
>> >  #define PGT_shared_page   PG_mask(8, 4)  /* CoW sharable page
>> */
>> > +#define PGT_dev_tlb_page  PG_mask(9, 4)  /* Maybe in Device-TLB
>> mapping?   */
>> >  #define PGT_type_mask PG_mask(15, 4) /* Bits 28-31 or 60-63.
>> */
>> >
>> > * I define a new typed refcounts PGT_dev_tlb_page.
>> 
>> Why? I.e. why won't a base ref for r/o pages and a writable type-ref for r/w 
> ones
>> suffice, just like we do everywhere else?
>> 
> 
> I think it is different from r/o or writable.
> 
> The page freed from the domain, the Device-TLB flush is not completed.
> The page is _not_ r/o or writable, and can only access though DMA..

DMA or not, the page can either be accessed (read) (when mapped
r/o in the IOMMU) or written (when mapped r/w in the IOMMU). For
the purpose of refcounting it doesn't matter where the reads/writes
originate.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-30 Thread Xu, Quan
>> >>> On September 29, 2015, at 5:12 PM,  wrote:
> At 03:08 + on 28 Sep (1443409723), Xu, Quan wrote:
> > >>> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:
> > > 7/13: I'm not convinced that making the vcpu spin calling
> > > sched_yield() is a very good plan.  Better to explicitly pause the
> > > domain if you need its vcpus not to run.  But first -- why does
> > > IOMMU flushing mean that vcpus can't be run?
> >
> > Ensure that the required Device-TLB flushes are applied before
> > returning to guest mode via hypercall completion.  the domain can also
> > DMA this freed pages.  For example, Call do_memory_op HYPERCALL to
> > free a pageX (gfn --- mfn) from domain, and assume that there is a
> > mapping(gfn --- mfn) in Device-TLB, once the vcpu has returned to
> > guest mode, then the domain can still DMA this freed pageX.  Domain
> > kernel cannot use this being freed page, otherwise this is a domain
> > kernel bug.
> 
> 
> OK - let's ignore guest kernel bugs.  IIUC you're worried about the guest OS
> telling a device to issue DMA to an address that has changed in the IOMMU
> tables (unmapped, remapped elsewhere, permisisons changedm ) but not yet
> been flushed?


Yes, issue DMA to an address that has changed in the IOMMU table and EPT table, 
but not yet been flushed.


> 
> Unfortunately, pausing the guest's CPUs doesn't stop that.  A malicious guest
> could enqueue network receive buffers pointing to that address, and then
> arrange for a packet to arrive between the IOMMU table change and the flush
> completion.

Cool !!

> So you'll need to do something else to make the unmap safe.
>The usual
> method in Xen is to hold a reference to the page (for read-only
> mappings)


Read-only mapping refers to 'PGT_pinned'?
Could I introduce a new typed reference which can only been deref in QI 
interrupt handler(or associated tasklet)?? --(stop me, I always want to add 
some new flag or typed ..)
And preventing changes of ownership/type on the relevant pages.


> or a typed reference (for read-write), and not release that reference
> until the flush has completed.  That's OK with in-line synchronous flushes.
> 
> With the flush taking longer than Xen can wait for, you'll need to do 
> something
> more complex, e.g.:
>  - keep a log of all relevant pending derefs, to be processed when the
>flush completes; 



One of the CCed mentioned this solution in internal discussions. But it is 
tricky and over-engineering.
I need more than half year to implement it.


> or
>  - have some other method of preventing changes of ownership/type on
>the relevant pages. 


I prefer this solution.


> E.g. for CPU TLBs, we keep a per-page counter
>(tlbflush-timestamp) that we can use to detect whether enough TLB
>flushes have happened since the page was freed.
> 
> The log is tricky - I'm not sure how toq make sure that it has bounded size 
> if a
> flush can take seconds.
> 
> I'm not sure the counter works either -- when that detector triggers we do a
> synchronous TLB-flush IPI to make the operation safe, and that's exactly what 
> we
> can't do here.
> 
> Any other ideas floating around?
> 
> Cheers,
> 

Tim, thanks for your help.
Any idea, I will send out. Maybe it is not a complete solution. 

Quan

> Tim.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-29 Thread Jan Beulich
>>> On 29.09.15 at 04:53,  wrote:
 Monday, September 28, 2015 2:47 PM, wrote:
>> >>> On 28.09.15 at 05:08,  wrote:
>>  Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:
> 
>> It would be a guest kernel bug, but all _we_ care about is that such a guest 
> kernel
>> bug won't affect the hypervisor or other guests.
> 
> It won't affect the hypervisor or other guest domains.
> As the required Device-TLB flushes are not applied, the hypercall is not 
> completed. The being freed page is still owned by this buggy
> Guest, not released back to xen or reallocated for other guests.

Seems like you misunderstood the purpose of my reply: I wasn't
claiming that what you patch set currently does would constitute an
issue. I was simply stating a general rule to consider when thinking
about which solutions are viable and which aren't.

> For Tim's suggestion --"to make the IOMMU table take typed refcounts to
> anything it points to, and only drop those refcounts when the flush 
> completes."
> 
> From IOMMU point of view, if it can walk through IOMMU table to get these 
> pages and take typed refcounts. 
> These pages are maybe owned by hardware_domain, dummy, HVM guest .etc. could 
> I narrow it down to HVM guest? --- It is not for anything it points to, but 
> just 
> for HVM guest related. this will simplify the design.

I don't follow. Why would you want to walk page tables? And why
would a HVM guest have pages other than those owned by itself or
granted access to by another guest mapped in its IOMMU page
tables? In any event - the ref-counting would need to happen as
you _create_ the mappings, not at some later point.

> from HVM guest point of view, once the ATS device is assigned, we can: 
> *pause the HVM guest domain.
> *scan domain's xenpage_list, page_list and arch.relmem_list to get these 
> pages, which will be took typed refcounts ( PGT_dev_tlb_page -- a new type).
> *unpause the HVM guest domain.
> 
> (we can ignore domain's xenpage_list) as:
> ((
>Actually, the previous pages are maybe mapped from Xen heap for guest 
> domains in decrease_reservation() / xenmem_add_to_physmap_one()
>/ p2m_add_foreign(), but they are not mapped to IOMMU table. The below 4 
> functions will map xen heap page for guest domains:
>   * share page for xen Oprofile.
>   * vLAPIC mapping.
>   * grant table shared page.
>   * domain share_info page.
> ))

Neither of which really has a need to be in the IOMMU page tables
afaics.

>  Just for check, do typed refcounts refer to the following?
> 
> --- a/xen/include/asm-x86/mm.h
> +++ b/xen/include/asm-x86/mm.h
> @@ -183,6 +183,7 @@ struct page_info
>  #define PGT_seg_desc_page PG_mask(5, 4)  /* using this page in a GDT/LDT?  */
>  #define PGT_writable_page PG_mask(7, 4)  /* has writable mappings? */
>  #define PGT_shared_page   PG_mask(8, 4)  /* CoW sharable page  */
> +#define PGT_dev_tlb_page  PG_mask(9, 4)  /* Maybe in Device-TLB mapping?   */
>  #define PGT_type_mask PG_mask(15, 4) /* Bits 28-31 or 60-63.   */
> 
> * I define a new typed refcounts PGT_dev_tlb_page.

Why? I.e. why won't a base ref for r/o pages and a writable type-ref
for r/w ones suffice, just like we do everywhere else?

>> Once you do that, I
>> don't think there'll be a reason to pause the guest for the duration of the 
> flush.
>> And really (as pointed out before) pausing the guest would get us _far_ away
>> from how real hardware behaves.
>> 
> 
> Once I do that, I think the guest should be still paused, if the Device-TLB 
> flush is not completed.
> 
> As mentioned in previous email, for example:
> Call do_memory_op HYPERCALL to free a pageX (gfn1 <---> mfn1). The gfn1 is 
> the 
> freed portion of GPA.
> assume that there is a mapping(gfn1<---> mfn1) in Device-TLB. If the 
> Device-TLB 
> flush is not completed and return to guest mode,
> the guest may call do_memory_op HYPERCALL to allocate a new pageY(mfn2) to 
> gfn1..
> then:
> the EPT mapping is (gfn1--mfn2), the Device-TLB mapping is (gfn1<--->mfn1) .
> 
> If the Device-TLB flush is not completed, DMA associated with gfn1 may still 
> write some data with pageX(gfn1 <---> mfn1), but pageX will be 
> Released to xen when the Device-TLB flush is completed. It is maybe not 
> correct for guest to read data from gfn1 after DMA(now the page associated 
> with gfn1 is pageY ).
> 
> Right?

No. The extra ref taken will prevent the page from getting freed. And
as long as the flush is in process, DMA to/from the page is going to
produce undefined results (affecting only the guest). But note that
there may be reasons for an external to the guest entity invoking the
operation which ultimately led to the flush to do this on a paused guest
only. But that's not of concern to the hypervisor side implementation.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org

Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-29 Thread Jan Beulich
>>> On 29.09.15 at 11:11,  wrote:
> With the flush taking longer than Xen can wait for, you'll need to
> do something more complex, e.g.:
>  - keep a log of all relevant pending derefs, to be processed when the
>flush completes; or
>  - have some other method of preventing changes of ownership/type on
>the relevant pages.  E.g. for CPU TLBs, we keep a per-page counter
>(tlbflush-timestamp) that we can use to detect whether enough TLB
>flushes have happened since the page was freed.
> 
> The log is tricky - I'm not sure how toq make sure that it has bounded
> size if a flush can take seconds.
> 
> I'm not sure the counter works either -- when that detector triggers
> we do a synchronous TLB-flush IPI to make the operation safe, and
> that's exactly what we can't do here.
> 
> Any other ideas floating around?

The variant of the log model might work if sufficient information is
available in the interrupt handler (or associated tasklet) to identify
a much smaller subset of pages to scan through. Since there is a
32-bit quantity written to a pre-determined location upon qi
completion, I wonder whether that couldn't be used for that purpose
- 32 bits disambiguate a page within 16Tb of RAM, so there wouldn't
need to be too many hashed together in a single chain. Otoh of
course we can't have 2^32 standalone list heads.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-29 Thread Tim Deegan
Hi,

At 03:08 + on 28 Sep (1443409723), Xu, Quan wrote:
> >>> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:
> > 7/13: I'm not convinced that making the vcpu spin calling
> > sched_yield() is a very good plan.  Better to explicitly pause the domain 
> > if you
> > need its vcpus not to run.  But first -- why does IOMMU flushing mean that
> > vcpus can't be run?
> 
> Ensure that the required Device-TLB flushes are applied before
> returning to guest mode via hypercall completion.  the domain can
> also DMA this freed pages.  For example, Call do_memory_op HYPERCALL
> to free a pageX (gfn --- mfn) from domain, and assume that there is
> a mapping(gfn --- mfn) in Device-TLB, once the vcpu has returned to
> guest mode, then the domain can still DMA this freed pageX.  Domain
> kernel cannot use this being freed page, otherwise this is a domain
> kernel bug.


OK - let's ignore guest kernel bugs.  IIUC you're worried about the
guest OS telling a device to issue DMA to an address that has changed
in the IOMMU tables (unmapped, remapped elsewhere, permisisons
changedm ) but not yet been flushed?

Unfortunately, pausing the guest's CPUs doesn't stop that.  A
malicious guest could enqueue network receive buffers pointing to
that address, and then arrange for a packet to arrive between the IOMMU
table change and the flush completion.

So you'll need to do something else to make the unmap safe.  The usual
method in Xen is to hold a reference to the page (for read-only
mappings) or a typed reference (for read-write), and not release that
reference until the flush has completed.  That's OK with in-line
synchronous flushes.

With the flush taking longer than Xen can wait for, you'll need to
do something more complex, e.g.:
 - keep a log of all relevant pending derefs, to be processed when the
   flush completes; or
 - have some other method of preventing changes of ownership/type on
   the relevant pages.  E.g. for CPU TLBs, we keep a per-page counter
   (tlbflush-timestamp) that we can use to detect whether enough TLB
   flushes have happened since the page was freed.

The log is tricky - I'm not sure how toq make sure that it has bounded
size if a flush can take seconds.

I'm not sure the counter works either -- when that detector triggers
we do a synchronous TLB-flush IPI to make the operation safe, and
that's exactly what we can't do here.

Any other ideas floating around?

Cheers,

Tim.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-28 Thread Jan Beulich
>>> On 28.09.15 at 05:08,  wrote:
 Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:
>> 7/13: I'm not convinced that making the vcpu spin calling
>> sched_yield() is a very good plan.  Better to explicitly pause the domain if 
>> you
>> need its vcpus not to run.  But first -- why does IOMMU flushing mean that
>> vcpus can't be run?
> 
> Ensure that the required Device-TLB flushes are applied before returning to 
> guest mode via hypercall completion.
> the domain can also DMA this freed pages.
> For example, Call do_memory_op HYPERCALL to free a pageX (gfn --- mfn) from 
> domain, and assume that there is
> a mapping(gfn --- mfn) in Device-TLB, once the vcpu has returned to guest 
> mode, 
> then the domain can still DMA this freed pageX.
> Domain kernel cannot use this being freed page, otherwise this is a domain 
> kernel bug.

It would be a guest kernel bug, but all _we_ care about is that such
a guest kernel bug won't affect the hypervisor or other guests. You
need to answer the question (perhaps just for yourself) taking into
account Tim's suggestion to hold references to all pages mapped by
the IOMMU page tables. Once you do that, I don't think there'll be
a reason to pause the guest for the duration of the flush. And really
(as pointed out before) pausing the guest would get us _far_ away
from how real hardware behaves.

The only possibly tricky thing will be how to know in the flush
completion handler which pages to drop references for, as it doesn't
look like you'd be able to put them on a list without allocating extra
memory fro tracking (and allocation in turn would be bad as it can
fail).

> I didn't make the IOMMU table to take typed refcount to anything it points 
> to. This is really complex.

But unavoidable I think, and with that I'm not sure it makes a lot of
sense to do further (detailed) review of the initial version of the series.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-28 Thread Xu, Quan
>>> Monday, September 28, 2015 2:47 PM, wrote:
> >>> On 28.09.15 at 05:08,  wrote:
>  Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:

> It would be a guest kernel bug, but all _we_ care about is that such a guest 
> kernel
> bug won't affect the hypervisor or other guests.

It won't affect the hypervisor or other guest domains.
As the required Device-TLB flushes are not applied, the hypercall is not 
completed. The being freed page is still owned by this buggy
Guest, not released back to xen or reallocated for other guests.


> You need to answer the
> question (perhaps just for yourself) taking into account Tim's suggestion to 
> hold
> references to all pages mapped by the IOMMU page tables. 

It is safe and complex.
But if Tim can ack all of my memory analysis, does my solution work for 
upstream?


For Tim's suggestion --"to make the IOMMU table take typed refcounts to
anything it points to, and only drop those refcounts when the flush completes."

>From IOMMU point of view, if it can walk through IOMMU table to get these 
>pages and take typed refcounts. 
These pages are maybe owned by hardware_domain, dummy, HVM guest .etc. could I 
narrow it down to HVM guest? --- It is not for anything it points to, but just 
for HVM guest related. this will simplify the design.

from HVM guest point of view, once the ATS device is assigned, we can: 
*pause the HVM guest domain.
*scan domain's xenpage_list, page_list and arch.relmem_list to get these pages, 
which will be took typed refcounts ( PGT_dev_tlb_page -- a new type).
*unpause the HVM guest domain.

(we can ignore domain's xenpage_list) as:
((
   Actually, the previous pages are maybe mapped from Xen heap for guest 
domains in decrease_reservation() / xenmem_add_to_physmap_one()
   / p2m_add_foreign(), but they are not mapped to IOMMU table. The below 4 
functions will map xen heap page for guest domains:
  * share page for xen Oprofile.
  * vLAPIC mapping.
  * grant table shared page.
  * domain share_info page.
))


* Once assigned a new page, if the ATS device is assigned, we should also take 
typed refcounts ( PGT_dev_tlb_page).
* Once freed a page, the ATS device is assigned, we should check the typed 
refcounts ( PGT_dev_tlb_page) in free_domheap_pages()
  If the typed refcounts is PGT_dev_tlb_page, the page should be hold in a page 
list per-domain and freed in QI interrupt handler.


 Just for check, do typed refcounts refer to the following?

--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -183,6 +183,7 @@ struct page_info
 #define PGT_seg_desc_page PG_mask(5, 4)  /* using this page in a GDT/LDT?  */
 #define PGT_writable_page PG_mask(7, 4)  /* has writable mappings? */
 #define PGT_shared_page   PG_mask(8, 4)  /* CoW sharable page  */
+#define PGT_dev_tlb_page  PG_mask(9, 4)  /* Maybe in Device-TLB mapping?   */
 #define PGT_type_mask PG_mask(15, 4) /* Bits 28-31 or 60-63.   */




* I define a new typed refcounts PGT_dev_tlb_page.



> Once you do that, I
> don't think there'll be a reason to pause the guest for the duration of the 
> flush.
> And really (as pointed out before) pausing the guest would get us _far_ away
> from how real hardware behaves.
> 

Once I do that, I think the guest should be still paused, if the Device-TLB 
flush is not completed.

As mentioned in previous email, for example:
Call do_memory_op HYPERCALL to free a pageX (gfn1 <---> mfn1). The gfn1 is the 
freed portion of GPA.
assume that there is a mapping(gfn1<---> mfn1) in Device-TLB. If the Device-TLB 
flush is not completed and return to guest mode,
the guest may call do_memory_op HYPERCALL to allocate a new pageY(mfn2) to 
gfn1..
then:
the EPT mapping is (gfn1--mfn2), the Device-TLB mapping is (gfn1<--->mfn1) .

If the Device-TLB flush is not completed, DMA associated with gfn1 may still 
write some data with pageX(gfn1 <---> mfn1), but pageX will be 
Released to xen when the Device-TLB flush is completed. It is maybe not correct 
for guest to read data from gfn1 after DMA(now the page associated with gfn1 is 
pageY ).

Right?


> The only possibly tricky thing will be how to know in the flush completion 
> handler
> which pages to drop references for, as it doesn't look like you'd be able to 
> put
> them on a list without allocating extra memory fro tracking (and allocation 
> in turn
> would be bad as it can fail).
> 

* Once freed a page, the ATS device is assigned, we should check the typed 
refcounts ( PGT_dev_tlb_page) in free_domheap_pages()
  If the typed refcounts is PGT_dev_tlb_page, the page should be hold in a page 
list per-domain and freed in QI interrupt handler.


> > I didn't make the IOMMU table to take typed refcount to anything it
> > points to. This is really complex.
> 
> But unavoidable I think, and with that I'm not sure it makes a lot of sense 
> to do
> further (detailed) review of the initial version of the 

Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-27 Thread Xu, Quan
Tim, thanks for your review.

>>> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:
> Hi,
> 
> At 14:09 + on 21 Sep (1442844587), Xu, Quan wrote:
> > George / Tim,
> > Could you help me review these memory patches? Thanks!
> 
> The interrupt-mapping and chipset control parts of this are outside my
> understanding. :)  And I'm not an x86/mm maintainer any more, but I'll have a
> look:
> 

I've heard so much about you from my teammates. :):)


> 7/13: I'm not convinced that making the vcpu spin calling
> sched_yield() is a very good plan.  Better to explicitly pause the domain if 
> you
> need its vcpus not to run.  But first -- why does IOMMU flushing mean that
> vcpus can't be run?
> 

Ensure that the required Device-TLB flushes are applied before returning to 
guest mode via hypercall completion.
the domain can also DMA this freed pages.
For example, Call do_memory_op HYPERCALL to free a pageX (gfn --- mfn) from 
domain, and assume that there is
a mapping(gfn --- mfn) in Device-TLB, once the vcpu has returned to guest mode, 
then the domain can still DMA this freed pageX.
Domain kernel cannot use this being freed page, otherwise this is a domain 
kernel bug.

In fact, any time you want to reschedule, you need to raise SCHEDULE_SOFTIRQ, 
which is then checked and serviced
in do_softirq().
It can also pause the domain and unpause the domain in QI interrupt handler, or 
block the vCPU and unblock all of the 
domain's vCPU in QI interrupt handler.  It is the performance consideration 
while using sched_yield().



> 8/13: This doesn't seem like it's enough make this safe. :(  Yes, you can't 
> allocate
> a page to another VM while there are IOTLB entries pointing to it, but you 
> also
> can't use it for other things inside the same domain!
> 
> It might be enough, if you can argue that e.g. the IOMMU tables only ever have
> mappings of pages owned by the domain, and that any other feature that might
> rely on the daomin's memory being made read-only (e.g. sharing) is explicily
> disallowed, but I'd want to see those things mechanically enforced.
> 



As mentioned above,  Device-TLB flushes are applied before returning to guest 
mode via hypercall completion.
If the hypercall is not completed, the freed page will not be released back to 
xen. Domain kernel cannot use this being freed page, otherwise this is a domain 
kernel bug.
So I think it is enough.   :):)




> I think the safest answer is to make the IOMMU table take typed refcounts to
> anything it points to, and only drop those refcounts when the flush completes,
> but I can imaging that becoming complex.
> 

Yes, i also think so.
I have a similar design, taking typed refcount and increasing the page refcount 
(as PGT_pinned / PGC_allocated), but i did it only in page freed invoker flow.
I dropped this design, as i should introduce anther flag to aware that the 
domain is assigned a ATS device(A ugly design). 
I didn't make the IOMMU table to take typed refcount to anything it points to. 
This is really complex.

> You may also need to consider grant-mapped memory - you need to make sure
> the grant isn't released until after the flush completes.
> 
 I can introduce a new grant type for this case (such as GNTMAP_iommu_dma)..
For ending foreign access, it should _only_ be used when the remote domain has 
unmapped the frame. 
gnttab_query_foreign_access( gref ) will indicate the state of any mapping.

> 12/13: Ah, I see you are looking at grant table stuff, at least for the 
> transfer path. :)
> Still the mapping path needs to be looked at.
> 

Yes, I think you can do it. :):)


> Cheers,
> 
> Tim.


Tim, thanks again. I really appreciate your review.


Quan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-23 Thread Tim Deegan
Hi,

At 14:09 + on 21 Sep (1442844587), Xu, Quan wrote:
> George / Tim,
> Could you help me review these memory patches? Thanks!

The interrupt-mapping and chipset control parts of this are outside my
understanding. :)  And I'm not an x86/mm maintainer any more, but I'll
have a look:

7/13: I'm not convinced that making the vcpu spin calling
sched_yield() is a very good plan.  Better to explicitly pause the
domain if you need its vcpus not to run.  But first -- why does IOMMU
flushing mean that vcpus can't be run?

8/13: This doesn't seem like it's enough make this safe. :(  Yes, you
can't allocate a page to another VM while there are IOTLB entries
pointing to it, but you also can't use it for other things inside the
same domain!

It might be enough, if you can argue that e.g. the IOMMU tables only
ever have mappings of pages owned by the domain, and that any other
feature that might rely on the daomin's memory being made read-only
(e.g. sharing) is explicily disallowed, but I'd want to see those
things mechanically enforced.

I think the safest answer is to make the IOMMU table take typed
refcounts to anything it points to, and only drop those refcounts when
the flush completes, but I can imaging that becoming complex.

You may also need to consider grant-mapped memory - you need to make
sure the grant isn't released until after the flush completes.

12/13: Ah, I see you are looking at grant table stuff, at least for
the transfer path. :)  Still the mapping path needs to be looked at.

Cheers,

Tim.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-21 Thread Jan Beulich
>>> On 21.09.15 at 11:46,  wrote:
>>> >>> On 21.09.15 at 16:51, < jbeul...@suse.com > wrote:
>>- Anything else?
> 
> 
> Just test the extreme case. The ATS specification mandates a timeout of 1 
> _minute_ for cache flush, even though it doesn't take so much time for cache 
> flush.
> In my design, if the Device-TLB is not completed, the domain's vCPUs are not 
> allowed entry guest mode (patch #7), otherwise, the logic is not correct.

Hmm, that would be a serious limitation, and I can't immediately
see a reason: Taking this with virtualization removed, I don't
think such an invalidation would stall a physical CPU for a whole
minute. Sadly I also can't immediately think of a solution, but I
guess first of all I'll have to take a look at the series (which
unfortunately may take a few days to get to).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-21 Thread Jan Beulich
>>> On 21.09.15 at 16:03,  wrote:

 On 21.09.15 at 20:04, < jbeul...@suse.com > wrote:
>> >>> On 21.09.15 at 11:46,  wrote:
>> >>> >>> On 21.09.15 at 16:51, < jbeul...@suse.com > wrote:
>> >>- Anything else?
>> >
>> >
>> > Just test the extreme case. The ATS specification mandates a timeout
>> > of 1 _minute_ for cache flush, even though it doesn't take so much
>> > time for cache flush.
>> > In my design, if the Device-TLB is not completed, the domain's vCPUs
>> > are not allowed entry guest mode (patch #7), otherwise, the logic is not
>> correct.
>> 
>> Hmm, that would be a serious limitation, and I can't immediately see a 
> reason:
>> Taking this with virtualization removed, I don't think such an invalidation 
> would
>> stall a physical CPU for a whole minute. Sadly I also can't immediately 
> think of a
>> solution, but I guess first of all I'll have to take a look at the series 
> (which
>> unfortunately may take a few days to get to).
> 
> thanks for your review. Any comment, I will reply to your emails ASAP.

Thanks. It would however have been nice if you addressed the
implied question (regarding the reason) right away.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-21 Thread Xu, Quan

>>> On 21.09.15 at 20:04, < jbeul...@suse.com > wrote:
> >>> On 21.09.15 at 11:46,  wrote:
> >>> >>> On 21.09.15 at 16:51, < jbeul...@suse.com > wrote:
> >>- Anything else?
> >
> >
> > Just test the extreme case. The ATS specification mandates a timeout
> > of 1 _minute_ for cache flush, even though it doesn't take so much
> > time for cache flush.
> > In my design, if the Device-TLB is not completed, the domain's vCPUs
> > are not allowed entry guest mode (patch #7), otherwise, the logic is not
> correct.
> 
> Hmm, that would be a serious limitation, and I can't immediately see a reason:
> Taking this with virtualization removed, I don't think such an invalidation 
> would
> stall a physical CPU for a whole minute. Sadly I also can't immediately think 
> of a
> solution, but I guess first of all I'll have to take a look at the series 
> (which
> unfortunately may take a few days to get to).
> 
> Jan

Jan, 

thanks for your review. Any comment, I will reply to your emails ASAP.
 
-Quan 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-21 Thread Xu, Quan
George / Tim,
Could you help me review these memory patches? Thanks!

-Quan



> -Original Message-
> From: Xu, Quan
> Sent: Wednesday, September 16, 2015 9:24 PM
> To: andrew.coop...@citrix.com; Dong, Eddie; ian.campb...@citrix.com;
> ian.jack...@eu.citrix.com; jbeul...@suse.com; Nakajima, Jun; k...@xen.org;
> Tian, Kevin; t...@xen.org; Zhang, Yang Z; george.dun...@eu.citrix.com
> Cc: xen-devel@lists.xen.org; Xu, Quan
> Subject: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
> 
> Introduction
> 
> 
>VT-d code currently has a number of cases where completion of certain
> operations is being waited for by way of spinning. The majority of instances 
> use
> that variable indirectly through IOMMU_WAIT_OP() macro , allowing for loops of
> up to 1 second (DMAR_OPERATION_TIMEOUT). While in many of the cases this
> may be acceptable, the invalidation case seems particularly problematic.
> 
> Currently hypervisor polls the status address of wait descriptor up to 1 
> second to
> get Invalidation flush result. When Invalidation queue includes Device-TLB
> invalidation, using 1 second is a mistake here in the validation sync. As the 
> 1
> second timeout here is related to response times by the IOMMU engine, Instead
> of Device-TLB invalidation with PCI-e Address Translation Services (ATS) in 
> use.
> the ATS specification mandates a timeout of 1 _minute_ for cache flush. The 
> ATS
> case needs to be taken into consideration when doing invalidations. Obviously
> we can't spin for a minute, so invalidation absolutely needs to be converted 
> to a
> non-spinning model.
> 
>Also i should fix the new memory security issue.
> The page freed from the domain should be on held, until the Device-TLB flush 
> is
> completed (ATS timeout of 1 _minute_).
> The page previously associated  with the freed portion of GPA should not be
> reallocated for another purpose until the appropriate invalidations have been
> performed. Otherwise, the original page owner can still access freed page 
> though
> DMA.
> 
> Why RFC
> ===
> Patch 0001--0005, 0013 are IOMMU related.
> Patch 0006 is about new flag (vCPU / MMU related).
> Patch 0007 is vCPU related.
> Patch 0008--0012 are MMU related.
> 
> 1. Xen MMU is very complicated. Could Xen MMU experts help me verify
> whether I
>have covered all of the case?
> 
> 2. For gnttab_transfer, If the Device-TLB flush is still not completed 
> when to
>map the transferring page to a remote domain, schedule and wait on a
> waitqueue
>until the Device-TLB flush is completed. Is it correct?
> 
>(I have tested waitqueue in decrease_reservation() [do_memory_op()
> hypercall])
> I wake up domain(with only one vCPU) with debug-key tool, and the
> domain(with only one vCPU)
> is still working after waiting 60s in a waitqueue. )
> 
> 
> Design Overview
> ===
> 
> This design implements a non-spinning model for Device-TLB invalidation -- 
> using
> an interrupt based mechanism. Track the Device-TLB invalidation status in an
> invalidation table per-domain. The invalidation table keeps the count of 
> in-flight
> Device-TLB invalidation requests, and also provides a global polling 
> parameter per
> domain for in-flight Device-TLB invalidation requests.
> Update invalidation table's count of in-flight Device-TLB invalidation 
> requests and
> assign the address of global polling parameter per domain in the Status 
> Address
> of each invalidation wait descriptor, when to submit invalidation requests.
> 
> For example:
>   .
> 
> |invl |  Status Data = 1 (the count of in-flight Device-TLB invalidation
> |requests) wait |  Status Address =
> |virt_to_maddr(&_a_global_polling_parameter_per_domain_)
> |dsc  |
>   .
>   .
> 
> |invl |
> |wait | Status Data = 2 (the count of in-flight Device-TLB invalidation
> |requests) dsc  | Status Address =
> |virt_to_maddr(&_a_global_polling_parameter_per_domain_)
>   .
>   .
> 
> |invl |
> |wait |  Status Data = 3 (the count of in-flight Device-TLB invalidation
> |requests) dsc  |  Status Address =
> |virt_to_maddr(&_a_global_polling_parameter_per_domain_)
>   .
>   .
> 
> More information about VT-d Invalidation Wait Descriptor, please refer to
> 
> http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog
> y/vt-directed-io-spec.html
>   6.5.2.8 Invalidation Wait Descriptor.
> Status Address and Data: Status address and data is used by hardware to 
> perform
> wait descriptor
>  completion status write when the Status Write(SW)
> field is Set. Hardware Behavior
>  is undefined if the Status address range of
> 0xFEEX_ etc.). The Status Address
>  and Data fields are ignored by hardware when the
> Status Write field is Clear.
> 
> The invalidation completion event interrupt is generated only after the
> invalidation wait descriptor completes. In 

Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-21 Thread Xu, Quan
Thanks Jan.

>> >>> On 21.09.15 at 16:51, < jbeul...@suse.com > wrote:
>>> On 17.09.15 at 05:26,  wrote:
> Much more information:
>If I run a service in this domain and tested this waitqueue case. 
> The domain is still working after 60s, but It prints out Call Trace with 
> $dmesg:
> 
> [  161.978599] BUG: soft lockup - CPU#0 stuck for 57s! 
> [kworker/0:1:272]

>Not sure what you meant to tell us with that (it clearly means there's a bug 
>somewhere in the series you're testing):
>- Drop the current series and wait for an update?

No.

>- A request for help? If so, I don't think there's much to be said
  based on just the kernel soft lockup detector output.
>- Anything else?


Just test the extreme case. The ATS specification mandates a timeout of 1 
_minute_ for cache flush, even though it doesn't take so much time for cache 
flush.
In my design, if the Device-TLB is not completed, the domain's vCPUs are not 
allowed entry guest mode (patch #7), otherwise, the logic is not correct.



~Quan


>Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-21 Thread Jan Beulich
>>> On 17.09.15 at 05:26,  wrote:
> Much more information:
>If I run a service in this domain and tested this waitqueue case. The 
> domain is still working after 60s, but It prints out Call Trace with $dmesg:
> 
> [  161.978599] BUG: soft lockup - CPU#0 stuck for 57s! [kworker/0:1:272]

Not sure what you meant to tell us with that (it clearly means
there's a bug somewhere in the series you're testing):
- Drop the current series and wait for an update?
- A request for help? If so, I don't think there's much to be said
  based on just the kernel soft lockup detector output.
- Anything else?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-17 Thread Ian Jackson
Julien Grall writes ("Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous 
Device-TLB Flush for ATS Device"):
> On 16/09/2015 14:47, Ian Jackson wrote:
> > I don't consider myself qualified to review that.  I think the
> > MAINTAINERS file should have an entry for xen/ but it doesn't seem to.
> 
> I think a such patch should come from you.

You're probably right.  Just sent.

Thanks,
Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-17 Thread Julien Grall



On 16/09/2015 14:47, Ian Jackson wrote:

Julien Grall writes ("Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB 
Flush for ATS Device"):

On 16/09/15 11:46, Ian Jackson wrote:

JOOI why did you CC me ?  I did a quick scan of these patches and they
don't seem to have any tools impact.  I would prefer not to be CC'd
unless there is a reason why my attention would be valueable.


The common directory is maintained by "THE REST" group. From the
MAINTAINERS file you are part of it.


Ah.  Hmmm.  You mean xen/common/ ?


Right.



I don't consider myself qualified to review that.  I think the
MAINTAINERS file should have an entry for xen/ but it doesn't seem to.


I think a such patch should come from you.

Regards,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-16 Thread Ian Jackson
Quan Xu writes ("[Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS 
Device"):
> Introduction
> 

Thanks for your submission.

JOOI why did you CC me ?  I did a quick scan of these patches and they
don't seem to have any tools impact.  I would prefer not to be CC'd
unless there is a reason why my attention would be valueable.

Regards,
Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-16 Thread Ian Jackson
Julien Grall writes ("Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous 
Device-TLB Flush for ATS Device"):
> On 16/09/15 11:46, Ian Jackson wrote:
> > JOOI why did you CC me ?  I did a quick scan of these patches and they
> > don't seem to have any tools impact.  I would prefer not to be CC'd
> > unless there is a reason why my attention would be valueable.
> 
> The common directory is maintained by "THE REST" group. From the
> MAINTAINERS file you are part of it.

Ah.  Hmmm.  You mean xen/common/ ?

I don't consider myself qualified to review that.  I think the
MAINTAINERS file should have an entry for xen/ but it doesn't seem to.

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-16 Thread Xu, Quan


> -Original Message-
> From: Ian Jackson [mailto:ian.jack...@eu.citrix.com]
> Sent: Wednesday, September 16, 2015 6:47 PM
> To: Xu, Quan
> Cc: andrew.coop...@citrix.com; Dong, Eddie; ian.campb...@citrix.com;
> ian.jack...@eu.citrix.com; jbeul...@suse.com; Nakajima, Jun; k...@xen.org;
> Tian, Kevin; t...@xen.org; Zhang, Yang Z; george.dun...@eu.citrix.com;
> xen-devel@lists.xen.org
> Subject: Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS
> Device
> 
> Quan Xu writes ("[Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS
> Device"):
> > Introduction
> > 
> 
> Thanks for your submission.
> 
> JOOI why did you CC me ?  I did a quick scan of these patches and they don't
> seem to have any tools impact.  I would prefer not to be CC'd unless there is 
> a
> reason why my attention would be valueable.

Ian,
Thanks for your quick response!
For patch-11 and patch-12, I got your email with scripts/get_maintainer.pl 
tool. 

-Quan

> 
> Regards,
> Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device

2015-09-16 Thread Xu, Quan


> -Original Message-
> From: Xu, Quan
> Sent: Wednesday, September 16, 2015 9:24 PM
> To: andrew.coop...@citrix.com; Dong, Eddie; ian.campb...@citrix.com;
> ian.jack...@eu.citrix.com; jbeul...@suse.com; Nakajima, Jun; k...@xen.org;
> Tian, Kevin; t...@xen.org; Zhang, Yang Z; george.dun...@eu.citrix.com
> Cc: xen-devel@lists.xen.org; Xu, Quan
> Subject: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
> 
> Introduction
> 
> 
>VT-d code currently has a number of cases where completion of certain
> operations is being waited for by way of spinning. The majority of instances 
> use
> that variable indirectly through IOMMU_WAIT_OP() macro , allowing for loops of
> up to 1 second (DMAR_OPERATION_TIMEOUT). While in many of the cases this
> may be acceptable, the invalidation case seems particularly problematic.
> 
> Currently hypervisor polls the status address of wait descriptor up to 1 
> second to
> get Invalidation flush result. When Invalidation queue includes Device-TLB
> invalidation, using 1 second is a mistake here in the validation sync. As the 
> 1
> second timeout here is related to response times by the IOMMU engine, Instead
> of Device-TLB invalidation with PCI-e Address Translation Services (ATS) in 
> use.
> the ATS specification mandates a timeout of 1 _minute_ for cache flush. The 
> ATS
> case needs to be taken into consideration when doing invalidations. Obviously
> we can't spin for a minute, so invalidation absolutely needs to be converted 
> to a
> non-spinning model.
> 
>Also i should fix the new memory security issue.
> The page freed from the domain should be on held, until the Device-TLB flush 
> is
> completed (ATS timeout of 1 _minute_).
> The page previously associated  with the freed portion of GPA should not be
> reallocated for another purpose until the appropriate invalidations have been
> performed. Otherwise, the original page owner can still access freed page 
> though
> DMA.
> 
> Why RFC
> ===
> Patch 0001--0005, 0013 are IOMMU related.
> Patch 0006 is about new flag (vCPU / MMU related).
> Patch 0007 is vCPU related.
> Patch 0008--0012 are MMU related.
> 
> 1. Xen MMU is very complicated. Could Xen MMU experts help me verify
> whether I
>have covered all of the case?
> 
> 2. For gnttab_transfer, If the Device-TLB flush is still not completed 
> when to
>map the transferring page to a remote domain, schedule and wait on a
> waitqueue
>until the Device-TLB flush is completed. Is it correct?
> 
>(I have tested waitqueue in decrease_reservation() [do_memory_op()
> hypercall])
> I wake up domain(with only one vCPU) with debug-key tool, and the
> domain(with only one vCPU)
> is still working after waiting 60s in a waitqueue. )


Much more information:
   If I run a service in this domain and tested this waitqueue case. The domain 
is still working after 60s, but It prints out Call Trace with $dmesg:

[  161.978599] BUG: soft lockup - CPU#0 stuck for 57s! [kworker/0:1:272]
[  161.978621] Modules linked in: crct10dif_pclmul(F) crc32_pclmul(F) joydev(F) 
ghash_clmulni_intel(F) cryptd(F) xen_kbdfront(F) microcode(F) cirrus ttm 
drm_kms_helper drm psmouse(F) serio_raw(F) syscopyarea(F) sysfillrect(F) 
sysimgblt(F) i2c_piix4 ext2(F) mac_hid lp(F) parport(F) myri10ge dca floppy(F)
[  161.978626] CPU: 0 PID: 272 Comm: kworker/0:1 Tainted: GF
3.11.0-12-generic #19-Ubuntu
[  161.978628] Hardware name: Xen HVM domU, BIOS 4.6.0-rc 08/03/2015
[  161.978638] Workqueue: events balloon_process
[  161.978640] task: 88007a1b4650 ti: 88007a1f2000 task.ti: 
88007a1f2000
[  161.978650] RIP: 0010:[]  [] 
xen_hypercall_memory_op+0x5/0x20
[  161.978652] RSP: 0018:88007a1f3d60  EFLAGS: 0246
[  161.978653] RAX: 000c RBX:  RCX: 1568
[  161.978654] RDX:  RSI: 88007a1f3d70 RDI: 0041
[  161.978656] RBP: 88007a1f3db8 R08: 81d04888 R09: 0006a5dd
[  161.978657] R10: 3690 R11: 88007f7fa750 R12: 880036671000
[  161.978658] R13: 810c6176 R14: 88007a1f3d20 R15: 
[  161.978660] FS:  () GS:88007be0() 
knlGS:
[  161.978661] CS:  0010 DS:  ES:  CR0: 80050033
[  161.978662] CR2: 7f8d3e97e000 CR3: 01c0e000 CR4: 001406f0
[  161.978669] Stack:
[  161.978673]  8141a16a 365b0048 81fb1520 
00ff
[  161.978676]   7ff0 81c97301 
1600
[  161.978678]  88007be13e00 88007be17e00  
88007a1f3e20
[  161.978679] Call Trace:
[  161.978684]  [] ? decrease_reservation+0x29a/0x2e0
[  161.978688]  [] balloon_process+0x333/0x430
[  161.978695]  [] process_one_work+0x17c/0x430
[  161.978699]  [] worker_thread+0x11c/0x3c0
[  161.978702]  [] ?