Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-11 Thread Jan Beulich
>>> On 11.09.15 at 02:59,  wrote:
> If you want a formula I would do:
> 
> #define MAX_SOCKETS 8
> 
>  max_pfns = pow(2,(MAX_SOCKETS - (max(nr_iommus(), MAX_SOCKETS * 64;
> 
> Where nr_iommus would have to be somehow implemented, ditto for pow.
> 
> This should give you:
>  8-> 64
>  7-> 128
>  6-> 256
>  5-> 512
>  4-> 1024
>  3-> 2048
>  2-> 4096
>  1-> 16384

16k seems excessive as a default. Also - why would this be related
to the number of sockets? I don't think there's a one-IOMMU-per-
socket rule; fixed-number-of-IOMMUs-per-node might come closer,
but there we'd have the problem of what "fixed number" is. Wouldn't
something as simple as 1024 / nr_iommus() do?

I also don't follow what cache flushes you talked about earlier: I
don't think the IOMMUs drive any global cache flushes, and I
would have thought the size limited IOTLB and (CPU side) cache
ones should be pretty limited in terms of bus load (unless the TLB
ones would get converted to global ones due to lacking IOMMU
capabilities). Is that not the case?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-11 Thread Jan Beulich
>>> On 11.09.15 at 14:05,  wrote:
> The flush_all(FLUSH_CACHE) in mtrr.c will result in a flush_area_mask for 
> all CPU's in the host.
> It will more time to issue a IPI to all logical cores the more core's there 
> are. I admit that
> x2apic_cluster mode may speed this up but not all hosts will have that 
> enabled.
> 
> The data flush will force all data out to memory controllers and it's 
> possible that CPU's in
> difference package have cached data all corresponding to a particular memory 
> controller which will
> become a bottleneck.
> 
> In worst case, with large delay between XEN_DOMCTL_memory_mapping hypercalls 
> and on a 8 socket
> system you may end up writing out 45MB (L3 cache) * 8 = 360MB to a single 
> memory controller every 64
> pages (256KiB) of domU p2m updated.

True.

Considering that BARs need to be properly aligned in both guest
and host address spaces, I wonder why we aren't using large
pages to map such huge BARs then. As it looks this would require
redefining the semantics of the domctl once again, but that's not
a big problem since - it's a domctl. I'll see if I can cook up something
(assuming that hosts used for passing through devices with such
huge BARs will have support for at least 2Mb pages in both EPT
[NPT always has] and IOMMU).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-11 Thread Malcolm Crossley
On 11/09/15 12:11, Jan Beulich wrote:
 On 11.09.15 at 12:28,  wrote:
>> On 11/09/15 10:17, Jan Beulich wrote:
>> On 11.09.15 at 02:59,  wrote:
 If you want a formula I would do:

 #define MAX_SOCKETS 8

  max_pfns = pow(2,(MAX_SOCKETS - (max(nr_iommus(), MAX_SOCKETS * 64;

 Where nr_iommus would have to be somehow implemented, ditto for pow.

 This should give you:
  8 -> 64
  7 -> 128
  6 -> 256
  5 -> 512
  4 -> 1024
  3 -> 2048
  2 -> 4096
  1 -> 16384
>>>
>>> 16k seems excessive as a default. Also - why would this be related
>>> to the number of sockets? I don't think there's a one-IOMMU-per-
>>> socket rule; fixed-number-of-IOMMUs-per-node might come closer,
>>> but there we'd have the problem of what "fixed number" is. Wouldn't
>>> something as simple as 1024 / nr_iommus() do?
>>>
>>> I also don't follow what cache flushes you talked about earlier: I
>>> don't think the IOMMUs drive any global cache flushes, and I
>>> would have thought the size limited IOTLB and (CPU side) cache
>>> ones should be pretty limited in terms of bus load (unless the TLB
>>> ones would get converted to global ones due to lacking IOMMU
>>> capabilities). Is that not the case?
>>
>> The data cache flushes are caused by the memory_type_changed()
>> call at the bottom of the XEN_DOMCTL_memory_mapping hypercall,
>> not by the IOMMU code itself.
> 
> In which case - contrary to what Konrad said he measured - their
> impact on overall throughput shouldn't scale with socket (or node)
> count (unless the hardware implementation of the flushes is bad).
> 

The flush_all(FLUSH_CACHE) in mtrr.c will result in a flush_area_mask for all 
CPU's in the host.
It will more time to issue a IPI to all logical cores the more core's there 
are. I admit that
x2apic_cluster mode may speed this up but not all hosts will have that enabled.

The data flush will force all data out to memory controllers and it's possible 
that CPU's in
difference package have cached data all corresponding to a particular memory 
controller which will
become a bottleneck.

In worst case, with large delay between XEN_DOMCTL_memory_mapping hypercalls 
and on a 8 socket
system you may end up writing out 45MB (L3 cache) * 8 = 360MB to a single 
memory controller every 64
pages (256KiB) of domU p2m updated.

Malcolm


> Jan
> 


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-11 Thread Malcolm Crossley
On 11/09/15 10:17, Jan Beulich wrote:
 On 11.09.15 at 02:59,  wrote:
>> If you want a formula I would do:
>>
>> #define MAX_SOCKETS 8
>>
>>  max_pfns = pow(2,(MAX_SOCKETS - (max(nr_iommus(), MAX_SOCKETS * 64;
>>
>> Where nr_iommus would have to be somehow implemented, ditto for pow.
>>
>> This should give you:
>>  8   -> 64
>>  7   -> 128
>>  6   -> 256
>>  5   -> 512
>>  4   -> 1024
>>  3   -> 2048
>>  2   -> 4096
>>  1   -> 16384
> 
> 16k seems excessive as a default. Also - why would this be related
> to the number of sockets? I don't think there's a one-IOMMU-per-
> socket rule; fixed-number-of-IOMMUs-per-node might come closer,
> but there we'd have the problem of what "fixed number" is. Wouldn't
> something as simple as 1024 / nr_iommus() do?
> 
> I also don't follow what cache flushes you talked about earlier: I
> don't think the IOMMUs drive any global cache flushes, and I
> would have thought the size limited IOTLB and (CPU side) cache
> ones should be pretty limited in terms of bus load (unless the TLB
> ones would get converted to global ones due to lacking IOMMU
> capabilities). Is that not the case?

The data cache flushes are caused by the memory_type_changed() call at the
bottom of the XEN_DOMCTL_memory_mapping hypercall, not by the IOMMU code itself.

Malcolm

> 
> Jan
> 


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-11 Thread Jan Beulich
>>> On 11.09.15 at 12:28,  wrote:
> On 11/09/15 10:17, Jan Beulich wrote:
> On 11.09.15 at 02:59,  wrote:
>>> If you want a formula I would do:
>>>
>>> #define MAX_SOCKETS 8
>>>
>>>  max_pfns = pow(2,(MAX_SOCKETS - (max(nr_iommus(), MAX_SOCKETS * 64;
>>>
>>> Where nr_iommus would have to be somehow implemented, ditto for pow.
>>>
>>> This should give you:
>>>  8  -> 64
>>>  7  -> 128
>>>  6  -> 256
>>>  5  -> 512
>>>  4  -> 1024
>>>  3  -> 2048
>>>  2  -> 4096
>>>  1  -> 16384
>> 
>> 16k seems excessive as a default. Also - why would this be related
>> to the number of sockets? I don't think there's a one-IOMMU-per-
>> socket rule; fixed-number-of-IOMMUs-per-node might come closer,
>> but there we'd have the problem of what "fixed number" is. Wouldn't
>> something as simple as 1024 / nr_iommus() do?
>> 
>> I also don't follow what cache flushes you talked about earlier: I
>> don't think the IOMMUs drive any global cache flushes, and I
>> would have thought the size limited IOTLB and (CPU side) cache
>> ones should be pretty limited in terms of bus load (unless the TLB
>> ones would get converted to global ones due to lacking IOMMU
>> capabilities). Is that not the case?
> 
> The data cache flushes are caused by the memory_type_changed()
> call at the bottom of the XEN_DOMCTL_memory_mapping hypercall,
> not by the IOMMU code itself.

In which case - contrary to what Konrad said he measured - their
impact on overall throughput shouldn't scale with socket (or node)
count (unless the hardware implementation of the flushes is bad).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-10 Thread Chen, Tiejun

Sort of (the patch has the intended effect, but for its size very
many rough edges).



I guess we need to amend the original parameter, once_mapping_mfns, like 
this,


/* xen_once_mapping_mfns: memory mapping mfn bumbers once. */
unsigned int xen_once_mapping_mfns;
size_param("once_mapping_mfns", xen_once_mapping_mfns);

static void xen_once_mapping_mfns_setup(void)
{
if ( once_mapping_mfns < 64 )
xen_once_mapping_mfns = 64;
else if ( once_mapping_mfns > 1024 )
xen_once_mapping_mfns = 1024;
else
xen_once_mapping_mfns = once_mapping_mfns;
}

Or what is your expected rule?

Thanks
Tiejun

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-10 Thread Jan Beulich
>>> On 10.09.15 at 07:28,  wrote:
>> > If the 64 limit was arbitrary then I would suggest increasing it to at 
>> > least
>>> 1024 so that
>>> at least 4M of BAR can be mapped in one go and it reduces the overhead by a
>>> factor of 16.
>>
>> 1024 may be a little much, but 256 is certainly a possibility, plus
>> Konrad's suggestion to allow this limit to be controlled via command
>> line option.
> 
> Are you guys talking this way?

Sort of (the patch has the intended effect, but for its size very
many rough edges).

Jan

> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index 3946e4c..a9671bb 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -88,6 +88,10 @@ boolean_param("noapic", skip_ioapic_setup);
>   s8 __read_mostly xen_cpuidle = -1;
>   boolean_param("cpuidle", xen_cpuidle);
> 
> +/* once_mapping_mfns: memory mapping mfn bumbers once. */
> +unsigned int xen_once_mapping_mfns;
> +integer_param("once_mapping_mfns", xen_once_mapping_mfns);
> +
>   #ifndef NDEBUG
>   unsigned long __initdata highmem_start;
>   size_param("highmem-start", highmem_start);
> diff --git a/xen/common/domctl.c b/xen/common/domctl.c
> index 3bf39f1..82c85e3 100644
> --- a/xen/common/domctl.c
> +++ b/xen/common/domctl.c
> @@ -33,6 +33,8 @@
>   #include 
>   #include 
> 
> +extern unsigned int xen_once_mapping_mfns;
> +
>   static DEFINE_SPINLOCK(domctl_lock);
>   DEFINE_SPINLOCK(vcpu_alloc_lock);
> 
> @@ -1035,7 +1037,7 @@ long 
> do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
> 
>   ret = -E2BIG;
>   /* Must break hypercall up as this could take a while. */
> -if ( nr_mfns > 64 )
> +if ( nr_mfns > xen_once_mapping_mfns )
>   break;
> 
>   ret = -EPERM;
> 
> Thanks
> Tiejun




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-10 Thread Jan Beulich
>>> On 10.09.15 at 10:55,  wrote:
>>  Sort of (the patch has the intended effect, but for its size very
>> many rough edges).
>>
> 
> I guess we need to amend the original parameter, once_mapping_mfns, like 
> this,
> 
> /* xen_once_mapping_mfns: memory mapping mfn bumbers once. */
> unsigned int xen_once_mapping_mfns;
> size_param("once_mapping_mfns", xen_once_mapping_mfns);
> 
> static void xen_once_mapping_mfns_setup(void)
> {
>  if ( once_mapping_mfns < 64 )
>  xen_once_mapping_mfns = 64;
>  else if ( once_mapping_mfns > 1024 )
>  xen_once_mapping_mfns = 1024;
>  else
>  xen_once_mapping_mfns = once_mapping_mfns;
> }

Right, that's one of the things that would need taking care of.
(Whether enforcing an upper limit is actually needed I'm not
sure - we generally allow the admin to shoot himself in the foot
if he wants to. And whether the lower limit should be 64 instead
of just ensuring the limit is not zero is another question.)

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-10 Thread Konrad Rzeszutek Wilk
On Thu, Sep 10, 2015 at 02:59:17AM -0600, Jan Beulich wrote:
> >>> On 10.09.15 at 10:55,  wrote:
> >>  Sort of (the patch has the intended effect, but for its size very
> >> many rough edges).
> >>
> > 
> > I guess we need to amend the original parameter, once_mapping_mfns, like 
> > this,
> > 
> > /* xen_once_mapping_mfns: memory mapping mfn bumbers once. */
> > unsigned int xen_once_mapping_mfns;
> > size_param("once_mapping_mfns", xen_once_mapping_mfns);
> > 
> > static void xen_once_mapping_mfns_setup(void)
> > {
> >  if ( once_mapping_mfns < 64 )
> >  xen_once_mapping_mfns = 64;
> >  else if ( once_mapping_mfns > 1024 )
> >  xen_once_mapping_mfns = 1024;
> >  else
> >  xen_once_mapping_mfns = once_mapping_mfns;
> > }
> 
> Right, that's one of the things that would need taking care of.
> (Whether enforcing an upper limit is actually needed I'm not
> sure - we generally allow the admin to shoot himself in the foot
> if he wants to. And whether the lower limit should be 64 instead
> of just ensuring the limit is not zero is another question.)

64 was semi-arbitrary  - it ended up giving good latency on
highly scalar machines (8 socket). Higher numbers ended up
affecting the latency.

But higher numbers on small socket machines were OK.
(As they do not have 8 IOMMU VT-d chipsets all potentially
flodding the QPI with serialized cache flushes).
> 
> Jan
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-10 Thread Konrad Rzeszutek Wilk
On Fri, Sep 11, 2015 at 08:44:50AM +0800, Chen, Tiejun wrote:
> >>Right, that's one of the things that would need taking care of.
> >>(Whether enforcing an upper limit is actually needed I'm not
> >>sure - we generally allow the admin to shoot himself in the foot
> >>if he wants to. And whether the lower limit should be 64 instead
> >>of just ensuring the limit is not zero is another question.)
> >
> >64 was semi-arbitrary  - it ended up giving good latency on
> >highly scalar machines (8 socket). Higher numbers ended up
> >affecting the latency.
> >
> >But higher numbers on small socket machines were OK.
> >(As they do not have 8 IOMMU VT-d chipsets all potentially
> >flodding the QPI with serialized cache flushes).
> >>
> 
> So we should make this range [8, ] here, but 64 by
> default. Right?

Not sure I follow you. The 8 was based on 8 socket machines having 8
IOMMUs..

If you want a formula I would do:

#define MAX_SOCKETS 8

 max_pfns = pow(2,(MAX_SOCKETS - (max(nr_iommus(), MAX_SOCKETS * 64;

Where nr_iommus would have to be somehow implemented, ditto for pow.

This should give you:
 8  -> 64
 7  -> 128
 6  -> 256
 5  -> 512
 4  -> 1024
 3  -> 2048
 2  -> 4096
 1  -> 16384

> 
> Thanks
> Tiejun
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-10 Thread Chen, Tiejun

Right, that's one of the things that would need taking care of.
(Whether enforcing an upper limit is actually needed I'm not
sure - we generally allow the admin to shoot himself in the foot
if he wants to. And whether the lower limit should be 64 instead
of just ensuring the limit is not zero is another question.)


64 was semi-arbitrary  - it ended up giving good latency on
highly scalar machines (8 socket). Higher numbers ended up
affecting the latency.

But higher numbers on small socket machines were OK.
(As they do not have 8 IOMMU VT-d chipsets all potentially
flodding the QPI with serialized cache flushes).




So we should make this range [8, ] here, but 64 by 
default. Right?


Thanks
Tiejun


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-09 Thread Tiejun Chen
We should lower loglevel to XENLOG_G_DEBUG while mapping or
unmapping memory via XEN_DOMCTL_memory_mapping since its
fair enough to check this info just while debugging.

CC: Ian Campbell 
CC: Ian Jackson 
CC: Jan Beulich 
CC: Keir Fraser 
CC: Tim Deegan 
Signed-off-by: Tiejun Chen 
---
 xen/common/domctl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/xen/common/domctl.c b/xen/common/domctl.c
index 7f959f3..3bf39f1 100644
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -1049,7 +1049,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) 
u_domctl)
 
 if ( add )
 {
-printk(XENLOG_G_INFO
+printk(XENLOG_G_DEBUG
"memory_map:add: dom%d gfn=%lx mfn=%lx nr=%lx\n",
d->domain_id, gfn, mfn, nr_mfns);
 
@@ -1061,7 +1061,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) 
u_domctl)
 }
 else
 {
-printk(XENLOG_G_INFO
+printk(XENLOG_G_DEBUG
"memory_map:remove: dom%d gfn=%lx mfn=%lx nr=%lx\n",
d->domain_id, gfn, mfn, nr_mfns);
 
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-09 Thread Jan Beulich
>>> On 09.09.15 at 16:20,  wrote:
> Perhaps the solution is remove the first printk(s) and just have them
> once the operation has completed? That may fix the outstanding tasklet
> problem?

Considering that this is a tool stack based retry, how would the
hypervisor know when the _whole_ operation is done?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-09 Thread Konrad Rzeszutek Wilk
On Wed, Sep 09, 2015 at 02:50:25PM +0800, Tiejun Chen wrote:
> We should lower loglevel to XENLOG_G_DEBUG while mapping or
> unmapping memory via XEN_DOMCTL_memory_mapping since its
> fair enough to check this info just while debugging.

The issue you folks are hitting where it takes eons to boot
with a large BAR _may_ be related to tasklets and triggering
the hypercall_preempt_check().

I think you may be using 'console_to_ring' or such and collecting
the hypervisor logs. That will schedule an tasklet whenever printk
is used - and the tasklet will cause hypercall_preempt_check() to
return true.

So what we get is that on the entrace (even before calling map_mmio_region)
for this function we have an outstanding softirq we must process - which
means that this function exits as soon as possible to service it.

Perhaps the solution is remove the first printk(s) and just have them
once the operation has completed? That may fix the outstanding tasklet
problem?

Could you try that (or perhaps you have already tried it?) please?

> 
> CC: Ian Campbell 
> CC: Ian Jackson 
> CC: Jan Beulich 
> CC: Keir Fraser 
> CC: Tim Deegan 
> Signed-off-by: Tiejun Chen 
> ---
>  xen/common/domctl.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/common/domctl.c b/xen/common/domctl.c
> index 7f959f3..3bf39f1 100644
> --- a/xen/common/domctl.c
> +++ b/xen/common/domctl.c
> @@ -1049,7 +1049,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) 
> u_domctl)
>  
>  if ( add )
>  {
> -printk(XENLOG_G_INFO
> +printk(XENLOG_G_DEBUG
> "memory_map:add: dom%d gfn=%lx mfn=%lx nr=%lx\n",
> d->domain_id, gfn, mfn, nr_mfns);
>  
> @@ -1061,7 +1061,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) 
> u_domctl)
>  }
>  else
>  {
> -printk(XENLOG_G_INFO
> +printk(XENLOG_G_DEBUG
> "memory_map:remove: dom%d gfn=%lx mfn=%lx nr=%lx\n",
> d->domain_id, gfn, mfn, nr_mfns);
>  
> -- 
> 1.9.1
> 
> 
> ___
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-09 Thread Jan Beulich
>>> On 09.09.15 at 16:50,  wrote:
> On Wed, Sep 09, 2015 at 08:33:52AM -0600, Jan Beulich wrote:
>> >>> On 09.09.15 at 16:20,  wrote:
>> > Perhaps the solution is remove the first printk(s) and just have them
>> > once the operation has completed? That may fix the outstanding tasklet
>> > problem?
>> 
>> Considering that this is a tool stack based retry, how would the
>> hypervisor know when the _whole_ operation is done?
> 
> I was merely thinking of moving the printk _after_ the map_mmio_regions
> so there wouldn't be any outstanding preemption points in map_mmio_regions
> (so it can at least do the 64 PFNs).

But there are no preemption points. That's why you needed to use
tool stack based retries in the first place.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-09 Thread Jan Beulich
>>> On 09.09.15 at 17:19,  wrote:
> On 09/09/15 15:50, Konrad Rzeszutek Wilk wrote:
>> On Wed, Sep 09, 2015 at 08:33:52AM -0600, Jan Beulich wrote:
>> On 09.09.15 at 16:20,  wrote:
 Perhaps the solution is remove the first printk(s) and just have them
 once the operation has completed? That may fix the outstanding tasklet
 problem?
>>>
>>> Considering that this is a tool stack based retry, how would the
>>> hypervisor know when the _whole_ operation is done?
>> 
>> I was merely thinking of moving the printk _after_ the map_mmio_regions
>> so there wouldn't be any outstanding preemption points in map_mmio_regions
>> (so it can at least do the 64 PFNs).
>> 
>> But going forward a couple of ideas:
>> 
>>  - The 64 limit was arbitrary. It could have been 42 or PFNs / 
>> num_online_cpus(),
>>or actually finding out the size of the BAR and figuring the optimal
>>case so that it will be done under 1ms. Or perhaps just provide an
>>boot time parameter for those that really are struggling with this.
> 
> The issue of it taking a long time to map a large BAR is caused by the 
> unconditional
> memory_type_changed call at the bottom of XEN_DOMCTL_memory_mapping 
> function.
> 
> The memory_type_changed call results in a full data cache flush on VM with a 
> PCI device
> assigned.
> 
> We have seen this overhead cause a 1GB BAR to take 20 seconds to map it's 
> MMIO.
> 
> If the 64 limit was arbitrary then I would suggest increasing it to at least 
> 1024 so that
> at least 4M of BAR can be mapped in one go and it reduces the overhead by a 
> factor of 16.

1024 may be a little much, but 256 is certainly a possibility, plus
Konrad's suggestion to allow this limit to be controlled via command
line option.

> Currently the toolstack attempts lower and lower powers of 2 size's of the 
> MMIO region to be mapped
> until the hypercall succeeds.
> 
> Is it not clear to me why we have an unconditional call to 
> memory_type_changed in the domctl.
> Can somebody explain why it can't be made condition on errors?

You mean only on success? That may be possible (albeit as you
can see from the commit introducing it I wasn't sure, and I'm still
not sure), but wouldn't buy you anything.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-09 Thread Konrad Rzeszutek Wilk
On Wed, Sep 09, 2015 at 08:55:38AM -0600, Jan Beulich wrote:
> >>> On 09.09.15 at 16:50,  wrote:
> > On Wed, Sep 09, 2015 at 08:33:52AM -0600, Jan Beulich wrote:
> >> >>> On 09.09.15 at 16:20,  wrote:
> >> > Perhaps the solution is remove the first printk(s) and just have them
> >> > once the operation has completed? That may fix the outstanding tasklet
> >> > problem?
> >> 
> >> Considering that this is a tool stack based retry, how would the
> >> hypervisor know when the _whole_ operation is done?
> > 
> > I was merely thinking of moving the printk _after_ the map_mmio_regions
> > so there wouldn't be any outstanding preemption points in map_mmio_regions
> > (so it can at least do the 64 PFNs).
> 
> But there are no preemption points. That's why you needed to use
> tool stack based retries in the first place.

You are totally right. I recall seeing it but I realize it was an
RFC patch I wrote that would have made map_mmio_regions call
the hypercall_preempt_check()!

 Sorry about the noise!
> 
> Jan
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-09 Thread Konrad Rzeszutek Wilk
On Wed, Sep 09, 2015 at 08:33:52AM -0600, Jan Beulich wrote:
> >>> On 09.09.15 at 16:20,  wrote:
> > Perhaps the solution is remove the first printk(s) and just have them
> > once the operation has completed? That may fix the outstanding tasklet
> > problem?
> 
> Considering that this is a tool stack based retry, how would the
> hypervisor know when the _whole_ operation is done?

I was merely thinking of moving the printk _after_ the map_mmio_regions
so there wouldn't be any outstanding preemption points in map_mmio_regions
(so it can at least do the 64 PFNs).

But going forward a couple of ideas:

 - The 64 limit was arbitrary. It could have been 42 or PFNs / 
num_online_cpus(),
   or actually finding out the size of the BAR and figuring the optimal
   case so that it will be done under 1ms. Or perhaps just provide an
   boot time parameter for those that really are struggling with this.

 - Perhaps add a new API to the P2M code so it can do the operations
   in batches. Our map_mmio_region iterates over every PFN which is
   surely not efficient. Some batching could help? Maybe? What other
   code could benefit from this? Would the boot-time creation of IOMMU
   page-tables also help with this (which takes 10 minutes on a 1TB box
   BTW)

 - This printk can be altogether removed in the hypervisor and moved
   in the toolstack. That is the libxl xc_domain_add_to_physmap
   could itself use the logging facility (xc_report?) the action.
   
> 
> Jan
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-09 Thread Malcolm Crossley
On 09/09/15 15:50, Konrad Rzeszutek Wilk wrote:
> On Wed, Sep 09, 2015 at 08:33:52AM -0600, Jan Beulich wrote:
> On 09.09.15 at 16:20,  wrote:
>>> Perhaps the solution is remove the first printk(s) and just have them
>>> once the operation has completed? That may fix the outstanding tasklet
>>> problem?
>>
>> Considering that this is a tool stack based retry, how would the
>> hypervisor know when the _whole_ operation is done?
> 
> I was merely thinking of moving the printk _after_ the map_mmio_regions
> so there wouldn't be any outstanding preemption points in map_mmio_regions
> (so it can at least do the 64 PFNs).
> 
> But going forward a couple of ideas:
> 
>  - The 64 limit was arbitrary. It could have been 42 or PFNs / 
> num_online_cpus(),
>or actually finding out the size of the BAR and figuring the optimal
>case so that it will be done under 1ms. Or perhaps just provide an
>boot time parameter for those that really are struggling with this.

The issue of it taking a long time to map a large BAR is caused by the 
unconditional
memory_type_changed call at the bottom of XEN_DOMCTL_memory_mapping function.

The memory_type_changed call results in a full data cache flush on VM with a 
PCI device
assigned.

We have seen this overhead cause a 1GB BAR to take 20 seconds to map it's MMIO.

If the 64 limit was arbitrary then I would suggest increasing it to at least 
1024 so that
at least 4M of BAR can be mapped in one go and it reduces the overhead by a 
factor of 16.

Currently the toolstack attempts lower and lower powers of 2 size's of the MMIO 
region to be mapped
until the hypercall succeeds.

Is it not clear to me why we have an unconditional call to memory_type_changed 
in the domctl.
Can somebody explain why it can't be made condition on errors?

Malcolm


> 
>  - Perhaps add a new API to the P2M code so it can do the operations
>in batches. Our map_mmio_region iterates over every PFN which is
>surely not efficient. Some batching could help? Maybe? What other
>code could benefit from this? Would the boot-time creation of IOMMU
>page-tables also help with this (which takes 10 minutes on a 1TB box
>BTW)
> 
>  - This printk can be altogether removed in the hypervisor and moved
>in the toolstack. That is the libxl xc_domain_add_to_physmap
>could itself use the logging facility (xc_report?) the action.
>
>>
>> Jan
>>
> 
> ___
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

2015-09-09 Thread Chen, Tiejun

If the 64 limit was arbitrary then I would suggest increasing it to at least
1024 so that
at least 4M of BAR can be mapped in one go and it reduces the overhead by a
factor of 16.


1024 may be a little much, but 256 is certainly a possibility, plus
Konrad's suggestion to allow this limit to be controlled via command
line option.


Are you guys talking this way?

diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 3946e4c..a9671bb 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -88,6 +88,10 @@ boolean_param("noapic", skip_ioapic_setup);
 s8 __read_mostly xen_cpuidle = -1;
 boolean_param("cpuidle", xen_cpuidle);

+/* once_mapping_mfns: memory mapping mfn bumbers once. */
+unsigned int xen_once_mapping_mfns;
+integer_param("once_mapping_mfns", xen_once_mapping_mfns);
+
 #ifndef NDEBUG
 unsigned long __initdata highmem_start;
 size_param("highmem-start", highmem_start);
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
index 3bf39f1..82c85e3 100644
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -33,6 +33,8 @@
 #include 
 #include 

+extern unsigned int xen_once_mapping_mfns;
+
 static DEFINE_SPINLOCK(domctl_lock);
 DEFINE_SPINLOCK(vcpu_alloc_lock);

@@ -1035,7 +1037,7 @@ long 
do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)


 ret = -E2BIG;
 /* Must break hypercall up as this could take a while. */
-if ( nr_mfns > 64 )
+if ( nr_mfns > xen_once_mapping_mfns )
 break;

 ret = -EPERM;

Thanks
Tiejun

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel