Re: [Xen-devel] Xen 4.6 Development Update (five months reminder, 5 WEEKS TO FREEZE)

2015-06-08 Thread Yu, Zhang



On 6/5/2015 9:53 PM, wei.l...@citrix.com wrote:

(Note, please trim your quotes when replying, and also trim the CC list if
necessary. You might also consider changing the subject line of your reply to
"Status of  (Was: Xen 4.6 Development Update (X months reminder)")

Hi all

We are now four months into 4.6 development window. This is an email to keep
track of all the patch series I gathered. It is by no means complete and / or
acurate. Feel free to reply this email with new projects or correct my
misunderstanding.

= Timeline =

We are planning on a 9-month release cycle, but we could also release a bit
earlier if everything goes well (no blocker, no critical bug).

* Development start: 6 Jan 2015
<=== We are here ===>
* Feature Freeze: 10 Jul 2015
* RCs: TBD
* Release Date: 9 Oct 2015 (could release earlier)

The RCs and release will of course depend on stability and bugs, and
will therefore be fairly unpredictable.

Bug-fixes, if Acked-by by maintainer, can go anytime before the First
RC. Later on we will need to figure out the risk of regression/reward
to eliminate the possiblity of a bug introducing another bug.

= Prognosis =

The states are: none -> fair -> ok -> good -> done

none - nothing yet
fair - still working on it, patches are prototypes or RFC
ok   - patches posted, acting on review
good - some last minute pieces
done - all done, might have bugs

= Bug Fixes =

Bug fixes can be checked in without a freeze exception throughout the
freeze, unless the maintainer thinks they are particularly high
risk.  In later RC's, we may even begin rejecting bug fixes if the
broken functionality is small and the risk to other functionality is
high.

Document changes can go in anytime if the maintainer is OK with it.

These are guidelines and principles to give you an idea where we're coming
from; if you think there's a good reason why making an exception for you will
help us make Xen better than not doing so, feel free to make your case.

== Hypervisor ==

*  Alternate p2m: support multiple copies of host p2m (ok)
   -  Ed White

*  Improve RTDS scheduler (none)
Change RTDS from quantum driven to event driven
   -  Dagaen Golomb, Meng Xu, Chong Li

*  Credit2: introduce per-vcpu soft affinity (good)
   -  Justin T. Weaver

*  Credit2: introduce per-vcpu hard affinity (fair)
   -  Justin T. Weaver

*  sndif: add API for para-virtual sound (fair)
v7 posted
   -  Oleksandr Dmytryshyn

*  gnttab: improve scalability (good)
   -  David Vrabel

*  Xen multiboot2-EFI support (ok)
See http://lists.xen.org/archives/html/xen-devel/2015-01/msg03962.html
 http://lists.xen.org/archives/html/xen-devel/2015-01/msg03982.html
   -  Daniel Kiper

*  Credit2 production ready (none)
cpu reservation
   -  George Dunlap

*  VM event patches (none)
Add support for XSETBV vm_events,
Support hybernating guests
Support for VMCALL-based vm_events
   -  Razvan Cojocaru

=== Hypervisor X86 ===

*  Intel Cache Allocation Technology (good)
   -  Chao Peng

*  Intel GVT-g (none)
requires refactoring ioreq-server, fixing 16-byte MMIO emulation
    and optional PV IOMMU support
   -  Yu, Zhang

ioreq-server: still in development. Previously tried to refactor the
ioreq-server to track the IO resources by using radix tree. Yet this
approach would consume too much memory space. Now trying to use the
interval rbtree. Will send the patch out ASAP.

fixing 16-byte MMIO emulation: Paul Durrant has been working on this.

PV IOMMU support: Malcolm Crossley has been preparing the draft design.
We had several rounds internal discussion about how this design covers
the basic requirement of XenGT. Will continue the discussion and prepare
the patch once the design is sent to the community.

BTW, thank you Paul & Malcolm. :-)

Yu



*  Porting Intel P-state driver to Xen (fair)
   -  Wang, Wei

*  VT-d Posted-interrupt (PI) (good)
v2 posted
   -  Wu, Feng

*  Support controlling the max C-state sub-state (fair)
v3 posted
Hadn't see the patch reposted.
   -  Ross Lagerwall

*  IOMMU ABI for guests to map their DMA regions (fair)
   -  Malcolm Crossley

*  RMRR fix (fair)
RFC posted
   -  Tiejun Chen

*  VPMU - 'perf' support in Xen (good)
v21 posted
Need reviews/final ack.
   -  Boris Ostrovsky

*  PVH domU (fair)
RFC posted
   -  Elena Ufimtseva

=== Hypervisor ARM ===

*  ITS support (fair )
   -  Vijaya Kumar K

*  Add ACPI support for arm64 on Xen (fair)
RFC posted
   -  Parth Dixit

*  ARM remote processor iommu module (GPUs + IPUs) (fair)
v3 posted
   -  Andrii Tseglytskyi

*  ARM VM save/restore/live migration (none)
Need to rebased against migrationv2 - no code posted.
   -  None

*  ARM GICv2m support (none)
   -  Suravee Suthikulanit

*  ARM  PCI passthrough (fair)
   -  Manish Jaggi

*  ARM GICv2 on GICv3 support (none)
   -  Julien Grall
   -  Vijay Kilari

== Xen toolstack ==

*  Split libxc int

Re: [Xen-devel] [RFC] Xen PV IOMMU interface draft B

2015-06-17 Thread Yu, Zhang

Hi Malcolm,

  Thank you very much for accommodate our XenGT requirement in your
design. Following are some XenGT related questions. :)

On 6/13/2015 12:43 AM, Malcolm Crossley wrote:

Hi All,

Here is a design for allowing guests to control the IOMMU. This
allows for the guest GFN mapping to be programmed into the IOMMU and
avoid using the SWIOTLB bounce buffer technique in the Linux kernel
(except for legacy 32 bit DMA IO devices).

Draft B has been expanded to include Bus Address mapping/lookup for Mediated
pass-through emulators.

The pandoc markdown format of the document is provided below to allow
for easier inline comments:

% Xen PV IOMMU interface
% Malcolm Crossley <>
   Paul Durrant <>
% Draft B

Introduction


Revision History



Version  Date Changes
---  ---  --
Draft A  10 Apr 2014  Initial draft.

Draft B  12 Jun 2015  Second draft.


Background
==

Linux kernel SWIOTLB


Xen PV guests use a Pseudophysical Frame Number(PFN) address space which is
decoupled from the host Machine Frame Number(MFN) address space.

PV guest hardware drivers are only aware of the PFN address space only and
assume that if PFN addresses are contiguous then the hardware addresses would
be contiguous as well. The decoupling between PFN and MFN address spaces means
PFN and MFN addresses may not be contiguous across page boundaries and thus a
buffer allocated in GFN address space which spans a page boundary may not be
contiguous in MFN address space.

PV hardware drivers cannot tolerate this behaviour and so a special
"bounce buffer" region is used to hide this issue from the drivers.

A bounce buffer region is a special part of the PFN address space which has
been made to be contiguous in both PFN and MFN address spaces. When a driver
requests a buffer which spans a page boundary be made available for hardware
to read the core operating system code copies the buffer into a temporarily
reserved part of the bounce buffer region and then returns the MFN address of
the reserved part of the bounce buffer region back to the driver itself. The
driver then instructs the hardware to read the copy of the buffer in the
bounce buffer. Similarly if the driver requests a buffer is made available
for hardware to write to the first a region of the bounce buffer is reserved
and then after the hardware completes writing then the reserved region of
bounce buffer is copied to the originally allocated buffer.

The overheard of memory copies to/from the bounce buffer region is high
and damages performance. Furthermore, there is a risk the fixed size
bounce buffer region will become exhausted and it will not be possible to
return an hardware address back to the driver. The Linux kernel drivers do not
tolerate this failure and so the kernel is forced to crash, as an
uncorrectable error has occurred.

Input/Output Memory Management Units (IOMMU) allow for an inbound address
mapping to be created from the I/O Bus address space (typically PCI) to
the machine frame number address space. IOMMU's typically use a page table
mechanism to manage the mappings and therefore can create mappings of page size
granularity or larger.

The I/O Bus address space will be referred to as the Bus Frame Number (BFN)
address space for the rest of this document.


Mediated Pass-through Emulators
---

Mediated Pass-through emulators allow guest domains to interact with
hardware devices via emulator mediation. The emulator runs in a domain separate
to the guest domain and it is used to enforce security of guest access to the
hardware devices and isolation of different guests accessing the same hardware
device.

The emulator requires a mechanism to map guest address's to a bus address that
the hardware devices can access.


Clarification of GFN and BFN fields for different guest types
-
Guest Frame Numbers (GFN) definition varies depending on the guest type.

Diagram below details the memory accesses originating from CPU, per guest type:

   HVM guest  PV guest

  (VA)   (VA)
   |  |
  MMUMMU
   |  |
  (GFN)   |
   |  | (GFN)
  HAP a.k.a EPT/NPT   |
   |  |
  (MFN)  (MFN)
   |  |
  RAMRAM

For PV guests GFN is equal to MFN for a single page but not fo

Re: [Xen-devel] Xen 4.6 Development Update (three months reminder)

2015-05-03 Thread Yu, Zhang

Hi Wei,

  This is Zhang Yu from Intel graphic virtualization team. Previously 
in Xen hackathon, Paul and I mentioned that there're several patch 
series for XenGT that need to be tracked on Xen 4.6.

  Here, I'd like to confirm with you about these patchsets:
  1> 16-byte MMIO emulation fix – owned by Paul;
  2> Ioreq server refactor – owned by Yu;
  3> The PV IOMMU – owned by Malcolm; This one may not be completed in 
Xen 4.6, but a basic feature(to return a BFN which equals the MFN when 
IOMMU is 1:1 mapped or is disabled), might be necessary in this release.
  So could we also add separate tracks for these patches(I noticed the 
3rd is already mentioned in your mail)?  :-)


Thanks
Yu

On 4/14/2015 6:27 PM, wei.l...@citrix.com wrote:

Hi all

We are now three months into 4.6 development window. This is an email to keep
track of all the patch series I gathered. It is by no means complete and / or
acurate. Feel free to reply this email with new projects or correct my
misunderstanding.

= Timeline =

We are planning on a 9-month release cycle, but we could also release a bit
earlier if everything goes well (no blocker, no critical bug).

* Development start: 6 Jan 2015
<=== We are here ===>
* Feature Freeze: 10 Jul 2015
* RCs: TBD
* Release Date: 9 Oct 2015 (could release earlier)

The RCs and release will of course depend on stability and bugs, and
will therefore be fairly unpredictable.

Bug-fixes, if Acked-by by maintainer, can go anytime before the First
RC. Later on we will need to figure out the risk of regression/reward
to eliminate the possiblity of a bug introducing another bug.

= Prognosis =

The states are: none -> fair -> ok -> good -> done

none - nothing yet
fair - still working on it, patches are prototypes or RFC
ok   - patches posted, acting on review
good - some last minute pieces
done - all done, might have bugs

= Bug Fixes =

Bug fixes can be checked in without a freeze exception throughout the
freeze, unless the maintainer thinks they are particularly high
risk.  In later RC's, we may even begin rejecting bug fixes if the
broken functionality is small and the risk to other functionality is
high.

Document changes can go in anytime if the maintainer is OK with it.

These are guidelines and principles to give you an idea where we're coming
from; if you think there's a good reason why making an exception for you will
help us make Xen better than not doing so, feel free to make your case.

== Hypervisor ==

*  Alternate p2m: support multiple copies of host p2m (ok)
   -  Ed White

*  Improve RTDS scheduler (none)
   -  Dagaen Golomb, Meng Xu

*  Credit2: introduce per-vcpu hard and soft affinity (good)
   -  Justin T. Weaver

*  sndif: add API for para-virtual sound (fair)
v7 posted
   -  Oleksandr Dmytryshyn

*  gnttab: improve scalability (good)
v5 posted
   -  Christoph Egger

*  Display IO topology when PXM data is available (good)
v3 posted
   -  Boris Ostrovsky

*  Xen multiboot2-EFI support (fair)
See http://lists.xen.org/archives/html/xen-devel/2013-05/msg02281.html
RFC posted
   -  Daniel Kiper

*  Credit2 production ready (none)
cpu pinning, numa affinity and cpu reservation
   -  George Dunlap

*  VM event patches (none)
Add support for XSETBV vm_events,
Support hybernating guests
Support for VMCALL-based vm_events
   -  Razvan Cojocaru

=== Hypervisor X86 ===

*  Intel Cache Allocation Technology (good)
   -  Chao Peng

*  VT-d Posted-interrupt (PI) (none)
   -  Wu, Feng

*  HT enabled with credit has 7.9 per perf drop. (none)
kernbench demonstrated it
http://www.gossamer-threads.com/lists/xen/devel/339409
This has existed since credit1 introduction.
   -  Dario Faggioli

*  Support controlling the max C-state sub-state (fair)
v3 posted
Hadn't see the patch reposted.
   -  Ross Lagerwall

*  IOMMU ABI for guests to map their DMA regions (fair)
   -  Malcolm Crossley

*  Intel PML (Page Modification Logging) for Xen (none)
design doc posted
   -  Kai Huang

*  RMRR fix (fair)
RFC posted
   -  Tiejun Chen

*  VPMU - 'perf' support in Xen (good)
v14 posted
Need reviews/final ack.
   -  Boris Ostrovsky

*  PVH - AMD hardware support. (fair)
RFC posted
   -  Elena Ufimtseva

*  PVH dom0 (fair)
RFC posted
   -  Elena Ufimtseva

=== Hypervisor ARM ===

*  Mem_access for ARM (good)
v13 posted
   -  Tamas K Lengyel

*  ITS support (fair )
   -  Vijaya Kumar K

*  Add ACPI support for arm64 on Xen (fair)
RFC posted
   -  Parth Dixit

*  ARM: reenable support 32-bit userspace running in 64-bit guest (good)
v2 posted
   -  Ian Campbell

*  ARM remote processor iommu module (GPUs + IPUs) (fair)
v3 posted
   -  Andrii Tseglytskyi

*  ARM VM save/restore/live migration (none)
Need to rebased against migrationv2 - no code posted.
   -  None

*  ARM GICv2m support (none)
   -  Suravee Suthikulanit

*  ARM - passthrough of non-PCI (ok)
   -  Julien Grall

*  ARM  PCI passthrough (none)
   -  

Re: [Xen-devel] Xen 4.6 Development Update (three months reminder)

2015-05-04 Thread Yu, Zhang

Hi Wei,

Thanks for your reply.

On 5/4/2015 5:44 PM, Wei Liu wrote:

(Thanks for trimming the CC list before hand)

On Mon, May 04, 2015 at 02:05:49PM +0800, Yu, Zhang wrote:

Hi Wei,



Hello.


   This is Zhang Yu from Intel graphic virtualization team. Previously in Xen
hackathon, Paul and I mentioned that there're several patch series for XenGT
that need to be tracked on Xen 4.6.
   Here, I'd like to confirm with you about these patchsets:
   1> 16-byte MMIO emulation fix – owned by Paul;


Could you explain a bit why this is needed? AIUI it's just a latent
bug that discovered by this particular usecase, right? In other words,
not really a regression introduced by ioreq server.
OK. Then we will fix this, but not necessary to track this bug. IIRC, 
this is not a regression. Am I right, Paul? :-)



   2> Ioreq server refactor – owned by Yu;
   3> The PV IOMMU – owned by Malcolm; This one may not be completed in Xen
4.6, but a basic feature(to return a BFN which equals the MFN when IOMMU is
1:1 mapped or is disabled), might be necessary in this release.
   So could we also add separate tracks for these patches(I noticed the 3rd
is already mentioned in your mail)?  :-)



I tend to track only big feature items. Non-blocking bugs and small
refactoring are not  tracked.
Well, by "big feature", I'm not sure if this ioreq server refactor issue 
qualifies this definition. :-) But this is part of the functionalities 
that support the Intel GVT-g solution, which is a big feature from the 
overall POV. However, if we track the Intel GVT-g feature as a whole new 
feature, the patch series would seem too scattered.
Sorry for being unfamiliar with the Xen development schedules, but is 
there any approach we can track ioreq server refactor patches(my mission 
is to upstream this in Xen 4.6)?  :-)


The first one needs to be actively tracked if it's a regression.  I
already track the third one since it's a big feature.

Wei.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel



Thanks
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.6 Development Update (three months reminder)

2015-05-05 Thread Yu, Zhang

Thank you, Wei.

On 5/5/2015 5:12 PM, Wei Liu wrote:

On Mon, May 04, 2015 at 08:51:56PM +0800, Yu, Zhang wrote:

Hi Wei,

Thanks for your reply.

On 5/4/2015 5:44 PM, Wei Liu wrote:

(Thanks for trimming the CC list before hand)

On Mon, May 04, 2015 at 02:05:49PM +0800, Yu, Zhang wrote:

Hi Wei,



Hello.


   This is Zhang Yu from Intel graphic virtualization team. Previously in Xen
hackathon, Paul and I mentioned that there're several patch series for XenGT
that need to be tracked on Xen 4.6.
   Here, I'd like to confirm with you about these patchsets:
   1> 16-byte MMIO emulation fix – owned by Paul;


Could you explain a bit why this is needed? AIUI it's just a latent
bug that discovered by this particular usecase, right? In other words,
not really a regression introduced by ioreq server.

OK. Then we will fix this, but not necessary to track this bug. IIRC, this
is not a regression. Am I right, Paul? :-)



   2> Ioreq server refactor – owned by Yu;
   3> The PV IOMMU – owned by Malcolm; This one may not be completed in Xen
4.6, but a basic feature(to return a BFN which equals the MFN when IOMMU is
1:1 mapped or is disabled), might be necessary in this release.
   So could we also add separate tracks for these patches(I noticed the 3rd
is already mentioned in your mail)?  :-)



I tend to track only big feature items. Non-blocking bugs and small
refactoring are not  tracked.

Well, by "big feature", I'm not sure if this ioreq server refactor issue
qualifies this definition. :-) But this is part of the functionalities that
support the Intel GVT-g solution, which is a big feature from the overall
POV. However, if we track the Intel GVT-g feature as a whole new feature,
the patch series would seem too scattered.


As I understand it, Intel GVT-g consists of different components. Xen
component is only one of many components that float around. I can try to
setup a Xen GVT-g item and put this under a subitem if it makes sense.
Yes. So if convenient, how about setup a Intel GVT-g item and put the 
ioreq server patch series as a subitem in 4.6?



Sorry for being unfamiliar with the Xen development schedules, but is there
any approach we can track ioreq server refactor patches(my mission is to
upstream this in Xen 4.6)?  :-)


I think this sort of thing happens when it happens. You just need to
follow the usual development process. Note that we need not wait until
everything in this list go in before we can release 4.6. Tracking them
here is more about having an idea what exciting things are going on
within Xen community.

Got it, and thank you! :-)


Wei.



The first one needs to be actively tracked if it's a regression.  I
already track the third one since it's a big feature.

Wei.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel



Thanks
Yu


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


B.R.
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH] Refactor ioreq server for better performance

2015-06-26 Thread Yu Zhang
  XenGT leverages ioreq server to track and forward the accesses to
GPU IO resources, e.g. the PPGTT(per-process graphic translation
tables). Currently, ioreq server uses rangeset to track the BDF/
PIO/MMIO ranges to be emulated. To select an ioreq server, the
rangeset is searched to see if the IO range is recorded. However,
traversing the link list inside rangeset could be time consuming
when number of ranges is too high. On HSW platform, number of PPGTTs
for each vGPU could be several hundred. On BDW, this value could
be several thousand.
  To increase the performance, a new data structure, rb_rangeset
is defined. Compared with rangeset, which is based on doubly linked
list with O(n) time complexity for searching, this rb_rangeset is
based on red-black tree with O(log(n)) time complexity. Besides the
underlying data structure difference with the rangeset, another one
is that the rb_rangeset does not provide a spinlock, instead, it
left this to users.
  Besides, NR_IO_RANGE_TYPES is changed to 8192 to accommodate more
ranges.

Signed-off-by: Yu Zhang 
---
 xen/arch/x86/hvm/hvm.c   |  52 -
 xen/common/Makefile  |   1 +
 xen/common/rb_rangeset.c | 243 +++
 xen/include/asm-x86/hvm/domain.h |   4 +-
 xen/include/xen/rb_rangeset.h|  45 
 5 files changed, 311 insertions(+), 34 deletions(-)
 create mode 100644 xen/common/rb_rangeset.c
 create mode 100644 xen/include/xen/rb_rangeset.h

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 535d622..be70925 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -809,7 +810,7 @@ static void hvm_ioreq_server_unmap_pages(struct 
hvm_ioreq_server *s,
 }
 }
 
-static void hvm_ioreq_server_free_rangesets(struct hvm_ioreq_server *s,
+static void hvm_ioreq_server_free_rb_rangesets(struct hvm_ioreq_server *s,
 bool_t is_default)
 {
 unsigned int i;
@@ -818,10 +819,10 @@ static void hvm_ioreq_server_free_rangesets(struct 
hvm_ioreq_server *s,
 return;
 
 for ( i = 0; i < NR_IO_RANGE_TYPES; i++ )
-rangeset_destroy(s->range[i]);
+rb_rangeset_destroy(s->range[i]);
 }
 
-static int hvm_ioreq_server_alloc_rangesets(struct hvm_ioreq_server *s, 
+static int hvm_ioreq_server_alloc_rb_rangesets(struct hvm_ioreq_server *s,
 bool_t is_default)
 {
 unsigned int i;
@@ -832,33 +833,20 @@ static int hvm_ioreq_server_alloc_rangesets(struct 
hvm_ioreq_server *s,
 
 for ( i = 0; i < NR_IO_RANGE_TYPES; i++ )
 {
-char *name;
-
-rc = asprintf(&name, "ioreq_server %d %s", s->id,
-  (i == HVMOP_IO_RANGE_PORT) ? "port" :
-  (i == HVMOP_IO_RANGE_MEMORY) ? "memory" :
-  (i == HVMOP_IO_RANGE_PCI) ? "pci" :
-  "");
-if ( rc )
-goto fail;
-
-s->range[i] = rangeset_new(s->domain, name,
-   RANGESETF_prettyprint_hex);
-
-xfree(name);
+s->range[i] = rb_rangeset_new();
 
 rc = -ENOMEM;
 if ( !s->range[i] )
 goto fail;
 
-rangeset_limit(s->range[i], MAX_NR_IO_RANGES);
+s->range[i]->nr_ranges = MAX_NR_IO_RANGES;
 }
 
  done:
 return 0;
 
  fail:
-hvm_ioreq_server_free_rangesets(s, 0);
+hvm_ioreq_server_free_rb_rangesets(s, 0);
 
 return rc;
 }
@@ -934,7 +922,7 @@ static int hvm_ioreq_server_init(struct hvm_ioreq_server 
*s, struct domain *d,
 INIT_LIST_HEAD(&s->ioreq_vcpu_list);
 spin_lock_init(&s->bufioreq_lock);
 
-rc = hvm_ioreq_server_alloc_rangesets(s, is_default);
+rc = hvm_ioreq_server_alloc_rb_rangesets(s, is_default);
 if ( rc )
 return rc;
 
@@ -960,7 +948,7 @@ static int hvm_ioreq_server_init(struct hvm_ioreq_server 
*s, struct domain *d,
 hvm_ioreq_server_unmap_pages(s, is_default);
 
  fail_map:
-hvm_ioreq_server_free_rangesets(s, is_default);
+hvm_ioreq_server_free_rb_rangesets(s, is_default);
 
 return rc;
 }
@@ -971,7 +959,7 @@ static void hvm_ioreq_server_deinit(struct hvm_ioreq_server 
*s,
 ASSERT(!s->enabled);
 hvm_ioreq_server_remove_all_vcpus(s);
 hvm_ioreq_server_unmap_pages(s, is_default);
-hvm_ioreq_server_free_rangesets(s, is_default);
+hvm_ioreq_server_free_rb_rangesets(s, is_default);
 }
 
 static ioservid_t next_ioservid(struct domain *d)
@@ -1149,7 +1137,7 @@ static int hvm_map_io_range_to_ioreq_server(struct domain 
*d, ioservid_t id,
 
 if ( s->id == id )
 {
-struct rangeset *r;
+struct rb_rangeset *r;
 
 switch ( type )
 {
@@ -1169,10 +1157,10 @@ static int hvm_map_

Re: [Xen-devel] [PATCH] Refactor ioreq server for better performance

2015-06-30 Thread Yu, Zhang

Thanks you, Paul.

On 6/29/2015 8:12 PM, Paul Durrant wrote:

-Original Message-
From: Yu Zhang [mailto:yu.c.zh...@linux.intel.com]
Sent: 26 June 2015 11:30
To: xen-de...@lists.xenproject.org; Paul Durrant; Andrew Cooper;
jbeul...@suse.com; Kevin Tian; zhiyuan...@intel.com
Subject: [PATCH] Refactor ioreq server for better performance

   XenGT leverages ioreq server to track and forward the accesses to
GPU IO resources, e.g. the PPGTT(per-process graphic translation
tables). Currently, ioreq server uses rangeset to track the BDF/
PIO/MMIO ranges to be emulated. To select an ioreq server, the
rangeset is searched to see if the IO range is recorded. However,
traversing the link list inside rangeset could be time consuming
when number of ranges is too high. On HSW platform, number of PPGTTs
for each vGPU could be several hundred. On BDW, this value could
be several thousand.
   To increase the performance, a new data structure, rb_rangeset
is defined. Compared with rangeset, which is based on doubly linked
list with O(n) time complexity for searching, this rb_rangeset is
based on red-black tree with O(log(n)) time complexity. Besides the
underlying data structure difference with the rangeset, another one
is that the rb_rangeset does not provide a spinlock, instead, it
left this to users.
   Besides, NR_IO_RANGE_TYPES is changed to 8192 to accommodate more
ranges.

Signed-off-by: Yu Zhang 
---
  xen/arch/x86/hvm/hvm.c   |  52 -
  xen/common/Makefile  |   1 +
  xen/common/rb_rangeset.c | 243
+++
  xen/include/asm-x86/hvm/domain.h |   4 +-
  xen/include/xen/rb_rangeset.h|  45 
  5 files changed, 311 insertions(+), 34 deletions(-)
  create mode 100644 xen/common/rb_rangeset.c
  create mode 100644 xen/include/xen/rb_rangeset.h

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 535d622..be70925 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -37,6 +37,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -809,7 +810,7 @@ static void hvm_ioreq_server_unmap_pages(struct
hvm_ioreq_server *s,
  }
  }

-static void hvm_ioreq_server_free_rangesets(struct hvm_ioreq_server *s,
+static void hvm_ioreq_server_free_rb_rangesets(struct hvm_ioreq_server
*s,


Did you need to change the name of the function here?

Got it. It's not necessary to change the name. :)




  bool_t is_default)
  {
  unsigned int i;
@@ -818,10 +819,10 @@ static void
hvm_ioreq_server_free_rangesets(struct hvm_ioreq_server *s,
  return;

  for ( i = 0; i < NR_IO_RANGE_TYPES; i++ )
-rangeset_destroy(s->range[i]);
+rb_rangeset_destroy(s->range[i]);
  }

-static int hvm_ioreq_server_alloc_rangesets(struct hvm_ioreq_server *s,
+static int hvm_ioreq_server_alloc_rb_rangesets(struct hvm_ioreq_server
*s,


Same here.


Yes, and thanks.



  bool_t is_default)
  {
  unsigned int i;
@@ -832,33 +833,20 @@ static int hvm_ioreq_server_alloc_rangesets(struct
hvm_ioreq_server *s,

  for ( i = 0; i < NR_IO_RANGE_TYPES; i++ )
  {
-char *name;
-
-rc = asprintf(&name, "ioreq_server %d %s", s->id,
-  (i == HVMOP_IO_RANGE_PORT) ? "port" :
-  (i == HVMOP_IO_RANGE_MEMORY) ? "memory" :
-  (i == HVMOP_IO_RANGE_PCI) ? "pci" :
-  "");
-if ( rc )
-goto fail;
-
-s->range[i] = rangeset_new(s->domain, name,
-   RANGESETF_prettyprint_hex);
-
-xfree(name);
+s->range[i] = rb_rangeset_new();



I think assigning a name to the rangeset and having a debug-key dump is useful. 
Can you not duplicate that in your new implementation?



Well, I can add some dump routines, e.g. hvm_ioreq_server_dump_range(),
to dump the ranges inside each ioreq server of a domain. This routine
is similar to the rangeset_domain_printk(). But unlike the rangeset,
which is also inserted in domain->rangesets list, the new rb_rangeset
is only a member of ioreq server. So we can dump the ranges inside a
domain by first access each ioreq server.
Do you think this approach duplicated?


  rc = -ENOMEM;
  if ( !s->range[i] )
  goto fail;

-rangeset_limit(s->range[i], MAX_NR_IO_RANGES);
+s->range[i]->nr_ranges = MAX_NR_IO_RANGES;


I'd add a limit function rather than just stooging into the structure fields.


Yes, will do. :)


  }

   done:
  return 0;

   fail:
-hvm_ioreq_server_free_rangesets(s, 0);
+hvm_ioreq_server_free_rb_rangesets(s, 0);



Without the name change this diff is gone.



Yes, and thanks.


  return rc;
  }
@@ -934,7 +922,7 @@ static int hvm_ioreq_server_init(struct
hv

Re: [Xen-devel] [PATCH] Refactor ioreq server for better performance

2015-07-01 Thread Yu, Zhang



On 6/30/2015 5:59 PM, Paul Durrant wrote:

-Original Message-
From: Yu, Zhang [mailto:yu.c.zh...@linux.intel.com]
Sent: 30 June 2015 08:11
To: Paul Durrant; xen-de...@lists.xenproject.org; Andrew Cooper;
jbeul...@suse.com; Kevin Tian; zhiyuan...@intel.com
Subject: Re: [Xen-devel] [PATCH] Refactor ioreq server for better
performance

Thanks you, Paul.



No problem :-)

[snip]


I think assigning a name to the rangeset and having a debug-key dump is

useful. Can you not duplicate that in your new implementation?




Well, I can add some dump routines, e.g. hvm_ioreq_server_dump_range(),
to dump the ranges inside each ioreq server of a domain. This routine
is similar to the rangeset_domain_printk(). But unlike the rangeset,
which is also inserted in domain->rangesets list, the new rb_rangeset
is only a member of ioreq server. So we can dump the ranges inside a
domain by first access each ioreq server.
Do you think this approach duplicated?



Either add an rb_rangesets list to the domain, or have the debug key walk the 
ioreq server list and dump the rangesets in each. The former is obviously the 
simplest.



Thanks, Paul.
Well, I agree the former approach would be simpler. But I still doubt
if this is more reasonable. :)
IIUC, one of the reasons for struct domain to have a rangeset list(and
a spinlock - rangesets_lock), is because there are iomem_caps and
irq_caps for each domain. These 2 rangeset members of struct domain are
platform independent.
However, struct rb_rangeset is only supposed to be used in ioreq
server, which is only for x86 hvm cases. Adding a rb_rangeset list
member(similarly, if so, a rb_rangesets_lock is also required) in
struct domain maybe useless for hardware domain and for platforms other
than x86.
So, I'd like to register a new debug key, to dump the ioreq server
informations, just like the keys to dump iommu p2m table or the irq
mappings. With a new debug key, we do not need to add a spinlock for
rb_rangeset in struct domain, the one in ioreq server would be enough.
Does this sound reasonable?


[snip]


I this limit enough? I think having something that's toolstack-tunable would

be more future-proof.




By now, this value would suffice. I'd prefer this as a default value.
As you know, I have also considered a xl param to do so. And one
question is that, would a per-domain param appropriate? I mean, each
hvm can have multiple ioreq servers. If this is acceptible, I can cook
another patch to do so, is this OK?



That's ok with me, but you may need to mention the idea of a follow up patch in 
the check-in comment.


Sure, and thanks. :)

Yu

   Paul


Thanks
Yu


Paul



   struct hvm_ioreq_server {
   struct list_head   list_entry;
@@ -68,7 +68,7 @@ struct hvm_ioreq_server {
   /* Lock to serialize access to buffered ioreq ring */
   spinlock_t bufioreq_lock;
   evtchn_port_t  bufioreq_evtchn;
-struct rangeset*range[NR_IO_RANGE_TYPES];
+struct rb_rangeset *range[NR_IO_RANGE_TYPES];
   bool_t enabled;
   bool_t bufioreq_atomic;
   };
diff --git a/xen/include/xen/rb_rangeset.h
b/xen/include/xen/rb_rangeset.h
new file mode 100644
index 000..768230c
--- /dev/null
+++ b/xen/include/xen/rb_rangeset.h
@@ -0,0 +1,45 @@
+/*
+  Red-black tree based rangeset
+
+  This program is free software; you can redistribute it and/or modify
+  it under the terms of the GNU General Public License as published by
+  the Free Software Foundation; either version 2 of the License, or
+  (at your option) any later version.
+
+  This program is distributed in the hope that it will be useful,
+  but WITHOUT ANY WARRANTY; without even the implied warranty of
+  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+  GNU General Public License for more details.
+
+  You should have received a copy of the GNU General Public License
+  along with this program; if not, write to the Free Software
+  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307

USA

+*/
+
+#ifndef __RB_RANGESET_H__
+#define __RB_RANGESET_H__
+
+#include 
+
+struct rb_rangeset {
+long nr_ranges;
+struct rb_root   rbroot;
+};
+
+struct rb_range {
+struct rb_node node;
+unsigned long s, e;
+};
+
+struct rb_rangeset *rb_rangeset_new(void);
+void rb_rangeset_destroy(struct rb_rangeset *r);
+bool_t rb_rangeset_overlaps_range(struct rb_rangeset *r,
+unsigned long s, unsigned long e);
+bool_t rb_rangeset_contains_range(
+struct rb_rangeset *r, unsigned long s, unsigned long e);
+int rb_rangeset_add_range(struct rb_rangeset *r,
+unsigned long s, unsigned long e);
+int rb_rangeset_remove_range(struct rb_rangeset *r,
+unsigned long s, unsigned long e);
+
+#endif /* __RB_RANGESET_H__ */
--
1.9.1



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://

Re: [Xen-devel] Xen 4.6 Development Update (2 WEEKS TO FREEZE, important information in preamble)

2015-07-01 Thread Yu, Zhang



= Prognosis =

The states are: none -> fair -> ok -> good -> done

none - nothing yet
fair - still working on it, patches are prototypes or RFC
ok   - patches posted, acting on review
good - some last minute pieces
done - all done, might have bugs




*  Intel GVT-g (none)
requires refactoring ioreq-server, fixing 16-byte MMIO emulation
and optional PV IOMMU support
   -  Yu, Zhang

Hi wei, following are status of Intel GVT-g:

1> ioreq-server refactor: fair. patch sent out by me, but not much
comments.

2> 16-byte MMIO emulation: I believe status is good, several patch
versions sent out by Paul.

3> PV IOMMU: new draft in discussion. Malcolm Crossley has been
working on it. So status should be none or fair?

Thanks
Yu



*  Porting Intel P-state driver to Xen (fair)
   -  Wang, Wei






___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] Refactor ioreq server for better performance

2015-07-02 Thread Yu, Zhang

[snip]



Thanks, Paul.
Well, I agree the former approach would be simpler. But I still doubt
if this is more reasonable. :)
IIUC, one of the reasons for struct domain to have a rangeset list(and
a spinlock - rangesets_lock), is because there are iomem_caps and
irq_caps for each domain. These 2 rangeset members of struct domain are
platform independent.
However, struct rb_rangeset is only supposed to be used in ioreq
server, which is only for x86 hvm cases. Adding a rb_rangeset list
member(similarly, if so, a rb_rangesets_lock is also required) in
struct domain maybe useless for hardware domain and for platforms other
than x86.


Fair enough.


So, I'd like to register a new debug key, to dump the ioreq server
informations, just like the keys to dump iommu p2m table or the irq
mappings. With a new debug key, we do not need to add a spinlock for
rb_rangeset in struct domain, the one in ioreq server would be enough.
Does this sound reasonable?



That would be ok with me, but I'm not sure about claiming a whole debug key for 
this. Is there any other one that you could piggy-back on? If not, then maybe 
just make it part of the 'q' output.


Thanks, my new implementation uses the 'q' debug key. Will send out the
new version later. :)

Yu


   Paul



[snip]




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel







___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/2] Resize the MAX_NR_IO_RANGES for ioreq server

2015-07-02 Thread Yu Zhang
MAX_NR_IO_RANGES is used by ioreq server as the maximum
number of discrete ranges to be tracked. This patch changes
its value to 8k, so that more ranges can be tracked on next
generation of Intel platforms in XenGT. Future patches can
extend the limit to be toolstack tunable, and MAX_NR_IO_RANGES
can serve as a default limit.

Signed-off-by: Yu Zhang 
---
 xen/include/asm-x86/hvm/domain.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h
index ad68fcf..d62fda9 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -49,7 +49,7 @@ struct hvm_ioreq_vcpu {
 };
 
 #define NR_IO_RANGE_TYPES (HVMOP_IO_RANGE_PCI + 1)
-#define MAX_NR_IO_RANGES  256
+#define MAX_NR_IO_RANGES  8192
 
 struct hvm_ioreq_server {
 struct list_head   list_entry;
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 2/2] Add new data structure to track ranges.

2015-07-02 Thread Yu Zhang
This patch introduces a new data structure, struct rb_rangeset,
to represent a group of continuous ranges, e.g. the start and end
addresses for PIO/MMIO regions. By now, this structure is supposed
to assist ioreq server to forward the I/O request to backend device
models more efficiently.

Behavior of this new data structure is quite similar to rangeset,
with major difference being the time complexity. Based on doubly
linked list, struct rangeset provides O(n) time complexity for
searching. And struct rb_rangeset is based on red-black tree, with
binary searching, the time complexity is improved to O(log(n)) -
more suitable to track massive discrete ranges.

Ioreq server code is changed to utilize this new type, and a new
routine, hvm_ioreq_server_dump_range_info, is added to dump all the
ranges tracked in an ioreq server.

Signed-off-by: Yu Zhang 
---
 xen/arch/x86/domain.c|   3 +
 xen/arch/x86/hvm/hvm.c   |  56 ++--
 xen/common/Makefile  |   1 +
 xen/common/rb_rangeset.c | 281 +++
 xen/include/asm-x86/hvm/domain.h |   2 +-
 xen/include/asm-x86/hvm/hvm.h|   1 +
 xen/include/xen/rb_rangeset.h|  49 +++
 7 files changed, 378 insertions(+), 15 deletions(-)
 create mode 100644 xen/common/rb_rangeset.c
 create mode 100644 xen/include/xen/rb_rangeset.h

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index a8fe046..f8a8b80 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -2086,6 +2086,9 @@ int domain_relinquish_resources(struct domain *d)
 void arch_dump_domain_info(struct domain *d)
 {
 paging_dump_domain_info(d);
+
+if ( is_hvm_domain(d) )
+hvm_ioreq_server_dump_range_info(d);
 }
 
 void arch_dump_vcpu_info(struct vcpu *v)
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 535d622..c79676e 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -818,7 +819,7 @@ static void hvm_ioreq_server_free_rangesets(struct 
hvm_ioreq_server *s,
 return;
 
 for ( i = 0; i < NR_IO_RANGE_TYPES; i++ )
-rangeset_destroy(s->range[i]);
+rb_rangeset_destroy(s->range[i]);
 }
 
 static int hvm_ioreq_server_alloc_rangesets(struct hvm_ioreq_server *s, 
@@ -842,8 +843,7 @@ static int hvm_ioreq_server_alloc_rangesets(struct 
hvm_ioreq_server *s,
 if ( rc )
 goto fail;
 
-s->range[i] = rangeset_new(s->domain, name,
-   RANGESETF_prettyprint_hex);
+s->range[i] = rb_rangeset_new(name);
 
 xfree(name);
 
@@ -851,7 +851,7 @@ static int hvm_ioreq_server_alloc_rangesets(struct 
hvm_ioreq_server *s,
 if ( !s->range[i] )
 goto fail;
 
-rangeset_limit(s->range[i], MAX_NR_IO_RANGES);
+rb_rangeset_limit(s->range[i], MAX_NR_IO_RANGES);
 }
 
  done:
@@ -1149,7 +1149,7 @@ static int hvm_map_io_range_to_ioreq_server(struct domain 
*d, ioservid_t id,
 
 if ( s->id == id )
 {
-struct rangeset *r;
+struct rb_rangeset *r;
 
 switch ( type )
 {
@@ -1169,10 +1169,10 @@ static int hvm_map_io_range_to_ioreq_server(struct 
domain *d, ioservid_t id,
 break;
 
 rc = -EEXIST;
-if ( rangeset_overlaps_range(r, start, end) )
+if ( rb_rangeset_overlaps_range(r, start, end) )
 break;
 
-rc = rangeset_add_range(r, start, end);
+rc = rb_rangeset_add_range(r, start, end);
 break;
 }
 }
@@ -1200,7 +1200,7 @@ static int hvm_unmap_io_range_from_ioreq_server(struct 
domain *d, ioservid_t id,
 
 if ( s->id == id )
 {
-struct rangeset *r;
+struct rb_rangeset *r;
 
 switch ( type )
 {
@@ -1220,10 +1220,10 @@ static int hvm_unmap_io_range_from_ioreq_server(struct 
domain *d, ioservid_t id,
 break;
 
 rc = -ENOENT;
-if ( !rangeset_contains_range(r, start, end) )
+if ( !rb_rangeset_contains_range(r, start, end) )
 break;
 
-rc = rangeset_remove_range(r, start, end);
+rc = rb_rangeset_remove_range(r, start, end);
 break;
 }
 }
@@ -1349,6 +1349,34 @@ static void hvm_destroy_all_ioreq_servers(struct domain 
*d)
 spin_unlock(&d->arch.hvm_domain.ioreq_server.lock);
 }
 
+void  hvm_ioreq_server_dump_range_info(struct domain *d)
+{
+unsigned int i;
+struct hvm_ioreq_server *s;
+
+spin_lock(&d->arch.hvm_domain.ioreq_server.lock);
+
+list_for_each_entry ( s,
+  &d->arch.hvm_domain.ioreq_server.list,
+  list_entry )
+{
+if ( s == d->arch.hvm_domain.default_ioreq_server )
+continue;
+
+   

[Xen-devel] [PATCH 0/2] Refactor ioreq server for better performance.

2015-07-02 Thread Yu Zhang
XenGT leverages ioreq server to track and forward the accesses to
GPU I/O resources, e.g. the PPGTT(per-process graphic translation
tables). Currently, ioreq server uses rangeset to track the BDF/
PIO/MMIO ranges to be emulated. To select an ioreq server, the
rangeset is searched to see if the I/O range is recorded. However,
traversing the link list inside rangeset could be time consuming
when number of ranges is too high. On HSW platform, number of PPGTTs
for each vGPU could be several hundred. On BDW, this value could
be several thousand.

To accommodate more ranges, limitation of the number of ranges in an
ioreq server, MAX_NR_IO_RANGES is changed - future patches will be
provided to tune this with other approaches. And to increase the ioreq
server performance, a new data structure, rb_rangeset, is introduced.

Yu Zhang (2):
  Resize the MAX_NR_IO_RANGES for ioreq server
  Add new data structure to track ranges.

 xen/arch/x86/domain.c|   3 +
 xen/arch/x86/hvm/hvm.c   |  56 ++--
 xen/common/Makefile  |   1 +
 xen/common/rb_rangeset.c | 281 +++
 xen/include/asm-x86/hvm/domain.h |   4 +-
 xen/include/asm-x86/hvm/hvm.h|   1 +
 xen/include/xen/rb_rangeset.h|  49 +++
 7 files changed, 379 insertions(+), 16 deletions(-)
 create mode 100644 xen/common/rb_rangeset.c
 create mode 100644 xen/include/xen/rb_rangeset.h

-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 0/2] Refactor ioreq server for better performance.

2015-07-02 Thread Yu, Zhang

Oh, I forgot the version number and change history.
This patchset should be version 2.

The change history should be:
1> Split the original patch into 2;
2> Take Paul Durrant’s comments:
a> Add a name member in the struct rb_rangeset, and use the ‘q’
debug key to dump the ranges in ioreq server;
b> Keep original routine names for hvm ioreq server;
c>. Commit message changes – mention that a future patch to change
the maximum ranges inside ioreq server;

Sorry, my fault. Could I add this change history in next version,
or should I resend the version 2? :)

Thanks
Yu

On 7/2/2015 8:31 PM, Yu Zhang wrote:

XenGT leverages ioreq server to track and forward the accesses to
GPU I/O resources, e.g. the PPGTT(per-process graphic translation
tables). Currently, ioreq server uses rangeset to track the BDF/
PIO/MMIO ranges to be emulated. To select an ioreq server, the
rangeset is searched to see if the I/O range is recorded. However,
traversing the link list inside rangeset could be time consuming
when number of ranges is too high. On HSW platform, number of PPGTTs
for each vGPU could be several hundred. On BDW, this value could
be several thousand.

To accommodate more ranges, limitation of the number of ranges in an
ioreq server, MAX_NR_IO_RANGES is changed - future patches will be
provided to tune this with other approaches. And to increase the ioreq
server performance, a new data structure, rb_rangeset, is introduced.

Yu Zhang (2):
   Resize the MAX_NR_IO_RANGES for ioreq server
   Add new data structure to track ranges.

  xen/arch/x86/domain.c|   3 +
  xen/arch/x86/hvm/hvm.c   |  56 ++--
  xen/common/Makefile  |   1 +
  xen/common/rb_rangeset.c | 281 +++
  xen/include/asm-x86/hvm/domain.h |   4 +-
  xen/include/asm-x86/hvm/hvm.h|   1 +
  xen/include/xen/rb_rangeset.h|  49 +++
  7 files changed, 379 insertions(+), 16 deletions(-)
  create mode 100644 xen/common/rb_rangeset.c
  create mode 100644 xen/include/xen/rb_rangeset.h



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 2/2] Add new data structure to track ranges.

2015-07-06 Thread Yu Zhang
This patch introduces a new data structure, struct rb_rangeset,
to represent a group of continuous ranges, e.g. the start and end
addresses for PIO/MMIO regions. By now, this structure is supposed
to assist ioreq server to forward the I/O request to backend device
models more efficiently.

Behavior of this new data structure is quite similar to rangeset,
with major difference being the time complexity. Based on doubly
linked list, struct rangeset provides O(n) time complexity for
searching. And struct rb_rangeset is based on red-black tree, with
binary searching, the time complexity is improved to O(log(n)) -
more suitable to track massive discrete ranges.

Ioreq server code is changed to utilize this new type, and a new
routine, hvm_ioreq_server_dump_range_info, is added to dump all the
ranges tracked in an ioreq server.

Signed-off-by: Yu Zhang 
---
 xen/arch/x86/domain.c|   3 +
 xen/arch/x86/hvm/hvm.c   |  56 ++--
 xen/common/Makefile  |   1 +
 xen/common/rb_rangeset.c | 281 +++
 xen/include/asm-x86/hvm/domain.h |   2 +-
 xen/include/asm-x86/hvm/hvm.h|   1 +
 xen/include/xen/rb_rangeset.h|  49 +++
 7 files changed, 378 insertions(+), 15 deletions(-)
 create mode 100644 xen/common/rb_rangeset.c
 create mode 100644 xen/include/xen/rb_rangeset.h

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index a8fe046..f8a8b80 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -2086,6 +2086,9 @@ int domain_relinquish_resources(struct domain *d)
 void arch_dump_domain_info(struct domain *d)
 {
 paging_dump_domain_info(d);
+
+if ( is_hvm_domain(d) )
+hvm_ioreq_server_dump_range_info(d);
 }
 
 void arch_dump_vcpu_info(struct vcpu *v)
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 535d622..c79676e 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -818,7 +819,7 @@ static void hvm_ioreq_server_free_rangesets(struct 
hvm_ioreq_server *s,
 return;
 
 for ( i = 0; i < NR_IO_RANGE_TYPES; i++ )
-rangeset_destroy(s->range[i]);
+rb_rangeset_destroy(s->range[i]);
 }
 
 static int hvm_ioreq_server_alloc_rangesets(struct hvm_ioreq_server *s, 
@@ -842,8 +843,7 @@ static int hvm_ioreq_server_alloc_rangesets(struct 
hvm_ioreq_server *s,
 if ( rc )
 goto fail;
 
-s->range[i] = rangeset_new(s->domain, name,
-   RANGESETF_prettyprint_hex);
+s->range[i] = rb_rangeset_new(name);
 
 xfree(name);
 
@@ -851,7 +851,7 @@ static int hvm_ioreq_server_alloc_rangesets(struct 
hvm_ioreq_server *s,
 if ( !s->range[i] )
 goto fail;
 
-rangeset_limit(s->range[i], MAX_NR_IO_RANGES);
+rb_rangeset_limit(s->range[i], MAX_NR_IO_RANGES);
 }
 
  done:
@@ -1149,7 +1149,7 @@ static int hvm_map_io_range_to_ioreq_server(struct domain 
*d, ioservid_t id,
 
 if ( s->id == id )
 {
-struct rangeset *r;
+struct rb_rangeset *r;
 
 switch ( type )
 {
@@ -1169,10 +1169,10 @@ static int hvm_map_io_range_to_ioreq_server(struct 
domain *d, ioservid_t id,
 break;
 
 rc = -EEXIST;
-if ( rangeset_overlaps_range(r, start, end) )
+if ( rb_rangeset_overlaps_range(r, start, end) )
 break;
 
-rc = rangeset_add_range(r, start, end);
+rc = rb_rangeset_add_range(r, start, end);
 break;
 }
 }
@@ -1200,7 +1200,7 @@ static int hvm_unmap_io_range_from_ioreq_server(struct 
domain *d, ioservid_t id,
 
 if ( s->id == id )
 {
-struct rangeset *r;
+struct rb_rangeset *r;
 
 switch ( type )
 {
@@ -1220,10 +1220,10 @@ static int hvm_unmap_io_range_from_ioreq_server(struct 
domain *d, ioservid_t id,
 break;
 
 rc = -ENOENT;
-if ( !rangeset_contains_range(r, start, end) )
+if ( !rb_rangeset_contains_range(r, start, end) )
 break;
 
-rc = rangeset_remove_range(r, start, end);
+rc = rb_rangeset_remove_range(r, start, end);
 break;
 }
 }
@@ -1349,6 +1349,34 @@ static void hvm_destroy_all_ioreq_servers(struct domain 
*d)
 spin_unlock(&d->arch.hvm_domain.ioreq_server.lock);
 }
 
+void  hvm_ioreq_server_dump_range_info(struct domain *d)
+{
+unsigned int i;
+struct hvm_ioreq_server *s;
+
+spin_lock(&d->arch.hvm_domain.ioreq_server.lock);
+
+list_for_each_entry ( s,
+  &d->arch.hvm_domain.ioreq_server.list,
+  list_entry )
+{
+if ( s == d->arch.hvm_domain.default_ioreq_server )
+continue;
+
+   

[Xen-devel] [PATCH v2 0/2] Refactor ioreq server for better performance.

2015-07-06 Thread Yu Zhang
XenGT leverages ioreq server to track and forward the accesses to
GPU I/O resources, e.g. the PPGTT(per-process graphic translation
tables). Currently, ioreq server uses rangeset to track the BDF/
PIO/MMIO ranges to be emulated. To select an ioreq server, the
rangeset is searched to see if the I/O range is recorded. However,
traversing the link list inside rangeset could be time consuming
when number of ranges is too high. On HSW platform, number of PPGTTs
for each vGPU could be several hundred. On BDW, this value could
be several thousand.

To accommodate more ranges, limitation of the number of ranges in an
ioreq server, MAX_NR_IO_RANGES is changed - future patches will be
provided to tune this with other approaches. And to increase the ioreq
server performance, a new data structure, rb_rangeset, is introduced.

Changes in v2:
1> Split the original patch into 2;
2> Take Paul Durrant's comments:
  a> Add a name member in the struct rb_rangeset, and use the 'q'
debug key to dump the ranges in ioreq server;
  b> Keep original routine names for hvm ioreq server;
  c> Commit message changes - mention that a future patch to change
the maximum ranges inside ioreq server.

Yu Zhang (2):
  Resize the MAX_NR_IO_RANGES for ioreq server
  Add new data structure to track ranges.

 xen/arch/x86/domain.c|   3 +
 xen/arch/x86/hvm/hvm.c   |  56 ++--
 xen/common/Makefile  |   1 +
 xen/common/rb_rangeset.c | 281 +++
 xen/include/asm-x86/hvm/domain.h |   4 +-
 xen/include/asm-x86/hvm/hvm.h|   1 +
 xen/include/xen/rb_rangeset.h|  49 +++
 7 files changed, 379 insertions(+), 16 deletions(-)
 create mode 100644 xen/common/rb_rangeset.c
 create mode 100644 xen/include/xen/rb_rangeset.h

-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 1/2] Resize the MAX_NR_IO_RANGES for ioreq server

2015-07-06 Thread Yu Zhang
MAX_NR_IO_RANGES is used by ioreq server as the maximum
number of discrete ranges to be tracked. This patch changes
its value to 8k, so that more ranges can be tracked on next
generation of Intel platforms in XenGT. Future patches can
extend the limit to be toolstack tunable, and MAX_NR_IO_RANGES
can serve as a default limit.

Signed-off-by: Yu Zhang 
---
 xen/include/asm-x86/hvm/domain.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h
index ad68fcf..d62fda9 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -49,7 +49,7 @@ struct hvm_ioreq_vcpu {
 };
 
 #define NR_IO_RANGE_TYPES (HVMOP_IO_RANGE_PCI + 1)
-#define MAX_NR_IO_RANGES  256
+#define MAX_NR_IO_RANGES  8192
 
 struct hvm_ioreq_server {
 struct list_head   list_entry;
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 0/2] Refactor ioreq server for better performance.

2015-07-06 Thread Yu, Zhang

A new patchset with v2 prefix and change log was just sent out.
So please ignore this thread, and sorry for the inconvenience. :)

Yu

On 7/2/2015 11:07 PM, Yu, Zhang wrote:

Oh, I forgot the version number and change history.
This patchset should be version 2.

The change history should be:
1> Split the original patch into 2;
2> Take Paul Durrant’s comments:
 a> Add a name member in the struct rb_rangeset, and use the ‘q’
debug key to dump the ranges in ioreq server;
 b> Keep original routine names for hvm ioreq server;
 c>. Commit message changes – mention that a future patch to change
the maximum ranges inside ioreq server;

Sorry, my fault. Could I add this change history in next version,
or should I resend the version 2? :)

Thanks
Yu

On 7/2/2015 8:31 PM, Yu Zhang wrote:

XenGT leverages ioreq server to track and forward the accesses to
GPU I/O resources, e.g. the PPGTT(per-process graphic translation
tables). Currently, ioreq server uses rangeset to track the BDF/
PIO/MMIO ranges to be emulated. To select an ioreq server, the
rangeset is searched to see if the I/O range is recorded. However,
traversing the link list inside rangeset could be time consuming
when number of ranges is too high. On HSW platform, number of PPGTTs
for each vGPU could be several hundred. On BDW, this value could
be several thousand.

To accommodate more ranges, limitation of the number of ranges in an
ioreq server, MAX_NR_IO_RANGES is changed - future patches will be
provided to tune this with other approaches. And to increase the ioreq
server performance, a new data structure, rb_rangeset, is introduced.

Yu Zhang (2):
   Resize the MAX_NR_IO_RANGES for ioreq server
   Add new data structure to track ranges.

  xen/arch/x86/domain.c|   3 +
  xen/arch/x86/hvm/hvm.c   |  56 ++--
  xen/common/Makefile  |   1 +
  xen/common/rb_rangeset.c | 281
+++
  xen/include/asm-x86/hvm/domain.h |   4 +-
  xen/include/asm-x86/hvm/hvm.h|   1 +
  xen/include/xen/rb_rangeset.h|  49 +++
  7 files changed, 379 insertions(+), 16 deletions(-)
  create mode 100644 xen/common/rb_rangeset.c
  create mode 100644 xen/include/xen/rb_rangeset.h



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 1/2] Resize the MAX_NR_IO_RANGES for ioreq server

2015-07-07 Thread Yu, Zhang

Thanks a lot, George.

On 7/6/2015 10:06 PM, George Dunlap wrote:

On Mon, Jul 6, 2015 at 2:33 PM, Paul Durrant  wrote:

-Original Message-
From: George Dunlap [mailto:george.dun...@eu.citrix.com]
Sent: 06 July 2015 14:28
To: Paul Durrant; George Dunlap
Cc: Yu Zhang; xen-devel@lists.xen.org; Keir (Xen.org); Jan Beulich; Andrew
Cooper; Kevin Tian; zhiyuan...@intel.com
Subject: Re: [Xen-devel] [PATCH v2 1/2] Resize the MAX_NR_IO_RANGES for
ioreq server

On 07/06/2015 02:09 PM, Paul Durrant wrote:

-Original Message-
From: dunl...@gmail.com [mailto:dunl...@gmail.com] On Behalf Of
George Dunlap
Sent: 06 July 2015 13:50
To: Paul Durrant
Cc: Yu Zhang; xen-devel@lists.xen.org; Keir (Xen.org); Jan Beulich;

Andrew

Cooper; Kevin Tian; zhiyuan...@intel.com
Subject: Re: [Xen-devel] [PATCH v2 1/2] Resize the MAX_NR_IO_RANGES

for

ioreq server

On Mon, Jul 6, 2015 at 1:38 PM, Paul Durrant 
wrote:

-Original Message-
From: dunl...@gmail.com [mailto:dunl...@gmail.com] On Behalf Of
George Dunlap
Sent: 06 July 2015 13:36
To: Yu Zhang
Cc: xen-devel@lists.xen.org; Keir (Xen.org); Jan Beulich; Andrew

Cooper;

Paul Durrant; Kevin Tian; zhiyuan...@intel.com
Subject: Re: [Xen-devel] [PATCH v2 1/2] Resize the

MAX_NR_IO_RANGES

for

ioreq server

On Mon, Jul 6, 2015 at 7:25 AM, Yu Zhang 
wrote:

MAX_NR_IO_RANGES is used by ioreq server as the maximum
number of discrete ranges to be tracked. This patch changes
its value to 8k, so that more ranges can be tracked on next
generation of Intel platforms in XenGT. Future patches can
extend the limit to be toolstack tunable, and MAX_NR_IO_RANGES
can serve as a default limit.

Signed-off-by: Yu Zhang 


I said this at the Hackathon, and I'll say it here:  I think this is
the wrong approach.

The problem here is not that you don't have enough memory ranges.

The

problem is that you are not tracking memory ranges, but individual
pages.

You need to make a new interface that allows you to tag individual
gfns as p2m_mmio_write_dm, and then allow one ioreq server to get
notifications for all such writes.



I think that is conflating things. It's quite conceivable that more than one

ioreq server will handle write_dm pages. If we had enough types to have
two page types per server then I'd agree with you, but we don't.

What's conflating things is using an interface designed for *device
memory ranges* to instead *track writes to gfns*.


What's the difference? Are you asserting that all device memory ranges

have read side effects and therefore write_dm is not a reasonable
optimization to use? I would not want to make that assertion.

Using write_dm is not the problem; it's having thousands of memory
"ranges" of 4k each that I object to.

Which is why I suggested adding an interface to request updates to gfns
(by marking them write_dm), rather than abusing the io range interface.



And it's the assertion that use of write_dm will only be relevant to gfns, and 
that all such notifications only need go to a single ioreq server, that I have 
a problem with. Whilst the use of io ranges to track gfn updates is, I agree, 
not ideal I think the overloading of write_dm is not a step in the right 
direction.


So there are two questions here.

First of all, I certainly think that the *interface* should be able to
be transparently extended to support multiple ioreq servers being able
to track gfns.  My suggestion was to add a hypercall that allows an
ioreq server to say, "Please send modifications to gfn N to ioreq
server X"; and that for the time being, only allow one such X to exist
at a time per domain.  That is, if ioreq server Y makes such a call
after ioreq server X has done so, return -EBUSY.  That way we can add
support when we need it.



Well, I also agree the current implementation is probably not optimal.
And yes, it seems promiscuous( hope I did not use the wrong word :) )
to mix the device I/O ranges and the guest memory. But, forwarding an
ioreq to backend driver, just by an p2m type? Although it would be easy
for XenGT to take this approach, I agree with Paul that this would
weaken the functionality of ioreq server. Besides, is it appropriate
for a p2m type to be used this way? It seems strange for me.


In fact, you probably already have a problem with two ioreq servers,
because (if I recall correctly) you don't know for sure when a page


Fortunately, we do, and these unmapped page tables will be removed from
the rangeset of ioreq server. So the following scenario won't happen. :)


has stopped being used as a GPU pagetable.  Consider the following
scenario:
1. Two devices, served by ioreq servers 1 and 2.
2. driver for device served by ioreq server 1 allocates a page, uses
it as a pagetable.  ioreq server 1 adds that pfn to the ranges it's
watching.
3. driver frees page back to guest OS; but ioreq server 1 doesn't know
so it doesn't release the range
4. driver f

Re: [Xen-devel] [PATCH v2 1/2] Resize the MAX_NR_IO_RANGES for ioreq server

2015-07-07 Thread Yu, Zhang

Hi Jan,

On 7/7/2015 10:04 PM, Jan Beulich wrote:

On 07.07.15 at 15:11,  wrote:

  -Original Message-
From: Jan Beulich [mailto:jbeul...@suse.com]
Sent: 07 July 2015 13:53
To: Paul Durrant
Cc: Andrew Cooper; George Dunlap; Kevin Tian; zhiyuan...@intel.com; Zhang
Yu; xen-devel@lists.xen.org; Keir (Xen.org)
Subject: RE: [Xen-devel] [PATCH v2 1/2] Resize the MAX_NR_IO_RANGES for
ioreq server


On 07.07.15 at 11:23,  wrote:

I wonder, would it be sufficient - at this stage - to add a new mapping sub-

op

to the HVM op to distinguish mapping of mapping gfns vs. MMIO ranges.

That

way we could use the same implementation underneath for now (using

the

rb_rangeset, which I think stands on its own merits for MMIO ranges

anyway)

Which would be (taking into account the good description of the
differences between RAM and MMIO pages given by George
yesterday [I think])? I continue to not be convinced we need
this new rangeset type (the more that it's name seems wrong,
since - as said by George - we're unlikely to deal with ranges
here).



I don't see that implementing rangesets on top of rb tree is a problem. IMO
it's a useful optimization in its own right since it takes something that's
currently O(n) and makes it O(log n) using an rb tree implementation that's
already there. In fact, perhaps we just make the current rangeset
implementation use rb trees underneath, then there's no need for the extra
API.


I wouldn't mind such an overall change (provided it doesn't
introduce new issues), but I'm not convinced of having two
rangeset implementations, and I don't see the lookup speed
as an issue with the uses we have for rangesets now. The
arguments for bumping the number of ranges, which would
possibly affect lookup in a negative way, haven't been
convincing to me so far (and XenGT isn't going to make 4.6
anyway).


I know that George and you have concerns about the differences
between MMIO and guest page tables, but I do not quite understand
why. :)

Althogh the granularity of the write-protected memory is 4K in our
case, the trapped address is still a guest physical address, not a
guest page frame number. We can see the range as an 4K sized one,
right? Besides, in many cases, the guest graphic page tables are
continuous and the size of a range inside ioreq server is more than
4k. As you can see, the existing code already tracks the guest graphic
page tables by rangeset, and the difference here is that there would
be more page tables we need to take care of for BDW. Changing the
upper limit of the number of rangeset inside an ioreq server does
not necessarily mean there would be so many.

About the rb_rangeset I introduced, it is only for XenGT case,
because only in such cases will a more efficient data structure be
necessary - maybe in the future there would be more requirements to
this type.

Yes, XenGT may not catch up the 4.6. And thanks for your frankness. :)
But I'll still be very appreciated if you can give me some advices
and directions to go.

B.R.
Yu

Jan





___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 1/2] Resize the MAX_NR_IO_RANGES for ioreq server

2015-07-07 Thread Yu, Zhang



On 7/7/2015 10:43 PM, Jan Beulich wrote:

On 07.07.15 at 16:30,  wrote:

I know that George and you have concerns about the differences
between MMIO and guest page tables, but I do not quite understand
why. :)


But you read George's very nice description of the differences? I
ask because if you did, I don't see why you re-raise the question
above.



Well, yes. I guess you mean this statement:
"the former is one or two actual ranges of a significant size; the
latter are (apparently) thousands of ranges of one page each."?
But I do not understand why this is abusing the io range interface.
Does the number matters so much? :)

B.R.
Yu


Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 1/2] Resize the MAX_NR_IO_RANGES for ioreq server

2015-07-07 Thread Yu, Zhang



On 7/7/2015 11:10 PM, Jan Beulich wrote:

On 07.07.15 at 16:49,  wrote:

On 7/7/2015 10:43 PM, Jan Beulich wrote:

On 07.07.15 at 16:30,  wrote:

I know that George and you have concerns about the differences
between MMIO and guest page tables, but I do not quite understand
why. :)


But you read George's very nice description of the differences? I
ask because if you did, I don't see why you re-raise the question
above.



Well, yes. I guess you mean this statement:
"the former is one or two actual ranges of a significant size; the
latter are (apparently) thousands of ranges of one page each."?
But I do not understand why this is abusing the io range interface.
Does the number matters so much? :)


Yes, we specifically set it that low so misbehaving tool stacks
(perhaps de-privileged) can't cause the hypervisor to allocate
undue amounts of memory for tracking these ranges. This
concern, btw, applies as much to the rb-rangesets.


Thanks for your explanation, Jan. : )
In fact, I have considered to add another patch to set this limit as
toolstack tunable.
One problem I encountered is that how to guarantee the validity
of the configured value - shall not over-consume the xen heap.
But I do agree there definitely should be more amendment patches.

B.R.
Yu


Plus the number you bump MAX_NR_IO_RANGES to is - as I
understood it - obtained phenomenologically, i.e. there's no
reason not to assume that some bigger graphics card may
need this to be further bumped. The current count is arbitrary
too, but limiting guests only in so far as there can't be more
than so many (possibly huge) MMIO ranges on the complete set
of devices passed through to it.

And finally, the I/O ranges are called I/O ranges because they
are intended to cover I/O memory. RAM clearly isn't I/O memory,
even if it may be accessed directly by devices.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v4] x86: add p2m_mmio_write_dm

2014-11-28 Thread Yu Zhang
XenGT (Intel Graphics Virtualization technology, please refer to
https://01.org/xen/blogs/srclarkx/2013/graphics-virtualization-
xengt) driver runs inside Dom0 as a virtual graphics device model,
and needs to trap and emulate the guest's write operations to some
specific memory pages, like memory pages used by guest graphics
driver as PPGTT(per-process graphics translation table). We agreed
to add a new p2m type "p2m_mmio_write_dm" to trap the write operation
on these graphic page tables.

Handling of this new p2m type are similar with existing p2m_ram_ro
in most condition checks, with only difference on final policy of
emulation vs. drop. For p2m_mmio_write_dm type pages, writes will
go to the device model via ioreq-server.

Previously, the conclusion in our v3 patch review is to provide a
more generalized HVMOP_map_io_range_to_ioreq_server hypercall, by
seperating rangesets inside a ioreq server to read-protected/write-
protected/both-prtected. Yet, after offline discussion with Paul,
we believe a more simplified solution may suffice. We can keep the
existing HVMOP_map_io_range_to_ioreq_server hypercall, and let the
user decide whether or not a p2m type change is necessary, because
in most cases the emulator will already use the p2m_mmio_dm type.

Changes from v3:
 - Use the existing HVMOP_map_io_range_to_ioreq_server hypercall
   to add write protected range.
 - Modify the HVMOP_set_mem_type hypercall to support the new p2m
   type for this range

Changes from v2:
 - Remove excute attribute of the new p2m type p2m_mmio_write_dm
 - Use existing rangeset for keeping the write protection page range
   instead of introducing hash table.
 - Some code style fix.

Changes from v1:
 - Changes the new p2m type name from p2m_ram_wp to p2m_mmio_write_dm.
   This means that we treat the pages as a special mmio range instead
   of ram.
 - Move macros to c file since only this file is using them.
 - Address various comments from Jan.

Signed-off-by: Yu Zhang 
Signed-off-by: Wei Ye 
---
 xen/arch/x86/hvm/hvm.c  | 13 ++---
 xen/arch/x86/mm/p2m-ept.c   |  1 +
 xen/arch/x86/mm/p2m-pt.c|  1 +
 xen/include/asm-x86/p2m.h   |  1 +
 xen/include/public/hvm/hvm_op.h |  1 +
 5 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 8f49b44..5f806e8 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2838,7 +2838,8 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long 
gla,
  * to the mmio handler.
  */
 if ( (p2mt == p2m_mmio_dm) || 
- (npfec.write_access && (p2mt == p2m_ram_ro)) )
+ (npfec.write_access &&
+   ((p2mt == p2m_ram_ro) || (p2mt == p2m_mmio_write_dm))) )
 {
 put_gfn(p2m->domain, gfn);
 
@@ -5922,6 +5923,8 @@ long do_hvm_op(unsigned long op, 
XEN_GUEST_HANDLE_PARAM(void) arg)
 a.mem_type =  HVMMEM_ram_rw;
 else if ( p2m_is_grant(t) )
 a.mem_type =  HVMMEM_ram_rw;
+else if ( t == p2m_mmio_write_dm )
+a.mem_type = HVMMEM_mmio_write_dm;
 else
 a.mem_type =  HVMMEM_mmio_dm;
 rc = __copy_to_guest(arg, &a, 1) ? -EFAULT : 0;
@@ -5941,7 +5944,8 @@ long do_hvm_op(unsigned long op, 
XEN_GUEST_HANDLE_PARAM(void) arg)
 static const p2m_type_t memtype[] = {
 [HVMMEM_ram_rw]  = p2m_ram_rw,
 [HVMMEM_ram_ro]  = p2m_ram_ro,
-[HVMMEM_mmio_dm] = p2m_mmio_dm
+[HVMMEM_mmio_dm] = p2m_mmio_dm,
+[HVMMEM_mmio_write_dm] = p2m_mmio_write_dm
 };
 
 if ( copy_from_guest(&a, arg, 1) )
@@ -5987,14 +5991,17 @@ long do_hvm_op(unsigned long op, 
XEN_GUEST_HANDLE_PARAM(void) arg)
 rc = -EAGAIN;
 goto param_fail4;
 }
+
 if ( !p2m_is_ram(t) &&
- (!p2m_is_hole(t) || a.hvmmem_type != HVMMEM_mmio_dm) )
+ (!p2m_is_hole(t) || a.hvmmem_type != HVMMEM_mmio_dm) &&
+ t != p2m_mmio_write_dm )
 {
 put_gfn(d, pfn);
 goto param_fail4;
 }
 
 rc = p2m_change_type_one(d, pfn, t, memtype[a.hvmmem_type]);
+
 put_gfn(d, pfn);
 if ( rc )
 goto param_fail4;
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index 15c6e83..e21a92d 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -136,6 +136,7 @@ static void ept_p2m_type_to_flags(ept_entry_t *entry, 
p2m_type_t type, p2m_acces
 entry->x = 0;
 break;
 case p2m_grant_map_ro:
+case p2m_mmio_write_dm:
 entry->r = 1;
 entry->w = entry->x = 0;
 break;
diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
index e48b63a..26fb18d 100644
--- a/xen/arch/x86/mm/p2m-pt.c

Re: [Xen-devel] [PATCH v4] x86: add p2m_mmio_write_dm

2014-12-01 Thread Yu, Zhang



On 11/28/2014 5:57 PM, Jan Beulich wrote:

On 28.11.14 at 08:59,  wrote:

--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2838,7 +2838,8 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long 
gla,
   * to the mmio handler.
   */
  if ( (p2mt == p2m_mmio_dm) ||
- (npfec.write_access && (p2mt == p2m_ram_ro)) )
+ (npfec.write_access &&
+   ((p2mt == p2m_ram_ro) || (p2mt == p2m_mmio_write_dm))) )


Why are you corrupting indentation here?

Thanks for your comments, Jan.
The indentation here is to make sure the
((p2mt == p2m_ram_ro) || (p2mt == p2m_mmio_write_dm)) are grouped 
together. But I am not sure if this is correct according to xen coding 
style. I may have misunderstood your previous comments on Sep 3, which 
said "the indentation would need adjustment" in reply to "[Xen-devel] 
[PATCH v3 1/2] x86: add p2m_mmio_write_dm"




Furthermore the code you modify here suggests that p2m_ram_ro
already has the needed semantics - writes get passed to the DM.
None of the other changes you make, and none of the other uses
of p2m_ram_ro appear to be in conflict with your intentions, so
you'd really need to explain better why you need the new type.


Thanks Jan.
To my understanding, pages with p2m_ram_ro are not supposed to be 
modified by guest. So in __hvm_copy(), when p2m type of a page is 
p2m_ram_rom, no copy will occur.
However, to our usage, we just wanna this page to be write protected, so 
that our device model can be triggered to do some emulation. The content 
written to this page is supposed not to be dropped. This way, if 
sequentially a read operation is performed by guest to this page, the 
guest will still see its anticipated value.


Maybe I need to update my commit message to explain this more clearly. :)


@@ -5922,6 +5923,8 @@ long do_hvm_op(unsigned long op, 
XEN_GUEST_HANDLE_PARAM(void) arg)
  a.mem_type =  HVMMEM_ram_rw;
  else if ( p2m_is_grant(t) )
  a.mem_type =  HVMMEM_ram_rw;
+else if ( t == p2m_mmio_write_dm )
+a.mem_type = HVMMEM_mmio_write_dm;
  else
  a.mem_type =  HVMMEM_mmio_dm;
  rc = __copy_to_guest(arg, &a, 1) ? -EFAULT : 0;
@@ -5941,7 +5944,8 @@ long do_hvm_op(unsigned long op, 
XEN_GUEST_HANDLE_PARAM(void) arg)
  static const p2m_type_t memtype[] = {
  [HVMMEM_ram_rw]  = p2m_ram_rw,
  [HVMMEM_ram_ro]  = p2m_ram_ro,
-[HVMMEM_mmio_dm] = p2m_mmio_dm
+[HVMMEM_mmio_dm] = p2m_mmio_dm,
+[HVMMEM_mmio_write_dm] = p2m_mmio_write_dm
  };

  if ( copy_from_guest(&a, arg, 1) )
@@ -5987,14 +5991,17 @@ long do_hvm_op(unsigned long op, 
XEN_GUEST_HANDLE_PARAM(void) arg)
  rc = -EAGAIN;
  goto param_fail4;
  }
+


Stray addition of a blank line?


  if ( !p2m_is_ram(t) &&
- (!p2m_is_hole(t) || a.hvmmem_type != HVMMEM_mmio_dm) )
+ (!p2m_is_hole(t) || a.hvmmem_type != HVMMEM_mmio_dm) &&
+ t != p2m_mmio_write_dm )


Do you really want to permit e.g. transitions between mmio_dm and
mmio_write_dm? We should be as restrictive as possible here to not
open up paths to security problems.


  {
  put_gfn(d, pfn);
  goto param_fail4;
  }

  rc = p2m_change_type_one(d, pfn, t, memtype[a.hvmmem_type]);
+
  put_gfn(d, pfn);
  if ( rc )
  goto param_fail4;


Another stray newline addition.


--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -72,6 +72,7 @@ typedef enum {
  p2m_ram_shared = 12,  /* Shared or sharable memory */
  p2m_ram_broken = 13,  /* Broken page, access cause domain crash */
  p2m_map_foreign  = 14,/* ram pages from foreign domain */
+p2m_mmio_write_dm = 15,   /* Read-only; writes go to the device model 
*/
  } p2m_type_t;

  /* Modifiers to the query */



If the new type is really needed, shouldn't this get added to
P2M_RO_TYPES?

Well, previsouly, I wished to differenciate the HVMMEM_ram_ro and the 
newly added HVMMEM_mmio_write_dm in the 
HVMOP_get_mem_type/HVMOP_set_mem_type hypercall. I'd rather think of 
this new type as write protected than read only.

But I'll take a look, to see if this can be added to P2M_RO_TYPES. Thanks.

Yu


Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4] x86: add p2m_mmio_write_dm

2014-12-01 Thread Yu, Zhang



On 12/1/2014 8:13 PM, Tim Deegan wrote:

At 11:17 + on 01 Dec (1417429027), Jan Beulich wrote:

On 01.12.14 at 11:30,  wrote:

At 09:32 + on 01 Dec (1417422746), Jan Beulich wrote:

On 01.12.14 at 09:49,  wrote:

To my understanding, pages with p2m_ram_ro are not supposed to be
modified by guest. So in __hvm_copy(), when p2m type of a page is
p2m_ram_rom, no copy will occur.
However, to our usage, we just wanna this page to be write protected, so
that our device model can be triggered to do some emulation. The content
written to this page is supposed not to be dropped. This way, if
sequentially a read operation is performed by guest to this page, the
guest will still see its anticipated value.


__hvm_copy() is only a helper function, and doesn't write to
mmio_dm space either; instead its (indirect) callers would invoke
hvmemul_do_mmio() upon seeing HVMCOPY_bad_gfn_to_mfn
returns. The question hence is about the apparent inconsistency
resulting from writes to ram_ro being dropped here but getting
passed to the DM by hvm_hap_nested_page_fault(). Tim - is
that really intentional?


No - and AFAICT it shouldn't be happening.  It _is_ how it was
implemented originally, because it involved fewer moving parts and
didn't need to be efficient (and after all, writes to entirely missing
addresses go to the device model too).

But the code was later updated to log and discard writes to read-only
memory (see 4d8aa29 from Trolle Selander).

Early version of p2m_ram_ro were documented in the internal headers as
sending the writes to the DM, but the public interface (HVMMEM_ram_ro)
has always said that writes are discarded.


Hmm, so which way do you recommend resolving the inconsistency
then - match what the public interface says or what the apparent
original intention for the internal type was? Presumably we need to
follow the public interface mandated model, and hence require the
new type to be introduced.


Sorry, I was unclear -- there isn't an inconsistency; both internal
and public headers currently say that writes are discarded and AFAICT
that is what the code does.

But yes, we ought to follow the established hypercall interface, and
so we need the new type.


During this bit of archaeology I realised that either this new type
should _not_ be made part of P2M_RO_TYPES, or, better, we need a new
class of P2M types (P2M_DISCARD_WRITE_TYPES, say) that should be used
for these paths in emulate_gva_to_mfn() and __hvm_copy(), containing
just p2m_ram_ro and p2m_grant_map_ro.


And I suppose in that latter case the new type could be made part
of P2M_RO_TYPES()?


Yes indeed, as P2M_RO_TYPES is defined as "must have the _PAGE_RW bit
clear in their PTEs".



Thanks Tim.
Following are my understanding of the P2M_RO_TYPES and your comments.
Not sure if I get it right. Please correct me if anything wrong:
1> The P2M_RO_TYPES now bears 2 meanings: one is "w bit is clear in the 
pte"; and another being to discard the write operations;

2> We better define another class to bear the second meaning.

Also some questions for the new p2m class, say P2M_DISCARD_WRITE_TYPES, 
and the new predicates, say p2m_is_discard_write:
1> You mentioned emulate_gva_to_mfn() and __hvm_copy() should discard 
the write ops, yet I also noticed many other places using the 
p2m_is_readonly, or only the "p2mt == p2m_ram_ro" judgement(in 
__hvm_copy/__hvm_clear). Among all these other places, is there any ones 
also supposed to use the p2m_is_discard_write?
2> p2m_grant_map_ro is also supposed to be discarded? Will handling of 
this type of pages goes into __hvm_copy()/__hvm_clear(), or should?


I'm a new guy of this area, and sorry for my messed questions. :)

Yu





Cheers,

Tim.




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4] x86: add p2m_mmio_write_dm

2014-12-01 Thread Yu, Zhang



On 12/1/2014 8:31 PM, Jan Beulich wrote:

On 01.12.14 at 13:13,  wrote:

At 11:17 + on 01 Dec (1417429027), Jan Beulich wrote:

On 01.12.14 at 11:30,  wrote:

At 09:32 + on 01 Dec (1417422746), Jan Beulich wrote:

On 01.12.14 at 09:49,  wrote:

To my understanding, pages with p2m_ram_ro are not supposed to be
modified by guest. So in __hvm_copy(), when p2m type of a page is
p2m_ram_rom, no copy will occur.
However, to our usage, we just wanna this page to be write protected, so
that our device model can be triggered to do some emulation. The content
written to this page is supposed not to be dropped. This way, if
sequentially a read operation is performed by guest to this page, the
guest will still see its anticipated value.


__hvm_copy() is only a helper function, and doesn't write to
mmio_dm space either; instead its (indirect) callers would invoke
hvmemul_do_mmio() upon seeing HVMCOPY_bad_gfn_to_mfn
returns. The question hence is about the apparent inconsistency
resulting from writes to ram_ro being dropped here but getting
passed to the DM by hvm_hap_nested_page_fault(). Tim - is
that really intentional?


No - and AFAICT it shouldn't be happening.  It _is_ how it was
implemented originally, because it involved fewer moving parts and
didn't need to be efficient (and after all, writes to entirely missing
addresses go to the device model too).

But the code was later updated to log and discard writes to read-only
memory (see 4d8aa29 from Trolle Selander).

Early version of p2m_ram_ro were documented in the internal headers as
sending the writes to the DM, but the public interface (HVMMEM_ram_ro)
has always said that writes are discarded.


Hmm, so which way do you recommend resolving the inconsistency
then - match what the public interface says or what the apparent
original intention for the internal type was? Presumably we need to
follow the public interface mandated model, and hence require the
new type to be introduced.


Sorry, I was unclear -- there isn't an inconsistency; both internal
and public headers currently say that writes are discarded and AFAICT
that is what the code does.


Not for hvm_hap_nested_page_fault() afaict - the forwarding to
DM there contradicts the "writes are discarded" model that other
code paths follow.


Thanks, Jan.
By "inconsistency", do you mean the p2m_ram_ro shall not trigger the 
handle_mmio_with_translation() in hvm_hap_nested_page_fault()?
I'm also a bit confused with the "writes are discarded/dropped" comments 
in the code. Does this mean writes to the p2m_ram_ro pages should be 
abandoned without going to the dm, or going to the dm and  ignored 
later? The code seems to be the second one.



Jan





___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4] x86: add p2m_mmio_write_dm

2014-12-02 Thread Yu, Zhang



On 12/2/2014 7:40 PM, Tim Deegan wrote:

At 15:38 +0800 on 02 Dec (1417531126), Yu, Zhang wrote:

On 12/1/2014 8:13 PM, Tim Deegan wrote:

At 11:17 + on 01 Dec (1417429027), Jan Beulich wrote:

On 01.12.14 at 11:30,  wrote:

During this bit of archaeology I realised that either this new type
should _not_ be made part of P2M_RO_TYPES, or, better, we need a new
class of P2M types (P2M_DISCARD_WRITE_TYPES, say) that should be used
for these paths in emulate_gva_to_mfn() and __hvm_copy(), containing
just p2m_ram_ro and p2m_grant_map_ro.


And I suppose in that latter case the new type could be made part
of P2M_RO_TYPES()?


Yes indeed, as P2M_RO_TYPES is defined as "must have the _PAGE_RW bit
clear in their PTEs".



Thanks Tim.
Following are my understanding of the P2M_RO_TYPES and your comments.
Not sure if I get it right. Please correct me if anything wrong:
1> The P2M_RO_TYPES now bears 2 meanings: one is "w bit is clear in the
pte"; and another being to discard the write operations;
2> We better define another class to bear the second meaning.


Yes, that's what I meant.

Answering your other questions in reverse order:


2> p2m_grant_map_ro is also supposed to be discarded? Will handling of
this type of pages goes into __hvm_copy()/__hvm_clear(), or should?


I think so, yes.  At the moment we inject #GP when the guest writes to
a read-only grant, which is OK: the guest really ought to know better.
But I think we'll probably end up with neater code if we handle
read-only grants the same way as p2m_ram_ro.

Anyone else have an opinion on the right thing to do here?


Also some questions for the new p2m class, say P2M_DISCARD_WRITE_TYPES,
and the new predicates, say p2m_is_discard_write:
1> You mentioned emulate_gva_to_mfn() and __hvm_copy() should discard
the write ops, yet I also noticed many other places using the
p2m_is_readonly, or only the "p2mt == p2m_ram_ro" judgement(in
__hvm_copy/__hvm_clear). Among all these other places, is there any ones
also supposed to use the p2m_is_discard_write?


I've just had a look through them all, and I can see exactly four
places that should be using the new p2m_is_discard_write() test:

  - emulate_gva_to_mfn() (Though in fact it's a no-op as shadow-mode
guests never have p2m_ram_shared or p2m_ram_logdirty mappings.)
  - __hvm_copy()
  - __hvm_clear() and
  - hvm_hap_nested_page_fault() (where you should also remove the
explicit handling of p2m_grant_map_ro below.)


Thank you, Tim & Jan.
To give a summary for all the comments:

1> new p2m type p2m_mmio_write_dm is to be added;
2> new p2m type need to be added to P2M_RO_TYPES class;
3> new p2m class, say P2M_DISCARD_WRITE_TYPES(which only include 
p2m_ram_ro and p2m_grant_map_ro), and the new predicates, say 
p2m_is_discard_write are needed to in these 4 places to discard the 
write op;
4> and of cause hvm_hap_nested_page_fault() do not need the special 
handling for p2m_grant_map_ro anymore;

5> coding style changes pointed out by Jan
6> clear the commit message

I'll prepare the patch and thanks! :)

Yu


Looking through that turned up a few other oddities, which I'm
listing here to remind myself to look at them later (i.e. you don't
need to worry about them for this patch):

  - nsvm_get_nvmcb_page() and nestedhap_walk_L0_p2m() need to handle
p2m_ram_logdirty or they might spuriously fail duiring live
migration.
  - __hvm_copy() and __hvm_clear are probably over-strict in their
failure to handle grant types.
  - P2M_UNMAP_TYPES in vmce.c is a mess.  It's not the right place to
define this, since it definitely won't be seen my anyone
adding a new type, and it already has an 'XXX' comment that says
it doesn't cover a lot of cases. :(

I'll have a look at those another time.

Cheers,

Tim.




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4] x86: add p2m_mmio_write_dm

2014-12-04 Thread Yu, Zhang



On 12/2/2014 7:40 PM, Tim Deegan wrote:

At 15:38 +0800 on 02 Dec (1417531126), Yu, Zhang wrote:

On 12/1/2014 8:13 PM, Tim Deegan wrote:

At 11:17 + on 01 Dec (1417429027), Jan Beulich wrote:

On 01.12.14 at 11:30,  wrote:

During this bit of archaeology I realised that either this new type
should _not_ be made part of P2M_RO_TYPES, or, better, we need a new
class of P2M types (P2M_DISCARD_WRITE_TYPES, say) that should be used
for these paths in emulate_gva_to_mfn() and __hvm_copy(), containing
just p2m_ram_ro and p2m_grant_map_ro.


And I suppose in that latter case the new type could be made part
of P2M_RO_TYPES()?


Yes indeed, as P2M_RO_TYPES is defined as "must have the _PAGE_RW bit
clear in their PTEs".



Thanks Tim.
Following are my understanding of the P2M_RO_TYPES and your comments.
Not sure if I get it right. Please correct me if anything wrong:
1> The P2M_RO_TYPES now bears 2 meanings: one is "w bit is clear in the
pte"; and another being to discard the write operations;
2> We better define another class to bear the second meaning.


Yes, that's what I meant.

Answering your other questions in reverse order:


2> p2m_grant_map_ro is also supposed to be discarded? Will handling of
this type of pages goes into __hvm_copy()/__hvm_clear(), or should?


I think so, yes.  At the moment we inject #GP when the guest writes to
a read-only grant, which is OK: the guest really ought to know better.
But I think we'll probably end up with neater code if we handle
read-only grants the same way as p2m_ram_ro.

Anyone else have an opinion on the right thing to do here?


Also some questions for the new p2m class, say P2M_DISCARD_WRITE_TYPES,
and the new predicates, say p2m_is_discard_write:
1> You mentioned emulate_gva_to_mfn() and __hvm_copy() should discard
the write ops, yet I also noticed many other places using the
p2m_is_readonly, or only the "p2mt == p2m_ram_ro" judgement(in
__hvm_copy/__hvm_clear). Among all these other places, is there any ones
also supposed to use the p2m_is_discard_write?


I've just had a look through them all, and I can see exactly four
places that should be using the new p2m_is_discard_write() test:

  - emulate_gva_to_mfn() (Though in fact it's a no-op as shadow-mode
guests never have p2m_ram_shared or p2m_ram_logdirty mappings.)
  - __hvm_copy()
  - __hvm_clear() and
  - hvm_hap_nested_page_fault() (where you should also remove the
explicit handling of p2m_grant_map_ro below.)

Looking through that turned up a few other oddities, which I'm
listing here to remind myself to look at them later (i.e. you don't
need to worry about them for this patch):

  - nsvm_get_nvmcb_page() and nestedhap_walk_L0_p2m() need to handle
p2m_ram_logdirty or they might spuriously fail duiring live
migration.
  - __hvm_copy() and __hvm_clear are probably over-strict in their
failure to handle grant types.


Hi Tim. Sorry to bother you. :)
I just noticed that in __hvm_copy()/__hvm_clear(), the grant types are 
handled before the p2m_ram_ro - will return HVMCOPY_unhandleable. So if
p2m_is_discard_write() is supposed to replace the handling of 
p2m_ram_ro, handling of p2m_grant_map_ro will still return 
HVMCOPY_unhandleable, before the p2m_is_discard_write() predicate.
Even we move the testing of p2m_is_discard_write() before the handling 
of grant types, it seems not quite clean.
By "over-strict in their failure to handle grant types.", do you also 
mean this?


Thanks
Yu


  - P2M_UNMAP_TYPES in vmce.c is a mess.  It's not the right place to
define this, since it definitely won't be seen my anyone
adding a new type, and it already has an 'XXX' comment that says
it doesn't cover a lot of cases. :(

I'll have a look at those another time.

Cheers,

Tim.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4] x86: add p2m_mmio_write_dm

2014-12-04 Thread Yu, Zhang



On 12/4/2014 5:36 PM, Tim Deegan wrote:

Hi,

At 17:01 +0800 on 04 Dec (1417708878), Yu, Zhang wrote:

I just noticed that in __hvm_copy()/__hvm_clear(), the grant types are
handled before the p2m_ram_ro - will return HVMCOPY_unhandleable. So if
p2m_is_discard_write() is supposed to replace the handling of
p2m_ram_ro, handling of p2m_grant_map_ro will still return
HVMCOPY_unhandleable, before the p2m_is_discard_write() predicate.
Even we move the testing of p2m_is_discard_write() before the handling
of grant types, it seems not quite clean.
By "over-strict in their failure to handle grant types.", do you also
mean this?


Yes, that's the sort of thing I meant.  I'll try to write a patch for
that later today or next week -- in the meantime I think you should
ignore it. :)

An unrelated thought: when you send your next version can you send it
as a two-patch series, where the first patch does the
p2m_is_discard_write() changes and the second adds your new type?


Sure, and thank you, Tim. :)


Cheers,

Tim.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 0/2] add new p2m type class and new p2m type

2014-12-04 Thread Yu Zhang
XenGT (Intel Graphics Virtualization technology, please refer to
https://01.org/xen/blogs/srclarkx/2013/graphics-virtualization-
xengt) driver runs inside Dom0 as a virtual graphics device model,
and needs to trap and emulate the guest's write operations to some
specific memory pages, like memory pages used by guest graphics
driver as PPGTT(per-process graphics translation table). We added
a new p2m type, p2m_mmio_write_dm, to trap and emulate the write
operations on these graphic page tables. 

Handling of this new p2m type are similar with existing p2m_ram_ro
in most condition checks, with only difference on final policy of
emulation vs. drop. For p2m_ram_ro types, write operations will not
trigger the device model, and will be discarded later in __hvm_copy();
while for the p2m_mmio_write_dm type pages, writes will go to the
device model via ioreq-server.

Previously, the conclusion in our v3 patch review is to provide a
more generalized HVMOP_map_io_range_to_ioreq_server hypercall, by
seperating rangesets inside a ioreq server to read-protected/write-
protected/both-prtected. Yet, after offline discussion with Paul,
we believe a more simplified solution may suffice. We can keep the 
existing HVMOP_map_io_range_to_ioreq_server hypercall, and let the 
user decide whether or not a p2m type change is necessary, because
in most cases the emulator will already use the p2m_mmio_dm type.

Changes from v4:
 - A new p2m type class, P2M_DISCARD_WRITE_TYPES, is added;
 - A new predicate, p2m_is_discard_write, is used in __hvm_copy()/
   __hvm_clear()/emulate_gva_to_mfn()/hvm_hap_nested_page_fault(),
   to discard the write operations;
 - The new p2m type, p2m_mmio_write_dm, is added to P2M_RO_TYPES;
 - Coding style changes;

Changes from v3:
 - Use the existing HVMOP_map_io_range_to_ioreq_server hypercall
   to add write protected range;
 - Modify the HVMOP_set_mem_type hypercall to support the new p2m
   type for this range.

Changes from v2:
 - Remove excute attribute of the new p2m type p2m_mmio_write_dm;
 - Use existing rangeset for keeping the write protection page range
   instead of introducing hash table;
 - Some code style fix.

Changes from v1:
 - Changes the new p2m type name from p2m_ram_wp to p2m_mmio_write_dm.
   This means that we treat the pages as a special mmio range instead
   of ram;
 - Move macros to c file since only this file is using them.
 - Address various comments from Jan.

Yu Zhang (2):
  Add a new p2m type class - P2M_DISCARD_WRITE_TYPES
  add a new p2m type - p2m_mmio_write_dm

 xen/arch/x86/hvm/hvm.c  | 25 ++---
 xen/arch/x86/mm/p2m-ept.c   |  1 +
 xen/arch/x86/mm/p2m-pt.c|  1 +
 xen/arch/x86/mm/shadow/multi.c  |  2 +-
 xen/include/asm-x86/p2m.h   |  9 -
 xen/include/public/hvm/hvm_op.h |  1 +
 6 files changed, 22 insertions(+), 17 deletions(-)

-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 1/2] add a new p2m type class - P2M_DISCARD_WRITE_TYPES

2014-12-04 Thread Yu Zhang
From: Yu Zhang 

Currently, the P2M_RO_TYPES bears 2 meanings: one is
"_PAGE_RW bit is clear in their PTEs", and another is
to discard the write operations on these pages. This
patch adds a p2m type class, P2M_DISCARD_WRITE_TYPES,
to bear the second meaning, so we can use this type
class instead of the P2M_RO_TYPES, to decide if a write
operation is to be ignored.

Signed-off-by: Yu Zhang 
---
 xen/arch/x86/hvm/hvm.c | 16 +++-
 xen/arch/x86/mm/shadow/multi.c |  2 +-
 xen/include/asm-x86/p2m.h  |  5 +
 3 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 51ffc90..967f822 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2837,7 +2837,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long 
gla,
  * to the mmio handler.
  */
 if ( (p2mt == p2m_mmio_dm) || 
- (npfec.write_access && (p2mt == p2m_ram_ro)) )
+ (npfec.write_access && (p2m_is_discard_write(p2mt))) )
 {
 put_gfn(p2m->domain, gfn);
 
@@ -2882,16 +2882,6 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long 
gla,
 goto out_put_gfn;
 }
 
-/* Shouldn't happen: Maybe the guest was writing to a r/o grant mapping? */
-if ( npfec.write_access && (p2mt == p2m_grant_map_ro) )
-{
-gdprintk(XENLOG_WARNING,
- "trying to write to read-only grant mapping\n");
-hvm_inject_hw_exception(TRAP_gp_fault, 0);
-rc = 1;
-goto out_put_gfn;
-}
-
 /* If we fell through, the vcpu will retry now that access restrictions 
have
  * been removed. It may fault again if the p2m entry type still requires 
so.
  * Otherwise, this is an error condition. */
@@ -3941,7 +3931,7 @@ static enum hvm_copy_result __hvm_copy(
 
 if ( flags & HVMCOPY_to_guest )
 {
-if ( p2mt == p2m_ram_ro )
+if ( p2m_is_discard_write(p2mt) )
 {
 static unsigned long lastpage;
 if ( xchg(&lastpage, gfn) != gfn )
@@ -4035,7 +4025,7 @@ static enum hvm_copy_result __hvm_clear(paddr_t addr, int 
size)
 
 p = (char *)__map_domain_page(page) + (addr & ~PAGE_MASK);
 
-if ( p2mt == p2m_ram_ro )
+if ( p2m_is_discard_write(p2mt) )
 {
 static unsigned long lastpage;
 if ( xchg(&lastpage, gfn) != gfn )
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index 225290e..94cf06d 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -4575,7 +4575,7 @@ static mfn_t emulate_gva_to_mfn(struct vcpu *v,
 {
 return _mfn(BAD_GFN_TO_MFN);
 }
-if ( p2m_is_readonly(p2mt) )
+if ( p2m_is_discard_write(p2mt) )
 {
 put_page(page);
 return _mfn(READONLY_GFN);
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 5f7fe71..42de75d 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -113,6 +113,10 @@ typedef unsigned int p2m_query_t;
   | p2m_to_mask(p2m_grant_map_ro)   \
   | p2m_to_mask(p2m_ram_shared) )
 
+/* Write-discard types, which should discard the write operations */
+#define P2M_DISCARD_WRITE_TYPES (p2m_to_mask(p2m_ram_ro) \
+  | p2m_to_mask(p2m_grant_map_ro))
+
 /* Types that can be subject to bulk transitions. */
 #define P2M_CHANGEABLE_TYPES (p2m_to_mask(p2m_ram_rw) \
   | p2m_to_mask(p2m_ram_logdirty) )
@@ -145,6 +149,7 @@ typedef unsigned int p2m_query_t;
 #define p2m_is_hole(_t) (p2m_to_mask(_t) & P2M_HOLE_TYPES)
 #define p2m_is_mmio(_t) (p2m_to_mask(_t) & P2M_MMIO_TYPES)
 #define p2m_is_readonly(_t) (p2m_to_mask(_t) & P2M_RO_TYPES)
+#define p2m_is_discard_write(_t) (p2m_to_mask(_t) & P2M_DISCARD_WRITE_TYPES)
 #define p2m_is_changeable(_t) (p2m_to_mask(_t) & P2M_CHANGEABLE_TYPES)
 #define p2m_is_pod(_t) (p2m_to_mask(_t) & P2M_POD_TYPES)
 #define p2m_is_grant(_t) (p2m_to_mask(_t) & P2M_GRANT_TYPES)
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 2/2] add a new p2m type - p2m_mmio_write_dm

2014-12-04 Thread Yu Zhang
From: Yu Zhang 

A new p2m type, p2m_mmio_write_dm, is added to trap and emulate
the write operations on GPU's page tables. Handling of this new
p2m type are similar with existing p2m_ram_ro in most condition
checks, with only difference on final policy of emulation vs. drop.
For p2m_ram_ro types, write operations will not trigger the device
model, and will be discarded later in __hvm_copy(); while for the
p2m_mmio_write_dm type pages, writes will go to the device model
via ioreq-server.

Signed-off-by: Yu Zhang 
Signed-off-by: Wei Ye 
---
 xen/arch/x86/hvm/hvm.c  | 11 ---
 xen/arch/x86/mm/p2m-ept.c   |  1 +
 xen/arch/x86/mm/p2m-pt.c|  1 +
 xen/include/asm-x86/p2m.h   |  4 +++-
 xen/include/public/hvm/hvm_op.h |  1 +
 5 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 967f822..b4bdfab 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2837,7 +2837,8 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long 
gla,
  * to the mmio handler.
  */
 if ( (p2mt == p2m_mmio_dm) || 
- (npfec.write_access && (p2m_is_discard_write(p2mt))) )
+ (npfec.write_access &&
+  (p2m_is_discard_write(p2mt) || (p2mt == p2m_mmio_write_dm))) )
 {
 put_gfn(p2m->domain, gfn);
 
@@ -5904,6 +5905,8 @@ long do_hvm_op(unsigned long op, 
XEN_GUEST_HANDLE_PARAM(void) arg)
 get_gfn_query_unlocked(d, a.pfn, &t);
 if ( p2m_is_mmio(t) )
 a.mem_type =  HVMMEM_mmio_dm;
+else if ( t == p2m_mmio_write_dm )
+a.mem_type = HVMMEM_mmio_write_dm;
 else if ( p2m_is_readonly(t) )
 a.mem_type =  HVMMEM_ram_ro;
 else if ( p2m_is_ram(t) )
@@ -5931,7 +5934,8 @@ long do_hvm_op(unsigned long op, 
XEN_GUEST_HANDLE_PARAM(void) arg)
 static const p2m_type_t memtype[] = {
 [HVMMEM_ram_rw]  = p2m_ram_rw,
 [HVMMEM_ram_ro]  = p2m_ram_ro,
-[HVMMEM_mmio_dm] = p2m_mmio_dm
+[HVMMEM_mmio_dm] = p2m_mmio_dm,
+[HVMMEM_mmio_write_dm] = p2m_mmio_write_dm
 };
 
 if ( copy_from_guest(&a, arg, 1) )
@@ -5978,7 +5982,8 @@ long do_hvm_op(unsigned long op, 
XEN_GUEST_HANDLE_PARAM(void) arg)
 goto param_fail4;
 }
 if ( !p2m_is_ram(t) &&
- (!p2m_is_hole(t) || a.hvmmem_type != HVMMEM_mmio_dm) )
+ (!p2m_is_hole(t) || a.hvmmem_type != HVMMEM_mmio_dm) &&
+ t != p2m_mmio_write_dm )
 {
 put_gfn(d, pfn);
 goto param_fail4;
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index 15c6e83..e21a92d 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -136,6 +136,7 @@ static void ept_p2m_type_to_flags(ept_entry_t *entry, 
p2m_type_t type, p2m_acces
 entry->x = 0;
 break;
 case p2m_grant_map_ro:
+case p2m_mmio_write_dm:
 entry->r = 1;
 entry->w = entry->x = 0;
 break;
diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
index e48b63a..26fb18d 100644
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -94,6 +94,7 @@ static unsigned long p2m_type_to_flags(p2m_type_t t, mfn_t 
mfn)
 default:
 return flags | _PAGE_NX_BIT;
 case p2m_grant_map_ro:
+case p2m_mmio_write_dm:
 return flags | P2M_BASE_FLAGS | _PAGE_NX_BIT;
 case p2m_ram_ro:
 case p2m_ram_logdirty:
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 42de75d..866fb0d 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -72,6 +72,7 @@ typedef enum {
 p2m_ram_shared = 12,  /* Shared or sharable memory */
 p2m_ram_broken = 13,  /* Broken page, access cause domain crash */
 p2m_map_foreign  = 14,/* ram pages from foreign domain */
+p2m_mmio_write_dm = 15,   /* Read-only; writes go to the device model 
*/
 } p2m_type_t;
 
 /* Modifiers to the query */
@@ -111,7 +112,8 @@ typedef unsigned int p2m_query_t;
 #define P2M_RO_TYPES (p2m_to_mask(p2m_ram_logdirty) \
   | p2m_to_mask(p2m_ram_ro) \
   | p2m_to_mask(p2m_grant_map_ro)   \
-  | p2m_to_mask(p2m_ram_shared) )
+  | p2m_to_mask(p2m_ram_shared)   \
+  | p2m_to_mask(p2m_mmio_write_dm))
 
 /* Write-discard types, which should discard the write operations */
 #define P2M_DISCARD_WRITE_TYPES (p2m_to_mask(p2m_ram_ro) \
diff --git a/xen/include/public/hvm/hvm_op.h b/xen/include/public/hvm/hvm_op.h
index eeb0a60..a4e5345 100644
--- a/xen/include/public/hvm/hvm_op.h
+++ b/xen/include/public/hvm/hvm_op.h
@@ -81,6 +81,7 @@ typedef enum {
 HVMMEM_ram_rw, /* Normal read/write guest RAM

Re: [Xen-devel] [PATCH v5 2/2] add a new p2m type - p2m_mmio_write_dm

2014-12-04 Thread Yu, Zhang



On 12/5/2014 12:04 AM, Tim Deegan wrote:

Hi,

At 21:13 +0800 on 04 Dec (1417724006), Yu Zhang wrote:

A new p2m type, p2m_mmio_write_dm, is added to trap and emulate
the write operations on GPU's page tables. Handling of this new
p2m type are similar with existing p2m_ram_ro in most condition
checks, with only difference on final policy of emulation vs. drop.
For p2m_ram_ro types, write operations will not trigger the device
model, and will be discarded later in __hvm_copy(); while for the
p2m_mmio_write_dm type pages, writes will go to the device model
via ioreq-server.

Signed-off-by: Yu Zhang 
Signed-off-by: Wei Ye 


Thanks for this -- only two comments:


@@ -5978,7 +5982,8 @@ long do_hvm_op(unsigned long op, 
XEN_GUEST_HANDLE_PARAM(void) arg)
  goto param_fail4;
  }
  if ( !p2m_is_ram(t) &&
- (!p2m_is_hole(t) || a.hvmmem_type != HVMMEM_mmio_dm) )
+ (!p2m_is_hole(t) || a.hvmmem_type != HVMMEM_mmio_dm) &&
+ t != p2m_mmio_write_dm )


I think that Jan already brough this up, and maybe I missed your
answer: this realaxation looks wrong to me. I would have thought that
transition between p2m_mmio_write_dm and p2m_ram_rw/p2m_ram_logdirty
would be the only ones you would want to allow.


Ha. Sorry, my negligence, and thanks for pointing out. :)
The transition we use now is only between p2m_mmio_write_dm and 
p2m_ram_rw. So how about this:


if ( !p2m_is_ram(t) &&
 (!p2m_is_hole(t) || a.hvmmem_type != HVMMEM_mmio_dm) &&
 (t != p2m_mmio_write_dm || a.hvmmem_type != HVMMEM_ram_rw) )




@@ -111,7 +112,8 @@ typedef unsigned int p2m_query_t;
  #define P2M_RO_TYPES (p2m_to_mask(p2m_ram_logdirty) \
| p2m_to_mask(p2m_ram_ro) \
| p2m_to_mask(p2m_grant_map_ro)   \
-  | p2m_to_mask(p2m_ram_shared) )
+  | p2m_to_mask(p2m_ram_shared)   \
+  | p2m_to_mask(p2m_mmio_write_dm))


Nit: please align the '\' with the others above it.

Got it, and thanks.

B.R.
Yu


Cheers,

Tim.




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v6 2/2] add a new p2m type - p2m_mmio_write_dm

2014-12-05 Thread Yu Zhang
From: Yu Zhang 

A new p2m type, p2m_mmio_write_dm, is added to trap and emulate
the write operations on GPU's page tables. Handling of this new
p2m type are similar with existing p2m_ram_ro in most condition
checks, with only difference on final policy of emulation vs. drop.
For p2m_ram_ro types, write operations will not trigger the device
model, and will be discarded later in __hvm_copy(); while for the
p2m_mmio_write_dm type pages, writes will go to the device model
via ioreq-server.

Signed-off-by: Yu Zhang 
Signed-off-by: Wei Ye 
---
 xen/arch/x86/hvm/hvm.c  | 11 ---
 xen/arch/x86/mm/p2m-ept.c   |  1 +
 xen/arch/x86/mm/p2m-pt.c|  1 +
 xen/include/asm-x86/p2m.h   |  4 +++-
 xen/include/public/hvm/hvm_op.h |  1 +
 5 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 967f822..25114fc 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2837,7 +2837,8 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long 
gla,
  * to the mmio handler.
  */
 if ( (p2mt == p2m_mmio_dm) || 
- (npfec.write_access && (p2m_is_discard_write(p2mt))) )
+ (npfec.write_access &&
+  (p2m_is_discard_write(p2mt) || (p2mt == p2m_mmio_write_dm))) )
 {
 put_gfn(p2m->domain, gfn);
 
@@ -5904,6 +5905,8 @@ long do_hvm_op(unsigned long op, 
XEN_GUEST_HANDLE_PARAM(void) arg)
 get_gfn_query_unlocked(d, a.pfn, &t);
 if ( p2m_is_mmio(t) )
 a.mem_type =  HVMMEM_mmio_dm;
+else if ( t == p2m_mmio_write_dm )
+a.mem_type = HVMMEM_mmio_write_dm;
 else if ( p2m_is_readonly(t) )
 a.mem_type =  HVMMEM_ram_ro;
 else if ( p2m_is_ram(t) )
@@ -5931,7 +5934,8 @@ long do_hvm_op(unsigned long op, 
XEN_GUEST_HANDLE_PARAM(void) arg)
 static const p2m_type_t memtype[] = {
 [HVMMEM_ram_rw]  = p2m_ram_rw,
 [HVMMEM_ram_ro]  = p2m_ram_ro,
-[HVMMEM_mmio_dm] = p2m_mmio_dm
+[HVMMEM_mmio_dm] = p2m_mmio_dm,
+[HVMMEM_mmio_write_dm] = p2m_mmio_write_dm
 };
 
 if ( copy_from_guest(&a, arg, 1) )
@@ -5978,7 +5982,8 @@ long do_hvm_op(unsigned long op, 
XEN_GUEST_HANDLE_PARAM(void) arg)
 goto param_fail4;
 }
 if ( !p2m_is_ram(t) &&
- (!p2m_is_hole(t) || a.hvmmem_type != HVMMEM_mmio_dm) )
+ (!p2m_is_hole(t) || a.hvmmem_type != HVMMEM_mmio_dm) &&
+ (t != p2m_mmio_write_dm || a.hvmmem_type != HVMMEM_ram_rw) )
 {
 put_gfn(d, pfn);
 goto param_fail4;
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index 15c6e83..e21a92d 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -136,6 +136,7 @@ static void ept_p2m_type_to_flags(ept_entry_t *entry, 
p2m_type_t type, p2m_acces
 entry->x = 0;
 break;
 case p2m_grant_map_ro:
+case p2m_mmio_write_dm:
 entry->r = 1;
 entry->w = entry->x = 0;
 break;
diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
index e48b63a..26fb18d 100644
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -94,6 +94,7 @@ static unsigned long p2m_type_to_flags(p2m_type_t t, mfn_t 
mfn)
 default:
 return flags | _PAGE_NX_BIT;
 case p2m_grant_map_ro:
+case p2m_mmio_write_dm:
 return flags | P2M_BASE_FLAGS | _PAGE_NX_BIT;
 case p2m_ram_ro:
 case p2m_ram_logdirty:
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 42de75d..2cf73ca 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -72,6 +72,7 @@ typedef enum {
 p2m_ram_shared = 12,  /* Shared or sharable memory */
 p2m_ram_broken = 13,  /* Broken page, access cause domain crash */
 p2m_map_foreign  = 14,/* ram pages from foreign domain */
+p2m_mmio_write_dm = 15,   /* Read-only; writes go to the device model 
*/
 } p2m_type_t;
 
 /* Modifiers to the query */
@@ -111,7 +112,8 @@ typedef unsigned int p2m_query_t;
 #define P2M_RO_TYPES (p2m_to_mask(p2m_ram_logdirty) \
   | p2m_to_mask(p2m_ram_ro) \
   | p2m_to_mask(p2m_grant_map_ro)   \
-  | p2m_to_mask(p2m_ram_shared) )
+  | p2m_to_mask(p2m_ram_shared) \
+  | p2m_to_mask(p2m_mmio_write_dm))
 
 /* Write-discard types, which should discard the write operations */
 #define P2M_DISCARD_WRITE_TYPES (p2m_to_mask(p2m_ram_ro) \
diff --git a/xen/include/public/hvm/hvm_op.h b/xen/include/public/hvm/hvm_op.h
index eeb0a60..a4e5345 100644
--- a/xen/include/public/hvm/hvm_op.h
+++ b/xen/include/public/hvm/hvm_op.h
@@ -81,6 +81,7 @@ typedef enum {
 HVMMEM_ram_rw, 

[Xen-devel] [PATCH v6 1/2] add a new p2m type class - P2M_DISCARD_WRITE_TYPES

2014-12-05 Thread Yu Zhang
From: Yu Zhang 

Currently, the P2M_RO_TYPES bears 2 meanings: one is
"_PAGE_RW bit is clear in their PTEs", and another is
to discard the write operations on these pages. This
patch adds a p2m type class, P2M_DISCARD_WRITE_TYPES,
to bear the second meaning, so we can use this type
class instead of the P2M_RO_TYPES, to decide if a write
operation is to be ignored.

Signed-off-by: Yu Zhang 
Reviewed-by: Tim Deegan 
---
 xen/arch/x86/hvm/hvm.c | 16 +++-
 xen/arch/x86/mm/shadow/multi.c |  2 +-
 xen/include/asm-x86/p2m.h  |  5 +
 3 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 51ffc90..967f822 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2837,7 +2837,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long 
gla,
  * to the mmio handler.
  */
 if ( (p2mt == p2m_mmio_dm) || 
- (npfec.write_access && (p2mt == p2m_ram_ro)) )
+ (npfec.write_access && (p2m_is_discard_write(p2mt))) )
 {
 put_gfn(p2m->domain, gfn);
 
@@ -2882,16 +2882,6 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long 
gla,
 goto out_put_gfn;
 }
 
-/* Shouldn't happen: Maybe the guest was writing to a r/o grant mapping? */
-if ( npfec.write_access && (p2mt == p2m_grant_map_ro) )
-{
-gdprintk(XENLOG_WARNING,
- "trying to write to read-only grant mapping\n");
-hvm_inject_hw_exception(TRAP_gp_fault, 0);
-rc = 1;
-goto out_put_gfn;
-}
-
 /* If we fell through, the vcpu will retry now that access restrictions 
have
  * been removed. It may fault again if the p2m entry type still requires 
so.
  * Otherwise, this is an error condition. */
@@ -3941,7 +3931,7 @@ static enum hvm_copy_result __hvm_copy(
 
 if ( flags & HVMCOPY_to_guest )
 {
-if ( p2mt == p2m_ram_ro )
+if ( p2m_is_discard_write(p2mt) )
 {
 static unsigned long lastpage;
 if ( xchg(&lastpage, gfn) != gfn )
@@ -4035,7 +4025,7 @@ static enum hvm_copy_result __hvm_clear(paddr_t addr, int 
size)
 
 p = (char *)__map_domain_page(page) + (addr & ~PAGE_MASK);
 
-if ( p2mt == p2m_ram_ro )
+if ( p2m_is_discard_write(p2mt) )
 {
 static unsigned long lastpage;
 if ( xchg(&lastpage, gfn) != gfn )
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index 225290e..94cf06d 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -4575,7 +4575,7 @@ static mfn_t emulate_gva_to_mfn(struct vcpu *v,
 {
 return _mfn(BAD_GFN_TO_MFN);
 }
-if ( p2m_is_readonly(p2mt) )
+if ( p2m_is_discard_write(p2mt) )
 {
 put_page(page);
 return _mfn(READONLY_GFN);
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 5f7fe71..42de75d 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -113,6 +113,10 @@ typedef unsigned int p2m_query_t;
   | p2m_to_mask(p2m_grant_map_ro)   \
   | p2m_to_mask(p2m_ram_shared) )
 
+/* Write-discard types, which should discard the write operations */
+#define P2M_DISCARD_WRITE_TYPES (p2m_to_mask(p2m_ram_ro) \
+  | p2m_to_mask(p2m_grant_map_ro))
+
 /* Types that can be subject to bulk transitions. */
 #define P2M_CHANGEABLE_TYPES (p2m_to_mask(p2m_ram_rw) \
   | p2m_to_mask(p2m_ram_logdirty) )
@@ -145,6 +149,7 @@ typedef unsigned int p2m_query_t;
 #define p2m_is_hole(_t) (p2m_to_mask(_t) & P2M_HOLE_TYPES)
 #define p2m_is_mmio(_t) (p2m_to_mask(_t) & P2M_MMIO_TYPES)
 #define p2m_is_readonly(_t) (p2m_to_mask(_t) & P2M_RO_TYPES)
+#define p2m_is_discard_write(_t) (p2m_to_mask(_t) & P2M_DISCARD_WRITE_TYPES)
 #define p2m_is_changeable(_t) (p2m_to_mask(_t) & P2M_CHANGEABLE_TYPES)
 #define p2m_is_pod(_t) (p2m_to_mask(_t) & P2M_POD_TYPES)
 #define p2m_is_grant(_t) (p2m_to_mask(_t) & P2M_GRANT_TYPES)
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v6 0/2] add new p2m type class and new p2m type

2014-12-05 Thread Yu Zhang
XenGT (Intel Graphics Virtualization technology, please refer to
https://01.org/xen/blogs/srclarkx/2013/graphics-virtualization-
xengt) driver runs inside Dom0 as a virtual graphics device model,
and needs to trap and emulate the guest's write operations to some
specific memory pages, like memory pages used by guest graphics
driver as PPGTT(per-process graphics translation table). We added
a new p2m type, p2m_mmio_write_dm, to trap and emulate the write
operations on these graphic page tables. 

Handling of this new p2m type are similar with existing p2m_ram_ro
in most condition checks, with only difference on final policy of
emulation vs. drop. For p2m_ram_ro types, write operations will not
trigger the device model, and will be discarded later in __hvm_copy();
while for the p2m_mmio_write_dm type pages, writes will go to the
device model via ioreq-server.

Previously, the conclusion in our v3 patch review is to provide a
more generalized HVMOP_map_io_range_to_ioreq_server hypercall, by
seperating rangesets inside a ioreq server to read-protected/write-
protected/both-prtected. Yet, after offline discussion with Paul,
we believe a more simplified solution may suffice. We can keep the 
existing HVMOP_map_io_range_to_ioreq_server hypercall, and let the 
user decide whether or not a p2m type change is necessary, because
in most cases the emulator will already use the p2m_mmio_dm type.

Changes from v5:
 - Stricter type checks for p2m type transitions;
 - One code style change.

Changes from v4:
 - A new p2m type class, P2M_DISCARD_WRITE_TYPES, is added;
 - A new predicate, p2m_is_discard_write, is used in __hvm_copy()/
   __hvm_clear()/emulate_gva_to_mfn()/hvm_hap_nested_page_fault(),
   to discard the write operations;
 - The new p2m type, p2m_mmio_write_dm, is added to P2M_RO_TYPES;
 - Coding style changes;

Changes from v3:
 - Use the existing HVMOP_map_io_range_to_ioreq_server hypercall
   to add write protected range;
 - Modify the HVMOP_set_mem_type hypercall to support the new p2m
   type for this range.

Changes from v2:
 - Remove excute attribute of the new p2m type p2m_mmio_write_dm;
 - Use existing rangeset for keeping the write protection page range
   instead of introducing hash table;
 - Some code style fix.

Changes from v1:
 - Changes the new p2m type name from p2m_ram_wp to p2m_mmio_write_dm.
   This means that we treat the pages as a special mmio range instead
   of ram;
 - Move macros to c file since only this file is using them.
 - Address various comments from Jan.

Yu Zhang (2):
  Add a new p2m type class - P2M_DISCARD_WRITE_TYPES
  add a new p2m type - p2m_mmio_write_dm

 xen/arch/x86/hvm/hvm.c  | 25 ++---
 xen/arch/x86/mm/p2m-ept.c   |  1 +
 xen/arch/x86/mm/p2m-pt.c|  1 +
 xen/arch/x86/mm/shadow/multi.c  |  2 +-
 xen/include/asm-x86/p2m.h   |  9 -
 xen/include/public/hvm/hvm_op.h |  1 +
 6 files changed, 22 insertions(+), 17 deletions(-)

-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v6 0/2] add new p2m type class and new p2m type

2014-12-08 Thread Yu, Zhang

Hi Tim & Jan,

  Thank you very much for your review.
  And could you please also help me about how to get an ACK? I'm not 
sure what's the next action I need to take. :-)


B.R.
Yu

On 12/6/2014 11:55 AM, Yu Zhang wrote:

XenGT (Intel Graphics Virtualization technology, please refer to
https://01.org/xen/blogs/srclarkx/2013/graphics-virtualization-
xengt) driver runs inside Dom0 as a virtual graphics device model,
and needs to trap and emulate the guest's write operations to some
specific memory pages, like memory pages used by guest graphics
driver as PPGTT(per-process graphics translation table). We added
a new p2m type, p2m_mmio_write_dm, to trap and emulate the write
operations on these graphic page tables.

Handling of this new p2m type are similar with existing p2m_ram_ro
in most condition checks, with only difference on final policy of
emulation vs. drop. For p2m_ram_ro types, write operations will not
trigger the device model, and will be discarded later in __hvm_copy();
while for the p2m_mmio_write_dm type pages, writes will go to the
device model via ioreq-server.

Previously, the conclusion in our v3 patch review is to provide a
more generalized HVMOP_map_io_range_to_ioreq_server hypercall, by
seperating rangesets inside a ioreq server to read-protected/write-
protected/both-prtected. Yet, after offline discussion with Paul,
we believe a more simplified solution may suffice. We can keep the
existing HVMOP_map_io_range_to_ioreq_server hypercall, and let the
user decide whether or not a p2m type change is necessary, because
in most cases the emulator will already use the p2m_mmio_dm type.

Changes from v5:
  - Stricter type checks for p2m type transitions;
  - One code style change.

Changes from v4:
  - A new p2m type class, P2M_DISCARD_WRITE_TYPES, is added;
  - A new predicate, p2m_is_discard_write, is used in __hvm_copy()/
__hvm_clear()/emulate_gva_to_mfn()/hvm_hap_nested_page_fault(),
to discard the write operations;
  - The new p2m type, p2m_mmio_write_dm, is added to P2M_RO_TYPES;
  - Coding style changes;

Changes from v3:
  - Use the existing HVMOP_map_io_range_to_ioreq_server hypercall
to add write protected range;
  - Modify the HVMOP_set_mem_type hypercall to support the new p2m
type for this range.

Changes from v2:
  - Remove excute attribute of the new p2m type p2m_mmio_write_dm;
  - Use existing rangeset for keeping the write protection page range
instead of introducing hash table;
  - Some code style fix.

Changes from v1:
  - Changes the new p2m type name from p2m_ram_wp to p2m_mmio_write_dm.
This means that we treat the pages as a special mmio range instead
of ram;
  - Move macros to c file since only this file is using them.
  - Address various comments from Jan.

Yu Zhang (2):
   Add a new p2m type class - P2M_DISCARD_WRITE_TYPES
   add a new p2m type - p2m_mmio_write_dm

  xen/arch/x86/hvm/hvm.c  | 25 ++---
  xen/arch/x86/mm/p2m-ept.c   |  1 +
  xen/arch/x86/mm/p2m-pt.c|  1 +
  xen/arch/x86/mm/shadow/multi.c  |  2 +-
  xen/include/asm-x86/p2m.h   |  9 -
  xen/include/public/hvm/hvm_op.h |  1 +
  6 files changed, 22 insertions(+), 17 deletions(-)



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] One question about the hypercall to translate gfn to mfn.

2014-12-09 Thread Yu, Zhang

Hi all,

  As you can see, we are pushing our XenGT patches to the upstream. One 
feature we need in xen is to translate guests' gfn to mfn in XenGT dom0 
device model.


  Here we may have 2 similar solutions:
  1> Paul told me(and thank you, Paul :)) that there used to be a 
hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in 
commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there was no 
usage at that time. So solution 1 is to revert this commit. However, 
since this hypercall was removed ages ago, the reverting met many 
conflicts, i.e. the gmfn_to_mfn is no longer used in x86, etc.


  2> In our project, we defined a new hypercall 
XENMEM_get_mfn_from_pfn, which has a similar implementation like the 
previous XENMEM_translate_gpfn_list. One of the major differences is 
that this newly defined one is only for x86(called in arch_memory_op), 
so we do not have to worry about the arm side.


  Does anyone has any suggestions about this?
  Thanks in advance. :)

B.R.
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v6 0/2] add new p2m type class and new p2m type

2014-12-09 Thread Yu, Zhang



On 12/9/2014 4:31 PM, Jan Beulich wrote:

On 09.12.14 at 03:02,  wrote:

Thank you very much for your review.
And could you please also help me about how to get an ACK? I'm not
sure what's the next action I need to take. :-)


I don't think you need to take any action at this point. The second
patch will need Tim's ack, yes, but that's nothing to worry about
(yet), since even with his ack the two patches wouldn't go in until
after 4.5 got branched off of staging.


Got it, and thanks!

Yu


Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] One question about the hypercall to translate gfn to mfn.

2014-12-09 Thread Yu, Zhang



On 12/9/2014 6:19 PM, Paul Durrant wrote:

I think use of an raw mfn value currently works only because dom0 is using a 
1:1 IOMMU mapping scheme. Is my understanding correct, or do you really need 
raw mfn values?

Thanks for your quick response, Paul.
Well, not exactly for this case. :)
In XenGT, our need to translate gfn to mfn is for GPU's page table, 
which contains the translation between graphic address and the memory 
address. This page table is maintained by GPU drivers, and our service 
domain need to have a method to translate the guest physical addresses 
written by the vGPU into host physical ones.
We do not use IOMMU in XenGT and therefore this translation may not 
necessarily be a 1:1 mapping.


B.R.
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v7 2/2] add a new p2m type - p2m_mmio_write_dm

2014-12-12 Thread Yu Zhang
From: Yu Zhang 

A new p2m type, p2m_mmio_write_dm, is added to trap and emulate
the write operations on GPU's page tables. Handling of this new
p2m type are similar with existing p2m_ram_ro in most condition
checks, with only difference on final policy of emulation vs. drop.
For p2m_ram_ro types, write operations will not trigger the device
model, and will be discarded later in __hvm_copy(); while for the
p2m_mmio_write_dm type pages, writes will go to the device model
via ioreq-server.

Signed-off-by: Yu Zhang 
Signed-off-by: Wei Ye 
Reviewed-by: Jan Beulich 
Reviewed-by: Tim Deegan 
---
 xen/arch/x86/hvm/hvm.c  | 11 ---
 xen/arch/x86/mm/p2m-ept.c   |  1 +
 xen/arch/x86/mm/p2m-pt.c|  1 +
 xen/arch/x86/mm/shadow/multi.c  |  3 ++-
 xen/include/asm-x86/p2m.h   |  4 +++-
 xen/include/public/hvm/hvm_op.h |  1 +
 6 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 2eb0795..f66f2c6 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2837,7 +2837,8 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long 
gla,
  * to the mmio handler.
  */
 if ( (p2mt == p2m_mmio_dm) || 
- (npfec.write_access && (p2m_is_discard_write(p2mt))) )
+ (npfec.write_access &&
+  (p2m_is_discard_write(p2mt) || (p2mt == p2m_mmio_write_dm))) )
 {
 put_gfn(p2m->domain, gfn);
 
@@ -5922,6 +5923,8 @@ long do_hvm_op(unsigned long op, 
XEN_GUEST_HANDLE_PARAM(void) arg)
 get_gfn_query_unlocked(d, a.pfn, &t);
 if ( p2m_is_mmio(t) )
 a.mem_type =  HVMMEM_mmio_dm;
+else if ( t == p2m_mmio_write_dm )
+a.mem_type = HVMMEM_mmio_write_dm;
 else if ( p2m_is_readonly(t) )
 a.mem_type =  HVMMEM_ram_ro;
 else if ( p2m_is_ram(t) )
@@ -5949,7 +5952,8 @@ long do_hvm_op(unsigned long op, 
XEN_GUEST_HANDLE_PARAM(void) arg)
 static const p2m_type_t memtype[] = {
 [HVMMEM_ram_rw]  = p2m_ram_rw,
 [HVMMEM_ram_ro]  = p2m_ram_ro,
-[HVMMEM_mmio_dm] = p2m_mmio_dm
+[HVMMEM_mmio_dm] = p2m_mmio_dm,
+[HVMMEM_mmio_write_dm] = p2m_mmio_write_dm
 };
 
 if ( copy_from_guest(&a, arg, 1) )
@@ -5996,7 +6000,8 @@ long do_hvm_op(unsigned long op, 
XEN_GUEST_HANDLE_PARAM(void) arg)
 goto param_fail4;
 }
 if ( !p2m_is_ram(t) &&
- (!p2m_is_hole(t) || a.hvmmem_type != HVMMEM_mmio_dm) )
+ (!p2m_is_hole(t) || a.hvmmem_type != HVMMEM_mmio_dm) &&
+ (t != p2m_mmio_write_dm || a.hvmmem_type != HVMMEM_ram_rw) )
 {
 put_gfn(d, pfn);
 goto param_fail4;
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index 15c6e83..e21a92d 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -136,6 +136,7 @@ static void ept_p2m_type_to_flags(ept_entry_t *entry, 
p2m_type_t type, p2m_acces
 entry->x = 0;
 break;
 case p2m_grant_map_ro:
+case p2m_mmio_write_dm:
 entry->r = 1;
 entry->w = entry->x = 0;
 break;
diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
index e48b63a..26fb18d 100644
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -94,6 +94,7 @@ static unsigned long p2m_type_to_flags(p2m_type_t t, mfn_t 
mfn)
 default:
 return flags | _PAGE_NX_BIT;
 case p2m_grant_map_ro:
+case p2m_mmio_write_dm:
 return flags | P2M_BASE_FLAGS | _PAGE_NX_BIT;
 case p2m_ram_ro:
 case p2m_ram_logdirty:
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index 94cf06d..65815bb 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3181,7 +3181,8 @@ static int sh_page_fault(struct vcpu *v,
 }
 
 /* Need to hand off device-model MMIO to the device model */
-if ( p2mt == p2m_mmio_dm ) 
+if ( p2mt == p2m_mmio_dm
+ || (p2mt == p2m_mmio_write_dm && ft == ft_demand_write) )
 {
 gpa = guest_walk_to_gpa(&gw);
 goto mmio;
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 42de75d..2cf73ca 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -72,6 +72,7 @@ typedef enum {
 p2m_ram_shared = 12,  /* Shared or sharable memory */
 p2m_ram_broken = 13,  /* Broken page, access cause domain crash */
 p2m_map_foreign  = 14,/* ram pages from foreign domain */
+p2m_mmio_write_dm = 15,   /* Read-only; writes go to the device model 
*/
 } p2m_type_t;
 
 /* Modifiers to the query */
@@ -111,7 +112,8 @@ typedef unsigned int p2m_query_t;
 #define P2M_RO_TYPES (p2m_to_mask(p2m_ram_logdirty) \
   | p2m_to_mask(p2m

[Xen-devel] [PATCH v7 0/2] add new p2m type class and new p2m type

2014-12-12 Thread Yu Zhang
XenGT (Intel Graphics Virtualization technology, please refer to
https://01.org/xen/blogs/srclarkx/2013/graphics-virtualization-
xengt) driver runs inside Dom0 as a virtual graphics device model,
and needs to trap and emulate the guest's write operations to some
specific memory pages, like memory pages used by guest graphics
driver as PPGTT(per-process graphics translation table). We added
a new p2m type, p2m_mmio_write_dm, to trap and emulate the write
operations on these graphic page tables. 

Handling of this new p2m type are similar with existing p2m_ram_ro
in most condition checks, with only difference on final policy of
emulation vs. drop. For p2m_ram_ro types, write operations will not
trigger the device model, and will be discarded later in __hvm_copy();
while for the p2m_mmio_write_dm type pages, writes will go to the
device model via ioreq-server.

Previously, the conclusion in our v3 patch review is to provide a
more generalized HVMOP_map_io_range_to_ioreq_server hypercall, by
seperating rangesets inside a ioreq server to read-protected/write-
protected/both-prtected. Yet, after offline discussion with Paul,
we believe a more simplified solution may suffice. We can keep the 
existing HVMOP_map_io_range_to_ioreq_server hypercall, and let the 
user decide whether or not a p2m type change is necessary, because
in most cases the emulator will already use the p2m_mmio_dm type.


Changes from v6:
 - Handle the new p2m type in the shadow-pagetable code.

Changes from v5:
 - Stricter type checks for p2m type transitions;
 - One code style change.

Changes from v4:
 - A new p2m type class, P2M_DISCARD_WRITE_TYPES, is added;
 - A new predicate, p2m_is_discard_write, is used in __hvm_copy()/
   __hvm_clear()/emulate_gva_to_mfn()/hvm_hap_nested_page_fault(),
   to discard the write operations;
 - The new p2m type, p2m_mmio_write_dm, is added to P2M_RO_TYPES;
 - Coding style changes;

Changes from v3:
 - Use the existing HVMOP_map_io_range_to_ioreq_server hypercall
   to add write protected range;
 - Modify the HVMOP_set_mem_type hypercall to support the new p2m
   type for this range.

Changes from v2:
 - Remove excute attribute of the new p2m type p2m_mmio_write_dm;
 - Use existing rangeset for keeping the write protection page range
   instead of introducing hash table;
 - Some code style fix.

Changes from v1:
 - Changes the new p2m type name from p2m_ram_wp to p2m_mmio_write_dm.
   This means that we treat the pages as a special mmio range instead
   of ram;
 - Move macros to c file since only this file is using them.
 - Address various comments from Jan.

Yu Zhang (2):
  Add a new p2m type class - P2M_DISCARD_WRITE_TYPES
  add a new p2m type - p2m_mmio_write_dm

 xen/arch/x86/hvm/hvm.c  | 25 ++---
 xen/arch/x86/mm/p2m-ept.c   |  1 +
 xen/arch/x86/mm/p2m-pt.c|  1 +
 xen/arch/x86/mm/shadow/multi.c  |  2 +-
 xen/include/asm-x86/p2m.h   |  9 -
 xen/include/public/hvm/hvm_op.h |  1 +
 6 files changed, 22 insertions(+), 17 deletions(-)

-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v7 1/2] add a new p2m type class - P2M_DISCARD_WRITE_TYPES

2014-12-12 Thread Yu Zhang
From: Yu Zhang 

Currently, the P2M_RO_TYPES bears 2 meanings: one is
"_PAGE_RW bit is clear in their PTEs", and another is
to discard the write operations on these pages. This
patch adds a p2m type class, P2M_DISCARD_WRITE_TYPES,
to bear the second meaning, so we can use this type
class instead of the P2M_RO_TYPES, to decide if a write
operation is to be ignored.

Signed-off-by: Yu Zhang 
Reviewed-by: Tim Deegan 
---
 xen/arch/x86/hvm/hvm.c | 16 +++-
 xen/arch/x86/mm/shadow/multi.c |  2 +-
 xen/include/asm-x86/p2m.h  |  5 +
 3 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index bc414ff..2eb0795 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2837,7 +2837,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long 
gla,
  * to the mmio handler.
  */
 if ( (p2mt == p2m_mmio_dm) || 
- (npfec.write_access && (p2mt == p2m_ram_ro)) )
+ (npfec.write_access && (p2m_is_discard_write(p2mt))) )
 {
 put_gfn(p2m->domain, gfn);
 
@@ -2882,16 +2882,6 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long 
gla,
 goto out_put_gfn;
 }
 
-/* Shouldn't happen: Maybe the guest was writing to a r/o grant mapping? */
-if ( npfec.write_access && (p2mt == p2m_grant_map_ro) )
-{
-gdprintk(XENLOG_WARNING,
- "trying to write to read-only grant mapping\n");
-hvm_inject_hw_exception(TRAP_gp_fault, 0);
-rc = 1;
-goto out_put_gfn;
-}
-
 /* If we fell through, the vcpu will retry now that access restrictions 
have
  * been removed. It may fault again if the p2m entry type still requires 
so.
  * Otherwise, this is an error condition. */
@@ -3941,7 +3931,7 @@ static enum hvm_copy_result __hvm_copy(
 
 if ( flags & HVMCOPY_to_guest )
 {
-if ( p2mt == p2m_ram_ro )
+if ( p2m_is_discard_write(p2mt) )
 {
 static unsigned long lastpage;
 if ( xchg(&lastpage, gfn) != gfn )
@@ -4035,7 +4025,7 @@ static enum hvm_copy_result __hvm_clear(paddr_t addr, int 
size)
 
 p = (char *)__map_domain_page(page) + (addr & ~PAGE_MASK);
 
-if ( p2mt == p2m_ram_ro )
+if ( p2m_is_discard_write(p2mt) )
 {
 static unsigned long lastpage;
 if ( xchg(&lastpage, gfn) != gfn )
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index 225290e..94cf06d 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -4575,7 +4575,7 @@ static mfn_t emulate_gva_to_mfn(struct vcpu *v,
 {
 return _mfn(BAD_GFN_TO_MFN);
 }
-if ( p2m_is_readonly(p2mt) )
+if ( p2m_is_discard_write(p2mt) )
 {
 put_page(page);
 return _mfn(READONLY_GFN);
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 5f7fe71..42de75d 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -113,6 +113,10 @@ typedef unsigned int p2m_query_t;
   | p2m_to_mask(p2m_grant_map_ro)   \
   | p2m_to_mask(p2m_ram_shared) )
 
+/* Write-discard types, which should discard the write operations */
+#define P2M_DISCARD_WRITE_TYPES (p2m_to_mask(p2m_ram_ro) \
+  | p2m_to_mask(p2m_grant_map_ro))
+
 /* Types that can be subject to bulk transitions. */
 #define P2M_CHANGEABLE_TYPES (p2m_to_mask(p2m_ram_rw) \
   | p2m_to_mask(p2m_ram_logdirty) )
@@ -145,6 +149,7 @@ typedef unsigned int p2m_query_t;
 #define p2m_is_hole(_t) (p2m_to_mask(_t) & P2M_HOLE_TYPES)
 #define p2m_is_mmio(_t) (p2m_to_mask(_t) & P2M_MMIO_TYPES)
 #define p2m_is_readonly(_t) (p2m_to_mask(_t) & P2M_RO_TYPES)
+#define p2m_is_discard_write(_t) (p2m_to_mask(_t) & P2M_DISCARD_WRITE_TYPES)
 #define p2m_is_changeable(_t) (p2m_to_mask(_t) & P2M_CHANGEABLE_TYPES)
 #define p2m_is_pod(_t) (p2m_to_mask(_t) & P2M_POD_TYPES)
 #define p2m_is_grant(_t) (p2m_to_mask(_t) & P2M_GRANT_TYPES)
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] docs/design: introduce HVMMEM_ioreq_serverX types

2016-02-25 Thread Yu, Zhang

Hi Paul,

  Thanks a lot for your help on this! And below are my questions.

On 2/25/2016 11:49 PM, Paul Durrant wrote:

This patch adds a new 'designs' subdirectory under docs as a repository
for this and future design proposals.

Signed-off-by: Paul Durrant 
---

For convenience this document can also be viewed in PDF at:

http://xenbits.xen.org/people/pauldu/hvmmem_ioreq_server.pdf
---
  docs/designs/hvmmem_ioreq_server.md | 63 +
  1 file changed, 63 insertions(+)
  create mode 100755 docs/designs/hvmmem_ioreq_server.md

diff --git a/docs/designs/hvmmem_ioreq_server.md 
b/docs/designs/hvmmem_ioreq_server.md
new file mode 100755
index 000..47fa715
--- /dev/null
+++ b/docs/designs/hvmmem_ioreq_server.md
@@ -0,0 +1,63 @@
+HVMMEM\_ioreq\_serverX
+--
+
+Background
+==
+
+The concept of the IOREQ server was introduced to allow multiple distinct
+device emulators to a single VM. The XenGT project uses an IOREQ server to
+provide mediated pass-through of Intel GPUs to guests and, as part of the
+mediation, needs to intercept accesses to GPU page-tables (or GTTs) that
+reside in guest RAM.
+
+The current implementation of this sets the type of GTT pages to type
+HVMMEM\_mmio\_write\_dm, which causes Xen to emulate writes to such pages,
+and then maps the guest physical addresses of those pages to the XenGT
+IOREQ server using the HVMOP\_map\_io\_range\_to\_ioreq\_server hypercall.
+However, because the number of GTTs is potentially large, using this
+approach does not scale well.
+
+Proposal
+
+
+Because the number of spare types available in the P2M type-space is
+currently very limited it is proposed that HVMMEM\_mmio\_write\_dm be
+replaced by a single new type HVMMEM\_ioreq\_server. In future, if the
+P2M type-space is increased, this can be renamed to HVMMEM\_ioreq\_server0
+and new HVMMEM\_ioreq\_server1, HVMMEM\_ioreq\_server2, etc. types
+can be added.
+
+Accesses to a page of type HVMMEM\_ioreq\_serverX should be the same as
+HVMMEM\_ram\_rw until the type is _claimed_ by an IOREQ server. Furthermore


Sorry, do you mean even when a gfn is set to HVMMEM_ioreq_serverX type,
its access rights in P2M still remains unchanged? So the new hypercall
pair, HVMOP_[un]map_mem_type_to_ioreq_server, are also responsible for
the PTE updates on the access bits?

If it is true, I'm afraid this would be time consuming, because the
map/unmap will have to traverse all P2M structures to detect the PTEs
with HVMMEM_ioreq_serverX flag set. Yet in XenGT, setting this flag is
triggered dynamically with the construction/destruction of shadow PPGTT.
But I'm not sure to which degree the performance casualties will be,
with frequent EPT table walk and EPT tlb flush.

If it is not, I guess we can(e.g. when trying to WP a gfn):
1> use HVMOP_set_mem_type to set the HVMMEM_ioreq_serverX flag, which
for the write protected case works the same as HVMMEM_mmio_write_dm;
If successful, accesses to a page of type HVMMEM_ioreq_serverX should
trigger the ioreq server selection path, but will be discarded.
2> after HVMOP_map_mem_type_to_ioreq_server is called, all accesses to
this pages of type HVMMEM_ioreq_serverX would be forwarded to a
specified ioreq server.

As to XenGT backend device model, we only need to use the map hypercall
once when trying to contruct the first shadow PPGTT, and use the unmap
hypercall when a VM is going to be torn down.

Any suggestions? :)

Thanks
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] docs/design: introduce HVMMEM_ioreq_serverX types

2016-02-26 Thread Yu, Zhang

Thanks, Paul.

On 2/26/2016 5:21 PM, Paul Durrant wrote:

-Original Message-
From: Yu, Zhang [mailto:yu.c.zh...@linux.intel.com]
Sent: 26 February 2016 06:59
To: Paul Durrant; xen-de...@lists.xenproject.org
Subject: Re: [Xen-devel] [PATCH] docs/design: introduce
HVMMEM_ioreq_serverX types

Hi Paul,

Thanks a lot for your help on this! And below are my questions.

On 2/25/2016 11:49 PM, Paul Durrant wrote:

This patch adds a new 'designs' subdirectory under docs as a repository
for this and future design proposals.

Signed-off-by: Paul Durrant 
---

For convenience this document can also be viewed in PDF at:

http://xenbits.xen.org/people/pauldu/hvmmem_ioreq_server.pdf
---
   docs/designs/hvmmem_ioreq_server.md | 63

+

   1 file changed, 63 insertions(+)
   create mode 100755 docs/designs/hvmmem_ioreq_server.md

diff --git a/docs/designs/hvmmem_ioreq_server.md

b/docs/designs/hvmmem_ioreq_server.md

new file mode 100755
index 000..47fa715
--- /dev/null
+++ b/docs/designs/hvmmem_ioreq_server.md
@@ -0,0 +1,63 @@
+HVMMEM\_ioreq\_serverX
+--
+
+Background
+==
+
+The concept of the IOREQ server was introduced to allow multiple distinct
+device emulators to a single VM. The XenGT project uses an IOREQ server

to

+provide mediated pass-through of Intel GPUs to guests and, as part of the
+mediation, needs to intercept accesses to GPU page-tables (or GTTs) that
+reside in guest RAM.
+
+The current implementation of this sets the type of GTT pages to type
+HVMMEM\_mmio\_write\_dm, which causes Xen to emulate writes to

such pages,

+and then maps the guest physical addresses of those pages to the XenGT
+IOREQ server using the HVMOP\_map\_io\_range\_to\_ioreq\_server

hypercall.

+However, because the number of GTTs is potentially large, using this
+approach does not scale well.
+
+Proposal
+
+
+Because the number of spare types available in the P2M type-space is
+currently very limited it is proposed that HVMMEM\_mmio\_write\_dm

be

+replaced by a single new type HVMMEM\_ioreq\_server. In future, if the
+P2M type-space is increased, this can be renamed to

HVMMEM\_ioreq\_server0

+and new HVMMEM\_ioreq\_server1, HVMMEM\_ioreq\_server2, etc.

types

+can be added.
+
+Accesses to a page of type HVMMEM\_ioreq\_serverX should be the

same as

+HVMMEM\_ram\_rw until the type is _claimed_ by an IOREQ server.

Furthermore

Sorry, do you mean even when a gfn is set to HVMMEM_ioreq_serverX
type,
its access rights in P2M still remains unchanged? So the new hypercall
pair, HVMOP_[un]map_mem_type_to_ioreq_server, are also responsible
for
the PTE updates on the access bits?


Yes, the access bits would not change *on existing* pages of this type until 
the type is claimed.



If it is true, I'm afraid this would be time consuming, because the
map/unmap will have to traverse all P2M structures to detect the PTEs
with HVMMEM_ioreq_serverX flag set. Yet in XenGT, setting this flag is
triggered dynamically with the construction/destruction of shadow PPGTT.
But I'm not sure to which degree the performance casualties will be,
with frequent EPT table walk and EPT tlb flush.


I don't see the concern. I am assuming XenGT would claim the type at start of 
day and then just to HVMOP_set_mem_type operations on PPGTT pages as necessary, 
which would not require p2m traversal (since that hypercall takes the gfn as an 
arg) and the EPT flush requirements are no different to setting the 
mmio_write_dm type in the current implementation.


So this looks like my second interpretation, with step 1> and step 2>
changed?





If it is not, I guess we can(e.g. when trying to WP a gfn):
1> use HVMOP_set_mem_type to set the HVMMEM_ioreq_serverX flag,
which
for the write protected case works the same as HVMMEM_mmio_write_dm;
If successful, accesses to a page of type HVMMEM_ioreq_serverX should
trigger the ioreq server selection path, but will be discarded.
2> after HVMOP_map_mem_type_to_ioreq_server is called, all accesses to
this pages of type HVMMEM_ioreq_serverX would be forwarded to a
specified ioreq server.

As to XenGT backend device model, we only need to use the map hypercall
once when trying to contruct the first shadow PPGTT, and use the unmap
hypercall when a VM is going to be torn down.



Yes, that's correct. A single claim at start of day and a single 'unclaim' at 
end of day. In between the current HVMOP_set_mem_type is pretty much as before 
(just with the new type name) and there is no longer a need for the hypercall 
to map the range to the ioreq server.

   Paul



B.R.
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] docs/design: introduce HVMMEM_ioreq_serverX types

2016-02-26 Thread Yu, Zhang

On 2/26/2016 5:18 PM, Jan Beulich wrote:

On 26.02.16 at 07:59,  wrote:

+Proposal
+
+
+Because the number of spare types available in the P2M type-space is
+currently very limited it is proposed that HVMMEM\_mmio\_write\_dm be
+replaced by a single new type HVMMEM\_ioreq\_server. In future, if the
+P2M type-space is increased, this can be renamed to HVMMEM\_ioreq\_server0
+and new HVMMEM\_ioreq\_server1, HVMMEM\_ioreq\_server2, etc. types
+can be added.
+
+Accesses to a page of type HVMMEM\_ioreq\_serverX should be the same as
+HVMMEM\_ram\_rw until the type is _claimed_ by an IOREQ server. Furthermore


Sorry, do you mean even when a gfn is set to HVMMEM_ioreq_serverX type,
its access rights in P2M still remains unchanged? So the new hypercall
pair, HVMOP_[un]map_mem_type_to_ioreq_server, are also responsible for
the PTE updates on the access bits?

If it is true, I'm afraid this would be time consuming, because the
map/unmap will have to traverse all P2M structures to detect the PTEs
with HVMMEM_ioreq_serverX flag set. Yet in XenGT, setting this flag is
triggered dynamically with the construction/destruction of shadow PPGTT.
But I'm not sure to which degree the performance casualties will be,
with frequent EPT table walk and EPT tlb flush.


No walking of EPT trees will be necessary in that case, just like it
already has been made unnecessary for other changes resulting
in various PTE attributes needing re-calculation. We'll only need
to extend the p2m_memory_type_changed() mechanism to cover
changes like this one.


So you mean when the access bits are to be updated, we can leverage
something like p2m_memory_type_changed(which I guess only deals with
memory types, not access bits) to avoid the walking of EPT trees? I'll
need to study this part.
Anyway, thanks for your advice. :)

B.R.
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] docs/design: introduce HVMMEM_ioreq_serverX types

2016-02-26 Thread Yu, Zhang



On 2/26/2016 5:50 PM, Paul Durrant wrote:

-Original Message-
From: Yu, Zhang [mailto:yu.c.zh...@linux.intel.com]
Sent: 26 February 2016 09:37
To: Jan Beulich
Cc: Paul Durrant; xen-de...@lists.xenproject.org
Subject: Re: [Xen-devel] [PATCH] docs/design: introduce
HVMMEM_ioreq_serverX types

On 2/26/2016 5:18 PM, Jan Beulich wrote:

On 26.02.16 at 07:59,  wrote:

+Proposal
+
+
+Because the number of spare types available in the P2M type-space is
+currently very limited it is proposed that

HVMMEM\_mmio\_write\_dm be

+replaced by a single new type HVMMEM\_ioreq\_server. In future, if

the

+P2M type-space is increased, this can be renamed to

HVMMEM\_ioreq\_server0

+and new HVMMEM\_ioreq\_server1, HVMMEM\_ioreq\_server2, etc.

types

+can be added.
+
+Accesses to a page of type HVMMEM\_ioreq\_serverX should be the

same as

+HVMMEM\_ram\_rw until the type is _claimed_ by an IOREQ server.

Furthermore


Sorry, do you mean even when a gfn is set to HVMMEM_ioreq_serverX

type,

its access rights in P2M still remains unchanged? So the new hypercall
pair, HVMOP_[un]map_mem_type_to_ioreq_server, are also

responsible for

the PTE updates on the access bits?

If it is true, I'm afraid this would be time consuming, because the
map/unmap will have to traverse all P2M structures to detect the PTEs
with HVMMEM_ioreq_serverX flag set. Yet in XenGT, setting this flag is
triggered dynamically with the construction/destruction of shadow

PPGTT.

But I'm not sure to which degree the performance casualties will be,
with frequent EPT table walk and EPT tlb flush.


No walking of EPT trees will be necessary in that case, just like it
already has been made unnecessary for other changes resulting
in various PTE attributes needing re-calculation. We'll only need
to extend the p2m_memory_type_changed() mechanism to cover
changes like this one.


So you mean when the access bits are to be updated, we can leverage
something like p2m_memory_type_changed(which I guess only deals with
memory types, not access bits) to avoid the walking of EPT trees? I'll
need to study this part.


No, the P2M is walked when the map/unmap hypercall is issued but, in the XenGT 
use-case, that hypercall is issued once at start of day and - if everything is 
working as it I believe it should - there won't actually be any pages of type 
HVMMEM_ioreq_server at that point, so no EPT flush is required.


Anyway, thanks for your advice. :)


I will post an implementation hopefully in the next few days once I've proved 
it works in my XenGT rig.


Great. Looking forward to this implementation, and thanks for your
help. :)

B.R.
Yu


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [V9 0/3] Refactor ioreq server for better performance.

2015-12-31 Thread Yu, Zhang

Shuai, thank you very much for helping me push these patches!
And sorry for the delay due to my illness.
Now I'm back and will pick this up. :)

B.R.
Yu

On 12/15/2015 10:05 AM, Shuai Ruan wrote:

From: Yu Zhang ng the

XenGT leverages ioreq server to track and forward the accesses to
GPU I/O resources, e.g. the PPGTT(per-process graphic translation
tables). Currently, ioreq server uses rangeset to track the BDF/
PIO/MMIO ranges to be emulated. To select an ioreq server, the
rangeset is searched to see if the I/O range is recorded. However,
traversing the link list inside rangeset could be time consuming
when number of ranges is too high. On HSW platform, number of PPGTTs
for each vGPU could be several hundred. On BDW, this value could
be several thousand.  This patch series refactored rangeset to base
it on red-back tree, so that the searching would be more efficient.

Besides, this patchset also splits the tracking of MMIO and guest
ram ranges into different rangesets. And to accommodate more ranges,
limitation of the number of ranges in an ioreq server, MAX_NR_IO_RANGES
is changed - future patches might be provided to tune this with other
approaches.

Changes in v9:
1> Change order of patch 2 and patch3.
2> Intruduce a const static array before hvm_ioreq_server_alloc_rangesets().
3> Coding style changes.

Changes in v8:
Use a clearer API name to map/unmap the write-protected memory in
ioreq server.

Changes in v7:
1> Coding style changes;
2> Fix a typo in hvm_select_ioreq_server().

Changes in v6:
Break the identical relationship between ioreq type and rangeset
index inside ioreq server.

Changes in v5:
1> Use gpfn, instead of gpa to track guest write-protected pages;
2> Remove redundant conditional statement in routine find_range().

Changes in v4:
Keep the name HVMOP_IO_RANGE_MEMORY for MMIO resources, and add
a new one, HVMOP_IO_RANGE_WP_MEM, for write-protected memory.

Changes in v3:
1> Use a seperate rangeset for guest ram pages in ioreq server;
2> Refactor rangeset, instead of introduce a new data structure.

Changes in v2:
1> Split the original patch into 2;
2> Take Paul Durrant's comments:
   a> Add a name member in the struct rb_rangeset, and use the 'q'
debug key to dump the ranges in ioreq server;
   b> Keep original routine names for hvm ioreq server;
   c> Commit message changes - mention that a future patch to change
the maximum ranges inside ioreq server.


Yu Zhang (3):
   Remove identical relationship between ioreq type and rangeset type.
   Refactor rangeset structure for better performance.
   Differentiate IO/mem resources tracked by ioreq server

  tools/libxc/include/xenctrl.h| 31 +++
  tools/libxc/xc_domain.c  | 61 ++
  xen/arch/x86/hvm/hvm.c   | 43 ++---
  xen/common/rangeset.c| 82 +---
  xen/include/asm-x86/hvm/domain.h |  4 +-
  xen/include/public/hvm/hvm_op.h  |  1 +
  6 files changed, 185 insertions(+), 37 deletions(-)



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [V9 2/3] Refactor rangeset structure for better performance.

2015-12-31 Thread Yu, Zhang



On 12/21/2015 10:38 PM, Jan Beulich wrote:

On 15.12.15 at 03:05,  wrote:

This patch refactors struct rangeset to base it on the red-black
tree structure, instead of on the current doubly linked list. By
now, ioreq leverages rangeset to keep track of the IO/memory
resources to be emulated. Yet when number of ranges inside one
ioreq server is very high, traversing a doubly linked list could
be time consuming. With this patch, the time complexity for
searching a rangeset can be improved from O(n) to O(log(n)).
Interfaces of rangeset still remain the same, and no new APIs
introduced.


So this indeed addresses one of the two original concerns. But
what about the other (resource use due to thousands of ranges
in use by a single VM)? IOW I'm still unconvinced this is the way
to go.



Thank you, Jan. As you saw in patch 3/3, the other concern was solved
by extending the rangeset size, which may not be convictive for you.
But I believe this patch - refactoring the rangeset to rb_tree, does
not only solve XenGT's performance issue, but may also be helpful in
the future, e.g. if someday the rangeset is not allocated in xen heap
and can have a great number of ranges in it. :)

Yu


Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [V9 3/3] Differentiate IO/mem resources tracked by ioreq server

2015-12-31 Thread Yu, Zhang



On 12/21/2015 10:45 PM, Jan Beulich wrote:

On 15.12.15 at 03:05,  wrote:

--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -935,6 +935,9 @@ static void hvm_ioreq_server_free_rangesets(struct 
hvm_ioreq_server *s,
  rangeset_destroy(s->range[i]);
  }

+static const char *io_range_name[ NR_IO_RANGE_TYPES ] =


const


OK. Thanks.




+{"port", "mmio", "pci", "wp-ed memory"};


As brief as possible, but still understandable - e.g. "wp-mem"?



Got it. Thanks.


@@ -2593,6 +2597,16 @@ struct hvm_ioreq_server *hvm_select_ioreq_server(struct 
domain *d,
  type = (p->type == IOREQ_TYPE_PIO) ?
  HVMOP_IO_RANGE_PORT : HVMOP_IO_RANGE_MEMORY;
  addr = p->addr;
+if ( type == HVMOP_IO_RANGE_MEMORY )
+{
+ ram_page = get_page_from_gfn(d, p->addr >> PAGE_SHIFT,
+  &p2mt, P2M_UNSHARE);
+ if ( p2mt == p2m_mmio_write_dm )
+ type = HVMOP_IO_RANGE_WP_MEM;
+
+ if ( ram_page )
+ put_page(ram_page);
+}


You evaluate the page's current type here - what if it subsequently
changes? I don't think it is appropriate to leave the hypervisor at
the mercy of the device model here.



Well. I do not quite understand your concern. :)
Here, the get_page_from_gfn() is used to determine if the addr is a MMIO
or a write-protected ram. If this p2m type is changed, it should be
triggered by the guest and device model, e.g. this RAM is not supposed
to be used as the graphic translation table. And it should be fine.
But I also wonder, if there's any other routine more appropriate to get
a p2m type from the gfn?


--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -48,8 +48,8 @@ struct hvm_ioreq_vcpu {
  bool_t   pending;
  };

-#define NR_IO_RANGE_TYPES (HVMOP_IO_RANGE_PCI + 1)
-#define MAX_NR_IO_RANGES  256
+#define NR_IO_RANGE_TYPES (HVMOP_IO_RANGE_WP_MEM + 1)
+#define MAX_NR_IO_RANGES  8192


I'm sure I've objected before to this universal bumping of the limit:
Even if I were to withdraw my objection to the higher limit on the
new kind of tracked resource, I would continue to object to all
other resources getting their limits bumped too.



Hah. So how about we keep MAX_NR_IO_RANGES as 256, and use a new value,
say MAX_NR_WR_MEM_RANGES, set to 8192 in this patch? :)

Thanks a lot & happy new year!


Yu


Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [V9 3/3] Differentiate IO/mem resources tracked by ioreq server

2016-01-06 Thread Yu, Zhang



On 1/6/2016 4:59 PM, Jan Beulich wrote:

On 31.12.15 at 10:33,  wrote:

On 12/21/2015 10:45 PM, Jan Beulich wrote:

On 15.12.15 at 03:05,  wrote:

--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -48,8 +48,8 @@ struct hvm_ioreq_vcpu {
   bool_t   pending;
   };

-#define NR_IO_RANGE_TYPES (HVMOP_IO_RANGE_PCI + 1)
-#define MAX_NR_IO_RANGES  256
+#define NR_IO_RANGE_TYPES (HVMOP_IO_RANGE_WP_MEM + 1)
+#define MAX_NR_IO_RANGES  8192


I'm sure I've objected before to this universal bumping of the limit:
Even if I were to withdraw my objection to the higher limit on the
new kind of tracked resource, I would continue to object to all
other resources getting their limits bumped too.



Hah. So how about we keep MAX_NR_IO_RANGES as 256, and use a new value,
say MAX_NR_WR_MEM_RANGES, set to 8192 in this patch? :)


That would at least limit the damage to the newly introduced type.
But I suppose you realize it would still be a resource consumption
concern. In order for this to not become a security issue, you
might e.g. stay with the conservative old limit and allow a command
line or even better guest config file override to it (effectively making
the admin state his consent with the higher resource use).


Thanks, Jan. I'll try to use the guest config file to set this limit. :)

Yu


Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [V9 3/3] Differentiate IO/mem resources tracked by ioreq server

2016-01-06 Thread Yu, Zhang



On 1/6/2016 5:58 PM, Jan Beulich wrote:

On 06.01.16 at 10:44,  wrote:

  -Original Message-
From: Jan Beulich [mailto:jbeul...@suse.com]
Sent: 06 January 2016 08:59
To: Zhang Yu
Cc: Andrew Cooper; Paul Durrant; Wei Liu; Ian Jackson; Stefano Stabellini;
Kevin Tian; zhiyuan...@intel.com; Shuai Ruan; xen-devel@lists.xen.org; Keir
(Xen.org)
Subject: Re: [Xen-devel] [V9 3/3] Differentiate IO/mem resources tracked by
ioreq server


On 31.12.15 at 10:33,  wrote:

On 12/21/2015 10:45 PM, Jan Beulich wrote:

On 15.12.15 at 03:05,  wrote:

@@ -2593,6 +2597,16 @@ struct hvm_ioreq_server

*hvm_select_ioreq_server(struct domain *d,

   type = (p->type == IOREQ_TYPE_PIO) ?
   HVMOP_IO_RANGE_PORT : HVMOP_IO_RANGE_MEMORY;
   addr = p->addr;
+if ( type == HVMOP_IO_RANGE_MEMORY )
+{
+ ram_page = get_page_from_gfn(d, p->addr >> PAGE_SHIFT,
+  &p2mt, P2M_UNSHARE);
+ if ( p2mt == p2m_mmio_write_dm )
+ type = HVMOP_IO_RANGE_WP_MEM;
+
+ if ( ram_page )
+ put_page(ram_page);
+}


You evaluate the page's current type here - what if it subsequently
changes? I don't think it is appropriate to leave the hypervisor at
the mercy of the device model here.


Well. I do not quite understand your concern. :)
Here, the get_page_from_gfn() is used to determine if the addr is a MMIO
or a write-protected ram. If this p2m type is changed, it should be
triggered by the guest and device model, e.g. this RAM is not supposed
to be used as the graphic translation table. And it should be fine.
But I also wonder, if there's any other routine more appropriate to get
a p2m type from the gfn?


No, the question isn't the choice of method to retrieve the
current type, but the lack of measures against the retrieved
type becoming stale by the time you actually use it.


I don't think that issue is specific to this code. AFAIK nothing in the I/O
emulation system protects against a type change whilst a request is in
flight.
Also, what are the consequences of a change? Only that the wrong range type
is selected and the emulation goes to the wrong place. This may be a problem
for the VM but should cause no other problems.


Okay, I buy this argument, but I think it would help if that was spelled
out this way in the commit message.


Thank you, Paul & Jan. :)
A note will be added to explain this in the commit message in next
version.

Yu


Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v10 2/3] Differentiate IO/mem resources tracked by ioreq server

2016-01-19 Thread Yu Zhang
Currently in ioreq server, guest write-protected ram pages are
tracked in the same rangeset with device mmio resources. Yet
unlike device mmio, which can be in big chunks, the guest write-
protected pages may be discrete ranges with 4K bytes each. This
patch uses a seperate rangeset for the guest ram pages.

To differentiate the ioreq type between the write-protected memory
ranges and the mmio ranges when selecting an ioreq server, the p2m
type is retrieved by calling get_page_from_gfn(). And we do not
need to worry about the p2m type change during the ioreq selection
process.

Note: Previously, a new hypercall or subop was suggested to map
write-protected pages into ioreq server. However, it turned out
handler of this new hypercall would be almost the same with the
existing pair - HVMOP_[un]map_io_range_to_ioreq_server, and there's
already a type parameter in this hypercall. So no new hypercall
defined, only a new type is introduced.

Acked-by: Wei Liu 
Acked-by: Ian Campbell 
Reviewed-by: Kevin Tian 
Signed-off-by: Shuai Ruan 
Signed-off-by: Yu Zhang 
---
 tools/libxc/include/xenctrl.h| 31 
 tools/libxc/xc_domain.c  | 61 
 xen/arch/x86/hvm/hvm.c   | 27 +++---
 xen/include/asm-x86/hvm/domain.h |  2 +-
 xen/include/public/hvm/hvm_op.h  |  1 +
 5 files changed, 117 insertions(+), 5 deletions(-)

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index 079cad0..036c72d 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -2023,6 +2023,37 @@ int xc_hvm_unmap_io_range_from_ioreq_server(xc_interface 
*xch,
 int is_mmio,
 uint64_t start,
 uint64_t end);
+/**
+ * This function registers a range of write-protected memory for emulation.
+ *
+ * @parm xch a handle to an open hypervisor interface.
+ * @parm domid the domain id to be serviced
+ * @parm id the IOREQ Server id.
+ * @parm start start of range
+ * @parm end end of range (inclusive).
+ * @return 0 on success, -1 on failure.
+ */
+int xc_hvm_map_wp_mem_range_to_ioreq_server(xc_interface *xch,
+domid_t domid,
+ioservid_t id,
+xen_pfn_t start,
+xen_pfn_t end);
+
+/**
+ * This function deregisters a range of write-protected memory for emulation.
+ *
+ * @parm xch a handle to an open hypervisor interface.
+ * @parm domid the domain id to be serviced
+ * @parm id the IOREQ Server id.
+ * @parm start start of range
+ * @parm end end of range (inclusive).
+ * @return 0 on success, -1 on failure.
+ */
+int xc_hvm_unmap_wp_mem_range_from_ioreq_server(xc_interface *xch,
+domid_t domid,
+ioservid_t id,
+xen_pfn_t start,
+xen_pfn_t end);
 
 /**
  * This function registers a PCI device for config space emulation.
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index 99e0d48..4f43695 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -1544,6 +1544,67 @@ int xc_hvm_unmap_io_range_from_ioreq_server(xc_interface 
*xch, domid_t domid,
 return rc;
 }
 
+int xc_hvm_map_wp_mem_range_to_ioreq_server(xc_interface *xch,
+domid_t domid,
+ioservid_t id,
+xen_pfn_t start,
+xen_pfn_t end)
+{
+DECLARE_HYPERCALL;
+DECLARE_HYPERCALL_BUFFER(xen_hvm_io_range_t, arg);
+int rc;
+
+arg = xc_hypercall_buffer_alloc(xch, arg, sizeof(*arg));
+if ( arg == NULL )
+return -1;
+
+hypercall.op = __HYPERVISOR_hvm_op;
+hypercall.arg[0] = HVMOP_map_io_range_to_ioreq_server;
+hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(arg);
+
+arg->domid = domid;
+arg->id = id;
+arg->type = HVMOP_IO_RANGE_WP_MEM;
+arg->start = start;
+arg->end = end;
+
+rc = do_xen_hypercall(xch, &hypercall);
+
+xc_hypercall_buffer_free(xch, arg);
+return rc;
+}
+
+int xc_hvm_unmap_wp_mem_range_from_ioreq_server(xc_interface *xch,
+domid_t domid,
+ioservid_t id,
+xen_pfn_t start,
+xen_pfn_t end)
+{
+DECLARE_HYPERCALL;
+DECLARE_HYPERCALL_BUFFER(xen_hvm_io_range_t, arg);
+int rc;
+
+arg = xc_hypercall_buffer_alloc(xch, arg, sizeof(*arg));
+if ( arg == NULL )
+return

[Xen-devel] [PATCH v10 0/3] Refactor ioreq server for better performance.

2016-01-19 Thread Yu Zhang
XenGT leverages ioreq server to track and forward the accesses to
GPU I/O resources, e.g. the PPGTT(per-process graphic translation
tables). Currently, ioreq server uses rangeset to track the BDF/
PIO/MMIO ranges to be emulated. To select an ioreq server, the 
rangeset is searched to see if the I/O range is recorded. However,
traversing the link list inside rangeset could be time consuming
when number of ranges is too high. On HSW platform, number of PPGTTs
for each vGPU could be several hundred. On BDW, this value could
be several thousand.  This patch series refactored rangeset to base
it on red-back tree, so that the searching would be more efficient. 

Besides, this patchset also splits the tracking of MMIO and guest
ram ranges into different rangesets. And to accommodate more ranges,
a new parameter , max_ranges, is introduced in hvm configuration file.

Changes in v10: 
1> Add a new patch to configure the range limit inside ioreq server.
2> Commit message changes. 
3> The previous patch "[1/3] Remove identical relationship between
   ioreq type and rangeset type." has already been merged, and is not
   included in this series now.

Changes in v9: 
1> Change order of patch 2 and patch3.
2> Intruduce a const static array before hvm_ioreq_server_alloc_rangesets().
3> Coding style changes.

Changes in v8: 
Use a clearer API name to map/unmap the write-protected memory in
ioreq server.

Changes in v7: 
1> Coding style changes;
2> Fix a typo in hvm_select_ioreq_server().

Changes in v6: 
Break the identical relationship between ioreq type and rangeset
index inside ioreq server.

Changes in v5:
1> Use gpfn, instead of gpa to track guest write-protected pages;
2> Remove redundant conditional statement in routine find_range().

Changes in v4:
Keep the name HVMOP_IO_RANGE_MEMORY for MMIO resources, and add
a new one, HVMOP_IO_RANGE_WP_MEM, for write-protected memory.

Changes in v3:
1> Use a seperate rangeset for guest ram pages in ioreq server;
2> Refactor rangeset, instead of introduce a new data structure.

Changes in v2:
1> Split the original patch into 2;
2> Take Paul Durrant's comments:
  a> Add a name member in the struct rb_rangeset, and use the 'q'
debug key to dump the ranges in ioreq server;
  b> Keep original routine names for hvm ioreq server;
  c> Commit message changes - mention that a future patch to change
the maximum ranges inside ioreq server.

Yu Zhang (3):
  Refactor rangeset structure for better performance.
  Differentiate IO/mem resources tracked by ioreq server
  tools: introduce parameter max_ranges.

 docs/man/xl.cfg.pod.5| 17 +
 tools/libxc/include/xenctrl.h| 31 +++
 tools/libxc/xc_domain.c  | 61 ++
 tools/libxl/libxl_dom.c  |  3 ++
 tools/libxl/libxl_types.idl  |  1 +
 tools/libxl/xl_cmdimpl.c |  4 ++
 xen/arch/x86/hvm/hvm.c   | 34 ++---
 xen/common/rangeset.c| 82 +---
 xen/include/asm-x86/hvm/domain.h |  2 +-
 xen/include/public/hvm/hvm_op.h  |  1 +
 xen/include/public/hvm/params.h  |  5 ++-
 11 files changed, 212 insertions(+), 29 deletions(-)

-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 3/3] tools: introduce parameter max_ranges.

2016-01-19 Thread Yu Zhang
A new parameter - max_ranges is added to set the upper limit of ranges
to be tracked inside one ioreq server rangeset.

Ioreq server uses a group of rangesets to track the I/O or memory
resources to be emulated. The default value of this limit is set to
256. Yet there are circumstances under which the limit should exceed
the default one. E.g. in XenGT, when tracking the per-process graphic
translation tables on intel broadwell platforms, the number of page
tables concerned will be several thousand(normally in this case, 8192
could be a big enough value). Users who set his item explicitly are
supposed to know the specific scenarios that necessitate this
configuration.

Signed-off-by: Yu Zhang 
---
 docs/man/xl.cfg.pod.5   | 17 +
 tools/libxl/libxl_dom.c |  3 +++
 tools/libxl/libxl_types.idl |  1 +
 tools/libxl/xl_cmdimpl.c|  4 
 xen/arch/x86/hvm/hvm.c  |  7 ++-
 xen/include/public/hvm/params.h |  5 -
 6 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index 8899f75..562563d 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -962,6 +962,23 @@ FIFO-based event channel ABI support up to 131,071 event 
channels.
 Other guests are limited to 4095 (64-bit x86 and ARM) or 1023 (32-bit
 x86).
 
+=item B
+
+Limit the maximum ranges that can be tracked inside one ioreq server
+rangeset.
+
+Ioreq server uses a group of rangesets to track the I/O or memory
+resources to be emulated. By default, this item is not set. Not
+configuring this item, or setting its value to 0 will result in the
+upper limit set to its default value - 256. Yet there are circumstances
+under which the upper limit inside one rangeset should exceed the
+default one. E.g. in XenGT, when tracking the per-process graphic
+translation tables on intel broadwell platforms, the number of page
+tables concerned will be several thousand(normally in this case, 8192
+could be a big enough value). Users who set his item explicitly are
+supposed to know the specific scenarios that necessitate this
+configuration.
+
 =back
 
 =head2 Paravirtualised (PV) Guest Specific Options
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 47971a9..607b0c4 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -288,6 +288,9 @@ static void hvm_set_conf_params(xc_interface *handle, 
uint32_t domid,
 libxl_defbool_val(info->u.hvm.nested_hvm));
 xc_hvm_param_set(handle, domid, HVM_PARAM_ALTP2M,
 libxl_defbool_val(info->u.hvm.altp2m));
+if (info->u.hvm.max_ranges > 0)
+xc_hvm_param_set(handle, domid, HVM_PARAM_MAX_RANGES,
+info->u.hvm.max_ranges);
 }
 
 int libxl__build_pre(libxl__gc *gc, uint32_t domid,
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 9ad7eba..c936265 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -518,6 +518,7 @@ libxl_domain_build_info = Struct("domain_build_info",[
("serial_list",  libxl_string_list),
("rdm", libxl_rdm_reserve),
("rdm_mem_boundary_memkb", MemKB),
+   ("max_ranges", uint32),
])),
  ("pv", Struct(None, [("kernel", string),
   ("slack_memkb", MemKB),
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 25507c7..9359de7 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -1626,6 +1626,10 @@ static void parse_config_data(const char *config_source,
 
 if (!xlu_cfg_get_long (config, "rdm_mem_boundary", &l, 0))
 b_info->u.hvm.rdm_mem_boundary_memkb = l * 1024;
+
+if (!xlu_cfg_get_long (config, "max_ranges", &l, 0))
+b_info->u.hvm.max_ranges = l;
+
 break;
 case LIBXL_DOMAIN_TYPE_PV:
 {
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index d59e7bc..2f85089 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -943,6 +943,10 @@ static int hvm_ioreq_server_alloc_rangesets(struct 
hvm_ioreq_server *s,
 {
 unsigned int i;
 int rc;
+unsigned int max_ranges =
+( s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_RANGES] > 0 ) ?
+s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_RANGES] :
+MAX_NR_IO_RANGES;
 
 if ( is_default )
 goto done;
@@ -965,7 +969,7 @@ static int hvm_ioreq_server_alloc_rangesets(struct 
hvm_ioreq_server *s,
 if ( !s->range[i] )
 goto fail;
 
-rangeset_limit(s->range[i], MAX_NR_IO_RANGES);
+rangeset_limit(s->range

[Xen-devel] [PATCH v10 1/3] Refactor rangeset structure for better performance.

2016-01-19 Thread Yu Zhang
This patch refactors struct rangeset to base it on the red-black
tree structure, instead of on the current doubly linked list. By
now, ioreq leverages rangeset to keep track of the IO/memory
resources to be emulated. Yet when number of ranges inside one
ioreq server is very high, traversing a doubly linked list could
be time consuming. With this patch, the time complexity for
searching a rangeset can be improved from O(n) to O(log(n)).
Interfaces of rangeset still remain the same, and no new APIs
introduced.

Reviewed-by: Paul Durrant 
Signed-off-by: Shuai Ruan 
Signed-off-by: Yu Zhang 
---
 xen/common/rangeset.c | 82 +--
 1 file changed, 60 insertions(+), 22 deletions(-)

diff --git a/xen/common/rangeset.c b/xen/common/rangeset.c
index 6c6293c..d15d8d5 100644
--- a/xen/common/rangeset.c
+++ b/xen/common/rangeset.c
@@ -10,11 +10,12 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* An inclusive range [s,e] and pointer to next range in ascending order. */
 struct range {
-struct list_head list;
+struct rb_node node;
 unsigned long s, e;
 };
 
@@ -24,7 +25,7 @@ struct rangeset {
 struct domain   *domain;
 
 /* Ordered list of ranges contained in this set, and protecting lock. */
-struct list_head range_list;
+struct rb_root   range_tree;
 
 /* Number of ranges that can be allocated */
 long nr_ranges;
@@ -45,41 +46,78 @@ struct rangeset {
 static struct range *find_range(
 struct rangeset *r, unsigned long s)
 {
-struct range *x = NULL, *y;
+struct rb_node *node;
+struct range   *x;
+struct range   *prev = NULL;
 
-list_for_each_entry ( y, &r->range_list, list )
+node = r->range_tree.rb_node;
+while ( node != NULL )
 {
-if ( y->s > s )
-break;
-x = y;
+x = container_of(node, struct range, node);
+if ( (s >= x->s) && (s <= x->e) )
+return x;
+if ( s < x->s )
+node = node->rb_left;
+else
+{
+prev = x;
+node = node->rb_right;
+}
 }
 
-return x;
+return prev;
 }
 
 /* Return the lowest range in the set r, or NULL if r is empty. */
 static struct range *first_range(
 struct rangeset *r)
 {
-if ( list_empty(&r->range_list) )
-return NULL;
-return list_entry(r->range_list.next, struct range, list);
+struct rb_node *node;
+
+node = rb_first(&r->range_tree);
+if ( node != NULL )
+return container_of(node, struct range, node);
+
+return NULL;
 }
 
 /* Return range following x in ascending order, or NULL if x is the highest. */
 static struct range *next_range(
 struct rangeset *r, struct range *x)
 {
-if ( x->list.next == &r->range_list )
-return NULL;
-return list_entry(x->list.next, struct range, list);
+struct rb_node *node;
+
+node = rb_next(&x->node);
+if ( node != NULL )
+return container_of(node, struct range, node);
+
+return NULL;
 }
 
 /* Insert range y after range x in r. Insert as first range if x is NULL. */
 static void insert_range(
 struct rangeset *r, struct range *x, struct range *y)
 {
-list_add(&y->list, (x != NULL) ? &x->list : &r->range_list);
+struct rb_node **node;
+struct rb_node *parent = NULL;
+
+if ( x == NULL )
+node = &r->range_tree.rb_node;
+else
+{
+node = &x->node.rb_right;
+parent = &x->node;
+}
+
+while ( *node != NULL )
+{
+parent = *node;
+node = &parent->rb_left;
+}
+
+/* Add new node and rebalance the red-black tree. */
+rb_link_node(&y->node, parent, node);
+rb_insert_color(&y->node, &r->range_tree);
 }
 
 /* Remove a range from its list and free it. */
@@ -88,7 +126,7 @@ static void destroy_range(
 {
 r->nr_ranges++;
 
-list_del(&x->list);
+rb_erase(&x->node, &r->range_tree);
 xfree(x);
 }
 
@@ -319,7 +357,7 @@ bool_t rangeset_contains_singleton(
 bool_t rangeset_is_empty(
 const struct rangeset *r)
 {
-return ((r == NULL) || list_empty(&r->range_list));
+return ((r == NULL) || RB_EMPTY_ROOT(&r->range_tree));
 }
 
 struct rangeset *rangeset_new(
@@ -332,7 +370,7 @@ struct rangeset *rangeset_new(
 return NULL;
 
 rwlock_init(&r->lock);
-INIT_LIST_HEAD(&r->range_list);
+r->range_tree = RB_ROOT;
 r->nr_ranges = -1;
 
 BUG_ON(flags & ~RANGESETF_prettyprint_hex);
@@ -410,7 +448,7 @@ void rangeset_domain_destroy(
 
 void rangeset_swap(struct rangeset *a, struct rangeset *b)
 {
-LIST_HEAD(tmp);
+struct rb_node *tmp;
 
 if ( a < b )
 {
@@ -423,9 +461,9 @@ void rangeset_swap(struct rangeset *a, struct rangeset *b)
 write_lock(&

Re: [Xen-devel] [PATCH 3/3] tools: introduce parameter max_ranges.

2016-01-19 Thread Yu, Zhang



On 1/20/2016 11:14 AM, Tian, Kevin wrote:

From: Ian Campbell [mailto:ian.campb...@citrix.com]
Sent: Tuesday, January 19, 2016 11:19 PM

On Tue, 2016-01-19 at 15:04 +, Wei Liu wrote:

This patch doesn't seem to have been CCd to the tools maintainers, adding
Ian too, I think everyone else was picked up along the way.

Please use ./scripts/get_maintainers.pl in the future.


On Tue, Jan 19, 2016 at 02:47:40PM +, Paul Durrant wrote:
[...]

ranges so perhaps the parameter name could be
'max_wp_memory_ranges'?




What does "WP" mean? "Write Protected"?



Yes.


Is this parameter closely related to IOREQ server? Should it contain
"ioreq" somehow?



It is closely related but ioreq server is an implementation detail so
do we want to expose it as a tunable? The concept we need to capture
is that the toolstack can tune the limit of the maximum number of
pages in the VM that can be set such that writes are emulated (but
reads are as for normal ram). Or I guess we could get very specific
and call it something like 'max_gtt_shadows'?


I would prefer generic concept in this case ("wp"). Let's wait a bit for
other people to voice their opinion.

Whichever one we pick it the meaning of the acronym needs to be clearly
documented...


I've got no ideas for a better name, "max_ranges" is clearly too generic
though.

One thought -- does XenGT require some other configuration option to enable
it or maybe a privilege which the target domain must necessarily have?
Could we use something like one of those to cause the t/stack to just DTRT
without the user having to micromanage the amount of pages which are
allowed to have this property?



Using "wp" is clear to me.


Thank you all. :)
So how about "max_wp_ram_ranges"? And the "wp" shall be well explained 
in documentation.



As a feature this write-protection has nothing to be GPU virtualization 
specific.
In the future the same mediated pass-through idea used in XenGT may be
used on other I/O devices which need to shadow some structure w/ requirement
to write-protect guest memory. So it's not good to tie this to either XenGT
or GTT.


Thank you, Kevin.
Well, if this parameter is not supposed to be xengt specific, we do not 
need to connect it with any xengt flag such as ."vgt=1" or "GVT-g=1". 
Hence the user will have to configure the max_wp_ram_ranges himself,

right?

B.R.
Yu


Thanks
Kevin



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/3] tools: introduce parameter max_ranges.

2016-01-19 Thread Yu, Zhang



On 1/20/2016 11:58 AM, Tian, Kevin wrote:

From: Yu, Zhang [mailto:yu.c.zh...@linux.intel.com]
Sent: Wednesday, January 20, 2016 11:33 AM

As a feature this write-protection has nothing to be GPU virtualization 
specific.
In the future the same mediated pass-through idea used in XenGT may be
used on other I/O devices which need to shadow some structure w/ requirement
to write-protect guest memory. So it's not good to tie this to either XenGT
or GTT.


Thank you, Kevin.
Well, if this parameter is not supposed to be xengt specific, we do not
need to connect it with any xengt flag such as ."vgt=1" or "GVT-g=1".
Hence the user will have to configure the max_wp_ram_ranges himself,
right?



Not always. The option can be configured manually by the user, or
automatically set in the code when "vgt=1" is recognized.


OK. That sounds more reasonable. :)
To give a summary, I'll do the following changes in next version:

1> rename this new parameter to "max_wp_ram_ranges", then use this
parameter as the wp-ram rangeset limit, for the I/O rangeset, keep
MAX_NR_IO_RANGES as its limit;
2> clear the documentation part;
3> define a LIBXL_HAVE_XXX in libxl.h to indicate a new field in the
build info;
4> We do not introduce the xengt flag by now, and will add code to
automatically set the "max_wp_ram_ranges" after this flag is accepted
in the future.

Does anyone have more suggestions? :)

B.R.
Yu


Thanks
Kevin



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/3] tools: introduce parameter max_ranges.

2016-01-20 Thread Yu, Zhang



On 1/20/2016 6:18 PM, Paul Durrant wrote:

-Original Message-
From: Ian Campbell [mailto:ian.campb...@citrix.com]
Sent: 20 January 2016 10:16
To: Kevin Tian; Yu, Zhang; Wei Liu; Paul Durrant
Cc: Keir (Xen.org); jbeul...@suse.com; Andrew Cooper; xen-
de...@lists.xen.org; Lv, Zhiyuan; Stefano Stabellini
Subject: Re: [Xen-devel] [PATCH 3/3] tools: introduce parameter
max_ranges.

On Wed, 2016-01-20 at 03:58 +, Tian, Kevin wrote:

From: Yu, Zhang [mailto:yu.c.zh...@linux.intel.com]
Sent: Wednesday, January 20, 2016 11:33 AM

As a feature this write-protection has nothing to be GPU
virtualization specific.
In the future the same mediated pass-through idea used in XenGT may
be
used on other I/O devices which need to shadow some structure w/
requirement
to write-protect guest memory. So it's not good to tie this to either
XenGT
or GTT.


Thank you, Kevin.
Well, if this parameter is not supposed to be xengt specific, we do not
need to connect it with any xengt flag such as ."vgt=1" or "GVT-g=1".
Hence the user will have to configure the max_wp_ram_ranges himself,
right?



Not always. The option can be configured manually by the user, or
automatically set in the code when "vgt=1" is recognized.


Is the latter approach not always sufficient? IOW, if it can be done
automatically, why would the user need to tweak it?



I think latter is sufficient for now. We always have the option of adding a 
specific wp_ram_ranges parameter in future if there is a need


Thank you all for your reply.
Well, I believe the latter option is only sufficient for most
usage models on BDW due to rangeset's ability to merge continuous
pages into one range, but there might be some extreme cases, e.g.
too many graphic related applications in one VM, which create a
great deal of per-process graphic translation tables. And also,
future cpu platforms might provide even more PPGGTs. So, I suggest
we use this max_wp_ram_ranges, and give the control to the system
administrator. Besides, like Kevin said, XenGT's mediated pass-thru
idea can also be adopted to other devices, and this parameter may
also help.
Also, we have plans to upstream the tool-stack changes later this
year. If this max_wp_ram_ranges is not convenient, we can introduce
a method to automatically set its default value.

B.R.
Yu


   Paul


Ian.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v11 2/3] Differentiate IO/mem resources tracked by ioreq server

2016-01-21 Thread Yu Zhang
Currently in ioreq server, guest write-protected ram pages are
tracked in the same rangeset with device mmio resources. Yet
unlike device mmio, which can be in big chunks, the guest write-
protected pages may be discrete ranges with 4K bytes each. This
patch uses a seperate rangeset for the guest ram pages.

To differentiate the ioreq type between the write-protected memory
ranges and the mmio ranges when selecting an ioreq server, the p2m
type is retrieved by calling get_page_from_gfn(). And we do not
need to worry about the p2m type change during the ioreq selection
process.

Note: Previously, a new hypercall or subop was suggested to map
write-protected pages into ioreq server. However, it turned out
handler of this new hypercall would be almost the same with the
existing pair - HVMOP_[un]map_io_range_to_ioreq_server, and there's
already a type parameter in this hypercall. So no new hypercall
defined, only a new type is introduced.

Acked-by: Wei Liu 
Acked-by: Ian Campbell 
Reviewed-by: Kevin Tian 
Reviewed-by: Paul Durrant 
Signed-off-by: Shuai Ruan 
Signed-off-by: Yu Zhang 
---
 tools/libxc/include/xenctrl.h| 31 
 tools/libxc/xc_domain.c  | 61 
 xen/arch/x86/hvm/hvm.c   | 27 +++---
 xen/include/asm-x86/hvm/domain.h |  2 +-
 xen/include/public/hvm/hvm_op.h  |  1 +
 5 files changed, 117 insertions(+), 5 deletions(-)

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index 079cad0..036c72d 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -2023,6 +2023,37 @@ int xc_hvm_unmap_io_range_from_ioreq_server(xc_interface 
*xch,
 int is_mmio,
 uint64_t start,
 uint64_t end);
+/**
+ * This function registers a range of write-protected memory for emulation.
+ *
+ * @parm xch a handle to an open hypervisor interface.
+ * @parm domid the domain id to be serviced
+ * @parm id the IOREQ Server id.
+ * @parm start start of range
+ * @parm end end of range (inclusive).
+ * @return 0 on success, -1 on failure.
+ */
+int xc_hvm_map_wp_mem_range_to_ioreq_server(xc_interface *xch,
+domid_t domid,
+ioservid_t id,
+xen_pfn_t start,
+xen_pfn_t end);
+
+/**
+ * This function deregisters a range of write-protected memory for emulation.
+ *
+ * @parm xch a handle to an open hypervisor interface.
+ * @parm domid the domain id to be serviced
+ * @parm id the IOREQ Server id.
+ * @parm start start of range
+ * @parm end end of range (inclusive).
+ * @return 0 on success, -1 on failure.
+ */
+int xc_hvm_unmap_wp_mem_range_from_ioreq_server(xc_interface *xch,
+domid_t domid,
+ioservid_t id,
+xen_pfn_t start,
+xen_pfn_t end);
 
 /**
  * This function registers a PCI device for config space emulation.
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index 99e0d48..4f43695 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -1544,6 +1544,67 @@ int xc_hvm_unmap_io_range_from_ioreq_server(xc_interface 
*xch, domid_t domid,
 return rc;
 }
 
+int xc_hvm_map_wp_mem_range_to_ioreq_server(xc_interface *xch,
+domid_t domid,
+ioservid_t id,
+xen_pfn_t start,
+xen_pfn_t end)
+{
+DECLARE_HYPERCALL;
+DECLARE_HYPERCALL_BUFFER(xen_hvm_io_range_t, arg);
+int rc;
+
+arg = xc_hypercall_buffer_alloc(xch, arg, sizeof(*arg));
+if ( arg == NULL )
+return -1;
+
+hypercall.op = __HYPERVISOR_hvm_op;
+hypercall.arg[0] = HVMOP_map_io_range_to_ioreq_server;
+hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(arg);
+
+arg->domid = domid;
+arg->id = id;
+arg->type = HVMOP_IO_RANGE_WP_MEM;
+arg->start = start;
+arg->end = end;
+
+rc = do_xen_hypercall(xch, &hypercall);
+
+xc_hypercall_buffer_free(xch, arg);
+return rc;
+}
+
+int xc_hvm_unmap_wp_mem_range_from_ioreq_server(xc_interface *xch,
+domid_t domid,
+ioservid_t id,
+xen_pfn_t start,
+xen_pfn_t end)
+{
+DECLARE_HYPERCALL;
+DECLARE_HYPERCALL_BUFFER(xen_hvm_io_range_t, arg);
+int rc;
+
+arg = xc_hypercall_buffer_alloc(xch, arg, sizeof(*arg));
+if ( arg == N

[Xen-devel] [PATCH v11 0/3] Refactor ioreq server for better performance.

2016-01-21 Thread Yu Zhang
XenGT leverages ioreq server to track and forward the accesses to
GPU I/O resources, e.g. the PPGTT(per-process graphic translation
tables). Currently, ioreq server uses rangeset to track the BDF/
PIO/MMIO ranges to be emulated. To select an ioreq server, the 
rangeset is searched to see if the I/O range is recorded. However,
traversing the link list inside rangeset could be time consuming
when number of ranges is too high. On HSW platform, number of PPGTTs
for each vGPU could be several hundred. On BDW, this value could
be several thousand.  This patch series refactored rangeset to base
it on red-back tree, so that the searching would be more efficient. 

Besides, this patchset also splits the tracking of MMIO and guest
ram ranges into different rangesets. And to accommodate more ranges,
a new parameter , max_wp_ram_ranges, is introduced in hvm configuration
file.

Changes in v11: 
1> rename the new parameter to "max_wp_ram_ranges", and use it
specifically for write-protected ram ranges.
2> clear the documentation part.
3> define a LIBXL_HAVE_BUILDINFO_HVM_MAX_WP_RAM_RANGES in libxl.h.

Changes in v10: 
1> Add a new patch to configure the range limit inside ioreq server.
2> Commit message changes. 
3> The previous patch "[1/3] Remove identical relationship between
   ioreq type and rangeset type." has already been merged, and is not
   included in this series now.

Changes in v9: 
1> Change order of patch 2 and patch3.
2> Intruduce a const static array before hvm_ioreq_server_alloc_rangesets().
3> Coding style changes.

Changes in v8: 
Use a clearer API name to map/unmap the write-protected memory in
ioreq server.

Changes in v7: 
1> Coding style changes;
2> Fix a typo in hvm_select_ioreq_server().

Changes in v6: 
Break the identical relationship between ioreq type and rangeset
index inside ioreq server.

Changes in v5:
1> Use gpfn, instead of gpa to track guest write-protected pages;
2> Remove redundant conditional statement in routine find_range().

Changes in v4:
Keep the name HVMOP_IO_RANGE_MEMORY for MMIO resources, and add
a new one, HVMOP_IO_RANGE_WP_MEM, for write-protected memory.

Changes in v3:
1> Use a seperate rangeset for guest ram pages in ioreq server;
2> Refactor rangeset, instead of introduce a new data structure.

Changes in v2:
1> Split the original patch into 2;
2> Take Paul Durrant's comments:
  a> Add a name member in the struct rb_rangeset, and use the 'q'
debug key to dump the ranges in ioreq server;
  b> Keep original routine names for hvm ioreq server;
  c> Commit message changes - mention that a future patch to change
the maximum ranges inside ioreq server.

Yu Zhang (3):
  Refactor rangeset structure for better performance.
  Differentiate IO/mem resources tracked by ioreq server
  tools: introduce parameter max_wp_ram_ranges.

 docs/man/xl.cfg.pod.5| 18 +
 tools/libxc/include/xenctrl.h| 31 +++
 tools/libxc/xc_domain.c  | 61 ++
 tools/libxl/libxl.h  |  5 +++
 tools/libxl/libxl_dom.c  |  3 ++
 tools/libxl/libxl_types.idl  |  1 +
 tools/libxl/xl_cmdimpl.c |  4 ++
 xen/arch/x86/hvm/hvm.c   | 37 +++---
 xen/common/rangeset.c| 82 +---
 xen/include/asm-x86/hvm/domain.h |  2 +-
 xen/include/public/hvm/hvm_op.h  |  1 +
 xen/include/public/hvm/params.h  |  5 ++-
 12 files changed, 221 insertions(+), 29 deletions(-)

-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v11 1/3] Refactor rangeset structure for better performance.

2016-01-21 Thread Yu Zhang
This patch refactors struct rangeset to base it on the red-black
tree structure, instead of on the current doubly linked list. By
now, ioreq leverages rangeset to keep track of the IO/memory
resources to be emulated. Yet when number of ranges inside one
ioreq server is very high, traversing a doubly linked list could
be time consuming. With this patch, the time complexity for
searching a rangeset can be improved from O(n) to O(log(n)).
Interfaces of rangeset still remain the same, and no new APIs
introduced.

Reviewed-by: Paul Durrant 
Signed-off-by: Shuai Ruan 
Signed-off-by: Yu Zhang 
---
 xen/common/rangeset.c | 82 +--
 1 file changed, 60 insertions(+), 22 deletions(-)

diff --git a/xen/common/rangeset.c b/xen/common/rangeset.c
index 6c6293c..d15d8d5 100644
--- a/xen/common/rangeset.c
+++ b/xen/common/rangeset.c
@@ -10,11 +10,12 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* An inclusive range [s,e] and pointer to next range in ascending order. */
 struct range {
-struct list_head list;
+struct rb_node node;
 unsigned long s, e;
 };
 
@@ -24,7 +25,7 @@ struct rangeset {
 struct domain   *domain;
 
 /* Ordered list of ranges contained in this set, and protecting lock. */
-struct list_head range_list;
+struct rb_root   range_tree;
 
 /* Number of ranges that can be allocated */
 long nr_ranges;
@@ -45,41 +46,78 @@ struct rangeset {
 static struct range *find_range(
 struct rangeset *r, unsigned long s)
 {
-struct range *x = NULL, *y;
+struct rb_node *node;
+struct range   *x;
+struct range   *prev = NULL;
 
-list_for_each_entry ( y, &r->range_list, list )
+node = r->range_tree.rb_node;
+while ( node != NULL )
 {
-if ( y->s > s )
-break;
-x = y;
+x = container_of(node, struct range, node);
+if ( (s >= x->s) && (s <= x->e) )
+return x;
+if ( s < x->s )
+node = node->rb_left;
+else
+{
+prev = x;
+node = node->rb_right;
+}
 }
 
-return x;
+return prev;
 }
 
 /* Return the lowest range in the set r, or NULL if r is empty. */
 static struct range *first_range(
 struct rangeset *r)
 {
-if ( list_empty(&r->range_list) )
-return NULL;
-return list_entry(r->range_list.next, struct range, list);
+struct rb_node *node;
+
+node = rb_first(&r->range_tree);
+if ( node != NULL )
+return container_of(node, struct range, node);
+
+return NULL;
 }
 
 /* Return range following x in ascending order, or NULL if x is the highest. */
 static struct range *next_range(
 struct rangeset *r, struct range *x)
 {
-if ( x->list.next == &r->range_list )
-return NULL;
-return list_entry(x->list.next, struct range, list);
+struct rb_node *node;
+
+node = rb_next(&x->node);
+if ( node != NULL )
+return container_of(node, struct range, node);
+
+return NULL;
 }
 
 /* Insert range y after range x in r. Insert as first range if x is NULL. */
 static void insert_range(
 struct rangeset *r, struct range *x, struct range *y)
 {
-list_add(&y->list, (x != NULL) ? &x->list : &r->range_list);
+struct rb_node **node;
+struct rb_node *parent = NULL;
+
+if ( x == NULL )
+node = &r->range_tree.rb_node;
+else
+{
+node = &x->node.rb_right;
+parent = &x->node;
+}
+
+while ( *node != NULL )
+{
+parent = *node;
+node = &parent->rb_left;
+}
+
+/* Add new node and rebalance the red-black tree. */
+rb_link_node(&y->node, parent, node);
+rb_insert_color(&y->node, &r->range_tree);
 }
 
 /* Remove a range from its list and free it. */
@@ -88,7 +126,7 @@ static void destroy_range(
 {
 r->nr_ranges++;
 
-list_del(&x->list);
+rb_erase(&x->node, &r->range_tree);
 xfree(x);
 }
 
@@ -319,7 +357,7 @@ bool_t rangeset_contains_singleton(
 bool_t rangeset_is_empty(
 const struct rangeset *r)
 {
-return ((r == NULL) || list_empty(&r->range_list));
+return ((r == NULL) || RB_EMPTY_ROOT(&r->range_tree));
 }
 
 struct rangeset *rangeset_new(
@@ -332,7 +370,7 @@ struct rangeset *rangeset_new(
 return NULL;
 
 rwlock_init(&r->lock);
-INIT_LIST_HEAD(&r->range_list);
+r->range_tree = RB_ROOT;
 r->nr_ranges = -1;
 
 BUG_ON(flags & ~RANGESETF_prettyprint_hex);
@@ -410,7 +448,7 @@ void rangeset_domain_destroy(
 
 void rangeset_swap(struct rangeset *a, struct rangeset *b)
 {
-LIST_HEAD(tmp);
+struct rb_node *tmp;
 
 if ( a < b )
 {
@@ -423,9 +461,9 @@ void rangeset_swap(struct rangeset *a, struct rangeset *b)
 write_lock(&

[Xen-devel] [PATCH v2 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-01-21 Thread Yu Zhang
A new parameter - max_wp_ram_ranges is added to set the upper limit
of write-protected ram ranges to be tracked inside one ioreq server
rangeset.

Ioreq server uses a group of rangesets to track the I/O or memory
resources to be emulated. Default limit of ranges that one rangeset
can allocate is set to a small value, due to the fact that these ranges
are allocated in xen heap. Yet for the write-protected ram ranges,
there are circumstances under which the upper limit inside one rangeset
should exceed the default one. E.g. in XenGT, when tracking the
per-process graphic translation tables on intel broadwell platforms,
the number of page tables concerned will be several thousand(normally
in this case, 8192 could be a big enough value). Users who set this
item explicitly are supposed to know the specific scenarios that
necessitate this configuration.

Signed-off-by: Yu Zhang 
---
 docs/man/xl.cfg.pod.5   | 18 ++
 tools/libxl/libxl.h |  5 +
 tools/libxl/libxl_dom.c |  3 +++
 tools/libxl/libxl_types.idl |  1 +
 tools/libxl/xl_cmdimpl.c|  4 
 xen/arch/x86/hvm/hvm.c  | 10 +-
 xen/include/public/hvm/params.h |  5 -
 7 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index 8899f75..7634c42 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -962,6 +962,24 @@ FIFO-based event channel ABI support up to 131,071 event 
channels.
 Other guests are limited to 4095 (64-bit x86 and ARM) or 1023 (32-bit
 x86).
 
+=item B
+
+Limit the maximum write-protected ram ranges that can be tracked
+inside one ioreq server rangeset.
+
+Ioreq server uses a group of rangesets to track the I/O or memory
+resources to be emulated. Default limit of ranges that one rangeset
+can allocate is set to a small value, due to the fact that these ranges
+are allocated in xen heap. Yet for the write-protected ram ranges,
+there are circumstances under which the upper limit inside one rangeset
+should exceed the default one. E.g. in XenGT, when tracking the per-
+process graphic translation tables on intel broadwell platforms, the
+number of page tables concerned will be several thousand(normally
+in this case, 8192 could be a big enough value). Not configuring this
+item, or setting its value to 0 will result in the upper limit set
+to its default one. Users who set his item explicitly are supposed
+to know the specific scenarios that necessitate this configuration.
+
 =back
 
 =head2 Paravirtualised (PV) Guest Specific Options
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 156c0d5..6698d72 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -136,6 +136,11 @@
 #define LIBXL_HAVE_BUILDINFO_EVENT_CHANNELS 1
 
 /*
+ * libxl_domain_build_info has the u.hvm.max_wp_ram_ranges field.
+ */
+#define LIBXL_HAVE_BUILDINFO_HVM_MAX_WP_RAM_RANGES 1
+
+/*
  * libxl_domain_build_info has the u.hvm.ms_vm_genid field.
  */
 #define LIBXL_HAVE_BUILDINFO_HVM_MS_VM_GENID 1
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 2269998..54173cb 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -288,6 +288,9 @@ static void hvm_set_conf_params(xc_interface *handle, 
uint32_t domid,
 libxl_defbool_val(info->u.hvm.nested_hvm));
 xc_hvm_param_set(handle, domid, HVM_PARAM_ALTP2M,
 libxl_defbool_val(info->u.hvm.altp2m));
+if (info->u.hvm.max_wp_ram_ranges > 0)
+xc_hvm_param_set(handle, domid, HVM_PARAM_MAX_WP_RAM_RANGES,
+info->u.hvm.max_wp_ram_ranges);
 }
 
 int libxl__build_pre(libxl__gc *gc, uint32_t domid,
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 9ad7eba..c7d7b5f 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -518,6 +518,7 @@ libxl_domain_build_info = Struct("domain_build_info",[
("serial_list",  libxl_string_list),
("rdm", libxl_rdm_reserve),
("rdm_mem_boundary_memkb", MemKB),
+   ("max_wp_ram_ranges", uint32),
])),
  ("pv", Struct(None, [("kernel", string),
   ("slack_memkb", MemKB),
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 25507c7..8bb7cc7 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -1626,6 +1626,10 @@ static void parse_config_data(const char *config_source,
 
 if (!xlu_cfg_get_long (config, "rdm_mem_boundary", &l, 0))
 b_info->u.hvm.rdm_mem_boundary_memkb = l * 1024;
+
+if (!xlu_cfg_get_long (config, "max_wp_ram_ranges", &l, 0))
+b_info->u.h

Re: [Xen-devel] [PATCH v2 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-01-25 Thread Yu, Zhang

Thank you, Jan.

On 1/22/2016 4:01 PM, Jan Beulich wrote:

On 22.01.16 at 04:20,  wrote:

--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -940,6 +940,10 @@ static int hvm_ioreq_server_alloc_rangesets(struct
hvm_ioreq_server *s,
  {
  unsigned int i;
  int rc;
+unsigned int max_wp_ram_ranges =
+( s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES] > 0 ) 
?
+s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES] :
+MAX_NR_IO_RANGES;


Besides this having stray blanks inside the parentheses it truncates
the value from 64 to 32 bits and would benefit from using the gcc
extension of omitting the middle operand of ?:. But even better
would imo be if you avoided the local variable and ...


After second thought, how about we define a default value for this
parameter in libx.h, and initialize the parameter when creating the
domain with default value if it's not configured.
About this local variable, we keep it, and ...


@@ -962,7 +966,10 @@ static int hvm_ioreq_server_alloc_rangesets(struct 
hvm_ioreq_server *s,
  if ( !s->range[i] )
  goto fail;

-rangeset_limit(s->range[i], MAX_NR_IO_RANGES);
+if ( i == HVMOP_IO_RANGE_WP_MEM )
+rangeset_limit(s->range[i], max_wp_ram_ranges);
+else
+rangeset_limit(s->range[i], MAX_NR_IO_RANGES);


... did the entire computation here, using ?: for the second argument
of the function invocation.


... replace the if/else pair with sth. like:
rangeset_limit(s->range[i],
   ((i == HVMOP_IO_RANGE_WP_MEM)?
max_wp_ram_ranges:
MAX_NR_IO_RANGES));
This 'max_wp_ram_ranges' has no particular usages, but the string
"s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES] "
is too lengthy, and can easily break the 80 column limitation. :)
Does this approach sounds OK? :)


@@ -6009,6 +6016,7 @@ static int hvm_allow_set_param(struct domain *d,
  case HVM_PARAM_IOREQ_SERVER_PFN:
  case HVM_PARAM_NR_IOREQ_SERVER_PAGES:
  case HVM_PARAM_ALTP2M:
+case HVM_PARAM_MAX_WP_RAM_RANGES:
  if ( value != 0 && a->value != value )
  rc = -EEXIST;
  break;


Is there a particular reason you want this limit to be unchangeable
after having got set once?


Well, not exactly. :)
I added this limit because by now we do not have any approach to
change the max range numbers inside ioreq server during run-time.
I can add another patch to introduce an xl command, which can change
it dynamically. But I doubt the necessity of this new command and
am also wonder if this new command would cause more confusion for
the user...

Jan



B.R.
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v11 2/3] Differentiate IO/mem resources tracked by ioreq server

2016-01-26 Thread Yu, Zhang



On 1/22/2016 7:43 PM, Jan Beulich wrote:

On 22.01.16 at 04:20,  wrote:

@@ -2601,6 +2605,16 @@ struct hvm_ioreq_server *hvm_select_ioreq_server(struct 
domain *d,
  type = (p->type == IOREQ_TYPE_PIO) ?
  HVMOP_IO_RANGE_PORT : HVMOP_IO_RANGE_MEMORY;
  addr = p->addr;
+if ( type == HVMOP_IO_RANGE_MEMORY )
+{
+ ram_page = get_page_from_gfn(d, p->addr >> PAGE_SHIFT,
+  &p2mt, P2M_UNSHARE);


It seems to me like I had asked before: Why P2M_UNSHARE instead
of just P2M_QUERY? (This could surely be fixed up while committing,
the more that I've already done some cleanup here, but I'd like to
understand this before it goes in.)


Hah, sorry for my bad memory. :)
I did not found P2M_QUERY; only P2M_UNSHARE and P2M_ALLOC are
defined. But after reading the code in ept_get_entry(), I guess the
P2M_UNSHARE is not accurate, maybe I should use 0 here for the
p2m_query_t parameter in get_page_from_gfn()?


+ if ( p2mt == p2m_mmio_write_dm )
+ type = HVMOP_IO_RANGE_WP_MEM;
+
+ if ( ram_page )
+ put_page(ram_page);
+}
  }

  list_for_each_entry ( s,
@@ -2642,6 +2656,11 @@ struct hvm_ioreq_server *hvm_select_ioreq_server(struct 
domain *d,
  }

  break;
+case HVMOP_IO_RANGE_WP_MEM:
+if ( rangeset_contains_singleton(r, PFN_DOWN(addr)) )
+return s;


Considering you've got p2m_mmio_write_dm above - can this
validly return false here?


Well, if we have multiple ioreq servers defined, it will...
Currently, this p2m type is only used in XenGT, which has only one
ioreq server other than qemu for the vGPU. But suppose there will
be more devices using this type and more ioreq servers introduced
for them, it can return false.


Jan



B.R.
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-01-26 Thread Yu, Zhang



On 1/26/2016 7:00 PM, Jan Beulich wrote:

On 26.01.16 at 08:32,  wrote:

On 1/22/2016 4:01 PM, Jan Beulich wrote:

On 22.01.16 at 04:20,  wrote:

--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -940,6 +940,10 @@ static int hvm_ioreq_server_alloc_rangesets(struct
hvm_ioreq_server *s,
   {
   unsigned int i;
   int rc;
+unsigned int max_wp_ram_ranges =
+( s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES] > 0 ) 
?
+s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES] :
+MAX_NR_IO_RANGES;


Besides this having stray blanks inside the parentheses it truncates
the value from 64 to 32 bits and would benefit from using the gcc
extension of omitting the middle operand of ?:. But even better
would imo be if you avoided the local variable and ...


After second thought, how about we define a default value for this
parameter in libx.h, and initialize the parameter when creating the
domain with default value if it's not configured.


No, I don't think the tool stack should be determining the default
here (unless you want the default to be zero, and have zero
indeed mean zero).


Thank you, Jan.
If we do not provide a default value in tool stack, the code above
should be kept, to initialize the local variable with either the one
set in the configuration file, or with MAX_NR_IO_RANGES. Is this OK?


About this local variable, we keep it, and ...


@@ -962,7 +966,10 @@ static int hvm_ioreq_server_alloc_rangesets(struct

hvm_ioreq_server *s,

   if ( !s->range[i] )
   goto fail;

-rangeset_limit(s->range[i], MAX_NR_IO_RANGES);
+if ( i == HVMOP_IO_RANGE_WP_MEM )
+rangeset_limit(s->range[i], max_wp_ram_ranges);
+else
+rangeset_limit(s->range[i], MAX_NR_IO_RANGES);


... did the entire computation here, using ?: for the second argument
of the function invocation.


... replace the if/else pair with sth. like:
  rangeset_limit(s->range[i],
 ((i == HVMOP_IO_RANGE_WP_MEM)?
  max_wp_ram_ranges:
  MAX_NR_IO_RANGES));
This 'max_wp_ram_ranges' has no particular usages, but the string
"s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES]"
is too lengthy, and can easily break the 80 column limitation. :)
Does this approach sounds OK? :)


Seems better than the original, so okay.


@@ -6009,6 +6016,7 @@ static int hvm_allow_set_param(struct domain *d,
   case HVM_PARAM_IOREQ_SERVER_PFN:
   case HVM_PARAM_NR_IOREQ_SERVER_PAGES:
   case HVM_PARAM_ALTP2M:
+case HVM_PARAM_MAX_WP_RAM_RANGES:
   if ( value != 0 && a->value != value )
   rc = -EEXIST;
   break;


Is there a particular reason you want this limit to be unchangeable
after having got set once?


Well, not exactly. :)
I added this limit because by now we do not have any approach to
change the max range numbers inside ioreq server during run-time.
I can add another patch to introduce an xl command, which can change
it dynamically. But I doubt the necessity of this new command and
am also wonder if this new command would cause more confusion for
the user...


And I didn't say you need to expose this to the user. All I asked
was whether you really mean the value to be a set-once one. If
yes, the code above is fine. If no, the code above should be
changed, but there's then still no need to expose a way to
"manually" adjust the value until a need for such arises.



I see. The constraint is not necessary. And I'll remove this code. :)


Jan




B.R.
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v11 2/3] Differentiate IO/mem resources tracked by ioreq server

2016-01-26 Thread Yu, Zhang



On 1/26/2016 7:24 PM, Jan Beulich wrote:

On 26.01.16 at 08:59,  wrote:




On 1/22/2016 7:43 PM, Jan Beulich wrote:

On 22.01.16 at 04:20,  wrote:

@@ -2601,6 +2605,16 @@ struct hvm_ioreq_server

*hvm_select_ioreq_server(struct domain *d,

   type = (p->type == IOREQ_TYPE_PIO) ?
   HVMOP_IO_RANGE_PORT : HVMOP_IO_RANGE_MEMORY;
   addr = p->addr;
+if ( type == HVMOP_IO_RANGE_MEMORY )
+{
+ ram_page = get_page_from_gfn(d, p->addr >> PAGE_SHIFT,
+  &p2mt, P2M_UNSHARE);


It seems to me like I had asked before: Why P2M_UNSHARE instead
of just P2M_QUERY? (This could surely be fixed up while committing,
the more that I've already done some cleanup here, but I'd like to
understand this before it goes in.)


Hah, sorry for my bad memory. :)
I did not found P2M_QUERY; only P2M_UNSHARE and P2M_ALLOC are
defined. But after reading the code in ept_get_entry(), I guess the
P2M_UNSHARE is not accurate, maybe I should use 0 here for the
p2m_query_t parameter in get_page_from_gfn()?


Ah, sorry for the misnamed suggestion. I'm not sure whether using
zero here actually matches your needs; P2M_UNSHARE though
seems odd in any case, so at least switching to P2M_ALLOC (to
populate PoD pages) would seem to be necessary.



Thanks, Jan.  :)
And now I believe we should use zero here. By now XenGT does not
support PoD and here all we care about is whether the p2m type of this
gfn is p2m_mmio_write_dm.


@@ -2642,6 +2656,11 @@ struct hvm_ioreq_server *hvm_select_ioreq_server(struct 
domain *d,
   }

   break;
+case HVMOP_IO_RANGE_WP_MEM:
+if ( rangeset_contains_singleton(r, PFN_DOWN(addr)) )
+return s;


Considering you've got p2m_mmio_write_dm above - can this
validly return false here?


Well, if we have multiple ioreq servers defined, it will...


Ah, right. That's fine then.

Jan




B.R.
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-01-26 Thread Yu, Zhang



On 1/26/2016 7:16 PM, David Vrabel wrote:

On 22/01/16 03:20, Yu Zhang wrote:

--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -962,6 +962,24 @@ FIFO-based event channel ABI support up to 131,071 event 
channels.
  Other guests are limited to 4095 (64-bit x86 and ARM) or 1023 (32-bit
  x86).

+=item B
+
+Limit the maximum write-protected ram ranges that can be tracked
+inside one ioreq server rangeset.
+
+Ioreq server uses a group of rangesets to track the I/O or memory
+resources to be emulated. Default limit of ranges that one rangeset
+can allocate is set to a small value, due to the fact that these ranges
+are allocated in xen heap. Yet for the write-protected ram ranges,
+there are circumstances under which the upper limit inside one rangeset
+should exceed the default one. E.g. in XenGT, when tracking the per-
+process graphic translation tables on intel broadwell platforms, the
+number of page tables concerned will be several thousand(normally
+in this case, 8192 could be a big enough value). Not configuring this
+item, or setting its value to 0 will result in the upper limit set
+to its default one. Users who set his item explicitly are supposed
+to know the specific scenarios that necessitate this configuration.


This help text isn't very helpful.  How is a user supposed to "know the
specific scenarios" that need this option?



Thank you for your comment, David. :)

Well, "know the specific scenarios" may seem too ambiguous. Here the
"specific scenarios" means when this parameter is used:
1> for virtual devices other than vGPU in GVT-g;
2> for GVT-g, there also might be some extreme cases, e.g. too many
graphic related applications in one VM, which create a great deal of
per-process graphic translation tables.
3> for GVT-g, future cpu platforms which provide even more PPGGTs.
Other than these cases, 8192 is a suggested value for this option.

So how about we add a section to point out these scenarios in this
text?


Why doesn't the toolstack (or qemu) automatically set this value based
on whether GVT-g/GVT-d is being used? Then there is no need to even
present this option to the user.

David



By now, this parameter is used in GVT-g, but we are expecting more
usages for other devices which adopt this mediated pass-through idea.
Indeed, XenGT has an xl configuration flag, and several other XenGT
specific parameters. We have plans to upstream these options later
this year. After these XenGT options are accepted, we can set this
"max_wp_ram_ranges" to a default value if GVT-g is detected and the
"max_wp_ram_ranges" is not explicitly configured.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-01-27 Thread Yu, Zhang



On 1/27/2016 6:27 PM, Jan Beulich wrote:

On 27.01.16 at 08:01,  wrote:




On 1/26/2016 7:00 PM, Jan Beulich wrote:

On 26.01.16 at 08:32,  wrote:

On 1/22/2016 4:01 PM, Jan Beulich wrote:

On 22.01.16 at 04:20,  wrote:

--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -940,6 +940,10 @@ static int hvm_ioreq_server_alloc_rangesets(struct
hvm_ioreq_server *s,
{
unsigned int i;
int rc;
+unsigned int max_wp_ram_ranges =
+( s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES] > 0 ) 
?
+s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES] :
+MAX_NR_IO_RANGES;


Besides this having stray blanks inside the parentheses it truncates
the value from 64 to 32 bits and would benefit from using the gcc
extension of omitting the middle operand of ?:. But even better
would imo be if you avoided the local variable and ...


After second thought, how about we define a default value for this
parameter in libx.h, and initialize the parameter when creating the
domain with default value if it's not configured.


No, I don't think the tool stack should be determining the default
here (unless you want the default to be zero, and have zero
indeed mean zero).


Thank you, Jan.
If we do not provide a default value in tool stack, the code above
should be kept, to initialize the local variable with either the one
set in the configuration file, or with MAX_NR_IO_RANGES. Is this OK?


Well, not exactly: For one, the original comment (still present
above) regarding truncation holds. And then another question is:
Do you expect this resource type to be useful with its number of
ranges limited to MAX_NR_IO_RANGES? I ask because if the
answer is "no", having it default to zero might be as reasonable.



Thanks, Jan.

About the default value:
  You are right. :) For XenGT, MAX_NR_IO_RANGES may only work under
limited conditions. Having it default to zero means XenGT users must
manually configure this option. Since we have plans to push other XenGT
tool stack parameters(including a GVT-g flag), how about we set this
max_wp_ram_ranges to a default one when GVT-g flag is detected, and
till then, max_wp_ram_ranges is supposed to be configured explicitly for
XenGT?

About the truncation issue:
  I do not quite follow. Will this hurt if the value configured does
not exceed 4G? What about a type cast?

B.R.
Yu




Jan




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-01-27 Thread Yu, Zhang



On 1/27/2016 10:32 PM, Jan Beulich wrote:

On 27.01.16 at 15:13,  wrote:

About the default value:
You are right. :) For XenGT, MAX_NR_IO_RANGES may only work under
limited conditions. Having it default to zero means XenGT users must
manually configure this option. Since we have plans to push other XenGT
tool stack parameters(including a GVT-g flag), how about we set this
max_wp_ram_ranges to a default one when GVT-g flag is detected, and
till then, max_wp_ram_ranges is supposed to be configured explicitly for
XenGT?


Sounds reasonable, and in line with what iirc was discussed on
the tool stack side.



Great, and thanks.


About the truncation issue:
I do not quite follow. Will this hurt if the value configured does
not exceed 4G? What about a type cast?


A typecast would not alter behavior in any way. And of course
a problem only arises if the value was above 4 billion. You either
need to refuse such values while the attempt is made to set it.
or you need to deal with the full range of possible values. Likely
the former is the better (and I wonder whether the upper
bound shouldn't be forced even lower than 4 billion).



Oh, I see. A check with the upper bound sounds better. Using 4G as the
upper bound is a little conservative, but I do not have any better
criteria right now. :)


Jan




Thanks
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-01-27 Thread Yu, Zhang



On 1/27/2016 11:12 PM, Jan Beulich wrote:

On 27.01.16 at 15:56,  wrote:

On 1/27/2016 10:32 PM, Jan Beulich wrote:

On 27.01.16 at 15:13,  wrote:

About the truncation issue:
 I do not quite follow. Will this hurt if the value configured does
not exceed 4G? What about a type cast?


A typecast would not alter behavior in any way. And of course
a problem only arises if the value was above 4 billion. You either
need to refuse such values while the attempt is made to set it.
or you need to deal with the full range of possible values. Likely
the former is the better (and I wonder whether the upper
bound shouldn't be forced even lower than 4 billion).


Oh, I see. A check with the upper bound sounds better. Using 4G as the
upper bound is a little conservative, but I do not have any better
criteria right now. :)


But when making that decision keep security in mind: How much
memory would it take to populate 4G rangeset nodes?


Well, for XenGT, one extreme case I can imagine would be half of all
the guest ram is used as the GPU page table, and page frames containing
these page tables are discontinuous (rangeset can combine continuous
ranges). For other virtual devices to leverage the write-protected gfn
rangeset, I believe the same idea applies. :)
Is this logic OK?

Thanks
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-01-27 Thread Yu, Zhang



On 1/27/2016 11:58 PM, Jan Beulich wrote:

On 27.01.16 at 16:23,  wrote:




On 1/27/2016 11:12 PM, Jan Beulich wrote:

On 27.01.16 at 15:56,  wrote:

On 1/27/2016 10:32 PM, Jan Beulich wrote:

On 27.01.16 at 15:13,  wrote:

About the truncation issue:
  I do not quite follow. Will this hurt if the value configured does
not exceed 4G? What about a type cast?


A typecast would not alter behavior in any way. And of course
a problem only arises if the value was above 4 billion. You either
need to refuse such values while the attempt is made to set it.
or you need to deal with the full range of possible values. Likely
the former is the better (and I wonder whether the upper
bound shouldn't be forced even lower than 4 billion).


Oh, I see. A check with the upper bound sounds better. Using 4G as the
upper bound is a little conservative, but I do not have any better
criteria right now. :)


But when making that decision keep security in mind: How much
memory would it take to populate 4G rangeset nodes?


Well, for XenGT, one extreme case I can imagine would be half of all
the guest ram is used as the GPU page table, and page frames containing
these page tables are discontinuous (rangeset can combine continuous
ranges). For other virtual devices to leverage the write-protected gfn
rangeset, I believe the same idea applies. :)
Is this logic OK?


I can follow it, yes, but 4G ranges mean 16Tb of memory put
in page tables, which to be honest doesn't seem reasonable to
me.



Thanks for your reply, Jan.
In most cases max_memkb in configuration file will not be set to such a
big value. So I'd suggest we use a comparison between the one
calculated from max_memkb and 4G, and choose the smaller one as upper
bound. If some time in the future, it becomes common cases for VMs to
use huge rams, we should use uint64 for the rangeset limit other than a 
uint32.


Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v12 1/3] Refactor rangeset structure for better performance.

2016-01-29 Thread Yu Zhang
This patch refactors struct rangeset to base it on the red-black
tree structure, instead of on the current doubly linked list. By
now, ioreq leverages rangeset to keep track of the IO/memory
resources to be emulated. Yet when number of ranges inside one
ioreq server is very high, traversing a doubly linked list could
be time consuming. With this patch, the time complexity for
searching a rangeset can be improved from O(n) to O(log(n)).
Interfaces of rangeset still remain the same, and no new APIs
introduced.

Reviewed-by: Paul Durrant 
Signed-off-by: Shuai Ruan 
Signed-off-by: Yu Zhang 
---
 xen/common/rangeset.c | 82 +--
 1 file changed, 60 insertions(+), 22 deletions(-)

diff --git a/xen/common/rangeset.c b/xen/common/rangeset.c
index 6c6293c..d15d8d5 100644
--- a/xen/common/rangeset.c
+++ b/xen/common/rangeset.c
@@ -10,11 +10,12 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* An inclusive range [s,e] and pointer to next range in ascending order. */
 struct range {
-struct list_head list;
+struct rb_node node;
 unsigned long s, e;
 };
 
@@ -24,7 +25,7 @@ struct rangeset {
 struct domain   *domain;
 
 /* Ordered list of ranges contained in this set, and protecting lock. */
-struct list_head range_list;
+struct rb_root   range_tree;
 
 /* Number of ranges that can be allocated */
 long nr_ranges;
@@ -45,41 +46,78 @@ struct rangeset {
 static struct range *find_range(
 struct rangeset *r, unsigned long s)
 {
-struct range *x = NULL, *y;
+struct rb_node *node;
+struct range   *x;
+struct range   *prev = NULL;
 
-list_for_each_entry ( y, &r->range_list, list )
+node = r->range_tree.rb_node;
+while ( node != NULL )
 {
-if ( y->s > s )
-break;
-x = y;
+x = container_of(node, struct range, node);
+if ( (s >= x->s) && (s <= x->e) )
+return x;
+if ( s < x->s )
+node = node->rb_left;
+else
+{
+prev = x;
+node = node->rb_right;
+}
 }
 
-return x;
+return prev;
 }
 
 /* Return the lowest range in the set r, or NULL if r is empty. */
 static struct range *first_range(
 struct rangeset *r)
 {
-if ( list_empty(&r->range_list) )
-return NULL;
-return list_entry(r->range_list.next, struct range, list);
+struct rb_node *node;
+
+node = rb_first(&r->range_tree);
+if ( node != NULL )
+return container_of(node, struct range, node);
+
+return NULL;
 }
 
 /* Return range following x in ascending order, or NULL if x is the highest. */
 static struct range *next_range(
 struct rangeset *r, struct range *x)
 {
-if ( x->list.next == &r->range_list )
-return NULL;
-return list_entry(x->list.next, struct range, list);
+struct rb_node *node;
+
+node = rb_next(&x->node);
+if ( node != NULL )
+return container_of(node, struct range, node);
+
+return NULL;
 }
 
 /* Insert range y after range x in r. Insert as first range if x is NULL. */
 static void insert_range(
 struct rangeset *r, struct range *x, struct range *y)
 {
-list_add(&y->list, (x != NULL) ? &x->list : &r->range_list);
+struct rb_node **node;
+struct rb_node *parent = NULL;
+
+if ( x == NULL )
+node = &r->range_tree.rb_node;
+else
+{
+node = &x->node.rb_right;
+parent = &x->node;
+}
+
+while ( *node != NULL )
+{
+parent = *node;
+node = &parent->rb_left;
+}
+
+/* Add new node and rebalance the red-black tree. */
+rb_link_node(&y->node, parent, node);
+rb_insert_color(&y->node, &r->range_tree);
 }
 
 /* Remove a range from its list and free it. */
@@ -88,7 +126,7 @@ static void destroy_range(
 {
 r->nr_ranges++;
 
-list_del(&x->list);
+rb_erase(&x->node, &r->range_tree);
 xfree(x);
 }
 
@@ -319,7 +357,7 @@ bool_t rangeset_contains_singleton(
 bool_t rangeset_is_empty(
 const struct rangeset *r)
 {
-return ((r == NULL) || list_empty(&r->range_list));
+return ((r == NULL) || RB_EMPTY_ROOT(&r->range_tree));
 }
 
 struct rangeset *rangeset_new(
@@ -332,7 +370,7 @@ struct rangeset *rangeset_new(
 return NULL;
 
 rwlock_init(&r->lock);
-INIT_LIST_HEAD(&r->range_list);
+r->range_tree = RB_ROOT;
 r->nr_ranges = -1;
 
 BUG_ON(flags & ~RANGESETF_prettyprint_hex);
@@ -410,7 +448,7 @@ void rangeset_domain_destroy(
 
 void rangeset_swap(struct rangeset *a, struct rangeset *b)
 {
-LIST_HEAD(tmp);
+struct rb_node *tmp;
 
 if ( a < b )
 {
@@ -423,9 +461,9 @@ void rangeset_swap(struct rangeset *a, struct rangeset *b)
 write_lock(&

[Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-01-29 Thread Yu Zhang
A new parameter - max_wp_ram_ranges is added to set the upper limit
of write-protected ram ranges to be tracked inside one ioreq server
rangeset.

Ioreq server uses a group of rangesets to track the I/O or memory
resources to be emulated. Default limit of ranges that one rangeset
can allocate is set to a small value, due to the fact that these ranges
are allocated in xen heap. Yet for the write-protected ram ranges,
there are circumstances under which the upper limit inside one rangeset
should exceed the default one. E.g. in XenGT, when tracking the PPGTTs(
per-process graphic translation tables) on Intel BDW platforms, number
of page tables concerned will be of several thousand.

For XenGT runing on Intel BDW platform, 8192 is a suggested value for
this parameter in most cases. But users who set his item explicitly
are also supposed to know the specific scenarios that necessitate this
configuration. Especially when this parameter is used:
1> for virtual devices other than vGPUs in XenGT;
2> for XenGT, there also might be some extreme cases, e.g. too many
graphic related applications in one VM, which create a great deal of
per-process graphic translation tables;
3> for XenGT, future cpu platforms which provide even more per-process
graphic translation tables.

Signed-off-by: Yu Zhang 
---
 docs/man/xl.cfg.pod.5   | 26 ++
 tools/libxl/libxl.h |  5 +
 tools/libxl/libxl_dom.c |  3 +++
 tools/libxl/libxl_types.idl |  1 +
 tools/libxl/xl_cmdimpl.c| 17 +
 xen/arch/x86/hvm/hvm.c  |  7 ++-
 xen/include/public/hvm/params.h |  5 -
 7 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index 8899f75..c294fd3 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -962,6 +962,32 @@ FIFO-based event channel ABI support up to 131,071 event 
channels.
 Other guests are limited to 4095 (64-bit x86 and ARM) or 1023 (32-bit
 x86).
 
+=item B
+
+Limit the maximum write-protected ram ranges that can be tracked
+inside one ioreq server rangeset.
+
+Ioreq server uses a group of rangesets to track the I/O or memory
+resources to be emulated. Default limit of ranges that one rangeset
+can allocate is set to a small value, due to the fact that these
+ranges are allocated in xen heap. Yet for the write-protected ram
+ranges, there are circumstances under which the upper limit inside
+one rangeset should exceed the default one. E.g. in Intel GVT-g,
+when tracking the PPGTT(per-process graphic translation tables) on
+Intel broadwell platforms, the number of page tables concerned will
+be of several thousand.
+
+For Intel GVT-g broadwell platform, 8192 is a suggested value for
+this parameter in most cases. But users who set his item explicitly
+are also supposed to know the specific scenarios that necessitate
+this configuration. Especially when this parameter is used:
+1> for virtual devices other than vGPU in GVT-g;
+2> for GVT-g, there also might be some extreme cases, e.g. too many
+graphic related applications in one VM, which create a great deal of
+per-process graphic translation tables;
+3> for GVT-g, future cpu platforms which provide even more per-process
+graphic translation tables.
+
 =back
 
 =head2 Paravirtualised (PV) Guest Specific Options
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index fa87f53..18828c5 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -136,6 +136,11 @@
 #define LIBXL_HAVE_BUILDINFO_EVENT_CHANNELS 1
 
 /*
+ * libxl_domain_build_info has the u.hvm.max_wp_ram_ranges field.
+ */
+#define LIBXL_HAVE_BUILDINFO_HVM_MAX_WP_RAM_RANGES 1
+
+/*
  * libxl_domain_build_info has the u.hvm.ms_vm_genid field.
  */
 #define LIBXL_HAVE_BUILDINFO_HVM_MS_VM_GENID 1
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 2269998..54173cb 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -288,6 +288,9 @@ static void hvm_set_conf_params(xc_interface *handle, 
uint32_t domid,
 libxl_defbool_val(info->u.hvm.nested_hvm));
 xc_hvm_param_set(handle, domid, HVM_PARAM_ALTP2M,
 libxl_defbool_val(info->u.hvm.altp2m));
+if (info->u.hvm.max_wp_ram_ranges > 0)
+xc_hvm_param_set(handle, domid, HVM_PARAM_MAX_WP_RAM_RANGES,
+info->u.hvm.max_wp_ram_ranges);
 }
 
 int libxl__build_pre(libxl__gc *gc, uint32_t domid,
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 9ad7eba..9185014 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -518,6 +518,7 @@ libxl_domain_build_info = Struct("domain_build_info",[
("serial_list",  libxl_string_list),
("rdm", libxl_rdm_reserve),
("rdm_mem_boundary_memkb&q

[Xen-devel] [PATCH v12 2/3] Differentiate IO/mem resources tracked by ioreq server

2016-01-29 Thread Yu Zhang
Currently in ioreq server, guest write-protected ram pages are
tracked in the same rangeset with device mmio resources. Yet
unlike device mmio, which can be in big chunks, the guest write-
protected pages may be discrete ranges with 4K bytes each. This
patch uses a seperate rangeset for the guest ram pages.

To differentiate the ioreq type between the write-protected memory
ranges and the mmio ranges when selecting an ioreq server, the p2m
type is retrieved by calling get_page_from_gfn(). And we do not
need to worry about the p2m type change during the ioreq selection
process.

Note: Previously, a new hypercall or subop was suggested to map
write-protected pages into ioreq server. However, it turned out
handler of this new hypercall would be almost the same with the
existing pair - HVMOP_[un]map_io_range_to_ioreq_server, and there's
already a type parameter in this hypercall. So no new hypercall
defined, only a new type is introduced.

Acked-by: Wei Liu 
Acked-by: Ian Campbell 
Reviewed-by: Kevin Tian 
Reviewed-by: Paul Durrant 
Signed-off-by: Shuai Ruan 
Signed-off-by: Yu Zhang 
---
 tools/libxc/include/xenctrl.h| 31 ++
 tools/libxc/xc_domain.c  | 55 
 xen/arch/x86/hvm/hvm.c   | 26 ---
 xen/include/asm-x86/hvm/domain.h |  2 +-
 xen/include/public/hvm/hvm_op.h  |  1 +
 5 files changed, 110 insertions(+), 5 deletions(-)

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index 1d656ac..1a5f4ec 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -1714,6 +1714,37 @@ int xc_hvm_unmap_io_range_from_ioreq_server(xc_interface 
*xch,
 int is_mmio,
 uint64_t start,
 uint64_t end);
+/**
+ * This function registers a range of write-protected memory for emulation.
+ *
+ * @parm xch a handle to an open hypervisor interface.
+ * @parm domid the domain id to be serviced
+ * @parm id the IOREQ Server id.
+ * @parm start start of range
+ * @parm end end of range (inclusive).
+ * @return 0 on success, -1 on failure.
+ */
+int xc_hvm_map_wp_mem_range_to_ioreq_server(xc_interface *xch,
+domid_t domid,
+ioservid_t id,
+xen_pfn_t start,
+xen_pfn_t end);
+
+/**
+ * This function deregisters a range of write-protected memory for emulation.
+ *
+ * @parm xch a handle to an open hypervisor interface.
+ * @parm domid the domain id to be serviced
+ * @parm id the IOREQ Server id.
+ * @parm start start of range
+ * @parm end end of range (inclusive).
+ * @return 0 on success, -1 on failure.
+ */
+int xc_hvm_unmap_wp_mem_range_from_ioreq_server(xc_interface *xch,
+domid_t domid,
+ioservid_t id,
+xen_pfn_t start,
+xen_pfn_t end);
 
 /**
  * This function registers a PCI device for config space emulation.
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index 921113d..e21b602 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -1523,6 +1523,61 @@ int xc_hvm_unmap_io_range_from_ioreq_server(xc_interface 
*xch, domid_t domid,
 return rc;
 }
 
+int xc_hvm_map_wp_mem_range_to_ioreq_server(xc_interface *xch,
+domid_t domid,
+ioservid_t id,
+xen_pfn_t start,
+xen_pfn_t end)
+{
+DECLARE_HYPERCALL_BUFFER(xen_hvm_io_range_t, arg);
+int rc;
+
+arg = xc_hypercall_buffer_alloc(xch, arg, sizeof(*arg));
+if ( arg == NULL )
+return -1;
+
+arg->domid = domid;
+arg->id = id;
+arg->type = HVMOP_IO_RANGE_WP_MEM;
+arg->start = start;
+arg->end = end;
+
+rc = xencall2(xch->xcall, __HYPERVISOR_hvm_op,
+  HVMOP_map_io_range_to_ioreq_server,
+  HYPERCALL_BUFFER_AS_ARG(arg));
+
+xc_hypercall_buffer_free(xch, arg);
+return rc;
+}
+
+int xc_hvm_unmap_wp_mem_range_from_ioreq_server(xc_interface *xch,
+domid_t domid,
+ioservid_t id,
+xen_pfn_t start,
+xen_pfn_t end)
+{
+DECLARE_HYPERCALL_BUFFER(xen_hvm_io_range_t, arg);
+int rc;
+
+arg = xc_hypercall_buffer_alloc(xch, arg, sizeof(*arg));
+if ( arg == NULL )
+return -1;
+
+arg->domid = domid;
+arg->id = id;
+arg->type = HVMOP_

[Xen-devel] [PATCH v12 0/3] Refactor ioreq server for better performance.

2016-01-29 Thread Yu Zhang

XenGT leverages ioreq server to track and forward the accesses to
GPU I/O resources, e.g. the PPGTT(per-process graphic translation
tables). Currently, ioreq server uses rangeset to track the BDF/
PIO/MMIO ranges to be emulated. To select an ioreq server, the 
rangeset is searched to see if the I/O range is recorded. However,
traversing the link list inside rangeset could be time consuming
when number of ranges is too high. On HSW platform, number of PPGTTs
for each vGPU could be several hundred. On BDW, this value could
be several thousand.  This patch series refactored rangeset to base
it on red-back tree, so that the searching would be more efficient. 

Besides, this patchset also splits the tracking of MMIO and guest
ram ranges into different rangesets. And to accommodate more ranges,
a new parameter , max_wp_ram_ranges, is introduced in hvm configuration
file.

Changes in v12: 
1> check the validity of max_wp_ram_ranges.
2> documentation and commit message changes.

Changes in v11: 
1> rename the new parameter to "max_wp_ram_ranges", and use it
specifically for write-protected ram ranges.
2> clear the documentation part.
3> define a LIBXL_HAVE_BUILDINFO_HVM_MAX_WP_RAM_RANGES in libxl.h.

Changes in v10: 
1> Add a new patch to configure the range limit inside ioreq server.
2> Commit message changes. 
3> The previous patch "[1/3] Remove identical relationship between
   ioreq type and rangeset type." has already been merged, and is not
   included in this series now.

Changes in v9: 
1> Change order of patch 2 and patch3.
2> Intruduce a const static array before hvm_ioreq_server_alloc_rangesets().
3> Coding style changes.

Changes in v8: 
Use a clearer API name to map/unmap the write-protected memory in
ioreq server.

Changes in v7: 
1> Coding style changes;
2> Fix a typo in hvm_select_ioreq_server().

Changes in v6: 
Break the identical relationship between ioreq type and rangeset
index inside ioreq server.

Changes in v5:
1> Use gpfn, instead of gpa to track guest write-protected pages;
2> Remove redundant conditional statement in routine find_range().

Changes in v4:
Keep the name HVMOP_IO_RANGE_MEMORY for MMIO resources, and add
a new one, HVMOP_IO_RANGE_WP_MEM, for write-protected memory.

Changes in v3:
1> Use a seperate rangeset for guest ram pages in ioreq server;
2> Refactor rangeset, instead of introduce a new data structure.

Changes in v2:
1> Split the original patch into 2;
2> Take Paul Durrant's comments:
  a> Add a name member in the struct rb_rangeset, and use the 'q'
debug key to dump the ranges in ioreq server;
  b> Keep original routine names for hvm ioreq server;
  c> Commit message changes - mention that a future patch to change
the maximum ranges inside ioreq server.

Yu Zhang (3):
  Refactor rangeset structure for better performance.
  Differentiate IO/mem resources tracked by ioreq server
  tools: introduce parameter max_wp_ram_ranges.

 docs/man/xl.cfg.pod.5| 26 +
 tools/libxc/include/xenctrl.h| 31 +++
 tools/libxc/xc_domain.c  | 55 +++
 tools/libxl/libxl.h  |  5 +++
 tools/libxl/libxl_dom.c  |  3 ++
 tools/libxl/libxl_types.idl  |  1 +
 tools/libxl/xl_cmdimpl.c | 17 +
 xen/arch/x86/hvm/hvm.c   | 33 +---
 xen/common/rangeset.c| 82 +---
 xen/include/asm-x86/hvm/domain.h |  2 +-
 xen/include/public/hvm/hvm_op.h  |  1 +
 xen/include/public/hvm/params.h  |  5 ++-
 12 files changed, 232 insertions(+), 29 deletions(-)

-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-01-30 Thread Yu, Zhang


On 1/30/2016 12:33 AM, Jan Beulich wrote:

On 29.01.16 at 11:45,  wrote:

--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -940,6 +940,8 @@ static int hvm_ioreq_server_alloc_rangesets(struct 
hvm_ioreq_server *s,
  {
  unsigned int i;
  int rc;
+unsigned int max_wp_ram_ranges =
+s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES];


You're still losing the upper 32 bits here. Iirc you agreed to range
check the value before storing into params[]...

Jan




Thanks, Jan. :)
In this version, the check is added in routine parse_config_data().
If option 'max_wp_ram_ranges' is configured with an unreasonable value,
the xl will terminate, before calling xc_hvm_param_set(). Does this
change meet your requirement? Or maybe did I have some misunderstanding
on this issue?


B.R.
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-02-01 Thread Yu, Zhang



On 2/1/2016 9:07 PM, Jan Beulich wrote:

On 01.02.16 at 13:49,  wrote:

On Mon, Feb 01, 2016 at 05:15:16AM -0700, Jan Beulich wrote:

On 01.02.16 at 13:02,  wrote:

On Mon, Feb 01, 2016 at 12:52:51AM -0700, Jan Beulich wrote:

On 30.01.16 at 15:38,  wrote:



On 1/30/2016 12:33 AM, Jan Beulich wrote:

On 29.01.16 at 11:45,  wrote:

--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -940,6 +940,8 @@ static int hvm_ioreq_server_alloc_rangesets(struct

hvm_ioreq_server *s,

   {
   unsigned int i;
   int rc;
+unsigned int max_wp_ram_ranges =
+s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES];


You're still losing the upper 32 bits here. Iirc you agreed to range
check the value before storing into params[]...


Thanks, Jan. :)
In this version, the check is added in routine parse_config_data().
If option 'max_wp_ram_ranges' is configured with an unreasonable value,
the xl will terminate, before calling xc_hvm_param_set(). Does this
change meet your requirement? Or maybe did I have some misunderstanding
on this issue?


Checking in the tools is desirable, but the hypervisor shouldn't rely
on any tool side checking.



As in hypervisor needs to sanitise all input from toolstack? I don't
think Xen does that today.


If it doesn't, then that's a bug. Note that in many cases (domctl-s
and alike) such bogus trusting in the tool stack behaving correctly
is only not a security issue due to XSA-77. Yet with XSA-77 we
made quite clear that we shouldn't knowingly allow in further such
issues (it'll be hard enough to find and address all existing ones).


So are you suggesting pulling the check done in toolstack into
hypervisor?


I think the check in the tools should stay (allowing for a
distinguishable error message to be issued); all I'm saying is
that doing the check in the tools is not enough.

Jan



Thank you Jan and Wei. And sorry for the late response.
But I still do not quite understand. :)
If tool stack can guarantee the validity of a parameter,
under which circumstances will hypervisor be threatened?
I'm not familiar with XSA-77, and I'll read it ASAP.

B.R.
Yu


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-02-01 Thread Yu, Zhang



On 2/1/2016 7:57 PM, Wei Liu wrote:

On Fri, Jan 29, 2016 at 06:45:14PM +0800, Yu Zhang wrote:

diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 25507c7..0c19dee 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -35,6 +35,7 @@
  #include 
  #include 
  #include 
+#include 

  #include "libxl.h"
  #include "libxl_utils.h"
@@ -1626,6 +1627,22 @@ static void parse_config_data(const char *config_source,

  if (!xlu_cfg_get_long (config, "rdm_mem_boundary", &l, 0))
  b_info->u.hvm.rdm_mem_boundary_memkb = l * 1024;
+
+if (!xlu_cfg_get_long (config, "max_wp_ram_ranges", &l, 0)) {
+uint64_t nr_pages = (b_info->max_memkb << 10) >> XC_PAGE_SHIFT;
+
+/* Due to rangeset's ability to combine continuous ranges, this
+ * parameter shall not be configured with values greater than half
+ * of the number of VM's page frames. It also shall not exceed 4G,
+ * because of the limitation from the rangeset side. */
+if (l > (nr_pages / 2) || l > UINT32_MAX) {
+fprintf(stderr, "ERROR: Invalid value for \"max_wp_ram_ranges\". 
"
+"Shall not exceed %ld or 4G.\n", nr_pages / 2);
+exit(1);
+}
+b_info->u.hvm.max_wp_ram_ranges = l;
+}
+


Xl is only one of the applications that use libxl (the library).  This
check should be inside libxl so that all applications (xl, libvirt and
others) have the same behaviour.

Take a look at initiate_domain_create where numerous validations are
done.

Wei.



Thank you, Wei. I'll try to move this part into
initiate_domain_create().

B.R.
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-02-01 Thread Yu, Zhang



On 2/2/2016 12:16 AM, Jan Beulich wrote:

On 01.02.16 at 16:14,  wrote:

But I still do not quite understand. :)
If tool stack can guarantee the validity of a parameter,
under which circumstances will hypervisor be threatened?


At least in disaggregated environments the hypervisor cannot
trust the (parts of the) tool stack(s) living outside of Dom0. But
even without disaggregation in mind it is bad practice to have
the hypervisor assume the tool stack will only pass sane values.
Just at the example of the param you're introducing: You don't
even do the validation in libxc, so any (theoretical) tool stack
no based on xl/libxl would not be guaranteed to pass a sane
value. And even if you moved it into libxc, one could still argue
that there could an even more theoretical tool stack not even
building on top of libxc.

Jan



Great. Thank you very much for your patience to explain.
Just sent out another mail about my understanding a moment ago,
seems I partially get it. :)
My vnc connection is too slow, will change the code tomorrow.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel



Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-02-01 Thread Yu, Zhang



On 2/1/2016 11:14 PM, Yu, Zhang wrote:



On 2/1/2016 9:07 PM, Jan Beulich wrote:

On 01.02.16 at 13:49,  wrote:

On Mon, Feb 01, 2016 at 05:15:16AM -0700, Jan Beulich wrote:

On 01.02.16 at 13:02,  wrote:

On Mon, Feb 01, 2016 at 12:52:51AM -0700, Jan Beulich wrote:

On 30.01.16 at 15:38,  wrote:



On 1/30/2016 12:33 AM, Jan Beulich wrote:

On 29.01.16 at 11:45,  wrote:

--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -940,6 +940,8 @@ static int
hvm_ioreq_server_alloc_rangesets(struct

hvm_ioreq_server *s,

   {
   unsigned int i;
   int rc;
+unsigned int max_wp_ram_ranges =
+
s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES];


You're still losing the upper 32 bits here. Iirc you agreed to
range
check the value before storing into params[]...


Thanks, Jan. :)
In this version, the check is added in routine parse_config_data().
If option 'max_wp_ram_ranges' is configured with an unreasonable
value,
the xl will terminate, before calling xc_hvm_param_set(). Does this
change meet your requirement? Or maybe did I have some
misunderstanding
on this issue?


Checking in the tools is desirable, but the hypervisor shouldn't rely
on any tool side checking.



As in hypervisor needs to sanitise all input from toolstack? I don't
think Xen does that today.


If it doesn't, then that's a bug. Note that in many cases (domctl-s
and alike) such bogus trusting in the tool stack behaving correctly
is only not a security issue due to XSA-77. Yet with XSA-77 we
made quite clear that we shouldn't knowingly allow in further such
issues (it'll be hard enough to find and address all existing ones).


So are you suggesting pulling the check done in toolstack into
hypervisor?


I think the check in the tools should stay (allowing for a
distinguishable error message to be issued); all I'm saying is
that doing the check in the tools is not enough.

Jan



Thank you Jan and Wei. And sorry for the late response.
But I still do not quite understand. :)
If tool stack can guarantee the validity of a parameter,
under which circumstances will hypervisor be threatened?
I'm not familiar with XSA-77, and I'll read it ASAP.

B.R.
Yu


Sorry to bother you, Jan.
After a second thought, I guess one of the security concern
is when some APP is trying to trigger the HVMOP_set_param
directly with some illegal values.
So, we need also validate this param in hvm_allow_set_param,
current although hvm_allow_set_param has not performed any
validation other parameters. We need to do this for the new
ones. Is this understanding correct?
Another question is: as to the tool stack side, do you think
an error message would suffice? Shouldn't xl be terminated?

Thanks
Yu



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-02-01 Thread Yu, Zhang



On 2/2/2016 12:35 AM, Jan Beulich wrote:

On 01.02.16 at 17:19,  wrote:

After a second thought, I guess one of the security concern
is when some APP is trying to trigger the HVMOP_set_param
directly with some illegal values.


Not sure what "directly" is supposed to mean here.


I mean with no validation by itself, like libxc...


So, we need also validate this param in hvm_allow_set_param,
current although hvm_allow_set_param has not performed any
validation other parameters. We need to do this for the new
ones. Is this understanding correct?


Yes.


Another question is: as to the tool stack side, do you think
an error message would suffice? Shouldn't xl be terminated?


I have no idea what consistent behavior in such a case would
be - I'll defer input on this to the tool stack maintainers.



Thank you.
Wei, which one do you prefer?

Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-02-02 Thread Yu, Zhang

Thanks for your reply, Ian.

On 2/2/2016 1:05 AM, Ian Jackson wrote:

Yu, Zhang writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter 
max_wp_ram_ranges."):

On 2/2/2016 12:35 AM, Jan Beulich wrote:

On 01.02.16 at 17:19,  wrote:

So, we need also validate this param in hvm_allow_set_param,
current although hvm_allow_set_param has not performed any
validation other parameters. We need to do this for the new
ones. Is this understanding correct?


Yes.


Another question is: as to the tool stack side, do you think
an error message would suffice? Shouldn't xl be terminated?


I have no idea what consistent behavior in such a case would
be - I'll defer input on this to the tool stack maintainers.


Thank you.
Wei, which one do you prefer?


I think that arrangements should be made for the hypercall failure to
be properly reported to the caller, and properly logged.

I don't think it is desirable to duplicate the sanity check in
xl/libxl/libxc.  That would simply result in there being two limits to
update.



Sorry, I do not follow. What does "being two limits to update" mean?


I have to say, though, that the situation with this parameter seems
quite unsatisfactory.  It seems to be a kind of bodge.



By "situation with this parameter", do you mean:
a> the introduction of this parameter in tool stack, or
b> the sanitizing for this parameter(In fact I'd prefer not to treat
the check of this parameter as a sanitizing, cause it only checks
the input against 4G to avoid data missing from uint64 to uint32
assignment in hvm_ioreq_server_alloc_rangesets)?




The changeable limit is there to prevent excessive resource usage by a
guest.  But the docs suggest that the excessive usage might be
normal.  That sounds like a suboptimal design to me.



Yes, there might be situations that this limit be set to some large
value. But I that situation would be very rare. Like the docs
suggested, for XenGT, 8K is a big enough one for most cases.


For reference, here is the docs proposed in this patch:

   =item B

   Limit the maximum write-protected ram ranges that can be tracked
   inside one ioreq server rangeset.

   Ioreq server uses a group of rangesets to track the I/O or memory
   resources to be emulated. Default limit of ranges that one rangeset
   can allocate is set to a small value, due to the fact that these
   ranges are allocated in xen heap. Yet for the write-protected ram
   ranges, there are circumstances under which the upper limit inside
   one rangeset should exceed the default one. E.g. in Intel GVT-g,
   when tracking the PPGTT(per-process graphic translation tables) on
   Intel broadwell platforms, the number of page tables concerned will
   be of several thousand.

   For Intel GVT-g broadwell platform, 8192 is a suggested value for
   this parameter in most cases. But users who set his item explicitly
   are also supposed to know the specific scenarios that necessitate
   this configuration. Especially when this parameter is used:
   1> for virtual devices other than vGPU in GVT-g;
   2> for GVT-g, there also might be some extreme cases, e.g. too many
   graphic related applications in one VM, which create a great deal of
   per-process graphic translation tables;
   3> for GVT-g, future cpu platforms which provide even more per-process
   graphic translation tables.

Having said that, if the hypervisor maintainers are happy with a
situation where this value is configured explicitly, and the
configurations where a non-default value is required is expected to be
rare, then I guess we can live with it.

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel



Thanks
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-02-02 Thread Yu, Zhang



On 2/2/2016 6:32 PM, Jan Beulich wrote:

On 01.02.16 at 18:05,  wrote:

Having said that, if the hypervisor maintainers are happy with a
situation where this value is configured explicitly, and the
configurations where a non-default value is required is expected to be
rare, then I guess we can live with it.


Well, from the very beginning I have been not very happy with
the introduction of this, and I still consider it half way acceptable
only because of not seeing any good alternative. If we look at
it strictly, it's in violation of the rule we set forth after XSA-77:
No introduction of new code making the system susceptible to
bad (malicious) tool stack behavior, and hence we should reject
it. Yet that would leave XenGT in a state where it would have no
perspective of ever getting merged, which doesn't seem very
desirable either.

Jan



Thanks, Jan.
I understand your concern, and to be honest, I do not think
this is an optimal solution. But I also have no better idea
in mind.  :(
Another option may be: instead of opening this parameter to
the tool stack, we use a XenGT flag, which set the rangeset
limit to a default value. But like I said, this default value
may not always work on future XenGT platforms.


B.R.
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-02-02 Thread Yu, Zhang



On 2/2/2016 7:51 PM, Wei Liu wrote:

On Tue, Feb 02, 2016 at 04:04:14PM +0800, Yu, Zhang wrote:

Thanks for your reply, Ian.

On 2/2/2016 1:05 AM, Ian Jackson wrote:

Yu, Zhang writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter 
max_wp_ram_ranges."):

On 2/2/2016 12:35 AM, Jan Beulich wrote:

On 01.02.16 at 17:19,  wrote:

So, we need also validate this param in hvm_allow_set_param,
current although hvm_allow_set_param has not performed any
validation other parameters. We need to do this for the new
ones. Is this understanding correct?


Yes.


Another question is: as to the tool stack side, do you think
an error message would suffice? Shouldn't xl be terminated?


I have no idea what consistent behavior in such a case would
be - I'll defer input on this to the tool stack maintainers.


Thank you.
Wei, which one do you prefer?


I think that arrangements should be made for the hypercall failure to
be properly reported to the caller, and properly logged.

I don't think it is desirable to duplicate the sanity check in
xl/libxl/libxc.  That would simply result in there being two limits to
update.



Sorry, I do not follow. What does "being two limits to update" mean?



I can't speak for Ian, but my understanding is that if the code logic is
duplicated in several places, you need to update all of them whenever
you change the logic. But, he has expressed if this is blocker for this
series, so I will let him clarify.


Thank you, Wei. This explanation helps. :)

B.R.
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-02-02 Thread Yu, Zhang



On 2/2/2016 7:12 PM, Jan Beulich wrote:

On 02.02.16 at 11:56,  wrote:

I understand your concern, and to be honest, I do not think
this is an optimal solution. But I also have no better idea
in mind.  :(
Another option may be: instead of opening this parameter to
the tool stack, we use a XenGT flag, which set the rangeset
limit to a default value. But like I said, this default value
may not always work on future XenGT platforms.


Assuming that you think of something set e.g. by hypervisor
command line option: How would that work? I.e. how would
that limit the resource use for all VMs not using XenGT? Or if
you mean a flag settable in the domain config - how would you
avoid a malicious admin to set this flag for all the VMs created
in the controlled partition of the system?



Well, I am not satisfied with this new parameter, because:
1> exposing an option like max_wp_ram_ranges to the user seems too
detailed;
2> but if not, using a XenGT flag means it would be hard for hypervisor
to find a default value which can work in all situations theoretically,
although in practice, 8K is already a big enough one.

However, as to the security concern you raised, I can not fully
understand. :) E.g. I believe a malicious admin can also breach the
system even without this patch. This argument may not be convincing to 
you, but as to this specific case, even if an admin set XenGT flag to
all VMs, what harm will this action do? It only means the ioreq server 
can at most allocate 8K ranges, will that consume all the Xen heaps, 
especially for 64 bit Xen?


Anyway, despite different opinions, I still need to say thank you
for your explanation. Upstreaming XenGT features is my task, it is 
painfully rewarding, to receive suggestions from community maintainers,

which helps a newbie like me better understand the virtualization
technology. :)

Thanks
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-02-02 Thread Yu, Zhang



On 2/2/2016 10:42 PM, Jan Beulich wrote:

On 02.02.16 at 15:01,  wrote:

On 2/2/2016 7:12 PM, Jan Beulich wrote:

On 02.02.16 at 11:56,  wrote:

I understand your concern, and to be honest, I do not think
this is an optimal solution. But I also have no better idea
in mind.  :(
Another option may be: instead of opening this parameter to
the tool stack, we use a XenGT flag, which set the rangeset
limit to a default value. But like I said, this default value
may not always work on future XenGT platforms.


Assuming that you think of something set e.g. by hypervisor
command line option: How would that work? I.e. how would
that limit the resource use for all VMs not using XenGT? Or if
you mean a flag settable in the domain config - how would you
avoid a malicious admin to set this flag for all the VMs created
in the controlled partition of the system?


Well, I am not satisfied with this new parameter, because:
1> exposing an option like max_wp_ram_ranges to the user seems too
detailed;
2> but if not, using a XenGT flag means it would be hard for hypervisor
to find a default value which can work in all situations theoretically,
although in practice, 8K is already a big enough one.

However, as to the security concern you raised, I can not fully
understand. :) E.g. I believe a malicious admin can also breach the
system even without this patch. This argument may not be convincing to
you, but as to this specific case, even if an admin set XenGT flag to
all VMs, what harm will this action do? It only means the ioreq server
can at most allocate 8K ranges, will that consume all the Xen heaps,
especially for 64 bit Xen?


First of all so far you meant to set a limit of 4G, which - taking a
handful of domains - if fully used would take even a mid-size
host out of memory. And then you have to consider bad effects
resulting from Xen itself not normally having a lot of memory left
(especially when "dom0_mem=" is not forcing most of the memory
to be in Xen's hands), which may mean that one domain
exhausting Xen's memory can affect another domain if Xen can't
allocate memory it needs to support that other domain, in the
worst case leading to a domain crash. And this all is still leaving
aside Xen's own health...



Thanks, Jan.
The limit of 4G is to avoid the data missing from uint64 to uint32
assignment. And I can accept the 8K limit for XenGT in practice.
After all, it is vGPU page tables we are trying to trap and emulate,
not normal page frames.

And I guess the reason that one domain exhausting Xen's memory can
affect another domain is because rangeset uses Xen heap, instead of the
per-domain memory. So what about we use a 8K limit by now for XenGT,
and in the future, if a per-domain memory allocation solution for
rangeset is ready, we do need to limit the rangeset size. Does this
sounds more acceptable?

B.R.
Yu


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-02-02 Thread Yu, Zhang



On 2/2/2016 11:21 PM, Jan Beulich wrote:

On 02.02.16 at 16:00,  wrote:

The limit of 4G is to avoid the data missing from uint64 to uint32
assignment. And I can accept the 8K limit for XenGT in practice.
After all, it is vGPU page tables we are trying to trap and emulate,
not normal page frames.

And I guess the reason that one domain exhausting Xen's memory can
affect another domain is because rangeset uses Xen heap, instead of the
per-domain memory. So what about we use a 8K limit by now for XenGT,
and in the future, if a per-domain memory allocation solution for
rangeset is ready, we do need to limit the rangeset size. Does this
sounds more acceptable?


The lower the limit the better (but no matter how low the limit
it won't make this a pretty thing). Anyway I'd still like to wait
for what Ian may further say on this.



OK then. :)
Ian, do you have any suggestions?

Thanks
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-02-02 Thread Yu, Zhang



On 2/2/2016 11:21 PM, Jan Beulich wrote:

On 02.02.16 at 16:00,  wrote:

The limit of 4G is to avoid the data missing from uint64 to uint32
assignment. And I can accept the 8K limit for XenGT in practice.
After all, it is vGPU page tables we are trying to trap and emulate,
not normal page frames.

And I guess the reason that one domain exhausting Xen's memory can
affect another domain is because rangeset uses Xen heap, instead of the
per-domain memory. So what about we use a 8K limit by now for XenGT,
and in the future, if a per-domain memory allocation solution for
rangeset is ready, we do need to limit the rangeset size. Does this
sounds more acceptable?


The lower the limit the better (but no matter how low the limit
it won't make this a pretty thing). Anyway I'd still like to wait
for what Ian may further say on this.


Hi Jan, I just had a discussion with my colleague. We believe 8K could
be the biggest limit for the write-protected ram ranges. If in the
future, number of vGPU page tables exceeds this limit, we will modify
our back-end device model to find a trade-off method, instead of
extending this limit. If you can accept this value as the upper bound
of rangeset, maybe we do not need to add any tool stack parameters, but
define a MAX_NR_WR_RAM_RANGES for the write-protected ram rangesset. As
to other rangesets, we keep their limit as 256. Does this sounds OK? :)

B.R.
Yu


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-02-04 Thread Yu, Zhang

On 2/4/2016 1:50 AM, George Dunlap wrote:

On Wed, Feb 3, 2016 at 3:10 PM, Paul Durrant  wrote:

  * Is it possible for libxl to somehow tell from the rest of the
configuration that this larger limit should be applied ?



If a XenGT-enabled VM is provisioned through libxl then some larger limit is 
likely to be required. One of the issues is that it is impossible (or at least 
difficult) to know how many GTTs are going to need to be shadowed.


By GTT, you mean the GPU pagetables I assume?  So you're talking about


Yes, GTT is "Graphics Translation Table" for short.


how large this value should be made, not whether the
heuristically-chosen larger value should be used.  libxl should be
able to tell that XenGT is enabled, I assume, so it should be able to
automatically bump this to 8k if necessary, right?



Yes.


But I think you'd still need a parameter that you could tweak if it
turned out that 8k wasn't enough for a particular workload, right?



Well, not exactly. For XenGT, the latest suggestion is that even when 8K
is not enough, we will not extend this limit anymore. But when
introducing this parameter, I had thought it might also be helpful for
other device virtualization cases which would like to use the mediated
passthrough idea.


  * If we are talking about mmio ranges for ioreq servers, why do
guests which do not use this feature have the ability to create
them at all ?


It's not the guest that directly creates the ranges, it's the emulator. 
Normally device emulation would require a relatively small number of MMIO 
ranges and a total number that cannot be influenced by the guest itself. In 
this case though, as I said above, the number *can* be influenced by the guest 
(although it is still the emulator which actually causes the ranges to be 
created).


Just to make this clear: The guest chooses how many gpfns are used in
the GPU pagetables; for each additional gpfn in the guest pagetable,
qemu / xen have the option of either marking it to be emulated (at the
moment, by marking it as a one-page "MMIO region") or crashing the
guest.



Well, kind of. The backend device model in dom0(not qemu) makes the 
decision whether or not this page is to be emulated.



(A background problem I have is that this thread is full of product
name jargon and assumes a lot of background knowledge of the
implementation of these features - background knowledge which I lack
and which isn't in these patches.  If someone could point me at a
quick summary of what `GVT-g' and `GVT-d' are that might help.)



GVT-d is a name applied to PCI passthrough of an Intel GPU. GVT-g is a name 
applied to Intel GPU virtualization, which makes use of an emulator to mediate 
guest access to the real GPU so that it is possible to share the resources 
securely.


And GTT are the GPU equivalent of page tables?


Yes.

Here let me try to give some brief introduction to the jargons:
* Intel GVT-d: an intel graphic virtualization solution, which dedicates
one physical GPU to a guest exclusively.

* Intel GVT-g: an intel graphic virtualization solution, with mediated
pass-through support. One physical GPU can be shared by multiple guests.
GPU performance-critical resources are partitioned by and passed
through to different vGPUs. Other GPU resources are trapped and
emulated by the device model.

* XenGT: Intel GVT-g code name for Xen.
Here this patch series are features required by XenGT.

* vGPU: virtual GPU presented to guests.

* GTT: abbreviation for graphics translation table, a page table
structure which translates the graphic memory address to a physical
one. For vGPU, PTEs in its GTT are GPFNs, thus raise a demand for
the device model to construct a group of shadow GPU page tables.

Thanks
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-02-04 Thread Yu, Zhang



On 2/4/2016 2:21 AM, George Dunlap wrote:

On Wed, Feb 3, 2016 at 5:41 PM, George Dunlap
 wrote:

I think at some point I suggested an alternate design based on marking
such gpfns with a special p2m type; I can't remember if that
suggestion was actually addressed or not.


FWIW, the thread where I suggested using p2m types was in response to

<1436163912-1506-2-git-send-email-yu.c.zh...@linux.intel.com>

Looking through it again, the main objection Paul gave[1]  was:

"And it's the assertion that use of write_dm will only be relevant to
gfns, and that all such notifications only need go to a single ioreq
server, that I have a problem with. Whilst the use of io ranges to
track gfn updates is, I agree, not ideal I think the overloading of
write_dm is not a step in the right direction."

Two issues raised here, about using only p2m types to implement write_dm:
1. More than one ioreq server may want to use the write_dm functionality
2. ioreq servers may want to use write_dm for things other than individual gpfns

My answer to #1 was:
1. At the moment, we only need to support a single ioreq server using write_dm
2. It's not technically difficult to extend the number of servers
supported to something sensible, like 4 (using 4 different write_dm
p2m types)
3. The interface can be designed such that we can extend support to
multiple servers when we need to.

My answer to #2 was that there's no reason why using write_dm could be
used for both individual gpfns and ranges; there's no reason the
interface can't take a "start" and "count" argument, even if for the
time being "count" is almost always going to be 1.



Well, talking about "the 'count' always going to be 1". I doubt that. :)
Statistics in XenGT shows that, GPU page tables are very likely to
be allocated in contiguous gpfns.


Compare this to the downsides of the approach you're proposing:
1. Using 40 bytes of hypervisor space per guest GPU pagetable page (as
opposed to using a bit in the existing p2m table)
2. Walking down an RB tree with 8000 individual nodes to find out
which server to send the message to (rather than just reading the
value from the p2m table).


8K is an upper limit for the rangeset, in many cases the RB tree will
not contain that many nodes.


3. Needing to determine on a guest-by-guest basis whether to change the limit
4. Needing to have an interface to make the limit even bigger, just in
case we find workloads that have even more GTTs.



Well, I have suggested in yesterday's reply. XenGT can choose not to
change this limit even when workloads are getting heavy - with
tradeoffs in the device model side.


I really don't understand where you're coming from on this.  The
approach you've chosen looks to me to be slower, more difficult to
implement, and more complicated; and it's caused a lot more resistance
trying to get this series accepted.



I agree utilizing the p2m types to do so is more efficient and quite
intuitive. But I hesitate to occupy the software available bits in EPT
PTEs(like Andrew's reply). Although we have introduced one, we believe 
it can also be used for other situations in the future, not just XenGT.


Thanks
Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-02-04 Thread Yu, Zhang



On 2/4/2016 3:12 AM, George Dunlap wrote:

On 03/02/16 18:39, Andrew Cooper wrote:

On 03/02/16 18:21, George Dunlap wrote:

2. It's not technically difficult to extend the number of servers
supported to something sensible, like 4 (using 4 different write_dm
p2m types)


While technically true, spare bits in the pagetable entries are at a
premium, and steadily decreasing as Intel are introducing new features.

We have 16 current p2m types, and a finite upper bound of 6 bits of p2m
type space, already with a push to reduce this number.

While introducing 1 new p2m type for this purpose might be an acceptable
tradeoff, a using a p2m type per ioreq server is not IMO.


It is true that we don't have a ton of elbow room to grow at the moment.

But we actually already have a single p2m type -- mmio_write_dm -- that
as far as I know is only being used by XenGT.  We don't actually need to
add any new p2m types for my initial proposal.

Going forward, we probably will, at some point, need to implement a
parallel "p2t" structure to keep track of types -- and probably will
whether end up implementing 4 separate write_dm types or not (for the
reasons you describe).



Thank you, George. Could you please elaborate more about the idea of
"p2t"?

Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-02-04 Thread Yu, Zhang



On 2/4/2016 5:28 PM, Paul Durrant wrote:

-Original Message-
From: Yu, Zhang [mailto:yu.c.zh...@linux.intel.com]
Sent: 04 February 2016 08:51
To: George Dunlap; Ian Jackson
Cc: Paul Durrant; Kevin Tian; Wei Liu; Ian Campbell; Andrew Cooper; xen-
de...@lists.xen.org; Stefano Stabellini; zhiyuan...@intel.com; Jan Beulich;
Keir (Xen.org)
Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
max_wp_ram_ranges.



On 2/4/2016 2:21 AM, George Dunlap wrote:

On Wed, Feb 3, 2016 at 5:41 PM, George Dunlap
 wrote:

I think at some point I suggested an alternate design based on marking
such gpfns with a special p2m type; I can't remember if that
suggestion was actually addressed or not.


FWIW, the thread where I suggested using p2m types was in response to

<1436163912-1506-2-git-send-email-yu.c.zh...@linux.intel.com>

Looking through it again, the main objection Paul gave[1]  was:

"And it's the assertion that use of write_dm will only be relevant to
gfns, and that all such notifications only need go to a single ioreq
server, that I have a problem with. Whilst the use of io ranges to
track gfn updates is, I agree, not ideal I think the overloading of
write_dm is not a step in the right direction."

Two issues raised here, about using only p2m types to implement

write_dm:

1. More than one ioreq server may want to use the write_dm functionality
2. ioreq servers may want to use write_dm for things other than individual

gpfns


My answer to #1 was:
1. At the moment, we only need to support a single ioreq server using

write_dm

2. It's not technically difficult to extend the number of servers
supported to something sensible, like 4 (using 4 different write_dm
p2m types)
3. The interface can be designed such that we can extend support to
multiple servers when we need to.

My answer to #2 was that there's no reason why using write_dm could be
used for both individual gpfns and ranges; there's no reason the
interface can't take a "start" and "count" argument, even if for the
time being "count" is almost always going to be 1.



Well, talking about "the 'count' always going to be 1". I doubt that. :)
Statistics in XenGT shows that, GPU page tables are very likely to
be allocated in contiguous gpfns.


Compare this to the downsides of the approach you're proposing:
1. Using 40 bytes of hypervisor space per guest GPU pagetable page (as
opposed to using a bit in the existing p2m table)
2. Walking down an RB tree with 8000 individual nodes to find out
which server to send the message to (rather than just reading the
value from the p2m table).


8K is an upper limit for the rangeset, in many cases the RB tree will
not contain that many nodes.


3. Needing to determine on a guest-by-guest basis whether to change the

limit

4. Needing to have an interface to make the limit even bigger, just in
case we find workloads that have even more GTTs.



Well, I have suggested in yesterday's reply. XenGT can choose not to
change this limit even when workloads are getting heavy - with
tradeoffs in the device model side.


I assume this means that the emulator can 'unshadow' GTTs (I guess on an LRU 
basis) so that it can shadow new ones when the limit has been exhausted?
If so, how bad is performance likely to be if we live with a lower limit and 
take the hit of unshadowing if the guest GTTs become heavily fragmented?


Thank you, Paul.

Well, I was told the emulator have approaches to delay the shadowing of
the GTT till future GPU commands are submitted. By now, I'm not sure
about the performance penalties if the limit is set too low. Although
we are confident 8K is a secure limit, it seems still too high to be
accepted. We will perform more experiments with this new approach to
find a balance between the lowest limit and the XenGT performance.

So another question is, if value of this limit really matters, will a
lower one be more acceptable(the current 256 being not enough)?

Thanks
Yu
find a

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-02-05 Thread Yu, Zhang



On 2/4/2016 7:06 PM, George Dunlap wrote:

On Thu, Feb 4, 2016 at 9:38 AM, Yu, Zhang  wrote:

On 2/4/2016 5:28 PM, Paul Durrant wrote:

I assume this means that the emulator can 'unshadow' GTTs (I guess on an
LRU basis) so that it can shadow new ones when the limit has been exhausted?
If so, how bad is performance likely to be if we live with a lower limit
and take the hit of unshadowing if the guest GTTs become heavily fragmented?


Thank you, Paul.

Well, I was told the emulator have approaches to delay the shadowing of
the GTT till future GPU commands are submitted. By now, I'm not sure
about the performance penalties if the limit is set too low. Although
we are confident 8K is a secure limit, it seems still too high to be
accepted. We will perform more experiments with this new approach to
find a balance between the lowest limit and the XenGT performance.


Just to check some of my assumptions:

I assume that unlike memory accesses, your GPU hardware cannot
'recover' from faults in the GTTs. That is, for memory, you can take a
page fault, fix up the pagetables, and then re-execute the original
instruction; but so far I haven't heard of any devices being able to
seamlessly re-execute a transaction after a fault.  Is my
understanding correct?



Yes


If that is the case, then for every top-level value (whatever the
equivalent of the CR3), you need to be able to shadow the entire GTT
tree below it, yes?  You can't use a trick that the memory shadow
pagetables can use, of unshadowing parts of the tree and reshadowing
them.

So as long as the currently-in-use GTT tree contains no more than
$LIMIT ranges, you can unshadow and reshadow; this will be slow, but
strictly speaking correct.

What do you do if the guest driver switches to a GTT such that the
entire tree takes up more than $LIMIT entries?



Good question. Like the memory virtualization, IIUC, besides wp the
guest page tables, we can also track the updates of them when cr3 is
written or when a tlb flush occurs. We can consider to optimize our GPU
device model to achieve similar goal, e.g. when a root pointer(like
cr3) to the page table is written and when a set of commands is
submitted(Both situations are trigger by MMIO operations). But taking
consideration of performance, we may probably still need to wp all the
page tables when they are created at the first time. It requires a lot
optimization work in the device model side to find a balance between a
minimal wp-ed gpfns and a reasonable performance. We'd like to have a
try. :)

Yu

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-02-05 Thread Yu, Zhang



On 2/5/2016 1:12 AM, George Dunlap wrote:

On 04/02/16 14:08, Jan Beulich wrote:

On 04.02.16 at 14:33,  wrote:

Jan Beulich writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
max_wp_ram_ranges."):

On 04.02.16 at 10:38,  wrote:

So another question is, if value of this limit really matters, will a
lower one be more acceptable(the current 256 being not enough)?


If you've carefully read George's replies, [...]


Thanks to George for the very clear explanation, and also to him for
an illuminating in-person discussion.

It is disturbing that as a result of me as a tools maintainer asking
questions about what seems to me to be a troublesome a user-visible
control setting in libxl, we are now apparently revisiting lower
layers of the hypervisor design, which have already been committed.

While I find George's line of argument convincing, neither I nor
George are maintainers of the relevant hypervisor code.  I am not
going to insist that anything in the hypervisor is done different and
am not trying to use my tools maintainer position to that end.

Clearly there has been a failure of our workflow to consider and
review everything properly together.  But given where we are now, I
think that this discussion about hypervisor internals is probably a
distraction.


While I recall George having made that alternative suggestion,
both Yu and Paul having reservations against it made me not
insist on that alternative. Instead I've been trying to limit some
of the bad effects that the variant originally proposed brought
with it. Clearly, with the more detailed reply George has now
given (involving areas where he is the maintainer for), I should
have been more demanding towards the exploration of that
alternative. That's clearly unfortunate, and I apologize for that,
but such things happen.

As to one of the patches already having for committed - I'm not
worried about that at all. We can always revert, that's why the
thing is called "unstable".


It looks like I should have been more careful to catch up on the current
state of things before I started arguing again -- please accept my
apologies.



In fact, I need to say thank you all for your patience and suggestions.
I'm thrilled to see XenGT is receiving so much attention. :)


I see that patch 2/3 addresses the gpfn/io question in the commit
message by saying, "Previously, a new hypercall or subop was suggested
to map write-protected pages into ioreq server. However, it turned out
handler of this new hypercall would be almost the same with the existing
pair - HVMOP_[un]map_io_range_to_ioreq_server, and there's already a
type parameter in this hypercall. So no new hypercall defined, only a
new type is introduced."

And I see that 2/3 internally separates the WP_RAM type into a separate
rangeset, whose size can be adjusted separately.

This addresses my complaint about the interface using gpfns rather than
MMIO ranges as an interface (somewhat anyway).  Sorry for not
acknowledging this at first.

The question of the internal implementation -- whether to use RB tree
rangesets, or radix trees (as apparently ARM memaccess does) or p2m
types -- is an internal implementation question.  I think p2m types is
long-term the best way to go, but it won't hurt to have the current
implementation checked in, as long as it doesn't have any impacts on the
stable interface.

At the moment, as far as I can tell, there's no way for libxl to even
run a version of qemu with XenGT enabled, so there's no real need for
libxl to be involved.



I agree.


The purpose of having the limit would putatively be to prevent a guest
being able to trigger an exhaustion of hypervisor memory by inducing the
device model to mark an arbitrary number of ranges as mmio_dm.

Two angles on this.

First, assuming that limiting the number of ranges is what we want:  I'm
not really a fan of using HVM_PARAMs for this, but as long as it's not
considered a public interface (i.e., it could go away or disappear and
everything would Just Work), then I wouldn't object.

Although I would ask: would it instead be suitable for now to just set
the default limit for WP_RAM to 8196 in the hypervisor, since we do
expect it to be tracking gpfn ranges rather than IO regions?  And if we


That is what we have suggesting in v9. But Jan proposed we leave this
option to the admin. And to some extent, I can understand his concern.


determine in the future that more ranges are necessary, to then do the
work of moving it to using p2m types (or exposing a knob to adjust it)?

But (and this the other angle): is simply marking a numerical limit
sufficient to avoid memory exhaustion? Is there a danger that after
creating several guests, such that Xen was now running very low on
memory, that a guest would (purposely or not) cause memory to be
exhausted sometime further after boot, causing a system-wide DoS (or
just general lack of stability)?



This worry sounds reasonable. So from this point of view, I guess value
of thi

Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges.

2016-02-05 Thread Yu, Zhang



On 2/5/2016 12:18 PM, Tian, Kevin wrote:

From: George Dunlap [mailto:george.dun...@citrix.com]
Sent: Friday, February 05, 2016 1:12 AM

On 04/02/16 14:08, Jan Beulich wrote:

On 04.02.16 at 14:33,  wrote:

Jan Beulich writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
max_wp_ram_ranges."):

On 04.02.16 at 10:38,  wrote:

So another question is, if value of this limit really matters, will a
lower one be more acceptable(the current 256 being not enough)?


If you've carefully read George's replies, [...]


Thanks to George for the very clear explanation, and also to him for
an illuminating in-person discussion.

It is disturbing that as a result of me as a tools maintainer asking
questions about what seems to me to be a troublesome a user-visible
control setting in libxl, we are now apparently revisiting lower
layers of the hypervisor design, which have already been committed.

While I find George's line of argument convincing, neither I nor
George are maintainers of the relevant hypervisor code.  I am not
going to insist that anything in the hypervisor is done different and
am not trying to use my tools maintainer position to that end.

Clearly there has been a failure of our workflow to consider and
review everything properly together.  But given where we are now, I
think that this discussion about hypervisor internals is probably a
distraction.


While I recall George having made that alternative suggestion,
both Yu and Paul having reservations against it made me not
insist on that alternative. Instead I've been trying to limit some
of the bad effects that the variant originally proposed brought
with it. Clearly, with the more detailed reply George has now
given (involving areas where he is the maintainer for), I should
have been more demanding towards the exploration of that
alternative. That's clearly unfortunate, and I apologize for that,
but such things happen.

As to one of the patches already having for committed - I'm not
worried about that at all. We can always revert, that's why the
thing is called "unstable".


It looks like I should have been more careful to catch up on the current
state of things before I started arguing again -- please accept my
apologies.


Thanks George for your careful thinking.



I see that patch 2/3 addresses the gpfn/io question in the commit
message by saying, "Previously, a new hypercall or subop was suggested
to map write-protected pages into ioreq server. However, it turned out
handler of this new hypercall would be almost the same with the existing
pair - HVMOP_[un]map_io_range_to_ioreq_server, and there's already a
type parameter in this hypercall. So no new hypercall defined, only a
new type is introduced."

And I see that 2/3 internally separates the WP_RAM type into a separate
rangeset, whose size can be adjusted separately.

This addresses my complaint about the interface using gpfns rather than
MMIO ranges as an interface (somewhat anyway).  Sorry for not
acknowledging this at first.

The question of the internal implementation -- whether to use RB tree
rangesets, or radix trees (as apparently ARM memaccess does) or p2m
types -- is an internal implementation question.  I think p2m types is
long-term the best way to go, but it won't hurt to have the current
implementation checked in, as long as it doesn't have any impacts on the
stable interface.


I'm still trying to understand your suggestion vs. this one. Today we
already have a p2m_mmio_write_dm type. It's there already, and any
write fault hitting that type will be delivered to ioreq server. Then next
open is how a ioreq server could know whether it should handle this
request or not, which is why some tracking structures (either RB/radix)
are created to maintain that specific information. It's under the assumption
that multiple ioreq servers co-exist, so a loop check on all ioreq servers
is required to identify the right target. And multiple ioreq servers are
real case in XenGT, because our vGPU device model is in kernel, as
part of Intel i915 graphics driver. So at least two ioreq servers already
exist, with one routing to XenGT in Dom0 kernel space and the other
to the default Qemu in Dom0 user.

In your long-term approach with p2m types, looks you are proposing
encoding ioreq server ID in p2m type directly (e.g. 4bits), which then
eliminates the need of tracking in ioreq server side so the whole
security concern is gone. And no limitation at all. Because available
p2m bits are limited, as Andrew pointed out, so it might be reasonable
to implement this approach when a new p2t structure is added, which
is why we consider it as a long-term approach.

Please correct me if above understanding is correct?



At the moment, as far as I can tell, there's no way for libxl to even
run a version of qemu with XenGT enabled, so there's no real need for
libxl to be involved.


no way because we have upstreamed all toolstack changes yet, but
we should still discuss the requirement as we've been d

[Xen-devel] [PATCH v3 1/2] Differentiate IO/mem resources tracked by ioreq server

2015-08-09 Thread Yu Zhang
Currently in ioreq server, guest write-protected ram pages are
tracked in the same rangeset with device mmio resources. Yet
unlike device mmio, which can be in big chunks, the guest write-
protected pages may be discrete ranges with 4K bytes each.

This patch uses a seperate rangeset for the guest ram pages.
And a new ioreq type, IOREQ_TYPE_MEM, is defined.

Note: Previously, a new hypercall or subop was suggested to map
write-protected pages into ioreq server. However, it turned out
handler of this new hypercall would be almost the same with the
existing pair - HVMOP_[un]map_io_range_to_ioreq_server, and there's
already a type parameter in this hypercall. So no new hypercall
defined, only a new type is introduced.

Signed-off-by: Yu Zhang 
---
 tools/libxc/include/xenctrl.h| 39 +++---
 tools/libxc/xc_domain.c  | 59 ++--
 xen/arch/x86/hvm/hvm.c   | 33 +++---
 xen/include/asm-x86/hvm/domain.h |  4 +--
 xen/include/public/hvm/hvm_op.h  |  3 +-
 xen/include/public/hvm/ioreq.h   |  1 +
 6 files changed, 126 insertions(+), 13 deletions(-)

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index de3c0ad..3e8c203 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -1976,12 +1976,12 @@ int xc_hvm_set_ioreq_server_state(xc_interface *xch,
   int enabled);
 
 /**
- * This function registers a range of memory or I/O ports for emulation.
+ * This function registers a range of mmio or I/O ports for emulation.
  *
  * @parm xch a handle to an open hypervisor interface.
  * @parm domid the domain id to be serviced
  * @parm id the IOREQ Server id.
- * @parm is_mmio is this a range of ports or memory
+ * @parm is_mmio is this a range of ports or mmio
  * @parm start start of range
  * @parm end end of range (inclusive).
  * @return 0 on success, -1 on failure.
@@ -1994,12 +1994,12 @@ int xc_hvm_map_io_range_to_ioreq_server(xc_interface 
*xch,
 uint64_t end);
 
 /**
- * This function deregisters a range of memory or I/O ports for emulation.
+ * This function deregisters a range of mmio or I/O ports for emulation.
  *
  * @parm xch a handle to an open hypervisor interface.
  * @parm domid the domain id to be serviced
  * @parm id the IOREQ Server id.
- * @parm is_mmio is this a range of ports or memory
+ * @parm is_mmio is this a range of ports or mmio
  * @parm start start of range
  * @parm end end of range (inclusive).
  * @return 0 on success, -1 on failure.
@@ -2010,6 +2010,37 @@ int xc_hvm_unmap_io_range_from_ioreq_server(xc_interface 
*xch,
 int is_mmio,
 uint64_t start,
 uint64_t end);
+/**
+ * This function registers a range of memory for emulation.
+ *
+ * @parm xch a handle to an open hypervisor interface.
+ * @parm domid the domain id to be serviced
+ * @parm id the IOREQ Server id.
+ * @parm start start of range
+ * @parm end end of range (inclusive).
+ * @return 0 on success, -1 on failure.
+ */
+int xc_hvm_map_mem_range_to_ioreq_server(xc_interface *xch,
+domid_t domid,
+ioservid_t id,
+uint64_t start,
+uint64_t end);
+
+/**
+ * This function deregisters a range of memory for emulation.
+ *
+ * @parm xch a handle to an open hypervisor interface.
+ * @parm domid the domain id to be serviced
+ * @parm id the IOREQ Server id.
+ * @parm start start of range
+ * @parm end end of range (inclusive).
+ * @return 0 on success, -1 on failure.
+ */
+int xc_hvm_unmap_mem_range_from_ioreq_server(xc_interface *xch,
+domid_t domid,
+ioservid_t id,
+uint64_t start,
+uint64_t end);
 
 /**
  * This function registers a PCI device for config space emulation.
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index 2ee26fb..7b36c99 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -1514,7 +1514,7 @@ int xc_hvm_map_io_range_to_ioreq_server(xc_interface 
*xch, domid_t domid,
 
 arg->domid = domid;
 arg->id = id;
-arg->type = is_mmio ? HVMOP_IO_RANGE_MEMORY : HVMOP_IO_RANGE_PORT;
+arg->type = is_mmio ? HVMOP_IO_RANGE_MMIO : HVMOP_IO_RANGE_PORT;
 arg->start = start;
 arg->end = end;
 
@@ -1542,7 +1542,7 @@ int xc_hvm_unmap_io_range_from_ioreq_server(xc_interface 
*xch, domid_t domid,
 
 arg->domid = domid;
 arg->id = id;
-arg->type = is_mmio ? HVMOP_IO_RANGE_MEMORY : HVMOP_IO_RANGE_PORT;
+arg->type = is_mmio ? HVMOP_IO_RANGE_MMIO : HVMOP_IO_RANGE_PORT

[Xen-devel] [PATCH v3 2/2] Refactor rangeset structure for better performance.

2015-08-09 Thread Yu Zhang
This patch refactors struct rangeset to base it on the red-black
tree structure, instead of on the current doubly linked list. By
now, ioreq leverages rangeset to keep track of the IO/memory
resources to be emulated. Yet when number of ranges inside one
ioreq server is very high, traversing a doubly linked list could
be time consuming. With this patch, the time complexity for
searching a rangeset can be improved from O(n) to O(log(n)).
Interfaces of rangeset still remain the same, and no new APIs
introduced.

Signed-off-by: Yu Zhang 
---
 xen/common/rangeset.c | 82 +--
 1 file changed, 60 insertions(+), 22 deletions(-)

diff --git a/xen/common/rangeset.c b/xen/common/rangeset.c
index 6c6293c..87b6aab 100644
--- a/xen/common/rangeset.c
+++ b/xen/common/rangeset.c
@@ -10,11 +10,12 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* An inclusive range [s,e] and pointer to next range in ascending order. */
 struct range {
-struct list_head list;
+struct rb_node node;
 unsigned long s, e;
 };
 
@@ -24,7 +25,7 @@ struct rangeset {
 struct domain   *domain;
 
 /* Ordered list of ranges contained in this set, and protecting lock. */
-struct list_head range_list;
+struct rb_root   range_tree;
 
 /* Number of ranges that can be allocated */
 long nr_ranges;
@@ -45,41 +46,78 @@ struct rangeset {
 static struct range *find_range(
 struct rangeset *r, unsigned long s)
 {
-struct range *x = NULL, *y;
+struct rb_node *node;
+struct range   *x;
+struct range   *prev = NULL;
 
-list_for_each_entry ( y, &r->range_list, list )
+node = r->range_tree.rb_node;
+while ( node )
 {
-if ( y->s > s )
-break;
-x = y;
+x = container_of(node, struct range, node);
+if ( (s >= x->s) && (s <= x->e) )
+return x;
+if ( s < x->s )
+node = node->rb_left;
+else if ( s > x->s )
+{
+prev = x;
+node = node->rb_right;
+}
 }
 
-return x;
+return prev;
 }
 
 /* Return the lowest range in the set r, or NULL if r is empty. */
 static struct range *first_range(
 struct rangeset *r)
 {
-if ( list_empty(&r->range_list) )
-return NULL;
-return list_entry(r->range_list.next, struct range, list);
+struct rb_node *node;
+
+node = rb_first(&r->range_tree);
+if ( node != NULL )
+return container_of(node, struct range, node);
+
+return NULL;
 }
 
 /* Return range following x in ascending order, or NULL if x is the highest. */
 static struct range *next_range(
 struct rangeset *r, struct range *x)
 {
-if ( x->list.next == &r->range_list )
-return NULL;
-return list_entry(x->list.next, struct range, list);
+struct rb_node *node;
+
+node = rb_next(&x->node);
+if ( node )
+return container_of(node, struct range, node);
+
+return NULL;
 }
 
 /* Insert range y after range x in r. Insert as first range if x is NULL. */
 static void insert_range(
 struct rangeset *r, struct range *x, struct range *y)
 {
-list_add(&y->list, (x != NULL) ? &x->list : &r->range_list);
+struct rb_node **node;
+struct rb_node *parent = NULL;
+
+if ( x == NULL )
+node = &r->range_tree.rb_node;
+else
+{
+node = &x->node.rb_right;
+parent = &x->node;
+}
+
+while ( *node )
+{
+parent = *node;
+node = &parent->rb_left;
+}
+
+/* Add new node and rebalance the red-black tree. */
+rb_link_node(&y->node, parent, node);
+rb_insert_color(&y->node, &r->range_tree);
 }
 
 /* Remove a range from its list and free it. */
@@ -88,7 +126,7 @@ static void destroy_range(
 {
 r->nr_ranges++;
 
-list_del(&x->list);
+rb_erase(&x->node, &r->range_tree);
 xfree(x);
 }
 
@@ -319,7 +357,7 @@ bool_t rangeset_contains_singleton(
 bool_t rangeset_is_empty(
 const struct rangeset *r)
 {
-return ((r == NULL) || list_empty(&r->range_list));
+return ((r == NULL) || RB_EMPTY_ROOT(&r->range_tree));
 }
 
 struct rangeset *rangeset_new(
@@ -332,7 +370,7 @@ struct rangeset *rangeset_new(
 return NULL;
 
 rwlock_init(&r->lock);
-INIT_LIST_HEAD(&r->range_list);
+r->range_tree = RB_ROOT;
 r->nr_ranges = -1;
 
 BUG_ON(flags & ~RANGESETF_prettyprint_hex);
@@ -410,7 +448,7 @@ void rangeset_domain_destroy(
 
 void rangeset_swap(struct rangeset *a, struct rangeset *b)
 {
-LIST_HEAD(tmp);
+struct rb_node* tmp;
 
 if ( a < b )
 {
@@ -423,9 +461,9 @@ void rangeset_swap(struct rangeset *a, struct rangeset *b)
 write_lock(&a->lock);
 }
 
-list_splice_init(&am

[Xen-devel] [PATCH v3 0/2] Refactor ioreq server for better performance.

2015-08-09 Thread Yu Zhang
XenGT leverages ioreq server to track and forward the accesses to
GPU I/O resources, e.g. the PPGTT(per-process graphic translation
tables). Currently, ioreq server uses rangeset to track the BDF/
PIO/MMIO ranges to be emulated. To select an ioreq server, the 
rangeset is searched to see if the I/O range is recorded. However,
traversing the link list inside rangeset could be time consuming
when number of ranges is too high. On HSW platform, number of PPGTTs
for each vGPU could be several hundred. On BDW, this value could
be several thousand.  This patch series refactored rangeset to base
it on red-back tree, so that the searching would be more efficient. 

Besides, this patchset also splits the tracking of MMIO and guest
ram ranges into different rangesets. And to accommodate more ranges,
limitation of the number of ranges in an ioreq server, MAX_NR_IO_RANGES
is changed - future patches might be provided to tune this with other
approaches.


Changes in v3: 
1> Use a seperate rangeset for guest ram pages in ioreq server;
2> Refactor rangeset, instead of introduce a new data structure.

Changes in v2: 
1> Split the original patch into 2;
2> Take Paul Durrant's comments:
  a> Add a name member in the struct rb_rangeset, and use the 'q' 
debug key to dump the ranges in ioreq server;
  b> Keep original routine names for hvm ioreq server;
  c> Commit message changes - mention that a future patch to change
the maximum ranges inside ioreq server.

Yu Zhang (2):
  Differentiate IO/mem resources tracked by ioreq server
  Refactor rangeset structure for better performance.

 tools/libxc/include/xenctrl.h| 39 +--
 tools/libxc/xc_domain.c  | 59 -
 xen/arch/x86/hvm/hvm.c   | 33 ++--
 xen/common/rangeset.c| 82 +---
 xen/include/asm-x86/hvm/domain.h |  4 +-
 xen/include/public/hvm/hvm_op.h  |  3 +-
 xen/include/public/hvm/ioreq.h   |  1 +
 7 files changed, 186 insertions(+), 35 deletions(-)

-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 1/2] Differentiate IO/mem resources tracked by ioreq server

2015-08-11 Thread Yu, Zhang



On 8/10/2015 4:26 PM, Wei Liu wrote:

On Mon, Aug 10, 2015 at 11:33:40AM +0800, Yu Zhang wrote:

Currently in ioreq server, guest write-protected ram pages are
tracked in the same rangeset with device mmio resources. Yet
unlike device mmio, which can be in big chunks, the guest write-
protected pages may be discrete ranges with 4K bytes each.

This patch uses a seperate rangeset for the guest ram pages.
And a new ioreq type, IOREQ_TYPE_MEM, is defined.

Note: Previously, a new hypercall or subop was suggested to map
write-protected pages into ioreq server. However, it turned out
handler of this new hypercall would be almost the same with the
existing pair - HVMOP_[un]map_io_range_to_ioreq_server, and there's
already a type parameter in this hypercall. So no new hypercall
defined, only a new type is introduced.

Signed-off-by: Yu Zhang 
---
  tools/libxc/include/xenctrl.h| 39 +++---
  tools/libxc/xc_domain.c  | 59 ++--


FWIW the hypercall wrappers look correct to me.


diff --git a/xen/include/public/hvm/hvm_op.h b/xen/include/public/hvm/hvm_op.h
index 014546a..9106cb9 100644
--- a/xen/include/public/hvm/hvm_op.h
+++ b/xen/include/public/hvm/hvm_op.h
@@ -329,8 +329,9 @@ struct xen_hvm_io_range {
  ioservid_t id;   /* IN - server id */
  uint32_t type;   /* IN - type of range */
  # define HVMOP_IO_RANGE_PORT   0 /* I/O port range */
-# define HVMOP_IO_RANGE_MEMORY 1 /* MMIO range */
+# define HVMOP_IO_RANGE_MMIO   1 /* MMIO range */
  # define HVMOP_IO_RANGE_PCI2 /* PCI segment/bus/dev/func range */
+# define HVMOP_IO_RANGE_MEMORY 3 /* MEMORY range */


This looks problematic. Maybe you can get away with this because this is
a toolstack-only interface?


Thanks Wei.
Well, I believe this interface could be used both by the backend device
driver and qemu as well(which I neglected).  :-)

Yu


Wei.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


  1   2   3   4   >