[Xen-devel] [PATCH 3/5] xen: add some __init and static annotations in arch/x86/xen/setup.c

2015-01-27 Thread Juergen Gross
Some more functions in arch/x86/xen/setup.c can be made "__init".
xen_ignore_unusable() can be made "static".

Signed-off-by: Juergen Gross 
---
 arch/x86/xen/setup.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 4dcc608..55f388e 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -535,8 +535,8 @@ static unsigned long __init xen_get_max_pages(void)
return min(max_pages, MAX_DOMAIN_PAGES);
 }
 
-static void xen_align_and_add_e820_region(phys_addr_t start, phys_addr_t size,
- int type)
+static void __init xen_align_and_add_e820_region(phys_addr_t start,
+phys_addr_t size, int type)
 {
phys_addr_t end = start + size;
 
@@ -549,7 +549,7 @@ static void xen_align_and_add_e820_region(phys_addr_t 
start, phys_addr_t size,
e820_add_region(start, end - start, type);
 }
 
-void xen_ignore_unusable(struct e820entry *list, size_t map_size)
+static void __init xen_ignore_unusable(struct e820entry *list, size_t map_size)
 {
struct e820entry *entry;
unsigned int i;
-- 
2.1.2


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 2/3] xen: scsiback: add LUN of restored domain

2015-01-30 Thread Juergen Gross
When a xen domain is being restored the LUN state of a pvscsi device
is "Connected" and not "Initialising" as in case of attaching a new
pvscsi LUN.

This must be taken into account when adding a new pvscsi device for
a domain as otherwise the pvscsi LUN won't be connected to the
SCSI target associated with it.

Signed-off-by: Juergen Gross 
---
 drivers/xen/xen-scsiback.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/xen/xen-scsiback.c b/drivers/xen/xen-scsiback.c
index 6457784..4290921 100644
--- a/drivers/xen/xen-scsiback.c
+++ b/drivers/xen/xen-scsiback.c
@@ -993,7 +993,7 @@ found:
 }
 
 static void scsiback_do_add_lun(struct vscsibk_info *info, const char *state,
-   char *phy, struct ids_tuple *vir)
+   char *phy, struct ids_tuple *vir, int try)
 {
if (!scsiback_add_translation_entry(info, phy, vir)) {
if (xenbus_printf(XBT_NIL, info->dev->nodename, state,
@@ -1001,7 +1001,7 @@ static void scsiback_do_add_lun(struct vscsibk_info 
*info, const char *state,
pr_err("xen-pvscsi: xenbus_printf error %s\n", state);
scsiback_del_translation_entry(info, vir);
}
-   } else {
+   } else if (!try) {
xenbus_printf(XBT_NIL, info->dev->nodename, state,
  "%d", XenbusStateClosed);
}
@@ -1061,10 +1061,19 @@ static void scsiback_do_1lun_hotplug(struct 
vscsibk_info *info, int op,
 
switch (op) {
case VSCSIBACK_OP_ADD_OR_DEL_LUN:
-   if (device_state == XenbusStateInitialising)
-   scsiback_do_add_lun(info, state, phy, &vir);
-   if (device_state == XenbusStateClosing)
+   switch (device_state) {
+   case XenbusStateInitialising:
+   scsiback_do_add_lun(info, state, phy, &vir, 0);
+   break;
+   case XenbusStateConnected:
+   scsiback_do_add_lun(info, state, phy, &vir, 1);
+   break;
+   case XenbusStateClosing:
scsiback_do_del_lun(info, state, &vir);
+   break;
+   default:
+   break;
+   }
break;
 
case VSCSIBACK_OP_UPDATEDEV_STATE:
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/3] xen: mark pvscsi frontend request consumed only after last read

2015-01-30 Thread Juergen Gross
A request in the ring buffer mustn't be read after it has been marked
as consumed. Otherwise it might already have been reused by the
frontend without violating the ring protocol.

To avoid inconsistencies in the backend only work on a private copy
of the request. This will ensure a malicious guest not being able to
bypass consistency checks of the backend by modifying an active
request.

Signed-off-by: Juergen Gross 
---
 drivers/xen/xen-scsiback.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/xen/xen-scsiback.c b/drivers/xen/xen-scsiback.c
index e999496e..6457784 100644
--- a/drivers/xen/xen-scsiback.c
+++ b/drivers/xen/xen-scsiback.c
@@ -708,7 +708,7 @@ static int prepare_pending_reqs(struct vscsibk_info *info,
 static int scsiback_do_cmd_fn(struct vscsibk_info *info)
 {
struct vscsiif_back_ring *ring = &info->ring;
-   struct vscsiif_request *ring_req;
+   struct vscsiif_request ring_req;
struct vscsibk_pend *pending_req;
RING_IDX rc, rp;
int err, more_to_do;
@@ -734,11 +734,11 @@ static int scsiback_do_cmd_fn(struct vscsibk_info *info)
if (!pending_req)
return 1;
 
-   ring_req = RING_GET_REQUEST(ring, rc);
+   memcpy(&ring_req, RING_GET_REQUEST(ring, rc), sizeof(ring_req));
ring->req_cons = ++rc;
 
-   act = ring_req->act;
-   err = prepare_pending_reqs(info, ring_req, pending_req);
+   act = ring_req.act;
+   err = prepare_pending_reqs(info, &ring_req, pending_req);
if (err) {
switch (err) {
case -ENODEV:
@@ -756,7 +756,7 @@ static int scsiback_do_cmd_fn(struct vscsibk_info *info)
 
switch (act) {
case VSCSIIF_ACT_SCSI_CDB:
-   if (scsiback_gnttab_data_map(ring_req, pending_req)) {
+   if (scsiback_gnttab_data_map(&ring_req, pending_req)) {
scsiback_fast_flush_area(pending_req);
scsiback_do_resp_with_sense(NULL,
DRIVER_ERROR << 24, 0, pending_req);
@@ -767,7 +767,7 @@ static int scsiback_do_cmd_fn(struct vscsibk_info *info)
break;
case VSCSIIF_ACT_SCSI_ABORT:
scsiback_device_action(pending_req, TMR_ABORT_TASK,
-   ring_req->ref_rqid);
+   ring_req.ref_rqid);
break;
case VSCSIIF_ACT_SCSI_RESET:
scsiback_device_action(pending_req, TMR_LUN_RESET, 0);
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 3/3] xen: support suspend/resume in pvscsi frontend

2015-01-30 Thread Juergen Gross
Up to now the pvscsi frontend hasn't supported domain suspend and
resume. When a domain with an assigned pvscsi device was suspended
and resumed again, it was not able to use the device any more: trying
to do so resulted in hanging processes.

Support suspend and resume of pvscsi devices.

Signed-off-by: Juergen Gross 
---
 drivers/scsi/xen-scsifront.c | 189 ---
 1 file changed, 162 insertions(+), 27 deletions(-)

diff --git a/drivers/scsi/xen-scsifront.c b/drivers/scsi/xen-scsifront.c
index 34199d2..b32157b 100644
--- a/drivers/scsi/xen-scsifront.c
+++ b/drivers/scsi/xen-scsifront.c
@@ -63,6 +63,7 @@
 
 #define VSCSIFRONT_OP_ADD_LUN  1
 #define VSCSIFRONT_OP_DEL_LUN  2
+#define VSCSIFRONT_OP_READD_LUN3
 
 /* Tuning point. */
 #define VSCSIIF_DEFAULT_CMD_PER_LUN 10
@@ -113,8 +114,13 @@ struct vscsifrnt_info {
DECLARE_BITMAP(shadow_free_bitmap, VSCSIIF_MAX_REQS);
struct vscsifrnt_shadow *shadow[VSCSIIF_MAX_REQS];
 
+   /* Following items are protected by the host lock. */
wait_queue_head_t wq_sync;
+   wait_queue_head_t wq_pause;
unsigned int wait_ring_available:1;
+   unsigned int waiting_pause:1;
+   unsigned int pause:1;
+   unsigned callers;
 
char dev_state_path[64];
struct task_struct *curr;
@@ -274,31 +280,31 @@ static void scsifront_sync_cmd_done(struct vscsifrnt_info 
*info,
wake_up(&shadow->wq_reset);
 }
 
-static int scsifront_cmd_done(struct vscsifrnt_info *info)
+static void scsifront_do_response(struct vscsifrnt_info *info,
+ struct vscsiif_response *ring_rsp)
+{
+   if (WARN(ring_rsp->rqid >= VSCSIIF_MAX_REQS ||
+test_bit(ring_rsp->rqid, info->shadow_free_bitmap),
+"illegal rqid %u returned by backend!\n", ring_rsp->rqid))
+   return;
+
+   if (info->shadow[ring_rsp->rqid]->act == VSCSIIF_ACT_SCSI_CDB)
+   scsifront_cdb_cmd_done(info, ring_rsp);
+   else
+   scsifront_sync_cmd_done(info, ring_rsp);
+}
+
+static int scsifront_ring_drain(struct vscsifrnt_info *info)
 {
struct vscsiif_response *ring_rsp;
RING_IDX i, rp;
int more_to_do = 0;
-   unsigned long flags;
-
-   spin_lock_irqsave(info->host->host_lock, flags);
 
rp = info->ring.sring->rsp_prod;
rmb();  /* ordering required respective to dom0 */
for (i = info->ring.rsp_cons; i != rp; i++) {
-
ring_rsp = RING_GET_RESPONSE(&info->ring, i);
-
-   if (WARN(ring_rsp->rqid >= VSCSIIF_MAX_REQS ||
-test_bit(ring_rsp->rqid, info->shadow_free_bitmap),
-"illegal rqid %u returned by backend!\n",
-ring_rsp->rqid))
-   continue;
-
-   if (info->shadow[ring_rsp->rqid]->act == VSCSIIF_ACT_SCSI_CDB)
-   scsifront_cdb_cmd_done(info, ring_rsp);
-   else
-   scsifront_sync_cmd_done(info, ring_rsp);
+   scsifront_do_response(info, ring_rsp);
}
 
info->ring.rsp_cons = i;
@@ -308,6 +314,18 @@ static int scsifront_cmd_done(struct vscsifrnt_info *info)
else
info->ring.sring->rsp_event = i + 1;
 
+   return more_to_do;
+}
+
+static int scsifront_cmd_done(struct vscsifrnt_info *info)
+{
+   int more_to_do;
+   unsigned long flags;
+
+   spin_lock_irqsave(info->host->host_lock, flags);
+
+   more_to_do = scsifront_ring_drain(info);
+
info->wait_ring_available = 0;
 
spin_unlock_irqrestore(info->host->host_lock, flags);
@@ -328,6 +346,24 @@ static irqreturn_t scsifront_irq_fn(int irq, void *dev_id)
return IRQ_HANDLED;
 }
 
+static void scsifront_finish_all(struct vscsifrnt_info *info)
+{
+   unsigned i;
+   struct vscsiif_response resp;
+
+   scsifront_ring_drain(info);
+
+   for (i = 0; i < VSCSIIF_MAX_REQS; i++) {
+   if (test_bit(i, info->shadow_free_bitmap))
+   continue;
+   resp.rqid = i;
+   resp.sense_len = 0;
+   resp.rslt = DID_RESET << 16;
+   resp.residual_len = 0;
+   scsifront_do_response(info, &resp);
+   }
+}
+
 static int map_data_for_request(struct vscsifrnt_info *info,
struct scsi_cmnd *sc,
struct vscsiif_request *ring_req,
@@ -475,6 +511,27 @@ static struct vscsiif_request *scsifront_command2ring(
return ring_req;
 }
 
+static int scsifront_enter(struct vscsifrnt_info *info)
+{
+   if (info->pause)
+   return 1;
+   info->callers++;
+   return 0;
+}
+
+static void scsifront_return(struct vscsifrnt_info *info)
+{
+   info->call

[Xen-devel] [PATCH 0/3] xen: pvscsi: avoid race, support suspend/resume

2015-01-30 Thread Juergen Gross
In the pvscsi backend copy the frontend request to ensure it is not
changed by the frontend during processing it in the backend.

Support suspend/resume of the domain to be able to access a pvscsi
device n the frontend afterwards.

Juergen Gross (3):
  xen: mark pvscsi frontend request consumed only after last read
  xen: scsiback: add LUN of restored domain
  xen: support suspend/resume in pvscsi frontend

 drivers/scsi/xen-scsifront.c | 189 ---
 drivers/xen/xen-scsiback.c   |  31 ---
 2 files changed, 182 insertions(+), 38 deletions(-)

-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/3] xen: mark pvscsi frontend request consumed only after last read

2015-01-30 Thread Juergen Gross

On 01/30/2015 12:47 PM, Jan Beulich wrote:

On 30.01.15 at 12:21,  wrote:

@@ -734,11 +734,11 @@ static int scsiback_do_cmd_fn(struct vscsibk_info *info)
if (!pending_req)
return 1;

-   ring_req = RING_GET_REQUEST(ring, rc);
+   memcpy(&ring_req, RING_GET_REQUEST(ring, rc), sizeof(ring_req));


I'd recommend the type safe *ring_req = *RING_GET_REQUEST(ring, rc)
here.


I think I'll use ring_req = *RING_GET_REQUEST(ring, rc) :-)




ring->req_cons = ++rc;

-   act = ring_req->act;
-   err = prepare_pending_reqs(info, ring_req, pending_req);
+   act = ring_req.act;


Is this helper variable then still needed?


No, you're right. Will delete it.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH linux-2.6.18] xen: mark pvscsi frontend request consumed only after last read

2015-01-30 Thread Juergen Gross
A request in the ring buffer mustn't be read after it has been marked
as consumed. Otherwise it might already have been reused by the
frontend without violating the ring protocol.

To avoid inconsistencies in the backend only work on a private copy
of the request. This will ensure a malicious guest not being able to
bypass consistency checks of the backend by modifying an active
request.

Signed-off-by: Juergen Gross 

diff -r 578e5aea3cbb drivers/xen/scsiback/scsiback.c
--- a/drivers/xen/scsiback/scsiback.c   Mon Jan 19 11:51:46 2015 +0100
+++ b/drivers/xen/scsiback/scsiback.c   Fri Jan 30 14:43:29 2015 +0100
@@ -579,7 +579,7 @@ invalid_value:
 static int _scsiback_do_cmd_fn(struct vscsibk_info *info)
 {
struct vscsiif_back_ring *ring = &info->ring;
-   vscsiif_request_t  *ring_req;
+   vscsiif_request_t ring_req;
 
pending_req_t *pending_req;
RING_IDX rc, rp;
@@ -609,10 +609,10 @@ static int _scsiback_do_cmd_fn(struct vs
break;
}
 
-   ring_req = RING_GET_REQUEST(ring, rc);
+   ring_req = *RING_GET_REQUEST(ring, rc);
ring->req_cons = ++rc;
 
-   err = prepare_pending_reqs(info, ring_req,
+   err = prepare_pending_reqs(info, &ring_req,
pending_req);
switch (err ?: pending_req->act) {
case VSCSIIF_ACT_SCSI_CDB:

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH linux-2.6.18] support suspend/resume in pvscsi drivers

2015-01-30 Thread Juergen Gross
Up to now the pvscsi drivers haven't supported domain suspend and
resume. When a domain with an assigned pvscsi device was suspended
and resumed again, it was not able to use the device any more: trying
to do so resulted in hanging processes.

Support suspend and resume of pvscsi devices.

Signed-off-by: Juergen Gross 

diff -r 578e5aea3cbb drivers/xen/scsiback/xenbus.c
--- a/drivers/xen/scsiback/xenbus.c Mon Jan 19 11:51:46 2015 +0100
+++ b/drivers/xen/scsiback/xenbus.c Fri Jan 30 13:57:29 2015 +0100
@@ -167,33 +167,48 @@ static void scsiback_do_lun_hotplug(stru
 
switch (op) {
case VSCSIBACK_OP_ADD_OR_DEL_LUN:
-   if (device_state == XenbusStateInitialising) {
+   switch (device_state) {
+   case XenbusStateInitialising:
+   case XenbusStateConnected:
sdev = scsiback_get_scsi_device(&phy);
-   if (!sdev)
-   xenbus_printf(XBT_NIL, dev->nodename, 
state_str, 
-   "%d", 
XenbusStateClosed);
-   else {
-   err = 
scsiback_add_translation_entry(be->info, sdev, &vir);
-   if (!err) {
-   if (xenbus_printf(XBT_NIL, 
dev->nodename, state_str, 
-   "%d", 
XenbusStateInitialised)) {
-   printk(KERN_ERR 
"scsiback: xenbus_printf error %s\n", state_str);
-   
scsiback_del_translation_entry(be->info, &vir);
-   }
-   } else {
-   scsi_device_put(sdev);
-   xenbus_printf(XBT_NIL, 
dev->nodename, state_str, 
-   "%d", 
XenbusStateClosed);
-   }
+   if (!sdev) {
+   xenbus_printf(XBT_NIL, dev->nodename,
+ state_str,
+ "%d", XenbusStateClosed);
+   break;
}
-   }
+   if (scsiback_add_translation_entry(be->info,
+   sdev, &vir)) {
+   scsi_device_put(sdev);
+   if (device_state == 
XenbusStateConnected)
+   break;
+   xenbus_printf(XBT_NIL, dev->nodename,
+ state_str,
+ "%d", XenbusStateClosed);
+   break;
+   }
+   if (!xenbus_printf(XBT_NIL, dev->nodename,
+ state_str, "%d",
+ XenbusStateInitialised))
+   break;
+   printk(KERN_ERR "scsiback: xenbus_printf error 
%s\n",
+  state_str);
+   scsiback_del_translation_entry(be->info, &vir);
+   break;
 
-   if (device_state == XenbusStateClosing) {
-   if (!scsiback_del_translation_entry(be->info, 
&vir)) {
-   if (xenbus_printf(XBT_NIL, 
dev->nodename, state_str, 
-   "%d", 
XenbusStateClosed))
-   printk(KERN_ERR "scsiback: 
xenbus_printf error %s\n", state_str);
-   }
+   case XenbusStateClosing:
+   if (scsiback_del_translation_entry(be->info,
+  &vir))
+   break;
+   if (xenbus_printf(XBT_NIL, dev->nodename,
+ state_str, "%d",
+ XenbusStateClosed))
+   printk(KERN_ERR "scsiback: 
xenbus_

Re: [Xen-devel] [PATCH linux-2.6.18] xen: mark pvscsi frontend request consumed only after last read

2015-01-30 Thread Juergen Gross

On 01/30/2015 03:22 PM, Jan Beulich wrote:

On 30.01.15 at 14:51, <"jgr...@suse.com".non-mime.internet> wrote:

A request in the ring buffer mustn't be read after it has been marked
as consumed. Otherwise it might already have been reused by the
frontend without violating the ring protocol.

To avoid inconsistencies in the backend only work on a private copy
of the request. This will ensure a malicious guest not being able to
bypass consistency checks of the backend by modifying an active
request.


I'm not convinced we need this in this version of the driver: c/s
590:c4134d1a3e3f took care of reading each ring_req field just
once.


This might be true. But the consumer index is incremented before the
last item of the request is read. This is a violation of the ring
interface: the frontend is free to put another request in this slot
while the backend is still using it.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH linux-2.6.18] xen: mark pvscsi frontend request consumed only after last read

2015-01-30 Thread Juergen Gross

On 01/30/2015 03:32 PM, Jan Beulich wrote:

On 30.01.15 at 15:22,  wrote:

On 30.01.15 at 14:51, <"jgr...@suse.com".non-mime.internet> wrote:

A request in the ring buffer mustn't be read after it has been marked
as consumed. Otherwise it might already have been reused by the
frontend without violating the ring protocol.

To avoid inconsistencies in the backend only work on a private copy
of the request. This will ensure a malicious guest not being able to
bypass consistency checks of the backend by modifying an active
request.


I'm not convinced we need this in this version of the driver: c/s
590:c4134d1a3e3f took care of reading each ring_req field just
once.


I should have clarified that I didn't mean we don't need to change
anything here: We should still move down the point where the
ring slot gets accounted as consumed.


My solution is more robust, I think. You don't have to be careful not
to introduce another double read somewhere.

Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH linux-2.6.18] support suspend/resume in pvscsi drivers

2015-01-30 Thread Juergen Gross

On 01/30/2015 03:46 PM, Jan Beulich wrote:

On 30.01.15 at 14:52, <"jgr...@suse.com".non-mime.internet> wrote:

@@ -231,8 +242,23 @@ static int scsifront_cmd_done(struct vsc
return more_to_do;
  }

+void scsifront_finish_all(struct vscsifrnt_info *info)
+{
+   unsigned i;
+   struct vscsiif_response resp;

+   scsifront_ring_drain(info);


Shouldn't you at least issue some kind of warning when this returns
non-zero?


If a warning should be issued, then this should be done after the
following loop in case of at least one request terminated there.

I'm really not sure, whether a warning is required here. If you like,
I can add one.

Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH linux-2.6.18] support suspend/resume in pvscsi drivers

2015-01-30 Thread Juergen Gross

On 01/30/2015 04:05 PM, Jan Beulich wrote:

On 30.01.15 at 15:54,  wrote:

On 01/30/2015 03:46 PM, Jan Beulich wrote:

On 30.01.15 at 14:52, <"jgr...@suse.com".non-mime.internet> wrote:

@@ -231,8 +242,23 @@ static int scsifront_cmd_done(struct vsc
return more_to_do;
   }

+void scsifront_finish_all(struct vscsifrnt_info *info)
+{
+   unsigned i;
+   struct vscsiif_response resp;

+   scsifront_ring_drain(info);


Shouldn't you at least issue some kind of warning when this returns
non-zero?


If a warning should be issued, then this should be done after the
following loop in case of at least one request terminated there.

I'm really not sure, whether a warning is required here. If you like,
I can add one.


I'm not sure, I'm merely asking because I saw the function return
value being ignored here.


I think it can be 0 only. We are handling resume, so the ring which is
being drained will no longer be filled by the backend.

Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH linux-2.6.18] xen: mark pvscsi frontend request consumed only after last read

2015-02-02 Thread Juergen Gross

On 02/02/2015 08:52 AM, Jan Beulich wrote:

On 30.01.15 at 14:51, <"jgr...@suse.com".non-mime.internet> wrote:

A request in the ring buffer mustn't be read after it has been marked
as consumed. Otherwise it might already have been reused by the
frontend without violating the ring protocol.


This is irrelevant, as the ->req_cons is a backend private field (if it
was a shared one, a barrier would have been needed between
copying and updating that field).


Hmm, you are right. Interesting, I've always thought req_cons would be
used by the frontend, too.

Thanks for updating my protocol know-how. :-)




To avoid inconsistencies in the backend only work on a private copy
of the request. This will ensure a malicious guest not being able to
bypass consistency checks of the backend by modifying an active
request.


Hence - also considering with my earlier reply - I don't view the change
as necessary.


Agreed.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] pvSCSI test

2015-02-03 Thread Juergen Gross

On 02/03/2015 07:16 PM, Kristian Hagsted Rasmussen wrote:


Hi Olaf and Juergen


I am interested in testing pvSCSI as I have a system were it would be ideal.
I have tried to apply this patch "http://marc.info/?l=xen-devel&m=139885599019457&w=2"; 
called "libbxl: add support for pvscsi, iteration 1", to my xen-4.5 tree.
I am using kernel 3.18.4 compiled with xen-scsiback and xen-scsifront compiled 
into the kernel, for both Dom0 and DomU.

  My Ubuntu domU is running fine, however the scsi disk I want to pass through 
does not appear.

in my config I have added the line:
vscsi= ['3:0:0:0,0:0:0:0']
I have also tried with:
vscsi= ['/dev/sdb,0:0:0:0']

but no matter which syntax I use I get the following error in dmesg log:
xen-pvscsi: 3:0:0:0 doesn't exist

I have no errors in the log files for the domU.

Am I missing something in my configuration, perhaps some device hiding like for 
pci pass through?


The upstream pvscsi backend is using the target infrastructure. Until
the tools are aware of this you'll have to configure the device to
pass to a domain in Dom0 e.g. via a script:

#!/bin/bash
# usage:
# mkpvscsi device
# e.g. mkpvscsi /dev/sr0

DEV=$1

gen_uuid()
{
cat /proc/sys/kernel/random/uuid | \
awk '{print "naa.6001405" substr($1,1,8) substr($1,10,1);}'
}

TARG=`gen_uuid`
INIT=`gen_uuid`

NODE=`lsscsi | awk '$(NF) == "'$DEV'"
{ print substr($1,2,length($1)-4); }'`
NAME=`echo $NODE | sed 's/:/_/g'`

modprobe configfs
mount -t configfs configfs /sys/kernel/config
modprobe xen-scsiback
modprobe target_core_mod
cd /sys/kernel/config/target

mkdir -p core/pscsi_0/$NAME
echo "$DEV" >core/pscsi_0/$NAME/udev_path

mkdir -p xen-pvscsi/$TARG/tpgt_0
echo "$NODE" >xen-pvscsi/$TARG/tpgt_0/param/alias
echo $INIT >xen-pvscsi/$TARG/tpgt_0/nexus
mkdir xen-pvscsi/$TARG/tpgt_0/lun/lun_0

cd xen-pvscsi/$TARG/tpgt_0/lun/lun_0
ln -s ../../../../../../target/core/pscsi_0/$NAME xen-pvscsi_port



-

After doing this you can use the xen tools to give the device to a domU.
Please note: this script is untested, I've used a simpler one which
fitted my needs (using targetcli, which isn't available everywhere).
I have tested all single steps of the script above, though.

Happy testing,

Juergen

P.S.: If you are feeling adventurous you can try other target backends
  than pscsi, e.g. iscsi or fileio.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen's Linux kernel config options V2

2015-02-03 Thread Juergen Gross

On 02/04/2015 01:48 AM, Luis R. Rodriguez wrote:

I'm going to work on this now so my replies below.

Note: If we want feature to require XEN_PV || XEN_PVH || XEN_PVHVM,
since XEN_BACKEND depends on them I think we could just use
XEN_BACKEND as a shorthand. Furthermore if we then wanted something to
be available for both backend and frontend we could use a dependency
on XEN_BACKEND || XEN_FRONTEND. Thoughts?

On Fri, Jan 9, 2015 at 11:02 AM, Konrad Rzeszutek Wilk
 wrote:

On Tue, Dec 16, 2014 at 05:21:05PM +0100, Juergen Gross wrote:

After some feedback for the first draft I'd suggest the following:

Option  Selects Depends
--
XEN
   PCI_XEN(x86)  SWIOTLB_XEN
   XEN_DOM0  XEN_BACKEND XEN_PV(x86) ||
 PCI_XEN(x86)XEN_PVH(x86)
 XEN_ACPI_HOTPLUG_MEMORY XEN_STUB
 XEN_ACPI_HOTPLUG_CPUXEN_STUB
 XEN_MCE_LOG(x86)


and XEN_ACPI_PROCESSOR(x86)


Added.


   XEN_MAX_DOMAIN_MEMORY(x86)


which depends on XEN_PV


Adjusted, but so far that's the only XEN_PV alone-dependent option.
Are you sure ? This defines MAX_DOMAIN_PAGES, and used on
arch/x86/xen/setup.c for xen_get_max_pages(). Can't this be used for
XEN_DOM0 ?


This option will be replaced by another one once my patches for
supporting >500GB pv-domains are ready.

For now you could let it depend on XEN_HAVE_PVMMU. It is relevant for
domUs as well.


Juergen




   XEN_SAVE_RESTORE(x86)
   XEN_DEBUG_FS
   XEN_WDT


.. which only makes sense in a XEN_DOM0? Perhaps make it part of XEN_DOM0?


Adjusted.


   XEN_BALLOON
 XEN_SELFBALLOONING  XEN_TMEM
 XEN_BALLOON_MEMORY_HOTPLUG
 XEN_SCRUB_PAGES
   XENFS XEN_PRIVCMD
 XEN_COMPAT_XENFS
   XEN_SYS_HYPERVISOR


Available on all? As in if we select CONFIG_XEN this would automtically show up?


I think this could be further compartmentalized. For XEN_BALLOON,
XEN_SELFBALLOONING, XEN_BALLOON_MEMORY_HOTPLUG, and XEN_SCRUB_PAGES we
have:

static int __init balloon_init(void)
{
 if (!xen_domain())
 return -ENODEV;

 pr_info("Initialising balloon driver\n");

 register_balloon(&balloon_dev);

 register_xen_selfballooning(&balloon_dev);

 register_xenstore_notifier(&xenstore_notifier);

 return 0;
}
subsys_initcall(balloon_init);


So as I see it XEN_BALLOON should depend on XEN_PV || XEN_PVH -- not
sure how ballooning would be used on HVM only domains although right
now ballooning would indeed be initialized in such situations, should
it not? If it should not then the above check should be for
xen_pvh_domain() not just xen_domain(). If it should work for hvm
domains too we could perhaps use XEN_BACKEND || XEN_FRONTEND.

As for XENFS we have the same check on init for xen_domain(), we only
have a small difference for two types of cases. If your kernel
supports XEN_DOM0 you also get exposed on the xenfs the xsd_kva and
xsd_port which correspond to the xen_store_evtchn and
xen_store_interface respectively. Does it make sense to expose xenfs
for hvms? If so under this new arrangement perhaps it should depend on
XEN_BACKEND || XEN_FRONTEND ?

XEN_SYS_HYPERVISOR just populates the generic /sys/hypervisor/ and
also extends it with Xen specific information, its initialization also
depends on xen_domain() but I cannot think of a reason to have this
for HVM. Perhaps this should depend on XEN_BACKEND only ?


   XEN_DEV_EVTCHN


Frontends and backends select this?


This registers /dev/xen/evtchn only if we're in xen_domain(). Do we
want this to user visible selectable option and have it depend on
XEN_BACKEND || XEN_FRONTEND ?


   XEN_GNTDEV


Backend should select this?


Based on my review I would think that this should depend on
XEN_BACKEND but be user selectable. I'm hoping Stefano can best decide
this though.


   XEN_GRANT_DEV_ALLOC
   SWIOTLB_XEN


(Make it hidden?)


As for XEN_GRANT_DEV_ALLOC -- if we have XEN_GTDEV as user selectable
its not clear to me why this would not be, and have it depend on
XEN_BACKEND, Stefano?

As for SWIOTLB_XEN -- should that not just depend on XEN_PV && X86 ?
At least drivers/xen/swiotlb-xen.c describes this as code which
provides an IOMMU for Xen PV guests with PCI passthrough.


   XEN_TMEM
   XEN_PRIVCMD


Backend select this?


OK


   XEN_STUB(x86_64)  BROKEN
   XEN_ACPI_PROCESSOR(x86)
   XEN_HAVE_PVMMU
   XEN_EFI(x64)
   XEN_XENBUS_FRONTEND


(make it hidden?)


Well XEN_STUB is broken... and its useful for CPU / memory hotplug
only. How about making XEN_STUB depend on XEN_BACKEND?

It seems to me that XEN_ACPI_PROCESSOR should also depend on XEN_BACKEND.

XEN_HAVE_PVMMU is only used when XEN_B

Re: [Xen-devel] pvSCSI test

2015-02-06 Thread Juergen Gross

On 02/06/2015 10:32 AM, Kristian Hagsted Rasmussen wrote:

On Wednesday, February 4, 2015 05:41, Juergen Gross  wrote:

To: Kristian Hagsted Rasmussen; Olaf Hering; xen-de...@lists.xensource.com
Subject: Re: pvSCSI test

On 02/03/2015 07:16 PM, Kristian Hagsted Rasmussen wrote:


Hi Olaf and Juergen


I am interested in testing pvSCSI as I have a system were it would be ideal.
I have tried to apply this patch "http://marc.info/?l=xen-devel&m=139885599019457&w=2"; 
called "libbxl: add support for pvscsi, iteration 1", to my xen-4.5 tree.
I am using kernel 3.18.4 compiled with xen-scsiback and xen-scsifront compiled 
into the kernel, for both Dom0 and DomU.

   My Ubuntu domU is running fine, however the scsi disk I want to pass through 
does not appear.

in my config I have added the line:
vscsi= ['3:0:0:0,0:0:0:0']
I have also tried with:
vscsi= ['/dev/sdb,0:0:0:0']

but no matter which syntax I use I get the following error in dmesg log:
xen-pvscsi: 3:0:0:0 doesn't exist

I have no errors in the log files for the domU.

Am I missing something in my configuration, perhaps some device hiding like for 
pci pass through?


The upstream pvscsi backend is using the target infrastructure. Until
the tools are aware of this you'll have to configure the device to
pass to a domain in Dom0 e.g. via a script:

#!/bin/bash
# usage:
# mkpvscsi device
# e.g. mkpvscsi /dev/sr0

DEV=$1

gen_uuid()
{
  cat /proc/sys/kernel/random/uuid | \
 awk '{print "naa.6001405" substr($1,1,8) substr($1,10,1);}'
}

TARG=`gen_uuid`
INIT=`gen_uuid`

NODE=`lsscsi | awk '$(NF) == "'$DEV'"
 { print substr($1,2,length($1)-4); }'`
NAME=`echo $NODE | sed 's/:/_/g'`



I used targetcli to create the pscsi target, so
skipped from here


modprobe configfs
mount -t configfs configfs /sys/kernel/config
modprobe xen-scsiback
modprobe target_core_mod


to here


cd /sys/kernel/config/target



and here to


mkdir -p core/pscsi_0/$NAME
echo "$DEV" >core/pscsi_0/$NAME/udev_path


to here


mkdir -p xen-pvscsi/$TARG/tpgt_0
echo "$NODE" >xen-pvscsi/$TARG/tpgt_0/param/alias
echo $INIT >xen-pvscsi/$TARG/tpgt_0/nexus
mkdir xen-pvscsi/$TARG/tpgt_0/lun/lun_0

cd xen-pvscsi/$TARG/tpgt_0/lun/lun_0
ln -s ../../../../../../target/core/pscsi_0/$NAME xen-pvscsi_port



-

After doing this you can use the xen tools to give the device to a domU.
Please note: this script is untested, I've used a simpler one which
fitted my needs (using targetcli, which isn't available everywhere).
I have tested all single steps of the script above, though.


Thanks for the script, I had some problems with the symlinking when
running the script, so tried to make the pscsi in targetcli, and then
shortening the script to skip all script lines concerning the core dir.
This made the script working for me, however I still get:
xen-pvscsi: 3:0:0:0 doesn't exist
error in dmesg log.

Are you using another patch for the xen-tree than the one I listed above?


I've used xm to get the drivers working, not xl. I wanted to have only
one changed component during tests in order to know which component is
failing. :-)

What are the contents of

/sys/kernel/config/target/xen-pvscsi/$TARG/tpgt_0/param/alias

($TARG replaced by the UUID generated above, of course)?

This should be "3:0:0:0" in your case. That's where the backend is
looking for the match from xenstore.

And the final symlink in the script is required. $NAME can be anything,
but has to match the pscsi name, of course.

Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] pvSCSI test

2015-02-06 Thread Juergen Gross

On 02/06/2015 03:02 PM, Kristian Hagsted Rasmussen wrote:


On Friday, February 6, 2015 10:57, Juergen Gross  wrote:

To: Kristian Hagsted Rasmussen; Olaf Hering; xen-de...@lists.xensource.com
Subject: Re: pvSCSI test

On 02/06/2015 10:32 AM, Kristian Hagsted Rasmussen wrote:

On Wednesday, February 4, 2015 05:41, Juergen Gross  wrote:

To: Kristian Hagsted Rasmussen; Olaf Hering; xen-de...@lists.xensource.com
Subject: Re: pvSCSI test

On 02/03/2015 07:16 PM, Kristian Hagsted Rasmussen wrote:


Hi Olaf and Juergen


I am interested in testing pvSCSI as I have a system were it would be ideal.
I have tried to apply this patch "http://marc.info/?l=xen-devel&m=139885599019457&w=2"; 
called "libbxl: add support for pvscsi, iteration 1", to my xen-4.5 tree.
I am using kernel 3.18.4 compiled with xen-scsiback and xen-scsifront compiled 
into the kernel, for both Dom0 and DomU.


I have by the way changed my kernel config, so that xen-scsiback, 
xen-scsifront, target_core_mod is compiled as modules.

snip



I've used xm to get the drivers working, not xl. I wanted to have only
one changed component during tests in order to know which component is
failing. :-)

What are the contents of

/sys/kernel/config/target/xen-pvscsi/$TARG/tpgt_0/param/alias

($TARG replaced by the UUID generated above, of course)?

This should be "3:0:0:0" in your case. That's where the backend is
looking for the match from xenstore.

And the final symlink in the script is required. $NAME can be anything,
but has to match the pscsi name, of course.


After some more fiddling around, I believe the configuration should be okay. My 
/etc/target/xen-pvscsi_start.sh look like this:

modprobe xen-scsiback
mkdir /sys/kernel/config/target/xen-pvscsi
mkdir -p /sys/kernel/config/target/xen-pvscsi/naa.600140512a981c66/tpgt_0
echo naa.6001405708ab297e > 
/sys/kernel/config/target/xen-pvscsi/naa.600140512a981c66/tpgt_0/nexus
 xen-pvscsi Target Ports
mkdir -p 
/sys/kernel/config/target/xen-pvscsi/naa.600140512a981c66/tpgt_0/lun/lun_0
ln -s 
/sys/kernel/config/target/xen-pvscsi/naa.600140512a981c66/tpgt_0/lun/lun_0/../../../../../../target/core/iblock_0/3_0_0_0
 
/sys/kernel/config/target/xen-pvscsi/naa.600140512a981c66/tpgt_0/lun/lun_0/xen-pvscsi_port


iblock_0?

You are not using pscsi, but iblock. Is that on purpose? I have tested
pscsi and fileio only.

What does lsscsi tell you after adding the device via targetcli? I
suppose you see a new scsi target you should use instead of 3:0:0:0
(that's what I did in the fileio case).


 Attributes for xen-pvscsi Target Portal Group
 Parameters for xen-pvscsi Target Portal Group
echo "3:0:0:0" > 
/sys/kernel/config/target/xen-pvscsi/naa.600140512a981c66/tpgt_0/param/alias

And
cat /sys/kernel/config/target/xen-pvscsi/naa.600140512a981c66/tpgt_0/param/alias
returns 3:0:0:0 as expected.

If I understand this correctly, it is xen-scsiback that reads the target 
configuration from ConfigFS and hence the problem that xen-pvscsi cannot find 
the device has nothing to do with which toolstack is used?


Yes and no. In theory the backend would accept anything from xenstore
which it can find in configfs. The toolstack however will only write
values into the xenstore it believes are valid SCSI devices.


If you are willing to part with the script you used together with targetcli, I 
would be more then happy to try that out.


I did my first tests with fileio on a machine I have no longer access
to. after I got it running I changed to my local test machine and pscsi.
Here I did the targetcli stuff manually and verified afterwards that
the single steps I put in my script were working. So the script I gave
you is basically the documentation of the manual setup I used.

Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RESEND Patch V2 1/4] xen: build infrastructure for generating hypercall depending symbols

2015-02-06 Thread Juergen Gross

Hey, x86 maintainers!

could you please comment?


Juergen

On 01/28/2015 06:11 AM, Juergen Gross wrote:

*Ping*

David wants a comment from the x86 maintainers.


Juergen

On 01/21/2015 08:49 AM, Juergen Gross wrote:

Today there are several places in the kernel which build tables
containing one entry for each possible Xen hypercall. Create an
infrastructure to be able to generate these tables at build time.

Based-on-patch-by: Jan Beulich 
Signed-off-by: Juergen Gross 
Reviewed-by: David Vrabel 
---
  arch/x86/syscalls/Makefile |  9 +
  scripts/xen-hypercalls.sh  | 12 
  2 files changed, 21 insertions(+)
  create mode 100644 scripts/xen-hypercalls.sh

diff --git a/arch/x86/syscalls/Makefile b/arch/x86/syscalls/Makefile
index 3323c27..a55abb9 100644
--- a/arch/x86/syscalls/Makefile
+++ b/arch/x86/syscalls/Makefile
@@ -19,6 +19,9 @@ quiet_cmd_syshdr = SYSHDR  $@
  quiet_cmd_systbl = SYSTBL  $@
cmd_systbl = $(CONFIG_SHELL) '$(systbl)' $< $@

+quiet_cmd_hypercalls = HYPERCALLS $@
+  cmd_hypercalls = $(CONFIG_SHELL) '$<' $@ $(filter-out $<,$^)
+
  syshdr_abi_unistd_32 := i386
  $(uapi)/unistd_32.h: $(syscall32) $(syshdr)
  $(call if_changed,syshdr)
@@ -47,10 +50,16 @@ $(out)/syscalls_32.h: $(syscall32) $(systbl)
  $(out)/syscalls_64.h: $(syscall64) $(systbl)
  $(call if_changed,systbl)

+$(out)/xen-hypercalls.h: $(srctree)/scripts/xen-hypercalls.sh
+$(call if_changed,hypercalls)
+
+$(out)/xen-hypercalls.h: $(srctree)/include/xen/interface/xen*.h
+
  uapisyshdr-y+= unistd_32.h unistd_64.h unistd_x32.h
  syshdr-y+= syscalls_32.h
  syshdr-$(CONFIG_X86_64)+= unistd_32_ia32.h unistd_64_x32.h
  syshdr-$(CONFIG_X86_64)+= syscalls_64.h
+syshdr-$(CONFIG_XEN)+= xen-hypercalls.h

  targets+= $(uapisyshdr-y) $(syshdr-y)

diff --git a/scripts/xen-hypercalls.sh b/scripts/xen-hypercalls.sh
new file mode 100644
index 000..676d922
--- /dev/null
+++ b/scripts/xen-hypercalls.sh
@@ -0,0 +1,12 @@
+#!/bin/sh
+out="$1"
+shift
+in="$@"
+
+for i in $in; do
+eval $CPP $LINUXINCLUDE -dD -imacros "$i" -x c /dev/null
+done | \
+awk '$1 == "#define" && $2 ~ /__HYPERVISOR_[a-z][a-z_0-9]*/ { v[$3] =
$2 }
+END {   print "/* auto-generated by scripts/xen-hypercall.sh */"
+for (i in v) if (!(v[i] in v))
+print "HYPERCALL("substr(v[i], 14)")"}' | sort -u >$out



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/





___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] pvSCSI test

2015-02-08 Thread Juergen Gross

On 02/06/2015 09:33 PM, Kristian Hagsted Rasmussen wrote:


On Friday, February 6, 2015 15:25, Juergen Gross  wrote:

To: Kristian Hagsted Rasmussen; Olaf Hering; xen-de...@lists.xensource.com
Subject: Re: [Xen-devel] pvSCSI test




After some more fiddling around, I believe the configuration should be okay. My 
/etc/target/xen-pvscsi_start.sh look like this:

modprobe xen-scsiback
mkdir /sys/kernel/config/target/xen-pvscsi
mkdir -p /sys/kernel/config/target/xen-pvscsi/naa.600140512a981c66/tpgt_0
echo naa.6001405708ab297e > 
/sys/kernel/config/target/xen-pvscsi/naa.600140512a981c66/tpgt_0/nexus
 xen-pvscsi Target Ports
mkdir -p 
/sys/kernel/config/target/xen-pvscsi/naa.600140512a981c66/tpgt_0/lun/lun_0
ln -s 
/sys/kernel/config/target/xen-pvscsi/naa.600140512a981c66/tpgt_0/lun/lun_0/../../../../../../target/core/iblock_0/3_0_0_0
 
/sys/kernel/config/target/xen-pvscsi/naa.600140512a981c66/tpgt_0/lun/lun_0/xen-pvscsi_port


iblock_0?


Sorry the result is the same if I use pscsi, I just tried iblock as it was a 
harddisk I tried to pass through and lio documentation advice against using 
pscsi for hard drives.

I just now tried to pass through a bluray drive on another machine, however I 
get the exact same error. What I noticed, however, was that the error only 
appears if the DomU kernel supports xen-scsifront. I hazzard a guess that this 
is because the backend only tries to connect to the device if the frontend 
makes a call for it.

In xenstore I have the following entry for vscsi under /local/domain/0/backend:
vscsi = ""
  3 = ""
   0 = ""
frontend = "/local/domain/3/device/vscsi/0"
frontend-id = "3"
online = "1"
state = "4"
feature-host = "0"
vscsi-devs = ""
 dev-0 = ""
  p-dev = "3:0:0:0"
  v-dev = "0:0:0:0"
  state = "6"
feature-sg-grant = "128"

As I am no expert I am not sure about this, but shouldn't there be a path to 
the device, more then just the p-dev statement?


No, that's okay. The connection between p-dev and the drive is done
via the target infrastructure.

Something seems to be wrong with your link in configfs: the target
seems not to be active. Could you please check the link to be correct?
Please check whether the pscsi (or iblock) entry is active. This can
be done via the "ls" command in targetcli for example.

When I tested the pscsi entry in configfs switched to "active" when I
linked the xen-pvscsi entry to it.


Do I have to manually add the device to xenstore?


I never did it. :-)


Juergen



If you do not feel for answering more of my questions please feel free to say 
so, I am just interested in this work and really look forward to its inclusion 
in xen.

/Kristian


You are not using pscsi, but iblock. Is that on purpose? I have tested
pscsi and fileio only.

What does lsscsi tell you after adding the device via targetcli? I
suppose you see a new scsi target you should use instead of 3:0:0:0
(that's what I did in the fileio case).


 Attributes for xen-pvscsi Target Portal Group
 Parameters for xen-pvscsi Target Portal Group
echo "3:0:0:0" > 
/sys/kernel/config/target/xen-pvscsi/naa.600140512a981c66/tpgt_0/param/alias

And
cat /sys/kernel/config/target/xen-pvscsi/naa.600140512a981c66/tpgt_0/param/alias
returns 3:0:0:0 as expected.

If I understand this correctly, it is xen-scsiback that reads the target 
configuration from ConfigFS and hence the problem that xen-pvscsi cannot find 
the device has nothing to do with which toolstack is used?


Yes and no. In theory the backend would accept anything from xenstore
which it can find in configfs. The toolstack however will only write
values into the xenstore it believes are valid SCSI devices.


If you are willing to part with the script you used together with targetcli, I 
would be more then happy to try that out.


I did my first tests with fileio on a machine I have no longer access
to. after I got it running I changed to my local test machine and pscsi.
Here I did the targetcli stuff manually and verified afterwards that
the single steps I put in my script were working. So the script I gave
you is basically the documentation of the manual setup I used.

Juergen




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen's Linux kernel config options v3

2015-02-08 Thread Juergen Gross

On 02/07/2015 12:44 AM, Luis R. Rodriguez wrote:

This is a third respin for a design proposal for a rework on the config options
on the Linux kernel related to Xen. The frist two proposals came from Juergen,
I'm taking on the work now as some other work I am doing is related to this.
This third iteration addresses the feedback given on Juergen's last v2
proposal. Let me know if there are any questions or any further feedback before
we start addressing the changes.

Reasons to consider a cleanup / reorganizing of the kconfig options:

- Everything depends on Xen but that's obviously not right. For instance
   we want to be able to build Xen frontend drivers for HVM domains without
   the need for choosing a pvops kernel: currently the frontend drivers need
   Xen configured which depends on PARAVIRT.
- XEN should not depend on PAE, we can have HVM guests without PAE.
- Some features are available for x86 only, in spite of these not being
   architecture specific, e.g. XEN_DEBUG_FS
- Be able to build a Dom0 using XEN_PVH(x86) without having to configure
   XEN_HAVE_PVMMU(x86).

Current Xen related config options are as follows:

Legend:

- The first column are the Xen config options. Indentation in this
   column reflects the dependency between those options (e.g.
   XEN_SCSI_BACKEND depends on XEN_BACKEND, which in turn depends on
   XEN_DOM0, which depends on XEN).
- The second column shows the options which are selected automatically
   if the corresponding option in the first column is configured.
- The last column shows additional dependencies which can't be shown via
   the indentation level of the first column.
- Some options or dependencies are architecture specific, they are
   listed with the architecture requirements in parens (e.g. XEN_TMEM(x86)
   which is available for x86 only).
- Non-Xen options are mentioned only if they seem to be really relevant,
   like e.g. PARAVIRT or BROKEN.
- All listed options are user selectable, options which are not user selectable 
but
   automatic are prefixed with a '*' on the left hand side to make emphasis

Option  Selects Depends
-
XEN PARAVIRT_CLOCK(x86) PARAVIRT(x86)
 XEN_HAVE_PVMMU(x86)
 SWIOTLB_XEN(arm,arm64)
   PCI_XEN(x86)  SWIOTLB_XEN
   XEN_DOM0  PCI_XEN(x86)
 XEN_BACKEND
   XEN_BLKDEV_BACKEND
   XEN_PCIDEV_BACKEND(x86)
   XEN_SCSI_BACKEND
   XEN_NETDEV_BACKEND
 XEN_ACPI_HOTPLUG_MEMORY XEN_STUB
 XEN_ACPI_HOTPLUG_CPUXEN_STUB
 XEN_MCE_LOG(x86)
   XEN_ACPI_PROCESSOR(x86)  ACPI_PROCESSOR
CPU_FREQ
   XEN_PVHVM(x86)
 XEN_PVH(x86)
   XEN_MAX_DOMAIN_MEMORY(x86)
   XEN_SAVE_RESTORE(x86)
   XEN_DEBUG_FS(x86)
   XEN_FBDEV_FRONTENDXEN_XENBUS_FRONTEND
 INPUT_XEN_KBDDEV_FRONTEND
   XEN_BLKDEV_FRONTEND   XEN_XENBUS_FRONTEND
   HVC_XEN
 HVC_XEN_FRONTENDXEN_XENBUS_FRONTEND
   TCG_XEN   XEN_XENBUS_FRONTEND
   XEN_PCIDEV_FRONTEND   PCI_XEN
 XEN_XENBUS_FRONTEND
   XEN_SCSI_FRONTEND XEN_XENBUS_FRONTEND
   INPUT_XEN_KBDDEV_FRONTEND XEN_XENBUS_FRONTEND
   XEN_WDT
   XEN_BALLOON
 XEN_SELFBALLOONING  XEN_TMEM
 XEN_BALLOON_MEMORY_HOTPLUG
 XEN_SCRUB_PAGES
   XEN_DEV_EVTCHN
   XENFS XEN_PRIVCMD
 XEN_COMPAT_XENFS
   XEN_SYS_HYPERVISOR
   XEN_XENBUS_FRONTEND
   XEN_GNTDEV
   XEN_GRANT_DEV_ALLOC
   SWIOTLB_XEN
   XEN_TMEM(x86)
   XEN_PRIVCMD
   XEN_STUB(x86_64)  BROKEN
   XEN_ACPI_PROCESSOR(x86)
   XEN_HAVE_PVMMU
   XEN_EFI(x64)
   XEN_NETDEV_FRONTEND   XEN_XENBUS_FRONTEND

After some feedback for the first draft I'd suggest the following:

Option  Selects Depends
--
XEN
   XEN_PV(x86)   XEN_HAVE_PVMMU(x86)
 PARAVIRT
 PARAVIRT_CLOCK
   XEN_PVH(x86)  XEN_PVHVM
 PARAVIRT
 PARAVIRT_CLOCK
   XEN_PVHVM(x86)PARAVIRT
 PARAVIRT_CLOCK
   XEN_BACKEND   SWIOTLB_XEN(arm,arm64)  XEN_PV(x86) ||
 XEN_PVH(x86) ||
 XEN_PVHVM(x86)
XEN_TMEM(x86)
XEN_PRIVCMD


Wrong indentation of above 2 entries

Juergen


  

Re: [Xen-devel] [PATCH] xen-scsiback: some modifications about code comment

2015-02-08 Thread Juergen Gross

On 02/07/2015 04:31 AM, Rudy Zhang wrote:

From: Tao Chen 

Signed-off-by: Tao Chen 


Are some white space fixes in comments really worth a patch?

Juergen


---
  drivers/xen/xen-scsiback.c | 16 
  1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/xen/xen-scsiback.c b/drivers/xen/xen-scsiback.c
index 3e32146..59f09fd 100644
--- a/drivers/xen/xen-scsiback.c
+++ b/drivers/xen/xen-scsiback.c
@@ -83,7 +83,7 @@ struct ids_tuple {

  struct v2p_entry {
struct ids_tuple v; /* translate from */
-   struct scsiback_tpg *tpg;   /* translate to   */
+   struct scsiback_tpg *tpg;   /* translate to */
unsigned int lun;
struct kref kref;
struct list_head l;
@@ -525,7 +525,7 @@ static int scsiback_gnttab_data_map(struct vscsiif_request 
*ring_req,
}
}

-   /* free of (sgl) in fast_flush_area()*/
+   /* free of (sgl) in fast_flush_area() */
pending_req->sgl = kmalloc_array(nr_segments,
sizeof(struct scatterlist), GFP_KERNEL);
if (!pending_req->sgl)
@@ -1084,7 +1084,7 @@ static void scsiback_do_1lun_hotplug(struct vscsibk_info 
*info, int op,
}
}
break;
-   /*When it is necessary, processing is added here.*/
+   /* When it is necessary, processing is added here. */
default:
break;
}
@@ -1475,8 +1475,8 @@ static u32 scsiback_tpg_get_inst_index(struct 
se_portal_group *se_tpg)
  static int scsiback_check_stop_free(struct se_cmd *se_cmd)
  {
/*
-* Do not release struct se_cmd's containing a valid TMR
-* pointer.  These will be released directly in scsiback_device_action()
+* Do not release struct se_cmd's containing a valid TMR pointer.
+* These will be released directly in scsiback_device_action()
 * with transport_generic_free_cmd().
 */
if (se_cmd->se_cmd_flags & SCF_SCSI_TMR_CDB)
@@ -1642,7 +1642,7 @@ static int scsiback_make_nexus(struct scsiback_tpg *tpg,
return -ENOMEM;
}
/*
-*  Initialize the struct se_session pointer
+* Initialize the struct se_session pointer
 */
tv_nexus->tvn_se_sess = transport_init_session(TARGET_PROT_NORMAL);
if (IS_ERR(tv_nexus->tvn_se_sess)) {
@@ -1759,7 +1759,7 @@ static ssize_t scsiback_tpg_store_nexus(struct 
se_portal_group *se_tpg,
unsigned char i_port[VSCSI_NAMELEN], *ptr, *port_ptr;
int ret;
/*
-* Shutdown the active I_T nexus if 'NULL' is passed..
+* Shutdown the active I_T nexus if 'NULL' is passed.
 */
if (!strncmp(page, "NULL", 4)) {
ret = scsiback_drop_nexus(tpg);
@@ -1930,7 +1930,7 @@ static void scsiback_drop_tpg(struct se_portal_group 
*se_tpg)
 */
scsiback_drop_nexus(tpg);
/*
-* Deregister the se_tpg from TCM..
+* Deregister the se_tpg from TCM.
 */
core_tpg_deregister(se_tpg);
kfree(tpg);




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Crash in acpi_ps_peek_opcode when booting kernel 3.19 as Xen dom0

2015-02-09 Thread Juergen Gross

On 02/09/2015 02:33 PM, Stefan Bader wrote:

On 09.02.2015 14:07, Stefan Bader wrote:

On 05.02.2015 20:36, Konrad Rzeszutek Wilk wrote:

On Thu, Feb 05, 2015 at 03:33:02PM +0100, Stefan Bader wrote:

While experimenting/testing various kernel versions I discovered that trying to
boot a Haswell based hosts will always crash when booting as Xen dom0
(Xen-4.4.1). The same crash happens since v3.19-rc1 and still does happen with
v3.19-rc7. A bare metal boot is having no issues and also an Opteron based host
is having no issues (dom0 and bare metal).
Could be a table that the other host does not have and since its only happening
in dom0 maybe some cpu capability that needs to be passed on?


Usually it means that the ACPI AML code is trying to do something with
the IOAPIC or something wihch is not accessible.

But this on the other hand looks to be trying to execute some AML code
that is unknown. Any chance you cna disassemble it and perhaps also
run with acpi debug options on to figure out where it blows up?


The weird thing here is that bare-metal on the same machine does work. And
previous kernels did work as well. So I think we can assume the ACPI tables are
ok. It could even be a red-herring. Well, likely is as booting with acpi=off
does hang instead of crashing.

Since I got no clue, I did what we always do when we are dumbfound, I went ahead
and bisected 3.18..3.19-rc1. Unfortunately the very last kernel I build was
something in between good and bad. Good as it did not crash exactly but bad as
it did not come up in a usable state. So I would not be sure the claimed to be
offending commit is right. Could be one in the range of:

G  * xen: use common page allocation function in p2m.c
* xen: Delay remapping memory of pv-domain
g  * xen: Delay m2p_override initialization
-> * xen: Delay invalidating extra memory
B  * x86: Introduce function to get pmd entry pointer

(G) really good, (g) somewhat not bad, (B) bad, (->) claimed first broken.


Oh, since that all sounds related to E820 in some way:

(XEN) Xen-e820 RAM map:
(XEN)   - 0009a400 (usable)
(XEN)  0009a400 - 000a (reserved)
(XEN)  000e - 0010 (reserved)
(XEN)  0010 - 30a48000 (usable)
(XEN)  30a48000 - 30a49000 (reserved)


Hmm, this memory hole is at a rather low address. Could it be some
vital data (one of kernel, page tables, initrd or p2m map) is located
at this address?

This would be a problem similar to the one I ran into when trying to
test on a machine with 1TB of memory, where the p2m map was too big
to fit into contiguous memory.

Could you check the addresses where the hypervisor puts this data for
Dom0?


Juergen


(XEN)  30a49000 - a27f4000 (usable)
(XEN)  a27f4000 - a2ab4000 (reserved)
(XEN)  a2ab4000 - a2fb4000 (ACPI NVS)
(XEN)  a2fb4000 - a2feb000 (ACPI data)
(XEN)  a2feb000 - a300 (usable)
(XEN)  a300 - afa0 (reserved)
(XEN)  e000 - f000 (reserved)
(XEN)  fec0 - fec01000 (reserved)
(XEN)  fed0 - fed04000 (reserved)
(XEN)  fed1 - fed1a000 (reserved)
(XEN)  fed1c000 - fed2 (reserved)
(XEN)  fed84000 - fed85000 (reserved)
(XEN)  fee0 - fee01000 (reserved)
(XEN)  ffc0 - 0001 (reserved)
(XEN)  0001 - 00024e60 (usable)

and how it looks with a 3.18 boot:

[0.00] e820: BIOS-provided physical RAM map:
[0.00] Xen: [mem 0x-0x00099fff] usable
[0.00] Xen: [mem 0x0009a400-0x000f] reserved
[0.00] Xen: [mem 0x0010-0x30a47fff] usable
[0.00] Xen: [mem 0x30a48000-0x30a48fff] reserved
[0.00] Xen: [mem 0x30a49000-0xa27f3fff] usable
[0.00] Xen: [mem 0xa27f4000-0xa2ab3fff] reserved
[0.00] Xen: [mem 0xa2ab4000-0xa2fb3fff] ACPI NVS
[0.00] Xen: [mem 0xa2fb4000-0xa2feafff] ACPI data
[0.00] Xen: [mem 0xa2feb000-0xa2ff] usable
[0.00] Xen: [mem 0xa300-0xaf9f] reserved
[0.00] Xen: [mem 0xe000-0xefff] reserved
[0.00] Xen: [mem 0xfec0-0xfec00fff] reserved
[0.00] Xen: [mem 0xfed0-0xfed03fff] reserved
[0.00] Xen: [mem 0xfed1-0xfed19fff] reserved
[0.00] Xen: [mem 0xfed1c000-0xfed1] reserved
[0.00] Xen: [mem 0xfed84000-0xfed84fff] reserved
[0.00] Xen: [mem 0xfee0-0xfeef] reserved
[0.00] Xen: [mem 0xffc0-0x] reserved
[0.00] Xen: [mem 0x0001-0x0001bd

Re: [Xen-devel] Xen's Linux kernel config options v3

2015-02-09 Thread Juergen Gross

On 02/09/2015 08:52 PM, Luis R. Rodriguez wrote:

On Mon, Feb 09, 2015 at 07:17:23AM +0100, Juergen Gross wrote:

On 02/07/2015 12:44 AM, Luis R. Rodriguez wrote:

After some feedback for the first draft I'd suggest the following:

Option  Selects Depends
--
XEN
XEN_PV(x86)   XEN_HAVE_PVMMU(x86)
  PARAVIRT
  PARAVIRT_CLOCK
XEN_PVH(x86)  XEN_PVHVM
  PARAVIRT
  PARAVIRT_CLOCK
XEN_PVHVM(x86)PARAVIRT
  PARAVIRT_CLOCK
XEN_BACKEND   SWIOTLB_XEN(arm,arm64)  XEN_PV(x86) ||
  XEN_PVH(x86) ||
  XEN_PVHVM(x86)
XEN_TMEM(x86)
XEN_PRIVCMD


Wrong indentation of above 2 entries


I had moved this to select based on Konrad's suggestion that the backend
selects this but then Jan noted this is not necessarily true given that there
is no connection between these and backend functionality and forgot to move it
out under. As such I'll remove both completely from selects -- but its not clear
to me that XEN_BACKEND should depend on both as it seems (maybe I misunderstood)
you are implying, should it?

Do we just want to keep both as is today?

Option  Selects Depends
--
XEN
 XEN_TMEM(x86)  !ARM && !ARM64 (default 
m if CLEANCACHE || FRONTSWAP)
 XEN_PRIVCMD


This is what I meant. Just let them depend on XEN like before.


Juergen



config XEN_TMEM
 tristate
 depends on !ARM && !ARM64
 default m if (CLEANCACHE || FRONTSWAP)
 help
   Shim to interface in-kernel Transcendent Memory hooks
   (e.g. cleancache and frontswap) to Xen tmem hypercalls.

config XEN_PRIVCMD
 tristate
 depends on XEN
 default m

We at least remove that explicit 'depneds on XEN' as it is already part of the
Kconfig file top level mainmenu, but that's a trivial obvious change.

   Luis

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RESEND Patch V2 1/4] xen: build infrastructure for generating hypercall depending symbols

2015-02-12 Thread Juergen Gross

 ##   ###  # #   #
 # #   #   ###  # #
 # #   #   # #   #  #
 ###   #  #  #  #  
 # #   #   # #  # #
 # #   ###  # #
 ####  # #   #

David still wants a comment from the x86 maintainers...

Juergen

On 01/21/2015 08:49 AM, Juergen Gross wrote:

Today there are several places in the kernel which build tables
containing one entry for each possible Xen hypercall. Create an
infrastructure to be able to generate these tables at build time.

Based-on-patch-by: Jan Beulich 
Signed-off-by: Juergen Gross 
Reviewed-by: David Vrabel 
---
  arch/x86/syscalls/Makefile |  9 +
  scripts/xen-hypercalls.sh  | 12 
  2 files changed, 21 insertions(+)
  create mode 100644 scripts/xen-hypercalls.sh

diff --git a/arch/x86/syscalls/Makefile b/arch/x86/syscalls/Makefile
index 3323c27..a55abb9 100644
--- a/arch/x86/syscalls/Makefile
+++ b/arch/x86/syscalls/Makefile
@@ -19,6 +19,9 @@ quiet_cmd_syshdr = SYSHDR  $@
  quiet_cmd_systbl = SYSTBL  $@
cmd_systbl = $(CONFIG_SHELL) '$(systbl)' $< $@

+quiet_cmd_hypercalls = HYPERCALLS $@
+  cmd_hypercalls = $(CONFIG_SHELL) '$<' $@ $(filter-out $<,$^)
+
  syshdr_abi_unistd_32 := i386
  $(uapi)/unistd_32.h: $(syscall32) $(syshdr)
$(call if_changed,syshdr)
@@ -47,10 +50,16 @@ $(out)/syscalls_32.h: $(syscall32) $(systbl)
  $(out)/syscalls_64.h: $(syscall64) $(systbl)
$(call if_changed,systbl)

+$(out)/xen-hypercalls.h: $(srctree)/scripts/xen-hypercalls.sh
+   $(call if_changed,hypercalls)
+
+$(out)/xen-hypercalls.h: $(srctree)/include/xen/interface/xen*.h
+
  uapisyshdr-y  += unistd_32.h unistd_64.h unistd_x32.h
  syshdr-y  += syscalls_32.h
  syshdr-$(CONFIG_X86_64)   += unistd_32_ia32.h unistd_64_x32.h
  syshdr-$(CONFIG_X86_64)   += syscalls_64.h
+syshdr-$(CONFIG_XEN)   += xen-hypercalls.h

  targets   += $(uapisyshdr-y) $(syshdr-y)

diff --git a/scripts/xen-hypercalls.sh b/scripts/xen-hypercalls.sh
new file mode 100644
index 000..676d922
--- /dev/null
+++ b/scripts/xen-hypercalls.sh
@@ -0,0 +1,12 @@
+#!/bin/sh
+out="$1"
+shift
+in="$@"
+
+for i in $in; do
+   eval $CPP $LINUXINCLUDE -dD -imacros "$i" -x c /dev/null
+done | \
+awk '$1 == "#define" && $2 ~ /__HYPERVISOR_[a-z][a-z_0-9]*/ { v[$3] = $2 }
+   END {   print "/* auto-generated by scripts/xen-hypercall.sh */"
+   for (i in v) if (!(v[i] in v))
+   print "HYPERCALL("substr(v[i], 14)")"}' | sort -u >$out




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] pvSCSI test

2015-02-15 Thread Juergen Gross

On 02/12/2015 05:43 PM, Kristian Hagsted Rasmussen wrote:

On Monday, February 9, 2015 07:02, Juergen Gross  wrote:

To: Kristian Hagsted Rasmussen; Olaf Hering; xen-de...@lists.xensource.com
Subject: Re: [Xen-devel] pvSCSI test


snip



No, that's okay. The connection between p-dev and the drive is done
via the target infrastructure.

Something seems to be wrong with your link in configfs: the target
seems not to be active. Could you please check the link to be correct?
Please check whether the pscsi (or iblock) entry is active. This can
be done via the "ls" command in targetcli for example.



In targetcli, ls returns:

o- / 
.
 [...]
   o- backstores 
..
 [...]
   | o- fileio 
...
 [0 Storage Object]
   | o- iblock 
...
 [0 Storage Object]
   | o- pscsi 

 [1 Storage Object]
   | | o- 3:0:0:0 
..
 [/dev/sdb activated]
   | o- rd_dr 

 [0 Storage Object]
   | o- rd_mcp 
...
 [0 Storage Object]
   o- ib_srpt 
...
 [0 Targets]
   o- iscsi 
.
 [0 Targets]
   o- loopback 
..
 [0 Targets]
   o- qla2xxx 
...
 [0 Targets]
   o- tcm_fc 

 [0 Targets]

And my script for starting xen-pvscsi is this:

modprobe xen-scsiback
mkdir /sys/kernel/config/target/xen-pvscsi
mkdir -p /sys/kernel/config/target/xen-pvscsi/naa.600140512a981c66/tpgt_0
echo naa.6001405708ab297e > 
/sys/kernel/config/target/xen-pvscsi/naa.600140512a981c66/tpgt_0/nexus
 pvscsi Target Ports
mkdir -p 
/sys/kernel/config/target/xen-pvscsi/naa.600140512a981c66/tpgt_0/lun/lun_0
ln -s /sys/kernel/config/target/core/pscsi_0/3:0:0:0 
/sys/kernel/config/target/xen-pvscsi/naa.600140512a981c66/tpgt_0/lun/lun_0/xen-pvscsi_port
 Attributes for pvscsi Target Portal Group
 Parameters for pvscsi Target Portal Group
echo "3:0:0:0" > 
/sys/kernel/config/target/xen-pvscsi/naa.600140512a981c66/tpgt_0/param/alias

I hope you can spot my error, as I am a little lost right now.


At least I have spotted man error, but I think I am to blame:

I've told you to write "3:0:0:0" to alias. This was wrong. It should
be only "3:0:0". The LUN number is not part of the alias to use.

Sorry for that,

Juergen




When I tested the pscsi entry in configfs switched to "active" when I
linked the xen-pvscsi entry to it.


Do I have to manually add the device to xenstore?


I never did it. :-)


Juergen



If you do not feel for answering more of my questions please feel free to say 
so, I am just interested in this work and really look forward to its inclusion 
in xen.

/Kristian


You are not using pscsi, but iblock. Is that on purpose? I have tested
pscsi and fileio only.

What does lsscsi tell you after adding the device via targetcli? I
suppose you see a new scsi target you should use instead of 3:0:0:0
(that's what I did in the fileio case).



I do not see more devices with lsscsi when I add and iBlock devices, however I 
also tested with a fileIO device, this does also not show up in lsscsi. However 
I can get it to show up by making a loopback entry in targetcli, this however 
does not change the outcome of my domain creation.

Best regards Kristian

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH V2 1/3] xen: mark pvscsi frontend request consumed only after last read

2015-02-16 Thread Juergen Gross
A request in the ring buffer mustn't be read after it has been marked
as consumed. Otherwise it might already have been reused by the
frontend without violating the ring protocol.

To avoid inconsistencies in the backend only work on a private copy
of the request. This will ensure a malicious guest not being able to
bypass consistency checks of the backend by modifying an active
request.

Signed-off-by: Juergen Gross 
---
 drivers/xen/xen-scsiback.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/xen/xen-scsiback.c b/drivers/xen/xen-scsiback.c
index 61653a0..9faca6a 100644
--- a/drivers/xen/xen-scsiback.c
+++ b/drivers/xen/xen-scsiback.c
@@ -709,12 +709,11 @@ static int prepare_pending_reqs(struct vscsibk_info *info,
 static int scsiback_do_cmd_fn(struct vscsibk_info *info)
 {
struct vscsiif_back_ring *ring = &info->ring;
-   struct vscsiif_request *ring_req;
+   struct vscsiif_request ring_req;
struct vscsibk_pend *pending_req;
RING_IDX rc, rp;
int err, more_to_do;
uint32_t result;
-   uint8_t act;
 
rc = ring->req_cons;
rp = ring->sring->req_prod;
@@ -735,11 +734,10 @@ static int scsiback_do_cmd_fn(struct vscsibk_info *info)
if (!pending_req)
return 1;
 
-   ring_req = RING_GET_REQUEST(ring, rc);
+   ring_req = *RING_GET_REQUEST(ring, rc);
ring->req_cons = ++rc;
 
-   act = ring_req->act;
-   err = prepare_pending_reqs(info, ring_req, pending_req);
+   err = prepare_pending_reqs(info, &ring_req, pending_req);
if (err) {
switch (err) {
case -ENODEV:
@@ -755,9 +753,9 @@ static int scsiback_do_cmd_fn(struct vscsibk_info *info)
return 1;
}
 
-   switch (act) {
+   switch (ring_req.act) {
case VSCSIIF_ACT_SCSI_CDB:
-   if (scsiback_gnttab_data_map(ring_req, pending_req)) {
+   if (scsiback_gnttab_data_map(&ring_req, pending_req)) {
scsiback_fast_flush_area(pending_req);
scsiback_do_resp_with_sense(NULL,
DRIVER_ERROR << 24, 0, pending_req);
@@ -768,7 +766,7 @@ static int scsiback_do_cmd_fn(struct vscsibk_info *info)
break;
case VSCSIIF_ACT_SCSI_ABORT:
scsiback_device_action(pending_req, TMR_ABORT_TASK,
-   ring_req->ref_rqid);
+   ring_req.ref_rqid);
break;
case VSCSIIF_ACT_SCSI_RESET:
scsiback_device_action(pending_req, TMR_LUN_RESET, 0);
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH V2 0/3] xen: pvscsi: avoid race, support suspend/resume

2015-02-16 Thread Juergen Gross
In the pvscsi backend copy the frontend request to ensure it is not
changed by the frontend during processing it in the backend.

Support suspend/resume of the domain to be able to access a pvscsi
device n the frontend afterwards.

Changes in V2:
- changed scsiback_do_cmd_fn() as sugested by Jan Beulich
- added adaption of backend parameters in frontend after resuming

Juergen Gross (3):
  xen: mark pvscsi frontend request consumed only after last read
  xen: scsiback: add LUN of restored domain
  xen: support suspend/resume in pvscsi frontend

 drivers/scsi/xen-scsifront.c | 214 ---
 drivers/xen/xen-scsiback.c   |  33 ---
 2 files changed, 199 insertions(+), 48 deletions(-)

-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH V2 2/3] xen: scsiback: add LUN of restored domain

2015-02-16 Thread Juergen Gross
When a xen domain is being restored the LUN state of a pvscsi device
is "Connected" and not "Initialising" as in case of attaching a new
pvscsi LUN.

This must be taken into account when adding a new pvscsi device for
a domain as otherwise the pvscsi LUN won't be connected to the
SCSI target associated with it.

Signed-off-by: Juergen Gross 
---
 drivers/xen/xen-scsiback.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/xen/xen-scsiback.c b/drivers/xen/xen-scsiback.c
index 9faca6a..9d60176 100644
--- a/drivers/xen/xen-scsiback.c
+++ b/drivers/xen/xen-scsiback.c
@@ -992,7 +992,7 @@ found:
 }
 
 static void scsiback_do_add_lun(struct vscsibk_info *info, const char *state,
-   char *phy, struct ids_tuple *vir)
+   char *phy, struct ids_tuple *vir, int try)
 {
if (!scsiback_add_translation_entry(info, phy, vir)) {
if (xenbus_printf(XBT_NIL, info->dev->nodename, state,
@@ -1000,7 +1000,7 @@ static void scsiback_do_add_lun(struct vscsibk_info 
*info, const char *state,
pr_err("xen-pvscsi: xenbus_printf error %s\n", state);
scsiback_del_translation_entry(info, vir);
}
-   } else {
+   } else if (!try) {
xenbus_printf(XBT_NIL, info->dev->nodename, state,
  "%d", XenbusStateClosed);
}
@@ -1060,10 +1060,19 @@ static void scsiback_do_1lun_hotplug(struct 
vscsibk_info *info, int op,
 
switch (op) {
case VSCSIBACK_OP_ADD_OR_DEL_LUN:
-   if (device_state == XenbusStateInitialising)
-   scsiback_do_add_lun(info, state, phy, &vir);
-   if (device_state == XenbusStateClosing)
+   switch (device_state) {
+   case XenbusStateInitialising:
+   scsiback_do_add_lun(info, state, phy, &vir, 0);
+   break;
+   case XenbusStateConnected:
+   scsiback_do_add_lun(info, state, phy, &vir, 1);
+   break;
+   case XenbusStateClosing:
scsiback_do_del_lun(info, state, &vir);
+   break;
+   default:
+   break;
+   }
break;
 
case VSCSIBACK_OP_UPDATEDEV_STATE:
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH V2 3/3] xen: support suspend/resume in pvscsi frontend

2015-02-16 Thread Juergen Gross
Up to now the pvscsi frontend hasn't supported domain suspend and
resume. When a domain with an assigned pvscsi device was suspended
and resumed again, it was not able to use the device any more: trying
to do so resulted in hanging processes.

Support suspend and resume of pvscsi devices.

Signed-off-by: Juergen Gross 
---
 drivers/scsi/xen-scsifront.c | 214 ---
 1 file changed, 179 insertions(+), 35 deletions(-)

diff --git a/drivers/scsi/xen-scsifront.c b/drivers/scsi/xen-scsifront.c
index 34199d2..78d9506 100644
--- a/drivers/scsi/xen-scsifront.c
+++ b/drivers/scsi/xen-scsifront.c
@@ -63,6 +63,7 @@
 
 #define VSCSIFRONT_OP_ADD_LUN  1
 #define VSCSIFRONT_OP_DEL_LUN  2
+#define VSCSIFRONT_OP_READD_LUN3
 
 /* Tuning point. */
 #define VSCSIIF_DEFAULT_CMD_PER_LUN 10
@@ -113,8 +114,13 @@ struct vscsifrnt_info {
DECLARE_BITMAP(shadow_free_bitmap, VSCSIIF_MAX_REQS);
struct vscsifrnt_shadow *shadow[VSCSIIF_MAX_REQS];
 
+   /* Following items are protected by the host lock. */
wait_queue_head_t wq_sync;
+   wait_queue_head_t wq_pause;
unsigned int wait_ring_available:1;
+   unsigned int waiting_pause:1;
+   unsigned int pause:1;
+   unsigned callers;
 
char dev_state_path[64];
struct task_struct *curr;
@@ -274,31 +280,31 @@ static void scsifront_sync_cmd_done(struct vscsifrnt_info 
*info,
wake_up(&shadow->wq_reset);
 }
 
-static int scsifront_cmd_done(struct vscsifrnt_info *info)
+static void scsifront_do_response(struct vscsifrnt_info *info,
+ struct vscsiif_response *ring_rsp)
+{
+   if (WARN(ring_rsp->rqid >= VSCSIIF_MAX_REQS ||
+test_bit(ring_rsp->rqid, info->shadow_free_bitmap),
+"illegal rqid %u returned by backend!\n", ring_rsp->rqid))
+   return;
+
+   if (info->shadow[ring_rsp->rqid]->act == VSCSIIF_ACT_SCSI_CDB)
+   scsifront_cdb_cmd_done(info, ring_rsp);
+   else
+   scsifront_sync_cmd_done(info, ring_rsp);
+}
+
+static int scsifront_ring_drain(struct vscsifrnt_info *info)
 {
struct vscsiif_response *ring_rsp;
RING_IDX i, rp;
int more_to_do = 0;
-   unsigned long flags;
-
-   spin_lock_irqsave(info->host->host_lock, flags);
 
rp = info->ring.sring->rsp_prod;
rmb();  /* ordering required respective to dom0 */
for (i = info->ring.rsp_cons; i != rp; i++) {
-
ring_rsp = RING_GET_RESPONSE(&info->ring, i);
-
-   if (WARN(ring_rsp->rqid >= VSCSIIF_MAX_REQS ||
-test_bit(ring_rsp->rqid, info->shadow_free_bitmap),
-"illegal rqid %u returned by backend!\n",
-ring_rsp->rqid))
-   continue;
-
-   if (info->shadow[ring_rsp->rqid]->act == VSCSIIF_ACT_SCSI_CDB)
-   scsifront_cdb_cmd_done(info, ring_rsp);
-   else
-   scsifront_sync_cmd_done(info, ring_rsp);
+   scsifront_do_response(info, ring_rsp);
}
 
info->ring.rsp_cons = i;
@@ -308,6 +314,18 @@ static int scsifront_cmd_done(struct vscsifrnt_info *info)
else
info->ring.sring->rsp_event = i + 1;
 
+   return more_to_do;
+}
+
+static int scsifront_cmd_done(struct vscsifrnt_info *info)
+{
+   int more_to_do;
+   unsigned long flags;
+
+   spin_lock_irqsave(info->host->host_lock, flags);
+
+   more_to_do = scsifront_ring_drain(info);
+
info->wait_ring_available = 0;
 
spin_unlock_irqrestore(info->host->host_lock, flags);
@@ -328,6 +346,24 @@ static irqreturn_t scsifront_irq_fn(int irq, void *dev_id)
return IRQ_HANDLED;
 }
 
+static void scsifront_finish_all(struct vscsifrnt_info *info)
+{
+   unsigned i;
+   struct vscsiif_response resp;
+
+   scsifront_ring_drain(info);
+
+   for (i = 0; i < VSCSIIF_MAX_REQS; i++) {
+   if (test_bit(i, info->shadow_free_bitmap))
+   continue;
+   resp.rqid = i;
+   resp.sense_len = 0;
+   resp.rslt = DID_RESET << 16;
+   resp.residual_len = 0;
+   scsifront_do_response(info, &resp);
+   }
+}
+
 static int map_data_for_request(struct vscsifrnt_info *info,
struct scsi_cmnd *sc,
struct vscsiif_request *ring_req,
@@ -475,6 +511,27 @@ static struct vscsiif_request *scsifront_command2ring(
return ring_req;
 }
 
+static int scsifront_enter(struct vscsifrnt_info *info)
+{
+   if (info->pause)
+   return 1;
+   info->callers++;
+   return 0;
+}
+
+static void scsifront_return(struct vscsifrnt_info *info)
+{
+   info->call

Re: [Xen-devel] [RFC v1 0/8] xen: kconfig changes

2015-02-16 Thread Juergen Gross

On 02/17/2015 01:25 AM, Luis R. Rodriguez wrote:

On Mon, Feb 16, 2015 at 4:20 PM, Luis R. Rodriguez
 wrote:

As it is per our agreed upon changes we can in theory enable a
XEN_PVHVM system without XEN_PV or XEN_PVH. If this is indeed
desirable this poses an issue at build time


And this also raises the question of whether or not we should make
XEN_PVHVM a user selectable option, right now it is a def_bool and is
therefore not human selectable. You can implicitly disable it by
disabling PCI for example though. If we want that to be exposed to the
user we can then enable some description of what that means, and the
user will then be able to read / select / enable XEN_PV., XEN_PVHVM,
XEN_PVH. Right now they'd only be able to select XEN_PV and/or
XEN_PVH, XEN_PVHVM is implicit.


I think making XEN_PVHVM user selectable is okay.

Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC v1 0/8] xen: kconfig changes

2015-02-16 Thread Juergen Gross

On 02/17/2015 01:20 AM, Luis R. Rodriguez wrote:

On Thu, Feb 12, 2015 at 3:07 AM, David Vrabel  wrote:

On 12/02/15 06:03, Luis R. Rodriguez wrote:

From: "Luis R. Rodriguez" 

Here's the first shot at the Kconfig changes for Xen as discussed
on the mailing list a little while ago [0]. Let me know if you spot
any issues or if you'd like things split differently. I tried to
make things as atomic as possible, but not being too rediculous
on the atomicity of the changes, for instance the HVC changes
were reasonable to just fold into the other change it touched.

Haven't gone to war with testing the Kconfig changes yet given this
is just the first RFC. If things look good please look for major
issues and let me know.#


Can you spin a v2 and make a git branch available, please?  I would like
people to be able to easily try out the changes rather than looking at
the diffs.

If I haven't comment on a specific patch it's because I thought it
looked ok.


Sure thing, before that I should address now what I have found as
issues with the Kconfig changes and what we need. What I see so far:

1) due to a recursive dependency it seems we should consider having
XEN_DOM0 select SWIOTLB_XEN instead of depend on it? That fixes it:

diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index d930574..c25e12b 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -14,7 +14,8 @@ config XEN
  config XEN_DOM0
 def_bool y
 select XEN_BACKEND
-   depends on XEN && PCI_XEN && SWIOTLB_XEN
+   select SWIOTLB_XEN
+   depends on XEN && PCI_XEN
 depends on X86_LOCAL_APIC && X86_IO_APIC && ACPI && PCI
 depends on XEN_PV || XEN_PVH


I'm fine with this.


2) due to a recursive-dependency it doesn't seem we should have
XEN_FRONTEND select on CONFIG_XEN -- with that in place we end up
with:

arch/x86/xen/Kconfig:5:error: recursive dependency detected!
arch/x86/xen/Kconfig:5: symbol XEN is selected by XEN_FRONTEND
drivers/xen/Kconfig:82: symbol XEN_FRONTEND depends on XEN

If we remove the select XEN from XEN_FRONTEND that fixes it. Not sure
what is ideal here though.

XEN_FRONTEND does not depend on XEN but the select seems to imply it.

3) The simple memory setup build issue:

As it is per our agreed upon changes we can in theory enable a
XEN_PVHVM system without XEN_PV or XEN_PVH. If this is indeed
desirable this poses an issue at build time at
arch/x86/xen/enlighten.c on xen_start_kernel() here:

 if (xen_feature(XENFEAT_auto_translated_physmap))
 x86_init.resources.memory_setup = xen_auto_xlated_memory_setup;
 else
 x86_init.resources.memory_setup = xen_memory_setup;


If we have neither XEN_PV nor XEN_PVH set, why do we have to build
enlighten.c? It will never be used. Same should apply to several other
files in arch/x86/xen.

Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 09/13] xen: check for kernel memory conflicting with memory layout

2015-02-17 Thread Juergen Gross
Checks whether the pre-allocated memory of the loaded kernel is in
conflict with the target memory map. If this is the case, just panic
instead of run into problems later.

Signed-off-by: Juergen Gross 
---
 arch/x86/xen/setup.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index eb219c1..37a34f9 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -829,6 +829,12 @@ static void __init xen_reserve_xen_mfnlist(void)
 PFN_PHYS(xen_start_info->nr_p2m_frames));
 }
 
+static int __init xen_kernel_mem_conflict(phys_addr_t start, phys_addr_t size)
+{
+   panic("kernel is located at position conflicting with E820 map!\n");
+   return 0;
+}
+
 /**
  * machine_specific_memory_setup - Hook for machine specific memory setup.
  **/
@@ -843,6 +849,10 @@ char * __init xen_memory_setup(void)
int i;
int op;
 
+   xen_add_reserved_area(__pa_symbol(_text),
+ __pa_symbol(__bss_stop) - __pa_symbol(_text),
+ xen_kernel_mem_conflict, 0);
+
xen_reserve_xen_mfnlist();
 
xen_max_pfn = min(MAX_DOMAIN_PAGES, xen_start_info->nr_pages);
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 07/13] xen: find unused contiguous memory area

2015-02-17 Thread Juergen Gross
For being able to relocate pre-allocated data areas like initrd or
p2m list it is mandatory to find a contiguous memory area which is
not yet in use and doesn't conflict with the memory map we want to
be in effect.

In case such an area is found reserve it at once as this will be
required to be done in any case.

Signed-off-by: Juergen Gross 
---
 arch/x86/xen/setup.c   | 34 ++
 arch/x86/xen/xen-ops.h |  1 +
 2 files changed, 35 insertions(+)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index a0af554..9c49d71 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -732,6 +732,40 @@ void __init xen_add_reserved_area(phys_addr_t start, 
phys_addr_t size,
 }
 
 /*
+ * Find a free area in physical memory not yet reserved and compliant with
+ * E820 map.
+ * Used to relocate pre-allocated areas like initrd or p2m list which are in
+ * conflict with the to be used E820 map.
+ * In case no area is found, return 0. Otherwise return the physical address
+ * of the area which is already reserved for convenience.
+ */
+phys_addr_t __init xen_find_free_area(phys_addr_t size)
+{
+   unsigned mapcnt;
+   phys_addr_t addr, start;
+   struct e820entry *entry = xen_e820_map;
+
+   for (mapcnt = 0; mapcnt < xen_e820_map_entries; mapcnt++, entry++) {
+   if (entry->type != E820_RAM || entry->size < size)
+   continue;
+   start = entry->addr;
+   for (addr = start; addr < start + size; addr += PAGE_SIZE) {
+   if (!memblock_is_reserved(addr))
+   continue;
+   start = addr + PAGE_SIZE;
+   if (start + size > entry->addr + entry->size)
+   break;
+   }
+   if (addr >= start + size) {
+   memblock_reserve(start, size);
+   return start;
+   }
+   }
+
+   return 0;
+}
+
+/*
  * Reserve Xen mfn_list.
  * See comment above "struct start_info" in 
  * We tried to make the the memblock_reserve more selective so
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index fee4f70..8181e01 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -44,6 +44,7 @@ void __init xen_inv_extra_mem(void);
 void __init xen_remap_memory(void);
 void __init xen_add_reserved_area(phys_addr_t start, phys_addr_t size,
int (*func)(phys_addr_t, phys_addr_t), int reserve);
+phys_addr_t __init xen_find_free_area(phys_addr_t size);
 char * __init xen_memory_setup(void);
 char * xen_auto_xlated_memory_setup(void);
 void __init xen_arch_setup(void);
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 11/13] xen: move initrd away from e820 non-ram area

2015-02-17 Thread Juergen Gross
When adapting the dom0 memory layout to that of the host make sure
the initrd isn't moved to another pfn range, as it won't be found
there any more.

The easiest way to accomplish that is by copying the initrd to an
area which is RAM according to the E820 map.

Signed-off-by: Juergen Gross 
---
 arch/x86/xen/enlighten.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 78a881b..21c82dfd 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1530,6 +1530,25 @@ static void __init xen_pvh_early_guest_init(void)
 }
 #endif/* CONFIG_XEN_PVH */
 
+static int __init xen_initrd_mem_conflict(phys_addr_t start, phys_addr_t size)
+{
+   phys_addr_t new;
+
+   new = xen_find_free_area(size);
+   if (!new)
+   panic("initrd is located at position conflicting with E820 
map!\n");
+
+   xen_phys_memcpy(new, start, size);
+   pr_info("initrd moved from [mem %#010llx-%#010llx] to [mem 
%#010llx-%#010llx]\n",
+   start, start + size, new, new + size);
+   memblock_free(start, size);
+
+   boot_params.hdr.ramdisk_image = new;
+   boot_params.ext_ramdisk_image = new >> 32;
+
+   return 1;
+}
+
 /* First C function to be called on Xen boot */
 asmlinkage __visible void __init xen_start_kernel(void)
 {
@@ -1691,6 +1710,9 @@ asmlinkage __visible void __init xen_start_kernel(void)
boot_params.hdr.ramdisk_size = xen_start_info->mod_len;
boot_params.hdr.cmd_line_ptr = __pa(xen_start_info->cmd_line);
 
+   xen_add_reserved_area(initrd_start, xen_start_info->mod_len,
+ xen_initrd_mem_conflict, 0);
+
if (!xen_initial_domain()) {
add_preferred_console("xenboot", 0, NULL);
add_preferred_console("tty", 0, NULL);
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 08/13] xen: add service function to copy physical memory areas

2015-02-17 Thread Juergen Gross
In case a pre-allocated memory area is to be moved in order to avoid
a conflict with the target E820 map we need a way to copy data between
physical addresses.

Add a function doing this via early_memremap().

Signed-off-by: Juergen Gross 
---
 arch/x86/xen/setup.c   | 29 +
 arch/x86/xen/xen-ops.h |  1 +
 2 files changed, 30 insertions(+)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 9c49d71..eb219c1 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -766,6 +766,35 @@ phys_addr_t __init xen_find_free_area(phys_addr_t size)
 }
 
 /*
+ * Like memcpy, but with physical addresses for dest and src.
+ */
+void __init xen_phys_memcpy(phys_addr_t dest, phys_addr_t src, phys_addr_t n)
+{
+   phys_addr_t dest_off, src_off, dest_len, src_len, len;
+   void *from, *to;
+
+   while (n) {
+   dest_off = dest & ~PAGE_MASK;
+   src_off = src & ~PAGE_MASK;
+   dest_len = n;
+   if (dest_len > (NR_FIX_BTMAPS << PAGE_SHIFT) - dest_off)
+   dest_len = (NR_FIX_BTMAPS << PAGE_SHIFT) - dest_off;
+   src_len = n;
+   if (src_len > (NR_FIX_BTMAPS << PAGE_SHIFT) - src_off)
+   src_len = (NR_FIX_BTMAPS << PAGE_SHIFT) - src_off;
+   len = min(dest_len, src_len);
+   to = early_memremap(dest - dest_off, dest_len + dest_off);
+   from = early_memremap(src - src_off, src_len + src_off);
+   memcpy(to, from, len);
+   early_iounmap(to, dest_len + dest_off);
+   early_iounmap(from, src_len + src_off);
+   n -= len;
+   dest += len;
+   src += len;
+   }
+}
+
+/*
  * Reserve Xen mfn_list.
  * See comment above "struct start_info" in 
  * We tried to make the the memblock_reserve more selective so
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 8181e01..9bf9df8 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -45,6 +45,7 @@ void __init xen_remap_memory(void);
 void __init xen_add_reserved_area(phys_addr_t start, phys_addr_t size,
int (*func)(phys_addr_t, phys_addr_t), int reserve);
 phys_addr_t __init xen_find_free_area(phys_addr_t size);
+void __init xen_phys_memcpy(phys_addr_t dest, phys_addr_t src, phys_addr_t n);
 char * __init xen_memory_setup(void);
 char * xen_auto_xlated_memory_setup(void);
 void __init xen_arch_setup(void);
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 01/13] xen: sync with xen header

2015-02-17 Thread Juergen Gross
Use the newest header from the xen tree to get some new structure
layouts.

Signed-off-by: Juergen Gross 
---
 arch/x86/include/asm/xen/interface.h | 96 
 1 file changed, 87 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/xen/interface.h 
b/arch/x86/include/asm/xen/interface.h
index 3400dba..3b88eea 100644
--- a/arch/x86/include/asm/xen/interface.h
+++ b/arch/x86/include/asm/xen/interface.h
@@ -3,12 +3,38 @@
  *
  * Guest OS interface to x86 Xen.
  *
- * Copyright (c) 2004, K A Fraser
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2004-2006, K A Fraser
  */
 
 #ifndef _ASM_X86_XEN_INTERFACE_H
 #define _ASM_X86_XEN_INTERFACE_H
 
+/*
+ * XEN_GUEST_HANDLE represents a guest pointer, when passed as a field
+ * in a struct in memory.
+ * XEN_GUEST_HANDLE_PARAM represent a guest pointer, when passed as an
+ * hypercall argument.
+ * XEN_GUEST_HANDLE_PARAM and XEN_GUEST_HANDLE are the same on X86 but
+ * they might not be on other architectures.
+ */
 #ifdef __XEN__
 #define __DEFINE_GUEST_HANDLE(name, type) \
 typedef struct { type *p; } __guest_handle_ ## name
@@ -88,13 +114,16 @@ DEFINE_GUEST_HANDLE(xen_ulong_t);
  * start of the GDT because some stupid OSes export hard-coded selector values
  * in their ABI. These hard-coded values are always near the start of the GDT,
  * so Xen places itself out of the way, at the far end of the GDT.
+ *
+ * NB The LDT is set using the MMUEXT_SET_LDT op of HYPERVISOR_mmuext_op
  */
 #define FIRST_RESERVED_GDT_PAGE  14
 #define FIRST_RESERVED_GDT_BYTE  (FIRST_RESERVED_GDT_PAGE * 4096)
 #define FIRST_RESERVED_GDT_ENTRY (FIRST_RESERVED_GDT_BYTE / 8)
 
 /*
- * Send an array of these to HYPERVISOR_set_trap_table()
+ * Send an array of these to HYPERVISOR_set_trap_table().
+ * Terminate the array with a sentinel entry, with traps[].address==0.
  * The privilege level specifies which modes may enter a trap via a software
  * interrupt. On x86/64, since rings 1 and 2 are unavailable, we allocate
  * privilege levels as follows:
@@ -118,10 +147,41 @@ struct trap_info {
 DEFINE_GUEST_HANDLE_STRUCT(trap_info);
 
 struct arch_shared_info {
-unsigned long max_pfn;  /* max pfn that appears in table */
-/* Frame containing list of mfns containing list of mfns containing p2m. */
-unsigned long pfn_to_mfn_frame_list_list;
-unsigned long nmi_reason;
+   /*
+* Number of valid entries in the p2m table(s) anchored at
+* pfn_to_mfn_frame_list_list and/or p2m_vaddr.
+*/
+   unsigned long max_pfn;
+   /*
+* Frame containing list of mfns containing list of mfns containing p2m.
+* A value of 0 indicates it has not yet been set up, ~0 indicates it
+* has been set to invalid e.g. due to the p2m being too large for the
+* 3-level p2m tree. In this case the linear mapper p2m list anchored
+* at p2m_vaddr is to be used.
+*/
+   xen_pfn_t pfn_to_mfn_frame_list_list;
+   unsigned long nmi_reason;
+   /*
+* Following three fields are valid if p2m_cr3 contains a value
+* different from 0.
+* p2m_cr3 is the root of the address space where p2m_vaddr is valid.
+* p2m_cr3 is in the same format as a cr3 value in the vcpu register
+* state and holds the folded machine frame number (via xen_pfn_to_cr3)
+* of a L3 or L4 page table.
+* p2m_vaddr holds the virtual address of the linear p2m list. All
+* entries in the range [0...max_pfn[ are accessible via this pointer.
+* p2m_generation will be incremented by the guest before and after each
+* change of the mappings of the p2m list. p2m_generation starts at 0
+* and a value with the least significant bit set indicates that a
+* mapping update is in progress. This allows guest external software
+* (e.g. in Dom0) to verif

[Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure

2015-02-17 Thread Juergen Gross
The linear p2m list should be anchored in the shared info structure
read by the Xen tools to be able to support 64 bit pv-domains larger
than 512 MB. Additionally the linear p2m list interface includes a
generation count which is changed prior to and after each mapping
change of the p2m list. Reading the generation count the Xen tools can
detect changes of the mappings and re-read the p2m list eventually.

Signed-off-by: Juergen Gross 
---
 arch/x86/xen/p2m.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index f18fd1d..df73cc5 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -256,6 +256,10 @@ void xen_setup_mfn_list_list(void)
HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list =
virt_to_mfn(p2m_top_mfn);
HYPERVISOR_shared_info->arch.max_pfn = xen_max_p2m_pfn;
+   HYPERVISOR_shared_info->arch.p2m_generation = 0;
+   HYPERVISOR_shared_info->arch.p2m_vaddr = (unsigned long)xen_p2m_addr;
+   HYPERVISOR_shared_info->arch.p2m_cr3 =
+   xen_pfn_to_cr3(virt_to_mfn(swapper_pg_dir));
 }
 
 /* Set up p2m_top to point to the domain-builder provided p2m pages */
@@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr, pte_t 
*pte_pg)
 
ptechk = lookup_address(vaddr, &level);
if (ptechk == pte_pg) {
+   HYPERVISOR_shared_info->arch.p2m_generation++;
set_pmd(pmdp,
__pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
+   HYPERVISOR_shared_info->arch.p2m_generation++;
pte_newpg[i] = NULL;
}
 
@@ -568,8 +574,10 @@ static bool alloc_p2m(unsigned long pfn)
spin_lock_irqsave(&p2m_update_lock, flags);
 
if (pte_pfn(*ptep) == p2m_pfn) {
+   HYPERVISOR_shared_info->arch.p2m_generation++;
set_pte(ptep,
pfn_pte(PFN_DOWN(__pa(p2m)), PAGE_KERNEL));
+   HYPERVISOR_shared_info->arch.p2m_generation++;
if (mid_mfn)
mid_mfn[mididx] = virt_to_mfn(p2m);
p2m = NULL;
@@ -621,6 +629,11 @@ bool __set_phys_to_machine(unsigned long pfn, unsigned 
long mfn)
return true;
}
 
+   /*
+* The interface requires atomic updates on p2m elements.
+* xen_safe_write_ulong() is using __put_user which does an atomic
+* store via asm().
+*/
if (likely(!xen_safe_write_ulong(xen_p2m_addr + pfn, mfn)))
return true;
 
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 05/13] xen: simplify xen_set_identity_and_remap() by using global variables

2015-02-17 Thread Juergen Gross
xen_set_identity_and_remap() is used to prepare remapping of memory
conflicting with the E820 map. It is tracking the pfn where to remap
new memory via a local variable which is passed to a subfunction
which in turn returns the new value for that variable.

Additionally the targeted maximum pfn is passed as a parameter to
sub functions.

Simplify that construct by using just global variables in the
source for that purpose. This will make things simpler when we need
those values later, too.

Signed-off-by: Juergen Gross 
---
 arch/x86/xen/setup.c | 63 +---
 1 file changed, 30 insertions(+), 33 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index ab6c36e..0dda131 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -56,6 +56,9 @@ static struct {
 } xen_remap_buf __initdata __aligned(PAGE_SIZE);
 static unsigned long xen_remap_mfn __initdata = INVALID_P2M_ENTRY;
 
+static unsigned long xen_remap_pfn;
+static unsigned long xen_max_pfn;
+
 /* 
  * The maximum amount of extra memory compared to the base size.  The
  * main scaling factor is the size of struct page.  At extreme ratios
@@ -223,7 +226,7 @@ static int __init xen_free_mfn(unsigned long mfn)
  * as a fallback if the remapping fails.
  */
 static void __init xen_set_identity_and_release_chunk(unsigned long start_pfn,
-   unsigned long end_pfn, unsigned long nr_pages, unsigned long *released)
+   unsigned long end_pfn, unsigned long *released)
 {
unsigned long pfn, end;
int ret;
@@ -231,7 +234,7 @@ static void __init 
xen_set_identity_and_release_chunk(unsigned long start_pfn,
WARN_ON(start_pfn > end_pfn);
 
/* Release pages first. */
-   end = min(end_pfn, nr_pages);
+   end = min(end_pfn, xen_max_pfn);
for (pfn = start_pfn; pfn < end; pfn++) {
unsigned long mfn = pfn_to_mfn(pfn);
 
@@ -302,7 +305,7 @@ static void __init xen_update_mem_tables(unsigned long pfn, 
unsigned long mfn)
  * its callers.
  */
 static void __init xen_do_set_identity_and_remap_chunk(
-unsigned long start_pfn, unsigned long size, unsigned long remap_pfn)
+   unsigned long start_pfn, unsigned long size)
 {
unsigned long buf = (unsigned long)&xen_remap_buf;
unsigned long mfn_save, mfn;
@@ -317,7 +320,7 @@ static void __init xen_do_set_identity_and_remap_chunk(
 
mfn_save = virt_to_mfn(buf);
 
-   for (ident_pfn_iter = start_pfn, remap_pfn_iter = remap_pfn;
+   for (ident_pfn_iter = start_pfn, remap_pfn_iter = xen_remap_pfn;
 ident_pfn_iter < ident_end_pfn;
 ident_pfn_iter += REMAP_SIZE, remap_pfn_iter += REMAP_SIZE) {
chunk = (left < REMAP_SIZE) ? left : REMAP_SIZE;
@@ -350,17 +353,16 @@ static void __init xen_do_set_identity_and_remap_chunk(
  * This function takes a contiguous pfn range that needs to be identity mapped
  * and:
  *
- *  1) Finds a new range of pfns to use to remap based on E820 and remap_pfn.
+ *  1) Finds a new range of pfns to use to remap based on E820 and
+ * xen_remap_pfn.
  *  2) Calls the do_ function to actually do the mapping/remapping work.
  *
  * The goal is to not allocate additional memory but to remap the existing
  * pages. In the case of an error the underlying memory is simply released back
  * to Xen and not remapped.
  */
-static unsigned long __init xen_set_identity_and_remap_chunk(
-   unsigned long start_pfn, unsigned long end_pfn, unsigned long nr_pages,
-   unsigned long remap_pfn, unsigned long *released,
-   unsigned long *remapped)
+static void __init xen_set_identity_and_remap_chunk(unsigned long start_pfn,
+   unsigned long end_pfn, unsigned long *released, unsigned long *remapped)
 {
unsigned long pfn;
unsigned long i = 0;
@@ -373,30 +375,30 @@ static unsigned long __init 
xen_set_identity_and_remap_chunk(
unsigned long remap_range_size;
 
/* Do not remap pages beyond the current allocation */
-   if (cur_pfn >= nr_pages) {
+   if (cur_pfn >= xen_max_pfn) {
/* Identity map remaining pages */
set_phys_range_identity(cur_pfn, cur_pfn + size);
break;
}
-   if (cur_pfn + size > nr_pages)
-   size = nr_pages - cur_pfn;
+   if (cur_pfn + size > xen_max_pfn)
+   size = xen_max_pfn - cur_pfn;
 
-   remap_range_size = xen_find_pfn_range(&remap_pfn);
+   remap_range_size = xen_find_pfn_range(&xen_remap_pfn);
if (!remap_range_size) {
pr_warning("Unable to find available pfn range, not 
remapping identity pages\n");
xen_set_identity_and_release_chunk(cur_pfn,
-   

[Xen-devel] [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains

2015-02-17 Thread Juergen Gross
64 bit pv-domains under Xen are limited to 512 GB of RAM today. The
main reason has been the 3 level p2m tree, which was replaced by the
virtual mapped linear p2m list. Parallel to the p2m list which is
being used by the kernel itself there is a 3 level mfn tree for usage
by the Xen tools and eventually for crash dump analysis. For this tree
the linear p2m list can serve as a replacement, too. As the kernel
can't know whether the tools are capable of dealing with the p2m list
instead of the mfn tree, the limit of 512 GB can't be dropped in all
cases.

This patch replaces the hard limit by a kernel parameter which tells
the kernel to obey the 512 GB limit or not. The default is selected by
a configuration parameter which specifies whether the 512 GB limit
should be active per default for dom0 (only crash dump analysis is
affected) and/or for domUs (additionally domain save/restore/migration
are affected).

Memory above the domain limit is returned to the hypervisor instead of
being identity mapped, which was wrong anyways.

The kernel configuration parameter to specify the maximum size of a
domain can be deleted, as it is not relevant any more.

Signed-off-by: Juergen Gross 
---
 Documentation/kernel-parameters.txt |  7 
 arch/x86/include/asm/xen/page.h |  4 ---
 arch/x86/xen/Kconfig| 31 +++-
 arch/x86/xen/p2m.c  | 10 +++---
 arch/x86/xen/setup.c| 72 ++---
 5 files changed, 93 insertions(+), 31 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index a89e326..7bf6342 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3959,6 +3959,13 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
plus one apbt timer for broadcast timer.
x86_intel_mid_timer=apbt_only | lapic_and_apbt
 
+   xen_512gb_limit [KNL,X86-64,XEN]
+   Restricts the kernel running paravirtualized under Xen
+   to use only up to 512 GB of RAM. The reason to do so is
+   crash analysis tools and Xen tools for doing domain
+   save/restore/migration must be enabled to handle larger
+   domains.
+
xen_emul_unplug=[HW,X86,XEN]
Unplug Xen emulated devices
Format: [unplug0,][unplug1]
diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index 358dcd3..18a11f2 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -35,10 +35,6 @@ typedef struct xpaddr {
 #define FOREIGN_FRAME(m)   ((m) | FOREIGN_FRAME_BIT)
 #define IDENTITY_FRAME(m)  ((m) | IDENTITY_FRAME_BIT)
 
-/* Maximum amount of memory we can handle in a domain in pages */
-#define MAX_DOMAIN_PAGES   \
-((unsigned long)((u64)CONFIG_XEN_MAX_DOMAIN_MEMORY * 1024 * 1024 * 1024 / 
PAGE_SIZE))
-
 extern unsigned long *machine_to_phys_mapping;
 extern unsigned long  machine_to_phys_nr;
 extern unsigned long *xen_p2m_addr;
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index e88fda8..b61a15e 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -23,14 +23,29 @@ config XEN_PVHVM
def_bool y
depends on XEN && PCI && X86_LOCAL_APIC
 
-config XEN_MAX_DOMAIN_MEMORY
-   int
-   default 500 if X86_64
-   default 64 if X86_32
-   depends on XEN
-   help
- This only affects the sizing of some bss arrays, the unused
- portions of which are freed.
+if X86_64
+choice
+   prompt "Support pv-domains larger than 512GB"
+   default XEN_512GB_NONE
+   help
+ Support paravirtualized domains with more than 512GB of RAM.
+
+ The Xen tools and crash dump analysis tools might not support
+ pv-domains with more than 512 GB of RAM. This option controls the
+ default setting of the kernel to use only up to 512 GB or more.
+ It is always possible to change the default via specifying the
+ boot parameter "xen_512gb_limit".
+
+   config XEN_512GB_NONE
+   bool "neither dom0 nor domUs can be larger than 512GB"
+   config XEN_512GB_DOM0
+   bool "dom0 can be larger than 512GB, domUs not"
+   config XEN_512GB_DOMU
+   bool "domUs can be larger than 512GB, dom0 not"
+   config XEN_512GB_ALL
+   bool "dom0 and domUs can be larger than 512GB"
+endchoice
+endif
 
 config XEN_SAVE_RESTORE
bool
diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index df73cc5..12a1e98 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -502,7 +502,7 @@ static pte_t *alloc_p2m_pmd(unsigned long addr, pte_t 
*pte

[Xen-devel] [PATCH 12/13] xen: if p2m list located in to be remapped region delay remapping

2015-02-17 Thread Juergen Gross
With adapting the memory layout of dom0 to that of the host care must
be taken not to remap the initial p2m list supported by the hypervisor.

If the p2m map is detected to be in a region which is going to be
remapped, delay the remapping of that area. Not doing so can either
crash the system very early, or lead to clobbered data as the target
memory area of the remap operation will no longer be reserved.

Signed-off-by: Juergen Gross 
---
 arch/x86/xen/setup.c | 26 +-
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 37a34f9..84a6473 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -794,6 +794,20 @@ void __init xen_phys_memcpy(phys_addr_t dest, phys_addr_t 
src, phys_addr_t n)
}
 }
 
+#ifdef CONFIG_X86_64
+static int __init xen_p2m_conflict(phys_addr_t start, phys_addr_t size)
+{
+   /* Delay invalidating memory. */
+   return 0;
+}
+#else
+static int __init xen_p2m_conflict(phys_addr_t start, phys_addr_t size)
+{
+   panic("p2m list is located at position conflicting with E820 map!\n");
+   return 0;
+}
+#endif
+
 /*
  * Reserve Xen mfn_list.
  * See comment above "struct start_info" in 
@@ -819,14 +833,16 @@ void __init xen_phys_memcpy(phys_addr_t dest, phys_addr_t 
src, phys_addr_t n)
 static void __init xen_reserve_xen_mfnlist(void)
 {
if (xen_start_info->mfn_list >= __START_KERNEL_map) {
-   memblock_reserve(__pa(xen_start_info->mfn_list),
-xen_start_info->pt_base -
-xen_start_info->mfn_list);
+   xen_add_reserved_area(__pa(xen_start_info->mfn_list),
+ xen_start_info->pt_base -
+ xen_start_info->mfn_list,
+ xen_p2m_conflict, 1);
return;
}
 
-   memblock_reserve(PFN_PHYS(xen_start_info->first_p2m_pfn),
-PFN_PHYS(xen_start_info->nr_p2m_frames));
+   xen_add_reserved_area(PFN_PHYS(xen_start_info->first_p2m_pfn),
+ PFN_PHYS(xen_start_info->nr_p2m_frames),
+ xen_p2m_conflict, 1);
 }
 
 static int __init xen_kernel_mem_conflict(phys_addr_t start, phys_addr_t size)
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 03/13] xen: eliminate scalability issues from initial mapping setup

2015-02-17 Thread Juergen Gross
Direct Xen to place the initial P->M table outside of the initial
mapping, as otherwise the 1G (implementation) / 2G (theoretical)
restriction on the size of the initial mapping limits the amount
of memory a domain can be handed initially.

As the initial P->M table is copied rather early during boot to
domain private memory and it's initial virtual mapping is dropped,
the easiest way to avoid virtual address conflicts with other
addresses in the kernel is to use a user address area for the
virtual address of the initial P->M table. This allows us to just
throw away the page tables of the initial mapping after the copy
without having to care about address invalidation.

It should be noted that this patch won't enable a pv-domain to USE
more than 512 GB of RAM. It just enables it to be started with a
P->M table covering more memory. This is especially important for
being able to boot a Dom0 on a system with more than 512 GB memory.

Signed-off-by: Juergen Gross 
Based-on-patch-by: Jan Beulich 
---
 arch/x86/xen/mmu.c  | 126 
 arch/x86/xen/setup.c|  67 ++---
 arch/x86/xen/xen-head.S |   2 +
 3 files changed, 156 insertions(+), 39 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index adca9e2..1ca5197 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1114,6 +1114,77 @@ static void __init xen_cleanhighmap(unsigned long vaddr,
xen_mc_flush();
 }
 
+/*
+ * Make a page range writeable and free it.
+ */
+static void __init xen_free_ro_pages(unsigned long paddr, unsigned long size)
+{
+   void *vaddr = __va(paddr);
+   void *vaddr_end = vaddr + size;
+
+   for (; vaddr < vaddr_end; vaddr += PAGE_SIZE)
+   make_lowmem_page_readwrite(vaddr);
+
+   memblock_free(paddr, size);
+}
+
+static void __init xen_cleanmfnmap_free_pgtbl(void *pgtbl)
+{
+   unsigned long pa = __pa(pgtbl) & PHYSICAL_PAGE_MASK;
+
+   ClearPagePinned(virt_to_page(__va(pa)));
+   xen_free_ro_pages(pa, PAGE_SIZE);
+}
+
+/*
+ * Since it is well isolated we can (and since it is perhaps large we should)
+ * also free the page tables mapping the initial P->M table.
+ */
+static void __init xen_cleanmfnmap(unsigned long vaddr)
+{
+   unsigned long va = vaddr & PMD_MASK;
+   unsigned long pa;
+   pgd_t *pgd = pgd_offset_k(va);
+   pud_t *pud_page = pud_offset(pgd, 0);
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+   unsigned int i;
+
+   set_pgd(pgd, __pgd(0));
+   do {
+   pud = pud_page + pud_index(va);
+   if (pud_none(*pud)) {
+   va += PUD_SIZE;
+   } else if (pud_large(*pud)) {
+   pa = pud_val(*pud) & PHYSICAL_PAGE_MASK;
+   xen_free_ro_pages(pa, PUD_SIZE);
+   va += PUD_SIZE;
+   } else {
+   pmd = pmd_offset(pud, va);
+   if (pmd_large(*pmd)) {
+   pa = pmd_val(*pmd) & PHYSICAL_PAGE_MASK;
+   xen_free_ro_pages(pa, PMD_SIZE);
+   } else if (!pmd_none(*pmd)) {
+   pte = pte_offset_kernel(pmd, va);
+   for (i = 0; i < PTRS_PER_PTE; ++i) {
+   if (pte_none(pte[i]))
+   break;
+   pa = pte_pfn(pte[i]) << PAGE_SHIFT;
+   xen_free_ro_pages(pa, PAGE_SIZE);
+   }
+   xen_cleanmfnmap_free_pgtbl(pte);
+   }
+   va += PMD_SIZE;
+   if (pmd_index(va))
+   continue;
+   xen_cleanmfnmap_free_pgtbl(pmd);
+   }
+
+   } while (pud_index(va) || pmd_index(va));
+   xen_cleanmfnmap_free_pgtbl(pud_page);
+}
+
 static void __init xen_pagetable_p2m_free(void)
 {
unsigned long size;
@@ -1128,18 +1199,25 @@ static void __init xen_pagetable_p2m_free(void)
/* using __ka address and sticking INVALID_P2M_ENTRY! */
memset((void *)xen_start_info->mfn_list, 0xff, size);
 
-   /* We should be in __ka space. */
-   BUG_ON(xen_start_info->mfn_list < __START_KERNEL_map);
addr = xen_start_info->mfn_list;
-   /* We roundup to the PMD, which means that if anybody at this stage is
-* using the __ka address of xen_start_info or 
xen_start_info->shared_info
-* they are in going to crash. Fortunatly we have already revectored
-* in xen_setup_kernel_pagetable and in xen_setup_shared_info. */
+   /*
+* We could be in __ka space.
+* We roundup to the PMD, which means that if anybody at this stage is
+* u

[Xen-devel] [PATCH 06/13] xen: detect pre-allocated memory interfering with e820 map

2015-02-17 Thread Juergen Gross
Currently especially for dom0 guest memory with guest pfns not
matching host areas populated with RAM are remapped to areas which
are RAM native as well. This is done to be able to use identity
mappings (pfn == mfn) for I/O areas.

Up to now it is not checked whether the remapped memory is already
in use. Remapping used memory will probably result in data corruption,
as the remapped memory will no longer be reserved. Any memory
allocation after the remap can claim that memory.

Add an infrastructure to check for conflicts of reserved memory areas
and in case of a conflict to react via an area specific function.

This function has 3 options:
- Panic
- Handle the conflict by moving the data to another memory area.
  This is indicated by a return value other than 0.
- Just return 0. This will delay invalidating the conflicting memory
  area to just before doing the remap. This option will be usable for
  cases only where the memory will no longer be needed when the remap
  operation will be started, e.g. for the p2m list, which is already
  copied then.

When doing the remap, check for not remapping a reserved page.

Signed-off-by: Juergen Gross 
---
 arch/x86/xen/setup.c   | 185 +++--
 arch/x86/xen/xen-ops.h |   2 +
 2 files changed, 182 insertions(+), 5 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 0dda131..a0af554 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -59,6 +59,20 @@ static unsigned long xen_remap_mfn __initdata = 
INVALID_P2M_ENTRY;
 static unsigned long xen_remap_pfn;
 static unsigned long xen_max_pfn;
 
+/*
+ * Areas with memblock_reserve()d memory to be checked against final E820 map.
+ * Each area has an associated function to handle conflicts (by either
+ * removing the conflict or by just crashing with an appropriate message).
+ * The array has a fixed size as there are only few areas of interest which are
+ * well known: kernel, page tables, p2m, initrd.
+ */
+#define XEN_N_RESERVED_AREAS   4
+static struct {
+   phys_addr_t start;
+   phys_addr_t size;
+   int (*func)(phys_addr_t start, phys_addr_t size);
+} xen_reserved_area[XEN_N_RESERVED_AREAS] __initdata;
+
 /* 
  * The maximum amount of extra memory compared to the base size.  The
  * main scaling factor is the size of struct page.  At extreme ratios
@@ -365,10 +379,10 @@ static void __init 
xen_set_identity_and_remap_chunk(unsigned long start_pfn,
unsigned long end_pfn, unsigned long *released, unsigned long *remapped)
 {
unsigned long pfn;
-   unsigned long i = 0;
+   unsigned long i;
unsigned long n = end_pfn - start_pfn;
 
-   while (i < n) {
+   for (i = 0; i < n; ) {
unsigned long cur_pfn = start_pfn + i;
unsigned long left = n - i;
unsigned long size = left;
@@ -411,6 +425,53 @@ static void __init 
xen_set_identity_and_remap_chunk(unsigned long start_pfn,
(unsigned long)__va(pfn << PAGE_SHIFT),
mfn_pte(pfn, PAGE_KERNEL_IO), 0);
 }
+/* Check to be remapped memory area for conflicts with reserved areas.
+ *
+ * Skip regions known to be reserved which are handled later. For these
+ * regions we have to increase the remapped counter in order to reserve
+ * extra memory space.
+ *
+ * In case a memory page already in use is to be remapped, just BUG().
+ */
+static void __init xen_set_identity_and_remap_chunk_chk(unsigned long 
start_pfn,
+   unsigned long end_pfn, unsigned long *released, unsigned long *remapped)
+{
+   unsigned long pfn;
+   unsigned long area_start, area_end;
+   unsigned i;
+
+   for (i = 0; i < XEN_N_RESERVED_AREAS; i++) {
+
+   if (!xen_reserved_area[i].size)
+   break;
+
+   area_start = PFN_DOWN(xen_reserved_area[i].start);
+   area_end = PFN_UP(xen_reserved_area[i].start +
+ xen_reserved_area[i].size);
+   if (area_start >= end_pfn || area_end <= start_pfn)
+   continue;
+
+   if (area_start > start_pfn)
+   xen_set_identity_and_remap_chunk(start_pfn, area_start,
+released, remapped);
+
+   if (area_end < end_pfn)
+   xen_set_identity_and_remap_chunk(area_end, end_pfn,
+released, remapped);
+
+   *remapped += min(area_end, end_pfn) -
+   max(area_start, start_pfn);
+
+   return;
+   }
+
+   /* Test for memory already in use */
+   for (pfn = start_pfn; pfn < end_pfn; pfn++)
+   BUG_ON(memblock_is_reserved(PFN_PHYS(pfn)));
+
+   xen_set_identity_and_remap_chunk(start_pfn, end_pfn,
+released, remap

[Xen-devel] [PATCH 10/13] xen: check pre-allocated page tables for conflict with memory map

2015-02-17 Thread Juergen Gross
Check whether the page tables built by the domain builder are at
memory addresses which are in conflict with the target memory map.
If this is the case just panic instead of running into problems
later.

Signed-off-by: Juergen Gross 
---
 arch/x86/xen/mmu.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 1ca5197..6641459 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1863,6 +1863,12 @@ void __init xen_setup_machphys_mapping(void)
 #endif
 }
 
+static int __init xen_pt_memory_conflict(phys_addr_t start, phys_addr_t size)
+{
+   panic("page tables are located at position conflicting with E820 
map!\n");
+   return 0;
+}
+
 #ifdef CONFIG_X86_64
 static void __init convert_pfn_mfn(void *v)
 {
@@ -1998,7 +2004,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
check_pt_base(&pt_base, &pt_end, addr[i]);
 
/* Our (by three pages) smaller Xen pagetable that we are using */
-   memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
+   xen_add_reserved_area(PFN_PHYS(pt_base),
+ (pt_end - pt_base) * PAGE_SIZE,
+ xen_pt_memory_conflict, 1);
/* protect xen_start_info */
memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
/* Revector the xen_start_info */
@@ -2074,8 +2082,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
  PFN_DOWN(__pa(initial_page_table)));
xen_write_cr3(__pa(initial_page_table));
 
-   memblock_reserve(__pa(xen_start_info->pt_base),
-xen_start_info->nr_pt_frames * PAGE_SIZE);
+   xen_add_reserved_area(__pa(xen_start_info->pt_base),
+ xen_start_info->nr_pt_frames * PAGE_SIZE,
+ xen_pt_memory_conflict, 1);
 }
 #endif /* CONFIG_X86_64 */
 
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 00/13] xen: support pv-domains larger than 512GB

2015-02-17 Thread Juergen Gross
Support 64 bit pv-domains with more than 512GB of memory.

Tested with 64 bit dom0 on machines with 8GB and 1TB and 32 bit dom0 on a
8GB machine. Conflicts between E820 map and different hypervisor populated
memory areas have been tested via a fake E820 map reserved area on the
8GB machine.

Juergen Gross (13):
  xen: sync with xen header
  xen: anchor linear p2m list in shared info structure
  xen: eliminate scalability issues from initial mapping setup
  xen: move static e820 map to global scope
  xen: simplify xen_set_identity_and_remap() by using global variables
  xen: detect pre-allocated memory interfering with e820 map
  xen: find unused contiguous memory area
  xen: add service function to copy physical memory areas
  xen: check for kernel memory conflicting with memory layout
  xen: check pre-allocated page tables for conflict with memory map
  xen: move initrd away from e820 non-ram area
  xen: if p2m list located in to be remapped region delay remapping
  xen: allow more than 512 GB of RAM for 64 bit pv-domains

 Documentation/kernel-parameters.txt  |   7 +
 arch/x86/include/asm/xen/interface.h |  96 ++-
 arch/x86/include/asm/xen/page.h  |   4 -
 arch/x86/xen/Kconfig |  31 +-
 arch/x86/xen/enlighten.c |  22 ++
 arch/x86/xen/mmu.c   | 141 -
 arch/x86/xen/p2m.c   |  23 +-
 arch/x86/xen/setup.c | 536 ---
 arch/x86/xen/xen-head.S  |   2 +
 arch/x86/xen/xen-ops.h   |   4 +
 10 files changed, 717 insertions(+), 149 deletions(-)

-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 04/13] xen: move static e820 map to global scope

2015-02-17 Thread Juergen Gross
Instead of using a function local static e820 map in xen_memory_setup()
and calling various functions in the same source with the map as a
parameter use a map directly accessible by all functions in the source.

Signed-off-by: Juergen Gross 
---
 arch/x86/xen/setup.c | 96 +++-
 1 file changed, 49 insertions(+), 47 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index adad417..ab6c36e 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -38,6 +38,10 @@ struct xen_memory_region 
xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS] __initdata;
 /* Number of pages released from the initial allocation. */
 unsigned long xen_released_pages;
 
+/* E820 map used during setting up memory. */
+static struct e820entry xen_e820_map[E820MAX] __initdata;
+static u32 xen_e820_map_entries __initdata;
+
 /*
  * Buffer used to remap identity mapped pages. We only need the virtual space.
  * The physical page behind this address is remapped as needed to different
@@ -164,15 +168,13 @@ void __init xen_inv_extra_mem(void)
  * This function updates min_pfn with the pfn found and returns
  * the size of that range or zero if not found.
  */
-static unsigned long __init xen_find_pfn_range(
-   const struct e820entry *list, size_t map_size,
-   unsigned long *min_pfn)
+static unsigned long __init xen_find_pfn_range(unsigned long *min_pfn)
 {
-   const struct e820entry *entry;
+   const struct e820entry *entry = xen_e820_map;
unsigned int i;
unsigned long done = 0;
 
-   for (i = 0, entry = list; i < map_size; i++, entry++) {
+   for (i = 0; i < xen_e820_map_entries; i++, entry++) {
unsigned long s_pfn;
unsigned long e_pfn;
 
@@ -356,9 +358,9 @@ static void __init xen_do_set_identity_and_remap_chunk(
  * to Xen and not remapped.
  */
 static unsigned long __init xen_set_identity_and_remap_chunk(
-const struct e820entry *list, size_t map_size, unsigned long start_pfn,
-   unsigned long end_pfn, unsigned long nr_pages, unsigned long remap_pfn,
-   unsigned long *released, unsigned long *remapped)
+   unsigned long start_pfn, unsigned long end_pfn, unsigned long nr_pages,
+   unsigned long remap_pfn, unsigned long *released,
+   unsigned long *remapped)
 {
unsigned long pfn;
unsigned long i = 0;
@@ -379,8 +381,7 @@ static unsigned long __init 
xen_set_identity_and_remap_chunk(
if (cur_pfn + size > nr_pages)
size = nr_pages - cur_pfn;
 
-   remap_range_size = xen_find_pfn_range(list, map_size,
- &remap_pfn);
+   remap_range_size = xen_find_pfn_range(&remap_pfn);
if (!remap_range_size) {
pr_warning("Unable to find available pfn range, not 
remapping identity pages\n");
xen_set_identity_and_release_chunk(cur_pfn,
@@ -411,13 +412,12 @@ static unsigned long __init 
xen_set_identity_and_remap_chunk(
return remap_pfn;
 }
 
-static void __init xen_set_identity_and_remap(
-   const struct e820entry *list, size_t map_size, unsigned long nr_pages,
-   unsigned long *released, unsigned long *remapped)
+static void __init xen_set_identity_and_remap(unsigned long nr_pages,
+   unsigned long *released, unsigned long *remapped)
 {
phys_addr_t start = 0;
unsigned long last_pfn = nr_pages;
-   const struct e820entry *entry;
+   const struct e820entry *entry = xen_e820_map;
unsigned long num_released = 0;
unsigned long num_remapped = 0;
int i;
@@ -433,9 +433,9 @@ static void __init xen_set_identity_and_remap(
 * example) the DMI tables in a reserved region that begins on
 * a non-page boundary.
 */
-   for (i = 0, entry = list; i < map_size; i++, entry++) {
+   for (i = 0; i < xen_e820_map_entries; i++, entry++) {
phys_addr_t end = entry->addr + entry->size;
-   if (entry->type == E820_RAM || i == map_size - 1) {
+   if (entry->type == E820_RAM || i == xen_e820_map_entries - 1) {
unsigned long start_pfn = PFN_DOWN(start);
unsigned long end_pfn = PFN_UP(end);
 
@@ -444,9 +444,9 @@ static void __init xen_set_identity_and_remap(
 
if (start_pfn < end_pfn)
last_pfn = xen_set_identity_and_remap_chunk(
-   list, map_size, start_pfn,
-   end_pfn, nr_pages, last_pfn,
-   &num_released, &num_remapped);
+   start_pfn, end_pfn, nr_pages,
+   last_pfn, &num_released,
+

Re: [Xen-devel] [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains

2015-02-18 Thread Juergen Gross

On 02/18/2015 10:21 AM, Paul Bolle wrote:

On Wed, 2015-02-18 at 07:52 +0100, Juergen Gross wrote:

64 bit pv-domains under Xen are limited to 512 GB of RAM today. The
main reason has been the 3 level p2m tree, which was replaced by the
virtual mapped linear p2m list. Parallel to the p2m list which is
being used by the kernel itself there is a 3 level mfn tree for usage
by the Xen tools and eventually for crash dump analysis. For this tree
the linear p2m list can serve as a replacement, too. As the kernel
can't know whether the tools are capable of dealing with the p2m list
instead of the mfn tree, the limit of 512 GB can't be dropped in all
cases.

This patch replaces the hard limit by a kernel parameter which tells
the kernel to obey the 512 GB limit or not. The default is selected by
a configuration parameter which specifies whether the 512 GB limit
should be active per default for dom0 (only crash dump analysis is
affected) and/or for domUs (additionally domain save/restore/migration
are affected).

Memory above the domain limit is returned to the hypervisor instead of
being identity mapped, which was wrong anyways.

The kernel configuration parameter to specify the maximum size of a
domain can be deleted, as it is not relevant any more.

Signed-off-by: Juergen Gross 
---
  Documentation/kernel-parameters.txt |  7 
  arch/x86/include/asm/xen/page.h |  4 ---
  arch/x86/xen/Kconfig| 31 +++-
  arch/x86/xen/p2m.c  | 10 +++---
  arch/x86/xen/setup.c| 72 ++---
  5 files changed, 93 insertions(+), 31 deletions(-)


[...]


--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -23,14 +23,29 @@ config XEN_PVHVM
def_bool y
depends on XEN && PCI && X86_LOCAL_APIC

-config XEN_MAX_DOMAIN_MEMORY
-   int
-   default 500 if X86_64
-   default 64 if X86_32
-   depends on XEN
-   help
- This only affects the sizing of some bss arrays, the unused
- portions of which are freed.
+if X86_64


Not
 && XEN
?


The complete directory is made only if CONFIG_XEN is set.




+choice
+   prompt "Support pv-domains larger than 512GB"
+   default XEN_512GB_NONE
+   help
+ Support paravirtualized domains with more than 512GB of RAM.
+
+ The Xen tools and crash dump analysis tools might not support
+ pv-domains with more than 512 GB of RAM. This option controls the
+ default setting of the kernel to use only up to 512 GB or more.
+ It is always possible to change the default via specifying the
+ boot parameter "xen_512gb_limit".
+
+   config XEN_512GB_NONE
+   bool "neither dom0 nor domUs can be larger than 512GB"
+   config XEN_512GB_DOM0
+   bool "dom0 can be larger than 512GB, domUs not"
+   config XEN_512GB_DOMU
+   bool "domUs can be larger than 512GB, dom0 not"
+   config XEN_512GB_ALL
+   bool "dom0 and domUs can be larger than 512GB"
+endchoice


So there are actually two independent limits, configured through a
choice with four entries. Would using just two separate Kconfig symbols
(XEN_512GB_DOM0 and XEN_512GB_DOMU) without a choice wrapper also work?


Yes.


Because ...


+endif

  config XEN_SAVE_RESTORE
 bool


[...]


diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 84a6473..16d94de 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -32,6 +32,8 @@
  #include "p2m.h"
  #include "mmu.h"

+#define GB(x) ((uint64_t)(x) * 1024 * 1024 * 1024)
+
  /* Amount of extra memory space we add to the e820 ranges */
  struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS] __initdata;

@@ -85,6 +87,27 @@ static struct {
   */
  #define EXTRA_MEM_RATIO   (10)

+static bool xen_dom0_512gb_limit __initdata =
+   IS_ENABLED(CONFIG_XEN_512GB_NONE) || IS_ENABLED(CONFIG_XEN_512GB_DOMU);


... then this could be something like:
 static bool xen_dom0_512gb_limit __initdata = 
!IS_ENABLED(CONFIG_XEN_512GB_DOM0);


+static bool xen_domu_512gb_limit __initdata =
+   IS_ENABLED(CONFIG_XEN_512GB_NONE) || IS_ENABLED(CONFIG_XEN_512GB_DOM0);
+


and this likewise:
 static bool xen_domu_512gb_limit __initdata = 
!IS_ENABLED(CONFIG_XEN_512GB_DOMU);

Correct?


Yes.

That's a matter of taste, I think.




+static int __init xen_parse_512gb(char *arg)
+{
+   bool val = false;
+
+   if (!arg)
+   val = true;
+   else if (strtobool(arg, &val))
+   return 1;
+
+   xen_dom0_512gb_limit = val;
+   xen_domu_512gb_limit = val;
+
+   return 0;
+}
+early_param("xen_512gb_limit", xen_parse_512gb);
+
  static void __init xen_add_extra_mem(phys_addr_t start, phys_addr_t size)
  {
int i;


So one can configure these two limits se

Re: [Xen-devel] [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains

2015-02-18 Thread Juergen Gross

On 02/18/2015 10:49 AM, Jan Beulich wrote:

On 18.02.15 at 10:37,  wrote:

On 02/18/2015 10:21 AM, Paul Bolle wrote:

On Wed, 2015-02-18 at 07:52 +0100, Juergen Gross wrote:

--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -23,14 +23,29 @@ config XEN_PVHVM
def_bool y
depends on XEN && PCI && X86_LOCAL_APIC

-config XEN_MAX_DOMAIN_MEMORY
-   int
-   default 500 if X86_64
-   default 64 if X86_32
-   depends on XEN
-   help
- This only affects the sizing of some bss arrays, the unused
- portions of which are freed.
+if X86_64


Not
  && XEN
?


The complete directory is made only if CONFIG_XEN is set.


But that doesn't mean this file gets used only when XEN is enabled.


Oh, you are right. I seem to have mixed up make and Kconfig of the
directory.


I would think though that an eventual "if XEN" should have wider
scope than just this option (i.e. likely almost the entire file).


Indeed.

So either I'll add the XEN dependency for the new option or I do
another patch adding "if XEN" just below configuring XEN and remove
the XEN dependencies in the rest of the entries.

As Luis is just doing a rework of XEN Kconfig stuff, I think I'll add
the XEN dependency.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC v1 0/8] xen: kconfig changes

2015-02-18 Thread Juergen Gross

On 02/18/2015 11:03 AM, David Vrabel wrote:

On 17/02/15 07:39, Juergen Gross wrote:


If we have neither XEN_PV nor XEN_PVH set, why do we have to build
enlighten.c? It will never be used. Same should apply to several other
files in arch/x86/xen.


Can we limit this series to only Kconfig changes?  I don't really like
scope-creep in patch series.


Are you sure this is possible? XEN will be configured in more cases as
today: this is the result of being able to build pv-drivers for hvm
domains.

BTW: it was you who wanted XEN_PVHVM to be implying XEN.

So today the complete directory arch/x86/xen isn't built for non-pv
kernels. Do you really want to change this? I don't think this is
acceptable.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure

2015-02-18 Thread Juergen Gross

On 02/18/2015 11:32 AM, David Vrabel wrote:

On 18/02/15 06:51, Juergen Gross wrote:

The linear p2m list should be anchored in the shared info structure


I'm not really sure what you mean by "anchored".


Bad wording? What about:

The virtual address of the linear p2m list should be stored in the
shared info structure.




read by the Xen tools to be able to support 64 bit pv-domains larger
than 512 MB. Additionally the linear p2m list interface includes a
generation count which is changed prior to and after each mapping
change of the p2m list. Reading the generation count the Xen tools can
detect changes of the mappings and re-read the p2m list eventually.

[...]

--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -256,6 +256,10 @@ void xen_setup_mfn_list_list(void)
HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list =
virt_to_mfn(p2m_top_mfn);
HYPERVISOR_shared_info->arch.max_pfn = xen_max_p2m_pfn;
+   HYPERVISOR_shared_info->arch.p2m_generation = 0;
+   HYPERVISOR_shared_info->arch.p2m_vaddr = (unsigned long)xen_p2m_addr;
+   HYPERVISOR_shared_info->arch.p2m_cr3 =
+   xen_pfn_to_cr3(virt_to_mfn(swapper_pg_dir));
  }

  /* Set up p2m_top to point to the domain-builder provided p2m pages */
@@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr, pte_t 
*pte_pg)

ptechk = lookup_address(vaddr, &level);
if (ptechk == pte_pg) {
+   HYPERVISOR_shared_info->arch.p2m_generation++;
set_pmd(pmdp,
__pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
+   HYPERVISOR_shared_info->arch.p2m_generation++;


Do these increments of p2m_generation need to be atomic?


Hmm, they are done under lock. I don't think the compiler is allowed to
reorder the writes to p2m_generation across set_pmd().


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure

2015-02-18 Thread Juergen Gross

On 02/18/2015 11:50 AM, Andrew Cooper wrote:

On 18/02/15 10:42, Juergen Gross wrote:



   /* Set up p2m_top to point to the domain-builder provided p2m
pages */
@@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr,
pte_t *pte_pg)

   ptechk = lookup_address(vaddr, &level);
   if (ptechk == pte_pg) {
+HYPERVISOR_shared_info->arch.p2m_generation++;
   set_pmd(pmdp,
   __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
+HYPERVISOR_shared_info->arch.p2m_generation++;


Do these increments of p2m_generation need to be atomic?


Hmm, they are done under lock. I don't think the compiler is allowed to
reorder the writes to p2m_generation across set_pmd().


They do need smp_wmb() to guarantee that the increment is visible before
the update occurs, just as the toolstack will need smp_rmb() to read.


Okay, I'll add smp_wmb() before and after calling set_pmd().


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure

2015-02-18 Thread Juergen Gross

On 02/18/2015 11:54 AM, David Vrabel wrote:

On 18/02/15 10:50, Andrew Cooper wrote:

On 18/02/15 10:42, Juergen Gross wrote:



   /* Set up p2m_top to point to the domain-builder provided p2m
pages */
@@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr,
pte_t *pte_pg)

   ptechk = lookup_address(vaddr, &level);
   if (ptechk == pte_pg) {
+HYPERVISOR_shared_info->arch.p2m_generation++;
   set_pmd(pmdp,
   __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
+HYPERVISOR_shared_info->arch.p2m_generation++;


Do these increments of p2m_generation need to be atomic?


Hmm, they are done under lock. I don't think the compiler is allowed to
reorder the writes to p2m_generation across set_pmd().


They do need smp_wmb() to guarantee that the increment is visible before
the update occurs, just as the toolstack will need smp_rmb() to read.


smp_wmb() isn't good enough since you need the barrier even on non-smp
-- you need a wmb().


Okay, will do.

Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains

2015-02-18 Thread Juergen Gross

On 02/18/2015 12:18 PM, David Vrabel wrote:

On 18/02/15 06:52, Juergen Gross wrote:


+if X86_64
+choice
+   prompt "Support pv-domains larger than 512GB"
+   default XEN_512GB_NONE
+   help
+ Support paravirtualized domains with more than 512GB of RAM.
+
+ The Xen tools and crash dump analysis tools might not support
+ pv-domains with more than 512 GB of RAM. This option controls the
+ default setting of the kernel to use only up to 512 GB or more.
+ It is always possible to change the default via specifying the
+ boot parameter "xen_512gb_limit".
+
+   config XEN_512GB_NONE
+   bool "neither dom0 nor domUs can be larger than 512GB"
+   config XEN_512GB_DOM0
+   bool "dom0 can be larger than 512GB, domUs not"
+   config XEN_512GB_DOMU
+   bool "domUs can be larger than 512GB, dom0 not"
+   config XEN_512GB_ALL
+   bool "dom0 and domUs can be larger than 512GB"
+endchoice
+endif


This configuration option doesn't look useful to me.  Can we get rid of
it with runtime checking.  e.g.,

If dom0, enable >512G.
If domU, enable >512G if requested by command line option /or/ toolstack
indicates that it supports the linear p2m.


How is the toolstack supposed to indicate this?

I'd be more than happy to get rid of that option. For Dom0 you seem to
have changed your mind (you rejected enabling >512GB as default last
year).

Doing some more tests I found the command line option is problematic:
The option seems to be evaluated only after it is needed (I did the
first tests using the config option). Can we get rid of the option
even for domU? Or do I have to pre-scan the command line for the option?


And

If max_pfn < 512G, populate 3-level p2m /unless/ toolstack indicates it
supports the linear p2m.


What about Dom0?


People using crash analysis tools that need the 3-level p2m can clamp
dom0 memory with the Xen command line option.  FWIW, the tool we use
doesn't need this.


Interesting. Which tool are you using?


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains

2015-02-18 Thread Juergen Gross

Sorry, used Reply instead of Reply-all

On 02/18/2015 12:18 PM, David Vrabel wrote:

On 18/02/15 06:52, Juergen Gross wrote:


+if X86_64
+choice
+   prompt "Support pv-domains larger than 512GB"
+   default XEN_512GB_NONE
+   help
+ Support paravirtualized domains with more than 512GB of RAM.
+
+ The Xen tools and crash dump analysis tools might not support
+ pv-domains with more than 512 GB of RAM. This option controls the
+ default setting of the kernel to use only up to 512 GB or more.
+ It is always possible to change the default via specifying the
+ boot parameter "xen_512gb_limit".
+
+   config XEN_512GB_NONE
+   bool "neither dom0 nor domUs can be larger than 512GB"
+   config XEN_512GB_DOM0
+   bool "dom0 can be larger than 512GB, domUs not"
+   config XEN_512GB_DOMU
+   bool "domUs can be larger than 512GB, dom0 not"
+   config XEN_512GB_ALL
+   bool "dom0 and domUs can be larger than 512GB"
+endchoice
+endif


This configuration option doesn't look useful to me.  Can we get rid of
it with runtime checking.  e.g.,

If dom0, enable >512G.
If domU, enable >512G if requested by command line option /or/ toolstack
indicates that it supports the linear p2m.


How is the toolstack supposed to indicate this?

I'd be more than happy to get rid of that option. For Dom0 you seem to
have changed your mind (you rejected enabling >512GB as default last
year).

Doing some more tests I found the command line option is problematic:
The option seems to be evaluated only after it is needed (I did the
first tests using the config option). Can we get rid of the option
even for domU? Or do I have to pre-scan the command line for the option?


And

If max_pfn < 512G, populate 3-level p2m /unless/ toolstack indicates it
supports the linear p2m.


What about Dom0?


People using crash analysis tools that need the 3-level p2m can clamp
dom0 memory with the Xen command line option.  FWIW, the tool we use
doesn't need this.


Interesting. Which tool are you using?


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 06/13] xen: detect pre-allocated memory interfering with e820 map

2015-02-23 Thread Juergen Gross

On 02/19/2015 07:07 PM, David Vrabel wrote:

On 18/02/2015 06:51, Juergen Gross wrote:

Currently especially for dom0 guest memory with guest pfns not
matching host areas populated with RAM are remapped to areas which
are RAM native as well. This is done to be able to use identity
mappings (pfn == mfn) for I/O areas.

Up to now it is not checked whether the remapped memory is already
in use. Remapping used memory will probably result in data corruption,
as the remapped memory will no longer be reserved. Any memory
allocation after the remap can claim that memory.

Add an infrastructure to check for conflicts of reserved memory areas
and in case of a conflict to react via an area specific function.

This function has 3 options:
- Panic
- Handle the conflict by moving the data to another memory area.
   This is indicated by a return value other than 0.
- Just return 0. This will delay invalidating the conflicting memory
   area to just before doing the remap. This option will be usable for
   cases only where the memory will no longer be needed when the remap
   operation will be started, e.g. for the p2m list, which is already
   copied then.

When doing the remap, check for not remapping a reserved page.

Signed-off-by: Juergen Gross 
---
  arch/x86/xen/setup.c   | 185
+++--
  arch/x86/xen/xen-ops.h |   2 +
  2 files changed, 182 insertions(+), 5 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 0dda131..a0af554 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -59,6 +59,20 @@ static unsigned long xen_remap_mfn __initdata =
INVALID_P2M_ENTRY;
  static unsigned long xen_remap_pfn;
  static unsigned long xen_max_pfn;

+/*
+ * Areas with memblock_reserve()d memory to be checked against final
E820 map.
+ * Each area has an associated function to handle conflicts (by either
+ * removing the conflict or by just crashing with an appropriate
message).
+ * The array has a fixed size as there are only few areas of interest
which are
+ * well known: kernel, page tables, p2m, initrd.
+ */
+#define XEN_N_RESERVED_AREAS4
+static struct {
+phys_addr_tstart;
+phys_addr_tsize;
+int(*func)(phys_addr_t start, phys_addr_t size);
+} xen_reserved_area[XEN_N_RESERVED_AREAS] __initdata;
+
  /*
   * The maximum amount of extra memory compared to the base size.  The
   * main scaling factor is the size of struct page.  At extreme ratios
@@ -365,10 +379,10 @@ static void __init
xen_set_identity_and_remap_chunk(unsigned long start_pfn,
  unsigned long end_pfn, unsigned long *released, unsigned long
*remapped)
  {
  unsigned long pfn;
-unsigned long i = 0;
+unsigned long i;
  unsigned long n = end_pfn - start_pfn;

-while (i < n) {
+for (i = 0; i < n; ) {
  unsigned long cur_pfn = start_pfn + i;
  unsigned long left = n - i;
  unsigned long size = left;
@@ -411,6 +425,53 @@ static void __init
xen_set_identity_and_remap_chunk(unsigned long start_pfn,
  (unsigned long)__va(pfn << PAGE_SHIFT),
  mfn_pte(pfn, PAGE_KERNEL_IO), 0);
  }
+/* Check to be remapped memory area for conflicts with reserved areas.
+ *
+ * Skip regions known to be reserved which are handled later. For these
+ * regions we have to increase the remapped counter in order to reserve
+ * extra memory space.
+ *
+ * In case a memory page already in use is to be remapped, just BUG().
+ */
+static void __init xen_set_identity_and_remap_chunk_chk(unsigned long
start_pfn,
+unsigned long end_pfn, unsigned long *released, unsigned long
*remapped)


...remap_chunk_checked() ?


I just wanted to avoid the function name to be even longer. OTOH I
really don't mind to use your suggestion. :-)




+{
+unsigned long pfn;
+unsigned long area_start, area_end;
+unsigned i;
+
+for (i = 0; i < XEN_N_RESERVED_AREAS; i++) {
+
+if (!xen_reserved_area[i].size)
+break;
+
+area_start = PFN_DOWN(xen_reserved_area[i].start);
+area_end = PFN_UP(xen_reserved_area[i].start +
+  xen_reserved_area[i].size);
+if (area_start >= end_pfn || area_end <= start_pfn)
+continue;
+
+if (area_start > start_pfn)
+xen_set_identity_and_remap_chunk(start_pfn, area_start,
+ released, remapped);
+
+if (area_end < end_pfn)
+xen_set_identity_and_remap_chunk(area_end, end_pfn,
+ released, remapped);
+
+*remapped += min(area_end, end_pfn) -
+max(area_start, start_pfn);
+
+return;


Why not defer the whole chunk that conflicts?  Or for that matter defer
all this remapping to the last minute.


There are two problems arising from this:

- In the initrd case remapping would be deferred too long: the initrd
  data is still in use when device initialization is running. And we
  really

Re: [Xen-devel] [PATCH 10/13] xen: check pre-allocated page tables for conflict with memory map

2015-02-23 Thread Juergen Gross

On 02/19/2015 06:37 PM, David Vrabel wrote:



On 18/02/2015 06:52, Juergen Gross wrote:

Check whether the page tables built by the domain builder are at
memory addresses which are in conflict with the target memory map.
If this is the case just panic instead of running into problems
later.


Again, what ensures this never actually happens?


Same answer as before: nothing.

Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 12/13] xen: if p2m list located in to be remapped region delay remapping

2015-02-23 Thread Juergen Gross

On 02/19/2015 06:44 PM, David Vrabel wrote:

On 18/02/2015 06:52, Juergen Gross wrote:

With adapting the memory layout of dom0 to that of the host care must
be taken not to remap the initial p2m list supported by the hypervisor.


"...supplied by the hypervisor" ?


Yes, of course.




If the p2m map is detected to be in a region which is going to be
remapped, delay the remapping of that area. Not doing so can either
crash the system very early, or lead to clobbered data as the target
memory area of the remap operation will no longer be reserved.


Would it be better to relocate the p2m before remapping memory?  If not,
explain why in the commit message.


Okay, will do.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v0 RFC 0/2] xl/libxl support for PVUSB

2014-11-10 Thread Juergen Gross

On 11/10/2014 04:01 PM, Konrad Rzeszutek Wilk wrote:

On Mon, Nov 10, 2014 at 01:37:44AM -0700, Chun Yan Liu wrote:

Is there any progress on this work? I didn't see new version after this.
Anyone knows the status?


I believe Olaf and Juergen were looking at this for Xen 4.6?


I'm working on the kernel pvusb drivers.

Juergen



CC-ing them.


Thanks,
Chunyan


On 8/11/2014 at 04:23 AM, in message

<1407702234-22309-1-git-send-email-caobosi...@gmail.com>, Bo Cao
 wrote:

Finally I have a workable version xl/libxl support for PVUSB. Most of
its commands work property now, but there are still some probelm to be
solved.
Please take a loot and give me some advices.

== What have been implemented ? ==
I have implemented libxl functions for PVUSB in libxl_usb.c. It mainly
consists of two part:
usbctrl_add/remove/list and usb_add/remove/list in which usbctrl denote usb
controller in which
usd device can be plugged in. I don't use "ao_dev" in
libxl_deivce_usbctrl_add since we don't need to
execute hotplug script for usbctrl and without "ao_dev", adding default
usbctrl for usb device
would be easier.

For the cammands to manipulate usb device such as "xl usb-attach" and "xl
usb-detach", this patch now only
support to specify usb devices by their interface in sysfs. Using this
interface, we can read usb device
information through sysfs and bind/unbind usb device. (The support for
mapping the "lsusb" bus:addr to the
sysfs usb interface will come later).

== What needs to do next ? ==
There are two main problems to be solved.

1.  PVUSB Options in VM Guest's Configuration File
 The interface in VM Guest's configuration file to add usb device is:
"usb=[interface="1-1"]".
But the problem is now is that after the default usbctrl is added, the state
of usbctrl is "2", e,g, "XenbusStateInitWait",
waiting for xen-usbfront to connect. The xen-usbfront in VM Guest isn't
loaded. Therefore, "sysfs_intf_write"
will report error. Does anyone have any clue how to solve this?

2. sysfs_intf_write
 In the process of "xl usb-attach domid intf=1-1", after writing "1-1" to
Xenstore entry, we need to
bind the controller of this usb device to usbback driver so that it can be
used by VM Guest. For exampele,
for usb device "1-1", it's controller interface maybe "1-1:1.0", and we
write this value to "/sys/bus/usb/driver/usbback/bind".
But for some devices, they have two controllers, for example "1-1:1.0" and
"1-1:1.1". I think this means it has two functions,
such as usbhid and usb-storage. So in this case, we bind the two controller
to usbback?


There maybe some errors or bugs in the codes. Feel free to tell me.

Cheers,

- Simon

---
CC: George Dunlap 
CC: Ian Jackson 
CC: Ian Campbell 
CC: Pasi Kärkkäinen 
CC: Lars Kurth 



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel





___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH V3 0/8] xen: Switch to virtual mapped linear p2m list

2014-11-10 Thread Juergen Gross
Paravirtualized kernels running on Xen use a three level tree for
translation of guest specific physical addresses to machine global
addresses. This p2m tree is used for construction of page table
entries, so the p2m tree walk is performance critical.

By using a linear virtual mapped p2m list accesses to p2m elements
can be sped up while even simplifying code. To achieve this goal
some p2m related initializations have to be performed later in the
boot process, as the final p2m list can be set up only after basic
memory management functions are available.

Changes in V3:
- Carved out (new) patch 1 to make pure code movement more obvious
  as requested by David Vrabel
- New patch 6 introducing __pfn_to_mfn() (taken from patch 7) as
  requested by David Vrabel
- New patch 8 to speed up set_phys_to_machine() as suggested by
  David Vrabel

Changes in V2:
- splitted patch 2 in 4 smaller ones as requested by David Vrabel
- added highmem check when remapping kernel memory as requested by
  David Vrabel

Juergen Gross (8):
  xen: Make functions static
  xen: Delay remapping memory of pv-domain
  xen: Delay m2p_override initialization
  xen: Delay invalidating extra memory
  x86: Introduce function to get pmd entry pointer
  xen: Hide get_phys_to_machine() to be able to tune common path
  xen: switch to linear virtual mapped sparse p2m list
  xen: Speed up set_phys_to_machine() by using read-only mappings

 arch/x86/include/asm/pgtable_types.h |1 +
 arch/x86/include/asm/xen/page.h  |   49 +-
 arch/x86/mm/pageattr.c   |   20 +
 arch/x86/xen/mmu.c   |   38 +-
 arch/x86/xen/p2m.c   | 1315 ++
 arch/x86/xen/setup.c |  460 ++--
 arch/x86/xen/xen-ops.h   |6 +-
 7 files changed, 854 insertions(+), 1035 deletions(-)

-- 
2.1.2


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH V3 2/8] xen: Delay remapping memory of pv-domain

2014-11-10 Thread Juergen Gross
Early in the boot process the memory layout of a pv-domain is changed
to match the E820 map (either the host one for Dom0 or the Xen one)
regarding placement of RAM and PCI holes. This requires removing memory
pages initially located at positions not suitable for RAM and adding
them later at higher addresses where no restrictions apply.

To be able to operate on the hypervisor supported p2m list until a
virtual mapped linear p2m list can be constructed, remapping must
be delayed until virtual memory management is initialized, as the
initial p2m list can't be extended unlimited at physical memory
initialization time due to it's fixed structure.

A further advantage is the reduction in complexity and code volume as
we don't have to be careful regarding memory restrictions during p2m
updates.

Signed-off-by: Juergen Gross 
Reviewed-by: David Vrabel 
---
 arch/x86/include/asm/xen/page.h |   1 -
 arch/x86/xen/mmu.c  |   4 +
 arch/x86/xen/p2m.c  | 149 
 arch/x86/xen/setup.c| 385 +++-
 arch/x86/xen/xen-ops.h  |   1 +
 5 files changed, 223 insertions(+), 317 deletions(-)

diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index 6c16451..b475297 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -44,7 +44,6 @@ extern unsigned long  machine_to_phys_nr;
 
 extern unsigned long get_phys_to_machine(unsigned long pfn);
 extern bool set_phys_to_machine(unsigned long pfn, unsigned long mfn);
-extern bool __init early_set_phys_to_machine(unsigned long pfn, unsigned long 
mfn);
 extern bool __set_phys_to_machine(unsigned long pfn, unsigned long mfn);
 extern unsigned long set_phys_range_identity(unsigned long pfn_s,
 unsigned long pfn_e);
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index a8a1a3d..d3e492b 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1223,6 +1223,10 @@ static void __init xen_pagetable_init(void)
/* Allocate and initialize top and mid mfn levels for p2m structure */
xen_build_mfn_list_list();
 
+   /* Remap memory freed because of conflicts with E820 map */
+   if (!xen_feature(XENFEAT_auto_translated_physmap))
+   xen_remap_memory();
+
xen_setup_shared_info();
xen_post_allocator_init();
 }
diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index fa75842..f67f8cf 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -164,6 +164,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -204,6 +205,8 @@ RESERVE_BRK(p2m_mid, PAGE_SIZE * (MAX_DOMAIN_PAGES / 
(P2M_PER_PAGE * P2M_MID_PER
  */
 RESERVE_BRK(p2m_identity_remap, PAGE_SIZE * 2 * 3 * MAX_REMAP_RANGES);
 
+static int use_brk = 1;
+
 static inline unsigned p2m_top_index(unsigned long pfn)
 {
BUG_ON(pfn >= MAX_P2M_PFN);
@@ -268,6 +271,22 @@ static void p2m_init(unsigned long *p2m)
p2m[i] = INVALID_P2M_ENTRY;
 }
 
+static void * __ref alloc_p2m_page(void)
+{
+   if (unlikely(use_brk))
+   return extend_brk(PAGE_SIZE, PAGE_SIZE);
+
+   if (unlikely(!slab_is_available()))
+   return alloc_bootmem_align(PAGE_SIZE, PAGE_SIZE);
+
+   return (void *)__get_free_page(GFP_KERNEL | __GFP_REPEAT);
+}
+
+static void free_p2m_page(void *p)
+{
+   free_page((unsigned long)p);
+}
+
 /*
  * Build the parallel p2m_top_mfn and p2m_mid_mfn structures
  *
@@ -287,13 +306,13 @@ void __ref xen_build_mfn_list_list(void)
 
/* Pre-initialize p2m_top_mfn to be completely missing */
if (p2m_top_mfn == NULL) {
-   p2m_mid_missing_mfn = alloc_bootmem_align(PAGE_SIZE, PAGE_SIZE);
+   p2m_mid_missing_mfn = alloc_p2m_page();
p2m_mid_mfn_init(p2m_mid_missing_mfn, p2m_missing);
 
-   p2m_top_mfn_p = alloc_bootmem_align(PAGE_SIZE, PAGE_SIZE);
+   p2m_top_mfn_p = alloc_p2m_page();
p2m_top_mfn_p_init(p2m_top_mfn_p);
 
-   p2m_top_mfn = alloc_bootmem_align(PAGE_SIZE, PAGE_SIZE);
+   p2m_top_mfn = alloc_p2m_page();
p2m_top_mfn_init(p2m_top_mfn);
} else {
/* Reinitialise, mfn's all change after migration */
@@ -327,7 +346,7 @@ void __ref xen_build_mfn_list_list(void)
 * missing parts of the mfn tree after
 * runtime.
 */
-   mid_mfn_p = alloc_bootmem_align(PAGE_SIZE, PAGE_SIZE);
+   mid_mfn_p = alloc_p2m_page();
p2m_mid_mfn_init(mid_mfn_p, p2m_missing);
 
p2m_top_mfn_p[topidx] = mid_mfn_p;
@@ -364,17 +383,17 @@ void __init xen_build_dynamic_phys_to_machine(void)
max_pfn = min(MAX_DOMAIN_PAGES, xen_start_info->nr_pages);
xen_max_p2m_pfn = max_pfn;
 
-   p2m_missing = exten

[Xen-devel] [PATCH V3 1/8] xen: Make functions static

2014-11-10 Thread Juergen Gross
Some functions in arch/x86/xen/p2m.c are used locally only. Make them
static. Rearrange the functions in p2m.c to avoid forward declarations.

While at it correct some style issues (long lines, use pr_warn()).

Signed-off-by: Juergen Gross 
---
 arch/x86/include/asm/xen/page.h |   6 -
 arch/x86/xen/p2m.c  | 347 
 2 files changed, 172 insertions(+), 181 deletions(-)

diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index c949923..6c16451 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -52,15 +52,9 @@ extern unsigned long set_phys_range_identity(unsigned long 
pfn_s,
 extern int set_foreign_p2m_mapping(struct gnttab_map_grant_ref *map_ops,
   struct gnttab_map_grant_ref *kmap_ops,
   struct page **pages, unsigned int count);
-extern int m2p_add_override(unsigned long mfn, struct page *page,
-   struct gnttab_map_grant_ref *kmap_op);
 extern int clear_foreign_p2m_mapping(struct gnttab_unmap_grant_ref *unmap_ops,
 struct gnttab_map_grant_ref *kmap_ops,
 struct page **pages, unsigned int count);
-extern int m2p_remove_override(struct page *page,
-  struct gnttab_map_grant_ref *kmap_op,
-  unsigned long mfn);
-extern struct page *m2p_find_override(unsigned long mfn);
 extern unsigned long m2p_find_override_pfn(unsigned long mfn, unsigned long 
pfn);
 
 static inline unsigned long pfn_to_mfn(unsigned long pfn)
diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index 9201a38..fa75842 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -896,6 +896,61 @@ static unsigned long mfn_hash(unsigned long mfn)
return hash_long(mfn, M2P_OVERRIDE_HASH_SHIFT);
 }
 
+/* Add an MFN override for a particular page */
+static int m2p_add_override(unsigned long mfn, struct page *page,
+   struct gnttab_map_grant_ref *kmap_op)
+{
+   unsigned long flags;
+   unsigned long pfn;
+   unsigned long uninitialized_var(address);
+   unsigned level;
+   pte_t *ptep = NULL;
+
+   pfn = page_to_pfn(page);
+   if (!PageHighMem(page)) {
+   address = (unsigned long)__va(pfn << PAGE_SHIFT);
+   ptep = lookup_address(address, &level);
+   if (WARN(ptep == NULL || level != PG_LEVEL_4K,
+"m2p_add_override: pfn %lx not mapped", pfn))
+   return -EINVAL;
+   }
+
+   if (kmap_op != NULL) {
+   if (!PageHighMem(page)) {
+   struct multicall_space mcs =
+   xen_mc_entry(sizeof(*kmap_op));
+
+   MULTI_grant_table_op(mcs.mc,
+   GNTTABOP_map_grant_ref, kmap_op, 1);
+
+   xen_mc_issue(PARAVIRT_LAZY_MMU);
+   }
+   }
+   spin_lock_irqsave(&m2p_override_lock, flags);
+   list_add(&page->lru,  &m2p_overrides[mfn_hash(mfn)]);
+   spin_unlock_irqrestore(&m2p_override_lock, flags);
+
+   /* p2m(m2p(mfn)) == mfn: the mfn is already present somewhere in
+* this domain. Set the FOREIGN_FRAME_BIT in the p2m for the other
+* pfn so that the following mfn_to_pfn(mfn) calls will return the
+* pfn from the m2p_override (the backend pfn) instead.
+* We need to do this because the pages shared by the frontend
+* (xen-blkfront) can be already locked (lock_page, called by
+* do_read_cache_page); when the userspace backend tries to use them
+* with direct_IO, mfn_to_pfn returns the pfn of the frontend, so
+* do_blockdev_direct_IO is going to try to lock the same pages
+* again resulting in a deadlock.
+* As a side effect get_user_pages_fast might not be safe on the
+* frontend pages while they are being shared with the backend,
+* because mfn_to_pfn (that ends up being called by GUPF) will
+* return the backend pfn rather than the frontend pfn. */
+   pfn = mfn_to_pfn_no_overrides(mfn);
+   if (get_phys_to_machine(pfn) == mfn)
+   set_phys_to_machine(pfn, FOREIGN_FRAME(mfn));
+
+   return 0;
+}
+
 int set_foreign_p2m_mapping(struct gnttab_map_grant_ref *map_ops,
struct gnttab_map_grant_ref *kmap_ops,
struct page **pages, unsigned int count)
@@ -955,61 +1010,123 @@ out:
 }
 EXPORT_SYMBOL_GPL(set_foreign_p2m_mapping);
 
-/* Add an MFN override for a particular page */
-int m2p_add_override(unsigned long mfn, struct page *page,
-   struct gnttab_map_grant_ref *kmap_op)
-{
-   unsigned long flags;
-   unsigned long pfn;
-   unsigned long uninitialized_var(address);
-   uns

[Xen-devel] [PATCH V3 3/8] xen: Delay m2p_override initialization

2014-11-10 Thread Juergen Gross
The m2p overrides are used to be able to find the local pfn for a
foreign mfn mapped into the domain. They are used by driver backends
having to access frontend data.

As this functionality isn't used in early boot it makes no sense to
initialize the m2p override functions very early. It can be done
later without doing any harm, removing the need for allocating memory
via extend_brk().

While at it make some m2p override functions static as they are only
used internally.

Signed-off-by: Juergen Gross 
---
 arch/x86/xen/p2m.c | 33 +++--
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index f67f8cf..97252e3 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -426,8 +426,6 @@ void __init xen_build_dynamic_phys_to_machine(void)
}
p2m_top[topidx][mididx] = &mfn_list[pfn];
}
-
-   m2p_override_init();
 }
 #ifdef CONFIG_X86_64
 unsigned long __init xen_revector_p2m_tree(void)
@@ -498,13 +496,15 @@ unsigned long __init xen_revector_p2m_tree(void)
}
/* This should be the leafs allocated for identity from _brk. */
}
-   return (unsigned long)mfn_list;
 
+   m2p_override_init();
+   return (unsigned long)mfn_list;
 }
 #else
 unsigned long __init xen_revector_p2m_tree(void)
 {
use_brk = 0;
+   m2p_override_init();
return 0;
 }
 #endif
@@ -794,15 +794,16 @@ bool set_phys_to_machine(unsigned long pfn, unsigned long 
mfn)
 #define M2P_OVERRIDE_HASH_SHIFT10
 #define M2P_OVERRIDE_HASH  (1 << M2P_OVERRIDE_HASH_SHIFT)
 
-static RESERVE_BRK_ARRAY(struct list_head, m2p_overrides, M2P_OVERRIDE_HASH);
+static struct list_head *m2p_overrides;
 static DEFINE_SPINLOCK(m2p_override_lock);
 
 static void __init m2p_override_init(void)
 {
unsigned i;
 
-   m2p_overrides = extend_brk(sizeof(*m2p_overrides) * M2P_OVERRIDE_HASH,
-  sizeof(unsigned long));
+   m2p_overrides = alloc_bootmem_align(
+   sizeof(*m2p_overrides) * M2P_OVERRIDE_HASH,
+   sizeof(unsigned long));
 
for (i = 0; i < M2P_OVERRIDE_HASH; i++)
INIT_LIST_HEAD(&m2p_overrides[i]);
@@ -930,21 +931,25 @@ EXPORT_SYMBOL_GPL(set_foreign_p2m_mapping);
 static struct page *m2p_find_override(unsigned long mfn)
 {
unsigned long flags;
-   struct list_head *bucket = &m2p_overrides[mfn_hash(mfn)];
+   struct list_head *bucket;
struct page *p, *ret;
 
ret = NULL;
 
-   spin_lock_irqsave(&m2p_override_lock, flags);
+   if (m2p_overrides) {
+   bucket = &m2p_overrides[mfn_hash(mfn)];
 
-   list_for_each_entry(p, bucket, lru) {
-   if (page_private(p) == mfn) {
-   ret = p;
-   break;
+   spin_lock_irqsave(&m2p_override_lock, flags);
+
+   list_for_each_entry(p, bucket, lru) {
+   if (page_private(p) == mfn) {
+   ret = p;
+   break;
+   }
}
-   }
 
-   spin_unlock_irqrestore(&m2p_override_lock, flags);
+   spin_unlock_irqrestore(&m2p_override_lock, flags);
+   }
 
return ret;
 }
-- 
2.1.2


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH V3 4/8] xen: Delay invalidating extra memory

2014-11-10 Thread Juergen Gross
When the physical memory configuration is initialized the p2m entries
for not pouplated memory pages are set to "invalid". As those pages
are beyond the hypervisor built p2m list the p2m tree has to be
extended.

This patch delays processing the extra memory related p2m entries
during the boot process until some more basic memory management
functions are callable. This removes the need to create new p2m
entries until virtual memory management is available.

Signed-off-by: Juergen Gross 
Reviewed-by: David Vrabel 
---
 arch/x86/include/asm/xen/page.h |   3 +
 arch/x86/xen/p2m.c  | 130 
 arch/x86/xen/setup.c| 103 ++-
 arch/x86/xen/xen-ops.h  |   3 +-
 4 files changed, 107 insertions(+), 132 deletions(-)

diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index b475297..28fa795 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -41,6 +41,9 @@ typedef struct xpaddr {
 
 extern unsigned long *machine_to_phys_mapping;
 extern unsigned long  machine_to_phys_nr;
+extern unsigned long *xen_p2m_addr;
+extern unsigned long  xen_p2m_size;
+extern unsigned long  xen_max_p2m_pfn;
 
 extern unsigned long get_phys_to_machine(unsigned long pfn);
 extern bool set_phys_to_machine(unsigned long pfn, unsigned long mfn);
diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index 97252e3..6a9dfa6 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -181,7 +181,12 @@
 
 static void __init m2p_override_init(void);
 
+unsigned long *xen_p2m_addr __read_mostly;
+EXPORT_SYMBOL_GPL(xen_p2m_addr);
+unsigned long xen_p2m_size __read_mostly;
+EXPORT_SYMBOL_GPL(xen_p2m_size);
 unsigned long xen_max_p2m_pfn __read_mostly;
+EXPORT_SYMBOL_GPL(xen_max_p2m_pfn);
 
 static unsigned long *p2m_mid_missing_mfn;
 static unsigned long *p2m_top_mfn;
@@ -198,13 +203,6 @@ static RESERVE_BRK_ARRAY(unsigned long *, 
p2m_mid_identity, P2M_MID_PER_PAGE);
 
 RESERVE_BRK(p2m_mid, PAGE_SIZE * (MAX_DOMAIN_PAGES / (P2M_PER_PAGE * 
P2M_MID_PER_PAGE)));
 
-/* For each I/O range remapped we may lose up to two leaf pages for the 
boundary
- * violations and three mid pages to cover up to 3GB. With
- * early_can_reuse_p2m_middle() most of the leaf pages will be reused by the
- * remapped region.
- */
-RESERVE_BRK(p2m_identity_remap, PAGE_SIZE * 2 * 3 * MAX_REMAP_RANGES);
-
 static int use_brk = 1;
 
 static inline unsigned p2m_top_index(unsigned long pfn)
@@ -376,12 +374,14 @@ void __init xen_build_dynamic_phys_to_machine(void)
unsigned long max_pfn;
unsigned long pfn;
 
-if (xen_feature(XENFEAT_auto_translated_physmap))
+   if (xen_feature(XENFEAT_auto_translated_physmap))
return;
 
+   xen_p2m_addr = (unsigned long *)xen_start_info->mfn_list;
mfn_list = (unsigned long *)xen_start_info->mfn_list;
max_pfn = min(MAX_DOMAIN_PAGES, xen_start_info->nr_pages);
xen_max_p2m_pfn = max_pfn;
+   xen_p2m_size = max_pfn;
 
p2m_missing = alloc_p2m_page();
p2m_init(p2m_missing);
@@ -497,6 +497,11 @@ unsigned long __init xen_revector_p2m_tree(void)
/* This should be the leafs allocated for identity from _brk. */
}
 
+   xen_p2m_size = xen_max_p2m_pfn;
+   xen_p2m_addr = mfn_list;
+
+   xen_inv_extra_mem();
+
m2p_override_init();
return (unsigned long)mfn_list;
 }
@@ -504,6 +509,8 @@ unsigned long __init xen_revector_p2m_tree(void)
 unsigned long __init xen_revector_p2m_tree(void)
 {
use_brk = 0;
+   xen_p2m_size = xen_max_p2m_pfn;
+   xen_inv_extra_mem();
m2p_override_init();
return 0;
 }
@@ -512,8 +519,12 @@ unsigned long get_phys_to_machine(unsigned long pfn)
 {
unsigned topidx, mididx, idx;
 
-   if (unlikely(pfn >= MAX_P2M_PFN))
+   if (unlikely(pfn >= xen_p2m_size)) {
+   if (pfn < xen_max_p2m_pfn)
+   return xen_chk_extra_mem(pfn);
+
return IDENTITY_FRAME(pfn);
+   }
 
topidx = p2m_top_index(pfn);
mididx = p2m_mid_index(pfn);
@@ -611,78 +622,12 @@ static bool alloc_p2m(unsigned long pfn)
return true;
 }
 
-static bool __init early_alloc_p2m(unsigned long pfn, bool check_boundary)
-{
-   unsigned topidx, mididx, idx;
-   unsigned long *p2m;
-
-   topidx = p2m_top_index(pfn);
-   mididx = p2m_mid_index(pfn);
-   idx = p2m_index(pfn);
-
-   /* Pfff.. No boundary cross-over, lets get out. */
-   if (!idx && check_boundary)
-   return false;
-
-   WARN(p2m_top[topidx][mididx] == p2m_identity,
-   "P2M[%d][%d] == IDENTITY, should be MISSING (or alloced)!\n",
-   topidx, mididx);
-
-   /*
-* Could be done by xen_build_dynamic_phys_to_machine..
-*/
-   if (p2m_top[topidx][mididx] != p2m_missing)
-   retu

[Xen-devel] [PATCH V3 7/8] xen: switch to linear virtual mapped sparse p2m list

2014-11-10 Thread Juergen Gross
At start of the day the Xen hypervisor presents a contiguous mfn list
to a pv-domain. In order to support sparse memory this mfn list is
accessed via a three level p2m tree built early in the boot process.
Whenever the system needs the mfn associated with a pfn this tree is
used to find the mfn.

Instead of using a software walked tree for accessing a specific mfn
list entry this patch is creating a virtual address area for the
entire possible mfn list including memory holes. The holes are
covered by mapping a pre-defined  page consisting only of "invalid
mfn" entries. Access to a mfn entry is possible by just using the
virtual base address of the mfn list and the pfn as index into that
list. This speeds up the (hot) path of determining the mfn of a
pfn.

Kernel build on a Dell Latitude E6440 (2 cores, HT) in 64 bit Dom0
showed following improvements:

Elapsed time: 32:50 ->  32:35
System:   18:07 ->  17:47
User:104:00 -> 103:30

Tested on 64 bit dom0 and 32 bit domU.

Signed-off-by: Juergen Gross 
---
 arch/x86/include/asm/xen/page.h |  14 +-
 arch/x86/xen/mmu.c  |  32 +-
 arch/x86/xen/p2m.c  | 732 +---
 arch/x86/xen/xen-ops.h  |   2 +-
 4 files changed, 342 insertions(+), 438 deletions(-)

diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index 07d8a7b..4a227ec 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -72,7 +72,19 @@ extern unsigned long m2p_find_override_pfn(unsigned long 
mfn, unsigned long pfn)
  */
 static inline unsigned long __pfn_to_mfn(unsigned long pfn)
 {
-   return get_phys_to_machine(pfn);
+   unsigned long mfn;
+
+   if (pfn < xen_p2m_size)
+   mfn = xen_p2m_addr[pfn];
+   else if (unlikely(pfn < xen_max_p2m_pfn))
+   return get_phys_to_machine(pfn);
+   else
+   return IDENTITY_FRAME(pfn);
+
+   if (unlikely(mfn == INVALID_P2M_ENTRY))
+   return get_phys_to_machine(pfn);
+
+   return mfn;
 }
 
 static inline unsigned long pfn_to_mfn(unsigned long pfn)
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 31ca515..0b43c45 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1158,20 +1158,16 @@ static void __init xen_cleanhighmap(unsigned long vaddr,
 * instead of somewhere later and be confusing. */
xen_mc_flush();
 }
-static void __init xen_pagetable_p2m_copy(void)
+
+static void __init xen_pagetable_p2m_free(void)
 {
unsigned long size;
unsigned long addr;
-   unsigned long new_mfn_list;
-
-   if (xen_feature(XENFEAT_auto_translated_physmap))
-   return;
 
size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
 
-   new_mfn_list = xen_revector_p2m_tree();
/* No memory or already called. */
-   if (!new_mfn_list || new_mfn_list == xen_start_info->mfn_list)
+   if ((unsigned long)xen_p2m_addr == xen_start_info->mfn_list)
return;
 
/* using __ka address and sticking INVALID_P2M_ENTRY! */
@@ -1189,8 +1185,6 @@ static void __init xen_pagetable_p2m_copy(void)
 
size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
memblock_free(__pa(xen_start_info->mfn_list), size);
-   /* And revector! Bye bye old array */
-   xen_start_info->mfn_list = new_mfn_list;
 
/* At this stage, cleanup_highmap has already cleaned __ka space
 * from _brk_limit way up to the max_pfn_mapped (which is the end of
@@ -1214,12 +1208,26 @@ static void __init xen_pagetable_p2m_copy(void)
 }
 #endif
 
-static void __init xen_pagetable_init(void)
+static void __init xen_pagetable_p2m_setup(void)
 {
-   paging_init();
+   if (xen_feature(XENFEAT_auto_translated_physmap))
+   return;
+
+   xen_vmalloc_p2m_tree();
+
 #ifdef CONFIG_X86_64
-   xen_pagetable_p2m_copy();
+   xen_pagetable_p2m_free();
 #endif
+   /* And revector! Bye bye old array */
+   xen_start_info->mfn_list = (unsigned long)xen_p2m_addr;
+}
+
+static void __init xen_pagetable_init(void)
+{
+   paging_init();
+
+   xen_pagetable_p2m_setup();
+
/* Allocate and initialize top and mid mfn levels for p2m structure */
xen_build_mfn_list_list();
 
diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index 328875a..7df446d 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -3,21 +3,22 @@
  * guests themselves, but it must also access and update the p2m array
  * during suspend/resume when all the pages are reallocated.
  *
- * The p2m table is logically a flat array, but we implement it as a
- * three-level tree to allow the address space to be sparse.
+ * The logical flat p2m table is mapped to a linear kernel memory area.
+ * For accesses by Xen a three-level tree linked via mfns only is set up to
+ * allow the address space to be spa

[Xen-devel] [PATCH V3 8/8] xen: Speed up set_phys_to_machine() by using read-only mappings

2014-11-10 Thread Juergen Gross
Instead of checking at each call of set_phys_to_machine() whether a
new p2m page has to be allocated due to writing an entry in a large
invalid or identity area, just map those areas read only and react
to a page fault on write by allocating the new page.

This change will make the common path with no allocation much
faster as it only requires a single write of the new mfn instead
of walking the address translation tables and checking for the
special cases.

Suggested-by: David Vrabel 
Signed-off-by: Juergen Gross 
---
 arch/x86/xen/p2m.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index 7df446d..58cf04c 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -70,6 +70,7 @@
 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -313,9 +314,9 @@ static void __init xen_rebuild_p2m_list(unsigned long *p2m)
paravirt_alloc_pte(&init_mm, __pa(p2m_identity_pte) >> PAGE_SHIFT);
for (i = 0; i < PTRS_PER_PTE; i++) {
set_pte(p2m_missing_pte + i,
-   pfn_pte(PFN_DOWN(__pa(p2m_missing)), PAGE_KERNEL));
+   pfn_pte(PFN_DOWN(__pa(p2m_missing)), PAGE_KERNEL_RO));
set_pte(p2m_identity_pte + i,
-   pfn_pte(PFN_DOWN(__pa(p2m_identity)), PAGE_KERNEL));
+   pfn_pte(PFN_DOWN(__pa(p2m_identity)), PAGE_KERNEL_RO));
}
 
for (pfn = 0; pfn < xen_max_p2m_pfn; pfn += chunk) {
@@ -362,7 +363,7 @@ static void __init xen_rebuild_p2m_list(unsigned long *p2m)
p2m_missing : p2m_identity;
ptep = populate_extra_pte((unsigned long)(p2m + pfn));
set_pte(ptep,
-   pfn_pte(PFN_DOWN(__pa(mfns)), PAGE_KERNEL));
+   pfn_pte(PFN_DOWN(__pa(mfns)), PAGE_KERNEL_RO));
continue;
}
 
@@ -621,6 +622,9 @@ bool __set_phys_to_machine(unsigned long pfn, unsigned long 
mfn)
return true;
}
 
+   if (likely(!__put_user(mfn, xen_p2m_addr + pfn)))
+   return true;
+
ptep = lookup_address((unsigned long)(xen_p2m_addr + pfn), &level);
BUG_ON(!ptep || level != PG_LEVEL_4K);
 
@@ -630,9 +634,7 @@ bool __set_phys_to_machine(unsigned long pfn, unsigned long 
mfn)
if (pte_pfn(*ptep) == PFN_DOWN(__pa(p2m_identity)))
return mfn == IDENTITY_FRAME(pfn);
 
-   xen_p2m_addr[pfn] = mfn;
-
-   return true;
+   return false;
 }
 
 bool set_phys_to_machine(unsigned long pfn, unsigned long mfn)
-- 
2.1.2


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH V3 6/8] xen: Hide get_phys_to_machine() to be able to tune common path

2014-11-10 Thread Juergen Gross
Today get_phys_to_machine() is always called when the mfn for a pfn
is to be obtained. Add a wrapper __pfn_to_mfn() as inline function
to be able to avoid calling get_phys_to_machine() when possible as
soon as the switch to a linear mapped p2m list has been done.

Signed-off-by: Juergen Gross 
---
 arch/x86/include/asm/xen/page.h | 27 +--
 arch/x86/xen/mmu.c  |  2 +-
 arch/x86/xen/p2m.c  |  6 +++---
 3 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index 28fa795..07d8a7b 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -59,6 +59,22 @@ extern int clear_foreign_p2m_mapping(struct 
gnttab_unmap_grant_ref *unmap_ops,
 struct page **pages, unsigned int count);
 extern unsigned long m2p_find_override_pfn(unsigned long mfn, unsigned long 
pfn);
 
+/*
+ * When to use pfn_to_mfn(), __pfn_to_mfn() or get_phys_to_machine():
+ * - pfn_to_mfn() returns either INVALID_P2M_ENTRY or the mfn. In case of an
+ *   identity entry the identity indicator will be cleared.
+ * - __pfn_to_mfn() returns the found entry of the p2m table. A possibly set
+ *   identity indicator will be still set. __pfn_to_mfn() is encapsulating
+ *   get_phys_to_machine() and might skip that function if possible to speed
+ *   up the common path.
+ * - get_phys_to_machine() is basically the same as __pfn_to_mfn(), but
+ *   without any short cuts for the common fast path.
+ */
+static inline unsigned long __pfn_to_mfn(unsigned long pfn)
+{
+   return get_phys_to_machine(pfn);
+}
+
 static inline unsigned long pfn_to_mfn(unsigned long pfn)
 {
unsigned long mfn;
@@ -66,7 +82,7 @@ static inline unsigned long pfn_to_mfn(unsigned long pfn)
if (xen_feature(XENFEAT_auto_translated_physmap))
return pfn;
 
-   mfn = get_phys_to_machine(pfn);
+   mfn = __pfn_to_mfn(pfn);
 
if (mfn != INVALID_P2M_ENTRY)
mfn &= ~(FOREIGN_FRAME_BIT | IDENTITY_FRAME_BIT);
@@ -79,7 +95,7 @@ static inline int phys_to_machine_mapping_valid(unsigned long 
pfn)
if (xen_feature(XENFEAT_auto_translated_physmap))
return 1;
 
-   return get_phys_to_machine(pfn) != INVALID_P2M_ENTRY;
+   return __pfn_to_mfn(pfn) != INVALID_P2M_ENTRY;
 }
 
 static inline unsigned long mfn_to_pfn_no_overrides(unsigned long mfn)
@@ -113,7 +129,7 @@ static inline unsigned long mfn_to_pfn(unsigned long mfn)
return mfn;
 
pfn = mfn_to_pfn_no_overrides(mfn);
-   if (get_phys_to_machine(pfn) != mfn) {
+   if (__pfn_to_mfn(pfn) != mfn) {
/*
 * If this appears to be a foreign mfn (because the pfn
 * doesn't map back to the mfn), then check the local override
@@ -129,8 +145,7 @@ static inline unsigned long mfn_to_pfn(unsigned long mfn)
 * entry doesn't map back to the mfn and m2p_override doesn't have a
 * valid entry for it.
 */
-   if (pfn == ~0 &&
-   get_phys_to_machine(mfn) == IDENTITY_FRAME(mfn))
+   if (pfn == ~0 && __pfn_to_mfn(mfn) == IDENTITY_FRAME(mfn))
pfn = mfn;
 
return pfn;
@@ -176,7 +191,7 @@ static inline unsigned long mfn_to_local_pfn(unsigned long 
mfn)
return mfn;
 
pfn = mfn_to_pfn(mfn);
-   if (get_phys_to_machine(pfn) != mfn)
+   if (__pfn_to_mfn(pfn) != mfn)
return -1; /* force !pfn_valid() */
return pfn;
 }
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index d3e492b..31ca515 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -387,7 +387,7 @@ static pteval_t pte_pfn_to_mfn(pteval_t val)
unsigned long mfn;
 
if (!xen_feature(XENFEAT_auto_translated_physmap))
-   mfn = get_phys_to_machine(pfn);
+   mfn = __pfn_to_mfn(pfn);
else
mfn = pfn;
/*
diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index 6a9dfa6..328875a 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -785,7 +785,7 @@ static int m2p_add_override(unsigned long mfn, struct page 
*page,
 * because mfn_to_pfn (that ends up being called by GUPF) will
 * return the backend pfn rather than the frontend pfn. */
pfn = mfn_to_pfn_no_overrides(mfn);
-   if (get_phys_to_machine(pfn) == mfn)
+   if (__pfn_to_mfn(pfn) == mfn)
set_phys_to_machine(pfn, FOREIGN_FRAME(mfn));
 
return 0;
@@ -965,7 +965,7 @@ static int m2p_remove_override(struct page *page,
 * pfn again. */
mfn &= ~FOREIGN_FRAME_BIT;
pfn = mfn_to_pfn_no_overrides(mfn);
-   if (get_phys_to_machine(pfn) == FOREIGN_FRAME(mfn) &&
+   if (__pfn_to_mfn(pfn) == FOREIGN_FRAME(mfn) &&

[Xen-devel] [PATCH V3 5/8] x86: Introduce function to get pmd entry pointer

2014-11-10 Thread Juergen Gross
Introduces lookup_pmd_address() to get the address of the pmd entry
related to a virtual address in the current address space. This
function is needed for support of a virtual mapped sparse p2m list
in xen pv domains.

Signed-off-by: Juergen Gross 
---
 arch/x86/include/asm/pgtable_types.h |  1 +
 arch/x86/mm/pageattr.c   | 20 
 2 files changed, 21 insertions(+)

diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index 0778964..d83f5e7 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -396,6 +396,7 @@ static inline void update_page_count(int level, unsigned 
long pages) { }
 extern pte_t *lookup_address(unsigned long address, unsigned int *level);
 extern pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
unsigned int *level);
+extern pmd_t *lookup_pmd_address(unsigned long address);
 extern phys_addr_t slow_virt_to_phys(void *__address);
 extern int kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address,
   unsigned numpages, unsigned long page_flags);
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 36de293..1298108 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -384,6 +384,26 @@ static pte_t *_lookup_address_cpa(struct cpa_data *cpa, 
unsigned long address,
 }
 
 /*
+ * Lookup the PMD entry for a virtual address. Return a pointer to the entry
+ * or NULL if not present.
+ */
+pmd_t *lookup_pmd_address(unsigned long address)
+{
+   pgd_t *pgd;
+   pud_t *pud;
+
+   pgd = pgd_offset_k(address);
+   if (pgd_none(*pgd))
+   return NULL;
+
+   pud = pud_offset(pgd, address);
+   if (pud_none(*pud) || pud_large(*pud) || !pud_present(*pud))
+   return NULL;
+
+   return pmd_offset(pud, address);
+}
+
+/*
  * This is necessary because __pa() does not work on some
  * kinds of memory, like vmalloc() or the alloc_remap()
  * areas on 32-bit NUMA systems.  The percpu areas can
-- 
2.1.2


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH V3 1/8] xen: Make functions static

2014-11-11 Thread Juergen Gross

On 11/11/2014 11:21 AM, David Vrabel wrote:

On 11/11/14 05:43, Juergen Gross wrote:

Some functions in arch/x86/xen/p2m.c are used locally only. Make them
static. Rearrange the functions in p2m.c to avoid forward declarations.

While at it correct some style issues (long lines, use pr_warn()).


Please don't add extra stuff like this.  In general if you feel yourself
wring "while at it..." or "also..." then you need another patch.


I applied the changes only to functions I was moving, as checkpatch was
complaining. Documentation says this should be avoided only when moving
functions between files.

If you still think I should omit these changes I'll throw them out.



I also don't care about long lines if they are under 100 characters.


I do. :-)

But same again: either no style corrections or all.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH V3 2/8] xen: Delay remapping memory of pv-domain

2014-11-11 Thread Juergen Gross

On 11/11/2014 12:45 PM, Andrew Cooper wrote:

On 11/11/14 05:43, Juergen Gross wrote:

diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index fa75842..f67f8cf 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -268,6 +271,22 @@ static void p2m_init(unsigned long *p2m)
p2m[i] = INVALID_P2M_ENTRY;
  }

+static void * __ref alloc_p2m_page(void)
+{
+   if (unlikely(use_brk))
+   return extend_brk(PAGE_SIZE, PAGE_SIZE);
+
+   if (unlikely(!slab_is_available()))
+   return alloc_bootmem_align(PAGE_SIZE, PAGE_SIZE);
+
+   return (void *)__get_free_page(GFP_KERNEL | __GFP_REPEAT);
+}
+
+static void free_p2m_page(void *p)
+{
+   free_page((unsigned long)p);
+}
+


What guarantees are there that free_p2m_page() is only called on p2m
pages allocated using __get_free_page() ?  I can see from this diff that
this is the case, but that doesn't help someone coming along in the future.

At the very least, a comment is warranted about the apparent mismatch
between {alloc,free}_p2m_page().


Okay, I'll add a comment.




@@ -420,6 +439,7 @@ unsigned long __init xen_revector_p2m_tree(void)
unsigned long *mfn_list = NULL;
unsigned long size;

+   use_brk = 0;
va_start = xen_start_info->mfn_list;
/*We copy in increments of P2M_PER_PAGE * sizeof(unsigned long),
 * so make sure it is rounded up to that */
@@ -484,6 +504,7 @@ unsigned long __init xen_revector_p2m_tree(void)
  #else
  unsigned long __init xen_revector_p2m_tree(void)
  {
+   use_brk = 0;
return 0;
  }
  #endif


This appears to be a completely orphaned function.

It has a split definition based on CONFIG_X86_64, but the sole caller is
xen_pagetable_p2m_copy() which is X86_64 only.

How does use_brk get cleared for 32bit PV guests?


Good catch. use_brk is removed in a later patch and I have to admit I
didn't test each patch with 32 bit guests, just some of them (including
the final one, of course).

I'll change (and test) the patch accordingly.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Wrong cpupool handling

2014-11-11 Thread Juergen Gross

Hi Dietmar,

On 11/11/2014 01:18 PM, Dietmar Hahn wrote:

Hi list,

When creating a cpupool, starting and destroying a guest within this pool,
then removing this pool doesn't work because of EBUSY.

It seems the cause of this behavior is the commit
bac6334b51d9bcfe57ecf4a4cb5288348fcf044a.

In domain_kill() the function sched_move_domain() gets called changing the
d->cpupool pointer to the new cpupool without incrementing/decrementing the
counters "n_dom" of the new/old cpupool.

This leads to decrementing the wrong  cpupool0->n_dom counter when
cpupool_rm_domain() gets called at the end and my own cpupool can't be
destroyed because n_dom = 1!

I don't have a fast patch because I'am not enough familiar with the code
this time but I think it should be fixed for 4.5.


Could you try the attached (untested) patch? Should apply to HEAD.

Juergen

>From f231f837b6dffd071c59517286562ccde7c5b5bc Mon Sep 17 00:00:00 2001
From: Juergen Gross 
Date: Tue, 11 Nov 2014 15:03:33 +0100
Subject: [PATCH] Adjust number of domains in cpupools when destroying domain

Commit bac6334b51d9bcfe57ecf4a4cb5288348fcf044a (move domain to
cpupool0 before destroying it) introduced an error in the accounting
of cpupools regarding the number of domains. The number of domains
is nor adjusted when a domain is moved to cpupool0 in kill_domain().

Correct this by introducing a cpupool function doing the move
instead of open coding it by calling sched_move_domain().

Signed-off-by: Juergen Gross 
---
 xen/common/cpupool.c| 46 +-
 xen/common/domain.c |  2 +-
 xen/include/xen/sched.h |  1 +
 3 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
index 73249d3..552d791 100644
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -225,6 +225,35 @@ static int cpupool_destroy(struct cpupool *c)
 }
 
 /*
+ * Move domain to another cpupool
+ */
+static int cpupool_move_domain_unlocked(struct domain *d, struct cpupool *c)
+{
+int ret;
+
+d->cpupool->n_dom--;
+ret = sched_move_domain(d, c);
+if ( ret )
+d->cpupool->n_dom++;
+else
+c->n_dom++;
+
+return ret;
+}
+int cpupool_move_domain(struct domain *d, struct cpupool *c)
+{
+int ret;
+
+spin_lock(&cpupool_lock);
+
+ret = cpupool_move_domain_unlocked(d, c);
+
+spin_unlock(&cpupool_lock);
+
+return ret;
+}
+
+/*
  * assign a specific cpu to a cpupool
  * cpupool_lock must be held
  */
@@ -338,13 +367,9 @@ static int cpupool_unassign_cpu(struct cpupool *c, unsigned int cpu)
 ret = -EBUSY;
 break;
 }
-c->n_dom--;
-ret = sched_move_domain(d, cpupool0);
+ret = cpupool_move_domain_unlocked(d, cpupool0);
 if ( ret )
-{
-c->n_dom++;
 break;
-}
 cpupool0->n_dom++;
 }
 rcu_read_unlock(&domlist_read_lock);
@@ -613,16 +638,11 @@ int cpupool_do_sysctl(struct xen_sysctl_cpupool_op *op)
 d->domain_id, op->cpupool_id);
 ret = -ENOENT;
 spin_lock(&cpupool_lock);
+
 c = cpupool_find_by_id(op->cpupool_id);
 if ( (c != NULL) && cpumask_weight(c->cpu_valid) )
-{
-d->cpupool->n_dom--;
-ret = sched_move_domain(d, c);
-if ( ret )
-d->cpupool->n_dom++;
-else
-c->n_dom++;
-}
+ret = cpupool_move_domain_unlocked(d, c);
+
 spin_unlock(&cpupool_lock);
 cpupool_dprintk("cpupool move_domain(dom=%d)->pool=%d ret %d\n",
 d->domain_id, op->cpupool_id, ret);
diff --git a/xen/common/domain.c b/xen/common/domain.c
index a3f51ec..4a62c1d 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -621,7 +621,7 @@ int domain_kill(struct domain *d)
 rc = -EAGAIN;
 break;
 }
-if ( sched_move_domain(d, cpupool0) )
+if ( cpupool_move_domain(d, cpupool0) )
 return -EAGAIN;
 for_each_vcpu ( d, v )
 unmap_vcpu_info(v);
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index c5157e6..46fc6e3 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -871,6 +871,7 @@ struct cpupool *cpupool_get_by_id(int poolid);
 void cpupool_put(struct cpupool *pool);
 int cpupool_add_domain(struct domain *d, int poolid);
 void cpupool_rm_domain(struct domain *d);
+int cpupool_move_domain(struct domain *d, struct cpupool *c);
 int cpupool_do_sysctl(struct xen_sysctl_cpupool_op *op);
 void schedule_dump(struct cpupool *c);
 extern void dump_runq(unsigned char key);
-- 
2.1.2

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Wrong cpupool handling

2014-11-11 Thread Juergen Gross

Hi again,

On 11/11/2014 01:18 PM, Dietmar Hahn wrote:

Hi list,

When creating a cpupool, starting and destroying a guest within this pool,
then removing this pool doesn't work because of EBUSY.

It seems the cause of this behavior is the commit
bac6334b51d9bcfe57ecf4a4cb5288348fcf044a.

In domain_kill() the function sched_move_domain() gets called changing the
d->cpupool pointer to the new cpupool without incrementing/decrementing the
counters "n_dom" of the new/old cpupool.

This leads to decrementing the wrong  cpupool0->n_dom counter when
cpupool_rm_domain() gets called at the end and my own cpupool can't be
destroyed because n_dom = 1!

I don't have a fast patch because I'am not enough familiar with the code
this time but I think it should be fixed for 4.5.


Please discard previous patch, try this one.

Juergen

>From 629a7fe8ed07a13304d9378d357420ec885f59e3 Mon Sep 17 00:00:00 2001
From: Juergen Gross 
Date: Tue, 11 Nov 2014 15:03:33 +0100
Subject: [PATCH] Adjust number of domains in cpupools when destroying domain

Commit bac6334b51d9bcfe57ecf4a4cb5288348fcf044a (move domain to
cpupool0 before destroying it) introduced an error in the accounting
of cpupools regarding the number of domains. The number of domains
is nor adjusted when a domain is moved to cpupool0 in kill_domain().

Correct this by introducing a cpupool function doing the move
instead of open coding it by calling sched_move_domain().

Signed-off-by: Juergen Gross 
---
 xen/common/cpupool.c| 47 +--
 xen/common/domain.c |  2 +-
 xen/include/xen/sched.h |  1 +
 3 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
index 73249d3..c6e3869 100644
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -225,6 +225,35 @@ static int cpupool_destroy(struct cpupool *c)
 }
 
 /*
+ * Move domain to another cpupool
+ */
+static int cpupool_move_domain_unlocked(struct domain *d, struct cpupool *c)
+{
+int ret;
+
+d->cpupool->n_dom--;
+ret = sched_move_domain(d, c);
+if ( ret )
+d->cpupool->n_dom++;
+else
+c->n_dom++;
+
+return ret;
+}
+int cpupool_move_domain(struct domain *d, struct cpupool *c)
+{
+int ret;
+
+spin_lock(&cpupool_lock);
+
+ret = cpupool_move_domain_unlocked(d, c);
+
+spin_unlock(&cpupool_lock);
+
+return ret;
+}
+
+/*
  * assign a specific cpu to a cpupool
  * cpupool_lock must be held
  */
@@ -338,14 +367,9 @@ static int cpupool_unassign_cpu(struct cpupool *c, unsigned int cpu)
 ret = -EBUSY;
 break;
 }
-c->n_dom--;
-ret = sched_move_domain(d, cpupool0);
+ret = cpupool_move_domain_unlocked(d, cpupool0);
 if ( ret )
-{
-c->n_dom++;
 break;
-}
-cpupool0->n_dom++;
 }
 rcu_read_unlock(&domlist_read_lock);
 if ( ret )
@@ -613,16 +637,11 @@ int cpupool_do_sysctl(struct xen_sysctl_cpupool_op *op)
 d->domain_id, op->cpupool_id);
 ret = -ENOENT;
 spin_lock(&cpupool_lock);
+
 c = cpupool_find_by_id(op->cpupool_id);
 if ( (c != NULL) && cpumask_weight(c->cpu_valid) )
-{
-d->cpupool->n_dom--;
-ret = sched_move_domain(d, c);
-if ( ret )
-d->cpupool->n_dom++;
-else
-c->n_dom++;
-}
+ret = cpupool_move_domain_unlocked(d, c);
+
 spin_unlock(&cpupool_lock);
 cpupool_dprintk("cpupool move_domain(dom=%d)->pool=%d ret %d\n",
 d->domain_id, op->cpupool_id, ret);
diff --git a/xen/common/domain.c b/xen/common/domain.c
index a3f51ec..4a62c1d 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -621,7 +621,7 @@ int domain_kill(struct domain *d)
 rc = -EAGAIN;
 break;
 }
-if ( sched_move_domain(d, cpupool0) )
+if ( cpupool_move_domain(d, cpupool0) )
 return -EAGAIN;
 for_each_vcpu ( d, v )
 unmap_vcpu_info(v);
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index c5157e6..46fc6e3 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -871,6 +871,7 @@ struct cpupool *cpupool_get_by_id(int poolid);
 void cpupool_put(struct cpupool *pool);
 int cpupool_add_domain(struct domain *d, int poolid);
 void cpupool_rm_domain(struct domain *d);
+int cpupool_move_domain(struct domain *d, struct cpupool *c);
 int cpupool_do_sysctl(struct xen_sysctl_cpupool_op *op);
 void schedule_dump(struct cpupool *c);
 extern void dump_runq(unsigned char key);
-- 
2.1.2

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Wrong cpupool handling

2014-11-12 Thread Juergen Gross

On 11/12/2014 10:53 AM, Dietmar Hahn wrote:

Am Dienstag 11 November 2014, 15:21:01 schrieb Juergen Gross:

Hi again,

On 11/11/2014 01:18 PM, Dietmar Hahn wrote:

Hi list,

When creating a cpupool, starting and destroying a guest within this pool,
then removing this pool doesn't work because of EBUSY.

It seems the cause of this behavior is the commit
bac6334b51d9bcfe57ecf4a4cb5288348fcf044a.

In domain_kill() the function sched_move_domain() gets called changing the
d->cpupool pointer to the new cpupool without incrementing/decrementing the
counters "n_dom" of the new/old cpupool.

This leads to decrementing the wrong  cpupool0->n_dom counter when
cpupool_rm_domain() gets called at the end and my own cpupool can't be
destroyed because n_dom = 1!

I don't have a fast patch because I'am not enough familiar with the code
this time but I think it should be fixed for 4.5.


Please discard previous patch, try this one.


Yes this patch works.


Thanks. Can I add your "tested-by:"?


But I think in general a better solution would be to have the changing of the
cpupool pointer in sched_move_domain() together with the increment/decrement
of the counters but I see the locking problem.


The scheduler should never change cpupool owned data. The cpupool
pointer is domain data, so changing this is okay.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH] Adjust number of domains in cpupools when destroying domain

2014-11-12 Thread Juergen Gross
Commit bac6334b51d9bcfe57ecf4a4cb5288348fcf044a (move domain to
cpupool0 before destroying it) introduced an error in the accounting
of cpupools regarding the number of domains. The number of domains
is nor adjusted when a domain is moved to cpupool0 in kill_domain().

Correct this by introducing a cpupool function doing the move
instead of open coding it by calling sched_move_domain().

Signed-off-by: Juergen Gross 
Tested-by: Dietmar Hahn 
---
 xen/common/cpupool.c| 47 +--
 xen/common/domain.c |  2 +-
 xen/include/xen/sched.h |  1 +
 3 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
index 73249d3..c6e3869 100644
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -225,6 +225,35 @@ static int cpupool_destroy(struct cpupool *c)
 }
 
 /*
+ * Move domain to another cpupool
+ */
+static int cpupool_move_domain_unlocked(struct domain *d, struct cpupool *c)
+{
+int ret;
+
+d->cpupool->n_dom--;
+ret = sched_move_domain(d, c);
+if ( ret )
+d->cpupool->n_dom++;
+else
+c->n_dom++;
+
+return ret;
+}
+int cpupool_move_domain(struct domain *d, struct cpupool *c)
+{
+int ret;
+
+spin_lock(&cpupool_lock);
+
+ret = cpupool_move_domain_unlocked(d, c);
+
+spin_unlock(&cpupool_lock);
+
+return ret;
+}
+
+/*
  * assign a specific cpu to a cpupool
  * cpupool_lock must be held
  */
@@ -338,14 +367,9 @@ static int cpupool_unassign_cpu(struct cpupool *c, 
unsigned int cpu)
 ret = -EBUSY;
 break;
 }
-c->n_dom--;
-ret = sched_move_domain(d, cpupool0);
+ret = cpupool_move_domain_unlocked(d, cpupool0);
 if ( ret )
-{
-c->n_dom++;
 break;
-}
-cpupool0->n_dom++;
 }
 rcu_read_unlock(&domlist_read_lock);
 if ( ret )
@@ -613,16 +637,11 @@ int cpupool_do_sysctl(struct xen_sysctl_cpupool_op *op)
 d->domain_id, op->cpupool_id);
 ret = -ENOENT;
 spin_lock(&cpupool_lock);
+
 c = cpupool_find_by_id(op->cpupool_id);
 if ( (c != NULL) && cpumask_weight(c->cpu_valid) )
-{
-d->cpupool->n_dom--;
-ret = sched_move_domain(d, c);
-if ( ret )
-d->cpupool->n_dom++;
-else
-c->n_dom++;
-}
+ret = cpupool_move_domain_unlocked(d, c);
+
 spin_unlock(&cpupool_lock);
 cpupool_dprintk("cpupool move_domain(dom=%d)->pool=%d ret %d\n",
 d->domain_id, op->cpupool_id, ret);
diff --git a/xen/common/domain.c b/xen/common/domain.c
index a3f51ec..4a62c1d 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -621,7 +621,7 @@ int domain_kill(struct domain *d)
 rc = -EAGAIN;
 break;
 }
-if ( sched_move_domain(d, cpupool0) )
+if ( cpupool_move_domain(d, cpupool0) )
 return -EAGAIN;
 for_each_vcpu ( d, v )
 unmap_vcpu_info(v);
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index c5157e6..46fc6e3 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -871,6 +871,7 @@ struct cpupool *cpupool_get_by_id(int poolid);
 void cpupool_put(struct cpupool *pool);
 int cpupool_add_domain(struct domain *d, int poolid);
 void cpupool_rm_domain(struct domain *d);
+int cpupool_move_domain(struct domain *d, struct cpupool *c);
 int cpupool_do_sysctl(struct xen_sysctl_cpupool_op *op);
 void schedule_dump(struct cpupool *c);
 extern void dump_runq(unsigned char key);
-- 
2.1.2


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH] Adjust number of domains in cpupools when destroying domain

2014-11-12 Thread Juergen Gross
Commit bac6334b51d9bcfe57ecf4a4cb5288348fcf044a (move domain to
cpupool0 before destroying it) introduced an error in the accounting
of cpupools regarding the number of domains. The number of domains
is nor adjusted when a domain is moved to cpupool0 in kill_domain().

Correct this by introducing a cpupool function doing the move
instead of open coding it by calling sched_move_domain().

Signed-off-by: Juergen Gross 
Tested-by: Dietmar Hahn 
Reviewed-by: Andrew Cooper 
---
 xen/common/cpupool.c| 47 +--
 xen/common/domain.c |  2 +-
 xen/include/xen/sched.h |  1 +
 3 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
index 73249d3..a758a8b 100644
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -225,6 +225,35 @@ static int cpupool_destroy(struct cpupool *c)
 }
 
 /*
+ * Move domain to another cpupool
+ */
+static int cpupool_move_domain_locked(struct domain *d, struct cpupool *c)
+{
+int ret;
+
+d->cpupool->n_dom--;
+ret = sched_move_domain(d, c);
+if ( ret )
+d->cpupool->n_dom++;
+else
+c->n_dom++;
+
+return ret;
+}
+int cpupool_move_domain(struct domain *d, struct cpupool *c)
+{
+int ret;
+
+spin_lock(&cpupool_lock);
+
+ret = cpupool_move_domain_locked(d, c);
+
+spin_unlock(&cpupool_lock);
+
+return ret;
+}
+
+/*
  * assign a specific cpu to a cpupool
  * cpupool_lock must be held
  */
@@ -338,14 +367,9 @@ static int cpupool_unassign_cpu(struct cpupool *c, 
unsigned int cpu)
 ret = -EBUSY;
 break;
 }
-c->n_dom--;
-ret = sched_move_domain(d, cpupool0);
+ret = cpupool_move_domain_locked(d, cpupool0);
 if ( ret )
-{
-c->n_dom++;
 break;
-}
-cpupool0->n_dom++;
 }
 rcu_read_unlock(&domlist_read_lock);
 if ( ret )
@@ -613,16 +637,11 @@ int cpupool_do_sysctl(struct xen_sysctl_cpupool_op *op)
 d->domain_id, op->cpupool_id);
 ret = -ENOENT;
 spin_lock(&cpupool_lock);
+
 c = cpupool_find_by_id(op->cpupool_id);
 if ( (c != NULL) && cpumask_weight(c->cpu_valid) )
-{
-d->cpupool->n_dom--;
-ret = sched_move_domain(d, c);
-if ( ret )
-d->cpupool->n_dom++;
-else
-c->n_dom++;
-}
+ret = cpupool_move_domain_locked(d, c);
+
 spin_unlock(&cpupool_lock);
 cpupool_dprintk("cpupool move_domain(dom=%d)->pool=%d ret %d\n",
 d->domain_id, op->cpupool_id, ret);
diff --git a/xen/common/domain.c b/xen/common/domain.c
index a3f51ec..4a62c1d 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -621,7 +621,7 @@ int domain_kill(struct domain *d)
 rc = -EAGAIN;
 break;
 }
-if ( sched_move_domain(d, cpupool0) )
+if ( cpupool_move_domain(d, cpupool0) )
 return -EAGAIN;
 for_each_vcpu ( d, v )
 unmap_vcpu_info(v);
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index c5157e6..46fc6e3 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -871,6 +871,7 @@ struct cpupool *cpupool_get_by_id(int poolid);
 void cpupool_put(struct cpupool *pool);
 int cpupool_add_domain(struct domain *d, int poolid);
 void cpupool_rm_domain(struct domain *d);
+int cpupool_move_domain(struct domain *d, struct cpupool *c);
 int cpupool_do_sysctl(struct xen_sysctl_cpupool_op *op);
 void schedule_dump(struct cpupool *c);
 extern void dump_runq(unsigned char key);
-- 
2.1.2


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] Adjust number of domains in cpupools when destroying domain

2014-11-12 Thread Juergen Gross

On 11/12/2014 12:10 PM, George Dunlap wrote:

On Wed, Nov 12, 2014 at 10:40 AM, Juergen Gross  wrote:

Commit bac6334b51d9bcfe57ecf4a4cb5288348fcf044a (move domain to
cpupool0 before destroying it) introduced an error in the accounting
of cpupools regarding the number of domains. The number of domains
is nor adjusted when a domain is moved to cpupool0 in kill_domain().

Correct this by introducing a cpupool function doing the move
instead of open coding it by calling sched_move_domain().

Signed-off-by: Juergen Gross 
Tested-by: Dietmar Hahn 


Juergen / Dietmar -- do either of you have a reasonably complete set
of tests for cpupools?  It seems like even basic corner cases (like
shutting down a domain in a pool and then destroying a pool) aren't
being tested.

It would be really good if someone could try to do a more thorough
test before the 4.5 release.  It shouldn't be too hard to write a
script to test a lot of this functionality programmatically.


For the xm toolstack we had some tests at Fujitsu. Dietmar, you could
ask Lutz for advice. He might still have the scripts somewhere. They
should be easily adaptable to xl. In case you don't have time to try
them would you send them to me?

Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH V3 2/8] xen: Delay remapping memory of pv-domain

2014-11-12 Thread Juergen Gross

On 11/12/2014 10:45 PM, Konrad Rzeszutek Wilk wrote:

On Tue, Nov 11, 2014 at 06:43:40AM +0100, Juergen Gross wrote:

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index a8a1a3d..d3e492b 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1223,6 +1223,10 @@ static void __init xen_pagetable_init(void)
/* Allocate and initialize top and mid mfn levels for p2m structure */
xen_build_mfn_list_list();

+   /* Remap memory freed because of conflicts with E820 map */


s/becasue of/due to


Okay.


/* Boundary cross-over for the edges: */
-   p2m = extend_brk(PAGE_SIZE, PAGE_SIZE);
+   p2m = alloc_p2m_page();

p2m_init(p2m);

@@ -640,7 +651,7 @@ static bool __init early_alloc_p2m_middle(unsigned long pfn)

mid = p2m_top[topidx];
if (mid == p2m_mid_missing) {
-   mid = extend_brk(PAGE_SIZE, PAGE_SIZE);
+   mid = alloc_p2m_page();

p2m_mid_init(mid, p2m_missing);

@@ -649,100 +660,6 @@ static bool __init early_alloc_p2m_middle(unsigned long 
pfn)
return true;
  }



I would split this patch in two - one for the extend_brk/alloc_page conversation
to alloc_p2m_page and free_page to free_p2m_page.


Okay.


-/* Buffer used to remap identity mapped pages */
-unsigned long xen_remap_buf[P2M_PER_PAGE] __initdata;
+/*
+ * Buffer used to remap identity mapped pages. We only need the virtual space.


Could you expand on the 'need the virtual space'?


I'll update the comment to:

/*
 * Buffer used to remap identity mapped pages. We only need the virtual
 * space. The physical page behind this address is remapped as needed to
 * different buffer pages.
 */




.. snip..

  /*
   * This function updates the p2m and m2p tables with an identity map from
- * start_pfn to start_pfn+size and remaps the underlying RAM of the original
- * allocation at remap_pfn. It must do so carefully in P2M_PER_PAGE sized 
blocks
- * to not exhaust the reserved brk space. Doing it in properly aligned blocks
- * ensures we only allocate the minimum required leaf pages in the p2m table. 
It
- * copies the existing mfns from the p2m table under the 1:1 map, overwrites
- * them with the identity map and then updates the p2m and m2p tables with the
- * remapped memory.
+ * start_pfn to start_pfn+size and prepares remapping the underlying RAM of the
+ * original allocation at remap_pfn. The information needed for remapping is
+ * saved in the memory itself to avoid the need for allocating buffers. The
+ * complete remap information is contained in a list of MFNs each containing
+ * up to REMAP_SIZE MFNs and the start target PFN for doing the remap.
+ * This enables to preserve the original mfn sequence while doing the remapping


us to


Yep.


+ * at a time when the memory management is capable of allocating virtual and
+ * physical memory in arbitrary amounts.


You might want to add, see 'xen_remap_memory' and its callers.


Okay.


-   /* These two checks move from the start to end boundaries */
-   if (ident_boundary_pfn == ident_start_pfn_align)
-   ident_boundary_pfn = ident_pfn_iter;
-   if (remap_boundary_pfn == remap_start_pfn_align)
-   remap_boundary_pfn = remap_pfn_iter;
+   /* Map first pfn to xen_remap_buf */
+   mfn = pfn_to_mfn(ident_pfn_iter);
+   set_pte_mfn(buf, mfn, PAGE_KERNEL);


So you set the buf to be point to 'mfn'.


Correct.



-   /* Check we aren't past the end */
-   BUG_ON(ident_boundary_pfn >= start_pfn + size);
-   BUG_ON(remap_boundary_pfn >= remap_pfn + size);
+   /* Save mapping information in page */
+   xen_remap_buf.next_area_mfn = xen_remap_mfn;
+   xen_remap_buf.target_pfn = remap_pfn_iter;
+   xen_remap_buf.size = chunk;
+   for (i = 0; i < chunk; i++)
+   xen_remap_buf.mfns[i] = pfn_to_mfn(ident_pfn_iter + i);

-   mfn = pfn_to_mfn(ident_boundary_pfn);
+   /* New element first in list */


I don't get that comment. Don't you mean the MFN of the last chunk you
had stashed the 'xen_remap_buf' structure in?

The 'xen_remap_mfn' ends up being the the tail value of this
"list".


I'll redo the comment:

/* Put remap buf into list. */


+/*
+ * Remap the memory prepared in xen_do_set_identity_and_remap_chunk().
+ */
+void __init xen_remap_memory(void)
+{
+   unsigned long buf = (unsigned long)&xen_remap_buf;
+   unsigned long mfn_save, mfn, pfn;
+   unsigned long remapped = 0, released = 0;
+   unsigned int i, free;
+   unsigned long pfn_s = ~0UL;
+   unsigned long len = 0;
+
+   mfn_save = virt_to_mfn(buf);
+
+   while (xen_remap_mfn != INVALID_P2M_ENTRY) {


So the 'list' is constructed by going forw

Re: [Xen-devel] [PATCH V3 4/8] xen: Delay invalidating extra memory

2014-11-12 Thread Juergen Gross

On 11/12/2014 11:10 PM, Konrad Rzeszutek Wilk wrote:

@@ -376,12 +374,14 @@ void __init xen_build_dynamic_phys_to_machine(void)
unsigned long max_pfn;
unsigned long pfn;

-if (xen_feature(XENFEAT_auto_translated_physmap))
+   if (xen_feature(XENFEAT_auto_translated_physmap))


Spurious change.


I'll remove it.



.. snip..

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 0e5f9b6..8d5985b 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -75,7 +75,6 @@ static unsigned long xen_remap_mfn __initdata = 
INVALID_P2M_ENTRY;

  static void __init xen_add_extra_mem(u64 start, u64 size)
  {
-   unsigned long pfn;
int i;

for (i = 0; i < XEN_EXTRA_MEM_MAX_REGIONS; i++) {
@@ -95,17 +94,74 @@ static void __init xen_add_extra_mem(u64 start, u64 size)
printk(KERN_WARNING "Warning: not enough extra memory 
regions\n");

memblock_reserve(start, size);
+}

-   xen_max_p2m_pfn = PFN_DOWN(start + size);
-   for (pfn = PFN_DOWN(start); pfn < xen_max_p2m_pfn; pfn++) {
-   unsigned long mfn = pfn_to_mfn(pfn);
+static void __init xen_del_extra_mem(u64 start, u64 size)
+{
+   int i;
+   u64 start_r, size_r;

-   if (WARN_ONCE(mfn == pfn, "Trying to over-write 1-1 mapping (pfn: 
%lx)\n", pfn))
-   continue;
-   WARN_ONCE(mfn != INVALID_P2M_ENTRY, "Trying to remove %lx which has 
%lx mfn!\n",
- pfn, mfn);
+   for (i = 0; i < XEN_EXTRA_MEM_MAX_REGIONS; i++) {
+   start_r = xen_extra_mem[i].start;
+   size_r = xen_extra_mem[i].size;
+
+   /* Start of region. */
+   if (start_r == start) {
+   BUG_ON(size > size_r);
+   xen_extra_mem[i].start += size;
+   xen_extra_mem[i].size -= size;
+   break;
+   }
+   /* End of region. */
+   if (start_r + size_r == start + size) {
+   BUG_ON(size > size_r);
+   xen_extra_mem[i].size -= size;
+   break;
+   }
+   /* Mid of region. */
+   if (start > start_r && start < start_r + size_r) {
+   BUG_ON(start + size > start_r + size_r);
+   xen_extra_mem[i].size = start - start_r;
+   xen_add_extra_mem(start + size, start_r + size_r -
+ (start + size));


Which ends up calling 'memblock_reserve' for an region it already has
reserved. Should we call memblock_free(start_r, size_r - size) before calling 
this?

Or is that not neccessary as memblock_* is pretty smart about this sort of 
thing?


Regions marked via memblock_reserve() are allowed to overlap. I can add
a comment.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH V3 5/8] x86: Introduce function to get pmd entry pointer

2014-11-12 Thread Juergen Gross

On 11/12/2014 11:12 PM, Konrad Rzeszutek Wilk wrote:

On Tue, Nov 11, 2014 at 06:43:43AM +0100, Juergen Gross wrote:

Introduces lookup_pmd_address() to get the address of the pmd entry
related to a virtual address in the current address space. This
function is needed for support of a virtual mapped sparse p2m list
in xen pv domains.


What is wrong with using 'lookup_address' ?


It doesn't return the needed information. I need a pmd entry here, not
a pte entry.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH V3 6/8] xen: Hide get_phys_to_machine() to be able to tune common path

2014-11-13 Thread Juergen Gross

On 11/12/2014 11:18 PM, Konrad Rzeszutek Wilk wrote:

On Tue, Nov 11, 2014 at 06:43:44AM +0100, Juergen Gross wrote:

Today get_phys_to_machine() is always called when the mfn for a pfn
is to be obtained. Add a wrapper __pfn_to_mfn() as inline function
to be able to avoid calling get_phys_to_machine() when possible as


s/when/where/


No. It's not a matter of the caller, but of the p2m list entry.


soon as the switch to a linear mapped p2m list has been done.


But your inline function still calls get_phys_to_machine?


Sure. The switch is done in the next patch. David asked me to split
the patch doing the preparation by adding __pfn_tom_mfn() in an own
patch.






Signed-off-by: Juergen Gross 
---
  arch/x86/include/asm/xen/page.h | 27 +--
  arch/x86/xen/mmu.c  |  2 +-
  arch/x86/xen/p2m.c  |  6 +++---
  3 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index 28fa795..07d8a7b 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -59,6 +59,22 @@ extern int clear_foreign_p2m_mapping(struct 
gnttab_unmap_grant_ref *unmap_ops,
 struct page **pages, unsigned int count);
  extern unsigned long m2p_find_override_pfn(unsigned long mfn, unsigned long 
pfn);

+/*
+ * When to use pfn_to_mfn(), __pfn_to_mfn() or get_phys_to_machine():
+ * - pfn_to_mfn() returns either INVALID_P2M_ENTRY or the mfn. In case of an
+ *   identity entry the identity indicator will be cleared.


Why don't you say : In case of identity PFN the same PFN is returned.

But you did miss that also the FOREIGN_FRAME_BIT is cleared.


I'll reword the comment.




+ * - __pfn_to_mfn() returns the found entry of the p2m table. A possibly set


s/of the/in the/

+ *   identity indicator will be still set. __pfn_to_mfn() is encapsulating

.. be still set if the PFN is an identity one.

+ *   get_phys_to_machine() and might skip that function if possible to speed
+ *   up the common path.


How is is skipping that function? The patch below does no such thing?


The next patch in this series does.




+ * - get_phys_to_machine() is basically the same as __pfn_to_mfn(), but
+ *   without any short cuts for the common fast path.


Right. Perhpas we should call it 'slow_p2m' instead of the 
'get_phys_to_machine'.


That's a matter of taste, I think. I can change it if nobody else
objects.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH V3 7/8] xen: switch to linear virtual mapped sparse p2m list

2014-11-13 Thread Juergen Gross

On 11/11/2014 06:47 PM, David Vrabel wrote:

On 11/11/14 05:43, Juergen Gross wrote:

At start of the day the Xen hypervisor presents a contiguous mfn list
to a pv-domain. In order to support sparse memory this mfn list is
accessed via a three level p2m tree built early in the boot process.
Whenever the system needs the mfn associated with a pfn this tree is
used to find the mfn.

Instead of using a software walked tree for accessing a specific mfn
list entry this patch is creating a virtual address area for the
entire possible mfn list including memory holes. The holes are
covered by mapping a pre-defined  page consisting only of "invalid
mfn" entries. Access to a mfn entry is possible by just using the
virtual base address of the mfn list and the pfn as index into that
list. This speeds up the (hot) path of determining the mfn of a
pfn.

Kernel build on a Dell Latitude E6440 (2 cores, HT) in 64 bit Dom0
showed following improvements:

Elapsed time: 32:50 ->  32:35
System:   18:07 ->  17:47
User:104:00 -> 103:30

Tested on 64 bit dom0 and 32 bit domU.


Reviewed-by: David Vrabel 

Can you please test this with the following guests/scenarios.

* 64 bit dom0 with PCI devices with high MMIO BARs.


I'm not sure I have a machine available with this configuration.


* 32 bit domU with PCI devices assigned.
* 32 bit domU with 64 GiB of memory.
* domU that starts pre-ballooned and is subsequently ballooned up.
* 64 bit domU that is saved and restored (or local host migration)
* 32 bit domU that is saved and restored (or local host migration)


I'll try.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH V3 2/8] xen: Delay remapping memory of pv-domain

2014-11-13 Thread Juergen Gross

On 11/13/2014 08:56 PM, Konrad Rzeszutek Wilk wrote:

+   mfn_save = virt_to_mfn(buf);
+
+   while (xen_remap_mfn != INVALID_P2M_ENTRY) {


So the 'list' is constructed by going forward - that is from low-numbered
PFNs to higher numbered ones. But the 'xen_remap_mfn' is going the
other way - from the highest PFN to the lowest PFN.

Won't that mean we will restore the chunks of memory in the wrong
order? That is we will still restore them in chunks size, but the
chunks will be in descending order instead of ascending?


No, the information where to put each chunk is contained in the chunk
data. I can add a comment explaining this.


Right, the MFNs in a "chunks" are going to be restored in the right order.

I was thinking that the "chunks" (so a set of MFNs) will be restored in
the opposite order that they are written to.

And oddly enough the "chunks" are done in 512-3 = 509 MFNs at once?


More don't fit on a single page due to the other info needed. So: yes.








+   /* Map the remap information */
+   set_pte_mfn(buf, xen_remap_mfn, PAGE_KERNEL);
+
+   BUG_ON(xen_remap_mfn != xen_remap_buf.mfns[0]);
+
+   free = 0;
+   pfn = xen_remap_buf.target_pfn;
+   for (i = 0; i < xen_remap_buf.size; i++) {
+   mfn = xen_remap_buf.mfns[i];
+   if (!released && xen_update_mem_tables(pfn, mfn)) {
+   remapped++;


If we fail 'xen_update_mem_tables' we will on the next chunk (so i+1) keep on
freeing pages instead of trying to remap. Is that intentional? Could we
try to remap?


Hmm, I'm not sure this is worth the effort. What could lead to failure
here? I suspect we could even just BUG() on failure. What do you think?


I was hoping that this question would lead to making this loop a bit
simpler as you would have to spread some of the code in the loop
into functions.

And keep 'remmaped' and 'released' reset every loop.

However, if it makes the code more complex - then please
forget my question.


Using BUG() instead would make the code less complex. Do you really
think xen_update_mem_tables() would ever fail in a sane system?

- set_phys_to_machine() would fail only on a memory shortage. Just
  going on without adding more memory wouldn't lead to a healthy system,
  I think.
- The hypervisor calls would fail only in case of parameter errors.
  This should never happen, so dying seems to be the correct reaction.

David, what do you think?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH V6 00/18] x86: Full support of PAT

2014-11-13 Thread Juergen Gross

Ingo,

could you take the patches, please?


Juergen

On 11/03/2014 02:01 PM, Juergen Gross wrote:

The x86 architecture offers via the PAT (Page Attribute Table) a way to
specify different caching modes in page table entries. The PAT MSR contains
8 entries each specifying one of 6 possible cache modes. A pte references one
of those entries via 3 bits: _PAGE_PAT, _PAGE_PWT and _PAGE_PCD.

The Linux kernel currently supports only 4 different cache modes. The PAT MSR
is set up in a way that the setting of _PAGE_PAT in a pte doesn't matter: the
top 4 entries in the PAT MSR are the same as the 4 lower entries.

This results in the kernel not supporting e.g. write-through mode. Especially
this cache mode would speed up drivers of video cards which now have to use
uncached accesses.

OTOH some old processors (Pentium) don't support PAT correctly and the Xen
hypervisor has been using a different PAT MSR configuration for some time now
and can't change that as this setting is part of the ABI.

This patch set abstracts the cache mode from the pte and introduces tables to
translate between cache mode and pte bits (the default cache mode "write back"
is hard-wired to PAT entry 0). The tables are statically initialized with
values being compatible to old processors and current usage. As soon as the
PAT MSR is changed (or - in case of Xen - is read at boot time) the tables are
changed accordingly. Requests of mappings with special cache modes are always
possible now, in case they are not supported there will be a fallback to a
compatible but slower mode.

Summing it up, this patch set adds the following features:
- capability to support WT and WP cache modes on processors with full PAT
   support
- processors with no or uncorrect PAT support are still working as today, even
   if WT or WP cache mode are selected by drivers for some pages
- reduction of Xen special handling regarding cache mode

Changes in V6:
- add new patch 10 (x86: Remove looking for setting of _PAGE_PAT_LARGE in
   pageattr.c) as suggested by Thomas Gleixner
- replaced SOB of Stefan Bader by "Based-on-patch-by:" as suggested by
   Borislav Petkov

Changes in V5:
- split up first patch as requested by Ingo Molnar and Thomas Gleixner
- add a helper function in pat_init_cache_modes() as requested by Ingo Molnar

Changes in V4:
- rebased to 3.18-rc2

Changes in V3:
- corrected two minor nits (UC_MINUS, again) detected by Toshi Kani

Changes in V2:
- simplified handling of PAT MSR write under Xen as suggested by David Vrabel
- removed resetting of pat_enabled under Xen
- two small corrections requested by Toshi Kani (UC_MINUS cache mode in
   vermilion driver, fix 32 bit kernel build failure)
- correct build error on non-x86 arch by moving definition of
   update_cache_mode_entry() to x86 specific header

Changes since RFC:
- renamed functions and variables as suggested by Toshi Kani
- corrected cache mode bits for WT and WP
- modified handling of PAT MSR write under Xen as suggested by Jan Beulich


Juergen Gross (18):
   x86: Make page cache mode a real type
   x86: Use new cache mode type in include/asm/fb.h
   x86: Use new cache mode type in drivers/video/fbdev/gbefb.c
   x86: Use new cache mode type in drivers/video/fbdev/vermilion
   x86: Use new cache mode type in arch/x86/pci
   x86: Use new cache mode type in arch/x86/mm/init_64.c
   x86: Use new cache mode type in asm/pgtable.h
   x86: Use new cache mode type in mm/iomap_32.c
   x86: Use new cache mode type in track_pfn_remap() and
 track_pfn_insert()
   x86: Remove looking for setting of _PAGE_PAT_LARGE in pageattr.c
   x86: Use new cache mode type in setting page attributes
   x86: Use new cache mode type in mm/ioremap.c
   x86: Use new cache mode type in memtype related functions
   x86: Clean up pgtable_types.h
   x86: Support PAT bit in pagetable dump for lower levels
   x86: Respect PAT bit when copying pte values between large and normal
 pages
   x86: Enable PAT to use cache mode translation tables
   xen: Support Xen pv-domains using PAT

  arch/x86/include/asm/cacheflush.h |  38 ---
  arch/x86/include/asm/fb.h |   6 +-
  arch/x86/include/asm/io.h |   2 +-
  arch/x86/include/asm/pat.h|   7 +-
  arch/x86/include/asm/pgtable.h|  19 ++--
  arch/x86/include/asm/pgtable_types.h  |  96 
  arch/x86/mm/dump_pagetables.c |  24 ++--
  arch/x86/mm/init.c|  37 +++
  arch/x86/mm/init_64.c |   9 +-
  arch/x86/mm/iomap_32.c|  12 +-
  arch/x86/mm/ioremap.c |  63 ++-
  arch/x86/mm/mm_internal.h |   2 +
  arch/x86/mm/pageattr.c|  84 --
  arch/x86/mm/pat.c | 176 +++---
  arch/x86/mm/pat_internal.h|  22 ++--
  arch/x86/mm/pat_rbtree.c 

[Xen-devel] [PATCH 0/4] support guest virtual mapped p2m list

2014-11-14 Thread Juergen Gross
The x86 struct arch_shared_info field pfn_to_mfn_frame_list_list
currently contains the mfn of the top level page frame of the 3 level
p2m tree, which is used by the Xen tools during saving and restoring
(and live migration) of pv domains and for crash dump analysis. With
three levels of the p2m tree it is possible to support up to 512 GB of
RAM for a 64 bit pv domain.

A 32 bit pv domain can support more, as each memory page can hold 1024
instead of 512 entries, leading to a limit of 4 TB.

To be able to support more RAM on x86-64 switch to a virtual mapped
p2m list.

Juergen Gross (4):
  expand x86 arch_shared_info to support linear p2m list
  introduce arch_get_features()
  introduce boot parameter for setting XENFEAT_virtual_p2m
  document new boot parameter virt_p2m

 docs/misc/xen-command-line.markdown | 22 ++
 xen/arch/arm/domain.c   |  5 +++
 xen/arch/x86/domain.c   | 80 +
 xen/common/kernel.c | 22 +-
 xen/include/public/arch-x86/xen.h   |  7 +++-
 xen/include/public/features.h   |  3 ++
 xen/include/xen/domain.h|  2 +
 7 files changed, 120 insertions(+), 21 deletions(-)

-- 
2.1.2


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 3/4] introduce boot parameter for setting XENFEAT_virtual_p2m

2014-11-14 Thread Juergen Gross
Introduce a new boot parameter "virt_p2m" to be able to set
XENFEAT_virtual_p2m for a pv domain.

As long as Xen tools and kdump don't support this new feature it is
turned off by default.

Signed-off-by: Juergen Gross 
---
 xen/arch/x86/domain.c | 50 ++
 1 file changed, 50 insertions(+)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index d98aabd..ccb54f6 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -2166,8 +2166,44 @@ static int __init init_vcpu_kick_softirq(void)
 }
 __initcall(init_vcpu_kick_softirq);
 
+#define VIRT_P2M_DOM00x01
+#define VIRT_P2M_DOM0_LARGE  0x02
+#define VIRT_P2M_DOMU0x04
+#define VIRT_P2M_DOMU_LARGE  0x08
+static unsigned virt_p2m = 0;
+
+static void __init parse_virt_p2m(const char *s)
+{
+char *ss;
+int b;
+
+do {
+ss = strchr(s, ',');
+if ( ss )
+*ss = '\0';
+
+b = parse_bool(s);
+if ( b == 0 )
+virt_p2m = 0;
+else if ( b == 1 )
+virt_p2m = VIRT_P2M_DOM0 | VIRT_P2M_DOMU;
+else if ( !strcmp(s, "dom0") )
+virt_p2m |= VIRT_P2M_DOM0;
+else if ( !strcmp(s, "dom0_large") )
+virt_p2m |= VIRT_P2M_DOM0_LARGE;
+else if ( !strcmp(s, "domu") )
+virt_p2m |= VIRT_P2M_DOMU;
+else if ( !strcmp(s, "domu_large") )
+virt_p2m |= VIRT_P2M_DOMU_LARGE;
+
+s = ss + 1;
+} while ( ss );
+}
+custom_param("virt_p2m", parse_virt_p2m);
+
 uint32_t arch_get_features(struct domain *d, unsigned int submap_idx)
 {
+#define DOM_IS_LARGE(d) ((d)->max_pages > 1U << 27)
 uint32_t submap = 0;
 
 switch ( submap_idx )
@@ -2179,6 +2215,20 @@ uint32_t arch_get_features(struct domain *d, unsigned 
int submap_idx)
 submap |= (1U << XENFEAT_mmu_pt_update_preserve_ad) |
   (1U << XENFEAT_highmem_assist) |
   (1U << XENFEAT_gnttab_map_avail_bits);
+if ( is_hardware_domain(d) )
+{
+if ( virt_p2m & VIRT_P2M_DOM0 )
+submap |= 1U << XENFEAT_virtual_p2m;
+if ( DOM_IS_LARGE(d) && virt_p2m & VIRT_P2M_DOM0_LARGE )
+submap |= 1U << XENFEAT_virtual_p2m;
+}
+else
+{
+if ( virt_p2m & VIRT_P2M_DOMU )
+submap |= 1U << XENFEAT_virtual_p2m;
+if ( DOM_IS_LARGE(d) && virt_p2m & VIRT_P2M_DOMU_LARGE )
+submap |= 1U << XENFEAT_virtual_p2m;
+}
 break;
 case guest_type_pvh:
 submap |= (1U << XENFEAT_hvm_safe_pvclock) |
-- 
2.1.2


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/4] expand x86 arch_shared_info to support linear p2m list

2014-11-14 Thread Juergen Gross
The x86 struct arch_shared_info field pfn_to_mfn_frame_list_list
currently contains the mfn of the top level page frame of the 3 level
p2m tree, which is used by the Xen tools during saving and restoring
(and live migration) of pv domains and for crash dump analysis. With
three levels of the p2m tree it is possible to support up to 512 GB of
RAM for a 64 bit pv domain.

A 32 bit pv domain can support more, as each memory page can hold 1024
instead of 512 entries, leading to a limit of 4 TB.

To be able to support more RAM on x86-64 switch to a virtual mapped
p2m list.

This patch expands struct arch_shared_info with a new p2m list virtual
address and the mfn of the page table root. The new information is
indicated by the domain to be valid by storing ~0UL into
pfn_to_mfn_frame_list_list. The hypervisor indicates usability of this
feature by a new flag XENFEAT_virtual_p2m.
---
 xen/include/public/arch-x86/xen.h | 7 ++-
 xen/include/public/features.h | 3 +++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/xen/include/public/arch-x86/xen.h 
b/xen/include/public/arch-x86/xen.h
index f35804b..b0f85a9 100644
--- a/xen/include/public/arch-x86/xen.h
+++ b/xen/include/public/arch-x86/xen.h
@@ -224,7 +224,12 @@ struct arch_shared_info {
 /* Frame containing list of mfns containing list of mfns containing p2m. */
 xen_pfn_t pfn_to_mfn_frame_list_list;
 unsigned long nmi_reason;
-uint64_t pad[32];
+/*
+ * Following two fields are valid if pfn_to_mfn_frame_list_list contains
+ * ~0UL.
+ */
+unsigned long p2m_vaddr;/* virtual address of the p2m list */
+unsigned long p2m_as_root;  /* mfn of the top level page table */
 };
 typedef struct arch_shared_info arch_shared_info_t;
 
diff --git a/xen/include/public/features.h b/xen/include/public/features.h
index 16d92aa..ff0b82d 100644
--- a/xen/include/public/features.h
+++ b/xen/include/public/features.h
@@ -99,6 +99,9 @@
 #define XENFEAT_grant_map_identity12
  */
 
+/* x86: guest may specify virtual address of p2m list */
+#define XENFEAT_virtual_p2m   13
+
 #define XENFEAT_NR_SUBMAPS 1
 
 #endif /* __XEN_PUBLIC_FEATURES_H__ */
-- 
2.1.2


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 4/4] document new boot parameter virt_p2m

2014-11-14 Thread Juergen Gross
Add documentation for the new boot parameter "virt_p2m".

Signed-off-by: Juergen Gross 
---
 docs/misc/xen-command-line.markdown | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/docs/misc/xen-command-line.markdown 
b/docs/misc/xen-command-line.markdown
index 0830e5f..c56273d 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1272,6 +1272,28 @@ The optional `keep` parameter causes Xen to continue 
using the vga
 console even after dom0 has been started.  The default behaviour is to
 relinquish control to dom0.
 
+### virt\_p2m
+> `= List of [  | dom0 | dom0\_large | domu | domu\_large ]`
+
+> Default: `false`
+
+Allow pv-domains to specify a virtual address for the domain's p2m list which
+is used by the Xen tools during domain save and restore and by kdump.
+
+`dom0` enables this feature for Dom0.
+
+`dom0\_large` enables this feature for Dom0 with more than 512 GiB of RAM
+(the traditional 3 level p2m tree can't map more than that).
+
+`domu` enables this feature for all pv domains but Dom0.
+
+`domu\_large` enables this feature for all pv domains with more than 512 GiB
+of RAM but Dom0.
+
+`true` enables this feature for all pv domains.
+
+`false` disables this feature for all domains.
+
 ### vpid (Intel)
 > `= `
 
-- 
2.1.2


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 2/4] introduce arch_get_features()

2014-11-14 Thread Juergen Gross
The XENVER_get_features sub command of the xen_version hypercall is
handled completely in common/kernel.c despite of some architecture
dependant parts.

Move the architecture dependant parts in an own function in
arch/*/domain.c

Signed-off-by: Juergen Gross 
---
 xen/arch/arm/domain.c|  5 +
 xen/arch/x86/domain.c| 30 ++
 xen/common/kernel.c  | 22 ++
 xen/include/xen/domain.h |  2 ++
 4 files changed, 39 insertions(+), 20 deletions(-)

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index 7221bc8..dc5a3fb 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -823,6 +823,11 @@ void vcpu_block_unless_event_pending(struct vcpu *v)
 vcpu_unblock(current);
 }
 
+uint32_t arch_get_features(struct domain *d, unsigned int submap_idx)
+{
+return 0;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index ae0a344..d98aabd 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -2166,6 +2166,36 @@ static int __init init_vcpu_kick_softirq(void)
 }
 __initcall(init_vcpu_kick_softirq);
 
+uint32_t arch_get_features(struct domain *d, unsigned int submap_idx)
+{
+uint32_t submap = 0;
+
+switch ( submap_idx )
+{
+case 0:
+switch ( d->guest_type )
+{
+case guest_type_pv:
+submap |= (1U << XENFEAT_mmu_pt_update_preserve_ad) |
+  (1U << XENFEAT_highmem_assist) |
+  (1U << XENFEAT_gnttab_map_avail_bits);
+break;
+case guest_type_pvh:
+submap |= (1U << XENFEAT_hvm_safe_pvclock) |
+  (1U << XENFEAT_supervisor_mode_kernel) |
+  (1U << XENFEAT_hvm_callback_vector);
+break;
+case guest_type_hvm:
+submap |= (1U << XENFEAT_hvm_safe_pvclock) |
+  (1U << XENFEAT_hvm_callback_vector) |
+  (1U << XENFEAT_hvm_pirqs);
+break;
+}
+break;
+}
+
+return submap;
+}
 
 /*
  * Local variables:
diff --git a/xen/common/kernel.c b/xen/common/kernel.c
index d23c422..d22a860 100644
--- a/xen/common/kernel.c
+++ b/xen/common/kernel.c
@@ -312,31 +312,13 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 fi.submap |= 1U << XENFEAT_supervisor_mode_kernel;
 if ( is_hardware_domain(current->domain) )
 fi.submap |= 1U << XENFEAT_dom0;
-#ifdef CONFIG_X86
-switch ( d->guest_type )
-{
-case guest_type_pv:
-fi.submap |= (1U << XENFEAT_mmu_pt_update_preserve_ad) |
- (1U << XENFEAT_highmem_assist) |
- (1U << XENFEAT_gnttab_map_avail_bits);
-break;
-case guest_type_pvh:
-fi.submap |= (1U << XENFEAT_hvm_safe_pvclock) |
- (1U << XENFEAT_supervisor_mode_kernel) |
- (1U << XENFEAT_hvm_callback_vector);
-break;
-case guest_type_hvm:
-fi.submap |= (1U << XENFEAT_hvm_safe_pvclock) |
- (1U << XENFEAT_hvm_callback_vector) |
- (1U << XENFEAT_hvm_pirqs);
-break;
-}
-#endif
 break;
 default:
 return -EINVAL;
 }
 
+fi.submap |= arch_get_features(d, fi.submap_idx);
+
 if ( copy_to_guest(arg, &fi, 1) )
 return -EFAULT;
 return 0;
diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h
index 9215b0e..0d12dc0 100644
--- a/xen/include/xen/domain.h
+++ b/xen/include/xen/domain.h
@@ -80,6 +80,8 @@ extern spinlock_t vcpu_alloc_lock;
 bool_t domctl_lock_acquire(void);
 void domctl_lock_release(void);
 
+uint32_t arch_get_features(struct domain *d, unsigned int submap_idx);
+
 /*
  * Continue the current hypercall via func(data) on specified cpu.
  * If this function returns 0 then the function is guaranteed to run at some
-- 
2.1.2


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH V3 7/8] xen: switch to linear virtual mapped sparse p2m list

2014-11-14 Thread Juergen Gross

On 11/14/2014 12:58 PM, David Vrabel wrote:

On 13/11/14 09:21, Juergen Gross wrote:

On 11/11/2014 06:47 PM, David Vrabel wrote:


Can you please test this with the following guests/scenarios.

* 64 bit dom0 with PCI devices with high MMIO BARs.


I'm not sure I have a machine available with this configuration.


We have a bunch of them in our test lab. Unfortunately, xapi doesn't
work on Linux 3.12 or later so I won't be able to test this series in
the short term.


I've found one. Stay tuned. :-)


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/4] expand x86 arch_shared_info to support linear p2m list

2014-11-14 Thread Juergen Gross

On 11/14/2014 12:41 PM, Andrew Cooper wrote:

On 14/11/14 09:37, Juergen Gross wrote:

The x86 struct arch_shared_info field pfn_to_mfn_frame_list_list
currently contains the mfn of the top level page frame of the 3 level
p2m tree, which is used by the Xen tools during saving and restoring
(and live migration) of pv domains and for crash dump analysis. With
three levels of the p2m tree it is possible to support up to 512 GB of
RAM for a 64 bit pv domain.

A 32 bit pv domain can support more, as each memory page can hold 1024
instead of 512 entries, leading to a limit of 4 TB.

To be able to support more RAM on x86-64 switch to a virtual mapped
p2m list.

This patch expands struct arch_shared_info with a new p2m list virtual
address and the mfn of the page table root. The new information is
indicated by the domain to be valid by storing ~0UL into
pfn_to_mfn_frame_list_list. The hypervisor indicates usability of this
feature by a new flag XENFEAT_virtual_p2m.


How do you envisage this being used?  Are you expecting the tools to do
manual pagetable walks using xc_map_foreign_xxx() ?


Yes. Not very different compared to today's mapping via the 3 level
p2m tree. Just another entry format, 4 instead of 3 levels and starting
at an offset.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/4] expand x86 arch_shared_info to support linear p2m list

2014-11-14 Thread Juergen Gross

On 11/14/2014 03:59 PM, Andrew Cooper wrote:

On 14/11/14 14:14, Jürgen Groß wrote:

On 11/14/2014 02:56 PM, Andrew Cooper wrote:

On 14/11/14 12:53, Juergen Gross wrote:

On 11/14/2014 12:41 PM, Andrew Cooper wrote:

On 14/11/14 09:37, Juergen Gross wrote:

The x86 struct arch_shared_info field pfn_to_mfn_frame_list_list
currently contains the mfn of the top level page frame of the 3 level
p2m tree, which is used by the Xen tools during saving and restoring
(and live migration) of pv domains and for crash dump analysis. With
three levels of the p2m tree it is possible to support up to 512
GB of
RAM for a 64 bit pv domain.

A 32 bit pv domain can support more, as each memory page can hold
1024
instead of 512 entries, leading to a limit of 4 TB.

To be able to support more RAM on x86-64 switch to a virtual mapped
p2m list.

This patch expands struct arch_shared_info with a new p2m list
virtual
address and the mfn of the page table root. The new information is
indicated by the domain to be valid by storing ~0UL into
pfn_to_mfn_frame_list_list. The hypervisor indicates usability of
this
feature by a new flag XENFEAT_virtual_p2m.


How do you envisage this being used?  Are you expecting the tools
to do
manual pagetable walks using xc_map_foreign_xxx() ?


Yes. Not very different compared to today's mapping via the 3 level
p2m tree. Just another entry format, 4 instead of 3 levels and starting
at an offset.


Yes - David and I were discussing this over lunch, and it is not
actually very different.

In reality, how likely is it that the pages backing this virtual linear
array change?


Very unlikely, I think. But not impossible.


One issue currently is that, during the live part of migration, the
toolstack has no way of working out whether the structure of the p2m has
changed (intermediate leaves rearranged, or the length increasing).

In the case that the VM does change the structure of the p2m under the
feet of the toolstack, migration will either blow up in a non-subtle way
with a p2m/m2p mismatch, or in a subtle way with the receiving side
copying the new p2m over the wrong part of the new domain.

I am wondering whether, with this new p2m method, we can take sufficient
steps to be able to guarantee mishaps like this can't occur.


This should be easy: I could add a counter in arch_shared_info which is
incremented whenever a p2m mapping is being changed. The toolstack could
compare the counter values before start and at end of migration and redo
the migration (or fail) if they are different. In order to avoid races
I would have to increment the counter before and after changing the
mapping.



That is insufficient I believe.

Consider:

* Toolstack walks pagetables and maps the frames containing the linear p2m
* Live migration starts
* VM remaps a frame in the middle of the linear p2m
* Live migration continues, but the toolstack has a stale frame in the
middle of its view of the p2m.


This would be covered by my suggestion. At the end of the memory
transfer (with some bogus contents) the toolstack would discover the
change of the p2m structure and either fail the migration or start it
from the beginning and thus overwriting the bogus frames.


As the p2m is almost never expected to change, I think it might be
better to have a flag the toolstack can set to say "The toolstack is
peeking at your p2m behind your back - you must not change its structure."


Be careful here: changes of the structure can be due to two scenarios:
- ballooning (invalid entries being populated): this is no problem, as
  we can stop the ballooning during live migration.
- mapping of grant pages e.g. in a stub domain (first map in an area
  former marked as invalid): you can't stop this, as the stub domain
  has to do some work. Here a restart of the migration should work, as
  the p2m structure change can only happen once for each affected p2m
  page.


Having just thought this through, I think there is also a race condition
between a VM changing an entry in the p2m, and the toolstack doing
verifications of frames being sent.


Okay, so the flag you mentioned should just prohibit changes in the
p2m list related to memory frames of the affected domain: ballooning
up or down, or rearranging the memory layout (does this happen today?).
Mapping and unmapping of grant pages should be still allowed.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH V3 2/8] xen: Delay remapping memory of pv-domain

2014-11-14 Thread Juergen Gross

On 11/14/2014 05:47 PM, Konrad Rzeszutek Wilk wrote:

On Fri, Nov 14, 2014 at 05:53:19AM +0100, Juergen Gross wrote:

On 11/13/2014 08:56 PM, Konrad Rzeszutek Wilk wrote:

+   mfn_save = virt_to_mfn(buf);
+
+   while (xen_remap_mfn != INVALID_P2M_ENTRY) {


So the 'list' is constructed by going forward - that is from low-numbered
PFNs to higher numbered ones. But the 'xen_remap_mfn' is going the
other way - from the highest PFN to the lowest PFN.

Won't that mean we will restore the chunks of memory in the wrong
order? That is we will still restore them in chunks size, but the
chunks will be in descending order instead of ascending?


No, the information where to put each chunk is contained in the chunk
data. I can add a comment explaining this.


Right, the MFNs in a "chunks" are going to be restored in the right order.

I was thinking that the "chunks" (so a set of MFNs) will be restored in
the opposite order that they are written to.

And oddly enough the "chunks" are done in 512-3 = 509 MFNs at once?


More don't fit on a single page due to the other info needed. So: yes.


But you could use two pages - one for the structure and the other
for the list of MFNs. That would fix the problem of having only
509 MFNs being contingous per chunk when restoring.


That's no problem (see below).


Anyhow the point I had that I am worried is that we do not restore the
MFNs in the same order. We do it in "chunk" size which is OK (so the 509 MFNs
at once)- but the order we traverse the restoration process is the opposite of
the save process. Say we have 4MB of contingous MFNs, so two (err, three)
chunks. The first one we iterate is from 0->509, the second is 510->1018, the
last is 1019->1023. When we restore (remap) we start with the last 'chunk'
so we end up restoring them: 1019->1023, 510->1018, 0->509 order.


No. When building up the chunks we save in each chunk where to put it
on remap. So in your example 0-509 should be mapped at +0,
510-1018 at +510, and 1019-1023 at +1019.

When remapping we map 1019-1023 to +1019, 510-1018 at +510
and last 0-509 at +0. So we do the mapping in reverse order, but
to the correct pfns.

Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/4] expand x86 arch_shared_info to support linear p2m list

2014-11-17 Thread Juergen Gross

On 11/14/2014 05:08 PM, Andrew Cooper wrote:

On 14/11/14 15:32, Juergen Gross wrote:

On 11/14/2014 03:59 PM, Andrew Cooper wrote:

On 14/11/14 14:14, Jürgen Groß wrote:

On 11/14/2014 02:56 PM, Andrew Cooper wrote:

On 14/11/14 12:53, Juergen Gross wrote:

On 11/14/2014 12:41 PM, Andrew Cooper wrote:

On 14/11/14 09:37, Juergen Gross wrote:

The x86 struct arch_shared_info field pfn_to_mfn_frame_list_list
currently contains the mfn of the top level page frame of the 3
level
p2m tree, which is used by the Xen tools during saving and
restoring
(and live migration) of pv domains and for crash dump analysis.
With
three levels of the p2m tree it is possible to support up to 512
GB of
RAM for a 64 bit pv domain.

A 32 bit pv domain can support more, as each memory page can hold
1024
instead of 512 entries, leading to a limit of 4 TB.

To be able to support more RAM on x86-64 switch to a virtual mapped
p2m list.

This patch expands struct arch_shared_info with a new p2m list
virtual
address and the mfn of the page table root. The new information is
indicated by the domain to be valid by storing ~0UL into
pfn_to_mfn_frame_list_list. The hypervisor indicates usability of
this
feature by a new flag XENFEAT_virtual_p2m.


How do you envisage this being used?  Are you expecting the tools
to do
manual pagetable walks using xc_map_foreign_xxx() ?


Yes. Not very different compared to today's mapping via the 3 level
p2m tree. Just another entry format, 4 instead of 3 levels and
starting
at an offset.


Yes - David and I were discussing this over lunch, and it is not
actually very different.

In reality, how likely is it that the pages backing this virtual
linear
array change?


Very unlikely, I think. But not impossible.


One issue currently is that, during the live part of migration, the
toolstack has no way of working out whether the structure of the
p2m has
changed (intermediate leaves rearranged, or the length increasing).

In the case that the VM does change the structure of the p2m under the
feet of the toolstack, migration will either blow up in a
non-subtle way
with a p2m/m2p mismatch, or in a subtle way with the receiving side
copying the new p2m over the wrong part of the new domain.

I am wondering whether, with this new p2m method, we can take
sufficient
steps to be able to guarantee mishaps like this can't occur.


This should be easy: I could add a counter in arch_shared_info which is
incremented whenever a p2m mapping is being changed. The toolstack
could
compare the counter values before start and at end of migration and
redo
the migration (or fail) if they are different. In order to avoid races
I would have to increment the counter before and after changing the
mapping.



That is insufficient I believe.

Consider:

* Toolstack walks pagetables and maps the frames containing the
linear p2m
* Live migration starts
* VM remaps a frame in the middle of the linear p2m
* Live migration continues, but the toolstack has a stale frame in the
middle of its view of the p2m.


This would be covered by my suggestion. At the end of the memory
transfer (with some bogus contents) the toolstack would discover the
change of the p2m structure and either fail the migration or start it
from the beginning and thus overwriting the bogus frames.


Checking after pause is too late.  The content of the p2m is used verify
each frame being sent on the wire, so is in active use for the entire
duration of live migration.

If the toolstack starts verifying frames being sent using information
from a stale p2m, the best that can be hoped for is that the toolstack
declares that the p2m and m2p are inconsistent and abort the migrate.




As the p2m is almost never expected to change, I think it might be
better to have a flag the toolstack can set to say "The toolstack is
peeking at your p2m behind your back - you must not change its
structure."


Be careful here: changes of the structure can be due to two scenarios:
- ballooning (invalid entries being populated): this is no problem, as
   we can stop the ballooning during live migration.
- mapping of grant pages e.g. in a stub domain (first map in an area
   former marked as invalid): you can't stop this, as the stub domain
   has to do some work. Here a restart of the migration should work, as
   the p2m structure change can only happen once for each affected p2m
   page.


Migration is not at all possible with a domain referencing foreign frames.

The live part can cope with foreign frames referenced in the ptes.  As
part of the pause handling in the VM, the frontends must unmap any
grants they have.  After pause, any remaining foreign frames cause a
migration failure.




Having just thought this through, I think there is also a race condition
between a VM changing an entry in the p2m, and the toolstack doing
verifications of frames being sent.


Okay, so the flag you mentioned should just prohibit changes in the
p2m list related to memory fram

Re: [Xen-devel] [PATCH 3/4] introduce boot parameter for setting XENFEAT_virtual_p2m

2014-11-19 Thread Juergen Gross

On 11/19/2014 10:04 PM, Konrad Rzeszutek Wilk wrote:

On Fri, Nov 14, 2014 at 10:37:25AM +0100, Juergen Gross wrote:

Introduce a new boot parameter "virt_p2m" to be able to set
XENFEAT_virtual_p2m for a pv domain.

As long as Xen tools and kdump don't support this new feature it is
turned off by default.


Couldn't the dom0_large and dom0 be detected automatically? That is
the dom0 could advertise it can do large-dom0 support and Xen would
automatically switch to the right mode?


No, that's not the problem. Xen has to indicate it is capable to handle
the new mode. At dom0 construction time the dom0 kernel can't know about
the capability of kdump to handle the new mode.

In case the new interface is accepted I'll set up some kdump patches to
handle it. We can switch to dom0/dom0_large set on default if they are
accepted on time (e.g. at the time the kernel support for the new
interface is put in place).


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH V3 2/8] xen: Delay remapping memory of pv-domain

2014-11-19 Thread Juergen Gross

On 11/19/2014 08:43 PM, Konrad Rzeszutek Wilk wrote:

On Fri, Nov 14, 2014 at 06:14:06PM +0100, Juergen Gross wrote:

On 11/14/2014 05:47 PM, Konrad Rzeszutek Wilk wrote:

On Fri, Nov 14, 2014 at 05:53:19AM +0100, Juergen Gross wrote:

On 11/13/2014 08:56 PM, Konrad Rzeszutek Wilk wrote:

+   mfn_save = virt_to_mfn(buf);
+
+   while (xen_remap_mfn != INVALID_P2M_ENTRY) {


So the 'list' is constructed by going forward - that is from low-numbered
PFNs to higher numbered ones. But the 'xen_remap_mfn' is going the
other way - from the highest PFN to the lowest PFN.

Won't that mean we will restore the chunks of memory in the wrong
order? That is we will still restore them in chunks size, but the
chunks will be in descending order instead of ascending?


No, the information where to put each chunk is contained in the chunk
data. I can add a comment explaining this.


Right, the MFNs in a "chunks" are going to be restored in the right order.

I was thinking that the "chunks" (so a set of MFNs) will be restored in
the opposite order that they are written to.

And oddly enough the "chunks" are done in 512-3 = 509 MFNs at once?


More don't fit on a single page due to the other info needed. So: yes.


But you could use two pages - one for the structure and the other
for the list of MFNs. That would fix the problem of having only
509 MFNs being contingous per chunk when restoring.


That's no problem (see below).


Anyhow the point I had that I am worried is that we do not restore the
MFNs in the same order. We do it in "chunk" size which is OK (so the 509 MFNs
at once)- but the order we traverse the restoration process is the opposite of
the save process. Say we have 4MB of contingous MFNs, so two (err, three)
chunks. The first one we iterate is from 0->509, the second is 510->1018, the
last is 1019->1023. When we restore (remap) we start with the last 'chunk'
so we end up restoring them: 1019->1023, 510->1018, 0->509 order.


No. When building up the chunks we save in each chunk where to put it
on remap. So in your example 0-509 should be mapped at +0,
510-1018 at +510, and 1019-1023 at +1019.

When remapping we map 1019-1023 to +1019, 510-1018 at +510
and last 0-509 at +0. So we do the mapping in reverse order, but
to the correct pfns.


Excellent! Could a condensed version of that explanation be put in the code ?


Sure.

Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH V3 0/8] xen: Switch to virtual mapped linear p2m list

2014-11-19 Thread Juergen Gross

On 11/19/2014 09:41 PM, Konrad Rzeszutek Wilk wrote:

On Tue, Nov 11, 2014 at 06:43:38AM +0100, Juergen Gross wrote:

Paravirtualized kernels running on Xen use a three level tree for
translation of guest specific physical addresses to machine global
addresses. This p2m tree is used for construction of page table
entries, so the p2m tree walk is performance critical.

By using a linear virtual mapped p2m list accesses to p2m elements
can be sped up while even simplifying code. To achieve this goal
some p2m related initializations have to be performed later in the
boot process, as the final p2m list can be set up only after basic
memory management functions are available.



Hey Juergen,

I finially finished looking at the patchset. Had some comments,
some questions that I hope can make it in the patch so that in
six months or so when somebody looks at the code they can
understand the subtle pieces.


Yep.

OTOH: What was hard to write should be hard to read ;-)


Looking forward to the v4! (Thought keep in mind that next week
is Thanksgiving week so won't be able to look much after Wednesday)


Let's see how testing is going. Setting up the test system wasn't
very smooth due to some unrelated issues.




  arch/x86/include/asm/pgtable_types.h |1 +
  arch/x86/include/asm/xen/page.h  |   49 +-
  arch/x86/mm/pageattr.c   |   20 +
  arch/x86/xen/mmu.c   |   38 +-
  arch/x86/xen/p2m.c   | 1315 ++
  arch/x86/xen/setup.c |  460 ++--
  arch/x86/xen/xen-ops.h   |6 +-
  7 files changed, 854 insertions(+), 1035 deletions(-)


And best of - we are deleting more code!


Indeed. But it's a shame the beautiful ASCII-art in p2m.c is part of the
deletions.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Buggy interaction of live migration and p2m updates

2014-11-20 Thread Juergen Gross

On 11/20/2014 07:28 PM, Andrew Cooper wrote:

Hello,

Tim, David and I were discussing this over lunch.  This email is a
(hopefully accurate) account of our findings, and potential solutions.
(If I have messed up, please shout.)

Currently, correct live migration of PV domains relies on the toolstack
(which has a live mapping of the guests p2m) not observing stale values
when the guest updates its p2m, and the race condition between a p2m
update and an m2p update.  Realistically, this means no updates to the
p2m at all, due to several potential race conditions.  Should any race
conditions happen (e.g. ballooning while live migrating), the effects
could be anything from an aborted migration to VM memory corruption.

It should be noted that migrationv2 does not fix any of this.  It alters
the way in which some race conditions might be observed.  During
development of migrationv2, there was an explicit non-requirement of
fixing the existing Ballooning+LiveMigration issues we were aware,
although at the time, we were not aware of this specific set of issues.
Our goal was to simply make migrationv2 work in the same circumstances
as previously, but with a bitness-agnostic wire format and
forward-extensible protocol.


As far as these issues are concerned, there are two distinct p2m
modifications which we care about:
1) p2m structure changes (rearranging the layout of the p2m)
2) p2m content changes (altering entries in the p2m)

There is no possible way for the toolstack to prevent a domain from
altering its p2m.  At the moment, ballooning typically only occurs when
requested by the toolstack, but the underlying operations
(increase/decrease_reservation, mem_exchange, etc) can be used by the
guest at any point.  This includes Wei's guest memory fragmentation
changes.  Changes to the content of the p2m also occur for grant map and
unmap operations.


Currently in PV guests, the p2m is implemented using a 3-level tree,
with its root in the guests shared_info page.  It provides a hard VM
memory limit of 4TB for 32bit PV guests (which is far higher than the
128GB limit from the compat p2m mappings), or 512GB for 64bit PV guests.

Juergen has a proposed new p2m interface using a virtual linear
mapping.  This is conceptually similar to the previous implementation
(which is fine from the toolstacks point of view), but far less
complicated from the guests point of view, and removes the memory limits
imposed by the p2m structure.

The new virtual linear mapping suffers from the same interaction issues
as the old 3-level tree did, but the introduction of the new interface
affords us an opportunity to make all API modifications at once to
reduce churn.


During live migration, the toolstack maps the guests p2m into a linear
mapping in the toolstacks virtual address space.  This is done once at
the start of migration, and never subsequently altered.  During live
migration, the p2m is cross-verified with the m2p, and frames are sent
using pfns as a reference, as they will be located in different frames
on the receiving side.

Should the guest change the p2m structure during live migration, the
toolstack ends up with a stale p2m with a non-p2m frame in the middle,
resulting in bogus cross-referencing.  Should the guest change an entry
in the p2m, the p2m frame itself will be resent as it would be marked as
dirty in the logdirty bitmap, but the target pfn will remain unsent and
probably stale on the receiving side.


Another factor which needs to be taken into account is Remus/COLO, which
run the domains under live migration conditions for the duration of
their lifetime.

During the live part of migration, the toolstack already has to be able
to tolerate failures to normalise the pagetables, which result as a
consequent of the pagetables being in active.  These failures are fatal
on the final iteration after the guest has been paused, but the same
logic could be extended to p2m/m2p issues, if needed.


There are several potential solutions to these problems.

1) Freeze the guests p2m during live migrate

This is the simplest sounding option, but is quite problematic from the
point of view of the guest.  It is essentially a shared spinlock between
the toolstack and the guest kernel.  It would prevent any grant
map/unmap operations from occurring, and might interact badly with
certain p2m updated in the guest which would previously be expected to
unconditionally succeed.

Pros) (Can't think of any)
Cons) Not easy to implement (even conceptually), requires invasive guest
changes, will cripple Remus/COLO


2) Deep p2m dirty tracking

In the case that a p2m frame is discovered dirty in the logdirty bitmap,
we can be certain that a write has occurred to it, and in the common
case, means that the mapping has changed.  The toolstack could maintain
a non-live copy of the p2m which is updated as new frames are sent.
When a dirty p2m frame is found, the live and non-live copies can be
consulted to find which pfn mappings have changed, a

[Xen-devel] Hypervisor error messages after xl block-detach with linux 3.18-rc5

2014-11-21 Thread Juergen Gross

Hi,

while testing my "linear p2m list" patches I saw the following
problem (even without my patches in place):

In dom0 running linux 3.18-rc5 on top of Xen 4.4.1 I modified the
disk image of a guest by attaching it to dom0:

xl block-attach 0 file:/var/lib/libvirt/images/opensuse13-1/xvda,xvda,w
mount /dev/xvda2 /mnt
...
umount /mnt
xl block-detach 0 xvda

Worked without any problem. After some seconds the following messages
were issued on the console:

(XEN) mm.c:2352:d0 Bad type (saw 7402 != exp 
1000) for mfn 61110 (pfn 1f3f21c)

(XEN) mm.c:2995:d0 Error while pinning mfn 61110
(XEN) mm.c:2352:d0 Bad type (saw 7402 != exp 
1000) for mfn 61110 (pfn 1f3f21c)

(XEN) mm.c:906:d0 Attempt to create linear p.t. with write perms
(XEN) mm.c:2352:d0 Bad type (saw 7402 != exp 
1000) for mfn 6 (pfn 1f3f21d)

(XEN) mm.c:2995:d0 Error while pinning mfn 6
(XEN) mm.c:2352:d0 Bad type (saw 7402 != exp 
1000) for mfn 6 (pfn 1f3f21d)

(XEN) mm.c:906:d0 Attempt to create linear p.t. with write perms
(XEN) mm.c:2352:d0 Bad type (saw 7402 != exp 
1000) for mfn 61120 (pfn 1f3f22c)

(XEN) mm.c:2995:d0 Error while pinning mfn 61120
(XEN) mm.c:2352:d0 Bad type (saw 7402 != exp 
1000) for mfn 61120 (pfn 1f3f22c)

(XEN) mm.c:906:d0 Attempt to create linear p.t. with write perms
(XEN) mm.c:2352:d0 Bad type (saw 7402 != exp 
1000) for mfn 61121 (pfn 1f3f22d)

(XEN) mm.c:2995:d0 Error while pinning mfn 61121
(XEN) mm.c:2352:d0 Bad type (saw 7402 != exp 
1000) for mfn 61121 (pfn 1f3f22d)

(XEN) mm.c:906:d0 Attempt to create linear p.t. with write perms
(XEN) mm.c:2352:d0 Bad type (saw 7402 != exp 
1000) for mfn 61102 (pfn 1f3f20e)

(XEN) mm.c:2995:d0 Error while pinning mfn 61102
(XEN) mm.c:2352:d0 Bad type (saw 7402 != exp 
1000) for mfn 61102 (pfn 1f3f20e)

(XEN) mm.c:906:d0 Attempt to create linear p.t. with write perms
(XEN) mm.c:2352:d0 Bad type (saw 7402 != exp 
1000) for mfn 61103 (pfn 1f3f20f)

(XEN) mm.c:2995:d0 Error while pinning mfn 61103
(XEN) mm.c:2352:d0 Bad type (saw 7402 != exp 
1000) for mfn 61103 (pfn 1f3f20f)

(XEN) mm.c:906:d0 Attempt to create linear p.t. with write perms

Is this a known issue?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] WARNings in guest during xl save/restore

2014-11-21 Thread Juergen Gross

Hi,

during tests of my "linear p2m list" patches I stumbled over some
WARNs issued during xl save and xl restore of a pv-domU with
unpatched linux 3.18-rc5:

during save I saw multiple entries like:
[  176.900393] WARNING: CPU: 0 PID: 9 at arch/x86/xen/enlighten.c:968 
clear_local_APIC+0xa5/0x2b0()
[  176.900393] Modules linked in: cfg80211 rfkill nfsd auth_rpcgss 
oid_registry nfs_acl nfs lockd grace fscache sunrpc evdev 
x86_pkg_temp_thermal thermal_sys snd_pcm coretemp snd_timer crc32_pclmul 
aesni_intel snd xts soundcore aes_i586 lrw gf128mul ablk_helper pcspkr 
cryptd fuse autofs4 ext4 crc16 mbcache jbd2 crc32c_intel
[  176.900393] CPU: 0 PID: 9 Comm: migration/0 Tainted: GW 
3.18.0-rc5 #30
[  176.900393]  0009 c14c40b2  c1054b10 c1599538  
0009 c158bdc2
[  176.900393]  03c8 c103c925 c103c925 03c8 0002  
c15d25eb e8867e64
[  176.900393]  c1054bd9 0009  c103c925  c103cb54 
0002 

[  176.900393] Call Trace:
[  176.900393]  [] ? dump_stack+0x3e/0x4e
[  176.900393]  [] ? warn_slowpath_common+0x90/0xc0
[  176.900393]  [] ? clear_local_APIC+0xa5/0x2b0
[  176.900393]  [] ? clear_local_APIC+0xa5/0x2b0
[  176.900393]  [] ? warn_slowpath_null+0x19/0x20
[  176.900393]  [] ? clear_local_APIC+0xa5/0x2b0
[  176.900393]  [] ? disable_local_APIC+0x24/0x90
[  176.900393]  [] ? lapic_suspend+0x11e/0x170
[  176.900393]  [] ? syscore_suspend+0x79/0x220
[  176.900393]  [] ? set_next_entity+0x62/0x80
[  176.900393]  [] ? xen_suspend+0x2d/0x110
[  176.900393]  [] ? xen_mc_flush+0x13f/0x170
[  176.900393]  [] ? multi_cpu_stop+0xa9/0xd0
[  176.900393]  [] ? cpu_stop_should_run+0x50/0x50
[  176.900393]  [] ? cpu_stopper_thread+0x71/0x100
[  176.900393]  [] ? finish_task_switch+0x34/0xd0
[  176.900393]  [] ? __schedule+0x23d/0x7f0
[  176.900393]  [] ? __wake_up_common+0x44/0x70
[  176.900393]  [] ? _raw_spin_lock_irqsave+0x12/0x60
[  176.900393]  [] ? smpboot_thread_fn+0xd2/0x170
[  176.900393]  [] ? SyS_setgroups+0x110/0x110
[  176.900393]  [] ? kthread+0xa1/0xc0
[  176.900393]  [] ? ret_from_kernel_thread+0x21/0x30
[  176.900393]  [] ? kthread_create_on_node+0x120/0x120
[  176.900393] ---[ end trace b38596d5cfdcde8d ]---

and during restore:
[  176.900393] WARNING: CPU: 0 PID: 9 at arch/x86/xen/enlighten.c:968 
lapic_resume+0xc6/0x270()
[  176.900393] Modules linked in: cfg80211 rfkill nfsd auth_rpcgss 
oid_registry nfs_acl nfs lockd grace fscache sunrpc evdev 
x86_pkg_temp_thermal thermal_sys snd_pcm coretemp snd_timer crc32_pclmul 
aesni_intel snd xts soundcore aes_i586 lrw gf128mul ablk_helper pcspkr 
cryptd fuse autofs4 ext4 crc16 mbcache jbd2 crc32c_intel
[  176.900393] CPU: 0 PID: 9 Comm: migration/0 Tainted: GW 
3.18.0-rc5 #30
[  176.900393]  0009 c14c40b2  c1054b10 c1599538  
0009 c158bdc2
[  176.900393]  03c8 c103c1e6 c103c1e6 03c8 c1030020 0002 
001b 
[  176.900393]  c1054bd9 0009  c103c1e6  c16432c0 
0108cdfe c15d25dc

[  176.900393] Call Trace:
[  176.900393]  [] ? dump_stack+0x3e/0x4e
[  176.900393]  [] ? warn_slowpath_common+0x90/0xc0
[  176.900393]  [] ? lapic_resume+0xc6/0x270
[  176.900393]  [] ? lapic_resume+0xc6/0x270
[  176.900393]  [] ? mcheck_cpu_init+0x170/0x4f0
[  176.900393]  [] ? warn_slowpath_null+0x19/0x20
[  176.900393]  [] ? lapic_resume+0xc6/0x270
[  176.900393]  [] ? syscore_resume+0x46/0x160
[  176.900393]  [] ? xen_timer_resume+0x42/0x60
[  176.900393]  [] ? xen_suspend+0x7c/0x110
[  176.900393]  [] ? multi_cpu_stop+0xa9/0xd0
[  176.900393]  [] ? cpu_stop_should_run+0x50/0x50
[  176.900393]  [] ? cpu_stopper_thread+0x71/0x100
[  176.900393]  [] ? finish_task_switch+0x34/0xd0
[  176.900393]  [] ? __schedule+0x23d/0x7f0
[  176.900393]  [] ? __wake_up_common+0x44/0x70
[  176.900393]  [] ? _raw_spin_lock_irqsave+0x12/0x60
[  176.900393]  [] ? smpboot_thread_fn+0xd2/0x170
[  176.900393]  [] ? SyS_setgroups+0x110/0x110
[  176.900393]  [] ? kthread+0xa1/0xc0
[  176.900393]  [] ? ret_from_kernel_thread+0x21/0x30
[  176.900393]  [] ? kthread_create_on_node+0x120/0x120
[  176.900393] ---[ end trace b38596d5cfdcde93 ]---

While this seems not to be critical (the system is running after the
restore) I assume disabling/enabling a local APIC on a pv-domain isn't
something we want to happen...


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Buggy interaction of live migration and p2m updates

2014-11-21 Thread Juergen Gross

On 11/21/2014 12:15 PM, Ian Campbell wrote:

On Fri, 2014-11-21 at 11:07 +, Andrew Cooper wrote:

On 21/11/14 10:46, Ian Campbell wrote:

On Fri, 2014-11-21 at 10:24 +, Andrew Cooper wrote:

On 21/11/14 09:43, Ian Campbell wrote:

I don't see any (explicit) mention of the pfn_to_mfn_frame_list_list
here, where does that fit in?


It is referenced several times, although not by its exact name.

Hence no explicit mention.

It's ambiguous when you refer to "higher level frames" (which I presume
are the reference you are referring to) because some kernels (perhaps
only historic ones, I've not been keeping up) keep both an N-level tree
of their own internally and the toolstack visible frame_list_list
(sometimes partially overlapping at some level). Is every reference to
"higher level frames" actually intended to be a reference to
pfn_to_mfn_frame_list_list or not?


"higher level frames" would be the toolstack-abi-defined first and
second level lists.  The logdirty infrastructure can be used to detect
writes to these frames, and therefore detect structural changes to the p2m.

I would like to hope that every kernel out there keeps this information
correctly up-to-date and updates it in an appropriate order...



It seems like sometimes you are talking at times about tracking the
kernel's internal structure and not just pfn_to_mfn_frame_list_list and
I'm not sure why that would be.


I apologise for giving this impression.  It was not intended.


Great, I just wanted to be sure we were all on the same page, since
scrobbling around in the kernel's internal data structures would clearly
be mad...




I'm also not sure why
pfn_to_mfn_frame_list_list is apparently discounted in the linear case,
AFAIK the guest is still obliged to keep that up to date regardless of
the scheme it uses internally for accessing the p2m.


There are two reasons for the virtual linear p2m, the primary one being
to break the hard 512GB limit given the old 3-level table.

A 64bit PV guest cannot possibly use the pfn_to_mfn_frame_list_list if
it needs to actually exceed 512GB of RAM.  Therefore, to signal the use
the virtual linear method, a PV guest explicitly sets
pfn_to_mfn_frame_list_list to INVALID_MFN, and fills in the brand new
adjacent information.


Oh, I hadn't realised this linear p2m stuff involved a guest ABI change.
Have I somehow completely missed the xen.git side of these patches? I
thought I'd only seen linux.git ones (and hence wasn't looking very
closely).


V1 of the patches suggesting such a change have been posted a week ago:

http://lists.xen.org/archives/html/xen-devel/2014-11/msg01276.html

The linear p2m stuff is a prerequisite for this change, not the reason
for it.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/4] expand x86 arch_shared_info to support linear p2m list

2014-11-21 Thread Juergen Gross

On 11/21/2014 01:23 PM, Jan Beulich wrote:

On 14.11.14 at 10:37, <"jgr...@suse.com".non-mime.internet> wrote:

--- a/xen/include/public/arch-x86/xen.h
+++ b/xen/include/public/arch-x86/xen.h
@@ -224,7 +224,12 @@ struct arch_shared_info {
  /* Frame containing list of mfns containing list of mfns containing p2m. 
*/
  xen_pfn_t pfn_to_mfn_frame_list_list;
  unsigned long nmi_reason;
-uint64_t pad[32];
+/*
+ * Following two fields are valid if pfn_to_mfn_frame_list_list contains
+ * ~0UL.
+ */
+unsigned long p2m_vaddr;/* virtual address of the p2m list */
+unsigned long p2m_as_root;  /* mfn of the top level page table */


xen_pfn_t please. And what does the "as" in the name stand for?


"as" is address space. I can rename it to e.g. "p2m_pgd_mfn".


It's also kind of unclear in the description what "the page table root"
is, as I don't think there are many OSes which use just a single set
of page tables (i.e. just a single address space). Not having followed
the discussion closely - what is this needed for anyway?


It's a replacement of the pfn_to_mfn_frame_list_list using the same
page table as the kernel for accessing the p2m list. We need the root
of the page table and the virtual address of the p2m list.




--- a/xen/include/public/features.h
+++ b/xen/include/public/features.h
@@ -99,6 +99,9 @@
  #define XENFEAT_grant_map_identity12
   */

+/* x86: guest may specify virtual address of p2m list */
+#define XENFEAT_virtual_p2m   13


The name to me suggests something that's not real. Perhaps better
XENFEAT_virtually_mapped_p2m or XENFEAT_p2m_va{,ddr}?


Yeah, that's better. I'll use XENFEAT_p2m_vaddr.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] PCI-passthrough for 32 bit guests and high MMIO addresses

2014-11-21 Thread Juergen Gross

Hi,

again a fallout from my "linear p2m list" tests:

Trying to do PCI-passthrough with a 32-bit pv-domain I passed the
wrong device to the domain. The MMIO address was too large for a
MFN of a 32-bit system (it was 38000320-3800036f).

Instead of rejecting the operation Xen tried to perform it resulting
in a (quite understandable) failure in the domU.

I think either the hypervisor or the tools should refuse to do
PCI-passthrough in this case.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


<    1   2   3   4   5   6   7   8   9   10   >