date:20150911

On Fri, 2015-09-11 at 16:34 +0100, Ian Jackson wrote:
> From 29e08dfa3a5c5a5aeb51fd01c67345e20cbb33c5 Mon Sep 17 00:00:00 2001
> From: Ian Jackson 
> Date: Fri, 11 Sep 2015 16:27:08 +0100
> Subject: [OSSTEST PATCH] cs-bisection-step: Cope with graph-out (testids)
>  containing ( ) etc.
> 
> cr-try-bisect launders / in the testid but relies on other characters
> being handled appropriately by cs-bisection-step.  So for example it
> can pass
> 
>   graph-out=/home/logs/results/bisect/linux-linus/test-armhf-armhf-xl
> -arndale.leak-check--basis(8)
> 
> But cs-bisection step foolishly assumed that the --graph-out argument
> did not contain any shell metacharacters.  Fix this.
> 
> Specifically:
> 
>  * Change invocations of perl's open to use the 3-argument form
>  * Change invocations of system to pass individual arguments rather
>than constructing a shell script fragment and relying on the shell
>to split it up.
>  * In particular, in the png processing pipeline, use the "sh -ec
>

[Xen-devel] [PATCH RFC v3 3/6] HVM x86 deprivileged mode: Trap handlers for deprivileged mode

Added trap handlers to catch exceptions such as a page fault, general
protection fault, etc. These handlers will crash the domain as such exceptions
would indicate that either there is a bug in deprivileged mode or it has been
compromised by an attacker.

On calling a domain_crash() whilst in deprivileged mode, we need to restore
the host's context so that we do not have guest-defined registers and values
in use after this point due to lazy loading of these values in the SVM and VMX
implementations.

Signed-off-by: Ben Catterall 

Changed since v1

 * Changed to domain_crash(), domain_crash_synchronous was used previously.
 * Updated to perform a HVM context switch on crashing a domain
 * Updated hvm_deprivileged_check_trap() to return a testable error
   code and return based on this.

Changed since v2

 * Coding style: Added space after if, for, etc.
 * hvm_deprivileged_user_mode() now returns a value to indicate success or
   failure.
---
 xen/arch/x86/hvm/deprivileged.c| 70 +-
 xen/arch/x86/traps.c   | 55 ++
 xen/include/xen/hvm/deprivileged.h | 25 +-
 3 files changed, 148 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
index 5574c50..68c40ad 100644
--- a/xen/arch/x86/hvm/deprivileged.c
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -560,7 +560,7 @@ void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu)
  * This method is then jumped into to restore execution context after
  * exiting user mode.
  */
-void hvm_deprivileged_user_mode(void)
+int hvm_deprivileged_user_mode(void)
 {
 struct vcpu *vcpu = get_current();
 
@@ -576,6 +576,20 @@ void hvm_deprivileged_user_mode(void)
 
 vcpu->arch.hvm_vcpu.depriv_user_mode = 0;
 vcpu->arch.hvm_vcpu.depriv_rsp = 0;
+
+/*
+ * If we need to crash the domain at this point. We will return up the call
+ * stack, undoing any allocations and then the event testers in the exit
+ * assembly stubs will test for the SOFTIRQ_TIMER event generated by a
+ * domain_crash and will crash the domain for us.
+ */
+if ( vcpu->arch.hvm_vcpu.depriv_destroy )
+{
+domain_crash(vcpu->domain);
+return 1;
+}
+
+return 0;
 }
 
 /*
@@ -639,3 +653,57 @@ void hvm_deprivileged_finish_user_mode(void)
 
 hvm_deprivileged_finish_user_mode_asm();
 }
+
+/* Check if we are in deprivileged mode */
+int is_hvm_deprivileged_vcpu(void)
+{
+struct vcpu *v = get_current();
+
+if ( is_hvm_vcpu(v) && (v->arch.hvm_vcpu.depriv_user_mode) )
+return 1;
+
+return 0;
+}
+
+/*
+ * Crash the domain. This should not be called if there are any memory
+ * allocations which will be freed by code following its invocation in the
+ * current execution context (current stack). This is because it causes a
+ * permanent 'context switch' and the current stack will be cloberred so
+ * any allocations made which are not freed by other paths will leak.
+ * This function should only be used after deprivileged mode has been
+ * successfully switched into, otherwise, the normal domain_crash function
+ * should be used.
+ *
+ * The domain which is crashed is that of the current vcpu.
+ *
+ * To crash the domain, we need to return to our privileged stack as we may 
have
+ * memory allocations which need to be cleaned up. Then, after we have returned
+ * to this stack, we can then crash the domain. We set a flag which we check
+ * when returning.
+ */
+void hvm_deprivileged_crash_domain(const char *reason)
+{
+struct vcpu *vcpu = get_current();
+
+vcpu->arch.hvm_vcpu.depriv_destroy = 1;
+
+printk(XENLOG_ERR "HVM Deprivileged Mode: Crashing domain. Reason: %s\n",
+   reason);
+
+/*
+ * Restore the processor's state. We need to do the privileged return
+ * path to undo any allocations that got us to this state
+ */
+hvm_deprivileged_finish_user_mode();
+/* DOES NOT RETURN */
+}
+
+/* Handle a trap event */
+int hvm_deprivileged_check_trap(const char* func_name)
+{
+if ( is_hvm_deprivileged_vcpu() )
+hvm_deprivileged_crash_domain(func_name);
+
+return 0;
+}
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 9f5a6c6..f14a845 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -74,6 +74,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * opt_nmi: one of 'ignore', 'dom0', or 'fatal'.
@@ -500,6 +501,13 @@ static void do_guest_trap(
 struct trap_bounce *tb;
 const struct trap_info *ti;
 
+/*
+ * If we take the trap whilst in HVM deprivileged mode
+ * then we should crash the domain.
+ */
+if ( hvm_deprivileged_check_trap(__func__) )
+return;
+
 trace_pv_trap(trapnr, regs->eip, use_error_code, regs->error_code);
 
 tb = >arch.pv_vcpu.trap_bounce;
@@ -617,6 +625,13 @@ static void do_trap(struct cpu_user_regs *regs, int

[Xen-devel] [PATCH RFC v3 4/6] HVM x86 deprivileged mode: Watchdog for DoS prevention

A watchdog timer is used to prevent the deprivileged mode running for too long,
aimed at handling a bug or attempted DoS. If the watchdog has occurred more than
once whilst we have been in the same deprivileged mode context, then we crash
the domain. This can be adjusted for longer running times in future.

Signed-off-by: Ben Catterall 

Changed since v2:
 * Coding style: Added space after if
---
 xen/arch/x86/hvm/deprivileged.c |  4 
 xen/arch/x86/nmi.c  | 17 +
 2 files changed, 21 insertions(+)

diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
index 68c40ad..0b02065 100644
--- a/xen/arch/x86/hvm/deprivileged.c
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -17,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4t_base)
 {
@@ -577,6 +579,8 @@ int hvm_deprivileged_user_mode(void)
 vcpu->arch.hvm_vcpu.depriv_user_mode = 0;
 vcpu->arch.hvm_vcpu.depriv_rsp = 0;
 
+vcpu->arch.hvm_vcpu.depriv_watchdog_count = 0;
+
 /*
  * If we need to crash the domain at this point. We will return up the call
  * stack, undoing any allocations and then the event testers in the exit
diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c
index 2ab97a0..e5598a2 100644
--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -463,9 +464,25 @@ int __init watchdog_setup(void)
 /* Returns false if this was not a watchdog NMI, true otherwise */
 bool_t nmi_watchdog_tick(const struct cpu_user_regs *regs)
 {
+struct vcpu *vcpu = current;
 bool_t watchdog_tick = 1;
 unsigned int sum = this_cpu(nmi_timer_ticks);
 
+/*
+ * If the domain has been running in deprivileged mode for two watchdog
+ * ticks, then we kill it to prevent a DoS. We use two ticks as a coarse
+ * measure as this ensures that at least a full watchdog tick duration has
+ * occurred. This means that we do not need to track entry time and do
+ * time calculations.
+ */
+if ( is_hvm_deprivileged_vcpu() )
+{
+if ( vcpu->arch.hvm_vcpu.depriv_watchdog_count )
+hvm_deprivileged_crash_domain("HVM Deprivileged domain: Domain 
exceeded running time.");
+else
+vcpu->arch.hvm_vcpu.depriv_watchdog_count = 1;
+}
+
 if ( (this_cpu(last_irq_sums) == sum) && watchdog_enabled() )
 {
 /*
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH RFC v3 5/6] HVM x86 deprivileged mode: Syscall and deprivileged operation dispatcher

We have two operations:
1) dispatching a deprivileged mode operation
2) deprivileged mode executing a system call

For (1):
We have a table of methods which can be dispatched. All deprivileged mode
methods which can be dispatched need to be in this array. This aims to
prevent dispatching functions which are not designed for deprivileged mode
and means that we do not dispatch on an aribitrary pointer. We then dispatch
to the function pointer stored in this array. This goes via an assembly stub
in deprivileged mode which calls the function and then issues a syscall to
return to privileged mode when the operation completes. This allows the
deprivileged function to return normally.

For (2):
We again have a table of methods which are the syscall handlers. All
system calls which we handle need to be in this table. Deprivileged mode
passes an integer to select which operation to call. The system call is
wrapped to marshall the paramters as necessary and then jumps to a stub to
issue the syscall.

Data transfer for (1):
To pass data to deprivileged mode, we can pass up to five integer or
pointer parameters in registers. This is thanks to the 64-bit Linux calling
convention which puts these into 64-bit registers. The dispatch code takes
these parameters and arranges them so that when the deprivileged mode
operation executes, they are in the registers specified by the calling
convention. This means that it is transparent to the operation that it was
not invoked by a function call.

To pass the data which pointers correspond to, we use the deprivileged data
section. We copy this data to the section and change the pointer so that it
points into this section. Any extra parameters are also copied into the
data section.

To return data from deprivileged mode, the operation can supply a return
value which we pass through back to the caller. If extra data is needed,
which may be needed to make logical decisions after invocation of the
operation, then this is placed at the end of the data section. The caller
of the operation can then access this data. We copy back the data which we
initially copied in so that the caller sees any changes made by the callee.
NOTE: You need to handle the case where these structures can be updated
whilst in deprivileged mode.

It is necessary to clear out the data page between deprivilegeged mode
operations to prevent data leakage between operations which _may_ be
useful to an attacker.

Data transfer for (2):
We need to transfer data to the syscall handler and then back to the
deprivileged mode operation. To pass data, we use the same method as in (1)
for the first five parameters. For extra data, this will be placed at the
end of the data section and will be fetched by the handler.  We also use
the same method as in (1) for passing data back to the operation.

The general process to create a deprivilged mode operation is as follows:
 - Keep the old method prototype the same so that callers do not need to be
   modified. This helps to reduce the impact of this feature on the rest of the
   code base
 - Move the old code into a new deprv_F version of the function.
 - Marshall and unmarshall arguments as needed in the old function
 - Call the depriv version using depriv#n(F, params) function which is a wrapper
   around hvm_deprivileged_user_mode(F, params) in case we want to change this
   interface later or need better/extra argument marshalling.
 - Use the return code to work out what further processing is needed then return
 - Add an entry into the depriv_operation_table and add an operation number

With this done, there are no edits which need to be made to callers. If aliasing
of data is added to the feature, then this may not longer be the case.

The process to create a syscall is as follows:
 - Create a syscall with a name do_depriv_* using the depriv_syscall_t type
 - Write the syscall body
 - Return a result to depriv mode
 - Add an entry to the depriv_syscall_table and create a syscall number

Syscalls are made using DEPRIV_SYSCALL_CALL(op, ret, params) which
takes the operation number, the return variable and the paramters for the
system call, executes the system call using the Linux 64-bit calling convention
and then sets ret to the return value.

TODO:
-
 - Alias data for deprivileged mode. There is a large comment at the top of
   deprivileged_syscall.c which outlines considerations.
 - Check if we need to map_domain_page the pages when we do the copy in
   hvm_deprivileged_copy_data{to/from}
 - Check for unsigned integer wrapping on addition in
   hvm_deprivileged_copy_data_{to/from}
 - Move hvm_deprivileged_syscall into the syscall macro. It's a stub and
   unless extra code is needed there it can be folded into the macro.
 - Check maintainers' thoughts on the deprivileged mode function checks in
   hvm_deprivileged_user_mode. See the TODO comment.

We copy the data for ease of implementation and for small enough
structures, this is acceptable. For larger structures, or

[Xen-devel] [PATCH RFC v3 1/6] HVM x86 deprivileged mode: Create deprivileged page tables

The paging structure mappings for the deprivileged mode are added to the monitor
page table for HVM guests for HAP and shadow table paging. The entries are
generated by walking the page tables and mapping in new pages. Access bits are
flipped as needed.

The page entries are generated for deprivileged .text, .data and a stack. The
.text section is only allocated once at HVM domain initialisation and then we
alias it from then onwards. The data section is copied from sections allocated
by the linker. The mappings are setup in an unused portion of the Xen virtual
address space. The pages are mapped in as user mode accessible, with NX bits set
for the data and stack regions and the code region is set to be executable and
read-only.

The needed pages are allocated on the paging heap and are deallocated when
those heap pages are deallocated (on domain destruction).

Signed-off-by: Ben Catterall 

Changes since v1

 * .text section is now aliased when needed
 * Reduced user stack size to two pages
 * Changed allocator used for pages
 * Changed types to using __hvm_$foo[] for linker variables
 * Moved some #define's to page.h
 * Small bug fix: Testing global bit on L3 not relevant

Changes since v2:
-
 * Bug fix: Pass return value back through page table generation code
 * Coding style: Added space before if, for, etc.
---
 xen/arch/x86/hvm/Makefile  |   1 +
 xen/arch/x86/hvm/deprivileged.c| 538 +
 xen/arch/x86/mm/hap/hap.c  |   8 +
 xen/arch/x86/mm/shadow/multi.c |   8 +
 xen/arch/x86/xen.lds.S |  19 ++
 xen/include/asm-x86/config.h   |  29 +-
 xen/include/asm-x86/x86_64/page.h  |  15 ++
 xen/include/xen/hvm/deprivileged.h |  95 +++
 xen/include/xen/sched.h|   4 +
 9 files changed, 710 insertions(+), 7 deletions(-)
 create mode 100644 xen/arch/x86/hvm/deprivileged.c
 create mode 100644 xen/include/xen/hvm/deprivileged.h

diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
index 794e793..df5ebb8 100644
--- a/xen/arch/x86/hvm/Makefile
+++ b/xen/arch/x86/hvm/Makefile
@@ -2,6 +2,7 @@ subdir-y += svm
 subdir-y += vmx
 
 obj-y += asid.o
+obj-y += deprivileged.o
 obj-y += emulate.o
 obj-y += event.o
 obj-y += hpet.o
diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
new file mode 100644
index 000..0075523
--- /dev/null
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -0,0 +1,538 @@
+/*
+ * HVM deprivileged mode to provide support for running operations in
+ * user mode from Xen
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4t_base)
+{
+void *p;
+unsigned long size;
+unsigned int l4t_idx_code = l4_table_offset(HVM_DEPRIVILEGED_TEXT_ADDR);
+int ret;
+
+/* If there is already an entry here */
+ASSERT(!l4e_get_intpte(l4t_base[l4t_idx_code]));
+
+/*
+ * We alias the .text segment for deprivileged mode to save memory.
+ * Additionally, to save allocating page tables for each vcpu's 
deprivileged
+ * mode .text segment, we reuse them.
+ *
+ * If we have not already created a mapping (valid_l4e_code is false) then
+ * we create one and generate the page tables. To save doing this for each
+ * vcpu, if we already have a set of valid page tables then we reuse them.
+ * So, if we have the page tables and there is no entry at the desired PML4
+ * slot, then we can just reuse those page tables.
+ *
+ * The mappings are per-domain as we use the domain's page pool memory
+ * allocator for the new page structure and page frame pages.
+ */
+if ( !d->hvm_depriv_valid_l4e_code )
+{
+/*
+ * Build the alias mappings for the .text segment for deprivileged code
+ *
+ * NOTE: If there are other pages here, then this method will map 
around
+ * them. Which means that any future alias will use this mapping. If 
the
+ * HVM depriv section no longer has a unique PML4 entry in the Xen
+ * memory map, this will need to be accounted for.
+ */
+size = (unsigned long)__hvm_deprivileged_text_end -
+   (unsigned long)__hvm_deprivileged_text_start;
+
+ret = hvm_deprivileged_map_l4(d, l4t_base,
+   (unsigned 
long)__hvm_deprivileged_text_start,
+   (unsigned long)HVM_DEPRIVILEGED_TEXT_ADDR,
+   size, 0 /* No write */, HVM_DEPRIV_ALIAS);
+
+if ( ret )
+{
+printk(XENLOG_ERR "HVM: Error when initialising depriv .text. 
Code: %d",
+   ret);
+
+domain_crash(d);
+return;
+}
+
+d->hvm_depriv_l4e_code = l4t_base[l4t_idx_code];
+

[Xen-devel] [PATCH RFC v3 2/6] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode

The process to switch into and out of deprivileged mode can be likened to
setjmp/longjmp.

Xen is non-preemptive and taking an interrupt/exception, SYSCALL, SYSENTER,
NMI or any IST will currently clobber the Xen privileged stack. We need this
stack to be preserved so that after executing deprivileged mode, we can
return to our previous privileged execution point. This allows us to unwind the
stack, cleaning up memory allocations.

To enter deprivileged mode, we move the interrupt/exception rsp,
SYSENTER rsp and SYSCALL rsp to point to lower down Xen's privileged stack
to prevent them from clobbering it. The IST NMI and DF handlers used to copy
themselves onto the privileged stack. This is no longer the case, they now
leave themselves on their predefined stacks.

This means that we can continue execution from that point. This is similar
behaviour to a context switch.

To exit deprivileged mode, we restore the original interrupt/exception rsp,
SYSENTER rsp and SYSCALL rsp. We can then continue execution from where we left
off, which will unwind the stack and free up resources. This method means that
we do not need to change any other code paths and its invocation will be
transparent to callers. This should allow the feature to be more easily
deployed to different parts of Xen.

The switch to and from deprivileged mode is performed using sysret and syscall
respectively.

Signed-off-by: Ben Catterall 

Changed since v1

 * Added support for AMD SVM
 * Moved to the new stack approach
 * IST handlers no longer copy themselves
 * Updated context switching code to perform a full context-switch.
 This means that depriv mode will execute with host register states not
 (partial) guest register state. This allows for crashing the domain (later
 patch) whilst in depriv mode, alleviates potential security vulnerabilities
 and is necessaryto work around the AMD TR issue.
 * Moved processor-specific code to processor-specific files.
 * Changed call/jmp pair in deprivileged_asm.S to call/ret pair to not confuse
   processor branch predictors.

Changed since v2:
-
 * Coding style: Add space after if, for, etc.
---
 xen/arch/x86/domain.c   |  12 +++
 xen/arch/x86/hvm/Makefile   |   1 +
 xen/arch/x86/hvm/deprivileged.c | 103 ++
 xen/arch/x86/hvm/deprivileged_asm.S | 167 
 xen/arch/x86/hvm/svm/svm.c  | 130 +++-
 xen/arch/x86/hvm/vmx/vmx.c  | 118 +
 xen/arch/x86/mm/hap/hap.c   |   2 +-
 xen/arch/x86/x86_64/asm-offsets.c   |   5 ++
 xen/arch/x86/x86_64/entry.S |  38 ++--
 xen/arch/x86/x86_64/traps.c |  13 ++-
 xen/include/asm-x86/current.h   |   2 +
 xen/include/asm-x86/hvm/svm/svm.h   |  13 +++
 xen/include/asm-x86/hvm/vcpu.h  |  15 
 xen/include/asm-x86/hvm/vmx/vmx.h   |   2 +
 xen/include/asm-x86/processor.h |   2 +
 xen/include/asm-x86/system.h|   3 +
 xen/include/xen/hvm/deprivileged.h  |  45 ++
 xen/include/xen/sched.h |  18 +++-
 18 files changed, 674 insertions(+), 15 deletions(-)
 create mode 100644 xen/arch/x86/hvm/deprivileged_asm.S

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 045f6ff..a0e5e70 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -62,6 +62,7 @@
 #include 
 #include 
 #include 
+#include 
 
 DEFINE_PER_CPU(struct vcpu *, curr_vcpu);
 DEFINE_PER_CPU(unsigned long, cr4);
@@ -446,6 +447,12 @@ int vcpu_initialise(struct vcpu *v)
 if ( has_hvm_container_domain(d) )
 {
 rc = hvm_vcpu_initialise(v);
+
+/* Initialise HVM deprivileged mode */
+printk("HVM initialising deprivileged mode ...");
+hvm_deprivileged_prepare_vcpu(v);
+printk("Done.\n");
+
 goto done;
 }
 
@@ -523,7 +530,12 @@ void vcpu_destroy(struct vcpu *v)
 vcpu_destroy_fpu(v);
 
 if ( has_hvm_container_vcpu(v) )
+{
+/* Destroy the deprivileged mode on this vcpu */
+hvm_deprivileged_destroy_vcpu(v);
+
 hvm_vcpu_destroy(v);
+}
 else
 xfree(v->arch.pv_vcpu.trap_ctxt);
 }
diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
index df5ebb8..e16960a 100644
--- a/xen/arch/x86/hvm/Makefile
+++ b/xen/arch/x86/hvm/Makefile
@@ -3,6 +3,7 @@ subdir-y += vmx
 
 obj-y += asid.o
 obj-y += deprivileged.o
+obj-y += deprivileged_asm.o
 obj-y += emulate.o
 obj-y += event.o
 obj-y += hpet.o
diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
index 0075523..5574c50 100644
--- a/xen/arch/x86/hvm/deprivileged.c
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -536,3 +536,106 @@ struct page_info *hvm_deprivileged_alloc_page(struct 
domain *d)
 
 return pg;
 }
+
+/* Used to prepare each vcpus data for user mode. Call for each HVM vcpu. */
+int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu)
+{
+

[Xen-devel] [PATCH RFC v3 6/6] HVM x86 deprivileged mode: Move VPIC to deprivileged mode

First steps of moving the VPIC into deprivileged mode.

For the VPIC, some of its functions are called from both privileged code and
deprivileged code. Some of these are also called from non-hvm domains. This
means that we cannot just convert the entire function to a depriv only one, but
need to handle this case. vpic_get_priority() shows one way of doing this but
there may be other ways. The main aim will be to minimise code duplication and
logic needed to determine where the call is coming from. This will not be a
unique problem so some thought will be needed as to the best way to resolve this
in general.

A clean up method handles a deprivileged mode domain crashing whilst holding
resources. For example, if we hold a lock or have allocated memory which Xen
will not clean up for us when we crash, we need to release these. Otherwise we
fail on ASSERT_NOT_IN_ATOMIC in vmx_asm_vmexit_handler due to an unreleased lock
and then panic. We could also leak memory if we allocate from a pool which Xen
does not clean up for us on crashing the domain.

TODO

Patches 5 & 6:
 - Fix the GCC switch statement issue which causes a page fault

Patch 6:
 - Fix vpic lock release on domain crash.
 - Finish moving parts of the VPIC into deprivileged mode

KNOWN ISSUES

 - Page fault for vpic_ioport_write due to GCC switch statements placing the
   jump table in .rodata which is in the privileged mode area.

   This has been traced to the first of the switch statements in the function.
   Though other switches in that function may also be affected.
   Compiled using GCC 4.9.2-10.

   You can get the offset into this function by doing:
   (RIP - (depriv_vpic_ioport_write - __hvm_deprivileged_text_start))

   It appears to be a built-in default of GCC to put switch jump tables in
   .rodata or .text and there does not appear to be a way to change this
   (except to patch the compiler). Note that GCC will not necessarily allocate
   jump tables for each switch statment, it depends on the optimiser.

   Thus, when we relocate a deprivileged method containing code using a switch
   statement which GCC has created a jump table for, this leads to a page
   fault. This is because we have not mapped in the rodata section
   as we should not (depriv should not have access to it).

   A workaround would be to patch the generated assembly so that this table is
   moved into hvm_deprivileged.rodata. This can be done by adding,
   .section .hvm_deprivileged.rodata, around the generated table. We can then
   relocate

   Note that GCC is using RIP-relative addressing for this, so the offset
   of depriv .rodata to the depriv .text segment will need to be the same
   when it is mapped in.

Signed-off-by: Ben Catterall 
---
 xen/arch/x86/hvm/deprivileged.c |  49 +++
 xen/arch/x86/hvm/deprivileged_syscall.c |   4 +-
 xen/arch/x86/hvm/vpic.c | 151 
 xen/arch/x86/traps.c|   5 +-
 xen/include/asm-x86/hvm/vcpu.h  |   2 +
 xen/include/xen/hvm/deprivileged.h  |   3 +
 6 files changed, 192 insertions(+), 22 deletions(-)

diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
index 5606f9a..9561054 100644
--- a/xen/arch/x86/hvm/deprivileged.c
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -20,7 +20,14 @@
 #include 
 #include 
 
+/* TODO: move to a better place than here */
+int depriv_vpic_ioport_write(unsigned long *ret_data_ptr,
+ struct hvm_hw_vpic *vpic, int32_t addr,
+ uint32_t val) DEPRIV_TEXT_SEGMENT;
+
+
 static depriv_syscall_t depriv_operation_table[] = {
+DEPRIV_OPERATION(vpic_ioport_write, 4)
 };
 
 void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4t_base)
@@ -641,6 +648,11 @@ int hvm_deprivileged_user_mode(unsigned long operation, 
register_t a,
  */
 if ( vcpu->arch.hvm_vcpu.depriv_destroy )
 {
+/*
+ * Track which operation we are currently performing so that we can
+ * clean up if we have to crash the domain whilst doing it.
+ */
+hvm_deprivileged_clean_up(vcpu, operation);
 domain_crash(vcpu->domain);
 return 1;
 }
@@ -763,3 +775,40 @@ int hvm_deprivileged_check_trap(const char* func_name)
 
 return 0;
 }
+
+/*
+ * Clean up when destroying the domain
+ * When we destroy the domain whilst performing a deprivilged mode operation,
+ * we need to make sure that we do not hold any locks or have any memory which
+ * we have allocated related to the deprivileged mode operation which will
+ * not be cleared up by Xen automatically as part of domain destruction.
+ *
+ * An example is when we crash whilst holding a lock, we need to release this
+ * lock.
+ */
+void hvm_deprivileged_clean_up(struct vcpu *vcpu, unsigned long op)
+{
+struct hvm_hw_vpic *vpic;
+
+/* The vpic lock is not released if we crash the domain. This means that 
the
+

[Xen-devel] [PATCH RFC v3 0/6] HVM x86 deprivileged mode summary

Hi all,

I have now finished my internship at Citrix and am posting this final version of
my RFC series. I would like to express my thanks to all of those who have taken
the time to review, comment and discuss this series, as well as to my colleagues
who have provided excellent guidance and help. I have learned a great deal and
have greatly enjoyed working with all of you. Thank you.

Hopefully the series will be beneficial. I believe that it has shown that a
deprivileged mode in Xen is a possible and viable option, as long as performance
impact vs security is carefully considered on a case-by-case basis. The end of
this series contains an example of moving some of the vpic into deprivileged
mode which has allowed me to test and verify that the feature works. There are
enhancements and some clean up which is needed but, after that, the feature
could be deployed to HVM devices currently found in Xen such as the VPIC.

Patches one to four are (hopefully) now fairly stable. Patch 5 is the new
system call and deprivileged dispatch mode which is new to this series. Patch 6
is also new and is a demonstration of using this for the vpic and hass mainly
been used to test and exercise this feature.

As this patch series is in RFC, there are some debug printks which should be
removed when/if it leaves RFC but, they are useful in fixing the known issue so
I have left them in until that can be resolved.

There are some efficiency savings that can be made and an instance of a general
issue (detailed later) which will need to be addressed.

Many thanks once again,
Ben

TODOs
-
There is a set of TODOs in this patch series, some issues in the later patches
which need addressing and some other considerations which I've summarised here.

Patch 1:
 - Consider hvm_deprivileged_map_* and an efficiency saving by mapping in larger
   pages. See the TODO at the top of the L4 version of this method.

Patch 2:
 - We have a much more heavyweight version of the deprivileged mode context
   switch after testing for AMD SVM found that this was necessary. However,
   the FPU is currently also saved and this may not be necessary. Consideration
   is needed to work out if we can cut this down even more.

Patch 4:
 - The watchdog timer is hooked currently to kill deprivileged mode operations
   that run for too long and is hardcoded to be at least one watchdog tick and
   at most two. This may want to be refined.

Patch 5:
 - Alias data for deprivileged mode. There is a large comment at the top of
   deprivileged_syscall.c which outlines considerations.
 - Check if we need to map_domain_page the pages when we do the copy in
   hvm_deprivileged_copy_data{to/from}
 - Check for unsigned integer wrapping on addition in
   hvm_deprivileged_copy_data_{to/from}
 - Move hvm_deprivileged_syscall into the syscall macro. It's a stub and
   unless extra code is needed there it can be folded into the macro.
 - Check maintainers' thoughts on the deprivileged mode function checks in
   hvm_deprivileged_user_mode. See the TODO comment.

Patches 5 & 6:
 - Fix/work around the GCC switch statement issue.


KNOWN ISSUES

 - Page fault for vpic_ioport_write due to GCC switch statements placing the
   jump table in .rodata which is in the privileged mode area.

   This has been traced to the first of the switch statements in the function.
   Though other switches in that function may also be affected.
   Compiled using GCC 4.9.2-10.

   You can get the offset into this function by doing:
   (RIP - (depriv_vpic_ioport_write - __hvm_deprivileged_text_start))

   It appears to be a built-in default of GCC to put switch jump tables in
   .rodata or .text and there does not appear to be a way to change this
   (except to patch the compiler, though hopefully there _is_ another
   option I just haven't been able to find...). Note that GCC will not
   necessarily allocate jump tables for each switch statment, it appears to
   depends on a number of factors such as the optimiser, the number of cases,
   the type of the case, compiler version etc.

   Thus, when we relocate a deprivileged method containing code using a switch
   statement which GCC has created a jump table for, this leads to a page
   fault. This is because we have not mapped in the rodata section
   as we should not (depriv should not have access to it).

   A workaround would be to patch the generated assembly so that this table is
   moved into hvm_deprivileged.rodata. This can be done by adding,
   .section .hvm_deprivileged.rodata, around the generated table. We can then
   relocate this.

   Note that GCC is using RIP-relative addressing for this, so the offset
   of depriv .rodata to the depriv .text segment will need to be the same
   when it is mapped in.







___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2] arm/xen: Enable user access to the kernel before issuing a privcmd call

2015-09-11 Thread Russell King - ARM Linux

On Fri, Sep 11, 2015 at 05:25:59PM +0100, Julien Grall wrote:
> + /*
> +  * Privcmd calls are issued by the userspace. We need to allow the
> +  * kernel to access the userspace memory before issuing the hypercall.
> +  */
> + uaccess_enable r4
> +
> + /* r4 is loaded now as we use it as scratch register before */
>   ldr r4, [sp, #4]

As I mentioned in one of my previous mails, "ip" should be safe to use
here - it's a caller-corrupted register, just like r0-r3 and lr.  So,
you could do:

ldr r4, [sp, #4]
+   uaccess_enable ip

which fractionally tightens the window.

However, there's nothing actually wrong with your version - there's no
way we could've got this far with sp pointing at userspace.

I'm happy with either version, so:

Acked-by: Russell King 

How do you want to handle the patch?  I already have some other uaccess
fixes queued up to send to Linus before the merge window closes.

>   __HVC(XEN_IMM)
> +
> + /*
> +  * Disable userspace access from kernel. This is fine to do it
> +  * unconditionally as no set_fs(KERNEL_DS)/set_fs(get_ds()) is
> +  * called before.
> +  */
> + uaccess_disable r4
> +
>   ldm sp!, {r4}
>   ret lr
>  ENDPROC(privcmd_call);
> -- 
> 2.1.4
> 

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for 4.6 v4 2/3] xl/libxl: disallow saving a guest with vNUMA configured

On Fri, Sep 11, 2015 at 04:53:57PM +0100, Ian Campbell wrote:
> On Fri, 2015-09-11 at 16:09 +0100, Wei Liu wrote:
> 
> > > >  Note that virtual NUMA for PV guest is not yet supported, because
> > > >  there is an issue with cpuid handling that affects PV virtual NUMA.
> > > > +Further more, guest with virtual NUMA cannot be saved or migrated
> > > 
> > > I _think_ (but am not 100% sure) that in the sense you mean it is
> > > "Furthermore". I don't think "Further more," actually means anything.
> > > 
> > > I can fix as I commit.
> > > 
> > 
> > Yes please.
> 
> I went with:
> +Furthermore, guests with virtual NUMA cannot be saved or migrated
> +because the migration stream does not preserve node information.
> 
> I've applied all three patches in this series to staging and 4.6. The
> conflict you mentioned elsewhere (patch #1) was just the presence of
>  xc_domain_soft_reset in staging but not staging-4.6. I resolved it, I
> don't think I can have gotten it wrong, but do check ;-)
> 

The backported patch looks correct. Thanks.

> Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH] efi/libstub/fdt: Standardize the names of EFI stub parameters

2015-09-11 Thread Mark Rutland

On Fri, Sep 11, 2015 at 01:46:43PM +0100, Daniel Kiper wrote:
> On Thu, Sep 10, 2015 at 05:23:02PM +0100, Mark Rutland wrote:
> > > > C) When you could go:
> > > >
> > > >DT -> Discover Xen -> Xen-specific stuff -> Xen-specific EFI/ACPI 
> > > > discovery
> > >
> > > I take you mean discovering Xen with the usual Xen hypervisor node on
> > > device tree. I think that C) is a good option actually. I like it. Not
> > > sure why we didn't think about this earlier. Is there anything EFI or
> > > ACPI which is needed before Xen support is discovered by
> > > arch/arm64/kernel/setup.c:setup_arch -> xen_early_init()?
> >
> > Currently lots (including the memory map). With the stuff to support
> > SPCR, the ACPI discovery would be moved before xen_early_init().
> >
> > > If not, we could just go for this. A lot of complexity would go away.
> >
> > I suspect this would still be fairly complex, but would at least prevent
> > the Xen-specific EFI handling from adversely affecting the native case.
> >
> > > > D) If you want to be generic:
> > > >EFI -> EFI application -> EFI tables -> ACPI tables -> Xen-specific 
> > > > stuff
> > > >   \--/
> > > >(virtualize these, provide shims to Dom0, but handle
> > > > everything in Xen itself)
> > >
> > > I think that this is good in theory but could turn out to be a lot of
> > > work in practice. We could probably virtualize the RuntimeServices but
> > > the BootServices are troublesome.
> >
> > What's troublesome with the boot services?
> >
> > What can't be simulated?
> 
> How do you want to access bare metal EFI boot services from dom0 if they
> were shutdown long time ago before loading dom0 image?

I don't want to.

I asked "What can't be simulated?" because I assumed everything
necessary/mandatory could be simulated without needinng access to any
real EFI boot services.

As far as I can see all that's necessary is to provide a compatible
interface.

> What do you need from EFI boot services in dom0?

The ability to call ExitBootServices() and SetVirtualAddressMap() on a
_virtual_ address map for _virtual_ services provided by the hypervisor.
A console so that I can log things early on.

Mark.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH] efi/libstub/fdt: Standardize the names of EFI stub parameters

2015-09-11 Thread Mark Rutland

> It feels like this discussion is going in circles.
> 
> When we discussed this six months ago, we already concluded that,
> since UEFI is the only specified way that the presence of ACPI is
> advertised on an ARM system, we need to emulate UEFI to some extent.

My understanding from the last time I was present at such a discussion
was that the emulation was complete from the kernel's PoV (i.e. no
special case handling was required). 

Evidently I misunderstood.

One of the main points of rationale for requiring EFI was that we'd have
a well-defined system state as per the EFI (and ACPI) standards. We'd
know we had the EFI memory map, services, etc (with the memory map and
configuration tables being the most important elements). We didn't want
to have to try to reconcile a DT memory map and ACPI, for instance.

That is somewhat (though admitedly not entirely) broken if we require a
set of rules to be applied beyond what the standards mandate.  That is
broken if we require a non-standard set of rules to be applied, and
limits what we can do in the !Xen case.

> So we need the EFI system table to expose the UEFI configuration table
> that carries the ACPI root pointer.
> 
> Since ACPI support also relies on the UEFI memory map (I think?), we
> need that as well.
> 
> These two items are exactly what we pass via the UEFI DT properties,
> so we should indeed promote the current de-facto binding to a proper
> binding, and renaming the properties makes sense in that context.

I agree that we need to sort these out.

> I agree that this should also include a description of the expected
> state of the firmware, i.e., that ExitBootServices() has been called,
> and that the memory map has been populated with virtual address, which
> have been installed using SetVirtualAddressMap() if they differ from
> the physical addresses. (The current implementation on the kernel side
> is perfectly capable of dealing with a 1:1 mapping).
> 
> Beyond that, there is no point in pretending to be a full UEFI
> implementation, imo. Boot services are not required, nor are runtime
> services (only the current EFI init code on arm needs to be modified
> to deal with a NULL runtime services pointer)

I'm not keen on this because it involves applying Xen-specific caveats
atop of what the UEFI spec says (e.g. as runtime services might be NULL
under Xen). As the spec and Xen evolve, those caveats shift, and that's
going to be fragile for all users regardleses of Xen.

If Xen needs special-casing, then I'd rather that Xen were detected
first and provided us with what it knows is safe for us to use, rather
than we tip-toe around until we're sure Xen isn't present and/or doesn't
need additional constraints met.

Thanks,
Mark.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH v2] arm/xen: Enable user access to the kernel before issuing a privcmd call

When Xen is copying data to/from the guest it will check if the kernel
has the right to do the access. If not, the hypercall will return an
error.

After the commit a5e090acbf545c0a3b04080f8a488b17ec41fe02 "ARM:
software-based privileged-no-access support", the kernel can't access
any longer the user space by default. This will result to fail on every
hypercall made by the userspace (i.e via privcmd).

We have to enable the userspace access and then restore the correct
permission every time the privcmd is used to made an hypercall.

I didn't find generic helpers to do a these operations, so the change
is only arm32 specific.

Reported-by: Riku Voipio 
Signed-off-by: Julien Grall 

---
Cc: Stefano Stabellini 
Cc: Russell King 

Changes in v2:
- Directly enable/disable the user space access in assembly
- Typoes

ARM64 doesn't seem to have priviledge no-access support yet so there
is nothing to do for now.

I haven't look x86 at all.
---
 arch/arm/xen/hypercall.S | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/arm/xen/hypercall.S b/arch/arm/xen/hypercall.S
index f00e080..10fd99c 100644
--- a/arch/arm/xen/hypercall.S
+++ b/arch/arm/xen/hypercall.S
@@ -98,8 +98,23 @@ ENTRY(privcmd_call)
mov r1, r2
mov r2, r3
ldr r3, [sp, #8]
+   /*
+* Privcmd calls are issued by the userspace. We need to allow the
+* kernel to access the userspace memory before issuing the hypercall.
+*/
+   uaccess_enable r4
+
+   /* r4 is loaded now as we use it as scratch register before */
ldr r4, [sp, #4]
__HVC(XEN_IMM)
+
+   /*
+* Disable userspace access from kernel. This is fine to do it
+* unconditionally as no set_fs(KERNEL_DS)/set_fs(get_ds()) is
+* called before.
+*/
+   uaccess_disable r4
+
ldm sp!, {r4}
ret lr
 ENDPROC(privcmd_call);
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler

2015-09-11 Thread Konrad Rzeszutek Wilk

On Wed, Nov 06, 2013 at 02:41:12PM +0800, Zhu Yanhai wrote:
> As we know Intel X86's CR0.TS is a sticky bit, which means once set
> it remains set until cleared by some software routines, in other words,
> the exception handler expects the bit is set when it starts to execute.
> 
> However xen doesn't simulate this behavior quite well for PV guests -
> vcpu_restore_fpu_lazy() clears CR0.TS unconditionally in the very beginning,
> so the guest kernel's #NM handler runs with CR0.TS cleared. Generally speaking
> it's fine since the linux kernel executes the exception handler with
> interrupt disabled and a sane #NM handler will clear the bit anyway
> before it exits, but there's a catch: if it's the first FPU trap for the 
> process,
> the linux kernel must allocate a piece of SLAB memory for it to save
> the FPU registers, which opens a schedule window as the memory
> allocation might sleep -- and with CR0.TS keeps clear!
> 

With the Ingo's FPU rewrite we haven't been able to retrigger this.
(Tests ran for 2 weeks while they would have failed within
two hours).

And when I dug in this I found the reason:

commit 0c8c0f03e3a292e031596484275c14cf39c0ab7a
Author: Dave Hansen 
Date:   Fri Jul 17 12:28:11 2015 +0200

x86/fpu, sched: Dynamically allocate 'struct fpu'

The FPU rewrite removed the dynamic allocations of 'struct fpu'.
But, this potentially wastes massive amounts of memory (2k per
task on systems that do not have AVX-512 for instance).

Instead of having a separate slab, this patch just appends the
space that we need to the 'task_struct' which we dynamically
allocate already.  This saves from doing an extra slab
allocation at fork().

And that when the #NM is called ('do_device_not_available')
it does:

 fpu__restore(>thread.fpu); /* interrupts still off */  
   |+- fpu__activate_curr (which just inits the already allocated space)
   | \- memset(state, 0, xstate_size);
   |+- fpregs_activate
 \- stts()

So there is no scheduling window during this time, while in
kernels prior to Linux 4.2 there was.

And it took a bit of time to figure out what exactly the problem was.

I appreciate folks emails (and this giant thread) about this but
without some sort of diagram it was hard to understand this
(at least to me).

So here it is in case somebody is doing code archaeology:

For simplicity we assume the guest/baremetal use the lazy mechanism
not eager. That makes 'switch_fpu_prepare' (called by schedule()) effectively:

if (previous task had PF_USED_MATH set)
   stts (CR0.TS=1)
else
   ;

I am ignoring the case if the task had used the FPU more than
five times - where we do things a bit different.

The time diagram looks great at 132x42.

Anyhow, lets assume that we have two tasks: A and B. Both
haven't used the FPU. This is on PVHVM:

CR0.TS=1   CR0.TS=1 CR0.TS=0
   CR0.TS=1   CR0.TS=0
++---+---+
task A | #NM |task B||taskB |   
| task A |   |taskA  |
MMX|math_state_restore   |  ||  |   
||   |   |
op |  \- fpu_init|  ||  |   
||   |   |
   |   \- .. schedule()  |  ||  |   
||   |   |
   |   [swap task B] |  ||  |   
||   |   |
   |   [since task A |  ||  |   
||   |   |
   |hadn't set   |  ||  |   
||   |   |
   |PF_USED_MATH |  ||  |   
||   |   |
   |we don't muck|  ||  |   
||   |   |
   |with CR0.TS] |  ||  |   
||   |   |
   | |MMX op||  |   
||   |   |
   | |  |#NM |  |   
||   |   |
   | |  |math_state_restore  |  |   
||   |   |
   | |  | fpu_init worked|  |   
||   |   |
   | |  |  clts()|  |   
||   |   |
   |

Re: [Xen-devel] [PATCH v2] arm/xen: Enable user access to the kernel before issuing a privcmd call

2015-09-11 Thread Stefano Stabellini

On Fri, 11 Sep 2015, Russell King - ARM Linux wrote:
> On Fri, Sep 11, 2015 at 06:36:05PM +0100, Julien Grall wrote:
> > On 11/09/15 18:32, Julien Grall wrote:
> > > On 11/09/15 18:00, Russell King - ARM Linux wrote:
> > >> On Fri, Sep 11, 2015 at 05:25:59PM +0100, Julien Grall wrote:
> > >>> +   /*
> > >>> +* Privcmd calls are issued by the userspace. We need to allow 
> > >>> the
> > >>> +* kernel to access the userspace memory before issuing the 
> > >>> hypercall.
> > >>> +*/
> > >>> +   uaccess_enable r4
> > >>> +
> > >>> +   /* r4 is loaded now as we use it as scratch register before */
> > >>> ldr r4, [sp, #4]
> > >>
> > >> As I mentioned in one of my previous mails, "ip" should be safe to use
> > >> here - it's a caller-corrupted register, just like r0-r3 and lr.  So,
> > >> you could do:
> > >>
> > >>  ldr r4, [sp, #4]
> > >> +uaccess_enable ip
> > > 
> > > The register ip (aka r12) is used to store the hypercall number. So we
> > > can't reuse it as scratch register.
> > > 
> > > The easiest one is r4.
> > > 
> > >>
> > >> which fractionally tightens the window.
> > >>
> > >> However, there's nothing actually wrong with your version - there's no
> > >> way we could've got this far with sp pointing at userspace.
> > >>
> > >> I'm happy with either version, so:
> > >>
> > >> Acked-by: Russell King 
> > >>
> > >> How do you want to handle the patch?  I already have some other uaccess
> > >> fixes queued up to send to Linus before the merge window closes.
> > 
> > Forgot to answer to this bits. I was thinking to ask Stefano carrying
> > the patch in xentip. Although it won't go until rc1.
> > 
> > I don't mind if it's going earlier in Linux/master.
> 
> Thanks, I've applied your patch as-is now.

That's fine by me, the patch looks good.

Thanks,

Stefano

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH RFC v3 0/6] HVM x86 deprivileged mode summary


Hi all,

Here are two Python scripts which I have used to collect performance 
benchmarks for this series. I am putting them here in case they are useful.


Ben

On 11/09/15 17:08, Ben Catterall wrote:

Hi all,

I have now finished my internship at Citrix and am posting this final version of
my RFC series. I would like to express my thanks to all of those who have taken
the time to review, comment and discuss this series, as well as to my colleagues
who have provided excellent guidance and help. I have learned a great deal and
have greatly enjoyed working with all of you. Thank you.

Hopefully the series will be beneficial. I believe that it has shown that a
deprivileged mode in Xen is a possible and viable option, as long as performance
impact vs security is carefully considered on a case-by-case basis. The end of
this series contains an example of moving some of the vpic into deprivileged
mode which has allowed me to test and verify that the feature works. There are
enhancements and some clean up which is needed but, after that, the feature
could be deployed to HVM devices currently found in Xen such as the VPIC.

Patches one to four are (hopefully) now fairly stable. Patch 5 is the new
system call and deprivileged dispatch mode which is new to this series. Patch 6
is also new and is a demonstration of using this for the vpic and hass mainly
been used to test and exercise this feature.

As this patch series is in RFC, there are some debug printks which should be
removed when/if it leaves RFC but, they are useful in fixing the known issue so
I have left them in until that can be resolved.

There are some efficiency savings that can be made and an instance of a general
issue (detailed later) which will need to be addressed.

Many thanks once again,
Ben

TODOs
-
There is a set of TODOs in this patch series, some issues in the later patches
which need addressing and some other considerations which I've summarised here.

Patch 1:
  - Consider hvm_deprivileged_map_* and an efficiency saving by mapping in 
larger
pages. See the TODO at the top of the L4 version of this method.

Patch 2:
  - We have a much more heavyweight version of the deprivileged mode context
switch after testing for AMD SVM found that this was necessary. However,
the FPU is currently also saved and this may not be necessary. Consideration
is needed to work out if we can cut this down even more.

Patch 4:
  - The watchdog timer is hooked currently to kill deprivileged mode operations
that run for too long and is hardcoded to be at least one watchdog tick and
at most two. This may want to be refined.

Patch 5:
  - Alias data for deprivileged mode. There is a large comment at the top of
deprivileged_syscall.c which outlines considerations.
  - Check if we need to map_domain_page the pages when we do the copy in
hvm_deprivileged_copy_data{to/from}
  - Check for unsigned integer wrapping on addition in
hvm_deprivileged_copy_data_{to/from}
  - Move hvm_deprivileged_syscall into the syscall macro. It's a stub and
unless extra code is needed there it can be folded into the macro.
  - Check maintainers' thoughts on the deprivileged mode function checks in
hvm_deprivileged_user_mode. See the TODO comment.

Patches 5 & 6:
  - Fix/work around the GCC switch statement issue.


KNOWN ISSUES

  - Page fault for vpic_ioport_write due to GCC switch statements placing the
jump table in .rodata which is in the privileged mode area.

This has been traced to the first of the switch statements in the function.
Though other switches in that function may also be affected.
Compiled using GCC 4.9.2-10.

You can get the offset into this function by doing:
(RIP - (depriv_vpic_ioport_write - __hvm_deprivileged_text_start))

It appears to be a built-in default of GCC to put switch jump tables in
.rodata or .text and there does not appear to be a way to change this
(except to patch the compiler, though hopefully there _is_ another
option I just haven't been able to find...). Note that GCC will not
necessarily allocate jump tables for each switch statment, it appears to
depends on a number of factors such as the optimiser, the number of cases,
the type of the case, compiler version etc.

Thus, when we relocate a deprivileged method containing code using a switch
statement which GCC has created a jump table for, this leads to a page
fault. This is because we have not mapped in the rodata section
as we should not (depriv should not have access to it).

A workaround would be to patch the generated assembly so that this table is
moved into hvm_deprivileged.rodata. This can be done by adding,
.section .hvm_deprivileged.rodata, around the generated table. We can then
relocate this.

Note that GCC is using RIP-relative addressing for this, so the offset
of depriv .rodata to the depriv .text segment will need to be the same

Re: [Xen-devel] [PATCH v5 2/2] xen/arm: support gzip compressed kernels

2015-09-11 Thread Stefano Stabellini

On Tue, 8 Sep 2015, Ian Campbell wrote:
> On Mon, 2015-09-07 at 15:25 +0100, Stefano Stabellini wrote:
> > Free the memory used for the compressed kernel and update the relative
> > mod->start and mod->size parameters with the uncompressed ones.
> > 
> > Signed-off-by: Stefano Stabellini 
> > Reviewed-by: Julien Grall 
> > CC: ian.campb...@citrix.com
> > 
> > ---
> > 
> > Changes in v5:
> > - code style
> > 
> > Changes in v4:
> > - return uint32_t from output_length
> > - __init kernel_decompress
> > - code style
> > - add comment
> > - if kernel_decompress fails with error, return
> > 
> > Changes in v3:
> > - better error checks in kernel_decompress
> > - free unneeded pages between output_size and kernel_order_out
> > - alloc pages from domheap
> > 
> > Changes in v2:
> > - use gzip_check
> > - avoid useless casts
> > - free original kernel image and update the mod with the new address and
> > size
> > - remove changes to common Makefile
> > - remove CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > ---
> >  xen/arch/arm/kernel.c   |   75
> > +++
> >  xen/arch/arm/setup.c|2 +-
> >  xen/include/asm-arm/setup.h |2 ++
> >  3 files changed, 78 insertions(+), 1 deletion(-)
> > 
> > diff --git a/xen/arch/arm/kernel.c b/xen/arch/arm/kernel.c
> > index f641b12..baa5717 100644
> > --- a/xen/arch/arm/kernel.c
> > +++ b/xen/arch/arm/kernel.c
> > @@ -13,6 +13,8 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> >  
> >  #include "kernel.h"
> >  
> > @@ -257,6 +259,63 @@ static int kernel_uimage_probe(struct kernel_info
> > *info,
> >  return 0;
> >  }
> >  
> > +static __init uint32_t output_length(char *image, unsigned long
> > image_len)
> > +{
> > +return *(uint32_t *)[image_len - 4];
> > +}
> > +
> > +static __init int kernel_decompress(struct kernel_info *info,
> > + paddr_t *addr, paddr_t *size)
> > +{
> > +char *output, *input, *end;
> > +char magic[2];
> > +int rc;
> > +unsigned kernel_order_out;
> > +paddr_t output_size;
> > +struct page_info *pages;
> > +
> > +if ( *size < 2 )
> > +return -EINVAL;
> > +
> > +copy_from_paddr(magic, *addr, sizeof(magic));
> > +
> > +/* only gzip is supported */
> > +if ( !gzip_check(magic, *size) )
> > +return -EINVAL;
> > +
> > +input = ioremap_cache(*addr, *size);
> > +if ( input == NULL )
> > +return -EFAULT;
> > +
> > +output_size = output_length(input, *size);
> > +kernel_order_out = get_order_from_bytes(output_size);
> > +pages = alloc_domheap_pages(NULL, kernel_order_out, 0);
> > +if ( pages == NULL )
> > +{
> > +iounmap(input);
> > +return -ENOMEM;
> > +}
> > +output = page_to_virt(pages);
> > +
> > +rc = perform_gunzip(output, input, *size);
> > +clean_dcache_va_range(output, output_size);
> > +iounmap(input);
> > +
> > +*addr = virt_to_maddr(output);
> 
> I don't think virt_to_maddr is strictly speaking valid (at the arch
> interface level, our actual implementation may happen to cope) for domheap
> pages, it's only valid for things which are in the linear 1:1 map (~=
> xenheap).
> 
> I think you need page_to_maddr(pages) instead.
> 
> 
> > +*size = output_size;
> > +
> > +end = output + (1 << (kernel_order_out + PAGE_SHIFT));
> > +/* 
> > + * Need to free pages after output_size here because they won't be
> > + * freed by discard_initial_modules
> > + */
> > +output += (output_size + PAGE_SIZE - 1) & PAGE_MASK;
> > +for ( ; output < end; output += PAGE_SIZE )
> > +free_domheap_page(virt_to_page(output));
> 
> And here again I don't think you can use virt_to_page.

I replaced it all with vmap/vunmap.


> > +
> > +return 0;
> > +}
> > +
> >  #ifdef CONFIG_ARM_64
> >  /*
> >   * Check if the image is a 64-bit Image.
> > @@ -463,6 +522,22 @@ int kernel_probe(struct kernel_info *info)
> >  printk("Loading ramdisk from boot module @ %"PRIpaddr"\n",
> > info->initrd_bootmodule->start);
> >  
> > +/* if it is a gzip'ed image, 32bit or 64bit, uncompress it */
> > +rc = kernel_decompress(info, , );
> > +if (rc < 0 && rc != -EINVAL)
> 
> IMHO kernel_decompress should return success when the decompress is a nop
> (as represented by EINVAL here) and an error only when the thing needs to
> be decompressed but cannot be.

That mean collapsing the "nothing to do" and the "decompression
successful" cases into a single return value, which I think is not a
good idea. We would be losing information compared to what we have now.
I am quite happy to replace EINVAL with any other return value you think
is most appropriate though.


> That would mean putting the free of the original kernel and the updates of
> mod->* into kernel_decompress. But I think that also makes more sense
> because it confines

Re: [Xen-devel] [PATCH] arm/xen: Enable user access to the kernel before issuing a privcmd call

Hi Ian,

On 11/09/15 15:29, Ian Campbell wrote:
>> After the commit a5e090acbf545c0a3b04080f8a488b17ec41fe02 "ARM:
>> software-based priviledged-no-access support", the kernel can't access
> 
> "privileged"

That was a typo in the commit title of the patch. So I won't fix this one.

All the others will be fixed on the next version.

Regards,

-- 
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2] arm/xen: Enable user access to the kernel before issuing a privcmd call

On 11/09/15 18:00, Russell King - ARM Linux wrote:
> On Fri, Sep 11, 2015 at 05:25:59PM +0100, Julien Grall wrote:
>> +/*
>> + * Privcmd calls are issued by the userspace. We need to allow the
>> + * kernel to access the userspace memory before issuing the hypercall.
>> + */
>> +uaccess_enable r4
>> +
>> +/* r4 is loaded now as we use it as scratch register before */
>>  ldr r4, [sp, #4]
> 
> As I mentioned in one of my previous mails, "ip" should be safe to use
> here - it's a caller-corrupted register, just like r0-r3 and lr.  So,
> you could do:
> 
>   ldr r4, [sp, #4]
> + uaccess_enable ip

The register ip (aka r12) is used to store the hypercall number. So we
can't reuse it as scratch register.

The easiest one is r4.

> 
> which fractionally tightens the window.
> 
> However, there's nothing actually wrong with your version - there's no
> way we could've got this far with sp pointing at userspace.
> 
> I'm happy with either version, so:
> 
> Acked-by: Russell King 
> 
> How do you want to handle the patch?  I already have some other uaccess
> fixes queued up to send to Linus before the merge window closes.
> 
>>  __HVC(XEN_IMM)
>> +
>> +/*
>> + * Disable userspace access from kernel. This is fine to do it
>> + * unconditionally as no set_fs(KERNEL_DS)/set_fs(get_ds()) is
>> + * called before.
>> + */
>> +uaccess_disable r4
>> +
>>  ldm sp!, {r4}
>>  ret lr
>>  ENDPROC(privcmd_call);
>> -- 
>> 2.1.4
>>
> 


-- 
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH] efi/libstub/fdt: Standardize the names of EFI stub parameters

2015-09-11 Thread Mark Rutland

> >> Considering that the EFI support is just for Dom0, and Dom0 (at
> >> the time) had to be PV anyway, it was the more natural solution to
> >> expose the interface via hypercalls, the more that this allows better
> >> control over what is and primarily what is not being exposed to
> >> Dom0. With the wrapper approach we'd be back to the same
> >> problem (discussed elsewhere) of which EFI version to surface: The
> >> host one would impose potentially missing extensions, while the
> >> most recent hypervisor known one might imply hiding valuable
> >> information from Dom0. Plus there are incompatible changes like
> >> the altered meaning of EFI_MEMORY_WP in 2.5.
> > 
> > I'm not sure I follow how hypercalls solve any impedance mismatch here;
> > you're still expecting Dom0 to call up to Xen in order to perform calls,
> > and all I suggested was a different location for those hypercalls.
> > 
> > If Xen is happy to make such calls blindly, why does it matter if the
> > hypercall was in the kernel binary or an external shim?
> 
> Because there could be new entries in SystemTable->RuntimeServices
> (expected and blindly but validly called by the kernel). Even worse
> (because likely harder to deal with) would be new fields in other
> structures.

Any of these could cause Xen to blow up, while Xen could always provide
a known-safe (but potentially sub-optimal) view to the kernel by
default.

> > Incompatible changes are a spec problem regardless of how this is
> > handled.
> 
> Not necessarily - we don't expose the memory map (we'd have to
> if we were to mimic EFI for Dom0), and hence the mentioned issue
> doesn't exist in our model.

We have to expose _some_ memory map, so I don't follow this point.

Mark.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH for-4.6] libxl: handle read-only drives with qemu-xen

2015-09-11 Thread Stefano Stabellini

The current libxl code doesn't deal with read-only drives at all.

Upstream QEMU and qemu-xen only support read-only cdrom drives: make
sure to specify "readonly=on" for cdrom drives and return error in case
the user requested a non-cdrom read-only drive.

Signed-off-by: Stefano Stabellini 
---
 tools/libxl/libxl_dm.c |   13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index 02c0162..468ff9c 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -1110,13 +1110,18 @@ static int libxl__build_device_model_args_new(libxl__gc 
*gc,
 if (disks[i].is_cdrom) {
 if (disks[i].format == LIBXL_DISK_FORMAT_EMPTY)
 drive = libxl__sprintf
-(gc, 
"if=ide,index=%d,media=cdrom,cache=writeback,id=ide-%i",
- disk, dev_number);
+(gc, 
"if=ide,index=%d,readonly=%s,media=cdrom,cache=writeback,id=ide-%i",
+ disk, disks[i].readwrite ? "off" : "on", dev_number);
 else
 drive = libxl__sprintf
-(gc, 
"file=%s,if=ide,index=%d,media=cdrom,format=%s,cache=writeback,id=ide-%i",
- disks[i].pdev_path, disk, format, dev_number);
+(gc, 
"file=%s,if=ide,index=%d,readonly=%s,media=cdrom,format=%s,cache=writeback,id=ide-%i",
+ disks[i].pdev_path, disk, disks[i].readwrite ? "off" 
: "on", format, dev_number);
 } else {
+if (!disks[i].readwrite) {
+LIBXL__LOG(ctx, LIBXL__LOG_ERROR, "QEMU doesn't support 
read-only disk drivers");
+return ERROR_INVAL;
+}
+
 if (disks[i].format == LIBXL_DISK_FORMAT_EMPTY) {
 LIBXL__LOG(ctx, LIBXL__LOG_WARNING, "cannot support"
" empty disk format for %s", disks[i].vdev);
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2] arm/xen: Enable user access to the kernel before issuing a privcmd call

2015-09-11 Thread Russell King - ARM Linux

On Fri, Sep 11, 2015 at 06:56:50PM +0100, Stefano Stabellini wrote:
> On Fri, 11 Sep 2015, Russell King - ARM Linux wrote:
> > On Fri, Sep 11, 2015 at 06:36:05PM +0100, Julien Grall wrote:
> > > On 11/09/15 18:32, Julien Grall wrote:
> > > > On 11/09/15 18:00, Russell King - ARM Linux wrote:
> > > >> On Fri, Sep 11, 2015 at 05:25:59PM +0100, Julien Grall wrote:
> > > >>> + /*
> > > >>> +  * Privcmd calls are issued by the userspace. We need to allow 
> > > >>> the
> > > >>> +  * kernel to access the userspace memory before issuing the 
> > > >>> hypercall.
> > > >>> +  */
> > > >>> + uaccess_enable r4
> > > >>> +
> > > >>> + /* r4 is loaded now as we use it as scratch register before */
> > > >>>   ldr r4, [sp, #4]
> > > >>
> > > >> As I mentioned in one of my previous mails, "ip" should be safe to use
> > > >> here - it's a caller-corrupted register, just like r0-r3 and lr.  So,
> > > >> you could do:
> > > >>
> > > >>ldr r4, [sp, #4]
> > > >> +  uaccess_enable ip
> > > > 
> > > > The register ip (aka r12) is used to store the hypercall number. So we
> > > > can't reuse it as scratch register.
> > > > 
> > > > The easiest one is r4.
> > > > 
> > > >>
> > > >> which fractionally tightens the window.
> > > >>
> > > >> However, there's nothing actually wrong with your version - there's no
> > > >> way we could've got this far with sp pointing at userspace.
> > > >>
> > > >> I'm happy with either version, so:
> > > >>
> > > >> Acked-by: Russell King 
> > > >>
> > > >> How do you want to handle the patch?  I already have some other uaccess
> > > >> fixes queued up to send to Linus before the merge window closes.
> > > 
> > > Forgot to answer to this bits. I was thinking to ask Stefano carrying
> > > the patch in xentip. Although it won't go until rc1.
> > > 
> > > I don't mind if it's going earlier in Linux/master.
> > 
> > Thanks, I've applied your patch as-is now.
> 
> That's fine by me, the patch looks good.

If you'd like your ack on it, please send one, I can still do that.

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH v7 01/17] VT-d Posted-intterrupt (PI) design

Add the design doc for VT-d PI.

CC: Kevin Tian 
CC: Yang Zhang 
CC: Jan Beulich 
CC: Keir Fraser 
CC: Andrew Cooper 
CC: George Dunlap 
Signed-off-by: Feng Wu 
Reviewed-by: Kevin Tian 
Reviewed-by: Konrad Rzeszutek Wilk 
---
 docs/misc/vtd-pi.txt | 332 +++
 1 file changed, 332 insertions(+)
 create mode 100644 docs/misc/vtd-pi.txt

diff --git a/docs/misc/vtd-pi.txt b/docs/misc/vtd-pi.txt
new file mode 100644
index 000..af5409a
--- /dev/null
+++ b/docs/misc/vtd-pi.txt
@@ -0,0 +1,332 @@
+Authors: Feng Wu 
+
+VT-d Posted-interrupt (PI) design for XEN
+
+Background
+==
+With the development of virtualization, there are more and more device
+assignment requirements. However, today when a VM is running with
+assigned devices (such as, NIC), external interrupt handling for the assigned
+devices always needs VMM intervention.
+
+VT-d Posted-interrupt is a more enhanced method to handle interrupts
+in the virtualization environment. Interrupt posting is the process by
+which an interrupt request is recorded in a memory-resident
+posted-interrupt-descriptor structure by the root-complex, followed by
+an optional notification event issued to the CPU complex.
+
+With VT-d Posted-interrupt we can get the following advantages:
+- Direct delivery of external interrupts to running vCPUs without VMM
+intervention
+- Decrease the interrupt migration complexity. On vCPU migration, software
+can atomically co-migrate all interrupts targeting the migrating vCPU. For
+virtual machines with assigned devices, migrating a vCPU across pCPUs
+either incurs the overhead of forwarding interrupts in software (e.g. via VMM
+generated IPIs), or complexity to independently migrate each interrupt 
targeting
+the vCPU to the new pCPU. However, after enabling VT-d PI, the destination vCPU
+of an external interrupt from assigned devices is stored in the IRTE (i.e.
+Posted-interrupt Descriptor Address), when vCPU is migrated to another pCPU,
+we will set this new pCPU in the 'NDST' filed of Posted-interrupt descriptor, 
this
+make the interrupt migration automatic.
+
+Here is what Xen currently does for external interrupts from assigned devices:
+
+When a VM is running and an external interrupt from an assigned device occurs
+for it. VM-EXIT happens, then:
+
+vmx_do_extint() --> do_IRQ() --> __do_IRQ_guest() --> hvm_do_IRQ_dpci() -->
+raise_softirq_for(pirq_dpci) --> raise_softirq(HVM_DPCI_SOFTIRQ)
+
+softirq HVM_DPCI_SOFTIRQ is bound to dpci_softirq()
+
+dpci_softirq() --> hvm_dirq_assist() --> vmsi_deliver_pirq() --> 
vmsi_deliver() -->
+vmsi_inj_irq() --> vlapic_set_irq()
+
+vlapic_set_irq() does the following things:
+1. If CPU-side posted-interrupt is supported, call vmx_deliver_posted_intr() 
to deliver
+the virtual interrupt via posted-interrupt infrastructure.
+2. Else if CPU-side posted-interrupt is not supported, set the related vIRR in 
vLAPIC
+page and call vcpu_kick() to kick the related vCPU. Before VM-Entry, 
vmx_intr_assist()
+will help to inject the interrupt to guests.
+
+However, after VT-d PI is supported, when a guest is running in non-root and an
+external interrupt from an assigned device occurs for it. No VM-Exit is needed,
+the guest can handle this totally in non-root mode, thus avoiding all the above
+code flow.
+
+Posted-interrupt Introduction
+
+There are two components to the Posted-interrupt architecture:
+Processor Support and Root-Complex Support
+
+- Processor Support
+Posted-interrupt processing is a feature by which a processor processes
+the virtual interrupts by recording them as pending on the virtual-APIC
+page.
+
+Posted-interrupt processing is enabled by setting the process posted
+interrupts VM-execution control. The processing is performed in response
+to the arrival of an interrupt with the posted-interrupt notification vector.
+In response to such an interrupt, the processor processes virtual interrupts
+recorded in a data structure called a posted-interrupt descriptor.
+
+More information about APICv and CPU-side Posted-interrupt, please refer
+to Chapter 29, and Section 29.6 in the Intel SDM:
+http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
+
+- Root-Complex Support
+Interrupt posting is the process by which an interrupt request (from IOAPIC
+or MSI/MSIx capable sources) is recorded in a memory-resident
+posted-interrupt-descriptor structure by the root-complex, followed by
+an optional notification event issued to the CPU complex. The interrupt
+request arriving at the root-complex carry the identity of the interrupt
+request source and a 'remapping-index'. The remapping-index is used to
+look-up an entry from the

[Xen-devel] [PATCH v7 00/17] Add VT-d Posted-Interrupts support

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

You can find the VT-d Posted-Interrtups Spec. in the following URL:
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html

Feng Wu (17):
  VT-d Posted-intterrupt (PI) design
  Add cmpxchg16b support for x86-64
  iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature
  vt-d: VT-d Posted-Interrupts feature detection
  vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  vmx: Add some helper functions for Posted-Interrupts
  vmx: Initialize VT-d Posted-Interrupts Descriptor
  vmx: Suppress posting interrupts when 'SN' is set
  VT-d: Remove pointless casts
  vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts
  vt-d: Add API to update IRTE when VT-d PI is used
  x86: move some APIC related macros to apicdef.h
  Update IRTE according to guest interrupt config changes
  vmx: Properly handle notification event when vCPU is running
  vmx: VT-d posted-interrupt core logic handling
  VT-d: Dump the posted format IRTE
  Add a command line parameter for VT-d posted-interrupts

 docs/misc/vtd-pi.txt   | 332 +
 docs/misc/xen-command-line.markdown|   9 +-
 xen/arch/x86/domain.c  |  21 +++
 xen/arch/x86/hvm/hvm.c |   6 +
 xen/arch/x86/hvm/vlapic.c  |   5 -
 xen/arch/x86/hvm/vmx/vmcs.c|  24 +++
 xen/arch/x86/hvm/vmx/vmx.c | 312 ++-
 xen/common/schedule.c  |   2 +
 xen/drivers/passthrough/io.c   | 118 +++-
 xen/drivers/passthrough/iommu.c|  16 +-
 xen/drivers/passthrough/vtd/intremap.c | 213 -
 xen/drivers/passthrough/vtd/iommu.c|  14 +-
 xen/drivers/passthrough/vtd/iommu.h|  51 +++--
 xen/drivers/passthrough/vtd/utils.c|  42 +++--
 xen/include/asm-arm/domain.h   |   2 +
 xen/include/asm-x86/apicdef.h  |   3 +
 xen/include/asm-x86/domain.h   |   3 +
 xen/include/asm-x86/hvm/hvm.h  |   4 +
 xen/include/asm-x86/hvm/vmx/vmcs.h |  25 ++-
 xen/include/asm-x86/hvm/vmx/vmx.h  |  27 +++
 xen/include/asm-x86/iommu.h|   2 +
 xen/include/asm-x86/x86_64/system.h|  31 +++
 xen/include/xen/iommu.h|   2 +-
 23 files changed, 1176 insertions(+), 88 deletions(-)
 create mode 100644 docs/misc/vtd-pi.txt

-- 
2.1.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH v7 10/17] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts

Extend struct iremap_entry according to VT-d Posted-Interrupts Spec.

CC: Yang Zhang 
CC: Kevin Tian 
Signed-off-by: Feng Wu 
Acked-by: Kevin Tian 
---
v7:
- Add a __uint128_t member to the union in struct iremap_entry

v4:
- res_4 is not a bitfiled, correct it.
- Expose 'im' to remapped irte as well.

v3:
- Use u32 instead of u64 to define the bitfields in 'struct iremap_entry'
- Limit using bitfield if possible

 xen/drivers/passthrough/vtd/intremap.c | 92 +-
 xen/drivers/passthrough/vtd/iommu.h| 44 ++--
 xen/drivers/passthrough/vtd/utils.c|  8 +--
 3 files changed, 81 insertions(+), 63 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/intremap.c 
b/xen/drivers/passthrough/vtd/intremap.c
index 987bbe9..e9fffa6 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -122,9 +122,9 @@ static u16 hpetid_to_bdf(unsigned int hpet_id)
 static void set_ire_sid(struct iremap_entry *ire,
 unsigned int svt, unsigned int sq, unsigned int sid)
 {
-ire->hi.svt = svt;
-ire->hi.sq = sq;
-ire->hi.sid = sid;
+ire->remap.svt = svt;
+ire->remap.sq = sq;
+ire->remap.sid = sid;
 }
 
 static void set_ioapic_source_id(int apic_id, struct iremap_entry *ire)
@@ -219,7 +219,7 @@ static unsigned int alloc_remap_entry(struct iommu *iommu, 
unsigned int nr)
 else
 p = _entries[i % (1 << IREMAP_ENTRY_ORDER)];
 
-if ( p->lo_val || p->hi_val ) /* not a free entry */
+if ( p->lo || p->hi ) /* not a free entry */
 found = 0;
 else if ( ++found == nr )
 break;
@@ -253,7 +253,7 @@ static int remap_entry_to_ioapic_rte(
 GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
  iremap_entries, iremap_entry);
 
-if ( iremap_entry->hi_val == 0 && iremap_entry->lo_val == 0 )
+if ( iremap_entry->hi == 0 && iremap_entry->lo == 0 )
 {
 dprintk(XENLOG_ERR VTDPREFIX,
 "%s: index (%d) get an empty entry!\n",
@@ -263,13 +263,13 @@ static int remap_entry_to_ioapic_rte(
 return -EFAULT;
 }
 
-old_rte->vector = iremap_entry->lo.vector;
-old_rte->delivery_mode = iremap_entry->lo.dlm;
-old_rte->dest_mode = iremap_entry->lo.dm;
-old_rte->trigger = iremap_entry->lo.tm;
+old_rte->vector = iremap_entry->remap.vector;
+old_rte->delivery_mode = iremap_entry->remap.dlm;
+old_rte->dest_mode = iremap_entry->remap.dm;
+old_rte->trigger = iremap_entry->remap.tm;
 old_rte->__reserved_2 = 0;
 old_rte->dest.logical.__reserved_1 = 0;
-old_rte->dest.logical.logical_dest = iremap_entry->lo.dst >> 8;
+old_rte->dest.logical.logical_dest = iremap_entry->remap.dst >> 8;
 
 unmap_vtd_domain_page(iremap_entries);
 spin_unlock_irqrestore(_ctrl->iremap_lock, flags);
@@ -317,27 +317,28 @@ static int ioapic_rte_to_remap_entry(struct iommu *iommu,
 if ( rte_upper )
 {
 if ( x2apic_enabled )
-new_ire.lo.dst = value;
+new_ire.remap.dst = value;
 else
-new_ire.lo.dst = (value >> 24) << 8;
+new_ire.remap.dst = (value >> 24) << 8;
 }
 else
 {
 *(((u32 *)_rte) + 0) = value;
-new_ire.lo.fpd = 0;
-new_ire.lo.dm = new_rte.dest_mode;
-new_ire.lo.tm = new_rte.trigger;
-new_ire.lo.dlm = new_rte.delivery_mode;
+new_ire.remap.fpd = 0;
+new_ire.remap.dm = new_rte.dest_mode;
+new_ire.remap.tm = new_rte.trigger;
+new_ire.remap.dlm = new_rte.delivery_mode;
 /* Hardware require RH = 1 for LPR delivery mode */
-new_ire.lo.rh = (new_ire.lo.dlm == dest_LowestPrio);
-new_ire.lo.avail = 0;
-new_ire.lo.res_1 = 0;
-new_ire.lo.vector = new_rte.vector;
-new_ire.lo.res_2 = 0;
+new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
+new_ire.remap.avail = 0;
+new_ire.remap.res_1 = 0;
+new_ire.remap.vector = new_rte.vector;
+new_ire.remap.res_2 = 0;
 
 set_ioapic_source_id(IO_APIC_ID(apic), _ire);
-new_ire.hi.res_1 = 0;
-new_ire.lo.p = 1; /* finally, set present bit */
+new_ire.remap.res_3 = 0;
+new_ire.remap.res_4 = 0;
+new_ire.remap.p = 1; /* finally, set present bit */
 
 /* now construct new ioapic rte entry */
 remap_rte->vector = new_rte.vector;
@@ -510,7 +511,7 @@ static int remap_entry_to_msi_msg(
 GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
  iremap_entries, iremap_entry);
 
-if ( iremap_entry->hi_val == 0 && iremap_entry->lo_val == 0 )
+if ( iremap_entry->hi == 0 && iremap_entry->lo == 0 )
 {
 dprintk(XENLOG_ERR VTDPREFIX,
 "%s: index (%d) get an empty entry!\n",
@@ -523,25 +524,25 @@ static int

[Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling

This patch includes the following aspects:
- Handling logic when vCPU is blocked:
* Add a global vector to wake up the blocked vCPU
  when an interrupt is being posted to it (This part
  was sugguested by Yang Zhang ).
* Define two per-cpu variables:
  1. pi_blocked_vcpu:
A list storing the vCPUs which were blocked
on this pCPU.

  2. pi_blocked_vcpu_lock:
The spinlock to protect pi_blocked_vcpu.

- Add some scheduler hooks, this part was suggested
  by Dario Faggioli .
* vmx_pre_ctx_switch_pi()
  It is called before context switch, we update the
  posted interrupt descriptor when the vCPU is preempted,
  go to sleep, or is blocked.

* vmx_post_ctx_switch_pi()
  It is called after context switch, we update the posted
  interrupt descriptor when the vCPU is going to run.

* arch_vcpu_wake_prepare()
  It will be called when waking up the vCPU, we update
  the posted interrupt descriptor when the vCPU is
  unblocked.

CC: Keir Fraser 
CC: Jan Beulich 
CC: Andrew Cooper 
CC: Kevin Tian 
CC: George Dunlap 
CC: Dario Faggioli 
Sugguested-by: Dario Faggioli 
Signed-off-by: Feng Wu 
Reviewed-by: Dario Faggioli 
---
v7:
- Merge [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted 
interrupts
  and "[PATCH v6 14/18] vmx: posted-interrupt handling when vCPU is blocked"
  into this patch, so it is self-contained and more convenient
  for code review.
- Make 'pi_blocked_vcpu' and 'pi_blocked_vcpu_lock' static
- Coding style
- Use per_cpu() instead of this_cpu() in pi_wakeup_interrupt()
- Move ack_APIC_irq() to the beginning of pi_wakeup_interrupt()
- Rename 'pi_ctxt_switch_from' to 'ctxt_switch_prepare'
- Rename 'pi_ctxt_switch_to' to 'ctxt_switch_cancel'
- Use 'has_hvm_container_vcpu' instead of 'is_hvm_vcpu'
- Use 'spin_lock' and 'spin_unlock' when the interrupt has been
  already disabled.
- Rename arch_vcpu_wake_prepare to vmx_vcpu_wake_prepare
- Define vmx_vcpu_wake_prepare in xen/arch/x86/hvm/hvm.c
- Call .pi_ctxt_switch_to() __context_switch() instead of directly
  calling vmx_post_ctx_switch_pi() in vmx_ctxt_switch_to()
- Make .pi_block_cpu unsigned int
- Use list_del() instead of list_del_init()
- Coding style

One remaining item:
Jan has concern about calling vcpu_unblock() in vmx_pre_ctx_switch_pi(),
need Dario or George's input about this.

Changelog for "vmx: Add some scheduler hooks for VT-d posted interrupts"
v6:
- Add two static inline functions for pi context switch
- Fix typos

v5:
- Rename arch_vcpu_wake to arch_vcpu_wake_prepare
- Make arch_vcpu_wake_prepare() inline for ARM
- Merge the ARM dummy hook with together
- Changes to some code comments
- Leave 'pi_ctxt_switch_from' and 'pi_ctxt_switch_to' NULL if
  PI is disabled or the vCPU is not in HVM
- Coding style

v4:
- Newly added

Changlog for "vmx: posted-interrupt handling when vCPU is blocked"
v6:
- Fix some typos
- Ack the interrupt right after the spin_unlock in pi_wakeup_interrupt()

v4:
- Use local variables in pi_wakeup_interrupt()
- Remove vcpu from the blocked list when pi_desc.on==1, this
- avoid kick vcpu multiple times.
- Remove tasklet

v3:
- This patch is generated by merging the following three patches in v2:
   [RFC v2 09/15] Add a new per-vCPU tasklet to wakeup the blocked vCPU
   [RFC v2 10/15] vmx: Define two per-cpu variables
   [RFC v2 11/15] vmx: Add a global wake-up vector for VT-d Posted-Interrupts
- rename 'vcpu_wakeup_tasklet' to 'pi_vcpu_wakeup_tasklet'
- Move the definition of 'pi_vcpu_wakeup_tasklet' to 'struct arch_vmx_struct'
- rename 'vcpu_wakeup_tasklet_handler' to 'pi_vcpu_wakeup_tasklet_handler'
- Make pi_wakeup_interrupt() static
- Rename 'blocked_vcpu_list' to 'pi_blocked_vcpu_list'
- move 'pi_blocked_vcpu_list' to 'struct arch_vmx_struct'
- Rename 'blocked_vcpu' to 'pi_blocked_vcpu'
- Rename 'blocked_vcpu_lock' to 'pi_blocked_vcpu_lock'

 xen/arch/x86/domain.c  |  21 
 xen/arch/x86/hvm/hvm.c |   6 +
 xen/arch/x86/hvm/vmx/vmcs.c|   2 +
 xen/arch/x86/hvm/vmx/vmx.c | 229 +
 xen/common/schedule.c  |   2 +
 xen/include/asm-arm/domain.h   |   2 +
 xen/include/asm-x86/domain.h   |   3 +
 xen/include/asm-x86/hvm/hvm.h  |   4 +
 xen/include/asm-x86/hvm/vmx/vmcs.h |  11 ++
 xen/include/asm-x86/hvm/vmx/vmx.h  |   4 +
 10 files changed, 284 insertions(+)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 045f6ff..d64d4eb 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1531,6 +1531,8 @@ static void __context_switch(void)
 }
 vcpu_restore_fpu_eager(n);
 n->arch.ctxt_switch_to(n);
+

[Xen-devel] [PATCH v7 14/17] vmx: Properly handle notification event when vCPU is running

When a vCPU is running in Root mode and a notification event
has been injected to it. we need to set VCPU_KICK_SOFTIRQ for
the current cpu, so the pending interrupt in PIRR will be
synced to vIRR before VM-Exit in time.

CC: Kevin Tian 
CC: Keir Fraser 
CC: Jan Beulich 
CC: Andrew Cooper 
Signed-off-by: Feng Wu 
Acked-by: Kevin Tian 
---
v7:
- Retain 'cli' in the comments to make it more understandable.
- Register another notification event handler when VT-d PI is enabled

v6:
- Ack the interrupt in the beginning of pi_notification_interrupt()

v4:
- Coding style.

v3:
- Make pi_notification_interrupt() static

 xen/arch/x86/hvm/vmx/vmx.c | 54 +-
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 5f01629..8e41f4b 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1975,6 +1975,53 @@ static struct hvm_function_table __initdata 
vmx_function_table = {
 .altp2m_vcpu_emulate_vmfunc = vmx_vcpu_emulate_vmfunc,
 };
 
+/* Handle VT-d posted-interrupt when VCPU is running. */
+static void pi_notification_interrupt(struct cpu_user_regs *regs)
+{
+ack_APIC_irq();
+this_cpu(irq_count)++;
+
+/*
+ * We get here when a vCPU is running in root-mode (such as via hypercall,
+ * or any other reasons which can result in VM-Exit), and before vCPU is
+ * back to non-root, external interrupts from an assigned device happen
+ * and a notification event is delivered to this logical CPU.
+ *
+ * we need to set VCPU_KICK_SOFTIRQ for the current cpu, just like
+ * __vmx_deliver_posted_interrupt(). So the pending interrupt in PIRR will
+ * be synced to vIRR before VM-Exit in time.
+ *
+ * Please refer to the following code fragments from
+ * xen/arch/x86/hvm/vmx/entry.S:
+ *
+ * .Lvmx_do_vmentry
+ *
+ *  ..
+ *
+ *  point 1
+ *
+ *  cli
+ *  cmp  %ecx,(%rdx,%rax,1)
+ *  jnz  .Lvmx_process_softirqs
+ *
+ *  ..
+ *
+ *  je   .Lvmx_launch
+ *
+ *  ..
+ *
+ * .Lvmx_process_softirqs:
+ *  sti
+ *  call do_softirq
+ *  jmp  .Lvmx_do_vmentry
+ *
+ * If VT-d engine issues a notification event at point 1 above, it cannot
+ * be delivered to the guest during this VM-entry without raising the
+ * softirq in this notification handler.
+ */
+raise_softirq(VCPU_KICK_SOFTIRQ);
+}
+
 const struct hvm_function_table * __init start_vmx(void)
 {
 set_in_cr4(X86_CR4_VMXE);
@@ -2012,7 +2059,12 @@ const struct hvm_function_table * __init start_vmx(void)
 }
 
 if ( cpu_has_vmx_posted_intr_processing )
-alloc_direct_apic_vector(_intr_vector, event_check_interrupt);
+{
+if ( iommu_intpost )
+alloc_direct_apic_vector(_intr_vector, 
pi_notification_interrupt);
+else
+alloc_direct_apic_vector(_intr_vector, 
event_check_interrupt);
+}
 else
 {
 vmx_function_table.deliver_posted_intr = NULL;
-- 
2.1.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH v7 06/17] vmx: Add some helper functions for Posted-Interrupts

This patch adds some helper functions to manipulate the
Posted-Interrupts Descriptor.

CC: Kevin Tian 
CC: Keir Fraser 
CC: Jan Beulich 
CC: Andrew Cooper 
Signed-off-by: Feng Wu 
Reviewed-by: Konrad Rzeszutek Wilk 
---
v7:
- Use bitfield in pi_test_on() and pi_test_sn()

v4:
- Newly added

 xen/include/asm-x86/hvm/vmx/vmx.h | 21 +
 1 file changed, 21 insertions(+)

diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h 
b/xen/include/asm-x86/hvm/vmx/vmx.h
index 3fbfa44..8d91110 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -101,6 +101,7 @@ void vmx_update_cpu_exec_control(struct vcpu *v);
 void vmx_update_secondary_exec_control(struct vcpu *v);
 
 #define POSTED_INTR_ON  0
+#define POSTED_INTR_SN  1
 static inline int pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
 {
 return test_and_set_bit(vector, pi_desc->pir);
@@ -121,11 +122,31 @@ static inline int pi_test_and_clear_on(struct pi_desc 
*pi_desc)
 return test_and_clear_bit(POSTED_INTR_ON, _desc->control);
 }
 
+static inline int pi_test_on(struct pi_desc *pi_desc)
+{
+return pi_desc->on;
+}
+
 static inline unsigned long pi_get_pir(struct pi_desc *pi_desc, int group)
 {
 return xchg(_desc->pir[group], 0);
 }
 
+static inline int pi_test_sn(struct pi_desc *pi_desc)
+{
+return pi_desc->sn;
+}
+
+static inline void pi_set_sn(struct pi_desc *pi_desc)
+{
+set_bit(POSTED_INTR_SN, _desc->control);
+}
+
+static inline void pi_clear_sn(struct pi_desc *pi_desc)
+{
+clear_bit(POSTED_INTR_SN, _desc->control);
+}
+
 /*
  * Exit Reasons
  */
-- 
2.1.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH v7 07/17] vmx: Initialize VT-d Posted-Interrupts Descriptor

This patch initializes the VT-d Posted-interrupt Descriptor.

CC: Kevin Tian 
CC: Keir Fraser 
CC: Jan Beulich 
CC: Andrew Cooper 
Signed-off-by: Feng Wu 
Acked-by: Kevin Tian 
Reviewed-by: Konrad Rzeszutek Wilk 
---
v7:
- Add comments to function 'pi_desc_init' to clarify why we
  update the posted-interrupt descriptor in non-atomical way
  in it.

v3:
- Move pi_desc_init() to xen/arch/x86/hvm/vmx/vmcs.c
- Remove the 'inline' flag of pi_desc_init()

 xen/arch/x86/hvm/vmx/vmcs.c   | 22 ++
 xen/include/asm-x86/hvm/vmx/vmx.h |  2 ++
 2 files changed, 24 insertions(+)

diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index a0a97e7..5f67797 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static bool_t __read_mostly opt_vpid_enabled = 1;
 boolean_param("vpid", opt_vpid_enabled);
@@ -951,6 +952,24 @@ void virtual_vmcs_vmwrite(void *vvmcs, u32 vmcs_encoding, 
u64 val)
 virtual_vmcs_exit(vvmcs);
 }
 
+/*
+ * This function is only called in a vCPU's initialization phase,
+ * so we can update the posted-interrupt descriptor in non-atomic way.
+ */
+static void pi_desc_init(struct vcpu *v)
+{
+uint32_t dest;
+
+v->arch.hvm_vmx.pi_desc.nv = posted_intr_vector;
+
+dest = cpu_physical_id(v->processor);
+
+if ( x2apic_enabled )
+v->arch.hvm_vmx.pi_desc.ndst = dest;
+else
+v->arch.hvm_vmx.pi_desc.ndst = MASK_INSR(dest, PI_xAPIC_NDST_MASK);
+}
+
 static int construct_vmcs(struct vcpu *v)
 {
 struct domain *d = v->domain;
@@ -1089,6 +1108,9 @@ static int construct_vmcs(struct vcpu *v)
 
 if ( cpu_has_vmx_posted_intr_processing )
 {
+if ( iommu_intpost )
+pi_desc_init(v);
+
 __vmwrite(PI_DESC_ADDR, virt_to_maddr(>arch.hvm_vmx.pi_desc));
 __vmwrite(POSTED_INTR_NOTIFICATION_VECTOR, posted_intr_vector);
 }
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h 
b/xen/include/asm-x86/hvm/vmx/vmx.h
index 8d91110..70b254f 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -88,6 +88,8 @@ typedef enum {
 #define EPT_EMT_WB  6
 #define EPT_EMT_RSV27
 
+#define PI_xAPIC_NDST_MASK  0xFF00
+
 void vmx_asm_vmexit_handler(struct cpu_user_regs);
 void vmx_asm_do_vmentry(void);
 void vmx_intr_assist(void);
-- 
2.1.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH v7 02/17] Add cmpxchg16b support for x86-64

This patch adds cmpxchg16b support for x86-64, so software
can perform 128-bit atomic write/read.

CC: Keir Fraser 
CC: Jan Beulich 
CC: Andrew Cooper 
Signed-off-by: Feng Wu 
---
v7:
- Make the last two parameters of __cmpxchg16b() const
- Remove memory clobber
- Add run-time and build-build check in cmpxchg16b()
- Cast the last two parameter to void * when calling __cmpxchg16b()

v6:
- Fix a typo

v5:
- Change back the parameters of __cmpxchg16b() to __uint128_t *
- Remove pointless cast for 'ptr'
- Remove pointless parentheses
- Use A constraint for the output

v4:
- Use pointer as the parameter of __cmpxchg16b().
- Use gcc's __uint128_t built-in type
- Make the parameters of __cmpxchg16b() void *

v3:
- Newly added.

 xen/include/asm-x86/x86_64/system.h | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/xen/include/asm-x86/x86_64/system.h 
b/xen/include/asm-x86/x86_64/system.h
index 662813a..defb770 100644
--- a/xen/include/asm-x86/x86_64/system.h
+++ b/xen/include/asm-x86/x86_64/system.h
@@ -6,6 +6,37 @@
(unsigned long)(n),sizeof(*(ptr
 
 /*
+ * Atomic 16 bytes compare and exchange.  Compare OLD with MEM, if
+ * identical, store NEW in MEM.  Return the initial value in MEM.
+ * Success is indicated by comparing RETURN with OLD.
+ *
+ * This function can only be called when cpu_has_cx16 is true.
+ */
+
+static always_inline __uint128_t __cmpxchg16b(
+volatile void *ptr, const __uint128_t *old, const __uint128_t *new)
+{
+__uint128_t prev;
+uint64_t new_high = *new >> 64;
+uint64_t new_low = (uint64_t)*new;
+
+ASSERT(cpu_has_cx16);
+
+asm volatile ( "lock; cmpxchg16b %3"
+   : "=A" (prev)
+   : "c" (new_high), "b" (new_low), "m" (*__xg(ptr)), "0" 
(*old)
+ );
+
+return prev;
+}
+
+#define cmpxchg16b(ptr,o,n)\
+( ({ ASSERT(((unsigned long)ptr & 0xF) == 0); }),  \
+  (BUILD_BUG_ON(sizeof(*o) != sizeof(__uint128_t))),   \
+  (BUILD_BUG_ON(sizeof(*n) != sizeof(__uint128_t))),   \
+  (__cmpxchg16b((ptr), (void *)(o), (void *)(n))) )
+
+/*
  * This function causes value _o to be changed to _n at location _p.
  * If this access causes a fault then we return 1, otherwise we return 0.
  * If no fault occurs then _o is updated to the value we saw at _p. If this
-- 
2.1.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH v7 09/17] VT-d: Remove pointless casts

Remove pointless casts.

CC: Yang Zhang 
CC: Kevin Tian 
Signed-off-by: Feng Wu 
Reviewed-by: Konrad Rzeszutek Wilk 
---
v7:
- Remove an 'u32' casting omitted in v5

v5:
- Newly added.

 xen/drivers/passthrough/vtd/utils.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/utils.c 
b/xen/drivers/passthrough/vtd/utils.c
index 44c4ef5..a75059f 100644
--- a/xen/drivers/passthrough/vtd/utils.c
+++ b/xen/drivers/passthrough/vtd/utils.c
@@ -234,10 +234,9 @@ static void dump_iommu_info(unsigned char key)
 continue;
 printk("  %04x:  %x   %x  %04x %08x %02x%x   %x  %x  %x  
%x"
 "   %x %x\n", i,
-(u32)p->hi.svt, (u32)p->hi.sq, (u32)p->hi.sid,
-(u32)p->lo.dst, (u32)p->lo.vector, (u32)p->lo.avail,
-(u32)p->lo.dlm, (u32)p->lo.tm, (u32)p->lo.rh,
-(u32)p->lo.dm, (u32)p->lo.fpd, (u32)p->lo.p);
+p->hi.svt, p->hi.sq, p->hi.sid, p->lo.dst, p->lo.vector,
+p->lo.avail, p->lo.dlm, p->lo.tm, p->lo.rh, p->lo.dm,
+p->lo.fpd, p->lo.p);
 print_cnt++;
 }
 if ( iremap_entries )
@@ -281,11 +280,10 @@ static void dump_iommu_info(unsigned char key)
 
 printk("   %02x:  %04x   %x%x   %x   %x   %x%x"
 "%x %02x\n", i,
-(u32)remap->index_0_14 | ((u32)remap->index_15 << 15),
-(u32)remap->format, (u32)remap->mask, (u32)remap->trigger,
-(u32)remap->irr, (u32)remap->polarity,
-(u32)remap->delivery_status, (u32)remap->delivery_mode,
-(u32)remap->vector);
+remap->index_0_14 | (remap->index_15 << 15),
+remap->format, remap->mask, remap->trigger, remap->irr,
+remap->polarity, remap->delivery_status, 
remap->delivery_mode,
+remap->vector);
 }
 }
 }
-- 
2.1.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH v7 08/17] vmx: Suppress posting interrupts when 'SN' is set

Currently, we don't support urgent interrupt, all interrupts
are recognized as non-urgent interrupt, so we cannot send
posted-interrupt when 'SN' is set.

CC: Kevin Tian 
CC: Keir Fraser 
CC: Jan Beulich 
CC: Andrew Cooper 
Signed-off-by: Feng Wu 
Reviewed-by: Konrad Rzeszutek Wilk 
---
v7:
- Coding style
- Read the current pi_desc.control as the intial value of prev.control

v6:
- Add some comments

v5:
- keep the vcpu_kick() at the end of vmx_deliver_posted_intr()
- Keep the 'return' after calling __vmx_deliver_posted_interrupt()

v4:
- Coding style.
- V3 removes a vcpu_kick() from the eoi_exitmap_changed path
  incorrectly, fix it.

v3:
- use cmpxchg to test SN/ON and set ON

 xen/arch/x86/hvm/vmx/vmx.c | 29 -
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index c32d863..5f01629 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1701,8 +1701,35 @@ static void vmx_deliver_posted_intr(struct vcpu *v, u8 
vector)
  */
 pi_set_on(>arch.hvm_vmx.pi_desc);
 }
-else if ( !pi_test_and_set_on(>arch.hvm_vmx.pi_desc) )
+else
 {
+struct pi_desc old, new, prev;
+
+prev.control = v->arch.hvm_vmx.pi_desc.control;
+
+do {
+/*
+ * Currently, we don't support urgent interrupt, all
+ * interrupts are recognized as non-urgent interrupt,
+ * so we cannot send posted-interrupt when 'SN' is set.
+ * Besides that, if 'ON' is already set, we cannot set
+ * posted-interrupts as well.
+ */
+if ( pi_test_sn() || pi_test_on() )
+{
+vcpu_kick(v);
+return;
+}
+
+old.control = v->arch.hvm_vmx.pi_desc.control &
+  ~(1 << POSTED_INTR_ON | 1 << POSTED_INTR_SN);
+new.control = v->arch.hvm_vmx.pi_desc.control |
+  (1 << POSTED_INTR_ON);
+
+prev.control = cmpxchg(>arch.hvm_vmx.pi_desc.control,
+   old.control, new.control);
+} while ( prev.control != old.control );
+
 __vmx_deliver_posted_interrupt(v);
 return;
 }
-- 
2.1.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH v7 05/17] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts

Extend struct pi_desc according to VT-d Posted-Interrupts Spec.

CC: Kevin Tian 
CC: Keir Fraser 
CC: Jan Beulich 
CC: Andrew Cooper 
Signed-off-by: Feng Wu 
Reviewed-by: Andrew Cooper 
Acked-by: Kevin Tian 
Reviewed-by: Konrad Rzeszutek Wilk 
---
v7:
- Coding style.

v3:
- Use u32 instead of u64 for the bitfield in 'struct pi_desc'

 xen/include/asm-x86/hvm/vmx/vmcs.h | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h 
b/xen/include/asm-x86/hvm/vmx/vmcs.h
index f1126d4..b7f78e3 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -80,8 +80,18 @@ struct vmx_domain {
 
 struct pi_desc {
 DECLARE_BITMAP(pir, NR_VECTORS);
-u32 control;
-u32 rsvd[7];
+union {
+struct {
+u16 on : 1,  /* bit 256 - Outstanding Notification */
+sn : 1,  /* bit 257 - Suppress Notification */
+rsvd_1 : 14; /* bit 271:258 - Reserved */
+u8  nv;  /* bit 279:272 - Notification Vector */
+u8  rsvd_2;  /* bit 287:280 - Reserved */
+u32 ndst;/* bit 319:288 - Notification Destination */
+};
+u64 control;
+};
+u32 rsvd[6];
 } __attribute__ ((aligned (64)));
 
 #define ept_get_wl(ept)   ((ept)->ept_wl)
-- 
2.1.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH v7 04/17] vt-d: VT-d Posted-Interrupts feature detection

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

CC: Yang Zhang 
CC: Kevin Tian 
Signed-off-by: Feng Wu 
Reviewed-by: Konrad Rzeszutek Wilk 
---
v7:
- Remove pointless "if non iommu_intremap then disable iommu_intpost" logic
- Don't need to check !iommu_intremap or !iommu_intpost when setting 
iommu_intpost to 0

v5:
- Remove blank line

v4:
- Correct a logic error when setting iommu_intpost to 0

v3:
- Remove the "if no intremap then no intpost" logic in
  intel_vtd_setup(), it is covered in the iommu_setup().
- Add "if no intremap then no intpost" logic in the end
  of init_vtd_hw() which is called by vtd_resume().

So the logic exists in the following three places:
- parse_iommu_param()
- iommu_setup()
- init_vtd_hw()

 xen/drivers/passthrough/vtd/iommu.c | 14 --
 xen/drivers/passthrough/vtd/iommu.h |  1 +
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/iommu.c 
b/xen/drivers/passthrough/vtd/iommu.c
index 1dffc40..8dee731 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2147,8 +2147,8 @@ int __init intel_vtd_setup(void)
 }
 
 /* We enable the following features only if they are supported by all VT-d
- * engines: Snoop Control, DMA passthrough, Queued Invalidation and
- * Interrupt Remapping.
+ * engines: Snoop Control, DMA passthrough, Queued Invalidation, Interrupt
+ * Remapping, and Posted Interrupt
  */
 for_each_drhd_unit ( drhd )
 {
@@ -2176,6 +2176,14 @@ int __init intel_vtd_setup(void)
 if ( iommu_intremap && !ecap_intr_remap(iommu->ecap) )
 iommu_intremap = 0;
 
+/*
+ * We cannot use posted interrupt if X86_FEATURE_CX16 is
+ * not supported, since we count on this feature to
+ * atomically update 16-byte IRTE in posted format.
+ */
+if ( !cap_intr_post(iommu->cap) || !cpu_has_cx16 )
+iommu_intpost = 0;
+
 if ( !vtd_ept_page_compatible(iommu) )
 iommu_hap_pt_share = 0;
 
@@ -2201,6 +2209,7 @@ int __init intel_vtd_setup(void)
 P(iommu_passthrough, "Dom0 DMA Passthrough");
 P(iommu_qinval, "Queued Invalidation");
 P(iommu_intremap, "Interrupt Remapping");
+P(iommu_intpost, "Posted Interrupt");
 P(iommu_hap_pt_share, "Shared EPT tables");
 #undef P
 
@@ -2220,6 +2229,7 @@ int __init intel_vtd_setup(void)
 iommu_passthrough = 0;
 iommu_qinval = 0;
 iommu_intremap = 0;
+iommu_intpost = 0;
 return ret;
 }
 
diff --git a/xen/drivers/passthrough/vtd/iommu.h 
b/xen/drivers/passthrough/vtd/iommu.h
index ac71ed1..22abefe 100644
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -61,6 +61,7 @@
 /*
  * Decoding Capability Register
  */
+#define cap_intr_post(c)   (((c) >> 59) & 1)
 #define cap_read_drain(c)  (((c) >> 55) & 1)
 #define cap_write_drain(c) (((c) >> 54) & 1)
 #define cap_max_amask_val(c)   (((c) >> 48) & 0x3f)
-- 
2.1.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [v2][PATCH] xen/vtd/iommu: permit group devices to passthrough in relaxed mode

>>> On 11.09.15 at 01:22,  wrote:
> Sorry it's a bad example. My actual concern is that we can't count
> on this per-VM relax/strict policy to prevent group devices assigned
> to different VM. In that case it's definitely a security hole since
> one VM may clobber shared RMRR to impact another VM. So right
> example for that scenario is both VMs specified with 'relax'. 

Sorry, no, the idea of "relax" is to allow the admin to state "I have
no security concerns". Hence we'd have a security issue only if the
default was "relax" (which iiuc it isn't, or if it were _that's_ what
would need to be alongside the presented change). Whether that
statement of the admin is because of
- knowing that the RMRR won't be used post-boot
- group-assigning the devices manually
- simply not caring (i.e. trusting the guests)
is not our business.

IOW, provided there's no way for "relax" to become the default
(Tiejun - please confirm), the patch as is should be fine.

Jan



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] xhci_hcd intterrupt affinity in Dom0/DomU limited to single interrupt

>>> On 10.09.15 at 18:20,  wrote:
> On Wed, 2015-09-09 at 00:48 -0600, Jan Beulich wrote:
>> >>> On 08.09.15 at 18:02,  wrote:
>> > I believe the driver does support use of multiple interrupts based on
>> > the previous explanation of the lspci output where it was established
>> > that the device could use up to 8 interrupts which is what I see on bare
>> > metal.
>> 
>> Where is the proof of that? All I've seen is output like this
>> 
>> Capabilities: [80] MSI: Enable+ Count=1/8 Maskable- 64bit+
>> 
>> which says that one out of eight interrupts is being used. And
>> if in the native case this would indeed be the case, I don't think
>> you've provided complete hypervisor and kernel logs for the
>> Xen case so far, which would allow us to look for respective error
>> indications. And this (ignoring the line wrapping, which makes
>> things hard to read - it would be appreciated if you could fix
>> your mail client)...
>> 
>> > Bare metal:
>> > 
>> > cat /proc/interrupts 
>> >CPU0   CPU1   CPU2   CPU3   CPU4   CPU5
>> > CPU6   CPU7   
>> >   0: 36  0  0  0  0  0
>> > 0  0  IR-IO-APIC-edge  timer
>> >[...]
>> >  27: 337125  47893 708965   4049   53940667 263303
>> > 87847   4958  IR-PCI-MSI-edge  xhci_hcd
>> 
>> ... also shows just a single interrupt being in use.
> 
> Kernel logs for native and Dom0 with 'debug' appended to grub. xl-dmesg
> with log_lvl=all guest_loglvl=all set. Please let me know if there are
> other logs or log levels that I should provide. 

The native kernel log supports there only being a single interrupt
in use. I'm still not seeing any proof of your claim for this to be
different. Did you double check lspci output in the native case?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] tools: any user of xc_dom_image->allocate?


While testing xen tools patches to start a pv-domU >512GB I stumbled
over a problem in the domain builder: it is keeping track of the last
allocated virtual address in the memory image it is creating. For
very huge domains (>1TB) this virtual address will wrap around as it is
starting at -2GB and the p2m for such a domain is >2GB.

With a modern pvops kernel (4.3) this would be no problem, as it is
supporting mapping the p2m to an arbitrary address.

In the domain builder, however, I can't currently just ignore the wrap,
as after each memory allocation for the domain image dom->allocate() is
being called with the last allocated virtual address as an argument.
dom->allocate() is allowed to be NULL (in which case it isn't called).

I have found no user of dom->allocate(), deleting it from the struct
xc_dom_image and removing the call sites in xc_dom_core.c didn't break
the build.

Are there any objections to remove the allocate() function in struct
xc_dom_image?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [xen-4.4-testing test] 61695: regressions - FAIL

flight 61695 xen-4.4-testing real [real]
http://logs.test-lab.xenproject.org/osstest/logs/61695/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-i386-xl-qcow2  9 debian-di-install fail REGR. vs. 60727
 test-amd64-i386-xl-raw9 debian-di-install fail REGR. vs. 60727

Tests which are failing intermittently (not blocking):
 test-amd64-i386-xl-qemuu-win7-amd64 16 guest-localmigrate/x10 fail in 61599 
pass in 61695
 test-amd64-i386-xl-qemuu-debianhvm-amd64 19 guest-start/debianhvm.repeat fail 
pass in 61599

Regressions which are regarded as allowable (not blocking):
 test-amd64-i386-libvirt-vhd   9 debian-di-install fail REGR. vs. 60727
 test-amd64-amd64-libvirt-raw  9 debian-di-install fail REGR. vs. 60727
 test-amd64-amd64-libvirt-vhd  9 debian-di-install fail REGR. vs. 60727
 test-armhf-armhf-xl-multivcpu 16 guest-start/debian.repeatfail  like 60696
 test-amd64-i386-xl-vhd9 debian-di-installfail   like 60727
 test-amd64-i386-libvirt-qcow2  9 debian-di-installfail  like 60727
 test-amd64-amd64-xl-vhd   9 debian-di-installfail   like 60727
 test-amd64-i386-libvirt  11 guest-start  fail   like 60727
 test-amd64-amd64-libvirt 11 guest-start  fail   like 60727
 test-amd64-amd64-xl-qemuu-win7-amd64 17 guest-stop fail like 60727

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-rumpuserxen-amd64  1 build-check(1)   blocked n/a
 test-amd64-amd64-migrupgrade  1 build-check(1)   blocked  n/a
 test-amd64-i386-migrupgrade   1 build-check(1)   blocked  n/a
 test-amd64-i386-rumpuserxen-i386  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-qcow2  9 debian-di-installfail never pass
 test-armhf-armhf-xl-qcow2 9 debian-di-installfail   never pass
 build-amd64-rumpuserxen   6 xen-buildfail   never pass
 build-i386-rumpuserxen6 xen-buildfail   never pass
 test-armhf-armhf-xl-raw   9 debian-di-installfail   never pass
 build-amd64-prev  5 xen-buildfail   never pass
 test-armhf-armhf-libvirt-raw  9 debian-di-installfail   never pass
 test-armhf-armhf-xl-vhd   9 debian-di-installfail   never pass
 test-amd64-i386-libvirt-pair 21 guest-migrate/src_host/dst_host fail never pass
 build-i386-prev   5 xen-buildfail   never pass
 test-amd64-amd64-xl-qcow2 9 debian-di-installfail   never pass
 test-armhf-armhf-libvirt 11 guest-start  fail   never pass
 test-amd64-amd64-libvirt-pair 21 guest-migrate/src_host/dst_host fail never 
pass
 test-armhf-armhf-xl-arndale  12 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  13 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qcow2 11 migrate-support-checkfail never pass
 test-amd64-i386-libvirt-raw  11 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 13 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 12 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-credit2  13 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  12 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 12 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 13 saverestore-support-checkfail never pass
 test-amd64-amd64-xl-qemut-win7-amd64 17 guest-stop fail never pass
 test-amd64-i386-xl-qemut-win7-amd64 17 guest-stop  fail never pass
 test-amd64-i386-xl-qemuu-win7-amd64 17 guest-stop  fail never pass
 test-amd64-i386-xend-qemut-winxpsp3 21 leak-check/checkfail never pass
 test-armhf-armhf-xl  12 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  13 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-vhd  9 debian-di-installfail   never pass

version targeted for testing:
 xen  339f5743e84a28dd01ffa7498372e410301cd0b4
baseline version:
 xen  3646b134c1673f09c0a239de10b0da4c9265c8e8

Last test of basis60727  2015-08-16 16:15:09 Z   25 days
Failing since 60802  2015-08-20 14:41:37 Z   21 days   10 attempts
Testing same since61512  2015-09-07 11:42:03 Z3 days3 attempts


People who touched revisions under test:
  Ian Campbell 
  Jan Beulich 
  Julien Grall 

jobs:
 build-amd64-xend pass
 build-i386-xend  pass
 build-amd64

[Xen-devel] [PATCH v7 12/17] x86: move some APIC related macros to apicdef.h

Move some APIC related macros to apicdef.h, so they can be used
outside of vlapic.c.

CC: Keir Fraser 
CC: Jan Beulich 
CC: Andrew Cooper 
Signed-off-by: Feng Wu 
---
v7:
- Put the Macros to the right place inside the file.

v6:
- Newly introduced.

 xen/arch/x86/hvm/vlapic.c | 5 -
 xen/include/asm-x86/apicdef.h | 3 +++
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index b893b40..9b7c871 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -65,11 +65,6 @@ static const unsigned int vlapic_lvt_mask[VLAPIC_LVT_NUM] =
  LVT_MASK
 };
 
-/* Following could belong in apicdef.h */
-#define APIC_SHORT_MASK  0xc
-#define APIC_DEST_NOSHORT0x0
-#define APIC_DEST_MASK   0x800
-
 #define vlapic_lvt_vector(vlapic, lvt_type) \
 (vlapic_get_reg(vlapic, lvt_type) & APIC_VECTOR_MASK)
 
diff --git a/xen/include/asm-x86/apicdef.h b/xen/include/asm-x86/apicdef.h
index 6069fce..f197ff6 100644
--- a/xen/include/asm-x86/apicdef.h
+++ b/xen/include/asm-x86/apicdef.h
@@ -57,6 +57,8 @@
 #defineAPIC_DEST_SELF  0x4
 #defineAPIC_DEST_ALLINC0x8
 #defineAPIC_DEST_ALLBUT0xC
+#defineAPIC_SHORT_MASK 0xC
+#defineAPIC_DEST_NOSHORT   0x0
 #defineAPIC_ICR_RR_MASK0x3
 #defineAPIC_ICR_RR_INVALID 0x0
 #defineAPIC_ICR_RR_INPROG  0x1
@@ -64,6 +66,7 @@
 #defineAPIC_INT_LEVELTRIG  0x08000
 #defineAPIC_INT_ASSERT 0x04000
 #defineAPIC_ICR_BUSY   0x01000
+#defineAPIC_DEST_MASK  0x800
 #defineAPIC_DEST_LOGICAL   0x00800
 #defineAPIC_DEST_PHYSICAL  0x0
 #defineAPIC_DM_FIXED   0x0
-- 
2.1.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [OSSTest Nested v12 09/21] Wrapper and use core_dump_setup() for nested host and normal host to setup coredump sysctl

On Thu, 2015-09-10 at 18:23 +0100, Ian Jackson wrote:
> Robert Ho writes ("[OSSTest Nested v12 09/21] Wrapper and use
> core_dump_setup() for nested host and normal host to setup coredump
> sysctl"):
> > This patch does these 4 things:
> > 1. wrapper coredump setup code from original ts-host-install into
> > TestSupport.pm
> > 2. replace ts-host-install original code with this wrapper function
> 
> You mean
>break coredump setup code into new function `core_dump_setup'
> 
> Please break this part (the refactoring - ie points 1,2) out into its
> own patch.  You can then say `no functional change' (assuming that's
> true).
> 
> > 3. in debian-hvm-install, create '/var/core' in hvm host post
> > installation.
> > 4. in ts-nested-setup, call this function for l1 host.
> 
> Is there some reason why the mkdir isn't done in core_dump_setup ?

It is, so it's unclear why it is also done in debian-hvm-install, I suspect
that is left over from a previous version of the series which didn't do the
refactoring. Probably it can just be dropped.

> Also it should do `mkdir -p' in case the directory already exists
> somehow.

This is what the code which is refactored into core_dump_setup does
already.

Ian.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH v7 13/17] Update IRTE according to guest interrupt config changes

When guest changes its interrupt configuration (such as, vector, etc.)
for direct-assigned devices, we need to update the associated IRTE
with the new guest vector, so external interrupts from the assigned
devices can be injected to guests without VM-Exit.

For lowest-priority interrupts, we use vector-hashing mechamisn to find
the destination vCPU. This follows the hardware behavior, since modern
Intel CPUs use vector hashing to handle the lowest-priority interrupt.

For multicast/broadcast vCPU, we cannot handle it via interrupt posting,
still use interrupt remapping.

CC: Jan Beulich 
Signed-off-by: Feng Wu 
---
v7:
- Remove some pointless debug printk
- Fix a logic error when assigning 'delivery_mode'
- Adjust the definition of local variable 'idx'
- Add a dprintk if we cannot find the vCPU from 'pi_find_dest_vcpu'
- Add 'else if ( delivery_mode == dest_Fixed )' in 'pi_find_dest_vcpu'

v6:
- Use macro to replace plain numbers
- Correct the overflow error in a loop

v5:
- Make 'struct vcpu *vcpu' const

v4:
- Make some 'int' variables 'unsigned int' in pi_find_dest_vcpu()
- Make 'dest_id' uint32_t
- Rename 'size' to 'bitmap_array_size'
- find_next_bit() and find_first_bit() always return unsigned int,
  so no need to check whether the return value is less than 0.
- Message error level XENLOG_G_WARNING -> XENLOG_G_INFO
- Remove useless warning message
- Create a seperate function vector_hashing_dest() to find the
- destination of lowest-priority interrupts.
- Change some comments

v3:
- Use bitmap to store the all the possible destination vCPUs of an
  interrupt, then trying to find the right destination from the bitmap
- Typo and some small changes

 xen/drivers/passthrough/io.c | 118 ++-
 1 file changed, 117 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index bda9374..5b0b11e 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static DEFINE_PER_CPU(struct list_head, dpci_list);
 
@@ -198,6 +199,103 @@ void free_hvm_irq_dpci(struct hvm_irq_dpci *dpci)
 xfree(dpci);
 }
 
+/*
+ * This routine handles lowest-priority interrupts using vector-hashing
+ * mechanism. As an example, modern Intel CPUs use this method to handle
+ * lowest-priority interrupts.
+ *
+ * Here is the details about the vector-hashing mechanism:
+ * 1. For lowest-priority interrupts, store all the possible destination
+ *vCPUs in an array.
+ * 2. Use "gvec % max number of destination vCPUs" to find the right
+ *destination vCPU in the array for the lowest-priority interrupt.
+ */
+static struct vcpu *vector_hashing_dest(const struct domain *d,
+uint32_t dest_id,
+bool_t dest_mode,
+uint8_t gvec)
+
+{
+unsigned long *dest_vcpu_bitmap;
+unsigned int dest_vcpus = 0;
+unsigned int bitmap_array_size = BITS_TO_LONGS(d->max_vcpus);
+struct vcpu *v, *dest = NULL;
+unsigned int i;
+
+dest_vcpu_bitmap = xzalloc_array(unsigned long, bitmap_array_size);
+if ( !dest_vcpu_bitmap )
+return NULL;
+
+for_each_vcpu ( d, v )
+{
+if ( !vlapic_match_dest(vcpu_vlapic(v), NULL, APIC_DEST_NOSHORT,
+dest_id, dest_mode) )
+continue;
+
+__set_bit(v->vcpu_id, dest_vcpu_bitmap);
+dest_vcpus++;
+}
+
+if ( dest_vcpus != 0 )
+{
+unsigned int mod = gvec % dest_vcpus;
+unsigned int idx = 0;
+
+for ( i = 0; i <= mod; i++ )
+{
+idx = find_next_bit(dest_vcpu_bitmap, d->max_vcpus, idx) + 1;
+BUG_ON(idx >= d->max_vcpus);
+}
+
+dest = d->vcpu[idx-1];
+}
+
+xfree(dest_vcpu_bitmap);
+
+return dest;
+}
+
+/*
+ * The purpose of this routine is to find the right destination vCPU for
+ * an interrupt which will be delivered by VT-d posted-interrupt. There
+ * are several cases as below:
+ *
+ * - For lowest-priority interrupts, use vector-hashing mechanism to find
+ *   the destination.
+ * - Otherwise, for single destination interrupt, it is straightforward to
+ *   find the destination vCPU and return true.
+ * - For multicast/broadcast vCPU, we cannot handle it via interrupt posting,
+ *   so return NULL.
+ */
+static struct vcpu *pi_find_dest_vcpu(const struct domain *d, uint32_t dest_id,
+  bool_t dest_mode, uint8_t delivery_mode,
+  uint8_t gvec)
+{
+unsigned int dest_vcpus = 0;
+struct vcpu *v, *dest = NULL;
+
+if ( delivery_mode == dest_LowestPrio )
+return vector_hashing_dest(d, dest_id, dest_mode, gvec);
+else if ( delivery_mode == dest_Fixed )
+{
+for_each_vcpu ( d, v )
+{
+if (

[Xen-devel] [PATCH v7 03/17] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

This patch adds variable 'iommu_intpost' to control whether enable VT-d
posted-interrupt or not in the generic IOMMU code.

CC: Jan Beulich 
CC: Kevin Tian 
Signed-off-by: Feng Wu 
Reviewed-by: Kevin Tian 
Reviewed-by: Konrad Rzeszutek Wilk 
Acked-by: Jan Beulich 
---
v5:
- Remove the "if no intremap then no intpost" logic in parse_iommu_param(), 
which
  can be covered in iommu_setup()

v3:
- Remove pointless initializer for 'iommu_intpost'.
- Some adjustment for "if no intremap then no intpost" logic.
* For parse_iommu_param(), move it to the end of the function,
  so we don't need to add the some logic when introduing the
  new kernel parameter 'intpost' in later patch.
* Add this logic in iommu_setup() after iommu_hardware_setup()
  is called.

 xen/drivers/passthrough/iommu.c | 13 -
 xen/include/xen/iommu.h |  2 +-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index fc7831e..36d5cc0 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -51,6 +51,14 @@ bool_t __read_mostly iommu_passthrough;
 bool_t __read_mostly iommu_snoop = 1;
 bool_t __read_mostly iommu_qinval = 1;
 bool_t __read_mostly iommu_intremap = 1;
+
+/*
+ * In the current implementation of VT-d posted interrupts, in some extreme
+ * cases, the per cpu list which saves the blocked vCPU will be very long,
+ * and this will affect the interrupt latency, so let this feature off by
+ * default until we find a good solution to resolve it.
+ */
+bool_t __read_mostly iommu_intpost;
 bool_t __read_mostly iommu_hap_pt_share = 1;
 bool_t __read_mostly iommu_debug;
 bool_t __read_mostly amd_iommu_perdev_intremap = 1;
@@ -307,6 +315,9 @@ int __init iommu_setup(void)
 panic("Couldn't enable %s and iommu=required/force",
   !iommu_enabled ? "IOMMU" : "Interrupt Remapping");
 
+if ( !iommu_intremap )
+iommu_intpost = 0;
+
 if ( !iommu_enabled )
 {
 iommu_snoop = 0;
@@ -374,7 +385,7 @@ void iommu_crash_shutdown(void)
 const struct iommu_ops *ops = iommu_get_ops();
 if ( iommu_enabled )
 ops->crash_shutdown();
-iommu_enabled = iommu_intremap = 0;
+iommu_enabled = iommu_intremap = iommu_intpost = 0;
 }
 
 int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 8f3a20e..1f5d04a 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -30,7 +30,7 @@
 extern bool_t iommu_enable, iommu_enabled;
 extern bool_t force_iommu, iommu_verbose;
 extern bool_t iommu_workaround_bios_bug, iommu_igfx, iommu_passthrough;
-extern bool_t iommu_snoop, iommu_qinval, iommu_intremap;
+extern bool_t iommu_snoop, iommu_qinval, iommu_intremap, iommu_intpost;
 extern bool_t iommu_hap_pt_share;
 extern bool_t iommu_debug;
 extern bool_t amd_iommu_perdev_intremap;
-- 
2.1.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH v7 16/17] VT-d: Dump the posted format IRTE

Add the utility to dump the posted format IRTE.

CC: Yang Zhang 
CC: Kevin Tian 
Signed-off-by: Feng Wu 
---
v7:
- Remove the two stage loop

v6:
- Fix a typo

v4:
- Newly added

 xen/drivers/passthrough/vtd/utils.c | 30 +++---
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/utils.c 
b/xen/drivers/passthrough/vtd/utils.c
index 6daa156..54db519 100644
--- a/xen/drivers/passthrough/vtd/utils.c
+++ b/xen/drivers/passthrough/vtd/utils.c
@@ -203,6 +203,9 @@ static void dump_iommu_info(unsigned char key)
 ecap_intr_remap(iommu->ecap) ? "" : "not ",
 (status & DMA_GSTS_IRES) ? " and enabled" : "" );
 
+printk("  Interrupt Posting: %ssupported.\n",
+cap_intr_post(iommu->cap) ? "" : "not ");
+
 if ( status & DMA_GSTS_IRES )
 {
 /* Dump interrupt remapping table. */
@@ -213,8 +216,11 @@ static void dump_iommu_info(unsigned char key)
 
 printk("  Interrupt remapping table (nr_entry=%#x. "
 "Only dump P=1 entries here):\n", nr_entry);
-printk("   SVT  SQ   SID  DST  V  AVL DLM TM RH DM "
-   "FPD P\n");
+printk("R means remapped format, P means posted format.\n");
+printk("R:   SVT  SQ   SID  V  AVL FPD  DST DLM TM RH DM "
+   "P\n");
+printk("P:   SVT  SQ   SID  V  AVL FPD  PDA  URG "
+   "P\n");
 for ( i = 0; i < nr_entry; i++ )
 {
 struct iremap_entry *p;
@@ -232,11 +238,21 @@ static void dump_iommu_info(unsigned char key)
 
 if ( !p->remap.p )
 continue;
-printk("  %04x:  %x   %x  %04x %08x %02x%x   %x  %x  %x  
%x"
-"   %x %x\n", i,
-p->remap.svt, p->remap.sq, p->remap.sid, p->remap.dst,
-p->remap.vector, p->remap.avail, p->remap.dlm, p->remap.tm,
-p->remap.rh, p->remap.dm, p->remap.fpd, p->remap.p);
+if ( !p->remap.im )
+printk("R:  %04x:  %x   %x  %04x %02x%x   %x %08x   "
+"%x  %x  %x  %x %x\n", i,
+p->remap.svt, p->remap.sq, p->remap.sid,
+p->remap.vector, p->remap.avail, p->remap.fpd,
+p->remap.dst, p->remap.dlm, p->remap.tm, p->remap.rh,
+p->remap.dm, p->remap.p);
+else
+printk("P:  %04x:  %x   %x  %04x %02x%x   %x %16lx"
+"%x %x\n", i,
+p->post.svt, p->post.sq, p->post.sid, p->post.vector,
+p->post.avail, p->post.fpd,
+((u64)p->post.pda_h << 32) | (p->post.pda_l << 6),
+p->post.urg, p->post.p);
+
 print_cnt++;
 }
 if ( iremap_entries )
-- 
2.1.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] xl: libxl_domain_info: getting domain info list: Bad address


On 11/09/2015 08:55, Riku Voipio wrote:

It looks like the errors started Sep 4th, while Sep 3rd was still OK.
The Xen binary was same for both test runs, only the kernel (which
follows mainline) was changed.

Failing kernel was 807249d3ada1ff28a47c4054ca4edd479421b671
While last succeeding was 1e1a4e8f439113b7820bc7150569f685e1cc2b43


Thank you for narrowing down.


This range includes merge of ARM development updates from Russell King
for 4.3, which probably contains the change that breaks Xen.


I've been able to reproduce it on midway so it's not related to the 
Arndale board.


Although, I don't see anything obvious in the log which could break Xen.

I will try a manual bisection to see if I can fingered a specific commit.

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] xl: libxl_domain_info: getting domain info list: Bad address

On Thu, 2015-09-10 at 18:07 +0100, Julien Grall wrote:
> Hi,
> 
> Riku reported me an error on their CI loop while run Xen on the Arndale:
> 
> Starting /usr/sbin/xenstored...
> Setting domain 0 name, domid and JSON config...
> libxl: error: libxl.c:675:libxl_domain_info: getting domain info list:
> Bad address
> libxl: error: libxl_dom.c:1869:libxl__userdata_path: unable to find
> domain info for domain 0: Bad address
> cannot store stub json config for Dom0
> Starting xenconsoled...
> Starting QEMU as disk backend for dom0
> /etc/init.d/xencommons: line 102: qemu-system-i386: command not found
> libxl: error: libxl.c:656:libxl_list_domain: getting domain info list:
> Bad address
> libxl_list_domain failed.

FWIW this caused this recent test failure of linux-next:
http://logs.test-lab.xenproject.org/osstest/logs/61690/test-armhf-armhf-xl-arndale/info.html
I don't know for how long it has been failing, but may or may not be
bisectable by the automated bisector.

> The full log can be found here: https://paste.debian.net/311187
> 
> I've looked at the osstest log and was able to find the same errors very often
> on the arndale. Although, it seems that the tests are still passing. For
> instance [1].

I don't see this message in any of the test logs.

It does appear in the serial logs, but with at "Sep  9 00:47:23" and "Sep 
 9 01:47:29" while this test was running from "2015-09-09 08:01:32 Z" until
just after "2015-09-09 15:16:16 Z".

This is because the serial logs are not rotated for each new job and they
aren't trimmed to only the relevant time span, so they can (and almost
always do) contain stuff from previous tests. You generally need to scroll
to the end.

There are two related potential improvements which could be made to osstest
here. First would be to arrange for the serial logs to be trimmed to the
relevant time span the second would be a new ts-logs-audit step which runs
after ts-logs-capture and checks for anything amis (e.g. segfaults in the
kernel logs).

Obviously if you do the second without the first the code would need to be
careful to only look at relevant lines in serial.log (other logs should be
ok).

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

>>> On 11.09.15 at 02:59,  wrote:
> If you want a formula I would do:
> 
> #define MAX_SOCKETS 8
> 
>  max_pfns = pow(2,(MAX_SOCKETS - (max(nr_iommus(), MAX_SOCKETS * 64;
> 
> Where nr_iommus would have to be somehow implemented, ditto for pow.
> 
> This should give you:
>  8-> 64
>  7-> 128
>  6-> 256
>  5-> 512
>  4-> 1024
>  3-> 2048
>  2-> 4096
>  1-> 16384

16k seems excessive as a default. Also - why would this be related
to the number of sockets? I don't think there's a one-IOMMU-per-
socket rule; fixed-number-of-IOMMUs-per-node might come closer,
but there we'd have the problem of what "fixed number" is. Wouldn't
something as simple as 1024 / nr_iommus() do?

I also don't follow what cache flushes you talked about earlier: I
don't think the IOMMUs drive any global cache flushes, and I
would have thought the size limited IOTLB and (CPU side) cache
ones should be pretty limited in terms of bus load (unless the TLB
ones would get converted to global ones due to lacking IOMMU
capabilities). Is that not the case?

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] xl: libxl_domain_info: getting domain info list: Bad address

On Fri, 2015-09-11 at 10:26 +0100, Julien Grall wrote:
> Hi Ian,
> 
> On 11/09/2015 09:52, Ian Campbell wrote:
> > On Thu, 2015-09-10 at 18:07 +0100, Julien Grall wrote:
> > > Hi,
> > > 
> > > Riku reported me an error on their CI loop while run Xen on the
> > > Arndale:
> > > 
> > > Starting /usr/sbin/xenstored...
> > > Setting domain 0 name, domid and JSON config...
> > > libxl: error: libxl.c:675:libxl_domain_info: getting domain info
> > > list:
> > > Bad address
> > > libxl: error: libxl_dom.c:1869:libxl__userdata_path: unable to find
> > > domain info for domain 0: Bad address
> > > cannot store stub json config for Dom0
> > > Starting xenconsoled...
> > > Starting QEMU as disk backend for dom0
> > > /etc/init.d/xencommons: line 102: qemu-system-i386: command not found
> > > libxl: error: libxl.c:656:libxl_list_domain: getting domain info
> > > list:
> > > Bad address
> > > libxl_list_domain failed.
> > 
> > FWIW this caused this recent test failure of linux-next:
> > http://logs.test-lab.xenproject.org/osstest/logs/61690/test-armhf-armhf
> > -xl-arndale/info.html
> > I don't know for how long it has been failing, but may or may not be
> > bisectable by the automated bisector.
> 
> It's causing the issue on Linux next since the end of august:
> http://logs.test-lab.xenproject.org/osstest/results/history/test-armhf-ar
> mhf-xl-arndale/linux-next.html
> 
> The same problem appears in linux-linus from the 8th of september:
> http://logs.test-lab.xenproject.org/osstest/results/history/test-armhf-
> armhf-xl-arndale/linux-linus.html
> 
> Maybe the last job can be bisect from v4.2 tag?

It looks like it tried and got some really weird error:
http://logs.test-lab.xenproject.org/osstest/results/bisect/linux-linus/test-armhf-armhf-xl-arndale.leak-check--basis%288%29.html

Revision graph generation failed!
Error message:
 dot -Tps 
-o/home/logs/results/bisect/linux-linus/test-armhf-armhf-xl-arndale.leak-check--basis(8).ps
 
/home/logs/results/bisect/linux-linus/test-armhf-armhf-xl-arndale.leak-check--basis(8).dot:
 512  at Osstest.pm line 357.


It worked for me manually, I wonder if the issue is lack of escaping for
the ()'s?

Ian?

> 
> > 
> > > The full log can be found here: https://paste.debian.net/311187
> > > 
> > > I've looked at the osstest log and was able to find the same errors
> > > very often
> > > on the arndale. Although, it seems that the tests are still passing.
> > > For
> > > instance [1].
> > 
> > I don't see this message in any of the test logs.
> > 
> > It does appear in the serial logs, but with at "Sep  9 00:47:23" and
> > "Sep
> >   9 01:47:29" while this test was running from "2015-09-09 08:01:32 Z"
> > until
> > just after "2015-09-09 15:16:16 Z".
> > 
> > This is because the serial logs are not rotated for each new job and
> > they
> > aren't trimmed to only the relevant time span, so they can (and almost
> > always do) contain stuff from previous tests. You generally need to
> > scroll
> > to the end.
> 
> Damn, I though the serial logs was only containing the serial messages 
> for the current job.
> 
> I was doing some grep in the logs to see when it first appears. Sorry 
> for the confusion
> 
> Regards,
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH OSSTEST] ts-xen-build: Do not set QEMU_REMOTE unless $r{tree_qemu} is set

4.4 and earlier do not check if QEMU_REMOTE is empty before using it.
>From 4.5 onwards if QEMU_REMOTE is empty then default is used.

This should fix the build-*-prev job for 4.5 and earlier. In this job
we deliberately don't specify tree_qemu since we want whatever
that branch gives us.

Signed-off-by: Ian Campbell 
---
 ts-xen-build | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/ts-xen-build b/ts-xen-build
index cebfaf3..b02e737 100755
--- a/ts-xen-build
+++ b/ts-xen-build
@@ -52,12 +52,14 @@ sub checkout () {
echo >>.config debug=$debug_build
echo >>.config GIT_HTTP=y
echo >>.config LIBLEAFDIR_x86_64=lib
-   echo >>.config QEMU_REMOTE='$r{tree_qemu}'
echo >>.config KERNELS=''
 END
(nonempty($r{enable_xsm}) ? <>.config XSM_ENABLE='${build_xsm}'
 END
+   (nonempty($r{tree_qemu}) ? <>.config QEMU_REMOTE='$r{tree_qemu}'
+END
(nonempty($r{revision_qemu}) ? <>.config QEMU_TAG='$r{revision_qemu}'
 END
-- 
2.5.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH OSSTEST] cri-common: Refactor select_prevxenbranch to cri-getprevxenbranch

This moves it outside any prevailing set -x and reduces the amount of
noise in various logs.

Signed-off-by: Ian Campbell 
---
 cri-common   | 16 +---
 cri-getprevxenbranch | 19 +++
 2 files changed, 20 insertions(+), 15 deletions(-)
 create mode 100755 cri-getprevxenbranch

diff --git a/cri-common b/cri-common
index 94696ab..2669485 100644
--- a/cri-common
+++ b/cri-common
@@ -61,21 +61,7 @@ repo_tree_rev_fetch_git () {
 }
 
 select_prevxenbranch () {
-   local b
-   local p
-   for b in $(./mg-list-all-branches) ; do # already sorted by version
-   case "$b" in
-   xen*)
-   if [ "x$b" = "x$xenbranch" ] ; then
-   break
-   else
-   p=$b
-   fi
-   ;;
-   *)  ;;
-   esac
-   done
-   prevxenbranch=$p
+   prevxenbranch=`./cri-getprevxenbranch $xenbranch`
 }
 
 select_xenbranch () {
diff --git a/cri-getprevxenbranch b/cri-getprevxenbranch
new file mode 100755
index 000..dce52c2
--- /dev/null
+++ b/cri-getprevxenbranch
@@ -0,0 +1,19 @@
+#!/bin/bash
+
+xenbranch=$1
+p=
+
+for b in $(./mg-list-all-branches) ; do # already sorted by version
+case "$b" in
+   xen*)
+   if [ "x$b" = "x$xenbranch" ] ; then
+   break
+   else
+   p=$b
+   fi
+   ;;
+   *)  ;;
+esac
+done
+
+echo $p
-- 
2.5.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [OSSTest Nested v12 01/21] Optimize and re-format previous code of 'submenu' parsing

On Thu, 2015-09-10 at 17:16 +0100, Ian Jackson wrote:
> Robert Ho writes ("[OSSTest Nested v12 01/21] Optimize and re-format
> previous code of 'submenu' parsing"):
> > * space between ')' and '{'; and after '='
> > * omit unnecessary 'define' and '!defined' usage
> > * break long '{}' into several lines
> 
> These changes are all good.
> 
> But}
> 
> >  if (m/^\s*submenu\s+[\'\"](.*)[\'\"].*\{\s*$/) {
> > -$submenu={ StartLine =>$., MenuEntryPath => join
> ">", @offsets };
> > +$submenu= { StartLine =>$. };
> 
> This drops the setting of MenuEntryPath from $submenu.  This isn't
> mentioned in the commit message, and (without looking at the code to
> double-check) I'm not sure that it's right.

>From memory MenuEntryPath is unused, but I left it there because when
debugging it was very convenient to use Data::Dumper to dump $submenu
and/or $entry at each iteration and in that case having that
information to hand can be useful.

I suppose one could argue that when debugging you can add this field
back, but personally I'd prefer to just leave it as it's harmless.

Ian.

> 
> Thanks,
> Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] xl: libxl_domain_info: getting domain info list: Bad address

2015-09-11 Thread Riku Voipio

Hi,

On 10 September 2015 at 20:07, Julien Grall  wrote:
> Hi,
>
> Riku reported me an error on their CI loop while run Xen on the Arndale:
>
> Starting /usr/sbin/xenstored...
> Setting domain 0 name, domid and JSON config...
> libxl: error: libxl.c:675:libxl_domain_info: getting domain info list: Bad 
> address
> libxl: error: libxl_dom.c:1869:libxl__userdata_path: unable to find domain 
> info for domain 0: Bad address
> cannot store stub json config for Dom0
> Starting xenconsoled...
> Starting QEMU as disk backend for dom0
> /etc/init.d/xencommons: line 102: qemu-system-i386: command not found
> libxl: error: libxl.c:656:libxl_list_domain: getting domain info list: Bad 
> address
> libxl_list_domain failed.
>
> The full log can be found here: https://paste.debian.net/311187
>
> I've looked at the osstest log and was able to find the same errors very often
> on the arndale. Although, it seems that the tests are still passing. For
> instance [1].

Hi,

It looks like the errors started Sep 4th, while Sep 3rd was still OK.
The Xen binary was same for both test runs, only the kernel (which
follows mainline) was changed.

Failing kernel was 807249d3ada1ff28a47c4054ca4edd479421b671
While last succeeding was 1e1a4e8f439113b7820bc7150569f685e1cc2b43

This range includes merge of ARM development updates from Russell King
for 4.3, which probably contains the change that breaks Xen.

Riku

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v3] x86/hvm: fix saved pmtimer value

>>> On 11.09.15 at 03:18,  wrote:
> Jan Beulich  writes:
>> From: Kouya Shimura 
>>
>> The ACPI PM timer is sometimes broken on live migration.
>> Since vcpu->arch.hvm_vcpu.guest_time is always zero in other than
>> "delay for missed ticks mode". Even in "delay for missed ticks mode",
>> vcpu's guest_time field is not valid (i.e. zero) when
>> the state of vcpu is "blocked". (see pt_save_timer function)
>>
>> The original author (Tim Deegan) of pmtimer_save() must have intended
>> that it saves the last scheduled time of the vcpu. Unfortunately it was
>> already implied this bug. FYI, there is no other timer mode than 
>> "delay for missed ticks mode" then.
>>
>> For consistency with HPET, pmtimer_save() should refer hvm_get_guest_time()
>> to update the counter as well as hpet_save() does. 
>>
>> Without this patch, the clock of windows server 2012R2 without HPET
>> might leap forward several minutes on live migration.
>>
>> Signed-off-by: Kouya Shimura 
>>
>> Retain use of ->arch.hvm_vcpu.guest_time when non-zero. Do the inverse
>> adjustment for vHPET.
> 
> I'm fine with this change.
> Why not modify the subject too?

Right - adding "and hpet" to it.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [libvirt] [PATCH LIBVIRT] libxl: don't end job for ephemeal domain on start failure

2015-09-11 Thread Michal Privoznik

On 10.09.2015 17:45, Ian Campbell wrote:
> commit 4b53d0d4ac9c "libxl: don't remove persistent domain on start
> failure" cleans up the vm object and sets it to NULL if the vm is not
> persistent, however at end job vm (now NULL) is dereferenced via the call to
> libxlDomainObjEndJob. Avoid this by skipping "endjob" and going
> straight to "cleanup" in this case.
> 
> Signed-off-by: Ian Campbell 
> ---
>  src/libxl/libxl_driver.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/src/libxl/libxl_driver.c b/src/libxl/libxl_driver.c
> index 5f69b49..e2797d5 100644
> --- a/src/libxl/libxl_driver.c
> +++ b/src/libxl/libxl_driver.c
> @@ -992,6 +992,7 @@ libxlDomainCreateXML(virConnectPtr conn, const char *xml,
>  if (!vm->persistent) {
>  virDomainObjListRemove(driver->domains, vm);
>  vm = NULL;
> +goto cleanup;
>  }
>  goto endjob;
>  }
> 

While usually having cleanup label in between BeginJob and EndJob is
causing troubles, here it is desired.

ACKed and pushed.

Although, looking at the code, maybe it's time to make it look more like
qemu driver. I mean, wrapping EndJob(); vm= NULL; into one function. Do
proper refcounting, etc.

Michal

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] xl: libxl_domain_info: getting domain info list: Bad address

Hi Ian,

On 11/09/2015 09:52, Ian Campbell wrote:

On Thu, 2015-09-10 at 18:07 +0100, Julien Grall wrote:

Hi,

Riku reported me an error on their CI loop while run Xen on the Arndale:

Starting /usr/sbin/xenstored...
Setting domain 0 name, domid and JSON config...
libxl: error: libxl.c:675:libxl_domain_info: getting domain info list:
Bad address
libxl: error: libxl_dom.c:1869:libxl__userdata_path: unable to find
domain info for domain 0: Bad address
cannot store stub json config for Dom0
Starting xenconsoled...
Starting QEMU as disk backend for dom0
/etc/init.d/xencommons: line 102: qemu-system-i386: command not found
libxl: error: libxl.c:656:libxl_list_domain: getting domain info list:
Bad address
libxl_list_domain failed.

FWIW this caused this recent test failure of linux-next:
http://logs.test-lab.xenproject.org/osstest/logs/61690/test-armhf-armhf-xl-arndale/info.html
I don't know for how long it has been failing, but may or may not be
bisectable by the automated bisector.

It's causing the issue on Linux next since the end of august:
http://logs.test-lab.xenproject.org/osstest/results/history/test-armhf-armhf-xl-arndale/linux-next.html

The same problem appears in linux-linus from the 8th of september:
http://logs.test-lab.xenproject.org/osstest/results/history/test-armhf-
armhf-xl-arndale/linux-linus.html

Maybe the last job can be bisect from v4.2 tag?

The full log can be found here: https://paste.debian.net/311187

I've looked at the osstest log and was able to find the same errors very often
on the arndale. Although, it seems that the tests are still passing. For
instance [1].

I don't see this message in any of the test logs.

It does appear in the serial logs, but with at "Sep 9 00:47:23" and "Sep
9 01:47:29" while this test was running from "2015-09-09 08:01:32 Z" until
just after "2015-09-09 15:16:16 Z".

This is because the serial logs are not rotated for each new job and they
aren't trimmed to only the relevant time span, so they can (and almost
always do) contain stuff from previous tests. You generally need to scroll
to the end.

Damn, I though the serial logs was only containing the serial messages
for the current job.

I was doing some grep in the logs to see when it first appears. Sorry
for the confusion

Regards,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [qemu-upstream-4.5-testing test] 61733: regressions - trouble: broken/fail/pass

flight 61733 qemu-upstream-4.5-testing real [real]
http://logs.test-lab.xenproject.org/osstest/logs/61733/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-i386-xl-vhd9 debian-di-install fail REGR. vs. 60577
 test-amd64-i386-xl-raw9 debian-di-install fail REGR. vs. 60577
 test-amd64-amd64-xl-raw   9 debian-di-install fail REGR. vs. 60577

Tests which are failing intermittently (not blocking):
 test-armhf-armhf-xl-vhd   3 host-install(3)  broken in 61618 pass in 61733
 test-armhf-armhf-xl-rtds  3 host-install(3)   broken pass in 61618
 test-armhf-armhf-xl-multivcpu 7 host-ping-check-xen fail in 61618 pass in 61733
 test-armhf-armhf-libvirt-vhd  6 xen-bootfail pass in 61618
 test-amd64-i386-xl-qemuu-winxpsp3 15 guest-localmigrate.2   fail pass in 61618

Regressions which are regarded as allowable (not blocking):
 test-amd64-amd64-libvirt-raw  9 debian-di-install fail REGR. vs. 60577
 test-amd64-i386-xl-qemuu-win7-amd64 17 guest-stopfail blocked in 60577
 test-amd64-i386-libvirt  11 guest-start  fail   like 60577
 test-amd64-amd64-xl-qemuu-win7-amd64 17 guest-stop fail like 60577

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt-vhd  9 debian-di-install fail in 61618 never pass
 test-armhf-armhf-xl-rtds 13 saverestore-support-check fail in 61618 never pass
 test-armhf-armhf-xl-rtds 12 migrate-support-check fail in 61618 never pass
 test-armhf-armhf-xl-rtds 16 guest-start/debian.repeat fail in 61618 never pass
 test-amd64-amd64-xl-pvh-intel 11 guest-start  fail  never pass
 test-amd64-amd64-xl-pvh-amd  11 guest-start  fail   never pass
 test-armhf-armhf-xl-qcow2 9 debian-di-installfail   never pass
 test-amd64-i386-xl-qcow2  9 debian-di-installfail   never pass
 test-armhf-armhf-libvirt-qcow2  9 debian-di-installfail never pass
 test-armhf-armhf-libvirt-raw  9 debian-di-installfail   never pass
 test-armhf-armhf-xl-raw   9 debian-di-installfail   never pass
 test-armhf-armhf-libvirt 14 guest-saverestorefail   never pass
 test-armhf-armhf-libvirt 12 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-pair 21 guest-migrate/src_host/dst_host fail never pass
 test-amd64-amd64-libvirt-pair 21 guest-migrate/src_host/dst_host fail never 
pass
 test-amd64-amd64-libvirt 12 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 13 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 12 migrate-support-checkfail  never pass
 test-amd64-amd64-libvirt-qcow2 11 migrate-support-checkfail never pass
 test-armhf-armhf-xl-arndale  12 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  13 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-raw  11 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  13 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  12 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-qcow2 11 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-cubietruck 12 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 13 saverestore-support-checkfail never pass
 test-amd64-i386-libvirt-vhd  11 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 11 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  12 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  13 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd   9 debian-di-installfail   never pass

version targeted for testing:
 qemuuc6dc376c4b5292769582137867d1be6c3960b5c7
baseline version:
 qemuuf74d682ee4878af6a8e943f5f0b595af92b20084

Last test of basis60577  2015-08-04 12:45:54 Z   38 days
Failing since 60964  2015-08-28 09:10:02 Z   14 days3 attempts
Testing same since61618  2015-09-08 12:11:46 Z3 days2 attempts


People who touched revisions under test:
  Gerd Hoffmann 
  Peter Lieven 

jobs:
 build-amd64  pass
 build-armhf  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-armhf-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-armhf-pvops

[Xen-devel] [qemu-upstream-4.3-testing test] 61729: regressions - FAIL

flight 61729 qemu-upstream-4.3-testing real [real]
http://logs.test-lab.xenproject.org/osstest/logs/61729/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-i386-xl-qemuu-debianhvm-amd64 19 guest-start/debianhvm.repeat fail 
REGR. vs. 60700

Regressions which are regarded as allowable (not blocking):
 test-amd64-i386-xl-raw9 debian-di-installfail   like 60700
 test-amd64-i386-libvirt  11 guest-start  fail   like 60700
 test-amd64-i386-xl-qemuu-win7-amd64 17 guest-stop  fail like 60700
 test-amd64-amd64-xl-qemuu-win7-amd64 17 guest-stop fail like 60700

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-xl-qemuu-ovmf-amd64  9 debian-hvm-install fail never pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  9 debian-hvm-install  fail never pass
 test-amd64-amd64-xl-raw   9 debian-di-installfail   never pass
 test-amd64-amd64-libvirt 12 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-raw 11 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-qcow2 11 migrate-support-checkfail  never pass
 test-amd64-amd64-libvirt-qcow2 11 migrate-support-checkfail never pass
 test-amd64-i386-libvirt-raw  11 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 11 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-vhd  11 migrate-support-checkfail   never pass

version targeted for testing:
 qemuu92dae02ba02166cfcce020cb71021a73903ada2f
baseline version:
 qemuu20c1b1812de98ed789d55e22a43a4700fb765596

Last test of basis60700  2015-08-14 10:50:55 Z   28 days
Failing since 60903  2015-08-27 01:40:43 Z   15 days3 attempts
Testing same since61620  2015-09-08 12:11:41 Z3 days2 attempts


People who touched revisions under test:
  Gerd Hoffmann 
  Peter Lieven 

jobs:
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl  pass
 test-amd64-i386-xl   pass
 test-amd64-i386-qemuu-rhel6hvm-amd   pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64pass
 test-amd64-i386-xl-qemuu-debianhvm-amd64 fail
 test-amd64-i386-freebsd10-amd64  pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 fail
 test-amd64-i386-xl-qemuu-ovmf-amd64  fail
 test-amd64-amd64-xl-qemuu-win7-amd64 fail
 test-amd64-i386-xl-qemuu-win7-amd64  fail
 test-amd64-amd64-xl-credit2  pass
 test-amd64-i386-freebsd10-i386   pass
 test-amd64-i386-qemuu-rhel6hvm-intel pass
 test-amd64-amd64-libvirt pass
 test-amd64-i386-libvirt  fail
 test-amd64-amd64-xl-multivcpupass
 test-amd64-amd64-pairpass
 test-amd64-i386-pair pass
 test-amd64-amd64-pv  pass
 test-amd64-i386-pv   pass
 test-amd64-amd64-amd64-pvgrubpass
 test-amd64-amd64-i386-pvgrub pass
 test-amd64-amd64-pygrub  pass
 test-amd64-amd64-libvirt-qcow2   pass
 test-amd64-i386-libvirt-qcow2pass
 test-amd64-amd64-xl-qcow2pass
 test-amd64-i386-xl-qcow2 pass
 test-amd64-amd64-libvirt-raw pass
 test-amd64-i386-libvirt-raw  pass
 test-amd64-amd64-xl-raw  fail
 test-amd64-i386-xl-raw   fail
 test-amd64-i386-xl-qemuu-winxpsp3-vcpus1 pass
 test-amd64-amd64-libvirt-vhd pass
 test-amd64-i386-libvirt-vhd  pass
 test-amd64-amd64-xl-vhd

[Xen-devel] [PATCH 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages

The minimal size of request in the block framework is always PAGE_SIZE.
It means that when 64KB guest is support, the request will at least be
64KB.

Although, if the backend doesn't support indirect grant (such as QDISK
in QEMU), a ring request is only able to accomodate 11 segments of 4KB
(i.e 44KB).

The current frontend is assuming that an I/O request will always fit in
a ring request. This is not true any more when using 64KB page
granularity and will therefore crash during the boot.

On ARM64, the ABI is completely neutral to the page granularity used by
the domU. The guest has the choice between different page granularity
supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB).
This can't be enforced by the hypervisor and therefore it's possible to
run guests using different page granularity.

So we can't mandate the block backend to support non-indirect grant
when the frontend is using 64KB page granularity and have to fix it
properly in the frontend.

The solution exposed below is based on modifying directly the frontend
guest rather than asking the block framework to support smaller size
(i.e < PAGE_SIZE). This is because the change is the block framework are
not trivial as everything seems to relying on a struct *page (see [1]).
Although, it may be possible that someone succeed to do it in the future
and we would therefore be able to use advantage.

Given that a block request may not fit in a single ring request, a
second request is introduced for the data that cannot fit in the first
one. This means that the second request should never be used on Linux
configuration using a page granularity < 44KB.

Note that the parameters blk_queue_max_* helpers haven't been updated.
The block code will set mimimum size supported and we may be able  to
support directly any change in the block framework that lower down the
mimimal size of a request.

[1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html

Signed-off-by: Julien Grall 

---
Cc: Konrad Rzeszutek Wilk 
Cc: "Roger Pau Monné" 
Cc: Boris Ostrovsky 
Cc: David Vrabel 
---
 drivers/block/xen-blkfront.c | 199 +++
 1 file changed, 183 insertions(+), 16 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index f9d55c3..03772c9 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -60,6 +60,20 @@
 
 #include 
 
+/*
+ * The block framework is always working on segment of PAGE_SIZE minimum.
+ * When Linux is using a different page size than xen, it may not be possible
+ * to put all the data in a single segment.
+ * This can happen when the backend doesn't support indirect grant and
+ * therefore the maximum amount of data that a request can carry is
+ * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE = 44KB
+ *
+ * Note that we only support one extra request. So the Linux page size
+ * should be <= ( 2 * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) =
+ * 88KB.
+ */
+#define HAS_EXTRA_REQ (BLKIF_MAX_SEGMENTS_PER_REQUEST < XEN_PFN_PER_PAGE)
+
 enum blkif_state {
BLKIF_STATE_DISCONNECTED,
BLKIF_STATE_CONNECTED,
@@ -79,6 +93,19 @@ struct blk_shadow {
struct grant **indirect_grants;
struct scatterlist *sg;
unsigned int num_sg;
+   enum
+   {
+   REQ_WAITING,
+   REQ_DONE,
+   REQ_FAIL
+   } status;
+
+   #define NO_ASSOCIATED_ID ~0UL
+   /*
+* Id of the sibling if we ever need 2 requests when handling a
+* block I/O request
+*/
+   unsigned long associated_id;
 };
 
 struct split_bio {
@@ -467,6 +494,8 @@ static unsigned long blkif_ring_get_request(struct 
blkfront_info *info,
 
id = get_id_from_freelist(info);
info->shadow[id].request = req;
+   info->shadow[id].status = REQ_WAITING;
+   info->shadow[id].associated_id = NO_ASSOCIATED_ID;
 
(*ring_req)->u.rw.id = id;
 
@@ -508,6 +537,9 @@ struct setup_rw_req {
bool need_copy;
unsigned int bvec_off;
char *bvec_data;
+
+   bool require_extra_req;
+   struct blkif_request *ring_req2;
 };
 
 static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
@@ -521,8 +553,24 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, 
unsigned int offset,
unsigned int grant_idx = setup->grant_idx;
struct blkif_request *ring_req = setup->ring_req;
struct blkfront_info *info = setup->info;
+   /*
+* We always use the shadow of the first request to store the list
+* of grant associated to the block I/O request. This made the
+* completion more easy to handle even if the block I/O request is
+* split.
+*/
struct blk_shadow *shadow = >shadow[setup->id];
 
+   if (unlikely(setup->require_extra_req

[Xen-devel] [PATCH 1/2] block/xen-blkfront: Introduce blkif_ring_get_request

The code to get a request is always the same. Therefore we can factorize
it in a single function.

Signed-off-by: Julien Grall 

---
Cc: Konrad Rzeszutek Wilk 
Cc: "Roger Pau Monné" 
Cc: Boris Ostrovsky 
Cc: David Vrabel 
---
 drivers/block/xen-blkfront.c | 30 +++---
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 43cda94..f9d55c3 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -456,6 +456,23 @@ static int blkif_ioctl(struct block_device *bdev, fmode_t 
mode,
return 0;
 }
 
+static unsigned long blkif_ring_get_request(struct blkfront_info *info,
+   struct request *req,
+   struct blkif_request **ring_req)
+{
+   unsigned long id;
+
+   *ring_req = RING_GET_REQUEST(>ring, info->ring.req_prod_pvt);
+   info->ring.req_prod_pvt++;
+
+   id = get_id_from_freelist(info);
+   info->shadow[id].request = req;
+
+   (*ring_req)->u.rw.id = id;
+
+   return id;
+}
+
 static int blkif_queue_discard_req(struct request *req)
 {
struct blkfront_info *info = req->rq_disk->private_data;
@@ -463,9 +480,7 @@ static int blkif_queue_discard_req(struct request *req)
unsigned long id;
 
/* Fill out a communications ring structure. */
-   ring_req = RING_GET_REQUEST(>ring, info->ring.req_prod_pvt);
-   id = get_id_from_freelist(info);
-   info->shadow[id].request = req;
+   id = blkif_ring_get_request(info, req, _req);
 
ring_req->operation = BLKIF_OP_DISCARD;
ring_req->u.discard.nr_sectors = blk_rq_sectors(req);
@@ -476,8 +491,6 @@ static int blkif_queue_discard_req(struct request *req)
else
ring_req->u.discard.flag = 0;
 
-   info->ring.req_prod_pvt++;
-
/* Keep a private copy so we can reissue requests when recovering. */
info->shadow[id].req = *ring_req;
 
@@ -613,9 +626,7 @@ static int blkif_queue_rw_req(struct request *req)
new_persistent_gnts = 0;
 
/* Fill out a communications ring structure. */
-   ring_req = RING_GET_REQUEST(>ring, info->ring.req_prod_pvt);
-   id = get_id_from_freelist(info);
-   info->shadow[id].request = req;
+   id = blkif_ring_get_request(info, req, _req);
 
BUG_ON(info->max_indirect_segments == 0 &&
   GREFS(req->nr_phys_segments) > BLKIF_MAX_SEGMENTS_PER_REQUEST);
@@ -628,7 +639,6 @@ static int blkif_queue_rw_req(struct request *req)
for_each_sg(info->shadow[id].sg, sg, num_sg, i)
   num_grant += gnttab_count_grant(sg->offset, sg->length);
 
-   ring_req->u.rw.id = id;
info->shadow[id].num_sg = num_sg;
if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
/*
@@ -694,8 +704,6 @@ static int blkif_queue_rw_req(struct request *req)
if (setup.segments)
kunmap_atomic(setup.segments);
 
-   info->ring.req_prod_pvt++;
-
/* Keep a private copy so we can reissue requests when recovering. */
info->shadow[id].req = *ring_req;
 
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH 0/2] block/xen-blkfront: Support non-indirect with 64KB page granularity

Hi all,

This is a follow-up on the previous discussion [1] related to guest using 64KB
page granularity not booting with backend using non-indirect grant.

This has been successly tested on ARM64 with both 64KB and 4KB page granularity
guests and QEMU as the backend. Indeed QEMU is not supported indirect.

For a summary of the previous discussion see patch #2.

This series is based on top of my 64KB page granularity support [2].

Comments are welcomed.

Sincerely yours,

[1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg01659.html
[2] https://lwn.net/Articles/656797/

Cc: Konrad Rzeszutek Wilk 
Cc: "Roger Pau Monné" 
Cc: Boris Ostrovsky 
Cc: David Vrabel 

Julien Grall (2):
  block/xen-blkfront: Introduce blkif_ring_get_request
  block/xen-blkfront: Handle non-indirect grant with 64KB pages

 drivers/block/xen-blkfront.c | 229 ++-
 1 file changed, 202 insertions(+), 27 deletions(-)

-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [qemu-upstream-4.2-testing test] 61726: tolerable FAIL - PUSHED

flight 61726 qemu-upstream-4.2-testing real [real]
http://logs.test-lab.xenproject.org/osstest/logs/61726/

Failures :-/ but no regressions.

Tests which are failing intermittently (not blocking):
 test-i386-i386-xl-qemuu-winxpsp3 16 guest-localmigrate/x10 fail in 61619 pass 
in 61726
 test-amd64-amd64-xl-qemuu-win7-amd64 16 guest-localmigrate/x10 fail pass in 
61619

Regressions which are regarded as allowable (not blocking):
 test-amd64-amd64-xl-qemuu-win7-amd64 17 guest-stopfail in 61619 like 60611

Tests which did not succeed, but are not blocking:
 test-amd64-i386-xl-qemuu-ovmf-amd64  9 debian-hvm-install  fail never pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64  9 debian-hvm-install fail never pass
 test-amd64-i386-xl-qemuu-win7-amd64 17 guest-stop  fail never pass
 test-amd64-i386-xend-qemuu-winxpsp3 21 leak-check/checkfail never pass

version targeted for testing:
 qemuu2a5956801545ff4122dc9551bcc4c4e3053f30ba
baseline version:
 qemuu138906105dd47b9dc6b1e5010e81fc606983dd75

Last test of basis60611  2015-08-06 01:42:11 Z   36 days
Testing same since61619  2015-09-08 12:11:08 Z3 days2 attempts


People who touched revisions under test:
  Gerd Hoffmann 
  Peter Lieven 

jobs:
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-i386-qemuu-rhel6hvm-amd   pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64pass
 test-amd64-i386-xl-qemuu-debianhvm-amd64 pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 fail
 test-amd64-i386-xl-qemuu-ovmf-amd64  fail
 test-amd64-amd64-xl-qemuu-win7-amd64 fail
 test-amd64-i386-xl-qemuu-win7-amd64  fail
 test-amd64-i386-qemuu-rhel6hvm-intel pass
 test-amd64-i386-xl-qemuu-winxpsp3-vcpus1 pass
 test-amd64-i386-xend-qemuu-winxpsp3  fail
 test-amd64-amd64-xl-qemuu-winxpsp3   pass
 test-i386-i386-xl-qemuu-winxpsp3 pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

+ branch=qemu-upstream-4.2-testing
+ revision=2a5956801545ff4122dc9551bcc4c4e3053f30ba
+ . cri-lock-repos
++ . cri-common
+++ . cri-getconfig
+++ umask 002
+++ getrepos
 getconfig Repos
 perl -e '
use Osstest;
readglobalconfig();
print $c{"Repos"} or die $!;
'
+++ local repos=/home/osstest/repos
+++ '[' -z /home/osstest/repos ']'
+++ '[' '!' -d /home/osstest/repos ']'
+++ echo /home/osstest/repos
++ repos=/home/osstest/repos
++ repos_lock=/home/osstest/repos/lock
++ '[' x '!=' x/home/osstest/repos/lock ']'
++ OSSTEST_REPOS_LOCK_LOCKED=/home/osstest/repos/lock
++ exec with-lock-ex -w /home/osstest/repos/lock ./ap-push 
qemu-upstream-4.2-testing 2a5956801545ff4122dc9551bcc4c4e3053f30ba
+ branch=qemu-upstream-4.2-testing
+ revision=2a5956801545ff4122dc9551bcc4c4e3053f30ba
+ . cri-lock-repos
++ . cri-common
+++ . cri-getconfig
+++ umask 002
+++ getrepos
 getconfig Repos
 perl -e '
use Osstest;
readglobalconfig();
print $c{"Repos"} or die $!;
'
+++ local repos=/home/osstest/repos
+++ '[' -z /home/osstest/repos ']'
+++ '[' '!' -d /home/osstest/repos ']'
+++ echo /home/osstest/repos
++ repos=/home/osstest/repos
++ repos_lock=/home/osstest/repos/lock
++ '[' x/home/osstest/repos/lock '!=' x/home/osstest/repos/lock ']'
+ . cri-common
++ . cri-getconfig
++ umask 002
+ select_xenbranch
+ case "$branch" in
+ tree=qemuu
+ xenbranch=xen-4.2-testing
+ '[' xqemuu = xlinux ']'
+ linuxbranch=
+ '[' x = x ']'
+ qemuubranch=qemu-upstream-4.2-testing
+ select_prevxenbranch
+ local b
+ local p
++ ./mg-list-all-branches
+ for b in '$(./mg-list-all-branches)'
+ case "$b" in
+ for b in '$(./mg-list-all-branches)'
+ case "$b"

Re: [Xen-devel] OVMF/Xen, Debian wheezy can't boot with NX on stack (Was: Re: [edk2] [PATCH] OvmfPkg: prevent code execution from DXE stack)

2015-09-11 Thread Josh Triplett

On Fri, Sep 11, 2015 at 05:28:06PM +0200, Laszlo Ersek wrote:
> On 09/11/15 16:10, Josh Triplett wrote:
> > On Fri, Sep 11, 2015 at 01:43:53PM +0200, Laszlo Ersek wrote:
> >> On 09/09/15 12:48, Laszlo Ersek wrote:
> >>> On 09/09/15 11:37, Ian Campbell wrote:
>  On Wed, 2015-09-09 at 01:06 -0600, Jan Beulich wrote:
>  On 09.09.15 at 00:23,  wrote:
> >> On 09/08/15 19:26, Anthony PERARD wrote:
> >>> And I get this on the console:
> >>> Welcome to GRUB!
> >>>
> >>>  X64 Exception Type - 0E(#PF - Page-Fault)  CPU Apic ID -
> >>>  
> >>> RIP  - 0F5F8918, CS  - 0028, RFLAGS -
> >>> 00210206
> >>> ExceptionData - 0011
> >>> RAX  - , RCX - 07FCE000, RDX -
> >>> 
> >>> RBX  - 0B6092C0, RSP - 0F5F8590, RBP -
> >>> 0B608EA0
> >>> RSI  - 0F5F8838, RDI - 0B608EA0
> >>> R8   - , R9  - 0B609200, R10 -
> >>> 
> >>> R11  - 000A, R12 - , R13 -
> >>> 001B
> >>> R14  - 0B609360, R15 - 
> >>> DS   - 0008, ES  - 0008, FS  -
> >>> 0008
> >>> GS   - 0008, SS  - 0008
> >>> CR0  - 8033, CR2 - 0F5F8918, CR3 -
> >>> 0F597000
> >>> CR4  - 0668, CR8 - 
> >>> DR0  - , DR1 - , DR2 -
> >>> 
> >>> DR3  - , DR6 - 0FF0, DR7 -
> >>> 0400
> >>> GDTR - 0F57BF18 003F, LDTR - 
> >>> IDTR - 0EEA5018 0FFF,   TR - 
> >>> FXSAVE_STATE - 0F5F81F0
> >>>  Find PE image 
> >> /build/xen-unstable/src/xen-unstable/tools/firmware/ovmf-dir
> >> -remote/Build
> >> /OvmfX64/DEBUG_GCC49/X64/IntelFrameworkModulePkg/Universal/StatusCode/R
> >> untime
> >> Dxe/StatusCodeRuntimeDxe/DEBUG/StatusCodeRuntimeDxe.dll 
> >> (ImageBase=0F556000, EntryPoint=0F55628F) 
> >>>
> >>> I did check with other guest (Windows, Ubuntu, Debian Jessie), and
> >>> they are
> >>> working correctly. Debian Wheezy is the only one that fail.
> >>
> >> I don't have an environment to reproduce this in. I think we should try
> >> to understand this problem better, before deciding how to make it go
> >> away.
> >>
> >> Please locate the "StatusCodeRuntimeDxe.debug" file in your Build
> >> directory (ie. under the location listed in the error report). Then,
> >> please disassemble it with "objdump -S". The fault location in the
> >> disassembly can be found based on RIP, ImageBase and EntryPoint;
> >
> > I don't think the exact instruction at that address really matters. The
> > main question appears to be why RIP and RSP both point into the
> > same page (see also the subject of Anthony's mail).
> 
>  I'm not 100% what is going on,
> >>>
> >>> me neither :)
> >>>
>  but if this (executable code on stack) is
>  happening in grub is there something which is explicitly forbidden to 
>  UEFI
>  apps by the UEFI spec?
> >>>
> >>> Yes, there is. This small OvmfPkg patch only enables the edk2 feature
> >>> added by Star Zeng in
> >>>  for OVMF. That patch
> >>> (also referenced in my commit message by SVN rev) says,
> >>>
> >>> This feature is added for UEFI spec that says
> >>> "Stack may be marked as non-executable in identity mapped page
> >>> tables".
> >>> A PCD PcdSetNxForStack is added to turn on/off this feature, and it
> >>> is FALSE by default.
> >>>
> >>> A UEFI app runs (well, *starts*, anyway) before ExitBootServices() /
> >>> SetVirtualAddressMap(), so it's bound by the above.
> >>>
> >>> The spec passage above is quoted from "2.3.2 IA-32 Platforms", and
> >>> "2.3.4 x64 Platforms", in chapter "2.3 Calling Conventions", where the
> >>> boot services time environment is specified.
> >>>
> >>> This is new in UEFI-2.5, and it comes from Mantis ticket 1224: "Adding
> >>> support for No executable data areas".
> >>>
> >>> ... The question could be then if grub (in Wheezy) should be adapted to
> >>> UEFI-2.5 (if that's possible) or if OVMF should be built without this
> >>> feature.
> >>>
> >>> Hmmm. Actually, I'm torn about the default for PcdSetNxForStack.
> >>>
> >>> Namely, Mantis ticket 1224 has come up before. There's another edk2
> >>> sub-feature related to this UEFI spec feature / Mantis ticket; the
> >>> properties table (controlled by "PcdPropertiesTableEnable"), and the
> >>> effects it has on the UEFI memory map, and the requirements it presents
> >>> for UEFI OSes.
> >>>
> >>> *That* sub-feature

Re: [Xen-devel] [PATCH v4 00/20] xen/arm64: Add support for 64KB page in Linux

Hi,

A quick update on the TODO.

On 07/09/15 16:33, Julien Grall wrote:
> ARM64 Linux is supporting both 4KB and 64KB page granularity. Although, Xen
> hypercall interface and PV protocol are always based on 4KB page granularity.
> 
> Any attempt to boot a Linux guest with 64KB pages enabled will result to a
> guest crash.
> 
> This series is a first attempt to allow those Linux running with the current
> hypercall interface and PV protocol.
> 
> This solution has been chosen because we want to run Linux 64KB in released
> Xen ARM version or/and platform using an old version of Linux DOM0.
> 
> There is room for improvement, such as support of 64KB grant, modification
> of PV protocol to support different page size... They will be explored in a
> separate patch series later.
> 
> TODO list:
> - Convert swiotlb to 64KB

Sent http://lists.xen.org/archives/html/xen-devel/2015-09/msg01292.html

> - Convert xenfb to 64KB
> - Support for multiple page ring support
> - Support for 64KB in gnttdev
> - Support of non-indirect grant with 64KB frontend

Sent http://lists.xen.org/archives/html/xen-devel/2015-09/msg01577.html

> - It may be possible to move some common define between
> netback/netfront and blkfront/blkback in an header
> 
> I've got most of the patches for the TODO items. I'm planning to send them as
> a follow-up as it's not a requirement for a basic guests.
> 
> All patches has been built tested for ARM32, ARM64, x86. But I haven't tested
> to run it on x86 as I don't have a box with Xen x86 running. I would be
> happy if someone give a try and see possible regression for x86.
> 
> I know that Konrad as a test-suite for x86. Konrand, would it be possible to
> give a run to for this series?
> 
> A branch based on the latest xentip/for-linus-4.3 can be found here:
> 
> git://xenbits.xen.org/people/julieng/linux-arm.git branch xen-64k-v4

I will resend a new version when 4.3-rc1 is out in order to fix any
possible conflict with linux/master. I already know that the netback
patch (#18) is conflicting with 1d5d48523900a4b0f25d6b52f1a93c84bd671186
"xen-netback: require fewer guest Rx slots when not using GSO".

If the 2 series above are completely acked/reviewed, I will fold them in
this series.

Regards,

-- 
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [ovmf test] 61736: regressions - FAIL

flight 61736 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/61736/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-amd64-xl-qemuu-ovmf-amd64 9 debian-hvm-install fail REGR. vs. 60869
 test-amd64-i386-xl-qemuu-ovmf-amd64  9 debian-hvm-install fail REGR. vs. 60869

version targeted for testing:
 ovmf 78c8ec8a3f1c09856f0d70027d6a9d814208a77f
baseline version:
 ovmf ba1806251ff8ff695175b92ab5732eadbcd2f72e

Last test of basis60869  2015-08-25 03:03:43 Z   17 days
Failing since 60904  2015-08-27 01:40:43 Z   15 days   10 attempts
Testing same since61736  2015-09-10 05:21:23 Z1 days1 attempts


People who touched revisions under test:
  "Yao, Jiewen" 
  Ard Biesheuvel 
  Cecil Sheng 
  Cecil Sheng 
  Dandan Bi 
  eric Dong 
  Feng Tian 
  Fu Siyuan 
  Gary Ching-Pang Lin 
  Hao Wu 
  Heyi Guo 
  Jeff Fan 
  Jiaxin Wu 
  Jonathan Panozzo 
  Laszlo Ersek 
  Leif Lindholm 
  Liming Gao 
  Masamitsu MURASE 
  Qin Long 
  Qiu Shumin 
  Ruiyu Ni 
  Samer El-Haj-Mahmoud 
  Shifei Lu 
  Star Zeng 
  Sunny Wang 
  Yao, Jiewen 
  Yingke Liu 
  Zhang Lubo 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 fail
 test-amd64-i386-xl-qemuu-ovmf-amd64  fail



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Not pushing.

(No revision log; it would be 1991 lines long.)

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] OVMF/Xen, Debian wheezy can't boot with NX on stack (Was: Re: [edk2] [PATCH] OvmfPkg: prevent code execution from DXE stack)

2015-09-11 Thread Josh Triplett

On Fri, Sep 11, 2015 at 11:27:32PM +0200, Laszlo Ersek wrote:
> On 09/11/15 21:30, Josh Triplett wrote:
> > On Fri, Sep 11, 2015 at 05:28:06PM +0200, Laszlo Ersek wrote:
> >> Breaking Debian Wheezy's and BITS's GRUB is also bad, but the former is
> >> very old (and has a clear upgrade path), while the latter is mainly used
> >> by developers (who can learn about the -fw_cfg switch by googling or
> >> asking on the least without huge trouble). In this case I'm leaning
> >> towards OVMF being "bleeding edge" by default. But, I could be convinced
> >> otherwise.
> > 
> > I certainly think it makes sense for OVMF to adopt the feature sooner
> > than normal, and I agree that OVMF serves as a test case.  But going
> > directly from "not possible to turn on" to "turned on by default",
> > without any period of "off by default but possible to turn on", seems a
> > bit unfortunate.
> > 
> > That said, we could certainly fix BITS to use newer GRUB2, and use
> > (and document) -fw_cfg in the meantime.  So I won't push *too* hard for
> > changing the default, just mildly.
> 
> Okay. If I'll need to send a v2 for any reason, I'll incorporate this.
> If not, then I can post a followup patch later (stating that it's due to
> community feedback).

Thanks!

> > On a vaguely related note, what's the canonical place to report bugs in
> > OVMF?
> 
> (Bugs? What bugs? :))
> 
> It's this list, .

There isn't a tracker of some kind?  That's unfortunate.

But thanks; I'll send mail to the list when we discover an issue while
experimenting with BITS.

(Also, if you don't intend to use github's issue tracker, you might want
to turn it off so people don't file things there and expect a response.)

- Josh Triplett

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] OVMF/Xen, Debian wheezy can't boot with NX on stack (Was: Re: [edk2] [PATCH] OvmfPkg: prevent code execution from DXE stack)

2015-09-11 Thread Laszlo Ersek

On 09/11/15 21:30, Josh Triplett wrote:
> On Fri, Sep 11, 2015 at 05:28:06PM +0200, Laszlo Ersek wrote:
>> On 09/11/15 16:10, Josh Triplett wrote:
>>> On Fri, Sep 11, 2015 at 01:43:53PM +0200, Laszlo Ersek wrote:
 On 09/09/15 12:48, Laszlo Ersek wrote:
> On 09/09/15 11:37, Ian Campbell wrote:
>> On Wed, 2015-09-09 at 01:06 -0600, Jan Beulich wrote:
>> On 09.09.15 at 00:23,  wrote:
 On 09/08/15 19:26, Anthony PERARD wrote:
> And I get this on the console:
> Welcome to GRUB!
>
>  X64 Exception Type - 0E(#PF - Page-Fault)  CPU Apic ID -
>  
> RIP  - 0F5F8918, CS  - 0028, RFLAGS -
> 00210206
> ExceptionData - 0011
> RAX  - , RCX - 07FCE000, RDX -
> 
> RBX  - 0B6092C0, RSP - 0F5F8590, RBP -
> 0B608EA0
> RSI  - 0F5F8838, RDI - 0B608EA0
> R8   - , R9  - 0B609200, R10 -
> 
> R11  - 000A, R12 - , R13 -
> 001B
> R14  - 0B609360, R15 - 
> DS   - 0008, ES  - 0008, FS  -
> 0008
> GS   - 0008, SS  - 0008
> CR0  - 8033, CR2 - 0F5F8918, CR3 -
> 0F597000
> CR4  - 0668, CR8 - 
> DR0  - , DR1 - , DR2 -
> 
> DR3  - , DR6 - 0FF0, DR7 -
> 0400
> GDTR - 0F57BF18 003F, LDTR - 
> IDTR - 0EEA5018 0FFF,   TR - 
> FXSAVE_STATE - 0F5F81F0
>  Find PE image 
 /build/xen-unstable/src/xen-unstable/tools/firmware/ovmf-dir
 -remote/Build
 /OvmfX64/DEBUG_GCC49/X64/IntelFrameworkModulePkg/Universal/StatusCode/R
 untime
 Dxe/StatusCodeRuntimeDxe/DEBUG/StatusCodeRuntimeDxe.dll 
 (ImageBase=0F556000, EntryPoint=0F55628F) 
>
> I did check with other guest (Windows, Ubuntu, Debian Jessie), and
> they are
> working correctly. Debian Wheezy is the only one that fail.

 I don't have an environment to reproduce this in. I think we should try
 to understand this problem better, before deciding how to make it go
 away.

 Please locate the "StatusCodeRuntimeDxe.debug" file in your Build
 directory (ie. under the location listed in the error report). Then,
 please disassemble it with "objdump -S". The fault location in the
 disassembly can be found based on RIP, ImageBase and EntryPoint;
>>>
>>> I don't think the exact instruction at that address really matters. The
>>> main question appears to be why RIP and RSP both point into the
>>> same page (see also the subject of Anthony's mail).
>>
>> I'm not 100% what is going on,
>
> me neither :)
>
>> but if this (executable code on stack) is
>> happening in grub is there something which is explicitly forbidden to 
>> UEFI
>> apps by the UEFI spec?
>
> Yes, there is. This small OvmfPkg patch only enables the edk2 feature
> added by Star Zeng in
>  for OVMF. That patch
> (also referenced in my commit message by SVN rev) says,
>
> This feature is added for UEFI spec that says
> "Stack may be marked as non-executable in identity mapped page
> tables".
> A PCD PcdSetNxForStack is added to turn on/off this feature, and it
> is FALSE by default.
>
> A UEFI app runs (well, *starts*, anyway) before ExitBootServices() /
> SetVirtualAddressMap(), so it's bound by the above.
>
> The spec passage above is quoted from "2.3.2 IA-32 Platforms", and
> "2.3.4 x64 Platforms", in chapter "2.3 Calling Conventions", where the
> boot services time environment is specified.
>
> This is new in UEFI-2.5, and it comes from Mantis ticket 1224: "Adding
> support for No executable data areas".
>
> ... The question could be then if grub (in Wheezy) should be adapted to
> UEFI-2.5 (if that's possible) or if OVMF should be built without this
> feature.
>
> Hmmm. Actually, I'm torn about the default for PcdSetNxForStack.
>
> Namely, Mantis ticket 1224 has come up before. There's another edk2
> sub-feature related to this UEFI spec feature / Mantis ticket; the
> properties table (controlled by "PcdPropertiesTableEnable"), and the
> effects it has on the UEFI memory map, and the requirements it presents
> for

Re: [Xen-devel] [v2][PATCH] xen/vtd/iommu: permit group devices to passthrough in relaxed mode

2015-09-11 Thread Tian, Kevin

> From: Jan Beulich [mailto:jbeul...@suse.com]
> Sent: Friday, September 11, 2015 4:56 PM
> 
> >>> On 11.09.15 at 01:22,  wrote:
> > Sorry it's a bad example. My actual concern is that we can't count
> > on this per-VM relax/strict policy to prevent group devices assigned
> > to different VM. In that case it's definitely a security hole since
> > one VM may clobber shared RMRR to impact another VM. So right
> > example for that scenario is both VMs specified with 'relax'.
> 
> Sorry, no, the idea of "relax" is to allow the admin to state "I have
> no security concerns". Hence we'd have a security issue only if the
> default was "relax" (which iiuc it isn't, or if it were _that's_ what
> would need to be alongside the presented change). Whether that
> statement of the admin is because of
> - knowing that the RMRR won't be used post-boot
> - group-assigning the devices manually
> - simply not caring (i.e. trusting the guests)
> is not our business.
> 
> IOW, provided there's no way for "relax" to become the default
> (Tiejun - please confirm), the patch as is should be fine.
> 
> Jan
> 

OK, that explanation is fine to me as long as it's made clear no
security guarantee once admin uses 'relax' for any domain. Tiejun
could you resend patch with right warning/error type?

Thanks
Kevin

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH for-4.6] libxl: clear O_NONBLOCK|O_NDELAY on migration fd and reinstate afterwards

The fd passed to us by libvirt for both save and restore has at least
O_NONBLOCK set, which libxl does not expect and therefore fails to
handle any EAGAIN which might arise.

This has been observed with migration v2, but if v1 used to work I
think that would be just be by luck and/or coincidence.

Unix convention (and the principal of least surprise) is usually to
ensure that an fd has no "strange" properties, such as being
non-blocking, when handing it to another component.

However for the convenience of the application arrange instead for
libxl to clear any unexpected flags on the file descriptors it is
given for save or restore and restore them to their original state at
the end. O_NDELAY could be similarly problematic so clear that as
well as O_NONBLOCK.

To do this introduce a pair of new helper functions one to modify+save
the flags and another to restore them and call them in the appropriate
places.

The migration v1 code appeared to do some things with O_NONBLOCK in
the checkpoint case. Migration v2 doesn't seem to do so, and in any
case I wouldn't expect it to be relying on libvirt's setting of
O_NONBLOCK when xl doesn't use that flag.

Signed-off-by: Ian Campbell 
Cc: Jim Fehlig 
Cc: Andrew Cooper 
Cc: Shriram Rajagopalan 
Cc: Yang Hongyang 
---
For 4.6: This fixes migration with libvirt, which I think is worth
doing before the release.

For backports: Once "ts-xen-install: Rewrite /etc/hosts to comment out
127.0.1.1 entry" passes through osstest's pretest gate and has run on
some of the older branches we should then know if this is necessary
for migration v1. Or we could backport it regardless.
---
 tools/libxl/libxl.c  | 65 
 tools/libxl/libxl_create.c   | 23 +++-
 tools/libxl/libxl_internal.h | 13 +
 3 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 4f2eb24..d6efdd8 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -952,6 +952,12 @@ static void domain_suspend_cb(libxl__egc *egc,
   libxl__domain_suspend_state *dss, int rc)
 {
 STATE_AO_GC(dss->ao);
+int flrc;
+
+flrc = libxl__fd_flags_restore(gc, dss->fd, dss->fdfl);
+/* If suspend has failed already then report that error not this one. */
+if (flrc && !rc) rc = flrc;
+
 libxl__ao_complete(egc,ao,rc);
 
 }
@@ -980,6 +986,11 @@ int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, 
int fd, int flags,
 dss->live = flags & LIBXL_SUSPEND_LIVE;
 dss->debug = flags & LIBXL_SUSPEND_DEBUG;
 
+rc = libxl__fd_flags_modify_save(gc, dss->fd,
+ ~(O_NONBLOCK|O_NDELAY), 0,
+ >fdfl);
+if (rc < 0) goto out_err;
+
 libxl__domain_save(egc, dss);
 return AO_INPROGRESS;
 
@@ -6507,6 +6518,60 @@ int libxl_fd_set_cloexec(libxl_ctx *ctx, int fd, int 
cloexec)
 int libxl_fd_set_nonblock(libxl_ctx *ctx, int fd, int nonblock)
   { return fd_set_flags(ctx,fd, F_GETFL,F_SETFL,"FL", O_NONBLOCK, nonblock); }
 
+int libxl__fd_flags_modify_save(libxl__gc *gc, int fd,
+int mask, int val, int *r_oldflags)
+{
+int rc, ret, fdfl;
+
+fdfl = fcntl(fd, F_GETFL);
+if (fdfl < 0) {
+LOGE(ERROR, "failed to fcntl.F_GETFL for fd %d", fd);
+rc = ERROR_FAIL;
+goto out_err;
+}
+
+LOG(DEBUG, "fnctl F_GETFL flags for fd %d are %x", fd, fdfl);
+
+if (r_oldflags)
+*r_oldflags = fdfl;
+
+fdfl &= mask;
+fdfl |= val;
+
+LOG(DEBUG, "fnctl F_SETFL of fd %d to %x", fd, fdfl);
+
+ret = fcntl(fd, F_SETFL, fdfl);
+if (ret < 0) {
+LOGE(ERROR, "failed to fcntl.F_SETFL for fd %d", fd);
+rc = ERROR_FAIL;
+goto out_err;
+}
+
+rc = 0;
+
+out_err:
+return rc;
+}
+
+int libxl__fd_flags_restore(libxl__gc *gc, int fd, int fdfl)
+{
+int ret, rc;
+
+LOG(DEBUG, "fnctl F_SETFL of fd %d to %x", fd, fdfl);
+
+ret = fcntl(fd, F_SETFL, fdfl);
+if (ret < 0) {
+LOGE(ERROR, "failed to fcntl.F_SETFL for fd %x", fd);
+rc = ERROR_FAIL;
+goto out_err;
+}
+
+rc = 0;
+
+out_err:
+return rc;
+
+}
 
 void libxl_hwcap_copy(libxl_ctx *ctx,libxl_hwcap *dst, libxl_hwcap *src)
 {
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 5128160..099c7e8 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -1555,6 +1555,7 @@ static int do_domain_create(libxl_ctx *ctx, 
libxl_domain_config *d_config,
 {
 AO_CREATE(ctx, 0, ao_how);
 libxl__app_domain_create_state *cdcs;
+int rc;
 
 GCNEW(cdcs);
 cdcs->dcs.ao = ao;
@@ -1562,8 +1563,13 @@ static int do_domain_create(libxl_ctx *ctx, 
libxl_domain_config *d_config,
 libxl_domain_config_init(>dcs.guest_config_saved);

Re: [Xen-devel] [PATCH for 4.6 v3 2/3] xl/libxl: disallow saving a guest with vNUMA configured

On Thu, 2015-09-10 at 18:05 +0100, Wei Liu wrote:
> On Thu, Sep 10, 2015 at 05:53:35PM +0100, Ian Campbell wrote:
> > On Thu, 2015-09-10 at 17:15 +0100, Wei Liu wrote:
> > > On Thu, Sep 10, 2015 at 05:10:57PM +0100, Ian Campbell wrote:
> > > > On Thu, 2015-09-10 at 15:50 +0100, Wei Liu wrote:
> > > > > This is because the migration stream does not preserve node
> > > > > information.
> > > > > 
> > > > > Note this is not a regression for migration v2 vs legacy
> > > > > migration
> > > > > because neither of them preserve node information.
> > > > > 
> > > > > Signed-off-by: Wei Liu 
> > > > > ---
> > > > > Cc: andrew.coop...@citrix.com
> > > > > 
> > > > > v3:
> > > > > 1. Update manpage, code comment and commit message.
> > > > > 2. *Don't* check if nomigrate is set.
> > > > > ---
> > > > >  docs/man/xl.cfg.pod.5   |  2 ++
> > > > >  tools/libxl/libxl_dom.c | 14 ++
> > > > >  2 files changed, 16 insertions(+)
> > > > > 
> > > > > diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> > > > > index 80e51bb..555f8ba 100644
> > > > > --- a/docs/man/xl.cfg.pod.5
> > > > > +++ b/docs/man/xl.cfg.pod.5
> > > > > @@ -263,6 +263,8 @@ virtual node.
> > > > >  
> > > > >  Note that virtual NUMA for PV guest is not yet supported,
> > > > > because
> > > > >  there is an issue with cpuid handling that affects PV virtual
> > > > > NUMA.
> > > > > +Further more, guest with virtual NUMA cannot be saved or
> > > > > migrated
> > > > > +because migration stream does not preserve node information.
> > > > >  
> > > > >  Each B is a list, which has a form of
> > > > >  "[VNODE_CONFIG_OPTION,VNODE_CONFIG_OPTION, ... ]"  (without
> > > > > quotes).
> > > > > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> > > > > index c2518a3..a4d37dc 100644
> > > > > --- a/tools/libxl/libxl_dom.c
> > > > > +++ b/tools/libxl/libxl_dom.c
> > > > > @@ -24,6 +24,7 @@
> > > > >  #include 
> > > > >  #include 
> > > > >  #include 
> > > > > +#include 
> > > > >  
> > > > >  libxl_domain_type libxl__domain_type(libxl__gc *gc, uint32_t
> > > > > domid)
> > > > >  {
> > > > > @@ -1612,6 +1613,7 @@ void libxl__domain_save(libxl__egc *egc,
> > > > > libxl__domain_suspend_state *dss)
> > > > >  const libxl_domain_remus_info *const r_info = dss->remus;
> > > > >  libxl__srm_save_autogen_callbacks *const callbacks =
> > > > >  >sws.shs.callbacks.save.a;
> > > > > +unsigned int nr_vnodes = 0, nr_vmemranges = 0, nr_vcpus = 0;
> > > > >  
> > > > >  dss->rc = 0;
> > > > >  logdirty_init(>logdirty);
> > > > > @@ -1636,6 +1638,18 @@ void libxl__domain_save(libxl__egc *egc,
> > > > > libxl__domain_suspend_state *dss)
> > > > >| (debug ? XCFLAGS_DEBUG : 0)
> > > > >| (dss->hvm ? XCFLAGS_HVM : 0);
> > > > >  
> > > > > +/* Disallow saving a guest with vNUMA configured because
> > > > > migration
> > > > > + * stream does not preserve node information.
> > > > > + */
> > > > > +rc = xc_domain_getvnuma(CTX->xch, domid, _vnodes,
> > > > > _vmemranges,
> > > > > +_vcpus, NULL, NULL, NULL);
> > > > > +assert(rc == -1 && (errno == XEN_ENOBUFS || errno ==
> > > > > XEN_EOPNOTSUPP));
> > > > 
> > > > Has this been tested with a domain _without_ vnuma config.
> > > > 
> > > 
> > > Yes.
> > > 
> > > > Specifically if there is no vnuma config and therefore 0 vnodes and
> > > > 0
> > > > vmemranges will the hypervisor actually return XEN_ENOBUFS rather
> > > > than
> > > > success (because it succeeded to put 0 things into a zero length
> > > > array).
> > > > 
> > > 
> > > If there is no vnuma configuration at all, hv returns XEN_EOPNOTSUPP
> > > (hence the assertion in code).
> > 
> > Ah, I took that to be "Xen cannot do vnuma at all", rather than "This
> > particular domain has no vnuma".
> > 
> > > > It looks like the non-zero number of vcpus in the domain will
> > > > indeed
> > > 
> > > I guess you meant "zero number"?
> > 
> > No, I meant non-zero. A domain with no vnuma still has some vcpus I
> > think.
> > Hence the NULL for the vcpus_to_vnodes array would trigger XEN_ENOBUFS.
> 
> Ah, you meant d->vcpus inside HV.
> 
> Yes, that's right. XEN_ENOBUFS is guaranteed in the above
> xc_domain_getvnuma call if there is d->vnuma structure inside HV,
> because a d->nr_vcpus is not zero.

But "is d->vnuma" corresponds to there being vnuma config for the domain. I
'm specifically worried about the case where there is no vnuma config for
the domain.

Ian.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [distros-debian-sid test] 37922: regressions - FAIL

2015-09-11 Thread Platform Team regression test user

flight 37922 distros-debian-sid real [real]
http://osstest.xs.citrite.net/~osstest/testlogs/logs/37922/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-amd64-i386-sid-netboot-pygrub 9 debian-di-install fail REGR. vs. 
37873
 test-amd64-amd64-amd64-sid-netboot-pvgrub 9 debian-di-install fail REGR. vs. 
37873
 test-amd64-i386-i386-sid-netboot-pvgrub 9 debian-di-install fail REGR. vs. 
37873
 test-amd64-i386-amd64-sid-netboot-pygrub 13 guest-saverestore fail REGR. vs. 
37873

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-armhf-sid-netboot-pygrub  9 debian-di-install fail never pass

baseline version:
 flight   37873

jobs:
 build-amd64  pass
 build-armhf  pass
 build-i386   pass
 build-amd64-pvopspass
 build-armhf-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-amd64-sid-netboot-pvgrubfail
 test-amd64-i386-i386-sid-netboot-pvgrub  fail
 test-amd64-i386-amd64-sid-netboot-pygrub fail
 test-armhf-armhf-armhf-sid-netboot-pygrubfail
 test-amd64-amd64-i386-sid-netboot-pygrub fail



sg-report-flight on osstest.xs.citrite.net
logs: /home/osstest/logs
images: /home/osstest/images

Logs, config files, etc. are available at
http://osstest.xs.citrite.net/~osstest/testlogs/logs

Test harness code can be found at
http://xenbits.xensource.com/gitweb?p=osstest.git;a=summary


Push not applicable.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 0/5] libxc: support building large pv-domains


On 09/11/2015 03:28 PM, Ian Campbell wrote:

On Fri, 2015-09-11 at 14:32 +0200, Juergen Gross wrote:

The Xen hypervisor supports starting a dom0 with large memory (up to
the TB range) by not including the initrd and p2m list in the initial
kernel mapping. Especially the p2m list can grow larger than the
available virtual space in the initial mapping.

The started kernel is indicating the support of each feature via
elf notes.

This series enables the domain builder in libxc to do the same as the
hypervisor. This enables starting of huge pv-domUs via xl.

Unmapped initrd is supported for 64 and 32 bit domains, omitting the
p2m from initial kernel mapping is possible for 64 bit domains only.

Tested with:
- 32 bit domU (kernel not supporting unmapped initrd)
- 32 bit domU (kernel supporting unmapped initrd)
- 1 GB 64 bit domU (kernel supporting unmapped initrd, not p2m)
- 1 GB 64 bit domU (kernel supporting unmapped initrd and p2m)
- 900GB 64 bit domU (kernel supporting unmapped initrd and p2m)

Juergen Gross (5):
   libxc: remove allocate member from struct xc_dom_image
   libxc: do initrd processing of domain builder in own function
   libxc: create unmapped initrd in domain builder if supported
   libxc: split p2m allocation in domain builder from other magic pages
   libxc: create p2m list outside of kernel mapping if supported

  tools/libxc/include/xc_dom.h |   4 +-
  tools/libxc/xc_dom_core.c| 123 +--
  tools/libxc/xc_dom_x86.c | 120 -


How much is this going to conflict with Roger's "Introduce HVM without dm
and new boot ABI" changes to HVM building?


As it is touching the pv domain builder only, I don't think there will
be a conflict. All rights of being wrong reserved. :-)

Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH for 4.6 v4 3/3] xl: handle empty vnuma configuration

When user specifies vnuma = [], we need to skip the whole parser
function, otherwise the parser sets b_info->max_memkb to garbage value.

Signed-off-by: Wei Liu 
Acked-by: Ian Campbell 
---
 tools/libxl/xl_cmdimpl.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index c9bd839..bfbd421 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -1093,6 +1093,9 @@ static void parse_vnuma_config(const XLU_Config *config,
 if (xlu_cfg_get_list(config, "vnuma", , _vnuma, 1))
 return;
 
+if (!num_vnuma)
+return;
+
 b_info->num_vnuma_nodes = num_vnuma;
 b_info->vnuma_nodes = xcalloc(num_vnuma, sizeof(libxl_vnode_info));
 vcpu_parsed = xcalloc(num_vnuma, sizeof(libxl_bitmap));
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH for 4.6 v4 1/3] libxc: introduce xc_domain_getvnuma

A simple wrapper for XENMEM_get_vnumainfo.

Signed-off-by: Wei Liu 
Acked-by: Ian Campbell 
---
v4: rebase on top of staging
---
 tools/libxc/include/xenctrl.h | 18 +++
 tools/libxc/xc_domain.c   | 53 +++
 2 files changed, 71 insertions(+)

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index e019474..3482544 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -1287,6 +1287,24 @@ int xc_domain_setvnuma(xc_interface *xch,
 unsigned int *vdistance,
 unsigned int *vcpu_to_vnode,
 unsigned int *vnode_to_pnode);
+/*
+ * Retrieve vnuma configuration
+ * domid: IN, target domid
+ * nr_vnodes: IN/OUT, number of vnodes, not NULL
+ * nr_vmemranges: IN/OUT, number of vmemranges, not NULL
+ * nr_vcpus: IN/OUT, number of vcpus, not NULL
+ * vmemranges: OUT, an array which has length of nr_vmemranges
+ * vdistance: OUT, an array which has length of nr_vnodes * nr_vnodes
+ * vcpu_to_vnode: OUT, an array which has length of nr_vcpus
+ */
+int xc_domain_getvnuma(xc_interface *xch,
+   uint32_t domid,
+   uint32_t *nr_vnodes,
+   uint32_t *nr_vmemranges,
+   uint32_t *nr_vcpus,
+   xen_vmemrange_t *vmemrange,
+   unsigned int *vdistance,
+   unsigned int *vcpu_to_vnode);
 
 int xc_domain_soft_reset(xc_interface *xch,
  uint32_t domid);
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index 62b2e45..e7278dd 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -2493,6 +2493,59 @@ int xc_domain_setvnuma(xc_interface *xch,
 return rc;
 }
 
+int xc_domain_getvnuma(xc_interface *xch,
+   uint32_t domid,
+   uint32_t *nr_vnodes,
+   uint32_t *nr_vmemranges,
+   uint32_t *nr_vcpus,
+   xen_vmemrange_t *vmemrange,
+   unsigned int *vdistance,
+   unsigned int *vcpu_to_vnode)
+{
+int rc;
+DECLARE_HYPERCALL_BOUNCE(vmemrange, sizeof(*vmemrange) * *nr_vmemranges,
+ XC_HYPERCALL_BUFFER_BOUNCE_OUT);
+DECLARE_HYPERCALL_BOUNCE(vdistance, sizeof(*vdistance) *
+ *nr_vnodes * *nr_vnodes,
+ XC_HYPERCALL_BUFFER_BOUNCE_OUT);
+DECLARE_HYPERCALL_BOUNCE(vcpu_to_vnode, sizeof(*vcpu_to_vnode) * *nr_vcpus,
+ XC_HYPERCALL_BUFFER_BOUNCE_OUT);
+
+struct xen_vnuma_topology_info vnuma_topo;
+
+if ( xc_hypercall_bounce_pre(xch, vmemrange)  ||
+ xc_hypercall_bounce_pre(xch, vdistance)  ||
+ xc_hypercall_bounce_pre(xch, vcpu_to_vnode) )
+{
+rc = -1;
+errno = ENOMEM;
+goto vnumaget_fail;
+}
+
+set_xen_guest_handle(vnuma_topo.vmemrange.h, vmemrange);
+set_xen_guest_handle(vnuma_topo.vdistance.h, vdistance);
+set_xen_guest_handle(vnuma_topo.vcpu_to_vnode.h, vcpu_to_vnode);
+
+vnuma_topo.nr_vnodes = *nr_vnodes;
+vnuma_topo.nr_vcpus = *nr_vcpus;
+vnuma_topo.nr_vmemranges = *nr_vmemranges;
+vnuma_topo.domid = domid;
+vnuma_topo.pad = 0;
+
+rc = do_memory_op(xch, XENMEM_get_vnumainfo, _topo,
+  sizeof(vnuma_topo));
+
+*nr_vnodes = vnuma_topo.nr_vnodes;
+*nr_vcpus = vnuma_topo.nr_vcpus;
+*nr_vmemranges = vnuma_topo.nr_vmemranges;
+
+ vnumaget_fail:
+xc_hypercall_bounce_post(xch, vmemrange);
+xc_hypercall_bounce_post(xch, vdistance);
+xc_hypercall_bounce_post(xch, vcpu_to_vnode);
+
+return rc;
+}
 
 int xc_domain_soft_reset(xc_interface *xch,
  uint32_t domid)
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] Xen 4.4 & 4.5 - Various problems (mostly undefined references to libxenctrl functions)

2015-09-11 Thread Sébastien Frémal

Hello,

I'm working with Xen to develop new communication modules to improve data
transfer between xen domains. As I had a really old version of Xen
(installed in 2012 !!) and some functionnalities didn't work (sharing
already allocated pages with grant references), I reinstalled my entire
system (new linux, new xen). Therefore, I installed Ubuntu LTS 14.04 and
the xen hypervisor coming with its packages (xen 4.4). I had no problem to
create a virtual machine, but when I tried to compile the test program
which uses my modules, the battle began. Here is my compilation line :

gcc gntring3_read_async.c
/home/fremals/GVirtus9/modules/gntring/libgntring4.o -lxenctrl -o
ring3_read_async -lm -I /home/fremals/GVirtus9/modules/

libgntring4.o contains the code using libxenctrl : xc_interface_open,
xc_map_foreign_pages and xc_interface_close. With Xen 4.4, I had no problem
to compile libgntring.o, but when I tried to compile ring3_read_async, I
got the message error :
libgntring4.c:(.text+0x328): undefined reference to «
xc_interface_open(xentoollog_logger*, xentoollog_logger*, unsigned int) »
libgntring4.c:(.text+0x328): undefined reference to «
xc_map_foreign_pages(arg list) »
libgntring4.c:(.text+0x365): undefined reference to «
xc_interface_close(xc_interface_core*) »

I checked libxenctrl.so with nm and it was empty. However, libxenctrl.a had
all needed symbols. I tried again with the following command line :
gcc gntring3_read_async.c
/home/fremals/GVirtus9/modules/gntring/libgntring4.o
/usr/local/lib/libxenctrl.a -o ring3_read_async -lm -I
/home/fremals/GVirtus9/modules/
and the compilation fails with only two errors :
libgntring4.c:(.text+0x328): undefined reference to «
xc_interface_open(xentoollog_logger*, xentoollog_logger*, unsigned int) »
libgntring4.c:(.text+0x365): undefined reference to «
xc_interface_close(xc_interface_core*) »
It found the xc_map_foreign_pages ! But not the other two functions. I
thought that there could be a problem with libxenctrl, I therefore
downloaded the xen source code and tried to compile the code of this tool
but I had errors with xen headers (I didn't kept the errors).

At this point, I wanted to do things right and reinstall Xen from source
code, not with an ubuntu package. I removed Xen 4.4 and took Xen 4.5.1. As
I rapidly got an error with this code (I don't remind which one), I used
Xen 4.5 instead. Once I succesfully installed xen, I tried to run my
compilation again. This time with -lxenctrl. I had the same problem than
before :
libgntring4.c:(.text+0x328): undefined reference to «
xc_interface_open(xentoollog_logger*, xentoollog_logger*, unsigned int) »
libgntring4.c:(.text+0x365): undefined reference to «
xc_interface_close(xc_interface_core*) »

To check that previous steps were all right, I tried to recompile
libgntring4.o and it lead to new errors :
g++ -c -O3 -fPIC ../modules/gntring/libgntring4.c -o
../modules/gntring/libgntring4.o
In file included from /usr/local/include/xenctrl.h:50:0,
 from ../modules/gntring/libgntring4.c:12:
/usr/local/include/xen/platform.h:156:31: error: field ‘set_time’ has
incomplete type
 struct xenpf_efi_time set_time;
   ^
/usr/local/include/xen/platform.h:160:31: error: field ‘get_wakeup_time’
has incomplete type
 struct xenpf_efi_time get_wakeup_time;
   ^
/usr/local/include/xen/platform.h:164:31: error: field ‘set_wakeup_time’
has incomplete type
 struct xenpf_efi_time set_wakeup_time;
   ^
/usr/local/include/xen/platform.h:184:35: error: field ‘vendor_guid’ has
incomplete type
 struct xenpf_efi_guid vendor_guid;
   ^
make: *** [libgntring4.o] Erreur 1

I talked to some members of my team about the "unefined reference"
problems, but no one knows what's the problem. Can I ask if someone have an
idea of what's wrong here please ?

Best regards,

Sebastien Fremal
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for 4.6 v4 1/3] libxc: introduce xc_domain_getvnuma

On Fri, Sep 11, 2015 at 02:50:07PM +0100, Wei Liu wrote:
> A simple wrapper for XENMEM_get_vnumainfo.
> 
> Signed-off-by: Wei Liu 
> Acked-by: Ian Campbell 
> ---
> v4: rebase on top of staging

Note that this patch needs some trivial contextual adjustment when being
applied to staging-4.6.

If you need a patch for staging-4.6 I can also provide one.

Wei.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 0/5] libxc: support building large pv-domains

On Fri, 2015-09-11 at 15:42 +0200, Juergen Gross wrote:
> On 09/11/2015 03:28 PM, Ian Campbell wrote:
> > On Fri, 2015-09-11 at 14:32 +0200, Juergen Gross wrote:
> > > The Xen hypervisor supports starting a dom0 with large memory (up to
> > > the TB range) by not including the initrd and p2m list in the initial
> > > kernel mapping. Especially the p2m list can grow larger than the
> > > available virtual space in the initial mapping.
> > > 
> > > The started kernel is indicating the support of each feature via
> > > elf notes.
> > > 
> > > This series enables the domain builder in libxc to do the same as the
> > > hypervisor. This enables starting of huge pv-domUs via xl.
> > > 
> > > Unmapped initrd is supported for 64 and 32 bit domains, omitting the
> > > p2m from initial kernel mapping is possible for 64 bit domains only.
> > > 
> > > Tested with:
> > > - 32 bit domU (kernel not supporting unmapped initrd)
> > > - 32 bit domU (kernel supporting unmapped initrd)
> > > - 1 GB 64 bit domU (kernel supporting unmapped initrd, not p2m)
> > > - 1 GB 64 bit domU (kernel supporting unmapped initrd and p2m)
> > > - 900GB 64 bit domU (kernel supporting unmapped initrd and p2m)
> > > 
> > > Juergen Gross (5):
> > >libxc: remove allocate member from struct xc_dom_image
> > >libxc: do initrd processing of domain builder in own function
> > >libxc: create unmapped initrd in domain builder if supported
> > >libxc: split p2m allocation in domain builder from other magic
> > > pages
> > >libxc: create p2m list outside of kernel mapping if supported
> > > 
> > >   tools/libxc/include/xc_dom.h |   4 +-
> > >   tools/libxc/xc_dom_core.c| 123 +---
> > > ---
> > >   tools/libxc/xc_dom_x86.c | 120
> > > -
> > 
> > How much is this going to conflict with Roger's "Introduce HVM without
> > dm
> > and new boot ABI" changes to HVM building?
> 
> As it is touching the pv domain builder only, I don't think there will
> be a conflict.

The reason I asked is that the first thing Roger's series does is cause HVM
domains to be built using the PV domain builder...

>  All rights of being wrong reserved. :-)

Warranty void to the limit of your statutory rights ;-)

Ian.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 0/5] libxc: support building large pv-domains


On 09/11/2015 03:53 PM, Ian Campbell wrote:

On Fri, 2015-09-11 at 15:42 +0200, Juergen Gross wrote:

On 09/11/2015 03:28 PM, Ian Campbell wrote:

On Fri, 2015-09-11 at 14:32 +0200, Juergen Gross wrote:

The Xen hypervisor supports starting a dom0 with large memory (up to
the TB range) by not including the initrd and p2m list in the initial
kernel mapping. Especially the p2m list can grow larger than the
available virtual space in the initial mapping.

The started kernel is indicating the support of each feature via
elf notes.

This series enables the domain builder in libxc to do the same as the
hypervisor. This enables starting of huge pv-domUs via xl.

Unmapped initrd is supported for 64 and 32 bit domains, omitting the
p2m from initial kernel mapping is possible for 64 bit domains only.

Tested with:
- 32 bit domU (kernel not supporting unmapped initrd)
- 32 bit domU (kernel supporting unmapped initrd)
- 1 GB 64 bit domU (kernel supporting unmapped initrd, not p2m)
- 1 GB 64 bit domU (kernel supporting unmapped initrd and p2m)
- 900GB 64 bit domU (kernel supporting unmapped initrd and p2m)

Juergen Gross (5):
libxc: remove allocate member from struct xc_dom_image
libxc: do initrd processing of domain builder in own function
libxc: create unmapped initrd in domain builder if supported
libxc: split p2m allocation in domain builder from other magic
pages
libxc: create p2m list outside of kernel mapping if supported

   tools/libxc/include/xc_dom.h |   4 +-
   tools/libxc/xc_dom_core.c| 123 +---
---
   tools/libxc/xc_dom_x86.c | 120
-


How much is this going to conflict with Roger's "Introduce HVM without
dm
and new boot ABI" changes to HVM building?


As it is touching the pv domain builder only, I don't think there will
be a conflict.


The reason I asked is that the first thing Roger's series does is cause HVM
domains to be built using the PV domain builder...


Aah, okay.

OTOH I'm doing nothing different than the hypervisor when loading dom0.
As long as the ELFNOTEs in question (or the corresponding elements in
dom->parms) are not set, the resulting domain image should be the same
as today.


  All rights of being wrong reserved. :-)


Warranty void to the limit of your statutory rights ;-)


Ha, good intuition! The disclaimer wasn't a bad idea at the end. ;-)


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [OSSTest Nested v12 09/21] Wrapper and use core_dump_setup() for nested host and normal host to setup coredump sysctl

2015-09-11 Thread Ian Jackson

Ian Campbell writes ("Re: [OSSTest Nested v12 09/21] Wrapper and use 
core_dump_setup() for nested host and normal host to setup coredump sysctl"):
> On Thu, 2015-09-10 at 18:23 +0100, Ian Jackson wrote:
> > Also it should do `mkdir -p' in case the directory already exists
> > somehow.
> 
> This is what the code which is refactored into core_dump_setup does
> already.

So it does!

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v3] xen: arm: Support <32MB frametables

On Tue, 2015-09-01 at 16:43 +0100, Ian Campbell wrote:
> On Wed, 2015-08-26 at 17:44 -0700, Julien Grall wrote:
> > Hi Chris,
> > 
> > On 21/08/2015 14:30, Chris Brand wrote:
> > > setup_frametable_mappings() rounds frametable_size up to a multiple
> > > of 32MB. This is wasteful on systems with less than 4GB of RAM,
> > > although it does allow the "contig" bit to be set in the PTEs.
> > > 
> > > Where the frametable is less than 32MB in size, instead round up
> > > to a multiple of 2MB, not setting the "contig" bit in the PTEs.
> > > 
> > > Signed-off-by: Chris Brand 
> > 
> > Reviewed-by: Julien Grall 
> 
> Acked-by: Ian Campbell 
> 
> Chris, please ping me if I haven't applied this within some reasonable
> period after the tree opens for 4.7 development.

Applied.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v3 2/2] xen: arm: Be explicit about bit values in mfn_to_xen_entry()

On Thu, 2015-09-10 at 11:56 -0700, Chris Brand wrote:
> Ensure that every relevant bit is given an explicit value.
> This has no effect on the generated code, but makes it
> a little easier to follow.
> 
> Reported-by: Julien Grall 
> Signed-off-by: Chris Brand 

Acked + applied for 4.7 along with the first one.

I don't think there is any need for either for 4.6, since it's just a code
clarity thing.

> ---
> v3 trims down the list of bits given explicit values
> v2 adds comments on pxn and avail
> 
>  xen/include/asm-arm/page.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/xen/include/asm-arm/page.h b/xen/include/asm-arm/page.h
> index 01628f3e96cb..a94e978a9995 100644
> --- a/xen/include/asm-arm/page.h
> +++ b/xen/include/asm-arm/page.h
> @@ -202,9 +202,12 @@ static inline lpae_t mfn_to_xen_entry(unsigned long
> mfn, unsigned attr)
>  .ai = attr,
>  .ns = 1,  /* Hyp mode is in the non-secure world
> */
>  .user = 1,/* See below */
> +.ro = 0,  /* Assume read-write */
>  .af = 1,  /* No need for access tracking */
>  .ng = 1,  /* Makes TLB flushes easier */
> +.contig = 0,  /* Assume non-contiguous */
>  .xn = 1,  /* No need to execute outside .text */
> +.avail = 0,   /* Reference count for domheap mapping
> */
>  }};;
>  /* Setting the User bit is strange, but the ATS1H[RW] instructions
>   * don't seem to work otherwise, and since we never run on Xen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] OVMF/Xen, Debian wheezy can't boot with NX on stack (Was: Re: [edk2] [PATCH] OvmfPkg: prevent code execution from DXE stack)

2015-09-11 Thread Josh Triplett

On Fri, Sep 11, 2015 at 01:43:53PM +0200, Laszlo Ersek wrote:
> On 09/09/15 12:48, Laszlo Ersek wrote:
> > On 09/09/15 11:37, Ian Campbell wrote:
> >> On Wed, 2015-09-09 at 01:06 -0600, Jan Beulich wrote:
> >> On 09.09.15 at 00:23,  wrote:
>  On 09/08/15 19:26, Anthony PERARD wrote:
> > And I get this on the console:
> > Welcome to GRUB!
> >
> >  X64 Exception Type - 0E(#PF - Page-Fault)  CPU Apic ID -
> >  
> > RIP  - 0F5F8918, CS  - 0028, RFLAGS -
> > 00210206
> > ExceptionData - 0011
> > RAX  - , RCX - 07FCE000, RDX -
> > 
> > RBX  - 0B6092C0, RSP - 0F5F8590, RBP -
> > 0B608EA0
> > RSI  - 0F5F8838, RDI - 0B608EA0
> > R8   - , R9  - 0B609200, R10 -
> > 
> > R11  - 000A, R12 - , R13 -
> > 001B
> > R14  - 0B609360, R15 - 
> > DS   - 0008, ES  - 0008, FS  -
> > 0008
> > GS   - 0008, SS  - 0008
> > CR0  - 8033, CR2 - 0F5F8918, CR3 -
> > 0F597000
> > CR4  - 0668, CR8 - 
> > DR0  - , DR1 - , DR2 -
> > 
> > DR3  - , DR6 - 0FF0, DR7 -
> > 0400
> > GDTR - 0F57BF18 003F, LDTR - 
> > IDTR - 0EEA5018 0FFF,   TR - 
> > FXSAVE_STATE - 0F5F81F0
> >  Find PE image 
>  /build/xen-unstable/src/xen-unstable/tools/firmware/ovmf-dir
>  -remote/Build
>  /OvmfX64/DEBUG_GCC49/X64/IntelFrameworkModulePkg/Universal/StatusCode/R
>  untime
>  Dxe/StatusCodeRuntimeDxe/DEBUG/StatusCodeRuntimeDxe.dll 
>  (ImageBase=0F556000, EntryPoint=0F55628F) 
> >
> > I did check with other guest (Windows, Ubuntu, Debian Jessie), and
> > they are
> > working correctly. Debian Wheezy is the only one that fail.
> 
>  I don't have an environment to reproduce this in. I think we should try
>  to understand this problem better, before deciding how to make it go
>  away.
> 
>  Please locate the "StatusCodeRuntimeDxe.debug" file in your Build
>  directory (ie. under the location listed in the error report). Then,
>  please disassemble it with "objdump -S". The fault location in the
>  disassembly can be found based on RIP, ImageBase and EntryPoint;
> >>>
> >>> I don't think the exact instruction at that address really matters. The
> >>> main question appears to be why RIP and RSP both point into the
> >>> same page (see also the subject of Anthony's mail).
> >>
> >> I'm not 100% what is going on,
> > 
> > me neither :)
> > 
> >> but if this (executable code on stack) is
> >> happening in grub is there something which is explicitly forbidden to UEFI
> >> apps by the UEFI spec?
> > 
> > Yes, there is. This small OvmfPkg patch only enables the edk2 feature
> > added by Star Zeng in
> >  for OVMF. That patch
> > (also referenced in my commit message by SVN rev) says,
> > 
> > This feature is added for UEFI spec that says
> > "Stack may be marked as non-executable in identity mapped page
> > tables".
> > A PCD PcdSetNxForStack is added to turn on/off this feature, and it
> > is FALSE by default.
> > 
> > A UEFI app runs (well, *starts*, anyway) before ExitBootServices() /
> > SetVirtualAddressMap(), so it's bound by the above.
> > 
> > The spec passage above is quoted from "2.3.2 IA-32 Platforms", and
> > "2.3.4 x64 Platforms", in chapter "2.3 Calling Conventions", where the
> > boot services time environment is specified.
> > 
> > This is new in UEFI-2.5, and it comes from Mantis ticket 1224: "Adding
> > support for No executable data areas".
> > 
> > ... The question could be then if grub (in Wheezy) should be adapted to
> > UEFI-2.5 (if that's possible) or if OVMF should be built without this
> > feature.
> > 
> > Hmmm. Actually, I'm torn about the default for PcdSetNxForStack.
> > 
> > Namely, Mantis ticket 1224 has come up before. There's another edk2
> > sub-feature related to this UEFI spec feature / Mantis ticket; the
> > properties table (controlled by "PcdPropertiesTableEnable"), and the
> > effects it has on the UEFI memory map, and the requirements it presents
> > for UEFI OSes.
> > 
> > *That* sub-feature is extremely intrusive.
> > "MdeModulePkg/MdeModulePkg.dec" sets "PcdPropertiesTableEnable" TRUE by
> > default, and OvmfPkg inherits it. I have not overridden that default
> > just yet in OvmfPkg because the properties table feature depends on
> > something *else* too: sections in runtime DXE driver binaries

Re: [Xen-devel] [PATCH V6 3/7] libxl: add pvusb API

On Fri, 2015-09-11 at 15:55 +0200, Juergen Gross wrote:
> On 09/11/2015 03:26 PM, Ian Campbell wrote:
> > On Thu, 2015-09-10 at 23:42 -0600, Chun Yan Liu wrote:
> > > 
> > > > Do these fields have any particular size requirements arising from
> > > > e.g. the
> > > > USB spec or from possible dom0 implementations?
> > > > 
> > > > If they have a well defined fixed size from a USB spec then maybe
> > > > we
> > > > could
> > > > use the appropriate fixed size types?
> > > 
> > > Di> dn't see the size limitation. In Linux kernel code, busnum and
> > > devnum (here
> > > 'hostbus, hostaddr') are both 'int' type.
> > 
> > Is that a Linux-specific implementation detail or a fundamental
> > property of
> > USB? We should be designing the interface around Linux implementation
> > details. It seems like something in the USB spec ought to define
> > precisely
> > the number of bits in both a bus number and a device address within
> > that
> > bus.
> 
> The USB spec is only about _the_ bus. How many buses a host can
> operate and how they are numbered is outside the USB spec.
> 
> Devices are addressed via their ports in the USB protocol. devnum
> is a unique index for a device on the bus, the USB protocol equivalent
> is a list of ports of:
> - 1 member in case of direct attached devices
> - multiple members in case of hubs between bus and device

Thanks for the info. So an "address" in the USB protocol is actually a
"path" and "hostbus" is an implementation dependent shorthand for all but
the last link in that path.

What is the size of each element in the chain, that would seem to be the
correct size of "hostaddr".

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for 4.6 v3 2/3] xl/libxl: disallow saving a guest with vNUMA configured

On Fri, Sep 11, 2015 at 02:59:07PM +0100, Ian Campbell wrote:
> On Fri, 2015-09-11 at 14:43 +0100, Wei Liu wrote:
> > On Fri, Sep 11, 2015 at 02:21:17PM +0100, Ian Campbell wrote:
> > > On Fri, 2015-09-11 at 11:50 +0100, Ian Campbell wrote:
> > > > But "is d->vnuma" corresponds to there being vnuma config for the
> > > > domain. 
> > > 
> > > We discussed this IRL and concluded that we should stop trying to
> > > differentiate "no vnuma configuration" from "has empty vnuma
> > > configuration".
> > > 
> > > So this code should raise this error if xc_domain_getvnuma returns
> > > anything
> > > other than rc == -1 && errno == XEN_EOPNOTSUPP. So the check is
> > > 
> > > if ( rc != -1 || errno != XEN_EOPNOTSUPP )
> > > 
> > 
> > To be precise, this should be
> > 
> >   if ( rc != -1 || errno == XEN_EOPNOTSUPP )
> > 
> > (your if expression contradicts what you said)
> 
> I don't think it did, but they are inverses of each other, due to the
> "other than" wording in the prose.
>   errno == OPNOTSUPP  errno != OPNOTSUPP
> rc >=0??? Some vnuma config
> rc ==-1   No vnuma config(*)  Some other error
> 
> (*) is the only situation which is allowed, which is what I described in
> the text.
> 
> But the if needs to reject the other 3 cases, so it is in the inverse test.
> rc != -1 covers the top row, and errno != OPNOTSUPP covers the second
> column, if either are true then we do not want to proceed.
> 

Oh, right, I misinterpreted your expression. Sorry for the noise.

Wei.

> Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for-4.6] libxl: clear O_NONBLOCK|O_NDELAY on migration fd and reinstate afterwards

On Fri, 2015-09-11 at 13:56 +0100, Wei Liu wrote:
> On Fri, Sep 11, 2015 at 11:50:14AM +0100, Ian Jackson wrote:
> > Ian Campbell writes ("[PATCH for-4.6] libxl: clear O_NONBLOCK|O_NDELAY
> > on migration fd and reinstate afterwards"):
> > > The fd passed to us by libvirt for both save and restore has at least
> > > O_NONBLOCK set, which libxl does not expect and therefore fails to
> > > handle any EAGAIN which might arise.
> > 
> > Acked-by: Ian Jackson 
> > 
> > > For 4.6: This fixes migration with libvirt, which I think is worth
> > > doing before the release.
> > 
> > Indeed.
> > 
> 
> +1
> 
> Acked-by: Wei Liu 

Pushed to staging and staging-4.6, thanks.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for-4.6] libxl: clear O_NONBLOCK|O_NDELAY on migration fd and reinstate afterwards

On Fri, 2015-09-11 at 14:44 +0100, Andrew Cooper wrote:
> On 11/09/15 11:42, Ian Campbell wrote:
> > The fd passed to us by libvirt for both save and restore has at least
> > O_NONBLOCK set, which libxl does not expect and therefore fails to
> > handle any EAGAIN which might arise.
> > 
> > This has been observed with migration v2, but if v1 used to work I
> > think that would be just be by luck and/or coincidence.
> > 
> > Unix convention (and the principal of least surprise) is usually to
> > ensure that an fd has no "strange" properties, such as being
> > non-blocking, when handing it to another component.
> > 
> > However for the convenience of the application arrange instead for
> > libxl to clear any unexpected flags on the file descriptors it is
> > given for save or restore and restore them to their original state at
> > the end. O_NDELAY could be similarly problematic so clear that as
> > well as O_NONBLOCK.
> > 
> > To do this introduce a pair of new helper functions one to modify+save
> > the flags and another to restore them and call them in the appropriate
> > places.
> > 
> > The migration v1 code appeared to do some things with O_NONBLOCK in
> > the checkpoint case. Migration v2 doesn't seem to do so, and in any
> > case I wouldn't expect it to be relying on libvirt's setting of
> > O_NONBLOCK when xl doesn't use that flag.
> > 
> > Signed-off-by: Ian Campbell 
> > Cc: Jim Fehlig 
> > Cc: Andrew Cooper 
> > Cc: Shriram Rajagopalan 
> > Cc: Yang Hongyang 
> > ---
> > For 4.6: This fixes migration with libvirt, which I think is worth
> > doing before the release.
> > 
> > For backports: Once "ts-xen-install: Rewrite /etc/hosts to comment out
> > 127.0.1.1 entry" passes through osstest's pretest gate and has run on
> > some of the older branches we should then know if this is necessary
> > for migration v1. Or we could backport it regardless.
> 
> I don't believe any special consideration is needed for the legacy 
> conversion case, as all other fds used there are created by components 
> we control.

Thanks, I was actually talking about actual migration v1 as in 4.5
migrating to 4.5, but the above is useful info nonetheless.

> > +LOG(DEBUG, "fnctl F_GETFL flags for fd %d are %x", fd, fdfl);
> 
> %#x to distinguish decimal and hex numbers in the same message (and 
> other debug messages)

Gah, didn't see this until after I pushed, sorry. Will post a followup

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH] efi/libstub/fdt: Standardize the names of EFI stub parameters

2015-09-11 Thread Ard Biesheuvel

On 11 September 2015 at 15:14, Stefano Stabellini
 wrote:
> On Fri, 11 Sep 2015, Daniel Kiper wrote:
>> On Thu, Sep 10, 2015 at 05:23:02PM +0100, Mark Rutland wrote:
>> > > > C) When you could go:
>> > > >
>> > > >DT -> Discover Xen -> Xen-specific stuff -> Xen-specific EFI/ACPI 
>> > > > discovery
>> > >
>> > > I take you mean discovering Xen with the usual Xen hypervisor node on
>> > > device tree. I think that C) is a good option actually. I like it. Not
>> > > sure why we didn't think about this earlier. Is there anything EFI or
>> > > ACPI which is needed before Xen support is discovered by
>> > > arch/arm64/kernel/setup.c:setup_arch -> xen_early_init()?
>> >
>> > Currently lots (including the memory map). With the stuff to support
>> > SPCR, the ACPI discovery would be moved before xen_early_init().
>> >
>> > > If not, we could just go for this. A lot of complexity would go away.
>> >
>> > I suspect this would still be fairly complex, but would at least prevent
>> > the Xen-specific EFI handling from adversely affecting the native case.
>> >
>> > > > D) If you want to be generic:
>> > > >EFI -> EFI application -> EFI tables -> ACPI tables -> Xen-specific 
>> > > > stuff
>> > > >   \--/
>> > > >(virtualize these, provide shims to Dom0, but handle
>> > > > everything in Xen itself)
>> > >
>> > > I think that this is good in theory but could turn out to be a lot of
>> > > work in practice. We could probably virtualize the RuntimeServices but
>> > > the BootServices are troublesome.
>> >
>> > What's troublesome with the boot services?
>> >
>> > What can't be simulated?
>>
>> How do you want to access bare metal EFI boot services from dom0 if they
>> were shutdown long time ago before loading dom0 image? What do you need
>> from EFI boot services in dom0?
>
> That's right. Trying to emulate BootServices after the real
> ExitBootServices has already been called seems like a very bad plan.
>
> I think that whatever interface we come up with, would need to be past
> ExitBootServices.

It feels like this discussion is going in circles.

When we discussed this six months ago, we already concluded that,
since UEFI is the only specified way that the presence of ACPI is
advertised on an ARM system, we need to emulate UEFI to some extent.

So we need the EFI system table to expose the UEFI configuration table
that carries the ACPI root pointer.

Since ACPI support also relies on the UEFI memory map (I think?), we
need that as well.

These two items are exactly what we pass via the UEFI DT properties,
so we should indeed promote the current de-facto binding to a proper
binding, and renaming the properties makes sense in that context.

I agree that this should also include a description of the expected
state of the firmware, i.e., that ExitBootServices() has been called,
and that the memory map has been populated with virtual address, which
have been installed using SetVirtualAddressMap() if they differ from
the physical addresses. (The current implementation on the kernel side
is perfectly capable of dealing with a 1:1 mapping).

Beyond that, there is no point in pretending to be a full UEFI
implementation, imo. Boot services are not required, nor are runtime
services (only the current EFI init code on arm needs to be modified
to deal with a NULL runtime services pointer)

-- 
Ard.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 3/5] libxc: create unmapped initrd in domain builder if supported


On 09/11/2015 02:54 PM, Ian Jackson wrote:

Juergen Gross writes ("[PATCH 3/5] libxc: create unmapped initrd in domain builder 
if supported"):

In case the kernel of a new pv-domU indicates it is supporting an
unmapped initrd, don't waste precious virtual space for the initrd,
but allocate only guest physical memory for it.

...

The name of this ELFNOTE suggests that it applies to all multiboot
modules, not just ramdisks.  In particular, that means perhaps it
ought to apply to device tree blobs too ?


Hmm, in theory, yes.

AFAIK it is only used for initrd on x86. I think support for other
modules can be added as needed.


-/* load ramdisk */
-if ( dom->ramdisk_blob )
+/* Load ramdisk if initial mapping required. */
+if ( dom->ramdisk_blob &&
+ (!dom->parms.elf_notes[XEN_ELFNOTE_MOD_START_PFN].data.num ||
+  dom->ramdisk_seg.vstart) )


After this patch the resulting structure of the code is rather
unfortunate, in that the order of the main processing steps depends on
this ELFNOTE.

Wouldn't it be better to generalise xc_dom_alloc_segment ?


How?

You have to create (and allocate space for) the page tables after
all allocations which should be covered by those page tables. And
you must not allocate the other stuff before that, as this would
again waste virtual address space, which is 1:1 with guest physical
memory.

The only solution would be to calculate the needed sizes of the
single memory chunks first and then do the allocations either in
the mapped or the unmapped region according to the ELFNOTEs. This
would rip at least the initrd processing into two parts, as the
needed memory size is calculated depending on the initrd being
compressed or not.

I thought about building a table containing the sequence of the
single processing steps dependant on the ELFNOTEs and processing this
table in a generic loop afterwards. If you like this approach, I can
give it a try. I just wanted to avoid a complete rework of the main
building function.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 3/5] libxc: create unmapped initrd in domain builder if supported


On 09/11/2015 03:15 PM, Julien Grall wrote:

On 11/09/15 13:54, Ian Jackson wrote:

Juergen Gross writes ("[PATCH 3/5] libxc: create unmapped initrd in domain builder 
if supported"):

In case the kernel of a new pv-domU indicates it is supporting an
unmapped initrd, don't waste precious virtual space for the initrd,
but allocate only guest physical memory for it.

...

The name of this ELFNOTE suggests that it applies to all multiboot
modules, not just ramdisks.  In particular, that means perhaps it
ought to apply to device tree blobs too ?


The device tree blobs is not a multiboot module but directly pass in a
register to the kernel.

FWIW, we don't have any ELF support right now on ARM.


Okay, I thought so, but I wasn't sure.




-/* load ramdisk */
-if ( dom->ramdisk_blob )
+/* Load ramdisk if initial mapping required. */
+if ( dom->ramdisk_blob &&
+ (!dom->parms.elf_notes[XEN_ELFNOTE_MOD_START_PFN].data.num ||
+  dom->ramdisk_seg.vstart) )


After this patch the resulting structure of the code is rather
unfortunate, in that the order of the main processing steps depends on
this ELFNOTE.


Shouldn't we ought to have a common code ELF agnostic? I.e we may have
other kernel image format where we have notes but not ELF notes.


dom->parms is the same for all architectures. I think it would have to
be extended in that case.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for 4.6 v3 2/3] xl/libxl: disallow saving a guest with vNUMA configured

On Fri, Sep 11, 2015 at 02:21:17PM +0100, Ian Campbell wrote:
> On Fri, 2015-09-11 at 11:50 +0100, Ian Campbell wrote:
> > But "is d->vnuma" corresponds to there being vnuma config for the domain. 
> 
> We discussed this IRL and concluded that we should stop trying to
> differentiate "no vnuma configuration" from "has empty vnuma
> configuration".
> 
> So this code should raise this error if xc_domain_getvnuma returns anything
> other than rc == -1 && errno == XEN_EOPNOTSUPP. So the check is
> 
> if ( rc != -1 || errno != XEN_EOPNOTSUPP )
> 

To be precise, this should be

  if ( rc != -1 || errno == XEN_EOPNOTSUPP )

(your if expression contradicts what you said)

> I think.
> 
> This then avoids any confusion about what it means to have a d->vnuma with
> nr_something == 0 in it.
> 
> Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for-4.6] libxl: clear O_NONBLOCK|O_NDELAY on migration fd and reinstate afterwards

2015-09-11 Thread Andrew Cooper


On 11/09/15 11:42, Ian Campbell wrote:

The fd passed to us by libvirt for both save and restore has at least
O_NONBLOCK set, which libxl does not expect and therefore fails to
handle any EAGAIN which might arise.

This has been observed with migration v2, but if v1 used to work I
think that would be just be by luck and/or coincidence.

Unix convention (and the principal of least surprise) is usually to
ensure that an fd has no "strange" properties, such as being
non-blocking, when handing it to another component.

However for the convenience of the application arrange instead for
libxl to clear any unexpected flags on the file descriptors it is
given for save or restore and restore them to their original state at
the end. O_NDELAY could be similarly problematic so clear that as
well as O_NONBLOCK.

To do this introduce a pair of new helper functions one to modify+save
the flags and another to restore them and call them in the appropriate
places.

The migration v1 code appeared to do some things with O_NONBLOCK in
the checkpoint case. Migration v2 doesn't seem to do so, and in any
case I wouldn't expect it to be relying on libvirt's setting of
O_NONBLOCK when xl doesn't use that flag.

Signed-off-by: Ian Campbell 
Cc: Jim Fehlig 
Cc: Andrew Cooper 
Cc: Shriram Rajagopalan 
Cc: Yang Hongyang 
---
For 4.6: This fixes migration with libvirt, which I think is worth
doing before the release.

For backports: Once "ts-xen-install: Rewrite /etc/hosts to comment out
127.0.1.1 entry" passes through osstest's pretest gate and has run on
some of the older branches we should then know if this is necessary
for migration v1. Or we could backport it regardless.


I don't believe any special consideration is needed for the legacy 
conversion case, as all other fds used there are created by components 
we control.



---
  tools/libxl/libxl.c  | 65 
  tools/libxl/libxl_create.c   | 23 +++-
  tools/libxl/libxl_internal.h | 13 +
  3 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 4f2eb24..d6efdd8 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -952,6 +952,12 @@ static void domain_suspend_cb(libxl__egc *egc,
libxl__domain_suspend_state *dss, int rc)
  {
  STATE_AO_GC(dss->ao);
+int flrc;
+
+flrc = libxl__fd_flags_restore(gc, dss->fd, dss->fdfl);
+/* If suspend has failed already then report that error not this one. */
+if (flrc && !rc) rc = flrc;
+
  libxl__ao_complete(egc,ao,rc);
  
  }

@@ -980,6 +986,11 @@ int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, 
int fd, int flags,
  dss->live = flags & LIBXL_SUSPEND_LIVE;
  dss->debug = flags & LIBXL_SUSPEND_DEBUG;
  
+rc = libxl__fd_flags_modify_save(gc, dss->fd,

+ ~(O_NONBLOCK|O_NDELAY), 0,
+ >fdfl);
+if (rc < 0) goto out_err;
+
  libxl__domain_save(egc, dss);
  return AO_INPROGRESS;
  
@@ -6507,6 +6518,60 @@ int libxl_fd_set_cloexec(libxl_ctx *ctx, int fd, int cloexec)

  int libxl_fd_set_nonblock(libxl_ctx *ctx, int fd, int nonblock)
{ return fd_set_flags(ctx,fd, F_GETFL,F_SETFL,"FL", O_NONBLOCK, nonblock); }
  
+int libxl__fd_flags_modify_save(libxl__gc *gc, int fd,

+int mask, int val, int *r_oldflags)
+{
+int rc, ret, fdfl;
+
+fdfl = fcntl(fd, F_GETFL);
+if (fdfl < 0) {
+LOGE(ERROR, "failed to fcntl.F_GETFL for fd %d", fd);
+rc = ERROR_FAIL;
+goto out_err;
+}
+
+LOG(DEBUG, "fnctl F_GETFL flags for fd %d are %x", fd, fdfl);


%#x to distinguish decimal and hex numbers in the same message (and 
other debug messages)


~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] [PATCH for 4.6 v4 2/3] xl/libxl: disallow saving a guest with vNUMA configured

This is because the migration stream does not preserve node information.

Note this is not a regression for migration v2 vs legacy migration
because neither of them preserve node information.

Signed-off-by: Wei Liu 
---
Cc: andrew.coop...@citrix.com

v4:
1. Don't differentiate "no vnuma" from "empty vnuma".

v3:
1. Update manpage, code comment and commit message.
2. *Don't* check if nomigrate is set.
---
 docs/man/xl.cfg.pod.5   |  2 ++
 tools/libxl/libxl_dom.c | 16 
 2 files changed, 18 insertions(+)

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index c6345b8..157c855 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -263,6 +263,8 @@ virtual node.
 
 Note that virtual NUMA for PV guest is not yet supported, because
 there is an issue with cpuid handling that affects PV virtual NUMA.
+Further more, guest with virtual NUMA cannot be saved or migrated
+because migration stream does not preserve node information.
 
 Each B is a list, which has a form of
 "[VNODE_CONFIG_OPTION,VNODE_CONFIG_OPTION, ... ]"  (without quotes).
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index c2518a3..7227f35 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 libxl_domain_type libxl__domain_type(libxl__gc *gc, uint32_t domid)
 {
@@ -1612,6 +1613,7 @@ void libxl__domain_save(libxl__egc *egc, 
libxl__domain_suspend_state *dss)
 const libxl_domain_remus_info *const r_info = dss->remus;
 libxl__srm_save_autogen_callbacks *const callbacks =
 >sws.shs.callbacks.save.a;
+unsigned int nr_vnodes = 0, nr_vmemranges = 0, nr_vcpus = 0;
 
 dss->rc = 0;
 logdirty_init(>logdirty);
@@ -1636,6 +1638,20 @@ void libxl__domain_save(libxl__egc *egc, 
libxl__domain_suspend_state *dss)
   | (debug ? XCFLAGS_DEBUG : 0)
   | (dss->hvm ? XCFLAGS_HVM : 0);
 
+/* Disallow saving a guest with vNUMA configured because migration
+ * stream does not preserve node information.
+ *
+ * Do not differentiate "no vnuma configuration" from "empty vnuma
+ * configuration".
+ */
+rc = xc_domain_getvnuma(CTX->xch, domid, _vnodes, _vmemranges,
+_vcpus, NULL, NULL, NULL);
+if (rc != -1 || errno == XEN_ENOBUFS) {
+LOG(ERROR, "Cannot save a guest with vNUMA configured");
+rc = ERROR_FAIL;
+goto out;
+}
+
 dss->guest_evtchn.port = -1;
 dss->guest_evtchn_lockfd = -1;
 dss->guest_responded = 0;
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [OSSTEST Nested PATCH v11 6/7] Compose the main recipe of nested test job

2015-09-11 Thread Ian Jackson

Hu, Robert writes ("RE: [OSSTEST Nested PATCH v11 6/7] Compose the main recipe 
of nested test job"):
> So strange, seems this mail was sent ' Tuesday, September 1, 2015
> 10:42 PM ', but I have just received it.

How annoying.

>  I'm fully occupied by some
> release test, so have to carefully read your comments 1 ~2 weeks
> later. Sorry about this.

That's quite OK.  I have been slow too.  I want to reassure you again
that I want this feature in osstest.  I think we are making reasonable
progress.

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH] xen/domctl: lower loglevel of XEN_DOMCTL_memory_mapping

>>> On 11.09.15 at 14:05,  wrote:
> The flush_all(FLUSH_CACHE) in mtrr.c will result in a flush_area_mask for 
> all CPU's in the host.
> It will more time to issue a IPI to all logical cores the more core's there 
> are. I admit that
> x2apic_cluster mode may speed this up but not all hosts will have that 
> enabled.
> 
> The data flush will force all data out to memory controllers and it's 
> possible that CPU's in
> difference package have cached data all corresponding to a particular memory 
> controller which will
> become a bottleneck.
> 
> In worst case, with large delay between XEN_DOMCTL_memory_mapping hypercalls 
> and on a 8 socket
> system you may end up writing out 45MB (L3 cache) * 8 = 360MB to a single 
> memory controller every 64
> pages (256KiB) of domU p2m updated.

True.

Considering that BARs need to be properly aligned in both guest
and host address spaces, I wonder why we aren't using large
pages to map such huge BARs then. As it looks this would require
redefining the semantics of the domctl once again, but that's not
a big problem since - it's a domctl. I'll see if I can cook up something
(assuming that hosts used for passing through devices with such
huge BARs will have support for at least 2Mb pages in both EPT
[NPT always has] and IOMMU).

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for 4.6 v4 2/3] xl/libxl: disallow saving a guest with vNUMA configured