Re: [PATCH v2 02/11] xen/hvmlite: Bootstrap HVMlite guest

2016-02-03 Thread Boris Ostrovsky
 force us to provide a well define structure
for considering code that should not run by pegging it as required or supported
for different early x86 code stubs.


As I said before, I don't see how we can avoid having another entry 
point without making Xen change its load procedure. Which is highly 
unlikely to happen.




I had hinted perhaps we might be able to piggy back on top of the ELF loader
protocol as well, and since that's standard do wonder if that could instead
be extended to help unify a mechanism for different OSes instead of making
this just a solution for Linux.

Code review below.

On Mon, Feb 01, 2016 at 10:38:48AM -0500, Boris Ostrovsky wrote:

Start HVMlite guest at XEN_ELFNOTE_PHYS32_ENTRY address. Setup hypercall
page, initialize boot_params, enable early page tables.

Since this stub is executed before kernel entry point we cannot use
variables in .bss which is cleared by kernel. We explicitly place
variables that are initialized here into .data.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 5774800..5f05fa2 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -171,6 +172,17 @@ struct tls_descs {
   */
  static DEFINE_PER_CPU(struct tls_descs, shadow_tls_desc);
  
+#ifdef CONFIG_XEN_PVHVM

+/*
+ * HVMlite variables. These need to live in data segment since they are
+ * initialized before startup_{32|64}, which clear .bss, are invoked.
+ */
+int xen_hvmlite __attribute__((section(".data"))) = 0;
+struct hvm_start_info hvmlite_start_info __attribute__((section(".data")));
+uint hvmlite_start_info_sz = sizeof(hvmlite_start_info);
+struct boot_params xen_hvmlite_boot_params __attribute__((section(".data")));
+#endif
+

The section annotations seems very special use case but likely worth documenting
and defining a new macro for in include/linux/compiler.h. This would make it
easier to change should we want to change the section used here later and
enable others to easily look for the reason for these annotations in a
single place.


I wonder whether __initdata would be a good attribute. We only need this 
early in the boot.


And xen_hvmlite is gone now so we don't need to worry about it.

-boris


[PATCH v2 04/11] xen/hvmlite: Allow HVMlite guests delay initializing grant table

2016-02-01 Thread Boris Ostrovsky
.. just like we currently do for PVH guests

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/grant-table.c |4 ++--
 drivers/xen/grant-table.c  |8 
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/grant-table.c b/arch/x86/xen/grant-table.c
index e079500..40ad9c2 100644
--- a/arch/x86/xen/grant-table.c
+++ b/arch/x86/xen/grant-table.c
@@ -110,7 +110,7 @@ int arch_gnttab_init(unsigned long nr_shared)
return arch_gnttab_valloc(_shared_vm_area, nr_shared);
 }
 
-#ifdef CONFIG_XEN_PVH
+#ifdef CONFIG_XEN_PVHVM
 #include 
 #include 
 #include 
@@ -164,7 +164,7 @@ static int __init xlated_setup_gnttab_pages(void)
 
 static int __init xen_pvh_gnttab_setup(void)
 {
-   if (!xen_pvh_domain())
+   if (!xen_pvh_domain() && !xen_hvmlite)
return -ENODEV;
 
return xlated_setup_gnttab_pages();
diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index c49f79e..9a239d5 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -1147,13 +1147,13 @@ EXPORT_SYMBOL_GPL(gnttab_init);
 
 static int __gnttab_init(void)
 {
+   if (!xen_domain())
+   return -ENODEV;
+
/* Delay grant-table initialization in the PV on HVM case */
-   if (xen_hvm_domain())
+   if (xen_hvm_domain() && !xen_hvmlite)
return 0;
 
-   if (!xen_pv_domain())
-   return -ENODEV;
-
return gnttab_init();
 }
 /* Starts after core_initcall so that xen_pvh_gnttab_setup can be called
-- 
1.7.1



[PATCH v2 06/11] xen/hvmlite: Prepare cpu_initialize_context() routine for HVMlite SMP

2016-02-01 Thread Boris Ostrovsky
Subsequent patch will add support for starting secondary VCPUs in
HVMlite guest. This patch exists to simmplify code review.

No functional changes (except for introduction of 'if (!xen_hvmlite)').

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/smp.c |  104 ---
 1 files changed, 57 insertions(+), 47 deletions(-)

diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 3f4ebf0..5fc4afb 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -390,70 +390,80 @@ cpu_initialize_context(unsigned int cpu, struct 
task_struct *idle)
if (cpumask_test_and_set_cpu(cpu, xen_cpu_initialized_map))
return 0;
 
-   ctxt = kzalloc(sizeof(*ctxt), GFP_KERNEL);
-   if (ctxt == NULL)
-   return -ENOMEM;
+   if (!xen_hvmlite) {
 
-   gdt = get_cpu_gdt_table(cpu);
+   ctxt = kzalloc(sizeof(*ctxt), GFP_KERNEL);
+   if (ctxt == NULL)
+   return -ENOMEM;
+
+   gdt = get_cpu_gdt_table(cpu);
 
 #ifdef CONFIG_X86_32
-   /* Note: PVH is not yet supported on x86_32. */
-   ctxt->user_regs.fs = __KERNEL_PERCPU;
-   ctxt->user_regs.gs = __KERNEL_STACK_CANARY;
+   /* Note: PVH is not yet supported on x86_32. */
+   ctxt->user_regs.fs = __KERNEL_PERCPU;
+   ctxt->user_regs.gs = __KERNEL_STACK_CANARY;
 #endif
-   memset(>fpu_ctxt, 0, sizeof(ctxt->fpu_ctxt));
+   memset(>fpu_ctxt, 0, sizeof(ctxt->fpu_ctxt));
 
-   if (!xen_feature(XENFEAT_auto_translated_physmap)) {
-   ctxt->user_regs.eip = (unsigned long)cpu_bringup_and_idle;
-   ctxt->flags = VGCF_IN_KERNEL;
-   ctxt->user_regs.eflags = 0x1000; /* IOPL_RING1 */
-   ctxt->user_regs.ds = __USER_DS;
-   ctxt->user_regs.es = __USER_DS;
-   ctxt->user_regs.ss = __KERNEL_DS;
+   if (!xen_feature(XENFEAT_auto_translated_physmap)) {
+   ctxt->user_regs.eip =
+   (unsigned long)cpu_bringup_and_idle;
+   ctxt->flags = VGCF_IN_KERNEL;
+   ctxt->user_regs.eflags = 0x1000; /* IOPL_RING1 */
+   ctxt->user_regs.ds = __USER_DS;
+   ctxt->user_regs.es = __USER_DS;
+   ctxt->user_regs.ss = __KERNEL_DS;
 
-   xen_copy_trap_info(ctxt->trap_ctxt);
+   xen_copy_trap_info(ctxt->trap_ctxt);
 
-   ctxt->ldt_ents = 0;
+   ctxt->ldt_ents = 0;
 
-   BUG_ON((unsigned long)gdt & ~PAGE_MASK);
+   BUG_ON((unsigned long)gdt & ~PAGE_MASK);
 
-   gdt_mfn = arbitrary_virt_to_mfn(gdt);
-   make_lowmem_page_readonly(gdt);
-   make_lowmem_page_readonly(mfn_to_virt(gdt_mfn));
+   gdt_mfn = arbitrary_virt_to_mfn(gdt);
+   make_lowmem_page_readonly(gdt);
+   make_lowmem_page_readonly(mfn_to_virt(gdt_mfn));
 
-   ctxt->gdt_frames[0] = gdt_mfn;
-   ctxt->gdt_ents  = GDT_ENTRIES;
+   ctxt->gdt_frames[0] = gdt_mfn;
+   ctxt->gdt_ents  = GDT_ENTRIES;
 
-   ctxt->kernel_ss = __KERNEL_DS;
-   ctxt->kernel_sp = idle->thread.sp0;
+   ctxt->kernel_ss = __KERNEL_DS;
+   ctxt->kernel_sp = idle->thread.sp0;
 
 #ifdef CONFIG_X86_32
-   ctxt->event_callback_cs = __KERNEL_CS;
-   ctxt->failsafe_callback_cs  = __KERNEL_CS;
+   ctxt->event_callback_cs = __KERNEL_CS;
+   ctxt->failsafe_callback_cs  = __KERNEL_CS;
 #else
-   ctxt->gs_base_kernel = per_cpu_offset(cpu);
+   ctxt->gs_base_kernel = per_cpu_offset(cpu);
 #endif
-   ctxt->event_callback_eip=
-   (unsigned long)xen_hypervisor_callback;
-   ctxt->failsafe_callback_eip =
-   (unsigned long)xen_failsafe_callback;
-   ctxt->user_regs.cs = __KERNEL_CS;
-   per_cpu(xen_cr3, cpu) = __pa(swapper_pg_dir);
-   }
+   ctxt->event_callback_eip=
+   (unsigned long)xen_hypervisor_callback;
+   ctxt->failsafe_callback_eip =
+   (unsigned long)xen_failsafe_callback;
+   ctxt->user_regs.cs = __KERNEL_CS;
+   per_cpu(xen_cr3, cpu) = __pa(swapper_pg_dir);
+   }
 #ifdef CONFIG_XEN_PVH
-   else {
-   /*
-* The vcpu comes on kernel page tab

[PATCH v2 07/11] xen/hvmlite: Initialize context for secondary VCPUs

2016-02-01 Thread Boris Ostrovsky
Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/smp.c |   57 
 arch/x86/xen/smp.h |4 +++
 arch/x86/xen/xen-hvmlite.S |7 +
 3 files changed, 63 insertions(+), 5 deletions(-)

diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 5fc4afb..b265c4f 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -384,6 +385,7 @@ cpu_initialize_context(unsigned int cpu, struct task_struct 
*idle)
struct vcpu_guest_context *ctxt;
struct desc_struct *gdt;
unsigned long gdt_mfn;
+   void *ctxt_arg;
 
/* used to tell cpu_init() that it can proceed with initialization */
cpumask_set_cpu(cpu, cpu_callout_mask);
@@ -392,7 +394,7 @@ cpu_initialize_context(unsigned int cpu, struct task_struct 
*idle)
 
if (!xen_hvmlite) {
 
-   ctxt = kzalloc(sizeof(*ctxt), GFP_KERNEL);
+   ctxt_arg = ctxt = kzalloc(sizeof(*ctxt), GFP_KERNEL);
if (ctxt == NULL)
return -ENOMEM;
 
@@ -460,14 +462,59 @@ cpu_initialize_context(unsigned int cpu, struct 
task_struct *idle)
ctxt->user_regs.esp = idle->thread.sp0 - sizeof(struct pt_regs);
ctxt->ctrlreg[3] = xen_pfn_to_cr3(virt_to_gfn(swapper_pg_dir));
} else {
-   ctxt = NULL; /* To quiet down compiler */
-   BUG();
+#ifdef CONFIG_XEN_PVHVM
+   struct vcpu_hvm_context *hctxt;
+
+   ctxt_arg = hctxt = kzalloc(sizeof(*hctxt), GFP_KERNEL);
+   if (hctxt == NULL)
+   return -ENOMEM;
+
+#ifdef CONFIG_X86_64
+   hctxt->mode = VCPU_HVM_MODE_64B;
+   hctxt->cpu_regs.x86_64.rip =
+   (unsigned long)secondary_startup_64;
+   hctxt->cpu_regs.x86_64.rsp = stack_start;
+
+   hctxt->cpu_regs.x86_64.cr0 =
+   X86_CR0_PG | X86_CR0_WP | X86_CR0_PE;
+   hctxt->cpu_regs.x86_64.cr4 = X86_CR4_PAE;
+   hctxt->cpu_regs.x86_64.cr3 =
+   xen_pfn_to_cr3(virt_to_mfn(init_level4_pgt));
+   hctxt->cpu_regs.x86_64.efer = EFER_LME | EFER_NX;
+#else
+   hctxt->mode = VCPU_HVM_MODE_32B;
+   /*
+* startup_32_smp expects GDT loaded so we can't jump
+* there directly.
+*/
+   hctxt->cpu_regs.x86_32.eip =
+   (unsigned long)hvmlite_smp_32 - __START_KERNEL_map;
+
+   hctxt->cpu_regs.x86_32.cr0 = X86_CR0_PE;
+
+   hctxt->cpu_regs.x86_32.cs_base = 0;
+   hctxt->cpu_regs.x86_32.cs_limit = ~0u;
+   hctxt->cpu_regs.x86_32.cs_ar = 0xc9b;
+   hctxt->cpu_regs.x86_32.ds_base = 0;
+   hctxt->cpu_regs.x86_32.ds_limit = ~0u;
+   hctxt->cpu_regs.x86_32.ds_ar = 0xc93;
+   hctxt->cpu_regs.x86_32.es_base = 0;
+   hctxt->cpu_regs.x86_32.es_limit = ~0u;
+   hctxt->cpu_regs.x86_32.es_ar = 0xc93;
+   hctxt->cpu_regs.x86_32.ss_base = 0;
+   hctxt->cpu_regs.x86_32.ss_limit = ~0u;
+   hctxt->cpu_regs.x86_32.ss_ar = 0xc93;
+   hctxt->cpu_regs.x86_32.tr_base = 0;
+   hctxt->cpu_regs.x86_32.tr_limit = 0xff;
+   hctxt->cpu_regs.x86_32.tr_ar = 0x8b;
+#endif
+#endif
}
 
-   if (HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt))
+   if (HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt_arg))
BUG();
 
-   kfree(ctxt);
+   kfree(ctxt_arg);
return 0;
 }
 
diff --git a/arch/x86/xen/smp.h b/arch/x86/xen/smp.h
index 963d62a..b4a833c 100644
--- a/arch/x86/xen/smp.h
+++ b/arch/x86/xen/smp.h
@@ -8,6 +8,10 @@ extern void xen_send_IPI_allbutself(int vector);
 extern void xen_send_IPI_all(int vector);
 extern void xen_send_IPI_self(int vector);
 
+#ifdef CONFIG_X86_32
+extern void hvmlite_smp_32(void);
+#endif
+
 #ifdef CONFIG_XEN_PVH
 extern void xen_pvh_early_cpu_init(int cpu, bool entry);
 #else
diff --git a/arch/x86/xen/xen-hvmlite.S b/arch/x86/xen/xen-hvmlite.S
index fc7c08c..805e6a0 100644
--- a/arch/x86/xen/xen-hvmlite.S
+++ b/arch/x86/xen/xen-hvmlite.S
@@ -144,6 +144,13 @@ ENTRY(hvmlite_start_xen)
ljmp$0x10, $_pa(startup_32)
 #endif
 
+#ifdef CONFIG_X86_32
+ENTRY(hvmlite_smp_32)
+mov $_pa(boot_gdt_descr), %eax
+lgdt (%eax)
+jmp startup_32_smp
+#endif
+
.data
 gdt:
.word   gdt_end - gdt
-- 
1.7.1



[PATCH v2 09/11] xen/hvmlite: Use x86's default timer init for HVMlite guests

2016-02-01 Thread Boris Ostrovsky
xen_timer_init() will be called from apic_bsp_setup().

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/time.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index f1ba6a0..d77b398 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -492,7 +492,10 @@ void __init xen_init_time_ops(void)
 {
pv_time_ops = xen_time_ops;
 
-   x86_init.timers.timer_init = xen_time_init;
+   if (!xen_hvmlite)
+   x86_init.timers.timer_init = xen_time_init;
+   else
+   x86_init.timers.timer_init = x86_init_noop;
x86_init.timers.setup_percpu_clockev = x86_init_noop;
x86_cpuinit.setup_percpu_clockev = x86_init_noop;
 
-- 
1.7.1



[PATCH v2 01/11] xen/hvmlite: Import hvmlite-related Xen public interfaces

2016-02-01 Thread Boris Ostrovsky
Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 include/xen/interface/elfnote.h  |   12 +++-
 include/xen/interface/hvm/hvm_vcpu.h |  143 ++
 include/xen/interface/xen.h  |   24 ++
 3 files changed, 178 insertions(+), 1 deletions(-)
 create mode 100644 include/xen/interface/hvm/hvm_vcpu.h

diff --git a/include/xen/interface/elfnote.h b/include/xen/interface/elfnote.h
index f90b034..9e9f9bf 100644
--- a/include/xen/interface/elfnote.h
+++ b/include/xen/interface/elfnote.h
@@ -193,9 +193,19 @@
 #define XEN_ELFNOTE_SUPPORTED_FEATURES 17
 
 /*
+ * Physical entry point into the kernel.
+ *
+ * 32bit entry point into the kernel. When requested to launch the
+ * guest kernel in a HVM container, Xen will use this entry point to
+ * launch the guest in 32bit protected mode with paging disabled.
+ * Ignored otherwise.
+ */
+#define XEN_ELFNOTE_PHYS32_ENTRY 18
+
+/*
  * The number of the highest elfnote defined.
  */
-#define XEN_ELFNOTE_MAX XEN_ELFNOTE_SUPPORTED_FEATURES
+#define XEN_ELFNOTE_MAX XEN_ELFNOTE_PHYS32_ENTRY
 
 #endif /* __XEN_PUBLIC_ELFNOTE_H__ */
 
diff --git a/include/xen/interface/hvm/hvm_vcpu.h 
b/include/xen/interface/hvm/hvm_vcpu.h
new file mode 100644
index 000..32ca83e
--- /dev/null
+++ b/include/xen/interface/hvm/hvm_vcpu.h
@@ -0,0 +1,143 @@
+/*
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2015, Roger Pau Monne <roger@citrix.com>
+ */
+
+#ifndef __XEN_PUBLIC_HVM_HVM_VCPU_H__
+#define __XEN_PUBLIC_HVM_HVM_VCPU_H__
+
+#include "../xen.h"
+
+struct vcpu_hvm_x86_32 {
+uint32_t eax;
+uint32_t ecx;
+uint32_t edx;
+uint32_t ebx;
+uint32_t esp;
+uint32_t ebp;
+uint32_t esi;
+uint32_t edi;
+uint32_t eip;
+uint32_t eflags;
+
+uint32_t cr0;
+uint32_t cr3;
+uint32_t cr4;
+
+uint32_t pad1;
+
+/*
+ * EFER should only be used to set the NXE bit (if required)
+ * when starting a vCPU in 32bit mode with paging enabled or
+ * to set the LME/LMA bits in order to start the vCPU in
+ * compatibility mode.
+ */
+uint64_t efer;
+
+uint32_t cs_base;
+uint32_t ds_base;
+uint32_t ss_base;
+uint32_t es_base;
+uint32_t tr_base;
+uint32_t cs_limit;
+uint32_t ds_limit;
+uint32_t ss_limit;
+uint32_t es_limit;
+uint32_t tr_limit;
+uint16_t cs_ar;
+uint16_t ds_ar;
+uint16_t ss_ar;
+uint16_t es_ar;
+uint16_t tr_ar;
+
+uint16_t pad2[3];
+};
+
+/*
+ * The layout of the _ar fields of the segment registers is the
+ * following:
+ *
+ * Bits   [0,3]: type (bits 40-43).
+ * Bit4: s(descriptor type, bit 44).
+ * Bit[5,6]: dpl  (descriptor privilege level, bits 45-46).
+ * Bit7: p(segment-present, bit 47).
+ * Bit8: avl  (available for system software, bit 52).
+ * Bit9: l(64-bit code segment, bit 53).
+ * Bit   10: db   (meaning depends on the segment, bit 54).
+ * Bit   11: g(granularity, bit 55)
+ * Bits [12,15]: unused, must be blank.
+ *
+ * A more complete description of the meaning of this fields can be
+ * obtained from the Intel SDM, Volume 3, section 3.4.5.
+ */
+
+struct vcpu_hvm_x86_64 {
+uint64_t rax;
+uint64_t rcx;
+uint64_t rdx;
+uint64_t rbx;
+uint64_t rsp;
+uint64_t rbp;
+uint64_t rsi;
+uint64_t rdi;
+uint64_t rip;
+uint64_t rflags;
+
+uint64_t cr0;
+uint64_t cr3;
+uint64_t cr4;
+uint64_t efer;
+
+/*
+ * Using VCPU_HVM_MODE_64B implies that the vCPU is launched
+ * directly in long mode, so the cached parts of the segment
+ * registers get set to match that environment.
+ *
+ * If the user wants to launch the vCPU in compatibility mode
+ * the 32-bit structure should be used instead.
+ */
+};
+
+struct vcpu_hvm_context {
+#define VCPU_HVM_MODE_32B 0  /* 32bit fields 

[PATCH v2 08/11] xen/hvmlite: Extend APIC operations for HVMlite guests

2016-02-01 Thread Boris Ostrovsky
HVMlite guests need to be viewed as having APIC, otherwise smpboot code,
for example, will complain.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---

Not sure about xen_cpu_present_to_apicid() being an identity function, given
xen_x86_32_early_logical_apicid().

Suspend/resume will cause xen_apic_write() warnings.

 arch/x86/xen/apic.c |   39 +--
 arch/x86/xen/smp.c  |2 +-
 2 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/arch/x86/xen/apic.c b/arch/x86/xen/apic.c
index acda713..9bf27f2 100644
--- a/arch/x86/xen/apic.c
+++ b/arch/x86/xen/apic.c
@@ -6,6 +6,7 @@
 
 #include 
 #include 
+#include 
 #include "xen-ops.h"
 #include "pmu.h"
 #include "smp.h"
@@ -78,6 +79,21 @@ static void xen_apic_write(u32 reg, u32 val)
return;
}
 
+   if (xen_hvmlite) {
+   switch (reg) {
+   case APIC_TASKPRI:
+   case APIC_SPIV:
+   case APIC_ESR:
+   case APIC_LVTT:
+   case APIC_LVT0:
+   case APIC_LVT1:
+   case APIC_LVTERR:
+   pr_debug("Unimplemented APIC register %x,"
+" value: %x\n", reg, val);
+   return;
+   }
+   }
+
/* Warn to see if there's any stray references */
WARN(1,"register: %x, value: %x\n", reg, val);
 }
@@ -100,7 +116,7 @@ static u32 xen_safe_apic_wait_icr_idle(void)
 
 static int xen_apic_probe_pv(void)
 {
-   if (xen_pv_domain())
+   if (xen_pv_domain() || xen_hvmlite)
return 1;
 
return 0;
@@ -142,6 +158,19 @@ static void xen_silent_inquire(int apicid)
 {
 }
 
+static int xen_cpu_present_to_apicid(int cpu)
+{
+   return cpu;
+}
+
+static int xen_wakeup_secondary_cpu(int cpu, unsigned long start_eip)
+{
+   if (!xen_hvmlite)
+   return -EINVAL;
+
+   return HYPERVISOR_vcpu_op(VCPUOP_up, cpu, NULL);
+}
+
 static struct apic xen_pv_apic = {
.name   = "Xen PV",
.probe  = xen_apic_probe_pv,
@@ -162,7 +191,7 @@ static struct apic xen_pv_apic = {
 
.ioapic_phys_id_map = default_ioapic_phys_id_map, /* Used 
on 32-bit */
.setup_apic_routing = NULL,
-   .cpu_present_to_apicid  = default_cpu_present_to_apicid,
+   .cpu_present_to_apicid  = xen_cpu_present_to_apicid,
.apicid_to_cpu_present  = physid_set_mask_of_physid, /* Used on 
32-bit */
.check_phys_apicid_present  = default_check_phys_apicid_present, /* 
smp_sanity_check needs it */
.phys_pkg_id= xen_phys_pkg_id, /* detect_ht */
@@ -180,6 +209,9 @@ static struct apic xen_pv_apic = {
.send_IPI_all   = xen_send_IPI_all,
.send_IPI_self  = xen_send_IPI_self,
 #endif
+
+   .wakeup_secondary_cpu   = xen_wakeup_secondary_cpu,
+
/* .wait_for_init_deassert- used  by AP bootup - smp_callin which we 
don't use */
.inquire_remote_apic= xen_silent_inquire,
 
@@ -216,5 +248,8 @@ void __init xen_init_apic(void)
apic = _pv_apic;
 
x86_platform.apic_post_init = xen_apic_check;
+
+   if (xen_hvmlite)
+   setup_force_cpu_cap(X86_FEATURE_APIC);
 }
 apic_driver(xen_pv_apic);
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index b265c4f..fb085ef 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -214,7 +214,7 @@ static int xen_smp_intr_init(unsigned int cpu)
 * The IRQ worker on PVHVM goes through the native path and uses the
 * IPI mechanism.
 */
-   if (xen_hvm_domain())
+   if (xen_hvm_domain() && !xen_hvmlite)
return 0;
 
callfunc_name = kasprintf(GFP_KERNEL, "irqwork%d", cpu);
-- 
1.7.1



[PATCH v2 05/11] xen/hvmlite: HVMlite guests always have PV devices

2016-02-01 Thread Boris Ostrovsky
Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/platform-pci-unplug.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/xen/platform-pci-unplug.c 
b/arch/x86/xen/platform-pci-unplug.c
index 9586ff3..802ec90 100644
--- a/arch/x86/xen/platform-pci-unplug.c
+++ b/arch/x86/xen/platform-pci-unplug.c
@@ -73,8 +73,8 @@ bool xen_has_pv_devices(void)
if (!xen_domain())
return false;
 
-   /* PV domains always have them. */
-   if (xen_pv_domain())
+   /* PV and HVMlite domains always have them. */
+   if (xen_pv_domain() || xen_hvmlite)
return true;
 
/* And user has xen_platform_pci=0 set in guest config as
-- 
1.7.1



[PATCH v2 02/11] xen/hvmlite: Bootstrap HVMlite guest

2016-02-01 Thread Boris Ostrovsky
Start HVMlite guest at XEN_ELFNOTE_PHYS32_ENTRY address. Setup hypercall
page, initialize boot_params, enable early page tables.

Since this stub is executed before kernel entry point we cannot use
variables in .bss which is cleared by kernel. We explicitly place
variables that are initialized here into .data.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/Makefile  |1 +
 arch/x86/xen/enlighten.c   |   86 +-
 arch/x86/xen/xen-hvmlite.S |  175 
 include/xen/xen.h  |6 ++
 4 files changed, 267 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/xen/xen-hvmlite.S

diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile
index e47e527..1d913d7 100644
--- a/arch/x86/xen/Makefile
+++ b/arch/x86/xen/Makefile
@@ -23,3 +23,4 @@ obj-$(CONFIG_XEN_DEBUG_FS)+= debugfs.o
 obj-$(CONFIG_XEN_DOM0) += vga.o
 obj-$(CONFIG_SWIOTLB_XEN)  += pci-swiotlb-xen.o
 obj-$(CONFIG_XEN_EFI)  += efi.o
+obj-$(CONFIG_XEN_PVHVM)+= xen-hvmlite.o
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 5774800..5f05fa2 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -118,7 +118,8 @@ DEFINE_PER_CPU(struct vcpu_info *, xen_vcpu);
  */
 DEFINE_PER_CPU(struct vcpu_info, xen_vcpu_info);
 
-enum xen_domain_type xen_domain_type = XEN_NATIVE;
+enum xen_domain_type xen_domain_type
+   __attribute__((section(".data"))) = XEN_NATIVE;
 EXPORT_SYMBOL_GPL(xen_domain_type);
 
 unsigned long *machine_to_phys_mapping = (void *)MACH2PHYS_VIRT_START;
@@ -171,6 +172,17 @@ struct tls_descs {
  */
 static DEFINE_PER_CPU(struct tls_descs, shadow_tls_desc);
 
+#ifdef CONFIG_XEN_PVHVM
+/*
+ * HVMlite variables. These need to live in data segment since they are
+ * initialized before startup_{32|64}, which clear .bss, are invoked.
+ */
+int xen_hvmlite __attribute__((section(".data"))) = 0;
+struct hvm_start_info hvmlite_start_info __attribute__((section(".data")));
+uint hvmlite_start_info_sz = sizeof(hvmlite_start_info);
+struct boot_params xen_hvmlite_boot_params __attribute__((section(".data")));
+#endif
+
 static void clamp_max_cpus(void)
 {
 #ifdef CONFIG_SMP
@@ -1731,6 +1743,78 @@ asmlinkage __visible void __init xen_start_kernel(void)
 #endif
 }
 
+#ifdef CONFIG_XEN_PVHVM
+static void __init hvmlite_bootparams(void)
+{
+   struct xen_memory_map memmap;
+   int i;
+
+   memset(_hvmlite_boot_params, 0, sizeof(xen_hvmlite_boot_params));
+
+   memmap.nr_entries = ARRAY_SIZE(xen_hvmlite_boot_params.e820_map);
+   set_xen_guest_handle(memmap.buffer, xen_hvmlite_boot_params.e820_map);
+   if (HYPERVISOR_memory_op(XENMEM_memory_map, )) {
+   xen_raw_console_write("XENMEM_memory_map failed\n");
+   BUG();
+   }
+
+   xen_hvmlite_boot_params.e820_map[memmap.nr_entries].addr =
+   ISA_START_ADDRESS;
+   xen_hvmlite_boot_params.e820_map[memmap.nr_entries].size =
+   ISA_END_ADDRESS - ISA_START_ADDRESS;
+   xen_hvmlite_boot_params.e820_map[memmap.nr_entries++].type =
+   E820_RESERVED;
+
+   sanitize_e820_map(xen_hvmlite_boot_params.e820_map,
+ ARRAY_SIZE(xen_hvmlite_boot_params.e820_map),
+ _entries);
+
+   xen_hvmlite_boot_params.e820_entries = memmap.nr_entries;
+   for (i = 0; i < xen_hvmlite_boot_params.e820_entries; i++)
+   e820_add_region(xen_hvmlite_boot_params.e820_map[i].addr,
+   xen_hvmlite_boot_params.e820_map[i].size,
+   xen_hvmlite_boot_params.e820_map[i].type);
+
+   xen_hvmlite_boot_params.hdr.cmd_line_ptr =
+   hvmlite_start_info.cmdline_paddr;
+
+   /* The first module is always ramdisk */
+   if (hvmlite_start_info.nr_modules) {
+   struct hvm_modlist_entry *modaddr =
+   __va(hvmlite_start_info.modlist_paddr);
+   xen_hvmlite_boot_params.hdr.ramdisk_image = modaddr->paddr;
+   xen_hvmlite_boot_params.hdr.ramdisk_size = modaddr->size;
+   }
+
+   /*
+* See Documentation/x86/boot.txt.
+*
+* Version 2.12 supports Xen entry point but we will use default x86/PC
+* environment (i.e. hardware_subarch 0).
+*/
+   xen_hvmlite_boot_params.hdr.version = 0x212;
+   xen_hvmlite_boot_params.hdr.type_of_loader = 9; /* Xen loader */
+}
+
+/*
+ * This routine (and those that it might call) should not use
+ * anything that lives in .bss since that segment will be cleared later
+ */
+void __init xen_prepare_hvmlite(void)
+{
+   u32 eax, ecx, edx, msr;
+   u64 pfn;
+
+   xen_hvmlite = 1;
+
+   cpuid(xen_cpuid_base() + 2, , , , );
+   pfn = __pa(hypercall_page);
+   wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
+
+   hvmli

[PATCH v2 00/11] HVMlite domU support

2016-02-01 Thread Boris Ostrovsky
This series introduces HVMlite support for unprivileged guests.

It has been tested on Intel/AMD, both 32- and 64-bit, including CPU on- and
offlining and save/restore. (Restore will result in APIC write warnings
which exist now for 32-bit PV guests as well so I didn't address this in
this series)

Compile-tested on ARM

v2:
* Dropped first patch (don't use initial_page_table/initial_pg_pmd and 
start_secondary)
* Much simplified kernel init code (after realizing that xen_hvm_guest_init()/
  init_hvm_pv_info() already do most of it). As result, factoring ot of 
xen_start_kernel()
  is no longer necessary. (patch 3)
* Dropped pcifront use until we decide that we *are* using it. (patch 5)


Boris Ostrovsky (11):
  xen/hvmlite: Import hvmlite-related Xen public interfaces
  xen/hvmlite: Bootstrap HVMlite guest
  xen/hvmlite: Initialize HVMlite kernel
  xen/hvmlite: Allow HVMlite guests delay initializing grant table
  xen/hvmlite: HVMlite guests always have PV devices
  xen/hvmlite: Prepare cpu_initialize_context() routine for HVMlite SMP
  xen/hvmlite: Initialize context for secondary VCPUs
  xen/hvmlite: Extend APIC operations for HVMlite guests
  xen/hvmlite: Use x86's default timer init for HVMlite guests
  xen/hvmlite: Boot secondary CPUs
  xen/hvmlite: Enable CPU on-/offlining

 arch/x86/xen/Makefile|1 +
 arch/x86/xen/apic.c  |   39 +-
 arch/x86/xen/enlighten.c |  106 +-
 arch/x86/xen/grant-table.c   |4 +-
 arch/x86/xen/platform-pci-unplug.c   |4 +-
 arch/x86/xen/pmu.c   |4 +-
 arch/x86/xen/smp.c   |  252 --
 arch/x86/xen/smp.h   |4 +
 arch/x86/xen/time.c  |5 +-
 arch/x86/xen/xen-hvmlite.S   |  190 +
 drivers/xen/grant-table.c|8 +-
 include/xen/interface/elfnote.h  |   12 ++-
 include/xen/interface/hvm/hvm_vcpu.h |  143 +++
 include/xen/interface/xen.h  |   24 
 include/xen/xen.h|6 +
 15 files changed, 706 insertions(+), 96 deletions(-)
 create mode 100644 arch/x86/xen/xen-hvmlite.S
 create mode 100644 include/xen/interface/hvm/hvm_vcpu.h



[PATCH v2 03/11] xen/hvmlite: Initialize HVMlite kernel

2016-02-01 Thread Boris Ostrovsky
HVMlite guests need to make a few additional initialization calls
compared to regular HVM guests.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/enlighten.c |   18 --
 1 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 5f05fa2..1409de6 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1863,15 +1863,21 @@ static void __init init_hvm_pv_info(void)
minor = eax & 0x;
printk(KERN_INFO "Xen version %d.%d.\n", major, minor);
 
-   cpuid(base + 2, , , , );
-
-   pfn = __pa(hypercall_page);
-   wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
+   /* HVMlite set up hypercall page earlier in xen_prepare_hvmlite() */
+   if (xen_hvmlite) {
+   pv_info.name = "Xen HVMlite";
+   pv_info.paravirt_enabled = 1;
+   xen_init_apic();
+   machine_ops = xen_machine_ops;
+   } else {
+   pv_info.name = "Xen HVM";
+   cpuid(base + 2, , , , );
+   pfn = __pa(hypercall_page);
+   wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
+   }
 
xen_setup_features();
 
-   pv_info.name = "Xen HVM";
-
xen_domain_type = XEN_HVM_DOMAIN;
 }
 
-- 
1.7.1



[PATCH v2 10/11] xen/hvmlite: Boot secondary CPUs

2016-02-01 Thread Boris Ostrovsky
HVMlite secondary VCPUs use baremetal bringup path (i.e. native_*
smp_ops) but need to do some preparation in PV code.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/enlighten.c |2 +
 arch/x86/xen/pmu.c   |4 +-
 arch/x86/xen/smp.c   |   64 +
 3 files changed, 51 insertions(+), 19 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 1409de6..76fb0b2 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1933,6 +1933,8 @@ static void __init xen_hvm_guest_init(void)
xen_have_vector_callback = 1;
xen_hvm_smp_init();
register_cpu_notifier(_hvm_cpu_notifier);
+   if (xen_hvmlite)
+   smp_found_config = 1;
xen_unplug_emulated_devices();
x86_init.irqs.intr_init = xen_init_IRQ;
xen_hvm_init_time_ops();
diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c
index 724a087..7bc209b 100644
--- a/arch/x86/xen/pmu.c
+++ b/arch/x86/xen/pmu.c
@@ -518,7 +518,7 @@ void xen_pmu_init(int cpu)
 
BUILD_BUG_ON(sizeof(struct xen_pmu_data) > PAGE_SIZE);
 
-   if (xen_hvm_domain())
+   if (xen_hvm_domain() && !xen_hvmlite)
return;
 
xenpmu_data = (struct xen_pmu_data *)get_zeroed_page(GFP_KERNEL);
@@ -556,7 +556,7 @@ void xen_pmu_finish(int cpu)
 {
struct xen_pmu_params xp;
 
-   if (xen_hvm_domain())
+   if (xen_hvm_domain() && !xen_hvmlite)
return;
 
xp.vcpu = cpu;
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index fb085ef..a22cae2 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -348,26 +348,31 @@ static void __init xen_smp_prepare_cpus(unsigned int 
max_cpus)
}
xen_init_lock_cpu(0);
 
-   smp_store_boot_cpu_info();
-   cpu_data(0).x86_max_cores = 1;
+   if (!xen_hvmlite) {
+   smp_store_boot_cpu_info();
+   cpu_data(0).x86_max_cores = 1;
+
+   for_each_possible_cpu(i) {
+   zalloc_cpumask_var(_cpu(cpu_sibling_map, i),
+  GFP_KERNEL);
+   zalloc_cpumask_var(_cpu(cpu_core_map, i),
+  GFP_KERNEL);
+   zalloc_cpumask_var(_cpu(cpu_llc_shared_map, i),
+  GFP_KERNEL);
+   }
+   set_cpu_sibling_map(0);
 
-   for_each_possible_cpu(i) {
-   zalloc_cpumask_var(_cpu(cpu_sibling_map, i), GFP_KERNEL);
-   zalloc_cpumask_var(_cpu(cpu_core_map, i), GFP_KERNEL);
-   zalloc_cpumask_var(_cpu(cpu_llc_shared_map, i), GFP_KERNEL);
+   if (!alloc_cpumask_var(_cpu_initialized_map, GFP_KERNEL))
+   panic("could not allocate xen_cpu_initialized_map\n");
+
+   cpumask_copy(xen_cpu_initialized_map, cpumask_of(0));
}
-   set_cpu_sibling_map(0);
 
xen_pmu_init(0);
 
if (xen_smp_intr_init(0))
BUG();
 
-   if (!alloc_cpumask_var(_cpu_initialized_map, GFP_KERNEL))
-   panic("could not allocate xen_cpu_initialized_map\n");
-
-   cpumask_copy(xen_cpu_initialized_map, cpumask_of(0));
-
/* Restrict the possible_map according to max_cpus. */
while ((num_possible_cpus() > 1) && (num_possible_cpus() > max_cpus)) {
for (cpu = nr_cpu_ids - 1; !cpu_possible(cpu); cpu--)
@@ -375,8 +380,11 @@ static void __init xen_smp_prepare_cpus(unsigned int 
max_cpus)
set_cpu_possible(cpu, false);
}
 
-   for_each_possible_cpu(cpu)
+   for_each_possible_cpu(cpu) {
set_cpu_present(cpu, true);
+   if (xen_hvmlite)
+   physid_set(cpu, phys_cpu_present_map);
+   }
 }
 
 static int
@@ -810,10 +818,15 @@ void __init xen_smp_init(void)
 
 static void __init xen_hvm_smp_prepare_cpus(unsigned int max_cpus)
 {
+   if (xen_hvmlite)
+   xen_smp_prepare_cpus(max_cpus);
+
native_smp_prepare_cpus(max_cpus);
-   WARN_ON(xen_smp_intr_init(0));
 
-   xen_init_lock_cpu(0);
+   if (!xen_hvmlite) {
+   WARN_ON(xen_smp_intr_init(0));
+   xen_init_lock_cpu(0);
+   }
 }
 
 static int xen_hvm_cpu_up(unsigned int cpu, struct task_struct *tidle)
@@ -836,8 +849,21 @@ static int xen_hvm_cpu_up(unsigned int cpu, struct 
task_struct *tidle)
*/
rc = xen_smp_intr_init(cpu);
WARN_ON(rc);
-   if (!rc)
-   rc =  native_cpu_up(cpu, tidle);
+
+   if (xen_hvmlite) {
+   rc = cpu_initialize_context(cpu, tidle);
+   if (rc) {
+   xen_smp_intr_free(cpu);
+   return rc;
+   }
+   xen_pmu_init(cpu);
+   }
+
+   if (!rc) {
+  

[PATCH v2 11/11] xen/hvmlite: Enable CPU on-/offlining

2016-02-01 Thread Boris Ostrovsky
When offlining, we should properly clean up interrupts and wait
until hypervisor declares VCPU as down before cleaning up.

After VCPU that was previously offlined is brought back to life
we want to jump back to bare-metal entry points. It's a simple
jump on 64-bit but requires minor tweaking for 32-bit case.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/smp.c |   35 +--
 arch/x86/xen/xen-hvmlite.S |8 
 2 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index a22cae2..d768c4e 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -143,7 +143,7 @@ static void xen_smp_intr_free(unsigned int cpu)
kfree(per_cpu(xen_callfuncsingle_irq, cpu).name);
per_cpu(xen_callfuncsingle_irq, cpu).name = NULL;
}
-   if (xen_hvm_domain())
+   if (xen_hvm_domain() && !xen_hvmlite)
return;
 
if (per_cpu(xen_irq_work, cpu).irq >= 0) {
@@ -585,7 +585,8 @@ static int xen_cpu_disable(void)
 
 static void xen_cpu_die(unsigned int cpu)
 {
-   while (xen_pv_domain() && HYPERVISOR_vcpu_op(VCPUOP_is_up, cpu, NULL)) {
+   while ((xen_pv_domain() || xen_hvmlite) &&
+  HYPERVISOR_vcpu_op(VCPUOP_is_up, cpu, NULL)) {
__set_current_state(TASK_UNINTERRUPTIBLE);
schedule_timeout(HZ/10);
}
@@ -602,14 +603,28 @@ static void xen_play_dead(void) /* used only with 
HOTPLUG_CPU */
 {
play_dead_common();
HYPERVISOR_vcpu_op(VCPUOP_down, smp_processor_id(), NULL);
-   cpu_bringup();
-   /*
-* commit 4b0c0f294 (tick: Cleanup NOHZ per cpu data on cpu down)
-* clears certain data that the cpu_idle loop (which called us
-* and that we return from) expects. The only way to get that
-* data back is to call:
-*/
-   tick_nohz_idle_enter();
+
+   if (!xen_hvm_domain()) {
+   cpu_bringup();
+   /*
+* commit 4b0c0f294 (tick: Cleanup NOHZ per cpu data on cpu
+* down) clears certain data that the cpu_idle loop (which
+* called us and that we return from) expects. The only way to
+* get that data back is to call:
+*/
+   tick_nohz_idle_enter();
+   } else {
+   /*
+* For 64-bit we can jump directly to SMP entry point but for
+* 32-bit we need to disable paging and load boot GDT (just
+* like in cpu_initialize_context()).
+*/
+#ifdef CONFIG_X86_64
+   asm("jmp secondary_startup_64");
+#else
+   asm("jmp hvmlite_smp_32_hp");
+#endif
+   }
 }
 
 #else /* !CONFIG_HOTPLUG_CPU */
diff --git a/arch/x86/xen/xen-hvmlite.S b/arch/x86/xen/xen-hvmlite.S
index 805e6a0..1db2e96 100644
--- a/arch/x86/xen/xen-hvmlite.S
+++ b/arch/x86/xen/xen-hvmlite.S
@@ -145,6 +145,14 @@ ENTRY(hvmlite_start_xen)
 #endif
 
 #ifdef CONFIG_X86_32
+ENTRY(hvmlite_smp_32_hp)
+   movl $_pa(initial_page_table), %eax
+   movl %eax, %cr3
+   ljmp $__KERNEL_CS,$_pa(5f)
+5:
+   movl $X86_CR0_PE, %eax
+   movl %eax, %cr0
+
 ENTRY(hvmlite_smp_32)
 mov $_pa(boot_gdt_descr), %eax
 lgdt (%eax)
-- 
1.7.1



[PATCH v2] xen/x86: Zero out .bss for PV guests

2016-02-24 Thread Boris Ostrovsky
Baremetal kernels clear .bss early in the boot but Xen PV guests don't
execute that code. They have been able to run without problems because
Xen domain builder happens to give out zeroed pages. However, since this
is not really guaranteed, .bss should be explicitly cleared.

(Since we introduce macros for specifying 32- and 64-bit registers we
can get rid of ifdefs in startup_xen())

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
Cc: sta...@vger.kernel.org
---
 arch/x86/xen/xen-head.S | 29 ++---
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index b65f59a..2af87d1 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -35,16 +35,31 @@
 #define PVH_FEATURES (0)
 #endif
 
-   __INIT
-ENTRY(startup_xen)
-   cld
 #ifdef CONFIG_X86_32
-   mov %esi,xen_start_info
-   mov $init_thread_union+THREAD_SIZE,%esp
+#define REG(register)  %e##register
+#define WSIZE_SHIFT2
+#define STOS   stosl
 #else
-   mov %rsi,xen_start_info
-   mov $init_thread_union+THREAD_SIZE,%rsp
+#define REG(register)  %r##register
+#define WSIZE_SHIFT3
+#define STOS   stosq
 #endif
+
+   __INIT
+ENTRY(startup_xen)
+   cld
+
+   /* Clear .bss */
+   xor REG(ax),REG(ax)
+   mov $__bss_start,REG(di)
+   mov $__bss_stop,REG(cx)
+   sub REG(di),REG(cx)
+   shr $WSIZE_SHIFT,REG(cx)
+   rep STOS
+
+   mov REG(si),xen_start_info
+   mov $init_thread_union+THREAD_SIZE,REG(sp)
+
jmp xen_start_kernel
 
__FINIT
-- 
2.1.0



Re: [PATCH 1/2] hvc_xen: add earlycon support

2016-02-24 Thread Boris Ostrovsky

On 02/24/2016 07:23 AM, Stefano Stabellini wrote:

Introduce EARLYCON support in hvc_xen, useful for early debugging on arm
and arm64, where xen early_printk is not available. Differently from
xenboot_write_console on x86, we won't just return if !xen_pv_domain(),
because arm and arm64 guests are actually xen_hvm_domain() from linux
pov, and also because we want to capture all possible writes, including
the ones earlier than xen_early_init().

Signed-off-by: Stefano Stabellini
---
  drivers/tty/hvc/hvc_xen.c |   48 ++---
  1 file changed, 37 insertions(+), 11 deletions(-)

diff --git a/drivers/tty/hvc/hvc_xen.c b/drivers/tty/hvc/hvc_xen.c
index fa816b7..34e8e9f 100644
--- a/drivers/tty/hvc/hvc_xen.c
+++ b/drivers/tty/hvc/hvc_xen.c
@@ -25,6 +25,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
@@ -597,21 +598,11 @@ static int xen_cons_init(void)
  }
  console_initcall(xen_cons_init);
  
-#ifdef CONFIG_EARLY_PRINTK

-static void xenboot_write_console(struct console *console, const char *string,
- unsigned len)
+static void domU_write_early_console(const char *string, unsigned len)
  {
unsigned int linelen, off = 0;
const char *pos;
  
-	if (!xen_pv_domain())

-   return;
-
-   dom0_write_console(0, string, len);
-
-   if (xen_initial_domain())
-   return;
-
domU_write_console(0, "(early) ", 8);
while (off < len && NULL != (pos = strchr(string+off, '\n'))) {
linelen = pos-string+off;
@@ -625,6 +616,21 @@ static void xenboot_write_console(struct console *console, 
const char *string,
domU_write_console(0, string+off, len-off);
  }
  
+#ifdef CONFIG_EARLY_PRINTK

+static void xenboot_write_console(struct console *console, const char *string,
+ unsigned len)
+{
+   if (!xen_pv_domain())
+   return;


This, BTW, will break PVHv2 code (which is not there yet). Your patch is 
not making the code any different from what it is now but I wonder 
whether xenboot_write_console() can be made to work for HVM guests at 
this time since you making changes there anyway.


Or we can delay it until HVMlite/PVHv2 patches make it in.


+
+   dom0_write_console(0, string, len);
+
+   if (xen_initial_domain())
+   return;
+
+   domU_write_early_console(string, len);
+}
+
  struct console xenboot_console = {
.name   = "xenboot",
.write  = xenboot_write_console,
@@ -664,3 +670,23 @@ void xen_raw_printk(const char *fmt, ...)
  
  	xen_raw_console_write(buf);

  }
+
+static void xenboot_earlycon_write(struct console *console,
+ const char *string,
+ unsigned len)
+{
+   dom0_write_console(0, string, len);
+
+   if (xen_initial_domain())
+   return;
+
+   domU_write_early_console(string, len);
+}


xenboot_earlycon_write() and xenboot_write_console() share most of the 
code, can it be factored out?


-boris


+
+static int __init xenboot_earlycon_setup(struct earlycon_device *device,
+   const char *opt)
+{
+   device->con->write = xenboot_earlycon_write;
+   return 0;
+}
+EARLYCON_DECLARE(xenboot, xenboot_earlycon_setup);




Re: [Xen-devel] [PATCH] xen/x86: Zero out .bss for PV guests

2016-02-24 Thread Boris Ostrovsky

On 02/24/2016 09:15 AM, Andrew Cooper wrote:

On 24/02/16 14:12, David Vrabel wrote:

On 22/02/16 22:06, Boris Ostrovsky wrote:

Baremetal kernels clear .bss early in the boot. Since Xen PV guests don't
excecute that early code they should do it too.

(Since we introduce macros for specifying 32- and 64-bit registers we
can get rid of ifdefs in startup_xen())

.bss must have been cleared for PV guests otherwise they would be
horribly broken.  What was the method and why is it no longer sufficient?


I couldn't find this being done anywhere, hence this patch.


The domain builder hands out zeroed pages.  I don't believe we guarantee
that the guests RAM is clean, but it is in practice.


OK, that's what I suspected but didn't actually look.

I, in fact, wonder whether this should go to stable trees as well.

-boris


Re: [PATCH v2] xen/x86: Zero out .bss for PV guests

2016-02-24 Thread Boris Ostrovsky

On 02/24/2016 11:05 AM, Brian Gerst wrote:


Use the macros in  instead of defining your own.  Also,
xorl %eax,%eax is good for 64-bit too, since the upper bits are
cleared.


I suspected this would have to be defined somewhere but couldn't find 
it. Thanks!


-boris



Re: [Xen-devel] [PATCH v2] xen/x86: Zero out .bss for PV guests

2016-02-24 Thread Boris Ostrovsky

On 02/24/2016 12:26 PM, Andrew Cooper wrote:

On 24/02/16 15:19, Boris Ostrovsky wrote:

+   /* Clear .bss */
+   xor REG(ax),REG(ax)

If we are nitpicking,

This should be xor %eax, %eax even in 64bit.  Functionally identical,
and shorter to encode.


Right, Brian Gerst pointed this out too in another message.

-boris



Re: [Xen-devel] [PATCH] xen/x86: Zero out .bss for PV guests

2016-02-24 Thread Boris Ostrovsky

On 02/24/2016 11:22 AM, Luis R. Rodriguez wrote:

On Wed, Feb 24, 2016 at 6:58 AM, David Vrabel  wrote:

Yes.  Can you respin with a commit message explaining?  (Or just provide
the message here and I'll fix it up).

Is there no way to re-use somehow the clear_bss() from bare metal?


xen_start_info lives in .bss and it is initialized in startup assembly code.

We could move it to .data but I'd still want to do this first thing in 
startup_xen(). We may add more code there later that touches something 
in .bss and with delayed section clearing we may blow away that data.


-boris


This uses a section range:

/* Don't add a printk in there. printk relies on the PDA which is not
initialized
yet. */
static void __init clear_bss(void)
{
 memset(__bss_start, 0,
(unsigned long) __bss_stop - (unsigned long) __bss_start);
}

Perhaps the section range might be different for PV guests? Or can
this simply not work even if one added a guest bss section size, or
would it be too late for PV guests, ie we need to do it in asm on PV
guests as you did? If so why.

  Luis




Re: [Xen-devel] [PATCH 8/9] x86/rtc: replace paravirt_enabled() check with subarch check

2016-02-22 Thread Boris Ostrovsky

On 02/22/2016 05:27 AM, Borislav Petkov wrote:

On Mon, Feb 22, 2016 at 07:07:56AM +0100, Luis R. Rodriguez wrote:

diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 1ae89a2721d6..fe0d579b63e3 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -84,11 +84,14 @@ struct x86_init_paging {
   *boot cpu
   * @timer_init:   initialize the platform timer (default 
PIT/HPET)
   * @wallclock_init:   init the wallclock device
+ * @no_cmos_rtc:   set when platform has no CMOS real-time clock
+ * present
   */
  struct x86_init_timers {
void (*setup_percpu_clockev)(void);
void (*timer_init)(void);
void (*wallclock_init)(void);
+   bool no_cmos_rtc;

I'd add

u64 flags;

to x86_init_ops and then set X86_PLATFORM_NO_RTC or so in there. The
reason being, others could use that flags field too, for other stuff and
define more bits.


Maybe timer_flags or platform_flags (or something else) to be a little 
more cscope-friendly?


-boris



Re: [PATCH 5/9] apm32: remove paravirt_enabled() use

2016-02-22 Thread Boris Ostrovsky

On 02/19/2016 07:42 PM, Luis R. Rodriguez wrote:

On Fri, Feb 19, 2016 at 05:17:27PM -0500, Boris Ostrovsky wrote:


On 02/19/2016 03:58 PM, Luis R. Rodriguez wrote:

On Fri, Feb 19, 2016 at 10:08:43AM -0500, Boris Ostrovsky wrote:


in xen_start_kernel(). Better yet, clear whole .bss.

(This applies to the next patch as well).

So clear_bss() -- oh look, another call that xen_start_kernel() could have made
good use of. :)  Can you send a respective patch I can old into this series?
I'm afraid it is by no means obvious to me where it would be safe to do this on
xen_start_kernel().

OK, I'll test it next week, it really should be done irrespective of
what you are doing here.

OK so it can go in separately?



Yes, I'll send it as a separate patch.

-boris


[PATCH] xen/x86: Zero out .bss for PV guests

2016-02-22 Thread Boris Ostrovsky
Baremetal kernels clear .bss early in the boot. Since Xen PV guests don't
excecute that early code they should do it too.

(Since we introduce macros for specifying 32- and 64-bit registers we
can get rid of ifdefs in startup_xen())

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/xen-head.S | 29 ++---
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index b65f59a..2af87d1 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -35,16 +35,31 @@
 #define PVH_FEATURES (0)
 #endif
 
-   __INIT
-ENTRY(startup_xen)
-   cld
 #ifdef CONFIG_X86_32
-   mov %esi,xen_start_info
-   mov $init_thread_union+THREAD_SIZE,%esp
+#define REG(register)  %e##register
+#define WSIZE_SHIFT2
+#define STOS   stosl
 #else
-   mov %rsi,xen_start_info
-   mov $init_thread_union+THREAD_SIZE,%rsp
+#define REG(register)  %r##register
+#define WSIZE_SHIFT3
+#define STOS   stosq
 #endif
+
+   __INIT
+ENTRY(startup_xen)
+   cld
+
+   /* Clear .bss */
+   xor REG(ax),REG(ax)
+   mov $__bss_start,REG(di)
+   mov $__bss_stop,REG(cx)
+   sub REG(di),REG(cx)
+   shr $WSIZE_SHIFT,REG(cx)
+   rep STOS
+
+   mov REG(si),xen_start_info
+   mov $init_thread_union+THREAD_SIZE,REG(sp)
+
jmp xen_start_kernel
 
__FINIT
-- 
2.1.0



Re: [PATCH v2 1/3] hvc_xen: add earlycon support

2016-02-25 Thread Boris Ostrovsky

On 02/25/2016 07:10 AM, Stefano Stabellini wrote:

Introduce EARLYCON support in hvc_xen, useful for early debugging on arm
and arm64, where xen early_printk is not available.

It is different from xenboot_write_console on x86 in two ways:

- it does not return if !xen_pv_domain(), not only because ARM guests
   are xen_hvm_domain(), but also because we want to capture all the
   early boot messages, before xen support is discovered
- it does not try to print to the domU console at all, because xen
   support will only be discovered at a later point

Signed-off-by: Stefano Stabellini <stefano.stabell...@eu.citrix.com>


Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com>




Re: [PATCH v2 2/3] hvc_xen: fix xenboot for DomUs

2016-02-25 Thread Boris Ostrovsky

On 02/25/2016 07:10 AM, Stefano Stabellini wrote:

The xenboot early console has been partially broken for DomU for a long
time: the output would only go to the hypervisor via hypercall
(HYPERVISOR_console_io), while it wouldn't actually go to the DomU
console. The reason is that domU_write_console would return early as no
xencons structs are configured for it.

Add an appropriate xencons struct for xenboot from the xenboot setup
callback.

Signed-off-by: Stefano Stabellini <stefano.stabell...@eu.citrix.com>

---
Changes in v2:
- add return to xenboot_setup_console
---
  drivers/tty/hvc/hvc_xen.c |   32 ++--
  1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/drivers/tty/hvc/hvc_xen.c b/drivers/tty/hvc/hvc_xen.c
index 68b8ec8..bf787aa 100644
--- a/drivers/tty/hvc/hvc_xen.c
+++ b/drivers/tty/hvc/hvc_xen.c
@@ -246,6 +246,18 @@ err:
return -ENODEV;
  }
  
+static int xen_early_pv_console_init(struct xencons_info *info, int vtermno)


Nit: not sure whether "early" is the right word to use here: when called 
from xen_pv_console_init() it's not really on the early path.


Other than that:
Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com>


+{
+   info->evtchn = xen_start_info->console.domU.evtchn;
+   /* GFN == MFN for PV guest */
+   info->intf = gfn_to_virt(xen_start_info->console.domU.mfn);
+   info->vtermno = vtermno;
+
+   list_add_tail(>list, );
+
+   return 0;
+}
+
  static int xen_pv_console_init(void)
  {
struct xencons_info *info;
@@ -265,13 +277,8 @@ static int xen_pv_console_init(void)
/* already configured */
return 0;
}
-   info->evtchn = xen_start_info->console.domU.evtchn;
-   /* GFN == MFN for PV guest */
-   info->intf = gfn_to_virt(xen_start_info->console.domU.mfn);
-   info->vtermno = HVC_COOKIE;
-
spin_lock(_lock);
-   list_add_tail(>list, );
+   xen_early_pv_console_init(info, HVC_COOKIE);
spin_unlock(_lock);
  
  	return 0;

@@ -599,6 +606,18 @@ static int xen_cons_init(void)
  console_initcall(xen_cons_init);
  
  #ifdef CONFIG_EARLY_PRINTK

+static int __init xenboot_setup_console(struct console *console, char *string)
+{
+   static struct xencons_info xenboot;
+
+   if (xen_initial_domain())
+   return 0;
+   if (!xen_pv_domain())
+   return -ENODEV;
+
+   return xen_early_pv_console_init(, 0);
+}
+
  static void xenboot_write_console(struct console *console, const char *string,
  unsigned len)
  {
@@ -629,6 +648,7 @@ static void xenboot_write_console(struct console *console, 
const char *string,
  struct console xenboot_console = {
.name   = "xenboot",
.write  = xenboot_write_console,
+   .setup  = xenboot_setup_console,
.flags  = CON_PRINTBUFFER | CON_BOOT | CON_ANYTIME,
.index  = -1,
  };




Re: [PATCH v2 3/3] hvc_xen: make early_printk work with HVM guests

2016-02-25 Thread Boris Ostrovsky

On 02/25/2016 07:10 AM, Stefano Stabellini wrote:

Refactor the existing code in xen_raw_console_write to get the generic
early_printk console work with HVM guests.

Take the opportunity to replace the outb loop with a single outsb call
to reduce the number of vmexit.

Signed-off-by: Stefano Stabellini <stefano.stabell...@eu.citrix.com>


Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com>




[PATCH v3 2/2] xen/x86: Drop mode-selecting ifdefs in startup_xen()

2016-02-25 Thread Boris Ostrovsky
Use asm/asm.h macros instead.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/xen-head.S | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index 11dbe49..2366895 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -52,13 +52,9 @@ ENTRY(startup_xen)
shr $WSIZE_SHIFT, %__ASM_REG(cx)
rep __ASM_SIZE(stos)
 
-#ifdef CONFIG_X86_32
-   mov %esi,xen_start_info
-   mov $init_thread_union+THREAD_SIZE,%esp
-#else
-   mov %rsi,xen_start_info
-   mov $init_thread_union+THREAD_SIZE,%rsp
-#endif
+   mov %__ASM_REG(si), xen_start_info
+   mov $init_thread_union+THREAD_SIZE, %__ASM_REG(sp)
+
jmp xen_start_kernel
 
__FINIT
-- 
2.1.0



[PATCH v3 0/2] Clear .bss for VP guests

2016-02-25 Thread Boris Ostrovsky
PV guests need to have their .bss zeroed out since it is not guaranteed
to be cleared by Xen's domain builder

v3:
* Use existing macros for selecting mode-appropriate instructions/registers
* Use 32-bit registers in XOR
* Split removal of mode-selecting ifdefs into a separate patch

Boris Ostrovsky (2):
  xen/x86: Zero out .bss for PV guests
  xen/x86: Drop mode-selecting ifdefs in startup_xen()

 arch/x86/xen/xen-head.S | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

-- 
2.1.0



[PATCH v3 1/2] xen/x86: Zero out .bss for PV guests

2016-02-25 Thread Boris Ostrovsky
Baremetal kernels clear .bss early in the boot but Xen PV guests don't
execute that code. They have been able to run without problems because
Xen domain builder happens to give out zeroed pages. However, since this
is not really guaranteed, .bss should be explicitly cleared.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
Cc: sta...@vger.kernel.org
---
 arch/x86/xen/xen-head.S | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index b65f59a..11dbe49 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -38,6 +38,20 @@
__INIT
 ENTRY(startup_xen)
cld
+
+#ifdef CONFIG_X86_32
+#define WSIZE_SHIFT2
+#else
+#define WSIZE_SHIFT3
+#endif
+   /* Clear .bss */
+   xor %eax,%eax
+   mov $__bss_start, %__ASM_REG(di)
+   mov $__bss_stop, %__ASM_REG(cx)
+   sub %__ASM_REG(di), %__ASM_REG(cx)
+   shr $WSIZE_SHIFT, %__ASM_REG(cx)
+   rep __ASM_SIZE(stos)
+
 #ifdef CONFIG_X86_32
mov %esi,xen_start_info
mov $init_thread_union+THREAD_SIZE,%esp
-- 
2.1.0



Re: [Xen-devel] [PATCH v3 0/2] Clear .bss for VP guests

2016-02-26 Thread Boris Ostrovsky

On 02/26/2016 09:42 AM, Brian Gerst wrote:

On Fri, Feb 26, 2016 at 8:51 AM, Boris Ostrovsky
<boris.ostrov...@oracle.com> wrote:

On 02/26/2016 05:53 AM, Roger Pau Monné wrote:

El 25/2/16 a les 16:16, Boris Ostrovsky ha escrit:

PV guests need to have their .bss zeroed out since it is not guaranteed
to be cleared by Xen's domain builder

I guess I'm missing something, but elf_load_image (in libelf-loader.c)
seems to be able to clear segments (it will zero the memory between
p_paddr + p_filesz and p_paddr + p_memsz) while loading the ELF into
memory, so if the program headers are correctly setup the .bss should be
zeroed out AFAICT.


Right, but I don't think this is guaranteed. It's uninitialized data so in
principle it can be anything.

The ELF spec says "the system initializes the data with zero when the
program begins to run" which I read as it's up to runtime and not the loader
to do so.

And since kernel does it explicitly on baremetal path I think it's a good
idea for PV to do the same.

It does it on bare metal because bzImage is a raw binary image, not ELF.


OK, I didn't think about this.

But nevertheless, is it guaranteed that .bss is cleared by the loader? 
My reading of the spec is that it's not.



-boris



Re: [PATCH v2 2/3] hvc_xen: fix xenboot for DomUs

2016-02-26 Thread Boris Ostrovsky

On 02/26/2016 10:39 AM, Stefano Stabellini wrote:

On Thu, 25 Feb 2016, Boris Ostrovsky wrote:

On 02/25/2016 07:10 AM, Stefano Stabellini wrote:

The xenboot early console has been partially broken for DomU for a long
time: the output would only go to the hypervisor via hypercall
(HYPERVISOR_console_io), while it wouldn't actually go to the DomU
console. The reason is that domU_write_console would return early as no
xencons structs are configured for it.

Add an appropriate xencons struct for xenboot from the xenboot setup
callback.

Signed-off-by: Stefano Stabellini <stefano.stabell...@eu.citrix.com>

---
Changes in v2:
- add return to xenboot_setup_console
---
   drivers/tty/hvc/hvc_xen.c |   32 ++--
   1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/drivers/tty/hvc/hvc_xen.c b/drivers/tty/hvc/hvc_xen.c
index 68b8ec8..bf787aa 100644
--- a/drivers/tty/hvc/hvc_xen.c
+++ b/drivers/tty/hvc/hvc_xen.c
@@ -246,6 +246,18 @@ err:
return -ENODEV;
   }
   +static int xen_early_pv_console_init(struct xencons_info *info, int
vtermno)

Nit: not sure whether "early" is the right word to use here: when called from
xen_pv_console_init() it's not really on the early path.

What about xencons_info_pv_init?


Sure, that sounds good.

(BTW, I didn't mean that my R-b was conditional on this)


-boris





Other than that:
Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com>

Thanks!



+{
+   info->evtchn = xen_start_info->console.domU.evtchn;
+   /* GFN == MFN for PV guest */
+   info->intf = gfn_to_virt(xen_start_info->console.domU.mfn);
+   info->vtermno = vtermno;
+
+   list_add_tail(>list, );
+
+   return 0;
+}
+
   static int xen_pv_console_init(void)
   {
struct xencons_info *info;
@@ -265,13 +277,8 @@ static int xen_pv_console_init(void)
/* already configured */
return 0;
}
-   info->evtchn = xen_start_info->console.domU.evtchn;
-   /* GFN == MFN for PV guest */
-   info->intf = gfn_to_virt(xen_start_info->console.domU.mfn);
-   info->vtermno = HVC_COOKIE;
-
spin_lock(_lock);
-   list_add_tail(>list, );
+   xen_early_pv_console_init(info, HVC_COOKIE);
spin_unlock(_lock);
return 0;
@@ -599,6 +606,18 @@ static int xen_cons_init(void)
   console_initcall(xen_cons_init);
 #ifdef CONFIG_EARLY_PRINTK
+static int __init xenboot_setup_console(struct console *console, char
*string)
+{
+   static struct xencons_info xenboot;
+
+   if (xen_initial_domain())
+   return 0;
+   if (!xen_pv_domain())
+   return -ENODEV;
+
+   return xen_early_pv_console_init(, 0);
+}
+
   static void xenboot_write_console(struct console *console, const char
*string,
  unsigned len)
   {
@@ -629,6 +648,7 @@ static void xenboot_write_console(struct console
*console, const char *string,
   struct console xenboot_console = {
.name   = "xenboot",
.write  = xenboot_write_console,
+   .setup  = xenboot_setup_console,
.flags  = CON_PRINTBUFFER | CON_BOOT | CON_ANYTIME,
.index  = -1,
   };




Re: [Xen-devel] [PATCH v3 0/2] Clear .bss for VP guests

2016-02-26 Thread Boris Ostrovsky

On 02/26/2016 10:26 AM, David Vrabel wrote:


The tools support loading bzImages, not just ELF images.


We load them as ELF images though, I believe, after uncompressing. E.g.:

static int xc_dom_load_bzimage_kernel(struct xc_dom_image *dom)
{
return elf_loader.loader(dom);
}


-boris


Re: [Xen-devel] [PATCH v3 0/2] Clear .bss for VP guests

2016-02-26 Thread Boris Ostrovsky

On 02/26/2016 10:22 AM, Roger Pau Monné wrote:

El 26/2/16 a les 16:10, Boris Ostrovsky ha escrit:

On 02/26/2016 09:42 AM, Brian Gerst wrote:

On Fri, Feb 26, 2016 at 8:51 AM, Boris Ostrovsky
<boris.ostrov...@oracle.com> wrote:

On 02/26/2016 05:53 AM, Roger Pau Monné wrote:

El 25/2/16 a les 16:16, Boris Ostrovsky ha escrit:

PV guests need to have their .bss zeroed out since it is not
guaranteed
to be cleared by Xen's domain builder

I guess I'm missing something, but elf_load_image (in libelf-loader.c)
seems to be able to clear segments (it will zero the memory between
p_paddr + p_filesz and p_paddr + p_memsz) while loading the ELF into
memory, so if the program headers are correctly setup the .bss
should be
zeroed out AFAICT.

Right, but I don't think this is guaranteed. It's uninitialized data
so in
principle it can be anything.

The ELF spec says "the system initializes the data with zero when the
program begins to run" which I read as it's up to runtime and not the
loader
to do so.

And since kernel does it explicitly on baremetal path I think it's a
good
idea for PV to do the same.

It does it on bare metal because bzImage is a raw binary image, not ELF.

OK, I didn't think about this.

But nevertheless, is it guaranteed that .bss is cleared by the loader?
My reading of the spec is that it's not.

I think this is very blur in general. The copy of the spec I have says:

"the system initializes the data with zeros when the program begins to run"

What is "the system" here, Xen or the guest kernel?

Just to be clear, I'm not opposing to this change in any way, but the
message in patch 1/2 needs to be fixed:

"They have been able to run without problems because Xen domain builder
happens to give out zeroed pages."

This is wrong IMHO, .bss is not cleared because we are using zeroed
pages, but because elf_load_image explicitly zeroes the space between
p_filesz and p_memsz in ELF program headers (which is were .bss resides
on properly arranged ELF binaries) when loading them.


That's what I meant --- that the builder/loader gives out zeroed pages, 
not that Xen's allocator clears them in general. I'll update the commit 
message.




I'm quite sure NetBSD also relies on this, so I would say it's
intrinsically part of the Xen boot ABI now, and this change just adds
seatbelts to Linux.


Maybe NetBSD should drive carefully then ;-)

-boris


Re: [PATCH, RESEND] xen: allocate gntdev_copy_batch dynamically

2016-02-26 Thread Boris Ostrovsky

On 02/25/2016 04:25 PM, Arnd Bergmann wrote:

struct gntdev_copy_batch is arguably too large to fit on the kernel stack,
and we get a warning about the stack usage in gntdev_ioctl_grant_copy:

drivers/xen/gntdev.c:949:1: error: the frame size of 1240 bytes is larger than 
1024 bytes

This changes the code to us a dynamic allocation instead.

Signed-off-by: Arnd Bergmann 
Fixes: a4cdb556cae0 ("xen/gntdev: add ioctl for grant copy")
---
  drivers/xen/gntdev.c | 17 ++---
  1 file changed, 10 insertions(+), 7 deletions(-)

I sent this in January, Boris sent an almost identical patch
as http://www.gossamer-threads.com/lists/xen/devel/414056
but the bug remains present in mainline and linux-next as of
Feb 25.

Could you apply one of the patches before the bug makes it
into v4.5?


David wanted to shrink the structure size instead:
http://www.gossamer-threads.com/lists/xen/devel/414535

-boris



Re: [Xen-devel] [PATCH v3 0/2] Clear .bss for VP guests

2016-02-26 Thread Boris Ostrovsky

On 02/26/2016 05:53 AM, Roger Pau Monné wrote:

El 25/2/16 a les 16:16, Boris Ostrovsky ha escrit:

PV guests need to have their .bss zeroed out since it is not guaranteed
to be cleared by Xen's domain builder

I guess I'm missing something, but elf_load_image (in libelf-loader.c)
seems to be able to clear segments (it will zero the memory between
p_paddr + p_filesz and p_paddr + p_memsz) while loading the ELF into
memory, so if the program headers are correctly setup the .bss should be
zeroed out AFAICT.


Right, but I don't think this is guaranteed. It's uninitialized data so 
in principle it can be anything.


The ELF spec says "the system initializes the data with zero when the 
program begins to run" which I read as it's up to runtime and not the 
loader to do so.


And since kernel does it explicitly on baremetal path I think it's a 
good idea for PV to do the same.


-boris


[PATCH v4 2/2] xen/x86: Drop mode-selecting ifdefs in startup_xen()

2016-02-26 Thread Boris Ostrovsky
Use asm/asm.h macros instead.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/xen-head.S | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index 5c63d2d..de93b20 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -47,13 +47,9 @@ ENTRY(startup_xen)
shr $__ASM_SEL(2, 3), %_ASM_CX
rep __ASM_SIZE(stos)
 
-#ifdef CONFIG_X86_32
-   mov %esi,xen_start_info
-   mov $init_thread_union+THREAD_SIZE,%esp
-#else
-   mov %rsi,xen_start_info
-   mov $init_thread_union+THREAD_SIZE,%rsp
-#endif
+   mov %_ASM_SI, xen_start_info
+   mov $init_thread_union+THREAD_SIZE, %_ASM_SP
+
jmp xen_start_kernel
 
__FINIT
-- 
2.1.0



[PATCH v4 0/2] Clear .bss for PV guests

2016-02-26 Thread Boris Ostrovsky
PV guests need to have their .bss zeroed out since it is not guaranteed
to be cleared by Xen's domain builder (even if it is done so currently)

v4:
* Better use of mode-selecting macros.
* Drop stable tag.
* Update commit message for patch 1.

Boris Ostrovsky (2):
  xen/x86: Zero out .bss for PV guests
  xen/x86: Drop mode-selecting ifdefs in startup_xen()

 arch/x86/xen/xen-head.S | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

-- 
2.1.0



[PATCH v4 1/2] xen/x86: Zero out .bss for PV guests

2016-02-26 Thread Boris Ostrovsky
ELF spec is unclear about whether .bss must me cleared by the loader.
Currently the domain builder does it when loading the guest but because
it is not (or rather may not be) guaranteed we should zero it out
explicitly.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/xen-head.S | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index b65f59a..5c63d2d 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -38,6 +38,15 @@
__INIT
 ENTRY(startup_xen)
cld
+
+   /* Clear .bss */
+   xor %eax,%eax
+   mov $__bss_start, %_ASM_DI
+   mov $__bss_stop, %_ASM_CX
+   sub %_ASM_DI, %_ASM_CX
+   shr $__ASM_SEL(2, 3), %_ASM_CX
+   rep __ASM_SIZE(stos)
+
 #ifdef CONFIG_X86_32
mov %esi,xen_start_info
mov $init_thread_union+THREAD_SIZE,%esp
-- 
2.1.0



Re: [RFC v1 4/8] x86/init: add linker table support

2016-01-21 Thread Boris Ostrovsky

On 01/21/2016 03:38 AM, Roger Pau Monné wrote:

El 20/01/16 a les 22.33, Luis R. Rodriguez ha escrit:

On Wed, Jan 20, 2016 at 1:00 PM, Konrad Rzeszutek Wilk
 wrote:

+static bool x86_init_fn_supports_subarch(struct x86_init_fn *fn)
+{
+ if (!fn->supp_hardware_subarch) {
+ pr_err("Init sequence fails to declares any supported subarchs: 
%pF\n", fn->early_init);
+ WARN_ON(1);
+ }
+ if (BIT(boot_params.hdr.hardware_subarch) & fn->supp_hardware_subarch)
+ return true;
+ return false;
+}

So the logic for this working is that boot_params.hdr.hardware_subarch

And the macros define two: BIT(X86_SUBARCH_PC) or BIT(X86_SUBARCH_XEN).

But hardware_subarch by default is set to zero. Which means if GRUB2, PXELinux, 
Xen multiboot1
don't set it - then the X86_SUBARCH_PC is choosen right?

  1 << 0 & 1 << X86_SUBARCH_PC (which is zero).

For this to nicely work with Xen it ought to do this:

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 993b7a7..6cf9afd 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1676,6 +1676,7 @@ asmlinkage __visible void __init xen_start_kernel(void)
 boot_params.hdr.ramdisk_image = initrd_start;
 boot_params.hdr.ramdisk_size = xen_start_info->mod_len;
 boot_params.hdr.cmd_line_ptr = __pa(xen_start_info->cmd_line);
+   boot_params.hdr.hardware_subarch = X86_SUBARCH_XEN;

 if (!xen_initial_domain()) {
 add_preferred_console("xenboot", 0, NULL);


?

That's correct for PV and PVH, likewise when qemu is required for HVM
qemu could set it. I have the qemu change done but that should only
cover HVM. A common place to set this as well could be the hypervisor,
but currently the hypervisor doesn't set any boot_params, instead a
generic struct is passed and the kernel code (for any OS) is expected
to interpret this and then set the required values for the OS in the
init path. Long term though if we wanted to merge init further one way
could be to have the hypervisor just set the zero page cleanly for the
different modes. If we needed more data other than the
hardware_subarch we also have the hardware_subarch_data, that's a u64
, and how that is used would be up to the subarch. In Xen's case it
could do what it wants with it. That would still mean perhaps defining
as part of a Xen boot protocol a place where xen specific code can
count on finding more Xen data passed by the hypervisor, the
xen_start_info. That is, if we wanted to merge init paths this is
something to consider.

One thing I considered on the question of who should set the zero page
for Xen with the prospect of merging inits, or at least this subarch
for both short term and long term are the obvious implications in
terms of hypervisor / kernel / qemu combination requirements if the
subarch is needed. Having it set in the kernel is an obvious immediate
choice for PV / PVH but it means we can't merge init paths completely
(down to asm inits), we'd still be able to merge some C init paths
though, the first entry would still be different. Having the zero page
set on the hypervisor would go long ways but it would mean a
hypervisor change required.

I don't think the hypervisor should be setting Linux specific boot
related parameters, the boot ABI should be OS agnostic. IMHO, a small
shim should be added to Linux in order to set what Linux requires when
entering from a Xen entry point.


And that's exactly what HVMlite does. Most of this shim layer is setting 
up boot_params, after which we jump to standard x86 boot path (i.e. 
startup_{32|64}). With hardware_subarch set to zero.


-boris



[PATCH v1 10/12] xen/hvmlite: Use x86's default timer init for HVMlite guests

2016-01-22 Thread Boris Ostrovsky
xen_timer_init() will be called from apic_bsp_setup().

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/time.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index a0a4e55..93745e7 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -439,7 +439,10 @@ void __init xen_init_time_ops(void)
 {
pv_time_ops = xen_time_ops;
 
-   x86_init.timers.timer_init = xen_time_init;
+   if (!xen_hvmlite)
+   x86_init.timers.timer_init = xen_time_init;
+   else
+   x86_init.timers.timer_init = x86_init_noop;
x86_init.timers.setup_percpu_clockev = x86_init_noop;
x86_cpuinit.setup_percpu_clockev = x86_init_noop;
 
-- 
1.7.1



[PATCH v1 08/12] xen/hvmlite: Initialize context for secondary VCPUs

2016-01-22 Thread Boris Ostrovsky
Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/smp.c |   57 
 arch/x86/xen/smp.h |4 +++
 arch/x86/xen/xen-hvmlite.S |7 +
 3 files changed, 63 insertions(+), 5 deletions(-)

diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 5fc4afb..b265c4f 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -384,6 +385,7 @@ cpu_initialize_context(unsigned int cpu, struct task_struct 
*idle)
struct vcpu_guest_context *ctxt;
struct desc_struct *gdt;
unsigned long gdt_mfn;
+   void *ctxt_arg;
 
/* used to tell cpu_init() that it can proceed with initialization */
cpumask_set_cpu(cpu, cpu_callout_mask);
@@ -392,7 +394,7 @@ cpu_initialize_context(unsigned int cpu, struct task_struct 
*idle)
 
if (!xen_hvmlite) {
 
-   ctxt = kzalloc(sizeof(*ctxt), GFP_KERNEL);
+   ctxt_arg = ctxt = kzalloc(sizeof(*ctxt), GFP_KERNEL);
if (ctxt == NULL)
return -ENOMEM;
 
@@ -460,14 +462,59 @@ cpu_initialize_context(unsigned int cpu, struct 
task_struct *idle)
ctxt->user_regs.esp = idle->thread.sp0 - sizeof(struct pt_regs);
ctxt->ctrlreg[3] = xen_pfn_to_cr3(virt_to_gfn(swapper_pg_dir));
} else {
-   ctxt = NULL; /* To quiet down compiler */
-   BUG();
+#ifdef CONFIG_XEN_PVHVM
+   struct vcpu_hvm_context *hctxt;
+
+   ctxt_arg = hctxt = kzalloc(sizeof(*hctxt), GFP_KERNEL);
+   if (hctxt == NULL)
+   return -ENOMEM;
+
+#ifdef CONFIG_X86_64
+   hctxt->mode = VCPU_HVM_MODE_64B;
+   hctxt->cpu_regs.x86_64.rip =
+   (unsigned long)secondary_startup_64;
+   hctxt->cpu_regs.x86_64.rsp = stack_start;
+
+   hctxt->cpu_regs.x86_64.cr0 =
+   X86_CR0_PG | X86_CR0_WP | X86_CR0_PE;
+   hctxt->cpu_regs.x86_64.cr4 = X86_CR4_PAE;
+   hctxt->cpu_regs.x86_64.cr3 =
+   xen_pfn_to_cr3(virt_to_mfn(init_level4_pgt));
+   hctxt->cpu_regs.x86_64.efer = EFER_LME | EFER_NX;
+#else
+   hctxt->mode = VCPU_HVM_MODE_32B;
+   /*
+* startup_32_smp expects GDT loaded so we can't jump
+* there directly.
+*/
+   hctxt->cpu_regs.x86_32.eip =
+   (unsigned long)hvmlite_smp_32 - __START_KERNEL_map;
+
+   hctxt->cpu_regs.x86_32.cr0 = X86_CR0_PE;
+
+   hctxt->cpu_regs.x86_32.cs_base = 0;
+   hctxt->cpu_regs.x86_32.cs_limit = ~0u;
+   hctxt->cpu_regs.x86_32.cs_ar = 0xc9b;
+   hctxt->cpu_regs.x86_32.ds_base = 0;
+   hctxt->cpu_regs.x86_32.ds_limit = ~0u;
+   hctxt->cpu_regs.x86_32.ds_ar = 0xc93;
+   hctxt->cpu_regs.x86_32.es_base = 0;
+   hctxt->cpu_regs.x86_32.es_limit = ~0u;
+   hctxt->cpu_regs.x86_32.es_ar = 0xc93;
+   hctxt->cpu_regs.x86_32.ss_base = 0;
+   hctxt->cpu_regs.x86_32.ss_limit = ~0u;
+   hctxt->cpu_regs.x86_32.ss_ar = 0xc93;
+   hctxt->cpu_regs.x86_32.tr_base = 0;
+   hctxt->cpu_regs.x86_32.tr_limit = 0xff;
+   hctxt->cpu_regs.x86_32.tr_ar = 0x8b;
+#endif
+#endif
}
 
-   if (HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt))
+   if (HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt_arg))
BUG();
 
-   kfree(ctxt);
+   kfree(ctxt_arg);
return 0;
 }
 
diff --git a/arch/x86/xen/smp.h b/arch/x86/xen/smp.h
index 963d62a..b4a833c 100644
--- a/arch/x86/xen/smp.h
+++ b/arch/x86/xen/smp.h
@@ -8,6 +8,10 @@ extern void xen_send_IPI_allbutself(int vector);
 extern void xen_send_IPI_all(int vector);
 extern void xen_send_IPI_self(int vector);
 
+#ifdef CONFIG_X86_32
+extern void hvmlite_smp_32(void);
+#endif
+
 #ifdef CONFIG_XEN_PVH
 extern void xen_pvh_early_cpu_init(int cpu, bool entry);
 #else
diff --git a/arch/x86/xen/xen-hvmlite.S b/arch/x86/xen/xen-hvmlite.S
index 90f03d0..8d6a642 100644
--- a/arch/x86/xen/xen-hvmlite.S
+++ b/arch/x86/xen/xen-hvmlite.S
@@ -134,6 +134,13 @@ ENTRY(hvmlite_start_xen)
ljmp$0x10, $_pa(startup_32)
 #endif
 
+#ifdef CONFIG_X86_32
+ENTRY(hvmlite_smp_32)
+mov $_pa(boot_gdt_descr), %eax
+lgdt (%eax)
+jmp startup_32_smp
+#endif
+
.data
 gdt:
.word   gdt_end - gdt
-- 
1.7.1



[PATCH v1 01/12] x86/smp: Make start_secondary() and initial_pg_pmd visible globally

2016-01-22 Thread Boris Ostrovsky
Xen's HVMlite guests will want to use them.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/include/asm/smp.h |1 +
 arch/x86/kernel/head_32.S  |2 +-
 arch/x86/kernel/smpboot.c  |2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index dfcf072..739abb0 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -147,6 +147,7 @@ void x86_idle_thread_init(unsigned int cpu, struct 
task_struct *idle);
 
 void smp_store_boot_cpu_info(void);
 void smp_store_cpu_info(int id);
+void start_secondary(void *unused);
 #define cpu_physical_id(cpu)   per_cpu(x86_cpu_to_apicid, cpu)
 
 #else /* !CONFIG_SMP */
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 6bc9ae2..eeb5c42 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -666,7 +666,7 @@ ENTRY(setup_once_ref)
 __PAGE_ALIGNED_BSS
.align PAGE_SIZE
 #ifdef CONFIG_X86_PAE
-initial_pg_pmd:
+ENTRY(initial_pg_pmd)
.fill 1024*KPMDS,4,0
 #else
 ENTRY(initial_page_table)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 24d57f7..cba00bc 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -199,7 +199,7 @@ static int enable_start_cpu0;
 /*
  * Activate a secondary processor.
  */
-static void notrace start_secondary(void *unused)
+void notrace start_secondary(void *unused)
 {
/*
 * Don't put *anything* before cpu_init(), SMP booting is too
-- 
1.7.1



[PATCH v1 05/12] xen/hvmlite: Allow HVMlite guests delay initializing grant table

2016-01-22 Thread Boris Ostrovsky
.. just like we currently do for PVH guests

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/grant-table.c |4 ++--
 drivers/xen/grant-table.c  |8 
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/grant-table.c b/arch/x86/xen/grant-table.c
index e079500..40ad9c2 100644
--- a/arch/x86/xen/grant-table.c
+++ b/arch/x86/xen/grant-table.c
@@ -110,7 +110,7 @@ int arch_gnttab_init(unsigned long nr_shared)
return arch_gnttab_valloc(_shared_vm_area, nr_shared);
 }
 
-#ifdef CONFIG_XEN_PVH
+#ifdef CONFIG_XEN_PVHVM
 #include 
 #include 
 #include 
@@ -164,7 +164,7 @@ static int __init xlated_setup_gnttab_pages(void)
 
 static int __init xen_pvh_gnttab_setup(void)
 {
-   if (!xen_pvh_domain())
+   if (!xen_pvh_domain() && !xen_hvmlite)
return -ENODEV;
 
return xlated_setup_gnttab_pages();
diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index effbaf9..8c6c0cf 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -1147,13 +1147,13 @@ EXPORT_SYMBOL_GPL(gnttab_init);
 
 static int __gnttab_init(void)
 {
+   if (!xen_domain())
+   return -ENODEV;
+
/* Delay grant-table initialization in the PV on HVM case */
-   if (xen_hvm_domain())
+   if (xen_hvm_domain() && !xen_hvmlite)
return 0;
 
-   if (!xen_pv_domain())
-   return -ENODEV;
-
return gnttab_init();
 }
 /* Starts after core_initcall so that xen_pvh_gnttab_setup can be called
-- 
1.7.1



[PATCH v1 09/12] xen/hvmlite: Extend APIC operations for HVMlite guests

2016-01-22 Thread Boris Ostrovsky
HVMlite guests need to be viewed as having APIC, otherwise smpboot code,
for example, will complain.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---

Not sure about xen_cpu_present_to_apicid() being an identity function, given
xen_x86_32_early_logical_apicid().

 arch/x86/xen/apic.c |   39 +--
 arch/x86/xen/smp.c  |2 +-
 2 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/arch/x86/xen/apic.c b/arch/x86/xen/apic.c
index abf4901..2a59c39 100644
--- a/arch/x86/xen/apic.c
+++ b/arch/x86/xen/apic.c
@@ -6,6 +6,7 @@
 
 #include 
 #include 
+#include 
 #include "xen-ops.h"
 #include "pmu.h"
 #include "smp.h"
@@ -78,6 +79,21 @@ static void xen_apic_write(u32 reg, u32 val)
return;
}
 
+   if (xen_hvmlite) {
+   switch (reg) {
+   case APIC_TASKPRI:
+   case APIC_SPIV:
+   case APIC_ESR:
+   case APIC_LVTT:
+   case APIC_LVT0:
+   case APIC_LVT1:
+   case APIC_LVTERR:
+   pr_debug("Unimplemented APIC register %x,"
+" value: %x\n", reg, val);
+   return;
+   }
+   }
+
/* Warn to see if there's any stray references */
WARN(1,"register: %x, value: %x\n", reg, val);
 }
@@ -100,7 +116,7 @@ static u32 xen_safe_apic_wait_icr_idle(void)
 
 static int xen_apic_probe_pv(void)
 {
-   if (xen_pv_domain())
+   if (xen_pv_domain() || xen_hvmlite)
return 1;
 
return 0;
@@ -142,6 +158,19 @@ static void xen_silent_inquire(int apicid)
 {
 }
 
+static int xen_cpu_present_to_apicid(int cpu)
+{
+   return cpu;
+}
+
+static int xen_wakeup_secondary_cpu(int cpu, unsigned long start_eip)
+{
+   if (!xen_hvmlite)
+   return -EINVAL;
+
+   return HYPERVISOR_vcpu_op(VCPUOP_up, cpu, NULL);
+}
+
 static struct apic xen_pv_apic = {
.name   = "Xen PV",
.probe  = xen_apic_probe_pv,
@@ -162,7 +191,7 @@ static struct apic xen_pv_apic = {
 
.ioapic_phys_id_map = default_ioapic_phys_id_map, /* Used 
on 32-bit */
.setup_apic_routing = NULL,
-   .cpu_present_to_apicid  = default_cpu_present_to_apicid,
+   .cpu_present_to_apicid  = xen_cpu_present_to_apicid,
.apicid_to_cpu_present  = physid_set_mask_of_physid, /* Used on 
32-bit */
.check_phys_apicid_present  = default_check_phys_apicid_present, /* 
smp_sanity_check needs it */
.phys_pkg_id= xen_phys_pkg_id, /* detect_ht */
@@ -180,6 +209,9 @@ static struct apic xen_pv_apic = {
.send_IPI_all   = xen_send_IPI_all,
.send_IPI_self  = xen_send_IPI_self,
 #endif
+
+   .wakeup_secondary_cpu   = xen_wakeup_secondary_cpu,
+
/* .wait_for_init_deassert- used  by AP bootup - smp_callin which we 
don't use */
.inquire_remote_apic= xen_silent_inquire,
 
@@ -216,5 +248,8 @@ void __init xen_init_apic(void)
apic = _pv_apic;
 
x86_platform.apic_post_init = xen_apic_check;
+
+   if (xen_hvmlite)
+   setup_force_cpu_cap(X86_FEATURE_APIC);
 }
 apic_driver(xen_pv_apic);
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index b265c4f..fb085ef 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -214,7 +214,7 @@ static int xen_smp_intr_init(unsigned int cpu)
 * The IRQ worker on PVHVM goes through the native path and uses the
 * IPI mechanism.
 */
-   if (xen_hvm_domain())
+   if (xen_hvm_domain() && !xen_hvmlite)
return 0;
 
callfunc_name = kasprintf(GFP_KERNEL, "irqwork%d", cpu);
-- 
1.7.1



[PATCH v1 07/12] xen/hvmlite: Prepare cpu_initialize_context() routine for HVMlite SMP

2016-01-22 Thread Boris Ostrovsky
Subsequent patch will add support for starting secondary VCPUs in
HVMlite guest. This patch exists to make review easier.

No functional changes (except for introduction of 'if (!xen_hvmlite)').

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/smp.c |  104 ---
 1 files changed, 57 insertions(+), 47 deletions(-)

diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 3f4ebf0..5fc4afb 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -390,70 +390,80 @@ cpu_initialize_context(unsigned int cpu, struct 
task_struct *idle)
if (cpumask_test_and_set_cpu(cpu, xen_cpu_initialized_map))
return 0;
 
-   ctxt = kzalloc(sizeof(*ctxt), GFP_KERNEL);
-   if (ctxt == NULL)
-   return -ENOMEM;
+   if (!xen_hvmlite) {
 
-   gdt = get_cpu_gdt_table(cpu);
+   ctxt = kzalloc(sizeof(*ctxt), GFP_KERNEL);
+   if (ctxt == NULL)
+   return -ENOMEM;
+
+   gdt = get_cpu_gdt_table(cpu);
 
 #ifdef CONFIG_X86_32
-   /* Note: PVH is not yet supported on x86_32. */
-   ctxt->user_regs.fs = __KERNEL_PERCPU;
-   ctxt->user_regs.gs = __KERNEL_STACK_CANARY;
+   /* Note: PVH is not yet supported on x86_32. */
+   ctxt->user_regs.fs = __KERNEL_PERCPU;
+   ctxt->user_regs.gs = __KERNEL_STACK_CANARY;
 #endif
-   memset(>fpu_ctxt, 0, sizeof(ctxt->fpu_ctxt));
+   memset(>fpu_ctxt, 0, sizeof(ctxt->fpu_ctxt));
 
-   if (!xen_feature(XENFEAT_auto_translated_physmap)) {
-   ctxt->user_regs.eip = (unsigned long)cpu_bringup_and_idle;
-   ctxt->flags = VGCF_IN_KERNEL;
-   ctxt->user_regs.eflags = 0x1000; /* IOPL_RING1 */
-   ctxt->user_regs.ds = __USER_DS;
-   ctxt->user_regs.es = __USER_DS;
-   ctxt->user_regs.ss = __KERNEL_DS;
+   if (!xen_feature(XENFEAT_auto_translated_physmap)) {
+   ctxt->user_regs.eip =
+   (unsigned long)cpu_bringup_and_idle;
+   ctxt->flags = VGCF_IN_KERNEL;
+   ctxt->user_regs.eflags = 0x1000; /* IOPL_RING1 */
+   ctxt->user_regs.ds = __USER_DS;
+   ctxt->user_regs.es = __USER_DS;
+   ctxt->user_regs.ss = __KERNEL_DS;
 
-   xen_copy_trap_info(ctxt->trap_ctxt);
+   xen_copy_trap_info(ctxt->trap_ctxt);
 
-   ctxt->ldt_ents = 0;
+   ctxt->ldt_ents = 0;
 
-   BUG_ON((unsigned long)gdt & ~PAGE_MASK);
+   BUG_ON((unsigned long)gdt & ~PAGE_MASK);
 
-   gdt_mfn = arbitrary_virt_to_mfn(gdt);
-   make_lowmem_page_readonly(gdt);
-   make_lowmem_page_readonly(mfn_to_virt(gdt_mfn));
+   gdt_mfn = arbitrary_virt_to_mfn(gdt);
+   make_lowmem_page_readonly(gdt);
+   make_lowmem_page_readonly(mfn_to_virt(gdt_mfn));
 
-   ctxt->gdt_frames[0] = gdt_mfn;
-   ctxt->gdt_ents  = GDT_ENTRIES;
+   ctxt->gdt_frames[0] = gdt_mfn;
+   ctxt->gdt_ents  = GDT_ENTRIES;
 
-   ctxt->kernel_ss = __KERNEL_DS;
-   ctxt->kernel_sp = idle->thread.sp0;
+   ctxt->kernel_ss = __KERNEL_DS;
+   ctxt->kernel_sp = idle->thread.sp0;
 
 #ifdef CONFIG_X86_32
-   ctxt->event_callback_cs = __KERNEL_CS;
-   ctxt->failsafe_callback_cs  = __KERNEL_CS;
+   ctxt->event_callback_cs = __KERNEL_CS;
+   ctxt->failsafe_callback_cs  = __KERNEL_CS;
 #else
-   ctxt->gs_base_kernel = per_cpu_offset(cpu);
+   ctxt->gs_base_kernel = per_cpu_offset(cpu);
 #endif
-   ctxt->event_callback_eip=
-   (unsigned long)xen_hypervisor_callback;
-   ctxt->failsafe_callback_eip =
-   (unsigned long)xen_failsafe_callback;
-   ctxt->user_regs.cs = __KERNEL_CS;
-   per_cpu(xen_cr3, cpu) = __pa(swapper_pg_dir);
-   }
+   ctxt->event_callback_eip=
+   (unsigned long)xen_hypervisor_callback;
+   ctxt->failsafe_callback_eip =
+   (unsigned long)xen_failsafe_callback;
+   ctxt->user_regs.cs = __KERNEL_CS;
+   per_cpu(xen_cr3, cpu) = __pa(swapper_pg_dir);
+   }
 #ifdef CONFIG_XEN_PVH
-   else {
-   /*
-* The vcpu comes on kernel page tables which

[PATCH v1 12/12] xen/hvmlite: Enable CPU on-/offlining

2016-01-22 Thread Boris Ostrovsky
When offlining, we should properly clean up interrupts and wait
until hypervisor declares VCPU as down before cleaning up.

After VCPU that was previously offlined is brought back to life
we want to jump back to bare-metal entry points. It's a simple
jump on 64-bit but requires minor tweaking for 32-bit case.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/smp.c |   35 +--
 arch/x86/xen/xen-hvmlite.S |8 
 2 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index fbad829..7e96a23 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -143,7 +143,7 @@ static void xen_smp_intr_free(unsigned int cpu)
kfree(per_cpu(xen_callfuncsingle_irq, cpu).name);
per_cpu(xen_callfuncsingle_irq, cpu).name = NULL;
}
-   if (xen_hvm_domain())
+   if (xen_hvm_domain() && !xen_hvmlite)
return;
 
if (per_cpu(xen_irq_work, cpu).irq >= 0) {
@@ -585,7 +585,8 @@ static int xen_cpu_disable(void)
 
 static void xen_cpu_die(unsigned int cpu)
 {
-   while (xen_pv_domain() && HYPERVISOR_vcpu_op(VCPUOP_is_up, cpu, NULL)) {
+   while ((xen_pv_domain() || xen_hvmlite) &&
+  HYPERVISOR_vcpu_op(VCPUOP_is_up, cpu, NULL)) {
__set_current_state(TASK_UNINTERRUPTIBLE);
schedule_timeout(HZ/10);
}
@@ -602,14 +603,28 @@ static void xen_play_dead(void) /* used only with 
HOTPLUG_CPU */
 {
play_dead_common();
HYPERVISOR_vcpu_op(VCPUOP_down, smp_processor_id(), NULL);
-   cpu_bringup();
-   /*
-* commit 4b0c0f294 (tick: Cleanup NOHZ per cpu data on cpu down)
-* clears certain data that the cpu_idle loop (which called us
-* and that we return from) expects. The only way to get that
-* data back is to call:
-*/
-   tick_nohz_idle_enter();
+
+   if (!xen_hvm_domain()) {
+   cpu_bringup();
+   /*
+* commit 4b0c0f294 (tick: Cleanup NOHZ per cpu data on cpu
+* down) clears certain data that the cpu_idle loop (which
+* called us and that we return from) expects. The only way to
+* get that data back is to call:
+*/
+   tick_nohz_idle_enter();
+   } else {
+   /*
+* For 64-bit we can jump directly to SMP entry point but for
+* 32-bit we need to disable paging and load boot GDT (just
+* like in cpu_initialize_context()).
+*/
+#ifdef CONFIG_X86_64
+   asm("jmp secondary_startup_64");
+#else
+   asm("jmp hvmlite_smp_32_hp");
+#endif
+   }
 }
 
 #else /* !CONFIG_HOTPLUG_CPU */
diff --git a/arch/x86/xen/xen-hvmlite.S b/arch/x86/xen/xen-hvmlite.S
index 8d6a642..4edd6ef 100644
--- a/arch/x86/xen/xen-hvmlite.S
+++ b/arch/x86/xen/xen-hvmlite.S
@@ -135,6 +135,14 @@ ENTRY(hvmlite_start_xen)
 #endif
 
 #ifdef CONFIG_X86_32
+ENTRY(hvmlite_smp_32_hp)
+   movl $_pa(initial_page_table), %eax
+   movl %eax, %cr3
+   ljmp $__KERNEL_CS,$_pa(5f)
+5:
+   movl $X86_CR0_PE, %eax
+   movl %eax, %cr0
+
 ENTRY(hvmlite_smp_32)
 mov $_pa(boot_gdt_descr), %eax
 lgdt (%eax)
-- 
1.7.1



[PATCH v1 03/12] xen/hvmlite: Import hvmlite-related Xen public interfaces

2016-01-22 Thread Boris Ostrovsky
Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 include/xen/interface/elfnote.h  |   12 +++-
 include/xen/interface/hvm/hvm_vcpu.h |  143 ++
 include/xen/interface/xen.h  |   24 ++
 3 files changed, 178 insertions(+), 1 deletions(-)
 create mode 100644 include/xen/interface/hvm/hvm_vcpu.h

diff --git a/include/xen/interface/elfnote.h b/include/xen/interface/elfnote.h
index f90b034..9e9f9bf 100644
--- a/include/xen/interface/elfnote.h
+++ b/include/xen/interface/elfnote.h
@@ -193,9 +193,19 @@
 #define XEN_ELFNOTE_SUPPORTED_FEATURES 17
 
 /*
+ * Physical entry point into the kernel.
+ *
+ * 32bit entry point into the kernel. When requested to launch the
+ * guest kernel in a HVM container, Xen will use this entry point to
+ * launch the guest in 32bit protected mode with paging disabled.
+ * Ignored otherwise.
+ */
+#define XEN_ELFNOTE_PHYS32_ENTRY 18
+
+/*
  * The number of the highest elfnote defined.
  */
-#define XEN_ELFNOTE_MAX XEN_ELFNOTE_SUPPORTED_FEATURES
+#define XEN_ELFNOTE_MAX XEN_ELFNOTE_PHYS32_ENTRY
 
 #endif /* __XEN_PUBLIC_ELFNOTE_H__ */
 
diff --git a/include/xen/interface/hvm/hvm_vcpu.h 
b/include/xen/interface/hvm/hvm_vcpu.h
new file mode 100644
index 000..32ca83e
--- /dev/null
+++ b/include/xen/interface/hvm/hvm_vcpu.h
@@ -0,0 +1,143 @@
+/*
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2015, Roger Pau Monne <roger@citrix.com>
+ */
+
+#ifndef __XEN_PUBLIC_HVM_HVM_VCPU_H__
+#define __XEN_PUBLIC_HVM_HVM_VCPU_H__
+
+#include "../xen.h"
+
+struct vcpu_hvm_x86_32 {
+uint32_t eax;
+uint32_t ecx;
+uint32_t edx;
+uint32_t ebx;
+uint32_t esp;
+uint32_t ebp;
+uint32_t esi;
+uint32_t edi;
+uint32_t eip;
+uint32_t eflags;
+
+uint32_t cr0;
+uint32_t cr3;
+uint32_t cr4;
+
+uint32_t pad1;
+
+/*
+ * EFER should only be used to set the NXE bit (if required)
+ * when starting a vCPU in 32bit mode with paging enabled or
+ * to set the LME/LMA bits in order to start the vCPU in
+ * compatibility mode.
+ */
+uint64_t efer;
+
+uint32_t cs_base;
+uint32_t ds_base;
+uint32_t ss_base;
+uint32_t es_base;
+uint32_t tr_base;
+uint32_t cs_limit;
+uint32_t ds_limit;
+uint32_t ss_limit;
+uint32_t es_limit;
+uint32_t tr_limit;
+uint16_t cs_ar;
+uint16_t ds_ar;
+uint16_t ss_ar;
+uint16_t es_ar;
+uint16_t tr_ar;
+
+uint16_t pad2[3];
+};
+
+/*
+ * The layout of the _ar fields of the segment registers is the
+ * following:
+ *
+ * Bits   [0,3]: type (bits 40-43).
+ * Bit4: s(descriptor type, bit 44).
+ * Bit[5,6]: dpl  (descriptor privilege level, bits 45-46).
+ * Bit7: p(segment-present, bit 47).
+ * Bit8: avl  (available for system software, bit 52).
+ * Bit9: l(64-bit code segment, bit 53).
+ * Bit   10: db   (meaning depends on the segment, bit 54).
+ * Bit   11: g(granularity, bit 55)
+ * Bits [12,15]: unused, must be blank.
+ *
+ * A more complete description of the meaning of this fields can be
+ * obtained from the Intel SDM, Volume 3, section 3.4.5.
+ */
+
+struct vcpu_hvm_x86_64 {
+uint64_t rax;
+uint64_t rcx;
+uint64_t rdx;
+uint64_t rbx;
+uint64_t rsp;
+uint64_t rbp;
+uint64_t rsi;
+uint64_t rdi;
+uint64_t rip;
+uint64_t rflags;
+
+uint64_t cr0;
+uint64_t cr3;
+uint64_t cr4;
+uint64_t efer;
+
+/*
+ * Using VCPU_HVM_MODE_64B implies that the vCPU is launched
+ * directly in long mode, so the cached parts of the segment
+ * registers get set to match that environment.
+ *
+ * If the user wants to launch the vCPU in compatibility mode
+ * the 32-bit structure should be used instead.
+ */
+};
+
+struct vcpu_hvm_context {
+#define VCPU_HVM_MODE_32B 0  /* 32bit fields 

[PATCH v1 04/12] xen/hvmlite: Bootstrap HVMlite guest

2016-01-22 Thread Boris Ostrovsky
Start HVMlite guest XEN_ELFNOTE_PHYS32_ENTRY address. Setup hypercall
page, initialize boot_params, enable early page tables.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/Makefile  |1 +
 arch/x86/xen/enlighten.c   |   91 +-
 arch/x86/xen/xen-hvmlite.S |  158 
 include/xen/xen.h  |6 ++
 4 files changed, 255 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/xen/xen-hvmlite.S

diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile
index e47e527..1d913d7 100644
--- a/arch/x86/xen/Makefile
+++ b/arch/x86/xen/Makefile
@@ -23,3 +23,4 @@ obj-$(CONFIG_XEN_DEBUG_FS)+= debugfs.o
 obj-$(CONFIG_XEN_DOM0) += vga.o
 obj-$(CONFIG_SWIOTLB_XEN)  += pci-swiotlb-xen.o
 obj-$(CONFIG_XEN_EFI)  += efi.o
+obj-$(CONFIG_XEN_PVHVM)+= xen-hvmlite.o
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 2cf446a..2ed8b2b 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -118,7 +118,8 @@ DEFINE_PER_CPU(struct vcpu_info *, xen_vcpu);
  */
 DEFINE_PER_CPU(struct vcpu_info, xen_vcpu_info);
 
-enum xen_domain_type xen_domain_type = XEN_NATIVE;
+enum xen_domain_type xen_domain_type
+   __attribute__((section(".data"))) = XEN_NATIVE;
 EXPORT_SYMBOL_GPL(xen_domain_type);
 
 unsigned long *machine_to_phys_mapping = (void *)MACH2PHYS_VIRT_START;
@@ -171,6 +172,17 @@ struct tls_descs {
  */
 static DEFINE_PER_CPU(struct tls_descs, shadow_tls_desc);
 
+#ifdef CONFIG_XEN_PVHVM
+/*
+ * HVMlite variables. These need to live in data segment since they are
+ * initialized before startup_{32|64}, which clear .bss, are invoked.
+ */
+int xen_hvmlite __attribute__((section(".data"))) = 0;
+struct hvm_start_info hvmlite_start_info __attribute__((section(".data")));
+uint hvmlite_start_info_sz = sizeof(hvmlite_start_info);
+struct boot_params xen_hvmlite_boot_params __attribute__((section(".data")));
+#endif
+
 static void clamp_max_cpus(void)
 {
 #ifdef CONFIG_SMP
@@ -1736,6 +1748,83 @@ asmlinkage __visible void __init xen_start_kernel(void)
 #endif
 }
 
+#ifdef CONFIG_XEN_PVHVM
+static void __init hvmlite_bootparams(void)
+{
+   struct xen_memory_map memmap;
+   int i;
+
+   memset(_hvmlite_boot_params, 0, sizeof(xen_hvmlite_boot_params));
+
+   memmap.nr_entries = ARRAY_SIZE(xen_hvmlite_boot_params.e820_map);
+   set_xen_guest_handle(memmap.buffer, xen_hvmlite_boot_params.e820_map);
+   if (HYPERVISOR_memory_op(XENMEM_memory_map, )) {
+   xen_raw_console_write("XENMEM_memory_map failed\n");
+   BUG();
+   }
+
+   xen_hvmlite_boot_params.e820_map[memmap.nr_entries].addr =
+   ISA_START_ADDRESS;
+   xen_hvmlite_boot_params.e820_map[memmap.nr_entries].size =
+   ISA_END_ADDRESS - ISA_START_ADDRESS;
+   xen_hvmlite_boot_params.e820_map[memmap.nr_entries++].type =
+   E820_RESERVED;
+
+   sanitize_e820_map(xen_hvmlite_boot_params.e820_map,
+ ARRAY_SIZE(xen_hvmlite_boot_params.e820_map),
+ _entries);
+
+   xen_hvmlite_boot_params.e820_entries = memmap.nr_entries;
+   for (i = 0; i < xen_hvmlite_boot_params.e820_entries; i++)
+   e820_add_region(xen_hvmlite_boot_params.e820_map[i].addr,
+   xen_hvmlite_boot_params.e820_map[i].size,
+   xen_hvmlite_boot_params.e820_map[i].type);
+
+   xen_hvmlite_boot_params.hdr.cmd_line_ptr =
+   hvmlite_start_info.cmdline_paddr;
+
+   /* The first module is always ramdisk */
+   if (hvmlite_start_info.nr_modules) {
+   struct hvm_modlist_entry *modaddr =
+   __va(hvmlite_start_info.modlist_paddr);
+   xen_hvmlite_boot_params.hdr.ramdisk_image = modaddr->paddr;
+   xen_hvmlite_boot_params.hdr.ramdisk_size = modaddr->size;
+   }
+
+   /*
+* See Documentation/x86/boot.txt.
+*
+* Version 2.12 supports Xen entry point but we will use default x86/PC
+* environment (i.e. hardware_subarch 0).
+*/
+   xen_hvmlite_boot_params.hdr.version = 0x212;
+   xen_hvmlite_boot_params.hdr.type_of_loader = 9; /* Xen loader */
+}
+
+/*
+ * This routine (and those that it might call) should not use
+ * anything that lives in .bss since that segment will be cleared later
+ */
+void __init xen_prepare_hvmlite(void)
+{
+   u32 eax, ecx, edx, msr;
+   u64 pfn;
+
+   cpuid(xen_cpuid_base() + 2, , , , );
+   pfn = __pa(hypercall_page);
+   wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
+
+   pv_info.name = "Xen HVMlite";
+   xen_domain_type = XEN_HVM_DOMAIN;
+   xen_hvmlite = 1;
+
+   x86_init.oem.arch_setup = xen_init_kernel;
+   x86_init.oem.banner = xen_banner;
+
+  

[PATCH v1 00/12] HVMlite domU support

2016-01-22 Thread Boris Ostrovsky
This series introduces HVMlite support for unprivileged guests.

It has been tested on Intel/AMD, both 32- and 64-bit, including CPU on- and
offlining and save/restore. (Restore will result in APIC write warnings
which exist now for 32-bit PV guests as well so I didn't address this in
this series)

Compile-tested on ARM

Boris Ostrovsky (12):
  x86/smp: Make start_secondary() and initial_pg_pmd visible globally
  xen/hvmlite: Factor out common kernel init code
  xen/hvmlite: Import hvmlite-related Xen public interfaces
  xen/hvmlite: Bootstrap HVMlite guest
  xen/hvmlite: Allow HVMlite guests delay initializing grant table
  xen/hvmlite: Initialize PCI
  xen/hvmlite: Prepare cpu_initialize_context() routine for HVMlite SMP
  xen/hvmlite: Initialize context for secondary VCPUs
  xen/hvmlite: Extend APIC operations for HVMlite guests
  xen/hvmlite: Use x86's default timer init for HVMlite guests
  xen/hvmlite: Boot secondary CPUs
  xen/hvmlite: Enable CPU on-/offlining

 arch/x86/include/asm/smp.h   |1 +
 arch/x86/kernel/head_32.S|2 +-
 arch/x86/kernel/smpboot.c|2 +-
 arch/x86/pci/xen.c   |2 +-
 arch/x86/xen/Makefile|1 +
 arch/x86/xen/apic.c  |   39 -
 arch/x86/xen/enlighten.c |  318 ++
 arch/x86/xen/grant-table.c   |4 +-
 arch/x86/xen/platform-pci-unplug.c   |4 +-
 arch/x86/xen/pmu.c   |4 +-
 arch/x86/xen/smp.c   |  248 ++
 arch/x86/xen/smp.h   |4 +
 arch/x86/xen/time.c  |5 +-
 arch/x86/xen/xen-hvmlite.S   |  173 ++
 drivers/xen/grant-table.c|8 +-
 include/xen/interface/elfnote.h  |   12 ++-
 include/xen/interface/hvm/hvm_vcpu.h |  143 +++
 include/xen/interface/xen.h  |   24 +++
 include/xen/xen.h|6 +
 19 files changed, 801 insertions(+), 199 deletions(-)
 create mode 100644 arch/x86/xen/xen-hvmlite.S
 create mode 100644 include/xen/interface/hvm/hvm_vcpu.h



[PATCH v1 02/12] xen/hvmlite: Factor out common kernel init code

2016-01-22 Thread Boris Ostrovsky
HVMlite guests (to be introduced in subsequent patches) share most
of the kernel initialization code with PV(H).

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/enlighten.c |  225 --
 1 files changed, 119 insertions(+), 106 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index d09e4c9..2cf446a 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1505,19 +1505,14 @@ static void __init xen_pvh_early_guest_init(void)
 }
 #endif/* CONFIG_XEN_PVH */
 
-/* First C function to be called on Xen boot */
-asmlinkage __visible void __init xen_start_kernel(void)
+
+static void __init xen_init_kernel(void)
 {
struct physdev_set_iopl set_iopl;
unsigned long initrd_start = 0;
u64 pat;
int rc;
 
-   if (!xen_start_info)
-   return;
-
-   xen_domain_type = XEN_PV_DOMAIN;
-
xen_setup_features();
 #ifdef CONFIG_XEN_PVH
xen_pvh_early_guest_init();
@@ -1529,48 +1524,9 @@ asmlinkage __visible void __init xen_start_kernel(void)
if (xen_initial_domain())
pv_info.features |= PV_SUPPORTED_RTC;
pv_init_ops = xen_init_ops;
-   if (!xen_pvh_domain()) {
-   pv_cpu_ops = xen_cpu_ops;
-
-   x86_platform.get_nmi_reason = xen_get_nmi_reason;
-   }
-
-   if (xen_feature(XENFEAT_auto_translated_physmap))
-   x86_init.resources.memory_setup = xen_auto_xlated_memory_setup;
-   else
-   x86_init.resources.memory_setup = xen_memory_setup;
-   x86_init.oem.arch_setup = xen_arch_setup;
-   x86_init.oem.banner = xen_banner;
 
xen_init_time_ops();
 
-   /*
-* Set up some pagetable state before starting to set any ptes.
-*/
-
-   xen_init_mmu_ops();
-
-   /* Prevent unwanted bits from being set in PTEs. */
-   __supported_pte_mask &= ~_PAGE_GLOBAL;
-
-   /*
-* Prevent page tables from being allocated in highmem, even
-* if CONFIG_HIGHPTE is enabled.
-*/
-   __userpte_alloc_gfp &= ~__GFP_HIGHMEM;
-
-   /* Work out if we support NX */
-   x86_configure_nx();
-
-   /* Get mfn list */
-   xen_build_dynamic_phys_to_machine();
-
-   /*
-* Set up kernel GDT and segment registers, mainly so that
-* -fstack-protector code can be executed.
-*/
-   xen_setup_gdt(0);
-
xen_init_irq_ops();
xen_init_cpuid_mask();
 
@@ -1580,21 +1536,8 @@ asmlinkage __visible void __init xen_start_kernel(void)
 */
xen_init_apic();
 #endif
-
-   if (xen_feature(XENFEAT_mmu_pt_update_preserve_ad)) {
-   pv_mmu_ops.ptep_modify_prot_start = xen_ptep_modify_prot_start;
-   pv_mmu_ops.ptep_modify_prot_commit = 
xen_ptep_modify_prot_commit;
-   }
-
machine_ops = xen_machine_ops;
 
-   /*
-* The only reliable way to retain the initial address of the
-* percpu gdt_page is to remember it here, so we can go and
-* mark it RW later, when the initial percpu area is freed.
-*/
-   xen_initial_gdt = _cpu(gdt_page, 0);
-
xen_smp_init();
 
 #ifdef CONFIG_ACPI_NUMA
@@ -1609,66 +1552,126 @@ asmlinkage __visible void __init xen_start_kernel(void)
   possible map and a non-dummy shared_info. */
per_cpu(xen_vcpu, 0) = _shared_info->vcpu_info[0];
 
-   local_irq_disable();
-   early_boot_irqs_disabled = true;
+   if (!xen_hvm_domain()) {
+   if (!xen_pvh_domain()) {
+   pv_cpu_ops = xen_cpu_ops;
 
-   xen_raw_console_write("mapping kernel into physical memory\n");
-   xen_setup_kernel_pagetable((pgd_t *)xen_start_info->pt_base,
-  xen_start_info->nr_pages);
-   xen_reserve_special_pages();
+   x86_platform.get_nmi_reason = xen_get_nmi_reason;
+   }
 
-   /*
-* Modify the cache mode translation tables to match Xen's PAT
-* configuration.
-*/
-   rdmsrl(MSR_IA32_CR_PAT, pat);
-   pat_init_cache_modes(pat);
+   if (xen_feature(XENFEAT_auto_translated_physmap))
+   x86_init.resources.memory_setup =
+   xen_auto_xlated_memory_setup;
+   else
+   x86_init.resources.memory_setup = xen_memory_setup;
+   x86_init.oem.arch_setup = xen_arch_setup;
+   x86_init.oem.banner = xen_banner;
 
-   /* keep using Xen gdt for now; no urgent need to change it */
+   /*
+* Set up some pagetable state before starting to set any ptes.
+*/
 
-#ifdef CONFIG_X86_32
-   pv_info.kernel_rpl = 1;
-   if (xen_feature(XENFEAT_supervisor_mode_kernel))
-   pv_info.kernel_rpl = 0;
-#else
-   pv_info.kernel_rpl = 0;
-#endi

[PATCH v1 06/12] xen/hvmlite: Initialize PCI

2016-01-22 Thread Boris Ostrovsky
HVMlite guests need PCI frontend and always have PV devices

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/pci/xen.c |2 +-
 arch/x86/xen/platform-pci-unplug.c |4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index ff31ab4..d847f7d 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -405,7 +405,7 @@ static void xen_teardown_msi_irq(unsigned int irq)
 
 int __init pci_xen_init(void)
 {
-   if (!xen_pv_domain() || xen_initial_domain())
+   if ((!xen_pv_domain() && !xen_hvmlite) || xen_initial_domain())
return -ENODEV;
 
printk(KERN_INFO "PCI: setting up Xen PCI frontend stub\n");
diff --git a/arch/x86/xen/platform-pci-unplug.c 
b/arch/x86/xen/platform-pci-unplug.c
index 9586ff3..802ec90 100644
--- a/arch/x86/xen/platform-pci-unplug.c
+++ b/arch/x86/xen/platform-pci-unplug.c
@@ -73,8 +73,8 @@ bool xen_has_pv_devices(void)
if (!xen_domain())
return false;
 
-   /* PV domains always have them. */
-   if (xen_pv_domain())
+   /* PV and HVMlite domains always have them. */
+   if (xen_pv_domain() || xen_hvmlite)
return true;
 
/* And user has xen_platform_pci=0 set in guest config as
-- 
1.7.1



[PATCH v1 11/12] xen/hvmlite: Boot secondary CPUs

2016-01-22 Thread Boris Ostrovsky
HVMlite secondary VCPUs use baremetal bringup path (i.e. native_*
smp_ops) but need to do some preparation in PV code.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/enlighten.c |2 +
 arch/x86/xen/pmu.c   |4 +-
 arch/x86/xen/smp.c   |   60 +-
 3 files changed, 47 insertions(+), 19 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 2ed8b2b..850ce66 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1937,6 +1937,8 @@ static void __init xen_hvm_guest_init(void)
xen_have_vector_callback = 1;
xen_hvm_smp_init();
register_cpu_notifier(_hvm_cpu_notifier);
+   if (xen_hvmlite)
+   smp_found_config = 1;
xen_unplug_emulated_devices();
x86_init.irqs.intr_init = xen_init_IRQ;
xen_hvm_init_time_ops();
diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c
index 724a087..7bc209b 100644
--- a/arch/x86/xen/pmu.c
+++ b/arch/x86/xen/pmu.c
@@ -518,7 +518,7 @@ void xen_pmu_init(int cpu)
 
BUILD_BUG_ON(sizeof(struct xen_pmu_data) > PAGE_SIZE);
 
-   if (xen_hvm_domain())
+   if (xen_hvm_domain() && !xen_hvmlite)
return;
 
xenpmu_data = (struct xen_pmu_data *)get_zeroed_page(GFP_KERNEL);
@@ -556,7 +556,7 @@ void xen_pmu_finish(int cpu)
 {
struct xen_pmu_params xp;
 
-   if (xen_hvm_domain())
+   if (xen_hvm_domain() && !xen_hvmlite)
return;
 
xp.vcpu = cpu;
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index fb085ef..fbad829 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -348,26 +348,31 @@ static void __init xen_smp_prepare_cpus(unsigned int 
max_cpus)
}
xen_init_lock_cpu(0);
 
-   smp_store_boot_cpu_info();
-   cpu_data(0).x86_max_cores = 1;
+   if (!xen_hvmlite) {
+   smp_store_boot_cpu_info();
+   cpu_data(0).x86_max_cores = 1;
+
+   for_each_possible_cpu(i) {
+   zalloc_cpumask_var(_cpu(cpu_sibling_map, i),
+  GFP_KERNEL);
+   zalloc_cpumask_var(_cpu(cpu_core_map, i),
+  GFP_KERNEL);
+   zalloc_cpumask_var(_cpu(cpu_llc_shared_map, i),
+  GFP_KERNEL);
+   }
+   set_cpu_sibling_map(0);
 
-   for_each_possible_cpu(i) {
-   zalloc_cpumask_var(_cpu(cpu_sibling_map, i), GFP_KERNEL);
-   zalloc_cpumask_var(_cpu(cpu_core_map, i), GFP_KERNEL);
-   zalloc_cpumask_var(_cpu(cpu_llc_shared_map, i), GFP_KERNEL);
+   if (!alloc_cpumask_var(_cpu_initialized_map, GFP_KERNEL))
+   panic("could not allocate xen_cpu_initialized_map\n");
+
+   cpumask_copy(xen_cpu_initialized_map, cpumask_of(0));
}
-   set_cpu_sibling_map(0);
 
xen_pmu_init(0);
 
if (xen_smp_intr_init(0))
BUG();
 
-   if (!alloc_cpumask_var(_cpu_initialized_map, GFP_KERNEL))
-   panic("could not allocate xen_cpu_initialized_map\n");
-
-   cpumask_copy(xen_cpu_initialized_map, cpumask_of(0));
-
/* Restrict the possible_map according to max_cpus. */
while ((num_possible_cpus() > 1) && (num_possible_cpus() > max_cpus)) {
for (cpu = nr_cpu_ids - 1; !cpu_possible(cpu); cpu--)
@@ -375,8 +380,11 @@ static void __init xen_smp_prepare_cpus(unsigned int 
max_cpus)
set_cpu_possible(cpu, false);
}
 
-   for_each_possible_cpu(cpu)
+   for_each_possible_cpu(cpu) {
set_cpu_present(cpu, true);
+   if (xen_hvmlite)
+   physid_set(cpu, phys_cpu_present_map);
+   }
 }
 
 static int
@@ -810,10 +818,15 @@ void __init xen_smp_init(void)
 
 static void __init xen_hvm_smp_prepare_cpus(unsigned int max_cpus)
 {
+   if (xen_hvmlite)
+   xen_smp_prepare_cpus(max_cpus);
+
native_smp_prepare_cpus(max_cpus);
-   WARN_ON(xen_smp_intr_init(0));
 
-   xen_init_lock_cpu(0);
+   if (!xen_hvmlite) {
+   WARN_ON(xen_smp_intr_init(0));
+   xen_init_lock_cpu(0);
+   }
 }
 
 static int xen_hvm_cpu_up(unsigned int cpu, struct task_struct *tidle)
@@ -836,8 +849,21 @@ static int xen_hvm_cpu_up(unsigned int cpu, struct 
task_struct *tidle)
*/
rc = xen_smp_intr_init(cpu);
WARN_ON(rc);
-   if (!rc)
-   rc =  native_cpu_up(cpu, tidle);
+
+   if (xen_hvmlite) {
+   rc = cpu_initialize_context(cpu, tidle);
+   if (rc) {
+   xen_smp_intr_free(cpu);
+   return rc;
+   }
+   xen_pmu_init(cpu);
+   }
+
+   if (!rc) {
+  

Re: [PATCH v1 02/12] xen/hvmlite: Factor out common kernel init code

2016-01-22 Thread Boris Ostrovsky

On 01/22/2016 06:01 PM, Luis R. Rodriguez wrote:

On Fri, Jan 22, 2016 at 04:35:48PM -0500, Boris Ostrovsky wrote:

HVMlite guests (to be introduced in subsequent patches) share most
of the kernel initialization code with PV(H).

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
  arch/x86/xen/enlighten.c |  225 --
  1 files changed, 119 insertions(+), 106 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index d09e4c9..2cf446a 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c

Whoa, I'm lost, its hard for me to tell what exactly stayed and what
got pulled into a helper, etc. Is there a possibility to split this
patch in 2 somehow to make the actual functional changes easier to
read? There are too many changes here and I just can't tell easily
what's going on.



The only real changes that this patch introduces is it reorders some of 
the operations that used to be in xen_start_kernel(). This is done so 
that in the next patch when we add hvmlite we can easily put those 
specific to PV(H) inside 'if (!xen_hvm_domain())'. I probably should 
have said so in the commit message.


It is indeed difficult to review but I don't see how I can split this. 
Even if I just moved it (without reordering) it would still be hard to read.


-boris


Re: [PATCH v1 02/12] xen/hvmlite: Factor out common kernel init code

2016-01-22 Thread Boris Ostrovsky

On 01/22/2016 06:12 PM, Boris Ostrovsky wrote:

On 01/22/2016 06:01 PM, Luis R. Rodriguez wrote:

On Fri, Jan 22, 2016 at 04:35:48PM -0500, Boris Ostrovsky wrote:

HVMlite guests (to be introduced in subsequent patches) share most
of the kernel initialization code with PV(H).

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
  arch/x86/xen/enlighten.c |  225 
--

  1 files changed, 119 insertions(+), 106 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index d09e4c9..2cf446a 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c

Whoa, I'm lost, its hard for me to tell what exactly stayed and what
got pulled into a helper, etc. Is there a possibility to split this
patch in 2 somehow to make the actual functional changes easier to
read? There are too many changes here and I just can't tell easily
what's going on.



The only real changes that this patch introduces is it reorders some 
of the operations that used to be in xen_start_kernel(). This is done 
so that in the next patch when we add hvmlite we can easily put those 
specific to PV(H) inside 'if (!xen_hvm_domain())'. I probably should 
have said so in the commit message.



Actually, I forgot that I merged the 'if' clause into this patch already 
so I'll see if I can separate them again to make it easier on the eye.


-boris




It is indeed difficult to review but I don't see how I can split this. 
Even if I just moved it (without reordering) it would still be hard to 
read.


-boris




Re: [PATCH v1 00/12] HVMlite domU support

2016-01-25 Thread Boris Ostrovsky

On 01/25/2016 05:51 AM, David Vrabel wrote:

On 22/01/16 21:35, Boris Ostrovsky wrote:

This series introduces HVMlite support for unprivileged guests.

It has been tested on Intel/AMD, both 32- and 64-bit, including CPU on- and
offlining and save/restore. (Restore will result in APIC write warnings
which exist now for 32-bit PV guests as well so I didn't address this in
this series)

Can you remove PVH support in this series as well?  We won't necessarily
remove PVH support immediately but I'd like to see the ultimate end result.


I'd rather wait until we have HVMlite dom0 before dropping PVH. If 
nothing else it may help debugging.


BTW, I assume we are going to rename HVMlite to PVH once the original 
implementation is removed.



-boris


Re: [Xen-devel] [PATCH v1 04/12] xen/hvmlite: Bootstrap HVMlite guest

2016-01-25 Thread Boris Ostrovsky

On 01/22/2016 07:30 PM, Andrew Cooper wrote:

On 22/01/2016 23:32, Luis R. Rodriguez wrote:

On Fri, Jan 22, 2016 at 04:35:50PM -0500, Boris Ostrovsky wrote:

+   /*
+* See Documentation/x86/boot.txt.
+*
+* Version 2.12 supports Xen entry point but we will use default x86/PC
+* environment (i.e. hardware_subarch 0).
+*/
+   xen_hvmlite_boot_params.hdr.version = 0x212;
+   xen_hvmlite_boot_params.hdr.type_of_loader = 9; /* Xen loader */
+}

I realize PV got away with setting up boot_params on C code but best
ask now that this new code is being introduced: why can't we just have
the Xen hypervisor fill this in? It'd save us all this code.

I agree that this looks to be a mess.  Having said that, the DMLite boot
protocol is OS agnostic, and will be staying that way.

It happens to look suspiciously like multiboot; a flat 32bit protected
mode entry (at a location chosen in an ELF note), with %ebx pointing to
an in-ram structure containing things like a command line and module list.

I would have though the correct way to do direct Linux support would be
to have a very small init stub which constructs an appropriate zero
page, and lets the native entry point get on with things.


Which is really what 
hvmlite_start_xen()->xen_prepare_hvmlite()->hvmlite_bootparams() is 
doing. Not much more than that (for 64-bit it also loads identity 
mapping because that's what startup_64 wants)


-boris



This covers the usecase where people wish to boot a specific Linux
kernel straight out of the dom0 filesystem.

For the alternative usecase of general OS support, dom0 would boot
something such as grub2 as the DMLite "kernel", at which point all
stooging around in the guests filesystem is done from guest context,
rather than control context (mitigating a substantial attack surface).

~Andrew




Re: [PATCH v1 01/12] x86/smp: Make start_secondary() and initial_pg_pmd visible globally

2016-01-25 Thread Boris Ostrovsky

On 01/25/2016 05:53 AM, David Vrabel wrote:

On 22/01/16 21:35, Boris Ostrovsky wrote:

Xen's HVMlite guests will want to use them.

Please explain in detail why they are needed.




Actually, start_secondary is not needed anymore, I removed its use at 
some point.


initial_pg_pmd (together with initial_page_table) are not really 
required  --- I just used them for temporary page tables in the hvmlite 
startup code instead of allocating dedicated pages for that. Perhaps 
there is other (Xen-specific) area that I can use for that. Once we jump 
to startup_32 these tables are no longer in use.


-boris


Re: [PATCH v1 04/12] xen/hvmlite: Bootstrap HVMlite guest

2016-01-25 Thread Boris Ostrovsky

On 01/22/2016 06:32 PM, Luis R. Rodriguez wrote:

On Fri, Jan 22, 2016 at 04:35:50PM -0500, Boris Ostrovsky wrote:


+/*
+ * This routine (and those that it might call) should not use
+ * anything that lives in .bss since that segment will be cleared later
+ */
+void __init xen_prepare_hvmlite(void)
+{
+   u32 eax, ecx, edx, msr;
+   u64 pfn;
+
+   cpuid(xen_cpuid_base() + 2, , , , );
+   pfn = __pa(hypercall_page);
+   wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
+
+   pv_info.name = "Xen HVMlite";
+   xen_domain_type = XEN_HVM_DOMAIN;
+   xen_hvmlite = 1;
+
+   x86_init.oem.arch_setup = xen_init_kernel;
+   x86_init.oem.banner = xen_banner;
+
+   hvmlite_bootparams();
+}
+#endif

If the boot_params.hdr.hardware_subarch_data pointed to a custom
struct then the first C entry point for Xen could shuffle this and
set this too, by still using less asm entry helpers. We'd still
need this run but with the linker table I think we could have
a stub small stub for hvm run, it would not be a call from asm.


Perhaps, but someone would still have to set hardware_subarch. And it's 
hvmlite_bootparams() that does it.


And that's not sufficient, I think. There are still some things that 
trampoline code sets up (e.g. page tables for 64-bit).


-boris




Re: [Xen-devel] [PATCH v1 02/12] xen/hvmlite: Factor out common kernel init code

2016-01-25 Thread Boris Ostrovsky

On 01/25/2016 06:04 AM, David Vrabel wrote:

On 22/01/16 21:35, Boris Ostrovsky wrote:

HVMlite guests (to be introduced in subsequent patches) share most
of the kernel initialization code with PV(H).

Where possible, HVMlite should share initialization with bare metal/HVM
and not PV(H).


This is really platform initialization, for HVMlite it's invoked as 
x86_init.oem.arch_setup(). Things like xen_setup_features(), 
xen_init_apic() etc.


I suppose I can move it to xen_hvm_guest_init() but then we will have 
some amount of code duplication. There is also a chunk of code there (in 
xen_init_kernel()) that will probably be used for HVMlite dom0.




Sharing any code with the existing PVH code isn't useful, since PVH
support will be removed.


There is nothing PVH-specific, I said "(H)" mostly because that code is 
the same is PV.


-boris



Re: [PATCH v1 04/12] xen/hvmlite: Bootstrap HVMlite guest

2016-01-25 Thread Boris Ostrovsky

On 01/25/2016 04:21 PM, H. Peter Anvin wrote:

On 01/25/16 13:12, Luis R. Rodriguez wrote:

Perhaps, but someone would still have to set hardware_subarch. And
it's hvmlite_bootparams() that does it.

No, Xen would do it as well, essentially all of hvmlite_bootparams() could be
done in Xen.


Or a stub code.


This patch in fact is the stub for Xen HVMlite guests, after we are done 
with it we jump to bare-metal startup code (i.e startup_32|64)


-boris



Re: [Xen-devel] [PATCH v1 04/12] xen/hvmlite: Bootstrap HVMlite guest

2016-01-25 Thread Boris Ostrovsky

On 01/25/2016 05:19 PM, Luis R. Rodriguez wrote:

On Sat, Jan 23, 2016 at 02:49:36PM +, Andrew Cooper wrote:



it causes inappropriate linkage between the
toolstack and the version of Linux wishing to be booted.

There are ways to do it in such a way that it does not create dependency issues
on Linux specific code.


0) The Xen way and EFI way

A generic data structure is passed to the entry point on the kernel to
load the kernel. The kernel in turn must parse this generic specific struct
and interprets it and translate it to the kernel specific values.

1) The qemu way:

One way is to simply not refer to the boot_params data structures but IMHO that
produces really ugly code. The qemu folks did it this way, so qemu does not use
arch/x86/include/uapi/asm/bootparam.h in any way, instead it uses offsets from
a char *header. Refer to load_linux() in hw/i386/pc.c, for instance:

header[0x211] |= 0x80;  /* CAN_USE_HEAP */

2) The grub way:

Another way, which grub uses, is to define their own data structures based
on arch/x86/include/uapi/asm/bootparam.h, with their own data types, and refer
to them in code, for instance refer to grub_cmd_linux() on the grub file
grub-core/loader/i386/pc/linux.c and its copy definition of the header 
definition
in include/grub/i386/linux.h.

lh.loadflags |= GRUB_LINUX_FLAG_CAN_USE_HEAP;

The way lguest does it in the lguest launcher is IMHO the cleanest of course,
but it includes asm/bootparam.h and uses that in code:

/* Boot protocol version: 2.07 supports the fields for lguest. */
boot->hdr.version = 0x207;

But that does mean copying over or using the bootparam.h file and using
linux data types.

3) Merge of xen way and using subarch_data

Only set the subarch and subarch_data pointer, and have the kernel then
read the generic data structure and parse it as it used to, the benefit
is sharing a common entry point.

No one uses this yet. But I think its a reasonable compromise.

Perhaps there are other ways as well.


If true, then by no means, no matter how hard we try, and no matter
what we do on the Linux front to help clean things up will we be able
to have a unified bare metal / Xen entry. I'm noting it could be
possible though provided we do just set the zero page, the subarch to
Xen and subarch_data to the Xen custom data structure.

All you need is a very small stub which starts per the DMLite ABI, sets
up an appropriate zero_page, and jumps to the native entrypoint.

Right.



I am trying to understand what your objection is to what is proposed in 
this patch. It is the size of the stub? If yes -- what would you like to 
leave out to be done later?



-boris




However, this stub belongs in Linux, not in the Xen toolstack.  That
way, when the Linux boot protocol is modified, both sides can be updated
accordingly.

Makes sense the different entry points just seems best avoided if possible.

   Luis




Re: [PATCH 3/3] tty: hvc_xen: hide xen_console_remove when unused

2016-01-25 Thread Boris Ostrovsky

On 01/25/2016 04:54 PM, Arnd Bergmann wrote:

xencons_disconnect_backend() is only called from xen_console_remove(),


and also from xencons_probe()/xencons_resume(). But those two are also 
under the

same ifdef.

-boris


which is conditionally compiled, so we get a harmless warning when
CONFIG_HVC_XEN_FRONTEND is unset:

hvc/hvc_xen.c:350:12: error: 'xen_console_remove' defined but not used 
[-Werror=unused-function]

This moves the function down into the same #ifdef section to silence
the warning.

Signed-off-by: Arnd Bergmann 
---
  drivers/tty/hvc/hvc_xen.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/tty/hvc/hvc_xen.c b/drivers/tty/hvc/hvc_xen.c
index fa816b7193b6..11725422dacb 100644
--- a/drivers/tty/hvc/hvc_xen.c
+++ b/drivers/tty/hvc/hvc_xen.c
@@ -323,6 +323,7 @@ void xen_console_resume(void)
}
  }
  
+#ifdef CONFIG_HVC_XEN_FRONTEND

  static void xencons_disconnect_backend(struct xencons_info *info)
  {
if (info->irq > 0)
@@ -363,7 +364,6 @@ static int xen_console_remove(struct xencons_info *info)
return 0;
  }
  
-#ifdef CONFIG_HVC_XEN_FRONTEND

  static int xencons_remove(struct xenbus_device *dev)
  {
return xen_console_remove(dev_get_drvdata(>dev));




Re: [PATCH v1 04/12] xen/hvmlite: Bootstrap HVMlite guest

2016-01-26 Thread Boris Ostrovsky

On 01/26/2016 01:46 PM, Andy Lutomirski wrote:

On Tue, Jan 26, 2016 at 10:34 AM, Luis R. Rodriguez <mcg...@suse.com> wrote:

On Mon, Jan 25, 2016 at 05:28:08PM -0500, Boris Ostrovsky wrote:

On 01/25/2016 04:21 PM, H. Peter Anvin wrote:

On 01/25/16 13:12, Luis R. Rodriguez wrote:

Perhaps, but someone would still have to set hardware_subarch. And
it's hvmlite_bootparams() that does it.

No, Xen would do it as well, essentially all of hvmlite_bootparams() could be
done in Xen.


Or a stub code.

This patch in fact is the stub for Xen HVMlite guests, after we are
done with it we jump to bare-metal startup code (i.e startup_32|64)

Right the point is the stub need not be in Linux, I'll explain in the other
thread where I provided more details on the different known approaches.


ISTM if the Xen ABI-specified entry point has a different convention
than the Linux native entry, then the stub should live in Linux.  It
would be just a couple if lines of code, right?


It's not exactly a couple of lines but it's not large neither. It mainly 
sets up boot_params (similar to what make_boot_params() does for EFI). 
Plus, for 64-bit, it loads identity page tables and switches to long 
mode. And then jumps to bare-meta startup code.





The issue that caused headaches in the past isn't that there's code
that's executed only on native, it's that there are whole big
functions that are executed only on native for no good reason and that
aren't clearly marked.


This won't happen with HVMlite.



If we had native_start_kernel and xen_start_kernel, and they both
called very quickly in to common_start_kernel, it would be very clear
what's going on.


What is now xen_start_kernel() is no longer the entry point for HVMlite. 
It is called as x86_init.oem.arch_setup() (or I may even move it to 
x86_hyper_xen.init_platform() or something like that).


-boris


Re: [Xen-devel] [PATCH v1 04/12] xen/hvmlite: Bootstrap HVMlite guest

2016-01-26 Thread Boris Ostrovsky

On 01/26/2016 03:30 PM, Luis R. Rodriguez wrote:


What I'm proposing?

1) Lets see if we can put a proactive stop-gap to issues such as the cr4 shadow
bug and Kasan bugs from creeping in. This should not only help PV but perhaps
HVMLite. If you'd like to help with that refer to this thread:

http://lkml.kernel.org/r/CAB=NE6VTCRCazcNpCdJ7pN1eD3=x_fcgodh37mzvpxkken5...@mail.gmail.com

2) asm code sharing prospects - rebranding of PVH to HVMlite

It sounds like HVMlite is really just a clean PVH implementation so we'll
be doing this it seems according to this approach:

a) Since PVH brand is taken add new new Xen clean solution as "HVMLite 
implementation"
b) Drop PVH
c) Re-brand HVMLite as PVH

Is there really no asm code to share between PV and HVMlite ?


Not the early boot code.


  How about between
PV and other Linux asm ? Specifically I'm talking about a huge new entry point
called hvmlite_start_xen() of asm code. How much sharing or duplication
is being avoided here?


I don't see that we can share much here. Especially given that 
hvmlite_start() is a 32-bit entry point so we can't start a 64-bit PV guest.


I suppose we can find some common (C) code that sets boot_params but on 
the first sight they are quite different too.


(And whether it's huge is a matter of opinion ;-). It is more than a 
couple of instructions, I'll give you that)




3) C code sharing prospects: rebranding of PVH to HVMlite

This code seems to share a lot from PV guest. Can we combine?


HVMlite closely follows HVM (and baremetal) code path. There is some 
common setup code with PV (which is what the second patch in this series 
is trying to factor out)




4) hardware_subarch, hardware_subarch_data and future prospects

Your patch relies on a *new* Linux entry point. Sure, we had one
for EFI, and sure there is another for Xen PV, but since you're just
rebranding PVH to HVMlite and given historic issues with any new
Linux entry points I'd like for us to take a breather and evaluate
the extent our lofty unification goals, and how the x86 boot protocol
could help with this already.


I am not sure I see how you can avoid having new entry point. For 
example, all HVMlite guests start in 32-bit mode. Who will switch to 
long mode?




Now, perhaps today it may seem as difficult to unify asm entry points
today all over, but if we can take baby steps towards that I think that
should be seriously evaluated.

For instance, we should consider on relying on hardware_subarch and
hardware_subarch_data to fetch the hvmlite_start_info by taking advantage of
the x86 boot protocol.  The hvmlite_start_info is what Xen uses to send us info
about the guest as your patch proposes (this matches the Xen PV style entry),
we stash it into a standard Linux boot_params variable called
xen_hvmlite_boot_params for the Xen guest in hvmlite_bootparam(). This
data structure and code seems to match very much the PV guest structure,


No, the two don't match at all.


why not just use a union and differentiate on PV subtype ? If you want to avoid
a lot of PV calls for HVMlite it seems you could just take advantage of
subarch Xen type, and differentiate on the subarch_data within Xen code
to make that code just PV sepecific.

I only see gains by using the Xen subarch, so would like to know why PC is
being pushed.


It's not that subarch 0 is being pushed here. It's that I don't see how 
it can be useful for this particular guest type. Maybe as we add new 
features (or discover something that we've missed up till now) we can 
switch to using it. If you think we should delay initializing 
boot_params until then --- I will disagree: boot_params are used before 
we look at subarch and I don't believe it makes sense to pick and choose 
what to initialize before and what can wait.


(And I am not sure it can be useful on PV neither, at least the way it 
is used now. You will not reach the point in the (32-bit) code where it 
is tested. You will die way earlier (I think on startup_32()'s fourth 
instruction).)



-boris


Re: [Xen-devel] [PATCH v1 04/12] xen/hvmlite: Bootstrap HVMlite guest

2016-01-27 Thread Boris Ostrovsky

On 01/27/2016 09:50 AM, David Vrabel wrote:

On 27/01/16 14:42, Konrad Rzeszutek Wilk wrote:

On Tue, Jan 26, 2016 at 08:54:56PM -0800, Luis R. Rodriguez wrote:

On Jan 26, 2016 6:16 PM, "Luis R. Rodriguez"  wrote:

On Tue, Jan 26, 2016 at 4:04 PM, Luis R. Rodriguez 

wrote:

You go:

hvmlite_start_xen() -->
 HVM stub
 startup_64() | (startup_32()

Hrm, does HVMlite work well with load_ucode_bsp(), note the patches to
rebrand pv_enabled() to pv_legacy() or whatever, this PV type will not
be legacy or crap / old, so we'd need a way to catch it if we should
not use that code for this PV type. This begs the question, are you
also sure other callers in startup_32() or startup_64() might be OK as
well where previously guarded with pv_enabled() ?

Actually this call can't be used, and if early code used it prior to
setup_arch() it'd be a bug as its only properly set until later. Vetting
for correctness of all code call is still required though and perhaps we do
need something to catch now this PV type on early code such as this one if
we don't want it. From what I've gathered before on other bsp ucode we
don't want ucode loaded for PV guest types through these mechanisms.

It may help to not think of PVH/hvmlite as PV. It really is HVM with a lot
of emulated devices removed.

How does early microcode work on EFI? Does the EFI stub code have an early
microcode loading code ?

Surely the interesting comparison here is how is (early) microcode
loading disabled in KVM guests?  We should use the same mechanism for
HVMlite guests.



Why would we ever want to have a guest load microcode during boot? I can 
see how a (privileged) guest may want to load microcode from a shell 
(via microcode driver).


-boris



Re: [Xen-devel] [PATCH v1 04/12] xen/hvmlite: Bootstrap HVMlite guest

2016-01-27 Thread Boris Ostrovsky

On 01/27/2016 10:09 AM, David Vrabel wrote:

On 27/01/16 15:06, Boris Ostrovsky wrote:

On 01/27/2016 09:50 AM, David Vrabel wrote:

On 27/01/16 14:42, Konrad Rzeszutek Wilk wrote:

On Tue, Jan 26, 2016 at 08:54:56PM -0800, Luis R. Rodriguez wrote:

On Jan 26, 2016 6:16 PM, "Luis R. Rodriguez" <mcg...@suse.com> wrote:

On Tue, Jan 26, 2016 at 4:04 PM, Luis R. Rodriguez <mcg...@suse.com>

wrote:

You go:

hvmlite_start_xen() -->
  HVM stub
  startup_64() | (startup_32()

Hrm, does HVMlite work well with load_ucode_bsp(), note the patches to
rebrand pv_enabled() to pv_legacy() or whatever, this PV type will not
be legacy or crap / old, so we'd need a way to catch it if we should
not use that code for this PV type. This begs the question, are you
also sure other callers in startup_32() or startup_64() might be OK as
well where previously guarded with pv_enabled() ?

Actually this call can't be used, and if early code used it prior to
setup_arch() it'd be a bug as its only properly set until later.
Vetting
for correctness of all code call is still required though and
perhaps we do
need something to catch now this PV type on early code such as this
one if
we don't want it. From what I've gathered before on other bsp ucode we
don't want ucode loaded for PV guest types through these mechanisms.

It may help to not think of PVH/hvmlite as PV. It really is HVM with
a lot
of emulated devices removed.

How does early microcode work on EFI? Does the EFI stub code have an
early
microcode loading code ?

Surely the interesting comparison here is how is (early) microcode
loading disabled in KVM guests?  We should use the same mechanism for



HVMlite guests.


Why would we ever want to have a guest load microcode during boot? I can
see how a (privileged) guest may want to load microcode from a shell
(via microcode driver).

I think you missed a word when you read my reply.


Yes, I missed it ;-)

Why not continue relying on paravirt_enabled()? We are going to keep 
this in some form for HVMlite.


-boris



Re: [PATCH 4/4] xen/pcifront: Fix mysterious crashes when NUMA locality information was extracted.

2016-02-13 Thread Boris Ostrovsky

On 02/11/2016 04:10 PM, Konrad Rzeszutek Wilk wrote:


This patch fixes the issue by:
  1) Use kzalloc to initialize to a well known state.
  2) Put 'struct pci_sysdata' at the start of 'pcifront_sd'. That
 way access to the 'node' will access the right offset.

CC: sta...@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.w...@oracle.com>


Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com>

(This, btw, is the second time we got bitten by pcifront_sd bit not 
being pci_sysdata. dc4fdaf0e48 was a workaround for a similar problem 
and we should have fixed it then).


-boris


Re: [PATCH v2 3/3] paravirt: rename paravirt_enabled to paravirt_legacy

2016-02-17 Thread Boris Ostrovsky

On 02/17/2016 03:49 PM, Borislav Petkov wrote:

On Wed, Feb 17, 2016 at 12:07:13PM -0800, Luis R. Rodriguez wrote:

OK so here's a wiki to keep track of progress of the difference uses:

http://kernelnewbies.org/KernelProjects/remove-paravirt-enabled

It seems we have a resolution one way or another for all except for
the use on arch/x86/mm/dump_pagetables.c, is that right?


Why not?

I think we should simply check the range as 8000 -
87ff is practically an ABI and nothing should be mapped

^

That's exactly the point: if something is mapped it's an error for a 
non-PV kernel.


By removing paravirt_enabled() we may miss those errors. Worse, I think 
we may even crash while doing pagetable walk (although it's probably 
better to crash here than to use an unexpected translation in real code 
somewhere)


-boris



there anyway. No need for paravirt_enabled() there either.





Re: [PATCH v2 3/3] paravirt: rename paravirt_enabled to paravirt_legacy

2016-02-17 Thread Boris Ostrovsky

On 02/17/2016 05:03 PM, Borislav Petkov wrote:

On Wed, Feb 17, 2016 at 04:21:56PM -0500, Boris Ostrovsky wrote:

That's exactly the point: if something is mapped it's an error for a
non-PV kernel.

How would something be mapped there? __PAGE_OFFSET is
0x8800.

Or are you thinking about some insanely b0rked kernel code mapping stuff
in there?


The latter. This is to detect things that clearly shouldn't be happening.




By removing paravirt_enabled() we may miss those errors. Worse, I think we
may even crash while doing pagetable walk (although it's probably better to
crash here than to use an unexpected translation in real code somewhere)

Well, if this is the only site which keeps paravirt_enabled() from being
buried, we need to think about a better way to detect a hypervisor.
Maybe we should look at x86_hyper, use CPUID(0x4...)


Can't use CPUID 0x4000 because it will return hypervisor (Xen or 
otherwise) for non-PV guests as well. In Xen's case, you can't determine 
guest type from hypervisor leaves.




or something else.


We could say xen_pv_domain(). But this means using Xen-specific code in 
x86-generic file to detect things specified by ABI. I don't know if I'd 
like that.


-boris



What's your preference?





Re: [PATCH v2 3/3] paravirt: rename paravirt_enabled to paravirt_legacy

2016-02-17 Thread Boris Ostrovsky

On 02/17/2016 05:18 PM, Andy Lutomirski wrote:

On Wed, Feb 17, 2016 at 2:03 PM, Borislav Petkov <b...@alien8.de> wrote:

On Wed, Feb 17, 2016 at 04:21:56PM -0500, Boris Ostrovsky wrote:

That's exactly the point: if something is mapped it's an error for a
non-PV kernel.

How would something be mapped there? __PAGE_OFFSET is
0x8800.

Or are you thinking about some insanely b0rked kernel code mapping stuff
in there?


By removing paravirt_enabled() we may miss those errors. Worse, I think we
may even crash while doing pagetable walk (although it's probably better to
crash here than to use an unexpected translation in real code somewhere)

Well, if this is the only site which keeps paravirt_enabled() from being
buried, we need to think about a better way to detect a hypervisor.
Maybe we should look at x86_hyper, use CPUID(0x4...) or something else.

What's your preference?

I'm confused.  Isn't it the other way around?  That is, checking for
the hypervisor range on all systems should be safer, right?  Or am I
missing something?


Hmm. I think you are right --- I was following wrong branch of the 'if' 
statement. We are always going straight to note_page().


Then yes, we should be able to remove paravirt_enabled(). Sorry for the 
noise.


-boris


Re: [PATCH] x86/ptdump: Remove paravirt_enabled()

2016-02-18 Thread Boris Ostrovsky



On 02/18/2016 03:00 PM, Borislav Petkov wrote:

From: Borislav Petkov <b...@suse.de>

is_hypervisor_range() can simply check if the PGD index is
within 8000 - 87ff which is the range
reserved for a hypervisor. That range is practically an ABI, see
Documentation/x86/x86_64/mm.txt.

Signed-off-by: Borislav Petkov <b...@suse.de>
Cc: Andy Lutomirski <l...@amacapital.net>
Cc: Boris Ostrovsky <boris.ostrov...@oracle.com>
Cc: "Luis R. Rodriguez" <mcg...@kernel.org>


Under Xen, as PV guest:

Tested-by: Boris Ostrovsky <boris.ostrov...@oracle.com>




Re: [PATCH 5/9] apm32: remove paravirt_enabled() use

2016-02-19 Thread Boris Ostrovsky



On 02/19/2016 08:08 AM, Luis R. Rodriguez wrote:

There is already a check for apm_info.bios == 0, the
apm_info.bios is set from the boot_params.apm_bios_info.
Both Xen and lguest, which are also the only ones that set
paravirt_enabled to true) do never set the apm_bios_info,
the paravirt_enabled() check is simply not needed.


We need to guarantee that boot_params is filled with zeroes. On 
baremetal path we clear .bss (which is where boot_params live) before 
copying data from zero page.


So we need to at least memset(_params, 0, sz) in 
xen_start_kernel(). Better yet, clear whole .bss.


(This applies to the next patch as well).

-boris



Re: [RFC v2 1/6] x86/boot: add BIT() to boot/bitops.h

2016-02-19 Thread Boris Ostrovsky



On 02/19/2016 09:15 AM, Luis R. Rodriguez wrote:

diff --git a/arch/x86/boot/bitops.h b/arch/x86/boot/bitops.h
index 878e4b9940d9..232cff0ff4e3 100644
--- a/arch/x86/boot/bitops.h
+++ b/arch/x86/boot/bitops.h
@@ -40,4 +40,6 @@ static inline void set_bit(int nr, void *addr)
asm("btsl %1,%0" : "+m" (*(u32 *)addr) : "Ir" (nr));
  }
  
+#define BIT(x)	(1 << x)


Parenthesize x.



Re: [PATCH 5/9] apm32: remove paravirt_enabled() use

2016-02-19 Thread Boris Ostrovsky



On 02/19/2016 03:58 PM, Luis R. Rodriguez wrote:

On Fri, Feb 19, 2016 at 10:08:43AM -0500, Boris Ostrovsky wrote:


in xen_start_kernel(). Better yet, clear whole .bss.

(This applies to the next patch as well).

So clear_bss() -- oh look, another call that xen_start_kernel() could have made
good use of. :)  Can you send a respective patch I can old into this series?
I'm afraid it is by no means obvious to me where it would be safe to do this on
xen_start_kernel().


OK, I'll test it next week, it really should be done irrespective of 
what you are doing here.


-boris



Re: [PATCH 4/4] xen/pcifront: Fix mysterious crashes when NUMA locality information was extracted.

2016-02-15 Thread Boris Ostrovsky

On 02/15/2016 09:05 AM, Konrad Rzeszutek Wilk wrote:

On Sat, Feb 13, 2016 at 08:23:14PM -0500, Boris Ostrovsky wrote:

On 02/11/2016 04:10 PM, Konrad Rzeszutek Wilk wrote:

This patch fixes the issue by:
  1) Use kzalloc to initialize to a well known state.
  2) Put 'struct pci_sysdata' at the start of 'pcifront_sd'. That
 way access to the 'node' will access the right offset.

CC: sta...@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.w...@oracle.com>

Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com>

(This, btw, is the second time we got bitten by pcifront_sd bit not being
pci_sysdata. dc4fdaf0e48 was a workaround for a similar problem and we
should have fixed it then).

I think that the dc4fdaf0e4839109169d8261814813816951c75f commit can be
reverted then?

Ah no, b/c:
"Fixes: 97badf873ab6 (device property: Make it possible to use secondary 
firmware nodes)"


Right --- the problem which that commit fixed was not specific to Xen 
but could be observed on ia64 as well.


-boris



Re: [Xen-devel] [patch 1/4] hotplug: Prevent alloc/free of irq descriptors during cpu up/down

2016-03-14 Thread Boris Ostrovsky

On 03/12/2016 04:19 AM, Thomas Gleixner wrote:

Boris,

On Tue, 14 Jul 2015, Boris Ostrovsky wrote:

On 07/14/2015 04:15 PM, Thomas Gleixner wrote:

The issue here is that all architectures need that protection and just
Xen does irq allocations in cpu_up.

So moving that protection into architecture code is not really an
option.


Otherwise we will need to have something like arch_post_cpu_up()
after the lock is released.

I'm not sure, that this will work. You probably want to do this in the
cpu prepare stage, i.e. before calling __cpu_up().

For PV guests (the ones that use xen_cpu_up()) it will work either before
or
after __cpu_up(). At least my (somewhat limited) testing didn't show any
problems so far.

However, HVM CPUs use xen_hvm_cpu_up() and if you read comments there you
will
see that xen_smp_intr_init() needs to be called before native_cpu_up() but
xen_init_lock_cpu() (which eventually calls irq_alloc_descs()) needs to be
called after.

I think I can split xen_init_lock_cpu() so that the part that needs to be
called after will avoid going into irq core code. And then the rest will
go
into arch_cpu_prepare().

I think we should revisit this for 4.3. For 4.2 we can do the trivial
variant and move the locking in native_cpu_up() and x86 only. x86 was
the only arch on which such wreckage has been seen in the wild, but we
should have that protection for all archs in the long run.

Patch below should fix the issue.

Thanks! Most of my tests passed, I had a couple of failures but I will need to
see whether they are related to this patch.

Did you ever come around to address that irq allocation from within cpu_up()?

I really want to generalize the protection instead of carrying that x86 only
hack forever.


Sorry, I completely forgot about this. Let me see how I can take 
allocations from under the lock. I might just be able to put them in CPU 
notifiers --- most into CPU_UP_PREPARE but spinlock interrupt may need 
to go into CPU_ONLINE.


-boris



[PATCH 1/2] xen/x86: Move irq allocation from Xen smp_op.cpu_up()

2016-03-19 Thread Boris Ostrovsky
Commit ce0d3c0a6fb1 ("genirq: Revert sparse irq locking around
__cpu_up() and move it to x86 for now") reverted irq locking
introduced by commit a89941816726 ("hotplug: Prevent alloc/free
of irq descriptors during cpu up/down") because of Xen allocating
irqs in both of its cpu_up ops.

We can move those allocations into CPU notifiers so that original
patch can be reinstated.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/enlighten.c |   53 ++---
 arch/x86/xen/smp.c   |   45 +-
 arch/x86/xen/smp.h   |3 ++
 3 files changed, 49 insertions(+), 52 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 2c26108..d1a86db 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -137,6 +137,8 @@ RESERVE_BRK(shared_info_page_brk, PAGE_SIZE);
 __read_mostly int xen_have_vector_callback;
 EXPORT_SYMBOL_GPL(xen_have_vector_callback);
 
+static struct notifier_block xen_cpu_notifier;
+
 /*
  * Point at some empty memory to start with. We map the real shared_info
  * page as soon as fixmap is up and running.
@@ -1596,6 +1598,7 @@ asmlinkage __visible void __init xen_start_kernel(void)
xen_initial_gdt = _cpu(gdt_page, 0);
 
xen_smp_init();
+   register_cpu_notifier(_cpu_notifier);
 
 #ifdef CONFIG_ACPI_NUMA
/*
@@ -1783,17 +1786,49 @@ static void __init init_hvm_pv_info(void)
xen_domain_type = XEN_HVM_DOMAIN;
 }
 
-static int xen_hvm_cpu_notify(struct notifier_block *self, unsigned long 
action,
- void *hcpu)
+static int xen_cpu_notify(struct notifier_block *self, unsigned long action,
+ void *hcpu)
 {
int cpu = (long)hcpu;
+   int rc;
+
switch (action) {
case CPU_UP_PREPARE:
-   xen_vcpu_setup(cpu);
-   if (xen_have_vector_callback) {
-   if (xen_feature(XENFEAT_hvm_safe_pvclock))
-   xen_setup_timer(cpu);
+   if (xen_hvm_domain()) {
+   /*
+* This can happen if CPU was offlined earlier and
+* offlining timed out in common_cpu_die().
+*/
+   if (cpu_report_state(cpu) == CPU_DEAD_FROZEN) {
+   xen_smp_intr_free(cpu);
+   xen_uninit_lock_cpu(cpu);
+   }
+
+   xen_vcpu_setup(cpu);
}
+
+   if (xen_pv_domain() ||
+   (xen_have_vector_callback &&
+xen_feature(XENFEAT_hvm_safe_pvclock)))
+   xen_setup_timer(cpu);
+
+   rc = xen_smp_intr_init(cpu);
+   if (rc) {
+   WARN(1, "xen_smp_intr_init() for CPU %d failed: %d\n",
+cpu, rc);
+   return NOTIFY_BAD;
+   }
+
+   break;
+   case CPU_ONLINE:
+   xen_init_lock_cpu(cpu);
+   break;
+   case CPU_UP_CANCELED:
+   xen_smp_intr_free(cpu);
+   if (xen_pv_domain() ||
+   (xen_have_vector_callback &&
+xen_feature(XENFEAT_hvm_safe_pvclock)))
+   xen_teardown_timer(cpu);
break;
default:
break;
@@ -1801,8 +1836,8 @@ static int xen_hvm_cpu_notify(struct notifier_block 
*self, unsigned long action,
return NOTIFY_OK;
 }
 
-static struct notifier_block xen_hvm_cpu_notifier = {
-   .notifier_call  = xen_hvm_cpu_notify,
+static struct notifier_block xen_cpu_notifier = {
+   .notifier_call  = xen_cpu_notify,
 };
 
 #ifdef CONFIG_KEXEC_CORE
@@ -1834,7 +1869,7 @@ static void __init xen_hvm_guest_init(void)
if (xen_feature(XENFEAT_hvm_callback_vector))
xen_have_vector_callback = 1;
xen_hvm_smp_init();
-   register_cpu_notifier(_hvm_cpu_notifier);
+   register_cpu_notifier(_cpu_notifier);
xen_unplug_emulated_devices();
x86_init.irqs.intr_init = xen_init_IRQ;
xen_hvm_init_time_ops();
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 719cf29..09d5cc0 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -115,7 +115,7 @@ asmlinkage __visible void cpu_bringup_and_idle(int cpu)
cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);
 }
 
-static void xen_smp_intr_free(unsigned int cpu)
+void xen_smp_intr_free(unsigned int cpu)
 {
if (per_cpu(xen_resched_irq, cpu).irq >= 0) {
unbind_from_irqhandler(per_cpu(xen_resched_irq, cpu).irq, NULL);
@@ -159,7 +159,7 @@ static void xen_smp_intr_free(unsigned int cpu)
per_cpu(xen_pmu_irq, cpu).name = NULL;
}
 };
-static int xen_smp_intr_init(unsigned int cpu)
+in

[PATCH 1/2] xen/apic: Provide Xen-specific version of cpu_present_to_apicid APIC op

2016-03-19 Thread Boris Ostrovsky
Currently Xen uses default_cpu_present_to_apicid() which will always
report BAD_APICID for PV guests since x86_bios_cpu_apic_id is initialised
to that value and is never updated.

With commit 1f12e32f4cd5 ("x86/topology: Create logical package id"), this
op is now called by smp_init_package_map() when deciding whether to call
topology_update_package_map() which sets cpu_data(cpu).logical_proc_id.
The latter (as topology_logical_package_id(cpu)) may be used, for example,
by cpu_to_rapl_pmu() as an array index. Since uninitialized
logical_package_id is set to -1, the index will become 64K which is
obviously problematic.

While RAPL code (and any other users of logical_package_id) should be
careful in their assumptions about id's validity, Xen's
cpu_present_to_apicid op should still provide value consistent with its
own xen_apic_read(APIC_ID).

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/apic.c |   12 ++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/xen/apic.c b/arch/x86/xen/apic.c
index abf4901..db52a7f 100644
--- a/arch/x86/xen/apic.c
+++ b/arch/x86/xen/apic.c
@@ -66,7 +66,7 @@ static u32 xen_apic_read(u32 reg)
 
ret = HYPERVISOR_platform_op();
if (ret)
-   return 0;
+   op.u.pcpu_info.apic_id = BAD_APICID;
 
return op.u.pcpu_info.apic_id << 24;
 }
@@ -142,6 +142,14 @@ static void xen_silent_inquire(int apicid)
 {
 }
 
+static int xen_cpu_present_to_apicid(int cpu)
+{
+   if (cpu_present(cpu))
+   return xen_get_apic_id(xen_apic_read(APIC_ID));
+   else
+   return BAD_APICID;
+}
+
 static struct apic xen_pv_apic = {
.name   = "Xen PV",
.probe  = xen_apic_probe_pv,
@@ -162,7 +170,7 @@ static struct apic xen_pv_apic = {
 
.ioapic_phys_id_map = default_ioapic_phys_id_map, /* Used 
on 32-bit */
.setup_apic_routing = NULL,
-   .cpu_present_to_apicid  = default_cpu_present_to_apicid,
+   .cpu_present_to_apicid  = xen_cpu_present_to_apicid,
.apicid_to_cpu_present  = physid_set_mask_of_physid, /* Used on 
32-bit */
.check_phys_apicid_present  = default_check_phys_apicid_present, /* 
smp_sanity_check needs it */
.phys_pkg_id= xen_phys_pkg_id, /* detect_ht */
-- 
1.7.1



[PATCH v2] xen/events: Mask a moving irq

2016-03-19 Thread Boris Ostrovsky
Moving an unmasked irq may result in irq handler being invoked on both
source and target CPUs.

With 2-level this can happen as follows:

On source CPU:
evtchn_2l_handle_events() ->
generic_handle_irq() ->
handle_edge_irq() ->
   eoi_pirq():
   irq_move_irq(data);

   /* WE ARE HERE */

   if (VALID_EVTCHN(evtchn))
   clear_evtchn(evtchn);

If at this moment target processor is handling an unrelated event in
evtchn_2l_handle_events()'s loop it may pick up our event since target's
cpu_evtchn_mask claims that this event belongs to it *and* the event is
unmasked and still pending. At the same time, source CPU will continue
executing its own handle_edge_irq().

With FIFO interrupt the scenario is similar: irq_move_irq() may result
in a EVTCHNOP_unmask hypercall which, in turn, may make the event
pending on the target CPU.

We can avoid this situation by moving and clearing the event while
keeping event masked.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
Cc: sta...@vger.kernel.org
---

v2:
 * Use irq_move_masked_irq() instead of irq_move_irq()
 * Reorganize code flow

 drivers/xen/events/events_base.c |   28 
 1 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 524c221..4436778 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -484,9 +484,19 @@ static void eoi_pirq(struct irq_data *data)
struct physdev_eoi eoi = { .irq = pirq_from_irq(data->irq) };
int rc = 0;
 
-   irq_move_irq(data);
+   if (!VALID_EVTCHN(evtchn))
+   return;
 
-   if (VALID_EVTCHN(evtchn))
+   if (unlikely(irqd_is_setaffinity_pending(data))) {
+   int masked = test_and_set_mask(evtchn);
+
+   clear_evtchn(evtchn);
+
+   irq_move_masked_irq(data);
+
+   if (!masked)
+   unmask_evtchn(evtchn);
+   } else
clear_evtchn(evtchn);
 
if (pirq_needs_eoi(data->irq)) {
@@ -1357,9 +1367,19 @@ static void ack_dynirq(struct irq_data *data)
 {
int evtchn = evtchn_from_irq(data->irq);
 
-   irq_move_irq(data);
+   if (!VALID_EVTCHN(evtchn))
+   return;
 
-   if (VALID_EVTCHN(evtchn))
+   if (unlikely(irqd_is_setaffinity_pending(data))) {
+   int masked = test_and_set_mask(evtchn);
+
+   clear_evtchn(evtchn);
+
+   irq_move_masked_irq(data);
+
+   if (!masked)
+   unmask_evtchn(evtchn);
+   } else
clear_evtchn(evtchn);
 }
 
-- 
1.7.1



[PATCH 0/2] Reinstate irq alloc/dealloc locking patch

2016-03-19 Thread Boris Ostrovsky
Original version of that patch (commit a89941816726) had to be reverted
due to Xen allocating irqs in its cpu_up ops.

The first patch moves allocations into hotplug notifiers and the second
one restores the original patch (with minor adjustments to new hotplug
framework)

Boris Ostrovsky (2):
  xen/x86: Move irq allocation from Xen smp_op.cpu_up()
  hotplug: Prevent alloc/free of irq descriptors during cpu up/down
(again)

 arch/x86/kernel/smpboot.c |   11 -
 arch/x86/xen/enlighten.c  |   53 +---
 arch/x86/xen/smp.c|   45 +
 arch/x86/xen/smp.h|3 ++
 kernel/cpu.c  |8 ++
 5 files changed, 57 insertions(+), 63 deletions(-)



Re: [Xen-devel] [PATCH v4 0/5] [PATCH v3 0/5] Improve non-"safe" MSR access failure handling

2016-03-14 Thread Boris Ostrovsky

On 03/12/2016 01:08 PM, Andy Lutomirski wrote:

Setting CONFIG_PARAVIRT=y has an unintended side effect: it silently
turns all rdmsr and wrmsr operations into the safe variants without
any checks that the operations actually succeed.

With CONFIG_PARAVIRT=n, unchecked MSR failures OOPS and probably
cause boot to fail if they happen before init starts.

Neither behavior is very good, and it's particularly unfortunate that
the behavior changes depending on CONFIG_PARAVIRT.

In particular, KVM guests might be unwittingly depending on the
PARAVIRT=y behavior because CONFIG_KVM_GUEST currently depends on
CONFIG_PARAVIRT, and, because accesses in that case are completely
unchecked, we wouldn't even see a warning.

This series changes the native behavior, regardless of
CONFIG_PARAVIRT.  A non-"safe" MSR failure will give an informative
warning once and will be fixed up -- native_read_msr will return
zero, and both reads and writes will continue where they left off.

If panic_on_oops is set, they will still OOPS and panic.

By using the shiny new custom exception handler infrastructure,
there should be no overhead on the success paths.

I didn't change the behavior on Xen, but, with this series applied,
it would be straightforward for the Xen maintainers to make the
corresponding change -- knowledge of whether the access is "safe" is
now propagated into the pvops.

Doing this is probably a prerequisite to sanely decoupling
CONFIG_KVM_GUEST and CONFIG_PARAVIRT, which would probably make
Arjan and the rest of the Clear Containers people happy :)

There's also room to reduce the code size of the "safe" variants
using custom exception handlers in the future.

Changes from v3:
  - WARN_ONCE instead of WARN (Ingo)
  - In the warning text, s/unsafe/unchecked/ (Ingo, sort of)

Changes from earlier versions: lots of changes!

Andy Lutomirski (5):
   x86/paravirt: Add _safe to the read_msr and write_msr PV hooks
   x86/msr: Carry on after a non-"safe" MSR access fails without
 !panic_on_oops
   x86/paravirt: Add paravirt_{read,write}_msr
   x86/paravirt: Make "unsafe" MSR accesses unsafe even if PARAVIRT=y
   x86/msr: Set the return value to zero when native_rdmsr_safe fails

  arch/x86/include/asm/msr.h| 20 
  arch/x86/include/asm/paravirt.h   | 45 +--
  arch/x86/include/asm/paravirt_types.h | 14 +++
  arch/x86/kernel/paravirt.c|  6 +++--
  arch/x86/mm/extable.c | 33 +
  arch/x86/xen/enlighten.c  | 27 +++--
  6 files changed, 114 insertions(+), 31 deletions(-)



I don't see any issues as far as Xen is concerned but let me run this 
through our nightly. I'll wait for the next version (which I think you 
might have based on the comments). Please copy me.


-boris


Re: [PATCH v4 04/14] x86/rtc: replace paravirt rtc check with platform legacy quirk

2016-04-07 Thread Boris Ostrovsky

On 04/06/2016 08:06 PM, Luis R. Rodriguez wrote:

We have 4 types of x86 platforms that disable RTC:

   * Intel MID
   * Lguest - uses paravirt
   * Xen dom-U - uses paravirt
   * x86 on legacy systems annotated with an ACPI legacy flag

We can consolidate all of these into a platform specific legacy
quirk set early in boot through i386_start_kernel() and through
x86_64_start_reservations(). This deals with the RTC quirks which
we can rely on through the hardware subarch, the ACPI check can
be dealt with separately.

v2: split the subarch check from the ACPI check, clarify
 on the ACPI change commit log why ordering works

Suggested-by: Ingo Molnar 
Signed-off-by: Luis R. Rodriguez 
---
  arch/x86/Makefile |  1 +
  arch/x86/include/asm/paravirt.h   |  6 --
  arch/x86/include/asm/paravirt_types.h |  5 -
  arch/x86/include/asm/processor.h  |  1 -
  arch/x86/include/asm/x86_init.h   | 13 +
  arch/x86/kernel/Makefile  |  6 +-
  arch/x86/kernel/head32.c  |  2 ++
  arch/x86/kernel/head64.c  |  1 +
  arch/x86/kernel/platform-quirks.c | 18 ++
  arch/x86/kernel/rtc.c |  7 ++-
  arch/x86/lguest/boot.c|  1 -
  arch/x86/xen/enlighten.c  |  3 ---
  12 files changed, 42 insertions(+), 22 deletions(-)
  create mode 100644 arch/x86/kernel/platform-quirks.c

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 4086abca0b32..f9ed8a7ce2b6 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -209,6 +209,7 @@ endif
  head-y := arch/x86/kernel/head_$(BITS).o
  head-y += arch/x86/kernel/head$(BITS).o
  head-y += arch/x86/kernel/head.o
+head-y += arch/x86/kernel/platform-quirks.o
  
  libs-y  += arch/x86/lib/
  
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h

index 601f1b8f9961..6c7a4a192032 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -20,12 +20,6 @@ static inline int paravirt_enabled(void)
return pv_info.paravirt_enabled;
  }
  
-static inline int paravirt_has_feature(unsigned int feature)

-{
-   WARN_ON_ONCE(!pv_info.paravirt_enabled);
-   return (pv_info.features & feature);
-}
-
  static inline void load_sp0(struct tss_struct *tss,
 struct thread_struct *thread)
  {
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index e8c2326478c8..6acc1b26cf40 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -70,14 +70,9 @@ struct pv_info {
  #endif
  
  	int paravirt_enabled;

-   unsigned int features;/* valid only if paravirt_enabled is set */
const char *name;
  };
  
-#define paravirt_has(x) paravirt_has_feature(PV_SUPPORTED_##x)

-/* Supported features */
-#define PV_SUPPORTED_RTC(1<<0)
-
  struct pv_init_ops {
/*
 * Patch may replace one of the defined code sequences with
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 9264476f3d57..0c70c7daa6b8 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -474,7 +474,6 @@ static inline unsigned long current_top_of_stack(void)
  #else
  #define __cpuid   native_cpuid
  #define paravirt_enabled()0
-#define paravirt_has(x)0
  
  static inline void load_sp0(struct tss_struct *tss,

struct thread_struct *thread)
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 1ae89a2721d6..27d5c3fe5198 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -142,6 +142,15 @@ struct x86_cpuinit_ops {
  struct timespec;
  
  /**

+ * struct x86_legacy_features - legacy x86 features
+ *
+ * @rtc: this device has a CMOS real-time clock present
+ */
+struct x86_legacy_features {
+   int rtc;
+};
+
+/**
   * struct x86_platform_ops - platform specific runtime functions
   * @calibrate_tsc:calibrate TSC
   * @get_wallclock:get time from HW clock like RTC etc.
@@ -152,6 +161,7 @@ struct timespec;
   * @save_sched_clock_state:   save state for sched_clock() on suspend
   * @restore_sched_clock_state:restore state for sched_clock() on 
resume
   * @apic_post_init:   adjust apic if neeeded
+ * @legacy:legacy features
   */
  struct x86_platform_ops {
unsigned long (*calibrate_tsc)(void);
@@ -165,6 +175,7 @@ struct x86_platform_ops {
void (*save_sched_clock_state)(void);
void (*restore_sched_clock_state)(void);
void (*apic_post_init)(void);
+   struct x86_legacy_features legacy;
  };
  
  struct pci_dev;

@@ -186,6 +197,8 @@ extern struct x86_cpuinit_ops x86_cpuinit;
  extern struct x86_platform_ops x86_platform;
  extern struct x86_msi_ops x86_msi;
  extern struct 

Re: [PATCH v4 08/14] apm32: remove paravirt_enabled() use

2016-04-07 Thread Boris Ostrovsky

On 04/06/2016 08:06 PM, Luis R. Rodriguez wrote:

There is already a check for apm_info.bios == 0, the
apm_info.bios is set from the boot_params.apm_bios_info.
Both Xen and lguest, which are also the only ones that set
paravirt_enabled to true, never set the apm_bios.info. The

Xen folks are sure force disable to 0 is not needed,


Because apm_info lives in .bss (which we recently made sure is cleared 
on Xen PV). May be worth mentioning in the commit message so that we 
don't forget why this is not needed.


I think you also have this statement in other patches.

-boris


we
recently forced disabled this on lguest. With this in place
the paravirt_enabled() check is simply not needed anymore.

Signed-off-by: Luis R. Rodriguez 
---
  arch/x86/kernel/apm_32.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apm_32.c b/arch/x86/kernel/apm_32.c
index 9307f182fe30..c7364bd633e1 100644
--- a/arch/x86/kernel/apm_32.c
+++ b/arch/x86/kernel/apm_32.c
@@ -2267,7 +2267,7 @@ static int __init apm_init(void)
  
  	dmi_check_system(apm_dmi_table);
  
-	if (apm_info.bios.version == 0 || paravirt_enabled() || machine_is_olpc()) {

+   if (apm_info.bios.version == 0 || machine_is_olpc()) {
printk(KERN_INFO "apm: BIOS not found.\n");
return -ENODEV;
}




Re: [Xen-devel] [PATCH v5 0/9] Improve non-"safe" MSR access failure handling

2016-04-05 Thread Boris Ostrovsky

On 04/02/2016 10:01 AM, Andy Lutomirski wrote:


Andy Lutomirski (9):
   x86/head: Pass a real pt_regs and trapnr to early_fixup_exception
   x86/head: Move the early NMI fixup into C
   x86/head: Move early exception panic code into early_fixup_exception
   x86/traps: Enable all exception handler callbacks early
   x86/paravirt: Add _safe to the read_msr and write_msr PV hooks
   x86/msr: Carry on after a non-"safe" MSR access fails
   x86/paravirt: Add paravirt_{read,write}_msr
   x86/paravirt: Make "unsafe" MSR accesses unsafe even if PARAVIRT=y
   x86/msr: Set the return value to zero when native_rdmsr_safe fails


With Xen (on top of eb1af3b):

Tested-by: Boris Ostrovsky <boris.ostrov...@oracle.com>

-boris


Re: Xen regression, Was: [PATCH] x86/irq: Probe for PIC presence before allocating descs for legacy IRQs

2016-04-12 Thread Boris Ostrovsky

On 04/12/2016 02:06 PM, Stefano Stabellini wrote:

On Tue, 12 Apr 2016, Boris Ostrovsky wrote:

On 04/11/2016 10:08 PM, Stefano Stabellini wrote:

Hi all,

Unfortunately this patch (now commit
8c058b0b9c34d8c8d7912880956543769323e2d8) causes a regression on Xen
when running on top of QEMU: the number of PIT irqs get set to 0 by
probe_8259A but actually there are 16.

Any suggestions on how to fix this?

1) we could revert 8c058b0b9c34d8c8d7912880956543769323e2d8
2) we could introduce an 'if (!xen_domain())' in probe_8259A
3) suggestions welcome

Stefano, do you have b4ff8389ed14b849354b59ce9b360bdefcdbf99c ?

It was supposed to fix this problem for Xen. However, I just noticed that
arch/arm64/include/asm/irq.h makes nr_legacy_irqs() return 0 (unlike
arch/arm/include/asm/irq.h). Could that be the problem?

I have b4ff8389ed14b849354b59ce9b360bdefcdbf99c but it doesn't fix the
issue for me.

Is the idea of your patch that xen_allocate_irq_gsi will allocate the
descriptor dynamically instead?


Right.


If so, it doesn't work because it
doesn't get called for irq 14:


So how has it worked until now then?



piix_init_one -> ata_pci_sff_activate_host -> devm_request_irq ->
devm_request_threaded_irq-> request_threaded_irq -> irq_to_desc(14) ->
-EVAIL

If you look at pci_xen_initial_domain, the loop:

for (irq = 0; irq < nr_legacy_irqs(); irq++) {

won't work anymore because by the time is called, nr_legacy_irqs()
already returns 0, because it has been changed by probe_8259A().

We also need the following patch:

---

xen/x86: actually allocate legacy interrupts on PV guests

b4ff8389ed14 is incomplete: relies on nr_legacy_irqs() to get the number
of legacy interrupts when actually nr_legacy_irqs() returns 0 after
probe_8259A(). Use NR_IRQS_LEGACY instead.

Signed-off-by: Stefano Stabellini <sstabell...@kernel.org>

diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index beac4df..6db0060 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -492,7 +492,7 @@ int __init pci_xen_initial_domain(void)
__acpi_register_gsi = acpi_register_gsi_xen;
__acpi_unregister_gsi = NULL;
/* Pre-allocate legacy irqs */
-   for (irq = 0; irq < nr_legacy_irqs(); irq++) {
+   for (irq = 0; irq < NR_IRQS_LEGACY; irq++) {
int trigger, polarity;
  
  		if (acpi_get_override_irq(irq, , ) == -1)



Won't we need the same change in the 'if (0 == nr_ioapics)' clause?

-boris


Re: Xen regression, Was: [PATCH] x86/irq: Probe for PIC presence before allocating descs for legacy IRQs

2016-04-12 Thread Boris Ostrovsky

On 04/12/2016 05:56 PM, Stefano Stabellini wrote:


  
I am not sure, maybe you didn't have CONFIG_SPARSE_IRQ?

But I am certain that 4.6-rc2, with the attached config, fails as Dom0
on QEMU with the following sequence of calls:


I did have CONFIG_SPARSE_IRQ and I just rebuilt 4.5.0 with your config 
(4.6-rc3 doesn't build for me for some reason) and that booted dom0 as well.


BTW, what do you mean by "dom0 on QEMU"?

-boris




Re: Xen regression, Was: [PATCH] x86/irq: Probe for PIC presence before allocating descs for legacy IRQs

2016-04-12 Thread Boris Ostrovsky

On 04/12/2016 05:14 PM, Stefano Stabellini wrote:

On Tue, 12 Apr 2016, Boris Ostrovsky wrote:

On 04/12/2016 02:06 PM, Stefano Stabellini wrote:

On Tue, 12 Apr 2016, Boris Ostrovsky wrote:

On 04/11/2016 10:08 PM, Stefano Stabellini wrote:

Hi all,

Unfortunately this patch (now commit
8c058b0b9c34d8c8d7912880956543769323e2d8) causes a regression on Xen
when running on top of QEMU: the number of PIT irqs get set to 0 by
probe_8259A but actually there are 16.

Any suggestions on how to fix this?

1) we could revert 8c058b0b9c34d8c8d7912880956543769323e2d8
2) we could introduce an 'if (!xen_domain())' in probe_8259A
3) suggestions welcome

Stefano, do you have b4ff8389ed14b849354b59ce9b360bdefcdbf99c ?

It was supposed to fix this problem for Xen. However, I just noticed that
arch/arm64/include/asm/irq.h makes nr_legacy_irqs() return 0 (unlike
arch/arm/include/asm/irq.h). Could that be the problem?

I have b4ff8389ed14b849354b59ce9b360bdefcdbf99c but it doesn't fix the
issue for me.

Is the idea of your patch that xen_allocate_irq_gsi will allocate the
descriptor dynamically instead?

Right.


If so, it doesn't work because it
doesn't get called for irq 14:

So how has it worked until now then?

It didn't for me :-)
Maybe it was working for DomUs, but not for Dom0?


I most certainly run this as both dom0 and domU on variety of machines.

So perhaps there is some other problem?





piix_init_one -> ata_pci_sff_activate_host -> devm_request_irq ->
devm_request_threaded_irq-> request_threaded_irq -> irq_to_desc(14) ->
-EVAIL

If you look at pci_xen_initial_domain, the loop:

for (irq = 0; irq < nr_legacy_irqs(); irq++) {

won't work anymore because by the time is called, nr_legacy_irqs()
already returns 0, because it has been changed by probe_8259A().

We also need the following patch:

---

xen/x86: actually allocate legacy interrupts on PV guests

b4ff8389ed14 is incomplete: relies on nr_legacy_irqs() to get the number
of legacy interrupts when actually nr_legacy_irqs() returns 0 after
probe_8259A(). Use NR_IRQS_LEGACY instead.

Signed-off-by: Stefano Stabellini <sstabell...@kernel.org>

diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index beac4df..6db0060 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -492,7 +492,7 @@ int __init pci_xen_initial_domain(void)
__acpi_register_gsi = acpi_register_gsi_xen;
__acpi_unregister_gsi = NULL;
/* Pre-allocate legacy irqs */
-   for (irq = 0; irq < nr_legacy_irqs(); irq++) {
+   for (irq = 0; irq < NR_IRQS_LEGACY; irq++) {
int trigger, polarity;
if (acpi_get_override_irq(irq, , ) == -1)


Won't we need the same change in the 'if (0 == nr_ioapics)' clause?

That's not a problem: there is 1 ioapic in my system and is detected
correctly.


True, but if a system has no ioapics then we will again fail to manage 
the pirq, no?


-boris


Re: Xen regression, Was: [PATCH] x86/irq: Probe for PIC presence before allocating descs for legacy IRQs

2016-04-12 Thread Boris Ostrovsky

On 04/12/2016 07:15 PM, Stefano Stabellini wrote:

On Tue, 12 Apr 2016, Boris Ostrovsky wrote:

On 04/12/2016 05:56 PM, Stefano Stabellini wrote:

   I am not sure, maybe you didn't have CONFIG_SPARSE_IRQ?
But I am certain that 4.6-rc2, with the attached config, fails as Dom0
on QEMU with the following sequence of calls:

I did have CONFIG_SPARSE_IRQ and I just rebuilt 4.5.0 with your config
(4.6-rc3 doesn't build for me for some reason) and that booted dom0 as well.

BTW, what do you mean by "dom0 on QEMU"?
  
I am running Xen and Linux inside a QEMU x86_64 emulated machine (nested

virt).


This I, of course, never tried.

But given that things work in a single-level virt, doesn't this imply 
that perhaps there is something in the emulation that's not quite right?


-boris


Re: [PATCH v4 04/14] x86/rtc: replace paravirt rtc check with platform legacy quirk

2016-04-08 Thread Boris Ostrovsky

On 04/08/2016 02:29 AM, Luis R. Rodriguez wrote:

On Thu, Apr 7, 2016 at 10:18 PM, Juergen Gross <jgr...@suse.com> wrote:

On 08/04/16 02:32, Luis R. Rodriguez wrote:

On Thu, Apr 07, 2016 at 08:55:54AM -0400, Boris Ostrovsky wrote:

On 04/06/2016 08:06 PM, Luis R. Rodriguez wrote:

We have 4 types of x86 platforms that disable RTC:

   * Intel MID
   * Lguest - uses paravirt
   * Xen dom-U - uses paravirt
   * x86 on legacy systems annotated with an ACPI legacy flag

We can consolidate all of these into a platform specific legacy
quirk set early in boot through i386_start_kernel() and through
x86_64_start_reservations(). This deals with the RTC quirks which
we can rely on through the hardware subarch, the ACPI check can
be dealt with separately.

v2: split the subarch check from the ACPI check, clarify
 on the ACPI change commit log why ordering works

Suggested-by: Ingo Molnar <mi...@kernel.org>
Signed-off-by: Luis R. Rodriguez <mcg...@kernel.org>

<-- snip -->


diff --git a/arch/x86/kernel/platform-quirks.c 
b/arch/x86/kernel/platform-quirks.c
new file mode 100644
index ..1b114ac5996f
--- /dev/null
+++ b/arch/x86/kernel/platform-quirks.c
@@ -0,0 +1,18 @@
+#include 
+#include 
+
+#include 
+#include 
+
+void __init x86_early_init_platform_quirks(void)
+{
+   x86_platform.legacy.rtc = 1;
+
+   switch (boot_params.hdr.hardware_subarch) {
+   case X86_SUBARCH_XEN:
+   case X86_SUBARCH_LGUEST:
+   case X86_SUBARCH_INTEL_MID:
+   x86_platform.legacy.rtc = 0;
+   break;
+   }
+}

What about Xen dom0 (aka initial domain)?

Indeed, thanks for catching this, the hunk below removes the re-enablement of
the the RTC for dom0:


--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1192,7 +1192,6 @@ static const struct pv_info xen_info __initconst = {
  #ifdef CONFIG_X86_64
 .extra_user_64bit_cs = FLAT_USER_CS64,
  #endif
-   .features = 0,
 .name = "Xen",
  };
@@ -1525,8 +1524,6 @@ asmlinkage __visible void __init xen_start_kernel(void)
 /* Install Xen paravirt ops */
 pv_info = xen_info;
-   if (xen_initial_domain())
-   pv_info.features |= PV_SUPPORTED_RTC;
 pv_init_ops = xen_init_ops;
 if (!xen_pvh_domain()) {
 pv_cpu_ops = xen_cpu_ops;

This should then break dom0 unless of course you have the respective next
patch applied and that disabled the RTC due to an ACPI setting on your
platform. Juergen, can you check to see if that was the case for your
testing platform on dom0 ?

Are you sure it would break?

No, suspected that it should though.


Wouldn't it just fall back to another
clock source, e.g. hpet?

I suppose so.


I looked into my test system: seems as if add_rtc_cmos() is returning
before the .legacy.rtc test.

OK thanks...


It works because the clock must have been discovered by ACPI prior to 
add_rtc_cmos() call. It's PNP0b00 object, I believe. The rest of the 
routine is to handle the case when RTC is not found in ACPI tables for 
whatever reasons (I think).


That's why we added paravirt_has(RTC) --- dom0 should be able to handle 
such cases, just like bare metal.


-boris


Re: [PATCH v4 04/14] x86/rtc: replace paravirt rtc check with platform legacy quirk

2016-04-08 Thread Boris Ostrovsky

On 04/08/2016 03:59 AM, Juergen Gross wrote:

On 08/04/16 09:36, Luis R. Rodriguez wrote:

On Fri, Apr 8, 2016 at 12:13 AM, Juergen Gross  wrote:

On 08/04/16 08:56, Luis R. Rodriguez wrote:

On Thu, Apr 7, 2016 at 11:38 PM, Juergen Gross  wrote:

Okay. Another idea (not sure whether this is really a good one):

Add X86_SUBARCH_XEN_DOM0. As hardware_subarch is 32 bits wide I don't
think the number of subarchs is a scarce resource. :-)

This would mean bumping the x86 boot protocol, we shouldn't take that
lightly, but given that in this case the new subarch would really only
be set by the kernel (or future loaders for perhaps HVMLite) I'd think
this is not such an intrusive alternative.

I think adding an own subarch for dom0 isn't that bad. It really is
different from domU as dom0 has per default access to the real hardware
(or at least to most of it).


Can we do this (overwrite quirks) in x86_init_ops.arch_setup? I'd really 
like to avoid adding a what essentially is a sub-subarch.


-boris


Re: [PATCH v5 04/14] x86/rtc: replace paravirt rtc check with platform legacy quirk

2016-04-11 Thread Boris Ostrovsky

On 04/08/2016 07:40 PM, Luis R. Rodriguez wrote:

diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 1ae89a2721d6..8bb8c1a4615a 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -142,6 +142,15 @@ struct x86_cpuinit_ops {
 struct timespec;
 
 /**

+ * struct x86_legacy_features - legacy x86 features
+ *
+ * @rtc: this device has a CMOS real-time clock present
+ */
+struct x86_legacy_features {
+   int rtc;
+};
+
+/**
  * struct x86_platform_ops - platform specific runtime functions
  * @calibrate_tsc: calibrate TSC
  * @get_wallclock: get time from HW clock like RTC etc.
@@ -152,6 +161,14 @@ struct timespec;
  * @save_sched_clock_state:save state for sched_clock() on suspend
  * @restore_sched_clock_state: restore state for sched_clock() on resume
  * @apic_post_init:adjust apic if neeeded
+ * @legacy:legacy features
+ * @set_legacy_features:   override legacy features. Use of this callback
+ * is highly discouraged. You should only need
+ * this if your hardware platform requires further
+ * custom fine tuning far beyong what may be
+ * possible in x86_early_init_platform_quirks() by
+ * only using the current x86_hardware_subarch
+ * semantics.
  */
 struct x86_platform_ops {
unsigned long (*calibrate_tsc)(void);
@@ -165,6 +182,8 @@ struct x86_platform_ops {
void (*save_sched_clock_state)(void);
void (*restore_sched_clock_state)(void);
void (*apic_post_init)(void);
+   struct x86_legacy_features legacy;


I don't think this belongs here --- we are in the ops structure.

-boris


+   void (*set_legacy_features)(void);
  };
  
  struct pci_dev;

@@ -186,6 +205,8 @@ extern struct x86_cpuinit_ops x86_cpuinit;
  extern struct x86_platform_ops x86_platform;
  extern struct x86_msi_ops x86_msi;
  extern struct x86_io_apic_ops x86_io_apic_ops;
+
+extern void x86_early_init_platform_quirks(void);
  extern void x86_init_noop(void);
  extern void x86_init_uint_noop(unsigned int unused);
  





Re: Xen regression, Was: [PATCH] x86/irq: Probe for PIC presence before allocating descs for legacy IRQs

2016-04-12 Thread Boris Ostrovsky

On 04/11/2016 10:08 PM, Stefano Stabellini wrote:

Hi all,

Unfortunately this patch (now commit
8c058b0b9c34d8c8d7912880956543769323e2d8) causes a regression on Xen
when running on top of QEMU: the number of PIT irqs get set to 0 by
probe_8259A but actually there are 16.

Any suggestions on how to fix this?

1) we could revert 8c058b0b9c34d8c8d7912880956543769323e2d8
2) we could introduce an 'if (!xen_domain())' in probe_8259A
3) suggestions welcome


Stefano, do you have b4ff8389ed14b849354b59ce9b360bdefcdbf99c ?

It was supposed to fix this problem for Xen. However, I just noticed 
that arch/arm64/include/asm/irq.h makes nr_legacy_irqs() return 0 
(unlike arch/arm/include/asm/irq.h). Could that be the problem?



-boris



[PATCH 2/2] xen/x86: Call cpu_startup_entry(CPUHP_AP_ONLINE_IDLE) from xen_play_dead()

2016-03-19 Thread Boris Ostrovsky
This call has always been missing from xen_play dead() but until
recently this was rather benign. With new cpu hotplug framework
however this call is required, otherwise a hot-plugged CPU will not
be properly brough up (by never calling cpuhp_online_idle())

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/xen/smp.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 3c6d17f..719cf29 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -545,6 +545,8 @@ static void xen_play_dead(void) /* used only with 
HOTPLUG_CPU */
 * data back is to call:
 */
tick_nohz_idle_enter();
+
+   cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);
 }
 
 #else /* !CONFIG_HOTPLUG_CPU */
-- 
1.7.1



[PATCH 0/2] Hotplug fixes

2016-03-19 Thread Boris Ostrovsky
Two patches fixing regressions introduced by new CPU hotplug framework

Boris Ostrovsky (2):
  xen/apic: Provide Xen-specific version of cpu_present_to_apicid APIC
op
  xen/x86: Call cpu_startup_entry(CPUHP_AP_ONLINE_IDLE) from
xen_play_dead()

 arch/x86/xen/apic.c |   12 ++--
 arch/x86/xen/smp.c  |2 ++
 2 files changed, 12 insertions(+), 2 deletions(-)



Re: [Xen-devel] [PATCH] xen/events: Mask a moving irq

2016-03-19 Thread Boris Ostrovsky

On 03/17/2016 12:03 PM, David Vrabel wrote:

On 17/03/16 12:45, Boris Ostrovsky wrote:

Moving an unmasked irq may result in irq handler being invoked on both
source and target CPUs.

With 2-level this can happen as follows:

On source CPU:
 evtchn_2l_handle_events() ->
 generic_handle_irq() ->
 handle_edge_irq() ->
eoi_pirq():
irq_move_irq(data);

/* WE ARE HERE */

if (VALID_EVTCHN(evtchn))
clear_evtchn(evtchn);

If at this moment target processor is handling an unrelated event in
evtchn_2l_handle_events()'s loop it may pick up our event since target's
cpu_evtchn_mask claims that this event belongs to it *and* the event is
unmasked and still pending. At the same time, source CPU will continue
executing its own handle_edge_irq().

With FIFO interrupt the scenario is similar: irq_move_irq() may result
in a EVTCHNOP_unmask hypercall which, in turn, may make the event
pending on the target CPU.

We can avoid this situation by moving and clearing the event while
keeping event masked.

Can you do:

if (unlikely(irqd_is_setaffinity_pending(data))) {
masked = test_and_set_mask()

clear_evtchn()
irq_move_masked_irq()


I did think about this but then I wasn't sure whether this might open 
some other window for things to sneak in. It shouldn't but these things 
are rather subtle so I'd rather leave the order of how operations are 
done unchanged.


But I should indeed use irq_move_masked_irq() instead of irq_move_irq().

-boris



unmask(masked);
} else
clear_evtchn()






Re: [Xen-devel] [PATCH] xen/events: Mask a moving irq

2016-03-19 Thread Boris Ostrovsky

On 03/17/2016 01:29 PM, David Vrabel wrote:

On 17/03/16 16:53, Boris Ostrovsky wrote:

On 03/17/2016 12:03 PM, David Vrabel wrote:

On 17/03/16 12:45, Boris Ostrovsky wrote:

Moving an unmasked irq may result in irq handler being invoked on both
source and target CPUs.

With 2-level this can happen as follows:

On source CPU:
  evtchn_2l_handle_events() ->
  generic_handle_irq() ->
  handle_edge_irq() ->
 eoi_pirq():
 irq_move_irq(data);

 /* WE ARE HERE */

 if (VALID_EVTCHN(evtchn))
 clear_evtchn(evtchn);

If at this moment target processor is handling an unrelated event in
evtchn_2l_handle_events()'s loop it may pick up our event since target's
cpu_evtchn_mask claims that this event belongs to it *and* the event is
unmasked and still pending. At the same time, source CPU will continue
executing its own handle_edge_irq().

With FIFO interrupt the scenario is similar: irq_move_irq() may result
in a EVTCHNOP_unmask hypercall which, in turn, may make the event
pending on the target CPU.

We can avoid this situation by moving and clearing the event while
keeping event masked.

Can you do:

 if (unlikely(irqd_is_setaffinity_pending(data))) {
 masked = test_and_set_mask()

 clear_evtchn()
 irq_move_masked_irq()

I did think about this but then I wasn't sure whether this might open
some other window for things to sneak in. It shouldn't but these things
are rather subtle so I'd rather leave the order of how operations are
done unchanged.

This is the order your patch has though.  I'm confused.


Ugh, sorry --- I misread what you wrote, I thought you wanted to clear 
before masking. Which wouldn't make any sense.


So yes, what you are suggesting is better.

-borsi


[PATCH] xen/events: Mask a moving irq

2016-03-19 Thread Boris Ostrovsky
Moving an unmasked irq may result in irq handler being invoked on both
source and target CPUs.

With 2-level this can happen as follows:

On source CPU:
evtchn_2l_handle_events() ->
generic_handle_irq() ->
handle_edge_irq() ->
   eoi_pirq():
   irq_move_irq(data);

   /* WE ARE HERE */

   if (VALID_EVTCHN(evtchn))
   clear_evtchn(evtchn);

If at this moment target processor is handling an unrelated event in
evtchn_2l_handle_events()'s loop it may pick up our event since target's
cpu_evtchn_mask claims that this event belongs to it *and* the event is
unmasked and still pending. At the same time, source CPU will continue
executing its own handle_edge_irq().

With FIFO interrupt the scenario is similar: irq_move_irq() may result
in a EVTCHNOP_unmask hypercall which, in turn, may make the event
pending on the target CPU.

We can avoid this situation by moving and clearing the event while
keeping event masked.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
Cc: sta...@vger.kernel.org
---
 drivers/xen/events/events_base.c |   26 --
 1 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 524c221..c5725ee 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -483,12 +483,23 @@ static void eoi_pirq(struct irq_data *data)
int evtchn = evtchn_from_irq(data->irq);
struct physdev_eoi eoi = { .irq = pirq_from_irq(data->irq) };
int rc = 0;
+   int need_unmask = 0;
 
-   irq_move_irq(data);
+   if (unlikely(irqd_is_setaffinity_pending(data))) {
+   if (VALID_EVTCHN(evtchn))
+   need_unmask = !test_and_set_mask(evtchn);
+   }
 
if (VALID_EVTCHN(evtchn))
clear_evtchn(evtchn);
 
+   irq_move_irq(data);
+
+   if (VALID_EVTCHN(evtchn)) {
+   if (unlikely(need_unmask))
+   unmask_evtchn(evtchn);
+   }
+
if (pirq_needs_eoi(data->irq)) {
rc = HYPERVISOR_physdev_op(PHYSDEVOP_eoi, );
WARN_ON(rc);
@@ -1356,11 +1367,22 @@ static void disable_dynirq(struct irq_data *data)
 static void ack_dynirq(struct irq_data *data)
 {
int evtchn = evtchn_from_irq(data->irq);
+   int need_unmask = 0;
 
-   irq_move_irq(data);
+   if (unlikely(irqd_is_setaffinity_pending(data))) {
+   if (VALID_EVTCHN(evtchn))
+   need_unmask = !test_and_set_mask(evtchn);
+   }
 
if (VALID_EVTCHN(evtchn))
clear_evtchn(evtchn);
+
+   irq_move_irq(data);
+
+   if (VALID_EVTCHN(evtchn)) {
+   if (unlikely(need_unmask))
+   unmask_evtchn(evtchn);
+   }
 }
 
 static void mask_ack_dynirq(struct irq_data *data)
-- 
1.7.7.6



[PATCH 2/2] hotplug: Prevent alloc/free of irq descriptors during cpu up/down (again)

2016-03-19 Thread Boris Ostrovsky
Now that Xen no longer allocates irqs in _cpu_up() we can restore
commit a89941816726 ("hotplug: Prevent alloc/free of irq descriptors
during cpu up/down")

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 arch/x86/kernel/smpboot.c |   11 ---
 kernel/cpu.c  |8 
 2 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 643dbdc..cabe21e 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1083,17 +1083,8 @@ int native_cpu_up(unsigned int cpu, struct task_struct 
*tidle)
 
common_cpu_up(cpu, tidle);
 
-   /*
-* We have to walk the irq descriptors to setup the vector
-* space for the cpu which comes online.  Prevent irq
-* alloc/free across the bringup.
-*/
-   irq_lock_sparse();
-
err = do_boot_cpu(apicid, cpu, tidle);
-
if (err) {
-   irq_unlock_sparse();
pr_err("do_boot_cpu failed(%d) to wakeup CPU#%u\n", err, cpu);
return -EIO;
}
@@ -,8 +1102,6 @@ int native_cpu_up(unsigned int cpu, struct task_struct 
*tidle)
touch_nmi_watchdog();
}
 
-   irq_unlock_sparse();
-
return 0;
 }
 
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 6ea42e8..2ff63b3 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -342,8 +342,16 @@ static int bringup_cpu(unsigned int cpu)
struct task_struct *idle = idle_thread_get(cpu);
int ret;
 
+   /*
+* Some architectures have to walk the irq descriptors to
+* setup the vector space for the cpu which comes online.
+* Prevent irq alloc/free across the bringup.
+*/
+   irq_lock_sparse();
+
/* Arch-specific enabling code. */
ret = __cpu_up(cpu, idle);
+   irq_unlock_sparse();
if (ret) {
cpu_notify(CPU_UP_CANCELED, cpu);
return ret;
-- 
1.7.1



Re: [Xen-devel] 4.4: INFO: rcu_sched self-detected stall on CPU

2016-03-25 Thread Boris Ostrovsky

On 03/24/2016 10:53 PM, Steven Haigh wrote:

Hi all,

Firstly, I've cross-posted this to xen-devel and the lkml - as this
problem seems to only exist when using kernel 4.4 as a Xen DomU kernel.
I have also CC'ed Greg KH for his awesome insight as maintainer.

Please CC myself into replies - as I'm not a member of the kernel
mailing list - I may miss replies from monitoring the archives.

I've noticed recently that heavy disk IO is causing rcu_sched to detect
stalls. The process mentioned usually goes to 100% CPU usage, and
eventually processes start segfaulting and dying. The only fix to
recover the system is to use 'xl destroy' to force-kill the VM and to
start it again.

The majority of these issues seem to mention ext4 in the trace. This may
indicate an issue there - or may be a red herring.

The gritty details:
INFO: rcu_sched self-detected stall on CPU
#0110-...: (20999 ticks this GP) idle=327/141/0
softirq=1101493/1101493 fqs=6973
#011 (t=21000 jiffies g=827095 c=827094 q=524)
Task dump for CPU 0:
rsync   R  running task0  2446   2444 0x0088
818d0c00 88007fc03c58 810a625f 
818d0c00 88007fc03c70 810a8699 0001
88007fc03ca0 810d0e5a 88007fc170c0 818d0c00
Call Trace:
  [] sched_show_task+0xaf/0x110
[] dump_cpu_task+0x39/0x40
[] rcu_dump_cpu_stacks+0x8a/0xc0
[] rcu_check_callbacks+0x424/0x7a0
[] ? account_system_time+0x81/0x110
[] ? account_process_tick+0x61/0x160
[] ? tick_sched_do_timer+0x30/0x30
[] update_process_times+0x39/0x60
[] tick_sched_handle.isra.15+0x36/0x50
[] tick_sched_timer+0x3d/0x70
[] __hrtimer_run_queues+0xf2/0x250
[] hrtimer_interrupt+0xa8/0x190
[] xen_timer_interrupt+0x2e/0x140
[] handle_irq_event_percpu+0x55/0x1e0
[] handle_percpu_irq+0x3a/0x50
[] generic_handle_irq+0x22/0x30
[] __evtchn_fifo_handle_events+0x15f/0x180
[] evtchn_fifo_handle_events+0x10/0x20
[] __xen_evtchn_do_upcall+0x43/0x80
[] xen_evtchn_do_upcall+0x30/0x50
[] xen_hvm_callback_vector+0x82/0x90
  [] ? queued_write_lock_slowpath+0x3d/0x80
[] _raw_write_lock+0x1e/0x30


This looks to me like ext4 failing to grab a lock. Everything above it 
(in Xen code) is regular tick interrupt handling which detects the stall.


Your config does not have CONFIG_PARAVIRT_SPINLOCKS so that eliminates 
any possible issues with pv locks.


Do you see anything "interesting" in dom0? (e.g. dmesg, xl dmesg, 
/var/log/xen/) Are you oversubscribing your guest (CPU-wise)?


-boris


[] ext4_es_remove_extent+0x43/0xc0
[] ext4_clear_inode+0x39/0x80
[] ext4_evict_inode+0x8d/0x4e0
[] evict+0xb7/0x180
[] dispose_list+0x36/0x50
[] prune_icache_sb+0x4b/0x60
[] super_cache_scan+0x141/0x190
[] shrink_slab.part.37+0x1ee/0x390
[] shrink_zone+0x26c/0x280
[] do_try_to_free_pages+0x15c/0x410
[] try_to_free_pages+0xba/0x170
[] __alloc_pages_nodemask+0x525/0xa60
[] ? kmem_cache_free+0xcc/0x2c0
[] alloc_pages_current+0x8d/0x120
[] __page_cache_alloc+0x91/0xc0
[] pagecache_get_page+0x56/0x1e0
[] grab_cache_page_write_begin+0x26/0x40
[] ext4_da_write_begin+0xa1/0x300
[] ? ext4_da_write_end+0x124/0x2b0
[] generic_perform_write+0xc0/0x1a0
[] __generic_file_write_iter+0x188/0x1e0
[] ext4_file_write_iter+0xf0/0x340
[] __vfs_write+0xaa/0xe0
[] vfs_write+0xa2/0x1a0
[] SyS_write+0x46/0xa0
[] entry_SYSCALL_64_fastpath+0x12/0x71

Some 11 hours later:
sshd[785]: segfault at 1f0 ip 7f03bb94ae5c sp 7ffe9eb54470 error
4 in ld-2.17.so[7f03bb94+21000]
sh[787]: segfault at 1f0 ip 7f6b4a0dfe5c sp 7ffe3d4a71e0 error 4
in ld-2.17.so[7f6b4a0d5000+21000]
systemd-cgroups[788]: segfault at 1f0 ip 7f4baa82ce5c sp
7ffd28e4c4b0 error 4 in ld-2.17.so[7f4baa822000+21000]
sshd[791]: segfault at 1f0 ip 7ff8c8a8ce5c sp 7ffede9e1c20 error
4 in ld-2.17.so[7ff8c8a82000+21000]
sshd[792]: segfault at 1f0 ip 7f183cf75e5c sp 7ffc81ab7160 error
4 in ld-2.17.so[7f183cf6b000+21000]
sshd[793]: segfault at 1f0 ip 7f3c665ece5c sp 7ffd9a13c850 error
4 in ld-2.17.so[7f3c665e2000+21000]

 From isolated testing, this does not occur on kernel 4.5.x - however I
have not verified this myself.

The kernel config used can be found in the kernel-xen git repo if it
assists in debugging:
http://xen.crc.id.au/git/?p=kernel-xen;a=summary



___
Xen-devel mailing list
xen-de...@lists.xen.org
http://lists.xen.org/xen-devel




Re: [PATCH 2/2] xen/x86: Call cpu_startup_entry(CPUHP_AP_ONLINE_IDLE) from xen_play_dead()

2016-03-25 Thread Boris Ostrovsky

On 03/25/2016 10:53 AM, Konrad Rzeszutek Wilk wrote:

On Thu, Mar 17, 2016 at 09:03:25AM -0400, Boris Ostrovsky wrote:

This call has always been missing from xen_play dead() but until
recently this was rather benign. With new cpu hotplug framework
however this call is required, otherwise a hot-plugged CPU will not

Could you include the commit id of the 'new cpu hotplug' in case
anybody wants to backport this?


Sure.

It's commit 8df3e07e7f21 ("cpu/hotplug: Let upcoming cpu bring itself 
fully up").


Do you (or David) want me to re-send it?

-boris





Thanks!

be properly brough up (by never calling cpuhp_online_idle())

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
  arch/x86/xen/smp.c |2 ++
  1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 3c6d17f..719cf29 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -545,6 +545,8 @@ static void xen_play_dead(void) /* used only with 
HOTPLUG_CPU */
 * data back is to call:
 */
tick_nohz_idle_enter();
+
+   cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);
  }
  
  #else /* !CONFIG_HOTPLUG_CPU */

--
1.7.1





Re: [Xen-devel] 4.4: INFO: rcu_sched self-detected stall on CPU

2016-03-25 Thread Boris Ostrovsky



On 03/25/2016 10:05 AM, Steven Haigh wrote:

On 25/03/2016 11:23 PM, Boris Ostrovsky wrote:

On 03/24/2016 10:53 PM, Steven Haigh wrote:

Hi all,

Firstly, I've cross-posted this to xen-devel and the lkml - as this
problem seems to only exist when using kernel 4.4 as a Xen DomU kernel.
I have also CC'ed Greg KH for his awesome insight as maintainer.

Please CC myself into replies - as I'm not a member of the kernel
mailing list - I may miss replies from monitoring the archives.

I've noticed recently that heavy disk IO is causing rcu_sched to detect
stalls. The process mentioned usually goes to 100% CPU usage, and
eventually processes start segfaulting and dying. The only fix to
recover the system is to use 'xl destroy' to force-kill the VM and to
start it again.

The majority of these issues seem to mention ext4 in the trace. This may
indicate an issue there - or may be a red herring.

The gritty details:
INFO: rcu_sched self-detected stall on CPU
#0110-...: (20999 ticks this GP) idle=327/141/0
softirq=1101493/1101493 fqs=6973
#011 (t=21000 jiffies g=827095 c=827094 q=524)
Task dump for CPU 0:
rsync   R  running task0  2446   2444 0x0088
818d0c00 88007fc03c58 810a625f 
818d0c00 88007fc03c70 810a8699 0001
88007fc03ca0 810d0e5a 88007fc170c0 818d0c00
Call Trace:
  [] sched_show_task+0xaf/0x110
[] dump_cpu_task+0x39/0x40
[] rcu_dump_cpu_stacks+0x8a/0xc0
[] rcu_check_callbacks+0x424/0x7a0
[] ? account_system_time+0x81/0x110
[] ? account_process_tick+0x61/0x160
[] ? tick_sched_do_timer+0x30/0x30
[] update_process_times+0x39/0x60
[] tick_sched_handle.isra.15+0x36/0x50
[] tick_sched_timer+0x3d/0x70
[] __hrtimer_run_queues+0xf2/0x250
[] hrtimer_interrupt+0xa8/0x190
[] xen_timer_interrupt+0x2e/0x140
[] handle_irq_event_percpu+0x55/0x1e0
[] handle_percpu_irq+0x3a/0x50
[] generic_handle_irq+0x22/0x30
[] __evtchn_fifo_handle_events+0x15f/0x180
[] evtchn_fifo_handle_events+0x10/0x20
[] __xen_evtchn_do_upcall+0x43/0x80
[] xen_evtchn_do_upcall+0x30/0x50
[] xen_hvm_callback_vector+0x82/0x90
  [] ? queued_write_lock_slowpath+0x3d/0x80
[] _raw_write_lock+0x1e/0x30

This looks to me like ext4 failing to grab a lock. Everything above it
(in Xen code) is regular tick interrupt handling which detects the stall.

Your config does not have CONFIG_PARAVIRT_SPINLOCKS so that eliminates
any possible issues with pv locks.

Do you see anything "interesting" in dom0? (e.g. dmesg, xl dmesg,
/var/log/xen/) Are you oversubscribing your guest (CPU-wise)?

There is nothing special being logged anywhere that I can see. dmesg /
xl dmesg on the Dom0 show nothing unusual.

I do share CPUs - but I don't give any DomU more than 2 vcpus. The
physical host has 4 cores - 1 pinned to the Dom0.

I log to a remote syslog on this system - and I've uploaded the entire
log to a pastebin (don't want to do a 45Kb attachment here):
 http://paste.fedoraproject.org/345095/58914452


That doesn't look like a full log. In any case, the RCU stall may be a 
secondary problem --- there is a bunch of splats before the stall.


-boris




Not sure if it makes any difference at all, but my DomU config is:
# cat /etc/xen/backup.vm
name= "backup.vm"
memory  = 2048
vcpus   = 2
cpus= "1-3"
disk= [ 'phy:/dev/vg_raid1_new/backup.vm,xvda,w' ]
vif = [ "mac=00:11:36:35:35:09, bridge=br203,
vifname=vm.backup, script=vif-bridge" ]
bootloader  = 'pygrub'
pvh = 1

on_poweroff = 'destroy'
on_reboot   = 'restart'
on_crash= 'restart'
cpu_weight  = 64

I never had this problem when running kernel 4.1.x - it only started
when I upgraded everything to 4.4 - not exactly a great help - but may
help narrow things down?


[] ext4_es_remove_extent+0x43/0xc0
[] ext4_clear_inode+0x39/0x80
[] ext4_evict_inode+0x8d/0x4e0
[] evict+0xb7/0x180
[] dispose_list+0x36/0x50
[] prune_icache_sb+0x4b/0x60
[] super_cache_scan+0x141/0x190
[] shrink_slab.part.37+0x1ee/0x390
[] shrink_zone+0x26c/0x280
[] do_try_to_free_pages+0x15c/0x410
[] try_to_free_pages+0xba/0x170
[] __alloc_pages_nodemask+0x525/0xa60
[] ? kmem_cache_free+0xcc/0x2c0
[] alloc_pages_current+0x8d/0x120
[] __page_cache_alloc+0x91/0xc0
[] pagecache_get_page+0x56/0x1e0
[] grab_cache_page_write_begin+0x26/0x40
[] ext4_da_write_begin+0xa1/0x300
[] ? ext4_da_write_end+0x124/0x2b0
[] generic_perform_write+0xc0/0x1a0
[] __generic_file_write_iter+0x188/0x1e0
[] ext4_file_write_iter+0xf0/0x340
[] __vfs_write+0xaa/0xe0
[] vfs_write+0xa2/0x1a0
[] SyS_write+0x46/0xa0
[] entry_SYSCALL_64_fastpath+0x12/0x71

Some 11 hours later:
sshd[785]: segfault at 1f0 ip 7f03bb94ae5c sp 7ffe9eb54470 error
4 in ld-2.17.so[7f03bb94+21000]
sh[787]: segfault at 1f0 ip 7f6b4a0dfe5c sp 7ffe3d4a71e0 error 4
in ld-2.17.so[7f6b4a0d5000+21000]
systemd-cgroups[788]: segfault at 1f0 ip 

<    1   2   3   4   5   6   7   8   9   10   >