Re: [PATCH v2 02/11] xen/hvmlite: Bootstrap HVMlite guest
force us to provide a well define structure for considering code that should not run by pegging it as required or supported for different early x86 code stubs. As I said before, I don't see how we can avoid having another entry point without making Xen change its load procedure. Which is highly unlikely to happen. I had hinted perhaps we might be able to piggy back on top of the ELF loader protocol as well, and since that's standard do wonder if that could instead be extended to help unify a mechanism for different OSes instead of making this just a solution for Linux. Code review below. On Mon, Feb 01, 2016 at 10:38:48AM -0500, Boris Ostrovsky wrote: Start HVMlite guest at XEN_ELFNOTE_PHYS32_ENTRY address. Setup hypercall page, initialize boot_params, enable early page tables. Since this stub is executed before kernel entry point we cannot use variables in .bss which is cleared by kernel. We explicitly place variables that are initialized here into .data. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index 5774800..5f05fa2 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -171,6 +172,17 @@ struct tls_descs { */ static DEFINE_PER_CPU(struct tls_descs, shadow_tls_desc); +#ifdef CONFIG_XEN_PVHVM +/* + * HVMlite variables. These need to live in data segment since they are + * initialized before startup_{32|64}, which clear .bss, are invoked. + */ +int xen_hvmlite __attribute__((section(".data"))) = 0; +struct hvm_start_info hvmlite_start_info __attribute__((section(".data"))); +uint hvmlite_start_info_sz = sizeof(hvmlite_start_info); +struct boot_params xen_hvmlite_boot_params __attribute__((section(".data"))); +#endif + The section annotations seems very special use case but likely worth documenting and defining a new macro for in include/linux/compiler.h. This would make it easier to change should we want to change the section used here later and enable others to easily look for the reason for these annotations in a single place. I wonder whether __initdata would be a good attribute. We only need this early in the boot. And xen_hvmlite is gone now so we don't need to worry about it. -boris
[PATCH v2 04/11] xen/hvmlite: Allow HVMlite guests delay initializing grant table
.. just like we currently do for PVH guests Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/grant-table.c |4 ++-- drivers/xen/grant-table.c |8 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/xen/grant-table.c b/arch/x86/xen/grant-table.c index e079500..40ad9c2 100644 --- a/arch/x86/xen/grant-table.c +++ b/arch/x86/xen/grant-table.c @@ -110,7 +110,7 @@ int arch_gnttab_init(unsigned long nr_shared) return arch_gnttab_valloc(_shared_vm_area, nr_shared); } -#ifdef CONFIG_XEN_PVH +#ifdef CONFIG_XEN_PVHVM #include #include #include @@ -164,7 +164,7 @@ static int __init xlated_setup_gnttab_pages(void) static int __init xen_pvh_gnttab_setup(void) { - if (!xen_pvh_domain()) + if (!xen_pvh_domain() && !xen_hvmlite) return -ENODEV; return xlated_setup_gnttab_pages(); diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c index c49f79e..9a239d5 100644 --- a/drivers/xen/grant-table.c +++ b/drivers/xen/grant-table.c @@ -1147,13 +1147,13 @@ EXPORT_SYMBOL_GPL(gnttab_init); static int __gnttab_init(void) { + if (!xen_domain()) + return -ENODEV; + /* Delay grant-table initialization in the PV on HVM case */ - if (xen_hvm_domain()) + if (xen_hvm_domain() && !xen_hvmlite) return 0; - if (!xen_pv_domain()) - return -ENODEV; - return gnttab_init(); } /* Starts after core_initcall so that xen_pvh_gnttab_setup can be called -- 1.7.1
[PATCH v2 06/11] xen/hvmlite: Prepare cpu_initialize_context() routine for HVMlite SMP
Subsequent patch will add support for starting secondary VCPUs in HVMlite guest. This patch exists to simmplify code review. No functional changes (except for introduction of 'if (!xen_hvmlite)'). Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/smp.c | 104 --- 1 files changed, 57 insertions(+), 47 deletions(-) diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index 3f4ebf0..5fc4afb 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -390,70 +390,80 @@ cpu_initialize_context(unsigned int cpu, struct task_struct *idle) if (cpumask_test_and_set_cpu(cpu, xen_cpu_initialized_map)) return 0; - ctxt = kzalloc(sizeof(*ctxt), GFP_KERNEL); - if (ctxt == NULL) - return -ENOMEM; + if (!xen_hvmlite) { - gdt = get_cpu_gdt_table(cpu); + ctxt = kzalloc(sizeof(*ctxt), GFP_KERNEL); + if (ctxt == NULL) + return -ENOMEM; + + gdt = get_cpu_gdt_table(cpu); #ifdef CONFIG_X86_32 - /* Note: PVH is not yet supported on x86_32. */ - ctxt->user_regs.fs = __KERNEL_PERCPU; - ctxt->user_regs.gs = __KERNEL_STACK_CANARY; + /* Note: PVH is not yet supported on x86_32. */ + ctxt->user_regs.fs = __KERNEL_PERCPU; + ctxt->user_regs.gs = __KERNEL_STACK_CANARY; #endif - memset(>fpu_ctxt, 0, sizeof(ctxt->fpu_ctxt)); + memset(>fpu_ctxt, 0, sizeof(ctxt->fpu_ctxt)); - if (!xen_feature(XENFEAT_auto_translated_physmap)) { - ctxt->user_regs.eip = (unsigned long)cpu_bringup_and_idle; - ctxt->flags = VGCF_IN_KERNEL; - ctxt->user_regs.eflags = 0x1000; /* IOPL_RING1 */ - ctxt->user_regs.ds = __USER_DS; - ctxt->user_regs.es = __USER_DS; - ctxt->user_regs.ss = __KERNEL_DS; + if (!xen_feature(XENFEAT_auto_translated_physmap)) { + ctxt->user_regs.eip = + (unsigned long)cpu_bringup_and_idle; + ctxt->flags = VGCF_IN_KERNEL; + ctxt->user_regs.eflags = 0x1000; /* IOPL_RING1 */ + ctxt->user_regs.ds = __USER_DS; + ctxt->user_regs.es = __USER_DS; + ctxt->user_regs.ss = __KERNEL_DS; - xen_copy_trap_info(ctxt->trap_ctxt); + xen_copy_trap_info(ctxt->trap_ctxt); - ctxt->ldt_ents = 0; + ctxt->ldt_ents = 0; - BUG_ON((unsigned long)gdt & ~PAGE_MASK); + BUG_ON((unsigned long)gdt & ~PAGE_MASK); - gdt_mfn = arbitrary_virt_to_mfn(gdt); - make_lowmem_page_readonly(gdt); - make_lowmem_page_readonly(mfn_to_virt(gdt_mfn)); + gdt_mfn = arbitrary_virt_to_mfn(gdt); + make_lowmem_page_readonly(gdt); + make_lowmem_page_readonly(mfn_to_virt(gdt_mfn)); - ctxt->gdt_frames[0] = gdt_mfn; - ctxt->gdt_ents = GDT_ENTRIES; + ctxt->gdt_frames[0] = gdt_mfn; + ctxt->gdt_ents = GDT_ENTRIES; - ctxt->kernel_ss = __KERNEL_DS; - ctxt->kernel_sp = idle->thread.sp0; + ctxt->kernel_ss = __KERNEL_DS; + ctxt->kernel_sp = idle->thread.sp0; #ifdef CONFIG_X86_32 - ctxt->event_callback_cs = __KERNEL_CS; - ctxt->failsafe_callback_cs = __KERNEL_CS; + ctxt->event_callback_cs = __KERNEL_CS; + ctxt->failsafe_callback_cs = __KERNEL_CS; #else - ctxt->gs_base_kernel = per_cpu_offset(cpu); + ctxt->gs_base_kernel = per_cpu_offset(cpu); #endif - ctxt->event_callback_eip= - (unsigned long)xen_hypervisor_callback; - ctxt->failsafe_callback_eip = - (unsigned long)xen_failsafe_callback; - ctxt->user_regs.cs = __KERNEL_CS; - per_cpu(xen_cr3, cpu) = __pa(swapper_pg_dir); - } + ctxt->event_callback_eip= + (unsigned long)xen_hypervisor_callback; + ctxt->failsafe_callback_eip = + (unsigned long)xen_failsafe_callback; + ctxt->user_regs.cs = __KERNEL_CS; + per_cpu(xen_cr3, cpu) = __pa(swapper_pg_dir); + } #ifdef CONFIG_XEN_PVH - else { - /* -* The vcpu comes on kernel page tab
[PATCH v2 07/11] xen/hvmlite: Initialize context for secondary VCPUs
Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/smp.c | 57 arch/x86/xen/smp.h |4 +++ arch/x86/xen/xen-hvmlite.S |7 + 3 files changed, 63 insertions(+), 5 deletions(-) diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index 5fc4afb..b265c4f 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include @@ -384,6 +385,7 @@ cpu_initialize_context(unsigned int cpu, struct task_struct *idle) struct vcpu_guest_context *ctxt; struct desc_struct *gdt; unsigned long gdt_mfn; + void *ctxt_arg; /* used to tell cpu_init() that it can proceed with initialization */ cpumask_set_cpu(cpu, cpu_callout_mask); @@ -392,7 +394,7 @@ cpu_initialize_context(unsigned int cpu, struct task_struct *idle) if (!xen_hvmlite) { - ctxt = kzalloc(sizeof(*ctxt), GFP_KERNEL); + ctxt_arg = ctxt = kzalloc(sizeof(*ctxt), GFP_KERNEL); if (ctxt == NULL) return -ENOMEM; @@ -460,14 +462,59 @@ cpu_initialize_context(unsigned int cpu, struct task_struct *idle) ctxt->user_regs.esp = idle->thread.sp0 - sizeof(struct pt_regs); ctxt->ctrlreg[3] = xen_pfn_to_cr3(virt_to_gfn(swapper_pg_dir)); } else { - ctxt = NULL; /* To quiet down compiler */ - BUG(); +#ifdef CONFIG_XEN_PVHVM + struct vcpu_hvm_context *hctxt; + + ctxt_arg = hctxt = kzalloc(sizeof(*hctxt), GFP_KERNEL); + if (hctxt == NULL) + return -ENOMEM; + +#ifdef CONFIG_X86_64 + hctxt->mode = VCPU_HVM_MODE_64B; + hctxt->cpu_regs.x86_64.rip = + (unsigned long)secondary_startup_64; + hctxt->cpu_regs.x86_64.rsp = stack_start; + + hctxt->cpu_regs.x86_64.cr0 = + X86_CR0_PG | X86_CR0_WP | X86_CR0_PE; + hctxt->cpu_regs.x86_64.cr4 = X86_CR4_PAE; + hctxt->cpu_regs.x86_64.cr3 = + xen_pfn_to_cr3(virt_to_mfn(init_level4_pgt)); + hctxt->cpu_regs.x86_64.efer = EFER_LME | EFER_NX; +#else + hctxt->mode = VCPU_HVM_MODE_32B; + /* +* startup_32_smp expects GDT loaded so we can't jump +* there directly. +*/ + hctxt->cpu_regs.x86_32.eip = + (unsigned long)hvmlite_smp_32 - __START_KERNEL_map; + + hctxt->cpu_regs.x86_32.cr0 = X86_CR0_PE; + + hctxt->cpu_regs.x86_32.cs_base = 0; + hctxt->cpu_regs.x86_32.cs_limit = ~0u; + hctxt->cpu_regs.x86_32.cs_ar = 0xc9b; + hctxt->cpu_regs.x86_32.ds_base = 0; + hctxt->cpu_regs.x86_32.ds_limit = ~0u; + hctxt->cpu_regs.x86_32.ds_ar = 0xc93; + hctxt->cpu_regs.x86_32.es_base = 0; + hctxt->cpu_regs.x86_32.es_limit = ~0u; + hctxt->cpu_regs.x86_32.es_ar = 0xc93; + hctxt->cpu_regs.x86_32.ss_base = 0; + hctxt->cpu_regs.x86_32.ss_limit = ~0u; + hctxt->cpu_regs.x86_32.ss_ar = 0xc93; + hctxt->cpu_regs.x86_32.tr_base = 0; + hctxt->cpu_regs.x86_32.tr_limit = 0xff; + hctxt->cpu_regs.x86_32.tr_ar = 0x8b; +#endif +#endif } - if (HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt)) + if (HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt_arg)) BUG(); - kfree(ctxt); + kfree(ctxt_arg); return 0; } diff --git a/arch/x86/xen/smp.h b/arch/x86/xen/smp.h index 963d62a..b4a833c 100644 --- a/arch/x86/xen/smp.h +++ b/arch/x86/xen/smp.h @@ -8,6 +8,10 @@ extern void xen_send_IPI_allbutself(int vector); extern void xen_send_IPI_all(int vector); extern void xen_send_IPI_self(int vector); +#ifdef CONFIG_X86_32 +extern void hvmlite_smp_32(void); +#endif + #ifdef CONFIG_XEN_PVH extern void xen_pvh_early_cpu_init(int cpu, bool entry); #else diff --git a/arch/x86/xen/xen-hvmlite.S b/arch/x86/xen/xen-hvmlite.S index fc7c08c..805e6a0 100644 --- a/arch/x86/xen/xen-hvmlite.S +++ b/arch/x86/xen/xen-hvmlite.S @@ -144,6 +144,13 @@ ENTRY(hvmlite_start_xen) ljmp$0x10, $_pa(startup_32) #endif +#ifdef CONFIG_X86_32 +ENTRY(hvmlite_smp_32) +mov $_pa(boot_gdt_descr), %eax +lgdt (%eax) +jmp startup_32_smp +#endif + .data gdt: .word gdt_end - gdt -- 1.7.1
[PATCH v2 09/11] xen/hvmlite: Use x86's default timer init for HVMlite guests
xen_timer_init() will be called from apic_bsp_setup(). Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/time.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c index f1ba6a0..d77b398 100644 --- a/arch/x86/xen/time.c +++ b/arch/x86/xen/time.c @@ -492,7 +492,10 @@ void __init xen_init_time_ops(void) { pv_time_ops = xen_time_ops; - x86_init.timers.timer_init = xen_time_init; + if (!xen_hvmlite) + x86_init.timers.timer_init = xen_time_init; + else + x86_init.timers.timer_init = x86_init_noop; x86_init.timers.setup_percpu_clockev = x86_init_noop; x86_cpuinit.setup_percpu_clockev = x86_init_noop; -- 1.7.1
[PATCH v2 01/11] xen/hvmlite: Import hvmlite-related Xen public interfaces
Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- include/xen/interface/elfnote.h | 12 +++- include/xen/interface/hvm/hvm_vcpu.h | 143 ++ include/xen/interface/xen.h | 24 ++ 3 files changed, 178 insertions(+), 1 deletions(-) create mode 100644 include/xen/interface/hvm/hvm_vcpu.h diff --git a/include/xen/interface/elfnote.h b/include/xen/interface/elfnote.h index f90b034..9e9f9bf 100644 --- a/include/xen/interface/elfnote.h +++ b/include/xen/interface/elfnote.h @@ -193,9 +193,19 @@ #define XEN_ELFNOTE_SUPPORTED_FEATURES 17 /* + * Physical entry point into the kernel. + * + * 32bit entry point into the kernel. When requested to launch the + * guest kernel in a HVM container, Xen will use this entry point to + * launch the guest in 32bit protected mode with paging disabled. + * Ignored otherwise. + */ +#define XEN_ELFNOTE_PHYS32_ENTRY 18 + +/* * The number of the highest elfnote defined. */ -#define XEN_ELFNOTE_MAX XEN_ELFNOTE_SUPPORTED_FEATURES +#define XEN_ELFNOTE_MAX XEN_ELFNOTE_PHYS32_ENTRY #endif /* __XEN_PUBLIC_ELFNOTE_H__ */ diff --git a/include/xen/interface/hvm/hvm_vcpu.h b/include/xen/interface/hvm/hvm_vcpu.h new file mode 100644 index 000..32ca83e --- /dev/null +++ b/include/xen/interface/hvm/hvm_vcpu.h @@ -0,0 +1,143 @@ +/* + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to + * deal in the Software without restriction, including without limitation the + * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or + * sell copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER + * DEALINGS IN THE SOFTWARE. + * + * Copyright (c) 2015, Roger Pau Monne <roger@citrix.com> + */ + +#ifndef __XEN_PUBLIC_HVM_HVM_VCPU_H__ +#define __XEN_PUBLIC_HVM_HVM_VCPU_H__ + +#include "../xen.h" + +struct vcpu_hvm_x86_32 { +uint32_t eax; +uint32_t ecx; +uint32_t edx; +uint32_t ebx; +uint32_t esp; +uint32_t ebp; +uint32_t esi; +uint32_t edi; +uint32_t eip; +uint32_t eflags; + +uint32_t cr0; +uint32_t cr3; +uint32_t cr4; + +uint32_t pad1; + +/* + * EFER should only be used to set the NXE bit (if required) + * when starting a vCPU in 32bit mode with paging enabled or + * to set the LME/LMA bits in order to start the vCPU in + * compatibility mode. + */ +uint64_t efer; + +uint32_t cs_base; +uint32_t ds_base; +uint32_t ss_base; +uint32_t es_base; +uint32_t tr_base; +uint32_t cs_limit; +uint32_t ds_limit; +uint32_t ss_limit; +uint32_t es_limit; +uint32_t tr_limit; +uint16_t cs_ar; +uint16_t ds_ar; +uint16_t ss_ar; +uint16_t es_ar; +uint16_t tr_ar; + +uint16_t pad2[3]; +}; + +/* + * The layout of the _ar fields of the segment registers is the + * following: + * + * Bits [0,3]: type (bits 40-43). + * Bit4: s(descriptor type, bit 44). + * Bit[5,6]: dpl (descriptor privilege level, bits 45-46). + * Bit7: p(segment-present, bit 47). + * Bit8: avl (available for system software, bit 52). + * Bit9: l(64-bit code segment, bit 53). + * Bit 10: db (meaning depends on the segment, bit 54). + * Bit 11: g(granularity, bit 55) + * Bits [12,15]: unused, must be blank. + * + * A more complete description of the meaning of this fields can be + * obtained from the Intel SDM, Volume 3, section 3.4.5. + */ + +struct vcpu_hvm_x86_64 { +uint64_t rax; +uint64_t rcx; +uint64_t rdx; +uint64_t rbx; +uint64_t rsp; +uint64_t rbp; +uint64_t rsi; +uint64_t rdi; +uint64_t rip; +uint64_t rflags; + +uint64_t cr0; +uint64_t cr3; +uint64_t cr4; +uint64_t efer; + +/* + * Using VCPU_HVM_MODE_64B implies that the vCPU is launched + * directly in long mode, so the cached parts of the segment + * registers get set to match that environment. + * + * If the user wants to launch the vCPU in compatibility mode + * the 32-bit structure should be used instead. + */ +}; + +struct vcpu_hvm_context { +#define VCPU_HVM_MODE_32B 0 /* 32bit fields
[PATCH v2 08/11] xen/hvmlite: Extend APIC operations for HVMlite guests
HVMlite guests need to be viewed as having APIC, otherwise smpboot code, for example, will complain. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- Not sure about xen_cpu_present_to_apicid() being an identity function, given xen_x86_32_early_logical_apicid(). Suspend/resume will cause xen_apic_write() warnings. arch/x86/xen/apic.c | 39 +-- arch/x86/xen/smp.c |2 +- 2 files changed, 38 insertions(+), 3 deletions(-) diff --git a/arch/x86/xen/apic.c b/arch/x86/xen/apic.c index acda713..9bf27f2 100644 --- a/arch/x86/xen/apic.c +++ b/arch/x86/xen/apic.c @@ -6,6 +6,7 @@ #include #include +#include #include "xen-ops.h" #include "pmu.h" #include "smp.h" @@ -78,6 +79,21 @@ static void xen_apic_write(u32 reg, u32 val) return; } + if (xen_hvmlite) { + switch (reg) { + case APIC_TASKPRI: + case APIC_SPIV: + case APIC_ESR: + case APIC_LVTT: + case APIC_LVT0: + case APIC_LVT1: + case APIC_LVTERR: + pr_debug("Unimplemented APIC register %x," +" value: %x\n", reg, val); + return; + } + } + /* Warn to see if there's any stray references */ WARN(1,"register: %x, value: %x\n", reg, val); } @@ -100,7 +116,7 @@ static u32 xen_safe_apic_wait_icr_idle(void) static int xen_apic_probe_pv(void) { - if (xen_pv_domain()) + if (xen_pv_domain() || xen_hvmlite) return 1; return 0; @@ -142,6 +158,19 @@ static void xen_silent_inquire(int apicid) { } +static int xen_cpu_present_to_apicid(int cpu) +{ + return cpu; +} + +static int xen_wakeup_secondary_cpu(int cpu, unsigned long start_eip) +{ + if (!xen_hvmlite) + return -EINVAL; + + return HYPERVISOR_vcpu_op(VCPUOP_up, cpu, NULL); +} + static struct apic xen_pv_apic = { .name = "Xen PV", .probe = xen_apic_probe_pv, @@ -162,7 +191,7 @@ static struct apic xen_pv_apic = { .ioapic_phys_id_map = default_ioapic_phys_id_map, /* Used on 32-bit */ .setup_apic_routing = NULL, - .cpu_present_to_apicid = default_cpu_present_to_apicid, + .cpu_present_to_apicid = xen_cpu_present_to_apicid, .apicid_to_cpu_present = physid_set_mask_of_physid, /* Used on 32-bit */ .check_phys_apicid_present = default_check_phys_apicid_present, /* smp_sanity_check needs it */ .phys_pkg_id= xen_phys_pkg_id, /* detect_ht */ @@ -180,6 +209,9 @@ static struct apic xen_pv_apic = { .send_IPI_all = xen_send_IPI_all, .send_IPI_self = xen_send_IPI_self, #endif + + .wakeup_secondary_cpu = xen_wakeup_secondary_cpu, + /* .wait_for_init_deassert- used by AP bootup - smp_callin which we don't use */ .inquire_remote_apic= xen_silent_inquire, @@ -216,5 +248,8 @@ void __init xen_init_apic(void) apic = _pv_apic; x86_platform.apic_post_init = xen_apic_check; + + if (xen_hvmlite) + setup_force_cpu_cap(X86_FEATURE_APIC); } apic_driver(xen_pv_apic); diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index b265c4f..fb085ef 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -214,7 +214,7 @@ static int xen_smp_intr_init(unsigned int cpu) * The IRQ worker on PVHVM goes through the native path and uses the * IPI mechanism. */ - if (xen_hvm_domain()) + if (xen_hvm_domain() && !xen_hvmlite) return 0; callfunc_name = kasprintf(GFP_KERNEL, "irqwork%d", cpu); -- 1.7.1
[PATCH v2 05/11] xen/hvmlite: HVMlite guests always have PV devices
Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/platform-pci-unplug.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/xen/platform-pci-unplug.c b/arch/x86/xen/platform-pci-unplug.c index 9586ff3..802ec90 100644 --- a/arch/x86/xen/platform-pci-unplug.c +++ b/arch/x86/xen/platform-pci-unplug.c @@ -73,8 +73,8 @@ bool xen_has_pv_devices(void) if (!xen_domain()) return false; - /* PV domains always have them. */ - if (xen_pv_domain()) + /* PV and HVMlite domains always have them. */ + if (xen_pv_domain() || xen_hvmlite) return true; /* And user has xen_platform_pci=0 set in guest config as -- 1.7.1
[PATCH v2 02/11] xen/hvmlite: Bootstrap HVMlite guest
Start HVMlite guest at XEN_ELFNOTE_PHYS32_ENTRY address. Setup hypercall page, initialize boot_params, enable early page tables. Since this stub is executed before kernel entry point we cannot use variables in .bss which is cleared by kernel. We explicitly place variables that are initialized here into .data. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/Makefile |1 + arch/x86/xen/enlighten.c | 86 +- arch/x86/xen/xen-hvmlite.S | 175 include/xen/xen.h |6 ++ 4 files changed, 267 insertions(+), 1 deletions(-) create mode 100644 arch/x86/xen/xen-hvmlite.S diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile index e47e527..1d913d7 100644 --- a/arch/x86/xen/Makefile +++ b/arch/x86/xen/Makefile @@ -23,3 +23,4 @@ obj-$(CONFIG_XEN_DEBUG_FS)+= debugfs.o obj-$(CONFIG_XEN_DOM0) += vga.o obj-$(CONFIG_SWIOTLB_XEN) += pci-swiotlb-xen.o obj-$(CONFIG_XEN_EFI) += efi.o +obj-$(CONFIG_XEN_PVHVM)+= xen-hvmlite.o diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index 5774800..5f05fa2 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -118,7 +118,8 @@ DEFINE_PER_CPU(struct vcpu_info *, xen_vcpu); */ DEFINE_PER_CPU(struct vcpu_info, xen_vcpu_info); -enum xen_domain_type xen_domain_type = XEN_NATIVE; +enum xen_domain_type xen_domain_type + __attribute__((section(".data"))) = XEN_NATIVE; EXPORT_SYMBOL_GPL(xen_domain_type); unsigned long *machine_to_phys_mapping = (void *)MACH2PHYS_VIRT_START; @@ -171,6 +172,17 @@ struct tls_descs { */ static DEFINE_PER_CPU(struct tls_descs, shadow_tls_desc); +#ifdef CONFIG_XEN_PVHVM +/* + * HVMlite variables. These need to live in data segment since they are + * initialized before startup_{32|64}, which clear .bss, are invoked. + */ +int xen_hvmlite __attribute__((section(".data"))) = 0; +struct hvm_start_info hvmlite_start_info __attribute__((section(".data"))); +uint hvmlite_start_info_sz = sizeof(hvmlite_start_info); +struct boot_params xen_hvmlite_boot_params __attribute__((section(".data"))); +#endif + static void clamp_max_cpus(void) { #ifdef CONFIG_SMP @@ -1731,6 +1743,78 @@ asmlinkage __visible void __init xen_start_kernel(void) #endif } +#ifdef CONFIG_XEN_PVHVM +static void __init hvmlite_bootparams(void) +{ + struct xen_memory_map memmap; + int i; + + memset(_hvmlite_boot_params, 0, sizeof(xen_hvmlite_boot_params)); + + memmap.nr_entries = ARRAY_SIZE(xen_hvmlite_boot_params.e820_map); + set_xen_guest_handle(memmap.buffer, xen_hvmlite_boot_params.e820_map); + if (HYPERVISOR_memory_op(XENMEM_memory_map, )) { + xen_raw_console_write("XENMEM_memory_map failed\n"); + BUG(); + } + + xen_hvmlite_boot_params.e820_map[memmap.nr_entries].addr = + ISA_START_ADDRESS; + xen_hvmlite_boot_params.e820_map[memmap.nr_entries].size = + ISA_END_ADDRESS - ISA_START_ADDRESS; + xen_hvmlite_boot_params.e820_map[memmap.nr_entries++].type = + E820_RESERVED; + + sanitize_e820_map(xen_hvmlite_boot_params.e820_map, + ARRAY_SIZE(xen_hvmlite_boot_params.e820_map), + _entries); + + xen_hvmlite_boot_params.e820_entries = memmap.nr_entries; + for (i = 0; i < xen_hvmlite_boot_params.e820_entries; i++) + e820_add_region(xen_hvmlite_boot_params.e820_map[i].addr, + xen_hvmlite_boot_params.e820_map[i].size, + xen_hvmlite_boot_params.e820_map[i].type); + + xen_hvmlite_boot_params.hdr.cmd_line_ptr = + hvmlite_start_info.cmdline_paddr; + + /* The first module is always ramdisk */ + if (hvmlite_start_info.nr_modules) { + struct hvm_modlist_entry *modaddr = + __va(hvmlite_start_info.modlist_paddr); + xen_hvmlite_boot_params.hdr.ramdisk_image = modaddr->paddr; + xen_hvmlite_boot_params.hdr.ramdisk_size = modaddr->size; + } + + /* +* See Documentation/x86/boot.txt. +* +* Version 2.12 supports Xen entry point but we will use default x86/PC +* environment (i.e. hardware_subarch 0). +*/ + xen_hvmlite_boot_params.hdr.version = 0x212; + xen_hvmlite_boot_params.hdr.type_of_loader = 9; /* Xen loader */ +} + +/* + * This routine (and those that it might call) should not use + * anything that lives in .bss since that segment will be cleared later + */ +void __init xen_prepare_hvmlite(void) +{ + u32 eax, ecx, edx, msr; + u64 pfn; + + xen_hvmlite = 1; + + cpuid(xen_cpuid_base() + 2, , , , ); + pfn = __pa(hypercall_page); + wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32)); + + hvmli
[PATCH v2 00/11] HVMlite domU support
This series introduces HVMlite support for unprivileged guests. It has been tested on Intel/AMD, both 32- and 64-bit, including CPU on- and offlining and save/restore. (Restore will result in APIC write warnings which exist now for 32-bit PV guests as well so I didn't address this in this series) Compile-tested on ARM v2: * Dropped first patch (don't use initial_page_table/initial_pg_pmd and start_secondary) * Much simplified kernel init code (after realizing that xen_hvm_guest_init()/ init_hvm_pv_info() already do most of it). As result, factoring ot of xen_start_kernel() is no longer necessary. (patch 3) * Dropped pcifront use until we decide that we *are* using it. (patch 5) Boris Ostrovsky (11): xen/hvmlite: Import hvmlite-related Xen public interfaces xen/hvmlite: Bootstrap HVMlite guest xen/hvmlite: Initialize HVMlite kernel xen/hvmlite: Allow HVMlite guests delay initializing grant table xen/hvmlite: HVMlite guests always have PV devices xen/hvmlite: Prepare cpu_initialize_context() routine for HVMlite SMP xen/hvmlite: Initialize context for secondary VCPUs xen/hvmlite: Extend APIC operations for HVMlite guests xen/hvmlite: Use x86's default timer init for HVMlite guests xen/hvmlite: Boot secondary CPUs xen/hvmlite: Enable CPU on-/offlining arch/x86/xen/Makefile|1 + arch/x86/xen/apic.c | 39 +- arch/x86/xen/enlighten.c | 106 +- arch/x86/xen/grant-table.c |4 +- arch/x86/xen/platform-pci-unplug.c |4 +- arch/x86/xen/pmu.c |4 +- arch/x86/xen/smp.c | 252 -- arch/x86/xen/smp.h |4 + arch/x86/xen/time.c |5 +- arch/x86/xen/xen-hvmlite.S | 190 + drivers/xen/grant-table.c|8 +- include/xen/interface/elfnote.h | 12 ++- include/xen/interface/hvm/hvm_vcpu.h | 143 +++ include/xen/interface/xen.h | 24 include/xen/xen.h|6 + 15 files changed, 706 insertions(+), 96 deletions(-) create mode 100644 arch/x86/xen/xen-hvmlite.S create mode 100644 include/xen/interface/hvm/hvm_vcpu.h
[PATCH v2 03/11] xen/hvmlite: Initialize HVMlite kernel
HVMlite guests need to make a few additional initialization calls compared to regular HVM guests. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/enlighten.c | 18 -- 1 files changed, 12 insertions(+), 6 deletions(-) diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index 5f05fa2..1409de6 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -1863,15 +1863,21 @@ static void __init init_hvm_pv_info(void) minor = eax & 0x; printk(KERN_INFO "Xen version %d.%d.\n", major, minor); - cpuid(base + 2, , , , ); - - pfn = __pa(hypercall_page); - wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32)); + /* HVMlite set up hypercall page earlier in xen_prepare_hvmlite() */ + if (xen_hvmlite) { + pv_info.name = "Xen HVMlite"; + pv_info.paravirt_enabled = 1; + xen_init_apic(); + machine_ops = xen_machine_ops; + } else { + pv_info.name = "Xen HVM"; + cpuid(base + 2, , , , ); + pfn = __pa(hypercall_page); + wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32)); + } xen_setup_features(); - pv_info.name = "Xen HVM"; - xen_domain_type = XEN_HVM_DOMAIN; } -- 1.7.1
[PATCH v2 10/11] xen/hvmlite: Boot secondary CPUs
HVMlite secondary VCPUs use baremetal bringup path (i.e. native_* smp_ops) but need to do some preparation in PV code. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/enlighten.c |2 + arch/x86/xen/pmu.c |4 +- arch/x86/xen/smp.c | 64 + 3 files changed, 51 insertions(+), 19 deletions(-) diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index 1409de6..76fb0b2 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -1933,6 +1933,8 @@ static void __init xen_hvm_guest_init(void) xen_have_vector_callback = 1; xen_hvm_smp_init(); register_cpu_notifier(_hvm_cpu_notifier); + if (xen_hvmlite) + smp_found_config = 1; xen_unplug_emulated_devices(); x86_init.irqs.intr_init = xen_init_IRQ; xen_hvm_init_time_ops(); diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c index 724a087..7bc209b 100644 --- a/arch/x86/xen/pmu.c +++ b/arch/x86/xen/pmu.c @@ -518,7 +518,7 @@ void xen_pmu_init(int cpu) BUILD_BUG_ON(sizeof(struct xen_pmu_data) > PAGE_SIZE); - if (xen_hvm_domain()) + if (xen_hvm_domain() && !xen_hvmlite) return; xenpmu_data = (struct xen_pmu_data *)get_zeroed_page(GFP_KERNEL); @@ -556,7 +556,7 @@ void xen_pmu_finish(int cpu) { struct xen_pmu_params xp; - if (xen_hvm_domain()) + if (xen_hvm_domain() && !xen_hvmlite) return; xp.vcpu = cpu; diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index fb085ef..a22cae2 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -348,26 +348,31 @@ static void __init xen_smp_prepare_cpus(unsigned int max_cpus) } xen_init_lock_cpu(0); - smp_store_boot_cpu_info(); - cpu_data(0).x86_max_cores = 1; + if (!xen_hvmlite) { + smp_store_boot_cpu_info(); + cpu_data(0).x86_max_cores = 1; + + for_each_possible_cpu(i) { + zalloc_cpumask_var(_cpu(cpu_sibling_map, i), + GFP_KERNEL); + zalloc_cpumask_var(_cpu(cpu_core_map, i), + GFP_KERNEL); + zalloc_cpumask_var(_cpu(cpu_llc_shared_map, i), + GFP_KERNEL); + } + set_cpu_sibling_map(0); - for_each_possible_cpu(i) { - zalloc_cpumask_var(_cpu(cpu_sibling_map, i), GFP_KERNEL); - zalloc_cpumask_var(_cpu(cpu_core_map, i), GFP_KERNEL); - zalloc_cpumask_var(_cpu(cpu_llc_shared_map, i), GFP_KERNEL); + if (!alloc_cpumask_var(_cpu_initialized_map, GFP_KERNEL)) + panic("could not allocate xen_cpu_initialized_map\n"); + + cpumask_copy(xen_cpu_initialized_map, cpumask_of(0)); } - set_cpu_sibling_map(0); xen_pmu_init(0); if (xen_smp_intr_init(0)) BUG(); - if (!alloc_cpumask_var(_cpu_initialized_map, GFP_KERNEL)) - panic("could not allocate xen_cpu_initialized_map\n"); - - cpumask_copy(xen_cpu_initialized_map, cpumask_of(0)); - /* Restrict the possible_map according to max_cpus. */ while ((num_possible_cpus() > 1) && (num_possible_cpus() > max_cpus)) { for (cpu = nr_cpu_ids - 1; !cpu_possible(cpu); cpu--) @@ -375,8 +380,11 @@ static void __init xen_smp_prepare_cpus(unsigned int max_cpus) set_cpu_possible(cpu, false); } - for_each_possible_cpu(cpu) + for_each_possible_cpu(cpu) { set_cpu_present(cpu, true); + if (xen_hvmlite) + physid_set(cpu, phys_cpu_present_map); + } } static int @@ -810,10 +818,15 @@ void __init xen_smp_init(void) static void __init xen_hvm_smp_prepare_cpus(unsigned int max_cpus) { + if (xen_hvmlite) + xen_smp_prepare_cpus(max_cpus); + native_smp_prepare_cpus(max_cpus); - WARN_ON(xen_smp_intr_init(0)); - xen_init_lock_cpu(0); + if (!xen_hvmlite) { + WARN_ON(xen_smp_intr_init(0)); + xen_init_lock_cpu(0); + } } static int xen_hvm_cpu_up(unsigned int cpu, struct task_struct *tidle) @@ -836,8 +849,21 @@ static int xen_hvm_cpu_up(unsigned int cpu, struct task_struct *tidle) */ rc = xen_smp_intr_init(cpu); WARN_ON(rc); - if (!rc) - rc = native_cpu_up(cpu, tidle); + + if (xen_hvmlite) { + rc = cpu_initialize_context(cpu, tidle); + if (rc) { + xen_smp_intr_free(cpu); + return rc; + } + xen_pmu_init(cpu); + } + + if (!rc) { +
[PATCH v2 11/11] xen/hvmlite: Enable CPU on-/offlining
When offlining, we should properly clean up interrupts and wait until hypervisor declares VCPU as down before cleaning up. After VCPU that was previously offlined is brought back to life we want to jump back to bare-metal entry points. It's a simple jump on 64-bit but requires minor tweaking for 32-bit case. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/smp.c | 35 +-- arch/x86/xen/xen-hvmlite.S |8 2 files changed, 33 insertions(+), 10 deletions(-) diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index a22cae2..d768c4e 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -143,7 +143,7 @@ static void xen_smp_intr_free(unsigned int cpu) kfree(per_cpu(xen_callfuncsingle_irq, cpu).name); per_cpu(xen_callfuncsingle_irq, cpu).name = NULL; } - if (xen_hvm_domain()) + if (xen_hvm_domain() && !xen_hvmlite) return; if (per_cpu(xen_irq_work, cpu).irq >= 0) { @@ -585,7 +585,8 @@ static int xen_cpu_disable(void) static void xen_cpu_die(unsigned int cpu) { - while (xen_pv_domain() && HYPERVISOR_vcpu_op(VCPUOP_is_up, cpu, NULL)) { + while ((xen_pv_domain() || xen_hvmlite) && + HYPERVISOR_vcpu_op(VCPUOP_is_up, cpu, NULL)) { __set_current_state(TASK_UNINTERRUPTIBLE); schedule_timeout(HZ/10); } @@ -602,14 +603,28 @@ static void xen_play_dead(void) /* used only with HOTPLUG_CPU */ { play_dead_common(); HYPERVISOR_vcpu_op(VCPUOP_down, smp_processor_id(), NULL); - cpu_bringup(); - /* -* commit 4b0c0f294 (tick: Cleanup NOHZ per cpu data on cpu down) -* clears certain data that the cpu_idle loop (which called us -* and that we return from) expects. The only way to get that -* data back is to call: -*/ - tick_nohz_idle_enter(); + + if (!xen_hvm_domain()) { + cpu_bringup(); + /* +* commit 4b0c0f294 (tick: Cleanup NOHZ per cpu data on cpu +* down) clears certain data that the cpu_idle loop (which +* called us and that we return from) expects. The only way to +* get that data back is to call: +*/ + tick_nohz_idle_enter(); + } else { + /* +* For 64-bit we can jump directly to SMP entry point but for +* 32-bit we need to disable paging and load boot GDT (just +* like in cpu_initialize_context()). +*/ +#ifdef CONFIG_X86_64 + asm("jmp secondary_startup_64"); +#else + asm("jmp hvmlite_smp_32_hp"); +#endif + } } #else /* !CONFIG_HOTPLUG_CPU */ diff --git a/arch/x86/xen/xen-hvmlite.S b/arch/x86/xen/xen-hvmlite.S index 805e6a0..1db2e96 100644 --- a/arch/x86/xen/xen-hvmlite.S +++ b/arch/x86/xen/xen-hvmlite.S @@ -145,6 +145,14 @@ ENTRY(hvmlite_start_xen) #endif #ifdef CONFIG_X86_32 +ENTRY(hvmlite_smp_32_hp) + movl $_pa(initial_page_table), %eax + movl %eax, %cr3 + ljmp $__KERNEL_CS,$_pa(5f) +5: + movl $X86_CR0_PE, %eax + movl %eax, %cr0 + ENTRY(hvmlite_smp_32) mov $_pa(boot_gdt_descr), %eax lgdt (%eax) -- 1.7.1
[PATCH v2] xen/x86: Zero out .bss for PV guests
Baremetal kernels clear .bss early in the boot but Xen PV guests don't execute that code. They have been able to run without problems because Xen domain builder happens to give out zeroed pages. However, since this is not really guaranteed, .bss should be explicitly cleared. (Since we introduce macros for specifying 32- and 64-bit registers we can get rid of ifdefs in startup_xen()) Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> Cc: sta...@vger.kernel.org --- arch/x86/xen/xen-head.S | 29 ++--- 1 file changed, 22 insertions(+), 7 deletions(-) diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S index b65f59a..2af87d1 100644 --- a/arch/x86/xen/xen-head.S +++ b/arch/x86/xen/xen-head.S @@ -35,16 +35,31 @@ #define PVH_FEATURES (0) #endif - __INIT -ENTRY(startup_xen) - cld #ifdef CONFIG_X86_32 - mov %esi,xen_start_info - mov $init_thread_union+THREAD_SIZE,%esp +#define REG(register) %e##register +#define WSIZE_SHIFT2 +#define STOS stosl #else - mov %rsi,xen_start_info - mov $init_thread_union+THREAD_SIZE,%rsp +#define REG(register) %r##register +#define WSIZE_SHIFT3 +#define STOS stosq #endif + + __INIT +ENTRY(startup_xen) + cld + + /* Clear .bss */ + xor REG(ax),REG(ax) + mov $__bss_start,REG(di) + mov $__bss_stop,REG(cx) + sub REG(di),REG(cx) + shr $WSIZE_SHIFT,REG(cx) + rep STOS + + mov REG(si),xen_start_info + mov $init_thread_union+THREAD_SIZE,REG(sp) + jmp xen_start_kernel __FINIT -- 2.1.0
Re: [PATCH 1/2] hvc_xen: add earlycon support
On 02/24/2016 07:23 AM, Stefano Stabellini wrote: Introduce EARLYCON support in hvc_xen, useful for early debugging on arm and arm64, where xen early_printk is not available. Differently from xenboot_write_console on x86, we won't just return if !xen_pv_domain(), because arm and arm64 guests are actually xen_hvm_domain() from linux pov, and also because we want to capture all possible writes, including the ones earlier than xen_early_init(). Signed-off-by: Stefano Stabellini--- drivers/tty/hvc/hvc_xen.c | 48 ++--- 1 file changed, 37 insertions(+), 11 deletions(-) diff --git a/drivers/tty/hvc/hvc_xen.c b/drivers/tty/hvc/hvc_xen.c index fa816b7..34e8e9f 100644 --- a/drivers/tty/hvc/hvc_xen.c +++ b/drivers/tty/hvc/hvc_xen.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include @@ -597,21 +598,11 @@ static int xen_cons_init(void) } console_initcall(xen_cons_init); -#ifdef CONFIG_EARLY_PRINTK -static void xenboot_write_console(struct console *console, const char *string, - unsigned len) +static void domU_write_early_console(const char *string, unsigned len) { unsigned int linelen, off = 0; const char *pos; - if (!xen_pv_domain()) - return; - - dom0_write_console(0, string, len); - - if (xen_initial_domain()) - return; - domU_write_console(0, "(early) ", 8); while (off < len && NULL != (pos = strchr(string+off, '\n'))) { linelen = pos-string+off; @@ -625,6 +616,21 @@ static void xenboot_write_console(struct console *console, const char *string, domU_write_console(0, string+off, len-off); } +#ifdef CONFIG_EARLY_PRINTK +static void xenboot_write_console(struct console *console, const char *string, + unsigned len) +{ + if (!xen_pv_domain()) + return; This, BTW, will break PVHv2 code (which is not there yet). Your patch is not making the code any different from what it is now but I wonder whether xenboot_write_console() can be made to work for HVM guests at this time since you making changes there anyway. Or we can delay it until HVMlite/PVHv2 patches make it in. + + dom0_write_console(0, string, len); + + if (xen_initial_domain()) + return; + + domU_write_early_console(string, len); +} + struct console xenboot_console = { .name = "xenboot", .write = xenboot_write_console, @@ -664,3 +670,23 @@ void xen_raw_printk(const char *fmt, ...) xen_raw_console_write(buf); } + +static void xenboot_earlycon_write(struct console *console, + const char *string, + unsigned len) +{ + dom0_write_console(0, string, len); + + if (xen_initial_domain()) + return; + + domU_write_early_console(string, len); +} xenboot_earlycon_write() and xenboot_write_console() share most of the code, can it be factored out? -boris + +static int __init xenboot_earlycon_setup(struct earlycon_device *device, + const char *opt) +{ + device->con->write = xenboot_earlycon_write; + return 0; +} +EARLYCON_DECLARE(xenboot, xenboot_earlycon_setup);
Re: [Xen-devel] [PATCH] xen/x86: Zero out .bss for PV guests
On 02/24/2016 09:15 AM, Andrew Cooper wrote: On 24/02/16 14:12, David Vrabel wrote: On 22/02/16 22:06, Boris Ostrovsky wrote: Baremetal kernels clear .bss early in the boot. Since Xen PV guests don't excecute that early code they should do it too. (Since we introduce macros for specifying 32- and 64-bit registers we can get rid of ifdefs in startup_xen()) .bss must have been cleared for PV guests otherwise they would be horribly broken. What was the method and why is it no longer sufficient? I couldn't find this being done anywhere, hence this patch. The domain builder hands out zeroed pages. I don't believe we guarantee that the guests RAM is clean, but it is in practice. OK, that's what I suspected but didn't actually look. I, in fact, wonder whether this should go to stable trees as well. -boris
Re: [PATCH v2] xen/x86: Zero out .bss for PV guests
On 02/24/2016 11:05 AM, Brian Gerst wrote: Use the macros in instead of defining your own. Also, xorl %eax,%eax is good for 64-bit too, since the upper bits are cleared. I suspected this would have to be defined somewhere but couldn't find it. Thanks! -boris
Re: [Xen-devel] [PATCH v2] xen/x86: Zero out .bss for PV guests
On 02/24/2016 12:26 PM, Andrew Cooper wrote: On 24/02/16 15:19, Boris Ostrovsky wrote: + /* Clear .bss */ + xor REG(ax),REG(ax) If we are nitpicking, This should be xor %eax, %eax even in 64bit. Functionally identical, and shorter to encode. Right, Brian Gerst pointed this out too in another message. -boris
Re: [Xen-devel] [PATCH] xen/x86: Zero out .bss for PV guests
On 02/24/2016 11:22 AM, Luis R. Rodriguez wrote: On Wed, Feb 24, 2016 at 6:58 AM, David Vrabelwrote: Yes. Can you respin with a commit message explaining? (Or just provide the message here and I'll fix it up). Is there no way to re-use somehow the clear_bss() from bare metal? xen_start_info lives in .bss and it is initialized in startup assembly code. We could move it to .data but I'd still want to do this first thing in startup_xen(). We may add more code there later that touches something in .bss and with delayed section clearing we may blow away that data. -boris This uses a section range: /* Don't add a printk in there. printk relies on the PDA which is not initialized yet. */ static void __init clear_bss(void) { memset(__bss_start, 0, (unsigned long) __bss_stop - (unsigned long) __bss_start); } Perhaps the section range might be different for PV guests? Or can this simply not work even if one added a guest bss section size, or would it be too late for PV guests, ie we need to do it in asm on PV guests as you did? If so why. Luis
Re: [Xen-devel] [PATCH 8/9] x86/rtc: replace paravirt_enabled() check with subarch check
On 02/22/2016 05:27 AM, Borislav Petkov wrote: On Mon, Feb 22, 2016 at 07:07:56AM +0100, Luis R. Rodriguez wrote: diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h index 1ae89a2721d6..fe0d579b63e3 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -84,11 +84,14 @@ struct x86_init_paging { *boot cpu * @timer_init: initialize the platform timer (default PIT/HPET) * @wallclock_init: init the wallclock device + * @no_cmos_rtc: set when platform has no CMOS real-time clock + * present */ struct x86_init_timers { void (*setup_percpu_clockev)(void); void (*timer_init)(void); void (*wallclock_init)(void); + bool no_cmos_rtc; I'd add u64 flags; to x86_init_ops and then set X86_PLATFORM_NO_RTC or so in there. The reason being, others could use that flags field too, for other stuff and define more bits. Maybe timer_flags or platform_flags (or something else) to be a little more cscope-friendly? -boris
Re: [PATCH 5/9] apm32: remove paravirt_enabled() use
On 02/19/2016 07:42 PM, Luis R. Rodriguez wrote: On Fri, Feb 19, 2016 at 05:17:27PM -0500, Boris Ostrovsky wrote: On 02/19/2016 03:58 PM, Luis R. Rodriguez wrote: On Fri, Feb 19, 2016 at 10:08:43AM -0500, Boris Ostrovsky wrote: in xen_start_kernel(). Better yet, clear whole .bss. (This applies to the next patch as well). So clear_bss() -- oh look, another call that xen_start_kernel() could have made good use of. :) Can you send a respective patch I can old into this series? I'm afraid it is by no means obvious to me where it would be safe to do this on xen_start_kernel(). OK, I'll test it next week, it really should be done irrespective of what you are doing here. OK so it can go in separately? Yes, I'll send it as a separate patch. -boris
[PATCH] xen/x86: Zero out .bss for PV guests
Baremetal kernels clear .bss early in the boot. Since Xen PV guests don't excecute that early code they should do it too. (Since we introduce macros for specifying 32- and 64-bit registers we can get rid of ifdefs in startup_xen()) Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/xen-head.S | 29 ++--- 1 file changed, 22 insertions(+), 7 deletions(-) diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S index b65f59a..2af87d1 100644 --- a/arch/x86/xen/xen-head.S +++ b/arch/x86/xen/xen-head.S @@ -35,16 +35,31 @@ #define PVH_FEATURES (0) #endif - __INIT -ENTRY(startup_xen) - cld #ifdef CONFIG_X86_32 - mov %esi,xen_start_info - mov $init_thread_union+THREAD_SIZE,%esp +#define REG(register) %e##register +#define WSIZE_SHIFT2 +#define STOS stosl #else - mov %rsi,xen_start_info - mov $init_thread_union+THREAD_SIZE,%rsp +#define REG(register) %r##register +#define WSIZE_SHIFT3 +#define STOS stosq #endif + + __INIT +ENTRY(startup_xen) + cld + + /* Clear .bss */ + xor REG(ax),REG(ax) + mov $__bss_start,REG(di) + mov $__bss_stop,REG(cx) + sub REG(di),REG(cx) + shr $WSIZE_SHIFT,REG(cx) + rep STOS + + mov REG(si),xen_start_info + mov $init_thread_union+THREAD_SIZE,REG(sp) + jmp xen_start_kernel __FINIT -- 2.1.0
Re: [PATCH v2 1/3] hvc_xen: add earlycon support
On 02/25/2016 07:10 AM, Stefano Stabellini wrote: Introduce EARLYCON support in hvc_xen, useful for early debugging on arm and arm64, where xen early_printk is not available. It is different from xenboot_write_console on x86 in two ways: - it does not return if !xen_pv_domain(), not only because ARM guests are xen_hvm_domain(), but also because we want to capture all the early boot messages, before xen support is discovered - it does not try to print to the domU console at all, because xen support will only be discovered at a later point Signed-off-by: Stefano Stabellini <stefano.stabell...@eu.citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
Re: [PATCH v2 2/3] hvc_xen: fix xenboot for DomUs
On 02/25/2016 07:10 AM, Stefano Stabellini wrote: The xenboot early console has been partially broken for DomU for a long time: the output would only go to the hypervisor via hypercall (HYPERVISOR_console_io), while it wouldn't actually go to the DomU console. The reason is that domU_write_console would return early as no xencons structs are configured for it. Add an appropriate xencons struct for xenboot from the xenboot setup callback. Signed-off-by: Stefano Stabellini <stefano.stabell...@eu.citrix.com> --- Changes in v2: - add return to xenboot_setup_console --- drivers/tty/hvc/hvc_xen.c | 32 ++-- 1 file changed, 26 insertions(+), 6 deletions(-) diff --git a/drivers/tty/hvc/hvc_xen.c b/drivers/tty/hvc/hvc_xen.c index 68b8ec8..bf787aa 100644 --- a/drivers/tty/hvc/hvc_xen.c +++ b/drivers/tty/hvc/hvc_xen.c @@ -246,6 +246,18 @@ err: return -ENODEV; } +static int xen_early_pv_console_init(struct xencons_info *info, int vtermno) Nit: not sure whether "early" is the right word to use here: when called from xen_pv_console_init() it's not really on the early path. Other than that: Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com> +{ + info->evtchn = xen_start_info->console.domU.evtchn; + /* GFN == MFN for PV guest */ + info->intf = gfn_to_virt(xen_start_info->console.domU.mfn); + info->vtermno = vtermno; + + list_add_tail(>list, ); + + return 0; +} + static int xen_pv_console_init(void) { struct xencons_info *info; @@ -265,13 +277,8 @@ static int xen_pv_console_init(void) /* already configured */ return 0; } - info->evtchn = xen_start_info->console.domU.evtchn; - /* GFN == MFN for PV guest */ - info->intf = gfn_to_virt(xen_start_info->console.domU.mfn); - info->vtermno = HVC_COOKIE; - spin_lock(_lock); - list_add_tail(>list, ); + xen_early_pv_console_init(info, HVC_COOKIE); spin_unlock(_lock); return 0; @@ -599,6 +606,18 @@ static int xen_cons_init(void) console_initcall(xen_cons_init); #ifdef CONFIG_EARLY_PRINTK +static int __init xenboot_setup_console(struct console *console, char *string) +{ + static struct xencons_info xenboot; + + if (xen_initial_domain()) + return 0; + if (!xen_pv_domain()) + return -ENODEV; + + return xen_early_pv_console_init(, 0); +} + static void xenboot_write_console(struct console *console, const char *string, unsigned len) { @@ -629,6 +648,7 @@ static void xenboot_write_console(struct console *console, const char *string, struct console xenboot_console = { .name = "xenboot", .write = xenboot_write_console, + .setup = xenboot_setup_console, .flags = CON_PRINTBUFFER | CON_BOOT | CON_ANYTIME, .index = -1, };
Re: [PATCH v2 3/3] hvc_xen: make early_printk work with HVM guests
On 02/25/2016 07:10 AM, Stefano Stabellini wrote: Refactor the existing code in xen_raw_console_write to get the generic early_printk console work with HVM guests. Take the opportunity to replace the outb loop with a single outsb call to reduce the number of vmexit. Signed-off-by: Stefano Stabellini <stefano.stabell...@eu.citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
[PATCH v3 2/2] xen/x86: Drop mode-selecting ifdefs in startup_xen()
Use asm/asm.h macros instead. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/xen-head.S | 10 +++--- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S index 11dbe49..2366895 100644 --- a/arch/x86/xen/xen-head.S +++ b/arch/x86/xen/xen-head.S @@ -52,13 +52,9 @@ ENTRY(startup_xen) shr $WSIZE_SHIFT, %__ASM_REG(cx) rep __ASM_SIZE(stos) -#ifdef CONFIG_X86_32 - mov %esi,xen_start_info - mov $init_thread_union+THREAD_SIZE,%esp -#else - mov %rsi,xen_start_info - mov $init_thread_union+THREAD_SIZE,%rsp -#endif + mov %__ASM_REG(si), xen_start_info + mov $init_thread_union+THREAD_SIZE, %__ASM_REG(sp) + jmp xen_start_kernel __FINIT -- 2.1.0
[PATCH v3 0/2] Clear .bss for VP guests
PV guests need to have their .bss zeroed out since it is not guaranteed to be cleared by Xen's domain builder v3: * Use existing macros for selecting mode-appropriate instructions/registers * Use 32-bit registers in XOR * Split removal of mode-selecting ifdefs into a separate patch Boris Ostrovsky (2): xen/x86: Zero out .bss for PV guests xen/x86: Drop mode-selecting ifdefs in startup_xen() arch/x86/xen/xen-head.S | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) -- 2.1.0
[PATCH v3 1/2] xen/x86: Zero out .bss for PV guests
Baremetal kernels clear .bss early in the boot but Xen PV guests don't execute that code. They have been able to run without problems because Xen domain builder happens to give out zeroed pages. However, since this is not really guaranteed, .bss should be explicitly cleared. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> Cc: sta...@vger.kernel.org --- arch/x86/xen/xen-head.S | 14 ++ 1 file changed, 14 insertions(+) diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S index b65f59a..11dbe49 100644 --- a/arch/x86/xen/xen-head.S +++ b/arch/x86/xen/xen-head.S @@ -38,6 +38,20 @@ __INIT ENTRY(startup_xen) cld + +#ifdef CONFIG_X86_32 +#define WSIZE_SHIFT2 +#else +#define WSIZE_SHIFT3 +#endif + /* Clear .bss */ + xor %eax,%eax + mov $__bss_start, %__ASM_REG(di) + mov $__bss_stop, %__ASM_REG(cx) + sub %__ASM_REG(di), %__ASM_REG(cx) + shr $WSIZE_SHIFT, %__ASM_REG(cx) + rep __ASM_SIZE(stos) + #ifdef CONFIG_X86_32 mov %esi,xen_start_info mov $init_thread_union+THREAD_SIZE,%esp -- 2.1.0
Re: [Xen-devel] [PATCH v3 0/2] Clear .bss for VP guests
On 02/26/2016 09:42 AM, Brian Gerst wrote: On Fri, Feb 26, 2016 at 8:51 AM, Boris Ostrovsky <boris.ostrov...@oracle.com> wrote: On 02/26/2016 05:53 AM, Roger Pau Monné wrote: El 25/2/16 a les 16:16, Boris Ostrovsky ha escrit: PV guests need to have their .bss zeroed out since it is not guaranteed to be cleared by Xen's domain builder I guess I'm missing something, but elf_load_image (in libelf-loader.c) seems to be able to clear segments (it will zero the memory between p_paddr + p_filesz and p_paddr + p_memsz) while loading the ELF into memory, so if the program headers are correctly setup the .bss should be zeroed out AFAICT. Right, but I don't think this is guaranteed. It's uninitialized data so in principle it can be anything. The ELF spec says "the system initializes the data with zero when the program begins to run" which I read as it's up to runtime and not the loader to do so. And since kernel does it explicitly on baremetal path I think it's a good idea for PV to do the same. It does it on bare metal because bzImage is a raw binary image, not ELF. OK, I didn't think about this. But nevertheless, is it guaranteed that .bss is cleared by the loader? My reading of the spec is that it's not. -boris
Re: [PATCH v2 2/3] hvc_xen: fix xenboot for DomUs
On 02/26/2016 10:39 AM, Stefano Stabellini wrote: On Thu, 25 Feb 2016, Boris Ostrovsky wrote: On 02/25/2016 07:10 AM, Stefano Stabellini wrote: The xenboot early console has been partially broken for DomU for a long time: the output would only go to the hypervisor via hypercall (HYPERVISOR_console_io), while it wouldn't actually go to the DomU console. The reason is that domU_write_console would return early as no xencons structs are configured for it. Add an appropriate xencons struct for xenboot from the xenboot setup callback. Signed-off-by: Stefano Stabellini <stefano.stabell...@eu.citrix.com> --- Changes in v2: - add return to xenboot_setup_console --- drivers/tty/hvc/hvc_xen.c | 32 ++-- 1 file changed, 26 insertions(+), 6 deletions(-) diff --git a/drivers/tty/hvc/hvc_xen.c b/drivers/tty/hvc/hvc_xen.c index 68b8ec8..bf787aa 100644 --- a/drivers/tty/hvc/hvc_xen.c +++ b/drivers/tty/hvc/hvc_xen.c @@ -246,6 +246,18 @@ err: return -ENODEV; } +static int xen_early_pv_console_init(struct xencons_info *info, int vtermno) Nit: not sure whether "early" is the right word to use here: when called from xen_pv_console_init() it's not really on the early path. What about xencons_info_pv_init? Sure, that sounds good. (BTW, I didn't mean that my R-b was conditional on this) -boris Other than that: Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com> Thanks! +{ + info->evtchn = xen_start_info->console.domU.evtchn; + /* GFN == MFN for PV guest */ + info->intf = gfn_to_virt(xen_start_info->console.domU.mfn); + info->vtermno = vtermno; + + list_add_tail(>list, ); + + return 0; +} + static int xen_pv_console_init(void) { struct xencons_info *info; @@ -265,13 +277,8 @@ static int xen_pv_console_init(void) /* already configured */ return 0; } - info->evtchn = xen_start_info->console.domU.evtchn; - /* GFN == MFN for PV guest */ - info->intf = gfn_to_virt(xen_start_info->console.domU.mfn); - info->vtermno = HVC_COOKIE; - spin_lock(_lock); - list_add_tail(>list, ); + xen_early_pv_console_init(info, HVC_COOKIE); spin_unlock(_lock); return 0; @@ -599,6 +606,18 @@ static int xen_cons_init(void) console_initcall(xen_cons_init); #ifdef CONFIG_EARLY_PRINTK +static int __init xenboot_setup_console(struct console *console, char *string) +{ + static struct xencons_info xenboot; + + if (xen_initial_domain()) + return 0; + if (!xen_pv_domain()) + return -ENODEV; + + return xen_early_pv_console_init(, 0); +} + static void xenboot_write_console(struct console *console, const char *string, unsigned len) { @@ -629,6 +648,7 @@ static void xenboot_write_console(struct console *console, const char *string, struct console xenboot_console = { .name = "xenboot", .write = xenboot_write_console, + .setup = xenboot_setup_console, .flags = CON_PRINTBUFFER | CON_BOOT | CON_ANYTIME, .index = -1, };
Re: [Xen-devel] [PATCH v3 0/2] Clear .bss for VP guests
On 02/26/2016 10:26 AM, David Vrabel wrote: The tools support loading bzImages, not just ELF images. We load them as ELF images though, I believe, after uncompressing. E.g.: static int xc_dom_load_bzimage_kernel(struct xc_dom_image *dom) { return elf_loader.loader(dom); } -boris
Re: [Xen-devel] [PATCH v3 0/2] Clear .bss for VP guests
On 02/26/2016 10:22 AM, Roger Pau Monné wrote: El 26/2/16 a les 16:10, Boris Ostrovsky ha escrit: On 02/26/2016 09:42 AM, Brian Gerst wrote: On Fri, Feb 26, 2016 at 8:51 AM, Boris Ostrovsky <boris.ostrov...@oracle.com> wrote: On 02/26/2016 05:53 AM, Roger Pau Monné wrote: El 25/2/16 a les 16:16, Boris Ostrovsky ha escrit: PV guests need to have their .bss zeroed out since it is not guaranteed to be cleared by Xen's domain builder I guess I'm missing something, but elf_load_image (in libelf-loader.c) seems to be able to clear segments (it will zero the memory between p_paddr + p_filesz and p_paddr + p_memsz) while loading the ELF into memory, so if the program headers are correctly setup the .bss should be zeroed out AFAICT. Right, but I don't think this is guaranteed. It's uninitialized data so in principle it can be anything. The ELF spec says "the system initializes the data with zero when the program begins to run" which I read as it's up to runtime and not the loader to do so. And since kernel does it explicitly on baremetal path I think it's a good idea for PV to do the same. It does it on bare metal because bzImage is a raw binary image, not ELF. OK, I didn't think about this. But nevertheless, is it guaranteed that .bss is cleared by the loader? My reading of the spec is that it's not. I think this is very blur in general. The copy of the spec I have says: "the system initializes the data with zeros when the program begins to run" What is "the system" here, Xen or the guest kernel? Just to be clear, I'm not opposing to this change in any way, but the message in patch 1/2 needs to be fixed: "They have been able to run without problems because Xen domain builder happens to give out zeroed pages." This is wrong IMHO, .bss is not cleared because we are using zeroed pages, but because elf_load_image explicitly zeroes the space between p_filesz and p_memsz in ELF program headers (which is were .bss resides on properly arranged ELF binaries) when loading them. That's what I meant --- that the builder/loader gives out zeroed pages, not that Xen's allocator clears them in general. I'll update the commit message. I'm quite sure NetBSD also relies on this, so I would say it's intrinsically part of the Xen boot ABI now, and this change just adds seatbelts to Linux. Maybe NetBSD should drive carefully then ;-) -boris
Re: [PATCH, RESEND] xen: allocate gntdev_copy_batch dynamically
On 02/25/2016 04:25 PM, Arnd Bergmann wrote: struct gntdev_copy_batch is arguably too large to fit on the kernel stack, and we get a warning about the stack usage in gntdev_ioctl_grant_copy: drivers/xen/gntdev.c:949:1: error: the frame size of 1240 bytes is larger than 1024 bytes This changes the code to us a dynamic allocation instead. Signed-off-by: Arnd BergmannFixes: a4cdb556cae0 ("xen/gntdev: add ioctl for grant copy") --- drivers/xen/gntdev.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) I sent this in January, Boris sent an almost identical patch as http://www.gossamer-threads.com/lists/xen/devel/414056 but the bug remains present in mainline and linux-next as of Feb 25. Could you apply one of the patches before the bug makes it into v4.5? David wanted to shrink the structure size instead: http://www.gossamer-threads.com/lists/xen/devel/414535 -boris
Re: [Xen-devel] [PATCH v3 0/2] Clear .bss for VP guests
On 02/26/2016 05:53 AM, Roger Pau Monné wrote: El 25/2/16 a les 16:16, Boris Ostrovsky ha escrit: PV guests need to have their .bss zeroed out since it is not guaranteed to be cleared by Xen's domain builder I guess I'm missing something, but elf_load_image (in libelf-loader.c) seems to be able to clear segments (it will zero the memory between p_paddr + p_filesz and p_paddr + p_memsz) while loading the ELF into memory, so if the program headers are correctly setup the .bss should be zeroed out AFAICT. Right, but I don't think this is guaranteed. It's uninitialized data so in principle it can be anything. The ELF spec says "the system initializes the data with zero when the program begins to run" which I read as it's up to runtime and not the loader to do so. And since kernel does it explicitly on baremetal path I think it's a good idea for PV to do the same. -boris
[PATCH v4 2/2] xen/x86: Drop mode-selecting ifdefs in startup_xen()
Use asm/asm.h macros instead. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/xen-head.S | 10 +++--- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S index 5c63d2d..de93b20 100644 --- a/arch/x86/xen/xen-head.S +++ b/arch/x86/xen/xen-head.S @@ -47,13 +47,9 @@ ENTRY(startup_xen) shr $__ASM_SEL(2, 3), %_ASM_CX rep __ASM_SIZE(stos) -#ifdef CONFIG_X86_32 - mov %esi,xen_start_info - mov $init_thread_union+THREAD_SIZE,%esp -#else - mov %rsi,xen_start_info - mov $init_thread_union+THREAD_SIZE,%rsp -#endif + mov %_ASM_SI, xen_start_info + mov $init_thread_union+THREAD_SIZE, %_ASM_SP + jmp xen_start_kernel __FINIT -- 2.1.0
[PATCH v4 0/2] Clear .bss for PV guests
PV guests need to have their .bss zeroed out since it is not guaranteed to be cleared by Xen's domain builder (even if it is done so currently) v4: * Better use of mode-selecting macros. * Drop stable tag. * Update commit message for patch 1. Boris Ostrovsky (2): xen/x86: Zero out .bss for PV guests xen/x86: Drop mode-selecting ifdefs in startup_xen() arch/x86/xen/xen-head.S | 19 --- 1 file changed, 12 insertions(+), 7 deletions(-) -- 2.1.0
[PATCH v4 1/2] xen/x86: Zero out .bss for PV guests
ELF spec is unclear about whether .bss must me cleared by the loader. Currently the domain builder does it when loading the guest but because it is not (or rather may not be) guaranteed we should zero it out explicitly. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/xen-head.S | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S index b65f59a..5c63d2d 100644 --- a/arch/x86/xen/xen-head.S +++ b/arch/x86/xen/xen-head.S @@ -38,6 +38,15 @@ __INIT ENTRY(startup_xen) cld + + /* Clear .bss */ + xor %eax,%eax + mov $__bss_start, %_ASM_DI + mov $__bss_stop, %_ASM_CX + sub %_ASM_DI, %_ASM_CX + shr $__ASM_SEL(2, 3), %_ASM_CX + rep __ASM_SIZE(stos) + #ifdef CONFIG_X86_32 mov %esi,xen_start_info mov $init_thread_union+THREAD_SIZE,%esp -- 2.1.0
Re: [RFC v1 4/8] x86/init: add linker table support
On 01/21/2016 03:38 AM, Roger Pau Monné wrote: El 20/01/16 a les 22.33, Luis R. Rodriguez ha escrit: On Wed, Jan 20, 2016 at 1:00 PM, Konrad Rzeszutek Wilkwrote: +static bool x86_init_fn_supports_subarch(struct x86_init_fn *fn) +{ + if (!fn->supp_hardware_subarch) { + pr_err("Init sequence fails to declares any supported subarchs: %pF\n", fn->early_init); + WARN_ON(1); + } + if (BIT(boot_params.hdr.hardware_subarch) & fn->supp_hardware_subarch) + return true; + return false; +} So the logic for this working is that boot_params.hdr.hardware_subarch And the macros define two: BIT(X86_SUBARCH_PC) or BIT(X86_SUBARCH_XEN). But hardware_subarch by default is set to zero. Which means if GRUB2, PXELinux, Xen multiboot1 don't set it - then the X86_SUBARCH_PC is choosen right? 1 << 0 & 1 << X86_SUBARCH_PC (which is zero). For this to nicely work with Xen it ought to do this: diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index 993b7a7..6cf9afd 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -1676,6 +1676,7 @@ asmlinkage __visible void __init xen_start_kernel(void) boot_params.hdr.ramdisk_image = initrd_start; boot_params.hdr.ramdisk_size = xen_start_info->mod_len; boot_params.hdr.cmd_line_ptr = __pa(xen_start_info->cmd_line); + boot_params.hdr.hardware_subarch = X86_SUBARCH_XEN; if (!xen_initial_domain()) { add_preferred_console("xenboot", 0, NULL); ? That's correct for PV and PVH, likewise when qemu is required for HVM qemu could set it. I have the qemu change done but that should only cover HVM. A common place to set this as well could be the hypervisor, but currently the hypervisor doesn't set any boot_params, instead a generic struct is passed and the kernel code (for any OS) is expected to interpret this and then set the required values for the OS in the init path. Long term though if we wanted to merge init further one way could be to have the hypervisor just set the zero page cleanly for the different modes. If we needed more data other than the hardware_subarch we also have the hardware_subarch_data, that's a u64 , and how that is used would be up to the subarch. In Xen's case it could do what it wants with it. That would still mean perhaps defining as part of a Xen boot protocol a place where xen specific code can count on finding more Xen data passed by the hypervisor, the xen_start_info. That is, if we wanted to merge init paths this is something to consider. One thing I considered on the question of who should set the zero page for Xen with the prospect of merging inits, or at least this subarch for both short term and long term are the obvious implications in terms of hypervisor / kernel / qemu combination requirements if the subarch is needed. Having it set in the kernel is an obvious immediate choice for PV / PVH but it means we can't merge init paths completely (down to asm inits), we'd still be able to merge some C init paths though, the first entry would still be different. Having the zero page set on the hypervisor would go long ways but it would mean a hypervisor change required. I don't think the hypervisor should be setting Linux specific boot related parameters, the boot ABI should be OS agnostic. IMHO, a small shim should be added to Linux in order to set what Linux requires when entering from a Xen entry point. And that's exactly what HVMlite does. Most of this shim layer is setting up boot_params, after which we jump to standard x86 boot path (i.e. startup_{32|64}). With hardware_subarch set to zero. -boris
[PATCH v1 10/12] xen/hvmlite: Use x86's default timer init for HVMlite guests
xen_timer_init() will be called from apic_bsp_setup(). Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/time.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c index a0a4e55..93745e7 100644 --- a/arch/x86/xen/time.c +++ b/arch/x86/xen/time.c @@ -439,7 +439,10 @@ void __init xen_init_time_ops(void) { pv_time_ops = xen_time_ops; - x86_init.timers.timer_init = xen_time_init; + if (!xen_hvmlite) + x86_init.timers.timer_init = xen_time_init; + else + x86_init.timers.timer_init = x86_init_noop; x86_init.timers.setup_percpu_clockev = x86_init_noop; x86_cpuinit.setup_percpu_clockev = x86_init_noop; -- 1.7.1
[PATCH v1 08/12] xen/hvmlite: Initialize context for secondary VCPUs
Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/smp.c | 57 arch/x86/xen/smp.h |4 +++ arch/x86/xen/xen-hvmlite.S |7 + 3 files changed, 63 insertions(+), 5 deletions(-) diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index 5fc4afb..b265c4f 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include @@ -384,6 +385,7 @@ cpu_initialize_context(unsigned int cpu, struct task_struct *idle) struct vcpu_guest_context *ctxt; struct desc_struct *gdt; unsigned long gdt_mfn; + void *ctxt_arg; /* used to tell cpu_init() that it can proceed with initialization */ cpumask_set_cpu(cpu, cpu_callout_mask); @@ -392,7 +394,7 @@ cpu_initialize_context(unsigned int cpu, struct task_struct *idle) if (!xen_hvmlite) { - ctxt = kzalloc(sizeof(*ctxt), GFP_KERNEL); + ctxt_arg = ctxt = kzalloc(sizeof(*ctxt), GFP_KERNEL); if (ctxt == NULL) return -ENOMEM; @@ -460,14 +462,59 @@ cpu_initialize_context(unsigned int cpu, struct task_struct *idle) ctxt->user_regs.esp = idle->thread.sp0 - sizeof(struct pt_regs); ctxt->ctrlreg[3] = xen_pfn_to_cr3(virt_to_gfn(swapper_pg_dir)); } else { - ctxt = NULL; /* To quiet down compiler */ - BUG(); +#ifdef CONFIG_XEN_PVHVM + struct vcpu_hvm_context *hctxt; + + ctxt_arg = hctxt = kzalloc(sizeof(*hctxt), GFP_KERNEL); + if (hctxt == NULL) + return -ENOMEM; + +#ifdef CONFIG_X86_64 + hctxt->mode = VCPU_HVM_MODE_64B; + hctxt->cpu_regs.x86_64.rip = + (unsigned long)secondary_startup_64; + hctxt->cpu_regs.x86_64.rsp = stack_start; + + hctxt->cpu_regs.x86_64.cr0 = + X86_CR0_PG | X86_CR0_WP | X86_CR0_PE; + hctxt->cpu_regs.x86_64.cr4 = X86_CR4_PAE; + hctxt->cpu_regs.x86_64.cr3 = + xen_pfn_to_cr3(virt_to_mfn(init_level4_pgt)); + hctxt->cpu_regs.x86_64.efer = EFER_LME | EFER_NX; +#else + hctxt->mode = VCPU_HVM_MODE_32B; + /* +* startup_32_smp expects GDT loaded so we can't jump +* there directly. +*/ + hctxt->cpu_regs.x86_32.eip = + (unsigned long)hvmlite_smp_32 - __START_KERNEL_map; + + hctxt->cpu_regs.x86_32.cr0 = X86_CR0_PE; + + hctxt->cpu_regs.x86_32.cs_base = 0; + hctxt->cpu_regs.x86_32.cs_limit = ~0u; + hctxt->cpu_regs.x86_32.cs_ar = 0xc9b; + hctxt->cpu_regs.x86_32.ds_base = 0; + hctxt->cpu_regs.x86_32.ds_limit = ~0u; + hctxt->cpu_regs.x86_32.ds_ar = 0xc93; + hctxt->cpu_regs.x86_32.es_base = 0; + hctxt->cpu_regs.x86_32.es_limit = ~0u; + hctxt->cpu_regs.x86_32.es_ar = 0xc93; + hctxt->cpu_regs.x86_32.ss_base = 0; + hctxt->cpu_regs.x86_32.ss_limit = ~0u; + hctxt->cpu_regs.x86_32.ss_ar = 0xc93; + hctxt->cpu_regs.x86_32.tr_base = 0; + hctxt->cpu_regs.x86_32.tr_limit = 0xff; + hctxt->cpu_regs.x86_32.tr_ar = 0x8b; +#endif +#endif } - if (HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt)) + if (HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt_arg)) BUG(); - kfree(ctxt); + kfree(ctxt_arg); return 0; } diff --git a/arch/x86/xen/smp.h b/arch/x86/xen/smp.h index 963d62a..b4a833c 100644 --- a/arch/x86/xen/smp.h +++ b/arch/x86/xen/smp.h @@ -8,6 +8,10 @@ extern void xen_send_IPI_allbutself(int vector); extern void xen_send_IPI_all(int vector); extern void xen_send_IPI_self(int vector); +#ifdef CONFIG_X86_32 +extern void hvmlite_smp_32(void); +#endif + #ifdef CONFIG_XEN_PVH extern void xen_pvh_early_cpu_init(int cpu, bool entry); #else diff --git a/arch/x86/xen/xen-hvmlite.S b/arch/x86/xen/xen-hvmlite.S index 90f03d0..8d6a642 100644 --- a/arch/x86/xen/xen-hvmlite.S +++ b/arch/x86/xen/xen-hvmlite.S @@ -134,6 +134,13 @@ ENTRY(hvmlite_start_xen) ljmp$0x10, $_pa(startup_32) #endif +#ifdef CONFIG_X86_32 +ENTRY(hvmlite_smp_32) +mov $_pa(boot_gdt_descr), %eax +lgdt (%eax) +jmp startup_32_smp +#endif + .data gdt: .word gdt_end - gdt -- 1.7.1
[PATCH v1 01/12] x86/smp: Make start_secondary() and initial_pg_pmd visible globally
Xen's HVMlite guests will want to use them. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/include/asm/smp.h |1 + arch/x86/kernel/head_32.S |2 +- arch/x86/kernel/smpboot.c |2 +- 3 files changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index dfcf072..739abb0 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -147,6 +147,7 @@ void x86_idle_thread_init(unsigned int cpu, struct task_struct *idle); void smp_store_boot_cpu_info(void); void smp_store_cpu_info(int id); +void start_secondary(void *unused); #define cpu_physical_id(cpu) per_cpu(x86_cpu_to_apicid, cpu) #else /* !CONFIG_SMP */ diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S index 6bc9ae2..eeb5c42 100644 --- a/arch/x86/kernel/head_32.S +++ b/arch/x86/kernel/head_32.S @@ -666,7 +666,7 @@ ENTRY(setup_once_ref) __PAGE_ALIGNED_BSS .align PAGE_SIZE #ifdef CONFIG_X86_PAE -initial_pg_pmd: +ENTRY(initial_pg_pmd) .fill 1024*KPMDS,4,0 #else ENTRY(initial_page_table) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 24d57f7..cba00bc 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -199,7 +199,7 @@ static int enable_start_cpu0; /* * Activate a secondary processor. */ -static void notrace start_secondary(void *unused) +void notrace start_secondary(void *unused) { /* * Don't put *anything* before cpu_init(), SMP booting is too -- 1.7.1
[PATCH v1 05/12] xen/hvmlite: Allow HVMlite guests delay initializing grant table
.. just like we currently do for PVH guests Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/grant-table.c |4 ++-- drivers/xen/grant-table.c |8 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/xen/grant-table.c b/arch/x86/xen/grant-table.c index e079500..40ad9c2 100644 --- a/arch/x86/xen/grant-table.c +++ b/arch/x86/xen/grant-table.c @@ -110,7 +110,7 @@ int arch_gnttab_init(unsigned long nr_shared) return arch_gnttab_valloc(_shared_vm_area, nr_shared); } -#ifdef CONFIG_XEN_PVH +#ifdef CONFIG_XEN_PVHVM #include #include #include @@ -164,7 +164,7 @@ static int __init xlated_setup_gnttab_pages(void) static int __init xen_pvh_gnttab_setup(void) { - if (!xen_pvh_domain()) + if (!xen_pvh_domain() && !xen_hvmlite) return -ENODEV; return xlated_setup_gnttab_pages(); diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c index effbaf9..8c6c0cf 100644 --- a/drivers/xen/grant-table.c +++ b/drivers/xen/grant-table.c @@ -1147,13 +1147,13 @@ EXPORT_SYMBOL_GPL(gnttab_init); static int __gnttab_init(void) { + if (!xen_domain()) + return -ENODEV; + /* Delay grant-table initialization in the PV on HVM case */ - if (xen_hvm_domain()) + if (xen_hvm_domain() && !xen_hvmlite) return 0; - if (!xen_pv_domain()) - return -ENODEV; - return gnttab_init(); } /* Starts after core_initcall so that xen_pvh_gnttab_setup can be called -- 1.7.1
[PATCH v1 09/12] xen/hvmlite: Extend APIC operations for HVMlite guests
HVMlite guests need to be viewed as having APIC, otherwise smpboot code, for example, will complain. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- Not sure about xen_cpu_present_to_apicid() being an identity function, given xen_x86_32_early_logical_apicid(). arch/x86/xen/apic.c | 39 +-- arch/x86/xen/smp.c |2 +- 2 files changed, 38 insertions(+), 3 deletions(-) diff --git a/arch/x86/xen/apic.c b/arch/x86/xen/apic.c index abf4901..2a59c39 100644 --- a/arch/x86/xen/apic.c +++ b/arch/x86/xen/apic.c @@ -6,6 +6,7 @@ #include #include +#include #include "xen-ops.h" #include "pmu.h" #include "smp.h" @@ -78,6 +79,21 @@ static void xen_apic_write(u32 reg, u32 val) return; } + if (xen_hvmlite) { + switch (reg) { + case APIC_TASKPRI: + case APIC_SPIV: + case APIC_ESR: + case APIC_LVTT: + case APIC_LVT0: + case APIC_LVT1: + case APIC_LVTERR: + pr_debug("Unimplemented APIC register %x," +" value: %x\n", reg, val); + return; + } + } + /* Warn to see if there's any stray references */ WARN(1,"register: %x, value: %x\n", reg, val); } @@ -100,7 +116,7 @@ static u32 xen_safe_apic_wait_icr_idle(void) static int xen_apic_probe_pv(void) { - if (xen_pv_domain()) + if (xen_pv_domain() || xen_hvmlite) return 1; return 0; @@ -142,6 +158,19 @@ static void xen_silent_inquire(int apicid) { } +static int xen_cpu_present_to_apicid(int cpu) +{ + return cpu; +} + +static int xen_wakeup_secondary_cpu(int cpu, unsigned long start_eip) +{ + if (!xen_hvmlite) + return -EINVAL; + + return HYPERVISOR_vcpu_op(VCPUOP_up, cpu, NULL); +} + static struct apic xen_pv_apic = { .name = "Xen PV", .probe = xen_apic_probe_pv, @@ -162,7 +191,7 @@ static struct apic xen_pv_apic = { .ioapic_phys_id_map = default_ioapic_phys_id_map, /* Used on 32-bit */ .setup_apic_routing = NULL, - .cpu_present_to_apicid = default_cpu_present_to_apicid, + .cpu_present_to_apicid = xen_cpu_present_to_apicid, .apicid_to_cpu_present = physid_set_mask_of_physid, /* Used on 32-bit */ .check_phys_apicid_present = default_check_phys_apicid_present, /* smp_sanity_check needs it */ .phys_pkg_id= xen_phys_pkg_id, /* detect_ht */ @@ -180,6 +209,9 @@ static struct apic xen_pv_apic = { .send_IPI_all = xen_send_IPI_all, .send_IPI_self = xen_send_IPI_self, #endif + + .wakeup_secondary_cpu = xen_wakeup_secondary_cpu, + /* .wait_for_init_deassert- used by AP bootup - smp_callin which we don't use */ .inquire_remote_apic= xen_silent_inquire, @@ -216,5 +248,8 @@ void __init xen_init_apic(void) apic = _pv_apic; x86_platform.apic_post_init = xen_apic_check; + + if (xen_hvmlite) + setup_force_cpu_cap(X86_FEATURE_APIC); } apic_driver(xen_pv_apic); diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index b265c4f..fb085ef 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -214,7 +214,7 @@ static int xen_smp_intr_init(unsigned int cpu) * The IRQ worker on PVHVM goes through the native path and uses the * IPI mechanism. */ - if (xen_hvm_domain()) + if (xen_hvm_domain() && !xen_hvmlite) return 0; callfunc_name = kasprintf(GFP_KERNEL, "irqwork%d", cpu); -- 1.7.1
[PATCH v1 07/12] xen/hvmlite: Prepare cpu_initialize_context() routine for HVMlite SMP
Subsequent patch will add support for starting secondary VCPUs in HVMlite guest. This patch exists to make review easier. No functional changes (except for introduction of 'if (!xen_hvmlite)'). Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/smp.c | 104 --- 1 files changed, 57 insertions(+), 47 deletions(-) diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index 3f4ebf0..5fc4afb 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -390,70 +390,80 @@ cpu_initialize_context(unsigned int cpu, struct task_struct *idle) if (cpumask_test_and_set_cpu(cpu, xen_cpu_initialized_map)) return 0; - ctxt = kzalloc(sizeof(*ctxt), GFP_KERNEL); - if (ctxt == NULL) - return -ENOMEM; + if (!xen_hvmlite) { - gdt = get_cpu_gdt_table(cpu); + ctxt = kzalloc(sizeof(*ctxt), GFP_KERNEL); + if (ctxt == NULL) + return -ENOMEM; + + gdt = get_cpu_gdt_table(cpu); #ifdef CONFIG_X86_32 - /* Note: PVH is not yet supported on x86_32. */ - ctxt->user_regs.fs = __KERNEL_PERCPU; - ctxt->user_regs.gs = __KERNEL_STACK_CANARY; + /* Note: PVH is not yet supported on x86_32. */ + ctxt->user_regs.fs = __KERNEL_PERCPU; + ctxt->user_regs.gs = __KERNEL_STACK_CANARY; #endif - memset(>fpu_ctxt, 0, sizeof(ctxt->fpu_ctxt)); + memset(>fpu_ctxt, 0, sizeof(ctxt->fpu_ctxt)); - if (!xen_feature(XENFEAT_auto_translated_physmap)) { - ctxt->user_regs.eip = (unsigned long)cpu_bringup_and_idle; - ctxt->flags = VGCF_IN_KERNEL; - ctxt->user_regs.eflags = 0x1000; /* IOPL_RING1 */ - ctxt->user_regs.ds = __USER_DS; - ctxt->user_regs.es = __USER_DS; - ctxt->user_regs.ss = __KERNEL_DS; + if (!xen_feature(XENFEAT_auto_translated_physmap)) { + ctxt->user_regs.eip = + (unsigned long)cpu_bringup_and_idle; + ctxt->flags = VGCF_IN_KERNEL; + ctxt->user_regs.eflags = 0x1000; /* IOPL_RING1 */ + ctxt->user_regs.ds = __USER_DS; + ctxt->user_regs.es = __USER_DS; + ctxt->user_regs.ss = __KERNEL_DS; - xen_copy_trap_info(ctxt->trap_ctxt); + xen_copy_trap_info(ctxt->trap_ctxt); - ctxt->ldt_ents = 0; + ctxt->ldt_ents = 0; - BUG_ON((unsigned long)gdt & ~PAGE_MASK); + BUG_ON((unsigned long)gdt & ~PAGE_MASK); - gdt_mfn = arbitrary_virt_to_mfn(gdt); - make_lowmem_page_readonly(gdt); - make_lowmem_page_readonly(mfn_to_virt(gdt_mfn)); + gdt_mfn = arbitrary_virt_to_mfn(gdt); + make_lowmem_page_readonly(gdt); + make_lowmem_page_readonly(mfn_to_virt(gdt_mfn)); - ctxt->gdt_frames[0] = gdt_mfn; - ctxt->gdt_ents = GDT_ENTRIES; + ctxt->gdt_frames[0] = gdt_mfn; + ctxt->gdt_ents = GDT_ENTRIES; - ctxt->kernel_ss = __KERNEL_DS; - ctxt->kernel_sp = idle->thread.sp0; + ctxt->kernel_ss = __KERNEL_DS; + ctxt->kernel_sp = idle->thread.sp0; #ifdef CONFIG_X86_32 - ctxt->event_callback_cs = __KERNEL_CS; - ctxt->failsafe_callback_cs = __KERNEL_CS; + ctxt->event_callback_cs = __KERNEL_CS; + ctxt->failsafe_callback_cs = __KERNEL_CS; #else - ctxt->gs_base_kernel = per_cpu_offset(cpu); + ctxt->gs_base_kernel = per_cpu_offset(cpu); #endif - ctxt->event_callback_eip= - (unsigned long)xen_hypervisor_callback; - ctxt->failsafe_callback_eip = - (unsigned long)xen_failsafe_callback; - ctxt->user_regs.cs = __KERNEL_CS; - per_cpu(xen_cr3, cpu) = __pa(swapper_pg_dir); - } + ctxt->event_callback_eip= + (unsigned long)xen_hypervisor_callback; + ctxt->failsafe_callback_eip = + (unsigned long)xen_failsafe_callback; + ctxt->user_regs.cs = __KERNEL_CS; + per_cpu(xen_cr3, cpu) = __pa(swapper_pg_dir); + } #ifdef CONFIG_XEN_PVH - else { - /* -* The vcpu comes on kernel page tables which
[PATCH v1 12/12] xen/hvmlite: Enable CPU on-/offlining
When offlining, we should properly clean up interrupts and wait until hypervisor declares VCPU as down before cleaning up. After VCPU that was previously offlined is brought back to life we want to jump back to bare-metal entry points. It's a simple jump on 64-bit but requires minor tweaking for 32-bit case. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/smp.c | 35 +-- arch/x86/xen/xen-hvmlite.S |8 2 files changed, 33 insertions(+), 10 deletions(-) diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index fbad829..7e96a23 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -143,7 +143,7 @@ static void xen_smp_intr_free(unsigned int cpu) kfree(per_cpu(xen_callfuncsingle_irq, cpu).name); per_cpu(xen_callfuncsingle_irq, cpu).name = NULL; } - if (xen_hvm_domain()) + if (xen_hvm_domain() && !xen_hvmlite) return; if (per_cpu(xen_irq_work, cpu).irq >= 0) { @@ -585,7 +585,8 @@ static int xen_cpu_disable(void) static void xen_cpu_die(unsigned int cpu) { - while (xen_pv_domain() && HYPERVISOR_vcpu_op(VCPUOP_is_up, cpu, NULL)) { + while ((xen_pv_domain() || xen_hvmlite) && + HYPERVISOR_vcpu_op(VCPUOP_is_up, cpu, NULL)) { __set_current_state(TASK_UNINTERRUPTIBLE); schedule_timeout(HZ/10); } @@ -602,14 +603,28 @@ static void xen_play_dead(void) /* used only with HOTPLUG_CPU */ { play_dead_common(); HYPERVISOR_vcpu_op(VCPUOP_down, smp_processor_id(), NULL); - cpu_bringup(); - /* -* commit 4b0c0f294 (tick: Cleanup NOHZ per cpu data on cpu down) -* clears certain data that the cpu_idle loop (which called us -* and that we return from) expects. The only way to get that -* data back is to call: -*/ - tick_nohz_idle_enter(); + + if (!xen_hvm_domain()) { + cpu_bringup(); + /* +* commit 4b0c0f294 (tick: Cleanup NOHZ per cpu data on cpu +* down) clears certain data that the cpu_idle loop (which +* called us and that we return from) expects. The only way to +* get that data back is to call: +*/ + tick_nohz_idle_enter(); + } else { + /* +* For 64-bit we can jump directly to SMP entry point but for +* 32-bit we need to disable paging and load boot GDT (just +* like in cpu_initialize_context()). +*/ +#ifdef CONFIG_X86_64 + asm("jmp secondary_startup_64"); +#else + asm("jmp hvmlite_smp_32_hp"); +#endif + } } #else /* !CONFIG_HOTPLUG_CPU */ diff --git a/arch/x86/xen/xen-hvmlite.S b/arch/x86/xen/xen-hvmlite.S index 8d6a642..4edd6ef 100644 --- a/arch/x86/xen/xen-hvmlite.S +++ b/arch/x86/xen/xen-hvmlite.S @@ -135,6 +135,14 @@ ENTRY(hvmlite_start_xen) #endif #ifdef CONFIG_X86_32 +ENTRY(hvmlite_smp_32_hp) + movl $_pa(initial_page_table), %eax + movl %eax, %cr3 + ljmp $__KERNEL_CS,$_pa(5f) +5: + movl $X86_CR0_PE, %eax + movl %eax, %cr0 + ENTRY(hvmlite_smp_32) mov $_pa(boot_gdt_descr), %eax lgdt (%eax) -- 1.7.1
[PATCH v1 03/12] xen/hvmlite: Import hvmlite-related Xen public interfaces
Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- include/xen/interface/elfnote.h | 12 +++- include/xen/interface/hvm/hvm_vcpu.h | 143 ++ include/xen/interface/xen.h | 24 ++ 3 files changed, 178 insertions(+), 1 deletions(-) create mode 100644 include/xen/interface/hvm/hvm_vcpu.h diff --git a/include/xen/interface/elfnote.h b/include/xen/interface/elfnote.h index f90b034..9e9f9bf 100644 --- a/include/xen/interface/elfnote.h +++ b/include/xen/interface/elfnote.h @@ -193,9 +193,19 @@ #define XEN_ELFNOTE_SUPPORTED_FEATURES 17 /* + * Physical entry point into the kernel. + * + * 32bit entry point into the kernel. When requested to launch the + * guest kernel in a HVM container, Xen will use this entry point to + * launch the guest in 32bit protected mode with paging disabled. + * Ignored otherwise. + */ +#define XEN_ELFNOTE_PHYS32_ENTRY 18 + +/* * The number of the highest elfnote defined. */ -#define XEN_ELFNOTE_MAX XEN_ELFNOTE_SUPPORTED_FEATURES +#define XEN_ELFNOTE_MAX XEN_ELFNOTE_PHYS32_ENTRY #endif /* __XEN_PUBLIC_ELFNOTE_H__ */ diff --git a/include/xen/interface/hvm/hvm_vcpu.h b/include/xen/interface/hvm/hvm_vcpu.h new file mode 100644 index 000..32ca83e --- /dev/null +++ b/include/xen/interface/hvm/hvm_vcpu.h @@ -0,0 +1,143 @@ +/* + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to + * deal in the Software without restriction, including without limitation the + * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or + * sell copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER + * DEALINGS IN THE SOFTWARE. + * + * Copyright (c) 2015, Roger Pau Monne <roger@citrix.com> + */ + +#ifndef __XEN_PUBLIC_HVM_HVM_VCPU_H__ +#define __XEN_PUBLIC_HVM_HVM_VCPU_H__ + +#include "../xen.h" + +struct vcpu_hvm_x86_32 { +uint32_t eax; +uint32_t ecx; +uint32_t edx; +uint32_t ebx; +uint32_t esp; +uint32_t ebp; +uint32_t esi; +uint32_t edi; +uint32_t eip; +uint32_t eflags; + +uint32_t cr0; +uint32_t cr3; +uint32_t cr4; + +uint32_t pad1; + +/* + * EFER should only be used to set the NXE bit (if required) + * when starting a vCPU in 32bit mode with paging enabled or + * to set the LME/LMA bits in order to start the vCPU in + * compatibility mode. + */ +uint64_t efer; + +uint32_t cs_base; +uint32_t ds_base; +uint32_t ss_base; +uint32_t es_base; +uint32_t tr_base; +uint32_t cs_limit; +uint32_t ds_limit; +uint32_t ss_limit; +uint32_t es_limit; +uint32_t tr_limit; +uint16_t cs_ar; +uint16_t ds_ar; +uint16_t ss_ar; +uint16_t es_ar; +uint16_t tr_ar; + +uint16_t pad2[3]; +}; + +/* + * The layout of the _ar fields of the segment registers is the + * following: + * + * Bits [0,3]: type (bits 40-43). + * Bit4: s(descriptor type, bit 44). + * Bit[5,6]: dpl (descriptor privilege level, bits 45-46). + * Bit7: p(segment-present, bit 47). + * Bit8: avl (available for system software, bit 52). + * Bit9: l(64-bit code segment, bit 53). + * Bit 10: db (meaning depends on the segment, bit 54). + * Bit 11: g(granularity, bit 55) + * Bits [12,15]: unused, must be blank. + * + * A more complete description of the meaning of this fields can be + * obtained from the Intel SDM, Volume 3, section 3.4.5. + */ + +struct vcpu_hvm_x86_64 { +uint64_t rax; +uint64_t rcx; +uint64_t rdx; +uint64_t rbx; +uint64_t rsp; +uint64_t rbp; +uint64_t rsi; +uint64_t rdi; +uint64_t rip; +uint64_t rflags; + +uint64_t cr0; +uint64_t cr3; +uint64_t cr4; +uint64_t efer; + +/* + * Using VCPU_HVM_MODE_64B implies that the vCPU is launched + * directly in long mode, so the cached parts of the segment + * registers get set to match that environment. + * + * If the user wants to launch the vCPU in compatibility mode + * the 32-bit structure should be used instead. + */ +}; + +struct vcpu_hvm_context { +#define VCPU_HVM_MODE_32B 0 /* 32bit fields
[PATCH v1 04/12] xen/hvmlite: Bootstrap HVMlite guest
Start HVMlite guest XEN_ELFNOTE_PHYS32_ENTRY address. Setup hypercall page, initialize boot_params, enable early page tables. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/Makefile |1 + arch/x86/xen/enlighten.c | 91 +- arch/x86/xen/xen-hvmlite.S | 158 include/xen/xen.h |6 ++ 4 files changed, 255 insertions(+), 1 deletions(-) create mode 100644 arch/x86/xen/xen-hvmlite.S diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile index e47e527..1d913d7 100644 --- a/arch/x86/xen/Makefile +++ b/arch/x86/xen/Makefile @@ -23,3 +23,4 @@ obj-$(CONFIG_XEN_DEBUG_FS)+= debugfs.o obj-$(CONFIG_XEN_DOM0) += vga.o obj-$(CONFIG_SWIOTLB_XEN) += pci-swiotlb-xen.o obj-$(CONFIG_XEN_EFI) += efi.o +obj-$(CONFIG_XEN_PVHVM)+= xen-hvmlite.o diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index 2cf446a..2ed8b2b 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -118,7 +118,8 @@ DEFINE_PER_CPU(struct vcpu_info *, xen_vcpu); */ DEFINE_PER_CPU(struct vcpu_info, xen_vcpu_info); -enum xen_domain_type xen_domain_type = XEN_NATIVE; +enum xen_domain_type xen_domain_type + __attribute__((section(".data"))) = XEN_NATIVE; EXPORT_SYMBOL_GPL(xen_domain_type); unsigned long *machine_to_phys_mapping = (void *)MACH2PHYS_VIRT_START; @@ -171,6 +172,17 @@ struct tls_descs { */ static DEFINE_PER_CPU(struct tls_descs, shadow_tls_desc); +#ifdef CONFIG_XEN_PVHVM +/* + * HVMlite variables. These need to live in data segment since they are + * initialized before startup_{32|64}, which clear .bss, are invoked. + */ +int xen_hvmlite __attribute__((section(".data"))) = 0; +struct hvm_start_info hvmlite_start_info __attribute__((section(".data"))); +uint hvmlite_start_info_sz = sizeof(hvmlite_start_info); +struct boot_params xen_hvmlite_boot_params __attribute__((section(".data"))); +#endif + static void clamp_max_cpus(void) { #ifdef CONFIG_SMP @@ -1736,6 +1748,83 @@ asmlinkage __visible void __init xen_start_kernel(void) #endif } +#ifdef CONFIG_XEN_PVHVM +static void __init hvmlite_bootparams(void) +{ + struct xen_memory_map memmap; + int i; + + memset(_hvmlite_boot_params, 0, sizeof(xen_hvmlite_boot_params)); + + memmap.nr_entries = ARRAY_SIZE(xen_hvmlite_boot_params.e820_map); + set_xen_guest_handle(memmap.buffer, xen_hvmlite_boot_params.e820_map); + if (HYPERVISOR_memory_op(XENMEM_memory_map, )) { + xen_raw_console_write("XENMEM_memory_map failed\n"); + BUG(); + } + + xen_hvmlite_boot_params.e820_map[memmap.nr_entries].addr = + ISA_START_ADDRESS; + xen_hvmlite_boot_params.e820_map[memmap.nr_entries].size = + ISA_END_ADDRESS - ISA_START_ADDRESS; + xen_hvmlite_boot_params.e820_map[memmap.nr_entries++].type = + E820_RESERVED; + + sanitize_e820_map(xen_hvmlite_boot_params.e820_map, + ARRAY_SIZE(xen_hvmlite_boot_params.e820_map), + _entries); + + xen_hvmlite_boot_params.e820_entries = memmap.nr_entries; + for (i = 0; i < xen_hvmlite_boot_params.e820_entries; i++) + e820_add_region(xen_hvmlite_boot_params.e820_map[i].addr, + xen_hvmlite_boot_params.e820_map[i].size, + xen_hvmlite_boot_params.e820_map[i].type); + + xen_hvmlite_boot_params.hdr.cmd_line_ptr = + hvmlite_start_info.cmdline_paddr; + + /* The first module is always ramdisk */ + if (hvmlite_start_info.nr_modules) { + struct hvm_modlist_entry *modaddr = + __va(hvmlite_start_info.modlist_paddr); + xen_hvmlite_boot_params.hdr.ramdisk_image = modaddr->paddr; + xen_hvmlite_boot_params.hdr.ramdisk_size = modaddr->size; + } + + /* +* See Documentation/x86/boot.txt. +* +* Version 2.12 supports Xen entry point but we will use default x86/PC +* environment (i.e. hardware_subarch 0). +*/ + xen_hvmlite_boot_params.hdr.version = 0x212; + xen_hvmlite_boot_params.hdr.type_of_loader = 9; /* Xen loader */ +} + +/* + * This routine (and those that it might call) should not use + * anything that lives in .bss since that segment will be cleared later + */ +void __init xen_prepare_hvmlite(void) +{ + u32 eax, ecx, edx, msr; + u64 pfn; + + cpuid(xen_cpuid_base() + 2, , , , ); + pfn = __pa(hypercall_page); + wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32)); + + pv_info.name = "Xen HVMlite"; + xen_domain_type = XEN_HVM_DOMAIN; + xen_hvmlite = 1; + + x86_init.oem.arch_setup = xen_init_kernel; + x86_init.oem.banner = xen_banner; + +
[PATCH v1 00/12] HVMlite domU support
This series introduces HVMlite support for unprivileged guests. It has been tested on Intel/AMD, both 32- and 64-bit, including CPU on- and offlining and save/restore. (Restore will result in APIC write warnings which exist now for 32-bit PV guests as well so I didn't address this in this series) Compile-tested on ARM Boris Ostrovsky (12): x86/smp: Make start_secondary() and initial_pg_pmd visible globally xen/hvmlite: Factor out common kernel init code xen/hvmlite: Import hvmlite-related Xen public interfaces xen/hvmlite: Bootstrap HVMlite guest xen/hvmlite: Allow HVMlite guests delay initializing grant table xen/hvmlite: Initialize PCI xen/hvmlite: Prepare cpu_initialize_context() routine for HVMlite SMP xen/hvmlite: Initialize context for secondary VCPUs xen/hvmlite: Extend APIC operations for HVMlite guests xen/hvmlite: Use x86's default timer init for HVMlite guests xen/hvmlite: Boot secondary CPUs xen/hvmlite: Enable CPU on-/offlining arch/x86/include/asm/smp.h |1 + arch/x86/kernel/head_32.S|2 +- arch/x86/kernel/smpboot.c|2 +- arch/x86/pci/xen.c |2 +- arch/x86/xen/Makefile|1 + arch/x86/xen/apic.c | 39 - arch/x86/xen/enlighten.c | 318 ++ arch/x86/xen/grant-table.c |4 +- arch/x86/xen/platform-pci-unplug.c |4 +- arch/x86/xen/pmu.c |4 +- arch/x86/xen/smp.c | 248 ++ arch/x86/xen/smp.h |4 + arch/x86/xen/time.c |5 +- arch/x86/xen/xen-hvmlite.S | 173 ++ drivers/xen/grant-table.c|8 +- include/xen/interface/elfnote.h | 12 ++- include/xen/interface/hvm/hvm_vcpu.h | 143 +++ include/xen/interface/xen.h | 24 +++ include/xen/xen.h|6 + 19 files changed, 801 insertions(+), 199 deletions(-) create mode 100644 arch/x86/xen/xen-hvmlite.S create mode 100644 include/xen/interface/hvm/hvm_vcpu.h
[PATCH v1 02/12] xen/hvmlite: Factor out common kernel init code
HVMlite guests (to be introduced in subsequent patches) share most of the kernel initialization code with PV(H). Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/enlighten.c | 225 -- 1 files changed, 119 insertions(+), 106 deletions(-) diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index d09e4c9..2cf446a 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -1505,19 +1505,14 @@ static void __init xen_pvh_early_guest_init(void) } #endif/* CONFIG_XEN_PVH */ -/* First C function to be called on Xen boot */ -asmlinkage __visible void __init xen_start_kernel(void) + +static void __init xen_init_kernel(void) { struct physdev_set_iopl set_iopl; unsigned long initrd_start = 0; u64 pat; int rc; - if (!xen_start_info) - return; - - xen_domain_type = XEN_PV_DOMAIN; - xen_setup_features(); #ifdef CONFIG_XEN_PVH xen_pvh_early_guest_init(); @@ -1529,48 +1524,9 @@ asmlinkage __visible void __init xen_start_kernel(void) if (xen_initial_domain()) pv_info.features |= PV_SUPPORTED_RTC; pv_init_ops = xen_init_ops; - if (!xen_pvh_domain()) { - pv_cpu_ops = xen_cpu_ops; - - x86_platform.get_nmi_reason = xen_get_nmi_reason; - } - - if (xen_feature(XENFEAT_auto_translated_physmap)) - x86_init.resources.memory_setup = xen_auto_xlated_memory_setup; - else - x86_init.resources.memory_setup = xen_memory_setup; - x86_init.oem.arch_setup = xen_arch_setup; - x86_init.oem.banner = xen_banner; xen_init_time_ops(); - /* -* Set up some pagetable state before starting to set any ptes. -*/ - - xen_init_mmu_ops(); - - /* Prevent unwanted bits from being set in PTEs. */ - __supported_pte_mask &= ~_PAGE_GLOBAL; - - /* -* Prevent page tables from being allocated in highmem, even -* if CONFIG_HIGHPTE is enabled. -*/ - __userpte_alloc_gfp &= ~__GFP_HIGHMEM; - - /* Work out if we support NX */ - x86_configure_nx(); - - /* Get mfn list */ - xen_build_dynamic_phys_to_machine(); - - /* -* Set up kernel GDT and segment registers, mainly so that -* -fstack-protector code can be executed. -*/ - xen_setup_gdt(0); - xen_init_irq_ops(); xen_init_cpuid_mask(); @@ -1580,21 +1536,8 @@ asmlinkage __visible void __init xen_start_kernel(void) */ xen_init_apic(); #endif - - if (xen_feature(XENFEAT_mmu_pt_update_preserve_ad)) { - pv_mmu_ops.ptep_modify_prot_start = xen_ptep_modify_prot_start; - pv_mmu_ops.ptep_modify_prot_commit = xen_ptep_modify_prot_commit; - } - machine_ops = xen_machine_ops; - /* -* The only reliable way to retain the initial address of the -* percpu gdt_page is to remember it here, so we can go and -* mark it RW later, when the initial percpu area is freed. -*/ - xen_initial_gdt = _cpu(gdt_page, 0); - xen_smp_init(); #ifdef CONFIG_ACPI_NUMA @@ -1609,66 +1552,126 @@ asmlinkage __visible void __init xen_start_kernel(void) possible map and a non-dummy shared_info. */ per_cpu(xen_vcpu, 0) = _shared_info->vcpu_info[0]; - local_irq_disable(); - early_boot_irqs_disabled = true; + if (!xen_hvm_domain()) { + if (!xen_pvh_domain()) { + pv_cpu_ops = xen_cpu_ops; - xen_raw_console_write("mapping kernel into physical memory\n"); - xen_setup_kernel_pagetable((pgd_t *)xen_start_info->pt_base, - xen_start_info->nr_pages); - xen_reserve_special_pages(); + x86_platform.get_nmi_reason = xen_get_nmi_reason; + } - /* -* Modify the cache mode translation tables to match Xen's PAT -* configuration. -*/ - rdmsrl(MSR_IA32_CR_PAT, pat); - pat_init_cache_modes(pat); + if (xen_feature(XENFEAT_auto_translated_physmap)) + x86_init.resources.memory_setup = + xen_auto_xlated_memory_setup; + else + x86_init.resources.memory_setup = xen_memory_setup; + x86_init.oem.arch_setup = xen_arch_setup; + x86_init.oem.banner = xen_banner; - /* keep using Xen gdt for now; no urgent need to change it */ + /* +* Set up some pagetable state before starting to set any ptes. +*/ -#ifdef CONFIG_X86_32 - pv_info.kernel_rpl = 1; - if (xen_feature(XENFEAT_supervisor_mode_kernel)) - pv_info.kernel_rpl = 0; -#else - pv_info.kernel_rpl = 0; -#endi
[PATCH v1 06/12] xen/hvmlite: Initialize PCI
HVMlite guests need PCI frontend and always have PV devices Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/pci/xen.c |2 +- arch/x86/xen/platform-pci-unplug.c |4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c index ff31ab4..d847f7d 100644 --- a/arch/x86/pci/xen.c +++ b/arch/x86/pci/xen.c @@ -405,7 +405,7 @@ static void xen_teardown_msi_irq(unsigned int irq) int __init pci_xen_init(void) { - if (!xen_pv_domain() || xen_initial_domain()) + if ((!xen_pv_domain() && !xen_hvmlite) || xen_initial_domain()) return -ENODEV; printk(KERN_INFO "PCI: setting up Xen PCI frontend stub\n"); diff --git a/arch/x86/xen/platform-pci-unplug.c b/arch/x86/xen/platform-pci-unplug.c index 9586ff3..802ec90 100644 --- a/arch/x86/xen/platform-pci-unplug.c +++ b/arch/x86/xen/platform-pci-unplug.c @@ -73,8 +73,8 @@ bool xen_has_pv_devices(void) if (!xen_domain()) return false; - /* PV domains always have them. */ - if (xen_pv_domain()) + /* PV and HVMlite domains always have them. */ + if (xen_pv_domain() || xen_hvmlite) return true; /* And user has xen_platform_pci=0 set in guest config as -- 1.7.1
[PATCH v1 11/12] xen/hvmlite: Boot secondary CPUs
HVMlite secondary VCPUs use baremetal bringup path (i.e. native_* smp_ops) but need to do some preparation in PV code. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/enlighten.c |2 + arch/x86/xen/pmu.c |4 +- arch/x86/xen/smp.c | 60 +- 3 files changed, 47 insertions(+), 19 deletions(-) diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index 2ed8b2b..850ce66 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -1937,6 +1937,8 @@ static void __init xen_hvm_guest_init(void) xen_have_vector_callback = 1; xen_hvm_smp_init(); register_cpu_notifier(_hvm_cpu_notifier); + if (xen_hvmlite) + smp_found_config = 1; xen_unplug_emulated_devices(); x86_init.irqs.intr_init = xen_init_IRQ; xen_hvm_init_time_ops(); diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c index 724a087..7bc209b 100644 --- a/arch/x86/xen/pmu.c +++ b/arch/x86/xen/pmu.c @@ -518,7 +518,7 @@ void xen_pmu_init(int cpu) BUILD_BUG_ON(sizeof(struct xen_pmu_data) > PAGE_SIZE); - if (xen_hvm_domain()) + if (xen_hvm_domain() && !xen_hvmlite) return; xenpmu_data = (struct xen_pmu_data *)get_zeroed_page(GFP_KERNEL); @@ -556,7 +556,7 @@ void xen_pmu_finish(int cpu) { struct xen_pmu_params xp; - if (xen_hvm_domain()) + if (xen_hvm_domain() && !xen_hvmlite) return; xp.vcpu = cpu; diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index fb085ef..fbad829 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -348,26 +348,31 @@ static void __init xen_smp_prepare_cpus(unsigned int max_cpus) } xen_init_lock_cpu(0); - smp_store_boot_cpu_info(); - cpu_data(0).x86_max_cores = 1; + if (!xen_hvmlite) { + smp_store_boot_cpu_info(); + cpu_data(0).x86_max_cores = 1; + + for_each_possible_cpu(i) { + zalloc_cpumask_var(_cpu(cpu_sibling_map, i), + GFP_KERNEL); + zalloc_cpumask_var(_cpu(cpu_core_map, i), + GFP_KERNEL); + zalloc_cpumask_var(_cpu(cpu_llc_shared_map, i), + GFP_KERNEL); + } + set_cpu_sibling_map(0); - for_each_possible_cpu(i) { - zalloc_cpumask_var(_cpu(cpu_sibling_map, i), GFP_KERNEL); - zalloc_cpumask_var(_cpu(cpu_core_map, i), GFP_KERNEL); - zalloc_cpumask_var(_cpu(cpu_llc_shared_map, i), GFP_KERNEL); + if (!alloc_cpumask_var(_cpu_initialized_map, GFP_KERNEL)) + panic("could not allocate xen_cpu_initialized_map\n"); + + cpumask_copy(xen_cpu_initialized_map, cpumask_of(0)); } - set_cpu_sibling_map(0); xen_pmu_init(0); if (xen_smp_intr_init(0)) BUG(); - if (!alloc_cpumask_var(_cpu_initialized_map, GFP_KERNEL)) - panic("could not allocate xen_cpu_initialized_map\n"); - - cpumask_copy(xen_cpu_initialized_map, cpumask_of(0)); - /* Restrict the possible_map according to max_cpus. */ while ((num_possible_cpus() > 1) && (num_possible_cpus() > max_cpus)) { for (cpu = nr_cpu_ids - 1; !cpu_possible(cpu); cpu--) @@ -375,8 +380,11 @@ static void __init xen_smp_prepare_cpus(unsigned int max_cpus) set_cpu_possible(cpu, false); } - for_each_possible_cpu(cpu) + for_each_possible_cpu(cpu) { set_cpu_present(cpu, true); + if (xen_hvmlite) + physid_set(cpu, phys_cpu_present_map); + } } static int @@ -810,10 +818,15 @@ void __init xen_smp_init(void) static void __init xen_hvm_smp_prepare_cpus(unsigned int max_cpus) { + if (xen_hvmlite) + xen_smp_prepare_cpus(max_cpus); + native_smp_prepare_cpus(max_cpus); - WARN_ON(xen_smp_intr_init(0)); - xen_init_lock_cpu(0); + if (!xen_hvmlite) { + WARN_ON(xen_smp_intr_init(0)); + xen_init_lock_cpu(0); + } } static int xen_hvm_cpu_up(unsigned int cpu, struct task_struct *tidle) @@ -836,8 +849,21 @@ static int xen_hvm_cpu_up(unsigned int cpu, struct task_struct *tidle) */ rc = xen_smp_intr_init(cpu); WARN_ON(rc); - if (!rc) - rc = native_cpu_up(cpu, tidle); + + if (xen_hvmlite) { + rc = cpu_initialize_context(cpu, tidle); + if (rc) { + xen_smp_intr_free(cpu); + return rc; + } + xen_pmu_init(cpu); + } + + if (!rc) { +
Re: [PATCH v1 02/12] xen/hvmlite: Factor out common kernel init code
On 01/22/2016 06:01 PM, Luis R. Rodriguez wrote: On Fri, Jan 22, 2016 at 04:35:48PM -0500, Boris Ostrovsky wrote: HVMlite guests (to be introduced in subsequent patches) share most of the kernel initialization code with PV(H). Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/enlighten.c | 225 -- 1 files changed, 119 insertions(+), 106 deletions(-) diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index d09e4c9..2cf446a 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c Whoa, I'm lost, its hard for me to tell what exactly stayed and what got pulled into a helper, etc. Is there a possibility to split this patch in 2 somehow to make the actual functional changes easier to read? There are too many changes here and I just can't tell easily what's going on. The only real changes that this patch introduces is it reorders some of the operations that used to be in xen_start_kernel(). This is done so that in the next patch when we add hvmlite we can easily put those specific to PV(H) inside 'if (!xen_hvm_domain())'. I probably should have said so in the commit message. It is indeed difficult to review but I don't see how I can split this. Even if I just moved it (without reordering) it would still be hard to read. -boris
Re: [PATCH v1 02/12] xen/hvmlite: Factor out common kernel init code
On 01/22/2016 06:12 PM, Boris Ostrovsky wrote: On 01/22/2016 06:01 PM, Luis R. Rodriguez wrote: On Fri, Jan 22, 2016 at 04:35:48PM -0500, Boris Ostrovsky wrote: HVMlite guests (to be introduced in subsequent patches) share most of the kernel initialization code with PV(H). Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/enlighten.c | 225 -- 1 files changed, 119 insertions(+), 106 deletions(-) diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index d09e4c9..2cf446a 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c Whoa, I'm lost, its hard for me to tell what exactly stayed and what got pulled into a helper, etc. Is there a possibility to split this patch in 2 somehow to make the actual functional changes easier to read? There are too many changes here and I just can't tell easily what's going on. The only real changes that this patch introduces is it reorders some of the operations that used to be in xen_start_kernel(). This is done so that in the next patch when we add hvmlite we can easily put those specific to PV(H) inside 'if (!xen_hvm_domain())'. I probably should have said so in the commit message. Actually, I forgot that I merged the 'if' clause into this patch already so I'll see if I can separate them again to make it easier on the eye. -boris It is indeed difficult to review but I don't see how I can split this. Even if I just moved it (without reordering) it would still be hard to read. -boris
Re: [PATCH v1 00/12] HVMlite domU support
On 01/25/2016 05:51 AM, David Vrabel wrote: On 22/01/16 21:35, Boris Ostrovsky wrote: This series introduces HVMlite support for unprivileged guests. It has been tested on Intel/AMD, both 32- and 64-bit, including CPU on- and offlining and save/restore. (Restore will result in APIC write warnings which exist now for 32-bit PV guests as well so I didn't address this in this series) Can you remove PVH support in this series as well? We won't necessarily remove PVH support immediately but I'd like to see the ultimate end result. I'd rather wait until we have HVMlite dom0 before dropping PVH. If nothing else it may help debugging. BTW, I assume we are going to rename HVMlite to PVH once the original implementation is removed. -boris
Re: [Xen-devel] [PATCH v1 04/12] xen/hvmlite: Bootstrap HVMlite guest
On 01/22/2016 07:30 PM, Andrew Cooper wrote: On 22/01/2016 23:32, Luis R. Rodriguez wrote: On Fri, Jan 22, 2016 at 04:35:50PM -0500, Boris Ostrovsky wrote: + /* +* See Documentation/x86/boot.txt. +* +* Version 2.12 supports Xen entry point but we will use default x86/PC +* environment (i.e. hardware_subarch 0). +*/ + xen_hvmlite_boot_params.hdr.version = 0x212; + xen_hvmlite_boot_params.hdr.type_of_loader = 9; /* Xen loader */ +} I realize PV got away with setting up boot_params on C code but best ask now that this new code is being introduced: why can't we just have the Xen hypervisor fill this in? It'd save us all this code. I agree that this looks to be a mess. Having said that, the DMLite boot protocol is OS agnostic, and will be staying that way. It happens to look suspiciously like multiboot; a flat 32bit protected mode entry (at a location chosen in an ELF note), with %ebx pointing to an in-ram structure containing things like a command line and module list. I would have though the correct way to do direct Linux support would be to have a very small init stub which constructs an appropriate zero page, and lets the native entry point get on with things. Which is really what hvmlite_start_xen()->xen_prepare_hvmlite()->hvmlite_bootparams() is doing. Not much more than that (for 64-bit it also loads identity mapping because that's what startup_64 wants) -boris This covers the usecase where people wish to boot a specific Linux kernel straight out of the dom0 filesystem. For the alternative usecase of general OS support, dom0 would boot something such as grub2 as the DMLite "kernel", at which point all stooging around in the guests filesystem is done from guest context, rather than control context (mitigating a substantial attack surface). ~Andrew
Re: [PATCH v1 01/12] x86/smp: Make start_secondary() and initial_pg_pmd visible globally
On 01/25/2016 05:53 AM, David Vrabel wrote: On 22/01/16 21:35, Boris Ostrovsky wrote: Xen's HVMlite guests will want to use them. Please explain in detail why they are needed. Actually, start_secondary is not needed anymore, I removed its use at some point. initial_pg_pmd (together with initial_page_table) are not really required --- I just used them for temporary page tables in the hvmlite startup code instead of allocating dedicated pages for that. Perhaps there is other (Xen-specific) area that I can use for that. Once we jump to startup_32 these tables are no longer in use. -boris
Re: [PATCH v1 04/12] xen/hvmlite: Bootstrap HVMlite guest
On 01/22/2016 06:32 PM, Luis R. Rodriguez wrote: On Fri, Jan 22, 2016 at 04:35:50PM -0500, Boris Ostrovsky wrote: +/* + * This routine (and those that it might call) should not use + * anything that lives in .bss since that segment will be cleared later + */ +void __init xen_prepare_hvmlite(void) +{ + u32 eax, ecx, edx, msr; + u64 pfn; + + cpuid(xen_cpuid_base() + 2, , , , ); + pfn = __pa(hypercall_page); + wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32)); + + pv_info.name = "Xen HVMlite"; + xen_domain_type = XEN_HVM_DOMAIN; + xen_hvmlite = 1; + + x86_init.oem.arch_setup = xen_init_kernel; + x86_init.oem.banner = xen_banner; + + hvmlite_bootparams(); +} +#endif If the boot_params.hdr.hardware_subarch_data pointed to a custom struct then the first C entry point for Xen could shuffle this and set this too, by still using less asm entry helpers. We'd still need this run but with the linker table I think we could have a stub small stub for hvm run, it would not be a call from asm. Perhaps, but someone would still have to set hardware_subarch. And it's hvmlite_bootparams() that does it. And that's not sufficient, I think. There are still some things that trampoline code sets up (e.g. page tables for 64-bit). -boris
Re: [Xen-devel] [PATCH v1 02/12] xen/hvmlite: Factor out common kernel init code
On 01/25/2016 06:04 AM, David Vrabel wrote: On 22/01/16 21:35, Boris Ostrovsky wrote: HVMlite guests (to be introduced in subsequent patches) share most of the kernel initialization code with PV(H). Where possible, HVMlite should share initialization with bare metal/HVM and not PV(H). This is really platform initialization, for HVMlite it's invoked as x86_init.oem.arch_setup(). Things like xen_setup_features(), xen_init_apic() etc. I suppose I can move it to xen_hvm_guest_init() but then we will have some amount of code duplication. There is also a chunk of code there (in xen_init_kernel()) that will probably be used for HVMlite dom0. Sharing any code with the existing PVH code isn't useful, since PVH support will be removed. There is nothing PVH-specific, I said "(H)" mostly because that code is the same is PV. -boris
Re: [PATCH v1 04/12] xen/hvmlite: Bootstrap HVMlite guest
On 01/25/2016 04:21 PM, H. Peter Anvin wrote: On 01/25/16 13:12, Luis R. Rodriguez wrote: Perhaps, but someone would still have to set hardware_subarch. And it's hvmlite_bootparams() that does it. No, Xen would do it as well, essentially all of hvmlite_bootparams() could be done in Xen. Or a stub code. This patch in fact is the stub for Xen HVMlite guests, after we are done with it we jump to bare-metal startup code (i.e startup_32|64) -boris
Re: [Xen-devel] [PATCH v1 04/12] xen/hvmlite: Bootstrap HVMlite guest
On 01/25/2016 05:19 PM, Luis R. Rodriguez wrote: On Sat, Jan 23, 2016 at 02:49:36PM +, Andrew Cooper wrote: it causes inappropriate linkage between the toolstack and the version of Linux wishing to be booted. There are ways to do it in such a way that it does not create dependency issues on Linux specific code. 0) The Xen way and EFI way A generic data structure is passed to the entry point on the kernel to load the kernel. The kernel in turn must parse this generic specific struct and interprets it and translate it to the kernel specific values. 1) The qemu way: One way is to simply not refer to the boot_params data structures but IMHO that produces really ugly code. The qemu folks did it this way, so qemu does not use arch/x86/include/uapi/asm/bootparam.h in any way, instead it uses offsets from a char *header. Refer to load_linux() in hw/i386/pc.c, for instance: header[0x211] |= 0x80; /* CAN_USE_HEAP */ 2) The grub way: Another way, which grub uses, is to define their own data structures based on arch/x86/include/uapi/asm/bootparam.h, with their own data types, and refer to them in code, for instance refer to grub_cmd_linux() on the grub file grub-core/loader/i386/pc/linux.c and its copy definition of the header definition in include/grub/i386/linux.h. lh.loadflags |= GRUB_LINUX_FLAG_CAN_USE_HEAP; The way lguest does it in the lguest launcher is IMHO the cleanest of course, but it includes asm/bootparam.h and uses that in code: /* Boot protocol version: 2.07 supports the fields for lguest. */ boot->hdr.version = 0x207; But that does mean copying over or using the bootparam.h file and using linux data types. 3) Merge of xen way and using subarch_data Only set the subarch and subarch_data pointer, and have the kernel then read the generic data structure and parse it as it used to, the benefit is sharing a common entry point. No one uses this yet. But I think its a reasonable compromise. Perhaps there are other ways as well. If true, then by no means, no matter how hard we try, and no matter what we do on the Linux front to help clean things up will we be able to have a unified bare metal / Xen entry. I'm noting it could be possible though provided we do just set the zero page, the subarch to Xen and subarch_data to the Xen custom data structure. All you need is a very small stub which starts per the DMLite ABI, sets up an appropriate zero_page, and jumps to the native entrypoint. Right. I am trying to understand what your objection is to what is proposed in this patch. It is the size of the stub? If yes -- what would you like to leave out to be done later? -boris However, this stub belongs in Linux, not in the Xen toolstack. That way, when the Linux boot protocol is modified, both sides can be updated accordingly. Makes sense the different entry points just seems best avoided if possible. Luis
Re: [PATCH 3/3] tty: hvc_xen: hide xen_console_remove when unused
On 01/25/2016 04:54 PM, Arnd Bergmann wrote: xencons_disconnect_backend() is only called from xen_console_remove(), and also from xencons_probe()/xencons_resume(). But those two are also under the same ifdef. -boris which is conditionally compiled, so we get a harmless warning when CONFIG_HVC_XEN_FRONTEND is unset: hvc/hvc_xen.c:350:12: error: 'xen_console_remove' defined but not used [-Werror=unused-function] This moves the function down into the same #ifdef section to silence the warning. Signed-off-by: Arnd Bergmann--- drivers/tty/hvc/hvc_xen.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/tty/hvc/hvc_xen.c b/drivers/tty/hvc/hvc_xen.c index fa816b7193b6..11725422dacb 100644 --- a/drivers/tty/hvc/hvc_xen.c +++ b/drivers/tty/hvc/hvc_xen.c @@ -323,6 +323,7 @@ void xen_console_resume(void) } } +#ifdef CONFIG_HVC_XEN_FRONTEND static void xencons_disconnect_backend(struct xencons_info *info) { if (info->irq > 0) @@ -363,7 +364,6 @@ static int xen_console_remove(struct xencons_info *info) return 0; } -#ifdef CONFIG_HVC_XEN_FRONTEND static int xencons_remove(struct xenbus_device *dev) { return xen_console_remove(dev_get_drvdata(>dev));
Re: [PATCH v1 04/12] xen/hvmlite: Bootstrap HVMlite guest
On 01/26/2016 01:46 PM, Andy Lutomirski wrote: On Tue, Jan 26, 2016 at 10:34 AM, Luis R. Rodriguez <mcg...@suse.com> wrote: On Mon, Jan 25, 2016 at 05:28:08PM -0500, Boris Ostrovsky wrote: On 01/25/2016 04:21 PM, H. Peter Anvin wrote: On 01/25/16 13:12, Luis R. Rodriguez wrote: Perhaps, but someone would still have to set hardware_subarch. And it's hvmlite_bootparams() that does it. No, Xen would do it as well, essentially all of hvmlite_bootparams() could be done in Xen. Or a stub code. This patch in fact is the stub for Xen HVMlite guests, after we are done with it we jump to bare-metal startup code (i.e startup_32|64) Right the point is the stub need not be in Linux, I'll explain in the other thread where I provided more details on the different known approaches. ISTM if the Xen ABI-specified entry point has a different convention than the Linux native entry, then the stub should live in Linux. It would be just a couple if lines of code, right? It's not exactly a couple of lines but it's not large neither. It mainly sets up boot_params (similar to what make_boot_params() does for EFI). Plus, for 64-bit, it loads identity page tables and switches to long mode. And then jumps to bare-meta startup code. The issue that caused headaches in the past isn't that there's code that's executed only on native, it's that there are whole big functions that are executed only on native for no good reason and that aren't clearly marked. This won't happen with HVMlite. If we had native_start_kernel and xen_start_kernel, and they both called very quickly in to common_start_kernel, it would be very clear what's going on. What is now xen_start_kernel() is no longer the entry point for HVMlite. It is called as x86_init.oem.arch_setup() (or I may even move it to x86_hyper_xen.init_platform() or something like that). -boris
Re: [Xen-devel] [PATCH v1 04/12] xen/hvmlite: Bootstrap HVMlite guest
On 01/26/2016 03:30 PM, Luis R. Rodriguez wrote: What I'm proposing? 1) Lets see if we can put a proactive stop-gap to issues such as the cr4 shadow bug and Kasan bugs from creeping in. This should not only help PV but perhaps HVMLite. If you'd like to help with that refer to this thread: http://lkml.kernel.org/r/CAB=NE6VTCRCazcNpCdJ7pN1eD3=x_fcgodh37mzvpxkken5...@mail.gmail.com 2) asm code sharing prospects - rebranding of PVH to HVMlite It sounds like HVMlite is really just a clean PVH implementation so we'll be doing this it seems according to this approach: a) Since PVH brand is taken add new new Xen clean solution as "HVMLite implementation" b) Drop PVH c) Re-brand HVMLite as PVH Is there really no asm code to share between PV and HVMlite ? Not the early boot code. How about between PV and other Linux asm ? Specifically I'm talking about a huge new entry point called hvmlite_start_xen() of asm code. How much sharing or duplication is being avoided here? I don't see that we can share much here. Especially given that hvmlite_start() is a 32-bit entry point so we can't start a 64-bit PV guest. I suppose we can find some common (C) code that sets boot_params but on the first sight they are quite different too. (And whether it's huge is a matter of opinion ;-). It is more than a couple of instructions, I'll give you that) 3) C code sharing prospects: rebranding of PVH to HVMlite This code seems to share a lot from PV guest. Can we combine? HVMlite closely follows HVM (and baremetal) code path. There is some common setup code with PV (which is what the second patch in this series is trying to factor out) 4) hardware_subarch, hardware_subarch_data and future prospects Your patch relies on a *new* Linux entry point. Sure, we had one for EFI, and sure there is another for Xen PV, but since you're just rebranding PVH to HVMlite and given historic issues with any new Linux entry points I'd like for us to take a breather and evaluate the extent our lofty unification goals, and how the x86 boot protocol could help with this already. I am not sure I see how you can avoid having new entry point. For example, all HVMlite guests start in 32-bit mode. Who will switch to long mode? Now, perhaps today it may seem as difficult to unify asm entry points today all over, but if we can take baby steps towards that I think that should be seriously evaluated. For instance, we should consider on relying on hardware_subarch and hardware_subarch_data to fetch the hvmlite_start_info by taking advantage of the x86 boot protocol. The hvmlite_start_info is what Xen uses to send us info about the guest as your patch proposes (this matches the Xen PV style entry), we stash it into a standard Linux boot_params variable called xen_hvmlite_boot_params for the Xen guest in hvmlite_bootparam(). This data structure and code seems to match very much the PV guest structure, No, the two don't match at all. why not just use a union and differentiate on PV subtype ? If you want to avoid a lot of PV calls for HVMlite it seems you could just take advantage of subarch Xen type, and differentiate on the subarch_data within Xen code to make that code just PV sepecific. I only see gains by using the Xen subarch, so would like to know why PC is being pushed. It's not that subarch 0 is being pushed here. It's that I don't see how it can be useful for this particular guest type. Maybe as we add new features (or discover something that we've missed up till now) we can switch to using it. If you think we should delay initializing boot_params until then --- I will disagree: boot_params are used before we look at subarch and I don't believe it makes sense to pick and choose what to initialize before and what can wait. (And I am not sure it can be useful on PV neither, at least the way it is used now. You will not reach the point in the (32-bit) code where it is tested. You will die way earlier (I think on startup_32()'s fourth instruction).) -boris
Re: [Xen-devel] [PATCH v1 04/12] xen/hvmlite: Bootstrap HVMlite guest
On 01/27/2016 09:50 AM, David Vrabel wrote: On 27/01/16 14:42, Konrad Rzeszutek Wilk wrote: On Tue, Jan 26, 2016 at 08:54:56PM -0800, Luis R. Rodriguez wrote: On Jan 26, 2016 6:16 PM, "Luis R. Rodriguez"wrote: On Tue, Jan 26, 2016 at 4:04 PM, Luis R. Rodriguez wrote: You go: hvmlite_start_xen() --> HVM stub startup_64() | (startup_32() Hrm, does HVMlite work well with load_ucode_bsp(), note the patches to rebrand pv_enabled() to pv_legacy() or whatever, this PV type will not be legacy or crap / old, so we'd need a way to catch it if we should not use that code for this PV type. This begs the question, are you also sure other callers in startup_32() or startup_64() might be OK as well where previously guarded with pv_enabled() ? Actually this call can't be used, and if early code used it prior to setup_arch() it'd be a bug as its only properly set until later. Vetting for correctness of all code call is still required though and perhaps we do need something to catch now this PV type on early code such as this one if we don't want it. From what I've gathered before on other bsp ucode we don't want ucode loaded for PV guest types through these mechanisms. It may help to not think of PVH/hvmlite as PV. It really is HVM with a lot of emulated devices removed. How does early microcode work on EFI? Does the EFI stub code have an early microcode loading code ? Surely the interesting comparison here is how is (early) microcode loading disabled in KVM guests? We should use the same mechanism for HVMlite guests. Why would we ever want to have a guest load microcode during boot? I can see how a (privileged) guest may want to load microcode from a shell (via microcode driver). -boris
Re: [Xen-devel] [PATCH v1 04/12] xen/hvmlite: Bootstrap HVMlite guest
On 01/27/2016 10:09 AM, David Vrabel wrote: On 27/01/16 15:06, Boris Ostrovsky wrote: On 01/27/2016 09:50 AM, David Vrabel wrote: On 27/01/16 14:42, Konrad Rzeszutek Wilk wrote: On Tue, Jan 26, 2016 at 08:54:56PM -0800, Luis R. Rodriguez wrote: On Jan 26, 2016 6:16 PM, "Luis R. Rodriguez" <mcg...@suse.com> wrote: On Tue, Jan 26, 2016 at 4:04 PM, Luis R. Rodriguez <mcg...@suse.com> wrote: You go: hvmlite_start_xen() --> HVM stub startup_64() | (startup_32() Hrm, does HVMlite work well with load_ucode_bsp(), note the patches to rebrand pv_enabled() to pv_legacy() or whatever, this PV type will not be legacy or crap / old, so we'd need a way to catch it if we should not use that code for this PV type. This begs the question, are you also sure other callers in startup_32() or startup_64() might be OK as well where previously guarded with pv_enabled() ? Actually this call can't be used, and if early code used it prior to setup_arch() it'd be a bug as its only properly set until later. Vetting for correctness of all code call is still required though and perhaps we do need something to catch now this PV type on early code such as this one if we don't want it. From what I've gathered before on other bsp ucode we don't want ucode loaded for PV guest types through these mechanisms. It may help to not think of PVH/hvmlite as PV. It really is HVM with a lot of emulated devices removed. How does early microcode work on EFI? Does the EFI stub code have an early microcode loading code ? Surely the interesting comparison here is how is (early) microcode loading disabled in KVM guests? We should use the same mechanism for HVMlite guests. Why would we ever want to have a guest load microcode during boot? I can see how a (privileged) guest may want to load microcode from a shell (via microcode driver). I think you missed a word when you read my reply. Yes, I missed it ;-) Why not continue relying on paravirt_enabled()? We are going to keep this in some form for HVMlite. -boris
Re: [PATCH 4/4] xen/pcifront: Fix mysterious crashes when NUMA locality information was extracted.
On 02/11/2016 04:10 PM, Konrad Rzeszutek Wilk wrote: This patch fixes the issue by: 1) Use kzalloc to initialize to a well known state. 2) Put 'struct pci_sysdata' at the start of 'pcifront_sd'. That way access to the 'node' will access the right offset. CC: sta...@vger.kernel.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.w...@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com> (This, btw, is the second time we got bitten by pcifront_sd bit not being pci_sysdata. dc4fdaf0e48 was a workaround for a similar problem and we should have fixed it then). -boris
Re: [PATCH v2 3/3] paravirt: rename paravirt_enabled to paravirt_legacy
On 02/17/2016 03:49 PM, Borislav Petkov wrote: On Wed, Feb 17, 2016 at 12:07:13PM -0800, Luis R. Rodriguez wrote: OK so here's a wiki to keep track of progress of the difference uses: http://kernelnewbies.org/KernelProjects/remove-paravirt-enabled It seems we have a resolution one way or another for all except for the use on arch/x86/mm/dump_pagetables.c, is that right? Why not? I think we should simply check the range as 8000 - 87ff is practically an ABI and nothing should be mapped ^ That's exactly the point: if something is mapped it's an error for a non-PV kernel. By removing paravirt_enabled() we may miss those errors. Worse, I think we may even crash while doing pagetable walk (although it's probably better to crash here than to use an unexpected translation in real code somewhere) -boris there anyway. No need for paravirt_enabled() there either.
Re: [PATCH v2 3/3] paravirt: rename paravirt_enabled to paravirt_legacy
On 02/17/2016 05:03 PM, Borislav Petkov wrote: On Wed, Feb 17, 2016 at 04:21:56PM -0500, Boris Ostrovsky wrote: That's exactly the point: if something is mapped it's an error for a non-PV kernel. How would something be mapped there? __PAGE_OFFSET is 0x8800. Or are you thinking about some insanely b0rked kernel code mapping stuff in there? The latter. This is to detect things that clearly shouldn't be happening. By removing paravirt_enabled() we may miss those errors. Worse, I think we may even crash while doing pagetable walk (although it's probably better to crash here than to use an unexpected translation in real code somewhere) Well, if this is the only site which keeps paravirt_enabled() from being buried, we need to think about a better way to detect a hypervisor. Maybe we should look at x86_hyper, use CPUID(0x4...) Can't use CPUID 0x4000 because it will return hypervisor (Xen or otherwise) for non-PV guests as well. In Xen's case, you can't determine guest type from hypervisor leaves. or something else. We could say xen_pv_domain(). But this means using Xen-specific code in x86-generic file to detect things specified by ABI. I don't know if I'd like that. -boris What's your preference?
Re: [PATCH v2 3/3] paravirt: rename paravirt_enabled to paravirt_legacy
On 02/17/2016 05:18 PM, Andy Lutomirski wrote: On Wed, Feb 17, 2016 at 2:03 PM, Borislav Petkov <b...@alien8.de> wrote: On Wed, Feb 17, 2016 at 04:21:56PM -0500, Boris Ostrovsky wrote: That's exactly the point: if something is mapped it's an error for a non-PV kernel. How would something be mapped there? __PAGE_OFFSET is 0x8800. Or are you thinking about some insanely b0rked kernel code mapping stuff in there? By removing paravirt_enabled() we may miss those errors. Worse, I think we may even crash while doing pagetable walk (although it's probably better to crash here than to use an unexpected translation in real code somewhere) Well, if this is the only site which keeps paravirt_enabled() from being buried, we need to think about a better way to detect a hypervisor. Maybe we should look at x86_hyper, use CPUID(0x4...) or something else. What's your preference? I'm confused. Isn't it the other way around? That is, checking for the hypervisor range on all systems should be safer, right? Or am I missing something? Hmm. I think you are right --- I was following wrong branch of the 'if' statement. We are always going straight to note_page(). Then yes, we should be able to remove paravirt_enabled(). Sorry for the noise. -boris
Re: [PATCH] x86/ptdump: Remove paravirt_enabled()
On 02/18/2016 03:00 PM, Borislav Petkov wrote: From: Borislav Petkov <b...@suse.de> is_hypervisor_range() can simply check if the PGD index is within 8000 - 87ff which is the range reserved for a hypervisor. That range is practically an ABI, see Documentation/x86/x86_64/mm.txt. Signed-off-by: Borislav Petkov <b...@suse.de> Cc: Andy Lutomirski <l...@amacapital.net> Cc: Boris Ostrovsky <boris.ostrov...@oracle.com> Cc: "Luis R. Rodriguez" <mcg...@kernel.org> Under Xen, as PV guest: Tested-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
Re: [PATCH 5/9] apm32: remove paravirt_enabled() use
On 02/19/2016 08:08 AM, Luis R. Rodriguez wrote: There is already a check for apm_info.bios == 0, the apm_info.bios is set from the boot_params.apm_bios_info. Both Xen and lguest, which are also the only ones that set paravirt_enabled to true) do never set the apm_bios_info, the paravirt_enabled() check is simply not needed. We need to guarantee that boot_params is filled with zeroes. On baremetal path we clear .bss (which is where boot_params live) before copying data from zero page. So we need to at least memset(_params, 0, sz) in xen_start_kernel(). Better yet, clear whole .bss. (This applies to the next patch as well). -boris
Re: [RFC v2 1/6] x86/boot: add BIT() to boot/bitops.h
On 02/19/2016 09:15 AM, Luis R. Rodriguez wrote: diff --git a/arch/x86/boot/bitops.h b/arch/x86/boot/bitops.h index 878e4b9940d9..232cff0ff4e3 100644 --- a/arch/x86/boot/bitops.h +++ b/arch/x86/boot/bitops.h @@ -40,4 +40,6 @@ static inline void set_bit(int nr, void *addr) asm("btsl %1,%0" : "+m" (*(u32 *)addr) : "Ir" (nr)); } +#define BIT(x) (1 << x) Parenthesize x.
Re: [PATCH 5/9] apm32: remove paravirt_enabled() use
On 02/19/2016 03:58 PM, Luis R. Rodriguez wrote: On Fri, Feb 19, 2016 at 10:08:43AM -0500, Boris Ostrovsky wrote: in xen_start_kernel(). Better yet, clear whole .bss. (This applies to the next patch as well). So clear_bss() -- oh look, another call that xen_start_kernel() could have made good use of. :) Can you send a respective patch I can old into this series? I'm afraid it is by no means obvious to me where it would be safe to do this on xen_start_kernel(). OK, I'll test it next week, it really should be done irrespective of what you are doing here. -boris
Re: [PATCH 4/4] xen/pcifront: Fix mysterious crashes when NUMA locality information was extracted.
On 02/15/2016 09:05 AM, Konrad Rzeszutek Wilk wrote: On Sat, Feb 13, 2016 at 08:23:14PM -0500, Boris Ostrovsky wrote: On 02/11/2016 04:10 PM, Konrad Rzeszutek Wilk wrote: This patch fixes the issue by: 1) Use kzalloc to initialize to a well known state. 2) Put 'struct pci_sysdata' at the start of 'pcifront_sd'. That way access to the 'node' will access the right offset. CC: sta...@vger.kernel.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.w...@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com> (This, btw, is the second time we got bitten by pcifront_sd bit not being pci_sysdata. dc4fdaf0e48 was a workaround for a similar problem and we should have fixed it then). I think that the dc4fdaf0e4839109169d8261814813816951c75f commit can be reverted then? Ah no, b/c: "Fixes: 97badf873ab6 (device property: Make it possible to use secondary firmware nodes)" Right --- the problem which that commit fixed was not specific to Xen but could be observed on ia64 as well. -boris
Re: [Xen-devel] [patch 1/4] hotplug: Prevent alloc/free of irq descriptors during cpu up/down
On 03/12/2016 04:19 AM, Thomas Gleixner wrote: Boris, On Tue, 14 Jul 2015, Boris Ostrovsky wrote: On 07/14/2015 04:15 PM, Thomas Gleixner wrote: The issue here is that all architectures need that protection and just Xen does irq allocations in cpu_up. So moving that protection into architecture code is not really an option. Otherwise we will need to have something like arch_post_cpu_up() after the lock is released. I'm not sure, that this will work. You probably want to do this in the cpu prepare stage, i.e. before calling __cpu_up(). For PV guests (the ones that use xen_cpu_up()) it will work either before or after __cpu_up(). At least my (somewhat limited) testing didn't show any problems so far. However, HVM CPUs use xen_hvm_cpu_up() and if you read comments there you will see that xen_smp_intr_init() needs to be called before native_cpu_up() but xen_init_lock_cpu() (which eventually calls irq_alloc_descs()) needs to be called after. I think I can split xen_init_lock_cpu() so that the part that needs to be called after will avoid going into irq core code. And then the rest will go into arch_cpu_prepare(). I think we should revisit this for 4.3. For 4.2 we can do the trivial variant and move the locking in native_cpu_up() and x86 only. x86 was the only arch on which such wreckage has been seen in the wild, but we should have that protection for all archs in the long run. Patch below should fix the issue. Thanks! Most of my tests passed, I had a couple of failures but I will need to see whether they are related to this patch. Did you ever come around to address that irq allocation from within cpu_up()? I really want to generalize the protection instead of carrying that x86 only hack forever. Sorry, I completely forgot about this. Let me see how I can take allocations from under the lock. I might just be able to put them in CPU notifiers --- most into CPU_UP_PREPARE but spinlock interrupt may need to go into CPU_ONLINE. -boris
[PATCH 1/2] xen/x86: Move irq allocation from Xen smp_op.cpu_up()
Commit ce0d3c0a6fb1 ("genirq: Revert sparse irq locking around __cpu_up() and move it to x86 for now") reverted irq locking introduced by commit a89941816726 ("hotplug: Prevent alloc/free of irq descriptors during cpu up/down") because of Xen allocating irqs in both of its cpu_up ops. We can move those allocations into CPU notifiers so that original patch can be reinstated. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/enlighten.c | 53 ++--- arch/x86/xen/smp.c | 45 +- arch/x86/xen/smp.h |3 ++ 3 files changed, 49 insertions(+), 52 deletions(-) diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index 2c26108..d1a86db 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -137,6 +137,8 @@ RESERVE_BRK(shared_info_page_brk, PAGE_SIZE); __read_mostly int xen_have_vector_callback; EXPORT_SYMBOL_GPL(xen_have_vector_callback); +static struct notifier_block xen_cpu_notifier; + /* * Point at some empty memory to start with. We map the real shared_info * page as soon as fixmap is up and running. @@ -1596,6 +1598,7 @@ asmlinkage __visible void __init xen_start_kernel(void) xen_initial_gdt = _cpu(gdt_page, 0); xen_smp_init(); + register_cpu_notifier(_cpu_notifier); #ifdef CONFIG_ACPI_NUMA /* @@ -1783,17 +1786,49 @@ static void __init init_hvm_pv_info(void) xen_domain_type = XEN_HVM_DOMAIN; } -static int xen_hvm_cpu_notify(struct notifier_block *self, unsigned long action, - void *hcpu) +static int xen_cpu_notify(struct notifier_block *self, unsigned long action, + void *hcpu) { int cpu = (long)hcpu; + int rc; + switch (action) { case CPU_UP_PREPARE: - xen_vcpu_setup(cpu); - if (xen_have_vector_callback) { - if (xen_feature(XENFEAT_hvm_safe_pvclock)) - xen_setup_timer(cpu); + if (xen_hvm_domain()) { + /* +* This can happen if CPU was offlined earlier and +* offlining timed out in common_cpu_die(). +*/ + if (cpu_report_state(cpu) == CPU_DEAD_FROZEN) { + xen_smp_intr_free(cpu); + xen_uninit_lock_cpu(cpu); + } + + xen_vcpu_setup(cpu); } + + if (xen_pv_domain() || + (xen_have_vector_callback && +xen_feature(XENFEAT_hvm_safe_pvclock))) + xen_setup_timer(cpu); + + rc = xen_smp_intr_init(cpu); + if (rc) { + WARN(1, "xen_smp_intr_init() for CPU %d failed: %d\n", +cpu, rc); + return NOTIFY_BAD; + } + + break; + case CPU_ONLINE: + xen_init_lock_cpu(cpu); + break; + case CPU_UP_CANCELED: + xen_smp_intr_free(cpu); + if (xen_pv_domain() || + (xen_have_vector_callback && +xen_feature(XENFEAT_hvm_safe_pvclock))) + xen_teardown_timer(cpu); break; default: break; @@ -1801,8 +1836,8 @@ static int xen_hvm_cpu_notify(struct notifier_block *self, unsigned long action, return NOTIFY_OK; } -static struct notifier_block xen_hvm_cpu_notifier = { - .notifier_call = xen_hvm_cpu_notify, +static struct notifier_block xen_cpu_notifier = { + .notifier_call = xen_cpu_notify, }; #ifdef CONFIG_KEXEC_CORE @@ -1834,7 +1869,7 @@ static void __init xen_hvm_guest_init(void) if (xen_feature(XENFEAT_hvm_callback_vector)) xen_have_vector_callback = 1; xen_hvm_smp_init(); - register_cpu_notifier(_hvm_cpu_notifier); + register_cpu_notifier(_cpu_notifier); xen_unplug_emulated_devices(); x86_init.irqs.intr_init = xen_init_IRQ; xen_hvm_init_time_ops(); diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index 719cf29..09d5cc0 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -115,7 +115,7 @@ asmlinkage __visible void cpu_bringup_and_idle(int cpu) cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); } -static void xen_smp_intr_free(unsigned int cpu) +void xen_smp_intr_free(unsigned int cpu) { if (per_cpu(xen_resched_irq, cpu).irq >= 0) { unbind_from_irqhandler(per_cpu(xen_resched_irq, cpu).irq, NULL); @@ -159,7 +159,7 @@ static void xen_smp_intr_free(unsigned int cpu) per_cpu(xen_pmu_irq, cpu).name = NULL; } }; -static int xen_smp_intr_init(unsigned int cpu) +in
[PATCH 1/2] xen/apic: Provide Xen-specific version of cpu_present_to_apicid APIC op
Currently Xen uses default_cpu_present_to_apicid() which will always report BAD_APICID for PV guests since x86_bios_cpu_apic_id is initialised to that value and is never updated. With commit 1f12e32f4cd5 ("x86/topology: Create logical package id"), this op is now called by smp_init_package_map() when deciding whether to call topology_update_package_map() which sets cpu_data(cpu).logical_proc_id. The latter (as topology_logical_package_id(cpu)) may be used, for example, by cpu_to_rapl_pmu() as an array index. Since uninitialized logical_package_id is set to -1, the index will become 64K which is obviously problematic. While RAPL code (and any other users of logical_package_id) should be careful in their assumptions about id's validity, Xen's cpu_present_to_apicid op should still provide value consistent with its own xen_apic_read(APIC_ID). Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/apic.c | 12 ++-- 1 files changed, 10 insertions(+), 2 deletions(-) diff --git a/arch/x86/xen/apic.c b/arch/x86/xen/apic.c index abf4901..db52a7f 100644 --- a/arch/x86/xen/apic.c +++ b/arch/x86/xen/apic.c @@ -66,7 +66,7 @@ static u32 xen_apic_read(u32 reg) ret = HYPERVISOR_platform_op(); if (ret) - return 0; + op.u.pcpu_info.apic_id = BAD_APICID; return op.u.pcpu_info.apic_id << 24; } @@ -142,6 +142,14 @@ static void xen_silent_inquire(int apicid) { } +static int xen_cpu_present_to_apicid(int cpu) +{ + if (cpu_present(cpu)) + return xen_get_apic_id(xen_apic_read(APIC_ID)); + else + return BAD_APICID; +} + static struct apic xen_pv_apic = { .name = "Xen PV", .probe = xen_apic_probe_pv, @@ -162,7 +170,7 @@ static struct apic xen_pv_apic = { .ioapic_phys_id_map = default_ioapic_phys_id_map, /* Used on 32-bit */ .setup_apic_routing = NULL, - .cpu_present_to_apicid = default_cpu_present_to_apicid, + .cpu_present_to_apicid = xen_cpu_present_to_apicid, .apicid_to_cpu_present = physid_set_mask_of_physid, /* Used on 32-bit */ .check_phys_apicid_present = default_check_phys_apicid_present, /* smp_sanity_check needs it */ .phys_pkg_id= xen_phys_pkg_id, /* detect_ht */ -- 1.7.1
[PATCH v2] xen/events: Mask a moving irq
Moving an unmasked irq may result in irq handler being invoked on both source and target CPUs. With 2-level this can happen as follows: On source CPU: evtchn_2l_handle_events() -> generic_handle_irq() -> handle_edge_irq() -> eoi_pirq(): irq_move_irq(data); /* WE ARE HERE */ if (VALID_EVTCHN(evtchn)) clear_evtchn(evtchn); If at this moment target processor is handling an unrelated event in evtchn_2l_handle_events()'s loop it may pick up our event since target's cpu_evtchn_mask claims that this event belongs to it *and* the event is unmasked and still pending. At the same time, source CPU will continue executing its own handle_edge_irq(). With FIFO interrupt the scenario is similar: irq_move_irq() may result in a EVTCHNOP_unmask hypercall which, in turn, may make the event pending on the target CPU. We can avoid this situation by moving and clearing the event while keeping event masked. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> Cc: sta...@vger.kernel.org --- v2: * Use irq_move_masked_irq() instead of irq_move_irq() * Reorganize code flow drivers/xen/events/events_base.c | 28 1 files changed, 24 insertions(+), 4 deletions(-) diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index 524c221..4436778 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -484,9 +484,19 @@ static void eoi_pirq(struct irq_data *data) struct physdev_eoi eoi = { .irq = pirq_from_irq(data->irq) }; int rc = 0; - irq_move_irq(data); + if (!VALID_EVTCHN(evtchn)) + return; - if (VALID_EVTCHN(evtchn)) + if (unlikely(irqd_is_setaffinity_pending(data))) { + int masked = test_and_set_mask(evtchn); + + clear_evtchn(evtchn); + + irq_move_masked_irq(data); + + if (!masked) + unmask_evtchn(evtchn); + } else clear_evtchn(evtchn); if (pirq_needs_eoi(data->irq)) { @@ -1357,9 +1367,19 @@ static void ack_dynirq(struct irq_data *data) { int evtchn = evtchn_from_irq(data->irq); - irq_move_irq(data); + if (!VALID_EVTCHN(evtchn)) + return; - if (VALID_EVTCHN(evtchn)) + if (unlikely(irqd_is_setaffinity_pending(data))) { + int masked = test_and_set_mask(evtchn); + + clear_evtchn(evtchn); + + irq_move_masked_irq(data); + + if (!masked) + unmask_evtchn(evtchn); + } else clear_evtchn(evtchn); } -- 1.7.1
[PATCH 0/2] Reinstate irq alloc/dealloc locking patch
Original version of that patch (commit a89941816726) had to be reverted due to Xen allocating irqs in its cpu_up ops. The first patch moves allocations into hotplug notifiers and the second one restores the original patch (with minor adjustments to new hotplug framework) Boris Ostrovsky (2): xen/x86: Move irq allocation from Xen smp_op.cpu_up() hotplug: Prevent alloc/free of irq descriptors during cpu up/down (again) arch/x86/kernel/smpboot.c | 11 - arch/x86/xen/enlighten.c | 53 +--- arch/x86/xen/smp.c| 45 + arch/x86/xen/smp.h|3 ++ kernel/cpu.c |8 ++ 5 files changed, 57 insertions(+), 63 deletions(-)
Re: [Xen-devel] [PATCH v4 0/5] [PATCH v3 0/5] Improve non-"safe" MSR access failure handling
On 03/12/2016 01:08 PM, Andy Lutomirski wrote: Setting CONFIG_PARAVIRT=y has an unintended side effect: it silently turns all rdmsr and wrmsr operations into the safe variants without any checks that the operations actually succeed. With CONFIG_PARAVIRT=n, unchecked MSR failures OOPS and probably cause boot to fail if they happen before init starts. Neither behavior is very good, and it's particularly unfortunate that the behavior changes depending on CONFIG_PARAVIRT. In particular, KVM guests might be unwittingly depending on the PARAVIRT=y behavior because CONFIG_KVM_GUEST currently depends on CONFIG_PARAVIRT, and, because accesses in that case are completely unchecked, we wouldn't even see a warning. This series changes the native behavior, regardless of CONFIG_PARAVIRT. A non-"safe" MSR failure will give an informative warning once and will be fixed up -- native_read_msr will return zero, and both reads and writes will continue where they left off. If panic_on_oops is set, they will still OOPS and panic. By using the shiny new custom exception handler infrastructure, there should be no overhead on the success paths. I didn't change the behavior on Xen, but, with this series applied, it would be straightforward for the Xen maintainers to make the corresponding change -- knowledge of whether the access is "safe" is now propagated into the pvops. Doing this is probably a prerequisite to sanely decoupling CONFIG_KVM_GUEST and CONFIG_PARAVIRT, which would probably make Arjan and the rest of the Clear Containers people happy :) There's also room to reduce the code size of the "safe" variants using custom exception handlers in the future. Changes from v3: - WARN_ONCE instead of WARN (Ingo) - In the warning text, s/unsafe/unchecked/ (Ingo, sort of) Changes from earlier versions: lots of changes! Andy Lutomirski (5): x86/paravirt: Add _safe to the read_msr and write_msr PV hooks x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops x86/paravirt: Add paravirt_{read,write}_msr x86/paravirt: Make "unsafe" MSR accesses unsafe even if PARAVIRT=y x86/msr: Set the return value to zero when native_rdmsr_safe fails arch/x86/include/asm/msr.h| 20 arch/x86/include/asm/paravirt.h | 45 +-- arch/x86/include/asm/paravirt_types.h | 14 +++ arch/x86/kernel/paravirt.c| 6 +++-- arch/x86/mm/extable.c | 33 + arch/x86/xen/enlighten.c | 27 +++-- 6 files changed, 114 insertions(+), 31 deletions(-) I don't see any issues as far as Xen is concerned but let me run this through our nightly. I'll wait for the next version (which I think you might have based on the comments). Please copy me. -boris
Re: [PATCH v4 04/14] x86/rtc: replace paravirt rtc check with platform legacy quirk
On 04/06/2016 08:06 PM, Luis R. Rodriguez wrote: We have 4 types of x86 platforms that disable RTC: * Intel MID * Lguest - uses paravirt * Xen dom-U - uses paravirt * x86 on legacy systems annotated with an ACPI legacy flag We can consolidate all of these into a platform specific legacy quirk set early in boot through i386_start_kernel() and through x86_64_start_reservations(). This deals with the RTC quirks which we can rely on through the hardware subarch, the ACPI check can be dealt with separately. v2: split the subarch check from the ACPI check, clarify on the ACPI change commit log why ordering works Suggested-by: Ingo MolnarSigned-off-by: Luis R. Rodriguez --- arch/x86/Makefile | 1 + arch/x86/include/asm/paravirt.h | 6 -- arch/x86/include/asm/paravirt_types.h | 5 - arch/x86/include/asm/processor.h | 1 - arch/x86/include/asm/x86_init.h | 13 + arch/x86/kernel/Makefile | 6 +- arch/x86/kernel/head32.c | 2 ++ arch/x86/kernel/head64.c | 1 + arch/x86/kernel/platform-quirks.c | 18 ++ arch/x86/kernel/rtc.c | 7 ++- arch/x86/lguest/boot.c| 1 - arch/x86/xen/enlighten.c | 3 --- 12 files changed, 42 insertions(+), 22 deletions(-) create mode 100644 arch/x86/kernel/platform-quirks.c diff --git a/arch/x86/Makefile b/arch/x86/Makefile index 4086abca0b32..f9ed8a7ce2b6 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -209,6 +209,7 @@ endif head-y := arch/x86/kernel/head_$(BITS).o head-y += arch/x86/kernel/head$(BITS).o head-y += arch/x86/kernel/head.o +head-y += arch/x86/kernel/platform-quirks.o libs-y += arch/x86/lib/ diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 601f1b8f9961..6c7a4a192032 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -20,12 +20,6 @@ static inline int paravirt_enabled(void) return pv_info.paravirt_enabled; } -static inline int paravirt_has_feature(unsigned int feature) -{ - WARN_ON_ONCE(!pv_info.paravirt_enabled); - return (pv_info.features & feature); -} - static inline void load_sp0(struct tss_struct *tss, struct thread_struct *thread) { diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index e8c2326478c8..6acc1b26cf40 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -70,14 +70,9 @@ struct pv_info { #endif int paravirt_enabled; - unsigned int features;/* valid only if paravirt_enabled is set */ const char *name; }; -#define paravirt_has(x) paravirt_has_feature(PV_SUPPORTED_##x) -/* Supported features */ -#define PV_SUPPORTED_RTC(1<<0) - struct pv_init_ops { /* * Patch may replace one of the defined code sequences with diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 9264476f3d57..0c70c7daa6b8 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -474,7 +474,6 @@ static inline unsigned long current_top_of_stack(void) #else #define __cpuid native_cpuid #define paravirt_enabled()0 -#define paravirt_has(x)0 static inline void load_sp0(struct tss_struct *tss, struct thread_struct *thread) diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h index 1ae89a2721d6..27d5c3fe5198 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -142,6 +142,15 @@ struct x86_cpuinit_ops { struct timespec; /** + * struct x86_legacy_features - legacy x86 features + * + * @rtc: this device has a CMOS real-time clock present + */ +struct x86_legacy_features { + int rtc; +}; + +/** * struct x86_platform_ops - platform specific runtime functions * @calibrate_tsc:calibrate TSC * @get_wallclock:get time from HW clock like RTC etc. @@ -152,6 +161,7 @@ struct timespec; * @save_sched_clock_state: save state for sched_clock() on suspend * @restore_sched_clock_state:restore state for sched_clock() on resume * @apic_post_init: adjust apic if neeeded + * @legacy:legacy features */ struct x86_platform_ops { unsigned long (*calibrate_tsc)(void); @@ -165,6 +175,7 @@ struct x86_platform_ops { void (*save_sched_clock_state)(void); void (*restore_sched_clock_state)(void); void (*apic_post_init)(void); + struct x86_legacy_features legacy; }; struct pci_dev; @@ -186,6 +197,8 @@ extern struct x86_cpuinit_ops x86_cpuinit; extern struct x86_platform_ops x86_platform; extern struct x86_msi_ops x86_msi; extern struct
Re: [PATCH v4 08/14] apm32: remove paravirt_enabled() use
On 04/06/2016 08:06 PM, Luis R. Rodriguez wrote: There is already a check for apm_info.bios == 0, the apm_info.bios is set from the boot_params.apm_bios_info. Both Xen and lguest, which are also the only ones that set paravirt_enabled to true, never set the apm_bios.info. The Xen folks are sure force disable to 0 is not needed, Because apm_info lives in .bss (which we recently made sure is cleared on Xen PV). May be worth mentioning in the commit message so that we don't forget why this is not needed. I think you also have this statement in other patches. -boris we recently forced disabled this on lguest. With this in place the paravirt_enabled() check is simply not needed anymore. Signed-off-by: Luis R. Rodriguez--- arch/x86/kernel/apm_32.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/apm_32.c b/arch/x86/kernel/apm_32.c index 9307f182fe30..c7364bd633e1 100644 --- a/arch/x86/kernel/apm_32.c +++ b/arch/x86/kernel/apm_32.c @@ -2267,7 +2267,7 @@ static int __init apm_init(void) dmi_check_system(apm_dmi_table); - if (apm_info.bios.version == 0 || paravirt_enabled() || machine_is_olpc()) { + if (apm_info.bios.version == 0 || machine_is_olpc()) { printk(KERN_INFO "apm: BIOS not found.\n"); return -ENODEV; }
Re: [Xen-devel] [PATCH v5 0/9] Improve non-"safe" MSR access failure handling
On 04/02/2016 10:01 AM, Andy Lutomirski wrote: Andy Lutomirski (9): x86/head: Pass a real pt_regs and trapnr to early_fixup_exception x86/head: Move the early NMI fixup into C x86/head: Move early exception panic code into early_fixup_exception x86/traps: Enable all exception handler callbacks early x86/paravirt: Add _safe to the read_msr and write_msr PV hooks x86/msr: Carry on after a non-"safe" MSR access fails x86/paravirt: Add paravirt_{read,write}_msr x86/paravirt: Make "unsafe" MSR accesses unsafe even if PARAVIRT=y x86/msr: Set the return value to zero when native_rdmsr_safe fails With Xen (on top of eb1af3b): Tested-by: Boris Ostrovsky <boris.ostrov...@oracle.com> -boris
Re: Xen regression, Was: [PATCH] x86/irq: Probe for PIC presence before allocating descs for legacy IRQs
On 04/12/2016 02:06 PM, Stefano Stabellini wrote: On Tue, 12 Apr 2016, Boris Ostrovsky wrote: On 04/11/2016 10:08 PM, Stefano Stabellini wrote: Hi all, Unfortunately this patch (now commit 8c058b0b9c34d8c8d7912880956543769323e2d8) causes a regression on Xen when running on top of QEMU: the number of PIT irqs get set to 0 by probe_8259A but actually there are 16. Any suggestions on how to fix this? 1) we could revert 8c058b0b9c34d8c8d7912880956543769323e2d8 2) we could introduce an 'if (!xen_domain())' in probe_8259A 3) suggestions welcome Stefano, do you have b4ff8389ed14b849354b59ce9b360bdefcdbf99c ? It was supposed to fix this problem for Xen. However, I just noticed that arch/arm64/include/asm/irq.h makes nr_legacy_irqs() return 0 (unlike arch/arm/include/asm/irq.h). Could that be the problem? I have b4ff8389ed14b849354b59ce9b360bdefcdbf99c but it doesn't fix the issue for me. Is the idea of your patch that xen_allocate_irq_gsi will allocate the descriptor dynamically instead? Right. If so, it doesn't work because it doesn't get called for irq 14: So how has it worked until now then? piix_init_one -> ata_pci_sff_activate_host -> devm_request_irq -> devm_request_threaded_irq-> request_threaded_irq -> irq_to_desc(14) -> -EVAIL If you look at pci_xen_initial_domain, the loop: for (irq = 0; irq < nr_legacy_irqs(); irq++) { won't work anymore because by the time is called, nr_legacy_irqs() already returns 0, because it has been changed by probe_8259A(). We also need the following patch: --- xen/x86: actually allocate legacy interrupts on PV guests b4ff8389ed14 is incomplete: relies on nr_legacy_irqs() to get the number of legacy interrupts when actually nr_legacy_irqs() returns 0 after probe_8259A(). Use NR_IRQS_LEGACY instead. Signed-off-by: Stefano Stabellini <sstabell...@kernel.org> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c index beac4df..6db0060 100644 --- a/arch/x86/pci/xen.c +++ b/arch/x86/pci/xen.c @@ -492,7 +492,7 @@ int __init pci_xen_initial_domain(void) __acpi_register_gsi = acpi_register_gsi_xen; __acpi_unregister_gsi = NULL; /* Pre-allocate legacy irqs */ - for (irq = 0; irq < nr_legacy_irqs(); irq++) { + for (irq = 0; irq < NR_IRQS_LEGACY; irq++) { int trigger, polarity; if (acpi_get_override_irq(irq, , ) == -1) Won't we need the same change in the 'if (0 == nr_ioapics)' clause? -boris
Re: Xen regression, Was: [PATCH] x86/irq: Probe for PIC presence before allocating descs for legacy IRQs
On 04/12/2016 05:56 PM, Stefano Stabellini wrote: I am not sure, maybe you didn't have CONFIG_SPARSE_IRQ? But I am certain that 4.6-rc2, with the attached config, fails as Dom0 on QEMU with the following sequence of calls: I did have CONFIG_SPARSE_IRQ and I just rebuilt 4.5.0 with your config (4.6-rc3 doesn't build for me for some reason) and that booted dom0 as well. BTW, what do you mean by "dom0 on QEMU"? -boris
Re: Xen regression, Was: [PATCH] x86/irq: Probe for PIC presence before allocating descs for legacy IRQs
On 04/12/2016 05:14 PM, Stefano Stabellini wrote: On Tue, 12 Apr 2016, Boris Ostrovsky wrote: On 04/12/2016 02:06 PM, Stefano Stabellini wrote: On Tue, 12 Apr 2016, Boris Ostrovsky wrote: On 04/11/2016 10:08 PM, Stefano Stabellini wrote: Hi all, Unfortunately this patch (now commit 8c058b0b9c34d8c8d7912880956543769323e2d8) causes a regression on Xen when running on top of QEMU: the number of PIT irqs get set to 0 by probe_8259A but actually there are 16. Any suggestions on how to fix this? 1) we could revert 8c058b0b9c34d8c8d7912880956543769323e2d8 2) we could introduce an 'if (!xen_domain())' in probe_8259A 3) suggestions welcome Stefano, do you have b4ff8389ed14b849354b59ce9b360bdefcdbf99c ? It was supposed to fix this problem for Xen. However, I just noticed that arch/arm64/include/asm/irq.h makes nr_legacy_irqs() return 0 (unlike arch/arm/include/asm/irq.h). Could that be the problem? I have b4ff8389ed14b849354b59ce9b360bdefcdbf99c but it doesn't fix the issue for me. Is the idea of your patch that xen_allocate_irq_gsi will allocate the descriptor dynamically instead? Right. If so, it doesn't work because it doesn't get called for irq 14: So how has it worked until now then? It didn't for me :-) Maybe it was working for DomUs, but not for Dom0? I most certainly run this as both dom0 and domU on variety of machines. So perhaps there is some other problem? piix_init_one -> ata_pci_sff_activate_host -> devm_request_irq -> devm_request_threaded_irq-> request_threaded_irq -> irq_to_desc(14) -> -EVAIL If you look at pci_xen_initial_domain, the loop: for (irq = 0; irq < nr_legacy_irqs(); irq++) { won't work anymore because by the time is called, nr_legacy_irqs() already returns 0, because it has been changed by probe_8259A(). We also need the following patch: --- xen/x86: actually allocate legacy interrupts on PV guests b4ff8389ed14 is incomplete: relies on nr_legacy_irqs() to get the number of legacy interrupts when actually nr_legacy_irqs() returns 0 after probe_8259A(). Use NR_IRQS_LEGACY instead. Signed-off-by: Stefano Stabellini <sstabell...@kernel.org> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c index beac4df..6db0060 100644 --- a/arch/x86/pci/xen.c +++ b/arch/x86/pci/xen.c @@ -492,7 +492,7 @@ int __init pci_xen_initial_domain(void) __acpi_register_gsi = acpi_register_gsi_xen; __acpi_unregister_gsi = NULL; /* Pre-allocate legacy irqs */ - for (irq = 0; irq < nr_legacy_irqs(); irq++) { + for (irq = 0; irq < NR_IRQS_LEGACY; irq++) { int trigger, polarity; if (acpi_get_override_irq(irq, , ) == -1) Won't we need the same change in the 'if (0 == nr_ioapics)' clause? That's not a problem: there is 1 ioapic in my system and is detected correctly. True, but if a system has no ioapics then we will again fail to manage the pirq, no? -boris
Re: Xen regression, Was: [PATCH] x86/irq: Probe for PIC presence before allocating descs for legacy IRQs
On 04/12/2016 07:15 PM, Stefano Stabellini wrote: On Tue, 12 Apr 2016, Boris Ostrovsky wrote: On 04/12/2016 05:56 PM, Stefano Stabellini wrote: I am not sure, maybe you didn't have CONFIG_SPARSE_IRQ? But I am certain that 4.6-rc2, with the attached config, fails as Dom0 on QEMU with the following sequence of calls: I did have CONFIG_SPARSE_IRQ and I just rebuilt 4.5.0 with your config (4.6-rc3 doesn't build for me for some reason) and that booted dom0 as well. BTW, what do you mean by "dom0 on QEMU"? I am running Xen and Linux inside a QEMU x86_64 emulated machine (nested virt). This I, of course, never tried. But given that things work in a single-level virt, doesn't this imply that perhaps there is something in the emulation that's not quite right? -boris
Re: [PATCH v4 04/14] x86/rtc: replace paravirt rtc check with platform legacy quirk
On 04/08/2016 02:29 AM, Luis R. Rodriguez wrote: On Thu, Apr 7, 2016 at 10:18 PM, Juergen Gross <jgr...@suse.com> wrote: On 08/04/16 02:32, Luis R. Rodriguez wrote: On Thu, Apr 07, 2016 at 08:55:54AM -0400, Boris Ostrovsky wrote: On 04/06/2016 08:06 PM, Luis R. Rodriguez wrote: We have 4 types of x86 platforms that disable RTC: * Intel MID * Lguest - uses paravirt * Xen dom-U - uses paravirt * x86 on legacy systems annotated with an ACPI legacy flag We can consolidate all of these into a platform specific legacy quirk set early in boot through i386_start_kernel() and through x86_64_start_reservations(). This deals with the RTC quirks which we can rely on through the hardware subarch, the ACPI check can be dealt with separately. v2: split the subarch check from the ACPI check, clarify on the ACPI change commit log why ordering works Suggested-by: Ingo Molnar <mi...@kernel.org> Signed-off-by: Luis R. Rodriguez <mcg...@kernel.org> <-- snip --> diff --git a/arch/x86/kernel/platform-quirks.c b/arch/x86/kernel/platform-quirks.c new file mode 100644 index ..1b114ac5996f --- /dev/null +++ b/arch/x86/kernel/platform-quirks.c @@ -0,0 +1,18 @@ +#include +#include + +#include +#include + +void __init x86_early_init_platform_quirks(void) +{ + x86_platform.legacy.rtc = 1; + + switch (boot_params.hdr.hardware_subarch) { + case X86_SUBARCH_XEN: + case X86_SUBARCH_LGUEST: + case X86_SUBARCH_INTEL_MID: + x86_platform.legacy.rtc = 0; + break; + } +} What about Xen dom0 (aka initial domain)? Indeed, thanks for catching this, the hunk below removes the re-enablement of the the RTC for dom0: --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -1192,7 +1192,6 @@ static const struct pv_info xen_info __initconst = { #ifdef CONFIG_X86_64 .extra_user_64bit_cs = FLAT_USER_CS64, #endif - .features = 0, .name = "Xen", }; @@ -1525,8 +1524,6 @@ asmlinkage __visible void __init xen_start_kernel(void) /* Install Xen paravirt ops */ pv_info = xen_info; - if (xen_initial_domain()) - pv_info.features |= PV_SUPPORTED_RTC; pv_init_ops = xen_init_ops; if (!xen_pvh_domain()) { pv_cpu_ops = xen_cpu_ops; This should then break dom0 unless of course you have the respective next patch applied and that disabled the RTC due to an ACPI setting on your platform. Juergen, can you check to see if that was the case for your testing platform on dom0 ? Are you sure it would break? No, suspected that it should though. Wouldn't it just fall back to another clock source, e.g. hpet? I suppose so. I looked into my test system: seems as if add_rtc_cmos() is returning before the .legacy.rtc test. OK thanks... It works because the clock must have been discovered by ACPI prior to add_rtc_cmos() call. It's PNP0b00 object, I believe. The rest of the routine is to handle the case when RTC is not found in ACPI tables for whatever reasons (I think). That's why we added paravirt_has(RTC) --- dom0 should be able to handle such cases, just like bare metal. -boris
Re: [PATCH v4 04/14] x86/rtc: replace paravirt rtc check with platform legacy quirk
On 04/08/2016 03:59 AM, Juergen Gross wrote: On 08/04/16 09:36, Luis R. Rodriguez wrote: On Fri, Apr 8, 2016 at 12:13 AM, Juergen Grosswrote: On 08/04/16 08:56, Luis R. Rodriguez wrote: On Thu, Apr 7, 2016 at 11:38 PM, Juergen Gross wrote: Okay. Another idea (not sure whether this is really a good one): Add X86_SUBARCH_XEN_DOM0. As hardware_subarch is 32 bits wide I don't think the number of subarchs is a scarce resource. :-) This would mean bumping the x86 boot protocol, we shouldn't take that lightly, but given that in this case the new subarch would really only be set by the kernel (or future loaders for perhaps HVMLite) I'd think this is not such an intrusive alternative. I think adding an own subarch for dom0 isn't that bad. It really is different from domU as dom0 has per default access to the real hardware (or at least to most of it). Can we do this (overwrite quirks) in x86_init_ops.arch_setup? I'd really like to avoid adding a what essentially is a sub-subarch. -boris
Re: [PATCH v5 04/14] x86/rtc: replace paravirt rtc check with platform legacy quirk
On 04/08/2016 07:40 PM, Luis R. Rodriguez wrote: diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h index 1ae89a2721d6..8bb8c1a4615a 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -142,6 +142,15 @@ struct x86_cpuinit_ops { struct timespec; /** + * struct x86_legacy_features - legacy x86 features + * + * @rtc: this device has a CMOS real-time clock present + */ +struct x86_legacy_features { + int rtc; +}; + +/** * struct x86_platform_ops - platform specific runtime functions * @calibrate_tsc: calibrate TSC * @get_wallclock: get time from HW clock like RTC etc. @@ -152,6 +161,14 @@ struct timespec; * @save_sched_clock_state:save state for sched_clock() on suspend * @restore_sched_clock_state: restore state for sched_clock() on resume * @apic_post_init:adjust apic if neeeded + * @legacy:legacy features + * @set_legacy_features: override legacy features. Use of this callback + * is highly discouraged. You should only need + * this if your hardware platform requires further + * custom fine tuning far beyong what may be + * possible in x86_early_init_platform_quirks() by + * only using the current x86_hardware_subarch + * semantics. */ struct x86_platform_ops { unsigned long (*calibrate_tsc)(void); @@ -165,6 +182,8 @@ struct x86_platform_ops { void (*save_sched_clock_state)(void); void (*restore_sched_clock_state)(void); void (*apic_post_init)(void); + struct x86_legacy_features legacy; I don't think this belongs here --- we are in the ops structure. -boris + void (*set_legacy_features)(void); }; struct pci_dev; @@ -186,6 +205,8 @@ extern struct x86_cpuinit_ops x86_cpuinit; extern struct x86_platform_ops x86_platform; extern struct x86_msi_ops x86_msi; extern struct x86_io_apic_ops x86_io_apic_ops; + +extern void x86_early_init_platform_quirks(void); extern void x86_init_noop(void); extern void x86_init_uint_noop(unsigned int unused);
Re: Xen regression, Was: [PATCH] x86/irq: Probe for PIC presence before allocating descs for legacy IRQs
On 04/11/2016 10:08 PM, Stefano Stabellini wrote: Hi all, Unfortunately this patch (now commit 8c058b0b9c34d8c8d7912880956543769323e2d8) causes a regression on Xen when running on top of QEMU: the number of PIT irqs get set to 0 by probe_8259A but actually there are 16. Any suggestions on how to fix this? 1) we could revert 8c058b0b9c34d8c8d7912880956543769323e2d8 2) we could introduce an 'if (!xen_domain())' in probe_8259A 3) suggestions welcome Stefano, do you have b4ff8389ed14b849354b59ce9b360bdefcdbf99c ? It was supposed to fix this problem for Xen. However, I just noticed that arch/arm64/include/asm/irq.h makes nr_legacy_irqs() return 0 (unlike arch/arm/include/asm/irq.h). Could that be the problem? -boris
[PATCH 2/2] xen/x86: Call cpu_startup_entry(CPUHP_AP_ONLINE_IDLE) from xen_play_dead()
This call has always been missing from xen_play dead() but until recently this was rather benign. With new cpu hotplug framework however this call is required, otherwise a hot-plugged CPU will not be properly brough up (by never calling cpuhp_online_idle()) Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/smp.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index 3c6d17f..719cf29 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -545,6 +545,8 @@ static void xen_play_dead(void) /* used only with HOTPLUG_CPU */ * data back is to call: */ tick_nohz_idle_enter(); + + cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); } #else /* !CONFIG_HOTPLUG_CPU */ -- 1.7.1
[PATCH 0/2] Hotplug fixes
Two patches fixing regressions introduced by new CPU hotplug framework Boris Ostrovsky (2): xen/apic: Provide Xen-specific version of cpu_present_to_apicid APIC op xen/x86: Call cpu_startup_entry(CPUHP_AP_ONLINE_IDLE) from xen_play_dead() arch/x86/xen/apic.c | 12 ++-- arch/x86/xen/smp.c |2 ++ 2 files changed, 12 insertions(+), 2 deletions(-)
Re: [Xen-devel] [PATCH] xen/events: Mask a moving irq
On 03/17/2016 12:03 PM, David Vrabel wrote: On 17/03/16 12:45, Boris Ostrovsky wrote: Moving an unmasked irq may result in irq handler being invoked on both source and target CPUs. With 2-level this can happen as follows: On source CPU: evtchn_2l_handle_events() -> generic_handle_irq() -> handle_edge_irq() -> eoi_pirq(): irq_move_irq(data); /* WE ARE HERE */ if (VALID_EVTCHN(evtchn)) clear_evtchn(evtchn); If at this moment target processor is handling an unrelated event in evtchn_2l_handle_events()'s loop it may pick up our event since target's cpu_evtchn_mask claims that this event belongs to it *and* the event is unmasked and still pending. At the same time, source CPU will continue executing its own handle_edge_irq(). With FIFO interrupt the scenario is similar: irq_move_irq() may result in a EVTCHNOP_unmask hypercall which, in turn, may make the event pending on the target CPU. We can avoid this situation by moving and clearing the event while keeping event masked. Can you do: if (unlikely(irqd_is_setaffinity_pending(data))) { masked = test_and_set_mask() clear_evtchn() irq_move_masked_irq() I did think about this but then I wasn't sure whether this might open some other window for things to sneak in. It shouldn't but these things are rather subtle so I'd rather leave the order of how operations are done unchanged. But I should indeed use irq_move_masked_irq() instead of irq_move_irq(). -boris unmask(masked); } else clear_evtchn()
Re: [Xen-devel] [PATCH] xen/events: Mask a moving irq
On 03/17/2016 01:29 PM, David Vrabel wrote: On 17/03/16 16:53, Boris Ostrovsky wrote: On 03/17/2016 12:03 PM, David Vrabel wrote: On 17/03/16 12:45, Boris Ostrovsky wrote: Moving an unmasked irq may result in irq handler being invoked on both source and target CPUs. With 2-level this can happen as follows: On source CPU: evtchn_2l_handle_events() -> generic_handle_irq() -> handle_edge_irq() -> eoi_pirq(): irq_move_irq(data); /* WE ARE HERE */ if (VALID_EVTCHN(evtchn)) clear_evtchn(evtchn); If at this moment target processor is handling an unrelated event in evtchn_2l_handle_events()'s loop it may pick up our event since target's cpu_evtchn_mask claims that this event belongs to it *and* the event is unmasked and still pending. At the same time, source CPU will continue executing its own handle_edge_irq(). With FIFO interrupt the scenario is similar: irq_move_irq() may result in a EVTCHNOP_unmask hypercall which, in turn, may make the event pending on the target CPU. We can avoid this situation by moving and clearing the event while keeping event masked. Can you do: if (unlikely(irqd_is_setaffinity_pending(data))) { masked = test_and_set_mask() clear_evtchn() irq_move_masked_irq() I did think about this but then I wasn't sure whether this might open some other window for things to sneak in. It shouldn't but these things are rather subtle so I'd rather leave the order of how operations are done unchanged. This is the order your patch has though. I'm confused. Ugh, sorry --- I misread what you wrote, I thought you wanted to clear before masking. Which wouldn't make any sense. So yes, what you are suggesting is better. -borsi
[PATCH] xen/events: Mask a moving irq
Moving an unmasked irq may result in irq handler being invoked on both source and target CPUs. With 2-level this can happen as follows: On source CPU: evtchn_2l_handle_events() -> generic_handle_irq() -> handle_edge_irq() -> eoi_pirq(): irq_move_irq(data); /* WE ARE HERE */ if (VALID_EVTCHN(evtchn)) clear_evtchn(evtchn); If at this moment target processor is handling an unrelated event in evtchn_2l_handle_events()'s loop it may pick up our event since target's cpu_evtchn_mask claims that this event belongs to it *and* the event is unmasked and still pending. At the same time, source CPU will continue executing its own handle_edge_irq(). With FIFO interrupt the scenario is similar: irq_move_irq() may result in a EVTCHNOP_unmask hypercall which, in turn, may make the event pending on the target CPU. We can avoid this situation by moving and clearing the event while keeping event masked. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> Cc: sta...@vger.kernel.org --- drivers/xen/events/events_base.c | 26 -- 1 files changed, 24 insertions(+), 2 deletions(-) diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index 524c221..c5725ee 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -483,12 +483,23 @@ static void eoi_pirq(struct irq_data *data) int evtchn = evtchn_from_irq(data->irq); struct physdev_eoi eoi = { .irq = pirq_from_irq(data->irq) }; int rc = 0; + int need_unmask = 0; - irq_move_irq(data); + if (unlikely(irqd_is_setaffinity_pending(data))) { + if (VALID_EVTCHN(evtchn)) + need_unmask = !test_and_set_mask(evtchn); + } if (VALID_EVTCHN(evtchn)) clear_evtchn(evtchn); + irq_move_irq(data); + + if (VALID_EVTCHN(evtchn)) { + if (unlikely(need_unmask)) + unmask_evtchn(evtchn); + } + if (pirq_needs_eoi(data->irq)) { rc = HYPERVISOR_physdev_op(PHYSDEVOP_eoi, ); WARN_ON(rc); @@ -1356,11 +1367,22 @@ static void disable_dynirq(struct irq_data *data) static void ack_dynirq(struct irq_data *data) { int evtchn = evtchn_from_irq(data->irq); + int need_unmask = 0; - irq_move_irq(data); + if (unlikely(irqd_is_setaffinity_pending(data))) { + if (VALID_EVTCHN(evtchn)) + need_unmask = !test_and_set_mask(evtchn); + } if (VALID_EVTCHN(evtchn)) clear_evtchn(evtchn); + + irq_move_irq(data); + + if (VALID_EVTCHN(evtchn)) { + if (unlikely(need_unmask)) + unmask_evtchn(evtchn); + } } static void mask_ack_dynirq(struct irq_data *data) -- 1.7.7.6
[PATCH 2/2] hotplug: Prevent alloc/free of irq descriptors during cpu up/down (again)
Now that Xen no longer allocates irqs in _cpu_up() we can restore commit a89941816726 ("hotplug: Prevent alloc/free of irq descriptors during cpu up/down") Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/kernel/smpboot.c | 11 --- kernel/cpu.c |8 2 files changed, 8 insertions(+), 11 deletions(-) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 643dbdc..cabe21e 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1083,17 +1083,8 @@ int native_cpu_up(unsigned int cpu, struct task_struct *tidle) common_cpu_up(cpu, tidle); - /* -* We have to walk the irq descriptors to setup the vector -* space for the cpu which comes online. Prevent irq -* alloc/free across the bringup. -*/ - irq_lock_sparse(); - err = do_boot_cpu(apicid, cpu, tidle); - if (err) { - irq_unlock_sparse(); pr_err("do_boot_cpu failed(%d) to wakeup CPU#%u\n", err, cpu); return -EIO; } @@ -,8 +1102,6 @@ int native_cpu_up(unsigned int cpu, struct task_struct *tidle) touch_nmi_watchdog(); } - irq_unlock_sparse(); - return 0; } diff --git a/kernel/cpu.c b/kernel/cpu.c index 6ea42e8..2ff63b3 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -342,8 +342,16 @@ static int bringup_cpu(unsigned int cpu) struct task_struct *idle = idle_thread_get(cpu); int ret; + /* +* Some architectures have to walk the irq descriptors to +* setup the vector space for the cpu which comes online. +* Prevent irq alloc/free across the bringup. +*/ + irq_lock_sparse(); + /* Arch-specific enabling code. */ ret = __cpu_up(cpu, idle); + irq_unlock_sparse(); if (ret) { cpu_notify(CPU_UP_CANCELED, cpu); return ret; -- 1.7.1
Re: [Xen-devel] 4.4: INFO: rcu_sched self-detected stall on CPU
On 03/24/2016 10:53 PM, Steven Haigh wrote: Hi all, Firstly, I've cross-posted this to xen-devel and the lkml - as this problem seems to only exist when using kernel 4.4 as a Xen DomU kernel. I have also CC'ed Greg KH for his awesome insight as maintainer. Please CC myself into replies - as I'm not a member of the kernel mailing list - I may miss replies from monitoring the archives. I've noticed recently that heavy disk IO is causing rcu_sched to detect stalls. The process mentioned usually goes to 100% CPU usage, and eventually processes start segfaulting and dying. The only fix to recover the system is to use 'xl destroy' to force-kill the VM and to start it again. The majority of these issues seem to mention ext4 in the trace. This may indicate an issue there - or may be a red herring. The gritty details: INFO: rcu_sched self-detected stall on CPU #0110-...: (20999 ticks this GP) idle=327/141/0 softirq=1101493/1101493 fqs=6973 #011 (t=21000 jiffies g=827095 c=827094 q=524) Task dump for CPU 0: rsync R running task0 2446 2444 0x0088 818d0c00 88007fc03c58 810a625f 818d0c00 88007fc03c70 810a8699 0001 88007fc03ca0 810d0e5a 88007fc170c0 818d0c00 Call Trace: [] sched_show_task+0xaf/0x110 [] dump_cpu_task+0x39/0x40 [] rcu_dump_cpu_stacks+0x8a/0xc0 [] rcu_check_callbacks+0x424/0x7a0 [] ? account_system_time+0x81/0x110 [] ? account_process_tick+0x61/0x160 [] ? tick_sched_do_timer+0x30/0x30 [] update_process_times+0x39/0x60 [] tick_sched_handle.isra.15+0x36/0x50 [] tick_sched_timer+0x3d/0x70 [] __hrtimer_run_queues+0xf2/0x250 [] hrtimer_interrupt+0xa8/0x190 [] xen_timer_interrupt+0x2e/0x140 [] handle_irq_event_percpu+0x55/0x1e0 [] handle_percpu_irq+0x3a/0x50 [] generic_handle_irq+0x22/0x30 [] __evtchn_fifo_handle_events+0x15f/0x180 [] evtchn_fifo_handle_events+0x10/0x20 [] __xen_evtchn_do_upcall+0x43/0x80 [] xen_evtchn_do_upcall+0x30/0x50 [] xen_hvm_callback_vector+0x82/0x90 [] ? queued_write_lock_slowpath+0x3d/0x80 [] _raw_write_lock+0x1e/0x30 This looks to me like ext4 failing to grab a lock. Everything above it (in Xen code) is regular tick interrupt handling which detects the stall. Your config does not have CONFIG_PARAVIRT_SPINLOCKS so that eliminates any possible issues with pv locks. Do you see anything "interesting" in dom0? (e.g. dmesg, xl dmesg, /var/log/xen/) Are you oversubscribing your guest (CPU-wise)? -boris [] ext4_es_remove_extent+0x43/0xc0 [] ext4_clear_inode+0x39/0x80 [] ext4_evict_inode+0x8d/0x4e0 [] evict+0xb7/0x180 [] dispose_list+0x36/0x50 [] prune_icache_sb+0x4b/0x60 [] super_cache_scan+0x141/0x190 [] shrink_slab.part.37+0x1ee/0x390 [] shrink_zone+0x26c/0x280 [] do_try_to_free_pages+0x15c/0x410 [] try_to_free_pages+0xba/0x170 [] __alloc_pages_nodemask+0x525/0xa60 [] ? kmem_cache_free+0xcc/0x2c0 [] alloc_pages_current+0x8d/0x120 [] __page_cache_alloc+0x91/0xc0 [] pagecache_get_page+0x56/0x1e0 [] grab_cache_page_write_begin+0x26/0x40 [] ext4_da_write_begin+0xa1/0x300 [] ? ext4_da_write_end+0x124/0x2b0 [] generic_perform_write+0xc0/0x1a0 [] __generic_file_write_iter+0x188/0x1e0 [] ext4_file_write_iter+0xf0/0x340 [] __vfs_write+0xaa/0xe0 [] vfs_write+0xa2/0x1a0 [] SyS_write+0x46/0xa0 [] entry_SYSCALL_64_fastpath+0x12/0x71 Some 11 hours later: sshd[785]: segfault at 1f0 ip 7f03bb94ae5c sp 7ffe9eb54470 error 4 in ld-2.17.so[7f03bb94+21000] sh[787]: segfault at 1f0 ip 7f6b4a0dfe5c sp 7ffe3d4a71e0 error 4 in ld-2.17.so[7f6b4a0d5000+21000] systemd-cgroups[788]: segfault at 1f0 ip 7f4baa82ce5c sp 7ffd28e4c4b0 error 4 in ld-2.17.so[7f4baa822000+21000] sshd[791]: segfault at 1f0 ip 7ff8c8a8ce5c sp 7ffede9e1c20 error 4 in ld-2.17.so[7ff8c8a82000+21000] sshd[792]: segfault at 1f0 ip 7f183cf75e5c sp 7ffc81ab7160 error 4 in ld-2.17.so[7f183cf6b000+21000] sshd[793]: segfault at 1f0 ip 7f3c665ece5c sp 7ffd9a13c850 error 4 in ld-2.17.so[7f3c665e2000+21000] From isolated testing, this does not occur on kernel 4.5.x - however I have not verified this myself. The kernel config used can be found in the kernel-xen git repo if it assists in debugging: http://xen.crc.id.au/git/?p=kernel-xen;a=summary ___ Xen-devel mailing list xen-de...@lists.xen.org http://lists.xen.org/xen-devel
Re: [PATCH 2/2] xen/x86: Call cpu_startup_entry(CPUHP_AP_ONLINE_IDLE) from xen_play_dead()
On 03/25/2016 10:53 AM, Konrad Rzeszutek Wilk wrote: On Thu, Mar 17, 2016 at 09:03:25AM -0400, Boris Ostrovsky wrote: This call has always been missing from xen_play dead() but until recently this was rather benign. With new cpu hotplug framework however this call is required, otherwise a hot-plugged CPU will not Could you include the commit id of the 'new cpu hotplug' in case anybody wants to backport this? Sure. It's commit 8df3e07e7f21 ("cpu/hotplug: Let upcoming cpu bring itself fully up"). Do you (or David) want me to re-send it? -boris Thanks! be properly brough up (by never calling cpuhp_online_idle()) Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- arch/x86/xen/smp.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index 3c6d17f..719cf29 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -545,6 +545,8 @@ static void xen_play_dead(void) /* used only with HOTPLUG_CPU */ * data back is to call: */ tick_nohz_idle_enter(); + + cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); } #else /* !CONFIG_HOTPLUG_CPU */ -- 1.7.1
Re: [Xen-devel] 4.4: INFO: rcu_sched self-detected stall on CPU
On 03/25/2016 10:05 AM, Steven Haigh wrote: On 25/03/2016 11:23 PM, Boris Ostrovsky wrote: On 03/24/2016 10:53 PM, Steven Haigh wrote: Hi all, Firstly, I've cross-posted this to xen-devel and the lkml - as this problem seems to only exist when using kernel 4.4 as a Xen DomU kernel. I have also CC'ed Greg KH for his awesome insight as maintainer. Please CC myself into replies - as I'm not a member of the kernel mailing list - I may miss replies from monitoring the archives. I've noticed recently that heavy disk IO is causing rcu_sched to detect stalls. The process mentioned usually goes to 100% CPU usage, and eventually processes start segfaulting and dying. The only fix to recover the system is to use 'xl destroy' to force-kill the VM and to start it again. The majority of these issues seem to mention ext4 in the trace. This may indicate an issue there - or may be a red herring. The gritty details: INFO: rcu_sched self-detected stall on CPU #0110-...: (20999 ticks this GP) idle=327/141/0 softirq=1101493/1101493 fqs=6973 #011 (t=21000 jiffies g=827095 c=827094 q=524) Task dump for CPU 0: rsync R running task0 2446 2444 0x0088 818d0c00 88007fc03c58 810a625f 818d0c00 88007fc03c70 810a8699 0001 88007fc03ca0 810d0e5a 88007fc170c0 818d0c00 Call Trace: [] sched_show_task+0xaf/0x110 [] dump_cpu_task+0x39/0x40 [] rcu_dump_cpu_stacks+0x8a/0xc0 [] rcu_check_callbacks+0x424/0x7a0 [] ? account_system_time+0x81/0x110 [] ? account_process_tick+0x61/0x160 [] ? tick_sched_do_timer+0x30/0x30 [] update_process_times+0x39/0x60 [] tick_sched_handle.isra.15+0x36/0x50 [] tick_sched_timer+0x3d/0x70 [] __hrtimer_run_queues+0xf2/0x250 [] hrtimer_interrupt+0xa8/0x190 [] xen_timer_interrupt+0x2e/0x140 [] handle_irq_event_percpu+0x55/0x1e0 [] handle_percpu_irq+0x3a/0x50 [] generic_handle_irq+0x22/0x30 [] __evtchn_fifo_handle_events+0x15f/0x180 [] evtchn_fifo_handle_events+0x10/0x20 [] __xen_evtchn_do_upcall+0x43/0x80 [] xen_evtchn_do_upcall+0x30/0x50 [] xen_hvm_callback_vector+0x82/0x90 [] ? queued_write_lock_slowpath+0x3d/0x80 [] _raw_write_lock+0x1e/0x30 This looks to me like ext4 failing to grab a lock. Everything above it (in Xen code) is regular tick interrupt handling which detects the stall. Your config does not have CONFIG_PARAVIRT_SPINLOCKS so that eliminates any possible issues with pv locks. Do you see anything "interesting" in dom0? (e.g. dmesg, xl dmesg, /var/log/xen/) Are you oversubscribing your guest (CPU-wise)? There is nothing special being logged anywhere that I can see. dmesg / xl dmesg on the Dom0 show nothing unusual. I do share CPUs - but I don't give any DomU more than 2 vcpus. The physical host has 4 cores - 1 pinned to the Dom0. I log to a remote syslog on this system - and I've uploaded the entire log to a pastebin (don't want to do a 45Kb attachment here): http://paste.fedoraproject.org/345095/58914452 That doesn't look like a full log. In any case, the RCU stall may be a secondary problem --- there is a bunch of splats before the stall. -boris Not sure if it makes any difference at all, but my DomU config is: # cat /etc/xen/backup.vm name= "backup.vm" memory = 2048 vcpus = 2 cpus= "1-3" disk= [ 'phy:/dev/vg_raid1_new/backup.vm,xvda,w' ] vif = [ "mac=00:11:36:35:35:09, bridge=br203, vifname=vm.backup, script=vif-bridge" ] bootloader = 'pygrub' pvh = 1 on_poweroff = 'destroy' on_reboot = 'restart' on_crash= 'restart' cpu_weight = 64 I never had this problem when running kernel 4.1.x - it only started when I upgraded everything to 4.4 - not exactly a great help - but may help narrow things down? [] ext4_es_remove_extent+0x43/0xc0 [] ext4_clear_inode+0x39/0x80 [] ext4_evict_inode+0x8d/0x4e0 [] evict+0xb7/0x180 [] dispose_list+0x36/0x50 [] prune_icache_sb+0x4b/0x60 [] super_cache_scan+0x141/0x190 [] shrink_slab.part.37+0x1ee/0x390 [] shrink_zone+0x26c/0x280 [] do_try_to_free_pages+0x15c/0x410 [] try_to_free_pages+0xba/0x170 [] __alloc_pages_nodemask+0x525/0xa60 [] ? kmem_cache_free+0xcc/0x2c0 [] alloc_pages_current+0x8d/0x120 [] __page_cache_alloc+0x91/0xc0 [] pagecache_get_page+0x56/0x1e0 [] grab_cache_page_write_begin+0x26/0x40 [] ext4_da_write_begin+0xa1/0x300 [] ? ext4_da_write_end+0x124/0x2b0 [] generic_perform_write+0xc0/0x1a0 [] __generic_file_write_iter+0x188/0x1e0 [] ext4_file_write_iter+0xf0/0x340 [] __vfs_write+0xaa/0xe0 [] vfs_write+0xa2/0x1a0 [] SyS_write+0x46/0xa0 [] entry_SYSCALL_64_fastpath+0x12/0x71 Some 11 hours later: sshd[785]: segfault at 1f0 ip 7f03bb94ae5c sp 7ffe9eb54470 error 4 in ld-2.17.so[7f03bb94+21000] sh[787]: segfault at 1f0 ip 7f6b4a0dfe5c sp 7ffe3d4a71e0 error 4 in ld-2.17.so[7f6b4a0d5000+21000] systemd-cgroups[788]: segfault at 1f0 ip