Re: [net-next RFC V4 PATCH 0/4] Multiqueue virtio-net
On 06/26/2012 01:49 AM, Sridhar Samudrala wrote: On 6/25/2012 2:16 AM, Jason Wang wrote: Hello All: This series is an update version of multiqueue virtio-net driver based on Krishna Kumar's work to let virtio-net use multiple rx/tx queues to do the packets reception and transmission. Please review and comments. Test Environment: - Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 8 cores 2 numa nodes - Two directed connected 82599 Test Summary: - Highlights: huge improvements on TCP_RR test - Lowlights: regression on small packet transmission, higher cpu utilization than single queue, need further optimization Does this also scale with increased number of VMs? Hi Sridhar: Good suggestions, I didn't measure them. I would run test and post them. Thanks Thanks Sridhar Analysis of the performance result: - I count the number of packets sending/receiving during the test, and multiqueue show much more ability in terms of packets per second. - For the tx regression, multiqueue send about 1-2 times of more packets compared to single queue, and the packets size were much smaller than single queue does. I suspect tcp does less batching in multiqueue, so I hack the tcp_write_xmit() to forece more batching, multiqueue works as well as singlequeue for both small transmission and throughput - I didn't pack the accelerate RFS with virtio-net in this sereis as it still need further shaping, for the one that interested in this please see: http://www.mail-archive.com/kvm@vger.kernel.org/msg64111.html -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next RFC V4 PATCH 0/4] Multiqueue virtio-net
On 06/26/2012 02:01 AM, Shirley Ma wrote: Hello Jason, Good work. Do you have local guest to guest results? Thanks Shirley Hi Shirley: I would run tests to measure the performance and post here. Thanks -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 4/5] KVM: emulator: move linearize() out of emulator code.
On Mon, Jun 25, 2012 at 06:50:06PM +0300, Avi Kivity wrote: Later we can extend x86_decode_insn() and the other functions to follow the same rule. What rule? We cannot not initialize a context. You can reduce things that should be initialized to minimum (getting GP registers on demand, etc), but still some initialization is needed since ctxt holds emulation state and it needs to be reset before each emulation. An alternative is to use two contexts, the base context only holds ops and is the parameter to all the callbacks on the non-state APIs, the derived context holds the state: struct x86_emulation_ctxt { struct x86_ops *ops; /* state that always needs to be initialized, preferablt none */ }; struct x86_insn_ctxt { struct x86_emulation_ctxt em; /* instruction state */ } and so we have a compile-time split between users of the state and non-users. I do not understand how you will divide current ctxt structure between those two. Where will you put those for instance: interruptibility, have_exception, perm_ok, only_vendor_specific_insn and how can they not be initialized before each instruction emulation? x86_emulate_ops::get_interruptibility() x86_emulate_ops::set_interruptibility() x86_emulate_ops::exception() They do not remove the need for initialization before instruction execution, they just move things that need to be initialized somewhere else (to kvm_arch_vcpu likely). x86_decode_insn(struct x86_insn_ctxt *ctxt, unsigned flags) { ctxt-flags = flags; ctxt-perm_ok = false; } In short, instruction emulation state is only seen by instruction emulation functions, the others don't get to see it. So you want to divide emulator.c to two types of function: those without side effect, that do some kind of calculations on vcpu state according to weird x86 rules, and those that change vcpu state and write it back eventually. I do not see the justification for that complication really. emulator.c is complicated enough already and the line between two may be blurred. If you dislike linearize() callback so much I can make kvm_linearize_address() to do calculation base on its parameters only. It is almost there, only cpl and seg base/desc are missing from parameter list. I can put it into header and x86.c/emulator.c will both be able to use it. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 4/5] KVM: emulator: move linearize() out of emulator code.
On 06/26/2012 11:30 AM, Gleb Natapov wrote: Where will you put those for instance: interruptibility, have_exception, perm_ok, only_vendor_specific_insn and how can they not be initialized before each instruction emulation? x86_emulate_ops::get_interruptibility() x86_emulate_ops::set_interruptibility() x86_emulate_ops::exception() They do not remove the need for initialization before instruction execution, they just move things that need to be initialized somewhere else (to kvm_arch_vcpu likely). x86_decode_insn(struct x86_insn_ctxt *ctxt, unsigned flags) { ctxt-flags = flags; ctxt-perm_ok = false; } In short, instruction emulation state is only seen by instruction emulation functions, the others don't get to see it. So you want to divide emulator.c to two types of function: those without side effect, that do some kind of calculations on vcpu state according to weird x86 rules, and those that change vcpu state and write it back eventually. I do not see the justification for that complication really. emulator.c is complicated enough already and the line between two may be blurred. Really, the only issue is that the read/write callbacks sometimes cannot return a result. Otherwise the entire thing would be stateless. If you dislike linearize() callback so much I can make kvm_linearize_address() to do calculation base on its parameters only. It is almost there, only cpl and seg base/desc are missing from parameter list. I can put it into header and x86.c/emulator.c will both be able to use it. And all the stack mask and stuff? Yuck. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/1] vhost, use_mm and KERNEL_DS
Folks, here is a patch that fixes vhost to use USER_DS before doing a use_mm/usercopy operation. This was found during vhost prototyping on s390 were we have a separate user/kernel address space. Jens Freimann (1): use USER_DS in vhost_worker thread drivers/vhost/vhost.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] use USER_DS in vhost_worker thread
On Tue, Jun 26, 2012 at 12:59:58PM +0200, Christian Borntraeger wrote: From: Jens Freimann jf...@linux.vnet.ibm.com On some architectures address spaces are set up in a way that this is not necessary to work properly but on some others (like s390) it is. Make sure we operate on the user address space to allow copy_xxx_user() from the vhost_worker() thread by setting it explicitly before calling use_mm() and revert it after unuse_mm(). Signed-off-by: Jens Freimann jf...@linux.vnet.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com Acked-by: Michael S. Tsirkin m...@redhat.com Dave, can you queue this up for 3.5 please? Thanks. --- drivers/vhost/vhost.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 94dbd25..112156f 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -191,7 +191,9 @@ static int vhost_worker(void *data) struct vhost_dev *dev = data; struct vhost_work *work = NULL; unsigned uninitialized_var(seq); + mm_segment_t oldfs = get_fs(); + set_fs(USER_DS); use_mm(dev-mm); for (;;) { @@ -229,6 +231,7 @@ static int vhost_worker(void *data) } unuse_mm(dev-mm); + set_fs(oldfs); return 0; } -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
grub is stuck in grub_biosdisk_check_int13_extensions() with qemu-kvm
Greetings everybody, We are running qemu-kvm-0.14.0 on stock ubuntu-natty 2.6.38-8, and booting stock ubuntu-precise image (3.2.0-25-generic), as a guest. Occasionally, grub gets stuck before it reaches the OS selection menu. After adding prints to grub, we discovered that it gets stuck in grub_biosdisk_check_int13_extensions(), which does: struct grub_bios_int_registers regs; regs.edx = drive 0xff; regs.eax = 0x4100; regs.ebx = 0x55aa; regs.flags = GRUB_CPU_INT_FLAGS_DEFAULT; grub_bios_interrupt (0x13, regs); We have confirmed that drive=0x80 at this point. Also, during normal boots we see that this function always returns the same values: flags:0x3202 eax:0x3000 ebx:0xaa55 ecx:0x7. This is in line with code in extboot.S: int13_handler() and check_if_extensions_present(). Can anybody advise on how to debug this problem further? I am also attaching a screenshot of a stuck grub. Thanks, Alex. attachment: grub_hang_with_drive.png
[PATCH 5/6] KVM: s390: use sigp condition code defines
From: Heiko Carstens heiko.carst...@de.ibm.com Just use the defines instead of using plain numbers and adding a comment behind each line. Signed-off-by: Heiko Carstens heiko.carst...@de.ibm.com Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com --- arch/s390/kvm/sigp.c | 58 +- 1 file changed, 29 insertions(+), 29 deletions(-) diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c index ca544d5..97c9f36 100644 --- a/arch/s390/kvm/sigp.c +++ b/arch/s390/kvm/sigp.c @@ -26,19 +26,19 @@ static int __sigp_sense(struct kvm_vcpu *vcpu, u16 cpu_addr, int rc; if (cpu_addr = KVM_MAX_VCPUS) - return 3; /* not operational */ + return SIGP_CC_NOT_OPERATIONAL; spin_lock(fi-lock); if (fi-local_int[cpu_addr] == NULL) - rc = 3; /* not operational */ + rc = SIGP_CC_NOT_OPERATIONAL; else if (!(atomic_read(fi-local_int[cpu_addr]-cpuflags) CPUSTAT_STOPPED)) { *reg = 0xUL; - rc = 1; /* status stored */ + rc = SIGP_CC_STATUS_STORED; } else { *reg = 0xUL; *reg |= SIGP_STATUS_STOPPED; - rc = 1; /* status stored */ + rc = SIGP_CC_STATUS_STORED; } spin_unlock(fi-lock); @@ -54,7 +54,7 @@ static int __sigp_emergency(struct kvm_vcpu *vcpu, u16 cpu_addr) int rc; if (cpu_addr = KVM_MAX_VCPUS) - return 3; /* not operational */ + return SIGP_CC_NOT_OPERATIONAL; inti = kzalloc(sizeof(*inti), GFP_KERNEL); if (!inti) @@ -66,7 +66,7 @@ static int __sigp_emergency(struct kvm_vcpu *vcpu, u16 cpu_addr) spin_lock(fi-lock); li = fi-local_int[cpu_addr]; if (li == NULL) { - rc = 3; /* not operational */ + rc = SIGP_CC_NOT_OPERATIONAL; kfree(inti); goto unlock; } @@ -77,7 +77,7 @@ static int __sigp_emergency(struct kvm_vcpu *vcpu, u16 cpu_addr) if (waitqueue_active(li-wq)) wake_up_interruptible(li-wq); spin_unlock_bh(li-lock); - rc = 0; /* order accepted */ + rc = SIGP_CC_ORDER_CODE_ACCEPTED; VCPU_EVENT(vcpu, 4, sent sigp emerg to cpu %x, cpu_addr); unlock: spin_unlock(fi-lock); @@ -92,7 +92,7 @@ static int __sigp_external_call(struct kvm_vcpu *vcpu, u16 cpu_addr) int rc; if (cpu_addr = KVM_MAX_VCPUS) - return 3; /* not operational */ + return SIGP_CC_NOT_OPERATIONAL; inti = kzalloc(sizeof(*inti), GFP_KERNEL); if (!inti) @@ -104,7 +104,7 @@ static int __sigp_external_call(struct kvm_vcpu *vcpu, u16 cpu_addr) spin_lock(fi-lock); li = fi-local_int[cpu_addr]; if (li == NULL) { - rc = 3; /* not operational */ + rc = SIGP_CC_NOT_OPERATIONAL; kfree(inti); goto unlock; } @@ -115,7 +115,7 @@ static int __sigp_external_call(struct kvm_vcpu *vcpu, u16 cpu_addr) if (waitqueue_active(li-wq)) wake_up_interruptible(li-wq); spin_unlock_bh(li-lock); - rc = 0; /* order accepted */ + rc = SIGP_CC_ORDER_CODE_ACCEPTED; VCPU_EVENT(vcpu, 4, sent sigp ext call to cpu %x, cpu_addr); unlock: spin_unlock(fi-lock); @@ -143,7 +143,7 @@ static int __inject_sigp_stop(struct kvm_s390_local_interrupt *li, int action) out: spin_unlock_bh(li-lock); - return 0; /* order accepted */ + return SIGP_CC_ORDER_CODE_ACCEPTED; } static int __sigp_stop(struct kvm_vcpu *vcpu, u16 cpu_addr, int action) @@ -153,12 +153,12 @@ static int __sigp_stop(struct kvm_vcpu *vcpu, u16 cpu_addr, int action) int rc; if (cpu_addr = KVM_MAX_VCPUS) - return 3; /* not operational */ + return SIGP_CC_NOT_OPERATIONAL; spin_lock(fi-lock); li = fi-local_int[cpu_addr]; if (li == NULL) { - rc = 3; /* not operational */ + rc = SIGP_CC_NOT_OPERATIONAL; goto unlock; } @@ -182,11 +182,11 @@ static int __sigp_set_arch(struct kvm_vcpu *vcpu, u32 parameter) switch (parameter 0xff) { case 0: - rc = 3; /* not operational */ + rc = SIGP_CC_NOT_OPERATIONAL; break; case 1: case 2: - rc = 0; /* order accepted */ + rc = SIGP_CC_ORDER_CODE_ACCEPTED; break; default: rc = -EOPNOTSUPP; @@ -209,12 +209,12 @@ static int __sigp_set_prefix(struct kvm_vcpu *vcpu, u16 cpu_addr, u32 address, copy_from_guest_absolute(vcpu, tmp, address + PAGE_SIZE, 1)) { *reg = 0xUL; *reg |= SIGP_STATUS_INVALID_PARAMETER; -
[PATCH 4/6] KVM: s390: fix sigp set prefix status stored cases
From: Heiko Carstens heiko.carst...@de.ibm.com If an invalid parameter is passed or the addressed cpu is in an incorrect state sigp set prefix will store a status. This status must only have bits set as defined by the architecture. The current kvm implementation missed to clear bits and also did not set the intended status bit (and instead of or operation). Signed-off-by: Heiko Carstens heiko.carst...@de.ibm.com Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com --- arch/s390/kvm/sigp.c |7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c index caccc0e..ca544d5 100644 --- a/arch/s390/kvm/sigp.c +++ b/arch/s390/kvm/sigp.c @@ -207,6 +207,7 @@ static int __sigp_set_prefix(struct kvm_vcpu *vcpu, u16 cpu_addr, u32 address, address = address 0x7fffe000u; if (copy_from_guest_absolute(vcpu, tmp, address, 1) || copy_from_guest_absolute(vcpu, tmp, address + PAGE_SIZE, 1)) { + *reg = 0xUL; *reg |= SIGP_STATUS_INVALID_PARAMETER; return 1; /* invalid parameter */ } @@ -220,8 +221,9 @@ static int __sigp_set_prefix(struct kvm_vcpu *vcpu, u16 cpu_addr, u32 address, li = fi-local_int[cpu_addr]; if (li == NULL) { + *reg = 0xUL; + *reg |= SIGP_STATUS_INCORRECT_STATE; rc = 1; /* incorrect state */ - *reg = SIGP_STATUS_INCORRECT_STATE; kfree(inti); goto out_fi; } @@ -229,8 +231,9 @@ static int __sigp_set_prefix(struct kvm_vcpu *vcpu, u16 cpu_addr, u32 address, spin_lock_bh(li-lock); /* cpu must be in stopped state */ if (!(atomic_read(li-cpuflags) CPUSTAT_STOPPED)) { + *reg = 0xUL; + *reg |= SIGP_STATUS_INCORRECT_STATE; rc = 1; /* incorrect state */ - *reg = SIGP_STATUS_INCORRECT_STATE; kfree(inti); goto out_li; } -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/6] kvm/s390: sigp related changes for 3.6
Avi, Marcelo, here are some more s390 patches for the next release. Patches 1 and 2 are included for dependency reasons; they will also be sent through Martin's s390 tree. The other patches fix several problems in our sigp handling code and make it nicer to read. Cornelia Huck (1): KVM: s390: Fix sigp sense handling. Heiko Carstens (5): s390/smp: remove redundant check s390/smp/kvm: unifiy sigp definitions KVM: s390: fix sigp sense running condition code handling KVM: s390: fix sigp set prefix status stored cases KVM: s390: use sigp condition code defines arch/s390/include/asm/sigp.h | 32 arch/s390/kernel/smp.c | 76 ++- arch/s390/kvm/sigp.c | 117 +- 3 files changed, 106 insertions(+), 119 deletions(-) create mode 100644 arch/s390/include/asm/sigp.h -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] s390/smp/kvm: unifiy sigp definitions
From: Heiko Carstens heiko.carst...@de.ibm.com The smp and the kvm code have different defines for the sigp order codes. Let's just have a single place where these are defined. Also move the sigp condition code and sigp cpu status bits to the new sigp.h header file. Signed-off-by: Heiko Carstens heiko.carst...@de.ibm.com Signed-off-by: Martin Schwidefsky schwidef...@de.ibm.com Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com --- arch/s390/include/asm/sigp.h | 31 ++ arch/s390/kernel/smp.c | 72 ++ arch/s390/kvm/sigp.c | 46 ++- 3 files changed, 64 insertions(+), 85 deletions(-) create mode 100644 arch/s390/include/asm/sigp.h diff --git a/arch/s390/include/asm/sigp.h b/arch/s390/include/asm/sigp.h new file mode 100644 index 000..7306270 --- /dev/null +++ b/arch/s390/include/asm/sigp.h @@ -0,0 +1,31 @@ +#ifndef __S390_ASM_SIGP_H +#define __S390_ASM_SIGP_H + +/* SIGP order codes */ +#define SIGP_SENSE 1 +#define SIGP_EXTERNAL_CALL 2 +#define SIGP_EMERGENCY_SIGNAL3 +#define SIGP_STOP5 +#define SIGP_RESTART 6 +#define SIGP_STOP_AND_STORE_STATUS9 +#define SIGP_INITIAL_CPU_RESET 11 +#define SIGP_SET_PREFIX 13 +#define SIGP_STORE_STATUS_AT_ADDRESS 14 +#define SIGP_SET_ARCHITECTURE 18 +#define SIGP_SENSE_RUNNING 21 + +/* SIGP condition codes */ +#define SIGP_CC_ORDER_CODE_ACCEPTED 0 +#define SIGP_CC_STATUS_STORED 1 +#define SIGP_CC_BUSY 2 +#define SIGP_CC_NOT_OPERATIONAL3 + +/* SIGP cpu status bits */ + +#define SIGP_STATUS_CHECK_STOP 0x0010UL +#define SIGP_STATUS_STOPPED0x0040UL +#define SIGP_STATUS_INVALID_PARAMETER 0x0100UL +#define SIGP_STATUS_INCORRECT_STATE0x0200UL +#define SIGP_STATUS_NOT_RUNNING0x0400UL + +#endif /* __S390_ASM_SIGP_H */ diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c index c78074c..6e4047e 100644 --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -44,34 +44,10 @@ #include asm/vdso.h #include asm/debug.h #include asm/os_info.h +#include asm/sigp.h #include entry.h enum { - sigp_sense = 1, - sigp_external_call = 2, - sigp_emergency_signal = 3, - sigp_start = 4, - sigp_stop = 5, - sigp_restart = 6, - sigp_stop_and_store_status = 9, - sigp_initial_cpu_reset = 11, - sigp_cpu_reset = 12, - sigp_set_prefix = 13, - sigp_store_status_at_address = 14, - sigp_store_extended_status_at_address = 15, - sigp_set_architecture = 18, - sigp_conditional_emergency_signal = 19, - sigp_sense_running = 21, -}; - -enum { - sigp_order_code_accepted = 0, - sigp_status_stored = 1, - sigp_busy = 2, - sigp_not_operational = 3, -}; - -enum { ec_schedule = 0, ec_call_function, ec_call_function_single, @@ -124,7 +100,7 @@ static inline int __pcpu_sigp_relax(u16 addr, u8 order, u32 parm, u32 *status) while (1) { cc = __pcpu_sigp(addr, order, parm, status); - if (cc != sigp_busy) + if (cc != SIGP_CC_BUSY) return cc; cpu_relax(); } @@ -136,7 +112,7 @@ static int pcpu_sigp_retry(struct pcpu *pcpu, u8 order, u32 parm) for (retry = 0; ; retry++) { cc = __pcpu_sigp(pcpu-address, order, parm, pcpu-status); - if (cc != sigp_busy) + if (cc != SIGP_CC_BUSY) break; if (retry = 3) udelay(10); @@ -146,8 +122,8 @@ static int pcpu_sigp_retry(struct pcpu *pcpu, u8 order, u32 parm) static inline int pcpu_stopped(struct pcpu *pcpu) { - if (__pcpu_sigp(pcpu-address, sigp_sense, - 0, pcpu-status) != sigp_status_stored) + if (__pcpu_sigp(pcpu-address, SIGP_SENSE, + 0, pcpu-status) != SIGP_CC_STATUS_STORED) return 0; /* Check for stopped and check stop state */ return !!(pcpu-status 0x50); @@ -155,8 +131,8 @@ static inline int pcpu_stopped(struct pcpu *pcpu) static inline int pcpu_running(struct pcpu *pcpu) { - if (__pcpu_sigp(pcpu-address, sigp_sense_running, - 0, pcpu-status) != sigp_status_stored) + if (__pcpu_sigp(pcpu-address, SIGP_SENSE_RUNNING, + 0, pcpu-status) != SIGP_CC_STATUS_STORED) return 1; /* Status stored condition code is equivalent to cpu not running. */ return 0; @@ -181,7 +157,7 @@ static void pcpu_ec_call(struct pcpu *pcpu, int ec_bit) set_bit(ec_bit, pcpu-ec_mask); order = pcpu_running(pcpu) ? - sigp_external_call : sigp_emergency_signal; + SIGP_EXTERNAL_CALL : SIGP_EMERGENCY_SIGNAL;
[PATCH 3/6] KVM: s390: fix sigp sense running condition code handling
From: Heiko Carstens heiko.carst...@de.ibm.com Only if the sensed cpu is not running a status is stored, which is reflected by condition code 1. If the cpu is running, condition code 0 should be returned. Just the opposite of what the code is doing. Acked-by: Cornelia Huck cornelia.h...@de.ibm.com Signed-off-by: Heiko Carstens heiko.carst...@de.ibm.com Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com --- arch/s390/kvm/sigp.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c index fda1d64..caccc0e 100644 --- a/arch/s390/kvm/sigp.c +++ b/arch/s390/kvm/sigp.c @@ -268,12 +268,12 @@ static int __sigp_sense_running(struct kvm_vcpu *vcpu, u16 cpu_addr, if (atomic_read(fi-local_int[cpu_addr]-cpuflags) CPUSTAT_RUNNING) { /* running */ - rc = 1; + rc = 0; } else { /* not running */ *reg = 0xUL; *reg |= SIGP_STATUS_NOT_RUNNING; - rc = 0; + rc = 1; } } spin_unlock(fi-lock); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] s390/smp: remove redundant check
From: Heiko Carstens heiko.carst...@de.ibm.com condition code status stored for sigp sense running always implies that only the not running status bit is set. Therefore no need to check if it is set. Signed-off-by: Heiko Carstens heiko.carst...@de.ibm.com Signed-off-by: Martin Schwidefsky schwidef...@de.ibm.com Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com --- arch/s390/kernel/smp.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c index 15cca26..c78074c 100644 --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -158,8 +158,8 @@ static inline int pcpu_running(struct pcpu *pcpu) if (__pcpu_sigp(pcpu-address, sigp_sense_running, 0, pcpu-status) != sigp_status_stored) return 1; - /* Check for running status */ - return !(pcpu-status 0x400); + /* Status stored condition code is equivalent to cpu not running. */ + return 0; } /* -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] KVM: s390: Fix sigp sense handling.
If sigp sense doesn't have any status bits to report, it should set cc 0 and leave the register as-is. Since we know about the external call pending bit, we should report it if it is set as well. Acked-by: Heiko Carstens heiko.carst...@de.ibm.com Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com --- arch/s390/include/asm/sigp.h |1 + arch/s390/kvm/sigp.c | 14 +- 2 files changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/s390/include/asm/sigp.h b/arch/s390/include/asm/sigp.h index 7306270..5a87d16 100644 --- a/arch/s390/include/asm/sigp.h +++ b/arch/s390/include/asm/sigp.h @@ -24,6 +24,7 @@ #define SIGP_STATUS_CHECK_STOP 0x0010UL #define SIGP_STATUS_STOPPED0x0040UL +#define SIGP_STATUS_EXT_CALL_PENDING 0x0080UL #define SIGP_STATUS_INVALID_PARAMETER 0x0100UL #define SIGP_STATUS_INCORRECT_STATE0x0200UL #define SIGP_STATUS_NOT_RUNNING0x0400UL diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c index 97c9f36..6ed8175 100644 --- a/arch/s390/kvm/sigp.c +++ b/arch/s390/kvm/sigp.c @@ -32,12 +32,16 @@ static int __sigp_sense(struct kvm_vcpu *vcpu, u16 cpu_addr, if (fi-local_int[cpu_addr] == NULL) rc = SIGP_CC_NOT_OPERATIONAL; else if (!(atomic_read(fi-local_int[cpu_addr]-cpuflags) - CPUSTAT_STOPPED)) { - *reg = 0xUL; - rc = SIGP_CC_STATUS_STORED; - } else { + (CPUSTAT_ECALL_PEND | CPUSTAT_STOPPED))) + rc = SIGP_CC_ORDER_CODE_ACCEPTED; + else { *reg = 0xUL; - *reg |= SIGP_STATUS_STOPPED; + if (atomic_read(fi-local_int[cpu_addr]-cpuflags) +CPUSTAT_ECALL_PEND) + *reg |= SIGP_STATUS_EXT_CALL_PENDING; + if (atomic_read(fi-local_int[cpu_addr]-cpuflags) +CPUSTAT_STOPPED) + *reg |= SIGP_STATUS_STOPPED; rc = SIGP_CC_STATUS_STORED; } spin_unlock(fi-lock); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] KVM: s390: fix sigp sense running condition code handling
On 26.06.2012, at 16:06, Cornelia Huck wrote: From: Heiko Carstens heiko.carst...@de.ibm.com Only if the sensed cpu is not running a status is stored, which is reflected by condition code 1. If the cpu is running, condition code 0 should be returned. Just the opposite of what the code is doing. Acked-by: Cornelia Huck cornelia.h...@de.ibm.com Signed-off-by: Heiko Carstens heiko.carst...@de.ibm.com Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com Yikes. Is this a stable candidate? Alex --- arch/s390/kvm/sigp.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c index fda1d64..caccc0e 100644 --- a/arch/s390/kvm/sigp.c +++ b/arch/s390/kvm/sigp.c @@ -268,12 +268,12 @@ static int __sigp_sense_running(struct kvm_vcpu *vcpu, u16 cpu_addr, if (atomic_read(fi-local_int[cpu_addr]-cpuflags) CPUSTAT_RUNNING) { /* running */ - rc = 1; + rc = 0; } else { /* not running */ *reg = 0xUL; *reg |= SIGP_STATUS_NOT_RUNNING; - rc = 0; + rc = 1; } } spin_unlock(fi-lock); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] KVM: s390: fix sigp set prefix status stored cases
On 26.06.2012, at 16:06, Cornelia Huck wrote: From: Heiko Carstens heiko.carst...@de.ibm.com If an invalid parameter is passed or the addressed cpu is in an incorrect state sigp set prefix will store a status. This status must only have bits set as defined by the architecture. The current kvm implementation missed to clear bits and also did not set the intended status bit (and instead of or operation). Signed-off-by: Heiko Carstens heiko.carst...@de.ibm.com Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com What was the net effect of this for a guest? Any problems rising from it? Alex --- arch/s390/kvm/sigp.c |7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c index caccc0e..ca544d5 100644 --- a/arch/s390/kvm/sigp.c +++ b/arch/s390/kvm/sigp.c @@ -207,6 +207,7 @@ static int __sigp_set_prefix(struct kvm_vcpu *vcpu, u16 cpu_addr, u32 address, address = address 0x7fffe000u; if (copy_from_guest_absolute(vcpu, tmp, address, 1) || copy_from_guest_absolute(vcpu, tmp, address + PAGE_SIZE, 1)) { + *reg = 0xUL; *reg |= SIGP_STATUS_INVALID_PARAMETER; return 1; /* invalid parameter */ } @@ -220,8 +221,9 @@ static int __sigp_set_prefix(struct kvm_vcpu *vcpu, u16 cpu_addr, u32 address, li = fi-local_int[cpu_addr]; if (li == NULL) { + *reg = 0xUL; + *reg |= SIGP_STATUS_INCORRECT_STATE; rc = 1; /* incorrect state */ - *reg = SIGP_STATUS_INCORRECT_STATE; kfree(inti); goto out_fi; } @@ -229,8 +231,9 @@ static int __sigp_set_prefix(struct kvm_vcpu *vcpu, u16 cpu_addr, u32 address, spin_lock_bh(li-lock); /* cpu must be in stopped state */ if (!(atomic_read(li-cpuflags) CPUSTAT_STOPPED)) { + *reg = 0xUL; + *reg |= SIGP_STATUS_INCORRECT_STATE; rc = 1; /* incorrect state */ - *reg = SIGP_STATUS_INCORRECT_STATE; kfree(inti); goto out_li; } -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm-1.0 crashes with threaded vnc server?
On 13.03.2012 16:06, Alexander Graf wrote: On 13.03.2012, at 16:05, Corentin Chary wrote: On Tue, Mar 13, 2012 at 12:29 PM, Peter Lievenp...@dlh.net wrote: On 11.02.2012 09:55, Corentin Chary wrote: On Thu, Feb 9, 2012 at 7:08 PM, Peter Lievenp...@dlh.net wrote: Hi, is anyone aware if there are still problems when enabling the threaded vnc server? I saw some VMs crashing when using a qemu-kvm build with --enable-vnc-thread. qemu-kvm-1.0[22646]: segfault at 0 ip 7fec1ca7ea0b sp 7fec19d056d0 error 6 in libz.so.1.2.3.3[7fec1ca75000+16000] qemu-kvm-1.0[26056]: segfault at 7f06d8d6e010 ip 7f06e0a30d71 sp 7f06df035748 error 6 in libc-2.11.1.so[7f06e09aa000+17a000] I had no time to debug further. It seems to happen shortly after migrating, but thats uncertain. At least the segfault in libz seems to give a hint to VNC since I cannot image of any other part of qemu-kvm using libz except for VNC server. Thanks, Peter Hi Peter, I found two patches on my git tree that I sent long ago but somehow get lost on the mailing list. I rebased the tree but did not have the time (yet) to test them. http://git.iksaif.net/?p=qemu.git;a=shortlog;h=refs/heads/wip Feel free to try them. If QEMU segfault again, please send a full gdb backtrace / valgrind trace / way to reproduce :). Thanks, I have seen no more crashes with these to patches applied. I would suggest it would be good to push them to the master repository. Thank you, Peter Ccing Alexander, Ah, cool. Corentin, I think you're right now the closest thing we have to a maintainer for VNC. Could you please just send out a pull request for those? hi all, i suspect there is still a problem with the threaded vnc server. its just a guess, but we saw a resonable number of vms hanging in the last weeks. hanging meaning the emulation is stopped and the qemu-kvm process does no longer react, not on monitor, not on vnc, not on qmp. why i suspect the threaded vnc server is that in all cases we have analyzed this happened with an open vnc session and only on nodes with the threaded vnc server enabled. it might also be the case that this happens at a resolution change. is there anything known or has someone an idea? we are running qemu-kvm 1.0.1 with vnc: don't mess up with iohandlers in the vnc thread vnc: Limit r/w access to size of allocated memory compiled in. unfortunately, i was not yet able to reproduce this with a debugger attached. thanks, peter Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 42782] IO_PAGE_FAULT while starting xorg
https://bugzilla.kernel.org/show_bug.cgi?id=42782 --- Comment #11 from Reartes Guillermo rtgui...@gmail.com 2012-06-26 15:11:16 --- I booted with iommu=pt kernel parameter and i stopped getting these messages. I am experimenting with pcie kvm pass-through, so i ended using that parameter. Sadly the whole host freezes when i power on the guest, but that is outside of this bug-report. In my case, it only happens at POWER-ON, not when doing a RESET. So to reproduce it one must POWER-CYCLE. I don't think that was actually not 100% accurate. # dmesg| grep -i amd-vi [1.774021] AMD-Vi: device: 00:00.2 cap: 0040 seg: 0 flags: 3e info 1300 [1.774024] AMD-Vi:mmio-addr: feb2 [1.774259] AMD-Vi: DEV_SELECT_RANGE_START devid: 00:00.0 flags: 00 [1.774262] AMD-Vi: DEV_RANGE_END devid: 00:00.2 [1.774264] AMD-Vi: DEV_SELECT devid: 00:02.0 flags: 00 [1.774266] AMD-Vi: DEV_SELECT_RANGE_START devid: 01:00.0 flags: 00 [1.774268] AMD-Vi: DEV_RANGE_END devid: 01:00.1 [1.774270] AMD-Vi: DEV_SELECT devid: 00:04.0 flags: 00 [1.774272] AMD-Vi: DEV_SELECT devid: 02:00.0 flags: 00 [1.774274] AMD-Vi: DEV_SELECT devid: 00:05.0 flags: 00 [1.774275] AMD-Vi: DEV_SELECT devid: 03:00.0 flags: 00 [1.774277] AMD-Vi: DEV_SELECT devid: 00:06.0 flags: 00 [1.774279] AMD-Vi: DEV_SELECT devid: 04:00.0 flags: 00 [1.774281] AMD-Vi: DEV_SELECT devid: 00:09.0 flags: 00 [1.774283] AMD-Vi: DEV_SELECT devid: 05:00.0 flags: 00 [1.774284] AMD-Vi: DEV_SELECT devid: 00:11.0 flags: 00 [1.774286] AMD-Vi: DEV_SELECT_RANGE_START devid: 00:12.0 flags: 00 [1.774288] AMD-Vi: DEV_RANGE_END devid: 00:12.2 [1.774290] AMD-Vi: DEV_SELECT_RANGE_START devid: 00:13.0 flags: 00 [1.774292] AMD-Vi: DEV_RANGE_END devid: 00:13.2 [1.774294] AMD-Vi: DEV_SELECT devid: 00:14.0 flags: d7 [1.774296] AMD-Vi: DEV_SELECT devid: 00:14.3 flags: 00 [1.774297] AMD-Vi: DEV_SELECT devid: 00:14.4 flags: 00 [1.774300] AMD-Vi: DEV_ALIAS_RANGE devid: 06:00.0 flags: 00 devid_to: 00:14.4 [1.774302] AMD-Vi: DEV_RANGE_END devid: 06:1f.7 [1.774311] AMD-Vi: DEV_SELECT devid: 00:14.5 flags: 00 [1.774313] AMD-Vi: DEV_SELECT devid: 00:15.0 flags: 00 [1.774315] AMD-Vi: DEV_SELECT devid: 07:00.0 flags: 00 [1.774317] AMD-Vi: DEV_SELECT devid: 00:15.1 flags: 00 [1.774319] AMD-Vi: DEV_SELECT devid: 08:00.0 flags: 00 [1.774320] AMD-Vi: DEV_SELECT_RANGE_START devid: 00:16.0 flags: 00 [1.774322] AMD-Vi: DEV_RANGE_END devid: 00:16.2 [1.774428] AMD-Vi: Enabling IOMMU at :00:00.2 cap 0x40 [1.827755] AMD-Vi: Initialized for Passthrough Mode -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] KVM: s390: fix sigp sense running condition code handling
On Tue, 26 Jun 2012 16:52:56 +0200 Alexander Graf ag...@suse.de wrote: On 26.06.2012, at 16:06, Cornelia Huck wrote: From: Heiko Carstens heiko.carst...@de.ibm.com Only if the sensed cpu is not running a status is stored, which is reflected by condition code 1. If the cpu is running, condition code 0 should be returned. Just the opposite of what the code is doing. Acked-by: Cornelia Huck cornelia.h...@de.ibm.com Signed-off-by: Heiko Carstens heiko.carst...@de.ibm.com Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com Yikes. Is this a stable candidate? This code will only hit when running on a host running virtualized itself (where sigp sense running will cause an intercept), so I doubt many people will see the effects. Cornelia -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] KVM: s390: fix sigp sense running condition code handling
On 26.06.2012, at 17:33, Cornelia Huck wrote: On Tue, 26 Jun 2012 16:52:56 +0200 Alexander Graf ag...@suse.de wrote: On 26.06.2012, at 16:06, Cornelia Huck wrote: From: Heiko Carstens heiko.carst...@de.ibm.com Only if the sensed cpu is not running a status is stored, which is reflected by condition code 1. If the cpu is running, condition code 0 should be returned. Just the opposite of what the code is doing. Acked-by: Cornelia Huck cornelia.h...@de.ibm.com Signed-off-by: Heiko Carstens heiko.carst...@de.ibm.com Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com Yikes. Is this a stable candidate? This code will only hit when running on a host running virtualized itself (where sigp sense running will cause an intercept), so I doubt many people will see the effects. You mean this will hit when running kvm inside of a z/VM VM? That's a pretty valid use case. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] KVM: s390: fix sigp set prefix status stored cases
On Tue, 26 Jun 2012 16:56:08 +0200 Alexander Graf ag...@suse.de wrote: On 26.06.2012, at 16:06, Cornelia Huck wrote: From: Heiko Carstens heiko.carst...@de.ibm.com If an invalid parameter is passed or the addressed cpu is in an incorrect state sigp set prefix will store a status. This status must only have bits set as defined by the architecture. The current kvm implementation missed to clear bits and also did not set the intended status bit (and instead of or operation). Signed-off-by: Heiko Carstens heiko.carst...@de.ibm.com Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com What was the net effect of this for a guest? Any problems rising from it? The guest might see some unexpected status bits if it did call sigp set prefix in an incorrect way. I'm not aware that anybody has actually seen that. Cornelia -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] KVM: s390: fix sigp set prefix status stored cases
Am 26.06.2012 um 17:39 schrieb Cornelia Huck cornelia.h...@de.ibm.com: On Tue, 26 Jun 2012 16:56:08 +0200 Alexander Graf ag...@suse.de wrote: On 26.06.2012, at 16:06, Cornelia Huck wrote: From: Heiko Carstens heiko.carst...@de.ibm.com If an invalid parameter is passed or the addressed cpu is in an incorrect state sigp set prefix will store a status. This status must only have bits set as defined by the architecture. The current kvm implementation missed to clear bits and also did not set the intended status bit (and instead of or operation). Signed-off-by: Heiko Carstens heiko.carst...@de.ibm.com Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com What was the net effect of this for a guest? Any problems rising from it? The guest might see some unexpected status bits if it did call sigp set prefix in an incorrect way. I'm not aware that anybody has actually seen that. Yeah, we only need the set prefix on vm init and here it should always succeed, so this one is not all that urgent :). Alex Cornelia -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] KVM: s390: fix sigp sense running condition code handling
On Tue, 26 Jun 2012 17:36:19 +0200 Alexander Graf ag...@suse.de wrote: On 26.06.2012, at 17:33, Cornelia Huck wrote: On Tue, 26 Jun 2012 16:52:56 +0200 Alexander Graf ag...@suse.de wrote: On 26.06.2012, at 16:06, Cornelia Huck wrote: From: Heiko Carstens heiko.carst...@de.ibm.com Only if the sensed cpu is not running a status is stored, which is reflected by condition code 1. If the cpu is running, condition code 0 should be returned. Just the opposite of what the code is doing. Acked-by: Cornelia Huck cornelia.h...@de.ibm.com Signed-off-by: Heiko Carstens heiko.carst...@de.ibm.com Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com Yikes. Is this a stable candidate? This code will only hit when running on a host running virtualized itself (where sigp sense running will cause an intercept), so I doubt many people will see the effects. You mean this will hit when running kvm inside of a z/VM VM? That's a pretty valid use case. I'd have thought it was a very uncommon one. But I certainly don't object against putting the fix into stable. Cornelia -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] KVM: s390: fix sigp sense running condition code handling
On 26.06.2012, at 17:57, Cornelia Huck wrote: On Tue, 26 Jun 2012 17:36:19 +0200 Alexander Graf ag...@suse.de wrote: On 26.06.2012, at 17:33, Cornelia Huck wrote: On Tue, 26 Jun 2012 16:52:56 +0200 Alexander Graf ag...@suse.de wrote: On 26.06.2012, at 16:06, Cornelia Huck wrote: From: Heiko Carstens heiko.carst...@de.ibm.com Only if the sensed cpu is not running a status is stored, which is reflected by condition code 1. If the cpu is running, condition code 0 should be returned. Just the opposite of what the code is doing. Acked-by: Cornelia Huck cornelia.h...@de.ibm.com Signed-off-by: Heiko Carstens heiko.carst...@de.ibm.com Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com Yikes. Is this a stable candidate? This code will only hit when running on a host running virtualized itself (where sigp sense running will cause an intercept), so I doubt many people will see the effects. You mean this will hit when running kvm inside of a z/VM VM? That's a pretty valid use case. I'd have thought it was a very uncommon one. But I certainly don't object against putting the fix into stable. It's not exactly useful for productive things, but just for prototyping KVM, people tend to not have spare LPARs lying around :) So yes, this should go into stable. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/3] KVM: Add new -cpu best
During discussions on whether to make -cpu host the default in SLE, I found myself disagreeing to the thought, because it potentially opens a big can of worms for potential bugs. But if I already am so opposed to it for SLE, how can it possibly be reasonable to default to -cpu host in upstream QEMU? And what would a sane default look like? So I had this idea of looping through all available CPU definitions. We can pretty well tell if our host is able to execute any of them by checking the respective flags and seeing if our host has all features the CPU definition requires. With that, we can create a -cpu type that would fall back to the best known CPU definition that our host can fulfill. On my Phenom II system for example, that would be -cpu phenom. With this approach we can test and verify that CPU types actually work at any random user setup, because we can always verify that all the -cpu types we ship actually work. And we only default to some clever mechanism that chooses from one of these. Signed-off-by: Alexander Graf ag...@suse.de --- target-i386/cpu.c | 81 + 1 files changed, 81 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index fdd95be..98cc1ec 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -558,6 +558,85 @@ static int cpu_x86_fill_host(x86_def_t *x86_cpu_def) return 0; } +/* Are all guest feature bits present on the host? */ +static bool cpu_x86_feature_subset(uint32_t host, uint32_t guest) +{ +int i; + +for (i = 0; i 32; i++) { +uint32_t mask = 1 i; +if ((guest mask) !(host mask)) { +return false; +} +} + +return true; +} + +/* Does the host support all the features of the CPU definition? */ +static bool cpu_x86_fits_host(x86_def_t *x86_cpu_def) +{ +uint32_t eax = 0, ebx = 0, ecx = 0, edx = 0; + +host_cpuid(0x0, 0, eax, ebx, ecx, edx); +if (x86_cpu_def-level eax) { +return false; +} +if ((x86_cpu_def-vendor1 != ebx) || +(x86_cpu_def-vendor2 != edx) || +(x86_cpu_def-vendor3 != ecx)) { +return false; +} + +host_cpuid(0x1, 0, eax, ebx, ecx, edx); +if (!cpu_x86_feature_subset(ecx, x86_cpu_def-ext_features) || +!cpu_x86_feature_subset(edx, x86_cpu_def-features)) { +return false; +} + +host_cpuid(0x8000, 0, eax, ebx, ecx, edx); +if (x86_cpu_def-xlevel eax) { +return false; +} + +host_cpuid(0x8001, 0, eax, ebx, ecx, edx); +if (!cpu_x86_feature_subset(edx, x86_cpu_def-ext2_features) || +!cpu_x86_feature_subset(ecx, x86_cpu_def-ext3_features)) { +return false; +} + +return true; +} + +/* Returns true when new_def is higher versioned than old_def */ +static int cpu_x86_fits_higher(x86_def_t *new_def, x86_def_t *old_def) +{ +int old_fammod = (old_def-family 24) | (old_def-model 8) + | (old_def-stepping); +int new_fammod = (new_def-family 24) | (new_def-model 8) + | (new_def-stepping); + +return new_fammod old_fammod; +} + +static void cpu_x86_fill_best(x86_def_t *x86_cpu_def) +{ +x86_def_t *def; + +x86_cpu_def-family = 0; +x86_cpu_def-model = 0; +for (def = x86_defs; def; def = def-next) { +if (cpu_x86_fits_host(def) cpu_x86_fits_higher(def, x86_cpu_def)) { +memcpy(x86_cpu_def, def, sizeof(*def)); +} +} + +if (!x86_cpu_def-family !x86_cpu_def-model) { +fprintf(stderr, No fitting CPU model found!\n); +exit(1); +} +} + static int unavailable_host_feature(struct model_features_t *f, uint32_t mask) { int i; @@ -878,6 +957,8 @@ static int cpu_x86_find_by_name(x86_def_t *x86_cpu_def, const char *cpu_model) break; if (kvm_enabled() name strcmp(name, host) == 0) { cpu_x86_fill_host(x86_cpu_def); +} else if (kvm_enabled() name strcmp(name, best) == 0) { +cpu_x86_fill_best(x86_cpu_def); } else if (!def) { goto error; } else { -- 1.6.0.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/3] KVM: Use -cpu best as default on x86
When running QEMU without -cpu parameter, the user usually wants a sane default. So far, we're using the qemu64/qemu32 CPU type, which basically means the maximum TCG can emulate. That's a really good default when using TCG, but when running with KVM we much rather want a default saying the maximum performance I can get. Fortunately we just added an option that gives us the best performance while still staying safe on the testability side of things: -cpu best. So all we need to do is make -cpu best the default when the user doesn't explicitly specify a CPU type. This fixes a lot of subtile breakage in the GNU toolchain (libgmp) which hicks up on QEMU's non-existent CPU models. This patch also adds a new pc-1.2 machine type to stay backwards compatible with older versions of QEMU. Signed-off-by: Alexander Graf ag...@suse.de --- v1 - v2: - rebase --- hw/pc_piix.c | 45 - 1 files changed, 36 insertions(+), 9 deletions(-) diff --git a/hw/pc_piix.c b/hw/pc_piix.c index eae258c..eafd383 100644 --- a/hw/pc_piix.c +++ b/hw/pc_piix.c @@ -127,7 +127,8 @@ static void pc_init1(MemoryRegion *system_memory, const char *initrd_filename, const char *cpu_model, int pci_enabled, - int kvmclock_enabled) + int kvmclock_enabled, + int may_cpu_best) { int i; ram_addr_t below_4g_mem_size, above_4g_mem_size; @@ -149,6 +150,9 @@ static void pc_init1(MemoryRegion *system_memory, MemoryRegion *rom_memory; void *fw_cfg = NULL; +if (!cpu_model kvm_enabled() may_cpu_best) { +cpu_model = best; +} pc_cpus_init(cpu_model); if (kvmclock_enabled) { @@ -298,7 +302,21 @@ static void pc_init_pci(ram_addr_t ram_size, get_system_io(), ram_size, boot_device, kernel_filename, kernel_cmdline, - initrd_filename, cpu_model, 1, 1); + initrd_filename, cpu_model, 1, 1, 1); +} + +static void pc_init_pci_oldcpu(ram_addr_t ram_size, + const char *boot_device, + const char *kernel_filename, + const char *kernel_cmdline, + const char *initrd_filename, + const char *cpu_model) +{ +pc_init1(get_system_memory(), + get_system_io(), + ram_size, boot_device, + kernel_filename, kernel_cmdline, + initrd_filename, cpu_model, 1, 1, 0); } static void pc_init_pci_no_kvmclock(ram_addr_t ram_size, @@ -312,7 +330,7 @@ static void pc_init_pci_no_kvmclock(ram_addr_t ram_size, get_system_io(), ram_size, boot_device, kernel_filename, kernel_cmdline, - initrd_filename, cpu_model, 1, 0); + initrd_filename, cpu_model, 1, 0, 0); } static void pc_init_isa(ram_addr_t ram_size, @@ -328,7 +346,7 @@ static void pc_init_isa(ram_addr_t ram_size, get_system_io(), ram_size, boot_device, kernel_filename, kernel_cmdline, - initrd_filename, cpu_model, 0, 1); + initrd_filename, cpu_model, 0, 1, 0); } #ifdef CONFIG_XEN @@ -349,8 +367,8 @@ static void pc_xen_hvm_init(ram_addr_t ram_size, } #endif -static QEMUMachine pc_machine_v1_1 = { -.name = pc-1.1, +static QEMUMachine pc_machine_v1_2 = { +.name = pc-1.2, .alias = pc, .desc = Standard PC, .init = pc_init_pci, @@ -358,6 +376,14 @@ static QEMUMachine pc_machine_v1_1 = { .is_default = 1, }; +static QEMUMachine pc_machine_v1_1 = { +.name = pc-1.1, +.desc = Standard PC, +.init = pc_init_pci_oldcpu, +.max_cpus = 255, +.is_default = 1, +}; + #define PC_COMPAT_1_0 \ {\ .driver = pc-sysfw,\ @@ -384,7 +410,7 @@ static QEMUMachine pc_machine_v1_1 = { static QEMUMachine pc_machine_v1_0 = { .name = pc-1.0, .desc = Standard PC, -.init = pc_init_pci, +.init = pc_init_pci_oldcpu, .max_cpus = 255, .compat_props = (GlobalProperty[]) { PC_COMPAT_1_0, @@ -399,7 +425,7 @@ static QEMUMachine pc_machine_v1_0 = { static QEMUMachine pc_machine_v0_15 = { .name = pc-0.15, .desc = Standard PC, -.init = pc_init_pci, +.init = pc_init_pci_oldcpu, .max_cpus = 255, .compat_props = (GlobalProperty[]) { PC_COMPAT_0_15, @@ -431,7 +457,7 @@ static QEMUMachine pc_machine_v0_15 = { static QEMUMachine pc_machine_v0_14 = { .name = pc-0.14, .desc = Standard PC, -.init = pc_init_pci, +.init = pc_init_pci_oldcpu, .max_cpus = 255, .compat_props = (GlobalProperty[]) { PC_COMPAT_0_14, @@ -612,6 +638,7 @@ static QEMUMachine xenfv_machine = { static void pc_machine_init(void) { +qemu_register_machine(pc_machine_v1_2);
[PATCH v2 3/3] i386: KVM: List -cpu host and best in -cpu ?
The kvm_enabled() helper doesn't work in a function as early as -cpu ? yet. It also doesn't make sense to list the -cpu ? output conditional on the -enable-kvm parameter. So let's always mention -cpu host in the CPU list when KVM is supported on that configuration. In addition, this patch also adds listing of -cpu best in the -cpu ? list, so that people know that this option exists. Signed-off-by: Alexander Graf ag...@suse.de --- target-i386/cpu.c |7 --- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 98cc1ec..6c20798 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1199,9 +1199,10 @@ void x86_cpu_list(FILE *f, fprintf_function cpu_fprintf, const char *optarg) (*cpu_fprintf)(f, \n); } } -if (kvm_enabled()) { -(*cpu_fprintf)(f, x86 %16s\n, [host]); -} +#ifdef CONFIG_KVM +(*cpu_fprintf)(f, x86 %16s\n, KVM only: [host]); +(*cpu_fprintf)(f, x86 %16s\n, KVM only: [best]); +#endif } int cpu_x86_register(X86CPU *cpu, const char *cpu_model) -- 1.6.0.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call minutes (2012-06-26)
Hi This are the minutes for Today call - q35 integration why not ICH10? ICH9 is already obsolete. what are the differences? We need to check guests from Windows XP and some *BSD. Having it default for 1.2? Anthony. Having is as an option in 1.2 and make it defalut in 1.3. Alex? Anthony don't like today patches. He wants to have 1st the refactoring, and then merge q35 sharing things with current code. Add requirements to the Q35 wiki page (Anthony task). People use ISA devices? - irq chip integration? Continue discussion on mailing listh - having pc-next, and don't change each version. Continue discussion on mailing list Thanks, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: First step to push iothread lock out of inner run loop
On Sat, Jun 23, 2012 at 12:55:49AM +0200, Jan Kiszka wrote: Should have declared this [RFC] in the subject and CC'ed kvm... On 2012-06-23 00:45, Jan Kiszka wrote: This sketches a possible path to get rid of the iothread lock on vmexits in KVM mode. On x86, the the in-kernel irqchips has to be used because we otherwise need to synchronize APIC and other per-cpu state accesses that could be changed concurrently. Not yet fully analyzed is the NMI injection path in the absence of an APIC. s390x should be fine without specific locking as their pre/post-run callbacks are empty. Power requires locking for the pre-run callback. This patch is untested, but a similar version was successfully used in a x86 setup with a network I/O path that needed no central iothread locking anymore (required special MMIO exit handling). --- kvm-all.c | 18 -- target-i386/kvm.c |7 +++ target-ppc/kvm.c |4 3 files changed, 27 insertions(+), 2 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index f8e4328..9c3e26f 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -1460,6 +1460,8 @@ int kvm_cpu_exec(CPUArchState *env) return EXCP_HLT; } +qemu_mutex_unlock_iothread(); + do { if (env-kvm_vcpu_dirty) { kvm_arch_put_registers(env, KVM_PUT_RUNTIME_STATE); @@ -1476,14 +1478,16 @@ int kvm_cpu_exec(CPUArchState *env) */ qemu_cpu_kick_self(); } -qemu_mutex_unlock_iothread(); run_ret = kvm_vcpu_ioctl(env, KVM_RUN, 0); -qemu_mutex_lock_iothread(); kvm_arch_post_run(env, run); +/* TODO: push coalesced mmio flushing to the point where we access + * devices that are using it (currently VGA and E1000). */ +qemu_mutex_lock_iothread(); kvm_flush_coalesced_mmio_buffer(); +qemu_mutex_unlock_iothread(); if (run_ret 0) { if (run_ret == -EINTR || run_ret == -EAGAIN) { @@ -1499,19 +1503,23 @@ int kvm_cpu_exec(CPUArchState *env) switch (run-exit_reason) { case KVM_EXIT_IO: DPRINTF(handle_io\n); +qemu_mutex_lock_iothread(); kvm_handle_io(run-io.port, (uint8_t *)run + run-io.data_offset, run-io.direction, run-io.size, run-io.count); +qemu_mutex_unlock_iothread(); ret = 0; break; case KVM_EXIT_MMIO: DPRINTF(handle_mmio\n); +qemu_mutex_lock_iothread(); cpu_physical_memory_rw(run-mmio.phys_addr, run-mmio.data, run-mmio.len, run-mmio.is_write); +qemu_mutex_unlock_iothread(); ret = 0; break; case KVM_EXIT_IRQ_WINDOW_OPEN: @@ -1520,7 +1528,9 @@ int kvm_cpu_exec(CPUArchState *env) break; case KVM_EXIT_SHUTDOWN: DPRINTF(shutdown\n); +qemu_mutex_lock_iothread(); qemu_system_reset_request(); +qemu_mutex_unlock_iothread(); ret = EXCP_INTERRUPT; break; case KVM_EXIT_UNKNOWN: @@ -1533,11 +1543,15 @@ int kvm_cpu_exec(CPUArchState *env) break; default: DPRINTF(kvm_arch_handle_exit\n); +qemu_mutex_lock_iothread(); ret = kvm_arch_handle_exit(env, run); +qemu_mutex_unlock_iothread(); break; } } while (ret == 0); +qemu_mutex_lock_iothread(); + if (ret 0) { cpu_dump_state(env, stderr, fprintf, CPU_DUMP_CODE); vm_stop(RUN_STATE_INTERNAL_ERROR); diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 0d0d8f6..0ad64d1 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -1631,7 +1631,10 @@ void kvm_arch_pre_run(CPUX86State *env, struct kvm_run *run) /* Inject NMI */ if (env-interrupt_request CPU_INTERRUPT_NMI) { +qemu_mutex_lock_iothread(); env-interrupt_request = ~CPU_INTERRUPT_NMI; +qemu_mutex_unlock_iothread(); + DPRINTF(injected NMI\n); ret = kvm_vcpu_ioctl(env, KVM_NMI); if (ret 0) { @@ -1641,6 +1644,8 @@ void kvm_arch_pre_run(CPUX86State *env, struct kvm_run *run) } if (!kvm_irqchip_in_kernel()) { +qemu_mutex_lock_iothread(); + /* Force the VCPU out of its inner loop to process any INIT requests * or pending TPR access reports. */ if (env-interrupt_request @@ -1682,6 +1687,8 @@ void kvm_arch_pre_run(CPUX86State *env, struct kvm_run *run)
[PATCH] Add a page cache-backed balloon device driver.
This implementation of a virtio balloon driver uses the page cache to store pages that have been released to the host. The communication (outside of target counts) is one way--the guest notifies the host when it adds a page to the page cache, allowing the host to madvise(2) with MADV_DONTNEED. Reclaim in the guest is therefore automatic and implicit (via the regular page reclaim). This means that inflating the balloon is similar to the existing balloon mechanism, but the deflate is different--it re-uses existing Linux kernel functionality to automatically reclaim. Signed-off-by: Frank Swiderski f...@google.com --- drivers/virtio/Kconfig | 13 + drivers/virtio/Makefile |1 + drivers/virtio/virtio_fileballoon.c | 636 +++ include/linux/virtio_balloon.h |9 + include/linux/virtio_ids.h |1 + 5 files changed, 660 insertions(+), 0 deletions(-) create mode 100644 drivers/virtio/virtio_fileballoon.c diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig index f38b17a..cffa2a7 100644 --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -35,6 +35,19 @@ config VIRTIO_BALLOON If unsure, say M. +config VIRTIO_FILEBALLOON + tristate Virtio page cache-backed balloon driver + select VIRTIO + select VIRTIO_RING + ---help--- +This driver supports decreasing and automatically reclaiming the +memory within a guest VM. Unlike VIRTIO_BALLOON, this driver instead +tries to maintain a specific target balloon size using the page cache. +This allows the guest to implicitly deflate the balloon by flushing +pages from the cache and touching the page. + +If unsure, say N. + config VIRTIO_MMIO tristate Platform bus driver for memory mapped virtio devices (EXPERIMENTAL) depends on HAS_IOMEM EXPERIMENTAL diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile index 5a4c63c..7ca0a3f 100644 --- a/drivers/virtio/Makefile +++ b/drivers/virtio/Makefile @@ -3,3 +3,4 @@ obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o +obj-$(CONFIG_VIRTIO_FILEBALLOON) += virtio_fileballoon.o diff --git a/drivers/virtio/virtio_fileballoon.c b/drivers/virtio/virtio_fileballoon.c new file mode 100644 index 000..ff252ec --- /dev/null +++ b/drivers/virtio/virtio_fileballoon.c @@ -0,0 +1,636 @@ +/* Virtio file (page cache-backed) balloon implementation, inspired by + * Dor Loar and Marcelo Tosatti's implementations, and based on Rusty Russel's + * implementation. + * + * This implementation of the virtio balloon driver re-uses the page cache to + * allow memory consumed by inflating the balloon to be reclaimed by linux. It + * creates and mounts a bare-bones filesystem containing a single inode. When + * the host requests the balloon to inflate, it does so by reading pages at + * offsets into the inode mapping's page_tree. The host is notified when the + * pages are added to the page_tree, allowing it (the host) to madvise(2) the + * corresponding host memory, reducing the RSS of the virtual machine. In this + * implementation, the host is only notified when a page is added to the + * balloon. Reclaim happens under the existing TTFP logic, which flushes unused + * pages in the page cache. If the host used MADV_DONTNEED, then when the guest + * uses the page, the zero page will be mapped in, allowing automatic (and fast, + * compared to requiring a host notification via a virtio queue to get memory + * back) reclaim. + * + * Copyright 2008 Rusty Russell IBM Corporation + * Copyright 2011 Frank Swiderski Google Inc + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ +#include linux/backing-dev.h +#include linux/delay.h +#include linux/file.h +#include linux/freezer.h +#include linux/fs.h +#include linux/jiffies.h +#include linux/kthread.h +#include linux/module.h +#include linux/mount.h +#include linux/pagemap.h +#include linux/slab.h +#include linux/swap.h +#include linux/virtio.h +#include linux/virtio_balloon.h +#include linux/writeback.h + +#define VIRTBALLOON_PFN_ARRAY_SIZE 256 + +struct virtio_balloon { + struct
Re: [PATCH] Add a page cache-backed balloon device driver.
On 06/26/2012 04:32 PM, Frank Swiderski wrote: This implementation of a virtio balloon driver uses the page cache to store pages that have been released to the host. The communication (outside of target counts) is one way--the guest notifies the host when it adds a page to the page cache, allowing the host to madvise(2) with MADV_DONTNEED. Reclaim in the guest is therefore automatic and implicit (via the regular page reclaim). This means that inflating the balloon is similar to the existing balloon mechanism, but the deflate is different--it re-uses existing Linux kernel functionality to automatically reclaim. Signed-off-by: Frank Swiderskif...@google.com It is a great idea, but how can this memory balancing possibly work if someone uses memory cgroups inside a guest? Having said that, we currently do not have proper memory reclaim balancing between cgroups at all, so requiring that of this balloon driver would be unreasonable. The code looks good to me, my only worry is the code duplication. We now have 5 balloon drivers, for 4 hypervisors, all implementing everything from scratch... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Request VFIO inclusion in linux-next
On Mon, 2012-06-25 at 22:55 -0600, Alex Williamson wrote: Hi, VFIO has been kicking around for well over a year now and has been posted numerous times for review. The pre-requirements are finally available in linux-next (or will be in the 20120626 build) so I'd like to request a new branch be included in linux-next with a goal of being accepted into v3.6. Ack. Let's get that in, it's been simmering for too long and we'll need that to do PCI pass-through on KVM powerpc. Cheers, Ben. VFIO is a userspace driver interface designed to support assignment of devices into virtual machines using IOMMU level access control. This IOMMU requirement, secure resource access, and flexible interrupt support make VFIO unique from existing drivers, like UIO. VFIO supports modular backends for both IOMMU and device access. Initial backends are included for PCI device assignment using the IOMMU API in a manner compatible with x86 device assignment. POWER support is also under development, making use of the same PCI device backend, but adding new IOMMU support for their platforms. As with previous versions of VFIO, Qemu is targeted as a primary user and a working development tree including vfio-pci support can be found here: git://github.com/awilliam/qemu-vfio.git iommu-group-vfio Eventually we hope VFIO can deprecate the x86, PCI-specific device assignment currently used by KVM. The info for linux-next: Tree: git://github.com/awilliam/linux-vfio.git Branch: next Contact: Alex Williamson alex.william...@redhat.com This branch should be applied after both Bjorn's PCI next branch and Joerg's IOMMU next branch and contains the following changes: Documentation/ioctl/ioctl-number.txt |1 Documentation/vfio.txt | 315 +++ MAINTAINERS |8 drivers/Kconfig |2 drivers/Makefile |1 drivers/vfio/Kconfig | 16 drivers/vfio/Makefile|3 drivers/vfio/pci/Kconfig |8 drivers/vfio/pci/Makefile|4 drivers/vfio/pci/vfio_pci.c | 565 drivers/vfio/pci/vfio_pci_config.c | 1528 +++ drivers/vfio/pci/vfio_pci_intrs.c| 727 drivers/vfio/pci/vfio_pci_private.h | 91 ++ drivers/vfio/pci/vfio_pci_rdwr.c | 269 ++ drivers/vfio/vfio.c | 1420 drivers/vfio/vfio_iommu_type1.c | 754 + include/linux/vfio.h | 445 ++ 17 files changed, 6157 insertions(+) If there are any objections to including this, please speak now. If anything looks amiss in the branch, let me know. I've never hosted a next branch. Review comments welcome and I'll be glad to post the series in email again if requested. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe linux-pci in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add a page cache-backed balloon device driver.
On Tue, Jun 26, 2012 at 1:40 PM, Rik van Riel r...@redhat.com wrote: On 06/26/2012 04:32 PM, Frank Swiderski wrote: This implementation of a virtio balloon driver uses the page cache to store pages that have been released to the host. The communication (outside of target counts) is one way--the guest notifies the host when it adds a page to the page cache, allowing the host to madvise(2) with MADV_DONTNEED. Reclaim in the guest is therefore automatic and implicit (via the regular page reclaim). This means that inflating the balloon is similar to the existing balloon mechanism, but the deflate is different--it re-uses existing Linux kernel functionality to automatically reclaim. Signed-off-by: Frank Swiderskif...@google.com It is a great idea, but how can this memory balancing possibly work if someone uses memory cgroups inside a guest? Thanks and good point--this isn't something that I considered in the implementation. Having said that, we currently do not have proper memory reclaim balancing between cgroups at all, so requiring that of this balloon driver would be unreasonable. The code looks good to me, my only worry is the code duplication. We now have 5 balloon drivers, for 4 hypervisors, all implementing everything from scratch... Do you have any recommendations on this? I could (I think reasonably so) modify the existing virtio_balloon.c and have it change behavior based on a feature bit or other configuration. I'm not sure that really addresses the root of what you're pointing out--it's still adding a different implementation, but doing so as an extension of an existing one. fes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add a page cache-backed balloon device driver.
On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote: This implementation of a virtio balloon driver uses the page cache to store pages that have been released to the host. The communication (outside of target counts) is one way--the guest notifies the host when it adds a page to the page cache, allowing the host to madvise(2) with MADV_DONTNEED. Reclaim in the guest is therefore automatic and implicit (via the regular page reclaim). This means that inflating the balloon is similar to the existing balloon mechanism, but the deflate is different--it re-uses existing Linux kernel functionality to automatically reclaim. Signed-off-by: Frank Swiderski f...@google.com I'm pondering this: Should it really be a separate driver/device ID? If it behaves the same from host POV, maybe it should be up to the guest how to inflate/deflate the balloon internally? --- drivers/virtio/Kconfig | 13 + drivers/virtio/Makefile |1 + drivers/virtio/virtio_fileballoon.c | 636 +++ include/linux/virtio_balloon.h |9 + include/linux/virtio_ids.h |1 + 5 files changed, 660 insertions(+), 0 deletions(-) create mode 100644 drivers/virtio/virtio_fileballoon.c diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig index f38b17a..cffa2a7 100644 --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -35,6 +35,19 @@ config VIRTIO_BALLOON If unsure, say M. +config VIRTIO_FILEBALLOON + tristate Virtio page cache-backed balloon driver + select VIRTIO + select VIRTIO_RING + ---help--- + This driver supports decreasing and automatically reclaiming the + memory within a guest VM. Unlike VIRTIO_BALLOON, this driver instead + tries to maintain a specific target balloon size using the page cache. + This allows the guest to implicitly deflate the balloon by flushing + pages from the cache and touching the page. + + If unsure, say N. + config VIRTIO_MMIO tristate Platform bus driver for memory mapped virtio devices (EXPERIMENTAL) depends on HAS_IOMEM EXPERIMENTAL diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile index 5a4c63c..7ca0a3f 100644 --- a/drivers/virtio/Makefile +++ b/drivers/virtio/Makefile @@ -3,3 +3,4 @@ obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o +obj-$(CONFIG_VIRTIO_FILEBALLOON) += virtio_fileballoon.o diff --git a/drivers/virtio/virtio_fileballoon.c b/drivers/virtio/virtio_fileballoon.c new file mode 100644 index 000..ff252ec --- /dev/null +++ b/drivers/virtio/virtio_fileballoon.c @@ -0,0 +1,636 @@ +/* Virtio file (page cache-backed) balloon implementation, inspired by + * Dor Loar and Marcelo Tosatti's implementations, and based on Rusty Russel's + * implementation. + * + * This implementation of the virtio balloon driver re-uses the page cache to + * allow memory consumed by inflating the balloon to be reclaimed by linux. It + * creates and mounts a bare-bones filesystem containing a single inode. When + * the host requests the balloon to inflate, it does so by reading pages at + * offsets into the inode mapping's page_tree. The host is notified when the + * pages are added to the page_tree, allowing it (the host) to madvise(2) the + * corresponding host memory, reducing the RSS of the virtual machine. In this + * implementation, the host is only notified when a page is added to the + * balloon. Reclaim happens under the existing TTFP logic, which flushes unused + * pages in the page cache. If the host used MADV_DONTNEED, then when the guest + * uses the page, the zero page will be mapped in, allowing automatic (and fast, + * compared to requiring a host notification via a virtio queue to get memory + * back) reclaim. + * + * Copyright 2008 Rusty Russell IBM Corporation + * Copyright 2011 Frank Swiderski Google Inc + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ +#include linux/backing-dev.h +#include linux/delay.h +#include linux/file.h +#include linux/freezer.h +#include
Re: [PATCH] Add a page cache-backed balloon device driver.
On 06/26/2012 05:31 PM, Frank Swiderski wrote: On Tue, Jun 26, 2012 at 1:40 PM, Rik van Rielr...@redhat.com wrote: The code looks good to me, my only worry is the code duplication. We now have 5 balloon drivers, for 4 hypervisors, all implementing everything from scratch... Do you have any recommendations on this? I could (I think reasonably so) modify the existing virtio_balloon.c and have it change behavior based on a feature bit or other configuration. I'm not sure that really addresses the root of what you're pointing out--it's still adding a different implementation, but doing so as an extension of an existing one. Ideally, I believe we would have two balloon top parts in a guest (one classical balloon, one on the LRU), and four bottom parts (kvm, xen, vmware s390). That way the virt specific bits of a balloon driver would be essentially a -balloon_page and -release_page callback for pages, as well as methods to communicate with the host. All the management of pages, including stuff like putting them on the LRU, or isolating them for migration, would be done with the same common code, regardless of what virt software we are running on. Of course, that is a substantial amount of work and I feel it would be unreasonable to block anyone's code on that kind of thing (especially considering that your code is good), but I do believe the explosion of balloon code is a little worrying. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add a page cache-backed balloon device driver.
On Tue, Jun 26, 2012 at 02:31:26PM -0700, Frank Swiderski wrote: On Tue, Jun 26, 2012 at 1:40 PM, Rik van Riel r...@redhat.com wrote: On 06/26/2012 04:32 PM, Frank Swiderski wrote: This implementation of a virtio balloon driver uses the page cache to store pages that have been released to the host. The communication (outside of target counts) is one way--the guest notifies the host when it adds a page to the page cache, allowing the host to madvise(2) with MADV_DONTNEED. Reclaim in the guest is therefore automatic and implicit (via the regular page reclaim). This means that inflating the balloon is similar to the existing balloon mechanism, but the deflate is different--it re-uses existing Linux kernel functionality to automatically reclaim. Signed-off-by: Frank Swiderskif...@google.com It is a great idea, but how can this memory balancing possibly work if someone uses memory cgroups inside a guest? Thanks and good point--this isn't something that I considered in the implementation. Having said that, we currently do not have proper memory reclaim balancing between cgroups at all, so requiring that of this balloon driver would be unreasonable. The code looks good to me, my only worry is the code duplication. We now have 5 balloon drivers, for 4 hypervisors, all implementing everything from scratch... Do you have any recommendations on this? I could (I think reasonably so) modify the existing virtio_balloon.c and have it change behavior based on a feature bit or other configuration. I'm not sure that really addresses the root of what you're pointing out--it's still adding a different implementation, but doing so as an extension of an existing one. fes Let's assume it's a feature bit: how would you formulate what the feature does *from host point of view*? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 10/17] PowerPC: booke64: Refactor exception prolog for save/restore regs
On Mon, 2012-06-25 at 15:26 +0300, Mihai Caraman wrote: Refactor exception prolog to allow save/restore register parameters. Add addition none definition for exception prolog usage. This is needed for exceptions like Guest Doorbell that use GSRRx regsiters which do not map on exception type. Signed-off-by: Mihai Caraman mihai.cara...@freescale.com --- arch/powerpc/kernel/exceptions-64e.S | 23 --- 1 files changed, 8 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S index 7215cc2..52aa96b 100644 --- a/arch/powerpc/kernel/exceptions-64e.S +++ b/arch/powerpc/kernel/exceptions-64e.S @@ -35,7 +35,7 @@ #define SPECIAL_EXC_FRAME_SIZE INT_FRAME_SIZE /* Exception prolog code for all exceptions */ -#define EXCEPTION_PROLOG(n, type, addition) \ +#define EXCEPTION_PROLOG(n, type, srr0, srr1, addition) \ mtspr SPRN_SPRG_##type##_SCRATCH,r13; /* get spare registers */ \ mfspr r13,SPRN_SPRG_PACA; /* get PACA */ \ std r10,PACA_EX##type+EX_R10(r13); \ @@ -44,54 +44,47 @@ addition; /* additional code for that exc. */ \ std r1,PACA_EX##type+EX_R1(r13); /* save old r1 in the PACA */ \ stw r10,PACA_EX##type+EX_CR(r13); /* save old CR in the PACA */ \ - mfspr r11,SPRN_##type##_SRR1;/* what are we coming from */\ + mfspr r11,srr1;/* what are we coming from */ \ type##_SET_KSTACK; /* get special stack if necessary */\ andi. r10,r11,MSR_PR; /* save stack pointer */\ beq 1f; /* branch around if supervisor */ \ ld r1,PACAKSAVE(r13); /* get kernel stack coming from usr */\ 1: cmpdi cr1,r1,0; /* check if SP makes sense */ \ bge-cr1,exc_##n##_bad_stack;/* bad stack (TODO: out of line) */ \ - mfspr r10,SPRN_##type##_SRR0; /* read SRR0 before touching stack */ + mfspr r10,srr0; /* read SRR0 before touching stack */ No, use the existing macro, use a ##type## specific to guest doorbells, with appropriate definitions of the corresponding SPRN_ macros. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 11/17] PowerPC: booke64: Fix machine check handler to use the right prolog
On Mon, 2012-06-25 at 15:26 +0300, Mihai Caraman wrote: Machine check exception handler was using a wrong prolog. Hypervisors, like KVM, which are called early from the exception handler rely on the interrupt source. Signed-off-by: Mihai Caraman mihai.cara...@freescale.com Ack. Please separate your core patches from your KVM series and submit them separately. I'll take care of the core Book3E part. Cheers, Ben. --- arch/powerpc/kernel/exceptions-64e.S |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S index 52aa96b..06f7aec 100644 --- a/arch/powerpc/kernel/exceptions-64e.S +++ b/arch/powerpc/kernel/exceptions-64e.S @@ -290,7 +290,7 @@ interrupt_end_book3e: /* Machine Check Interrupt */ START_EXCEPTION(machine_check); - CRIT_EXCEPTION_PROLOG(0x200, PROLOG_ADDITION_NONE) + MC_EXCEPTION_PROLOG(0x200, PROLOG_ADDITION_NONE) // EXCEPTION_COMMON(0x200, PACA_EXMC, INTS_DISABLE) // bl special_reg_save_mc // addir3,r1,STACK_FRAME_OVERHEAD -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 13/17] PowerPC: booke64: Use SPRG0/3 scratch for bolted TLB miss crit int
On Mon, 2012-06-25 at 15:26 +0300, Mihai Caraman wrote: Embedded.Hypervisor category defines GSPRG0..3 physical registers for guests. Avoid SPRG4-7 usage as scratch in host exception handlers, otherwise guest SPRG4-7 registers will be clobbered. For bolted TLB miss exception handlers, which is the version currently supported by KVM, use SPRN_SPRG_GEN_SCRATCH (aka SPRG0) instead of SPRN_SPRG_TLB_SCRATCH (aka SPRG6) and replace TLB with GEN PACA slots to keep consitency. For critical exception handler use SPRG3 instead of SPRG7. Beware with SPRG3 usage. It's user space visible and we plan to use it for other things (see Anton's patch to stick topology information in there for use by the vdso). If you clobber it, you may want to restore it later. I think Anton's patch should put the proper value we want in the PACA anyway since we also need to restore it on exit from KVM, so you can still use it as scratch, just restore the value before going to C. Cheers, Ben. Signed-off-by: Mihai Caraman mihai.cara...@freescale.com --- arch/powerpc/include/asm/exception-64e.h | 14 +++--- arch/powerpc/include/asm/reg.h |6 +++--- arch/powerpc/mm/tlb_low_64e.S| 28 ++-- 3 files changed, 24 insertions(+), 24 deletions(-) diff --git a/arch/powerpc/include/asm/exception-64e.h b/arch/powerpc/include/asm/exception-64e.h index ac13add..c90a9a4 100644 --- a/arch/powerpc/include/asm/exception-64e.h +++ b/arch/powerpc/include/asm/exception-64e.h @@ -38,8 +38,11 @@ */ -/* We are out of SPRGs so we save some things in the PACA. The normal - * exception frame is smaller than the CRIT or MC one though +/* We are out of SPRGs so we save some things in the 8 slots available in PACA. + * The normal exception frame is smaller than the CRIT or MC one though + * + * Bolted TLB miss exception variant also uses these slots which in combination + * with pgd and kernel_pgd fits in one 64-byte cache line. */ #define EX_R1(0 * 8) #define EX_CR(1 * 8) @@ -47,13 +50,10 @@ #define EX_R11 (3 * 8) #define EX_R14 (4 * 8) #define EX_R15 (5 * 8) +#define EX_R16 (6 * 8) /* - * The TLB miss exception uses different slots. - * - * The bolted variant uses only the first six fields, - * which in combination with pgd and kernel_pgd fits in - * one 64-byte cache line. + * PACA slots offset for standard TLB miss exception. */ #define EX_TLB_R10 ( 0 * 8) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index f0cb7f4..51c14a7 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -760,10 +760,10 @@ * 64-bit embedded * - SPRG0 generic exception scratch * - SPRG2 TLB exception stack - * - SPRG3 unused (user visible) + * - SPRG3 critical exception scratch (user visible) * - SPRG4 unused (user visible) * - SPRG6 TLB miss scratch (user visible, sorry !) - * - SPRG7 critical exception scratch + * - SPRG7 unused (user visible) * - SPRG8 machine check exception scratch * - SPRG9 debug exception scratch * @@ -857,7 +857,7 @@ #ifdef CONFIG_PPC_BOOK3E_64 #define SPRN_SPRG_MC_SCRATCH SPRN_SPRG8 -#define SPRN_SPRG_CRIT_SCRATCH SPRN_SPRG7 +#define SPRN_SPRG_CRIT_SCRATCH SPRN_SPRG3 #define SPRN_SPRG_DBG_SCRATCHSPRN_SPRG9 #define SPRN_SPRG_TLB_EXFRAMESPRN_SPRG2 #define SPRN_SPRG_TLB_SCRATCHSPRN_SPRG6 diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S index 88feaaa..4192ade 100644 --- a/arch/powerpc/mm/tlb_low_64e.S +++ b/arch/powerpc/mm/tlb_low_64e.S @@ -40,36 +40,36 @@ **/ .macro tlb_prolog_bolted intnum addr - mtspr SPRN_SPRG_TLB_SCRATCH,r13 + mtspr SPRN_SPRG_GEN_SCRATCH,r13 mfspr r13,SPRN_SPRG_PACA - std r10,PACA_EXTLB+EX_TLB_R10(r13) + std r10,PACA_EXGEN+EX_R10(r13) mfcrr10 - std r11,PACA_EXTLB+EX_TLB_R11(r13) + std r11,PACA_EXGEN+EX_R11(r13) #ifdef CONFIG_KVM_BOOKE_HV BEGIN_FTR_SECTION mfspr r11, SPRN_SRR1 END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV) #endif DO_KVM \intnum, SPRN_SRR1 - std r16,PACA_EXTLB+EX_TLB_R16(r13) + std r16,PACA_EXGEN+EX_R16(r13) mfspr r16,\addr /* get faulting address */ - std r14,PACA_EXTLB+EX_TLB_R14(r13) + std r14,PACA_EXGEN+EX_R14(r13) ld r14,PACAPGD(r13) - std r15,PACA_EXTLB+EX_TLB_R15(r13) - std r10,PACA_EXTLB+EX_TLB_CR(r13) + std r15,PACA_EXGEN+EX_R15(r13) + std r10,PACA_EXGEN+EX_CR(r13) TLB_MISS_PROLOG_STATS_BOLTED .endm .macro tlb_epilog_bolted - ld r14,PACA_EXTLB+EX_TLB_CR(r13) - ld r10,PACA_EXTLB+EX_TLB_R10(r13) - ld
Re: [RFC PATCH 13/17] PowerPC: booke64: Use SPRG0/3 scratch for bolted TLB miss crit int
On 06/25/2012 07:26 AM, Mihai Caraman wrote: Embedded.Hypervisor category defines GSPRG0..3 physical registers for guests. Avoid SPRG4-7 usage as scratch in host exception handlers, otherwise guest SPRG4-7 registers will be clobbered. For bolted TLB miss exception handlers, which is the version currently supported by KVM, use SPRN_SPRG_GEN_SCRATCH (aka SPRG0) instead of SPRN_SPRG_TLB_SCRATCH (aka SPRG6) and replace TLB with GEN PACA slots to keep consitency. For critical exception handler use SPRG3 instead of SPRG7. extlb is in the same cache line as other TLB stuff we need, while exgen isn't. Let's stick with extlb. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 03/17] KVM: PPC64: booke: Add EPCR support in sregs
On 06/25/2012 07:26 AM, Mihai Caraman wrote: Add KVM_SREGS_E_64 feature and EPCR spr support in get/set sregs for 64-bit hosts. Signed-off-by: Mihai Caraman mihai.cara...@freescale.com --- arch/powerpc/kvm/booke.c | 14 ++ 1 files changed, 14 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index f9fa260..d15c4b5 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -1052,6 +1052,9 @@ static void get_sregs_base(struct kvm_vcpu *vcpu, u64 tb = get_tb(); sregs-u.e.features |= KVM_SREGS_E_BASE; +#ifdef CONFIG_64BIT + sregs-u.e.features |= KVM_SREGS_E_64;0 +#endif sregs-u.e.csrr0 = vcpu-arch.csrr0; sregs-u.e.csrr1 = vcpu-arch.csrr1; @@ -1063,6 +1066,9 @@ static void get_sregs_base(struct kvm_vcpu *vcpu, sregs-u.e.dec = kvmppc_get_dec(vcpu, tb); sregs-u.e.tb = tb; sregs-u.e.vrsave = vcpu-arch.vrsave; +#ifdef CONFIG_64BIT + sregs-u.e.epcr = vcpu-arch.epcr; +#endif } static int set_sregs_base(struct kvm_vcpu *vcpu, @@ -1071,6 +1077,11 @@ static int set_sregs_base(struct kvm_vcpu *vcpu, if (!(sregs-u.e.features KVM_SREGS_E_BASE)) return 0; +#ifdef CONFIG_64BIT + if (!(sregs-u.e.features KVM_SREGS_E_64)) + return 0; +#endif This means that a QEMU targeting a 32-bit guest won't be able to set any special registers, if it sets feature bits manually rather than getting them from GET_SREGS. This check should only qualify whether we look at sregs.u.e.epcr, not whether this function works at all. BTW, shouldn't the BASE check return an error rather than silently no-op? -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add a page cache-backed balloon device driver.
On Tue, Jun 26, 2012 at 2:47 PM, Michael S. Tsirkin m...@redhat.com wrote: On Tue, Jun 26, 2012 at 02:31:26PM -0700, Frank Swiderski wrote: On Tue, Jun 26, 2012 at 1:40 PM, Rik van Riel r...@redhat.com wrote: On 06/26/2012 04:32 PM, Frank Swiderski wrote: This implementation of a virtio balloon driver uses the page cache to store pages that have been released to the host. The communication (outside of target counts) is one way--the guest notifies the host when it adds a page to the page cache, allowing the host to madvise(2) with MADV_DONTNEED. Reclaim in the guest is therefore automatic and implicit (via the regular page reclaim). This means that inflating the balloon is similar to the existing balloon mechanism, but the deflate is different--it re-uses existing Linux kernel functionality to automatically reclaim. Signed-off-by: Frank Swiderskif...@google.com It is a great idea, but how can this memory balancing possibly work if someone uses memory cgroups inside a guest? Thanks and good point--this isn't something that I considered in the implementation. Having said that, we currently do not have proper memory reclaim balancing between cgroups at all, so requiring that of this balloon driver would be unreasonable. The code looks good to me, my only worry is the code duplication. We now have 5 balloon drivers, for 4 hypervisors, all implementing everything from scratch... Do you have any recommendations on this? I could (I think reasonably so) modify the existing virtio_balloon.c and have it change behavior based on a feature bit or other configuration. I'm not sure that really addresses the root of what you're pointing out--it's still adding a different implementation, but doing so as an extension of an existing one. fes Let's assume it's a feature bit: how would you formulate what the feature does *from host point of view*? -- MST In this implementation, the host doesn't keep track of pages in the balloon, as there is no explicit deflate path. The host device for this implementation should merely, for example, MADV_DONTNEED on the pages sent in an inflate. Thus, the inflate becomes a notification that the guest doesn't need those pages mapped in, but that they should be available if the guest touches them. In that sense, it's not a rigid shrink of guest memory. I'm not sure what I'd call the feature bit though. Was that the question you were asking, or did I misread? fes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add a page cache-backed balloon device driver.
On Tue, Jun 26, 2012 at 2:45 PM, Rik van Riel r...@redhat.com wrote: On 06/26/2012 05:31 PM, Frank Swiderski wrote: On Tue, Jun 26, 2012 at 1:40 PM, Rik van Rielr...@redhat.com wrote: The code looks good to me, my only worry is the code duplication. We now have 5 balloon drivers, for 4 hypervisors, all implementing everything from scratch... Do you have any recommendations on this? I could (I think reasonably so) modify the existing virtio_balloon.c and have it change behavior based on a feature bit or other configuration. I'm not sure that really addresses the root of what you're pointing out--it's still adding a different implementation, but doing so as an extension of an existing one. Ideally, I believe we would have two balloon top parts in a guest (one classical balloon, one on the LRU), and four bottom parts (kvm, xen, vmware s390). That way the virt specific bits of a balloon driver would be essentially a -balloon_page and -release_page callback for pages, as well as methods to communicate with the host. All the management of pages, including stuff like putting them on the LRU, or isolating them for migration, would be done with the same common code, regardless of what virt software we are running on. Of course, that is a substantial amount of work and I feel it would be unreasonable to block anyone's code on that kind of thing (especially considering that your code is good), but I do believe the explosion of balloon code is a little worrying. Hm, that makes a lot of sense. That would be a few patches definitely worth doing, IMHO. I'm not entirely sure how I feel about inflating the balloon drivers in the meantime. Sigh, and I didn't even mean that as a pun. fes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Request VFIO inclusion in linux-next
Hi Alex, On Mon, 25 Jun 2012 22:55:52 -0600 Alex Williamson alex.william...@redhat.com wrote: VFIO has been kicking around for well over a year now and has been posted numerous times for review. The pre-requirements are finally available in linux-next (or will be in the 20120626 build) so I'd like to request a new branch be included in linux-next with a goal of being accepted into v3.6. The info for linux-next: Tree: git://github.com/awilliam/linux-vfio.git Branch: next Contact: Alex Williamson alex.william...@redhat.com This branch should be applied after both Bjorn's PCI next branch and Joerg's IOMMU next branch I have added this from today. Since this tree merges (parts of) the pci and iommu trees, I am hoping that those are to remain stable - if not, then you need to stay alert if they are rebased. Thanks for adding your subsystem tree as a participant of linux-next. As you may know, this is not a judgment of your code. The purpose of linux-next is for integration testing and to lower the impact of conflicts between subsystems in the next merge window. You will need to ensure that the patches/commits in your tree/series have been: * submitted under GPL v2 (or later) and include the Contributor's Signed-off-by, * posted to the relevant mailing list, * reviewed by you (or another maintainer of your subsystem tree), * successfully unit tested, and * destined for the current or next Linux merge window. Basically, this should be just what you would send to Linus (or ask him to fetch). It is allowed to be rebased if you deem it necessary. -- Cheers, Stephen Rothwell s...@canb.auug.org.au Legal Stuff: By participating in linux-next, your subsystem tree contributions are public and will be included in the linux-next trees. You may be sent e-mail messages indicating errors or other issues when the patches/commits from your subsystem tree are merged and tested in linux-next. These messages may also be cross-posted to the linux-next mailing list, the linux-kernel mailing list, etc. The linux-next tree project and IBM (my employer) make no warranties regarding the linux-next project, the testing procedures, the results, the e-mails, etc. If you don't agree to these ground rules, let me know and I'll remove your tree from participation in linux-next. pgpBcKIHoRdyA.pgp Description: PGP signature
[PATCH 2/4] KVM: Use __print_hex() for kvm_emulate_insn tracepoint
From: Namhyung Kim namhyung@lge.com The kvm_emulate_insn tracepoint used __print_insn() for printing its instructions. However it makes the format of the event hard to parse as it reveals TP internals. Fortunately, kernel provides __print_hex for almost same purpose, we can use it instead of open coding it. The user-space can be changed to parse it later. That means raw kernel tracing will not be affected by this change: # cd /sys/kernel/debug/tracing/ # cat events/kvm/kvm_emulate_insn/format name: kvm_emulate_insn ID: 29 format: ... print fmt: %x:%llx:%s (%s)%s, REC-csbase, REC-rip, __print_hex(REC-insn, REC-len), \ __print_symbolic(REC-flags, { 0, real }, { (1 0) | (1 1), vm16 }, \ { (1 0), prot16 }, { (1 0) | (1 2), prot32 }, { (1 0) | (1 3), prot64 }), \ REC-failed ? failed : # echo 1 events/kvm/kvm_emulate_insn/enable # cat trace # tracer: nop # # entries-in-buffer/entries-written: 2183/2183 #P:12 # # _-= irqs-off # / _= need-resched #| / _---= hardirq/softirq #|| / _--= preempt-depth #||| / delay # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | qemu-kvm-1782 [002] ...1 140.931636: kvm_emulate_insn: 0:c102fa25:89 10 (prot32) qemu-kvm-1781 [004] ...1 140.931637: kvm_emulate_insn: 0:c102fa25:89 10 (prot32) Cc: kvm@vger.kernel.org Link: http://lkml.kernel.org/n/tip-wfw6y3b9ugtey8snaow9n...@git.kernel.org Signed-off-by: Namhyung Kim namhy...@kernel.org --- arch/x86/kvm/trace.h | 12 +--- include/trace/ftrace.h |1 + 2 files changed, 2 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index 911d2641f14c..62d02e3c3ed6 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -710,16 +710,6 @@ TRACE_EVENT(kvm_skinit, __entry-rip, __entry-slb) ); -#define __print_insn(insn, ilen) ({ \ - int i; \ - const char *ret = p-buffer + p-len;\ -\ - for (i = 0; i ilen; ++i) \ - trace_seq_printf(p, %02x, insn[i]); \ - trace_seq_printf(p, %c, 0);\ - ret; \ - }) - #define KVM_EMUL_INSN_F_CR0_PE (1 0) #define KVM_EMUL_INSN_F_EFL_VM (1 1) #define KVM_EMUL_INSN_F_CS_D (1 2) @@ -786,7 +776,7 @@ TRACE_EVENT(kvm_emulate_insn, TP_printk(%x:%llx:%s (%s)%s, __entry-csbase, __entry-rip, - __print_insn(__entry-insn, __entry-len), + __print_hex(__entry-insn, __entry-len), __print_symbolic(__entry-flags, kvm_trace_symbol_emul_flags), __entry-failed ? failed : diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h index 769724944fc6..c6bc2faaf261 100644 --- a/include/trace/ftrace.h +++ b/include/trace/ftrace.h @@ -571,6 +571,7 @@ static inline void ftrace_test_probe_##call(void) \ #undef __print_flags #undef __print_symbolic +#undef __print_hex #undef __get_dynamic_array #undef __get_str -- 1.7.10.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/6] kvm: Sanitize KVM_IRQFD flags
We only know of one so far. Signed-off-by: Alex Williamson alex.william...@redhat.com --- virt/kvm/eventfd.c |3 +++ 1 file changed, 3 insertions(+) diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index c307c24..7d7e2aa 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -340,6 +340,9 @@ kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args) int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args) { + if (args-flags ~KVM_IRQFD_FLAG_DEASSIGN) + return -EINVAL; + if (args-flags KVM_IRQFD_FLAG_DEASSIGN) return kvm_irqfd_deassign(kvm, args); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/6] kvm: Extend irqfd to support level interrupts
In order to inject an interrupt from an external source using an irqfd, we need to allocate a new irq_source_id. This allows us to assert and (later) de-assert an interrupt line independently from users of KVM_IRQ_LINE and avoid lost interrupts. We also add what may appear like a bit of excessive infrastructure around an object for storing this irq_source_id. However, notice that we only provide a way to assert the interrupt here. A follow-on interface will make use of the same irq_source_id to allow de-assert. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Documentation/virtual/kvm/api.txt |5 ++ arch/x86/kvm/x86.c|1 include/linux/kvm.h |3 + virt/kvm/eventfd.c| 95 +++-- 4 files changed, 99 insertions(+), 5 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index ea9edce..b216709 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1981,6 +1981,11 @@ the guest using the specified gsi pin. The irqfd is removed using the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd and kvm_irqfd.gsi. +With KVM_IRQFD_FLAG_LEVEL KVM_IRQFD allocates a new IRQ source ID for +the requested irqfd. This is necessary to share level triggered +interrupts with those injected through KVM_IRQ_LINE. IRQFDs created +with KVM_IRQFD_FLAG_LEVEL must also set this flag when de-assiging. +KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL. 5. The kvm_run structure diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a01a424..80bed07 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2148,6 +2148,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_GET_TSC_KHZ: case KVM_CAP_PCI_2_3: case KVM_CAP_KVMCLOCK_CTRL: + case KVM_CAP_IRQFD_LEVEL: r = 1; break; case KVM_CAP_COALESCED_MMIO: diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 2ce09aa..b2e6e4f 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_GET_SMMU_INFO 78 #define KVM_CAP_S390_COW 79 #define KVM_CAP_PPC_ALLOC_HTAB 80 +#define KVM_CAP_IRQFD_LEVEL 81 #ifdef KVM_CAP_IRQ_ROUTING @@ -683,6 +684,8 @@ struct kvm_xen_hvm_config { #endif #define KVM_IRQFD_FLAG_DEASSIGN (1 0) +/* Available with KVM_CAP_IRQFD_LEVEL */ +#define KVM_IRQFD_FLAG_LEVEL (1 1) struct kvm_irqfd { __u32 fd; diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index 7d7e2aa..18cc284 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -36,6 +36,64 @@ #include iodev.h /* + * An irq_source_id can be created from KVM_IRQFD for level interrupt + * injections and shared with other interfaces for EOI or de-assert. + * Create an object with reference counting to make it easy to use. + */ +struct _irq_source { + int id; + struct kvm *kvm; + struct kref kref; +}; + +static void release_irq_source(struct kref *kref) +{ + struct _irq_source *source; + + source = container_of(kref, struct _irq_source, kref); + + kvm_free_irq_source_id(source-kvm, source-id); + kfree(source); +} + +static void put_irq_source(struct _irq_source *source) +{ + if (source) + kref_put(source-kref, release_irq_source); +} + +static struct _irq_source *__attribute__ ((used)) /* white lie for now */ +get_irq_source(struct _irq_source *source) +{ + if (source) + kref_get(source-kref); + + return source; +} + +static struct _irq_source *new_irq_source(struct kvm *kvm) +{ + struct _irq_source *source; + int id; + + source = kzalloc(sizeof(*source), GFP_KERNEL); + if (!source) + return ERR_PTR(-ENOMEM); + + id = kvm_request_irq_source_id(kvm); + if (id 0) { + kfree(source); + return ERR_PTR(id); + } + + kref_init(source-kref); + source-kvm = kvm; + source-id = id; + + return source; +} + +/* * * irqfd: Allows an fd to be used to inject an interrupt to the guest * @@ -52,6 +110,7 @@ struct _irqfd { /* Used for level IRQ fast-path */ int gsi; struct work_struct inject; + struct _irq_source *source; /* Used for setup/shutdown */ struct eventfd_ctx *eventfd; struct list_head list; @@ -62,7 +121,7 @@ struct _irqfd { static struct workqueue_struct *irqfd_cleanup_wq; static void -irqfd_inject(struct work_struct *work) +irqfd_inject_edge(struct work_struct *work) { struct _irqfd *irqfd = container_of(work, struct _irqfd, inject); struct kvm *kvm = irqfd-kvm; @@ -71,6 +130,14 @@ irqfd_inject(struct work_struct *work) kvm_set_irq(kvm,
[PATCH v2 5/6] kvm: KVM_EOIFD, an eventfd for EOIs
This new ioctl enables an eventfd to be triggered when an EOI is written for a specified irqchip pin. By default this is a simple notification, but we can also tie the eoifd to a level irqfd, which enables the irqchip pin to be automatically de-asserted on EOI. This mode is particularly useful for device-assignment applications where the unmask and notify triggers a hardware unmask. The default mode is most applicable to simple notify with no side-effects for userspace usage, such as Qemu. Here we make use of the reference counting of the _irq_source object allowing us to share it with an irqfd and cleanup regardless of the release order. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Documentation/virtual/kvm/api.txt | 24 + arch/x86/kvm/x86.c|1 include/linux/kvm.h | 14 +++ include/linux/kvm_host.h | 13 +++ virt/kvm/eventfd.c| 189 + virt/kvm/kvm_main.c | 11 ++ 6 files changed, 250 insertions(+), 2 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index b216709..87a2558 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1987,6 +1987,30 @@ interrupts with those injected through KVM_IRQ_LINE. IRQFDs created with KVM_IRQFD_FLAG_LEVEL must also set this flag when de-assiging. KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL. +4.77 KVM_EOIFD + +Capability: KVM_CAP_EOIFD +Architectures: x86 +Type: vm ioctl +Parameters: struct kvm_eoifd (in) +Returns: 0 on success, -1 on error + +KVM_EOIFD allows userspace to receive EOI notification through an +eventfd for level triggered irqchip interrupts. Behavior for edge +triggered interrupts is undefined. kvm_eoifd.fd specifies the eventfd +used for notification and kvm_eoifd.gsi specifies the irchip pin, +similar to KVM_IRQFD. KVM_EOIFD_FLAG_DEASSIGN is used to deassign +a previously enabled eoifd and should also set fd and gsi to match. + +The KVM_EOIFD_FLAG_LEVEL_IRQFD flag indicates that the EOI is for +a level triggered EOI and the kvm_eoifd structure includes +kvm_eoifd.irqfd, which must be previously configured using KVM_IRQFD +with the KVM_IRQFD_FLAG_LEVEL flag. This allows both EOI notification +through kvm_eoifd.fd as well as automatically de-asserting level +irqfds on EOI. Both KVM_EOIFD_FLAG_DEASSIGN and +KVM_EOIFD_FLAG_LEVEL_IRQFD should be used to de-assign an eoifd +initially setup with KVM_EOIFD_FLAG_LEVEL_IRQFD. + 5. The kvm_run structure diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 80bed07..62d6eca 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2149,6 +2149,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_PCI_2_3: case KVM_CAP_KVMCLOCK_CTRL: case KVM_CAP_IRQFD_LEVEL: + case KVM_CAP_EOIFD: r = 1; break; case KVM_CAP_COALESCED_MMIO: diff --git a/include/linux/kvm.h b/include/linux/kvm.h index b2e6e4f..7567e7d 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -619,6 +619,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_S390_COW 79 #define KVM_CAP_PPC_ALLOC_HTAB 80 #define KVM_CAP_IRQFD_LEVEL 81 +#define KVM_CAP_EOIFD 82 #ifdef KVM_CAP_IRQ_ROUTING @@ -694,6 +695,17 @@ struct kvm_irqfd { __u8 pad[20]; }; +#define KVM_EOIFD_FLAG_DEASSIGN (1 0) +#define KVM_EOIFD_FLAG_LEVEL_IRQFD (1 1) + +struct kvm_eoifd { + __u32 fd; + __u32 gsi; + __u32 flags; + __u32 irqfd; + __u8 pad[16]; +}; + struct kvm_clock_data { __u64 clock; __u32 flags; @@ -834,6 +846,8 @@ struct kvm_s390_ucas_mapping { #define KVM_PPC_GET_SMMU_INFO_IOR(KVMIO, 0xa6, struct kvm_ppc_smmu_info) /* Available with KVM_CAP_PPC_ALLOC_HTAB */ #define KVM_PPC_ALLOCATE_HTAB_IOWR(KVMIO, 0xa7, __u32) +/* Available with KVM_CAP_EOIFD */ +#define KVM_EOIFD _IOW(KVMIO, 0xa8, struct kvm_eoifd) /* * ioctls for vcpu fds diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ae3b426..83472eb 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -285,6 +285,10 @@ struct kvm { struct list_head items; } irqfds; struct list_head ioeventfds; + struct { + spinlock_tlock; + struct list_head items; + } eoifds; #endif struct kvm_vm_stat stat; struct kvm_arch arch; @@ -828,6 +832,8 @@ int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args); void kvm_irqfd_release(struct kvm *kvm); void kvm_irq_routing_update(struct kvm *, struct kvm_irq_routing_table *); int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args); +int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args); +void kvm_eoifd_release(struct kvm *kvm); #else @@ -853,6 +859,13 @@ static inline int kvm_ioeventfd(struct kvm
[PATCH v2 6/6] kvm: Level IRQ de-assert for KVM_IRQFD
This is an alternate level irqfd de-assert mode that's potentially useful for emulated drivers. It's included here to show how easy it is to implement with the new level irqfd and eoifd support. It's possible this mode might also prove interesting for device-assignment where we inject via level irqfd, receive an EOI (w/o de-assert), and use the level de-assert irqfd here. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Documentation/virtual/kvm/api.txt |8 include/linux/kvm.h |6 +- virt/kvm/eventfd.c| 28 ++-- 3 files changed, 39 insertions(+), 3 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 87a2558..b356937 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1987,6 +1987,14 @@ interrupts with those injected through KVM_IRQ_LINE. IRQFDs created with KVM_IRQFD_FLAG_LEVEL must also set this flag when de-assiging. KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL. +The KVM_IRQFD_FLAG_LEVEL_DEASSERT flag creates an irqfd similar to +KVM_IRQFD_FLAG_LEVEL, except the irqfd de-asserts the irqchip pin +rather than asserts it. The level irqfd must first be created as +aboved and passed to this ioctl in kvm_irqfd.irqfd. This ensures +the de-assert irqfd uses the same IRQ source ID as the assert irqfd. +This flag should also be specified on de-assign. This feature is +present when KVM_CAP_IRQFD_LEVEL_DEASSERT is available. + 4.77 KVM_EOIFD Capability: KVM_CAP_EOIFD diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 7567e7d..0bbfd47 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -620,6 +620,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_ALLOC_HTAB 80 #define KVM_CAP_IRQFD_LEVEL 81 #define KVM_CAP_EOIFD 82 +#define KVM_CAP_IRQFD_LEVEL_DEASSERT 83 #ifdef KVM_CAP_IRQ_ROUTING @@ -687,12 +688,15 @@ struct kvm_xen_hvm_config { #define KVM_IRQFD_FLAG_DEASSIGN (1 0) /* Available with KVM_CAP_IRQFD_LEVEL */ #define KVM_IRQFD_FLAG_LEVEL (1 1) +/* Available with KVM_CAP_IRQFD_LEVEL_DEASSERT */ +#define KVM_IRQFD_FLAG_LEVEL_DEASSERT (1 2) struct kvm_irqfd { __u32 fd; __u32 gsi; __u32 flags; - __u8 pad[20]; + __u32 irqfd; + __u8 pad[16]; }; #define KVM_EOIFD_FLAG_DEASSIGN (1 0) diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index 02ca50f..50ace0e 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -172,6 +172,14 @@ irqfd_inject_level(struct work_struct *work) kvm_set_irq(irqfd-kvm, irqfd-source-id, irqfd-gsi, 1); } +static void +irqfd_inject_level_deassert(struct work_struct *work) +{ + struct _irqfd *irqfd = container_of(work, struct _irqfd, inject); + + kvm_set_irq(irqfd-kvm, irqfd-source-id, irqfd-gsi, 0); +} + /* * Race-free decouple logic (ordering is critical) */ @@ -320,6 +328,9 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args) INIT_LIST_HEAD(irqfd-list); if (args-flags KVM_IRQFD_FLAG_LEVEL) { + if (args-flags KVM_IRQFD_FLAG_LEVEL_DEASSERT) + return -EINVAL; /* mutually exclusive */ + irqfd-source = new_irq_source(kvm); if (IS_ERR(irqfd-source)) { ret = PTR_ERR(irqfd-source); @@ -328,6 +339,16 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args) } INIT_WORK(irqfd-inject, irqfd_inject_level); + + } else if (args-flags KVM_IRQFD_FLAG_LEVEL_DEASSERT) { + irqfd-source = get_irq_source_from_irqfd(kvm, args-irqfd); + if (IS_ERR(irqfd-source)) { + ret = PTR_ERR(irqfd-source); + irqfd-source = NULL; + goto fail; + } + + INIT_WORK(irqfd-inject, irqfd_inject_level_deassert); } else INIT_WORK(irqfd-inject, irqfd_inject_edge); @@ -421,7 +442,8 @@ kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args) { struct _irqfd *irqfd, *tmp; struct eventfd_ctx *eventfd; - bool is_level = (args-flags KVM_IRQFD_FLAG_LEVEL) != 0; + bool is_level = (args-flags (KVM_IRQFD_FLAG_LEVEL | + KVM_IRQFD_FLAG_LEVEL_DEASSERT)) != 0; eventfd = eventfd_ctx_fdget(args-fd); if (IS_ERR(eventfd)) @@ -461,7 +483,9 @@ kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args) int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args) { - if (args-flags ~(KVM_IRQFD_FLAG_DEASSIGN | KVM_IRQFD_FLAG_LEVEL)) + if (args-flags ~(KVM_IRQFD_FLAG_DEASSIGN | + KVM_IRQFD_FLAG_LEVEL | + KVM_IRQFD_FLAG_LEVEL_DEASSERT)) return -EINVAL; if (args-flags KVM_IRQFD_FLAG_DEASSIGN) -- To unsubscribe from this list: send the line unsubscribe
[PATCH v2 0/6] kvm: level triggered irqfd support
Ok, let's see how this flies. I actually quite like this, so be gentle tearing it apart ;) I just couldn't bring myself to contort KVM_IRQFD into something that either sets up an irqfd or specifies a nearly unrelated EOI eventfd. The solution I've come up with, that also avoids exposing irq_source_ids to userspace, is to work through the irqfd. If we setup a level irqfd, we can optionally associate an eoifd with the same irq_source_id, by passing the irqfd. To do this, we just need to create a new reference counted object for the source ID so we don't run into problems ordering release. This means we end up with a KVM_EOIFD ioctl that has both general usefulness and can still tie into an irqfd. In patch 6/6 I also include an alternate de-assert mechanism via an irqfd with the opposite polarity. I don't currently use this, but it's pretty trivial and at least available in archives now. I don't address whether injecting an edge irqfd really needs an assert followed by de-assert (I don't know). This new interface really unties itself from caring. We might be able to consolidate inject functions at some future point, but it doesn't change how we'd name flags as it did in the previous version. Thanks, Alex --- Alex Williamson (6): kvm: Level IRQ de-assert for KVM_IRQFD kvm: KVM_EOIFD, an eventfd for EOIs kvm: Extend irqfd to support level interrupts kvm: Sanitize KVM_IRQFD flags kvm: Add missing KVM_IRQFD API documentation kvm: Pass kvm_irqfd to functions Documentation/virtual/kvm/api.txt | 53 ++ arch/x86/kvm/x86.c|2 include/linux/kvm.h | 23 +++ include/linux/kvm_host.h | 17 ++ virt/kvm/eventfd.c| 323 - virt/kvm/kvm_main.c | 13 + 6 files changed, 414 insertions(+), 17 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/6] kvm: Pass kvm_irqfd to functions
Prune this down to just the struct kvm_irqfd so we can avoid changing function definition for every flag or field we use. Signed-off-by: Alex Williamson alex.william...@redhat.com --- include/linux/kvm_host.h |4 ++-- virt/kvm/eventfd.c | 20 ++-- virt/kvm/kvm_main.c |2 +- 3 files changed, 13 insertions(+), 13 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 27ac8a4..ae3b426 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -824,7 +824,7 @@ static inline void kvm_free_irq_routing(struct kvm *kvm) {} #ifdef CONFIG_HAVE_KVM_EVENTFD void kvm_eventfd_init(struct kvm *kvm); -int kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags); +int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args); void kvm_irqfd_release(struct kvm *kvm); void kvm_irq_routing_update(struct kvm *, struct kvm_irq_routing_table *); int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args); @@ -833,7 +833,7 @@ int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args); static inline void kvm_eventfd_init(struct kvm *kvm) {} -static inline int kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags) +static inline int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args) { return -EINVAL; } diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index f59c1e8..c307c24 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -198,7 +198,7 @@ static void irqfd_update(struct kvm *kvm, struct _irqfd *irqfd, } static int -kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) +kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args) { struct kvm_irq_routing_table *irq_rt; struct _irqfd *irqfd, *tmp; @@ -212,12 +212,12 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) return -ENOMEM; irqfd-kvm = kvm; - irqfd-gsi = gsi; + irqfd-gsi = args-gsi; INIT_LIST_HEAD(irqfd-list); INIT_WORK(irqfd-inject, irqfd_inject); INIT_WORK(irqfd-shutdown, irqfd_shutdown); - file = eventfd_fget(fd); + file = eventfd_fget(args-fd); if (IS_ERR(file)) { ret = PTR_ERR(file); goto fail; @@ -298,19 +298,19 @@ kvm_eventfd_init(struct kvm *kvm) * shutdown any irqfd's that match fd+gsi */ static int -kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi) +kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args) { struct _irqfd *irqfd, *tmp; struct eventfd_ctx *eventfd; - eventfd = eventfd_ctx_fdget(fd); + eventfd = eventfd_ctx_fdget(args-fd); if (IS_ERR(eventfd)) return PTR_ERR(eventfd); spin_lock_irq(kvm-irqfds.lock); list_for_each_entry_safe(irqfd, tmp, kvm-irqfds.items, list) { - if (irqfd-eventfd == eventfd irqfd-gsi == gsi) { + if (irqfd-eventfd == eventfd irqfd-gsi == args-gsi) { /* * This rcu_assign_pointer is needed for when * another thread calls kvm_irq_routing_update before @@ -338,12 +338,12 @@ kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi) } int -kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags) +kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args) { - if (flags KVM_IRQFD_FLAG_DEASSIGN) - return kvm_irqfd_deassign(kvm, fd, gsi); + if (args-flags KVM_IRQFD_FLAG_DEASSIGN) + return kvm_irqfd_deassign(kvm, args); - return kvm_irqfd_assign(kvm, fd, gsi); + return kvm_irqfd_assign(kvm, args); } /* diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 02cb440..b4ad14cc 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2059,7 +2059,7 @@ static long kvm_vm_ioctl(struct file *filp, r = -EFAULT; if (copy_from_user(data, argp, sizeof data)) goto out; - r = kvm_irqfd(kvm, data.fd, data.gsi, data.flags); + r = kvm_irqfd(kvm, data); break; } case KVM_IOEVENTFD: { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/6] kvm: Add missing KVM_IRQFD API documentation
Signed-off-by: Alex Williamson alex.william...@redhat.com --- Documentation/virtual/kvm/api.txt | 16 1 file changed, 16 insertions(+) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 310fe50..ea9edce 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1965,6 +1965,22 @@ return the hash table order in the parameter. (If the guest is using the virtualized real-mode area (VRMA) facility, the kernel will re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.) +4.76 KVM_IRQFD + +Capability: KVM_CAP_IRQFD +Architectures: x86 +Type: vm ioctl +Parameters: struct kvm_irqfd (in) +Returns: 0 on success, -1 on error + +Allows setting an eventfd to directly trigger a guest interrupt. +kvm_irqfd.fd specifies the file descriptor to use as the eventfd and +kvm_irqfd.gsi specifies the irqchip pin toggled by this event. When +an event is tiggered on the eventfd, an interrupt is injected into +the guest using the specified gsi pin. The irqfd is removed using +the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd +and kvm_irqfd.gsi. + 5. The kvm_run structure -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 10/17] PowerPC: booke64: Refactor exception prolog for save/restore regs
On Mon, 2012-06-25 at 15:26 +0300, Mihai Caraman wrote: Refactor exception prolog to allow save/restore register parameters. Add addition none definition for exception prolog usage. This is needed for exceptions like Guest Doorbell that use GSRRx regsiters which do not map on exception type. Signed-off-by: Mihai Caraman mihai.cara...@freescale.com --- arch/powerpc/kernel/exceptions-64e.S | 23 --- 1 files changed, 8 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S index 7215cc2..52aa96b 100644 --- a/arch/powerpc/kernel/exceptions-64e.S +++ b/arch/powerpc/kernel/exceptions-64e.S @@ -35,7 +35,7 @@ #define SPECIAL_EXC_FRAME_SIZE INT_FRAME_SIZE /* Exception prolog code for all exceptions */ -#define EXCEPTION_PROLOG(n, type, addition) \ +#define EXCEPTION_PROLOG(n, type, srr0, srr1, addition) \ mtspr SPRN_SPRG_##type##_SCRATCH,r13; /* get spare registers */ \ mfspr r13,SPRN_SPRG_PACA; /* get PACA */ \ std r10,PACA_EX##type+EX_R10(r13); \ @@ -44,54 +44,47 @@ addition; /* additional code for that exc. */ \ std r1,PACA_EX##type+EX_R1(r13); /* save old r1 in the PACA */ \ stw r10,PACA_EX##type+EX_CR(r13); /* save old CR in the PACA */ \ - mfspr r11,SPRN_##type##_SRR1;/* what are we coming from */\ + mfspr r11,srr1;/* what are we coming from */ \ type##_SET_KSTACK; /* get special stack if necessary */\ andi. r10,r11,MSR_PR; /* save stack pointer */\ beq 1f; /* branch around if supervisor */ \ ld r1,PACAKSAVE(r13); /* get kernel stack coming from usr */\ 1: cmpdi cr1,r1,0; /* check if SP makes sense */ \ bge-cr1,exc_##n##_bad_stack;/* bad stack (TODO: out of line) */ \ - mfspr r10,SPRN_##type##_SRR0; /* read SRR0 before touching stack */ + mfspr r10,srr0; /* read SRR0 before touching stack */ No, use the existing macro, use a ##type## specific to guest doorbells, with appropriate definitions of the corresponding SPRN_ macros. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 13/17] PowerPC: booke64: Use SPRG0/3 scratch for bolted TLB miss crit int
On Mon, 2012-06-25 at 15:26 +0300, Mihai Caraman wrote: Embedded.Hypervisor category defines GSPRG0..3 physical registers for guests. Avoid SPRG4-7 usage as scratch in host exception handlers, otherwise guest SPRG4-7 registers will be clobbered. For bolted TLB miss exception handlers, which is the version currently supported by KVM, use SPRN_SPRG_GEN_SCRATCH (aka SPRG0) instead of SPRN_SPRG_TLB_SCRATCH (aka SPRG6) and replace TLB with GEN PACA slots to keep consitency. For critical exception handler use SPRG3 instead of SPRG7. Beware with SPRG3 usage. It's user space visible and we plan to use it for other things (see Anton's patch to stick topology information in there for use by the vdso). If you clobber it, you may want to restore it later. I think Anton's patch should put the proper value we want in the PACA anyway since we also need to restore it on exit from KVM, so you can still use it as scratch, just restore the value before going to C. Cheers, Ben. Signed-off-by: Mihai Caraman mihai.cara...@freescale.com --- arch/powerpc/include/asm/exception-64e.h | 14 +++--- arch/powerpc/include/asm/reg.h |6 +++--- arch/powerpc/mm/tlb_low_64e.S| 28 ++-- 3 files changed, 24 insertions(+), 24 deletions(-) diff --git a/arch/powerpc/include/asm/exception-64e.h b/arch/powerpc/include/asm/exception-64e.h index ac13add..c90a9a4 100644 --- a/arch/powerpc/include/asm/exception-64e.h +++ b/arch/powerpc/include/asm/exception-64e.h @@ -38,8 +38,11 @@ */ -/* We are out of SPRGs so we save some things in the PACA. The normal - * exception frame is smaller than the CRIT or MC one though +/* We are out of SPRGs so we save some things in the 8 slots available in PACA. + * The normal exception frame is smaller than the CRIT or MC one though + * + * Bolted TLB miss exception variant also uses these slots which in combination + * with pgd and kernel_pgd fits in one 64-byte cache line. */ #define EX_R1(0 * 8) #define EX_CR(1 * 8) @@ -47,13 +50,10 @@ #define EX_R11 (3 * 8) #define EX_R14 (4 * 8) #define EX_R15 (5 * 8) +#define EX_R16 (6 * 8) /* - * The TLB miss exception uses different slots. - * - * The bolted variant uses only the first six fields, - * which in combination with pgd and kernel_pgd fits in - * one 64-byte cache line. + * PACA slots offset for standard TLB miss exception. */ #define EX_TLB_R10 ( 0 * 8) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index f0cb7f4..51c14a7 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -760,10 +760,10 @@ * 64-bit embedded * - SPRG0 generic exception scratch * - SPRG2 TLB exception stack - * - SPRG3 unused (user visible) + * - SPRG3 critical exception scratch (user visible) * - SPRG4 unused (user visible) * - SPRG6 TLB miss scratch (user visible, sorry !) - * - SPRG7 critical exception scratch + * - SPRG7 unused (user visible) * - SPRG8 machine check exception scratch * - SPRG9 debug exception scratch * @@ -857,7 +857,7 @@ #ifdef CONFIG_PPC_BOOK3E_64 #define SPRN_SPRG_MC_SCRATCH SPRN_SPRG8 -#define SPRN_SPRG_CRIT_SCRATCH SPRN_SPRG7 +#define SPRN_SPRG_CRIT_SCRATCH SPRN_SPRG3 #define SPRN_SPRG_DBG_SCRATCHSPRN_SPRG9 #define SPRN_SPRG_TLB_EXFRAMESPRN_SPRG2 #define SPRN_SPRG_TLB_SCRATCHSPRN_SPRG6 diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S index 88feaaa..4192ade 100644 --- a/arch/powerpc/mm/tlb_low_64e.S +++ b/arch/powerpc/mm/tlb_low_64e.S @@ -40,36 +40,36 @@ **/ .macro tlb_prolog_bolted intnum addr - mtspr SPRN_SPRG_TLB_SCRATCH,r13 + mtspr SPRN_SPRG_GEN_SCRATCH,r13 mfspr r13,SPRN_SPRG_PACA - std r10,PACA_EXTLB+EX_TLB_R10(r13) + std r10,PACA_EXGEN+EX_R10(r13) mfcrr10 - std r11,PACA_EXTLB+EX_TLB_R11(r13) + std r11,PACA_EXGEN+EX_R11(r13) #ifdef CONFIG_KVM_BOOKE_HV BEGIN_FTR_SECTION mfspr r11, SPRN_SRR1 END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV) #endif DO_KVM \intnum, SPRN_SRR1 - std r16,PACA_EXTLB+EX_TLB_R16(r13) + std r16,PACA_EXGEN+EX_R16(r13) mfspr r16,\addr /* get faulting address */ - std r14,PACA_EXTLB+EX_TLB_R14(r13) + std r14,PACA_EXGEN+EX_R14(r13) ld r14,PACAPGD(r13) - std r15,PACA_EXTLB+EX_TLB_R15(r13) - std r10,PACA_EXTLB+EX_TLB_CR(r13) + std r15,PACA_EXGEN+EX_R15(r13) + std r10,PACA_EXGEN+EX_CR(r13) TLB_MISS_PROLOG_STATS_BOLTED .endm .macro tlb_epilog_bolted - ld r14,PACA_EXTLB+EX_TLB_CR(r13) - ld r10,PACA_EXTLB+EX_TLB_R10(r13) - ld
Re: [RFC PATCH 13/17] PowerPC: booke64: Use SPRG0/3 scratch for bolted TLB miss crit int
On 06/25/2012 07:26 AM, Mihai Caraman wrote: Embedded.Hypervisor category defines GSPRG0..3 physical registers for guests. Avoid SPRG4-7 usage as scratch in host exception handlers, otherwise guest SPRG4-7 registers will be clobbered. For bolted TLB miss exception handlers, which is the version currently supported by KVM, use SPRN_SPRG_GEN_SCRATCH (aka SPRG0) instead of SPRN_SPRG_TLB_SCRATCH (aka SPRG6) and replace TLB with GEN PACA slots to keep consitency. For critical exception handler use SPRG3 instead of SPRG7. extlb is in the same cache line as other TLB stuff we need, while exgen isn't. Let's stick with extlb. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 03/17] KVM: PPC64: booke: Add EPCR support in sregs
On 06/25/2012 07:26 AM, Mihai Caraman wrote: Add KVM_SREGS_E_64 feature and EPCR spr support in get/set sregs for 64-bit hosts. Signed-off-by: Mihai Caraman mihai.cara...@freescale.com --- arch/powerpc/kvm/booke.c | 14 ++ 1 files changed, 14 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index f9fa260..d15c4b5 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -1052,6 +1052,9 @@ static void get_sregs_base(struct kvm_vcpu *vcpu, u64 tb = get_tb(); sregs-u.e.features |= KVM_SREGS_E_BASE; +#ifdef CONFIG_64BIT + sregs-u.e.features |= KVM_SREGS_E_64;0 +#endif sregs-u.e.csrr0 = vcpu-arch.csrr0; sregs-u.e.csrr1 = vcpu-arch.csrr1; @@ -1063,6 +1066,9 @@ static void get_sregs_base(struct kvm_vcpu *vcpu, sregs-u.e.dec = kvmppc_get_dec(vcpu, tb); sregs-u.e.tb = tb; sregs-u.e.vrsave = vcpu-arch.vrsave; +#ifdef CONFIG_64BIT + sregs-u.e.epcr = vcpu-arch.epcr; +#endif } static int set_sregs_base(struct kvm_vcpu *vcpu, @@ -1071,6 +1077,11 @@ static int set_sregs_base(struct kvm_vcpu *vcpu, if (!(sregs-u.e.features KVM_SREGS_E_BASE)) return 0; +#ifdef CONFIG_64BIT + if (!(sregs-u.e.features KVM_SREGS_E_64)) + return 0; +#endif This means that a QEMU targeting a 32-bit guest won't be able to set any special registers, if it sets feature bits manually rather than getting them from GET_SREGS. This check should only qualify whether we look at sregs.u.e.epcr, not whether this function works at all. BTW, shouldn't the BASE check return an error rather than silently no-op? -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html