[매일경제신문사] 바이러스- 발견 경고
발신자 : kvm@vger.kernel.org 수신자 : SEMIANGEL 제목 : Returned mail: Data format error 첨부 파일명 : attachment.zip(attachment.exe) 검사 결과 : attachment.zip/attachment.exe ( Win32/MyDoom.worm.M ) - 압축 파일(압축을 푼 후 다시 검사하십시오.) attachment.zip - 삭제 안녕하세요. 매일경제신문사 메일서버 관리자입니다. 보내주신 메일에서 바이러스가 발견되었습니다. 치료가 가능하면 치료를 시도하고 불가능하면 삭제됩니다. 본 메일은 자동으로 보내지며 반송이 불가능합니다.N§²æìr¸yúèØb²X¬¶Ç§vØ^)Þº{.nÇ+·¤¾h§¶¡Ü¨}©²Æ zÚj:¨¾«êçzZ+Ê£¢·h§~Ûiÿûàz¹®w¥¢¸?¨èÚ¢)ߢf
Re: [PATCH v2 5/6] x86: Enable ack interrupt on vmexit
On Mon, Nov 26, 2012 at 05:44:29AM +, Zhang, Yang Z wrote: Avi Kivity wrote on 2012-11-25: On 11/25/2012 03:03 PM, Gleb Natapov wrote: On Sun, Nov 25, 2012 at 02:55:26PM +0200, Avi Kivity wrote: On 11/22/2012 05:22 PM, Gleb Natapov wrote: On Wed, Nov 21, 2012 at 04:09:38PM +0800, Yang Zhang wrote: Ack interrupt on vmexit is required by Posted Interrupt. With it, when external interrupt caused vmexit, the cpu will acknowledge the interrupt controller and save the interrupt's vector in vmcs. There are several approaches to enable it. This patch uses a simply way: re-generate an interrupt via self ipi. diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 7949d21..f6ef090 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2525,7 +2525,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) #ifdef CONFIG_X86_64 min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; #endif - opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT; + opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT | + VM_EXIT_ACK_INTR_ON_EXIT; Always? Do it only if posted interrupts are actually available and going to be used. Why not always? Better to have a single code path for host interrupts (and as Yang notes, the new path is faster as well). Is it? The current path is: vm exit - KVM vmexit handler(interrupt disabled) - KVM re-enable interrupt - cpu ack the interrupt and interrupt deliver through the host IDT. The proposed path is: CPU acks interrupt - vm exit - KVM vmexit handler(interrupt disabled) - eoi - self IPI - KVM re-enable interrupt - cpu ack the interrupt and interrupt deliver through the host IDT. Am I missing something? Yes, you're missing the part where I didn't write that the new path should avoid the IDT and dispatch the interrupt directly, by emulating an interrupt frame directly. Can be as simple as pushf; push cs; call interrupt_table[vector * 8]. Of course we need to verify that no interrupt uses the IST or a task gate. How can we call interrupt table directly? I don't think we can expose the idt_table to a module. No, but we can add function to entry_(64|32).S that despatch via idt_table and expose it. Avi's idea is worth to explore before going self IPI way. Anyway, to simply the implementation, I will follow gleb's suggestion: only enable ack intr on exit when PI is enabled and self ipi should be enough. Any comments? Best regards, Yang -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call agenda for 2012-11-27
Hi Please send in any agenda topics you are interested in. Later, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting
On Mon, Nov 26, 2012 at 03:51:04AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2012-11-25: On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote: Posted Interrupt allows vAPICV interrupts to inject into guest directly without any vmexit. - When delivering a interrupt to guest, if target vcpu is running, update Posted-interrupt requests bitmap and send a notification event to the vcpu. Then the vcpu will handle this interrupt automatically, without any software involvemnt. Looks like you allocating one irq vector per vcpu per pcpu and then migrate it or reallocate when vcpu move from one pcpu to another. This is not scalable and migrating irq migration slows things down. What's wrong with allocating one global vector for posted interrupt during vmx initialization and use it for all vcpus? Consider the following situation: If vcpu A is running when notification event which belong to vcpu B is arrived, since the vector match the vcpu A's notification vector, then this event will be consumed by vcpu A(even it do nothing) and the interrupt cannot be handled in time. The exact same situation is possible with your code. vcpu B can be migrated from pcpu and vcpu A will take its place and will be assigned the same vector as vcpu B. But I fail to see why is this a problem. vcpu A will ignore PI since pir will be empty and vcpu B should detect new event during next vmentry. - If target vcpu is not running or there already a notification event pending in the vcpu, do nothing. The interrupt will be handled by old way. Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- arch/x86/include/asm/kvm_host.h |3 + arch/x86/include/asm/vmx.h |4 + arch/x86/kernel/apic/io_apic.c | 138 arch/x86/kvm/lapic.c| 31 ++- arch/x86/kvm/lapic.h|8 ++ arch/x86/kvm/vmx.c | 192 +-- arch/x86/kvm/x86.c |2 + include/linux/kvm_host.h |1 + virt/kvm/kvm_main.c |2 + 9 files changed, 372 insertions(+), 9 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 8e07a86..1145894 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -683,9 +683,12 @@ struct kvm_x86_ops { void (*enable_irq_window)(struct kvm_vcpu *vcpu); void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr); int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu); + int (*has_posted_interrupt)(struct kvm_vcpu *vcpu); void (*update_irq)(struct kvm_vcpu *vcpu); void (*set_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector, int need_eoi, int global); + int (*send_nv)(struct kvm_vcpu *vcpu, int vector); + void (*pi_migrate)(struct kvm_vcpu *vcpu); int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); int (*get_tdp_level)(void); u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio); diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 1003341..7b9e1d0 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -152,6 +152,7 @@ #define PIN_BASED_EXT_INTR_MASK 0x0001 #define PIN_BASED_NMI_EXITING 0x0008 #define PIN_BASED_VIRTUAL_NMIS 0x0020 +#define PIN_BASED_POSTED_INTR 0x0080 #define VM_EXIT_SAVE_DEBUG_CONTROLS 0x0002 #define VM_EXIT_HOST_ADDR_SPACE_SIZE0x0200 @@ -174,6 +175,7 @@ /* VMCS Encodings */ enum vmcs_field {VIRTUAL_PROCESSOR_ID = 0x, + POSTED_INTR_NV = 0x0002, GUEST_ES_SELECTOR = 0x0800, GUEST_CS_SELECTOR = 0x0802,GUEST_SS_SELECTOR = 0x0804, @@ -208,6 +210,8 @@ enum vmcs_field { VIRTUAL_APIC_PAGE_ADDR_HIGH = 0x2013, APIC_ACCESS_ADDR= 0x2014, APIC_ACCESS_ADDR_HIGH = 0x2015, + POSTED_INTR_DESC_ADDR = 0x2016, + POSTED_INTR_DESC_ADDR_HIGH = 0x2017, EPT_POINTER = 0x201a, EPT_POINTER_HIGH= 0x201b, EOI_EXIT_BITMAP0= 0x201c, diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index 1817fa9..97cb8ee 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -3277,6 +3277,144 @@ int arch_setup_dmar_msi(unsigned int irq) } #endif +static int +pi_set_affinity(struct irq_data *data, const struct cpumask *mask, +bool force) +{ + unsigned int dest; + struct irq_cfg *cfg = (struct irq_cfg *)data-chip_data; + if (cpumask_equal(cfg-domain, mask)) + return
Re: qemu-kvm-1.2.0: double free or corruption in VNC code
On Fri, Nov 23, 2012 at 08:24:32PM +0100, Nikola Ciprich wrote: Please also post the exact package version you are using - the line numbers change between releases and depend on which patches have been applied to the source tree. The distro exact package version allows me to download the source tree that was used to build this binary and check the correct line numbers. Hello Stafan, it's based on fedora rawhide pkg 2:1.2.0-16 with few minor tweaks to compile on centos6. I've uploaded sources used for build here: http://nik.lbox.cz/download/qemu-kvm-1.2.0.tar.bz2 (after make clean) or http://nik.lbox.cz/download/qemu-1.2.0-lb6.01.src.rpm will this help? Thanks, I looked at the backtrace in the source tree. Unfortunately the root cause is not obvious to me. I was looking for a double-free of the zrle buffers. If this bug repeatedly bites you, try a different VNC encoding as a workaround (not ZRLE). Perhaps someone more familiar with the VNC code will be able to see it. All the information you have provided is helpful. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM Disk i/o or VM activities causes soft lockup?
On Fri, Nov 23, 2012 at 10:34:16AM -0800, Vincent Li wrote: On Thu, Nov 22, 2012 at 11:29 PM, Stefan Hajnoczi stefa...@gmail.com wrote: On Wed, Nov 21, 2012 at 03:36:50PM -0800, Vincent Li wrote: We have users running on redhat based distro (Kernel 2.6.32-131.21.1.el6.x86_64 ) with kvm, when customer made cron job script to copy large files between kvm guest or some other user space program leads to disk i/o or VM activities, users get following soft lockup message from console: Nov 17 13:44:46 slot1/luipaard100a err kernel: BUG: soft lockup - CPU#4 stuck for 61s! [qemu-kvm:6795] Nov 17 13:44:46 slot1/luipaard100a warning kernel: Modules linked in: ebt_vlan nls_utf8 isofs ebtable_filter ebtables 8021q garp bridge stp llc ipt_REJECT iptable_filter xt_NOTRACK nf_conntrack iptable_raw ip_tables loop ext2 binfmt_misc hed womdict(U) vnic(U) parport_pc lp parport predis(U) lasthop(U) ipv6 toggler vhost_net tun kvm_intel kvm jiffies(U) sysstats hrsleep i2c_dev datastor(U) linux_user_bde(P)(U) linux_kernel_bde(P)(U) tg3 libphy serio_raw i2c_i801 i2c_core ehci_hcd raid1 raid0 virtio_pci virtio_blk virtio virtio_ring mvsas libsas scsi_transport_sas mptspi mptscsih mptbase scsi_transport_spi 3w_9xxx sata_svw(U) ahci serverworks sata_sil ata_piix libata sd_mod crc_t10dif amd74xx piix ide_gd_mod ide_core dm_snapshot dm_mirror dm_region_hash dm_log dm_mod ext3 jbd mbcache Nov 17 13:44:46 slot1/luipaard100a warning kernel: Pid: 6795, comm: qemu-kvm Tainted: P 2.6.32-131.21.1.el6.f5.x86_64 #1 Nov 17 13:44:46 slot1/luipaard100a warning kernel: Call Trace: Nov 17 13:44:46 slot1/luipaard100a warning kernel: IRQ [81084f95] ? get_timestamp+0x9/0xf Nov 17 13:44:46 slot1/luipaard100a warning kernel: [810855d6] ? watchdog_timer_fn+0x130/0x178 Nov 17 13:44:46 slot1/luipaard100a warning kernel: [81059f11] ? __run_hrtimer+0xa3/0xff Nov 17 13:44:46 slot1/luipaard100a warning kernel: [8105a188] ? hrtimer_interrupt+0xe6/0x190 Nov 17 13:44:46 slot1/luipaard100a warning kernel: [8105a14b] ? hrtimer_interrupt+0xa9/0x190 Nov 17 13:44:46 slot1/luipaard100a warning kernel: [8101e5a9] ? hpet_interrupt_handler+0x26/0x2d Nov 17 13:44:46 slot1/luipaard100a warning kernel: [8105a26f] ? hrtimer_peek_ahead_timers+0x9/0xd Nov 17 13:44:46 slot1/luipaard100a warning kernel: [81044fcc] ? __do_softirq+0xc5/0x17a Nov 17 13:44:46 slot1/luipaard100a warning kernel: [81003adc] ? call_softirq+0x1c/0x28 Nov 17 13:44:46 slot1/luipaard100a warning kernel: [8100506b] ? do_softirq+0x31/0x66 Nov 17 13:44:46 slot1/luipaard100a warning kernel: [81003673] ? call_function_interrupt+0x13/0x20 Nov 17 13:44:46 slot1/luipaard100a warning kernel: EOI [a0219986] ? vmx_get_msr+0x0/0x123 [kvm_intel] Nov 17 13:44:46 slot1/luipaard100a warning kernel: [a01d11c0] ? kvm_arch_vcpu_ioctl_run+0x80e/0xaf1 [kvm] Nov 17 13:44:46 slot1/luipaard100a warning kernel: [a01d11b4] ? kvm_arch_vcpu_ioctl_run+0x802/0xaf1 [kvm] Nov 17 13:44:46 slot1/luipaard100a warning kernel: [8114e59b] ? inode_has_perm+0x65/0x72 Nov 17 13:44:46 slot1/luipaard100a warning kernel: [a01c77f5] ? kvm_vcpu_ioctl+0xf2/0x5ba [kvm] Nov 17 13:44:46 slot1/luipaard100a warning kernel: [8114e642] ? file_has_perm+0x9a/0xac Nov 17 13:44:46 slot1/luipaard100a warning kernel: [810f9ec2] ? vfs_ioctl+0x21/0x6b Nov 17 13:44:46 slot1/luipaard100a warning kernel: [810fa406] ? do_vfs_ioctl+0x487/0x4da Nov 17 13:44:46 slot1/luipaard100a warning kernel: [810fa4aa] ? sys_ioctl+0x51/0x70 Nov 17 13:44:46 slot1/luipaard100a warning kernel: [810029d1] ? system_call_fastpath+0x3c/0x41 This soft lockup is report on the host? Stefan Yes, it is on host. we just recommend users not doing large file copying, just wondering if there is potential kernel bug. it seems the softlockup backtrace pointing to hrtimer and softirq. my naive knowledge is that the watchdog thread is on top of hrtimer which is on top of softirq. Since the soft lockup detector is firing on the host, this seems like a hardware/driver problem. Have you ever had soft lockups running non-KVM workloads on this host? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: Re: Re: Re: [RFC PATCH 0/2] kvm/vmx: Output TSC offset
Hi Marcelo, (2012/11/24 7:46), Marcelo Tosatti wrote: On Thu, Nov 22, 2012 at 02:21:20PM +0900, Yoshihiro YUNOMAE wrote: Hi Marcelo, (2012/11/21 7:51), Marcelo Tosatti wrote: On Tue, Nov 20, 2012 at 07:36:33PM +0900, Yoshihiro YUNOMAE wrote: Hi Marcelo, Sorry for the late reply. (2012/11/17 4:15), Marcelo Tosatti wrote: On Wed, Nov 14, 2012 at 05:26:10PM +0900, Yoshihiro YUNOMAE wrote: Thank you for commenting on my patch set. (2012/11/14 11:31), Steven Rostedt wrote: On Tue, 2012-11-13 at 18:03 -0800, David Sharp wrote: On Tue, Nov 13, 2012 at 6:00 PM, Steven Rostedt rost...@goodmis.org wrote: On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote: To merge the data like previous pattern, we apply this patch set. Then, we can get TSC offset of the guest as follows: $ dmesg | grep kvm [ 57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ## | PID TSC offset | HOST TSC value --+ Using printk to export something like this is IMO a nasty hack. Can't we create a /sys or /proc file to export the same thing? Since the value changes over the course of the trace, and seems to be part of the context of the trace, I think I'd include it as a tracepoint. I'm fine with that too. Using some tracepoint is a nice idea, but there is one problem. Here, our discussion point is the event which TSC offset is changed does not frequently occur, but the buffer must keep the event data. There are two ideas for using tracepoint. First, we define new tracepoint for changed TSC offset. This is simple and the overhead will be low. However, this trace event stored in the buffer will be overwritten by other trace events because this TSC offset event does not frequently occur. Second, we add TSC offset information to the tracepoint frequently occured. For example, we assume that TSC offset information is added to arguments of trace_kvm_exit(). The TSC offset is in the host trace. So given a host trace with two TSC offset updates, how do you know which events in the guest trace (containing a number of events) refer to which tsc offset update? Unless i am missing something, you can't solve this easily (well, except exporting information to the guest that allows it to transform RDTSC - host TSC value, which can be done via pvclock). As you say, TSC offset events are in the host trace, but we don't need to notify guests of updating TSC offset. The offset event will output the next TSC offset value and the current TSC value, so we can calculate the guest TSC (T1) for the event. Guest TSCs since T1 can be converted to host TSC using the TSC offset, so we can integrate those trace data. Think of this scenario: host trace 1h. event tsc write tsc_offset=-1000 3h. vmenter 4h. vmexit ... (event sequence) 99h. vmexit 100h. event tsc_write tsc_offset=-2000 101h. vmenter ... (event sequence). 500h. event tsc_write tsc_offset=-3000 Then a guest trace containing events with a TSC timestamp. Which tsc_offset to use? (that is the problem, which unless i am mistaken can only be solved easily if the guest can convert RDTSC - TSC of host). There are three following cases of changing TSC offset: 1. Reset TSC at guest boot time 2. Adjust TSC offset due to some host's problems 3. Write TSC on guests The scenario which you mentioned is case 3, so we'll discuss this case. Here, we assume that a guest is allocated single CPU for the sake of ease. If a guest executes write_tsc, TSC values jumps to forward or backward. For the forward case, trace data are as follows: host guest cyclesevents cycles events 3000 tsc_offset=-2950 3001 kvm_enter 53 eventX 100 (write_tsc=+900) 3060 kvm_exit 3075 tsc_offset=-2050 3080 kvm_enter 1050 event1 1055 event2 ... This case is simple. The guest TSC of the first kvm_enter is calculated as follows: (host TSC of kvm_enter) + (current tsc_offset) = 3001 - 2950 = 51 Similarly, the guest TSC of the second kvm_enter is 130. So, the guest events between 51 and 130, that is, 53 eventX is inserted between the first pair of kvm_enter and kvm_exit. To insert events of the guests between 51 and 130, we convert the guest TSC to the host TSC using TSC offset 2950. For the backward case, trace data are as follows: host guest cyclesevents cycles events 3000 tsc_offset=-2950 3001 kvm_enter 53 eventX 100 (write_tsc=-50) 3060 kvm_exit
Re: Invoking guest os script, without guest having network connectivity?
On Sat, Nov 24, 2012 at 06:40:39PM +0200, Shlomi Tsadok wrote: I'm looking for a way to configure the guest networking(including IP) dynamically, using a custom script, right after VM creation. Is there a similar feature in KVM/Libvirt as the Invoke-VMScript in of VMware's PowerCLI? It allows you to run a script in the guest OS, even before the guest has networking connectivity (the host talks to the vmtools agent that's installed in the guest). The QEMU guest agent (qemu-ga) has features that may allow you to do this. I'm not familiar enough with it to give details, here are some alternatives: If you provide a kernel + initramfs externally (outside the guest disk image) you can add files to the initramfs. This allows you to customize boot up. Alternatively you can use PXE booting to achieve the same thing. Finally, you could edit the disk image using libguestfs or qemu-nbd before booting it for the first time. This gives you a chance to customize configuration and startup files. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 50921] kvm hangs booting Windows 2000
https://bugzilla.kernel.org/show_bug.cgi?id=50921 Alan a...@lxorguk.ukuu.org.uk changed: What|Removed |Added CC||a...@lxorguk.ukuu.org.uk --- Comment #12 from Alan a...@lxorguk.ukuu.org.uk 2012-11-26 12:09:34 --- vboxpci 22709 0 - Live 0xf89bb000 (O) vboxnetadp 25431 0 - Live 0xf8aa6000 (O) vboxnetflt 22987 0 - Live 0xf8aae000 (O) vboxdrv 227471 3 vboxpci,vboxnetadp,vboxnetflt, Live 0xf91d4000 (O) Shouldn't be interefering but probably a good idea to test without -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3 RFC 0/2] kvm: Improving undercommit scenarios
In some special scenarios like #vcpu = #pcpu, PLE handler may prove very costly, because there is no need to iterate over vcpus and do unsuccessful yield_to burning CPU. The first patch optimizes all the yield_to by bailing out when there is no need to continue in yield_to (i.e., when there is only one task in source and target rq). Second patch uses that in PLE handler. Further when a yield_to fails we do not immediately go out of PLE handler instead we try thrice to have better statistical possibility of false return. Otherwise that would affect moderate overcommit cases. Result on 3.7.0-rc6 kernel shows around 140% improvement for ebizzy 1x and around 51% for dbench 1x with 32 core PLE machine with 32 vcpu guest. base = 3.7.0-rc6 machine: 32 core mx3850 x5 PLE mc --+---+---+---++---+ ebizzy (rec/sec higher is beter) --+---+---+---++---+ basestdev patched stdev %improve --+---+---+---++---+ 1x 2511.300021.54096051.8000 170.2592 140.98276 2x 2679.4000 332.44822692.3000 251.4005 0.48145 3x 2253.5000 266.42432192.1667 178.9753-2.72169 4x 1784.3750 102.26992018.7500 187.572313.13485 --+---+---+---++---+ --+---+---+---++---+ dbench (throughput in MB/sec. higher is better) --+---+---+---++---+ basestdev patched stdev %improve --+---+---+---++---+ 1x 6677.4080 638.504810098.0060 3449.7026 51.22643 2x 2012.676064.76422019.0440 62.6702 0.31639 3x 1302.078340.83361292.7517 27.0515 -0.71629 4x 3043.1725 3243.72814664.4662 5946.5741 53.27643 --+---+---+---++---+ Here is the refernce of no ple result. ebizzy-1x_nople 7592.6000 rec/sec dbench_1x_nople 7853.6960 MB/sec The result says we can still improve by 60% for ebizzy, but overall we are getting impressive performance with the patches. Changes Since V2: - Dropped global measures usage patch (Peter Zilstra) - Do not bail out on first failure (Avi Kivity) - Try thrice for the failure of yield_to to get statistically more correct behaviour. Changes since V1: - Discard the idea of exporting nrrunning and optimize in core scheduler (Peter) - Use yield() instead of schedule in overcommit scenarios (Rik) - Use loadavg knowledge to detect undercommit/overcommit Peter Zijlstra (1): Bail out of yield_to when source and target runqueue has one task Raghavendra K T (1): Handle yield_to failure return for potential undercommit case Please let me know your comments and suggestions. Link for V2: https://lkml.org/lkml/2012/10/29/287 Link for V1: https://lkml.org/lkml/2012/9/21/168 kernel/sched/core.c | 25 +++-- virt/kvm/kvm_main.c | 26 -- 2 files changed, 35 insertions(+), 16 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
From: Peter Zijlstra pet...@infradead.org In case of undercomitted scenarios, especially in large guests yield_to overhead is significantly high. when run queue length of source and target is one, take an opportunity to bail out and return -ESRCH. This return condition can be further exploited to quickly come out of PLE handler. (History: Raghavendra initially worked on break out of kvm ple handler upon seeing source runqueue length = 1, but it had to export rq length). Peter came up with the elegant idea of return -ESRCH in scheduler core. Signed-off-by: Peter Zijlstra pet...@infradead.org Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi) Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- kernel/sched/core.c | 25 +++-- 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2d8927f..fc219a5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield); * It's the caller's job to ensure that the target task struct * can't go away on us before we can do any checks. * - * Returns true if we indeed boosted the target task. + * Returns: + * true (0) if we indeed boosted the target task. + * false (0) if we failed to boost the target. + * -ESRCH if there's no task to yield to. */ bool __sched yield_to(struct task_struct *p, bool preempt) { @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt) again: p_rq = task_rq(p); + /* +* If we're the only runnable task on the rq and target rq also +* has only one task, there's absolutely no point in yielding. +*/ + if (rq-nr_running == 1 p_rq-nr_running == 1) { + yielded = -ESRCH; + goto out_irq; + } + double_rq_lock(rq, p_rq); while (task_rq(p) != p_rq) { double_rq_unlock(rq, p_rq); @@ -4310,13 +4322,13 @@ again: } if (!curr-sched_class-yield_to_task) - goto out; + goto out_unlock; if (curr-sched_class != p-sched_class) - goto out; + goto out_unlock; if (task_running(p_rq, p) || p-state) - goto out; + goto out_unlock; yielded = curr-sched_class-yield_to_task(rq, p, preempt); if (yielded) { @@ -4329,11 +4341,12 @@ again: resched_task(p_rq-curr); } -out: +out_unlock: double_rq_unlock(rq, p_rq); +out_irq: local_irq_restore(flags); - if (yielded) + if (yielded 0) schedule(); return yielded; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case
From: Raghavendra K T raghavendra...@linux.vnet.ibm.com yield_to returns -ESRCH, When source and target of yield_to run queue length is one. When we see three successive failures of yield_to we assume we are in potential undercommit case and abort from PLE handler. The assumption is backed by low probability of wrong decision for even worst case scenarios such as average runqueue length between 1 and 2. note that we do not update last boosted vcpu in failure cases. Thank Avi for raising question on aborting after first fail from yield_to. Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- virt/kvm/kvm_main.c | 26 -- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index be70035..053f494 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1639,6 +1639,7 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target) { struct pid *pid; struct task_struct *task = NULL; + bool ret = false; rcu_read_lock(); pid = rcu_dereference(target-pid); @@ -1646,17 +1647,15 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target) task = get_pid_task(target-pid, PIDTYPE_PID); rcu_read_unlock(); if (!task) - return false; + return ret; if (task-flags PF_VCPU) { put_task_struct(task); - return false; - } - if (yield_to(task, 1)) { - put_task_struct(task); - return true; + return ret; } + ret = yield_to(task, 1); put_task_struct(task); - return false; + + return ret; } EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to); @@ -1697,12 +1696,14 @@ bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu) return eligible; } #endif + void kvm_vcpu_on_spin(struct kvm_vcpu *me) { struct kvm *kvm = me-kvm; struct kvm_vcpu *vcpu; int last_boosted_vcpu = me-kvm-last_boosted_vcpu; int yielded = 0; + int try = 3; int pass; int i; @@ -1714,7 +1715,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) * VCPU is holding the lock that we need and will release it. * We approximate round-robin by starting at the last boosted VCPU. */ - for (pass = 0; pass 2 !yielded; pass++) { + for (pass = 0; pass 2 !yielded try; pass++) { kvm_for_each_vcpu(i, vcpu, kvm) { if (!pass i = last_boosted_vcpu) { i = last_boosted_vcpu; @@ -1727,10 +1728,15 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) continue; if (!kvm_vcpu_eligible_for_directed_yield(vcpu)) continue; - if (kvm_vcpu_yield_to(vcpu)) { + + yielded = kvm_vcpu_yield_to(vcpu); + if (yielded 0) { kvm-last_boosted_vcpu = i; - yielded = 1; break; + } else if (yielded 0) { + try--; + if (!try) + break; } } } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting
Gleb Natapov wrote on 2012-11-26: On Mon, Nov 26, 2012 at 03:51:04AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2012-11-25: On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote: Posted Interrupt allows vAPICV interrupts to inject into guest directly without any vmexit. - When delivering a interrupt to guest, if target vcpu is running, update Posted-interrupt requests bitmap and send a notification event to the vcpu. Then the vcpu will handle this interrupt automatically, without any software involvemnt. Looks like you allocating one irq vector per vcpu per pcpu and then migrate it or reallocate when vcpu move from one pcpu to another. This is not scalable and migrating irq migration slows things down. What's wrong with allocating one global vector for posted interrupt during vmx initialization and use it for all vcpus? Consider the following situation: If vcpu A is running when notification event which belong to vcpu B is arrived, since the vector match the vcpu A's notification vector, then this event will be consumed by vcpu A(even it do nothing) and the interrupt cannot be handled in time. The exact same situation is possible with your code. vcpu B can be migrated from pcpu and vcpu A will take its place and will be assigned the same vector as vcpu B. But I fail to see why is this a No, the on bit will be set to prevent notification event when vcpu B start migration. And it only free the vector before it going to run in another pcpu. problem. vcpu A will ignore PI since pir will be empty and vcpu B should detect new event during next vmentry. Yes, but the next vmentry may happen long time later and interrupt cannot be serviced until next vmentry. In current way, it will cause vmexit and re-schedule the vcpu B. + if (!cfg) { + free_irq_at(irq, NULL); + return 0; + } + + raw_spin_lock_irqsave(vector_lock, flags); + if (!__assign_irq_vector(irq, cfg, mask)) + ret = irq; + raw_spin_unlock_irqrestore(vector_lock, flags); + + if (ret) { + irq_set_chip_data(irq, cfg); + irq_clear_status_flags(irq, IRQ_NOREQUEST); + } else { + free_irq_at(irq, cfg); + } + return ret; +} This function is mostly cutpaste of create_irq_nr(). Yes, this function allow to allocate vector from specified cpu. Does not justify code duplication. ok. will change it in next version. if (kvm_x86_ops-has_virtual_interrupt_delivery(vcpu)) apic-vid_enabled = true; + + if (kvm_x86_ops-has_posted_interrupt(vcpu)) + apic-pi_enabled = true; + This is global state, no need per apic variable. Even all vcpus use the same setting, but according to SDM, apicv really is a per apic variable. Anyway, if you think we should not put it here, where is the best place? @@ -1555,6 +1611,11 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) struct desc_ptr *gdt = __get_cpu_var(host_gdt); unsigned long sysenter_esp; + if (enable_apicv_pi to_vmx(vcpu)-pi) + pi_set_on(to_vmx(vcpu)-pi); + Why? Here means the vcpu start migration. So we should prevent the notification event until migration end. You check for IN_GUEST_MODE while sending notification. Why is this not For interrupt from emulated device, it enough. But VT-d device doesn't know the vcpu is migrating, so set the on bit to prevent the notification event when target vcpu is migrating. enough? Also why vmx_vcpu_load() call means that vcpu start migration? I think the follow check can ensure the vcpu is in migration, am I wrong? if (vmx-loaded_vmcs-cpu != cpu) { if (enable_apicv_pi to_vmx(vcpu)-pi) pi_set_on(to_vmx(vcpu)-pi); } + kvm_make_request(KVM_REQ_POSTED_INTR, vcpu); + kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); local_irq_disable(); list_add(vmx-loaded_vmcs-loaded_vmcss_on_cpu_link, @@ -1582,6 +1643,8 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu) vcpu-cpu = -1; kvm_cpu_vmxoff(); } + if (enable_apicv_pi to_vmx(vcpu)-pi) + pi_set_on(to_vmx(vcpu)-pi); Why? When vcpu schedule out, no need to send notification event to it, just set the PIR and wakeup it is enough. Same as above. When vcpu is scheduled out it will no be in IN_GUEST_MODE Right. mode. Also in this case we probably should set bit directly in IRR and leave PIR alone. From the view of hypervisor, IRR and PIR are same. For each vmentry, if PI is enabled, the IRR equal to (IRR | PIR). So there is no difference to set IRR or PIR if target vcpu is not running. } static void vmx_fpu_activate(struct kvm_vcpu *vcpu) @@ -2451,12 +2514,6 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) u32 _vmexit_control = 0; u32 _vmentry_control = 0; - min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING; - opt = PIN_BASED_VIRTUAL_NMIS; - if
Re: [PATCH 4/5] KVM: PPC: Book3S HV: Don't give the guest RW access to RO pages
On 24.11.2012, at 10:32, Paul Mackerras wrote: On Sat, Nov 24, 2012 at 10:05:37AM +0100, Alexander Graf wrote: On 23.11.2012, at 23:13, Paul Mackerras pau...@samba.org wrote: On Fri, Nov 23, 2012 at 04:47:45PM +0100, Alexander Graf wrote: On 22.11.2012, at 10:28, Paul Mackerras wrote: Currently, if the guest does an H_PROTECT hcall requesting that the permissions on a HPT entry be changed to allow writing, we make the requested change even if the page is marked read-only in the host Linux page tables. This is a problem since it would for instance allow a guest to modify a page that KSM has decided can be shared between multiple guests. To fix this, if the new permissions for the page allow writing, we need to look up the memslot for the page, work out the host virtual address, and look up the Linux page tables to get the PTE for the page. If that PTE is read-only, we reduce the HPTE permissions to read-only. How does KSM handle this usually? If you reduce the permissions to R/O, how do you ever get a R/W page from a deduplicated one? The scenario goes something like this: 1. Guest creates an HPTE with RO permissions. 2. KSM decides the page is identical to another page and changes the HPTE to point to a shared copy. Permissions are still RO. 3. Guest decides it wants write access to the page and does an H_PROTECT hcall to change the permissions on the HPTE to RW. The bug is that we actually make the requested change in step 3. Instead we should leave it at RO, then when the guest tries to write to the page, we take a hypervisor page fault, copy the page and give the guest write access to its own copy of the page. So what this patch does is add code to H_PROTECT so that if the guest is requesting RW access, we check the Linux PTE to see if the underlying guest page is RO, and if so reduce the permissions in the HPTE to RO. But this will be guest visible, because now H_PROTECT doesn't actually mark the page R/W in the HTAB, right? No - the guest view of the HPTE has R/W permissions. The guest view of the HPTE is made up of doubleword 0 from the real HPT plus rev-guest_rpte for doubleword 1 (where rev is the entry in the revmap array, kvm-arch.revmap, for the HPTE). The guest view can be different from the host/hardware view, which is in the real HPT. For instance, the guest view of a HPTE might be valid but the host view might be invalid because the underlying real page has been paged out - in that case we use a software bit which we call HPTE_V_ABSENT to remind ourselves that there is something valid there from the guest's point of view. Or the guest view can be R/W but the host view is RO, as in the case where KSM has merged the page. So the flow with this patch is: - guest page permission fault This comes through the host (kvmppc_hpte_hv_fault()) which looks at the guest view of the HPTE, sees that it has RO permissions, and sends the page fault to the guest. - guest does H_PROTECT to mark page r/w - H_PROTECT doesn't do anything - guest returns from permission handler, triggers write fault This comes once again to kvmppc_hpte_hv_fault(), which sees that the guest view of the HPTE has R/W permissions now, and sends the page fault to kvmppc_book3s_hv_page_fault(), which requests write access to the page, possibly triggering copy-on-write or whatever, and updates the real HPTE to have R/W permissions and possibly point to a new page of memory. 2 questions here: How does the host know that the page is actually r/w? I assume you mean RO? It looks up the memslot for the guest physical address (which it gets from rev-guest_rpte), uses that to work out the host virtual address (i.e. the address in qemu's address space), looks up the Linux PTE in qemu's Linux page tables, and looks at the _PAGE_RW bit there. How does this work on 970? I thought page faults always go straight to the guest there. They do, which is why PPC970 can't do any of this. On PPC970 we have kvm-arch.using_mmu_notifiers == 0, and that makes the code pin every page of guest memory that is mapped by a guest HPTE (with a Linux guest, that means every page, because of the linear mapping). On POWER7 we have kvm-arch.using_mmu_notifiers == 1, which enables host paging and deduplication of guest memory. Thanks a lot for the detailed explanation! Maybe you guys should just release an HV capable p7 system publicly, so we can deprecate 970 support. That would make a few things quite a bit easier ;) Thanks, applied to kvm-ppc-next. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
On 23.11.2012, at 23:07, Paul Mackerras wrote: On Fri, Nov 23, 2012 at 04:43:03PM +0100, Alexander Graf wrote: On 22.11.2012, at 10:28, Paul Mackerras wrote: - With the possibility of the host paging out guest pages, the use of H_LOCAL by an SMP guest is dangerous since the guest could possibly retain and use a stale TLB entry pointing to a page that had been removed from the guest. I don't understand this part. Don't we flush the TLB when the page gets evicted from the shadow HTAB? The H_LOCAL flag is something that we invented to allow the guest to tell the host I only ever used this translation (HPTE) on the current vcpu when it's removing or modifying an HPTE. The idea is that that would then let the host use the tlbiel instruction (local TLB invalidate) rather than the usual global tlbie instruction. Tlbiel is faster because it doesn't need to go out on the fabric and get processed by all cpus. In fact our guests don't use it at present, but we put it in because we thought we should be able to get a performance improvement, particularly on large machines. However, the catch is that the guest's setting of H_LOCAL might be incorrect, in which case we could have a stale TLB entry on another physical cpu. While the physical page that it refers to is still owned by the guest, that stale entry doesn't matter from the host's point of view. But if the host wants to take that page away from the guest, the stale entry becomes a problem. That's exactly where my question lies. Does that mean we don't flush the TLB entry regardless when we take the page away from the guest? Alex The solution I implement here is just not to use tlbiel in SMP guests. UP guests are not so much of a problem because the potential attack from the guest relies on having one cpu remove the HPTE and do tlbiel while another cpu uses the stale TLB entry, which you can't do if you only have one cpu. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 50921] kvm hangs booting Windows 2000
https://bugzilla.kernel.org/show_bug.cgi?id=50921 --- Comment #13 from Lucio Crusca lu...@sulweb.org 2012-11-26 13:13:56 --- @Alan: see comment #5, since then I've always tested with and without vbox modules. @Gleb: can't run on 3.5.0 right now, I'll take the stack trace ASAP. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
On 23.11.2012, at 22:42, Paul Mackerras wrote: On Fri, Nov 23, 2012 at 03:13:09PM +0100, Alexander Graf wrote: On 22.11.2012, at 10:25, Paul Mackerras wrote: + /* Do they have an SLB shadow buffer registered? */ + slb = vcpu-arch.slb_shadow.pinned_addr; + if (!slb) + return; Mind to explain this case? What happens here? Do we leave the guest with an empty SLB? Why would this ever happen? What happens next as soon as we go back into the guest? Yes, we leave the guest with an empty SLB, the access gets retried and this time the guest gets an SLB miss interrupt, which it can hopefully handle using an SLB miss handler that runs entirely in real mode. This could happen for instance while the guest is in SLOF or yaboot or some other code that runs basically in real mode but occasionally turns the MMU on for some accesses, and happens to have a bug where it creates a duplicate SLB entry. Is this what pHyp does? Also, is this what we want? Why don't we populate an #MC into the guest so it knows it did something wrong? Alex + /* Sanity check */ + n = slb-persistent; + if (n SLB_MIN_SIZE) + n = SLB_MIN_SIZE; Please use a max() macro here. OK. + rb = 0x800; /* IS field = 0b10, flush congruence class */ + for (i = 0; i 128; ++i) { Please put up a #define for this. POWER7_TLB_SIZE or so. Is there any way to fetch that number from an SPR? I don't really want to have a p7+ and a p8 function in here too ;). + asm volatile(tlbiel %0 : : r (rb)); + rb += 0x1000; I assume this also means (1 TLBIE_ENTRY_SHIFT)? Would be nice to keep the code readable without guessing :). The 0x800 and 0x1000 are taken from the architecture - it defines fields in the RB value for the flush type and TLB index. The 128 is POWER7-specific and isn't in any SPR that I know of. Eventually we'll probably have to put it (the number of TLB congruence classes) in the cputable, but for now I'll just do a define. So I take it that p7 does not implement tlbia? Correct. So we never return 0? How about ECC errors and the likes? Wouldn't those also be #MCs that the host needs to handle? Yes, true. In fact the OPAL firmware gets to see the machine checks before we do (see the opal_register_exception_handler() calls in arch/powerpc/platforms/powernv/opal.c), so it should have already handled recoverable things like L1 cache parity errors. I'll make the function return 0 if there's an error that it doesn't know about. ld r8, HSTATE_VMHANDLER(r13) ld r7, HSTATE_HOST_MSR(r13) cmpwi r12, BOOK3S_INTERRUPT_EXTERNAL +BEGIN_FTR_SECTION beq 11f - cmpwi r12, BOOK3S_INTERRUPT_MACHINE_CHECK +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206) Mind to explain the logic that happens here? Do we get external interrupts on 970? If not, the cmpwi should also be inside the FTR section. Also, if we do a beq here, why do the beqctr below again? I was making it not call the host kernel machine check handler if it was a machine check that pulled us out of the guest. In fact we probably do still want to call the handler, but we don't want to jump to 0x200, since that has been patched by OPAL, and we don't want to make OPAL think we just got another machine check. Instead we would need to jump to machine_check_pSeries. The feature section is because POWER7 sets HSRR0/1 on external interrupts, whereas PPC970 sets SRR0/1. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM: PPC: Fixes for PR-KVM on POWER7
On 05.11.2012, at 04:40, Paul Mackerras wrote: Here are some fixes for PR-style KVM. With these I can successfully run a pseries (PAPR) guest under PR KVM on a POWER7. (This is all running inside a HV KVM virtual machine.) The patches are against Alex Graf's kvm-ppc-next branch. Thanks a lot! Applied all to kvm-ppc-next. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote: From: Peter Zijlstra pet...@infradead.org In case of undercomitted scenarios, especially in large guests yield_to overhead is significantly high. when run queue length of source and target is one, take an opportunity to bail out and return -ESRCH. This return condition can be further exploited to quickly come out of PLE handler. (History: Raghavendra initially worked on break out of kvm ple handler upon seeing source runqueue length = 1, but it had to export rq length). Peter came up with the elegant idea of return -ESRCH in scheduler core. Signed-off-by: Peter Zijlstra pet...@infradead.org Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi) Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- kernel/sched/core.c | 25 +++-- 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2d8927f..fc219a5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield); * It's the caller's job to ensure that the target task struct * can't go away on us before we can do any checks. * - * Returns true if we indeed boosted the target task. + * Returns: + * true (0) if we indeed boosted the target task. + * false (0) if we failed to boost the target. + * -ESRCH if there's no task to yield to. */ bool __sched yield_to(struct task_struct *p, bool preempt) { @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt) again: p_rq = task_rq(p); + /* + * If we're the only runnable task on the rq and target rq also + * has only one task, there's absolutely no point in yielding. + */ + if (rq-nr_running == 1 p_rq-nr_running == 1) { + yielded = -ESRCH; + goto out_irq; + } + double_rq_lock(rq, p_rq); while (task_rq(p) != p_rq) { double_rq_unlock(rq, p_rq); @@ -4310,13 +4322,13 @@ again: } if (!curr-sched_class-yield_to_task) - goto out; + goto out_unlock; if (curr-sched_class != p-sched_class) - goto out; + goto out_unlock; if (task_running(p_rq, p) || p-state) - goto out; + goto out_unlock; yielded = curr-sched_class-yield_to_task(rq, p, preempt); if (yielded) { @@ -4329,11 +4341,12 @@ again: resched_task(p_rq-curr); } -out: +out_unlock: double_rq_unlock(rq, p_rq); +out_irq: local_irq_restore(flags); - if (yielded) + if (yielded 0) schedule(); return yielded; Acked-by: Andrew Jones drjo...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case
On Mon, Nov 26, 2012 at 05:38:04PM +0530, Raghavendra K T wrote: From: Raghavendra K T raghavendra...@linux.vnet.ibm.com yield_to returns -ESRCH, When source and target of yield_to run queue length is one. When we see three successive failures of yield_to we assume we are in potential undercommit case and abort from PLE handler. The assumption is backed by low probability of wrong decision for even worst case scenarios such as average runqueue length between 1 and 2. note that we do not update last boosted vcpu in failure cases. Thank Avi for raising question on aborting after first fail from yield_to. Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- virt/kvm/kvm_main.c | 26 -- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index be70035..053f494 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1639,6 +1639,7 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target) { struct pid *pid; struct task_struct *task = NULL; + bool ret = false; rcu_read_lock(); pid = rcu_dereference(target-pid); @@ -1646,17 +1647,15 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target) task = get_pid_task(target-pid, PIDTYPE_PID); rcu_read_unlock(); if (!task) - return false; + return ret; if (task-flags PF_VCPU) { put_task_struct(task); - return false; - } - if (yield_to(task, 1)) { - put_task_struct(task); - return true; + return ret; } + ret = yield_to(task, 1); put_task_struct(task); - return false; + + return ret; } EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to); @@ -1697,12 +1696,14 @@ bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu) return eligible; } #endif + void kvm_vcpu_on_spin(struct kvm_vcpu *me) { struct kvm *kvm = me-kvm; struct kvm_vcpu *vcpu; int last_boosted_vcpu = me-kvm-last_boosted_vcpu; int yielded = 0; + int try = 3; int pass; int i; @@ -1714,7 +1715,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) * VCPU is holding the lock that we need and will release it. * We approximate round-robin by starting at the last boosted VCPU. */ - for (pass = 0; pass 2 !yielded; pass++) { + for (pass = 0; pass 2 !yielded try; pass++) { kvm_for_each_vcpu(i, vcpu, kvm) { if (!pass i = last_boosted_vcpu) { i = last_boosted_vcpu; @@ -1727,10 +1728,15 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) continue; if (!kvm_vcpu_eligible_for_directed_yield(vcpu)) continue; - if (kvm_vcpu_yield_to(vcpu)) { + + yielded = kvm_vcpu_yield_to(vcpu); + if (yielded 0) { kvm-last_boosted_vcpu = i; - yielded = 1; break; + } else if (yielded 0) { + try--; + if (!try) + break; } } } The check done in patch 1/2 is done before the double_rq_lock, so it's cheap. Now, this patch is to avoid doing too many get_pid_task calls. I wonder if it would make more sense to change the vcpu state from tracking the pid to tracking the task. If that was done, then I don't believe this patch is necessary. Rik, for 34bb10b79de7 was there a reason pid was used instead of task? Drew -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting
On Mon, Nov 26, 2012 at 12:29:54PM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2012-11-26: On Mon, Nov 26, 2012 at 03:51:04AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2012-11-25: On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote: Posted Interrupt allows vAPICV interrupts to inject into guest directly without any vmexit. - When delivering a interrupt to guest, if target vcpu is running, update Posted-interrupt requests bitmap and send a notification event to the vcpu. Then the vcpu will handle this interrupt automatically, without any software involvemnt. Looks like you allocating one irq vector per vcpu per pcpu and then migrate it or reallocate when vcpu move from one pcpu to another. This is not scalable and migrating irq migration slows things down. What's wrong with allocating one global vector for posted interrupt during vmx initialization and use it for all vcpus? Consider the following situation: If vcpu A is running when notification event which belong to vcpu B is arrived, since the vector match the vcpu A's notification vector, then this event will be consumed by vcpu A(even it do nothing) and the interrupt cannot be handled in time. The exact same situation is possible with your code. vcpu B can be migrated from pcpu and vcpu A will take its place and will be assigned the same vector as vcpu B. But I fail to see why is this a No, the on bit will be set to prevent notification event when vcpu B start migration. And it only free the vector before it going to run in another pcpu. There is a race. Sender check on bit, vcpu B migrate to another pcpu and starts running there, vcpu A takes vpuc's B vector, sender send PI vcpu A gets it. problem. vcpu A will ignore PI since pir will be empty and vcpu B should detect new event during next vmentry. Yes, but the next vmentry may happen long time later and interrupt cannot be serviced until next vmentry. In current way, it will cause vmexit and re-schedule the vcpu B. Vmentry will happen when scheduler will decide that vcpu can run. There is no problem here. What you probably want to say is that vcpu may not be aware of interrupt happening since it was migrated to different vcpu just after PI IPI was sent and thus missed it. But than PIR interrupts should be processed during vmentry on another pcpu: Sender: Guest: set pir set on if (vcpu in guest mode on pcpu1) vmexit on pcpu1 vmentry on pcpu2 process pir, deliver interrupt send PI IPI to pcpu1 +if (!cfg) { +free_irq_at(irq, NULL); +return 0; +} + +raw_spin_lock_irqsave(vector_lock, flags); +if (!__assign_irq_vector(irq, cfg, mask)) +ret = irq; +raw_spin_unlock_irqrestore(vector_lock, flags); + +if (ret) { +irq_set_chip_data(irq, cfg); +irq_clear_status_flags(irq, IRQ_NOREQUEST); +} else { +free_irq_at(irq, cfg); +} +return ret; +} This function is mostly cutpaste of create_irq_nr(). Yes, this function allow to allocate vector from specified cpu. Does not justify code duplication. ok. will change it in next version. Please use single global vector in the next version. if (kvm_x86_ops-has_virtual_interrupt_delivery(vcpu)) apic-vid_enabled = true; + +if (kvm_x86_ops-has_posted_interrupt(vcpu)) +apic-pi_enabled = true; + This is global state, no need per apic variable. Even all vcpus use the same setting, but according to SDM, apicv really is a per apic variable. It is not per vapic in our implementation and this is what is important here. Anyway, if you think we should not put it here, where is the best place? It is not needed, just use has_posted_interrupt(vcpu) instead. @@ -1555,6 +1611,11 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) struct desc_ptr *gdt = __get_cpu_var(host_gdt); unsigned long sysenter_esp; +if (enable_apicv_pi to_vmx(vcpu)-pi) +pi_set_on(to_vmx(vcpu)-pi); + Why? Here means the vcpu start migration. So we should prevent the notification event until migration end. You check for IN_GUEST_MODE while sending notification. Why is this not For interrupt from emulated device, it enough. But VT-d device doesn't know the vcpu is migrating, so set the on bit to prevent the notification event when target vcpu is migrating. Why should VT-d device care about that? It sets bits in pir and sends IPI. If vcpu is running it process pir immediately, if not it will do it during next vmentry. enough? Also why vmx_vcpu_load() call means
Re: [PATCH] MAINTAINERS: Add git tree link for PPC KVM
On 16.10.2012, at 07:01, Michael Ellerman wrote: Signed-off-by: Michael Ellerman mich...@ellerman.id.au Thanks, applied to kvm-ppc-next. Alex --- MAINTAINERS |1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index e73060f..32dc107 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4244,6 +4244,7 @@ KERNEL VIRTUAL MACHINE (KVM) FOR POWERPC M:Alexander Graf ag...@suse.de L:kvm-...@vger.kernel.org W:http://kvm.qumranet.com +T: git git://github.com/agraf/linux-2.6.git S:Supported F:arch/powerpc/include/asm/kvm* F:arch/powerpc/kvm/ -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case
On Mon, Nov 26, 2012 at 02:43:02PM +0100, Andrew Jones wrote: On Mon, Nov 26, 2012 at 05:38:04PM +0530, Raghavendra K T wrote: From: Raghavendra K T raghavendra...@linux.vnet.ibm.com yield_to returns -ESRCH, When source and target of yield_to run queue length is one. When we see three successive failures of yield_to we assume we are in potential undercommit case and abort from PLE handler. The assumption is backed by low probability of wrong decision for even worst case scenarios such as average runqueue length between 1 and 2. note that we do not update last boosted vcpu in failure cases. Thank Avi for raising question on aborting after first fail from yield_to. Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- virt/kvm/kvm_main.c | 26 -- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index be70035..053f494 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1639,6 +1639,7 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target) { struct pid *pid; struct task_struct *task = NULL; + bool ret = false; rcu_read_lock(); pid = rcu_dereference(target-pid); @@ -1646,17 +1647,15 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target) task = get_pid_task(target-pid, PIDTYPE_PID); rcu_read_unlock(); if (!task) - return false; + return ret; if (task-flags PF_VCPU) { put_task_struct(task); - return false; - } - if (yield_to(task, 1)) { - put_task_struct(task); - return true; + return ret; } + ret = yield_to(task, 1); put_task_struct(task); - return false; + + return ret; } EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to); @@ -1697,12 +1696,14 @@ bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu) return eligible; } #endif + void kvm_vcpu_on_spin(struct kvm_vcpu *me) { struct kvm *kvm = me-kvm; struct kvm_vcpu *vcpu; int last_boosted_vcpu = me-kvm-last_boosted_vcpu; int yielded = 0; + int try = 3; int pass; int i; @@ -1714,7 +1715,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) * VCPU is holding the lock that we need and will release it. * We approximate round-robin by starting at the last boosted VCPU. */ - for (pass = 0; pass 2 !yielded; pass++) { + for (pass = 0; pass 2 !yielded try; pass++) { kvm_for_each_vcpu(i, vcpu, kvm) { if (!pass i = last_boosted_vcpu) { i = last_boosted_vcpu; @@ -1727,10 +1728,15 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) continue; if (!kvm_vcpu_eligible_for_directed_yield(vcpu)) continue; - if (kvm_vcpu_yield_to(vcpu)) { + + yielded = kvm_vcpu_yield_to(vcpu); + if (yielded 0) { kvm-last_boosted_vcpu = i; - yielded = 1; break; + } else if (yielded 0) { + try--; + if (!try) + break; } } } The check done in patch 1/2 is done before the double_rq_lock, so it's cheap. Now, this patch is to avoid doing too many get_pid_task calls. I wonder if it would make more sense to change the vcpu state from tracking the pid to tracking the task. If that was done, then I don't believe this patch is necessary. Rik, for 34bb10b79de7 was there a reason pid was used instead of task? Nevermind, I guess there's no way to validate the task pointer without checking the pid, since, as your git commit says, there are no guarantee that the same task always keeps the same vcpu. We'd only know it's valid if it's running, and if it's running, it's of no interest. Drew -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio powerpc: enabled and supported on powernv platform
On Thu, 2012-11-22 at 11:56 +, Sethi Varun-B16395 wrote: -Original Message- From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel- ow...@vger.kernel.org] On Behalf Of Alex Williamson Sent: Tuesday, November 20, 2012 11:50 PM To: Alexey Kardashevskiy Cc: Benjamin Herrenschmidt; Paul Mackerras; linuxppc- d...@lists.ozlabs.org; linux-ker...@vger.kernel.org; kvm@vger.kernel.org; David Gibson Subject: Re: [PATCH] vfio powerpc: enabled and supported on powernv platform On Tue, 2012-11-20 at 11:48 +1100, Alexey Kardashevskiy wrote: VFIO implements platform independent stuff such as a PCI driver, BAR access (via read/write on a file descriptor or direct mapping when possible) and IRQ signaling. The platform dependent part includes IOMMU initialization and handling. This patch initializes IOMMU groups based on the IOMMU configuration discovered during the PCI scan, only POWERNV platform is supported at the moment. Also the patch implements an VFIO-IOMMU driver which manages DMA mapping/unmapping requests coming from the client (now QEMU). It also returns a DMA window information to let the guest initialize the device tree for a guest OS properly. Although this driver has been tested only on POWERNV, it should work on any platform supporting TCE tables. To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config option. Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h |6 + arch/powerpc/kernel/iommu.c | 140 +++ arch/powerpc/platforms/powernv/pci.c | 135 +++ drivers/iommu/Kconfig|8 ++ drivers/vfio/Kconfig |6 + drivers/vfio/Makefile|1 + drivers/vfio/vfio_iommu_spapr_tce.c | 247 ++ include/linux/vfio.h | 20 +++ 8 files changed, 563 insertions(+) create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index cbfe678..5ba66cb 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -64,30 +64,33 @@ struct iommu_pool { } cacheline_aligned_in_smp; struct iommu_table { unsigned long it_busno; /* Bus number this table belongs to */ unsigned long it_size; /* Size of iommu table in entries */ unsigned long it_offset;/* Offset into global table */ unsigned long it_base; /* mapped address of tce table */ unsigned long it_index; /* which iommu table this is */ unsigned long it_type; /* type: PCI or Virtual Bus */ unsigned long it_blocksize; /* Entries in each block (cacheline) */ unsigned long poolsize; unsigned long nr_pools; struct iommu_pool large_pool; struct iommu_pool pools[IOMMU_NR_POOLS]; unsigned long *it_map; /* A simple allocation bitmap for now */ +#ifdef CONFIG_IOMMU_API + struct iommu_group *it_group; +#endif }; struct scatterlist; static inline void set_iommu_table_base(struct device *dev, void *base) { dev-archdata.dma_data.iommu_table_base = base; } static inline void *get_iommu_table_base(struct device *dev) { return dev-archdata.dma_data.iommu_table_base; } /* Frees table for an individual device node */ @@ -135,17 +138,20 @@ static inline void pci_iommu_init(void) { } extern void alloc_dart_table(void); #if defined(CONFIG_PPC64) defined(CONFIG_PM) static inline void iommu_save(void) { if (ppc_md.iommu_save) ppc_md.iommu_save(); } static inline void iommu_restore(void) { if (ppc_md.iommu_restore) ppc_md.iommu_restore(); } #endif +extern long iommu_put_tces(struct iommu_table *tbl, unsigned long entry, uint64_t tce, + enum dma_data_direction direction, unsigned long pages); + #endif /* __KERNEL__ */ #endif /* _ASM_IOMMU_H */ diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index ff5a6ce..94f614b 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -32,30 +32,31 @@ #include linux/dma-mapping.h #include linux/bitmap.h #include linux/iommu-helper.h #include linux/crash_dump.h #include linux/hash.h #include linux/fault-inject.h #include linux/pci.h #include asm/io.h #include asm/prom.h #include asm/iommu.h #include asm/pci-bridge.h #include asm/machdep.h #include asm/kdump.h #include asm/fadump.h #include asm/vio.h +#include asm/tce.h #define DBG(...) static int novmerge; static void __iommu_free(struct iommu_table *, dma_addr_t, unsigned int); static int __init
Re: [PATCH v8 1/2] x86/kexec: add a new atomic notifier list for kdump
Zhang Yanfei zhangyan...@cn.fujitsu.com writes: This patch adds an atomic notifier list named crash_notifier_list. Currently, when loading kvm-intel module, a notifier will be registered in the list to enable vmcss loaded on all cpus to be VMCLEAR'd if needed. crash_notifier_list ick gag please no. Effectively this makes the kexec on panic code path undebuggable. Instead we need to use direct function calls to whatever you are doing. If a direct function call is too complex then the piece of code you want to call is almost certainly too complex to be calling on a code path like this. Eric Signed-off-by: Zhang Yanfei zhangyan...@cn.fujitsu.com --- arch/x86/include/asm/kexec.h |2 ++ arch/x86/kernel/crash.c |9 + 2 files changed, 11 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 317ff17..5e22b00 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -163,6 +163,8 @@ struct kimage_arch { }; #endif +extern struct atomic_notifier_head crash_notifier_list; + #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_KEXEC_H */ diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index 13ad899..c5b2f70 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -16,6 +16,8 @@ #include linux/delay.h #include linux/elf.h #include linux/elfcore.h +#include linux/module.h +#include linux/notifier.h #include asm/processor.h #include asm/hardirq.h @@ -30,6 +32,9 @@ int in_crash_kexec; +ATOMIC_NOTIFIER_HEAD(crash_notifier_list); +EXPORT_SYMBOL_GPL(crash_notifier_list); + #if defined(CONFIG_SMP) defined(CONFIG_X86_LOCAL_APIC) static void kdump_nmi_callback(int cpu, struct pt_regs *regs) @@ -46,6 +51,8 @@ static void kdump_nmi_callback(int cpu, struct pt_regs *regs) #endif crash_save_cpu(regs, cpu); + atomic_notifier_call_chain(crash_notifier_list, 0, NULL); + /* Disable VMX or SVM if needed. * * We need to disable virtualization on all CPUs. @@ -88,6 +95,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs) kdump_nmi_shootdown_cpus(); + atomic_notifier_call_chain(crash_notifier_list, 0, NULL); + /* Booting kdump kernel with VMX or SVM enabled won't work, * because (among other limitations) we can't disable paging * with the virt flags. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost-blk: Add vhost-blk support v5
On Mon, Nov 19, 2012 at 10:26:41PM +0200, Michael S. Tsirkin wrote: Userspace bits: - 1) LKVM The latest vhost-blk userspace bits for kvm tool can be found here: g...@github.com:asias/linux-kvm.git blk.vhost-blk 2) QEMU The latest vhost-blk userspace prototype for QEMU can be found here: g...@github.com:asias/qemu.git blk.vhost-blk Changes in v5: - Do not assume the buffer layout - Fix wakeup race Changes in v4: - Mark req-status as userspace pointer - Use __copy_to_user() instead of copy_to_user() in vhost_blk_set_status() - Add if (need_resched()) schedule() in blk thread - Kill vhost_blk_stop_vq() and move it into vhost_blk_stop() - Use vq_err() instead of pr_warn() - Fail un Unsupported request - Add flush in vhost_blk_set_features() Changes in v3: - Sending REQ_FLUSH bio instead of vfs_fsync, thanks Christoph! - Check file passed by user is a raw block device file Signed-off-by: Asias He as...@redhat.com Since there are files shared by this and vhost net it's easiest for me to merge this all through the vhost tree. Hi Dave, are you ok with this proposal? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio powerpc: enabled and supported on powernv platform
On Fri, 2012-11-23 at 13:02 +1100, Alexey Kardashevskiy wrote: On 22/11/12 22:56, Sethi Varun-B16395 wrote: -Original Message- From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel- ow...@vger.kernel.org] On Behalf Of Alex Williamson Sent: Tuesday, November 20, 2012 11:50 PM To: Alexey Kardashevskiy Cc: Benjamin Herrenschmidt; Paul Mackerras; linuxppc- d...@lists.ozlabs.org; linux-ker...@vger.kernel.org; kvm@vger.kernel.org; David Gibson Subject: Re: [PATCH] vfio powerpc: enabled and supported on powernv platform On Tue, 2012-11-20 at 11:48 +1100, Alexey Kardashevskiy wrote: VFIO implements platform independent stuff such as a PCI driver, BAR access (via read/write on a file descriptor or direct mapping when possible) and IRQ signaling. The platform dependent part includes IOMMU initialization and handling. This patch initializes IOMMU groups based on the IOMMU configuration discovered during the PCI scan, only POWERNV platform is supported at the moment. Also the patch implements an VFIO-IOMMU driver which manages DMA mapping/unmapping requests coming from the client (now QEMU). It also returns a DMA window information to let the guest initialize the device tree for a guest OS properly. Although this driver has been tested only on POWERNV, it should work on any platform supporting TCE tables. To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config option. Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h |6 + arch/powerpc/kernel/iommu.c | 140 +++ arch/powerpc/platforms/powernv/pci.c | 135 +++ drivers/iommu/Kconfig|8 ++ drivers/vfio/Kconfig |6 + drivers/vfio/Makefile|1 + drivers/vfio/vfio_iommu_spapr_tce.c | 247 ++ include/linux/vfio.h | 20 +++ 8 files changed, 563 insertions(+) create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index cbfe678..5ba66cb 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -64,30 +64,33 @@ struct iommu_pool { } cacheline_aligned_in_smp; struct iommu_table { unsigned long it_busno; /* Bus number this table belongs to */ unsigned long it_size; /* Size of iommu table in entries */ unsigned long it_offset;/* Offset into global table */ unsigned long it_base; /* mapped address of tce table */ unsigned long it_index; /* which iommu table this is */ unsigned long it_type; /* type: PCI or Virtual Bus */ unsigned long it_blocksize; /* Entries in each block (cacheline) */ unsigned long poolsize; unsigned long nr_pools; struct iommu_pool large_pool; struct iommu_pool pools[IOMMU_NR_POOLS]; unsigned long *it_map; /* A simple allocation bitmap for now */ +#ifdef CONFIG_IOMMU_API + struct iommu_group *it_group; +#endif }; struct scatterlist; static inline void set_iommu_table_base(struct device *dev, void *base) { dev-archdata.dma_data.iommu_table_base = base; } static inline void *get_iommu_table_base(struct device *dev) { return dev-archdata.dma_data.iommu_table_base; } /* Frees table for an individual device node */ @@ -135,17 +138,20 @@ static inline void pci_iommu_init(void) { } extern void alloc_dart_table(void); #if defined(CONFIG_PPC64) defined(CONFIG_PM) static inline void iommu_save(void) { if (ppc_md.iommu_save) ppc_md.iommu_save(); } static inline void iommu_restore(void) { if (ppc_md.iommu_restore) ppc_md.iommu_restore(); } #endif +extern long iommu_put_tces(struct iommu_table *tbl, unsigned long entry, uint64_t tce, + enum dma_data_direction direction, unsigned long pages); + #endif /* __KERNEL__ */ #endif /* _ASM_IOMMU_H */ diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index ff5a6ce..94f614b 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -32,30 +32,31 @@ #include linux/dma-mapping.h #include linux/bitmap.h #include linux/iommu-helper.h #include linux/crash_dump.h #include linux/hash.h #include linux/fault-inject.h #include linux/pci.h #include asm/io.h #include asm/prom.h #include asm/iommu.h #include asm/pci-bridge.h #include asm/machdep.h #include asm/kdump.h #include asm/fadump.h #include asm/vio.h +#include asm/tce.h #define DBG(...)
Re: The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM
On Fri, 2012-11-23 at 11:06 +0800, yi li wrote: Hi Guys, there have a issue about smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM. My reason: pcpu will occur a lot of ipi interrupt to find the vcpu to handle the irq. so the guest os will VM_EXIT frequelty. right? if smp_affinity can work correctly on guest os, the best way is that the vcpu handle the irq is cputune at the pcpu which handle the kvm:pci-bus irq on the host.but unfortunly, i find that smp_affinity can not work correctly on guest os when msi/msi-x. how to reproduce: 1: passthrough a netcard (Brodcom BCM5716S) to the guest os 2: ifup the netcard, the card will use msi-x interrupt default, and close the irqbalance service 3: echo 4 cat /proc/irq/NETCARDIRQ/smp_affinity, so we assume the vcpu2 handle the irq. 4: we have set vcpupin vcpu='2' cpuset='1'/ and set the irq kvm:pci-bus to the pcpu1 on the host. we think this configure will reduce the ipi interrupt when inject interrupt to the guest os. but this irq is not only handle on vcpu2. maybe it is not our expect。 What version of qemu-kvm/qemu are you using? There's been some work recently specifically to enable this. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/1] [PULL] qemu-kvm.git uq/master queue
Marcelo Tosatti mtosa...@redhat.com writes: The following changes since commit 1ccbc2851282564308f790753d7158487b6af8e2: qemu-sockets: Fix parsing of the inet option 'to'. (2012-11-21 12:07:59 +0400) are available in the git repository at: git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master Bruce Rogers (1): Legacy qemu-kvm options have no argument Pulled. Thanks. Regards, Anthony Liguori qemu-options.hx |8 1 files changed, 4 insertions(+), 4 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM
hi Alex, the qemu-kvm version 1.2. Thanks. YiLi 2012/11/26 Alex Williamson alex.william...@redhat.com: On Fri, 2012-11-23 at 11:06 +0800, yi li wrote: Hi Guys, there have a issue about smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM. My reason: pcpu will occur a lot of ipi interrupt to find the vcpu to handle the irq. so the guest os will VM_EXIT frequelty. right? if smp_affinity can work correctly on guest os, the best way is that the vcpu handle the irq is cputune at the pcpu which handle the kvm:pci-bus irq on the host.but unfortunly, i find that smp_affinity can not work correctly on guest os when msi/msi-x. how to reproduce: 1: passthrough a netcard (Brodcom BCM5716S) to the guest os 2: ifup the netcard, the card will use msi-x interrupt default, and close the irqbalance service 3: echo 4 cat /proc/irq/NETCARDIRQ/smp_affinity, so we assume the vcpu2 handle the irq. 4: we have set vcpupin vcpu='2' cpuset='1'/ and set the irq kvm:pci-bus to the pcpu1 on the host. we think this configure will reduce the ipi interrupt when inject interrupt to the guest os. but this irq is not only handle on vcpu2. maybe it is not our expect。 What version of qemu-kvm/qemu are you using? There's been some work recently specifically to enable this. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 1/2] x86/kexec: add a new atomic notifier list for kdump
On Mon, Nov 26, 2012 at 09:08:54AM -0600, Eric W. Biederman wrote: Zhang Yanfei zhangyan...@cn.fujitsu.com writes: This patch adds an atomic notifier list named crash_notifier_list. Currently, when loading kvm-intel module, a notifier will be registered in the list to enable vmcss loaded on all cpus to be VMCLEAR'd if needed. crash_notifier_list ick gag please no. Effectively this makes the kexec on panic code path undebuggable. Instead we need to use direct function calls to whatever you are doing. The code walks linked list in kvm-intel module and calls vmclear on whatever it finds there. Since the function have to resides in kvm-intel module it cannot be called directly. Is callback pointer that is set by kvm-intel more acceptable? If a direct function call is too complex then the piece of code you want to call is almost certainly too complex to be calling on a code path like this. Eric Signed-off-by: Zhang Yanfei zhangyan...@cn.fujitsu.com --- arch/x86/include/asm/kexec.h |2 ++ arch/x86/kernel/crash.c |9 + 2 files changed, 11 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 317ff17..5e22b00 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -163,6 +163,8 @@ struct kimage_arch { }; #endif +extern struct atomic_notifier_head crash_notifier_list; + #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_KEXEC_H */ diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index 13ad899..c5b2f70 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -16,6 +16,8 @@ #include linux/delay.h #include linux/elf.h #include linux/elfcore.h +#include linux/module.h +#include linux/notifier.h #include asm/processor.h #include asm/hardirq.h @@ -30,6 +32,9 @@ int in_crash_kexec; +ATOMIC_NOTIFIER_HEAD(crash_notifier_list); +EXPORT_SYMBOL_GPL(crash_notifier_list); + #if defined(CONFIG_SMP) defined(CONFIG_X86_LOCAL_APIC) static void kdump_nmi_callback(int cpu, struct pt_regs *regs) @@ -46,6 +51,8 @@ static void kdump_nmi_callback(int cpu, struct pt_regs *regs) #endif crash_save_cpu(regs, cpu); + atomic_notifier_call_chain(crash_notifier_list, 0, NULL); + /* Disable VMX or SVM if needed. * * We need to disable virtualization on all CPUs. @@ -88,6 +95,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs) kdump_nmi_shootdown_cpus(); + atomic_notifier_call_chain(crash_notifier_list, 0, NULL); + /* Booting kdump kernel with VMX or SVM enabled won't work, * because (among other limitations) we can't disable paging * with the virt flags. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 1/2] x86/kexec: add a new atomic notifier list for kdump
Gleb Natapov g...@redhat.com writes: On Mon, Nov 26, 2012 at 09:08:54AM -0600, Eric W. Biederman wrote: Zhang Yanfei zhangyan...@cn.fujitsu.com writes: This patch adds an atomic notifier list named crash_notifier_list. Currently, when loading kvm-intel module, a notifier will be registered in the list to enable vmcss loaded on all cpus to be VMCLEAR'd if needed. crash_notifier_list ick gag please no. Effectively this makes the kexec on panic code path undebuggable. Instead we need to use direct function calls to whatever you are doing. The code walks linked list in kvm-intel module and calls vmclear on whatever it finds there. Since the function have to resides in kvm-intel module it cannot be called directly. Is callback pointer that is set by kvm-intel more acceptable? Yes a specific callback function is more acceptable. Looking a little deeper vmclear_local_loaded_vmcss is not particularly acceptable. It is doing a lot of work that is unnecessary to save the virtual registers on the kexec on panic path. In fact I wonder if it might not just be easier to call vmcs_clear to a fixed per cpu buffer. Performing list walking in interrupt context without locking in vmclear_local_loaded vmcss looks a bit scary. Not that locking would make it any better, as locking would simply add one more way to deadlock the system. Only an rcu list walk is at all safe. A list walk that modifies the list as vmclear_local_loaded_vmcss does is definitely not safe. Eric -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 1/2] x86/kexec: add a new atomic notifier list for kdump
On Mon, Nov 26, 2012 at 11:43:10AM -0600, Eric W. Biederman wrote: Gleb Natapov g...@redhat.com writes: On Mon, Nov 26, 2012 at 09:08:54AM -0600, Eric W. Biederman wrote: Zhang Yanfei zhangyan...@cn.fujitsu.com writes: This patch adds an atomic notifier list named crash_notifier_list. Currently, when loading kvm-intel module, a notifier will be registered in the list to enable vmcss loaded on all cpus to be VMCLEAR'd if needed. crash_notifier_list ick gag please no. Effectively this makes the kexec on panic code path undebuggable. Instead we need to use direct function calls to whatever you are doing. The code walks linked list in kvm-intel module and calls vmclear on whatever it finds there. Since the function have to resides in kvm-intel module it cannot be called directly. Is callback pointer that is set by kvm-intel more acceptable? Yes a specific callback function is more acceptable. Looking a little deeper vmclear_local_loaded_vmcss is not particularly acceptable. It is doing a lot of work that is unnecessary to save the virtual registers on the kexec on panic path. What work are you referring to in particular that may not be acceptable? In fact I wonder if it might not just be easier to call vmcs_clear to a fixed per cpu buffer. There may be more than one vmcs loaded on a cpu, hence the list. Performing list walking in interrupt context without locking in vmclear_local_loaded vmcss looks a bit scary. Not that locking would make it any better, as locking would simply add one more way to deadlock the system. Only an rcu list walk is at all safe. A list walk that modifies the list as vmclear_local_loaded_vmcss does is definitely not safe. The list vmclear_local_loaded walks is per cpu. Zhang's kvm patch disables kexec callback while list is modified. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio powerpc: enabled and supported on powernv platform
On Mon, 2012-11-26 at 08:18 -0700, Alex Williamson wrote: On Fri, 2012-11-23 at 13:02 +1100, Alexey Kardashevskiy wrote: On 22/11/12 22:56, Sethi Varun-B16395 wrote: -Original Message- From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel- ow...@vger.kernel.org] On Behalf Of Alex Williamson Sent: Tuesday, November 20, 2012 11:50 PM To: Alexey Kardashevskiy Cc: Benjamin Herrenschmidt; Paul Mackerras; linuxppc- d...@lists.ozlabs.org; linux-ker...@vger.kernel.org; kvm@vger.kernel.org; David Gibson Subject: Re: [PATCH] vfio powerpc: enabled and supported on powernv platform On Tue, 2012-11-20 at 11:48 +1100, Alexey Kardashevskiy wrote: VFIO implements platform independent stuff such as a PCI driver, BAR access (via read/write on a file descriptor or direct mapping when possible) and IRQ signaling. The platform dependent part includes IOMMU initialization and handling. This patch initializes IOMMU groups based on the IOMMU configuration discovered during the PCI scan, only POWERNV platform is supported at the moment. Also the patch implements an VFIO-IOMMU driver which manages DMA mapping/unmapping requests coming from the client (now QEMU). It also returns a DMA window information to let the guest initialize the device tree for a guest OS properly. Although this driver has been tested only on POWERNV, it should work on any platform supporting TCE tables. To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config option. Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h |6 + arch/powerpc/kernel/iommu.c | 140 +++ arch/powerpc/platforms/powernv/pci.c | 135 +++ drivers/iommu/Kconfig|8 ++ drivers/vfio/Kconfig |6 + drivers/vfio/Makefile|1 + drivers/vfio/vfio_iommu_spapr_tce.c | 247 ++ include/linux/vfio.h | 20 +++ 8 files changed, 563 insertions(+) create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index cbfe678..5ba66cb 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -64,30 +64,33 @@ struct iommu_pool { } cacheline_aligned_in_smp; struct iommu_table { unsigned long it_busno; /* Bus number this table belongs to */ unsigned long it_size; /* Size of iommu table in entries */ unsigned long it_offset;/* Offset into global table */ unsigned long it_base; /* mapped address of tce table */ unsigned long it_index; /* which iommu table this is */ unsigned long it_type; /* type: PCI or Virtual Bus */ unsigned long it_blocksize; /* Entries in each block (cacheline) */ unsigned long poolsize; unsigned long nr_pools; struct iommu_pool large_pool; struct iommu_pool pools[IOMMU_NR_POOLS]; unsigned long *it_map; /* A simple allocation bitmap for now */ +#ifdef CONFIG_IOMMU_API + struct iommu_group *it_group; +#endif }; struct scatterlist; static inline void set_iommu_table_base(struct device *dev, void *base) { dev-archdata.dma_data.iommu_table_base = base; } static inline void *get_iommu_table_base(struct device *dev) { return dev-archdata.dma_data.iommu_table_base; } /* Frees table for an individual device node */ @@ -135,17 +138,20 @@ static inline void pci_iommu_init(void) { } extern void alloc_dart_table(void); #if defined(CONFIG_PPC64) defined(CONFIG_PM) static inline void iommu_save(void) { if (ppc_md.iommu_save) ppc_md.iommu_save(); } static inline void iommu_restore(void) { if (ppc_md.iommu_restore) ppc_md.iommu_restore(); } #endif +extern long iommu_put_tces(struct iommu_table *tbl, unsigned long entry, uint64_t tce, + enum dma_data_direction direction, unsigned long pages); + #endif /* __KERNEL__ */ #endif /* _ASM_IOMMU_H */ diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index ff5a6ce..94f614b 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -32,30 +32,31 @@ #include linux/dma-mapping.h #include linux/bitmap.h #include linux/iommu-helper.h #include linux/crash_dump.h #include linux/hash.h #include linux/fault-inject.h #include linux/pci.h #include asm/io.h #include asm/prom.h #include asm/iommu.h #include
Re: [PATCH v8 1/2] x86/kexec: add a new atomic notifier list for kdump
Gleb Natapov g...@redhat.com writes: On Mon, Nov 26, 2012 at 11:43:10AM -0600, Eric W. Biederman wrote: Gleb Natapov g...@redhat.com writes: On Mon, Nov 26, 2012 at 09:08:54AM -0600, Eric W. Biederman wrote: Zhang Yanfei zhangyan...@cn.fujitsu.com writes: This patch adds an atomic notifier list named crash_notifier_list. Currently, when loading kvm-intel module, a notifier will be registered in the list to enable vmcss loaded on all cpus to be VMCLEAR'd if needed. crash_notifier_list ick gag please no. Effectively this makes the kexec on panic code path undebuggable. Instead we need to use direct function calls to whatever you are doing. The code walks linked list in kvm-intel module and calls vmclear on whatever it finds there. Since the function have to resides in kvm-intel module it cannot be called directly. Is callback pointer that is set by kvm-intel more acceptable? Yes a specific callback function is more acceptable. Looking a little deeper vmclear_local_loaded_vmcss is not particularly acceptable. It is doing a lot of work that is unnecessary to save the virtual registers on the kexec on panic path. What work are you referring to in particular that may not be acceptable? The unnecessary work that I was see is all of the software state changing. Unlinking things from linked lists flipping variables. None of that appears related to the fundamental issue saving cpu state. Simply reusing a function that does more than what is strictly required makes me nervous. What is the chance that the function will grow with maintenance and add constructs that are not safe in a kexec on panic situtation. In fact I wonder if it might not just be easier to call vmcs_clear to a fixed per cpu buffer. There may be more than one vmcs loaded on a cpu, hence the list. Performing list walking in interrupt context without locking in vmclear_local_loaded vmcss looks a bit scary. Not that locking would make it any better, as locking would simply add one more way to deadlock the system. Only an rcu list walk is at all safe. A list walk that modifies the list as vmclear_local_loaded_vmcss does is definitely not safe. The list vmclear_local_loaded walks is per cpu. Zhang's kvm patch disables kexec callback while list is modified. If the list is only modified on it's cpu and we are running on that cpu that does look like it will give the necessary protections. It isn't particularly clear at first glance that is the case unfortunately. Eric -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] vfio powerpc: implemented IOMMU driver for VFIO
On Fri, 2012-11-23 at 20:03 +1100, Alexey Kardashevskiy wrote: VFIO implements platform independent stuff such as a PCI driver, BAR access (via read/write on a file descriptor or direct mapping when possible) and IRQ signaling. The platform dependent part includes IOMMU initialization and handling. This patch implements an IOMMU driver for VFIO which does mapping/unmapping pages for the guest IO and provides information about DMA window (required by a POWERPC guest). The counterpart in QEMU is required to support this functionality. Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- drivers/vfio/Kconfig|6 + drivers/vfio/Makefile |1 + drivers/vfio/vfio_iommu_spapr_tce.c | 247 +++ include/linux/vfio.h| 20 +++ 4 files changed, 274 insertions(+) create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index 7cd5dec..b464687 100644 --- a/drivers/vfio/Kconfig +++ b/drivers/vfio/Kconfig @@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1 depends on VFIO default n +config VFIO_IOMMU_SPAPR_TCE + tristate + depends on VFIO SPAPR_TCE_IOMMU + default n + menuconfig VFIO tristate VFIO Non-Privileged userspace driver framework depends on IOMMU_API select VFIO_IOMMU_TYPE1 if X86 + select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV help VFIO provides a framework for secure userspace device drivers. See Documentation/vfio.txt for more details. diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile index 2398d4a..72bfabc 100644 --- a/drivers/vfio/Makefile +++ b/drivers/vfio/Makefile @@ -1,3 +1,4 @@ obj-$(CONFIG_VFIO) += vfio.o obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o +obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o obj-$(CONFIG_VFIO_PCI) += pci/ diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c new file mode 100644 index 000..46a6298 --- /dev/null +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -0,0 +1,247 @@ +/* + * VFIO: IOMMU DMA mapping support for TCE on POWER + * + * Copyright (C) 2012 IBM Corp. All rights reserved. + * Author: Alexey Kardashevskiy a...@ozlabs.ru + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * Derived from original vfio_iommu_type1.c: + * Copyright (C) 2012 Red Hat, Inc. All rights reserved. + * Author: Alex Williamson alex.william...@redhat.com + */ + +#include linux/module.h +#include linux/pci.h +#include linux/slab.h +#include linux/uaccess.h +#include linux/err.h +#include linux/vfio.h +#include asm/iommu.h + +#define DRIVER_VERSION 0.1 +#define DRIVER_AUTHOR a...@ozlabs.ru +#define DRIVER_DESC VFIO IOMMU SPAPR TCE + +static void tce_iommu_detach_group(void *iommu_data, + struct iommu_group *iommu_group); + +/* + * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation + */ + +/* + * The container descriptor supports only a single group per container. + * Required by the API as the container is not supplied with the IOMMU group + * at the moment of initialization. + */ +struct tce_container { + struct mutex lock; + struct iommu_table *tbl; +}; + +static void *tce_iommu_open(unsigned long arg) +{ + struct tce_container *container; + + if (arg != VFIO_SPAPR_TCE_IOMMU) { + printk(KERN_ERR tce_vfio: Wrong IOMMU type\n); + return ERR_PTR(-EINVAL); + } + + container = kzalloc(sizeof(*container), GFP_KERNEL); + if (!container) + return ERR_PTR(-ENOMEM); + + mutex_init(container-lock); + + return container; +} + +static void tce_iommu_release(void *iommu_data) +{ + struct tce_container *container = iommu_data; + + WARN_ON(container-tbl !container-tbl-it_group); I think your patch ordering is backwards here. it_group isn't added until 2/2. I'd really like to see the arch/powerpc code approved and merged by the powerpc maintainer before we add the code that makes use of it into vfio. Otherwise we just get lots of churn if interfaces change or they disapprove of it altogether. + if (container-tbl container-tbl-it_group) + tce_iommu_detach_group(iommu_data, container-tbl-it_group); + + mutex_destroy(container-lock); + + kfree(container); +} + +static long tce_iommu_ioctl(void *iommu_data, + unsigned int cmd, unsigned long arg) +{ + struct tce_container *container = iommu_data; + unsigned long minsz; + + switch (cmd) { + case VFIO_CHECK_EXTENSION: { + return (arg == VFIO_SPAPR_TCE_IOMMU) ? 1 : 0; + }
[PATCH V3 0/2] Resend - IA32_TSC_ADJUST support for KVM
Resending these as the mail seems to have not fully worked last Wed. Marcelo, I have addressed your comments for this patch set (V3), the following patch for QEMU-KVM and for adding a test case for tsc_adjust also to follow today. Thanks, Will Will Auld (2): Add code to track call origin for msr assignment. Enabling IA32_TSC_ADJUST for KVM guest VM support arch/x86/include/asm/cpufeature.h | 1 + arch/x86/include/asm/kvm_host.h | 15 ++--- arch/x86/include/asm/msr-index.h | 1 + arch/x86/kvm/cpuid.c | 2 ++ arch/x86/kvm/cpuid.h | 8 +++ arch/x86/kvm/svm.c| 28 ++-- arch/x86/kvm/vmx.c| 33 ++-- arch/x86/kvm/x86.c| 45 +-- arch/x86/kvm/x86.h| 2 +- 9 files changed, 112 insertions(+), 23 deletions(-) -- 1.8.0.rc0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3 1/2] Resend - Add code to track call origin for msr assignment.
In order to track who initiated the call (host or guest) to modify an msr value I have changed function call parameters along the call path. The specific change is to add a struct pointer parameter that points to (index, data, caller) information rather than having this information passed as individual parameters. The initial use for this capability is for updating the IA32_TSC_ADJUST msr while setting the tsc value. It is anticipated that this capability is useful other tasks. Signed-off-by: Will Auld will.a...@intel.com --- arch/x86/include/asm/kvm_host.h | 12 +--- arch/x86/kvm/svm.c | 21 +++-- arch/x86/kvm/vmx.c | 24 +--- arch/x86/kvm/x86.c | 23 +-- arch/x86/kvm/x86.h | 2 +- 5 files changed, 59 insertions(+), 23 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 09155d6..da34027 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -598,6 +598,12 @@ struct kvm_vcpu_stat { struct x86_instruction_info; +struct msr_data { +bool host_initiated; +u32 index; +u64 data; +}; + struct kvm_x86_ops { int (*cpu_has_kvm_support)(void); /* __init */ int (*disabled_by_bios)(void); /* __init */ @@ -621,7 +627,7 @@ struct kvm_x86_ops { void (*set_guest_debug)(struct kvm_vcpu *vcpu, struct kvm_guest_debug *dbg); int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata); - int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); + int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr); u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg); void (*get_segment)(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); @@ -772,7 +778,7 @@ static inline int emulate_instruction(struct kvm_vcpu *vcpu, void kvm_enable_efer_bits(u64); int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data); -int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); +int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr); struct x86_emulate_ctxt; @@ -799,7 +805,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l); int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr); int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata); -int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data); +int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr); unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu); void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index baead95..5ac11f0 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1211,6 +1211,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) struct page *msrpm_pages; struct page *hsave_page; struct page *nested_msrpm_pages; + struct msr_data msr; int err; svm = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL); @@ -1255,7 +1256,10 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) svm-vmcb_pa = page_to_pfn(page) PAGE_SHIFT; svm-asid_generation = 0; init_vmcb(svm); - kvm_write_tsc(svm-vcpu, 0); + msr.data = 0x0; + msr.index = MSR_IA32_TSC; + msr.host_initiated = true; + kvm_write_tsc(svm-vcpu, msr); err = fx_init(svm-vcpu); if (err) @@ -3147,13 +3151,15 @@ static int svm_set_vm_cr(struct kvm_vcpu *vcpu, u64 data) return 0; } -static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data) +static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { struct vcpu_svm *svm = to_svm(vcpu); + u32 ecx = msr-index; + u64 data = msr-data; switch (ecx) { case MSR_IA32_TSC: - kvm_write_tsc(vcpu, data); + kvm_write_tsc(vcpu, msr); break; case MSR_STAR: svm-vmcb-save.star = data; @@ -3208,20 +3214,23 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data) vcpu_unimpl(vcpu, unimplemented wrmsr: 0x%x data 0x%llx\n, ecx, data); break; default: - return kvm_set_msr_common(vcpu, ecx, data); + return kvm_set_msr_common(vcpu, msr); } return 0; } static int wrmsr_interception(struct vcpu_svm *svm) { + struct msr_data msr; u32 ecx = svm-vcpu.arch.regs[VCPU_REGS_RCX]; u64 data = (svm-vcpu.arch.regs[VCPU_REGS_RAX] -1u) | ((u64)(svm-vcpu.arch.regs[VCPU_REGS_RDX] -1u) 32); - + msr.data = data; + msr.index = ecx; + msr.host_initiated = false; svm-next_rip = kvm_rip_read(svm-vcpu) + 2; - if (svm_set_msr(svm-vcpu,
[PATCH V3 2/2] Resend - Enabling IA32_TSC_ADJUST for KVM guest VM support
CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported Basic design is to emulate the MSR by allowing reads and writes to a guest vcpu specific location to store the value of the emulated MSR while adding the value to the vmcs tsc_offset. In this way the IA32_TSC_ADJUST value will be included in all reads to the TSC MSR whether through rdmsr or rdtsc. This is of course as long as the use TSC counter offsetting VM-execution control is enabled as well as the IA32_TSC_ADJUST control. However, because hardware will only return the TSC + IA32_TSC_ADJUST + vmsc tsc_offset for a guest process when it does and rdtsc (with the correct settings) the value of our virtualized IA32_TSC_ADJUST must be stored in one of these three locations. The argument against storing it in the actual MSR is performance. This is likely to be seldom used while the save/restore is required on every transition. IA32_TSC_ADJUST was created as a way to solve some issues with writing TSC itself so that is not an option either. The remaining option, defined above as our solution has the problem of returning incorrect vmcs tsc_offset values (unless we intercept and fix, not done here) as mentioned above. However, more problematic is that storing the data in vmcs tsc_offset will have a different semantic effect on the system than does using the actual MSR. This is illustrated in the following example: The hypervisor set the IA32_TSC_ADJUST, then the guest sets it and a guest process performs a rdtsc. In this case the guest process will get TSC + IA32_TSC_ADJUST_hyperviser + vmsc tsc_offset including IA32_TSC_ADJUST_guest. While the total system semantics changed the semantics as seen by the guest do not and hence this will not cause a problem. Signed-off-by: Will Auld will.a...@intel.com --- arch/x86/include/asm/cpufeature.h | 1 + arch/x86/include/asm/kvm_host.h | 3 +++ arch/x86/include/asm/msr-index.h | 1 + arch/x86/kvm/cpuid.c | 2 ++ arch/x86/kvm/cpuid.h | 8 arch/x86/kvm/svm.c| 7 +++ arch/x86/kvm/vmx.c| 9 + arch/x86/kvm/x86.c| 22 ++ 8 files changed, 53 insertions(+) diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index 6b7ee5f..e574d81 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -199,6 +199,7 @@ /* Intel-defined CPU features, CPUID level 0x0007:0 (ebx), word 9 */ #define X86_FEATURE_FSGSBASE (9*32+ 0) /* {RD/WR}{FS/GS}BASE instructions*/ +#define X86_FEATURE_TSC_ADJUST (9*32+ 1) /* TSC adjustment MSR 0x3b */ #define X86_FEATURE_BMI1 (9*32+ 3) /* 1st group bit manipulation extensions */ #define X86_FEATURE_HLE(9*32+ 4) /* Hardware Lock Elision */ #define X86_FEATURE_AVX2 (9*32+ 5) /* AVX2 instructions */ diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index da34027..cf8c7e0 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -442,6 +442,8 @@ struct kvm_vcpu_arch { u32 virtual_tsc_mult; u32 virtual_tsc_khz; + s64 ia32_tsc_adjust_msr; + atomic_t nmi_queued; /* unprocessed asynchronous NMIs */ unsigned nmi_pending; /* NMI queued after currently running handler */ bool nmi_injected;/* Trying to inject an NMI this entry */ @@ -690,6 +692,7 @@ struct kvm_x86_ops { bool (*has_wbinvd_exit)(void); void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale); + u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu); void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset); u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc); diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 957ec87..6486569 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -231,6 +231,7 @@ #define MSR_IA32_EBL_CR_POWERON0x002a #define MSR_EBC_FREQUENCY_ID 0x002c #define MSR_IA32_FEATURE_CONTROL0x003a +#define MSR_IA32_TSC_ADJUST 0x003b #define FEATURE_CONTROL_LOCKED (10) #define FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX (11) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 0595f13..e817bac 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -320,6 +320,8 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, if (index == 0) { entry-ebx = kvm_supported_word9_x86_features; cpuid_mask(entry-ebx, 9); + // TSC_ADJUST is emulated + entry-ebx |= F(TSC_ADJUST); } else entry-ebx = 0; entry-eax = 0; diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index
[PATCH V2] Resend - Enabling IA32_TSC_ADJUST for Qemu KVM guest VMs
CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported Basic design is to emulate the MSR by allowing reads and writes to the hypervisor vcpu specific locations to store the value of the emulated MSRs. In this way the IA32_TSC_ADJUST value will be included in all reads to the TSC MSR whether through rdmsr or rdtsc. As this is a new MSR that the guest may access and modify its value needs to be migrated along with the other MRSs. The changes here are specifically for recognizing when IA32_TSC_ADJUST is enabled in CPUID and code added for migrating its value. Signed-off-by: Will Auld will.a...@intel.com --- target-i386/cpu.h | 2 ++ target-i386/kvm.c | 15 +++ target-i386/machine.c | 21 + 3 files changed, 38 insertions(+) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index aabf993..13d4152 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -284,6 +284,7 @@ #define MSR_IA32_APICBASE_BSP (18) #define MSR_IA32_APICBASE_ENABLE(111) #define MSR_IA32_APICBASE_BASE (0xf12) +#define MSR_TSC_ADJUST 0x003b #define MSR_IA32_TSCDEADLINE0x6e0 #define MSR_MTRRcap0xfe @@ -701,6 +702,7 @@ typedef struct CPUX86State { uint64_t async_pf_en_msr; uint64_t tsc; +uint64_t tsc_adjust; uint64_t tsc_deadline; uint64_t mcg_status; diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 696b14a..e974c42 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -61,6 +61,7 @@ const KVMCapabilityInfo kvm_arch_required_capabilities[] = { static bool has_msr_star; static bool has_msr_hsave_pa; +static bool has_msr_tsc_adjust; static bool has_msr_tsc_deadline; static bool has_msr_async_pf_en; static bool has_msr_misc_enable; @@ -641,6 +642,10 @@ static int kvm_get_supported_msrs(KVMState *s) has_msr_hsave_pa = true; continue; } +if (kvm_msr_list-indices[i] == MSR_TSC_ADJUST) { +has_msr_tsc_adjust = true; +continue; +} if (kvm_msr_list-indices[i] == MSR_IA32_TSCDEADLINE) { has_msr_tsc_deadline = true; continue; @@ -978,6 +983,10 @@ static int kvm_put_msrs(CPUX86State *env, int level) if (has_msr_hsave_pa) { kvm_msr_entry_set(msrs[n++], MSR_VM_HSAVE_PA, env-vm_hsave); } +if (has_msr_tsc_adjust) { +kvm_msr_entry_set(msrs[n++], + MSR_TSC_ADJUST, env-tsc_adjust); +} if (has_msr_tsc_deadline) { kvm_msr_entry_set(msrs[n++], MSR_IA32_TSCDEADLINE, env-tsc_deadline); } @@ -1234,6 +1243,9 @@ static int kvm_get_msrs(CPUX86State *env) if (has_msr_hsave_pa) { msrs[n++].index = MSR_VM_HSAVE_PA; } +if (has_msr_tsc_adjust) { +msrs[n++].index = MSR_TSC_ADJUST; +} if (has_msr_tsc_deadline) { msrs[n++].index = MSR_IA32_TSCDEADLINE; } @@ -1308,6 +1320,9 @@ static int kvm_get_msrs(CPUX86State *env) case MSR_IA32_TSC: env-tsc = msrs[i].data; break; +case MSR_TSC_ADJUST: +env-tsc_adjust = msrs[i].data; +break; case MSR_IA32_TSCDEADLINE: env-tsc_deadline = msrs[i].data; break; diff --git a/target-i386/machine.c b/target-i386/machine.c index a8be058..95bda9b 100644 --- a/target-i386/machine.c +++ b/target-i386/machine.c @@ -310,6 +310,24 @@ static const VMStateDescription vmstate_fpop_ip_dp = { } }; +static bool tsc_adjust_needed(void *opaque) +{ +CPUX86State *cpu = opaque; + +return cpu-tsc_adjust != 0; +} + +static const VMStateDescription vmstate_msr_tsc_adjust = { +.name = cpu/msr_tsc_adjust, +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_UINT64(tsc_adjust, CPUX86State), +VMSTATE_END_OF_LIST() +} +}; + static bool tscdeadline_needed(void *opaque) { CPUX86State *env = opaque; @@ -457,6 +475,9 @@ static const VMStateDescription vmstate_cpu = { .vmsd = vmstate_fpop_ip_dp, .needed = fpop_ip_dp_needed, }, { +.vmsd = vmstate_msr_tsc_adjust, +.needed = tsc_adjust_needed, +}, { .vmsd = vmstate_msr_tscdeadline, .needed = tscdeadline_needed, }, { -- 1.8.0.rc0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 1/2] Resend - Add code to track call origin for msr assignment.
Comments are still not addressed. On Mon, Nov 26, 2012 at 10:40:51AM -0800, Will Auld wrote: In order to track who initiated the call (host or guest) to modify an msr value I have changed function call parameters along the call path. The specific change is to add a struct pointer parameter that points to (index, data, caller) information rather than having this information passed as individual parameters. The initial use for this capability is for updating the IA32_TSC_ADJUST msr while setting the tsc value. It is anticipated that this capability is useful other tasks. Signed-off-by: Will Auld will.a...@intel.com --- arch/x86/include/asm/kvm_host.h | 12 +--- arch/x86/kvm/svm.c | 21 +++-- arch/x86/kvm/vmx.c | 24 +--- arch/x86/kvm/x86.c | 23 +-- arch/x86/kvm/x86.h | 2 +- 5 files changed, 59 insertions(+), 23 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 09155d6..da34027 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -598,6 +598,12 @@ struct kvm_vcpu_stat { struct x86_instruction_info; +struct msr_data { +bool host_initiated; +u32 index; +u64 data; +}; + struct kvm_x86_ops { int (*cpu_has_kvm_support)(void); /* __init */ int (*disabled_by_bios)(void); /* __init */ @@ -621,7 +627,7 @@ struct kvm_x86_ops { void (*set_guest_debug)(struct kvm_vcpu *vcpu, struct kvm_guest_debug *dbg); int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata); - int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); + int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr); u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg); void (*get_segment)(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); @@ -772,7 +778,7 @@ static inline int emulate_instruction(struct kvm_vcpu *vcpu, void kvm_enable_efer_bits(u64); int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data); -int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); +int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr); struct x86_emulate_ctxt; @@ -799,7 +805,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l); int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr); int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata); -int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data); +int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr); unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu); void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index baead95..5ac11f0 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1211,6 +1211,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) struct page *msrpm_pages; struct page *hsave_page; struct page *nested_msrpm_pages; + struct msr_data msr; int err; svm = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL); @@ -1255,7 +1256,10 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) svm-vmcb_pa = page_to_pfn(page) PAGE_SHIFT; svm-asid_generation = 0; init_vmcb(svm); - kvm_write_tsc(svm-vcpu, 0); + msr.data = 0x0; + msr.index = MSR_IA32_TSC; + msr.host_initiated = true; + kvm_write_tsc(svm-vcpu, msr); err = fx_init(svm-vcpu); if (err) @@ -3147,13 +3151,15 @@ static int svm_set_vm_cr(struct kvm_vcpu *vcpu, u64 data) return 0; } -static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data) +static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { struct vcpu_svm *svm = to_svm(vcpu); + u32 ecx = msr-index; + u64 data = msr-data; switch (ecx) { case MSR_IA32_TSC: - kvm_write_tsc(vcpu, data); + kvm_write_tsc(vcpu, msr); break; case MSR_STAR: svm-vmcb-save.star = data; @@ -3208,20 +3214,23 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data) vcpu_unimpl(vcpu, unimplemented wrmsr: 0x%x data 0x%llx\n, ecx, data); break; default: - return kvm_set_msr_common(vcpu, ecx, data); + return kvm_set_msr_common(vcpu, msr); } return 0; } static int wrmsr_interception(struct vcpu_svm *svm) { + struct msr_data msr; u32 ecx = svm-vcpu.arch.regs[VCPU_REGS_RCX]; u64 data = (svm-vcpu.arch.regs[VCPU_REGS_RAX] -1u) | ((u64)(svm-vcpu.arch.regs[VCPU_REGS_RDX] -1u) 32); - + msr.data = data; + msr.index
Re: [PATCH v2 00/18] KVM for MIPS32 Processors
I have several general questions about this patch... On 11/21/2012 06:33 PM, Sanjay Lal wrote: The following patchset implements KVM support for MIPS32R2 processors, using Trap Emulate, with basic runtime binary translation to improve performance. The goal has been to keep the Guest kernel changes to a minimum. What is the point of minimizing guest kernel changes? Because you are using an invented memory map, instead of the architecturally defined map, there is no hope of running a single kernel image both natively and as a guest. So why do you care about how many changes there are. The patch is against Linux 3.7-rc6. This is Version 2 of the patch set. There is a companion patchset for QEMU that adds KVM support for the MIPS target. KVM/MIPS should support MIPS32-R2 processors and beyond. It has been tested on the following platforms: - Malta Board with FPGA based 34K (Little Endian). - Sigma Designs TangoX board with a 24K based 8654 SoC (Little Endian). - Malta Board with 74K @ 1GHz (Little Endian). - OVPSim MIPS simulator from Imperas emulating a Malta board with 24Kc and 1074Kc cores (Little Endian). Unlike x86, there is no concept of a canonical MIPS system for you to implement. So the choice of emulating a Malta or one of the SigmaDesigns boards doesn't seem to me to give you anything. Why not just define the guest system to be exactly the facilities provided by the VirtIO drivers? [...] Perhaps it is obvious from the patches, but I wasn't able to figure out how you solve the problem of the Root/Host kernel clobbering the K0 and K1 registers in its exception handlers. These registers are also used by the Guest kernel (aren't they)? David Daney -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH V2] Resend - Enabling IA32_TSC_ADJUST for Qemu KVM guest VMs
Hello, Am 26.11.2012 19:42, schrieb Will Auld: CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported Basic design is to emulate the MSR by allowing reads and writes to the hypervisor vcpu specific locations to store the value of the emulated MSRs. In this way the IA32_TSC_ADJUST value will be included in all reads to the TSC MSR whether through rdmsr or rdtsc. As this is a new MSR that the guest may access and modify its value needs to be migrated along with the other MRSs. The changes here are specifically for recognizing when IA32_TSC_ADJUST is enabled in CPUID and code added for migrating its value. Signed-off-by: Will Auld will.a...@intel.com $subject should get a prefix of target-i386: and resend is better used inside a tag so that it doesn't end up in the commit. And it's QEMU. ;) Some more stylistic issues inline: --- target-i386/cpu.h | 2 ++ target-i386/kvm.c | 15 +++ target-i386/machine.c | 21 + 3 files changed, 38 insertions(+) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index aabf993..13d4152 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -284,6 +284,7 @@ #define MSR_IA32_APICBASE_BSP (18) #define MSR_IA32_APICBASE_ENABLE(111) #define MSR_IA32_APICBASE_BASE (0xf12) +#define MSR_TSC_ADJUST 0x003b Tabs. You can use scripts/checkpatch.pl to verify. #define MSR_IA32_TSCDEADLINE0x6e0 #define MSR_MTRRcap 0xfe @@ -701,6 +702,7 @@ typedef struct CPUX86State { uint64_t async_pf_en_msr; uint64_t tsc; +uint64_t tsc_adjust; uint64_t tsc_deadline; uint64_t mcg_status; diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 696b14a..e974c42 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -61,6 +61,7 @@ const KVMCapabilityInfo kvm_arch_required_capabilities[] = { static bool has_msr_star; static bool has_msr_hsave_pa; +static bool has_msr_tsc_adjust; static bool has_msr_tsc_deadline; static bool has_msr_async_pf_en; static bool has_msr_misc_enable; @@ -641,6 +642,10 @@ static int kvm_get_supported_msrs(KVMState *s) has_msr_hsave_pa = true; continue; } +if (kvm_msr_list-indices[i] == MSR_TSC_ADJUST) { +has_msr_tsc_adjust = true; +continue; +} if (kvm_msr_list-indices[i] == MSR_IA32_TSCDEADLINE) { has_msr_tsc_deadline = true; continue; @@ -978,6 +983,10 @@ static int kvm_put_msrs(CPUX86State *env, int level) if (has_msr_hsave_pa) { kvm_msr_entry_set(msrs[n++], MSR_VM_HSAVE_PA, env-vm_hsave); } +if (has_msr_tsc_adjust) { +kvm_msr_entry_set(msrs[n++], + MSR_TSC_ADJUST, env-tsc_adjust); Tabs. +} if (has_msr_tsc_deadline) { kvm_msr_entry_set(msrs[n++], MSR_IA32_TSCDEADLINE, env-tsc_deadline); } @@ -1234,6 +1243,9 @@ static int kvm_get_msrs(CPUX86State *env) if (has_msr_hsave_pa) { msrs[n++].index = MSR_VM_HSAVE_PA; } +if (has_msr_tsc_adjust) { +msrs[n++].index = MSR_TSC_ADJUST; +} if (has_msr_tsc_deadline) { msrs[n++].index = MSR_IA32_TSCDEADLINE; } @@ -1308,6 +1320,9 @@ static int kvm_get_msrs(CPUX86State *env) case MSR_IA32_TSC: env-tsc = msrs[i].data; break; +case MSR_TSC_ADJUST: +env-tsc_adjust = msrs[i].data; +break; case MSR_IA32_TSCDEADLINE: env-tsc_deadline = msrs[i].data; break; diff --git a/target-i386/machine.c b/target-i386/machine.c index a8be058..95bda9b 100644 --- a/target-i386/machine.c +++ b/target-i386/machine.c @@ -310,6 +310,24 @@ static const VMStateDescription vmstate_fpop_ip_dp = { } }; +static bool tsc_adjust_needed(void *opaque) +{ +CPUX86State *cpu = opaque; Please name this env to differentiate from CPUState / X86CPU. Since there are other tsc_* fields already I won't request that you move your new field to the containing X86CPU struct but at some point we will need to convert the VMSDs to X86CPU. + +return cpu-tsc_adjust != 0; +} + +static const VMStateDescription vmstate_msr_tsc_adjust = { +.name = cpu/msr_tsc_adjust, +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_UINT64(tsc_adjust, CPUX86State), +VMSTATE_END_OF_LIST() +} +}; + static bool tscdeadline_needed(void *opaque) { CPUX86State *env = opaque; @@ -457,6 +475,9 @@ static const VMStateDescription vmstate_cpu = { .vmsd = vmstate_fpop_ip_dp, .needed = fpop_ip_dp_needed, }, { +.vmsd =
Re: The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM
On Tue, 2012-11-27 at 00:47 +0800, yi li wrote: hi Alex, the qemu-kvm version 1.2. And is the device making use of MSI-X or MSI interrupts. MSI-X should work on 1.2, MSI does not yet support vector updates for affinity, but patches are welcome. Thanks, Alex 2012/11/26 Alex Williamson alex.william...@redhat.com: On Fri, 2012-11-23 at 11:06 +0800, yi li wrote: Hi Guys, there have a issue about smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM. My reason: pcpu will occur a lot of ipi interrupt to find the vcpu to handle the irq. so the guest os will VM_EXIT frequelty. right? if smp_affinity can work correctly on guest os, the best way is that the vcpu handle the irq is cputune at the pcpu which handle the kvm:pci-bus irq on the host.but unfortunly, i find that smp_affinity can not work correctly on guest os when msi/msi-x. how to reproduce: 1: passthrough a netcard (Brodcom BCM5716S) to the guest os 2: ifup the netcard, the card will use msi-x interrupt default, and close the irqbalance service 3: echo 4 cat /proc/irq/NETCARDIRQ/smp_affinity, so we assume the vcpu2 handle the irq. 4: we have set vcpupin vcpu='2' cpuset='1'/ and set the irq kvm:pci-bus to the pcpu1 on the host. we think this configure will reduce the ipi interrupt when inject interrupt to the guest os. but this irq is not only handle on vcpu2. maybe it is not our expect。 What version of qemu-kvm/qemu are you using? There's been some work recently specifically to enable this. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance issue
On Sun, Nov 25, 2012 at 6:17 PM, George-Cristian Bîrzan g...@birzan.org wrote: On Sun, Nov 25, 2012 at 5:19 PM, Gleb Natapov g...@redhat.com wrote: What Windows is this? Can you try changing -cpu host to -cpu host,+hv_relaxed? This is on Windows Server 2008 R2 (sorry, forgot to mention that I guess), and I can try it tomorrow (US time), as getting a stream my way depends on complicated stuff. I will though, and let you know how it goes. I changed that, no difference. -- George-Cristian Bîrzan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM
https://bugzilla.kernel.org/show_bug.cgi?id=50891 Alex Williamson alex.william...@redhat.com changed: What|Removed |Added CC||alex.william...@redhat.com --- Comment #1 from Alex Williamson alex.william...@redhat.com 2012-11-26 19:32:15 --- MSI-X SMP affinity should be working, MSI SMP affinity is not currently implemented. Please clarify whether the device in question is actually making use of MSI or MSI-X. Thanks. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V3 1/2] Resend - Add code to track call origin for msr assignment.
Gleb, Marcelo, Sorry Gleb, I did not see comments from you but I have now found them. In doing so I also found one form Marcelo that I missed. What I believe is now outstanding to be addressed are: From Gleb: - You've changed function pointer signature here, but emulator_set_msr() remained the same - Also I would prefer adding host_initiated parameter to kvm_set_msr() instead of introducing msr_data structure. From Marcelo: - false, this is guest instruction emulation I will address these points. However Gleb, your second item above, host_initiated parameter was implemented but then rejected agreeing that the msr_data structure would be a better solution. This was base on discussion with both Avi and Marcelo. I will leave this as is. Thanks, Will -Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Monday, November 26, 2012 10:47 AM To: Auld, Will Cc: qemu-devel; mtosa...@redhat.com; kvm@vger.kernel.org; Dugger, Donald D; Liu, Jinsong; Zhang, Xiantao; a...@redhat.com Subject: Re: [PATCH V3 1/2] Resend - Add code to track call origin for msr assignment. Comments are still not addressed. On Mon, Nov 26, 2012 at 10:40:51AM -0800, Will Auld wrote: In order to track who initiated the call (host or guest) to modify an msr value I have changed function call parameters along the call path. The specific change is to add a struct pointer parameter that points to (index, data, caller) information rather than having this information passed as individual parameters. The initial use for this capability is for updating the IA32_TSC_ADJUST msr while setting the tsc value. It is anticipated that this capability is useful other tasks. Signed-off-by: Will Auld will.a...@intel.com --- arch/x86/include/asm/kvm_host.h | 12 +--- arch/x86/kvm/svm.c | 21 +++-- arch/x86/kvm/vmx.c | 24 +--- arch/x86/kvm/x86.c | 23 +-- arch/x86/kvm/x86.h | 2 +- 5 files changed, 59 insertions(+), 23 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 09155d6..da34027 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -598,6 +598,12 @@ struct kvm_vcpu_stat { struct x86_instruction_info; +struct msr_data { +bool host_initiated; +u32 index; +u64 data; +}; + struct kvm_x86_ops { int (*cpu_has_kvm_support)(void); /* __init */ int (*disabled_by_bios)(void); /* __init */ @@ -621,7 +627,7 @@ struct kvm_x86_ops { void (*set_guest_debug)(struct kvm_vcpu *vcpu, struct kvm_guest_debug *dbg); int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata); - int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); + int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr); u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg); void (*get_segment)(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); @@ -772,7 +778,7 @@ static inline int emulate_instruction(struct kvm_vcpu *vcpu, void kvm_enable_efer_bits(u64); int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data); -int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); +int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr); struct x86_emulate_ctxt; @@ -799,7 +805,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l); int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr); int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata); -int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data); +int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr); unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu); void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index baead95..5ac11f0 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1211,6 +1211,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) struct page *msrpm_pages; struct page *hsave_page; struct page *nested_msrpm_pages; + struct msr_data msr; int err; svm = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL); @@ -1255,7 +1256,10 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) svm-vmcb_pa = page_to_pfn(page) PAGE_SHIFT; svm-asid_generation = 0; init_vmcb(svm); - kvm_write_tsc(svm-vcpu, 0); + msr.data = 0x0; + msr.index = MSR_IA32_TSC; + msr.host_initiated = true; + kvm_write_tsc(svm-vcpu, msr); err = fx_init(svm-vcpu); if (err) @@ -3147,13 +3151,15 @@ static int svm_set_vm_cr(struct kvm_vcpu *vcpu, u64 data) return 0; } -static int
Re: [PATCH RFC V2 0/5] Separate consigned (expected steal) from steal time.
On 10/22/2012 10:33 AM, Rik van Riel wrote: On 10/16/2012 10:23 PM, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. How do s390 and Power systems deal with reporting that kind of information? IMHO it would be good to see what those do, so we do not end up re-inventing the wheel, and confusing admins with yet another way of reporting the information... Sorry for the delay in the response. I'm assuming you are asking about s390 and Power lpars. In the case of lpar on POWER systems they simply report steal time and do not alter it in any way. They do however report how much processor is assigned to the partition and that information is in /proc/ppc64/lparcfg. Mike -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pci_enable_msix() fails with ENOMEM/EINVAL
On Thu, 2012-11-22 at 10:52 +0200, Alex Lyakas wrote: Hi Alex, thanks for your response. I printed out the vector and entry values of dev-host_msix_entries[i] within assigned_device_enable_host_msix() before call to request_threaded_irq(). I see that they are all 0s: kernel: [ 3332.610980] kvm-8095: KVM_ASSIGN_DEV_IRQ assigned_dev_id=924 kernel: [ 3332.610985] kvm-8095: assigned_device_enable_host_msix() assigned_dev_id=924 #0: [v=0 e=0] kernel: [ 3332.610989] kvm-8095: assigned_device_enable_host_msix() assigned_dev_id=924 #1: [v=0 e=1] kernel: [ 3332.610992] kvm-8095: assigned_device_enable_host_msix() assigned_dev_id=924 #2: [v=0 e=2] So I don't really understand how they all ask for irq=0; I must be missing something. Is there any other explanation of request_threaded_irq() to return EBUSY? From the code I don't see that there is. The vectors all being zero sounds like an indication that pci_enable_msix() didn't actually work. Each of those should be a unique vector. Does booting the host with nointremap perhaps make a difference? Maybe we can isolate the problem to the interrupt remapper code. This issue is reproducible and is not going to go away by itself. Working around it is also problematic. We thought to check whether all IRQs are properly attached after QEMU sets the vm state to running. However, vm state is set to running before IRQ attachments are performed; we debugged this and found out that they are done from a different thread, from a stack trace like this: kvm_assign_irq() assigned_dev_update_msix() assigned_dev_pci_write_config() pci_host_config_write_common() pci_data_write() pci_host_data_write() memory_region_write_accessor() access_with_adjusted_size() memory_region_iorange_write() ioport_writew_thunk() ioport_write() cpu_outw() kvm_handle_io() kvm_cpu_exec() qemu_kvm_cpu_thread_fn() So looks like this is performed on-demand (on first IO), so no reliable point to check that IRQs are attached properly. Correct, MSI-X is setup when the guest enables MSI-X on the device, which is likely a long way into guest boot. There's no guarantee that the guest ever enables MSI-X, so there's no association to whether the guest is running. Another issue that in KVM code the return value of pci_host_config_write_common() is not checked, so there is no way to report a failure. A common problem in qemu, imho Is there any way you think you can help me debug this further? It seems like pci_enable_msix is still failing, but perhaps silently without irqbalance. We need to figure out where and why. Isolating it to the interrupt remapper with nointremap might give us some clues (this is an Intel VT-d system, right?). Thanks, Alex -Original Message- From: Alex Williamson Sent: 22 November, 2012 12:25 AM To: Alex Lyakas Cc: kvm@vger.kernel.org Subject: Re: pci_enable_msix() fails with ENOMEM/EINVAL On Wed, 2012-11-21 at 16:19 +0200, Alex Lyakas wrote: Hi, I was advised to turn off irqbalance and reproduced this issue, but the failure is in a different place now. Now request_threaded_irq() fails with EBUSY. According to the code, this can only happen on the path: request_threaded_irq() - __setup_irq() Now in setup irq, the only place where EBUSY can show up for us is here: ... raw_spin_lock_irqsave(desc-lock, flags); old_ptr = desc-action; old = *old_ptr; if (old) { /* * Can't share interrupts unless both agree to and are * the same type (level, edge, polarity). So both flag * fields must have IRQF_SHARED set and the bits which * set the trigger type must match. Also all must * agree on ONESHOT. */ if (!((old-flags new-flags) IRQF_SHARED) || ((old-flags ^ new-flags) IRQF_TRIGGER_MASK) || ((old-flags ^ new-flags) IRQF_ONESHOT)) { old_name = old-name; goto mismatch; } /* All handlers must agree on per-cpuness */ if ((old-flags IRQF_PERCPU) != (new-flags IRQF_PERCPU)) goto mismatch; KVM calls request_threaded_irq() with flags==0, so can it be that different KVM processes request the same IRQ? Shouldn't be possible, irqs are allocated from a bitmap protected by a mutex, see __irq_alloc_descs How different KVM processes spawned simultaneously agree between them on IRQ numbers? They don't, MSI/X vectors are not currently share-able. Can you show that you're actually getting duplicate irq vectors? Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 50921] kvm hangs booting Windows 2000
On 11/24/2012 09:44 PM, bugzilla-dae...@bugzilla.kernel.org wrote: https://bugzilla.kernel.org/show_bug.cgi?id=50921 --- Comment #5 from Lucio Crusca lu...@sulweb.org 2012-11-24 13:44:16 --- Here the first tests results: vbox modules do not make a difference (tried rmmod vboxpci vboxnetadp vboxnetflt vboxdrv and then kvm ...). The trace.dat is about 60M, I could upload it somewhere, however I tried looking at it and I'm reasonably sure it hangs here: $ trace-cmd report | grep 125\\.332 | tail kvm-6588 [000] 125.332264: kvm_entry:vcpu 0 kvm-6588 [000] 125.332264: kvm_emulate_insn: 1:44f8: 75 27 Hmm... no 'kvm_exit' message. It looks like the infinite loop is caused by: | /* Don't enter VMX if guest state is invalid, let the exit handler | start emulation until we arrive back to a valid state */ | if (vmx-emulation_required emulate_invalid_guest_state) | return; (vmx_vcpu_run in arch/x86/kvm/vmx.c) And, i noticed 'ept' is not supported on your box, that means 'enable_unrestricted_guest' is disabled. I guess something was wrong when emulate big real mode. Could you reload kvm-intel.ko with 'emulate_invalid_guest_state = 0', and see what will happen. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 50921] kvm hangs booting Windows 2000
https://bugzilla.kernel.org/show_bug.cgi?id=50921 --- Comment #14 from Anonymous Emailer anonym...@kernel-bugs.osdl.org 2012-11-26 20:10:11 --- Reply-To: xiaoguangr...@linux.vnet.ibm.com On 11/24/2012 09:44 PM, bugzilla-dae...@bugzilla.kernel.org wrote: https://bugzilla.kernel.org/show_bug.cgi?id=50921 --- Comment #5 from Lucio Crusca lu...@sulweb.org 2012-11-24 13:44:16 --- Here the first tests results: vbox modules do not make a difference (tried rmmod vboxpci vboxnetadp vboxnetflt vboxdrv and then kvm ...). The trace.dat is about 60M, I could upload it somewhere, however I tried looking at it and I'm reasonably sure it hangs here: $ trace-cmd report | grep 125\\.332 | tail kvm-6588 [000] 125.332264: kvm_entry:vcpu 0 kvm-6588 [000] 125.332264: kvm_emulate_insn: 1:44f8: 75 27 Hmm... no 'kvm_exit' message. It looks like the infinite loop is caused by: |/* Don't enter VMX if guest state is invalid, let the exit handler | start emulation until we arrive back to a valid state */ |if (vmx-emulation_required emulate_invalid_guest_state) |return; (vmx_vcpu_run in arch/x86/kvm/vmx.c) And, i noticed 'ept' is not supported on your box, that means 'enable_unrestricted_guest' is disabled. I guess something was wrong when emulate big real mode. Could you reload kvm-intel.ko with 'emulate_invalid_guest_state = 0', and see what will happen. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM Disk i/o or VM activities causes soft lockup?
On Mon, Nov 26, 2012 at 2:58 AM, Stefan Hajnoczi stefa...@gmail.com wrote: On Fri, Nov 23, 2012 at 10:34:16AM -0800, Vincent Li wrote: On Thu, Nov 22, 2012 at 11:29 PM, Stefan Hajnoczi stefa...@gmail.com wrote: On Wed, Nov 21, 2012 at 03:36:50PM -0800, Vincent Li wrote: We have users running on redhat based distro (Kernel 2.6.32-131.21.1.el6.x86_64 ) with kvm, when customer made cron job script to copy large files between kvm guest or some other user space program leads to disk i/o or VM activities, users get following soft lockup message from console: Nov 17 13:44:46 slot1/luipaard100a err kernel: BUG: soft lockup - CPU#4 stuck for 61s! [qemu-kvm:6795] Nov 17 13:44:46 slot1/luipaard100a warning kernel: Modules linked in: ebt_vlan nls_utf8 isofs ebtable_filter ebtables 8021q garp bridge stp llc ipt_REJECT iptable_filter xt_NOTRACK nf_conntrack iptable_raw ip_tables loop ext2 binfmt_misc hed womdict(U) vnic(U) parport_pc lp parport predis(U) lasthop(U) ipv6 toggler vhost_net tun kvm_intel kvm jiffies(U) sysstats hrsleep i2c_dev datastor(U) linux_user_bde(P)(U) linux_kernel_bde(P)(U) tg3 libphy serio_raw i2c_i801 i2c_core ehci_hcd raid1 raid0 virtio_pci virtio_blk virtio virtio_ring mvsas libsas scsi_transport_sas mptspi mptscsih mptbase scsi_transport_spi 3w_9xxx sata_svw(U) ahci serverworks sata_sil ata_piix libata sd_mod crc_t10dif amd74xx piix ide_gd_mod ide_core dm_snapshot dm_mirror dm_region_hash dm_log dm_mod ext3 jbd mbcache Nov 17 13:44:46 slot1/luipaard100a warning kernel: Pid: 6795, comm: qemu-kvm Tainted: P 2.6.32-131.21.1.el6.f5.x86_64 #1 Nov 17 13:44:46 slot1/luipaard100a warning kernel: Call Trace: Nov 17 13:44:46 slot1/luipaard100a warning kernel: IRQ [81084f95] ? get_timestamp+0x9/0xf Nov 17 13:44:46 slot1/luipaard100a warning kernel: [810855d6] ? watchdog_timer_fn+0x130/0x178 Nov 17 13:44:46 slot1/luipaard100a warning kernel: [81059f11] ? __run_hrtimer+0xa3/0xff Nov 17 13:44:46 slot1/luipaard100a warning kernel: [8105a188] ? hrtimer_interrupt+0xe6/0x190 Nov 17 13:44:46 slot1/luipaard100a warning kernel: [8105a14b] ? hrtimer_interrupt+0xa9/0x190 Nov 17 13:44:46 slot1/luipaard100a warning kernel: [8101e5a9] ? hpet_interrupt_handler+0x26/0x2d Nov 17 13:44:46 slot1/luipaard100a warning kernel: [8105a26f] ? hrtimer_peek_ahead_timers+0x9/0xd Nov 17 13:44:46 slot1/luipaard100a warning kernel: [81044fcc] ? __do_softirq+0xc5/0x17a Nov 17 13:44:46 slot1/luipaard100a warning kernel: [81003adc] ? call_softirq+0x1c/0x28 Nov 17 13:44:46 slot1/luipaard100a warning kernel: [8100506b] ? do_softirq+0x31/0x66 Nov 17 13:44:46 slot1/luipaard100a warning kernel: [81003673] ? call_function_interrupt+0x13/0x20 Nov 17 13:44:46 slot1/luipaard100a warning kernel: EOI [a0219986] ? vmx_get_msr+0x0/0x123 [kvm_intel] Nov 17 13:44:46 slot1/luipaard100a warning kernel: [a01d11c0] ? kvm_arch_vcpu_ioctl_run+0x80e/0xaf1 [kvm] Nov 17 13:44:46 slot1/luipaard100a warning kernel: [a01d11b4] ? kvm_arch_vcpu_ioctl_run+0x802/0xaf1 [kvm] Nov 17 13:44:46 slot1/luipaard100a warning kernel: [8114e59b] ? inode_has_perm+0x65/0x72 Nov 17 13:44:46 slot1/luipaard100a warning kernel: [a01c77f5] ? kvm_vcpu_ioctl+0xf2/0x5ba [kvm] Nov 17 13:44:46 slot1/luipaard100a warning kernel: [8114e642] ? file_has_perm+0x9a/0xac Nov 17 13:44:46 slot1/luipaard100a warning kernel: [810f9ec2] ? vfs_ioctl+0x21/0x6b Nov 17 13:44:46 slot1/luipaard100a warning kernel: [810fa406] ? do_vfs_ioctl+0x487/0x4da Nov 17 13:44:46 slot1/luipaard100a warning kernel: [810fa4aa] ? sys_ioctl+0x51/0x70 Nov 17 13:44:46 slot1/luipaard100a warning kernel: [810029d1] ? system_call_fastpath+0x3c/0x41 This soft lockup is report on the host? Stefan Yes, it is on host. we just recommend users not doing large file copying, just wondering if there is potential kernel bug. it seems the softlockup backtrace pointing to hrtimer and softirq. my naive knowledge is that the watchdog thread is on top of hrtimer which is on top of softirq. Since the soft lockup detector is firing on the host, this seems like a hardware/driver problem. Have you ever had soft lockups running non-KVM workloads on this host? Stefan this soft lockup only triggers when running KVM, also users used another script in cron job to restart 4 kvm instance every 5 mintues ( insane to me) that also causing tons of softlock up message during the kvm instance startup . we have already told customer stop doing that and the softlockup message disappear. Vincent -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 50921] kvm hangs booting Windows 2000
Sorry, forgot to CC Lucio Crusca. On 11/27/2012 04:09 AM, Xiao Guangrong wrote: On 11/24/2012 09:44 PM, bugzilla-dae...@bugzilla.kernel.org wrote: https://bugzilla.kernel.org/show_bug.cgi?id=50921 --- Comment #5 from Lucio Crusca lu...@sulweb.org 2012-11-24 13:44:16 --- Here the first tests results: vbox modules do not make a difference (tried rmmod vboxpci vboxnetadp vboxnetflt vboxdrv and then kvm ...). The trace.dat is about 60M, I could upload it somewhere, however I tried looking at it and I'm reasonably sure it hangs here: $ trace-cmd report | grep 125\\.332 | tail kvm-6588 [000] 125.332264: kvm_entry:vcpu 0 kvm-6588 [000] 125.332264: kvm_emulate_insn: 1:44f8: 75 27 Hmm... no 'kvm_exit' message. It looks like the infinite loop is caused by: | /* Don't enter VMX if guest state is invalid, let the exit handler |start emulation until we arrive back to a valid state */ | if (vmx-emulation_required emulate_invalid_guest_state) | return; (vmx_vcpu_run in arch/x86/kvm/vmx.c) And, i noticed 'ept' is not supported on your box, that means 'enable_unrestricted_guest' is disabled. I guess something was wrong when emulate big real mode. Could you reload kvm-intel.ko with 'emulate_invalid_guest_state = 0', and see what will happen. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 50921] kvm hangs booting Windows 2000
https://bugzilla.kernel.org/show_bug.cgi?id=50921 --- Comment #15 from Anonymous Emailer anonym...@kernel-bugs.osdl.org 2012-11-26 20:29:28 --- Reply-To: xiaoguangr...@linux.vnet.ibm.com Sorry, forgot to CC Lucio Crusca. On 11/27/2012 04:09 AM, Xiao Guangrong wrote: On 11/24/2012 09:44 PM, bugzilla-dae...@bugzilla.kernel.org wrote: https://bugzilla.kernel.org/show_bug.cgi?id=50921 --- Comment #5 from Lucio Crusca lu...@sulweb.org 2012-11-24 13:44:16 --- Here the first tests results: vbox modules do not make a difference (tried rmmod vboxpci vboxnetadp vboxnetflt vboxdrv and then kvm ...). The trace.dat is about 60M, I could upload it somewhere, however I tried looking at it and I'm reasonably sure it hangs here: $ trace-cmd report | grep 125\\.332 | tail kvm-6588 [000] 125.332264: kvm_entry:vcpu 0 kvm-6588 [000] 125.332264: kvm_emulate_insn: 1:44f8: 75 27 Hmm... no 'kvm_exit' message. It looks like the infinite loop is caused by: | /* Don't enter VMX if guest state is invalid, let the exit handler |start emulation until we arrive back to a valid state */ | if (vmx-emulation_required emulate_invalid_guest_state) | return; (vmx_vcpu_run in arch/x86/kvm/vmx.c) And, i noticed 'ept' is not supported on your box, that means 'enable_unrestricted_guest' is disabled. I guess something was wrong when emulate big real mode. Could you reload kvm-intel.ko with 'emulate_invalid_guest_state = 0', and see what will happen. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug.-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 1/2] Resend - Add code to track call origin for msr assignment.
On Mon, Nov 26, 2012 at 07:42:28PM +, Auld, Will wrote: Gleb, Marcelo, Sorry Gleb, I did not see comments from you but I have now found them. In doing so I also found one form Marcelo that I missed. What I believe is now outstanding to be addressed are: From Gleb: - You've changed function pointer signature here, but emulator_set_msr() remained the same - Also I would prefer adding host_initiated parameter to kvm_set_msr() instead of introducing msr_data structure. From Marcelo: - false, this is guest instruction emulation I will address these points. However Gleb, your second item above, host_initiated parameter was implemented but then rejected agreeing that the msr_data structure would be a better solution. This was base on discussion with both Avi and Marcelo. I will leave this as is. OK. Thanks. Thanks, Will -Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Monday, November 26, 2012 10:47 AM To: Auld, Will Cc: qemu-devel; mtosa...@redhat.com; kvm@vger.kernel.org; Dugger, Donald D; Liu, Jinsong; Zhang, Xiantao; a...@redhat.com Subject: Re: [PATCH V3 1/2] Resend - Add code to track call origin for msr assignment. Comments are still not addressed. On Mon, Nov 26, 2012 at 10:40:51AM -0800, Will Auld wrote: In order to track who initiated the call (host or guest) to modify an msr value I have changed function call parameters along the call path. The specific change is to add a struct pointer parameter that points to (index, data, caller) information rather than having this information passed as individual parameters. The initial use for this capability is for updating the IA32_TSC_ADJUST msr while setting the tsc value. It is anticipated that this capability is useful other tasks. Signed-off-by: Will Auld will.a...@intel.com --- arch/x86/include/asm/kvm_host.h | 12 +--- arch/x86/kvm/svm.c | 21 +++-- arch/x86/kvm/vmx.c | 24 +--- arch/x86/kvm/x86.c | 23 +-- arch/x86/kvm/x86.h | 2 +- 5 files changed, 59 insertions(+), 23 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 09155d6..da34027 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -598,6 +598,12 @@ struct kvm_vcpu_stat { struct x86_instruction_info; +struct msr_data { +bool host_initiated; +u32 index; +u64 data; +}; + struct kvm_x86_ops { int (*cpu_has_kvm_support)(void); /* __init */ int (*disabled_by_bios)(void); /* __init */ @@ -621,7 +627,7 @@ struct kvm_x86_ops { void (*set_guest_debug)(struct kvm_vcpu *vcpu, struct kvm_guest_debug *dbg); int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata); - int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); + int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr); u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg); void (*get_segment)(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); @@ -772,7 +778,7 @@ static inline int emulate_instruction(struct kvm_vcpu *vcpu, void kvm_enable_efer_bits(u64); int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data); -int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); +int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr); struct x86_emulate_ctxt; @@ -799,7 +805,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l); int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr); int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata); -int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data); +int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr); unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu); void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index baead95..5ac11f0 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1211,6 +1211,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) struct page *msrpm_pages; struct page *hsave_page; struct page *nested_msrpm_pages; + struct msr_data msr; int err; svm = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL); @@ -1255,7 +1256,10 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) svm-vmcb_pa = page_to_pfn(page) PAGE_SHIFT; svm-asid_generation = 0; init_vmcb(svm); - kvm_write_tsc(svm-vcpu, 0); + msr.data = 0x0; + msr.index = MSR_IA32_TSC; + msr.host_initiated = true; + kvm_write_tsc(svm-vcpu, msr); err =
[PATCH 0/5] Alter steal time reporting in KVM
In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. --- Michael Wolf (5): Alter the amount of steal time reported by the guest. Expand the steal time msr to also contain the consigned time. Add the code to send the consigned time from the host to the guest Add a timer to allow the separation of consigned from steal time. Add an ioctl to communicate the consign limit to the host. arch/x86/include/asm/kvm_host.h | 11 +++ arch/x86/include/asm/kvm_para.h |3 +- arch/x86/include/asm/paravirt.h |4 +-- arch/x86/include/asm/paravirt_types.h |2 + arch/x86/kernel/kvm.c |8 ++--- arch/x86/kernel/paravirt.c|4 +-- arch/x86/kvm/x86.c| 50 - fs/proc/stat.c|9 +- include/linux/kernel_stat.h |2 + include/linux/kvm_host.h |2 + include/uapi/linux/kvm.h |2 + kernel/sched/core.c | 10 ++- kernel/sched/cputime.c| 21 +- kernel/sched/sched.h |2 + virt/kvm/kvm_main.c |7 + 15 files changed, 120 insertions(+), 17 deletions(-) -- Signature -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] Alter the amount of steal time reported by the guest.
Modify the amount of stealtime that the kernel reports via the /proc interface. Steal time will now be broken down into steal_time and consigned_time. Consigned_time will represent the amount of time that is expected to be lost due to overcommitment of the physical cpu or by using cpu capping. The amount consigned_time will be passed in using an ioctl. The time will be expressed in the number of nanoseconds to be lost in during the fixed period. The fixed period is currently 1/10th of a second. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- fs/proc/stat.c |9 +++-- include/linux/kernel_stat.h |1 + 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index e296572..cb7fe80 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v) int i, j; unsigned long jif; u64 user, nice, system, idle, iowait, irq, softirq, steal; - u64 guest, guest_nice; + u64 guest, guest_nice, consign; u64 sum = 0; u64 sum_softirq = 0; unsigned int per_softirq_sums[NR_SOFTIRQS] = {0}; @@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v) user = nice = system = idle = iowait = irq = softirq = steal = 0; - guest = guest_nice = 0; + guest = guest_nice = consign = 0; getboottime(boottime); jif = boottime.tv_sec; + for_each_possible_cpu(i) { user += kcpustat_cpu(i).cpustat[CPUTIME_USER]; nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE]; @@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v) steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; sum += kstat_cpu_irqs_sum(i); sum += arch_irq_stat_cpu(i); @@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); for_each_online_cpu(i) { @@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v) steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; seq_printf(p, cpu%d, i); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice)); @@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); } seq_printf(p, intr %llu, (unsigned long long)sum); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 1865b1f..e5978b0 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -28,6 +28,7 @@ enum cpu_usage_stat { CPUTIME_STEAL, CPUTIME_GUEST, CPUTIME_GUEST_NICE, + CPUTIME_CONSIGN, NR_STATS, }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] Expand the steal time msr to also contain the consigned time.
Add a consigned field. This field will hold the time lost due to capping or overcommit. The rest of the time will still show up in the steal-time field. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/include/asm/paravirt_types.h |2 +- arch/x86/kernel/kvm.c |7 ++- kernel/sched/core.c | 10 +- kernel/sched/cputime.c|2 +- 5 files changed, 15 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index a0facf3..a5f9f30 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu) +static inline u64 paravirt_steal_clock(int cpu, u64 *steal) { - return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); + PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 142236e..5d4fc8b 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -95,7 +95,7 @@ struct pv_lazy_ops { struct pv_time_ops { unsigned long long (*sched_clock)(void); - unsigned long long (*steal_clock)(int cpu); + void (*steal_clock)(int cpu, unsigned long long *steal); unsigned long (*get_tsc_khz)(void); }; diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 4180a87..ac357b3 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -372,9 +372,8 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static u64 kvm_steal_clock(int cpu) +static void kvm_steal_clock(int cpu, u64 *steal) { - u64 steal; struct kvm_steal_time *src; int version; @@ -382,11 +381,9 @@ static u64 kvm_steal_clock(int cpu) do { version = src-version; rmb(); - steal = src-steal; + *steal = src-steal; rmb(); } while ((version 1) || (version != src-version)); - - return steal; } void kvm_disable_steal_time(void) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c2e077c..b21d92d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -748,6 +748,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) */ #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING) s64 steal = 0, irq_delta = 0; + u64 consigned = 0; #endif #ifdef CONFIG_IRQ_TIME_ACCOUNTING irq_delta = irq_time_read(cpu_of(rq)) - rq-prev_irq_time; @@ -776,8 +777,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING if (static_key_false((paravirt_steal_rq_enabled))) { u64 st; + u64 cs; - steal = paravirt_steal_clock(cpu_of(rq)); + paravirt_steal_clock(cpu_of(rq), steal, consigned); + /* +* since we are not assigning the steal time to cpustats +* here, just combine the steal and consigned times to +* do the rest of the calculations. +*/ + steal += consigned; steal -= rq-prev_steal_time_rq; if (unlikely(steal delta)) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 8d859da..593b647 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void) if (static_key_false(paravirt_steal_enabled)) { u64 steal, st = 0; - steal = paravirt_steal_clock(smp_processor_id()); + paravirt_steal_clock(smp_processor_id(), steal); steal -= this_rq()-prev_steal_time; st = steal_ticks(steal); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] Add the code to send the consigned time from the host to the guest
Add the code to send the consigned time from the host to the guest. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/include/asm/kvm_para.h |3 ++- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/kernel/kvm.c |3 ++- arch/x86/kernel/paravirt.c |4 ++-- arch/x86/kvm/x86.c |2 ++ include/linux/kernel_stat.h |1 + kernel/sched/cputime.c | 21 +++-- kernel/sched/sched.h|2 ++ 9 files changed, 33 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b2e11f4..434d378 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -426,6 +426,7 @@ struct kvm_vcpu_arch { u64 msr_val; u64 last_steal; u64 accum_steal; + u64 accum_consigned; struct gfn_to_hva_cache stime; struct kvm_steal_time steal; } st; diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index eb3e9d8..1763369 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -42,9 +42,10 @@ struct kvm_steal_time { __u64 steal; + __u64 consigned; __u32 version; __u32 flags; - __u32 pad[12]; + __u32 pad[10]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index a5f9f30..d39e8d0 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu, u64 *steal) +static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) { - PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); + PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index ac357b3..4439a5c 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -372,7 +372,7 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static void kvm_steal_clock(int cpu, u64 *steal) +static void kvm_steal_clock(int cpu, u64 *steal, u64 *consigned) { struct kvm_steal_time *src; int version; @@ -382,6 +382,7 @@ static void kvm_steal_clock(int cpu, u64 *steal) version = src-version; rmb(); *steal = src-steal; + *consigned = src-consigned; rmb(); } while ((version 1) || (version != src-version)); } diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 17fff18..3797683 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -207,9 +207,9 @@ static void native_flush_tlb_single(unsigned long addr) struct static_key paravirt_steal_enabled; struct static_key paravirt_steal_rq_enabled; -static u64 native_steal_clock(int cpu) +static void native_steal_clock(int cpu, u64 *steal, u64 *consigned) { - return 0; + *steal = *consigned = 0; } /* These are in entry.S */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1eefebe..683b531 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1565,8 +1565,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu) return; vcpu-arch.st.steal.steal += vcpu-arch.st.accum_steal; + vcpu-arch.st.steal.consigned += vcpu-arch.st.accum_consigned; vcpu-arch.st.steal.version += 2; vcpu-arch.st.accum_steal = 0; + vcpu-arch.st.accum_consigned = 0; kvm_write_guest_cached(vcpu-kvm, vcpu-arch.st.stime, vcpu-arch.st.steal, sizeof(struct kvm_steal_time)); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index e5978b0..91afaa3 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -126,6 +126,7 @@ extern unsigned long long task_delta_exec(struct task_struct *); extern void account_user_time(struct task_struct *, cputime_t, cputime_t); extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t); extern void account_steal_time(cputime_t); +extern void account_consigned_time(cputime_t); extern void account_idle_time(cputime_t); extern void account_process_tick(struct task_struct *, int user); diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 593b647..53bd0be 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -244,6 +244,18 @@ void account_system_time(struct task_struct *p, int hardirq_offset, } /* + * This accounts for the time that is split out of steal time. + * Consigned time represents the amount of time
[PATCH 4/5] Add a timer to allow the separation of consigned from steal time.
Add a timer to the host. This will define the period. During a period the first n ticks will go into the consigned bucket. Any other ticks that occur within the period will be placed in the stealtime bucket. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- arch/x86/include/asm/kvm_host.h | 10 + arch/x86/include/asm/paravirt.h |2 +- arch/x86/kvm/x86.c | 42 ++- 3 files changed, 52 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 434d378..4794c95 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -41,6 +41,8 @@ #define KVM_PIO_PAGE_OFFSET 1 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2 +#define KVM_STEAL_TIMER_DELAY 1UL + #define CR0_RESERVED_BITS \ (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \ | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \ @@ -353,6 +355,14 @@ struct kvm_vcpu_arch { bool tpr_access_reporting; /* +* timer used to determine if the time should be counted as +* steal time or consigned time. +*/ + struct hrtimer steal_timer; + u64 current_consigned; + u64 consigned_limit; + + /* * Paging state of the vcpu * * If the vcpu runs in guest mode with two level paging this still saves diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index d39e8d0..6db79f9 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,7 +196,7 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) +static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) { PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 683b531..c91f4c9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1546,13 +1546,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu) static void accumulate_steal_time(struct kvm_vcpu *vcpu) { u64 delta; + u64 steal_delta; + u64 consigned_delta; if (!(vcpu-arch.st.msr_val KVM_MSR_ENABLED)) return; delta = current-sched_info.run_delay - vcpu-arch.st.last_steal; vcpu-arch.st.last_steal = current-sched_info.run_delay; - vcpu-arch.st.accum_steal = delta; + + /* split the delta into steal and consigned */ + if (vcpu-arch.current_consigned vcpu-arch.consigned_limit) { + vcpu-arch.current_consigned += delta; + if (vcpu-arch.current_consigned vcpu-arch.consigned_limit) { + steal_delta = vcpu-arch.current_consigned + - vcpu-arch.consigned_limit; + consigned_delta = delta - steal_delta; + } else { + consigned_delta = delta; + steal_delta = 0; + } + } else { + consigned_delta = 0; + steal_delta = delta; + } + vcpu-arch.st.accum_steal = steal_delta; + vcpu-arch.st.accum_consigned = consigned_delta; } static void record_steal_time(struct kvm_vcpu *vcpu) @@ -6203,11 +6222,25 @@ bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu) struct static_key kvm_no_apic_vcpu __read_mostly; +enum hrtimer_restart steal_timer_fn(struct hrtimer *data) +{ + struct kvm_vcpu *vcpu; + ktime_t now; + + vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer); + vcpu-arch.current_consigned = 0; + now = ktime_get(); + hrtimer_forward(vcpu-arch.steal_timer, now, + ktime_set(0, KVM_STEAL_TIMER_DELAY)); + return HRTIMER_RESTART; +} + int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) { struct page *page; struct kvm *kvm; int r; + ktime_t ktime; BUG_ON(vcpu-kvm == NULL); kvm = vcpu-kvm; @@ -6251,6 +6284,12 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) kvm_async_pf_hash_reset(vcpu); kvm_pmu_init(vcpu); + /* Initialize and start a timer to capture steal and consigned time */ + hrtimer_init(vcpu-arch.steal_timer, CLOCK_MONOTONIC, + HRTIMER_MODE_REL); + vcpu-arch.steal_timer.function = steal_timer_fn; + ktime = ktime_set(0, KVM_STEAL_TIMER_DELAY); + hrtimer_start(vcpu-arch.steal_timer, ktime, HRTIMER_MODE_REL); return 0; fail_free_mce_banks: @@ -6269,6 +6308,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) { int idx; + hrtimer_cancel(vcpu-arch.steal_timer); kvm_pmu_destroy(vcpu);
[PATCH 5/5] Add an ioctl to communicate the consign limit to the host.
Add an ioctl to communicate the consign limit to the host. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- arch/x86/kvm/x86.c |6 ++ include/linux/kvm_host.h |2 ++ include/uapi/linux/kvm.h |2 ++ virt/kvm/kvm_main.c |7 +++ 4 files changed, 17 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c91f4c9..5d57469 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5938,6 +5938,12 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu, return 0; } +int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu, long entitlement) +{ + vcpu-arch.consigned_limit = entitlement; + return 0; +} + int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) { struct i387_fxsave_struct *fxsave = diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 0e2212f..de13648 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -590,6 +590,8 @@ void kvm_arch_hardware_unsetup(void); void kvm_arch_check_processor_compat(void *rtn); int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu); int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu); +int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu, + long entitlement); void kvm_free_physmem(struct kvm *kvm); diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 0a6d6ba..86f24bb 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -921,6 +921,8 @@ struct kvm_s390_ucas_mapping { #define KVM_SET_ONE_REG _IOW(KVMIO, 0xac, struct kvm_one_reg) /* VM is being stopped by host */ #define KVM_KVMCLOCK_CTRL_IO(KVMIO, 0xad) +/* Set the consignment limit which will be used to separete steal time */ +#define KVM_SET_ENTITLEMENT _IOW(KVMIO, 0xae, unsigned long) #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1 0) #define KVM_DEV_ASSIGN_PCI_2_3 (1 1) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index be70035..c712fe5 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2062,6 +2062,13 @@ out_free2: r = 0; break; } + case KVM_SET_ENTITLEMENT: { + r = kvm_arch_vcpu_ioctl_set_entitlement(vcpu, arg); + if (r) + goto out; + r = 0; + break; + } default: r = kvm_arch_vcpu_ioctl(filp, ioctl, arg); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] Add the code to send the consigned time from the host to the guest
Add the code to send the consigned time from the host to the guest. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/include/asm/kvm_para.h |3 ++- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/kernel/kvm.c |3 ++- arch/x86/kernel/paravirt.c |4 ++-- arch/x86/kvm/x86.c |2 ++ include/linux/kernel_stat.h |1 + kernel/sched/cputime.c | 21 +++-- kernel/sched/sched.h|2 ++ 9 files changed, 33 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b2e11f4..434d378 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -426,6 +426,7 @@ struct kvm_vcpu_arch { u64 msr_val; u64 last_steal; u64 accum_steal; + u64 accum_consigned; struct gfn_to_hva_cache stime; struct kvm_steal_time steal; } st; diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index eb3e9d8..1763369 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -42,9 +42,10 @@ struct kvm_steal_time { __u64 steal; + __u64 consigned; __u32 version; __u32 flags; - __u32 pad[12]; + __u32 pad[10]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index a5f9f30..d39e8d0 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu, u64 *steal) +static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) { - PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); + PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index ac357b3..4439a5c 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -372,7 +372,7 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static void kvm_steal_clock(int cpu, u64 *steal) +static void kvm_steal_clock(int cpu, u64 *steal, u64 *consigned) { struct kvm_steal_time *src; int version; @@ -382,6 +382,7 @@ static void kvm_steal_clock(int cpu, u64 *steal) version = src-version; rmb(); *steal = src-steal; + *consigned = src-consigned; rmb(); } while ((version 1) || (version != src-version)); } diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 17fff18..3797683 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -207,9 +207,9 @@ static void native_flush_tlb_single(unsigned long addr) struct static_key paravirt_steal_enabled; struct static_key paravirt_steal_rq_enabled; -static u64 native_steal_clock(int cpu) +static void native_steal_clock(int cpu, u64 *steal, u64 *consigned) { - return 0; + *steal = *consigned = 0; } /* These are in entry.S */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1eefebe..683b531 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1565,8 +1565,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu) return; vcpu-arch.st.steal.steal += vcpu-arch.st.accum_steal; + vcpu-arch.st.steal.consigned += vcpu-arch.st.accum_consigned; vcpu-arch.st.steal.version += 2; vcpu-arch.st.accum_steal = 0; + vcpu-arch.st.accum_consigned = 0; kvm_write_guest_cached(vcpu-kvm, vcpu-arch.st.stime, vcpu-arch.st.steal, sizeof(struct kvm_steal_time)); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index e5978b0..91afaa3 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -126,6 +126,7 @@ extern unsigned long long task_delta_exec(struct task_struct *); extern void account_user_time(struct task_struct *, cputime_t, cputime_t); extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t); extern void account_steal_time(cputime_t); +extern void account_consigned_time(cputime_t); extern void account_idle_time(cputime_t); extern void account_process_tick(struct task_struct *, int user); diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 593b647..53bd0be 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -244,6 +244,18 @@ void account_system_time(struct task_struct *p, int hardirq_offset, } /* + * This accounts for the time that is split out of steal time. + * Consigned time represents the amount of time
[PATCH 4/5] Add a timer to allow the separation of consigned from steal time.
Add a timer to the host. This will define the period. During a period the first n ticks will go into the consigned bucket. Any other ticks that occur within the period will be placed in the stealtime bucket. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- arch/x86/include/asm/kvm_host.h | 10 + arch/x86/include/asm/paravirt.h |2 +- arch/x86/kvm/x86.c | 42 ++- 3 files changed, 52 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 434d378..4794c95 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -41,6 +41,8 @@ #define KVM_PIO_PAGE_OFFSET 1 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2 +#define KVM_STEAL_TIMER_DELAY 1UL + #define CR0_RESERVED_BITS \ (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \ | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \ @@ -353,6 +355,14 @@ struct kvm_vcpu_arch { bool tpr_access_reporting; /* +* timer used to determine if the time should be counted as +* steal time or consigned time. +*/ + struct hrtimer steal_timer; + u64 current_consigned; + u64 consigned_limit; + + /* * Paging state of the vcpu * * If the vcpu runs in guest mode with two level paging this still saves diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index d39e8d0..6db79f9 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,7 +196,7 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) +static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) { PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 683b531..c91f4c9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1546,13 +1546,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu) static void accumulate_steal_time(struct kvm_vcpu *vcpu) { u64 delta; + u64 steal_delta; + u64 consigned_delta; if (!(vcpu-arch.st.msr_val KVM_MSR_ENABLED)) return; delta = current-sched_info.run_delay - vcpu-arch.st.last_steal; vcpu-arch.st.last_steal = current-sched_info.run_delay; - vcpu-arch.st.accum_steal = delta; + + /* split the delta into steal and consigned */ + if (vcpu-arch.current_consigned vcpu-arch.consigned_limit) { + vcpu-arch.current_consigned += delta; + if (vcpu-arch.current_consigned vcpu-arch.consigned_limit) { + steal_delta = vcpu-arch.current_consigned + - vcpu-arch.consigned_limit; + consigned_delta = delta - steal_delta; + } else { + consigned_delta = delta; + steal_delta = 0; + } + } else { + consigned_delta = 0; + steal_delta = delta; + } + vcpu-arch.st.accum_steal = steal_delta; + vcpu-arch.st.accum_consigned = consigned_delta; } static void record_steal_time(struct kvm_vcpu *vcpu) @@ -6203,11 +6222,25 @@ bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu) struct static_key kvm_no_apic_vcpu __read_mostly; +enum hrtimer_restart steal_timer_fn(struct hrtimer *data) +{ + struct kvm_vcpu *vcpu; + ktime_t now; + + vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer); + vcpu-arch.current_consigned = 0; + now = ktime_get(); + hrtimer_forward(vcpu-arch.steal_timer, now, + ktime_set(0, KVM_STEAL_TIMER_DELAY)); + return HRTIMER_RESTART; +} + int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) { struct page *page; struct kvm *kvm; int r; + ktime_t ktime; BUG_ON(vcpu-kvm == NULL); kvm = vcpu-kvm; @@ -6251,6 +6284,12 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) kvm_async_pf_hash_reset(vcpu); kvm_pmu_init(vcpu); + /* Initialize and start a timer to capture steal and consigned time */ + hrtimer_init(vcpu-arch.steal_timer, CLOCK_MONOTONIC, + HRTIMER_MODE_REL); + vcpu-arch.steal_timer.function = steal_timer_fn; + ktime = ktime_set(0, KVM_STEAL_TIMER_DELAY); + hrtimer_start(vcpu-arch.steal_timer, ktime, HRTIMER_MODE_REL); return 0; fail_free_mce_banks: @@ -6269,6 +6308,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) { int idx; + hrtimer_cancel(vcpu-arch.steal_timer); kvm_pmu_destroy(vcpu);
[PATCH 0/5] Alter stealtime reporting in KVM
In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. --- Michael Wolf (5): Alter the amount of steal time reported by the guest. Expand the steal time msr to also contain the consigned time. Add the code to send the consigned time from the host to the guest Add a timer to allow the separation of consigned from steal time. Add an ioctl to communicate the consign limit to the host. CREDITS|5 Documentation/arm64/memory.txt | 12 Documentation/cgroups/memory.txt |4 .../devicetree/bindings/net/mdio-gpio.txt |9 Documentation/filesystems/proc.txt | 16 Documentation/hwmon/fam15h_power |2 Documentation/kernel-parameters.txt| 20 Documentation/networking/netdev-features.txt |2 Documentation/scheduler/numa-problem.txt | 20 MAINTAINERS| 87 + Makefile |2 arch/alpha/kernel/osf_sys.c|6 arch/arm/boot/Makefile | 10 arch/arm/boot/dts/tegra30.dtsi |4 arch/arm/include/asm/io.h |4 arch/arm/include/asm/sched_clock.h |2 arch/arm/include/asm/vfpmacros.h | 12 arch/arm/include/uapi/asm/hwcap.h |3 arch/arm/kernel/sched_clock.c | 18 arch/arm/mach-at91/at91rm9200_devices.c|2 arch/arm/mach-at91/at91sam9260_devices.c |2 arch/arm/mach-at91/at91sam9261_devices.c |2 arch/arm/mach-at91/at91sam9263_devices.c |2 arch/arm/mach-at91/at91sam9g45_devices.c | 12 arch/arm/mach-davinci/dm644x.c |3 arch/arm/mach-highbank/system.c|3 arch/arm/mach-imx/clk-gate2.c |2 arch/arm/mach-imx/ehci-imx25.c |2 arch/arm/mach-imx/ehci-imx35.c |2 arch/arm/mach-omap2/board-igep0020.c |5 arch/arm/mach-omap2/clockdomains44xx_data.c|2 arch/arm/mach-omap2/devices.c | 79 + arch/arm/mach-omap2/omap_hwmod.c | 63 + arch/arm/mach-omap2/omap_hwmod_44xx_data.c | 36 arch/arm/mach-omap2/twl-common.c |3 arch/arm/mach-omap2/vc.c |2 arch/arm/mach-pxa/hx4700.c |8 arch/arm/mach-pxa/spitz_pm.c |8 arch/arm/mm/alignment.c|2 arch/arm/plat-omap/include/plat/omap_hwmod.h |6 arch/arm/tools/Makefile|2 arch/arm/vfp/vfpmodule.c |9 arch/arm/xen/enlighten.c | 11 arch/arm/xen/hypercall.S | 14 arch/arm64/Kconfig |1 arch/arm64/include/asm/elf.h |5 arch/arm64/include/asm/fpsimd.h|5 arch/arm64/include/asm/io.h| 10 arch/arm64/include/asm/pgtable-hwdef.h |6 arch/arm64/include/asm/pgtable.h | 40 - arch/arm64/include/asm/processor.h |2 arch/arm64/include/asm/unistd.h|1 arch/arm64/kernel/perf_event.c | 10 arch/arm64/kernel/process.c| 18 arch/arm64/kernel/smp.c|3 arch/arm64/mm/init.c |2 arch/frv/Kconfig |1 arch/frv/boot/Makefile | 10 arch/frv/include/asm/unistd.h |1 arch/frv/kernel/entry.S| 28 arch/frv/kernel/process.c |5 arch/frv/mb93090-mb00/pci-dma-nommu.c |1 arch/h8300/include/asm/cache.h |3 arch/ia64/mm/init.c|1 arch/m68k/include/asm/signal.h |6 arch/mips/cavium-octeon/executive/cvmx-l2c.c | 900 arch/unicore32/include/asm/byteorder.h | 24
[PATCH 1/5] Alter the amount of steal time reported by the guest.
Modify the amount of stealtime that the kernel reports via the /proc interface. Steal time will now be broken down into steal_time and consigned_time. Consigned_time will represent the amount of time that is expected to be lost due to overcommitment of the physical cpu or by using cpu capping. The amount consigned_time will be passed in using an ioctl. The time will be expressed in the number of nanoseconds to be lost in during the fixed period. The fixed period is currently 1/10th of a second. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- fs/proc/stat.c |9 +++-- include/linux/kernel_stat.h |1 + 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index e296572..cb7fe80 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v) int i, j; unsigned long jif; u64 user, nice, system, idle, iowait, irq, softirq, steal; - u64 guest, guest_nice; + u64 guest, guest_nice, consign; u64 sum = 0; u64 sum_softirq = 0; unsigned int per_softirq_sums[NR_SOFTIRQS] = {0}; @@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v) user = nice = system = idle = iowait = irq = softirq = steal = 0; - guest = guest_nice = 0; + guest = guest_nice = consign = 0; getboottime(boottime); jif = boottime.tv_sec; + for_each_possible_cpu(i) { user += kcpustat_cpu(i).cpustat[CPUTIME_USER]; nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE]; @@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v) steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; sum += kstat_cpu_irqs_sum(i); sum += arch_irq_stat_cpu(i); @@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); for_each_online_cpu(i) { @@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v) steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; seq_printf(p, cpu%d, i); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice)); @@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); } seq_printf(p, intr %llu, (unsigned long long)sum); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 1865b1f..e5978b0 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -28,6 +28,7 @@ enum cpu_usage_stat { CPUTIME_STEAL, CPUTIME_GUEST, CPUTIME_GUEST_NICE, + CPUTIME_CONSIGN, NR_STATS, }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] Expand the steal time msr to also contain the consigned time.
Add a consigned field. This field will hold the time lost due to capping or overcommit. The rest of the time will still show up in the steal-time field. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/include/asm/paravirt_types.h |2 +- arch/x86/kernel/kvm.c |7 ++- kernel/sched/core.c | 10 +- kernel/sched/cputime.c|2 +- 5 files changed, 15 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index a0facf3..a5f9f30 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu) +static inline u64 paravirt_steal_clock(int cpu, u64 *steal) { - return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); + PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 142236e..5d4fc8b 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -95,7 +95,7 @@ struct pv_lazy_ops { struct pv_time_ops { unsigned long long (*sched_clock)(void); - unsigned long long (*steal_clock)(int cpu); + void (*steal_clock)(int cpu, unsigned long long *steal); unsigned long (*get_tsc_khz)(void); }; diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 4180a87..ac357b3 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -372,9 +372,8 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static u64 kvm_steal_clock(int cpu) +static void kvm_steal_clock(int cpu, u64 *steal) { - u64 steal; struct kvm_steal_time *src; int version; @@ -382,11 +381,9 @@ static u64 kvm_steal_clock(int cpu) do { version = src-version; rmb(); - steal = src-steal; + *steal = src-steal; rmb(); } while ((version 1) || (version != src-version)); - - return steal; } void kvm_disable_steal_time(void) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c2e077c..b21d92d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -748,6 +748,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) */ #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING) s64 steal = 0, irq_delta = 0; + u64 consigned = 0; #endif #ifdef CONFIG_IRQ_TIME_ACCOUNTING irq_delta = irq_time_read(cpu_of(rq)) - rq-prev_irq_time; @@ -776,8 +777,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING if (static_key_false((paravirt_steal_rq_enabled))) { u64 st; + u64 cs; - steal = paravirt_steal_clock(cpu_of(rq)); + paravirt_steal_clock(cpu_of(rq), steal, consigned); + /* +* since we are not assigning the steal time to cpustats +* here, just combine the steal and consigned times to +* do the rest of the calculations. +*/ + steal += consigned; steal -= rq-prev_steal_time_rq; if (unlikely(steal delta)) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 8d859da..593b647 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void) if (static_key_false(paravirt_steal_enabled)) { u64 steal, st = 0; - steal = paravirt_steal_clock(smp_processor_id()); + paravirt_steal_clock(smp_processor_id(), steal); steal -= this_rq()-prev_steal_time; st = steal_ticks(steal); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
On Mon, Nov 26, 2012 at 02:10:33PM +0100, Alexander Graf wrote: On 23.11.2012, at 23:07, Paul Mackerras wrote: On Fri, Nov 23, 2012 at 04:43:03PM +0100, Alexander Graf wrote: On 22.11.2012, at 10:28, Paul Mackerras wrote: - With the possibility of the host paging out guest pages, the use of H_LOCAL by an SMP guest is dangerous since the guest could possibly retain and use a stale TLB entry pointing to a page that had been removed from the guest. I don't understand this part. Don't we flush the TLB when the page gets evicted from the shadow HTAB? The H_LOCAL flag is something that we invented to allow the guest to tell the host I only ever used this translation (HPTE) on the current vcpu when it's removing or modifying an HPTE. The idea is that that would then let the host use the tlbiel instruction (local TLB invalidate) rather than the usual global tlbie instruction. Tlbiel is faster because it doesn't need to go out on the fabric and get processed by all cpus. In fact our guests don't use it at present, but we put it in because we thought we should be able to get a performance improvement, particularly on large machines. However, the catch is that the guest's setting of H_LOCAL might be incorrect, in which case we could have a stale TLB entry on another physical cpu. While the physical page that it refers to is still owned by the guest, that stale entry doesn't matter from the host's point of view. But if the host wants to take that page away from the guest, the stale entry becomes a problem. That's exactly where my question lies. Does that mean we don't flush the TLB entry regardless when we take the page away from the guest? The question is how to find the TLB entry if the HPTE it came from is no longer present. Flushing a TLB entry requires a virtual address. When we're taking a page away from the guest we have the real address of the page, not the virtual address. We can use the reverse-mapping chains to loop through all the HPTEs that map the page, and from each HPTE we can (and do) calculate a virtual address and do a TLBIE on that virtual address (each HPTE could be at a different virtual address). The difficulty comes when we no longer have the HPTE but we potentially have a stale TLB entry, due to having used tlbiel when we removed the HPTE. Without the HPTE the only way to get rid of the stale TLB entry would be to completely flush all the TLB entries for the guest's LPID on every physical CPU it had ever run on. Since I don't want to go to that much effort, what I am proposing, and what this patch implements, is to not ever use tlbiel when removing HPTEs in SMP guests on POWER7. In other words, what this patch is about is making sure we don't get these troublesome stale TLB entries. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
On Mon, Nov 26, 2012 at 02:15:16PM +0100, Alexander Graf wrote: On 23.11.2012, at 22:42, Paul Mackerras wrote: On Fri, Nov 23, 2012 at 03:13:09PM +0100, Alexander Graf wrote: On 22.11.2012, at 10:25, Paul Mackerras wrote: + /* Do they have an SLB shadow buffer registered? */ + slb = vcpu-arch.slb_shadow.pinned_addr; + if (!slb) + return; Mind to explain this case? What happens here? Do we leave the guest with an empty SLB? Why would this ever happen? What happens next as soon as we go back into the guest? Yes, we leave the guest with an empty SLB, the access gets retried and this time the guest gets an SLB miss interrupt, which it can hopefully handle using an SLB miss handler that runs entirely in real mode. This could happen for instance while the guest is in SLOF or yaboot or some other code that runs basically in real mode but occasionally turns the MMU on for some accesses, and happens to have a bug where it creates a duplicate SLB entry. Is this what pHyp does? Also, is this what we want? Why don't we populate an #MC into the guest so it knows it did something wrong? Yes, yes and we do. Anytime we get a machine check while in the guest we give the guest a machine check interrupt. Ultimately we want to implement the FWNMI (Firmware-assisted NMI) thing defined in PAPR which makes the handling of system reset and machine check slightly nicer for the guest, but that's for later. It will build on top of the stuff in this patch. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
On 26.11.2012, at 22:33, Paul Mackerras wrote: On Mon, Nov 26, 2012 at 02:15:16PM +0100, Alexander Graf wrote: On 23.11.2012, at 22:42, Paul Mackerras wrote: On Fri, Nov 23, 2012 at 03:13:09PM +0100, Alexander Graf wrote: On 22.11.2012, at 10:25, Paul Mackerras wrote: + /* Do they have an SLB shadow buffer registered? */ + slb = vcpu-arch.slb_shadow.pinned_addr; + if (!slb) + return; Mind to explain this case? What happens here? Do we leave the guest with an empty SLB? Why would this ever happen? What happens next as soon as we go back into the guest? Yes, we leave the guest with an empty SLB, the access gets retried and this time the guest gets an SLB miss interrupt, which it can hopefully handle using an SLB miss handler that runs entirely in real mode. This could happen for instance while the guest is in SLOF or yaboot or some other code that runs basically in real mode but occasionally turns the MMU on for some accesses, and happens to have a bug where it creates a duplicate SLB entry. Is this what pHyp does? Also, is this what we want? Why don't we populate an #MC into the guest so it knows it did something wrong? Yes, yes and we do. Anytime we get a machine check while in the guest we give the guest a machine check interrupt. Ultimately we want to implement the FWNMI (Firmware-assisted NMI) thing defined in PAPR which makes the handling of system reset and machine check slightly nicer for the guest, but that's for later. It will build on top of the stuff in this patch. So why would the function return 1 then which means MC is handled, forget about it rather than 0, which means inject MC into the guest? Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
On 26.11.2012, at 22:48, Paul Mackerras wrote: On Mon, Nov 26, 2012 at 02:10:33PM +0100, Alexander Graf wrote: On 23.11.2012, at 23:07, Paul Mackerras wrote: On Fri, Nov 23, 2012 at 04:43:03PM +0100, Alexander Graf wrote: On 22.11.2012, at 10:28, Paul Mackerras wrote: - With the possibility of the host paging out guest pages, the use of H_LOCAL by an SMP guest is dangerous since the guest could possibly retain and use a stale TLB entry pointing to a page that had been removed from the guest. I don't understand this part. Don't we flush the TLB when the page gets evicted from the shadow HTAB? The H_LOCAL flag is something that we invented to allow the guest to tell the host I only ever used this translation (HPTE) on the current vcpu when it's removing or modifying an HPTE. The idea is that that would then let the host use the tlbiel instruction (local TLB invalidate) rather than the usual global tlbie instruction. Tlbiel is faster because it doesn't need to go out on the fabric and get processed by all cpus. In fact our guests don't use it at present, but we put it in because we thought we should be able to get a performance improvement, particularly on large machines. However, the catch is that the guest's setting of H_LOCAL might be incorrect, in which case we could have a stale TLB entry on another physical cpu. While the physical page that it refers to is still owned by the guest, that stale entry doesn't matter from the host's point of view. But if the host wants to take that page away from the guest, the stale entry becomes a problem. That's exactly where my question lies. Does that mean we don't flush the TLB entry regardless when we take the page away from the guest? The question is how to find the TLB entry if the HPTE it came from is no longer present. Flushing a TLB entry requires a virtual address. When we're taking a page away from the guest we have the real address of the page, not the virtual address. We can use the reverse-mapping chains to loop through all the HPTEs that map the page, and from each HPTE we can (and do) calculate a virtual address and do a TLBIE on that virtual address (each HPTE could be at a different virtual address). The difficulty comes when we no longer have the HPTE but we potentially have a stale TLB entry, due to having used tlbiel when we removed the HPTE. Without the HPTE the only way to get rid of the stale TLB entry would be to completely flush all the TLB entries for the guest's LPID on every physical CPU it had ever run on. Since I don't want to go to that much effort, what I am proposing, and what this patch implements, is to not ever use tlbiel when removing HPTEs in SMP guests on POWER7. In other words, what this patch is about is making sure we don't get these troublesome stale TLB entries. I see. You could keep a list of to-be-flushed VAs around that you could skim through when taking a page away from the guest. That way you make the fast case fast (add/remove of page from the guest) and the slow path slow (paging). But I'm fine with disallowing local flushes on remove completely for now. It would be nice to get performance data on how much this would be a net win though. There are certainly ways of keeping local flushes alive with the scheme above. Thanks, applied to kvm-ppc-next. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
On 26.11.2012, at 22:55, Alexander Graf wrote: On 26.11.2012, at 22:33, Paul Mackerras wrote: On Mon, Nov 26, 2012 at 02:15:16PM +0100, Alexander Graf wrote: On 23.11.2012, at 22:42, Paul Mackerras wrote: On Fri, Nov 23, 2012 at 03:13:09PM +0100, Alexander Graf wrote: On 22.11.2012, at 10:25, Paul Mackerras wrote: +/* Do they have an SLB shadow buffer registered? */ +slb = vcpu-arch.slb_shadow.pinned_addr; +if (!slb) +return; Mind to explain this case? What happens here? Do we leave the guest with an empty SLB? Why would this ever happen? What happens next as soon as we go back into the guest? Yes, we leave the guest with an empty SLB, the access gets retried and this time the guest gets an SLB miss interrupt, which it can hopefully handle using an SLB miss handler that runs entirely in real mode. This could happen for instance while the guest is in SLOF or yaboot or some other code that runs basically in real mode but occasionally turns the MMU on for some accesses, and happens to have a bug where it creates a duplicate SLB entry. Is this what pHyp does? Also, is this what we want? Why don't we populate an #MC into the guest so it knows it did something wrong? Yes, yes and we do. Anytime we get a machine check while in the guest we give the guest a machine check interrupt. Ultimately we want to implement the FWNMI (Firmware-assisted NMI) thing defined in PAPR which makes the handling of system reset and machine check slightly nicer for the guest, but that's for later. It will build on top of the stuff in this patch. So why would the function return 1 then which means MC is handled, forget about it rather than 0, which means inject MC into the guest? Oh wait - 1 means have the host handle it. Let me check up the code again. Alex-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 50921] kvm hangs booting Windows 2000
https://bugzilla.kernel.org/show_bug.cgi?id=50921 --- Comment #16 from Lucio Crusca lu...@sulweb.org 2012-11-26 22:14:15 --- @xiaoguangrong: YOU ARE THE MAN! 'emulate_invalid_guest_state = 0' did the trick, now I have win2000 running in a 3.6.7 kvm guest! Thanks. Still guessing why it works with plain kvm-intel.ko in Debian kernels is out of my reach, but I can safely shove that mystery in the when-I'll-have-time-stuff drawer and live happy with this solution meanwhile. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86, kvm: Remove incorrect redundant assembly constraint
On 11/25/2012 11:22 PM, Paolo Bonzini wrote: Il 21/11/2012 23:41, H. Peter Anvin ha scritto: From: H. Peter Anvin h...@linux.intel.com In __emulate_1op_rax_rdx, we use +a and +d which are input/output constraints, and *then* use a and d as input constraints. This is incorrect, but happens to work on some versions of gcc. However, it breaks gcc with -O0 and icc, and may break on future versions of gcc. Reported-and-tested-by: Melanie Blower melanie.blo...@intel.com Signed-off-by: H. Peter Anvin h...@linux.intel.com Link: http://lkml.kernel.org/r/b3584e72cfebed439a3eca9bce67a4ef1b17a...@fmsmsx107.amr.corp.intel.com --- arch/x86/kvm/emulate.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 39171cb..bba39bf 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -426,8 +426,7 @@ static void invalidate_registers(struct x86_emulate_ctxt *ctxt) _ASM_EXTABLE(1b, 3b)\ : =m ((ctxt)-eflags), =r (_tmp), \ +a (*rax), +d (*rdx), +qm(_ex) \ -: i (EFLAGS_MASK), m ((ctxt)-src.val), \ - a (*rax), d (*rdx)); \ +: i (EFLAGS_MASK), m ((ctxt)-src.val));\ } while (0) /* instruction has only one source operand, destination is implicit (e.g. mul, div, imul, idiv) */ Reviewed-by: Paolo Bonzini pbonz...@redhat.com Gleb, Marcelo: are you going to apply this or would you prefer I took it in x86/urgent? -hpa -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
On Mon, Nov 26, 2012 at 11:03:48PM +0100, Alexander Graf wrote: On 26.11.2012, at 22:55, Alexander Graf wrote: On 26.11.2012, at 22:33, Paul Mackerras wrote: On Mon, Nov 26, 2012 at 02:15:16PM +0100, Alexander Graf wrote: On 23.11.2012, at 22:42, Paul Mackerras wrote: On Fri, Nov 23, 2012 at 03:13:09PM +0100, Alexander Graf wrote: On 22.11.2012, at 10:25, Paul Mackerras wrote: + /* Do they have an SLB shadow buffer registered? */ + slb = vcpu-arch.slb_shadow.pinned_addr; + if (!slb) + return; Mind to explain this case? What happens here? Do we leave the guest with an empty SLB? Why would this ever happen? What happens next as soon as we go back into the guest? Yes, we leave the guest with an empty SLB, the access gets retried and this time the guest gets an SLB miss interrupt, which it can hopefully handle using an SLB miss handler that runs entirely in real mode. This could happen for instance while the guest is in SLOF or yaboot or some other code that runs basically in real mode but occasionally turns the MMU on for some accesses, and happens to have a bug where it creates a duplicate SLB entry. Is this what pHyp does? Also, is this what we want? Why don't we populate an #MC into the guest so it knows it did something wrong? Yes, yes and we do. Anytime we get a machine check while in the guest we give the guest a machine check interrupt. Ultimately we want to implement the FWNMI (Firmware-assisted NMI) thing defined in PAPR which makes the handling of system reset and machine check slightly nicer for the guest, but that's for later. It will build on top of the stuff in this patch. So why would the function return 1 then which means MC is handled, forget about it rather than 0, which means inject MC into the guest? Oh wait - 1 means have the host handle it. Let me check up the code again. 1 means the problem is fixed, now give the guest a machine check interrupt. 0 means exit the guest, have the host's MC handler look at it, then give the guest a machine check. In this case the delivery of the MC to the guest happens in kvmppc_handle_exit(). Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
On Mon, Nov 26, 2012 at 11:03:19PM +0100, Alexander Graf wrote: On 26.11.2012, at 22:48, Paul Mackerras wrote: On Mon, Nov 26, 2012 at 02:10:33PM +0100, Alexander Graf wrote: On 23.11.2012, at 23:07, Paul Mackerras wrote: On Fri, Nov 23, 2012 at 04:43:03PM +0100, Alexander Graf wrote: On 22.11.2012, at 10:28, Paul Mackerras wrote: - With the possibility of the host paging out guest pages, the use of H_LOCAL by an SMP guest is dangerous since the guest could possibly retain and use a stale TLB entry pointing to a page that had been removed from the guest. I don't understand this part. Don't we flush the TLB when the page gets evicted from the shadow HTAB? The H_LOCAL flag is something that we invented to allow the guest to tell the host I only ever used this translation (HPTE) on the current vcpu when it's removing or modifying an HPTE. The idea is that that would then let the host use the tlbiel instruction (local TLB invalidate) rather than the usual global tlbie instruction. Tlbiel is faster because it doesn't need to go out on the fabric and get processed by all cpus. In fact our guests don't use it at present, but we put it in because we thought we should be able to get a performance improvement, particularly on large machines. However, the catch is that the guest's setting of H_LOCAL might be incorrect, in which case we could have a stale TLB entry on another physical cpu. While the physical page that it refers to is still owned by the guest, that stale entry doesn't matter from the host's point of view. But if the host wants to take that page away from the guest, the stale entry becomes a problem. That's exactly where my question lies. Does that mean we don't flush the TLB entry regardless when we take the page away from the guest? The question is how to find the TLB entry if the HPTE it came from is no longer present. Flushing a TLB entry requires a virtual address. When we're taking a page away from the guest we have the real address of the page, not the virtual address. We can use the reverse-mapping chains to loop through all the HPTEs that map the page, and from each HPTE we can (and do) calculate a virtual address and do a TLBIE on that virtual address (each HPTE could be at a different virtual address). The difficulty comes when we no longer have the HPTE but we potentially have a stale TLB entry, due to having used tlbiel when we removed the HPTE. Without the HPTE the only way to get rid of the stale TLB entry would be to completely flush all the TLB entries for the guest's LPID on every physical CPU it had ever run on. Since I don't want to go to that much effort, what I am proposing, and what this patch implements, is to not ever use tlbiel when removing HPTEs in SMP guests on POWER7. In other words, what this patch is about is making sure we don't get these troublesome stale TLB entries. I see. You could keep a list of to-be-flushed VAs around that you could skim through when taking a page away from the guest. That way you make the fast case fast (add/remove of page from the guest) and the slow path slow (paging). Yes, I thought about that, but the problem is that the list of VAs could get arbitrarily long and take up a lot of host memory. But I'm fine with disallowing local flushes on remove completely for now. It would be nice to get performance data on how much this would be a net win though. There are certainly ways of keeping local flushes alive with the scheme above. Yes, I definitely want to get some good performance data to see how much of a win it would be, and if there is a good win, work out some scheme to let us use the local flushes. Thanks, applied to kvm-ppc-next. Thanks, Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: x86: improve reexecute_instruction
On Tue, Nov 20, 2012 at 07:59:53AM +0800, Xiao Guangrong wrote: The current reexecute_instruction can not well detect the failed instruction emulation. It allows guest to retry all the instructions except it accesses on error pfn. For example, some cases are nested-write-protect - if the page we want to write is used as PDE but it chains to itself. Under this case, we should stop the emulation and report the case to userspace. Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/include/asm/kvm_host.h |2 + arch/x86/kvm/paging_tmpl.h |2 + arch/x86/kvm/x86.c | 54 --- 3 files changed, 43 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b2e11f4..c5eb52f 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -566,6 +566,8 @@ struct kvm_arch { u64 hv_guest_os_id; u64 hv_hypercall; + /* synchronizing reexecute_instruction and page fault path. */ + u64 page_fault_count; #ifdef CONFIG_KVM_MMU_AUDIT int audit_point; #endif diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 891eb6d..d55ad89 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -568,6 +568,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, if (mmu_notifier_retry(vcpu-kvm, mmu_seq)) goto out_unlock; + vcpu-kvm-arch.page_fault_count++; + kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT); kvm_mmu_free_some_pages(vcpu); if (!force_pt_level) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5fe72cc..2fe484b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4473,37 +4473,61 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long cr2) { gpa_t gpa = cr2; pfn_t pfn; - - if (!ACCESS_ONCE(vcpu-kvm-arch.indirect_shadow_pages)) - return false; + u64 page_fault_count; + int emulate; if (!vcpu-arch.mmu.direct_map) { gpa = kvm_mmu_gva_to_gpa_read(vcpu, cr2, NULL); + /* + * If the mapping is invalid in guest, let cpu retry + * it to generate fault. + */ if (gpa == UNMAPPED_GVA) - return true; /* let cpu generate fault */ + return true; } /* - * if emulation was due to access to shadowed page table - * and it failed try to unshadow page and re-enter the - * guest to let CPU execute the instruction. - */ - if (kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa))) - return true; - - /* * Do not retry the unhandleable instruction if it faults on the * readonly host memory, otherwise it will goto a infinite loop: * retry instruction - write #PF - emulation fail - retry * instruction - ... */ pfn = gfn_to_pfn(vcpu-kvm, gpa_to_gfn(gpa)); - if (!is_error_noslot_pfn(pfn)) { - kvm_release_pfn_clean(pfn); + + /* + * If the instruction failed on the error pfn, it can not be fixed, + * report the error to userspace. + */ + if (is_error_noslot_pfn(pfn)) + return false; + + kvm_release_pfn_clean(pfn); + + /* The instructions are well-emulated on direct mmu. */ + if (vcpu-arch.mmu.direct_map) { + if (ACCESS_ONCE(vcpu-kvm-arch.indirect_shadow_pages)) + kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa)); + return true; } - return false; +again: + page_fault_count = ACCESS_ONCE(vcpu-kvm-arch.page_fault_count); + + /* + * if emulation was due to access to shadowed page table + * and it failed try to unshadow page and re-enter the + * guest to let CPU execute the instruction. + */ + kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa)); + emulate = vcpu-arch.mmu.page_fault(vcpu, cr3, PFERR_WRITE_MASK, false); Can you explain what is the objective here? + /* The page fault path called above can increase the count. */ + if (page_fault_count + 1 != + ACCESS_ONCE(vcpu-kvm-arch.page_fault_count)) + goto again; + + return !emulate; } static bool retry_instruction(struct x86_emulate_ctxt *ctxt, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] KVM: x86: let reexecute_instruction work for tdp
On Tue, Nov 20, 2012 at 07:59:10AM +0800, Xiao Guangrong wrote: Currently, reexecute_instruction refused to retry all instructions. If nested npt is used, the emulation may be caused by shadow page, it can be fixed by dropping the shadow page Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/x86.c | 14 -- 1 files changed, 8 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7be8452..5fe72cc 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4469,17 +4469,19 @@ static int handle_emulation_failure(struct kvm_vcpu *vcpu) return r; } -static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t gva) +static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long cr2) { - gpa_t gpa; + gpa_t gpa = cr2; pfn_t pfn; - if (tdp_enabled) + if (!ACCESS_ONCE(vcpu-kvm-arch.indirect_shadow_pages)) return false; How is indirect_shadow_pages protected? Why is ACCESS_ONCE() being used to read it? - gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL); - if (gpa == UNMAPPED_GVA) - return true; /* let cpu generate fault */ + if (!vcpu-arch.mmu.direct_map) { + gpa = kvm_mmu_gva_to_gpa_read(vcpu, cr2, NULL); + if (gpa == UNMAPPED_GVA) + return true; /* let cpu generate fault */ + } /* * if emulation was due to access to shadowed page table -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: Re: Re: Re: [RFC PATCH 0/2] kvm/vmx: Output TSC offset
On Mon, Nov 26, 2012 at 08:05:10PM +0900, Yoshihiro YUNOMAE wrote: 500h. event tsc_write tsc_offset=-3000 Then a guest trace containing events with a TSC timestamp. Which tsc_offset to use? (that is the problem, which unless i am mistaken can only be solved easily if the guest can convert RDTSC - TSC of host). There are three following cases of changing TSC offset: 1. Reset TSC at guest boot time 2. Adjust TSC offset due to some host's problems 3. Write TSC on guests The scenario which you mentioned is case 3, so we'll discuss this case. Here, we assume that a guest is allocated single CPU for the sake of ease. If a guest executes write_tsc, TSC values jumps to forward or backward. For the forward case, trace data are as follows: host guest cyclesevents cycles events 3000 tsc_offset=-2950 3001 kvm_enter 53 eventX 100 (write_tsc=+900) 3060 kvm_exit 3075 tsc_offset=-2050 3080 kvm_enter 1050 event1 1055 event2 ... This case is simple. The guest TSC of the first kvm_enter is calculated as follows: (host TSC of kvm_enter) + (current tsc_offset) = 3001 - 2950 = 51 Similarly, the guest TSC of the second kvm_enter is 130. So, the guest events between 51 and 130, that is, 53 eventX is inserted between the first pair of kvm_enter and kvm_exit. To insert events of the guests between 51 and 130, we convert the guest TSC to the host TSC using TSC offset 2950. For the backward case, trace data are as follows: host guest cyclesevents cycles events 3000 tsc_offset=-2950 3001 kvm_enter 53 eventX 100 (write_tsc=-50) 3060 kvm_exit 3075 tsc_offset=-2050 3080 kvm_enter 90 event1 95 event2 ... 3400100(write_tsc=-50) 90event3 95event4 As you say, in this case, the previous method is invalid. When we calculate the guest TSC value for the tsc_offset=-3000 event, the value is 75 on the guest. This seems like prior event of write_tsc=-50 event. So, we need to consider more. In this case, it is important that we can understand where the guest executes write_tsc or the host rewrites the TSC offset. write_tsc on the guest equals wrmsr 0x0010, so this instruction induces vm_exit. This implies that the guest does not operate when the host changes TSC offset on the cpu. In other words, the guest cannot use new TSC before the host rewrites the new TSC offset. So, if timestamp on the guest is not monotonically increased, we can understand the guest executes write_tsc. Moreover, in the region where timestamp is decreasing, we can understand when the host rewrote the TSC offset in the guest trace data. Therefore, we can sort trace data in chronological order. This requires an entire trace of events. That is, to be able to reconstruct timeline you require the entire trace from the moment guest starts. So that you can correlate wrmsr-to-tsc on the guest with vmexit-due-to-tsc-write on the host. Which means that running out of space for trace buffer equals losing ability to order events. Is that desirable? It seems cumbersome to me. As you say, tracing events can overwrite important events like kvm_exit/entry or write_tsc_offset. So, Steven's multiple buffer is needed by this feature. Normal events which often hit record the buffer A, and important events which rarely hit record the buffer B. In our case, the important event is write_tsc_offset. Also the need to correlate each write_tsc event in the guest trace with a corresponding tsc_offset write in the host trace means that it is _necessary_ for the guest and host to enable tracing simultaneously. Correct? Also, there are WRMSR executions in the guest for which there is no event in the trace buffer. From SeaBIOS, during boot. In that case, there is no explicit event in the guest trace which you can correlate with tsc_offset changes in the host side. I understand that you want to say, but we don't correlate between write_tsc event and write_tsc_offset event directly. This is because the write_tsc tracepoint (also WRMSR instruction) is not prepared in the current kernel. So, in the previous mail (https://lkml.org/lkml/2012/11/22/53), I suggested the method which we don't need to prepare the write_tsc tracepoint. In the method, we enable ftrace before the guest boots, and we need to keep all
Re: [PATCH v2] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
On Tue, Nov 27, 2012 at 12:16:28AM +0100, Alexander Graf wrote: On 24.11.2012, at 09:37, Paul Mackerras wrote: Currently, if a machine check interrupt happens while we are in the guest, we exit the guest and call the host's machine check handler, which tends to cause the host to panic. Some machine checks can be triggered by the guest; for example, if the guest creates two entries in the SLB that map the same effective address, and then accesses that effective address, the CPU will take a machine check interrupt. To handle this better, when a machine check happens inside the guest, we call a new function, kvmppc_realmode_machine_check(), while still in real mode before exiting the guest. On POWER7, it handles the cases that the guest can trigger, either by flushing and reloading the SLB, or by flushing the TLB, and then it delivers the machine check interrupt directly to the guest without going back to the host. On POWER7, the OPAL firmware patches the machine check interrupt vector so that it gets control first, and it leaves behind its analysis of the situation in a structure pointed to by the opal_mc_evt field of the paca. The kvmppc_realmode_machine_check() function looks at this, and if OPAL reports that there was no error, or that it has handled the error, we also go straight back to the guest with a machine check. We have to deliver a machine check to the guest since the machine check interrupt might have trashed valid values in SRR0/1. If the machine check is one we can't handle in real mode, and one that OPAL hasn't already handled, or on PPC970, we exit the guest and call the host's machine check handler. We do this by jumping to the machine_check_fwnmi label, rather than absolute address 0x200, because we don't want to re-execute OPAL's handler on POWER7. On PPC970, the two are equivalent because address 0x200 just contains a branch. Then, if the host machine check handler decides that the system can continue executing, kvmppc_handle_exit() delivers a machine check interrupt to the guest -- once again to let the guest know that SRR0/1 have been modified. Signed-off-by: Paul Mackerras pau...@samba.org Thanks for the semantic explanations :). From that POV things are clear and good with me now. That leaves only checkpatch ;) WARNING: please, no space before tabs #142: FILE: arch/powerpc/kvm/book3s_hv_ras.c:21: +#define SRR1_MC_IFETCH_SLBMULTI ^I3^I/* SLB multi-hit */$ WARNING: please, no space before tabs #143: FILE: arch/powerpc/kvm/book3s_hv_ras.c:22: +#define SRR1_MC_IFETCH_SLBPARMULTI ^I4^I/* SLB parity + multi-hit */$ WARNING: min() should probably be min_t(u32, slb-persistent, SLB_MIN_SIZE) #168: FILE: arch/powerpc/kvm/book3s_hv_ras.c:47: + n = min(slb-persistent, (u32) SLB_MIN_SIZE); total: 0 errors, 3 warnings, 357 lines checked Phooey. Do you want me to resubmit the patch, or will you fix it up? Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
On 27.11.2012, at 00:16, Paul Mackerras wrote: On Mon, Nov 26, 2012 at 11:03:19PM +0100, Alexander Graf wrote: On 26.11.2012, at 22:48, Paul Mackerras wrote: On Mon, Nov 26, 2012 at 02:10:33PM +0100, Alexander Graf wrote: On 23.11.2012, at 23:07, Paul Mackerras wrote: On Fri, Nov 23, 2012 at 04:43:03PM +0100, Alexander Graf wrote: On 22.11.2012, at 10:28, Paul Mackerras wrote: - With the possibility of the host paging out guest pages, the use of H_LOCAL by an SMP guest is dangerous since the guest could possibly retain and use a stale TLB entry pointing to a page that had been removed from the guest. I don't understand this part. Don't we flush the TLB when the page gets evicted from the shadow HTAB? The H_LOCAL flag is something that we invented to allow the guest to tell the host I only ever used this translation (HPTE) on the current vcpu when it's removing or modifying an HPTE. The idea is that that would then let the host use the tlbiel instruction (local TLB invalidate) rather than the usual global tlbie instruction. Tlbiel is faster because it doesn't need to go out on the fabric and get processed by all cpus. In fact our guests don't use it at present, but we put it in because we thought we should be able to get a performance improvement, particularly on large machines. However, the catch is that the guest's setting of H_LOCAL might be incorrect, in which case we could have a stale TLB entry on another physical cpu. While the physical page that it refers to is still owned by the guest, that stale entry doesn't matter from the host's point of view. But if the host wants to take that page away from the guest, the stale entry becomes a problem. That's exactly where my question lies. Does that mean we don't flush the TLB entry regardless when we take the page away from the guest? The question is how to find the TLB entry if the HPTE it came from is no longer present. Flushing a TLB entry requires a virtual address. When we're taking a page away from the guest we have the real address of the page, not the virtual address. We can use the reverse-mapping chains to loop through all the HPTEs that map the page, and from each HPTE we can (and do) calculate a virtual address and do a TLBIE on that virtual address (each HPTE could be at a different virtual address). The difficulty comes when we no longer have the HPTE but we potentially have a stale TLB entry, due to having used tlbiel when we removed the HPTE. Without the HPTE the only way to get rid of the stale TLB entry would be to completely flush all the TLB entries for the guest's LPID on every physical CPU it had ever run on. Since I don't want to go to that much effort, what I am proposing, and what this patch implements, is to not ever use tlbiel when removing HPTEs in SMP guests on POWER7. In other words, what this patch is about is making sure we don't get these troublesome stale TLB entries. I see. You could keep a list of to-be-flushed VAs around that you could skim through when taking a page away from the guest. That way you make the fast case fast (add/remove of page from the guest) and the slow path slow (paging). Yes, I thought about that, but the problem is that the list of VAs could get arbitrarily long and take up a lot of host memory. You can always cap it at an arbitrary number, similar to how the TLB itself is limited too. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
On 27.11.2012, at 00:18, Paul Mackerras wrote: On Tue, Nov 27, 2012 at 12:16:28AM +0100, Alexander Graf wrote: On 24.11.2012, at 09:37, Paul Mackerras wrote: Currently, if a machine check interrupt happens while we are in the guest, we exit the guest and call the host's machine check handler, which tends to cause the host to panic. Some machine checks can be triggered by the guest; for example, if the guest creates two entries in the SLB that map the same effective address, and then accesses that effective address, the CPU will take a machine check interrupt. To handle this better, when a machine check happens inside the guest, we call a new function, kvmppc_realmode_machine_check(), while still in real mode before exiting the guest. On POWER7, it handles the cases that the guest can trigger, either by flushing and reloading the SLB, or by flushing the TLB, and then it delivers the machine check interrupt directly to the guest without going back to the host. On POWER7, the OPAL firmware patches the machine check interrupt vector so that it gets control first, and it leaves behind its analysis of the situation in a structure pointed to by the opal_mc_evt field of the paca. The kvmppc_realmode_machine_check() function looks at this, and if OPAL reports that there was no error, or that it has handled the error, we also go straight back to the guest with a machine check. We have to deliver a machine check to the guest since the machine check interrupt might have trashed valid values in SRR0/1. If the machine check is one we can't handle in real mode, and one that OPAL hasn't already handled, or on PPC970, we exit the guest and call the host's machine check handler. We do this by jumping to the machine_check_fwnmi label, rather than absolute address 0x200, because we don't want to re-execute OPAL's handler on POWER7. On PPC970, the two are equivalent because address 0x200 just contains a branch. Then, if the host machine check handler decides that the system can continue executing, kvmppc_handle_exit() delivers a machine check interrupt to the guest -- once again to let the guest know that SRR0/1 have been modified. Signed-off-by: Paul Mackerras pau...@samba.org Thanks for the semantic explanations :). From that POV things are clear and good with me now. That leaves only checkpatch ;) WARNING: please, no space before tabs #142: FILE: arch/powerpc/kvm/book3s_hv_ras.c:21: +#define SRR1_MC_IFETCH_SLBMULTI ^I3^I/* SLB multi-hit */$ WARNING: please, no space before tabs #143: FILE: arch/powerpc/kvm/book3s_hv_ras.c:22: +#define SRR1_MC_IFETCH_SLBPARMULTI ^I4^I/* SLB parity + multi-hit */$ WARNING: min() should probably be min_t(u32, slb-persistent, SLB_MIN_SIZE) #168: FILE: arch/powerpc/kvm/book3s_hv_ras.c:47: +n = min(slb-persistent, (u32) SLB_MIN_SIZE); total: 0 errors, 3 warnings, 357 lines checked Phooey. Do you want me to resubmit the patch, or will you fix it up? Hrm. Promise to run checkpatch yourself next time and I'll fix it up for you this time ;) Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] AER-KVM: Error containment of PCI pass-thru devices assigned to KVM guests
On Tue, Nov 20, 2012 at 02:09:46PM +, Pandarathil, Vijaymohan R wrote: -Original Message- From: Stefan Hajnoczi [mailto:stefa...@gmail.com] Sent: Tuesday, November 20, 2012 5:41 AM To: Pandarathil, Vijaymohan R Cc: kvm@vger.kernel.org; linux-...@vger.kernel.org; qemu-de...@nongnu.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH 0/4] AER-KVM: Error containment of PCI pass-thru devices assigned to KVM guests On Tue, Nov 20, 2012 at 06:31:48AM +, Pandarathil, Vijaymohan R wrote: Add support for error containment when a PCI pass-thru device assigned to a KVM guest encounters an error. This is for PCIe devices/drivers that support AER functionality. When the OS is notified of an error in a device either through the firmware first approach or through an interrupt handled by the AER root port driver, concerned subsystems are notified by invoking callbacks registered by these subsystems. The device is also marked as tainted till the corresponding driver recovery routines are successful. KVM module registers for a notification of such errors. In the KVM callback routine, a global counter is incremented to keep track of the error notification. Before each CPU enters guest mode to execute guest code, appropriate checks are done to see if the impacted device belongs to the guest or not. If the device belongs to the guest, qemu hypervisor for the guest is informed and the guest is immediately brought down, thus preventing or minimizing chances of any bad data being written out by the guest driver after the device has encountered an error. I'm surprised that the hypervisor would shut down the guest when PCIe AER kicks in for a pass-through device. Shouldn't we pass the AER event into the guest and deal with it there? Agreed. That would be the ideal behavior and is planned in a future patch. Lack of control over the capabilities/type of the OS/drivers running in the guest is also a concern in passing along the event to the guest. My understanding is that in the current implementation of Linux/KVM, these errors are not handled at all and can potentially cause a guest hang or crash or even data corruption depending on the implementation of the guest driver for the device. As a first step, these patches make the behavior better by doing error containment with a predictable behavior when such errors occur. For both ACPI notifications and Linux PCI AER driver there is a way for the PCI driver to receive a notification, correct? Can just have virt/kvm/assigned-dev.c code register such a notifier (as a PCI driver) and then perform appropriate action? Also the semantics of tainted driver is not entirely clear. Is there any reason for not having this feature for VFIO only, as KVM device assigment is being phased out? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86, kvm: Remove incorrect redundant assembly constraint
On Mon, Nov 26, 2012 at 02:48:50PM -0800, H. Peter Anvin wrote: On 11/25/2012 11:22 PM, Paolo Bonzini wrote: Il 21/11/2012 23:41, H. Peter Anvin ha scritto: From: H. Peter Anvin h...@linux.intel.com In __emulate_1op_rax_rdx, we use +a and +d which are input/output constraints, and *then* use a and d as input constraints. This is incorrect, but happens to work on some versions of gcc. However, it breaks gcc with -O0 and icc, and may break on future versions of gcc. Reported-and-tested-by: Melanie Blower melanie.blo...@intel.com Signed-off-by: H. Peter Anvin h...@linux.intel.com Link: http://lkml.kernel.org/r/b3584e72cfebed439a3eca9bce67a4ef1b17a...@fmsmsx107.amr.corp.intel.com --- arch/x86/kvm/emulate.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 39171cb..bba39bf 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -426,8 +426,7 @@ static void invalidate_registers(struct x86_emulate_ctxt *ctxt) _ASM_EXTABLE(1b, 3b)\ : =m ((ctxt)-eflags), =r (_tmp), \ +a (*rax), +d (*rdx), +qm(_ex) \ - : i (EFLAGS_MASK), m ((ctxt)-src.val), \ -a (*rax), d (*rdx)); \ + : i (EFLAGS_MASK), m ((ctxt)-src.val));\ } while (0) /* instruction has only one source operand, destination is implicit (e.g. mul, div, imul, idiv) */ Reviewed-by: Paolo Bonzini pbonz...@redhat.com Gleb, Marcelo: are you going to apply this or would you prefer I took it in x86/urgent? -hpa Feel free to merge it through x86/urgent. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86, kvm: Remove incorrect redundant assembly constraint
On 11/26/2012 03:48 PM, Marcelo Tosatti wrote: Gleb, Marcelo: are you going to apply this or would you prefer I took it in x86/urgent? -hpa Feel free to merge it through x86/urgent. I presume that's an Acked-by? -hpa -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
On Tue, Nov 27, 2012 at 12:20:08AM +0100, Alexander Graf wrote: Hrm. Promise to run checkpatch yourself next time and I'll fix it up for you this time ;) OK, will do, thanks. :) Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio powerpc: enabled and supported on powernv platform
On Mon, 2012-11-26 at 11:04 -0700, Alex Williamson wrote: Ok, I see tces are put on shutdown via tce_iommu_detach_group, so you're more concerned about the guest simply mapping over top of it's own mappings. Is that common? Is it common enough for every multi-page mapping to assume it will happen? I know this is a performance sensitive path for you and it seems like a map-only w/ fallback to unmap, remap would be better in the general case. On x86 we do exactly that, but we do the unmap, remap from userspace when we get an EBUSY. Thanks, Right, Linux as guest at least will never map over an existing mapping. It will always unmap first. IE. The only transition we do on H_PUT_TCE are 0 - valid and valid - 0. So it would be fine to simplify the code and keep the map over map as a slow fallback. I can't tell for other operating systems but we don't care about those at this point :-) Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM
Alex, Thanks for your reply, and i will check it agiain with msi-x. YiLi 2012/11/27 Alex Williamson alex.william...@redhat.com: On Tue, 2012-11-27 at 00:47 +0800, yi li wrote: hi Alex, the qemu-kvm version 1.2. And is the device making use of MSI-X or MSI interrupts. MSI-X should work on 1.2, MSI does not yet support vector updates for affinity, but patches are welcome. Thanks, Alex 2012/11/26 Alex Williamson alex.william...@redhat.com: On Fri, 2012-11-23 at 11:06 +0800, yi li wrote: Hi Guys, there have a issue about smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM. My reason: pcpu will occur a lot of ipi interrupt to find the vcpu to handle the irq. so the guest os will VM_EXIT frequelty. right? if smp_affinity can work correctly on guest os, the best way is that the vcpu handle the irq is cputune at the pcpu which handle the kvm:pci-bus irq on the host.but unfortunly, i find that smp_affinity can not work correctly on guest os when msi/msi-x. how to reproduce: 1: passthrough a netcard (Brodcom BCM5716S) to the guest os 2: ifup the netcard, the card will use msi-x interrupt default, and close the irqbalance service 3: echo 4 cat /proc/irq/NETCARDIRQ/smp_affinity, so we assume the vcpu2 handle the irq. 4: we have set vcpupin vcpu='2' cpuset='1'/ and set the irq kvm:pci-bus to the pcpu1 on the host. we think this configure will reduce the ipi interrupt when inject interrupt to the guest os. but this irq is not only handle on vcpu2. maybe it is not our expect。 What version of qemu-kvm/qemu are you using? There's been some work recently specifically to enable this. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86, kvm: Remove incorrect redundant assembly constraint
On Mon, Nov 26, 2012 at 03:49:36PM -0800, H. Peter Anvin wrote: On 11/26/2012 03:48 PM, Marcelo Tosatti wrote: Gleb, Marcelo: are you going to apply this or would you prefer I took it in x86/urgent? -hpa Feel free to merge it through x86/urgent. I presume that's an Acked-by? -hpa Yes. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM
https://bugzilla.kernel.org/show_bug.cgi?id=50891 --- Comment #2 from liyi yiliker...@gmail.com 2012-11-27 00:57:28 --- sorry, i am not clearly it. 1:i am sure the device using the MSI-X, the test is failed. check the attribute, entry-msi_attrib.is_msix is 1. pls, the qemu kvm version is 1.2. also, when using the virtio driver, i find the the nercard using the msi-x, but the test is ok. and i have test intel 82599 SR-IOV, passthrough the VF to the guest os, the test is failed using the msi-x as the BCM5716S. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 0/2] Enable guest use of TSC_ADJUST functionality
This reversion, V4, addresses a couple of issues I missed from Gleb and Marcelo. Thanks, Will Will Auld (2): Add code to track call origin for msr assignment. Enabling IA32_TSC_ADJUST for KVM guest VM support arch/x86/include/asm/cpufeature.h | 1 + arch/x86/include/asm/kvm_host.h | 15 ++--- arch/x86/include/asm/msr-index.h | 1 + arch/x86/kvm/cpuid.c | 2 ++ arch/x86/kvm/cpuid.h | 8 +++ arch/x86/kvm/svm.c| 28 ++-- arch/x86/kvm/vmx.c| 33 ++-- arch/x86/kvm/x86.c| 45 --- arch/x86/kvm/x86.h| 2 +- 9 files changed, 110 insertions(+), 25 deletions(-) -- 1.8.0.rc0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 1/2] x86/kexec: add a new atomic notifier list for kdump
于 2012年11月27日 02:18, Eric W. Biederman 写道: Gleb Natapov g...@redhat.com writes: On Mon, Nov 26, 2012 at 11:43:10AM -0600, Eric W. Biederman wrote: Gleb Natapov g...@redhat.com writes: On Mon, Nov 26, 2012 at 09:08:54AM -0600, Eric W. Biederman wrote: Zhang Yanfei zhangyan...@cn.fujitsu.com writes: This patch adds an atomic notifier list named crash_notifier_list. Currently, when loading kvm-intel module, a notifier will be registered in the list to enable vmcss loaded on all cpus to be VMCLEAR'd if needed. crash_notifier_list ick gag please no. Effectively this makes the kexec on panic code path undebuggable. Instead we need to use direct function calls to whatever you are doing. The code walks linked list in kvm-intel module and calls vmclear on whatever it finds there. Since the function have to resides in kvm-intel module it cannot be called directly. Is callback pointer that is set by kvm-intel more acceptable? Yes a specific callback function is more acceptable. Looking a little deeper vmclear_local_loaded_vmcss is not particularly acceptable. It is doing a lot of work that is unnecessary to save the virtual registers on the kexec on panic path. What work are you referring to in particular that may not be acceptable? The unnecessary work that I was see is all of the software state changing. Unlinking things from linked lists flipping variables. None of that appears related to the fundamental issue saving cpu state. Simply reusing a function that does more than what is strictly required makes me nervous. What is the chance that the function will grow with maintenance and add constructs that are not safe in a kexec on panic situtation. So in summary, 1. a specific callback function instead of a notifier? 2. Instead of calling vmclear_local_loaded_vmcss, the vmclear operation will just call the vmclear on every vmcss loaded on the cpu? like below: static void crash_vmclear_local_loaded_vmcss(void) { int cpu = raw_smp_processor_id(); struct loaded_vmcs *v, *n; if (!crash_local_vmclear_enabled(cpu)) return; list_for_each_entry_safe(v, n, per_cpu(loaded_vmcss_on_cpu, cpu), loaded_vmcss_on_cpu_link) vmcs_clear(v-vmcs); } right? Thanks Zhang In fact I wonder if it might not just be easier to call vmcs_clear to a fixed per cpu buffer. There may be more than one vmcs loaded on a cpu, hence the list. Performing list walking in interrupt context without locking in vmclear_local_loaded vmcss looks a bit scary. Not that locking would make it any better, as locking would simply add one more way to deadlock the system. Only an rcu list walk is at all safe. A list walk that modifies the list as vmclear_local_loaded_vmcss does is definitely not safe. The list vmclear_local_loaded walks is per cpu. Zhang's kvm patch disables kexec callback while list is modified. If the list is only modified on it's cpu and we are running on that cpu that does look like it will give the necessary protections. It isn't particularly clear at first glance that is the case unfortunately. Eric -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 1/2] Add code to track call origin for msr assignment.
In order to track who initiated the call (host or guest) to modify an msr value I have changed function call parameters along the call path. The specific change is to add a struct pointer parameter that points to (index, data, caller) information rather than having this information passed as individual parameters. The initial use for this capability is for updating the IA32_TSC_ADJUST msr while setting the tsc value. It is anticipated that this capability is useful other tasks. Signed-off-by: Will Auld will.a...@intel.com --- arch/x86/include/asm/kvm_host.h | 12 +--- arch/x86/kvm/svm.c | 21 +++-- arch/x86/kvm/vmx.c | 24 +--- arch/x86/kvm/x86.c | 23 +++ arch/x86/kvm/x86.h | 2 +- 5 files changed, 57 insertions(+), 25 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 09155d6..da34027 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -598,6 +598,12 @@ struct kvm_vcpu_stat { struct x86_instruction_info; +struct msr_data { +bool host_initiated; +u32 index; +u64 data; +}; + struct kvm_x86_ops { int (*cpu_has_kvm_support)(void); /* __init */ int (*disabled_by_bios)(void); /* __init */ @@ -621,7 +627,7 @@ struct kvm_x86_ops { void (*set_guest_debug)(struct kvm_vcpu *vcpu, struct kvm_guest_debug *dbg); int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata); - int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); + int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr); u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg); void (*get_segment)(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); @@ -772,7 +778,7 @@ static inline int emulate_instruction(struct kvm_vcpu *vcpu, void kvm_enable_efer_bits(u64); int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data); -int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data); +int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr); struct x86_emulate_ctxt; @@ -799,7 +805,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l); int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr); int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata); -int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data); +int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr); unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu); void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index baead95..5ac11f0 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1211,6 +1211,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) struct page *msrpm_pages; struct page *hsave_page; struct page *nested_msrpm_pages; + struct msr_data msr; int err; svm = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL); @@ -1255,7 +1256,10 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) svm-vmcb_pa = page_to_pfn(page) PAGE_SHIFT; svm-asid_generation = 0; init_vmcb(svm); - kvm_write_tsc(svm-vcpu, 0); + msr.data = 0x0; + msr.index = MSR_IA32_TSC; + msr.host_initiated = true; + kvm_write_tsc(svm-vcpu, msr); err = fx_init(svm-vcpu); if (err) @@ -3147,13 +3151,15 @@ static int svm_set_vm_cr(struct kvm_vcpu *vcpu, u64 data) return 0; } -static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data) +static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { struct vcpu_svm *svm = to_svm(vcpu); + u32 ecx = msr-index; + u64 data = msr-data; switch (ecx) { case MSR_IA32_TSC: - kvm_write_tsc(vcpu, data); + kvm_write_tsc(vcpu, msr); break; case MSR_STAR: svm-vmcb-save.star = data; @@ -3208,20 +3214,23 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data) vcpu_unimpl(vcpu, unimplemented wrmsr: 0x%x data 0x%llx\n, ecx, data); break; default: - return kvm_set_msr_common(vcpu, ecx, data); + return kvm_set_msr_common(vcpu, msr); } return 0; } static int wrmsr_interception(struct vcpu_svm *svm) { + struct msr_data msr; u32 ecx = svm-vcpu.arch.regs[VCPU_REGS_RCX]; u64 data = (svm-vcpu.arch.regs[VCPU_REGS_RAX] -1u) | ((u64)(svm-vcpu.arch.regs[VCPU_REGS_RDX] -1u) 32); - + msr.data = data; + msr.index = ecx; + msr.host_initiated = false; svm-next_rip = kvm_rip_read(svm-vcpu) + 2; - if (svm_set_msr(svm-vcpu,
[PATCH V4 2/2] Enabling IA32_TSC_ADJUST for KVM guest VM support
CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported Basic design is to emulate the MSR by allowing reads and writes to a guest vcpu specific location to store the value of the emulated MSR while adding the value to the vmcs tsc_offset. In this way the IA32_TSC_ADJUST value will be included in all reads to the TSC MSR whether through rdmsr or rdtsc. This is of course as long as the use TSC counter offsetting VM-execution control is enabled as well as the IA32_TSC_ADJUST control. However, because hardware will only return the TSC + IA32_TSC_ADJUST + vmsc tsc_offset for a guest process when it does and rdtsc (with the correct settings) the value of our virtualized IA32_TSC_ADJUST must be stored in one of these three locations. The argument against storing it in the actual MSR is performance. This is likely to be seldom used while the save/restore is required on every transition. IA32_TSC_ADJUST was created as a way to solve some issues with writing TSC itself so that is not an option either. The remaining option, defined above as our solution has the problem of returning incorrect vmcs tsc_offset values (unless we intercept and fix, not done here) as mentioned above. However, more problematic is that storing the data in vmcs tsc_offset will have a different semantic effect on the system than does using the actual MSR. This is illustrated in the following example: The hypervisor set the IA32_TSC_ADJUST, then the guest sets it and a guest process performs a rdtsc. In this case the guest process will get TSC + IA32_TSC_ADJUST_hyperviser + vmsc tsc_offset including IA32_TSC_ADJUST_guest. While the total system semantics changed the semantics as seen by the guest do not and hence this will not cause a problem. Signed-off-by: Will Auld will.a...@intel.com --- arch/x86/include/asm/cpufeature.h | 1 + arch/x86/include/asm/kvm_host.h | 3 +++ arch/x86/include/asm/msr-index.h | 1 + arch/x86/kvm/cpuid.c | 2 ++ arch/x86/kvm/cpuid.h | 8 arch/x86/kvm/svm.c| 7 +++ arch/x86/kvm/vmx.c| 9 + arch/x86/kvm/x86.c| 22 ++ 8 files changed, 53 insertions(+) diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index 6b7ee5f..e574d81 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -199,6 +199,7 @@ /* Intel-defined CPU features, CPUID level 0x0007:0 (ebx), word 9 */ #define X86_FEATURE_FSGSBASE (9*32+ 0) /* {RD/WR}{FS/GS}BASE instructions*/ +#define X86_FEATURE_TSC_ADJUST (9*32+ 1) /* TSC adjustment MSR 0x3b */ #define X86_FEATURE_BMI1 (9*32+ 3) /* 1st group bit manipulation extensions */ #define X86_FEATURE_HLE(9*32+ 4) /* Hardware Lock Elision */ #define X86_FEATURE_AVX2 (9*32+ 5) /* AVX2 instructions */ diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index da34027..cf8c7e0 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -442,6 +442,8 @@ struct kvm_vcpu_arch { u32 virtual_tsc_mult; u32 virtual_tsc_khz; + s64 ia32_tsc_adjust_msr; + atomic_t nmi_queued; /* unprocessed asynchronous NMIs */ unsigned nmi_pending; /* NMI queued after currently running handler */ bool nmi_injected;/* Trying to inject an NMI this entry */ @@ -690,6 +692,7 @@ struct kvm_x86_ops { bool (*has_wbinvd_exit)(void); void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale); + u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu); void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset); u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc); diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 957ec87..6486569 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -231,6 +231,7 @@ #define MSR_IA32_EBL_CR_POWERON0x002a #define MSR_EBC_FREQUENCY_ID 0x002c #define MSR_IA32_FEATURE_CONTROL0x003a +#define MSR_IA32_TSC_ADJUST 0x003b #define FEATURE_CONTROL_LOCKED (10) #define FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX (11) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 0595f13..e817bac 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -320,6 +320,8 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, if (index == 0) { entry-ebx = kvm_supported_word9_x86_features; cpuid_mask(entry-ebx, 9); + // TSC_ADJUST is emulated + entry-ebx |= F(TSC_ADJUST); } else entry-ebx = 0; entry-eax = 0; diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index
[PATCH V3] target-i386: Enabling IA32_TSC_ADJUST for QEMU KVM guest VMs
CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported Basic design is to emulate the MSR by allowing reads and writes to the hypervisor vcpu specific locations to store the value of the emulated MSRs. In this way the IA32_TSC_ADJUST value will be included in all reads to the TSC MSR whether through rdmsr or rdtsc. As this is a new MSR that the guest may access and modify its value needs to be migrated along with the other MRSs. The changes here are specifically for recognizing when IA32_TSC_ADJUST is enabled in CPUID and code added for migrating its value. Signed-off-by: Will Auld will.a...@intel.com --- target-i386/cpu.h | 2 ++ target-i386/kvm.c | 15 +++ target-i386/machine.c | 21 + 3 files changed, 38 insertions(+) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index aabf993..13d4152 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -284,6 +284,7 @@ #define MSR_IA32_APICBASE_BSP (18) #define MSR_IA32_APICBASE_ENABLE(111) #define MSR_IA32_APICBASE_BASE (0xf12) +#define MSR_TSC_ADJUST 0x003b #define MSR_IA32_TSCDEADLINE0x6e0 #define MSR_MTRRcap0xfe @@ -701,6 +702,7 @@ typedef struct CPUX86State { uint64_t async_pf_en_msr; uint64_t tsc; +uint64_t tsc_adjust; uint64_t tsc_deadline; uint64_t mcg_status; diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 696b14a..e974c42 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -61,6 +61,7 @@ const KVMCapabilityInfo kvm_arch_required_capabilities[] = { static bool has_msr_star; static bool has_msr_hsave_pa; +static bool has_msr_tsc_adjust; static bool has_msr_tsc_deadline; static bool has_msr_async_pf_en; static bool has_msr_misc_enable; @@ -641,6 +642,10 @@ static int kvm_get_supported_msrs(KVMState *s) has_msr_hsave_pa = true; continue; } +if (kvm_msr_list-indices[i] == MSR_TSC_ADJUST) { +has_msr_tsc_adjust = true; +continue; +} if (kvm_msr_list-indices[i] == MSR_IA32_TSCDEADLINE) { has_msr_tsc_deadline = true; continue; @@ -978,6 +983,10 @@ static int kvm_put_msrs(CPUX86State *env, int level) if (has_msr_hsave_pa) { kvm_msr_entry_set(msrs[n++], MSR_VM_HSAVE_PA, env-vm_hsave); } +if (has_msr_tsc_adjust) { +kvm_msr_entry_set(msrs[n++], + MSR_TSC_ADJUST, env-tsc_adjust); +} if (has_msr_tsc_deadline) { kvm_msr_entry_set(msrs[n++], MSR_IA32_TSCDEADLINE, env-tsc_deadline); } @@ -1234,6 +1243,9 @@ static int kvm_get_msrs(CPUX86State *env) if (has_msr_hsave_pa) { msrs[n++].index = MSR_VM_HSAVE_PA; } +if (has_msr_tsc_adjust) { +msrs[n++].index = MSR_TSC_ADJUST; +} if (has_msr_tsc_deadline) { msrs[n++].index = MSR_IA32_TSCDEADLINE; } @@ -1308,6 +1320,9 @@ static int kvm_get_msrs(CPUX86State *env) case MSR_IA32_TSC: env-tsc = msrs[i].data; break; +case MSR_TSC_ADJUST: +env-tsc_adjust = msrs[i].data; +break; case MSR_IA32_TSCDEADLINE: env-tsc_deadline = msrs[i].data; break; diff --git a/target-i386/machine.c b/target-i386/machine.c index a8be058..95bda9b 100644 --- a/target-i386/machine.c +++ b/target-i386/machine.c @@ -310,6 +310,24 @@ static const VMStateDescription vmstate_fpop_ip_dp = { } }; +static bool tsc_adjust_needed(void *opaque) +{ +CPUX86State *cpu = opaque; + +return cpu-tsc_adjust != 0; +} + +static const VMStateDescription vmstate_msr_tsc_adjust = { +.name = cpu/msr_tsc_adjust, +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_UINT64(tsc_adjust, CPUX86State), +VMSTATE_END_OF_LIST() +} +}; + static bool tscdeadline_needed(void *opaque) { CPUX86State *env = opaque; @@ -457,6 +475,9 @@ static const VMStateDescription vmstate_cpu = { .vmsd = vmstate_fpop_ip_dp, .needed = fpop_ip_dp_needed, }, { +.vmsd = vmstate_msr_tsc_adjust, +.needed = tsc_adjust_needed, +}, { .vmsd = vmstate_msr_tscdeadline, .needed = tscdeadline_needed, }, { -- 1.8.0.rc0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [Qemu-devel] [PATCH V2] Resend - Enabling IA32_TSC_ADJUST for QEMU KVM guest VMs
Andreas, Thanks. I just sent the update patch (V3) to address your comments. Will -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Andreas Färber Sent: Monday, November 26, 2012 11:05 AM To: Auld, Will Cc: Will Auld; qemu-devel; Gleb; Marcelo Tosatti; kvm@vger.kernel.org; Dugger, Donald D; Liu, Jinsong; Zhang, Xiantao; a...@redhat.com Subject: Re: [Qemu-devel] [PATCH V2] Resend - Enabling IA32_TSC_ADJUST for Qemu KVM guest VMs Hello, Am 26.11.2012 19:42, schrieb Will Auld: CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported Basic design is to emulate the MSR by allowing reads and writes to the hypervisor vcpu specific locations to store the value of the emulated MSRs. In this way the IA32_TSC_ADJUST value will be included in all reads to the TSC MSR whether through rdmsr or rdtsc. As this is a new MSR that the guest may access and modify its value needs to be migrated along with the other MRSs. The changes here are specifically for recognizing when IA32_TSC_ADJUST is enabled in CPUID and code added for migrating its value. Signed-off-by: Will Auld will.a...@intel.com $subject should get a prefix of target-i386: and resend is better used inside a tag so that it doesn't end up in the commit. And it's QEMU. ;) Some more stylistic issues inline: --- target-i386/cpu.h | 2 ++ target-i386/kvm.c | 15 +++ target-i386/machine.c | 21 + 3 files changed, 38 insertions(+) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index aabf993..13d4152 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -284,6 +284,7 @@ #define MSR_IA32_APICBASE_BSP (18) #define MSR_IA32_APICBASE_ENABLE(111) #define MSR_IA32_APICBASE_BASE (0xf12) +#define MSR_TSC_ADJUST 0x003b Tabs. You can use scripts/checkpatch.pl to verify. #define MSR_IA32_TSCDEADLINE0x6e0 #define MSR_MTRRcap0xfe @@ -701,6 +702,7 @@ typedef struct CPUX86State { uint64_t async_pf_en_msr; uint64_t tsc; +uint64_t tsc_adjust; uint64_t tsc_deadline; uint64_t mcg_status; diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 696b14a..e974c42 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -61,6 +61,7 @@ const KVMCapabilityInfo kvm_arch_required_capabilities[] = { static bool has_msr_star; static bool has_msr_hsave_pa; +static bool has_msr_tsc_adjust; static bool has_msr_tsc_deadline; static bool has_msr_async_pf_en; static bool has_msr_misc_enable; @@ -641,6 +642,10 @@ static int kvm_get_supported_msrs(KVMState *s) has_msr_hsave_pa = true; continue; } +if (kvm_msr_list-indices[i] == MSR_TSC_ADJUST) { +has_msr_tsc_adjust = true; +continue; +} if (kvm_msr_list-indices[i] == MSR_IA32_TSCDEADLINE) { has_msr_tsc_deadline = true; continue; @@ -978,6 +983,10 @@ static int kvm_put_msrs(CPUX86State *env, int level) if (has_msr_hsave_pa) { kvm_msr_entry_set(msrs[n++], MSR_VM_HSAVE_PA, env- vm_hsave); } +if (has_msr_tsc_adjust) { +kvm_msr_entry_set(msrs[n++], + MSR_TSC_ADJUST, env-tsc_adjust); Tabs. +} if (has_msr_tsc_deadline) { kvm_msr_entry_set(msrs[n++], MSR_IA32_TSCDEADLINE, env- tsc_deadline); } @@ -1234,6 +1243,9 @@ static int kvm_get_msrs(CPUX86State *env) if (has_msr_hsave_pa) { msrs[n++].index = MSR_VM_HSAVE_PA; } +if (has_msr_tsc_adjust) { +msrs[n++].index = MSR_TSC_ADJUST; +} if (has_msr_tsc_deadline) { msrs[n++].index = MSR_IA32_TSCDEADLINE; } @@ -1308,6 +1320,9 @@ static int kvm_get_msrs(CPUX86State *env) case MSR_IA32_TSC: env-tsc = msrs[i].data; break; +case MSR_TSC_ADJUST: +env-tsc_adjust = msrs[i].data; +break; case MSR_IA32_TSCDEADLINE: env-tsc_deadline = msrs[i].data; break; diff --git a/target-i386/machine.c b/target-i386/machine.c index a8be058..95bda9b 100644 --- a/target-i386/machine.c +++ b/target-i386/machine.c @@ -310,6 +310,24 @@ static const VMStateDescription vmstate_fpop_ip_dp = { } }; +static bool tsc_adjust_needed(void *opaque) { +CPUX86State *cpu = opaque; Please name this env to differentiate from CPUState / X86CPU. Since there are other tsc_* fields already I won't request that you move your new field to the containing X86CPU struct but at some point we will need to convert the VMSDs to X86CPU. + +return
Re: [PATCH v8 1/2] x86/kexec: add a new atomic notifier list for kdump
Zhang Yanfei zhangyan...@cn.fujitsu.com writes: So in summary, 1. a specific callback function instead of a notifier? Yes. 2. Instead of calling vmclear_local_loaded_vmcss, the vmclear operation will just call the vmclear on every vmcss loaded on the cpu? like below: static void crash_vmclear_local_loaded_vmcss(void) { int cpu = raw_smp_processor_id(); struct loaded_vmcs *v, *n; if (!crash_local_vmclear_enabled(cpu)) return; list_for_each_entry_safe(v, n, per_cpu(loaded_vmcss_on_cpu, cpu), loaded_vmcss_on_cpu_link) vmcs_clear(v-vmcs); } right? Yeah that looks good. I would do list_for_each_entry because the list isn't changing. Eric -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH V3] target-i386: Enabling IA32_TSC_ADJUST for QEMU KVM guest VMs
Am 27.11.2012 02:40, schrieb Will Auld: CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported Basic design is to emulate the MSR by allowing reads and writes to the hypervisor vcpu specific locations to store the value of the emulated MSRs. In this way the IA32_TSC_ADJUST value will be included in all reads to the TSC MSR whether through rdmsr or rdtsc. As this is a new MSR that the guest may access and modify its value needs to be migrated along with the other MRSs. The changes here are specifically for recognizing when IA32_TSC_ADJUST is enabled in CPUID and code added for migrating its value. Signed-off-by: Will Auld will.a...@intel.com Something went wrong here, none of the V2 review comments are addressed. Maybe you sent the wrong patch file? Cheers, Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 1/2] x86/kexec: add a new atomic notifier list for kdump
于 2012年11月27日 09:49, Eric W. Biederman 写道: Zhang Yanfei zhangyan...@cn.fujitsu.com writes: So in summary, 1. a specific callback function instead of a notifier? Yes. 2. Instead of calling vmclear_local_loaded_vmcss, the vmclear operation will just call the vmclear on every vmcss loaded on the cpu? like below: static void crash_vmclear_local_loaded_vmcss(void) { int cpu = raw_smp_processor_id(); struct loaded_vmcs *v, *n; if (!crash_local_vmclear_enabled(cpu)) return; list_for_each_entry_safe(v, n, per_cpu(loaded_vmcss_on_cpu, cpu), loaded_vmcss_on_cpu_link) vmcs_clear(v-vmcs); } right? Yeah that looks good. I would do list_for_each_entry because the list isn't changing. OK. I will update the patch and resend it. Zhang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [Qemu-devel] [PATCH V3] target-i386: Enabling IA32_TSC_ADJUST for QEMU KVM guest VMs
Sorry, let me figure this out and resend. Thanks, Will -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Andreas Färber Sent: Monday, November 26, 2012 5:51 PM To: Auld, Will Cc: Will Auld; qemu-devel; Gleb; mtosa...@redhat.com; kvm@vger.kernel.org; Dugger, Donald D; Liu, Jinsong; Zhang, Xiantao; a...@redhat.com Subject: Re: [Qemu-devel] [PATCH V3] target-i386: Enabling IA32_TSC_ADJUST for QEMU KVM guest VMs Am 27.11.2012 02:40, schrieb Will Auld: CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported Basic design is to emulate the MSR by allowing reads and writes to the hypervisor vcpu specific locations to store the value of the emulated MSRs. In this way the IA32_TSC_ADJUST value will be included in all reads to the TSC MSR whether through rdmsr or rdtsc. As this is a new MSR that the guest may access and modify its value needs to be migrated along with the other MRSs. The changes here are specifically for recognizing when IA32_TSC_ADJUST is enabled in CPUID and code added for migrating its value. Signed-off-by: Will Auld will.a...@intel.com Something went wrong here, none of the V2 review comments are addressed. Maybe you sent the wrong patch file? Cheers, Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html N�r��yb�X��ǧv�^�){.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf
[PATCH V4] target-i386: Enabling IA32_TSC_ADJUST for Qemu KVM guest VMs
CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported Basic design is to emulate the MSR by allowing reads and writes to the hypervisor vcpu specific locations to store the value of the emulated MSRs. In this way the IA32_TSC_ADJUST value will be included in all reads to the TSC MSR whether through rdmsr or rdtsc. As this is a new MSR that the guest may access and modify its value needs to be migrated along with the other MRSs. The changes here are specifically for recognizing when IA32_TSC_ADJUST is enabled in CPUID and code added for migrating its value. Signed-off-by: Will Auld will.a...@intel.com --- target-i386/cpu.h | 2 ++ target-i386/kvm.c | 14 ++ target-i386/machine.c | 21 + 3 files changed, 37 insertions(+) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index aabf993..9dedaa6 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -284,6 +284,7 @@ #define MSR_IA32_APICBASE_BSP (18) #define MSR_IA32_APICBASE_ENABLE(111) #define MSR_IA32_APICBASE_BASE (0xf12) +#define MSR_TSC_ADJUST 0x003b #define MSR_IA32_TSCDEADLINE0x6e0 #define MSR_MTRRcap0xfe @@ -701,6 +702,7 @@ typedef struct CPUX86State { uint64_t async_pf_en_msr; uint64_t tsc; +uint64_t tsc_adjust; uint64_t tsc_deadline; uint64_t mcg_status; diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 696b14a..6d2a061 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -61,6 +61,7 @@ const KVMCapabilityInfo kvm_arch_required_capabilities[] = { static bool has_msr_star; static bool has_msr_hsave_pa; +static bool has_msr_tsc_adjust; static bool has_msr_tsc_deadline; static bool has_msr_async_pf_en; static bool has_msr_misc_enable; @@ -641,6 +642,10 @@ static int kvm_get_supported_msrs(KVMState *s) has_msr_hsave_pa = true; continue; } + if (kvm_msr_list-indices[i] == MSR_TSC_ADJUST) { + has_msr_tsc_adjust = true; + continue; + } if (kvm_msr_list-indices[i] == MSR_IA32_TSCDEADLINE) { has_msr_tsc_deadline = true; continue; @@ -978,6 +983,9 @@ static int kvm_put_msrs(CPUX86State *env, int level) if (has_msr_hsave_pa) { kvm_msr_entry_set(msrs[n++], MSR_VM_HSAVE_PA, env-vm_hsave); } +if (has_msr_tsc_adjust) { + kvm_msr_entry_set(msrs[n++], MSR_TSC_ADJUST, env-tsc_adjust); +} if (has_msr_tsc_deadline) { kvm_msr_entry_set(msrs[n++], MSR_IA32_TSCDEADLINE, env-tsc_deadline); } @@ -1234,6 +1242,9 @@ static int kvm_get_msrs(CPUX86State *env) if (has_msr_hsave_pa) { msrs[n++].index = MSR_VM_HSAVE_PA; } +if (has_msr_tsc_adjust) { + msrs[n++].index = MSR_TSC_ADJUST; +} if (has_msr_tsc_deadline) { msrs[n++].index = MSR_IA32_TSCDEADLINE; } @@ -1308,6 +1319,9 @@ static int kvm_get_msrs(CPUX86State *env) case MSR_IA32_TSC: env-tsc = msrs[i].data; break; + case MSR_TSC_ADJUST: + env-tsc_adjust = msrs[i].data; + break; case MSR_IA32_TSCDEADLINE: env-tsc_deadline = msrs[i].data; break; diff --git a/target-i386/machine.c b/target-i386/machine.c index a8be058..df3f779 100644 --- a/target-i386/machine.c +++ b/target-i386/machine.c @@ -310,6 +310,24 @@ static const VMStateDescription vmstate_fpop_ip_dp = { } }; +static bool tsc_adjust_needed(void *opaque) +{ +CPUX86State *env = opaque; + +return env-tsc_adjust != 0; +} + +static const VMStateDescription vmstate_msr_tsc_adjust = { +.name = cpu/msr_tsc_adjust, +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_UINT64(tsc_adjust, CPUX86State), +VMSTATE_END_OF_LIST() +} +}; + static bool tscdeadline_needed(void *opaque) { CPUX86State *env = opaque; @@ -457,6 +475,9 @@ static const VMStateDescription vmstate_cpu = { .vmsd = vmstate_fpop_ip_dp, .needed = fpop_ip_dp_needed, }, { +.vmsd = vmstate_msr_tsc_adjust, +.needed = tsc_adjust_needed, +}, { .vmsd = vmstate_msr_tscdeadline, .needed = tscdeadline_needed, }, { -- 1.8.0.rc0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html