[매일경제신문사] 바이러스- 발견 경고

2012-11-26 Thread Administrator
발신자 : kvm@vger.kernel.org
수신자 : SEMIANGEL
제목 : Returned mail: Data format error
첨부 파일명 : attachment.zip(attachment.exe)
검사 결과 : attachment.zip/attachment.exe ( Win32/MyDoom.worm.M ) - 압축 파일(압축을 푼 후 
다시 검사하십시오.)
attachment.zip - 삭제


안녕하세요. 
매일경제신문사 메일서버 관리자입니다.
보내주신 메일에서 바이러스가 발견되었습니다.
치료가 가능하면 치료를 시도하고 불가능하면 삭제됩니다.
본 메일은 자동으로 보내지며 반송이 
불가능합니다.N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¤¾h§¶›¡Ü¨}©ž²Æ zÚj:‰¨¾«‘êçzZ+€Ê£¢·hšˆ§~†­†Ûiÿûàz¹®w¥¢¸?™¨è­Ú¢)ߢf

Re: [PATCH v2 5/6] x86: Enable ack interrupt on vmexit

2012-11-26 Thread Gleb Natapov
On Mon, Nov 26, 2012 at 05:44:29AM +, Zhang, Yang Z wrote:
 Avi Kivity wrote on 2012-11-25:
  On 11/25/2012 03:03 PM, Gleb Natapov wrote:
  On Sun, Nov 25, 2012 at 02:55:26PM +0200, Avi Kivity wrote:
  On 11/22/2012 05:22 PM, Gleb Natapov wrote:
  On Wed, Nov 21, 2012 at 04:09:38PM +0800, Yang Zhang wrote:
  Ack interrupt on vmexit is required by Posted Interrupt. With it,
  when external interrupt caused vmexit, the cpu will acknowledge the
  interrupt controller and save the interrupt's vector in vmcs.
  
  There are several approaches to enable it. This patch uses a simply
  way: re-generate an interrupt via self ipi.
  
  
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index 7949d21..f6ef090 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -2525,7 +2525,8 @@ static __init int setup_vmcs_config(struct
  vmcs_config *vmcs_conf)
   #ifdef CONFIG_X86_64
  min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
   #endif
  -   opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT;
  +   opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT |
  +   VM_EXIT_ACK_INTR_ON_EXIT;
  Always? Do it only if posted interrupts are actually available
  and going to be used.
  
  Why not always?  Better to have a single code path for host interrupts
  (and as Yang notes, the new path is faster as well).
  
  Is it? The current path is:
  
  vm exit - KVM vmexit handler(interrupt disabled) - KVM re-enable
  interrupt - cpu ack the interrupt and interrupt deliver through the
  host IDT.
  
  The proposed path is:
  
  CPU acks interrupt - vm exit - KVM vmexit handler(interrupt disabled)
  - eoi - self IPI - KVM re-enable interrupt - cpu ack the interrupt
  and interrupt deliver through the host IDT.
  
  Am I missing something?
  
  Yes, you're missing the part where I didn't write that the new path
  should avoid the IDT and dispatch the interrupt directly, by emulating
  an interrupt frame directly.  Can be as simple as pushf; push cs; call
  interrupt_table[vector * 8].  Of course we need to verify that no
  interrupt uses the IST or a task gate.
 
 How can we call interrupt table directly? I don't think we can expose the 
 idt_table to a module.
No, but we can add function to entry_(64|32).S that despatch via
idt_table and expose it. Avi's idea is worth to explore before going
self IPI way.

 Anyway, to simply the implementation, I will follow gleb's suggestion: only 
 enable ack intr on exit when PI is enabled and self ipi should be enough. 
 Any comments?
 
 Best regards,
 Yang
 

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call agenda for 2012-11-27

2012-11-26 Thread Juan Quintela

Hi

Please send in any agenda topics you are interested in.

Later, Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting

2012-11-26 Thread Gleb Natapov
On Mon, Nov 26, 2012 at 03:51:04AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2012-11-25:
  On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
  Posted Interrupt allows vAPICV interrupts to inject into guest directly
  without any vmexit.
  
  - When delivering a interrupt to guest, if target vcpu is running,
update Posted-interrupt requests bitmap and send a notification event
to the vcpu. Then the vcpu will handle this interrupt automatically,
without any software involvemnt.
  Looks like you allocating one irq vector per vcpu per pcpu and then
  migrate it or reallocate when vcpu move from one pcpu to another.
  This is not scalable and migrating irq migration slows things down.
  What's wrong with allocating one global vector for posted interrupt
  during vmx initialization and use it for all vcpus?
 
 Consider the following situation: 
 If vcpu A is running when notification event which belong to vcpu B is 
 arrived, since the vector match the vcpu A's notification vector, then this 
 event will be consumed by vcpu A(even it do nothing) and the interrupt cannot 
 be handled in time.
The exact same situation is possible with your code. vcpu B can be
migrated from pcpu and vcpu A will take its place and will be assigned
the same vector as vcpu B. But I fail to see why is this a problem. vcpu
A will ignore PI since pir will be empty and vcpu B should detect new
event during next vmentry. 

 
  - If target vcpu is not running or there already a notification event
pending in the vcpu, do nothing. The interrupt will be handled by old
way.
  Signed-off-by: Yang Zhang yang.z.zh...@intel.com
  ---
   arch/x86/include/asm/kvm_host.h |3 + arch/x86/include/asm/vmx.h   
 |4 + arch/x86/kernel/apic/io_apic.c  |  138
    arch/x86/kvm/lapic.c|   31
   ++- arch/x86/kvm/lapic.h|8 ++ arch/x86/kvm/vmx.c  
  |  192 +--
   arch/x86/kvm/x86.c  |2 + include/linux/kvm_host.h 
 |1 + virt/kvm/kvm_main.c |2 + 9 files changed,
   372 insertions(+), 9 deletions(-)
  diff --git a/arch/x86/include/asm/kvm_host.h
  b/arch/x86/include/asm/kvm_host.h index 8e07a86..1145894 100644 ---
  a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h
  @@ -683,9 +683,12 @@ struct kvm_x86_ops {
 void (*enable_irq_window)(struct kvm_vcpu *vcpu);   void
   (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr); 
  int
   (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu); +   int
   (*has_posted_interrupt)(struct kvm_vcpu *vcpu);   void
   (*update_irq)(struct kvm_vcpu *vcpu); void (*set_eoi_exitmap)(struct
   kvm_vcpu *vcpu, int vector,   int need_eoi, int 
  global);
  +  int (*send_nv)(struct kvm_vcpu *vcpu, int vector);
  +  void (*pi_migrate)(struct kvm_vcpu *vcpu);
 int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
 int (*get_tdp_level)(void);
 u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
  diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
  index 1003341..7b9e1d0 100644
  --- a/arch/x86/include/asm/vmx.h
  +++ b/arch/x86/include/asm/vmx.h
  @@ -152,6 +152,7 @@
   #define PIN_BASED_EXT_INTR_MASK 0x0001
   #define PIN_BASED_NMI_EXITING   0x0008
   #define PIN_BASED_VIRTUAL_NMIS  0x0020
  +#define PIN_BASED_POSTED_INTR   0x0080
  
   #define VM_EXIT_SAVE_DEBUG_CONTROLS 0x0002 #define
   VM_EXIT_HOST_ADDR_SPACE_SIZE0x0200 @@ -174,6 +175,7 @@
   /* VMCS Encodings */ enum vmcs_field {VIRTUAL_PROCESSOR_ID  
= 0x, +  POSTED_INTR_NV  = 0x0002,
 GUEST_ES_SELECTOR   = 0x0800,   GUEST_CS_SELECTOR 
= 0x0802,GUEST_SS_SELECTOR   = 0x0804,
   @@ -208,6 +210,8 @@ enum vmcs_field { VIRTUAL_APIC_PAGE_ADDR_HIGH
   = 0x2013, APIC_ACCESS_ADDR= 0x2014,
 APIC_ACCESS_ADDR_HIGH   = 0x2015,
  +  POSTED_INTR_DESC_ADDR   = 0x2016,
  +  POSTED_INTR_DESC_ADDR_HIGH  = 0x2017,
 EPT_POINTER = 0x201a,
 EPT_POINTER_HIGH= 0x201b,
 EOI_EXIT_BITMAP0= 0x201c,
  diff --git a/arch/x86/kernel/apic/io_apic.c 
  b/arch/x86/kernel/apic/io_apic.c
  index 1817fa9..97cb8ee 100644
  --- a/arch/x86/kernel/apic/io_apic.c
  +++ b/arch/x86/kernel/apic/io_apic.c
  @@ -3277,6 +3277,144 @@ int arch_setup_dmar_msi(unsigned int irq)
   }
   #endif
  +static int
  +pi_set_affinity(struct irq_data *data, const struct cpumask *mask,
  +bool force)
  +{
  +  unsigned int dest;
  +  struct irq_cfg *cfg = (struct irq_cfg *)data-chip_data;
  +  if (cpumask_equal(cfg-domain, mask))
  +  return 

Re: qemu-kvm-1.2.0: double free or corruption in VNC code

2012-11-26 Thread Stefan Hajnoczi
On Fri, Nov 23, 2012 at 08:24:32PM +0100, Nikola Ciprich wrote:
  Please also post the exact package version you are using - the line
  numbers change between releases and depend on which patches have been
  applied to the source tree.  The distro exact package version allows me
  to download the source tree that was used to build this binary and check
  the correct line numbers.
 
 Hello Stafan,
 
 it's based on fedora rawhide pkg 2:1.2.0-16 with few minor tweaks to compile
 on centos6.
 I've uploaded sources used for build here:
 
 http://nik.lbox.cz/download/qemu-kvm-1.2.0.tar.bz2 (after make clean)
 
 or
 
 http://nik.lbox.cz/download/qemu-1.2.0-lb6.01.src.rpm 
 
 will this help?

Thanks, I looked at the backtrace in the source tree.  Unfortunately the
root cause is not obvious to me.  I was looking for a double-free of the
zrle buffers.

If this bug repeatedly bites you, try a different VNC encoding as a
workaround (not ZRLE).

Perhaps someone more familiar with the VNC code will be able to see it.
All the information you have provided is helpful.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM Disk i/o or VM activities causes soft lockup?

2012-11-26 Thread Stefan Hajnoczi
On Fri, Nov 23, 2012 at 10:34:16AM -0800, Vincent Li wrote:
 On Thu, Nov 22, 2012 at 11:29 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
  On Wed, Nov 21, 2012 at 03:36:50PM -0800, Vincent Li wrote:
  We have users running on redhat based distro (Kernel
  2.6.32-131.21.1.el6.x86_64 ) with kvm, when customer made cron job
  script to copy large files between kvm guest or some other user space
  program leads to disk i/o or VM activities, users get following soft
  lockup message from console:
 
  Nov 17 13:44:46 slot1/luipaard100a err kernel: BUG: soft lockup -
  CPU#4 stuck for 61s! [qemu-kvm:6795]
  Nov 17 13:44:46 slot1/luipaard100a warning kernel: Modules linked in:
  ebt_vlan nls_utf8 isofs ebtable_filter ebtables 8021q garp bridge stp
  llc ipt_REJECT iptable_filter xt_NOTRACK nf_conntrack iptable_raw
  ip_tables loop ext2 binfmt_misc hed womdict(U) vnic(U) parport_pc lp
  parport predis(U) lasthop(U) ipv6 toggler vhost_net tun kvm_intel kvm
  jiffies(U) sysstats hrsleep i2c_dev datastor(U) linux_user_bde(P)(U)
  linux_kernel_bde(P)(U) tg3 libphy serio_raw i2c_i801 i2c_core ehci_hcd
  raid1 raid0 virtio_pci virtio_blk virtio virtio_ring mvsas libsas
  scsi_transport_sas mptspi mptscsih mptbase scsi_transport_spi 3w_9xxx
  sata_svw(U) ahci serverworks sata_sil ata_piix libata sd_mod
  crc_t10dif amd74xx piix ide_gd_mod ide_core dm_snapshot dm_mirror
  dm_region_hash dm_log dm_mod ext3 jbd mbcache
  Nov 17 13:44:46 slot1/luipaard100a warning kernel: Pid: 6795, comm:
  qemu-kvm Tainted: P   
  2.6.32-131.21.1.el6.f5.x86_64 #1
  Nov 17 13:44:46 slot1/luipaard100a warning kernel: Call Trace:
  Nov 17 13:44:46 slot1/luipaard100a warning kernel: IRQ
  [81084f95] ? get_timestamp+0x9/0xf
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [810855d6] ? watchdog_timer_fn+0x130/0x178
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [81059f11] ? __run_hrtimer+0xa3/0xff
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [8105a188] ? hrtimer_interrupt+0xe6/0x190
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [8105a14b] ? hrtimer_interrupt+0xa9/0x190
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [8101e5a9] ? hpet_interrupt_handler+0x26/0x2d
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [8105a26f] ? hrtimer_peek_ahead_timers+0x9/0xd
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [81044fcc] ? __do_softirq+0xc5/0x17a
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [81003adc] ? call_softirq+0x1c/0x28
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [8100506b] ? do_softirq+0x31/0x66
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [81003673] ? call_function_interrupt+0x13/0x20
  Nov 17 13:44:46 slot1/luipaard100a warning kernel: EOI
  [a0219986] ? vmx_get_msr+0x0/0x123 [kvm_intel]
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [a01d11c0] ? kvm_arch_vcpu_ioctl_run+0x80e/0xaf1 [kvm]
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [a01d11b4] ? kvm_arch_vcpu_ioctl_run+0x802/0xaf1 [kvm]
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [8114e59b] ? inode_has_perm+0x65/0x72
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [a01c77f5] ? kvm_vcpu_ioctl+0xf2/0x5ba [kvm]
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [8114e642] ? file_has_perm+0x9a/0xac
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [810f9ec2] ? vfs_ioctl+0x21/0x6b
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [810fa406] ? do_vfs_ioctl+0x487/0x4da
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [810fa4aa] ? sys_ioctl+0x51/0x70
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [810029d1] ? system_call_fastpath+0x3c/0x41
 
  This soft lockup is report on the host?
 
  Stefan
 
 Yes, it is on host. we just recommend users not doing large file
 copying, just wondering if there is potential kernel bug. it seems the
 softlockup backtrace pointing to hrtimer and softirq. my naive
 knowledge is that the watchdog thread is on top of hrtimer which is on
 top of softirq.

Since the soft lockup detector is firing on the host, this seems like a
hardware/driver problem.  Have you ever had soft lockups running non-KVM
workloads on this host?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Re: Re: Re: [RFC PATCH 0/2] kvm/vmx: Output TSC offset

2012-11-26 Thread Yoshihiro YUNOMAE

Hi Marcelo,

(2012/11/24 7:46), Marcelo Tosatti wrote:

On Thu, Nov 22, 2012 at 02:21:20PM +0900, Yoshihiro YUNOMAE wrote:

Hi Marcelo,

(2012/11/21 7:51), Marcelo Tosatti wrote:

On Tue, Nov 20, 2012 at 07:36:33PM +0900, Yoshihiro YUNOMAE wrote:

Hi Marcelo,

Sorry for the late reply.

(2012/11/17 4:15), Marcelo Tosatti wrote:

On Wed, Nov 14, 2012 at 05:26:10PM +0900, Yoshihiro YUNOMAE wrote:

Thank you for commenting on my patch set.

(2012/11/14 11:31), Steven Rostedt wrote:

On Tue, 2012-11-13 at 18:03 -0800, David Sharp wrote:

On Tue, Nov 13, 2012 at 6:00 PM, Steven Rostedt rost...@goodmis.org wrote:

On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote:


To merge the data like previous pattern, we apply this patch set. Then, we can
get TSC offset of the guest as follows:

$ dmesg | grep kvm
[   57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
     |
  PID TSC offset |
HOST TSC value --+



Using printk to export something like this is IMO a nasty hack.

Can't we create a /sys or /proc file to export the same thing?


Since the value changes over the course of the trace, and seems to be
part of the context of the trace, I think I'd include it as a
tracepoint.



I'm fine with that too.


Using some tracepoint is a nice idea, but there is one problem. Here,
our discussion point is the event which TSC offset is changed does not
frequently occur, but the buffer must keep the event data.

There are two ideas for using tracepoint. First, we define new
tracepoint for changed TSC offset. This is simple and the overhead will
be low. However, this trace event stored in the buffer will be
overwritten by other trace events because this TSC offset event does
not frequently occur. Second, we add TSC offset information to the
tracepoint frequently occured. For example, we assume that TSC offset
information is added to arguments of trace_kvm_exit().


The TSC offset is in the host trace. So given a host trace with two TSC
offset updates, how do you know which events in the guest trace
(containing a number of events) refer to which tsc offset update?

Unless i am missing something, you can't solve this easily (well, except
exporting information to the guest that allows it to transform RDTSC -
host TSC value, which can be done via pvclock).


As you say, TSC offset events are in the host trace, but we don't need
to notify guests of updating TSC offset. The offset event will output
the next TSC offset value and the current TSC value, so we can
calculate the guest TSC (T1) for the event. Guest TSCs since T1 can be
converted to host TSC using the TSC offset, so we can integrate those
trace data.


Think of this scenario:

host trace
1h. event tsc write tsc_offset=-1000
3h. vmenter
4h. vmexit
... (event sequence)
99h. vmexit
100h. event tsc_write tsc_offset=-2000
101h. vmenter
... (event sequence).
500h. event tsc_write tsc_offset=-3000

Then a guest trace containing events with a TSC timestamp.
Which tsc_offset to use?

(that is the problem, which unless i am mistaken can only be solved
easily if the guest can convert RDTSC - TSC of host).


There are three following cases of changing TSC offset:
  1. Reset TSC at guest boot time
  2. Adjust TSC offset due to some host's problems
  3. Write TSC on guests
The scenario which you mentioned is case 3, so we'll discuss this case.
Here, we assume that a guest is allocated single CPU for the sake of
ease.

If a guest executes write_tsc, TSC values jumps to forward or backward.
For the forward case, trace data are as follows:

host guest   
cyclesevents   cycles   events
  3000   tsc_offset=-2950
  3001   kvm_enter
  53 eventX
  
 100 (write_tsc=+900)
  3060   kvm_exit
  3075   tsc_offset=-2050
  3080   kvm_enter
1050 event1
1055 event2
  ...


This case is simple. The guest TSC of the first kvm_enter is calculated
as follows:

   (host TSC of kvm_enter) + (current tsc_offset) = 3001 - 2950 = 51

Similarly, the guest TSC of the second kvm_enter is 130. So, the guest
events between 51 and 130, that is, 53 eventX is inserted between the
first pair of kvm_enter and kvm_exit. To insert events of the guests
between 51 and 130, we convert the guest TSC to the host TSC using TSC
offset 2950.

For the backward case, trace data are as follows:

host guest   
cyclesevents   cycles   events
  3000   tsc_offset=-2950
  3001   kvm_enter
  53 eventX
  
 100 (write_tsc=-50)
  3060   kvm_exit
  

Re: Invoking guest os script, without guest having network connectivity?

2012-11-26 Thread Stefan Hajnoczi
On Sat, Nov 24, 2012 at 06:40:39PM +0200, Shlomi Tsadok wrote:
 I'm looking for a way to configure the guest networking(including IP)
 dynamically, using a custom script, right after VM creation.
 
 Is there a similar feature in KVM/Libvirt as the Invoke-VMScript in of
 VMware's PowerCLI?
 It allows you to run a script in the guest OS, even before the guest
 has networking connectivity (the host talks to the vmtools agent
 that's  installed in the guest).

The QEMU guest agent (qemu-ga) has features that may allow you to do
this.  I'm not familiar enough with it to give details, here are some
alternatives:

If you provide a kernel + initramfs externally (outside the guest disk
image) you can add files to the initramfs.  This allows you to customize
boot up.

Alternatively you can use PXE booting to achieve the same thing.

Finally, you could edit the disk image using libguestfs or qemu-nbd
before booting it for the first time.  This gives you a chance to
customize configuration and startup files.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 50921] kvm hangs booting Windows 2000

2012-11-26 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=50921


Alan a...@lxorguk.ukuu.org.uk changed:

   What|Removed |Added

 CC||a...@lxorguk.ukuu.org.uk




--- Comment #12 from Alan a...@lxorguk.ukuu.org.uk  2012-11-26 12:09:34 ---
vboxpci 22709 0 - Live 0xf89bb000 (O)
vboxnetadp 25431 0 - Live 0xf8aa6000 (O)
vboxnetflt 22987 0 - Live 0xf8aae000 (O)
vboxdrv 227471 3 vboxpci,vboxnetadp,vboxnetflt, Live 0xf91d4000 (O)


Shouldn't be interefering but probably a good idea to test without

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3 RFC 0/2] kvm: Improving undercommit scenarios

2012-11-26 Thread Raghavendra K T
 In some special scenarios like #vcpu = #pcpu, PLE handler may
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

 The first patch optimizes all the yield_to by bailing out when there
 is no need to continue in yield_to (i.e., when there is only one task 
 in source and target rq).

 Second patch uses that in PLE handler. Further when a yield_to fails
 we do not immediately go out of PLE handler instead we try thrice 
 to have better statistical possibility of false return. Otherwise that
 would affect moderate overcommit cases.
 
 Result on 3.7.0-rc6 kernel shows around 140% improvement for ebizzy 1x and
 around 51% for dbench 1x  with 32 core PLE machine with 32 vcpu guest.


base = 3.7.0-rc6 
machine: 32 core mx3850 x5 PLE mc

--+---+---+---++---+
   ebizzy (rec/sec higher is beter)
--+---+---+---++---+
basestdev   patched stdev   %improve 
--+---+---+---++---+
1x   2511.300021.54096051.8000   170.2592   140.98276   
2x   2679.4000   332.44822692.3000   251.4005 0.48145
3x   2253.5000   266.42432192.1667   178.9753-2.72169
4x   1784.3750   102.26992018.7500   187.572313.13485
--+---+---+---++---+

--+---+---+---++---+
dbench (throughput in MB/sec. higher is better)
--+---+---+---++---+
basestdev   patched stdev   %improve 
--+---+---+---++---+
1x  6677.4080   638.504810098.0060   3449.7026 51.22643
2x  2012.676064.76422019.0440 62.6702   0.31639
3x  1302.078340.83361292.7517 27.0515  -0.71629
4x  3043.1725  3243.72814664.4662   5946.5741  53.27643
--+---+---+---++---+

Here is the refernce of no ple result.
 ebizzy-1x_nople 7592.6000 rec/sec
 dbench_1x_nople 7853.6960 MB/sec

The result says we can still improve by 60% for ebizzy, but overall we are
getting impressive performance with the patches.

 Changes Since V2:
 - Dropped global measures usage patch (Peter Zilstra)
 - Do not bail out on first failure (Avi Kivity)
 - Try thrice for the failure of yield_to to get statistically more correct
   behaviour.

 Changes since V1:
 - Discard the idea of exporting nrrunning and optimize in core scheduler 
(Peter)
 - Use yield() instead of schedule in overcommit scenarios (Rik)
 - Use loadavg knowledge to detect undercommit/overcommit

 Peter Zijlstra (1):
  Bail out of yield_to when source and target runqueue has one task

 Raghavendra K T (1):
  Handle yield_to failure return for potential undercommit case

 Please let me know your comments and suggestions.

 Link for V2:
 https://lkml.org/lkml/2012/10/29/287

 Link for V1:
 https://lkml.org/lkml/2012/9/21/168

 kernel/sched/core.c | 25 +++--
 virt/kvm/kvm_main.c | 26 --
 2 files changed, 35 insertions(+), 16 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task

2012-11-26 Thread Raghavendra K T
From: Peter Zijlstra pet...@infradead.org

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple handler upon
 seeing source runqueue length = 1, but it had to export rq length).
 Peter came up with the elegant idea of return -ESRCH in scheduler core.

Signed-off-by: Peter Zijlstra pet...@infradead.org
Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---

 kernel/sched/core.c |   25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..fc219a5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
  * It's the caller's job to ensure that the target task struct
  * can't go away on us before we can do any checks.
  *
- * Returns true if we indeed boosted the target task.
+ * Returns:
+ * true (0) if we indeed boosted the target task.
+ * false (0) if we failed to boost the target.
+ * -ESRCH if there's no task to yield to.
  */
 bool __sched yield_to(struct task_struct *p, bool preempt)
 {
@@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
 
 again:
p_rq = task_rq(p);
+   /*
+* If we're the only runnable task on the rq and target rq also
+* has only one task, there's absolutely no point in yielding.
+*/
+   if (rq-nr_running == 1  p_rq-nr_running == 1) {
+   yielded = -ESRCH;
+   goto out_irq;
+   }
+
double_rq_lock(rq, p_rq);
while (task_rq(p) != p_rq) {
double_rq_unlock(rq, p_rq);
@@ -4310,13 +4322,13 @@ again:
}
 
if (!curr-sched_class-yield_to_task)
-   goto out;
+   goto out_unlock;
 
if (curr-sched_class != p-sched_class)
-   goto out;
+   goto out_unlock;
 
if (task_running(p_rq, p) || p-state)
-   goto out;
+   goto out_unlock;
 
yielded = curr-sched_class-yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4329,11 +4341,12 @@ again:
resched_task(p_rq-curr);
}
 
-out:
+out_unlock:
double_rq_unlock(rq, p_rq);
+out_irq:
local_irq_restore(flags);
 
-   if (yielded)
+   if (yielded  0)
schedule();
 
return yielded;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case

2012-11-26 Thread Raghavendra K T
From: Raghavendra K T raghavendra...@linux.vnet.ibm.com

yield_to returns -ESRCH, When source and target of yield_to
run queue length is one. When we see three successive failures of
yield_to we assume we are in potential undercommit case and abort
from PLE handler.
The assumption is backed by low probability of wrong decision
for even worst case scenarios such as average runqueue length
between 1 and 2.

note that we do not update last boosted vcpu in failure cases.
Thank Avi for raising question on aborting after first fail from yield_to.

Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 virt/kvm/kvm_main.c |   26 --
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be70035..053f494 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1639,6 +1639,7 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
 {
struct pid *pid;
struct task_struct *task = NULL;
+   bool ret = false;
 
rcu_read_lock();
pid = rcu_dereference(target-pid);
@@ -1646,17 +1647,15 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
task = get_pid_task(target-pid, PIDTYPE_PID);
rcu_read_unlock();
if (!task)
-   return false;
+   return ret;
if (task-flags  PF_VCPU) {
put_task_struct(task);
-   return false;
-   }
-   if (yield_to(task, 1)) {
-   put_task_struct(task);
-   return true;
+   return ret;
}
+   ret = yield_to(task, 1);
put_task_struct(task);
-   return false;
+
+   return ret;
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
 
@@ -1697,12 +1696,14 @@ bool kvm_vcpu_eligible_for_directed_yield(struct 
kvm_vcpu *vcpu)
return eligible;
 }
 #endif
+
 void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
struct kvm *kvm = me-kvm;
struct kvm_vcpu *vcpu;
int last_boosted_vcpu = me-kvm-last_boosted_vcpu;
int yielded = 0;
+   int try = 3;
int pass;
int i;
 
@@ -1714,7 +1715,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 * VCPU is holding the lock that we need and will release it.
 * We approximate round-robin by starting at the last boosted VCPU.
 */
-   for (pass = 0; pass  2  !yielded; pass++) {
+   for (pass = 0; pass  2  !yielded  try; pass++) {
kvm_for_each_vcpu(i, vcpu, kvm) {
if (!pass  i = last_boosted_vcpu) {
i = last_boosted_vcpu;
@@ -1727,10 +1728,15 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
continue;
if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
continue;
-   if (kvm_vcpu_yield_to(vcpu)) {
+
+   yielded = kvm_vcpu_yield_to(vcpu);
+   if (yielded  0) {
kvm-last_boosted_vcpu = i;
-   yielded = 1;
break;
+   } else if (yielded  0) {
+   try--;
+   if (!try)
+   break;
}
}
}

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting

2012-11-26 Thread Zhang, Yang Z
Gleb Natapov wrote on 2012-11-26:
 On Mon, Nov 26, 2012 at 03:51:04AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2012-11-25:
 On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
 Posted Interrupt allows vAPICV interrupts to inject into guest directly
 without any vmexit.
 
 - When delivering a interrupt to guest, if target vcpu is running,
   update Posted-interrupt requests bitmap and send a notification event
   to the vcpu. Then the vcpu will handle this interrupt automatically,
   without any software involvemnt.
 Looks like you allocating one irq vector per vcpu per pcpu and then
 migrate it or reallocate when vcpu move from one pcpu to another.
 This is not scalable and migrating irq migration slows things down.
 What's wrong with allocating one global vector for posted interrupt
 during vmx initialization and use it for all vcpus?
 
 Consider the following situation:
 If vcpu A is running when notification event which belong to vcpu B is 
 arrived,
 since the vector match the vcpu A's notification vector, then this event
 will be consumed by vcpu A(even it do nothing) and the interrupt cannot
 be handled in time. The exact same situation is possible with your code.
 vcpu B can be migrated from pcpu and vcpu A will take its place and will
 be assigned the same vector as vcpu B. But I fail to see why is this a
No, the on bit will be set to prevent notification event when vcpu B start 
migration. And it only free the vector before it going to run in another pcpu. 

 problem. vcpu A will ignore PI since pir will be empty and vcpu B should
 detect new event during next vmentry.
Yes, but the next vmentry may happen long time later and interrupt cannot be 
serviced until next vmentry. In current way, it will cause vmexit and 
re-schedule the vcpu B.
 
 
 
 +  if (!cfg) {
 +  free_irq_at(irq, NULL);
 +  return 0;
 +  }
 +
 +  raw_spin_lock_irqsave(vector_lock, flags);
 +  if (!__assign_irq_vector(irq, cfg, mask))
 +  ret = irq;
 +  raw_spin_unlock_irqrestore(vector_lock, flags);
 +
 +  if (ret) {
 +  irq_set_chip_data(irq, cfg);
 +  irq_clear_status_flags(irq, IRQ_NOREQUEST);
 +  } else {
 +  free_irq_at(irq, cfg);
 +  }
 +  return ret;
 +}
 
 This function is mostly cutpaste of create_irq_nr().
 
 Yes, this function allow to allocate vector from specified cpu.
 
 Does not justify code duplication.
ok. will change it in next version.
 
 
if (kvm_x86_ops-has_virtual_interrupt_delivery(vcpu))
apic-vid_enabled = true;
 +
 +  if (kvm_x86_ops-has_posted_interrupt(vcpu))
 +  apic-pi_enabled = true;
 +
 This is global state, no need per apic variable.
Even all vcpus use the same setting, but according to SDM, apicv really is a 
per apic variable.
Anyway, if you think we should not put it here, where is the best place?

 @@ -1555,6 +1611,11 @@ static void vmx_vcpu_load(struct kvm_vcpu
 *vcpu,
 int cpu)
struct desc_ptr *gdt = __get_cpu_var(host_gdt);
unsigned long sysenter_esp;
 +  if (enable_apicv_pi  to_vmx(vcpu)-pi)
 +  pi_set_on(to_vmx(vcpu)-pi);
 +
 Why?
 
 Here means the vcpu start migration. So we should prevent the
 notification event until migration end.
 
 You check for IN_GUEST_MODE while sending notification. Why is this not
For interrupt from emulated device, it enough. But VT-d device doesn't know the 
vcpu is migrating, so set the on bit to prevent the notification event when 
target vcpu is migrating.

 enough? Also why vmx_vcpu_load() call means that vcpu start migration?
I think the follow check can ensure the vcpu is in migration, am I wrong?
if (vmx-loaded_vmcs-cpu != cpu)
{
if (enable_apicv_pi  to_vmx(vcpu)-pi)
pi_set_on(to_vmx(vcpu)-pi);
}

 +  kvm_make_request(KVM_REQ_POSTED_INTR, vcpu);
 +
kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);  
 local_irq_disable();
list_add(vmx-loaded_vmcs-loaded_vmcss_on_cpu_link, @@ -1582,6
  +1643,8 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
vcpu-cpu = -1; kvm_cpu_vmxoff();   }
 +  if (enable_apicv_pi  to_vmx(vcpu)-pi)
 +  pi_set_on(to_vmx(vcpu)-pi);
 Why?
 
 When vcpu schedule out, no need to send notification event to it, just set 
 the
 PIR and wakeup it is enough.
 Same as above. When vcpu is scheduled out it will no be in IN_GUEST_MODE
Right.

 mode. Also in this case we probably should set bit directly in IRR and leave
 PIR alone.
From the view of hypervisor, IRR and PIR are same. For each vmentry, if PI is 
enabled, the IRR equal to (IRR | PIR). So there is no difference to set IRR or 
PIR if target vcpu is not running.

 
 
  }
  
  static void vmx_fpu_activate(struct kvm_vcpu *vcpu)
 @@ -2451,12 +2514,6 @@ static __init int setup_vmcs_config(struct
 vmcs_config *vmcs_conf)
u32 _vmexit_control = 0;
u32 _vmentry_control = 0;
 -  min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
 -  opt = PIN_BASED_VIRTUAL_NMIS;
 -  if 

Re: [PATCH 4/5] KVM: PPC: Book3S HV: Don't give the guest RW access to RO pages

2012-11-26 Thread Alexander Graf

On 24.11.2012, at 10:32, Paul Mackerras wrote:

 On Sat, Nov 24, 2012 at 10:05:37AM +0100, Alexander Graf wrote:
 
 
 On 23.11.2012, at 23:13, Paul Mackerras pau...@samba.org wrote:
 
 On Fri, Nov 23, 2012 at 04:47:45PM +0100, Alexander Graf wrote:
 
 On 22.11.2012, at 10:28, Paul Mackerras wrote:
 
 Currently, if the guest does an H_PROTECT hcall requesting that the
 permissions on a HPT entry be changed to allow writing, we make the
 requested change even if the page is marked read-only in the host
 Linux page tables.  This is a problem since it would for instance
 allow a guest to modify a page that KSM has decided can be shared
 between multiple guests.
 
 To fix this, if the new permissions for the page allow writing, we need
 to look up the memslot for the page, work out the host virtual address,
 and look up the Linux page tables to get the PTE for the page.  If that
 PTE is read-only, we reduce the HPTE permissions to read-only.
 
 How does KSM handle this usually? If you reduce the permissions to R/O, 
 how do you ever get a R/W page from a deduplicated one?
 
 The scenario goes something like this:
 
 1. Guest creates an HPTE with RO permissions.
 2. KSM decides the page is identical to another page and changes the
  HPTE to point to a shared copy.  Permissions are still RO.
 3. Guest decides it wants write access to the page and does an
  H_PROTECT hcall to change the permissions on the HPTE to RW.
 
 The bug is that we actually make the requested change in step 3.
 Instead we should leave it at RO, then when the guest tries to write
 to the page, we take a hypervisor page fault, copy the page and give
 the guest write access to its own copy of the page.
 
 So what this patch does is add code to H_PROTECT so that if the guest
 is requesting RW access, we check the Linux PTE to see if the
 underlying guest page is RO, and if so reduce the permissions in the
 HPTE to RO.
 
 But this will be guest visible, because now H_PROTECT doesn't actually mark 
 the page R/W in the HTAB, right?
 
 No - the guest view of the HPTE has R/W permissions.  The guest view
 of the HPTE is made up of doubleword 0 from the real HPT plus
 rev-guest_rpte for doubleword 1 (where rev is the entry in the revmap
 array, kvm-arch.revmap, for the HPTE).  The guest view can be
 different from the host/hardware view, which is in the real HPT.  For
 instance, the guest view of a HPTE might be valid but the host view
 might be invalid because the underlying real page has been paged out -
 in that case we use a software bit which we call HPTE_V_ABSENT to
 remind ourselves that there is something valid there from the guest's
 point of view.  Or the guest view can be R/W but the host view is RO,
 as in the case where KSM has merged the page.
 
 So the flow with this patch is:
 
  - guest page permission fault
 
 This comes through the host (kvmppc_hpte_hv_fault()) which looks at
 the guest view of the HPTE, sees that it has RO permissions, and sends
 the page fault to the guest.
 
  - guest does H_PROTECT to mark page r/w
  - H_PROTECT doesn't do anything
  - guest returns from permission handler, triggers write fault
 
 This comes once again to kvmppc_hpte_hv_fault(), which sees that the
 guest view of the HPTE has R/W permissions now, and sends the page
 fault to kvmppc_book3s_hv_page_fault(), which requests write access to
 the page, possibly triggering copy-on-write or whatever, and updates
 the real HPTE to have R/W permissions and possibly point to a new page
 of memory.
 
 
 2 questions here:
 
 How does the host know that the page is actually r/w?
 
 I assume you mean RO?  It looks up the memslot for the guest physical
 address (which it gets from rev-guest_rpte), uses that to work out
 the host virtual address (i.e. the address in qemu's address space),
 looks up the Linux PTE in qemu's Linux page tables, and looks at the
 _PAGE_RW bit there.
 
 How does this work on 970? I thought page faults always go straight to the 
 guest there.
 
 They do, which is why PPC970 can't do any of this.  On PPC970 we have
 kvm-arch.using_mmu_notifiers == 0, and that makes the code pin every
 page of guest memory that is mapped by a guest HPTE (with a Linux
 guest, that means every page, because of the linear mapping).  On
 POWER7 we have kvm-arch.using_mmu_notifiers == 1, which enables
 host paging and deduplication of guest memory.

Thanks a lot for the detailed explanation! Maybe you guys should just release 
an HV capable p7 system publicly, so we can deprecate 970 support. That would 
make a few things quite a bit easier ;)

Thanks, applied to kvm-ppc-next.

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations

2012-11-26 Thread Alexander Graf

On 23.11.2012, at 23:07, Paul Mackerras wrote:

 On Fri, Nov 23, 2012 at 04:43:03PM +0100, Alexander Graf wrote:
 
 On 22.11.2012, at 10:28, Paul Mackerras wrote:
 
 - With the possibility of the host paging out guest pages, the use of
 H_LOCAL by an SMP guest is dangerous since the guest could possibly
 retain and use a stale TLB entry pointing to a page that had been
 removed from the guest.
 
 I don't understand this part. Don't we flush the TLB when the page gets 
 evicted from the shadow HTAB?
 
 The H_LOCAL flag is something that we invented to allow the guest to
 tell the host I only ever used this translation (HPTE) on the current
 vcpu when it's removing or modifying an HPTE.  The idea is that that
 would then let the host use the tlbiel instruction (local TLB
 invalidate) rather than the usual global tlbie instruction.  Tlbiel is
 faster because it doesn't need to go out on the fabric and get
 processed by all cpus.  In fact our guests don't use it at present,
 but we put it in because we thought we should be able to get a
 performance improvement, particularly on large machines.
 
 However, the catch is that the guest's setting of H_LOCAL might be
 incorrect, in which case we could have a stale TLB entry on another
 physical cpu.  While the physical page that it refers to is still
 owned by the guest, that stale entry doesn't matter from the host's
 point of view.  But if the host wants to take that page away from the
 guest, the stale entry becomes a problem.

That's exactly where my question lies. Does that mean we don't flush the TLB 
entry regardless when we take the page away from the guest?


Alex

 
 The solution I implement here is just not to use tlbiel in SMP guests.
 UP guests are not so much of a problem because the potential attack
 from the guest relies on having one cpu remove the HPTE and do tlbiel
 while another cpu uses the stale TLB entry, which you can't do if you
 only have one cpu.
 
 Paul.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 50921] kvm hangs booting Windows 2000

2012-11-26 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=50921





--- Comment #13 from Lucio Crusca lu...@sulweb.org  2012-11-26 13:13:56 ---
@Alan: see comment #5, since then I've always tested with and without vbox
modules.

@Gleb: can't run on 3.5.0 right now, I'll take the stack trace ASAP.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking

2012-11-26 Thread Alexander Graf

On 23.11.2012, at 22:42, Paul Mackerras wrote:

 On Fri, Nov 23, 2012 at 03:13:09PM +0100, Alexander Graf wrote:
 
 On 22.11.2012, at 10:25, Paul Mackerras wrote:
 
 +   /* Do they have an SLB shadow buffer registered? */
 +   slb = vcpu-arch.slb_shadow.pinned_addr;
 +   if (!slb)
 +   return;
 
 Mind to explain this case? What happens here? Do we leave the guest with an 
 empty SLB? Why would this ever happen? What happens next as soon as we go 
 back into the guest?
 
 Yes, we leave the guest with an empty SLB, the access gets retried and
 this time the guest gets an SLB miss interrupt, which it can hopefully
 handle using an SLB miss handler that runs entirely in real mode.
 This could happen for instance while the guest is in SLOF or yaboot or
 some other code that runs basically in real mode but occasionally
 turns the MMU on for some accesses, and happens to have a bug where it
 creates a duplicate SLB entry.

Is this what pHyp does? Also, is this what we want? Why don't we populate an 
#MC into the guest so it knows it did something wrong?


Alex

 
 +   /* Sanity check */
 +   n = slb-persistent;
 +   if (n  SLB_MIN_SIZE)
 +   n = SLB_MIN_SIZE;
 
 Please use a max() macro here.
 
 OK.
 
 +   rb = 0x800; /* IS field = 0b10, flush congruence class */
 +   for (i = 0; i  128; ++i) {
 
 Please put up a #define for this. POWER7_TLB_SIZE or so. Is there any way to 
 fetch that number from an SPR? I don't really want to have a p7+ and a p8 
 function in here too ;).
 
 +   asm volatile(tlbiel %0 : : r (rb));
 +   rb += 0x1000;
 
 I assume this also means (1  TLBIE_ENTRY_SHIFT)? Would be nice to keep the 
 code readable without guessing :).
 
 The 0x800 and 0x1000 are taken from the architecture - it defines
 fields in the RB value for the flush type and TLB index.  The 128 is
 POWER7-specific and isn't in any SPR that I know of.  Eventually we'll
 probably have to put it (the number of TLB congruence classes) in the
 cputable, but for now I'll just do a define.
 
 So I take it that p7 does not implement tlbia?
 
 Correct.
 
 So we never return 0? How about ECC errors and the likes? Wouldn't those 
 also be #MCs that the host needs to handle?
 
 Yes, true.  In fact the OPAL firmware gets to see the machine checks
 before we do (see the opal_register_exception_handler() calls in
 arch/powerpc/platforms/powernv/opal.c), so it should have already
 handled recoverable things like L1 cache parity errors.
 
 I'll make the function return 0 if there's an error that it doesn't
 know about.
 
 ld  r8, HSTATE_VMHANDLER(r13)
 ld  r7, HSTATE_HOST_MSR(r13)
 
 cmpwi   r12, BOOK3S_INTERRUPT_EXTERNAL
 +BEGIN_FTR_SECTION
 beq 11f
 -   cmpwi   r12, BOOK3S_INTERRUPT_MACHINE_CHECK
 +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
 
 Mind to explain the logic that happens here? Do we get external interrupts 
 on 970? If not, the cmpwi should also be inside the FTR section. Also, if we 
 do a beq here, why do the beqctr below again?
 
 I was making it not call the host kernel machine check handler if it
 was a machine check that pulled us out of the guest.  In fact we
 probably do still want to call the handler, but we don't want to jump
 to 0x200, since that has been patched by OPAL, and we don't want to
 make OPAL think we just got another machine check.  Instead we would
 need to jump to machine_check_pSeries.
 
 The feature section is because POWER7 sets HSRR0/1 on external
 interrupts, whereas PPC970 sets SRR0/1.
 
 Paul.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM: PPC: Fixes for PR-KVM on POWER7

2012-11-26 Thread Alexander Graf

On 05.11.2012, at 04:40, Paul Mackerras wrote:

 Here are some fixes for PR-style KVM.  With these I can successfully
 run a pseries (PAPR) guest under PR KVM on a POWER7.  (This is all
 running inside a HV KVM virtual machine.)  The patches are against
 Alex Graf's kvm-ppc-next branch.

Thanks a lot! Applied all to kvm-ppc-next.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task

2012-11-26 Thread Andrew Jones
On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote:
 From: Peter Zijlstra pet...@infradead.org
 
 In case of undercomitted scenarios, especially in large guests
 yield_to overhead is significantly high. when run queue length of
 source and target is one, take an opportunity to bail out and return
 -ESRCH. This return condition can be further exploited to quickly come
 out of PLE handler.
 
 (History: Raghavendra initially worked on break out of kvm ple handler upon
  seeing source runqueue length = 1, but it had to export rq length).
  Peter came up with the elegant idea of return -ESRCH in scheduler core.
 
 Signed-off-by: Peter Zijlstra pet...@infradead.org
 Raghavendra, Checking the rq length of target vcpu condition added.(thanks 
 Avi)
 Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com
 Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 ---
 
  kernel/sched/core.c |   25 +++--
  1 file changed, 19 insertions(+), 6 deletions(-)
 
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 2d8927f..fc219a5 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
   * It's the caller's job to ensure that the target task struct
   * can't go away on us before we can do any checks.
   *
 - * Returns true if we indeed boosted the target task.
 + * Returns:
 + *   true (0) if we indeed boosted the target task.
 + *   false (0) if we failed to boost the target.
 + *   -ESRCH if there's no task to yield to.
   */
  bool __sched yield_to(struct task_struct *p, bool preempt)
  {
 @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool 
 preempt)
  
  again:
   p_rq = task_rq(p);
 + /*
 +  * If we're the only runnable task on the rq and target rq also
 +  * has only one task, there's absolutely no point in yielding.
 +  */
 + if (rq-nr_running == 1  p_rq-nr_running == 1) {
 + yielded = -ESRCH;
 + goto out_irq;
 + }
 +
   double_rq_lock(rq, p_rq);
   while (task_rq(p) != p_rq) {
   double_rq_unlock(rq, p_rq);
 @@ -4310,13 +4322,13 @@ again:
   }
  
   if (!curr-sched_class-yield_to_task)
 - goto out;
 + goto out_unlock;
  
   if (curr-sched_class != p-sched_class)
 - goto out;
 + goto out_unlock;
  
   if (task_running(p_rq, p) || p-state)
 - goto out;
 + goto out_unlock;
  
   yielded = curr-sched_class-yield_to_task(rq, p, preempt);
   if (yielded) {
 @@ -4329,11 +4341,12 @@ again:
   resched_task(p_rq-curr);
   }
  
 -out:
 +out_unlock:
   double_rq_unlock(rq, p_rq);
 +out_irq:
   local_irq_restore(flags);
  
 - if (yielded)
 + if (yielded  0)
   schedule();
  
   return yielded;


Acked-by: Andrew Jones drjo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case

2012-11-26 Thread Andrew Jones
On Mon, Nov 26, 2012 at 05:38:04PM +0530, Raghavendra K T wrote:
 From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 
 yield_to returns -ESRCH, When source and target of yield_to
 run queue length is one. When we see three successive failures of
 yield_to we assume we are in potential undercommit case and abort
 from PLE handler.
 The assumption is backed by low probability of wrong decision
 for even worst case scenarios such as average runqueue length
 between 1 and 2.
 
 note that we do not update last boosted vcpu in failure cases.
 Thank Avi for raising question on aborting after first fail from yield_to.
 
 Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com
 Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 ---
  virt/kvm/kvm_main.c |   26 --
  1 file changed, 16 insertions(+), 10 deletions(-)
 
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index be70035..053f494 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -1639,6 +1639,7 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
  {
   struct pid *pid;
   struct task_struct *task = NULL;
 + bool ret = false;
  
   rcu_read_lock();
   pid = rcu_dereference(target-pid);
 @@ -1646,17 +1647,15 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
   task = get_pid_task(target-pid, PIDTYPE_PID);
   rcu_read_unlock();
   if (!task)
 - return false;
 + return ret;
   if (task-flags  PF_VCPU) {
   put_task_struct(task);
 - return false;
 - }
 - if (yield_to(task, 1)) {
 - put_task_struct(task);
 - return true;
 + return ret;
   }
 + ret = yield_to(task, 1);
   put_task_struct(task);
 - return false;
 +
 + return ret;
  }
  EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
  
 @@ -1697,12 +1696,14 @@ bool kvm_vcpu_eligible_for_directed_yield(struct 
 kvm_vcpu *vcpu)
   return eligible;
  }
  #endif
 +
  void kvm_vcpu_on_spin(struct kvm_vcpu *me)
  {
   struct kvm *kvm = me-kvm;
   struct kvm_vcpu *vcpu;
   int last_boosted_vcpu = me-kvm-last_boosted_vcpu;
   int yielded = 0;
 + int try = 3;
   int pass;
   int i;
  
 @@ -1714,7 +1715,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
* VCPU is holding the lock that we need and will release it.
* We approximate round-robin by starting at the last boosted VCPU.
*/
 - for (pass = 0; pass  2  !yielded; pass++) {
 + for (pass = 0; pass  2  !yielded  try; pass++) {
   kvm_for_each_vcpu(i, vcpu, kvm) {
   if (!pass  i = last_boosted_vcpu) {
   i = last_boosted_vcpu;
 @@ -1727,10 +1728,15 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
   continue;
   if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
   continue;
 - if (kvm_vcpu_yield_to(vcpu)) {
 +
 + yielded = kvm_vcpu_yield_to(vcpu);
 + if (yielded  0) {
   kvm-last_boosted_vcpu = i;
 - yielded = 1;
   break;
 + } else if (yielded  0) {
 + try--;
 + if (!try)
 + break;
   }
   }
   }
 

The check done in patch 1/2 is done before the double_rq_lock, so it's
cheap. Now, this patch is to avoid doing too many get_pid_task calls. I
wonder if it would make more sense to change the vcpu state from tracking
the pid to tracking the task. If that was done, then I don't believe this
patch is necessary.

Rik,
for 34bb10b79de7 was there a reason pid was used instead of task?

Drew
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting

2012-11-26 Thread Gleb Natapov
On Mon, Nov 26, 2012 at 12:29:54PM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2012-11-26:
  On Mon, Nov 26, 2012 at 03:51:04AM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2012-11-25:
  On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
  Posted Interrupt allows vAPICV interrupts to inject into guest directly
  without any vmexit.
  
  - When delivering a interrupt to guest, if target vcpu is running,
update Posted-interrupt requests bitmap and send a notification event
to the vcpu. Then the vcpu will handle this interrupt automatically,
without any software involvemnt.
  Looks like you allocating one irq vector per vcpu per pcpu and then
  migrate it or reallocate when vcpu move from one pcpu to another.
  This is not scalable and migrating irq migration slows things down.
  What's wrong with allocating one global vector for posted interrupt
  during vmx initialization and use it for all vcpus?
  
  Consider the following situation:
  If vcpu A is running when notification event which belong to vcpu B is 
  arrived,
  since the vector match the vcpu A's notification vector, then this event
  will be consumed by vcpu A(even it do nothing) and the interrupt cannot
  be handled in time. The exact same situation is possible with your code.
  vcpu B can be migrated from pcpu and vcpu A will take its place and will
  be assigned the same vector as vcpu B. But I fail to see why is this a
 No, the on bit will be set to prevent notification event when vcpu B start 
 migration. And it only free the vector before it going to run in another 
 pcpu. 
There is a race. Sender check on bit, vcpu B migrate to another pcpu and
starts running there, vcpu A takes vpuc's B vector, sender send PI vcpu
A gets it.


 
  problem. vcpu A will ignore PI since pir will be empty and vcpu B should
  detect new event during next vmentry.
 Yes, but the next vmentry may happen long time later and interrupt cannot be 
 serviced until next vmentry. In current way, it will cause vmexit and 
 re-schedule the vcpu B.
Vmentry will happen when scheduler will decide that vcpu can run. There
is no problem here. What you probably want to say is that vcpu may not be
aware of interrupt happening since it was migrated to different vcpu
just after PI IPI was sent and thus missed it. But than PIR interrupts
should be processed during vmentry on another pcpu:

Sender: Guest:

set pir
set on
if (vcpu in guest mode on pcpu1)
   vmexit on pcpu1
   vmentry on pcpu2
   process pir, deliver interrupt
send PI IPI to pcpu1


  
  
  
  +if (!cfg) {
  +free_irq_at(irq, NULL);
  +return 0;
  +}
  +
  +raw_spin_lock_irqsave(vector_lock, flags);
  +if (!__assign_irq_vector(irq, cfg, mask))
  +ret = irq;
  +raw_spin_unlock_irqrestore(vector_lock, flags);
  +
  +if (ret) {
  +irq_set_chip_data(irq, cfg);
  +irq_clear_status_flags(irq, IRQ_NOREQUEST);
  +} else {
  +free_irq_at(irq, cfg);
  +}
  +return ret;
  +}
  
  This function is mostly cutpaste of create_irq_nr().
  
  Yes, this function allow to allocate vector from specified cpu.
  
  Does not justify code duplication.
 ok. will change it in next version.
  
Please use single global vector in the next version.

  
   if (kvm_x86_ops-has_virtual_interrupt_delivery(vcpu))
   apic-vid_enabled = true;
  +
  +if (kvm_x86_ops-has_posted_interrupt(vcpu))
  +apic-pi_enabled = true;
  +
  This is global state, no need per apic variable.
 Even all vcpus use the same setting, but according to SDM, apicv really is a 
 per apic variable.
It is not per vapic in our implementation and this is what is
important here.

 Anyway, if you think we should not put it here, where is the best place?
It is not needed, just use has_posted_interrupt(vcpu) instead.

 
  @@ -1555,6 +1611,11 @@ static void vmx_vcpu_load(struct kvm_vcpu
  *vcpu,
  int cpu)
   struct desc_ptr *gdt = __get_cpu_var(host_gdt);
   unsigned long sysenter_esp;
  +if (enable_apicv_pi  to_vmx(vcpu)-pi)
  +pi_set_on(to_vmx(vcpu)-pi);
  +
  Why?
  
  Here means the vcpu start migration. So we should prevent the
  notification event until migration end.
  
  You check for IN_GUEST_MODE while sending notification. Why is this not
 For interrupt from emulated device, it enough. But VT-d device doesn't know 
 the vcpu is migrating, so set the on bit to prevent the notification event 
 when target vcpu is migrating.
Why should VT-d device care about that? It sets bits in pir and sends
IPI. If vcpu is running it process pir immediately, if not it will do it
during next vmentry.

 
  enough? Also why vmx_vcpu_load() call means 

Re: [PATCH] MAINTAINERS: Add git tree link for PPC KVM

2012-11-26 Thread Alexander Graf

On 16.10.2012, at 07:01, Michael Ellerman wrote:

 Signed-off-by: Michael Ellerman mich...@ellerman.id.au

Thanks, applied to kvm-ppc-next.


Alex

 ---
 MAINTAINERS |1 +
 1 file changed, 1 insertion(+)
 
 diff --git a/MAINTAINERS b/MAINTAINERS
 index e73060f..32dc107 100644
 --- a/MAINTAINERS
 +++ b/MAINTAINERS
 @@ -4244,6 +4244,7 @@ KERNEL VIRTUAL MACHINE (KVM) FOR POWERPC
 M:Alexander Graf ag...@suse.de
 L:kvm-...@vger.kernel.org
 W:http://kvm.qumranet.com
 +T:   git git://github.com/agraf/linux-2.6.git
 S:Supported
 F:arch/powerpc/include/asm/kvm*
 F:arch/powerpc/kvm/
 -- 
 1.7.9.5
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case

2012-11-26 Thread Andrew Jones
On Mon, Nov 26, 2012 at 02:43:02PM +0100, Andrew Jones wrote:
 On Mon, Nov 26, 2012 at 05:38:04PM +0530, Raghavendra K T wrote:
  From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
  
  yield_to returns -ESRCH, When source and target of yield_to
  run queue length is one. When we see three successive failures of
  yield_to we assume we are in potential undercommit case and abort
  from PLE handler.
  The assumption is backed by low probability of wrong decision
  for even worst case scenarios such as average runqueue length
  between 1 and 2.
  
  note that we do not update last boosted vcpu in failure cases.
  Thank Avi for raising question on aborting after first fail from yield_to.
  
  Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com
  Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
  ---
   virt/kvm/kvm_main.c |   26 --
   1 file changed, 16 insertions(+), 10 deletions(-)
  
  diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
  index be70035..053f494 100644
  --- a/virt/kvm/kvm_main.c
  +++ b/virt/kvm/kvm_main.c
  @@ -1639,6 +1639,7 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
   {
  struct pid *pid;
  struct task_struct *task = NULL;
  +   bool ret = false;
   
  rcu_read_lock();
  pid = rcu_dereference(target-pid);
  @@ -1646,17 +1647,15 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
  task = get_pid_task(target-pid, PIDTYPE_PID);
  rcu_read_unlock();
  if (!task)
  -   return false;
  +   return ret;
  if (task-flags  PF_VCPU) {
  put_task_struct(task);
  -   return false;
  -   }
  -   if (yield_to(task, 1)) {
  -   put_task_struct(task);
  -   return true;
  +   return ret;
  }
  +   ret = yield_to(task, 1);
  put_task_struct(task);
  -   return false;
  +
  +   return ret;
   }
   EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
   
  @@ -1697,12 +1696,14 @@ bool kvm_vcpu_eligible_for_directed_yield(struct 
  kvm_vcpu *vcpu)
  return eligible;
   }
   #endif
  +
   void kvm_vcpu_on_spin(struct kvm_vcpu *me)
   {
  struct kvm *kvm = me-kvm;
  struct kvm_vcpu *vcpu;
  int last_boosted_vcpu = me-kvm-last_boosted_vcpu;
  int yielded = 0;
  +   int try = 3;
  int pass;
  int i;
   
  @@ -1714,7 +1715,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
   * VCPU is holding the lock that we need and will release it.
   * We approximate round-robin by starting at the last boosted VCPU.
   */
  -   for (pass = 0; pass  2  !yielded; pass++) {
  +   for (pass = 0; pass  2  !yielded  try; pass++) {
  kvm_for_each_vcpu(i, vcpu, kvm) {
  if (!pass  i = last_boosted_vcpu) {
  i = last_boosted_vcpu;
  @@ -1727,10 +1728,15 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
  continue;
  if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
  continue;
  -   if (kvm_vcpu_yield_to(vcpu)) {
  +
  +   yielded = kvm_vcpu_yield_to(vcpu);
  +   if (yielded  0) {
  kvm-last_boosted_vcpu = i;
  -   yielded = 1;
  break;
  +   } else if (yielded  0) {
  +   try--;
  +   if (!try)
  +   break;
  }
  }
  }
  
 
 The check done in patch 1/2 is done before the double_rq_lock, so it's
 cheap. Now, this patch is to avoid doing too many get_pid_task calls. I
 wonder if it would make more sense to change the vcpu state from tracking
 the pid to tracking the task. If that was done, then I don't believe this
 patch is necessary.
 
 Rik,
 for 34bb10b79de7 was there a reason pid was used instead of task?

Nevermind, I guess there's no way to validate the task pointer without
checking the pid, since, as your git commit says, there are no guarantee
that the same task always keeps the same vcpu. We'd only know it's valid
if it's running, and if it's running, it's of no interest.

 
 Drew
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio powerpc: enabled and supported on powernv platform

2012-11-26 Thread Alex Williamson
On Thu, 2012-11-22 at 11:56 +, Sethi Varun-B16395 wrote:
 
  -Original Message-
  From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
  ow...@vger.kernel.org] On Behalf Of Alex Williamson
  Sent: Tuesday, November 20, 2012 11:50 PM
  To: Alexey Kardashevskiy
  Cc: Benjamin Herrenschmidt; Paul Mackerras; linuxppc-
  d...@lists.ozlabs.org; linux-ker...@vger.kernel.org; kvm@vger.kernel.org;
  David Gibson
  Subject: Re: [PATCH] vfio powerpc: enabled and supported on powernv
  platform
  
  On Tue, 2012-11-20 at 11:48 +1100, Alexey Kardashevskiy wrote:
   VFIO implements platform independent stuff such as a PCI driver, BAR
   access (via read/write on a file descriptor or direct mapping when
   possible) and IRQ signaling.
   The platform dependent part includes IOMMU initialization and
   handling.
  
   This patch initializes IOMMU groups based on the IOMMU configuration
   discovered during the PCI scan, only POWERNV platform is supported at
   the moment.
  
   Also the patch implements an VFIO-IOMMU driver which manages DMA
   mapping/unmapping requests coming from the client (now QEMU). It also
   returns a DMA window information to let the guest initialize the
   device tree for a guest OS properly. Although this driver has been
   tested only on POWERNV, it should work on any platform supporting TCE
   tables.
  
   To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config option.
  
   Cc: David Gibson da...@gibson.dropbear.id.au
   Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
   ---
arch/powerpc/include/asm/iommu.h |6 +
arch/powerpc/kernel/iommu.c  |  140 +++
arch/powerpc/platforms/powernv/pci.c |  135 +++
drivers/iommu/Kconfig|8 ++
drivers/vfio/Kconfig |6 +
drivers/vfio/Makefile|1 +
drivers/vfio/vfio_iommu_spapr_tce.c  |  247
  ++
include/linux/vfio.h |   20 +++
8 files changed, 563 insertions(+)
create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
  
   diff --git a/arch/powerpc/include/asm/iommu.h
   b/arch/powerpc/include/asm/iommu.h
   index cbfe678..5ba66cb 100644
   --- a/arch/powerpc/include/asm/iommu.h
   +++ b/arch/powerpc/include/asm/iommu.h
   @@ -64,30 +64,33 @@ struct iommu_pool {  }
   cacheline_aligned_in_smp;
  
struct iommu_table {
 unsigned long  it_busno; /* Bus number this table belongs to */
 unsigned long  it_size;  /* Size of iommu table in entries */
 unsigned long  it_offset;/* Offset into global table */
 unsigned long  it_base;  /* mapped address of tce table */
 unsigned long  it_index; /* which iommu table this is */
 unsigned long  it_type;  /* type: PCI or Virtual Bus */
 unsigned long  it_blocksize; /* Entries in each block (cacheline)
  */
 unsigned long  poolsize;
 unsigned long  nr_pools;
 struct iommu_pool large_pool;
 struct iommu_pool pools[IOMMU_NR_POOLS];
 unsigned long *it_map;   /* A simple allocation bitmap for now
  */
   +#ifdef CONFIG_IOMMU_API
   + struct iommu_group *it_group;
   +#endif
};
  
struct scatterlist;
  
static inline void set_iommu_table_base(struct device *dev, void
   *base)  {
 dev-archdata.dma_data.iommu_table_base = base;  }
  
static inline void *get_iommu_table_base(struct device *dev)  {
 return dev-archdata.dma_data.iommu_table_base;
}
  
/* Frees table for an individual device node */ @@ -135,17 +138,20 @@
   static inline void pci_iommu_init(void) { }  extern void
   alloc_dart_table(void);  #if defined(CONFIG_PPC64) 
   defined(CONFIG_PM)  static inline void iommu_save(void)  {
 if (ppc_md.iommu_save)
 ppc_md.iommu_save();
}
  
static inline void iommu_restore(void)  {
 if (ppc_md.iommu_restore)
 ppc_md.iommu_restore();
}
#endif
  
   +extern long iommu_put_tces(struct iommu_table *tbl, unsigned long
  entry, uint64_t tce,
   + enum dma_data_direction direction, unsigned long pages);
   +
#endif /* __KERNEL__ */
#endif /* _ASM_IOMMU_H */
   diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
   index ff5a6ce..94f614b 100644
   --- a/arch/powerpc/kernel/iommu.c
   +++ b/arch/powerpc/kernel/iommu.c
   @@ -32,30 +32,31 @@
#include linux/dma-mapping.h
#include linux/bitmap.h
#include linux/iommu-helper.h
#include linux/crash_dump.h
#include linux/hash.h
#include linux/fault-inject.h
#include linux/pci.h
#include asm/io.h
#include asm/prom.h
#include asm/iommu.h
#include asm/pci-bridge.h
#include asm/machdep.h
#include asm/kdump.h
#include asm/fadump.h
#include asm/vio.h
   +#include asm/tce.h
  
#define DBG(...)
  
static int novmerge;
  
static void __iommu_free(struct iommu_table *, dma_addr_t, unsigned
   int);
  
static int __init 

Re: [PATCH v8 1/2] x86/kexec: add a new atomic notifier list for kdump

2012-11-26 Thread Eric W. Biederman
Zhang Yanfei zhangyan...@cn.fujitsu.com writes:

 This patch adds an atomic notifier list named crash_notifier_list.
 Currently, when loading kvm-intel module, a notifier will be registered
 in the list to enable vmcss loaded on all cpus to be VMCLEAR'd if
 needed.

crash_notifier_list ick gag please no.  Effectively this makes the kexec
on panic code path undebuggable.

Instead we need to use direct function calls to whatever you are doing.

If a direct function call is too complex then the piece of code you want
to call is almost certainly too complex to be calling on a code path
like this.

Eric

 Signed-off-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
 ---
  arch/x86/include/asm/kexec.h |2 ++
  arch/x86/kernel/crash.c  |9 +
  2 files changed, 11 insertions(+), 0 deletions(-)

 diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
 index 317ff17..5e22b00 100644
 --- a/arch/x86/include/asm/kexec.h
 +++ b/arch/x86/include/asm/kexec.h
 @@ -163,6 +163,8 @@ struct kimage_arch {
  };
  #endif
  
 +extern struct atomic_notifier_head crash_notifier_list;
 +
  #endif /* __ASSEMBLY__ */
  
  #endif /* _ASM_X86_KEXEC_H */
 diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
 index 13ad899..c5b2f70 100644
 --- a/arch/x86/kernel/crash.c
 +++ b/arch/x86/kernel/crash.c
 @@ -16,6 +16,8 @@
  #include linux/delay.h
  #include linux/elf.h
  #include linux/elfcore.h
 +#include linux/module.h
 +#include linux/notifier.h
  
  #include asm/processor.h
  #include asm/hardirq.h
 @@ -30,6 +32,9 @@
  
  int in_crash_kexec;
  
 +ATOMIC_NOTIFIER_HEAD(crash_notifier_list);
 +EXPORT_SYMBOL_GPL(crash_notifier_list);
 +
  #if defined(CONFIG_SMP)  defined(CONFIG_X86_LOCAL_APIC)
  
  static void kdump_nmi_callback(int cpu, struct pt_regs *regs)
 @@ -46,6 +51,8 @@ static void kdump_nmi_callback(int cpu, struct pt_regs 
 *regs)
  #endif
   crash_save_cpu(regs, cpu);
  
 + atomic_notifier_call_chain(crash_notifier_list, 0, NULL);
 +
   /* Disable VMX or SVM if needed.
*
* We need to disable virtualization on all CPUs.
 @@ -88,6 +95,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
  
   kdump_nmi_shootdown_cpus();
  
 + atomic_notifier_call_chain(crash_notifier_list, 0, NULL);
 +
   /* Booting kdump kernel with VMX or SVM enabled won't work,
* because (among other limitations) we can't disable paging
* with the virt flags.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost-blk: Add vhost-blk support v5

2012-11-26 Thread Michael S. Tsirkin
On Mon, Nov 19, 2012 at 10:26:41PM +0200, Michael S. Tsirkin wrote:
  
  Userspace bits:
  -
  1) LKVM
  The latest vhost-blk userspace bits for kvm tool can be found here:
  g...@github.com:asias/linux-kvm.git blk.vhost-blk
  
  2) QEMU
  The latest vhost-blk userspace prototype for QEMU can be found here:
  g...@github.com:asias/qemu.git blk.vhost-blk
  
  Changes in v5:
  - Do not assume the buffer layout
  - Fix wakeup race
  
  Changes in v4:
  - Mark req-status as userspace pointer
  - Use __copy_to_user() instead of copy_to_user() in vhost_blk_set_status()
  - Add if (need_resched()) schedule() in blk thread
  - Kill vhost_blk_stop_vq() and move it into vhost_blk_stop()
  - Use vq_err() instead of pr_warn()
  - Fail un Unsupported request
  - Add flush in vhost_blk_set_features()
  
  Changes in v3:
  - Sending REQ_FLUSH bio instead of vfs_fsync, thanks Christoph!
  - Check file passed by user is a raw block device file
  
  Signed-off-by: Asias He as...@redhat.com
 
 Since there are files shared by this and vhost net
 it's easiest for me to merge this all through the
 vhost tree.

Hi Dave, are you ok with this proposal?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio powerpc: enabled and supported on powernv platform

2012-11-26 Thread Alex Williamson
On Fri, 2012-11-23 at 13:02 +1100, Alexey Kardashevskiy wrote:
 On 22/11/12 22:56, Sethi Varun-B16395 wrote:
 
 
  -Original Message-
  From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
  ow...@vger.kernel.org] On Behalf Of Alex Williamson
  Sent: Tuesday, November 20, 2012 11:50 PM
  To: Alexey Kardashevskiy
  Cc: Benjamin Herrenschmidt; Paul Mackerras; linuxppc-
  d...@lists.ozlabs.org; linux-ker...@vger.kernel.org; kvm@vger.kernel.org;
  David Gibson
  Subject: Re: [PATCH] vfio powerpc: enabled and supported on powernv
  platform
 
  On Tue, 2012-11-20 at 11:48 +1100, Alexey Kardashevskiy wrote:
  VFIO implements platform independent stuff such as a PCI driver, BAR
  access (via read/write on a file descriptor or direct mapping when
  possible) and IRQ signaling.
  The platform dependent part includes IOMMU initialization and
  handling.
 
  This patch initializes IOMMU groups based on the IOMMU configuration
  discovered during the PCI scan, only POWERNV platform is supported at
  the moment.
 
  Also the patch implements an VFIO-IOMMU driver which manages DMA
  mapping/unmapping requests coming from the client (now QEMU). It also
  returns a DMA window information to let the guest initialize the
  device tree for a guest OS properly. Although this driver has been
  tested only on POWERNV, it should work on any platform supporting TCE
  tables.
 
  To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config option.
 
  Cc: David Gibson da...@gibson.dropbear.id.au
  Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
  ---
arch/powerpc/include/asm/iommu.h |6 +
arch/powerpc/kernel/iommu.c  |  140 +++
arch/powerpc/platforms/powernv/pci.c |  135 +++
drivers/iommu/Kconfig|8 ++
drivers/vfio/Kconfig |6 +
drivers/vfio/Makefile|1 +
drivers/vfio/vfio_iommu_spapr_tce.c  |  247
  ++
include/linux/vfio.h |   20 +++
8 files changed, 563 insertions(+)
create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
 
  diff --git a/arch/powerpc/include/asm/iommu.h
  b/arch/powerpc/include/asm/iommu.h
  index cbfe678..5ba66cb 100644
  --- a/arch/powerpc/include/asm/iommu.h
  +++ b/arch/powerpc/include/asm/iommu.h
  @@ -64,30 +64,33 @@ struct iommu_pool {  }
  cacheline_aligned_in_smp;
 
struct iommu_table {
unsigned long  it_busno; /* Bus number this table belongs 
  to */
unsigned long  it_size;  /* Size of iommu table in entries 
  */
unsigned long  it_offset;/* Offset into global table */
unsigned long  it_base;  /* mapped address of tce table */
unsigned long  it_index; /* which iommu table this is */
unsigned long  it_type;  /* type: PCI or Virtual Bus */
unsigned long  it_blocksize; /* Entries in each block 
  (cacheline)
  */
unsigned long  poolsize;
unsigned long  nr_pools;
struct iommu_pool large_pool;
struct iommu_pool pools[IOMMU_NR_POOLS];
unsigned long *it_map;   /* A simple allocation bitmap for 
  now
  */
  +#ifdef CONFIG_IOMMU_API
  + struct iommu_group *it_group;
  +#endif
};
 
struct scatterlist;
 
static inline void set_iommu_table_base(struct device *dev, void
  *base)  {
dev-archdata.dma_data.iommu_table_base = base;  }
 
static inline void *get_iommu_table_base(struct device *dev)  {
return dev-archdata.dma_data.iommu_table_base;
}
 
/* Frees table for an individual device node */ @@ -135,17 +138,20 @@
  static inline void pci_iommu_init(void) { }  extern void
  alloc_dart_table(void);  #if defined(CONFIG_PPC64) 
  defined(CONFIG_PM)  static inline void iommu_save(void)  {
if (ppc_md.iommu_save)
ppc_md.iommu_save();
}
 
static inline void iommu_restore(void)  {
if (ppc_md.iommu_restore)
ppc_md.iommu_restore();
}
#endif
 
  +extern long iommu_put_tces(struct iommu_table *tbl, unsigned long
  entry, uint64_t tce,
  + enum dma_data_direction direction, unsigned long pages);
  +
#endif /* __KERNEL__ */
#endif /* _ASM_IOMMU_H */
  diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
  index ff5a6ce..94f614b 100644
  --- a/arch/powerpc/kernel/iommu.c
  +++ b/arch/powerpc/kernel/iommu.c
  @@ -32,30 +32,31 @@
#include linux/dma-mapping.h
#include linux/bitmap.h
#include linux/iommu-helper.h
#include linux/crash_dump.h
#include linux/hash.h
#include linux/fault-inject.h
#include linux/pci.h
#include asm/io.h
#include asm/prom.h
#include asm/iommu.h
#include asm/pci-bridge.h
#include asm/machdep.h
#include asm/kdump.h
#include asm/fadump.h
#include asm/vio.h
  +#include asm/tce.h
 
#define DBG(...)
 

Re: The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM

2012-11-26 Thread Alex Williamson
On Fri, 2012-11-23 at 11:06 +0800, yi li wrote:
 Hi Guys,
 
 there have a issue about smp_affinity cannot work correctly on guest
 os when PCI passthrough device using msi/msi-x with KVM.
 
 My reason:
 pcpu will occur a lot of ipi interrupt to find the vcpu to handle the
 irq.  so the guest os will VM_EXIT frequelty. right?
 
 if smp_affinity can work correctly on guest os,  the best way is that
 the vcpu handle the irq is cputune at the pcpu which handle the
 kvm:pci-bus irq on the host.but  unfortunly, i find that smp_affinity
 can not work correctly on guest os when msi/msi-x.
 
 how to reproduce:
 1: passthrough a netcard (Brodcom BCM5716S) to the guest os
 
 2: ifup the netcard, the card will use msi-x interrupt default, and close the
 irqbalance service
 
 3:  echo 4  cat /proc/irq/NETCARDIRQ/smp_affinity, so we assume the vcpu2
 handle the irq.
 
 4: we have set vcpupin vcpu='2' cpuset='1'/ and set the irq kvm:pci-bus to
 the pcpu1 on the host.
 
 we think this configure will reduce the ipi interrupt when inject interrupt to
 the guest os. but this irq is not only handle on vcpu2. maybe it is
 not our expect。

What version of qemu-kvm/qemu are you using?  There's been some work
recently specifically to enable this.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] [PULL] qemu-kvm.git uq/master queue

2012-11-26 Thread Anthony Liguori
Marcelo Tosatti mtosa...@redhat.com writes:

 The following changes since commit 1ccbc2851282564308f790753d7158487b6af8e2:

   qemu-sockets: Fix parsing of the inet option 'to'. (2012-11-21 12:07:59 
 +0400)

 are available in the git repository at:
   git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master

 Bruce Rogers (1):
   Legacy qemu-kvm options have no argument

Pulled. Thanks.

Regards,

Anthony Liguori


  qemu-options.hx |8 
  1 files changed, 4 insertions(+), 4 deletions(-)
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM

2012-11-26 Thread yi li
hi Alex,

the qemu-kvm version 1.2.

Thanks.
YiLi

2012/11/26 Alex Williamson alex.william...@redhat.com:
 On Fri, 2012-11-23 at 11:06 +0800, yi li wrote:
 Hi Guys,

 there have a issue about smp_affinity cannot work correctly on guest
 os when PCI passthrough device using msi/msi-x with KVM.

 My reason:
 pcpu will occur a lot of ipi interrupt to find the vcpu to handle the
 irq.  so the guest os will VM_EXIT frequelty. right?

 if smp_affinity can work correctly on guest os,  the best way is that
 the vcpu handle the irq is cputune at the pcpu which handle the
 kvm:pci-bus irq on the host.but  unfortunly, i find that smp_affinity
 can not work correctly on guest os when msi/msi-x.

 how to reproduce:
 1: passthrough a netcard (Brodcom BCM5716S) to the guest os

 2: ifup the netcard, the card will use msi-x interrupt default, and close the
 irqbalance service

 3:  echo 4  cat /proc/irq/NETCARDIRQ/smp_affinity, so we assume the vcpu2
 handle the irq.

 4: we have set vcpupin vcpu='2' cpuset='1'/ and set the irq kvm:pci-bus to
 the pcpu1 on the host.

 we think this configure will reduce the ipi interrupt when inject interrupt 
 to
 the guest os. but this irq is not only handle on vcpu2. maybe it is
 not our expect。

 What version of qemu-kvm/qemu are you using?  There's been some work
 recently specifically to enable this.  Thanks,

 Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 1/2] x86/kexec: add a new atomic notifier list for kdump

2012-11-26 Thread Gleb Natapov
On Mon, Nov 26, 2012 at 09:08:54AM -0600, Eric W. Biederman wrote:
 Zhang Yanfei zhangyan...@cn.fujitsu.com writes:
 
  This patch adds an atomic notifier list named crash_notifier_list.
  Currently, when loading kvm-intel module, a notifier will be registered
  in the list to enable vmcss loaded on all cpus to be VMCLEAR'd if
  needed.
 
 crash_notifier_list ick gag please no.  Effectively this makes the kexec
 on panic code path undebuggable.
 
 Instead we need to use direct function calls to whatever you are doing.
 
The code walks linked list in kvm-intel module and calls vmclear on
whatever it finds there. Since the function have to resides in kvm-intel
module it cannot be called directly. Is callback pointer that is set
by kvm-intel more acceptable?

 If a direct function call is too complex then the piece of code you want
 to call is almost certainly too complex to be calling on a code path
 like this.
 
 Eric
 
  Signed-off-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
  ---
   arch/x86/include/asm/kexec.h |2 ++
   arch/x86/kernel/crash.c  |9 +
   2 files changed, 11 insertions(+), 0 deletions(-)
 
  diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
  index 317ff17..5e22b00 100644
  --- a/arch/x86/include/asm/kexec.h
  +++ b/arch/x86/include/asm/kexec.h
  @@ -163,6 +163,8 @@ struct kimage_arch {
   };
   #endif
   
  +extern struct atomic_notifier_head crash_notifier_list;
  +
   #endif /* __ASSEMBLY__ */
   
   #endif /* _ASM_X86_KEXEC_H */
  diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
  index 13ad899..c5b2f70 100644
  --- a/arch/x86/kernel/crash.c
  +++ b/arch/x86/kernel/crash.c
  @@ -16,6 +16,8 @@
   #include linux/delay.h
   #include linux/elf.h
   #include linux/elfcore.h
  +#include linux/module.h
  +#include linux/notifier.h
   
   #include asm/processor.h
   #include asm/hardirq.h
  @@ -30,6 +32,9 @@
   
   int in_crash_kexec;
   
  +ATOMIC_NOTIFIER_HEAD(crash_notifier_list);
  +EXPORT_SYMBOL_GPL(crash_notifier_list);
  +
   #if defined(CONFIG_SMP)  defined(CONFIG_X86_LOCAL_APIC)
   
   static void kdump_nmi_callback(int cpu, struct pt_regs *regs)
  @@ -46,6 +51,8 @@ static void kdump_nmi_callback(int cpu, struct pt_regs 
  *regs)
   #endif
  crash_save_cpu(regs, cpu);
   
  +   atomic_notifier_call_chain(crash_notifier_list, 0, NULL);
  +
  /* Disable VMX or SVM if needed.
   *
   * We need to disable virtualization on all CPUs.
  @@ -88,6 +95,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
   
  kdump_nmi_shootdown_cpus();
   
  +   atomic_notifier_call_chain(crash_notifier_list, 0, NULL);
  +
  /* Booting kdump kernel with VMX or SVM enabled won't work,
   * because (among other limitations) we can't disable paging
   * with the virt flags.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 1/2] x86/kexec: add a new atomic notifier list for kdump

2012-11-26 Thread Eric W. Biederman
Gleb Natapov g...@redhat.com writes:

 On Mon, Nov 26, 2012 at 09:08:54AM -0600, Eric W. Biederman wrote:
 Zhang Yanfei zhangyan...@cn.fujitsu.com writes:
 
  This patch adds an atomic notifier list named crash_notifier_list.
  Currently, when loading kvm-intel module, a notifier will be registered
  in the list to enable vmcss loaded on all cpus to be VMCLEAR'd if
  needed.
 
 crash_notifier_list ick gag please no.  Effectively this makes the kexec
 on panic code path undebuggable.
 
 Instead we need to use direct function calls to whatever you are doing.
 
 The code walks linked list in kvm-intel module and calls vmclear on
 whatever it finds there. Since the function have to resides in kvm-intel
 module it cannot be called directly. Is callback pointer that is set
 by kvm-intel more acceptable?

Yes a specific callback function is more acceptable.  Looking a little
deeper vmclear_local_loaded_vmcss is not particularly acceptable. It is
doing a lot of work that is unnecessary to save the virtual registers
on the kexec on panic path.

In fact I wonder if it might not just be easier to call vmcs_clear to a
fixed per cpu buffer.

Performing list walking in interrupt context without locking in
vmclear_local_loaded vmcss looks a bit scary.  Not that locking would
make it any better, as locking would simply add one more way to deadlock
the system.  Only an rcu list walk is at all safe.  A list walk that
modifies the list as vmclear_local_loaded_vmcss does is definitely not safe.

Eric
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 1/2] x86/kexec: add a new atomic notifier list for kdump

2012-11-26 Thread Gleb Natapov
On Mon, Nov 26, 2012 at 11:43:10AM -0600, Eric W. Biederman wrote:
 Gleb Natapov g...@redhat.com writes:
 
  On Mon, Nov 26, 2012 at 09:08:54AM -0600, Eric W. Biederman wrote:
  Zhang Yanfei zhangyan...@cn.fujitsu.com writes:
  
   This patch adds an atomic notifier list named crash_notifier_list.
   Currently, when loading kvm-intel module, a notifier will be registered
   in the list to enable vmcss loaded on all cpus to be VMCLEAR'd if
   needed.
  
  crash_notifier_list ick gag please no.  Effectively this makes the kexec
  on panic code path undebuggable.
  
  Instead we need to use direct function calls to whatever you are doing.
  
  The code walks linked list in kvm-intel module and calls vmclear on
  whatever it finds there. Since the function have to resides in kvm-intel
  module it cannot be called directly. Is callback pointer that is set
  by kvm-intel more acceptable?
 
 Yes a specific callback function is more acceptable.  Looking a little
 deeper vmclear_local_loaded_vmcss is not particularly acceptable. It is
 doing a lot of work that is unnecessary to save the virtual registers
 on the kexec on panic path.
 
What work are you referring to in particular that may not be acceptable?

 In fact I wonder if it might not just be easier to call vmcs_clear to a
 fixed per cpu buffer.
 
There may be more than one vmcs loaded on a cpu, hence the list.

 Performing list walking in interrupt context without locking in
 vmclear_local_loaded vmcss looks a bit scary.  Not that locking would
 make it any better, as locking would simply add one more way to deadlock
 the system.  Only an rcu list walk is at all safe.  A list walk that
 modifies the list as vmclear_local_loaded_vmcss does is definitely not safe.
 
The list vmclear_local_loaded walks is per cpu. Zhang's kvm patch
disables kexec callback while list is modified.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio powerpc: enabled and supported on powernv platform

2012-11-26 Thread Alex Williamson
On Mon, 2012-11-26 at 08:18 -0700, Alex Williamson wrote:
 On Fri, 2012-11-23 at 13:02 +1100, Alexey Kardashevskiy wrote:
  On 22/11/12 22:56, Sethi Varun-B16395 wrote:
  
  
   -Original Message-
   From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
   ow...@vger.kernel.org] On Behalf Of Alex Williamson
   Sent: Tuesday, November 20, 2012 11:50 PM
   To: Alexey Kardashevskiy
   Cc: Benjamin Herrenschmidt; Paul Mackerras; linuxppc-
   d...@lists.ozlabs.org; linux-ker...@vger.kernel.org; kvm@vger.kernel.org;
   David Gibson
   Subject: Re: [PATCH] vfio powerpc: enabled and supported on powernv
   platform
  
   On Tue, 2012-11-20 at 11:48 +1100, Alexey Kardashevskiy wrote:
   VFIO implements platform independent stuff such as a PCI driver, BAR
   access (via read/write on a file descriptor or direct mapping when
   possible) and IRQ signaling.
   The platform dependent part includes IOMMU initialization and
   handling.
  
   This patch initializes IOMMU groups based on the IOMMU configuration
   discovered during the PCI scan, only POWERNV platform is supported at
   the moment.
  
   Also the patch implements an VFIO-IOMMU driver which manages DMA
   mapping/unmapping requests coming from the client (now QEMU). It also
   returns a DMA window information to let the guest initialize the
   device tree for a guest OS properly. Although this driver has been
   tested only on POWERNV, it should work on any platform supporting TCE
   tables.
  
   To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config option.
  
   Cc: David Gibson da...@gibson.dropbear.id.au
   Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
   ---
 arch/powerpc/include/asm/iommu.h |6 +
 arch/powerpc/kernel/iommu.c  |  140 +++
 arch/powerpc/platforms/powernv/pci.c |  135 +++
 drivers/iommu/Kconfig|8 ++
 drivers/vfio/Kconfig |6 +
 drivers/vfio/Makefile|1 +
 drivers/vfio/vfio_iommu_spapr_tce.c  |  247
   ++
 include/linux/vfio.h |   20 +++
 8 files changed, 563 insertions(+)
 create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
  
   diff --git a/arch/powerpc/include/asm/iommu.h
   b/arch/powerpc/include/asm/iommu.h
   index cbfe678..5ba66cb 100644
   --- a/arch/powerpc/include/asm/iommu.h
   +++ b/arch/powerpc/include/asm/iommu.h
   @@ -64,30 +64,33 @@ struct iommu_pool {  }
   cacheline_aligned_in_smp;
  
 struct iommu_table {
   unsigned long  it_busno; /* Bus number this table belongs 
   to */
   unsigned long  it_size;  /* Size of iommu table in entries 
   */
   unsigned long  it_offset;/* Offset into global table */
   unsigned long  it_base;  /* mapped address of tce table */
   unsigned long  it_index; /* which iommu table this is */
   unsigned long  it_type;  /* type: PCI or Virtual Bus */
   unsigned long  it_blocksize; /* Entries in each block 
   (cacheline)
   */
   unsigned long  poolsize;
   unsigned long  nr_pools;
   struct iommu_pool large_pool;
   struct iommu_pool pools[IOMMU_NR_POOLS];
   unsigned long *it_map;   /* A simple allocation bitmap for 
   now
   */
   +#ifdef CONFIG_IOMMU_API
   +   struct iommu_group *it_group;
   +#endif
 };
  
 struct scatterlist;
  
 static inline void set_iommu_table_base(struct device *dev, void
   *base)  {
   dev-archdata.dma_data.iommu_table_base = base;  }
  
 static inline void *get_iommu_table_base(struct device *dev)  {
   return dev-archdata.dma_data.iommu_table_base;
 }
  
 /* Frees table for an individual device node */ @@ -135,17 +138,20 @@
   static inline void pci_iommu_init(void) { }  extern void
   alloc_dart_table(void);  #if defined(CONFIG_PPC64) 
   defined(CONFIG_PM)  static inline void iommu_save(void)  {
   if (ppc_md.iommu_save)
   ppc_md.iommu_save();
 }
  
 static inline void iommu_restore(void)  {
   if (ppc_md.iommu_restore)
   ppc_md.iommu_restore();
 }
 #endif
  
   +extern long iommu_put_tces(struct iommu_table *tbl, unsigned long
   entry, uint64_t tce,
   +   enum dma_data_direction direction, unsigned long pages);
   +
 #endif /* __KERNEL__ */
 #endif /* _ASM_IOMMU_H */
   diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
   index ff5a6ce..94f614b 100644
   --- a/arch/powerpc/kernel/iommu.c
   +++ b/arch/powerpc/kernel/iommu.c
   @@ -32,30 +32,31 @@
 #include linux/dma-mapping.h
 #include linux/bitmap.h
 #include linux/iommu-helper.h
 #include linux/crash_dump.h
 #include linux/hash.h
 #include linux/fault-inject.h
 #include linux/pci.h
 #include asm/io.h
 #include asm/prom.h
 #include asm/iommu.h
 #include 

Re: [PATCH v8 1/2] x86/kexec: add a new atomic notifier list for kdump

2012-11-26 Thread Eric W. Biederman
Gleb Natapov g...@redhat.com writes:

 On Mon, Nov 26, 2012 at 11:43:10AM -0600, Eric W. Biederman wrote:
 Gleb Natapov g...@redhat.com writes:
 
  On Mon, Nov 26, 2012 at 09:08:54AM -0600, Eric W. Biederman wrote:
  Zhang Yanfei zhangyan...@cn.fujitsu.com writes:
  
   This patch adds an atomic notifier list named crash_notifier_list.
   Currently, when loading kvm-intel module, a notifier will be registered
   in the list to enable vmcss loaded on all cpus to be VMCLEAR'd if
   needed.
  
  crash_notifier_list ick gag please no.  Effectively this makes the kexec
  on panic code path undebuggable.
  
  Instead we need to use direct function calls to whatever you are doing.
  
  The code walks linked list in kvm-intel module and calls vmclear on
  whatever it finds there. Since the function have to resides in kvm-intel
  module it cannot be called directly. Is callback pointer that is set
  by kvm-intel more acceptable?
 
 Yes a specific callback function is more acceptable.  Looking a little
 deeper vmclear_local_loaded_vmcss is not particularly acceptable. It is
 doing a lot of work that is unnecessary to save the virtual registers
 on the kexec on panic path.
 
 What work are you referring to in particular that may not be
 acceptable?

The unnecessary work that I was see is all of the software state
changing.  Unlinking things from linked lists flipping variables.
None of that appears related to the fundamental issue saving cpu
state.

Simply reusing a function that does more than what is strictly required
makes me nervous.  What is the chance that the function will grow
with maintenance and add constructs that are not safe in a kexec on
panic situtation.

 In fact I wonder if it might not just be easier to call vmcs_clear to a
 fixed per cpu buffer.
 
 There may be more than one vmcs loaded on a cpu, hence the list.

 Performing list walking in interrupt context without locking in
 vmclear_local_loaded vmcss looks a bit scary.  Not that locking would
 make it any better, as locking would simply add one more way to deadlock
 the system.  Only an rcu list walk is at all safe.  A list walk that
 modifies the list as vmclear_local_loaded_vmcss does is definitely not safe.
 
 The list vmclear_local_loaded walks is per cpu. Zhang's kvm patch
 disables kexec callback while list is modified.

If the list is only modified on it's cpu and we are running on that cpu
that does look like it will give the necessary protections.  It isn't
particularly clear at first glance that is the case unfortunately.

Eric
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] vfio powerpc: implemented IOMMU driver for VFIO

2012-11-26 Thread Alex Williamson
On Fri, 2012-11-23 at 20:03 +1100, Alexey Kardashevskiy wrote:
 VFIO implements platform independent stuff such as
 a PCI driver, BAR access (via read/write on a file descriptor
 or direct mapping when possible) and IRQ signaling.
 
 The platform dependent part includes IOMMU initialization
 and handling. This patch implements an IOMMU driver for VFIO
 which does mapping/unmapping pages for the guest IO and
 provides information about DMA window (required by a POWERPC
 guest).
 
 The counterpart in QEMU is required to support this functionality.
 
 Cc: David Gibson da...@gibson.dropbear.id.au
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
  drivers/vfio/Kconfig|6 +
  drivers/vfio/Makefile   |1 +
  drivers/vfio/vfio_iommu_spapr_tce.c |  247 
 +++
  include/linux/vfio.h|   20 +++
  4 files changed, 274 insertions(+)
  create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
 
 diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
 index 7cd5dec..b464687 100644
 --- a/drivers/vfio/Kconfig
 +++ b/drivers/vfio/Kconfig
 @@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1
   depends on VFIO
   default n
  
 +config VFIO_IOMMU_SPAPR_TCE
 + tristate
 + depends on VFIO  SPAPR_TCE_IOMMU
 + default n
 +
  menuconfig VFIO
   tristate VFIO Non-Privileged userspace driver framework
   depends on IOMMU_API
   select VFIO_IOMMU_TYPE1 if X86
 + select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
   help
 VFIO provides a framework for secure userspace device drivers.
 See Documentation/vfio.txt for more details.
 diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
 index 2398d4a..72bfabc 100644
 --- a/drivers/vfio/Makefile
 +++ b/drivers/vfio/Makefile
 @@ -1,3 +1,4 @@
  obj-$(CONFIG_VFIO) += vfio.o
  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
 +obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
  obj-$(CONFIG_VFIO_PCI) += pci/
 diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
 b/drivers/vfio/vfio_iommu_spapr_tce.c
 new file mode 100644
 index 000..46a6298
 --- /dev/null
 +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
 @@ -0,0 +1,247 @@
 +/*
 + * VFIO: IOMMU DMA mapping support for TCE on POWER
 + *
 + * Copyright (C) 2012 IBM Corp.  All rights reserved.
 + * Author: Alexey Kardashevskiy a...@ozlabs.ru
 + *
 + * This program is free software; you can redistribute it and/or modify
 + * it under the terms of the GNU General Public License version 2 as
 + * published by the Free Software Foundation.
 + *
 + * Derived from original vfio_iommu_type1.c:
 + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
 + * Author: Alex Williamson alex.william...@redhat.com
 + */
 +
 +#include linux/module.h
 +#include linux/pci.h
 +#include linux/slab.h
 +#include linux/uaccess.h
 +#include linux/err.h
 +#include linux/vfio.h
 +#include asm/iommu.h
 +
 +#define DRIVER_VERSION  0.1
 +#define DRIVER_AUTHOR   a...@ozlabs.ru
 +#define DRIVER_DESC VFIO IOMMU SPAPR TCE
 +
 +static void tce_iommu_detach_group(void *iommu_data,
 + struct iommu_group *iommu_group);
 +
 +/*
 + * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
 + */
 +
 +/*
 + * The container descriptor supports only a single group per container.
 + * Required by the API as the container is not supplied with the IOMMU group
 + * at the moment of initialization.
 + */
 +struct tce_container {
 + struct mutex lock;
 + struct iommu_table *tbl;
 +};
 +
 +static void *tce_iommu_open(unsigned long arg)
 +{
 + struct tce_container *container;
 +
 + if (arg != VFIO_SPAPR_TCE_IOMMU) {
 + printk(KERN_ERR tce_vfio: Wrong IOMMU type\n);
 + return ERR_PTR(-EINVAL);
 + }
 +
 + container = kzalloc(sizeof(*container), GFP_KERNEL);
 + if (!container)
 + return ERR_PTR(-ENOMEM);
 +
 + mutex_init(container-lock);
 +
 + return container;
 +}
 +
 +static void tce_iommu_release(void *iommu_data)
 +{
 + struct tce_container *container = iommu_data;
 +
 + WARN_ON(container-tbl  !container-tbl-it_group);

I think your patch ordering is backwards here.  it_group isn't added
until 2/2.  I'd really like to see the arch/powerpc code approved and
merged by the powerpc maintainer before we add the code that makes use
of it into vfio.  Otherwise we just get lots of churn if interfaces
change or they disapprove of it altogether.

 + if (container-tbl  container-tbl-it_group)
 + tce_iommu_detach_group(iommu_data, container-tbl-it_group);
 +
 + mutex_destroy(container-lock);
 +
 + kfree(container);
 +}
 +
 +static long tce_iommu_ioctl(void *iommu_data,
 +  unsigned int cmd, unsigned long arg)
 +{
 + struct tce_container *container = iommu_data;
 + unsigned long minsz;
 +
 + switch (cmd) {
 + case VFIO_CHECK_EXTENSION: {
 + return (arg == VFIO_SPAPR_TCE_IOMMU) ? 1 : 0;
 + }
 

[PATCH V3 0/2] Resend - IA32_TSC_ADJUST support for KVM

2012-11-26 Thread Will Auld
Resending these as the mail seems to have not fully worked last Wed.


Marcelo,

I have addressed your comments for this patch set (V3), the following patch for 
QEMU-KVM and for adding a test 
case for tsc_adjust also to follow today. 

Thanks,

Will

Will Auld (2):
  Add code to track call origin for msr assignment.
  Enabling IA32_TSC_ADJUST for KVM guest VM support

 arch/x86/include/asm/cpufeature.h |  1 +
 arch/x86/include/asm/kvm_host.h   | 15 ++---
 arch/x86/include/asm/msr-index.h  |  1 +
 arch/x86/kvm/cpuid.c  |  2 ++
 arch/x86/kvm/cpuid.h  |  8 +++
 arch/x86/kvm/svm.c| 28 ++--
 arch/x86/kvm/vmx.c| 33 ++--
 arch/x86/kvm/x86.c| 45 +--
 arch/x86/kvm/x86.h|  2 +-
 9 files changed, 112 insertions(+), 23 deletions(-)

-- 
1.8.0.rc0



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3 1/2] Resend - Add code to track call origin for msr assignment.

2012-11-26 Thread Will Auld
In order to track who initiated the call (host or guest) to modify an msr
value I have changed function call parameters along the call path. The
specific change is to add a struct pointer parameter that points to (index,
data, caller) information rather than having this information passed as
individual parameters.

The initial use for this capability is for updating the IA32_TSC_ADJUST
msr while setting the tsc value. It is anticipated that this capability
is useful other tasks.

Signed-off-by: Will Auld will.a...@intel.com
---
 arch/x86/include/asm/kvm_host.h | 12 +---
 arch/x86/kvm/svm.c  | 21 +++--
 arch/x86/kvm/vmx.c  | 24 +---
 arch/x86/kvm/x86.c  | 23 +--
 arch/x86/kvm/x86.h  |  2 +-
 5 files changed, 59 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09155d6..da34027 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -598,6 +598,12 @@ struct kvm_vcpu_stat {
 
 struct x86_instruction_info;
 
+struct msr_data {
+bool host_initiated;
+u32 index;
+u64 data;
+};
+
 struct kvm_x86_ops {
int (*cpu_has_kvm_support)(void);  /* __init */
int (*disabled_by_bios)(void); /* __init */
@@ -621,7 +627,7 @@ struct kvm_x86_ops {
void (*set_guest_debug)(struct kvm_vcpu *vcpu,
struct kvm_guest_debug *dbg);
int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata);
-   int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
+   int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg);
void (*get_segment)(struct kvm_vcpu *vcpu,
struct kvm_segment *var, int seg);
@@ -772,7 +778,7 @@ static inline int emulate_instruction(struct kvm_vcpu *vcpu,
 
 void kvm_enable_efer_bits(u64);
 int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data);
-int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
+int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
 struct x86_emulate_ctxt;
 
@@ -799,7 +805,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, 
int *l);
 int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr);
 
 int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
-int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data);
+int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
 unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu);
 void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index baead95..5ac11f0 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1211,6 +1211,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, 
unsigned int id)
struct page *msrpm_pages;
struct page *hsave_page;
struct page *nested_msrpm_pages;
+   struct msr_data msr;
int err;
 
svm = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL);
@@ -1255,7 +1256,10 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, 
unsigned int id)
svm-vmcb_pa = page_to_pfn(page)  PAGE_SHIFT;
svm-asid_generation = 0;
init_vmcb(svm);
-   kvm_write_tsc(svm-vcpu, 0);
+   msr.data = 0x0;
+   msr.index = MSR_IA32_TSC;
+   msr.host_initiated = true;
+   kvm_write_tsc(svm-vcpu, msr);
 
err = fx_init(svm-vcpu);
if (err)
@@ -3147,13 +3151,15 @@ static int svm_set_vm_cr(struct kvm_vcpu *vcpu, u64 
data)
return 0;
 }
 
-static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data)
+static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 {
struct vcpu_svm *svm = to_svm(vcpu);
 
+   u32 ecx = msr-index;
+   u64 data = msr-data;
switch (ecx) {
case MSR_IA32_TSC:
-   kvm_write_tsc(vcpu, data);
+   kvm_write_tsc(vcpu, msr);
break;
case MSR_STAR:
svm-vmcb-save.star = data;
@@ -3208,20 +3214,23 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned 
ecx, u64 data)
vcpu_unimpl(vcpu, unimplemented wrmsr: 0x%x data 0x%llx\n, 
ecx, data);
break;
default:
-   return kvm_set_msr_common(vcpu, ecx, data);
+   return kvm_set_msr_common(vcpu, msr);
}
return 0;
 }
 
 static int wrmsr_interception(struct vcpu_svm *svm)
 {
+   struct msr_data msr;
u32 ecx = svm-vcpu.arch.regs[VCPU_REGS_RCX];
u64 data = (svm-vcpu.arch.regs[VCPU_REGS_RAX]  -1u)
| ((u64)(svm-vcpu.arch.regs[VCPU_REGS_RDX]  -1u)  32);
 
-
+   msr.data = data;
+   msr.index = ecx;
+   msr.host_initiated = false;
svm-next_rip = kvm_rip_read(svm-vcpu) + 2;
-   if (svm_set_msr(svm-vcpu, 

[PATCH V3 2/2] Resend - Enabling IA32_TSC_ADJUST for KVM guest VM support

2012-11-26 Thread Will Auld
CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported

Basic design is to emulate the MSR by allowing reads and writes to a guest
vcpu specific location to store the value of the emulated MSR while adding
the value to the vmcs tsc_offset. In this way the IA32_TSC_ADJUST value will
be included in all reads to the TSC MSR whether through rdmsr or rdtsc. This
is of course as long as the use TSC counter offsetting VM-execution
control is enabled as well as the IA32_TSC_ADJUST control.

However, because hardware will only return the TSC + IA32_TSC_ADJUST + vmsc
tsc_offset for a guest process when it does and rdtsc (with the correct
settings) the value of our virtualized IA32_TSC_ADJUST must be stored in
one of these three locations. The argument against storing it in the actual
MSR is performance. This is likely to be seldom used while the save/restore
is required on every transition. IA32_TSC_ADJUST was created as a way to
solve some issues with writing TSC itself so that is not an option either.
The remaining option, defined above as our solution has the problem of
returning incorrect vmcs tsc_offset values (unless we intercept and fix, not
done here) as mentioned above. However, more problematic is that storing the
data in vmcs tsc_offset will have a different semantic effect on the system
than does using the actual MSR. This is illustrated in the following example:
The hypervisor set the IA32_TSC_ADJUST, then the guest sets it and a guest
process performs a rdtsc. In this case the guest process will get TSC +
IA32_TSC_ADJUST_hyperviser + vmsc tsc_offset including IA32_TSC_ADJUST_guest.
While the total system semantics changed the semantics as seen by the guest
do not and hence this will not cause a problem.

Signed-off-by: Will Auld will.a...@intel.com
---
 arch/x86/include/asm/cpufeature.h |  1 +
 arch/x86/include/asm/kvm_host.h   |  3 +++
 arch/x86/include/asm/msr-index.h  |  1 +
 arch/x86/kvm/cpuid.c  |  2 ++
 arch/x86/kvm/cpuid.h  |  8 
 arch/x86/kvm/svm.c|  7 +++
 arch/x86/kvm/vmx.c|  9 +
 arch/x86/kvm/x86.c| 22 ++
 8 files changed, 53 insertions(+)

diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index 6b7ee5f..e574d81 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -199,6 +199,7 @@
 
 /* Intel-defined CPU features, CPUID level 0x0007:0 (ebx), word 9 */
 #define X86_FEATURE_FSGSBASE   (9*32+ 0) /* {RD/WR}{FS/GS}BASE instructions*/
+#define X86_FEATURE_TSC_ADJUST  (9*32+ 1) /* TSC adjustment MSR 0x3b */
 #define X86_FEATURE_BMI1   (9*32+ 3) /* 1st group bit manipulation 
extensions */
 #define X86_FEATURE_HLE(9*32+ 4) /* Hardware Lock Elision */
 #define X86_FEATURE_AVX2   (9*32+ 5) /* AVX2 instructions */
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index da34027..cf8c7e0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -442,6 +442,8 @@ struct kvm_vcpu_arch {
u32 virtual_tsc_mult;
u32 virtual_tsc_khz;
 
+   s64 ia32_tsc_adjust_msr;
+
atomic_t nmi_queued;  /* unprocessed asynchronous NMIs */
unsigned nmi_pending; /* NMI queued after currently running handler */
bool nmi_injected;/* Trying to inject an NMI this entry */
@@ -690,6 +692,7 @@ struct kvm_x86_ops {
bool (*has_wbinvd_exit)(void);
 
void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool 
scale);
+   u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu);
void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
 
u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc);
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 957ec87..6486569 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -231,6 +231,7 @@
 #define MSR_IA32_EBL_CR_POWERON0x002a
 #define MSR_EBC_FREQUENCY_ID   0x002c
 #define MSR_IA32_FEATURE_CONTROL0x003a
+#define MSR_IA32_TSC_ADJUST 0x003b
 
 #define FEATURE_CONTROL_LOCKED (10)
 #define FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX   (11)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0595f13..e817bac 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -320,6 +320,8 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 
function,
if (index == 0) {
entry-ebx = kvm_supported_word9_x86_features;
cpuid_mask(entry-ebx, 9);
+   // TSC_ADJUST is emulated 
+   entry-ebx |= F(TSC_ADJUST);
} else
entry-ebx = 0;
entry-eax = 0;
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 

[PATCH V2] Resend - Enabling IA32_TSC_ADJUST for Qemu KVM guest VMs

2012-11-26 Thread Will Auld
CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported

Basic design is to emulate the MSR by allowing reads and writes to the
hypervisor vcpu specific locations to store the value of the emulated MSRs.
In this way the IA32_TSC_ADJUST value will be included in all reads to
the TSC MSR whether through rdmsr or rdtsc.

As this is a new MSR that the guest may access and modify its value needs
to be migrated along with the other MRSs. The changes here are specifically
for recognizing when IA32_TSC_ADJUST is enabled in CPUID and code added
for migrating its value.

Signed-off-by: Will Auld will.a...@intel.com
---
 target-i386/cpu.h |  2 ++
 target-i386/kvm.c | 15 +++
 target-i386/machine.c | 21 +
 3 files changed, 38 insertions(+)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index aabf993..13d4152 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -284,6 +284,7 @@
 #define MSR_IA32_APICBASE_BSP   (18)
 #define MSR_IA32_APICBASE_ENABLE(111)
 #define MSR_IA32_APICBASE_BASE  (0xf12)
+#define MSR_TSC_ADJUST 0x003b
 #define MSR_IA32_TSCDEADLINE0x6e0
 
 #define MSR_MTRRcap0xfe
@@ -701,6 +702,7 @@ typedef struct CPUX86State {
 uint64_t async_pf_en_msr;
 
 uint64_t tsc;
+uint64_t tsc_adjust;
 uint64_t tsc_deadline;
 
 uint64_t mcg_status;
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 696b14a..e974c42 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -61,6 +61,7 @@ const KVMCapabilityInfo kvm_arch_required_capabilities[] = {
 
 static bool has_msr_star;
 static bool has_msr_hsave_pa;
+static bool has_msr_tsc_adjust;
 static bool has_msr_tsc_deadline;
 static bool has_msr_async_pf_en;
 static bool has_msr_misc_enable;
@@ -641,6 +642,10 @@ static int kvm_get_supported_msrs(KVMState *s)
 has_msr_hsave_pa = true;
 continue;
 }
+if (kvm_msr_list-indices[i] == MSR_TSC_ADJUST) {
+has_msr_tsc_adjust = true;
+continue;
+}
 if (kvm_msr_list-indices[i] == MSR_IA32_TSCDEADLINE) {
 has_msr_tsc_deadline = true;
 continue;
@@ -978,6 +983,10 @@ static int kvm_put_msrs(CPUX86State *env, int level)
 if (has_msr_hsave_pa) {
 kvm_msr_entry_set(msrs[n++], MSR_VM_HSAVE_PA, env-vm_hsave);
 }
+if (has_msr_tsc_adjust) {
+kvm_msr_entry_set(msrs[n++], 
+   MSR_TSC_ADJUST, env-tsc_adjust);
+}
 if (has_msr_tsc_deadline) {
 kvm_msr_entry_set(msrs[n++], MSR_IA32_TSCDEADLINE, env-tsc_deadline);
 }
@@ -1234,6 +1243,9 @@ static int kvm_get_msrs(CPUX86State *env)
 if (has_msr_hsave_pa) {
 msrs[n++].index = MSR_VM_HSAVE_PA;
 }
+if (has_msr_tsc_adjust) {
+msrs[n++].index = MSR_TSC_ADJUST;
+}
 if (has_msr_tsc_deadline) {
 msrs[n++].index = MSR_IA32_TSCDEADLINE;
 }
@@ -1308,6 +1320,9 @@ static int kvm_get_msrs(CPUX86State *env)
 case MSR_IA32_TSC:
 env-tsc = msrs[i].data;
 break;
+case MSR_TSC_ADJUST:
+env-tsc_adjust = msrs[i].data;
+break;
 case MSR_IA32_TSCDEADLINE:
 env-tsc_deadline = msrs[i].data;
 break;
diff --git a/target-i386/machine.c b/target-i386/machine.c
index a8be058..95bda9b 100644
--- a/target-i386/machine.c
+++ b/target-i386/machine.c
@@ -310,6 +310,24 @@ static const VMStateDescription vmstate_fpop_ip_dp = {
 }
 };
 
+static bool tsc_adjust_needed(void *opaque)
+{
+CPUX86State *cpu = opaque;
+
+return cpu-tsc_adjust != 0;
+}
+
+static const VMStateDescription vmstate_msr_tsc_adjust = {
+.name = cpu/msr_tsc_adjust,
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT64(tsc_adjust, CPUX86State),
+VMSTATE_END_OF_LIST()
+}
+};
+
 static bool tscdeadline_needed(void *opaque)
 {
 CPUX86State *env = opaque;
@@ -457,6 +475,9 @@ static const VMStateDescription vmstate_cpu = {
 .vmsd = vmstate_fpop_ip_dp,
 .needed = fpop_ip_dp_needed,
 }, {
+.vmsd = vmstate_msr_tsc_adjust,
+.needed = tsc_adjust_needed,
+}, {
 .vmsd = vmstate_msr_tscdeadline,
 .needed = tscdeadline_needed,
 }, {
-- 
1.8.0.rc0



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 1/2] Resend - Add code to track call origin for msr assignment.

2012-11-26 Thread Gleb Natapov
Comments are still not addressed.

On Mon, Nov 26, 2012 at 10:40:51AM -0800, Will Auld wrote:
 In order to track who initiated the call (host or guest) to modify an msr
 value I have changed function call parameters along the call path. The
 specific change is to add a struct pointer parameter that points to (index,
 data, caller) information rather than having this information passed as
 individual parameters.
 
 The initial use for this capability is for updating the IA32_TSC_ADJUST
 msr while setting the tsc value. It is anticipated that this capability
 is useful other tasks.
 
 Signed-off-by: Will Auld will.a...@intel.com
 ---
  arch/x86/include/asm/kvm_host.h | 12 +---
  arch/x86/kvm/svm.c  | 21 +++--
  arch/x86/kvm/vmx.c  | 24 +---
  arch/x86/kvm/x86.c  | 23 +--
  arch/x86/kvm/x86.h  |  2 +-
  5 files changed, 59 insertions(+), 23 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 09155d6..da34027 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -598,6 +598,12 @@ struct kvm_vcpu_stat {
  
  struct x86_instruction_info;
  
 +struct msr_data {
 +bool host_initiated;
 +u32 index;
 +u64 data;
 +};
 +
  struct kvm_x86_ops {
   int (*cpu_has_kvm_support)(void);  /* __init */
   int (*disabled_by_bios)(void); /* __init */
 @@ -621,7 +627,7 @@ struct kvm_x86_ops {
   void (*set_guest_debug)(struct kvm_vcpu *vcpu,
   struct kvm_guest_debug *dbg);
   int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata);
 - int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
 + int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
   u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg);
   void (*get_segment)(struct kvm_vcpu *vcpu,
   struct kvm_segment *var, int seg);
 @@ -772,7 +778,7 @@ static inline int emulate_instruction(struct kvm_vcpu 
 *vcpu,
  
  void kvm_enable_efer_bits(u64);
  int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data);
 -int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
 +int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
  
  struct x86_emulate_ctxt;
  
 @@ -799,7 +805,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, 
 int *l);
  int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr);
  
  int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
 -int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data);
 +int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr);
  
  unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu);
  void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index baead95..5ac11f0 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -1211,6 +1211,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm 
 *kvm, unsigned int id)
   struct page *msrpm_pages;
   struct page *hsave_page;
   struct page *nested_msrpm_pages;
 + struct msr_data msr;
   int err;
  
   svm = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL);
 @@ -1255,7 +1256,10 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm 
 *kvm, unsigned int id)
   svm-vmcb_pa = page_to_pfn(page)  PAGE_SHIFT;
   svm-asid_generation = 0;
   init_vmcb(svm);
 - kvm_write_tsc(svm-vcpu, 0);
 + msr.data = 0x0;
 + msr.index = MSR_IA32_TSC;
 + msr.host_initiated = true;
 + kvm_write_tsc(svm-vcpu, msr);
  
   err = fx_init(svm-vcpu);
   if (err)
 @@ -3147,13 +3151,15 @@ static int svm_set_vm_cr(struct kvm_vcpu *vcpu, u64 
 data)
   return 0;
  }
  
 -static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data)
 +static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
  {
   struct vcpu_svm *svm = to_svm(vcpu);
  
 + u32 ecx = msr-index;
 + u64 data = msr-data;
   switch (ecx) {
   case MSR_IA32_TSC:
 - kvm_write_tsc(vcpu, data);
 + kvm_write_tsc(vcpu, msr);
   break;
   case MSR_STAR:
   svm-vmcb-save.star = data;
 @@ -3208,20 +3214,23 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, 
 unsigned ecx, u64 data)
   vcpu_unimpl(vcpu, unimplemented wrmsr: 0x%x data 0x%llx\n, 
 ecx, data);
   break;
   default:
 - return kvm_set_msr_common(vcpu, ecx, data);
 + return kvm_set_msr_common(vcpu, msr);
   }
   return 0;
  }
  
  static int wrmsr_interception(struct vcpu_svm *svm)
  {
 + struct msr_data msr;
   u32 ecx = svm-vcpu.arch.regs[VCPU_REGS_RCX];
   u64 data = (svm-vcpu.arch.regs[VCPU_REGS_RAX]  -1u)
   | ((u64)(svm-vcpu.arch.regs[VCPU_REGS_RDX]  -1u)  32);
  
 -
 + msr.data = data;
 + msr.index 

Re: [PATCH v2 00/18] KVM for MIPS32 Processors

2012-11-26 Thread David Daney


I have several general questions about this patch...

On 11/21/2012 06:33 PM, Sanjay Lal wrote:

The following patchset implements KVM support for MIPS32R2 processors,
using Trap  Emulate, with basic runtime binary translation to improve
performance.  The goal has been to keep the Guest kernel changes to a
minimum.


What is the point of minimizing guest kernel changes?

Because you are using an invented memory map, instead of the 
architecturally defined map, there is no hope of running a single kernel 
image both natively and as a guest.  So why do you care about how many 
changes there are.




The patch is against Linux 3.7-rc6.  This is Version 2 of the patch set.

There is a companion patchset for QEMU that adds KVM support for the
MIPS target.

KVM/MIPS should support MIPS32-R2 processors and beyond.
It has been tested on the following platforms:
  - Malta Board with FPGA based 34K (Little Endian).
  - Sigma Designs TangoX board with a 24K based 8654 SoC (Little Endian).
  - Malta Board with 74K @ 1GHz (Little Endian).
  - OVPSim MIPS simulator from Imperas emulating a Malta board with
24Kc and 1074Kc cores (Little Endian).


Unlike x86, there is no concept of a canonical MIPS system for you to 
implement.  So the choice of emulating a Malta or one of the 
SigmaDesigns boards doesn't seem to me to give you anything.


Why not just define the guest system to be exactly the facilities 
provided by the VirtIO drivers?



[...]


Perhaps it is obvious from the patches, but I wasn't able to figure out 
how you solve the problem of the Root/Host kernel clobbering the K0 and 
K1 registers in its exception handlers.  These registers are also used 
by the Guest kernel (aren't they)?


David Daney
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH V2] Resend - Enabling IA32_TSC_ADJUST for Qemu KVM guest VMs

2012-11-26 Thread Andreas Färber
Hello,

Am 26.11.2012 19:42, schrieb Will Auld:
 CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported
 
 Basic design is to emulate the MSR by allowing reads and writes to the
 hypervisor vcpu specific locations to store the value of the emulated MSRs.
 In this way the IA32_TSC_ADJUST value will be included in all reads to
 the TSC MSR whether through rdmsr or rdtsc.
 
 As this is a new MSR that the guest may access and modify its value needs
 to be migrated along with the other MRSs. The changes here are specifically
 for recognizing when IA32_TSC_ADJUST is enabled in CPUID and code added
 for migrating its value.
 
 Signed-off-by: Will Auld will.a...@intel.com

$subject should get a prefix of target-i386:  and resend is better
used inside a tag so that it doesn't end up in the commit.
And it's QEMU. ;)

Some more stylistic issues inline:

 ---
  target-i386/cpu.h |  2 ++
  target-i386/kvm.c | 15 +++
  target-i386/machine.c | 21 +
  3 files changed, 38 insertions(+)
 
 diff --git a/target-i386/cpu.h b/target-i386/cpu.h
 index aabf993..13d4152 100644
 --- a/target-i386/cpu.h
 +++ b/target-i386/cpu.h
 @@ -284,6 +284,7 @@
  #define MSR_IA32_APICBASE_BSP   (18)
  #define MSR_IA32_APICBASE_ENABLE(111)
  #define MSR_IA32_APICBASE_BASE  (0xf12)
 +#define MSR_TSC_ADJUST   0x003b

Tabs. You can use scripts/checkpatch.pl to verify.

  #define MSR_IA32_TSCDEADLINE0x6e0
  
  #define MSR_MTRRcap  0xfe
 @@ -701,6 +702,7 @@ typedef struct CPUX86State {
  uint64_t async_pf_en_msr;
  
  uint64_t tsc;
 +uint64_t tsc_adjust;
  uint64_t tsc_deadline;
  
  uint64_t mcg_status;
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 696b14a..e974c42 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -61,6 +61,7 @@ const KVMCapabilityInfo kvm_arch_required_capabilities[] = {
  
  static bool has_msr_star;
  static bool has_msr_hsave_pa;
 +static bool has_msr_tsc_adjust;
  static bool has_msr_tsc_deadline;
  static bool has_msr_async_pf_en;
  static bool has_msr_misc_enable;
 @@ -641,6 +642,10 @@ static int kvm_get_supported_msrs(KVMState *s)
  has_msr_hsave_pa = true;
  continue;
  }
 +if (kvm_msr_list-indices[i] == MSR_TSC_ADJUST) {
 +has_msr_tsc_adjust = true;
 +continue;
 +}
  if (kvm_msr_list-indices[i] == MSR_IA32_TSCDEADLINE) {
  has_msr_tsc_deadline = true;
  continue;
 @@ -978,6 +983,10 @@ static int kvm_put_msrs(CPUX86State *env, int level)
  if (has_msr_hsave_pa) {
  kvm_msr_entry_set(msrs[n++], MSR_VM_HSAVE_PA, env-vm_hsave);
  }
 +if (has_msr_tsc_adjust) {
 +kvm_msr_entry_set(msrs[n++], 
 + MSR_TSC_ADJUST, env-tsc_adjust);

Tabs.

 +}
  if (has_msr_tsc_deadline) {
  kvm_msr_entry_set(msrs[n++], MSR_IA32_TSCDEADLINE, 
 env-tsc_deadline);
  }
 @@ -1234,6 +1243,9 @@ static int kvm_get_msrs(CPUX86State *env)
  if (has_msr_hsave_pa) {
  msrs[n++].index = MSR_VM_HSAVE_PA;
  }
 +if (has_msr_tsc_adjust) {
 +msrs[n++].index = MSR_TSC_ADJUST;
 +}
  if (has_msr_tsc_deadline) {
  msrs[n++].index = MSR_IA32_TSCDEADLINE;
  }
 @@ -1308,6 +1320,9 @@ static int kvm_get_msrs(CPUX86State *env)
  case MSR_IA32_TSC:
  env-tsc = msrs[i].data;
  break;
 +case MSR_TSC_ADJUST:
 +env-tsc_adjust = msrs[i].data;
 +break;
  case MSR_IA32_TSCDEADLINE:
  env-tsc_deadline = msrs[i].data;
  break;
 diff --git a/target-i386/machine.c b/target-i386/machine.c
 index a8be058..95bda9b 100644
 --- a/target-i386/machine.c
 +++ b/target-i386/machine.c
 @@ -310,6 +310,24 @@ static const VMStateDescription vmstate_fpop_ip_dp = {
  }
  };
  
 +static bool tsc_adjust_needed(void *opaque)
 +{
 +CPUX86State *cpu = opaque;

Please name this env to differentiate from CPUState / X86CPU.
Since there are other tsc_* fields already I won't request that you move
your new field to the containing X86CPU struct but at some point we will
need to convert the VMSDs to X86CPU.

 +
 +return cpu-tsc_adjust != 0;
 +}
 +
 +static const VMStateDescription vmstate_msr_tsc_adjust = {
 +.name = cpu/msr_tsc_adjust,
 +.version_id = 1,
 +.minimum_version_id = 1,
 +.minimum_version_id_old = 1,
 +.fields  = (VMStateField []) {
 +VMSTATE_UINT64(tsc_adjust, CPUX86State),
 +VMSTATE_END_OF_LIST()
 +}
 +};
 +
  static bool tscdeadline_needed(void *opaque)
  {
  CPUX86State *env = opaque;
 @@ -457,6 +475,9 @@ static const VMStateDescription vmstate_cpu = {
  .vmsd = vmstate_fpop_ip_dp,
  .needed = fpop_ip_dp_needed,
  }, {
 +.vmsd = 

Re: The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM

2012-11-26 Thread Alex Williamson
On Tue, 2012-11-27 at 00:47 +0800, yi li wrote:
 hi Alex,
 
 the qemu-kvm version 1.2.

And is the device making use of MSI-X or MSI interrupts.  MSI-X should
work on 1.2, MSI does not yet support vector updates for affinity, but
patches are welcome.  Thanks,

Alex

 2012/11/26 Alex Williamson alex.william...@redhat.com:
  On Fri, 2012-11-23 at 11:06 +0800, yi li wrote:
  Hi Guys,
 
  there have a issue about smp_affinity cannot work correctly on guest
  os when PCI passthrough device using msi/msi-x with KVM.
 
  My reason:
  pcpu will occur a lot of ipi interrupt to find the vcpu to handle the
  irq.  so the guest os will VM_EXIT frequelty. right?
 
  if smp_affinity can work correctly on guest os,  the best way is that
  the vcpu handle the irq is cputune at the pcpu which handle the
  kvm:pci-bus irq on the host.but  unfortunly, i find that smp_affinity
  can not work correctly on guest os when msi/msi-x.
 
  how to reproduce:
  1: passthrough a netcard (Brodcom BCM5716S) to the guest os
 
  2: ifup the netcard, the card will use msi-x interrupt default, and close 
  the
  irqbalance service
 
  3:  echo 4  cat /proc/irq/NETCARDIRQ/smp_affinity, so we assume the vcpu2
  handle the irq.
 
  4: we have set vcpupin vcpu='2' cpuset='1'/ and set the irq kvm:pci-bus 
  to
  the pcpu1 on the host.
 
  we think this configure will reduce the ipi interrupt when inject 
  interrupt to
  the guest os. but this irq is not only handle on vcpu2. maybe it is
  not our expect。
 
  What version of qemu-kvm/qemu are you using?  There's been some work
  recently specifically to enable this.  Thanks,
 
  Alex
 



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance issue

2012-11-26 Thread George-Cristian Bîrzan
On Sun, Nov 25, 2012 at 6:17 PM, George-Cristian Bîrzan g...@birzan.org wrote:
 On Sun, Nov 25, 2012 at 5:19 PM, Gleb Natapov g...@redhat.com wrote:
 What Windows is this? Can you try changing -cpu host to -cpu
 host,+hv_relaxed?

 This is on Windows Server 2008 R2 (sorry, forgot to mention that I
 guess), and I can try it tomorrow (US time), as getting a stream my
 way depends on complicated stuff. I will though, and let you know how
 it goes.

I changed that, no difference.


--
George-Cristian Bîrzan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM

2012-11-26 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=50891


Alex Williamson alex.william...@redhat.com changed:

   What|Removed |Added

 CC||alex.william...@redhat.com




--- Comment #1 from Alex Williamson alex.william...@redhat.com  2012-11-26 
19:32:15 ---
MSI-X SMP affinity should be working, MSI SMP affinity is not currently
implemented.  Please clarify whether the device in question is actually making
use of MSI or MSI-X.  Thanks.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V3 1/2] Resend - Add code to track call origin for msr assignment.

2012-11-26 Thread Auld, Will
Gleb, Marcelo,

Sorry Gleb, I did not see comments from you but I have now found them. In doing 
so I also found one form Marcelo that I missed. 

What I believe is now outstanding to be addressed are:
From Gleb:
- You've changed function pointer signature here, but emulator_set_msr() 
remained the same
- Also I would prefer adding host_initiated parameter to kvm_set_msr() instead 
of introducing msr_data structure.

From Marcelo:
- false, this is guest instruction emulation

I will address these points. However Gleb, your second item above, 
host_initiated parameter was implemented but then rejected agreeing that the 
msr_data structure would be a better solution. This was base on discussion with 
both Avi and Marcelo. I will leave this as is. 

Thanks,

Will 

 -Original Message-
 From: Gleb Natapov [mailto:g...@redhat.com]
 Sent: Monday, November 26, 2012 10:47 AM
 To: Auld, Will
 Cc: qemu-devel; mtosa...@redhat.com; kvm@vger.kernel.org; Dugger,
 Donald D; Liu, Jinsong; Zhang, Xiantao; a...@redhat.com
 Subject: Re: [PATCH V3 1/2] Resend - Add code to track call origin for
 msr assignment.
 
 Comments are still not addressed.
 
 On Mon, Nov 26, 2012 at 10:40:51AM -0800, Will Auld wrote:
  In order to track who initiated the call (host or guest) to modify an
  msr value I have changed function call parameters along the call
 path.
  The specific change is to add a struct pointer parameter that points
  to (index, data, caller) information rather than having this
  information passed as individual parameters.
 
  The initial use for this capability is for updating the
  IA32_TSC_ADJUST msr while setting the tsc value. It is anticipated
  that this capability is useful other tasks.
 
  Signed-off-by: Will Auld will.a...@intel.com
  ---
   arch/x86/include/asm/kvm_host.h | 12 +---
   arch/x86/kvm/svm.c  | 21 +++--
   arch/x86/kvm/vmx.c  | 24 +---
   arch/x86/kvm/x86.c  | 23 +--
   arch/x86/kvm/x86.h  |  2 +-
   5 files changed, 59 insertions(+), 23 deletions(-)
 
  diff --git a/arch/x86/include/asm/kvm_host.h
  b/arch/x86/include/asm/kvm_host.h index 09155d6..da34027 100644
  --- a/arch/x86/include/asm/kvm_host.h
  +++ b/arch/x86/include/asm/kvm_host.h
  @@ -598,6 +598,12 @@ struct kvm_vcpu_stat {
 
   struct x86_instruction_info;
 
  +struct msr_data {
  +bool host_initiated;
  +u32 index;
  +u64 data;
  +};
  +
   struct kvm_x86_ops {
  int (*cpu_has_kvm_support)(void);  /* __init */
  int (*disabled_by_bios)(void); /* __init */
  @@ -621,7 +627,7 @@ struct kvm_x86_ops {
  void (*set_guest_debug)(struct kvm_vcpu *vcpu,
  struct kvm_guest_debug *dbg);
  int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata);
  -   int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
  +   int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
  u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg);
  void (*get_segment)(struct kvm_vcpu *vcpu,
  struct kvm_segment *var, int seg); @@ -772,7
 +778,7 @@ static
  inline int emulate_instruction(struct kvm_vcpu *vcpu,
 
   void kvm_enable_efer_bits(u64);
   int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data);
  -int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
  +int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
   struct x86_emulate_ctxt;
 
  @@ -799,7 +805,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu,
  int *db, int *l);  int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index,
  u64 xcr);
 
   int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
  -int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data);
  +int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
   unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu);  void
  kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags); diff
  --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index
 baead95..5ac11f0
  100644
  --- a/arch/x86/kvm/svm.c
  +++ b/arch/x86/kvm/svm.c
  @@ -1211,6 +1211,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct
 kvm *kvm, unsigned int id)
  struct page *msrpm_pages;
  struct page *hsave_page;
  struct page *nested_msrpm_pages;
  +   struct msr_data msr;
  int err;
 
  svm = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL); @@ -1255,7
  +1256,10 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm,
 unsigned int id)
  svm-vmcb_pa = page_to_pfn(page)  PAGE_SHIFT;
  svm-asid_generation = 0;
  init_vmcb(svm);
  -   kvm_write_tsc(svm-vcpu, 0);
  +   msr.data = 0x0;
  +   msr.index = MSR_IA32_TSC;
  +   msr.host_initiated = true;
  +   kvm_write_tsc(svm-vcpu, msr);
 
  err = fx_init(svm-vcpu);
  if (err)
  @@ -3147,13 +3151,15 @@ static int svm_set_vm_cr(struct kvm_vcpu
 *vcpu, u64 data)
  return 0;
   }
 
  -static int 

Re: [PATCH RFC V2 0/5] Separate consigned (expected steal) from steal time.

2012-11-26 Thread Michael Wolf

On 10/22/2012 10:33 AM, Rik van Riel wrote:

On 10/16/2012 10:23 PM, Michael Wolf wrote:

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.  This can
cause confusion for the end user.


How do s390 and Power systems deal with reporting that kind
of information?

IMHO it would be good to see what those do, so we do not end
up re-inventing the wheel, and confusing admins with yet another
way of reporting the information...

Sorry for the delay in the response.  I'm assuming you are asking about 
s390 and Power lpars.
In the case of lpar on POWER systems they simply report steal time and 
do not alter it in any way.
They do however report how much processor is assigned to the partition 
and that information is

in /proc/ppc64/lparcfg.

Mike

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pci_enable_msix() fails with ENOMEM/EINVAL

2012-11-26 Thread Alex Williamson
On Thu, 2012-11-22 at 10:52 +0200, Alex Lyakas wrote:
 Hi Alex,
 thanks for your response.
 
 I printed out the vector and entry values of dev-host_msix_entries[i] 
 within assigned_device_enable_host_msix() before call to 
 request_threaded_irq(). I see that they are all 0s:
 kernel: [ 3332.610980] kvm-8095: KVM_ASSIGN_DEV_IRQ assigned_dev_id=924
 kernel: [ 3332.610985] kvm-8095: assigned_device_enable_host_msix() 
 assigned_dev_id=924 #0: [v=0 e=0]
 kernel: [ 3332.610989] kvm-8095: assigned_device_enable_host_msix() 
 assigned_dev_id=924 #1: [v=0 e=1]
 kernel: [ 3332.610992] kvm-8095: assigned_device_enable_host_msix() 
 assigned_dev_id=924 #2: [v=0 e=2]
 
 So I don't really understand how they all ask for irq=0; I must be missing 
 something. Is there any other explanation of request_threaded_irq() to 
 return EBUSY? From the code I don't see that there is.

The vectors all being zero sounds like an indication that
pci_enable_msix() didn't actually work.  Each of those should be a
unique vector.   Does booting the host with nointremap perhaps make a
difference?  Maybe we can isolate the problem to the interrupt remapper
code.

 This issue is reproducible and is not going to go away by itself. Working 
 around it is also problematic. We thought to check whether all IRQs are 
 properly attached after QEMU sets the vm state to running. However, vm 
 state is set to running before IRQ attachments are performed; we debugged 
 this and found out that they are done from a different thread, from a stack 
 trace like this:
 kvm_assign_irq()
 assigned_dev_update_msix()
 assigned_dev_pci_write_config()
 pci_host_config_write_common()
 pci_data_write()
 pci_host_data_write()
 memory_region_write_accessor()
 access_with_adjusted_size()
 memory_region_iorange_write()
 ioport_writew_thunk()
 ioport_write()
 cpu_outw()
 kvm_handle_io()
 kvm_cpu_exec()
 qemu_kvm_cpu_thread_fn()
 
 So looks like this is performed on-demand (on first IO), so no reliable 
 point to check that IRQs are attached properly.

Correct, MSI-X is setup when the guest enables MSI-X on the device,
which is likely a long way into guest boot.  There's no guarantee that
the guest ever enables MSI-X, so there's no association to whether the
guest is running.

  Another issue that in KVM 
 code the return value of pci_host_config_write_common() is not checked, so 
 there is no way to report a failure.

A common problem in qemu, imho

 Is there any way you think you can help me debug this further?

It seems like pci_enable_msix is still failing, but perhaps silently
without irqbalance.  We need to figure out where and why.  Isolating it
to the interrupt remapper with nointremap might give us some clues
(this is an Intel VT-d system, right?).  Thanks,

Alex


 -Original Message- 
 From: Alex Williamson
 Sent: 22 November, 2012 12:25 AM
 To: Alex Lyakas
 Cc: kvm@vger.kernel.org
 Subject: Re: pci_enable_msix() fails with ENOMEM/EINVAL
 
 On Wed, 2012-11-21 at 16:19 +0200, Alex Lyakas wrote:
  Hi,
  I was advised to turn off irqbalance and reproduced this issue, but
  the failure is in a different place now. Now request_threaded_irq()
  fails with EBUSY.
  According to the code, this can only happen on the path:
  request_threaded_irq() - __setup_irq()
  Now in setup irq, the only place where EBUSY can show up for us is here:
  ...
  raw_spin_lock_irqsave(desc-lock, flags);
  old_ptr = desc-action;
  old = *old_ptr;
  if (old) {
  /*
  * Can't share interrupts unless both agree to and are
  * the same type (level, edge, polarity). So both flag
  * fields must have IRQF_SHARED set and the bits which
  * set the trigger type must match. Also all must
  * agree on ONESHOT.
  */
  if (!((old-flags  new-flags)  IRQF_SHARED) ||
  ((old-flags ^ new-flags)  IRQF_TRIGGER_MASK) ||
  ((old-flags ^ new-flags)  IRQF_ONESHOT)) {
  old_name = old-name;
  goto mismatch;
  }
 
  /* All handlers must agree on per-cpuness */
  if ((old-flags  IRQF_PERCPU) !=
  (new-flags  IRQF_PERCPU))
  goto mismatch;
 
  KVM calls request_threaded_irq() with flags==0, so can it be that
  different KVM processes request the same IRQ?
 
 Shouldn't be possible, irqs are allocated from a bitmap protected by a
 mutex, see __irq_alloc_descs
 
   How different KVM
  processes spawned simultaneously agree between them on IRQ numbers?
 
 They don't, MSI/X vectors are not currently share-able.  Can you show
 that you're actually getting duplicate irq vectors?  Thanks,
 
 Alex
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 50921] kvm hangs booting Windows 2000

2012-11-26 Thread Xiao Guangrong
On 11/24/2012 09:44 PM, bugzilla-dae...@bugzilla.kernel.org wrote:
 https://bugzilla.kernel.org/show_bug.cgi?id=50921
 
 
 
 
 
 --- Comment #5 from Lucio Crusca lu...@sulweb.org  2012-11-24 13:44:16 ---
 Here the first tests results:
 
 vbox modules do not make a difference (tried rmmod vboxpci vboxnetadp
 vboxnetflt vboxdrv and then kvm ...).
 
 The trace.dat is about 60M, I could upload it somewhere, however I tried
 looking at it and I'm reasonably sure it hangs here:
 
 $ trace-cmd report | grep 125\\.332 | tail
  kvm-6588  [000]   125.332264: kvm_entry:vcpu 0
  kvm-6588  [000]   125.332264: kvm_emulate_insn: 1:44f8: 
 75
 27

Hmm... no 'kvm_exit' message. It looks like the infinite loop is caused by:

|   /* Don't enter VMX if guest state is invalid, let the exit handler
|  start emulation until we arrive back to a valid state */
|   if (vmx-emulation_required  emulate_invalid_guest_state)
|   return;

(vmx_vcpu_run in arch/x86/kvm/vmx.c)

And, i noticed 'ept' is not supported on your box, that means
'enable_unrestricted_guest' is disabled. I guess something was wrong
when emulate big real mode.

Could you reload kvm-intel.ko with 'emulate_invalid_guest_state = 0',
and see what will happen.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 50921] kvm hangs booting Windows 2000

2012-11-26 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=50921





--- Comment #14 from Anonymous Emailer anonym...@kernel-bugs.osdl.org  
2012-11-26 20:10:11 ---
Reply-To: xiaoguangr...@linux.vnet.ibm.com

On 11/24/2012 09:44 PM, bugzilla-dae...@bugzilla.kernel.org wrote:
 https://bugzilla.kernel.org/show_bug.cgi?id=50921
 
 
 
 
 
 --- Comment #5 from Lucio Crusca lu...@sulweb.org  2012-11-24 13:44:16 ---
 Here the first tests results:
 
 vbox modules do not make a difference (tried rmmod vboxpci vboxnetadp
 vboxnetflt vboxdrv and then kvm ...).
 
 The trace.dat is about 60M, I could upload it somewhere, however I tried
 looking at it and I'm reasonably sure it hangs here:
 
 $ trace-cmd report | grep 125\\.332 | tail
  kvm-6588  [000]   125.332264: kvm_entry:vcpu 0
  kvm-6588  [000]   125.332264: kvm_emulate_insn: 1:44f8: 
 75
 27

Hmm... no 'kvm_exit' message. It looks like the infinite loop is caused by:

|/* Don't enter VMX if guest state is invalid, let the exit handler
|   start emulation until we arrive back to a valid state */
|if (vmx-emulation_required  emulate_invalid_guest_state)
|return;

(vmx_vcpu_run in arch/x86/kvm/vmx.c)

And, i noticed 'ept' is not supported on your box, that means
'enable_unrestricted_guest' is disabled. I guess something was wrong
when emulate big real mode.

Could you reload kvm-intel.ko with 'emulate_invalid_guest_state = 0',
and see what will happen.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM Disk i/o or VM activities causes soft lockup?

2012-11-26 Thread Vincent Li
On Mon, Nov 26, 2012 at 2:58 AM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Fri, Nov 23, 2012 at 10:34:16AM -0800, Vincent Li wrote:
 On Thu, Nov 22, 2012 at 11:29 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
  On Wed, Nov 21, 2012 at 03:36:50PM -0800, Vincent Li wrote:
  We have users running on redhat based distro (Kernel
  2.6.32-131.21.1.el6.x86_64 ) with kvm, when customer made cron job
  script to copy large files between kvm guest or some other user space
  program leads to disk i/o or VM activities, users get following soft
  lockup message from console:
 
  Nov 17 13:44:46 slot1/luipaard100a err kernel: BUG: soft lockup -
  CPU#4 stuck for 61s! [qemu-kvm:6795]
  Nov 17 13:44:46 slot1/luipaard100a warning kernel: Modules linked in:
  ebt_vlan nls_utf8 isofs ebtable_filter ebtables 8021q garp bridge stp
  llc ipt_REJECT iptable_filter xt_NOTRACK nf_conntrack iptable_raw
  ip_tables loop ext2 binfmt_misc hed womdict(U) vnic(U) parport_pc lp
  parport predis(U) lasthop(U) ipv6 toggler vhost_net tun kvm_intel kvm
  jiffies(U) sysstats hrsleep i2c_dev datastor(U) linux_user_bde(P)(U)
  linux_kernel_bde(P)(U) tg3 libphy serio_raw i2c_i801 i2c_core ehci_hcd
  raid1 raid0 virtio_pci virtio_blk virtio virtio_ring mvsas libsas
  scsi_transport_sas mptspi mptscsih mptbase scsi_transport_spi 3w_9xxx
  sata_svw(U) ahci serverworks sata_sil ata_piix libata sd_mod
  crc_t10dif amd74xx piix ide_gd_mod ide_core dm_snapshot dm_mirror
  dm_region_hash dm_log dm_mod ext3 jbd mbcache
  Nov 17 13:44:46 slot1/luipaard100a warning kernel: Pid: 6795, comm:
  qemu-kvm Tainted: P   
  2.6.32-131.21.1.el6.f5.x86_64 #1
  Nov 17 13:44:46 slot1/luipaard100a warning kernel: Call Trace:
  Nov 17 13:44:46 slot1/luipaard100a warning kernel: IRQ
  [81084f95] ? get_timestamp+0x9/0xf
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [810855d6] ? watchdog_timer_fn+0x130/0x178
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [81059f11] ? __run_hrtimer+0xa3/0xff
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [8105a188] ? hrtimer_interrupt+0xe6/0x190
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [8105a14b] ? hrtimer_interrupt+0xa9/0x190
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [8101e5a9] ? hpet_interrupt_handler+0x26/0x2d
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [8105a26f] ? hrtimer_peek_ahead_timers+0x9/0xd
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [81044fcc] ? __do_softirq+0xc5/0x17a
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [81003adc] ? call_softirq+0x1c/0x28
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [8100506b] ? do_softirq+0x31/0x66
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [81003673] ? call_function_interrupt+0x13/0x20
  Nov 17 13:44:46 slot1/luipaard100a warning kernel: EOI
  [a0219986] ? vmx_get_msr+0x0/0x123 [kvm_intel]
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [a01d11c0] ? kvm_arch_vcpu_ioctl_run+0x80e/0xaf1 [kvm]
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [a01d11b4] ? kvm_arch_vcpu_ioctl_run+0x802/0xaf1 [kvm]
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [8114e59b] ? inode_has_perm+0x65/0x72
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [a01c77f5] ? kvm_vcpu_ioctl+0xf2/0x5ba [kvm]
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [8114e642] ? file_has_perm+0x9a/0xac
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [810f9ec2] ? vfs_ioctl+0x21/0x6b
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [810fa406] ? do_vfs_ioctl+0x487/0x4da
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [810fa4aa] ? sys_ioctl+0x51/0x70
  Nov 17 13:44:46 slot1/luipaard100a warning kernel:
  [810029d1] ? system_call_fastpath+0x3c/0x41
 
  This soft lockup is report on the host?
 
  Stefan

 Yes, it is on host. we just recommend users not doing large file
 copying, just wondering if there is potential kernel bug. it seems the
 softlockup backtrace pointing to hrtimer and softirq. my naive
 knowledge is that the watchdog thread is on top of hrtimer which is on
 top of softirq.

 Since the soft lockup detector is firing on the host, this seems like a
 hardware/driver problem.  Have you ever had soft lockups running non-KVM
 workloads on this host?

 Stefan

this soft lockup only triggers when running KVM, also users used
another script in cron job to restart 4 kvm instance every 5 mintues (
insane to me) that also causing tons of softlock up message during the
kvm instance startup . we have already told customer stop doing that
and the softlockup message disappear.

Vincent
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 50921] kvm hangs booting Windows 2000

2012-11-26 Thread Xiao Guangrong
Sorry, forgot to CC Lucio Crusca.

On 11/27/2012 04:09 AM, Xiao Guangrong wrote:
 On 11/24/2012 09:44 PM, bugzilla-dae...@bugzilla.kernel.org wrote:
 https://bugzilla.kernel.org/show_bug.cgi?id=50921





 --- Comment #5 from Lucio Crusca lu...@sulweb.org  2012-11-24 13:44:16 ---
 Here the first tests results:

 vbox modules do not make a difference (tried rmmod vboxpci vboxnetadp
 vboxnetflt vboxdrv and then kvm ...).

 The trace.dat is about 60M, I could upload it somewhere, however I tried
 looking at it and I'm reasonably sure it hangs here:

 $ trace-cmd report | grep 125\\.332 | tail
  kvm-6588  [000]   125.332264: kvm_entry:vcpu 0
  kvm-6588  [000]   125.332264: kvm_emulate_insn: 1:44f8: 
 75
 27
 
 Hmm... no 'kvm_exit' message. It looks like the infinite loop is caused by:
 
 | /* Don't enter VMX if guest state is invalid, let the exit handler
 |start emulation until we arrive back to a valid state */
 | if (vmx-emulation_required  emulate_invalid_guest_state)
 | return;
 
 (vmx_vcpu_run in arch/x86/kvm/vmx.c)
 
 And, i noticed 'ept' is not supported on your box, that means
 'enable_unrestricted_guest' is disabled. I guess something was wrong
 when emulate big real mode.
 
 Could you reload kvm-intel.ko with 'emulate_invalid_guest_state = 0',
 and see what will happen.
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 50921] kvm hangs booting Windows 2000

2012-11-26 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=50921





--- Comment #15 from Anonymous Emailer anonym...@kernel-bugs.osdl.org  
2012-11-26 20:29:28 ---
Reply-To: xiaoguangr...@linux.vnet.ibm.com

Sorry, forgot to CC Lucio Crusca.

On 11/27/2012 04:09 AM, Xiao Guangrong wrote:
 On 11/24/2012 09:44 PM, bugzilla-dae...@bugzilla.kernel.org wrote:
 https://bugzilla.kernel.org/show_bug.cgi?id=50921





 --- Comment #5 from Lucio Crusca lu...@sulweb.org  2012-11-24 13:44:16 ---
 Here the first tests results:

 vbox modules do not make a difference (tried rmmod vboxpci vboxnetadp
 vboxnetflt vboxdrv and then kvm ...).

 The trace.dat is about 60M, I could upload it somewhere, however I tried
 looking at it and I'm reasonably sure it hangs here:

 $ trace-cmd report | grep 125\\.332 | tail
  kvm-6588  [000]   125.332264: kvm_entry:vcpu 0
  kvm-6588  [000]   125.332264: kvm_emulate_insn: 1:44f8: 
 75
 27
 
 Hmm... no 'kvm_exit' message. It looks like the infinite loop is caused by:
 
 | /* Don't enter VMX if guest state is invalid, let the exit handler
 |start emulation until we arrive back to a valid state */
 | if (vmx-emulation_required  emulate_invalid_guest_state)
 | return;
 
 (vmx_vcpu_run in arch/x86/kvm/vmx.c)
 
 And, i noticed 'ept' is not supported on your box, that means
 'enable_unrestricted_guest' is disabled. I guess something was wrong
 when emulate big real mode.
 
 Could you reload kvm-intel.ko with 'emulate_invalid_guest_state = 0',
 and see what will happen.
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 1/2] Resend - Add code to track call origin for msr assignment.

2012-11-26 Thread Gleb Natapov
On Mon, Nov 26, 2012 at 07:42:28PM +, Auld, Will wrote:
 Gleb, Marcelo,
 
 Sorry Gleb, I did not see comments from you but I have now found them. In 
 doing so I also found one form Marcelo that I missed. 
 
 What I believe is now outstanding to be addressed are:
 From Gleb:
 - You've changed function pointer signature here, but emulator_set_msr() 
 remained the same
 - Also I would prefer adding host_initiated parameter to kvm_set_msr() 
 instead of introducing msr_data structure.
 
 From Marcelo:
 - false, this is guest instruction emulation
 
 I will address these points. However Gleb, your second item above, 
 host_initiated parameter was implemented but then rejected agreeing that the 
 msr_data structure would be a better solution. This was base on discussion 
 with both Avi and Marcelo. I will leave this as is. 
 
OK. Thanks.


 Thanks,
 
 Will 
 
  -Original Message-
  From: Gleb Natapov [mailto:g...@redhat.com]
  Sent: Monday, November 26, 2012 10:47 AM
  To: Auld, Will
  Cc: qemu-devel; mtosa...@redhat.com; kvm@vger.kernel.org; Dugger,
  Donald D; Liu, Jinsong; Zhang, Xiantao; a...@redhat.com
  Subject: Re: [PATCH V3 1/2] Resend - Add code to track call origin for
  msr assignment.
  
  Comments are still not addressed.
  
  On Mon, Nov 26, 2012 at 10:40:51AM -0800, Will Auld wrote:
   In order to track who initiated the call (host or guest) to modify an
   msr value I have changed function call parameters along the call
  path.
   The specific change is to add a struct pointer parameter that points
   to (index, data, caller) information rather than having this
   information passed as individual parameters.
  
   The initial use for this capability is for updating the
   IA32_TSC_ADJUST msr while setting the tsc value. It is anticipated
   that this capability is useful other tasks.
  
   Signed-off-by: Will Auld will.a...@intel.com
   ---
arch/x86/include/asm/kvm_host.h | 12 +---
arch/x86/kvm/svm.c  | 21 +++--
arch/x86/kvm/vmx.c  | 24 +---
arch/x86/kvm/x86.c  | 23 +--
arch/x86/kvm/x86.h  |  2 +-
5 files changed, 59 insertions(+), 23 deletions(-)
  
   diff --git a/arch/x86/include/asm/kvm_host.h
   b/arch/x86/include/asm/kvm_host.h index 09155d6..da34027 100644
   --- a/arch/x86/include/asm/kvm_host.h
   +++ b/arch/x86/include/asm/kvm_host.h
   @@ -598,6 +598,12 @@ struct kvm_vcpu_stat {
  
struct x86_instruction_info;
  
   +struct msr_data {
   +bool host_initiated;
   +u32 index;
   +u64 data;
   +};
   +
struct kvm_x86_ops {
 int (*cpu_has_kvm_support)(void);  /* __init */
 int (*disabled_by_bios)(void); /* __init */
   @@ -621,7 +627,7 @@ struct kvm_x86_ops {
 void (*set_guest_debug)(struct kvm_vcpu *vcpu,
 struct kvm_guest_debug *dbg);
 int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata);
   - int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
   + int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
 u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg);
 void (*get_segment)(struct kvm_vcpu *vcpu,
 struct kvm_segment *var, int seg); @@ -772,7
  +778,7 @@ static
   inline int emulate_instruction(struct kvm_vcpu *vcpu,
  
void kvm_enable_efer_bits(u64);
int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data);
   -int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
   +int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
  
struct x86_emulate_ctxt;
  
   @@ -799,7 +805,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu,
   int *db, int *l);  int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index,
   u64 xcr);
  
int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
   -int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data);
   +int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr);
  
unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu);  void
   kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags); diff
   --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index
  baead95..5ac11f0
   100644
   --- a/arch/x86/kvm/svm.c
   +++ b/arch/x86/kvm/svm.c
   @@ -1211,6 +1211,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct
  kvm *kvm, unsigned int id)
 struct page *msrpm_pages;
 struct page *hsave_page;
 struct page *nested_msrpm_pages;
   + struct msr_data msr;
 int err;
  
 svm = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL); @@ -1255,7
   +1256,10 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm,
  unsigned int id)
 svm-vmcb_pa = page_to_pfn(page)  PAGE_SHIFT;
 svm-asid_generation = 0;
 init_vmcb(svm);
   - kvm_write_tsc(svm-vcpu, 0);
   + msr.data = 0x0;
   + msr.index = MSR_IA32_TSC;
   + msr.host_initiated = true;
   + kvm_write_tsc(svm-vcpu, msr);
  
 err = 

[PATCH 0/5] Alter steal time reporting in KVM

2012-11-26 Thread Michael Wolf
In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.  This can
cause confusion for the end user.  To ease the confusion this patch set
adds the idea of consigned (expected steal) time.  The host will separate
the consigned time from the steal time.  The consignment limit passed to the
host will be the amount of steal time expected within a fixed period of
time.  Any other steal time accruing during that period will show as the
traditional steal time.

---

Michael Wolf (5):
  Alter the amount of steal time reported by the guest.
  Expand the steal time msr to also contain the consigned time.
  Add the code to send the consigned time from the host to the guest
  Add a timer to allow the separation of consigned from steal time.
  Add an ioctl to communicate the consign limit to the host.


 arch/x86/include/asm/kvm_host.h   |   11 +++
 arch/x86/include/asm/kvm_para.h   |3 +-
 arch/x86/include/asm/paravirt.h   |4 +--
 arch/x86/include/asm/paravirt_types.h |2 +
 arch/x86/kernel/kvm.c |8 ++---
 arch/x86/kernel/paravirt.c|4 +--
 arch/x86/kvm/x86.c|   50 -
 fs/proc/stat.c|9 +-
 include/linux/kernel_stat.h   |2 +
 include/linux/kvm_host.h  |2 +
 include/uapi/linux/kvm.h  |2 +
 kernel/sched/core.c   |   10 ++-
 kernel/sched/cputime.c|   21 +-
 kernel/sched/sched.h  |2 +
 virt/kvm/kvm_main.c   |7 +
 15 files changed, 120 insertions(+), 17 deletions(-)

-- 
Signature

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] Alter the amount of steal time reported by the guest.

2012-11-26 Thread Michael Wolf
Modify the amount of stealtime that the kernel reports via the /proc interface.
Steal time will now be broken down into steal_time and consigned_time.
Consigned_time will represent the amount of time that is expected to be lost
due to overcommitment of the physical cpu or by using cpu capping.  The amount
consigned_time will be passed in using an ioctl.  The time will be expressed in
the number of nanoseconds to be lost in during the fixed period.  The fixed 
period
is currently 1/10th of a second.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
 fs/proc/stat.c  |9 +++--
 include/linux/kernel_stat.h |1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e296572..cb7fe80 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v)
int i, j;
unsigned long jif;
u64 user, nice, system, idle, iowait, irq, softirq, steal;
-   u64 guest, guest_nice;
+   u64 guest, guest_nice, consign;
u64 sum = 0;
u64 sum_softirq = 0;
unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
@@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v)
 
user = nice = system = idle = iowait =
irq = softirq = steal = 0;
-   guest = guest_nice = 0;
+   guest = guest_nice = consign = 0;
getboottime(boottime);
jif = boottime.tv_sec;
 
+
for_each_possible_cpu(i) {
user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE];
@@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v)
steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+   consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
sum += kstat_cpu_irqs_sum(i);
sum += arch_irq_stat_cpu(i);
 
@@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+   seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');
 
for_each_online_cpu(i) {
@@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v)
steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+   consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
seq_printf(p, cpu%d, i);
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice));
@@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+   seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');
}
seq_printf(p, intr %llu, (unsigned long long)sum);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 1865b1f..e5978b0 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -28,6 +28,7 @@ enum cpu_usage_stat {
CPUTIME_STEAL,
CPUTIME_GUEST,
CPUTIME_GUEST_NICE,
+   CPUTIME_CONSIGN,
NR_STATS,
 };
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] Expand the steal time msr to also contain the consigned time.

2012-11-26 Thread Michael Wolf
Add a consigned field.  This field will hold the time lost due to capping or 
overcommit.
The rest of the time will still show up in the steal-time field.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
 arch/x86/include/asm/paravirt.h   |4 ++--
 arch/x86/include/asm/paravirt_types.h |2 +-
 arch/x86/kernel/kvm.c |7 ++-
 kernel/sched/core.c   |   10 +-
 kernel/sched/cputime.c|2 +-
 5 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a0facf3..a5f9f30 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu)
+static inline u64 paravirt_steal_clock(int cpu, u64 *steal)
 {
-   return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+   PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
 }
 
 static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 142236e..5d4fc8b 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -95,7 +95,7 @@ struct pv_lazy_ops {
 
 struct pv_time_ops {
unsigned long long (*sched_clock)(void);
-   unsigned long long (*steal_clock)(int cpu);
+   void (*steal_clock)(int cpu, unsigned long long *steal);
unsigned long (*get_tsc_khz)(void);
 };
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 4180a87..ac357b3 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -372,9 +372,8 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
-static u64 kvm_steal_clock(int cpu)
+static void kvm_steal_clock(int cpu, u64 *steal)
 {
-   u64 steal;
struct kvm_steal_time *src;
int version;
 
@@ -382,11 +381,9 @@ static u64 kvm_steal_clock(int cpu)
do {
version = src-version;
rmb();
-   steal = src-steal;
+   *steal = src-steal;
rmb();
} while ((version  1) || (version != src-version));
-
-   return steal;
 }
 
 void kvm_disable_steal_time(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c2e077c..b21d92d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -748,6 +748,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
  */
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || 
defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
s64 steal = 0, irq_delta = 0;
+   u64 consigned = 0;
 #endif
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
irq_delta = irq_time_read(cpu_of(rq)) - rq-prev_irq_time;
@@ -776,8 +777,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((paravirt_steal_rq_enabled))) {
u64 st;
+   u64 cs;
 
-   steal = paravirt_steal_clock(cpu_of(rq));
+   paravirt_steal_clock(cpu_of(rq), steal, consigned);
+   /*
+* since we are not assigning the steal time to cpustats
+* here, just combine the steal and consigned times to
+* do the rest of the calculations.
+*/
+   steal += consigned;
steal -= rq-prev_steal_time_rq;
 
if (unlikely(steal  delta))
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 8d859da..593b647 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void)
if (static_key_false(paravirt_steal_enabled)) {
u64 steal, st = 0;
 
-   steal = paravirt_steal_clock(smp_processor_id());
+   paravirt_steal_clock(smp_processor_id(), steal);
steal -= this_rq()-prev_steal_time;
 
st = steal_ticks(steal);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] Add the code to send the consigned time from the host to the guest

2012-11-26 Thread Michael Wolf
Add the code to send the consigned time from the host to the guest.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/include/asm/kvm_para.h |3 ++-
 arch/x86/include/asm/paravirt.h |4 ++--
 arch/x86/kernel/kvm.c   |3 ++-
 arch/x86/kernel/paravirt.c  |4 ++--
 arch/x86/kvm/x86.c  |2 ++
 include/linux/kernel_stat.h |1 +
 kernel/sched/cputime.c  |   21 +++--
 kernel/sched/sched.h|2 ++
 9 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b2e11f4..434d378 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -426,6 +426,7 @@ struct kvm_vcpu_arch {
u64 msr_val;
u64 last_steal;
u64 accum_steal;
+   u64 accum_consigned;
struct gfn_to_hva_cache stime;
struct kvm_steal_time steal;
} st;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index eb3e9d8..1763369 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -42,9 +42,10 @@
 
 struct kvm_steal_time {
__u64 steal;
+   __u64 consigned;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u32 pad[10];
 };
 
 #define KVM_STEAL_ALIGNMENT_BITS 5
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a5f9f30..d39e8d0 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu, u64 *steal)
+static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
-   PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
+   PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned);
 }
 
 static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index ac357b3..4439a5c 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -372,7 +372,7 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
-static void kvm_steal_clock(int cpu, u64 *steal)
+static void kvm_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
struct kvm_steal_time *src;
int version;
@@ -382,6 +382,7 @@ static void kvm_steal_clock(int cpu, u64 *steal)
version = src-version;
rmb();
*steal = src-steal;
+   *consigned = src-consigned;
rmb();
} while ((version  1) || (version != src-version));
 }
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 17fff18..3797683 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -207,9 +207,9 @@ static void native_flush_tlb_single(unsigned long addr)
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
 
-static u64 native_steal_clock(int cpu)
+static void native_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
-   return 0;
+   *steal = *consigned = 0;
 }
 
 /* These are in entry.S */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1eefebe..683b531 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1565,8 +1565,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
return;
 
vcpu-arch.st.steal.steal += vcpu-arch.st.accum_steal;
+   vcpu-arch.st.steal.consigned += vcpu-arch.st.accum_consigned;
vcpu-arch.st.steal.version += 2;
vcpu-arch.st.accum_steal = 0;
+   vcpu-arch.st.accum_consigned = 0;
 
kvm_write_guest_cached(vcpu-kvm, vcpu-arch.st.stime,
vcpu-arch.st.steal, sizeof(struct kvm_steal_time));
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index e5978b0..91afaa3 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -126,6 +126,7 @@ extern unsigned long long task_delta_exec(struct 
task_struct *);
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, 
cputime_t);
 extern void account_steal_time(cputime_t);
+extern void account_consigned_time(cputime_t);
 extern void account_idle_time(cputime_t);
 
 extern void account_process_tick(struct task_struct *, int user);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 593b647..53bd0be 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -244,6 +244,18 @@ void account_system_time(struct task_struct *p, int 
hardirq_offset,
 }
 
 /*
+ * This accounts for the time that is split out of steal time.
+ * Consigned time represents the amount of time 

[PATCH 4/5] Add a timer to allow the separation of consigned from steal time.

2012-11-26 Thread Michael Wolf
Add a timer to the host.  This will define the period.  During a period
the first n ticks will go into the consigned bucket.  Any other ticks that
occur within the period will be placed in the stealtime bucket.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
 arch/x86/include/asm/kvm_host.h |   10 +
 arch/x86/include/asm/paravirt.h |2 +-
 arch/x86/kvm/x86.c  |   42 ++-
 3 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 434d378..4794c95 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -41,6 +41,8 @@
 #define KVM_PIO_PAGE_OFFSET 1
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
 
+#define KVM_STEAL_TIMER_DELAY 1UL
+
 #define CR0_RESERVED_BITS   \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
  | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \
@@ -353,6 +355,14 @@ struct kvm_vcpu_arch {
bool tpr_access_reporting;
 
/*
+* timer used to determine if the time should be counted as
+* steal time or consigned time.
+*/
+   struct hrtimer steal_timer;
+   u64 current_consigned;
+   u64 consigned_limit;
+
+   /*
 * Paging state of the vcpu
 *
 * If the vcpu runs in guest mode with two level paging this still saves
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d39e8d0..6db79f9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,7 +196,7 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
+static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned);
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 683b531..c91f4c9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1546,13 +1546,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
 static void accumulate_steal_time(struct kvm_vcpu *vcpu)
 {
u64 delta;
+   u64 steal_delta;
+   u64 consigned_delta;
 
if (!(vcpu-arch.st.msr_val  KVM_MSR_ENABLED))
return;
 
delta = current-sched_info.run_delay - vcpu-arch.st.last_steal;
vcpu-arch.st.last_steal = current-sched_info.run_delay;
-   vcpu-arch.st.accum_steal = delta;
+
+   /* split the delta into steal and consigned */
+   if (vcpu-arch.current_consigned  vcpu-arch.consigned_limit) {
+   vcpu-arch.current_consigned += delta;
+   if (vcpu-arch.current_consigned  vcpu-arch.consigned_limit) {
+   steal_delta = vcpu-arch.current_consigned
+   -  vcpu-arch.consigned_limit;
+   consigned_delta = delta - steal_delta;
+   } else {
+   consigned_delta = delta;
+   steal_delta = 0;
+   }
+   } else {
+   consigned_delta = 0;
+   steal_delta = delta;
+   }
+   vcpu-arch.st.accum_steal = steal_delta;
+   vcpu-arch.st.accum_consigned = consigned_delta;
 }
 
 static void record_steal_time(struct kvm_vcpu *vcpu)
@@ -6203,11 +6222,25 @@ bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu)
 
 struct static_key kvm_no_apic_vcpu __read_mostly;
 
+enum hrtimer_restart steal_timer_fn(struct hrtimer *data)
+{
+   struct kvm_vcpu *vcpu;
+   ktime_t now;
+
+   vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer);
+   vcpu-arch.current_consigned = 0;
+   now = ktime_get();
+   hrtimer_forward(vcpu-arch.steal_timer, now,
+   ktime_set(0, KVM_STEAL_TIMER_DELAY));
+   return HRTIMER_RESTART;
+}
+
 int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 {
struct page *page;
struct kvm *kvm;
int r;
+   ktime_t ktime;
 
BUG_ON(vcpu-kvm == NULL);
kvm = vcpu-kvm;
@@ -6251,6 +6284,12 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 
kvm_async_pf_hash_reset(vcpu);
kvm_pmu_init(vcpu);
+   /* Initialize and start a timer to capture steal and consigned time */
+   hrtimer_init(vcpu-arch.steal_timer, CLOCK_MONOTONIC,
+   HRTIMER_MODE_REL);
+   vcpu-arch.steal_timer.function = steal_timer_fn;
+   ktime = ktime_set(0, KVM_STEAL_TIMER_DELAY);
+   hrtimer_start(vcpu-arch.steal_timer, ktime, HRTIMER_MODE_REL);
 
return 0;
 fail_free_mce_banks:
@@ -6269,6 +6308,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
int idx;
 
+   hrtimer_cancel(vcpu-arch.steal_timer);
kvm_pmu_destroy(vcpu);

[PATCH 5/5] Add an ioctl to communicate the consign limit to the host.

2012-11-26 Thread Michael Wolf
Add an ioctl to communicate the consign limit to the host.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
 arch/x86/kvm/x86.c   |6 ++
 include/linux/kvm_host.h |2 ++
 include/uapi/linux/kvm.h |2 ++
 virt/kvm/kvm_main.c  |7 +++
 4 files changed, 17 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c91f4c9..5d57469 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5938,6 +5938,12 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
return 0;
 }
 
+int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu, long 
entitlement)
+{
+   vcpu-arch.consigned_limit = entitlement;
+   return 0;
+}
+
 int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
struct i387_fxsave_struct *fxsave =
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0e2212f..de13648 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -590,6 +590,8 @@ void kvm_arch_hardware_unsetup(void);
 void kvm_arch_check_processor_compat(void *rtn);
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
+int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu,
+   long entitlement);
 
 void kvm_free_physmem(struct kvm *kvm);
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0a6d6ba..86f24bb 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -921,6 +921,8 @@ struct kvm_s390_ucas_mapping {
 #define KVM_SET_ONE_REG  _IOW(KVMIO,  0xac, struct kvm_one_reg)
 /* VM is being stopped by host */
 #define KVM_KVMCLOCK_CTRL_IO(KVMIO,   0xad)
+/* Set the consignment limit which will be used to separete steal time */
+#define KVM_SET_ENTITLEMENT  _IOW(KVMIO, 0xae, unsigned long)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1  0)
 #define KVM_DEV_ASSIGN_PCI_2_3 (1  1)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be70035..c712fe5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2062,6 +2062,13 @@ out_free2:
r = 0;
break;
}
+   case KVM_SET_ENTITLEMENT: {
+   r = kvm_arch_vcpu_ioctl_set_entitlement(vcpu, arg);
+   if (r)
+   goto out;
+   r = 0;
+   break;
+   }
default:
r = kvm_arch_vcpu_ioctl(filp, ioctl, arg);
}

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] Add the code to send the consigned time from the host to the guest

2012-11-26 Thread Michael Wolf
Add the code to send the consigned time from the host to the guest.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/include/asm/kvm_para.h |3 ++-
 arch/x86/include/asm/paravirt.h |4 ++--
 arch/x86/kernel/kvm.c   |3 ++-
 arch/x86/kernel/paravirt.c  |4 ++--
 arch/x86/kvm/x86.c  |2 ++
 include/linux/kernel_stat.h |1 +
 kernel/sched/cputime.c  |   21 +++--
 kernel/sched/sched.h|2 ++
 9 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b2e11f4..434d378 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -426,6 +426,7 @@ struct kvm_vcpu_arch {
u64 msr_val;
u64 last_steal;
u64 accum_steal;
+   u64 accum_consigned;
struct gfn_to_hva_cache stime;
struct kvm_steal_time steal;
} st;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index eb3e9d8..1763369 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -42,9 +42,10 @@
 
 struct kvm_steal_time {
__u64 steal;
+   __u64 consigned;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u32 pad[10];
 };
 
 #define KVM_STEAL_ALIGNMENT_BITS 5
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a5f9f30..d39e8d0 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu, u64 *steal)
+static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
-   PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
+   PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned);
 }
 
 static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index ac357b3..4439a5c 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -372,7 +372,7 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
-static void kvm_steal_clock(int cpu, u64 *steal)
+static void kvm_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
struct kvm_steal_time *src;
int version;
@@ -382,6 +382,7 @@ static void kvm_steal_clock(int cpu, u64 *steal)
version = src-version;
rmb();
*steal = src-steal;
+   *consigned = src-consigned;
rmb();
} while ((version  1) || (version != src-version));
 }
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 17fff18..3797683 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -207,9 +207,9 @@ static void native_flush_tlb_single(unsigned long addr)
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
 
-static u64 native_steal_clock(int cpu)
+static void native_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
-   return 0;
+   *steal = *consigned = 0;
 }
 
 /* These are in entry.S */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1eefebe..683b531 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1565,8 +1565,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
return;
 
vcpu-arch.st.steal.steal += vcpu-arch.st.accum_steal;
+   vcpu-arch.st.steal.consigned += vcpu-arch.st.accum_consigned;
vcpu-arch.st.steal.version += 2;
vcpu-arch.st.accum_steal = 0;
+   vcpu-arch.st.accum_consigned = 0;
 
kvm_write_guest_cached(vcpu-kvm, vcpu-arch.st.stime,
vcpu-arch.st.steal, sizeof(struct kvm_steal_time));
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index e5978b0..91afaa3 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -126,6 +126,7 @@ extern unsigned long long task_delta_exec(struct 
task_struct *);
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, 
cputime_t);
 extern void account_steal_time(cputime_t);
+extern void account_consigned_time(cputime_t);
 extern void account_idle_time(cputime_t);
 
 extern void account_process_tick(struct task_struct *, int user);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 593b647..53bd0be 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -244,6 +244,18 @@ void account_system_time(struct task_struct *p, int 
hardirq_offset,
 }
 
 /*
+ * This accounts for the time that is split out of steal time.
+ * Consigned time represents the amount of time 

[PATCH 4/5] Add a timer to allow the separation of consigned from steal time.

2012-11-26 Thread Michael Wolf
Add a timer to the host.  This will define the period.  During a period
the first n ticks will go into the consigned bucket.  Any other ticks that
occur within the period will be placed in the stealtime bucket.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
 arch/x86/include/asm/kvm_host.h |   10 +
 arch/x86/include/asm/paravirt.h |2 +-
 arch/x86/kvm/x86.c  |   42 ++-
 3 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 434d378..4794c95 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -41,6 +41,8 @@
 #define KVM_PIO_PAGE_OFFSET 1
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
 
+#define KVM_STEAL_TIMER_DELAY 1UL
+
 #define CR0_RESERVED_BITS   \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
  | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \
@@ -353,6 +355,14 @@ struct kvm_vcpu_arch {
bool tpr_access_reporting;
 
/*
+* timer used to determine if the time should be counted as
+* steal time or consigned time.
+*/
+   struct hrtimer steal_timer;
+   u64 current_consigned;
+   u64 consigned_limit;
+
+   /*
 * Paging state of the vcpu
 *
 * If the vcpu runs in guest mode with two level paging this still saves
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d39e8d0..6db79f9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,7 +196,7 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
+static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned);
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 683b531..c91f4c9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1546,13 +1546,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
 static void accumulate_steal_time(struct kvm_vcpu *vcpu)
 {
u64 delta;
+   u64 steal_delta;
+   u64 consigned_delta;
 
if (!(vcpu-arch.st.msr_val  KVM_MSR_ENABLED))
return;
 
delta = current-sched_info.run_delay - vcpu-arch.st.last_steal;
vcpu-arch.st.last_steal = current-sched_info.run_delay;
-   vcpu-arch.st.accum_steal = delta;
+
+   /* split the delta into steal and consigned */
+   if (vcpu-arch.current_consigned  vcpu-arch.consigned_limit) {
+   vcpu-arch.current_consigned += delta;
+   if (vcpu-arch.current_consigned  vcpu-arch.consigned_limit) {
+   steal_delta = vcpu-arch.current_consigned
+   -  vcpu-arch.consigned_limit;
+   consigned_delta = delta - steal_delta;
+   } else {
+   consigned_delta = delta;
+   steal_delta = 0;
+   }
+   } else {
+   consigned_delta = 0;
+   steal_delta = delta;
+   }
+   vcpu-arch.st.accum_steal = steal_delta;
+   vcpu-arch.st.accum_consigned = consigned_delta;
 }
 
 static void record_steal_time(struct kvm_vcpu *vcpu)
@@ -6203,11 +6222,25 @@ bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu)
 
 struct static_key kvm_no_apic_vcpu __read_mostly;
 
+enum hrtimer_restart steal_timer_fn(struct hrtimer *data)
+{
+   struct kvm_vcpu *vcpu;
+   ktime_t now;
+
+   vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer);
+   vcpu-arch.current_consigned = 0;
+   now = ktime_get();
+   hrtimer_forward(vcpu-arch.steal_timer, now,
+   ktime_set(0, KVM_STEAL_TIMER_DELAY));
+   return HRTIMER_RESTART;
+}
+
 int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 {
struct page *page;
struct kvm *kvm;
int r;
+   ktime_t ktime;
 
BUG_ON(vcpu-kvm == NULL);
kvm = vcpu-kvm;
@@ -6251,6 +6284,12 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 
kvm_async_pf_hash_reset(vcpu);
kvm_pmu_init(vcpu);
+   /* Initialize and start a timer to capture steal and consigned time */
+   hrtimer_init(vcpu-arch.steal_timer, CLOCK_MONOTONIC,
+   HRTIMER_MODE_REL);
+   vcpu-arch.steal_timer.function = steal_timer_fn;
+   ktime = ktime_set(0, KVM_STEAL_TIMER_DELAY);
+   hrtimer_start(vcpu-arch.steal_timer, ktime, HRTIMER_MODE_REL);
 
return 0;
 fail_free_mce_banks:
@@ -6269,6 +6308,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
int idx;
 
+   hrtimer_cancel(vcpu-arch.steal_timer);
kvm_pmu_destroy(vcpu);

[PATCH 0/5] Alter stealtime reporting in KVM

2012-11-26 Thread Michael Wolf
In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.  This can
cause confusion for the end user.  To ease the confusion this patch set
adds the idea of consigned (expected steal) time.  The host will separate
the consigned time from the steal time.  The consignment limit passed to the
host will be the amount of steal time expected within a fixed period of
time.  Any other steal time accruing during that period will show as the
traditional steal time.

---

Michael Wolf (5):
  Alter the amount of steal time reported by the guest.
  Expand the steal time msr to also contain the consigned time.
  Add the code to send the consigned time from the host to the guest
  Add a timer to allow the separation of consigned from steal time.
  Add an ioctl to communicate the consign limit to the host.


 CREDITS|5 
 Documentation/arm64/memory.txt |   12 
 Documentation/cgroups/memory.txt   |4 
 .../devicetree/bindings/net/mdio-gpio.txt  |9 
 Documentation/filesystems/proc.txt |   16 
 Documentation/hwmon/fam15h_power   |2 
 Documentation/kernel-parameters.txt|   20 
 Documentation/networking/netdev-features.txt   |2 
 Documentation/scheduler/numa-problem.txt   |   20 
 MAINTAINERS|   87 +
 Makefile   |2 
 arch/alpha/kernel/osf_sys.c|6 
 arch/arm/boot/Makefile |   10 
 arch/arm/boot/dts/tegra30.dtsi |4 
 arch/arm/include/asm/io.h  |4 
 arch/arm/include/asm/sched_clock.h |2 
 arch/arm/include/asm/vfpmacros.h   |   12 
 arch/arm/include/uapi/asm/hwcap.h  |3 
 arch/arm/kernel/sched_clock.c  |   18 
 arch/arm/mach-at91/at91rm9200_devices.c|2 
 arch/arm/mach-at91/at91sam9260_devices.c   |2 
 arch/arm/mach-at91/at91sam9261_devices.c   |2 
 arch/arm/mach-at91/at91sam9263_devices.c   |2 
 arch/arm/mach-at91/at91sam9g45_devices.c   |   12 
 arch/arm/mach-davinci/dm644x.c |3 
 arch/arm/mach-highbank/system.c|3 
 arch/arm/mach-imx/clk-gate2.c  |2 
 arch/arm/mach-imx/ehci-imx25.c |2 
 arch/arm/mach-imx/ehci-imx35.c |2 
 arch/arm/mach-omap2/board-igep0020.c   |5 
 arch/arm/mach-omap2/clockdomains44xx_data.c|2 
 arch/arm/mach-omap2/devices.c  |   79 +
 arch/arm/mach-omap2/omap_hwmod.c   |   63 +
 arch/arm/mach-omap2/omap_hwmod_44xx_data.c |   36 
 arch/arm/mach-omap2/twl-common.c   |3 
 arch/arm/mach-omap2/vc.c   |2 
 arch/arm/mach-pxa/hx4700.c |8 
 arch/arm/mach-pxa/spitz_pm.c   |8 
 arch/arm/mm/alignment.c|2 
 arch/arm/plat-omap/include/plat/omap_hwmod.h   |6 
 arch/arm/tools/Makefile|2 
 arch/arm/vfp/vfpmodule.c   |9 
 arch/arm/xen/enlighten.c   |   11 
 arch/arm/xen/hypercall.S   |   14 
 arch/arm64/Kconfig |1 
 arch/arm64/include/asm/elf.h   |5 
 arch/arm64/include/asm/fpsimd.h|5 
 arch/arm64/include/asm/io.h|   10 
 arch/arm64/include/asm/pgtable-hwdef.h |6 
 arch/arm64/include/asm/pgtable.h   |   40 -
 arch/arm64/include/asm/processor.h |2 
 arch/arm64/include/asm/unistd.h|1 
 arch/arm64/kernel/perf_event.c |   10 
 arch/arm64/kernel/process.c|   18 
 arch/arm64/kernel/smp.c|3 
 arch/arm64/mm/init.c   |2 
 arch/frv/Kconfig   |1 
 arch/frv/boot/Makefile |   10 
 arch/frv/include/asm/unistd.h  |1 
 arch/frv/kernel/entry.S|   28 
 arch/frv/kernel/process.c  |5 
 arch/frv/mb93090-mb00/pci-dma-nommu.c  |1 
 arch/h8300/include/asm/cache.h |3 
 arch/ia64/mm/init.c|1 
 arch/m68k/include/asm/signal.h |6 
 arch/mips/cavium-octeon/executive/cvmx-l2c.c   |  900 
 arch/unicore32/include/asm/byteorder.h |   24 
 

[PATCH 1/5] Alter the amount of steal time reported by the guest.

2012-11-26 Thread Michael Wolf
Modify the amount of stealtime that the kernel reports via the /proc interface.
Steal time will now be broken down into steal_time and consigned_time.
Consigned_time will represent the amount of time that is expected to be lost
due to overcommitment of the physical cpu or by using cpu capping.  The amount
consigned_time will be passed in using an ioctl.  The time will be expressed in
the number of nanoseconds to be lost in during the fixed period.  The fixed 
period
is currently 1/10th of a second.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
 fs/proc/stat.c  |9 +++--
 include/linux/kernel_stat.h |1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e296572..cb7fe80 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v)
int i, j;
unsigned long jif;
u64 user, nice, system, idle, iowait, irq, softirq, steal;
-   u64 guest, guest_nice;
+   u64 guest, guest_nice, consign;
u64 sum = 0;
u64 sum_softirq = 0;
unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
@@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v)
 
user = nice = system = idle = iowait =
irq = softirq = steal = 0;
-   guest = guest_nice = 0;
+   guest = guest_nice = consign = 0;
getboottime(boottime);
jif = boottime.tv_sec;
 
+
for_each_possible_cpu(i) {
user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE];
@@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v)
steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+   consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
sum += kstat_cpu_irqs_sum(i);
sum += arch_irq_stat_cpu(i);
 
@@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+   seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');
 
for_each_online_cpu(i) {
@@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v)
steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+   consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
seq_printf(p, cpu%d, i);
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice));
@@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+   seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');
}
seq_printf(p, intr %llu, (unsigned long long)sum);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 1865b1f..e5978b0 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -28,6 +28,7 @@ enum cpu_usage_stat {
CPUTIME_STEAL,
CPUTIME_GUEST,
CPUTIME_GUEST_NICE,
+   CPUTIME_CONSIGN,
NR_STATS,
 };
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] Expand the steal time msr to also contain the consigned time.

2012-11-26 Thread Michael Wolf
Add a consigned field.  This field will hold the time lost due to capping or 
overcommit.
The rest of the time will still show up in the steal-time field.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
 arch/x86/include/asm/paravirt.h   |4 ++--
 arch/x86/include/asm/paravirt_types.h |2 +-
 arch/x86/kernel/kvm.c |7 ++-
 kernel/sched/core.c   |   10 +-
 kernel/sched/cputime.c|2 +-
 5 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a0facf3..a5f9f30 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu)
+static inline u64 paravirt_steal_clock(int cpu, u64 *steal)
 {
-   return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+   PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
 }
 
 static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 142236e..5d4fc8b 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -95,7 +95,7 @@ struct pv_lazy_ops {
 
 struct pv_time_ops {
unsigned long long (*sched_clock)(void);
-   unsigned long long (*steal_clock)(int cpu);
+   void (*steal_clock)(int cpu, unsigned long long *steal);
unsigned long (*get_tsc_khz)(void);
 };
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 4180a87..ac357b3 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -372,9 +372,8 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
-static u64 kvm_steal_clock(int cpu)
+static void kvm_steal_clock(int cpu, u64 *steal)
 {
-   u64 steal;
struct kvm_steal_time *src;
int version;
 
@@ -382,11 +381,9 @@ static u64 kvm_steal_clock(int cpu)
do {
version = src-version;
rmb();
-   steal = src-steal;
+   *steal = src-steal;
rmb();
} while ((version  1) || (version != src-version));
-
-   return steal;
 }
 
 void kvm_disable_steal_time(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c2e077c..b21d92d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -748,6 +748,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
  */
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || 
defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
s64 steal = 0, irq_delta = 0;
+   u64 consigned = 0;
 #endif
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
irq_delta = irq_time_read(cpu_of(rq)) - rq-prev_irq_time;
@@ -776,8 +777,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((paravirt_steal_rq_enabled))) {
u64 st;
+   u64 cs;
 
-   steal = paravirt_steal_clock(cpu_of(rq));
+   paravirt_steal_clock(cpu_of(rq), steal, consigned);
+   /*
+* since we are not assigning the steal time to cpustats
+* here, just combine the steal and consigned times to
+* do the rest of the calculations.
+*/
+   steal += consigned;
steal -= rq-prev_steal_time_rq;
 
if (unlikely(steal  delta))
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 8d859da..593b647 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void)
if (static_key_false(paravirt_steal_enabled)) {
u64 steal, st = 0;
 
-   steal = paravirt_steal_clock(smp_processor_id());
+   paravirt_steal_clock(smp_processor_id(), steal);
steal -= this_rq()-prev_steal_time;
 
st = steal_ticks(steal);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations

2012-11-26 Thread Paul Mackerras
On Mon, Nov 26, 2012 at 02:10:33PM +0100, Alexander Graf wrote:
 
 On 23.11.2012, at 23:07, Paul Mackerras wrote:
 
  On Fri, Nov 23, 2012 at 04:43:03PM +0100, Alexander Graf wrote:
  
  On 22.11.2012, at 10:28, Paul Mackerras wrote:
  
  - With the possibility of the host paging out guest pages, the use of
  H_LOCAL by an SMP guest is dangerous since the guest could possibly
  retain and use a stale TLB entry pointing to a page that had been
  removed from the guest.
  
  I don't understand this part. Don't we flush the TLB when the page gets 
  evicted from the shadow HTAB?
  
  The H_LOCAL flag is something that we invented to allow the guest to
  tell the host I only ever used this translation (HPTE) on the current
  vcpu when it's removing or modifying an HPTE.  The idea is that that
  would then let the host use the tlbiel instruction (local TLB
  invalidate) rather than the usual global tlbie instruction.  Tlbiel is
  faster because it doesn't need to go out on the fabric and get
  processed by all cpus.  In fact our guests don't use it at present,
  but we put it in because we thought we should be able to get a
  performance improvement, particularly on large machines.
  
  However, the catch is that the guest's setting of H_LOCAL might be
  incorrect, in which case we could have a stale TLB entry on another
  physical cpu.  While the physical page that it refers to is still
  owned by the guest, that stale entry doesn't matter from the host's
  point of view.  But if the host wants to take that page away from the
  guest, the stale entry becomes a problem.
 
 That's exactly where my question lies. Does that mean we don't flush the TLB 
 entry regardless when we take the page away from the guest?

The question is how to find the TLB entry if the HPTE it came from is
no longer present.  Flushing a TLB entry requires a virtual address.
When we're taking a page away from the guest we have the real address
of the page, not the virtual address.  We can use the reverse-mapping
chains to loop through all the HPTEs that map the page, and from each
HPTE we can (and do) calculate a virtual address and do a TLBIE on
that virtual address (each HPTE could be at a different virtual
address).

The difficulty comes when we no longer have the HPTE but we
potentially have a stale TLB entry, due to having used tlbiel when we
removed the HPTE.  Without the HPTE the only way to get rid of the
stale TLB entry would be to completely flush all the TLB entries for
the guest's LPID on every physical CPU it had ever run on.  Since I
don't want to go to that much effort, what I am proposing, and what
this patch implements, is to not ever use tlbiel when removing HPTEs
in SMP guests on POWER7.

In other words, what this patch is about is making sure we don't get
these troublesome stale TLB entries.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking

2012-11-26 Thread Paul Mackerras
On Mon, Nov 26, 2012 at 02:15:16PM +0100, Alexander Graf wrote:
 
 On 23.11.2012, at 22:42, Paul Mackerras wrote:
 
  On Fri, Nov 23, 2012 at 03:13:09PM +0100, Alexander Graf wrote:
  
  On 22.11.2012, at 10:25, Paul Mackerras wrote:
  
  + /* Do they have an SLB shadow buffer registered? */
  + slb = vcpu-arch.slb_shadow.pinned_addr;
  + if (!slb)
  + return;
  
  Mind to explain this case? What happens here? Do we leave the guest with 
  an empty SLB? Why would this ever happen? What happens next as soon as we 
  go back into the guest?
  
  Yes, we leave the guest with an empty SLB, the access gets retried and
  this time the guest gets an SLB miss interrupt, which it can hopefully
  handle using an SLB miss handler that runs entirely in real mode.
  This could happen for instance while the guest is in SLOF or yaboot or
  some other code that runs basically in real mode but occasionally
  turns the MMU on for some accesses, and happens to have a bug where it
  creates a duplicate SLB entry.
 
 Is this what pHyp does? Also, is this what we want? Why don't we populate an 
 #MC into the guest so it knows it did something wrong?

Yes, yes and we do.  Anytime we get a machine check while in the guest
we give the guest a machine check interrupt.

Ultimately we want to implement the FWNMI (Firmware-assisted NMI)
thing defined in PAPR which makes the handling of system reset and
machine check slightly nicer for the guest, but that's for later.  It
will build on top of the stuff in this patch.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking

2012-11-26 Thread Alexander Graf

On 26.11.2012, at 22:33, Paul Mackerras wrote:

 On Mon, Nov 26, 2012 at 02:15:16PM +0100, Alexander Graf wrote:
 
 On 23.11.2012, at 22:42, Paul Mackerras wrote:
 
 On Fri, Nov 23, 2012 at 03:13:09PM +0100, Alexander Graf wrote:
 
 On 22.11.2012, at 10:25, Paul Mackerras wrote:
 
 + /* Do they have an SLB shadow buffer registered? */
 + slb = vcpu-arch.slb_shadow.pinned_addr;
 + if (!slb)
 + return;
 
 Mind to explain this case? What happens here? Do we leave the guest with 
 an empty SLB? Why would this ever happen? What happens next as soon as we 
 go back into the guest?
 
 Yes, we leave the guest with an empty SLB, the access gets retried and
 this time the guest gets an SLB miss interrupt, which it can hopefully
 handle using an SLB miss handler that runs entirely in real mode.
 This could happen for instance while the guest is in SLOF or yaboot or
 some other code that runs basically in real mode but occasionally
 turns the MMU on for some accesses, and happens to have a bug where it
 creates a duplicate SLB entry.
 
 Is this what pHyp does? Also, is this what we want? Why don't we populate an 
 #MC into the guest so it knows it did something wrong?
 
 Yes, yes and we do.  Anytime we get a machine check while in the guest
 we give the guest a machine check interrupt.
 
 Ultimately we want to implement the FWNMI (Firmware-assisted NMI)
 thing defined in PAPR which makes the handling of system reset and
 machine check slightly nicer for the guest, but that's for later.  It
 will build on top of the stuff in this patch.

So why would the function return 1 then which means MC is handled, forget 
about it rather than 0, which means inject MC into the guest?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations

2012-11-26 Thread Alexander Graf

On 26.11.2012, at 22:48, Paul Mackerras wrote:

 On Mon, Nov 26, 2012 at 02:10:33PM +0100, Alexander Graf wrote:
 
 On 23.11.2012, at 23:07, Paul Mackerras wrote:
 
 On Fri, Nov 23, 2012 at 04:43:03PM +0100, Alexander Graf wrote:
 
 On 22.11.2012, at 10:28, Paul Mackerras wrote:
 
 - With the possibility of the host paging out guest pages, the use of
 H_LOCAL by an SMP guest is dangerous since the guest could possibly
 retain and use a stale TLB entry pointing to a page that had been
 removed from the guest.
 
 I don't understand this part. Don't we flush the TLB when the page gets 
 evicted from the shadow HTAB?
 
 The H_LOCAL flag is something that we invented to allow the guest to
 tell the host I only ever used this translation (HPTE) on the current
 vcpu when it's removing or modifying an HPTE.  The idea is that that
 would then let the host use the tlbiel instruction (local TLB
 invalidate) rather than the usual global tlbie instruction.  Tlbiel is
 faster because it doesn't need to go out on the fabric and get
 processed by all cpus.  In fact our guests don't use it at present,
 but we put it in because we thought we should be able to get a
 performance improvement, particularly on large machines.
 
 However, the catch is that the guest's setting of H_LOCAL might be
 incorrect, in which case we could have a stale TLB entry on another
 physical cpu.  While the physical page that it refers to is still
 owned by the guest, that stale entry doesn't matter from the host's
 point of view.  But if the host wants to take that page away from the
 guest, the stale entry becomes a problem.
 
 That's exactly where my question lies. Does that mean we don't flush the TLB 
 entry regardless when we take the page away from the guest?
 
 The question is how to find the TLB entry if the HPTE it came from is
 no longer present.  Flushing a TLB entry requires a virtual address.
 When we're taking a page away from the guest we have the real address
 of the page, not the virtual address.  We can use the reverse-mapping
 chains to loop through all the HPTEs that map the page, and from each
 HPTE we can (and do) calculate a virtual address and do a TLBIE on
 that virtual address (each HPTE could be at a different virtual
 address).
 
 The difficulty comes when we no longer have the HPTE but we
 potentially have a stale TLB entry, due to having used tlbiel when we
 removed the HPTE.  Without the HPTE the only way to get rid of the
 stale TLB entry would be to completely flush all the TLB entries for
 the guest's LPID on every physical CPU it had ever run on.  Since I
 don't want to go to that much effort, what I am proposing, and what
 this patch implements, is to not ever use tlbiel when removing HPTEs
 in SMP guests on POWER7.
 
 In other words, what this patch is about is making sure we don't get
 these troublesome stale TLB entries.

I see. You could keep a list of to-be-flushed VAs around that you could skim 
through when taking a page away from the guest. That way you make the fast case 
fast (add/remove of page from the guest) and the slow path slow (paging).

But I'm fine with disallowing local flushes on remove completely for now. It 
would be nice to get performance data on how much this would be a net win 
though. There are certainly ways of keeping local flushes alive with the scheme 
above.


Thanks, applied to kvm-ppc-next.

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking

2012-11-26 Thread Alexander Graf

On 26.11.2012, at 22:55, Alexander Graf wrote:

 
 On 26.11.2012, at 22:33, Paul Mackerras wrote:
 
 On Mon, Nov 26, 2012 at 02:15:16PM +0100, Alexander Graf wrote:
 
 On 23.11.2012, at 22:42, Paul Mackerras wrote:
 
 On Fri, Nov 23, 2012 at 03:13:09PM +0100, Alexander Graf wrote:
 
 On 22.11.2012, at 10:25, Paul Mackerras wrote:
 
 +/* Do they have an SLB shadow buffer registered? */
 +slb = vcpu-arch.slb_shadow.pinned_addr;
 +if (!slb)
 +return;
 
 Mind to explain this case? What happens here? Do we leave the guest with 
 an empty SLB? Why would this ever happen? What happens next as soon as we 
 go back into the guest?
 
 Yes, we leave the guest with an empty SLB, the access gets retried and
 this time the guest gets an SLB miss interrupt, which it can hopefully
 handle using an SLB miss handler that runs entirely in real mode.
 This could happen for instance while the guest is in SLOF or yaboot or
 some other code that runs basically in real mode but occasionally
 turns the MMU on for some accesses, and happens to have a bug where it
 creates a duplicate SLB entry.
 
 Is this what pHyp does? Also, is this what we want? Why don't we populate 
 an #MC into the guest so it knows it did something wrong?
 
 Yes, yes and we do.  Anytime we get a machine check while in the guest
 we give the guest a machine check interrupt.
 
 Ultimately we want to implement the FWNMI (Firmware-assisted NMI)
 thing defined in PAPR which makes the handling of system reset and
 machine check slightly nicer for the guest, but that's for later.  It
 will build on top of the stuff in this patch.
 
 So why would the function return 1 then which means MC is handled, forget 
 about it rather than 0, which means inject MC into the guest?

Oh wait - 1 means have the host handle it. Let me check up the code again.


Alex--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 50921] kvm hangs booting Windows 2000

2012-11-26 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=50921





--- Comment #16 from Lucio Crusca lu...@sulweb.org  2012-11-26 22:14:15 ---
@xiaoguangrong: YOU ARE THE MAN! 'emulate_invalid_guest_state = 0' did the
trick, now I have win2000 running in a 3.6.7 kvm guest! Thanks.

Still guessing why it works with plain kvm-intel.ko in Debian kernels is out of
my reach, but I can safely shove that mystery in the when-I'll-have-time-stuff
drawer and live happy with this solution meanwhile.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86, kvm: Remove incorrect redundant assembly constraint

2012-11-26 Thread H. Peter Anvin
On 11/25/2012 11:22 PM, Paolo Bonzini wrote:
 Il 21/11/2012 23:41, H. Peter Anvin ha scritto:
 From: H. Peter Anvin h...@linux.intel.com

 In __emulate_1op_rax_rdx, we use +a and +d which are input/output
 constraints, and *then* use a and d as input constraints.  This is
 incorrect, but happens to work on some versions of gcc.

 However, it breaks gcc with -O0 and icc, and may break on future
 versions of gcc.

 Reported-and-tested-by: Melanie Blower melanie.blo...@intel.com
 Signed-off-by: H. Peter Anvin h...@linux.intel.com
 Link: 
 http://lkml.kernel.org/r/b3584e72cfebed439a3eca9bce67a4ef1b17a...@fmsmsx107.amr.corp.intel.com
 ---
  arch/x86/kvm/emulate.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
 index 39171cb..bba39bf 100644
 --- a/arch/x86/kvm/emulate.c
 +++ b/arch/x86/kvm/emulate.c
 @@ -426,8 +426,7 @@ static void invalidate_registers(struct x86_emulate_ctxt 
 *ctxt)
  _ASM_EXTABLE(1b, 3b)\
  : =m ((ctxt)-eflags), =r (_tmp),  \
+a (*rax), +d (*rdx), +qm(_ex)  \
 -: i (EFLAGS_MASK), m ((ctxt)-src.val), \
 -  a (*rax), d (*rdx));  \
 +: i (EFLAGS_MASK), m ((ctxt)-src.val));\
  } while (0)
  
  /* instruction has only one source operand, destination is implicit (e.g. 
 mul, div, imul, idiv) */

 
 Reviewed-by: Paolo Bonzini pbonz...@redhat.com
 

Gleb, Marcelo: are you going to apply this or would you prefer I took it
in x86/urgent?

-hpa

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking

2012-11-26 Thread Paul Mackerras
On Mon, Nov 26, 2012 at 11:03:48PM +0100, Alexander Graf wrote:
 
 On 26.11.2012, at 22:55, Alexander Graf wrote:
 
  
  On 26.11.2012, at 22:33, Paul Mackerras wrote:
  
  On Mon, Nov 26, 2012 at 02:15:16PM +0100, Alexander Graf wrote:
  
  On 23.11.2012, at 22:42, Paul Mackerras wrote:
  
  On Fri, Nov 23, 2012 at 03:13:09PM +0100, Alexander Graf wrote:
  
  On 22.11.2012, at 10:25, Paul Mackerras wrote:
  
  +  /* Do they have an SLB shadow buffer registered? */
  +  slb = vcpu-arch.slb_shadow.pinned_addr;
  +  if (!slb)
  +  return;
  
  Mind to explain this case? What happens here? Do we leave the guest 
  with an empty SLB? Why would this ever happen? What happens next as 
  soon as we go back into the guest?
  
  Yes, we leave the guest with an empty SLB, the access gets retried and
  this time the guest gets an SLB miss interrupt, which it can hopefully
  handle using an SLB miss handler that runs entirely in real mode.
  This could happen for instance while the guest is in SLOF or yaboot or
  some other code that runs basically in real mode but occasionally
  turns the MMU on for some accesses, and happens to have a bug where it
  creates a duplicate SLB entry.
  
  Is this what pHyp does? Also, is this what we want? Why don't we populate 
  an #MC into the guest so it knows it did something wrong?
  
  Yes, yes and we do.  Anytime we get a machine check while in the guest
  we give the guest a machine check interrupt.
  
  Ultimately we want to implement the FWNMI (Firmware-assisted NMI)
  thing defined in PAPR which makes the handling of system reset and
  machine check slightly nicer for the guest, but that's for later.  It
  will build on top of the stuff in this patch.
  
  So why would the function return 1 then which means MC is handled, forget 
  about it rather than 0, which means inject MC into the guest?
 
 Oh wait - 1 means have the host handle it. Let me check up the code again.

1 means the problem is fixed, now give the guest a machine check
interrupt.

0 means exit the guest, have the host's MC handler look at it, then
give the guest a machine check.  In this case the delivery of the MC
to the guest happens in kvmppc_handle_exit().

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations

2012-11-26 Thread Paul Mackerras
On Mon, Nov 26, 2012 at 11:03:19PM +0100, Alexander Graf wrote:
 
 On 26.11.2012, at 22:48, Paul Mackerras wrote:
 
  On Mon, Nov 26, 2012 at 02:10:33PM +0100, Alexander Graf wrote:
  
  On 23.11.2012, at 23:07, Paul Mackerras wrote:
  
  On Fri, Nov 23, 2012 at 04:43:03PM +0100, Alexander Graf wrote:
  
  On 22.11.2012, at 10:28, Paul Mackerras wrote:
  
  - With the possibility of the host paging out guest pages, the use of
  H_LOCAL by an SMP guest is dangerous since the guest could possibly
  retain and use a stale TLB entry pointing to a page that had been
  removed from the guest.
  
  I don't understand this part. Don't we flush the TLB when the page gets 
  evicted from the shadow HTAB?
  
  The H_LOCAL flag is something that we invented to allow the guest to
  tell the host I only ever used this translation (HPTE) on the current
  vcpu when it's removing or modifying an HPTE.  The idea is that that
  would then let the host use the tlbiel instruction (local TLB
  invalidate) rather than the usual global tlbie instruction.  Tlbiel is
  faster because it doesn't need to go out on the fabric and get
  processed by all cpus.  In fact our guests don't use it at present,
  but we put it in because we thought we should be able to get a
  performance improvement, particularly on large machines.
  
  However, the catch is that the guest's setting of H_LOCAL might be
  incorrect, in which case we could have a stale TLB entry on another
  physical cpu.  While the physical page that it refers to is still
  owned by the guest, that stale entry doesn't matter from the host's
  point of view.  But if the host wants to take that page away from the
  guest, the stale entry becomes a problem.
  
  That's exactly where my question lies. Does that mean we don't flush the 
  TLB entry regardless when we take the page away from the guest?
  
  The question is how to find the TLB entry if the HPTE it came from is
  no longer present.  Flushing a TLB entry requires a virtual address.
  When we're taking a page away from the guest we have the real address
  of the page, not the virtual address.  We can use the reverse-mapping
  chains to loop through all the HPTEs that map the page, and from each
  HPTE we can (and do) calculate a virtual address and do a TLBIE on
  that virtual address (each HPTE could be at a different virtual
  address).
  
  The difficulty comes when we no longer have the HPTE but we
  potentially have a stale TLB entry, due to having used tlbiel when we
  removed the HPTE.  Without the HPTE the only way to get rid of the
  stale TLB entry would be to completely flush all the TLB entries for
  the guest's LPID on every physical CPU it had ever run on.  Since I
  don't want to go to that much effort, what I am proposing, and what
  this patch implements, is to not ever use tlbiel when removing HPTEs
  in SMP guests on POWER7.
  
  In other words, what this patch is about is making sure we don't get
  these troublesome stale TLB entries.
 
 I see. You could keep a list of to-be-flushed VAs around that you could skim 
 through when taking a page away from the guest. That way you make the fast 
 case fast (add/remove of page from the guest) and the slow path slow (paging).

Yes, I thought about that, but the problem is that the list of VAs
could get arbitrarily long and take up a lot of host memory.

 But I'm fine with disallowing local flushes on remove completely for now. It 
 would be nice to get performance data on how much this would be a net win 
 though. There are certainly ways of keeping local flushes alive with the 
 scheme above.

Yes, I definitely want to get some good performance data to see how
much of a win it would be, and if there is a good win, work out some
scheme to let us use the local flushes.

 Thanks, applied to kvm-ppc-next.

Thanks,
Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: x86: improve reexecute_instruction

2012-11-26 Thread Marcelo Tosatti
On Tue, Nov 20, 2012 at 07:59:53AM +0800, Xiao Guangrong wrote:
 The current reexecute_instruction can not well detect the failed instruction
 emulation. It allows guest to retry all the instructions except it accesses
 on error pfn.
 
 For example, some cases are nested-write-protect - if the page we want to
 write is used as PDE but it chains to itself. Under this case, we should
 stop the emulation and report the case to userspace.
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/include/asm/kvm_host.h |2 +
  arch/x86/kvm/paging_tmpl.h  |2 +
  arch/x86/kvm/x86.c  |   54 
 ---
  3 files changed, 43 insertions(+), 15 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index b2e11f4..c5eb52f 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -566,6 +566,8 @@ struct kvm_arch {
   u64 hv_guest_os_id;
   u64 hv_hypercall;
 
 + /* synchronizing reexecute_instruction and page fault path. */
 + u64 page_fault_count;
   #ifdef CONFIG_KVM_MMU_AUDIT
   int audit_point;
   #endif
 diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
 index 891eb6d..d55ad89 100644
 --- a/arch/x86/kvm/paging_tmpl.h
 +++ b/arch/x86/kvm/paging_tmpl.h
 @@ -568,6 +568,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
 addr, u32 error_code,
   if (mmu_notifier_retry(vcpu-kvm, mmu_seq))
   goto out_unlock;
 
 + vcpu-kvm-arch.page_fault_count++;
 +
   kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
   kvm_mmu_free_some_pages(vcpu);
   if (!force_pt_level)
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 5fe72cc..2fe484b 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -4473,37 +4473,61 @@ static bool reexecute_instruction(struct kvm_vcpu 
 *vcpu, unsigned long cr2)
  {
   gpa_t gpa = cr2;
   pfn_t pfn;
 -
 - if (!ACCESS_ONCE(vcpu-kvm-arch.indirect_shadow_pages))
 - return false;
 + u64 page_fault_count;
 + int emulate;
 
   if (!vcpu-arch.mmu.direct_map) {
   gpa = kvm_mmu_gva_to_gpa_read(vcpu, cr2, NULL);
 + /*
 +  * If the mapping is invalid in guest, let cpu retry
 +  * it to generate fault.
 +  */
   if (gpa == UNMAPPED_GVA)
 - return true; /* let cpu generate fault */
 + return true;
   }
 
   /*
 -  * if emulation was due to access to shadowed page table
 -  * and it failed try to unshadow page and re-enter the
 -  * guest to let CPU execute the instruction.
 -  */
 - if (kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa)))
 - return true;
 -
 - /*
* Do not retry the unhandleable instruction if it faults on the
* readonly host memory, otherwise it will goto a infinite loop:
* retry instruction - write #PF - emulation fail - retry
* instruction - ...
*/
   pfn = gfn_to_pfn(vcpu-kvm, gpa_to_gfn(gpa));
 - if (!is_error_noslot_pfn(pfn)) {
 - kvm_release_pfn_clean(pfn);
 +
 + /*
 +  * If the instruction failed on the error pfn, it can not be fixed,
 +  * report the error to userspace.
 +  */
 + if (is_error_noslot_pfn(pfn))
 + return false;
 +
 + kvm_release_pfn_clean(pfn);
 +
 + /* The instructions are well-emulated on direct mmu. */
 + if (vcpu-arch.mmu.direct_map) {
 + if (ACCESS_ONCE(vcpu-kvm-arch.indirect_shadow_pages))
 + kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa));
 +
   return true;
   }
 
 - return false;
 +again:
 + page_fault_count = ACCESS_ONCE(vcpu-kvm-arch.page_fault_count);
 +
 + /*
 +  * if emulation was due to access to shadowed page table
 +  * and it failed try to unshadow page and re-enter the
 +  * guest to let CPU execute the instruction.
 +  */
 + kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa));
 + emulate = vcpu-arch.mmu.page_fault(vcpu, cr3, PFERR_WRITE_MASK, false);

Can you explain what is the objective here?

 + /* The page fault path called above can increase the count. */
 + if (page_fault_count + 1 !=
 +   ACCESS_ONCE(vcpu-kvm-arch.page_fault_count))
 + goto again;
 +
 + return !emulate;
  }
 
  static bool retry_instruction(struct x86_emulate_ctxt *ctxt,

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] KVM: x86: let reexecute_instruction work for tdp

2012-11-26 Thread Marcelo Tosatti
On Tue, Nov 20, 2012 at 07:59:10AM +0800, Xiao Guangrong wrote:
 Currently, reexecute_instruction refused to retry all instructions. If
 nested npt is used, the emulation may be caused by shadow page, it can
 be fixed by dropping the shadow page
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/x86.c |   14 --
  1 files changed, 8 insertions(+), 6 deletions(-)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 7be8452..5fe72cc 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -4469,17 +4469,19 @@ static int handle_emulation_failure(struct kvm_vcpu 
 *vcpu)
   return r;
  }
 
 -static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t gva)
 +static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long cr2)
  {
 - gpa_t gpa;
 + gpa_t gpa = cr2;
   pfn_t pfn;
 
 - if (tdp_enabled)
 + if (!ACCESS_ONCE(vcpu-kvm-arch.indirect_shadow_pages))
   return false;

How is indirect_shadow_pages protected? Why is ACCESS_ONCE() being used
to read it?

 - gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
 - if (gpa == UNMAPPED_GVA)
 - return true; /* let cpu generate fault */
 + if (!vcpu-arch.mmu.direct_map) {
 + gpa = kvm_mmu_gva_to_gpa_read(vcpu, cr2, NULL);
 + if (gpa == UNMAPPED_GVA)
 + return true; /* let cpu generate fault */
 + }
 
   /*
* if emulation was due to access to shadowed page table
 -- 
 1.7.7.6
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Re: Re: Re: [RFC PATCH 0/2] kvm/vmx: Output TSC offset

2012-11-26 Thread Marcelo Tosatti
On Mon, Nov 26, 2012 at 08:05:10PM +0900, Yoshihiro YUNOMAE wrote:
 500h. event tsc_write tsc_offset=-3000
 
 Then a guest trace containing events with a TSC timestamp.
 Which tsc_offset to use?
 
 (that is the problem, which unless i am mistaken can only be solved
 easily if the guest can convert RDTSC - TSC of host).
 
 There are three following cases of changing TSC offset:
   1. Reset TSC at guest boot time
   2. Adjust TSC offset due to some host's problems
   3. Write TSC on guests
 The scenario which you mentioned is case 3, so we'll discuss this case.
 Here, we assume that a guest is allocated single CPU for the sake of
 ease.
 
 If a guest executes write_tsc, TSC values jumps to forward or backward.
 For the forward case, trace data are as follows:
 
 host guest   
 cyclesevents   cycles   events
   3000   tsc_offset=-2950
   3001   kvm_enter
   53 eventX
   
  100 (write_tsc=+900)
   3060   kvm_exit
   3075   tsc_offset=-2050
   3080   kvm_enter
 1050 event1
 1055 event2
   ...
 
 
 This case is simple. The guest TSC of the first kvm_enter is calculated
 as follows:
 
(host TSC of kvm_enter) + (current tsc_offset) = 3001 - 2950 = 51
 
 Similarly, the guest TSC of the second kvm_enter is 130. So, the guest
 events between 51 and 130, that is, 53 eventX is inserted between the
 first pair of kvm_enter and kvm_exit. To insert events of the guests
 between 51 and 130, we convert the guest TSC to the host TSC using TSC
 offset 2950.
 
 For the backward case, trace data are as follows:
 
 host guest   
 cyclesevents   cycles   events
   3000   tsc_offset=-2950
   3001   kvm_enter
   53 eventX
   
  100 (write_tsc=-50)
   3060   kvm_exit
   3075   tsc_offset=-2050
   3080   kvm_enter
   90 event1
   95 event2
   ...
 
 3400100(write_tsc=-50)
 
  90event3
  95event4
 
 As you say, in this case, the previous method is invalid. When we
 calculate the guest TSC value for the tsc_offset=-3000 event, the value
 is 75 on the guest. This seems like prior event of write_tsc=-50 event.
 So, we need to consider more.
 
 In this case, it is important that we can understand where the guest
 executes write_tsc or the host rewrites the TSC offset. write_tsc on
 the guest equals wrmsr 0x0010, so this instruction induces vm_exit.
 This implies that the guest does not operate when the host changes TSC
 offset on the cpu. In other words, the guest cannot use new TSC before
 the host rewrites the new TSC offset. So, if timestamp on the guest is
 not monotonically increased, we can understand the guest executes
 write_tsc. Moreover, in the region where timestamp is decreasing, we
 can understand when the host rewrote the TSC offset in the guest trace
 data. Therefore, we can sort trace data in chronological order.
 
 This requires an entire trace of events. That is, to be able
 to reconstruct timeline you require the entire trace from the moment
 guest starts. So that you can correlate wrmsr-to-tsc on the guest with
 vmexit-due-to-tsc-write on the host.
 
 Which means that running out of space for trace buffer equals losing
 ability to order events.
 
 Is that desirable? It seems cumbersome to me.
 
 As you say, tracing events can overwrite important events like
 kvm_exit/entry or write_tsc_offset. So, Steven's multiple buffer is
 needed by this feature. Normal events which often hit record the buffer
 A, and important events which rarely hit record the buffer B. In our
 case, the important event is write_tsc_offset.
 Also the need to correlate each write_tsc event in the guest trace
 with a corresponding tsc_offset write in the host trace means that it
 is _necessary_ for the guest and host to enable tracing simultaneously.
 Correct?
 
 Also, there are WRMSR executions in the guest for which there is
 no event in the trace buffer. From SeaBIOS, during boot.
 In that case, there is no explicit event in the guest trace which you
 can correlate with tsc_offset changes in the host side.
 
 I understand that you want to say, but we don't correlate between
 write_tsc event and write_tsc_offset event directly. This is because
 the write_tsc tracepoint (also WRMSR instruction) is not prepared in
 the current kernel. So, in the previous mail
 (https://lkml.org/lkml/2012/11/22/53), I suggested the method which we
 don't need to prepare the write_tsc tracepoint.
 
 In the method, we enable ftrace before the guest boots, and we need to
 keep all 

Re: [PATCH v2] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking

2012-11-26 Thread Paul Mackerras
On Tue, Nov 27, 2012 at 12:16:28AM +0100, Alexander Graf wrote:
 
 On 24.11.2012, at 09:37, Paul Mackerras wrote:
 
  Currently, if a machine check interrupt happens while we are in the
  guest, we exit the guest and call the host's machine check handler,
  which tends to cause the host to panic.  Some machine checks can be
  triggered by the guest; for example, if the guest creates two entries
  in the SLB that map the same effective address, and then accesses that
  effective address, the CPU will take a machine check interrupt.
  
  To handle this better, when a machine check happens inside the guest,
  we call a new function, kvmppc_realmode_machine_check(), while still in
  real mode before exiting the guest.  On POWER7, it handles the cases
  that the guest can trigger, either by flushing and reloading the SLB,
  or by flushing the TLB, and then it delivers the machine check interrupt
  directly to the guest without going back to the host.  On POWER7, the
  OPAL firmware patches the machine check interrupt vector so that it
  gets control first, and it leaves behind its analysis of the situation
  in a structure pointed to by the opal_mc_evt field of the paca.  The
  kvmppc_realmode_machine_check() function looks at this, and if OPAL
  reports that there was no error, or that it has handled the error, we
  also go straight back to the guest with a machine check.  We have to
  deliver a machine check to the guest since the machine check interrupt
  might have trashed valid values in SRR0/1.
  
  If the machine check is one we can't handle in real mode, and one that
  OPAL hasn't already handled, or on PPC970, we exit the guest and call
  the host's machine check handler.  We do this by jumping to the
  machine_check_fwnmi label, rather than absolute address 0x200, because
  we don't want to re-execute OPAL's handler on POWER7.  On PPC970, the
  two are equivalent because address 0x200 just contains a branch.
  
  Then, if the host machine check handler decides that the system can
  continue executing, kvmppc_handle_exit() delivers a machine check
  interrupt to the guest -- once again to let the guest know that SRR0/1
  have been modified.
  
  Signed-off-by: Paul Mackerras pau...@samba.org
 
 Thanks for the semantic explanations :). From that POV things are clear and 
 good with me now. That leaves only checkpatch ;)
 
 
 WARNING: please, no space before tabs
 #142: FILE: arch/powerpc/kvm/book3s_hv_ras.c:21:
 +#define SRR1_MC_IFETCH_SLBMULTI ^I3^I/* SLB multi-hit */$
 
 WARNING: please, no space before tabs
 #143: FILE: arch/powerpc/kvm/book3s_hv_ras.c:22:
 +#define SRR1_MC_IFETCH_SLBPARMULTI ^I4^I/* SLB parity + multi-hit */$
 
 WARNING: min() should probably be min_t(u32, slb-persistent, SLB_MIN_SIZE)
 #168: FILE: arch/powerpc/kvm/book3s_hv_ras.c:47:
 + n = min(slb-persistent, (u32) SLB_MIN_SIZE);
 
 total: 0 errors, 3 warnings, 357 lines checked

Phooey.  Do you want me to resubmit the patch, or will you fix it up?

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations

2012-11-26 Thread Alexander Graf

On 27.11.2012, at 00:16, Paul Mackerras wrote:

 On Mon, Nov 26, 2012 at 11:03:19PM +0100, Alexander Graf wrote:
 
 On 26.11.2012, at 22:48, Paul Mackerras wrote:
 
 On Mon, Nov 26, 2012 at 02:10:33PM +0100, Alexander Graf wrote:
 
 On 23.11.2012, at 23:07, Paul Mackerras wrote:
 
 On Fri, Nov 23, 2012 at 04:43:03PM +0100, Alexander Graf wrote:
 
 On 22.11.2012, at 10:28, Paul Mackerras wrote:
 
 - With the possibility of the host paging out guest pages, the use of
 H_LOCAL by an SMP guest is dangerous since the guest could possibly
 retain and use a stale TLB entry pointing to a page that had been
 removed from the guest.
 
 I don't understand this part. Don't we flush the TLB when the page gets 
 evicted from the shadow HTAB?
 
 The H_LOCAL flag is something that we invented to allow the guest to
 tell the host I only ever used this translation (HPTE) on the current
 vcpu when it's removing or modifying an HPTE.  The idea is that that
 would then let the host use the tlbiel instruction (local TLB
 invalidate) rather than the usual global tlbie instruction.  Tlbiel is
 faster because it doesn't need to go out on the fabric and get
 processed by all cpus.  In fact our guests don't use it at present,
 but we put it in because we thought we should be able to get a
 performance improvement, particularly on large machines.
 
 However, the catch is that the guest's setting of H_LOCAL might be
 incorrect, in which case we could have a stale TLB entry on another
 physical cpu.  While the physical page that it refers to is still
 owned by the guest, that stale entry doesn't matter from the host's
 point of view.  But if the host wants to take that page away from the
 guest, the stale entry becomes a problem.
 
 That's exactly where my question lies. Does that mean we don't flush the 
 TLB entry regardless when we take the page away from the guest?
 
 The question is how to find the TLB entry if the HPTE it came from is
 no longer present.  Flushing a TLB entry requires a virtual address.
 When we're taking a page away from the guest we have the real address
 of the page, not the virtual address.  We can use the reverse-mapping
 chains to loop through all the HPTEs that map the page, and from each
 HPTE we can (and do) calculate a virtual address and do a TLBIE on
 that virtual address (each HPTE could be at a different virtual
 address).
 
 The difficulty comes when we no longer have the HPTE but we
 potentially have a stale TLB entry, due to having used tlbiel when we
 removed the HPTE.  Without the HPTE the only way to get rid of the
 stale TLB entry would be to completely flush all the TLB entries for
 the guest's LPID on every physical CPU it had ever run on.  Since I
 don't want to go to that much effort, what I am proposing, and what
 this patch implements, is to not ever use tlbiel when removing HPTEs
 in SMP guests on POWER7.
 
 In other words, what this patch is about is making sure we don't get
 these troublesome stale TLB entries.
 
 I see. You could keep a list of to-be-flushed VAs around that you could skim 
 through when taking a page away from the guest. That way you make the fast 
 case fast (add/remove of page from the guest) and the slow path slow 
 (paging).
 
 Yes, I thought about that, but the problem is that the list of VAs
 could get arbitrarily long and take up a lot of host memory.

You can always cap it at an arbitrary number, similar to how the TLB itself is 
limited too.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking

2012-11-26 Thread Alexander Graf

On 27.11.2012, at 00:18, Paul Mackerras wrote:

 On Tue, Nov 27, 2012 at 12:16:28AM +0100, Alexander Graf wrote:
 
 On 24.11.2012, at 09:37, Paul Mackerras wrote:
 
 Currently, if a machine check interrupt happens while we are in the
 guest, we exit the guest and call the host's machine check handler,
 which tends to cause the host to panic.  Some machine checks can be
 triggered by the guest; for example, if the guest creates two entries
 in the SLB that map the same effective address, and then accesses that
 effective address, the CPU will take a machine check interrupt.
 
 To handle this better, when a machine check happens inside the guest,
 we call a new function, kvmppc_realmode_machine_check(), while still in
 real mode before exiting the guest.  On POWER7, it handles the cases
 that the guest can trigger, either by flushing and reloading the SLB,
 or by flushing the TLB, and then it delivers the machine check interrupt
 directly to the guest without going back to the host.  On POWER7, the
 OPAL firmware patches the machine check interrupt vector so that it
 gets control first, and it leaves behind its analysis of the situation
 in a structure pointed to by the opal_mc_evt field of the paca.  The
 kvmppc_realmode_machine_check() function looks at this, and if OPAL
 reports that there was no error, or that it has handled the error, we
 also go straight back to the guest with a machine check.  We have to
 deliver a machine check to the guest since the machine check interrupt
 might have trashed valid values in SRR0/1.
 
 If the machine check is one we can't handle in real mode, and one that
 OPAL hasn't already handled, or on PPC970, we exit the guest and call
 the host's machine check handler.  We do this by jumping to the
 machine_check_fwnmi label, rather than absolute address 0x200, because
 we don't want to re-execute OPAL's handler on POWER7.  On PPC970, the
 two are equivalent because address 0x200 just contains a branch.
 
 Then, if the host machine check handler decides that the system can
 continue executing, kvmppc_handle_exit() delivers a machine check
 interrupt to the guest -- once again to let the guest know that SRR0/1
 have been modified.
 
 Signed-off-by: Paul Mackerras pau...@samba.org
 
 Thanks for the semantic explanations :). From that POV things are clear and 
 good with me now. That leaves only checkpatch ;)
 
 
 WARNING: please, no space before tabs
 #142: FILE: arch/powerpc/kvm/book3s_hv_ras.c:21:
 +#define SRR1_MC_IFETCH_SLBMULTI ^I3^I/* SLB multi-hit */$
 
 WARNING: please, no space before tabs
 #143: FILE: arch/powerpc/kvm/book3s_hv_ras.c:22:
 +#define SRR1_MC_IFETCH_SLBPARMULTI ^I4^I/* SLB parity + multi-hit */$
 
 WARNING: min() should probably be min_t(u32, slb-persistent, SLB_MIN_SIZE)
 #168: FILE: arch/powerpc/kvm/book3s_hv_ras.c:47:
 +n = min(slb-persistent, (u32) SLB_MIN_SIZE);
 
 total: 0 errors, 3 warnings, 357 lines checked
 
 Phooey.  Do you want me to resubmit the patch, or will you fix it up?

Hrm. Promise to run checkpatch yourself next time and I'll fix it up for you 
this time ;)


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] AER-KVM: Error containment of PCI pass-thru devices assigned to KVM guests

2012-11-26 Thread Marcelo Tosatti
On Tue, Nov 20, 2012 at 02:09:46PM +, Pandarathil, Vijaymohan R wrote:
 
 
  -Original Message-
  From: Stefan Hajnoczi [mailto:stefa...@gmail.com]
  Sent: Tuesday, November 20, 2012 5:41 AM
  To: Pandarathil, Vijaymohan R
  Cc: kvm@vger.kernel.org; linux-...@vger.kernel.org; qemu-de...@nongnu.org;
  linux-ker...@vger.kernel.org
  Subject: Re: [PATCH 0/4] AER-KVM: Error containment of PCI pass-thru
  devices assigned to KVM guests
  
  On Tue, Nov 20, 2012 at 06:31:48AM +, Pandarathil, Vijaymohan R wrote:
   Add support for error containment when a PCI pass-thru device assigned to
  a KVM
   guest encounters an error. This is for PCIe devices/drivers that support
  AER
   functionality. When the OS is notified of an error in a device either
   through the firmware first approach or through an interrupt handled by
  the AER
   root port driver, concerned subsystems are notified by invoking callbacks
   registered by these subsystems. The device is also marked as tainted till
  the
   corresponding driver recovery routines are successful.
  
   KVM module registers for a notification of such errors. In the KVM
  callback
   routine, a global counter is incremented to keep track of the error
   notification. Before each CPU enters guest mode to execute guest code,
   appropriate checks are done to see if the impacted device belongs to the
  guest
   or not. If the device belongs to the guest, qemu hypervisor for the guest
  is
   informed and the guest is immediately brought down, thus preventing or
   minimizing chances of any bad data being written out by the guest driver
   after the device has encountered an error.
  
  I'm surprised that the hypervisor would shut down the guest when PCIe
  AER kicks in for a pass-through device.  Shouldn't we pass the AER event
  into the guest and deal with it there?
 
 Agreed. That would be the ideal behavior and is planned in a future patch.
 Lack of control over the capabilities/type of the OS/drivers running in 
 the guest is also a concern in passing along the event to the guest.
 
 My understanding is that in the current implementation of Linux/KVM, these 
 errors are not handled at all and can potentially cause a guest hang or 
 crash or even data corruption depending on the implementation of the guest
 driver for the device. As a first step, these patches make the behavior 
 better by doing error containment with a predictable behavior when such
 errors occur. 

For both ACPI notifications and Linux PCI AER driver there is a way for
the PCI driver to receive a notification, correct?

Can just have virt/kvm/assigned-dev.c code register such a notifier (as
a PCI driver) and then perform appropriate action?

Also the semantics of tainted driver is not entirely clear.

Is there any reason for not having this feature for VFIO only, as KVM
device assigment is being phased out?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86, kvm: Remove incorrect redundant assembly constraint

2012-11-26 Thread Marcelo Tosatti
On Mon, Nov 26, 2012 at 02:48:50PM -0800, H. Peter Anvin wrote:
 On 11/25/2012 11:22 PM, Paolo Bonzini wrote:
  Il 21/11/2012 23:41, H. Peter Anvin ha scritto:
  From: H. Peter Anvin h...@linux.intel.com
 
  In __emulate_1op_rax_rdx, we use +a and +d which are input/output
  constraints, and *then* use a and d as input constraints.  This is
  incorrect, but happens to work on some versions of gcc.
 
  However, it breaks gcc with -O0 and icc, and may break on future
  versions of gcc.
 
  Reported-and-tested-by: Melanie Blower melanie.blo...@intel.com
  Signed-off-by: H. Peter Anvin h...@linux.intel.com
  Link: 
  http://lkml.kernel.org/r/b3584e72cfebed439a3eca9bce67a4ef1b17a...@fmsmsx107.amr.corp.intel.com
  ---
   arch/x86/kvm/emulate.c | 3 +--
   1 file changed, 1 insertion(+), 2 deletions(-)
 
  diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
  index 39171cb..bba39bf 100644
  --- a/arch/x86/kvm/emulate.c
  +++ b/arch/x86/kvm/emulate.c
  @@ -426,8 +426,7 @@ static void invalidate_registers(struct 
  x86_emulate_ctxt *ctxt)
 _ASM_EXTABLE(1b, 3b)\
 : =m ((ctxt)-eflags), =r (_tmp),  \
   +a (*rax), +d (*rdx), +qm(_ex)  \
  -  : i (EFLAGS_MASK), m ((ctxt)-src.val), \
  -a (*rax), d (*rdx));  \
  +  : i (EFLAGS_MASK), m ((ctxt)-src.val));\
 } while (0)
   
   /* instruction has only one source operand, destination is implicit (e.g. 
  mul, div, imul, idiv) */
 
  
  Reviewed-by: Paolo Bonzini pbonz...@redhat.com
  
 
 Gleb, Marcelo: are you going to apply this or would you prefer I took it
 in x86/urgent?
 
   -hpa

Feel free to merge it through x86/urgent.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86, kvm: Remove incorrect redundant assembly constraint

2012-11-26 Thread H. Peter Anvin
On 11/26/2012 03:48 PM, Marcelo Tosatti wrote:

 Gleb, Marcelo: are you going to apply this or would you prefer I took it
 in x86/urgent?

  -hpa
 
 Feel free to merge it through x86/urgent.
 

I presume that's an Acked-by?

-hpa

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking

2012-11-26 Thread Paul Mackerras
On Tue, Nov 27, 2012 at 12:20:08AM +0100, Alexander Graf wrote:
 Hrm. Promise to run checkpatch yourself next time and I'll fix it up for you 
 this time ;)

OK, will do, thanks. :)

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio powerpc: enabled and supported on powernv platform

2012-11-26 Thread Benjamin Herrenschmidt
On Mon, 2012-11-26 at 11:04 -0700, Alex Williamson wrote:
 Ok, I see tces are put on shutdown via tce_iommu_detach_group, so you're
 more concerned about the guest simply mapping over top of it's own
 mappings.  Is that common?  Is it common enough for every multi-page
 mapping to assume it will happen?  I know this is a performance
 sensitive path for you and it seems like a map-only w/ fallback to
 unmap, remap would be better in the general case.
 
 On x86 we do exactly that, but we do the unmap, remap from userspace
 when we get an EBUSY.  Thanks, 

Right, Linux as guest at least will never map over an existing
mapping. It will always unmap first. IE. The only transition we do on
H_PUT_TCE are 0 - valid and valid - 0.

So it would be fine to simplify the code and keep the map over map as
a slow fallback. I can't tell for other operating systems but we don't
care about those at this point :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM

2012-11-26 Thread yi li
Alex,

Thanks for your reply, and i will check it agiain with msi-x.

YiLi

2012/11/27 Alex Williamson alex.william...@redhat.com:
 On Tue, 2012-11-27 at 00:47 +0800, yi li wrote:
 hi Alex,

 the qemu-kvm version 1.2.

 And is the device making use of MSI-X or MSI interrupts.  MSI-X should
 work on 1.2, MSI does not yet support vector updates for affinity, but
 patches are welcome.  Thanks,

 Alex

 2012/11/26 Alex Williamson alex.william...@redhat.com:
  On Fri, 2012-11-23 at 11:06 +0800, yi li wrote:
  Hi Guys,
 
  there have a issue about smp_affinity cannot work correctly on guest
  os when PCI passthrough device using msi/msi-x with KVM.
 
  My reason:
  pcpu will occur a lot of ipi interrupt to find the vcpu to handle the
  irq.  so the guest os will VM_EXIT frequelty. right?
 
  if smp_affinity can work correctly on guest os,  the best way is that
  the vcpu handle the irq is cputune at the pcpu which handle the
  kvm:pci-bus irq on the host.but  unfortunly, i find that smp_affinity
  can not work correctly on guest os when msi/msi-x.
 
  how to reproduce:
  1: passthrough a netcard (Brodcom BCM5716S) to the guest os
 
  2: ifup the netcard, the card will use msi-x interrupt default, and close 
  the
  irqbalance service
 
  3:  echo 4  cat /proc/irq/NETCARDIRQ/smp_affinity, so we assume the vcpu2
  handle the irq.
 
  4: we have set vcpupin vcpu='2' cpuset='1'/ and set the irq kvm:pci-bus 
  to
  the pcpu1 on the host.
 
  we think this configure will reduce the ipi interrupt when inject 
  interrupt to
  the guest os. but this irq is not only handle on vcpu2. maybe it is
  not our expect。
 
  What version of qemu-kvm/qemu are you using?  There's been some work
  recently specifically to enable this.  Thanks,
 
  Alex
 



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86, kvm: Remove incorrect redundant assembly constraint

2012-11-26 Thread Marcelo Tosatti
On Mon, Nov 26, 2012 at 03:49:36PM -0800, H. Peter Anvin wrote:
 On 11/26/2012 03:48 PM, Marcelo Tosatti wrote:
 
  Gleb, Marcelo: are you going to apply this or would you prefer I took it
  in x86/urgent?
 
 -hpa
  
  Feel free to merge it through x86/urgent.
  
 
 I presume that's an Acked-by?
 
   -hpa

Yes.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM

2012-11-26 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=50891





--- Comment #2 from liyi yiliker...@gmail.com  2012-11-27 00:57:28 ---
sorry, i am not clearly it.

1:i am sure the device using the MSI-X, the test is failed.
  check the attribute, entry-msi_attrib.is_msix  is 1.

pls, the qemu kvm version is 1.2. 
also, when using the virtio driver, i find the the nercard using the msi-x, but
the test is ok.
and i have test intel 82599 SR-IOV, passthrough the VF to the guest os, the
test is failed using the msi-x as the BCM5716S.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 0/2] Enable guest use of TSC_ADJUST functionality

2012-11-26 Thread Will Auld

This reversion, V4, addresses a couple of issues I missed from Gleb and Marcelo.

Thanks, Will

Will Auld (2):
  Add code to track call origin for msr assignment.
  Enabling IA32_TSC_ADJUST for KVM guest VM support

 arch/x86/include/asm/cpufeature.h |  1 +
 arch/x86/include/asm/kvm_host.h   | 15 ++---
 arch/x86/include/asm/msr-index.h  |  1 +
 arch/x86/kvm/cpuid.c  |  2 ++
 arch/x86/kvm/cpuid.h  |  8 +++
 arch/x86/kvm/svm.c| 28 ++--
 arch/x86/kvm/vmx.c| 33 ++--
 arch/x86/kvm/x86.c| 45 ---
 arch/x86/kvm/x86.h|  2 +-
 9 files changed, 110 insertions(+), 25 deletions(-)

-- 
1.8.0.rc0



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 1/2] x86/kexec: add a new atomic notifier list for kdump

2012-11-26 Thread Zhang Yanfei
于 2012年11月27日 02:18, Eric W. Biederman 写道:
 Gleb Natapov g...@redhat.com writes:
 
 On Mon, Nov 26, 2012 at 11:43:10AM -0600, Eric W. Biederman wrote:
 Gleb Natapov g...@redhat.com writes:

 On Mon, Nov 26, 2012 at 09:08:54AM -0600, Eric W. Biederman wrote:
 Zhang Yanfei zhangyan...@cn.fujitsu.com writes:

 This patch adds an atomic notifier list named crash_notifier_list.
 Currently, when loading kvm-intel module, a notifier will be registered
 in the list to enable vmcss loaded on all cpus to be VMCLEAR'd if
 needed.

 crash_notifier_list ick gag please no.  Effectively this makes the kexec
 on panic code path undebuggable.

 Instead we need to use direct function calls to whatever you are doing.

 The code walks linked list in kvm-intel module and calls vmclear on
 whatever it finds there. Since the function have to resides in kvm-intel
 module it cannot be called directly. Is callback pointer that is set
 by kvm-intel more acceptable?

 Yes a specific callback function is more acceptable.  Looking a little
 deeper vmclear_local_loaded_vmcss is not particularly acceptable. It is
 doing a lot of work that is unnecessary to save the virtual registers
 on the kexec on panic path.

 What work are you referring to in particular that may not be
 acceptable?
 
 The unnecessary work that I was see is all of the software state
 changing.  Unlinking things from linked lists flipping variables.
 None of that appears related to the fundamental issue saving cpu
 state.
 
 Simply reusing a function that does more than what is strictly required
 makes me nervous.  What is the chance that the function will grow
 with maintenance and add constructs that are not safe in a kexec on
 panic situtation.

So in summary,

1. a specific callback function instead of a notifier?

2. Instead of calling vmclear_local_loaded_vmcss, the vmclear operation
   will just call the vmclear on every vmcss loaded on the cpu?

   like below:

   static void crash_vmclear_local_loaded_vmcss(void)
   {
int cpu = raw_smp_processor_id();
struct loaded_vmcs *v, *n;

if (!crash_local_vmclear_enabled(cpu))
return;

list_for_each_entry_safe(v, n, per_cpu(loaded_vmcss_on_cpu, cpu),
 loaded_vmcss_on_cpu_link)
vmcs_clear(v-vmcs);
   }

   right?

Thanks
Zhang

 
 In fact I wonder if it might not just be easier to call vmcs_clear to a
 fixed per cpu buffer.

 There may be more than one vmcs loaded on a cpu, hence the list.

 Performing list walking in interrupt context without locking in
 vmclear_local_loaded vmcss looks a bit scary.  Not that locking would
 make it any better, as locking would simply add one more way to deadlock
 the system.  Only an rcu list walk is at all safe.  A list walk that
 modifies the list as vmclear_local_loaded_vmcss does is definitely not safe.

 The list vmclear_local_loaded walks is per cpu. Zhang's kvm patch
 disables kexec callback while list is modified.
 
 If the list is only modified on it's cpu and we are running on that cpu
 that does look like it will give the necessary protections.  It isn't
 particularly clear at first glance that is the case unfortunately.
 
 Eric
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 1/2] Add code to track call origin for msr assignment.

2012-11-26 Thread Will Auld
In order to track who initiated the call (host or guest) to modify an msr
value I have changed function call parameters along the call path. The
specific change is to add a struct pointer parameter that points to (index,
data, caller) information rather than having this information passed as
individual parameters.

The initial use for this capability is for updating the IA32_TSC_ADJUST
msr while setting the tsc value. It is anticipated that this capability
is useful other tasks.

Signed-off-by: Will Auld will.a...@intel.com
---
 arch/x86/include/asm/kvm_host.h | 12 +---
 arch/x86/kvm/svm.c  | 21 +++--
 arch/x86/kvm/vmx.c  | 24 +---
 arch/x86/kvm/x86.c  | 23 +++
 arch/x86/kvm/x86.h  |  2 +-
 5 files changed, 57 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09155d6..da34027 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -598,6 +598,12 @@ struct kvm_vcpu_stat {
 
 struct x86_instruction_info;
 
+struct msr_data {
+bool host_initiated;
+u32 index;
+u64 data;
+};
+
 struct kvm_x86_ops {
int (*cpu_has_kvm_support)(void);  /* __init */
int (*disabled_by_bios)(void); /* __init */
@@ -621,7 +627,7 @@ struct kvm_x86_ops {
void (*set_guest_debug)(struct kvm_vcpu *vcpu,
struct kvm_guest_debug *dbg);
int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata);
-   int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
+   int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg);
void (*get_segment)(struct kvm_vcpu *vcpu,
struct kvm_segment *var, int seg);
@@ -772,7 +778,7 @@ static inline int emulate_instruction(struct kvm_vcpu *vcpu,
 
 void kvm_enable_efer_bits(u64);
 int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data);
-int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
+int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
 struct x86_emulate_ctxt;
 
@@ -799,7 +805,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, 
int *l);
 int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr);
 
 int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
-int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data);
+int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
 unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu);
 void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index baead95..5ac11f0 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1211,6 +1211,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, 
unsigned int id)
struct page *msrpm_pages;
struct page *hsave_page;
struct page *nested_msrpm_pages;
+   struct msr_data msr;
int err;
 
svm = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL);
@@ -1255,7 +1256,10 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, 
unsigned int id)
svm-vmcb_pa = page_to_pfn(page)  PAGE_SHIFT;
svm-asid_generation = 0;
init_vmcb(svm);
-   kvm_write_tsc(svm-vcpu, 0);
+   msr.data = 0x0;
+   msr.index = MSR_IA32_TSC;
+   msr.host_initiated = true;
+   kvm_write_tsc(svm-vcpu, msr);
 
err = fx_init(svm-vcpu);
if (err)
@@ -3147,13 +3151,15 @@ static int svm_set_vm_cr(struct kvm_vcpu *vcpu, u64 
data)
return 0;
 }
 
-static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data)
+static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 {
struct vcpu_svm *svm = to_svm(vcpu);
 
+   u32 ecx = msr-index;
+   u64 data = msr-data;
switch (ecx) {
case MSR_IA32_TSC:
-   kvm_write_tsc(vcpu, data);
+   kvm_write_tsc(vcpu, msr);
break;
case MSR_STAR:
svm-vmcb-save.star = data;
@@ -3208,20 +3214,23 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned 
ecx, u64 data)
vcpu_unimpl(vcpu, unimplemented wrmsr: 0x%x data 0x%llx\n, 
ecx, data);
break;
default:
-   return kvm_set_msr_common(vcpu, ecx, data);
+   return kvm_set_msr_common(vcpu, msr);
}
return 0;
 }
 
 static int wrmsr_interception(struct vcpu_svm *svm)
 {
+   struct msr_data msr;
u32 ecx = svm-vcpu.arch.regs[VCPU_REGS_RCX];
u64 data = (svm-vcpu.arch.regs[VCPU_REGS_RAX]  -1u)
| ((u64)(svm-vcpu.arch.regs[VCPU_REGS_RDX]  -1u)  32);
 
-
+   msr.data = data;
+   msr.index = ecx;
+   msr.host_initiated = false;
svm-next_rip = kvm_rip_read(svm-vcpu) + 2;
-   if (svm_set_msr(svm-vcpu, 

[PATCH V4 2/2] Enabling IA32_TSC_ADJUST for KVM guest VM support

2012-11-26 Thread Will Auld
CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported

Basic design is to emulate the MSR by allowing reads and writes to a guest
vcpu specific location to store the value of the emulated MSR while adding
the value to the vmcs tsc_offset. In this way the IA32_TSC_ADJUST value will
be included in all reads to the TSC MSR whether through rdmsr or rdtsc. This
is of course as long as the use TSC counter offsetting VM-execution
control is enabled as well as the IA32_TSC_ADJUST control.

However, because hardware will only return the TSC + IA32_TSC_ADJUST + vmsc
tsc_offset for a guest process when it does and rdtsc (with the correct
settings) the value of our virtualized IA32_TSC_ADJUST must be stored in
one of these three locations. The argument against storing it in the actual
MSR is performance. This is likely to be seldom used while the save/restore
is required on every transition. IA32_TSC_ADJUST was created as a way to
solve some issues with writing TSC itself so that is not an option either.
The remaining option, defined above as our solution has the problem of
returning incorrect vmcs tsc_offset values (unless we intercept and fix, not
done here) as mentioned above. However, more problematic is that storing the
data in vmcs tsc_offset will have a different semantic effect on the system
than does using the actual MSR. This is illustrated in the following example:
The hypervisor set the IA32_TSC_ADJUST, then the guest sets it and a guest
process performs a rdtsc. In this case the guest process will get TSC +
IA32_TSC_ADJUST_hyperviser + vmsc tsc_offset including IA32_TSC_ADJUST_guest.
While the total system semantics changed the semantics as seen by the guest
do not and hence this will not cause a problem.

Signed-off-by: Will Auld will.a...@intel.com
---
 arch/x86/include/asm/cpufeature.h |  1 +
 arch/x86/include/asm/kvm_host.h   |  3 +++
 arch/x86/include/asm/msr-index.h  |  1 +
 arch/x86/kvm/cpuid.c  |  2 ++
 arch/x86/kvm/cpuid.h  |  8 
 arch/x86/kvm/svm.c|  7 +++
 arch/x86/kvm/vmx.c|  9 +
 arch/x86/kvm/x86.c| 22 ++
 8 files changed, 53 insertions(+)

diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index 6b7ee5f..e574d81 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -199,6 +199,7 @@
 
 /* Intel-defined CPU features, CPUID level 0x0007:0 (ebx), word 9 */
 #define X86_FEATURE_FSGSBASE   (9*32+ 0) /* {RD/WR}{FS/GS}BASE instructions*/
+#define X86_FEATURE_TSC_ADJUST  (9*32+ 1) /* TSC adjustment MSR 0x3b */
 #define X86_FEATURE_BMI1   (9*32+ 3) /* 1st group bit manipulation 
extensions */
 #define X86_FEATURE_HLE(9*32+ 4) /* Hardware Lock Elision */
 #define X86_FEATURE_AVX2   (9*32+ 5) /* AVX2 instructions */
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index da34027..cf8c7e0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -442,6 +442,8 @@ struct kvm_vcpu_arch {
u32 virtual_tsc_mult;
u32 virtual_tsc_khz;
 
+   s64 ia32_tsc_adjust_msr;
+
atomic_t nmi_queued;  /* unprocessed asynchronous NMIs */
unsigned nmi_pending; /* NMI queued after currently running handler */
bool nmi_injected;/* Trying to inject an NMI this entry */
@@ -690,6 +692,7 @@ struct kvm_x86_ops {
bool (*has_wbinvd_exit)(void);
 
void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool 
scale);
+   u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu);
void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
 
u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc);
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 957ec87..6486569 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -231,6 +231,7 @@
 #define MSR_IA32_EBL_CR_POWERON0x002a
 #define MSR_EBC_FREQUENCY_ID   0x002c
 #define MSR_IA32_FEATURE_CONTROL0x003a
+#define MSR_IA32_TSC_ADJUST 0x003b
 
 #define FEATURE_CONTROL_LOCKED (10)
 #define FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX   (11)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0595f13..e817bac 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -320,6 +320,8 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 
function,
if (index == 0) {
entry-ebx = kvm_supported_word9_x86_features;
cpuid_mask(entry-ebx, 9);
+   // TSC_ADJUST is emulated 
+   entry-ebx |= F(TSC_ADJUST);
} else
entry-ebx = 0;
entry-eax = 0;
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 

[PATCH V3] target-i386: Enabling IA32_TSC_ADJUST for QEMU KVM guest VMs

2012-11-26 Thread Will Auld
CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported

Basic design is to emulate the MSR by allowing reads and writes to the
hypervisor vcpu specific locations to store the value of the emulated MSRs.
In this way the IA32_TSC_ADJUST value will be included in all reads to
the TSC MSR whether through rdmsr or rdtsc.

As this is a new MSR that the guest may access and modify its value needs
to be migrated along with the other MRSs. The changes here are specifically
for recognizing when IA32_TSC_ADJUST is enabled in CPUID and code added
for migrating its value.

Signed-off-by: Will Auld will.a...@intel.com
---
 target-i386/cpu.h |  2 ++
 target-i386/kvm.c | 15 +++
 target-i386/machine.c | 21 +
 3 files changed, 38 insertions(+)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index aabf993..13d4152 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -284,6 +284,7 @@
 #define MSR_IA32_APICBASE_BSP   (18)
 #define MSR_IA32_APICBASE_ENABLE(111)
 #define MSR_IA32_APICBASE_BASE  (0xf12)
+#define MSR_TSC_ADJUST 0x003b
 #define MSR_IA32_TSCDEADLINE0x6e0
 
 #define MSR_MTRRcap0xfe
@@ -701,6 +702,7 @@ typedef struct CPUX86State {
 uint64_t async_pf_en_msr;
 
 uint64_t tsc;
+uint64_t tsc_adjust;
 uint64_t tsc_deadline;
 
 uint64_t mcg_status;
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 696b14a..e974c42 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -61,6 +61,7 @@ const KVMCapabilityInfo kvm_arch_required_capabilities[] = {
 
 static bool has_msr_star;
 static bool has_msr_hsave_pa;
+static bool has_msr_tsc_adjust;
 static bool has_msr_tsc_deadline;
 static bool has_msr_async_pf_en;
 static bool has_msr_misc_enable;
@@ -641,6 +642,10 @@ static int kvm_get_supported_msrs(KVMState *s)
 has_msr_hsave_pa = true;
 continue;
 }
+if (kvm_msr_list-indices[i] == MSR_TSC_ADJUST) {
+has_msr_tsc_adjust = true;
+continue;
+}
 if (kvm_msr_list-indices[i] == MSR_IA32_TSCDEADLINE) {
 has_msr_tsc_deadline = true;
 continue;
@@ -978,6 +983,10 @@ static int kvm_put_msrs(CPUX86State *env, int level)
 if (has_msr_hsave_pa) {
 kvm_msr_entry_set(msrs[n++], MSR_VM_HSAVE_PA, env-vm_hsave);
 }
+if (has_msr_tsc_adjust) {
+kvm_msr_entry_set(msrs[n++], 
+   MSR_TSC_ADJUST, env-tsc_adjust);
+}
 if (has_msr_tsc_deadline) {
 kvm_msr_entry_set(msrs[n++], MSR_IA32_TSCDEADLINE, env-tsc_deadline);
 }
@@ -1234,6 +1243,9 @@ static int kvm_get_msrs(CPUX86State *env)
 if (has_msr_hsave_pa) {
 msrs[n++].index = MSR_VM_HSAVE_PA;
 }
+if (has_msr_tsc_adjust) {
+msrs[n++].index = MSR_TSC_ADJUST;
+}
 if (has_msr_tsc_deadline) {
 msrs[n++].index = MSR_IA32_TSCDEADLINE;
 }
@@ -1308,6 +1320,9 @@ static int kvm_get_msrs(CPUX86State *env)
 case MSR_IA32_TSC:
 env-tsc = msrs[i].data;
 break;
+case MSR_TSC_ADJUST:
+env-tsc_adjust = msrs[i].data;
+break;
 case MSR_IA32_TSCDEADLINE:
 env-tsc_deadline = msrs[i].data;
 break;
diff --git a/target-i386/machine.c b/target-i386/machine.c
index a8be058..95bda9b 100644
--- a/target-i386/machine.c
+++ b/target-i386/machine.c
@@ -310,6 +310,24 @@ static const VMStateDescription vmstate_fpop_ip_dp = {
 }
 };
 
+static bool tsc_adjust_needed(void *opaque)
+{
+CPUX86State *cpu = opaque;
+
+return cpu-tsc_adjust != 0;
+}
+
+static const VMStateDescription vmstate_msr_tsc_adjust = {
+.name = cpu/msr_tsc_adjust,
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT64(tsc_adjust, CPUX86State),
+VMSTATE_END_OF_LIST()
+}
+};
+
 static bool tscdeadline_needed(void *opaque)
 {
 CPUX86State *env = opaque;
@@ -457,6 +475,9 @@ static const VMStateDescription vmstate_cpu = {
 .vmsd = vmstate_fpop_ip_dp,
 .needed = fpop_ip_dp_needed,
 }, {
+.vmsd = vmstate_msr_tsc_adjust,
+.needed = tsc_adjust_needed,
+}, {
 .vmsd = vmstate_msr_tscdeadline,
 .needed = tscdeadline_needed,
 }, {
-- 
1.8.0.rc0



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Qemu-devel] [PATCH V2] Resend - Enabling IA32_TSC_ADJUST for QEMU KVM guest VMs

2012-11-26 Thread Auld, Will
Andreas, 

Thanks. I just sent the update patch (V3) to address your comments. 

Will

 -Original Message-
 From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
 Behalf Of Andreas Färber
 Sent: Monday, November 26, 2012 11:05 AM
 To: Auld, Will
 Cc: Will Auld; qemu-devel; Gleb; Marcelo Tosatti; kvm@vger.kernel.org;
 Dugger, Donald D; Liu, Jinsong; Zhang, Xiantao; a...@redhat.com
 Subject: Re: [Qemu-devel] [PATCH V2] Resend - Enabling IA32_TSC_ADJUST
 for Qemu KVM guest VMs
 
 Hello,
 
 Am 26.11.2012 19:42, schrieb Will Auld:
  CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported
 
  Basic design is to emulate the MSR by allowing reads and writes to
 the
  hypervisor vcpu specific locations to store the value of the emulated
 MSRs.
  In this way the IA32_TSC_ADJUST value will be included in all reads
 to
  the TSC MSR whether through rdmsr or rdtsc.
 
  As this is a new MSR that the guest may access and modify its value
  needs to be migrated along with the other MRSs. The changes here are
  specifically for recognizing when IA32_TSC_ADJUST is enabled in CPUID
  and code added for migrating its value.
 
  Signed-off-by: Will Auld will.a...@intel.com
 
 $subject should get a prefix of target-i386:  and resend is better
 used inside a tag so that it doesn't end up in the commit.
 And it's QEMU. ;)
 
 Some more stylistic issues inline:
 
  ---
   target-i386/cpu.h |  2 ++
   target-i386/kvm.c | 15 +++
   target-i386/machine.c | 21 +
   3 files changed, 38 insertions(+)
 
  diff --git a/target-i386/cpu.h b/target-i386/cpu.h index
  aabf993..13d4152 100644
  --- a/target-i386/cpu.h
  +++ b/target-i386/cpu.h
  @@ -284,6 +284,7 @@
   #define MSR_IA32_APICBASE_BSP   (18)
   #define MSR_IA32_APICBASE_ENABLE(111)
   #define MSR_IA32_APICBASE_BASE  (0xf12)
  +#define MSR_TSC_ADJUST 0x003b
 
 Tabs. You can use scripts/checkpatch.pl to verify.
 
   #define MSR_IA32_TSCDEADLINE0x6e0
 
   #define MSR_MTRRcap0xfe
  @@ -701,6 +702,7 @@ typedef struct CPUX86State {
   uint64_t async_pf_en_msr;
 
   uint64_t tsc;
  +uint64_t tsc_adjust;
   uint64_t tsc_deadline;
 
   uint64_t mcg_status;
  diff --git a/target-i386/kvm.c b/target-i386/kvm.c index
  696b14a..e974c42 100644
  --- a/target-i386/kvm.c
  +++ b/target-i386/kvm.c
  @@ -61,6 +61,7 @@ const KVMCapabilityInfo
  kvm_arch_required_capabilities[] = {
 
   static bool has_msr_star;
   static bool has_msr_hsave_pa;
  +static bool has_msr_tsc_adjust;
   static bool has_msr_tsc_deadline;
   static bool has_msr_async_pf_en;
   static bool has_msr_misc_enable;
  @@ -641,6 +642,10 @@ static int kvm_get_supported_msrs(KVMState *s)
   has_msr_hsave_pa = true;
   continue;
   }
  +if (kvm_msr_list-indices[i] == MSR_TSC_ADJUST) {
  +has_msr_tsc_adjust = true;
  +continue;
  +}
   if (kvm_msr_list-indices[i] ==
 MSR_IA32_TSCDEADLINE) {
   has_msr_tsc_deadline = true;
   continue;
  @@ -978,6 +983,10 @@ static int kvm_put_msrs(CPUX86State *env, int
 level)
   if (has_msr_hsave_pa) {
   kvm_msr_entry_set(msrs[n++], MSR_VM_HSAVE_PA, env-
 vm_hsave);
   }
  +if (has_msr_tsc_adjust) {
  +kvm_msr_entry_set(msrs[n++],
  +   MSR_TSC_ADJUST, env-tsc_adjust);
 
 Tabs.
 
  +}
   if (has_msr_tsc_deadline) {
   kvm_msr_entry_set(msrs[n++], MSR_IA32_TSCDEADLINE, env-
 tsc_deadline);
   }
  @@ -1234,6 +1243,9 @@ static int kvm_get_msrs(CPUX86State *env)
   if (has_msr_hsave_pa) {
   msrs[n++].index = MSR_VM_HSAVE_PA;
   }
  +if (has_msr_tsc_adjust) {
  +msrs[n++].index = MSR_TSC_ADJUST;
  +}
   if (has_msr_tsc_deadline) {
   msrs[n++].index = MSR_IA32_TSCDEADLINE;
   }
  @@ -1308,6 +1320,9 @@ static int kvm_get_msrs(CPUX86State *env)
   case MSR_IA32_TSC:
   env-tsc = msrs[i].data;
   break;
  +case MSR_TSC_ADJUST:
  +env-tsc_adjust = msrs[i].data;
  +break;
   case MSR_IA32_TSCDEADLINE:
   env-tsc_deadline = msrs[i].data;
   break;
  diff --git a/target-i386/machine.c b/target-i386/machine.c index
  a8be058..95bda9b 100644
  --- a/target-i386/machine.c
  +++ b/target-i386/machine.c
  @@ -310,6 +310,24 @@ static const VMStateDescription
 vmstate_fpop_ip_dp = {
   }
   };
 
  +static bool tsc_adjust_needed(void *opaque) {
  +CPUX86State *cpu = opaque;
 
 Please name this env to differentiate from CPUState / X86CPU.
 Since there are other tsc_* fields already I won't request that you
 move your new field to the containing X86CPU struct but at some point
 we will need to convert the VMSDs to X86CPU.
 
  +
  +return 

Re: [PATCH v8 1/2] x86/kexec: add a new atomic notifier list for kdump

2012-11-26 Thread Eric W. Biederman
Zhang Yanfei zhangyan...@cn.fujitsu.com writes:

 So in summary,

 1. a specific callback function instead of a notifier?

Yes.

 2. Instead of calling vmclear_local_loaded_vmcss, the vmclear operation
will just call the vmclear on every vmcss loaded on the cpu?

like below:

static void crash_vmclear_local_loaded_vmcss(void)
{
 int cpu = raw_smp_processor_id();
 struct loaded_vmcs *v, *n;

 if (!crash_local_vmclear_enabled(cpu))
 return;

 list_for_each_entry_safe(v, n, per_cpu(loaded_vmcss_on_cpu, cpu),
  loaded_vmcss_on_cpu_link)
 vmcs_clear(v-vmcs);
}

right?

Yeah that looks good.  I would do list_for_each_entry because the list
isn't changing.

Eric
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH V3] target-i386: Enabling IA32_TSC_ADJUST for QEMU KVM guest VMs

2012-11-26 Thread Andreas Färber
Am 27.11.2012 02:40, schrieb Will Auld:
 CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported
 
 Basic design is to emulate the MSR by allowing reads and writes to the
 hypervisor vcpu specific locations to store the value of the emulated MSRs.
 In this way the IA32_TSC_ADJUST value will be included in all reads to
 the TSC MSR whether through rdmsr or rdtsc.
 
 As this is a new MSR that the guest may access and modify its value needs
 to be migrated along with the other MRSs. The changes here are specifically
 for recognizing when IA32_TSC_ADJUST is enabled in CPUID and code added
 for migrating its value.
 
 Signed-off-by: Will Auld will.a...@intel.com

Something went wrong here, none of the V2 review comments are addressed.
Maybe you sent the wrong patch file?

Cheers,
Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 1/2] x86/kexec: add a new atomic notifier list for kdump

2012-11-26 Thread Zhang Yanfei
于 2012年11月27日 09:49, Eric W. Biederman 写道:
 Zhang Yanfei zhangyan...@cn.fujitsu.com writes:
 
 So in summary,

 1. a specific callback function instead of a notifier?
 
 Yes.
 
 2. Instead of calling vmclear_local_loaded_vmcss, the vmclear operation
will just call the vmclear on every vmcss loaded on the cpu?

like below:

static void crash_vmclear_local_loaded_vmcss(void)
{
 int cpu = raw_smp_processor_id();
 struct loaded_vmcs *v, *n;

 if (!crash_local_vmclear_enabled(cpu))
 return;

 list_for_each_entry_safe(v, n, per_cpu(loaded_vmcss_on_cpu, cpu),
  loaded_vmcss_on_cpu_link)
 vmcs_clear(v-vmcs);
}

right?
 
 Yeah that looks good.  I would do list_for_each_entry because the list
 isn't changing.

OK.

I will update the patch and resend it.

Zhang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Qemu-devel] [PATCH V3] target-i386: Enabling IA32_TSC_ADJUST for QEMU KVM guest VMs

2012-11-26 Thread Auld, Will
Sorry, let me figure this out and resend.

Thanks,

Will

 -Original Message-
 From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
 Behalf Of Andreas Färber
 Sent: Monday, November 26, 2012 5:51 PM
 To: Auld, Will
 Cc: Will Auld; qemu-devel; Gleb; mtosa...@redhat.com;
 kvm@vger.kernel.org; Dugger, Donald D; Liu, Jinsong; Zhang, Xiantao;
 a...@redhat.com
 Subject: Re: [Qemu-devel] [PATCH V3] target-i386: Enabling
 IA32_TSC_ADJUST for QEMU KVM guest VMs
 
 Am 27.11.2012 02:40, schrieb Will Auld:
  CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported
 
  Basic design is to emulate the MSR by allowing reads and writes to
 the
  hypervisor vcpu specific locations to store the value of the emulated
 MSRs.
  In this way the IA32_TSC_ADJUST value will be included in all reads
 to
  the TSC MSR whether through rdmsr or rdtsc.
 
  As this is a new MSR that the guest may access and modify its value
  needs to be migrated along with the other MRSs. The changes here are
  specifically for recognizing when IA32_TSC_ADJUST is enabled in CPUID
  and code added for migrating its value.
 
  Signed-off-by: Will Auld will.a...@intel.com
 
 Something went wrong here, none of the V2 review comments are
 addressed.
 Maybe you sent the wrong patch file?
 
 Cheers,
 Andreas
 
 --
 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
 GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
 --
 To unsubscribe from this list: send the line unsubscribe kvm in the
 body of a message to majord...@vger.kernel.org More majordomo info at
 http://vger.kernel.org/majordomo-info.html
N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf

[PATCH V4] target-i386: Enabling IA32_TSC_ADJUST for Qemu KVM guest VMs

2012-11-26 Thread Will Auld
CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported

Basic design is to emulate the MSR by allowing reads and writes to the
hypervisor vcpu specific locations to store the value of the emulated MSRs.
In this way the IA32_TSC_ADJUST value will be included in all reads to
the TSC MSR whether through rdmsr or rdtsc.

As this is a new MSR that the guest may access and modify its value needs
to be migrated along with the other MRSs. The changes here are specifically
for recognizing when IA32_TSC_ADJUST is enabled in CPUID and code added
for migrating its value.

Signed-off-by: Will Auld will.a...@intel.com
---
 target-i386/cpu.h |  2 ++
 target-i386/kvm.c | 14 ++
 target-i386/machine.c | 21 +
 3 files changed, 37 insertions(+)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index aabf993..9dedaa6 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -284,6 +284,7 @@
 #define MSR_IA32_APICBASE_BSP   (18)
 #define MSR_IA32_APICBASE_ENABLE(111)
 #define MSR_IA32_APICBASE_BASE  (0xf12)
+#define MSR_TSC_ADJUST 0x003b
 #define MSR_IA32_TSCDEADLINE0x6e0
 
 #define MSR_MTRRcap0xfe
@@ -701,6 +702,7 @@ typedef struct CPUX86State {
 uint64_t async_pf_en_msr;
 
 uint64_t tsc;
+uint64_t tsc_adjust;
 uint64_t tsc_deadline;
 
 uint64_t mcg_status;
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 696b14a..6d2a061 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -61,6 +61,7 @@ const KVMCapabilityInfo kvm_arch_required_capabilities[] = {
 
 static bool has_msr_star;
 static bool has_msr_hsave_pa;
+static bool has_msr_tsc_adjust;
 static bool has_msr_tsc_deadline;
 static bool has_msr_async_pf_en;
 static bool has_msr_misc_enable;
@@ -641,6 +642,10 @@ static int kvm_get_supported_msrs(KVMState *s)
 has_msr_hsave_pa = true;
 continue;
 }
+   if (kvm_msr_list-indices[i] == MSR_TSC_ADJUST) {
+   has_msr_tsc_adjust = true;
+   continue;
+   }
 if (kvm_msr_list-indices[i] == MSR_IA32_TSCDEADLINE) {
 has_msr_tsc_deadline = true;
 continue;
@@ -978,6 +983,9 @@ static int kvm_put_msrs(CPUX86State *env, int level)
 if (has_msr_hsave_pa) {
 kvm_msr_entry_set(msrs[n++], MSR_VM_HSAVE_PA, env-vm_hsave);
 }
+if (has_msr_tsc_adjust) {
+   kvm_msr_entry_set(msrs[n++], MSR_TSC_ADJUST, env-tsc_adjust);
+}
 if (has_msr_tsc_deadline) {
 kvm_msr_entry_set(msrs[n++], MSR_IA32_TSCDEADLINE, env-tsc_deadline);
 }
@@ -1234,6 +1242,9 @@ static int kvm_get_msrs(CPUX86State *env)
 if (has_msr_hsave_pa) {
 msrs[n++].index = MSR_VM_HSAVE_PA;
 }
+if (has_msr_tsc_adjust) {
+   msrs[n++].index = MSR_TSC_ADJUST;
+}
 if (has_msr_tsc_deadline) {
 msrs[n++].index = MSR_IA32_TSCDEADLINE;
 }
@@ -1308,6 +1319,9 @@ static int kvm_get_msrs(CPUX86State *env)
 case MSR_IA32_TSC:
 env-tsc = msrs[i].data;
 break;
+   case MSR_TSC_ADJUST:
+   env-tsc_adjust = msrs[i].data;
+   break;
 case MSR_IA32_TSCDEADLINE:
 env-tsc_deadline = msrs[i].data;
 break;
diff --git a/target-i386/machine.c b/target-i386/machine.c
index a8be058..df3f779 100644
--- a/target-i386/machine.c
+++ b/target-i386/machine.c
@@ -310,6 +310,24 @@ static const VMStateDescription vmstate_fpop_ip_dp = {
 }
 };
 
+static bool tsc_adjust_needed(void *opaque)
+{
+CPUX86State *env = opaque;
+
+return env-tsc_adjust != 0;
+}
+
+static const VMStateDescription vmstate_msr_tsc_adjust = {
+.name = cpu/msr_tsc_adjust,
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT64(tsc_adjust, CPUX86State),
+VMSTATE_END_OF_LIST()
+}
+};
+
 static bool tscdeadline_needed(void *opaque)
 {
 CPUX86State *env = opaque;
@@ -457,6 +475,9 @@ static const VMStateDescription vmstate_cpu = {
 .vmsd = vmstate_fpop_ip_dp,
 .needed = fpop_ip_dp_needed,
 }, {
+.vmsd = vmstate_msr_tsc_adjust,
+.needed = tsc_adjust_needed,
+}, {
 .vmsd = vmstate_msr_tscdeadline,
 .needed = tscdeadline_needed,
 }, {
-- 
1.8.0.rc0



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >