Re: Windows slow boot: contractor wanted

2012-08-19 Thread Avi Kivity
On 08/17/2012 03:36 PM, Richard Davies wrote:
 Hi Avi,
 
 Thanks to you and several others for offering help. We will work with Avi at
 first, but are grateful for all the other offers of help. We have a number
 of other qemu-related projects which we'd be interested in getting done, and
 will get in touch with these names (and anyone else who comes forward) to
 see if any are of interest to you.
 
 
 This slow boot problem is intermittent and varys in how slow the boots are,
 but I managed to trigger it this morning with medium slow booting (5-10
 minutes) and link to the requested traces below.
 
 The host in question has 128GB RAM and dual AMD Opteron 6128 (16 cores
 total). It is running kernel 3.5.1 and qemu-kvm 1.1.1.
 
 In this morning's test, we have 3 guests, all booting Windows with 40GB RAM
 and 8 cores each (we have seen small VMs go slow as I originally said, but
 it is easier to trigger with big VMs):
 
 pid 15665: qemu-kvm -nodefaults -m 40960 -smp 8 -cpu host,hv_relaxed \
   -vga cirrus -usbdevice tablet -vnc :99 -monitor stdio -hda test1.raw
 pid 15676: qemu-kvm -nodefaults -m 40960 -smp 8 -cpu host,hv_relaxed \
   -vga cirrus -usbdevice tablet -vnc :98 -monitor stdio -hda test2.raw
 pid 15653: qemu-kvm -nodefaults -m 40960 -smp 8 -cpu host,hv_relaxed \
   -vga cirrus -usbdevice tablet -vnc :97 -monitor stdio -hda test3.raw
 

40+40+40=120, pretty close to your server specs.  Are you swapping?


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC-v2 3/6] vhost-scsi: add -vhost-scsi host device for use with tcm-vhost

2012-08-19 Thread Michael S. Tsirkin
On Sat, Aug 18, 2012 at 05:36:26PM -0700, Nicholas A. Bellinger wrote:
 On Sat, 2012-08-18 at 22:12 +0300, Michael S. Tsirkin wrote:
  On Tue, Aug 14, 2012 at 01:31:14PM -0700, Nicholas A. Bellinger wrote:
   On Mon, 2012-08-13 at 11:53 +0300, Michael S. Tsirkin wrote:
On Mon, Aug 13, 2012 at 08:35:14AM +, Nicholas A. Bellinger wrote:
 From: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
 
 SNIP
 
 +static VHostSCSI *vhost_scsi_add(const char *id, const char *wwpn,
 + uint16_t tpgt)
 +{
 +VHostSCSI *vs = g_malloc0(sizeof(*vs));
 +int ret;
 +
 +/* TODO set up vhost-scsi device and bind to 
 tcm_vhost/$wwpm/tpgt_$tpgt */
 +fprintf(stderr, wwpn = \%s\ tpgt = \%u\\n, id, tpgt);
 +
 +ret = vhost_dev_init(vs-dev, -1, /dev/vhost-scsi, false);

This -1 is a hack. You need to support passing in fd from
the monitor, and pass it here.

   
   Mmm, looking at how vhost_net_init + tap.c does this, but am not quite
   what fd needs to be propagated up for virtio-scsi - vhost-scsi..
   
   Can you please elaborate on this one a bit more..?
   
  
  The idea is to allow running as a user without access to
  /dev/vhost-scsi.
  For this, allow passing in the fd of /dev/vhost-scsi through unix domain 
  sockets.
  
 
 Ah, that is a pretty neat trick..   So for vhost-scsi code, this would
 mean something along the lines of the following, yes..?

Yes but with one correction. See below.

 Thanks MST!

 diff --git a/hw/vhost-scsi.c b/hw/vhost-scsi.c
 index 4206a75..8af8758 100644
 --- a/hw/vhost-scsi.c
 +++ b/hw/vhost-scsi.c
 @@ -21,6 +21,7 @@ struct VHostSCSI {
  const char *id;
  const char *wwpn;
  uint16_t tpgt;
 +int vhostfd;
  struct vhost_dev dev;
  struct vhost_virtqueue vqs[VHOST_SCSI_VQ_NUM];
  QLIST_ENTRY(VHostSCSI) list;
 @@ -114,13 +115,32 @@ void vhost_scsi_stop(VHostSCSI *vs, VirtIODevice *vdev)
  }
  
  static VHostSCSI *vhost_scsi_add(const char *id, const char *wwpn,
 - uint16_t tpgt)
 + uint16_t tpgt, const char *vhostfd_str)
  {
 -VHostSCSI *vs = g_malloc0(sizeof(*vs));
 +VHostSCSI *vs;
  int ret;
  
 +vs = g_malloc0(sizeof(*vs));
 +if (!vs) {
 +error_report(vhost-scsi: unable to allocate *vs\n);
 +return NULL;
 +}
 +vs-vhostfd = -1;
 +
 +if (vhostfd_str) {
 +if (!qemu_isdigit(vhostfd_str[0])) {
 +error_report(vhost-scsi: passed vhostfd value is not a 
 digit\n);
 +return NULL;

This let you use an fd which was open at exec
but does not allow for fd to be open later in
case device is hot-plugged.

See net_handle_fd_param - I think you can just rename it
qemu_handle_fd_param to avoid code duplication.

 +}
 +
 +vs-vhostfd = qemu_parse_fd(vhostfd_str);
 +if (vs-vhostfd == -1) {
 +error_report(vhost-scsi: unable to parse vs-vhostfd\n);
 +return NULL;
 +}
 +}
  /* TODO set up vhost-scsi device and bind to tcm_vhost/$wwpm/tpgt_$tpgt 
 */
 -ret = vhost_dev_init(vs-dev, -1, /dev/vhost-scsi, false);
 +ret = vhost_dev_init(vs-dev, vs-vhostfd, /dev/vhost-scsi, false);
  if (ret  0) {
  error_report(vhost-scsi: vhost initialization failed: %s\n,
  strerror(-ret));
 @@ -140,7 +160,7 @@ static VHostSCSI *vhost_scsi_add(const char *id, const 
 char *wwpn,
  VHostSCSI *vhost_scsi_add_opts(QemuOpts *opts)
  {
  const char *id;
 -const char *wwpn;
 +const char *wwpn, *vhostfd;
  uint64_t tpgt;
  
  id = qemu_opts_id(opts);
 @@ -164,6 +184,7 @@ VHostSCSI *vhost_scsi_add_opts(QemuOpts *opts)
  error_report(vhost-scsi: \%s\ needs a 16-bit tpgt\n, id);
  return NULL;
  }
 +vhostfd = qemu_opt_get(opts, vhostfd);
  
 -return vhost_scsi_add(id, wwpn, tpgt);
 +return vhost_scsi_add(id, wwpn, tpgt, vhostfd);
  }
 diff --git a/qemu-config.c b/qemu-config.c
 index 33399ea..2d4884c 100644
 --- a/qemu-config.c
 +++ b/qemu-config.c
 @@ -636,6 +636,9 @@ QemuOptsList qemu_vhost_scsi_opts = {
  }, {
  .name = tpgt,
  .type = QEMU_OPT_NUMBER,
 +}, {
 +.name = vhostfd,
 +.type = QEMU_OPT_STRING,
  },
  { /* end of list */ }
  },
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Windows slow boot: contractor wanted

2012-08-19 Thread Richard Davies
Avi Kivity wrote:
 Richard Davies wrote:
  The host in question has 128GB RAM and dual AMD Opteron 6128 (16 cores
  total). It is running kernel 3.5.1 and qemu-kvm 1.1.1.
 
  In this morning's test, we have 3 guests, all booting Windows with 40GB RAM
  and 8 cores each (we have seen small VMs go slow as I originally said, but
  it is easier to trigger with big VMs):

 40+40+40=120, pretty close to your server specs. Are you swapping?

No - you can see on the top screenshot that there's no swap in use.

Richard.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [regression] virtio net locks up

2012-08-19 Thread Michael S. Tsirkin
On Sat, Aug 18, 2012 at 01:25:05AM +0200, Bernd Schubert wrote:
 On 08/12/2012 01:45 PM, Michael S. Tsirkin wrote:
  On Mon, Jul 30, 2012 at 08:08:31PM +0200, Bernd Schubert wrote:
  On 07/30/2012 07:33 PM, Bernd Schubert wrote:
  Hello Stefan,
 
  Stefan Hajnoczi stefanha at gmail.com writes:
 
  On Wed, Jan 11, 2012 at 4:18 PM, Bernd Schubert
  bernd.schubert at itwm.fraunhofer.de wrote:
  On 01/11/2012 05:04 PM, Stefan Hajnoczi wrote:
  Try pinging the host's IP address from inside the guest.  Run tcpdump
  on the guest's tap interface from the host and observe whether or not
  you see any packets being sent from the guest.
 
 
 
  sorry for my terribly late reply. As usual I got distracted by too many 
  other
  things and then returned the hardware I was running the VMs on. My new 
  desktop
  system is better suitable to run kvm and I can easily reproduce it now 
  with 3.5
  on host and guest side. So its not fixed in recent versions yet.
 
 
 
  Seems arp requests are still going out, but then don't go in:
 
  17:16:21.202547 ARP, Reply 192.168.123.1 is-at 00:25:90:38:09:cd (oui
  Unknown), length 28
  17:16:21.538724 ARP, Request who-has squeeze1 tell squeeze3, length 28
  17:16:21.539026 ARP, Reply squeeze1 is-at 52:54:00:12:34:11 (oui 
  Unknown),
  length 28
  17:16:22.200912 ARP, Request who-has 192.168.123.1 tell squeeze3, 
  length 28
 
  Okay, so it seems networking from the tap device and beyond is fine.
 
  rmmod virtio_net inside the guest and then modprobe virtio_net again.
  See if network connectivity is restored (remember to rerun DHCP or
  whatever, if necessary).
 
 
  Yep, that makes it work again. But probably is not the real solution ;)
 
  It's just another piece of information which helps debug this :).  At
  least nothing has wedged itself into an unrecoverable state.
 
  When you said the problem happens without vhost, did you explicitly
  run vhost=off?  Or did you just omit vhost=on?
 
  It was definitely off and I can confirm that it also locks up with 
  vhost=on and
  vhost=off with 3.5.
 
 
  This sounds like a guest kernel/driver issue.  I recommend testing
  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git in
  the guest to see if this has already been fixed.
 
  If you have the -dbg RPMs installed it may be possible to insert a
  probe into the virtio_net kernel module and observe receive
  interrupts.  This does require the right kernel CONFIG_ but you might
  already have it enabled:
 
  $ sudo perf probe --add skb_recv_done
  $ sudo perf record -e probe:skb_recv_done -a
  ...send some packets to the guest...
  ^C
  $ sudo perf script
 
  If you see no skb_recv_done events then the guest driver is not
  receiving a notification when packets are received.
 
  You can find more about how to use perf-probe(1) at
  http://blog.vmsplice.net/2011/03/how-to-use-perf-probe.html.
 
  Ah nice, I would have used systemtap, but always wanted to check how to 
  do it
  with perf :)
 
  So once the virtio NIC has locked up, I don't get any events from it 
  anymore -
  until I remove/re-insert the virtio module (including ifup/ifdown). I 
  will try
  to find some time later on this week to look into it again.
  Any further ideas how to proceed (I haven't even checked yet how virtio 
  works at
  all...).
 
 
  I took a quick glance where skb_recv_done is registered at all and
  traced it back to vp_find_vqs(). Looking into that function I
  noticed MSI and so tried to boot with pci=nomsi. And indeed I
  guessed it right, with pci=nomsi I don't get any lockups anymore.
  Am I the only one booting kvm-qemu usually with enabled MSI?
 
  Cheers,
  Bernd
  
  No :)
  
  I am guessing it has to do with OOM handling in the guest -
  it is tested very little but maybe your guest is such that atomic
  pool gets exhausted for some reason.
  Could you pls check whether refill_work runs by tracing it?
  This is our OOM handler.
  
  
 
 Just checked it, it does not show up in perf script output.
 
 
 Cheers,
 Bernd


When running with vhost-net on, if you enable
DEBUG in *host* kernel build (or set CONFIG_DYNAMIC_DEBUG
and enable messages for the vhost_net module)
pr_debug will output some debug messages if guest
bug is detected.

Can you try this?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 37/74] lto, KVM: Don't assume asm statements end up in the same assembler file

2012-08-19 Thread Avi Kivity
On 08/19/2012 05:56 AM, Andi Kleen wrote:
 From: Andi Kleen a...@linux.intel.com
 
 The VMX code references a local assembler label between two inline
 assembler statements. This assumes they both end up in the same
 assembler files. In some experimental builds of gcc this is not
 necessarily true, causing linker failures.
 
 Replace the local label reference with a more traditional asmlinkage
 extern.
 
 This also eliminates one assembler statement and
 generates a bit better code on 64bit: the compiler can
 use a RIP relative LEA instead of a movabs, saving
 a few bytes.

I'm happy to see work on lto-enabling the kernel.

  
 +extern __visible unsigned long kvm_vmx_return;
 +
  /*
   * Set up the vmcs's constant host-state fields, i.e., host-state fields that
   * will not change in the lifetime of the guest.
 @@ -3753,8 +3755,7 @@ static void vmx_set_constant_host_state(void)
   native_store_idt(dt);
   vmcs_writel(HOST_IDTR_BASE, dt.address);   /* 22.2.4 */
  
 - asm(mov $.Lkvm_vmx_return, %0 : =r(tmpl));
 - vmcs_writel(HOST_RIP, tmpl); /* 22.2.5 */
 + vmcs_writel(HOST_RIP, (unsigned long)kvm_vmx_return); /* 22.2.5 */
  
   rdmsr(MSR_IA32_SYSENTER_CS, low32, high32);
   vmcs_write32(HOST_IA32_SYSENTER_CS, low32);
 @@ -6305,9 +6306,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
 *vcpu)
   /* Enter guest mode */
   jne .Llaunched \n\t
   __ex(ASM_VMX_VMLAUNCH) \n\t
 - jmp .Lkvm_vmx_return \n\t
 + jmp kvm_vmx_return \n\t
   .Llaunched:  __ex(ASM_VMX_VMRESUME) \n\t
 - .Lkvm_vmx_return: 
 + .globl kvm_vmx_return\n
 + kvm_vmx_return: 
   /* Save guest registers, load host registers, keep flags */
   mov %0, %c[wordsize](%%Rsp) \n\t
   pop %0 \n\t
 

The reason we use a local label is so that we the function isn't split
into two from the profiler's point of view.  See cd2276a795b013d1.

One way to fix this is to have a .data variable initialized to point to
.Lkvm_vmx_return (this can be done from the same asm statement in
vmx_vcpu_run), and reference that variable in
vmx_set_constant_host_state().  If no one comes up with a better idea,
I'll write a patch doing this.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: x86 emulator: access GPRs on demand

2012-08-19 Thread Avi Kivity
On 08/17/2012 08:29 PM, Marcelo Tosatti wrote:
 On Thu, Aug 16, 2012 at 05:54:49PM +0300, Avi Kivity wrote:
 Instead of populating the the entire register file, read in registers
 as they are accessed, and write back only the modified ones.  This
 saves a VMREAD and VMWRITE on Intel (for rsp, since it is not usually
 used during emulation), and a two 128-byte copies for the registers.
 
 
 @@ -2715,14 +2764,17 @@ int emulator_task_switch(struct x86_emulate_ctxt 
 *ctxt,
  {
  int rc;
  
 +invalidate_registers(ctxt);
  ctxt-_eip = ctxt-eip;
  ctxt-dst.type = OP_NONE;
  
  rc = emulator_do_task_switch(ctxt, tss_selector, idt_index, reason,
   has_error_code, error_code);
  
 -if (rc == X86EMUL_CONTINUE)
 +if (rc == X86EMUL_CONTINUE) {
  ctxt-eip = ctxt-_eip;
 +writeback_registers(ctxt);
 +}
  
  return (rc == X86EMUL_UNHANDLEABLE) ? EMULATION_FAILED : EMULATION_OK;
  }
 
 
 No clear point when emulator register cache is active, when it is
 not (AFAICS this patch does not invalidate registers on emulation start
 (the above being one of the exceptions) does not clear valid bit on
 writeback-to-vcpu-cache on emulation exit).

It is cleared when emulation starts.  For the non-insn-emulation entry
points, there is an explicit invalidate.  For the emulation entry point,
there is a memset() that clears everything up to _regs, which includes
the cache.  This discrepancy isn't nice, but it preexists.  I don't know
whether we should decompose the memset() or not, it is rather efficient.

 
 Concern is that emulator can start with cached registers marked as valid 
 but in fact are invalid from previous emulation round.
 
 Maybe move invalidate() to init_emulate_ctxt?
 

See the memset() in init_decode_cache().

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v10 07/14] KVM: ARM: Memory virtualization setup

2012-08-19 Thread Peter Maydell
On 19 August 2012 05:34, Christoffer Dall c.d...@virtualopensystems.com wrote:
 On Thu, Aug 16, 2012 at 2:25 PM, Alexander Graf ag...@suse.de wrote:
 A single hva can have multiple gpas mapped, no? At least that's what I 
 gathered
 from the discussion about my attempt to a function similar to this :).

 I don't think this is the case for ARM, can you provide an example? We
 use gfn_to_pfn_prot and only allow user memory regions. What you
 suggest would be multiple physical addresses pointing to the same
 memory bank, I don't think that makes any sense on ARM hardware, for
 x86 and PPC I don't know.

I don't know what an hva is, but yes, ARM boards can have the same
block of RAM aliased into multiple places in the physical address space.
(we don't currently bother to implement the aliases in qemu's vexpress-a15
though because it's a bunch of mappings of the low 2GB into high
addresses mostly intended to let you test LPAE code without having to
put lots of RAM on the hardware).

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] KVM: PPC: Book3S HV: Handle memory slot deletion and modification correctly

2012-08-19 Thread Avi Kivity
On 08/17/2012 09:39 PM, Marcelo Tosatti wrote:
 
 Yes. Well, Avi mentioned earlier that there are users for change of GPA
 base. But, if my understanding is correct, the code that emulates
 change of BAR in QEMU is:
 
 /* now do the real mapping */
 if (r-addr != PCI_BAR_UNMAPPED) {
 memory_region_del_subregion(r-address_space, r-memory);
 }
 r-addr = new_addr;
 if (r-addr != PCI_BAR_UNMAPPED) {
 memory_region_add_subregion_overlap(r-address_space,
 r-addr, r-memory, 1);
 
 These translate to two kvm_set_user_memory ioctls. 

Not directly.  These functions change a qemu-internal memory map, which
is then transferred to kvm.  Those two calls might be in a transaction
(they aren't now), in which case the memory map update is atomic.

So indeed we issue two ioctls now, but that's a side effect of the
implementation, not related to those two calls being separate.

 
  Without taking into consideration backwards compatibility, userspace 
   can first delete the slot and later create a new one.
 
  Current qemu will in fact do that.  Not sure about older ones.
 
 
 Avi, where it does that?

By that I meant first deleting the first slot and then creating a new one.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm-1.0.1 - unable to exit if vcpu is in infinite loop

2012-08-19 Thread Avi Kivity
On 08/17/2012 06:04 PM, Jan Kiszka wrote:
  
 Can anyone imagine that such a barrier may actually be required? If it
 is currently possible that env-stop is evaluated before we called into
 sigtimedwait in qemu_kvm_eat_signals, then we could actually eat the
 signal without properly processing its reason (stop).
 
 Should not be required (TM): Both signal eating / stop checking and stop
 setting / signal generation happens under the BQL, thus the ordering
 must not make a difference here.

Agree.


 Don't see where we could lose a signal. Maybe due to a subtle memory
 corruption that sets thread_kicked to non-zero, preventing the kicking
 this way.

Cannot be ruled out, yet too much of a coincidence.

Could be a kernel bug (either in kvm or elsewhere), we've had several
before in this area.

Is this reproducible?

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/7] memory: Flush coalesced MMIO on selected region access

2012-08-19 Thread Avi Kivity
On 08/17/2012 01:55 PM, Jan Kiszka wrote:
 On 2012-07-10 12:41, Jan Kiszka wrote:
 On 2012-07-02 11:07, Avi Kivity wrote:
 On 06/29/2012 07:37 PM, Jan Kiszka wrote:
 Instead of flushing pending coalesced MMIO requests on every vmexit,
 this provides a mechanism to selectively flush when memory regions
 related to the coalesced one are accessed. This first of all includes
 the coalesced region itself but can also applied to other regions, e.g.
 of the same device, by calling memory_region_set_flush_coalesced.

 Looks fine.

 I have a hard time deciding whether this should go through the kvm tree
 or memory tree.  Anthony, perhaps you can commit it directly to avoid
 the livelock?

 Reviewed-by: Avi Kivity a...@redhat.com

 
 Anthony, ping?
 
 Argh, missed that this series was forgotten. Patch 1 is a bug fix, will
 resend it separately so that it can make it into 1.2. Will repost the
 rest once master reopens.

My fault, I should have just taken it into memory/core and sent a pull
request.  Sorry about that.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: perf uncore lkvm woes

2012-08-19 Thread Avi Kivity
On 08/17/2012 09:56 AM, Peter Zijlstra wrote:
 On Fri, 2012-08-17 at 09:40 +0800, Yan, Zheng wrote:
 
 Peter, do I need to submit a patch disables uncore on virtualized CPU?
 
 I think Avi prefers the method where KVM 'fakes' the MSRs and we have to
 detect if the MSRs actually work or not.

s/we have/we don't have/.

 
 If you're willing to have a go at that, please do so. If you're not sure
 how to do the KVM part, I'm sure Avi and/or Gleb can help you out.

Certainly, please see kvm_pmu_get_msr() and kvm_pmu_set_msr().

The approach is that if an msr write can be emulated correctly (for
example, it disables a counter) then we let it proceed; if it cannot be
emulated correctly (for example it enables a counter that we cannot
emulate), then we ignore it, but print out a message that tells the user
that we're faking something that may cause the guest to malfunction.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3.6] KVM: x86 emulator: use stack size attribute to mask rsp in stack ops

2012-08-19 Thread Avi Kivity
The sub-register used to access the stack (sp, esp, or rsp) is not
determined by the address size attribute like other memory references,
but by the stack segment's B bit (if not in x86_64 mode).

Fix by using the existing stack_mask() to figure out the correct mask.

This long-existing bug was exposed by a combination of a27685c33ae
(emulate invalid guest state by default), which causes many more
instructions to be emulated, and a seabios change (possibly a bug) which
causes the high 16 bits of esp to become polluted across calls to real
mode software interrupts.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/emulate.c | 30 +-
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 97d9a99..a3b57a2 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -475,13 +475,26 @@ static int stack_size(struct x86_emulate_ctxt *ctxt)
return address_mask(ctxt, reg);
 }
 
+static void masked_increment(ulong *reg, ulong mask, int inc)
+{
+   assign_masked(reg, *reg + inc, mask);
+}
+
 static inline void
 register_address_increment(struct x86_emulate_ctxt *ctxt, unsigned long *reg, 
int inc)
 {
+   ulong mask;
+
if (ctxt-ad_bytes == sizeof(unsigned long))
-   *reg += inc;
+   mask = ~0UL;
else
-   *reg = (*reg  ~ad_mask(ctxt)) | ((*reg + inc)  ad_mask(ctxt));
+   mask = ad_mask(ctxt);
+   masked_increment(reg, mask, inc);
+}
+
+static void rsp_increment(struct x86_emulate_ctxt *ctxt, int inc)
+{
+   masked_increment(ctxt-regs[VCPU_REGS_RSP], stack_mask(ctxt), inc);
 }
 
 static inline void jmp_rel(struct x86_emulate_ctxt *ctxt, int rel)
@@ -1522,8 +1535,8 @@ static int push(struct x86_emulate_ctxt *ctxt, void 
*data, int bytes)
 {
struct segmented_address addr;
 
-   register_address_increment(ctxt, ctxt-regs[VCPU_REGS_RSP], -bytes);
-   addr.ea = register_address(ctxt, ctxt-regs[VCPU_REGS_RSP]);
+   rsp_increment(ctxt, -bytes);
+   addr.ea = ctxt-regs[VCPU_REGS_RSP]  stack_mask(ctxt);
addr.seg = VCPU_SREG_SS;
 
return segmented_write(ctxt, addr, data, bytes);
@@ -1542,13 +1555,13 @@ static int emulate_pop(struct x86_emulate_ctxt *ctxt,
int rc;
struct segmented_address addr;
 
-   addr.ea = register_address(ctxt, ctxt-regs[VCPU_REGS_RSP]);
+   addr.ea = ctxt-regs[VCPU_REGS_RSP]  stack_mask(ctxt);
addr.seg = VCPU_SREG_SS;
rc = segmented_read(ctxt, addr, dest, len);
if (rc != X86EMUL_CONTINUE)
return rc;
 
-   register_address_increment(ctxt, ctxt-regs[VCPU_REGS_RSP], len);
+   rsp_increment(ctxt, len);
return rc;
 }
 
@@ -1688,8 +1701,7 @@ static int em_popa(struct x86_emulate_ctxt *ctxt)
 
while (reg = VCPU_REGS_RAX) {
if (reg == VCPU_REGS_RSP) {
-   register_address_increment(ctxt, 
ctxt-regs[VCPU_REGS_RSP],
-   ctxt-op_bytes);
+   rsp_increment(ctxt, ctxt-op_bytes);
--reg;
}
 
@@ -2825,7 +2837,7 @@ static int em_ret_near_imm(struct x86_emulate_ctxt *ctxt)
rc = emulate_pop(ctxt, ctxt-dst.val, ctxt-op_bytes);
if (rc != X86EMUL_CONTINUE)
return rc;
-   register_address_increment(ctxt, ctxt-regs[VCPU_REGS_RSP], 
ctxt-src.val);
+   rsp_increment(ctxt, ctxt-src.val);
return X86EMUL_CONTINUE;
 }
 
-- 
1.7.11.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v10 07/14] KVM: ARM: Memory virtualization setup

2012-08-19 Thread Avi Kivity
On 08/19/2012 12:38 PM, Peter Maydell wrote:
 On 19 August 2012 05:34, Christoffer Dall c.d...@virtualopensystems.com 
 wrote:
 On Thu, Aug 16, 2012 at 2:25 PM, Alexander Graf ag...@suse.de wrote:
 A single hva can have multiple gpas mapped, no? At least that's what I 
 gathered
 from the discussion about my attempt to a function similar to this :).
 
 I don't think this is the case for ARM, can you provide an example? We
 use gfn_to_pfn_prot and only allow user memory regions. What you
 suggest would be multiple physical addresses pointing to the same
 memory bank, I don't think that makes any sense on ARM hardware, for
 x86 and PPC I don't know.
 
 I don't know what an hva is,

host virtual address

(see Documentation/virtual/kvm/mmu.txt for more TLAs in this area).

 but yes, ARM boards can have the same
 block of RAM aliased into multiple places in the physical address space.
 (we don't currently bother to implement the aliases in qemu's vexpress-a15
 though because it's a bunch of mappings of the low 2GB into high
 addresses mostly intended to let you test LPAE code without having to
 put lots of RAM on the hardware).

Even if it weren't common, the API allows it, so we must behave sensibly.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Windows slow boot: contractor wanted

2012-08-19 Thread Avi Kivity
On 08/17/2012 03:36 PM, Richard Davies wrote:
 Hi Avi,
 
 Thanks to you and several others for offering help. We will work with Avi at
 first, but are grateful for all the other offers of help. We have a number
 of other qemu-related projects which we'd be interested in getting done, and
 will get in touch with these names (and anyone else who comes forward) to
 see if any are of interest to you.
 
 
 This slow boot problem is intermittent and varys in how slow the boots are,
 but I managed to trigger it this morning with medium slow booting (5-10
 minutes) and link to the requested traces below.
 
 The host in question has 128GB RAM and dual AMD Opteron 6128 (16 cores
 total). It is running kernel 3.5.1 and qemu-kvm 1.1.1.
 
 In this morning's test, we have 3 guests, all booting Windows with 40GB RAM
 and 8 cores each (we have seen small VMs go slow as I originally said, but
 it is easier to trigger with big VMs):
 
 pid 15665: qemu-kvm -nodefaults -m 40960 -smp 8 -cpu host,hv_relaxed \
   -vga cirrus -usbdevice tablet -vnc :99 -monitor stdio -hda test1.raw
 pid 15676: qemu-kvm -nodefaults -m 40960 -smp 8 -cpu host,hv_relaxed \
   -vga cirrus -usbdevice tablet -vnc :98 -monitor stdio -hda test2.raw
 pid 15653: qemu-kvm -nodefaults -m 40960 -smp 8 -cpu host,hv_relaxed \
   -vga cirrus -usbdevice tablet -vnc :97 -monitor stdio -hda test3.raw
 
 We are running with hv_relaxed since this was suggested in the previous
 thread, but we see intermittent slow boots with and without this flag.
 
 
 All 3 VMs are booting slowly for most of the attached capture, which I
 started after confirming the slow boots and stopped as soon as the first of
 them (15665) had booted. In terms of visible symptoms, the VMs are showing
 the Windows boot progress bar, which is moving very slowly. In top, the VMs
 are at 400% CPU and their resident state size (RES) memory is slowly
 counting up until it reaches the full VM size, at which point they finish
 booting.
 
 
 Here are the trace files:
 
 http://users.org.uk/slow-win-boot-1/ps.txt (ps auxwwwf as root)
 http://users.org.uk/slow-win-boot-1/top.txt (top with 2 VMs still slow)
 http://users.org.uk/slow-win-boot-1/trace-console.txt (running trace-cmd)
 http://users.org.uk/slow-win-boot-1/trace.dat (the 1.7G trace data file)
 http://users.org.uk/slow-win-boot-1/trace-report.txt (the 4G trace report)
 
 
 Please let me know if there is anything else which I can provide?

There are tons of PAUSE exits indicating cpu overcommit (and indeed you
are overcommitted by about 50%).

What host kernel version are you running?

Does this reproduce without overcommit?



-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Big real mode use in ipxe

2012-08-19 Thread Avi Kivity
ipxe contains the following snippet:

/* Copy ROM to image source PMM block */
pushw   %es
xorw%ax, %ax
movw%ax, %es
movl%esi, %edi
xorl%esi, %esi
movzbl  romheader_size, %ecx
shll$9, %ecx
addr32 rep movsb/* PMM presence implies flat real mode */

Which copies an image to %edi, with %edi = 0x1.  This is in accordance 
with the PMM spec:

3.2.4 Accessing Extended Memory

This section specifies how clients should access extended memory blocks 
allocated by the PMM. When control is passed to an option ROM from a BIOS that 
supports PMM, the processor will be in big real mode, and Gate A20 will be 
disabled (segment wrap turned off). This allows access to extended memory 
blocks using real mode addressing.

In big real mode, access to memory above 1MB can be accomplished by using a 
32-bit extended index register (EDI, etc.) and setting the segment register to 
h. The following code example assumes that the pmmAllocate function was 
just called to allocate a block of extended memory, and DX:AX returned the 
32-bit buffer address.

; Assume here that DX:AX contains the 32-bit address of our allocated buffer.
; Clear the DS segment register.
push h
pop ds
; Put the DX:AX 32-bit buffer address into EDI.
mov di, dx
; Get the upper word.
shl edi, 16
; Shift it to upper EDI.
mov di, ax
; Get the lower word.
; Example: clear the first four bytes of the extended memory buffer.
mov [edi], h
; DS:EDI is used as the memory pointer.

In a similar way, the other segment registers and 32-bit index registers can be 
used for extended memory
accessing.

So far so good.  But the Intel SDM says (20.1.1):

The IA-32 processors beginning with the Intel386 processor can generate 32-bit 
offsets using an address override prefix; however, in real-address mode, the 
value of
a 32-bit offset may not exceed H without causing an exception. For full 
compatibility with Intel 286 real-address mode, pseudo-protection faults 
(interrupt 12 or 13) occur if a 32-bit offset is generated outside the range 0 
through H.

Which is exactly what happens here.  My understanding of big real mode is that 
to achieve a segment limit != 0x, you must go into 32-bit protected mode, 
load a segment with a larger limit, and return into real mode without touching 
the segment.  The next load of the segment will reset the limit to 0x.

Due to bugs in both qemu tcg and kvm, limit checks are not enforced in real 
mode, but once this bugs are fixed, the code above will break.

The PMM spec also has this to say (1.3):

Big Real Mode

Big Real Mode is a modified version of the processor’s real mode with the 
segment limits changed from 1MB to 4GB. Big real mode allows the BIOS or an 
Option ROM to read and write extended memory without the overhead of protected 
mode. The BIOS puts the processor in big real mode during POST to allow 
simplified access to extended memory. The processor will be in big real mode 
while the PMM Services are callable.

This is more in line with the Intel spec, and means that the modification to 
%es must be avoided (and that seabios needs changes to either work in big real 
mode, or to put the processor back into big real mode after returning from a 
PMM service.

The whole thing is very unfortunate, as kvm is very slow while in big real 
mode, on certain processors.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 37/74] lto, KVM: Don't assume asm statements end up in the same assembler file

2012-08-19 Thread Andi Kleen
 The reason we use a local label is so that we the function isn't split
 into two from the profiler's point of view.  See cd2276a795b013d1.

Hmm that commit message is not very enlightening.

The goal was to force a compiler error?

With LTO there is no way to force two functions be in the same assembler
file. The partitioner is always allowed to split.

 
 One way to fix this is to have a .data variable initialized to point to
 .Lkvm_vmx_return (this can be done from the same asm statement in
 vmx_vcpu_run), and reference that variable in
 vmx_set_constant_host_state().  If no one comes up with a better idea,
 I'll write a patch doing this.

I'm not clear how that is better than my patch.

-andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 37/74] lto, KVM: Don't assume asm statements end up in the same assembler file

2012-08-19 Thread Avi Kivity
On 08/19/2012 06:09 PM, Andi Kleen wrote:
 The reason we use a local label is so that we the function isn't split
 into two from the profiler's point of view.  See cd2276a795b013d1.
 
 Hmm that commit message is not very enlightening.
 
 The goal was to force a compiler error?

No, the goal was to avoid a global label in the middle of a function.
The profiler interprets it as a new function.  After your patch,
profiles will show a function named kvm_vmx_return taking a few percent
cpu, although there is no such function.

 
 With LTO there is no way to force two functions be in the same assembler
 file. The partitioner is always allowed to split.

I'm not trying to force two functions to be in the same assembler file.

 
 
 One way to fix this is to have a .data variable initialized to point to
 .Lkvm_vmx_return (this can be done from the same asm statement in
 vmx_vcpu_run), and reference that variable in
 vmx_set_constant_host_state().  If no one comes up with a better idea,
 I'll write a patch doing this.
 
 I'm not clear how that is better than my patch.

My patch will not generate the artifact with kvm_vmx_return.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 37/74] lto, KVM: Don't assume asm statements end up in the same assembler file

2012-08-19 Thread Andi Kleen
On Sun, Aug 19, 2012 at 06:12:57PM +0300, Avi Kivity wrote:
 On 08/19/2012 06:09 PM, Andi Kleen wrote:
  The reason we use a local label is so that we the function isn't split
  into two from the profiler's point of view.  See cd2276a795b013d1.
  
  Hmm that commit message is not very enlightening.
  
  The goal was to force a compiler error?
 
 No, the goal was to avoid a global label in the middle of a function.
 The profiler interprets it as a new function.  After your patch,

Ah got it now. I always used to have the same problem with sys_call_return.`

I wonder if there shouldn't be a way to tell perf to ignore a symbol.

  
  One way to fix this is to have a .data variable initialized to point to
  .Lkvm_vmx_return (this can be done from the same asm statement in
  vmx_vcpu_run), and reference that variable in
  vmx_set_constant_host_state().  If no one comes up with a better idea,
  I'll write a patch doing this.
  
  I'm not clear how that is better than my patch.
 
 My patch will not generate the artifact with kvm_vmx_return.

Ok fine for me. I'll keep this patch for now, until you have
something better.

-Andi


-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ipxe-devel] Big real mode use in ipxe

2012-08-19 Thread Avi Kivity
On 08/19/2012 06:34 PM, Michael Brown wrote:
 On Sunday 19 Aug 2012 16:07:05 Avi Kivity wrote:
 Which is exactly what happens here.  My understanding of big real mode is
 that to achieve a segment limit != 0x, you must go into 32-bit
 protected mode, load a segment with a larger limit, and return into real
 mode without touching the segment.  The next load of the segment will
 reset the limit to 0x.
 
 Not quite.  You can't return into real mode without touching the segment, 
 since part of the process of returning to real mode is to reload the segment 
 registers with real-mode values, and this happens _after_ setting CR0.PE=0.
 
 Whenever CR0.PE=0, loading a segment register with value N will load the 
 literal value (N4) into the base address for that segment, without changing 
 the limit.  This is the trick that allows flat real mode (aka big real mode) 
 to 
 work; the limit remains at 4G even after loading the segment register with a 
 real-mode value.

So I see, from looking at the Xen source.  I'll also double-check with
bochs.  Looks like I'll need to fix kvm not to reset the segment limit
when reloading a segment in real mode.

 
 (and that seabios needs changes to either work in
 big real mode, or to put the processor back into big real mode after
 returning from a PMM service.
 
 If seabios switches into protected mode when performing a PMM service, then 
 it 
 _must_ leave the segment limits at 4G when returning to real mode.  To do 
 otherwise will violate the PMM spec, and will break conforming clients such 
 as 
 iPXE.

This probably works, since iPXE works on kvm on AMD and on Intel
processors with unrestricted guest support.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ipxe-devel] Big real mode use in ipxe

2012-08-19 Thread Michael Brown
On Sunday 19 Aug 2012 16:07:05 Avi Kivity wrote:
 Which is exactly what happens here.  My understanding of big real mode is
 that to achieve a segment limit != 0x, you must go into 32-bit
 protected mode, load a segment with a larger limit, and return into real
 mode without touching the segment.  The next load of the segment will
 reset the limit to 0x.

Not quite.  You can't return into real mode without touching the segment, 
since part of the process of returning to real mode is to reload the segment 
registers with real-mode values, and this happens _after_ setting CR0.PE=0.

Whenever CR0.PE=0, loading a segment register with value N will load the 
literal value (N4) into the base address for that segment, without changing 
the limit.  This is the trick that allows flat real mode (aka big real mode) to 
work; the limit remains at 4G even after loading the segment register with a 
real-mode value.

 (and that seabios needs changes to either work in
 big real mode, or to put the processor back into big real mode after
 returning from a PMM service.

If seabios switches into protected mode when performing a PMM service, then it 
_must_ leave the segment limits at 4G when returning to real mode.  To do 
otherwise will violate the PMM spec, and will break conforming clients such as 
iPXE.

Michael
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Big real mode use in ipxe

2012-08-19 Thread Kevin O'Connor
On Sun, Aug 19, 2012 at 06:07:05PM +0300, Avi Kivity wrote:
 ipxe contains the following snippet:
 
   /* Copy ROM to image source PMM block */
   pushw   %es
   xorw%ax, %ax
   movw%ax, %es
   movl%esi, %edi
   xorl%esi, %esi
   movzbl  romheader_size, %ecx
   shll$9, %ecx
   addr32 rep movsb/* PMM presence implies flat real mode */
 
 Which copies an image to %edi, with %edi = 0x1.  This is in accordance 
 with the PMM spec:
[...]
 So far so good.  But the Intel SDM says (20.1.1):
 
 The IA-32 processors beginning with the Intel386 processor can generate 
 32-bit offsets using an address override prefix; however, in real-address 
 mode, the value of
 a 32-bit offset may not exceed H without causing an exception. For full 
 compatibility with Intel 286 real-address mode, pseudo-protection faults 
 (interrupt 12 or 13) occur if a 32-bit offset is generated outside the range 
 0 through H.

I interpretted the above to mean however, in [normal real-mode where
the segment registers are set to 0x] real-address mode, the value
of a 32-bit offset may not exceed H without causing an exception

 Which is exactly what happens here.  My understanding of big real
 mode is that to achieve a segment limit != 0x, you must go into
 32-bit protected mode, load a segment with a larger limit, and
 return into real mode without touching the segment.  The next load
 of the segment will reset the limit to 0x.

No, the segment limit is only changed when the protected mode bit is
set and the segment register is loaded.  When the protected mode bit
is not set, only the segment offset changes.

[...]
 The PMM spec also has this to say (1.3):
 
 Big Real Mode
 
 Big Real Mode is a modified version of the processor’s real mode
 with the segment limits changed from 1MB to 4GB. Big real mode
 allows the BIOS or an Option ROM to read and write extended memory
 without the overhead of protected mode. The BIOS puts the processor
 in big real mode during POST to allow simplified access to extended
 memory. The processor will be in big real mode while the PMM
 Services are callable.
 
 This is more in line with the Intel spec, and means that the
 modification to %es must be avoided (and that seabios needs changes
 to either work in big real mode, or to put the processor back into
 big real mode after returning from a PMM service.

The SeaBIOS code is regularly used on a variety of real processors
(which do enforce segment limits).  This includes several different
AMD processors and Intel processors.  It has also been tested in the
past with other manufacturers (eg, Via).  We've never seen an issue
with the big real mode support.

 The whole thing is very unfortunate, as kvm is very slow while in
 big real mode, on certain processors.

Unfortunately, big real mode is a requirement for option roms.

-Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Big real mode use in ipxe

2012-08-19 Thread Avi Kivity
On 08/19/2012 06:44 PM, Kevin O'Connor wrote:
 On Sun, Aug 19, 2012 at 06:07:05PM +0300, Avi Kivity wrote:
 ipxe contains the following snippet:
 
  /* Copy ROM to image source PMM block */
  pushw   %es
  xorw%ax, %ax
  movw%ax, %es
  movl%esi, %edi
  xorl%esi, %esi
  movzbl  romheader_size, %ecx
  shll$9, %ecx
  addr32 rep movsb/* PMM presence implies flat real mode */
 
 Which copies an image to %edi, with %edi = 0x1.  This is in accordance 
 with the PMM spec:
 [...]
 So far so good.  But the Intel SDM says (20.1.1):
 
 The IA-32 processors beginning with the Intel386 processor can generate 
 32-bit offsets using an address override prefix; however, in real-address 
 mode, the value of
 a 32-bit offset may not exceed H without causing an exception. For full 
 compatibility with Intel 286 real-address mode, pseudo-protection faults 
 (interrupt 12 or 13) occur if a 32-bit offset is generated outside the range 
 0 through H.
 
 I interpretted the above to mean however, in [normal real-mode where
 the segment registers are set to 0x] real-address mode, the value
 of a 32-bit offset may not exceed H without causing an exception

I understood it the same way.

 
 Which is exactly what happens here.  My understanding of big real
 mode is that to achieve a segment limit != 0x, you must go into
 32-bit protected mode, load a segment with a larger limit, and
 return into real mode without touching the segment.  The next load
 of the segment will reset the limit to 0x.
 
 No, the segment limit is only changed when the protected mode bit is
 set and the segment register is loaded.  When the protected mode bit
 is not set, only the segment offset changes.

That's what I missed.  I always understood a segment reload in real mode
to reset the limit field, though I had no basis for it.  I'll fix kvm
not to do this.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [ipxe-devel] Big real mode use in ipxe

2012-08-19 Thread Kevin O'Connor
On Sun, Aug 19, 2012 at 04:34:50PM +0100, Michael Brown wrote:
 On Sunday 19 Aug 2012 16:07:05 Avi Kivity wrote:
  (and that seabios needs changes to either work in
  big real mode, or to put the processor back into big real mode after
  returning from a PMM service.
 
 If seabios switches into protected mode when performing a PMM service, then 
 it 
 _must_ leave the segment limits at 4G when returning to real mode.  To do 
 otherwise will violate the PMM spec, and will break conforming clients such 
 as 
 iPXE.

SeaBIOS does switch to 32bit mode during PMM calls and does switch to
16bit big real mode (segment limits set to 4G) on return.

-Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v10 07/14] KVM: ARM: Memory virtualization setup

2012-08-19 Thread Christoffer Dall
On Sun, Aug 19, 2012 at 9:00 AM, Avi Kivity a...@redhat.com wrote:
 On 08/19/2012 12:38 PM, Peter Maydell wrote:
 On 19 August 2012 05:34, Christoffer Dall c.d...@virtualopensystems.com 
 wrote:
 On Thu, Aug 16, 2012 at 2:25 PM, Alexander Graf ag...@suse.de wrote:
 A single hva can have multiple gpas mapped, no? At least that's what I 
 gathered
 from the discussion about my attempt to a function similar to this :).

 I don't think this is the case for ARM, can you provide an example? We
 use gfn_to_pfn_prot and only allow user memory regions. What you
 suggest would be multiple physical addresses pointing to the same
 memory bank, I don't think that makes any sense on ARM hardware, for
 x86 and PPC I don't know.

 I don't know what an hva is,

 host virtual address

 (see Documentation/virtual/kvm/mmu.txt for more TLAs in this area).

  but yes, ARM boards can have the same
 block of RAM aliased into multiple places in the physical address space.
 (we don't currently bother to implement the aliases in qemu's vexpress-a15
 though because it's a bunch of mappings of the low 2GB into high
 addresses mostly intended to let you test LPAE code without having to
 put lots of RAM on the hardware).

I stand corrected.


 Even if it weren't common, the API allows it, so we must behave sensibly.


true, this should be a solution:

commit 2a8661fd7e6c15889a20a4547bd7861e84b778a8
Author: Christoffer Dall c.d...@virtualopensystems.com
Date:   Sun Aug 19 15:52:10 2012 -0400

KVM: ARM: A single hva can map multiple gpas

Handle mmu notifier ops for every such mapping.

Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 3df4fa8..9b23230 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -754,11 +754,14 @@ int kvm_handle_guest_abort(struct kvm_vcpu
*vcpu, struct kvm_run *run)
return ret ? ret : 1;
 }

-static bool hva_to_gpa(struct kvm *kvm, unsigned long hva, gpa_t *gpa)
+static int handle_hva_to_gpa(struct kvm *kvm, unsigned long hva,
+void (*handler)(struct kvm *kvm, unsigned long hva,
+gpa_t gpa, void *data),
+void *data)
 {
struct kvm_memslots *slots;
struct kvm_memory_slot *memslot;
-   bool found = false;
+   int cnt = 0;

slots = kvm_memslots(kvm);

@@ -769,31 +772,36 @@ static bool hva_to_gpa(struct kvm *kvm, unsigned
long hva, gpa_t *gpa)

end = start + (memslot-npages  PAGE_SHIFT);
if (hva = start  hva  end) {
+   gpa_t gpa;
gpa_t gpa_offset = hva - start;
-   *gpa = (memslot-base_gfn  PAGE_SHIFT) + gpa_offset;
-   found = true;
-   /* no overlapping memslots allowed: break */
-   break;
+   gpa = (memslot-base_gfn  PAGE_SHIFT) + gpa_offset;
+   handler(kvm, hva, gpa, data);
+   cnt++;
}
}

-   return found;
+   return cnt;
+}
+
+static void kvm_unmap_hva_handler(struct kvm *kvm, unsigned long hva,
+ gpa_t gpa, void *data)
+{
+   spin_lock(kvm-arch.pgd_lock);
+   stage2_clear_pte(kvm, gpa);
+   spin_unlock(kvm-arch.pgd_lock);
 }

 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
 {
-   bool found;
-   gpa_t gpa;
+   int found;

if (!kvm-arch.pgd)
return 0;

-   found = hva_to_gpa(kvm, hva, gpa);
-   if (found) {
-   spin_lock(kvm-arch.pgd_lock);
-   stage2_clear_pte(kvm, gpa);
-   spin_unlock(kvm-arch.pgd_lock);
-   }
+   found = handle_hva_to_gpa(kvm, hva, kvm_unmap_hva_handler, NULL);
+   if (found  0)
+   __kvm_tlb_flush_vmid(kvm);
+
return 0;
 }

@@ -814,21 +822,27 @@ int kvm_unmap_hva_range(struct kvm *kvm,
return 0;
 }

+static void kvm_set_spte_handler(struct kvm *kvm, unsigned long hva,
+gpa_t gpa, void *data)
+{
+   pte_t *pte = (pte_t *)data;
+
+   spin_lock(kvm-arch.pgd_lock);
+   stage2_set_pte(kvm, NULL, gpa, pte);
+   spin_unlock(kvm-arch.pgd_lock);
+}
+
+
 void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
 {
-   gpa_t gpa;
-   bool found;
+   int found;

if (!kvm-arch.pgd)
return;

-   found = hva_to_gpa(kvm, hva, gpa);
-   if (found) {
-   spin_lock(kvm-arch.pgd_lock);
-   stage2_set_pte(kvm, NULL, gpa, pte);
-   spin_unlock(kvm-arch.pgd_lock);
+   found = handle_hva_to_gpa(kvm, hva, kvm_set_spte_handler, pte);
+   if (found  0)
__kvm_tlb_flush_vmid(kvm);
-   }
 }

 void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu)

--
--
To unsubscribe from this list: send the line unsubscribe kvm 

Re: perf uncore lkvm woes

2012-08-19 Thread Yan, Zheng
On 08/19/2012 05:55 PM, Avi Kivity wrote:
 On 08/17/2012 09:56 AM, Peter Zijlstra wrote:
 On Fri, 2012-08-17 at 09:40 +0800, Yan, Zheng wrote:

 Peter, do I need to submit a patch disables uncore on virtualized CPU?

 I think Avi prefers the method where KVM 'fakes' the MSRs and we have to
 detect if the MSRs actually work or not.
 
 s/we have/we don't have/.
 

 If you're willing to have a go at that, please do so. If you're not sure
 how to do the KVM part, I'm sure Avi and/or Gleb can help you out.
 
 Certainly, please see kvm_pmu_get_msr() and kvm_pmu_set_msr().
 
 The approach is that if an msr write can be emulated correctly (for
 example, it disables a counter) then we let it proceed; if it cannot be
 emulated correctly (for example it enables a counter that we cannot
 emulate), then we ignore it, but print out a message that tells the user
 that we're faking something that may cause the guest to malfunction.
 

There is only one kvm_pmu structure in struct kvm_vcpu_arch, but the uncore
driver may define dozens of PMUs. Besides the uncore PMUs make extensive use
of extra registers, I don't think we can store these information in kvm_pmu
structure.

The uncore pmu collects system-wide events on a given socket, it may not be
possible to be simulated by virtualized CPU. I think it's better to just
disable uncore on virtualized CPU.

Regards
Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: perf uncore lkvm woes

2012-08-19 Thread Yan, Zheng
On 08/19/2012 05:55 PM, Avi Kivity wrote:
 On 08/17/2012 09:56 AM, Peter Zijlstra wrote:
 On Fri, 2012-08-17 at 09:40 +0800, Yan, Zheng wrote:

 Peter, do I need to submit a patch disables uncore on virtualized CPU?

 I think Avi prefers the method where KVM 'fakes' the MSRs and we have to
 detect if the MSRs actually work or not.
 
 s/we have/we don't have/.
 

 If you're willing to have a go at that, please do so. If you're not sure
 how to do the KVM part, I'm sure Avi and/or Gleb can help you out.
 
 Certainly, please see kvm_pmu_get_msr() and kvm_pmu_set_msr().
 
 The approach is that if an msr write can be emulated correctly (for
 example, it disables a counter) then we let it proceed; if it cannot be
 emulated correctly (for example it enables a counter that we cannot
 emulate), then we ignore it, but print out a message that tells the user
 that we're faking something that may cause the guest to malfunction.
 
 

Anyone knows how to detect if the kernel is running on virtualized CPU?

Regards
Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] KVM: PPC: Book3S HV: Handle memory slot deletion and modification correctly

2012-08-19 Thread Avi Kivity
On 08/17/2012 09:39 PM, Marcelo Tosatti wrote:
 
 Yes. Well, Avi mentioned earlier that there are users for change of GPA
 base. But, if my understanding is correct, the code that emulates
 change of BAR in QEMU is:
 
 /* now do the real mapping */
 if (r-addr != PCI_BAR_UNMAPPED) {
 memory_region_del_subregion(r-address_space, r-memory);
 }
 r-addr = new_addr;
 if (r-addr != PCI_BAR_UNMAPPED) {
 memory_region_add_subregion_overlap(r-address_space,
 r-addr, r-memory, 1);
 
 These translate to two kvm_set_user_memory ioctls. 

Not directly.  These functions change a qemu-internal memory map, which
is then transferred to kvm.  Those two calls might be in a transaction
(they aren't now), in which case the memory map update is atomic.

So indeed we issue two ioctls now, but that's a side effect of the
implementation, not related to those two calls being separate.

 
  Without taking into consideration backwards compatibility, userspace 
   can first delete the slot and later create a new one.
 
  Current qemu will in fact do that.  Not sure about older ones.
 
 
 Avi, where it does that?

By that I meant first deleting the first slot and then creating a new one.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html