Re: [Qemu-devel] 9p broken?

2012-07-30 Thread Aneesh Kumar K.V
Avi Kivity  writes:

> Having an annoying bug on i386 kvm I decided to debug it buy running an
> i386 guest on my x86_64 host, use 9p to access a guest image, and run it
> using nested kvm.
>
> However, 9p appears to be broken: first, the configure test fails (patch
> sent).  Second, while mount works, ls on the mount point causes qemu to
> crash with

I missed that you have already sent a patch for configure fix. That
looks better that what i sent. I will ack that patch 

>
> (gdb) bt
> #0  error_set (errp=0x7fffe95fb128, fmt=0x558d4568 "{ 'class':
> 'VirtFSFeatureBlocksMigration', 'data': { 'path': %s, 'tag': %s } }") at
> /home/tlv/akivity/qemu/error.c:32
> #1  0x5567cb06 in v9fs_attach (opaque=0x7fffe95e3020) at
> /home/tlv/akivity/qemu/hw/9pfs/virtio-9p.c:988
> #2  0x5561d19f in coroutine_trampoline (i0=1449767888, i1=21845)
> at /home/tlv/akivity/qemu/coroutine-ucontext.c:138
> #3  0x75a93ef0 in ?? ()   from /lib64/libc.so.6
> #4  0x7fffce00 in ?? ()
> #5  0x in ?? (
>
> **errp already points to a VirtFSFeatureBlocksMigration error;
> v9fs_attach() has been called a second time (the first time,
> understandably, on mount; the second on ls).
>

Why are we calling attach a second time ?. I am also not able to reproduce this

root@qemu-img-64:~# mount -t 9p -otrans=virtio,version=9p2000.L v_tmp /mnt
root@qemu-img-64:~# ls /mnt/a.c
/mnt/a.c


> Command line:
>
> qemu-system-x86_64 -m 4G -smp 4 -drive
> file=/images/Fedora-i386.img,if=virtio,cache=none -cdrom
> /images/iso/bfo.iso -device virtio-9p-pci,fsdev=root,mount_tag=root
> -fsdev local,id=root,path=/,security_model=passthrough -enable-kvm -net
> nic,model=virtio,netdev=net -netdev user,id=net -monitor stdio -cpu host
>

-aneesh

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] 9p broken?

2012-07-30 Thread Aneesh Kumar K.V
"Richard W.M. Jones"  writes:

> On Mon, Jul 30, 2012 at 03:35:39PM +0300, Avi Kivity wrote:
>> Having an annoying bug on i386 kvm I decided to debug it buy running an
>> i386 guest on my x86_64 host, use 9p to access a guest image, and run it
>> using nested kvm.
>> 
>> However, 9p appears to be broken: first, the configure test fails (patch
>> sent).  Second, while mount works, ls on the mount point causes qemu to
>> crash with
>> 
>> (gdb) bt
>> #0  error_set (errp=0x7fffe95fb128, fmt=0x558d4568 "{ 'class':
>> 'VirtFSFeatureBlocksMigration', 'data': { 'path': %s, 'tag': %s } }") at
>> /home/tlv/akivity/qemu/error.c:32
>> #1  0x5567cb06 in v9fs_attach (opaque=0x7fffe95e3020) at
>> /home/tlv/akivity/qemu/hw/9pfs/virtio-9p.c:988
>> #2  0x5561d19f in coroutine_trampoline (i0=1449767888, i1=21845)
>> at /home/tlv/akivity/qemu/coroutine-ucontext.c:138
>> #3  0x75a93ef0 in ?? ()  from /lib64/libc.so.6
>> #4  0x7fffce00 in ?? ()
>> #5  0x in ?? (
>> 
>> **errp already points to a VirtFSFeatureBlocksMigration error;
>> v9fs_attach() has been called a second time (the first time,
>> understandably, on mount; the second on ls).
>
> Yes, I can reproduce this too.
>
> LIBGUESTFS_QEMU=~/d/qemu/qemu.wrapper \
> guestfish -v -- \
>   sparse /tmp/unused 100M : \
>   config -device 'virtio-9p-pci,fsdev=root,mount_tag=root' : \
>   config -fsdev 'local,id=root,path=/tmp,security_model=passthrough' : \
>   run : \
>   mount-9p root / : \
>   ls /
>
> Stack trace:
>
> #0  0x7fb1d4d19ba5 in __GI_raise (sig=sig@entry=6)
> at ../nptl/sysdeps/unix/sysv/linux/raise.c:63
> #1  0x7fb1d4d1b358 in __GI_abort () at abort.c:90
> #2  0x7fb1d4d12972 in __assert_fail_base (fmt=
> 0x7fb1d4e5c8e8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
> assertion=assertion@entry=0x7fb1d8e2e87e "*errp == ((void *)0)", 
> file=file@entry=0x7fb1d8e56217 "error.c", line=line@entry=35, 
> function=function@entry=
> 0x7fb1d8e2e8ca <__PRETTY_FUNCTION__.13983> "error_set") at assert.c:92
> #3  0x7fb1d4d12a22 in __GI___assert_fail (assertion=assertion@entry=
> 0x7fb1d8e2e87e "*errp == ((void *)0)", file=file@entry=
> 0x7fb1d8e56217 "error.c", line=line@entry=35, function=function@entry=
> 0x7fb1d8e2e8ca <__PRETTY_FUNCTION__.13983> "error_set") at assert.c:101
> #4  0x7fb1d8c1147f in error_set (errp=errp@entry=0x7fb1ce36a128, 
> fmt=fmt@entry=
> 0x7fb1d8e39a78 "{ 'class': 'VirtFSFeatureBlocksMigration', 'data': { 
> 'path': %s, 'tag': %s } }") at error.c:35
> #5  0x7fb1d8c57f9b in v9fs_attach (opaque=0x7fb1ce352020)
> at /home/rjones/d/qemu/hw/9pfs/virtio-9p.c:988
> #6  0x7fb1d8c0fcfa in coroutine_trampoline (i0=, 
> i1=) at coroutine-ucontext.c:138
> #7  0x7fb1d4d2a2f0 in ?? () from /lib64/libc.so.6
> #8  0x7fff51061aa0 in ?? ()
> #9  0xb5b5b5b5b5b5b5b5 in ?? ()
> #10 0x in ?? ()
>
> I'll add a regression test for 9p to libguestfs so at least we will
> catch this in future during Fedora builds.
>

I am not able to reproduce this, I had to patch configure to fix some
compile errors, but then I am not hitting the crash. I am using the
latest qemu. We do the below in 9p

error_set(&s->migration_blocker, QERR_VIRTFS_FEATURE_BLOCKS_MIGRATION,
  s->ctx.fs_root ? s->ctx.fs_root : "NULL", s->tag);

I am not sure how we can hit that assert() in error_set() ?. 
We allocate s via 

s = (V9fsState *)virtio_common_init() which does

vdev = g_malloc0(struct_size);

-aneesh

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: don't bother tracking is_dir

2012-07-30 Thread Pekka Enberg
On Sun, 22 Jul 2012, Sasha Levin wrote:
> This is something we can calculate on the fly, and doesn't justify the 
> overhead
> of tracking it all over fid transitions.
> 
> Signed-off-by: Sasha Levin 

What kind of overhead are you talking about?

> ---
>  tools/kvm/include/kvm/virtio-9p.h |1 -
>  tools/kvm/virtio/9p.c |   16 +++-
>  2 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/tools/kvm/include/kvm/virtio-9p.h 
> b/tools/kvm/include/kvm/virtio-9p.h
> index 186fe05..cb590d1 100644
> --- a/tools/kvm/include/kvm/virtio-9p.h
> +++ b/tools/kvm/include/kvm/virtio-9p.h
> @@ -26,7 +26,6 @@ struct p9_msg {
>  struct p9_fid {
>   u32 fid;
>   u32 uid;
> - u8  is_dir;
>   charabs_path[PATH_MAX];
>   char*path;
>   DIR *dir;
> diff --git a/tools/kvm/virtio/9p.c b/tools/kvm/virtio/9p.c
> index c663dab..2cf99a0 100644
> --- a/tools/kvm/virtio/9p.c
> +++ b/tools/kvm/virtio/9p.c
> @@ -228,6 +228,15 @@ static int virtio_p9_openflags(int flags)
>   return flags;
>  }
>  
> +static bool is_dir(struct p9_fid *fid)
> +{
> + struct stat st;
> +
> + stat(fid->abs_path, &st);
> +
> + return S_ISDIR(st.st_mode);
> +}
> +
>  static void virtio_p9_open(struct p9_dev *p9dev,
>  struct p9_pdu *pdu, u32 *outlen)
>  {
> @@ -245,7 +254,7 @@ static void virtio_p9_open(struct p9_dev *p9dev,
>  
>   stat2qid(&st, &qid);
>  
> - if (new_fid->is_dir) {
> + if (is_dir(new_fid)) {
>   new_fid->dir = opendir(new_fid->abs_path);
>   if (!new_fid->dir)
>   goto err_out;
> @@ -394,7 +403,6 @@ static void virtio_p9_walk(struct p9_dev *p9dev,
>   goto err_out;
>  
>   stat2qid(&st, &wqid);
> - new_fid->is_dir = S_ISDIR(st.st_mode);
>   strcpy(new_fid->path, tmp);
>   new_fid->uid = fid->uid;
>   nwqid++;
> @@ -406,7 +414,6 @@ static void virtio_p9_walk(struct p9_dev *p9dev,
>*/
>   pdu->write_offset += sizeof(u16);
>   old_fid = get_fid(p9dev, fid_val);
> - new_fid->is_dir = old_fid->is_dir;
>   strcpy(new_fid->path, old_fid->path);
>   new_fid->uid= old_fid->uid;
>   }
> @@ -445,7 +452,6 @@ static void virtio_p9_attach(struct p9_dev *p9dev,
>  
>   fid = get_fid(p9dev, fid_val);
>   fid->uid = uid;
> - fid->is_dir = 1;
>   strcpy(fid->path, "/");
>  
>   virtio_p9_pdu_writef(pdu, "Q", &qid);
> @@ -547,7 +553,7 @@ static void virtio_p9_readdir(struct p9_dev *p9dev,
>   virtio_p9_pdu_readf(pdu, "dqd", &fid_val, &offset, &count);
>   fid = get_fid(p9dev, fid_val);
>  
> - if (!fid->is_dir) {
> + if (!is_dir(fid)) {
>   errno = EINVAL;
>   goto err_out;
>   }
> -- 
> 1.7.8.6
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] Use kernel supplied MMU info for kvm tool

2012-07-30 Thread Pekka Enberg
On Wed, 18 Jul 2012, Michael Ellerman wrote:
> It occurred to me overnight that I forgot to mention that in order to
> build the new code you need the headers from a 3.5-rc1 era kernel (for
> the ioctl & KVM_CAP definitions).
> 
> The easiest way to do that is to merge linus' tree into kvmtool.
> 
> Are you planning on doing that in the master kvmtool tree anytime soon?
> It's still based on 3.4-rc1 it seems.

Done. Sorry for the delay!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL (PATCH 0/4)] VFIO driver for v3.6

2012-07-30 Thread Linus Torvalds
On Mon, Jul 30, 2012 at 4:17 PM, Alex Williamson
 wrote:
>
> I'm pretty anxious to find out as well.  Linus, ping, any thoughts on
> including this in 3.6?  Thanks,

I just pulled it, but then I unpulled again when I realized it's not a
signed tag and it's on github.

Please, people. Do tagged releases with proper signatures if you're
not using kernel.org or other controlled servers. In fact, I prefer
signed tags even if you *do* use kernel.org etc.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: MMU: Tracking guest writes through EPT entries ?

2012-07-30 Thread Xiao Guangrong
On 07/31/2012 01:18 AM, Sunil wrote:
> Hello List,
> 
> I am a KVM newbie and studying KVM mmu code.
> 
> On the existing guest, I am trying to track all guest writes by
> marking page table entry as read-only in EPT entry [ I am using Intel
> machine with vmx and ept support ]. Looks like EPT support re-uses
> shadow page table(SPT) code and hence some of SPT routines.
> 
> I was thinking of below possible approach. Use pte_list_walk() to
> traverse through list of sptes and use mmu_spte_update()  to flip the
> PT_WRITABLE_MASK flag. But all SPTEs are not part of any single list;
> but on separate lists (based on gfn, page level, memory_slot). So,
> recording all the faulted guest GFN and then using above method work ?
> 

There are two ways to write-protect all sptes:
- use kvm_mmu_slot_remove_write_access() on all memslots
- walk the shadow page cache to get the shadow pages in the highest level
  (level = 4 on EPT), then write-protect its entries.

If you just want to do it for the specified gfn, you can use
rmap_write_protect().

Just inquisitive, what is your purpose? :)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] tcm_vhost: Initial merge of vhost level target fabric driver

2012-07-30 Thread Nicholas A. Bellinger
Hi Linus,

Here is the PULL request for the initial merge of tcm_vhost based on
RFC-v5 code with MST's ACK appended to the initial merge commit.
As promised, the commit is available from two different branches for you
to consider merging as for-3.6 code.

The 'for-next-merge' branch based on mainline commit 7409a6657ae using
3.5-rc2 code contains two duplicates of pre-merge vhost patch
dependencies that have already been merged into mainline via net-next.
This commit is also in the 07302012 -next patchset, and available here:

  git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git 
for-next-merge

Or the 'for-linus' branch containing an -rc0 head @ commit bdc0077af57:

   Merge tag 'scsi-misc' of git://git.kernel.org/../jejb/scsi)

rebased up to the last commit in scsi-misc required for virtio-scsi
client LLD scanning logic to function properly with tcm_vhost fabric
ports, is available here:

  git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git for-linus

Both branches have gotten recent testing and have been running
over-night small block random I/O tests connected to raw block flash
backends.  The same diffstat below will result from pulling either
branch.

Also, the incremental patch to address MST's last round of post-merge
comments has been sent to the lists for feedback this afternoon.  This
will be included into the usual post -rc1 PULL via 3.6-rc-fixes, along
with any other bits that end up changing post-merge.

Please let us know if you have any concerns.

Thank you!

--nab

Nicholas Bellinger (1):
  tcm_vhost: Initial merge for vhost level target fabric driver

 drivers/vhost/Kconfig |3 +
 drivers/vhost/Kconfig.tcm |6 +
 drivers/vhost/Makefile|2 +
 drivers/vhost/tcm_vhost.c | 1628 +
 drivers/vhost/tcm_vhost.h |  101 +++
 5 files changed, 1740 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/Kconfig.tcm
 create mode 100644 drivers/vhost/tcm_vhost.c
 create mode 100644 drivers/vhost/tcm_vhost.h

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

2012-07-30 Thread Alex Williamson
On Tue, 2012-07-31 at 03:36 +0300, Michael S. Tsirkin wrote:
> On Mon, Jul 30, 2012 at 06:26:31PM -0600, Alex Williamson wrote:
> > On Tue, 2012-07-31 at 03:01 +0300, Michael S. Tsirkin wrote:
> > > On Mon, Jul 30, 2012 at 10:22:10AM -0600, Alex Williamson wrote:
> > > > On Sun, 2012-07-29 at 17:54 +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Jul 24, 2012 at 02:43:22PM -0600, Alex Williamson wrote:
> > > > > > This new ioctl enables an eventfd to be triggered when an EOI is
> > > > > > written for a specified irqchip pin.  The first user of this will
> > > > > > be external device assignment through VFIO, using a level irqfd
> > > > > > for asserting a PCI INTx interrupt and this interface for de-assert
> > > > > > and notification once the interrupt is serviced.
> > > > > > 
> > > > > > Here we make use of the reference counting of the _irq_source
> > > > > > object allowing us to share it with an irqfd and cleanup regardless
> > > > > > of the release order.
> > > > > > 
> > > > > > Signed-off-by: Alex Williamson 
> > > > > 
> > > > > > ---
> > > > > > 
> > > > > >  Documentation/virtual/kvm/api.txt |   21 ++
> > > > > >  arch/x86/kvm/x86.c|2 
> > > > > >  include/linux/kvm.h   |   15 ++
> > > > > >  include/linux/kvm_host.h  |   13 +
> > > > > >  virt/kvm/eventfd.c|  336 
> > > > > > +
> > > > > >  virt/kvm/kvm_main.c   |   11 +
> > > > > >  6 files changed, 398 insertions(+)
> > > > > > 
> > > > > > diff --git a/Documentation/virtual/kvm/api.txt 
> > > > > > b/Documentation/virtual/kvm/api.txt
> > > > > > index 3911e62..8cd6b36 100644
> > > > > > --- a/Documentation/virtual/kvm/api.txt
> > > > > > +++ b/Documentation/virtual/kvm/api.txt
> > > > > > @@ -1989,6 +1989,27 @@ return the hash table order in the 
> > > > > > parameter.  (If the guest is using
> > > > > >  the virtualized real-mode area (VRMA) facility, the kernel will
> > > > > >  re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
> > > > > >  
> > > > > > +4.77 KVM_EOIFD
> > > > > > +
> > > > > > +Capability: KVM_CAP_EOIFD
> > > > > > +Architectures: x86
> > > > > > +Type: vm ioctl
> > > > > > +Parameters: struct kvm_eoifd (in)
> > > > > > +Returns: 0 on success, < 0 on error
> > > > > > +
> > > > > > +KVM_EOIFD allows userspace to receive interrupt EOI notification
> > > > > > +through an eventfd.
> > > > > 
> > > > > I thought about it some more, and I think it should be renamed to an
> > > > > interrupt ack notification than eoi notification.
> > > > > For example, consider userspace that uses threaded 
> > > > > interrupts.interrupts.
> > > > > Currently what will happen is each interrupt will be injected
> > > > > twice, since on eoi device is still asserting it.
> > > > 
> > > > I don't follow, why is userspace writing an eoi to the ioapic if it
> > > > hasn't handled the interrupt
> > > 
> > > It has handled it - it disabled the hardware interrupt.
> > 
> > So it's not injected twice, it's held pending at the ioapic the second
> > time, just like hardware.
> 
> It is not like hardware at all. in hardware there is no overhead
> here you interrupot the guest to run handler in host.

Obviously we have some overhead, we're emulating the guest hardware.
That doesn't make the behavior unlike hardware.

> >  Maybe there's a future optimization there,
> > but I don't think it's appropriate at this time.
> 
> Yes. But to make it *possible* in future we must remove
> the requirement to signal fd immediately on EOI.
> So rename it ackfd.

How does the name make that possible?  We can easily add a flag
EOIFD_FLAG_EOI_ON_REENABLE, or whatever.

> > > > and why wouldn't the same happen on bare
> > > > metal?
> > > 
> > > on bare metal level does not matter as long as interrupt
> > > is disabled.
> > > 
> > > > > One fix would be to delay event until interrupt is re-enabled.
> > > > > Now I am not asking you to fix this immediately,
> > > > > but I think we should make the interface generic by
> > > > > saying we report an ack to userspace and not specifically EOI.
> > > > 
> > > > Using the word "delay" in the context of interrupt delivery raises all
> > > > sorts of red flags for me, but I really don't understand your argument.
> > > 
> > > I am saying it's an "ack" of interrupt userspace cares about.
> > > The fact it is done by EOI is an implementation detail.
> > 
> > The implementation is how an EOI is generated on an ioapic, not that an
> > EOI exists.  How do I read a hardware spec and figure out what "ack of
> > interrupt" means?
> 
> It just means it will be called after guest has completed handling
> interrupt. How we detect that is our problem.

Conceptually, we're still looking for the EOI, we may just be able to
optimize to EOI && irqchip pin unmasked.  The name doesn't prohibit
anything here and eoifd is more descriptive that it relates to the end
of an interrupt where ackfd means nothing (what got acked?).

> > > > > 

Re: [PATCH 2/2] KVM: PPC: booke: Added debug handler

2012-07-30 Thread Scott Wood
On 07/30/2012 06:11 AM, Bharat Bhushan wrote:
> +1:   /* debug interrupt happened in guest */
> + mfspr   r4, \scratch
> + mtcrr3
> + mr  r3, r4
> + mfspr   r4, SPRN_SPRG_THREAD
> + lwz r4, THREAD_KVM_VCPU(r4)
> + stw r3, VCPU_GPR(r4)(r4)
> + stw r5, VCPU_GPR(r5)(r4)
> + stw r6, VCPU_GPR(r6)(r4)

You're not working on the latest tree -- all this stuff is
"VCPU_GPR(R6)(r4)" now.

-Scott


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

2012-07-30 Thread Michael S. Tsirkin
On Mon, Jul 30, 2012 at 06:26:31PM -0600, Alex Williamson wrote:
> On Tue, 2012-07-31 at 03:01 +0300, Michael S. Tsirkin wrote:
> > On Mon, Jul 30, 2012 at 10:22:10AM -0600, Alex Williamson wrote:
> > > On Sun, 2012-07-29 at 17:54 +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 24, 2012 at 02:43:22PM -0600, Alex Williamson wrote:
> > > > > This new ioctl enables an eventfd to be triggered when an EOI is
> > > > > written for a specified irqchip pin.  The first user of this will
> > > > > be external device assignment through VFIO, using a level irqfd
> > > > > for asserting a PCI INTx interrupt and this interface for de-assert
> > > > > and notification once the interrupt is serviced.
> > > > > 
> > > > > Here we make use of the reference counting of the _irq_source
> > > > > object allowing us to share it with an irqfd and cleanup regardless
> > > > > of the release order.
> > > > > 
> > > > > Signed-off-by: Alex Williamson 
> > > > 
> > > > > ---
> > > > > 
> > > > >  Documentation/virtual/kvm/api.txt |   21 ++
> > > > >  arch/x86/kvm/x86.c|2 
> > > > >  include/linux/kvm.h   |   15 ++
> > > > >  include/linux/kvm_host.h  |   13 +
> > > > >  virt/kvm/eventfd.c|  336 
> > > > > +
> > > > >  virt/kvm/kvm_main.c   |   11 +
> > > > >  6 files changed, 398 insertions(+)
> > > > > 
> > > > > diff --git a/Documentation/virtual/kvm/api.txt 
> > > > > b/Documentation/virtual/kvm/api.txt
> > > > > index 3911e62..8cd6b36 100644
> > > > > --- a/Documentation/virtual/kvm/api.txt
> > > > > +++ b/Documentation/virtual/kvm/api.txt
> > > > > @@ -1989,6 +1989,27 @@ return the hash table order in the parameter.  
> > > > > (If the guest is using
> > > > >  the virtualized real-mode area (VRMA) facility, the kernel will
> > > > >  re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
> > > > >  
> > > > > +4.77 KVM_EOIFD
> > > > > +
> > > > > +Capability: KVM_CAP_EOIFD
> > > > > +Architectures: x86
> > > > > +Type: vm ioctl
> > > > > +Parameters: struct kvm_eoifd (in)
> > > > > +Returns: 0 on success, < 0 on error
> > > > > +
> > > > > +KVM_EOIFD allows userspace to receive interrupt EOI notification
> > > > > +through an eventfd.
> > > > 
> > > > I thought about it some more, and I think it should be renamed to an
> > > > interrupt ack notification than eoi notification.
> > > > For example, consider userspace that uses threaded interrupts.
> > > > Currently what will happen is each interrupt will be injected
> > > > twice, since on eoi device is still asserting it.
> > > 
> > > I don't follow, why is userspace writing an eoi to the ioapic if it
> > > hasn't handled the interrupt
> > 
> > It has handled it - it disabled the hardware interrupt.
> 
> So it's not injected twice, it's held pending at the ioapic the second
> time, just like hardware.

It is not like hardware at all. in hardware there is no overhead
here you interrupot the guest to run handler in host.

>  Maybe there's a future optimization there,
> but I don't think it's appropriate at this time.

Yes. But to make it *possible* in future we must remove
the requirement to signal fd immediately on EOI.
So rename it ackfd.

> > > and why wouldn't the same happen on bare
> > > metal?
> > 
> > on bare metal level does not matter as long as interrupt
> > is disabled.
> > 
> > > > One fix would be to delay event until interrupt is re-enabled.
> > > > Now I am not asking you to fix this immediately,
> > > > but I think we should make the interface generic by
> > > > saying we report an ack to userspace and not specifically EOI.
> > > 
> > > Using the word "delay" in the context of interrupt delivery raises all
> > > sorts of red flags for me, but I really don't understand your argument.
> > 
> > I am saying it's an "ack" of interrupt userspace cares about.
> > The fact it is done by EOI is an implementation detail.
> 
> The implementation is how an EOI is generated on an ioapic, not that an
> EOI exists.  How do I read a hardware spec and figure out what "ack of
> interrupt" means?

It just means it will be called after guest has completed handling
interrupt. How we detect that is our problem.

> > > > >  kvm_eoifd.fd specifies the eventfd used for
> > > > > +notification.  KVM_EOIFD_FLAG_DEASSIGN is used to de-assign an eoifd
> > > > > +once assigned.  KVM_EOIFD also requires additional bits set in
> > > > > +kvm_eoifd.flags to bind to the proper interrupt line.  The
> > > > > +KVM_EOIFD_FLAG_LEVEL_IRQFD indicates that kvm_eoifd.key is provided
> > > > > +and is a key from a level triggered interrupt (configured from
> > > > > +KVM_IRQFD using KVM_IRQFD_FLAG_LEVEL).  The EOI notification is bound
> > > > > +to the same GSI and irqchip input as the irqfd.  Both kvm_eoifd.key
> > > > > +and KVM_EOIFD_FLAG_LEVEL_IRQFD must be specified on assignment and
> > > > > +de-assignment of KVM_EOIFD.  A level irqfd may only be bound to a
> > > > > +single eoifd.  KVM

Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

2012-07-30 Thread Alex Williamson
On Tue, 2012-07-31 at 03:01 +0300, Michael S. Tsirkin wrote:
> On Mon, Jul 30, 2012 at 10:22:10AM -0600, Alex Williamson wrote:
> > On Sun, 2012-07-29 at 17:54 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Jul 24, 2012 at 02:43:22PM -0600, Alex Williamson wrote:
> > > > This new ioctl enables an eventfd to be triggered when an EOI is
> > > > written for a specified irqchip pin.  The first user of this will
> > > > be external device assignment through VFIO, using a level irqfd
> > > > for asserting a PCI INTx interrupt and this interface for de-assert
> > > > and notification once the interrupt is serviced.
> > > > 
> > > > Here we make use of the reference counting of the _irq_source
> > > > object allowing us to share it with an irqfd and cleanup regardless
> > > > of the release order.
> > > > 
> > > > Signed-off-by: Alex Williamson 
> > > 
> > > > ---
> > > > 
> > > >  Documentation/virtual/kvm/api.txt |   21 ++
> > > >  arch/x86/kvm/x86.c|2 
> > > >  include/linux/kvm.h   |   15 ++
> > > >  include/linux/kvm_host.h  |   13 +
> > > >  virt/kvm/eventfd.c|  336 
> > > > +
> > > >  virt/kvm/kvm_main.c   |   11 +
> > > >  6 files changed, 398 insertions(+)
> > > > 
> > > > diff --git a/Documentation/virtual/kvm/api.txt 
> > > > b/Documentation/virtual/kvm/api.txt
> > > > index 3911e62..8cd6b36 100644
> > > > --- a/Documentation/virtual/kvm/api.txt
> > > > +++ b/Documentation/virtual/kvm/api.txt
> > > > @@ -1989,6 +1989,27 @@ return the hash table order in the parameter.  
> > > > (If the guest is using
> > > >  the virtualized real-mode area (VRMA) facility, the kernel will
> > > >  re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
> > > >  
> > > > +4.77 KVM_EOIFD
> > > > +
> > > > +Capability: KVM_CAP_EOIFD
> > > > +Architectures: x86
> > > > +Type: vm ioctl
> > > > +Parameters: struct kvm_eoifd (in)
> > > > +Returns: 0 on success, < 0 on error
> > > > +
> > > > +KVM_EOIFD allows userspace to receive interrupt EOI notification
> > > > +through an eventfd.
> > > 
> > > I thought about it some more, and I think it should be renamed to an
> > > interrupt ack notification than eoi notification.
> > > For example, consider userspace that uses threaded interrupts.
> > > Currently what will happen is each interrupt will be injected
> > > twice, since on eoi device is still asserting it.
> > 
> > I don't follow, why is userspace writing an eoi to the ioapic if it
> > hasn't handled the interrupt
> 
> It has handled it - it disabled the hardware interrupt.

So it's not injected twice, it's held pending at the ioapic the second
time, just like hardware.  Maybe there's a future optimization there,
but I don't think it's appropriate at this time.

> > and why wouldn't the same happen on bare
> > metal?
> 
> on bare metal level does not matter as long as interrupt
> is disabled.
> 
> > > One fix would be to delay event until interrupt is re-enabled.
> > > Now I am not asking you to fix this immediately,
> > > but I think we should make the interface generic by
> > > saying we report an ack to userspace and not specifically EOI.
> > 
> > Using the word "delay" in the context of interrupt delivery raises all
> > sorts of red flags for me, but I really don't understand your argument.
> 
> I am saying it's an "ack" of interrupt userspace cares about.
> The fact it is done by EOI is an implementation detail.

The implementation is how an EOI is generated on an ioapic, not that an
EOI exists.  How do I read a hardware spec and figure out what "ack of
interrupt" means?

> > > >  kvm_eoifd.fd specifies the eventfd used for
> > > > +notification.  KVM_EOIFD_FLAG_DEASSIGN is used to de-assign an eoifd
> > > > +once assigned.  KVM_EOIFD also requires additional bits set in
> > > > +kvm_eoifd.flags to bind to the proper interrupt line.  The
> > > > +KVM_EOIFD_FLAG_LEVEL_IRQFD indicates that kvm_eoifd.key is provided
> > > > +and is a key from a level triggered interrupt (configured from
> > > > +KVM_IRQFD using KVM_IRQFD_FLAG_LEVEL).  The EOI notification is bound
> > > > +to the same GSI and irqchip input as the irqfd.  Both kvm_eoifd.key
> > > > +and KVM_EOIFD_FLAG_LEVEL_IRQFD must be specified on assignment and
> > > > +de-assignment of KVM_EOIFD.  A level irqfd may only be bound to a
> > > > +single eoifd.  KVM_CAP_EOIFD_LEVEL_IRQFD indicates support of
> > > > +KVM_EOIFD_FLAG_LEVEL_IRQFD.
> > > >  
> > > 
> > > Hmm returning the key means we'll need to keep refcounting for source
> > > IDs around forever. I liked passing the fd better: make implementation
> > > match interface and not the other way around.
> > 
> > False, a source ID has a finite lifecycle.  The fd approach was broken.
> > Holding the irqfd context imposed too many dependencies between eoifd
> > and irqfd necessitating things like one interface disabling another.  I
> > thoroughly disagree with that approach.
> 
> You keep saying this but it is 

Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

2012-07-30 Thread Michael S. Tsirkin
On Mon, Jul 30, 2012 at 10:22:10AM -0600, Alex Williamson wrote:
> On Sun, 2012-07-29 at 17:54 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 24, 2012 at 02:43:22PM -0600, Alex Williamson wrote:
> > > This new ioctl enables an eventfd to be triggered when an EOI is
> > > written for a specified irqchip pin.  The first user of this will
> > > be external device assignment through VFIO, using a level irqfd
> > > for asserting a PCI INTx interrupt and this interface for de-assert
> > > and notification once the interrupt is serviced.
> > > 
> > > Here we make use of the reference counting of the _irq_source
> > > object allowing us to share it with an irqfd and cleanup regardless
> > > of the release order.
> > > 
> > > Signed-off-by: Alex Williamson 
> > 
> > > ---
> > > 
> > >  Documentation/virtual/kvm/api.txt |   21 ++
> > >  arch/x86/kvm/x86.c|2 
> > >  include/linux/kvm.h   |   15 ++
> > >  include/linux/kvm_host.h  |   13 +
> > >  virt/kvm/eventfd.c|  336 
> > > +
> > >  virt/kvm/kvm_main.c   |   11 +
> > >  6 files changed, 398 insertions(+)
> > > 
> > > diff --git a/Documentation/virtual/kvm/api.txt 
> > > b/Documentation/virtual/kvm/api.txt
> > > index 3911e62..8cd6b36 100644
> > > --- a/Documentation/virtual/kvm/api.txt
> > > +++ b/Documentation/virtual/kvm/api.txt
> > > @@ -1989,6 +1989,27 @@ return the hash table order in the parameter.  (If 
> > > the guest is using
> > >  the virtualized real-mode area (VRMA) facility, the kernel will
> > >  re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
> > >  
> > > +4.77 KVM_EOIFD
> > > +
> > > +Capability: KVM_CAP_EOIFD
> > > +Architectures: x86
> > > +Type: vm ioctl
> > > +Parameters: struct kvm_eoifd (in)
> > > +Returns: 0 on success, < 0 on error
> > > +
> > > +KVM_EOIFD allows userspace to receive interrupt EOI notification
> > > +through an eventfd.
> > 
> > I thought about it some more, and I think it should be renamed to an
> > interrupt ack notification than eoi notification.
> > For example, consider userspace that uses threaded interrupts.
> > Currently what will happen is each interrupt will be injected
> > twice, since on eoi device is still asserting it.
> 
> I don't follow, why is userspace writing an eoi to the ioapic if it
> hasn't handled the interrupt

It has handled it - it disabled the hardware interrupt.

> and why wouldn't the same happen on bare
> metal?

on bare metal level does not matter as long as interrupt
is disabled.

> > One fix would be to delay event until interrupt is re-enabled.
> > Now I am not asking you to fix this immediately,
> > but I think we should make the interface generic by
> > saying we report an ack to userspace and not specifically EOI.
> 
> Using the word "delay" in the context of interrupt delivery raises all
> sorts of red flags for me, but I really don't understand your argument.

I am saying it's an "ack" of interrupt userspace cares about.
The fact it is done by EOI is an implementation detail.

> > >  kvm_eoifd.fd specifies the eventfd used for
> > > +notification.  KVM_EOIFD_FLAG_DEASSIGN is used to de-assign an eoifd
> > > +once assigned.  KVM_EOIFD also requires additional bits set in
> > > +kvm_eoifd.flags to bind to the proper interrupt line.  The
> > > +KVM_EOIFD_FLAG_LEVEL_IRQFD indicates that kvm_eoifd.key is provided
> > > +and is a key from a level triggered interrupt (configured from
> > > +KVM_IRQFD using KVM_IRQFD_FLAG_LEVEL).  The EOI notification is bound
> > > +to the same GSI and irqchip input as the irqfd.  Both kvm_eoifd.key
> > > +and KVM_EOIFD_FLAG_LEVEL_IRQFD must be specified on assignment and
> > > +de-assignment of KVM_EOIFD.  A level irqfd may only be bound to a
> > > +single eoifd.  KVM_CAP_EOIFD_LEVEL_IRQFD indicates support of
> > > +KVM_EOIFD_FLAG_LEVEL_IRQFD.
> > >  
> > 
> > Hmm returning the key means we'll need to keep refcounting for source
> > IDs around forever. I liked passing the fd better: make implementation
> > match interface and not the other way around.
> 
> False, a source ID has a finite lifecycle.  The fd approach was broken.
> Holding the irqfd context imposed too many dependencies between eoifd
> and irqfd necessitating things like one interface disabling another.  I
> thoroughly disagree with that approach.

You keep saying this but it is still true: once irqfd
is closed eoifd does not get any more interrupts.

> > >  5. The kvm_run structure
> > >  
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 9ded39d..8f3164e 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -2171,6 +2171,8 @@ int kvm_dev_ioctl_check_extension(long ext)
> > >   case KVM_CAP_PCI_2_3:
> > >   case KVM_CAP_KVMCLOCK_CTRL:
> > >   case KVM_CAP_IRQFD_LEVEL:
> > > + case KVM_CAP_EOIFD:
> > > + case KVM_CAP_EOIFD_LEVEL_IRQFD:
> > >   r = 1;
> > >   break;
> > >   case KVM_CAP_COALES

Re: [PATCH] KVM: Don't update PPR on any APIC read

2012-07-30 Thread Marcelo Tosatti
On Sun, Jul 22, 2012 at 05:41:00PM +0300, Avi Kivity wrote:
> The current code will update the PPR on almost any APIC read; however
> that's only required if we read the PPR.
> 
> kvm_update_ppr() shows up in some profiles, albeit with a low usage (~1%).
> This should reduce it further (it will still be called during interrupt
> processing).
> 
> Signed-off-by: Avi Kivity 

Applied, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm-1.1.0 crashing with kernel 3.5.0-rc6

2012-07-30 Thread Marcelo Tosatti
On Mon, Jul 30, 2012 at 07:39:31PM +0300, Avi Kivity wrote:
> On 07/30/2012 05:07 PM, Chris Clayton wrote:
> >>
>  With kernel 3.5.0 with b2da15ac26a0c00 reverted, I have just had 15
>  clean invocations of vanilla qemu-kvm-1.1.1. So that commit would seem
>  to be the problem.
> >>>
> >>> Just to be sure, I've run some more tests today. No crashes occurred in
> >>> 20 runs of vanilla qemu-kvm-1.1.1 on kernel 3.5.0 with b2da15ac26a0c00
> >>> reverted.
> >>
> >> Ok.  I'm trying to reproduce it here on a nested-virt setup, since the
> >> code looks correct.
> >>
> >> What's your preemption settings?
> >>
> >>
> > [chris:~/kernel/linux-3.5.0]$ grep PREEMPT .config
> > CONFIG_TREE_PREEMPT_RCU=y
> > CONFIG_PREEMPT_RCU=y
> > CONFIG_PREEMPT_NOTIFIERS=y
> > # CONFIG_PREEMPT_NONE is not set
> > # CONFIG_PREEMPT_VOLUNTARY is not set
> > CONFIG_PREEMPT=y
> > CONFIG_PREEMPT_COUNT=y
> 
> Here's what I think that is happening
> 
>   vcpu_load
>   ...
>   vmx_save_host_state
>   vmx_vcpu_run
>   (ds.cpl, es.cpl cleared by hardware)
> 
>   interrupt
> push ds, es  # pushes bad ds, es
> schedule
>   vmx_vcpu_put
> vmx_load_host_state
>   reload ds, es
> pop ds, es  # of other thread's stack
> iret
>   # other thread runs
>   interrupt
> schedule  # back in vcpu thread
> interrupt return: pop ds, es  # <-- problem
> iret
> 
>...
>vcpu_put
> 
># bad ds, es, but !vmx->host_state.loaded
> 
> Marcelo, did I miss something here?

Don't think so.

> 
> Unfortunately, my reproducer has ceased to reproduce.  But the fix is
> easy if the analysis above is right.
> 
> -- 
> error compiling committee.c: too many arguments to function
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL (PATCH 0/4)] VFIO driver for v3.6

2012-07-30 Thread Alex Williamson
On Fri, 2012-07-27 at 15:32 +1000, Paul Mackerras wrote:
> On Wed, Jul 25, 2012 at 08:53:06AM -0600, Alex Williamson wrote:
> > Hi Linus,
> > 
> > This series includes the VFIO userspace driver interface for the
> > 3.6 kernel merge window.  This driver is intended to provide a
> > secure interface for device access using IOMMU protection for
> > applications like assignment of physical devices to virtual
> > machines.  Qemu will be the first user of this interface, enabling
> > assignment of PCI devices to Qemu guests.  This interface is
> > intended to eventually replace the x86-specific assignment mechanism
> > currently available in KVM.  This interface has the advantage of
> > being more secure, by working with IOMMU groups to ensure device
> > isolation and providing it's own filtered resource access mechanism,
> > and also more flexible, in not being x86 or KVM specific (extensions
> > to enable POWER are already working).
> > 
> > As a new driver, I'm including both the individual patches in email,
> > as well as a branch to pull from:
> > 
> > git://github.com/awilliam/linux-vfio.git for-linus
> > 
> > This driver is originally the work of Tom Lyon, but has since been
> > handed over to me and gone through a complete overhaul thanks to the
> > input from David Gibson, Ben Herrenschmidt, Chris Wright, Joerg
> > Roedel, and others.  This driver has been available in linux-next for
> > the last month.  Thanks,
> 
> Linus,
> 
> Are you thinking of pulling this driver in for 3.6?  I would be glad
> to see it go in since we want to use it with KVM on PowerPC.  If
> possible we'd like the PowerPC bits for it to go in as well.

I'm pretty anxious to find out as well.  Linus, ping, any thoughts on
including this in 3.6?  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: PPC: Book3S_32: Fix MTMSR_EERI macro

2012-07-30 Thread Alexander Graf
Commit b38c77d82e4 moved the MTMSR_EERI macro from the KVM code to generic
ppc_asm.h code. However, while adding it in the headers for the ppc32 case,
it missed out to remove the former definition in the KVM code.

This patch fixes compilation on server type PPC32 targets with CONFIG_KVM
enabled.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/book3s_rmhandlers.S |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_rmhandlers.S 
b/arch/powerpc/kvm/book3s_rmhandlers.S
index ab523f3..9ecf6e3 100644
--- a/arch/powerpc/kvm/book3s_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_rmhandlers.S
@@ -67,7 +67,6 @@ kvmppc_skip_Hinterrupt:
 #elif defined(CONFIG_PPC_BOOK3S_32)
 
 #define FUNC(name) name
-#define MTMSR_EERI(reg)mtmsr   (reg)
 
 .macro INTERRUPT_TRAMPOLINE intno
 
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH] vfio: VFIO PCI driver for Qemu

2012-07-30 Thread Alex Williamson
On Sun, 2012-07-29 at 16:47 +0300, Avi Kivity wrote:
> On 07/26/2012 08:40 PM, Alex Williamson wrote:
> > On Thu, 2012-07-26 at 19:34 +0300, Avi Kivity wrote:
> >> On 07/25/2012 08:03 PM, Alex Williamson wrote:
> >> 
> >> > +/*
> >> > + * Resource setup
> >> > + */
> >> > +static void vfio_unmap_bar(VFIODevice *vdev, int nr)
> >> > +{
> >> > +VFIOBAR *bar = &vdev->bars[nr];
> >> > +uint64_t size;
> >> > +
> >> > +if (!memory_region_size(&bar->mem)) {
> >> > +return;
> >> > +}
> > 
> > This one is the "slow" mapped MemoryRegion.  If there's nothing here,
> > the BAR isn't populated.
> > 
> >> > +
> >> > +size = memory_region_size(&bar->mmap_mem);
> >> > +if (size) {
> >> > + memory_region_del_subregion(&bar->mem, &bar->mmap_mem);
> >> > + munmap(bar->mmap, size);
> >> > +}
> > 
> > This is the direct mapped MemoryRegion that potentially overlays the
> > "slow" mapping above for MMIO BARs of sufficient alignment.  If the BAR
> > includes the MSI-X vector table, this maps the region in front of the
> > table
> 
> If the region size is zero, then both memory_region_del_subregion()
> (assuming the region is parented) and munmap() do nothing.  So you could
> call this unconditionally.

I suppose parenting them is the key.  I'm counting on memory_region_size
of zero for an uninitialized, g_malloc0() MemoryRegion.  Initializing
them just to have a parent so we can unconditionally remove them here
seems like it's just shifting complexity from one function to another.
The majority of BARs aren't even implemented, so we'd actually be
setting up a lot of dummy infrastructure for a slightly cleaner unmap
function.  I'll keep looking at this, but I'm not optimistic there's an
overall simplification here.
 
> >> > +
> >> > +if (vdev->msix && vdev->msix->table_bar == nr) {
> >> > +size = memory_region_size(&vdev->msix->mmap_mem);
> >> > +if (size) {
> >> > +memory_region_del_subregion(&bar->mem, 
> >> > &vdev->msix->mmap_mem);
> >> > +munmap(vdev->msix->mmap, size);
> >> > +}
> >> > +}
> > 
> > And this one potentially unmaps the overlap after the vector table if
> > there's any space for one.
> > 
> >> Are the three size checks needed? Everything should work without them
> >> from the memory core point of view.
> > 
> > I haven't tried, but I strongly suspect I shouldn't be munmap'ing
> > NULL... no?
> 
> NULL isn't the problem (well some kernels protect against mmaping NULL
> to avoid kernel exploits), but it seems the kernel doesn't like a zero
> length.

in mm/mmap.c:do_munmap() I see:

if ((len = PAGE_ALIGN(len)) == 0)
return -EINVAL;

Before anything scary happens, so that should be ok.  It's not really
worthwhile to call the munmaps unconditionally if we already have the
condition tests because the subregions are unparented though.

> >> > +
> >> > +static void vfio_map_bar(VFIODevice *vdev, int nr, uint8_t type)
> >> > +{
> >> > +VFIOBAR *bar = &vdev->bars[nr];
> >> > +unsigned size = bar->size;
> >> > +char name[64];
> >> > +
> >> > +sprintf(name, "VFIO %04x:%02x:%02x.%x BAR %d", vdev->host.domain,
> >> > +vdev->host.bus, vdev->host.slot, vdev->host.function, nr);
> >> > +
> >> > +/* A "slow" read/write mapping underlies all BARs */
> >> > +memory_region_init_io(&bar->mem, &vfio_bar_ops, bar, name, size);
> >> > +pci_register_bar(&vdev->pdev, nr, type, &bar->mem);
> >> 
> >> So far all container BARs have been pure containers, without RAM or I/O
> >> callbacks.  It should all work, but this sets precedent and requires it
> >> to work.  I guess there's no problem supporting it though.
> > 
> > KVM device assignment already makes use of this as well, if I understand
> > correctly.
> 
> Okay.
> 
> > 
> >> > +
> >> > +if (type & PCI_BASE_ADDRESS_SPACE_IO) {
> >> > +return; /* IO space is only slow, don't expect high perf here */
> >> > +}
> >> 
> >> What about non-x86 where IO is actually memory?  I think you can drop
> >> this and let the address space filtering in the listener drop it if it
> >> turns out to be in IO space.
> > 
> > They're probably saying "What's I/O port space?" ;)  Yeah, there may be
> > some room to do more here, but no need until we have something that can
> > make use of it. 
> 
> Most likely all that is needed is to drop the test.
> 
> > Note that these are the BAR mappings, which turn into
> > MemoryRegions, so I'm not sure what the listener has to do with
> > filtering these just yet.
> 
> +static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> +{
> +return (section->address_space != get_system_memory() ||
> +!memory_region_is_ram(section->mr));
> +}
> 
> Or the filter argument to memory_listener_register() (which you use --
> you can drop the first check above).  On x86 those I/O regions will be
> filtered out, on non-x86 with a properly-wired chipset emulation they'll

[PATCH] tcm_vhost: Post-merge review changes requested by MST

2012-07-30 Thread Nicholas A. Bellinger
From: Nicholas Bellinger 

This patch contains the post RFC-v5 (post-merge) changes, this includes:

- Add locking comment
- Move vhost_scsi_complete_cmd ahead of TFO callbacks in order to
  drop forward declarations
- Drop extra '!= NULL' usage in vhost_scsi_complete_cmd_work()
- Change vhost_scsi_*_handle_kick() to use pr_debug
- Fix possible race in vhost_scsi_set_endpoint() for vs->vs_tpg checking
  + assignment.
- Convert tv_tpg->tpg_vhost_count + ->tv_tpg_port_count from atomic_t ->
  int, and make sure reference is protected by ->tv_tpg_mutex.
- Drop unnecessary vhost_scsi->vhost_ref_cnt
- Add 'err:' label for exception path in vhost_scsi_clear_endpoint()
- Add enum for VQ numbers, add usage in vhost_scsi_open()
- Add vhost_scsi_flush() + vhost_scsi_flush_vq() following
  drivers/vhost/net.c
- Add smp_wmb() + vhost_scsi_flush() call during vhost_scsi_set_features()
- Drop unnecessary copy_from_user() usage with GET_ABI_VERSION ioctl
- Add missing vhost_scsi_compat_ioctl() caller for vhost_scsi_fops
- Fix function parameter definition first line to follow existing
  vhost code style
- Change 'vHost' usage -> 'vhost' in handful of locations
- Change -EPERM -> -EBUSY usage for two failures in tcm_vhost_drop_nexus()
- Add comment for tcm_vhost_workqueue in tcm_vhost_init()
- Make GET_ABI_VERSION return 'int' + add comment in tcm_vhost.h

Reported-by: Michael S. Tsirkin 
Cc: Michael S. Tsirkin 
Cc: Stefan Hajnoczi 
Cc: Anthony Liguori 
Cc: Zhi Yong Wu 
Cc: Paolo Bonzini 
Signed-off-by: Nicholas Bellinger 
---
 drivers/vhost/tcm_vhost.c |  195 
 drivers/vhost/tcm_vhost.h |9 +-
 2 files changed, 111 insertions(+), 93 deletions(-)

diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
index 481af88..74b2eda 100644
--- a/drivers/vhost/tcm_vhost.c
+++ b/drivers/vhost/tcm_vhost.c
@@ -53,9 +53,14 @@
 #include "vhost.h"
 #include "tcm_vhost.h"
 
+enum {
+   VHOST_SCSI_VQ_CTL = 0,
+   VHOST_SCSI_VQ_EVT = 1,
+   VHOST_SCSI_VQ_IO = 2,
+};
+
 struct vhost_scsi {
-   atomic_t vhost_ref_cnt;
-   struct tcm_vhost_tpg *vs_tpg;
+   struct tcm_vhost_tpg *vs_tpg;   /* Protected by vhost_scsi->dev.mutex */
struct vhost_dev dev;
struct vhost_virtqueue vqs[3];
 
@@ -131,8 +136,7 @@ static u32 tcm_vhost_get_default_depth(struct 
se_portal_group *se_tpg)
return 1;
 }
 
-static u32 tcm_vhost_get_pr_transport_id(
-   struct se_portal_group *se_tpg,
+static u32 tcm_vhost_get_pr_transport_id(struct se_portal_group *se_tpg,
struct se_node_acl *se_nacl,
struct t10_pr_registration *pr_reg,
int *format_code,
@@ -162,8 +166,7 @@ static u32 tcm_vhost_get_pr_transport_id(
format_code, buf);
 }
 
-static u32 tcm_vhost_get_pr_transport_id_len(
-   struct se_portal_group *se_tpg,
+static u32 tcm_vhost_get_pr_transport_id_len(struct se_portal_group *se_tpg,
struct se_node_acl *se_nacl,
struct t10_pr_registration *pr_reg,
int *format_code)
@@ -192,8 +195,7 @@ static u32 tcm_vhost_get_pr_transport_id_len(
format_code);
 }
 
-static char *tcm_vhost_parse_pr_out_transport_id(
-   struct se_portal_group *se_tpg,
+static char *tcm_vhost_parse_pr_out_transport_id(struct se_portal_group 
*se_tpg,
const char *buf,
u32 *out_tid_len,
char **port_nexus_ptr)
@@ -236,8 +238,7 @@ static struct se_node_acl *tcm_vhost_alloc_fabric_acl(
return &nacl->se_node_acl;
 }
 
-static void tcm_vhost_release_fabric_acl(
-   struct se_portal_group *se_tpg,
+static void tcm_vhost_release_fabric_acl(struct se_portal_group *se_tpg,
struct se_node_acl *se_nacl)
 {
struct tcm_vhost_nacl *nacl = container_of(se_nacl,
@@ -297,7 +298,16 @@ static int tcm_vhost_get_cmd_state(struct se_cmd *se_cmd)
return 0;
 }
 
-static void vhost_scsi_complete_cmd(struct tcm_vhost_cmd *);
+static void vhost_scsi_complete_cmd(struct tcm_vhost_cmd *tv_cmd)
+{
+   struct vhost_scsi *vs = tv_cmd->tvc_vhost;
+
+   spin_lock_bh(&vs->vs_completion_lock);
+   list_add_tail(&tv_cmd->tvc_completion_list, &vs->vs_completion_list);
+   spin_unlock_bh(&vs->vs_completion_lock);
+
+   vhost_work_queue(&vs->dev, &vs->vs_completion_work);
+}
 
 static int tcm_vhost_queue_data_in(struct se_cmd *se_cmd)
 {
@@ -381,7 +391,7 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work 
*work)
vs_completion_work);
struct tcm_vhost_cmd *tv_cmd;
 
-   while ((tv_cmd = vhost_scsi_get_cmd_from_completion(vs)) != NULL) {
+   while ((tv_cmd = vhost_scsi_get_cmd_from_completion(vs))) {
struct virtio_scsi_cmd_resp v_rsp;
struct se_cmd *se_cmd = &tv_cmd->tvc_se_cmd;
int ret;
@@ -408,19 +418,6 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work 
*work)
vhost_signal(&vs->dev, &vs->vqs[2]);
 }
 
-static void vhost_sc

Re: [Qemu-devel] 9p broken?

2012-07-30 Thread Richard W.M. Jones
On Mon, Jul 30, 2012 at 03:35:39PM +0300, Avi Kivity wrote:
> Having an annoying bug on i386 kvm I decided to debug it buy running an
> i386 guest on my x86_64 host, use 9p to access a guest image, and run it
> using nested kvm.
> 
> However, 9p appears to be broken: first, the configure test fails (patch
> sent).  Second, while mount works, ls on the mount point causes qemu to
> crash with
> 
> (gdb) bt
> #0  error_set (errp=0x7fffe95fb128, fmt=0x558d4568 "{ 'class':
> 'VirtFSFeatureBlocksMigration', 'data': { 'path': %s, 'tag': %s } }") at
> /home/tlv/akivity/qemu/error.c:32
> #1  0x5567cb06 in v9fs_attach (opaque=0x7fffe95e3020) at
> /home/tlv/akivity/qemu/hw/9pfs/virtio-9p.c:988
> #2  0x5561d19f in coroutine_trampoline (i0=1449767888, i1=21845)
> at /home/tlv/akivity/qemu/coroutine-ucontext.c:138
> #3  0x75a93ef0 in ?? ()   from /lib64/libc.so.6
> #4  0x7fffce00 in ?? ()
> #5  0x in ?? (
> 
> **errp already points to a VirtFSFeatureBlocksMigration error;
> v9fs_attach() has been called a second time (the first time,
> understandably, on mount; the second on ls).

Yes, I can reproduce this too.

LIBGUESTFS_QEMU=~/d/qemu/qemu.wrapper \
guestfish -v -- \
  sparse /tmp/unused 100M : \
  config -device 'virtio-9p-pci,fsdev=root,mount_tag=root' : \
  config -fsdev 'local,id=root,path=/tmp,security_model=passthrough' : \
  run : \
  mount-9p root / : \
  ls /

Stack trace:

#0  0x7fb1d4d19ba5 in __GI_raise (sig=sig@entry=6)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:63
#1  0x7fb1d4d1b358 in __GI_abort () at abort.c:90
#2  0x7fb1d4d12972 in __assert_fail_base (fmt=
0x7fb1d4e5c8e8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
assertion=assertion@entry=0x7fb1d8e2e87e "*errp == ((void *)0)", 
file=file@entry=0x7fb1d8e56217 "error.c", line=line@entry=35, 
function=function@entry=
0x7fb1d8e2e8ca <__PRETTY_FUNCTION__.13983> "error_set") at assert.c:92
#3  0x7fb1d4d12a22 in __GI___assert_fail (assertion=assertion@entry=
0x7fb1d8e2e87e "*errp == ((void *)0)", file=file@entry=
0x7fb1d8e56217 "error.c", line=line@entry=35, function=function@entry=
0x7fb1d8e2e8ca <__PRETTY_FUNCTION__.13983> "error_set") at assert.c:101
#4  0x7fb1d8c1147f in error_set (errp=errp@entry=0x7fb1ce36a128, 
fmt=fmt@entry=
0x7fb1d8e39a78 "{ 'class': 'VirtFSFeatureBlocksMigration', 'data': { 
'path': %s, 'tag': %s } }") at error.c:35
#5  0x7fb1d8c57f9b in v9fs_attach (opaque=0x7fb1ce352020)
at /home/rjones/d/qemu/hw/9pfs/virtio-9p.c:988
#6  0x7fb1d8c0fcfa in coroutine_trampoline (i0=, 
i1=) at coroutine-ucontext.c:138
#7  0x7fb1d4d2a2f0 in ?? () from /lib64/libc.so.6
#8  0x7fff51061aa0 in ?? ()
#9  0xb5b5b5b5b5b5b5b5 in ?? ()
#10 0x in ?? ()

I'll add a regression test for 9p to libguestfs so at least we will
catch this in future during Fedora builds.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://et.redhat.com/~rjones/virt-df/
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: booke/bookehv: Add guest debug support

2012-07-30 Thread Scott Wood
On 07/30/2012 02:37 AM, Bhushan Bharat-R65777 wrote:
> 
> 
>> -Original Message-
>> From: Wood Scott-B07421
>> Sent: Friday, July 27, 2012 7:00 AM
>> To: Bhushan Bharat-R65777
>> Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; ag...@suse.de; Bhushan 
>> Bharat-
>> R65777
>> Subject: Re: [PATCH 2/2] KVM: PPC: booke/bookehv: Add guest debug support
>>
>> On 07/26/2012 12:32 AM, Bharat Bhushan wrote:
>>> This patch adds:
>>>  1) KVM debug handler added for e500v2.
>>>  2) Guest debug by qemu gdb stub.
>>
>> Does it make sense for these to both be in the same patch?  If there's common
>> code used by both, that could be added first.
> 
> ok
> 
>>
>>> Signed-off-by: Liu Yu 
>>> Signed-off-by: Varun Sethi 
>>> [bharat.bhus...@freescale.com: Substantial changes]
>>> Signed-off-by: Bharat Bhushan 
>>> ---
>>>  arch/powerpc/include/asm/kvm.h|   21 +
>>>  arch/powerpc/include/asm/kvm_host.h   |7 ++
>>>  arch/powerpc/include/asm/kvm_ppc.h|2 +
>>>  arch/powerpc/include/asm/reg_booke.h  |1 +
>>>  arch/powerpc/kernel/asm-offsets.c |   31 ++-
>>>  arch/powerpc/kvm/booke.c  |  146 +++---
>>>  arch/powerpc/kvm/booke_interrupts.S   |  160
>> -
>>>  arch/powerpc/kvm/bookehv_interrupts.S |  141 -
>>>  arch/powerpc/kvm/e500mc.c |3 +-
>>>  arch/powerpc/kvm/powerpc.c|2 +-
>>>  10 files changed, 492 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/kvm.h
>>> b/arch/powerpc/include/asm/kvm.h index 3c14202..da71c84 100644
>>> --- a/arch/powerpc/include/asm/kvm.h
>>> +++ b/arch/powerpc/include/asm/kvm.h
>>> @@ -25,6 +25,7 @@
>>>  /* Select powerpc specific features in  */  #define
>>> __KVM_HAVE_SPAPR_TCE  #define __KVM_HAVE_PPC_SMT
>>> +#define __KVM_HAVE_GUEST_DEBUG
>>>
>>>  struct kvm_regs {
>>> __u64 pc;
>>> @@ -265,10 +266,19 @@ struct kvm_fpu {  };
>>>
>>>  struct kvm_debug_exit_arch {
>>> +   __u32 exception;
>>> +   __u32 pc;
>>> +   __u32 status;
>>>  };
>>
>> PC must be 64-bit.  What goes in "status" and "exception"?
> 
> ok
> 
>>
>>>  /* for KVM_SET_GUEST_DEBUG */
>>>  struct kvm_guest_debug_arch {
>>> +   struct {
>>> +   __u64 addr;
>>> +   __u32 type;
>>> +   __u32 pad1;
>>> +   __u64 pad2;
>>> +   } bp[16];
>>>  };
>>
>> What goes in "type"?
> 
> Type denote breakpoint, read watchpoint, write watchpoint or watchpoint (both 
> read and write). Will adding a comment to describe this is ok?

Yes, please make sure all of this is well documented.

>>>  /* definition of registers in kvm_run */ @@ -285,6 +295,17 @@ struct
>>> kvm_sync_regs {
>>>  #define KVM_CPU_3S_64  4
>>>  #define KVM_CPU_E500MC 5
>>>
>>> +/* Debug related defines */
>>> +#define KVM_INST_GUESTGDB   0x7C00021C  /* ehpriv OC=0 */
>>
>> Will this work on all PPC?
>>
>> It certainly won't work on other architectures, so at a minimum it's
>> KVM_PPC_INST_GUEST_GDB, but maybe it needs to be determined at runtime.
> 
> How to determine at run time? adding another ioctl ?

Or extend an existing one.  Is there any other information about debug
capabilities that you expose -- number of hardware breakpoints
supported, etc?

>>> +#define KVM_GUESTDBG_USE_SW_BP  0x0001
>>> +#define KVM_GUESTDBG_USE_HW_BP  0x0002
>>
>> Where do these get used?  Any reason for these particular values?  If you're
>> trying to create a partition where the upper half is generic and the lower 
>> half
>> is arch-specific, say so.
> 
> KVM_SET_GUEST_DEBUG ioctl used to set/unset debug interrupts, which
> have a "u32 control" element. We have inherited this mechanism from
> x86 implementation and it looks like lower 16 bits are generic (like
> KVM_GUESTDBG_ENBLE, KVM_GUESTDBG_SINGLESTEP etc and upper 16 bits are
> Architecture specific.
> 
> I will add a comment to describe this.

I don't think the sw/hw distinction belongs here -- it should be per
breakpoint.

>>> +   run->exit_reason = KVM_EXIT_DEBUG;
>>> +   run->debug.arch.pc = vcpu->arch.pc;
>>> +   run->debug.arch.exception = exit_nr;
>>> +   run->debug.arch.status = 0;
>>> +   kvmppc_account_exit(vcpu, DEBUG_EXITS);
>>> +   return RESUME_HOST;
>>
>> The interface isn't (clearly labelled as) booke specific, but you return 
>> booke-
>> specific exception numbers.  How's userspace supposed to know what to do with
>> them?  What do you plan on doing with them in QEMU?
> 
> This is booke specific.

Then put booke in the name, but what about it really needs to be booke
specific?  Why does QEMU care about the exception type?

>>> +#ifndef CONFIG_PPC_FSL_BOOK3E
>>> +   PPC_LD(r7, VCPU_HOST_DBG+KVMPPC_DBG_IAC3, r4)
>>> +   PPC_LD(r8, VCPU_HOST_DBG+KVMPPC_DBG_IAC4, r4)
>>> +   mtspr   SPRN_IAC3, r7
>>> +   mtspr   SPRN_IAC4, r8
>>> +#endif
>>
>> Can you handle this at runtime with a feature section?
>

Qcow2 preallocation + backing file

2012-07-30 Thread Mark Moseley
I've been searching around trying to figure out if there's any benefit
to having a qcow2 backing file created with preallocation=metadata
set. That is, if there's any benefit to using a qcow2 file that is
based on a qcow2 backing file with preallocation=metadata turned on. I
know you can't turn on preallocation=metadata for the derived qcow2
file, but nothing seems to stop you from deriving an image from a base
image where the base image has preallocation=metadata turned on. Sorry
for the tongue twisters.

Looking around the net, it's not super clear (or at least I'm not
interpreting what I'm finding correctly) but it sounds like it
shouldn't have any benefit, i.e. that you lose the benefit of the base
image's preallocation=metadata on derived images. But goofing around
with fio randrw, running tests on derived images from backing files
with preallocation=metadata vs no preallocation=metadata, it seems to
have a huge benefit, at least much much higher r/w bw with the
preallocation=metadata-derived image.

While I might've answered my own question, I'm no fio expert. I don't
necessarily trust that I'm testing it correctly, which is why I'm
asking here. Running the same test multiple times gives me
30-40meg/sec for both read and write on the
preallocation=metadata-derived image, vs 0.5-1meg/sec for
non-preallocation=metadata-derived images, both which sound way too
high/low to be right. Same box (ubuntu 12.04 64-bit, running the stock
ubuntu qemu 1.0), same disk, even the same directory. I could also see
some bizarre side effect happening that I'm not aware of, ready to
bite me later on down the road, like, say, that it's silently actually
reading/writing the original image (just a made-up example).

I also figured that it'd be good to ask for future googlers.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How KVM handles nested guest's ioctl?

2012-07-30 Thread siddhesh phadke
I tried to go through KVM code to check how KVM handles nested guest's
ioctl calls. Does L0 handles nested guest's ioctl calls or it passes
it to L1 guest to handle it?
Can anyone please help me with this?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: Check if smp_cpus exceeds max cpus supported by kvm

2012-07-30 Thread riegamaths
From: Dunrong Huang 

Add a helper function for fetching max cpus supported by kvm.

Make QEMU exit with an error message if smp_cpus exceeds limit
of VCPU count retrieved by invoking this helper function.

Signed-off-by: Dunrong Huang 
---
 kvm-all.c |   25 +
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 2148b20..8cb4b92 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1207,6 +1207,23 @@ static int kvm_irqchip_create(KVMState *s)
 return 0;
 }
 
+static int kvm_max_vcpus(KVMState *s)
+{
+int max_vcpus = 4;
+int ret;
+ret = kvm_check_extension(s, KVM_CAP_MAX_VCPUS);
+if (ret) {
+max_vcpus = ret;
+} else {
+   ret = kvm_check_extension(s, KVM_CAP_NR_VCPUS);
+if (ret) {
+max_vcpus = ret;
+}
+   }
+
+return max_vcpus;
+}
+
 int kvm_init(void)
 {
 static const char upgrade_note[] =
@@ -1216,6 +1233,7 @@ int kvm_init(void)
 const KVMCapabilityInfo *missing_cap;
 int ret;
 int i;
+int max_vcpus;
 
 s = g_malloc0(sizeof(KVMState));
 
@@ -1256,6 +1274,13 @@ int kvm_init(void)
 goto err;
 }
 
+max_vcpus = kvm_max_vcpus(s);
+if (smp_cpus > max_vcpus) {
+fprintf(stderr, "Number of SMP cpus requested (%d) exceeds max cpus "
+"supported by KVM (%d)\n", smp_cpus, max_vcpus);
+exit(1);
+}
+
 s->vmfd = kvm_ioctl(s, KVM_CREATE_VM, 0);
 if (s->vmfd < 0) {
 #ifdef TARGET_S390X
-- 
1.7.8.6

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [regression] virtio net locks up

2012-07-30 Thread Bernd Schubert

On 07/30/2012 07:33 PM, Bernd Schubert wrote:

Hello Stefan,

Stefan Hajnoczi  gmail.com> writes:


On Wed, Jan 11, 2012 at 4:18 PM, Bernd Schubert
 itwm.fraunhofer.de> wrote:

On 01/11/2012 05:04 PM, Stefan Hajnoczi wrote:

Try pinging the host's IP address from inside the guest.  Run tcpdump
on the guest's tap interface from the host and observe whether or not
you see any packets being sent from the guest.





sorry for my terribly late reply. As usual I got distracted by too many other
things and then returned the hardware I was running the VMs on. My new desktop
system is better suitable to run kvm and I can easily reproduce it now with 3.5
on host and guest side. So its not fixed in recent versions yet.




Seems arp requests are still going out, but then don't go in:

17:16:21.202547 ARP, Reply 192.168.123.1 is-at 00:25:90:38:09:cd (oui
Unknown), length 28
17:16:21.538724 ARP, Request who-has squeeze1 tell squeeze3, length 28
17:16:21.539026 ARP, Reply squeeze1 is-at 52:54:00:12:34:11 (oui Unknown),
length 28
17:16:22.200912 ARP, Request who-has 192.168.123.1 tell squeeze3, length 28


Okay, so it seems networking from the tap device and beyond is fine.


rmmod virtio_net inside the guest and then modprobe virtio_net again.
See if network connectivity is restored (remember to rerun DHCP or
whatever, if necessary).



Yep, that makes it work again. But probably is not the real solution ;)


It's just another piece of information which helps debug this :).  At
least nothing has wedged itself into an unrecoverable state.

When you said the problem happens without vhost, did you explicitly
run vhost=off?  Or did you just omit "vhost=on"?


It was definitely off and I can confirm that it also locks up with vhost=on and
vhost=off with 3.5.



This sounds like a guest kernel/driver issue.  I recommend testing
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git in
the guest to see if this has already been fixed.

If you have the -dbg RPMs installed it may be possible to insert a
probe into the virtio_net kernel module and observe receive
interrupts.  This does require the right kernel CONFIG_ but you might
already have it enabled:

$ sudo perf probe --add skb_recv_done
$ sudo perf record -e probe:skb_recv_done -a
...send some packets to the guest...
^C
$ sudo perf script

If you see no skb_recv_done events then the guest driver is not
receiving a notification when packets are received.

You can find more about how to use perf-probe(1) at
http://blog.vmsplice.net/2011/03/how-to-use-perf-probe.html.


Ah nice, I would have used systemtap, but always wanted to check how to do it
with perf :)

So once the virtio NIC has locked up, I don't get any events from it anymore -
until I remove/re-insert the virtio module (including ifup/ifdown). I will try
to find some time later on this week to look into it again.
Any further ideas how to proceed (I haven't even checked yet how virtio works at
all...).



I took a quick glance where skb_recv_done is registered at all and 
traced it back to vp_find_vqs(). Looking into that function I noticed 
MSI and so tried to boot with pci=nomsi. And indeed I guessed it right, 
with pci=nomsi I don't get any lockups anymore.

Am I the only one booting kvm-qemu usually with enabled MSI?

Cheers,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [regression] virtio net locks up

2012-07-30 Thread Bernd Schubert
Hello Stefan,

Stefan Hajnoczi  gmail.com> writes:
> 
> On Wed, Jan 11, 2012 at 4:18 PM, Bernd Schubert
>  itwm.fraunhofer.de> wrote:
> > On 01/11/2012 05:04 PM, Stefan Hajnoczi wrote:
> >> Try pinging the host's IP address from inside the guest.  Run tcpdump
> >> on the guest's tap interface from the host and observe whether or not
> >> you see any packets being sent from the guest.
> >


sorry for my terribly late reply. As usual I got distracted by too many other
things and then returned the hardware I was running the VMs on. My new desktop
system is better suitable to run kvm and I can easily reproduce it now with 3.5
on host and guest side. So its not fixed in recent versions yet.


> >
> > Seems arp requests are still going out, but then don't go in:
> >
> > 17:16:21.202547 ARP, Reply 192.168.123.1 is-at 00:25:90:38:09:cd (oui
> > Unknown), length 28
> > 17:16:21.538724 ARP, Request who-has squeeze1 tell squeeze3, length 28
> > 17:16:21.539026 ARP, Reply squeeze1 is-at 52:54:00:12:34:11 (oui Unknown),
> > length 28
> > 17:16:22.200912 ARP, Request who-has 192.168.123.1 tell squeeze3, length 28
> 
> Okay, so it seems networking from the tap device and beyond is fine.
> 
> >> rmmod virtio_net inside the guest and then modprobe virtio_net again.
> >> See if network connectivity is restored (remember to rerun DHCP or
> >> whatever, if necessary).
> >
> >
> > Yep, that makes it work again. But probably is not the real solution ;)
> 
> It's just another piece of information which helps debug this :).  At
> least nothing has wedged itself into an unrecoverable state.
> 
> When you said the problem happens without vhost, did you explicitly
> run vhost=off?  Or did you just omit "vhost=on"?

It was definitely off and I can confirm that it also locks up with vhost=on and
vhost=off with 3.5.

> 
> This sounds like a guest kernel/driver issue.  I recommend testing
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git in
> the guest to see if this has already been fixed.
> 
> If you have the -dbg RPMs installed it may be possible to insert a
> probe into the virtio_net kernel module and observe receive
> interrupts.  This does require the right kernel CONFIG_ but you might
> already have it enabled:
> 
> $ sudo perf probe --add skb_recv_done
> $ sudo perf record -e probe:skb_recv_done -a
> ...send some packets to the guest...
> ^C
> $ sudo perf script
> 
> If you see no skb_recv_done events then the guest driver is not
> receiving a notification when packets are received.
> 
> You can find more about how to use perf-probe(1) at
> http://blog.vmsplice.net/2011/03/how-to-use-perf-probe.html.

Ah nice, I would have used systemtap, but always wanted to check how to do it
with perf :) 

So once the virtio NIC has locked up, I don't get any events from it anymore -
until I remove/re-insert the virtio module (including ifup/ifdown). I will try
to find some time later on this week to look into it again.
Any further ideas how to proceed (I haven't even checked yet how virtio works at
all...).

Thanks,
Bernd

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM: MMU: Tracking guest writes through EPT entries ?

2012-07-30 Thread Sunil
Hello List,

I am a KVM newbie and studying KVM mmu code.

On the existing guest, I am trying to track all guest writes by
marking page table entry as read-only in EPT entry [ I am using Intel
machine with vmx and ept support ]. Looks like EPT support re-uses
shadow page table(SPT) code and hence some of SPT routines.

I was thinking of below possible approach. Use pte_list_walk() to
traverse through list of sptes and use mmu_spte_update()  to flip the
PT_WRITABLE_MASK flag. But all SPTEs are not part of any single list;
but on separate lists (based on gfn, page level, memory_slot). So,
recording all the faulted guest GFN and then using above method work ?

Can you please help if there is more direct and cleaner way to do it ?
 (or if I am totally off the track)

Thanks !

--
Sunil
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm-1.1.0 crashing with kernel 3.5.0-rc6

2012-07-30 Thread Avi Kivity
On 07/30/2012 05:07 PM, Chris Clayton wrote:
>>
 With kernel 3.5.0 with b2da15ac26a0c00 reverted, I have just had 15
 clean invocations of vanilla qemu-kvm-1.1.1. So that commit would seem
 to be the problem.
>>>
>>> Just to be sure, I've run some more tests today. No crashes occurred in
>>> 20 runs of vanilla qemu-kvm-1.1.1 on kernel 3.5.0 with b2da15ac26a0c00
>>> reverted.
>>
>> Ok.  I'm trying to reproduce it here on a nested-virt setup, since the
>> code looks correct.
>>
>> What's your preemption settings?
>>
>>
> [chris:~/kernel/linux-3.5.0]$ grep PREEMPT .config
> CONFIG_TREE_PREEMPT_RCU=y
> CONFIG_PREEMPT_RCU=y
> CONFIG_PREEMPT_NOTIFIERS=y
> # CONFIG_PREEMPT_NONE is not set
> # CONFIG_PREEMPT_VOLUNTARY is not set
> CONFIG_PREEMPT=y
> CONFIG_PREEMPT_COUNT=y

Here's what I think that is happening

  vcpu_load
  ...
  vmx_save_host_state
  vmx_vcpu_run
  (ds.cpl, es.cpl cleared by hardware)

  interrupt
push ds, es  # pushes bad ds, es
schedule
  vmx_vcpu_put
vmx_load_host_state
  reload ds, es
pop ds, es  # of other thread's stack
iret
  # other thread runs
  interrupt
schedule  # back in vcpu thread
interrupt return: pop ds, es  # <-- problem
iret

   ...
   vcpu_put

   # bad ds, es, but !vmx->host_state.loaded

Marcelo, did I miss something here?

Unfortunately, my reproducer has ceased to reproduce.  But the fix is
easy if the analysis above is right.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

2012-07-30 Thread Alex Williamson
On Sun, 2012-07-29 at 17:54 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 24, 2012 at 02:43:22PM -0600, Alex Williamson wrote:
> > This new ioctl enables an eventfd to be triggered when an EOI is
> > written for a specified irqchip pin.  The first user of this will
> > be external device assignment through VFIO, using a level irqfd
> > for asserting a PCI INTx interrupt and this interface for de-assert
> > and notification once the interrupt is serviced.
> > 
> > Here we make use of the reference counting of the _irq_source
> > object allowing us to share it with an irqfd and cleanup regardless
> > of the release order.
> > 
> > Signed-off-by: Alex Williamson 
> 
> > ---
> > 
> >  Documentation/virtual/kvm/api.txt |   21 ++
> >  arch/x86/kvm/x86.c|2 
> >  include/linux/kvm.h   |   15 ++
> >  include/linux/kvm_host.h  |   13 +
> >  virt/kvm/eventfd.c|  336 
> > +
> >  virt/kvm/kvm_main.c   |   11 +
> >  6 files changed, 398 insertions(+)
> > 
> > diff --git a/Documentation/virtual/kvm/api.txt 
> > b/Documentation/virtual/kvm/api.txt
> > index 3911e62..8cd6b36 100644
> > --- a/Documentation/virtual/kvm/api.txt
> > +++ b/Documentation/virtual/kvm/api.txt
> > @@ -1989,6 +1989,27 @@ return the hash table order in the parameter.  (If 
> > the guest is using
> >  the virtualized real-mode area (VRMA) facility, the kernel will
> >  re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
> >  
> > +4.77 KVM_EOIFD
> > +
> > +Capability: KVM_CAP_EOIFD
> > +Architectures: x86
> > +Type: vm ioctl
> > +Parameters: struct kvm_eoifd (in)
> > +Returns: 0 on success, < 0 on error
> > +
> > +KVM_EOIFD allows userspace to receive interrupt EOI notification
> > +through an eventfd.
> 
> I thought about it some more, and I think it should be renamed to an
> interrupt ack notification than eoi notification.
> For example, consider userspace that uses threaded interrupts.
> Currently what will happen is each interrupt will be injected
> twice, since on eoi device is still asserting it.

I don't follow, why is userspace writing an eoi to the ioapic if it
hasn't handled the interrupt and why wouldn't the same happen on bare
metal?

> One fix would be to delay event until interrupt is re-enabled.
> Now I am not asking you to fix this immediately,
> but I think we should make the interface generic by
> saying we report an ack to userspace and not specifically EOI.

Using the word "delay" in the context of interrupt delivery raises all
sorts of red flags for me, but I really don't understand your argument.

> >  kvm_eoifd.fd specifies the eventfd used for
> > +notification.  KVM_EOIFD_FLAG_DEASSIGN is used to de-assign an eoifd
> > +once assigned.  KVM_EOIFD also requires additional bits set in
> > +kvm_eoifd.flags to bind to the proper interrupt line.  The
> > +KVM_EOIFD_FLAG_LEVEL_IRQFD indicates that kvm_eoifd.key is provided
> > +and is a key from a level triggered interrupt (configured from
> > +KVM_IRQFD using KVM_IRQFD_FLAG_LEVEL).  The EOI notification is bound
> > +to the same GSI and irqchip input as the irqfd.  Both kvm_eoifd.key
> > +and KVM_EOIFD_FLAG_LEVEL_IRQFD must be specified on assignment and
> > +de-assignment of KVM_EOIFD.  A level irqfd may only be bound to a
> > +single eoifd.  KVM_CAP_EOIFD_LEVEL_IRQFD indicates support of
> > +KVM_EOIFD_FLAG_LEVEL_IRQFD.
> >  
> 
> Hmm returning the key means we'll need to keep refcounting for source
> IDs around forever. I liked passing the fd better: make implementation
> match interface and not the other way around.

False, a source ID has a finite lifecycle.  The fd approach was broken.
Holding the irqfd context imposed too many dependencies between eoifd
and irqfd necessitating things like one interface disabling another.  I
thoroughly disagree with that approach.

> >  5. The kvm_run structure
> >  
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 9ded39d..8f3164e 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2171,6 +2171,8 @@ int kvm_dev_ioctl_check_extension(long ext)
> > case KVM_CAP_PCI_2_3:
> > case KVM_CAP_KVMCLOCK_CTRL:
> > case KVM_CAP_IRQFD_LEVEL:
> > +   case KVM_CAP_EOIFD:
> > +   case KVM_CAP_EOIFD_LEVEL_IRQFD:
> > r = 1;
> > break;
> > case KVM_CAP_COALESCED_MMIO:
> > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > index b2e6e4f..effb916 100644
> > --- a/include/linux/kvm.h
> > +++ b/include/linux/kvm.h
> > @@ -619,6 +619,8 @@ struct kvm_ppc_smmu_info {
> >  #define KVM_CAP_S390_COW 79
> >  #define KVM_CAP_PPC_ALLOC_HTAB 80
> >  #define KVM_CAP_IRQFD_LEVEL 81
> > +#define KVM_CAP_EOIFD 82
> > +#define KVM_CAP_EOIFD_LEVEL_IRQFD 83
> >  
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >  
> > @@ -694,6 +696,17 @@ struct kvm_irqfd {
> > __u8  pad[20];
> >  };
> >  
> > +#define KVM_EOIFD_FLAG_DEASSIGN (1 << 0)
> > +/* Available with KVM_CAP_EOIFD_

Re: [PATCH v7 1/2] kvm: Extend irqfd to support level interrupts

2012-07-30 Thread Alex Williamson
On Sun, 2012-07-29 at 18:01 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 24, 2012 at 02:43:14PM -0600, Alex Williamson wrote:
> > In order to inject a level interrupt from an external source using an
> > irqfd, we need to allocate a new irq_source_id.  This allows us to
> > assert and (later) de-assert an interrupt line independently from
> > users of KVM_IRQ_LINE and avoid lost interrupts.
> > 
> > We also add what may appear like a bit of excessive infrastructure
> > around an object for storing this irq_source_id.  However, notice
> > that we only provide a way to assert the interrupt here.  A follow-on
> > interface will make use of the same irq_source_id to allow de-assert.
> > 
> > Signed-off-by: Alex Williamson 
> 
> I think this tracking of source ids is the root of all the problems
> you see with this patchset.
> 
> A source ID is required for an irqfd to be created.
> But if source ID exists after irqfd is destroyed then
> the next create will fail.

Only if there are no available source IDs.

> So the only sane thing to do is to make irqfd manage this resource,
> clean it up completely when irqfd is gone.
> 
> Not to mention, the patch will be smaller :)

The only sane way to do that is to pull the eoifd into KVM_IRQFD and set
them up together.  That's actually what v1 of this endeavor did.  My
intention with splitting eoifd from irqfd is that I think EOI
notification is potentially useful outside of this usage with irqfds and
I wanted an interface that could be used independently.  Someday, an
irqfd may not be the only way to generate a key.  Userspace may also
wish to register to receive notification-only for the existing user
source ID.

I do not think it's sane to have an eoifd configured using KVM_EOIFD and
destroyed using KVM_IRQFD.  As for smaller patch, I'm not convinced.  We
still have to watch for POLLHUP, which pulls in the bulk of the code.
And using the above approach of pulling eoifd setup into irqfd we have
to address what happens to the combined set when either eventfd is
closed.  By your argument closing the irqfd closes the eoifd, but does
closing the eoifd necessarily close the irqfd.  If not then we end up
with the question of how can an eoifd be added to an existing irqfd.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes July 24th

2012-07-30 Thread Markus Armbruster
Juan Quintela  writes:

> Hi
>
> Minutes of Today call.
>
> ahci: agraf
> 
> - how to enable it?
>   * today is too difficult (alex)
>   * get it to work as everything else (markus)
>   And big discussion ensued. Markus & alex will reply with details.
>   No agreement, and too subtle for me to resume perfectly.
>   * cdrom still not working
> - dates for freeze? q35?
>   depends on what people does/review?
>
> Later, Juan.

Sorry for the delay.  Hope I still remember enough write a worthwhile
summary.  Alex, please correct me if I misrepresent you.


Background
--

* -drive if=ide asks the board to connect this drive to its "IDE
  controller".  It's silently ignored if the board doesn't have one.
  Even if it has one, it may silently ignore some drives.

  Examples:

  - PIIX3 boards such as current "pc" connect all if=ide drives to
controller "piix3-ide".  Bus numbers > 2 fail machine creation.

  - I expect the forthcoming "q35" board to connect all if=ide drives to
controller "ich9-ahci".  We hope to upgrade "pc" to this board.

  - Board "r2d" connects the if=ide,index=0 drive to a non-qdevified
controller hw/ide/mmio.c.  Any others get ignored.

  - Board "s390-virtio" connects some if=ide drives[*] to
virtio-blk-s390.  Yes, that's not really an IDE controller.

  - Board "highbank" appears to have an ich9-ahci controller, but it
doesn't seem to connect any if=ide drives to it.

* Likewise, -drive if=scsi asks the board to connect this drive to a
  SCSI controller.

  Examples:

  - PC boards create N+1 "lsi53c895a" controllers, where N is the
largest bus number in any -drive if=scsi.  The controller
auto-connects all if=scsi,bus=I drives for its own bus number I.
Ugly as hell.

  - Board "pseries" works the same, except it creates "spapr-vscsi"
controllers.

  - The various Sun4[cdm] boards create a single "esp" controller, for
bus=0.  -drive if=scsi with any other bus fails machine
initialization.

* The meaning of -hda, drive without if=, and so forth depends on the
  board, it's currently either like if=scsi or like if=ide.

* You can use -device to add additional controllers and drives.  For
  instance, if the board provides a PCI bus, then

  -device ich9-ahci,id=ahci0

  creates an ICH9 AHCI controller, and

  -drive if=none,id=drive0,...
  -device ide-hd,bus=ahci0.0,drive=drive0

  connects an IDE disk to it.

  
Alex's proposal
---

Alex wants more users for AHCI.  Using it with a board that doesn't
connect if=ide drives to an AHCI controller involves -device.  Alex
thinks that's too hard for users, and prevents adoption.

His proposed solution is to create a new interface type "ahci", so that
-drive if=ahci "just works".  It creates ich9-ahci controllers
automatically.


Critique


* Does the use case "AHCI with a non-AHCI board" justify a new interface
  type?  Would anybody care if we had q35 today?  Would we regret adding
  if=ahci once we have q35?

* If the board already has an ich9-ahci, you can't use if=ahci in its
  current form to add drives to it.  if=ahci creates *additional*
  ich9-ahci controllers.

* Do we want a separate interface type for each (sufficiently popular)
  controller?  What about "megasas" when the board provides only
  "lsi53c895a"?  What shall we do when we acquire an ich10 device model?
  Where does that stop?

* Should -drive ever auto-create controllers?  if=scsi does, but it's
  ugly.

* A possible alternative: machine option to replace the controller to
  use for if=ide.  Also possible for if=scsi.

* Probably more I can't remember anymore.


[*] "Some drives": looks like the code attempts to connect up to ten,
but I'm not at all sure more than two work.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv5 1/4] Provide userspace IO exit completion callback.

2012-07-30 Thread Gleb Natapov
Current code assumes that IO exit was due to instruction emulation
and handles execution back to emulator directly. This patch adds new
userspace IO exit completion callback that can be set by any other code
that caused IO exit to userspace.

Signed-off-by: Gleb Natapov 
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/x86.c  |   92 +++
 2 files changed, 56 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 48e7131..37f4f11 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -414,6 +414,7 @@ struct kvm_vcpu_arch {
struct x86_emulate_ctxt emulate_ctxt;
bool emulate_regs_need_sync_to_vcpu;
bool emulate_regs_need_sync_from_vcpu;
+   int (*complete_userspace_io)(struct kvm_vcpu *vcpu);
 
gpa_t time;
struct pvclock_vcpu_time_info hv_clock;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b6379e5..41c0c3d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4539,6 +4539,9 @@ static bool retry_instruction(struct x86_emulate_ctxt 
*ctxt,
return true;
 }
 
+static int complete_emulated_mmio(struct kvm_vcpu *vcpu);
+static int complete_emulated_pio(struct kvm_vcpu *vcpu);
+
 int x86_emulate_instruction(struct kvm_vcpu *vcpu,
unsigned long cr2,
int emulation_type,
@@ -4609,13 +4612,16 @@ restart:
} else if (vcpu->arch.pio.count) {
if (!vcpu->arch.pio.in)
vcpu->arch.pio.count = 0;
-   else
+   else {
writeback = false;
+   vcpu->arch.complete_userspace_io = 
complete_emulated_pio;
+   }
r = EMULATE_DO_MMIO;
} else if (vcpu->mmio_needed) {
if (!vcpu->mmio_is_write)
writeback = false;
r = EMULATE_DO_MMIO;
+   vcpu->arch.complete_userspace_io = complete_emulated_mmio;
} else if (r == EMULATION_RESTART)
goto restart;
else
@@ -5471,6 +5477,24 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
return r;
 }
 
+static inline int complete_emulated_io(struct kvm_vcpu *vcpu)
+{
+   int r;
+   vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+   r = emulate_instruction(vcpu, EMULTYPE_NO_DECODE);
+   srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
+   if (r != EMULATE_DONE)
+   return 0;
+   return 1;
+}
+
+static int complete_emulated_pio(struct kvm_vcpu *vcpu)
+{
+   BUG_ON(!vcpu->arch.pio.count);
+
+   return complete_emulated_io(vcpu);
+}
+
 /*
  * Implements the following, as a state machine:
  *
@@ -5487,47 +5511,37 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
  *  copy data
  *  exit
  */
-static int complete_mmio(struct kvm_vcpu *vcpu)
+static int complete_emulated_mmio(struct kvm_vcpu *vcpu)
 {
struct kvm_run *run = vcpu->run;
struct kvm_mmio_fragment *frag;
-   int r;
 
-   if (!(vcpu->arch.pio.count || vcpu->mmio_needed))
-   return 1;
+   BUG_ON(!vcpu->mmio_needed);
 
-   if (vcpu->mmio_needed) {
-   /* Complete previous fragment */
-   frag = &vcpu->mmio_fragments[vcpu->mmio_cur_fragment++];
-   if (!vcpu->mmio_is_write)
-   memcpy(frag->data, run->mmio.data, frag->len);
-   if (vcpu->mmio_cur_fragment == vcpu->mmio_nr_fragments) {
-   vcpu->mmio_needed = 0;
-   if (vcpu->mmio_is_write)
-   return 1;
-   vcpu->mmio_read_completed = 1;
-   goto done;
-   }
-   /* Initiate next fragment */
-   ++frag;
-   run->exit_reason = KVM_EXIT_MMIO;
-   run->mmio.phys_addr = frag->gpa;
+   /* Complete previous fragment */
+   frag = &vcpu->mmio_fragments[vcpu->mmio_cur_fragment++];
+   if (!vcpu->mmio_is_write)
+   memcpy(frag->data, run->mmio.data, frag->len);
+   if (vcpu->mmio_cur_fragment == vcpu->mmio_nr_fragments) {
+   vcpu->mmio_needed = 0;
if (vcpu->mmio_is_write)
-   memcpy(run->mmio.data, frag->data, frag->len);
-   run->mmio.len = frag->len;
-   run->mmio.is_write = vcpu->mmio_is_write;
-   return 0;
-
-   }
-done:
-   vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
-   r = emulate_instruction(vcpu, EMULTYPE_NO_DECODE);
-   srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
-   if (r != EMULATE_DONE)
-   return 0;
-   return 1;
+   return 1;
+   vcpu->mmio_read_completed = 1;
+   return complete_emulated_io(vcpu);
+   }
+   /* Initiate next fragment */

[PATCHv5 4/4] KVM: emulator: optimize "rep ins" handling.

2012-07-30 Thread Gleb Natapov
Optimize "rep ins" by allowing emulator to write back more than one
datum at a time. Introduce new operand type OP_MEM_STR which tells
writeback() that dst contains pointer to an array that should be written
back as opposite to just one data element.

Signed-off-by: Gleb Natapov 
---
 arch/x86/include/asm/kvm_emulate.h |4 +++-
 arch/x86/kvm/emulate.c |   33 -
 2 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index 8d0fe8f..d1777f8 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -200,8 +200,9 @@ typedef u32 __attribute__((vector_size(16))) sse128_t;
 
 /* Type, address-of, and value of an instruction's operand. */
 struct operand {
-   enum { OP_REG, OP_MEM, OP_IMM, OP_XMM, OP_MM, OP_NONE } type;
+   enum { OP_REG, OP_MEM, OP_MEM_STR, OP_IMM, OP_XMM, OP_MM, OP_NONE } 
type;
unsigned int bytes;
+   unsigned int count;
union {
unsigned long orig_val;
u64 orig_val64;
@@ -221,6 +222,7 @@ struct operand {
char valptr[sizeof(unsigned long) + 2];
sse128_t vec_val;
u64 mm_val;
+   void *data;
};
 };
 
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index c22762d..c74bce8 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1251,8 +1251,15 @@ static int pio_in_emulated(struct x86_emulate_ctxt *ctxt,
rc->end = n * size;
}
 
-   memcpy(dest, rc->data + rc->pos, size);
-   rc->pos += size;
+   if (ctxt->rep_prefix && !(ctxt->eflags & EFLG_DF)) {
+   ctxt->dst.data = rc->data + rc->pos;
+   ctxt->dst.type = OP_MEM_STR;
+   ctxt->dst.count = (rc->end - rc->pos) / size;
+   rc->pos = rc->end;
+   } else {
+   memcpy(dest, rc->data + rc->pos, size);
+   rc->pos += size;
+   }
return 1;
 }
 
@@ -1500,6 +1507,14 @@ static int writeback(struct x86_emulate_ctxt *ctxt)
if (rc != X86EMUL_CONTINUE)
return rc;
break;
+   case OP_MEM_STR:
+   rc = segmented_write(ctxt,
+   ctxt->dst.addr.mem,
+   ctxt->dst.data,
+   ctxt->dst.bytes * ctxt->dst.count);
+   if (rc != X86EMUL_CONTINUE)
+   return rc;
+   break;
case OP_XMM:
write_sse_reg(ctxt, &ctxt->dst.vec_val, ctxt->dst.addr.xmm);
break;
@@ -2732,7 +2747,7 @@ int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
 static void string_addr_inc(struct x86_emulate_ctxt *ctxt, int reg,
struct operand *op)
 {
-   int df = (ctxt->eflags & EFLG_DF) ? -1 : 1;
+   int df = (ctxt->eflags & EFLG_DF) ? -op->count : op->count;
 
register_address_increment(ctxt, &ctxt->regs[reg], df * op->bytes);
op->addr.mem.ea = register_address(ctxt, ctxt->regs[reg]);
@@ -3672,7 +3687,7 @@ static struct opcode opcode_table[256] = {
I(DstReg | SrcMem | ModRM | Src2Imm, em_imul_3op),
I(SrcImmByte | Mov | Stack, em_push),
I(DstReg | SrcMem | ModRM | Src2ImmByte, em_imul_3op),
-   I2bvIP(DstDI | SrcDX | Mov | String, em_in, ins, check_perm_in), /* 
insb, insw/insd */
+   I2bvIP(DstDI | SrcDX | Mov | String | Unaligned, em_in, ins, 
check_perm_in), /* insb, insw/insd */
I2bvIP(SrcSI | DstDX | String, em_out, outs, check_perm_out), /* outsb, 
outsw/outsd */
/* 0x70 - 0x7F */
X16(D(SrcImmByte)),
@@ -3930,6 +3945,7 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, 
struct operand *op,
register_address(ctxt, ctxt->regs[VCPU_REGS_RDI]);
op->addr.mem.seg = VCPU_SREG_ES;
op->val = 0;
+   op->count = 1;
break;
case OpDX:
op->type = OP_REG;
@@ -3973,6 +3989,7 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, 
struct operand *op,
register_address(ctxt, ctxt->regs[VCPU_REGS_RSI]);
op->addr.mem.seg = seg_override(ctxt);
op->val = 0;
+   op->count = 1;
break;
case OpImmFAddr:
op->type = OP_IMM;
@@ -4513,8 +4530,14 @@ writeback:
string_addr_inc(ctxt, VCPU_REGS_RDI, &ctxt->dst);
 
if (ctxt->rep_prefix && (ctxt->d & String)) {
+   unsigned int count;
struct read_cache *r = &ctxt->io_read;
-   register_address_increment(ctxt, &ctxt->regs[VCPU_REGS_RCX], 
-1);
+   if ((ctxt->d & SrcMask) == SrcSI)
+   count = ctxt->src.count;
+   else
+   count = ctxt->dst.count;
+   register_addre

[PATCHv5 3/4] KVM: emulator: string_addr_inc() cleanup

2012-07-30 Thread Gleb Natapov
Remove unneeded segment argument. Address structure already has correct
segment which was put there during decode.

Signed-off-by: Gleb Natapov 
---
 arch/x86/kvm/emulate.c |   11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 4d33423..c22762d 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2729,14 +2729,13 @@ int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
return (rc == X86EMUL_UNHANDLEABLE) ? EMULATION_FAILED : EMULATION_OK;
 }
 
-static void string_addr_inc(struct x86_emulate_ctxt *ctxt, unsigned seg,
-   int reg, struct operand *op)
+static void string_addr_inc(struct x86_emulate_ctxt *ctxt, int reg,
+   struct operand *op)
 {
int df = (ctxt->eflags & EFLG_DF) ? -1 : 1;
 
register_address_increment(ctxt, &ctxt->regs[reg], df * op->bytes);
op->addr.mem.ea = register_address(ctxt, ctxt->regs[reg]);
-   op->addr.mem.seg = seg;
 }
 
 static int em_das(struct x86_emulate_ctxt *ctxt)
@@ -4508,12 +4507,10 @@ writeback:
ctxt->dst.type = saved_dst_type;
 
if ((ctxt->d & SrcMask) == SrcSI)
-   string_addr_inc(ctxt, seg_override(ctxt),
-   VCPU_REGS_RSI, &ctxt->src);
+   string_addr_inc(ctxt, VCPU_REGS_RSI, &ctxt->src);
 
if ((ctxt->d & DstMask) == DstDI)
-   string_addr_inc(ctxt, VCPU_SREG_ES, VCPU_REGS_RDI,
-   &ctxt->dst);
+   string_addr_inc(ctxt, VCPU_REGS_RDI, &ctxt->dst);
 
if (ctxt->rep_prefix && (ctxt->d & String)) {
struct read_cache *r = &ctxt->io_read;
-- 
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv5 2/4] KVM: emulator: make x86 emulation modes enum instead of defines

2012-07-30 Thread Gleb Natapov

Signed-off-by: Gleb Natapov 
---
 arch/x86/include/asm/kvm_emulate.h |   22 ++
 arch/x86/kvm/emulate.c |4 +++-
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index c764f43..8d0fe8f 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -236,6 +236,15 @@ struct read_cache {
unsigned long end;
 };
 
+/* Execution mode, passed to the emulator. */
+enum x86emul_mode {
+   X86EMUL_MODE_REAL,  /* Real mode. */
+   X86EMUL_MODE_VM86,  /* Virtual 8086 mode. */
+   X86EMUL_MODE_PROT16,/* 16-bit protected mode. */
+   X86EMUL_MODE_PROT32,/* 32-bit protected mode. */
+   X86EMUL_MODE_PROT64,/* 64-bit (long) mode.*/
+};
+
 struct x86_emulate_ctxt {
struct x86_emulate_ops *ops;
 
@@ -243,7 +252,7 @@ struct x86_emulate_ctxt {
unsigned long eflags;
unsigned long eip; /* eip before instruction emulation */
/* Emulated execution mode, represented by an X86EMUL_MODE value. */
-   int mode;
+   enum x86emul_mode mode;
 
/* interruptibility state, as a result of execution of STI or MOV SS */
int interruptibility;
@@ -293,17 +302,6 @@ struct x86_emulate_ctxt {
 #define REPE_PREFIX0xf3
 #define REPNE_PREFIX   0xf2
 
-/* Execution mode, passed to the emulator. */
-#define X86EMUL_MODE_REAL 0/* Real mode. */
-#define X86EMUL_MODE_VM86 1/* Virtual 8086 mode. */
-#define X86EMUL_MODE_PROT16   2/* 16-bit protected mode. */
-#define X86EMUL_MODE_PROT32   4/* 32-bit protected mode. */
-#define X86EMUL_MODE_PROT64   8/* 64-bit (long) mode.*/
-
-/* any protected mode   */
-#define X86EMUL_MODE_PROT (X86EMUL_MODE_PROT16|X86EMUL_MODE_PROT32| \
-  X86EMUL_MODE_PROT64)
-
 /* CPUID vendors */
 #define X86EMUL_CPUID_VENDOR_AuthenticAMD_ebx 0x68747541
 #define X86EMUL_CPUID_VENDOR_AuthenticAMD_ecx 0x444d4163
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 10f0136..4d33423 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2210,6 +2210,8 @@ static int em_sysenter(struct x86_emulate_ctxt *ctxt)
if (msr_data == 0x0)
return emulate_gp(ctxt, 0);
break;
+   default:
+   break;
}
 
ctxt->eflags &= ~(EFLG_VM | EFLG_IF | EFLG_RF);
@@ -4338,7 +4340,7 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
}
 
/* Instruction can only be executed in protected mode */
-   if ((ctxt->d & Prot) && !(ctxt->mode & X86EMUL_MODE_PROT)) {
+   if ((ctxt->d & Prot) && ctxt->mode < X86EMUL_MODE_PROT16) {
rc = emulate_ud(ctxt);
goto done;
}
-- 
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv5 0/4] improve speed of "rep ins" emulation

2012-07-30 Thread Gleb Natapov
And now for something completely different.

So this series (or rather the last patch of it) takes different approach
to "rep ins" optimization. Instead of writing separate fast path for
it do the fast path inside emulator itself. This way nobody can say the
code is not reused!

Patch 1,2 are now, strictly speaking, not needed, but I think this is still
nice cleanup and, in case of patch 1, eliminates some ifs at each KVM_RUN ioctl.
 
Gleb Natapov (4):
  Provide userspace IO exit completion callback.
  KVM: emulator: make x86 emulation modes enum instead of defines
  KVM: emulator: string_addr_inc() cleanup
  KVM: emulator: optimize "rep ins" handling.

 arch/x86/include/asm/kvm_emulate.h |   26 +-
 arch/x86/include/asm/kvm_host.h|1 +
 arch/x86/kvm/emulate.c |   48 ++-
 arch/x86/kvm/x86.c |   92 +---
 4 files changed, 104 insertions(+), 63 deletions(-)

-- 
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm-1.1.0 crashing with kernel 3.5.0-rc6

2012-07-30 Thread Chris Clayton

On 07/30/12 15:03, Avi Kivity wrote:

On 07/30/2012 05:00 PM, Chris Clayton wrote:

On 07/29/12 20:10, Chris Clayton wrote:

Possible culprit: b2da15ac26a0c00.



That commit isn't in qermu-kvm-1.1.1.


It is in kernel.



Sorry, so it is.

With kernel 3.5.0 with b2da15ac26a0c00 reverted, I have just had 15
clean invocations of vanilla qemu-kvm-1.1.1. So that commit would seem
to be the problem.


Just to be sure, I've run some more tests today. No crashes occurred in
20 runs of vanilla qemu-kvm-1.1.1 on kernel 3.5.0 with b2da15ac26a0c00
reverted.


Ok.  I'm trying to reproduce it here on a nested-virt setup, since the
code looks correct.

What's your preemption settings?



[chris:~/kernel/linux-3.5.0]$ grep PREEMPT .config
CONFIG_TREE_PREEMPT_RCU=y
CONFIG_PREEMPT_RCU=y
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm-1.1.0 crashing with kernel 3.5.0-rc6

2012-07-30 Thread Avi Kivity
On 07/30/2012 05:00 PM, Chris Clayton wrote:
> On 07/29/12 20:10, Chris Clayton wrote:
> Possible culprit: b2da15ac26a0c00.
>
>
 That commit isn't in qermu-kvm-1.1.1.

>>> It is in kernel.
>>>
>>
>> Sorry, so it is.
>>
>> With kernel 3.5.0 with b2da15ac26a0c00 reverted, I have just had 15
>> clean invocations of vanilla qemu-kvm-1.1.1. So that commit would seem
>> to be the problem.
> 
> Just to be sure, I've run some more tests today. No crashes occurred in
> 20 runs of vanilla qemu-kvm-1.1.1 on kernel 3.5.0 with b2da15ac26a0c00
> reverted.

Ok.  I'm trying to reproduce it here on a nested-virt setup, since the
code looks correct.

What's your preemption settings?


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm-1.1.0 crashing with kernel 3.5.0-rc6

2012-07-30 Thread Chris Clayton

On 07/29/12 20:10, Chris Clayton wrote:

Possible culprit: b2da15ac26a0c00.



That commit isn't in qermu-kvm-1.1.1.


It is in kernel.



Sorry, so it is.

With kernel 3.5.0 with b2da15ac26a0c00 reverted, I have just had 15
clean invocations of vanilla qemu-kvm-1.1.1. So that commit would seem
to be the problem.


Just to be sure, I've run some more tests today. No crashes occurred in 
20 runs of vanilla qemu-kvm-1.1.1 on kernel 3.5.0 with b2da15ac26a0c00 
reverted.


Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 3/3] virtio-blk: Add bio-based IO path for virtio-blk

2012-07-30 Thread Christoph Hellwig
On Mon, Jul 30, 2012 at 11:25:51AM +0930, Rusty Russell wrote:
> I consider this approach a half-way step.  Quick attempts on my laptop
> and I couldn't find a case where the bio path was a loss, but in theory
> if the host wasn't doing any reordering and it was a slow device, you'd
> want the guest to do so.
> 
> I'm not sure if current qemu can be configured to do such a thing?


The host kernel will do the I/O scheduling for you unless you explicitly
disable it.  And we should be able to assume an administrator will only
disable it when they have a reason for it - if not they'll get worse
performance for non-virtualized workloads as well.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 3/3] virtio-blk: Add bio-based IO path for virtio-blk

2012-07-30 Thread Christoph Hellwig
On Mon, Jul 30, 2012 at 12:43:12PM +0800, Asias He wrote:
> I think we can add REQ_FLUSH & REQ_FUA support to bio path and that 
> deserves another patch.

Adding it is a requirement for merging the code.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 3/3] virtio-blk: Add bio-based IO path for virtio-blk

2012-07-30 Thread Christoph Hellwig
On Mon, Jul 30, 2012 at 09:31:06AM +0200, Paolo Bonzini wrote:
> You only need to add REQ_FLUSH support.  The virtio-blk protocol does
> not support REQ_FUA, because there's no easy way to do it in userspace.

A bio-based driver needs to handle both REQ_FLUSH and REQ_FUA as it does
not get the sequencing of REQ_FUA into REQ_FLUSH that request based drivers
can request.  To what the REQ_FUA request gets translated is a different story.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next RFC V5 4/5] virtio_net: multiqueue support

2012-07-30 Thread Sasha Levin
On 07/29/2012 11:44 AM, Michael S. Tsirkin wrote:
> On Sat, Jul 21, 2012 at 02:02:58PM +0200, Sasha Levin wrote:
>> On 07/20/2012 03:40 PM, Michael S. Tsirkin wrote:
 -  err = init_vqs(vi);
> + if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
> + vi->has_cvq = true;
> +
>>> How about we disable multiqueue if there's no cvq?
>>> Will make logic a bit simpler, won't it?
>>
>> multiqueues don't really depend on cvq. Does this added complexity really 
>> justifies adding an artificial limit?
> 
> Well !cvq support is a legacy feature: the reason we support it
> in driver is to avoid breaking on old hosts. Adding more code to that
> path just doesn't make much sense since old hosts won't have mq.

Is it really a legacy feature? The spec suggests that its an optional queue 
which is not necessary for the operation of the device.

Which is why we never implemented it in lkvm - we weren't interested in any of 
the features it provided at that time and we could provide high performance 
with vhost support even without it.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL] virtio

2012-07-30 Thread Rusty Russell
The following changes since commit f7da9cdf45cbbad5029d4858dcbc0134e06084ed:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2012-07-28 
06:00:39 -0700)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus.git 
tags/virtio-for-linus

for you to fetch changes up to 6a743897144500fb4c4566ced3a498d5180fbb5b:

  virtio-blk: return VIRTIO_BLK_F_FLUSH to header. (2012-07-30 13:30:52 +0930)


Virtio patches, mainly hotplugging fixes.


Amit Shah (5):
  virtio ids: fix comment for virtio-rng
  virtio: rng: allow tasks to be killed that are waiting for rng input
  virtio: rng: don't wait on host when module is going away
  virtio: rng: split out common code in probe / remove for s3/s4 ops
  virtio: rng: s3/s4 support

Asias He (3):
  virtio-blk: Call del_gendisk() before disable guest kick
  virtio-blk: Reset device after blk_cleanup_queue()
  virtio-blk: Use block layer provided spinlock

Paolo Bonzini (1):
  virtio-blk: allow toggling host cache between writeback and writethrough

Rusty Russell (1):
  virtio-blk: return VIRTIO_BLK_F_FLUSH to header.

 drivers/block/virtio_blk.c  |  115 ---
 drivers/char/hw_random/virtio-rng.c |   37 ++-
 include/linux/virtio_blk.h  |   10 ++-
 include/linux/virtio_ids.h  |2 +-
 4 files changed, 137 insertions(+), 27 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


9p broken?

2012-07-30 Thread Avi Kivity
Having an annoying bug on i386 kvm I decided to debug it buy running an
i386 guest on my x86_64 host, use 9p to access a guest image, and run it
using nested kvm.

However, 9p appears to be broken: first, the configure test fails (patch
sent).  Second, while mount works, ls on the mount point causes qemu to
crash with

(gdb) bt
#0  error_set (errp=0x7fffe95fb128, fmt=0x558d4568 "{ 'class':
'VirtFSFeatureBlocksMigration', 'data': { 'path': %s, 'tag': %s } }") at
/home/tlv/akivity/qemu/error.c:32
#1  0x5567cb06 in v9fs_attach (opaque=0x7fffe95e3020) at
/home/tlv/akivity/qemu/hw/9pfs/virtio-9p.c:988
#2  0x5561d19f in coroutine_trampoline (i0=1449767888, i1=21845)
at /home/tlv/akivity/qemu/coroutine-ucontext.c:138
#3  0x75a93ef0 in ?? () from /lib64/libc.so.6
#4  0x7fffce00 in ?? ()
#5  0x in ?? (

**errp already points to a VirtFSFeatureBlocksMigration error;
v9fs_attach() has been called a second time (the first time,
understandably, on mount; the second on ls).

Command line:

qemu-system-x86_64 -m 4G -smp 4 -drive
file=/images/Fedora-i386.img,if=virtio,cache=none -cdrom
/images/iso/bfo.iso -device virtio-9p-pci,fsdev=root,mount_tag=root
-fsdev local,id=root,path=/,security_model=passthrough -enable-kvm -net
nic,model=virtio,netdev=net -netdev user,id=net -monitor stdio -cpu host

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] powerpc/kvm: Allow qemu hypercalls to set all regs

2012-07-30 Thread Alexander Graf

On 27.07.2012, at 08:18, Benjamin Herrenschmidt wrote:

> Right now, whenever we exit for an hcall, upon return, we fetch
> some register values from the structure that carries the hcall
> results and update the vcpu accordingly.
> 
> However, if the hypercall chooses instead to update the registers
> itself by calling set_regs, then we end up clobbering those values.
> 
> This is for example the case of the rtas calls used to reboot the
> machine where a new CPU state is established and must not be clobbered.
> 
> The simple fix is to always clear the hcall_needed flag when a
> set-regs happen. This fixes reboot problems.

I do understand the race you're running into. However, I don't understand why. 
QEMU's hcall return code fetches its GPRs from env->gprs. That should be 
perfectly fine if the values are 0 before you return from the hcall function.

Could you please try to manually set the relevant regs to 0 in QEMU and see if 
that also fixes reboot for you? That'd keep the interface a bit less complex.


Alex

> 
> Signed-off-by: Benjamin Herrenschmidt 
> ---
> 
> Note: There are similar issues if the reboot is triggered by an MMIO
> emulation, or an OSI call, Alex, you might want to handle those cases
> as well. In any cases, I'd like this to still go into 3.6 but I can
> send it myself to Linus if you want.
> 
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index 3f2a836..9ab13d6 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -463,6 +463,14 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, 
> struct kvm_regs *regs)
>   for (i = 0; i < ARRAY_SIZE(regs->gpr); i++)
>   kvmppc_set_gpr(vcpu, i, regs->gpr[i]);
> 
> + /*
> +  * If the hypercall decides to change register values
> +  * explicitly, we must ensure that we do not override
> +  * them upon return from that hypercall. Among others
> +  * this happens on system reset
> +  */
> + vcpu->arch.hcall_needed = 0;
> +
>   return 0;
> }
> 
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 87f4dc8..75e196d 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -592,6 +592,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
> struct kvm_run *run)
>   } else if (vcpu->arch.hcall_needed) {
>   int i;
> 
> + /*
> +  * Note: This might not be called if the hypervisor
> +  * call has done a set_regs(). In this case hcall_needed
> +  * is cleared. This is necessary for reset to work properly
> +  */
>   kvmppc_set_gpr(vcpu, 3, run->papr_hcall.ret);
>   for (i = 0; i < 9; ++i)
>   kvmppc_set_gpr(vcpu, 4 + i, run->papr_hcall.args[i]);
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] KVM: PPC: booke: Added debug handler

2012-07-30 Thread Bharat Bhushan
Installed debug handler will be used for guest debug support and
debug facility emulation features (patches for these features
will follow this patch).

Signed-off-by: Liu Yu 
[bharat.bhus...@freescale.com: Substantial changes]
Signed-off-by: Bharat Bhushan 
---
 arch/powerpc/include/asm/kvm_host.h |1 +
 arch/powerpc/kernel/asm-offsets.c   |1 +
 arch/powerpc/kvm/booke_interrupts.S |   45 +++
 3 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 7a45194..daa 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -492,6 +492,7 @@ struct kvm_vcpu_arch {
u32 tlbcfg[4];
u32 mmucfg;
u32 epr;
+   u32 crit_save;
 #endif
gpa_t paddr_accessed;
gva_t vaddr_accessed;
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 52c7ad7..6f7c1c9 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -562,6 +562,7 @@ int main(void)
DEFINE(VCPU_LAST_INST, offsetof(struct kvm_vcpu, arch.last_inst));
DEFINE(VCPU_FAULT_DEAR, offsetof(struct kvm_vcpu, arch.fault_dear));
DEFINE(VCPU_FAULT_ESR, offsetof(struct kvm_vcpu, arch.fault_esr));
+   DEFINE(VCPU_CRIT_SAVE, offsetof(struct kvm_vcpu, arch.crit_save));
 #endif /* CONFIG_PPC_BOOK3S */
 #endif /* CONFIG_KVM */
 
diff --git a/arch/powerpc/kvm/booke_interrupts.S 
b/arch/powerpc/kvm/booke_interrupts.S
index c7aebc5..e002c34 100644
--- a/arch/powerpc/kvm/booke_interrupts.S
+++ b/arch/powerpc/kvm/booke_interrupts.S
@@ -74,6 +74,51 @@ _GLOBAL(kvmppc_handler_\ivor_nr)
bctr
 .endm
 
+.macro KVM_DBG_HANDLER ivor_nr scratch srr0
+_GLOBAL(kvmppc_handler_\ivor_nr)
+   mtspr   \scratch, r4
+   mfspr   r4, SPRN_SPRG_THREAD
+   lwz r4, THREAD_KVM_VCPU(r4)
+   stw r3, VCPU_CRIT_SAVE(r4)
+   mfcrr3
+   mfspr   r4, SPRN_CSRR1
+   andi.   r4, r4, MSR_PR
+   bne 1f
+   /* debug interrupt happened in enter/exit path */
+   mfspr   r4, SPRN_CSRR1
+   rlwinm  r4, r4, 0, ~MSR_DE
+   mtspr   SPRN_CSRR1, r4
+   lis r4, 0x
+   ori r4, r4, 0x
+   mtspr   SPRN_DBSR, r4
+   mfspr   r4, SPRN_SPRG_THREAD
+   lwz r4, THREAD_KVM_VCPU(r4)
+   mtcrr3
+   lwz r3, VCPU_CRIT_SAVE(r4)
+   mfspr   r4, \scratch
+   rfci
+1: /* debug interrupt happened in guest */
+   mfspr   r4, \scratch
+   mtcrr3
+   mr  r3, r4
+   mfspr   r4, SPRN_SPRG_THREAD
+   lwz r4, THREAD_KVM_VCPU(r4)
+   stw r3, VCPU_GPR(r4)(r4)
+   stw r5, VCPU_GPR(r5)(r4)
+   stw r6, VCPU_GPR(r6)(r4)
+   lwz r3, VCPU_CRIT_SAVE(r4)
+   mfspr   r5, \srr0
+   stw r3, VCPU_GPR(r3)(r4)
+   stw r5, VCPU_PC(r4)
+   mfctr   r5
+   lis r6, kvmppc_resume_host@h
+   stw r5, VCPU_CTR(r4)
+   li  r5, \ivor_nr
+   ori r6, r6, kvmppc_resume_host@l
+   mtctr   r6
+   bctr
+.endm
+
 .macro KVM_HANDLER_ADDR ivor_nr
.long   kvmppc_handler_\ivor_nr
 .endm
-- 
1.7.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] KVM: PPC: booke: Allow multiple exception types

2012-07-30 Thread Bharat Bhushan
Current kvmppc_booke_handlers uses the same macro (KVM_HANDLER) and
all handlers are considered to be the same size. This will not be
the case if we want to use different macros for different handlers.

This patch improves the kvmppc_booke_handler so that it can
support different macros for different handlers.

Signed-off-by: Liu Yu 
[bharat.bhus...@freescale.com: Substantial changes]
Signed-off-by: Bharat Bhushan 
---
 arch/powerpc/include/asm/kvm_ppc.h  |2 -
 arch/powerpc/kvm/booke.c|9 ---
 arch/powerpc/kvm/booke.h|1 +
 arch/powerpc/kvm/booke_interrupts.S |   37 --
 arch/powerpc/kvm/e500.c |   13 +++
 5 files changed, 48 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 823d563..6d3e665 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -47,8 +47,6 @@ enum emulation_result {
 
 extern int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu);
 extern int __kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu);
-extern char kvmppc_handlers_start[];
-extern unsigned long kvmppc_handler_len;
 extern void kvmppc_handler_highmem(void);
 
 extern void kvmppc_dump_vcpu(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 47a7925..6fbdcfc 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -1528,6 +1528,7 @@ int __init kvmppc_booke_init(void)
 {
 #ifndef CONFIG_KVM_BOOKE_HV
unsigned long ivor[16];
+   unsigned long *handler = kvmppc_booke_handler_addr;
unsigned long max_ivor = 0;
int i;
 
@@ -1561,14 +1562,14 @@ int __init kvmppc_booke_init(void)
 
for (i = 0; i < 16; i++) {
if (ivor[i] > max_ivor)
-   max_ivor = ivor[i];
+   max_ivor = i;
 
memcpy((void *)kvmppc_booke_handlers + ivor[i],
-  kvmppc_handlers_start + i * kvmppc_handler_len,
-  kvmppc_handler_len);
+  (void *)handler[i], handler[i + 1] - handler[i]);
}
flush_icache_range(kvmppc_booke_handlers,
-  kvmppc_booke_handlers + max_ivor + 
kvmppc_handler_len);
+  kvmppc_booke_handlers + ivor[max_ivor] +
+   handler[max_ivor + 1] - handler[max_ivor]);
 #endif /* !BOOKE_HV */
return 0;
 }
diff --git a/arch/powerpc/kvm/booke.h b/arch/powerpc/kvm/booke.h
index ba61974..de9e526 100644
--- a/arch/powerpc/kvm/booke.h
+++ b/arch/powerpc/kvm/booke.h
@@ -65,6 +65,7 @@
  (1 << BOOKE_IRQPRIO_CRITICAL))
 
 extern unsigned long kvmppc_booke_handlers;
+extern unsigned long kvmppc_booke_handler_addr[];
 
 void kvmppc_set_msr(struct kvm_vcpu *vcpu, u32 new_msr);
 void kvmppc_mmu_msr_notify(struct kvm_vcpu *vcpu, u32 old_msr);
diff --git a/arch/powerpc/kvm/booke_interrupts.S 
b/arch/powerpc/kvm/booke_interrupts.S
index 09456c4..c7aebc5 100644
--- a/arch/powerpc/kvm/booke_interrupts.S
+++ b/arch/powerpc/kvm/booke_interrupts.S
@@ -74,6 +74,14 @@ _GLOBAL(kvmppc_handler_\ivor_nr)
bctr
 .endm
 
+.macro KVM_HANDLER_ADDR ivor_nr
+   .long   kvmppc_handler_\ivor_nr
+.endm
+
+.macro KVM_HANDLER_END
+   .long   kvmppc_handlers_end
+.endm
+
 _GLOBAL(kvmppc_handlers_start)
 KVM_HANDLER BOOKE_INTERRUPT_CRITICAL SPRN_SPRG_RSCRATCH_CRIT SPRN_CSRR0
 KVM_HANDLER BOOKE_INTERRUPT_MACHINE_CHECK  SPRN_SPRG_RSCRATCH_MC SPRN_MCSRR0
@@ -94,9 +102,7 @@ KVM_HANDLER BOOKE_INTERRUPT_DEBUG SPRN_SPRG_RSCRATCH_CRIT 
SPRN_CSRR0
 KVM_HANDLER BOOKE_INTERRUPT_SPE_UNAVAIL SPRN_SPRG_RSCRATCH0 SPRN_SRR0
 KVM_HANDLER BOOKE_INTERRUPT_SPE_FP_DATA SPRN_SPRG_RSCRATCH0 SPRN_SRR0
 KVM_HANDLER BOOKE_INTERRUPT_SPE_FP_ROUND SPRN_SPRG_RSCRATCH0 SPRN_SRR0
-
-_GLOBAL(kvmppc_handler_len)
-   .long kvmppc_handler_1 - kvmppc_handler_0
+_GLOBAL(kvmppc_handlers_end)
 
 /* Registers:
  *  SPRG_SCRATCH0: guest r4
@@ -464,6 +470,31 @@ lightweight_exit:
lwz r4, VCPU_GPR(r4)(r4)
rfi
 
+   .data
+   .align  4
+   .globl  kvmppc_booke_handler_addr
+kvmppc_booke_handler_addr:
+KVM_HANDLER_ADDR BOOKE_INTERRUPT_CRITICAL
+KVM_HANDLER_ADDR BOOKE_INTERRUPT_MACHINE_CHECK
+KVM_HANDLER_ADDR BOOKE_INTERRUPT_DATA_STORAGE
+KVM_HANDLER_ADDR BOOKE_INTERRUPT_INST_STORAGE
+KVM_HANDLER_ADDR BOOKE_INTERRUPT_EXTERNAL
+KVM_HANDLER_ADDR BOOKE_INTERRUPT_ALIGNMENT
+KVM_HANDLER_ADDR BOOKE_INTERRUPT_PROGRAM
+KVM_HANDLER_ADDR BOOKE_INTERRUPT_FP_UNAVAIL
+KVM_HANDLER_ADDR BOOKE_INTERRUPT_SYSCALL
+KVM_HANDLER_ADDR BOOKE_INTERRUPT_AP_UNAVAIL
+KVM_HANDLER_ADDR BOOKE_INTERRUPT_DECREMENTER
+KVM_HANDLER_ADDR BOOKE_INTERRUPT_FIT
+KVM_HANDLER_ADDR BOOKE_INTERRUPT_WATCHDOG
+KVM_HANDLER_ADDR BOOKE_INTERRUPT_DTLB_MISS
+KVM_HANDLER_ADDR BOOKE_INTERRUPT_ITLB_MISS
+KVM_HANDLER_ADDR BOOKE_INTERRUPT_DEBUG
+KVM_HANDLER_ADDR BOOKE_INTERRUPT_SPE_UNAVAIL
+KVM_HA

[PATCH] KVM: PPC: Book3S HV: Handle memory slot deletion and modification

2012-07-30 Thread Paul Mackerras
>From 4df5da62cb49b71c9a70072341dcb0557768c520 Mon Sep 17 00:00:00 2001
From: Paul Mackerras 
Date: Mon, 30 Jul 2012 16:40:54 +1000
Subject: [PATCH] KVM: PPC: Book3S HV: Handle memory slot deletion and
 modification correctly

At present the HV style of KVM doesn't handle deletion or modification
of memory slots correctly.  Deletion occurs when userspace does the
KVM_SET_USER_MEMORY_REGION with arguments specifying a new length of
zero for a slot that contains memory.  Modification occurs when the
arguments specify a new guest_phys_addr or flags.

To allow the HV code to know which operation (creation, deletion or
modification) is being requested, it needs to see the old and new
contents of the memslot.  kvm_arch_prepare_memory_region has this
information, so we modify it to pass it down to
kvmppc_core_prepare_memory_region.  We then modify the HV version
of that to check which operation is being performed.  If it is a
deletion, we call a new function kvmppc_unmap_memslot to remove any
HPT (hashed page table) entries referencing the memory being removed.
Similarly, if we are modifying the guest physical address we also
remove any HPT entries.  If the KVM_MEM_LOG_DIRTY_PAGES flag is now
set and wasn't before, we call kvmppc_hv_get_dirty_log to clear any
dirty bits so we can detect all modifications from now on.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_ppc.h  |4 +++
 arch/powerpc/kvm/book3s_64_mmu_hv.c |   27 ++--
 arch/powerpc/kvm/book3s_hv.c|   61 +++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |2 +-
 arch/powerpc/kvm/book3s_pr.c|2 ++
 arch/powerpc/kvm/booke.c|2 ++
 arch/powerpc/kvm/powerpc.c  |2 +-
 7 files changed, 76 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 0124937..044c921 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -140,11 +140,15 @@ extern void kvm_release_hpt(struct kvmppc_linear_info 
*li);
 extern int kvmppc_core_init_vm(struct kvm *kvm);
 extern void kvmppc_core_destroy_vm(struct kvm *kvm);
 extern int kvmppc_core_prepare_memory_region(struct kvm *kvm,
+struct kvm_memory_slot *memslot,
+struct kvm_memory_slot *old,
struct kvm_userspace_memory_region *mem);
 extern void kvmppc_core_commit_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem);
 extern int kvm_vm_ioctl_get_smmu_info(struct kvm *kvm,
  struct kvm_ppc_smmu_info *info);
+extern void kvmppc_unmap_memslot(struct kvm *kvm,
+struct kvm_memory_slot *memslot);
 
 extern int kvmppc_bookehv_init(void);
 extern void kvmppc_bookehv_exit(void);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index d03eb6f..69fee3b 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -830,7 +830,8 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long 
*rmapp,
psize = hpte_page_size(hptep[0], ptel);
if ((hptep[0] & HPTE_V_VALID) &&
hpte_rpn(ptel, psize) == gfn) {
-   hptep[0] |= HPTE_V_ABSENT;
+   if (kvm->arch.using_mmu_notifiers)
+   hptep[0] |= HPTE_V_ABSENT;
kvmppc_invalidate_hpte(kvm, hptep, i);
/* Harvest R and C */
rcbits = hptep[1] & (HPTE_R_R | HPTE_R_C);
@@ -850,6 +851,28 @@ int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
return 0;
 }
 
+void kvmppc_unmap_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot)
+{
+   unsigned long *rmapp;
+   unsigned long gfn;
+   unsigned long n;
+
+   rmapp = memslot->rmap;
+   gfn = memslot->base_gfn;
+   for (n = memslot->npages; n; --n) {
+   /*
+* Testing the present bit without locking is OK because
+* the memslot has been marked invalid already, and hence
+* no new HPTEs referencing this page can be created,
+* thus the present bit can't go from 0 to 1.
+*/
+   if (*rmapp & KVMPPC_RMAP_PRESENT)
+   kvm_unmap_rmapp(kvm, rmapp, gfn);
+   ++rmapp;
+   ++gfn;
+   }
+}
+
 static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
 unsigned long gfn)
 {
@@ -1012,7 +1035,7 @@ long kvmppc_hv_get_dirty_log(struct kvm *kvm, struct 
kvm_memory_slot *memslot)
rmapp = memslot->rmap;
map = memslot->dirty_bitmap;
for (i = 0; i < memslot->npages; ++i) {
-   if (kvm_test_clear_dirty(kvm, rmapp))
+   if (kvm_test_clear_dirty(kvm, rmapp) &

Re: KVM call agenda for Tuesday, July 31

2012-07-30 Thread Paolo Bonzini
Il 30/07/2012 09:34, Juan Quintela ha scritto:
> 
> Hi
> 
> Please send in any agenda items you are interested in covering.

How/when to merge the "new error" series.

My proposal: make soft feature freeze shorter, delay rc0 slightly, only
take care of errors in the meanwhile.

Paolo



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)

2012-07-30 Thread Boaz Harrosh
On 07/30/2012 10:12 AM, Paolo Bonzini wrote:

> Il 30/07/2012 01:50, Rusty Russell ha scritto:
>>> Also, being the first user of chained scatterlist doesn't exactly give
>>> me warm fuzzies.
>>
>> We're far from the first user: they've been in the kernel for well over
>> 7 years.  They were introduced for the block layer, but they tended to
>> ignore private uses of scatterlists like this one.
> 
> Yeah, but sg_chain has no users in drivers, only a private one in
> lib/scatterlist.c.  The internal API could be changed to something else
> and leave virtio-scsi screwed...
> 
>> Yes, we should do this.  But note that this means an iteration, so we
>> might as well combine the loops :)
> 
> I'm really bad at posting pseudo-code, but you can count the number of
> physically-contiguous entries at the beginning of the list only.  So if
> everything is contiguous, you use a single non-indirect buffer and save
> a kmalloc.  If you use indirect buffers, I suspect it's much less
> effective to collapse physically-contiguous entries.  More elaborate
> heuristics do need a loop, though.
> 


[All the below with a grain of salt, from my senile memory]

You must not forget some facts about the scatterlist received here at the LLD.
It has already been DMA mapped and locked by the generic layer.
Which means that the DMA engine has already collapsed physically-contiguous
entries. Those you get here are already unique physically.
(There were bugs in the past, where this was not true, please complain
 if you find them again)

A scatterlist is two different lists taking the same space, but with two
different length.
- One list is the PAGE pointers plus offset && length, which is bigger or
  equal to the 2nd list. The end marker corresponds to this list.

  This list is the input into the DMA engine.

- Second list is the physical DMA addresses list. With their physical-lengths.
  Offset is not needed because it is incorporated in the DMA address.

  This list is the output from the DMA engine.

  The reason 2nd list is shorter is because the DMA engine tries to minimize
  the physical scatter-list entries which is usually a limited HW resource.

  This list might follow chains but it's end is determined by the received
  sg_count from the DMA engine, not by the end marker.

At the time my opinion, and I think Rusty agreed, was that the scatterlist
should be split in two. The input page-ptr list is just the BIO, and the
output of the DMA-engine should just be the physical part of the sg_list,
as a separate parameter. But all this was berried under too much APIs and
the noise was two strong, for any single brave sole.

So I'd just trust blindly the returned sg_count from the DMA engine, it is
already optimized. I THINK

> Paolo


Boaz
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH stub] kvm: caching API for interrupts

2012-07-30 Thread Avi Kivity
On 07/29/2012 11:00 PM, Michael S. Tsirkin wrote:
> I've been looking at adding caching for IRQs so that we don't need to
> scan all VCPUs on each interrupt.  One issue I had a problem with, is
> how the cache structure can be used from both a thread (to fill out the
> cache) and interrupt (to actually send if cache is valid).
> 
> For now just added a lock field in the cache so we don't need to worry
> about this, and with such a lock in place we don't have to worry about
> RCU as cache can be invalidated simply under this lock.
> 
> For now this just declares the structure and updates the APIs
> so it's not intended to be applied, but just to give you
> the idea.
> 
> Comments, flames wellcome.

I hope you aren't expecting any of the latter.

> 
> +struct kvm_irq_cache {
> + spinlock_t lock;
> + bool valid;
> + struct kvm_vcpu *dest;
> + /* For now we only cache lapic irqs */
> + struct kvm_lapic_irq irq;
> + /* Protected by kvm->irq_cache_lock */
> + struct list_head list;
> +};
> +

Why an external structure?

Why not add something to kvm_kernel_irq_routing_entry?  It is already
protected by rcu.

The atomic context code could look like this:


  if (kire->cache.valid) {
  kvm_apic_set_irq(kire->cache.vcpu, &kire->cache.irq);
  return;
  }
  queue_work(process_irq_slowpath);

The slow path tries to fills the cache, and processes the interrupt in
any case.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: How does QEMU communicate with KVM?

2012-07-30 Thread Min-gyu Kim
Sorry for sending garbage mail



There are several ways that qemu and kvm can communicate.
One of them is the ioctl to /dev/kvm and it is documented in
Documentation/virtual/kvm/api.txt.

But there's other ways also: like the pages that are mmap-ed from
kvm_vcpu_fault function, 
I’m not sure about the documentation of such channels.

regards
Min-gyu Kim



--- Original Message ---
Sender : Richard Yao
Date : 2012-07-30 14:47 (GMT+09:00)
Title : How does QEMU communicate with KVM?

Is there any documentation on how QEMU communicates with KVM?



From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
Behalf Of 김민규
Sent: Monday, July 30, 2012 5:23 PM
To: Richard Yao; kvm@vger.kernel.org
Subject: Re: How does QEMU communicate with KVM?



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does QEMU communicate with KVM?

2012-07-30 Thread 김민규
There are several ways that qemu and kvm can communicate.
One of them is the ioctl to /dev/kvm and it is documented in 
Documentation/virtual/kvm/api.txt.

But there's other ways also: like the pages that are mmap-ed from 
kvm_vcpu_fault function, 
As far as I know, there is no documentation on such communication channel(but 
i'm not sure)

regards
Min-gyu Kim


--- Original Message ---
Sender : Richard Yao
Date : 2012-07-30 14:47 (GMT+09:00)
Title : How does QEMU communicate with KVM?

Is there any documentation on how QEMU communicates with KVM?

RE: [PATCH 2/2] KVM: PPC: booke/bookehv: Add guest debug support

2012-07-30 Thread Bhushan Bharat-R65777


> -Original Message-
> From: Wood Scott-B07421
> Sent: Friday, July 27, 2012 7:00 AM
> To: Bhushan Bharat-R65777
> Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; ag...@suse.de; Bhushan 
> Bharat-
> R65777
> Subject: Re: [PATCH 2/2] KVM: PPC: booke/bookehv: Add guest debug support
> 
> On 07/26/2012 12:32 AM, Bharat Bhushan wrote:
> > This patch adds:
> >  1) KVM debug handler added for e500v2.
> >  2) Guest debug by qemu gdb stub.
> 
> Does it make sense for these to both be in the same patch?  If there's common
> code used by both, that could be added first.

ok

> 
> > Signed-off-by: Liu Yu 
> > Signed-off-by: Varun Sethi 
> > [bharat.bhus...@freescale.com: Substantial changes]
> > Signed-off-by: Bharat Bhushan 
> > ---
> >  arch/powerpc/include/asm/kvm.h|   21 +
> >  arch/powerpc/include/asm/kvm_host.h   |7 ++
> >  arch/powerpc/include/asm/kvm_ppc.h|2 +
> >  arch/powerpc/include/asm/reg_booke.h  |1 +
> >  arch/powerpc/kernel/asm-offsets.c |   31 ++-
> >  arch/powerpc/kvm/booke.c  |  146 +++---
> >  arch/powerpc/kvm/booke_interrupts.S   |  160
> -
> >  arch/powerpc/kvm/bookehv_interrupts.S |  141 -
> >  arch/powerpc/kvm/e500mc.c |3 +-
> >  arch/powerpc/kvm/powerpc.c|2 +-
> >  10 files changed, 492 insertions(+), 22 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/kvm.h
> > b/arch/powerpc/include/asm/kvm.h index 3c14202..da71c84 100644
> > --- a/arch/powerpc/include/asm/kvm.h
> > +++ b/arch/powerpc/include/asm/kvm.h
> > @@ -25,6 +25,7 @@
> >  /* Select powerpc specific features in  */  #define
> > __KVM_HAVE_SPAPR_TCE  #define __KVM_HAVE_PPC_SMT
> > +#define __KVM_HAVE_GUEST_DEBUG
> >
> >  struct kvm_regs {
> > __u64 pc;
> > @@ -265,10 +266,19 @@ struct kvm_fpu {  };
> >
> >  struct kvm_debug_exit_arch {
> > +   __u32 exception;
> > +   __u32 pc;
> > +   __u32 status;
> >  };
> 
> PC must be 64-bit.  What goes in "status" and "exception"?

ok

> 
> >  /* for KVM_SET_GUEST_DEBUG */
> >  struct kvm_guest_debug_arch {
> > +   struct {
> > +   __u64 addr;
> > +   __u32 type;
> > +   __u32 pad1;
> > +   __u64 pad2;
> > +   } bp[16];
> >  };
> 
> What goes in "type"?

Type denote breakpoint, read watchpoint, write watchpoint or watchpoint (both 
read and write). Will adding a comment to describe this is ok?

> 
> >  /* definition of registers in kvm_run */ @@ -285,6 +295,17 @@ struct
> > kvm_sync_regs {
> >  #define KVM_CPU_3S_64  4
> >  #define KVM_CPU_E500MC 5
> >
> > +/* Debug related defines */
> > +#define KVM_INST_GUESTGDB   0x7C00021C  /* ehpriv OC=0 */
> 
> Will this work on all PPC?
> 
> It certainly won't work on other architectures, so at a minimum it's
> KVM_PPC_INST_GUEST_GDB, but maybe it needs to be determined at runtime.

How to determine at run time? adding another ioctl ?

> 
> > +#define KVM_GUESTDBG_USE_SW_BP  0x0001
> > +#define KVM_GUESTDBG_USE_HW_BP  0x0002
> 
> Where do these get used?  Any reason for these particular values?  If you're
> trying to create a partition where the upper half is generic and the lower 
> half
> is arch-specific, say so.

KVM_SET_GUEST_DEBUG ioctl used to set/unset debug interrupts, which have a "u32 
control" element. We have inherited this mechanism from x86 implementation and 
it looks like lower 16 bits are generic (like KVM_GUESTDBG_ENBLE, 
KVM_GUESTDBG_SINGLESTEP etc and upper 16 bits are Architecture specific.

I will add a comment to describe this.

> 
> > diff --git a/arch/powerpc/include/asm/kvm_host.h
> > b/arch/powerpc/include/asm/kvm_host.h
> > index 7a45194..524af7a 100644
> > --- a/arch/powerpc/include/asm/kvm_host.h
> > +++ b/arch/powerpc/include/asm/kvm_host.h
> > @@ -458,7 +458,12 @@ struct kvm_vcpu_arch {
> > u32 ccr0;
> > u32 ccr1;
> > u32 dbsr;
> > +   /* guest debug regiters*/
> > struct kvmppc_booke_debug_reg dbg_reg;
> > +   /* shadow debug registers */
> > +   struct kvmppc_booke_debug_reg shadow_dbg_reg;
> > +   /* host debug regiters*/
> > +   struct kvmppc_booke_debug_reg host_dbg_reg;
> 
> s/regiter/register/g
> 
> ...and put a space before */

> 
> > @@ -492,6 +497,7 @@ struct kvm_vcpu_arch {
> > u32 tlbcfg[4];
> > u32 mmucfg;
> > u32 epr;
> > +   u32 crit_save;
> >  #endif
> 
> What is crit_save?
> 
> > gpa_t paddr_accessed;
> > gva_t vaddr_accessed;
> > @@ -533,6 +539,7 @@ struct kvm_vcpu_arch {
> > struct kvm_vcpu_arch_shared *shared;
> > unsigned long magic_page_pa; /* phys addr to map the magic page to */
> > unsigned long magic_page_ea; /* effect. addr to map the magic page
> > to */
> > +   struct kvm_guest_debug_arch dbg; /* debug arg between kvm and qemu
> > +*/
> 
> Is kvm_guest_debug_arch generic or PPC-specific?  If the former, why is it in 
> a
> ppc struct?  If the latter, why doesn't it h

KVM call agenda for Tuesday, July 31

2012-07-30 Thread Juan Quintela

Hi

Please send in any agenda items you are interested in covering.

Thanks, Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 3/3] virtio-blk: Add bio-based IO path for virtio-blk

2012-07-30 Thread Paolo Bonzini
Il 30/07/2012 06:43, Asias He ha scritto:
>>
> 
> Yes. Something like this:
> 
>   qemu -drive file=foo.img,cache=writeback/unsafe
> 
> is not safe against power losses also?

cache=writeback and cache=none are safe, cache=unsafe isn't.

> I think we can add REQ_FLUSH & REQ_FUA support to bio path and that
> deserves another patch.

You only need to add REQ_FLUSH support.  The virtio-blk protocol does
not support REQ_FUA, because there's no easy way to do it in userspace.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)

2012-07-30 Thread Paolo Bonzini
Il 30/07/2012 01:50, Rusty Russell ha scritto:
>> Also, being the first user of chained scatterlist doesn't exactly give
>> me warm fuzzies.
> 
> We're far from the first user: they've been in the kernel for well over
> 7 years.  They were introduced for the block layer, but they tended to
> ignore private uses of scatterlists like this one.

Yeah, but sg_chain has no users in drivers, only a private one in
lib/scatterlist.c.  The internal API could be changed to something else
and leave virtio-scsi screwed...

> Yes, we should do this.  But note that this means an iteration, so we
> might as well combine the loops :)

I'm really bad at posting pseudo-code, but you can count the number of
physically-contiguous entries at the beginning of the list only.  So if
everything is contiguous, you use a single non-indirect buffer and save
a kmalloc.  If you use indirect buffers, I suspect it's much less
effective to collapse physically-contiguous entries.  More elaborate
heuristics do need a loop, though.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH stub] kvm: caching API for interrupts

2012-07-30 Thread Gleb Natapov
On Sun, Jul 29, 2012 at 11:00:58PM +0300, Michael S. Tsirkin wrote:
> I've been looking at adding caching for IRQs so that we don't need to
> scan all VCPUs on each interrupt.  One issue I had a problem with, is
> how the cache structure can be used from both a thread (to fill out the
> cache) and interrupt (to actually send if cache is valid).
I do not get this. Cache should be fill out at the time of use.

> 
> For now just added a lock field in the cache so we don't need to worry
> about this, and with such a lock in place we don't have to worry about
> RCU as cache can be invalidated simply under this lock.
Cache should be invalidated at the time of use if irq routing changed.

> 
> For now this just declares the structure and updates the APIs
> so it's not intended to be applied, but just to give you
> the idea.
> 
> Comments, flames wellcome.
> 
There is no enough meat to discuss here I am afraid.

> Signed-off-by: Michael S. Tsirkin 
> 
> --
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 96c158a..3384b39 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -214,11 +214,28 @@ static inline unsigned long 
> kvm_dirty_bitmap_bytes(struct kvm_memory_slot *memsl
>   return ALIGN(memslot->npages, BITS_PER_LONG) / 8;
>  }
>  
> +struct kvm_irq_cache {
> + spinlock_t lock;
> + bool valid;
> + struct kvm_vcpu *dest;
> + /* For now we only cache lapic irqs */
> + struct kvm_lapic_irq irq;
> + /* Protected by kvm->irq_cache_lock */
> + struct list_head list;
Why are you storing them in the list? This should be embedded structure.

> +};
> +
> +int kvm_set_irq_and_cache(struct kvm *kvm, int irq_source_id, u32 irq, int 
> level,
> +   struct kvm_irq_cache *cache);
> +int kvm_set_irq_cached(struct kvm *kvm, struct kvm_irq_cache *cache);
> +void kvm_irq_cache_init(struct kvm *kvm, struct kvm_irq_cache *cache);
> +void kvm_irq_cache_cleanup(struct kvm_irq_cache *cache);
> +
>  struct kvm_kernel_irq_routing_entry {
>   u32 gsi;
>   u32 type;
>   int (*set)(struct kvm_kernel_irq_routing_entry *e,
> -struct kvm *kvm, int irq_source_id, int level);
> +struct kvm *kvm, int irq_source_id, int level,
> +struct kvm_irq_cache *cache);
>   union {
>   struct {
>   unsigned irqchip;
> @@ -304,6 +321,8 @@ struct kvm {
>   struct kvm_irq_routing_table __rcu *irq_routing;
>   struct hlist_head mask_notifier_list;
>   struct hlist_head irq_ack_notifier_list;
> + struct mutex irq_cache_lock;
> + struct list_head irq_cache_list;
>  #endif
>  
>  #ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> @@ -617,7 +636,8 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic 
> *ioapic,
>  #endif
>  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
>  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm 
> *kvm,
> - int irq_source_id, int level);
> + int irq_source_id, int level,
> + struct kvm_irq_cache *cache);
>  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
>  void kvm_register_irq_ack_notifier(struct kvm *kvm,
>  struct kvm_irq_ack_notifier *kian);
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 7d7e2aa..9622076 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -138,7 +138,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, 
> void *key)
>   irq = rcu_dereference(irqfd->irq_entry);
>   /* An event has been signaled, inject an interrupt */
>   if (irq)
> - kvm_set_msi(irq, kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 1);
> + kvm_set_msi(irq, kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 1, 
> NULL);
>   else
>   schedule_work(&irqfd->inject);
>   rcu_read_unlock();
> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> index 5afb431..d92f3d3 100644
> --- a/virt/kvm/irq_comm.c
> +++ b/virt/kvm/irq_comm.c
> @@ -46,7 +46,8 @@ static inline int kvm_irq_line_state(unsigned long 
> *irq_state,
>  }
>  
>  static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> -struct kvm *kvm, int irq_source_id, int level)
> +struct kvm *kvm, int irq_source_id, int level,
> +struct kvm_irq_cache *cache)
>  {
>  #ifdef CONFIG_X86
>   struct kvm_pic *pic = pic_irqchip(kvm);
> @@ -59,7 +60,8 @@ static int kvm_set_pic_irq(struct 
> kvm_kernel_irq_routing_entry *e,
>  }
>  
>  static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> -   struct kvm *kvm, int irq_source_id, int level)
> +   struct kvm *kvm, int irq_source_id, int level,
> +   struct kvm_irq_cache *cache)
>  {
>   struct kvm_ioapic *ioa