Re: [PATCH 00/17] RFC: userfault v2

2014-11-11 Thread zhanghailiang

Hi Andrea,

Is there any new about this discussion? ;)

Will you plan to support 'only wrprotect fault' in the userfault API?

Thanks,
zhanghailiang

On 2014/10/30 19:31, zhanghailiang wrote:

On 2014/10/30 1:46, Andrea Arcangeli wrote:

Hi Zhanghailiang,

On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:

Hi Andrea,

Thanks for your hard work on userfault;)

This is really a useful API.

I want to confirm a question:
Can we support distinguishing between writing and reading memory for userfault?
That is, we can decide whether writing a page, reading a page or both trigger 
userfault.

I think this will help supporting vhost-scsi,ivshmem for migration,
we can trace dirty page in userspace.

Actually, i'm trying to relize live memory snapshot based on pre-copy and 
userfault,
but reading memory from migration thread will also trigger userfault.
It will be easy to implement live memory snapshot, if we support configuring
userfault for writing memory only.


Mail is going to be long enough already so I'll just assume tracking
dirty memory in userland (instead of doing it in kernel) is worthy
feature to have here.

After some chat during the KVMForum I've been already thinking it
could be beneficial for some usage to give userland the information
about the fault being read or write, combined with the ability of
mapping pages wrprotected to mcopy_atomic (that would work without
false positives only with MADV_DONTFORK also set, but it's already set
in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be
checked also in the wrprotect faults, not just in the not present
faults, but it's not a massive change. Returning the read/write
information is also a not massive change. This will then payoff mostly
if there's also a way to remove the memory atomically (kind of
remap_anon_pages).

Would that be enough? I mean are you still ok if non present read
fault traps too (you'd be notified it's a read) and you get
notification for both wrprotect and non present faults?


Hi Andrea,

Thanks for your reply, and your patience;)

Er, maybe i didn't describe clearly. What i really need for live memory snapshot
is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing 
write action*.

My initial solution scheme for live memory snapshot is:
(1) pause VM
(2) using userfaultfd to mark all memory of VM is wrprotect (readonly)
(3) save deivce state to snapshot file
(4) resume VM
(5) snapshot thread begin to save page of memory to snapshot file
(6) VM is going to run, and it is OK for VM or other thread to read ram (no 
fault trap),
 but if VM try to write page (dirty the page), there will be
 a userfault trap notification.
(7) a fault-handle-thread reads the page request from userfaultfd,
 it will copy content of the page to some buffers, and then remove the 
page's
 wrprotect limit(still using the userfaultfd to tell kernel).
(8) after step (7), VM can continue to write the page which is now can be write.
(9) snapshot thread save the page cached in step (7)
(10) repeat step (5)~(9) until all VM's memory is saved to snapshot file.

So, what i need for userfault is supporting only wrprotect fault. i don't
want to get notification for non present reading faults, it will influence
VM's performance and the efficiency of doing snapshot.

Also, i think this feature will benefit for migration of ivshmem and vhost-scsi
which have no dirty-page-tracing now.


The question then is how you mark the memory readonly to let the
wrprotect faults trap if the memory already existed and you didn't map
it yourself in the guest with mcopy_atomic with a readonly flag.

My current plan would be:

- keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the
   fast path check in the not-present and wrprotect page fault

- if VM_USERFAULT is set, find if there's a userfaultfd registered
   into that vma too

 if yes engage userfaultfd protocol

 otherwise raise SIGBUS (single threaded apps should be fine with
 SIGBUS and it'll avoid them to spawn a thread in order to talk the
 userfaultfd protocol)

- if userfaultfd protocol is engaged, return read|write fault + fault
   address to read(ufd) syscalls

- leave the "userfault" resolution mechanism independent of the
   userfaultfd protocol so we keep the two problems separated and we
   don't mix them in the same API which makes it even harder to
   finalize it.

 add mcopy_atomic (with a flag to map the page readonly too)

 The alternative would be to hide mcopy_atomic (and even
 remap_anon_pages in order to "remove" the memory atomically for
 the externalization into the cloud) as userfaultfd commands to
 write into the fd. But then there would be no much point to keep
 MADV_USERFAULT around if I do so and I could just remove it
 too or it doesn't look clean having to open the userfaultfd just
 to issue an hidden mcopy_atomic.

 So it becomes a decision if the basic SIGBUS mode for single

Re: [PATCH v4 4/6] hw_random: fix unregister race.

2014-11-11 Thread Herbert Xu
On Mon, Nov 10, 2014 at 09:47:27PM +0800, Herbert Xu wrote:
> On Mon, Nov 03, 2014 at 11:56:24PM +0800, Amos Kong wrote:
> >
> > @@ -98,6 +99,8 @@ static inline void cleanup_rng(struct kref *kref)
> >  
> > if (rng->cleanup)
> > rng->cleanup(rng);
> 
> You need a compiler barrier here to prevent reordering.

Michael Büsch pointed out that we should actually have a memory
barrier here.  I thought we didn't need it because I was only
worried about the code in my original complaint.  However, expecting
driver writers to use correct synchronisation primitives is surely
asking too much.

So let's add an smp_wmb() here to ensure all side-effects of
cleanup is visible.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 3/6] hw_random: use reference counts on each struct hwrng.

2014-11-11 Thread Rusty Russell
Amos Kong  writes:
> From: Rusty Russell 
>
> current_rng holds one reference, and we bump it every time we want
> to do a read from it.
>
> This means we only hold the rng_mutex to grab or drop a reference,
> so accessing /sys/devices/virtual/misc/hw_random/rng_current doesn't
> block on read of /dev/hwrng.
>
> Using a kref is overkill (we're always under the rng_mutex), but
> a standard pattern.
>
> This also solves the problem that the hwrng_fillfn thread was
> accessing current_rng without a lock, which could change (eg. to NULL)
> underneath it.
>
> v4: decrease last reference for triggering the cleanup

This doesn't make any sense:

> +static void drop_current_rng(void)
> +{
> + struct hwrng *rng = current_rng;
> +
> + BUG_ON(!mutex_is_locked(&rng_mutex));
> + if (!current_rng)
> + return;
> +
> + /* release current_rng reference */
> + kref_put(¤t_rng->ref, cleanup_rng);
> + current_rng = NULL;
> +
> + /* decrease last reference for triggering the cleanup */
> + kref_put(&rng->ref, cleanup_rng);
> +}

Why would it drop the refcount twice?  This doesn't make sense.

Hmm, because you added kref_init, which initializes the reference count
to 1, you created this bug.

Leave out the kref_init, and let it naturally be 0 (until, and if, it
becomes current_rng).  Add a comment if you want.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] hw_random: fix unregister race.

2014-11-11 Thread Rusty Russell
Amos Kong  writes:
> From: Rusty Russell 
>
> The previous patch added one potential problem: we can still be
> reading from a hwrng when it's unregistered.  Add a wait for zero
> in the hwrng_unregister path.
>
> v4: add cleanup_done flag to insure that cleanup is done

That's a bit weird.  The usual pattern would be to hold a reference
until we're actually finished, but this reference is a bit weird.

We hold the mutex across cleanup, so we could grab that but we have to
take care sleeping inside wait_event, otherwise Peter will have to fix
my code again :)

AFAICT the wake_woken() stuff isn't merged yet, so your patch will
have to do for now.

> @@ -98,6 +99,8 @@ static inline void cleanup_rng(struct kref *kref)
>  
>   if (rng->cleanup)
>   rng->cleanup(rng);
> + rng->cleanup_done = true;
> + wake_up_all(&rng_done);
>  }
>  
>  static void set_current_rng(struct hwrng *rng)
> @@ -536,6 +539,11 @@ void hwrng_unregister(struct hwrng *rng)
>   kthread_stop(hwrng_fill);
>   } else
>   mutex_unlock(&rng_mutex);
> +
> + /* Just in case rng is reading right now, wait. */
> + wait_event(rng_done, rng->cleanup_done &&
> +atomic_read(&rng->ref.refcount) == 0);
> +

The atomic_read() isn't necessary here.

However, you should probably init cleanup_done in hwrng_register().
(Probably noone does unregister then register, but let's be clear).

Thanks,
Rusty.

>  }
>  EXPORT_SYMBOL_GPL(hwrng_unregister);
>  
> diff --git a/include/linux/hw_random.h b/include/linux/hw_random.h
> index c212e71..7832e50 100644
> --- a/include/linux/hw_random.h
> +++ b/include/linux/hw_random.h
> @@ -46,6 +46,7 @@ struct hwrng {
>   /* internal. */
>   struct list_head list;
>   struct kref ref;
> + bool cleanup_done;
>  };
>  
>  /** Register a new Hardware Random Number Generator driver. */
> -- 
> 1.9.3
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes

2014-11-11 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2014-11-11:
> 
> 
> On 11/11/2014 10:20, Wu, Feng wrote:
>>> Since legacy KVM device assignment is effectively deprecated, have
>>> you considered how we might do this with VFIO?  Thanks,
>> 
>> I haven't thought about how to enable this in VFIO so far. I think I
>> can continue to implement that if needed after this patch set is finished.
> What do you think of this?
> 
> Hi Feng,
> 
> we are not applying new features to legacy KVM device assignment,
> since it is unsafe (it does not honor ACS).

Personally, I think this feature will be helpful to the legacy device 
assignment. Agree, vfio is the right solution for future feature enabling. But 
the old kvm without the good vfio supporting is still used largely today. The 
user really looking for this feature but they will not upgrade their kernel. 
It's easy for us to backport this feature to old kvm with the legacy device 
assignment, but it is impossible to backport the whole vfio. So I think you 
guys can take a consider to add this feature to both vfio and legacy device 
assignment.

> 
> I and Alex can help you with designing a way to interface VFIO with
> KVM posted interrupts.  Give us a few days to study these patches
> more, or feel free to request comments if you have ideas about it yourself.
> 
> Paolo


Best regards,
Yang



Re: compiler bug gcc4.6/4.7 with ACCESS_ONCE and workarounds

2014-11-11 Thread Linus Torvalds
On Tue, Nov 11, 2014 at 4:33 PM, Linus Torvalds
 wrote:
>
> I guess as a workaround it is fine, as long as we don't lose sight of
> trying to eventually do a better job.

Oh, and when it comes to the actual gcc bug - do you have any reason
to believe that it's somehow triggered more easily by something
particular in the arch/s390/kvm/gaccess.c code?

IOW, why does this problem not hit the x86 spinlocks that also use
volatile pointers to aggregate types? Or does it?

   Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: compiler bug gcc4.6/4.7 with ACCESS_ONCE and workarounds

2014-11-11 Thread Linus Torvalds
On Tue, Nov 11, 2014 at 1:16 PM, Christian Borntraeger
 wrote:
>
> Are you ok with the patch as is in kvm/next for the time being or shall
> we revert that and replace it with the .val scheme?

Is that the one that was quoted at the beginning of the thread, that
uses barrier()?

I guess as a workaround it is fine, as long as we don't lose sight of
trying to eventually do a better job.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nested KVM slower than QEMU with gnumach guest kernel

2014-11-11 Thread Jan Kiszka
On 2014-11-11 19:55, Samuel Thibault wrote:
> Hello,
> 
> jenkins.debian.net is running inside a KVM VM, and it runs nested
> KVM guests for its installation attempts.  This goes fine with Linux
> kernels, but it is extremely slow with gnumach kernels.  I have
> reproduced the issue with my laptop with a linux 3.17 host kernel, a
> 3.16 L1-guest kernel, and an i7-2720QM CPU, with similar results; it's
> actually even slower than letting qemu emulate the CPU... For these
> tests I'm using the following image:
> 
> http://people.debian.org/~sthibault/tmp/netinst.iso
> 
> The reference test here boils down to running qemu -cdrom netinst.iso -m
> 512, choosing the "Automated install" choice, and waiting for "Loading
> additional components" step to complete. (yes, the boot menu gets
> mangled ATM, there's apparently currently a bug between qemu and grub)
> 
> My host is A, my level1-KVM-guest is B.
> 
> KVM:
> A$ qemu -enable-kvm -cdrom netinst.iso -m 512M
> takes ~1 minute.
> 
> QEMU:
> A$ qemu -cdrom netinst.iso -m 512M
> takes ~7 minutes.
> 
> KVM-in-KVM:
> B$ qemu -enable-kvm -cdrom netinst.iso -m 512M
> takes ~10 minutes, when it doesn't gets completely stuck, which is quite
> often, actually...
> 
> QEMU-in-KVM:
> B$ qemu -cdrom netinst.iso -m 512M
> takes ~7 minutes.
> 
> I don't see such horrible slowdown with a linux image.  Is there
> something particular that could explain such a difference?  What tools
> or counters could I use to investigate which area of KVM is getting
> slow?

You can try to catch a trace (ftrace) on the physical host.

I suspect the setup forces a lot of instruction emulation, either on L0
or L1. And that is slower than QEMU is KVM does not optimize like QEMU does.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [PATCH net] Revert "drivers/net: Disable UFO through virtio" in macvtap and tun

2014-11-11 Thread David Miller
From: Ben Hutchings 
Date: Tue, 11 Nov 2014 17:12:58 +

> This reverts commit 88e0e0e5aa722b193c8758c8b45d041de5316924 for
> the tap drivers, but leaves UFO disabled in virtio_net.
> 
> libvirt at least assumes that tap features will never be dropped
> in new kernel versions, and doing so prevents migration of VMs to
> the never kernel version while they are running with virtio net
> devices.
> 
> Fixes: 88e0e0e5aa7a ("drivers/net: Disable UFO through virtio")
> Signed-off-by: Ben Hutchings 
> ---
> Compile-tested only.

After you get some Tested-by:, please resubmit to netdev.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: compiler bug gcc4.6/4.7 with ACCESS_ONCE and workarounds

2014-11-11 Thread Christian Borntraeger
Am 10.11.2014 um 22:07 schrieb Linus Torvalds:
> On Mon, Nov 10, 2014 at 12:18 PM, Christian Borntraeger
>  wrote:
>>
>> Now: I can reproduces belows miscompile on gcc46 and gcc 47
>> gcc 45 seems ok, gcc 48 is fixed.  This makes blacklisting
>> a bit hard, especially since it is not limited to s390, but
>> covers all architectures.
>> In essence ACCESS_ONCE will not work reliably on aggregate
>> types with gcc 4.6 and gcc 4.7.
>> In Linux we seem to use ACCESS_ONCE mostly on scalar types,
>> below code is an example were we dont - and break.
> 
> Hmm. I think we should see how painful it would be to make it a rule
> that ACCESS_ONCE() only works on scalar types.
> 
> Even in the actual code you show as an example, the "fix" is really to
> use the "unsigned long val" member of the union for the ACCESS_ONCE().
> And that seems to be true in many other cases too.

Yes, using the val like in 
-   new = old = ACCESS_ONCE(*ic);
+   new.val = old.val = ACCESS_ONCE(ic->val);

does solve the problem as well. In fact, gcc does create the same binary
code on my 4.7.2.

Are you ok with the patch as is in kvm/next for the time being or shall
we revert that and replace it with the .val scheme?

We can also do the cleanup later on if we manage to get your initial patch
into a shape that works out.

Christian

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Live migration locks up 3.2 guests in do_timer(ticks ~ 500000)

2014-11-11 Thread Matt Mullins
On Tue, Oct 28, 2014 at 10:04:10AM +0100, Paolo Bonzini wrote:
> can you test using QEMU from git://git.qemu-project.org/qemu.git?

That seems to work great, yes.

Looking through the commit history, I see:
  kvmclock: Ensure time in migration never goes backward
  kvmclock: Ensure proper env->tsc value for kvmclock_current_nsec 
calculation

Assuming those are the right fixes for this issue, are they suitable to request
backported to distros' qemu 2.0 branches?  The merge commit for them seemed to
indicate they were problematic at first.

On second thought - I found the qemu-devel threads about reverting them in the
2.1 timeframe, so I'm going to do a little more research before I start trying
to suggest how to fix for existing install-base.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


nested KVM slower than QEMU with gnumach guest kernel

2014-11-11 Thread Samuel Thibault
Hello,

jenkins.debian.net is running inside a KVM VM, and it runs nested
KVM guests for its installation attempts.  This goes fine with Linux
kernels, but it is extremely slow with gnumach kernels.  I have
reproduced the issue with my laptop with a linux 3.17 host kernel, a
3.16 L1-guest kernel, and an i7-2720QM CPU, with similar results; it's
actually even slower than letting qemu emulate the CPU... For these
tests I'm using the following image:

http://people.debian.org/~sthibault/tmp/netinst.iso

The reference test here boils down to running qemu -cdrom netinst.iso -m
512, choosing the "Automated install" choice, and waiting for "Loading
additional components" step to complete. (yes, the boot menu gets
mangled ATM, there's apparently currently a bug between qemu and grub)

My host is A, my level1-KVM-guest is B.

KVM:
A$ qemu -enable-kvm -cdrom netinst.iso -m 512M
takes ~1 minute.

QEMU:
A$ qemu -cdrom netinst.iso -m 512M
takes ~7 minutes.

KVM-in-KVM:
B$ qemu -enable-kvm -cdrom netinst.iso -m 512M
takes ~10 minutes, when it doesn't gets completely stuck, which is quite
often, actually...

QEMU-in-KVM:
B$ qemu -cdrom netinst.iso -m 512M
takes ~7 minutes.

I don't see such horrible slowdown with a linux image.  Is there
something particular that could explain such a difference?  What tools
or counters could I use to investigate which area of KVM is getting
slow?

Samuel
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TUN_F_UFO change breaks live migration

2014-11-11 Thread Ben Hutchings
On Tue, 2014-11-11 at 17:57 +0200, Michael S. Tsirkin wrote:
> On Tue, Nov 11, 2014 at 12:17:26PM +, Ben Hutchings wrote:
> > On Tue, 2014-11-11 at 10:58 +, Stefan Hajnoczi wrote:
> > > Commit 3d0ad09412ffe00c9afa201d01effdb6023d09b4 ("drivers/net: Disable
> > > UFO through virtio") breaks live migration of KVM guests from older to
> > > newer host kernels:
> > > 
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3d0ad09412ffe00c9afa201d01effdb6023d09b4
> > > 
> > > The problem occurs when a guest running on a host kernel without commit
> > > 3d0ad0941 in tun.ko attempts to live migration to a host with commit
> > > 3d0ad0941.
> > > 
> > > Live migration fails in QEMU with the following error message:
> > > 
> > >   virtio-net: saved image requires TUN_F_UFO support
> > > 
> > > The old host tun.ko advertised support for TUN_F_UFO.  The new host
> > > tun.ko does not and that's why QEMU aborts live migration.  QEMU cannot
> > > change the features of a running virtio-net device.
> > 
> > Yes, this is known and was mentioned in the DSA.
> > 
> > > tuxcrafter provided logs from two Debian hosts migrating from
> > > 3.2.60-1+deb7u3 to 3.2.63-2+deb7u1:
> > > 
> > > http://paste.debian.net/131264/
> > > 
> > > I haven't investigated enough to suggest a fix, just wanted to bring it
> > > to your attention.  Soon a lot of people will be hitting this problem as
> > > they upgrade their infrastructure and migrate guests - seems like a
> > > critical issue.
> > 
> > You can work around this by making macvtap and tun still claim to
> > support UFO.
> 
> If this is what we want userspace to do, let's just put the
> feature flag back?

Let's not *just* put the feature flag back.  I accept this is probably
needed as a workaround, but UFO/IPv6 still won't work correctly over
virtio.

> Basically userspace assumed that features will only
> ever be added, never removed, so this change is
> breaking it.

OK.

> >  They continue to support it even if it's not advertised
> > because the tap features don't reliably get propagated to virtio
> > devices.
> > 
> > Ben.
> 
> Hmm I don't understand this last sentence.
> features are actually reliably propagated to virtio devices.

They might be when using current QEMU and libvirt on the host.  They
weren't when I tested on Debian stable.  The warnings about 'using
disabled UFO feature' are reliably triggered if the host's tap driver is
patched and the guest's virtio_net driver is not.

Ben.

-- 
Ben Hutchings
Experience is directly proportional to the value of equipment destroyed.
 - Carolyn Scheppner


signature.asc
Description: This is a digitally signed message part


[PATCH net] Revert "drivers/net: Disable UFO through virtio" in macvtap and tun

2014-11-11 Thread Ben Hutchings
This reverts commit 88e0e0e5aa722b193c8758c8b45d041de5316924 for
the tap drivers, but leaves UFO disabled in virtio_net.

libvirt at least assumes that tap features will never be dropped
in new kernel versions, and doing so prevents migration of VMs to
the never kernel version while they are running with virtio net
devices.

Fixes: 88e0e0e5aa7a ("drivers/net: Disable UFO through virtio")
Signed-off-by: Ben Hutchings 
---
Compile-tested only.

Ben.

 drivers/net/macvtap.c | 13 -
 drivers/net/tun.c | 19 ---
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 6f226de..aeaeb6d 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -66,7 +66,7 @@ static struct cdev macvtap_cdev;
 static const struct proto_ops macvtap_socket_ops;
 
 #define TUN_OFFLOADS (NETIF_F_HW_CSUM | NETIF_F_TSO_ECN | NETIF_F_TSO | \
- NETIF_F_TSO6)
+ NETIF_F_TSO6 | NETIF_F_UFO)
 #define RX_OFFLOADS (NETIF_F_GRO | NETIF_F_LRO)
 #define TAP_FEATURES (NETIF_F_GSO | NETIF_F_SG)
 
@@ -570,8 +570,6 @@ static int macvtap_skb_from_vnet_hdr(struct sk_buff *skb,
gso_type = SKB_GSO_TCPV6;
break;
case VIRTIO_NET_HDR_GSO_UDP:
-   pr_warn_once("macvtap: %s: using disabled UFO feature; 
please fix this program\n",
-current->comm);
gso_type = SKB_GSO_UDP;
if (skb->protocol == htons(ETH_P_IPV6))
ipv6_proxy_select_ident(skb);
@@ -619,6 +617,8 @@ static void macvtap_skb_to_vnet_hdr(const struct sk_buff 
*skb,
vnet_hdr->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
else if (sinfo->gso_type & SKB_GSO_TCPV6)
vnet_hdr->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+   else if (sinfo->gso_type & SKB_GSO_UDP)
+   vnet_hdr->gso_type = VIRTIO_NET_HDR_GSO_UDP;
else
BUG();
if (sinfo->gso_type & SKB_GSO_TCP_ECN)
@@ -953,6 +953,9 @@ static int set_offload(struct macvtap_queue *q, unsigned 
long arg)
if (arg & TUN_F_TSO6)
feature_mask |= NETIF_F_TSO6;
}
+
+   if (arg & TUN_F_UFO)
+   feature_mask |= NETIF_F_UFO;
}
 
/* tun/tap driver inverts the usage for TSO offloads, where
@@ -963,7 +966,7 @@ static int set_offload(struct macvtap_queue *q, unsigned 
long arg)
 * When user space turns off TSO, we turn off GSO/LRO so that
 * user-space will not receive TSO frames.
 */
-   if (feature_mask & (NETIF_F_TSO | NETIF_F_TSO6))
+   if (feature_mask & (NETIF_F_TSO | NETIF_F_TSO6 | NETIF_F_UFO))
features |= RX_OFFLOADS;
else
features &= ~RX_OFFLOADS;
@@ -1064,7 +1067,7 @@ static long macvtap_ioctl(struct file *file, unsigned int 
cmd,
case TUNSETOFFLOAD:
/* let the user check for future flags */
if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
-   TUN_F_TSO_ECN))
+   TUN_F_TSO_ECN | TUN_F_UFO))
return -EINVAL;
 
rtnl_lock();
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 7302398..a0987d1 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -175,7 +175,7 @@ struct tun_struct {
struct net_device   *dev;
netdev_features_t   set_features;
 #define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \
- NETIF_F_TSO6)
+ NETIF_F_TSO6|NETIF_F_UFO)
 
int vnet_hdr_sz;
int sndbuf;
@@ -1152,20 +1152,10 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
break;
case VIRTIO_NET_HDR_GSO_UDP:
-   {
-   static bool warned;
-
-   if (!warned) {
-   warned = true;
-   netdev_warn(tun->dev,
-   "%s: using disabled UFO feature; 
please fix this program\n",
-   current->comm);
-   }
skb_shinfo(skb)->gso_type = SKB_GSO_UDP;
if (skb->protocol == htons(ETH_P_IPV6))
ipv6_proxy_select_ident(skb);
break;
-   }
default:
tun->dev->stats.rx_frame_errors++;
kfree_skb(skb);
@@ -1265,6 +1255,8 @@ static ssize_t tun_put_user(struct tun_struct *tun,

Re: TUN_F_UFO change breaks live migration

2014-11-11 Thread David Miller
From: "Michael S. Tsirkin" 
Date: Tue, 11 Nov 2014 17:57:50 +0200

> Basically userspace assumed that features will only
> ever be added, never removed, so this change is
> breaking it.

I essentially agree.

We can't just toss feature bits like this which have been present for so
long.

"There is no way I can figure out how to easily make it work" is not an
excuse for breaking existing userspace like this.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TUN_F_UFO change breaks live migration

2014-11-11 Thread Michael S. Tsirkin
On Tue, Nov 11, 2014 at 12:17:26PM +, Ben Hutchings wrote:
> On Tue, 2014-11-11 at 10:58 +, Stefan Hajnoczi wrote:
> > Commit 3d0ad09412ffe00c9afa201d01effdb6023d09b4 ("drivers/net: Disable
> > UFO through virtio") breaks live migration of KVM guests from older to
> > newer host kernels:
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3d0ad09412ffe00c9afa201d01effdb6023d09b4
> > 
> > The problem occurs when a guest running on a host kernel without commit
> > 3d0ad0941 in tun.ko attempts to live migration to a host with commit
> > 3d0ad0941.
> > 
> > Live migration fails in QEMU with the following error message:
> > 
> >   virtio-net: saved image requires TUN_F_UFO support
> > 
> > The old host tun.ko advertised support for TUN_F_UFO.  The new host
> > tun.ko does not and that's why QEMU aborts live migration.  QEMU cannot
> > change the features of a running virtio-net device.
> 
> Yes, this is known and was mentioned in the DSA.
> 
> > tuxcrafter provided logs from two Debian hosts migrating from
> > 3.2.60-1+deb7u3 to 3.2.63-2+deb7u1:
> > 
> > http://paste.debian.net/131264/
> > 
> > I haven't investigated enough to suggest a fix, just wanted to bring it
> > to your attention.  Soon a lot of people will be hitting this problem as
> > they upgrade their infrastructure and migrate guests - seems like a
> > critical issue.
> 
> You can work around this by making macvtap and tun still claim to
> support UFO.

If this is what we want userspace to do, let's just put the
feature flag back?


Basically userspace assumed that features will only
ever be added, never removed, so this change is
breaking it.

>  They continue to support it even if it's not advertised
> because the tap features don't reliably get propagated to virtio
> devices.
> 
> Ben.

Hmm I don't understand this last sentence.
features are actually reliably propagated to virtio devices.


> -- 
> Ben Hutchings
> Experience is directly proportional to the value of equipment destroyed.
>  - Carolyn Scheppner


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] KVM fixes for 3.18-rc5

2014-11-11 Thread Paolo Bonzini
Linus,

The following changes since commit 0df1f2487d2f0d04703f142813d53615d62a1da4:

  Linux 3.18-rc3 (2014-11-02 15:01:51 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus

for you to fetch changes up to d29b9d7ed76c0b961603ca692b8a562556a20212:

  KVM: x86: Fix uninitialized op->type for some immediate values (2014-11-05 
12:36:58 +0100)


Two fixes---one of them not exactly a one liner, but things are
calming down on the KVM front at last.


Nadav Amit (1):
  KVM: x86: Fix uninitialized op->type for some immediate values

Paolo Bonzini (1):
  Merge tag 'kvm-s390-20141103' of git://git.kernel.org/.../kvms390/linux 
into kvm-master

Sebastian Ott (1):
  KVM: s390: virtio_ccw: remove unused variable

 arch/x86/kvm/emulate.c| 8 
 drivers/s390/kvm/virtio_ccw.c | 1 -
 2 files changed, 8 insertions(+), 1 deletion(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call for agenda for 2014-11-11

2014-11-11 Thread Juan Quintela
Juan Quintela  wrote:
> Hi
>
> Please, send any topic that you are interested in covering.

As there are no agenda, call is cancelled.

Have a nice day.

Later, Juan.


>
>  Thanks, Juan.
>
>  Call details:
>
>  15:00 CEST
>  13:00 UTC
>  09:00 EDT
>
>  Every two weeks
>
> By popular demand, a google calendar public entry with it
>
>  
> https://www.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
>
>  (Let me know if you have any problems with the calendar entry)
>
> If you need phone number details,  contact me privately
>
> Thanks, Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/13] iommu/vt-d: No need to migrating irq for VT-d Posted-Interrtups

2014-11-11 Thread Jiang Liu


On 2014/11/10 14:26, Feng Wu wrote:
> We don't need to migrate the irqs for VT-d Posted-Interrtups here.
> When 'pst' is set in IRTE, the associated irq will be posted to
> guests instead of interrupt remapping. The destination of the
> interrupt is set in Posted-Interrupts Descriptor, and the migration
> happens during VCPU scheduling.
> 
> Signed-off-by: Feng Wu 
> ---
>  drivers/iommu/intel_irq_remapping.c |7 +++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/iommu/intel_irq_remapping.c 
> b/drivers/iommu/intel_irq_remapping.c
> index 87c02fe..249e2b1 100644
> --- a/drivers/iommu/intel_irq_remapping.c
> +++ b/drivers/iommu/intel_irq_remapping.c
> @@ -1038,6 +1038,13 @@ intel_ioapic_set_affinity(struct irq_data *data, const 
> struct cpumask *mask,
>   if (get_irte(irq, &irte))
>   return -EBUSY;
>  
> + /*
> +  * If the interrupt is for posting, it is used by guests,
> +  * we cannot change IRTE here.
> +  */
> + if (irte.irq_post_low.pst == 1)
> + return 0;
Hi Feng,
You should return some error code instead of 0, otherwise the
irq core will get confused.

> +
>   err = assign_irq_vector(irq, cfg, mask);
>   if (err)
>   return err;
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/13] iommu/vt-d: Adjust 'struct irte' to better suit for VT-d Posted-Interrupts

2014-11-11 Thread Jiang Liu
Hi Feng,
Other than this solution, how about introducing new
struct irte_pi for posted interrupt?

On 2014/11/10 14:26, Feng Wu wrote:
> This patch adjusts the definition of 'struct irte', so that we can
> add the VT-d Posted-Interrtups format in this structure later.
> 
> Signed-off-by: Feng Wu 
> ---
>  drivers/iommu/intel_irq_remapping.c |   35 
> +++
>  include/linux/dmar.h|4 ++--
>  2 files changed, 21 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/iommu/intel_irq_remapping.c 
> b/drivers/iommu/intel_irq_remapping.c
> index f99f0f1..776da10 100644
> --- a/drivers/iommu/intel_irq_remapping.c
> +++ b/drivers/iommu/intel_irq_remapping.c
> @@ -310,9 +310,9 @@ static void set_irte_sid(struct irte *irte, unsigned int 
> svt,
>  {
>   if (disable_sourceid_checking)
>   svt = SVT_NO_VERIFY;
> - irte->svt = svt;
> - irte->sq = sq;
> - irte->sid = sid;
> + irte->irq_remap_high.svt = svt;
> + irte->irq_remap_high.sq = sq;
> + irte->irq_remap_high.sid = sid;
>  }
>  
>  static int set_ioapic_sid(struct irte *irte, int apic)
> @@ -917,8 +917,8 @@ static void prepare_irte(struct irte *irte, int vector,
>  {
>   memset(irte, 0, sizeof(*irte));
>  
> - irte->present = 1;
> - irte->dst_mode = apic->irq_dest_mode;
> + irte->irq_remap_low.present = 1;
> + irte->irq_remap_low.dst_mode = apic->irq_dest_mode;
>   /*
>* Trigger mode in the IRTE will always be edge, and for IO-APIC, the
>* actual level or edge trigger will be setup in the IO-APIC
> @@ -926,11 +926,11 @@ static void prepare_irte(struct irte *irte, int vector,
>* For more details, see the comments (in io_apic.c) explainig IO-APIC
>* irq migration in the presence of interrupt-remapping.
>   */
> - irte->trigger_mode = 0;
> - irte->dlvry_mode = apic->irq_delivery_mode;
> - irte->vector = vector;
> - irte->dest_id = IRTE_DEST(dest);
> - irte->redir_hint = 1;
> + irte->irq_remap_low.trigger_mode = 0;
> + irte->irq_remap_low.dlvry_mode = apic->irq_delivery_mode;
> + irte->irq_remap_low.vector = vector;
> + irte->irq_remap_low.dest_id = IRTE_DEST(dest);
> + irte->irq_remap_low.redir_hint = 1;
>  }
>  
>  static int intel_setup_ioapic_entry(int irq,
> @@ -973,10 +973,13 @@ static int intel_setup_ioapic_entry(int irq,
>   "Redir_hint:%d Trig_Mode:%d Dlvry_Mode:%X "
>   "Avail:%X Vector:%02X Dest:%08X "
>   "SID:%04X SQ:%X SVT:%X)\n",
> - attr->ioapic, irte.present, irte.fpd, irte.dst_mode,
> - irte.redir_hint, irte.trigger_mode, irte.dlvry_mode,
> - irte.avail, irte.vector, irte.dest_id,
> - irte.sid, irte.sq, irte.svt);
> + attr->ioapic, irte.irq_remap_low.present,
> + irte.irq_remap_low.fpd, irte.irq_remap_low.dst_mode,
> + irte.irq_remap_low.redir_hint, irte.irq_remap_low.trigger_mode,
> + irte.irq_remap_low.dlvry_mode, irte.irq_remap_low.avail,
> + irte.irq_remap_low.vector, irte.irq_remap_low.dest_id,
> + irte.irq_remap_high.sid, irte.irq_remap_high.sq,
> + irte.irq_remap_high.svt);
>  
>   entry = (struct IR_IO_APIC_route_entry *)route_entry;
>   memset(entry, 0, sizeof(*entry));
> @@ -1046,8 +1049,8 @@ intel_ioapic_set_affinity(struct irq_data *data, const 
> struct cpumask *mask,
>   return err;
>   }
>  
> - irte.vector = cfg->vector;
> - irte.dest_id = IRTE_DEST(dest);
> + irte.irq_remap_low.vector = cfg->vector;
> + irte.irq_remap_low.dest_id = IRTE_DEST(dest);
>  
>   /*
>* Atomically updates the IRTE with the new destination, vector
> diff --git a/include/linux/dmar.h b/include/linux/dmar.h
> index 593fff9..8be5d42 100644
> --- a/include/linux/dmar.h
> +++ b/include/linux/dmar.h
> @@ -159,7 +159,7 @@ struct irte {
>   vector  : 8,
>   __reserved_2: 8,
>   dest_id : 32;
> - };
> + } irq_remap_low;
>   __u64 low;
>   };
>  
> @@ -169,7 +169,7 @@ struct irte {
>   sq  : 2,
>   svt : 2,
>   __reserved_3: 44;
> - };
> + } irq_remap_high;
>   __u64 high;
>   };
>  };
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/13] iommu/vt-d: VT-d Posted-Interrupts feature detection

2014-11-11 Thread Jiang Liu


On 2014/11/10 14:26, Feng Wu wrote:
> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> With VT-d Posted-Interrupts enabled, external interrupts from
> direct-assigned devices can be delivered to guests without VMM
> intervention when guest is running in non-root mode.
> 
> This patch adds feature detection logic for VT-d posted-interrupt.
> 
> Signed-off-by: Feng Wu 
> ---
>  drivers/iommu/intel_irq_remapping.c |   13 +
>  drivers/iommu/irq_remapping.c   |4 
>  drivers/iommu/irq_remapping.h   |5 +
>  include/linux/intel-iommu.h |1 +
>  4 files changed, 23 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/iommu/intel_irq_remapping.c 
> b/drivers/iommu/intel_irq_remapping.c
> index 7c80661..f99f0f1 100644
> --- a/drivers/iommu/intel_irq_remapping.c
> +++ b/drivers/iommu/intel_irq_remapping.c
> @@ -580,6 +580,19 @@ static int __init intel_irq_remapping_supported(void)
>   if (!ecap_ir_support(iommu->ecap))
>   return 0;
>  
> + /* VT-d posted-interrupt feature detection*/
> + if (disable_irq_post == 0)
> + for_each_drhd_unit(drhd) {
> + struct intel_iommu *iommu = drhd->iommu;
Hi Feng,
You may use for_each_active_iommu() here.
Regards!
Gerry

> +
> + if (!cap_pi_support(iommu->cap)) {
> + irq_post_enabled = 0;
> + disable_irq_post = 1;
> + break;
> + }
> + irq_post_enabled = 1;
> + }
> +
>   return 1;
>  }
>  
> diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
> index 74a1767..2f8ee00 100644
> --- a/drivers/iommu/irq_remapping.c
> +++ b/drivers/iommu/irq_remapping.c
> @@ -23,6 +23,10 @@ int irq_remap_broken;
>  int disable_sourceid_checking;
>  int no_x2apic_optout;
>  
> +int disable_irq_post = 1;
> +int irq_post_enabled = 0;
> +EXPORT_SYMBOL_GPL(irq_post_enabled);
> +
>  static struct irq_remap_ops *remap_ops;
>  
>  static int msi_alloc_remapped_irq(struct pci_dev *pdev, int irq, int nvec);
> diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
> index fde250f..7bb5913 100644
> --- a/drivers/iommu/irq_remapping.h
> +++ b/drivers/iommu/irq_remapping.h
> @@ -37,6 +37,9 @@ extern int disable_sourceid_checking;
>  extern int no_x2apic_optout;
>  extern int irq_remapping_enabled;
>  
> +extern int disable_irq_post;
> +extern int irq_post_enabled;
> +
>  struct irq_remap_ops {
>   /* Check whether Interrupt Remapping is supported */
>   int (*supported)(void);
> @@ -91,6 +94,8 @@ extern struct irq_remap_ops amd_iommu_irq_ops;
>  #define irq_remapping_enabled 0
>  #define disable_irq_remap 1
>  #define irq_remap_broken  0
> +#define disable_irq_post  1
> +#define irq_post_enabled  0
>  
>  #endif /* CONFIG_IRQ_REMAP */
>  
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index a65208a..5b1a124 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -87,6 +87,7 @@ static inline void dmar_writeq(void __iomem *addr, u64 val)
>  /*
>   * Decoding Capability Register
>   */
> +#define cap_pi_support(c)(((c) >> 59) & 1)
>  #define cap_read_drain(c)(((c) >> 55) & 1)
>  #define cap_write_drain(c)   (((c) >> 54) & 1)
>  #define cap_max_amask_val(c) (((c) >> 48) & 0x3f)
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/13] KVM: Initialize VT-d Posted-Interrtups Descriptor

2014-11-11 Thread Jiang Liu
On 2014/11/10 14:26, Feng Wu wrote:
> This patch initialize the VT-d Posted-interrupt Descritpor.
> 
> Signed-off-by: Feng Wu 
> ---
>  arch/x86/include/asm/irq_remapping.h |1 +
>  arch/x86/kernel/apic/apic.c  |1 +
>  arch/x86/kvm/vmx.c   |   56 -
>  3 files changed, 56 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/irq_remapping.h 
> b/arch/x86/include/asm/irq_remapping.h
> index b7747c4..a3cc437 100644
> --- a/arch/x86/include/asm/irq_remapping.h
> +++ b/arch/x86/include/asm/irq_remapping.h
> @@ -57,6 +57,7 @@ extern bool setup_remapped_irq(int irq,
>  struct irq_chip *chip);
>  
>  void irq_remap_modify_chip_defaults(struct irq_chip *chip);
> +extern int irq_post_enabled;
>  
>  #else  /* CONFIG_IRQ_REMAP */
>  
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index ba6cc04..987408d 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -162,6 +162,7 @@ __setup("apicpmtimer", setup_apicpmtimer);
>  #endif
>  
>  int x2apic_mode;
> +EXPORT_SYMBOL_GPL(x2apic_mode);
>  #ifdef CONFIG_X86_X2APIC
>  /* x2apic enabled before OS handover */
>  int x2apic_preenabled;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 3e556c6..a4670d3 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -45,6 +45,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "trace.h"
>  
> @@ -408,13 +409,32 @@ struct nested_vmx {
>  };
>  
>  #define POSTED_INTR_ON  0
> +#define POSTED_INTR_SN  1
> +
>  /* Posted-Interrupt Descriptor */
>  struct pi_desc {
>   u32 pir[8]; /* Posted interrupt requested */
> - u32 control;/* bit 0 of control is outstanding notification bit */
> - u32 rsvd[7];
> + union {
> + struct {
> + u64 on  : 1,
> + sn  : 1,
> + rsvd_1  : 13,
> + ndm : 1,
> + nv  : 8,
> + rsvd_2  : 8,
> + ndst: 32;
> + };
> + u64 control;
> + };
> + u32 rsvd[6];
>  } __aligned(64);
>  
> +static void pi_clear_sn(struct pi_desc *pi_desc)
> +{
> + return clear_bit(POSTED_INTR_SN,
> + (unsigned long *)&pi_desc->control);
> +}
> +
>  static bool pi_test_and_set_on(struct pi_desc *pi_desc)
>  {
>   return test_and_set_bit(POSTED_INTR_ON,
> @@ -4396,6 +4416,33 @@ static void ept_set_mmio_spte_mask(void)
>   kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);
>  }
>  
> +static bool pi_desc_init(struct vcpu_vmx *vmx)
> +{
> + unsigned int dest;
> +
> + if (irq_post_enabled == 0)
> + return true;
> +
> + /*
> +  * Initialize Posted-Interrupt Descriptor
> +  */
> +
> + pi_clear_sn(&vmx->pi_desc);
> + vmx->pi_desc.nv = POSTED_INTR_VECTOR;
> +
> + /* Physical mode for Notificaiton Event */
> + vmx->pi_desc.ndm = 0;
> + dest = cpu_physical_id(vmx->vcpu.cpu);
> +
> + if (x2apic_mode)
Hi Feng,
Could you try to use x2apic_enabled() here so you don't
need to export x2apic_mode?
Regards!
Gerry
> + vmx->pi_desc.ndst = dest;
> + else
> + vmx->pi_desc.ndst = (dest << 8) & 0xFF00;
> +
> + return true;
> +}
> +
> +
>  /*
>   * Sets up the vmcs for emulated real mode.
>   */
> @@ -4439,6 +4486,11 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>  
>   vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR);
>   vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
> +
> + if (!pi_desc_init(vmx)) {
> + printk(KERN_ERR "Initialize PI descriptor error!\n");
> + return 1;
> + }
>   }
>  
>   if (ple_gap) {
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes

2014-11-11 Thread Wu, Feng


> -Original Message-
> From: Paolo Bonzini [mailto:pbonz...@redhat.com]
> Sent: Tuesday, November 11, 2014 7:02 PM
> To: Wu, Feng; Alex Williamson
> Cc: g...@kernel.org; dw...@infradead.org; j...@8bytes.org;
> t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org;
> kvm@vger.kernel.org; io...@lists.linux-foundation.org;
> linux-ker...@vger.kernel.org
> Subject: Re: [PATCH 05/13] KVM: Update IRTE according to guest interrupt
> configuration changes
> 
> 
> 
> On 11/11/2014 10:20, Wu, Feng wrote:
> > > Since legacy KVM device assignment is effectively deprecated, have you
> > > considered how we might do this with VFIO?  Thanks,
> >
> > I haven't thought about how to enable this in VFIO so far. I think I can 
> > continue
> to
> > implement that if needed after this patch set is finished. What do you 
> > think of
> this?
> 
> Hi Feng,
> 
> we are not applying new features to legacy KVM device assignment, since
> it is unsafe (it does not honor ACS).
> 
> I and Alex can help you with designing a way to interface VFIO with KVM
> posted interrupts.  Give us a few days to study these patches more, or
> feel free to request comments if you have ideas about it yourself.
> 
> Paolo

Okay, then I will put some efforts on getting familiar with VFIO mechanism. If
You have any questions about these patches, we can discuss it together.

Thanks,
Feng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes

2014-11-11 Thread Wu, Feng


> -Original Message-
> From: Paolo Bonzini [mailto:pbonz...@redhat.com]
> Sent: Tuesday, November 11, 2014 7:02 PM
> To: Wu, Feng; Alex Williamson
> Cc: g...@kernel.org; dw...@infradead.org; j...@8bytes.org;
> t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org;
> kvm@vger.kernel.org; io...@lists.linux-foundation.org;
> linux-ker...@vger.kernel.org
> Subject: Re: [PATCH 05/13] KVM: Update IRTE according to guest interrupt
> configuration changes
> 
> 
> 
> On 11/11/2014 10:20, Wu, Feng wrote:
> > > Since legacy KVM device assignment is effectively deprecated, have you
> > > considered how we might do this with VFIO?  Thanks,
> >
> > I haven't thought about how to enable this in VFIO so far. I think I can 
> > continue
> to
> > implement that if needed after this patch set is finished. What do you 
> > think of
> this?
> 
> Hi Feng,
> 
> we are not applying new features to legacy KVM device assignment, since
> it is unsafe (it does not honor ACS).
> 
> I and Alex can help you with designing a way to interface VFIO with KVM
> posted interrupts.  Give us a few days to study these patches more, or
> feel free to request comments if you have ideas about it yourself.
> 
> Paolo

Okay, then I will put some efforts on getting familiar with VFIO mechanism. If
You have any questions about these patches, we can discuss it together.

Thanks,
Feng
N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf

Re: TUN_F_UFO change breaks live migration

2014-11-11 Thread Ben Hutchings
On Tue, 2014-11-11 at 10:58 +, Stefan Hajnoczi wrote:
> Commit 3d0ad09412ffe00c9afa201d01effdb6023d09b4 ("drivers/net: Disable
> UFO through virtio") breaks live migration of KVM guests from older to
> newer host kernels:
> 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3d0ad09412ffe00c9afa201d01effdb6023d09b4
> 
> The problem occurs when a guest running on a host kernel without commit
> 3d0ad0941 in tun.ko attempts to live migration to a host with commit
> 3d0ad0941.
> 
> Live migration fails in QEMU with the following error message:
> 
>   virtio-net: saved image requires TUN_F_UFO support
> 
> The old host tun.ko advertised support for TUN_F_UFO.  The new host
> tun.ko does not and that's why QEMU aborts live migration.  QEMU cannot
> change the features of a running virtio-net device.

Yes, this is known and was mentioned in the DSA.

> tuxcrafter provided logs from two Debian hosts migrating from
> 3.2.60-1+deb7u3 to 3.2.63-2+deb7u1:
> 
> http://paste.debian.net/131264/
> 
> I haven't investigated enough to suggest a fix, just wanted to bring it
> to your attention.  Soon a lot of people will be hitting this problem as
> they upgrade their infrastructure and migrate guests - seems like a
> critical issue.

You can work around this by making macvtap and tun still claim to
support UFO.  They continue to support it even if it's not advertised
because the tap features don't reliably get propagated to virtio
devices.

Ben.

-- 
Ben Hutchings
Experience is directly proportional to the value of equipment destroyed.
 - Carolyn Scheppner


signature.asc
Description: This is a digitally signed message part


Re: [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes

2014-11-11 Thread Paolo Bonzini


On 11/11/2014 10:20, Wu, Feng wrote:
> > Since legacy KVM device assignment is effectively deprecated, have you
> > considered how we might do this with VFIO?  Thanks,
> 
> I haven't thought about how to enable this in VFIO so far. I think I can 
> continue to
> implement that if needed after this patch set is finished. What do you think 
> of this?

Hi Feng,

we are not applying new features to legacy KVM device assignment, since
it is unsafe (it does not honor ACS).

I and Alex can help you with designing a way to interface VFIO with KVM
posted interrupts.  Give us a few days to study these patches more, or
feel free to request comments if you have ideas about it yourself.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes

2014-11-11 Thread Wu, Feng


> -Original Message-
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Tuesday, November 11, 2014 5:58 AM
> To: Wu, Feng
> Cc: g...@kernel.org; pbonz...@redhat.com; dw...@infradead.org;
> j...@8bytes.org; t...@linutronix.de; mi...@redhat.com; h...@zytor.com;
> x...@kernel.org; kvm@vger.kernel.org; io...@lists.linux-foundation.org;
> linux-ker...@vger.kernel.org
> Subject: Re: [PATCH 05/13] KVM: Update IRTE according to guest interrupt
> configuration changes
> 
> On Mon, 2014-11-10 at 14:26 +0800, Feng Wu wrote:
> > When guest changes its interrupt configuration (such as, vector, etc.)
> > for direct-assigned devices, we need to update the associated IRTE
> > with the new guest vector, so external interrupts from the assigned
> > devices can be injected to guests without VM-Exit.
> >
> > The current method of handling guest lowest priority interrtups
> > is to use a counter 'apic_arb_prio' for each VCPU, we choose the
> > VCPU with smallest 'apic_arb_prio' and then increase it by 1.
> > However, for VT-d PI, we cannot re-use this, since we no longer
> > have control to 'apic_arb_prio' with posted interrupt direct
> > delivery by Hardware.
> >
> > Here, we introduce a similiar way with 'apic_arb_prio' to handle
> > guest lowest priority interrtups when VT-d PI is used. Here is the
> > ideas:
> > - Each VCPU has a counter 'round_robin_counter'.
> > - When guests sets an interrupts to lowest priority, we choose
> > the VCPU with smallest 'round_robin_counter' as the destination,
> > then increase it.
> >
> > Signed-off-by: Feng Wu 
> > ---
> >  arch/x86/include/asm/irq_remapping.h |6 ++
> >  arch/x86/include/asm/kvm_host.h  |2 +
> >  arch/x86/kvm/vmx.c   |   12 +++
> >  arch/x86/kvm/x86.c   |   11 +++
> >  drivers/iommu/amd_iommu.c|6 ++
> >  drivers/iommu/intel_irq_remapping.c  |   28 +++
> >  drivers/iommu/irq_remapping.c|9 ++
> >  drivers/iommu/irq_remapping.h|3 +
> >  include/linux/dmar.h |   26 ++
> >  include/linux/kvm_host.h |   22 +
> >  include/uapi/linux/kvm.h |1 +
> >  virt/kvm/assigned-dev.c  |  141
> ++
> >  virt/kvm/irq_comm.c  |4 +-
> >  virt/kvm/irqchip.c   |   11 ---
> >  14 files changed, 269 insertions(+), 13 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/irq_remapping.h
> b/arch/x86/include/asm/irq_remapping.h
> > index a3cc437..32d6cc4 100644
> > --- a/arch/x86/include/asm/irq_remapping.h
> > +++ b/arch/x86/include/asm/irq_remapping.h
> > @@ -51,6 +51,7 @@ extern void compose_remapped_msi_msg(struct
> pci_dev *pdev,
> >  unsigned int irq, unsigned int dest,
> >  struct msi_msg *msg, u8 hpet_id);
> >  extern int setup_hpet_msi_remapped(unsigned int irq, unsigned int id);
> > +extern int update_pi_irte(unsigned int irq, u64 pi_desc_addr, u32 vector);
> >  extern void panic_if_irq_remap(const char *msg);
> >  extern bool setup_remapped_irq(int irq,
> >struct irq_cfg *cfg,
> > @@ -88,6 +89,11 @@ static inline int setup_hpet_msi_remapped(unsigned
> int irq, unsigned int id)
> > return -ENODEV;
> >  }
> >
> > +static inline int update_pi_irte(unsigned int irq, u64 pi_desc_addr, u32
> vector)
> > +{
> > +   return -ENODEV;
> > +}
> > +
> >  static inline void panic_if_irq_remap(const char *msg)
> >  {
> >  }
> > diff --git a/arch/x86/include/asm/kvm_host.h
> b/arch/x86/include/asm/kvm_host.h
> > index 6ed0c30..0630161 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -358,6 +358,7 @@ struct kvm_vcpu_arch {
> > struct kvm_lapic *apic;/* kernel irqchip context */
> > unsigned long apic_attention;
> > int32_t apic_arb_prio;
> > +   int32_t round_robin_counter;
> > int mp_state;
> > u64 ia32_misc_enable_msr;
> > bool tpr_access_reporting;
> > @@ -771,6 +772,7 @@ struct kvm_x86_ops {
> > int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr);
> >
> > void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
> > +   u64 (*get_pi_desc_addr)(struct kvm_vcpu *vcpu);
> >  };
> >
> >  struct kvm_arch_async_pf {
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index a4670d3..ae91b72 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -544,6 +544,11 @@ static inline struct vcpu_vmx *to_vmx(struct
> kvm_vcpu *vcpu)
> > return container_of(vcpu, struct vcpu_vmx, vcpu);
> >  }
> >
> > +struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
> > +{
> > +   return &(to_vmx(vcpu)->pi_desc);
> > +}
> > +
> >  #define VMCS12_OFFSET(x) offsetof(struct vmcs12, x)
> >  #define FIELD(number, name)[number] = VMCS12_OFFSET(name)
> >  #define FIELD64(number, name)  [number] = VMCS12_OFFSET(name), \
> > @@ -4280,6 +4285,11 @@ static void vmx_syn

Re: Seeking a KVM benchmark

2014-11-11 Thread Paolo Bonzini


On 10/11/2014 13:15, Paolo Bonzini wrote:
> 
> 
> On 10/11/2014 11:45, Gleb Natapov wrote:
>>> I tried making also the other shared MSRs the same between guest and
>>> host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier
>>> has nothing to do.  That saves about 4-500 cycles on inl_from_qemu.  I
>>> do want to dig out my old Core 2 and see how the new test fares, but it
>>> really looks like your patch will be in 3.19.
>>
>> Please test on wide variety of HW before final decision.
> 
> Yes, definitely.

I've reproduced Andy's results on Ivy Bridge:

NX off  ~6900 cycles (EFER)
NX on, SCE off ~14600 cycles (urn)
NX on, SCE on   ~6900 cycles (same value)

I also asked Intel about clarifications.

On Core 2 Duo the results are weird.  There is no LOAD_EFER control,
so Andy's patch does not apply and the only interesting paths are urn
and same value.

The pessimization of EFER writes does _seem_ to be there, since I can
profile for iTLB flushes (r4082 on this microarchitecture) and get:

   0.14%  qemu-kvm  [kernel.kallsyms]  [k] native_write_msr_safe
   0.14%  qemu-kvm  [kernel.kallsyms]  [k] native_flush_tlb

but these are the top two results and it is not clear to me why perf
only records them as "0.14%"...  Also, this machine has no EPT, so virt
suffers a lot from TLB misses anyway.

Nevertheless I tried running kvm-unit-tests with different values of the
MSRs to see what's the behavior.

NX=1/SCE=0   NX=1/SCE=1  all MSRs equal
cpuid   3374 34483608
vmcall  3274 33373478
mov_from_cr811   11  11
mov_to_cr8  15   15  15
inl_from_pmtimer1780316346   15156
inl_from_qemu   1785816375   15163
inl_from_kernel 6351 64926622
outl_to_kernel  3850 39004053
mov_dr  116  116 117
ple-round-robin 15   16  16
wr_tsc_adjust_msr   3334 34173570
rd_tsc_adjust_msr   3374 34043605
mmio-no-eventfd:pci-mem 1918817866   16660
mmio-wildcard-eventfd:pci-mem   7319 74147595
mmio-datamatch-eventfd:pci-mem  7304 74707605
portio-no-eventfd:pci-io1321911780   10447
portio-wildcard-eventfd:pci-io  3951 40244149
portio-datamatch-eventfd:pci-io 3940 40264228

In the last column, all shared MSRs are equal (*) host and guest.  The
difference is very noisy on newer processors, but quite visible on the
older processor.  It is weird though that the light-weight exits become
_more_ expensive as more MSRs are equal between guest and host.

Anyhow, this is more of a curiosity since the proposed patch has no effect.

Next will come Nehalem.  Nehalem has both LOAD_EFER and EPT, so it's
already a good target.  I can test Westmere too, as soon as I find
someone that has it, but it shouldn't give surprises.

Paolo

(*) run this:

#! /usr/bin/env python
class msr(object):
def __init__(self):
try:
self.f = open('/dev/cpu/0/msr', 'r', 0)
except:
self.f = open('/dev/msr0', 'r', 0)
def read(self, index, default = None):
import struct
self.f.seek(index)
try:
return struct.unpack('Q', self.f.read(8))[0]
except:
return default

m = msr()
for i in [0xc080, 0xc081, 0xc082, 0xc083, 0xc084]:
print ("wrmsr(0x%x, 0x%x);" % (i, m.read(i)))

and add the result to the enable_nx function.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


TUN_F_UFO change breaks live migration

2014-11-11 Thread Stefan Hajnoczi
Commit 3d0ad09412ffe00c9afa201d01effdb6023d09b4 ("drivers/net: Disable
UFO through virtio") breaks live migration of KVM guests from older to
newer host kernels:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3d0ad09412ffe00c9afa201d01effdb6023d09b4

The problem occurs when a guest running on a host kernel without commit
3d0ad0941 in tun.ko attempts to live migration to a host with commit
3d0ad0941.

Live migration fails in QEMU with the following error message:

  virtio-net: saved image requires TUN_F_UFO support

The old host tun.ko advertised support for TUN_F_UFO.  The new host
tun.ko does not and that's why QEMU aborts live migration.  QEMU cannot
change the features of a running virtio-net device.

tuxcrafter provided logs from two Debian hosts migrating from
3.2.60-1+deb7u3 to 3.2.63-2+deb7u1:

http://paste.debian.net/131264/

I haven't investigated enough to suggest a fix, just wanted to bring it
to your attention.  Soon a lot of people will be hitting this problem as
they upgrade their infrastructure and migrate guests - seems like a
critical issue.

Stefan


pgp8hSn2xkMUC.pgp
Description: PGP signature


RE: [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes

2014-11-11 Thread Wu, Feng


> -Original Message-
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Tuesday, November 11, 2014 5:58 AM
> To: Wu, Feng
> Cc: g...@kernel.org; pbonz...@redhat.com; dw...@infradead.org;
> j...@8bytes.org; t...@linutronix.de; mi...@redhat.com; h...@zytor.com;
> x...@kernel.org; kvm@vger.kernel.org; io...@lists.linux-foundation.org;
> linux-ker...@vger.kernel.org
> Subject: Re: [PATCH 05/13] KVM: Update IRTE according to guest interrupt
> configuration changes
> 
> On Mon, 2014-11-10 at 14:26 +0800, Feng Wu wrote:
> > When guest changes its interrupt configuration (such as, vector, etc.)
> > for direct-assigned devices, we need to update the associated IRTE
> > with the new guest vector, so external interrupts from the assigned
> > devices can be injected to guests without VM-Exit.
> >
> > The current method of handling guest lowest priority interrtups
> > is to use a counter 'apic_arb_prio' for each VCPU, we choose the
> > VCPU with smallest 'apic_arb_prio' and then increase it by 1.
> > However, for VT-d PI, we cannot re-use this, since we no longer
> > have control to 'apic_arb_prio' with posted interrupt direct
> > delivery by Hardware.
> >
> > Here, we introduce a similiar way with 'apic_arb_prio' to handle
> > guest lowest priority interrtups when VT-d PI is used. Here is the
> > ideas:
> > - Each VCPU has a counter 'round_robin_counter'.
> > - When guests sets an interrupts to lowest priority, we choose
> > the VCPU with smallest 'round_robin_counter' as the destination,
> > then increase it.
> >
> > Signed-off-by: Feng Wu 
> > ---
> >  arch/x86/include/asm/irq_remapping.h |6 ++
> >  arch/x86/include/asm/kvm_host.h  |2 +
> >  arch/x86/kvm/vmx.c   |   12 +++
> >  arch/x86/kvm/x86.c   |   11 +++
> >  drivers/iommu/amd_iommu.c|6 ++
> >  drivers/iommu/intel_irq_remapping.c  |   28 +++
> >  drivers/iommu/irq_remapping.c|9 ++
> >  drivers/iommu/irq_remapping.h|3 +
> >  include/linux/dmar.h |   26 ++
> >  include/linux/kvm_host.h |   22 +
> >  include/uapi/linux/kvm.h |1 +
> >  virt/kvm/assigned-dev.c  |  141
> ++
> >  virt/kvm/irq_comm.c  |4 +-
> >  virt/kvm/irqchip.c   |   11 ---
> >  14 files changed, 269 insertions(+), 13 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/irq_remapping.h
> b/arch/x86/include/asm/irq_remapping.h
> > index a3cc437..32d6cc4 100644
> > --- a/arch/x86/include/asm/irq_remapping.h
> > +++ b/arch/x86/include/asm/irq_remapping.h
> > @@ -51,6 +51,7 @@ extern void compose_remapped_msi_msg(struct
> pci_dev *pdev,
> >  unsigned int irq, unsigned int dest,
> >  struct msi_msg *msg, u8 hpet_id);
> >  extern int setup_hpet_msi_remapped(unsigned int irq, unsigned int id);
> > +extern int update_pi_irte(unsigned int irq, u64 pi_desc_addr, u32 vector);
> >  extern void panic_if_irq_remap(const char *msg);
> >  extern bool setup_remapped_irq(int irq,
> >struct irq_cfg *cfg,
> > @@ -88,6 +89,11 @@ static inline int setup_hpet_msi_remapped(unsigned
> int irq, unsigned int id)
> > return -ENODEV;
> >  }
> >
> > +static inline int update_pi_irte(unsigned int irq, u64 pi_desc_addr, u32
> vector)
> > +{
> > +   return -ENODEV;
> > +}
> > +
> >  static inline void panic_if_irq_remap(const char *msg)
> >  {
> >  }
> > diff --git a/arch/x86/include/asm/kvm_host.h
> b/arch/x86/include/asm/kvm_host.h
> > index 6ed0c30..0630161 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -358,6 +358,7 @@ struct kvm_vcpu_arch {
> > struct kvm_lapic *apic;/* kernel irqchip context */
> > unsigned long apic_attention;
> > int32_t apic_arb_prio;
> > +   int32_t round_robin_counter;
> > int mp_state;
> > u64 ia32_misc_enable_msr;
> > bool tpr_access_reporting;
> > @@ -771,6 +772,7 @@ struct kvm_x86_ops {
> > int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr);
> >
> > void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
> > +   u64 (*get_pi_desc_addr)(struct kvm_vcpu *vcpu);
> >  };
> >
> >  struct kvm_arch_async_pf {
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index a4670d3..ae91b72 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -544,6 +544,11 @@ static inline struct vcpu_vmx *to_vmx(struct
> kvm_vcpu *vcpu)
> > return container_of(vcpu, struct vcpu_vmx, vcpu);
> >  }
> >
> > +struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
> > +{
> > +   return &(to_vmx(vcpu)->pi_desc);
> > +}
> > +
> >  #define VMCS12_OFFSET(x) offsetof(struct vmcs12, x)
> >  #define FIELD(number, name)[number] = VMCS12_OFFSET(name)
> >  #define FIELD64(number, name)  [number] = VMCS12_OFFSET(name), \
> > @@ -4280,6 +4285,11 @@ static void vmx_syn

RE: [PATCH] x86: Update VT-d Posted-Interrupts related information

2014-11-11 Thread Wu, Feng


> -Original Message-
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Tuesday, November 11, 2014 5:58 AM
> To: Wu, Feng
> Cc: pbonz...@redhat.com; r...@twiddle.net; mtosa...@redhat.com;
> qemu-de...@nongnu.org; kvm@vger.kernel.org
> Subject: Re: [PATCH] x86: Update VT-d Posted-Interrupts related information
> 
> On Mon, 2014-11-10 at 14:31 +0800, Feng Wu wrote:
> > VT-d Posted-Interrupts(PI) is an enhancement to CPU side Posted-Interrupt.
> > With VT-d Posted-Interrupts enabled, external interrupts from
> > direct-assigned devices can be delivered to guests without VMM
> > involvement when guest is running in non-root mode.
> >
> > If VT-d PI is supported by KVM, we need to update the IRTE with
> > the new guest interrtup configuration.
> >
> > Signed-off-by: Feng Wu 
> > ---
> >  hw/i386/kvm/pci-assign.c  |   24 
> >  linux-headers/linux/kvm.h |2 ++
> >  target-i386/kvm.c |5 +
> >  target-i386/kvm_i386.h|1 +
> >  4 files changed, 32 insertions(+), 0 deletions(-)
> >
> > diff --git a/hw/i386/kvm/pci-assign.c b/hw/i386/kvm/pci-assign.c
> > index bb206da..e55a99b 100644
> > --- a/hw/i386/kvm/pci-assign.c
> > +++ b/hw/i386/kvm/pci-assign.c
> > @@ -1005,6 +1005,12 @@ static void assigned_dev_update_msi(PCIDevice
> *pci_dev)
> >  assigned_dev->intx_route.mode = PCI_INTX_DISABLED;
> >  assigned_dev->intx_route.irq = -1;
> >  assigned_dev->assigned_irq_type = ASSIGNED_IRQ_MSI;
> > +
> > +if (kvm_check_extension(kvm_state, KVM_CAP_PI)) {
> > +if (kvm_device_pi_update(kvm_state, assigned_dev->dev_id)
> < 0) {
> > +perror("assigned_dev_update_msi:
> kvm_device_pi_update");
> > +}
> > +}
> 
> 
> I don't really understand this ioctl, we need to call into KVM to set
> the new IRQ routing anyway, why can't KVM figure out that it needs to
> update the posted interrupt config at the same time?  Also, we'll need
> to support this through vfio-pci.  Thanks,
> 
> Alex

We need host irq information got from "struct kvm_assigned_dev_kernel *dev"
when updating IRTE for VT-d Posted-Interrtups. However, when guest tries to
update the msi message like the following: 

assigned_dev_update_msi_msg() --> kvm_irqchip_update_msi_route() --> 
kvm_update_routing_entry() -->
kvm_irqchip_commit_routes() --> kvm_irqchip_commit_routes()--> 
KVM_SET_GSI_ROUTING --> kvm_set_irq_routing()

It will finally go to kvm_set_irq_routing() in KVM, there are two problem:
1. It use RCU in this function, it is hard to find which entry in the irq 
routing table is being updated.
2. Even we find the updated entry, it is hard to find the associated assigned 
device with this irq routing entry in this function.

Do you have any ideas about this? If we can do all this inside KVM, that should 
be great!

Thanks,
Feng

> 
> >  } else {
> >  Error *local_err = NULL;
> >
> > @@ -1029,6 +1035,12 @@ static void
> assigned_dev_update_msi_msg(PCIDevice *pci_dev)
> >
> >  kvm_irqchip_update_msi_route(kvm_state,
> assigned_dev->msi_virq[0],
> >   msi_get_message(pci_dev, 0));
> > +
> > +if (kvm_check_extension(kvm_state, KVM_CAP_PI)) {
> > +if (kvm_device_pi_update(kvm_state, assigned_dev->dev_id) < 0) {
> > +perror("assigned_dev_update_msi_msg:
> kvm_device_pi_update");
> > +}
> > +}
> >  }
> >
> >  static bool assigned_dev_msix_masked(MSIXTableEntry *entry)
> > @@ -1149,6 +1161,12 @@ static void assigned_dev_update_msix(PCIDevice
> *pci_dev)
> >  perror("assigned_dev_enable_msix: assign irq");
> >  return;
> >  }
> > +
> > +if (kvm_check_extension(kvm_state, KVM_CAP_PI)) {
> > +if (kvm_device_pi_update(kvm_state,
> assigned_dev->dev_id) < 0) {
> > +perror("assigned_dev_update_msix:
> kvm_device_pi_update");
> > +}
> > +}
> >  }
> >  assigned_dev->intx_route.mode = PCI_INTX_DISABLED;
> >  assigned_dev->intx_route.irq = -1;
> > @@ -1618,6 +1636,12 @@ static void assigned_dev_msix_mmio_write(void
> *opaque, hwaddr addr,
> >  if (ret) {
> >  error_report("Error updating irq routing entry
> (%d)", ret);
> >  }
> > +if (kvm_check_extension(kvm_state, KVM_CAP_PI)) {
> > +if (kvm_device_pi_update(kvm_state, adev->dev_id)
> < 0) {
> > +perror("assigned_dev_update_msi_msg: "
> > +"kvm_device_pi_update");
> > +}
> > +}
> >  }
> >  }
> >  }
> > diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
> > index 2669938..b34f3c4 100644
> > --- a/linux-headers/linux/kvm.h
> > +++ b/linux-headers/linux/kvm.h
> > @@ -765,6 +765,7 @@ struct kvm_ppc_smmu_info {
> >  #define KVM_CAP_PPC_FIXU

RE: [PATCH 02/13] KVM: Initialize VT-d Posted-Interrtups Descriptor

2014-11-11 Thread Wu, Feng


> -Original Message-
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Tuesday, November 11, 2014 5:57 AM
> To: Wu, Feng
> Cc: g...@kernel.org; pbonz...@redhat.com; dw...@infradead.org;
> j...@8bytes.org; t...@linutronix.de; mi...@redhat.com; h...@zytor.com;
> x...@kernel.org; kvm@vger.kernel.org; io...@lists.linux-foundation.org;
> linux-ker...@vger.kernel.org
> Subject: Re: [PATCH 02/13] KVM: Initialize VT-d Posted-Interrtups Descriptor
> 
> On Mon, 2014-11-10 at 14:26 +0800, Feng Wu wrote:
> > This patch initialize the VT-d Posted-interrupt Descritpor.
> >
> > Signed-off-by: Feng Wu 
> > ---
> >  arch/x86/include/asm/irq_remapping.h |1 +
> >  arch/x86/kernel/apic/apic.c  |1 +
> >  arch/x86/kvm/vmx.c   |   56
> -
> >  3 files changed, 56 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/irq_remapping.h
> b/arch/x86/include/asm/irq_remapping.h
> > index b7747c4..a3cc437 100644
> > --- a/arch/x86/include/asm/irq_remapping.h
> > +++ b/arch/x86/include/asm/irq_remapping.h
> > @@ -57,6 +57,7 @@ extern bool setup_remapped_irq(int irq,
> >struct irq_chip *chip);
> >
> >  void irq_remap_modify_chip_defaults(struct irq_chip *chip);
> > +extern int irq_post_enabled;
> >
> >  #else  /* CONFIG_IRQ_REMAP */
> >
> > diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> > index ba6cc04..987408d 100644
> > --- a/arch/x86/kernel/apic/apic.c
> > +++ b/arch/x86/kernel/apic/apic.c
> > @@ -162,6 +162,7 @@ __setup("apicpmtimer", setup_apicpmtimer);
> >  #endif
> >
> >  int x2apic_mode;
> > +EXPORT_SYMBOL_GPL(x2apic_mode);
> >  #ifdef CONFIG_X86_X2APIC
> >  /* x2apic enabled before OS handover */
> >  int x2apic_preenabled;
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 3e556c6..a4670d3 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -45,6 +45,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #include "trace.h"
> >
> > @@ -408,13 +409,32 @@ struct nested_vmx {
> >  };
> >
> >  #define POSTED_INTR_ON  0
> > +#define POSTED_INTR_SN  1
> > +
> >  /* Posted-Interrupt Descriptor */
> >  struct pi_desc {
> > u32 pir[8]; /* Posted interrupt requested */
> > -   u32 control;/* bit 0 of control is outstanding notification bit */
> > -   u32 rsvd[7];
> > +   union {
> > +   struct {
> > +   u64 on  : 1,
> > +   sn  : 1,
> > +   rsvd_1  : 13,
> > +   ndm : 1,
> > +   nv  : 8,
> > +   rsvd_2  : 8,
> > +   ndst: 32;
> > +   };
> > +   u64 control;
> > +   };
> > +   u32 rsvd[6];
> >  } __aligned(64);
> >
> > +static void pi_clear_sn(struct pi_desc *pi_desc)
> > +{
> > +   return clear_bit(POSTED_INTR_SN,
> > +   (unsigned long *)&pi_desc->control);
> > +}
> > +
> >  static bool pi_test_and_set_on(struct pi_desc *pi_desc)
> >  {
> > return test_and_set_bit(POSTED_INTR_ON,
> > @@ -4396,6 +4416,33 @@ static void ept_set_mmio_spte_mask(void)
> > kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);
> >  }
> >
> > +static bool pi_desc_init(struct vcpu_vmx *vmx)
> > +{
> > +   unsigned int dest;
> > +
> > +   if (irq_post_enabled == 0)
> > +   return true;
> > +
> > +   /*
> > +* Initialize Posted-Interrupt Descriptor
> > +*/
> > +
> > +   pi_clear_sn(&vmx->pi_desc);
> > +   vmx->pi_desc.nv = POSTED_INTR_VECTOR;
> > +
> > +   /* Physical mode for Notificaiton Event */
> > +   vmx->pi_desc.ndm = 0;
> > +   dest = cpu_physical_id(vmx->vcpu.cpu);
> > +
> > +   if (x2apic_mode)
> > +   vmx->pi_desc.ndst = dest;
> > +   else
> > +   vmx->pi_desc.ndst = (dest << 8) & 0xFF00;
> > +
> > +   return true;
> 
> Why does this bother to return anything since it can only return true?

Yes, I will make this function void. Thanks for the comments!

Thanks,
Feng

> 
> > +}
> > +
> > +
> >  /*
> >   * Sets up the vmcs for emulated real mode.
> >   */
> > @@ -4439,6 +4486,11 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> *vmx)
> >
> > vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR);
> > vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
> > +
> > +   if (!pi_desc_init(vmx)) {
> 
> And therefore this cannot happen.
> 
> > +   printk(KERN_ERR "Initialize PI descriptor error!\n");
> > +   return 1;
> 
> This is the wrong error anyway, vmx_create_vcpu() returns ERR_PTR(1)
> which fails the reverse IS_ERR()
> 
> Thanks,
> Alex
> 
> > +   }
> > }
> >
> > if (ple_gap) {
> 
> 



Re: [RFC PATCH 0/6] ARM64: KVM: PMU infrastructure support

2014-11-11 Thread Anup Patel
Hi All,

I have second thoughts about rebasing KVM PMU patches
to Marc's irq-forwarding patches.

The PMU IRQs (when virtualized by KVM) are not exactly
forwarded IRQs because they are shared between Host
and Guest.

Scenario1
-

We might have perf running on Host and no KVM guest
running. In this scenario, we wont get interrupts on Host
because the kvm_pmu_hyp_init() (similar to the function
kvm_timer_hyp_init() of Marc's IRQ-forwarding
implementation) has put all host PMU IRQs in forwarding
mode.

The only way solve this problem is to not set forwarding
mode for PMU IRQs in kvm_pmu_hyp_init() and instead
have special routines to turn on and turn off the forwarding
mode of PMU IRQs. These routines will be called from
kvm_arch_vcpu_ioctl_run() for toggling the PMU IRQ
forwarding state.

Scenario2
-

We might have perf running on Host and Guest simultaneously
which means it is quite likely that PMU HW trigger IRQ meant
for Host between "ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);"
and "kvm_pmu_sync_hwstate(vcpu);" (similar to timer sync routine
of Marc's patchset which is called before local_irq_enable()).

In this scenario, the updated kvm_pmu_sync_hwstate(vcpu)
will accidentally forward IRQ meant for Host to Guest unless
we put additional checks to inspect VCPU PMU state.

Am I missing any detail about IRQ forwarding for above
scenarios?

If not then can we consider current mask/unmask approach
for forwarding PMU IRQs?

Marc?? Will??

Regards,
Anup
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html