Re: [PATCH net] vhost-net: fix use-after-free in vhost_net_flush

2013-06-20 Thread Jason Wang
On 06/20/2013 07:48 PM, Michael S. Tsirkin wrote:
> vhost_net_ubuf_put_and_wait has a confusing name:
> it will actually also free it's argument.
> Thus since commit 1280c27f8e29acf4af2da914e80ec27c3dbd5c01
> vhost_net_flush tries to use the argument after passing it
> to vhost_net_ubuf_put_and_wait, this results
> in use after free.
> To fix, don't free the argument in vhost_net_ubuf_put_and_wait,
> add an new API for callers that want to free ubufs.
>
> Signed-off-by: Michael S. Tsirkin 
> ---
>
> Dave, this is needed for stable as well, but
> the code has moved a bit since then.
> I'll post a patch tweaked to apply against 3.9 and 3.8 separately.
>
>  drivers/vhost/net.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 5c77d6a..534adb0 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -149,6 +149,11 @@ static void vhost_net_ubuf_put_and_wait(struct 
> vhost_net_ubuf_ref *ubufs)
>  {
>   kref_put(&ubufs->kref, vhost_net_zerocopy_done_signal);
>   wait_event(ubufs->wait, !atomic_read(&ubufs->kref.refcount));
> +}
> +
> +static void vhost_net_ubuf_put_wait_and_free(struct vhost_net_ubuf_ref 
> *ubufs)
> +{
> + vhost_net_ubuf_put_and_wait(ubufs);
>   kfree(ubufs);
>  }
>  
> @@ -1073,7 +1078,7 @@ static long vhost_net_set_backend(struct vhost_net *n, 
> unsigned index, int fd)
>   mutex_unlock(&vq->mutex);
>  
>   if (oldubufs) {
> - vhost_net_ubuf_put_and_wait(oldubufs);
> + vhost_net_ubuf_put_wait_and_free(oldubufs);
>   mutex_lock(&vq->mutex);
>   vhost_zerocopy_signal_used(n, vq);
>   mutex_unlock(&vq->mutex);
> @@ -1091,7 +1096,7 @@ err_used:
>   vq->private_data = oldsock;
>   vhost_net_enable_vq(n, vq);
>   if (ubufs)
> - vhost_net_ubuf_put_and_wait(ubufs);
> + vhost_net_ubuf_put_wait_and_free(ubufs);
>  err_ubufs:
>   fput(sock->file);
>  err_vq:

Acked-by: Jason Wang 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vfio: fix documentation

2013-06-20 Thread Alexey Kardashevskiy
Signed-off-by: Alexey Kardashevskiy 
---
 Documentation/vfio.txt |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index c55533c..d7993dc 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -172,12 +172,12 @@ group and can access them as follows:
struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
 
/* Create a new container */
-   container = open("/dev/vfio/vfio, O_RDWR);
+   container = open("/dev/vfio/vfio", O_RDWR);
 
if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION)
/* Unknown API version */
 
-   if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_X86_IOMMU))
+   if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU))
/* Doesn't support the IOMMU driver we want. */
 
/* Open the group */
@@ -193,7 +193,7 @@ group and can access them as follows:
ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
 
/* Enable the IOMMU model we want */
-   ioctl(container, VFIO_SET_IOMMU, VFIO_X86_IOMMU)
+   ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU)
 
/* Get addition IOMMU info */
ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: MMU: Inform users of mmio generation wraparound

2013-06-20 Thread Takuya Yoshikawa
On Thu, 20 Jun 2013 23:29:22 +0200
Paolo Bonzini  wrote:

> > @@ -4385,8 +4385,10 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm)
> >  * The max value is MMIO_MAX_GEN - 1 since it is not called
> >  * when mark memslot invalid.
> >  */
> > -   if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN - 1)))
> > +   if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN - 1))) {
> 
> There is an off-by-one here (that I can fix in a separate patch).  The
> right test is >= MMIO_MAX_GEN, since we have:
> 
> #define MMIO_MAX_GEN((1 << MMIO_GEN_SHIFT) - 1)
> 
> Paolo

As Xiao's comment says, this is intentional because when we delete a
memory slot, we increase slots->generation twice: first for setting
the invalid flag, then for actually commiting the slot.

To avoid this double zap_all(), we need to adjust the generation to
make it really wraparound when we hit:

kvm_current_mmio_generation(kvm) == MMIO_MAX_GEN - 1

> 
> > +   printk_ratelimited(KERN_INFO "kvm: zapping shadow pages for 
> > mmio generation wraparound\n");
> > kvm_mmu_invalidate_zap_all_pages(kvm);
> > +   }
> >  }

BTW, I'm now thinking of another way to know this information.

If it's possible to check slots->generation, plus current_mmio_gen if needed,
on some events, administrators can know how many times such wraparounds
happened before.

They do not need to do tracing all the time for that, just hitting one event
later makes it possible to know if the guest was doing something problematic
before.

I'm now looking for the best trace point for that.  Probably somewhere in
arch_commit_memory().

Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Would a DOS on dovecot running under a VM cause host to crash?

2013-06-20 Thread Hugh Davenport

Hey All,

I'm just wondering whether this is what caused my server to crash.

Started last night in NZ land.

Jun 20 19:22:11 elm dovecot: imap-login: Disconnected (tried to use 
disallowed plaintext auth): user=<>, rip=attackerip, lip=10.0.0.3, 
session=<0C8LzpDfZQDINsQC>


occasionally get

Jun 20 19:22:52 elm dovecot: imap-login: Disconnected (no auth attempts 
in 1 secs): user=<>, rip=attackerip, lip=10.0.0.3, 
session=

or in 0 secs

last at
Jun 20 19:26:24 elm dovecot: imap-login: Disconnected (tried to use 
disallowed plaintext auth): user=<>, rip=attackerip, lip=10.0.0.3, 
session=<1MUR3ZDfcwDINsQC>


and a minute later the server lost contact to the world. When I checked 
a bit later,
the underlying host machine (dovecot runs on a VM (KVM)) had been 
powered off.


Now, here in NZ land, there was also a crazy storm last night, and lots 
of brown outs.
There could potentially of been a surge that killed it, but the UPS was 
still running

fine when I started it again.

The "attack" lasted around 4 minutes, in which there was 1161 lines in 
the log for a

single attacker ip, and no other similar logs previously.

Would this be enough to kill not only the VM running dovecot, but the 
underlying host

machine?

All up to date with patches, running debian stable (wheezy).
dovecot-core debian package version 1:2.1.7-7
dovecot version 2.1.7
I notice there is a version 2.2.3 out, but not in debian yet. Could this 
fix this

issue? I don't particularly want to have it happen again :D.

The host is running debian oldstable (squeeze), so could update more.
libvirt0 debian package version 0.8.3-5+squeeze5
libvirt version 0.8.3
I notice there is a version 1.0.6 out (debian stable only has 
0.9.12-11+deb7u1, which

is 0.9.12), would either of these versions fix an issue like this?
qemu-kvm debian package version 0.12.5+dfsg-5+squeeze10
kernel is 2.6.32-5-amd64

Any thoughts?

Cheers,

Hugh

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] KVM: s390: further fixes for -next

2013-06-20 Thread Paolo Bonzini
Il 20/06/2013 17:21, Cornelia Huck ha scritto:
> Hi,
> 
> some more fixes for kvm-next, bringing our handling of intercepted
> instructions more into line with what is actually architectured.
> 
> Please apply.
> 
> Thomas Huth (7):
>   KVM: s390: Renamed PGM_PRIVILEGED_OPERATION
>   KVM: s390: Privileged operation check for TPROT
>   KVM: s390: Privileged operation checks moved to instruction handlers
>   KVM: s390: Check for PSTATE when handling DIAGNOSE
>   KVM: s390: Check for access exceptions during TPI
>   KVM: s390: Reworked LCTL and LCTLG instructions
>   KVM: s390: Fixed priority of execution in STSI
> 
>  arch/s390/include/asm/kvm_host.h |   2 +-
>  arch/s390/kvm/diag.c |   3 +
>  arch/s390/kvm/intercept.c|  85 +-
>  arch/s390/kvm/kvm-s390.h |   3 +-
>  arch/s390/kvm/priv.c | 184 
> ++-
>  arch/s390/kvm/sigp.c |   3 +-
>  6 files changed, 152 insertions(+), 128 deletions(-)
> 

Applied to next, thanks.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: MMU: Inform users of mmio generation wraparound

2013-06-20 Thread Paolo Bonzini
Il 20/06/2013 18:34, Takuya Yoshikawa ha scritto:
> From: Takuya Yoshikawa 
> 
> Without this information, users will just see unexpected performance
> problems and there is little chance we will get good reports from them:
> note that mmio generation is increased even when we just start, or stop,
> dirty logging for some memory slot, in which case users cannot expect
> all shadow pages to be zapped.
> 
> printk_ratelimited() is used for this taking into account the problems
> that we can see the information many times when we start multiple VMs
> and guests can trigger this by reading ROM in a loop for example.
> 
> Signed-off-by: Takuya Yoshikawa 
> ---
>  Interestingly, I saw this information printed twice every time.
>  Looks like current_mmio_gen can become mmio_max_gen...
>  arch/x86/kvm/mmu.c |4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index c60c5da..54e3968 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4385,8 +4385,10 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm)
>* The max value is MMIO_MAX_GEN - 1 since it is not called
>* when mark memslot invalid.
>*/
> - if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN - 1)))
> + if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN - 1))) {

There is an off-by-one here (that I can fix in a separate patch).  The
right test is >= MMIO_MAX_GEN, since we have:

#define MMIO_MAX_GEN((1 << MMIO_GEN_SHIFT) - 1)

Paolo

> + printk_ratelimited(KERN_INFO "kvm: zapping shadow pages for 
> mmio generation wraparound\n");
>   kvm_mmu_invalidate_zap_all_pages(kvm);
> + }
>  }
>  
>  static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc)
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regression after "Remove support for reporting coalesced APIC IRQs"

2013-06-20 Thread Jan Kiszka
On 2013-06-20 22:29, Gleb Natapov wrote:
> On Thu, Jun 20, 2013 at 10:10:18PM +0200, Jan Kiszka wrote:
>> On 2013-06-20 13:47, Gleb Natapov wrote:
>>> Jan ping, are you OK with what I proposed below?
>>>
>>> On Thu, Jun 06, 2013 at 11:53:52AM +0300, Gleb Natapov wrote:
 Hi Jan,

 I bisected [1] to f1ed0450a5fac7067590317cbf027f566b6ccbca. Fortunately
 further investigation showed that it is not really related to removing
 APIC timer interrupt reinjection and the real problem is that we cannot
 assume that __apic_accept_irq() always injects interrupts like the patch
 does because the function skips interrupt injection if APIC is disabled.
 This misreporting screws RTC interrupt tracking, so further RTC interrupt
 are stopped to be injected. The simplest solution that I see is to revert
 most of the commit and only leave APIC timer interrupt reinjection.
>>
>> I'm not understanding the precise error yet and how __apic_accept_irq
>> should be (properly) involved in its solution. Which code path depend on
>> the information that the APIC is enabled?
>>
> RTC interrupt injection tracking in virt/kvm/ioapic.c depends on
> accurate information about which vcpus interrupt was injected into since it
> expects EOI from each vcpu before injection next RTC interrupt. Since
> now kvm_apic_set_irq() reports interrupt as injected for vcpus with
> disabled apic the logic breaks because EOI will never happen.

OK, so we may have to enhance kvm_apic_set_irq. Or implement the
apic_enabled check properly in __apic_accept_irq (though only
kvm_apic_set_irq will have a use for it).

The (historic) check looks very strange, misplaced, and carries a
comment that is probably also outdated. The check used to protect the
only functioning delivery mode, but then was left in place, ignoring all
the other modes.

Yes, that's not directly related to this regression. We can revert first
and then rework this interface. But the latter should definitely be done
as the revert will make the interface even worse IMHO.

Jan




signature.asc
Description: OpenPGP digital signature


Re: Regression after "Remove support for reporting coalesced APIC IRQs"

2013-06-20 Thread Gleb Natapov
On Thu, Jun 20, 2013 at 10:10:18PM +0200, Jan Kiszka wrote:
> On 2013-06-20 13:47, Gleb Natapov wrote:
> > Jan ping, are you OK with what I proposed below?
> > 
> > On Thu, Jun 06, 2013 at 11:53:52AM +0300, Gleb Natapov wrote:
> >> Hi Jan,
> >>
> >> I bisected [1] to f1ed0450a5fac7067590317cbf027f566b6ccbca. Fortunately
> >> further investigation showed that it is not really related to removing
> >> APIC timer interrupt reinjection and the real problem is that we cannot
> >> assume that __apic_accept_irq() always injects interrupts like the patch
> >> does because the function skips interrupt injection if APIC is disabled.
> >> This misreporting screws RTC interrupt tracking, so further RTC interrupt
> >> are stopped to be injected. The simplest solution that I see is to revert
> >> most of the commit and only leave APIC timer interrupt reinjection.
> 
> I'm not understanding the precise error yet and how __apic_accept_irq
> should be (properly) involved in its solution. Which code path depend on
> the information that the APIC is enabled?
> 
RTC interrupt injection tracking in virt/kvm/ioapic.c depends on
accurate information about which vcpus interrupt was injected into since it
expects EOI from each vcpu before injection next RTC interrupt. Since
now kvm_apic_set_irq() reports interrupt as injected for vcpus with
disabled apic the logic breaks because EOI will never happen.

> The point is that preserving the return value of __apic_accept_irq, just
> redefining it to "delivery_mode != APIC_DM_FIXED || apic_enabled()"
> creates a pretty ugly interface, no? Can't we address the specific issue
> of the RTC at a different level?
> 
Do not see how.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regression after "Remove support for reporting coalesced APIC IRQs"

2013-06-20 Thread Jan Kiszka
On 2013-06-20 13:47, Gleb Natapov wrote:
> Jan ping, are you OK with what I proposed below?
> 
> On Thu, Jun 06, 2013 at 11:53:52AM +0300, Gleb Natapov wrote:
>> Hi Jan,
>>
>> I bisected [1] to f1ed0450a5fac7067590317cbf027f566b6ccbca. Fortunately
>> further investigation showed that it is not really related to removing
>> APIC timer interrupt reinjection and the real problem is that we cannot
>> assume that __apic_accept_irq() always injects interrupts like the patch
>> does because the function skips interrupt injection if APIC is disabled.
>> This misreporting screws RTC interrupt tracking, so further RTC interrupt
>> are stopped to be injected. The simplest solution that I see is to revert
>> most of the commit and only leave APIC timer interrupt reinjection.

I'm not understanding the precise error yet and how __apic_accept_irq
should be (properly) involved in its solution. Which code path depend on
the information that the APIC is enabled?

The point is that preserving the return value of __apic_accept_irq, just
redefining it to "delivery_mode != APIC_DM_FIXED || apic_enabled()"
creates a pretty ugly interface, no? Can't we address the specific issue
of the RTC at a different level?

Jan




signature.asc
Description: OpenPGP digital signature


Re: Bug#707257: linux-image-3.8-1-686-pae: KVM crashes with "entry failed, hardware error 0x80000021"

2013-06-20 Thread Stefan Pietsch
On 19.06.2013 16:38, Gleb Natapov wrote:
> On Wed, Jun 19, 2013 at 04:12:04PM +0200, Stefan Pietsch wrote:
>> On 19.06.2013 15:41, Gleb Natapov wrote:
>>> On Sun, Jun 16, 2013 at 02:25:04AM +0200, Stefan Pietsch wrote:

 first bad commit: [25391454e73e3156202264eb3c473825afe4bc94]
 KVM: VMX: don't clobber segment AR of unusable segments.


 25391454e73e3156202264eb3c473825afe4bc94
 emulate_invalid_guest_state=0 -> hangs and shows "KVM: entry failed"
 emulate_invalid_guest_state=1 -> hangs

>>> With emulate_invalid_guest_state=1 the commit does nothing. Can you
>>> double check that 218e763f458c44f30041c1b48b4371e130fd4317 works for you
>>> with emulate_invalid_guest_state=1?
>>
>>
>> 218e763f458c44f30041c1b48b4371e130fd4317
>> emulate_invalid_guest_state=0 -> works
>> emulate_invalid_guest_state=1 -> hangs
>>
>> 25391454e73e3156202264eb3c473825afe4bc94 broke
>> emulate_invalid_guest_state=0.
> Can you provide the output of 25391454e73e3156202264eb3c473825afe4bc94
> and emulate_invalid_guest_state=0. Also run "x/20i $pc-20" in qemu
> monitor after the hang.


25391454e73e3156202264eb3c473825afe4bc94
 emulate_invalid_guest_state=0

(qemu) info registers
EAX= EBX=0001 ECX=f000 EDX=f000
ESI=00195e93 EDI= EBP=de84c000 ESP=de84df64
EIP=c101611c EFL=00010246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =007b   00c0f300 DPL=3 DS   [-WA]
CS =0060   00c09b00 DPL=0 CS32 [-RA]
SS =0068   00c09300 DPL=0 DS   [-WA]
DS =007b   00c0f300 DPL=3 DS   [-WA]
FS =  ffff 00f0ff00 DPL=3 CS64 [CRA]
GS =00e0 c1438b40 0018 00409100 DPL=0 DS   [--A]
LDT=  ffff 00f0ff00 DPL=3 CS64 [CRA]
TR =0080 c1400f00 206b 8b00 DPL=0 TSS32-busy
GDT= c13f6000 00ff
IDT= c13f5000 07ff
CR0=8005003b CR2= CR3=014bc000 CR4=0690
DR0= DR1= DR2=0007
DR3=
DR6=0ff0 DR7=0400
EFER=
FCW=037f FSW= [ST=0] FTW=00 MXCSR=1f80
FPR0=f44d002c6000 400d FPR1=80847fe7 400e
FPR2=fa007fa24000 400e FPR3=80e88055f000 400e
FPR4=ea61009c4000 400d FPR5=ea62009c4000 400c
FPR6=800bf600 4015 FPR7= 
XMM00=
XMM01=
XMM02=
XMM03=
XMM04=
XMM05=
XMM06=
XMM07=

(qemu) x/20i $pc-20
0xc1016108:  (bad)
0xc1016109:  decl   0x158bc310(%ecx)
0xc101610f:  les-0x7b723ec0(%ebx),%eax
0xc1016115:  adc%al,(%eax)
0xc1016117:  sar$0xff,%bh
0xc101611a:  mov(%eax),%eax
0xc101611c:  ret
0xc101611d:  jmp0xc1016121
0xc101611f:  pause
0xc1016121:  mov(%eax),%edx
0xc1016123:  test   %edx,%edx
0xc1016125:  je 0xc101611f
0xc1016127:  ret
0xc1016128:  cmp$0xfe,%eax
0xc101612d:  setle  %al
0xc1016130:  movzbl %al,%eax
0xc1016133:  ret
0xc1016134:  mov0xc1407b64,%edx
0xc101613a:  mov$0x20,%eax
0xc101613f:  call   *0x9c(%edx)


last 20 lines of the trace:
 qemu-system-x86-4042  [000]   295.592694: vcpu_match_mmio:  gva
0xb0d0 gpa 0xfee000d0 Read GVA
 qemu-system-x86-4042  [000]   295.592694: kvm_apic:
apic_read APIC_LDR = 0x100
 qemu-system-x86-4042  [000]   295.592695: kvm_mmio: mmio
read len 4 gpa 0xfee000d0 val 0x100
 qemu-system-x86-4042  [000]   295.592696: kvm_entry:vcpu 0
 qemu-system-x86-4042  [000]   295.592699: kvm_exit: reason
EXCEPTION_NMI rip 0xc101611a info b080 8b0e
 qemu-system-x86-4042  [000]   295.592700: kvm_page_fault:   address
b080 error_code 9
 qemu-system-x86-4042  [000]   295.592701: kvm_emulate_insn:
0:c101611a:8b 00 (prot32)
 qemu-system-x86-4042  [000]   295.592702: vcpu_match_mmio:  gva
0xb080 gpa 0xfee00080 Read GVA
 qemu-system-x86-4042  [000]   295.592703: kvm_apic:
apic_read APIC_TASKPRI = 0x0
 qemu-system-x86-4042  [000]   295.592703: kvm_mmio: mmio
read len 4 gpa 0xfee00080 val 0x0
 qemu-system-x86-4042  [000]   295.592704: kvm_userspace_exit:   reason
KVM_EXIT_TPR_ACCESS (12)
 qemu-system-x86-4042  [000]   295.592805: kvm_entry:vcpu 0
 qemu-system-x86-4042  [000]   295.592808: kvm_exit: reason
 rip 0xc101611c info 0 8b0e
 qemu-system-x86-4042  [000]   295.592809: kvm_userspace_exit:   reason
KVM_EXIT_FAIL_ENTRY (9)
 qemu-system-x86-4040  [001]   295.594993: kvm_set_irq:  gsi 0
level 1 source 0
 qemu-system-x86-4040  [001]   295.594995: kvm_pic_set_irq:  chip 0
pin 0 (edge)
 qemu-system-x86-4040  [001]   295.594997: kvm_ioapic_set_irq:   pin 2
dst 0 vec=0 

[PATCH v2] KVM: MMU: Inform users of mmio generation wraparound

2013-06-20 Thread Takuya Yoshikawa
From: Takuya Yoshikawa 

Without this information, users will just see unexpected performance
problems and there is little chance we will get good reports from them:
note that mmio generation is increased even when we just start, or stop,
dirty logging for some memory slot, in which case users cannot expect
all shadow pages to be zapped.

printk_ratelimited() is used for this taking into account the problems
that we can see the information many times when we start multiple VMs
and guests can trigger this by reading ROM in a loop for example.

Signed-off-by: Takuya Yoshikawa 
---
 Interestingly, I saw this information printed twice every time.
 Looks like current_mmio_gen can become mmio_max_gen...

 arch/x86/kvm/mmu.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c60c5da..54e3968 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4385,8 +4385,10 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm)
 * The max value is MMIO_MAX_GEN - 1 since it is not called
 * when mark memslot invalid.
 */
-   if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN - 1)))
+   if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN - 1))) {
+   printk_ratelimited(KERN_INFO "kvm: zapping shadow pages for 
mmio generation wraparound\n");
kvm_mmu_invalidate_zap_all_pages(kvm);
+   }
 }
 
 static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/7] KVM: s390: Privileged operation checks moved to instruction handlers

2013-06-20 Thread Cornelia Huck
From: Thomas Huth 

We need more fine-grained control about the point in time when we check
for privileged instructions, since the exceptions that can happen during
an instruction have a well-defined priority. For example, for the PFMF
instruction, the check for PGM_PRIVILEGED_OP must happen after the check
for PGM_OPERATION since the latter has a higher precedence - thus the
check for privileged operation must not be done in kvm_s390_handle_b9()
already.

Signed-off-by: Thomas Huth 
Acked-by: Cornelia Huck 
Signed-off-by: Cornelia Huck 
---
 arch/s390/kvm/priv.c | 63 
 1 file changed, 39 insertions(+), 24 deletions(-)

diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index 04dc4a1..0b19e22 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -37,6 +37,9 @@ static int handle_set_prefix(struct kvm_vcpu *vcpu)
 
vcpu->stat.instruction_spx++;
 
+   if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
+   return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
+
operand2 = kvm_s390_get_base_disp_s(vcpu);
 
/* must be word boundary */
@@ -68,6 +71,9 @@ static int handle_store_prefix(struct kvm_vcpu *vcpu)
 
vcpu->stat.instruction_stpx++;
 
+   if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
+   return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
+
operand2 = kvm_s390_get_base_disp_s(vcpu);
 
/* must be word boundary */
@@ -92,6 +98,9 @@ static int handle_store_cpu_address(struct kvm_vcpu *vcpu)
 
vcpu->stat.instruction_stap++;
 
+   if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
+   return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
+
useraddr = kvm_s390_get_base_disp_s(vcpu);
 
if (useraddr & 1)
@@ -108,6 +117,10 @@ static int handle_store_cpu_address(struct kvm_vcpu *vcpu)
 static int handle_skey(struct kvm_vcpu *vcpu)
 {
vcpu->stat.instruction_storage_key++;
+
+   if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
+   return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
+
vcpu->arch.sie_block->gpsw.addr =
__rewind_psw(vcpu->arch.sie_block->gpsw, 4);
VCPU_EVENT(vcpu, 4, "%s", "retrying storage key operation");
@@ -186,6 +199,9 @@ static int handle_io_inst(struct kvm_vcpu *vcpu)
 {
VCPU_EVENT(vcpu, 4, "%s", "I/O instruction");
 
+   if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
+   return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
+
if (vcpu->kvm->arch.css_support) {
/*
 * Most I/O instructions will be handled by userspace.
@@ -214,6 +230,10 @@ static int handle_stfl(struct kvm_vcpu *vcpu)
int rc;
 
vcpu->stat.instruction_stfl++;
+
+   if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
+   return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
+
/* only pass the facility bits, which we can handle */
facility_list = S390_lowcore.stfl_fac_list & 0xff82fff3;
 
@@ -282,6 +302,9 @@ static int handle_lpswe(struct kvm_vcpu *vcpu)
psw_t new_psw;
u64 addr;
 
+   if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
+   return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
+
addr = kvm_s390_get_base_disp_s(vcpu);
if (addr & 7)
return kvm_s390_inject_program_int(vcpu, PGM_SPECIFICATION);
@@ -300,6 +323,9 @@ static int handle_stidp(struct kvm_vcpu *vcpu)
 
vcpu->stat.instruction_stidp++;
 
+   if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
+   return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
+
operand2 = kvm_s390_get_base_disp_s(vcpu);
 
if (operand2 & 7)
@@ -355,6 +381,9 @@ static int handle_stsi(struct kvm_vcpu *vcpu)
vcpu->stat.instruction_stsi++;
VCPU_EVENT(vcpu, 4, "stsi: fc: %x sel1: %x sel2: %x", fc, sel1, sel2);
 
+   if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
+   return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
+
operand2 = kvm_s390_get_base_disp_s(vcpu);
 
if (operand2 & 0xfff && fc > 0)
@@ -436,20 +465,14 @@ int kvm_s390_handle_b2(struct kvm_vcpu *vcpu)
intercept_handler_t handler;
 
/*
-* a lot of B2 instructions are priviledged. We first check for
-* the privileged ones, that we can handle in the kernel. If the
-* kernel can handle this instruction, we check for the problem
-* state bit and (a) handle the instruction or (b) send a code 2
-* program check.
-* Anything else goes to userspace.*/
+* A lot of B2 instructions are priviledged. Here we check for
+* the privileged ones, that we can handle in the kernel.
+* Anything else goes to userspace.
+*/
handler = b2_handlers[v

[PATCH 4/7] KVM: s390: Check for PSTATE when handling DIAGNOSE

2013-06-20 Thread Cornelia Huck
From: Thomas Huth 

DIAGNOSE is a privileged instruction and thus we must make sure that we are
in supervisor mode before taking any other actions.

Signed-off-by: Thomas Huth 
Acked-by: Cornelia Huck 
Signed-off-by: Cornelia Huck 
---
 arch/s390/kvm/diag.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/s390/kvm/diag.c b/arch/s390/kvm/diag.c
index 1c01a99..3074475 100644
--- a/arch/s390/kvm/diag.c
+++ b/arch/s390/kvm/diag.c
@@ -132,6 +132,9 @@ int kvm_s390_handle_diag(struct kvm_vcpu *vcpu)
 {
int code = (vcpu->arch.sie_block->ipb & 0xfff) >> 16;
 
+   if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
+   return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
+
trace_kvm_s390_handle_diag(vcpu, code);
switch (code) {
case 0x10:
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] KVM: s390: Privileged operation check for TPROT

2013-06-20 Thread Cornelia Huck
From: Thomas Huth 

TPROT is a privileged instruction and thus should generate a privileged
operation exception when the problem state bit is not cleared in the PSW.

Signed-off-by: Thomas Huth 
Acked-by: Cornelia Huck 
Signed-off-by: Cornelia Huck 
---
 arch/s390/kvm/priv.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index a21e014..04dc4a1 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -596,6 +596,9 @@ static int handle_tprot(struct kvm_vcpu *vcpu)
 
vcpu->stat.instruction_tprot++;
 
+   if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
+   return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
+
kvm_s390_get_base_disp_sse(vcpu, &address1, &address2);
 
/* we only handle the Linux memory detection case:
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/7] KVM: s390: Check for access exceptions during TPI

2013-06-20 Thread Cornelia Huck
From: Thomas Huth 

When a guest calls the TPI instruction, the second operand address could
point to an invalid location. In this case the problem should be signaled
to the guest by throwing an access exception.

Signed-off-by: Thomas Huth 
Acked-by: Cornelia Huck 
Signed-off-by: Cornelia Huck 
---
 arch/s390/kvm/priv.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index 0b19e22..4b8fb6c 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -146,9 +146,10 @@ static int handle_tpi(struct kvm_vcpu *vcpu)
 * Store the two-word I/O interruption code into the
 * provided area.
 */
-   put_guest(vcpu, inti->io.subchannel_id, (u16 __user *) addr);
-   put_guest(vcpu, inti->io.subchannel_nr, (u16 __user *) (addr + 
2));
-   put_guest(vcpu, inti->io.io_int_parm, (u32 __user *) (addr + 
4));
+   if (put_guest(vcpu, inti->io.subchannel_id, (u16 __user *)addr)
+   || put_guest(vcpu, inti->io.subchannel_nr, (u16 __user 
*)(addr + 2))
+   || put_guest(vcpu, inti->io.io_int_parm, (u32 __user 
*)(addr + 4)))
+   return kvm_s390_inject_program_int(vcpu, 
PGM_ADDRESSING);
} else {
/*
 * Store the three-word I/O interruption code into
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/7] KVM: s390: Fixed priority of execution in STSI

2013-06-20 Thread Cornelia Huck
From: Thomas Huth 

Added some missing validity checks for the operands and fixed the
priority of exceptions for some function codes according to the
"Principles of Operation" document.

Signed-off-by: Thomas Huth 
Acked-by: Cornelia Huck 
Signed-off-by: Cornelia Huck 
---
 arch/s390/kvm/priv.c | 23 ---
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index c7603f5..0da3e6e 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -385,16 +385,27 @@ static int handle_stsi(struct kvm_vcpu *vcpu)
if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
 
-   operand2 = kvm_s390_get_base_disp_s(vcpu);
+   if (fc > 3) {
+   vcpu->arch.sie_block->gpsw.mask |= 3ul << 44; /* cc 3 */
+   return 0;
+   }
 
-   if (operand2 & 0xfff && fc > 0)
+   if (vcpu->run->s.regs.gprs[0] & 0x0f00
+   || vcpu->run->s.regs.gprs[1] & 0x)
return kvm_s390_inject_program_int(vcpu, PGM_SPECIFICATION);
 
-   switch (fc) {
-   case 0:
+   if (fc == 0) {
vcpu->run->s.regs.gprs[0] = 3 << 28;
-   vcpu->arch.sie_block->gpsw.mask &= ~(3ul << 44);
+   vcpu->arch.sie_block->gpsw.mask &= ~(3ul << 44);  /* cc 0 */
return 0;
+   }
+
+   operand2 = kvm_s390_get_base_disp_s(vcpu);
+
+   if (operand2 & 0xfff)
+   return kvm_s390_inject_program_int(vcpu, PGM_SPECIFICATION);
+
+   switch (fc) {
case 1: /* same handling for 1 and 2 */
case 2:
mem = get_zeroed_page(GFP_KERNEL);
@@ -411,8 +422,6 @@ static int handle_stsi(struct kvm_vcpu *vcpu)
goto out_no_data;
handle_stsi_3_2_2(vcpu, (void *) mem);
break;
-   default:
-   goto out_no_data;
}
 
if (copy_to_guest_absolute(vcpu, operand2, (void *) mem, PAGE_SIZE)) {
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] KVM: s390: Renamed PGM_PRIVILEGED_OPERATION

2013-06-20 Thread Cornelia Huck
From: Thomas Huth 

Renamed the PGM_PRIVILEGED_OPERATION define to PGM_PRIVILEGED_OP since this
define was way longer than the other PGM_* defines and caused the code often
to exceed the 80 columns limit when not split to multiple lines.

Signed-off-by: Thomas Huth 
Acked-by: Cornelia Huck 
Signed-off-by: Cornelia Huck 
---
 arch/s390/include/asm/kvm_host.h |  2 +-
 arch/s390/kvm/priv.c | 16 +++-
 arch/s390/kvm/sigp.c |  3 +--
 3 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 4339069..3238d40 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -175,7 +175,7 @@ struct kvm_s390_ext_info {
 };
 
 #define PGM_OPERATION0x01
-#define PGM_PRIVILEGED_OPERATION 0x02
+#define PGM_PRIVILEGED_OP   0x02
 #define PGM_EXECUTE  0x03
 #define PGM_PROTECTION   0x04
 #define PGM_ADDRESSING   0x05
diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index a0c63d7..a21e014 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -259,8 +259,8 @@ int kvm_s390_handle_lpsw(struct kvm_vcpu *vcpu)
u64 addr;
 
if (gpsw->mask & PSW_MASK_PSTATE)
-   return kvm_s390_inject_program_int(vcpu,
-  PGM_PRIVILEGED_OPERATION);
+   return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
+
addr = kvm_s390_get_base_disp_s(vcpu);
if (addr & 7)
return kvm_s390_inject_program_int(vcpu, PGM_SPECIFICATION);
@@ -446,7 +446,7 @@ int kvm_s390_handle_b2(struct kvm_vcpu *vcpu)
if (handler) {
if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
return kvm_s390_inject_program_int(vcpu,
-  PGM_PRIVILEGED_OPERATION);
+  PGM_PRIVILEGED_OP);
else
return handler(vcpu);
}
@@ -493,7 +493,7 @@ static int handle_pfmf(struct kvm_vcpu *vcpu)
return kvm_s390_inject_program_int(vcpu, PGM_OPERATION);
 
if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
-   return kvm_s390_inject_program_int(vcpu, 
PGM_PRIVILEGED_OPERATION);
+   return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
 
if (vcpu->run->s.regs.gprs[reg1] & PFMF_RESERVED)
return kvm_s390_inject_program_int(vcpu, PGM_SPECIFICATION);
@@ -564,7 +564,7 @@ int kvm_s390_handle_b9(struct kvm_vcpu *vcpu)
if ((handler != handle_epsw) &&
(vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE))
return kvm_s390_inject_program_int(vcpu,
-  PGM_PRIVILEGED_OPERATION);
+  PGM_PRIVILEGED_OP);
else
return handler(vcpu);
}
@@ -581,8 +581,7 @@ int kvm_s390_handle_priv_eb(struct kvm_vcpu *vcpu)
 
/* All eb instructions that end up here are privileged. */
if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
-   return kvm_s390_inject_program_int(vcpu,
-  PGM_PRIVILEGED_OPERATION);
+   return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
handler = eb_handlers[vcpu->arch.sie_block->ipb & 0xff];
if (handler)
return handler(vcpu);
@@ -642,8 +641,7 @@ static int handle_sckpf(struct kvm_vcpu *vcpu)
u32 value;
 
if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
-   return kvm_s390_inject_program_int(vcpu,
-  PGM_PRIVILEGED_OPERATION);
+   return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
 
if (vcpu->run->s.regs.gprs[0] & 0x)
return kvm_s390_inject_program_int(vcpu,
diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c
index 033c864..bec398c 100644
--- a/arch/s390/kvm/sigp.c
+++ b/arch/s390/kvm/sigp.c
@@ -333,8 +333,7 @@ int kvm_s390_handle_sigp(struct kvm_vcpu *vcpu)
 
/* sigp in userspace can exit */
if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
-   return kvm_s390_inject_program_int(vcpu,
-  PGM_PRIVILEGED_OPERATION);
+   return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
 
order_code = kvm_s390_get_base_disp_rs(vcpu);
 
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/7] KVM: s390: further fixes for -next

2013-06-20 Thread Cornelia Huck
Hi,

some more fixes for kvm-next, bringing our handling of intercepted
instructions more into line with what is actually architectured.

Please apply.

Thomas Huth (7):
  KVM: s390: Renamed PGM_PRIVILEGED_OPERATION
  KVM: s390: Privileged operation check for TPROT
  KVM: s390: Privileged operation checks moved to instruction handlers
  KVM: s390: Check for PSTATE when handling DIAGNOSE
  KVM: s390: Check for access exceptions during TPI
  KVM: s390: Reworked LCTL and LCTLG instructions
  KVM: s390: Fixed priority of execution in STSI

 arch/s390/include/asm/kvm_host.h |   2 +-
 arch/s390/kvm/diag.c |   3 +
 arch/s390/kvm/intercept.c|  85 +-
 arch/s390/kvm/kvm-s390.h |   3 +-
 arch/s390/kvm/priv.c | 184 ++-
 arch/s390/kvm/sigp.c |   3 +-
 6 files changed, 152 insertions(+), 128 deletions(-)

-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7] KVM: s390: Reworked LCTL and LCTLG instructions

2013-06-20 Thread Cornelia Huck
From: Thomas Huth 

LCTL and LCTLG are also privileged instructions, thus there is no need for
treating them separately from the other instructions in priv.c. So this
patch moves these two instructions to priv.c, adds a check for supervisor
state and simplifies the "handle_eb" instruction decoding by merging the
two eb_handlers jump tables from intercept.c and priv.c into one table only.

Signed-off-by: Thomas Huth 
Acked-by: Cornelia Huck 
Signed-off-by: Cornelia Huck 
---
 arch/s390/kvm/intercept.c | 85 ++-
 arch/s390/kvm/kvm-s390.h  |  3 +-
 arch/s390/kvm/priv.c  | 78 ++-
 3 files changed, 81 insertions(+), 85 deletions(-)

diff --git a/arch/s390/kvm/intercept.c b/arch/s390/kvm/intercept.c
index f0b8be0..5ee56e5 100644
--- a/arch/s390/kvm/intercept.c
+++ b/arch/s390/kvm/intercept.c
@@ -22,87 +22,6 @@
 #include "trace.h"
 #include "trace-s390.h"
 
-static int handle_lctlg(struct kvm_vcpu *vcpu)
-{
-   int reg1 = (vcpu->arch.sie_block->ipa & 0x00f0) >> 4;
-   int reg3 = vcpu->arch.sie_block->ipa & 0x000f;
-   u64 useraddr;
-   int reg, rc;
-
-   vcpu->stat.instruction_lctlg++;
-
-   useraddr = kvm_s390_get_base_disp_rsy(vcpu);
-
-   if (useraddr & 7)
-   return kvm_s390_inject_program_int(vcpu, PGM_SPECIFICATION);
-
-   reg = reg1;
-
-   VCPU_EVENT(vcpu, 5, "lctlg r1:%x, r3:%x, addr:%llx", reg1, reg3,
-  useraddr);
-   trace_kvm_s390_handle_lctl(vcpu, 1, reg1, reg3, useraddr);
-
-   do {
-   rc = get_guest(vcpu, vcpu->arch.sie_block->gcr[reg],
-  (u64 __user *) useraddr);
-   if (rc)
-   return kvm_s390_inject_program_int(vcpu, 
PGM_ADDRESSING);
-   useraddr += 8;
-   if (reg == reg3)
-   break;
-   reg = (reg + 1) % 16;
-   } while (1);
-   return 0;
-}
-
-static int handle_lctl(struct kvm_vcpu *vcpu)
-{
-   int reg1 = (vcpu->arch.sie_block->ipa & 0x00f0) >> 4;
-   int reg3 = vcpu->arch.sie_block->ipa & 0x000f;
-   u64 useraddr;
-   u32 val = 0;
-   int reg, rc;
-
-   vcpu->stat.instruction_lctl++;
-
-   useraddr = kvm_s390_get_base_disp_rs(vcpu);
-
-   if (useraddr & 3)
-   return kvm_s390_inject_program_int(vcpu, PGM_SPECIFICATION);
-
-   VCPU_EVENT(vcpu, 5, "lctl r1:%x, r3:%x, addr:%llx", reg1, reg3,
-  useraddr);
-   trace_kvm_s390_handle_lctl(vcpu, 0, reg1, reg3, useraddr);
-
-   reg = reg1;
-   do {
-   rc = get_guest(vcpu, val, (u32 __user *) useraddr);
-   if (rc)
-   return kvm_s390_inject_program_int(vcpu, 
PGM_ADDRESSING);
-   vcpu->arch.sie_block->gcr[reg] &= 0xul;
-   vcpu->arch.sie_block->gcr[reg] |= val;
-   useraddr += 4;
-   if (reg == reg3)
-   break;
-   reg = (reg + 1) % 16;
-   } while (1);
-   return 0;
-}
-
-static const intercept_handler_t eb_handlers[256] = {
-   [0x2f] = handle_lctlg,
-   [0x8a] = kvm_s390_handle_priv_eb,
-};
-
-static int handle_eb(struct kvm_vcpu *vcpu)
-{
-   intercept_handler_t handler;
-
-   handler = eb_handlers[vcpu->arch.sie_block->ipb & 0xff];
-   if (handler)
-   return handler(vcpu);
-   return -EOPNOTSUPP;
-}
 
 static const intercept_handler_t instruction_handlers[256] = {
[0x01] = kvm_s390_handle_01,
@@ -110,10 +29,10 @@ static const intercept_handler_t instruction_handlers[256] 
= {
[0x83] = kvm_s390_handle_diag,
[0xae] = kvm_s390_handle_sigp,
[0xb2] = kvm_s390_handle_b2,
-   [0xb7] = handle_lctl,
+   [0xb7] = kvm_s390_handle_lctl,
[0xb9] = kvm_s390_handle_b9,
[0xe5] = kvm_s390_handle_e5,
-   [0xeb] = handle_eb,
+   [0xeb] = kvm_s390_handle_eb,
 };
 
 static int handle_noop(struct kvm_vcpu *vcpu)
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index 15795b8..028ca9f 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -132,7 +132,8 @@ int kvm_s390_handle_e5(struct kvm_vcpu *vcpu);
 int kvm_s390_handle_01(struct kvm_vcpu *vcpu);
 int kvm_s390_handle_b9(struct kvm_vcpu *vcpu);
 int kvm_s390_handle_lpsw(struct kvm_vcpu *vcpu);
-int kvm_s390_handle_priv_eb(struct kvm_vcpu *vcpu);
+int kvm_s390_handle_lctl(struct kvm_vcpu *vcpu);
+int kvm_s390_handle_eb(struct kvm_vcpu *vcpu);
 
 /* implemented in sigp.c */
 int kvm_s390_handle_sigp(struct kvm_vcpu *vcpu);
diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index 4b8fb6c..c7603f5 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -590,11 +590,87 @@ int kvm_s390_handle_b9(struct kvm_vcpu *vcpu)
return -EOPNOTSUPP;
 }
 
+int kvm_s390_handle_lctl(struct kvm_vcpu *vcpu)
+{
+   int reg1 = (vcpu->arch.sie_block->ipa & 0x00f0) >> 4;
+  

Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-20 Thread Alex Williamson
On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote:
> On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
> > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
> >>> Just out of curiosity - would not get_file() and fput_atomic() on a
> >> group's
> >>> file* do the right job instead of vfio_group_add_external_user() and
> >>> vfio_group_del_external_user()?
> >>
> >> I was thinking that too.  Grabbing a file reference would certainly be
> >> the usual way of handling this sort of thing.
> > 
> > But that wouldn't prevent the group ownership to be returned to
> > the kernel or another user would it ?
> 
> 
> Holding the file pointer does not let the group->container_users counter go
> to zero

How so?  Holding the file pointer means the file won't go away, which
means the group release function won't be called.  That means the group
won't go away, but that doesn't mean it's attached to an IOMMU.  A user
could call UNSET_CONTAINER.

>  and this is exactly what vfio_group_add_external_user() and
> vfio_group_del_external_user() do. The difference is only in absolute value
> - 2 vs. 3.
> 
> No change in behaviour whether I use new vfio API or simply hold file* till
> KVM closes fd created when IOMMU was connected to LIOBN.

By that notion you could open(/dev/vfio/$GROUP) and you're safe, right?
But what about SET_CONTAINER & SET_IOMMU?  All that you guarantee
holding the file pointer is that the vfio_group exists.

> And while this counter is not zero, QEMU cannot take ownership over the group.
>
> I am definitely still missing the bigger picture...

The bigger picture is that the group needs to exist AND it needs to be
setup and maintained to have IOMMU protection.  Actually, my first stab
at add_external_user doesn't look sufficient, it needs to look more like
vfio_group_get_device_fd, checking group->container->iommu and
group_viable().  As written it would allow an external user after
SET_CONTAINER without SET_IOMMU.  It should also be part of the API that
the external user must hold the file reference between add_external_use
and del_external_user and do cleanup on any exit paths.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 3/3] kvm-unit-tests: Change two cases to use trap_emulator

2013-06-20 Thread Arthur Chunqi Li
Change two functions (test_mmx_movq_mf and test_movabs) using
unified trap_emulator.

Signed-off-by: Arthur Chunqi Li 
---
 x86/emulator.c |   70 
 1 file changed, 15 insertions(+), 55 deletions(-)

diff --git a/x86/emulator.c b/x86/emulator.c
index 876ee5a..68d2b93 100644
--- a/x86/emulator.c
+++ b/x86/emulator.c
@@ -789,38 +789,20 @@ static void advance_rip_by_3_and_note_exception(struct 
ex_regs *regs)
 static void test_mmx_movq_mf(uint64_t *mem, uint8_t *insn_page,
 uint8_t *alt_insn_page, void *insn_ram)
 {
-uint16_t fcw = 0;  // all exceptions unmasked
-ulong *cr3 = (ulong *)read_cr3();
-
-write_cr0(read_cr0() & ~6);  // TS, EM
-// Place a trapping instruction in the page to trigger a VMEXIT
-insn_page[0] = 0x89; // mov %eax, (%rax)
-insn_page[1] = 0x00;
-insn_page[2] = 0x90; // nop
-insn_page[3] = 0xc3; // ret
-// Place the instruction we want the hypervisor to see in the alternate 
page
-alt_insn_page[0] = 0x0f; // movq %mm0, (%rax)
-alt_insn_page[1] = 0x7f;
-alt_insn_page[2] = 0x00;
-alt_insn_page[3] = 0xc3; // ret
+uint16_t fcw = 0;  /* all exceptions unmasked */
+/* movq %mm0, (%rax) */
+void *stack = alloc_page();
 
+write_cr0(read_cr0() & ~6);  /* TS, EM */
 exceptions = 0;
 handle_exception(MF_VECTOR, advance_rip_by_3_and_note_exception);
-
-// Load the code TLB with insn_page, but point the page tables at
-// alt_insn_page (and keep the data TLB clear, for AMD decode assist).
-// This will make the CPU trap on the insn_page instruction but the
-// hypervisor will see alt_insn_page.
-install_page(cr3, virt_to_phys(insn_page), insn_ram);
 asm volatile("fninit; fldcw %0" : : "m"(fcw));
-asm volatile("fldz; fldz; fdivp"); // generate exception
-invlpg(insn_ram);
-// Load code TLB
-asm volatile("call *%0" : : "r"(insn_ram + 3));
-install_page(cr3, virt_to_phys(alt_insn_page), insn_ram);
-// Trap, let hypervisor emulate at alt_insn_page
-asm volatile("call *%0" : : "r"(insn_ram), "a"(mem));
-// exit MMX mode
+asm volatile("fldz; fldz; fdivp"); /* generate exception */
+
+MK_INSN(mmx_movq_mf, "movq %mm0, (%rax) \n\t");
+inregs = (struct regs){ .rsp=(u64)stack+1024 };
+trap_emulator(mem, alt_insn_page, &insn_mmx_movq_mf);
+/* exit MMX mode */
 asm volatile("fnclex; emms");
 report("movq mmx generates #MF", exceptions == 1);
 handle_exception(MF_VECTOR, 0);
@@ -829,33 +811,11 @@ static void test_mmx_movq_mf(uint64_t *mem, uint8_t 
*insn_page,
 static void test_movabs(uint64_t *mem, uint8_t *insn_page,
   uint8_t *alt_insn_page, void *insn_ram)
 {
-uint64_t val = 0;
-ulong *cr3 = (ulong *)read_cr3();
-
-// Pad with RET instructions
-memset(insn_page, 0xc3, 4096);
-memset(alt_insn_page, 0xc3, 4096);
-// Place a trapping instruction in the page to trigger a VMEXIT
-insn_page[0] = 0x89; // mov %eax, (%rax)
-insn_page[1] = 0x00;
-// Place the instruction we want the hypervisor to see in the alternate
-// page. A buggy hypervisor will fetch a 32-bit immediate and return
-// 0xc3c3c3c3.
-alt_insn_page[0] = 0x48; // mov $0xc3c3c3c3c3c3c3c3, %rcx
-alt_insn_page[1] = 0xb9;
-
-// Load the code TLB with insn_page, but point the page tables at
-// alt_insn_page (and keep the data TLB clear, for AMD decode assist).
-// This will make the CPU trap on the insn_page instruction but the
-// hypervisor will see alt_insn_page.
-install_page(cr3, virt_to_phys(insn_page), insn_ram);
-// Load code TLB
-invlpg(insn_ram);
-asm volatile("call *%0" : : "r"(insn_ram + 3));
-// Trap, let hypervisor emulate at alt_insn_page
-install_page(cr3, virt_to_phys(alt_insn_page), insn_ram);
-asm volatile("call *%1" : "=c"(val) : "r"(insn_ram), "a"(mem), "c"(0));
-report("64-bit mov imm", val == 0xc3c3c3c3c3c3c3c3);
+/* mov $0x9090909090909090, %rcx */
+MK_INSN(movabs, "mov $0x9090909090909090, %rcx\n\t");
+inregs = (struct regs){ 0 };
+trap_emulator(mem, alt_insn_page, &insn_movabs);
+report("64-bit mov imm2", outregs.rcx == 0x9090909090909090);
 }
 
 static void test_crosspage_mmio(volatile uint8_t *mem)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/3] kvm-unit-tests: Add memcpy to lib/string.c

2013-06-20 Thread Arthur Chunqi Li
Add memcpy(void *dest, const void *src, size_t n) to lib/string.c.
This function acts the same as memcpy in libc.

Signed-off-by: Arthur Chunqi Li 
---
 lib/libcflat.h |1 +
 lib/string.c   |   12 
 2 files changed, 13 insertions(+)

diff --git a/lib/libcflat.h b/lib/libcflat.h
index 0875bd9..fadc33d 100644
--- a/lib/libcflat.h
+++ b/lib/libcflat.h
@@ -50,6 +50,7 @@ extern int vsnprintf(char *buf, int size, const char *fmt, 
va_list va);
 extern void puts(const char *s);
 
 extern void *memset(void *s, int c, size_t n);
+extern void *memcpy(void *dest, const void *src, size_t n);
 
 extern long atol(const char *ptr);
 #define ARRAY_SIZE(_a)  (sizeof(_a)/sizeof((_a)[0]))
diff --git a/lib/string.c b/lib/string.c
index 9dc94a1..e798f86 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -42,6 +42,18 @@ void *memset(void *s, int c, size_t n)
 return s;
 }
 
+void *memcpy(void *dest, const void *src, size_t n)
+{
+size_t i;
+char *a = dest;
+char *b = src;
+
+for (i = 0; i < n; ++i)
+a[i] = b[i];
+
+return dest;
+}
+
 long atol(const char *ptr)
 {
 long acc = 0;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 2/3] kvm-unit-tests: Add a func to run instruction in emulator

2013-06-20 Thread Arthur Chunqi Li
Add a function trap_emulator to run an instruction in emulator.
Set inregs first (%rax is invalid because it is used as return
address), put instruction codec in alt_insn and call func with
alt_insn_length. Get results in outregs.

Signed-off-by: Arthur Chunqi Li 
---
 x86/emulator.c |   95 
 1 file changed, 95 insertions(+)

diff --git a/x86/emulator.c b/x86/emulator.c
index 96576e5..876ee5a 100644
--- a/x86/emulator.c
+++ b/x86/emulator.c
@@ -11,6 +11,20 @@ int fails, tests;
 
 static int exceptions;
 
+struct regs {
+   u64 rax, rbx, rcx, rdx;
+   u64 rsi, rdi, rsp, rbp;
+   u64 r8, r9, r10, r11;
+   u64 r12, r13, r14, r15;
+   u64 rip, rflags;
+};
+struct regs inregs, outregs, save;
+
+struct insn_desc {
+   u64 ptr;
+   size_t len;
+};
+
 void report(const char *name, int result)
 {
++tests;
@@ -685,6 +699,87 @@ static void test_shld_shrd(u32 *mem)
 report("shrd (cl)", *mem == ((0x12345678 >> 3) | (5u << 29)));
 }
 
+#define INSN_XCHG_ALL  \
+   "xchg %rax, 0+save \n\t"\
+   "xchg %rbx, 8+save \n\t"\
+   "xchg %rcx, 16+save \n\t"   \
+   "xchg %rdx, 24+save \n\t"   \
+   "xchg %rsi, 32+save \n\t"   \
+   "xchg %rdi, 40+save \n\t"   \
+   "xchg %rsp, 48+save \n\t"   \
+   "xchg %rbp, 56+save \n\t"   \
+   "xchg %r8, 64+save \n\t"\
+   "xchg %r9, 72+save \n\t"\
+   "xchg %r10, 80+save \n\t"   \
+   "xchg %r11, 88+save \n\t"   \
+   "xchg %r12, 96+save \n\t"   \
+   "xchg %r13, 104+save \n\t"  \
+   "xchg %r14, 112+save \n\t"  \
+   "xchg %r15, 120+save \n\t"
+
+asm(
+   ".align 4096\n\t"
+   "insn_page:\n\t"
+   "ret\n\t"
+   "pushf\n\t"
+   "push 136+save \n\t"
+   "popf \n\t"
+   INSN_XCHG_ALL
+   "test_insn:\n\t"
+   "in  (%dx),%al\n\t"
+   ".skip 31, 0x90\n\t"
+   "test_insn_end:\n\t"
+   INSN_XCHG_ALL
+   "pushf \n\t"
+   "pop 136+save \n\t"
+   "popf \n\t"
+   "ret \n\t"
+   "insn_page_end:\n\t"
+   ".align 4096\n\t"
+);
+
+#define MK_INSN(name, str) \
+asm (  \
+".pushsection .data.insn  \n\t"\
+"insn_" #name ": \n\t" \
+".quad 1001f, 1002f - 1001f \n\t"  \
+".popsection \n\t" \
+".pushsection .text.insn, \"ax\" \n\t" \
+"1001: \n\t"   \
+"insn_code_" #name ": " str " \n\t"\
+"1002: \n\t"   \
+".popsection"  \
+); \
+extern struct insn_desc insn_##name;
+
+static void trap_emulator(uint64_t *mem, void *alt_insn_page,
+   struct insn_desc *alt_insn)
+{
+   ulong *cr3 = (ulong *)read_cr3();
+   void *insn_ram;
+   extern u8 insn_page[], test_insn[];
+
+   insn_ram = vmap(virt_to_phys(insn_page), 4096);
+   memcpy(alt_insn_page, insn_page, 4096);
+   memcpy(alt_insn_page + (test_insn - insn_page),
+   (void *)(alt_insn->ptr), alt_insn->len);
+   save = inregs;
+
+   /* Load the code TLB with insn_page, but point the page tables at
+  alt_insn_page (and keep the data TLB clear, for AMD decode assist).
+  This will make the CPU trap on the insn_page instruction but the
+  hypervisor will see alt_insn_page. */
+   install_page(cr3, virt_to_phys(insn_page), insn_ram);
+   invlpg(insn_ram);
+   /* Load code TLB */
+   asm volatile("call *%0" : : "r"(insn_ram));
+   install_page(cr3, virt_to_phys(alt_insn_page), insn_ram);
+   /* Trap, let hypervisor emulate at alt_insn_page */
+   asm volatile("call *%0": : "r"(insn_ram+1));
+
+   outregs = save;
+}
+
 static void advance_rip_by_3_and_note_exception(struct ex_regs *regs)
 {
 ++exceptions;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: MMU: Inform users of mmio generation wraparound

2013-06-20 Thread Takuya Yoshikawa
On Thu, 20 Jun 2013 15:14:42 +0200
Paolo Bonzini  wrote:

> Il 20/06/2013 14:54, Gleb Natapov ha scritto:
> >> If they see mysterious peformance problems induced by this wraparound, the 
> >> only
> >> way to know the cause later is by this kind of information in the syslog.
> >> So even the first wraparound may better be printed out IMO.
> > Think about starting hundreds VMs on a freshly booted host. You will see
> > hundreds of those pretty quickly.
> 
> With the change I made to Xiao's patch (changing -13 to -150) you won't
> see it immediately after startup, but the first wraparound may still
> come very soon with a loop that reads the ROM.  (The second takes 5
> minutes).
> 
> >> I want to let administrators know the cause if possible, any better way?
> >>
> > Not that I can think of. Paolo what about print_once() and ignore first
> > wraparound?
> 
> printk_ratelimited is enough, even without ignoring the first
> wraparound.  It will handle the case of multiple VMs too.

OK, I'm now trying printk_ratelimited() version.
Will send v2 if it works.

Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] KVM: arm-vgic: Implement save/restore of VGIC state

2013-06-20 Thread Alexander Graf

On 06/11/2013 06:51 AM, Christoffer Dall wrote:

This patch series adds support for user space save/restore of the VGIC
state.  Instead of expanding the ONE_REG interface, which works on
VCPUs, we first introduce support for the new KVM device control API and
the VGIC.  Now, instead of calling KVM_CREATE_IRQCHIP, user space can
call KVM_CREATE_DEVICE and perform operations on the device fd, such as
KVM_SET_DEVICE_ATTR to set a device attribute.

We leverage the KVM_{SET/GET}_DEVICE_ATTR API to export the state of the
VGIC to user space.  Instead of coming up with our own custom format for
exporting the VGIC state, we simply export all the state visible to an
emulated guest, which must contain the full GIC state to provide
save/restore of the GIC state for power management purposes.  This
further provides the benefit of being able to re-use the MMIO emulation
code for the distributor for save/restore.

However, the need to save/restore cpu-specific state demands that user
space can save/restore state accessible through the CPU interface, and
we therefore add an emulation interface for the CPU-specific interface.

This is considered a first attempt, and I am not married to the device
control API.  If there are good technical arguments to take another
approach, I am of course willing to discuss this.  However, my attempts
with the ONE_REG interface did not look very nice.

   [ WARINING: The patch set core functionality is completely untested;
   the basic KVM system has been briefly tested on TC2 and it doesn't
   seem like I've broken existing functionality. ]

I wanted to get this out early to get feedback on the overall API and
idea, and I'm writing some user QEMU for the user space side to test
the new functionality meanwhile.

Patches are against kvm-arm-next and also available here:
git://git.linaro.org/people/cdall/linux-kvm-arm.git vgic-migrate


I don't see major glitches :)

Reviewed-by: Alexander Graf 


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: MMU: Inform users of mmio generation wraparound

2013-06-20 Thread Takuya Yoshikawa
On Thu, 20 Jun 2013 15:54:38 +0300
Gleb Natapov  wrote:

> On Thu, Jun 20, 2013 at 09:28:37PM +0900, Takuya Yoshikawa wrote:
> > On Thu, 20 Jun 2013 14:45:04 +0300
> > Gleb Natapov  wrote:
> > 
> > > On Thu, Jun 20, 2013 at 12:59:54PM +0200, Paolo Bonzini wrote:
> > > > Il 20/06/2013 10:59, Takuya Yoshikawa ha scritto:
> > > > > Without this information, users will just see unexpected performance
> > > > > problems and there is little chance we will get good reports from 
> > > > > them:
> > > > > note that mmio generation is increased even when we just start, or 
> > > > > stop,
> > > > > dirty logging for some memory slot, in which case users should never
> > > > > expect all shadow pages to be zapped.
> > > > > 
> > > > > Signed-off-by: Takuya Yoshikawa 
> > > > > ---
> > > > >  arch/x86/kvm/mmu.c |4 +++-
> > > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > > > index c60c5da..bc8302f 100644
> > > > > --- a/arch/x86/kvm/mmu.c
> > > > > +++ b/arch/x86/kvm/mmu.c
> > > > > @@ -4385,8 +4385,10 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm 
> > > > > *kvm)
> > > > >* The max value is MMIO_MAX_GEN - 1 since it is not called
> > > > >* when mark memslot invalid.
> > > > >*/
> > > > > - if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN 
> > > > > - 1)))
> > > > > + if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN 
> > > > > - 1))) {
> > > > > + printk(KERN_INFO "kvm: zapping shadow pages for mmio 
> > > > > generation wraparound");
> > > > 
> > > > This should at least be rate-limited, because it is guest triggerable.
> > > > 
> > > It will be hard for guest to triggers it 1 << 19 times too fast though.
> > 
> > I think guest-triggerable zap_all itself is a threat for the host, rather
> > than a matter of log flooding, even if it can be preempted.
> > 
> It's not much we can do about it. Slot removal/creation is triggerable
> through HW emulation registers.

OK, I see.

> 
> > > 
> > > > But why isn't the kvm_mmu_invalidate_zap_all_pages tracepoint enough?
> > > > 
> > > This one will trigger during slot deletion/move too.
> > > 
> > > I would put it in to see if it actually triggers in some real world
> > > workloads (skipping the firs wraparound since it is intentional),
> > > we can always drop it if it will turn out to create a lot of noise.
> > >  
> > 
> > This patch is not for developers but for end users: of course they do not
> > use tracers during running their services normally.
> > 
> > If they see mysterious peformance problems induced by this wraparound, the 
> > only
> > way to know the cause later is by this kind of information in the syslog.
> > So even the first wraparound may better be printed out IMO.
> Think about starting hundreds VMs on a freshly booted host. You will see
> hundreds of those pretty quickly.

Yes.

> 
> > 
> > I want to let administrators know the cause if possible, any better way?
> > 
> Not that I can think of. Paolo what about print_once() and ignore first
> wraparound?

Assuming that the first one will be removed someday, it's for debugging anyway,
we can just do print_once() in the future?

That way, admins can check if there is any guest which did some problematic
things.

Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: MMU: Inform users of mmio generation wraparound

2013-06-20 Thread Paolo Bonzini
Il 20/06/2013 14:54, Gleb Natapov ha scritto:
>> If they see mysterious peformance problems induced by this wraparound, the 
>> only
>> way to know the cause later is by this kind of information in the syslog.
>> So even the first wraparound may better be printed out IMO.
> Think about starting hundreds VMs on a freshly booted host. You will see
> hundreds of those pretty quickly.

With the change I made to Xiao's patch (changing -13 to -150) you won't
see it immediately after startup, but the first wraparound may still
come very soon with a loop that reads the ROM.  (The second takes 5
minutes).

>> I want to let administrators know the cause if possible, any better way?
>>
> Not that I can think of. Paolo what about print_once() and ignore first
> wraparound?

printk_ratelimited is enough, even without ignoring the first
wraparound.  It will handle the case of multiple VMs too.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 07/21] kvm: skip system call when msi route is unchanged

2013-06-20 Thread Michael S. Tsirkin
Some guests do a large number of mask/unmask
calls which currently trigger expensive route update
system calls.
Detect that route in unchanged and skip the system call.

Reported-by: "Zhanghaoyu (A)" 
Signed-off-by: Michael S. Tsirkin 
---
 kvm-all.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kvm-all.c b/kvm-all.c
index f119ce1..891722b 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1026,6 +1026,10 @@ static int kvm_update_routing_entry(KVMState *s,
 continue;
 }
 
+if(!memcmp(entry, new_entry, sizeof *entry)) {
+return 0;
+}
+
 *entry = *new_entry;
 
 kvm_irqchip_commit_routes(s);
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 06/21] kvm: zero-initialize KVM_SET_GSI_ROUTING input

2013-06-20 Thread Michael S. Tsirkin
kvm_add_routing_entry makes an attempt to
zero-initialize any new routing entry.
However, it fails to initialize padding
within the u field of the structure
kvm_irq_routing_entry.

Other functions like kvm_irqchip_update_msi_route
also fail to initialize the padding field in
kvm_irq_routing_entry.

While mostly harmless, this would prevent us from
reusing these fields for something useful in
the future.

It's better to just make sure all input is initialized.

Once it is, we can also drop complex field by field assignment and just
do the simple *a = *b to update a route entry.

Signed-off-by: Michael S. Tsirkin 
---
 kvm-all.c | 19 +++
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 405480e..f119ce1 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1006,11 +1006,8 @@ static void kvm_add_routing_entry(KVMState *s,
 }
 n = s->irq_routes->nr++;
 new = &s->irq_routes->entries[n];
-memset(new, 0, sizeof(*new));
-new->gsi = entry->gsi;
-new->type = entry->type;
-new->flags = entry->flags;
-new->u = entry->u;
+
+*new = *entry;
 
 set_gsi(s, entry->gsi);
 
@@ -1029,9 +1026,7 @@ static int kvm_update_routing_entry(KVMState *s,
 continue;
 }
 
-entry->type = new_entry->type;
-entry->flags = new_entry->flags;
-entry->u = new_entry->u;
+*entry = *new_entry;
 
 kvm_irqchip_commit_routes(s);
 
@@ -1043,7 +1038,7 @@ static int kvm_update_routing_entry(KVMState *s,
 
 void kvm_irqchip_add_irq_route(KVMState *s, int irq, int irqchip, int pin)
 {
-struct kvm_irq_routing_entry e;
+struct kvm_irq_routing_entry e = {};
 
 assert(pin < s->gsi_count);
 
@@ -1156,7 +1151,7 @@ int kvm_irqchip_send_msi(KVMState *s, MSIMessage msg)
 return virq;
 }
 
-route = g_malloc(sizeof(KVMMSIRoute));
+route = g_malloc0(sizeof(KVMMSIRoute));
 route->kroute.gsi = virq;
 route->kroute.type = KVM_IRQ_ROUTING_MSI;
 route->kroute.flags = 0;
@@ -1177,7 +1172,7 @@ int kvm_irqchip_send_msi(KVMState *s, MSIMessage msg)
 
 int kvm_irqchip_add_msi_route(KVMState *s, MSIMessage msg)
 {
-struct kvm_irq_routing_entry kroute;
+struct kvm_irq_routing_entry kroute = {};
 int virq;
 
 if (!kvm_gsi_routing_enabled()) {
@@ -1203,7 +1198,7 @@ int kvm_irqchip_add_msi_route(KVMState *s, MSIMessage msg)
 
 int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg)
 {
-struct kvm_irq_routing_entry kroute;
+struct kvm_irq_routing_entry kroute = {};
 
 if (!kvm_irqchip_in_kernel()) {
 return -ENOSYS;
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: MMU: Inform users of mmio generation wraparound

2013-06-20 Thread Gleb Natapov
On Thu, Jun 20, 2013 at 09:28:37PM +0900, Takuya Yoshikawa wrote:
> On Thu, 20 Jun 2013 14:45:04 +0300
> Gleb Natapov  wrote:
> 
> > On Thu, Jun 20, 2013 at 12:59:54PM +0200, Paolo Bonzini wrote:
> > > Il 20/06/2013 10:59, Takuya Yoshikawa ha scritto:
> > > > Without this information, users will just see unexpected performance
> > > > problems and there is little chance we will get good reports from them:
> > > > note that mmio generation is increased even when we just start, or stop,
> > > > dirty logging for some memory slot, in which case users should never
> > > > expect all shadow pages to be zapped.
> > > > 
> > > > Signed-off-by: Takuya Yoshikawa 
> > > > ---
> > > >  arch/x86/kvm/mmu.c |4 +++-
> > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > > index c60c5da..bc8302f 100644
> > > > --- a/arch/x86/kvm/mmu.c
> > > > +++ b/arch/x86/kvm/mmu.c
> > > > @@ -4385,8 +4385,10 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm 
> > > > *kvm)
> > > >  * The max value is MMIO_MAX_GEN - 1 since it is not called
> > > >  * when mark memslot invalid.
> > > >  */
> > > > -   if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN 
> > > > - 1)))
> > > > +   if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN 
> > > > - 1))) {
> > > > +   printk(KERN_INFO "kvm: zapping shadow pages for mmio 
> > > > generation wraparound");
> > > 
> > > This should at least be rate-limited, because it is guest triggerable.
> > > 
> > It will be hard for guest to triggers it 1 << 19 times too fast though.
> 
> I think guest-triggerable zap_all itself is a threat for the host, rather
> than a matter of log flooding, even if it can be preempted.
> 
It's not much we can do about it. Slot removal/creation is triggerable
through HW emulation registers.

> > 
> > > But why isn't the kvm_mmu_invalidate_zap_all_pages tracepoint enough?
> > > 
> > This one will trigger during slot deletion/move too.
> > 
> > I would put it in to see if it actually triggers in some real world
> > workloads (skipping the firs wraparound since it is intentional),
> > we can always drop it if it will turn out to create a lot of noise.
> >  
> 
> This patch is not for developers but for end users: of course they do not
> use tracers during running their services normally.
> 
> If they see mysterious peformance problems induced by this wraparound, the 
> only
> way to know the cause later is by this kind of information in the syslog.
> So even the first wraparound may better be printed out IMO.
Think about starting hundreds VMs on a freshly booted host. You will see
hundreds of those pretty quickly.

> 
> I want to let administrators know the cause if possible, any better way?
> 
Not that I can think of. Paolo what about print_once() and ignore first
wraparound?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] vhost-net: fix use-after-free in vhost_net_flush

2013-06-20 Thread Sergei Shtylyov

Hello.

On 20-06-2013 15:48, Michael S. Tsirkin wrote:


vhost_net_ubuf_put_and_wait has a confusing name:
it will actually also free it's argument.
Thus since commit 1280c27f8e29acf4af2da914e80ec27c3dbd5c01


   Please also specify that commit's summary line in parens.


vhost_net_flush tries to use the argument after passing it
to vhost_net_ubuf_put_and_wait, this results
in use after free.
To fix, don't free the argument in vhost_net_ubuf_put_and_wait,
add an new API for callers that want to free ubufs.



Signed-off-by: Michael S. Tsirkin 


WBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: MMU: Inform users of mmio generation wraparound

2013-06-20 Thread Gleb Natapov
On Thu, Jun 20, 2013 at 02:35:36PM +0200, Paolo Bonzini wrote:
> Il 20/06/2013 13:45, Gleb Natapov ha scritto:
> > On Thu, Jun 20, 2013 at 12:59:54PM +0200, Paolo Bonzini wrote:
> >> Il 20/06/2013 10:59, Takuya Yoshikawa ha scritto:
> >>> Without this information, users will just see unexpected performance
> >>> problems and there is little chance we will get good reports from them:
> >>> note that mmio generation is increased even when we just start, or stop,
> >>> dirty logging for some memory slot, in which case users should never
> >>> expect all shadow pages to be zapped.
> >>>
> >>> Signed-off-by: Takuya Yoshikawa 
> >>> ---
> >>>  arch/x86/kvm/mmu.c |4 +++-
> >>>  1 file changed, 3 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >>> index c60c5da..bc8302f 100644
> >>> --- a/arch/x86/kvm/mmu.c
> >>> +++ b/arch/x86/kvm/mmu.c
> >>> @@ -4385,8 +4385,10 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm)
> >>>* The max value is MMIO_MAX_GEN - 1 since it is not called
> >>>* when mark memslot invalid.
> >>>*/
> >>> - if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN - 1)))
> >>> + if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN - 1))) {
> >>> + printk(KERN_INFO "kvm: zapping shadow pages for mmio generation 
> >>> wraparound");
> >>
> >> This should at least be rate-limited, because it is guest triggerable.
> >>
> > It will be hard for guest to triggers it 1 << 19 times too fast though.
> > 
> >> But why isn't the kvm_mmu_invalidate_zap_all_pages tracepoint enough?
> >
> > This one will trigger during slot deletion/move too.
> > 
> > I would put it in to see if it actually triggers in some real world
> > workloads (skipping the firs wraparound since it is intentional),
> > we can always drop it if it will turn out to create a lot of noise.
> 
> Reading a ROM in a loop can trigger it in less than 5 minutes on my
> machine.  Not a lot of noise, but enough to be annoying.  I think the
> existing tracepoint is enough, or we can add a more specific one here.
> 
5 minutes after first wraparound?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call agenda for 2013-06-25

2013-06-20 Thread Michael S. Tsirkin
Please, send any topic that you are interested in covering.

Thanks, MST

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: MMU: Inform users of mmio generation wraparound

2013-06-20 Thread Paolo Bonzini
Il 20/06/2013 13:45, Gleb Natapov ha scritto:
> On Thu, Jun 20, 2013 at 12:59:54PM +0200, Paolo Bonzini wrote:
>> Il 20/06/2013 10:59, Takuya Yoshikawa ha scritto:
>>> Without this information, users will just see unexpected performance
>>> problems and there is little chance we will get good reports from them:
>>> note that mmio generation is increased even when we just start, or stop,
>>> dirty logging for some memory slot, in which case users should never
>>> expect all shadow pages to be zapped.
>>>
>>> Signed-off-by: Takuya Yoshikawa 
>>> ---
>>>  arch/x86/kvm/mmu.c |4 +++-
>>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>> index c60c5da..bc8302f 100644
>>> --- a/arch/x86/kvm/mmu.c
>>> +++ b/arch/x86/kvm/mmu.c
>>> @@ -4385,8 +4385,10 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm)
>>>  * The max value is MMIO_MAX_GEN - 1 since it is not called
>>>  * when mark memslot invalid.
>>>  */
>>> -   if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN - 1)))
>>> +   if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN - 1))) {
>>> +   printk(KERN_INFO "kvm: zapping shadow pages for mmio generation 
>>> wraparound");
>>
>> This should at least be rate-limited, because it is guest triggerable.
>>
> It will be hard for guest to triggers it 1 << 19 times too fast though.
> 
>> But why isn't the kvm_mmu_invalidate_zap_all_pages tracepoint enough?
>
> This one will trigger during slot deletion/move too.
> 
> I would put it in to see if it actually triggers in some real world
> workloads (skipping the firs wraparound since it is intentional),
> we can always drop it if it will turn out to create a lot of noise.

Reading a ROM in a loop can trigger it in less than 5 minutes on my
machine.  Not a lot of noise, but enough to be annoying.  I think the
existing tracepoint is enough, or we can add a more specific one here.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] kvm-unit-tests: Change two cases to use trap_emulator

2013-06-20 Thread Gleb Natapov
On Thu, Jun 20, 2013 at 06:45:22PM +0800, Arthur Chunqi Li wrote:
> Change two functions (test_mmx_movq_mf and test_movabs) using
> unified trap_emulator.
> 
> Signed-off-by: Arthur Chunqi Li 
> ---
>  x86/emulator.c |   70 
> 
>  1 file changed, 15 insertions(+), 55 deletions(-)
> 
> diff --git a/x86/emulator.c b/x86/emulator.c
> index b3626fa..16d63e0 100644
> --- a/x86/emulator.c
> +++ b/x86/emulator.c
> @@ -772,38 +772,19 @@ static void advance_rip_by_3_and_note_exception(struct 
> ex_regs *regs)
>  static void test_mmx_movq_mf(uint64_t *mem, uint8_t *insn_page,
>uint8_t *alt_insn_page, void *insn_ram)
>  {
> -uint16_t fcw = 0;  // all exceptions unmasked
> -ulong *cr3 = (ulong *)read_cr3();
> -
> -write_cr0(read_cr0() & ~6);  // TS, EM
> -// Place a trapping instruction in the page to trigger a VMEXIT
> -insn_page[0] = 0x89; // mov %eax, (%rax)
> -insn_page[1] = 0x00;
> -insn_page[2] = 0x90; // nop
> -insn_page[3] = 0xc3; // ret
> -// Place the instruction we want the hypervisor to see in the alternate 
> page
> -alt_insn_page[0] = 0x0f; // movq %mm0, (%rax)
> -alt_insn_page[1] = 0x7f;
> -alt_insn_page[2] = 0x00;
> -alt_insn_page[3] = 0xc3; // ret
> +uint16_t fcw = 0;  /* all exceptions unmasked */
> +u8 alt_insn[] = {0x0f, 0x7f, 0x00}; /* movq %mm0, (%rax) */
Lets introduce something akin to MK_INSN in x86/realmode.c.

> +void *stack = alloc_page();
>  
> +write_cr0(read_cr0() & ~6);  /* TS, EM */
>  exceptions = 0;
>  handle_exception(MF_VECTOR, advance_rip_by_3_and_note_exception);
> -
> -// Load the code TLB with insn_page, but point the page tables at
> -// alt_insn_page (and keep the data TLB clear, for AMD decode assist).
> -// This will make the CPU trap on the insn_page instruction but the
> -// hypervisor will see alt_insn_page.
> -install_page(cr3, virt_to_phys(insn_page), insn_ram);
>  asm volatile("fninit; fldcw %0" : : "m"(fcw));
> -asm volatile("fldz; fldz; fdivp"); // generate exception
> -invlpg(insn_ram);
> -// Load code TLB
> -asm volatile("call *%0" : : "r"(insn_ram + 3));
> -install_page(cr3, virt_to_phys(alt_insn_page), insn_ram);
> -// Trap, let hypervisor emulate at alt_insn_page
> -asm volatile("call *%0" : : "r"(insn_ram), "a"(mem));
> -// exit MMX mode
> +asm volatile("fldz; fldz; fdivp"); /* generate exception */
> +
> +inregs = (struct regs){ .rsp=(u64)stack+1024 };
> +trap_emulator(mem, alt_insn, 3);
> +/* exit MMX mode */
>  asm volatile("fnclex; emms");
>  report("movq mmx generates #MF", exceptions == 1);
>  handle_exception(MF_VECTOR, 0);
> @@ -812,33 +793,12 @@ static void test_mmx_movq_mf(uint64_t *mem, uint8_t 
> *insn_page,
>  static void test_movabs(uint64_t *mem, uint8_t *insn_page,
>  uint8_t *alt_insn_page, void *insn_ram)
>  {
> -uint64_t val = 0;
> -ulong *cr3 = (ulong *)read_cr3();
> -
> -// Pad with RET instructions
> -memset(insn_page, 0xc3, 4096);
> -memset(alt_insn_page, 0xc3, 4096);
> -// Place a trapping instruction in the page to trigger a VMEXIT
> -insn_page[0] = 0x89; // mov %eax, (%rax)
> -insn_page[1] = 0x00;
> -// Place the instruction we want the hypervisor to see in the alternate
> -// page. A buggy hypervisor will fetch a 32-bit immediate and return
> -// 0xc3c3c3c3.
> -alt_insn_page[0] = 0x48; // mov $0xc3c3c3c3c3c3c3c3, %rcx
> -alt_insn_page[1] = 0xb9;
> -
> -// Load the code TLB with insn_page, but point the page tables at
> -// alt_insn_page (and keep the data TLB clear, for AMD decode assist).
> -// This will make the CPU trap on the insn_page instruction but the
> -// hypervisor will see alt_insn_page.
> -install_page(cr3, virt_to_phys(insn_page), insn_ram);
> -// Load code TLB
> -invlpg(insn_ram);
> -asm volatile("call *%0" : : "r"(insn_ram + 3));
> -// Trap, let hypervisor emulate at alt_insn_page
> -install_page(cr3, virt_to_phys(alt_insn_page), insn_ram);
> -asm volatile("call *%1" : "=c"(val) : "r"(insn_ram), "a"(mem), "c"(0));
> -report("64-bit mov imm", val == 0xc3c3c3c3c3c3c3c3);
> +/* mov $0x9090909090909090, %rcx */
> +uint8_t alt_insn[] = {0x48, 0xb9, 0x90, 0x90, 0x90,
> + 0x90, 0x90, 0x90, 0x90, 0x90};
> +inregs = (struct regs){ 0 };
> +trap_emulator(mem, alt_insn, 10);
> +report("64-bit mov imm2", outregs.rcx == 0x9090909090909090);
>  }
>  
>  static void test_crosspage_mmio(volatile uint8_t *mem)
> -- 
> 1.7.9.5

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm-unit-tests: Add a func to run instruction in emulator

2013-06-20 Thread Gleb Natapov
On Thu, Jun 20, 2013 at 06:45:21PM +0800, Arthur Chunqi Li wrote:
> Add a function trap_emulator to run an instruction in emulator.
> Set inregs first (%rax is invalid because it is used as return
> address), put instruction codec in alt_insn and call func with
> alt_insn_length. Get results in outregs.
> 
Looks good, some comment bellow.

> Signed-off-by: Arthur Chunqi Li 
> ---
>  lib/libcflat.h |1 +
>  lib/string.c   |   12 +
>  x86/emulator.c |   78 
> 
>  3 files changed, 91 insertions(+)
> 
> diff --git a/lib/libcflat.h b/lib/libcflat.h
> index 0875bd9..fadc33d 100644
> --- a/lib/libcflat.h
> +++ b/lib/libcflat.h
> @@ -50,6 +50,7 @@ extern int vsnprintf(char *buf, int size, const char *fmt, 
> va_list va);
>  extern void puts(const char *s);
>  
>  extern void *memset(void *s, int c, size_t n);
> +extern void *memcpy(void *dest, const void *src, size_t n);
>  
>  extern long atol(const char *ptr);
>  #define ARRAY_SIZE(_a)  (sizeof(_a)/sizeof((_a)[0]))
> diff --git a/lib/string.c b/lib/string.c
> index 9dc94a1..e798f86 100644
> --- a/lib/string.c
> +++ b/lib/string.c
> @@ -42,6 +42,18 @@ void *memset(void *s, int c, size_t n)
>  return s;
>  }
>  
> +void *memcpy(void *dest, const void *src, size_t n)
> +{
> +size_t i;
> +char *a = dest;
> +char *b = src;
> +
> +for (i = 0; i < n; ++i)
> +a[i] = b[i];
> +
> +return dest;
> +}
> +
memcpy addition should be in separate patch usually, but for unit test
it is no a big deal.

>  long atol(const char *ptr)
>  {
>  long acc = 0;
> diff --git a/x86/emulator.c b/x86/emulator.c
> index 96576e5..b3626fa 100644
> --- a/x86/emulator.c
> +++ b/x86/emulator.c
> @@ -11,6 +11,15 @@ int fails, tests;
>  
>  static int exceptions;
>  
> +struct regs {
> + u64 rax, rbx, rcx, rdx;
> + u64 rsi, rdi, rsp, rbp;
> + u64 r8, r9, r10, r11;
> + u64 r12, r13, r14, r15;
> + u64 rip, rflags;
> +};
> +struct regs inregs, outregs, save;
> +
>  void report(const char *name, int result)
>  {
>   ++tests;
> @@ -685,6 +694,75 @@ static void test_shld_shrd(u32 *mem)
>  report("shrd (cl)", *mem == ((0x12345678 >> 3) | (5u << 29)));
>  }
>  
> +#define INSN_XCHG_ALL\
> + "xchg %rax, 0+save \n\t"\
> + "xchg %rbx, 8+save \n\t"\
> + "xchg %rcx, 16+save \n\t"   \
> + "xchg %rdx, 24+save \n\t"   \
> + "xchg %rsi, 32+save \n\t"   \
> + "xchg %rdi, 40+save \n\t"   \
> + "xchg %rsp, 48+save \n\t"   \
> + "xchg %rbp, 56+save \n\t"   \
> + "xchg %r8, 64+save \n\t"\
> + "xchg %r9, 72+save \n\t"\
> + "xchg %r10, 80+save \n\t"   \
> + "xchg %r11, 88+save \n\t"   \
> + "xchg %r12, 96+save \n\t"   \
> + "xchg %r13, 104+save \n\t"  \
> + "xchg %r14, 112+save \n\t"  \
> + "xchg %r15, 120+save \n\t"  \
> +
> +asm(
> + ".align 4096\n\t"
> + "insn_page:\n\t"
> + "ret\n\t"
> + "pushf\n\t"
> + "push 136+save \n\t"
> + "popf \n\t"
> + INSN_XCHG_ALL
> + "test_insn:\n\t"
> + "in  (%dx),%al\n\t"
> + ".skip 31, 0x90\n\t"
> + "test_insn_end:\n\t"
> + INSN_XCHG_ALL
> + "pushf \n\t"
> + "pop 136+save \n\t"
> + "popf \n\t"
> + "ret \n\t"
> + "insn_page_end:\n\t"
> + ".align 4096\n\t"
> +
> + "alt_insn_page:\n\t"
> + ". = . + 4096\n\t"
> + ".align 4096\n\t"
alt_insn_page can be allocated by alloc_page().

> +);
> +
> +static void trap_emulator(uint64_t *mem, uint8_t* alt_insn, int 
> alt_insn_length)
> +{
> + ulong *cr3 = (ulong *)read_cr3();
> + void *insn_ram;
> + extern u8 insn_page[], test_insn[], alt_insn_page[];
> +
> + insn_ram = vmap(virt_to_phys(insn_page), 4096);
> + memcpy(alt_insn_page, test_insn, 4096);
> + memcpy(alt_insn_page + (test_insn - insn_page), alt_insn, 
> alt_insn_length);
> + save = inregs;
> +
> + /* Load the code TLB with insn_page, but point the page tables at
> +alt_insn_page (and keep the data TLB clear, for AMD decode assist).
> +This will make the CPU trap on the insn_page instruction but the
> +hypervisor will see alt_insn_page. */
> + install_page(cr3, virt_to_phys(insn_page), insn_ram);
> + invlpg(insn_ram);
> + /* Load code TLB */
> + asm volatile("call *%0" : : "r"(insn_ram));
> + install_page(cr3, virt_to_phys(alt_insn_page), insn_ram);
> + /* Trap, let hypervisor emulate at alt_insn_page */
> + asm volatile("call *%0": : "r"(insn_ram+1));
> +
> + outregs = save;
> +}
> +
>  static void advance_rip_by_3_and_note_exception(struct ex_regs *regs)
>  {
>  ++exceptions;
> -- 
> 1.7.9.5

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body

Re: [PATCH] KVM: MMU: Inform users of mmio generation wraparound

2013-06-20 Thread Takuya Yoshikawa
On Thu, 20 Jun 2013 14:45:04 +0300
Gleb Natapov  wrote:

> On Thu, Jun 20, 2013 at 12:59:54PM +0200, Paolo Bonzini wrote:
> > Il 20/06/2013 10:59, Takuya Yoshikawa ha scritto:
> > > Without this information, users will just see unexpected performance
> > > problems and there is little chance we will get good reports from them:
> > > note that mmio generation is increased even when we just start, or stop,
> > > dirty logging for some memory slot, in which case users should never
> > > expect all shadow pages to be zapped.
> > > 
> > > Signed-off-by: Takuya Yoshikawa 
> > > ---
> > >  arch/x86/kvm/mmu.c |4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > index c60c5da..bc8302f 100644
> > > --- a/arch/x86/kvm/mmu.c
> > > +++ b/arch/x86/kvm/mmu.c
> > > @@ -4385,8 +4385,10 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm)
> > >* The max value is MMIO_MAX_GEN - 1 since it is not called
> > >* when mark memslot invalid.
> > >*/
> > > - if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN - 1)))
> > > + if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN - 1))) {
> > > + printk(KERN_INFO "kvm: zapping shadow pages for mmio generation 
> > > wraparound");
> > 
> > This should at least be rate-limited, because it is guest triggerable.
> > 
> It will be hard for guest to triggers it 1 << 19 times too fast though.

I think guest-triggerable zap_all itself is a threat for the host, rather
than a matter of log flooding, even if it can be preempted.

> 
> > But why isn't the kvm_mmu_invalidate_zap_all_pages tracepoint enough?
> > 
> This one will trigger during slot deletion/move too.
> 
> I would put it in to see if it actually triggers in some real world
> workloads (skipping the firs wraparound since it is intentional),
> we can always drop it if it will turn out to create a lot of noise.
>  

This patch is not for developers but for end users: of course they do not
use tracers during running their services normally.

If they see mysterious peformance problems induced by this wraparound, the only
way to know the cause later is by this kind of information in the syslog.
So even the first wraparound may better be printed out IMO.

I want to let administrators know the cause if possible, any better way?

Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC] vhost-net: make more functions static

2013-06-20 Thread Michael S. Tsirkin
Make two more functions static - they only have local callers.

Signed-off-by: Michael S. Tsirkin 
---

Will be merged through the vhost tree - no need to apply
directly.

 drivers/vhost/net.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 534adb0..4c5de87 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -167,7 +167,7 @@ static void vhost_net_clear_ubuf_info(struct vhost_net *n)
}
 }
 
-int vhost_net_set_ubuf_info(struct vhost_net *n)
+static int vhost_net_set_ubuf_info(struct vhost_net *n)
 {
bool zcopy;
int i;
@@ -188,7 +188,7 @@ err:
return -ENOMEM;
 }
 
-void vhost_net_vq_reset(struct vhost_net *n)
+static void vhost_net_vq_reset(struct vhost_net *n)
 {
int i;
 
-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net for-stable] vhost-net: fix use-after-free in vhost_net_flush

2013-06-20 Thread Michael S. Tsirkin
vhost_net_ubuf_put_and_wait has a confusing name:
it will actually also free it's argument.
Thus since commit 1280c27f8e29acf4af2da914e80ec27c3dbd5c01
vhost_net_flush tries to use the argument after passing it
to vhost_net_ubuf_put_and_wait, this results
in use after free.
To fix, don't free the argument in vhost_net_ubuf_put_and_wait,
add an new API for callers that want to free ubufs.

Signed-off-by: Michael S. Tsirkin 
---

I sent the same patch against 3.10 separately.
This one is for stable, against 3.9/3.8 - the code moved around
since then.
Not sending to stable directly since Dave Miller handles
it for net drivers normally.

 drivers/vhost/net.c   | 4 ++--
 drivers/vhost/vhost.c | 5 +
 drivers/vhost/vhost.h | 1 +
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index ec6fb3f..e5ff7a5 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -856,7 +856,7 @@ static long vhost_net_set_backend(struct vhost_net *n, 
unsigned index, int fd)
mutex_unlock(&vq->mutex);
 
if (oldubufs) {
-   vhost_ubuf_put_and_wait(oldubufs);
+   vhost_ubuf_put_wait_and_free(oldubufs);
mutex_lock(&vq->mutex);
vhost_zerocopy_signal_used(n, vq);
mutex_unlock(&vq->mutex);
@@ -874,7 +874,7 @@ err_used:
rcu_assign_pointer(vq->private_data, oldsock);
vhost_net_enable_vq(n, vq);
if (ubufs)
-   vhost_ubuf_put_and_wait(ubufs);
+   vhost_ubuf_put_wait_and_free(ubufs);
 err_ubufs:
fput(sock->file);
 err_vq:
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 9759249..c01d22f 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1581,5 +1581,10 @@ void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref 
*ubufs)
 {
kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
wait_event(ubufs->wait, !atomic_read(&ubufs->kref.refcount));
+}
+
+void vhost_ubuf_put_wait_and_free(struct vhost_ubuf_ref *ubufs)
+{
+   vhost_ubuf_put_and_wait(ubufs);
kfree(ubufs);
 }
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 17261e2..b53cb28 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -63,6 +63,7 @@ struct vhost_ubuf_ref {
 struct vhost_ubuf_ref *vhost_ubuf_alloc(struct vhost_virtqueue *, bool zcopy);
 void vhost_ubuf_put(struct vhost_ubuf_ref *);
 void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref *);
+void vhost_ubuf_put_wait_and_free(struct vhost_ubuf_ref *);
 
 struct ubuf_info;
 
-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net] vhost-net: fix use-after-free in vhost_net_flush

2013-06-20 Thread Michael S. Tsirkin
vhost_net_ubuf_put_and_wait has a confusing name:
it will actually also free it's argument.
Thus since commit 1280c27f8e29acf4af2da914e80ec27c3dbd5c01
vhost_net_flush tries to use the argument after passing it
to vhost_net_ubuf_put_and_wait, this results
in use after free.
To fix, don't free the argument in vhost_net_ubuf_put_and_wait,
add an new API for callers that want to free ubufs.

Signed-off-by: Michael S. Tsirkin 
---

Dave, this is needed for stable as well, but
the code has moved a bit since then.
I'll post a patch tweaked to apply against 3.9 and 3.8 separately.

 drivers/vhost/net.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 5c77d6a..534adb0 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -149,6 +149,11 @@ static void vhost_net_ubuf_put_and_wait(struct 
vhost_net_ubuf_ref *ubufs)
 {
kref_put(&ubufs->kref, vhost_net_zerocopy_done_signal);
wait_event(ubufs->wait, !atomic_read(&ubufs->kref.refcount));
+}
+
+static void vhost_net_ubuf_put_wait_and_free(struct vhost_net_ubuf_ref *ubufs)
+{
+   vhost_net_ubuf_put_and_wait(ubufs);
kfree(ubufs);
 }
 
@@ -1073,7 +1078,7 @@ static long vhost_net_set_backend(struct vhost_net *n, 
unsigned index, int fd)
mutex_unlock(&vq->mutex);
 
if (oldubufs) {
-   vhost_net_ubuf_put_and_wait(oldubufs);
+   vhost_net_ubuf_put_wait_and_free(oldubufs);
mutex_lock(&vq->mutex);
vhost_zerocopy_signal_used(n, vq);
mutex_unlock(&vq->mutex);
@@ -1091,7 +1096,7 @@ err_used:
vq->private_data = oldsock;
vhost_net_enable_vq(n, vq);
if (ubufs)
-   vhost_net_ubuf_put_and_wait(ubufs);
+   vhost_net_ubuf_put_wait_and_free(ubufs);
 err_ubufs:
fput(sock->file);
 err_vq:
-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regression after "Remove support for reporting coalesced APIC IRQs"

2013-06-20 Thread Gleb Natapov
Jan ping, are you OK with what I proposed below?

On Thu, Jun 06, 2013 at 11:53:52AM +0300, Gleb Natapov wrote:
> Hi Jan,
> 
> I bisected [1] to f1ed0450a5fac7067590317cbf027f566b6ccbca. Fortunately
> further investigation showed that it is not really related to removing
> APIC timer interrupt reinjection and the real problem is that we cannot
> assume that __apic_accept_irq() always injects interrupts like the patch
> does because the function skips interrupt injection if APIC is disabled.
> This misreporting screws RTC interrupt tracking, so further RTC interrupt
> are stopped to be injected. The simplest solution that I see is to revert
> most of the commit and only leave APIC timer interrupt reinjection.
> 
> If you have more elegant solution let me know.
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=58931
> --
>   Gleb.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: MMU: Inform users of mmio generation wraparound

2013-06-20 Thread Gleb Natapov
On Thu, Jun 20, 2013 at 12:59:54PM +0200, Paolo Bonzini wrote:
> Il 20/06/2013 10:59, Takuya Yoshikawa ha scritto:
> > Without this information, users will just see unexpected performance
> > problems and there is little chance we will get good reports from them:
> > note that mmio generation is increased even when we just start, or stop,
> > dirty logging for some memory slot, in which case users should never
> > expect all shadow pages to be zapped.
> > 
> > Signed-off-by: Takuya Yoshikawa 
> > ---
> >  arch/x86/kvm/mmu.c |4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index c60c5da..bc8302f 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -4385,8 +4385,10 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm)
> >  * The max value is MMIO_MAX_GEN - 1 since it is not called
> >  * when mark memslot invalid.
> >  */
> > -   if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN - 1)))
> > +   if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN - 1))) {
> > +   printk(KERN_INFO "kvm: zapping shadow pages for mmio generation 
> > wraparound");
> 
> This should at least be rate-limited, because it is guest triggerable.
> 
It will be hard for guest to triggers it 1 << 19 times too fast though.

> But why isn't the kvm_mmu_invalidate_zap_all_pages tracepoint enough?
> 
This one will trigger during slot deletion/move too.

I would put it in to see if it actually triggers in some real world
workloads (skipping the firs wraparound since it is intentional),
we can always drop it if it will turn out to create a lot of noise.
 
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: MMU: Inform users of mmio generation wraparound

2013-06-20 Thread Paolo Bonzini
Il 20/06/2013 10:59, Takuya Yoshikawa ha scritto:
> Without this information, users will just see unexpected performance
> problems and there is little chance we will get good reports from them:
> note that mmio generation is increased even when we just start, or stop,
> dirty logging for some memory slot, in which case users should never
> expect all shadow pages to be zapped.
> 
> Signed-off-by: Takuya Yoshikawa 
> ---
>  arch/x86/kvm/mmu.c |4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index c60c5da..bc8302f 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4385,8 +4385,10 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm)
>* The max value is MMIO_MAX_GEN - 1 since it is not called
>* when mark memslot invalid.
>*/
> - if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN - 1)))
> + if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN - 1))) {
> + printk(KERN_INFO "kvm: zapping shadow pages for mmio generation 
> wraparound");

This should at least be rate-limited, because it is guest triggerable.

But why isn't the kvm_mmu_invalidate_zap_all_pages tracepoint enough?

Paolo

>   kvm_mmu_invalidate_zap_all_pages(kvm);
> + }
>  }
>  
>  static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc)
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm-unit-tests: Add a func to run instruction in emulator

2013-06-20 Thread Jan Kiszka
On 2013-06-20 12:45, Arthur Chunqi Li wrote:
> Add a function trap_emulator to run an instruction in emulator.
> Set inregs first (%rax is invalid because it is used as return
> address), put instruction codec in alt_insn and call func with
> alt_insn_length. Get results in outregs.

Small hint: You should version your patches to help differentiating the
postings (e.g. [PATCH v3 ...])

Jan




signature.asc
Description: OpenPGP digital signature


[PATCH 1/2] kvm-unit-tests: Add a func to run instruction in emulator

2013-06-20 Thread Arthur Chunqi Li
Add a function trap_emulator to run an instruction in emulator.
Set inregs first (%rax is invalid because it is used as return
address), put instruction codec in alt_insn and call func with
alt_insn_length. Get results in outregs.

Signed-off-by: Arthur Chunqi Li 
---
 lib/libcflat.h |1 +
 lib/string.c   |   12 +
 x86/emulator.c |   78 
 3 files changed, 91 insertions(+)

diff --git a/lib/libcflat.h b/lib/libcflat.h
index 0875bd9..fadc33d 100644
--- a/lib/libcflat.h
+++ b/lib/libcflat.h
@@ -50,6 +50,7 @@ extern int vsnprintf(char *buf, int size, const char *fmt, 
va_list va);
 extern void puts(const char *s);
 
 extern void *memset(void *s, int c, size_t n);
+extern void *memcpy(void *dest, const void *src, size_t n);
 
 extern long atol(const char *ptr);
 #define ARRAY_SIZE(_a)  (sizeof(_a)/sizeof((_a)[0]))
diff --git a/lib/string.c b/lib/string.c
index 9dc94a1..e798f86 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -42,6 +42,18 @@ void *memset(void *s, int c, size_t n)
 return s;
 }
 
+void *memcpy(void *dest, const void *src, size_t n)
+{
+size_t i;
+char *a = dest;
+char *b = src;
+
+for (i = 0; i < n; ++i)
+a[i] = b[i];
+
+return dest;
+}
+
 long atol(const char *ptr)
 {
 long acc = 0;
diff --git a/x86/emulator.c b/x86/emulator.c
index 96576e5..b3626fa 100644
--- a/x86/emulator.c
+++ b/x86/emulator.c
@@ -11,6 +11,15 @@ int fails, tests;
 
 static int exceptions;
 
+struct regs {
+   u64 rax, rbx, rcx, rdx;
+   u64 rsi, rdi, rsp, rbp;
+   u64 r8, r9, r10, r11;
+   u64 r12, r13, r14, r15;
+   u64 rip, rflags;
+};
+struct regs inregs, outregs, save;
+
 void report(const char *name, int result)
 {
++tests;
@@ -685,6 +694,75 @@ static void test_shld_shrd(u32 *mem)
 report("shrd (cl)", *mem == ((0x12345678 >> 3) | (5u << 29)));
 }
 
+#define INSN_XCHG_ALL  \
+   "xchg %rax, 0+save \n\t"\
+   "xchg %rbx, 8+save \n\t"\
+   "xchg %rcx, 16+save \n\t"   \
+   "xchg %rdx, 24+save \n\t"   \
+   "xchg %rsi, 32+save \n\t"   \
+   "xchg %rdi, 40+save \n\t"   \
+   "xchg %rsp, 48+save \n\t"   \
+   "xchg %rbp, 56+save \n\t"   \
+   "xchg %r8, 64+save \n\t"\
+   "xchg %r9, 72+save \n\t"\
+   "xchg %r10, 80+save \n\t"   \
+   "xchg %r11, 88+save \n\t"   \
+   "xchg %r12, 96+save \n\t"   \
+   "xchg %r13, 104+save \n\t"  \
+   "xchg %r14, 112+save \n\t"  \
+   "xchg %r15, 120+save \n\t"  \
+
+asm(
+   ".align 4096\n\t"
+   "insn_page:\n\t"
+   "ret\n\t"
+   "pushf\n\t"
+   "push 136+save \n\t"
+   "popf \n\t"
+   INSN_XCHG_ALL
+   "test_insn:\n\t"
+   "in  (%dx),%al\n\t"
+   ".skip 31, 0x90\n\t"
+   "test_insn_end:\n\t"
+   INSN_XCHG_ALL
+   "pushf \n\t"
+   "pop 136+save \n\t"
+   "popf \n\t"
+   "ret \n\t"
+   "insn_page_end:\n\t"
+   ".align 4096\n\t"
+
+   "alt_insn_page:\n\t"
+   ". = . + 4096\n\t"
+   ".align 4096\n\t"
+);
+
+static void trap_emulator(uint64_t *mem, uint8_t* alt_insn, int 
alt_insn_length)
+{
+   ulong *cr3 = (ulong *)read_cr3();
+   void *insn_ram;
+   extern u8 insn_page[], test_insn[], alt_insn_page[];
+
+   insn_ram = vmap(virt_to_phys(insn_page), 4096);
+   memcpy(alt_insn_page, test_insn, 4096);
+   memcpy(alt_insn_page + (test_insn - insn_page), alt_insn, 
alt_insn_length);
+   save = inregs;
+
+   /* Load the code TLB with insn_page, but point the page tables at
+  alt_insn_page (and keep the data TLB clear, for AMD decode assist).
+  This will make the CPU trap on the insn_page instruction but the
+  hypervisor will see alt_insn_page. */
+   install_page(cr3, virt_to_phys(insn_page), insn_ram);
+   invlpg(insn_ram);
+   /* Load code TLB */
+   asm volatile("call *%0" : : "r"(insn_ram));
+   install_page(cr3, virt_to_phys(alt_insn_page), insn_ram);
+   /* Trap, let hypervisor emulate at alt_insn_page */
+   asm volatile("call *%0": : "r"(insn_ram+1));
+
+   outregs = save;
+}
+
 static void advance_rip_by_3_and_note_exception(struct ex_regs *regs)
 {
 ++exceptions;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] kvm-unit-tests: Change two cases to use trap_emulator

2013-06-20 Thread Arthur Chunqi Li
Change two functions (test_mmx_movq_mf and test_movabs) using
unified trap_emulator.

Signed-off-by: Arthur Chunqi Li 
---
 x86/emulator.c |   70 
 1 file changed, 15 insertions(+), 55 deletions(-)

diff --git a/x86/emulator.c b/x86/emulator.c
index b3626fa..16d63e0 100644
--- a/x86/emulator.c
+++ b/x86/emulator.c
@@ -772,38 +772,19 @@ static void advance_rip_by_3_and_note_exception(struct 
ex_regs *regs)
 static void test_mmx_movq_mf(uint64_t *mem, uint8_t *insn_page,
 uint8_t *alt_insn_page, void *insn_ram)
 {
-uint16_t fcw = 0;  // all exceptions unmasked
-ulong *cr3 = (ulong *)read_cr3();
-
-write_cr0(read_cr0() & ~6);  // TS, EM
-// Place a trapping instruction in the page to trigger a VMEXIT
-insn_page[0] = 0x89; // mov %eax, (%rax)
-insn_page[1] = 0x00;
-insn_page[2] = 0x90; // nop
-insn_page[3] = 0xc3; // ret
-// Place the instruction we want the hypervisor to see in the alternate 
page
-alt_insn_page[0] = 0x0f; // movq %mm0, (%rax)
-alt_insn_page[1] = 0x7f;
-alt_insn_page[2] = 0x00;
-alt_insn_page[3] = 0xc3; // ret
+uint16_t fcw = 0;  /* all exceptions unmasked */
+u8 alt_insn[] = {0x0f, 0x7f, 0x00}; /* movq %mm0, (%rax) */
+void *stack = alloc_page();
 
+write_cr0(read_cr0() & ~6);  /* TS, EM */
 exceptions = 0;
 handle_exception(MF_VECTOR, advance_rip_by_3_and_note_exception);
-
-// Load the code TLB with insn_page, but point the page tables at
-// alt_insn_page (and keep the data TLB clear, for AMD decode assist).
-// This will make the CPU trap on the insn_page instruction but the
-// hypervisor will see alt_insn_page.
-install_page(cr3, virt_to_phys(insn_page), insn_ram);
 asm volatile("fninit; fldcw %0" : : "m"(fcw));
-asm volatile("fldz; fldz; fdivp"); // generate exception
-invlpg(insn_ram);
-// Load code TLB
-asm volatile("call *%0" : : "r"(insn_ram + 3));
-install_page(cr3, virt_to_phys(alt_insn_page), insn_ram);
-// Trap, let hypervisor emulate at alt_insn_page
-asm volatile("call *%0" : : "r"(insn_ram), "a"(mem));
-// exit MMX mode
+asm volatile("fldz; fldz; fdivp"); /* generate exception */
+
+inregs = (struct regs){ .rsp=(u64)stack+1024 };
+trap_emulator(mem, alt_insn, 3);
+/* exit MMX mode */
 asm volatile("fnclex; emms");
 report("movq mmx generates #MF", exceptions == 1);
 handle_exception(MF_VECTOR, 0);
@@ -812,33 +793,12 @@ static void test_mmx_movq_mf(uint64_t *mem, uint8_t 
*insn_page,
 static void test_movabs(uint64_t *mem, uint8_t *insn_page,
   uint8_t *alt_insn_page, void *insn_ram)
 {
-uint64_t val = 0;
-ulong *cr3 = (ulong *)read_cr3();
-
-// Pad with RET instructions
-memset(insn_page, 0xc3, 4096);
-memset(alt_insn_page, 0xc3, 4096);
-// Place a trapping instruction in the page to trigger a VMEXIT
-insn_page[0] = 0x89; // mov %eax, (%rax)
-insn_page[1] = 0x00;
-// Place the instruction we want the hypervisor to see in the alternate
-// page. A buggy hypervisor will fetch a 32-bit immediate and return
-// 0xc3c3c3c3.
-alt_insn_page[0] = 0x48; // mov $0xc3c3c3c3c3c3c3c3, %rcx
-alt_insn_page[1] = 0xb9;
-
-// Load the code TLB with insn_page, but point the page tables at
-// alt_insn_page (and keep the data TLB clear, for AMD decode assist).
-// This will make the CPU trap on the insn_page instruction but the
-// hypervisor will see alt_insn_page.
-install_page(cr3, virt_to_phys(insn_page), insn_ram);
-// Load code TLB
-invlpg(insn_ram);
-asm volatile("call *%0" : : "r"(insn_ram + 3));
-// Trap, let hypervisor emulate at alt_insn_page
-install_page(cr3, virt_to_phys(alt_insn_page), insn_ram);
-asm volatile("call *%1" : "=c"(val) : "r"(insn_ram), "a"(mem), "c"(0));
-report("64-bit mov imm", val == 0xc3c3c3c3c3c3c3c3);
+/* mov $0x9090909090909090, %rcx */
+uint8_t alt_insn[] = {0x48, 0xb9, 0x90, 0x90, 0x90,
+   0x90, 0x90, 0x90, 0x90, 0x90};
+inregs = (struct regs){ 0 };
+trap_emulator(mem, alt_insn, 10);
+report("64-bit mov imm2", outregs.rcx == 0x9090909090909090);
 }
 
 static void test_crosspage_mmio(volatile uint8_t *mem)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: MMU: Inform users of mmio generation wraparound

2013-06-20 Thread Takuya Yoshikawa
Without this information, users will just see unexpected performance
problems and there is little chance we will get good reports from them:
note that mmio generation is increased even when we just start, or stop,
dirty logging for some memory slot, in which case users should never
expect all shadow pages to be zapped.

Signed-off-by: Takuya Yoshikawa 
---
 arch/x86/kvm/mmu.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c60c5da..bc8302f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4385,8 +4385,10 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm)
 * The max value is MMIO_MAX_GEN - 1 since it is not called
 * when mark memslot invalid.
 */
-   if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN - 1)))
+   if (unlikely(kvm_current_mmio_generation(kvm) >= (MMIO_MAX_GEN - 1))) {
+   printk(KERN_INFO "kvm: zapping shadow pages for mmio generation 
wraparound");
kvm_mmu_invalidate_zap_all_pages(kvm);
+   }
 }
 
 static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm-unit-tests: Add a func to run instruction in emulator

2013-06-20 Thread Gmail
ok, I will handle all above in the following commit.

Arthur Chunqi Li
Department of Computer Science
School of EECS
Peking University
Beijing, China

>From my iPhone

在 2013-6-20,16:48,Gleb Natapov  写道:

> On Wed, Jun 19, 2013 at 11:00:56PM +0800, Arthur Chunqi Li wrote:
>> Add a function trap_emulator to run an instruction in emulator.
>> Set inregs first (%rax is invalid because it is used as return
>> address), put instruction codec in alt_insn and call func with
>> alt_insn_length. Get results in outregs.
>> 
>> Signed-off-by: Arthur Chunqi Li 
>> ---
>> x86/emulator.c |  110 
>> 
>> 1 file changed, 110 insertions(+)
>> mode change 100644 => 100755 x86/emulator.c
>> 
>> diff --git a/x86/emulator.c b/x86/emulator.c
>> old mode 100644
>> new mode 100755
>> index 96576e5..48d45c8
>> --- a/x86/emulator.c
>> +++ b/x86/emulator.c
>> @@ -11,6 +11,15 @@ int fails, tests;
>> 
>> static int exceptions;
>> 
>> +struct regs {
>> +u64 rax, rbx, rcx, rdx;
>> +u64 rsi, rdi, rsp, rbp;
>> +u64 r8, r9, r10, r11;
>> +u64 r12, r13, r14, r15;
>> +u64 rip, rflags;
>> +};
>> +struct regs inregs, outregs, save;
>> +
>> void report(const char *name, int result)
>> {
>>++tests;
>> @@ -685,6 +694,107 @@ static void test_shld_shrd(u32 *mem)
>> report("shrd (cl)", *mem == ((0x12345678 >> 3) | (5u << 29)));
>> }
>> 
>> +#define INSN_SAVE\
> No need for all the defines. Put all the code into insn_page, allocate
> alt_insn_page dynamically and copy the code there by memcpy.
> 
>> +"ret\n\t"\
>> +"pushf\n\t"\
>> +"push 136+save \n\t"\
>> +"popf \n\t"\
>> +"xchg %rax, 0+save \n\t"\
>> +"xchg %rbx, 8+save \n\t"\
>> +"xchg %rcx, 16+save \n\t"\
>> +"xchg %rdx, 24+save \n\t"\
>> +"xchg %rsi, 32+save \n\t"\
>> +"xchg %rdi, 40+save \n\t"\
>> +"xchg %rsp, 48+save \n\t"\
>> +"xchg %rbp, 56+save \n\t"\
>> +"xchg %r8, 64+save \n\t"\
>> +"xchg %r9, 72+save \n\t"\
>> +"xchg %r10, 80+save \n\t"\
>> +"xchg %r11, 88+save \n\t"\
>> +"xchg %r12, 96+save \n\t"\
>> +"xchg %r13, 104+save \n\t"\
>> +"xchg %r14, 112+save \n\t"\
>> +"xchg %r15, 120+save \n\t"\
>> +
>> +#define INSN_RESTORE\
>> +"xchg %rax, 0+save \n\t"\
>> +"xchg %rbx, 8+save \n\t"\
>> +"xchg %rcx, 16+save \n\t"\
>> +"xchg %rdx, 24+save \n\t"\
>> +"xchg %rsi, 32+save \n\t"\
>> +"xchg %rdi, 40+save \n\t"\
>> +"xchg %rsp, 48+save \n\t"\
>> +"xchg %rbp, 56+save \n\t"\
>> +"xchg %r8, 64+save \n\t"\
>> +"xchg %r9, 72+save \n\t"\
>> +"xchg %r10, 80+save \n\t"\
>> +"xchg %r11, 88+save \n\t"\
>> +"xchg %r12, 96+save \n\t"\
>> +"xchg %r13, 104+save \n\t"\
>> +"xchg %r14, 112+save \n\t"\
>> +"xchg %r15, 120+save \n\t"\
>> +"pushf \n\t"\
>> +"pop 136+save \n\t"\
>> +"popf \n\t"\
>> +"ret \n\t"\
>> +
>> +#define INSN_TRAP\
>> +"in  (%dx),%al\n\t"\
>> +". = . + 31\n\t"\
> If you will do ".skip 31, 0x90\n\t" instead you can drop loop
> that inserts nops bellow.
> 
>> +
>> +asm(
>> +".align 4096\n\t"
>> +"insn_page:\n\t"
>> +INSN_SAVE
>> +"test_insn:\n\t"
>> +INSN_TRAP
>> +"test_insn_end:\n\t"
>> +INSN_RESTORE
>> +"insn_page_end:\n\t"
>> +".align 4096\n\t"
>> +
>> +"alt_insn_page:\n\t"
>> +INSN_SAVE
>> +"alt_test_insn:\n\t"
>> +INSN_TRAP
>> +"alt_test_insn_end:\n\t"
>> +INSN_RESTORE
>> +"alt_insn_page_end:\n\t"
>> +".align 4096\n\t"
>> +);
>> +
>> +static void trap_emulator(uint64_t *mem, uint8_t* alt_insn, int 
>> alt_insn_length)
>> +{
>> +ulong *cr3 = (ulong *)read_cr3();
>> +void *insn_ram;
>> +int i;
>> +extern u8 insn_page[], test_insn[], test_insn_end[];
>> +extern u8 alt_insn_page[], alt_test_insn[];
>> +
>> +insn_ram = vmap(virt_to_phys(insn_page), 4096);
>> +for (i=1; i> +alt_test_insn[i] = test_insn[i] = 0x90; // nop
>> +for (i=0; i> +alt_test_insn[i] = alt_insn[i];
>> +for(;i> +alt_test_insn[i] = 0x90; // nop
>> +save = inregs;
>> +
>> +// Load the code TLB with insn_page, but point the page tables at
>> +// alt_insn_page (and keep the data TLB clear, for AMD decode assist).
>> +// This will make the CPU trap on the insn_page instruction but the
>> +// hypervisor will see alt_insn_page.
> I prefer all the comments to be changed to /**/ style while we are at it.
> 
>> +install_page(cr3, virt_to_phys(insn_page), insn_ram);
>> +invlpg(insn_ram);
>> +// Load code TLB
>> +asm volatile("

Re: [PATCH v2 0/2] kvm: x86: Emulate MSR_PLATFORM_INFO

2013-06-20 Thread Gleb Natapov
On Thu, Jun 20, 2013 at 10:34:39AM +0200, Paolo Bonzini wrote:
> Il 20/06/2013 09:31, Gleb Natapov ha scritto:
> >> I agree with you on the potential problems but I think we are completely
> >> ignoring the "non-migration" use case. These users will probably benefit 
> >> from a correct value of (virtual) msr_platform_info. And it appears, the 
> >> easiest way to give both options to the user is using a new module_param, 
> >> say ignore_platform_info.
> >>
> > Migration is important part of virtualization. PMU emulation currently
> > is in demo status because migration is not implemented yet, but at least
> > it is possible to implement it.
> 
> But there's a lot more to do for migration than just moving the MSR
> contents from one machine to another.  We need to support customizing
> the values of CPUID leaf 0ah, so that PMU migration can work if you
> specify "lowest common denominator" CPU models.
> 
> I haven't looked at it yet because QEMU support for this is not in sight.
> 
Eduardo working on it.

> >> Scenarios :
> >> 1. ignore_platform_info = 0 (Default). Inject #GP if application tries to
> >> read this MSR.
> >> 2. ignore_platform_info = 1. User wants to read the calculated value, her
> >> environment doesn't require migration.
> >> 3. ignore_msrs = 1. If this is set, we always return 0 and application will
> >> hopefully resort to a workaround.
> >>
> > Module flag is global for all VMs on a host. Implementing it like this
> > will guaranty that the feature will not be used in production ever.
> > ignore_msrs exists only for developers to quickly check if a problem goes
> > away if some MSR does not #GP, never as a real solution.
> > 
> > To make it somewhat useful the flag should be per-vm and exposed to all
> > layers up to a management. Management is the one who enables it per VM
> > basis and guaranties that VM with the feature enabled is never live
> > migrated. Frankly IMO it will be another management knob that will never
> > be set.
> 
> I agree.
> 
> I think it's fine to apply Bandan's patches.  It's just one more thing
> to care about when migrating machines that use the PMU.  And I hope that
> Intel will add TSC scaling sooner or later, which will fix the issue
> automatically.  Hear, Jun! :)
> 
I am find with adding things without migration support if we know how
migration can be fixed, in this case we are adding something that cannot
be fixed without HW support and, as such, should be enabled only on
hypothetical future HW that will support required feature.

So enabling it by default is wrong, and I listed pros and cons of global
ignore_platform_info option and per vm enable option above.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-20 Thread Alexey Kardashevskiy
On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
>>> Just out of curiosity - would not get_file() and fput_atomic() on a
>> group's
>>> file* do the right job instead of vfio_group_add_external_user() and
>>> vfio_group_del_external_user()?
>>
>> I was thinking that too.  Grabbing a file reference would certainly be
>> the usual way of handling this sort of thing.
> 
> But that wouldn't prevent the group ownership to be returned to
> the kernel or another user would it ?


Holding the file pointer does not let the group->container_users counter go
to zero and this is exactly what vfio_group_add_external_user() and
vfio_group_del_external_user() do. The difference is only in absolute value
- 2 vs. 3.

No change in behaviour whether I use new vfio API or simply hold file* till
KVM closes fd created when IOMMU was connected to LIOBN.

And while this counter is not zero, QEMU cannot take ownership over the group.

I am definitely still missing the bigger picture...


-- 
Alexey
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm-unit-tests: Add a func to run instruction in emulator

2013-06-20 Thread Gleb Natapov
On Wed, Jun 19, 2013 at 11:00:56PM +0800, Arthur Chunqi Li wrote:
> Add a function trap_emulator to run an instruction in emulator.
> Set inregs first (%rax is invalid because it is used as return
> address), put instruction codec in alt_insn and call func with
> alt_insn_length. Get results in outregs.
> 
> Signed-off-by: Arthur Chunqi Li 
> ---
>  x86/emulator.c |  110 
> 
>  1 file changed, 110 insertions(+)
>  mode change 100644 => 100755 x86/emulator.c
> 
> diff --git a/x86/emulator.c b/x86/emulator.c
> old mode 100644
> new mode 100755
> index 96576e5..48d45c8
> --- a/x86/emulator.c
> +++ b/x86/emulator.c
> @@ -11,6 +11,15 @@ int fails, tests;
>  
>  static int exceptions;
>  
> +struct regs {
> + u64 rax, rbx, rcx, rdx;
> + u64 rsi, rdi, rsp, rbp;
> + u64 r8, r9, r10, r11;
> + u64 r12, r13, r14, r15;
> + u64 rip, rflags;
> +};
> +struct regs inregs, outregs, save;
> +
>  void report(const char *name, int result)
>  {
>   ++tests;
> @@ -685,6 +694,107 @@ static void test_shld_shrd(u32 *mem)
>  report("shrd (cl)", *mem == ((0x12345678 >> 3) | (5u << 29)));
>  }
>  
> +#define INSN_SAVE\
No need for all the defines. Put all the code into insn_page, allocate
alt_insn_page dynamically and copy the code there by memcpy.

> + "ret\n\t"   \
> + "pushf\n\t" \
> + "push 136+save \n\t"\
> + "popf \n\t" \
> + "xchg %rax, 0+save \n\t"\
> + "xchg %rbx, 8+save \n\t"\
> + "xchg %rcx, 16+save \n\t"   \
> + "xchg %rdx, 24+save \n\t"   \
> + "xchg %rsi, 32+save \n\t"   \
> + "xchg %rdi, 40+save \n\t"   \
> + "xchg %rsp, 48+save \n\t"   \
> + "xchg %rbp, 56+save \n\t"   \
> + "xchg %r8, 64+save \n\t"\
> + "xchg %r9, 72+save \n\t"\
> + "xchg %r10, 80+save \n\t"   \
> + "xchg %r11, 88+save \n\t"   \
> + "xchg %r12, 96+save \n\t"   \
> + "xchg %r13, 104+save \n\t"  \
> + "xchg %r14, 112+save \n\t"  \
> + "xchg %r15, 120+save \n\t"  \
> +
> +#define INSN_RESTORE \
> + "xchg %rax, 0+save \n\t"\
> + "xchg %rbx, 8+save \n\t"\
> + "xchg %rcx, 16+save \n\t"   \
> + "xchg %rdx, 24+save \n\t"   \
> + "xchg %rsi, 32+save \n\t"   \
> + "xchg %rdi, 40+save \n\t"   \
> + "xchg %rsp, 48+save \n\t"   \
> + "xchg %rbp, 56+save \n\t"   \
> + "xchg %r8, 64+save \n\t"\
> + "xchg %r9, 72+save \n\t"\
> + "xchg %r10, 80+save \n\t"   \
> + "xchg %r11, 88+save \n\t"   \
> + "xchg %r12, 96+save \n\t"   \
> + "xchg %r13, 104+save \n\t"  \
> + "xchg %r14, 112+save \n\t"  \
> + "xchg %r15, 120+save \n\t"  \
> + "pushf \n\t"\
> + "pop 136+save \n\t" \
> + "popf \n\t" \
> + "ret \n\t"  \
> +
> +#define INSN_TRAP\
> + "in  (%dx),%al\n\t" \
> + ". = . + 31\n\t"\
If you will do ".skip 31, 0x90\n\t" instead you can drop loop
that inserts nops bellow.

> +
> +asm(
> + ".align 4096\n\t"
> + "insn_page:\n\t"
> + INSN_SAVE
> + "test_insn:\n\t"
> + INSN_TRAP
> + "test_insn_end:\n\t"
> + INSN_RESTORE
> + "insn_page_end:\n\t"
> + ".align 4096\n\t"
> +
> + "alt_insn_page:\n\t"
> + INSN_SAVE
> + "alt_test_insn:\n\t"
> + INSN_TRAP
> + "alt_test_insn_end:\n\t"
> + INSN_RESTORE
> + "alt_insn_page_end:\n\t"
> + ".align 4096\n\t"
> +);
> +
> +static void trap_emulator(uint64_t *mem, uint8_t* alt_insn, int 
> alt_insn_length)
> +{
> + ulong *cr3 = (ulong *)read_cr3();
> + void *insn_ram;
> + int i;
> + extern u8 insn_page[], test_insn[], test_insn_end[];
> + extern u8 alt_insn_page[], alt_test_insn[];
> +
> + insn_ram = vmap(virt_to_phys(insn_page), 4096);
> + for (i=1; i + alt_test_insn[i] = test_insn[i] = 0x90; // nop
> + for (i=0; i + alt_test_insn[i] = alt_insn[i];
> + for(;i + alt_test_insn[i] = 0x90; // nop
> + save = inregs;
> +
> + // Load the code TLB with insn_page, but point the page tables at
> + // alt_insn_page (and keep the data TLB clear, for AMD decode assist).
> + // This will make the CPU trap on the insn_page instruction but the
> + // hypervisor will see alt_insn_page.
I prefer all the comments to be changed to /**/ style while we are at it.

> + install_page(cr3, virt_to_phys(

Re: [PATCHv1] kvm guest: fix uninitialized kvmclock read by KVM guest

2013-06-20 Thread Igor Mammedov
On Wed, 19 Jun 2013 15:29:31 +0200
Paolo Bonzini  wrote:

> Il 19/06/2013 15:20, Batalov Eugene ha scritto:
> > 
> > I've missed this detail. It looks like Igor's patch doesn't bring
> > secondary cpus kvm_clocksource behavior back to one before the regression,
> > Before the regression per_cpu variables are used to allocate
> > kvm_pv_clock areas.
> > To to usage of percpu variables bootstrap cpu kvm_clock area contents
> > were copied to smp secondary cpus kvm_clock areas when they were started.
> > Bootstrap cpu kvm_clock area was not zeroed at this time.
> > So kvm_pv_clock for secondary cpus never returned "zero" clock before
> > the regression.
> > 
> > During the analysis of the bug I introduced idea to return zero before
> > kvm clocksource is initialized for secondary cpus
> > just like bootstrap cpu does on kernel boot. You can read that in BZ.
> 
> Yes, this is why I prefer to invert the two function calls.  But Igor's
> patch fixes the hang (trivially because version is even) and is more
> appropriate for -rc6.

I'll post this swap shortly, but zeroing out hv_clock at init time,
would be still needed to provide sane values there if ftrace enabled
at that time.

> 
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv1] kvm guest: fix uninitialized kvmclock read by KVM guest

2013-06-20 Thread Paolo Bonzini
Il 20/06/2013 10:30, Igor Mammedov ha scritto:
> On Wed, 19 Jun 2013 15:29:31 +0200
> Paolo Bonzini  wrote:
> 
>> Il 19/06/2013 15:20, Batalov Eugene ha scritto:
>>>
>>> I've missed this detail. It looks like Igor's patch doesn't bring
>>> secondary cpus kvm_clocksource behavior back to one before the regression,
>>> Before the regression per_cpu variables are used to allocate
>>> kvm_pv_clock areas.
>>> To to usage of percpu variables bootstrap cpu kvm_clock area contents
>>> were copied to smp secondary cpus kvm_clock areas when they were started.
>>> Bootstrap cpu kvm_clock area was not zeroed at this time.
>>> So kvm_pv_clock for secondary cpus never returned "zero" clock before
>>> the regression.
>>>
>>> During the analysis of the bug I introduced idea to return zero before
>>> kvm clocksource is initialized for secondary cpus
>>> just like bootstrap cpu does on kernel boot. You can read that in BZ.
>>
>> Yes, this is why I prefer to invert the two function calls.  But Igor's
>> patch fixes the hang (trivially because version is even) and is more
>> appropriate for -rc6.
> 
> I'll post this swap shortly, but zeroing out hv_clock at init time,
> would be still needed to provide sane values there if ftrace enabled
> at that time.

Fine!  Please mention it (with --verbose flag) in the commit message.

Paolo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/2] kvm: x86: Emulate MSR_PLATFORM_INFO

2013-06-20 Thread Paolo Bonzini
Il 20/06/2013 09:31, Gleb Natapov ha scritto:
>> I agree with you on the potential problems but I think we are completely
>> ignoring the "non-migration" use case. These users will probably benefit 
>> from a correct value of (virtual) msr_platform_info. And it appears, the 
>> easiest way to give both options to the user is using a new module_param, 
>> say ignore_platform_info.
>>
> Migration is important part of virtualization. PMU emulation currently
> is in demo status because migration is not implemented yet, but at least
> it is possible to implement it.

But there's a lot more to do for migration than just moving the MSR
contents from one machine to another.  We need to support customizing
the values of CPUID leaf 0ah, so that PMU migration can work if you
specify "lowest common denominator" CPU models.

I haven't looked at it yet because QEMU support for this is not in sight.

>> Scenarios :
>> 1. ignore_platform_info = 0 (Default). Inject #GP if application tries to
>> read this MSR.
>> 2. ignore_platform_info = 1. User wants to read the calculated value, her
>> environment doesn't require migration.
>> 3. ignore_msrs = 1. If this is set, we always return 0 and application will
>> hopefully resort to a workaround.
>>
> Module flag is global for all VMs on a host. Implementing it like this
> will guaranty that the feature will not be used in production ever.
> ignore_msrs exists only for developers to quickly check if a problem goes
> away if some MSR does not #GP, never as a real solution.
> 
> To make it somewhat useful the flag should be per-vm and exposed to all
> layers up to a management. Management is the one who enables it per VM
> basis and guaranties that VM with the feature enabled is never live
> migrated. Frankly IMO it will be another management knob that will never
> be set.

I agree.

I think it's fine to apply Bandan's patches.  It's just one more thing
to care about when migrating machines that use the PMU.  And I hope that
Intel will add TSC scaling sooner or later, which will fix the issue
automatically.  Hear, Jun! :)

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm-unit-tests: Add a func to run instruction in emulator

2013-06-20 Thread Gleb Natapov
On Thu, Jun 20, 2013 at 10:29:42AM +0200, Paolo Bonzini wrote:
> Il 19/06/2013 18:03, Gleb Natapov ha scritto:
> > On Wed, Jun 19, 2013 at 11:07:18PM +0800, 李春奇  wrote:
> >> Hi Gleb,
> >> This version can set %rsp before trapping into emulator, because
> >> insn_page and alt_insn_page is statically defined and their relative
> >> position to (save) is fixed during execution.
> >>
> > The position of the code is not fixed during execution since you execute
> > it from a virtual address obtained dynamically by vmap() and the address
> > is definitely different from the one the code was compiled for, but if
> > you look at the code that compile actually produce you will see that it
> > uses absolute address to access "save" and this is why it works. I
> > wounder why compiler decided to use absolute address this time, Paolo?
> 
> Because he's using assembly with operands that he wrote himself.  Before
> he was using "m" and the compiler decided to express the memory operand
> as "save(%rip)".
> 
> The assembler then emits different opcodes (of course) and also
> different relocations.  In the current code, it tells the linker to
> place an absolute address.  In the previous one, it tells the linker to
> place a delta from %rip.
> 
Heh, make sense. OK, so we will go with that. Will comment on the patch
itself.

> Paolo
> 
> >> In this way, test case of test_mmx_movq_mf needs to pre-define its own
> >> stack, this change is in the next patch.
> >>
> >> In this version, insn_ram is initially mapped to insn_page and them
> >> each call to insn_page/alt_insn_page are all via insn_ram. This trick
> >> runs well but I don't know why my previous version causes error.
> >>
> > Because previous version tried to use install_page() on a large page
> > mapped region and the function does not know how to handle that.
> > 
> >> Arthur.
> >> On Wed, Jun 19, 2013 at 11:00 PM, Arthur Chunqi Li  
> >> wrote:
> >>> Add a function trap_emulator to run an instruction in emulator.
> >>> Set inregs first (%rax is invalid because it is used as return
> >>> address), put instruction codec in alt_insn and call func with
> >>> alt_insn_length. Get results in outregs.
> >>>
> >>> Signed-off-by: Arthur Chunqi Li 
> >>> ---
> >>>  x86/emulator.c |  110 
> >>> 
> >>>  1 file changed, 110 insertions(+)
> >>>  mode change 100644 => 100755 x86/emulator.c
> >>>
> >>> diff --git a/x86/emulator.c b/x86/emulator.c
> >>> old mode 100644
> >>> new mode 100755
> >>> index 96576e5..48d45c8
> >>> --- a/x86/emulator.c
> >>> +++ b/x86/emulator.c
> >>> @@ -11,6 +11,15 @@ int fails, tests;
> >>>
> >>>  static int exceptions;
> >>>
> >>> +struct regs {
> >>> +   u64 rax, rbx, rcx, rdx;
> >>> +   u64 rsi, rdi, rsp, rbp;
> >>> +   u64 r8, r9, r10, r11;
> >>> +   u64 r12, r13, r14, r15;
> >>> +   u64 rip, rflags;
> >>> +};
> >>> +struct regs inregs, outregs, save;
> >>> +
> >>>  void report(const char *name, int result)
> >>>  {
> >>> ++tests;
> >>> @@ -685,6 +694,107 @@ static void test_shld_shrd(u32 *mem)
> >>>  report("shrd (cl)", *mem == ((0x12345678 >> 3) | (5u << 29)));
> >>>  }
> >>>
> >>> +#define INSN_SAVE  \
> >>> +   "ret\n\t"   \
> >>> +   "pushf\n\t" \
> >>> +   "push 136+save \n\t"\
> >>> +   "popf \n\t" \
> >>> +   "xchg %rax, 0+save \n\t"\
> >>> +   "xchg %rbx, 8+save \n\t"\
> >>> +   "xchg %rcx, 16+save \n\t"   \
> >>> +   "xchg %rdx, 24+save \n\t"   \
> >>> +   "xchg %rsi, 32+save \n\t"   \
> >>> +   "xchg %rdi, 40+save \n\t"   \
> >>> +   "xchg %rsp, 48+save \n\t"   \
> >>> +   "xchg %rbp, 56+save \n\t"   \
> >>> +   "xchg %r8, 64+save \n\t"\
> >>> +   "xchg %r9, 72+save \n\t"\
> >>> +   "xchg %r10, 80+save \n\t"   \
> >>> +   "xchg %r11, 88+save \n\t"   \
> >>> +   "xchg %r12, 96+save \n\t"   \
> >>> +   "xchg %r13, 104+save \n\t"  \
> >>> +   "xchg %r14, 112+save \n\t"  \
> >>> +   "xchg %r15, 120+save \n\t"  \
> >>> +
> >>> +#define INSN_RESTORE   \
> >>> +   "xchg %rax, 0+save \n\t"\
> >>> +   "xchg %rbx, 8+save \n\t"\
> >>> +   "xchg %rcx, 16+save \n\t"   \
> >>> +   "xchg %rdx, 24+save \n\t"   \
> >>> +   "xchg %rsi, 32+save \n\t"   \
> >>> +   "xchg %rdi, 40+save \n\t"   \
> >>> +   "xchg %rsp, 48+save \n\t"   \
> >>> +   "xchg %rbp, 56+save \n\t"   \
> >>> +   "xchg %r8, 64+save \n\t"\
> >>> +   "xchg %r9, 72+save \n\t"\
> >>> +   "xchg %r10, 80+save \n\t"   \
> 

Re: [PATCH 1/2] kvm-unit-tests: Add a func to run instruction in emulator

2013-06-20 Thread Paolo Bonzini
Il 19/06/2013 18:03, Gleb Natapov ha scritto:
> On Wed, Jun 19, 2013 at 11:07:18PM +0800, 李春奇  wrote:
>> Hi Gleb,
>> This version can set %rsp before trapping into emulator, because
>> insn_page and alt_insn_page is statically defined and their relative
>> position to (save) is fixed during execution.
>>
> The position of the code is not fixed during execution since you execute
> it from a virtual address obtained dynamically by vmap() and the address
> is definitely different from the one the code was compiled for, but if
> you look at the code that compile actually produce you will see that it
> uses absolute address to access "save" and this is why it works. I
> wounder why compiler decided to use absolute address this time, Paolo?

Because he's using assembly with operands that he wrote himself.  Before
he was using "m" and the compiler decided to express the memory operand
as "save(%rip)".

The assembler then emits different opcodes (of course) and also
different relocations.  In the current code, it tells the linker to
place an absolute address.  In the previous one, it tells the linker to
place a delta from %rip.

Paolo

>> In this way, test case of test_mmx_movq_mf needs to pre-define its own
>> stack, this change is in the next patch.
>>
>> In this version, insn_ram is initially mapped to insn_page and them
>> each call to insn_page/alt_insn_page are all via insn_ram. This trick
>> runs well but I don't know why my previous version causes error.
>>
> Because previous version tried to use install_page() on a large page
> mapped region and the function does not know how to handle that.
> 
>> Arthur.
>> On Wed, Jun 19, 2013 at 11:00 PM, Arthur Chunqi Li  wrote:
>>> Add a function trap_emulator to run an instruction in emulator.
>>> Set inregs first (%rax is invalid because it is used as return
>>> address), put instruction codec in alt_insn and call func with
>>> alt_insn_length. Get results in outregs.
>>>
>>> Signed-off-by: Arthur Chunqi Li 
>>> ---
>>>  x86/emulator.c |  110 
>>> 
>>>  1 file changed, 110 insertions(+)
>>>  mode change 100644 => 100755 x86/emulator.c
>>>
>>> diff --git a/x86/emulator.c b/x86/emulator.c
>>> old mode 100644
>>> new mode 100755
>>> index 96576e5..48d45c8
>>> --- a/x86/emulator.c
>>> +++ b/x86/emulator.c
>>> @@ -11,6 +11,15 @@ int fails, tests;
>>>
>>>  static int exceptions;
>>>
>>> +struct regs {
>>> +   u64 rax, rbx, rcx, rdx;
>>> +   u64 rsi, rdi, rsp, rbp;
>>> +   u64 r8, r9, r10, r11;
>>> +   u64 r12, r13, r14, r15;
>>> +   u64 rip, rflags;
>>> +};
>>> +struct regs inregs, outregs, save;
>>> +
>>>  void report(const char *name, int result)
>>>  {
>>> ++tests;
>>> @@ -685,6 +694,107 @@ static void test_shld_shrd(u32 *mem)
>>>  report("shrd (cl)", *mem == ((0x12345678 >> 3) | (5u << 29)));
>>>  }
>>>
>>> +#define INSN_SAVE  \
>>> +   "ret\n\t"   \
>>> +   "pushf\n\t" \
>>> +   "push 136+save \n\t"\
>>> +   "popf \n\t" \
>>> +   "xchg %rax, 0+save \n\t"\
>>> +   "xchg %rbx, 8+save \n\t"\
>>> +   "xchg %rcx, 16+save \n\t"   \
>>> +   "xchg %rdx, 24+save \n\t"   \
>>> +   "xchg %rsi, 32+save \n\t"   \
>>> +   "xchg %rdi, 40+save \n\t"   \
>>> +   "xchg %rsp, 48+save \n\t"   \
>>> +   "xchg %rbp, 56+save \n\t"   \
>>> +   "xchg %r8, 64+save \n\t"\
>>> +   "xchg %r9, 72+save \n\t"\
>>> +   "xchg %r10, 80+save \n\t"   \
>>> +   "xchg %r11, 88+save \n\t"   \
>>> +   "xchg %r12, 96+save \n\t"   \
>>> +   "xchg %r13, 104+save \n\t"  \
>>> +   "xchg %r14, 112+save \n\t"  \
>>> +   "xchg %r15, 120+save \n\t"  \
>>> +
>>> +#define INSN_RESTORE   \
>>> +   "xchg %rax, 0+save \n\t"\
>>> +   "xchg %rbx, 8+save \n\t"\
>>> +   "xchg %rcx, 16+save \n\t"   \
>>> +   "xchg %rdx, 24+save \n\t"   \
>>> +   "xchg %rsi, 32+save \n\t"   \
>>> +   "xchg %rdi, 40+save \n\t"   \
>>> +   "xchg %rsp, 48+save \n\t"   \
>>> +   "xchg %rbp, 56+save \n\t"   \
>>> +   "xchg %r8, 64+save \n\t"\
>>> +   "xchg %r9, 72+save \n\t"\
>>> +   "xchg %r10, 80+save \n\t"   \
>>> +   "xchg %r11, 88+save \n\t"   \
>>> +   "xchg %r12, 96+save \n\t"   \
>>> +   "xchg %r13, 104+save \n\t"  \
>>> +   "xchg %r14, 112+save \n\t"  \
>>> +   "xchg %r15, 120+save \n\t"  \
>>> +   "pushf \n\t"\
>>> +   "pop 136+save \n\t" \
>>> +   

Re: [PATCH 2/2] kvm-unit-tests: Change two cases to use trap_emulator

2013-06-20 Thread Paolo Bonzini
Il 19/06/2013 17:00, Arthur Chunqi Li ha scritto:
>  static void test_movabs(uint64_t *mem, uint8_t *insn_page,
>  uint8_t *alt_insn_page, void *insn_ram)
>  {
> -uint64_t val = 0;
> -ulong *cr3 = (ulong *)read_cr3();
> -
> -// Pad with RET instructions
> -memset(insn_page, 0xc3, 4096);
> -memset(alt_insn_page, 0xc3, 4096);
> -// Place a trapping instruction in the page to trigger a VMEXIT
> -insn_page[0] = 0x89; // mov %eax, (%rax)
> -insn_page[1] = 0x00;
> -// Place the instruction we want the hypervisor to see in the alternate
> -// page. A buggy hypervisor will fetch a 32-bit immediate and return
> -// 0xc3c3c3c3.
> -alt_insn_page[0] = 0x48; // mov $0xc3c3c3c3c3c3c3c3, %rcx
> -alt_insn_page[1] = 0xb9;
> -
> -// Load the code TLB with insn_page, but point the page tables at
> -// alt_insn_page (and keep the data TLB clear, for AMD decode assist).
> -// This will make the CPU trap on the insn_page instruction but the
> -// hypervisor will see alt_insn_page.
> -install_page(cr3, virt_to_phys(insn_page), insn_ram);
> -// Load code TLB
> -invlpg(insn_ram);
> -asm volatile("call *%0" : : "r"(insn_ram + 3));
> -// Trap, let hypervisor emulate at alt_insn_page
> -install_page(cr3, virt_to_phys(alt_insn_page), insn_ram);
> -asm volatile("call *%1" : "=c"(val) : "r"(insn_ram), "a"(mem), "c"(0));
> -report("64-bit mov imm", val == 0xc3c3c3c3c3c3c3c3);
> +// mov $0xc3c3c3c3c3c3c3c3, %rcx
> +uint8_t alt_insn[] = {0x48, 0xb9, 0xc3, 0xc3, 0xc3,
> + 0xc3, 0xc3, 0xc3, 0xc3, 0xc3};
> +inregs = (struct regs){ 0 };
> +trap_emulator(mem, alt_insn, 10);
> +report("64-bit mov imm2", outregs.rcx == 0xc3c3c3c3c3c3c3c3);
>  }

0xc3 is ret and it may mess up the test case if the buggy hypervisor
sees it as

mov $0xc3c3c3c3, %rcx (sign extended, no such instruction exists)
ret
ret
ret
ret

I suggest changing it to 0x90 as part of this patch.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7] KVM: MMU: document fast invalidate all mmio sptes

2013-06-20 Thread Paolo Bonzini
Il 20/06/2013 07:21, Rob Landley ha scritto:
> On 06/19/2013 04:09:25 AM, Xiao Guangrong wrote:
>> Document it to Documentation/virtual/kvm/mmu.txt
> 
> Why break a change to a single documentation file into 7 pieces.
> 
> Are we going to bisect the documentation?

It is explaining 7 different optimizations, and each patch could have
been submitted together with the corresponding optimization.  So it
makes sense---and it certainly helped review.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-20 Thread Benjamin Herrenschmidt
On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
> > Just out of curiosity - would not get_file() and fput_atomic() on a
> group's
> > file* do the right job instead of vfio_group_add_external_user() and
> > vfio_group_del_external_user()?
> 
> I was thinking that too.  Grabbing a file reference would certainly be
> the usual way of handling this sort of thing.

But that wouldn't prevent the group ownership to be returned to
the kernel or another user would it ?

Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/2] kvm: x86: Emulate MSR_PLATFORM_INFO

2013-06-20 Thread Gleb Natapov
On Wed, Jun 19, 2013 at 01:50:45PM -0400, Bandan Das wrote:
> Gleb Natapov  writes:
> 
> > On Tue, Jun 18, 2013 at 11:29:27AM -0400, Bandan Das wrote:
> >> Gleb Natapov  writes:
> >> 
> >> > On Tue, Jun 18, 2013 at 04:05:08PM +0200, Paolo Bonzini wrote:
> >> >> Il 05/06/2013 10:42, Gleb Natapov ha scritto:
> >> >> >> > These patches add an emulated MSR_PLATFORM_INFO that kvm guests
> >> >> >> > can read as described in section 14.3.2.4 of the Intel SDM. 
> >> >> >> > The relevant changes and details are in [2/2]; [1/2] makes 
> >> >> >> > vendor_intel
> >> >> >> > generic. There are atleat two known applications that fail to run 
> >> >> >> > because
> >> >> >> > of this MSR missing - Sandra and vTune.
> >> >> > So I really want Intel opinion on this. Right now it is impossible to
> >> >> > implement the MSR correctly in the face of migration (may be with tsc
> >> >> > scaling it will be possible) and while it is unimplemented if 
> >> >> > application
> >> >> > tries to use it it fails, but if we will implement it application will
> >> >> > just produce incorrect result without any means for user to detect it.
> >> >> 
> >> >> Jun, ping?  (Perhaps Gleb you want to ask a more specific question 
> >> >> though).
> >> >> 
> >> >> I don't think this is much different from any other RDTSC usage in
> >> >> applications (they will typically do their calibration manually, and do
> >> >> it just once).  I'm applying it to queue.
> >> >> 
> >> > And we do not support application that uses RDTSC directly! If we could
> >> > catch those it would be good from support point of view, so the way
> >> > MSR_PLATFORM_INFO behaves now it better then proposed alternative.
> >> 
> >> If support is the issue, can't we have a flag that disables this by
> >> default and users who want to take the plunge (and be responsible
> >> for the consequences) can enable it to read platform_info ?
> >> 
> > We have it already :) ignore_msrs. If it is set unimplemented MSRs will
> > not inject #GP, but return zero value instead. Zero it as incorrect as
> > anything else in the case of migration.
> 
> But does the intended purpose of ignore_msrs fit here ?  Even if we return
> 0 after a migration, there's no guarantee that the application will
> give correct results based on the old value of the MSR it read.
> 
I am not sure I understand. I am saying that since we cannot have
correct implementation of the MSR, any implementation that does not
crash a guest is as good as any other.

> I agree with you on the potential problems but I think we are completely
> ignoring the "non-migration" use case. These users will probably benefit 
> from a correct value of (virtual) msr_platform_info. And it appears, the 
> easiest way to give both options to the user is using a new module_param, 
> say ignore_platform_info.
> 
Migration is important part of virtualization. PMU emulation currently
is in demo status because migration is not implemented yet, but at least
it is possible to implement it.

> Scenarios :
> 1. ignore_platform_info = 0 (Default). Inject #GP if application tries to
> read this MSR.
> 2. ignore_platform_info = 1. User wants to read the calculated value, her
> environment doesn't require migration.
> 3. ignore_msrs = 1. If this is set, we always return 0 and application will
> hopefully resort to a workaround.
> 
Module flag is global for all VMs on a host. Implementing it like this
will guaranty that the feature will not be used in production ever.
ignore_msrs exists only for developers to quickly check if a problem goes
away if some MSR does not #GP, never as a real solution.

To make it somewhat useful the flag should be per-vm and exposed to all
layers up to a management. Management is the one who enables it per VM
basis and guaranties that VM with the feature enabled is never live
migrated. Frankly IMO it will be another management knob that will never
be set.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html