date:20141125



On 25/11/2014 19:45, Eduardo Habkost wrote:
>> > +static const char *cpuid_xsave_feature_name[] = {
>> > +"xsaveopt", "xsavec", "xgetbv1", "xsaves",
> None of the above features introduce any new state that might need to be
> migrated, or will require other changes in QEMU to work, right?
> 
> It looks like they don't introduce any extra state, but if they do, they
> need to be added to unmigratable_flags until migration support is
> implemented.
> 
> If they require other QEMU changes, it would be nice if KVM reported
> them using KVM_CHECK_EXTENSION instead of GET_SUPPORTED_CPUID, so it
> wouldn't break "-cpu host".

No, they don't.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2

2014-11-25 Thread Andrea Arcangeli

On Fri, Nov 21, 2014 at 11:05:45PM +, Peter Maydell wrote:
> If it's mapped and readable-but-not-writable then it should still
> fault on write accesses, though? These are cases we currently get
> SEGV for, anyway.

Yes then it'll work just fine.

> Ah, I guess we have a terminology difference. I was considering
> "page fault" to mean (roughly) "anything that causes the CPU to
> take an exception on an attempted load/store" and expected that
> userfaultfd would notify userspace of any of those. (Well, not
> alignment faults, maybe, but I'm definitely surprised that
> access permission issues don't get reported the same way as
> page-completely-missing issues. In other words I was expecting
> that this was "everything previously reported via SIGSEGV or
> SIGBUS now comes via userfaultfd".)

Just not PROT_NONE SIGSEGV faults, i.e. PROT_NONE would still SIGSEGV
currently. Because it's not a not-present fault (the page is present,
just not mapped readable) and it's neither a wrprotect fault (it is
trapped with the vma vm_flags permission bits instead before the
actual page fault handler is invoked). userfaultfd hooks into the
common code of the page fault handler.

> > Temporarily removing/moving the page with remap_anon_pages shall be
> > much better than using PROT_NONE for this (or alternative syscall name
> > to differentiate it further from remap_file_pages, or equivalent
> > userfaultfd command if we decide to hide the pte/pmd mangling as
> > userfaultfd commands instead of adding new standalone syscalls).
> 
> We don't use PROT_NONE for the linux-user situation, we just use
> mprotect() to remove the PAGE_WRITE permission so it's still
> readable.

Like said above it'll work just fine then.

> I suspect actually linux-user would be better off implementing
> something like "if this is a page which we've mapped read-only
> because we translated code out of it, then go ahead and remap
> it r/w and throw away the translation and retry the access,
> otherwise report SEGV to the guest", because taking SEGVs shouldn't
> be a fast path in the guest binary. That would let us work without
> architecture-specific junk and without requiring new kernel
> features either. So you can ignore this whole tangent thread :-)

You might get a significant boost if you use userfaultfd.

For postcopy live snapshot and postcopy live migration the main
benefit is the removal mprotect as a whole and the performance
improvement is a secondary benefit.

You can cap the max size of the JIT translated cache (and in turn the
maximal number of vmas generated by the mprotects) but we can't cap
the address space fragmentation. The faults may invoke way too many
mprotect and we may fragment the vma too much to the point we get
-ENOMEM.

Marking a page wrprotected however is always tricky, no matter if it's
fork doing it or KSM or something else. KSM just skips page that could
be under gup pins and retries them at the next pass. Fork simply won't
work right currently and it needs MADV_DONTFORK to avoid the
wrprotection entirely where you may use O_DIRECT mixed with threads
and fork.

For this new vma-less syscall (or ufd command) the best we could do is
to print a warning if any page marked wrprotected could be under GUP
pin (the warning could generate false positives as result of
speculative cache lookups that run lockless get_page_unless_zero() on
any pfn).

To avoid races the postcopy live snapshot feature I think it should be
enough to wait all in-flight I/O to complete before marking the guest
address space readonly (the KVM gup() side can be taken care of by
marking the shadow MMU readonly which is a must anyway, the mmu
notifier will take care of that part).

The postcopy live snapshot will have to copy the page so it's
effectively a COW in userland, and in turn it must ensure there's no
O_DIRECT in flight still writing to the page (despite we marked it
readonly) while the wrprotection syscall runs.

For your case probably there's no gup() in the equation unless you use
O_DIRECT (I don't think you use shadow-MMU in the kernel in
linux-user) so you don't have to worry about those races and it's just
simpler.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu->wq

On Tue, Nov 25, 2014 at 01:57:37PM -0500, Rik van Riel wrote:
> On 11/25/2014 12:21 PM, Marcelo Tosatti wrote:
> > The problem:
> > 
> > On -RT, an emulated LAPIC timer instances has the following path:
> > 
> > 1) hard interrupt
> > 2) ksoftirqd is scheduled
> > 3) ksoftirqd wakes up vcpu thread
> > 4) vcpu thread is scheduled
> > 
> > This extra context switch introduces unnecessary latency in the 
> > LAPIC path for a KVM guest.
> > 
> > The solution:
> > 
> > Allow waking up vcpu thread from hardirq context,
> > thus avoiding the need for ksoftirqd to be scheduled.
> > 
> > Normal waitqueues make use of spinlocks, which on -RT 
> > are sleepable locks. Therefore, waking up a waitqueue 
> > waiter involves locking a sleeping lock, which 
> > is not allowed from hard interrupt context.
> > 
> 
> What are the consequences for the realtime kernel of
> making waitqueue traversal non-preemptible?
> 
> Is waitqueue traversal something that ever takes so
> long we need to care about it being non-preemptible?

commit d872cfa018625071a3a6b1ea8a62a2790d2c157a
Author: Thomas Gleixner 
Date:   Mon Dec 12 12:29:04 2011 +0100

wait-simple: Simple waitqueue implementation

wait_queue is a swiss army knife and in most of the cases the
complexity is not needed. For RT waitqueues are a constant source of
trouble as we can't convert the head lock to a raw spinlock due to
fancy and long lasting callbacks.

Unsure about the details...

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch -rt 2/2] KVM: lapic: mark LAPIC timer handler as irqsafe

2014-11-25 Thread Rik van Riel

On 11/25/2014 12:21 PM, Marcelo Tosatti wrote:
> Since lapic timer handler only wakes up a simple waitqueue,
> it can be executed from hardirq context.
> 
> Reduces average cyclictest latency by 3us.

Can this patch be merged in the KVM tree, and go
upstream via Paolo?

> Signed-off-by: Marcelo Tosatti 

Acked-by: Rik van Riel 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu->wq

2014-11-25 Thread Rik van Riel

On 11/25/2014 01:57 PM, Rik van Riel wrote:
> On 11/25/2014 12:21 PM, Marcelo Tosatti wrote:
>> The problem:
>>
>> On -RT, an emulated LAPIC timer instances has the following path:
>>
>> 1) hard interrupt
>> 2) ksoftirqd is scheduled
>> 3) ksoftirqd wakes up vcpu thread
>> 4) vcpu thread is scheduled
>>
>> This extra context switch introduces unnecessary latency in the 
>> LAPIC path for a KVM guest.
>>
>> The solution:
>>
>> Allow waking up vcpu thread from hardirq context,
>> thus avoiding the need for ksoftirqd to be scheduled.
>>
>> Normal waitqueues make use of spinlocks, which on -RT 
>> are sleepable locks. Therefore, waking up a waitqueue 
>> waiter involves locking a sleeping lock, which 
>> is not allowed from hard interrupt context.
>>
> 
> What are the consequences for the realtime kernel of
> making waitqueue traversal non-preemptible?
> 
> Is waitqueue traversal something that ever takes so
> long we need to care about it being non-preemptible?

I answered my own question.

This patch only changes the kvm vcpu waitqueue,
which should only ever have the vcpu thread itself
waiting on it. In other words, it is a wait "queue"
of just one entry long, and the latency of traversing
it will be absolutely minimal.

Unless someone can think of a better way to implement
this patch, it gets my:

Acked-by: Rik van Riel 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 7/8] KVM: kvm-vfio: generic forwarding control

2014-11-25 Thread Alex Williamson

On Tue, 2014-11-25 at 19:20 +0100, Eric Auger wrote:
> On 11/24/2014 09:56 PM, Alex Williamson wrote:
> > On Sun, 2014-11-23 at 19:35 +0100, Eric Auger wrote:
> >> This patch introduces a new KVM_DEV_VFIO_DEVICE group.
> >>
> >> This is a new control channel which enables KVM to cooperate with
> >> viable VFIO devices.
> >>
> >> Functions are introduced to check the validity of a VFIO device
> >> file descriptor, increment/decrement the ref counter of the VFIO
> >> device.
> >>
> >> The patch introduces 2 attributes for this new device group:
> >> KVM_DEV_VFIO_DEVICE_FORWARD_IRQ, KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ.
> >> Their purpose is to turn a VFIO device IRQ into a forwarded IRQ and
> >> unset respectively unset the feature.
> >>
> >> The VFIO device stores a list of registered forwarded IRQs. The reference
> >> counter of the device is incremented each time a new IRQ is forwarded.
> >> Reference counter is decremented when the IRQ forwarding is unset.
> >>
> >> The forwarding programmming is architecture specific, implemented in
> >> kvm_arch_set_fwd_state function. Architecture specific implementation is
> >> enabled when __KVM_HAVE_ARCH_KVM_VFIO_FORWARD is set. When not set those
> >> functions are void.
> >>
> >> Signed-off-by: Eric Auger 
> >>
> >> ---
> >>
> >> v2 -> v3:
> >> - add API comments in kvm_host.h
> >> - improve the commit message
> >> - create a private kvm_vfio_fwd_irq struct
> >> - fwd_irq_action replaced by a bool and removal of VFIO_IRQ_CLEANUP. This
> >>   latter action will be handled in vgic.
> >> - add a vfio_device handle argument to kvm_arch_set_fwd_state. The goal is
> >>   to move platform specific stuff in architecture specific code.
> >> - kvm_arch_set_fwd_state renamed into kvm_arch_vfio_set_forward
> >> - increment the ref counter each time we do an IRQ forwarding and decrement
> >>   this latter each time one IRQ forward is unset. Simplifies the whole
> >>   ref counting.
> >> - simplification of list handling: create, search, removal
> >>
> >> v1 -> v2:
> >> - __KVM_HAVE_ARCH_KVM_VFIO renamed into __KVM_HAVE_ARCH_KVM_VFIO_FORWARD
> >> - original patch file separated into 2 parts: generic part moved in vfio.c
> >>   and ARM specific part(kvm_arch_set_fwd_state)
> >> ---
> >>  include/linux/kvm_host.h |  28 ++
> >>  virt/kvm/vfio.c  | 249 
> >> ++-
> >>  2 files changed, 274 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> >> index ea53b04..0b9659d 100644
> >> --- a/include/linux/kvm_host.h
> >> +++ b/include/linux/kvm_host.h
> >> @@ -1076,6 +1076,15 @@ struct kvm_device_ops {
> >>  unsigned long arg);
> >>  };
> >>  
> >> +/* internal self-contained structure describing a forwarded IRQ */
> >> +struct kvm_fwd_irq {
> >> +  struct kvm *kvm; /* VM to inject the GSI into */
> >> +  struct vfio_device *vdev; /* vfio device the IRQ belongs to */
> >> +  __u32 index; /* VFIO device IRQ index */
> >> +  __u32 subindex; /* VFIO device IRQ subindex */
> >> +  __u32 gsi; /* gsi, ie. virtual IRQ number */
> >> +};
> >> +
> >>  void kvm_device_get(struct kvm_device *dev);
> >>  void kvm_device_put(struct kvm_device *dev);
> >>  struct kvm_device *kvm_device_from_filp(struct file *filp);
> >> @@ -1085,6 +1094,25 @@ void kvm_unregister_device_ops(u32 type);
> >>  extern struct kvm_device_ops kvm_mpic_ops;
> >>  extern struct kvm_device_ops kvm_xics_ops;
> >>  
> >> +#ifdef __KVM_HAVE_ARCH_KVM_VFIO_FORWARD
> >> +/**
> >> + * kvm_arch_vfio_set_forward - changes the forwarded state of an IRQ
> >> + *
> >> + * @fwd_irq: handle to the forwarded irq struct
> >> + * @forward: true means forwarded, false means not forwarded
> >> + * returns 0 on success, < 0 on failure
> >> + */
> >> +int kvm_arch_vfio_set_forward(struct kvm_fwd_irq *fwd_irq,
> >> +bool forward);
> > 
> > We could add a struct device* to the args list or into struct
> > kvm_fwd_irq so that arch code doesn't need to touch the vdev.  arch code
> > has no business dealing with references to the vfio_device.
> Hi Alex,
> 
> Currently It can't put struct device* into the kvm_fwd_irq struct since
> I need to release the vfio_device with
> vfio_device_put_external_user(struct vfio_device *vdev)
> typically in kvm_vfio_clean_fwd_irq. So I need to store the pointers to
> the vfio_device somewhere.
> 
> I see 2 solutions: change the proto of
> vfio_device_put_external_user(struct vfio_device *vdev) and pass a
> struct device* (??)
> 
> or change the proto of kvm_arch_vfio_set_forward into
> 
> kvm_arch_vfio_set_forward(struct kvm *kvm, struct device *dev, int
> index, [int subindex], int gsi, bool forward) or using index/start/count
> but loosing the interest of having a self-contained internal struct.

The latter is sort of what I was assuming, I think the interface between
VFIO and KVM-VFIO is good, we just don't need to expose VFIO-isms out to
the arch KVM code.  KVM-VFIO should

Re: [patch -rt 2/2] KVM: lapic: mark LAPIC timer handler as irqsafe

2014-11-25 Thread Jan Kiszka

On 2014-11-25 18:38, Paolo Bonzini wrote:
> 
> 
> On 25/11/2014 18:21, Marcelo Tosatti wrote:
>> +
>> +if (r == HRTIMER_RESTART) {
>> +do {
>> +ret = hrtimer_start_expires(data, HRTIMER_MODE_ABS);
>> +if (ret == -ETIME)
>> +hrtimer_add_expires_ns(&ktimer->timer,
>> +ktimer->period);
> 
> Is it possible to just compute the time where the next interrupt
> happens?  I suspect the printk and WARN_ON below can be easily triggered
> by a guest.

We have a lower bound for the period that a guest can program. Unless
that value is set too low, this should practically not happen if we
avoid disturbances while handling the event and reprogramming the next
one (irqs off?).

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu->wq

2014-11-25 Thread Rik van Riel

On 11/25/2014 12:21 PM, Marcelo Tosatti wrote:
> The problem:
> 
> On -RT, an emulated LAPIC timer instances has the following path:
> 
> 1) hard interrupt
> 2) ksoftirqd is scheduled
> 3) ksoftirqd wakes up vcpu thread
> 4) vcpu thread is scheduled
> 
> This extra context switch introduces unnecessary latency in the 
> LAPIC path for a KVM guest.
> 
> The solution:
> 
> Allow waking up vcpu thread from hardirq context,
> thus avoiding the need for ksoftirqd to be scheduled.
> 
> Normal waitqueues make use of spinlocks, which on -RT 
> are sleepable locks. Therefore, waking up a waitqueue 
> waiter involves locking a sleeping lock, which 
> is not allowed from hard interrupt context.
> 

What are the consequences for the realtime kernel of
making waitqueue traversal non-preemptible?

Is waitqueue traversal something that ever takes so
long we need to care about it being non-preemptible?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] target-i386: add feature flags for CPUID[EAX=0xd,ECX=1]

2014-11-25 Thread Eduardo Habkost

On Tue, Nov 25, 2014 at 06:35:42PM +0100, Paolo Bonzini wrote:
> These represent xsave-related capabilities of the processor, and KVM may
> or may not support them.
> 
> Add feature bits so that they are considered by "-cpu ...,enforce", and use
> the new feature work instead of calling kvm_arch_get_supported_cpuid.
> 
> Signed-off-by: Paolo Bonzini 

Cool, eliminating all kvm_arch_get_supported_cpuid() calls from
cpu_x86_cpuid() is on my TODO-list.

I just have one question below:

> ---
>  target-i386/cpu.c | 28 +++-
>  target-i386/cpu.h |  6 ++
>  2 files changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index e9df33e..a2dde11 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -274,6 +274,17 @@ static const char *cpuid_apm_edx_feature_name[] = {
>  NULL, NULL, NULL, NULL,
>  };
>  
> +static const char *cpuid_xsave_feature_name[] = {
> +"xsaveopt", "xsavec", "xgetbv1", "xsaves",

None of the above features introduce any new state that might need to be
migrated, or will require other changes in QEMU to work, right?

It looks like they don't introduce any extra state, but if they do, they
need to be added to unmigratable_flags until migration support is
implemented.

If they require other QEMU changes, it would be nice if KVM reported
them using KVM_CHECK_EXTENSION instead of GET_SUPPORTED_CPUID, so it
wouldn't break "-cpu host".


> +NULL, NULL, NULL, NULL,
> +NULL, NULL, NULL, NULL,
> +NULL, NULL, NULL, NULL,
> +NULL, NULL, NULL, NULL,
> +NULL, NULL, NULL, NULL,
> +NULL, NULL, NULL, NULL,
> +NULL, NULL, NULL, NULL,
> +};
> +
>  #define I486_FEATURES (CPUID_FP87 | CPUID_VME | CPUID_PSE)
>  #define PENTIUM_FEATURES (I486_FEATURES | CPUID_DE | CPUID_TSC | \
>CPUID_MSR | CPUID_MCE | CPUID_CX8 | CPUID_MMX | CPUID_APIC)
> @@ -391,6 +402,13 @@ static FeatureWordInfo feature_word_info[FEATURE_WORDS] 
> = {
>  .tcg_features = TCG_APM_FEATURES,
>  .unmigratable_flags = CPUID_APM_INVTSC,
>  },
> +[FEAT_XSAVE] = {
> +.feat_names = cpuid_xsave_feature_name,
> +.cpuid_eax = 0xd,
> +.cpuid_needs_ecx = true, .cpuid_ecx = 1,
> +.cpuid_reg = R_EAX,
> +.tcg_features = 0,
> +},
>  };
>  
>  typedef struct X86RegisterInfo32 {
> @@ -1018,6 +1036,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
>  CPUID_EXT2_SYSCALL,
>  .features[FEAT_8000_0001_ECX] =
>  CPUID_EXT3_LAHF_LM,
> +.features[FEAT_XSAVE] =
> +CPUID_XSAVE_XSAVEOPT,
>  .xlevel = 0x800A,
>  .model_id = "Intel Xeon E312xx (Sandy Bridge)",
>  },
> @@ -1051,6 +1071,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
>  CPUID_7_0_EBX_HLE | CPUID_7_0_EBX_AVX2 | CPUID_7_0_EBX_SMEP |
>  CPUID_7_0_EBX_BMI2 | CPUID_7_0_EBX_ERMS | CPUID_7_0_EBX_INVPCID |
>  CPUID_7_0_EBX_RTM,
> +.features[FEAT_XSAVE] =
> +CPUID_XSAVE_XSAVEOPT,
>  .xlevel = 0x800A,
>  .model_id = "Intel Core Processor (Haswell)",
>  },
> @@ -1085,6 +1107,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
>  CPUID_7_0_EBX_BMI2 | CPUID_7_0_EBX_ERMS | CPUID_7_0_EBX_INVPCID |
>  CPUID_7_0_EBX_RTM | CPUID_7_0_EBX_RDSEED | CPUID_7_0_EBX_ADX |
>  CPUID_7_0_EBX_SMAP,
> +.features[FEAT_XSAVE] =
> +CPUID_XSAVE_XSAVEOPT,
>  .xlevel = 0x800A,
>  .model_id = "Intel Core Processor (Broadwell)",
>  },
> @@ -1202,6 +1226,7 @@ static X86CPUDefinition builtin_x86_defs[] = {
>  CPUID_EXT3_3DNOWPREFETCH | CPUID_EXT3_MISALIGNSSE |
>  CPUID_EXT3_SSE4A | CPUID_EXT3_ABM | CPUID_EXT3_SVM |
>  CPUID_EXT3_LAHF_LM,
> +/* no xsaveopt! */
>  .xlevel = 0x801A,
>  .model_id = "AMD Opteron 62xx class CPU",
>  },
> @@ -1236,6 +1261,7 @@ static X86CPUDefinition builtin_x86_defs[] = {
>  CPUID_EXT3_3DNOWPREFETCH | CPUID_EXT3_MISALIGNSSE |
>  CPUID_EXT3_SSE4A | CPUID_EXT3_ABM | CPUID_EXT3_SVM |
>  CPUID_EXT3_LAHF_LM,
> +/* no xsaveopt! */
>  .xlevel = 0x801A,
>  .model_id = "AMD Opteron 63xx class CPU",
>  },
> @@ -2377,7 +2403,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
> uint32_t count,
>  *eax |= kvm_mask & (XSTATE_FP | XSTATE_SSE);
>  *ebx = *ecx;
>  } else if (count == 1) {
> -*eax = kvm_arch_get_supported_cpuid(s, 0xd, 1, R_EAX);
> +*eax = env->features[FEAT_XSAVE];
>  } else if (count < ARRAY_SIZE(ext_save_areas)) {
>  const ExtSaveArea *esa = &ext_save_areas[count];
>  if ((env->features[esa->feature] & esa->bits) == esa->bits &&
> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> index 015f5b5..f9d74c7 100644
> --- a/target-i386/cpu.h
> +++ b/target-i386/cpu.h
> @@

Re: KVM causes #GP on XRSTORS

2014-11-25 Thread Eduardo Habkost

On Fri, Nov 21, 2014 at 03:46:55PM +0100, Paolo Bonzini wrote:
> 
> 
> On 20/11/2014 17:34, Nadav Amit wrote:
> > Fenghua,
> > 
> > I got KVM (v3.17) crashing on a machine that supports XRSTORS - It appears 
> > to get a #GP when it is trying to load the guest FPU.
> > One reason for the #GP is that XCOMP_BV[63] is zeroed on the guest_fpu, but 
> > I am not sure it is the only problem.
> > Was KVM ever tested with XRSTORS?
> 
> What is the content of the CPUID[EAX=13,ECX=0] and CPUID[EAX=13,ECX=1]
> leaves on the host?
> 
> Fenghua, which processors have XSAVEC, which have XGETBV with ECX=1, and
> which have XSAVES?  We need to expose this in QEMU, for which I can send
> a patch later today or next week (CCing Eduardo for this).
> 
> We will also have to uncompact the XSAVE area either in KVM_GET_XSAVE or
> in QEMU.  It's probably not hard to do it in the kernel.

Oh, now I see that KVM uses xsave_state(), which uses XSAVEOPT and
XSAVES depending on the host CPU features. I didn't know that.

QEMU already expects the standard format to be part of the KVM_GET_XSAVE
ABI. So this would already cause problems in host CPUs supporting
XSAVES, even if XSAVES is not exposed to guests yet.

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 7/8] KVM: kvm-vfio: generic forwarding control

On 11/24/2014 09:56 PM, Alex Williamson wrote:
> On Sun, 2014-11-23 at 19:35 +0100, Eric Auger wrote:
>> This patch introduces a new KVM_DEV_VFIO_DEVICE group.
>>
>> This is a new control channel which enables KVM to cooperate with
>> viable VFIO devices.
>>
>> Functions are introduced to check the validity of a VFIO device
>> file descriptor, increment/decrement the ref counter of the VFIO
>> device.
>>
>> The patch introduces 2 attributes for this new device group:
>> KVM_DEV_VFIO_DEVICE_FORWARD_IRQ, KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ.
>> Their purpose is to turn a VFIO device IRQ into a forwarded IRQ and
>> unset respectively unset the feature.
>>
>> The VFIO device stores a list of registered forwarded IRQs. The reference
>> counter of the device is incremented each time a new IRQ is forwarded.
>> Reference counter is decremented when the IRQ forwarding is unset.
>>
>> The forwarding programmming is architecture specific, implemented in
>> kvm_arch_set_fwd_state function. Architecture specific implementation is
>> enabled when __KVM_HAVE_ARCH_KVM_VFIO_FORWARD is set. When not set those
>> functions are void.
>>
>> Signed-off-by: Eric Auger 
>>
>> ---
>>
>> v2 -> v3:
>> - add API comments in kvm_host.h
>> - improve the commit message
>> - create a private kvm_vfio_fwd_irq struct
>> - fwd_irq_action replaced by a bool and removal of VFIO_IRQ_CLEANUP. This
>>   latter action will be handled in vgic.
>> - add a vfio_device handle argument to kvm_arch_set_fwd_state. The goal is
>>   to move platform specific stuff in architecture specific code.
>> - kvm_arch_set_fwd_state renamed into kvm_arch_vfio_set_forward
>> - increment the ref counter each time we do an IRQ forwarding and decrement
>>   this latter each time one IRQ forward is unset. Simplifies the whole
>>   ref counting.
>> - simplification of list handling: create, search, removal
>>
>> v1 -> v2:
>> - __KVM_HAVE_ARCH_KVM_VFIO renamed into __KVM_HAVE_ARCH_KVM_VFIO_FORWARD
>> - original patch file separated into 2 parts: generic part moved in vfio.c
>>   and ARM specific part(kvm_arch_set_fwd_state)
>> ---
>>  include/linux/kvm_host.h |  28 ++
>>  virt/kvm/vfio.c  | 249 
>> ++-
>>  2 files changed, 274 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index ea53b04..0b9659d 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -1076,6 +1076,15 @@ struct kvm_device_ops {
>>unsigned long arg);
>>  };
>>  
>> +/* internal self-contained structure describing a forwarded IRQ */
>> +struct kvm_fwd_irq {
>> +struct kvm *kvm; /* VM to inject the GSI into */
>> +struct vfio_device *vdev; /* vfio device the IRQ belongs to */
>> +__u32 index; /* VFIO device IRQ index */
>> +__u32 subindex; /* VFIO device IRQ subindex */
>> +__u32 gsi; /* gsi, ie. virtual IRQ number */
>> +};
>> +
>>  void kvm_device_get(struct kvm_device *dev);
>>  void kvm_device_put(struct kvm_device *dev);
>>  struct kvm_device *kvm_device_from_filp(struct file *filp);
>> @@ -1085,6 +1094,25 @@ void kvm_unregister_device_ops(u32 type);
>>  extern struct kvm_device_ops kvm_mpic_ops;
>>  extern struct kvm_device_ops kvm_xics_ops;
>>  
>> +#ifdef __KVM_HAVE_ARCH_KVM_VFIO_FORWARD
>> +/**
>> + * kvm_arch_vfio_set_forward - changes the forwarded state of an IRQ
>> + *
>> + * @fwd_irq: handle to the forwarded irq struct
>> + * @forward: true means forwarded, false means not forwarded
>> + * returns 0 on success, < 0 on failure
>> + */
>> +int kvm_arch_vfio_set_forward(struct kvm_fwd_irq *fwd_irq,
>> +  bool forward);
> 
> We could add a struct device* to the args list or into struct
> kvm_fwd_irq so that arch code doesn't need to touch the vdev.  arch code
> has no business dealing with references to the vfio_device.
Hi Alex,

Currently It can't put struct device* into the kvm_fwd_irq struct since
I need to release the vfio_device with
vfio_device_put_external_user(struct vfio_device *vdev)
typically in kvm_vfio_clean_fwd_irq. So I need to store the pointers to
the vfio_device somewhere.

I see 2 solutions: change the proto of
vfio_device_put_external_user(struct vfio_device *vdev) and pass a
struct device* (??)

or change the proto of kvm_arch_vfio_set_forward into

kvm_arch_vfio_set_forward(struct kvm *kvm, struct device *dev, int
index, [int subindex], int gsi, bool forward) or using index/start/count
but loosing the interest of having a self-contained internal struct.

> 
>> +
>> +#else
>> +static inline int kvm_arch_vfio_set_forward(struct kvm_fwd_irq *fwd_irq,
>> +bool forward)
>> +{
>> +return 0;
>> +}
>> +#endif
>> +
>>  #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
>>  
>>  static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool 
>> val)
>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>> index 6f0cc34..af178bb 100644
>> --- a/virt/kvm

Re: [PATCH RFC v2 07/12] dataplane: allow virtio-1 devices

On Tue, 25 Nov 2014 18:55:56 +0100
Greg Kurz  wrote:

> On Tue, 25 Nov 2014 14:24:18 +0100
> Cornelia Huck  wrote:
> 
> > Handle endianness conversion for virtio-1 virtqueues correctly.
> > 
> > Note that dataplane now needs to be built per-target.
> > 
> > Signed-off-by: Cornelia Huck 
> > ---
> 
> We still have the same error as in your previous post...
> 
> In file included from include/hw/virtio/dataplane/vring.h:23:0,
>  from include/hw/virtio/virtio-scsi.h:21,
>  from hw/virtio/virtio-pci.c:24:
> include/hw/virtio/virtio-access.h: In function ‘virtio_access_is_big_endian’:
> include/hw/virtio/virtio-access.h:28:15: error: attempt to use poisoned 
> "TARGET_WORDS_BIGENDIAN"
>  #elif defined(TARGET_WORDS_BIGENDIAN)

Grmpf, I thought I had fixed this. I'll do an update tomorrow.

> 
> Either build virtio-pci per-target or probably better move the target tainted
> accessors to another file.
> 
> >  hw/block/dataplane/virtio-blk.c |3 +-
> >  hw/scsi/virtio-scsi-dataplane.c |2 +-
> >  hw/virtio/Makefile.objs |2 +-
> >  hw/virtio/dataplane/Makefile.objs   |2 +-
> >  hw/virtio/dataplane/vring.c |   85 
> > +++
> >  include/hw/virtio/dataplane/vring.h |   64 --
> >  6 files changed, 113 insertions(+), 45 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC v2 07/12] dataplane: allow virtio-1 devices

2014-11-25 Thread Greg Kurz

On Tue, 25 Nov 2014 14:24:18 +0100
Cornelia Huck  wrote:

> Handle endianness conversion for virtio-1 virtqueues correctly.
> 
> Note that dataplane now needs to be built per-target.
> 
> Signed-off-by: Cornelia Huck 
> ---

We still have the same error as in your previous post...

In file included from include/hw/virtio/dataplane/vring.h:23:0,
 from include/hw/virtio/virtio-scsi.h:21,
 from hw/virtio/virtio-pci.c:24:
include/hw/virtio/virtio-access.h: In function ‘virtio_access_is_big_endian’:
include/hw/virtio/virtio-access.h:28:15: error: attempt to use poisoned 
"TARGET_WORDS_BIGENDIAN"
 #elif defined(TARGET_WORDS_BIGENDIAN)

Either build virtio-pci per-target or probably better move the target tainted
accessors to another file.

>  hw/block/dataplane/virtio-blk.c |3 +-
>  hw/scsi/virtio-scsi-dataplane.c |2 +-
>  hw/virtio/Makefile.objs |2 +-
>  hw/virtio/dataplane/Makefile.objs   |2 +-
>  hw/virtio/dataplane/vring.c |   85 
> +++
>  include/hw/virtio/dataplane/vring.h |   64 --
>  6 files changed, 113 insertions(+), 45 deletions(-)
> 
> diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
> index 1222a37..c25878c 100644
> --- a/hw/block/dataplane/virtio-blk.c
> +++ b/hw/block/dataplane/virtio-blk.c
> @@ -16,6 +16,7 @@
>  #include "qemu/iov.h"
>  #include "qemu/thread.h"
>  #include "qemu/error-report.h"
> +#include "hw/virtio/virtio-access.h"
>  #include "hw/virtio/dataplane/vring.h"
>  #include "sysemu/block-backend.h"
>  #include "hw/virtio/virtio-blk.h"
> @@ -75,7 +76,7 @@ static void complete_request_vring(VirtIOBlockReq *req, 
> unsigned char status)
>  VirtIOBlockDataPlane *s = req->dev->dataplane;
>  stb_p(&req->in->status, status);
> 
> -vring_push(&req->dev->dataplane->vring, &req->elem,
> +vring_push(s->vdev, &req->dev->dataplane->vring, &req->elem,
> req->qiov.size + sizeof(*req->in));
> 
>  /* Suppress notification to guest by BH and its scheduled
> diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
> index 03a1e8c..418d73b 100644
> --- a/hw/scsi/virtio-scsi-dataplane.c
> +++ b/hw/scsi/virtio-scsi-dataplane.c
> @@ -94,7 +94,7 @@ void virtio_scsi_vring_push_notify(VirtIOSCSIReq *req)
>  {
>  VirtIODevice *vdev = VIRTIO_DEVICE(req->vring->parent);
> 
> -vring_push(&req->vring->vring, &req->elem,
> +vring_push(vdev, &req->vring->vring, &req->elem,
> req->qsgl.size + req->resp_iov.size);
> 
>  if (vring_should_notify(vdev, &req->vring->vring)) {
> diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
> index d21c397..19b224a 100644
> --- a/hw/virtio/Makefile.objs
> +++ b/hw/virtio/Makefile.objs
> @@ -2,7 +2,7 @@ common-obj-y += virtio-rng.o
>  common-obj-$(CONFIG_VIRTIO_PCI) += virtio-pci.o
>  common-obj-y += virtio-bus.o
>  common-obj-y += virtio-mmio.o
> -common-obj-$(CONFIG_VIRTIO) += dataplane/
> +obj-$(CONFIG_VIRTIO) += dataplane/
> 
>  obj-y += virtio.o virtio-balloon.o 
>  obj-$(CONFIG_LINUX) += vhost.o vhost-backend.o vhost-user.o
> diff --git a/hw/virtio/dataplane/Makefile.objs 
> b/hw/virtio/dataplane/Makefile.objs
> index 9a8cfc0..753a9ca 100644
> --- a/hw/virtio/dataplane/Makefile.objs
> +++ b/hw/virtio/dataplane/Makefile.objs
> @@ -1 +1 @@
> -common-obj-y += vring.o
> +obj-y += vring.o
> diff --git a/hw/virtio/dataplane/vring.c b/hw/virtio/dataplane/vring.c
> index a051775..69809ff 100644
> --- a/hw/virtio/dataplane/vring.c
> +++ b/hw/virtio/dataplane/vring.c
> @@ -18,6 +18,7 @@
>  #include "hw/hw.h"
>  #include "exec/memory.h"
>  #include "exec/address-spaces.h"
> +#include "hw/virtio/virtio-access.h"
>  #include "hw/virtio/dataplane/vring.h"
>  #include "qemu/error-report.h"
> 
> @@ -83,7 +84,7 @@ bool vring_setup(Vring *vring, VirtIODevice *vdev, int n)
>  vring_init(&vring->vr, virtio_queue_get_num(vdev, n), vring_ptr, 4096);
> 
>  vring->last_avail_idx = virtio_queue_get_last_avail_idx(vdev, n);
> -vring->last_used_idx = vring->vr.used->idx;
> +vring->last_used_idx = vring_get_used_idx(vdev, vring);
>  vring->signalled_used = 0;
>  vring->signalled_used_valid = false;
> 
> @@ -104,7 +105,7 @@ void vring_teardown(Vring *vring, VirtIODevice *vdev, int 
> n)
>  void vring_disable_notification(VirtIODevice *vdev, Vring *vring)
>  {
>  if (!(vdev->guest_features[0] & (1 << VIRTIO_RING_F_EVENT_IDX))) {
> -vring->vr.used->flags |= VRING_USED_F_NO_NOTIFY;
> +vring_set_used_flags(vdev, vring, VRING_USED_F_NO_NOTIFY);
>  }
>  }
> 
> @@ -117,10 +118,10 @@ bool vring_enable_notification(VirtIODevice *vdev, 
> Vring *vring)
>  if (vdev->guest_features[0] & (1 << VIRTIO_RING_F_EVENT_IDX)) {
>  vring_avail_event(&vring->vr) = vring->vr.avail->idx;
>  } else {
> -vring->vr.used->flags &= ~VRING_USED_F_NO_NOTIFY;
> +vring_clear_used_flags(vdev, vring, VRIN

Re: [patch -rt 2/2] KVM: lapic: mark LAPIC timer handler as irqsafe



On 25/11/2014 18:21, Marcelo Tosatti wrote:
> +
> + if (r == HRTIMER_RESTART) {
> + do {
> + ret = hrtimer_start_expires(data, HRTIMER_MODE_ABS);
> + if (ret == -ETIME)
> + hrtimer_add_expires_ns(&ktimer->timer,
> + ktimer->period);

Is it possible to just compute the time where the next interrupt
happens?  I suspect the printk and WARN_ON below can be easily triggered
by a guest.

Paolo

> + i++;
> + } while (ret == -ETIME && i < 10);
> +
> + if (ret == -ETIME) {
> + printk(KERN_ERR "%s: failed to reprogram timer\n",
> +__func__);
> + WARN_ON(1);
> + }
> + }
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] target-i386: add feature flags for CPUID[EAX=0xd,ECX=1]

These represent xsave-related capabilities of the processor, and KVM may
or may not support them.

Add feature bits so that they are considered by "-cpu ...,enforce", and use
the new feature work instead of calling kvm_arch_get_supported_cpuid.

Signed-off-by: Paolo Bonzini 
---
 target-i386/cpu.c | 28 +++-
 target-i386/cpu.h |  6 ++
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index e9df33e..a2dde11 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -274,6 +274,17 @@ static const char *cpuid_apm_edx_feature_name[] = {
 NULL, NULL, NULL, NULL,
 };
 
+static const char *cpuid_xsave_feature_name[] = {
+"xsaveopt", "xsavec", "xgetbv1", "xsaves",
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+};
+
 #define I486_FEATURES (CPUID_FP87 | CPUID_VME | CPUID_PSE)
 #define PENTIUM_FEATURES (I486_FEATURES | CPUID_DE | CPUID_TSC | \
   CPUID_MSR | CPUID_MCE | CPUID_CX8 | CPUID_MMX | CPUID_APIC)
@@ -391,6 +402,13 @@ static FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
 .tcg_features = TCG_APM_FEATURES,
 .unmigratable_flags = CPUID_APM_INVTSC,
 },
+[FEAT_XSAVE] = {
+.feat_names = cpuid_xsave_feature_name,
+.cpuid_eax = 0xd,
+.cpuid_needs_ecx = true, .cpuid_ecx = 1,
+.cpuid_reg = R_EAX,
+.tcg_features = 0,
+},
 };
 
 typedef struct X86RegisterInfo32 {
@@ -1018,6 +1036,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT2_SYSCALL,
 .features[FEAT_8000_0001_ECX] =
 CPUID_EXT3_LAHF_LM,
+.features[FEAT_XSAVE] =
+CPUID_XSAVE_XSAVEOPT,
 .xlevel = 0x800A,
 .model_id = "Intel Xeon E312xx (Sandy Bridge)",
 },
@@ -1051,6 +1071,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_7_0_EBX_HLE | CPUID_7_0_EBX_AVX2 | CPUID_7_0_EBX_SMEP |
 CPUID_7_0_EBX_BMI2 | CPUID_7_0_EBX_ERMS | CPUID_7_0_EBX_INVPCID |
 CPUID_7_0_EBX_RTM,
+.features[FEAT_XSAVE] =
+CPUID_XSAVE_XSAVEOPT,
 .xlevel = 0x800A,
 .model_id = "Intel Core Processor (Haswell)",
 },
@@ -1085,6 +1107,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_7_0_EBX_BMI2 | CPUID_7_0_EBX_ERMS | CPUID_7_0_EBX_INVPCID |
 CPUID_7_0_EBX_RTM | CPUID_7_0_EBX_RDSEED | CPUID_7_0_EBX_ADX |
 CPUID_7_0_EBX_SMAP,
+.features[FEAT_XSAVE] =
+CPUID_XSAVE_XSAVEOPT,
 .xlevel = 0x800A,
 .model_id = "Intel Core Processor (Broadwell)",
 },
@@ -1202,6 +1226,7 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT3_3DNOWPREFETCH | CPUID_EXT3_MISALIGNSSE |
 CPUID_EXT3_SSE4A | CPUID_EXT3_ABM | CPUID_EXT3_SVM |
 CPUID_EXT3_LAHF_LM,
+/* no xsaveopt! */
 .xlevel = 0x801A,
 .model_id = "AMD Opteron 62xx class CPU",
 },
@@ -1236,6 +1261,7 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT3_3DNOWPREFETCH | CPUID_EXT3_MISALIGNSSE |
 CPUID_EXT3_SSE4A | CPUID_EXT3_ABM | CPUID_EXT3_SVM |
 CPUID_EXT3_LAHF_LM,
+/* no xsaveopt! */
 .xlevel = 0x801A,
 .model_id = "AMD Opteron 63xx class CPU",
 },
@@ -2377,7 +2403,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 *eax |= kvm_mask & (XSTATE_FP | XSTATE_SSE);
 *ebx = *ecx;
 } else if (count == 1) {
-*eax = kvm_arch_get_supported_cpuid(s, 0xd, 1, R_EAX);
+*eax = env->features[FEAT_XSAVE];
 } else if (count < ARRAY_SIZE(ext_save_areas)) {
 const ExtSaveArea *esa = &ext_save_areas[count];
 if ((env->features[esa->feature] & esa->bits) == esa->bits &&
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 015f5b5..f9d74c7 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -411,6 +411,7 @@ typedef enum FeatureWord {
 FEAT_C000_0001_EDX, /* CPUID[C000_0001].EDX */
 FEAT_KVM,   /* CPUID[4000_0001].EAX (KVM_CPUID_FEATURES) */
 FEAT_SVM,   /* CPUID[8000_000A].EDX */
+FEAT_XSAVE, /* CPUID[EAX=0xd,ECX=1].EAX */
 FEATURE_WORDS,
 } FeatureWord;
 
@@ -571,6 +572,11 @@ typedef uint32_t FeatureWordArray[FEATURE_WORDS];
 #define CPUID_7_0_EBX_AVX512ER (1U << 27) /* AVX-512 Exponential and 
Reciprocal */
 #define CPUID_7_0_EBX_AVX512CD (1U << 28) /* AVX-512 Conflict Detection */
 
+#define CPUID_XSAVE_XSAVEOPT   (1U << 0)
+#define CPUID_XSAVE_XSAVEC (1U << 1)
+#define CPUID_XSAVE_XGETBV1(1U << 2)
+#define CPUID_XSAVE_XSAVES (1U << 3)
+
 /* CPUID[0x8007].EDX flags: */
 #define CPUID_APM_INVTSC   (1U << 8)
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe

[patch -rt 1/2] KVM: use simple waitqueue for vcpu->wq

The problem:

On -RT, an emulated LAPIC timer instances has the following path:

1) hard interrupt
2) ksoftirqd is scheduled
3) ksoftirqd wakes up vcpu thread
4) vcpu thread is scheduled

This extra context switch introduces unnecessary latency in the 
LAPIC path for a KVM guest.

The solution:

Allow waking up vcpu thread from hardirq context,
thus avoiding the need for ksoftirqd to be scheduled.

Normal waitqueues make use of spinlocks, which on -RT 
are sleepable locks. Therefore, waking up a waitqueue 
waiter involves locking a sleeping lock, which 
is not allowed from hard interrupt context.

cyclictest command line:
# cyclictest -m -n -q -p99 -l 100 -h60  -D 1m

This patch reduces the average latency in my tests from 14us to 11us.

Signed-off-by: Marcelo Tosatti 

---
 arch/arm/kvm/arm.c  |4 ++--
 arch/arm/kvm/psci.c |4 ++--
 arch/mips/kvm/kvm_mips.c|8 
 arch/powerpc/include/asm/kvm_host.h |4 ++--
 arch/powerpc/kvm/book3s_hv.c|   20 ++--
 arch/s390/include/asm/kvm_host.h|2 +-
 arch/s390/kvm/interrupt.c   |   22 ++
 arch/s390/kvm/sigp.c|   16 
 arch/x86/kvm/lapic.c|6 +++---
 include/linux/kvm_host.h|4 ++--
 virt/kvm/async_pf.c |4 ++--
 virt/kvm/kvm_main.c |   16 
 12 files changed, 54 insertions(+), 56 deletions(-)

Index: linux-stable-rt/arch/arm/kvm/arm.c
===
--- linux-stable-rt.orig/arch/arm/kvm/arm.c 2014-11-25 14:13:39.188899952 
-0200
+++ linux-stable-rt/arch/arm/kvm/arm.c  2014-11-25 14:14:38.620810092 -0200
@@ -495,9 +495,9 @@
 
 static void vcpu_pause(struct kvm_vcpu *vcpu)
 {
-   wait_queue_head_t *wq = kvm_arch_vcpu_wq(vcpu);
+   struct swait_head *wq = kvm_arch_vcpu_wq(vcpu);
 
-   wait_event_interruptible(*wq, !vcpu->arch.pause);
+   swait_event_interruptible(*wq, !vcpu->arch.pause);
 }
 
 static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
Index: linux-stable-rt/arch/arm/kvm/psci.c
===
--- linux-stable-rt.orig/arch/arm/kvm/psci.c2014-11-25 14:13:39.189899951 
-0200
+++ linux-stable-rt/arch/arm/kvm/psci.c 2014-11-25 14:14:38.620810092 -0200
@@ -36,7 +36,7 @@
 {
struct kvm *kvm = source_vcpu->kvm;
struct kvm_vcpu *vcpu = NULL, *tmp;
-   wait_queue_head_t *wq;
+   struct swait_head *wq;
unsigned long cpu_id;
unsigned long mpidr;
phys_addr_t target_pc;
@@ -80,7 +80,7 @@
smp_mb();   /* Make sure the above is visible */
 
wq = kvm_arch_vcpu_wq(vcpu);
-   wake_up_interruptible(wq);
+   swait_wake_interruptible(wq);
 
return KVM_PSCI_RET_SUCCESS;
 }
Index: linux-stable-rt/arch/mips/kvm/kvm_mips.c
===
--- linux-stable-rt.orig/arch/mips/kvm/kvm_mips.c   2014-11-25 
14:13:39.191899948 -0200
+++ linux-stable-rt/arch/mips/kvm/kvm_mips.c2014-11-25 14:14:38.621810091 
-0200
@@ -464,8 +464,8 @@
 
dvcpu->arch.wait = 0;
 
-   if (waitqueue_active(&dvcpu->wq)) {
-   wake_up_interruptible(&dvcpu->wq);
+   if (swaitqueue_active(&dvcpu->wq)) {
+   swait_wake_interruptible(&dvcpu->wq);
}
 
return 0;
@@ -971,8 +971,8 @@
kvm_mips_callbacks->queue_timer_int(vcpu);
 
vcpu->arch.wait = 0;
-   if (waitqueue_active(&vcpu->wq)) {
-   wake_up_interruptible(&vcpu->wq);
+   if (swaitqueue_active(&vcpu->wq)) {
+   swait_wake_nterruptible(&vcpu->wq);
}
 }
 
Index: linux-stable-rt/arch/powerpc/include/asm/kvm_host.h
===
--- linux-stable-rt.orig/arch/powerpc/include/asm/kvm_host.h2014-11-25 
14:13:39.193899944 -0200
+++ linux-stable-rt/arch/powerpc/include/asm/kvm_host.h 2014-11-25 
14:14:38.621810091 -0200
@@ -295,7 +295,7 @@
u8 in_guest;
struct list_head runnable_threads;
spinlock_t lock;
-   wait_queue_head_t wq;
+   struct swait_head wq;
u64 stolen_tb;
u64 preempt_tb;
struct kvm_vcpu *runner;
@@ -612,7 +612,7 @@
u8 prodded;
u32 last_inst;
 
-   wait_queue_head_t *wqp;
+   struct swait_head *wqp;
struct kvmppc_vcore *vcore;
int ret;
int trap;
Index: linux-stable-rt/arch/powerpc/kvm/book3s_hv.c
===
--- linux-stable-rt.orig/arch/powerpc/kvm/book3s_hv.c   2014-11-25 
14:13:39.195899942 -0200
+++ linux-stable-rt/arch/powerpc/kvm/book3s_hv.c2014-11-25 
14:14:38.625810085 -0200
@@ -74,11 +74,11 @@
 {
int me;
int cpu = vcpu->cpu;
-   wait_queue_head_t *wqp;
+   struct swait_head *wqp;
 
w

[patch -rt 2/2] KVM: lapic: mark LAPIC timer handler as irqsafe

Since lapic timer handler only wakes up a simple waitqueue,
it can be executed from hardirq context.

Reduces average cyclictest latency by 3us.

Signed-off-by: Marcelo Tosatti 

---
 arch/x86/kvm/lapic.c |   42 +++---
 1 file changed, 39 insertions(+), 3 deletions(-)

Index: linux-stable-rt/arch/x86/kvm/lapic.c
===
--- linux-stable-rt.orig/arch/x86/kvm/lapic.c   2014-11-25 14:14:38.636810068 
-0200
+++ linux-stable-rt/arch/x86/kvm/lapic.c2014-11-25 14:17:28.850567059 
-0200
@@ -1031,8 +1031,38 @@
   apic->divide_count);
 }
 
+
+static enum hrtimer_restart apic_timer_fn(struct hrtimer *data);
+
+static void apic_timer_expired(struct hrtimer *data)
+{
+   int ret, i = 0;
+   enum hrtimer_restart r;
+   struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer);
+
+   r = apic_timer_fn(data);
+
+   if (r == HRTIMER_RESTART) {
+   do {
+   ret = hrtimer_start_expires(data, HRTIMER_MODE_ABS);
+   if (ret == -ETIME)
+   hrtimer_add_expires_ns(&ktimer->timer,
+   ktimer->period);
+   i++;
+   } while (ret == -ETIME && i < 10);
+
+   if (ret == -ETIME) {
+   printk(KERN_ERR "%s: failed to reprogram timer\n",
+  __func__);
+   WARN_ON(1);
+   }
+   }
+}
+
+
 static void start_apic_timer(struct kvm_lapic *apic)
 {
+   int ret;
ktime_t now;
atomic_set(&apic->lapic_timer.pending, 0);
 
@@ -1062,9 +1092,11 @@
}
}
 
-   hrtimer_start(&apic->lapic_timer.timer,
+   ret = hrtimer_start(&apic->lapic_timer.timer,
  ktime_add_ns(now, apic->lapic_timer.period),
  HRTIMER_MODE_ABS);
+   if (ret == -ETIME)
+   apic_timer_expired(&apic->lapic_timer.timer);
 
apic_debug("%s: bus cycle is %" PRId64 "ns, now 0x%016"
   PRIx64 ", "
@@ -1094,8 +1126,10 @@
ns = (tscdeadline - guest_tsc) * 100ULL;
do_div(ns, this_tsc_khz);
}
-   hrtimer_start(&apic->lapic_timer.timer,
+   ret = hrtimer_start(&apic->lapic_timer.timer,
ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
+   if (ret == -ETIME)
+   apic_timer_expired(&apic->lapic_timer.timer);
 
local_irq_restore(flags);
}
@@ -1581,6 +1615,7 @@
hrtimer_init(&apic->lapic_timer.timer, CLOCK_MONOTONIC,
 HRTIMER_MODE_ABS);
apic->lapic_timer.timer.function = apic_timer_fn;
+   apic->lapic_timer.timer.irqsafe = 1;
 
/*
 * APIC is created enabled. This will prevent kvm_lapic_set_base from
@@ -1699,7 +1734,8 @@
 
timer = &vcpu->arch.apic->lapic_timer.timer;
if (hrtimer_cancel(timer))
-   hrtimer_start_expires(timer, HRTIMER_MODE_ABS);
+   if (hrtimer_start_expires(timer, HRTIMER_MODE_ABS) == -ETIME)
+   apic_timer_expired(timer);
 }
 
 /*


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch -rt 0/2] use simple waitqueue for kvm vcpu waitqueue (v2)

The problem:

On -RT, an emulated LAPIC timer instances has the following path:

1) hard interrupt
2) ksoftirqd is scheduled
3) ksoftirqd wakes up vcpu thread
4) vcpu thread is scheduled

This extra context switch introduces unnecessary latency in the
LAPIC path for a KVM guest.

The solution:

Allow waking up vcpu thread from hardirq context,
thus avoiding the need for ksoftirqd to be scheduled.

Normal waitqueues make use of spinlocks, which on -RT
are sleepable locks. Therefore, waking up a waitqueue
waiter involves locking a sleeping lock, which
is not allowed from hard interrupt context.

cyclictest command line:
# cyclictest -m -n -q -p99 -l 100 -h60  -D 1m

This patch reduces the average latency in my tests from 14us to 11us.

v2: improve changelog (Rik van Riel)




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/7] KVM: arm: guest debug, define API headers



On 25/11/2014 18:13, Peter Maydell wrote:
> On 25 November 2014 at 17:05, Paolo Bonzini  wrote:
>> > So there is no register that says "this breakpoint has triggered" or
>> > "this watchpoint has triggered"?
> Nope. You take a debug exception; the syndrome register tells
> you if it was a bp or a wp, and if it was a wp the fault address
> register tells you the address being accessed (if it was a bp
> you know the program counter, obviously). The debugger is expected
> to be able to figure it out from there, if it cares.

That's already good enough---do the KVM_DEBUG_EXIT_* constants match the
syndrome register, or if not why?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 29/42] vhost/net: virtio 1.0 byte swap

Signed-off-by: Michael S. Tsirkin 
---
 drivers/vhost/net.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index dce5c58..cae22f9 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -416,7 +416,7 @@ static void handle_tx(struct vhost_net *net)
struct ubuf_info *ubuf;
ubuf = nvq->ubuf_info + nvq->upend_idx;
 
-   vq->heads[nvq->upend_idx].id = head;
+   vq->heads[nvq->upend_idx].id = cpu_to_vhost32(vq, head);
vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
ubuf->callback = vhost_zerocopy_callback;
ubuf->ctx = nvq->ubufs;
@@ -500,6 +500,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
int headcount = 0;
unsigned d;
int r, nlogs = 0;
+   u32 len;
 
while (datalen > 0 && headcount < quota) {
if (unlikely(seg >= UIO_MAXIOV)) {
@@ -527,13 +528,14 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
nlogs += *log_num;
log += *log_num;
}
-   heads[headcount].id = d;
-   heads[headcount].len = iov_length(vq->iov + seg, in);
-   datalen -= heads[headcount].len;
+   heads[headcount].id = cpu_to_vhost32(vq, d);
+   len = iov_length(vq->iov + seg, in);
+   heads[headcount].len = cpu_to_vhost32(vq, len);
+   datalen -= len;
++headcount;
seg += in;
}
-   heads[headcount - 1].len += datalen;
+   heads[headcount - 1].len = cpu_to_vhost32(vq, len - datalen);
*iovcount = seg;
if (unlikely(log))
*log_num = nlogs;
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 28/42] vhost: make features 64 bit

We need to use bit 32 for virtio 1.0

Signed-off-by: Michael S. Tsirkin 
---
 drivers/vhost/vhost.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b9032e8..1f321fd 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -106,7 +106,7 @@ struct vhost_virtqueue {
/* Protected by virtqueue mutex. */
struct vhost_memory *memory;
void *private_data;
-   unsigned acked_features;
+   u64 acked_features;
/* Log write descriptors */
void __user *log_base;
struct vhost_log *log;
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC v2 00/12] qemu: towards virtio-1 host support

On Tue, 25 Nov 2014 18:44:10 +0200
"Michael S. Tsirkin"  wrote:

> On Tue, Nov 25, 2014 at 02:24:11PM +0100, Cornelia Huck wrote:
> > Hi,
> > 
> > here's the next version of my virtio-1 qemu patchset. Using virtio-1
> > virtio-blk and virtio-net devices with a guest kernel built from
> > <1416829787-14252-1-git-send-email-...@redhat.com> still seems to
> > work for the virtio-ccw transport.
> 
> I posted v4, would appreciate an explicit tested-by from you
> on that thread.

I was still reviewing v3 :)

I'll put it on my todo list for tomorrow.

> 
> > 
> > Changes from v1:
> > - rebased against current master
> > - don't advertise VERSION_1 for all devices, make devices switch it on
> >   individually
> 
> So from ccw perspective, all that's left is to convert more devices?
> 

I'll probably need to look into the explicit command reject for
FEATURES_OK, but other than that, only the individual devices are left
AFAICS.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/7] KVM: arm: guest debug, define API headers

2014-11-25 Thread Peter Maydell

On 25 November 2014 at 17:05, Paolo Bonzini  wrote:
> So there is no register that says "this breakpoint has triggered" or
> "this watchpoint has triggered"?

Nope. You take a debug exception; the syndrome register tells
you if it was a bp or a wp, and if it was a wp the fault address
register tells you the address being accessed (if it was a bp
you know the program counter, obviously). The debugger is expected
to be able to figure it out from there, if it cares.

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/7] KVM: arm: guest debug, define API headers



On 25/11/2014 17:10, Alex Bennée wrote:
> +/* Exit types which define why we did a debug exit */
> +#define KVM_DEBUG_EXIT_ERROR 0x0
> +#define KVM_DEBUG_EXIT_SINGLE_STEP   0x1
> +#define KVM_DEBUG_EXIT_SW_BKPT   0x2
> +#define KVM_DEBUG_EXIT_HW_BKPT   0x3
> +#define KVM_DEBUG_EXIT_HW_WTPT   0x4
> +
>  struct kvm_debug_exit_arch {
> + __u64 address;
> + __u32 exit_type;
>  };
>  

So there is no register that says "this breakpoint has triggered" or
"this watchpoint has triggered"?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 26/42] vhost/net: force len for TX to host endian

We use native endian-ness internally but never
expose it to guest.

Signed-off-by: Michael S. Tsirkin 
---
 drivers/vhost/net.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8dae2f7..dce5c58 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -48,15 +48,15 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy 
TX;"
  * status internally; used for zerocopy tx only.
  */
 /* Lower device DMA failed */
-#define VHOST_DMA_FAILED_LEN   3
+#define VHOST_DMA_FAILED_LEN   ((__force __virtio32)3)
 /* Lower device DMA done */
-#define VHOST_DMA_DONE_LEN 2
+#define VHOST_DMA_DONE_LEN ((__force __virtio32)2)
 /* Lower device DMA in progress */
-#define VHOST_DMA_IN_PROGRESS  1
+#define VHOST_DMA_IN_PROGRESS  ((__force __virtio32)1)
 /* Buffer unused */
-#define VHOST_DMA_CLEAR_LEN0
+#define VHOST_DMA_CLEAR_LEN((__force __virtio32)0)
 
-#define VHOST_DMA_IS_DONE(len) ((len) >= VHOST_DMA_DONE_LEN)
+#define VHOST_DMA_IS_DONE(len) ((__force u32)(len) >= (__force 
u32)VHOST_DMA_DONE_LEN)
 
 enum {
VHOST_NET_FEATURES = VHOST_FEATURES |
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 27/42] vhost: virtio 1.0 endian-ness support

Signed-off-by: Michael S. Tsirkin 
---
 drivers/vhost/vhost.c | 93 +++
 1 file changed, 56 insertions(+), 37 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c90f437..4d379ed 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -33,8 +33,8 @@ enum {
VHOST_MEMORY_F_LOG = 0x1,
 };
 
-#define vhost_used_event(vq) ((u16 __user *)&vq->avail->ring[vq->num])
-#define vhost_avail_event(vq) ((u16 __user *)&vq->used->ring[vq->num])
+#define vhost_used_event(vq) ((__virtio16 __user *)&vq->avail->ring[vq->num])
+#define vhost_avail_event(vq) ((__virtio16 __user *)&vq->used->ring[vq->num])
 
 static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
poll_table *pt)
@@ -1001,7 +1001,7 @@ EXPORT_SYMBOL_GPL(vhost_log_write);
 static int vhost_update_used_flags(struct vhost_virtqueue *vq)
 {
void __user *used;
-   if (__put_user(vq->used_flags, &vq->used->flags) < 0)
+   if (__put_user(cpu_to_vhost16(vq, vq->used_flags), &vq->used->flags) < 
0)
return -EFAULT;
if (unlikely(vq->log_used)) {
/* Make sure the flag is seen before log. */
@@ -1019,7 +1019,7 @@ static int vhost_update_used_flags(struct vhost_virtqueue 
*vq)
 
 static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 
avail_event)
 {
-   if (__put_user(vq->avail_idx, vhost_avail_event(vq)))
+   if (__put_user(cpu_to_vhost16(vq, vq->avail_idx), 
vhost_avail_event(vq)))
return -EFAULT;
if (unlikely(vq->log_used)) {
void __user *used;
@@ -1038,6 +1038,7 @@ static int vhost_update_avail_event(struct 
vhost_virtqueue *vq, u16 avail_event)
 
 int vhost_init_used(struct vhost_virtqueue *vq)
 {
+   __virtio16 last_used_idx;
int r;
if (!vq->private_data)
return 0;
@@ -1046,7 +1047,13 @@ int vhost_init_used(struct vhost_virtqueue *vq)
if (r)
return r;
vq->signalled_used_valid = false;
-   return get_user(vq->last_used_idx, &vq->used->idx);
+   if (!access_ok(VERIFY_READ, &vq->used->idx, sizeof vq->used->idx))
+   return -EFAULT;
+   r = __get_user(last_used_idx, &vq->used->idx);
+   if (r)
+   return r;
+   vq->last_used_idx = vhost16_to_cpu(vq, last_used_idx);
+   return 0;
 }
 EXPORT_SYMBOL_GPL(vhost_init_used);
 
@@ -1087,16 +1094,16 @@ static int translate_desc(struct vhost_virtqueue *vq, 
u64 addr, u32 len,
 /* Each buffer in the virtqueues is actually a chain of descriptors.  This
  * function returns the next descriptor in the chain,
  * or -1U if we're at the end. */
-static unsigned next_desc(struct vring_desc *desc)
+static unsigned next_desc(struct vhost_virtqueue *vq, struct vring_desc *desc)
 {
unsigned int next;
 
/* If this descriptor says it doesn't chain, we're done. */
-   if (!(desc->flags & VRING_DESC_F_NEXT))
+   if (!(desc->flags & cpu_to_vhost16(vq, VRING_DESC_F_NEXT)))
return -1U;
 
/* Check they're not leading us off end of descriptors. */
-   next = desc->next;
+   next = vhost16_to_cpu(vq, desc->next);
/* Make sure compiler knows to grab that: we don't want it changing! */
/* We will use the result as an index in an array, so most
 * architectures only need a compiler barrier here. */
@@ -1113,18 +1120,19 @@ static int get_indirect(struct vhost_virtqueue *vq,
 {
struct vring_desc desc;
unsigned int i = 0, count, found = 0;
+   u32 len = vhost32_to_cpu(vq, indirect->len);
int ret;
 
/* Sanity check */
-   if (unlikely(indirect->len % sizeof desc)) {
+   if (unlikely(len % sizeof desc)) {
vq_err(vq, "Invalid length in indirect descriptor: "
   "len 0x%llx not multiple of 0x%zx\n",
-  (unsigned long long)indirect->len,
+  (unsigned long long)vhost32_to_cpu(vq, indirect->len),
   sizeof desc);
return -EINVAL;
}
 
-   ret = translate_desc(vq, indirect->addr, indirect->len, vq->indirect,
+   ret = translate_desc(vq, vhost64_to_cpu(vq, indirect->addr), len, 
vq->indirect,
 UIO_MAXIOV);
if (unlikely(ret < 0)) {
vq_err(vq, "Translation failure %d in indirect.\n", ret);
@@ -1135,7 +1143,7 @@ static int get_indirect(struct vhost_virtqueue *vq,
 * architectures only need a compiler barrier here. */
read_barrier_depends();
 
-   count = indirect->len / sizeof desc;
+   count = len / sizeof desc;
/* Buffers are chained via a 16 bit next field, so
 * we can have at most 2^16 of these. */
if (unlikely(count > USHRT_MAX + 1)) {
@@ -1155,16 +1163,17 @@ static int get_indirect(struct vhost_virtqueue *vq,
if (unlikely(memcpy_fromiovec((unsig

[PATCH v4 25/42] vhost: add memory access wrappers

Signed-off-by: Michael S. Tsirkin 
---
 drivers/vhost/vhost.h | 33 -
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 3eda654..b9032e8 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -174,6 +174,37 @@ enum {
 
 static inline int vhost_has_feature(struct vhost_virtqueue *vq, int bit)
 {
-   return vq->acked_features & (1 << bit);
+   return vq->acked_features & (1ULL << bit);
+}
+
+/* Memory accessors */
+static inline u16 vhost16_to_cpu(struct vhost_virtqueue *vq, __virtio16 val)
+{
+   return __virtio16_to_cpu(vhost_has_feature(vq, VIRTIO_F_VERSION_1), 
val);
+}
+
+static inline __virtio16 cpu_to_vhost16(struct vhost_virtqueue *vq, u16 val)
+{
+   return __cpu_to_virtio16(vhost_has_feature(vq, VIRTIO_F_VERSION_1), 
val);
+}
+
+static inline u32 vhost32_to_cpu(struct vhost_virtqueue *vq, __virtio32 val)
+{
+   return __virtio32_to_cpu(vhost_has_feature(vq, VIRTIO_F_VERSION_1), 
val);
+}
+
+static inline __virtio32 cpu_to_vhost32(struct vhost_virtqueue *vq, u32 val)
+{
+   return __cpu_to_virtio32(vhost_has_feature(vq, VIRTIO_F_VERSION_1), 
val);
+}
+
+static inline u64 vhost64_to_cpu(struct vhost_virtqueue *vq, __virtio64 val)
+{
+   return __virtio64_to_cpu(vhost_has_feature(vq, VIRTIO_F_VERSION_1), 
val);
+}
+
+static inline __virtio64 cpu_to_vhost64(struct vhost_virtqueue *vq, u64 val)
+{
+   return __cpu_to_virtio64(vhost_has_feature(vq, VIRTIO_F_VERSION_1), 
val);
 }
 #endif
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 30/42] vhost/net: larger header for virtio 1.0

Signed-off-by: Michael S. Tsirkin 
---
 drivers/vhost/net.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index cae22f9..1ac58d0 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1027,7 +1027,8 @@ static int vhost_net_set_features(struct vhost_net *n, 
u64 features)
size_t vhost_hlen, sock_hlen, hdr_len;
int i;
 
-   hdr_len = (features & (1 << VIRTIO_NET_F_MRG_RXBUF)) ?
+   hdr_len = (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+  (1ULL << VIRTIO_F_VERSION_1))) ?
sizeof(struct virtio_net_hdr_mrg_rxbuf) :
sizeof(struct virtio_net_hdr);
if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Exposing host debug capabilities to userspace

On 25/11/2014 17:35, Alexander Graf wrote:
> Unfortunately on ARM you also have a few other constraints - the debug
> register space is partitioned into magic super debug registers at the
> top (with an implementation specific amount) and normal debug registers
> in the lower end of the space.

Does the gdbstub ever need to use the magic super debug registers?  Even
if it does, the logic is the same as x86.  On x86 you might run out of
breakpoints, on ARM you might run out of breakpoints in general or magic
super breakpoints in particular.

> I would just treat gdbstub debugging as the ugly step child that it is
> and not introduce anything special for it (except for debug event
> deflection to user space). Then only ever work on guest debug registers
> and call it a day. Chances are just too high that we get the interfaces
> wrong and shoot ourselves in the foot.

I agree.  But we do need a way to retrieve the number of valid fields in
struct kvm_guest_debug_arch, if that is not architecturally defined
(e.g. on x86 it's just always 4).  A "hidden" ONE_REG, whose existence
is determined by (ARM || ARM64) &&
kvm_check_extension(KVM_CAP_SET_GUEST_DEBUG), is certainly fine and I
like it because it doesn't pollute the global KVM_CAP_* namespace.

The alternative is a capability; however, if you start with two
capabilities you'll end up needing four, and then realize that returning
a struct via ONE_REG would have been a better idea.  :)  We already have
how many breakpoints, how many watchpoints, how many magic super debug
registers,...

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 31/42] vhost/net: enable virtio 1.0

Signed-off-by: Michael S. Tsirkin 
---
 drivers/vhost/net.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 1ac58d0..984242e 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -61,7 +61,8 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
 enum {
VHOST_NET_FEATURES = VHOST_FEATURES |
 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
-(1ULL << VIRTIO_NET_F_MRG_RXBUF),
+(1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+(1ULL << VIRTIO_F_VERSION_1),
 };
 
 enum {
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 32/42] vhost/net: suppress compiler warning

len is always initialized since function is called with size > 0.

Signed-off-by: Michael S. Tsirkin 
---
 drivers/vhost/net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 984242e..54ffbb0 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -501,7 +501,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
int headcount = 0;
unsigned d;
int r, nlogs = 0;
-   u32 len;
+   u32 uninitialized_var(len);
 
while (datalen > 0 && headcount < quota) {
if (unlikely(seg >= UIO_MAXIOV)) {
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC v2 00/12] qemu: towards virtio-1 host support

On Tue, Nov 25, 2014 at 02:24:11PM +0100, Cornelia Huck wrote:
> Hi,
> 
> here's the next version of my virtio-1 qemu patchset. Using virtio-1
> virtio-blk and virtio-net devices with a guest kernel built from
> <1416829787-14252-1-git-send-email-...@redhat.com> still seems to
> work for the virtio-ccw transport.

I posted v4, would appreciate an explicit tested-by from you
on that thread.

> 
> Changes from v1:
> - rebased against current master
> - don't advertise VERSION_1 for all devices, make devices switch it on
>   individually

So from ccw perspective, all that's left is to convert more devices?


> Cornelia Huck (9):
>   virtio: cull virtio_bus_set_vdev_features
>   virtio: support more feature bits
>   s390x/virtio-ccw: fix check for WRITE_FEAT
>   virtio: introduce legacy virtio devices
>   virtio: allow virtio-1 queue layout
>   dataplane: allow virtio-1 devices
>   s390x/virtio-ccw: support virtio-1 set_vq format
>   virtio-net/virtio-blk: enable virtio 1.0
>   s390x/virtio-ccw: enable virtio 1.0
> 
> Thomas Huth (3):
>   linux-headers/virtio_config: Update with VIRTIO_F_VERSION_1
>   s390x/css: Add a callback for when subchannel gets disabled
>   s390x/virtio-ccw: add virtio set-revision call
> 
>  hw/9pfs/virtio-9p-device.c  |7 +-
>  hw/block/dataplane/virtio-blk.c |3 +-
>  hw/block/virtio-blk.c   |   13 ++-
>  hw/char/virtio-serial-bus.c |9 +-
>  hw/net/virtio-net.c |   42 +---
>  hw/s390x/css.c  |   12 +++
>  hw/s390x/css.h  |1 +
>  hw/s390x/s390-virtio-bus.c  |9 +-
>  hw/s390x/virtio-ccw.c   |  186 
> +++
>  hw/s390x/virtio-ccw.h   |7 +-
>  hw/scsi/vhost-scsi.c|7 +-
>  hw/scsi/virtio-scsi-dataplane.c |2 +-
>  hw/scsi/virtio-scsi.c   |   10 +-
>  hw/virtio/Makefile.objs |2 +-
>  hw/virtio/dataplane/Makefile.objs   |2 +-
>  hw/virtio/dataplane/vring.c |   95 ++
>  hw/virtio/virtio-balloon.c  |8 +-
>  hw/virtio/virtio-bus.c  |   23 +
>  hw/virtio/virtio-mmio.c |9 +-
>  hw/virtio/virtio-pci.c  |   13 +--
>  hw/virtio/virtio-rng.c  |2 +-
>  hw/virtio/virtio.c  |   51 +++---
>  include/hw/virtio/dataplane/vring.h |   64 +++-
>  include/hw/virtio/virtio-access.h   |4 +
>  include/hw/virtio/virtio-bus.h  |   10 +-
>  include/hw/virtio/virtio.h  |   34 +--
>  linux-headers/linux/virtio_config.h |3 +
>  27 files changed, 449 insertions(+), 179 deletions(-)
> 
> -- 
> 1.7.9.5
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 41/42] vhost/scsi: partial virtio 1.0 support

Include all endian conversions as required by virtio 1.0.
Don't set virtio 1.0 yet, since that requires ANY_LAYOUT
which we don't yet support.

Signed-off-by: Michael S. Tsirkin 
---
 drivers/vhost/scsi.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index a17f118..01c01cb 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -168,6 +168,7 @@ enum {
VHOST_SCSI_VQ_IO = 2,
 };
 
+/* Note: can't set VIRTIO_F_VERSION_1 yet, since that implies ANY_LAYOUT. */
 enum {
VHOST_SCSI_FEATURES = VHOST_FEATURES | (1ULL << VIRTIO_SCSI_F_HOTPLUG) |
   (1ULL << VIRTIO_SCSI_F_T10_PI)
@@ -577,8 +578,8 @@ tcm_vhost_allocate_evt(struct vhost_scsi *vs,
return NULL;
}
 
-   evt->event.event = event;
-   evt->event.reason = reason;
+   evt->event.event = cpu_to_vhost32(vq, event);
+   evt->event.reason = cpu_to_vhost32(vq, reason);
vs->vs_events_nr++;
 
return evt;
@@ -636,7 +637,7 @@ again:
}
 
if (vs->vs_events_missed) {
-   event->event |= VIRTIO_SCSI_T_EVENTS_MISSED;
+   event->event |= cpu_to_vhost32(vq, VIRTIO_SCSI_T_EVENTS_MISSED);
vs->vs_events_missed = false;
}
 
@@ -695,12 +696,13 @@ static void vhost_scsi_complete_cmd_work(struct 
vhost_work *work)
cmd, se_cmd->residual_count, se_cmd->scsi_status);
 
memset(&v_rsp, 0, sizeof(v_rsp));
-   v_rsp.resid = se_cmd->residual_count;
+   v_rsp.resid = cpu_to_vhost32(cmd->tvc_vq, 
se_cmd->residual_count);
/* TODO is status_qualifier field needed? */
v_rsp.status = se_cmd->scsi_status;
-   v_rsp.sense_len = se_cmd->scsi_sense_length;
+   v_rsp.sense_len = cpu_to_vhost32(cmd->tvc_vq,
+se_cmd->scsi_sense_length);
memcpy(v_rsp.sense, cmd->tvc_sense_buf,
-  v_rsp.sense_len);
+  se_cmd->scsi_sense_length);
ret = copy_to_user(cmd->tvc_resp, &v_rsp, sizeof(v_rsp));
if (likely(ret == 0)) {
struct vhost_scsi_virtqueue *q;
@@ -1095,14 +1097,14 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct 
vhost_virtqueue *vq)
", but wrong data_direction\n");
goto err_cmd;
}
-   prot_bytes = v_req_pi.pi_bytesout;
+   prot_bytes = vhost32_to_cpu(vq, 
v_req_pi.pi_bytesout);
} else if (v_req_pi.pi_bytesin) {
if (data_direction != DMA_FROM_DEVICE) {
vq_err(vq, "Received non zero 
di_pi_niov"
", but wrong data_direction\n");
goto err_cmd;
}
-   prot_bytes = v_req_pi.pi_bytesin;
+   prot_bytes = vhost32_to_cpu(vq, 
v_req_pi.pi_bytesin);
}
if (prot_bytes) {
int tmp = 0;
@@ -1117,12 +1119,12 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct 
vhost_virtqueue *vq)
data_first += prot_niov;
data_niov = data_num - prot_niov;
}
-   tag = v_req_pi.tag;
+   tag = vhost64_to_cpu(vq, v_req_pi.tag);
task_attr = v_req_pi.task_attr;
cdb = &v_req_pi.cdb[0];
lun = ((v_req_pi.lun[2] << 8) | v_req_pi.lun[3]) & 
0x3FFF;
} else {
-   tag = v_req.tag;
+   tag = vhost64_to_cpu(vq, v_req.tag);
task_attr = v_req.task_attr;
cdb = &v_req.cdb[0];
lun = ((v_req.lun[2] << 8) | v_req.lun[3]) & 0x3FFF;
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Exposing host debug capabilities to userspace

On 24/11/2014 15:52, Christoffer Dall wrote:
> 
> We had a lenghty IRC discussion on this, for the curious, go read it
> here:
> http://irclogs.linaro.org/2014/11/24/%23kvm-arm.html

Some notes...

> chazy but saying that you want to design an ABI that potentially exposes 
> fewer debug registers to gdbstubs than your hardware supports and breaks 
> gdbstub support if you happen to support emulation of a cpu with more debug 
> registers in the future, because you want to re-use ONE_REG is not a great 
> approach either
> 
> agraf chazy: but you wouldn't even remotely think of doing it, no?
> 
> agraf chazy: I'm saying that QEMU shouldn't have access to the excess 
> registers
> 
> chazy agraf: I wouldn’t even remotely think of doing nested virtualization, 
> but people have
> 
> agraf chazy: QEMU spawns an A57, it gets 8 debug registers
> 
> chazy agraf: I understand, but I don’t agree with your rationale
> 
> agraf chazy: not "QEMU spawns an A57 on an A67, so it gets 8 real and 8 
> hidden debug registers that it also needs to be aware of mappings of"

I disagree with Alex.  QEMU can use the 16 debug registers as it wishes,
since it sets them with KVM_SET_GUEST_DEBUG.  The guest's eight can be
mapped to the last 8 if those are the required semantics for the
architecture.  Just exit to QEMU if you do a debug register write while
gdbstub debugging is active; QEMU gets the contents of the 8
VCPU-visible registers with ONE_REG (equivalent of x86 GET_DEBUGREGS);
calls KVM_SET_GUEST_DEBUG to reflect them in the 16-register state; and
restarts.

It works as long as you can trap both reads and writes.

> chazy ajb-linaro: don’t migrate a guest that is getting debugged is
> the answer to that one I believe
> 
> ajb-linarochazy: at least with the overlay approach that happens 
> automatically
> 
> ajb-linarothe overlay isn't migrated and the new host isn't calling 
> SET_GUEST_DEBUG so whatever the debug registers where before are restorerd

Right.  The destination will lose the hardware breakpoints/watchpoints
because it starts the guest with KVM_SET_GUEST_DEBUG.  Apart from losing
them, everything should work fine.  The dest sets the 8 VCPU-visible
registers with ONE_REG, the hypervisor reflects them in a subset of the
16 physical registers, and "that's it".

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Exposing host debug capabilities to userspace

2014-11-25 Thread Alexander Graf

On 25.11.14 17:21, Paolo Bonzini wrote:
> 
> 
> On 24/11/2014 14:59, Alex Bennée wrote:
>> Alexander Graf pointed out that KVM_CHECK_EXTENSION can return any
>> positive number for success. How about using:
>>
>> max_hw_bps = kvm_check_extension(kvm_state, KVM_CAP_GUEST_DEBUG_HW_BPS);
>> max_hw_wps = kvm_check_extension(kvm_state, KVM_CAP_GUEST_DEBUG_HW_WPS);
>>
>> Seems pretty sane, doesn't change the semantics of an API and is
>> architecture agnostic if others need the number?
> 
> Yes, this was going to be my suggestion as well.  Just I would use a
> bitmask in case some register can act as both breakpoint and watchpoint.
> 
> On x86, each of the four bp/wp registers (each register can act as both)
> can be used for either guest or gdbstub debugging.  If the
> KVM_GUESTDBG_USE_HW_BP feature is enabled, the guest is entered with
> "made up" debug register contents, that we pass via
> KVM_SET_GUEST_DEBUG's struct kvm_guest_debug_arch.  Otherwise, the guest
> is entered with real debug register contents passed via
> KVM_SET_DEBUGREGS.  Reads/writes of the debug registers trap to KVM
> (which helps the guest see the expected values of the debug registers in
> the former case).  There is no KVM_GET_GUEST_DEBUG because the
> corresponding info is passed via struct kvm_debug_exit_arch.
> 
> If gdbstub hardware breakpoints are enabled, all hardware breakpoints
> exit to userspace.  QEMU then decides whether the breakpoint came from
> guest debugging (and then injects an exception), or from gdbstub
> debugging (and then suspends execution).  Same for software breakpoints.
>  If the total request is >4 hardware breakpoints, someone will have to
> lose and some gdbstub breakpoints will be missed.

Unfortunately on ARM you also have a few other constraints - the debug
register space is partitioned into magic super debug registers at the
top (with an implementation specific amount) and normal debug registers
in the lower end of the space.

The main pain I have with exposing host information is that it's going
to be interesting and challenging enough to get all of this right merely
for the guest debug register space. Exposing the host debug register
space as well means there is even more space for breakage.

I would just treat gdbstub debugging as the ugly step child that it is
and not introduce anything special for it (except for debug event
deflection to user space). Then only ever work on guest debug registers
and call it a day. Chances are just too high that we get the interfaces
wrong and shoot ourselves in the foot.

Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Exposing host debug capabilities to userspace

On 24/11/2014 14:59, Alex Bennée wrote:
> Alexander Graf pointed out that KVM_CHECK_EXTENSION can return any
> positive number for success. How about using:
> 
> max_hw_bps = kvm_check_extension(kvm_state, KVM_CAP_GUEST_DEBUG_HW_BPS);
> max_hw_wps = kvm_check_extension(kvm_state, KVM_CAP_GUEST_DEBUG_HW_WPS);
> 
> Seems pretty sane, doesn't change the semantics of an API and is
> architecture agnostic if others need the number?

Yes, this was going to be my suggestion as well.  Just I would use a
bitmask in case some register can act as both breakpoint and watchpoint.

On x86, each of the four bp/wp registers (each register can act as both)
can be used for either guest or gdbstub debugging.  If the
KVM_GUESTDBG_USE_HW_BP feature is enabled, the guest is entered with
"made up" debug register contents, that we pass via
KVM_SET_GUEST_DEBUG's struct kvm_guest_debug_arch.  Otherwise, the guest
is entered with real debug register contents passed via
KVM_SET_DEBUGREGS.  Reads/writes of the debug registers trap to KVM
(which helps the guest see the expected values of the debug registers in
the former case).  There is no KVM_GET_GUEST_DEBUG because the
corresponding info is passed via struct kvm_debug_exit_arch.

If gdbstub hardware breakpoints are enabled, all hardware breakpoints
exit to userspace.  QEMU then decides whether the breakpoint came from
guest debugging (and then injects an exception), or from gdbstub
debugging (and then suspends execution).  Same for software breakpoints.
 If the total request is >4 hardware breakpoints, someone will have to
lose and some gdbstub breakpoints will be missed.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/7] KVM: arm: guest debug, define API headers

2014-11-25 Thread Peter Maydell

On 25 November 2014 at 16:10, Alex Bennée  wrote:
> +/* Exit types which define why we did a debug exit */
> +#define KVM_DEBUG_EXIT_ERROR   0x0
> +#define KVM_DEBUG_EXIT_SINGLE_STEP 0x1
> +#define KVM_DEBUG_EXIT_SW_BKPT 0x2
> +#define KVM_DEBUG_EXIT_HW_BKPT 0x3
> +#define KVM_DEBUG_EXIT_HW_WTPT 0x4

The names of these imply that they're generic, but they're
defined in an arch-specific header file...

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 6/7] KVM: arm64: re-factor hyp.S debug register code

This is a pre-cursor to sharing the code with the guest debug support.
This replaces the big macro that fishes data out of a fixed location
with a more general helper macro to restore a set of debug registers. It
uses macro substitution so it can be re-used for debug control and value
registers. It does however rely on the debug registers being 64 bit
aligned (as they happen to be in the hyp ABI).

Signed-off-by: Alex Bennée 

diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index c0bc218..b38ce3d 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -490,65 +490,12 @@
msr mdscr_el1,  x25
 .endm
 
-.macro restore_debug
-   // x2: base address for cpu context
-   // x3: tmp register
-
-   mrs x26, id_aa64dfr0_el1
-   ubfxx24, x26, #12, #4   // Extract BRPs
-   ubfxx25, x26, #20, #4   // Extract WRPs
-   mov w26, #15
-   sub w24, w26, w24   // How many BPs to skip
-   sub w25, w26, w25   // How many WPs to skip
-
-   add x3, x2, #CPU_SYSREG_OFFSET(DBGBCR0_EL1)
-
-   adr x26, 1f
-   add x26, x26, x24, lsl #2
-   br  x26
-1:
-   ldr x20, [x3, #(15 * 8)]
-   ldr x19, [x3, #(14 * 8)]
-   ldr x18, [x3, #(13 * 8)]
-   ldr x17, [x3, #(12 * 8)]
-   ldr x16, [x3, #(11 * 8)]
-   ldr x15, [x3, #(10 * 8)]
-   ldr x14, [x3, #(9 * 8)]
-   ldr x13, [x3, #(8 * 8)]
-   ldr x12, [x3, #(7 * 8)]
-   ldr x11, [x3, #(6 * 8)]
-   ldr x10, [x3, #(5 * 8)]
-   ldr x9, [x3, #(4 * 8)]
-   ldr x8, [x3, #(3 * 8)]
-   ldr x7, [x3, #(2 * 8)]
-   ldr x6, [x3, #(1 * 8)]
-   ldr x5, [x3, #(0 * 8)]
-
+.macro setup_debug_registers type
+// x3: pointer to register set
+// x4: number of registers to copy
+// x5..x20/x26 trashed
adr x26, 1f
-   add x26, x26, x24, lsl #2
-   br  x26
-1:
-   msr dbgbcr15_el1, x20
-   msr dbgbcr14_el1, x19
-   msr dbgbcr13_el1, x18
-   msr dbgbcr12_el1, x17
-   msr dbgbcr11_el1, x16
-   msr dbgbcr10_el1, x15
-   msr dbgbcr9_el1, x14
-   msr dbgbcr8_el1, x13
-   msr dbgbcr7_el1, x12
-   msr dbgbcr6_el1, x11
-   msr dbgbcr5_el1, x10
-   msr dbgbcr4_el1, x9
-   msr dbgbcr3_el1, x8
-   msr dbgbcr2_el1, x7
-   msr dbgbcr1_el1, x6
-   msr dbgbcr0_el1, x5
-
-   add x3, x2, #CPU_SYSREG_OFFSET(DBGBVR0_EL1)
-
-   adr x26, 1f
-   add x26, x26, x24, lsl #2
+   add x26, x26, x4, lsl #2
br  x26
 1:
ldr x20, [x3, #(15 * 8)]
@@ -569,116 +516,25 @@
ldr x5, [x3, #(0 * 8)]
 
adr x26, 1f
-   add x26, x26, x24, lsl #2
-   br  x26
-1:
-   msr dbgbvr15_el1, x20
-   msr dbgbvr14_el1, x19
-   msr dbgbvr13_el1, x18
-   msr dbgbvr12_el1, x17
-   msr dbgbvr11_el1, x16
-   msr dbgbvr10_el1, x15
-   msr dbgbvr9_el1, x14
-   msr dbgbvr8_el1, x13
-   msr dbgbvr7_el1, x12
-   msr dbgbvr6_el1, x11
-   msr dbgbvr5_el1, x10
-   msr dbgbvr4_el1, x9
-   msr dbgbvr3_el1, x8
-   msr dbgbvr2_el1, x7
-   msr dbgbvr1_el1, x6
-   msr dbgbvr0_el1, x5
-
-   add x3, x2, #CPU_SYSREG_OFFSET(DBGWCR0_EL1)
-
-   adr x26, 1f
-   add x26, x26, x25, lsl #2
+   add x26, x26, x4, lsl #2
br  x26
 1:
-   ldr x20, [x3, #(15 * 8)]
-   ldr x19, [x3, #(14 * 8)]
-   ldr x18, [x3, #(13 * 8)]
-   ldr x17, [x3, #(12 * 8)]
-   ldr x16, [x3, #(11 * 8)]
-   ldr x15, [x3, #(10 * 8)]
-   ldr x14, [x3, #(9 * 8)]
-   ldr x13, [x3, #(8 * 8)]
-   ldr x12, [x3, #(7 * 8)]
-   ldr x11, [x3, #(6 * 8)]
-   ldr x10, [x3, #(5 * 8)]
-   ldr x9, [x3, #(4 * 8)]
-   ldr x8, [x3, #(3 * 8)]
-   ldr x7, [x3, #(2 * 8)]
-   ldr x6, [x3, #(1 * 8)]
-   ldr x5, [x3, #(0 * 8)]
-
-   adr x26, 1f
-   add x26, x26, x25, lsl #2
-   br  x26
-1:
-   msr dbgwcr15_el1, x20
-   msr dbgwcr14_el1, x19
-   msr dbgwcr13_el1, x18
-   msr dbgwcr12_el1, x17
-   msr dbgwcr11_el1, x16
-   msr dbgwcr10_el1, x15
-   msr dbgwcr9_el1, x14
-   msr dbgwcr8_el1, x13
-   msr dbgwcr7_el1, x12
-   msr dbgwcr6_el1, x11
-   msr dbgwcr5_el1, x10
-   msr dbgwcr4_el1, x9
-   msr dbgwcr3_el1, x8
-   msr dbgwcr2_el1, x7
-   msr dbgwcr1_el1, x6
-   msr dbgwcr0_el1, x5
-
-   add x3, x2, #CPU_SYSREG_OFFSET(DBGWVR0_EL1)
-
-   adr x26, 1f
-   add x26, x26, x25, lsl #2
-   br  x26
-1:
-   ldr x20, [x3, #(15 * 8)]
-   ldr

[PATCH 4/7] KVM: arm64: guest debug, add SW break point support

This adds support for SW breakpoints inserted by userspace.

First we need to trap all BKPT exceptions in the hypervisor (ELS). This
in controlled through the MDCR_EL2 register. I've added a new field to
the vcpu structure to hold this value. There should be scope to
rationlise this with the VCPU_DEBUG_FLAGS/KVM_ARM64_DEBUG_DIRTY_SHIFT
manipulation in later patches.

Once the exception arrives we triggers an exit from the main hypervisor
loop to the hypervisor controller. The kvm_debug_exit_arch carries the
address of the exception. If the controller doesn't know of breakpoint
then we have a guest inserted breakpoint and the hypervisor needs to
start again and deliver the exception to guest.

Signed-off-by: Alex Bennée 

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 2c6386e..9383359 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2592,7 +2592,7 @@ when running. Common control bits are:
 The top 16 bits of the control field are architecture specific control
 flags which can include the following:
 
-  - KVM_GUESTDBG_USE_SW_BP: using software breakpoints [x86]
+  - KVM_GUESTDBG_USE_SW_BP: using software breakpoints [x86, arm64]
   - KVM_GUESTDBG_USE_HW_BP: using hardware breakpoints [x86, s390]
   - KVM_GUESTDBG_INJECT_DB: inject DB type exception [x86]
   - KVM_GUESTDBG_INJECT_BP: inject BP type exception [x86]
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index a0ff410..48d26bb 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -303,6 +303,9 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
struct kvm_guest_debug *dbg)
 {
+   /* Route debug traps to el2? */
+   bool route_el2 = false;
+
/* If it's not enabled clear all flags */
if (!(dbg->control & KVM_GUESTDBG_ENABLE)) {
vcpu->guest_debug = 0;
@@ -320,8 +323,8 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu 
*vcpu,
 
/* Software Break Points */
if (vcpu->guest_debug & KVM_GUESTDBG_USE_SW_BP) {
-   kvm_info("SW BP support requested, not yet implemented\n");
-   return -EINVAL;
+   kvm_info("SW BP support requested\n");
+   route_el2 = true;
}
 
/* Hardware assisted Break and Watch points */
@@ -330,6 +333,20 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu 
*vcpu,
return -EINVAL;
}
 
+   /* If we are going to handle any debug exceptions we need to
+* set MDCE_EL2.TDE to ensure they are routed to the
+* hypervisor. If userspace determines the exception was
+* actually internal to the guest it needs to handle
+* re-injecting the exception.
+*/
+   if (route_el2) {
+   kvm_info("routing debug exceptions");
+   vcpu->arch.mdcr_el2 = MDCR_EL2_TDE;
+   } else {
+   kvm_info("not routing debug exceptions");
+   vcpu->arch.mdcr_el2 = 0;
+   }
+
return 0;
 }
 
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 2012c4b..38b0f07 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -98,6 +98,7 @@ struct kvm_vcpu_arch {
 
/* HYP configuration */
u64 hcr_el2;
+   u32 mdcr_el2;
 
/* Exception Information */
struct kvm_vcpu_fault_info fault;
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 9a9fce0..8da1043 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -122,6 +122,7 @@ int main(void)
   DEFINE(VCPU_HPFAR_EL2,   offsetof(struct kvm_vcpu, 
arch.fault.hpfar_el2));
   DEFINE(VCPU_DEBUG_FLAGS, offsetof(struct kvm_vcpu, arch.debug_flags));
   DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_vcpu, arch.hcr_el2));
+  DEFINE(VCPU_MDCR_EL2,offsetof(struct kvm_vcpu, arch.mdcr_el2));
   DEFINE(VCPU_IRQ_LINES,   offsetof(struct kvm_vcpu, arch.irq_lines));
   DEFINE(VCPU_HOST_CONTEXT,offsetof(struct kvm_vcpu, 
arch.host_cpu_context));
   DEFINE(VCPU_TIMER_CNTV_CTL,  offsetof(struct kvm_vcpu, 
arch.timer_cpu.cntv_ctl));
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index 34b8bd0..28dc92b 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -71,6 +71,26 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
return 1;
 }
 
+/**
+ * kvm_handle_bkpt - handle a break-point instruction
+ *
+ * @vcpu:  the vcpu pointer
+ *
+ * By definition if we arrive here it's because we are routing
+ * all SW breakpoints to the hyper-visor as some may be a result of
+ * guest debugging. If user-space decides this particular break-point
+ * isn't for the host to handle it can re-feed the exception to the
+ * guest.
+ */
+st

[PATCH 3/7] KVM: arm: guest debug, add stub KVM_SET_GUEST_DEBUG ioctl

This commit adds a stub function to support the KVM_SET_GUEST_DEBUG
ioctl. Currently any operation flag will return EINVAL. Actual
functionality will be added with further patches.

Signed-off-by: Alex Bennée .

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 7610eaa..2c6386e 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2570,7 +2570,7 @@ handled.
 4.87 KVM_SET_GUEST_DEBUG
 
 Capability: KVM_CAP_SET_GUEST_DEBUG
-Architectures: x86, s390, ppc
+Architectures: x86, s390, ppc, arm64
 Type: vcpu ioctl
 Parameters: struct kvm_guest_debug (in)
 Returns: 0 on success; -1 on error
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 9e193c8..a0ff410 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -180,6 +180,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_ARM_PSCI:
case KVM_CAP_ARM_PSCI_0_2:
case KVM_CAP_READONLY_MEM:
+   case KVM_CAP_SET_GUEST_DEBUG:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
@@ -302,7 +303,34 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
struct kvm_guest_debug *dbg)
 {
-   return -EINVAL;
+   /* If it's not enabled clear all flags */
+   if (!(dbg->control & KVM_GUESTDBG_ENABLE)) {
+   vcpu->guest_debug = 0;
+   return 0;
+   }
+
+   vcpu->guest_debug = dbg->control;
+   kvm_info("%s: guest_debug is 0x%lx\n", __func__, vcpu->guest_debug);
+
+   /* Single Step */
+   if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
+   kvm_info("SS requested, not yet implemented\n");
+   return -EINVAL;
+   }
+
+   /* Software Break Points */
+   if (vcpu->guest_debug & KVM_GUESTDBG_USE_SW_BP) {
+   kvm_info("SW BP support requested, not yet implemented\n");
+   return -EINVAL;
+   }
+
+   /* Hardware assisted Break and Watch points */
+   if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP) {
+   kvm_info("HW BP support requested, not yet implemented\n");
+   return -EINVAL;
+   }
+
+   return 0;
 }
 
 
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/7] KVM: arm: guest debug, define API headers

This commit defines the API headers for guest debugging. There are two
architecture specific debug structures:

  - kvm_guest_debug_arch, allows us to pass in HW debug registers
  - kvm_debug_exit_arch, signals the exact debug exit and address

The type of debugging being used is control by the architecture specific
control bits of the kvm_guest_debug->control flags in the ioctl
structure.

Signed-off-by: Alex Bennée 

diff --git a/arch/arm64/include/uapi/asm/kvm.h 
b/arch/arm64/include/uapi/asm/kvm.h
index 8e38878..de2450c 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -93,10 +93,30 @@ struct kvm_sregs {
 struct kvm_fpu {
 };
 
+/* See ARM ARM D7.3: Debug Registers
+ *
+ * The control registers are stored as 64bit values as the setup code
+ * is shared with the normal cpu context restore code in hyp.S which
+ * is 64 bit only.
+ */
+#define KVM_ARM_NDBG_REGS 16
 struct kvm_guest_debug_arch {
+   __u64 dbg_bcr[KVM_ARM_NDBG_REGS];
+   __u64 dbg_bvr[KVM_ARM_NDBG_REGS];
+   __u64 dbg_wcr[KVM_ARM_NDBG_REGS];
+   __u64 dbg_wvr[KVM_ARM_NDBG_REGS];
 };
 
+/* Exit types which define why we did a debug exit */
+#define KVM_DEBUG_EXIT_ERROR   0x0
+#define KVM_DEBUG_EXIT_SINGLE_STEP 0x1
+#define KVM_DEBUG_EXIT_SW_BKPT 0x2
+#define KVM_DEBUG_EXIT_HW_BKPT 0x3
+#define KVM_DEBUG_EXIT_HW_WTPT 0x4
+
 struct kvm_debug_exit_arch {
+   __u64 address;
+   __u32 exit_type;
 };
 
 struct kvm_sync_regs {
@@ -198,4 +218,12 @@ struct kvm_arch_memory_slot {
 
 #endif
 
+/* Architecture related debug defines - upper 16 bits of
+ * kvm_guest_debug->control
+ */
+#define KVM_GUESTDBG_USE_SW_BP_SHIFT   16
+#define KVM_GUESTDBG_USE_SW_BP (1 << KVM_GUESTDBG_USE_SW_BP_SHIFT)
+#define KVM_GUESTDBG_USE_HW_BP_SHIFT   17
+#define KVM_GUESTDBG_USE_HW_BP (1 << KVM_GUESTDBG_USE_HW_BP_SHIFT)
+
 #endif /* __ARM_KVM_H__ */
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/7] KVM: arm64: guest debug, add support for single-step

This adds support for single-stepping the guest. As userspace can and
will manipulate guest registers before restarting any tweaking of the
registers has to occur just before control is passed back to the guest.
Furthermore while guest debugging is in effect we need to squash the
ability of the guest to single-step itself as we have no easy way of
re-entering the guest after the exception has been delivered to the
hypervisor.

Signed-off-by: Alex Bennée 

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 48d26bb..a76daae 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -300,6 +301,17 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
kvm_arm_set_running_vcpu(NULL);
 }
 
+/**
+ * kvm_arch_vcpu_ioctl_set_guest_debug - Setup guest debugging
+ * @kvm:   pointer to the KVM struct
+ * @kvm_guest_debug: the ioctl data buffer
+ *
+ * This sets up the VM for guest debugging. Care has to be taken when
+ * manipulating guest registers as these will be set/cleared by the
+ * hyper-visor controller, typically before each kvm_run event. As a
+ * result modification of the guest registers needs to take place
+ * after they have been restored in the hyp.S trampoline code.
+ */
 int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
struct kvm_guest_debug *dbg)
 {
@@ -317,8 +329,8 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu 
*vcpu,
 
/* Single Step */
if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
-   kvm_info("SS requested, not yet implemented\n");
-   return -EINVAL;
+   kvm_info("SS requested\n");
+   route_el2 = true;
}
 
/* Software Break Points */
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 8da1043..78e5ae1 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -121,6 +121,7 @@ int main(void)
   DEFINE(VCPU_FAR_EL2, offsetof(struct kvm_vcpu, arch.fault.far_el2));
   DEFINE(VCPU_HPFAR_EL2,   offsetof(struct kvm_vcpu, 
arch.fault.hpfar_el2));
   DEFINE(VCPU_DEBUG_FLAGS, offsetof(struct kvm_vcpu, arch.debug_flags));
+  DEFINE(GUEST_DEBUG,  offsetof(struct kvm_vcpu, guest_debug));
   DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_vcpu, arch.hcr_el2));
   DEFINE(VCPU_MDCR_EL2,offsetof(struct kvm_vcpu, arch.mdcr_el2));
   DEFINE(VCPU_IRQ_LINES,   offsetof(struct kvm_vcpu, arch.irq_lines));
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index 28dc92b..6def054 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -91,6 +91,25 @@ static int kvm_handle_bkpt(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
return 0;
 }
 
+/**
+ * kvm_handle_ss - handle single step exceptions
+ *
+ * @vcpu:  the vcpu pointer
+ *
+ * See: ARM ARM D2.12 for the details. While the host is routing debug
+ * exceptions to it's handlers we have to suppress the ability of the
+ * guest to trigger exceptions.
+ */
+static int kvm_handle_ss(struct kvm_vcpu *vcpu, struct kvm_run *run)
+{
+   WARN_ON(!(vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP));
+
+   run->exit_reason = KVM_EXIT_DEBUG;
+   run->debug.arch.exit_type = KVM_DEBUG_EXIT_SINGLE_STEP;
+   run->debug.arch.address = *vcpu_pc(vcpu);
+   return 0;
+}
+
 static exit_handle_fn arm_exit_handlers[] = {
[ESR_EL2_EC_WFI]= kvm_handle_wfx,
[ESR_EL2_EC_CP15_32]= kvm_handle_cp15_32,
@@ -105,6 +124,7 @@ static exit_handle_fn arm_exit_handlers[] = {
[ESR_EL2_EC_SYS64]  = kvm_handle_sys_reg,
[ESR_EL2_EC_IABT]   = kvm_handle_guest_abort,
[ESR_EL2_EC_DABT]   = kvm_handle_guest_abort,
+   [ESR_EL2_EC_SOFTSTP]= kvm_handle_ss,
[ESR_EL2_EC_BKPT32] = kvm_handle_bkpt,
[ESR_EL2_EC_BRK64]  = kvm_handle_bkpt,
 };
diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index 3c733ea..c0bc218 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -16,6 +16,7 @@
  */
 
 #include 
+#include 
 
 #include 
 #include 
@@ -168,6 +169,31 @@
// x19-x29, lr, sp*, elr*, spsr*
restore_common_regs
 
+   // After restoring the guest registers but before we return to the guest
+   // we may want to make some final tweaks to support guest debugging.
+   ldr x3, [x0, #GUEST_DEBUG]
+   tbz x3, #KVM_GUESTDBG_ENABLE_SHIFT, 2f  // No guest debug
+
+   // x0 - preserved as VCPU ptr
+   // x1 - spsr
+   // x2 - mdscr
+   mrs x1, spsr_el2
+   mrs x2, mdscr_el1
+
+   // See ARM ARM D2.12.3 The software step state machine
+   // If we are doing Single Step - set MDSCR_EL1.SS and PSTATE.SS
+   orr x1, x1, #DBG_SPSR_SS
+   orr x2, x2, #DBG_MDSCR_SS
+   tbnzx3, #KVM_GUESTDBG_SINGLESTEP_SHIFT

[PATCH 7/7] KVM: arm64: guest debug, HW assisted debug support

This adds support for userspace to control the HW debug registers for
guest debug. We'll only copy the $ARCH defined number across as that's
all that hyp.S will use anyway. I've moved some helper functions into
the hw_breakpoint.h header for re-use.

As with single step we need to tweak the guest registers to enable the
exceptions but we don't want to overwrite the guest copy of these
registers so this is done close to the guest entry.

Two new capabilities have been added to the KVM_EXTENSION ioctl to allow
userspace to query the number of hardware break and watch points
available on the host hardware.

Signed-off-by: Alex Bennée 

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 9383359..5e8c673 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2593,7 +2593,7 @@ The top 16 bits of the control field are architecture 
specific control
 flags which can include the following:
 
   - KVM_GUESTDBG_USE_SW_BP: using software breakpoints [x86, arm64]
-  - KVM_GUESTDBG_USE_HW_BP: using hardware breakpoints [x86, s390]
+  - KVM_GUESTDBG_USE_HW_BP: using hardware breakpoints [x86, s390, arm64]
   - KVM_GUESTDBG_INJECT_DB: inject DB type exception [x86]
   - KVM_GUESTDBG_INJECT_BP: inject BP type exception [x86]
   - KVM_GUESTDBG_EXIT_PENDING:  trigger an immediate guest exit [s390]
@@ -2606,7 +2606,10 @@ we need to ensure the guest vCPUs architecture specific 
registers are
 updated to the correct (supplied) values.
 
 The second part of the structure is architecture specific and
-typically contains a set of debug registers.
+typically contains a set of debug registers. For arm64 the number of
+debug registers is implementation defined and can be determined by
+querying the KVM_CAP_GUEST_DEBUG_HW_BPS and KVM_CAP_GUEST_DEBUG_HW_WPS
+capabilities.
 
 When debug events exit the main run loop with the reason
 KVM_EXIT_DEBUG with the kvm_debug_exit_arch part of the kvm_run
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index a76daae..c8ec23a 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -341,8 +342,37 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu 
*vcpu,
 
/* Hardware assisted Break and Watch points */
if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP) {
-   kvm_info("HW BP support requested, not yet implemented\n");
-   return -EINVAL;
+   int i;
+   int nb = get_num_brps();
+   int nw = get_num_wrps();
+
+   /* Copy across up to IMPDEF debug registers to our
+* shadow copy in the vcpu structure. The hyp.S code
+* will then set them up before we re-enter the guest.
+*/
+   memcpy(vcpu->arch.guest_debug_regs.dbg_bcr,
+   dbg->arch.dbg_bcr, sizeof(__u64)*nb);
+   memcpy(vcpu->arch.guest_debug_regs.dbg_bvr,
+   dbg->arch.dbg_bvr, sizeof(__u64)*nb);
+   memcpy(vcpu->arch.guest_debug_regs.dbg_wcr,
+   dbg->arch.dbg_wcr, sizeof(__u64)*nw);
+   memcpy(vcpu->arch.guest_debug_regs.dbg_wvr,
+   dbg->arch.dbg_wvr, sizeof(__u64)*nw);
+
+   kvm_info("HW BP support requested\n");
+   for (i = 0; i < nb; i++) {
+   kvm_info("%d: dbg_bcr=0x%llx dbg_bvr=0x%llx\n",
+i,
+   vcpu->arch.guest_debug_regs.dbg_bcr[i],
+   vcpu->arch.guest_debug_regs.dbg_bvr[i]);
+   }
+   for (i = 0; i < nw; i++) {
+   kvm_info("%d: dbg_wcr=0x%llx dbg_wvr=0x%llx\n",
+i,
+vcpu->arch.guest_debug_regs.dbg_wcr[i],
+vcpu->arch.guest_debug_regs.dbg_wvr[i]);
+   }
+   route_el2 = true;
}
 
/* If we are going to handle any debug exceptions we need to
diff --git a/arch/arm64/include/asm/hw_breakpoint.h 
b/arch/arm64/include/asm/hw_breakpoint.h
index 52b484b..c450552 100644
--- a/arch/arm64/include/asm/hw_breakpoint.h
+++ b/arch/arm64/include/asm/hw_breakpoint.h
@@ -130,6 +130,18 @@ static inline void ptrace_hw_copy_thread(struct 
task_struct *task)
 }
 #endif
 
+/* Determine number of BRP registers available. */
+static inline int get_num_brps(void)
+{
+   return ((read_cpuid(ID_AA64DFR0_EL1) >> 12) & 0xf) + 1;
+}
+
+/* Determine number of WRP registers available. */
+static inline int get_num_wrps(void)
+{
+   return ((read_cpuid(ID_AA64DFR0_EL1) >> 20) & 0xf) + 1;
+}
+
 extern struct pmu perf_ops_bp;
 
 #endif /* __KERNEL__ */
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 38b0f07..e386bf4 100644
--- a/arch/arm64/include/asm/kvm_

[PATCH 1/7] KVM: add commentary for kvm_debug_exit_arch struct

Bring into line with the commentary for the other structures and their
KVM_EXIT_* cases.

Signed-off-by: Alex Bennée 

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6076882..523f476 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -226,6 +226,7 @@ struct kvm_run {
__u32 count;
__u64 data_offset; /* relative to kvm_run start */
} io;
+   /* KVM_EXIT_DEBUG */
struct {
struct kvm_debug_exit_arch arch;
} debug;
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/7] KVM Guest Debug support for arm64

The following patches implement support for debugging of KVM guests on
arm64 hardware. It supports SW and HW break points as well as
single-step functionality. I've created a branch of QEMU with the
userspace support to test this at:

https://github.com/stsquad/qemu/tree/kvm/guest-debug-kvm-submission

I've done some initial manual testing and everything seems fine but I
have yet to write my test suite. One area which has not yet been
tested is debug inside the guest while being debugged. The current
implementation effectively suspends the guests ability to use single
step and HW assisted debugging while KVM is debugging it. However the
door is open to userspace trying to accommodate this mode as the
guest's view of debugging registers is separate from actual values
when guest debugging is in operation.

The first 3 patches simply lay the groundwork for the later
functionality. Each patch then introduces each bit of functionality
except the re-factor patch which I split out from the HW assisted
debug patch to aid review.

In reference to the recent discussion about exposing the number of
debug registers to userspace I ended up using the KVM_EXTENSION ioctl
which helpfully allows to return of any positive number for success.
It seemed by far the simplest API to use from both the kernel and
userspace point of view.

The patches all pass checkpatch apart from the changes to
asm-offsets.c which complain about leading spaces and long lines. I
left them as is to fit in with the formatting of the rest of the file.

Alex Bennée (7):
  KVM: add commentary for kvm_debug_exit_arch struct
  KVM: arm: guest debug, define API headers
  KVM: arm: guest debug, add stub KVM_SET_GUEST_DEBUG ioctl
  KVM: arm64: guest debug, add SW break point support
  KVM: arm64: guest debug, add support for single-step
  KVM: arm64: re-factor hyp.S debug register code
  KVM: arm64: guest debug, HW assisted debug support

 Documentation/virtual/kvm/api.txt  |  11 +-
 arch/arm/kvm/arm.c |  89 +-
 arch/arm64/include/asm/hw_breakpoint.h |  12 ++
 arch/arm64/include/asm/kvm_host.h  |   4 +-
 arch/arm64/include/uapi/asm/kvm.h  |  28 
 arch/arm64/kernel/asm-offsets.c|   6 +
 arch/arm64/kernel/hw_breakpoint.c  |  12 --
 arch/arm64/kvm/handle_exit.c   |  80 +
 arch/arm64/kvm/hyp.S   | 292 ++---
 arch/arm64/kvm/reset.c |   6 +
 include/uapi/linux/kvm.h   |  21 ++-
 11 files changed, 371 insertions(+), 190 deletions(-)

-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v2 1/2] KVM: kvm-vfio: User API for VT-d Posted-Interrupts

2014-11-25 Thread Alex Williamson

On Tue, 2014-11-25 at 16:01 +0100, Eric Auger wrote:
> On 11/25/2014 01:23 PM, Feng Wu wrote:
> > This patch adds and documents a new attribute
> > KVM_DEV_VFIO_DEVICE_POSTING_IRQ in KVM_DEV_VFIO_DEVICE group.
> > This new attribute is used for VT-d Posted-Interrupts.
> > 
> > When guest OS changes the interrupt configuration for an
> > assigned device, such as, MSI/MSIx data/address fields,
> > QEMU will use this IRQ attribute to tell KVM to update the
> > related IRTE according the VT-d Posted-Interrrupts Specification,
> > such as, the guest vector should be updated in the related IRTE.
> > 
> > Signed-off-by: Feng Wu 
> > ---
> >  Documentation/virtual/kvm/devices/vfio.txt |9 +
> >  include/uapi/linux/kvm.h   |   10 ++
> >  2 files changed, 19 insertions(+), 0 deletions(-)
> > 
> > diff --git a/Documentation/virtual/kvm/devices/vfio.txt 
> > b/Documentation/virtual/kvm/devices/vfio.txt
> > index f7aff29..39dee86 100644
> > --- a/Documentation/virtual/kvm/devices/vfio.txt
> > +++ b/Documentation/virtual/kvm/devices/vfio.txt
> > @@ -42,3 +42,12 @@ activated before VFIO_DEVICE_SET_IRQS has been called to 
> > trigger the IRQ
> >  or associate an eventfd to it. Unforwarding can only be called while the
> >  signaling has been disabled with VFIO_DEVICE_SET_IRQS. If this condition is
> >  not satisfied, the command returns an -EBUSY.
> > +
> > +  KVM_DEV_VFIO_DEVICE_POSTING_IRQ: Use posted interrtups mechanism to post
> > +   the IRQ to guests.
> > +For this attribute, kvm_device_attr.addr points to a kvm_posted_intr 
> > struct.
> > +
> > +When guest OS changes the interrupt configuration for an assigned device,
> > +such as, MSI/MSIx data/address fields, QEMU will use this IRQ attribute
> > +to tell KVM to update the related IRTE according the VT-d 
> > Posted-Interrrupts
> > +Specification, such as, the guest vector should be updated in the related 
> > IRTE.
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index a269a42..e5f86ad 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -949,6 +949,7 @@ struct kvm_device_attr {
> >  #define  KVM_DEV_VFIO_DEVICE   2
> >  #define   KVM_DEV_VFIO_DEVICE_FORWARD_IRQ  1
> >  #define   KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ2
> > +#define   KVM_DEV_VFIO_DEVICE_POSTING_IRQ  3
> >  
> >  enum kvm_device_type {
> > KVM_DEV_TYPE_FSL_MPIC_20= 1,
> > @@ -973,6 +974,15 @@ struct kvm_arch_forwarded_irq {
> > __u32 gsi; /* gsi, ie. virtual IRQ number */
> >  };
> >  
> > +struct kvm_posted_intr {
> > +   __u32   argsz;
> > +   __u32   fd; /* file descriptor of the VFIO device */
> > +   __u32   index;  /* VFIO device IRQ index */
> > +   __u32   start;
> > +   __u32   count;
> > +   int virq[0];/* gsi, ie. virtual IRQ number */
> > +};
> Hi Feng,
> 
> This struct could be used by arm code too. If Alex agrees I could use
> that one instead. We just need to find a common sensible name

Yep, the interface might as well support batch setup.  The vfio code
uses -1 for teardown if we want to avoid FORWARD vs UNFORWARD we could
let the data in the structure define which operation to do.  Ideally the
code in virt/kvm/vfio.c would be almost entirely shared and just make
different arch_foo() callouts.  The PCI smarts in 2/2 here should
probably be moved out to that same arch_ code.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v2 2/2] KVM: kvm-vfio: implement the VFIO skeleton for VT-d Posted-Interrupts

2014-11-25 Thread Alex Williamson

On Tue, 2014-11-25 at 20:23 +0800, Feng Wu wrote:
> This patch adds the kvm-vfio interface for VT-d Posted-Interrrupts.
> When guests updates MSI/MSI-x information for an assigned-device,
> QEMU will use KVM_DEV_VFIO_DEVICE_POSTING_IRQ attribute to setup
> IRTE for VT-d PI. This patch implement this IRQ attribute.
> 
> Signed-off-by: Feng Wu 
> ---
>  virt/kvm/vfio.c |  115 
> +++
>  1 files changed, 115 insertions(+), 0 deletions(-)
> 
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index 6bc7001..435adf4 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -446,6 +446,115 @@ out:
>   return ret;
>  }
>  
> +static int kvm_update_pi_irte(struct kvm *kvm, int host_irq, int guest_irq)
> +{
> + /*
> +  * TODO: need to add the real code to update the related IRTE,
> +  * Basically, This fucntion will do the following things:
> +  * - Get struct kvm_kernel_irq_routing_entry from guest irq
> +  * - Get the destination vCPU of the interrupts
> +  * - Update the IRTE according the VT-d PI Spec.
> +  *   1) guest vector
> +  *   2) Posted-Interrupts descritpor addresss
> +  */
> +
> + return 0;

Why not make this an arch_ call like Eric did?

> +}
> +
> +static int kvm_vfio_pci_get_irq_count(struct pci_dev *pdev, int irq_type)
> +{
> + if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
> + u8 pin;
> + pci_read_config_byte(pdev, PCI_INTERRUPT_PIN, &pin);
> + if (pin)
> + return 1;
> + } else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
> + u8 pos;
> + u16 flags;
> +
> + pos = pdev->msi_cap;
> + if (pos) {
> + pci_read_config_word(pdev,
> +  pos + PCI_MSI_FLAGS, &flags);
> + return 1 << ((flags & PCI_MSI_FLAGS_QMASK) >> 1);
> + }

Looks like a copy of vfio code, but since that was written helpers have
been added to the kernel:

return pci_msi_vec_count(pdev);

> + } else if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX) {
> + u8 pos;
> + u16 flags;
> +
> + pos = pdev->msix_cap;
> + if (pos) {
> + pci_read_config_word(pdev,
> +  pos + PCI_MSIX_FLAGS, &flags);
> +
> + return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
> + }

return pci_msix_vec_count(pdev);

> + } else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX)
> + if (pci_is_pcie(pdev))
> + return 1;

This is a virtual interrupt, this interface should never be used for it.

> +
> + return 0;
> +}
> +
> +static int kvm_vfio_set_pi(struct kvm_device *kdev, int32_t __user *argp)
> +{
> + struct kvm_posted_intr pi_info;
> + int *virq;
> + unsigned long minsz;
> + struct vfio_device *vdev;
> + struct msi_desc *entry;
> + struct device *dev;
> + struct pci_dev *pdev;
> + int i, max, ret;
> +
> + minsz = offsetofend(struct kvm_posted_intr, count);
> +
> + if (copy_from_user(&pi_info, (void __user *)argp, minsz))
> + return -EFAULT;
> +
> + if (pi_info.argsz < minsz || pi_info.index >= VFIO_PCI_NUM_IRQS)
> + return -EINVAL;
> +
> + vdev = kvm_vfio_get_vfio_device(pi_info.fd);
> + if (IS_ERR(vdev))
> + return PTR_ERR(vdev);
> +
> + dev = kvm_vfio_external_base_device(vdev);
> + if (!dev)
> + return -EFAULT;

Missing put

> +
> + pdev = to_pci_dev(dev);

We can't assume the user called with a PCI device, test it.

if (!dev_is_pci(dev)) {
put
return errno
}

pdev = to_pci_dev(dev);
...

> +
> + max = kvm_vfio_pci_get_irq_count(pdev, pi_info.index);
> +
> + if (pi_info.argsz - minsz < pi_info.count * sizeof(int) ||
> + pi_info.start >= max || pi_info.start + pi_info.count > max)
> + return -EINVAL;

Missing put

> +
> + virq = memdup_user((void __user *)((unsigned long)argp + minsz),
> +pi_info.count * sizeof(int));
> + if (IS_ERR(virq))
> + return PTR_ERR(virq);

put...

> +
> + for (i=0; i + list_for_each_entry(entry, &pdev->msi_list, list) {

This is unique to MSI/X but INTx and others can still make it here.
Also, this won't compile unless CONFIG_PCI_MSI.  Does anything protect
this list?

> + if (entry->msi_attrib.entry_nr != pi_info.start+i)
> + continue;
> +
> + ret = kvm_update_pi_irte(kdev->kvm,
> +  entry->irq, virq[i]);
> + if (ret) {
> + kfree(virq);
> + return -EFAULT;

put...

> + }
> + }
> + }
> +
> + kfree(virq);

put...

> +
> + return 0;
> +}
> +
>  static int kvm_vfi

[PATCH RFC 0/2] assign each vcpu an owning thread and improve yielding

2014-11-25 Thread David Hildenbrand

This series improves yielding on architectures that cannot disable preemption
while entering the guest and makes the creating thread of a VCPU the owning
thread and therefore the yield target when yielding to that VCPU.

We should focus on the case creating thread == executing thread and therefore
remove the complicated handling of PIDs involving synchronize_rcus.

This way we can speed up the creation of VCPUs and directly yield to the
executing vcpu threads.

Please note that - in theory - all VCPU ioctls should be triggered from the same
VCPU thread, so changing threads is not a scenario we should optimize.


David Hildenbrand (2):
  KVM: don't check for PF_VCPU when yielding
  KVM: thread creating a vcpu is the owner of that vcpu

 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c  | 22 ++
 2 files changed, 3 insertions(+), 20 deletions(-)

-- 
1.8.5.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC 2/2] KVM: thread creating a vcpu is the owner of that vcpu

2014-11-25 Thread David Hildenbrand

Currently, we allow changing the PID of a VCPU. This PID is used to
identify the thread to yield to if we want to yield to this specific
VCPU.

In practice (e.g. QEMU), the thread creating and executing the VCPU remains
always the same. Temporarily exchanging the PID (e.g. because an ioctl is
triggered from a wrong thread) doesn't really make sense.

The PID is exchanged and a synchronize_rcu() is called. When the executing
thread tries to run the VCPU again, another synchronize_rcu() happens.

If a yield to that VCPU is triggered while the PID of the wrong thread is 
active,
the wrong thread might receive a yield, but this will most likely not
help the executing thread at all. The executing thread won't have a higher
priority after the wrong thread has finished with the ioctl. The wrong thread
will even receive yields afterwards that were targeted to the executing vcpu,
until the executing VCPU has replaced the PID on the next ioctl - doesn't feel
correct to me.

This patch makes the creating thread the owning thread, and therefore the only
valid yield candidate (especially because VCPU ioctls are - in theory - only
valid when triggered from the owning thread - old user space versions may not
stick to this rule). This should also speed up the initial start of all VCPUs,
when the PID is assigned for the first time.

Should be backwards compatible - if there is any old user space version out
there that doesn't stick to the creating == executing thread rule, yields will
not work as intended.

Signed-off-by: David Hildenbrand 
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c  | 18 ++
 2 files changed, 3 insertions(+), 16 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index aa56894..f1fe655 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -245,6 +245,7 @@ struct kvm_vcpu {
int fpu_active;
int guest_fpu_loaded, guest_xcr0_loaded;
wait_queue_head_t wq;
+   /* the pid owning this vcpu - target for vcpu yields */
struct pid *pid;
int sigset_active;
sigset_t sigset;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 184f52e..4ba7810 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -124,15 +124,6 @@ int vcpu_load(struct kvm_vcpu *vcpu)
 
if (mutex_lock_killable(&vcpu->mutex))
return -EINTR;
-   if (unlikely(vcpu->pid != current->pids[PIDTYPE_PID].pid)) {
-   /* The thread running this VCPU changed. */
-   struct pid *oldpid = vcpu->pid;
-   struct pid *newpid = get_task_pid(current, PIDTYPE_PID);
-   rcu_assign_pointer(vcpu->pid, newpid);
-   if (oldpid)
-   synchronize_rcu();
-   put_pid(oldpid);
-   }
cpu = get_cpu();
preempt_notifier_register(&vcpu->preempt_notifier);
kvm_arch_vcpu_load(vcpu, cpu);
@@ -220,7 +211,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, 
unsigned id)
vcpu->cpu = -1;
vcpu->kvm = kvm;
vcpu->vcpu_id = id;
-   vcpu->pid = NULL;
+   vcpu->pid = get_task_pid(current, PIDTYPE_PID);
init_waitqueue_head(&vcpu->wq);
kvm_async_pf_vcpu_init(vcpu);
 
@@ -1771,15 +1762,10 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_kick);
 
 int kvm_vcpu_yield_to(struct kvm_vcpu *target)
 {
-   struct pid *pid;
struct task_struct *task = NULL;
int ret = 0;
 
-   rcu_read_lock();
-   pid = rcu_dereference(target->pid);
-   if (pid)
-   task = get_pid_task(pid, PIDTYPE_PID);
-   rcu_read_unlock();
+   task = get_pid_task(target->pid, PIDTYPE_PID);
if (!task)
return ret;
ret = yield_to(task, 1);
-- 
1.8.5.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC 1/2] KVM: don't check for PF_VCPU when yielding

2014-11-25 Thread David Hildenbrand

As some architectures (e.g. s390) can't disable preemption while
entering/leaving the guest, they won't receive the yield in all situations.

kvm_enter_guest() has to be called with preemption_disabled and will set
PF_VCPU. After that point e.g. s390 reenables preemption and starts to execute 
the
guest. The thread might therefore be scheduled out between kvm_enter_guest() and
kvm_exit_guest(), resulting in PF_VCPU being set but not being run.

Please note that preemption has to stay enabled in order to correctly process
page faults on s390.

Current code takes PF_VCPU as a hint that the VCPU thread is running and
therefore needs no yield. yield_to() checks whether the target thread is 
running,
so let's use the inbuilt functionality to make it independent of PF_VCPU and
preemption.

Signed-off-by: David Hildenbrand 
---
 virt/kvm/kvm_main.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5b45330..184f52e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1782,10 +1782,6 @@ int kvm_vcpu_yield_to(struct kvm_vcpu *target)
rcu_read_unlock();
if (!task)
return ret;
-   if (task->flags & PF_VCPU) {
-   put_task_struct(task);
-   return ret;
-   }
ret = yield_to(task, 1);
put_task_struct(task);
 
-- 
1.8.5.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v2 1/2] KVM: kvm-vfio: User API for VT-d Posted-Interrupts

On 11/25/2014 01:23 PM, Feng Wu wrote:
> This patch adds and documents a new attribute
> KVM_DEV_VFIO_DEVICE_POSTING_IRQ in KVM_DEV_VFIO_DEVICE group.
> This new attribute is used for VT-d Posted-Interrupts.
> 
> When guest OS changes the interrupt configuration for an
> assigned device, such as, MSI/MSIx data/address fields,
> QEMU will use this IRQ attribute to tell KVM to update the
> related IRTE according the VT-d Posted-Interrrupts Specification,
> such as, the guest vector should be updated in the related IRTE.
> 
> Signed-off-by: Feng Wu 
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |9 +
>  include/uapi/linux/kvm.h   |   10 ++
>  2 files changed, 19 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt 
> b/Documentation/virtual/kvm/devices/vfio.txt
> index f7aff29..39dee86 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -42,3 +42,12 @@ activated before VFIO_DEVICE_SET_IRQS has been called to 
> trigger the IRQ
>  or associate an eventfd to it. Unforwarding can only be called while the
>  signaling has been disabled with VFIO_DEVICE_SET_IRQS. If this condition is
>  not satisfied, the command returns an -EBUSY.
> +
> +  KVM_DEV_VFIO_DEVICE_POSTING_IRQ: Use posted interrtups mechanism to post
> +   the IRQ to guests.
> +For this attribute, kvm_device_attr.addr points to a kvm_posted_intr struct.
> +
> +When guest OS changes the interrupt configuration for an assigned device,
> +such as, MSI/MSIx data/address fields, QEMU will use this IRQ attribute
> +to tell KVM to update the related IRTE according the VT-d Posted-Interrrupts
> +Specification, such as, the guest vector should be updated in the related 
> IRTE.
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a269a42..e5f86ad 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -949,6 +949,7 @@ struct kvm_device_attr {
>  #define  KVM_DEV_VFIO_DEVICE 2
>  #define   KVM_DEV_VFIO_DEVICE_FORWARD_IRQ1
>  #define   KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ  2
> +#define   KVM_DEV_VFIO_DEVICE_POSTING_IRQ3
>  
>  enum kvm_device_type {
>   KVM_DEV_TYPE_FSL_MPIC_20= 1,
> @@ -973,6 +974,15 @@ struct kvm_arch_forwarded_irq {
>   __u32 gsi; /* gsi, ie. virtual IRQ number */
>  };
>  
> +struct kvm_posted_intr {
> + __u32   argsz;
> + __u32   fd; /* file descriptor of the VFIO device */
> + __u32   index;  /* VFIO device IRQ index */
> + __u32   start;
> + __u32   count;
> + int virq[0];/* gsi, ie. virtual IRQ number */
> +};
Hi Feng,

This struct could be used by arm code too. If Alex agrees I could use
that one instead. We just need to find a common sensible name

Best Regards
Eric
> +
>  /*
>   * ioctls for VM fds
>   */
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [CFT PATCH v2 0/2] KVM: support XSAVES usage in the host

2014-11-25 Thread Nadav Amit


> On Nov 25, 2014, at 16:17, Paolo Bonzini  wrote:
> 
> 
> 
> On 25/11/2014 15:05, Nadav Amit wrote:
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 373b0ab9a32e..ca26681455c2 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -6955,6 +6955,9 @@ int fx_init(struct kvm_vcpu *vcpu)
>>> return err;
>>> 
>>> fpu_finit(&vcpu->arch.guest_fpu);
>>> +   if (cpu_has_xsaves)
>>> +   vcpu->arch.guest_fpu.state->xsave.xsave_hdr.xcomp_bv =
>>> +   host_xcr0 | XSTATE_COMPACTION_ENABLED;
>>> 
>>> /*
>>>  * Ensure guest xcr0 is valid for loading
>> 
>> The second version works for me (w/qemu v2.1.0; Linux 3.13 guest). I
>> did not need to apply this patch on top. [although I am not sure whether
>> relying on userspace to call KVM_SET_XSAVE early enough is a good practice].
> 
> Did you actually try the patch? :)  If it works, I'm tempted to apply it
> anyway.
Yes, I tried it both with and without this patch.
Due to time constraints I only tested minimal functionality (Linux boot).
I will run more tests in the near future. Anyhow, you can put the:

Tested-by: Nadav Amit 

> 
>> One disclaimer: Since I got limited time with the machine, I executed
>> a slightly modified kernel/qemu, and not the latest version.
>> Anyhow, I don’t think these differences can have any impact.
> 
> Yes, that is no problem.

I am just worried that Wanpeng reported it fails, while I report it works...

Nadav

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

KVM call minutes for 2015-11-25

2014-11-25 Thread Juan Quintela


Hi

In today call ony one announcement in the agenda:

- Greesocs announces that they are starting ultithreading TCG.
- They would start with usermode (it looks easier).  Althought they are
  more interested in system mode.
- would start with a minimal wiki page, and ask for colaboration about
  ideas/plans for everybody else.
- Greensocs (as company) is interested in Embedded
- interested on multicore/multicluster emulation



- KVM disk days are on December
  - They are releasing a new disk everyday
  - They haven't filled everything
  - If you want to collabarate contact Stefan Hajnoczi 


See you in two weeks, Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [CFT PATCH v2 0/2] KVM: support XSAVES usage in the host



On 25/11/2014 15:05, Nadav Amit wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 373b0ab9a32e..ca26681455c2 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -6955,6 +6955,9 @@ int fx_init(struct kvm_vcpu *vcpu)
> > return err;
> > 
> > fpu_finit(&vcpu->arch.guest_fpu);
> > +   if (cpu_has_xsaves)
> > +   vcpu->arch.guest_fpu.state->xsave.xsave_hdr.xcomp_bv =
> > +   host_xcr0 | XSTATE_COMPACTION_ENABLED;
> > 
> > /*
> >  * Ensure guest xcr0 is valid for loading
> 
> The second version works for me (w/qemu v2.1.0; Linux 3.13 guest). I
> did not need to apply this patch on top. [although I am not sure whether
> relying on userspace to call KVM_SET_XSAVE early enough is a good practice].

Did you actually try the patch? :)  If it works, I'm tempted to apply it
anyway.

> One disclaimer: Since I got limited time with the machine, I executed
> a slightly modified kernel/qemu, and not the latest version.
> Anyhow, I don’t think these differences can have any impact.

Yes, that is no problem.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [CFT PATCH v2 0/2] KVM: support XSAVES usage in the host

2014-11-25 Thread Nadav Amit


> On Nov 25, 2014, at 12:36, Paolo Bonzini  wrote:
> 
> 
> 
> On 25/11/2014 11:13, Wanpeng Li wrote:
>> Hi Paolo,
>> On Mon, Nov 24, 2014 at 05:43:32PM +0100, Paolo Bonzini wrote:
>>> The first patch ensures that XSAVES is not exposed in the guest until
>>> we emulate MSR_IA32_XSS.  The second exports XSAVE data in the correct
>>> format.
>>> 
>>> I tested these on a non-XSAVES system so they should not be completely
>>> broken, but I need some help.  I am not even sure which XSAVE states
>>> are _not_ enabled, and thus compacted, in Linux.
>>> 
>>> Note that these patches do not add support for XSAVES in the guest yet,
>>> since MSR_IA32_XSS is not emulated.
>>> 
>>> If they fix the bug Nadav reported, I'll add Reported-by and commit.
>> 
>> I test this patchset w/ your "KVM: x86: export get_xsave_addr" patch on 
>> Skylake and guest hang during boot. The guest screen show "Probing EDD 
>> (edd=off to disable)... ok", and no more dump.
> 
> Anything in dmesg?  Can you try this patch on top?
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 373b0ab9a32e..ca26681455c2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6955,6 +6955,9 @@ int fx_init(struct kvm_vcpu *vcpu)
>   return err;
> 
>   fpu_finit(&vcpu->arch.guest_fpu);
> + if (cpu_has_xsaves)
> + vcpu->arch.guest_fpu.state->xsave.xsave_hdr.xcomp_bv =
> + host_xcr0 | XSTATE_COMPACTION_ENABLED;
> 
>   /*
>* Ensure guest xcr0 is valid for loading
> 

The second version works for me (w/qemu v2.1.0; Linux 3.13 guest). I did not 
need to apply this patch on top. [although I am not sure whether relying on 
userspace to call KVM_SET_XSAVE early enough is a good practice].

One disclaimer: Since I got limited time with the machine, I executed a 
slightly modified kernel/qemu, and not the latest version.
Anyhow, I don’t think these differences can have any impact.

Regards,
Nadav

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [question] lots of interrupts injected to vm when pressing somekey w/o releasing



On 25/11/2014 12:20, Zhang Haoyu wrote:
>>> I tested win-server-2008 with "-cpu 
>>> core2duo,hv_spinlocks=0x,hv_relaxed,hv_time",
>>> this problem still happened, about 200,000 vmexits per-second, 
>>> bringing very bad experience, just like being stuck.
>>
>> Please upload a full trace somewhere, or at least the "perf report" output.
>>
> # perf kvm stat report --event=vmexit
> Warning:
> Processed 4912765 events and lost 81 chunks!
> Check IO/CPU overload!
> 
> Analyze events for all VCPUs:
> 
>  VM-EXITSamples  Samples% Time% Avg time
> 
>PAUSE_INSTRUCTION124862060.74%23.33%  0.83us ( +-   0.02% )
>   IO_INSTRUCTION 79961938.89%74.49%  4.13us ( +-   6.93% )
>   EXTERNAL_INTERRUPT   2629 0.13% 0.47%  7.93us ( +-   1.27% )
>   VMCALL   1785 0.09% 0.04%  0.98us ( +-   0.30% )
>  APIC_ACCESS   1495 0.07% 0.11%  3.38us ( +-   1.65% )
>EXCEPTION_NMI   1188 0.06% 0.03%  1.26us ( +-   1.66% )
>  TPR_BELOW_THRESHOLD273 0.01% 0.01%  1.57us ( +-   1.32% )
>PENDING_INTERRUPT226 0.01% 0.01%  1.38us ( +-   2.59% )
>  HLT  6 0.00% 1.51%  11153.81us ( +-  15.24% )
>EPT_MISCONFIG  2 0.00% 0.00% 15.46us ( +-  23.93% )
> 
> Total Samples:2055843, Total events handled time:4438794.59us.
> 
> # perf kvm stat report --event=ioport
> Warning:
> Processed 4912765 events and lost 81 chunks!
> Check IO/CPU overload!
> 
> Analyze events for all VCPUs:
> 
>   IO Port AccessSamples  Samples% Time% Avg time
> 
> 0x64:PIN 79903299.93%99.84%  3.42us ( +-   8.35% )
>0x3fd:PIN300 0.04% 0.08%  6.88us ( +-   3.45% )
> 0x60:PIN274 0.03% 0.09%  8.53us ( +-   3.07% )
>  0xb000:POUT  2 0.00% 0.00%  7.96us ( +-   3.16% )
>   0xb000:PIN  2 0.00% 0.00%  8.16us ( +-   2.31% )
>   0xb008:PIN  2 0.00% 0.00%  1.53us ( +-  13.66% )
>   0xafe0:PIN  2 0.00% 0.00%  3.62us ( +-   0.84% )
>   0xafe1:PIN  2 0.00% 0.00%  2.76us ( +-   2.99% )
> 
> Total Samples:799616, Total events handled time:2739038.53us.
> 
> # trace-cmd report
> version = 6
> CPU 4 is empty
> CPU 5 is empty
> CPU 6 is empty
> CPU 7 is empty
> CPU 9 is empty
> CPU 11 is empty
> CPU 12 is empty
> CPU 13 is empty
> cpus=16
>  kvm-22783 [003]   721.255618: kvm_entry:vcpu 0
>  kvm-22784 [014]   721.255619: kvm_entry:vcpu 1
>  kvm-22783 [003]   721.255622: kvm_exit: reason 
> PAUSE_INSTRUCTION rip 0xf80001688a9e info 0 0
>  kvm-22784 [014]   721.255623: kvm_exit: reason 
> IO_INSTRUCTION rip 0xf88000cdd23b info 640008 0
>  kvm-22784 [014]   721.255623: kvm_pio:  pio_read at 
> 0x64 size 1 count 1
>  kvm-22784 [014]   721.255623: kvm_userspace_exit:   reason 
> KVM_EXIT_IO (2)
>  kvm-22783 [003]   721.255623: kvm_entry:vcpu 0
>  kvm-22784 [014]   721.255625: kvm_entry:vcpu 1
>  kvm-22783 [003]   721.255626: kvm_exit: reason 
> PAUSE_INSTRUCTION rip 0xf80001688a9e info 0 0
>  kvm-22783 [003]   721.255626: kvm_entry:vcpu 0
>  kvm-22784 [014]   721.255627: kvm_exit: reason 
> IO_INSTRUCTION rip 0xf88000cdd23b info 640008 0
>  kvm-22784 [014]   721.255628: kvm_pio:  pio_read at 
> 0x64 size 1 count 1
>  kvm-22784 [014]   721.255628: kvm_userspace_exit:   reason 
> KVM_EXIT_IO (2)
>  kvm-22783 [003]   721.255628: kvm_exit: reason 
> PAUSE_INSTRUCTION rip 0xf80001688a9e info 0 0
>  kvm-22783 [003]   721.255629: kvm_entry:vcpu 0
>  kvm-22784 [014]   721.255630: kvm_entry:vcpu 1
>  kvm-22783 [003]   721.255631: kvm_exit: reason 
> PAUSE_INSTRUCTION rip 0xf80001688a9e info 0 0
>  kvm-22784 [014]   721.255632: kvm_exit: reason 
> IO_INSTRUCTION rip 0xf88000cdd23b info 640008 0
>  kvm-22783 [003]   721.255632: kvm_entry:vcpu 0
>  kvm-22784 [014]   721.255632: kvm_pio:  pio_read at 
> 0x64 size 1 count 1
>  kvm-22784 [014]   721.255632: kvm_userspace_exit:   reason 
> KVM_EXIT_IO (2)
>  kvm-22783 [003]   721.255634: kvm_exit: reason 
> PAUSE_INSTRUCTION rip 0xf80001688a9e info 0 0
>  kvm-22784 [014]   721.255634: kvm_entry:vcpu 1
>  kvm-22783 [003]   721.255635: kvm_entry:vcpu 0
>  kvm-22784 [014]   721.255636: kvm_exi

Re: [RFC PATCH 0/6] ARM64: KVM: PMU infrastructure support

On Tue, Nov 25, 2014 at 06:17:03PM +0530, Anup Patel wrote:
> Hi Christoffer,
> 
> On Mon, Nov 24, 2014 at 8:07 PM, Christoffer Dall
>  wrote:
> > On Mon, Nov 24, 2014 at 02:14:48PM +0530, Anup Patel wrote:
> >> On Fri, Nov 21, 2014 at 5:19 PM, Christoffer Dall
> >>  wrote:
> >> > On Fri, Nov 21, 2014 at 04:06:05PM +0530, Anup Patel wrote:
> >> >> Hi Christoffer,
> >> >>
> >> >> On Fri, Nov 21, 2014 at 3:29 PM, Christoffer Dall
> >> >>  wrote:
> >> >> > On Thu, Nov 20, 2014 at 08:17:32PM +0530, Anup Patel wrote:
> >> >> >> On Wed, Nov 19, 2014 at 8:59 PM, Christoffer Dall
> >> >> >>  wrote:
> >> >> >> > On Tue, Nov 11, 2014 at 02:48:25PM +0530, Anup Patel wrote:
> >> >> >> >> Hi All,
> >> >> >> >>
> >> >> >> >> I have second thoughts about rebasing KVM PMU patches
> >> >> >> >> to Marc's irq-forwarding patches.
> >> >> >> >>
> >> >> >> >> The PMU IRQs (when virtualized by KVM) are not exactly
> >> >> >> >> forwarded IRQs because they are shared between Host
> >> >> >> >> and Guest.
> >> >> >> >>
> >> >> >> >> Scenario1
> >> >> >> >> -
> >> >> >> >>
> >> >> >> >> We might have perf running on Host and no KVM guest
> >> >> >> >> running. In this scenario, we wont get interrupts on Host
> >> >> >> >> because the kvm_pmu_hyp_init() (similar to the function
> >> >> >> >> kvm_timer_hyp_init() of Marc's IRQ-forwarding
> >> >> >> >> implementation) has put all host PMU IRQs in forwarding
> >> >> >> >> mode.
> >> >> >> >>
> >> >> >> >> The only way solve this problem is to not set forwarding
> >> >> >> >> mode for PMU IRQs in kvm_pmu_hyp_init() and instead
> >> >> >> >> have special routines to turn on and turn off the forwarding
> >> >> >> >> mode of PMU IRQs. These routines will be called from
> >> >> >> >> kvm_arch_vcpu_ioctl_run() for toggling the PMU IRQ
> >> >> >> >> forwarding state.
> >> >> >> >>
> >> >> >> >> Scenario2
> >> >> >> >> -
> >> >> >> >>
> >> >> >> >> We might have perf running on Host and Guest simultaneously
> >> >> >> >> which means it is quite likely that PMU HW trigger IRQ meant
> >> >> >> >> for Host between "ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);"
> >> >> >> >> and "kvm_pmu_sync_hwstate(vcpu);" (similar to timer sync routine
> >> >> >> >> of Marc's patchset which is called before local_irq_enable()).
> >> >> >> >>
> >> >> >> >> In this scenario, the updated kvm_pmu_sync_hwstate(vcpu)
> >> >> >> >> will accidentally forward IRQ meant for Host to Guest unless
> >> >> >> >> we put additional checks to inspect VCPU PMU state.
> >> >> >> >>
> >> >> >> >> Am I missing any detail about IRQ forwarding for above
> >> >> >> >> scenarios?
> >> >> >> >>
> >> >> >> > Hi Anup,
> >> >> >>
> >> >> >> Hi Christoffer,
> >> >> >>
> >> >> >> >
> >> >> >> > I briefly discussed this with Marc.  What I don't understand is 
> >> >> >> > how it
> >> >> >> > would be possible to get an interrupt for the host while running 
> >> >> >> > the
> >> >> >> > guest?
> >> >> >> >
> >> >> >> > The rationale behind my question is that whenever you're running 
> >> >> >> > the
> >> >> >> > guest, the PMU should be programmed exclusively with guest state, 
> >> >> >> > and
> >> >> >> > since the PMU is per core, any interrupts should be for the guest, 
> >> >> >> > where
> >> >> >> > it would always be pending.
> >> >> >>
> >> >> >> Yes, thats right PMU is programmed exclusively for guest when
> >> >> >> guest is running and for host when host is running.
> >> >> >>
> >> >> >> Let us assume a situation (Scenario2 mentioned previously)
> >> >> >> where both host and guest are using PMU. When the guest is
> >> >> >> running we come back to host mode due to variety of reasons
> >> >> >> (stage2 fault, guest IO, regular host interrupt, host interrupt
> >> >> >> meant for guest, ) which means we will return from the
> >> >> >> "ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);" statement in the
> >> >> >> kvm_arch_vcpu_ioctl_run() function with local IRQs disabled.
> >> >> >> At this point we would have restored back host PMU context and
> >> >> >> any PMU counter used by host can trigger PMU overflow interrup
> >> >> >> for host. Now we will be having "kvm_pmu_sync_hwstate(vcpu);"
> >> >> >> in the kvm_arch_vcpu_ioctl_run() function (similar to the
> >> >> >> kvm_timer_sync_hwstate() of Marc's IRQ forwarding patchset)
> >> >> >> which will try to detect PMU irq forwarding state in GIC hence it
> >> >> >> can accidentally discover PMU irq pending for guest while this
> >> >> >> PMU irq is actually meant for host.
> >> >> >>
> >> >> >> This above mentioned situation does not happen for timer
> >> >> >> because virtual timer interrupts are exclusively used for guest.
> >> >> >> The exclusive use of virtual timer interrupt for guest ensures that
> >> >> >> the function kvm_timer_sync_hwstate() will always see correct
> >> >> >> state of virtual timer IRQ from GIC.
> >> >> >>
> >> >> > I'm not quite following.
> >> >> >
> >> >> > When you call kvm_pmu_sync_hwstate(vcpu) in the non-preemtible 
> >> >> > section,

Re: [PATCH v3 7/8] KVM: kvm-vfio: generic forwarding control

On 11/25/2014 05:33 AM, Wu, Feng wrote:
> 
> 
>> -Original Message-
>> From: Eric Auger [mailto:eric.au...@linaro.org]
>> Sent: Monday, November 24, 2014 2:36 AM
>> To: eric.au...@st.com; eric.au...@linaro.org; christoffer.d...@linaro.org;
>> marc.zyng...@arm.com; linux-arm-ker...@lists.infradead.org;
>> kvm...@lists.cs.columbia.edu; kvm@vger.kernel.org;
>> alex.william...@redhat.com; joel.sch...@amd.com;
>> kim.phill...@freescale.com; pau...@samba.org; g...@kernel.org;
>> pbonz...@redhat.com; ag...@suse.de
>> Cc: linux-ker...@vger.kernel.org; patc...@linaro.org; will.dea...@arm.com;
>> a.mota...@virtualopensystems.com; a.r...@virtualopensystems.com;
>> john.li...@huawei.com; ming@canonical.com; Wu, Feng
>> Subject: [PATCH v3 7/8] KVM: kvm-vfio: generic forwarding control
>>
>> This patch introduces a new KVM_DEV_VFIO_DEVICE group.
>>
>> This is a new control channel which enables KVM to cooperate with
>> viable VFIO devices.
>>
>> Functions are introduced to check the validity of a VFIO device
>> file descriptor, increment/decrement the ref counter of the VFIO
>> device.
>>
>> The patch introduces 2 attributes for this new device group:
>> KVM_DEV_VFIO_DEVICE_FORWARD_IRQ,
>> KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ.
>> Their purpose is to turn a VFIO device IRQ into a forwarded IRQ and
>> unset respectively unset the feature.
>>
>> The VFIO device stores a list of registered forwarded IRQs. The reference
>> counter of the device is incremented each time a new IRQ is forwarded.
>> Reference counter is decremented when the IRQ forwarding is unset.
>>
>> The forwarding programmming is architecture specific, implemented in
>> kvm_arch_set_fwd_state function. Architecture specific implementation is
>> enabled when __KVM_HAVE_ARCH_KVM_VFIO_FORWARD is set. When not
>> set those
>> functions are void.
>>
>> Signed-off-by: Eric Auger 
>>
>> ---
>>
>> v2 -> v3:
>> - add API comments in kvm_host.h
>> - improve the commit message
>> - create a private kvm_vfio_fwd_irq struct
>> - fwd_irq_action replaced by a bool and removal of VFIO_IRQ_CLEANUP. This
>>   latter action will be handled in vgic.
>> - add a vfio_device handle argument to kvm_arch_set_fwd_state. The goal is
>>   to move platform specific stuff in architecture specific code.
>> - kvm_arch_set_fwd_state renamed into kvm_arch_vfio_set_forward
>> - increment the ref counter each time we do an IRQ forwarding and decrement
>>   this latter each time one IRQ forward is unset. Simplifies the whole
>>   ref counting.
>> - simplification of list handling: create, search, removal
>>
>> v1 -> v2:
>> - __KVM_HAVE_ARCH_KVM_VFIO renamed into
>> __KVM_HAVE_ARCH_KVM_VFIO_FORWARD
>> - original patch file separated into 2 parts: generic part moved in vfio.c
>>   and ARM specific part(kvm_arch_set_fwd_state)
>> ---
>>  include/linux/kvm_host.h |  28 ++
>>  virt/kvm/vfio.c  | 249
>> ++-
>>  2 files changed, 274 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index ea53b04..0b9659d 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -1076,6 +1076,15 @@ struct kvm_device_ops {
>>unsigned long arg);
>>  };
>>
>> +/* internal self-contained structure describing a forwarded IRQ */
>> +struct kvm_fwd_irq {
>> +struct kvm *kvm; /* VM to inject the GSI into */
>> +struct vfio_device *vdev; /* vfio device the IRQ belongs to */
>> +__u32 index; /* VFIO device IRQ index */
>> +__u32 subindex; /* VFIO device IRQ subindex */
>> +__u32 gsi; /* gsi, ie. virtual IRQ number */
>> +};
>> +
>>  void kvm_device_get(struct kvm_device *dev);
>>  void kvm_device_put(struct kvm_device *dev);
>>  struct kvm_device *kvm_device_from_filp(struct file *filp);
>> @@ -1085,6 +1094,25 @@ void kvm_unregister_device_ops(u32 type);
>>  extern struct kvm_device_ops kvm_mpic_ops;
>>  extern struct kvm_device_ops kvm_xics_ops;
>>
>> +#ifdef __KVM_HAVE_ARCH_KVM_VFIO_FORWARD
>> +/**
>> + * kvm_arch_vfio_set_forward - changes the forwarded state of an IRQ
>> + *
>> + * @fwd_irq: handle to the forwarded irq struct
>> + * @forward: true means forwarded, false means not forwarded
>> + * returns 0 on success, < 0 on failure
>> + */
>> +int kvm_arch_vfio_set_forward(struct kvm_fwd_irq *fwd_irq,
>> +  bool forward);
>> +
>> +#else
>> +static inline int kvm_arch_vfio_set_forward(struct kvm_fwd_irq *fwd_irq,
>> +bool forward)
>> +{
>> +return 0;
>> +}
>> +#endif
>> +
>>  #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
>>
>>  static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool 
>> val)
>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>> index 6f0cc34..af178bb 100644
>> --- a/virt/kvm/vfio.c
>> +++ b/virt/kvm/vfio.c
>> @@ -25,8 +25,16 @@ struct kvm_vfio_group {
>>  struct vfio_group *vfio_group;
>>  };
>>
>> +/* private linkable kvm_fwd_irq struct */
>> +

Re: [PATCH v4 2/3] KVM: arm: add irqfd support

On 11/25/2014 02:12 PM, Eric Auger wrote:
> On 11/25/2014 11:19 AM, Christoffer Dall wrote:
>> [Re-adding cc list]
>>
>> On Mon, Nov 24, 2014 at 05:42:30PM +0100, Eric Auger wrote:
>>> On 11/24/2014 04:47 PM, Christoffer Dall wrote:
 On Mon, Nov 24, 2014 at 12:02 PM, Eric Auger  wrote:
> On 11/24/2014 11:00 AM, Christoffer Dall wrote:
>> On Sun, Nov 23, 2014 at 06:56:59PM +0100, Eric Auger wrote:
>>> This patch enables irqfd on arm.
>>>
>>> Both irqfd and resamplefd are supported. Injection is implemented
>>> in vgic.c without routing.
>>>
>>> This patch enables CONFIG_HAVE_KVM_EVENTFD and CONFIG_HAVE_KVM_IRQFD.
>>>
>>> KVM_CAP_IRQFD is now advertised. KVM_CAP_IRQFD_RESAMPLE capability
>>> automatically is advertised as soon as CONFIG_HAVE_KVM_IRQFD is set.
>>>
>>> Signed-off-by: Eric Auger 
>>>
>>> ---
>>
>> [...]
>>
>>> +int kvm_set_irq(struct kvm *kvm, int irq_source_id,
>>> +u32 irq, int level, bool line_status)
>>> +{
>>> +unsigned int spi = irq + VGIC_NR_PRIVATE_IRQS;
>>> +
>>> +kvm_debug("irqfd sets vIRQ %d to %d\n", irq, level);
>>> +
>>> +if (likely(vgic_initialized(kvm))) {
>>> +if (spi > kvm->arch.vgic.nr_irqs)
>>> +return -EINVAL;
>>> +return kvm_vgic_inject_irq(kvm, 0, spi, level);
>>> +} else if (level && kvm_irq_has_notifier(kvm, 0, irq)) {
>>> +/*
>>> + * any IRQ injected before vgic readiness is
>>> + * ignored and the notifier, if any, is called
>>> + * immediately as if the virtual IRQ were completed
>>> + * by the guest
>>> + */
>>> +kvm_notify_acked_irq(kvm, 0, irq);
>>> +return -EAGAIN;
>>
>> This looks fishy, I don't fully understand which case you're catering
>> towards here (the comment is hard to understand), but my gut feeling is
>> that you should back out (probably with an error) if the vgic is not
>> initialized when calling this function.  Setting up the routing in the
>> first place should probably error out if the vgic is not available.
> The issue is related to integration with QEMU and VFIO.
> Currently VFIO signaling (we tell VFIO to signal the eventfd on a
> peculiar physical IRQ) and irqfd setup (we tell KVM/IRQFD to handle
> previous eventfd when triggered and inject a GSI) are done by QEMU
> *before* the first vcpu run. so before VGIC is ready.
>
> Both are done in a so called QEMU machine init done notifier. It could
> be done in a QEMU reset notifier but I guess the problem would be the
> same. and I think the same at which QEMU initializes it is correct.
>
> As soon as both VFIO signaling and irqfd are setup, virtual IRQ are
> likely to be injected and this is what happens on some circumstances.
> This happens on the 2d QEMU run after killing the 1st QEMU session. For
> some reason I guess the HW device - in my case the xgmac - was released
> in such a way the IRQ wire was not reset. So as soon as VFIO driver
> calls request_irq, the IRQ hits.
>
> I tried to change that this xgmac driver behavior but I must acknowledge
> that for the time beeing I failed. I will continue investigating that.
>
> Indeed I might be fixing the issue at the wrong place. But anyway this
> relies on the fact the assigned device driver toggled down the IRQ
> properly when releasing. At the time the signaling is setup we have not
> entered yet any driver reset code.
>
 I see, it's because of the quirky way we initialize the vgic at time
 of running the first CPU, so user space can't really hook into the
 sweet spot after initializing the vgic but just before actually
 running guest code.
>>>
>>> Yes currently irqfd generic code has no way to test if the virtual
>>> interrupt controller is "ready" (KVM__IRQFD -> kvm_irqfd_assign )
>>> because I guess other archs don't have the problem.

 Could it be that we simply need to let user space init the vgic after
 creating all its vcpus and only then allow setting up the irqfd?
>>>
>>> I start VFIO signaling and setup KVM_IRQFD in a QEMU machine init done
>>> notifier. the vgic init then must happen before then.
>>
>> have all the vcpus that QEMU wants to create, been created by then?
> Hi Christoffer,
> 
> My understanding of QEMU start sequence is as follows:
> at machine init, CPU realize function is called for each CPU
> (target-arm/cpu.c arm_cpu_realizefn)
> qemu_init_vcpu (cpus.c) > qemu_kvm_start_vcpu (cpus.c) > launch VCPU
> thread (qemu_kvm_cpu_thread_fn) > kvm_init_vcpu
> kvm_init_vcpu does the while (1) loop issuing KVM_RUN ioctl if *cpu_can_run*
> 
> cpu_can_run returns true when vm_start is called
> (resume_all_vcpus>cpu_resume)
> 
> QEMU machine init done or reset callbacks happ

[PATCH RFC v2 03/12] virtio: support more feature bits

With virtio-1, we support more than 32 feature bits. Let's make
vdev->guest_features depend on the number of supported feature bits,
allowing us to grow the feature bits automatically.

We also need to enhance the internal functions dealing with getting
and setting features with an additional index field, so that all feature
bits may be accessed (in chunks of 32 bits).

vhost and migration have been ignored for now.

Reviewed-by: Thomas Huth 
Signed-off-by: Cornelia Huck 
---
 hw/9pfs/virtio-9p-device.c |7 ++-
 hw/block/virtio-blk.c  |9 +++--
 hw/char/virtio-serial-bus.c|9 +++--
 hw/net/virtio-net.c|   38 ++
 hw/s390x/s390-virtio-bus.c |9 +
 hw/s390x/virtio-ccw.c  |   17 ++---
 hw/scsi/vhost-scsi.c   |7 +--
 hw/scsi/virtio-scsi.c  |   10 +-
 hw/virtio/dataplane/vring.c|   10 +-
 hw/virtio/virtio-balloon.c |8 ++--
 hw/virtio/virtio-bus.c |9 +
 hw/virtio/virtio-mmio.c|9 +
 hw/virtio/virtio-pci.c |   13 +++--
 hw/virtio/virtio-rng.c |2 +-
 hw/virtio/virtio.c |   29 +
 include/hw/virtio/virtio-bus.h |7 ---
 include/hw/virtio/virtio.h |   19 ++-
 17 files changed, 135 insertions(+), 77 deletions(-)

diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
index 2572747..c29c8c8 100644
--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -21,8 +21,13 @@
 #include "virtio-9p-coth.h"
 #include "hw/virtio/virtio-access.h"
 
-static uint32_t virtio_9p_get_features(VirtIODevice *vdev, uint32_t features)
+static uint32_t virtio_9p_get_features(VirtIODevice *vdev, unsigned int index,
+   uint32_t features)
 {
+if (index > 0) {
+return features;
+}
+
 features |= 1 << VIRTIO_9P_MOUNT_TAG;
 return features;
 }
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index b19b102..6d86f60 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -564,10 +564,15 @@ static void virtio_blk_set_config(VirtIODevice *vdev, 
const uint8_t *config)
 aio_context_release(blk_get_aio_context(s->blk));
 }
 
-static uint32_t virtio_blk_get_features(VirtIODevice *vdev, uint32_t features)
+static uint32_t virtio_blk_get_features(VirtIODevice *vdev, unsigned int index,
+uint32_t features)
 {
 VirtIOBlock *s = VIRTIO_BLK(vdev);
 
+if (index > 0) {
+return features;
+}
+
 features |= (1 << VIRTIO_BLK_F_SEG_MAX);
 features |= (1 << VIRTIO_BLK_F_GEOMETRY);
 features |= (1 << VIRTIO_BLK_F_TOPOLOGY);
@@ -601,7 +606,7 @@ static void virtio_blk_set_status(VirtIODevice *vdev, 
uint8_t status)
 return;
 }
 
-features = vdev->guest_features;
+features = vdev->guest_features[0];
 
 /* A guest that supports VIRTIO_BLK_F_CONFIG_WCE must be able to send
  * cache flushes.  Thus, the "auto writethrough" behavior is never
diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
index a7b1b68..55de504 100644
--- a/hw/char/virtio-serial-bus.c
+++ b/hw/char/virtio-serial-bus.c
@@ -75,7 +75,7 @@ static VirtIOSerialPort *find_port_by_name(char *name)
 static bool use_multiport(VirtIOSerial *vser)
 {
 VirtIODevice *vdev = VIRTIO_DEVICE(vser);
-return vdev->guest_features & (1 << VIRTIO_CONSOLE_F_MULTIPORT);
+return vdev->guest_features[0] & (1 << VIRTIO_CONSOLE_F_MULTIPORT);
 }
 
 static size_t write_to_port(VirtIOSerialPort *port,
@@ -467,10 +467,15 @@ static void handle_input(VirtIODevice *vdev, VirtQueue 
*vq)
 {
 }
 
-static uint32_t get_features(VirtIODevice *vdev, uint32_t features)
+static uint32_t get_features(VirtIODevice *vdev, unsigned int index,
+ uint32_t features)
 {
 VirtIOSerial *vser;
 
+if (index > 0) {
+return features;
+}
+
 vser = VIRTIO_SERIAL(vdev);
 
 if (vser->bus.max_nr_ports > 1) {
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 9b88775..1e214b5 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -86,7 +86,7 @@ static void virtio_net_set_config(VirtIODevice *vdev, const 
uint8_t *config)
 
 memcpy(&netcfg, config, n->config_size);
 
-if (!(vdev->guest_features >> VIRTIO_NET_F_CTRL_MAC_ADDR & 1) &&
+if (!(vdev->guest_features[0] >> VIRTIO_NET_F_CTRL_MAC_ADDR & 1) &&
 memcmp(netcfg.mac, n->mac, ETH_ALEN)) {
 memcpy(n->mac, netcfg.mac, ETH_ALEN);
 qemu_format_nic_info_str(qemu_get_queue(n->nic), n->mac);
@@ -305,7 +305,7 @@ static RxFilterInfo 
*virtio_net_query_rxfilter(NetClientState *nc)
 info->multicast_table = str_list;
 info->vlan_table = get_vlan_table(n);
 
-if (!((1 << VIRTIO_NET_F_CTRL_VLAN) & vdev->guest_features)) {
+if (!((1 << VIRTIO_NET_F_CTRL_VLAN) & vdev->g

[PATCH RFC v2 06/12] virtio: allow virtio-1 queue layout

For virtio-1 devices, we allow a more complex queue layout that doesn't
require descriptor table and rings on a physically-contigous memory area:
add virtio_queue_set_rings() to allow transports to set this up.

Signed-off-by: Cornelia Huck 
---
 hw/virtio/virtio.c |   16 
 include/hw/virtio/virtio.h |2 ++
 2 files changed, 18 insertions(+)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 4149f45..2c6bb91 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -96,6 +96,13 @@ static void virtqueue_init(VirtQueue *vq)
 {
 hwaddr pa = vq->pa;
 
+if (pa == -1ULL) {
+/*
+ * This is a virtio-1 style vq that has already been setup
+ * in virtio_queue_set.
+ */
+return;
+}
 vq->vring.desc = pa;
 vq->vring.avail = pa + vq->vring.num * sizeof(VRingDesc);
 vq->vring.used = vring_align(vq->vring.avail +
@@ -719,6 +726,15 @@ hwaddr virtio_queue_get_addr(VirtIODevice *vdev, int n)
 return vdev->vq[n].pa;
 }
 
+void virtio_queue_set_rings(VirtIODevice *vdev, int n, hwaddr desc,
+hwaddr avail, hwaddr used)
+{
+vdev->vq[n].pa = -1ULL;
+vdev->vq[n].vring.desc = desc;
+vdev->vq[n].vring.avail = avail;
+vdev->vq[n].vring.used = used;
+}
+
 void virtio_queue_set_num(VirtIODevice *vdev, int n, int num)
 {
 /* Don't allow guest to flip queue between existent and
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 40e567c..f840320 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -227,6 +227,8 @@ void virtio_queue_set_addr(VirtIODevice *vdev, int n, 
hwaddr addr);
 hwaddr virtio_queue_get_addr(VirtIODevice *vdev, int n);
 void virtio_queue_set_num(VirtIODevice *vdev, int n, int num);
 int virtio_queue_get_num(VirtIODevice *vdev, int n);
+void virtio_queue_set_rings(VirtIODevice *vdev, int n, hwaddr desc,
+hwaddr avail, hwaddr used);
 void virtio_queue_set_align(VirtIODevice *vdev, int n, int align);
 void virtio_queue_notify(VirtIODevice *vdev, int n);
 uint16_t virtio_queue_vector(VirtIODevice *vdev, int n);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC v2 08/12] s390x/css: Add a callback for when subchannel gets disabled

From: Thomas Huth 

We need a possibility to run code when a subchannel gets disabled.
This patch adds the necessary infrastructure.

Signed-off-by: Thomas Huth 
Signed-off-by: Cornelia Huck 
---
 hw/s390x/css.c |   12 
 hw/s390x/css.h |1 +
 2 files changed, 13 insertions(+)

diff --git a/hw/s390x/css.c b/hw/s390x/css.c
index b67c039..735ec55 100644
--- a/hw/s390x/css.c
+++ b/hw/s390x/css.c
@@ -588,6 +588,7 @@ int css_do_msch(SubchDev *sch, SCHIB *orig_schib)
 {
 SCSW *s = &sch->curr_status.scsw;
 PMCW *p = &sch->curr_status.pmcw;
+uint16_t oldflags;
 int ret;
 SCHIB schib;
 
@@ -610,6 +611,7 @@ int css_do_msch(SubchDev *sch, SCHIB *orig_schib)
 copy_schib_from_guest(&schib, orig_schib);
 /* Only update the program-modifiable fields. */
 p->intparm = schib.pmcw.intparm;
+oldflags = p->flags;
 p->flags &= ~(PMCW_FLAGS_MASK_ISC | PMCW_FLAGS_MASK_ENA |
   PMCW_FLAGS_MASK_LM | PMCW_FLAGS_MASK_MME |
   PMCW_FLAGS_MASK_MP);
@@ -625,6 +627,12 @@ int css_do_msch(SubchDev *sch, SCHIB *orig_schib)
 (PMCW_CHARS_MASK_MBFC | PMCW_CHARS_MASK_CSENSE);
 sch->curr_status.mba = schib.mba;
 
+/* Has the channel been disabled? */
+if (sch->disable_cb && (oldflags & PMCW_FLAGS_MASK_ENA) != 0
+&& (p->flags & PMCW_FLAGS_MASK_ENA) == 0) {
+sch->disable_cb(sch);
+}
+
 ret = 0;
 
 out:
@@ -1443,6 +1451,10 @@ void css_reset_sch(SubchDev *sch)
 {
 PMCW *p = &sch->curr_status.pmcw;
 
+if ((p->flags & PMCW_FLAGS_MASK_ENA) != 0 && sch->disable_cb) {
+sch->disable_cb(sch);
+}
+
 p->intparm = 0;
 p->flags &= ~(PMCW_FLAGS_MASK_ISC | PMCW_FLAGS_MASK_ENA |
   PMCW_FLAGS_MASK_LM | PMCW_FLAGS_MASK_MME |
diff --git a/hw/s390x/css.h b/hw/s390x/css.h
index 33104ac..7fa807b 100644
--- a/hw/s390x/css.h
+++ b/hw/s390x/css.h
@@ -81,6 +81,7 @@ struct SubchDev {
 uint8_t ccw_no_data_cnt;
 /* transport-provided data: */
 int (*ccw_cb) (SubchDev *, CCW1);
+void (*disable_cb)(SubchDev *);
 SenseId id;
 void *driver_data;
 };
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC v2 04/12] s390x/virtio-ccw: fix check for WRITE_FEAT

We need to check guest feature size, not host feature size to find
out whether we should call virtio_set_features(). This check is
possible now that vdev->guest_features is an array.

Reviewed-by: Thomas Huth 
Signed-off-by: Cornelia Huck 
---
 hw/s390x/virtio-ccw.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c
index 2f52b82..62ec852 100644
--- a/hw/s390x/virtio-ccw.c
+++ b/hw/s390x/virtio-ccw.c
@@ -399,7 +399,7 @@ static int virtio_ccw_cb(SubchDev *sch, CCW1 ccw)
 features.index = ldub_phys(&address_space_memory,
ccw.cda + sizeof(features.features));
 features.features = ldl_le_phys(&address_space_memory, ccw.cda);
-if (features.index < ARRAY_SIZE(dev->host_features)) {
+if (features.index < ARRAY_SIZE(vdev->guest_features)) {
 virtio_set_features(vdev, features.index, features.features);
 } else {
 /*
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC v2 01/12] linux-headers/virtio_config: Update with VIRTIO_F_VERSION_1

From: Thomas Huth 

Add the new VIRTIO_F_VERSION_1 definition to the virtio_config.h
linux header.

Signed-off-by: Thomas Huth 
Signed-off-by: Cornelia Huck 
---
 linux-headers/linux/virtio_config.h |3 +++
 1 file changed, 3 insertions(+)

diff --git a/linux-headers/linux/virtio_config.h 
b/linux-headers/linux/virtio_config.h
index 75dc20b..16aa289 100644
--- a/linux-headers/linux/virtio_config.h
+++ b/linux-headers/linux/virtio_config.h
@@ -54,4 +54,7 @@
 /* Can the device handle any descriptor layout? */
 #define VIRTIO_F_ANY_LAYOUT27
 
+/* v1.0 compliant. */
+#define VIRTIO_F_VERSION_1 32
+
 #endif /* _LINUX_VIRTIO_CONFIG_H */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC v2 09/12] s390x/virtio-ccw: add virtio set-revision call

From: Thomas Huth 

Handle the virtio-ccw revision according to what the guest sets.
When revision 1 is selected, we have a virtio-1 standard device
with byteswapping for the virtio rings.

When a channel gets disabled, we have to revert to the legacy behavior
in case the next user of the device does not negotiate the revision 1
anymore (e.g. the boot firmware uses revision 1, but the operating
system only uses the legacy mode).

Note that revisions > 0 are still disabled; but we still extend the
feature bit size to be able to handle the VERSION_1 bit.

Signed-off-by: Thomas Huth 
Signed-off-by: Cornelia Huck 
---
 hw/s390x/virtio-ccw.c |   52 +
 hw/s390x/virtio-ccw.h |7 ++-
 2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c
index 62ec852..e79f3b8 100644
--- a/hw/s390x/virtio-ccw.c
+++ b/hw/s390x/virtio-ccw.c
@@ -20,9 +20,11 @@
 #include "hw/virtio/virtio-net.h"
 #include "hw/sysbus.h"
 #include "qemu/bitops.h"
+#include "hw/virtio/virtio-access.h"
 #include "hw/virtio/virtio-bus.h"
 #include "hw/s390x/adapter.h"
 #include "hw/s390x/s390_flic.h"
+#include "linux/virtio_config.h"
 
 #include "ioinst.h"
 #include "css.h"
@@ -260,6 +262,12 @@ typedef struct VirtioThinintInfo {
 uint8_t isc;
 } QEMU_PACKED VirtioThinintInfo;
 
+typedef struct VirtioRevInfo {
+uint16_t revision;
+uint16_t length;
+uint8_t data[0];
+} QEMU_PACKED VirtioRevInfo;
+
 /* Specify where the virtqueues for the subchannel are in guest memory. */
 static int virtio_ccw_set_vqs(SubchDev *sch, uint64_t addr, uint32_t align,
   uint16_t index, uint16_t num)
@@ -299,6 +307,7 @@ static int virtio_ccw_cb(SubchDev *sch, CCW1 ccw)
 {
 int ret;
 VqInfoBlock info;
+VirtioRevInfo revinfo;
 uint8_t status;
 VirtioFeatDesc features;
 void *config;
@@ -373,6 +382,13 @@ static int virtio_ccw_cb(SubchDev *sch, CCW1 ccw)
ccw.cda + sizeof(features.features));
 if (features.index < ARRAY_SIZE(dev->host_features)) {
 features.features = dev->host_features[features.index];
+/*
+ * Don't offer version 1 to the guest if it did not
+ * negotiate at least revision 1.
+ */
+if (features.index == 1 && dev->revision <= 0) {
+features.features &= ~(1 << (VIRTIO_F_VERSION_1 - 32));
+}
 } else {
 /* Return zeroes if the guest supports more feature bits. */
 features.features = 0;
@@ -400,6 +416,13 @@ static int virtio_ccw_cb(SubchDev *sch, CCW1 ccw)
ccw.cda + sizeof(features.features));
 features.features = ldl_le_phys(&address_space_memory, ccw.cda);
 if (features.index < ARRAY_SIZE(vdev->guest_features)) {
+/*
+ * The guest should not set version 1 if it didn't
+ * negotiate a revision >= 1.
+ */
+if (features.index == 1 && dev->revision <= 0) {
+features.features &= ~(1 << (VIRTIO_F_VERSION_1 - 32));
+}
 virtio_set_features(vdev, features.index, features.features);
 } else {
 /*
@@ -600,6 +623,25 @@ static int virtio_ccw_cb(SubchDev *sch, CCW1 ccw)
 }
 }
 break;
+case CCW_CMD_SET_VIRTIO_REV:
+len = sizeof(revinfo);
+if (ccw.count < len || (check_len && ccw.count > len)) {
+ret = -EINVAL;
+break;
+}
+if (!ccw.cda) {
+ret = -EFAULT;
+break;
+}
+cpu_physical_memory_read(ccw.cda, &revinfo, len);
+if (dev->revision >= 0 ||
+revinfo.revision > VIRTIO_CCW_REV_MAX) {
+ret = -ENOSYS;
+break;
+}
+ret = 0;
+dev->revision = revinfo.revision;
+break;
 default:
 ret = -ENOSYS;
 break;
@@ -607,6 +649,13 @@ static int virtio_ccw_cb(SubchDev *sch, CCW1 ccw)
 return ret;
 }
 
+static void virtio_sch_disable_cb(SubchDev *sch)
+{
+VirtioCcwDevice *dev = sch->driver_data;
+
+dev->revision = -1;
+}
+
 static int virtio_ccw_device_init(VirtioCcwDevice *dev, VirtIODevice *vdev)
 {
 unsigned int cssid = 0;
@@ -733,6 +782,7 @@ static int virtio_ccw_device_init(VirtioCcwDevice *dev, 
VirtIODevice *vdev)
 css_sch_build_virtual_schib(sch, 0, VIRTIO_CCW_CHPID_TYPE);
 
 sch->ccw_cb = virtio_ccw_cb;
+sch->disable_cb = virtio_sch_disable_cb;
 
 /* Build senseid data. */
 memset(&sch->id, 0, sizeof(SenseId));
@@ -740,6 +790,8 @@ static int virtio_ccw_device_init(VirtioCcwDevice *dev, 
VirtIODevice *vdev)
 sch->id.cu_type = VIRTIO_CCW_CU_TYPE;
 sch->id.cu_model = vdev->device_id;
 
+dev->revision = -1;
+

[PATCH RFC v2 05/12] virtio: introduce legacy virtio devices

Introduce a helper function to indicate  whether a virtio device is
operating in legacy or virtio standard mode.

It may be used to make decisions about the endianess of virtio accesses
and other virtio-1 specific changes, enabling us to support transitional
devices.

Reviewed-by: Thomas Huth 
Signed-off-by: Cornelia Huck 
---
 hw/virtio/virtio.c|6 +-
 include/hw/virtio/virtio-access.h |4 
 include/hw/virtio/virtio.h|   13 +++--
 3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 2eb5d3c..4149f45 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -883,7 +883,11 @@ static bool virtio_device_endian_needed(void *opaque)
 VirtIODevice *vdev = opaque;
 
 assert(vdev->device_endian != VIRTIO_DEVICE_ENDIAN_UNKNOWN);
-return vdev->device_endian != virtio_default_endian();
+if (virtio_device_is_legacy(vdev)) {
+return vdev->device_endian != virtio_default_endian();
+}
+/* Devices conforming to VIRTIO 1.0 or later are always LE. */
+return vdev->device_endian != VIRTIO_DEVICE_ENDIAN_LITTLE;
 }
 
 static const VMStateDescription vmstate_virtio_device_endian = {
diff --git a/include/hw/virtio/virtio-access.h 
b/include/hw/virtio/virtio-access.h
index 46456fd..c123ee0 100644
--- a/include/hw/virtio/virtio-access.h
+++ b/include/hw/virtio/virtio-access.h
@@ -19,6 +19,10 @@
 
 static inline bool virtio_access_is_big_endian(VirtIODevice *vdev)
 {
+if (!virtio_device_is_legacy(vdev)) {
+/* Devices conforming to VIRTIO 1.0 or later are always LE. */
+return false;
+}
 #if defined(TARGET_IS_BIENDIAN)
 return virtio_is_big_endian(vdev);
 #elif defined(TARGET_WORDS_BIGENDIAN)
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index b408166..40e567c 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -275,9 +275,18 @@ void virtio_queue_set_host_notifier_fd_handler(VirtQueue 
*vq, bool assign,
 void virtio_queue_notify_vq(VirtQueue *vq);
 void virtio_irq(VirtQueue *vq);
 
+static inline bool virtio_device_is_legacy(VirtIODevice *vdev)
+{
+return !(vdev->guest_features[1] & (1 << (VIRTIO_F_VERSION_1 - 32)));
+}
+
 static inline bool virtio_is_big_endian(VirtIODevice *vdev)
 {
-assert(vdev->device_endian != VIRTIO_DEVICE_ENDIAN_UNKNOWN);
-return vdev->device_endian == VIRTIO_DEVICE_ENDIAN_BIG;
+if (virtio_device_is_legacy(vdev)) {
+assert(vdev->device_endian != VIRTIO_DEVICE_ENDIAN_UNKNOWN);
+return vdev->device_endian == VIRTIO_DEVICE_ENDIAN_BIG;
+}
+/* Devices conforming to VIRTIO 1.0 or later are always LE. */
+return false;
 }
 #endif
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC v2 02/12] virtio: cull virtio_bus_set_vdev_features

The only user of this function was virtio-ccw, and it should use
virtio_set_features() like everybody else: We need to make sure
that bad features are masked out properly, which this function did
not do.

Reviewed-by: Thomas Huth 
Signed-off-by: Cornelia Huck 
---
 hw/s390x/virtio-ccw.c  |3 +--
 hw/virtio/virtio-bus.c |   14 --
 include/hw/virtio/virtio-bus.h |3 ---
 3 files changed, 1 insertion(+), 19 deletions(-)

diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c
index ea236c9..84f17bc 100644
--- a/hw/s390x/virtio-ccw.c
+++ b/hw/s390x/virtio-ccw.c
@@ -400,8 +400,7 @@ static int virtio_ccw_cb(SubchDev *sch, CCW1 ccw)
ccw.cda + sizeof(features.features));
 features.features = ldl_le_phys(&address_space_memory, ccw.cda);
 if (features.index < ARRAY_SIZE(dev->host_features)) {
-virtio_bus_set_vdev_features(&dev->bus, features.features);
-vdev->guest_features = features.features;
+virtio_set_features(vdev, features.features);
 } else {
 /*
  * If the guest supports more feature bits, assert that it
diff --git a/hw/virtio/virtio-bus.c b/hw/virtio/virtio-bus.c
index eb77019..a8ffa07 100644
--- a/hw/virtio/virtio-bus.c
+++ b/hw/virtio/virtio-bus.c
@@ -109,20 +109,6 @@ uint32_t virtio_bus_get_vdev_features(VirtioBusState *bus,
 return k->get_features(vdev, requested_features);
 }
 
-/* Set the features of the plugged device. */
-void virtio_bus_set_vdev_features(VirtioBusState *bus,
-  uint32_t requested_features)
-{
-VirtIODevice *vdev = virtio_bus_get_device(bus);
-VirtioDeviceClass *k;
-
-assert(vdev != NULL);
-k = VIRTIO_DEVICE_GET_CLASS(vdev);
-if (k->set_features != NULL) {
-k->set_features(vdev, requested_features);
-}
-}
-
 /* Get bad features of the plugged device. */
 uint32_t virtio_bus_get_vdev_bad_features(VirtioBusState *bus)
 {
diff --git a/include/hw/virtio/virtio-bus.h b/include/hw/virtio/virtio-bus.h
index 0756545..0d2e7b4 100644
--- a/include/hw/virtio/virtio-bus.h
+++ b/include/hw/virtio/virtio-bus.h
@@ -84,9 +84,6 @@ size_t virtio_bus_get_vdev_config_len(VirtioBusState *bus);
 /* Get the features of the plugged device. */
 uint32_t virtio_bus_get_vdev_features(VirtioBusState *bus,
 uint32_t requested_features);
-/* Set the features of the plugged device. */
-void virtio_bus_set_vdev_features(VirtioBusState *bus,
-  uint32_t requested_features);
 /* Get bad features of the plugged device. */
 uint32_t virtio_bus_get_vdev_bad_features(VirtioBusState *bus);
 /* Get config of the plugged device. */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC v2 00/12] qemu: towards virtio-1 host support

Hi,

here's the next version of my virtio-1 qemu patchset. Using virtio-1
virtio-blk and virtio-net devices with a guest kernel built from
<1416829787-14252-1-git-send-email-...@redhat.com> still seems to
work for the virtio-ccw transport.

Changes from v1:
- rebased against current master
- don't advertise VERSION_1 for all devices, make devices switch it on
  individually

Cornelia Huck (9):
  virtio: cull virtio_bus_set_vdev_features
  virtio: support more feature bits
  s390x/virtio-ccw: fix check for WRITE_FEAT
  virtio: introduce legacy virtio devices
  virtio: allow virtio-1 queue layout
  dataplane: allow virtio-1 devices
  s390x/virtio-ccw: support virtio-1 set_vq format
  virtio-net/virtio-blk: enable virtio 1.0
  s390x/virtio-ccw: enable virtio 1.0

Thomas Huth (3):
  linux-headers/virtio_config: Update with VIRTIO_F_VERSION_1
  s390x/css: Add a callback for when subchannel gets disabled
  s390x/virtio-ccw: add virtio set-revision call

 hw/9pfs/virtio-9p-device.c  |7 +-
 hw/block/dataplane/virtio-blk.c |3 +-
 hw/block/virtio-blk.c   |   13 ++-
 hw/char/virtio-serial-bus.c |9 +-
 hw/net/virtio-net.c |   42 +---
 hw/s390x/css.c  |   12 +++
 hw/s390x/css.h  |1 +
 hw/s390x/s390-virtio-bus.c  |9 +-
 hw/s390x/virtio-ccw.c   |  186 +++
 hw/s390x/virtio-ccw.h   |7 +-
 hw/scsi/vhost-scsi.c|7 +-
 hw/scsi/virtio-scsi-dataplane.c |2 +-
 hw/scsi/virtio-scsi.c   |   10 +-
 hw/virtio/Makefile.objs |2 +-
 hw/virtio/dataplane/Makefile.objs   |2 +-
 hw/virtio/dataplane/vring.c |   95 ++
 hw/virtio/virtio-balloon.c  |8 +-
 hw/virtio/virtio-bus.c  |   23 +
 hw/virtio/virtio-mmio.c |9 +-
 hw/virtio/virtio-pci.c  |   13 +--
 hw/virtio/virtio-rng.c  |2 +-
 hw/virtio/virtio.c  |   51 +++---
 include/hw/virtio/dataplane/vring.h |   64 +++-
 include/hw/virtio/virtio-access.h   |4 +
 include/hw/virtio/virtio-bus.h  |   10 +-
 include/hw/virtio/virtio.h  |   34 +--
 linux-headers/linux/virtio_config.h |3 +
 27 files changed, 449 insertions(+), 179 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC v2 07/12] dataplane: allow virtio-1 devices

Handle endianness conversion for virtio-1 virtqueues correctly.

Note that dataplane now needs to be built per-target.

Signed-off-by: Cornelia Huck 
---
 hw/block/dataplane/virtio-blk.c |3 +-
 hw/scsi/virtio-scsi-dataplane.c |2 +-
 hw/virtio/Makefile.objs |2 +-
 hw/virtio/dataplane/Makefile.objs   |2 +-
 hw/virtio/dataplane/vring.c |   85 +++
 include/hw/virtio/dataplane/vring.h |   64 --
 6 files changed, 113 insertions(+), 45 deletions(-)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 1222a37..c25878c 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -16,6 +16,7 @@
 #include "qemu/iov.h"
 #include "qemu/thread.h"
 #include "qemu/error-report.h"
+#include "hw/virtio/virtio-access.h"
 #include "hw/virtio/dataplane/vring.h"
 #include "sysemu/block-backend.h"
 #include "hw/virtio/virtio-blk.h"
@@ -75,7 +76,7 @@ static void complete_request_vring(VirtIOBlockReq *req, 
unsigned char status)
 VirtIOBlockDataPlane *s = req->dev->dataplane;
 stb_p(&req->in->status, status);
 
-vring_push(&req->dev->dataplane->vring, &req->elem,
+vring_push(s->vdev, &req->dev->dataplane->vring, &req->elem,
req->qiov.size + sizeof(*req->in));
 
 /* Suppress notification to guest by BH and its scheduled
diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 03a1e8c..418d73b 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -94,7 +94,7 @@ void virtio_scsi_vring_push_notify(VirtIOSCSIReq *req)
 {
 VirtIODevice *vdev = VIRTIO_DEVICE(req->vring->parent);
 
-vring_push(&req->vring->vring, &req->elem,
+vring_push(vdev, &req->vring->vring, &req->elem,
req->qsgl.size + req->resp_iov.size);
 
 if (vring_should_notify(vdev, &req->vring->vring)) {
diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
index d21c397..19b224a 100644
--- a/hw/virtio/Makefile.objs
+++ b/hw/virtio/Makefile.objs
@@ -2,7 +2,7 @@ common-obj-y += virtio-rng.o
 common-obj-$(CONFIG_VIRTIO_PCI) += virtio-pci.o
 common-obj-y += virtio-bus.o
 common-obj-y += virtio-mmio.o
-common-obj-$(CONFIG_VIRTIO) += dataplane/
+obj-$(CONFIG_VIRTIO) += dataplane/
 
 obj-y += virtio.o virtio-balloon.o 
 obj-$(CONFIG_LINUX) += vhost.o vhost-backend.o vhost-user.o
diff --git a/hw/virtio/dataplane/Makefile.objs 
b/hw/virtio/dataplane/Makefile.objs
index 9a8cfc0..753a9ca 100644
--- a/hw/virtio/dataplane/Makefile.objs
+++ b/hw/virtio/dataplane/Makefile.objs
@@ -1 +1 @@
-common-obj-y += vring.o
+obj-y += vring.o
diff --git a/hw/virtio/dataplane/vring.c b/hw/virtio/dataplane/vring.c
index a051775..69809ff 100644
--- a/hw/virtio/dataplane/vring.c
+++ b/hw/virtio/dataplane/vring.c
@@ -18,6 +18,7 @@
 #include "hw/hw.h"
 #include "exec/memory.h"
 #include "exec/address-spaces.h"
+#include "hw/virtio/virtio-access.h"
 #include "hw/virtio/dataplane/vring.h"
 #include "qemu/error-report.h"
 
@@ -83,7 +84,7 @@ bool vring_setup(Vring *vring, VirtIODevice *vdev, int n)
 vring_init(&vring->vr, virtio_queue_get_num(vdev, n), vring_ptr, 4096);
 
 vring->last_avail_idx = virtio_queue_get_last_avail_idx(vdev, n);
-vring->last_used_idx = vring->vr.used->idx;
+vring->last_used_idx = vring_get_used_idx(vdev, vring);
 vring->signalled_used = 0;
 vring->signalled_used_valid = false;
 
@@ -104,7 +105,7 @@ void vring_teardown(Vring *vring, VirtIODevice *vdev, int n)
 void vring_disable_notification(VirtIODevice *vdev, Vring *vring)
 {
 if (!(vdev->guest_features[0] & (1 << VIRTIO_RING_F_EVENT_IDX))) {
-vring->vr.used->flags |= VRING_USED_F_NO_NOTIFY;
+vring_set_used_flags(vdev, vring, VRING_USED_F_NO_NOTIFY);
 }
 }
 
@@ -117,10 +118,10 @@ bool vring_enable_notification(VirtIODevice *vdev, Vring 
*vring)
 if (vdev->guest_features[0] & (1 << VIRTIO_RING_F_EVENT_IDX)) {
 vring_avail_event(&vring->vr) = vring->vr.avail->idx;
 } else {
-vring->vr.used->flags &= ~VRING_USED_F_NO_NOTIFY;
+vring_clear_used_flags(vdev, vring, VRING_USED_F_NO_NOTIFY);
 }
 smp_mb(); /* ensure update is seen before reading avail_idx */
-return !vring_more_avail(vring);
+return !vring_more_avail(vdev, vring);
 }
 
 /* This is stolen from linux/drivers/vhost/vhost.c:vhost_notify() */
@@ -134,12 +135,13 @@ bool vring_should_notify(VirtIODevice *vdev, Vring *vring)
 smp_mb();
 
 if ((vdev->guest_features[0] & VIRTIO_F_NOTIFY_ON_EMPTY) &&
-unlikely(vring->vr.avail->idx == vring->last_avail_idx)) {
+unlikely(!vring_more_avail(vdev, vring))) {
 return true;
 }
 
 if (!(vdev->guest_features[0] & VIRTIO_RING_F_EVENT_IDX)) {
-return !(vring->vr.avail->flags & VRING_AVAIL_F_NO_INTERRUPT);
+return !(vring_get_avail_flags(vdev, vring) &
+ VRING_AVAIL_F_NO_INTERRUPT);
 }
 ol

[PATCH RFC v2 12/12] s390x/virtio-ccw: enable virtio 1.0

virtio-ccw should now have everything in place to operate virtio 1.0
devices, so let's enable revision 1.

Signed-off-by: Cornelia Huck 
---
 hw/s390x/virtio-ccw.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/s390x/virtio-ccw.h b/hw/s390x/virtio-ccw.h
index 03d5955..08edd8d 100644
--- a/hw/s390x/virtio-ccw.h
+++ b/hw/s390x/virtio-ccw.h
@@ -73,7 +73,7 @@ typedef struct VirtIOCCWDeviceClass {
 #define VIRTIO_CCW_FEATURE_SIZE NR_VIRTIO_FEATURE_WORDS
 
 /* The maximum virtio revision we support. */
-#define VIRTIO_CCW_REV_MAX 0
+#define VIRTIO_CCW_REV_MAX 1
 
 /* Performance improves when virtqueue kick processing is decoupled from the
  * vcpu thread using ioeventfd for some devices. */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC v2 11/12] virtio-net/virtio-blk: enable virtio 1.0

virtio-net (non-vhost) and virtio-blk have everything in place to support
virtio 1.0: let's enable the feature bit for them.

Note that VIRTIO_F_VERSION_1 is technically a transport feature; once
every device is ready for virtio 1.0, we can move this setting this
feature bit out of the individual devices.

Signed-off-by: Cornelia Huck 
---
 hw/block/virtio-blk.c |4 
 hw/net/virtio-net.c   |4 
 2 files changed, 8 insertions(+)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 6d86f60..3781f98 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -569,6 +569,10 @@ static uint32_t virtio_blk_get_features(VirtIODevice 
*vdev, unsigned int index,
 {
 VirtIOBlock *s = VIRTIO_BLK(vdev);
 
+if (index == 1) {
+features |= (1 << (VIRTIO_F_VERSION_1 - 32));
+}
+
 if (index > 0) {
 return features;
 }
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 1e214b5..fcfc95f 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -447,6 +447,10 @@ static uint32_t virtio_net_get_features(VirtIODevice 
*vdev, unsigned int index,
 VirtIONet *n = VIRTIO_NET(vdev);
 NetClientState *nc = qemu_get_queue(n->nic);
 
+if (index == 1 && !get_vhost_net(nc->peer)) {
+features |= (1 << (VIRTIO_F_VERSION_1 - 32));
+}
+
 if (index > 0) {
 return features;
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC v2 10/12] s390x/virtio-ccw: support virtio-1 set_vq format

Support the new CCW_CMD_SET_VQ format for virtio-1 devices.

While we're at it, refactor the code a bit and enforce big endian
fields (which had always been required, even for legacy).

Reviewed-by: Thomas Huth 
Signed-off-by: Cornelia Huck 
---
 hw/s390x/virtio-ccw.c |  114 ++---
 1 file changed, 80 insertions(+), 34 deletions(-)

diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c
index e79f3b8..60d67a3 100644
--- a/hw/s390x/virtio-ccw.c
+++ b/hw/s390x/virtio-ccw.c
@@ -238,11 +238,20 @@ VirtualCssBus *virtual_css_bus_init(void)
 }
 
 /* Communication blocks used by several channel commands. */
-typedef struct VqInfoBlock {
+typedef struct VqInfoBlockLegacy {
 uint64_t queue;
 uint32_t align;
 uint16_t index;
 uint16_t num;
+} QEMU_PACKED VqInfoBlockLegacy;
+
+typedef struct VqInfoBlock {
+uint64_t desc;
+uint32_t res0;
+uint16_t index;
+uint16_t num;
+uint64_t avail;
+uint64_t used;
 } QEMU_PACKED VqInfoBlock;
 
 typedef struct VqConfigBlock {
@@ -269,17 +278,20 @@ typedef struct VirtioRevInfo {
 } QEMU_PACKED VirtioRevInfo;
 
 /* Specify where the virtqueues for the subchannel are in guest memory. */
-static int virtio_ccw_set_vqs(SubchDev *sch, uint64_t addr, uint32_t align,
-  uint16_t index, uint16_t num)
+static int virtio_ccw_set_vqs(SubchDev *sch, VqInfoBlock *info,
+  VqInfoBlockLegacy *linfo)
 {
 VirtIODevice *vdev = virtio_ccw_get_vdev(sch);
+uint16_t index = info ? info->index : linfo->index;
+uint16_t num = info ? info->num : linfo->num;
+uint64_t desc = info ? info->desc : linfo->queue;
 
 if (index > VIRTIO_PCI_QUEUE_MAX) {
 return -EINVAL;
 }
 
 /* Current code in virtio.c relies on 4K alignment. */
-if (addr && (align != 4096)) {
+if (linfo && desc && (linfo->align != 4096)) {
 return -EINVAL;
 }
 
@@ -287,8 +299,12 @@ static int virtio_ccw_set_vqs(SubchDev *sch, uint64_t 
addr, uint32_t align,
 return -EINVAL;
 }
 
-virtio_queue_set_addr(vdev, index, addr);
-if (!addr) {
+if (info) {
+virtio_queue_set_rings(vdev, index, desc, info->avail, info->used);
+} else {
+virtio_queue_set_addr(vdev, index, desc);
+}
+if (!desc) {
 virtio_queue_set_vector(vdev, index, 0);
 } else {
 /* Fail if we don't have a big enough queue. */
@@ -303,10 +319,66 @@ static int virtio_ccw_set_vqs(SubchDev *sch, uint64_t 
addr, uint32_t align,
 return 0;
 }
 
-static int virtio_ccw_cb(SubchDev *sch, CCW1 ccw)
+static int virtio_ccw_handle_set_vq(SubchDev *sch, CCW1 ccw, bool check_len,
+bool is_legacy)
 {
 int ret;
 VqInfoBlock info;
+VqInfoBlockLegacy linfo;
+size_t info_len = is_legacy ? sizeof(linfo) : sizeof(info);
+
+if (check_len) {
+if (ccw.count != info_len) {
+return -EINVAL;
+}
+} else if (ccw.count < info_len) {
+/* Can't execute command. */
+return -EINVAL;
+}
+if (!ccw.cda) {
+return -EFAULT;
+}
+if (is_legacy) {
+linfo.queue = ldq_be_phys(&address_space_memory, ccw.cda);
+linfo.align = ldl_be_phys(&address_space_memory,
+  ccw.cda + sizeof(linfo.queue));
+linfo.index = lduw_be_phys(&address_space_memory,
+   ccw.cda + sizeof(linfo.queue)
+   + sizeof(linfo.align));
+linfo.num = lduw_be_phys(&address_space_memory,
+ ccw.cda + sizeof(linfo.queue)
+ + sizeof(linfo.align)
+ + sizeof(linfo.index));
+ret = virtio_ccw_set_vqs(sch, NULL, &linfo);
+} else {
+info.desc = ldq_be_phys(&address_space_memory, ccw.cda);
+info.index = lduw_be_phys(&address_space_memory,
+  ccw.cda + sizeof(info.desc)
+  + sizeof(info.res0));
+info.num = lduw_be_phys(&address_space_memory,
+ccw.cda + sizeof(info.desc)
+  + sizeof(info.res0)
+  + sizeof(info.index));
+info.avail = ldq_be_phys(&address_space_memory,
+ ccw.cda + sizeof(info.desc)
+ + sizeof(info.res0)
+ + sizeof(info.index)
+ + sizeof(info.num));
+info.used = ldq_be_phys(&address_space_memory,
+ccw.cda + sizeof(info.desc)
++ sizeof(info.res0)
++ sizeof(info.index)
++ sizeof(info.num)
++ sizeof(info.avail));
+ret = virtio_ccw_set_vqs(sch, &info, NULL);
+}
+

Re: [PATCH v4 2/3] KVM: arm: add irqfd support

On 11/25/2014 11:19 AM, Christoffer Dall wrote:
> [Re-adding cc list]
> 
> On Mon, Nov 24, 2014 at 05:42:30PM +0100, Eric Auger wrote:
>> On 11/24/2014 04:47 PM, Christoffer Dall wrote:
>>> On Mon, Nov 24, 2014 at 12:02 PM, Eric Auger  wrote:
 On 11/24/2014 11:00 AM, Christoffer Dall wrote:
> On Sun, Nov 23, 2014 at 06:56:59PM +0100, Eric Auger wrote:
>> This patch enables irqfd on arm.
>>
>> Both irqfd and resamplefd are supported. Injection is implemented
>> in vgic.c without routing.
>>
>> This patch enables CONFIG_HAVE_KVM_EVENTFD and CONFIG_HAVE_KVM_IRQFD.
>>
>> KVM_CAP_IRQFD is now advertised. KVM_CAP_IRQFD_RESAMPLE capability
>> automatically is advertised as soon as CONFIG_HAVE_KVM_IRQFD is set.
>>
>> Signed-off-by: Eric Auger 
>>
>> ---
> 
> [...]
> 
>> +int kvm_set_irq(struct kvm *kvm, int irq_source_id,
>> +u32 irq, int level, bool line_status)
>> +{
>> +unsigned int spi = irq + VGIC_NR_PRIVATE_IRQS;
>> +
>> +kvm_debug("irqfd sets vIRQ %d to %d\n", irq, level);
>> +
>> +if (likely(vgic_initialized(kvm))) {
>> +if (spi > kvm->arch.vgic.nr_irqs)
>> +return -EINVAL;
>> +return kvm_vgic_inject_irq(kvm, 0, spi, level);
>> +} else if (level && kvm_irq_has_notifier(kvm, 0, irq)) {
>> +/*
>> + * any IRQ injected before vgic readiness is
>> + * ignored and the notifier, if any, is called
>> + * immediately as if the virtual IRQ were completed
>> + * by the guest
>> + */
>> +kvm_notify_acked_irq(kvm, 0, irq);
>> +return -EAGAIN;
>
> This looks fishy, I don't fully understand which case you're catering
> towards here (the comment is hard to understand), but my gut feeling is
> that you should back out (probably with an error) if the vgic is not
> initialized when calling this function.  Setting up the routing in the
> first place should probably error out if the vgic is not available.
 The issue is related to integration with QEMU and VFIO.
 Currently VFIO signaling (we tell VFIO to signal the eventfd on a
 peculiar physical IRQ) and irqfd setup (we tell KVM/IRQFD to handle
 previous eventfd when triggered and inject a GSI) are done by QEMU
 *before* the first vcpu run. so before VGIC is ready.

 Both are done in a so called QEMU machine init done notifier. It could
 be done in a QEMU reset notifier but I guess the problem would be the
 same. and I think the same at which QEMU initializes it is correct.

 As soon as both VFIO signaling and irqfd are setup, virtual IRQ are
 likely to be injected and this is what happens on some circumstances.
 This happens on the 2d QEMU run after killing the 1st QEMU session. For
 some reason I guess the HW device - in my case the xgmac - was released
 in such a way the IRQ wire was not reset. So as soon as VFIO driver
 calls request_irq, the IRQ hits.

 I tried to change that this xgmac driver behavior but I must acknowledge
 that for the time beeing I failed. I will continue investigating that.

 Indeed I might be fixing the issue at the wrong place. But anyway this
 relies on the fact the assigned device driver toggled down the IRQ
 properly when releasing. At the time the signaling is setup we have not
 entered yet any driver reset code.

>>> I see, it's because of the quirky way we initialize the vgic at time
>>> of running the first CPU, so user space can't really hook into the
>>> sweet spot after initializing the vgic but just before actually
>>> running guest code.
>>
>> Yes currently irqfd generic code has no way to test if the virtual
>> interrupt controller is "ready" (KVM__IRQFD -> kvm_irqfd_assign )
>> because I guess other archs don't have the problem.
>>>
>>> Could it be that we simply need to let user space init the vgic after
>>> creating all its vcpus and only then allow setting up the irqfd?
>>
>> I start VFIO signaling and setup KVM_IRQFD in a QEMU machine init done
>> notifier. the vgic init then must happen before then.
> 
> have all the vcpus that QEMU wants to create, been created by then?
Hi Christoffer,

My understanding of QEMU start sequence is as follows:
at machine init, CPU realize function is called for each CPU
(target-arm/cpu.c arm_cpu_realizefn)
qemu_init_vcpu (cpus.c) > qemu_kvm_start_vcpu (cpus.c) > launch VCPU
thread (qemu_kvm_cpu_thread_fn) > kvm_init_vcpu
kvm_init_vcpu does the while (1) loop issuing KVM_RUN ioctl if *cpu_can_run*

cpu_can_run returns true when vm_start is called
(resume_all_vcpus>cpu_resume)

QEMU machine init done or reset callbacks happen after machine init and
before vm_start.

The actual vgic "readiness" is set in the first vcpu run, ie on the
vm_start.

vgic instantiation in virt machin

Re: [RFC PATCH 0/6] ARM64: KVM: PMU infrastructure support

2014-11-25 Thread Anup Patel

Hi Christoffer,

On Mon, Nov 24, 2014 at 8:07 PM, Christoffer Dall
 wrote:
> On Mon, Nov 24, 2014 at 02:14:48PM +0530, Anup Patel wrote:
>> On Fri, Nov 21, 2014 at 5:19 PM, Christoffer Dall
>>  wrote:
>> > On Fri, Nov 21, 2014 at 04:06:05PM +0530, Anup Patel wrote:
>> >> Hi Christoffer,
>> >>
>> >> On Fri, Nov 21, 2014 at 3:29 PM, Christoffer Dall
>> >>  wrote:
>> >> > On Thu, Nov 20, 2014 at 08:17:32PM +0530, Anup Patel wrote:
>> >> >> On Wed, Nov 19, 2014 at 8:59 PM, Christoffer Dall
>> >> >>  wrote:
>> >> >> > On Tue, Nov 11, 2014 at 02:48:25PM +0530, Anup Patel wrote:
>> >> >> >> Hi All,
>> >> >> >>
>> >> >> >> I have second thoughts about rebasing KVM PMU patches
>> >> >> >> to Marc's irq-forwarding patches.
>> >> >> >>
>> >> >> >> The PMU IRQs (when virtualized by KVM) are not exactly
>> >> >> >> forwarded IRQs because they are shared between Host
>> >> >> >> and Guest.
>> >> >> >>
>> >> >> >> Scenario1
>> >> >> >> -
>> >> >> >>
>> >> >> >> We might have perf running on Host and no KVM guest
>> >> >> >> running. In this scenario, we wont get interrupts on Host
>> >> >> >> because the kvm_pmu_hyp_init() (similar to the function
>> >> >> >> kvm_timer_hyp_init() of Marc's IRQ-forwarding
>> >> >> >> implementation) has put all host PMU IRQs in forwarding
>> >> >> >> mode.
>> >> >> >>
>> >> >> >> The only way solve this problem is to not set forwarding
>> >> >> >> mode for PMU IRQs in kvm_pmu_hyp_init() and instead
>> >> >> >> have special routines to turn on and turn off the forwarding
>> >> >> >> mode of PMU IRQs. These routines will be called from
>> >> >> >> kvm_arch_vcpu_ioctl_run() for toggling the PMU IRQ
>> >> >> >> forwarding state.
>> >> >> >>
>> >> >> >> Scenario2
>> >> >> >> -
>> >> >> >>
>> >> >> >> We might have perf running on Host and Guest simultaneously
>> >> >> >> which means it is quite likely that PMU HW trigger IRQ meant
>> >> >> >> for Host between "ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);"
>> >> >> >> and "kvm_pmu_sync_hwstate(vcpu);" (similar to timer sync routine
>> >> >> >> of Marc's patchset which is called before local_irq_enable()).
>> >> >> >>
>> >> >> >> In this scenario, the updated kvm_pmu_sync_hwstate(vcpu)
>> >> >> >> will accidentally forward IRQ meant for Host to Guest unless
>> >> >> >> we put additional checks to inspect VCPU PMU state.
>> >> >> >>
>> >> >> >> Am I missing any detail about IRQ forwarding for above
>> >> >> >> scenarios?
>> >> >> >>
>> >> >> > Hi Anup,
>> >> >>
>> >> >> Hi Christoffer,
>> >> >>
>> >> >> >
>> >> >> > I briefly discussed this with Marc.  What I don't understand is how 
>> >> >> > it
>> >> >> > would be possible to get an interrupt for the host while running the
>> >> >> > guest?
>> >> >> >
>> >> >> > The rationale behind my question is that whenever you're running the
>> >> >> > guest, the PMU should be programmed exclusively with guest state, and
>> >> >> > since the PMU is per core, any interrupts should be for the guest, 
>> >> >> > where
>> >> >> > it would always be pending.
>> >> >>
>> >> >> Yes, thats right PMU is programmed exclusively for guest when
>> >> >> guest is running and for host when host is running.
>> >> >>
>> >> >> Let us assume a situation (Scenario2 mentioned previously)
>> >> >> where both host and guest are using PMU. When the guest is
>> >> >> running we come back to host mode due to variety of reasons
>> >> >> (stage2 fault, guest IO, regular host interrupt, host interrupt
>> >> >> meant for guest, ) which means we will return from the
>> >> >> "ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);" statement in the
>> >> >> kvm_arch_vcpu_ioctl_run() function with local IRQs disabled.
>> >> >> At this point we would have restored back host PMU context and
>> >> >> any PMU counter used by host can trigger PMU overflow interrup
>> >> >> for host. Now we will be having "kvm_pmu_sync_hwstate(vcpu);"
>> >> >> in the kvm_arch_vcpu_ioctl_run() function (similar to the
>> >> >> kvm_timer_sync_hwstate() of Marc's IRQ forwarding patchset)
>> >> >> which will try to detect PMU irq forwarding state in GIC hence it
>> >> >> can accidentally discover PMU irq pending for guest while this
>> >> >> PMU irq is actually meant for host.
>> >> >>
>> >> >> This above mentioned situation does not happen for timer
>> >> >> because virtual timer interrupts are exclusively used for guest.
>> >> >> The exclusive use of virtual timer interrupt for guest ensures that
>> >> >> the function kvm_timer_sync_hwstate() will always see correct
>> >> >> state of virtual timer IRQ from GIC.
>> >> >>
>> >> > I'm not quite following.
>> >> >
>> >> > When you call kvm_pmu_sync_hwstate(vcpu) in the non-preemtible section,
>> >> > you would (1) capture the active state of the IRQ pertaining to the
>> >> > guest and (2) deactive the IRQ on the host, then (3) switch the state of
>> >> > the PMU to the host state, and finally (4) re-enable IRQs on the CPU
>> >> > you're running on.
>> >> >
>> >> > If the host PMU state restored in (3) ca

Re: Benchmarking for vhost polling patch

2014-11-25 Thread Razya Ladelsky

Hi Michael,

> Hi Razya,
> On the netperf benchmark, it looks like polling=10 gives a modest but
> measureable gain.  So from that perspective it might be worth it if it's
> not too much code, though we'll need to spend more time checking the
> macro effect - we barely moved the needle on the macro benchmark and
> that is suspicious.

I ran memcached with various values for the key & value arguments, and 
managed to see a bigger impact of polling than when I used the default values,
Here are the numbers:

key=250 TPS  netvhost vm   TPS/cpu  TPS/CPU
value=2048   rate   util  util  change

polling=0   101540   103.0  46   100   695.47
polling=5   136747   123.0  83   100   747.25   0.074440609
polling=7   140722   125.7  84   100   764.79   0.099663658
polling=10  141719   126.3  87   100   757.85   0.089688003
polling=15  142430   127.1  90   100   749.63   0.077863015
polling=25  146347   128.7  95   100   750.49   0.079107993
polling=50  150882   131.1  100  100   754.41   0.084733701

Macro benchmarks are less I/O intensive than the micro benchmark, which is why 
we can expect less impact for polling as compared to netperf. 
However, as shown above, we managed to get 10% TPS/CPU improvement with the 
polling patch.

> Is there a chance you are actually trading latency for throughput?
> do you observe any effect on latency?

No.

> How about trying some other benchmark, e.g. NFS?
> 

Tried, but didn't have enough I/O produced (vhost was at most at 15% util)

> 
> Also, I am wondering:
> 
> since vhost thread is polling in kernel anyway, shouldn't
> we try and poll the host NIC?
> that would likely reduce at least the latency significantly,
> won't it?
> 

Yes, it could be a great addition at some point, but needs a thorough 
investigation. In any case, not a part of this patch...

Thanks,
Razya

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH v2 0/2] kvm-vfio: implement the vfio skeleton for VT-d Posted-Interrupts

2014-11-25 Thread Feng Wu

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

You can find the VT-d Posted-Interrtups Spec. in the following URL:
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html

This patchset adds the kvm-vfio interface for VT-d Posted-Interrrupts.

In the second patch of this patchset, I leave function
kvm_update_pi_irte() empty, since the purpose of this patch set is
to implement the VFIO related stuff for VT-d PI. kvm_update_pi_irte()
will do the real things, such as, updating IRTE. In fact, I think this
function will be implemented in another file instead of vfio.c. At the
current stage I just list it here to make the build successful. After
some other dependencies (such as, irq core changes in Linux kernel) is
resolved, I will send out the rest part of the VT-d PI patchset.

This patchset is based on the following Eric's VFIO patchset:
[PATCH v3 0/8] KVM-VFIO IRQ forward control

v1->v2
- Re-use KVM_DEV_VFIO_DEVICE group for VT-d PI.
- Define a new attribute in KVM_DEV_VFIO_DEVICE group.
- Teach KVM about sturct pci_dev, and get host irq from it.

Feng Wu (2):
KVM: kvm-vfio: User API for VT-d Posted-Interrupts
KVM: kvm-vfio: implement the VFIO skeleton for VT-d Posted-Interrupts

Documentation/virtual/kvm/devices/vfio.txt |9 ++
include/uapi/linux/kvm.h | 10 +++
virt/kvm/vfio.c| 115
3 files changed, 134 insertions(+), 0 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[RFC PATCH v2 2/2] KVM: kvm-vfio: implement the VFIO skeleton for VT-d Posted-Interrupts

2014-11-25 Thread Feng Wu

This patch adds the kvm-vfio interface for VT-d Posted-Interrrupts.
When guests updates MSI/MSI-x information for an assigned-device,
QEMU will use KVM_DEV_VFIO_DEVICE_POSTING_IRQ attribute to setup
IRTE for VT-d PI. This patch implement this IRQ attribute.

Signed-off-by: Feng Wu 
---
 virt/kvm/vfio.c |  115 +++
 1 files changed, 115 insertions(+), 0 deletions(-)

diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index 6bc7001..435adf4 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -446,6 +446,115 @@ out:
return ret;
 }
 
+static int kvm_update_pi_irte(struct kvm *kvm, int host_irq, int guest_irq)
+{
+   /*
+* TODO: need to add the real code to update the related IRTE,
+* Basically, This fucntion will do the following things:
+* - Get struct kvm_kernel_irq_routing_entry from guest irq
+* - Get the destination vCPU of the interrupts
+* - Update the IRTE according the VT-d PI Spec.
+*   1) guest vector
+*   2) Posted-Interrupts descritpor addresss
+*/
+
+   return 0;
+}
+
+static int kvm_vfio_pci_get_irq_count(struct pci_dev *pdev, int irq_type)
+{
+   if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
+   u8 pin;
+   pci_read_config_byte(pdev, PCI_INTERRUPT_PIN, &pin);
+   if (pin)
+   return 1;
+   } else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
+   u8 pos;
+   u16 flags;
+
+   pos = pdev->msi_cap;
+   if (pos) {
+   pci_read_config_word(pdev,
+pos + PCI_MSI_FLAGS, &flags);
+   return 1 << ((flags & PCI_MSI_FLAGS_QMASK) >> 1);
+   }
+   } else if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX) {
+   u8 pos;
+   u16 flags;
+
+   pos = pdev->msix_cap;
+   if (pos) {
+   pci_read_config_word(pdev,
+pos + PCI_MSIX_FLAGS, &flags);
+
+   return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
+   }
+   } else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX)
+   if (pci_is_pcie(pdev))
+   return 1;
+
+   return 0;
+}
+
+static int kvm_vfio_set_pi(struct kvm_device *kdev, int32_t __user *argp)
+{
+   struct kvm_posted_intr pi_info;
+   int *virq;
+   unsigned long minsz;
+   struct vfio_device *vdev;
+   struct msi_desc *entry;
+   struct device *dev;
+   struct pci_dev *pdev;
+   int i, max, ret;
+
+   minsz = offsetofend(struct kvm_posted_intr, count);
+
+   if (copy_from_user(&pi_info, (void __user *)argp, minsz))
+   return -EFAULT;
+
+   if (pi_info.argsz < minsz || pi_info.index >= VFIO_PCI_NUM_IRQS)
+   return -EINVAL;
+
+   vdev = kvm_vfio_get_vfio_device(pi_info.fd);
+   if (IS_ERR(vdev))
+   return PTR_ERR(vdev);
+
+   dev = kvm_vfio_external_base_device(vdev);
+   if (!dev)
+   return -EFAULT;
+
+   pdev = to_pci_dev(dev);
+
+   max = kvm_vfio_pci_get_irq_count(pdev, pi_info.index);
+
+   if (pi_info.argsz - minsz < pi_info.count * sizeof(int) ||
+   pi_info.start >= max || pi_info.start + pi_info.count > max)
+   return -EINVAL;
+
+   virq = memdup_user((void __user *)((unsigned long)argp + minsz),
+  pi_info.count * sizeof(int));
+   if (IS_ERR(virq))
+   return PTR_ERR(virq);
+
+   for (i=0; imsi_list, list) {
+   if (entry->msi_attrib.entry_nr != pi_info.start+i)
+   continue;
+
+   ret = kvm_update_pi_irte(kdev->kvm,
+entry->irq, virq[i]);
+   if (ret) {
+   kfree(virq);
+   return -EFAULT;
+   }
+   }
+   }
+
+   kfree(virq);
+
+   return 0;
+}
+
 static int kvm_vfio_set_device(struct kvm_device *kdev, long attr, u64 arg)
 {
int32_t __user *argp = (int32_t __user *)(unsigned long)arg;
@@ -456,6 +565,9 @@ static int kvm_vfio_set_device(struct kvm_device *kdev, 
long attr, u64 arg)
case KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ:
ret = kvm_vfio_control_irq_forward(kdev, attr, argp);
break;
+   case KVM_DEV_VFIO_DEVICE_POSTING_IRQ:
+   ret = kvm_vfio_set_pi(kdev, argp);
+   break;
default:
ret = -ENXIO;
}
@@ -511,6 +623,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
case KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ:
return 0;
 #endif
+   case KVM_DEV_VFIO_DEVICE_POSTING_IRQ:
+   return 0;
+
}
break;

[RFC PATCH v2 1/2] KVM: kvm-vfio: User API for VT-d Posted-Interrupts

2014-11-25 Thread Feng Wu

This patch adds and documents a new attribute
KVM_DEV_VFIO_DEVICE_POSTING_IRQ in KVM_DEV_VFIO_DEVICE group.
This new attribute is used for VT-d Posted-Interrupts.

When guest OS changes the interrupt configuration for an
assigned device, such as, MSI/MSIx data/address fields,
QEMU will use this IRQ attribute to tell KVM to update the
related IRTE according the VT-d Posted-Interrrupts Specification,
such as, the guest vector should be updated in the related IRTE.

Signed-off-by: Feng Wu 
---
 Documentation/virtual/kvm/devices/vfio.txt |9 +
 include/uapi/linux/kvm.h   |   10 ++
 2 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/devices/vfio.txt 
b/Documentation/virtual/kvm/devices/vfio.txt
index f7aff29..39dee86 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -42,3 +42,12 @@ activated before VFIO_DEVICE_SET_IRQS has been called to 
trigger the IRQ
 or associate an eventfd to it. Unforwarding can only be called while the
 signaling has been disabled with VFIO_DEVICE_SET_IRQS. If this condition is
 not satisfied, the command returns an -EBUSY.
+
+  KVM_DEV_VFIO_DEVICE_POSTING_IRQ: Use posted interrtups mechanism to post
+   the IRQ to guests.
+For this attribute, kvm_device_attr.addr points to a kvm_posted_intr struct.
+
+When guest OS changes the interrupt configuration for an assigned device,
+such as, MSI/MSIx data/address fields, QEMU will use this IRQ attribute
+to tell KVM to update the related IRTE according the VT-d Posted-Interrrupts
+Specification, such as, the guest vector should be updated in the related IRTE.
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a269a42..e5f86ad 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -949,6 +949,7 @@ struct kvm_device_attr {
 #define  KVM_DEV_VFIO_DEVICE   2
 #define   KVM_DEV_VFIO_DEVICE_FORWARD_IRQ  1
 #define   KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ2
+#define   KVM_DEV_VFIO_DEVICE_POSTING_IRQ  3
 
 enum kvm_device_type {
KVM_DEV_TYPE_FSL_MPIC_20= 1,
@@ -973,6 +974,15 @@ struct kvm_arch_forwarded_irq {
__u32 gsi; /* gsi, ie. virtual IRQ number */
 };
 
+struct kvm_posted_intr {
+   __u32   argsz;
+   __u32   fd; /* file descriptor of the VFIO device */
+   __u32   index;  /* VFIO device IRQ index */
+   __u32   start;
+   __u32   count;
+   int virq[0];/* gsi, ie. virtual IRQ number */
+};
+
 /*
  * ioctls for VM fds
  */
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [question] lots of interrupts injected to vm when pressing somekey w/o releasing

2014-11-25 Thread Zhang Haoyu

>> I tested win-server-2008 with "-cpu 
>> core2duo,hv_spinlocks=0x,hv_relaxed,hv_time",
>> this problem still happened, about 200,000 vmexits per-second, 
>> bringing very bad experience, just like being stuck.
>
>Please upload a full trace somewhere, or at least the "perf report" output.
>
# perf kvm stat report --event=vmexit
Warning:
Processed 4912765 events and lost 81 chunks!
Check IO/CPU overload!

Analyze events for all VCPUs:

 VM-EXITSamples  Samples% Time% Avg time

   PAUSE_INSTRUCTION124862060.74%23.33%  0.83us ( +-   0.02% )
  IO_INSTRUCTION 79961938.89%74.49%  4.13us ( +-   6.93% )
  EXTERNAL_INTERRUPT   2629 0.13% 0.47%  7.93us ( +-   1.27% )
  VMCALL   1785 0.09% 0.04%  0.98us ( +-   0.30% )
 APIC_ACCESS   1495 0.07% 0.11%  3.38us ( +-   1.65% )
   EXCEPTION_NMI   1188 0.06% 0.03%  1.26us ( +-   1.66% )
 TPR_BELOW_THRESHOLD273 0.01% 0.01%  1.57us ( +-   1.32% )
   PENDING_INTERRUPT226 0.01% 0.01%  1.38us ( +-   2.59% )
 HLT  6 0.00% 1.51%  11153.81us ( +-  15.24% )
   EPT_MISCONFIG  2 0.00% 0.00% 15.46us ( +-  23.93% )

Total Samples:2055843, Total events handled time:4438794.59us.

# perf kvm stat report --event=ioport
Warning:
Processed 4912765 events and lost 81 chunks!
Check IO/CPU overload!

Analyze events for all VCPUs:

  IO Port AccessSamples  Samples% Time% Avg time

0x64:PIN 79903299.93%99.84%  3.42us ( +-   8.35% )
   0x3fd:PIN300 0.04% 0.08%  6.88us ( +-   3.45% )
0x60:PIN274 0.03% 0.09%  8.53us ( +-   3.07% )
 0xb000:POUT  2 0.00% 0.00%  7.96us ( +-   3.16% )
  0xb000:PIN  2 0.00% 0.00%  8.16us ( +-   2.31% )
  0xb008:PIN  2 0.00% 0.00%  1.53us ( +-  13.66% )
  0xafe0:PIN  2 0.00% 0.00%  3.62us ( +-   0.84% )
  0xafe1:PIN  2 0.00% 0.00%  2.76us ( +-   2.99% )

Total Samples:799616, Total events handled time:2739038.53us.

# trace-cmd report
version = 6
CPU 4 is empty
CPU 5 is empty
CPU 6 is empty
CPU 7 is empty
CPU 9 is empty
CPU 11 is empty
CPU 12 is empty
CPU 13 is empty
cpus=16
 kvm-22783 [003]   721.255618: kvm_entry:vcpu 0
 kvm-22784 [014]   721.255619: kvm_entry:vcpu 1
 kvm-22783 [003]   721.255622: kvm_exit: reason 
PAUSE_INSTRUCTION rip 0xf80001688a9e info 0 0
 kvm-22784 [014]   721.255623: kvm_exit: reason 
IO_INSTRUCTION rip 0xf88000cdd23b info 640008 0
 kvm-22784 [014]   721.255623: kvm_pio:  pio_read at 
0x64 size 1 count 1
 kvm-22784 [014]   721.255623: kvm_userspace_exit:   reason 
KVM_EXIT_IO (2)
 kvm-22783 [003]   721.255623: kvm_entry:vcpu 0
 kvm-22784 [014]   721.255625: kvm_entry:vcpu 1
 kvm-22783 [003]   721.255626: kvm_exit: reason 
PAUSE_INSTRUCTION rip 0xf80001688a9e info 0 0
 kvm-22783 [003]   721.255626: kvm_entry:vcpu 0
 kvm-22784 [014]   721.255627: kvm_exit: reason 
IO_INSTRUCTION rip 0xf88000cdd23b info 640008 0
 kvm-22784 [014]   721.255628: kvm_pio:  pio_read at 
0x64 size 1 count 1
 kvm-22784 [014]   721.255628: kvm_userspace_exit:   reason 
KVM_EXIT_IO (2)
 kvm-22783 [003]   721.255628: kvm_exit: reason 
PAUSE_INSTRUCTION rip 0xf80001688a9e info 0 0
 kvm-22783 [003]   721.255629: kvm_entry:vcpu 0
 kvm-22784 [014]   721.255630: kvm_entry:vcpu 1
 kvm-22783 [003]   721.255631: kvm_exit: reason 
PAUSE_INSTRUCTION rip 0xf80001688a9e info 0 0
 kvm-22784 [014]   721.255632: kvm_exit: reason 
IO_INSTRUCTION rip 0xf88000cdd23b info 640008 0
 kvm-22783 [003]   721.255632: kvm_entry:vcpu 0
 kvm-22784 [014]   721.255632: kvm_pio:  pio_read at 
0x64 size 1 count 1
 kvm-22784 [014]   721.255632: kvm_userspace_exit:   reason 
KVM_EXIT_IO (2)
 kvm-22783 [003]   721.255634: kvm_exit: reason 
PAUSE_INSTRUCTION rip 0xf80001688a9e info 0 0
 kvm-22784 [014]   721.255634: kvm_entry:vcpu 1
 kvm-22783 [003]   721.255635: kvm_entry:vcpu 0
 kvm-22784 [014]   721.255636: kvm_exit: reason 
IO_INSTRUCTION rip 0xf88000cdd23b info 640008 0
 kvm-22784 [014]   721.255636: kvm_pio:  pio_read at 
0x64 size 1 count 1
 kvm-22784 [014]   721.255636: kvm_userspace_exit

Re: [PATCH v2 3/3] KVM: arm/arm64: Enable Dirty Page logging for ARMv8

On Mon, Nov 24, 2014 at 01:22:16PM -0800, Mario Smarduch wrote:
> On 11/22/2014 12:02 PM, Christoffer Dall wrote:
> > On Sat, Nov 15, 2014 at 12:19:10AM -0800, m.smard...@samsung.com wrote:
> >> From: Mario Smarduch 
> >>
> >> This patch enables ARMv8 ditry page logging support. Plugs ARMv8 into 
> >> generic
> >> layer through Kconfig symbol, and drops earlier ARM64 constraints to enable
> >> logging at architecture layer.
> >>
> >> Signed-off-by: Mario Smarduch 
> > 
> > Just reminding you again of what I said in the previous thread (think
> > that was before you sent this out), that you need to handle the pud_huge
> > case in arch/arm/kvm/mmu.c for ARMv8 here.
> > 
> > -Christoffer
> > 
> 
> 
>   Yes, so like similar handling to what unmap_puds() does when
> it encounters a PUD Block?

yes

> 
> Should next revision be rebased to 'queued' 3.18.0-rc2?
> 

rebase onto kvmarm/next whenever we move that to kvm/next after the
merge window (keep an eye out), typically it will be 3.19-rc1 + some
stuff from Paolo.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [CFT PATCH v2 0/2] KVM: support XSAVES usage in the host



On 25/11/2014 11:13, Wanpeng Li wrote:
> Hi Paolo,
> On Mon, Nov 24, 2014 at 05:43:32PM +0100, Paolo Bonzini wrote:
>> The first patch ensures that XSAVES is not exposed in the guest until
>> we emulate MSR_IA32_XSS.  The second exports XSAVE data in the correct
>> format.
>>
>> I tested these on a non-XSAVES system so they should not be completely
>> broken, but I need some help.  I am not even sure which XSAVE states
>> are _not_ enabled, and thus compacted, in Linux.
>>
>> Note that these patches do not add support for XSAVES in the guest yet,
>> since MSR_IA32_XSS is not emulated.
>>
>> If they fix the bug Nadav reported, I'll add Reported-by and commit.
> 
> I test this patchset w/ your "KVM: x86: export get_xsave_addr" patch on 
> Skylake and guest hang during boot. The guest screen show "Probing EDD 
> (edd=off to disable)... ok", and no more dump.

Anything in dmesg?  Can you try this patch on top?

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 373b0ab9a32e..ca26681455c2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6955,6 +6955,9 @@ int fx_init(struct kvm_vcpu *vcpu)
return err;

fpu_finit(&vcpu->arch.guest_fpu);
+   if (cpu_has_xsaves)
+   vcpu->arch.guest_fpu.state->xsave.xsave_hdr.xcomp_bv =
+   host_xcr0 | XSTATE_COMPACTION_ENABLED;

/*
 * Ensure guest xcr0 is valid for loading

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [CFT PATCH v2 0/2] KVM: support XSAVES usage in the host

2014-11-25 Thread Wanpeng Li

Hi Paolo,
On Mon, Nov 24, 2014 at 05:43:32PM +0100, Paolo Bonzini wrote:
>The first patch ensures that XSAVES is not exposed in the guest until
>we emulate MSR_IA32_XSS.  The second exports XSAVE data in the correct
>format.
>
>I tested these on a non-XSAVES system so they should not be completely
>broken, but I need some help.  I am not even sure which XSAVE states
>are _not_ enabled, and thus compacted, in Linux.
>
>Note that these patches do not add support for XSAVES in the guest yet,
>since MSR_IA32_XSS is not emulated.
>
>If they fix the bug Nadav reported, I'll add Reported-by and commit.

I test this patchset w/ your "KVM: x86: export get_xsave_addr" patch on 
Skylake and guest hang during boot. The guest screen show "Probing EDD 
(edd=off to disable)... ok", and no more dump.

Regards,
Wanpeng Li 

>
>Thanks,
>
>Paolo
>
>v1->v2: also adjust KVM_SET_XSAVE
>
>Paolo Bonzini (2):
>  kvm: x86: mask out XSAVES
>  KVM: x86: support XSAVES usage in the host
>
> arch/x86/kvm/cpuid.c | 11 ++-
> arch/x86/kvm/x86.c   | 87 +++-
> 2 files changed, 90 insertions(+), 8 deletions(-)
>
>-- 
>1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v14 0/7] KVM/arm/x86: dirty page logging for ARMv7 (3.17.0-rc1)

Hi Mario,

On Thu, Nov 13, 2014 at 05:57:41PM -0800, Mario Smarduch wrote:
> Patch series adds support for ARMv7 and generic dirty page logging support. 
> As we try to move towards generic dirty page logging additional logic is 
> moved 
> to generic code. Initially x86, armv7 KVM_GET_DIRTY_LOG reuses generic 
> code, shortly followed by armv8 patches.
> 
So given the timing of this and the last-minute fixes to this series, it
is too late to merge for 3.19.

I would like if we could get one complete patch series with both v7 and
v8 support in there ready for right after the merge window closes, so
that Marc and I can queue this early for -next and then merge it for
3.20.

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 2/3] KVM: arm: add irqfd support

[Re-adding cc list]

On Mon, Nov 24, 2014 at 05:42:30PM +0100, Eric Auger wrote:
> On 11/24/2014 04:47 PM, Christoffer Dall wrote:
> > On Mon, Nov 24, 2014 at 12:02 PM, Eric Auger  wrote:
> >> On 11/24/2014 11:00 AM, Christoffer Dall wrote:
> >>> On Sun, Nov 23, 2014 at 06:56:59PM +0100, Eric Auger wrote:
>  This patch enables irqfd on arm.
> 
>  Both irqfd and resamplefd are supported. Injection is implemented
>  in vgic.c without routing.
> 
>  This patch enables CONFIG_HAVE_KVM_EVENTFD and CONFIG_HAVE_KVM_IRQFD.
> 
>  KVM_CAP_IRQFD is now advertised. KVM_CAP_IRQFD_RESAMPLE capability
>  automatically is advertised as soon as CONFIG_HAVE_KVM_IRQFD is set.
> 
>  Signed-off-by: Eric Auger 
> 
>  ---

[...]

>  +int kvm_set_irq(struct kvm *kvm, int irq_source_id,
>  +u32 irq, int level, bool line_status)
>  +{
>  +unsigned int spi = irq + VGIC_NR_PRIVATE_IRQS;
>  +
>  +kvm_debug("irqfd sets vIRQ %d to %d\n", irq, level);
>  +
>  +if (likely(vgic_initialized(kvm))) {
>  +if (spi > kvm->arch.vgic.nr_irqs)
>  +return -EINVAL;
>  +return kvm_vgic_inject_irq(kvm, 0, spi, level);
>  +} else if (level && kvm_irq_has_notifier(kvm, 0, irq)) {
>  +/*
>  + * any IRQ injected before vgic readiness is
>  + * ignored and the notifier, if any, is called
>  + * immediately as if the virtual IRQ were completed
>  + * by the guest
>  + */
>  +kvm_notify_acked_irq(kvm, 0, irq);
>  +return -EAGAIN;
> >>>
> >>> This looks fishy, I don't fully understand which case you're catering
> >>> towards here (the comment is hard to understand), but my gut feeling is
> >>> that you should back out (probably with an error) if the vgic is not
> >>> initialized when calling this function.  Setting up the routing in the
> >>> first place should probably error out if the vgic is not available.
> >> The issue is related to integration with QEMU and VFIO.
> >> Currently VFIO signaling (we tell VFIO to signal the eventfd on a
> >> peculiar physical IRQ) and irqfd setup (we tell KVM/IRQFD to handle
> >> previous eventfd when triggered and inject a GSI) are done by QEMU
> >> *before* the first vcpu run. so before VGIC is ready.
> >>
> >> Both are done in a so called QEMU machine init done notifier. It could
> >> be done in a QEMU reset notifier but I guess the problem would be the
> >> same. and I think the same at which QEMU initializes it is correct.
> >>
> >> As soon as both VFIO signaling and irqfd are setup, virtual IRQ are
> >> likely to be injected and this is what happens on some circumstances.
> >> This happens on the 2d QEMU run after killing the 1st QEMU session. For
> >> some reason I guess the HW device - in my case the xgmac - was released
> >> in such a way the IRQ wire was not reset. So as soon as VFIO driver
> >> calls request_irq, the IRQ hits.
> >>
> >> I tried to change that this xgmac driver behavior but I must acknowledge
> >> that for the time beeing I failed. I will continue investigating that.
> >>
> >> Indeed I might be fixing the issue at the wrong place. But anyway this
> >> relies on the fact the assigned device driver toggled down the IRQ
> >> properly when releasing. At the time the signaling is setup we have not
> >> entered yet any driver reset code.
> >>
> > I see, it's because of the quirky way we initialize the vgic at time
> > of running the first CPU, so user space can't really hook into the
> > sweet spot after initializing the vgic but just before actually
> > running guest code.
> 
> Yes currently irqfd generic code has no way to test if the virtual
> interrupt controller is "ready" (KVM__IRQFD -> kvm_irqfd_assign )
> because I guess other archs don't have the problem.
> > 
> > Could it be that we simply need to let user space init the vgic after
> > creating all its vcpus and only then allow setting up the irqfd?
> 
> I start VFIO signaling and setup KVM_IRQFD in a QEMU machine init done
> notifier. the vgic init then must happen before then.

have all the vcpus that QEMU wants to create, been created by then?

> 
> Currently the VGIC KVM device has group/attributes that allow to set
> registers (vgic_attr_regs_access) and *partially* init the state
> (vgic_init_maps). Enhancing that we could indeed force the vgic init
> earlier.
> 

We would probably add a new attribute to the vgic device api if we were
to go down that road.

> > 
> > Alternatively we can refactor the whole thing so that we don't mess
> > with the pending state etc. directly in the vgic_update_irq function,
> > but just sets the level state (or remember that there was an edge,
> > hummm, maybe not) and later propagate this to the vcpus in
> > compute_pending_for_cpu().
> 
> yes we could indeed record a level info very early and not do anything
>

Re: [question] lots of interrupts injected to vm when pressing some key w/o releasing