[PATCH] kvm: remove KVM_MMIO_SIZE

2015-02-05 Thread Tiejun Chen
After f78146b0f923, "KVM: Fix page-crossing MMIO", and
87da7e66a405, "KVM: x86: fix vcpu->mmio_fragments overflow",
actually KVM_MMIO_SIZE is gone.

Signed-off-by: Tiejun Chen 
---
 arch/x86/include/asm/kvm_host.h | 2 --
 include/linux/kvm_host.h| 4 
 2 files changed, 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cb19d05..b0c10fb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -38,8 +38,6 @@
 #define KVM_PRIVATE_MEM_SLOTS 3
 #define KVM_MEM_SLOTS_NUM (KVM_USER_MEM_SLOTS + KVM_PRIVATE_MEM_SLOTS)
 
-#define KVM_MMIO_SIZE 16
-
 #define KVM_PIO_PAGE_OFFSET 1
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 26f1060..cdbe5cc 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -33,10 +33,6 @@
 
 #include 
 
-#ifndef KVM_MMIO_SIZE
-#define KVM_MMIO_SIZE 8
-#endif
-
 /*
  * The bit 16 ~ bit 31 of kvm_memory_region::flags are internally used
  * in kvm, other bits are visible for userspace which are defined in
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: remove KVM_MMIO_SIZE

2015-02-05 Thread Paolo Bonzini
On 05/02/2015 10:22, Tiejun Chen wrote:
> After f78146b0f923, "KVM: Fix page-crossing MMIO", and
> 87da7e66a405, "KVM: x86: fix vcpu->mmio_fragments overflow",
> actually KVM_MMIO_SIZE is gone.
> 
> Signed-off-by: Tiejun Chen 
> ---
>  arch/x86/include/asm/kvm_host.h | 2 --
>  include/linux/kvm_host.h| 4 
>  2 files changed, 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index cb19d05..b0c10fb 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -38,8 +38,6 @@
>  #define KVM_PRIVATE_MEM_SLOTS 3
>  #define KVM_MEM_SLOTS_NUM (KVM_USER_MEM_SLOTS + KVM_PRIVATE_MEM_SLOTS)
>  
> -#define KVM_MMIO_SIZE 16
> -
>  #define KVM_PIO_PAGE_OFFSET 1
>  #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 26f1060..cdbe5cc 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -33,10 +33,6 @@
>  
>  #include 
>  
> -#ifndef KVM_MMIO_SIZE
> -#define KVM_MMIO_SIZE 8
> -#endif
> -
>  /*
>   * The bit 16 ~ bit 31 of kvm_memory_region::flags are internally used
>   * in kvm, other bits are visible for userspace which are defined in
> 

Nice, thanks.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/7] s390/kernel: Update /proc/sysinfo file with Extended Name and UUID

2015-02-05 Thread Paolo Bonzini


On 04/02/2015 20:32, Christian Borntraeger wrote:
> 
> A new architecture extends STSI 3.2.2 with UUID and long names. KVM will
> provide the first implementation. This patch adds the additional data 
> fields (Extended Name and UUID) from the 4KB block returned by the STSI 
> 3.2.2 command and reflect this information in the /proc/sysinfo file
> accordingly. 

Great.

> This is is non-KVM code, but developed by the KVM team. The patch is 
> acked by Heiko Carstens to go over the KVM tree.

The last two lines are not even necessary (the explanation of "why" is
in the first paragraph, the explanation of "how" is in the
Signed-off-bys and Acked-by).

Paolo

> Signed-off-by: Ekaterina Tumanova 
> Reviewed-by: David Hildenbrand 
> Reviewed-by: Cornelia Huck 
> Acked-by: Heiko Carstens 
> Signed-off-by: Christian Borntraeger 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH v5 0/2] x86/xen: add xen hypercall preemption

2015-02-05 Thread David Vrabel
On 27/01/15 01:51, Luis R. Rodriguez wrote:
> From: "Luis R. Rodriguez" 
> 
> This v5 nukes tracing as David said it was useless, it also
> only adds support for 64-bit as its the only thing I can test,
> and slightly modifies the documentation in code as to why we
> want this. The no krobe thing is left in place as I haven't
> heard confirmation its kosher to remove it.

FYI, I went back an reviewed my last v4 patch and folded the
xen_end_upcall() change into it instead.

http://lists.xen.org/archives/html/xen-devel/2015-02/msg00682.html

This will be easier for arm/arm64 to make use of in the future since
there are fewer arch-specific changes.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/1] KVM: s390: Add MEMOP ioctl for reading/writing guest memory

2015-02-05 Thread Cornelia Huck
On Wed, 04 Feb 2015 09:26:11 +0100
Christian Borntraeger  wrote:

> the classic channel I/O does set the storage key change/reference and
> also triggers errors in the storage key protection value mismatches.

Just a bit of background for the benefit of innocent bystanders:

When classic channel I/O is initiated, the caller provides a so-called
operation request block (orb) which references the actual channel
program it wants to start. In this same orb, the caller also provides
the storage key to be used for all memory fetches (either of the
individual channel commands or of the memory they reference).

qemu interpretation of channel commands currently ignores all of this.

> Conny:
> I am asking myself, if we should explicitly add a comment in the 
> virtio-ccw spec, that all accesses are assumed to be with key 0 and 
> thus never cause key protection. The change/reference bit is set
> by the underlying I/O or memory copy anyway.
> We can then add a ccw later on to set a different key if we ever need
> that.

I don't think we need to clarify anything for the normal channel
commands used by virtio: They are treated as any other channel command
wrt key protection; and the lack of key checking is a feature of qemu,
not of the spec.

For the memory used for the virtqueues, I agree that we should clarify
that we assume key 0. I'm not sure whether there's a case for extending
this in the future, but who knows.

I'll probably have to open an issue against the virtio spec for this.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] KVM: MMU: Add mmu help functions to support PML

2015-02-05 Thread Radim Krčmář
2015-02-05 13:59+0800, Kai Huang:
> Thanks for your review and sorry for the late reply. I was working on
> something else these days.

(No problem, a reply within a week is instantaneous for me :)

> For rest of your comments, as this patch series have already been in Paolo's
> queue branch, I think I can send further patches to enhance them, if Paolo
> agrees the enhancement is needed.

Ah, sorry, I didn't notice the patches were already in ...
it's better to wait for a bug in the common part now.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] KVM: x86: Add new dirty logging kvm_x86_ops for PML

2015-02-05 Thread Radim Krčmář
2015-02-05 14:29+0800, Kai Huang:
> >>+   /* Still write protect RO slot */
> >>+   if (new->flags & KVM_MEM_READONLY) {
> >>+   kvm_mmu_slot_remove_write_access(kvm, new);
> >We didn't write protect RO slots before, does this patch depend on it?
> No PML doesn't depend on it to work. It's suggested by Paolo.

Thanks, it would have deserved a separate patch, IMO.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] KVM: VMX: Add PML support in VMX

2015-02-05 Thread Radim Krčmář
2015-02-05 14:23+0800, Kai Huang:
> On 02/03/2015 11:18 PM, Radim Krčmář wrote:
> >You have it protected by CONFIG_X86_64, but use it unconditionally.
> Thanks for catching. This has been fixed by another patch, and the fix has
> also been merged by Paolo.

(And I haven't noticed the followup either ...)

I don't know of any present bugs in this patch and all questions have
been cleared,

Thanks.

> >>+   exec_control &= ~SECONDARY_EXEC_ENABLE_PML;
> >What is the harm of enabling it here?
> >
> >(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES seems similar and does it.)
> Because the PML feature detection is unconditional (meaning
> SECONDARY_EXEC_ENABLE_PML is always in vmcs_config.cpu_based_2nd_exec_ctrl),
> but the PML buffer is only created when vcpu is created, and it is
> controlled by 'enable_pml' module parameter, if we always enable
> SECONDARY_EXEC_ENABLE_PML here, no PML buffer will be created if PML is
> disabled by 'enable_pml' parameter,

I meant

  if (!enable_pml)
exec_control &= ~SECONDARY_EXEC_ENABLE_PML

here and no exec_control operations in vmx_create_vcpu().

> so it's better to enable it along with
> creating PML buffer.

I think the reason why KVM split the setup into vmx_create_vcpu() and
vmx_secondary_exec_control() is to simplify nested virtualization.

> >>+   if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
> >>+   cpu_has_virtual_nmis() &&
> >>+   (exit_qualification & INTR_INFO_UNBLOCK_NMI))
> >>+   vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
> >>+   GUEST_INTR_STATE_NMI);
> >Relevant part of the specification is pasted from 27.2.2 (Information
> >for VM Exits Due to Vectored Events), which makes me think that this bit
> >is mirrored to IDT-vectoring information field ...
> >
> >Isn't vmx_recover_nmi_blocking() enough?
> >
> >(I see the same code in handle_ept_violation(), but wasn't that needed
> >  just because of a hardware error?)
> This needs to be handled in both EPT violation and PML. If you look at the
> PML specification (the link is in cover letter), you can see the definition
> of bit 12 of exit_qualification (section 1.5), which explains above code.
> The same definition of exit_qualification is in EPT violation part in
> Intel's SDM.

True ... IDT-vectoring doesn't set bit 12 because PML exit isn't a fault
and 27.2.2: "— For all other relevant VM exits, bit 12 is cleared to 0."
(This was humourously pasted to PML as well.)

> >>+   /* PML index always points to next available PML buffer entity */
> >>+   if (pml_idx >= PML_ENTITY_NUM)
> >>+   pml_idx = 0;
> >>+   else
> >>+   pml_idx++;
> >If the pml_idx is >= PML_ENTITY_NUM and < 0x, the log is empty,
> In this case, the log is full, actually. PML index is set to PML_ENTITY_NUM
> - 1 initially, and hardware decrease it after GPA is logged.
> 
> Below is the pseudocode of PML hardware logic.
> 
> IF (PML Index[15:9] ≠ 0)
> THEN VM exit;
> FI;

pml_idx >= PML_ENTITY_NUM exits without modifying the buffer,

> set accessed and dirty flags for EPT;
> IF (a dirty flag was updated from 0 to 1)
> THEN
> PML address[PML index] ← 4-KByte-aligned guest-physical address;
> PML index is decremented;

0x is the only value that specifies full buffer, the rest means
that we incorrectly set up the initial index and VMX exited without
changing the buffer => we shouldn't read it.
(I should have said "untouched" instead of "empty".)

> FI;
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC] kvm: x86: add halt_poll module parameter

2015-02-05 Thread Paolo Bonzini
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.

This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest.  KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too.  When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.

With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest.  This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.

Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host.  The above results were obtained with
a mostly random value of the parameter (200), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.

The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter.  It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.

While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls.  During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark.  The wasted time is thus very low.  Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.

The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer.  Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-1 clock
cycles compared to 2-12 without setting halt_poll.  For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/x86.c  | 28 
 include/linux/kvm_host.h|  1 +
 virt/kvm/kvm_main.c | 22 +++---
 4 files changed, 41 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 848947ac6ade..a236e39cc385 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -655,6 +655,7 @@ struct kvm_vcpu_stat {
u32 irq_window_exits;
u32 nmi_window_exits;
u32 halt_exits;
+   u32 halt_successful_poll;
u32 halt_wakeup;
u32 request_irq_exits;
u32 irq_exits;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1373e04e1f19..b7b20828f01c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -96,6 +96,9 @@ EXPORT_SYMBOL_GPL(kvm_x86_ops);
 static bool ignore_msrs = 0;
 module_param(ignore_msrs, bool, S_IRUGO | S_IWUSR);
 
+unsigned int halt_poll = 0;
+module_param(halt_poll, uint, S_IRUGO | S_IWUSR);
+
 unsigned int min_timer_period_us = 500;
 module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
 
@@ -145,6 +148,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ "irq_window", VCPU_STAT(irq_window_exits) },
{ "nmi_window", VCPU_STAT(nmi_window_exits) },
{ "halt_exits", VCPU_STAT(halt_exits) },
+   { "halt_successful_poll", VCPU_STAT(halt_successful_poll) },
{ "halt_wakeup", VCPU_STAT(halt_wakeup) },
{ "hypercalls", VCPU_STAT(hypercalls) },
{ "request_irq", VCPU_STAT(request_irq_exits) },
@@ -5819,13 +5823,29 @@ void kvm_arch_exit(void)
 int kvm_emulate_halt(struct kvm_vcpu *vcpu)
 {
++vcpu->stat.halt_exits;
-   if (irqchip_in_kernel(vcpu->kvm)) {
-   vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
-   return 1;
-   } else {
+   if (!irqchip_in_kernel(vcpu->kvm)) {
vcpu->run->exit_reason = KVM_EXIT_HLT;
return 0;
}
+
+   vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
+   if (halt_poll) {
+   u64 start, curr;
+   rdts

Re: [PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest

2015-02-05 Thread Christian Borntraeger
Am 05.02.2015 um 17:35 schrieb r...@redhat.com:
> From: Rik van Riel 
> 
> The host kernel is not doing anything while the CPU is executing
> a KVM guest VCPU, so it can be marked as being in an extended
> quiescent state, identical to that used when running user space
> code.
> 
> The only exception to that rule is when the host handles an
> interrupt, which is already handled by the irq code, which
> calls rcu_irq_enter and rcu_irq_exit.
> 
> The guest_enter and guest_exit functions already switch vtime
> accounting independent of context tracking, so leave those calls
> where they are, instead of moving them into the context tracking
> code.
> 
> Signed-off-by: Rik van Riel 
> ---
>  include/linux/context_tracking.h   | 8 +++-
>  include/linux/context_tracking_state.h | 1 +
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/context_tracking.h 
> b/include/linux/context_tracking.h
> index bd9f000fc98d..a5d3bb44b897 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void)
>  static inline void exception_exit(enum ctx_state prev_ctx)
>  {
>   if (context_tracking_is_enabled()) {
> - if (prev_ctx == IN_USER)
> + if (prev_ctx == IN_USER || prev_ctx == IN_GUEST)
>   context_tracking_user_enter(prev_ctx);
>   }
>  }
> @@ -78,6 +78,9 @@ static inline void guest_enter(void)
>   vtime_guest_enter(current);
>   else
>   current->flags |= PF_VCPU;
> +
> + if (context_tracking_is_enabled())
> + context_tracking_user_enter(IN_GUEST);
>  }


Couldnt we make
rcu_virt_note_context_switch(smp_processor_id());
conditional in include/linux/kvm_host.h (kvm_guest_enter)

e.g. something like
if (!context_tracking_is_enabled())
rcu_virt_note_context_switch(smp_processor_id());


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest

2015-02-05 Thread Rik van Riel
On 02/05/2015 11:44 AM, Christian Borntraeger wrote:
> Am 05.02.2015 um 17:35 schrieb r...@redhat.com:
>> From: Rik van Riel 
>>
>> The host kernel is not doing anything while the CPU is executing
>> a KVM guest VCPU, so it can be marked as being in an extended
>> quiescent state, identical to that used when running user space
>> code.
>>
>> The only exception to that rule is when the host handles an
>> interrupt, which is already handled by the irq code, which
>> calls rcu_irq_enter and rcu_irq_exit.
>>
>> The guest_enter and guest_exit functions already switch vtime
>> accounting independent of context tracking, so leave those calls
>> where they are, instead of moving them into the context tracking
>> code.
>>
>> Signed-off-by: Rik van Riel 
>> ---
>>  include/linux/context_tracking.h   | 8 +++-
>>  include/linux/context_tracking_state.h | 1 +
>>  2 files changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/context_tracking.h 
>> b/include/linux/context_tracking.h
>> index bd9f000fc98d..a5d3bb44b897 100644
>> --- a/include/linux/context_tracking.h
>> +++ b/include/linux/context_tracking.h
>> @@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void)
>>  static inline void exception_exit(enum ctx_state prev_ctx)
>>  {
>>  if (context_tracking_is_enabled()) {
>> -if (prev_ctx == IN_USER)
>> +if (prev_ctx == IN_USER || prev_ctx == IN_GUEST)
>>  context_tracking_user_enter(prev_ctx);
>>  }
>>  }
>> @@ -78,6 +78,9 @@ static inline void guest_enter(void)
>>  vtime_guest_enter(current);
>>  else
>>  current->flags |= PF_VCPU;
>> +
>> +if (context_tracking_is_enabled())
>> +context_tracking_user_enter(IN_GUEST);
>>  }
> 
> 
> Couldnt we make
> rcu_virt_note_context_switch(smp_processor_id());
> conditional in include/linux/kvm_host.h (kvm_guest_enter)
> 
> e.g. something like
> if (!context_tracking_is_enabled())
>   rcu_virt_note_context_switch(smp_processor_id());

Possibly. I considered the same, but I do not know whether
or not just rcu_user_enter / rcu_user_exit is enough.

I could certainly try it out and see whether anything
explodes, but I am not convinced that is careful enough
when it comes to handling RCU code...

Paul? :)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] nohz,kvm: export context_tracking_user_enter/exit

2015-02-05 Thread riel
From: Rik van Riel 

Export context_tracking_user_enter/exit so it can be used by KVM.

Signed-off-by: Rik van Riel 
---
 kernel/context_tracking.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 97806c4deec5..a3f4a2840637 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -107,6 +107,7 @@ void context_tracking_user_enter(enum ctx_state state)
local_irq_restore(flags);
 }
 NOKPROBE_SYMBOL(context_tracking_user_enter);
+EXPORT_SYMBOL_GPL(context_tracking_user_enter);
 
 /**
  * context_tracking_user_exit - Inform the context tracking that the CPU is
@@ -146,6 +147,7 @@ void context_tracking_user_exit(enum ctx_state state)
local_irq_restore(flags);
 }
 NOKPROBE_SYMBOL(context_tracking_user_exit);
+EXPORT_SYMBOL_GPL(context_tracking_user_exit);
 
 /**
  * __context_tracking_task_switch - context switch the syscall callbacks
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest

2015-02-05 Thread riel
When running a KVM guest on a system with NOHZ_FULL enabled, and the
KVM guest running with idle=poll mode, we still get wakeups of the
rcuos/N threads.

This problem has already been solved for user space by telling the
RCU subsystem that the CPU is in an extended quiescent state while
running user space code.

This patch series extends that code a little bit to make it usable
to track KVM guest space, too.

I tested the code by booting a KVM guest with idle=poll, on a system
with NOHZ_FULL enabled on most CPUs, and a VCPU thread bound to a
CPU. In a 10 second interval, rcuos/N threads on other CPUs got woken
up several times, while the rcuos thread on the CPU running the bound
and alwasy running VCPU thread never got woken up once.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] rcu,nohz: add state parameter to context_tracking_user_enter/exit

2015-02-05 Thread riel
From: Rik van Riel 

Add the expected ctx_state as a parameter to context_tracking_user_enter
and context_tracking_user_exit, allowing the same functions to not just
track kernel <> user space switching, but also kernel <> guest transitions.

Signed-off-by: Rik van Riel 
---
 include/linux/context_tracking.h | 12 ++--
 kernel/context_tracking.c| 10 +-
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 37b81bd51ec0..bd9f000fc98d 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -10,21 +10,21 @@
 #ifdef CONFIG_CONTEXT_TRACKING
 extern void context_tracking_cpu_set(int cpu);
 
-extern void context_tracking_user_enter(void);
-extern void context_tracking_user_exit(void);
+extern void context_tracking_user_enter(enum ctx_state state);
+extern void context_tracking_user_exit(enum ctx_state state);
 extern void __context_tracking_task_switch(struct task_struct *prev,
   struct task_struct *next);
 
 static inline void user_enter(void)
 {
if (context_tracking_is_enabled())
-   context_tracking_user_enter();
+   context_tracking_user_enter(IN_USER);
 
 }
 static inline void user_exit(void)
 {
if (context_tracking_is_enabled())
-   context_tracking_user_exit();
+   context_tracking_user_exit(IN_USER);
 }
 
 static inline enum ctx_state exception_enter(void)
@@ -35,7 +35,7 @@ static inline enum ctx_state exception_enter(void)
return 0;
 
prev_ctx = this_cpu_read(context_tracking.state);
-   context_tracking_user_exit();
+   context_tracking_user_exit(prev_ctx);
 
return prev_ctx;
 }
@@ -44,7 +44,7 @@ static inline void exception_exit(enum ctx_state prev_ctx)
 {
if (context_tracking_is_enabled()) {
if (prev_ctx == IN_USER)
-   context_tracking_user_enter();
+   context_tracking_user_enter(prev_ctx);
}
 }
 
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 937ecdfdf258..4c010787c9ec 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -47,7 +47,7 @@ void context_tracking_cpu_set(int cpu)
  * to execute won't use any RCU read side critical section because this
  * function sets RCU in extended quiescent state.
  */
-void context_tracking_user_enter(void)
+void context_tracking_user_enter(enum ctx_state state)
 {
unsigned long flags;
 
@@ -75,7 +75,7 @@ void context_tracking_user_enter(void)
WARN_ON_ONCE(!current->mm);
 
local_irq_save(flags);
-   if ( __this_cpu_read(context_tracking.state) != IN_USER) {
+   if ( __this_cpu_read(context_tracking.state) != state) {
if (__this_cpu_read(context_tracking.active)) {
trace_user_enter(0);
/*
@@ -101,7 +101,7 @@ void context_tracking_user_enter(void)
 * OTOH we can spare the calls to vtime and RCU when 
context_tracking.active
 * is false because we know that CPU is not tickless.
 */
-   __this_cpu_write(context_tracking.state, IN_USER);
+   __this_cpu_write(context_tracking.state, state);
}
local_irq_restore(flags);
 }
@@ -118,7 +118,7 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_user_exit(void)
+void context_tracking_user_exit(enum ctx_state state)
 {
unsigned long flags;
 
@@ -129,7 +129,7 @@ void context_tracking_user_exit(void)
return;
 
local_irq_save(flags);
-   if (__this_cpu_read(context_tracking.state) == IN_USER) {
+   if (__this_cpu_read(context_tracking.state) == state) {
if (__this_cpu_read(context_tracking.active)) {
/*
 * We are going to run code that may use RCU. Inform
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] rcu,nohz: run vtime_user_enter/exit only when state == IN_USER

2015-02-05 Thread riel
From: Rik van Riel 

Only run vtime_user_enter and vtime_user_exit when we are entering
or exiting user state, respectively.

The RCU code only distinguishes between "idle" and "not idle or kernel".
There should be no need to add an additional (unused) state there.

Signed-off-by: Rik van Riel 
---
 kernel/context_tracking.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 4c010787c9ec..97806c4deec5 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -85,7 +85,8 @@ void context_tracking_user_enter(enum ctx_state state)
 * user_exit() or rcu_irq_enter(). Let's remove RCU's 
dependency
 * on the tick.
 */
-   vtime_user_enter(current);
+   if (state == IN_USER)
+   vtime_user_enter(current);
rcu_user_enter();
}
/*
@@ -136,7 +137,8 @@ void context_tracking_user_exit(enum ctx_state state)
 * RCU core about that (ie: we may need the tick again).
 */
rcu_user_exit();
-   vtime_user_exit(current);
+   if (state == IN_USER)
+   vtime_user_exit(current);
trace_user_exit(0);
}
__this_cpu_write(context_tracking.state, IN_KERNEL);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest

2015-02-05 Thread riel
From: Rik van Riel 

The host kernel is not doing anything while the CPU is executing
a KVM guest VCPU, so it can be marked as being in an extended
quiescent state, identical to that used when running user space
code.

The only exception to that rule is when the host handles an
interrupt, which is already handled by the irq code, which
calls rcu_irq_enter and rcu_irq_exit.

The guest_enter and guest_exit functions already switch vtime
accounting independent of context tracking, so leave those calls
where they are, instead of moving them into the context tracking
code.

Signed-off-by: Rik van Riel 
---
 include/linux/context_tracking.h   | 8 +++-
 include/linux/context_tracking_state.h | 1 +
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index bd9f000fc98d..a5d3bb44b897 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void)
 static inline void exception_exit(enum ctx_state prev_ctx)
 {
if (context_tracking_is_enabled()) {
-   if (prev_ctx == IN_USER)
+   if (prev_ctx == IN_USER || prev_ctx == IN_GUEST)
context_tracking_user_enter(prev_ctx);
}
 }
@@ -78,6 +78,9 @@ static inline void guest_enter(void)
vtime_guest_enter(current);
else
current->flags |= PF_VCPU;
+
+   if (context_tracking_is_enabled())
+   context_tracking_user_enter(IN_GUEST);
 }
 
 static inline void guest_exit(void)
@@ -86,6 +89,9 @@ static inline void guest_exit(void)
vtime_guest_exit(current);
else
current->flags &= ~PF_VCPU;
+
+   if (context_tracking_is_enabled())
+   context_tracking_user_exit(IN_GUEST);
 }
 
 #else
diff --git a/include/linux/context_tracking_state.h 
b/include/linux/context_tracking_state.h
index 97a81225d037..f3ef027af749 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -15,6 +15,7 @@ struct context_tracking {
enum ctx_state {
IN_KERNEL = 0,
IN_USER,
+   IN_GUEST,
} state;
 };
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: x86: add halt_poll module parameter

2015-02-05 Thread Rik van Riel
On 02/05/2015 11:05 AM, Paolo Bonzini wrote:
> This patch introduces a new module parameter for the KVM module; when it
> is present, KVM attempts a bit of polling on every HLT before scheduling
> itself out via kvm_vcpu_block.
> 
> This parameter helps a lot for latency-bound workloads---in particular
> I tested it with O_DSYNC writes with a battery-backed disk in the host.
> In this case, writes are fast (because the data doesn't have to go all
> the way to the platters) but they cannot be merged by either the host or
> the guest.  KVM's performance here is usually around 30% of bare metal,
> or 50% if you use cache=directsync or cache=writethrough (these
> parameters avoid that the guest sends pointless flush requests, and
> at the same time they are not slow because of the battery-backed cache).
> The bad performance happens because on every halt the host CPU decides
> to halt itself too.  When the interrupt comes, the vCPU thread is then
> migrated to a new physical CPU, and in general the latency is horrible
> because the vCPU thread has to be scheduled back in.
> 
> With this patch performance reaches 60-65% of bare metal and, more
> important, 99% of what you get if you use idle=poll in the guest.  This
> means that the tunable gets rid of this particular bottleneck, and more
> work can be done to improve performance in the kernel or QEMU.
> 
> Of course there is some price to pay; every time an otherwise idle vCPUs
> is interrupted by an interrupt, it will poll unnecessarily and thus
> impose a little load on the host.  The above results were obtained with
> a mostly random value of the parameter (200), and the load was around
> 1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
> 
> The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
> that can be used to tune the parameter.  It counts how many HLT
> instructions received an interrupt during the polling period; each
> successful poll avoids that Linux schedules the VCPU thread out and back
> in, and may also avoid a likely trip to C1 and back for the physical CPU.

In the long run, this value should probably be auto-tuned.
However, it seems like a good idea to introduce this kind
of thing one step at a time.

> While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
> Of these halts, almost all are failed polls.  During the benchmark,
> instead, basically all halts end within the polling period, except a more
> or less constant stream of 50 per second coming from vCPUs that are not
> running the benchmark.  The wasted time is thus very low.  Things may
> be slightly different for Windows VMs, which have a ~10 ms timer tick.
> 
> The effect is also visible on Marcelo's recently-introduced latency
> test for the TSC deadline timer.  Though of course a non-RT kernel has
> awful latency bounds, the latency of the timer is around 8000-1 clock
> cycles compared to 2-12 without setting halt_poll.  For the TSC
> deadline timer, thus, the effect is both a smaller average latency and
> a smaller variance.
> 
> Signed-off-by: Paolo Bonzini 

Acked-by: Rik van Riel 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest

2015-02-05 Thread Paul E. McKenney
On Thu, Feb 05, 2015 at 11:52:37AM -0500, Rik van Riel wrote:
> On 02/05/2015 11:44 AM, Christian Borntraeger wrote:
> > Am 05.02.2015 um 17:35 schrieb r...@redhat.com:
> >> From: Rik van Riel 
> >>
> >> The host kernel is not doing anything while the CPU is executing
> >> a KVM guest VCPU, so it can be marked as being in an extended
> >> quiescent state, identical to that used when running user space
> >> code.
> >>
> >> The only exception to that rule is when the host handles an
> >> interrupt, which is already handled by the irq code, which
> >> calls rcu_irq_enter and rcu_irq_exit.
> >>
> >> The guest_enter and guest_exit functions already switch vtime
> >> accounting independent of context tracking, so leave those calls
> >> where they are, instead of moving them into the context tracking
> >> code.
> >>
> >> Signed-off-by: Rik van Riel 
> >> ---
> >>  include/linux/context_tracking.h   | 8 +++-
> >>  include/linux/context_tracking_state.h | 1 +
> >>  2 files changed, 8 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/linux/context_tracking.h 
> >> b/include/linux/context_tracking.h
> >> index bd9f000fc98d..a5d3bb44b897 100644
> >> --- a/include/linux/context_tracking.h
> >> +++ b/include/linux/context_tracking.h
> >> @@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void)
> >>  static inline void exception_exit(enum ctx_state prev_ctx)
> >>  {
> >>if (context_tracking_is_enabled()) {
> >> -  if (prev_ctx == IN_USER)
> >> +  if (prev_ctx == IN_USER || prev_ctx == IN_GUEST)
> >>context_tracking_user_enter(prev_ctx);
> >>}
> >>  }
> >> @@ -78,6 +78,9 @@ static inline void guest_enter(void)
> >>vtime_guest_enter(current);
> >>else
> >>current->flags |= PF_VCPU;
> >> +
> >> +  if (context_tracking_is_enabled())
> >> +  context_tracking_user_enter(IN_GUEST);
> >>  }
> > 
> > 
> > Couldnt we make
> > rcu_virt_note_context_switch(smp_processor_id());
> > conditional in include/linux/kvm_host.h (kvm_guest_enter)
> > 
> > e.g. something like
> > if (!context_tracking_is_enabled())
> > rcu_virt_note_context_switch(smp_processor_id());
> 
> Possibly. I considered the same, but I do not know whether
> or not just rcu_user_enter / rcu_user_exit is enough.
> 
> I could certainly try it out and see whether anything
> explodes, but I am not convinced that is careful enough
> when it comes to handling RCU code...
> 
> Paul? :)

That can fail for some odd combinations of Kconfig and boot parameters.
As I understand it, you instead want the following:

if (!tick_nohz_full_cpu(smp_processor_id()))
rcu_virt_note_context_switch(smp_processor_id());

Frederic might know of a better approach.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: x86: add halt_poll module parameter

2015-02-05 Thread Radim Krčmář
2015-02-05 17:05+0100, Paolo Bonzini:
> This patch introduces a new module parameter for the KVM module; when it
> is present, KVM attempts a bit of polling on every HLT before scheduling
> itself out via kvm_vcpu_block.
> 
> This parameter helps a lot for latency-bound workloads---in particular
> I tested it with O_DSYNC writes with a battery-backed disk in the host.
> In this case, writes are fast (because the data doesn't have to go all
> the way to the platters) but they cannot be merged by either the host or
> the guest.  KVM's performance here is usually around 30% of bare metal,
> or 50% if you use cache=directsync or cache=writethrough (these
> parameters avoid that the guest sends pointless flush requests, and
> at the same time they are not slow because of the battery-backed cache).
> The bad performance happens because on every halt the host CPU decides
> to halt itself too.
>  When the interrupt comes, the vCPU thread is then
> migrated to a new physical CPU,

Unrelated:  are drawbacks of migrating vs waking up evaluated correctly?

> and in general the latency is horrible
> because the vCPU thread has to be scheduled back in.
> 
> With this patch performance reaches 60-65% of bare metal and, more
> important,
>99% of what you get if you use idle=poll in the guest.

(Hm, I would have thought that this can outperform idle=poll ...)

>This
> means that the tunable gets rid of this particular bottleneck, and more
> work can be done to improve performance in the kernel or QEMU.
> 
> Of course there is some price to pay; every time an otherwise idle vCPUs
> is interrupted by an interrupt, it will poll unnecessarily and thus
> impose a little load on the host.  The above results were obtained with
> a mostly random value of the parameter (200)

(I guess that translated to about 0.66 ms.)

>   and the load was around
> 1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.

(10 exits_per_second / 4 VCPUs * 0.0066 second_per_exit = 1.65% load.)

> The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
> that can be used to tune the parameter.  It counts how many HLT
> instructions received an interrupt during the polling period; each
> successful poll avoids that Linux schedules the VCPU thread out and back
> in, and may also avoid a likely trip to C1 and back for the physical CPU.
> 
> While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.

(Looks like tickless.)

> Of these halts, almost all are failed polls.  During the benchmark,
> instead, basically all halts end within the polling period, except a more
> or less constant stream of 50 per second coming from vCPUs that are not
> running the benchmark.  The wasted time is thus very low.  Things may
> be slightly different for Windows VMs, which have a ~10 ms timer tick.

(Windows userspace can force timer tick down to every 0.5 ms :)

> The effect is also visible on Marcelo's recently-introduced latency
> test for the TSC deadline timer.  Though of course a non-RT kernel has
> awful latency bounds, the latency of the timer is around 8000-1 clock
> cycles compared to 2-12 without setting halt_poll.  For the TSC
> deadline timer, thus, the effect is both a smaller average latency and
> a smaller variance.
> 
> Signed-off-by: Paolo Bonzini 
> ---

It is going to be hard to balance between performance and wasted idle
time, so it's good to have it disabled by default,
(I guess the value is going to be overestimated when used.)

Reviewed-by: Radim Krčmář 

> +++ b/arch/x86/kvm/x86.c
> @@ -5819,13 +5823,29 @@ void kvm_arch_exit(void)
>  int kvm_emulate_halt(struct kvm_vcpu *vcpu)
> + if (halt_poll) {
> + u64 start, curr;
> + rdtscll(start);

(I agree that all x86 with VMM have TSC.)

> + do {
> + /*
> +  * This sets KVM_REQ_UNHALT if an interrupt
> +  * arrives.
> +  */
> + if (kvm_vcpu_check_block(vcpu) < 0) {
> + ++vcpu->stat.halt_successful_poll;
> + break;
> + }
> + rdtscll(curr);
> + } while(!need_resched() && curr - start < halt_poll);

(We can get preempted and possibly rescheduled to another CPU.
 With unstable TSC, this could loop longer than was requested;
 likely not by much thanks to unsigned though -- ok -- rare cases
 are ignored.)

> +++ b/virt/kvm/kvm_main.c
> @@ -1813,6 +1813,20 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
>  }
>  EXPORT_SYMBOL_GPL(mark_page_dirty);
>  
> +int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)

(I think the returned 'int' will be wasted on us, so it could be a
 'bool', which would allow a better name ...)

> +{
> + 

Re: [PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest

2015-02-05 Thread Rik van Riel
On 02/05/2015 12:50 PM, Paul E. McKenney wrote:
> On Thu, Feb 05, 2015 at 11:52:37AM -0500, Rik van Riel wrote:
>> On 02/05/2015 11:44 AM, Christian Borntraeger wrote:
>>> Am 05.02.2015 um 17:35 schrieb r...@redhat.com:
 From: Rik van Riel 

 The host kernel is not doing anything while the CPU is executing
 a KVM guest VCPU, so it can be marked as being in an extended
 quiescent state, identical to that used when running user space
 code.

 The only exception to that rule is when the host handles an
 interrupt, which is already handled by the irq code, which
 calls rcu_irq_enter and rcu_irq_exit.

 The guest_enter and guest_exit functions already switch vtime
 accounting independent of context tracking, so leave those calls
 where they are, instead of moving them into the context tracking
 code.

 Signed-off-by: Rik van Riel 
 ---
  include/linux/context_tracking.h   | 8 +++-
  include/linux/context_tracking_state.h | 1 +
  2 files changed, 8 insertions(+), 1 deletion(-)

 diff --git a/include/linux/context_tracking.h 
 b/include/linux/context_tracking.h
 index bd9f000fc98d..a5d3bb44b897 100644
 --- a/include/linux/context_tracking.h
 +++ b/include/linux/context_tracking.h
 @@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void)
  static inline void exception_exit(enum ctx_state prev_ctx)
  {
if (context_tracking_is_enabled()) {
 -  if (prev_ctx == IN_USER)
 +  if (prev_ctx == IN_USER || prev_ctx == IN_GUEST)
context_tracking_user_enter(prev_ctx);
}
  }
 @@ -78,6 +78,9 @@ static inline void guest_enter(void)
vtime_guest_enter(current);
else
current->flags |= PF_VCPU;
 +
 +  if (context_tracking_is_enabled())
 +  context_tracking_user_enter(IN_GUEST);
  }
>>>
>>>
>>> Couldnt we make
>>> rcu_virt_note_context_switch(smp_processor_id());
>>> conditional in include/linux/kvm_host.h (kvm_guest_enter)
>>>
>>> e.g. something like
>>> if (!context_tracking_is_enabled())
>>> rcu_virt_note_context_switch(smp_processor_id());
>>
>> Possibly. I considered the same, but I do not know whether
>> or not just rcu_user_enter / rcu_user_exit is enough.
>>
>> I could certainly try it out and see whether anything
>> explodes, but I am not convinced that is careful enough
>> when it comes to handling RCU code...
>>
>> Paul? :)
> 
> That can fail for some odd combinations of Kconfig and boot parameters.
> As I understand it, you instead want the following:
> 
>   if (!tick_nohz_full_cpu(smp_processor_id()))
>   rcu_virt_note_context_switch(smp_processor_id());
> 
> Frederic might know of a better approach.

Interesting, in what cases would that happen?

If context_tracking_is_enabled() we end up eventually
calling into rcu_user_enter & rcu_user_exit.

If it is not enabled, we call rcu_virt_note_context_switch.

In what cases would we need to call both?

I don't see exit-to-userspace do the things that
rcu_virt_note_context_switch does, and do not understand
why userspace is fine with that...

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH v5 0/2] x86/xen: add xen hypercall preemption

2015-02-05 Thread Luis R. Rodriguez
On Thu, Feb 05, 2015 at 12:47:15PM +, David Vrabel wrote:
> On 27/01/15 01:51, Luis R. Rodriguez wrote:
> > From: "Luis R. Rodriguez" 
> > 
> > This v5 nukes tracing as David said it was useless, it also
> > only adds support for 64-bit as its the only thing I can test,
> > and slightly modifies the documentation in code as to why we
> > want this. The no krobe thing is left in place as I haven't
> > heard confirmation its kosher to remove it.
> 
> FYI, I went back an reviewed my last v4 patch and folded the
> xen_end_upcall() change into it instead.
> 
> http://lists.xen.org/archives/html/xen-devel/2015-02/msg00682.html
> 
> This will be easier for arm/arm64 to make use of in the future since
> there are fewer arch-specific changes.

I had taken similar approach first and determined using pt_regs was
safer after advice from Andy. I'll provide feedback on your new patchset.

 Luis
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: x86: add halt_poll module parameter

2015-02-05 Thread Jan Kiszka
On 2015-02-05 17:05, Paolo Bonzini wrote:
> This patch introduces a new module parameter for the KVM module; when it
> is present, KVM attempts a bit of polling on every HLT before scheduling
> itself out via kvm_vcpu_block.

Wouldn't it be better to tune this on a per-VM basis? Think of mixed
workloads with some latency-sensitive and some standard VMs.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest

2015-02-05 Thread Paul E. McKenney
On Thu, Feb 05, 2015 at 01:09:19PM -0500, Rik van Riel wrote:
> On 02/05/2015 12:50 PM, Paul E. McKenney wrote:
> > On Thu, Feb 05, 2015 at 11:52:37AM -0500, Rik van Riel wrote:
> >> On 02/05/2015 11:44 AM, Christian Borntraeger wrote:
> >>> Am 05.02.2015 um 17:35 schrieb r...@redhat.com:
>  From: Rik van Riel 
> 
>  The host kernel is not doing anything while the CPU is executing
>  a KVM guest VCPU, so it can be marked as being in an extended
>  quiescent state, identical to that used when running user space
>  code.
> 
>  The only exception to that rule is when the host handles an
>  interrupt, which is already handled by the irq code, which
>  calls rcu_irq_enter and rcu_irq_exit.
> 
>  The guest_enter and guest_exit functions already switch vtime
>  accounting independent of context tracking, so leave those calls
>  where they are, instead of moving them into the context tracking
>  code.
> 
>  Signed-off-by: Rik van Riel 
>  ---
>   include/linux/context_tracking.h   | 8 +++-
>   include/linux/context_tracking_state.h | 1 +
>   2 files changed, 8 insertions(+), 1 deletion(-)
> 
>  diff --git a/include/linux/context_tracking.h 
>  b/include/linux/context_tracking.h
>  index bd9f000fc98d..a5d3bb44b897 100644
>  --- a/include/linux/context_tracking.h
>  +++ b/include/linux/context_tracking.h
>  @@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void)
>   static inline void exception_exit(enum ctx_state prev_ctx)
>   {
>   if (context_tracking_is_enabled()) {
>  -if (prev_ctx == IN_USER)
>  +if (prev_ctx == IN_USER || prev_ctx == IN_GUEST)
>   context_tracking_user_enter(prev_ctx);
>   }
>   }
>  @@ -78,6 +78,9 @@ static inline void guest_enter(void)
>   vtime_guest_enter(current);
>   else
>   current->flags |= PF_VCPU;
>  +
>  +if (context_tracking_is_enabled())
>  +context_tracking_user_enter(IN_GUEST);
>   }
> >>>
> >>>
> >>> Couldnt we make
> >>> rcu_virt_note_context_switch(smp_processor_id());
> >>> conditional in include/linux/kvm_host.h (kvm_guest_enter)
> >>>
> >>> e.g. something like
> >>> if (!context_tracking_is_enabled())
> >>>   rcu_virt_note_context_switch(smp_processor_id());
> >>
> >> Possibly. I considered the same, but I do not know whether
> >> or not just rcu_user_enter / rcu_user_exit is enough.
> >>
> >> I could certainly try it out and see whether anything
> >> explodes, but I am not convinced that is careful enough
> >> when it comes to handling RCU code...
> >>
> >> Paul? :)
> > 
> > That can fail for some odd combinations of Kconfig and boot parameters.
> > As I understand it, you instead want the following:
> > 
> > if (!tick_nohz_full_cpu(smp_processor_id()))
> > rcu_virt_note_context_switch(smp_processor_id());
> > 
> > Frederic might know of a better approach.
> 
> Interesting, in what cases would that happen?

My concern, perhaps misplaced, is that context_tracking_is_enabled(),
but that the current CPU is not a nohz_full= CPU.  In that case, the
context-tracking code would be within its rights to not tell RCU about
the transition to the guest, and thus RCU would be unaware that the
CPU was in a quiescent state.

In most workloads, you would expect the CPU to get interrupted or
preempted or something at some point, which would eventually inform
RCU, but too much delay would result in the infamous RCU CPU stall
warning message.  So let's code a bit more defensively.

> If context_tracking_is_enabled() we end up eventually
> calling into rcu_user_enter & rcu_user_exit.
> 
> If it is not enabled, we call rcu_virt_note_context_switch.
> 
> In what cases would we need to call both?
> 
> I don't see exit-to-userspace do the things that
> rcu_virt_note_context_switch does, and do not understand
> why userspace is fine with that...

The real danger is doing neither.

On tick_nohz_full_cpu() CPUs, the exit-to-userspace code should invoke
rcu_user_enter(), which sets some per-CPU state telling RCU to ignore
that CPU, since it cannot possibly do host RCU read-side critical sections
while running a guest.

In contrast, a non-tick_nohz_full_cpu() CPU doesn't let RCU
know that it is executing in a guest or in userspace.  So the
rcu_virt_note_context_switch() does the notification in that case.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest

2015-02-05 Thread Rik van Riel
On 02/05/2015 01:56 PM, Paul E. McKenney wrote:
> On Thu, Feb 05, 2015 at 01:09:19PM -0500, Rik van Riel wrote:
>> On 02/05/2015 12:50 PM, Paul E. McKenney wrote:
>>> On Thu, Feb 05, 2015 at 11:52:37AM -0500, Rik van Riel wrote:
 On 02/05/2015 11:44 AM, Christian Borntraeger wrote:
> Am 05.02.2015 um 17:35 schrieb r...@redhat.com:
>> From: Rik van Riel 
>>
>> The host kernel is not doing anything while the CPU is executing
>> a KVM guest VCPU, so it can be marked as being in an extended
>> quiescent state, identical to that used when running user space
>> code.
>>
>> The only exception to that rule is when the host handles an
>> interrupt, which is already handled by the irq code, which
>> calls rcu_irq_enter and rcu_irq_exit.
>>
>> The guest_enter and guest_exit functions already switch vtime
>> accounting independent of context tracking, so leave those calls
>> where they are, instead of moving them into the context tracking
>> code.
>>
>> Signed-off-by: Rik van Riel 
>> ---
>>  include/linux/context_tracking.h   | 8 +++-
>>  include/linux/context_tracking_state.h | 1 +
>>  2 files changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/context_tracking.h 
>> b/include/linux/context_tracking.h
>> index bd9f000fc98d..a5d3bb44b897 100644
>> --- a/include/linux/context_tracking.h
>> +++ b/include/linux/context_tracking.h
>> @@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void)
>>  static inline void exception_exit(enum ctx_state prev_ctx)
>>  {
>>  if (context_tracking_is_enabled()) {
>> -if (prev_ctx == IN_USER)
>> +if (prev_ctx == IN_USER || prev_ctx == IN_GUEST)
>>  context_tracking_user_enter(prev_ctx);
>>  }
>>  }
>> @@ -78,6 +78,9 @@ static inline void guest_enter(void)
>>  vtime_guest_enter(current);
>>  else
>>  current->flags |= PF_VCPU;
>> +
>> +if (context_tracking_is_enabled())
>> +context_tracking_user_enter(IN_GUEST);
>>  }
>
>
> Couldnt we make
> rcu_virt_note_context_switch(smp_processor_id());
> conditional in include/linux/kvm_host.h (kvm_guest_enter)
>
> e.g. something like
> if (!context_tracking_is_enabled())
>   rcu_virt_note_context_switch(smp_processor_id());

 Possibly. I considered the same, but I do not know whether
 or not just rcu_user_enter / rcu_user_exit is enough.

 I could certainly try it out and see whether anything
 explodes, but I am not convinced that is careful enough
 when it comes to handling RCU code...

 Paul? :)
>>>
>>> That can fail for some odd combinations of Kconfig and boot parameters.
>>> As I understand it, you instead want the following:
>>>
>>> if (!tick_nohz_full_cpu(smp_processor_id()))
>>> rcu_virt_note_context_switch(smp_processor_id());
>>>
>>> Frederic might know of a better approach.
>>
>> Interesting, in what cases would that happen?
> 
> My concern, perhaps misplaced, is that context_tracking_is_enabled(),
> but that the current CPU is not a nohz_full= CPU.  In that case, the
> context-tracking code would be within its rights to not tell RCU about
> the transition to the guest, and thus RCU would be unaware that the
> CPU was in a quiescent state.
> 
> In most workloads, you would expect the CPU to get interrupted or
> preempted or something at some point, which would eventually inform
> RCU, but too much delay would result in the infamous RCU CPU stall
> warning message.  So let's code a bit more defensively.
> 
>> If context_tracking_is_enabled() we end up eventually
>> calling into rcu_user_enter & rcu_user_exit.
>>
>> If it is not enabled, we call rcu_virt_note_context_switch.
>>
>> In what cases would we need to call both?
>>
>> I don't see exit-to-userspace do the things that
>> rcu_virt_note_context_switch does, and do not understand
>> why userspace is fine with that...
> 
> The real danger is doing neither.
> 
> On tick_nohz_full_cpu() CPUs, the exit-to-userspace code should invoke
> rcu_user_enter(), which sets some per-CPU state telling RCU to ignore
> that CPU, since it cannot possibly do host RCU read-side critical sections
> while running a guest.

Ahhh, I understand now.  Thank you for your
explanation, Paul.

I will make sure your suggestion is in version 2 of
this series.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest

2015-02-05 Thread Rik van Riel
On 02/05/2015 01:56 PM, Paul E. McKenney wrote:

> The real danger is doing neither.
> 
> On tick_nohz_full_cpu() CPUs, the exit-to-userspace code should invoke
> rcu_user_enter(), which sets some per-CPU state telling RCU to ignore
> that CPU, since it cannot possibly do host RCU read-side critical sections
> while running a guest.
> 
> In contrast, a non-tick_nohz_full_cpu() CPU doesn't let RCU
> know that it is executing in a guest or in userspace.  So the
> rcu_virt_note_context_switch() does the notification in that case.

Looking at context_tracking.h, I see the
function context_tracking_cpu_is_enabled().

That looks like it should do the right thing
in this case.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: x86: add halt_poll module parameter

2015-02-05 Thread Paolo Bonzini


On 05/02/2015 18:53, Radim Krčmář wrote:
> >99% of what you get if you use idle=poll in the guest.
> 
> (Hm, I would have thought that this can outperform idle=poll ...)

It outperforms idle=poll in the host.  A vmexit is probably too cheap to
outperform having idle=poll in the guest.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: x86: add halt_poll module parameter

2015-02-05 Thread Paolo Bonzini


On 05/02/2015 19:55, Jan Kiszka wrote:
> > This patch introduces a new module parameter for the KVM module; when it
> > is present, KVM attempts a bit of polling on every HLT before scheduling
> > itself out via kvm_vcpu_block.
>
> Wouldn't it be better to tune this on a per-VM basis? Think of mixed
> workloads with some latency-sensitive and some standard VMs.

Yes, but:

1) this turned out to be very cheap, so a per-host tunable is not too bad;

2) it also affects only very few workloads (for example network
workloads can already do polling in the guest) so it only affects few
people;

3) long term anyway we want it to auto tune, which is better than tuning
it per-VM.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: x86: add halt_poll module parameter

2015-02-05 Thread Rik van Riel
On 02/05/2015 02:20 PM, Paolo Bonzini wrote:
> 
> 
> On 05/02/2015 19:55, Jan Kiszka wrote:
>>> This patch introduces a new module parameter for the KVM module; when it
>>> is present, KVM attempts a bit of polling on every HLT before scheduling
>>> itself out via kvm_vcpu_block.
>>
>> Wouldn't it be better to tune this on a per-VM basis? Think of mixed
>> workloads with some latency-sensitive and some standard VMs.
> 
> Yes, but:
> 
> 1) this turned out to be very cheap, so a per-host tunable is not too bad;
> 
> 2) it also affects only very few workloads (for example network
> workloads can already do polling in the guest) so it only affects few
> people;
> 
> 3) long term anyway we want it to auto tune, which is better than tuning
> it per-VM.

We may want to auto tune it per VM.

However, if we make auto tuning work well, I do not
think we want to expose a user visible tunable per
VM, and commit to keeping that kind of interface
around forever.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest

2015-02-05 Thread Paul E. McKenney
On Thu, Feb 05, 2015 at 02:02:35PM -0500, Rik van Riel wrote:
> On 02/05/2015 01:56 PM, Paul E. McKenney wrote:
> 
> > The real danger is doing neither.
> > 
> > On tick_nohz_full_cpu() CPUs, the exit-to-userspace code should invoke
> > rcu_user_enter(), which sets some per-CPU state telling RCU to ignore
> > that CPU, since it cannot possibly do host RCU read-side critical sections
> > while running a guest.
> > 
> > In contrast, a non-tick_nohz_full_cpu() CPU doesn't let RCU
> > know that it is executing in a guest or in userspace.  So the
> > rcu_virt_note_context_switch() does the notification in that case.
> 
> Looking at context_tracking.h, I see the
> function context_tracking_cpu_is_enabled().
> 
> That looks like it should do the right thing
> in this case.

Right you are -- that same check is used to guard the
context_tracking_user_enter() function's call to rcu_user_enter().

Not sure why it open-codes the check rather than invoking
context_tracking_cpu_is_enabled().  Hmmm  One reason is that
the context_tracking_cpu_is_enabled() function isn't available in
that context, according to my compiler.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: x86: add halt_poll module parameter

2015-02-05 Thread Paolo Bonzini


On 05/02/2015 20:23, Rik van Riel wrote:
>> > 3) long term anyway we want it to auto tune, which is better than tuning
>> > it per-VM.
> We may want to auto tune it per VM.

We may even want to auto tune it per VCPU.

> However, if we make auto tuning work well, I do not
> think we want to expose a user visible tunable per
> VM, and commit to keeping that kind of interface
> around forever.

Exactly.  We probably want module parameters to tune the minimum/maximum
values (which includes the special cases of disabling polling
altogether, and disabling the autotuning while leaving polling enabled),
but committing to a per-VM interface is premature.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest

2015-02-05 Thread Rik van Riel
On 02/05/2015 02:27 PM, Paul E. McKenney wrote:
> On Thu, Feb 05, 2015 at 02:02:35PM -0500, Rik van Riel wrote:

>> Looking at context_tracking.h, I see the
>> function context_tracking_cpu_is_enabled().
>>
>> That looks like it should do the right thing
>> in this case.
> 
> Right you are -- that same check is used to guard the
> context_tracking_user_enter() function's call to rcu_user_enter().

My test suggests it is working.

v2 incoming :)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/5] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest

2015-02-05 Thread riel
When running a KVM guest on a system with NOHZ_FULL enabled, and the
KVM guest running with idle=poll mode, we still get wakeups of the
rcuos/N threads.

This problem has already been solved for user space by telling the
RCU subsystem that the CPU is in an extended quiescent state while
running user space code.

This patch series extends that code a little bit to make it usable
to track KVM guest space, too.

I tested the code by booting a KVM guest with idle=poll, on a system
with NOHZ_FULL enabled on most CPUs, and a VCPU thread bound to a
CPU. In a 10 second interval, rcuos/N threads on other CPUs got woken
up several times, while the rcuos thread on the CPU running the bound
and alwasy running VCPU thread never got woken up once.

Thanks to Christian Borntraeger and Paul McKenney for reviewing the
first version of this patch series, and helping optimize patch 4/5.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] nohz: add stub context_tracking_is_enabled

2015-02-05 Thread riel
From: Rik van Riel 

With code elsewhere doing something conditional on whether or not
context tracking is enabled, we want a stub function that tells us
context tracking is not enabled, when CONFIG_CONTEXT_TRACKING is
not set.

Signed-off-by: Rik van Riel 
---
 include/linux/context_tracking_state.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/context_tracking_state.h 
b/include/linux/context_tracking_state.h
index f3ef027af749..90a7bab8779e 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -40,6 +40,8 @@ static inline bool context_tracking_in_user(void)
 #else
 static inline bool context_tracking_in_user(void) { return false; }
 static inline bool context_tracking_active(void) { return false; }
+static inline bool context_tracking_is_enabled(void) { return false; }
+static inline bool context_tracking_cpu_is_enabled(void) { return false; }
 #endif /* CONFIG_CONTEXT_TRACKING */
 
 #endif
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit

2015-02-05 Thread riel
From: Rik van Riel 

Add the expected ctx_state as a parameter to context_tracking_user_enter
and context_tracking_user_exit, allowing the same functions to not just
track kernel <> user space switching, but also kernel <> guest transitions.

Signed-off-by: Rik van Riel 
---
 include/linux/context_tracking.h | 12 ++--
 kernel/context_tracking.c| 10 +-
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 37b81bd51ec0..bd9f000fc98d 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -10,21 +10,21 @@
 #ifdef CONFIG_CONTEXT_TRACKING
 extern void context_tracking_cpu_set(int cpu);
 
-extern void context_tracking_user_enter(void);
-extern void context_tracking_user_exit(void);
+extern void context_tracking_user_enter(enum ctx_state state);
+extern void context_tracking_user_exit(enum ctx_state state);
 extern void __context_tracking_task_switch(struct task_struct *prev,
   struct task_struct *next);
 
 static inline void user_enter(void)
 {
if (context_tracking_is_enabled())
-   context_tracking_user_enter();
+   context_tracking_user_enter(IN_USER);
 
 }
 static inline void user_exit(void)
 {
if (context_tracking_is_enabled())
-   context_tracking_user_exit();
+   context_tracking_user_exit(IN_USER);
 }
 
 static inline enum ctx_state exception_enter(void)
@@ -35,7 +35,7 @@ static inline enum ctx_state exception_enter(void)
return 0;
 
prev_ctx = this_cpu_read(context_tracking.state);
-   context_tracking_user_exit();
+   context_tracking_user_exit(prev_ctx);
 
return prev_ctx;
 }
@@ -44,7 +44,7 @@ static inline void exception_exit(enum ctx_state prev_ctx)
 {
if (context_tracking_is_enabled()) {
if (prev_ctx == IN_USER)
-   context_tracking_user_enter();
+   context_tracking_user_enter(prev_ctx);
}
 }
 
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 937ecdfdf258..4c010787c9ec 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -47,7 +47,7 @@ void context_tracking_cpu_set(int cpu)
  * to execute won't use any RCU read side critical section because this
  * function sets RCU in extended quiescent state.
  */
-void context_tracking_user_enter(void)
+void context_tracking_user_enter(enum ctx_state state)
 {
unsigned long flags;
 
@@ -75,7 +75,7 @@ void context_tracking_user_enter(void)
WARN_ON_ONCE(!current->mm);
 
local_irq_save(flags);
-   if ( __this_cpu_read(context_tracking.state) != IN_USER) {
+   if ( __this_cpu_read(context_tracking.state) != state) {
if (__this_cpu_read(context_tracking.active)) {
trace_user_enter(0);
/*
@@ -101,7 +101,7 @@ void context_tracking_user_enter(void)
 * OTOH we can spare the calls to vtime and RCU when 
context_tracking.active
 * is false because we know that CPU is not tickless.
 */
-   __this_cpu_write(context_tracking.state, IN_USER);
+   __this_cpu_write(context_tracking.state, state);
}
local_irq_restore(flags);
 }
@@ -118,7 +118,7 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_user_exit(void)
+void context_tracking_user_exit(enum ctx_state state)
 {
unsigned long flags;
 
@@ -129,7 +129,7 @@ void context_tracking_user_exit(void)
return;
 
local_irq_save(flags);
-   if (__this_cpu_read(context_tracking.state) == IN_USER) {
+   if (__this_cpu_read(context_tracking.state) == state) {
if (__this_cpu_read(context_tracking.active)) {
/*
 * We are going to run code that may use RCU. Inform
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] kvm,rcu,nohz: use RCU extended quiescent state when running KVM guest

2015-02-05 Thread riel
From: Rik van Riel 

The host kernel is not doing anything while the CPU is executing
a KVM guest VCPU, so it can be marked as being in an extended
quiescent state, identical to that used when running user space
code.

The only exception to that rule is when the host handles an
interrupt, which is already handled by the irq code, which
calls rcu_irq_enter and rcu_irq_exit.

The guest_enter and guest_exit functions already switch vtime
accounting independent of context tracking, so leave those calls
where they are, instead of moving them into the context tracking
code.

Signed-off-by: Rik van Riel 
---
 include/linux/context_tracking.h   | 8 +++-
 include/linux/context_tracking_state.h | 1 +
 include/linux/kvm_host.h   | 3 ++-
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index bd9f000fc98d..a5d3bb44b897 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void)
 static inline void exception_exit(enum ctx_state prev_ctx)
 {
if (context_tracking_is_enabled()) {
-   if (prev_ctx == IN_USER)
+   if (prev_ctx == IN_USER || prev_ctx == IN_GUEST)
context_tracking_user_enter(prev_ctx);
}
 }
@@ -78,6 +78,9 @@ static inline void guest_enter(void)
vtime_guest_enter(current);
else
current->flags |= PF_VCPU;
+
+   if (context_tracking_is_enabled())
+   context_tracking_user_enter(IN_GUEST);
 }
 
 static inline void guest_exit(void)
@@ -86,6 +89,9 @@ static inline void guest_exit(void)
vtime_guest_exit(current);
else
current->flags &= ~PF_VCPU;
+
+   if (context_tracking_is_enabled())
+   context_tracking_user_exit(IN_GUEST);
 }
 
 #else
diff --git a/include/linux/context_tracking_state.h 
b/include/linux/context_tracking_state.h
index 97a81225d037..f3ef027af749 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -15,6 +15,7 @@ struct context_tracking {
enum ctx_state {
IN_KERNEL = 0,
IN_USER,
+   IN_GUEST,
} state;
 };
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 26f106022c88..c7828a6a9614 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -772,7 +772,8 @@ static inline void kvm_guest_enter(void)
 * one time slice). Lets treat guest mode as quiescent state, just like
 * we do with user-mode execution.
 */
-   rcu_virt_note_context_switch(smp_processor_id());
+   if (!context_tracking_cpu_is_enabled())
+   rcu_virt_note_context_switch(smp_processor_id());
 }
 
 static inline void kvm_guest_exit(void)
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] nohz,kvm: export context_tracking_user_enter/exit

2015-02-05 Thread riel
From: Rik van Riel 

Export context_tracking_user_enter/exit so it can be used by KVM.

Signed-off-by: Rik van Riel 
---
 kernel/context_tracking.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 97806c4deec5..a3f4a2840637 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -107,6 +107,7 @@ void context_tracking_user_enter(enum ctx_state state)
local_irq_restore(flags);
 }
 NOKPROBE_SYMBOL(context_tracking_user_enter);
+EXPORT_SYMBOL_GPL(context_tracking_user_enter);
 
 /**
  * context_tracking_user_exit - Inform the context tracking that the CPU is
@@ -146,6 +147,7 @@ void context_tracking_user_exit(enum ctx_state state)
local_irq_restore(flags);
 }
 NOKPROBE_SYMBOL(context_tracking_user_exit);
+EXPORT_SYMBOL_GPL(context_tracking_user_exit);
 
 /**
  * __context_tracking_task_switch - context switch the syscall callbacks
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] rcu,nohz: run vtime_user_enter/exit only when state == IN_USER

2015-02-05 Thread riel
From: Rik van Riel 

Only run vtime_user_enter and vtime_user_exit when we are entering
or exiting user state, respectively.

The RCU code only distinguishes between "idle" and "not idle or kernel".
There should be no need to add an additional (unused) state there.

Signed-off-by: Rik van Riel 
---
 kernel/context_tracking.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 4c010787c9ec..97806c4deec5 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -85,7 +85,8 @@ void context_tracking_user_enter(enum ctx_state state)
 * user_exit() or rcu_irq_enter(). Let's remove RCU's 
dependency
 * on the tick.
 */
-   vtime_user_enter(current);
+   if (state == IN_USER)
+   vtime_user_enter(current);
rcu_user_enter();
}
/*
@@ -136,7 +137,8 @@ void context_tracking_user_exit(enum ctx_state state)
 * RCU core about that (ie: we may need the tick again).
 */
rcu_user_exit();
-   vtime_user_exit(current);
+   if (state == IN_USER)
+   vtime_user_exit(current);
trace_user_exit(0);
}
__this_cpu_write(context_tracking.state, IN_KERNEL);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: x86: add halt_poll module parameter

2015-02-05 Thread David Matlack
On Thu, Feb 5, 2015 at 8:05 AM, Paolo Bonzini  wrote:
> This patch introduces a new module parameter for the KVM module; when it
> is present, KVM attempts a bit of polling on every HLT before scheduling
> itself out via kvm_vcpu_block.

Awesome. I have been working on the same feature in parallel so I have
some suggestions :)

>
> This parameter helps a lot for latency-bound workloads---in particular
> I tested it with O_DSYNC writes with a battery-backed disk in the host.
> In this case, writes are fast (because the data doesn't have to go all
> the way to the platters) but they cannot be merged by either the host or
> the guest.  KVM's performance here is usually around 30% of bare metal,
> or 50% if you use cache=directsync or cache=writethrough (these
> parameters avoid that the guest sends pointless flush requests, and
> at the same time they are not slow because of the battery-backed cache).
> The bad performance happens because on every halt the host CPU decides
> to halt itself too.  When the interrupt comes, the vCPU thread is then
> migrated to a new physical CPU, and in general the latency is horrible
> because the vCPU thread has to be scheduled back in.
>
> With this patch performance reaches 60-65% of bare metal and, more
> important, 99% of what you get if you use idle=poll in the guest.  This

I used loopback TCP_RR and loopback memcache as benchmarks for halt
polling. I saw very similar results as you (before: 40% bare metal,
after: 60-65% bare metal and 95% of guest idle=poll).

> means that the tunable gets rid of this particular bottleneck, and more
> work can be done to improve performance in the kernel or QEMU.
>
> Of course there is some price to pay; every time an otherwise idle vCPUs
> is interrupted by an interrupt, it will poll unnecessarily and thus
> impose a little load on the host.  The above results were obtained with
> a mostly random value of the parameter (200), and the load was around
> 1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
>
> The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
> that can be used to tune the parameter.  It counts how many HLT
> instructions received an interrupt during the polling period; each
> successful poll avoids that Linux schedules the VCPU thread out and back
> in, and may also avoid a likely trip to C1 and back for the physical CPU.
>
> While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
> Of these halts, almost all are failed polls.  During the benchmark,
> instead, basically all halts end within the polling period, except a more
> or less constant stream of 50 per second coming from vCPUs that are not
> running the benchmark.  The wasted time is thus very low.  Things may
> be slightly different for Windows VMs, which have a ~10 ms timer tick.
>
> The effect is also visible on Marcelo's recently-introduced latency
> test for the TSC deadline timer.  Though of course a non-RT kernel has
> awful latency bounds, the latency of the timer is around 8000-1 clock
> cycles compared to 2-12 without setting halt_poll.  For the TSC
> deadline timer, thus, the effect is both a smaller average latency and
> a smaller variance.
>
> Signed-off-by: Paolo Bonzini 
> ---

Reviewed-by: David Matlack 

>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/x86.c  | 28 
>  include/linux/kvm_host.h|  1 +
>  virt/kvm/kvm_main.c | 22 +++---
>  4 files changed, 41 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 848947ac6ade..a236e39cc385 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -655,6 +655,7 @@ struct kvm_vcpu_stat {
> u32 irq_window_exits;
> u32 nmi_window_exits;
> u32 halt_exits;
> +   u32 halt_successful_poll;
> u32 halt_wakeup;
> u32 request_irq_exits;
> u32 irq_exits;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1373e04e1f19..b7b20828f01c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -96,6 +96,9 @@ EXPORT_SYMBOL_GPL(kvm_x86_ops);
>  static bool ignore_msrs = 0;
>  module_param(ignore_msrs, bool, S_IRUGO | S_IWUSR);
>
> +unsigned int halt_poll = 0;
> +module_param(halt_poll, uint, S_IRUGO | S_IWUSR);

Suggest encoding the units in the name. "halt_poll_cycles" in this case.

> +
>  unsigned int min_timer_period_us = 500;
>  module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
>
> @@ -145,6 +148,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
> { "irq_window", VCPU_STAT(irq_window_exits) },
> { "nmi_window", VCPU_STAT(nmi_window_exits) },
> { "halt_exits", VCPU_STAT(halt_exits) },
> +   { "halt_successful_poll", VCPU_STAT(halt_successful_poll) },
> { "halt_wakeup", VCPU_STAT(halt_wakeup) },
> { "hypercalls", VCPU_STAT(hyper

Re: [PATCH RFC] kvm: x86: add halt_poll module parameter

2015-02-05 Thread Paolo Bonzini


On 05/02/2015 21:39, David Matlack wrote:
>> This parameter helps a lot for latency-bound workloads [...]
>> KVM's performance here is usually around 30% of bare metal,
>> or 50% if you use cache=directsync or cache=writethrough.
>> With this patch performance reaches 60-65% of bare metal and, more
>> important, 99% of what you get if you use idle=poll in the guest.
> 
> I used loopback TCP_RR and loopback memcache as benchmarks for halt
> polling. I saw very similar results as you (before: 40% bare metal,
> after: 60-65% bare metal and 95% of guest idle=poll).

Good that it also works for network!

> Reviewed-by: David Matlack 
> 
>>  arch/x86/include/asm/kvm_host.h |  1 +
>>  arch/x86/kvm/x86.c  | 28 
>>  include/linux/kvm_host.h|  1 +
>>  virt/kvm/kvm_main.c | 22 +++---
>>  4 files changed, 41 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h 
>> b/arch/x86/include/asm/kvm_host.h
>> index 848947ac6ade..a236e39cc385 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -655,6 +655,7 @@ struct kvm_vcpu_stat {
>> u32 irq_window_exits;
>> u32 nmi_window_exits;
>> u32 halt_exits;
>> +   u32 halt_successful_poll;
>> u32 halt_wakeup;
>> u32 request_irq_exits;
>> u32 irq_exits;
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 1373e04e1f19..b7b20828f01c 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -96,6 +96,9 @@ EXPORT_SYMBOL_GPL(kvm_x86_ops);
>>  static bool ignore_msrs = 0;
>>  module_param(ignore_msrs, bool, S_IRUGO | S_IWUSR);
>>
>> +unsigned int halt_poll = 0;
>> +module_param(halt_poll, uint, S_IRUGO | S_IWUSR);
> 
> Suggest encoding the units in the name. "halt_poll_cycles" in this case.

I left it out because of the parallel with ple_window/ple_gap.  But I
will call it "halt_poll_ns" in the next version.

>> +
>>  unsigned int min_timer_period_us = 500;
>>  module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
>>
>> @@ -145,6 +148,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
>> { "irq_window", VCPU_STAT(irq_window_exits) },
>> { "nmi_window", VCPU_STAT(nmi_window_exits) },
>> { "halt_exits", VCPU_STAT(halt_exits) },
>> +   { "halt_successful_poll", VCPU_STAT(halt_successful_poll) },
>> { "halt_wakeup", VCPU_STAT(halt_wakeup) },
>> { "hypercalls", VCPU_STAT(hypercalls) },
>> { "request_irq", VCPU_STAT(request_irq_exits) },
>> @@ -5819,13 +5823,29 @@ void kvm_arch_exit(void)
>>  int kvm_emulate_halt(struct kvm_vcpu *vcpu)
>>  {
>> ++vcpu->stat.halt_exits;
>> -   if (irqchip_in_kernel(vcpu->kvm)) {
>> -   vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
>> -   return 1;
>> -   } else {
>> +   if (!irqchip_in_kernel(vcpu->kvm)) {
>> vcpu->run->exit_reason = KVM_EXIT_HLT;
>> return 0;
>> }
>> +
>> +   vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
>> +   if (halt_poll) {
> 
> Would it be useful to poll in kvm_vcpu_block() for the benefit of all
> arch's?

Sure.  Especially if I use time instead of cycles.

>> +   u64 start, curr;
>> +   rdtscll(start);
> 
> Why cycles instead of time?

People's love for rdtsc grows every time you tell them it's wrong...

>> +   do {
>> +   /*
>> +* This sets KVM_REQ_UNHALT if an interrupt
>> +* arrives.
>> +*/
>> +   if (kvm_vcpu_check_block(vcpu) < 0) {
>> +   ++vcpu->stat.halt_successful_poll;
>> +   break;
>> +   }
>> +   rdtscll(curr);
>> +   } while(!need_resched() && curr - start < halt_poll);
> 
> I found that using need_resched() was not sufficient at preventing
> VCPUs from delaying their own progress. To test this try running with
> and without polling on a 2 VCPU VM, confined to 1 PCPU, that is running
> loopback TCP_RR in the VM. The problem goes away if you stop polling as
> soon as there are runnable threads on your cpu. (e.g. use
> "single_task_running()" instead of "!need_resched()"
> http://lxr.free-electrons.com/source/kernel/sched/core.c#L2398 ). This
> also guarantees polling only delays the idle thread.

Great, I'll include all of your suggestions in the next version of the
patch.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Injecting a page fault into the guest OS?

2015-02-05 Thread Estrada, Zachary J

Hi all,

I'm trying to inject a page fault into a Linux guest, but when tracing the guest 
OS I don't see the injected fault making it to do_page_fault in the guest. That 
is, I don't see a do_page_fault call with a CR2 matching the address I'm passing.


I'm currently calling the kvm_inject_page_fault(vcpu, &exception) function from
a custom kernel module. I have a simple callback at the top of vmx_vcpu_run that 
invokes my module before the asm for vmlaunch/resume. I have tried with various 
permutations of exception.error_code, but haven't found anything that works. Is 
there something else I need to do? My system does not have tdp, but I am looking 
for something that's agnostic of the underlying virtual mmu.


Thanks!
--Zak
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: x86: add halt_poll module parameter

2015-02-05 Thread Marcelo Tosatti
On Thu, Feb 05, 2015 at 05:05:25PM +0100, Paolo Bonzini wrote:
> This patch introduces a new module parameter for the KVM module; when it
> is present, KVM attempts a bit of polling on every HLT before scheduling
> itself out via kvm_vcpu_block.
> 
> This parameter helps a lot for latency-bound workloads---in particular
> I tested it with O_DSYNC writes with a battery-backed disk in the host.
> In this case, writes are fast (because the data doesn't have to go all
> the way to the platters) but they cannot be merged by either the host or
> the guest.  KVM's performance here is usually around 30% of bare metal,
> or 50% if you use cache=directsync or cache=writethrough (these
> parameters avoid that the guest sends pointless flush requests, and
> at the same time they are not slow because of the battery-backed cache).
> The bad performance happens because on every halt the host CPU decides
> to halt itself too.  When the interrupt comes, the vCPU thread is then
> migrated to a new physical CPU, and in general the latency is horrible
> because the vCPU thread has to be scheduled back in.
> 
> With this patch performance reaches 60-65% of bare metal and, more
> important, 99% of what you get if you use idle=poll in the guest.  This
> means that the tunable gets rid of this particular bottleneck, and more
> work can be done to improve performance in the kernel or QEMU.
> 
> Of course there is some price to pay; every time an otherwise idle vCPUs
> is interrupted by an interrupt, it will poll unnecessarily and thus
> impose a little load on the host.  The above results were obtained with
> a mostly random value of the parameter (200), and the load was around
> 1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
> 
> The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
> that can be used to tune the parameter.  It counts how many HLT
> instructions received an interrupt during the polling period; each
> successful poll avoids that Linux schedules the VCPU thread out and back
> in, and may also avoid a likely trip to C1 and back for the physical CPU.
> 
> While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
> Of these halts, almost all are failed polls.  During the benchmark,
> instead, basically all halts end within the polling period, except a more
> or less constant stream of 50 per second coming from vCPUs that are not
> running the benchmark.  The wasted time is thus very low.  Things may
> be slightly different for Windows VMs, which have a ~10 ms timer tick.
> 
> The effect is also visible on Marcelo's recently-introduced latency
> test for the TSC deadline timer.  Though of course a non-RT kernel has
> awful latency bounds, the latency of the timer is around 8000-1 clock
> cycles compared to 2-12 without setting halt_poll.  For the TSC
> deadline timer, thus, the effect is both a smaller average latency and
> a smaller variance.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/x86.c  | 28 
>  include/linux/kvm_host.h|  1 +
>  virt/kvm/kvm_main.c | 22 +++---
>  4 files changed, 41 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 848947ac6ade..a236e39cc385 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -655,6 +655,7 @@ struct kvm_vcpu_stat {
>   u32 irq_window_exits;
>   u32 nmi_window_exits;
>   u32 halt_exits;
> + u32 halt_successful_poll;
>   u32 halt_wakeup;
>   u32 request_irq_exits;
>   u32 irq_exits;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1373e04e1f19..b7b20828f01c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -96,6 +96,9 @@ EXPORT_SYMBOL_GPL(kvm_x86_ops);
>  static bool ignore_msrs = 0;
>  module_param(ignore_msrs, bool, S_IRUGO | S_IWUSR);
>  
> +unsigned int halt_poll = 0;
> +module_param(halt_poll, uint, S_IRUGO | S_IWUSR);
> +
>  unsigned int min_timer_period_us = 500;
>  module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
>  
> @@ -145,6 +148,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
>   { "irq_window", VCPU_STAT(irq_window_exits) },
>   { "nmi_window", VCPU_STAT(nmi_window_exits) },
>   { "halt_exits", VCPU_STAT(halt_exits) },
> + { "halt_successful_poll", VCPU_STAT(halt_successful_poll) },
>   { "halt_wakeup", VCPU_STAT(halt_wakeup) },
>   { "hypercalls", VCPU_STAT(hypercalls) },
>   { "request_irq", VCPU_STAT(request_irq_exits) },
> @@ -5819,13 +5823,29 @@ void kvm_arch_exit(void)
>  int kvm_emulate_halt(struct kvm_vcpu *vcpu)
>  {
>   ++vcpu->stat.halt_exits;
> - if (irqchip_in_kernel(vcpu->kvm)) {
> - vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
> - return 1;
> - } else {
> + if (!irqchip_in_kernel(vcpu->kvm)) {
>

Re: [PATCH RFC] kvm: x86: add halt_poll module parameter

2015-02-05 Thread Marcelo Tosatti
On Thu, Feb 05, 2015 at 09:34:06PM -0200, Marcelo Tosatti wrote:
> On Thu, Feb 05, 2015 at 05:05:25PM +0100, Paolo Bonzini wrote:
> > This patch introduces a new module parameter for the KVM module; when it
> > is present, KVM attempts a bit of polling on every HLT before scheduling
> > itself out via kvm_vcpu_block.
> > 
> > This parameter helps a lot for latency-bound workloads---in particular
> > I tested it with O_DSYNC writes with a battery-backed disk in the host.
> > In this case, writes are fast (because the data doesn't have to go all
> > the way to the platters) but they cannot be merged by either the host or
> > the guest.  KVM's performance here is usually around 30% of bare metal,
> > or 50% if you use cache=directsync or cache=writethrough (these
> > parameters avoid that the guest sends pointless flush requests, and
> > at the same time they are not slow because of the battery-backed cache).
> > The bad performance happens because on every halt the host CPU decides
> > to halt itself too.  When the interrupt comes, the vCPU thread is then
> > migrated to a new physical CPU, and in general the latency is horrible
> > because the vCPU thread has to be scheduled back in.
> > 
> > With this patch performance reaches 60-65% of bare metal and, more
> > important, 99% of what you get if you use idle=poll in the guest.  This
> > means that the tunable gets rid of this particular bottleneck, and more
> > work can be done to improve performance in the kernel or QEMU.
> > 
> > Of course there is some price to pay; every time an otherwise idle vCPUs
> > is interrupted by an interrupt, it will poll unnecessarily and thus
> > impose a little load on the host.  The above results were obtained with
> > a mostly random value of the parameter (200), and the load was around
> > 1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
> > 
> > The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
> > that can be used to tune the parameter.  It counts how many HLT
> > instructions received an interrupt during the polling period; each
> > successful poll avoids that Linux schedules the VCPU thread out and back
> > in, and may also avoid a likely trip to C1 and back for the physical CPU.
> > 
> > While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
> > Of these halts, almost all are failed polls.  During the benchmark,
> > instead, basically all halts end within the polling period, except a more
> > or less constant stream of 50 per second coming from vCPUs that are not
> > running the benchmark.  The wasted time is thus very low.  Things may
> > be slightly different for Windows VMs, which have a ~10 ms timer tick.
> > 
> > The effect is also visible on Marcelo's recently-introduced latency
> > test for the TSC deadline timer.  Though of course a non-RT kernel has
> > awful latency bounds, the latency of the timer is around 8000-1 clock
> > cycles compared to 2-12 without setting halt_poll.  For the TSC
> > deadline timer, thus, the effect is both a smaller average latency and
> > a smaller variance.
> > 
> > Signed-off-by: Paolo Bonzini 
> > ---
> >  arch/x86/include/asm/kvm_host.h |  1 +
> >  arch/x86/kvm/x86.c  | 28 
> >  include/linux/kvm_host.h|  1 +
> >  virt/kvm/kvm_main.c | 22 +++---
> >  4 files changed, 41 insertions(+), 11 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index 848947ac6ade..a236e39cc385 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -655,6 +655,7 @@ struct kvm_vcpu_stat {
> > u32 irq_window_exits;
> > u32 nmi_window_exits;
> > u32 halt_exits;
> > +   u32 halt_successful_poll;
> > u32 halt_wakeup;
> > u32 request_irq_exits;
> > u32 irq_exits;
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 1373e04e1f19..b7b20828f01c 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -96,6 +96,9 @@ EXPORT_SYMBOL_GPL(kvm_x86_ops);
> >  static bool ignore_msrs = 0;
> >  module_param(ignore_msrs, bool, S_IRUGO | S_IWUSR);
> >  
> > +unsigned int halt_poll = 0;
> > +module_param(halt_poll, uint, S_IRUGO | S_IWUSR);
> > +
> >  unsigned int min_timer_period_us = 500;
> >  module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
> >  
> > @@ -145,6 +148,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
> > { "irq_window", VCPU_STAT(irq_window_exits) },
> > { "nmi_window", VCPU_STAT(nmi_window_exits) },
> > { "halt_exits", VCPU_STAT(halt_exits) },
> > +   { "halt_successful_poll", VCPU_STAT(halt_successful_poll) },
> > { "halt_wakeup", VCPU_STAT(halt_wakeup) },
> > { "hypercalls", VCPU_STAT(hypercalls) },
> > { "request_irq", VCPU_STAT(request_irq_exits) },
> > @@ -5819,13 +5823,29 @@ void kvm_arch_exit(void)
> >  int kvm_emulate_halt(struct kvm_vcpu *vcpu)
> 

Re: [PATCH 5/5] nohz: add stub context_tracking_is_enabled

2015-02-05 Thread Paul E. McKenney
On Thu, Feb 05, 2015 at 03:23:52PM -0500, r...@redhat.com wrote:
> From: Rik van Riel 
> 
> With code elsewhere doing something conditional on whether or not
> context tracking is enabled, we want a stub function that tells us
> context tracking is not enabled, when CONFIG_CONTEXT_TRACKING is
> not set.
> 
> Signed-off-by: Rik van Riel 

Acked-by: Paul E. McKenney 

> ---
>  include/linux/context_tracking_state.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/linux/context_tracking_state.h 
> b/include/linux/context_tracking_state.h
> index f3ef027af749..90a7bab8779e 100644
> --- a/include/linux/context_tracking_state.h
> +++ b/include/linux/context_tracking_state.h
> @@ -40,6 +40,8 @@ static inline bool context_tracking_in_user(void)
>  #else
>  static inline bool context_tracking_in_user(void) { return false; }
>  static inline bool context_tracking_active(void) { return false; }
> +static inline bool context_tracking_is_enabled(void) { return false; }
> +static inline bool context_tracking_cpu_is_enabled(void) { return false; }
>  #endif /* CONFIG_CONTEXT_TRACKING */
> 
>  #endif
> -- 
> 1.9.3
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit

2015-02-05 Thread Paul E. McKenney
On Thu, Feb 05, 2015 at 03:23:48PM -0500, r...@redhat.com wrote:
> From: Rik van Riel 
> 
> Add the expected ctx_state as a parameter to context_tracking_user_enter
> and context_tracking_user_exit, allowing the same functions to not just
> track kernel <> user space switching, but also kernel <> guest transitions.
> 
> Signed-off-by: Rik van Riel 

Acked-by: Paul E. McKenney 

> ---
>  include/linux/context_tracking.h | 12 ++--
>  kernel/context_tracking.c| 10 +-
>  2 files changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/context_tracking.h 
> b/include/linux/context_tracking.h
> index 37b81bd51ec0..bd9f000fc98d 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -10,21 +10,21 @@
>  #ifdef CONFIG_CONTEXT_TRACKING
>  extern void context_tracking_cpu_set(int cpu);
> 
> -extern void context_tracking_user_enter(void);
> -extern void context_tracking_user_exit(void);
> +extern void context_tracking_user_enter(enum ctx_state state);
> +extern void context_tracking_user_exit(enum ctx_state state);
>  extern void __context_tracking_task_switch(struct task_struct *prev,
>  struct task_struct *next);
> 
>  static inline void user_enter(void)
>  {
>   if (context_tracking_is_enabled())
> - context_tracking_user_enter();
> + context_tracking_user_enter(IN_USER);
> 
>  }
>  static inline void user_exit(void)
>  {
>   if (context_tracking_is_enabled())
> - context_tracking_user_exit();
> + context_tracking_user_exit(IN_USER);
>  }
> 
>  static inline enum ctx_state exception_enter(void)
> @@ -35,7 +35,7 @@ static inline enum ctx_state exception_enter(void)
>   return 0;
> 
>   prev_ctx = this_cpu_read(context_tracking.state);
> - context_tracking_user_exit();
> + context_tracking_user_exit(prev_ctx);
> 
>   return prev_ctx;
>  }
> @@ -44,7 +44,7 @@ static inline void exception_exit(enum ctx_state prev_ctx)
>  {
>   if (context_tracking_is_enabled()) {
>   if (prev_ctx == IN_USER)
> - context_tracking_user_enter();
> + context_tracking_user_enter(prev_ctx);
>   }
>  }
> 
> diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
> index 937ecdfdf258..4c010787c9ec 100644
> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -47,7 +47,7 @@ void context_tracking_cpu_set(int cpu)
>   * to execute won't use any RCU read side critical section because this
>   * function sets RCU in extended quiescent state.
>   */
> -void context_tracking_user_enter(void)
> +void context_tracking_user_enter(enum ctx_state state)
>  {
>   unsigned long flags;
> 
> @@ -75,7 +75,7 @@ void context_tracking_user_enter(void)
>   WARN_ON_ONCE(!current->mm);
> 
>   local_irq_save(flags);
> - if ( __this_cpu_read(context_tracking.state) != IN_USER) {
> + if ( __this_cpu_read(context_tracking.state) != state) {
>   if (__this_cpu_read(context_tracking.active)) {
>   trace_user_enter(0);
>   /*
> @@ -101,7 +101,7 @@ void context_tracking_user_enter(void)
>* OTOH we can spare the calls to vtime and RCU when 
> context_tracking.active
>* is false because we know that CPU is not tickless.
>*/
> - __this_cpu_write(context_tracking.state, IN_USER);
> + __this_cpu_write(context_tracking.state, state);
>   }
>   local_irq_restore(flags);
>  }
> @@ -118,7 +118,7 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
>   * This call supports re-entrancy. This way it can be called from any 
> exception
>   * handler without needing to know if we came from userspace or not.
>   */
> -void context_tracking_user_exit(void)
> +void context_tracking_user_exit(enum ctx_state state)
>  {
>   unsigned long flags;
> 
> @@ -129,7 +129,7 @@ void context_tracking_user_exit(void)
>   return;
> 
>   local_irq_save(flags);
> - if (__this_cpu_read(context_tracking.state) == IN_USER) {
> + if (__this_cpu_read(context_tracking.state) == state) {
>   if (__this_cpu_read(context_tracking.active)) {
>   /*
>* We are going to run code that may use RCU. Inform
> -- 
> 1.9.3
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] kvm,rcu,nohz: use RCU extended quiescent state when running KVM guest

2015-02-05 Thread Paul E. McKenney
On Thu, Feb 05, 2015 at 03:23:51PM -0500, r...@redhat.com wrote:
> From: Rik van Riel 
> 
> The host kernel is not doing anything while the CPU is executing
> a KVM guest VCPU, so it can be marked as being in an extended
> quiescent state, identical to that used when running user space
> code.
> 
> The only exception to that rule is when the host handles an
> interrupt, which is already handled by the irq code, which
> calls rcu_irq_enter and rcu_irq_exit.
> 
> The guest_enter and guest_exit functions already switch vtime
> accounting independent of context tracking, so leave those calls
> where they are, instead of moving them into the context tracking
> code.
> 
> Signed-off-by: Rik van Riel 

Acked-by: Paul E. McKenney 

> ---
>  include/linux/context_tracking.h   | 8 +++-
>  include/linux/context_tracking_state.h | 1 +
>  include/linux/kvm_host.h   | 3 ++-
>  3 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/context_tracking.h 
> b/include/linux/context_tracking.h
> index bd9f000fc98d..a5d3bb44b897 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void)
>  static inline void exception_exit(enum ctx_state prev_ctx)
>  {
>   if (context_tracking_is_enabled()) {
> - if (prev_ctx == IN_USER)
> + if (prev_ctx == IN_USER || prev_ctx == IN_GUEST)
>   context_tracking_user_enter(prev_ctx);
>   }
>  }
> @@ -78,6 +78,9 @@ static inline void guest_enter(void)
>   vtime_guest_enter(current);
>   else
>   current->flags |= PF_VCPU;
> +
> + if (context_tracking_is_enabled())
> + context_tracking_user_enter(IN_GUEST);
>  }
> 
>  static inline void guest_exit(void)
> @@ -86,6 +89,9 @@ static inline void guest_exit(void)
>   vtime_guest_exit(current);
>   else
>   current->flags &= ~PF_VCPU;
> +
> + if (context_tracking_is_enabled())
> + context_tracking_user_exit(IN_GUEST);
>  }
> 
>  #else
> diff --git a/include/linux/context_tracking_state.h 
> b/include/linux/context_tracking_state.h
> index 97a81225d037..f3ef027af749 100644
> --- a/include/linux/context_tracking_state.h
> +++ b/include/linux/context_tracking_state.h
> @@ -15,6 +15,7 @@ struct context_tracking {
>   enum ctx_state {
>   IN_KERNEL = 0,
>   IN_USER,
> + IN_GUEST,
>   } state;
>  };
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 26f106022c88..c7828a6a9614 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -772,7 +772,8 @@ static inline void kvm_guest_enter(void)
>* one time slice). Lets treat guest mode as quiescent state, just like
>* we do with user-mode execution.
>*/
> - rcu_virt_note_context_switch(smp_processor_id());
> + if (!context_tracking_cpu_is_enabled())
> + rcu_virt_note_context_switch(smp_processor_id());
>  }
> 
>  static inline void kvm_guest_exit(void)
> -- 
> 1.9.3
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] nohz,kvm: export context_tracking_user_enter/exit

2015-02-05 Thread Paul E. McKenney
On Thu, Feb 05, 2015 at 03:23:50PM -0500, r...@redhat.com wrote:
> From: Rik van Riel 
> 
> Export context_tracking_user_enter/exit so it can be used by KVM.
> 
> Signed-off-by: Rik van Riel 

Acked-by: Paul E. McKenney 

> ---
>  kernel/context_tracking.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
> index 97806c4deec5..a3f4a2840637 100644
> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -107,6 +107,7 @@ void context_tracking_user_enter(enum ctx_state state)
>   local_irq_restore(flags);
>  }
>  NOKPROBE_SYMBOL(context_tracking_user_enter);
> +EXPORT_SYMBOL_GPL(context_tracking_user_enter);
> 
>  /**
>   * context_tracking_user_exit - Inform the context tracking that the CPU is
> @@ -146,6 +147,7 @@ void context_tracking_user_exit(enum ctx_state state)
>   local_irq_restore(flags);
>  }
>  NOKPROBE_SYMBOL(context_tracking_user_exit);
> +EXPORT_SYMBOL_GPL(context_tracking_user_exit);
> 
>  /**
>   * __context_tracking_task_switch - context switch the syscall callbacks
> -- 
> 1.9.3
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] KVM: VMX: Add PML support in VMX

2015-02-05 Thread Kai Huang


On 02/05/2015 11:04 PM, Radim Krčmář wrote:

2015-02-05 14:23+0800, Kai Huang:

On 02/03/2015 11:18 PM, Radim Krčmář wrote:

You have it protected by CONFIG_X86_64, but use it unconditionally.

Thanks for catching. This has been fixed by another patch, and the fix has
also been merged by Paolo.

(And I haven't noticed the followup either ...)

I don't know of any present bugs in this patch and all questions have
been cleared,

Thanks.


+   exec_control &= ~SECONDARY_EXEC_ENABLE_PML;

What is the harm of enabling it here?

(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES seems similar and does it.)

Because the PML feature detection is unconditional (meaning
SECONDARY_EXEC_ENABLE_PML is always in vmcs_config.cpu_based_2nd_exec_ctrl),
but the PML buffer is only created when vcpu is created, and it is
controlled by 'enable_pml' module parameter, if we always enable
SECONDARY_EXEC_ENABLE_PML here, no PML buffer will be created if PML is
disabled by 'enable_pml' parameter,

I meant

   if (!enable_pml)
 exec_control &= ~SECONDARY_EXEC_ENABLE_PML

here and no exec_control operations in vmx_create_vcpu().
This also works. I originally thought put the enabling part and buffer 
allocation together is more straightforward.



 so it's better to enable it along with
creating PML buffer.

I think the reason why KVM split the setup into vmx_create_vcpu() and
vmx_secondary_exec_control() is to simplify nested virtualization.

This is a good point and I'll think about it :)




+   if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
+   cpu_has_virtual_nmis() &&
+   (exit_qualification & INTR_INFO_UNBLOCK_NMI))
+   vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
+   GUEST_INTR_STATE_NMI);

Relevant part of the specification is pasted from 27.2.2 (Information
for VM Exits Due to Vectored Events), which makes me think that this bit
is mirrored to IDT-vectoring information field ...

Isn't vmx_recover_nmi_blocking() enough?

(I see the same code in handle_ept_violation(), but wasn't that needed
  just because of a hardware error?)

This needs to be handled in both EPT violation and PML. If you look at the
PML specification (the link is in cover letter), you can see the definition
of bit 12 of exit_qualification (section 1.5), which explains above code.
The same definition of exit_qualification is in EPT violation part in
Intel's SDM.

True ... IDT-vectoring doesn't set bit 12 because PML exit isn't a fault
and 27.2.2: "— For all other relevant VM exits, bit 12 is cleared to 0."
(This was humourously pasted to PML as well.)


+   /* PML index always points to next available PML buffer entity */
+   if (pml_idx >= PML_ENTITY_NUM)
+   pml_idx = 0;
+   else
+   pml_idx++;

If the pml_idx is >= PML_ENTITY_NUM and < 0x, the log is empty,

In this case, the log is full, actually. PML index is set to PML_ENTITY_NUM
- 1 initially, and hardware decrease it after GPA is logged.

Below is the pseudocode of PML hardware logic.

IF (PML Index[15:9] ≠ 0)
 THEN VM exit;
FI;

pml_idx >= PML_ENTITY_NUM exits without modifying the buffer,

Yes.



set accessed and dirty flags for EPT;
IF (a dirty flag was updated from 0 to 1)
THEN
 PML address[PML index] ← 4-KByte-aligned guest-physical address;
 PML index is decremented;

0x is the only value that specifies full buffer, the rest means
that we incorrectly set up the initial index and VMX exited without
changing the buffer => we shouldn't read it.
(I should have said "untouched" instead of "empty".)
Yes theoretically, according to the pseudocode, 0x is the only 
possible value that specifies full buffer. Probably a better way is to 
check if pml_idx is in range [-1, 511] at the beginning and give a 
warning if it's not, or just ASSERT it. Then we can simply to do a 
++pml_idx (with changing pml_idx to signed short type). But the current 
code should also work anyway.


Thanks,
-Kai



FI;


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] KVM: VMX: Add PML support in VMX

2015-02-05 Thread Kai Huang


On 02/05/2015 11:04 PM, Radim Krčmář wrote:

2015-02-05 14:23+0800, Kai Huang:

On 02/03/2015 11:18 PM, Radim Krčmář wrote:

You have it protected by CONFIG_X86_64, but use it unconditionally.

Thanks for catching. This has been fixed by another patch, and the fix has
also been merged by Paolo.

(And I haven't noticed the followup either ...)

I don't know of any present bugs in this patch and all questions have
been cleared,

So far I see no other bug reported :)

Thanks,
-Kai


Thanks.


+   exec_control &= ~SECONDARY_EXEC_ENABLE_PML;

What is the harm of enabling it here?

(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES seems similar and does it.)

Because the PML feature detection is unconditional (meaning
SECONDARY_EXEC_ENABLE_PML is always in vmcs_config.cpu_based_2nd_exec_ctrl),
but the PML buffer is only created when vcpu is created, and it is
controlled by 'enable_pml' module parameter, if we always enable
SECONDARY_EXEC_ENABLE_PML here, no PML buffer will be created if PML is
disabled by 'enable_pml' parameter,

I meant

   if (!enable_pml)
 exec_control &= ~SECONDARY_EXEC_ENABLE_PML

here and no exec_control operations in vmx_create_vcpu().


 so it's better to enable it along with
creating PML buffer.

I think the reason why KVM split the setup into vmx_create_vcpu() and
vmx_secondary_exec_control() is to simplify nested virtualization.


+   if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
+   cpu_has_virtual_nmis() &&
+   (exit_qualification & INTR_INFO_UNBLOCK_NMI))
+   vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
+   GUEST_INTR_STATE_NMI);

Relevant part of the specification is pasted from 27.2.2 (Information
for VM Exits Due to Vectored Events), which makes me think that this bit
is mirrored to IDT-vectoring information field ...

Isn't vmx_recover_nmi_blocking() enough?

(I see the same code in handle_ept_violation(), but wasn't that needed
  just because of a hardware error?)

This needs to be handled in both EPT violation and PML. If you look at the
PML specification (the link is in cover letter), you can see the definition
of bit 12 of exit_qualification (section 1.5), which explains above code.
The same definition of exit_qualification is in EPT violation part in
Intel's SDM.

True ... IDT-vectoring doesn't set bit 12 because PML exit isn't a fault
and 27.2.2: "— For all other relevant VM exits, bit 12 is cleared to 0."
(This was humourously pasted to PML as well.)


+   /* PML index always points to next available PML buffer entity */
+   if (pml_idx >= PML_ENTITY_NUM)
+   pml_idx = 0;
+   else
+   pml_idx++;

If the pml_idx is >= PML_ENTITY_NUM and < 0x, the log is empty,

In this case, the log is full, actually. PML index is set to PML_ENTITY_NUM
- 1 initially, and hardware decrease it after GPA is logged.

Below is the pseudocode of PML hardware logic.

IF (PML Index[15:9] ≠ 0)
 THEN VM exit;
FI;

pml_idx >= PML_ENTITY_NUM exits without modifying the buffer,


set accessed and dirty flags for EPT;
IF (a dirty flag was updated from 0 to 1)
THEN
 PML address[PML index] ← 4-KByte-aligned guest-physical address;
 PML index is decremented;

0x is the only value that specifies full buffer, the rest means
that we incorrectly set up the initial index and VMX exited without
changing the buffer => we shouldn't read it.
(I should have said "untouched" instead of "empty".)


FI;


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


maybe a virtio-balloon-device issue ?

2015-02-05 Thread Mario Smarduch
Hi,

I'm looking into qemu/balloon driver VM overcommit. I noticed
virtio-balloon driver will take any setting from virtio-balloon-device
to the point Guest dies.

For a 1G guest
$ sudo echo balloon 100 | socat - tcp4-connect:127.0.0.1:

you get (same with libvirt setmem)

root@localhost:~# free
-bash: fork: Cannot allocate memory
root@localhost:~# ps
-bash: fork: Cannot allocate memory

$ sudo info balloon | socat ... - confirms setting

The balloon driver has been there for a while, not sure what I'm missing?

virtio-balloon-device provide free memory, i.e., - externally accessible
to host.  But this appears more like a hint for an inflate request, snmp
mibs
provide more detailed resource info then that.

I'm wondering if the driver should  not have some heuristic
check for an inflate request so it doesn't over inflate?  Similar to
kernel overcommit.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v4 0/8] vhost/scsi: Add ANY_LAYOUT + VERSION_1 support

2015-02-05 Thread Nicholas A. Bellinger
From: Nicholas Bellinger 

Hi MST, Paolo, Al & Co,

Here is v4 patch series for vhost/scsi ANY_LAYOUT + VERSION_1 support.

It updates vhost_virtqueue->handle_kick() -> vhost_scsi_handle_vq()
callback code to determine the start of protection + data payloads iovecs
past the starting virtio-scsi request + response headers.  It also uses
iovec out_size vs. req_size + iovec in_size vs. rsp_size in order to
determine data_direction for use with Al's new iov_iter primitives.

It assumes request/CDB and response/sense_buffer headers may span more
than a single iovec using mm/iov_iter.c logic.

It also allows virtio-scsi headers + T10_PI + Data SGL payloads to span
the same iovec when pinning user-space memory via get_user_pages_fast()
code.  (Not tested yet)

Based upon MST + Al's review feedback, the v3 -> v4 changes include:

  - Use iov_length() helper for out_size + in_size (MST)
  - Make out_iter init correct use out_size (MST)
  - Move data_direction + exp_data_len until after copy_from_iter() (MST)
  - Avoid specific ANY_LAYOUT wording in vhost_scsi_handle_vq() comments (MST)
  - Merge patch !ANY_LAYOUT removal into #5, use existing _vq() name (MST)
  - Further comments for vhost_scsi_handle_vq() code (Viro)
  - Fix iov_iter_init() to use proper WRITE/READ i->type assignment (Viro)

Note the one part that has been left unchanged is vhost_scsi_map_to_sgl()
into get_user_pages_fast(), for which existing iov_iter_get_pages() code
will need to allow for a callback or *sgl parameter to perform the associated
scatterlists setup from **pages for protection + data payloads.

It's functioning against v3.19-rc1 virtio-scsi LLD in T10_PI mode using
TYPE-1 DIF with ANY_LAYOUT -> VERSION_1 guest feature bits enabled, following
existing layout convention with protection/data SGL payloads residing within
seperate iovecs from virtio-scsi LLD guest.

Please review.

Thank you,

--nab

Nicholas Bellinger (8):
  vhost/scsi: Convert completion path to use copy_to_iter
  vhost/scsi: Fix incorrect early vhost_scsi_handle_vq failures
  vhost/scsi: Change vhost_scsi_map_to_sgl to accept iov ptr + len
  vhost/scsi: Add ANY_LAYOUT iov -> sgl mapping prerequisites
  vhost/scsi: Add ANY_LAYOUT support in vhost_scsi_handle_vq
  vhost/scsi: Set VIRTIO_F_ANY_LAYOUT + VIRTIO_F_VERSION_1 feature bits
  vhost/scsi: Drop left-over scsi_tcq.h include
  vhost/scsi: Global tcm_vhost -> vhost_scsi rename

 drivers/vhost/scsi.c | 1062 +-
 1 file changed, 538 insertions(+), 524 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v4 4/8] vhost/scsi: Add ANY_LAYOUT iov -> sgl mapping prerequisites

2015-02-05 Thread Nicholas A. Bellinger
From: Nicholas Bellinger 

This patch adds ANY_LAYOUT prerequisites logic for accepting a set of
protection + data payloads via iov_iter.  Also includes helpers for
calcuating SGLs + invoking vhost_scsi_map_to_sgl() with a known number
of iovecs.

Required by ANY_LAYOUT processing when struct iovec may be offset into
the first outgoing virtio-scsi request header.

Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Al Viro 
Signed-off-by: Nicholas Bellinger 
---
 drivers/vhost/scsi.c | 93 
 1 file changed, 93 insertions(+)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index b9800c7..5396b8a 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -914,6 +914,99 @@ vhost_scsi_map_iov_to_prot(struct tcm_vhost_cmd *cmd,
return 0;
 }
 
+static int
+vhost_scsi_calc_sgls(struct iov_iter *iter, size_t bytes, int max_sgls)
+{
+   int sgl_count = 0;
+
+   if (!iter || !iter->iov) {
+   pr_err("%s: iter->iov is NULL, but expected bytes: %zu"
+  " present\n", __func__, bytes);
+   return -EINVAL;
+   }
+
+   sgl_count = iov_iter_npages(iter, 0x);
+   if (sgl_count > max_sgls) {
+   pr_err("%s: requested sgl_count: %d exceeds pre-allocated"
+  " max_sgls: %d\n", __func__, sgl_count, max_sgls);
+   return -EINVAL;
+   }
+   return sgl_count;
+}
+
+static int
+vhost_scsi_iov_to_sgl(struct tcm_vhost_cmd *cmd, bool write,
+ struct iov_iter *iter, struct scatterlist *sg,
+ int sg_count)
+{
+   size_t off = iter->iov_offset;
+   int i, ret;
+
+   for (i = 0; i < iter->nr_segs; i++) {
+   void __user *base = iter->iov[i].iov_base + off;
+   size_t len = iter->iov[i].iov_len - off;
+
+   ret = vhost_scsi_map_to_sgl(cmd, base, len, sg, write);
+   if (ret < 0) {
+   for (i = 0; i < sg_count; i++) {
+   struct page *page = sg_page(&sg[i]);
+   if (page)
+   put_page(page);
+   }
+   return ret;
+   }
+   sg += ret;
+   off = 0;
+   }
+   return 0;
+}
+
+static int
+vhost_scsi_mapal(struct tcm_vhost_cmd *cmd,
+size_t prot_bytes, struct iov_iter *prot_iter,
+size_t data_bytes, struct iov_iter *data_iter)
+{
+   int sgl_count, ret;
+   bool write = (cmd->tvc_data_direction == DMA_FROM_DEVICE);
+
+   if (prot_bytes) {
+   sgl_count = vhost_scsi_calc_sgls(prot_iter, prot_bytes,
+TCM_VHOST_PREALLOC_PROT_SGLS);
+   if (sgl_count < 0)
+   return sgl_count;
+
+   sg_init_table(cmd->tvc_prot_sgl, sgl_count);
+   cmd->tvc_prot_sgl_count = sgl_count;
+   pr_debug("%s prot_sg %p prot_sgl_count %u\n", __func__,
+cmd->tvc_prot_sgl, cmd->tvc_prot_sgl_count);
+
+   ret = vhost_scsi_iov_to_sgl(cmd, write, prot_iter,
+   cmd->tvc_prot_sgl,
+   cmd->tvc_prot_sgl_count);
+   if (ret < 0) {
+   cmd->tvc_prot_sgl_count = 0;
+   return ret;
+   }
+   }
+   sgl_count = vhost_scsi_calc_sgls(data_iter, data_bytes,
+TCM_VHOST_PREALLOC_SGLS);
+   if (sgl_count < 0)
+   return sgl_count;
+
+   sg_init_table(cmd->tvc_sgl, sgl_count);
+   cmd->tvc_sgl_count = sgl_count;
+   pr_debug("%s data_sg %p data_sgl_count %u\n", __func__,
+ cmd->tvc_sgl, cmd->tvc_sgl_count);
+
+   ret = vhost_scsi_iov_to_sgl(cmd, write, data_iter,
+   cmd->tvc_sgl, cmd->tvc_sgl_count);
+   if (ret < 0) {
+   cmd->tvc_sgl_count = 0;
+   return ret;
+   }
+   return 0;
+}
+
 static void tcm_vhost_submission_work(struct work_struct *work)
 {
struct tcm_vhost_cmd *cmd =
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v4 1/8] vhost/scsi: Convert completion path to use copy_to_iter

2015-02-05 Thread Nicholas A. Bellinger
From: Nicholas Bellinger 

Required for ANY_LAYOUT support when the incoming virtio-scsi response
header + fixed size sense buffer payload may span more than a single
iovec entry.

This changes existing code to save cmd->tvc_resp_iov instead of the
first single iovec base pointer from &vq->iov[out].

Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Al Viro 
Signed-off-by: Nicholas Bellinger 
---
 drivers/vhost/scsi.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 01c01cb..29dfdf6 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -72,6 +72,8 @@ struct tcm_vhost_cmd {
int tvc_vq_desc;
/* virtio-scsi initiator task attribute */
int tvc_task_attr;
+   /* virtio-scsi response incoming iovecs */
+   int tvc_in_iovs;
/* virtio-scsi initiator data direction */
enum dma_data_direction tvc_data_direction;
/* Expected data transfer length from virtio-scsi header */
@@ -87,8 +89,8 @@ struct tcm_vhost_cmd {
struct scatterlist *tvc_sgl;
struct scatterlist *tvc_prot_sgl;
struct page **tvc_upages;
-   /* Pointer to response */
-   struct virtio_scsi_cmd_resp __user *tvc_resp;
+   /* Pointer to response header iovec */
+   struct iovec *tvc_resp_iov;
/* Pointer to vhost_scsi for our device */
struct vhost_scsi *tvc_vhost;
/* Pointer to vhost_virtqueue for the cmd */
@@ -682,6 +684,7 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work 
*work)
struct tcm_vhost_cmd *cmd;
struct llist_node *llnode;
struct se_cmd *se_cmd;
+   struct iov_iter iov_iter;
int ret, vq;
 
bitmap_zero(signal, VHOST_SCSI_MAX_VQ);
@@ -703,8 +706,11 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work 
*work)
 se_cmd->scsi_sense_length);
memcpy(v_rsp.sense, cmd->tvc_sense_buf,
   se_cmd->scsi_sense_length);
-   ret = copy_to_user(cmd->tvc_resp, &v_rsp, sizeof(v_rsp));
-   if (likely(ret == 0)) {
+
+   iov_iter_init(&iov_iter, READ, cmd->tvc_resp_iov,
+ cmd->tvc_in_iovs, sizeof(v_rsp));
+   ret = copy_to_iter(&v_rsp, sizeof(v_rsp), &iov_iter);
+   if (likely(ret == sizeof(v_rsp))) {
struct vhost_scsi_virtqueue *q;
vhost_add_used(cmd->tvc_vq, cmd->tvc_vq_desc, 0);
q = container_of(cmd->tvc_vq, struct 
vhost_scsi_virtqueue, vq);
@@ -1159,7 +1165,8 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct 
vhost_virtqueue *vq)
 
cmd->tvc_vhost = vs;
cmd->tvc_vq = vq;
-   cmd->tvc_resp = vq->iov[out].iov_base;
+   cmd->tvc_resp_iov = &vq->iov[out];
+   cmd->tvc_in_iovs = in;
 
pr_debug("vhost_scsi got command opcode: %#02x, lun: %d\n",
cmd->tvc_cdb[0], cmd->tvc_lun);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v4 7/8] vhost/scsi: Drop left-over scsi_tcq.h include

2015-02-05 Thread Nicholas A. Bellinger
From: Nicholas Bellinger 

With the recent removal of MSG_*_TAG defines in commit 68d81f40,
vhost-scsi is now using TCM_*_TAG and doesn't depend upon host
side scsi_tcq.h definitions anymore.

Cc: Michael S. Tsirkin 
Acked-by: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Signed-off-by: Nicholas Bellinger 
---
 drivers/vhost/scsi.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index b5e2b40..1a66fcf 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -38,7 +38,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v4 6/8] vhost/scsi: Set VIRTIO_F_ANY_LAYOUT + VIRTIO_F_VERSION_1 feature bits

2015-02-05 Thread Nicholas A. Bellinger
From: Nicholas Bellinger 

Signal support of VIRTIO_F_ANY_LAYOUT + VIRTIO_F_VERSION_1 feature bits
required for virtio-scsi 1.0 spec layout requirements.

Cc: Michael S. Tsirkin 
Acked-by: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Signed-off-by: Nicholas Bellinger 
---
 drivers/vhost/scsi.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index e53959f..b5e2b40 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -173,7 +173,9 @@ enum {
 /* Note: can't set VIRTIO_F_VERSION_1 yet, since that implies ANY_LAYOUT. */
 enum {
VHOST_SCSI_FEATURES = VHOST_FEATURES | (1ULL << VIRTIO_SCSI_F_HOTPLUG) |
-  (1ULL << VIRTIO_SCSI_F_T10_PI)
+  (1ULL << VIRTIO_SCSI_F_T10_PI) |
+  (1ULL << VIRTIO_F_ANY_LAYOUT) |
+  (1ULL << VIRTIO_F_VERSION_1)
 };
 
 #define VHOST_SCSI_MAX_TARGET  256
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v4 2/8] vhost/scsi: Fix incorrect early vhost_scsi_handle_vq failures

2015-02-05 Thread Nicholas A. Bellinger
From: Nicholas Bellinger 

This patch fixes vhost_scsi_handle_vq() failure cases that result in BUG_ON()
getting triggered when vhost_scsi_free_cmd() is called, and ->tvc_se_cmd has
not been initialized by target_submit_cmd_map_sgls().

It changes tcm_vhost_release_cmd() to use tcm_vhost_cmd->tvc_nexus for obtaining
se_session pointer reference.  Also, avoid calling put_page() on NULL sg->page
entries in vhost_scsi_map_to_sgl() failure path.

Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Signed-off-by: Nicholas Bellinger 
---
 drivers/vhost/scsi.c | 52 +---
 1 file changed, 29 insertions(+), 23 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 29dfdf6..62de820 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -462,7 +462,7 @@ static void tcm_vhost_release_cmd(struct se_cmd *se_cmd)
 {
struct tcm_vhost_cmd *tv_cmd = container_of(se_cmd,
struct tcm_vhost_cmd, tvc_se_cmd);
-   struct se_session *se_sess = se_cmd->se_sess;
+   struct se_session *se_sess = tv_cmd->tvc_nexus->tvn_se_sess;
int i;
 
if (tv_cmd->tvc_sgl_count) {
@@ -864,9 +864,11 @@ vhost_scsi_map_iov_to_sgl(struct tcm_vhost_cmd *cmd,
ret = vhost_scsi_map_to_sgl(cmd, sg, sgl_count, &iov[i],
cmd->tvc_upages, write);
if (ret < 0) {
-   for (i = 0; i < cmd->tvc_sgl_count; i++)
-   put_page(sg_page(&cmd->tvc_sgl[i]));
-
+   for (i = 0; i < cmd->tvc_sgl_count; i++) {
+   struct page *page = sg_page(&cmd->tvc_sgl[i]);
+   if (page)
+   put_page(page);
+   }
cmd->tvc_sgl_count = 0;
return ret;
}
@@ -905,9 +907,11 @@ vhost_scsi_map_iov_to_prot(struct tcm_vhost_cmd *cmd,
ret = vhost_scsi_map_to_sgl(cmd, prot_sg, prot_sgl_count, 
&iov[i],
cmd->tvc_upages, write);
if (ret < 0) {
-   for (i = 0; i < cmd->tvc_prot_sgl_count; i++)
-   put_page(sg_page(&cmd->tvc_prot_sgl[i]));
-
+   for (i = 0; i < cmd->tvc_prot_sgl_count; i++) {
+   struct page *page = 
sg_page(&cmd->tvc_prot_sgl[i]);
+   if (page)
+   put_page(page);
+   }
cmd->tvc_prot_sgl_count = 0;
return ret;
}
@@ -1065,12 +1069,14 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct 
vhost_virtqueue *vq)
if (unlikely(vq->iov[0].iov_len < req_size)) {
pr_err("Expecting virtio-scsi header: %zu, got %zu\n",
   req_size, vq->iov[0].iov_len);
-   break;
+   vhost_scsi_send_bad_target(vs, vq, head, out);
+   continue;
}
ret = memcpy_fromiovecend(req, &vq->iov[0], 0, req_size);
if (unlikely(ret)) {
vq_err(vq, "Faulted on virtio_scsi_cmd_req\n");
-   break;
+   vhost_scsi_send_bad_target(vs, vq, head, out);
+   continue;
}
 
/* virtio-scsi spec requires byte 0 of the lun to be 1 */
@@ -1101,14 +1107,16 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct 
vhost_virtqueue *vq)
if (data_direction != DMA_TO_DEVICE) {
vq_err(vq, "Received non zero 
do_pi_niov"
", but wrong data_direction\n");
-   goto err_cmd;
+   vhost_scsi_send_bad_target(vs, vq, 
head, out);
+   continue;
}
prot_bytes = vhost32_to_cpu(vq, 
v_req_pi.pi_bytesout);
} else if (v_req_pi.pi_bytesin) {
if (data_direction != DMA_FROM_DEVICE) {
vq_err(vq, "Received non zero 
di_pi_niov"
", but wrong data_direction\n");
-   goto err_cmd;
+   vhost_scsi_send_bad_target(vs, vq, 
head, out);
+   continue;
}
prot_bytes = vhost32_to_cpu(vq, 
v_req_pi.pi_bytesin);
}
@@ -1148,7 +1156,8 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct 
vhost_virtqueue *vq)
  

[PATCH-v4 5/8] vhost/scsi: Add ANY_LAYOUT support in vhost_scsi_handle_vq

2015-02-05 Thread Nicholas A. Bellinger
From: Nicholas Bellinger 

This patch adds ANY_LAYOUT compatible support within the existing
vhost_scsi_handle_vq() ->handle_kick() callback.

It calculates data_direction + exp_data_len for the new tcm_vhost_cmd
descriptor by walking both outgoing + incoming iovecs using iov_iter,
assuming the layout of outgoing request header + T10_PI + Data payload
comes first.

It also uses copy_from_iter() to copy leading virtio-scsi request header
that may or may not include SCSI CDB, that returns a re-calculated iovec
to start of T10_PI or Data SGL memory.

Also, go ahead and drop the legacy pre virtio v1.0 !ANY_LAYOUT logic.

Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Al Viro 
Signed-off-by: Nicholas Bellinger 
---
 drivers/vhost/scsi.c | 306 ++-
 1 file changed, 110 insertions(+), 196 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 5396b8a..e53959f 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -828,93 +828,6 @@ out:
 }
 
 static int
-vhost_scsi_map_iov_to_sgl(struct tcm_vhost_cmd *cmd,
- struct iovec *iov,
- int niov,
- bool write)
-{
-   struct scatterlist *sg = cmd->tvc_sgl;
-   unsigned int sgl_count = 0;
-   int ret, i;
-
-   for (i = 0; i < niov; i++)
-   sgl_count += iov_num_pages(iov[i].iov_base, iov[i].iov_len);
-
-   if (sgl_count > TCM_VHOST_PREALLOC_SGLS) {
-   pr_err("vhost_scsi_map_iov_to_sgl() sgl_count: %u greater than"
-   " preallocated TCM_VHOST_PREALLOC_SGLS: %u\n",
-   sgl_count, TCM_VHOST_PREALLOC_SGLS);
-   return -ENOBUFS;
-   }
-
-   pr_debug("%s sg %p sgl_count %u\n", __func__, sg, sgl_count);
-   sg_init_table(sg, sgl_count);
-   cmd->tvc_sgl_count = sgl_count;
-
-   pr_debug("Mapping iovec %p for %u pages\n", &iov[0], sgl_count);
-
-   for (i = 0; i < niov; i++) {
-   ret = vhost_scsi_map_to_sgl(cmd, iov[i].iov_base, 
iov[i].iov_len,
-   sg, write);
-   if (ret < 0) {
-   for (i = 0; i < cmd->tvc_sgl_count; i++) {
-   struct page *page = sg_page(&cmd->tvc_sgl[i]);
-   if (page)
-   put_page(page);
-   }
-   cmd->tvc_sgl_count = 0;
-   return ret;
-   }
-   sg += ret;
-   sgl_count -= ret;
-   }
-   return 0;
-}
-
-static int
-vhost_scsi_map_iov_to_prot(struct tcm_vhost_cmd *cmd,
-  struct iovec *iov,
-  int niov,
-  bool write)
-{
-   struct scatterlist *prot_sg = cmd->tvc_prot_sgl;
-   unsigned int prot_sgl_count = 0;
-   int ret, i;
-
-   for (i = 0; i < niov; i++)
-   prot_sgl_count += iov_num_pages(iov[i].iov_base, 
iov[i].iov_len);
-
-   if (prot_sgl_count > TCM_VHOST_PREALLOC_PROT_SGLS) {
-   pr_err("vhost_scsi_map_iov_to_prot() sgl_count: %u greater than"
-   " preallocated TCM_VHOST_PREALLOC_PROT_SGLS: %u\n",
-   prot_sgl_count, TCM_VHOST_PREALLOC_PROT_SGLS);
-   return -ENOBUFS;
-   }
-
-   pr_debug("%s prot_sg %p prot_sgl_count %u\n", __func__,
-prot_sg, prot_sgl_count);
-   sg_init_table(prot_sg, prot_sgl_count);
-   cmd->tvc_prot_sgl_count = prot_sgl_count;
-
-   for (i = 0; i < niov; i++) {
-   ret = vhost_scsi_map_to_sgl(cmd, iov[i].iov_base, 
iov[i].iov_len,
-   prot_sg, write);
-   if (ret < 0) {
-   for (i = 0; i < cmd->tvc_prot_sgl_count; i++) {
-   struct page *page = 
sg_page(&cmd->tvc_prot_sgl[i]);
-   if (page)
-   put_page(page);
-   }
-   cmd->tvc_prot_sgl_count = 0;
-   return ret;
-   }
-   prot_sg += ret;
-   prot_sgl_count -= ret;
-   }
-   return 0;
-}
-
-static int
 vhost_scsi_calc_sgls(struct iov_iter *iter, size_t bytes, int max_sgls)
 {
int sgl_count = 0;
@@ -1064,19 +977,20 @@ vhost_scsi_send_bad_target(struct vhost_scsi *vs,
 static void
 vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
 {
-   struct tcm_vhost_tpg **vs_tpg;
+   struct tcm_vhost_tpg **vs_tpg, *tpg;
struct virtio_scsi_cmd_req v_req;
struct virtio_scsi_cmd_req_pi v_req_pi;
-   struct tcm_vhost_tpg *tpg;
struct tcm_vhost_cmd *cmd;
+   struct iov_iter out_iter, in_iter, prot_iter, data_iter;
u64 tag;
-   u32 exp_data_len, data_first, data_num, data_direction, prot_first;
-   

[PATCH-v4 3/8] vhost/scsi: Change vhost_scsi_map_to_sgl to accept iov ptr + len

2015-02-05 Thread Nicholas A. Bellinger
From: Nicholas Bellinger 

This patch changes vhost_scsi_map_to_sgl() parameters to accept virtio
iovec ptr + len when determing pages_nr.

This is currently done with iov_num_pages() -> PAGE_ALIGN, so allow
the same parameters as well.

Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Signed-off-by: Nicholas Bellinger 
---
 drivers/vhost/scsi.c | 37 +++--
 1 file changed, 15 insertions(+), 22 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 62de820..b9800c7 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -222,10 +222,10 @@ static struct workqueue_struct *tcm_vhost_workqueue;
 static DEFINE_MUTEX(tcm_vhost_mutex);
 static LIST_HEAD(tcm_vhost_list);
 
-static int iov_num_pages(struct iovec *iov)
+static int iov_num_pages(void __user *iov_base, size_t iov_len)
 {
-   return (PAGE_ALIGN((unsigned long)iov->iov_base + iov->iov_len) -
-  ((unsigned long)iov->iov_base & PAGE_MASK)) >> PAGE_SHIFT;
+   return (PAGE_ALIGN((unsigned long)iov_base + iov_len) -
+  ((unsigned long)iov_base & PAGE_MASK)) >> PAGE_SHIFT;
 }
 
 static void tcm_vhost_done_inflight(struct kref *kref)
@@ -782,25 +782,18 @@ vhost_scsi_get_tag(struct vhost_virtqueue *vq, struct 
tcm_vhost_tpg *tpg,
  * Returns the number of scatterlist entries used or -errno on error.
  */
 static int
-vhost_scsi_map_to_sgl(struct tcm_vhost_cmd *tv_cmd,
+vhost_scsi_map_to_sgl(struct tcm_vhost_cmd *cmd,
+ void __user *ptr,
+ size_t len,
  struct scatterlist *sgl,
- unsigned int sgl_count,
- struct iovec *iov,
- struct page **pages,
  bool write)
 {
-   unsigned int npages = 0, pages_nr, offset, nbytes;
+   unsigned int npages = 0, offset, nbytes;
+   unsigned int pages_nr = iov_num_pages(ptr, len);
struct scatterlist *sg = sgl;
-   void __user *ptr = iov->iov_base;
-   size_t len = iov->iov_len;
+   struct page **pages = cmd->tvc_upages;
int ret, i;
 
-   pages_nr = iov_num_pages(iov);
-   if (pages_nr > sgl_count) {
-   pr_err("vhost_scsi_map_to_sgl() pages_nr: %u greater than"
-  " sgl_count: %u\n", pages_nr, sgl_count);
-   return -ENOBUFS;
-   }
if (pages_nr > TCM_VHOST_PREALLOC_UPAGES) {
pr_err("vhost_scsi_map_to_sgl() pages_nr: %u greater than"
   " preallocated TCM_VHOST_PREALLOC_UPAGES: %u\n",
@@ -845,7 +838,7 @@ vhost_scsi_map_iov_to_sgl(struct tcm_vhost_cmd *cmd,
int ret, i;
 
for (i = 0; i < niov; i++)
-   sgl_count += iov_num_pages(&iov[i]);
+   sgl_count += iov_num_pages(iov[i].iov_base, iov[i].iov_len);
 
if (sgl_count > TCM_VHOST_PREALLOC_SGLS) {
pr_err("vhost_scsi_map_iov_to_sgl() sgl_count: %u greater than"
@@ -861,8 +854,8 @@ vhost_scsi_map_iov_to_sgl(struct tcm_vhost_cmd *cmd,
pr_debug("Mapping iovec %p for %u pages\n", &iov[0], sgl_count);
 
for (i = 0; i < niov; i++) {
-   ret = vhost_scsi_map_to_sgl(cmd, sg, sgl_count, &iov[i],
-   cmd->tvc_upages, write);
+   ret = vhost_scsi_map_to_sgl(cmd, iov[i].iov_base, 
iov[i].iov_len,
+   sg, write);
if (ret < 0) {
for (i = 0; i < cmd->tvc_sgl_count; i++) {
struct page *page = sg_page(&cmd->tvc_sgl[i]);
@@ -889,7 +882,7 @@ vhost_scsi_map_iov_to_prot(struct tcm_vhost_cmd *cmd,
int ret, i;
 
for (i = 0; i < niov; i++)
-   prot_sgl_count += iov_num_pages(&iov[i]);
+   prot_sgl_count += iov_num_pages(iov[i].iov_base, 
iov[i].iov_len);
 
if (prot_sgl_count > TCM_VHOST_PREALLOC_PROT_SGLS) {
pr_err("vhost_scsi_map_iov_to_prot() sgl_count: %u greater than"
@@ -904,8 +897,8 @@ vhost_scsi_map_iov_to_prot(struct tcm_vhost_cmd *cmd,
cmd->tvc_prot_sgl_count = prot_sgl_count;
 
for (i = 0; i < niov; i++) {
-   ret = vhost_scsi_map_to_sgl(cmd, prot_sg, prot_sgl_count, 
&iov[i],
-   cmd->tvc_upages, write);
+   ret = vhost_scsi_map_to_sgl(cmd, iov[i].iov_base, 
iov[i].iov_len,
+   prot_sg, write);
if (ret < 0) {
for (i = 0; i < cmd->tvc_prot_sgl_count; i++) {
struct page *page = 
sg_page(&cmd->tvc_prot_sgl[i]);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v4 8/8] vhost/scsi: Global tcm_vhost -> vhost_scsi rename

2015-02-05 Thread Nicholas A. Bellinger
From: Nicholas Bellinger 

There is a large amount of code that still references the original
'tcm_vhost' naming conventions, instead of modern 'vhost_scsi'.

Go ahead and do a global rename to make the usage consistent.

Cc: Michael S. Tsirkin 
Acked-by: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Signed-off-by: Nicholas Bellinger 
---
 drivers/vhost/scsi.c | 662 +--
 1 file changed, 331 insertions(+), 331 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 1a66fcf..daf10b7 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -51,13 +51,13 @@
 
 #include "vhost.h"
 
-#define TCM_VHOST_VERSION  "v0.1"
-#define TCM_VHOST_NAMELEN 256
-#define TCM_VHOST_MAX_CDB_SIZE 32
-#define TCM_VHOST_DEFAULT_TAGS 256
-#define TCM_VHOST_PREALLOC_SGLS 2048
-#define TCM_VHOST_PREALLOC_UPAGES 2048
-#define TCM_VHOST_PREALLOC_PROT_SGLS 512
+#define VHOST_SCSI_VERSION  "v0.1"
+#define VHOST_SCSI_NAMELEN 256
+#define VHOST_SCSI_MAX_CDB_SIZE 32
+#define VHOST_SCSI_DEFAULT_TAGS 256
+#define VHOST_SCSI_PREALLOC_SGLS 2048
+#define VHOST_SCSI_PREALLOC_UPAGES 2048
+#define VHOST_SCSI_PREALLOC_PROT_SGLS 512
 
 struct vhost_scsi_inflight {
/* Wait for the flush operation to finish */
@@ -66,7 +66,7 @@ struct vhost_scsi_inflight {
struct kref kref;
 };
 
-struct tcm_vhost_cmd {
+struct vhost_scsi_cmd {
/* Descriptor from vhost_get_vq_desc() for virt_queue segment */
int tvc_vq_desc;
/* virtio-scsi initiator task attribute */
@@ -82,7 +82,7 @@ struct tcm_vhost_cmd {
/* The number of scatterlists associated with this cmd */
u32 tvc_sgl_count;
u32 tvc_prot_sgl_count;
-   /* Saved unpacked SCSI LUN for tcm_vhost_submission_work() */
+   /* Saved unpacked SCSI LUN for vhost_scsi_submission_work() */
u32 tvc_lun;
/* Pointer to the SGL formatted memory from virtio-scsi */
struct scatterlist *tvc_sgl;
@@ -95,13 +95,13 @@ struct tcm_vhost_cmd {
/* Pointer to vhost_virtqueue for the cmd */
struct vhost_virtqueue *tvc_vq;
/* Pointer to vhost nexus memory */
-   struct tcm_vhost_nexus *tvc_nexus;
+   struct vhost_scsi_nexus *tvc_nexus;
/* The TCM I/O descriptor that is accessed via container_of() */
struct se_cmd tvc_se_cmd;
-   /* work item used for cmwq dispatch to tcm_vhost_submission_work() */
+   /* work item used for cmwq dispatch to vhost_scsi_submission_work() */
struct work_struct work;
/* Copy of the incoming SCSI command descriptor block (CDB) */
-   unsigned char tvc_cdb[TCM_VHOST_MAX_CDB_SIZE];
+   unsigned char tvc_cdb[VHOST_SCSI_MAX_CDB_SIZE];
/* Sense buffer that will be mapped into outgoing status */
unsigned char tvc_sense_buf[TRANSPORT_SENSE_BUFFER];
/* Completed commands list, serviced from vhost worker thread */
@@ -110,53 +110,53 @@ struct tcm_vhost_cmd {
struct vhost_scsi_inflight *inflight;
 };
 
-struct tcm_vhost_nexus {
+struct vhost_scsi_nexus {
/* Pointer to TCM session for I_T Nexus */
struct se_session *tvn_se_sess;
 };
 
-struct tcm_vhost_nacl {
+struct vhost_scsi_nacl {
/* Binary World Wide unique Port Name for Vhost Initiator port */
u64 iport_wwpn;
/* ASCII formatted WWPN for Sas Initiator port */
-   char iport_name[TCM_VHOST_NAMELEN];
-   /* Returned by tcm_vhost_make_nodeacl() */
+   char iport_name[VHOST_SCSI_NAMELEN];
+   /* Returned by vhost_scsi_make_nodeacl() */
struct se_node_acl se_node_acl;
 };
 
-struct tcm_vhost_tpg {
+struct vhost_scsi_tpg {
/* Vhost port target portal group tag for TCM */
u16 tport_tpgt;
/* Used to track number of TPG Port/Lun Links wrt to explict I_T Nexus 
shutdown */
int tv_tpg_port_count;
/* Used for vhost_scsi device reference to tpg_nexus, protected by 
tv_tpg_mutex */
int tv_tpg_vhost_count;
-   /* list for tcm_vhost_list */
+   /* list for vhost_scsi_list */
struct list_head tv_tpg_list;
/* Used to protect access for tpg_nexus */
struct mutex tv_tpg_mutex;
/* Pointer to the TCM VHost I_T Nexus for this TPG endpoint */
-   struct tcm_vhost_nexus *tpg_nexus;
-   /* Pointer back to tcm_vhost_tport */
-   struct tcm_vhost_tport *tport;
-   /* Returned by tcm_vhost_make_tpg() */
+   struct vhost_scsi_nexus *tpg_nexus;
+   /* Pointer back to vhost_scsi_tport */
+   struct vhost_scsi_tport *tport;
+   /* Returned by vhost_scsi_make_tpg() */
struct se_portal_group se_tpg;
/* Pointer back to vhost_scsi, protected by tv_tpg_mutex */
struct vhost_scsi *vhost_scsi;
 };
 
-struct tcm_vhost_tport {
+struct vhost_scsi_tport {
/* SCSI protocol the tport is providing */
u8 tport_proto_id;
/* Binary World Wide unique Port Name for Vhost Target port */
u64 tport_wwpn;
/*

Windows XP guest latch up on KVM with recent kernels

2015-02-05 Thread Brad Campbell

G'day all,

I'm at a bit of a loss. I run a CCTV storage server on a Windows XP 
guest running on an AMD64 (Piledriver) host with 64 bit kernel and 
userspace. This has been running well for a number of years.


I recently upgraded from a 3.15 kernel to 3.17 and now 3.18. The 3.15 
kernel runs fine. Running 3.17 or later causes the VM to latch up after 
between 30 minutes and 24 hours running. I say latch up rather than lock 
up as if I run virt-viewer and wiggle the mouse around it springs back 
to life, although the system clock is now hours behind. I can see when 
it latches up as I have cacti monitoring it's network interface. The 
system clock stops dead as soon as the traffic does.


When it latches up, qemu spins using 100% of however many cores I have 
allocated. It does not respond to the virtio network adapter, and the 
system clock ceases ticking.


I have several XP VM's on this machine, all configured with the same 
hardware (virtio storage & network) and only the CCTV VM exhibits this 
behavior.


Reverting back to 3.15.x makes it go away. I've not had the opportunity 
to try 3.16 yet but it's on my todo list for this weekend.


I've tried several versions of qemu, and I'm currently on git HEAD. The 
problem tracks the kernel.


This is a production box, and given it can take hours to reproduce I 
can't really run a bisect on it. I've also been unable to replicate it 
thus far on an "identical" test machine, although I'll continue to try 
as it'll make bisecting more viable.


It's almost as if the guest stops responding to interrupts, yet console 
activity wakes it back up.


I'm only posting this in the vain hope it pricks someones spidey senses 
and gives me a bit of a leg up in debugging it.


I'm using guest drivers from spice-guest-tools-0.74 on all the guests.

Regards,
Brad
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: x86: add halt_poll module parameter

2015-02-05 Thread Paolo Bonzini


On 06/02/2015 00:34, Marcelo Tosatti wrote:
> You want at least a basic procedure to estimate a value
> (its a function of the device after all).

I will add a tracepoint.

> Rather than halt_successful_poll's, i suppose the optimum 
> can be estimated from a dataset containing entries
> in the form:
> 
> interrupt time - hlt time
> 
> Then choose a given value from that table.

Yes.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html