date:20150814

Re: x2APIC and number of the vcpu

2015-08-14 Thread Paolo Bonzini



On 15/08/2015 01:03, Ozgur O Kilic wrote:
>  My question is: is it possible theoretically or any one
> tried it? and if it is I cheched  my pc's hardware is support x2APIC
> whıch version of KVM should I use for that?

KVM doesn't support more than 256 VCPUs, even with x2APIC enabled in the
guest.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL] KVM patches for 4.2-rc7

2015-08-14 Thread Paolo Bonzini

Linus,

The following changes since commit fc1a8126bf8095b10f5a79893f2d2b19227f88f2:

  KVM: MTRR: Use default type for non-MTRR-covered gfn before WARN_ON 
(2015-08-05 11:57:57 +0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus

for you to fetch changes up to d7add05458084a5e3d65925764a02ca9c8202c1e:

  KVM: x86: Use adjustment in guest cycles when handling MSR_IA32_TSC_ADJUST 
(2015-08-07 13:28:03 +0200)


Just two very small & simple patches.


Haozhong Zhang (1):
  KVM: x86: Use adjustment in guest cycles when handling MSR_IA32_TSC_ADJUST

Paolo Bonzini (1):
  KVM: x86: zero IDT limit on entry to SMM

 arch/x86/kvm/x86.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] KVM: x86: fix edge EOI and IOAPIC reconfig race

2015-08-14 Thread Paolo Bonzini



On 14/08/2015 10:38, Radim Krčmář wrote:
>> How do you reproduce the bug?
> I run rhel4 (2.6.9) kernel on 2 VCPUs and frequently alternate
> smp_affinity of "timer".  The bug is hit within seconds.

Nice, I'll try to make a unit test for it on the plane. :)

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL] Early batch of KVM changes for 4.3 merge window

2015-08-14 Thread Paolo Bonzini

Linus,

the merge window will likely coincide with my two-week vacation
across August and September.

Rather than hoping for an -rc8, I'm sending now what I have.  The
PPC and ARM parts might come a few days after the official end
of the merge window.

The uncommon name for the tag (you said you look at things like
that) is due to the other pull request that is in flight for 4.2.
You can see on kernel.org that I and the other maintainers before
me have always used this format to archive what is sent during the
merge window.  The pull request also matches the next branch of
the repository.

The following changes since commit 0da029ed7ee5fdf49a2a0e14160c3ebe9292:

  KVM: x86: rename quirk constants to KVM_X86_QUIRK_* (2015-07-23 08:24:42 
+0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/kvm-4.3-1

for you to fetch changes up to 4d283ec908e617fa28bcb06bce310206f0655d67:

  x86/kvm: Rename VMX's segment access rights defines (2015-08-15 00:47:13 
+0200)


A very small release for x86 and s390 KVM.

s390: timekeeping changes, cleanups and fixes

x86: support for Hyper-V MSRs to report crashes, and a bunch of cleanups.

One interesting feature that was planned for 4.3 (emulating the local
APIC in kernel while keeping the IOAPIC and 8254 in userspace) had to
be delayed because Intel complained about my reading of the manual.


Alex Williamson (1):
  KVM: MTRR: Use default type for non-MTRR-covered gfn before WARN_ON

Andrey Smetanin (4):
  kvm/x86: move Hyper-V MSR's/hypercall code into hyperv.c file
  kvm: introduce vcpu_debug = kvm_debug + vcpu context
  kvm/x86: added hyper-v crash msrs into kvm hyperv context
  kvm/x86: add sending hyper-v crash notification to user space

Andy Lutomirski (1):
  x86/kvm: Rename VMX's segment access rights defines

Christian Borntraeger (10):
  KVM: s390: add kvm stat counter for all diagnoses
  KVM: s390: Improve vcpu event debugging for diagnoses
  KVM: s390: VCPU_EVENT cleanup for prefix changes
  KVM: s390: add more debug data for the pfault diagnoses
  KVM: s390: Fixup interrupt vcpu event messages and levels
  KVM: s390: remove outdated documentation
  KVM: s390: improve debug feature usage
  KVM: s390: adapt debug entries for instruction handling
  KVM: s390: Provide global debug log
  KVM: s390: log capability enablement and vm attribute changes

David Hildenbrand (3):
  KVM: s390: filter space-switch events when PER is enforced
  KVM: s390: remove "from (user|kernel)" from irq injection messages
  KVM: s390: more irq names for trace events

Dominik Dingel (3):
  KVM: s390: propagate error from enable storage key
  KVM: s390: clean up cmma_enable check
  KVM: s390: only reset CMMA state if it was enabled before

Eugene Korenevsky (1):
  KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions

Fan Zhang (1):
  KVM: s390: host STP toleration for VMs

Mihai Donțu (1):
  kvm/x86: add support for MONITOR_TRAP_FLAG

Nicholas Krause (2):
  KVM: s390: Fix assumption that kvm_set_irq_routing is always run 
successfully
  kvm: x86: Fix error handling in the function kvm_lapic_sync_from_vapic

Paolo Bonzini (7):
  KVM: svm: handle KVM_X86_QUIRK_CD_NW_CLEARED in svm_get_mt_mask
  Merge tag 'kvm-s390-next-20150728' of 
git://git.kernel.org/.../kvms390/linux into kvm-next
  KVM: move code related to KVM_SET_BOOT_CPU_ID to x86
  KVM: x86: remove unnecessary memory barriers for shared MSRs
  KVM: document memory barriers for kvm->vcpus/kvm->online_vcpus
  KVM: x86: clean/fix memory barriers in irqchip_in_kernel
  Merge tag 'kvm-s390-next-20150812' of 
git://git.kernel.org/.../kvms390/linux into HEAD

Wei Huang (1):
  KVM: x86/vPMU: Fix unnecessary signed extension for AMD PERFCTRn

Xiao Guangrong (9):
  KVM: MMU: fix validation of mmio page fault
  KVM: MMU: move FNAME(is_rsvd_bits_set) to mmu.c
  KVM: MMU: introduce rsvd_bits_validate
  KVM: MMU: split reset_rsvds_bits_mask
  KVM: MMU: split reset_rsvds_bits_mask_ept
  KVM: MMU: introduce the framework to check zero bits on sptes
  KVM: MMU: introduce is_shadow_zero_bits_set()
  KVM: MMU: fully check zero bits for sptes
  KVM: VMX: drop ept misconfig check

 Documentation/s390/00-INDEX   |   2 -
 Documentation/s390/kvm.txt| 125 -
 Documentation/virtual/kvm/api.txt |   5 +
 arch/s390/include/asm/etr.h   |   3 +
 arch/s390/include/asm/kvm_host.h  |   4 +-
 arch/s390/kernel/time.c   |  16 +-
 arch/s390/kvm/diag.c  |  13 +-
 arch/s390/kvm/guestdbg.c  |  35 
 arch/s390/kvm/interrupt.c |  98 +-
 arch/s390/kvm/kvm-s390.c  | 114 ++--
 arch/s390/kvm/kvm-s390.h  |  11 +-
 arch/s390/kv

Re: [PATCH v3 0/5] KVM: optimize userspace exits with a new ioctl

2015-08-14 Thread Paolo Bonzini



On 14/08/2015 12:08, Radim Krčmář wrote:
> v3:
>  * acked by Christian [1/5]
>  * use ioctl argument directly (unsigned long as flags) [4/5]
>  * precisely #ifdef arch-specific ioctls [5/5]
> v2:
>  * move request_exits debug counter patch right after introduction of
>KVM_REQ_EXIT [3/5]
>  * use vcpu ioctl instead of vm one [4/5]
>  * shrink kvm_user_exit from 64 to 32 bytes [4/5]
>  * new [5/5]
> 
> QEMU uses SIGUSR1 to force a userspace exit and also to queue an early
> exit before calling VCPU_RUN -- the signal is blocked in user space and
> temporarily unblocked in VCPU_RUN.
> The temporal unblocking by sigprocmask() in kvm_arch_vcpu_ioctl_run()
> takes a shared siglock, which leads to cacheline bouncing in NUMA
> systems.
> 
> This series allows the same with a new request bit and VM IOCTL that
> marks and kicks target VCPU, hence no need to unblock.
> 
> inl_from_{pmtimer,qemu} vmexit benchmark from kvm-unit-tests shows ~5%
> speedup for 1-4 VCPUs (300-2000 saved cycles) without noticeably
> regressing kernel VM exits.
> (Paolo did a quick run of older version of this series on a NUMA system
>  and the speedup was around 35% when utilizing more nodes.)
> 
> Radim Krčmář (5):
>   KVM: add kvm_has_request wrapper
>   KVM: add KVM_REQ_EXIT request for userspace exit
>   KVM: x86: add request_exits debug counter
>   KVM: add KVM_USER_EXIT vcpu ioctl for userspace exit
>   KVM: refactor asynchronous vcpu ioctl dispatch
> 
>  Documentation/virtual/kvm/api.txt | 25 +
>  arch/x86/include/asm/kvm_host.h   |  1 +
>  arch/x86/kvm/vmx.c|  4 ++--
>  arch/x86/kvm/x86.c| 23 +++
>  include/linux/kvm_host.h  | 15 +--
>  include/uapi/linux/kvm.h  |  4 
>  virt/kvm/kvm_main.c   | 15 ++-
>  7 files changed, 78 insertions(+), 9 deletions(-)
> 

Reviewed-by: Paolo Bonzini 

... however, we still need to decide what to do about machine-check
exceptions before enabling the capability, otherwise we'd need a new
KVM_CAP_USER_EXIT_MCE capability in the future.  So I'm holding up the
patches for now.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

x2APIC and number of the vcpu

2015-08-14 Thread Ozgur O Kilic

Hello,

I am making some research abut the x2APIC and number of max vcpus.
According to Intel if we use APIC we can only have 2^8 vcpu and if we
use x2APIC it allows 2^10 vcpu  because it has 32 bit address space
for APIC_ID. My question is: is it possible theoretically or any one
tried it? and if it is I cheched  my pc's hardware is support x2APIC
whıch version of KVM should I use for that?

Thank You
Ozgur Ozan Kilic
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/PATCH 1/3] x86/kvm: Rename VMX's segment access rights defines

2015-08-14 Thread Paolo Bonzini



On 13/08/2015 22:18, Andy Lutomirski wrote:
> VMX encodes access rights differently from LAR, and the latter is
> most likely what x86 people think of when they think of "access
> rights".
> 
> Rename them to avoid confusion.

Good idea, I've gone ahead and applied it for 4.3.

> Cc: kvm@vger.kernel.org
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/include/asm/vmx.h | 46 
> +++---
>  arch/x86/kvm/vmx.c | 14 +++---
>  2 files changed, 30 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index da772edd19ab..78e243ae1786 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -367,29 +367,29 @@ enum vmcs_field {
>  #define TYPE_PHYSICAL_APIC_EVENT(10 << 12)
>  #define TYPE_PHYSICAL_APIC_INST (15 << 12)
>  
> -/* segment AR */
> -#define SEGMENT_AR_L_MASK (1 << 13)
> -
> -#define AR_TYPE_ACCESSES_MASK 1
> -#define AR_TYPE_READABLE_MASK (1 << 1)
> -#define AR_TYPE_WRITEABLE_MASK (1 << 2)
> -#define AR_TYPE_CODE_MASK (1 << 3)
> -#define AR_TYPE_MASK 0x0f
> -#define AR_TYPE_BUSY_64_TSS 11
> -#define AR_TYPE_BUSY_32_TSS 11
> -#define AR_TYPE_BUSY_16_TSS 3
> -#define AR_TYPE_LDT 2
> -
> -#define AR_UNUSABLE_MASK (1 << 16)
> -#define AR_S_MASK (1 << 4)
> -#define AR_P_MASK (1 << 7)
> -#define AR_L_MASK (1 << 13)
> -#define AR_DB_MASK (1 << 14)
> -#define AR_G_MASK (1 << 15)
> -#define AR_DPL_SHIFT 5
> -#define AR_DPL(ar) (((ar) >> AR_DPL_SHIFT) & 3)
> -
> -#define AR_RESERVD_MASK 0xfffe0f00
> +/* segment AR in VMCS -- these are different from what LAR reports */
> +#define VMX_SEGMENT_AR_L_MASK (1 << 13)
> +
> +#define VMX_AR_TYPE_ACCESSES_MASK 1
> +#define VMX_AR_TYPE_READABLE_MASK (1 << 1)
> +#define VMX_AR_TYPE_WRITEABLE_MASK (1 << 2)
> +#define VMX_AR_TYPE_CODE_MASK (1 << 3)
> +#define VMX_AR_TYPE_MASK 0x0f
> +#define VMX_AR_TYPE_BUSY_64_TSS 11
> +#define VMX_AR_TYPE_BUSY_32_TSS 11
> +#define VMX_AR_TYPE_BUSY_16_TSS 3
> +#define VMX_AR_TYPE_LDT 2
> +
> +#define VMX_AR_UNUSABLE_MASK (1 << 16)
> +#define VMX_AR_S_MASK (1 << 4)
> +#define VMX_AR_P_MASK (1 << 7)
> +#define VMX_AR_L_MASK (1 << 13)
> +#define VMX_AR_DB_MASK (1 << 14)
> +#define VMX_AR_G_MASK (1 << 15)
> +#define VMX_AR_DPL_SHIFT 5
> +#define VMX_AR_DPL(ar) (((ar) >> VMX_AR_DPL_SHIFT) & 3)
> +
> +#define VMX_AR_RESERVD_MASK 0xfffe0f00
>  
>  #define TSS_PRIVATE_MEMSLOT  (KVM_USER_MEM_SLOTS + 0)
>  #define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT (KVM_USER_MEM_SLOTS + 1)
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index e856dd566f4c..d7ff79a5135b 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3423,12 +3423,12 @@ static void enter_lmode(struct kvm_vcpu *vcpu)
>   vmx_segment_cache_clear(to_vmx(vcpu));
>  
>   guest_tr_ar = vmcs_read32(GUEST_TR_AR_BYTES);
> - if ((guest_tr_ar & AR_TYPE_MASK) != AR_TYPE_BUSY_64_TSS) {
> + if ((guest_tr_ar & VMX_AR_TYPE_MASK) != VMX_AR_TYPE_BUSY_64_TSS) {
>   pr_debug_ratelimited("%s: tss fixup for long mode. \n",
>__func__);
>   vmcs_write32(GUEST_TR_AR_BYTES,
> -  (guest_tr_ar & ~AR_TYPE_MASK)
> -  | AR_TYPE_BUSY_64_TSS);
> +  (guest_tr_ar & ~VMX_AR_TYPE_MASK)
> +  | VMX_AR_TYPE_BUSY_64_TSS);
>   }
>   vmx_set_efer(vcpu, vcpu->arch.efer | EFER_LMA);
>  }
> @@ -3719,7 +3719,7 @@ static int vmx_get_cpl(struct kvm_vcpu *vcpu)
>   return 0;
>   else {
>   int ar = vmx_read_guest_seg_ar(vmx, VCPU_SREG_SS);
> - return AR_DPL(ar);
> + return VMX_AR_DPL(ar);
>   }
>  }
>  
> @@ -3847,11 +3847,11 @@ static bool code_segment_valid(struct kvm_vcpu *vcpu)
>  
>   if (cs.unusable)
>   return false;
> - if (~cs.type & (AR_TYPE_CODE_MASK|AR_TYPE_ACCESSES_MASK))
> + if (~cs.type & (VMX_AR_TYPE_CODE_MASK|VMX_AR_TYPE_ACCESSES_MASK))
>   return false;
>   if (!cs.s)
>   return false;
> - if (cs.type & AR_TYPE_WRITEABLE_MASK) {
> + if (cs.type & VMX_AR_TYPE_WRITEABLE_MASK) {
>   if (cs.dpl > cs_rpl)
>   return false;
>   } else {
> @@ -3901,7 +3901,7 @@ static bool data_segment_valid(struct kvm_vcpu *vcpu, 
> int seg)
>   return false;
>   if (!var.present)
>   return false;
> - if (~var.type & (AR_TYPE_CODE_MASK|AR_TYPE_WRITEABLE_MASK)) {
> + if (~var.type & (VMX_AR_TYPE_CODE_MASK|VMX_AR_TYPE_WRITEABLE_MASK)) {
>   if (var.dpl < rpl) /* DPL < RPL */
>   return false;
>   }
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/PATCH 3/3] x86/signal/64: Add explicit controls for sigcontext SS handling

2015-08-14 Thread Cyrill Gorcunov

On Fri, Aug 14, 2015 at 01:57:42PM -0700, Andy Lutomirski wrote:
> 
> Don't bother testing yet.  I'm waffling between trying something like
> this and adding SA_SAVE_SS.  I have partially written patches for the
> latter.

ok, ping me if anything
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/PATCH 3/3] x86/signal/64: Add explicit controls for sigcontext SS handling

2015-08-14 Thread Andy Lutomirski

On Fri, Aug 14, 2015 at 1:55 PM, Cyrill Gorcunov  wrote:
> On Thu, Aug 13, 2015 at 01:18:50PM -0700, Andy Lutomirski wrote:
>> This adds two new uc_flags flags.  UC_SAVED_SS will be set for all
>> 64-bit signals (including x32).  It indicates that the saved SS field
>> is valid and that the kernel understands UC_RESTORE_SS.
>>
>> The kernel will *not* set UC_RESTORE_SS.  User signal handlers can
>> set UC_RESTORE_SS themselves to indicate that sigreturn should
>> restore SS from the sigcontext.
>>
>> 64-bit programs that use segmentation are encouraged to check
>> UC_SAVED_SS and set UC_RESTORE_SS in their signal handlers.  This is
>> the only straightforward way to cause sigreturn to restore SS.  (The
>> only non-test program that I know of that uses segmentation in a
>> 64-bit binary is DOSEMU, and DOSEMU currently uses a nasty
>> trampoline to work around the lack of this mechanism in old kernels.
>> It could detect UC_RESTORE_SS and use it to avoid needing a
>> trampoline.
>>
>> Cc: Stas Sergeev 
>> Cc: Linus Torvalds 
>> Cc: Cyrill Gorcunov 
>> Cc: Pavel Emelyanov 
>> Signed-off-by: Andy Lutomirski 
>
> Looks reasonable to me. Andy, Linus, what the final conclusion --
> are we about to introduce this flag or simply continue with
> revert? Should I test this one? (from the code I don't excpect it
> break criu anyhow but still).

Don't bother testing yet.  I'm waffling between trying something like
this and adding SA_SAVE_SS.  I have partially written patches for the
latter.

--Andy




-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/PATCH 3/3] x86/signal/64: Add explicit controls for sigcontext SS handling

2015-08-14 Thread Cyrill Gorcunov

On Thu, Aug 13, 2015 at 01:18:50PM -0700, Andy Lutomirski wrote:
> This adds two new uc_flags flags.  UC_SAVED_SS will be set for all
> 64-bit signals (including x32).  It indicates that the saved SS field
> is valid and that the kernel understands UC_RESTORE_SS.
> 
> The kernel will *not* set UC_RESTORE_SS.  User signal handlers can
> set UC_RESTORE_SS themselves to indicate that sigreturn should
> restore SS from the sigcontext.
> 
> 64-bit programs that use segmentation are encouraged to check
> UC_SAVED_SS and set UC_RESTORE_SS in their signal handlers.  This is
> the only straightforward way to cause sigreturn to restore SS.  (The
> only non-test program that I know of that uses segmentation in a
> 64-bit binary is DOSEMU, and DOSEMU currently uses a nasty
> trampoline to work around the lack of this mechanism in old kernels.
> It could detect UC_RESTORE_SS and use it to avoid needing a
> trampoline.
> 
> Cc: Stas Sergeev 
> Cc: Linus Torvalds 
> Cc: Cyrill Gorcunov 
> Cc: Pavel Emelyanov 
> Signed-off-by: Andy Lutomirski 

Looks reasonable to me. Andy, Linus, what the final conclusion --
are we about to introduce this flag or simply continue with
revert? Should I test this one? (from the code I don't excpect it
break criu anyhow but still).
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Help debugging a regression in KVM Module

2015-08-14 Thread Alex Bennée


Peter Lieven  writes:

> Hi,
>
> some time a go I stumbled across a regression in the KVM Module that has been 
> introduced somewhere
> between 3.17 and 3.19.
>
> I have a rather old openSUSE guest with an XFS filesystem which realiably 
> crashes after some live migrations.
> I originally believed that the issue might be related to my setup with a 3.12 
> host kernel and kvm-kmod 3.19,
> but I now found that it is also still present with a 3.19 host kernel with 
> included 3.19 kvm module.
>
> My idea was to continue testing on a 3.12 host kernel and then bisect all 
> commits to the kvm related parts.
>
> Now my question is how to best bisect only kvm related changes (those
> that go into kvm-kmod)?

In general I don't bother. As it is a bisection you eliminate half the
commits at a time you get their fairly quickly anyway. However you can
tell bisect which parts of the tree you car about:

  git bisect start -- arch/arm64/kvm include/linux/kvm* include/uapi/linux/kvm* 
virt/kvm/


>
> Thanks,
> Peter

-- 
Alex Bennée
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Help debugging a regression in KVM Module

2015-08-14 Thread Peter Lieven

Am 14.08.2015 um 15:01 schrieb Paolo Bonzini:
>
> - Original Message -
>> From: "Peter Lieven" 
>> To: qemu-de...@nongnu.org, kvm@vger.kernel.org
>> Cc: "Paolo Bonzini" 
>> Sent: Friday, August 14, 2015 1:11:34 PM
>> Subject: Help debugging a regression in KVM Module
>>
>> Hi,
>>
>> some time a go I stumbled across a regression in the KVM Module that has been
>> introduced somewhere
>> between 3.17 and 3.19.
>>
>> I have a rather old openSUSE guest with an XFS filesystem which realiably
>> crashes after some live migrations.
>> I originally believed that the issue might be related to my setup with a 3.12
>> host kernel and kvm-kmod 3.19,
>> but I now found that it is also still present with a 3.19 host kernel with
>> included 3.19 kvm module.
>>
>> My idea was to continue testing on a 3.12 host kernel and then bisect all
>> commits to the kvm related parts.
>>
>> Now my question is how to best bisect only kvm related changes (those that go
>> into kvm-kmod)?
> I haven't forgotten this.  Sorry. :(
>
> Unfortunately I'll be away for three weeks, but I'll make it a priority
> when I'm back.

Its not time critical, but I think its worth investigating as it might affect
other systems as well - and maybe XFS is only very sensitive.

I suppose you are going on vacation. Enjoy!

Peter
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 11/18] nvdimm: build ACPI nvdimm devices

2015-08-14 Thread Xiao Guangrong

NVDIMM devices is defined in ACPI 6.0 9.20 NVDIMM Devices

This is a root device under \_SB and specified NVDIMM device are under the
root device. Each NVDIMM device has _ADR which return its handle used to
associate MEMDEV table in NFIT

We reserve handle 0 for root device. In this patch, we save handle, arg0,
arg1 and arg2. Arg3 is conditionally saved in later patch

Signed-off-by: Xiao Guangrong 
---
 hw/i386/acpi-build.c   |   2 +
 hw/mem/nvdimm/acpi.c   | 130 -
 include/hw/mem/pc-nvdimm.h |   2 +
 3 files changed, 132 insertions(+), 2 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 092ed2f..a792135 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -1342,6 +1342,8 @@ build_ssdt(GArray *table_data, GArray *linker,
 aml_append(sb_scope, scope);
 }
 }
+
+pc_nvdimm_build_acpi_devices(sb_scope);
 aml_append(ssdt, sb_scope);
 }
 
diff --git a/hw/mem/nvdimm/acpi.c b/hw/mem/nvdimm/acpi.c
index e0f2ad3..909a8ef 100644
--- a/hw/mem/nvdimm/acpi.c
+++ b/hw/mem/nvdimm/acpi.c
@@ -135,10 +135,11 @@ struct nfit_dcr {
 uint8_t reserved2[6];
 } QEMU_PACKED;
 
-#define REVSISON_ID1
-#define NFIT_FIC1  0x201
+#define REVSISON_ID 1
+#define NFIT_FIC1   0x201
 
 #define MAX_NVDIMM_NUMBER   10
+#define NOTIFY_VALUE0x99
 
 static int get_nvdimm_device_number(GSList *list)
 {
@@ -281,12 +282,15 @@ static size_t dsm_size;
 static uint64_t dsm_read(void *opaque, hwaddr addr,
  unsigned size)
 {
+fprintf(stderr, "BUG: we never read DSM notification MMIO.\n");
+assert(0);
 return 0;
 }
 
 static void dsm_write(void *opaque, hwaddr addr,
   uint64_t val, unsigned size)
 {
+assert(val == NOTIFY_VALUE);
 }
 
 static const MemoryRegionOps dsm_ops = {
@@ -361,3 +365,125 @@ void pc_nvdimm_build_nfit_table(GArray *table_offsets, 
GArray *table_data,
 exit:
 g_slist_free(list);
 }
+
+#define BUILD_STA_METHOD(_dev_, _method_)  \
+do {   \
+_method_ = aml_method("_STA", 0);  \
+aml_append(_method_, aml_return(aml_int(0x0f)));   \
+aml_append(_dev_, _method_);   \
+} while (0)
+
+#define SAVE_ARG012_HANDLE(_method_, _handle_) \
+do {   \
+aml_append(_method_, aml_store(_handle_, aml_name("HDLE")));   \
+aml_append(_method_, aml_store(aml_arg(0), aml_name("ARG0"))); \
+aml_append(_method_, aml_store(aml_arg(1), aml_name("ARG1"))); \
+aml_append(_method_, aml_store(aml_arg(2), aml_name("ARG2"))); \
+} while (0)
+
+#define NOTIFY_AND_RETURN(_method_)\
+do {   \
+aml_append(_method_, aml_store(aml_int(NOTIFY_VALUE),  \
+   aml_name("NOTI"))); \
+aml_append(_method_, aml_return(aml_name("ODAT")));\
+} while (0)
+
+static void build_nvdimm_devices(Aml *root_dev, GSList *list)
+{
+for (; list; list = list->next) {
+PCNVDIMMDevice *nvdimm = list->data;
+uint32_t handle = nvdimm_index_to_handle(nvdimm->device_index);
+Aml *dev, *method;
+
+dev = aml_device("NVD%d", nvdimm->device_index);
+aml_append(dev, aml_name_decl("_ADR", aml_int(handle)));
+
+BUILD_STA_METHOD(dev, method);
+
+method = aml_method("_DSM", 4);
+{
+SAVE_ARG012_HANDLE(method, aml_int(handle));
+NOTIFY_AND_RETURN(method);
+}
+aml_append(dev, method);
+
+aml_append(root_dev, dev);
+}
+}
+
+void pc_nvdimm_build_acpi_devices(Aml *sb_scope)
+{
+Aml *dev, *method, *field;
+struct dsm_buffer *dsm_buf;
+GSList *list = get_nvdimm_built_list();
+int nr = get_nvdimm_device_number(list);
+
+if (nr <= 0 || nr > MAX_NVDIMM_NUMBER) {
+g_slist_free(list);
+return;
+}
+
+dev = aml_device("NVDR");
+aml_append(dev, aml_name_decl("_HID", aml_string("ACPI0012")));
+
+/* map DSM buffer into ACPI namespace. */
+aml_append(dev, aml_operation_region("DSMR", AML_SYSTEM_MEMORY,
+   dsm_addr, dsm_size));
+
+/*
+ * DSM input:
+ * @HDLE: store device's handle, it's zero if the _DSM call happens
+ *on ROOT.
+ * @ARG0 ~ @ARG3: store the parameters of _DSM call.
+ *
+ * They are ram mapping on host so that these access never cause VM-EXIT.
+ */
+field = aml_field("DSMR", AML_DWORD_ACC, AML_PRESERVE);
+aml_append(field, aml_named_field("HDLE",
+   siz

[PATCH v2 12/18] nvdimm: save arg3 for NVDIMM device _DSM method

2015-08-14 Thread Xiao Guangrong

Check if the function (Arg2) has additional input info (arg3) and save
the info if needed

We only do the save on NVDIMM device since we are not going to support any
function on root device

Signed-off-by: Xiao Guangrong 
---
 hw/mem/nvdimm/acpi.c | 73 +++-
 1 file changed, 72 insertions(+), 1 deletion(-)

diff --git a/hw/mem/nvdimm/acpi.c b/hw/mem/nvdimm/acpi.c
index 909a8ef..0b09efa 100644
--- a/hw/mem/nvdimm/acpi.c
+++ b/hw/mem/nvdimm/acpi.c
@@ -259,6 +259,26 @@ static void build_nfit_table(GSList *device_list, char 
*buf)
 }
 }
 
+enum {
+NFIT_CMD_IMPLEMENTED = 0,
+
+/* bus commands */
+NFIT_CMD_ARS_CAP = 1,
+NFIT_CMD_ARS_START = 2,
+NFIT_CMD_ARS_QUERY = 3,
+
+/* per-dimm commands */
+NFIT_CMD_SMART = 1,
+NFIT_CMD_SMART_THRESHOLD = 2,
+NFIT_CMD_DIMM_FLAGS = 3,
+NFIT_CMD_GET_CONFIG_SIZE = 4,
+NFIT_CMD_GET_CONFIG_DATA = 5,
+NFIT_CMD_SET_CONFIG_DATA = 6,
+NFIT_CMD_VENDOR_EFFECT_LOG_SIZE = 7,
+NFIT_CMD_VENDOR_EFFECT_LOG = 8,
+NFIT_CMD_VENDOR = 9,
+};
+
 struct dsm_buffer {
 /* RAM page. */
 uint32_t handle;
@@ -366,6 +386,19 @@ exit:
 g_slist_free(list);
 }
 
+static bool device_cmd_has_arg3[] = {
+false,  /* NFIT_CMD_IMPLEMENTED */
+false,  /* NFIT_CMD_SMART */
+false,  /* NFIT_CMD_SMART_THRESHOLD */
+false,  /* NFIT_CMD_DIMM_FLAGS */
+false,  /* NFIT_CMD_GET_CONFIG_SIZE */
+true,   /* NFIT_CMD_GET_CONFIG_DATA */
+true,   /* NFIT_CMD_SET_CONFIG_DATA */
+false,  /* NFIT_CMD_VENDOR_EFFECT_LOG_SIZE */
+false,  /* NFIT_CMD_VENDOR_EFFECT_LOG */
+false,  /* NFIT_CMD_VENDOR */
+};
+
 #define BUILD_STA_METHOD(_dev_, _method_)  \
 do {   \
 _method_ = aml_method("_STA", 0);  \
@@ -390,10 +423,20 @@ exit:
 
 static void build_nvdimm_devices(Aml *root_dev, GSList *list)
 {
+Aml *has_arg3;
+int i, cmd_nr;
+
+cmd_nr = ARRAY_SIZE(device_cmd_has_arg3);
+has_arg3 = aml_package(cmd_nr);
+for (i = 0; i < cmd_nr; i++) {
+aml_append(has_arg3, aml_int(device_cmd_has_arg3[i]));
+}
+aml_append(root_dev, aml_name_decl("CAG3", has_arg3));
+
 for (; list; list = list->next) {
 PCNVDIMMDevice *nvdimm = list->data;
 uint32_t handle = nvdimm_index_to_handle(nvdimm->device_index);
-Aml *dev, *method;
+Aml *dev, *method, *ifctx;
 
 dev = aml_device("NVD%d", nvdimm->device_index);
 aml_append(dev, aml_name_decl("_ADR", aml_int(handle)));
@@ -403,6 +446,34 @@ static void build_nvdimm_devices(Aml *root_dev, GSList 
*list)
 method = aml_method("_DSM", 4);
 {
 SAVE_ARG012_HANDLE(method, aml_int(handle));
+
+/* Local5 = DeRefOf(Index(CAG3, Arg2)) */
+aml_append(method,
+   aml_store(aml_derefof(aml_index(aml_name("CAG3"),
+   aml_arg(2))), aml_local(5)));
+/* if 0 < local5 */
+ifctx = aml_if(aml_lless(aml_int(0), aml_local(5)));
+{
+/* Local0 = Index(Arg3, 0) */
+aml_append(ifctx, aml_store(aml_index(aml_arg(3), aml_int(0)),
+   aml_local(0)));
+/* Local1 = sizeof(Local0) */
+aml_append(ifctx, aml_store(aml_sizeof(aml_local(0)),
+   aml_local(1)));
+/* Local2 = Local1 << 3 */
+aml_append(ifctx, aml_store(aml_shiftleft(aml_local(1),
+   aml_int(3)), aml_local(2)));
+/* Local3 = DeRefOf(Local0) */
+aml_append(ifctx, aml_store(aml_derefof(aml_local(0)),
+   aml_local(3)));
+/* CreateField(Local3, 0, local2, IBUF) */
+aml_append(ifctx, aml_create_field(aml_local(3),
+   aml_int(0), aml_local(2), "IBUF"));
+/* ARG3 = IBUF */
+aml_append(ifctx, aml_store(aml_name("IBUF"),
+   aml_name("ARG3")));
+}
+aml_append(method, ifctx);
 NOTIFY_AND_RETURN(method);
 }
 aml_append(dev, method);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 13/18] nvdimm: build namespace config data

2015-08-14 Thread Xiao Guangrong

If @configdata is false, Qemu will build a static and readonly
namespace in memory and use it serveing for
DSM GET_CONFIG_SIZE/GET_CONFIG_DATA requests

Signed-off-by: Xiao Guangrong 
---
 hw/mem/Makefile.objs   |   3 +-
 hw/mem/nvdimm/acpi.c   |  10 ++
 hw/mem/nvdimm/internal.h   |  12 ++
 hw/mem/nvdimm/namespace.c  | 307 +
 include/hw/mem/pc-nvdimm.h |   2 +
 5 files changed, 333 insertions(+), 1 deletion(-)
 create mode 100644 hw/mem/nvdimm/namespace.c

diff --git a/hw/mem/Makefile.objs b/hw/mem/Makefile.objs
index 7a6948d..7f3fab2 100644
--- a/hw/mem/Makefile.objs
+++ b/hw/mem/Makefile.objs
@@ -1,2 +1,3 @@
 common-obj-$(CONFIG_MEM_HOTPLUG) += pc-dimm.o
-common-obj-$(CONFIG_NVDIMM) += nvdimm/pc-nvdimm.o nvdimm/acpi.o
+common-obj-$(CONFIG_NVDIMM) += nvdimm/pc-nvdimm.o nvdimm/acpi.o\
+  nvdimm/namespace.o
diff --git a/hw/mem/nvdimm/acpi.c b/hw/mem/nvdimm/acpi.c
index 0b09efa..c773954 100644
--- a/hw/mem/nvdimm/acpi.c
+++ b/hw/mem/nvdimm/acpi.c
@@ -240,6 +240,8 @@ static void build_nfit_table(GSList *device_list, char *buf)
 
 for (; device_list; device_list = device_list->next) {
 PCNVDIMMDevice *nvdimm = device_list->data;
+struct nfit_memdev *nfit_memdev;
+struct nfit_dcr *nfit_dcr;
 int spa_index, dcr_index;
 
 spa_index = ++index;
@@ -252,10 +254,15 @@ static void build_nfit_table(GSList *device_list, char 
*buf)
  * build Memory Device to System Physical Address Range Mapping
  * Table.
  */
+nfit_memdev = (struct nfit_memdev *)buf;
 buf += build_memdev_table(buf, nvdimm, spa_index, dcr_index);
 
 /* build Control Region Descriptor Table. */
+nfit_dcr = (struct nfit_dcr *)buf;
 buf += build_dcr_table(buf, nvdimm, dcr_index);
+
+calculate_nvdimm_isetcookie(nvdimm, nfit_memdev->region_spa_offset,
+nfit_dcr->serial_number);
 }
 }
 
@@ -382,6 +389,9 @@ void pc_nvdimm_build_nfit_table(GArray *table_offsets, 
GArray *table_data,
 
 build_header(linker, table_data, (void *)(table_data->data + nfit_start),
  "NFIT", table_data->len - nfit_start, 1);
+
+build_nvdimm_configdata(list);
+
 exit:
 g_slist_free(list);
 }
diff --git a/hw/mem/nvdimm/internal.h b/hw/mem/nvdimm/internal.h
index 90d54dc..b1f3f16 100644
--- a/hw/mem/nvdimm/internal.h
+++ b/hw/mem/nvdimm/internal.h
@@ -13,6 +13,14 @@
 #ifndef __NVDIMM_INTERNAL_H
 #define __NVDIMM_INTERNAL_H
 
+/* #define NVDIMM_DEBUG */
+
+#ifdef NVDIMM_DEBUG
+#define nvdebug(fmt, ...) fprintf(stderr, "nvdimm: " fmt, ## __VA_ARGS__)
+#else
+#define nvdebug(...)
+#endif
+
 #define PAGE_SIZE   (1UL << 12)
 
 typedef struct {
@@ -27,4 +35,8 @@ typedef struct {
 
 GSList *get_nvdimm_built_list(void);
 ram_addr_t reserved_range_push(uint64_t size);
+
+void calculate_nvdimm_isetcookie(PCNVDIMMDevice *nvdimm, uint64_t spa,
+ uint32_t sn);
+void build_nvdimm_configdata(GSList *device_list);
 #endif
diff --git a/hw/mem/nvdimm/namespace.c b/hw/mem/nvdimm/namespace.c
new file mode 100644
index 000..04626da
--- /dev/null
+++ b/hw/mem/nvdimm/namespace.c
@@ -0,0 +1,307 @@
+/*
+ * NVDIMM  Namespace Support
+ *
+ * Copyright(C) 2015 Intel Corporation.
+ *
+ * Author:
+ *  Xiao Guangrong 
+ *
+ * NVDIMM namespace specification can be found at:
+ *  http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see 
+ */
+
+#include "hw/mem/pc-nvdimm.h"
+
+#include "internal.h"
+
+static uint64_t fletcher64(void *addr, size_t len)
+{
+uint32_t *buf = addr;
+uint32_t lo32 = 0;
+uint64_t hi32 = 0;
+int i;
+
+for (i = 0; i < len / sizeof(uint32_t); i++) {
+lo32 += cpu_to_le32(buf[i]);
+hi32 += lo32;
+}
+
+return hi32 << 32 | lo32;
+}
+
+struct interleave_set_info {
+struct interleave_set_info_map {
+uint64_t region_spa_offset;
+uint32_t serial_number;
+uint32_t zero;
+} mapping[1];
+};
+
+void calculate_nvdimm_isetcookie(PCNVDIMMDevice *nvdimm, uint64_t spa,
+ uint32_t sn)
+{
+struct interleave_set_info info;
+
+info.mapping[0].region_spa_offset = spa;
+info.mapping[0].serial_number = sn;
+info.mappin

[PATCH v2 18/18] nvdimm: add maintain info

2015-08-14 Thread Xiao Guangrong

Add NVDIMM maintainer

Signed-off-by: Xiao Guangrong 
---
 MAINTAINERS | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 978b717..86786e6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -793,6 +793,12 @@ M: Jiri Pirko 
 S: Maintained
 F: hw/net/rocker/
 
+NVDIMM
+M: Xiao Guangrong 
+S: Maintained
+F: hw/mem/nvdimm/
+F: include/hw/mem/pc-nvdimm.h
+
 Subsystems
 --
 Audio
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 16/18] nvdimm: support NFIT_CMD_GET_CONFIG_DATA

2015-08-14 Thread Xiao Guangrong

Function 5 is used to get Namespace Label Data

Signed-off-by: Xiao Guangrong 
---
 hw/mem/nvdimm/acpi.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/hw/mem/nvdimm/acpi.c b/hw/mem/nvdimm/acpi.c
index 0a5f2c2..517d710 100644
--- a/hw/mem/nvdimm/acpi.c
+++ b/hw/mem/nvdimm/acpi.c
@@ -352,6 +352,7 @@ struct dsm_buffer {
 uint32_t arg1;
 uint32_t arg2;
 union {
+struct cmd_in_get_config_data cmd_config_get;
 struct cmd_in_set_config_data cmd_config_set;
 char arg3[PAGE_SIZE - 3 * sizeof(uint32_t) - 16 * sizeof(uint8_t)];
 };
@@ -454,6 +455,34 @@ dsm_cmd_config_size(PCNVDIMMDevice *nvdimm, struct 
dsm_buffer *in,
 return NFIT_STATUS_SUCCESS;
 }
 
+static uint32_t
+dsm_cmd_config_get(PCNVDIMMDevice *nvdimm, struct dsm_buffer *in,
+   struct dsm_out *out)
+{
+struct cmd_in_get_config_data *cmd_in = &in->cmd_config_get;
+uint32_t status;
+
+le32_to_cpus(&cmd_in->length);
+le32_to_cpus(&cmd_in->offset);
+
+nvdebug("Read Config: offset %#x length %#x.\n", cmd_in->offset,
+cmd_in->length);
+
+if (nvdimm->config_data_size < cmd_in->length + cmd_in->offset) {
+nvdebug("position %#x is beyond config data (len = %#lx).\n",
+cmd_in->length + cmd_in->offset, nvdimm->config_data_size);
+status = NFIT_STATUS_INVALID_PARAS;
+goto exit;
+}
+
+status = NFIT_STATUS_SUCCESS;
+memcpy(out->cmd_config_get.out_buf, nvdimm->config_data_addr +
+   cmd_in->offset, cmd_in->length);
+
+exit:
+return status;
+}
+
 static void dsm_write_nvdimm(struct dsm_buffer *in, struct dsm_out *out)
 {
 GSList *list = get_nvdimm_built_list();
@@ -478,6 +507,9 @@ static void dsm_write_nvdimm(struct dsm_buffer *in, struct 
dsm_out *out)
 case NFIT_CMD_GET_CONFIG_SIZE:
 status = dsm_cmd_config_size(nvdimm, in, out);
 break;
+case NFIT_CMD_GET_CONFIG_DATA:
+status = dsm_cmd_config_get(nvdimm, in, out);
+break;
 default:
 status = NFIT_STATUS_NOT_SUPPORTED;
 };
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 17/18] nvdimm: support NFIT_CMD_SET_CONFIG_DATA

2015-08-14 Thread Xiao Guangrong

Function 6 is used to set Namespace Label Data

Signed-off-by: Xiao Guangrong 
---
 hw/mem/nvdimm/acpi.c | 40 
 1 file changed, 40 insertions(+)

diff --git a/hw/mem/nvdimm/acpi.c b/hw/mem/nvdimm/acpi.c
index 517d710..283228d 100644
--- a/hw/mem/nvdimm/acpi.c
+++ b/hw/mem/nvdimm/acpi.c
@@ -382,12 +382,17 @@ struct cmd_out_get_config_data {
 uint8_t out_buf[0];
 } QEMU_PACKED;
 
+struct cmd_out_set_config_data {
+uint32_t status;
+} QEMU_PACKED;
+
 struct dsm_out {
 union {
 uint32_t status;
 struct cmd_out_implemented cmd_implemented;
 struct cmd_out_get_config_size cmd_config_size;
 struct cmd_out_get_config_data cmd_config_get;
+struct cmd_out_set_config_data cmd_config_set;
 uint8_t data[PAGE_SIZE];
 };
 };
@@ -483,6 +488,38 @@ exit:
 return status;
 }
 
+static uint32_t
+dsm_cmd_config_set(PCNVDIMMDevice *nvdimm, struct dsm_buffer *in,
+   struct dsm_out *out)
+{
+struct cmd_in_set_config_data *cmd_in = &in->cmd_config_set;
+uint32_t status;
+
+if (!nvdimm->configdata) {
+status = NFIT_STATUS_NOT_SUPPORTED;
+goto exit;
+}
+
+le32_to_cpus(&cmd_in->length);
+le32_to_cpus(&cmd_in->offset);
+
+nvdebug("Write Config: offset %#x length %#x.\n", cmd_in->offset,
+cmd_in->length);
+if (nvdimm->config_data_size < cmd_in->length + cmd_in->offset) {
+nvdebug("position %#x is beyond config data (len = %#lx).\n",
+cmd_in->length + cmd_in->offset, nvdimm->config_data_size);
+status = NFIT_STATUS_INVALID_PARAS;
+goto exit;
+}
+
+status = NFIT_STATUS_SUCCESS;
+memcpy(nvdimm->config_data_addr + cmd_in->offset, cmd_in->in_buf,
+   cmd_in->length);
+
+exit:
+return status;
+}
+
 static void dsm_write_nvdimm(struct dsm_buffer *in, struct dsm_out *out)
 {
 GSList *list = get_nvdimm_built_list();
@@ -510,6 +547,9 @@ static void dsm_write_nvdimm(struct dsm_buffer *in, struct 
dsm_out *out)
 case NFIT_CMD_GET_CONFIG_DATA:
 status = dsm_cmd_config_get(nvdimm, in, out);
 break;
+case NFIT_CMD_SET_CONFIG_DATA:
+status = dsm_cmd_config_set(nvdimm, in, out);
+break;
 default:
 status = NFIT_STATUS_NOT_SUPPORTED;
 };
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 15/18] nvdimm: support NFIT_CMD_GET_CONFIG_SIZE function

2015-08-14 Thread Xiao Guangrong

Function 4 is used to get Namespace lable size

Signed-off-by: Xiao Guangrong 
---
 hw/mem/nvdimm/acpi.c | 70 
 1 file changed, 70 insertions(+)

diff --git a/hw/mem/nvdimm/acpi.c b/hw/mem/nvdimm/acpi.c
index 20aefce..0a5f2c2 100644
--- a/hw/mem/nvdimm/acpi.c
+++ b/hw/mem/nvdimm/acpi.c
@@ -334,6 +334,17 @@ enum {
| (1 << NFIT_CMD_GET_CONFIG_SIZE)\
| (1 << NFIT_CMD_GET_CONFIG_DATA))
 
+struct cmd_in_get_config_data {
+uint32_t offset;
+uint32_t length;
+} QEMU_PACKED;
+
+struct cmd_in_set_config_data {
+uint32_t offset;
+uint32_t length;
+uint8_t in_buf[0];
+} QEMU_PACKED;
+
 struct dsm_buffer {
 /* RAM page. */
 uint32_t handle;
@@ -341,6 +352,7 @@ struct dsm_buffer {
 uint32_t arg1;
 uint32_t arg2;
 union {
+struct cmd_in_set_config_data cmd_config_set;
 char arg3[PAGE_SIZE - 3 * sizeof(uint32_t) - 16 * sizeof(uint8_t)];
 };
 
@@ -358,10 +370,23 @@ struct cmd_out_implemented {
 uint64_t cmd_list;
 };
 
+struct cmd_out_get_config_size {
+uint32_t status;
+uint32_t config_size;
+uint32_t max_xfer;
+} QEMU_PACKED;
+
+struct cmd_out_get_config_data {
+uint32_t status;
+uint8_t out_buf[0];
+} QEMU_PACKED;
+
 struct dsm_out {
 union {
 uint32_t status;
 struct cmd_out_implemented cmd_implemented;
+struct cmd_out_get_config_size cmd_config_size;
+struct cmd_out_get_config_data cmd_config_get;
 uint8_t data[PAGE_SIZE];
 };
 };
@@ -387,6 +412,48 @@ static void dsm_write_root(struct dsm_buffer *in, struct 
dsm_out *out)
 nvdebug("Return status %#x.\n", out->status);
 }
 
+/*
+ * the max transfer size is the max size transfered by both a
+ * NFIT_CMD_GET_CONFIG_DATA and a NFIT_CMD_SET_CONFIG_DATA
+ * command.
+ */
+static uint32_t max_xfer_config_size(void)
+{
+struct dsm_buffer *in;
+struct dsm_out *out;
+uint32_t max_get_size, max_set_size;
+
+/*
+ * the max data ACPI can read one time which is transfered by
+ * the response of NFIT_CMD_GET_CONFIG_DATA.
+ */
+max_get_size = sizeof(out->data) - sizeof(out->cmd_config_get);
+
+/*
+ * the max data ACPI can write one time which is transfered by
+ * NFIT_CMD_SET_CONFIG_DATA
+ */
+max_set_size = sizeof(in->arg3) - sizeof(in->cmd_config_set);
+return MIN(max_get_size, max_set_size);
+}
+
+static uint32_t
+dsm_cmd_config_size(PCNVDIMMDevice *nvdimm, struct dsm_buffer *in,
+struct dsm_out *out)
+{
+uint32_t config_size, mxfer;
+
+config_size = nvdimm->config_data_size;
+mxfer = max_xfer_config_size();
+
+out->cmd_config_size.config_size = cpu_to_le32(config_size);
+out->cmd_config_size.max_xfer = cpu_to_le32(mxfer);
+nvdebug("%s config_size %#x, max_xfer %#x.\n", __func__, config_size,
+mxfer);
+
+return NFIT_STATUS_SUCCESS;
+}
+
 static void dsm_write_nvdimm(struct dsm_buffer *in, struct dsm_out *out)
 {
 GSList *list = get_nvdimm_built_list();
@@ -408,6 +475,9 @@ static void dsm_write_nvdimm(struct dsm_buffer *in, struct 
dsm_out *out)
 
 out->cmd_implemented.cmd_list = cpu_to_le64(cmd_list);
 goto free;
+case NFIT_CMD_GET_CONFIG_SIZE:
+status = dsm_cmd_config_size(nvdimm, in, out);
+break;
 default:
 status = NFIT_STATUS_NOT_SUPPORTED;
 };
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 02/18] i386/acpi-build: allow SSDT to operate on 64 bit

2015-08-14 Thread Xiao Guangrong

Only 512M is left for MMIO below 4G and that are used by PCI, BIOS etc.
Other components also reserve regions from their internal usage, e.g,
[0xFED0, 0xFED0 + 0x400) is reserved for HPET

Switch SSDT to 64 bit to use the huge free room above 4G. In the later
patches, we will dynamical allocate free space within this region which
is used by NVDIMM _DSM method

Signed-off-by: Xiao Guangrong 
---
 hw/i386/acpi-build.c  | 4 ++--
 hw/i386/acpi-dsdt.dsl | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 46eddb8..8ead1c1 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -1348,7 +1348,7 @@ build_ssdt(GArray *table_data, GArray *linker,
 g_array_append_vals(table_data, ssdt->buf->data, ssdt->buf->len);
 build_header(linker, table_data,
 (void *)(table_data->data + table_data->len - ssdt->buf->len),
-"SSDT", ssdt->buf->len, 1);
+"SSDT", ssdt->buf->len, 2);
 free_aml_allocator();
 }
 
@@ -1586,7 +1586,7 @@ build_dsdt(GArray *table_data, GArray *linker, 
AcpiMiscInfo *misc)
 
 memset(dsdt, 0, sizeof *dsdt);
 build_header(linker, table_data, dsdt, "DSDT",
- misc->dsdt_size, 1);
+ misc->dsdt_size, 2);
 }
 
 static GArray *
diff --git a/hw/i386/acpi-dsdt.dsl b/hw/i386/acpi-dsdt.dsl
index a2d84ec..5cd3f0e 100644
--- a/hw/i386/acpi-dsdt.dsl
+++ b/hw/i386/acpi-dsdt.dsl
@@ -22,7 +22,7 @@ ACPI_EXTRACT_ALL_CODE AcpiDsdtAmlCode
 DefinitionBlock (
 "acpi-dsdt.aml",// Output Filename
 "DSDT", // Signature
-0x01,   // DSDT Compliance Revision
+0x02,   // DSDT Compliance Revision
 "BXPC", // OEMID
 "BXDSDT",   // TABLE ID
 0x1 // OEM Revision
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 14/18] nvdimm: support NFIT_CMD_IMPLEMENTED function

2015-08-14 Thread Xiao Guangrong

__DSM is defined in ACPI 6.0: 9.14.1 _DSM (Device Specific Method)

Function 0 is a query function. We do not support any function on root
device and only 3 functions are support for NVDIMM device,
NFIT_CMD_GET_CONFIG_SIZE, NFIT_CMD_GET_CONFIG_DATA and
NFIT_CMD_SET_CONFIG_DATA, that means we currently only allow to access
device's Label Namespace

Signed-off-by: Xiao Guangrong 
---
 hw/mem/nvdimm/acpi.c | 152 +++
 1 file changed, 152 insertions(+)

diff --git a/hw/mem/nvdimm/acpi.c b/hw/mem/nvdimm/acpi.c
index c773954..20aefce 100644
--- a/hw/mem/nvdimm/acpi.c
+++ b/hw/mem/nvdimm/acpi.c
@@ -31,6 +31,7 @@
 #include "exec/address-spaces.h"
 #include "hw/acpi/aml-build.h"
 #include "hw/mem/pc-nvdimm.h"
+#include "sysemu/sysemu.h"
 
 #include "internal.h"
 
@@ -41,6 +42,22 @@ static void nfit_spa_uuid_pm(void *uuid)
 memcpy(uuid, &uuid_pm, sizeof(uuid_pm));
 }
 
+static bool dsm_is_root_uuid(uint8_t *uuid)
+{
+uuid_le uuid_root = UUID_LE(0x2f10e7a4, 0x9e91, 0x11e4, 0x89,
+0xd3, 0x12, 0x3b, 0x93, 0xf7, 0x5c, 0xba);
+
+return !memcmp(uuid, &uuid_root, sizeof(uuid_root));
+}
+
+static bool dsm_is_dimm_uuid(uint8_t *uuid)
+{
+uuid_le uuid_dimm = UUID_LE(0x4309ac30, 0x0d11, 0x11e4, 0x91,
+0x91, 0x08, 0x00, 0x20, 0x0c, 0x9a, 0x66);
+
+return !memcmp(uuid, &uuid_dimm, sizeof(uuid_dimm));
+}
+
 enum {
 NFIT_TABLE_SPA = 0,
 NFIT_TABLE_MEM = 1,
@@ -162,6 +179,20 @@ static uint32_t nvdimm_index_to_handle(int index)
 return index + 1;
 }
 
+static PCNVDIMMDevice
+*get_nvdimm_device_by_handle(GSList *list, uint32_t handle)
+{
+for (; list; list = list->next) {
+PCNVDIMMDevice *nvdimm = list->data;
+
+if (nvdimm_index_to_handle(nvdimm->device_index) == handle) {
+return nvdimm;
+}
+}
+
+return NULL;
+}
+
 static size_t get_nfit_total_size(int nr)
 {
 /* each nvdimm has 3 tables. */
@@ -286,6 +317,23 @@ enum {
 NFIT_CMD_VENDOR = 9,
 };
 
+enum {
+NFIT_STATUS_SUCCESS = 0,
+NFIT_STATUS_NOT_SUPPORTED = 1,
+NFIT_STATUS_NON_EXISTING_MEM_DEV = 2,
+NFIT_STATUS_INVALID_PARAS = 3,
+NFIT_STATUS_VENDOR_SPECIFIC_ERROR = 4,
+};
+
+#define DSM_REVISION(1)
+
+/* do not support any command except NFIT_CMD_IMPLEMENTED on root. */
+#define ROOT_SUPPORT_CMD(1 << NFIT_CMD_IMPLEMENTED)
+/* support NFIT_CMD_SET_CONFIG_DATA iif nvdimm->configdata is true. */
+#define DIMM_SUPPORT_CMD((1 << NFIT_CMD_IMPLEMENTED)\
+   | (1 << NFIT_CMD_GET_CONFIG_SIZE)\
+   | (1 << NFIT_CMD_GET_CONFIG_DATA))
+
 struct dsm_buffer {
 /* RAM page. */
 uint32_t handle;
@@ -306,6 +354,18 @@ struct dsm_buffer {
 static ram_addr_t dsm_addr;
 static size_t dsm_size;
 
+struct cmd_out_implemented {
+uint64_t cmd_list;
+};
+
+struct dsm_out {
+union {
+uint32_t status;
+struct cmd_out_implemented cmd_implemented;
+uint8_t data[PAGE_SIZE];
+};
+};
+
 static uint64_t dsm_read(void *opaque, hwaddr addr,
  unsigned size)
 {
@@ -314,10 +374,102 @@ static uint64_t dsm_read(void *opaque, hwaddr addr,
 return 0;
 }
 
+static void dsm_write_root(struct dsm_buffer *in, struct dsm_out *out)
+{
+uint32_t function = in->arg2;
+
+if (function == NFIT_CMD_IMPLEMENTED) {
+out->cmd_implemented.cmd_list = cpu_to_le64(ROOT_SUPPORT_CMD);
+return;
+}
+
+out->status = cpu_to_le32(NFIT_STATUS_NOT_SUPPORTED);
+nvdebug("Return status %#x.\n", out->status);
+}
+
+static void dsm_write_nvdimm(struct dsm_buffer *in, struct dsm_out *out)
+{
+GSList *list = get_nvdimm_built_list();
+PCNVDIMMDevice *nvdimm = get_nvdimm_device_by_handle(list, in->handle);
+uint32_t function = in->arg2;
+uint32_t status = NFIT_STATUS_NON_EXISTING_MEM_DEV;
+uint64_t cmd_list;
+
+if (!nvdimm) {
+goto set_status_free;
+}
+
+switch (function) {
+case NFIT_CMD_IMPLEMENTED:
+cmd_list = DIMM_SUPPORT_CMD;
+if (nvdimm->configdata) {
+cmd_list |= 1 << NFIT_CMD_SET_CONFIG_DATA;
+}
+
+out->cmd_implemented.cmd_list = cpu_to_le64(cmd_list);
+goto free;
+default:
+status = NFIT_STATUS_NOT_SUPPORTED;
+};
+
+nvdebug("Return status %#x.\n", status);
+
+set_status_free:
+out->status = cpu_to_le32(status);
+free:
+g_slist_free(list);
+}
+
 static void dsm_write(void *opaque, hwaddr addr,
   uint64_t val, unsigned size)
 {
+struct MemoryRegion *dsm_ram_mr = opaque;
+struct dsm_buffer *dsm;
+struct dsm_out *out;
+void *buf;
+
 assert(val == NOTIFY_VALUE);
+
+buf = memory_region_get_ram_ptr(dsm_ram_mr);
+dsm = buf;
+out = buf;
+
+le32_to_cpus(&dsm->handle);
+le32_to_cpus(&dsm->arg1);
+le32_to_cpus(&dsm->arg2);
+
+nvdebug("Arg0 " UUID_FMT ".\n", dsm->arg0[0], dsm->

[PATCH v2 10/18] nvdimm: init the address region used by DSM method

2015-08-14 Thread Xiao Guangrong

This memory range is used to transfer data between ACPI in guest and Qemu,
it occupies two pages:
- one is RAM-based used to save the input info of _DSM method and Qemu reuse
  it store output info

- another one is MMIO-based, ACPI write data to this page to transfer the
  control to Qemu

Signed-off-by: Xiao Guangrong 
---
 hw/mem/nvdimm/acpi.c  | 80 ++-
 hw/mem/nvdimm/internal.h  |  1 +
 hw/mem/nvdimm/pc-nvdimm.c |  2 +-
 3 files changed, 81 insertions(+), 2 deletions(-)

diff --git a/hw/mem/nvdimm/acpi.c b/hw/mem/nvdimm/acpi.c
index f28752f..e0f2ad3 100644
--- a/hw/mem/nvdimm/acpi.c
+++ b/hw/mem/nvdimm/acpi.c
@@ -28,6 +28,7 @@
 
 #include "qemu-common.h"
 
+#include "exec/address-spaces.h"
 #include "hw/acpi/aml-build.h"
 #include "hw/mem/pc-nvdimm.h"
 
@@ -257,14 +258,91 @@ static void build_nfit_table(GSList *device_list, char 
*buf)
 }
 }
 
+struct dsm_buffer {
+/* RAM page. */
+uint32_t handle;
+uint8_t arg0[16];
+uint32_t arg1;
+uint32_t arg2;
+union {
+char arg3[PAGE_SIZE - 3 * sizeof(uint32_t) - 16 * sizeof(uint8_t)];
+};
+
+/* MMIO page. */
+union {
+uint32_t notify;
+char pedding[PAGE_SIZE];
+};
+};
+
+static ram_addr_t dsm_addr;
+static size_t dsm_size;
+
+static uint64_t dsm_read(void *opaque, hwaddr addr,
+ unsigned size)
+{
+return 0;
+}
+
+static void dsm_write(void *opaque, hwaddr addr,
+  uint64_t val, unsigned size)
+{
+}
+
+static const MemoryRegionOps dsm_ops = {
+.read = dsm_read,
+.write = dsm_write,
+.endianness = DEVICE_LITTLE_ENDIAN,
+};
+
+static int build_dsm_buffer(void)
+{
+MemoryRegion *dsm_ram_mr, *dsm_mmio_mr;
+ram_addr_t addr;;
+
+QEMU_BUILD_BUG_ON(PAGE_SIZE * 2 != sizeof(struct dsm_buffer));
+
+/* DSM buffer has already been built. */
+if (dsm_addr) {
+return 0;
+}
+
+addr = reserved_range_push(2 * PAGE_SIZE);
+if (!addr) {
+return -1;
+}
+
+dsm_addr = addr;
+dsm_size = PAGE_SIZE * 2;
+
+dsm_ram_mr = g_new(MemoryRegion, 1);
+memory_region_init_ram(dsm_ram_mr, NULL, "dsm_ram", PAGE_SIZE,
+   &error_abort);
+vmstate_register_ram_global(dsm_ram_mr);
+memory_region_add_subregion(get_system_memory(), addr, dsm_ram_mr);
+
+dsm_mmio_mr = g_new(MemoryRegion, 1);
+memory_region_init_io(dsm_mmio_mr, NULL, &dsm_ops, dsm_ram_mr,
+  "dsm_mmio", PAGE_SIZE);
+memory_region_add_subregion(get_system_memory(), addr + PAGE_SIZE,
+dsm_mmio_mr);
+return 0;
+}
+
 void pc_nvdimm_build_nfit_table(GArray *table_offsets, GArray *table_data,
 GArray *linker)
 {
-GSList *list = get_nvdimm_built_list();
+GSList *list;
 size_t total;
 char *buf;
 int nfit_start, nr;
 
+if (build_dsm_buffer()) {
+fprintf(stderr, "do not have enough space for DSM buffer.\n");
+return;
+}
+
+list = get_nvdimm_built_list();
 nr = get_nvdimm_device_number(list);
 total = get_nfit_total_size(nr);
 
diff --git a/hw/mem/nvdimm/internal.h b/hw/mem/nvdimm/internal.h
index 252a222..90d54dc 100644
--- a/hw/mem/nvdimm/internal.h
+++ b/hw/mem/nvdimm/internal.h
@@ -26,4 +26,5 @@ typedef struct {
 (d0), (d1), (d2), (d3), (d4), (d5), (d6), (d7) } })
 
 GSList *get_nvdimm_built_list(void);
+ram_addr_t reserved_range_push(uint64_t size);
 #endif
diff --git a/hw/mem/nvdimm/pc-nvdimm.c b/hw/mem/nvdimm/pc-nvdimm.c
index 2a6cfa2..752842a 100644
--- a/hw/mem/nvdimm/pc-nvdimm.c
+++ b/hw/mem/nvdimm/pc-nvdimm.c
@@ -45,7 +45,7 @@ void pc_nvdimm_reserve_range(ram_addr_t offset)
 nvdimms_info.current_addr = offset;
 }
 
-static ram_addr_t reserved_range_push(uint64_t size)
+ram_addr_t reserved_range_push(uint64_t size)
 {
 uint64_t current;
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 09/18] nvdimm: build ACPI NFIT table

2015-08-14 Thread Xiao Guangrong

NFIT is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)

Currently, we only support PMEM mode. Each device has 3 tables:
- SPA table, define the PMEM region info

- MEM DEV table, it has the @handle which is used to associate specified
  ACPI NVDIMM  device we will introduce in later patch.
  Also we can happily ignored the memory device's interleave, the real
  nvdimm hardware access is hidden behind host

- DCR table, it defines Vendor ID used to associate specified vendor
  nvdimm driver. Since we only implement PMEM mode this time, Command
  window and Data window are not needed

Signed-off-by: Xiao Guangrong 
---
 hw/i386/acpi-build.c   |   3 +
 hw/mem/Makefile.objs   |   2 +-
 hw/mem/nvdimm/acpi.c   | 285 +
 hw/mem/nvdimm/internal.h   |  29 +
 hw/mem/nvdimm/pc-nvdimm.c  |  27 -
 include/hw/mem/pc-nvdimm.h |   2 +
 6 files changed, 346 insertions(+), 2 deletions(-)
 create mode 100644 hw/mem/nvdimm/acpi.c
 create mode 100644 hw/mem/nvdimm/internal.h

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 8ead1c1..092ed2f 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -39,6 +39,7 @@
 #include "hw/loader.h"
 #include "hw/isa/isa.h"
 #include "hw/acpi/memory_hotplug.h"
+#include "hw/mem/pc-nvdimm.h"
 #include "sysemu/tpm.h"
 #include "hw/acpi/tpm.h"
 #include "sysemu/tpm_backend.h"
@@ -1741,6 +1742,8 @@ void acpi_build(PcGuestInfo *guest_info, AcpiBuildTables 
*tables)
 build_dmar_q35(tables_blob, tables->linker);
 }
 
+pc_nvdimm_build_nfit_table(table_offsets, tables_blob, tables->linker);
+
 /* Add tables supplied by user (if any) */
 for (u = acpi_table_first(); u; u = acpi_table_next(u)) {
 unsigned len = acpi_table_len(u);
diff --git a/hw/mem/Makefile.objs b/hw/mem/Makefile.objs
index 4df7482..7a6948d 100644
--- a/hw/mem/Makefile.objs
+++ b/hw/mem/Makefile.objs
@@ -1,2 +1,2 @@
 common-obj-$(CONFIG_MEM_HOTPLUG) += pc-dimm.o
-common-obj-$(CONFIG_NVDIMM) += nvdimm/pc-nvdimm.o
+common-obj-$(CONFIG_NVDIMM) += nvdimm/pc-nvdimm.o nvdimm/acpi.o
diff --git a/hw/mem/nvdimm/acpi.c b/hw/mem/nvdimm/acpi.c
new file mode 100644
index 000..f28752f
--- /dev/null
+++ b/hw/mem/nvdimm/acpi.c
@@ -0,0 +1,285 @@
+/*
+ * NVDIMM (A Non-Volatile Dual In-line Memory Module) NFIT Implement
+ *
+ * Copyright(C) 2015 Intel Corporation.
+ *
+ * Author:
+ *  Xiao Guangrong 
+ *
+ * NFIT is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)
+ * and the DSM specfication can be found at:
+ *   http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
+ *
+ * Currently, it only supports PMEM Virtualization.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see 
+ */
+
+#include "qemu-common.h"
+
+#include "hw/acpi/aml-build.h"
+#include "hw/mem/pc-nvdimm.h"
+
+#include "internal.h"
+
+static void nfit_spa_uuid_pm(void *uuid)
+{
+uuid_le uuid_pm = UUID_LE(0x66f0d379, 0xb4f3, 0x4074, 0xac, 0x43, 0x0d,
+  0x33, 0x18, 0xb7, 0x8c, 0xdb);
+memcpy(uuid, &uuid_pm, sizeof(uuid_pm));
+}
+
+enum {
+NFIT_TABLE_SPA = 0,
+NFIT_TABLE_MEM = 1,
+NFIT_TABLE_IDT = 2,
+NFIT_TABLE_SMBIOS = 3,
+NFIT_TABLE_DCR = 4,
+NFIT_TABLE_BDW = 5,
+NFIT_TABLE_FLUSH = 6,
+};
+
+enum {
+EFI_MEMORY_UC = 0x1ULL,
+EFI_MEMORY_WC = 0x2ULL,
+EFI_MEMORY_WT = 0x4ULL,
+EFI_MEMORY_WB = 0x8ULL,
+EFI_MEMORY_UCE = 0x10ULL,
+EFI_MEMORY_WP = 0x1000ULL,
+EFI_MEMORY_RP = 0x2000ULL,
+EFI_MEMORY_XP = 0x4000ULL,
+EFI_MEMORY_NV = 0x8000ULL,
+EFI_MEMORY_MORE_RELIABLE = 0x1ULL,
+};
+
+/*
+ * struct nfit - Nvdimm Firmware Interface Table
+ * @signature: "NFIT"
+ */
+struct nfit {
+ACPI_TABLE_HEADER_DEF
+uint32_t reserved;
+} QEMU_PACKED;
+
+/*
+ * struct nfit_spa - System Physical Address Range Structure
+ */
+struct nfit_spa {
+uint16_t type;
+uint16_t length;
+uint16_t spa_index;
+uint16_t flags;
+uint32_t reserved;
+uint32_t proximity_domain;
+uint8_t type_uuid[16];
+uint64_t spa_base;
+uint64_t spa_length;
+uint64_t mem_attr;
+} QEMU_PACKED;
+
+/*
+ * struct nfit_memdev - Memory Device to SPA Map Structure
+ */
+struct nfit_memdev {
+uint16_t type;
+uint16_t length;
+uint32_t nfit_handle;
+uint16_t phys_id;
+uint16_t region_id;
+uint16_t s

[PATCH v2 01/18] acpi: allow aml_operation_region() working on 64 bit offset

2015-08-14 Thread Xiao Guangrong

Currently, the offset in OperationRegion is limited to 32 bit, extend it
to 64 bit so that we can switch SSDT to 64 bit in later patch

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/aml-build.c | 2 +-
 include/hw/acpi/aml-build.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 0d4b324..02f9e3d 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -752,7 +752,7 @@ Aml *aml_package(uint8_t num_elements)
 
 /* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefOpRegion */
 Aml *aml_operation_region(const char *name, AmlRegionSpace rs,
-  uint32_t offset, uint32_t len)
+  uint64_t offset, uint32_t len)
 {
 Aml *var = aml_alloc();
 build_append_byte(var->buf, 0x5B); /* ExtOpPrefix */
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index e3afa13..996ac5b 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -222,7 +222,7 @@ Aml *aml_interrupt(AmlConsumerAndProducer con_and_pro,
 Aml *aml_io(AmlIODecode dec, uint16_t min_base, uint16_t max_base,
 uint8_t aln, uint8_t len);
 Aml *aml_operation_region(const char *name, AmlRegionSpace rs,
-  uint32_t offset, uint32_t len);
+  uint64_t offset, uint32_t len);
 Aml *aml_irq_no_flags(uint8_t irq);
 Aml *aml_named_field(const char *name, unsigned length);
 Aml *aml_reserved_field(unsigned length);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 05/18] acpi: add aml_create_field

2015-08-14 Thread Xiao Guangrong

Implement CreateField term which are used by NVDIMM _DSM method in later patch

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/aml-build.c | 14 ++
 include/hw/acpi/aml-build.h |  1 +
 2 files changed, 15 insertions(+)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index a526eed..debdad2 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1151,6 +1151,20 @@ Aml *aml_sizeof(Aml *arg)
 return var;
 }
 
+/* ACPI 6.0: 20.2.5.2 Named Objects Encoding: DefCreateField */
+Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name)
+{
+Aml *var = aml_alloc();
+
+build_append_byte(var->buf, 0x5B); /* ExtOpPrefix */
+build_append_byte(var->buf, 0x13); /* CreateFieldOp */
+aml_append(var, srcbuf);
+aml_append(var, index);
+aml_append(var, len);
+build_append_namestring(var->buf, "%s", name);
+return var;
+}
+
 void
 build_header(GArray *linker, GArray *table_data,
  AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 6b591ab..d4dbd44 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -277,6 +277,7 @@ Aml *aml_touuid(const char *uuid);
 Aml *aml_unicode(const char *str);
 Aml *aml_derefof(Aml *arg);
 Aml *aml_sizeof(Aml *arg);
+Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name);
 
 void
 build_header(GArray *linker, GArray *table_data,
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 03/18] acpi: add aml_derefof

2015-08-14 Thread Xiao Guangrong

Implement DeRefOf term which is used by NVDIMM _DSM method in later patch

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/aml-build.c | 8 
 include/hw/acpi/aml-build.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 02f9e3d..9e89efc 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1135,6 +1135,14 @@ Aml *aml_unicode(const char *str)
 return var;
 }
 
+/* ACPI 6.0: 20.2.5.4 Type 2 Opcodes Encoding: DefDerefOf */
+Aml *aml_derefof(Aml *arg)
+{
+Aml *var = aml_opcode(0x83 /* DerefOfOp */);
+aml_append(var, arg);
+return var;
+}
+
 void
 build_header(GArray *linker, GArray *table_data,
  AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 996ac5b..21dc5e9 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -275,6 +275,7 @@ Aml *aml_create_dword_field(Aml *srcbuf, Aml *index, const 
char *name);
 Aml *aml_varpackage(uint32_t num_elements);
 Aml *aml_touuid(const char *uuid);
 Aml *aml_unicode(const char *str);
+Aml *aml_derefof(Aml *arg);
 
 void
 build_header(GArray *linker, GArray *table_data,
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 07/18] nvdimm: reserve address range for NVDIMM

2015-08-14 Thread Xiao Guangrong

NVDIMM reserves all the free range above 4G to do:
- Persistent Memory (PMEM) mapping
- implement NVDIMM ACPI device _DSM method

Signed-off-by: Xiao Guangrong 
---
 hw/i386/pc.c   | 12 ++--
 hw/mem/nvdimm/pc-nvdimm.c  | 13 +
 include/hw/mem/pc-nvdimm.h |  1 +
 3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 7661ea9..41af6ea 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -64,6 +64,7 @@
 #include "hw/pci/pci_host.h"
 #include "acpi-build.h"
 #include "hw/mem/pc-dimm.h"
+#include "hw/mem/pc-nvdimm.h"
 #include "qapi/visitor.h"
 #include "qapi-visit.h"
 
@@ -1302,6 +1303,7 @@ FWCfgState *pc_memory_init(MachineState *machine,
 MemoryRegion *ram_below_4g, *ram_above_4g;
 FWCfgState *fw_cfg;
 PCMachineState *pcms = PC_MACHINE(machine);
+ram_addr_t offset;
 
 assert(machine->ram_size == below_4g_mem_size + above_4g_mem_size);
 
@@ -1339,6 +1341,8 @@ FWCfgState *pc_memory_init(MachineState *machine,
 exit(EXIT_FAILURE);
 }
 
+offset = 0x1ULL + above_4g_mem_size;
+
 /* initialize hotplug memory address space */
 if (guest_info->has_reserved_memory &&
 (machine->ram_size < machine->maxram_size)) {
@@ -1358,8 +1362,7 @@ FWCfgState *pc_memory_init(MachineState *machine,
 exit(EXIT_FAILURE);
 }
 
-pcms->hotplug_memory.base =
-ROUND_UP(0x1ULL + above_4g_mem_size, 1ULL << 30);
+pcms->hotplug_memory.base = ROUND_UP(offset, 1ULL << 30);
 
 if (pcms->enforce_aligned_dimm) {
 /* size hotplug region assuming 1G page max alignment per slot */
@@ -1377,8 +1380,13 @@ FWCfgState *pc_memory_init(MachineState *machine,
"hotplug-memory", hotplug_mem_size);
 memory_region_add_subregion(system_memory, pcms->hotplug_memory.base,
 &pcms->hotplug_memory.mr);
+
+offset = pcms->hotplug_memory.base + hotplug_mem_size;
 }
 
+ /* all the space left above 4G is reserved for NVDIMM. */
+pc_nvdimm_reserve_range(offset);
+
 /* Initialize PC system firmware */
 pc_system_firmware_init(rom_memory, guest_info->isapc_ram_fw);
 
diff --git a/hw/mem/nvdimm/pc-nvdimm.c b/hw/mem/nvdimm/pc-nvdimm.c
index a53d235..7a270a8 100644
--- a/hw/mem/nvdimm/pc-nvdimm.c
+++ b/hw/mem/nvdimm/pc-nvdimm.c
@@ -24,6 +24,19 @@
 
 #include "hw/mem/pc-nvdimm.h"
 
+#define PAGE_SIZE  (1UL << 12)
+
+static struct nvdimms_info {
+ram_addr_t current_addr;
+} nvdimms_info;
+
+/* the address range [offset, ~0ULL) is reserved for NVDIMM. */
+void pc_nvdimm_reserve_range(ram_addr_t offset)
+{
+offset = ROUND_UP(offset, PAGE_SIZE);
+nvdimms_info.current_addr = offset;
+}
+
 static char *get_file(Object *obj, Error **errp)
 {
 PCNVDIMMDevice *nvdimm = PC_NVDIMM(obj);
diff --git a/include/hw/mem/pc-nvdimm.h b/include/hw/mem/pc-nvdimm.h
index 51152b8..8601e9b 100644
--- a/include/hw/mem/pc-nvdimm.h
+++ b/include/hw/mem/pc-nvdimm.h
@@ -28,4 +28,5 @@ typedef struct PCNVDIMMDevice {
 #define PC_NVDIMM(obj) \
 OBJECT_CHECK(PCNVDIMMDevice, (obj), TYPE_PC_NVDIMM)
 
+void pc_nvdimm_reserve_range(ram_addr_t offset);
 #endif
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 06/18] pc: implement NVDIMM device abstract

2015-08-14 Thread Xiao Guangrong

Introduce "pc-nvdimm" device and it has two parameters:
- @file, which is the backed memory file for NVDIMM device

- @configdata, specify if we need to reserve 128k at the end of
  @file for nvdimm device's config data. Default is false

If @configdata is false, Qemu will build a static and readonly
namespace in memory and use it serveing for
DSM GET_CONFIG_SIZE/GET_CONFIG_DATA requests.
This is good for the user who want to pass whole nvdimm device
and make its data is complete visible to guest

We can use "-device pc-nvdimm,file=/dev/pmem,configdata" in the
Qemu command to create NVDIMM device for the guest

Signed-off-by: Xiao Guangrong 
---
 default-configs/i386-softmmu.mak   |  1 +
 default-configs/x86_64-softmmu.mak |  1 +
 hw/Makefile.objs   |  2 +-
 hw/mem/Makefile.objs   |  1 +
 hw/mem/nvdimm/pc-nvdimm.c  | 99 ++
 include/hw/mem/pc-nvdimm.h | 31 
 6 files changed, 134 insertions(+), 1 deletion(-)
 create mode 100644 hw/mem/nvdimm/pc-nvdimm.c
 create mode 100644 include/hw/mem/pc-nvdimm.h

diff --git a/default-configs/i386-softmmu.mak b/default-configs/i386-softmmu.mak
index 48b5762..67fc3a8 100644
--- a/default-configs/i386-softmmu.mak
+++ b/default-configs/i386-softmmu.mak
@@ -49,3 +49,4 @@ CONFIG_MEM_HOTPLUG=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
+CONFIG_NVDIMM=y
diff --git a/default-configs/x86_64-softmmu.mak 
b/default-configs/x86_64-softmmu.mak
index 4962ed7..dfcde36 100644
--- a/default-configs/x86_64-softmmu.mak
+++ b/default-configs/x86_64-softmmu.mak
@@ -50,3 +50,4 @@ CONFIG_MEM_HOTPLUG=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
+CONFIG_NVDIMM=y
diff --git a/hw/Makefile.objs b/hw/Makefile.objs
index 73afa41..1e25d3f 100644
--- a/hw/Makefile.objs
+++ b/hw/Makefile.objs
@@ -30,7 +30,7 @@ devices-dirs-$(CONFIG_SOFTMMU) += vfio/
 devices-dirs-$(CONFIG_VIRTIO) += virtio/
 devices-dirs-$(CONFIG_SOFTMMU) += watchdog/
 devices-dirs-$(CONFIG_SOFTMMU) += xen/
-devices-dirs-$(CONFIG_MEM_HOTPLUG) += mem/
+devices-dirs-y += mem/
 devices-dirs-y += core/
 common-obj-y += $(devices-dirs-y)
 obj-y += $(devices-dirs-y)
diff --git a/hw/mem/Makefile.objs b/hw/mem/Makefile.objs
index b000fb4..4df7482 100644
--- a/hw/mem/Makefile.objs
+++ b/hw/mem/Makefile.objs
@@ -1 +1,2 @@
 common-obj-$(CONFIG_MEM_HOTPLUG) += pc-dimm.o
+common-obj-$(CONFIG_NVDIMM) += nvdimm/pc-nvdimm.o
diff --git a/hw/mem/nvdimm/pc-nvdimm.c b/hw/mem/nvdimm/pc-nvdimm.c
new file mode 100644
index 000..a53d235
--- /dev/null
+++ b/hw/mem/nvdimm/pc-nvdimm.c
@@ -0,0 +1,99 @@
+/*
+ * NVDIMM (A Non-Volatile Dual In-line Memory Module) Virtualization Implement
+ *
+ * Copyright(C) 2015 Intel Corporation.
+ *
+ * Author:
+ *  Xiao Guangrong 
+ *
+ * Currently, it only supports PMEM Virtualization.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see 
+ */
+
+#include "hw/mem/pc-nvdimm.h"
+
+static char *get_file(Object *obj, Error **errp)
+{
+PCNVDIMMDevice *nvdimm = PC_NVDIMM(obj);
+
+return g_strdup(nvdimm->file);
+}
+
+static void set_file(Object *obj, const char *str, Error **errp)
+{
+PCNVDIMMDevice *nvdimm = PC_NVDIMM(obj);
+
+if (nvdimm->file) {
+g_free(nvdimm->file);
+}
+
+nvdimm->file = g_strdup(str);
+}
+
+static bool has_configdata(Object *obj, Error **errp)
+{
+PCNVDIMMDevice *nvdimm = PC_NVDIMM(obj);
+
+return nvdimm->configdata;
+}
+
+static void set_configdata(Object *obj, bool value, Error **errp)
+{
+PCNVDIMMDevice *nvdimm = PC_NVDIMM(obj);
+
+nvdimm->configdata = value;
+}
+
+static void pc_nvdimm_init(Object *obj)
+{
+object_property_add_str(obj, "file", get_file, set_file, NULL);
+object_property_add_bool(obj, "configdata", has_configdata,
+ set_configdata, NULL);
+}
+
+static void pc_nvdimm_realize(DeviceState *dev, Error **errp)
+{
+PCNVDIMMDevice *nvdimm = PC_NVDIMM(dev);
+
+if (!nvdimm->file) {
+error_setg(errp, "file property is not set");
+}
+}
+
+static void pc_nvdimm_class_init(ObjectClass *oc, void *data)
+{
+DeviceClass *dc = DEVICE_CLASS(oc);
+
+/* nvdimm hotplug has not been supported yet. */
+dc->hotpluggable = false;
+
+dc->realize = pc_nvdimm_realize;
+dc->desc = "NVDIMM memory module";
+}
+
+static TypeInfo pc_nvdimm_info = {
+.name  = TY

[PATCH v2 08/18] nvdimm: init backend memory mapping and config data area

2015-08-14 Thread Xiao Guangrong

The parameter @file is used as backed memory for NVDIMM which is
divided into two parts if @dataconfig is true:
- first parts is (0, size - 128K], which is used as PMEM (Persistent
  Memory)
- 128K at the end of the file, which is used as Config Data Area, it's
  used to store Label namespace data

The @file supports both regular file and block device, of course we
can assign any these two kinds of files for test and emulation, however,
in the real word for performance reason, we usually used these files as
NVDIMM backed file:
- the regular file in the filesystem with DAX enabled created on NVDIMM
  device on host
- the raw PMEM device on host, e,g /dev/pmem0

Signed-off-by: Xiao Guangrong 
---
 hw/mem/nvdimm/pc-nvdimm.c  | 109 -
 include/hw/mem/pc-nvdimm.h |   7 +++
 2 files changed, 115 insertions(+), 1 deletion(-)

diff --git a/hw/mem/nvdimm/pc-nvdimm.c b/hw/mem/nvdimm/pc-nvdimm.c
index 7a270a8..97710d1 100644
--- a/hw/mem/nvdimm/pc-nvdimm.c
+++ b/hw/mem/nvdimm/pc-nvdimm.c
@@ -22,12 +22,20 @@
  * License along with this library; if not, see 
  */
 
+#include 
+#include 
+#include 
+
+#include "exec/address-spaces.h"
 #include "hw/mem/pc-nvdimm.h"
 
-#define PAGE_SIZE  (1UL << 12)
+#define PAGE_SIZE   (1UL << 12)
+
+#define MIN_CONFIG_DATA_SIZE(128 << 10)
 
 static struct nvdimms_info {
 ram_addr_t current_addr;
+int device_index;
 } nvdimms_info;
 
 /* the address range [offset, ~0ULL) is reserved for NVDIMM. */
@@ -37,6 +45,26 @@ void pc_nvdimm_reserve_range(ram_addr_t offset)
 nvdimms_info.current_addr = offset;
 }
 
+static ram_addr_t reserved_range_push(uint64_t size)
+{
+uint64_t current;
+
+current = ROUND_UP(nvdimms_info.current_addr, PAGE_SIZE);
+
+/* do not have enough space? */
+if (current + size < current) {
+return 0;
+}
+
+nvdimms_info.current_addr = current + size;
+return current;
+}
+
+static uint32_t new_device_index(void)
+{
+return nvdimms_info.device_index++;
+}
+
 static char *get_file(Object *obj, Error **errp)
 {
 PCNVDIMMDevice *nvdimm = PC_NVDIMM(obj);
@@ -48,6 +76,11 @@ static void set_file(Object *obj, const char *str, Error 
**errp)
 {
 PCNVDIMMDevice *nvdimm = PC_NVDIMM(obj);
 
+if (memory_region_size(&nvdimm->mr)) {
+error_setg(errp, "cannot change property value");
+return;
+}
+
 if (nvdimm->file) {
 g_free(nvdimm->file);
 }
@@ -76,13 +109,87 @@ static void pc_nvdimm_init(Object *obj)
  set_configdata, NULL);
 }
 
+static uint64_t get_file_size(int fd)
+{
+struct stat stat_buf;
+uint64_t size;
+
+if (fstat(fd, &stat_buf) < 0) {
+return 0;
+}
+
+if (S_ISREG(stat_buf.st_mode)) {
+return stat_buf.st_size;
+}
+
+if (S_ISBLK(stat_buf.st_mode) && !ioctl(fd, BLKGETSIZE64, &size)) {
+return size;
+}
+
+return 0;
+}
+
 static void pc_nvdimm_realize(DeviceState *dev, Error **errp)
 {
 PCNVDIMMDevice *nvdimm = PC_NVDIMM(dev);
+char name[512];
+void *buf;
+ram_addr_t addr;
+uint64_t size, nvdimm_size, config_size = MIN_CONFIG_DATA_SIZE;
+int fd;
 
 if (!nvdimm->file) {
 error_setg(errp, "file property is not set");
 }
+
+fd = open(nvdimm->file, O_RDWR);
+if (fd < 0) {
+error_setg(errp, "can not open %s", nvdimm->file);
+return;
+}
+
+size = get_file_size(fd);
+buf = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+if (buf == MAP_FAILED) {
+error_setg(errp, "can not do mmap on %s", nvdimm->file);
+goto do_close;
+}
+
+nvdimm->config_data_size = config_size;
+if (nvdimm->configdata) {
+/* reserve MIN_CONFIGDATA_AREA_SIZE for configue data. */
+nvdimm_size = size - config_size;
+nvdimm->config_data_addr = buf + nvdimm_size;
+} else {
+nvdimm_size = size;
+nvdimm->config_data_addr = NULL;
+}
+
+if ((int64_t)nvdimm_size <= 0) {
+error_setg(errp, "file size is too small to store NVDIMM"
+ " configure data");
+goto do_unmap;
+}
+
+addr = reserved_range_push(nvdimm_size);
+if (!addr) {
+error_setg(errp, "do not have enough space for size %#lx.\n", size);
+goto do_unmap;
+}
+
+nvdimm->device_index = new_device_index();
+sprintf(name, "NVDIMM-%d", nvdimm->device_index);
+memory_region_init_ram_ptr(&nvdimm->mr, OBJECT(dev), name, nvdimm_size,
+   buf);
+vmstate_register_ram(&nvdimm->mr, DEVICE(dev));
+memory_region_add_subregion(get_system_memory(), addr, &nvdimm->mr);
+
+return;
+
+do_unmap:
+munmap(buf, size);
+do_close:
+close(fd);
 }
 
 static void pc_nvdimm_class_init(ObjectClass *oc, void *data)
diff --git a/include/hw/mem/pc-nvdimm.h b/include/hw/mem/pc-nvdimm.h
index 8601e9b..f617fd2 100644
--- a/include/hw/mem

[PATCH v2 00/18] implement vNVDIMM

2015-08-14 Thread Xiao Guangrong

Changlog:
- Use litten endian for DSM method, thanks for Stefan's suggestion

- introduce a new parameter, @configdata, if it's false, Qemu will
  build a static and readonly namespace in memory and use it serveing
  for DSM GET_CONFIG_SIZE/GET_CONFIG_DATA requests. In this case, no
  reserved region is needed at the end of the @file, it is good for
  the user who want to pass whole nvdimm device and make its data
  completely be visible to guest

- divide the source code into separated files and add maintain info

BTW, PCOMMIT virtualization on KVM side is work in progress, hopefully will
be posted on next week

== Background ==
NVDIMM (A Non-Volatile Dual In-line Memory Module) is going to be supported
on Intel's platform. They are discovered via ACPI and configured by _DSM
method of NVDIMM device in ACPI. There has some supporting documents which
can be found at:
ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf

Currently, the NVDIMM driver has been merged into upstream Linux Kernel and
this patchset tries to enable it in virtualization field

== Design ==
NVDIMM supports two mode accesses, one is PMEM which maps NVDIMM into CPU's
address space then CPU can directly access it as normal memory, another is
BLK which is used as block device to reduce the occupying of CPU address
space

BLK mode accesses NVDIMM via Command Register window and Data Register window.
BLK virtualization has high workload since each sector access will cause at
least two VM-EXIT. So we currently only imperilment vPMEM in this patchset

--- vPMEM design ---
We introduce a new device named "pc-nvdimm", it has a parameter, file, which
is the file-based backed memory passed to guest. The file can be regular file
and block device. We can use any file when we do test or emulation, however,
in the real word, the files passed to guest are:
- the regular file in the filesystem with DAX enabled created on NVDIMM device
  on host
- the raw PMEM device on host, e,g /dev/pmem0
Memory access on the address created by mmap on these kinds of files can
directly reach NVDIMM device on host.

--- vConfigure data area design ---
Each NVDIMM device has a configure data area which is used to store label
namespace data. In order to emulating this area, we divide the file into two
parts:
- first parts is (0, size - 128K], which is used as PMEM
- 128K at the end of the file, which is used as Config Data Area
So that the label namespace data can be persistent during power lose or system
failure

--- _DSM method design ---
_DSM in ACPI is used to configure NVDIMM, currently we only allow access of
label namespace data, i.e, Get Namespace Label Size (Function Index 4),
Get Namespace Label Data (Function Index 5) and Set Namespace Label Data
(Function Index 6)

_DSM uses two pages to transfer data between ACPI and Qemu, the first page
is RAM-based used to save the input info of _DSM method and Qemu reuse it
store output info and another page is MMIO-based, ACPI write data to this
page to transfer the control to Qemu

We use the address region above 4G to map these pages because there is huge
free space above 4G and it can avoid the address overlap with PCI and other
address reserved component (e,g HPET). This is also the reason we choose MMIO
notification instead of PIO

== Test ==
In host
1) create memory backed file, e.g # dd if=zero of=/tmp/nvdimm bs=1G count=10
2) append '-device pc-nvdimm,file=/tmp/nvdimm' in Qemu command line

In guest, download the latest upsteam kernel (4.2 merge window) and enable
ACPI_NFIT, LIBNVDIMM and BLK_DEV_PMEM.
1) insmod drivers/nvdimm/libnvdimm.ko
2) insmod drivers/acpi/nfit.ko
3) insmod drivers/nvdimm/nd_btt.ko
4) insmod drivers/nvdimm/nd_pmem.ko
You can see the whole nvdimm device used as a single namespace and /dev/pmem0
appears. You can do whatever on /dev/pmem0 including DAX access.

Currently Linux NVDIMM driver does not support namespace operation on this
kind of PMEM, apply below changes to support dynamical namespace:

@@ -798,7 +823,8 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *a
continue;
}
 
-   if (nfit_mem->bdw && nfit_mem->memdev_pmem)
+   //if (nfit_mem->bdw && nfit_mem->memdev_pmem)
+   if (nfit_mem->memdev_pmem)
flags |= NDD_ALIASING;

You can append another NVDIMM device in guest and do:   
# cd /sys/bus/nd/devices/
# cd namespace1.0/
# echo `uuidgen` > uuid
# echo `expr 1024 \* 1024 \* 128` > size
then reload nd.pmem.ko

You can see /dev/pmem1 appears

== TODO ==
1) NVDIMM NUMA support
2) NVDIMM hotplug support

Xiao Guangrong (18):
  acpi: allow aml_operation_region() working on 64 bit off

[PATCH v2 04/18] acpi: add aml_sizeof

2015-08-14 Thread Xiao Guangrong

Implement SizeOf term which is used by NVDIMM _DSM method in later patch

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/aml-build.c | 8 
 include/hw/acpi/aml-build.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 9e89efc..a526eed 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1143,6 +1143,14 @@ Aml *aml_derefof(Aml *arg)
 return var;
 }
 
+/* ACPI 6.0: 20.2.5.4 Type 2 Opcodes Encoding: DefSizeOf */
+Aml *aml_sizeof(Aml *arg)
+{
+Aml *var = aml_opcode(0x87 /* SizeOfOp */);
+aml_append(var, arg);
+return var;
+}
+
 void
 build_header(GArray *linker, GArray *table_data,
  AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 21dc5e9..6b591ab 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -276,6 +276,7 @@ Aml *aml_varpackage(uint32_t num_elements);
 Aml *aml_touuid(const char *uuid);
 Aml *aml_unicode(const char *str);
 Aml *aml_derefof(Aml *arg);
+Aml *aml_sizeof(Aml *arg);
 
 void
 build_header(GArray *linker, GArray *table_data,
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v3 2/3] Detect vGIC presence at runtime

2015-08-14 Thread Pavel Fedin

 Hello!

> This is completely Linux-specific, unfortunately.

 Yes. But better than nothing.

> And it relies on
> userpace to expose a modified DT, so you need to be able to report back
> to userspace that you can't deal with the virtual timer.

 Easy. If KVM_CAP_IRQCHIP == 0, then we apparently don't have vGIC, and since 
we know that vGIC and
vTimer are paired, we know that there is no vTimer too.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/3] Detect vGIC presence at runtime

2015-08-14 Thread Marc Zyngier

On 14/08/15 13:26, Pavel Fedin wrote:
>  Hello! Thank you for quick response.
> 
>> This is fairly unreadable. Please use a switch statement instead.
> 
>  Christoffer disliked it in v1, so i thought a bit and changed it. Ok, will 
> change it back.
> 
>> And here, we're going to assume that the arch timer still usable. We
>> definitely need a way to *prevent* the timer to be used when there is no
>> GIC. Otherwise, we're going to start trying to setup the mapping for the
>> active state, and the guest may start poking it.
> 
> But, this seems to be already done, isn't it?
> According to http://lxr.free-electrons.com/source/arch/arm/kvm/arm.c#L439:
> --- cut ---
> 459 /*
> 460  * Enable the arch timers only if we have an in-kernel VGIC
> 461  * and it has been properly initialized, since we cannot handle
> 462  * interrupts from the virtual timer with a userspace gic.
> 463  */
> 464 if (irqchip_in_kernel(kvm) && vgic_initialized(kvm))
> 465 kvm_timer_enable(kvm);
> --- cut ---

Right, I failed to remember that one. Sorry. It should be safe then.
Hopefully.

[...]

>  And some more. Actually, it is possible to emulate generic timer in 
> userspace, just not the virtual
> one. IIRC access to physical timer can be trapped. So, if we modify guest's 
> device tree by removing
> virtual timer IRQ, the guest will fall back to physical timer. And this will 
> be caught by the
> hypervisor. After this all we have to do is to add corresponding exit code 
> which would allow the
> userspace to emulate missing CP15 (or system in case of ARM64) registers. So, 
> this timer issue is
> not grave, just i postpone implementing it until GIC issues are settled down.

This is completely Linux-specific, unfortunately. And it relies on
userpace to expose a modified DT, so you need to be able to report back
to userspace that you can't deal with the virtual timer.

Which brings me to the next point: how do you tell userspace that your
timers are non-functional?

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Help debugging a regression in KVM Module

2015-08-14 Thread Paolo Bonzini



- Original Message -
> From: "Peter Lieven" 
> To: qemu-de...@nongnu.org, kvm@vger.kernel.org
> Cc: "Paolo Bonzini" 
> Sent: Friday, August 14, 2015 1:11:34 PM
> Subject: Help debugging a regression in KVM Module
> 
> Hi,
> 
> some time a go I stumbled across a regression in the KVM Module that has been
> introduced somewhere
> between 3.17 and 3.19.
> 
> I have a rather old openSUSE guest with an XFS filesystem which realiably
> crashes after some live migrations.
> I originally believed that the issue might be related to my setup with a 3.12
> host kernel and kvm-kmod 3.19,
> but I now found that it is also still present with a 3.19 host kernel with
> included 3.19 kvm module.
> 
> My idea was to continue testing on a 3.12 host kernel and then bisect all
> commits to the kvm related parts.
> 
> Now my question is how to best bisect only kvm related changes (those that go
> into kvm-kmod)?

I haven't forgotten this.  Sorry. :(

Unfortunately I'll be away for three weeks, but I'll make it a priority
when I'm back.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 12/15] KVM: arm64: sync LPI configuration and pending tables

2015-08-14 Thread Eric Auger

On 08/14/2015 01:58 PM, Eric Auger wrote:
> On 07/10/2015 04:21 PM, Andre Przywara wrote:
>> The LPI configuration and pending tables of the GICv3 LPIs are held
>> in tables in (guest) memory. To achieve reasonable performance, we
>> cache this data in our own data structures, so we need to sync those
>> two views from time to time. This behaviour is well described in the
>> GICv3 spec and is also exercised by hardware, so the sync points are
>> well known.
>>
>> Provide functions that read the guest memory and store the
>> information from the configuration and pending tables in the kernel.
>>
>> Signed-off-by: Andre Przywara 
>> ---
> would help to have change log between v1 -> v2 (valid for the whole series)
>>  include/kvm/arm_vgic.h  |   2 +
>>  virt/kvm/arm/its-emul.c | 124 
>> 
>>  virt/kvm/arm/its-emul.h |   3 ++
>>  3 files changed, 129 insertions(+)
>>
>> diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
>> index 2a67a10..323c33a 100644
>> --- a/include/kvm/arm_vgic.h
>> +++ b/include/kvm/arm_vgic.h
>> @@ -167,6 +167,8 @@ struct vgic_its {
>>  int cwriter;
>>  struct list_headdevice_list;
>>  struct list_headcollection_list;
>> +/* memory used for buffering guest's memory */
>> +void*buffer_page;
>>  };
>>  
>>  struct vgic_dist {
>> diff --git a/virt/kvm/arm/its-emul.c b/virt/kvm/arm/its-emul.c
>> index b9c40d7..05245cb 100644
>> --- a/virt/kvm/arm/its-emul.c
>> +++ b/virt/kvm/arm/its-emul.c
>> @@ -50,6 +50,7 @@ struct its_itte {
>>  struct its_collection *collection;
>>  u32 lpi;
>>  u32 event_id;
>> +u8 priority;
>>  bool enabled;
>>  unsigned long *pending;
>>  };
>> @@ -70,8 +71,124 @@ static struct its_itte *find_itte_by_lpi(struct kvm 
>> *kvm, int lpi)
>>  return NULL;
>>  }
>>  
>> +#define LPI_PROP_ENABLE_BIT(p)  ((p) & LPI_PROP_ENABLED)
>> +#define LPI_PROP_PRIORITY(p)((p) & 0xfc)
>> +
>> +/* stores the priority and enable bit for a given LPI */
>> +static void update_lpi_config(struct kvm *kvm, struct its_itte *itte, u8 
>> prop)
>> +{
>> +itte->priority = LPI_PROP_PRIORITY(prop);
>> +itte->enabled  = LPI_PROP_ENABLE_BIT(prop);
>> +}
>> +
>> +#define GIC_LPI_OFFSET 8192
>> +
>> +/* We scan the table in chunks the size of the smallest page size */
> 4kB chunks?
>> +#define CHUNK_SIZE 4096U
>> +
>>  #define BASER_BASE_ADDRESS(x) ((x) & 0xf000ULL)
>>  
>> +static int nr_idbits_propbase(u64 propbaser)
>> +{
>> +int nr_idbits = (1U << (propbaser & 0x1f)) + 1;
>> +
>> +return max(nr_idbits, INTERRUPT_ID_BITS_ITS);
>> +}
>> +
>> +/*
>> + * Scan the whole LPI configuration table and put the LPI configuration
>> + * data in our own data structures. This relies on the LPI being
>> + * mapped before.
>> + */
>> +static bool its_update_lpis_configuration(struct kvm *kvm)
>> +{
>> +struct vgic_dist *dist = &kvm->arch.vgic;
>> +u8 *prop = dist->its.buffer_page;
>> +u32 tsize;
>> +gpa_t propbase;
>> +int lpi = GIC_LPI_OFFSET;
>> +struct its_itte *itte;
>> +struct its_device *device;
>> +int ret;
>> +
>> +propbase = BASER_BASE_ADDRESS(dist->propbaser);
>> +tsize = nr_idbits_propbase(dist->propbaser);
>> +
>> +while (tsize > 0) {
>> +int chunksize = min(tsize, CHUNK_SIZE);
>> +
>> +ret = kvm_read_guest(kvm, propbase, prop, chunksize);
> I think you still have the spin_lock issue  since if my understanding is
> correct this is called from
> vgic_handle_mmio_access/vcall_range_handler/gic_enable_lpis
> where vgic_handle_mmio_access. Or does it take another path?
> 
> Shouldn't we create a new kvm_io_device to avoid holding the dist lock?

Sorry I forgot it was the case already. But currently we always register
the same io ops (registration entry point being
vgic_register_kvm_io_dev) and maybe we should have separate dispatcher
function for dist, redit and its?

Eric
> 
> Eric
>> +if (ret)
>> +return false;
>> +
>> +spin_lock(&dist->its.lock);
>> +/*
>> + * Updating the status for all allocated LPIs. We catch
>> + * those LPIs that get disabled. We really don't care
>> + * about unmapped LPIs, as they need to be updated
>> + * later manually anyway once they get mapped.
>> + */
>> +for_each_lpi(device, itte, kvm) {
>> +if (itte->lpi < lpi || itte->lpi >= lpi + chunksize)
>> +continue;
>> +
>> +update_lpi_config(kvm, itte, prop[itte->lpi - lpi]);
>> +}
>> +spin_unlock(&dist->its.lock);
>> +tsize -= chunksize;
>> +lpi += chunksize;
>> +propbase += chunksize;
>> +}
>> +
>> +return true;
>> +}
>> +
>> +/*
>> + * Scan the whole LPI pending table and sync the pending bit in there
>> + * wi

RE: [PATCH v3 2/3] Detect vGIC presence at runtime

2015-08-14 Thread Pavel Fedin

 Hello! Thank you for quick response.

> This is fairly unreadable. Please use a switch statement instead.

 Christoffer disliked it in v1, so i thought a bit and changed it. Ok, will 
change it back.

> And here, we're going to assume that the arch timer still usable. We
> definitely need a way to *prevent* the timer to be used when there is no
> GIC. Otherwise, we're going to start trying to setup the mapping for the
> active state, and the guest may start poking it.

But, this seems to be already done, isn't it?
According to http://lxr.free-electrons.com/source/arch/arm/kvm/arm.c#L439:
--- cut ---
459 /*
460  * Enable the arch timers only if we have an in-kernel VGIC
461  * and it has been properly initialized, since we cannot handle
462  * interrupts from the virtual timer with a userspace gic.
463  */
464 if (irqchip_in_kernel(kvm) && vgic_initialized(kvm))
465 kvm_timer_enable(kvm);
--- cut ---
 
 Without kvm->arch.timer.enabled set to 1 by kvm_timer_enable() VM context 
save/restore code will
not actually touch timer registers. Therefore the host part of the code will 
not do anything.
 As to guest itself, only userspace can stop it from accessing timer registers. 
My experimental qemu
does this by removing generic timer node from guest's device tree. Virtual 
timer access simply
cannot be trapped, otherwise there would be no problem at all. But, OK, even if 
the guest programs
timer, we will just see "Unexpected IRQ 27" on the console, and the guest will 
not work, so it's not
terribly fatal.

 You know, i actually looked at it before posting v3. I tried to omit 
kvm_timer_hyp_init() call too,
and got lots of crashes because:
1. kvm_timer_init() is called unconditionally
2. qemu does some initialization of timer registers unconditionally using 
ioctl, and they end up in
kvm_arm_timer_set_reg()
 Both of these points end up in kvm_phys_timer_read() which dereferences 
timecounter == NULL.
 Well, i could make kvm_phys_timer_read() just returning 0 in this case, but 
this could mis-trigger
kvm_timer_should_fire() in some circumstances. I would have to patch it too... 
At this point i
decided to stop because the result perhaps does not worth the effort and amount 
of patching.

 While writing this message i was walking through this code once again, and... 
I have a suggestion.
Actually, if we are really paranoid, we could be afraid of 
kvm_vgic_inject_irq() being called, which
would do some weird things without vGIC. It is possible to add a check for 
kvm->arch.timer.enabled
in kvm_timer_sync_hwstate() and kvm_timer_flush_hwstate(). If the timer is 
disabled those functions
will simply return doing nothing. This would guarantee that interrupt injection 
is never attempted.

 What do you think?

 And some more. Actually, it is possible to emulate generic timer in userspace, 
just not the virtual
one. IIRC access to physical timer can be trapped. So, if we modify guest's 
device tree by removing
virtual timer IRQ, the guest will fall back to physical timer. And this will be 
caught by the
hypervisor. After this all we have to do is to add corresponding exit code 
which would allow the
userspace to emulate missing CP15 (or system in case of ARM64) registers. So, 
this timer issue is
not grave, just i postpone implementing it until GIC issues are settled down.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 11/15] KVM: arm64: handle pending bit for LPIs in ITS emulation

2015-08-14 Thread Eric Auger

On 07/10/2015 04:21 PM, Andre Przywara wrote:
> As the actual LPI number in a guest can be quite high, but is mostly
> assigned using a very sparse allocation scheme, bitmaps and arrays
> for storing the virtual interrupt status are a waste of memory.
> We use our equivalent of the "Interrupt Translation Table Entry"
> (ITTE) to hold this extra status information for a virtual LPI.
> As the normal VGIC code cannot use it's fancy bitmaps to manage
> pending interrupts, we provide a hook in the VGIC code to let the
> ITS emulation handle the list register queueing itself.
> LPIs are located in a separate number range (>=8192), so
> distinguishing them is easy. With LPIs being only edge-triggered, we
> get away with a less complex IRQ handling.
> 
> Signed-off-by: Andre Przywara 
> ---
>  include/kvm/arm_vgic.h  |  2 ++
>  virt/kvm/arm/its-emul.c | 71 
>  virt/kvm/arm/its-emul.h |  3 ++
>  virt/kvm/arm/vgic-v3-emul.c |  2 ++
>  virt/kvm/arm/vgic.c | 72 
> ++---
>  5 files changed, 133 insertions(+), 17 deletions(-)
> 
> diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
> index 1648668..2a67a10 100644
> --- a/include/kvm/arm_vgic.h
> +++ b/include/kvm/arm_vgic.h
> @@ -147,6 +147,8 @@ struct vgic_vm_ops {
>   int (*init_model)(struct kvm *);
>   void(*destroy_model)(struct kvm *);
>   int (*map_resources)(struct kvm *, const struct vgic_params *);
> + bool(*queue_lpis)(struct kvm_vcpu *);
> + void(*unqueue_lpi)(struct kvm_vcpu *, int irq);
>  };
>  
>  struct vgic_io_device {
> diff --git a/virt/kvm/arm/its-emul.c b/virt/kvm/arm/its-emul.c
> index 7f217fa..b9c40d7 100644
> --- a/virt/kvm/arm/its-emul.c
> +++ b/virt/kvm/arm/its-emul.c
> @@ -50,8 +50,26 @@ struct its_itte {
>   struct its_collection *collection;
>   u32 lpi;
>   u32 event_id;
> + bool enabled;
> + unsigned long *pending;
>  };
>  
> +#define for_each_lpi(dev, itte, kvm) \
> + list_for_each_entry(dev, &(kvm)->arch.vgic.its.device_list, dev_list) \
> + list_for_each_entry(itte, &(dev)->itt, itte_list)
> +
You have a checkpatch error here:

ERROR: Macros with complex values should be enclosed in parentheses
#52: FILE: virt/kvm/arm/its-emul.c:57:
+#define for_each_lpi(dev, itte, kvm) \
+   list_for_each_entry(dev, &(kvm)->arch.vgic.its.device_list, dev_list) \
+   list_for_each_entry(itte, &(dev)->itt, itte_list)

> +static struct its_itte *find_itte_by_lpi(struct kvm *kvm, int lpi)
> +{
can't we have the same LPI present in different interrupt translation
tables? I don't know it is a sensible setting but I did not succeed in
finding it was not possible.
> + struct its_device *device;
> + struct its_itte *itte;
> +
> + for_each_lpi(device, itte, kvm) {
> + if (itte->lpi == lpi)
> + return itte;
> + }
> + return NULL;
> +}
> +
>  #define BASER_BASE_ADDRESS(x) ((x) & 0xf000ULL)
>  
>  /* The distributor lock is held by the VGIC MMIO handler. */
> @@ -145,6 +163,59 @@ static bool handle_mmio_gits_idregs(struct kvm_vcpu 
> *vcpu,
>   return false;
>  }
>  
> +/*
> + * Find all enabled and pending LPIs and queue them into the list
> + * registers.
> + * The dist lock is held by the caller.
> + */
> +bool vits_queue_lpis(struct kvm_vcpu *vcpu)
> +{
> + struct vgic_its *its = &vcpu->kvm->arch.vgic.its;
> + struct its_device *device;
> + struct its_itte *itte;
> + bool ret = true;
> +
> + if (!vgic_has_its(vcpu->kvm))
> + return true;
> + if (!its->enabled || !vcpu->kvm->arch.vgic.lpis_enabled)
> + return true;
> +
> + spin_lock(&its->lock);
> + for_each_lpi(device, itte, vcpu->kvm) {
> + if (!itte->enabled || !test_bit(vcpu->vcpu_id, itte->pending))
> + continue;
> +
> + if (!itte->collection)
> + continue;
> +
> + if (itte->collection->target_addr != vcpu->vcpu_id)
> + continue;
> +
> + __clear_bit(vcpu->vcpu_id, itte->pending);
> +
> + ret &= vgic_queue_irq(vcpu, 0, itte->lpi);
what if the vgic_queue_irq fails since no LR can be found, the
itte->pending was cleared so we forget that LPI? shouldn't we restore
the pending state in ITT? in vgic_queue_hwirq the state change only is
performed if the vgic_queue_irq succeeds
> + }
> +
> + spin_unlock(&its->lock);
> + return ret;
> +}
> +
> +/* Called with the distributor lock held by the caller. */
> +void vits_unqueue_lpi(struct kvm_vcpu *vcpu, int lpi)
I was a bit confused by the name of the function, with regard to
existing vgic_unqueue_irqs which restores the states in accordance to
what we have in LR. Wouldn't it make sense to call it
vits_lpi_set_pending(vcpu, lpi) or something that looks more similar to
vgic_dist_irq_set_pending setter which I think it mi

Re: [PATCH v2 12/15] KVM: arm64: sync LPI configuration and pending tables

2015-08-14 Thread Eric Auger

On 07/10/2015 04:21 PM, Andre Przywara wrote:
> The LPI configuration and pending tables of the GICv3 LPIs are held
> in tables in (guest) memory. To achieve reasonable performance, we
> cache this data in our own data structures, so we need to sync those
> two views from time to time. This behaviour is well described in the
> GICv3 spec and is also exercised by hardware, so the sync points are
> well known.
> 
> Provide functions that read the guest memory and store the
> information from the configuration and pending tables in the kernel.
> 
> Signed-off-by: Andre Przywara 
> ---
would help to have change log between v1 -> v2 (valid for the whole series)
>  include/kvm/arm_vgic.h  |   2 +
>  virt/kvm/arm/its-emul.c | 124 
> 
>  virt/kvm/arm/its-emul.h |   3 ++
>  3 files changed, 129 insertions(+)
> 
> diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
> index 2a67a10..323c33a 100644
> --- a/include/kvm/arm_vgic.h
> +++ b/include/kvm/arm_vgic.h
> @@ -167,6 +167,8 @@ struct vgic_its {
>   int cwriter;
>   struct list_headdevice_list;
>   struct list_headcollection_list;
> + /* memory used for buffering guest's memory */
> + void*buffer_page;
>  };
>  
>  struct vgic_dist {
> diff --git a/virt/kvm/arm/its-emul.c b/virt/kvm/arm/its-emul.c
> index b9c40d7..05245cb 100644
> --- a/virt/kvm/arm/its-emul.c
> +++ b/virt/kvm/arm/its-emul.c
> @@ -50,6 +50,7 @@ struct its_itte {
>   struct its_collection *collection;
>   u32 lpi;
>   u32 event_id;
> + u8 priority;
>   bool enabled;
>   unsigned long *pending;
>  };
> @@ -70,8 +71,124 @@ static struct its_itte *find_itte_by_lpi(struct kvm *kvm, 
> int lpi)
>   return NULL;
>  }
>  
> +#define LPI_PROP_ENABLE_BIT(p)   ((p) & LPI_PROP_ENABLED)
> +#define LPI_PROP_PRIORITY(p) ((p) & 0xfc)
> +
> +/* stores the priority and enable bit for a given LPI */
> +static void update_lpi_config(struct kvm *kvm, struct its_itte *itte, u8 
> prop)
> +{
> + itte->priority = LPI_PROP_PRIORITY(prop);
> + itte->enabled  = LPI_PROP_ENABLE_BIT(prop);
> +}
> +
> +#define GIC_LPI_OFFSET 8192
> +
> +/* We scan the table in chunks the size of the smallest page size */
4kB chunks?
> +#define CHUNK_SIZE 4096U
> +
>  #define BASER_BASE_ADDRESS(x) ((x) & 0xf000ULL)
>  
> +static int nr_idbits_propbase(u64 propbaser)
> +{
> + int nr_idbits = (1U << (propbaser & 0x1f)) + 1;
> +
> + return max(nr_idbits, INTERRUPT_ID_BITS_ITS);
> +}
> +
> +/*
> + * Scan the whole LPI configuration table and put the LPI configuration
> + * data in our own data structures. This relies on the LPI being
> + * mapped before.
> + */
> +static bool its_update_lpis_configuration(struct kvm *kvm)
> +{
> + struct vgic_dist *dist = &kvm->arch.vgic;
> + u8 *prop = dist->its.buffer_page;
> + u32 tsize;
> + gpa_t propbase;
> + int lpi = GIC_LPI_OFFSET;
> + struct its_itte *itte;
> + struct its_device *device;
> + int ret;
> +
> + propbase = BASER_BASE_ADDRESS(dist->propbaser);
> + tsize = nr_idbits_propbase(dist->propbaser);
> +
> + while (tsize > 0) {
> + int chunksize = min(tsize, CHUNK_SIZE);
> +
> + ret = kvm_read_guest(kvm, propbase, prop, chunksize);
I think you still have the spin_lock issue  since if my understanding is
correct this is called from
vgic_handle_mmio_access/vcall_range_handler/gic_enable_lpis
where vgic_handle_mmio_access. Or does it take another path?

Shouldn't we create a new kvm_io_device to avoid holding the dist lock?

Eric
> + if (ret)
> + return false;
> +
> + spin_lock(&dist->its.lock);
> + /*
> +  * Updating the status for all allocated LPIs. We catch
> +  * those LPIs that get disabled. We really don't care
> +  * about unmapped LPIs, as they need to be updated
> +  * later manually anyway once they get mapped.
> +  */
> + for_each_lpi(device, itte, kvm) {
> + if (itte->lpi < lpi || itte->lpi >= lpi + chunksize)
> + continue;
> +
> + update_lpi_config(kvm, itte, prop[itte->lpi - lpi]);
> + }
> + spin_unlock(&dist->its.lock);
> + tsize -= chunksize;
> + lpi += chunksize;
> + propbase += chunksize;
> + }
> +
> + return true;
> +}
> +
> +/*
> + * Scan the whole LPI pending table and sync the pending bit in there
> + * with our own data structures. This relies on the LPI being
> + * mapped before.
> + */
> +static bool its_sync_lpi_pending_table(struct kvm_vcpu *vcpu)
> +{
> + struct vgic_dist *dist = &vcpu->kvm->arch.vgic;
> + unsigned long *pendmask = dist->its.buffer_page;
> + u32 nr_lpis = VITS_NR_LPIS;
> + gpa_t pendbase;
> + int lpi = 0;
> + struct its_itte *i

Re: [PATCH v3 2/3] Detect vGIC presence at runtime

2015-08-14 Thread Marc Zyngier

On 05/08/15 11:53, Pavel Fedin wrote:
> Before commit 662d9715840aef44dcb573b0f9fab9e8319c868a is was possible to
> compile the kernel without vGIC and vTimer support. Commit message says
> about possibility to detect vGIC support in runtine, but this has never
> been implemented.
> 
> This patch introdices runtime check, restoring the lost functionality. It
> again allows to use KVM on hardware without vGIC. Interrupt controller has
> to be emulated in userspace in this case.
> 
> -ENODEV return code from probe function means there's no GIC at all.
> -ENXIO happens when, for example, there is GIC node in the device tree,
> but it does not specify vGIC resources. Normally this means that vGIC
> hardware is defunct. Any other error code is still treated as full stop
> because it might mean some really serious problems.
> 
> This patch does not touch any virtual timer code, suggesting that timer

And that's a problem, see below.

> hardware is actually in place. Normally on boards in question it is true,
> however since vGIC is missing, it is impossible to correctly utilize
> interrupts from the virtual timer. Since virtual timer handling is in
> active redevelopment now, handling in it userspace is out of scope at
> the moment. The guest is currently suggested to use some memory-mapped
> timer which can be emulated in userspace.
> 
> Signed-off-by: Pavel Fedin 
> ---
>  arch/arm/kvm/arm.c | 17 +++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index 199a50a..1039161 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -61,6 +61,8 @@ static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1);
>  static u8 kvm_next_vmid;
>  static DEFINE_SPINLOCK(kvm_vmid_lock);
>  
> +static bool vgic_present;
> +
>  static void kvm_arm_set_running_vcpu(struct kvm_vcpu *vcpu)
>  {
>   BUG_ON(preemptible());
> @@ -131,7 +133,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>   kvm->arch.vmid_gen = 0;
>  
>   /* The maximum number of VCPUs is limited by the host's GIC model */
> - kvm->arch.max_vcpus = kvm_vgic_get_max_vcpus();
> + kvm->arch.max_vcpus = vgic_present ?
> + kvm_vgic_get_max_vcpus() : KVM_MAX_VCPUS;
>  
>   return ret;
>  out_free_stage2_pgd:
> @@ -171,6 +174,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
> ext)
>   int r;
>   switch (ext) {
>   case KVM_CAP_IRQCHIP:
> + r = vgic_present;
> + break;
>   case KVM_CAP_IOEVENTFD:
>   case KVM_CAP_DEVICE_CTRL:
>   case KVM_CAP_USER_MEMORY:
> @@ -849,6 +854,8 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
>  
>   switch (dev_id) {
>   case KVM_ARM_DEVICE_VGIC_V2:
> + if (!vgic_present)
> + return -ENXIO;
>   return kvm_vgic_addr(kvm, type, &dev_addr->addr, true);
>   default:
>   return -ENODEV;
> @@ -863,6 +870,8 @@ long kvm_arch_vm_ioctl(struct file *filp,
>  
>   switch (ioctl) {
>   case KVM_CREATE_IRQCHIP: {
> + if (!vgic_present)
> + return -ENXIO;
>   return kvm_vgic_create(kvm, KVM_DEV_TYPE_ARM_VGIC_V2);
>   }
>   case KVM_ARM_SET_DEVICE_ADDR: {
> @@ -1045,8 +1054,12 @@ static int init_hyp_mode(void)
>* Init HYP view of VGIC
>*/
>   err = kvm_vgic_hyp_init();
> - if (err)
> + if (err == -ENODEV || err == -ENXIO)
> + vgic_present = false;

Which is the default value, isn't it?

> + else if (err)
>   goto out_free_context;
> + else
> + vgic_present = true;

This is fairly unreadable. Please use a switch statement instead.

>  
>   /*
>* Init HYP architected timer support
> 

And here, we're going to assume that the arch timer still usable. We
definitely need a way to *prevent* the timer to be used when there is no
GIC. Otherwise, we're going to start trying to setup the mapping for the
active state, and the guest may start poking it.

Timer and GIC are really tied to each other. If you start making one
optional, you need to carry on working the dependency chain.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Help debugging a regression in KVM Module

2015-08-14 Thread Peter Lieven

Hi,

some time a go I stumbled across a regression in the KVM Module that has been 
introduced somewhere
between 3.17 and 3.19.

I have a rather old openSUSE guest with an XFS filesystem which realiably 
crashes after some live migrations.
I originally believed that the issue might be related to my setup with a 3.12 
host kernel and kvm-kmod 3.19,
but I now found that it is also still present with a 3.19 host kernel with 
included 3.19 kvm module.

My idea was to continue testing on a 3.12 host kernel and then bisect all 
commits to the kvm related parts.

Now my question is how to best bisect only kvm related changes (those that go 
into kvm-kmod)?

Thanks,
Peter

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 102651] vcpuX unhandled rdmsr: 0x570

2015-08-14 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=102651

Huaitong Han  changed:

   What|Removed |Added

 CC||oen...@gmail.com

--- Comment #2 from Huaitong Han  ---
it's just warning, current KVM does not support intel PT feature, you can
ignore it, and I will fix the warning in native kernel.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 0/5] KVM: optimize userspace exits with a new ioctl

2015-08-14 Thread Radim Krčmář

v3:
 * acked by Christian [1/5]
 * use ioctl argument directly (unsigned long as flags) [4/5]
 * precisely #ifdef arch-specific ioctls [5/5]
v2:
 * move request_exits debug counter patch right after introduction of
   KVM_REQ_EXIT [3/5]
 * use vcpu ioctl instead of vm one [4/5]
 * shrink kvm_user_exit from 64 to 32 bytes [4/5]
 * new [5/5]

QEMU uses SIGUSR1 to force a userspace exit and also to queue an early
exit before calling VCPU_RUN -- the signal is blocked in user space and
temporarily unblocked in VCPU_RUN.
The temporal unblocking by sigprocmask() in kvm_arch_vcpu_ioctl_run()
takes a shared siglock, which leads to cacheline bouncing in NUMA
systems.

This series allows the same with a new request bit and VM IOCTL that
marks and kicks target VCPU, hence no need to unblock.

inl_from_{pmtimer,qemu} vmexit benchmark from kvm-unit-tests shows ~5%
speedup for 1-4 VCPUs (300-2000 saved cycles) without noticeably
regressing kernel VM exits.
(Paolo did a quick run of older version of this series on a NUMA system
 and the speedup was around 35% when utilizing more nodes.)

Radim Krčmář (5):
  KVM: add kvm_has_request wrapper
  KVM: add KVM_REQ_EXIT request for userspace exit
  KVM: x86: add request_exits debug counter
  KVM: add KVM_USER_EXIT vcpu ioctl for userspace exit
  KVM: refactor asynchronous vcpu ioctl dispatch

 Documentation/virtual/kvm/api.txt | 25 +
 arch/x86/include/asm/kvm_host.h   |  1 +
 arch/x86/kvm/vmx.c|  4 ++--
 arch/x86/kvm/x86.c| 23 +++
 include/linux/kvm_host.h  | 15 +--
 include/uapi/linux/kvm.h  |  4 
 virt/kvm/kvm_main.c   | 15 ++-
 7 files changed, 78 insertions(+), 9 deletions(-)

-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 1/5] KVM: add kvm_has_request wrapper

2015-08-14 Thread Radim Krčmář

We want to have requests abstracted from bit operations.

Acked-by: Christian Borntraeger 
Signed-off-by: Radim Krčmář 
---
 v3: acked by Christian

 arch/x86/kvm/vmx.c   | 2 +-
 include/linux/kvm_host.h | 7 ++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4cf25b90dbe0..40c6180a0ecb 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5809,7 +5809,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu 
*vcpu)
if (intr_window_requested && vmx_interrupt_allowed(vcpu))
return handle_interrupt_window(&vmx->vcpu);
 
-   if (test_bit(KVM_REQ_EVENT, &vcpu->requests))
+   if (kvm_has_request(KVM_REQ_EVENT, vcpu))
return 1;
 
err = emulate_instruction(vcpu, EMULTYPE_NO_REEXECUTE);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 27ccdf91a465..52e388367a26 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1089,9 +1089,14 @@ static inline void kvm_make_request(int req, struct 
kvm_vcpu *vcpu)
set_bit(req, &vcpu->requests);
 }
 
+static inline bool kvm_has_request(int req, struct kvm_vcpu *vcpu)
+{
+   return test_bit(req, &vcpu->requests);
+}
+
 static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
 {
-   if (test_bit(req, &vcpu->requests)) {
+   if (kvm_has_request(req, vcpu)) {
clear_bit(req, &vcpu->requests);
return true;
} else {
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 4/5] KVM: add KVM_USER_EXIT vcpu ioctl for userspace exit

2015-08-14 Thread Radim Krčmář

The guest can use KVM_USER_EXIT instead of a signal-based exiting to
userspace.  Availability depends on KVM_CAP_USER_EXIT.
Only x86 is implemented so far.

Signed-off-by: Radim Krčmář 
---
 v3:
  * use ioctl argument directly (unsigned long as flags) [Paolo]
 v2:
  * use vcpu ioctl instead of vm one [Paolo]
  * shrink kvm_user_exit from 64 to 32 bytes

 Documentation/virtual/kvm/api.txt | 25 +
 arch/x86/kvm/x86.c| 15 +++
 include/uapi/linux/kvm.h  |  3 +++
 virt/kvm/kvm_main.c   |  5 +++--
 4 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 3c714d43a717..df087ff3c5b6 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3020,6 +3020,31 @@ Returns: 0 on success, -1 on error
 
 Queues an SMI on the thread's vcpu.
 
+
+4.97 KVM_USER_EXIT
+
+Capability: KVM_CAP_USER_EXIT
+Architectures: x86
+Type: vcpu ioctl
+Parameters: unsigned long flags (in)
+Returns: 0 on success,
+ -EINVAL if flags is not 0
+
+The ioctl is asynchronous to VCPU execution and can be issued from all threads.
+
+Make vcpu_id exit to userspace as soon as possible.  If the VCPU is not running
+in kernel at the time, it will exit early on the next call to KVM_RUN.
+If the VCPU was going to exit because of other reasons when KVM_USER_EXIT was
+issued, it will keep the original exit reason without exiting early on next
+KVM_RUN.
+If VCPU exited because of KVM_USER_EXIT, the exit reason is KVM_EXIT_REQUEST.
+
+This ioctl has very similar effect (same sans some races on userspace exit) as
+sending a signal (that is blocked in userspace and set in KVM_SET_SIGNAL_MASK)
+to the VCPU thread.
+
+
+
 5. The kvm_run structure
 
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 37db1b32684a..d985806b17b1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2467,6 +2467,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_ASSIGN_DEV_IRQ:
case KVM_CAP_PCI_2_3:
 #endif
+   case KVM_CAP_USER_EXIT:
r = 1;
break;
case KVM_CAP_X86_SMM:
@@ -3078,6 +3079,17 @@ static int kvm_set_guest_paused(struct kvm_vcpu *vcpu)
return 0;
 }
 
+static int kvm_vcpu_ioctl_user_exit(struct kvm_vcpu *vcpu, unsigned long flags)
+{
+   if (flags != 0)
+   return -EINVAL;
+
+   kvm_make_request(KVM_REQ_EXIT, vcpu);
+   kvm_vcpu_kick(vcpu);
+
+   return 0;
+}
+
 long kvm_arch_vcpu_ioctl(struct file *filp,
 unsigned int ioctl, unsigned long arg)
 {
@@ -3342,6 +3354,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
r = kvm_set_guest_paused(vcpu);
goto out;
}
+   case KVM_USER_EXIT:
+   r = kvm_vcpu_ioctl_user_exit(vcpu, arg);
+   break;
default:
r = -EINVAL;
}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d996a7cdb4d2..58b3a07adc81 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -826,6 +826,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_X86_SMM 117
 #define KVM_CAP_MULTI_ADDRESS_SPACE 118
 #define KVM_CAP_SPLIT_IRQCHIP 119
+#define KVM_CAP_USER_EXIT 120
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1213,6 +1214,8 @@ struct kvm_s390_ucas_mapping {
 #define KVM_S390_GET_IRQ_STATE   _IOW(KVMIO, 0xb6, struct kvm_s390_irq_state)
 /* Available with KVM_CAP_X86_SMM */
 #define KVM_SMI   _IO(KVMIO,   0xb7)
+/* Available with KVM_CAP_USER_EXIT */
+#define KVM_USER_EXIT _IO(KVMIO,   0xb8)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 347899966178..dfa2d5f27713 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2251,15 +2251,16 @@ static long kvm_vcpu_ioctl(struct file *filp,
if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
return -EINVAL;
 
-#if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
/*
 * Special cases: vcpu ioctls that are asynchronous to vcpu execution,
 * so vcpu_load() would break it.
 */
+#if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
if (ioctl == KVM_S390_INTERRUPT || ioctl == KVM_S390_IRQ || ioctl == 
KVM_INTERRUPT)
return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
 #endif
-
+   if (ioctl == KVM_USER_EXIT)
+   return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
 
r = vcpu_load(vcpu);
if (r)
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 2/5] KVM: add KVM_REQ_EXIT request for userspace exit

2015-08-14 Thread Radim Krčmář

When userspace wants KVM to exit to userspace, it sends a signal.
This has a disadvantage of requiring a change to the signal mask because
the signal needs to be blocked in userspace to stay pending when sending
to self.

Using a request flag allows us to shave 200-300 cycles from every
userspace exit and the speedup grows with NUMA because unblocking
touches shared spinlock.

The disadvantage is that it adds an overhead of one bit check for all
kernel exits.  A quick tracing shows that the ratio of userspace exits
after boot is about 1/5 and in subsequent run of nmap and kernel compile
has about 1/60, so the check should not regress global performance.

All signal_pending() calls are userspace exit requests, so we add a
check for KVM_REQ_EXIT there.  There is one omitted call in kvm_vcpu_run
because KVM_REQ_EXIT is implied in earlier check for requests.

Signed-off-by: Radim Krčmář 
---
 arch/x86/kvm/vmx.c   | 2 +-
 arch/x86/kvm/x86.c   | 6 ++
 include/linux/kvm_host.h | 8 +++-
 include/uapi/linux/kvm.h | 1 +
 virt/kvm/kvm_main.c  | 2 +-
 5 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 40c6180a0ecb..2b789a869ef5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5833,7 +5833,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu 
*vcpu)
goto out;
}
 
-   if (signal_pending(current))
+   if (kvm_need_exit(vcpu))
goto out;
if (need_resched())
schedule();
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e5850076bf7b..c3df7733af09 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6548,6 +6548,11 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
++vcpu->stat.signal_exits;
break;
}
+   if (unlikely(kvm_has_request(KVM_REQ_EXIT, vcpu))) {
+   r = 0;
+   vcpu->run->exit_reason = KVM_EXIT_REQUEST;
+   break;
+   }
if (need_resched()) {
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
cond_resched();
@@ -6684,6 +6689,7 @@ out:
post_kvm_run_save(vcpu);
if (vcpu->sigset_active)
sigprocmask(SIG_SETMASK, &sigsaved, NULL);
+   clear_bit(KVM_REQ_EXIT, &vcpu->requests);
 
return r;
 }
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 52e388367a26..dcc57171e3ec 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -121,7 +121,7 @@ static inline bool is_error_page(struct page *page)
 #define KVM_REQ_UNHALT 6
 #define KVM_REQ_MMU_SYNC   7
 #define KVM_REQ_CLOCK_UPDATE   8
-#define KVM_REQ_KICK   9
+#define KVM_REQ_EXIT   9
 #define KVM_REQ_DEACTIVATE_FPU10
 #define KVM_REQ_EVENT 11
 #define KVM_REQ_APF_HALT  12
@@ -1104,6 +1104,12 @@ static inline bool kvm_check_request(int req, struct 
kvm_vcpu *vcpu)
}
 }
 
+static inline bool kvm_need_exit(struct kvm_vcpu *vcpu)
+{
+   return signal_pending(current) ||
+  kvm_has_request(KVM_REQ_EXIT, vcpu);
+}
+
 extern bool kvm_rebooting;
 
 struct kvm_device {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 26daafbba9ec..d996a7cdb4d2 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -184,6 +184,7 @@ struct kvm_s390_skeys {
 #define KVM_EXIT_SYSTEM_EVENT 24
 #define KVM_EXIT_S390_STSI25
 #define KVM_EXIT_IOAPIC_EOI   26
+#define KVM_EXIT_REQUEST  27
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d8db2f8fce9c..347899966178 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1914,7 +1914,7 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
}
if (kvm_cpu_has_pending_timer(vcpu))
return -EINTR;
-   if (signal_pending(current))
+   if (kvm_need_exit(vcpu))
return -EINTR;
 
return 0;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 3/5] KVM: x86: add request_exits debug counter

2015-08-14 Thread Radim Krčmář

We are still interested in the amount of exits userspace requested and
signal_exits doesn't cover that anymore.

Signed-off-by: Radim Krčmář 
---
 v2: move request_exits debug counter patch right after introduction of
 KVM_REQ_EXIT

 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/x86.c  | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09acaa64ef8e..95c05a3d02d4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -729,6 +729,7 @@ struct kvm_vcpu_stat {
u32 hypercalls;
u32 irq_injections;
u32 nmi_injections;
+   u32 request_exits;
 };
 
 struct x86_instruction_info;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c3df7733af09..37db1b32684a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -145,6 +145,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ "io_exits", VCPU_STAT(io_exits) },
{ "mmio_exits", VCPU_STAT(mmio_exits) },
{ "signal_exits", VCPU_STAT(signal_exits) },
+   { "request_exits", VCPU_STAT(request_exits) },
{ "irq_window", VCPU_STAT(irq_window_exits) },
{ "nmi_window", VCPU_STAT(nmi_window_exits) },
{ "halt_exits", VCPU_STAT(halt_exits) },
@@ -6551,6 +6552,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
if (unlikely(kvm_has_request(KVM_REQ_EXIT, vcpu))) {
r = 0;
vcpu->run->exit_reason = KVM_EXIT_REQUEST;
+   ++vcpu->stat.request_exits;
break;
}
if (need_resched()) {
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 5/5] KVM: refactor asynchronous vcpu ioctl dispatch

2015-08-14 Thread Radim Krčmář

I find the switch easier to read and modify.

Signed-off-by: Radim Krčmář 
---
 v3: precisely #ifdef arch-specific ioctls [Christian]
 v2: new

 virt/kvm/kvm_main.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dfa2d5f27713..c059c01161fe 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2255,12 +2255,16 @@ static long kvm_vcpu_ioctl(struct file *filp,
 * Special cases: vcpu ioctls that are asynchronous to vcpu execution,
 * so vcpu_load() would break it.
 */
-#if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
-   if (ioctl == KVM_S390_INTERRUPT || ioctl == KVM_S390_IRQ || ioctl == 
KVM_INTERRUPT)
-   return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
+   switch (ioctl) {
+#if defined(CONFIG_S390)
+   case KVM_S390_INTERRUPT:
+   case KVM_S390_IRQ:
+#elif defined(CONFIG_PPC) || defined(CONFIG_MIPS)
+   case KVM_INTERRUPT:
 #endif
-   if (ioctl == KVM_USER_EXIT)
+   case KVM_USER_EXIT:
return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
+   }
 
r = vcpu_load(vcpu);
if (r)
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

PING: [PATCH v3 0/3] KVM: arm/arm64: Allow to use KVM without in-kernel irqchip

2015-08-14 Thread Pavel Fedin

PING

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


> -Original Message-
> From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf 
> Of Pavel Fedin
> Sent: Wednesday, August 05, 2015 1:54 PM
> To: kvm...@lists.cs.columbia.edu; kvm@vger.kernel.org
> Cc: Christoffer Dall; Marc Zyngier
> Subject: [PATCH v3 0/3] KVM: arm/arm64: Allow to use KVM without in-kernel 
> irqchip
> 
> This patch set brings back functionality which was broken in v4.0.
> Unfortunately because of restrictions of such a hardware is is impossible
> to take advantage of virtual architected timer, therefore guest, running
> in such restricted mode, has to use some memory-mapped timer. But it is
> still better than nothing.
> 
> v2 => v3:
> - Improved commit messages, added references to commits where the respective
>   functionality was broken
> - Explicitly specify that the solution currently affects only vGIC and has
>   nothing to do with timer.
> - Fixed code style according to previous notes
> - Removed ARM64 save/restore patch introduced in v2 because it was already
>   obsolete for linux-next
> - Modify KVM_CAP_IRQFD handling in correct place
> 
> v1 => v2:
> - Do not use defensive approach in patch 0001. Use correct conditions in
>   callers instead
> - Added ARM64-specific code, without which attempt to run a VM ends in a
>   HYP crash because of unset vGIC save/restore function pointers
> 
> Pavel Fedin (3):
>   Fix NULL pointer dereference if KVM is used without in-kernel irqchip
>   Detect vGIC presence at runtime
>   Make KVM_CAP_IRQFD dependent on KVM_CAP_IRQCHIP
> 
>  arch/arm/kvm/arm.c  | 19 ---
>  virt/kvm/kvm_main.c |  5 +++--
>  2 files changed, 19 insertions(+), 5 deletions(-)
> 
> --
> 2.4.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/5] KVM: add kvm_has_request wrapper

2015-08-14 Thread Radim Krčmář

2015-08-13 12:03+0200, Christian Borntraeger:
> Am 13.08.2015 um 11:29 schrieb Paolo Bonzini:
>> On 13/08/2015 11:11, Radim Krčmář wrote:
> for the new interface. maybe we can rename kvm_check_request in a 
> separate patch somewhen.
>>> I wonder why haven't we copied the naming convention from bit operations
| [...]
>> 
>> Yes, that would be much better.
> 
> +1

I'll send patches later.  Hope you won't mind keeping the doomed
kvm_has_request() in v3.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] KVM: x86: fix edge EOI and IOAPIC reconfig race

2015-08-14 Thread Radim Krčmář

2015-08-13 16:53+0200, Paolo Bonzini:
> On 13/08/2015 15:46, Radim Krčmář wrote:
>>  1) IOAPIC inject a vector from i8254
>>  2) guest reconfigures that vector's VCPU and therefore eoi_exit_bitmap
>> on original VCPU gets cleared
>>  3) guest's handler for the vector does EOI
>>  4) KVM's EOI handler doesn't pass that vector to IOAPIC because it is
>> not in that VCPU's eoi_exit_bitmap
>>  5) i8254 stops working
>> 
>> This creates an unwanted situation if the vector is reused by a
>> non-IOAPIC source, but I think it is so rare that we don't want to make
>> the solution more sophisticated. 
> 
> What happens if the vector is changed in step 2?
> __kvm_ioapic_update_eoi won't match the redirection table entry.

Yes, the EOI is going to be ignored.  (With APICv, VMX won't even exit.)
In the patch, I dissmissed it as "shouldn't happen in the wild" because
we've always had the vector-change bug :) (Unlike the destination-change
one, which was APICv-only before recent changes.)

A simple solution to the vector-change would have a list of one-time
fixups (vector, *ioapic) and hooks in ioapic reconfig, scan and EOI.

A complex solution would replace ioapic scanning with an array of list
of ioapics (it needs to be a list or small array because vectors can be
shared).
An ioapic would be added to list[vector] on reconfig and removed on
reconfig unless an edge fixup was needed, then it would last til next
EOI  (I guess we won't need to consider vector in IRR and ISR).
Callbacks would update the eoi_exit_bitmap on relevant changes.

I considered doing the complex one, but then it occured to me that we
want the destination-change fixed in stable as APICv machines are
starting to get used and people might migrate old guests on them.

> How do you reproduce the bug?

I run rhel4 (2.6.9) kernel on 2 VCPUs and frequently alternate
smp_affinity of "timer".  The bug is hit within seconds.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

50 matches

Mail list logo