Re: [Qemu-devel] [PATCH v3 5/6] target-i386: Don't enable nested VMX by default

2014-10-30 Thread Paolo Bonzini

 Here I'm less certain what the best approach is. As you point out,
 there's an inconsistency that I agree should be fixed. I wonder however
 whether an approach similar to 3/6 for KVM only would be better? I.e.,
 have VMX as a sometimes-KVM-supported feature be listed in the model and
 filter it out for accel=kvm so that -cpu enforce works, but let
 accel=tcg fail with features not implemented.

This would mean that -cpu coreduo,enforce doesn't work on TCG, but -cpu
Nehalem,enforce works.  This does not make much sense to me.

In fact, I would even omit the x86_cpu_compat_set_features altogether.
The inclusion of vmx in these models was a mistake, and nested VMX is
not really useful with anything but -cpu host because there are too
many capabilities communicated via MSRs rather than CPUID.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC 07/11] dataplane: allow virtio-1 devices

2014-10-30 Thread Cornelia Huck
On Tue, 28 Oct 2014 16:22:54 +0100
Greg Kurz gk...@linux.vnet.ibm.com wrote:

 On Tue,  7 Oct 2014 16:40:03 +0200
 Cornelia Huck cornelia.h...@de.ibm.com wrote:
 
  Handle endianness conversion for virtio-1 virtqueues correctly.
  
  Note that dataplane now needs to be built per-target.
  
 
 It also affects hw/virtio/virtio-pci.c:
 
 In file included from include/hw/virtio/dataplane/vring.h:23:0,
  from include/hw/virtio/virtio-scsi.h:21,
  from hw/virtio/virtio-pci.c:24:
 include/hw/virtio/virtio-access.h: In function ‘virtio_access_is_big_endian’:
 include/hw/virtio/virtio-access.h:28:15: error: attempt to use poisoned 
 TARGET_WORDS_BIGENDIAN
  #elif defined(TARGET_WORDS_BIGENDIAN)
^
 
 FWIW when I added endian ambivalent support to virtio, I remember *some 
 people*
 getting angry at the idea of turning common code into per-target... :)

Well, it probably can't be helped for something that is
endian-sensitive like virtio :( (Although we should try to keep it as
local as possible.)

 
 See comment below.
 
  Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
  ---
   hw/block/dataplane/virtio-blk.c |3 +-
   hw/scsi/virtio-scsi-dataplane.c |2 +-
   hw/virtio/Makefile.objs |2 +-
   hw/virtio/dataplane/Makefile.objs   |2 +-
   hw/virtio/dataplane/vring.c |   85 
  +++
   include/hw/virtio/dataplane/vring.h |   64 --
   6 files changed, 113 insertions(+), 45 deletions(-)
  

  diff --git a/include/hw/virtio/dataplane/vring.h 
  b/include/hw/virtio/dataplane/vring.h
  index d3e086a..fde15f3 100644
  --- a/ 
  +++ b/include/hw/virtio/dataplane/vring.h
  @@ -20,6 +20,7 @@
   #include qemu-common.h
   #include hw/virtio/virtio_ring.h
   #include hw/virtio/virtio.h
  +#include hw/virtio/virtio-access.h
  
 
 Since the following commit:
 
 commit 244e2898b7a7735b3da114c120abe206af56a167
 Author: Fam Zheng f...@redhat.com
 Date:   Wed Sep 24 15:21:41 2014 +0800
 
 virtio-scsi: Add VirtIOSCSIVring in VirtIOSCSIReq
 
 The include/hw/virtio/dataplane/vring.h header is indirectly included
 by hw/virtio/virtio-pci.c. Why don't you move all this target dependent
 helpers to another header ?

Ah, this seems to have come in after I hacked on that code - I'll take a
look at splitting off the accessors.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Benchmarking for vhost polling patch

2014-10-30 Thread Zhang Haoyu
 Hi Michael,
 
 Following the polling patch thread: 
 http://marc.info/?l=kvmm=140853271510179w=2, 
 I changed poll_stop_idle to be counted in micro seconds, and carried out 
 experiments using varying sizes of this value. The setup for netperf 
 consisted of 
 1 vm and 1 vhost , each running on their own dedicated core.
 
Could you provide your changing code?

Thanks,
Zhang Haoyu

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-10-30 Thread zhanghailiang

On 2014/10/30 1:46, Andrea Arcangeli wrote:

Hi Zhanghailiang,

On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:

Hi Andrea,

Thanks for your hard work on userfault;)

This is really a useful API.

I want to confirm a question:
Can we support distinguishing between writing and reading memory for userfault?
That is, we can decide whether writing a page, reading a page or both trigger 
userfault.

I think this will help supporting vhost-scsi,ivshmem for migration,
we can trace dirty page in userspace.

Actually, i'm trying to relize live memory snapshot based on pre-copy and 
userfault,
but reading memory from migration thread will also trigger userfault.
It will be easy to implement live memory snapshot, if we support configuring
userfault for writing memory only.


Mail is going to be long enough already so I'll just assume tracking
dirty memory in userland (instead of doing it in kernel) is worthy
feature to have here.

After some chat during the KVMForum I've been already thinking it
could be beneficial for some usage to give userland the information
about the fault being read or write, combined with the ability of
mapping pages wrprotected to mcopy_atomic (that would work without
false positives only with MADV_DONTFORK also set, but it's already set
in qemu). That will require vma-vm_flags  VM_USERFAULT to be
checked also in the wrprotect faults, not just in the not present
faults, but it's not a massive change. Returning the read/write
information is also a not massive change. This will then payoff mostly
if there's also a way to remove the memory atomically (kind of
remap_anon_pages).

Would that be enough? I mean are you still ok if non present read
fault traps too (you'd be notified it's a read) and you get
notification for both wrprotect and non present faults?


Hi Andrea,

Thanks for your reply, and your patience;)

Er, maybe i didn't describe clearly. What i really need for live memory snapshot
is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing 
write action*.

My initial solution scheme for live memory snapshot is:
(1) pause VM
(2) using userfaultfd to mark all memory of VM is wrprotect (readonly)
(3) save deivce state to snapshot file
(4) resume VM
(5) snapshot thread begin to save page of memory to snapshot file
(6) VM is going to run, and it is OK for VM or other thread to read ram (no 
fault trap),
but if VM try to write page (dirty the page), there will be
a userfault trap notification.
(7) a fault-handle-thread reads the page request from userfaultfd,
it will copy content of the page to some buffers, and then remove the page's
wrprotect limit(still using the userfaultfd to tell kernel).
(8) after step (7), VM can continue to write the page which is now can be write.
(9) snapshot thread save the page cached in step (7)
(10) repeat step (5)~(9) until all VM's memory is saved to snapshot file.

So, what i need for userfault is supporting only wrprotect fault. i don't
want to get notification for non present reading faults, it will influence
VM's performance and the efficiency of doing snapshot.

Also, i think this feature will benefit for migration of ivshmem and vhost-scsi
which have no dirty-page-tracing now.


The question then is how you mark the memory readonly to let the
wrprotect faults trap if the memory already existed and you didn't map
it yourself in the guest with mcopy_atomic with a readonly flag.

My current plan would be:

- keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the
   fast path check in the not-present and wrprotect page fault

- if VM_USERFAULT is set, find if there's a userfaultfd registered
   into that vma too

 if yes engage userfaultfd protocol

 otherwise raise SIGBUS (single threaded apps should be fine with
 SIGBUS and it'll avoid them to spawn a thread in order to talk the
 userfaultfd protocol)

- if userfaultfd protocol is engaged, return read|write fault + fault
   address to read(ufd) syscalls

- leave the userfault resolution mechanism independent of the
   userfaultfd protocol so we keep the two problems separated and we
   don't mix them in the same API which makes it even harder to
   finalize it.

 add mcopy_atomic (with a flag to map the page readonly too)

 The alternative would be to hide mcopy_atomic (and even
 remap_anon_pages in order to remove the memory atomically for
 the externalization into the cloud) as userfaultfd commands to
 write into the fd. But then there would be no much point to keep
 MADV_USERFAULT around if I do so and I could just remove it
 too or it doesn't look clean having to open the userfaultfd just
 to issue an hidden mcopy_atomic.

 So it becomes a decision if the basic SIGBUS mode for single
 threaded apps should be supported or not. As long as we support
 SIGBUS too and we don't force to use userfaultfd as the only
 mechanism to be notified about userfaults, having a separate
 

Re: [PATCH v12 1/6] KVM: Add architecture-defined TLB flush support

2014-10-30 Thread Cornelia Huck
On Wed, 22 Oct 2014 15:34:06 -0700
Mario Smarduch m.smard...@samsung.com wrote:

 This patch adds support for architecture implemented VM TLB flush, currently
 ARMv7 defines HAVE_KVM_ARCH_TLB_FLUSH_ALL. This leaves other architectures 
 unaffected using the generic version. In subsequent patch ARMv7 defines
 HAVE_KVM_ARCH_TLB_FLUSH_ALL and it's own TLB flush interface.

Can you reword this a bit?

Allow architectures to override the generic kvm_flush_remote_tlbs()
function via HAVE_KVM_ARCH_TLB_FLUSH_ALL. ARMv7 will need this to
provide its own TLB flush interface.

 
 Signed-off-by: Mario Smarduch m.smard...@samsung.com
 ---
  virt/kvm/Kconfig|3 +++
  virt/kvm/kvm_main.c |2 ++
  2 files changed, 5 insertions(+)

Providing an override for the special cases looks sane to me.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 0/6] arm/KVM: dirty page logging support for ARMv7 (3.17.0-rc1)

2014-10-30 Thread Christian Borntraeger
Am 23.10.2014 00:34, schrieb Mario Smarduch:
 This patch series introduces dirty page logging for ARMv7 and adds some 
 degree 
 of generic dirty logging support for x86, armv7 and later armv8.
 
 I implemented Alex's  suggestion after he took a look at the patches at kvm
 forum to simplify the generic/arch split - leaving mips, powerpc, s390, 
 (ia64 although broken) unchanged. x86/armv7 now share some dirty logging 
 code. 
 armv8 dirty log patches have been posted and tested but for time being armv8
 is non-generic as well.
 
 I briefly spoke to most of you at kvm forum, and this is the patch series
 I was referring to. Implementation changed from previous version (patches
 1  2), those who acked previous revision, please review again.
 
 Last 4 patches (ARM) have been rebased for newer kernel, with no signifcant
 changes.
 
 Testing:
 - Generally live migration + checksumming of source/destination memory 
 regions 
   is used validate correctness. 
 - qemu machvirt, VExpress - Exynos 5440, FastModels - lmbench + dirty guest
   memory cycling.
 - ARMv8 Foundation Model/kvmtool - Due to slight overlap in 2nd stage handlers
   did a basic bringup using qemu.
 - x86_64 qemu  default machine model, tested migration on HP Z620, tested 
   convergence for several dirty page rates
 
 See https://github.com/mjsmar/arm-dirtylog-tests
 - Dirtlogtest-setup.pdf for ARMv7
 - https://github.com/mjsmar/arm-dirtylog-tests/tree/master/v7 - README
 
 The patch affects armv7,armv8, mips, ia64, powerpc, s390, x86_64. Patch
 series has been compiled for affected architectures:
 
 - x86_64 - defconfig 
 - ia64 - ia64-linux-gcc4.6.3 - defconfig, ia64 Kconfig defines BROKEN worked 
   around that to make sure new changes don't break build. Eventually build
   breaks due to other reasons.
 - mips - mips64-linux-gcc4.6.3 - malta_kvm_defconfig
 - ppc - powerpc64-linux-gcc4.6.3 - pseries_defconfig
 - s390 - s390x-linux-gcc4.6.3 - defconfig
 - armv8 - aarch64-linux-gnu-gcc4.8.1 - defconfig
 
 ARMv7 Dirty page logging implementation overivew-
 - initially write protects VM RAM memory region - 2nd stage page tables
 - add support to read dirty page log and again write protect the dirty pages 
   - second stage page table for next pass.
 - second stage huge page are dissolved into small page tables to keep track of
   dirty pages at page granularity. Tracking at huge page granularity limits
   migration to an almost idle system. Small page size logging supports higher 
   memory dirty rates.
 - In the event migration is canceled, normal behavior is resumed huge pages
   are rebuilt over time.
 
 Changes since v11:
 - Implemented Alex's comments to simplify generic layer.
 
 Changes since v10:
 - addressed wanghaibin comments 
 - addressed Christoffers comments
 
 Changes since v9:
 - Split patches into generic and architecture specific variants for TLB 
 Flushing
   and dirty log read (patches 1,2  3,4,5,6)
 - rebased to 3.16.0-rc1
 - Applied Christoffers comments.
 
 Mario Smarduch (6):
   KVM: Add architecture-defined TLB flush support
   KVM: Add generic support for dirty page logging
   arm: KVM: Add ARMv7 API to flush TLBs
   arm: KVM: Add initial dirty page locking infrastructure
   arm: KVM: dirty log read write protect support
   arm: KVM: ARMv7 dirty page logging 2nd stage page fault
 
  arch/arm/include/asm/kvm_asm.h|1 +
  arch/arm/include/asm/kvm_host.h   |   14 +++
  arch/arm/include/asm/kvm_mmu.h|   20 
  arch/arm/include/asm/pgtable-3level.h |1 +
  arch/arm/kvm/Kconfig  |2 +
  arch/arm/kvm/Makefile |1 +
  arch/arm/kvm/arm.c|2 +
  arch/arm/kvm/interrupts.S |   11 ++
  arch/arm/kvm/mmu.c|  209 
 +++--
  arch/x86/include/asm/kvm_host.h   |3 -
  arch/x86/kvm/Kconfig  |1 +
  arch/x86/kvm/Makefile |1 +
  arch/x86/kvm/x86.c|   86 --
  include/linux/kvm_host.h  |4 +
  virt/kvm/Kconfig  |6 +
  virt/kvm/dirtylog.c   |  112 ++
  virt/kvm/kvm_main.c   |2 +
  17 files changed, 380 insertions(+), 96 deletions(-)
  create mode 100644 virt/kvm/dirtylog.c
 

Patches 1-3 seem to work fine on s390. The other patches are arm-only (well 
cant find 5 and 6) so I guess its ok for s390.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Benchmarking for vhost polling patch

2014-10-30 Thread Razya Ladelsky
Zhang Haoyu zhan...@sangfor.com wrote on 30/10/2014 01:30:08 PM:

 From: Zhang Haoyu zhan...@sangfor.com
 To: Razya Ladelsky/Haifa/IBM@IBMIL, mst m...@redhat.com
 Cc: Razya Ladelsky/Haifa/IBM@IBMIL, kvm kvm@vger.kernel.org
 Date: 30/10/2014 01:30 PM
 Subject: Re: Benchmarking for vhost polling patch
 
  Hi Michael,
  
  Following the polling patch thread: http://marc.info/?
 l=kvmm=140853271510179w=2, 
  I changed poll_stop_idle to be counted in micro seconds, and carried 
out 
  experiments using varying sizes of this value. The setup for 
 netperf consisted of 
  1 vm and 1 vhost , each running on their own dedicated core.
  
 Could you provide your changing code?
 
 Thanks,
 Zhang Haoyu
 
Hi Zhang,
Do you mean the change in code for poll_stop_idle?
Thanks,
Razya

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 2/6] KVM: Add generic support for dirty page logging

2014-10-30 Thread Cornelia Huck
On Wed, 22 Oct 2014 15:34:07 -0700
Mario Smarduch m.smard...@samsung.com wrote:

 This patch defines KVM_GENERIC_DIRTYLOG, and moves dirty log read function
 to it's own file virt/kvm/dirtylog.c. x86 is updated to use the generic
 dirty log interface, selecting KVM_GENERIC_DIRTYLOG in its Kconfig and 
 makefile. No other architectures are affected, each uses it's own version.
 This changed from previous patch revision where non-generic architectures 
 were modified.
 
 In subsequent patch armv7 does samething. All other architectures continue
 use architecture defined version.
 

Hm.

The x86 specific version of dirty page logging is generic enough to be
used by other architectures, noteably ARMv7. So let's move the x86 code
under virt/kvm/ and make it depend on KVM_GENERIC_DIRTYLOG. Other
architectures continue to use their own implementations.

?

 
 Signed-off-by: Mario Smarduch m.smard...@samsung.com
 ---
  arch/x86/include/asm/kvm_host.h |3 --
  arch/x86/kvm/Kconfig|1 +
  arch/x86/kvm/Makefile   |1 +
  arch/x86/kvm/x86.c  |   86 --
  include/linux/kvm_host.h|4 ++
  virt/kvm/Kconfig|3 ++
  virt/kvm/dirtylog.c |  112 
 +++
  7 files changed, 121 insertions(+), 89 deletions(-)
  create mode 100644 virt/kvm/dirtylog.c
 

 diff --git a/virt/kvm/dirtylog.c b/virt/kvm/dirtylog.c
 new file mode 100644
 index 000..67a
 --- /dev/null
 +++ b/virt/kvm/dirtylog.c
 @@ -0,0 +1,112 @@
 +/*
 + * kvm generic dirty logging support, used by architectures that share
 + * comman dirty page logging implementation.

s/comman/common/

The approach looks sane to me, especially as it does not change other
architectures needlessly.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

2014-10-30 Thread David Vrabel
On 29/10/14 05:19, Andy Lutomirski wrote:
 CPUID leaf 4F02H: miscellaneous features
 
 
[...]
 ### CommonHV RNG
 
 If CPUID.4F02H.EAX is nonzero, then it contains an MSR index used to
 communicate with a hypervisor random number generator.  This MSR is
 referred to as MSR_COMMONHV_RNG.
 
 rdmsr(MSR_COMMONHV_RNG) returns a 64-bit best-effort random number.  If the
 hypervisor is able to generate a 64-bit cryptographically secure random 
 number,
 it SHOULD return it.  If not, then the hypervisor SHOULD do its best to return
 a random number suitable for seeding a cryptographic RNG.
 
 A guest is expected to read MSR_COMMONHV_RNG several times in a row.
 The hypervisor SHOULD return different values each time.
 
 rdmsr(MSR_COMMONHV_RNG) MUST NOT result in an exception, but guests MUST
 NOT assume that its return value is indeed secure.  For example, a hypervisor
 is free to return zero in response to rdmsr(MSR_COMMONHV_RNG).

I would add:

  If the hypervisor's pool of random data is exhausted, it MAY
  return 0.  The hypervisor MUST provide at least 4 (?) non-zero
  numbers to each guest.

Xen does not have a continual source of entropy and the only feasible
way is for the toolstack to provide each guest with a fixed size pool of
random data during guest creation.

The fixed size pool could be refilled by the guest if further random
data is needed (e.g., before an in-guest kexec).

 wrmsr(MSR_COMMONHV_RNG) offers the hypervisor up to 64 bits of entropy.
 The hypervisor MAY use it as it sees fit to improve its own random number
 generator.  A hypervisor SHOULD make a reasonable effort to avoid making
 values written to MSR_COMMONHV_RNG visible to untrusted parties, but
 guests SHOULD NOT write sensitive values to wrmsr(MSR_COMMONHV_RNG).

I don't think unprivileged guests should be able to influence the
hypervisor's RNG. Unless the intention here is it only affects the
numbers returned to this guest?

But since the write is optional, I don't object to it.

David

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

2014-10-30 Thread Paolo Bonzini
On 10/30/2014 01:21 PM, David Vrabel wrote:
 I would add:
 
   If the hypervisor's pool of random data is exhausted, it MAY
   return 0.  The hypervisor MUST provide at least 4 (?) non-zero
   numbers to each guest.

Mandating non-zero numbers sounds like a bad idea.  Just use the RNG
for what it was designed; returning non-random numbers will not be a
problem.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-10-30 Thread Dr. David Alan Gilbert
* zhanghailiang (zhang.zhanghaili...@huawei.com) wrote:
 On 2014/10/30 1:46, Andrea Arcangeli wrote:
 Hi Zhanghailiang,
 
 On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
 Hi Andrea,
 
 Thanks for your hard work on userfault;)
 
 This is really a useful API.
 
 I want to confirm a question:
 Can we support distinguishing between writing and reading memory for 
 userfault?
 That is, we can decide whether writing a page, reading a page or both 
 trigger userfault.
 
 I think this will help supporting vhost-scsi,ivshmem for migration,
 we can trace dirty page in userspace.
 
 Actually, i'm trying to relize live memory snapshot based on pre-copy and 
 userfault,
 but reading memory from migration thread will also trigger userfault.
 It will be easy to implement live memory snapshot, if we support configuring
 userfault for writing memory only.
 
 Mail is going to be long enough already so I'll just assume tracking
 dirty memory in userland (instead of doing it in kernel) is worthy
 feature to have here.
 
 After some chat during the KVMForum I've been already thinking it
 could be beneficial for some usage to give userland the information
 about the fault being read or write, combined with the ability of
 mapping pages wrprotected to mcopy_atomic (that would work without
 false positives only with MADV_DONTFORK also set, but it's already set
 in qemu). That will require vma-vm_flags  VM_USERFAULT to be
 checked also in the wrprotect faults, not just in the not present
 faults, but it's not a massive change. Returning the read/write
 information is also a not massive change. This will then payoff mostly
 if there's also a way to remove the memory atomically (kind of
 remap_anon_pages).
 
 Would that be enough? I mean are you still ok if non present read
 fault traps too (you'd be notified it's a read) and you get
 notification for both wrprotect and non present faults?
 
 Hi Andrea,
 
 Thanks for your reply, and your patience;)
 
 Er, maybe i didn't describe clearly. What i really need for live memory 
 snapshot
 is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing 
 write action*.
 
 My initial solution scheme for live memory snapshot is:
 (1) pause VM
 (2) using userfaultfd to mark all memory of VM is wrprotect (readonly)
 (3) save deivce state to snapshot file
 (4) resume VM
 (5) snapshot thread begin to save page of memory to snapshot file
 (6) VM is going to run, and it is OK for VM or other thread to read ram (no 
 fault trap),
 but if VM try to write page (dirty the page), there will be
 a userfault trap notification.
 (7) a fault-handle-thread reads the page request from userfaultfd,
 it will copy content of the page to some buffers, and then remove the 
 page's
 wrprotect limit(still using the userfaultfd to tell kernel).
 (8) after step (7), VM can continue to write the page which is now can be 
 write.
 (9) snapshot thread save the page cached in step (7)
 (10) repeat step (5)~(9) until all VM's memory is saved to snapshot file.

Hmm, I can see the same process being useful for the fault-tolerance schemes
like COLO, it needs a memory state snapshot.

 So, what i need for userfault is supporting only wrprotect fault. i don't
 want to get notification for non present reading faults, it will influence
 VM's performance and the efficiency of doing snapshot.

What pages would be non-present at this point - just balloon?

Dave

 Also, i think this feature will benefit for migration of ivshmem and 
 vhost-scsi
 which have no dirty-page-tracing now.
 
 The question then is how you mark the memory readonly to let the
 wrprotect faults trap if the memory already existed and you didn't map
 it yourself in the guest with mcopy_atomic with a readonly flag.
 
 My current plan would be:
 
 - keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the
fast path check in the not-present and wrprotect page fault
 
 - if VM_USERFAULT is set, find if there's a userfaultfd registered
into that vma too
 
  if yes engage userfaultfd protocol
 
  otherwise raise SIGBUS (single threaded apps should be fine with
  SIGBUS and it'll avoid them to spawn a thread in order to talk the
  userfaultfd protocol)
 
 - if userfaultfd protocol is engaged, return read|write fault + fault
address to read(ufd) syscalls
 
 - leave the userfault resolution mechanism independent of the
userfaultfd protocol so we keep the two problems separated and we
don't mix them in the same API which makes it even harder to
finalize it.
 
  add mcopy_atomic (with a flag to map the page readonly too)
 
  The alternative would be to hide mcopy_atomic (and even
  remap_anon_pages in order to remove the memory atomically for
  the externalization into the cloud) as userfaultfd commands to
  write into the fd. But then there would be no much point to keep
  MADV_USERFAULT around if I do so and I could just remove it
  too or it doesn't 

[PATCH 3/3] KVM: x86: optimize some accesses to LVTT and SPIV

2014-10-30 Thread Radim Krčmář
We mirror a subset of these registers in separate variables.
Using them directly should be faster.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/kvm/lapic.c | 10 +++---
 arch/x86/kvm/lapic.h |  6 +++---
 2 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index d3a3a1c..67af5d2 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -239,21 +239,17 @@ static inline int apic_lvt_vector(struct kvm_lapic *apic, 
int lvt_type)
 
 static inline int apic_lvtt_oneshot(struct kvm_lapic *apic)
 {
-   return ((kvm_apic_get_reg(apic, APIC_LVTT) 
-   apic-lapic_timer.timer_mode_mask) == APIC_LVT_TIMER_ONESHOT);
+   return apic-lapic_timer.timer_mode == APIC_LVT_TIMER_ONESHOT;
 }
 
 static inline int apic_lvtt_period(struct kvm_lapic *apic)
 {
-   return ((kvm_apic_get_reg(apic, APIC_LVTT) 
-   apic-lapic_timer.timer_mode_mask) == APIC_LVT_TIMER_PERIODIC);
+   return apic-lapic_timer.timer_mode == APIC_LVT_TIMER_PERIODIC;
 }
 
 static inline int apic_lvtt_tscdeadline(struct kvm_lapic *apic)
 {
-   return ((kvm_apic_get_reg(apic, APIC_LVTT) 
-   apic-lapic_timer.timer_mode_mask) ==
-   APIC_LVT_TIMER_TSCDEADLINE);
+   return apic-lapic_timer.timer_mode == APIC_LVT_TIMER_TSCDEADLINE;
 }
 
 static inline int apic_lvt_nmi_mode(u32 lvt_val)
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 755a954..2c56885 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -121,11 +121,11 @@ static inline int kvm_apic_hw_enabled(struct kvm_lapic 
*apic)
 
 extern struct static_key_deferred apic_sw_disabled;
 
-static inline int kvm_apic_sw_enabled(struct kvm_lapic *apic)
+static inline bool kvm_apic_sw_enabled(struct kvm_lapic *apic)
 {
if (static_key_false(apic_sw_disabled.key))
-   return kvm_apic_get_reg(apic, APIC_SPIV)  
APIC_SPIV_APIC_ENABLED;
-   return APIC_SPIV_APIC_ENABLED;
+   return apic-sw_enabled;
+   return true;
 }
 
 static inline bool kvm_apic_present(struct kvm_vcpu *vcpu)
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] KVM: x86: detect LVTT changes under APICv

2014-10-30 Thread Radim Krčmář
APICv traps register writes, so we can't retrieve previous value and
omit timer cancelation when mode changes.

timer_mode_mask shouldn't be changing as it depends on cpuid.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
#define assign(a, b) (a == b ? false : (a = b, true))

 arch/x86/kvm/lapic.c | 12 
 arch/x86/kvm/lapic.h |  1 +
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index f538b14..d3a3a1c 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1205,17 +1205,20 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 
reg, u32 val)
 
break;
 
-   case APIC_LVTT:
-   if ((kvm_apic_get_reg(apic, APIC_LVTT) 
-   apic-lapic_timer.timer_mode_mask) !=
-  (val  apic-lapic_timer.timer_mode_mask))
+   case APIC_LVTT: {
+   u32 timer_mode = val  apic-lapic_timer.timer_mode_mask;
+
+   if (apic-lapic_timer.timer_mode != timer_mode) {
+   apic-lapic_timer.timer_mode = timer_mode;
hrtimer_cancel(apic-lapic_timer.timer);
+   }
 
if (!kvm_apic_sw_enabled(apic))
val |= APIC_LVT_MASKED;
val = (apic_lvt_mask[0] | apic-lapic_timer.timer_mode_mask);
apic_set_reg(apic, APIC_LVTT, val);
break;
+   }
 
case APIC_TMICT:
if (apic_lvtt_tscdeadline(apic))
@@ -1449,6 +1452,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
 
for (i = 0; i  APIC_LVT_NUM; i++)
apic_set_reg(apic, APIC_LVTT + 0x10 * i, APIC_LVT_MASKED);
+   apic-lapic_timer.timer_mode = 0;
apic_set_reg(apic, APIC_LVT0,
 SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT));
 
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 5fcc3d3..755a954 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -11,6 +11,7 @@
 struct kvm_timer {
struct hrtimer timer;
s64 period; /* unit: ns */
+   u32 timer_mode;
u32 timer_mode_mask;
u64 tscdeadline;
atomic_t pending;   /* accumulated triggered timers 
*/
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] kvm: APICv register write workaround

2014-10-30 Thread Radim Krčmář
APICv traps register writes, so we can't retrieve previous value, but
our code depends on detecting changes.

Apart from disabling APIC register virtualization, we can detect the
change by using extra memory.  One value history is enough, but we still
don't want to keep it for every APIC register, for performance reasons.
This leaves us with either a new framework, or exceptions ...
The latter options fits KVM's path better [1,2].

And when we already mirror a part of registers, optimizing access is
acceptable [3].  (Squashed to keep bisecters happy.)

---
Radim Krčmář (3):
  KVM: x86: detect SPIV changes under APICv
  KVM: x86: detect LVTT changes under APICv
  KVM: x86: optimize some accesses to LVTT and SPIV

 arch/x86/kvm/lapic.c | 32 +---
 arch/x86/kvm/lapic.h |  8 +---
 2 files changed, 22 insertions(+), 18 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] KVM: x86: detect SPIV changes under APICv

2014-10-30 Thread Radim Krčmář
APICv traps register writes, so we can't retrieve previous value.
(A bit of blame on Intel.)

This caused a migration bug:  LAPIC is enabled, so our restore code
correctly lowers apic_sw_enabled, but doesn't increase it after APICv is
disabled, so we get below zero when freeing it; resulting in this trace:

  WARNING: at kernel/jump_label.c:81 __static_key_slow_dec+0xa6/0xb0()
  jump label: negative count!

  [816bf898] dump_stack+0x19/0x1b
  [8107c6f1] warn_slowpath_common+0x61/0x80
  [8107c76c] warn_slowpath_fmt+0x5c/0x80
  [811931e6] __static_key_slow_dec+0xa6/0xb0
  [81193226] static_key_slow_dec_deferred+0x16/0x20
  [a0637698] kvm_free_lapic+0x88/0xa0 [kvm]
  [a061c63e] kvm_arch_vcpu_uninit+0x2e/0xe0 [kvm]
  [a05ff301] kvm_vcpu_uninit+0x21/0x40 [kvm]
  [a067cec7] vmx_free_vcpu+0x47/0x70 [kvm_intel]
  [a061bc50] kvm_arch_vcpu_free+0x50/0x60 [kvm]
  [a061ca22] kvm_arch_destroy_vm+0x102/0x260 [kvm]
  [810b68fd] ? synchronize_srcu+0x1d/0x20
  [a06030d1] kvm_put_kvm+0xe1/0x1c0 [kvm]
  [a06036f8] kvm_vcpu_release+0x18/0x20 [kvm]
  [81215c62] __fput+0x102/0x310
  [81215f4e] fput+0xe/0x10
  [810ab664] task_work_run+0xb4/0xe0
  [81083944] do_exit+0x304/0xc60
  [816c8dfc] ? _raw_spin_unlock_irq+0x2c/0x50
  [810fd22d] ?  trace_hardirqs_on_caller+0xfd/0x1c0
  [8108432c] do_group_exit+0x4c/0xc0
  [810843b4] SyS_exit_group+0x14/0x20
  [816d33a9] system_call_fastpath+0x16/0x1b

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/kvm/lapic.c | 10 ++
 arch/x86/kvm/lapic.h |  1 +
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index b8345dd..f538b14 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -201,11 +201,13 @@ out:
 
 static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val)
 {
-   u32 prev = kvm_apic_get_reg(apic, APIC_SPIV);
+   bool enabled = val  APIC_SPIV_APIC_ENABLED;
 
apic_set_reg(apic, APIC_SPIV, val);
-   if ((prev ^ val)  APIC_SPIV_APIC_ENABLED) {
-   if (val  APIC_SPIV_APIC_ENABLED) {
+
+   if (enabled != apic-sw_enabled) {
+   apic-sw_enabled = enabled;
+   if (enabled) {
static_key_slow_dec_deferred(apic_sw_disabled);
recalculate_apic_map(apic-vcpu-kvm);
} else
@@ -1320,7 +1322,7 @@ void kvm_free_lapic(struct kvm_vcpu *vcpu)
if (!(vcpu-arch.apic_base  MSR_IA32_APICBASE_ENABLE))
static_key_slow_dec_deferred(apic_hw_disabled);
 
-   if (!(kvm_apic_get_reg(apic, APIC_SPIV)  APIC_SPIV_APIC_ENABLED))
+   if (!apic-sw_enabled)
static_key_slow_dec_deferred(apic_sw_disabled);
 
if (apic-regs)
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 6a11845..5fcc3d3 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -33,6 +33,7 @@ struct kvm_lapic {
 * Note: Only one register, the TPR, is used by the microcode.
 */
void *regs;
+   bool sw_enabled;
gpa_t vapic_addr;
struct gfn_to_hva_cache vapic_cache;
unsigned long pending_events;
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

2014-10-30 Thread Roger Pau Monné
Adding the bhyve guys.

El 29/10/14 a les 6.19, Andy Lutomirski ha escrit:
 Here's a draft CommonHV spec.  It's also on github:
 
 https://github.com/amluto/CommonHV
 
 So far, this provides a two-way RNG interface, a way to detect it, and
 a way to detect other hypervisor leaves.  The latter is because, after
 both the enormous public thread and some private discussions, it seems
 that detection of existing CPUID paravirt leaves is annoying and
 inefficient.  If we're going to define some cross-vendor CPUID leaves,
 it seems like it would be useful to offer a way to quickly enumerate
 other leaves.
 
 I've been told the AMD intends to update their manual to match Intel's
 so that hypervisors can use the entire 0x4F?? CPUID range.  I have
 intentionally not fixed an MSR value for the RNG because the range of
 allowed MSRs is very small in both the Intel and AMD manuals.  If any
 given hypervisor wants to ignore that small range and advertise a
 higher-numbered MSR, it is welcome to, but I don't want to codify
 something that doesn't comply with the manuals.
 
 Here's the draft.  Comments?  To the people who work on various
 hypervisors: Would you implement this?  Do you like it?  Is there
 anything, major or minor, that you'd like to see changed?  Do you
 think that this is a good idea at all?
 
 I've tried to get good coverage of various hypervisors.  There are
 Hyper-V, VMWare, KVM, and Xen people on the cc list.
 
 Thanks,
 Andy
 
 
 
 CommonHV, a common hypervisor interface
 ===
 
 This is CommonHV draft 1.
 
 The CommonHV specification is Copyright (c) 2014 Andrew Lutomirski.
 
 Licensing will be determined soon.  The license is expected to be extremely
 liberal.  I am currently leaning towards CC-BY-SA for the specification and
 an explicit license permitting anyone to implement the specification
 with no restrictions whatsoever.
 
 I have not patented, nor do I intend to patent, anything required to implement
 this specification.  I am not aware of any current or future intellectual
 property rights that would prevent a royalty-free implementation of
 this specification.
 
 I would like to find a stable, neutral steward of this specification
 going forward.  Help with this would be much appreciated.
 
 Scope
 -
 
 CommonHV is a simple interface for communication
 between hypervisors and their guests.
 
 CommonHV is intended to be very simple and to avoid interfering with
 existing paravirtual interfaces.  To that end, its scope is limited.
 CommonHV does only two types of things:
 
   * It provides a way to enumerate other paravirtual interfaces.
   * It provides a small, extensible set of paravirtual features that do not
 modify or replace standard system functionality.
 
 For example, CommonHV does not and will not define anything related to
 interrupt handling or virtual CPU management.
 
 For now, CommonHV is only applicable to the x86 platform.
 
 Discovery
 -
 
 A CommonHV hypervisor MUST set the hypervisor bit (bit 31 in CPUID.1H.0H.ECX)
 and provide the CPUID leaf 4F00H, containing:
 
   * CPUID.4F00H.0H.EAX = max_commonhv_leaf
   * CPUID.4F00H.0H.EBX = 0x6D6D6F43
   * CPUID.4F00H.0H.ECX = 0x56486E6F
   * CPUID.4F00H.0H.EDX = 0x66746e49
 
 EBX, ECX, and EDX form the string CommonHVIntf in little-endian ASCII.
 
 max_commonhv_leaf MUST be a number between 0x4F00 and 0x4FFF.  It
 indicates the largest leaf defined in this specification that is provided.
 Any leaves described in this specification with EAX values that exceed
 max_commonhv_leaf MUST be handled by guests as though they contain
 all zeros.
 
 CPUID leaf 4F01H: hypervisor interface enumeration
 --
 
 If max_commonhv_leaf = 0x4F01, CommonHV provides a list of tuples
 (location, signature).  Each tuple indicates the presence of another
 paravirtual interface identified by the signature at the indicated
 CPUID location.  It is expected that CPUID.location.0H will have
 (EBX, ECX, EDX) == signature, although whether this is required
 is left to the specification associated with the given signature.
 
 If the list contains N tuples, then, for each 0 = i  N:
 
   * CPUID.4F01H.i.EBX, CPUID.4F01H.i.ECX, and CPUID.4F01H.i.EDX
 are the signature.
   * CPUID.4F01H.i.EAX is the location.
 
 CPUID with EAX = 0x4F01 and ECX = N MUST return all zeros.
 
 To the extent that the hypervisor prefers a given interface, it should
 specify that interface earlier in the list.  For example, KVM might place
 its KVMKVMKVM signature first in the list to indicate that it should be
 used by guests in preference to other supported interfaces.  Other hypervisors
 would likely use a different order.
 
 The exact semantics of the ordering of the list is beyond the scope of
 this specification.
 
 CPUID leaf 4F02H: miscellaneous features
 
 
 

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

2014-10-30 Thread Andy Lutomirski
On Thu, Oct 30, 2014 at 5:21 AM, David Vrabel david.vra...@citrix.com wrote:
 On 29/10/14 05:19, Andy Lutomirski wrote:
 CPUID leaf 4F02H: miscellaneous features
 

 [...]
 ### CommonHV RNG

 If CPUID.4F02H.EAX is nonzero, then it contains an MSR index used to
 communicate with a hypervisor random number generator.  This MSR is
 referred to as MSR_COMMONHV_RNG.

 rdmsr(MSR_COMMONHV_RNG) returns a 64-bit best-effort random number.  If the
 hypervisor is able to generate a 64-bit cryptographically secure random 
 number,
 it SHOULD return it.  If not, then the hypervisor SHOULD do its best to 
 return
 a random number suitable for seeding a cryptographic RNG.

 A guest is expected to read MSR_COMMONHV_RNG several times in a row.
 The hypervisor SHOULD return different values each time.

 rdmsr(MSR_COMMONHV_RNG) MUST NOT result in an exception, but guests MUST
 NOT assume that its return value is indeed secure.  For example, a hypervisor
 is free to return zero in response to rdmsr(MSR_COMMONHV_RNG).

 I would add:

   If the hypervisor's pool of random data is exhausted, it MAY
   return 0.  The hypervisor MUST provide at least 4 (?) non-zero
   numbers to each guest.

 Xen does not have a continual source of entropy and the only feasible
 way is for the toolstack to provide each guest with a fixed size pool of
 random data during guest creation.


Xen could seed a very simple per-guest DRBG at guest startup and then
let the rdmsr call read from it.

 The fixed size pool could be refilled by the guest if further random
 data is needed (e.g., before an in-guest kexec).

That gets complicated.  Then you need an API to refill it.


 wrmsr(MSR_COMMONHV_RNG) offers the hypervisor up to 64 bits of entropy.
 The hypervisor MAY use it as it sees fit to improve its own random number
 generator.  A hypervisor SHOULD make a reasonable effort to avoid making
 values written to MSR_COMMONHV_RNG visible to untrusted parties, but
 guests SHOULD NOT write sensitive values to wrmsr(MSR_COMMONHV_RNG).

 I don't think unprivileged guests should be able to influence the
 hypervisor's RNG. Unless the intention here is it only affects the
 numbers returned to this guest?


An RNG can be designed to be secure even if malicious users can
provide input.  Linux has one of these, and I assume that Windows
does, too.  Xen doesn't for the entirely legitimate reason that Xen
has no need for such a thing.  (Xen dom0, on the other hand, has
Linux's.)

 But since the write is optional, I don't object to it.

Draft 2 has a bit that Xen could clear to ask the guest not to even
try to use this feature.

I'll send out draft 2 by email later today.  It's on github now, though.

--Andy
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

2014-10-30 Thread Ian Campbell
On Thu, 2014-10-30 at 07:45 -0700, Andy Lutomirski wrote:
  Xen does not have a continual source of entropy and the only feasible
  way is for the toolstack to provide each guest with a fixed size pool of
  random data during guest creation.
 
 
 Xen could seed a very simple per-guest DRBG at guest startup and then
 let the rdmsr call read from it.

I think I'm a bit confused by the intended scope of this facility. The
original spec said:

Note that the CommonHV RNG is not intended to replace stronger, 
asynchronous
paravirtual random number generator interfaces.  It is intended 
primarily
for seeding guest RNGs early in boot.

Which to me reads that the guest should be using this facility to seed
it's own simple DRBG on boot (with some finite amount of seed data from
the hv) and then using that until it can switch to something better. Is
that not the intention?

I think it's important to nail down the intended scope of this
interface, since it has quite an impact on what would be considered a
reasonable common design. 

Post boot I would as you say expect most OSes to switch over to
something more capable, not continue to rely on this facility for the
duration.

Ian.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[kvm-unit-tests PATCH 0/6] arm: enable MMU

2014-10-30 Thread Andrew Jones
This first patch of this series fixes a bug caused by attempting
to use spinlocks without enabling the MMU. The next three do some
prep for the fifth, and also fix arm's PAGE_ALIGN. The fifth is
prep for the sixth, which finally turns the MMU on for arm unit
tests.

Andrew Jones (6):
  arm: fix crash on cubietruck
  lib: add ALIGN() macro
  lib: steal const.h from kernel
  arm: apply ALIGN() and const.h to arm files
  arm: import some Linux page table API
  arm: turn on the MMU

 arm/cstart.S| 33 +++
 config/config-arm.mak   |  3 ++-
 lib/alloc.c |  4 +--
 lib/arm/asm/mmu.h   | 43 ++
 lib/arm/asm/page.h  | 43 +++---
 lib/arm/asm/pgtable-hwdef.h | 65 +
 lib/arm/mmu.c   | 53 
 lib/arm/processor.c | 11 
 lib/arm/setup.c |  3 +++
 lib/arm/spinlock.c  |  7 +
 lib/asm-generic/page.h  | 17 ++--
 lib/const.h | 11 
 lib/libcflat.h  |  4 +++
 13 files changed, 275 insertions(+), 22 deletions(-)
 create mode 100644 lib/arm/asm/mmu.h
 create mode 100644 lib/arm/asm/pgtable-hwdef.h
 create mode 100644 lib/arm/mmu.c
 create mode 100644 lib/const.h

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] arm: fix crash on cubietruck

2014-10-30 Thread Andrew Jones
Cubietruck seems to be more sensitive than my Midway when
attempting to use [ldr|str]ex instructions without caches
enabled (mmu disabled). Fix this by making the spinlock
implementation (currently the only user of *ex instructions)
conditional on the mmu being enabled.

Signed-off-by: Andrew Jones drjo...@redhat.com
---
 lib/arm/asm/mmu.h  | 11 +++
 lib/arm/spinlock.c |  7 +++
 2 files changed, 18 insertions(+)
 create mode 100644 lib/arm/asm/mmu.h

diff --git a/lib/arm/asm/mmu.h b/lib/arm/asm/mmu.h
new file mode 100644
index 0..987928b2c432c
--- /dev/null
+++ b/lib/arm/asm/mmu.h
@@ -0,0 +1,11 @@
+#ifndef __ASMARM_MMU_H_
+#define __ASMARM_MMU_H_
+/*
+ * Copyright (C) 2014, Red Hat Inc, Andrew Jones drjo...@redhat.com
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.
+ */
+
+#define mmu_enabled() (0)
+
+#endif /* __ASMARM_MMU_H_ */
diff --git a/lib/arm/spinlock.c b/lib/arm/spinlock.c
index d8a6d4c3383d6..e2bb1ace43c4e 100644
--- a/lib/arm/spinlock.c
+++ b/lib/arm/spinlock.c
@@ -1,12 +1,19 @@
 #include libcflat.h
 #include asm/spinlock.h
 #include asm/barrier.h
+#include asm/mmu.h
 
 void spin_lock(struct spinlock *lock)
 {
u32 val, fail;
 
dmb();
+
+   if (!mmu_enabled()) {
+   lock-v = 1;
+   return;
+   }
+
do {
asm volatile(
1: ldrex   %0, [%2]\n
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] arm: import some Linux page table API

2014-10-30 Thread Andrew Jones
Signed-off-by: Andrew Jones drjo...@redhat.com
---
 lib/arm/asm/page.h  | 22 +++
 lib/arm/asm/pgtable-hwdef.h | 65 +
 2 files changed, 87 insertions(+)
 create mode 100644 lib/arm/asm/pgtable-hwdef.h

diff --git a/lib/arm/asm/page.h b/lib/arm/asm/page.h
index 4602d735b7886..6ff849a0c0e3b 100644
--- a/lib/arm/asm/page.h
+++ b/lib/arm/asm/page.h
@@ -18,6 +18,28 @@
 
 #include asm/setup.h
 
+typedef u64 pteval_t;
+typedef u64 pmdval_t;
+typedef u64 pgdval_t;
+typedef struct { pteval_t pte; } pte_t;
+typedef struct { pmdval_t pmd; } pmd_t;
+typedef struct { pgdval_t pgd; } pgd_t;
+typedef struct { pteval_t pgprot; } pgprot_t;
+
+#define pte_val(x) ((x).pte)
+#define pmd_val(x) ((x).pmd)
+#define pgd_val(x) ((x).pgd)
+#define pgprot_val(x)  ((x).pgprot)
+
+#define __pte(x)   ((pte_t) { (x) } )
+#define __pmd(x)   ((pmd_t) { (x) } )
+#define __pgd(x)   ((pgd_t) { (x) } )
+#define __pgprot(x)((pgprot_t) { (x) } )
+
+typedef struct { pgd_t pgd; } pud_t;
+#define pud_val(x) (pgd_val((x).pgd))
+#define __pud(x)   ((pud_t) { __pgd(x) } )
+
 #ifndef __virt_to_phys
 #define __phys_to_virt(x)  ((unsigned long) (x))
 #define __virt_to_phys(x)  (x)
diff --git a/lib/arm/asm/pgtable-hwdef.h b/lib/arm/asm/pgtable-hwdef.h
new file mode 100644
index 0..a2564aaca05a3
--- /dev/null
+++ b/lib/arm/asm/pgtable-hwdef.h
@@ -0,0 +1,65 @@
+#ifndef _ASMARM_PGTABLE_HWDEF_H_
+#define _ASMARM_PGTABLE_HWDEF_H_
+/*
+ * From arch/arm/include/asm/pgtable-3level-hwdef.h
+ */
+
+/*
+ * Hardware page table definitions.
+ *
+ * + Level 1/2 descriptor
+ *   - common
+ */
+#define PMD_TYPE_MASK  (_AT(pmdval_t, 3)  0)
+#define PMD_TYPE_FAULT (_AT(pmdval_t, 0)  0)
+#define PMD_TYPE_TABLE (_AT(pmdval_t, 3)  0)
+#define PMD_TYPE_SECT  (_AT(pmdval_t, 1)  0)
+#define PMD_TABLE_BIT  (_AT(pmdval_t, 1)  1)
+#define PMD_BIT4   (_AT(pmdval_t, 0))
+#define PMD_DOMAIN(x)  (_AT(pmdval_t, 0))
+#define PMD_APTABLE_SHIFT  (61)
+#define PMD_APTABLE(_AT(pgdval_t, 3)  PGD_APTABLE_SHIFT)
+#define PMD_PXNTABLE   (_AT(pgdval_t, 1)  59)
+
+/*
+ *   - section
+ */
+#define PMD_SECT_BUFFERABLE(_AT(pmdval_t, 1)  2)
+#define PMD_SECT_CACHEABLE (_AT(pmdval_t, 1)  3)
+#define PMD_SECT_USER  (_AT(pmdval_t, 1)  6) /* AP[1] */
+#define PMD_SECT_AP2   (_AT(pmdval_t, 1)  7) /* read only */
+#define PMD_SECT_S (_AT(pmdval_t, 3)  8)
+#define PMD_SECT_AF(_AT(pmdval_t, 1)  10)
+#define PMD_SECT_nG(_AT(pmdval_t, 1)  11)
+#define PMD_SECT_PXN   (_AT(pmdval_t, 1)  53)
+#define PMD_SECT_XN(_AT(pmdval_t, 1)  54)
+#define PMD_SECT_AP_WRITE  (_AT(pmdval_t, 0))
+#define PMD_SECT_AP_READ   (_AT(pmdval_t, 0))
+#define PMD_SECT_AP1   (_AT(pmdval_t, 1)  6)
+#define PMD_SECT_TEX(x)(_AT(pmdval_t, 0))
+
+/*
+ * AttrIndx[2:0] encoding (mapping attributes defined in the MAIR* registers).
+ */
+#define PMD_SECT_UNCACHED  (_AT(pmdval_t, 0)  2) /* strongly ordered */
+#define PMD_SECT_BUFFERED  (_AT(pmdval_t, 1)  2) /* normal non-cacheable 
*/
+#define PMD_SECT_WT(_AT(pmdval_t, 2)  2) /* normal inner 
write-through */
+#define PMD_SECT_WB(_AT(pmdval_t, 3)  2) /* normal inner 
write-back */
+#define PMD_SECT_WBWA  (_AT(pmdval_t, 7)  2) /* normal inner 
write-alloc */
+
+/*
+ * + Level 3 descriptor (PTE)
+ */
+#define PTE_TYPE_MASK  (_AT(pteval_t, 3)  0)
+#define PTE_TYPE_FAULT (_AT(pteval_t, 0)  0)
+#define PTE_TYPE_PAGE  (_AT(pteval_t, 3)  0)
+#define PTE_TABLE_BIT  (_AT(pteval_t, 1)  1)
+#define PTE_BUFFERABLE (_AT(pteval_t, 1)  2) /* AttrIndx[0] 
*/
+#define PTE_CACHEABLE  (_AT(pteval_t, 1)  3) /* AttrIndx[1] 
*/
+#define PTE_AP2(_AT(pteval_t, 1)  7) /* 
AP[2] */
+#define PTE_EXT_SHARED (_AT(pteval_t, 3)  8) /* SH[1:0], 
inner shareable */
+#define PTE_EXT_AF (_AT(pteval_t, 1)  10)/* Access Flag 
*/
+#define PTE_EXT_NG (_AT(pteval_t, 1)  11)/* nG */
+#define PTE_EXT_XN (_AT(pteval_t, 1)  54)/* XN */
+
+#endif /* _ASMARM_PGTABLE_HWDEF_H_ */
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] lib: add ALIGN() macro

2014-10-30 Thread Andrew Jones
Add a type-considerate ALIGN[_UP] macro to libcflat, and apply
it to /lib code that can make use of it. This will be used to
fix PAGE_ALIGN on arm, which can be used on phys_addr_t
addresses, which may exceed 32 bits.

Signed-off-by: Andrew Jones drjo...@redhat.com
---
 lib/alloc.c| 4 +---
 lib/asm-generic/page.h | 9 ++---
 lib/libcflat.h | 4 
 3 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/alloc.c b/lib/alloc.c
index 5d55e285dcd1d..1abe4961ae9dd 100644
--- a/lib/alloc.c
+++ b/lib/alloc.c
@@ -7,8 +7,6 @@
 #include asm/spinlock.h
 #include asm/io.h
 
-#define ALIGN_UP_MASK(x, mask) (((x) + (mask))  ~(mask))
-#define ALIGN_UP(x, a) ALIGN_UP_MASK(x, (typeof(x))(a) - 1)
 #define MIN(a, b)  ((a)  (b) ? (a) : (b))
 #define MAX(a, b)  ((a)  (b) ? (a) : (b))
 
@@ -70,7 +68,7 @@ static phys_addr_t phys_alloc_aligned_safe(phys_addr_t size,
 
spin_lock(lock);
 
-   addr = ALIGN_UP(base, align);
+   addr = ALIGN(base, align);
size += addr - base;
 
if ((top_safe - base)  size) {
diff --git a/lib/asm-generic/page.h b/lib/asm-generic/page.h
index 559938fcf0b3f..8602752002f71 100644
--- a/lib/asm-generic/page.h
+++ b/lib/asm-generic/page.h
@@ -16,13 +16,16 @@
 #define PAGE_SIZE  (1  PAGE_SHIFT)
 #endif
 #define PAGE_MASK  (~(PAGE_SIZE-1))
-#define PAGE_ALIGN(addr)   (((addr) + (PAGE_SIZE-1))  PAGE_MASK)
 
 #ifndef __ASSEMBLY__
+
+#define PAGE_ALIGN(addr)   ALIGN(addr, PAGE_SIZE)
+
 #define __va(x)((void *)((unsigned long) (x)))
 #define __pa(x)((unsigned long) (x))
 #define virt_to_pfn(kaddr) (__pa(kaddr)  PAGE_SHIFT)
 #define pfn_to_virt(pfn)   __va((pfn)  PAGE_SHIFT)
-#endif
 
-#endif
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_GENERIC_PAGE_H_ */
diff --git a/lib/libcflat.h b/lib/libcflat.h
index a43eba0329f8e..7db29a4f4f3cb 100644
--- a/lib/libcflat.h
+++ b/lib/libcflat.h
@@ -30,6 +30,10 @@
 #define xstr(s) xxstr(s)
 #define xxstr(s) #s
 
+#define __ALIGN_MASK(x, mask)  (((x) + (mask))  ~(mask))
+#define __ALIGN(x, a)  __ALIGN_MASK(x, (typeof(x))(a) - 1)
+#define ALIGN(x, a)__ALIGN((x), (a))
+
 typedef uint8_tu8;
 typedef int8_t s8;
 typedef uint16_t   u16;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] arm: apply ALIGN() and const.h to arm files

2014-10-30 Thread Andrew Jones
This fixes PAGE_ALIGN for greater than 32-bit addresses.
Also fix up some whitespace in lib/arm/asm/page.h

Signed-off-by: Andrew Jones drjo...@redhat.com
---
 lib/arm/asm/page.h | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/lib/arm/asm/page.h b/lib/arm/asm/page.h
index 606d76f5775cf..4602d735b7886 100644
--- a/lib/arm/asm/page.h
+++ b/lib/arm/asm/page.h
@@ -6,16 +6,16 @@
  * This work is licensed under the terms of the GNU LGPL, version 2.
  */
 
+#include const.h
+
 #define PAGE_SHIFT 12
-#ifndef __ASSEMBLY__
-#define PAGE_SIZE  (1UL  PAGE_SHIFT)
-#else
-#define PAGE_SIZE  (1  PAGE_SHIFT)
-#endif
+#define PAGE_SIZE  (_AC(1,UL)  PAGE_SHIFT)
 #define PAGE_MASK  (~(PAGE_SIZE-1))
-#define PAGE_ALIGN(addr)   (((addr) + (PAGE_SIZE-1))  PAGE_MASK)
 
 #ifndef __ASSEMBLY__
+
+#define PAGE_ALIGN(addr)   ALIGN(addr, PAGE_SIZE)
+
 #include asm/setup.h
 
 #ifndef __virt_to_phys
@@ -26,8 +26,9 @@
 #define __va(x)((void 
*)__phys_to_virt((phys_addr_t)(x)))
 #define __pa(x)__virt_to_phys((unsigned long)(x))
 
-#define virt_to_pfn(kaddr)  (__pa(kaddr)  PAGE_SHIFT)
-#define pfn_to_virt(pfn)__va((pfn)  PAGE_SHIFT)
-#endif
+#define virt_to_pfn(kaddr) (__pa(kaddr)  PAGE_SHIFT)
+#define pfn_to_virt(pfn)   __va((pfn)  PAGE_SHIFT)
 
-#endif
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASMARM_PAGE_H_ */
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] lib: steal const.h from kernel

2014-10-30 Thread Andrew Jones
And apply it to /lib files. This prepares for the import of
kernel headers that make use of the const.h macros.

Signed-off-by: Andrew Jones drjo...@redhat.com
---
 lib/asm-generic/page.h |  8 +++-
 lib/const.h| 11 +++
 2 files changed, 14 insertions(+), 5 deletions(-)
 create mode 100644 lib/const.h

diff --git a/lib/asm-generic/page.h b/lib/asm-generic/page.h
index 8602752002f71..66c72a62bb0f7 100644
--- a/lib/asm-generic/page.h
+++ b/lib/asm-generic/page.h
@@ -9,12 +9,10 @@
  * This work is licensed under the terms of the GNU LGPL, version 2.
  */
 
+#include const.h
+
 #define PAGE_SHIFT 12
-#ifndef __ASSEMBLY__
-#define PAGE_SIZE  (1UL  PAGE_SHIFT)
-#else
-#define PAGE_SIZE  (1  PAGE_SHIFT)
-#endif
+#define PAGE_SIZE  (_AC(1,UL)  PAGE_SHIFT)
 #define PAGE_MASK  (~(PAGE_SIZE-1))
 
 #ifndef __ASSEMBLY__
diff --git a/lib/const.h b/lib/const.h
new file mode 100644
index 0..5cd94d7067541
--- /dev/null
+++ b/lib/const.h
@@ -0,0 +1,11 @@
+#ifndef _CONST_H_
+#define _CONST_H_
+#ifdef __ASSEMBLY__
+#define _AC(X,Y)   X
+#define _AT(T,X)   X
+#else
+#define __AC(X,Y)  (X##Y)
+#define _AC(X,Y)   __AC(X,Y)
+#define _AT(T,X)   ((T)(X))
+#endif
+#endif
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] arm: turn on the MMU

2014-10-30 Thread Andrew Jones
We should probably always run with the mmu on, so let's
enable it from setup with an identity map.

Signed-off-by: Andrew Jones drjo...@redhat.com
---
 arm/cstart.S  | 33 
 config/config-arm.mak |  3 ++-
 lib/arm/asm/mmu.h | 34 -
 lib/arm/mmu.c | 53 +++
 lib/arm/processor.c   | 11 +++
 lib/arm/setup.c   |  3 +++
 6 files changed, 135 insertions(+), 2 deletions(-)
 create mode 100644 lib/arm/mmu.c

diff --git a/arm/cstart.S b/arm/cstart.S
index cc87ece4b6b40..a1ccfb24bb4e0 100644
--- a/arm/cstart.S
+++ b/arm/cstart.S
@@ -72,6 +72,39 @@ halt:
b   1b
 
 /*
+ * asm_mmu_enable
+ *   Inputs:
+ * (r0 - lo, r1 - hi) is the base address of the translation table
+ *   Outputs: none
+ */
+.equ   PRRR,   0xeeaa4400  @ MAIR0 (from Linux kernel)
+.equ   NMRR,   0xff04  @ MAIR1 (from Linux kernel)
+.globl asm_mmu_enable
+asm_mmu_enable:
+   /* TTBCR */
+   mrc p15, 0, r2, c2, c0, 2
+   orr r2, #(1  31)  @ TTB_EAE
+   mcr p15, 0, r2, c2, c0, 2
+
+   /* MAIR */
+   ldr r2, =PRRR
+   mrc p15, 0, r2, c10, c2, 0
+   ldr r2, =NMRR
+   mrc p15, 0, r2, c10, c2, 1
+
+   /* TTBR0 */
+   mcrrp15, 0, r0, r1, c2
+
+   /* SCTLR */
+   mrc p15, 0, r2, c1, c0, 0
+   orr r2, #CR_C
+   orr r2, #CR_I
+   orr r2, #CR_M
+   mcr p15, 0, r2, c1, c0, 0
+
+   mov pc, lr
+
+/*
  * Vector stubs
  * Simplified version of the Linux kernel implementation
  *   arch/arm/kernel/entry-armv.S
diff --git a/config/config-arm.mak b/config/config-arm.mak
index 8a274c50332b0..86e1d75169b59 100644
--- a/config/config-arm.mak
+++ b/config/config-arm.mak
@@ -42,7 +42,8 @@ cflatobjs += \
lib/arm/io.o \
lib/arm/setup.o \
lib/arm/spinlock.o \
-   lib/arm/processor.o
+   lib/arm/processor.o \
+   lib/arm/mmu.o
 
 libeabi = lib/arm/libeabi.a
 eabiobjs = lib/arm/eabi_compat.o
diff --git a/lib/arm/asm/mmu.h b/lib/arm/asm/mmu.h
index 987928b2c432c..451c7493c2aba 100644
--- a/lib/arm/asm/mmu.h
+++ b/lib/arm/asm/mmu.h
@@ -5,7 +5,39 @@
  *
  * This work is licensed under the terms of the GNU LGPL, version 2.
  */
+#include asm/page.h
+#include asm/barrier.h
+#include alloc.h
 
-#define mmu_enabled() (0)
+#define PTRS_PER_PGD   4
+#define PGDIR_SHIFT30
+#define PGDIR_SIZE (1UL  PGDIR_SHIFT)
+#define PGDIR_MASK (~((1  PGDIR_SHIFT) - 1))
+
+#define pgd_free(pgd) free(pgd)
+static inline pgd_t *pgd_alloc(void)
+{
+   pgd_t *pgd = memalign(L1_CACHE_BYTES, PTRS_PER_PGD * sizeof(pgd_t));
+   memset(pgd, 0, PTRS_PER_PGD * sizeof(pgd_t));
+   return pgd;
+}
+
+static inline void local_flush_tlb_all(void)
+{
+   asm volatile(mcr p15, 0, %0, c8, c7, 0 :: r (0));
+   dsb();
+   isb();
+}
+
+static inline void flush_tlb_all(void)
+{
+   //TODO
+   local_flush_tlb_all();
+}
+
+extern bool mmu_enabled(void);
+extern void mmu_enable(pgd_t *pgtable);
+extern void mmu_enable_idmap(void);
+extern void mmu_init_io_sect(pgd_t *pgtable);
 
 #endif /* __ASMARM_MMU_H_ */
diff --git a/lib/arm/mmu.c b/lib/arm/mmu.c
new file mode 100644
index 0..c9d39bf6464b8
--- /dev/null
+++ b/lib/arm/mmu.c
@@ -0,0 +1,53 @@
+/*
+ * MMU enable and page table manipulation functions
+ *
+ * Copyright (C) 2014, Red Hat Inc, Andrew Jones drjo...@redhat.com
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.
+ */
+#include asm/setup.h
+#include asm/mmu.h
+#include asm/pgtable-hwdef.h
+
+static bool mmu_on;
+static pgd_t idmap[PTRS_PER_PGD] __attribute__((aligned(L1_CACHE_BYTES)));
+
+bool mmu_enabled(void)
+{
+   return mmu_on;
+}
+
+extern void asm_mmu_enable(phys_addr_t pgtable);
+void mmu_enable(pgd_t *pgtable)
+{
+   asm_mmu_enable(__pa(pgtable));
+   flush_tlb_all();
+   mmu_on = true;
+}
+
+void mmu_init_io_sect(pgd_t *pgtable)
+{
+   /*
+* mach-virt reserves the first 1G section for I/O
+*/
+   pgd_val(pgtable[0]) = PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_USER;
+   pgd_val(pgtable[0]) |= PMD_SECT_UNCACHED;
+}
+
+void mmu_enable_idmap(void)
+{
+   unsigned long sect, end;
+
+   mmu_init_io_sect(idmap);
+
+   end = sizeof(long) == 8 || !(PHYS_END  32) ? PHYS_END : 0xf000;
+
+   for (sect = PHYS_OFFSET  PGDIR_MASK; sect  end; sect += PGDIR_SIZE) {
+   int i = sect  PGDIR_SHIFT;
+   pgd_val(idmap[i]) = sect;
+   pgd_val(idmap[i]) |= PMD_TYPE_SECT | PMD_SECT_AF | 
PMD_SECT_USER;
+   pgd_val(idmap[i]) |= PMD_SECT_S | PMD_SECT_WBWA;
+   }
+
+   mmu_enable(idmap);
+}
diff --git a/lib/arm/processor.c b/lib/arm/processor.c
index 382a128edd415..866d11975b23b 100644
--- a/lib/arm/processor.c
+++ b/lib/arm/processor.c
@@ -92,6 +92,17 @@ void do_handle_exception(enum vector v, 

Re: [PATCH kvm-unit-tests] arm: fix crash when caches are off

2014-10-30 Thread Andrew Jones
On Fri, Sep 26, 2014 at 09:51:15AM +0200, Christoffer Dall wrote:
 On Tue, Sep 16, 2014 at 08:57:31AM -0400, Andrew Jones wrote:
  
  
  - Original Message -
   Il 16/09/2014 14:43, Andrew Jones ha scritto:
I don't think we need to worry about this case. AFAIU, enabling the
caches for a particular cpu shouldn't require any synchronization.
So we should be able to do

enable caches
spin_lock
start other processors
spin_unlock
   
   Ok, I'll test and apply your patch then.
   
   Once you change the code to enable caches, please consider hanging on
   spin_lock with caches disabled.
  
  Unfortunately I can't do that without changing spin_lock into a wrapper
  function. Early setup code calls functions that use spin_locks, e.g.
  puts(), and we won't want to move the cache enablement into early setup
  code, as that should be left for unit tests to turn on off as they wish.
  Thus we either need to be able to change the spin_lock implementation
  dynamically, or just leave the test/return as is.
  
 My take on this whole thing is that we're doing something fundamentally
 wrong.  I think what we should do is to always enable the MMU for
 running actual tests, bringing up multiple CPUs etc.  We could have an
 early_printf() that doesn't use the spinlock.  I think this will just be
 a more stable setup.
 
 Do we have clear ideas of which kinds of tests it would make sense to
 run without the MMU turned on?  If we can be more concrete on this
 subject, perhaps a special path (or build) that doesn't enable the MMU
 for running the aforementioned test cases could be added.
 

Finally carving out kvm-unit-tests time again and fixed this properly.
A series is on the list [kvm-unit-tests PATCH 0/6] arm: enable MMU.

drew
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


suspicious rcu_dereference_check() usage warning with 3.18-rc2

2014-10-30 Thread Takashi Iwai
Hi,

I've got a warning with the latest Linus tree like below:

[ INFO: suspicious RCU usage. ]
3.18.0-rc2-test2+ #70 Not tainted
---
include/linux/kvm_host.h:474 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
1 lock held by qemu-system-x86/2371:
 #0:  (vcpu-mutex){+.+...}, at: [a037d800] vcpu_load+0x20/0xd0 [kvm]

stack backtrace:
CPU: 4 PID: 2371 Comm: qemu-system-x86 Not tainted 3.18.0-rc2-test2+ #70
Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
 0001 880209983ca8 816f514f 
 8802099b8990 880209983cd8 810bd687 000fee00
 880208a2c000 880208a1 88020ef50040 880209983d08
Call Trace:
 [816f514f] dump_stack+0x4e/0x71
 [810bd687] lockdep_rcu_suspicious+0xe7/0x120
 [a037d055] gfn_to_memslot+0xd5/0xe0 [kvm]
 [a03807d3] __gfn_to_pfn+0x33/0x60 [kvm]
 [a0380885] gfn_to_page+0x25/0x90 [kvm]
 [a038aeec] kvm_vcpu_reload_apic_access_page+0x3c/0x80 [kvm]
 [a08f0a9c] vmx_vcpu_reset+0x20c/0x460 [kvm_intel]
 [a039ab8e] kvm_vcpu_reset+0x15e/0x1b0 [kvm]
 [a039ac0c] kvm_arch_vcpu_setup+0x2c/0x50 [kvm]
 [a037f7e0] kvm_vm_ioctl+0x1d0/0x780 [kvm]
 [810bc664] ? __lock_is_held+0x54/0x80
 [812231f0] do_vfs_ioctl+0x300/0x520
 [8122ee45] ? __fget+0x5/0x250
 [8122f0fa] ? __fget_light+0x2a/0xe0
 [81223491] SyS_ioctl+0x81/0xa0
 [816fed6d] system_call_fastpath+0x16/0x1b
kvm: zapping shadow pages for mmio generation wraparound
kvm [2369]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x


The machine itself and KVM work fine even after this warning.  I'm not
sure whether this is new, maybe it's triggered now since I changed my
Kconfig to cover more RCU testing recently.  The warning is
reproducible, I can see it at the first invocation of kvm after each
fresh boot.

Does this ring a bell to anyone?


thanks,

Takashi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/11] qemu: towards virtio-1 host support

2014-10-30 Thread Cornelia Huck
On Tue, 28 Oct 2014 06:43:29 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Fri, Oct 24, 2014 at 10:38:39AM +0200, Cornelia Huck wrote:
  On Fri, 24 Oct 2014 00:42:20 +0300
  Michael S. Tsirkin m...@redhat.com wrote:
  
   On Tue, Oct 07, 2014 at 04:39:56PM +0200, Cornelia Huck wrote:
This patchset aims to get us some way to implement virtio-1 compliant
and transitional devices in qemu. Branch available at

git://github.com/cohuck/qemu virtio-1

I've mainly focused on:
- endianness handling
- extended feature bits
- virtio-ccw new/changed commands
   
   So issues identified so far:
  
  Thanks for taking a look.
  
   - devices not converted yet should not advertize 1.0
  
  Neither should an uncoverted transport. So we either can
  - have transport set the bit and rely on devices -get_features
callback to mask it out
(virtio-ccw has to change the calling order for get_features, btw.)
  - have device set the bit and the transport mask it out later. Feels a
bit weird, as virtio-1 is a transport feature bit.
 
 
 I thought more about it, I think the right thing
 would be for unconverted transports to clear
 high bits on ack and get features.

This should work out of the box with my patches (virtio-pci and
virtio-mmio return 0 for high feature bits).

 
 So bit 32 is set, but not exposed to guests.
 In fact at least for PCI, we have a 32 bit field for
 features in 0.9 so it's automatic.
 Didn't check mmio yet.

We still to make sure the bit is not set for unconverted devices,
though. But you're probably right that having the device set the bit is
less error-prone.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC 05/11] virtio: introduce legacy virtio devices

2014-10-30 Thread Cornelia Huck
On Tue, 28 Oct 2014 16:40:18 +0100
Greg Kurz gk...@linux.vnet.ibm.com wrote:

 On Tue,  7 Oct 2014 16:40:01 +0200
 Cornelia Huck cornelia.h...@de.ibm.com wrote:
 
  Introduce a helper function to indicate  whether a virtio device is
  operating in legacy or virtio standard mode.
  
  It may be used to make decisions about the endianess of virtio accesses
  and other virtio-1 specific changes, enabling us to support transitional
  devices.
  
  Reviewed-by: Thomas Huth th...@linux.vnet.ibm.com
  Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
  ---
   hw/virtio/virtio.c|6 +-
   include/hw/virtio/virtio-access.h |4 
   include/hw/virtio/virtio.h|   13 +++--
   3 files changed, 20 insertions(+), 3 deletions(-)
  
  diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
  index 7aaa953..e6ae3a0 100644
  --- a/hw/virtio/virtio.c
  +++ b/hw/virtio/virtio.c
  @@ -883,7 +883,11 @@ static bool virtio_device_endian_needed(void *opaque)
   VirtIODevice *vdev = opaque;
  
   assert(vdev-device_endian != VIRTIO_DEVICE_ENDIAN_UNKNOWN);
  -return vdev-device_endian != virtio_default_endian();
  +if (virtio_device_is_legacy(vdev)) {
  +return vdev-device_endian != virtio_default_endian();
  +}
  +/* Devices conforming to VIRTIO 1.0 or later are always LE. */
  +return vdev-device_endian != VIRTIO_DEVICE_ENDIAN_LITTLE;
   }
  
 
 Shouldn't we have some code doing the following somewhere ?
 
 if (!virtio_device_is_legacy(vdev)) {
 vdev-device_endian = VIRTIO_DEVICE_ENDIAN_LITTLE;
 }
 
 also, since virtio-1 is LE only, do we expect device_endian to
 be different from VIRTIO_DEVICE_ENDIAN_LITTLE ?

device_endian should not depend on whether the device is legacy or not.
virtio_is_big_endian always returns false for virtio-1 devices, though.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 2/6] KVM: Add generic support for dirty page logging

2014-10-30 Thread Mario Smarduch
On 10/30/2014 05:14 AM, Cornelia Huck wrote:
 On Wed, 22 Oct 2014 15:34:07 -0700
 Mario Smarduch m.smard...@samsung.com wrote:
 
 This patch defines KVM_GENERIC_DIRTYLOG, and moves dirty log read function
 to it's own file virt/kvm/dirtylog.c. x86 is updated to use the generic
 dirty log interface, selecting KVM_GENERIC_DIRTYLOG in its Kconfig and 
 makefile. No other architectures are affected, each uses it's own version.
 This changed from previous patch revision where non-generic architectures 
 were modified.

 In subsequent patch armv7 does samething. All other architectures continue
 use architecture defined version.

 
 Hm.
 
 The x86 specific version of dirty page logging is generic enough to be
 used by other architectures, noteably ARMv7. So let's move the x86 code
 under virt/kvm/ and make it depend on KVM_GENERIC_DIRTYLOG. Other
 architectures continue to use their own implementations.
 
 ?

I'll update descriptions for both patches, with the more concise
descriptions.

Thanks.

 

 Signed-off-by: Mario Smarduch m.smard...@samsung.com
 ---
  arch/x86/include/asm/kvm_host.h |3 --
  arch/x86/kvm/Kconfig|1 +
  arch/x86/kvm/Makefile   |1 +
  arch/x86/kvm/x86.c  |   86 --
  include/linux/kvm_host.h|4 ++
  virt/kvm/Kconfig|3 ++
  virt/kvm/dirtylog.c |  112 
 +++
  7 files changed, 121 insertions(+), 89 deletions(-)
  create mode 100644 virt/kvm/dirtylog.c

 
 diff --git a/virt/kvm/dirtylog.c b/virt/kvm/dirtylog.c
 new file mode 100644
 index 000..67a
 --- /dev/null
 +++ b/virt/kvm/dirtylog.c
 @@ -0,0 +1,112 @@
 +/*
 + * kvm generic dirty logging support, used by architectures that share
 + * comman dirty page logging implementation.
 
 s/comman/common/
 
 The approach looks sane to me, especially as it does not change other
 architectures needlessly.
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 0/6] arm/KVM: dirty page logging support for ARMv7 (3.17.0-rc1)

2014-10-30 Thread Mario Smarduch
On 10/30/2014 05:11 AM, Christian Borntraeger wrote:
 Am 23.10.2014 00:34, schrieb Mario Smarduch:
 This patch series introduces dirty page logging for ARMv7 and adds some 
 degree 
 of generic dirty logging support for x86, armv7 and later armv8.

 I implemented Alex's  suggestion after he took a look at the patches at kvm
 forum to simplify the generic/arch split - leaving mips, powerpc, s390, 
 (ia64 although broken) unchanged. x86/armv7 now share some dirty logging 
 code. 
 armv8 dirty log patches have been posted and tested but for time being armv8
 is non-generic as well.

 I briefly spoke to most of you at kvm forum, and this is the patch series
 I was referring to. Implementation changed from previous version (patches
 1  2), those who acked previous revision, please review again.

 Last 4 patches (ARM) have been rebased for newer kernel, with no signifcant
 changes.

 Testing:
 - Generally live migration + checksumming of source/destination memory 
 regions 
   is used validate correctness. 
 - qemu machvirt, VExpress - Exynos 5440, FastModels - lmbench + dirty guest
   memory cycling.
 - ARMv8 Foundation Model/kvmtool - Due to slight overlap in 2nd stage 
 handlers
   did a basic bringup using qemu.
 - x86_64 qemu  default machine model, tested migration on HP Z620, tested 
   convergence for several dirty page rates

 See https://github.com/mjsmar/arm-dirtylog-tests
 - Dirtlogtest-setup.pdf for ARMv7
 - https://github.com/mjsmar/arm-dirtylog-tests/tree/master/v7 - README

 The patch affects armv7,armv8, mips, ia64, powerpc, s390, x86_64. Patch
 series has been compiled for affected architectures:

 - x86_64 - defconfig 
 - ia64 - ia64-linux-gcc4.6.3 - defconfig, ia64 Kconfig defines BROKEN worked 
   around that to make sure new changes don't break build. Eventually build
   breaks due to other reasons.
 - mips - mips64-linux-gcc4.6.3 - malta_kvm_defconfig
 - ppc - powerpc64-linux-gcc4.6.3 - pseries_defconfig
 - s390 - s390x-linux-gcc4.6.3 - defconfig
 - armv8 - aarch64-linux-gnu-gcc4.8.1 - defconfig

 ARMv7 Dirty page logging implementation overivew-
 - initially write protects VM RAM memory region - 2nd stage page tables
 - add support to read dirty page log and again write protect the dirty pages 
   - second stage page table for next pass.
 - second stage huge page are dissolved into small page tables to keep track 
 of
   dirty pages at page granularity. Tracking at huge page granularity limits
   migration to an almost idle system. Small page size logging supports 
 higher 
   memory dirty rates.
 - In the event migration is canceled, normal behavior is resumed huge pages
   are rebuilt over time.

 Changes since v11:
 - Implemented Alex's comments to simplify generic layer.

 Changes since v10:
 - addressed wanghaibin comments 
 - addressed Christoffers comments

 Changes since v9:
 - Split patches into generic and architecture specific variants for TLB 
 Flushing
   and dirty log read (patches 1,2  3,4,5,6)
 - rebased to 3.16.0-rc1
 - Applied Christoffers comments.

 Mario Smarduch (6):
   KVM: Add architecture-defined TLB flush support
   KVM: Add generic support for dirty page logging
   arm: KVM: Add ARMv7 API to flush TLBs
   arm: KVM: Add initial dirty page locking infrastructure
   arm: KVM: dirty log read write protect support
   arm: KVM: ARMv7 dirty page logging 2nd stage page fault

  arch/arm/include/asm/kvm_asm.h|1 +
  arch/arm/include/asm/kvm_host.h   |   14 +++
  arch/arm/include/asm/kvm_mmu.h|   20 
  arch/arm/include/asm/pgtable-3level.h |1 +
  arch/arm/kvm/Kconfig  |2 +
  arch/arm/kvm/Makefile |1 +
  arch/arm/kvm/arm.c|2 +
  arch/arm/kvm/interrupts.S |   11 ++
  arch/arm/kvm/mmu.c|  209 
 +++--
  arch/x86/include/asm/kvm_host.h   |3 -
  arch/x86/kvm/Kconfig  |1 +
  arch/x86/kvm/Makefile |1 +
  arch/x86/kvm/x86.c|   86 --
  include/linux/kvm_host.h  |4 +
  virt/kvm/Kconfig  |6 +
  virt/kvm/dirtylog.c   |  112 ++
  virt/kvm/kvm_main.c   |2 +
  17 files changed, 380 insertions(+), 96 deletions(-)
  create mode 100644 virt/kvm/dirtylog.c

 
 Patches 1-3 seem to work fine on s390. The other patches are arm-only (well 
 cant find 5 and 6) so I guess its ok for s390.
 
The patches are there but threading is broken, due to mail server
message threshold rate. Just in case links below

https://lists.cs.columbia.edu/pipermail/kvmarm/2014-October/011730.html
https://lists.cs.columbia.edu/pipermail/kvmarm/2014-October/011731.html

Thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: suspicious rcu_dereference_check() usage warning with 3.18-rc2

2014-10-30 Thread Alexei Starovoitov
On Thu, Oct 30, 2014 at 9:44 AM, Takashi Iwai ti...@suse.de wrote:
 Hi,

 I've got a warning with the latest Linus tree like below:

 [ INFO: suspicious RCU usage. ]
 3.18.0-rc2-test2+ #70 Not tainted
 ---
 include/linux/kvm_host.h:474 suspicious rcu_dereference_check() usage!

 other info that might help us debug this:

 rcu_scheduler_active = 1, debug_locks = 0
 1 lock held by qemu-system-x86/2371:
  #0:  (vcpu-mutex){+.+...}, at: [a037d800] vcpu_load+0x20/0xd0 
 [kvm]

 stack backtrace:
 CPU: 4 PID: 2371 Comm: qemu-system-x86 Not tainted 3.18.0-rc2-test2+ #70
 Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
  0001 880209983ca8 816f514f 
  8802099b8990 880209983cd8 810bd687 000fee00
  880208a2c000 880208a1 88020ef50040 880209983d08
 Call Trace:
  [816f514f] dump_stack+0x4e/0x71
  [810bd687] lockdep_rcu_suspicious+0xe7/0x120
  [a037d055] gfn_to_memslot+0xd5/0xe0 [kvm]
  [a03807d3] __gfn_to_pfn+0x33/0x60 [kvm]
  [a0380885] gfn_to_page+0x25/0x90 [kvm]
  [a038aeec] kvm_vcpu_reload_apic_access_page+0x3c/0x80 [kvm]
  [a08f0a9c] vmx_vcpu_reset+0x20c/0x460 [kvm_intel]
  [a039ab8e] kvm_vcpu_reset+0x15e/0x1b0 [kvm]
  [a039ac0c] kvm_arch_vcpu_setup+0x2c/0x50 [kvm]
  [a037f7e0] kvm_vm_ioctl+0x1d0/0x780 [kvm]
  [810bc664] ? __lock_is_held+0x54/0x80
  [812231f0] do_vfs_ioctl+0x300/0x520
  [8122ee45] ? __fget+0x5/0x250
  [8122f0fa] ? __fget_light+0x2a/0xe0
  [81223491] SyS_ioctl+0x81/0xa0
  [816fed6d] system_call_fastpath+0x16/0x1b
 kvm: zapping shadow pages for mmio generation wraparound
 kvm [2369]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x


 The machine itself and KVM work fine even after this warning.  I'm not
 sure whether this is new, maybe it's triggered now since I changed my
 Kconfig to cover more RCU testing recently.  The warning is
 reproducible, I can see it at the first invocation of kvm after each
 fresh boot.

 Does this ring a bell to anyone?

see exactly the same trace when lockdep is on.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC 05/11] virtio: introduce legacy virtio devices

2014-10-30 Thread Greg Kurz
On Thu, 30 Oct 2014 19:02:01 +0100
Cornelia Huck cornelia.h...@de.ibm.com wrote:

 On Tue, 28 Oct 2014 16:40:18 +0100
 Greg Kurz gk...@linux.vnet.ibm.com wrote:
 
  On Tue,  7 Oct 2014 16:40:01 +0200
  Cornelia Huck cornelia.h...@de.ibm.com wrote:
  
   Introduce a helper function to indicate  whether a virtio device is
   operating in legacy or virtio standard mode.
   
   It may be used to make decisions about the endianess of virtio accesses
   and other virtio-1 specific changes, enabling us to support transitional
   devices.
   
   Reviewed-by: Thomas Huth th...@linux.vnet.ibm.com
   Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
   ---
hw/virtio/virtio.c|6 +-
include/hw/virtio/virtio-access.h |4 
include/hw/virtio/virtio.h|   13 +++--
3 files changed, 20 insertions(+), 3 deletions(-)
   
   diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
   index 7aaa953..e6ae3a0 100644
   --- a/hw/virtio/virtio.c
   +++ b/hw/virtio/virtio.c
   @@ -883,7 +883,11 @@ static bool virtio_device_endian_needed(void *opaque)
VirtIODevice *vdev = opaque;
   
assert(vdev-device_endian != VIRTIO_DEVICE_ENDIAN_UNKNOWN);
   -return vdev-device_endian != virtio_default_endian();
   +if (virtio_device_is_legacy(vdev)) {
   +return vdev-device_endian != virtio_default_endian();
   +}
   +/* Devices conforming to VIRTIO 1.0 or later are always LE. */
   +return vdev-device_endian != VIRTIO_DEVICE_ENDIAN_LITTLE;
}
   
  
  Shouldn't we have some code doing the following somewhere ?
  
  if (!virtio_device_is_legacy(vdev)) {
  vdev-device_endian = VIRTIO_DEVICE_ENDIAN_LITTLE;
  }
  
  also, since virtio-1 is LE only, do we expect device_endian to
  be different from VIRTIO_DEVICE_ENDIAN_LITTLE ?
 
 device_endian should not depend on whether the device is legacy or not.
 virtio_is_big_endian always returns false for virtio-1 devices, though.

Sorry, I had missed the virtio_is_big_endian() change: it that makes
device_endian a legacy virtio only matter. 
So why would we care to migrate the endian subsection when we have a
virtio-1 device ?  Shouldn't virtio_device_endian_needed() return false
for virtio-1 ?

--
Greg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/6] hw_random: fix unregister race.

2014-10-30 Thread Rusty Russell
Herbert Xu herb...@gondor.apana.org.au writes:
 On Thu, Sep 18, 2014 at 08:37:45PM +0800, Amos Kong wrote:
 From: Rusty Russell ru...@rustcorp.com.au
 
 The previous patch added one potential problem: we can still be
 reading from a hwrng when it's unregistered.  Add a wait for zero
 in the hwrng_unregister path.
 
 Signed-off-by: Rusty Russell ru...@rustcorp.com.au

 You totally corrupted Rusty's patch.  If you're going to repost
 his series you better make sure that you've got the right patches.

 Just as well though as it made me think a little more about this
 patch :)

OK Amos, can you please repost the complete series?

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-10-30 Thread zhanghailiang

On 2014/10/30 20:49, Dr. David Alan Gilbert wrote:

* zhanghailiang (zhang.zhanghaili...@huawei.com) wrote:

On 2014/10/30 1:46, Andrea Arcangeli wrote:

Hi Zhanghailiang,

On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:

Hi Andrea,

Thanks for your hard work on userfault;)

This is really a useful API.

I want to confirm a question:
Can we support distinguishing between writing and reading memory for userfault?
That is, we can decide whether writing a page, reading a page or both trigger 
userfault.

I think this will help supporting vhost-scsi,ivshmem for migration,
we can trace dirty page in userspace.

Actually, i'm trying to relize live memory snapshot based on pre-copy and 
userfault,
but reading memory from migration thread will also trigger userfault.
It will be easy to implement live memory snapshot, if we support configuring
userfault for writing memory only.


Mail is going to be long enough already so I'll just assume tracking
dirty memory in userland (instead of doing it in kernel) is worthy
feature to have here.

After some chat during the KVMForum I've been already thinking it
could be beneficial for some usage to give userland the information
about the fault being read or write, combined with the ability of
mapping pages wrprotected to mcopy_atomic (that would work without
false positives only with MADV_DONTFORK also set, but it's already set
in qemu). That will require vma-vm_flags  VM_USERFAULT to be
checked also in the wrprotect faults, not just in the not present
faults, but it's not a massive change. Returning the read/write
information is also a not massive change. This will then payoff mostly
if there's also a way to remove the memory atomically (kind of
remap_anon_pages).

Would that be enough? I mean are you still ok if non present read
fault traps too (you'd be notified it's a read) and you get
notification for both wrprotect and non present faults?


Hi Andrea,

Thanks for your reply, and your patience;)

Er, maybe i didn't describe clearly. What i really need for live memory snapshot
is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing 
write action*.

My initial solution scheme for live memory snapshot is:
(1) pause VM
(2) using userfaultfd to mark all memory of VM is wrprotect (readonly)
(3) save deivce state to snapshot file
(4) resume VM
(5) snapshot thread begin to save page of memory to snapshot file
(6) VM is going to run, and it is OK for VM or other thread to read ram (no 
fault trap),
 but if VM try to write page (dirty the page), there will be
 a userfault trap notification.
(7) a fault-handle-thread reads the page request from userfaultfd,
 it will copy content of the page to some buffers, and then remove the 
page's
 wrprotect limit(still using the userfaultfd to tell kernel).
(8) after step (7), VM can continue to write the page which is now can be write.
(9) snapshot thread save the page cached in step (7)
(10) repeat step (5)~(9) until all VM's memory is saved to snapshot file.


Hmm, I can see the same process being useful for the fault-tolerance schemes
like COLO, it needs a memory state snapshot.


So, what i need for userfault is supporting only wrprotect fault. i don't
want to get notification for non present reading faults, it will influence
VM's performance and the efficiency of doing snapshot.


What pages would be non-present at this point - just balloon?



Er, sorry, it should be 'no-present page faults';)


Dave


Also, i think this feature will benefit for migration of ivshmem and vhost-scsi
which have no dirty-page-tracing now.


The question then is how you mark the memory readonly to let the
wrprotect faults trap if the memory already existed and you didn't map
it yourself in the guest with mcopy_atomic with a readonly flag.

My current plan would be:

- keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the
   fast path check in the not-present and wrprotect page fault

- if VM_USERFAULT is set, find if there's a userfaultfd registered
   into that vma too

 if yes engage userfaultfd protocol

 otherwise raise SIGBUS (single threaded apps should be fine with
 SIGBUS and it'll avoid them to spawn a thread in order to talk the
 userfaultfd protocol)

- if userfaultfd protocol is engaged, return read|write fault + fault
   address to read(ufd) syscalls

- leave the userfault resolution mechanism independent of the
   userfaultfd protocol so we keep the two problems separated and we
   don't mix them in the same API which makes it even harder to
   finalize it.

 add mcopy_atomic (with a flag to map the page readonly too)

 The alternative would be to hide mcopy_atomic (and even
 remap_anon_pages in order to remove the memory atomically for
 the externalization into the cloud) as userfaultfd commands to
 write into the fd. But then there would be no much point to keep
 MADV_USERFAULT around if I do so and I could just remove it
 too or it 

Re: suspicious rcu_dereference_check() usage warning with 3.18-rc2

2014-10-30 Thread Wanpeng Li
On Thu, Oct 30, 2014 at 05:44:48PM +0100, Takashi Iwai wrote:
Hi,

I've got a warning with the latest Linus tree like below:

[ INFO: suspicious RCU usage. ]
3.18.0-rc2-test2+ #70 Not tainted
---
include/linux/kvm_host.h:474 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
1 lock held by qemu-system-x86/2371:
 #0:  (vcpu-mutex){+.+...}, at: [a037d800] vcpu_load+0x20/0xd0 
 [kvm]

stack backtrace:
CPU: 4 PID: 2371 Comm: qemu-system-x86 Not tainted 3.18.0-rc2-test2+ #70
Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
 0001 880209983ca8 816f514f 
 8802099b8990 880209983cd8 810bd687 000fee00
 880208a2c000 880208a1 88020ef50040 880209983d08
Call Trace:
 [816f514f] dump_stack+0x4e/0x71
 [810bd687] lockdep_rcu_suspicious+0xe7/0x120
 [a037d055] gfn_to_memslot+0xd5/0xe0 [kvm]
 [a03807d3] __gfn_to_pfn+0x33/0x60 [kvm]
 [a0380885] gfn_to_page+0x25/0x90 [kvm]
 [a038aeec] kvm_vcpu_reload_apic_access_page+0x3c/0x80 [kvm]

The srcu read lock must be held while accessing memslots (e.g. when
using gfn_to_* functions, however, kvm_vcpu_reload_apic_access_page() 
doesn't do this. I will send a patch to fix it after reproducibe.

Regards,
Wanpeng Li 

 [a08f0a9c] vmx_vcpu_reset+0x20c/0x460 [kvm_intel]
 [a039ab8e] kvm_vcpu_reset+0x15e/0x1b0 [kvm]
 [a039ac0c] kvm_arch_vcpu_setup+0x2c/0x50 [kvm]
 [a037f7e0] kvm_vm_ioctl+0x1d0/0x780 [kvm]
 [810bc664] ? __lock_is_held+0x54/0x80
 [812231f0] do_vfs_ioctl+0x300/0x520
 [8122ee45] ? __fget+0x5/0x250
 [8122f0fa] ? __fget_light+0x2a/0xe0
 [81223491] SyS_ioctl+0x81/0xa0
 [816fed6d] system_call_fastpath+0x16/0x1b
kvm: zapping shadow pages for mmio generation wraparound
kvm [2369]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x


The machine itself and KVM work fine even after this warning.  I'm not
sure whether this is new, maybe it's triggered now since I changed my
Kconfig to cover more RCU testing recently.  The warning is
reproducible, I can see it at the first invocation of kvm after each
fresh boot.

Does this ring a bell to anyone?


thanks,

Takashi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Benchmarking for vhost polling patch

2014-10-30 Thread Zhang Haoyu
  Hi Michael,
  
  Following the polling patch thread: http://marc.info/?
 l=kvmm=140853271510179w=2, 
  I changed poll_stop_idle to be counted in micro seconds, and carried 
out 
  experiments using varying sizes of this value. The setup for 
 netperf consisted of 
  1 vm and 1 vhost , each running on their own dedicated core.
  
 Could you provide your changing code?
 
 Thanks,
 Zhang Haoyu
 
Hi Zhang,
Do you mean the change in code for poll_stop_idle?
Yes, it's better to provide the complete code, including the polling patch.

Thanks,
Zhang Haoyu
Thanks,
Razya

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-10-30 Thread Peter Feiner
On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote:
 On 2014/10/30 1:46, Andrea Arcangeli wrote:
 On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
 I want to confirm a question:
 Can we support distinguishing between writing and reading memory for 
 userfault?
 That is, we can decide whether writing a page, reading a page or both 
 trigger userfault.
 Mail is going to be long enough already so I'll just assume tracking
 dirty memory in userland (instead of doing it in kernel) is worthy
 feature to have here.

I'll open that can of worms :-)

 [...]
 Er, maybe i didn't describe clearly. What i really need for live memory 
 snapshot
 is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing 
 write action*.
 
 So, what i need for userfault is supporting only wrprotect fault. i don't
 want to get notification for non present reading faults, it will influence
 VM's performance and the efficiency of doing snapshot.

Given that you do care about performance Zhanghailiang, I don't think that a
userfault handler is a good place to track dirty memory. Every dirtying write
will block on the userfault handler, which is an expensively slow proposition
compared to an in-kernel approach.

 Also, i think this feature will benefit for migration of ivshmem and 
 vhost-scsi
 which have no dirty-page-tracing now.

I do agree wholeheartedly with you here. Manually tracking non-guest writes
adds to the complexity of device emulation code. A central fault-driven means
for dirty tracking writes from the guest and host would be a welcome
simplification to implementing pre-copy migration. Indeed, that's exactly what
I'm working on! I'm using the softdirty bit, which was introduced recently for
CRIU migration, to replace the use of KVM's dirty logging and manual dirty
tracking by the VMM during pre-copy migration. See
Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To
make softdirty usable for live migration, I've added an API to atomically
test-and-clear the bit and write protect the page.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-10-30 Thread zhanghailiang

On 2014/10/31 10:23, Peter Feiner wrote:

On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote:

On 2014/10/30 1:46, Andrea Arcangeli wrote:

On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:

I want to confirm a question:
Can we support distinguishing between writing and reading memory for userfault?
That is, we can decide whether writing a page, reading a page or both trigger 
userfault.

Mail is going to be long enough already so I'll just assume tracking
dirty memory in userland (instead of doing it in kernel) is worthy
feature to have here.


I'll open that can of worms :-)


[...]
Er, maybe i didn't describe clearly. What i really need for live memory snapshot
is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing 
write action*.

So, what i need for userfault is supporting only wrprotect fault. i don't
want to get notification for non present reading faults, it will influence
VM's performance and the efficiency of doing snapshot.


Given that you do care about performance Zhanghailiang, I don't think that a
userfault handler is a good place to track dirty memory. Every dirtying write
will block on the userfault handler, which is an expensively slow proposition
compared to an in-kernel approach.



Agreed, but for doing live memory snapshot (VM is running when do snapsphot),
we have to do this (block the write action), because we have to save the page 
before it
is dirtied by writing action. This is the difference, compared to pre-copy 
migration.


Also, i think this feature will benefit for migration of ivshmem and vhost-scsi
which have no dirty-page-tracing now.


I do agree wholeheartedly with you here. Manually tracking non-guest writes
adds to the complexity of device emulation code. A central fault-driven means
for dirty tracking writes from the guest and host would be a welcome
simplification to implementing pre-copy migration. Indeed, that's exactly what
I'm working on! I'm using the softdirty bit, which was introduced recently for
CRIU migration, to replace the use of KVM's dirty logging and manual dirty
tracking by the VMM during pre-copy migration. See


Great! Do you plan to issue your patches to community? I mean is your work 
based on
qemu? or an independent tool (CRIU migration?) for live-migration?
Maybe i could fix the migration problem for ivshmem in qemu now,
based on softdirty mechanism.


Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To


I have read them cursorily, it is useful for pre-copy indeed. But it seems that
it can not meet my need for snapshot.


make softdirty usable for live migration, I've added an API to atomically
test-and-clear the bit and write protect the page.


How can i find the API? Is it been merged in kernel's master branch already?


Thanks,
zhanghailiang

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: x86: fix access memslots w/o hold srcu read lock

2014-10-30 Thread Wanpeng Li
The srcu read lock must be held while accessing memslots (e.g. 
when using gfn_to_* functions), however, commit c24ae0dcd3e8 
(kvm: x86: Unpin and remove kvm_arch-apic_access_page) call 
gfn_to_page() in kvm_vcpu_reload_apic_access_page() w/o hold it
which leads to suspicious rcu_dereference_check() usage warning.
This patch fix it by holding srcu read lock when call gfn_to_page()
in kvm_vcpu_reload_apic_access_page() function.


[ INFO: suspicious RCU usage. ]
3.18.0-rc2-test2+ #70 Not tainted
---
include/linux/kvm_host.h:474 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
1 lock held by qemu-system-x86/2371:
 #0:  (vcpu-mutex){+.+...}, at: [a037d800] vcpu_load+0x20/0xd0 [kvm]

stack backtrace:
CPU: 4 PID: 2371 Comm: qemu-system-x86 Not tainted 3.18.0-rc2-test2+ #70
Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
 0001 880209983ca8 816f514f 
 8802099b8990 880209983cd8 810bd687 000fee00
 880208a2c000 880208a1 88020ef50040 880209983d08
Call Trace:
 [816f514f] dump_stack+0x4e/0x71
 [810bd687] lockdep_rcu_suspicious+0xe7/0x120
 [a037d055] gfn_to_memslot+0xd5/0xe0 [kvm]
 [a03807d3] __gfn_to_pfn+0x33/0x60 [kvm]
 [a0380885] gfn_to_page+0x25/0x90 [kvm]
 [a038aeec] kvm_vcpu_reload_apic_access_page+0x3c/0x80 [kvm]
 [a08f0a9c] vmx_vcpu_reset+0x20c/0x460 [kvm_intel]
 [a039ab8e] kvm_vcpu_reset+0x15e/0x1b0 [kvm]
 [a039ac0c] kvm_arch_vcpu_setup+0x2c/0x50 [kvm]
 [a037f7e0] kvm_vm_ioctl+0x1d0/0x780 [kvm]
 [810bc664] ? __lock_is_held+0x54/0x80
 [812231f0] do_vfs_ioctl+0x300/0x520
 [8122ee45] ? __fget+0x5/0x250
 [8122f0fa] ? __fget_light+0x2a/0xe0
 [81223491] SyS_ioctl+0x81/0xa0
 [816fed6d] system_call_fastpath+0x16/0x1b

Reported-by: Takashi Iwai ti...@suse.de
Reported-by: Alexei Starovoitov alexei.starovoi...@gmail.com
Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
---
 arch/x86/kvm/x86.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0033df3..2d97329 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6059,6 +6059,7 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
 {
struct page *page = NULL;
+   int idx;
 
if (!irqchip_in_kernel(vcpu-kvm))
return;
@@ -6066,7 +6067,9 @@ void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu 
*vcpu)
if (!kvm_x86_ops-set_apic_access_page_addr)
return;
 
+   idx = srcu_read_lock(vcpu-kvm-srcu);
page = gfn_to_page(vcpu-kvm, APIC_DEFAULT_PHYS_BASE  PAGE_SHIFT);
+   srcu_read_unlock(vcpu-kvm-srcu, idx);
kvm_x86_ops-set_apic_access_page_addr(vcpu, page_to_phys(page));
 
/*
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-10-30 Thread zhanghailiang

On 2014/10/31 11:29, zhanghailiang wrote:

On 2014/10/31 10:23, Peter Feiner wrote:

On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote:

On 2014/10/30 1:46, Andrea Arcangeli wrote:

On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:

I want to confirm a question:
Can we support distinguishing between writing and reading memory for userfault?
That is, we can decide whether writing a page, reading a page or both trigger 
userfault.

Mail is going to be long enough already so I'll just assume tracking
dirty memory in userland (instead of doing it in kernel) is worthy
feature to have here.


I'll open that can of worms :-)


[...]
Er, maybe i didn't describe clearly. What i really need for live memory snapshot
is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing 
write action*.

So, what i need for userfault is supporting only wrprotect fault. i don't
want to get notification for non present reading faults, it will influence
VM's performance and the efficiency of doing snapshot.


Given that you do care about performance Zhanghailiang, I don't think that a
userfault handler is a good place to track dirty memory. Every dirtying write
will block on the userfault handler, which is an expensively slow proposition
compared to an in-kernel approach.



Agreed, but for doing live memory snapshot (VM is running when do snapsphot),
we have to do this (block the write action), because we have to save the page 
before it
is dirtied by writing action. This is the difference, compared to pre-copy 
migration.



Again;) For snapshot, i don't use its dirty tracing ability, i just use it to 
block write action,
and save page, and then i will remove its write protect.


Also, i think this feature will benefit for migration of ivshmem and vhost-scsi
which have no dirty-page-tracing now.


I do agree wholeheartedly with you here. Manually tracking non-guest writes
adds to the complexity of device emulation code. A central fault-driven means
for dirty tracking writes from the guest and host would be a welcome
simplification to implementing pre-copy migration. Indeed, that's exactly what
I'm working on! I'm using the softdirty bit, which was introduced recently for
CRIU migration, to replace the use of KVM's dirty logging and manual dirty
tracking by the VMM during pre-copy migration. See


Great! Do you plan to issue your patches to community? I mean is your work 
based on
qemu? or an independent tool (CRIU migration?) for live-migration?
Maybe i could fix the migration problem for ivshmem in qemu now,
based on softdirty mechanism.


Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To


I have read them cursorily, it is useful for pre-copy indeed. But it seems that
it can not meet my need for snapshot.


make softdirty usable for live migration, I've added an API to atomically
test-and-clear the bit and write protect the page.


How can i find the API? Is it been merged in kernel's master branch already?


Thanks,
zhanghailiang

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-10-30 Thread Andres Lagar-Cavilla
On Thu, Oct 30, 2014 at 9:38 PM, zhanghailiang
zhang.zhanghaili...@huawei.com wrote:
 On 2014/10/31 11:29, zhanghailiang wrote:

 On 2014/10/31 10:23, Peter Feiner wrote:

 On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote:

 On 2014/10/30 1:46, Andrea Arcangeli wrote:

 On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:

 I want to confirm a question:
 Can we support distinguishing between writing and reading memory for
 userfault?
 That is, we can decide whether writing a page, reading a page or both
 trigger userfault.

 Mail is going to be long enough already so I'll just assume tracking
 dirty memory in userland (instead of doing it in kernel) is worthy
 feature to have here.


 I'll open that can of worms :-)

 [...]
 Er, maybe i didn't describe clearly. What i really need for live memory
 snapshot
 is only wrprotect fault, like kvm's dirty tracing mechanism, *only
 tracing write action*.

 So, what i need for userfault is supporting only wrprotect fault. i
 don't
 want to get notification for non present reading faults, it will
 influence
 VM's performance and the efficiency of doing snapshot.


 Given that you do care about performance Zhanghailiang, I don't think
 that a
 userfault handler is a good place to track dirty memory. Every dirtying
 write
 will block on the userfault handler, which is an expensively slow
 proposition
 compared to an in-kernel approach.


 Agreed, but for doing live memory snapshot (VM is running when do
 snapsphot),
 we have to do this (block the write action), because we have to save the
 page before it
 is dirtied by writing action. This is the difference, compared to pre-copy
 migration.


 Again;) For snapshot, i don't use its dirty tracing ability, i just use it
 to block write action,
 and save page, and then i will remove its write protect.

You could do a CoW in the kernel, post a notification, keep going, and
expose an interface for user-space to mmap the preserved copy. Getting
the life-cycle of the preserved page(s) right is tricky, but doable.
Anyway, it's easy to hand-wave without knowing your specific
requirements.

Opening the discussion a bit, this does look similar to the xen-access
interface, in which a xen domain vcpu could be stopped in its tracks
while user-space was notified (and acknowledged) a variety of
scenarios: page was written to, page was read from, vcpu is attempting
to execute from page, etc. Very applicable to anti-viruses right away,
for example you can enforce W^X properties on pages.

I don't know that Andrea wants to open the game so broadly for
userfault, and the code right now is very specific to triggering on
pte_none(), but that's a nice reward down this road.

Andres


 Also, i think this feature will benefit for migration of ivshmem and
 vhost-scsi
 which have no dirty-page-tracing now.


 I do agree wholeheartedly with you here. Manually tracking non-guest
 writes
 adds to the complexity of device emulation code. A central fault-driven
 means
 for dirty tracking writes from the guest and host would be a welcome
 simplification to implementing pre-copy migration. Indeed, that's exactly
 what
 I'm working on! I'm using the softdirty bit, which was introduced
 recently for
 CRIU migration, to replace the use of KVM's dirty logging and manual
 dirty
 tracking by the VMM during pre-copy migration. See


 Great! Do you plan to issue your patches to community? I mean is your work
 based on
 qemu? or an independent tool (CRIU migration?) for live-migration?
 Maybe i could fix the migration problem for ivshmem in qemu now,
 based on softdirty mechanism.

 Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't
 familiar. To


 I have read them cursorily, it is useful for pre-copy indeed. But it seems
 that
 it can not meet my need for snapshot.

 make softdirty usable for live migration, I've added an API to atomically
 test-and-clear the bit and write protect the page.


 How can i find the API? Is it been merged in kernel's master branch
 already?


 Thanks,
 zhanghailiang

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 .





-- 
Andres Lagar-Cavilla | Google Kernel Team | andre...@google.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: fix access memslots w/o hold srcu read lock

2014-10-30 Thread Chen, Tiejun

On 2014/10/31 12:33, Wanpeng Li wrote:

The srcu read lock must be held while accessing memslots (e.g.
when using gfn_to_* functions), however, commit c24ae0dcd3e8
(kvm: x86: Unpin and remove kvm_arch-apic_access_page) call
gfn_to_page() in kvm_vcpu_reload_apic_access_page() w/o hold it
which leads to suspicious rcu_dereference_check() usage warning.
This patch fix it by holding srcu read lock when call gfn_to_page()
in kvm_vcpu_reload_apic_access_page() function.


[ INFO: suspicious RCU usage. ]
3.18.0-rc2-test2+ #70 Not tainted
---
include/linux/kvm_host.h:474 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
1 lock held by qemu-system-x86/2371:
  #0:  (vcpu-mutex){+.+...}, at: [a037d800] vcpu_load+0x20/0xd0 
[kvm]

stack backtrace:
CPU: 4 PID: 2371 Comm: qemu-system-x86 Not tainted 3.18.0-rc2-test2+ #70
Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
  0001 880209983ca8 816f514f 
  8802099b8990 880209983cd8 810bd687 000fee00
  880208a2c000 880208a1 88020ef50040 880209983d08
Call Trace:
  [816f514f] dump_stack+0x4e/0x71
  [810bd687] lockdep_rcu_suspicious+0xe7/0x120
  [a037d055] gfn_to_memslot+0xd5/0xe0 [kvm]
  [a03807d3] __gfn_to_pfn+0x33/0x60 [kvm]
  [a0380885] gfn_to_page+0x25/0x90 [kvm]
  [a038aeec] kvm_vcpu_reload_apic_access_page+0x3c/0x80 [kvm]
  [a08f0a9c] vmx_vcpu_reset+0x20c/0x460 [kvm_intel]
  [a039ab8e] kvm_vcpu_reset+0x15e/0x1b0 [kvm]
  [a039ac0c] kvm_arch_vcpu_setup+0x2c/0x50 [kvm]
  [a037f7e0] kvm_vm_ioctl+0x1d0/0x780 [kvm]
  [810bc664] ? __lock_is_held+0x54/0x80
  [812231f0] do_vfs_ioctl+0x300/0x520
  [8122ee45] ? __fget+0x5/0x250
  [8122f0fa] ? __fget_light+0x2a/0xe0
  [81223491] SyS_ioctl+0x81/0xa0
  [816fed6d] system_call_fastpath+0x16/0x1b

Reported-by: Takashi Iwai ti...@suse.de
Reported-by: Alexei Starovoitov alexei.starovoi...@gmail.com
Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
---
  arch/x86/kvm/x86.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0033df3..2d97329 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6059,6 +6059,7 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
  void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
  {
struct page *page = NULL;
+   int idx;

if (!irqchip_in_kernel(vcpu-kvm))
return;
@@ -6066,7 +6067,9 @@ void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu 
*vcpu)
if (!kvm_x86_ops-set_apic_access_page_addr)
return;

+   idx = srcu_read_lock(vcpu-kvm-srcu);


There's another scenario that we already hold srcu before call 
kvm_vcpu_reload_apic_access_page(),


__vcpu_run()
|
+ vcpu-srcu_idx = srcu_read_lock(kvm-srcu);
+ r = vcpu_enter_guest(vcpu);
|
+ kvm_vcpu_reload_apic_access_page(vcpu);

So according to backtrace I think we should fix as follows:

kvm: x86: vmx: hold kvm-srcu while reload apic access page

kvm_vcpu_reload_apic_access_page() needs to access memslots via
gfn_to_page(), so its necessary to hold kvm-srcu.

Signed-off-by: Tiejun Chen tiejun.c...@intel.com
---
 arch/x86/kvm/vmx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b25a588..9fa1f46 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4442,6 +4442,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct msr_data apic_base_msr;
+   int idx;

vmx-rmode.vm86_active = 0;

@@ -4509,7 +4510,9 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu)
vmcs_write32(TPR_THRESHOLD, 0);
}

+   idx = srcu_read_lock(vcpu-kvm-srcu);
kvm_vcpu_reload_apic_access_page(vcpu);
+   srcu_read_unlock(vcpu-kvm-srcu, idx);

if (vmx_vm_has_apicv(vcpu-kvm))
memset(vmx-pi_desc, 0, sizeof(struct pi_desc));
--
1.9.1

Thanks
Tiejun

page = gfn_to_page(vcpu-kvm, APIC_DEFAULT_PHYS_BASE  PAGE_SHIFT);
+   srcu_read_unlock(vcpu-kvm-srcu, idx);
kvm_x86_ops-set_apic_access_page_addr(vcpu, page_to_phys(page));

/*


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 1/6] KVM: Add architecture-defined TLB flush support

2014-10-30 Thread Cornelia Huck
On Wed, 22 Oct 2014 15:34:06 -0700
Mario Smarduch m.smard...@samsung.com wrote:

 This patch adds support for architecture implemented VM TLB flush, currently
 ARMv7 defines HAVE_KVM_ARCH_TLB_FLUSH_ALL. This leaves other architectures 
 unaffected using the generic version. In subsequent patch ARMv7 defines
 HAVE_KVM_ARCH_TLB_FLUSH_ALL and it's own TLB flush interface.

Can you reword this a bit?

Allow architectures to override the generic kvm_flush_remote_tlbs()
function via HAVE_KVM_ARCH_TLB_FLUSH_ALL. ARMv7 will need this to
provide its own TLB flush interface.

 
 Signed-off-by: Mario Smarduch m.smard...@samsung.com
 ---
  virt/kvm/Kconfig|3 +++
  virt/kvm/kvm_main.c |2 ++
  2 files changed, 5 insertions(+)

Providing an override for the special cases looks sane to me.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 0/6] arm/KVM: dirty page logging support for ARMv7 (3.17.0-rc1)

2014-10-30 Thread Christian Borntraeger
Am 23.10.2014 00:34, schrieb Mario Smarduch:
 This patch series introduces dirty page logging for ARMv7 and adds some 
 degree 
 of generic dirty logging support for x86, armv7 and later armv8.
 
 I implemented Alex's  suggestion after he took a look at the patches at kvm
 forum to simplify the generic/arch split - leaving mips, powerpc, s390, 
 (ia64 although broken) unchanged. x86/armv7 now share some dirty logging 
 code. 
 armv8 dirty log patches have been posted and tested but for time being armv8
 is non-generic as well.
 
 I briefly spoke to most of you at kvm forum, and this is the patch series
 I was referring to. Implementation changed from previous version (patches
 1  2), those who acked previous revision, please review again.
 
 Last 4 patches (ARM) have been rebased for newer kernel, with no signifcant
 changes.
 
 Testing:
 - Generally live migration + checksumming of source/destination memory 
 regions 
   is used validate correctness. 
 - qemu machvirt, VExpress - Exynos 5440, FastModels - lmbench + dirty guest
   memory cycling.
 - ARMv8 Foundation Model/kvmtool - Due to slight overlap in 2nd stage handlers
   did a basic bringup using qemu.
 - x86_64 qemu  default machine model, tested migration on HP Z620, tested 
   convergence for several dirty page rates
 
 See https://github.com/mjsmar/arm-dirtylog-tests
 - Dirtlogtest-setup.pdf for ARMv7
 - https://github.com/mjsmar/arm-dirtylog-tests/tree/master/v7 - README
 
 The patch affects armv7,armv8, mips, ia64, powerpc, s390, x86_64. Patch
 series has been compiled for affected architectures:
 
 - x86_64 - defconfig 
 - ia64 - ia64-linux-gcc4.6.3 - defconfig, ia64 Kconfig defines BROKEN worked 
   around that to make sure new changes don't break build. Eventually build
   breaks due to other reasons.
 - mips - mips64-linux-gcc4.6.3 - malta_kvm_defconfig
 - ppc - powerpc64-linux-gcc4.6.3 - pseries_defconfig
 - s390 - s390x-linux-gcc4.6.3 - defconfig
 - armv8 - aarch64-linux-gnu-gcc4.8.1 - defconfig
 
 ARMv7 Dirty page logging implementation overivew-
 - initially write protects VM RAM memory region - 2nd stage page tables
 - add support to read dirty page log and again write protect the dirty pages 
   - second stage page table for next pass.
 - second stage huge page are dissolved into small page tables to keep track of
   dirty pages at page granularity. Tracking at huge page granularity limits
   migration to an almost idle system. Small page size logging supports higher 
   memory dirty rates.
 - In the event migration is canceled, normal behavior is resumed huge pages
   are rebuilt over time.
 
 Changes since v11:
 - Implemented Alex's comments to simplify generic layer.
 
 Changes since v10:
 - addressed wanghaibin comments 
 - addressed Christoffers comments
 
 Changes since v9:
 - Split patches into generic and architecture specific variants for TLB 
 Flushing
   and dirty log read (patches 1,2  3,4,5,6)
 - rebased to 3.16.0-rc1
 - Applied Christoffers comments.
 
 Mario Smarduch (6):
   KVM: Add architecture-defined TLB flush support
   KVM: Add generic support for dirty page logging
   arm: KVM: Add ARMv7 API to flush TLBs
   arm: KVM: Add initial dirty page locking infrastructure
   arm: KVM: dirty log read write protect support
   arm: KVM: ARMv7 dirty page logging 2nd stage page fault
 
  arch/arm/include/asm/kvm_asm.h|1 +
  arch/arm/include/asm/kvm_host.h   |   14 +++
  arch/arm/include/asm/kvm_mmu.h|   20 
  arch/arm/include/asm/pgtable-3level.h |1 +
  arch/arm/kvm/Kconfig  |2 +
  arch/arm/kvm/Makefile |1 +
  arch/arm/kvm/arm.c|2 +
  arch/arm/kvm/interrupts.S |   11 ++
  arch/arm/kvm/mmu.c|  209 
 +++--
  arch/x86/include/asm/kvm_host.h   |3 -
  arch/x86/kvm/Kconfig  |1 +
  arch/x86/kvm/Makefile |1 +
  arch/x86/kvm/x86.c|   86 --
  include/linux/kvm_host.h  |4 +
  virt/kvm/Kconfig  |6 +
  virt/kvm/dirtylog.c   |  112 ++
  virt/kvm/kvm_main.c   |2 +
  17 files changed, 380 insertions(+), 96 deletions(-)
  create mode 100644 virt/kvm/dirtylog.c
 

Patches 1-3 seem to work fine on s390. The other patches are arm-only (well 
cant find 5 and 6) so I guess its ok for s390.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 2/6] KVM: Add generic support for dirty page logging

2014-10-30 Thread Cornelia Huck
On Wed, 22 Oct 2014 15:34:07 -0700
Mario Smarduch m.smard...@samsung.com wrote:

 This patch defines KVM_GENERIC_DIRTYLOG, and moves dirty log read function
 to it's own file virt/kvm/dirtylog.c. x86 is updated to use the generic
 dirty log interface, selecting KVM_GENERIC_DIRTYLOG in its Kconfig and 
 makefile. No other architectures are affected, each uses it's own version.
 This changed from previous patch revision where non-generic architectures 
 were modified.
 
 In subsequent patch armv7 does samething. All other architectures continue
 use architecture defined version.
 

Hm.

The x86 specific version of dirty page logging is generic enough to be
used by other architectures, noteably ARMv7. So let's move the x86 code
under virt/kvm/ and make it depend on KVM_GENERIC_DIRTYLOG. Other
architectures continue to use their own implementations.

?

 
 Signed-off-by: Mario Smarduch m.smard...@samsung.com
 ---
  arch/x86/include/asm/kvm_host.h |3 --
  arch/x86/kvm/Kconfig|1 +
  arch/x86/kvm/Makefile   |1 +
  arch/x86/kvm/x86.c  |   86 --
  include/linux/kvm_host.h|4 ++
  virt/kvm/Kconfig|3 ++
  virt/kvm/dirtylog.c |  112 
 +++
  7 files changed, 121 insertions(+), 89 deletions(-)
  create mode 100644 virt/kvm/dirtylog.c
 

 diff --git a/virt/kvm/dirtylog.c b/virt/kvm/dirtylog.c
 new file mode 100644
 index 000..67a
 --- /dev/null
 +++ b/virt/kvm/dirtylog.c
 @@ -0,0 +1,112 @@
 +/*
 + * kvm generic dirty logging support, used by architectures that share
 + * comman dirty page logging implementation.

s/comman/common/

The approach looks sane to me, especially as it does not change other
architectures needlessly.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 2/6] KVM: Add generic support for dirty page logging

2014-10-30 Thread Mario Smarduch
On 10/30/2014 05:14 AM, Cornelia Huck wrote:
 On Wed, 22 Oct 2014 15:34:07 -0700
 Mario Smarduch m.smard...@samsung.com wrote:
 
 This patch defines KVM_GENERIC_DIRTYLOG, and moves dirty log read function
 to it's own file virt/kvm/dirtylog.c. x86 is updated to use the generic
 dirty log interface, selecting KVM_GENERIC_DIRTYLOG in its Kconfig and 
 makefile. No other architectures are affected, each uses it's own version.
 This changed from previous patch revision where non-generic architectures 
 were modified.

 In subsequent patch armv7 does samething. All other architectures continue
 use architecture defined version.

 
 Hm.
 
 The x86 specific version of dirty page logging is generic enough to be
 used by other architectures, noteably ARMv7. So let's move the x86 code
 under virt/kvm/ and make it depend on KVM_GENERIC_DIRTYLOG. Other
 architectures continue to use their own implementations.
 
 ?

I'll update descriptions for both patches, with the more concise
descriptions.

Thanks.

 

 Signed-off-by: Mario Smarduch m.smard...@samsung.com
 ---
  arch/x86/include/asm/kvm_host.h |3 --
  arch/x86/kvm/Kconfig|1 +
  arch/x86/kvm/Makefile   |1 +
  arch/x86/kvm/x86.c  |   86 --
  include/linux/kvm_host.h|4 ++
  virt/kvm/Kconfig|3 ++
  virt/kvm/dirtylog.c |  112 
 +++
  7 files changed, 121 insertions(+), 89 deletions(-)
  create mode 100644 virt/kvm/dirtylog.c

 
 diff --git a/virt/kvm/dirtylog.c b/virt/kvm/dirtylog.c
 new file mode 100644
 index 000..67a
 --- /dev/null
 +++ b/virt/kvm/dirtylog.c
 @@ -0,0 +1,112 @@
 +/*
 + * kvm generic dirty logging support, used by architectures that share
 + * comman dirty page logging implementation.
 
 s/comman/common/
 
 The approach looks sane to me, especially as it does not change other
 architectures needlessly.
 

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 0/6] arm/KVM: dirty page logging support for ARMv7 (3.17.0-rc1)

2014-10-30 Thread Mario Smarduch
On 10/30/2014 05:11 AM, Christian Borntraeger wrote:
 Am 23.10.2014 00:34, schrieb Mario Smarduch:
 This patch series introduces dirty page logging for ARMv7 and adds some 
 degree 
 of generic dirty logging support for x86, armv7 and later armv8.

 I implemented Alex's  suggestion after he took a look at the patches at kvm
 forum to simplify the generic/arch split - leaving mips, powerpc, s390, 
 (ia64 although broken) unchanged. x86/armv7 now share some dirty logging 
 code. 
 armv8 dirty log patches have been posted and tested but for time being armv8
 is non-generic as well.

 I briefly spoke to most of you at kvm forum, and this is the patch series
 I was referring to. Implementation changed from previous version (patches
 1  2), those who acked previous revision, please review again.

 Last 4 patches (ARM) have been rebased for newer kernel, with no signifcant
 changes.

 Testing:
 - Generally live migration + checksumming of source/destination memory 
 regions 
   is used validate correctness. 
 - qemu machvirt, VExpress - Exynos 5440, FastModels - lmbench + dirty guest
   memory cycling.
 - ARMv8 Foundation Model/kvmtool - Due to slight overlap in 2nd stage 
 handlers
   did a basic bringup using qemu.
 - x86_64 qemu  default machine model, tested migration on HP Z620, tested 
   convergence for several dirty page rates

 See https://github.com/mjsmar/arm-dirtylog-tests
 - Dirtlogtest-setup.pdf for ARMv7
 - https://github.com/mjsmar/arm-dirtylog-tests/tree/master/v7 - README

 The patch affects armv7,armv8, mips, ia64, powerpc, s390, x86_64. Patch
 series has been compiled for affected architectures:

 - x86_64 - defconfig 
 - ia64 - ia64-linux-gcc4.6.3 - defconfig, ia64 Kconfig defines BROKEN worked 
   around that to make sure new changes don't break build. Eventually build
   breaks due to other reasons.
 - mips - mips64-linux-gcc4.6.3 - malta_kvm_defconfig
 - ppc - powerpc64-linux-gcc4.6.3 - pseries_defconfig
 - s390 - s390x-linux-gcc4.6.3 - defconfig
 - armv8 - aarch64-linux-gnu-gcc4.8.1 - defconfig

 ARMv7 Dirty page logging implementation overivew-
 - initially write protects VM RAM memory region - 2nd stage page tables
 - add support to read dirty page log and again write protect the dirty pages 
   - second stage page table for next pass.
 - second stage huge page are dissolved into small page tables to keep track 
 of
   dirty pages at page granularity. Tracking at huge page granularity limits
   migration to an almost idle system. Small page size logging supports 
 higher 
   memory dirty rates.
 - In the event migration is canceled, normal behavior is resumed huge pages
   are rebuilt over time.

 Changes since v11:
 - Implemented Alex's comments to simplify generic layer.

 Changes since v10:
 - addressed wanghaibin comments 
 - addressed Christoffers comments

 Changes since v9:
 - Split patches into generic and architecture specific variants for TLB 
 Flushing
   and dirty log read (patches 1,2  3,4,5,6)
 - rebased to 3.16.0-rc1
 - Applied Christoffers comments.

 Mario Smarduch (6):
   KVM: Add architecture-defined TLB flush support
   KVM: Add generic support for dirty page logging
   arm: KVM: Add ARMv7 API to flush TLBs
   arm: KVM: Add initial dirty page locking infrastructure
   arm: KVM: dirty log read write protect support
   arm: KVM: ARMv7 dirty page logging 2nd stage page fault

  arch/arm/include/asm/kvm_asm.h|1 +
  arch/arm/include/asm/kvm_host.h   |   14 +++
  arch/arm/include/asm/kvm_mmu.h|   20 
  arch/arm/include/asm/pgtable-3level.h |1 +
  arch/arm/kvm/Kconfig  |2 +
  arch/arm/kvm/Makefile |1 +
  arch/arm/kvm/arm.c|2 +
  arch/arm/kvm/interrupts.S |   11 ++
  arch/arm/kvm/mmu.c|  209 
 +++--
  arch/x86/include/asm/kvm_host.h   |3 -
  arch/x86/kvm/Kconfig  |1 +
  arch/x86/kvm/Makefile |1 +
  arch/x86/kvm/x86.c|   86 --
  include/linux/kvm_host.h  |4 +
  virt/kvm/Kconfig  |6 +
  virt/kvm/dirtylog.c   |  112 ++
  virt/kvm/kvm_main.c   |2 +
  17 files changed, 380 insertions(+), 96 deletions(-)
  create mode 100644 virt/kvm/dirtylog.c

 
 Patches 1-3 seem to work fine on s390. The other patches are arm-only (well 
 cant find 5 and 6) so I guess its ok for s390.
 
The patches are there but threading is broken, due to mail server
message threshold rate. Just in case links below

https://lists.cs.columbia.edu/pipermail/kvmarm/2014-October/011730.html
https://lists.cs.columbia.edu/pipermail/kvmarm/2014-October/011731.html

Thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html