[PATCH RESEND v15 07/10] KVM: arm: page logging 2nd stage fault handling

2015-01-09 Thread Mario Smarduch
This patch adds support for 2nd stage page fault handling while dirty page
logging. On huge page faults, huge pages are dissolved to normal pages, and
rebuilding of 2nd stage huge pages is blocked. In case migration is 
canceled this restriction is removed and huge pages may be rebuilt again.

This patch applies cleanly on top of patch series posted Dec. 15'th:
https://lists.cs.columbia.edu/pipermail/kvmarm/2014-December/012826.html

Patch #11 has been dropped, and should not be applied.

Signed-off-by: Mario Smarduch m.smard...@samsung.com
---

Change Log since last RESEND v1 -- v2:
- Disallow dirty page logging of IO region - fail for initial write protect
  and disable logging code in 2nd stage page fault handler.
- Fixed auto spell correction errors

Change Log RESEND v0 -- v1:
- fixed bug exposed by new generic __get_user_pages_fast(), when region is 
  writable, prevent write protection of pte on read fault
- Removed marking entire huge page dirty on initial access
- don't dissolve huge pages of non-writable regions
- Made updates based on Christoffers comments
  - renamed logging status function to memslot_is_logging()
  - changed few values to bool from longs
  - streamlined user_mem_abort() to eliminate extra conditional checks
---
 arch/arm/kvm/mmu.c |  113 
 1 file changed, 105 insertions(+), 8 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 73d506f..b878236 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -47,6 +47,18 @@ static phys_addr_t hyp_idmap_vector;
 #define kvm_pmd_huge(_x)   (pmd_huge(_x) || pmd_trans_huge(_x))
 #define kvm_pud_huge(_x)   pud_huge(_x)
 
+#define KVM_S2PTE_FLAG_IS_IOMAP(1UL  0)
+#define KVM_S2PTE_FLAG_LOGGING_ACTIVE  (1UL  1)
+
+static bool memslot_is_logging(struct kvm_memory_slot *memslot)
+{
+#ifdef CONFIG_ARM
+   return !!memslot-dirty_bitmap;
+#else
+   return false;
+#endif
+}
+
 static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
 {
/*
@@ -59,6 +71,25 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, 
phys_addr_t ipa)
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
 }
 
+/**
+ * stage2_dissolve_pmd() - clear and flush huge PMD entry
+ * @kvm:   pointer to kvm structure.
+ * @addr:  IPA
+ * @pmd:   pmd pointer for IPA
+ *
+ * Function clears a PMD entry, flushes addr 1st and 2nd stage TLBs. Marks all
+ * pages in the range dirty.
+ */
+static void stage2_dissolve_pmd(struct kvm *kvm, phys_addr_t addr, pmd_t *pmd)
+{
+   if (!kvm_pmd_huge(*pmd))
+   return;
+
+   pmd_clear(pmd);
+   kvm_tlb_flush_vmid_ipa(kvm, addr);
+   put_page(virt_to_page(pmd));
+}
+
 static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
  int min, int max)
 {
@@ -703,10 +734,13 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct 
kvm_mmu_memory_cache
 }
 
 static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
- phys_addr_t addr, const pte_t *new_pte, bool iomap)
+ phys_addr_t addr, const pte_t *new_pte,
+ unsigned long flags)
 {
pmd_t *pmd;
pte_t *pte, old_pte;
+   bool iomap = flags  KVM_S2PTE_FLAG_IS_IOMAP;
+   bool logging_active = flags  KVM_S2PTE_FLAG_LOGGING_ACTIVE;
 
/* Create stage-2 page table mapping - Levels 0 and 1 */
pmd = stage2_get_pmd(kvm, cache, addr);
@@ -718,6 +752,13 @@ static int stage2_set_pte(struct kvm *kvm, struct 
kvm_mmu_memory_cache *cache,
return 0;
}
 
+   /*
+* While dirty page logging - dissolve huge PMD, then continue on to
+* allocate page.
+*/
+   if (logging_active)
+   stage2_dissolve_pmd(kvm, addr, pmd);
+
/* Create stage-2 page mappings - Level 2 */
if (pmd_none(*pmd)) {
if (!cache)
@@ -774,7 +815,8 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t 
guest_ipa,
if (ret)
goto out;
spin_lock(kvm-mmu_lock);
-   ret = stage2_set_pte(kvm, cache, addr, pte, true);
+   ret = stage2_set_pte(kvm, cache, addr, pte,
+   KVM_S2PTE_FLAG_IS_IOMAP);
spin_unlock(kvm-mmu_lock);
if (ret)
goto out;
@@ -1002,6 +1044,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
pfn_t pfn;
pgprot_t mem_type = PAGE_S2;
bool fault_ipa_uncached;
+   bool can_set_pte_rw = true;
+   unsigned long set_pte_flags = 0;
 
write_fault = kvm_is_write_fault(vcpu);
if (fault_status == FSC_PERM  !write_fault) {
@@ -1009,6 +1053,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
}
 
+
 

Re: [RESEND PATCH v15 07/11] KVM: arm: page logging 2nd stage fault handling

2015-01-09 Thread Mario Smarduch
On 01/09/2015 02:24 AM, Christoffer Dall wrote:
 On Thu, Jan 08, 2015 at 08:28:46AM -0800, Mario Smarduch wrote:
 On 01/08/2015 02:45 AM, Christoffer Dall wrote:
 On Wed, Jan 07, 2015 at 05:43:18PM -0800, Mario Smarduch wrote:
 Hi Christoffer,
   before going through your comments, I discovered that
 in 3.18.0-rc2 - a generic __get_user_pages_fast()
 was implemented, now ARM picks this up. This causes
 gfn_to_pfn_prot() to return meaningful 'writable'
 value for a read fault, provided the region is writable.

 Prior to that the weak version returned 0 and 'writable'
 had no optimization effect to set pte/pmd - RW on
 a read fault.

 As a consequence dirty logging broke in 3.18, I was seeing
 Correction on this, proper __get_user_pages_fast()
 behavior exposed a bug in page logging code.

 weird but very intermittent issues. I just put in the
 additional few lines to fix it, prevent pte RW (only R) on
 read faults  while  logging writable region.

 On 01/07/2015 04:38 AM, Christoffer Dall wrote:
 On Wed, Dec 17, 2014 at 06:07:29PM -0800, Mario Smarduch wrote:
 This patch is a followup to v15 patch series, with following changes:
 - When clearing/dissolving a huge, PMD mark huge page range dirty, since
   the state of whole range is unknown. After the huge page is dissolved 
   dirty page logging is at page granularity.

 What is the sequence of events where you could have dirtied another page
 within the PMD range after the user initially requested dirty page
 logging?

 No there is none. My issue was the start point for tracking dirty pages
 and that would be second call to dirty log read. Not first
 call after initial write protect where any page in range can
 be assumed dirty. I'll remove this, not sure if there would be any
 use case to call dirty log only once.


 Calling dirty log once can not give you anything meaningful, right?  You
 must assume all memory is 'dirty' at this point, no?

 There is the interval between KVM_MEM_LOG_DIRTY_PAGES and first
 call to KVM_GET_DIRTY_LOG. Not sure of any use case, maybe enable
 logging, wait a while do a dirty log read, disable logging.
 Get an accumulated snapshot of dirty page activity.

 ok, so from the time the user calls KVM_MEM_LOG_DIRTY_PAGES, then any
 fault on any huge page will dissolve that huge page into pages, and each
 dirty page will be logged accordingly for the first call to
 KVM_GET_DIRTY_LOG, right?  What am I missing here?

Yes that's correct, this may or may not be meaningful in itself.
The original point was first time access to a huge page (on
first or some later call) and do we consider whole range dirty.
Keeping track at page granularity + original image provides
everything needed to reconstruct the source so it should
not matter.

I think I convoluted this issue a bit.

- Mario
 
 -Christoffer
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations

2015-01-09 Thread Borislav Petkov
On Fri, Nov 07, 2014 at 03:58:19PM -0800, Andy Lutomirski wrote:
 We used to optimize rescheduling and audit on syscall exit.  Now that
 the full slow path is reasonably fast, remove these optimizations.
 
 This adds something like 10ns to the previously optimized paths on my
 computer, presumably due mostly to SAVE_REST / RESTORE_REST.
 
 I think that we should eventually replace both the syscall and
 non-paranoid interrupt exit slow paths with a pair of C functions
 along the lines of the syscall entry hooks.
 
 Signed-off-by: Andy Lutomirski l...@amacapital.net
 ---
  arch/x86/kernel/entry_64.S | 52 
 +-
  1 file changed, 5 insertions(+), 47 deletions(-)
 
 diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
 index a5afdf0f7fa4..222dc5c45ac3 100644
 --- a/arch/x86/kernel/entry_64.S
 +++ b/arch/x86/kernel/entry_64.S
 @@ -427,15 +427,12 @@ system_call_fastpath:
   * Has incomplete stack frame and undefined top of stack.
   */
  ret_from_sys_call:
 - movl $_TIF_ALLWORK_MASK,%edi
 - /* edi: flagmask */
 -sysret_check:
 + testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
 + jnz int_ret_from_sys_call_fixup /* Go the the slow path */
 +
   LOCKDEP_SYS_EXIT
   DISABLE_INTERRUPTS(CLBR_NONE)
   TRACE_IRQS_OFF
 - movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx
 - andl %edi,%edx
 - jnz  sysret_careful
   CFI_REMEMBER_STATE
   /*
* sysretq will re-enable interrupts:
 @@ -449,49 +446,10 @@ sysret_check:
   USERGS_SYSRET64
  
   CFI_RESTORE_STATE
 - /* Handle reschedules */
 - /* edx: work, edi: workmask */
 -sysret_careful:
 - bt $TIF_NEED_RESCHED,%edx
 - jnc sysret_signal
 - TRACE_IRQS_ON
 - ENABLE_INTERRUPTS(CLBR_NONE)
 - pushq_cfi %rdi
 - SCHEDULE_USER
 - popq_cfi %rdi
 - jmp sysret_check
  
 - /* Handle a signal */
 -sysret_signal:
 - TRACE_IRQS_ON
 - ENABLE_INTERRUPTS(CLBR_NONE)
 -#ifdef CONFIG_AUDITSYSCALL
 - bt $TIF_SYSCALL_AUDIT,%edx
 - jc sysret_audit
 -#endif
 - /*
 -  * We have a signal, or exit tracing or single-step.
 -  * These all wind up with the iret return path anyway,
 -  * so just join that path right now.
 -  */
 +int_ret_from_sys_call_fixup:
   FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
 - jmp int_check_syscall_exit_work
 -
 -#ifdef CONFIG_AUDITSYSCALL
 - /*
 -  * Return fast path for syscall audit.  Call __audit_syscall_exit()
 -  * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT
 -  * masked off.
 -  */
 -sysret_audit:
 - movq RAX-ARGOFFSET(%rsp),%rsi   /* second arg, syscall return value */
 - cmpq $-MAX_ERRNO,%rsi   /* is it  -MAX_ERRNO? */
 - setbe %al   /* 1 if so, 0 if not */
 - movzbl %al,%edi /* zero-extend that into %edi */
 - call __audit_syscall_exit
 - movl $(_TIF_ALLWORK_MASK  ~_TIF_SYSCALL_AUDIT),%edi
 - jmp sysret_check
 -#endif   /* CONFIG_AUDITSYSCALL */
 + jmp int_ret_from_sys_call

Ok, dumb question: what is replacing that audit functionality now?
Or are we remaining with the single path down syscall_trace_leave()?
AFAICT, the intention is to direct everything to int_ret_from_sys_call
after the syscall has been done so that paths can get simplified. Which
is a good thing, AFAICT.

Acked-by: Borislav Petkov b...@suse.de

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] kvmtool: remove 8250 IRQ line reset on device_init

2015-01-09 Thread Andre Przywara
Currently we reset the KVM interrupt line on initializing the 8250
serial device emulation.
For ARM this creates a problem where we use the in-kernel IRQ chip
before having fully initialized it. But with the new kernel interface
we cannot finish the GIC initialization before we know the number of
used IRQs, so we have to wait until all devices have been created and
initialized.
Since the in-kernel GIC emulation resets the IRQ line anyway and also
QEMU gets away without resetting it, the easiest solution is to drop
the IRQ line reset.

Signed-off-by: Andre Przywara andre.przyw...@arm.com
---
Hi Pekka,

this patch is an easy fix for our problems with Marc's kvmtool tree
and 3.19-rc (setting number of IRQs after the GIC has been used
already). I see that this basically reverts:

 commit 2ca9e37193ca5f5df5726e0061dc749829295435
 Author: Pekka Enberg penb...@kernel.org
 Date:   Sun Jan 23 11:49:39 2011 +0200
 
 kvm,8250: Fix device initial state
 
 This patch fixes 8250 device initial state for registers and IRQ based
 on what Qemu does.

Do you (or does anyone) know of the issue that this patch fixed?
This is four years old and from what I see QEMU does no longer the
mentioned IRQ reset(?).
Reworking kvmtool to avoid this issue sound rather painful.

Cheers,
Andre.

 tools/kvm/hw/serial.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/tools/kvm/hw/serial.c b/tools/kvm/hw/serial.c
index 270e6182..2f19ba8 100644
--- a/tools/kvm/hw/serial.c
+++ b/tools/kvm/hw/serial.c
@@ -406,7 +406,6 @@ static int serial8250__device_init(struct kvm *kvm, struct 
serial8250_device *de
 
ioport__map_irq(dev-irq);
r = ioport__register(kvm, dev-iobase, serial8250_ops, 8, dev);
-   kvm__irq_line(kvm, dev-irq, 0);
 
return r;
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault

2015-01-09 Thread Peter Maydell
On 9 January 2015 at 14:16, Marc Zyngier marc.zyng...@arm.com wrote:
 On 09/01/15 13:03, Peter Maydell wrote:
 When we reset a cpu by re-calling KVM_ARM_VCPU_INIT, that doesn't
 mean we get a new VMID for it, though, does it? I thought that
 what causes the icache flush to happen for the reset guest is
 that we unmap all of stage 2 and then fault it back in, via
 this code. That works for PIPT (we flush the range) and for
 VIPT (we do a full icache flush), but at the moment for VIVT
 ASID tagged we assume we can do nothing, and I don't think that's
 right for this case (though it is right for faulted because
 page was swapped out and OK for page was written by DMA).

 When we reset the guest, we also turn both its Icache off. Before
 turning it back on, the guest has to invalidate it (the ARM ARM doesn't
 seem to define the state of the cache out of reset).

But implementations are allowed to hit in the cache even
when the cache is disabled. In particular, setting the guest
SCTLR_EL1.I to 0 means iside accesses are Normal Noncacheable
for stage 1 attributes and (v8 ARM ARM D3.4.6) these can
be held in the instruction cache. So the guest is required
to do an icache-invalidate for any instructions that it writes
*itself* (or DMAs in) even with the icache off. But it cannot
possibly do so for its own initial startup code, and it must
be the job of KVM to do that for it. (You can think of this as
the VCPU provided by KVM always invalidates the icache on reset
and does not require an impdef magic cache-init routine as
described by D3.4.4 if you like.)

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2015-01-09 Thread Paolo Bonzini


On 09/01/2015 16:12, Radim Krčmář wrote:
  The chipset doesn't support it. :(
 
 I meant that we need to recompute PI entries for lowest priority
 interrupts every time guest's TPR changes.
 
 Luckily, Linux doesn't use TPR, but other OS might be a reason to drop
 lowest priority from PI optimizations.  (Or make it more complicated.)

Doing vector hashing is a possibility as well.  I would like to know
what existing chipsets do in practice, then we can mimic it.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Flush TLB when D bit is manually changed.

2015-01-09 Thread Paolo Bonzini


On 09/01/2015 09:44, Kai Huang wrote:
 When software changes D bit (either from 1 to 0, or 0 to 1), the corresponding
 TLB entity in the hardware won't be updated immediately. We should flush it to
 guarantee the consistence of D bit between TLB and MMU page table in memory.
 This is required if some specific hardware feature uses D-bit status to do
 specific things.
 
 Sanity test was done on my machine with Intel processor.
 
 Signed-off-by: Kai Huang kai.hu...@linux.intel.com
 ---
  arch/x86/kvm/mmu.c | 12 
  1 file changed, 12 insertions(+)
 
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 978f402..1feac0c 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -547,6 +547,11 @@ static bool spte_is_bit_cleared(u64 old_spte, u64 
 new_spte, u64 bit_mask)
   return (old_spte  bit_mask)  !(new_spte  bit_mask);
  }
  
 +static bool spte_is_bit_changed(u64 old_spte, u64 new_spte, u64 bit_mask)
 +{
 + return (old_spte  bit_mask) != (new_spte  bit_mask);
 +}
 +
  /* Rules for using mmu_spte_set:
   * Set the sptep from nonpresent to present.
   * Note: the sptep being assigned *must* be either not present
 @@ -597,6 +602,13 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
   if (!shadow_accessed_mask)
   return ret;
  
 + /*
 +  * We also need to flush TLB when D-bit is changed by software to
 +  * guarantee the D-bit consistence between TLB and MMU page table.
 +  */
 + if (spte_is_bit_changed(old_spte, new_spte, shadow_dirty_mask))

I think shadow_accessed_mask needs to be checked too.  I made the change
and applied the patch.

Paolo

 + ret = true;
 +
   if (spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask))
   kvm_set_pfn_accessed(spte_to_pfn(old_spte));
   if (spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask))
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND PATCH v15 07/11] KVM: arm: page logging 2nd stage fault handling

2015-01-09 Thread Christoffer Dall
On Thu, Jan 08, 2015 at 08:28:46AM -0800, Mario Smarduch wrote:
 On 01/08/2015 02:45 AM, Christoffer Dall wrote:
  On Wed, Jan 07, 2015 at 05:43:18PM -0800, Mario Smarduch wrote:
  Hi Christoffer,
before going through your comments, I discovered that
  in 3.18.0-rc2 - a generic __get_user_pages_fast()
  was implemented, now ARM picks this up. This causes
  gfn_to_pfn_prot() to return meaningful 'writable'
  value for a read fault, provided the region is writable.
 
  Prior to that the weak version returned 0 and 'writable'
  had no optimization effect to set pte/pmd - RW on
  a read fault.
 
  As a consequence dirty logging broke in 3.18, I was seeing
 Correction on this, proper __get_user_pages_fast()
 behavior exposed a bug in page logging code.
 
  weird but very intermittent issues. I just put in the
  additional few lines to fix it, prevent pte RW (only R) on
  read faults  while  logging writable region.
 
  On 01/07/2015 04:38 AM, Christoffer Dall wrote:
  On Wed, Dec 17, 2014 at 06:07:29PM -0800, Mario Smarduch wrote:
  This patch is a followup to v15 patch series, with following changes:
  - When clearing/dissolving a huge, PMD mark huge page range dirty, since
the state of whole range is unknown. After the huge page is dissolved 
dirty page logging is at page granularity.
 
  What is the sequence of events where you could have dirtied another page
  within the PMD range after the user initially requested dirty page
  logging?
 
  No there is none. My issue was the start point for tracking dirty pages
  and that would be second call to dirty log read. Not first
  call after initial write protect where any page in range can
  be assumed dirty. I'll remove this, not sure if there would be any
  use case to call dirty log only once.
 
  
  Calling dirty log once can not give you anything meaningful, right?  You
  must assume all memory is 'dirty' at this point, no?
 
 There is the interval between KVM_MEM_LOG_DIRTY_PAGES and first
 call to KVM_GET_DIRTY_LOG. Not sure of any use case, maybe enable
 logging, wait a while do a dirty log read, disable logging.
 Get an accumulated snapshot of dirty page activity.
 
ok, so from the time the user calls KVM_MEM_LOG_DIRTY_PAGES, then any
fault on any huge page will dissolve that huge page into pages, and each
dirty page will be logged accordingly for the first call to
KVM_GET_DIRTY_LOG, right?  What am I missing here?

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 11/11] KVM: arm/arm64: Add support to dissolve huge PUD

2015-01-09 Thread Christoffer Dall
On Thu, Jan 08, 2015 at 08:41:15AM -0800, Mario Smarduch wrote:

[...]

 
  I'm just thinking here, why do we need to check if we get a valid pud
  back here, but we don't need the equivalent check in dissolve_pmd from
  patch 7?
 
  kvm_pud_huge() doesn't check bit 0 for invalid entry, but
  pud_none() is not the right way to check either, maybe pud_bad()
  first. Nothing is done in patch 7 since the pmd is retrieved from
  stage2_get_pmd().
 
  
  hmmm, but stage2_get_pmd() can return a NULL pointer if you have the
  IOMAP flag set...
  
 
  I think the rationale is that it should never happen because we never
  call these functions with the logging and iomap flags at the same
  time...
 
  I'm little lost here, not sure how it's related to above.
  But I think a VFIO device will have a memslot and
  it would be possible to enable logging. But to what
  end I'm not sure.
 
  
  As I said above, if you call the set_s2pte function with the IOMAP and
  LOGGING flags set, then you'll end up in a situation where you can get a
  NULL pointer back from stage2_get_pmd() but you're never checking
  against that.
 
 I see what you're saying now.
  
  Now, this raises an interesting point, we have now added code that
  prevents faults from ever happening on device maps, but introducing a
  path here where the user can set logging on a memslot with device memory
  regions, which introduces write faults on such regions.  My gut feeling
  is that we should avoid that from ever happening, and not allow this
  function to be called with both flags set.
 
 Maybe kvm_arch_prepare_memory_region() can check if
 KVM_MEM_LOG_DIRTY_PAGES is being enabled for an IO region
 and don't allow it.
 

Yeah, I think we need to add a check for that somewhere as part of this
series (patch 7 perhaps?).

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Flush TLB when D bit is manually changed.

2015-01-09 Thread Xiao Guangrong



On 01/09/2015 04:44 PM, Kai Huang wrote:

When software changes D bit (either from 1 to 0, or 0 to 1), the corresponding
TLB entity in the hardware won't be updated immediately. We should flush it to
guarantee the consistence of D bit between TLB and MMU page table in memory.
This is required if some specific hardware feature uses D-bit status to do
specific things.



Currently, A/D is lazied synced and it does not hurt anything since we have
marked the page dirty after clearing the bit and mmu-notifier can keep the
coherence on host (It is the guest's responsibility to sync TLBs when changing
the bit).

So, i am just curious what some specific hardware is. :)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Flush TLB when D bit is manually changed.

2015-01-09 Thread Kai Huang
When software changes D bit (either from 1 to 0, or 0 to 1), the corresponding
TLB entity in the hardware won't be updated immediately. We should flush it to
guarantee the consistence of D bit between TLB and MMU page table in memory.
This is required if some specific hardware feature uses D-bit status to do
specific things.

Sanity test was done on my machine with Intel processor.

Signed-off-by: Kai Huang kai.hu...@linux.intel.com
---
 arch/x86/kvm/mmu.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 978f402..1feac0c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -547,6 +547,11 @@ static bool spte_is_bit_cleared(u64 old_spte, u64 
new_spte, u64 bit_mask)
return (old_spte  bit_mask)  !(new_spte  bit_mask);
 }
 
+static bool spte_is_bit_changed(u64 old_spte, u64 new_spte, u64 bit_mask)
+{
+   return (old_spte  bit_mask) != (new_spte  bit_mask);
+}
+
 /* Rules for using mmu_spte_set:
  * Set the sptep from nonpresent to present.
  * Note: the sptep being assigned *must* be either not present
@@ -597,6 +602,13 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
if (!shadow_accessed_mask)
return ret;
 
+   /*
+* We also need to flush TLB when D-bit is changed by software to
+* guarantee the D-bit consistence between TLB and MMU page table.
+*/
+   if (spte_is_bit_changed(old_spte, new_spte, shadow_dirty_mask))
+   ret = true;
+
if (spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask))
kvm_set_pfn_accessed(spte_to_pfn(old_spte));
if (spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask))
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Flush TLB when D bit is manually changed.

2015-01-09 Thread Paolo Bonzini


On 09/01/2015 09:44, Kai Huang wrote:
 When software changes D bit (either from 1 to 0, or 0 to 1), the corresponding
 TLB entity in the hardware won't be updated immediately. We should flush it to
 guarantee the consistence of D bit between TLB and MMU page table in memory.
 This is required if some specific hardware feature uses D-bit status to do
 specific things.
 
 Sanity test was done on my machine with Intel processor.
 
 Signed-off-by: Kai Huang kai.hu...@linux.intel.com
 ---
  arch/x86/kvm/mmu.c | 12 
  1 file changed, 12 insertions(+)
 
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 978f402..1feac0c 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -547,6 +547,11 @@ static bool spte_is_bit_cleared(u64 old_spte, u64 
 new_spte, u64 bit_mask)
   return (old_spte  bit_mask)  !(new_spte  bit_mask);
  }
  
 +static bool spte_is_bit_changed(u64 old_spte, u64 new_spte, u64 bit_mask)
 +{
 + return (old_spte  bit_mask) != (new_spte  bit_mask);
 +}
 +
  /* Rules for using mmu_spte_set:
   * Set the sptep from nonpresent to present.
   * Note: the sptep being assigned *must* be either not present
 @@ -597,6 +602,13 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
   if (!shadow_accessed_mask)
   return ret;
  
 + /*
 +  * We also need to flush TLB when D-bit is changed by software to
 +  * guarantee the D-bit consistence between TLB and MMU page table.
 +  */
 + if (spte_is_bit_changed(old_spte, new_spte, shadow_dirty_mask))

I think shadow_accessed_mask needs to be checked too.  I made the change
and applied the patch.

Paolo

 + ret = true;
 +
   if (spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask))
   kvm_set_pfn_accessed(spte_to_pfn(old_spte));
   if (spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask))
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault

2015-01-09 Thread Marc Zyngier
On 09/01/15 15:28, Peter Maydell wrote:
 On 9 January 2015 at 14:16, Marc Zyngier marc.zyng...@arm.com wrote:
 On 09/01/15 13:03, Peter Maydell wrote:
 When we reset a cpu by re-calling KVM_ARM_VCPU_INIT, that doesn't
 mean we get a new VMID for it, though, does it? I thought that
 what causes the icache flush to happen for the reset guest is
 that we unmap all of stage 2 and then fault it back in, via
 this code. That works for PIPT (we flush the range) and for
 VIPT (we do a full icache flush), but at the moment for VIVT
 ASID tagged we assume we can do nothing, and I don't think that's
 right for this case (though it is right for faulted because
 page was swapped out and OK for page was written by DMA).

 When we reset the guest, we also turn both its Icache off. Before
 turning it back on, the guest has to invalidate it (the ARM ARM doesn't
 seem to define the state of the cache out of reset).
 
 But implementations are allowed to hit in the cache even
 when the cache is disabled. In particular, setting the guest
 SCTLR_EL1.I to 0 means iside accesses are Normal Noncacheable
 for stage 1 attributes and (v8 ARM ARM D3.4.6) these can
 be held in the instruction cache. So the guest is required
 to do an icache-invalidate for any instructions that it writes
 *itself* (or DMAs in) even with the icache off. But it cannot
 possibly do so for its own initial startup code, and it must
 be the job of KVM to do that for it. (You can think of this as
 the VCPU provided by KVM always invalidates the icache on reset
 and does not require an impdef magic cache-init routine as
 described by D3.4.4 if you like.)

Right. I think I finally see your point. We've been safe so far that no
ASID-tagged VIVT icache platform is virtualisation capable, AFAIK.

Probably best to just nuke the icache always for non-PIPT caches, then.

I'll fold that in when I respin the series.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2015-01-09 Thread Radim Krčmář
2015-01-09 15:56+0100, Paolo Bonzini:
 
 
 On 09/01/2015 15:54, Radim Krčmář wrote:
  There are two points relevant to this patch in new KVM's implementation,
  (KVM: x86: amend APIC lowest priority arbitration,
   https://lkml.org/lkml/2015/1/9/362)
  
  1) lowest priority depends on TPR
  2) there is no need for balancing
  
  (1) has to be considered with PI as well.
 
 The chipset doesn't support it. :(

I meant that we need to recompute PI entries for lowest priority
interrupts every time guest's TPR changes.

Luckily, Linux doesn't use TPR, but other OS might be a reason to drop
lowest priority from PI optimizations.  (Or make it more complicated.)

  I kept (2) to avoid whining from people building on that behaviour, but
  lowest priority backed by PI could be transparent without it.
  
  Patch below removes the balancing, but I am not sure this is a price we
  allowed ourselves to pay ... what are your opinions?
 
 I wouldn't mind, but it requires a lot of benchmarking.

(I was afraid it would come to that.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2015-01-09 Thread Radim Krčmář
2015-01-09 16:18+0100, Paolo Bonzini:
 On 09/01/2015 16:12, Radim Krčmář wrote:
   The chipset doesn't support it. :(
  
  I meant that we need to recompute PI entries for lowest priority
  interrupts every time guest's TPR changes.
  
  Luckily, Linux doesn't use TPR, but other OS might be a reason to drop
  lowest priority from PI optimizations.  (Or make it more complicated.)
 
 Doing vector hashing is a possibility as well.  I would like to know
 what existing chipsets do in practice, then we can mimic it.

When looking at /proc/interrupts from time to time, I have only seen
interrupts landing on the first CPU of the set.

We could also distinguish between AMD and Intel ...
AMD should deliver to the highest APIC ID.
(If we still need to decide after focus processor and APR checks.)

I'll try to check using a more trustworthy approach.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations

2015-01-09 Thread Andy Lutomirski
On Fri, Jan 9, 2015 at 7:53 AM, Borislav Petkov b...@alien8.de wrote:
 On Fri, Nov 07, 2014 at 03:58:19PM -0800, Andy Lutomirski wrote:
 We used to optimize rescheduling and audit on syscall exit.  Now that
 the full slow path is reasonably fast, remove these optimizations.

 This adds something like 10ns to the previously optimized paths on my
 computer, presumably due mostly to SAVE_REST / RESTORE_REST.

 I think that we should eventually replace both the syscall and
 non-paranoid interrupt exit slow paths with a pair of C functions
 along the lines of the syscall entry hooks.

 Signed-off-by: Andy Lutomirski l...@amacapital.net
 ---
  arch/x86/kernel/entry_64.S | 52 
 +-
  1 file changed, 5 insertions(+), 47 deletions(-)

 diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
 index a5afdf0f7fa4..222dc5c45ac3 100644
 --- a/arch/x86/kernel/entry_64.S
 +++ b/arch/x86/kernel/entry_64.S
 @@ -427,15 +427,12 @@ system_call_fastpath:
   * Has incomplete stack frame and undefined top of stack.
   */
  ret_from_sys_call:
 - movl $_TIF_ALLWORK_MASK,%edi
 - /* edi: flagmask */
 -sysret_check:
 + testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
 + jnz int_ret_from_sys_call_fixup /* Go the the slow path */
 +
   LOCKDEP_SYS_EXIT
   DISABLE_INTERRUPTS(CLBR_NONE)
   TRACE_IRQS_OFF
 - movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx
 - andl %edi,%edx
 - jnz  sysret_careful
   CFI_REMEMBER_STATE
   /*
* sysretq will re-enable interrupts:
 @@ -449,49 +446,10 @@ sysret_check:
   USERGS_SYSRET64

   CFI_RESTORE_STATE
 - /* Handle reschedules */
 - /* edx: work, edi: workmask */
 -sysret_careful:
 - bt $TIF_NEED_RESCHED,%edx
 - jnc sysret_signal
 - TRACE_IRQS_ON
 - ENABLE_INTERRUPTS(CLBR_NONE)
 - pushq_cfi %rdi
 - SCHEDULE_USER
 - popq_cfi %rdi
 - jmp sysret_check

 - /* Handle a signal */
 -sysret_signal:
 - TRACE_IRQS_ON
 - ENABLE_INTERRUPTS(CLBR_NONE)
 -#ifdef CONFIG_AUDITSYSCALL
 - bt $TIF_SYSCALL_AUDIT,%edx
 - jc sysret_audit
 -#endif
 - /*
 -  * We have a signal, or exit tracing or single-step.
 -  * These all wind up with the iret return path anyway,
 -  * so just join that path right now.
 -  */
 +int_ret_from_sys_call_fixup:
   FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
 - jmp int_check_syscall_exit_work
 -
 -#ifdef CONFIG_AUDITSYSCALL
 - /*
 -  * Return fast path for syscall audit.  Call __audit_syscall_exit()
 -  * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT
 -  * masked off.
 -  */
 -sysret_audit:
 - movq RAX-ARGOFFSET(%rsp),%rsi   /* second arg, syscall return value */
 - cmpq $-MAX_ERRNO,%rsi   /* is it  -MAX_ERRNO? */
 - setbe %al   /* 1 if so, 0 if not */
 - movzbl %al,%edi /* zero-extend that into %edi */
 - call __audit_syscall_exit
 - movl $(_TIF_ALLWORK_MASK  ~_TIF_SYSCALL_AUDIT),%edi
 - jmp sysret_check
 -#endif   /* CONFIG_AUDITSYSCALL */
 + jmp int_ret_from_sys_call

 Ok, dumb question: what is replacing that audit functionality now?
 Or are we remaining with the single path down syscall_trace_leave()?
 AFAICT, the intention is to direct everything to int_ret_from_sys_call
 after the syscall has been done so that paths can get simplified. Which
 is a good thing, AFAICT.

Exactly.  int_ret_from_sys_call calls into the audit code in
syscall_trace_leave.  IIRC the current code sometimes calls the audit
exit hook twice, which seems like a bug to me.

--Andy


 Acked-by: Borislav Petkov b...@suse.de

 --
 Regards/Gruss,
 Boris.

 Sent from a fat crate under my desk. Formatting is fine.
 --



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: update_memslots: clean flags for invalid memslots

2015-01-09 Thread Tiejun Chen
Indeed, any invalid memslots should be new-npages = 0,
new-base_gfn = 0 and new-flags = 0 at the same time.

Signed-off-by: Tiejun Chen tiejun.c...@intel.com
---
Paolo,

This is just a small cleanup to follow-up commit, efbeec7098ee, fix
sorting of memslots with base_gfn == 0.

Tiejun

 virt/kvm/kvm_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1cc6e2e..369c759 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -673,6 +673,7 @@ static void update_memslots(struct kvm_memslots *slots,
if (!new-npages) {
WARN_ON(!mslots[i].npages);
new-base_gfn = 0;
+   new-flags = 0;
if (mslots[i].npages)
slots-used_slots--;
} else {
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] arm/arm64: KVM: Use set/way op trapping to track the state of the caches

2015-01-09 Thread Marc Zyngier
On 09/01/15 11:19, Christoffer Dall wrote:
 On Thu, Jan 08, 2015 at 11:59:07AM +, Marc Zyngier wrote:
 Trying to emulate the behaviour of set/way cache ops is fairly
 pointless, as there are too many ways we can end-up missing stuff.
 Also, there is some system caches out there that simply ignore
 set/way operations.

 So instead of trying to implement them, let's convert it to VA ops,
 and use them as a way to re-enable the trapping of VM ops. That way,
 we can detect the point when the MMU/caches are turned off, and do
 a full VM flush (which is what the guest was trying to do anyway).

 This allows a 32bit zImage to boot on the APM thingy, and will
 probably help bootloaders in general.

 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 ---
  arch/arm/include/asm/kvm_host.h   |   3 --
  arch/arm/kvm/arm.c|  10 
  arch/arm/kvm/coproc.c |  90 +
  arch/arm64/include/asm/kvm_host.h |   3 --
  arch/arm64/kvm/sys_regs.c | 102 
 ++
  5 files changed, 116 insertions(+), 92 deletions(-)

 diff --git a/arch/arm/include/asm/kvm_host.h 
 b/arch/arm/include/asm/kvm_host.h
 index 254e065..04b4ea0 100644
 --- a/arch/arm/include/asm/kvm_host.h
 +++ b/arch/arm/include/asm/kvm_host.h
 @@ -125,9 +125,6 @@ struct kvm_vcpu_arch {
* Anything that is not used directly from assembly code goes
* here.
*/
 - /* dcache set/way operation pending */
 - int last_pcpu;
 - cpumask_t require_dcache_flush;

   /* Don't run the guest on this vcpu */
   bool pause;
 diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
 index 2d6d910..0b0d58a 100644
 --- a/arch/arm/kvm/arm.c
 +++ b/arch/arm/kvm/arm.c
 @@ -281,15 +281,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
   vcpu-cpu = cpu;
   vcpu-arch.host_cpu_context = this_cpu_ptr(kvm_host_cpu_state);

 - /*
 -  * Check whether this vcpu requires the cache to be flushed on
 -  * this physical CPU. This is a consequence of doing dcache
 -  * operations by set/way on this vcpu. We do it here to be in
 -  * a non-preemptible section.
 -  */
 - if (cpumask_test_and_clear_cpu(cpu, vcpu-arch.require_dcache_flush))
 - flush_cache_all(); /* We'd really want v7_flush_dcache_all() */
 -
   kvm_arm_set_running_vcpu(vcpu);
  }

 @@ -541,7 +532,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
 struct kvm_run *run)
   ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);

   vcpu-mode = OUTSIDE_GUEST_MODE;
 - vcpu-arch.last_pcpu = smp_processor_id();
   kvm_guest_exit();
   trace_kvm_exit(*vcpu_pc(vcpu));
   /*
 diff --git a/arch/arm/kvm/coproc.c b/arch/arm/kvm/coproc.c
 index 7928dbd..3d352a5 100644
 --- a/arch/arm/kvm/coproc.c
 +++ b/arch/arm/kvm/coproc.c
 @@ -189,43 +189,57 @@ static bool access_l2ectlr(struct kvm_vcpu *vcpu,
   return true;
  }

 -/* See note at ARM ARM B1.14.4 */
 +/*
 + * See note at ARMv7 ARM B1.14.4 (TL;DR: S/W ops are not easily 
 virtualized).
 + *
 + * Main problems:
 + * - S/W ops are local to a CPU (not broadcast)
 + * - We have line migration behind our back (speculation)
 + * - System caches don't support S/W at all (damn!)
 + *
 + * In the face of the above, the best we can do is to try and convert
 + * S/W ops to VA ops. Because the guest is not allowed to infer the
 + * S/W to PA mapping, it can only use S/W to nuke the whole cache,
 + * which is a rather good thing for us.
 + *
 + * Also, it is only used when turning caches on/off (The expected
 + * usage of the cache maintenance instructions that operate by set/way
 + * is associated with the cache maintenance instructions associated
 + * with the powerdown and powerup of caches, if this is required by
 + * the implementation.).
 + *
 + * We use the following policy:
 + *
 + * - If we trap a S/W operation, we enable VM trapping to detect
 + *   caches being turned on/off.
 + *
 + * - If the caches have already been turned off when doing the S/W op,
 + *   we nuke the whole VM cache.
 + *
 + * - We flush the cache on both caches being turned on and off.
 + *
 + * - Once the caches are enabled, we stop trapping VM ops.
 + */
  static bool access_dcsw(struct kvm_vcpu *vcpu,
   const struct coproc_params *p,
   const struct coproc_reg *r)
  {
 - unsigned long val;
 - int cpu;
 -
   if (!p-is_write)
   return read_from_write_only(vcpu, p);

 - cpu = get_cpu();
 -
 - cpumask_setall(vcpu-arch.require_dcache_flush);
 - cpumask_clear_cpu(cpu, vcpu-arch.require_dcache_flush);
 -
 - /* If we were already preempted, take the long way around */
 - if (cpu != vcpu-arch.last_pcpu) {
 - flush_cache_all();
 - goto done;
 - }
 -
 - val = *vcpu_reg(vcpu, p-Rt1);
 -
 - switch (p-CRm) {
 - case 6: /* Upgrade DCISW 

Re: [PATCH 2/4] arm/arm64: KVM: Use set/way op trapping to track the state of the caches

2015-01-09 Thread Christoffer Dall
On Fri, Jan 09, 2015 at 11:38:19AM +, Marc Zyngier wrote:

[...]

  @@ -258,12 +272,24 @@ bool access_sctlr(struct kvm_vcpu *vcpu,
  const struct coproc_params *p,
  const struct coproc_reg *r)
   {
  + bool was_enabled = vcpu_has_cache_enabled(vcpu);
  + bool now_enabled;
  +
access_vm_reg(vcpu, p, r);
 
  - if (vcpu_has_cache_enabled(vcpu)) { /* MMU+Caches enabled? */
  - vcpu-arch.hcr = ~HCR_TVM;
  + now_enabled = vcpu_has_cache_enabled(vcpu);
  +
  + /*
  +  * If switching the MMU+caches on, need to invalidate the caches.
  +  * If switching it off, need to clean the caches.
  +  * Clean + invalidate does the trick always.
  
  couldn't a clean here then be writing bogus to RAM on an initial boot
  of the guest?
 
 It shouldn't. We're supposed to start a VM with a fully invalid cache
 (what's why we call coherent_cache_guest_page). And this is no different
 from what we had before, as an invalidate by set/way was upgraded to
 clean+invalidate.
 

right, that makes sense.

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] arm/arm64: KVM: Flush caches to memory on unmap

2015-01-09 Thread Christoffer Dall
On Thu, Jan 08, 2015 at 11:59:08AM +, Marc Zyngier wrote:
 Let's assume a guest has created an uncached mapping, and written
 to that page. Let's also assume that the host uses a cache-coherent
 IO subsystem. Let's finally assume that the host is under memory
 pressure and starts to swap things out.
 
 Before this uncached page is evicted, we need to make sure it
 gets invalidated, or the IO subsystem is going to swap out the
 cached view, loosing the data that has been written there.
 
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 ---
  arch/arm/include/asm/kvm_mmu.h   | 31 +++
  arch/arm/kvm/mmu.c   | 46 
 +++-
  arch/arm64/include/asm/kvm_mmu.h | 18 
  3 files changed, 80 insertions(+), 15 deletions(-)
 
 diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
 index 63e0ecc..7ceb836 100644
 --- a/arch/arm/include/asm/kvm_mmu.h
 +++ b/arch/arm/include/asm/kvm_mmu.h
 @@ -44,6 +44,7 @@
  
  #ifndef __ASSEMBLY__
  
 +#include linux/highmem.h
  #include asm/cacheflush.h
  #include asm/pgalloc.h
  
 @@ -190,6 +191,36 @@ static inline void coherent_cache_guest_page(struct 
 kvm_vcpu *vcpu, hva_t hva,
  
  #define kvm_virt_to_phys(x)  virt_to_idmap((unsigned long)(x))
  
 +static inline void __kvm_flush_dcache_pte(pte_t pte)
 +{
 + void *va = kmap_atomic(pte_page(pte));
 +
 + kvm_flush_dcache_to_poc(va, PAGE_SIZE);
 +
 + kunmap_atomic(va);
 +}
 +
 +static inline void __kvm_flush_dcache_pmd(pmd_t pmd)
 +{
 + unsigned long size = PMD_SIZE;
 + pfn_t pfn = pmd_pfn(pmd);
 +
 + while (size) {
 + void *va = kmap_atomic_pfn(pfn);
 +
 + kvm_flush_dcache_to_poc(va, PAGE_SIZE);
 +
 + pfn++;
 + size -= PAGE_SIZE;
 +
 + kunmap_atomic(va);
 + }
 +}
 +
 +static inline void __kvm_flush_dcache_pud(pud_t pud)
 +{
 +}
 +
  void stage2_flush_vm(struct kvm *kvm);
  
  #endif   /* !__ASSEMBLY__ */
 diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
 index 1dc9778..1f5b793 100644
 --- a/arch/arm/kvm/mmu.c
 +++ b/arch/arm/kvm/mmu.c
 @@ -58,6 +58,21 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, 
 phys_addr_t ipa)
   kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
  }
  
 +static void kvm_flush_dcache_pte(pte_t pte)
 +{
 + __kvm_flush_dcache_pte(pte);
 +}
 +
 +static void kvm_flush_dcache_pmd(pmd_t pmd)
 +{
 + __kvm_flush_dcache_pmd(pmd);
 +}
 +
 +static void kvm_flush_dcache_pud(pud_t pud)
 +{
 + __kvm_flush_dcache_pud(pud);
 +}
 +
  static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
 int min, int max)
  {
 @@ -128,9 +143,12 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
   start_pte = pte = pte_offset_kernel(pmd, addr);
   do {
   if (!pte_none(*pte)) {
 + pte_t old_pte = *pte;
   kvm_set_pte(pte, __pte(0));
 - put_page(virt_to_page(pte));

was this a bug beforehand that we released the page before we flushed
the TLB?

   kvm_tlb_flush_vmid_ipa(kvm, addr);
 + if ((pte_val(old_pte)  PAGE_S2_DEVICE) != 
 PAGE_S2_DEVICE)
 + kvm_flush_dcache_pte(old_pte);

this is confusing me: We are only flushing the cache for cached stage-2
mappings?  Weren't we trying to flush the cache for uncached mappings?
(we obviously also need flush a cached stage-2 mapping but where the
guest is mapping it as uncached, but we don't know that easily).

Am I missing something completely here?

In any case we probably need a comment here explaining this.

 + put_page(virt_to_page(pte));
   }
   } while (pte++, addr += PAGE_SIZE, addr != end);
  
 @@ -149,8 +167,10 @@ static void unmap_pmds(struct kvm *kvm, pud_t *pud,
   next = kvm_pmd_addr_end(addr, end);
   if (!pmd_none(*pmd)) {
   if (kvm_pmd_huge(*pmd)) {
 + pmd_t old_pmd = *pmd;
   pmd_clear(pmd);
   kvm_tlb_flush_vmid_ipa(kvm, addr);
 + kvm_flush_dcache_pmd(old_pmd);
   put_page(virt_to_page(pmd));
   } else {
   unmap_ptes(kvm, pmd, addr, next);
 @@ -173,8 +193,10 @@ static void unmap_puds(struct kvm *kvm, pgd_t *pgd,
   next = kvm_pud_addr_end(addr, end);
   if (!pud_none(*pud)) {
   if (pud_huge(*pud)) {
 + pud_t old_pud = *pud;
   pud_clear(pud);
   kvm_tlb_flush_vmid_ipa(kvm, addr);
 + kvm_flush_dcache_pud(old_pud);
   put_page(virt_to_page(pud));
   } else {
  

Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault

2015-01-09 Thread Marc Zyngier
On 09/01/15 13:03, Peter Maydell wrote:
 On 9 January 2015 at 12:50, Christoffer Dall
 christoffer.d...@linaro.org wrote:
 On Thu, Jan 08, 2015 at 03:21:50PM +, Peter Maydell wrote:
 If this is the first instruction in the guest (ie we've just
 (warm) reset the VM and are running the kernel as loaded into the guest
 by QEMU/kvmtool) then the guest can't have invalidated the icache,
 and QEMU can't do the invalidate because it doesn't have the vaddr
 and VMID of the guest.

 The guest must clean its icache before turning on the MMU, no?
 
 Yes, but to execute the clean icache insn inside the guest,
 the guest is executing instructions, so we'd better not have
 stale info for that code...

But we never start a guest with caches on. It has to enable them on its own.

 Whenever we reuse a VMID (rollover), we flush the entire icache for that
 vmid.
 
 When we reset a cpu by re-calling KVM_ARM_VCPU_INIT, that doesn't
 mean we get a new VMID for it, though, does it? I thought that
 what causes the icache flush to happen for the reset guest is
 that we unmap all of stage 2 and then fault it back in, via
 this code. That works for PIPT (we flush the range) and for
 VIPT (we do a full icache flush), but at the moment for VIVT
 ASID tagged we assume we can do nothing, and I don't think that's
 right for this case (though it is right for faulted because
 page was swapped out and OK for page was written by DMA).

When we reset the guest, we also turn both its Icache off. Before
turning it back on, the guest has to invalidate it (the ARM ARM doesn't
seem to define the state of the cache out of reset).

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] arm/arm64: KVM: Flush caches to memory on unmap

2015-01-09 Thread Marc Zyngier
On 09/01/15 12:30, Christoffer Dall wrote:
 On Thu, Jan 08, 2015 at 11:59:08AM +, Marc Zyngier wrote:
 Let's assume a guest has created an uncached mapping, and written
 to that page. Let's also assume that the host uses a cache-coherent
 IO subsystem. Let's finally assume that the host is under memory
 pressure and starts to swap things out.

 Before this uncached page is evicted, we need to make sure it
 gets invalidated, or the IO subsystem is going to swap out the
 cached view, loosing the data that has been written there.

 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 ---
  arch/arm/include/asm/kvm_mmu.h   | 31 +++
  arch/arm/kvm/mmu.c   | 46 
 +++-
  arch/arm64/include/asm/kvm_mmu.h | 18 
  3 files changed, 80 insertions(+), 15 deletions(-)

 diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
 index 63e0ecc..7ceb836 100644
 --- a/arch/arm/include/asm/kvm_mmu.h
 +++ b/arch/arm/include/asm/kvm_mmu.h
 @@ -44,6 +44,7 @@
  
  #ifndef __ASSEMBLY__
  
 +#include linux/highmem.h
  #include asm/cacheflush.h
  #include asm/pgalloc.h
  
 @@ -190,6 +191,36 @@ static inline void coherent_cache_guest_page(struct 
 kvm_vcpu *vcpu, hva_t hva,
  
  #define kvm_virt_to_phys(x) virt_to_idmap((unsigned long)(x))
  
 +static inline void __kvm_flush_dcache_pte(pte_t pte)
 +{
 +void *va = kmap_atomic(pte_page(pte));
 +
 +kvm_flush_dcache_to_poc(va, PAGE_SIZE);
 +
 +kunmap_atomic(va);
 +}
 +
 +static inline void __kvm_flush_dcache_pmd(pmd_t pmd)
 +{
 +unsigned long size = PMD_SIZE;
 +pfn_t pfn = pmd_pfn(pmd);
 +
 +while (size) {
 +void *va = kmap_atomic_pfn(pfn);
 +
 +kvm_flush_dcache_to_poc(va, PAGE_SIZE);
 +
 +pfn++;
 +size -= PAGE_SIZE;
 +
 +kunmap_atomic(va);
 +}
 +}
 +
 +static inline void __kvm_flush_dcache_pud(pud_t pud)
 +{
 +}
 +
  void stage2_flush_vm(struct kvm *kvm);
  
  #endif  /* !__ASSEMBLY__ */
 diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
 index 1dc9778..1f5b793 100644
 --- a/arch/arm/kvm/mmu.c
 +++ b/arch/arm/kvm/mmu.c
 @@ -58,6 +58,21 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, 
 phys_addr_t ipa)
  kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
  }
  
 +static void kvm_flush_dcache_pte(pte_t pte)
 +{
 +__kvm_flush_dcache_pte(pte);
 +}
 +
 +static void kvm_flush_dcache_pmd(pmd_t pmd)
 +{
 +__kvm_flush_dcache_pmd(pmd);
 +}
 +
 +static void kvm_flush_dcache_pud(pud_t pud)
 +{
 +__kvm_flush_dcache_pud(pud);
 +}
 +
  static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
int min, int max)
  {
 @@ -128,9 +143,12 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
  start_pte = pte = pte_offset_kernel(pmd, addr);
  do {
  if (!pte_none(*pte)) {
 +pte_t old_pte = *pte;
  kvm_set_pte(pte, __pte(0));
 -put_page(virt_to_page(pte));
 
 was this a bug beforehand that we released the page before we flushed
 the TLB?

I don't think so. TLB maintenance doesn't require the mapping to exist
in the page tables (while the cache maintenance do).

  kvm_tlb_flush_vmid_ipa(kvm, addr);
 +if ((pte_val(old_pte)  PAGE_S2_DEVICE) != 
 PAGE_S2_DEVICE)
 +kvm_flush_dcache_pte(old_pte);
 
 this is confusing me: We are only flushing the cache for cached stage-2
 mappings?  Weren't we trying to flush the cache for uncached mappings?
 (we obviously also need flush a cached stage-2 mapping but where the
 guest is mapping it as uncached, but we don't know that easily).
 
 Am I missing something completely here?

I think you must be misreading something:
- we want to invalidate mappings because the guest may have created an
uncached mapping
- as we cannot know about the guest's uncached mappings, we flush things
unconditionally (basically anything that is RAM).
- we avoid flushing devices because it is pointless (and the kernel
doesn't have a linear mapping for those).

Is that clearer? Or is it me that is misunderstanding your concerns
(entirely possible, I'm not at my best right now...).

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: x86: amend APIC lowest priority arbitration

2015-01-09 Thread Radim Krčmář
Lowest priority should take the task priority into account.

SDM 10.6.2.4 Lowest Priority Delivery Mode.
(Too long to quote; second and last paragraphs are relevant.)

Before this patch, we strived to have the same amount of handled
lowest-priority interrupts on all VCPUs.
This is only a complication, but kept for compatibility.

Real modern Intels can't send lowest priority IPIs and the chipset
directs external ones using processors' TPR.
AMD still has rough edges.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/kvm/lapic.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index a688fbffb34e..5b9d8c589bba 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -833,7 +833,15 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int 
delivery_mode,
 
 int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
 {
-   return vcpu1-arch.apic_arb_prio - vcpu2-arch.apic_arb_prio;
+   /* XXX: AMD (2:16.6.2 Lowest Priority Messages and Arbitration)
+*  - uses the APR register (which also considers ISR and IRR),
+*  - chooses the highest APIC ID when APRs are identical,
+*  - and allows a focus processor.
+* XXX: pseudo-balancing with apic_arb_prio is a KVM-specific feature
+*/
+   int tpr = kvm_apic_get_reg(vcpu1-arch.apic, APIC_TASKPRI) -
+ kvm_apic_get_reg(vcpu2-arch.apic, APIC_TASKPRI);
+   return tpr ? : vcpu1-arch.apic_arb_prio - vcpu2-arch.apic_arb_prio;
 }
 
 static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
-- 
2.2.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [v3 00/26] Add VT-d Posted-Interrupts support

2015-01-09 Thread Wu, Feng


 -Original Message-
 From: j...@8bytes.org [mailto:j...@8bytes.org]
 Sent: Friday, January 09, 2015 8:46 PM
 To: Wu, Feng
 Cc: t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org;
 g...@kernel.org; pbonz...@redhat.com; dw...@infradead.org;
 alex.william...@redhat.com; jiang@linux.intel.com; eric.au...@linaro.org;
 linux-ker...@vger.kernel.org; io...@lists.linux-foundation.org;
 kvm@vger.kernel.org
 Subject: Re: [v3 00/26] Add VT-d Posted-Interrupts support
 
 Hi Feng,
 
 On Tue, Jan 06, 2015 at 01:10:19AM +, Wu, Feng wrote:
  Ping...
 
  Hi Joerg  David,
 
  Could you please have a look at the IOMMU part of this series (patch 02 - 
  04,
 patch 06 - 09 , patch 26)?
 
  Hi Thomas, Ingo,  Peter,
 
  Could you please have a look at this series, especially for patch 01, 05, 
  21?
 
 I fear this conflicts somewhat with the irq-domain patches from Jiang
 Liu. Once we worked this out (should happen soon) I'll have a look at
 the VT-d posted interrupt patches.
 
 Regards,
 
   Joerg

Thanks a lot for your feedback on this. In fact, my patches is based on Jiang 
Liu's
irq-domain patches and has been discussed with Jiang offline. So I don't think
it will conflicts with Jiang's work. Anyway, any comments are welcome!

Thanks,
Feng

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault

2015-01-09 Thread Christoffer Dall
On Thu, Jan 08, 2015 at 11:59:09AM +, Marc Zyngier wrote:
 When handling a fault in stage-2, we need to resync I$ and D$, just
 to be sure we don't leave any old cache line behind.
 
 That's very good, except that we do so using the *user* address.
 Under heavy load (swapping like crazy), we may end up in a situation
 where the page gets mapped in stage-2 while being unmapped from
 userspace by another CPU.
 
 At that point, the DC/IC instructions can generate a fault, which
 we handle with kvm-mmu_lock held. The box quickly deadlocks, user
 is unhappy.
 
 Instead, perform this invalidation through the kernel mapping,
 which is guaranteed to be present. The box is much happier, and so
 am I.
 
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com

This looks good to me!

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[virtio] dummy device

2015-01-09 Thread Vasile Catalin-B50542

Hi,

I'm trying to add a new virtio device.
I've managed to make qemu connect my virtio device to the pci bus.
I'm now trying to make a dummy driver in the guest's kernel.
My lspci binary version isn't very verbose. For example the output for
lspci and lspci -n are the same:
00:00.0 Class 0604: 1957:0030
00:01.0 Class 0200: 1af4:1000
00:02.0 Class 0b40: 1af4:100d
(That's all I get from them.)
My device is on the last line.
I have tried using another solution to see if there is a driver doing 
something

with my device (at least probing it):
ls /sys/bus/pci/devices/\:00\:02.0/driver/module/drivers/
It outputs:
pci:virtio-pci
Should I see here the name of my device/driver module, or this is
what I would see for any device on an emulated machine with virtio pci 
based bus?
I currently just defined the struct virtio_driver with it's related 
things, and called:

module_virtio_driver();
MODULE_DEVICE_TABLE();
Is there anything left to do in order to have a functional dummy device?
Is there a better way to see the name of the module associated with a 
PCI device (virtio device)?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 00/26] Add VT-d Posted-Interrupts support

2015-01-09 Thread j...@8bytes.org
Hi Feng,

On Tue, Jan 06, 2015 at 01:10:19AM +, Wu, Feng wrote:
 Ping...
 
 Hi Joerg  David,
 
 Could you please have a look at the IOMMU part of this series (patch 02 - 04, 
 patch 06 - 09 , patch 26)?
 
 Hi Thomas, Ingo,  Peter,
 
 Could you please have a look at this series, especially for patch 01, 05, 21?

I fear this conflicts somewhat with the irq-domain patches from Jiang
Liu. Once we worked this out (should happen soon) I'll have a look at
the VT-d posted interrupt patches.

Regards,

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault

2015-01-09 Thread Christoffer Dall
On Thu, Jan 08, 2015 at 03:21:50PM +, Peter Maydell wrote:
 On 8 January 2015 at 15:06, Marc Zyngier marc.zyng...@arm.com wrote:
  On 08/01/15 13:16, Peter Maydell wrote:
  ASID cached VIVT icaches are also VMID tagged. It is thus impossible for
  stale cache lines to come with a new page. And if by synchronizing the
  caches you obtain a different instruction stream, it means you've
  restored the wrong page.
 
  ...is that true even if the dirty data in the dcache comes from
  the userspace process doing DMA or writing the initial boot
  image or whatever?
 
  We perform this on a page that is being brought in stage-2. Two cases:
 
  - This is a page is mapped for the first time: the icache should be
  invalid for this page (the guest should have invalidated it the first
  place),
 
 If this is the first instruction in the guest (ie we've just
 (warm) reset the VM and are running the kernel as loaded into the guest
 by QEMU/kvmtool) then the guest can't have invalidated the icache,
 and QEMU can't do the invalidate because it doesn't have the vaddr
 and VMID of the guest.
 
The guest must clean its icache before turning on the MMU, no?

Whenever we reuse a VMID (rollover), we flush the entire icache for that
vmid.

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault

2015-01-09 Thread Peter Maydell
On 9 January 2015 at 12:50, Christoffer Dall
christoffer.d...@linaro.org wrote:
 On Thu, Jan 08, 2015 at 03:21:50PM +, Peter Maydell wrote:
 If this is the first instruction in the guest (ie we've just
 (warm) reset the VM and are running the kernel as loaded into the guest
 by QEMU/kvmtool) then the guest can't have invalidated the icache,
 and QEMU can't do the invalidate because it doesn't have the vaddr
 and VMID of the guest.

 The guest must clean its icache before turning on the MMU, no?

Yes, but to execute the clean icache insn inside the guest,
the guest is executing instructions, so we'd better not have
stale info for that code...

 Whenever we reuse a VMID (rollover), we flush the entire icache for that
 vmid.

When we reset a cpu by re-calling KVM_ARM_VCPU_INIT, that doesn't
mean we get a new VMID for it, though, does it? I thought that
what causes the icache flush to happen for the reset guest is
that we unmap all of stage 2 and then fault it back in, via
this code. That works for PIPT (we flush the range) and for
VIPT (we do a full icache flush), but at the moment for VIVT
ASID tagged we assume we can do nothing, and I don't think that's
right for this case (though it is right for faulted because
page was swapped out and OK for page was written by DMA).

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] x86_64,entry: Use sysret to return to userspace when possible

2015-01-09 Thread Borislav Petkov
On Fri, Nov 07, 2014 at 03:58:18PM -0800, Andy Lutomirski wrote:
 + /*
 +  * Try to use SYSRET instead of IRET if we're returning to
 +  * a completely clean 64-bit userspace context.
 +  */
 + movq (RCX-R11)(%rsp), %rcx
 + cmpq %rcx,(RIP-R11)(%rsp)   /* RCX == RIP */
 + jne opportunistic_sysret_failed
 +
 + /*
 +  * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
 +  * in kernel space.  This essentially lets the user take over
 +  * the kernel, since userspace controls RSP.  It's not worth
 +  * testing for canonicalness exactly -- this check detects any
 +  * of the 17 high bits set, which is true for non-canonical
 +  * or kernel addresses.  (This will pessimize vsyscall=native.
 +  * Big deal.)
 +  */
 + shr $47, %rcx
 + jnz opportunistic_sysret_failed
 +
 + cmpq $__USER_CS,(CS-R11)(%rsp)  /* CS must match SYSRET */
 + jne opportunistic_sysret_failed
 +
 + movq (R11-R11)(%rsp), %r11
 + cmpq %r11,(EFLAGS-R11)(%rsp)/* R11 == RFLAGS */
 + jne opportunistic_sysret_failed
 +
 + testq $X86_EFLAGS_RF,%r11   /* sysret can't restore RF */
 + jnz opportunistic_sysret_failed
 +
 + /* nothing to check for RSP */
 +
 + cmpq $__USER_DS,(SS-R11)(%rsp)  /* SS must match SYSRET */
 + jne opportunistic_sysret_failed

Btw, Denys' R11-ARGOFFSET fix makes sense here too - using ARGOFFSET
instead of R11 would make this here clearer.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] arm/arm64: KVM: Use set/way op trapping to track the state of the caches

2015-01-09 Thread Christoffer Dall
On Thu, Jan 08, 2015 at 11:59:07AM +, Marc Zyngier wrote:
 Trying to emulate the behaviour of set/way cache ops is fairly
 pointless, as there are too many ways we can end-up missing stuff.
 Also, there is some system caches out there that simply ignore
 set/way operations.
 
 So instead of trying to implement them, let's convert it to VA ops,
 and use them as a way to re-enable the trapping of VM ops. That way,
 we can detect the point when the MMU/caches are turned off, and do
 a full VM flush (which is what the guest was trying to do anyway).
 
 This allows a 32bit zImage to boot on the APM thingy, and will
 probably help bootloaders in general.
 
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 ---
  arch/arm/include/asm/kvm_host.h   |   3 --
  arch/arm/kvm/arm.c|  10 
  arch/arm/kvm/coproc.c |  90 +
  arch/arm64/include/asm/kvm_host.h |   3 --
  arch/arm64/kvm/sys_regs.c | 102 
 ++
  5 files changed, 116 insertions(+), 92 deletions(-)
 
 diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
 index 254e065..04b4ea0 100644
 --- a/arch/arm/include/asm/kvm_host.h
 +++ b/arch/arm/include/asm/kvm_host.h
 @@ -125,9 +125,6 @@ struct kvm_vcpu_arch {
* Anything that is not used directly from assembly code goes
* here.
*/
 - /* dcache set/way operation pending */
 - int last_pcpu;
 - cpumask_t require_dcache_flush;
  
   /* Don't run the guest on this vcpu */
   bool pause;
 diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
 index 2d6d910..0b0d58a 100644
 --- a/arch/arm/kvm/arm.c
 +++ b/arch/arm/kvm/arm.c
 @@ -281,15 +281,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
   vcpu-cpu = cpu;
   vcpu-arch.host_cpu_context = this_cpu_ptr(kvm_host_cpu_state);
  
 - /*
 -  * Check whether this vcpu requires the cache to be flushed on
 -  * this physical CPU. This is a consequence of doing dcache
 -  * operations by set/way on this vcpu. We do it here to be in
 -  * a non-preemptible section.
 -  */
 - if (cpumask_test_and_clear_cpu(cpu, vcpu-arch.require_dcache_flush))
 - flush_cache_all(); /* We'd really want v7_flush_dcache_all() */
 -
   kvm_arm_set_running_vcpu(vcpu);
  }
  
 @@ -541,7 +532,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
 kvm_run *run)
   ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);
  
   vcpu-mode = OUTSIDE_GUEST_MODE;
 - vcpu-arch.last_pcpu = smp_processor_id();
   kvm_guest_exit();
   trace_kvm_exit(*vcpu_pc(vcpu));
   /*
 diff --git a/arch/arm/kvm/coproc.c b/arch/arm/kvm/coproc.c
 index 7928dbd..3d352a5 100644
 --- a/arch/arm/kvm/coproc.c
 +++ b/arch/arm/kvm/coproc.c
 @@ -189,43 +189,57 @@ static bool access_l2ectlr(struct kvm_vcpu *vcpu,
   return true;
  }
  
 -/* See note at ARM ARM B1.14.4 */
 +/*
 + * See note at ARMv7 ARM B1.14.4 (TL;DR: S/W ops are not easily virtualized).
 + *
 + * Main problems:
 + * - S/W ops are local to a CPU (not broadcast)
 + * - We have line migration behind our back (speculation)
 + * - System caches don't support S/W at all (damn!)
 + *
 + * In the face of the above, the best we can do is to try and convert
 + * S/W ops to VA ops. Because the guest is not allowed to infer the
 + * S/W to PA mapping, it can only use S/W to nuke the whole cache,
 + * which is a rather good thing for us.
 + *
 + * Also, it is only used when turning caches on/off (The expected
 + * usage of the cache maintenance instructions that operate by set/way
 + * is associated with the cache maintenance instructions associated
 + * with the powerdown and powerup of caches, if this is required by
 + * the implementation.).
 + *
 + * We use the following policy:
 + *
 + * - If we trap a S/W operation, we enable VM trapping to detect
 + *   caches being turned on/off.
 + *
 + * - If the caches have already been turned off when doing the S/W op,
 + *   we nuke the whole VM cache.
 + *
 + * - We flush the cache on both caches being turned on and off.
 + *
 + * - Once the caches are enabled, we stop trapping VM ops.
 + */
  static bool access_dcsw(struct kvm_vcpu *vcpu,
   const struct coproc_params *p,
   const struct coproc_reg *r)
  {
 - unsigned long val;
 - int cpu;
 -
   if (!p-is_write)
   return read_from_write_only(vcpu, p);
  
 - cpu = get_cpu();
 -
 - cpumask_setall(vcpu-arch.require_dcache_flush);
 - cpumask_clear_cpu(cpu, vcpu-arch.require_dcache_flush);
 -
 - /* If we were already preempted, take the long way around */
 - if (cpu != vcpu-arch.last_pcpu) {
 - flush_cache_all();
 - goto done;
 - }
 -
 - val = *vcpu_reg(vcpu, p-Rt1);
 -
 - switch (p-CRm) {
 - case 6: /* Upgrade DCISW to DCCISW, as per HCR.SWIO */
 - 

Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2015-01-09 Thread Radim Krčmář
2014-12-12 23:14+0800, Feng Wu:
 This patch defines a new interface kvm_find_dest_vcpu for
 VT-d PI, which can returns the destination vCPU of the
 interrupt for guests.
 
 Since VT-d PI cannot handle broadcast/multicast interrupt,
 Here we only handle Fixed and Lowest priority interrupts.
 
 The current method of handling guest lowest priority interrtups
 is to use a counter 'apic_arb_prio' for each vCPU, we choose the
 vCPU with smallest 'apic_arb_prio' and then increase it by 1.
 However, for VT-d PI, we cannot re-use this, since we no longer
 have control to 'apic_arb_prio' with posted interrupt direct
 delivery by Hardware.
 
 Here, we introduce a similar way with 'apic_arb_prio' to handle
 guest lowest priority interrtups when VT-d PI is used. Here is the
 ideas:
 - Each vCPU has a counter 'round_robin_counter'.
 - When guests sets an interrupts to lowest priority, we choose
 the vCPU with smallest 'round_robin_counter' as the destination,
 then increase it.

There are two points relevant to this patch in new KVM's implementation,
(KVM: x86: amend APIC lowest priority arbitration,
 https://lkml.org/lkml/2015/1/9/362)

1) lowest priority depends on TPR
2) there is no need for balancing

(1) has to be considered with PI as well.
I kept (2) to avoid whining from people building on that behaviour, but
lowest priority backed by PI could be transparent without it.

Patch below removes the balancing, but I am not sure this is a price we
allowed ourselves to pay ... what are your opinions?


---8---
KVM: x86: don't balance lowest priority interrupts

Balancing is not mandated by specification and real hardware most likely
doesn't do it.  We break backward compatibility to allow optimizations.
(Posted interrupts can deliver to only one fixed destination.)

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/include/asm/kvm_host.h | 1 -
 arch/x86/kvm/lapic.c| 8 ++--
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 97a5dd0222c8..aa4bd8286232 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -370,7 +370,6 @@ struct kvm_vcpu_arch {
u64 apic_base;
struct kvm_lapic *apic;/* kernel irqchip context */
unsigned long apic_attention;
-   int32_t apic_arb_prio;
int mp_state;
u64 ia32_misc_enable_msr;
bool tpr_access_reporting;
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 5b9d8c589bba..eb85af8e8fc0 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -749,7 +749,6 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int 
delivery_mode,
  trig_mode, vector);
switch (delivery_mode) {
case APIC_DM_LOWEST:
-   vcpu-arch.apic_arb_prio++;
case APIC_DM_FIXED:
/* FIXME add logic for vcpu on reset */
if (unlikely(!apic_enabled(apic)))
@@ -837,11 +836,9 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct 
kvm_vcpu *vcpu2)
 *  - uses the APR register (which also considers ISR and IRR),
 *  - chooses the highest APIC ID when APRs are identical,
 *  - and allows a focus processor.
-* XXX: pseudo-balancing with apic_arb_prio is a KVM-specific feature
 */
-   int tpr = kvm_apic_get_reg(vcpu1-arch.apic, APIC_TASKPRI) -
- kvm_apic_get_reg(vcpu2-arch.apic, APIC_TASKPRI);
-   return tpr ? : vcpu1-arch.apic_arb_prio - vcpu2-arch.apic_arb_prio;
+   return kvm_apic_get_reg(vcpu1-arch.apic, APIC_TASKPRI) -
+  kvm_apic_get_reg(vcpu2-arch.apic, APIC_TASKPRI);
 }
 
 static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
@@ -1595,7 +1592,6 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
vcpu-arch.pv_eoi.msr_val = 0;
apic_update_ppr(apic);
 
-   vcpu-arch.apic_arb_prio = 0;
vcpu-arch.apic_attention = 0;
 
apic_debug(%s: vcpu=%p, id=%d, base_msr=
-- 
2.2.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2015-01-09 Thread Paolo Bonzini


On 09/01/2015 15:54, Radim Krčmář wrote:
 There are two points relevant to this patch in new KVM's implementation,
 (KVM: x86: amend APIC lowest priority arbitration,
  https://lkml.org/lkml/2015/1/9/362)
 
 1) lowest priority depends on TPR
 2) there is no need for balancing
 
 (1) has to be considered with PI as well.

The chipset doesn't support it. :(

 I kept (2) to avoid whining from people building on that behaviour, but
 lowest priority backed by PI could be transparent without it.
 
 Patch below removes the balancing, but I am not sure this is a price we
 allowed ourselves to pay ... what are your opinions?

I wouldn't mind, but it requires a lot of benchmarking.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html