from:"Benjamin Herrenschmidt"

Re: [Qemu-ppc] KVM memory slots limit on powerpc

2015-09-04 Thread Benjamin Herrenschmidt

On Fri, 2015-09-04 at 12:28 +0200, Thomas Huth wrote:

> > Maybe some rcu protected scheme that doubles the amount of memslots
> > for
> > each overrun? Yes, that would be good and even reduce the footprint
> > for
> > systems with only a small number of memslots.
> 
> Seems like Alex Williamson already posted a patchset for growable
> memslots a couple of years ago:
> 
>  http://www.spinics.net/lists/kvm/msg50491.html
> 
> But I didn't quite spot the result in that thread why it never has 
> been
> included upstream. Alex (W.), do you remember the outcome?

Isn't the memslot array *already* protected by RCU anyway ?

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: ppc: Fix size of the PSPB register

2015-09-01 Thread Benjamin Herrenschmidt

On Wed, 2015-09-02 at 08:24 +1000, Paul Mackerras wrote:
> On Tue, Sep 01, 2015 at 11:41:18PM +0200, Thomas Huth wrote:
> > The size of the Problem State Priority Boost Register is only
> > 32 bits, so let's change the type of the corresponding variable
> > accordingly to avoid future trouble.
> 
> Since we're already using lwz/stw in the assembly code in
> book3s_hv_rmhandlers.S, this is actually a bug fix, isn't it?
> How did you find it?  Did you observe a failure of some kind, or did
> you just find it by code inspection?

Won't the fix break migration ? Unless qemu doens't migrate it ...

> Paul.
> ___
> Linuxppc-dev mailing list
> linuxppc-...@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: ppc: Fix size of the PSPB register

2015-09-01 Thread Benjamin Herrenschmidt

On Wed, 2015-09-02 at 08:45 +1000, Paul Mackerras wrote:
> On Wed, Sep 02, 2015 at 08:25:05AM +1000, Benjamin Herrenschmidt
> wrote:
> > On Tue, 2015-09-01 at 23:41 +0200, Thomas Huth wrote:
> > > The size of the Problem State Priority Boost Register is only
> > > 32 bits, so let's change the type of the corresponding variable
> > > accordingly to avoid future trouble.
> > 
> > It's not future trouble, it's broken today for LE and this should
> > fix
> > it BUT 
> 
> No, it's broken today for BE hosts, which will always see 0 for the
> PSPB register value.  LE hosts are fine.

0 or PSPB << 32 ?

> > The asm accesses it using lwz/stw and C accesses it as a ulong. On
> > LE
> > that will mean that userspace will see the value << 32
> 
> No, that will happen on BE, and since KVM_REG_PPC_PSPB says it's a
> 32-bit register, we'll just pass 0 back to userspace when it reads
> it.

Ah ok, I missed that bit about KVM_REG_PPC_PSPB

> > Now "fixing" it might break migration if that field is already
> > stored/loaded in its "broken" form. We may have to keep the
> > "broken"
> > behaviour and document that qemu sees a value shifted by 32.
> 
> It will be being set to 0 on BE hosts across migration today
> (fortunately 0 is a benign value for PSPB).  If we fix this on both
> the source and destination host, then the value will get migrated
> across correctly.

Ok, I missed the part where KVM_REG_PPC_PSPB passed it down as a 32
-bit. That means Thomas patch should work indeed.

> I think Thomas's patch is fine, it just needs a stronger patch
> description saying that it fixes an actual bug.

Right.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: ppc: Fix size of the PSPB register

2015-09-01 Thread Benjamin Herrenschmidt

On Tue, 2015-09-01 at 23:41 +0200, Thomas Huth wrote:
> The size of the Problem State Priority Boost Register is only
> 32 bits, so let's change the type of the corresponding variable
> accordingly to avoid future trouble.

It's not future trouble, it's broken today for LE and this should fix
it BUT 

The asm accesses it using lwz/stw and C accesses it as a ulong. On LE
that will mean that userspace will see the value << 32

Now "fixing" it might break migration if that field is already
stored/loaded in its "broken" form. We may have to keep the "broken"
behaviour and document that qemu sees a value shifted by 32.

Cheers,
Ben.

> Signed-off-by: Thomas Huth 
> ---
>  arch/powerpc/include/asm/kvm_host.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_host.h
> b/arch/powerpc/include/asm/kvm_host.h
> index d91f65b..c825f3a 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -473,7 +473,7 @@ struct kvm_vcpu_arch {
>   ulong ciabr;
>   ulong cfar;
>   ulong ppr;
> - ulong pspb;
> + u32 pspb;
>   ulong fscr;
>   ulong shadow_fscr;
>   ulong ebbhr;

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG: sleeping function called from ras_epow_interrupt context

2015-07-14 Thread Benjamin Herrenschmidt

On Tue, 2015-07-14 at 20:43 +0200, Thomas Huth wrote:
 Any suggestions how to fix this? Simply revert 587f83e8dd50d? Use
 mdelay() instead of msleep() in rtas_busy_delay()? Something more
 fancy?

A proper fix would be more fancy, the get_sensor should happen in a
kernel thread instead.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 06/25] powerpc: Use bool function return values of true/false not 1/0

2015-03-30 Thread Benjamin Herrenschmidt

On Mon, 2015-03-30 at 16:46 -0700, Joe Perches wrote:
 Use the normal return values for bool functions

Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org

Should we merge it or will you ?

Cheers,
Ben.

 Signed-off-by: Joe Perches j...@perches.com
 ---
  arch/powerpc/include/asm/dcr-native.h| 2 +-
  arch/powerpc/include/asm/dma-mapping.h   | 4 ++--
  arch/powerpc/include/asm/kvm_book3s_64.h | 4 ++--
  arch/powerpc/sysdev/dcr.c| 2 +-
  4 files changed, 6 insertions(+), 6 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/dcr-native.h 
 b/arch/powerpc/include/asm/dcr-native.h
 index 7d2e623..4efc11d 100644
 --- a/arch/powerpc/include/asm/dcr-native.h
 +++ b/arch/powerpc/include/asm/dcr-native.h
 @@ -31,7 +31,7 @@ typedef struct {
  
  static inline bool dcr_map_ok_native(dcr_host_native_t host)
  {
 - return 1;
 + return true;
  }
  
  #define dcr_map_native(dev, dcr_n, dcr_c) \
 diff --git a/arch/powerpc/include/asm/dma-mapping.h 
 b/arch/powerpc/include/asm/dma-mapping.h
 index 894d538..9103687 100644
 --- a/arch/powerpc/include/asm/dma-mapping.h
 +++ b/arch/powerpc/include/asm/dma-mapping.h
 @@ -191,11 +191,11 @@ static inline bool dma_capable(struct device *dev, 
 dma_addr_t addr, size_t size)
   struct dev_archdata *sd = dev-archdata;
  
   if (sd-max_direct_dma_addr  addr + size  sd-max_direct_dma_addr)
 - return 0;
 + return false;
  #endif
  
   if (!dev-dma_mask)
 - return 0;
 + return false;
  
   return addr + size - 1 = *dev-dma_mask;
  }
 diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
 b/arch/powerpc/include/asm/kvm_book3s_64.h
 index 2d81e20..2a244bf 100644
 --- a/arch/powerpc/include/asm/kvm_book3s_64.h
 +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
 @@ -335,7 +335,7 @@ static inline bool hpte_read_permission(unsigned long pp, 
 unsigned long key)
  {
   if (key)
   return PP_RWRX = pp  pp = PP_RXRX;
 - return 1;
 + return true;
  }
  
  static inline bool hpte_write_permission(unsigned long pp, unsigned long key)
 @@ -373,7 +373,7 @@ static inline bool slot_is_aligned(struct kvm_memory_slot 
 *memslot,
   unsigned long mask = (pagesize  PAGE_SHIFT) - 1;
  
   if (pagesize = PAGE_SIZE)
 - return 1;
 + return true;
   return !(memslot-base_gfn  mask)  !(memslot-npages  mask);
  }
  
 diff --git a/arch/powerpc/sysdev/dcr.c b/arch/powerpc/sysdev/dcr.c
 index 2d8a101..121e26f 100644
 --- a/arch/powerpc/sysdev/dcr.c
 +++ b/arch/powerpc/sysdev/dcr.c
 @@ -54,7 +54,7 @@ bool dcr_map_ok_generic(dcr_host_t host)
   else if (host.type == DCR_HOST_MMIO)
   return dcr_map_ok_mmio(host.host.mmio);
   else
 - return 0;
 + return false;
  }
  EXPORT_SYMBOL_GPL(dcr_map_ok_generic);
  


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] KVM: PPC: Remove page table walk helpers

2015-03-29 Thread Benjamin Herrenschmidt

On Mon, 2015-03-30 at 10:39 +0530, Aneesh Kumar K.V wrote:
 This patch remove helpers which we had used only once in the code.
 Limiting page table walk variants help in ensuring that we won't
 end up with code walking page table with wrong assumptions.
 
 Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com

Alex, it would be preferable to have this (and the previous one) in the
powerpc tree due to dependencies with further fixes to our page table
walking vs. THP, so if you're happy, just give us an ack.

Cheers,
Ben.

 ---
  arch/powerpc/include/asm/pgtable.h  | 21 -
  arch/powerpc/kvm/book3s_hv_rm_mmu.c | 62 
 -
  arch/powerpc/kvm/e500_mmu_host.c|  2 +-
  3 files changed, 28 insertions(+), 57 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/pgtable.h 
 b/arch/powerpc/include/asm/pgtable.h
 index 9835ac4173b7..92fe01c355a9 100644
 --- a/arch/powerpc/include/asm/pgtable.h
 +++ b/arch/powerpc/include/asm/pgtable.h
 @@ -249,27 +249,6 @@ extern int gup_hugepte(pte_t *ptep, unsigned long sz, 
 unsigned long addr,
  #endif
  pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
unsigned *shift);
 -
 -static inline pte_t *lookup_linux_ptep(pgd_t *pgdir, unsigned long hva,
 -  unsigned long *pte_sizep)
 -{
 - pte_t *ptep;
 - unsigned long ps = *pte_sizep;
 - unsigned int shift;
 -
 - ptep = find_linux_pte_or_hugepte(pgdir, hva, shift);
 - if (!ptep)
 - return NULL;
 - if (shift)
 - *pte_sizep = 1ul  shift;
 - else
 - *pte_sizep = PAGE_SIZE;
 -
 - if (ps  *pte_sizep)
 - return NULL;
 -
 - return ptep;
 -}
  #endif /* __ASSEMBLY__ */
  
  #endif /* __KERNEL__ */
 diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
 b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
 index 625407e4d3b0..73e083cb9f7e 100644
 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
 +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
 @@ -131,25 +131,6 @@ static void remove_revmap_chain(struct kvm *kvm, long 
 pte_index,
   unlock_rmap(rmap);
  }
  
 -static pte_t lookup_linux_pte_and_update(pgd_t *pgdir, unsigned long hva,
 -   int writing, unsigned long *pte_sizep)
 -{
 - pte_t *ptep;
 - unsigned long ps = *pte_sizep;
 - unsigned int hugepage_shift;
 -
 - ptep = find_linux_pte_or_hugepte(pgdir, hva, hugepage_shift);
 - if (!ptep)
 - return __pte(0);
 - if (hugepage_shift)
 - *pte_sizep = 1ul  hugepage_shift;
 - else
 - *pte_sizep = PAGE_SIZE;
 - if (ps  *pte_sizep)
 - return __pte(0);
 - return kvmppc_read_update_linux_pte(ptep, writing, hugepage_shift);
 -}
 -
  static inline void unlock_hpte(__be64 *hpte, unsigned long hpte_v)
  {
   asm volatile(PPC_RELEASE_BARRIER  : : : memory);
 @@ -166,10 +147,10 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long 
 flags,
   struct revmap_entry *rev;
   unsigned long g_ptel;
   struct kvm_memory_slot *memslot;
 - unsigned long pte_size;
 + unsigned hpage_shift;
   unsigned long is_io;
   unsigned long *rmap;
 - pte_t pte;
 + pte_t *ptep;
   unsigned int writing;
   unsigned long mmu_seq;
   unsigned long rcbits;
 @@ -208,22 +189,33 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long 
 flags,
  
   /* Translate to host virtual address */
   hva = __gfn_to_hva_memslot(memslot, gfn);
 + ptep = find_linux_pte_or_hugepte(pgdir, hva, hpage_shift);
 + if (ptep) {
 + pte_t pte;
 + unsigned int host_pte_size;
  
 - /* Look up the Linux PTE for the backing page */
 - pte_size = psize;
 - pte = lookup_linux_pte_and_update(pgdir, hva, writing, pte_size);
 - if (pte_present(pte)  !pte_protnone(pte)) {
 - if (writing  !pte_write(pte))
 - /* make the actual HPTE be read-only */
 - ptel = hpte_make_readonly(ptel);
 - is_io = hpte_cache_bits(pte_val(pte));
 - pa = pte_pfn(pte)  PAGE_SHIFT;
 - pa |= hva  (pte_size - 1);
 - pa |= gpa  ~PAGE_MASK;
 - }
 + if (hpage_shift)
 + host_pte_size = 1ul  hpage_shift;
 + else
 + host_pte_size = PAGE_SIZE;
 + /*
 +  * We should always find the guest page size
 +  * to = host page size, if host is using hugepage
 +  */
 + if (host_pte_size  psize)
 + return H_PARAMETER;
  
 - if (pte_size  psize)
 - return H_PARAMETER;
 + pte = kvmppc_read_update_linux_pte(ptep, writing, hpage_shift);
 + if (pte_present(pte)  !pte_protnone(pte)) {
 + if (writing  !pte_write(pte))
 + /* make the actual HPTE be read-only */
 +

Re: [PATCH v3] powerpc/kvm: support to handle sw breakpoint

2014-08-11 Thread Benjamin Herrenschmidt

On Mon, 2014-08-11 at 09:26 +0200, Alexander Graf wrote:
  diff --git a/arch/powerpc/kvm/emulate.c b/arch/powerpc/kvm/emulate.c
  index da86d9b..d95014e 100644
  --- a/arch/powerpc/kvm/emulate.c
  +++ b/arch/powerpc/kvm/emulate.c
 
 This should be book3s_emulate.c.

Any reason we can't make that 0000 opcode as breakpoint common to
all powerpc variants ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PULL 16/63] PPC: Add asm helpers for BE 32bit load/store

2014-08-01 Thread Benjamin Herrenschmidt

On Fri, 2014-08-01 at 11:17 +0200, Alexander Graf wrote:
 From assembly code we might not only have to explicitly BE access 64bit 
 values,
 but sometimes also 32bit ones. Add helpers that allow for easy use of 
 lwzx/stwx
 in their respective byte-reverse or native form.
 
 Signed-off-by: Alexander Graf ag...@suse.de

Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org
 ---
  arch/powerpc/include/asm/asm-compat.h | 4 
  1 file changed, 4 insertions(+)
 
 diff --git a/arch/powerpc/include/asm/asm-compat.h 
 b/arch/powerpc/include/asm/asm-compat.h
 index 4b237aa..21be8ae 100644
 --- a/arch/powerpc/include/asm/asm-compat.h
 +++ b/arch/powerpc/include/asm/asm-compat.h
 @@ -34,10 +34,14 @@
  #define PPC_MIN_STKFRM   112
  
  #ifdef __BIG_ENDIAN__
 +#define LWZX_BE  stringify_in_c(lwzx)
  #define LDX_BE   stringify_in_c(ldx)
 +#define STWX_BE  stringify_in_c(stwx)
  #define STDX_BE  stringify_in_c(stdx)
  #else
 +#define LWZX_BE  stringify_in_c(lwbrx)
  #define LDX_BE   stringify_in_c(ldbrx)
 +#define STWX_BE  stringify_in_c(stwbrx)
  #define STDX_BE  stringify_in_c(stdbrx)
  #endif
  


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] powerpc: kvm: make the setup of hpte under the protection of KVMPPC_RMAP_LOCK_BIT

2014-07-29 Thread Benjamin Herrenschmidt

On Mon, 2014-07-28 at 15:58 +0800, Liu ping fan wrote:
 Hope I am right.  Take the following seq as an example
 
 if (hptep[0]  HPTE_V_VALID) {
 /* HPTE was previously valid, so we need to invalidate it */
 unlock_rmap(rmap);
 hptep[0] |= HPTE_V_ABSENT;
 kvmppc_invalidate_hpte(kvm, hptep, index);
 /* don't lose previous R and C bits */
 r |= hptep[1]  (HPTE_R_R | HPTE_R_C);
 } else {
 kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0);
 }
   - if we try_to_unmap on
 pfn at here, then @r contains a invalid pfn
 hptep[1] = r;
 eieio();
 hptep[0] = hpte[0];
 asm volatile(ptesync : : : memory);

If that was the case we would have the same race in kvmppc_do_h_enter().

I think the fact that the HPTE is locked will prevent the race, ie,
HPTE_V_HVLOCK is set until hptep[0] is written to.

If I look at at the unmap case, my understanding is that it uses  
kvm_unmap_rmapp() which will also lock the HPTE (try_lock_hpte)
and so shouldn't have a race vs the above code.

Or do you see a race I don't ?

Cheers,
Ben.

 Thx.
 Fan
 
 On Mon, Jul 28, 2014 at 2:42 PM, Benjamin Herrenschmidt
 b...@kernel.crashing.org wrote:
  On Mon, 2014-07-28 at 14:09 +0800, Liu Ping Fan wrote:
  In current code, the setup of hpte is under the risk of race with
  mmu_notifier_invalidate, i.e we may setup a hpte with a invalid pfn.
  Resolve this issue by sync the two actions by KVMPPC_RMAP_LOCK_BIT.
 
  Please describe the race you think you see. I'm quite sure both Paul and
  I went over that code and somewhat convinced ourselves that it was ok
  but it's possible that we were both wrong :-)
 
  Cheers,
  Ben.
 
  Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
  ---
   arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 ++-
   1 file changed, 10 insertions(+), 5 deletions(-)
 
  diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
  b/arch/powerpc/kvm/book3s_64_mmu_hv.c
  index 8056107..e6dcff4 100644
  --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
  +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
  @@ -754,19 +754,24 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
  struct kvm_vcpu *vcpu,
 
if (hptep[0]  HPTE_V_VALID) {
/* HPTE was previously valid, so we need to invalidate it */
  - unlock_rmap(rmap);
hptep[0] |= HPTE_V_ABSENT;
kvmppc_invalidate_hpte(kvm, hptep, index);
/* don't lose previous R and C bits */
r |= hptep[1]  (HPTE_R_R | HPTE_R_C);
  +
  + hptep[1] = r;
  + eieio();
  + hptep[0] = hpte[0];
  + asm volatile(ptesync : : : memory);
  + unlock_rmap(rmap);
} else {
  + hptep[1] = r;
  + eieio();
  + hptep[0] = hpte[0];
  + asm volatile(ptesync : : : memory);
kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0);
}
 
  - hptep[1] = r;
  - eieio();
  - hptep[0] = hpte[0];
  - asm volatile(ptesync : : : memory);
preempt_enable();
if (page  hpte_is_writable(r))
SetPageDirty(page);
 
 


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] powerpc: kvm: make the setup of hpte under the protection of KVMPPC_RMAP_LOCK_BIT

2014-07-28 Thread Benjamin Herrenschmidt

On Mon, 2014-07-28 at 14:09 +0800, Liu Ping Fan wrote:
 In current code, the setup of hpte is under the risk of race with
 mmu_notifier_invalidate, i.e we may setup a hpte with a invalid pfn.
 Resolve this issue by sync the two actions by KVMPPC_RMAP_LOCK_BIT.

Please describe the race you think you see. I'm quite sure both Paul and
I went over that code and somewhat convinced ourselves that it was ok
but it's possible that we were both wrong :-)

Cheers,
Ben.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 ++-
  1 file changed, 10 insertions(+), 5 deletions(-)
 
 diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
 b/arch/powerpc/kvm/book3s_64_mmu_hv.c
 index 8056107..e6dcff4 100644
 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
 +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
 @@ -754,19 +754,24 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
 struct kvm_vcpu *vcpu,
  
   if (hptep[0]  HPTE_V_VALID) {
   /* HPTE was previously valid, so we need to invalidate it */
 - unlock_rmap(rmap);
   hptep[0] |= HPTE_V_ABSENT;
   kvmppc_invalidate_hpte(kvm, hptep, index);
   /* don't lose previous R and C bits */
   r |= hptep[1]  (HPTE_R_R | HPTE_R_C);
 +
 + hptep[1] = r;
 + eieio();
 + hptep[0] = hpte[0];
 + asm volatile(ptesync : : : memory);
 + unlock_rmap(rmap);
   } else {
 + hptep[1] = r;
 + eieio();
 + hptep[0] = hpte[0];
 + asm volatile(ptesync : : : memory);
   kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0);
   }
  
 - hptep[1] = r;
 - eieio();
 - hptep[0] = hpte[0];
 - asm volatile(ptesync : : : memory);
   preempt_enable();
   if (page  hpte_is_writable(r))
   SetPageDirty(page);


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 2/3] powerpc/powernv: Support PCI error injection

2014-07-21 Thread Benjamin Herrenschmidt

On Mon, 2014-07-21 at 16:06 +0800, Mike Qiu wrote:
  I don't like this. I much prefer have dedicated error injection files
  in their respective locations, something for PCI under the corresponding
  PCI bridge etc...
 
 So PowerNV error injection will be designed rely on debugfs been 
 configured, right?

Not necessarily. If we create a better debugfs layout for our PHBs, then
yes. It might be useful to provide more info in there for example access
to some of the counters ...

But on the other hand, for error injection in general, I wonder if we should
be under sysfs instead... something to study a bit.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 2/3] powerpc/powernv: Support PCI error injection

2014-07-21 Thread Benjamin Herrenschmidt

On Tue, 2014-07-22 at 11:10 +0800, Mike Qiu wrote:
 On 07/22/2014 06:49 AM, Benjamin Herrenschmidt wrote:
  On Mon, 2014-07-21 at 16:06 +0800, Mike Qiu wrote:
  I don't like this. I much prefer have dedicated error injection files
  in their respective locations, something for PCI under the corresponding
  PCI bridge etc...
  So PowerNV error injection will be designed rely on debugfs been
  configured, right?
  Not necessarily. If we create a better debugfs layout for our PHBs, then
  yes. It might be useful to provide more info in there for example access
  to some of the counters ...
 
  But on the other hand, for error injection in general, I wonder if we should
  be under sysfs instead... something to study a bit.
 
 In pHyp, general error injection use syscall:
 
  #define __NR_rtas255
 
 I don't know if it is a good idea to reuse this syscall for PowerNV.
 
 At least, it is another choice without sysfs rely.

No, we certainly don't want that RTAS stuff. I though Linux had some
kind of error injection infrastructure nowadays... somebody needs to
have a look.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 2/3] powerpc/powernv: Support PCI error injection

2014-06-24 Thread Benjamin Herrenschmidt


 Is it reasonable to do error injection with CONFIG_IOMMU_API ?
 
 That means if use default config(CONFIG_IOMMU_API = n),  we can not do 
 error injection to pci devices?

Well we can't pass them through either so ...

In any case, this is not a priority. First we need to implement a solid
error injection facility for the *host*. The guest one is really really
low on the list.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 2/3] powerpc/powernv: Support PCI error injection

2014-06-24 Thread Benjamin Herrenschmidt

On Tue, 2014-06-24 at 14:57 +0800, Mike Qiu wrote:
 Is that mean *host* side error injection should base on 
 CONFIG_IOMMU_API ? If it is just host side(no guest, no pass through), 
 can't we do error inject?
 
 Maybe I misunderstand :)

Ah no, make different patches, we don't want to use IOMMU group ID, just
PE numbers. Maybe we should expose in sysfs the PEs from the platform
code with the error injection files underneath ... 

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 2/3] powerpc/powernv: Support PCI error injection

2014-06-24 Thread Benjamin Herrenschmidt

On Wed, 2014-06-25 at 11:05 +0800, Mike Qiu wrote:
 Here maybe  /sys/kernel/debug/powerpc/errinjct is better, because
 it 
 will supply PCI_domain_nr in parameters, so no need supply errinjct 
 for each PCI domain.
 
 Another reason is error inject not only for PCI(in future), so better 
 not in PCI domain entry.
 
 Also it simple for userland tools to has a fixed path.

I don't like this. I much prefer have dedicated error injection files
in their respective locations, something for PCI under the corresponding
PCI bridge etc...

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 1/3] powerpc/powernv: Sync header with firmware

2014-06-23 Thread Benjamin Herrenschmidt

On Mon, 2014-06-23 at 12:14 +1000, Gavin Shan wrote:
 The patch synchronizes firmware header file (opal.h) for PCI error
 injection

The FW API you expose is not PCI specific. I haven't seen the
corresponding FW patches yet but I'm not fan of that single call
that collates unrelated things.

I much'd prefer see a opal_pci_err_inject that is specific to
IO(D)A errors, which takes a PHB ID and goes via the normal dispatch
to PHB ops inside OPAL. For the rest, especially core specific
injections, we can provide a separate dedicated call.

Cheers, 
Ben.

 Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
 ---
  arch/powerpc/include/asm/opal.h| 65 
 ++
  arch/powerpc/platforms/powernv/opal-wrappers.S |  1 +
  2 files changed, 66 insertions(+)
 
 diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
 index 66ad7a7..d982bb8 100644
 --- a/arch/powerpc/include/asm/opal.h
 +++ b/arch/powerpc/include/asm/opal.h
 @@ -175,6 +175,7 @@ extern int opal_enter_rtas(struct rtas_args *args,
  #define OPAL_SET_PARAM   90
  #define OPAL_DUMP_RESEND 91
  #define OPAL_DUMP_INFO2  94
 +#define OPAL_ERR_INJECT  96
  
  #ifndef __ASSEMBLY__
  
 @@ -219,6 +220,69 @@ enum OpalPciErrorSeverity {
   OPAL_EEH_SEV_INF= 5
  };
  
 +enum OpalErrinjctType {
 + OpalErrinjctTypeFirst   = 0,
 + OpalErrinjctTypeFatal   = 1,
 + OpalErrinjctTypeRecoverRandomEvent  = 2,
 + OpalErrinjctTypeRecoverSpecialEvent = 3,
 + OpalErrinjctTypeCorruptedPage   = 4,
 + OpalErrinjctTypeCorruptedSlb= 5,
 + OpalErrinjctTypeTranslatorFailure   = 6,
 + OpalErrinjctTypeIoaBusError = 7,
 + OpalErrinjctTypeIoaBusError64   = 8,
 + OpalErrinjctTypePlatformSpecific= 9,
 + OpalErrinjctTypeDcacheStart = 10,
 + OpalErrinjctTypeDcacheEnd   = 11,
 + OpalErrinjctTypeIcacheStart = 12,
 + OpalErrinjctTypeIcacheEnd   = 13,
 + OpalErrinjctTypeTlbStart= 14,
 + OpalErrinjctTypeTlbEnd  = 15,
 + OpalErrinjctTypeUpstreamIoError = 16,
 + OpalErrinjctTypeLast= 17,
 +
 + /* IoaBusError  IoaBusError64 */
 + OpalEitIoaLoadMemAddr   = 0,
 + OpalEitIoaLoadMemData   = 1,
 + OpalEitIoaLoadIoAddr= 2,
 + OpalEitIoaLoadIoData= 3,
 + OpalEitIoaLoadConfigAddr= 4,
 + OpalEitIoaLoadConfigData= 5,
 + OpalEitIoaStoreMemAddr  = 6,
 + OpalEitIoaStoreMemData  = 7,
 + OpalEitIoaStoreIoAddr   = 8,
 + OpalEitIoaStoreIoData   = 9,
 + OpalEitIoaStoreConfigAddr   = 10,
 + OpalEitIoaStoreConfigData   = 11,
 + OpalEitIoaDmaReadMemAddr= 12,
 + OpalEitIoaDmaReadMemData= 13,
 + OpalEitIoaDmaReadMemMaster  = 14,
 + OpalEitIoaDmaReadMemTarget  = 15,
 + OpalEitIoaDmaWriteMemAddr   = 16,
 + OpalEitIoaDmaWriteMemData   = 17,
 + OpalEitIoaDmaWriteMemMaster = 18,
 + OpalEitIoaDmaWriteMemTarget = 19,
 +};
 +
 +struct OpalErrinjct {
 + int32_t type;
 + union {
 + struct {
 + uint32_t addr;
 + uint32_t mask;
 + uint64_t phb_id;
 + uint32_t pe;
 + uint32_t function;
 + } ioa;
 + struct {
 + uint64_t addr;
 + uint64_t mask;
 + uint64_t phb_id;
 + uint32_t pe;
 + uint32_t function;
 + } ioa64;
 + };
 +};
 +
  enum OpalShpcAction {
   OPAL_SHPC_GET_LINK_STATE = 0,
   OPAL_SHPC_GET_SLOT_STATE = 1
 @@ -870,6 +934,7 @@ int64_t opal_update_flash(uint64_t blk_list);
  int64_t opal_dump_init(uint8_t dump_type);
  int64_t opal_dump_info(__be32 *dump_id, __be32 *dump_size);
  int64_t opal_dump_info2(__be32 *dump_id, __be32 *dump_size, __be32 
 *dump_type);
 +int64_t opal_err_injct(struct OpalErrinjct *ei);
  int64_t opal_dump_read(uint32_t dump_id, uint64_t buffer);
  int64_t opal_dump_ack(uint32_t dump_id);
  int64_t opal_dump_resend_notification(void);
 diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S 
 b/arch/powerpc/platforms/powernv/opal-wrappers.S
 index f531ffe..44b3d81 100644
 --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
 +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
 @@ -136,6 +136,7 @@ OPAL_CALL(opal_resync_timebase,   
 OPAL_RESYNC_TIMEBASE);
  OPAL_CALL(opal_dump_init,OPAL_DUMP_INIT);

Re: [PATCH] powerpc/kvm: support to handle sw breakpoint

2014-06-17 Thread Benjamin Herrenschmidt

On Tue, 2014-06-17 at 11:25 +0200, Alexander Graf wrote:
 On 17.06.14 11:22, Benjamin Herrenschmidt wrote:
  On Tue, 2014-06-17 at 10:54 +0200, Alexander Graf wrote:
  Also, why don't we use twi always or something else that actually is
  defined as illegal instruction? I would like to see this shared with
  book3s_32 PR.
  twi will be directed to the guest on HV no ? We want a real illegal
  because those go to the host (for potential emulation by the HV).
 
 Ah, good point. I guess we need different one for PR and HV then to 
 ensure compatibility with older ISAs on PR.

Well, we also need to be careful with what happens if a PR guest puts
that instruction in, do that stop its HV guest/host ?

What if it's done in userspace ? Do that stop the kernel ? :-)

Maddy, I haven't checked, does your patch ensure that we only ever stop
if the instruction is at a recorded bkpt address ? It still means that a
userspace process can practically DOS its kernel by issuing a lot of
these causing a crapload of exits.

Cheers,
Ben.

 Alex
 
  I'm
  trying to see if I can get the architect to set one in stone in a future
  proof way.
 
  Cheers,
  Ben.
 
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 2/3] powerpc/eeh: EEH support for VFIO PCI device

2014-06-05 Thread Benjamin Herrenschmidt

On Thu, 2014-06-05 at 16:36 +1000, Gavin Shan wrote:
 +#define EEH_OPT_GET_PE_ADDR0   /* Get PE addr  */
 +#define EEH_OPT_GET_PE_MODE1   /* Get PE mode  */

I assume that's just some leftover from the previous patches :-)

Don't respin just yet, let's see what other comments come in.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] PPC: KVM: Add support for 64bit TCE windows

2014-06-05 Thread Benjamin Herrenschmidt

On Thu, 2014-06-05 at 17:25 +1000, Alexey Kardashevskiy wrote:
 +This creates a virtual TCE (translation control entry) table, which
 +is an IOMMU for PAPR-style virtual I/O.  It is used to translate
 +logical addresses used in virtual I/O into guest physical addresses,
 +and provides a scatter/gather capability for PAPR virtual I/O.
 +
 +/* for KVM_CAP_SPAPR_TCE_64 */
 +struct kvm_create_spapr_tce_64 {
 +   __u64 liobn;
 +   __u64 window_size;
 +   __u64 bus_offset;
 +   __u32 page_shift;
 +   __u32 flags;
 +};
 +
 +The liobn field gives the logical IO bus number for which to create a
 +TCE table. The window_size field specifies the size of the DMA window
 +which this TCE table will translate - the table will contain one 64
 +bit TCE entry for every IOMMU page. The bus_offset field tells where
 +this window is mapped on the IO bus. 

Hrm, the bus_offset cannot be set arbitrarily, it has some pretty strong
HW limits depending on the type of bridge  architecture version...

Do you plan to have that knowledge in qemu ? Or do you have some other
mechanism to query it ? (I might be missing a piece of the puzzle here).

Also one thing I've been pondering ...

We'll end up wasting a ton of memory with those TCE tables. If you have
3 PEs mapped into a guest, it will try to create 3 DDW's mapping the
entire guest memory and so 3 TCE tables large enough for that ... and
which will contain exactly the same entries !

We really want to look into extending PAPR to allow the creation of
table aliases so that the guest can essentially create one table and
associate it with multiple PEs. We might still decide to do multiple
copies for NUMA reasons but no more than one per node for example... at
least we can have the policy in qemu/kvm.

Also, do you currently require allocating a single physically contiguous
table or do you support TCE trees in your implementation ?

Cheers,
Ben.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] PPC: KVM: Add support for 64bit TCE windows

2014-06-05 Thread Benjamin Herrenschmidt

On Thu, 2014-06-05 at 19:26 +1000, Alexey Kardashevskiy wrote:
 
 No trees yet. For 64GB window we need (6430)/(1620)*8 = 32K TCE table.
 Do we really need trees?

The above is assuming hugetlbfs backed guests. These are the least of my worry
indeed. But we need to deal with 4k and 64k guests.

Cheers,
Ben


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] PPC: KVM: Add support for 64bit TCE windows

2014-06-05 Thread Benjamin Herrenschmidt

On Thu, 2014-06-05 at 13:56 +0200, Alexander Graf wrote:
 What if we ask user space to give us a pointer to user space allocated 
 memory along with the TCE registration? We would still ask user space to 
 only use the returned fd for TCE modifications, but would have some 
 nicely swappable memory we can store the TCE entries in.

That isn't going to work terribly well for VFIO :-) But yes, for
emulated devices, we could improve things a bit, including for
the 32-bit TCE tables.

For emulated, the real mode path could walk the page tables and fallback
to virtual mode  get_user if the page isn't present, thus operating
directly on qemu memory TCE tables instead of the current pinned stuff.

However that has a cost in performance, but since that's really only
used for emulated devices and PAPR VIOs, it might not be a huge issue.

But for VFIO we don't have much choice, we need to create something the
HW can access.

 In fact, the code as is today can allocate an arbitrary amount of pinned 
 kernel memory from within user space without any checks.

Right. We should at least account it in the locked limit.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] powerpc/eeh: Avoid event on passed PE

2014-06-03 Thread Benjamin Herrenschmidt

On Tue, 2014-06-03 at 09:45 +0200, Alexander Graf wrote:
 For EEH it could as well be a dumb eventfd - really just a side
 channel that can tell user space that something happened
 asynchronously :).

Which the host kernel may have no way to detect without actively poking
at the device (fences in powernv or anything in PAPR host) and the only
user of this for now has no use for.

I insist don't bother.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: powerpc/pseries: Use new defines when calling h_set_mode

2014-05-29 Thread Benjamin Herrenschmidt

On Thu, 2014-05-29 at 23:27 +0200, Alexander Graf wrote:
 On 29.05.14 09:45, Michael Neuling wrote:
  +/* Values for 2nd argument to H_SET_MODE */
  +#define H_SET_MODE_RESOURCE_SET_CIABR1
  +#define H_SET_MODE_RESOURCE_SET_DAWR2
  +#define H_SET_MODE_RESOURCE_ADDR_TRANS_MODE3
  +#define H_SET_MODE_RESOURCE_LE4
 
  Much better, but I think you want to make use of these in non-kvm code too,
  no? At least the LE one is definitely already implemented as call :)
  Sure but that's a different patch below.
 
 Ben, how would you like to handle these 2 patches? If you give me an ack 
 I can just put this patch into my kvm queue. Alternatively we could both 
 carry a patch that adds the H_SET_MODE header bits only and whoever hits 
 Linus' tree first wins ;).

No biggie. Worst case it's a trivial conflict.

Cheers,
Ben.

 
 Alex
 
 
  Mikey
 
 
  powerpc/pseries: Use new defines when calling h_set_mode
 
  Now that we define these in the KVM code, use these defines when we call
  h_set_mode.  No functional change.
 
  Signed-off-by: Michael Neuling mi...@neuling.org
  --
  This depends on the KVM h_set_mode patches.
 
  diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
  b/arch/powerpc/include/asm/plpar_wrappers.h
  index 12c32c5..67859ed 100644
  --- a/arch/powerpc/include/asm/plpar_wrappers.h
  +++ b/arch/powerpc/include/asm/plpar_wrappers.h
  @@ -273,7 +273,7 @@ static inline long plpar_set_mode(unsigned long mflags, 
  unsigned long resource,
static inline long enable_reloc_on_exceptions(void)
{
  /* mflags = 3: Exceptions at 0xC0004000 */
  -   return plpar_set_mode(3, 3, 0, 0);
  +   return plpar_set_mode(3, H_SET_MODE_RESOURCE_ADDR_TRANS_MODE, 0, 0);
}

/*
  @@ -284,7 +284,7 @@ static inline long enable_reloc_on_exceptions(void)
 * returns H_SUCCESS.
 */
static inline long disable_reloc_on_exceptions(void) {
  -   return plpar_set_mode(0, 3, 0, 0);
  +   return plpar_set_mode(0, H_SET_MODE_RESOURCE_ADDR_TRANS_MODE, 0, 0);
}

/*
  @@ -297,7 +297,7 @@ static inline long disable_reloc_on_exceptions(void) {
static inline long enable_big_endian_exceptions(void)
{
  /* mflags = 0: big endian exceptions */
  -   return plpar_set_mode(0, 4, 0, 0);
  +   return plpar_set_mode(0, H_SET_MODE_RESOURCE_LE, 0, 0);
}

/*
  @@ -310,17 +310,17 @@ static inline long enable_big_endian_exceptions(void)
static inline long enable_little_endian_exceptions(void)
{
  /* mflags = 1: little endian exceptions */
  -   return plpar_set_mode(1, 4, 0, 0);
  +   return plpar_set_mode(1, H_SET_MODE_RESOURCE_LE, 0, 0);
}

static inline long plapr_set_ciabr(unsigned long ciabr)
{
  -   return plpar_set_mode(0, 1, ciabr, 0);
  +   return plpar_set_mode(0, H_SET_MODE_RESOURCE_SET_CIABR, ciabr, 0);
}

static inline long plapr_set_watchpoint0(unsigned long dawr0, unsigned 
  long dawrx0)
{
  -   return plpar_set_mode(0, 2, dawr0, dawrx0);
  +   return plpar_set_mode(0, H_SET_MODE_RESOURCE_SET_DAWR, dawr0, dawrx0);
}

#endif /* _ASM_POWERPC_PLPAR_WRAPPERS_H */
 


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Benjamin Herrenschmidt

On Wed, 2014-05-28 at 22:49 +1000, Gavin Shan wrote:
 
 I will remove those address related macros in next revision because it's
 user-level bussiness, not related to host kernel any more.
 
 If the user is QEMU + guest, we need the address to identify the PE though PHB
 BUID could be used as same purpose to get PHB, which is one-to-one mapping 
 with
 IOMMU group on sPAPR platform. However, once the PE address is built and 
 returned
 to guest, guest will use the PE address as input parameter in subsequent RTAS
 calls.
 
 If the user is some kind of application who just uses the ioctl() without 
 supporting
 RTAS calls. We don't need care about PE address. 

I am a bit reluctant with that PE==PHB equation we seem to be introducing.

This isn't the case in HW. It's possible that this is how we handle VFIO *today*
in qemu but it doesn't have to be does it ?

It also paints us into a corner if we want to start implementing some kind of
emulated EEH for selected emulated devices and/or virtio.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Benjamin Herrenschmidt

On Thu, 2014-05-29 at 10:05 +1000, Gavin Shan wrote:
 The log stuff is TBD and I'll figure it out later.
 
 About to what are the errors, there are a lot. Most of them are related
 to hardware level, for example unstable PCI link. Usually, those error
 bits defined in AER fatal error state register contribute to EEH errors.
 It could be software related, e.g. violating IOMMU protection (read/write
 permission etc), or even one PCI device isn't capable of DMAing. Hopefully,
 it's the explaination you're looking for? :-)

Note to Alex('s) ...

The log we get from FW at the moment in the host is:

  - In the case of pHyp / RTAS host, opaque. Basically it's a blob that we store
and that can be sent to IBM service people :-) Not terribly practical.

  - On PowerNV, it's a IODA specific data structure (basically a dump of a 
bunch of register state and tables). IODA is our IO architecture (sadly the
document itself isn't public at this point) and we have two versions, IODA1
and IODA2. You can consider the structure as chipset specific basically.

What I want to do in the long run is:

  - In the case of pHyp/RTAS host, there's not much we can do, so basically
forward that log as-is.

  - In the case of PowerNV, forward the log *and* add a well-defined blob to
it that does some basic interpretation of it. In fact I want to do the latter
more generally in the host kernel for host kernel errors as well, but we
can forward it to the guest via VFIO too. What I mean by interpretation is
something along the lines of an error type DMA IOMMU fault, MMIO error,
Link loss, PCIe UE, ... among a list of well known error types that
represent the most common ones, with a little bit of added info when
available (for most DMA errors we can provide the DMA address that faulted
for example).

So Gavin and I need to work a bit on that, both in the context of the host
kernel to improve the reporting there, and in the context of what we pass to
user space.

However, no driver today cares about that information. The PCIe error recovery
system doesn't carry it and it has no impact on the EEH recovery procedures,
so EEH in that sense is useful without that reporting. It's useful for the
programmer (or user/admin) to identify what went wrong but it's not used by
the automated recovery process.

One last thing to look at is in the case of a VFIO device, we might want to
silence the host kernel printf's once we support guest EEH since otherwise
the guest has a path to flood the host kernel log by triggering a lot of
EEH errors purposefully.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-27 Thread Benjamin Herrenschmidt

On Tue, 2014-05-27 at 12:15 -0600, Alex Williamson wrote:

  +/*
  + * Reset is the major step to recover problematic PE. The following
  + * command helps on that.
  + */
  +struct vfio_eeh_pe_reset {
  +   __u32 argsz;
  +   __u32 flags;
  +   __u32 option;
  +#define VFIO_EEH_PE_RESET_DEACTIVATE   0   /* Deactivate reset 
  */
  +#define VFIO_EEH_PE_RESET_HOT  1   /* Hot reset
  */
  +#define VFIO_EEH_PE_RESET_FUNDAMENTAL  3   /* Fundamental reset
  */
 
 How does a user know which of these to use?

The usual way is the driver asks for one or the other, this plumbs back
into the guest EEH code which itself plumbs into the PCIe error recovery
framework in Linux.

However I do have a question for Gavin here: Why do we expose an
explicit deactivate ? The reset functions should do the whole
reset sequence (assertion, delay, deassertion). In fact the firmware
doesn't really give you a choice for PERST right ? Or do we have
a requirement to expose both phases for RTAS? (In that case I'm
happy to ignore the deassertion there too).

Cheers,
Ben.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-27 Thread Benjamin Herrenschmidt

On Tue, 2014-05-27 at 14:37 -0600, Alex Williamson wrote:

  The usual way is the driver asks for one or the other, this plumbs back
  into the guest EEH code which itself plumbs into the PCIe error recovery
  framework in Linux.
 
 So magic?

Yes. The driver is expected to more or less knows what kind of reset it
wants for its device. Ideally hot reset is sufficient but some drivers
knows that the device they drive is crappy enough that it mostly ignores
hot reset and really needs a PERST for example...

Also we have other reasons to expose those interfaces outside of EEH. 

For example, some drivers might want to specifically trigger a PERST
after a microcode update. IE. There are path outside of EEH error
recovery where drivers in the guest might want to trigger a reset
to the device and they have control under some circumstances on
which kind of reset they are doing (and the guest Linux does  have
different code path to do a hot reset vs. a fundamental reset).

So we need to expose that distinction to be able to honor the guest
decision.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v6 2/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-22 Thread Benjamin Herrenschmidt

On Fri, 2014-05-23 at 14:37 +1000, Gavin Shan wrote:
 There's no notification, the user needs to observe the return value an
 poll?  Should we be enabling an eventfd to notify the user of the state
 change?
 
 
 Yes. The user needs to monitor the return value. we should have one 
 notification,
 but it's for later as we discussed :-)

 ../..

 How does the guest learn about the error?  Does it need to?
 
 When guest detects 0xFF's from reading PCI config space or IO, it's going
 check the device (PE) state. If the device (PE) has been put into frozen
 state, the recovery will be started.

Quick recap for Alex W (we discussed that with Alex G).

While a notification looks like a worthwhile addition in the long run, it
is not sufficient and not used today and I prefer that we keep that as something
to add later for those two main reasons:

 - First, the kernel itself isn't always notified. For example, if we implement
on top of an RTAS backend (PR KVM under pHyp) or if we are on top of PowerNV but
the error is a PHB fence (the entire PCI Host bridge gets fenced out in 
hardware
due to an internal error), then we get no notification. Only polling of the
hardware or firmware will tell us. Since we don't want to have a polling timer
in the kernel, that means that the userspace client of VFIO (or alternatively
the KVM guest) is the one that polls.

 - Second, this is how our primary user expects it: The primary (and only 
initial)
user of this will be qemu/KVM for PAPR guests and they don't have a notification
mechanism. Instead they query the EEH state after detecting an all 1's return 
from
MMIO or config space. This is how PAPR specifies it so we are just implementing 
the
spec here :-)

Because of these, I think we shouldn't worry too much about notification at
this stage.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-21 Thread Benjamin Herrenschmidt

On Wed, 2014-05-21 at 08:23 +0200, Alexander Graf wrote:
  Note to Alex: This definitely kills the notifier idea for now
 though,
  at least as a first class citizen of the design. We can add it as an
  optional optimization on top later.
 
 I don't think it does. The notifier would just get triggered on config
 space read failures for example :). It's really just an aid for the
 vfio user to have a common code path for error handling.

I'll let Gavin make the final call on that one, if he thinks we can
reliably trigger it and there isn't too much code churn as a
consequence.

Cheers,
Ben.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 3/4] drivers/vfio: EEH support for VFIO PCI device

2014-05-21 Thread Benjamin Herrenschmidt

On Wed, 2014-05-21 at 15:07 +0200, Alexander Graf wrote:

  +#ifdef CONFIG_VFIO_PCI_EEH
  +int eeh_vfio_open(struct pci_dev *pdev)
 
 Why vfio? Also that config option will not be set if vfio is compiled as 
 a module.
 
  +{
  +   struct eeh_dev *edev;
  +
  +   /* No PCI device ? */
  +   if (!pdev)
  +   return -ENODEV;
  +
  +   /* No EEH device ? */
  +   edev = pci_dev_to_eeh_dev(pdev);
  +   if (!edev || !edev-pe)
  +   return -ENODEV;
  +
  +   eeh_dev_set_passed(edev, true);
  +   eeh_pe_set_passed(edev-pe, true);
  +
  +   return 0;
  +}
  +EXPORT_SYMBOL_GPL(eeh_vfio_open);

Additionally, shouldn't we have some locking here ? (and in release too)

I don't like relying on the caller locking (if it does it at all).

  +   /* Device existing ? */
  +   ret = eeh_vfio_check_dev(pdev, edev, pe);
  +   if (ret) {
  +   pr_debug(%s: Cannot find device %s\n,
  +   __func__, pdev ? pci_name(pdev) : NULL);
  +   *retval = -7;
 
 What are these? Please use proper kernel internal return values for 
 errors. I don't want to see anything even remotely tied to RTAS in any 
 of these patches.

Hint: -ENODEV

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] powerpc/eeh: Avoid event on passed PE

2014-05-20 Thread Benjamin Herrenschmidt

On Tue, 2014-05-20 at 21:56 +1000, Gavin Shan wrote:

 .../...

 I think what you want is an irqfd that the in-kernel eeh code
 notifies when it sees a failure. When such an fd exists, the kernel
 skips its own error handling.
 
 
 Yeah, it's a good idea and something for me to improve in phase II. We
 can discuss for more later. For now, what I have in my head is something
 like this:

However, this would be a deviation from (or extension of) PAPR. At the
moment, the way things work in PAPR is that the guest is responsible for
querying the EEH state when something looks like an error (ie, getting
ff's back). This is also how it works in pHyp.

We have an interrupt path in the host when doing native EEH, and it
would be nice to extend PAPR to also be able to shoot an event to the
guest possibly using RTAS events, but let's get the basics working and
upstream first.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] powerpc/eeh: Avoid event on passed PE

2014-05-20 Thread Benjamin Herrenschmidt

On Tue, 2014-05-20 at 15:49 +0200, Alexander Graf wrote:
 Instead of
 
if (passed_flag)
  return;
 
 you would do
 
if (trigger_irqfd) {
  trigger_irqfd();
  return;
}
 
 which would be a much nicer, generic interface.

But that's not how PAPR works.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] powerpc/eeh: Avoid event on passed PE

2014-05-20 Thread Benjamin Herrenschmidt

On Tue, 2014-05-20 at 15:49 +0200, Alexander Graf wrote:
 So how about we just implement this whole thing properly as irqfd? 
 Whether QEMU can actually do anything with the interrupt is a different 
 question - we can leave it be for now. But we could model all the code 
 with the assumption that it should either handle the error itself or 
 trigger and irqfd write.

I don't object to the idea... however this smells of Deja Vu...

You often tend to want to turn something submitted that fills a specific
gap and implements a specific spec/function into some kind of idealized
grand design :-) And that means nothing gets upstream for weeks or monthes
as we churn and churn...

Sometimes it's probably worth it. Here I would argue against it and would
advocate for doing the basic functionality first, as it is used by guests,
and later add the irqfd option. I don't see any emergency here and adding
the irqfd will not cause fundamental design changes:

The passed flag (though I'm not fan of the name) is really something
we want in the low level handlers to avoid triggering host side EEH in
various places, regardless of whether we use irqfd or not.

This is totally orthogonal from the mechanism used for notifications.

Even in host, the detection path doesn't always involve interrupts, and
we can detect some things as a result of a host side config space access
for example etc...

So let's keep things nice and separate here. The interrupt notification
is just an optimization which will speed up discovery of the error in
*some* cases later on (but adds its own complexity since we have multiple
discovery path in host, so we need to keep track whether we have notified
yet or not etc...) so let's keep it for later.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-20 Thread Benjamin Herrenschmidt

On Tue, 2014-05-20 at 14:25 +0200, Alexander Graf wrote:
  - Move eeh-vfio.c to drivers/vfio/pci/
  - From eeh-vfio.c, dereference arch/powerpc/kernel/eeh.c::eeh_ops,
 which
 is arch/powerpc/plaforms/powernv/eeh-powernv.c::powernv_eeh_ops.
 Call
 
 Hrm, I think it'd be nicer to just export individual functions that do
 thing you want to do from eeh.c.

We already have an eeh_ops backend system with callbacks since we have
different backends for RTAS and powernv, so we could do what you suggest
but it would probably just boil down to wrappers around the EEH ops.

No big opinion either way on my side though.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-20 Thread Benjamin Herrenschmidt

On Tue, 2014-05-20 at 22:39 +1000, Gavin Shan wrote:
 Yeah. How about this? :-)
 
 - Move eeh-vfio.c to drivers/vfio/pci/
 - From eeh-vfio.c, dereference arch/powerpc/kernel/eeh.c::eeh_ops, which
is arch/powerpc/plaforms/powernv/eeh-powernv.c::powernv_eeh_ops. Call
 
 Hrm, I think it'd be nicer to just export individual functions that
 do thing you want to do from eeh.c.
 
 
 Ok. Got it. Thanks for your comments :)

The interesting thing with this approach is that VFIO per-se can work
with EEH RTAS backend too in the host.

IE, with PR KVM for example or with non-KVM uses of VFIO, it would be
possible to use a device in a user process and exploit EEH even when
running under a PAPR hypervisor.

That is, vfio-eeh uses generic exported EEH APIs from the EEH core
that will work on both powernv and RTAS backends.

Note to Alex: This definitely kills the notifier idea for now though,
at least as a first class citizen of the design. We can add it as an
optional optimization on top later.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/8] powerpc/eeh: Info to trace passed devices

2014-05-19 Thread Benjamin Herrenschmidt

On Mon, 2014-05-19 at 14:46 +0200, Alexander Graf wrote:
 I don't see the point of VFIO knowing about guest addresses. They are 
 not unique across a system and the whole idea that a VFIO device has to 
 be owned by a guest is also pretty dubious.
 
 I suppose what you really care about here is just a token for a specific 
 device? But why do you need one where we don't have tokens yet?

I think this is going to be needed when doing in-kernel acceleration
of some of the RTAS calls, but yes, Gavin, why do we need that now ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/8] powerpc: Extend syscall ppc_rtas()

2014-05-19 Thread Benjamin Herrenschmidt

On Mon, 2014-05-19 at 14:55 +0200, Alexander Graf wrote:
 On 14.05.14 06:12, Gavin Shan wrote:
  Originally, syscall ppc_rtas() can be used to invoke RTAS call from
  user space. Utility errinjct is using it to inject various errors
  to the system for testing purpose. The patch intends to extend the
  syscall to support both pSeries and PowerNV platform. With that,
  RTAS and OPAL call can be invoked from user space. In turn, utility
  errinjct can be supported on pSeries and PowerNV platform at same
  time.
 
  The original syscall handler ppc_rtas() is renamed to ppc_firmware(),
  which calls ppc_call_rtas() or ppc_call_opal() depending on the
  running platform. The data transported between userland and kerenl is
 
 Please fix your spelling of kernel.
 
  by struct rtas_args. It's platform specific on how to use the data.
 
  Signed-off-by: Mike Qiu qiud...@linux.vnet.ibm.com
  Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
 
 I think the basic idea to maintain the same interface between PAPR and 
 OPAL to user space is sound, but this is really Ben's call.

Yeah that worries me a bit, RTAS and OPAL are completely different
beasts.

We can keep that error injection separate from the rest of the EEH
enablement for now. I'll look at it when I get a chance.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 8/8] powerpc/powernv: Error injection infrastructure

2014-05-19 Thread Benjamin Herrenschmidt

On Mon, 2014-05-19 at 15:04 +0200, Alexander Graf wrote:
 On 14.05.14 06:12, Gavin Shan wrote:
  The patch intends to implement the error injection infrastructure
  for PowerNV platform. The predetermined handlers will be called
  according to the type of injected error (e.g. OpalErrinjctTypeIoaBusError).
  For now, we just support PCI error injection. We need support
  injecting other types of errors in future.
 
 Your token to a VFIO device is the VFIO fd. If you want to inject an 
 error into that device, you should do it via that token. That gets you 
 all permission problems solved for free.
 
 But I still didn't quite grasp why you need to do this. Why do we need 
 to inject an error into a device via OPAL when we want to do EEH inside 
 of a guest? Are you trying to emulate guest side error injection?

Yes, that's what he's trying to do but let's keep that separate from
the core EEH. I'd like the latter to be reviewed /fixed and upstream
first, then we can look at guest side injection.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/8] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-19 Thread Benjamin Herrenschmidt

On Mon, 2014-05-19 at 16:33 -0600, Alex Williamson wrote:
 On Wed, 2014-05-14 at 14:11 +1000, Gavin Shan wrote:
  The patch adds new IOCTL command VFIO_EEH_INFO to VFIO container
  to support EEH functionality for PCI devices, which have been
  passed from host to guest via VFIO.
 
 Some comments throughout, but overall this seems to forgo every bit of
 the device ownership and protection model used by VFIO and lets the user
 pick arbitrary host devices and do various operations, mostly unchecked.
 That's not acceptable.

Right, I don't understand why we need that mapping for base EEH. The
VFIO fd owns the PE, it can do operations on that PE, it doesn't need to
know the guest token/address.

Only when we start doing in-kernel acceleration of RTAS do we need that
sort of mapping established, which we'll be able to do via something
akin to what we did with TCEs to ensure we validate that we have
ownership of the physical device.

But let's proceed step by step. First, have something that is strictly
EEH from qemu RTAS to VFIO via ioctl, and that requires no mapping of
any sort.

Cheers,
Ben.

  Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
  ---
   arch/powerpc/platforms/powernv/Makefile   |   1 +
   arch/powerpc/platforms/powernv/eeh-vfio.c | 593 
  ++
   drivers/vfio/vfio_iommu_spapr_tce.c   |  12 +
   include/uapi/linux/vfio.h |  57 +++
   4 files changed, 663 insertions(+)
   create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c
  
  diff --git a/arch/powerpc/platforms/powernv/Makefile 
  b/arch/powerpc/platforms/powernv/Makefile
  index 63cebb9..2b15a03 100644
  --- a/arch/powerpc/platforms/powernv/Makefile
  +++ b/arch/powerpc/platforms/powernv/Makefile
  @@ -6,5 +6,6 @@ obj-y   += opal-msglog.o
   obj-$(CONFIG_SMP)  += smp.o
   obj-$(CONFIG_PCI)  += pci.o pci-p5ioc2.o pci-ioda.o
   obj-$(CONFIG_EEH)  += eeh-ioda.o eeh-powernv.o
  +obj-$(CONFIG_VFIO_EEH) += eeh-vfio.o
   obj-$(CONFIG_PPC_SCOM) += opal-xscom.o
   obj-$(CONFIG_MEMORY_FAILURE)   += opal-memory-errors.o
  diff --git a/arch/powerpc/platforms/powernv/eeh-vfio.c 
  b/arch/powerpc/platforms/powernv/eeh-vfio.c
  new file mode 100644
  index 000..69d5f2d
  --- /dev/null
  +++ b/arch/powerpc/platforms/powernv/eeh-vfio.c
  @@ -0,0 +1,593 @@
  +/*
  +  * The file intends to support EEH funtionality for those PCI devices,
  +  * which have been passed through from host to guest via VFIO. So this
  +  * file is naturally part of VFIO implementation on PowerNV platform.
  +  *
  +  * Copyright Benjamin Herrenschmidt  Gavin Shan, IBM Corporation 2014.
  +  *
  +  * This program is free software; you can redistribute it and/or modify
  +  * it under the terms of the GNU General Public License as published by
  +  * the Free Software Foundation; either version 2 of the License, or
  +  * (at your option) any later version.
  +  */
  +
  +#include linux/init.h
  +#include linux/io.h
  +#include linux/irq.h
  +#include linux/kernel.h
  +#include linux/kvm_host.h
  +#include linux/msi.h
  +#include linux/pci.h
  +#include linux/string.h
  +#include linux/vfio.h
  +
  +#include asm/eeh.h
  +#include asm/eeh_event.h
  +#include asm/io.h
  +#include asm/iommu.h
  +#include asm/opal.h
  +#include asm/msi_bitmap.h
  +#include asm/pci-bridge.h
  +#include asm/ppc-pci.h
  +#include asm/tce.h
  +#include asm/uaccess.h
  +
  +#include powernv.h
  +#include pci.h
  +
  +static int powernv_eeh_vfio_map(struct vfio_eeh_info *info)
  +{
  +   struct pci_bus *bus, *pe_bus;
  +   struct pci_dev *pdev;
  +   struct eeh_dev *edev;
  +   struct eeh_pe *pe;
  +   int domain, bus_no, devfn;
  +
  +   /* Host address */
  +   domain = info-map.host_domain;
  +   bus_no = (info-map.host_cfg_addr  8)  0xff;
  +   devfn = info-map.host_cfg_addr  0xff;
 
 Where are we validating that the user has any legitimate claim to be
 touching this device?
 
  +   /* Find PCI bus */
  +   bus = pci_find_bus(domain, bus_no);
  +   if (!bus) {
  +   pr_warn(%s: PCI bus %04x:%02x not found\n,
  +   __func__, domain, bus_no);
  +   return -ENODEV;
  +   }
  +
  +   /* Find PCI device */
  +   pdev = pci_get_slot(bus, devfn);
  +   if (!pdev) {
  +   pr_warn(%s: PCI device %04x:%02x:%02x.%01x not found\n,
  +   __func__, domain, bus_no,
  +   PCI_SLOT(devfn), PCI_FUNC(devfn));
  +   return -ENODEV;
  +   }
  +
  +   /* No EEH device - almost impossible */
  +   edev = pci_dev_to_eeh_dev(pdev);
  +   if (unlikely(!edev)) {
  +   pci_dev_put(pdev);
  +   pr_warn(%s: No EEH dev for PCI device %s\n,
  +   __func__, pci_name(pdev));
  +   return -ENODEV;
  +   }
  +
  +   /* Doesn't support PE migration between different PHBs */
  +   pe = edev-pe;
  +   if (!eeh_pe_passed(pe)) {
  +   pe_bus = eeh_pe_bus_get(pe);
  +   BUG_ON(!pe_bus);
 
 Can a user trigger

Re: [PATCH RFC 00/22] EEH Support for VFIO PCI devices on PowerKVM guest

2014-05-06 Thread Benjamin Herrenschmidt

On Tue, 2014-05-06 at 08:56 +0200, Alexander Graf wrote:
  For the error injection, I guess I have to put the logic token
 management
  into QEMU and error injection request will be handled by QEMU and
 then
  routed to host kernel via additional syscall as we did for pSeries.
 
 Yes, start off without in-kernel XICS so everything simply lives in 
 QEMU. Then add callbacks into the in-kernel XICS to inject these 
 interrupts if we don't have wide enough interfaces already.

It's got nothing to do with XICS ... :-)

But yes, we can route everything via qemu for now, then we'll need
at least one of the call to have a direct path but we should probably
strive to even make it real mode if that's possible, it's the one that
Linux will call whenever an MMIO returns all f's to check if the
underlying PE is frozen.

But we can do that as a second stage.

In fact going via VFIO ioctl's does make the whole security and
translation model much simpler initially.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page table.

2014-05-06 Thread Benjamin Herrenschmidt

On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote:
 On 06.05.14 02:06, Benjamin Herrenschmidt wrote:
  On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote:
  Isn't this a greater problem? We should start swapping before we hit
  the point where non movable kernel allocation fails, no?
  Possibly but the fact remains, this can be avoided by making sure that
  if we create a CMA reserve for KVM, then it uses it rather than using
  the rest of main memory for hash tables.
 
 So why were we preferring non-CMA memory before? Considering that Aneesh 
 introduced that logic in fa61a4e3 I suppose this was just a mistake?

I assume so.

  The fact that KVM uses a good number of normal kernel pages is maybe
  suboptimal, but shouldn't be a critical problem.
  The point is that we explicitly reserve those pages in CMA for use
  by KVM for that specific purpose, but the current code tries first
  to get them out of the normal pool.
 
  This is not an optimal behaviour and is what Aneesh patches are
  trying to fix.
 
 I agree, and I agree that it's worth it to make better use of our 
 resources. But we still shouldn't crash.

Well, Linux hitting out of memory conditions has never been a happy
story :-)

 However, reading through this thread I think I've slowly grasped what 
 the problem is. The hugetlbfs size calculation.

Not really.

 I guess something in your stack overreserves huge pages because it 
 doesn't account for the fact that some part of system memory is already 
 reserved for CMA.

Either that or simply Linux runs out because we dirty too fast...
really, Linux has never been good at dealing with OO situations,
especially when things like network drivers and filesystems try to do
ATOMIC or NOIO allocs...
 
 So the underlying problem is something completely orthogonal. The patch 
 body as is is fine, but the patch description should simply say that we 
 should prefer the CMA region because it's already reserved for us for 
 this purpose and we make better use of our available resources that way.

No.

We give a chunk of memory to hugetlbfs, it's all good and fine.

Whatever remains is split between CMA and the normal page allocator.

Without Aneesh latest patch, when creating guests, KVM starts allocating
it's hash tables from the latter instead of CMA (we never allocate from
hugetlb pool afaik, only guest pages do that, not hash tables).

So we exhaust the page allocator and get linux into OOM conditions
while there's plenty of space in CMA. But the kernel cannot use CMA for
it's own allocations, only to back user pages, which we don't care about
because our guest pages are covered by our hugetlb reserve :-)

 All the bits about pinning, numa, libvirt and whatnot don't really 
 matter and are just details that led Aneesh to find this non-optimal 
 allocation.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest

2014-05-06 Thread Benjamin Herrenschmidt

On Tue, 2014-05-06 at 11:12 +0200, Alexander Graf wrote:

 So if I understand this patch correctly, it simply introduces logic to 
 handle page sizes other than 4k, 64k, 16M by analyzing the actual page 
 size field in the HPTE. Mind to explain why exactly that enables us to 
 use THP?

 What exactly is the flow if the pages are not backed by huge pages? What 
 is the flow when they start to get backed by huge pages?

The hypervisor doesn't care about segments ... but it needs to properly
decode the page size requested by the guest, if anything, to issue the
right form of tlbie instruction.

The encoding in the HPTE for a 16M page inside a 64K segment is
different than the encoding for a 16M in a 16M segment, this is done so
that the encoding carries both information, which allows broadcast
tlbie to properly find the right set in the TLB for invalidations among
others.

So from a KVM perspective, we don't know whether the guest is doing THP
or something else (Linux calls it THP but all we care here is that this
is MPSS, another guest than Linux might exploit that differently).

What we do know is that if we advertise MPSS, we need to decode the page
sizes encoded in the HPTE so that we know what we are dealing with in
H_ENTER and can do the appropriate TLB invalidations in H_REMOVE 
evictions.

  +   if (a_size != -1)
  +   return 1ul  mmu_psize_defs[a_size].shift;
  +   }
  +
  +   }
  +   return 0;
}

static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long 
  psize)
  diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
  index 8227dba5af0f..a38d3289320a 100644
  --- a/arch/powerpc/kvm/book3s_hv.c
  +++ b/arch/powerpc/kvm/book3s_hv.c
  @@ -1949,6 +1949,13 @@ static void kvmppc_add_seg_page_size(struct 
  kvm_ppc_one_seg_page_size **sps,
   * support pte_enc here
   */
  (*sps)-enc[0].pte_enc = def-penc[linux_psize];
  +   /*
  +* Add 16MB MPSS support
  +*/
  +   if (linux_psize != MMU_PAGE_16M) {
  +   (*sps)-enc[1].page_shift = 24;
  +   (*sps)-enc[1].pte_enc = def-penc[MMU_PAGE_16M];
  +   }
 
 So this basically indicates that every segment (except for the 16MB one) 
 can also handle 16MB MPSS page sizes? I suppose you want to remove the 
 comment in kvm_vm_ioctl_get_smmu_info_hv() that says we don't do MPSS here.

I haven't reviewed the code there, make sure it will indeed do a
different encoding for every combination of segment/actual page size.

 Can we also ensure that every system we run on can do MPSS?

P7 and P8 are identical in that regard. However 970 doesn't do MPSS so
let's make sure we get that right.

Cheers,
Ben.
 

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest

2014-05-06 Thread Benjamin Herrenschmidt

On Tue, 2014-05-06 at 21:38 +0530, Aneesh Kumar K.V wrote:

  I updated the commit message as below. Let me know if this is ok.
 
   KVM: PPC: BOOK3S: HV: THP support for guest
 
  This has nothing to do with THP.
 
 THP support in guest depend on KVM advertising MPSS feature. We already
 have rest of the changes needed to support transparent huge pages
 upstream. (We do support THP with PowerVM LPAR already). The primary
 motivation of this patch is to enable THP in powerkvm guest. 

I would argue (nit picking, I know ... :-) that the subject should be
Enable MPSS support for guests, and the description can then explain
that this allows Linux guests to use THP.

Cheers,
Ben.

 
   
   On recent IBM Power CPUs, while the hashed page table is looked up 
  using
   the page size from the segmentation hardware (i.e. the SLB), it is
   possible to have the HPT entry indicate a larger page size.  Thus for
   example it is possible to put a 16MB page in a 64kB segment, but since
   the hash lookup is done using a 64kB page size, it may be necessary to
   put multiple entries in the HPT for a single 16MB page.  This
   capability is called mixed page-size segment (MPSS).  With MPSS,
   there are two relevant page sizes: the base page size, which is the
   size used in searching the HPT, and the actual page size, which is the
   size indicated in the HPT entry. [ Note that the actual page size is
   always = base page size ].
   
   We advertise MPSS feature to guest only if the host CPU supports the
   same. We use ibm,segment-page-sizes device tree node to advertise
   the MPSS support. The penc encoding indicate whether we support
   a specific combination of base page size and actual page size
   in the same segment. It is also the value used in the L|LP encoding
   of HPTE entry.
   
   In-order to support MPSS in guest, KVM need to handle the below 
  details
   * advertise MPSS via ibm,segment-page-sizes
   * Decode the base and actual page size correctly from the HPTE entry
 so that we know what we are dealing with in H_ENTER and and can do
 
  Which code path exactly changes for H_ENTER?
 
 There is no real code path changes. Any code path that use
 hpte_page_size() is impacted. We return actual page size there. 
 
 
 the appropriate TLB invalidation in H_REMOVE and evictions.
 
  Apart from the grammar (which is pretty broken for the part that is not 
  copied from Paul) and the subject line this sounds quite reasonable.
 
 
 Wll try to fix.
 
 -aneesh


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V4] POWERPC: BOOK3S: KVM: Use the saved dar value and generic make_dsisr

2014-05-05 Thread Benjamin Herrenschmidt

On Mon, 2014-05-05 at 19:56 +0530, Aneesh Kumar K.V wrote:
 
 Paul mentioned that BOOK3S always had DAR value set on alignment
 interrupt. And the patch is to enable/collect correct DAR value when
 running with Little Endian PR guest. Now to limit the impact and to
 enable Little Endian PR guest, I ended up doing the conditional code
 only for book3s 64 for which we know for sure that we set DAR value.

Only BookS ? Afaik, the kernel align.c unconditionally uses DAR on
every processor type. It's DSISR that may or may not be populated
but afaik DAR always is.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V4] POWERPC: BOOK3S: KVM: Use the saved dar value and generic make_dsisr

2014-05-05 Thread Benjamin Herrenschmidt

On Mon, 2014-05-05 at 16:43 +0200, Alexander Graf wrote:
  Paul mentioned that BOOK3S always had DAR value set on alignment
  interrupt. And the patch is to enable/collect correct DAR value when
  running with Little Endian PR guest. Now to limit the impact and to
  enable Little Endian PR guest, I ended up doing the conditional code
  only for book3s 64 for which we know for sure that we set DAR value.
 
 Yes, and I'm asking whether we know that this statement holds true for 
 PA6T and G5 chips which I wouldn't consider IBM POWER. Since the G5 is 
 at least developed by IBM, I'd assume its semantics here are similar to 
 POWER4, but for PA6T I wouldn't be so sure.

I am not aware of any PowerPC processor that does not set DAR on
alignment interrupts. Paul, are you ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] PPC: KVM: Introduce hypervisor call H_GET_TCE

2014-02-21 Thread Benjamin Herrenschmidt

On Fri, 2014-02-21 at 16:31 +0100, Laurent Dufour wrote:
 This fix introduces the H_GET_TCE hypervisor call which is basically the
 reverse of H_PUT_TCE, as defined in the Power Architecture Platform
 Requirements (PAPR).
 
 The hcall H_GET_TCE is required by the kdump kernel which is calling it to
 retrieve the TCE set up by the panicing kernel.

Alexey, will that work for VFIO ? Or are those patches *still* not
upstream ?

 Signed-off-by: Laurent Dufour lduf...@linux.vnet.ibm.com
 ---
  arch/powerpc/include/asm/kvm_ppc.h  |2 ++
  arch/powerpc/kvm/book3s_64_vio_hv.c |   28 
  arch/powerpc/kvm/book3s_hv_rmhandlers.S |2 +-
  3 files changed, 31 insertions(+), 1 deletion(-)
 
 diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
 b/arch/powerpc/include/asm/kvm_ppc.h
 index fcd53f0..4096f16 100644
 --- a/arch/powerpc/include/asm/kvm_ppc.h
 +++ b/arch/powerpc/include/asm/kvm_ppc.h
 @@ -129,6 +129,8 @@ extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
   struct kvm_create_spapr_tce *args);
  extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce);
 +extern long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 +  unsigned long ioba);
  extern struct kvm_rma_info *kvm_alloc_rma(void);
  extern void kvm_release_rma(struct kvm_rma_info *ri);
  extern struct page *kvm_alloc_hpt(unsigned long nr_pages);
 diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
 b/arch/powerpc/kvm/book3s_64_vio_hv.c
 index 2c25f54..89e96b3 100644
 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
 +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
 @@ -75,3 +75,31 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long 
 liobn,
   return H_TOO_HARD;
  }
  EXPORT_SYMBOL_GPL(kvmppc_h_put_tce);
 +
 +long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 +   unsigned long ioba)
 +{
 + struct kvm *kvm = vcpu-kvm;
 + struct kvmppc_spapr_tce_table *stt;
 +
 + list_for_each_entry(stt, kvm-arch.spapr_tce_tables, list) {
 + if (stt-liobn == liobn) {
 + unsigned long idx = ioba  SPAPR_TCE_SHIFT;
 + struct page *page;
 + u64 *tbl;
 +
 + if (ioba = stt-window_size)
 + return H_PARAMETER;
 +
 + page = stt-pages[idx / TCES_PER_PAGE];
 + tbl = (u64 *)page_address(page);
 +
 + vcpu-arch.gpr[4] = tbl[idx % TCES_PER_PAGE];
 + return H_SUCCESS;
 + }
 + }
 +
 + /* Didn't find the liobn, punt it to userspace */
 + return H_TOO_HARD;
 +}
 +EXPORT_SYMBOL_GPL(kvmppc_h_get_tce);
 diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
 b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 index e66d4ec..7d4fe2a 100644
 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 @@ -1758,7 +1758,7 @@ hcall_real_table:
   .long   0   /* 0x10 - H_CLEAR_MOD */
   .long   0   /* 0x14 - H_CLEAR_REF */
   .long   .kvmppc_h_protect - hcall_real_table
 - .long   0   /* 0x1c - H_GET_TCE */
 + .long   .kvmppc_h_get_tce - hcall_real_table
   .long   .kvmppc_h_put_tce - hcall_real_table
   .long   0   /* 0x24 - H_SET_SPRG0 */
   .long   .kvmppc_h_set_dabr - hcall_real_table


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 02/10] KVM: PPC: BOOK3S: PR: Emulate virtual timebase register

2014-01-29 Thread Benjamin Herrenschmidt

On Wed, 2014-01-29 at 17:39 +0100, Alexander Graf wrote:
 static inline mfvtb(unsigned long)
 {
 #ifdef CONFIG_PPC_BOOK3S_64
  return mfspr(SPRN_VTB);
 #else
  BUG();
 #endif
 }
 
 is a lot easier to read and get right. But reg.h is Ben's call.

Agreed.

 Also could you please give me a pointer to the specification for it? I 
 tried to look up vtb in the 2.06 ISA and couldn't find it. Is it a CPU 
 specific register?


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: PPC: Use schedule instead of cond_resched

2013-12-10 Thread Benjamin Herrenschmidt

On Tue, 2013-12-10 at 15:40 +0100, Alexander Graf wrote:
 On 10.12.2013, at 15:21, Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com 
 wrote:
 
  From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
  
  We already checked need_resched. So we can call schedule directly
  
  Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
 
 The real fix for the issue you're seeing is
 
   https://lkml.org/lkml/2013/11/28/241
 

And is still not upstream Peter ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Error in frreing hugepages with preemption enabled

2013-12-03 Thread Benjamin Herrenschmidt

On Tue, 2013-12-03 at 23:21 +0100, Andrea Arcangeli wrote:
 #ifdef CONFIG_PPC_FSL_BOOK3E
 hugepd_free(tlb, hugepte);
 ^^

This is the culprit

(Alex, you didn't specify this was embedded or did I miss it ?)

 #else
 pgtable_free_tlb(tlb, hugepte, pdshift - shift);
 #endif
 }

That function does:

batchp = __get_cpu_var(hugepd_freelist_cur);

IE, it tries to use a per-CPU batch. Basically, it's duplicating the
logic in mm/memory.c for RCU freeing using a per-cpu freelist. I suppose
it assumes being called under something like the page table lock ?

This code also never flushes the batch, which is a concern...

Alex, this is Freescale stuff, can you followup with them ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] powerpc/kvm: remove redundant assignment

2013-11-07 Thread Benjamin Herrenschmidt

On Thu, 2013-11-07 at 09:14 +0100, Alexander Graf wrote:
  And ? An explanation isn't going to be clearer than the code in that
  case ...
 
 It's pretty non-obvious when you do a git show on that patch in 1 year
 from now, as the redundancy is out of scope of what the diff shows.

And ? How would an explanation help ?

Either it's redundant or it's not ... but only look at the code can
prove it. An explanation won't because if the patch is wrong, so will be
the explanation.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] powerpc: kvm: optimize sc 0 as fast return

2013-11-07 Thread Benjamin Herrenschmidt

On Fri, 2013-11-08 at 04:10 +0100, Alexander Graf wrote:
 On 08.11.2013, at 03:44, Liu Ping Fan kernelf...@gmail.com wrote:
 
  syscall is a very common behavior inside guest, and this patch
  optimizes the path for the emulation of BOOK3S_INTERRUPT_SYSCALL,
  so hypervisor can return to guest without heavy exit, i.e, no need
  to swap TLB, HTAB,.. etc
 
 The syscall exit you touch here only happens when you do an sc  0
 with MSR_PR set inside the guest. The only case you realistically see
 this is when you run PR KVM inside of an HV KVM guest.
 
 I don't think we should optimize for that case. Instead, we should
 rather try to not bounce to the 1st hypervisor in the first place in
 that scenario :).

Well, so unfortunately openstack CI uses PR inside HV pretty
heavily  it *might* be worthwhile optimizing that path if the patch
is simple enough... I'd make that Paul's call.

Cheers,
Ben.

 
 Alex
 
  
  Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
  ---
  Compiled, but lack of bare metal, I have not tested it yet.
  ---
  arch/powerpc/kvm/book3s_hv.c|  6 --
  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 13 -
  2 files changed, 12 insertions(+), 7 deletions(-)
  
  diff --git a/arch/powerpc/kvm/book3s_hv.c
 b/arch/powerpc/kvm/book3s_hv.c
  index 62a2b5a..73dc852 100644
  --- a/arch/powerpc/kvm/book3s_hv.c
  +++ b/arch/powerpc/kvm/book3s_hv.c
  @@ -628,12 +628,6 @@ static int kvmppc_handle_exit(struct kvm_run
 *run, struct kvm_vcpu *vcpu,
  /* hcall - punt to userspace */
  int i;
  
  -   if (vcpu-arch.shregs.msr  MSR_PR) {
  -   /* sc 1 from userspace - reflect to guest syscall */
  -   kvmppc_book3s_queue_irqprio(vcpu, 
  BOOK3S_INTERRUPT_SYSCALL);
  -   r = RESUME_GUEST;
  -   break;
  -   }
  run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3);
  for (i = 0; i  9; ++i)
  run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i);
  diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
  index c71103b..9f626c3 100644
  --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
  +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
  @@ -1388,7 +1388,8 @@ kvmppc_hisi:
  hcall_try_real_mode:
  ld  r3,VCPU_GPR(R3)(r9)
  andi.   r0,r11,MSR_PR
  -   bne guest_exit_cont
  +   /* sc 1 from userspace - reflect to guest syscall */
  +   bne sc_0_fast_return
  clrrdi  r3,r3,2
  cmpldi  r3,hcall_real_table_end - hcall_real_table
  bge guest_exit_cont
  @@ -1409,6 +1410,16 @@ hcall_try_real_mode:
  ld  r11,VCPU_MSR(r4)
  b   fast_guest_return
  
  +sc_0_fast_return:
  +   ld  r10,VCPU_PC(r9)
  +   ld  r11,VCPU_MSR(r9)
  +   mtspr   SPRN_SRR0,r10
  +   mtspr   SPRN_SRR1,r11
  +   li  r10, BOOK3S_INTERRUPT_SYSCALL
  +   LOAD_REG_IMMEDIATE(r3,0x87a0)   /* zero 33:36,42:47 */
  +   and r11,r11,r3
  +   b   fast_guest_return
  +
  /* We've attempted a real mode hcall, but it's punted it back
   * to userspace.  We need to restore some clobbered volatiles
   * before resuming the pass-it-to-qemu path */
  -- 
  1.8.1.4
  


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] powerpc/kvm: remove redundant assignment

2013-11-06 Thread Benjamin Herrenschmidt

On Thu, 2013-11-07 at 08:52 +0100, Alexander Graf wrote:
 Am 06.11.2013 um 20:58 schrieb Benjamin Herrenschmidt 
 b...@kernel.crashing.org:
 
  On Wed, 2013-11-06 at 12:24 +0100, Alexander Graf wrote:
  On 05.11.2013, at 08:42, Liu Ping Fan kernelf...@gmail.com wrote:
  
  Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
  
  Patch description missing.
  
  Do you really need a description for trivial one-lines whose subject
  is a perfectly complete description already ?
 
 Would I ask for it otherwise? It's also not 100% obvious that the assignment 
 is redundant.

And ? An explanation isn't going to be clearer than the code in that
case ...

Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PULL 34/51] powerpc: move debug registers in a structure

2013-11-03 Thread Benjamin Herrenschmidt

On Sun, 2013-11-03 at 16:30 +0200, Gleb Natapov wrote:
 On Thu, Oct 31, 2013 at 10:18:19PM +0100, Alexander Graf wrote:
  From: Bharat Bhushan r65...@freescale.com
  
  This way we can use same data type struct with KVM and
  also help in using other debug related function.
  
  Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
  Signed-off-by: Alexander Graf ag...@suse.de
 It would be nice to have PPC maintainers (CCed) ACKs here. This patch
 also has merging conflicts with cbc9565e (should be easy to resolve)

Is it easier if I put it in powerpc -next and resolve the conflicts ?

I was ok with the patch but yes, I should probably have put it in a
topic branch in the first place and merge it both in my tree and Alex.

Cheers,
Ben.

  ---
   arch/powerpc/include/asm/processor.h |  38 +
   arch/powerpc/include/asm/reg_booke.h |   8 +-
   arch/powerpc/kernel/asm-offsets.c|   2 +-
   arch/powerpc/kernel/process.c|  42 +-
   arch/powerpc/kernel/ptrace.c | 154 
  +--
   arch/powerpc/kernel/ptrace32.c   |   2 +-
   arch/powerpc/kernel/signal_32.c  |   6 +-
   arch/powerpc/kernel/traps.c  |  35 
   8 files changed, 147 insertions(+), 140 deletions(-)
  
  diff --git a/arch/powerpc/include/asm/processor.h 
  b/arch/powerpc/include/asm/processor.h
  index e378ccc..b438444 100644
  --- a/arch/powerpc/include/asm/processor.h
  +++ b/arch/powerpc/include/asm/processor.h
  @@ -147,22 +147,7 @@ typedef struct {
   #define TS_FPR(i) fpr[i][TS_FPROFFSET]
   #define TS_TRANS_FPR(i) transact_fpr[i][TS_FPROFFSET]
   
  -struct thread_struct {
  -   unsigned long   ksp;/* Kernel stack pointer */
  -   unsigned long   ksp_limit;  /* if ksp = ksp_limit stack overflow */
  -
  -#ifdef CONFIG_PPC64
  -   unsigned long   ksp_vsid;
  -#endif
  -   struct pt_regs  *regs;  /* Pointer to saved register state */
  -   mm_segment_tfs; /* for get_fs() validation */
  -#ifdef CONFIG_BOOKE
  -   /* BookE base exception scratch space; align on cacheline */
  -   unsigned long   normsave[8] cacheline_aligned;
  -#endif
  -#ifdef CONFIG_PPC32
  -   void*pgdir; /* root of page-table tree */
  -#endif
  +struct debug_reg {
   #ifdef CONFIG_PPC_ADV_DEBUG_REGS
  /*
   * The following help to manage the use of Debug Control Registers
  @@ -199,6 +184,27 @@ struct thread_struct {
  unsigned long   dvc2;
   #endif
   #endif
  +};
  +
  +struct thread_struct {
  +   unsigned long   ksp;/* Kernel stack pointer */
  +   unsigned long   ksp_limit;  /* if ksp = ksp_limit stack overflow */
  +
  +#ifdef CONFIG_PPC64
  +   unsigned long   ksp_vsid;
  +#endif
  +   struct pt_regs  *regs;  /* Pointer to saved register state */
  +   mm_segment_tfs; /* for get_fs() validation */
  +#ifdef CONFIG_BOOKE
  +   /* BookE base exception scratch space; align on cacheline */
  +   unsigned long   normsave[8] cacheline_aligned;
  +#endif
  +#ifdef CONFIG_PPC32
  +   void*pgdir; /* root of page-table tree */
  +#endif
  +   /* Debug Registers */
  +   struct debug_reg debug;
  +
  /* FP and VSX 0-31 register set */
  double  fpr[32][TS_FPRWIDTH] __attribute__((aligned(16)));
  struct {
  diff --git a/arch/powerpc/include/asm/reg_booke.h 
  b/arch/powerpc/include/asm/reg_booke.h
  index ed8f836..2e31aac 100644
  --- a/arch/powerpc/include/asm/reg_booke.h
  +++ b/arch/powerpc/include/asm/reg_booke.h
  @@ -381,7 +381,7 @@
   #define DBCR0_IA34T0x4000  /* Instr Addr 3-4 range Toggle 
  */
   #define DBCR0_FT   0x0001  /* Freeze Timers on debug event */
   
  -#define dbcr_iac_range(task)   ((task)-thread.dbcr0)
  +#define dbcr_iac_range(task)   ((task)-thread.debug.dbcr0)
   #define DBCR_IAC12IDBCR0_IA12  /* Range 
  Inclusive */
   #define DBCR_IAC12X(DBCR0_IA12 | DBCR0_IA12X)  /* Range 
  Exclusive */
   #define DBCR_IAC12MODE (DBCR0_IA12 | DBCR0_IA12X)  /* IAC 1-2 Mode 
  Bits */
  @@ -395,7 +395,7 @@
   #define DBCR1_DAC1W0x2000  /* DAC1 Write Debug Event */
   #define DBCR1_DAC2W0x1000  /* DAC2 Write Debug Event */
   
  -#define dbcr_dac(task) ((task)-thread.dbcr1)
  +#define dbcr_dac(task) ((task)-thread.debug.dbcr1)
   #define DBCR_DAC1R DBCR1_DAC1R
   #define DBCR_DAC1W DBCR1_DAC1W
   #define DBCR_DAC2R DBCR1_DAC2R
  @@ -441,7 +441,7 @@
   #define DBCR0_CRET 0x0020  /* Critical Return Debug Event */
   #define DBCR0_FT   0x0001  /* Freeze Timers on debug event */
   
  -#define dbcr_dac(task) ((task)-thread.dbcr0)
  +#define dbcr_dac(task) ((task)-thread.debug.dbcr0)
   #define DBCR_DAC1R DBCR0_DAC1R
   #define DBCR_DAC1W DBCR0_DAC1W
   #define DBCR_DAC2R DBCR0_DAC2R
  @@ -475,7 +475,7 @@
   #define DBCR1_IAC34MX  0x00C0  /* Instr Addr 3-4 range

Re: [PULL 34/51] powerpc: move debug registers in a structure

2013-11-03 Thread Benjamin Herrenschmidt

On Mon, 2013-11-04 at 07:43 +0100, Alexander Graf wrote:
 Yeah, it's what Ben and me agreed on after I waited forever to get a
 topic branch created. Oh well, I guess this time we just have to
 manually resolve the conflicts and do a better job at communicating
 next time.

That specific one was my fault. Too much stuff on my plate and I
completely forgot about the topic branch, anyway, it's ok now.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-03 Thread Benjamin Herrenschmidt

On Thu, 2013-10-03 at 08:43 +0300, Gleb Natapov wrote:
 Why it can be a bad idea? User can drain hwrng continuously making other
 users of it much slower, or even worse, making them fall back to another
 much less reliable, source of entropy.

Not in a very significant way, we generate entropy at 1Mhz after all, which
is pretty good.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-02 Thread Benjamin Herrenschmidt

On Wed, 2013-10-02 at 10:46 +0200, Paolo Bonzini wrote:

 
 Thanks.  Any chance you can give some numbers of a kernel hypercall and
 a userspace hypercall on Power, so we have actual data?  For example a
 hypercall that returns H_PARAMETER as soon as possible.

I don't have (yet) numbers at hand but we have basically 3 places where
we can handle hypercalls:

 - Kernel real mode. This is where most of our MMU stuff goes for
example unless it needs to trigger a page fault in Linux. This is
executed with translation disabled and the MMU still in guest context.
This is the fastest path since we don't take out the other threads nor
perform any expensive context change. This is where we put the
accelerated H_RANDOM as well.

 - Kernel virtual mode. That's a full exit, so all threads are out and
MMU switched back to host Linux. Things like vhost MMIO emulation goes
there, page faults, etc...

 - Qemu. This adds the round trip to userspace on top of the above.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-02 Thread Benjamin Herrenschmidt

On Wed, 2013-10-02 at 11:11 +0200, Alexander Graf wrote:

 Right, and the difference for the patch in question is really whether
 we handle in in kernel virtual mode or in QEMU, so the bulk of the
 overhead (kicking threads out of  guest context, switching MMU
 context, etc) happens either way.
 
 So the additional overhead when handling it in QEMU here really boils
 down to the user space roundtrip (plus another random number read
 roundtrip).

Hrm, Michael had a real mode version ... not sure where it went then.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-02 Thread Benjamin Herrenschmidt

On Wed, 2013-10-02 at 13:02 +0300, Gleb Natapov wrote:

 Yes, I alluded to it in my email to Paul and Paolo asked also. How this
 interface is disabled? Also hwrnd is MMIO in a host why guest needs to
 use hypercall instead of emulating the device (in kernel or somewhere
 else?).

Migration will have to be dealt with one way or another, I suppose we
will indeed need a qemu fallback.

As for why hypercall instead of MMIO, well, you'd have to ask the folks
who wrote the PAPR spec :-) It's specified as a hypercall and
implemented as such in pHyp (PowerVM). The existing guests expect it
that way.

It might have to do with the required whitening done by the hypervisor
(H_RANDOM output is supposed to be clean). It also abstracts us from the
underlying HW implementation which could in theory change.
 
  Another things is that on a host hwrnd is protected from
 direct userspace access by virtue of been a device, but guest code (event
 kernel mode) is userspace as far as hosts security model goes, so by
 implementing this hypercall in a way that directly access hwrnd you
 expose hwrnd to a userspace unconditionally. Why is this a good idea? 

Why would this be a bad idea ?

Ben.

 --
   Gleb.
 --
 To unsubscribe from this list: send the line unsubscribe kvm-ppc in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-02 Thread Benjamin Herrenschmidt

On Wed, 2013-10-02 at 13:02 +0300, Gleb Natapov wrote:

 Yes, I alluded to it in my email to Paul and Paolo asked also. How this
 interface is disabled? Also hwrnd is MMIO in a host why guest needs to
 use hypercall instead of emulating the device (in kernel or somewhere
 else?). Another things is that on a host hwrnd is protected from
 direct userspace access by virtue of been a device, but guest code (event
 kernel mode) is userspace as far as hosts security model goes, so by
 implementing this hypercall in a way that directly access hwrnd you
 expose hwrnd to a userspace unconditionally. Why is this a good idea? 

BTW. Is this always going to be like this ?

Every *single* architectural or design decision we make for our
architecture has to be justified 30 times over, every piece of code bike
shedded to oblivion for month, etc... ?

Do we always have to finally get to some kind of agreement on design, go
to the 6 month bike-shedding phase, just to have somebody else come up
and start re-questioning the whole original design (without any
understanding of our specific constraints of course) ?

You guys are the most horrendous community I have ever got to work with.
It's simply impossible to get anything done in any reasonable time
frame .

At this stage, it would have taken us an order of magnitude less time to
simply rewrite an entire hypervisor from scratch.

This is sad.

Ben.



--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-02 Thread Benjamin Herrenschmidt

On Wed, 2013-10-02 at 17:10 +0300, Gleb Natapov wrote:
  The hwrng is accessible by host userspace via /dev/mem.
  
 Regular user has no access to /dev/mem, but he can start kvm guest and
 gain access to the device.

Seriously. You guys are really trying hard to make our life hell or
what ? That discussion about access permissions makes no sense
whatsoever. Please stop.

Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-02 Thread Benjamin Herrenschmidt

On Wed, 2013-10-02 at 17:37 +0300, Gleb Natapov wrote:
 On Wed, Oct 02, 2013 at 04:33:18PM +0200, Paolo Bonzini wrote:
  Il 02/10/2013 16:08, Alexander Graf ha scritto:
The hwrng is accessible by host userspace via /dev/mem.
   
   A guest should live on the same permission level as a user space
   application. If you run QEMU as UID 1000 without access to /dev/mem, why
   should the guest suddenly be able to directly access a memory location
   (MMIO) it couldn't access directly through a normal user space interface.
   
   It's basically a layering violation.
  
  With Michael's earlier patch in this series, the hwrng is accessible by
  host userspace via /dev/hwrng, no?
  
 Access to which can be controlled by its permission. Permission of
 /dev/kvm may be different. If we route hypercall via userspace and
 configure qemu to get entropy from /dev/hwrng everything will fall
 nicely together (except performance).

Yes, except abysmall performance and a lot more code for something
completely and utterly pointless  nice.

Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-01 Thread Benjamin Herrenschmidt

On Tue, 2013-10-01 at 11:39 +0300, Gleb Natapov wrote:
 On Tue, Oct 01, 2013 at 06:34:26PM +1000, Michael Ellerman wrote:
  On Thu, Sep 26, 2013 at 11:06:59AM +0200, Paolo Bonzini wrote:
   Il 26/09/2013 08:31, Michael Ellerman ha scritto:
Some powernv systems include a hwrng. Guests can access it via the
H_RANDOM hcall.
   
   Is there any reason to do this in the kernel?  
  
  It's less code, and it's faster :)
  
   It does not have to be a particularly fast path;
  
  Sure, but do we have to make it slow on purpose?
  
 We do not put non performance critical devices into the kernel.

So for the sake of that dogma you are going to make us do something that
is about 100 times slower ? (and possibly involves more lines of code)

It's not just speed ... H_RANDOM is going to be called by the guest
kernel. A round trip to qemu is going to introduce a kernel jitter
(complete stop of operations of the kernel on that virtual processor) of
a full exit + round trip to qemu + back to the kernel to get to some
source of random number ...  this is going to be in the dozens of ns at
least.

This makes no sense.

Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-01 Thread Benjamin Herrenschmidt

On Tue, 2013-10-01 at 13:19 +0200, Paolo Bonzini wrote:
 Il 01/10/2013 11:38, Benjamin Herrenschmidt ha scritto:
  So for the sake of that dogma you are going to make us do something that
  is about 100 times slower ? (and possibly involves more lines of code)
 
 If it's 100 times slower there is something else that's wrong.  It's
 most likely not 100 times slower, and this makes me wonder if you or
 Michael actually timed the code at all.

We haven't but it's pretty obvious:

 - The KVM real mode implementation: guest issues the hcall, we remain
in real mode, within the MMU context of the guest, all secondary threads
on the core are still running in the guest, and we do an MMIO  return.

 - The qemu variant: guest issues the hcall we need to exit the guest,
which means bring *all* threads on the core out of KVM, switch the full
MMU context back to host (which among others involves flushing the ERAT,
aka level 1 TLB), while sending the secondary threads into idle loops.
Then we return to qemu user context, which will then use /dev/random -
back into the kernel and out, at which point we can return to the guest,
so back into the kernel, back into run which means IPI the secondary
threads on the core, switch the MMU context again until we can finally
go back to executing guest instructions.

So no we haven't measured. But it is going to be VERY VERY VERY much
slower. Our exit latencies are bad with our current MMU *and* any exit
is going to cause all secondary threads on the core to have to exit as
well (remember P7 is 4 threads, P8 is 8)

  It's not just speed ... H_RANDOM is going to be called by the guest
  kernel. A round trip to qemu is going to introduce a kernel jitter
  (complete stop of operations of the kernel on that virtual processor) of
  a full exit + round trip to qemu + back to the kernel to get to some
  source of random number ...  this is going to be in the dozens of ns at
  least.
 
 I guess you mean dozens of *micro*seconds, which is somewhat exaggerated
 but not too much.  On x86 some reasonable timings are:

Yes.

   100 cyclesbare metal rdrand
   2000 cycles   guest-hypervisor-guest
   15000 cycles  guest-userspace-guest
 
 (100 cycles = 40 ns = 200 MB/sec; 2000 cycles = ~1 microseconds; 15000
 cycles = ~7.5 microseconds).  Even on 5 year old hardware, a userspace
 roundtrip is around a dozen microseconds.

So in your case going to qemu to emulate rdrand would indeed be 150
times slower, I don't see in what universe that would be considered a
good idea.

 Anyhow, I would like to know more about this hwrng and hypercall.
 
 Does the hwrng return random numbers (like rdrand) or real entropy (like
 rdseed that Intel will add in Broadwell)?

It's a random number obtained from sampling a set of oscillators. It's
slightly biased but we have very simple code (I believe shared with the
host kernel implementation) for whitening it as is required by PAPR.
 
   What about the hypercall?
 For example virtio-rng is specified to return actual entropy, it doesn't
 matter if it is from hardware or software.
 
 In either case, the patches have problems.
 
 1) If the hwrng returns random numbers, the whitening you're doing is
 totally insufficient and patch 2 is forging entropy that doesn't exist.

I will let Paul to comment on the whitening, it passes all the tests
we've been running it through.

 2) If the hwrng returns entropy, a read from the hwrng is going to even
 more expensive than an x86 rdrand (perhaps ~2000 cycles).

Depends how often you read, the HW I think is sampling asynchronously so
you only block on the MMIO if you already consumed the previous sample
but I'll let Paulus provide more details here.

   Hence, doing
 the emulation in the kernel is even less necessary.  Also, if the hwrng
 returns entropy patch 1 is unnecessary: you do not need to waste
 precious entropy bits by passing them to arch_get_random_long; just run
 rngd in the host as that will put the entropy to much better use.

 3) If the hypercall returns random numbers, then it is a pretty
 braindead interface since returning 8 bytes at a time limits the
 throughput to a handful of MB/s (compare to 200 MB/sec for x86 rdrand).
  But more important: in this case drivers/char/hw_random/pseries-rng.c
 is completely broken and insecure, just like patch 2 in case (1) above.

How so ?

 4) If the hypercall returns entropy (same as virtio-rng), the same
 considerations on speed apply.  If you can only produce entropy at say 1
 MB/s (so reading 8 bytes take 8 microseconds---which is actually very
 fast), it doesn't matter that much to spend 7 microseconds on a
 userspace roundtrip.  It's going to be only half the speed of bare
 metal, not 100 times slower.
 
 
 Also, you will need _anyway_ extra code that is not present here to
 either disable the rng based on userspace command-line, or to emulate
 the rng from userspace.  It is absolutely _not_ acceptable to have a
 hypercall disappear across

Re: [PATCH 1/3] powerpc: Implement arch_get_random_long/int() for powernv

2013-09-26 Thread Benjamin Herrenschmidt

On Thu, 2013-09-26 at 16:31 +1000, Michael Ellerman wrote:
 +   pr_info_once(registering arch random hook\n);

Either pr_debug or make it nicer looking :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] hwrng: Add a driver for the hwrng found in power7+ systems

2013-09-26 Thread Benjamin Herrenschmidt

On Thu, 2013-09-26 at 16:31 +1000, Michael Ellerman wrote:

 +   pr_info(registered powernv hwrng.\n);

First letter of a line should get a capital :-) Also since
it's per-device, at least indicate the OF path or the chip number or
something ...

Cheers,
Ben.



--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/11] KVM: PPC: Book3S HV: Add support for guest Program Priority Register

2013-09-16 Thread Benjamin Herrenschmidt

On Fri, 2013-09-06 at 13:22 +1000, Paul Mackerras wrote:
 POWER7 and later IBM server processors have a register called the
 Program Priority Register (PPR), which controls the priority of
 each hardware CPU SMT thread, and affects how fast it runs compared
 to other SMT threads.  This priority can be controlled by writing to
 the PPR or by use of a set of instructions of the form or rN,rN,rN
 which are otherwise no-ops but have been defined to set the priority
 to particular levels.
 
 This adds code to context switch the PPR when entering and exiting
 guests and to make the PPR value accessible through the SET/GET_ONE_REG
 interface.  When entering the guest, we set the PPR as late as
 possible, because if we are setting a low thread priority it will
 make the code run slowly from that point on.  Similarly, the
 first-level interrupt handlers save the PPR value in the PACA very
 early on, and set the thread priority to the medium level, so that
 the interrupt handling code runs at a reasonable speed.
 
 Signed-off-by: Paul Mackerras pau...@samba.org

Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org

Alex, can you take this via your tree ?

Cheers,
Ben.

 ---
  Documentation/virtual/kvm/api.txt |  1 +
  arch/powerpc/include/asm/exception-64s.h  |  8 
  arch/powerpc/include/asm/kvm_book3s_asm.h |  1 +
  arch/powerpc/include/asm/kvm_host.h   |  1 +
  arch/powerpc/include/uapi/asm/kvm.h   |  1 +
  arch/powerpc/kernel/asm-offsets.c |  2 ++
  arch/powerpc/kvm/book3s_hv.c  |  6 ++
  arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 12 +++-
  8 files changed, 31 insertions(+), 1 deletion(-)
 
 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index 1030ac9..34a32b6 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -1836,6 +1836,7 @@ registers, find a list below:
PPC   | KVM_REG_PPC_ACOP   | 64
PPC   | KVM_REG_PPC_VRSAVE | 32
PPC   | KVM_REG_PPC_LPCR   | 64
 +  PPC   | KVM_REG_PPC_PPR| 64
PPC   | KVM_REG_PPC_TM_GPR0| 64
...
PPC   | KVM_REG_PPC_TM_GPR31   | 64
 diff --git a/arch/powerpc/include/asm/exception-64s.h 
 b/arch/powerpc/include/asm/exception-64s.h
 index 07ca627..b86c4db 100644
 --- a/arch/powerpc/include/asm/exception-64s.h
 +++ b/arch/powerpc/include/asm/exception-64s.h
 @@ -203,6 +203,10 @@ do_kvm_##n:  
 \
   ld  r10,area+EX_CFAR(r13);  \
   std r10,HSTATE_CFAR(r13);   \
   END_FTR_SECTION_NESTED(CPU_FTR_CFAR,CPU_FTR_CFAR,947);  \
 + BEGIN_FTR_SECTION_NESTED(948)   \
 + ld  r10,area+EX_PPR(r13);   \
 + std r10,HSTATE_PPR(r13);\
 + END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948);\
   ld  r10,area+EX_R10(r13);   \
   stw r9,HSTATE_SCRATCH1(r13);\
   ld  r9,area+EX_R9(r13); \
 @@ -216,6 +220,10 @@ do_kvm_##n:  
 \
   ld  r10,area+EX_R10(r13);   \
   beq 89f;\
   stw r9,HSTATE_SCRATCH1(r13);\
 + BEGIN_FTR_SECTION_NESTED(948)   \
 + ld  r9,area+EX_PPR(r13);\
 + std r9,HSTATE_PPR(r13); \
 + END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948);\
   ld  r9,area+EX_R9(r13); \
   std r12,HSTATE_SCRATCH0(r13);   \
   li  r12,n;  \
 diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
 b/arch/powerpc/include/asm/kvm_book3s_asm.h
 index 9039d3c..22f4606 100644
 --- a/arch/powerpc/include/asm/kvm_book3s_asm.h
 +++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
 @@ -101,6 +101,7 @@ struct kvmppc_host_state {
  #endif
  #ifdef CONFIG_PPC_BOOK3S_64
   u64 cfar;
 + u64 ppr;
  #endif
  };
  
 diff --git a/arch/powerpc/include/asm/kvm_host.h 
 b/arch/powerpc/include/asm/kvm_host.h
 index 9741bf0..b0dcd18 100644
 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -464,6 +464,7 @@ struct kvm_vcpu_arch {
   u32 ctrl;
   ulong dabr;
   ulong cfar;
 + ulong ppr;
  #endif
   u32 vrsave; /* also USPRG0 */
   u32 mmucr;
 diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
 b/arch/powerpc/include/uapi/asm/kvm.h
 index e42127d..fab6bc1 100644
 --- a/arch/powerpc/include/uapi/asm/kvm.h
 +++ b/arch/powerpc/include/uapi/asm

Re: [PATCH 19/23] KVM: PPC: Book3S: Select PR vs HV separately for each guest

2013-09-12 Thread Benjamin Herrenschmidt

On Fri, 2013-09-13 at 10:17 +1000, Paul Mackerras wrote:

 Aneesh and I are currently investigating an alternative approach,
 which is much more like the x86 way of doing things.  We are looking
 at splitting the code into three modules: a kvm_pr.ko module with the
 PR-specific bits, a kvm_hv.ko module with the HV-specific bits, and a
 core kvm.ko module with the generic bits (basically book3s.c,
 powerpc.c, stuff from virt/kvm/, plus the stuff that both PR and HV
 use).  Basically the core module would have a pointer to a struct
 full of function pointers for the various ops that book3s_pr.c and
 book3s_hv.c both provide.  You would only be able to have one of
 kvm_pr and kvm_hv loaded at any one time.  If they were built in, you
 could have them both built in but only one could register its function
 pointer struct with the core.  Obviously the kvm_hv module would only
 load and register its struct on a machine that had hypervisor mode
 available.  If they were both built in I would think we would give HV
 the first chance to register itself, and let PR register if we can't
 do HV.
 
 How does that sound?

As long as we can force-load the PR one on a machine that normally runs
HV for the sake of testing ...

Also, all those KVM modules ... they don't auto-load do they ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

2013-09-04 Thread Benjamin Herrenschmidt

On Tue, 2013-09-03 at 13:53 +0300, Gleb Natapov wrote:
  Or supporting all IOMMU links (and leaving emulated stuff as is) in on
  device is the last thing I have to do and then you'll ack the patch?
  
 I am concerned more about API here. Internal implementation details I
 leave to powerpc experts :)

So Gleb, I want to step in for a bit here.

While I understand that the new KVM device API is all nice and shiny and that 
this
whole thing should probably have been KVM devices in the first place (had they
existed or had we been told back then), the point is, the API for handling
HW IOMMUs that Alexey is trying to add is an extension of an existing mechanism
used for emulated IOMMUs.

The internal data structure is shared, and fundamentally, by forcing him to
use that new KVM device for the new stuff, we create a oddball API with
an ioctl for one type of iommu and a KVM device for the other, which makes
the implementation a complete mess in the kernel (and you should care :-)

So for something completely new, I would tend to agree with you. However, I
still think that for this specific case, we should just plonk-in the original
ioctl proposed by Alexey and be done with it.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/10] KVM: PPC: reserve a capability number for multitce support

2013-08-27 Thread Benjamin Herrenschmidt

On Tue, 2013-08-27 at 09:40 +0300, Gleb Natapov wrote:
  Thanks. Since it's not in a topic branch that I can pull, I'm going to
  just cherry-pick them. However, they are in your queue branch, not
  next branch. Should I still assume this is a stable branch and that
  the numbers aren't going to change ?
  
 Queue will become next after I will test it and if test will fail the
 commit hash may change, but since you are going to cherry-pick and this
 does not preserve commit hash it does not matter.

Right, as long as the actual cap and ioctl numbers remain stable :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/10] KVM: PPC: reserve a capability number for multitce support

2013-08-27 Thread Benjamin Herrenschmidt

On Tue, 2013-08-27 at 09:41 +0300, Gleb Natapov wrote:
  Oh and Alexey mentions that there are two capabilities and you only
  applied one :-)
  
 Another one is:
  [PATCH v8] KVM: PPC: reserve a capability and ioctl numbers for
 realmode VFIO
 ?

Yes, thanks !

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/10] KVM: PPC: reserve a capability number for multitce support

2013-08-26 Thread Benjamin Herrenschmidt

On Mon, 2013-08-26 at 15:37 +0300, Gleb Natapov wrote:
  Gleb, any chance you can put this (and the next one) into a tree to
  lock in the numbers ?
  
 Applied it. Sorry for slow response, was on vocation and still go
 through the email backlog.

Thanks. Since it's not in a topic branch that I can pull, I'm going to
just cherry-pick them. However, they are in your queue branch, not
next branch. Should I still assume this is a stable branch and that
the numbers aren't going to change ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] powerpc/kvm: Handle the boundary condition correctly

2013-08-22 Thread Benjamin Herrenschmidt

On Fri, 2013-08-23 at 09:01 +0530, Aneesh Kumar K.V wrote:
 Alexander Graf ag...@suse.de writes:
 
  On 22.08.2013, at 12:37, Aneesh Kumar K.V wrote:
 
  From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
 
  Isn't this you?
 
 Yes. The patches are generated using git format-patch and sent by
 git send-email. That's how it always created patches for me. I am not sure if
 there is a config I can change to avoid having From:

Don't bother, that's perfectly fine, and git am will do the right thing.

Cheers,
Ben.

 
  
  We should be able to copy upto count bytes
 
  Why?
 
 
 Without this we end up doing
 
  +struct kvm_get_htab_buf {
  +struct kvm_get_htab_header header;
  +/*
  + * Older kernel required one extra byte.
  + */
  +unsigned long hpte[3];
  +} hpte_buf;
 
 
 even though we are only looking for one hpte entry.
 
 http://mid.gmane.org/1376995766-16526-4-git-send-email-aneesh.ku...@linux.vnet.ibm.com
 
 
  Alex
 
  
  Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
  ---
  arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)
  
  diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
  b/arch/powerpc/kvm/book3s_64_mmu_hv.c
  index 710d313..0ae6bb6 100644
  --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
  +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
  @@ -1362,7 +1362,7 @@ static ssize_t kvm_htab_read(struct file *file, char 
  __user *buf,
 lbuf = (unsigned long __user *)buf;
  
 nb = 0;
  -  while (nb + sizeof(hdr) + HPTE_SIZE  count) {
  +  while (nb + sizeof(hdr) + HPTE_SIZE = count) {
 /* Initialize header */
 hptr = (struct kvm_get_htab_header __user *)buf;
 hdr.n_valid = 0;
  @@ -1385,7 +1385,7 @@ static ssize_t kvm_htab_read(struct file *file, char 
  __user *buf,
 /* Grab a series of valid entries */
 while (i  kvm-arch.hpt_npte 
hdr.n_valid  0x 
  - nb + HPTE_SIZE  count 
  + nb + HPTE_SIZE = count 
record_hpte(flags, hptp, hpte, revp, 1, first_pass)) {
 /* valid entry, write it out */
 ++hdr.n_valid;
  -- 
  1.8.1.2
  


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v2][PATCH 1/1] KVM: PPC: Book3E HV: call SOFT_DISABLE_INTS to sync the software state

2013-08-14 Thread Benjamin Herrenschmidt

On Wed, 2013-08-14 at 14:34 -0500, Scott Wood wrote:
 On Wed, 2013-08-14 at 13:56 +0200, Alexander Graf wrote:
  On 07.08.2013, at 04:05, Tiejun Chen wrote:
  
   We enter with interrupts disabled in hardware, but we need to
   call SOFT_DISABLE_INTS anyway to ensure that the software state
   is kept in sync instead of calling hard_irq_disable() directly.
   
   Signed-off-by: Tiejun Chen tiejun.c...@windriver.com
  
  Scott, Ben, could you please ack?
 
 ACK, but it needs to be updated due to this patch that renames
 SOFT_DISABLE_INTS:
 
 http://patchwork.ozlabs.org/patch/259337/

Note that the latter was applied to powerpc -next already.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/6 v2] kvm: powerpc: booke: Add linux pte lookup like booke3s

2013-08-02 Thread Benjamin Herrenschmidt

On Fri, 2013-08-02 at 17:58 -0500, Scott Wood wrote:
 
 What about 64-bit PTEs on 32-bit kernels?
 
 In any case, this code does not belong in KVM.  It should be in the
 main
 PPC mm code, even if KVM is the only user.

Also don't we do similar things in BookS KVM ? At the very least that
sutff should become common. And yes, I agree, it should probably also
move to pgtable*

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/6 v2] kvm: powerpc: booke: Add linux pte lookup like booke3s

2013-08-02 Thread Benjamin Herrenschmidt

On Sat, 2013-08-03 at 02:58 +, Bhushan Bharat-R65777 wrote:
 One of the problem I saw was that if I put this code in
 asm/pgtable-32.h and asm/pgtable-64.h then pte_persent() and other
 friend function (on which this code depends) are defined in pgtable.h.
 And pgtable.h includes asm/pgtable-32.h and asm/pgtable-64.h before it
 defines pte_present() and friends functions.
 
 Ok I move wove this in asm/pgtable*.h, initially I fought with myself
 to take this code in pgtable* but finally end up doing here (got
 biased by book3s :)).

Is there a reason why these routines can not be completely generic
in pgtable.h ?

Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/6 v2] kvm: powerpc: use caching attributes as per linux pte

2013-08-02 Thread Benjamin Herrenschmidt

On Sat, 2013-08-03 at 03:11 +, Bhushan Bharat-R65777 wrote:
 
  
  Could you explain why we need to set dirty/referenced on the PTE, when we 
  didn't
  need to do that before? All we're getting from the PTE is wimg.
  We have MMU notifiers to take care of the page being unmapped, and we've 
  already
  marked the page itself as dirty if the TLB entry is writeable.
 
 I pulled this code from book3s.
 
 Ben, can you describe why we need this on book3s ?

If you let the guest write to the page you must set the dirty bit on the PTE
(or the struct page, at least one of them), similar with accessed on any access.

If you don't, the VM might swap the page out without writing it back to disk
for example, assuming it contains no modified data.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] kvm: powerpc: set cache coherency only for RAM pages

2013-07-26 Thread Benjamin Herrenschmidt

On Fri, 2013-07-26 at 15:03 +, Bhushan Bharat-R65777 wrote:
 Will not searching the Linux PTE is a overkill?

That's the best approach. Also we are searching it already to resolve
the page fault. That does mean we search twice but on the other hand
that also means it's hot in the cache.

Cheers,
Ben


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/10] powerpc: Prepare to support kernel handling of IOMMU map/unmap

2013-07-24 Thread Benjamin Herrenschmidt

On Wed, 2013-07-24 at 15:43 -0700, Andrew Morton wrote:
 For what?  The three lines of comment in page-flags.h?   ack :)
 
 Manipulating page-_count directly is considered poor form.  Don't
 blame us if we break your code ;)
 
 Actually, the manipulation in realmode_get_page() duplicates the
 existing get_page_unless_zero() and the one in realmode_put_page()
 could perhaps be placed in mm.h with a suitable name and some
 documentation.  That would improve your form and might protect the code
 from getting broken later on.

Yes, this stuff makes me really nervous :-) If it didn't provide an order
of magnitude performance improvement in KVM I would avoid it but heh...

Alexey, I like having that stuff in generic code.

However the meaning of the words real mode can be ambiguous accross
architectures, it might be best to then name it mmu_off_put_page to
make things a bit clearer, along with a comment explaining that this is
called in a context where none of the virtual mappings are accessible
(vmalloc, vmemmap, IOs, ...), and that in the case of sparsemem vmemmap
the caller must have taken care of getting the physical address of the
struct page and of ensuring it isn't split accross two vmemmap blocks.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v1][PATCH 1/1] KVM: PPC: disable preemption when using hard_irq_disable()

2013-07-14 Thread Benjamin Herrenschmidt

On Mon, 2013-07-15 at 10:20 +0800, tiejun.chen wrote:
 What about SOFT_IRQ_DISABLE? This is close to name
 hard_irq_disable() :) And 
 then remove all DISABLE_INTS as well?

Or RECONCILE_IRQ_STATE...

Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v1][PATCH 1/1] KVM: PPC: disable preemption when using hard_irq_disable()

2013-07-12 Thread Benjamin Herrenschmidt

On Fri, 2013-07-12 at 12:50 -0500, Scott Wood wrote:
 
 [1] SOFT_DISABLE_INTS seems an odd name for something that updates the  
 software state to be consistent with interrupts being *hard* disabled.   
 I can sort of see the logic in it, but it's confusing when first  
 encountered.  From the name it looks like all it would do is set  
 soft_enabled to 1.

It's indeed odd. Also worse when we use DISABLE_INTS which is just a
macro on top of SOFT_DISABLE_INTS :-)

I've been wanting to change the macro name for a while now and never
got to it. Patch welcome :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v1][PATCH 1/1] KVM: PPC: disable preemption when using hard_irq_disable()

2013-07-11 Thread Benjamin Herrenschmidt

On Thu, 2013-07-11 at 11:49 +0200, Alexander Graf wrote:
 Ben, is soft_enabled == 0; hard_enabled == 1 a valid combination that
 may ever occur?

Yes of course, that's what we call soft disabled :-) It's even the
whole point of doing lazy disable...

Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-07-11 Thread Benjamin Herrenschmidt

On Thu, 2013-07-11 at 11:52 +0200, Alexander Graf wrote:
  Where exactly (it is rather SPAPR_TCE_IOMMU but does not really
 matter)?
  Select it on KVM_BOOK3S_64? CONFIG_KVM_BOOK3S_64_HV?
  CONFIG_KVM_BOOK3S_64_PR? PPC_BOOK3S_64?
 
 I'd say the most logical choice would be to check the Makefile and see
 when it gets compiled. For those cases we want it enabled.

What *what* gets compiled ? You know our Makefile, it's crap :-)

We enable built-in things when CONFIG_KVM=m (which means you cannot take
a kernel build with CONFIG_KVM not set, enable CONFIG_KVM=m, and just
build the module, it won't work).

We could use KVM_BOOK3S_64 maybe ?

  I am trying to imagine a configuration where we really do not want
  IOMMU_API. Ben mentioned PPC32 and embedded PPC64 and that's it so
 any of
  BOOK3S (KVM_BOOK3S_64 is the best) should be fine, no?
 
 book3s_32 doesn't want this, but any book3s_64 implementation could
 potentially use it, yes. That's pretty much what the Makefile tells
 you too :).

Not really no. But that would do. You could have give a more useful
answer in the first place though rather than stringing him along.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

2013-07-11 Thread Benjamin Herrenschmidt

On Thu, 2013-07-11 at 13:15 +0200, Alexander Graf wrote:
 There are 2 ways of dealing with this:
 
   1) Call the ENABLE_CAP on every vcpu. That way one CPU may handle
 this hypercall in the kernel while another one may not. The same as we
 handle PAPR today.
 
   2) Create a new ENABLE_CAP for the vm.
 
 I think in this case option 1 is fine - it's how we handle everything
 else already.

So, you are now asking him to chose between a gross horror or adding a
new piece of infrastructure for something that is entirely pointless to
begin with ?

Come on, give him a break. That stuff is fine as it is.

Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v1][PATCH 1/1] KVM: PPC: disable preemption when using hard_irq_disable()

2013-07-11 Thread Benjamin Herrenschmidt

On Thu, 2013-07-11 at 14:47 +0200, Alexander Graf wrote:
  Yes of course, that's what we call soft disabled :-) It's even the
  whole point of doing lazy disable...
 
 Meh. Of course it's soft_enabled = 1; hard_enabled = 0.

That doesn't happen in normal C code. It happens under very specific
circumstances, such as in the guts of entry_64.S, in areas where we just
want to temporarily mask HW interrupts without changing the SW state
(and thus without having to deal with replays etc...).

We typically also do that right before going to idle on some processors
where we come back from idle with interrupts hard enabled, possibly
right in an interrupt vector.

Typically that's a state that makes some sense on KVM entry. From a
Linux perspective, you enter KVM with interrupts enabled. But you
temporarily hard disable on the way down while doing the low level
context switch.

This works well as long as you know you have no pending replay event.
That should be fine if you do a direct transition from soft+hard enabled
to hard disabled (without soft disabling). In that case there should be
nothing in irq_happened.

It's equivalent to returning to userspace from the kernel. We are
soft-enabled, but the code in ret_from_except hard disables while
mucking around with TIF_FLAGS etc... until the final rfid

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-07-11 Thread Benjamin Herrenschmidt

On Thu, 2013-07-11 at 14:50 +0200, Alexander Graf wrote:
  Not really no. But that would do. You could have give a more useful
  answer in the first place though rather than stringing him along.
 
 Sorry, I figured it was obvious.

It wasn't no, because of the mess with modules and the nasty Makefile we
have in there. Even I had to scratch my head for a bit :-)

Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

2013-07-11 Thread Benjamin Herrenschmidt

On Thu, 2013-07-11 at 14:51 +0200, Alexander Graf wrote:
 I don't like bloat usually. But Alexey even had an #ifdef DEBUG in
 there to selectively disable in-kernel handling of multi-TCE. Not
 calling ENABLE_CAP would give him exactly that without ugly #ifdefs in
 the kernel.

I don't see much point in disabling it... but ok, if that's a valuable
feature, then shoot some VM level ENABLE_CAP (please don't iterate all
VCPUs, that's gross).

Ben.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

2013-07-11 Thread Benjamin Herrenschmidt

On Thu, 2013-07-11 at 15:12 +1000, Alexey Kardashevskiy wrote:
  Any debug code is prohibited? Ok, I'll remove.
  
  Debug code that requires code changes is prohibited, yes.
  Debug code that is runtime switchable (pr_debug, trace points, etc)
  are allowed.

Bollox.

$ grep DBG\( arch/powerpc/ -r | wc -l
418

Also pr_devel is not runtime switchable in normal kernels either and
still an official kernel interface.

 Is there any easy way to enable just this specific udbg_printf (not all of
 them at once)? Trace points do not work in real mode as we figured out.

The cleaner way to do it is to use some kind of local macro that you
enable/disable by changing a #define at the top of the function, possibly
several.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

2013-07-11 Thread Benjamin Herrenschmidt

On Thu, 2013-07-11 at 12:11 +0200, Alexander Graf wrote:
  So I must add one more ioctl to enable MULTITCE in kernel handling. Is it
  what you are saying?
  
  I can see KVM_CHECK_EXTENSION but I do not see KVM_ENABLE_EXTENSION or
  anything like that.
 
 KVM_ENABLE_CAP. It's how we enable sPAPR capabilities too.

But in that case I don't see the point.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v1][PATCH 1/1] KVM: PPC: disable preemption when using hard_irq_disable()

2013-07-11 Thread Benjamin Herrenschmidt

On Thu, 2013-07-11 at 15:07 +0200, Alexander Graf wrote:
 Ok, let me quickly explain the problem.
 
 We are leaving host context, switching slowly into guest context.
 During that transition we call get_paca() indirectly (apparently by
 another call to hard_disable() which sounds bogus, but that's another
 story).
 
 get_paca() warns when we're preemptible. We're only not preemptible
 when either preempt is disabled or irqs are disabled. Irqs are
 disabled, but arch_irqs_disabled() doesn't know, because it only
 checks for soft disabled IRQs.
 
 So we can fix this either by setting IRQs as soft disabled as well or
 by disabling preemption until we enter the guest for real. Any
 preferences?

Well, if you hard disable first (ie, direct transition from full enabled
to hard disabled), you know you have nothing lazy pending in
irq_pending, then it's ok to mess around with local_paca-soft_enabled
to make it look disabled.

IE. Call hard_irq_disable(), then only turn local_paca-soft_enabled
back on late in the process, some time before the final rfi(d).

That works as long as you had a transition from full enabled to full
disabled and don't hard re-enable in the process. IE, You are certain
that there is nothing pending in irq_happened.

HOWEVER !

If you do that, you *ALSO* need to clear irq_happened. You must *NEVER*
leave PACA_IRQ_HARD_DIS set in irq_happened if you are soft-enabled, and
since the above means that you *will* be seen as soft enabled on the way
out of the guest, you can kaboom.

BTW. I'm fine with a patch that does:

#define hard_irq_disable()  do {\
u8 _was_enabled = get_paca()-soft_enabled; \
__hard_irq_disable();   \
-   get_paca()-soft_enabled = 0;   \
+   local_paca-soft_enabled = 0;   \

In fact we should probably do it anyway.

Cheers,
Ben.



--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v1][PATCH 1/1] KVM: PPC: disable preemption when using hard_irq_disable()

2013-07-11 Thread Benjamin Herrenschmidt

On Thu, 2013-07-11 at 11:18 -0500, Scott Wood wrote:
 
 If we set IRQs as soft-disabled prior to calling hard_irq_disable(),  
 then hard_irq_disable() will fail to call trace_hardirqs_off().

Sure because setting them as soft-disabled will have done it.

However by doing so, you also create the possibility of latching a new
event in irq_happened.

  or by disabling preemption until we enter the guest for real.
 
 I don't follow this one.  We're exiting, not entering.
 
  Any preferences?
 
 Use arch_local_save_flags() in hard_irq_disable() instead of reading  
 soft_enabled with C code.

That or just use local_paca... Though we'd had problems in the past
where gcc would defeat us there and essentially copy r13 to another
register and start indexing from there. That kills you if you preempt.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v1][PATCH 1/1] KVM: PPC: disable preemption when using hard_irq_disable()

2013-07-11 Thread Benjamin Herrenschmidt

On Fri, 2013-07-12 at 10:13 +0800, tiejun.chen wrote:
  #define hard_irq_disable()do {\
u8 _was_enabled = get_paca()-soft_enabled; \
 
 Current problem I met is issued from the above line.
 
__hard_irq_disable();   \
  - get_paca()-soft_enabled = 0;   \
 
 Not here.
 
 If I'm misunderstanding what you guys means, please correct me since this is 
 a 
 long discussion thread. I have to reread that carefully.

Then make it
u8 _was_enabled;
__hard_irq_disable();
was_enabled = local_paca-

Once you have hard disabled, using local_paca directly *should* be safe
(minus that gcc problem I mentioned).

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-07-10 Thread Benjamin Herrenschmidt

On Wed, 2013-07-10 at 12:33 +0200, Alexander Graf wrote:
 
 It's not exactly obvious that you're calling it with writing == 1 :).
 Can you create a new local variable is_write in the calling
 function, set that to 1 before the call to get_user_pages_fast and
 pass it in instead of the 1? The compiler should easily optimize all
 of that away, but it makes the code by far easier to read.

Ugh ?

Nobody else does that  (look at futex :-)

Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] kvm/ppc: IRQ disabling cleanup

2013-07-10 Thread Benjamin Herrenschmidt

On Thu, 2013-07-11 at 00:57 +0200, Alexander Graf wrote:
  #ifdef CONFIG_PPC64
  + /*
  +  * To avoid races, the caller must have gone directly from having
  +  * interrupts fully-enabled to hard-disabled.
  +  */
  + WARN_ON(local_paca-irq_happened != PACA_IRQ_HARD_DIS);
 
 WARN_ON(lazy_irq_pending()); ?

Different semantics. What you propose will not catch irq_happened == 0 :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/8] powerpc: add real mode support for dma operations on powernv

2013-07-09 Thread Benjamin Herrenschmidt

On Tue, 2013-07-09 at 18:02 +0200, Alexander Graf wrote:
 On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
  The existing TCE machine calls (tce_build and tce_free) only support
  virtual mode as they call __raw_writeq for TCE invalidation what
  fails in real mode.
 
  This introduces tce_build_rm and tce_free_rm real mode versions
  which do mostly the same but use Store Doubleword Caching Inhibited
  Indexed instruction for TCE invalidation.
 
 So would always using stdcix have any bad side effects?

Yes. Those instructions are only supposed to be used in hypervisor real
mode as per the architecture spec.

Cheers,
Ben.

 
 Alex
 
 
  This new feature is going to be utilized by real mode support of VFIO.
 
  Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru
  ---
arch/powerpc/include/asm/machdep.h| 12 ++
arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++--
arch/powerpc/platforms/powernv/pci.c  | 38 
  ++-
arch/powerpc/platforms/powernv/pci.h  |  2 +-
4 files changed, 64 insertions(+), 14 deletions(-)
 
  diff --git a/arch/powerpc/include/asm/machdep.h 
  b/arch/powerpc/include/asm/machdep.h
  index 92386fc..0c19eef 100644
  --- a/arch/powerpc/include/asm/machdep.h
  +++ b/arch/powerpc/include/asm/machdep.h
  @@ -75,6 +75,18 @@ struct machdep_calls {
  long index);
  void(*tce_flush)(struct iommu_table *tbl);
 
  +   /* _rm versions are for real mode use only */
  +   int (*tce_build_rm)(struct iommu_table *tbl,
  +long index,
  +long npages,
  +unsigned long uaddr,
  +enum dma_data_direction direction,
  +struct dma_attrs *attrs);
  +   void(*tce_free_rm)(struct iommu_table *tbl,
  +   long index,
  +   long npages);
  +   void(*tce_flush_rm)(struct iommu_table *tbl);
  +
  void __iomem *  (*ioremap)(phys_addr_t addr, unsigned long size,
 unsigned long flags, void *caller);
  void(*iounmap)(volatile void __iomem *token);
  diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
  b/arch/powerpc/platforms/powernv/pci-ioda.c
  index 2931d97..2797dec 100644
  --- a/arch/powerpc/platforms/powernv/pci-ioda.c
  +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
  @@ -68,6 +68,12 @@ define_pe_printk_level(pe_err, KERN_ERR);
define_pe_printk_level(pe_warn, KERN_WARNING);
define_pe_printk_level(pe_info, KERN_INFO);
 
  +static inline void rm_writed(unsigned long paddr, u64 val)
  +{
  +   __asm__ __volatile__(sync; stdcix %0,0,%1
  +   : : r (val), r (paddr) : memory);
  +}
  +
static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
{
  unsigned long pe;
  @@ -442,7 +448,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb 
  *phb, struct pci_dev *pdev
}
 
static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
  -u64 *startp, u64 *endp)
  +u64 *startp, u64 *endp, bool rm)
{
  u64 __iomem *invalidate = (u64 __iomem *)tbl-it_index;
  unsigned long start, end, inc;
  @@ -471,7 +477,10 @@ static void pnv_pci_ioda1_tce_invalidate(struct 
  iommu_table *tbl,
 
mb(); /* Ensure above stores are visible */
while (start= end) {
  -__raw_writeq(start, invalidate);
  +   if (rm)
  +   rm_writed((unsigned long) invalidate, start);
  +   else
  +   __raw_writeq(start, invalidate);
start += inc;
}
 
  @@ -483,7 +492,7 @@ static void pnv_pci_ioda1_tce_invalidate(struct 
  iommu_table *tbl,
 
static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
   struct iommu_table *tbl,
  -u64 *startp, u64 *endp)
  +u64 *startp, u64 *endp, bool rm)
{
  unsigned long start, end, inc;
  u64 __iomem *invalidate = (u64 __iomem *)tbl-it_index;
  @@ -502,22 +511,25 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
  pnv_ioda_pe *pe,
  mb();
 
  while (start= end) {
  -   __raw_writeq(start, invalidate);
  +   if (rm)
  +   rm_writed((unsigned long) invalidate, start);
  +   else
  +   __raw_writeq(start, invalidate);
  start += inc;
  }
}
 
void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
  -u64 *startp, u64 *endp)
  +u64 *startp, u64 *endp, bool rm)
{
  struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
tce32_table);

Re: [PATCH 4/8] powerpc: Prepare to support kernel handling of IOMMU map/unmap

2013-07-07 Thread Benjamin Herrenschmidt

On Sun, 2013-07-07 at 01:07 +1000, Alexey Kardashevskiy wrote:
 The current VFIO-on-POWER implementation supports only user mode
 driven mapping, i.e. QEMU is sending requests to map/unmap pages.
 However this approach is really slow, so we want to move that to KVM.
 Since H_PUT_TCE can be extremely performance sensitive (especially with
 network adapters where each packet needs to be mapped/unmapped) we chose
 to implement that as a fast hypercall directly in real
 mode (processor still in the guest context but MMU off).
 
 To be able to do that, we need to provide some facilities to
 access the struct page count within that real mode environment as things
 like the sparsemem vmemmap mappings aren't accessible.
 
 This adds an API to increment/decrement page counter as
 get_user_pages API used for user mode mapping does not work
 in the real mode.
 
 CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.

This patch will need an ack from mm people to make sure they are ok
with our approach and ack the change to the generic header.

(Added linux-mm).

Cheers,
Ben.

 Reviewed-by: Paul Mackerras pau...@samba.org
 Signed-off-by: Paul Mackerras pau...@samba.org
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 
 ---
 
 Changes:
 2013/06/27:
 * realmode_get_page() fixed to use get_page_unless_zero(). If failed,
 the call will be passed from real to virtual mode and safely handled.
 * added comment to PageCompound() in include/linux/page-flags.h.
 
 2013/05/20:
 * PageTail() is replaced by PageCompound() in order to have the same checks
 for whether the page is huge in realmode_get_page() and realmode_put_page()
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
  arch/powerpc/include/asm/pgtable-ppc64.h |  4 ++
  arch/powerpc/mm/init_64.c| 78 
 +++-
  include/linux/page-flags.h   |  4 +-
  3 files changed, 84 insertions(+), 2 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
 b/arch/powerpc/include/asm/pgtable-ppc64.h
 index e3d55f6f..7b46e5f 100644
 --- a/arch/powerpc/include/asm/pgtable-ppc64.h
 +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
 @@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t 
 *pgdir, unsigned long ea,
  }
  #endif /* !CONFIG_HUGETLB_PAGE */
  
 +struct page *realmode_pfn_to_page(unsigned long pfn);
 +int realmode_get_page(struct page *page);
 +int realmode_put_page(struct page *page);
 +
  #endif /* __ASSEMBLY__ */
  
  #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
 diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
 index a90b9c4..7031be3 100644
 --- a/arch/powerpc/mm/init_64.c
 +++ b/arch/powerpc/mm/init_64.c
 @@ -297,5 +297,81 @@ void vmemmap_free(unsigned long start, unsigned long end)
  {
  }
  
 -#endif /* CONFIG_SPARSEMEM_VMEMMAP */
 +/*
 + * We do not have access to the sparsemem vmemmap, so we fallback to
 + * walking the list of sparsemem blocks which we already maintain for
 + * the sake of crashdump. In the long run, we might want to maintain
 + * a tree if performance of that linear walk becomes a problem.
 + *
 + * Any of realmode_ functions can fail due to:
 + * 1) As real sparsemem blocks do not lay in RAM continously (they
 + * are in virtual address space which is not available in the real mode),
 + * the requested page struct can be split between blocks so get_page/put_page
 + * may fail.
 + * 2) When huge pages are used, the get_page/put_page API will fail
 + * in real mode as the linked addresses in the page struct are virtual
 + * too.
 + * When 1) or 2) takes place, the API returns an error code to cause
 + * an exit to kernel virtual mode where the operation will be completed.
 + */
 +struct page *realmode_pfn_to_page(unsigned long pfn)
 +{
 + struct vmemmap_backing *vmem_back;
 + struct page *page;
 + unsigned long page_size = 1  mmu_psize_defs[mmu_vmemmap_psize].shift;
 + unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
  
 + for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back-list) {
 + if (pg_va  vmem_back-virt_addr)
 + continue;
 +
 + /* Check that page struct is not split between real pages */
 + if ((pg_va + sizeof(struct page)) 
 + (vmem_back-virt_addr + page_size))
 + return NULL;
 +
 + page = (struct page *) (vmem_back-phys + pg_va -
 + vmem_back-virt_addr);
 + return page;
 + }
 +
 + return NULL;
 +}
 +EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
 +
 +#elif defined(CONFIG_FLATMEM)
 +
 +struct page *realmode_pfn_to_page(unsigned long pfn)
 +{
 + struct page *page = pfn_to_page(pfn);
 + return page;
 +}
 +EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
 +
 +#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
 +
 +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
 +int realmode_get_page(struct page *page)
 +{
 +

Re: [PATCH 2/2] KVM: PPC: Book3E: Add LRAT error exception handler

2013-07-04 Thread Benjamin Herrenschmidt

On Thu, 2013-07-04 at 06:47 +, Caraman Mihai Claudiu-B02008 wrote:
 This is a solid reason. Ben it's ok for you to apply the combined
 patch? If so I will respin it.

Sure, but nowadays, all that stuff goes via Scott and Alex.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH -V3 2/4] powerpc/kvm: Contiguous memory allocator based hash page table allocation

2013-07-02 Thread Benjamin Herrenschmidt

On Tue, 2013-07-02 at 17:12 +0200, Alexander Graf wrote:
 Is CMA a mandatory option in the kernel? Or can it be optionally 
 disabled? If it can be disabled, we should keep the preallocated 
 fallback case around for systems that have CMA disabled.

Why ? More junk code to keep around ...

If CMA is disabled, we can limit ourselves to dynamic allocation (with
limitation to 16M hash table).

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/8] vfio: add external user support

2013-06-27 Thread Benjamin Herrenschmidt

On Thu, 2013-06-27 at 16:59 +1000, Stephen Rothwell wrote:
  +/* Allows an external user (for example, KVM) to unlock an IOMMU
 group */
  +static void vfio_group_del_external_user(struct file *filep)
  +{
  + struct vfio_group *group = filep-private_data;
  +
  + BUG_ON(filep-f_op != vfio_group_fops);
 
 We usually reserve BUG_ON for situations where there is no way to
 continue running or continuing will corrupt the running kernel.  Maybe
 WARN_ON() and return?

Not even that. This is a user space provided fd, we shouldn't oops the
kernel because we passed a wrong argument, just return -EINVAL or
something like that (add a return code).

Ben.



--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-23 Thread Benjamin Herrenschmidt

On Mon, 2013-06-24 at 13:54 +1000, David Gibson wrote:
  DDW means an API by which the guest can request the creation of
  additional iommus for a given device (typically, in addition to the
  default smallish 32-bit one using 4k pages, the guest can request
  a larger window in 64-bit space using a larger page size).
 
 So, would a PAPR gest requesting this expect the new window to have
 a new liobn, or an existing liobn?

New liobn or there is no way to H_PUT_TCE it (it exists in addition
to the legacy window).

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 >

1 - 100 of 285 matches

Mail list logo