Re: [PATCH v2] KVM: PPC: Exit guest upon fatal machine check exception

2015-12-16 Thread David Gibson
 {
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b98889e..f43c124 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -147,7 +147,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>   addir1, r1, 112
>   ld  r7, HSTATE_HOST_MSR(r13)
>

Seems like the comment a little above this should be updated to
reflect the fact that this path no longer handles machine checks.

Apart from that and the access width bug Thomas spotted, it looks ok
to me,.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 5/9] KVM: PPC: Account TCE-containing pages in locked_vm

2015-12-08 Thread David Gibson
On Tue, Sep 15, 2015 at 08:49:35PM +1000, Alexey Kardashevskiy wrote:
> At the moment pages used for TCE tables (in addition to pages addressed
> by TCEs) are not counted in locked_vm counter so a malicious userspace
> tool can call ioctl(KVM_CREATE_SPAPR_TCE) as many times as RLIMIT_NOFILE and
> lock a lot of memory.
> 
> This adds counting for pages used for TCE tables.
> 
> This counts the number of pages required for a table plus pages for
> the kvmppc_spapr_tce_table struct (TCE table descriptor) itself.

Hmm.  Does it make sense to account for the descriptor struct itself?
I mean there are lots of little structures the kernel will allocate on
a process's behalf, and I don't think most of them get accounted
against locked vm.

> This does not change the amount of (de)allocated memory.
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
> ---
>  arch/powerpc/kvm/book3s_64_vio.c | 51 
> +++-
>  1 file changed, 50 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> b/arch/powerpc/kvm/book3s_64_vio.c
> index 9526c34..b70787d 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -45,13 +45,56 @@ static long kvmppc_stt_npages(unsigned long window_size)
>* sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
>  }
>  
> +static long kvmppc_account_memlimit(long npages, bool inc)
> +{
> + long ret = 0;
> + const long bytes = sizeof(struct kvmppc_spapr_tce_table) +
> + (abs(npages) * sizeof(struct page *));
> + const long stt_pages = ALIGN(bytes, PAGE_SIZE) / PAGE_SIZE;

Overflow checks might be useful here, I'm not sure.

> +
> + if (!current || !current->mm)
> + return ret; /* process exited */
> +
> + npages += stt_pages;
> +
> + down_write(>mm->mmap_sem);
> +
> + if (inc) {
> + long locked, lock_limit;
> +
> + locked = current->mm->locked_vm + npages;
> + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> + if (locked > lock_limit && !capable(CAP_IPC_LOCK))
> + ret = -ENOMEM;
> + else
> + current->mm->locked_vm += npages;
> + } else {
> + if (npages > current->mm->locked_vm)

Should this be a WARN_ON?  It means something has gone wrong
previously in the accounting, doesn't it?

> + npages = current->mm->locked_vm;
> +
> + current->mm->locked_vm -= npages;
> + }
> +
> + pr_debug("[%d] RLIMIT_MEMLOCK KVM %c%ld %ld/%ld%s\n", current->pid,
> + inc ? '+' : '-',
> + npages << PAGE_SHIFT,
> + current->mm->locked_vm << PAGE_SHIFT,
> + rlimit(RLIMIT_MEMLOCK),
> + ret ? " - exceeded" : "");
> +
> + up_write(>mm->mmap_sem);
> +
> + return ret;
> +}
> +
>  static void release_spapr_tce_table(struct rcu_head *head)
>  {
>   struct kvmppc_spapr_tce_table *stt = container_of(head,
>   struct kvmppc_spapr_tce_table, rcu);
>   int i;
> + long npages = kvmppc_stt_npages(stt->window_size);
>  
> - for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
> + for (i = 0; i < npages; i++)
>   __free_page(stt->pages[i]);
>  
>   kfree(stt);
> @@ -89,6 +132,7 @@ static int kvm_spapr_tce_release(struct inode *inode, 
> struct file *filp)
>  
>   kvm_put_kvm(stt->kvm);
>  
> + kvmppc_account_memlimit(kvmppc_stt_npages(stt->window_size), false);
>   call_rcu(>rcu, release_spapr_tce_table);
>  
>       return 0;
> @@ -114,6 +158,11 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>   }
>  
>   npages = kvmppc_stt_npages(args->window_size);
> + ret = kvmppc_account_memlimit(npages, true);
> + if (ret) {
> + stt = NULL;
> + goto fail;
> + }
>  
>   stt = kzalloc(sizeof(*stt) + npages * sizeof(struct page *),
> GFP_KERNEL);

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 7/9] KVM: PPC: Move reusable bits of H_PUT_TCE handler to helpers

2015-12-08 Thread David Gibson
n (u64 *) page_address(page);
> +}
> +
> +/*
> + * Handles TCE requests for emulated devices.
> + * Puts guest TCE values to the table and expects user space to convert them.
> + * Called in both real and virtual modes.
> + * Cannot fail so kvmppc_tce_validate must be called before it.
> + *
> + * WARNING: This will be called in real-mode on HV KVM and virtual
> + *  mode on PR KVM
> + */
> +void kvmppc_tce_put(struct kvmppc_spapr_tce_table *stt,
> + unsigned long idx, unsigned long tce)
> +{
> + struct page *page;
> + u64 *tbl;
> +
> + page = stt->pages[idx / TCES_PER_PAGE];
> + tbl = kvmppc_page_address(page);
> +
> + tbl[idx % TCES_PER_PAGE] = tce;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_tce_put);
>  
>  /* WARNING: This will be called in real-mode on HV KVM and virtual
>   *  mode on PR KVM
> @@ -85,9 +159,6 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long 
> liobn,
>  {
>   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
>   long ret = H_TOO_HARD;
> - unsigned long idx;
> - struct page *page;
> - u64 *tbl;
>  
>   /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>   /*  liobn, ioba, tce); */
> @@ -99,13 +170,11 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned 
> long liobn,
>   if (ret)
>   return ret;
>  
> - idx = ioba >> IOMMU_PAGE_SHIFT_4K;
> - page = stt->pages[idx / TCES_PER_PAGE];
> - tbl = (u64 *)page_address(page);
> + ret = kvmppc_tce_validate(stt, tce);
> + if (ret)
> + return ret;
>  
> - /* FIXME: Need to validate the TCE itself */
> - /* udbg_printf("tce @ %p\n", [idx % TCES_PER_PAGE]); */
> - tbl[idx % TCES_PER_PAGE] = tce;
> + kvmppc_tce_put(stt, ioba >> IOMMU_PAGE_SHIFT_4K, tce);
>  
>   return ret;
>  }

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 6/9] KVM: PPC: Replace SPAPR_TCE_SHIFT with IOMMU_PAGE_SHIFT_4K

2015-12-08 Thread David Gibson
On Tue, Sep 15, 2015 at 08:49:36PM +1000, Alexey Kardashevskiy wrote:
> SPAPR_TCE_SHIFT is used in few places only and since IOMMU_PAGE_SHIFT_4K
> can be easily used instead, remove SPAPR_TCE_SHIFT.
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>

> ---
>  arch/powerpc/include/asm/kvm_book3s_64.h | 2 --
>  arch/powerpc/kvm/book3s_64_vio.c | 3 ++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c  | 4 ++--
>  3 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
> b/arch/powerpc/include/asm/kvm_book3s_64.h
> index 2aa79c8..7529aab 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -33,8 +33,6 @@ static inline void svcpu_put(struct 
> kvmppc_book3s_shadow_vcpu *svcpu)
>  }
>  #endif
>  
> -#define SPAPR_TCE_SHIFT  12
> -
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>  #define KVM_DEFAULT_HPT_ORDER24  /* 16MB HPT by default */
>  #endif
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> b/arch/powerpc/kvm/book3s_64_vio.c
> index b70787d..e347856 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -36,12 +36,13 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define TCES_PER_PAGE(PAGE_SIZE / sizeof(u64))
>  
>  static long kvmppc_stt_npages(unsigned long window_size)
>  {
> - return ALIGN((window_size >> SPAPR_TCE_SHIFT)
> + return ALIGN((window_size >> IOMMU_PAGE_SHIFT_4K)
>* sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
>  }
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
> b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 8ae12ac..6cf1ab3 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -99,7 +99,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long 
> liobn,
>   if (ret)
>   return ret;
>  
> - idx = ioba >> SPAPR_TCE_SHIFT;
> + idx = ioba >> IOMMU_PAGE_SHIFT_4K;
>   page = stt->pages[idx / TCES_PER_PAGE];
>   tbl = (u64 *)page_address(page);
>  
> @@ -121,7 +121,7 @@ long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned 
> long liobn,
>   if (stt) {
>   ret = kvmppc_ioba_validate(stt, ioba, 1);
>   if (!ret) {
> - unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> + unsigned long idx = ioba >> IOMMU_PAGE_SHIFT_4K;
>   struct page *page = stt->pages[idx / TCES_PER_PAGE];
>   u64 *tbl = (u64 *)page_address(page);
>  

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 9/9] KVM: PPC: Add support for multiple-TCE hcalls

2015-12-08 Thread David Gibson
On Tue, Sep 15, 2015 at 08:49:39PM +1000, Alexey Kardashevskiy wrote:
> This adds real and virtual mode handlers for the H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls for user space emulated devices such as IBMVIO
> devices or emulated PCI.  These calls allow adding multiple entries
> (up to 512) into the TCE table in one call which saves time on
> transition between kernel and user space.
> 
> This implements the KVM_CAP_PPC_MULTITCE capability. When present,
> the kernel will try handling H_PUT_TCE_INDIRECT and H_STUFF_TCE.
> If they can not be handled by the kernel, they are passed on to
> the user space. The user space still has to have an implementation
> for these.
> 
> Both HV and PR-syle KVM are supported.
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
> ---
>  Documentation/virtual/kvm/api.txt   |  25 ++
>  arch/powerpc/include/asm/kvm_ppc.h  |  12 +++
>  arch/powerpc/kvm/book3s_64_vio.c| 111 +++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 145 
> ++--
>  arch/powerpc/kvm/book3s_hv.c|  26 +-
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S |   6 +-
>  arch/powerpc/kvm/book3s_pr_papr.c   |  35 
>  arch/powerpc/kvm/powerpc.c  |   3 +
>  8 files changed, 350 insertions(+), 13 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index d86d831..593c62a 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -3019,6 +3019,31 @@ Returns: 0 on success, -1 on error
>  
>  Queues an SMI on the thread's vcpu.
>  
> +4.97 KVM_CAP_PPC_MULTITCE
> +
> +Capability: KVM_CAP_PPC_MULTITCE
> +Architectures: ppc
> +Type: vm
> +
> +This capability means the kernel is capable of handling hypercalls
> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
> +space. This significantly accelerates DMA operations for PPC KVM guests.
> +User space should expect that its handlers for these hypercalls
> +are not going to be called if user space previously registered LIOBN
> +in KVM (via KVM_CREATE_SPAPR_TCE or similar calls).
> +
> +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
> +user space might have to advertise it for the guest. For example,
> +IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is
> +present in the "ibm,hypertas-functions" device-tree property.
> +
> +The hypercalls mentioned above may or may not be processed successfully
> +in the kernel based fast path. If they can not be handled by the kernel,
> +they will get passed on to user space. So user space still has to have
> +an implementation for these despite the in kernel acceleration.
> +
> +This capability is always enabled.
> +
>  5. The kvm_run structure
>  
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
> b/arch/powerpc/include/asm/kvm_ppc.h
> index fcde896..e5b968e 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -166,12 +166,24 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu 
> *vcpu);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>   struct kvm_create_spapr_tce *args);
> +extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
> + struct kvm_vcpu *vcpu, unsigned long liobn);
>  extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
>   unsigned long ioba, unsigned long npages);
>  extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
>   unsigned long tce);
> +extern long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
> + unsigned long *ua, unsigned long **prmap);
> +extern void kvmppc_tce_put(struct kvmppc_spapr_tce_table *tt,
> + unsigned long idx, unsigned long tce);
>  extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>unsigned long ioba, unsigned long tce);
> +extern long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce_list, unsigned long npages);
> +extern long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce_value, unsigned long npages);
>  extern long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>unsigned long ioba);
>  extern struct page *kvm_alloc_hpt(unsigned long nr_pages);
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> b/arch/powerpc/kvm/book3s_64_vio.c
> in

Re: [PATCH kernel 3/9] KVM: PPC: Rework H_PUT_TCE/H_GET_TCE handlers

2015-12-07 Thread David Gibson
ies on the fact that kvmppc_ioba_validate() would have
returned H_SUCCESS some distance above.  This seems rather fragile if
you insert anything else which alters ret in between.  Since this is
the success path, I think it's clearer to explicitly return H_SUCCESS.

>  }
>  EXPORT_SYMBOL_GPL(kvmppc_h_put_tce);
>  
>  long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> unsigned long ioba)
>  {
> - struct kvm *kvm = vcpu->kvm;
> - struct kvmppc_spapr_tce_table *stt;
> + struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
> + long ret = H_TOO_HARD;
>  
> - list_for_each_entry(stt, >arch.spapr_tce_tables, list) {
> - if (stt->liobn == liobn) {
> +
> + if (stt) {
> + ret = kvmppc_ioba_validate(stt, ioba, 1);
> + if (!ret) {

This relies on the fact that H_SUCCESS == 0, I'm not sure if that's
something we're already doing elsewhere or not.


>   unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> - struct page *page;
> - u64 *tbl;
> -
> - if (ioba >= stt->window_size)
> - return H_PARAMETER;
> -
> - page = stt->pages[idx / TCES_PER_PAGE];
> - tbl = (u64 *)page_address(page);
> + struct page *page = stt->pages[idx / TCES_PER_PAGE];
> + u64 *tbl = (u64 *)page_address(page);
>  
>   vcpu->arch.gpr[4] = tbl[idx % TCES_PER_PAGE];
> - return H_SUCCESS;
>   }
>   }
>  
> - /* Didn't find the liobn, punt it to userspace */
> - return H_TOO_HARD;
> +
> + return ret;
>  }
>  EXPORT_SYMBOL_GPL(kvmppc_h_get_tce);

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 1/9] rcu: Define notrace version of list_for_each_entry_rcu

2015-12-07 Thread David Gibson
On Tue, Sep 15, 2015 at 08:49:31PM +1000, Alexey Kardashevskiy wrote:
> This defines list_for_each_entry_rcu_notrace and list_entry_rcu_notrace
> which use rcu_dereference_raw_notrace instead of rcu_dereference_raw.
> This allows using list_for_each_entry_rcu_notrace in real mode (MMU is off).
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>

> ---
>  include/linux/rculist.h | 38 ++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index 17c6b1f..439c4d7 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -253,6 +253,25 @@ static inline void list_splice_init_rcu(struct list_head 
> *list,
>  })
>  
>  /**
> + * list_entry_rcu_notrace - get the struct for this entry
> + * @ptr:the  list_head pointer.
> + * @type:   the type of the struct this is embedded in.
> + * @member: the name of the list_struct within the struct.
> + *
> + * This primitive may safely run concurrently with the _rcu list-mutation
> + * primitives such as list_add_rcu() as long as it's guarded by 
> rcu_read_lock().
> + *
> + * This is the same as list_entry_rcu() except that it does
> + * not do any RCU debugging or tracing.
> + */
> +#define list_entry_rcu_notrace(ptr, type, member) \
> +({ \
> + typeof(*ptr) __rcu *__ptr = (typeof(*ptr) __rcu __force *)ptr; \
> + container_of((typeof(ptr))rcu_dereference_raw_notrace(__ptr), \
> + type, member); \
> +})
> +
> +/**
>   * Where are list_empty_rcu() and list_first_entry_rcu()?
>   *
>   * Implementing those functions following their counterparts list_empty() and
> @@ -308,6 +327,25 @@ static inline void list_splice_init_rcu(struct list_head 
> *list,
>   pos = list_entry_rcu(pos->member.next, typeof(*pos), member))
>  
>  /**
> + * list_for_each_entry_rcu_notrace - iterate over rcu list of given type
> + * @pos: the type * to use as a loop cursor.
> + * @head:the head for your list.
> + * @member:  the name of the list_struct within the struct.
> + *
> + * This list-traversal primitive may safely run concurrently with
> + * the _rcu list-mutation primitives such as list_add_rcu()
> + * as long as the traversal is guarded by rcu_read_lock().
> + *
> + * This is the same as list_for_each_entry_rcu() except that it does
> + * not do any RCU debugging or tracing.
> + */
> +#define list_for_each_entry_rcu_notrace(pos, head, member) \
> + for (pos = list_entry_rcu_notrace((head)->next, typeof(*pos), member); \
> + >member != (head); \
> + pos = list_entry_rcu_notrace(pos->member.next, typeof(*pos), \
> +     member))
> +
> +/**
>   * list_for_each_entry_continue_rcu - continue iteration over list of given 
> type
>   * @pos: the type * to use as a loop cursor.
>   * @head:the head for your list.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 4/9] KVM: PPC: Use RCU for arch.spapr_tce_tables

2015-12-07 Thread David Gibson
On Tue, Sep 15, 2015 at 08:49:34PM +1000, Alexey Kardashevskiy wrote:
> At the moment spapr_tce_tables is not protected against races. This makes
> use of RCU-variants of list helpers. As some bits are executed in real
> mode, this makes use of just introduced list_for_each_entry_rcu_notrace().
> 
> This converts release_spapr_tce_table() to a RCU scheduled handler.
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>

Looks correct on my limited knowledge of RCU

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 2/9] KVM: PPC: Make real_vmalloc_addr() public

2015-12-07 Thread David Gibson
On Tue, Sep 15, 2015 at 08:49:32PM +1000, Alexey Kardashevskiy wrote:
> This helper translates vmalloc'd addresses to linear addresses.
> It is only used by the KVM MMU code now and resides in the HV KVM code.
> We will need it further in the TCE code and the DMA memory preregistration
> code called in real mode.
> 
> This makes real_vmalloc_addr() public and moves it to the powerpc code as
> it does not do anything special for KVM.
> 
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>

Hmm, I have a couple of small concerns.

First, I'm not sure if the name is clear enough for a public function.

Second, I'm not sure if mmu-hash64.h is the right place for it.  This
is still a function with very specific and limited usage, I wonder if
we should have somewhere else for such special real mode helpers.

Paulus, thoughts?

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] KVM: PPC: Fix emulation of H_SET_DABR/X on POWER8

2015-11-22 Thread David Gibson
On Fri, 20 Nov 2015 09:11:45 +0100
Thomas Huth <th...@redhat.com> wrote:

> In the old DABR register, the BT (Breakpoint Translation) bit
> is bit number 61. In the new DAWRX register, the WT (Watchpoint
> Translation) bit is bit number 59. So to move the DABR-BT bit
> into the position of the DAWRX-WT bit, it has to be shifted by
> two, not only by one. This fixes hardware watchpoints in gdb of
> older guests that only use the H_SET_DABR/X interface instead
> of the new H_SET_MODE interface.
> 
> Signed-off-by: Thomas Huth <th...@redhat.com>

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>
Reviewed-by: David Gibson <dgib...@redhat.com>

> ---
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b98889e..3983b87 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -2143,7 +2143,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>  
>   /* Emulate H_SET_DABR/X on P8 for the sake of compat mode guests */
>  2:   rlwimi  r5, r4, 5, DAWRX_DR | DAWRX_DW
> - rlwimi  r5, r4, 1, DAWRX_WT
> + rlwimi  r5, r4, 2, DAWRX_WT
>   clrrdi  r4, r4, 3
>   std r4, VCPU_DAWR(r3)
>   std r5, VCPU_DAWRX(r3)
> -- 
> 1.8.3.1
> 


-- 
David Gibson <dgib...@redhat.com>
Senior Software Engineer, Virtualization, Red Hat


pgp8jxKFUmn0u.pgp
Description: OpenPGP digital signature


Re: [PATCH] KVM: PPC: Exit guest upon fatal machine check exception

2015-11-12 Thread David Gibson
On Thu, Nov 12, 2015 at 11:22:29PM +0530, Aravinda Prasad wrote:
> 
> 
> On Thursday 12 November 2015 10:13 AM, David Gibson wrote:
> > On Thu, Nov 12, 2015 at 10:02:10AM +0530, Aravinda Prasad wrote:
> >>
> >>
> >> On Thursday 12 November 2015 09:08 AM, David Gibson wrote:
> >>> On Thu, Nov 12, 2015 at 01:24:19PM +1100, Daniel Axtens wrote:
> >>>> Aravinda Prasad <aravi...@linux.vnet.ibm.com> writes:
> >>>>
> >>>>> This patch modifies KVM to cause a guest exit with
> >>>>> KVM_EXIT_NMI instead of immediately delivering a 0x200
> >>>>> interrupt to guest upon machine check exception in
> >>>>> guest address. Exiting the guest enables QEMU to build
> >>>>> error log and deliver machine check exception to guest
> >>>>> OS (either via guest OS registered machine check
> >>>>> handler or via 0x200 guest OS interrupt vector).
> >>>>>
> >>>>> This approach simplifies the delivering of machine
> >>>>> check exception to guest OS compared to the earlier approach
> >>>>> of KVM directly invoking 0x200 guest interrupt vector.
> >>>>> In the earlier approach QEMU patched the 0x200 interrupt
> >>>>> vector during boot. The patched code at 0x200 issued a
> >>>>> private hcall to pass the control to QEMU to build the
> >>>>> error log.
> >>>>>
> >>>>> This design/approach is based on the feedback for the
> >>>>> QEMU patches to handle machine check exception. Details
> >>>>> of earlier approach of handling machine check exception
> >>>>> in QEMU and related discussions can be found at:
> >>>>>
> >>>>> https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
> >>>>
> >>>> I've poked at the MCE code, but not the KVM MCE code, so I may be
> >>>> mistaken here, but I'm not clear on how this handles errors that the
> >>>> guest can recover without terminating.
> >>>>
> >>>> For example, a Linux guest can handle a UE in guest userspace by killing
> >>>> the guest process. A hypthetical non-linux guest with a microkernel
> >>>> could even survive UEs in drivers.
> >>>>
> >>>> It sounds from your patch like you're changing this behaviour. Is this
> >>>> right?
> >>>
> >>> So, IIUC.  Once the qemu pieces are in place as well it shouldn't
> >>> change this behaviour: KVM will exit to qemu, qemu will log the error
> >>> information (new), then reinject the MC to the guest which can still
> >>> handle it as you describe above.
> >>
> >> Yes. With KVM and QEMU both in place this will not change the behavior.
> >> QEMU will inject the UE to guest and the guest handles the UE based on
> >> where it occurred. For example if an UE happens in a guest process
> >> address space, that process will be killed.
> >>
> >>>
> >>> But, there could be a problem if you have a new kernel with an old
> >>> qemu, in that case qemu might not understand the new exit type and
> >>> treat it as a fatal error, even though the guest could actually cope
> >>> with it.
> >>
> >> In case of new kernel and old QEMU, the guest terminates as old QEMU
> >> does not understand the NMI exit reason. However, this is the case with
> >> old kernel and old QEMU as they do not handle UE belonging to guest. The
> >> difference is that the guest kernel terminates with different error
> >> code.
> > 
> > Ok.. assuming the guest has code to handle the UE in 0x200, why would
> > the guest terminate with old kernel and old qemu?  I haven't quite
> > followed the logic.
> 
> I overlooked it. I think I need to take into consideration whether guest
> issued "ibm, nmi-register". If the guest has issued "ibm, nmi-register"
> then we should not jump to 0x200 upon UE. With the old kernel and old
> QEMU this is broken as we always jump to 0x200.
> 
> However, if the guest has not issued "ibm, nmi-register" then upon UE we
> should jump to 0x200. If new kernel is used with old QEMU this
> functionality breaks as it causes guest to terminate with unhandled NMI
> exit.
> 
> So thinking whether qemu should explicitly enable the new NMI
> behavior.

So, I think the reasoning above tends towards having qemu control the
MC

Re: [PATCH] KVM: PPC: Exit guest upon fatal machine check exception

2015-11-11 Thread David Gibson
On Thu, Nov 12, 2015 at 10:02:10AM +0530, Aravinda Prasad wrote:
> 
> 
> On Thursday 12 November 2015 09:08 AM, David Gibson wrote:
> > On Thu, Nov 12, 2015 at 01:24:19PM +1100, Daniel Axtens wrote:
> >> Aravinda Prasad <aravi...@linux.vnet.ibm.com> writes:
> >>
> >>> This patch modifies KVM to cause a guest exit with
> >>> KVM_EXIT_NMI instead of immediately delivering a 0x200
> >>> interrupt to guest upon machine check exception in
> >>> guest address. Exiting the guest enables QEMU to build
> >>> error log and deliver machine check exception to guest
> >>> OS (either via guest OS registered machine check
> >>> handler or via 0x200 guest OS interrupt vector).
> >>>
> >>> This approach simplifies the delivering of machine
> >>> check exception to guest OS compared to the earlier approach
> >>> of KVM directly invoking 0x200 guest interrupt vector.
> >>> In the earlier approach QEMU patched the 0x200 interrupt
> >>> vector during boot. The patched code at 0x200 issued a
> >>> private hcall to pass the control to QEMU to build the
> >>> error log.
> >>>
> >>> This design/approach is based on the feedback for the
> >>> QEMU patches to handle machine check exception. Details
> >>> of earlier approach of handling machine check exception
> >>> in QEMU and related discussions can be found at:
> >>>
> >>> https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
> >>
> >> I've poked at the MCE code, but not the KVM MCE code, so I may be
> >> mistaken here, but I'm not clear on how this handles errors that the
> >> guest can recover without terminating.
> >>
> >> For example, a Linux guest can handle a UE in guest userspace by killing
> >> the guest process. A hypthetical non-linux guest with a microkernel
> >> could even survive UEs in drivers.
> >>
> >> It sounds from your patch like you're changing this behaviour. Is this
> >> right?
> > 
> > So, IIUC.  Once the qemu pieces are in place as well it shouldn't
> > change this behaviour: KVM will exit to qemu, qemu will log the error
> > information (new), then reinject the MC to the guest which can still
> > handle it as you describe above.
> 
> Yes. With KVM and QEMU both in place this will not change the behavior.
> QEMU will inject the UE to guest and the guest handles the UE based on
> where it occurred. For example if an UE happens in a guest process
> address space, that process will be killed.
> 
> > 
> > But, there could be a problem if you have a new kernel with an old
> > qemu, in that case qemu might not understand the new exit type and
> > treat it as a fatal error, even though the guest could actually cope
> > with it.
> 
> In case of new kernel and old QEMU, the guest terminates as old QEMU
> does not understand the NMI exit reason. However, this is the case with
> old kernel and old QEMU as they do not handle UE belonging to guest. The
> difference is that the guest kernel terminates with different error
> code.

Ok.. assuming the guest has code to handle the UE in 0x200, why would
the guest terminate with old kernel and old qemu?  I haven't quite
followed the logic.

> 
> old kernel and old QEMU -> guest panics [1] irrespective of where UE
>happened in guest address space.
> old kernel and new QEMU -> guest panics. same as above.
> new kernel and old QEMU -> guest terminates with unhanded NMI error
>irrespective of where UE happened in guest
> new kernel and new QEMU -> guest handles UEs in process address space
>by killing the process. guest terminates
>for UEs in guest kernel address space.
> 
> [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-June/118329.html
> 
> > 
> > Aravinda, do we need to change this so that qemu has to explicitly
> > enable the new NMI behaviour?  Or have I missed something that will
> > make that case work already.
> 
> I think we don't need to explicitly enable the new behavior. With new
> kernel and new QEMU this should just work. As mentioned above this is
> already broken for old kernel/QEMU. Any thoughts?
> 
> Regards,
> Aravinda
> 
> > 
> > 
> > 
> > ___
> > Linuxppc-dev mailing list
> > linuxppc-...@lists.ozlabs.org
> > https://lists.ozlabs.org/listinfo/linuxppc-dev
> > 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] KVM: PPC: Exit guest upon fatal machine check exception

2015-11-11 Thread David Gibson
et later based on trap). For handled errors
> +  * (no-fatal), go back to guest execution with current HSRR0
> +  * instead of exiting the guest. This approach will cause
> +  * the guest to exit in case of fatal machine check error.
>*/
> - ld  r10, VCPU_PC(r9)
> + bne 2f  /* Continue guest execution? */
> + /* If not, exit the guest. SRR0/1 are already set */
> + b   mc_cont
> +2:   ld  r10, VCPU_PC(r9)
>   ld  r11, VCPU_MSR(r9)
> - bne 2f  /* Continue guest execution. */
> - /* If not, deliver a machine check.  SRR0/1 are already set */
> - li  r10, BOOK3S_INTERRUPT_MACHINE_CHECK
> - ld  r11, VCPU_MSR(r9)
> - bl  kvmppc_msr_interrupt
> -2:   b   fast_interrupt_c_return
> + b   fast_interrupt_c_return
>  
>  /*
>   * Check the reason we woke from nap, and take appropriate action.
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] KVM: PPC: Exit guest upon fatal machine check exception

2015-11-11 Thread David Gibson
On Thu, Nov 12, 2015 at 01:24:19PM +1100, Daniel Axtens wrote:
> Aravinda Prasad <aravi...@linux.vnet.ibm.com> writes:
> 
> > This patch modifies KVM to cause a guest exit with
> > KVM_EXIT_NMI instead of immediately delivering a 0x200
> > interrupt to guest upon machine check exception in
> > guest address. Exiting the guest enables QEMU to build
> > error log and deliver machine check exception to guest
> > OS (either via guest OS registered machine check
> > handler or via 0x200 guest OS interrupt vector).
> >
> > This approach simplifies the delivering of machine
> > check exception to guest OS compared to the earlier approach
> > of KVM directly invoking 0x200 guest interrupt vector.
> > In the earlier approach QEMU patched the 0x200 interrupt
> > vector during boot. The patched code at 0x200 issued a
> > private hcall to pass the control to QEMU to build the
> > error log.
> >
> > This design/approach is based on the feedback for the
> > QEMU patches to handle machine check exception. Details
> > of earlier approach of handling machine check exception
> > in QEMU and related discussions can be found at:
> >
> > https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
> 
> I've poked at the MCE code, but not the KVM MCE code, so I may be
> mistaken here, but I'm not clear on how this handles errors that the
> guest can recover without terminating.
> 
> For example, a Linux guest can handle a UE in guest userspace by killing
> the guest process. A hypthetical non-linux guest with a microkernel
> could even survive UEs in drivers.
> 
> It sounds from your patch like you're changing this behaviour. Is this
> right?

So, IIUC.  Once the qemu pieces are in place as well it shouldn't
change this behaviour: KVM will exit to qemu, qemu will log the error
information (new), then reinject the MC to the guest which can still
handle it as you describe above.

But, there could be a problem if you have a new kernel with an old
qemu, in that case qemu might not understand the new exit type and
treat it as a fatal error, even though the guest could actually cope
with it.

Aravinda, do we need to change this so that qemu has to explicitly
enable the new NMI behaviour?  Or have I missed something that will
make that case work already.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] KVM: PPC: Book3S HV: Synthesize segment fault if SLB lookup fails

2015-11-03 Thread David Gibson
On Tue, Oct 27, 2015 at 04:13:56PM +1100, Paul Mackerras wrote:
> When handling a hypervisor data or instruction storage interrupt (HDSI
> or HISI), we look up the SLB entry for the address being accessed in
> order to translate the effective address to a virtual address which can
> be looked up in the guest HPT.  This lookup can occasionally fail due
> to the guest replacing an SLB entry without invalidating the evicted
> SLB entry.  In this situation an ERAT (effective to real address
> translation cache) entry can persist and be used by the hardware even
> though there is no longer a corresponding SLB entry.
> 
> Previously we would just deliver a data or instruction storage interrupt
> (DSI or ISI) to the guest in this case.  However, this is not correct
> and has been observed to cause guests to crash, typically with a
> data storage protection interrupt on a store to the vmemmap area.
> 
> Instead, what we do now is to synthesize a data or instruction segment
> interrupt.  That should cause the guest to reload an appropriate entry
> into the SLB and retry the faulting instruction.  If it still faults,
> we should find an appropriate SLB entry next time and be able to handle
> the fault.
> 
> Signed-off-by: Paul Mackerras <pau...@samba.org>

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PATCH v4] ppc/spapr: Implement H_RANDOM hypercall in QEMU

2015-09-21 Thread David Gibson
On Mon, Sep 21, 2015 at 10:37:28AM +0200, Greg Kurz wrote:
> On Mon, 21 Sep 2015 10:26:52 +0200
> Thomas Huth <th...@redhat.com> wrote:
> 
> > On 21/09/15 10:01, Greg Kurz wrote:
> > > On Mon, 21 Sep 2015 12:10:00 +1000
> > > David Gibson <da...@gibson.dropbear.id.au> wrote:
> > > 
> > >> On Fri, Sep 18, 2015 at 11:05:52AM +0200, Greg Kurz wrote:
> > >>> On Thu, 17 Sep 2015 10:49:41 +0200
> > >>> Thomas Huth <th...@redhat.com> wrote:
> > >>>
> > >>>> The PAPR interface defines a hypercall to pass high-quality
> > >>>> hardware generated random numbers to guests. Recent kernels can
> > >>>> already provide this hypercall to the guest if the right hardware
> > >>>> random number generator is available. But in case the user wants
> > >>>> to use another source like EGD, or QEMU is running with an older
> > >>>> kernel, we should also have this call in QEMU, so that guests that
> > >>>> do not support virtio-rng yet can get good random numbers, too.
> > >>>>
> > >>>> This patch now adds a new pseudo-device to QEMU that either
> > >>>> directly provides this hypercall to the guest or is able to
> > >>>> enable the in-kernel hypercall if available.
> > ...
> > >>> It is a good thing that the user can choose between in-kernel and 
> > >>> backend,
> > >>> and this patch does the work.
> > >>>
> > >>> This being said, I am not sure about the use case where a user has a 
> > >>> hwrng
> > >>> capable platform and wants to run guests without any hwrng support at 
> > >>> all is
> > >>> an appropriate default behavior... I guess we will find more users that 
> > >>> want
> > >>> in-kernel being the default if it is available.
> > >>>
> > >>> The patch below modifies yours to do just this: the pseudo-device is 
> > >>> only
> > >>> created if hwrng is present and not already created.
> > >>
> > >> I have mixed feelings about this.  On the one hand, I agree that it
> > >> would be nice to allow H_RANDOM support by default.  On the other hand
> > >> the patch below leaves no way to turn it off for testing purposes.  It
> > >> also adds another place where the guest hardware depends on the host
> > >> configuration, which adds to the already substantial mess of ensuring
> > >> that source and destination hardware configuration matches for
> > >> migration.
> > > 
> > > Yeah, describing the guest hw is really essential for migration... this
> > > is best addressed at the libvirt level with a full XML description of
> > > the machine... but FWIW if we are talking about running pseries on a
> > > POWER8 or newer host, I am not aware about "hwrng-less" boards... but
> > > I am probably missing something :)
> > 
> > Maybe it would be at least ok to enable it by default as long as
> > "-nodefaults" has not been specified as command line option?

I like that in principle, but the -nodefaults option isn't exposed
outside vl.c

> It makes a lot of sense indeed. I guess David should take your patch
> as it is now and the default behavior could be a follow up.

That's the plan.  I've already  taken the base patch.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpfA9hFQZ4X9.pgp
Description: PGP signature


Re: [PATCH] KVM: PPC: Take the kvm->srcu lock in kvmppc_h_logical_ci_load/store()

2015-09-20 Thread David Gibson
On Fri, Sep 18, 2015 at 08:57:28AM +0200, Thomas Huth wrote:
> Access to the kvm->buses (like with the kvm_io_bus_read() and -write()
> functions) has to be protected via the kvm->srcu lock.
> The kvmppc_h_logical_ci_load() and -store() functions are missing
> this lock so far, so let's add it there, too.
> This fixes the problem that the kernel reports "suspicious RCU usage"
> when lock debugging is enabled.
> 
> Fixes: 99342cf8044420eebdf9297ca03a14cb6a7085a1
> Signed-off-by: Thomas Huth <th...@redhat.com>

Nice catch.  Looks like I missed this because the places
kvm_io_bus_{read,write}() are called on x86 are buried about 5 layers
below where the srcu lock is taken :/.

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>

> ---
>  arch/powerpc/kvm/book3s.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index d75bf32..096e5eb 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -828,12 +828,15 @@ int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu)
>   unsigned long size = kvmppc_get_gpr(vcpu, 4);
>   unsigned long addr = kvmppc_get_gpr(vcpu, 5);
>   u64 buf;
> + int srcu_idx;
>   int ret;
>  
>   if (!is_power_of_2(size) || (size > sizeof(buf)))
>   return H_TOO_HARD;
>  
> + srcu_idx = srcu_read_lock(>kvm->srcu);
>   ret = kvm_io_bus_read(vcpu, KVM_MMIO_BUS, addr, size, );
> + srcu_read_unlock(>kvm->srcu, srcu_idx);
>   if (ret != 0)
>   return H_TOO_HARD;
>  
> @@ -868,6 +871,7 @@ int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu)
>   unsigned long addr = kvmppc_get_gpr(vcpu, 5);
>   unsigned long val = kvmppc_get_gpr(vcpu, 6);
>   u64 buf;
> + int srcu_idx;
>   int ret;
>  
>   switch (size) {
> @@ -891,7 +895,9 @@ int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu)
>   return H_TOO_HARD;
>   }
>  
> + srcu_idx = srcu_read_lock(>kvm->srcu);
>   ret = kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, size, );
> + srcu_read_unlock(>kvm->srcu, srcu_idx);
>   if (ret != 0)
>   return H_TOO_HARD;
>  

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp4kxxr5EHED.pgp
Description: PGP signature


Re: [PATCH 0/2] VFIO: Accept IOMMU group (PE) ID

2015-09-20 Thread David Gibson
On Sat, Sep 19, 2015 at 04:22:47PM +1000, David Gibson wrote:
> On Fri, Sep 18, 2015 at 09:47:32AM -0600, Alex Williamson wrote:
> > On Fri, 2015-09-18 at 16:24 +1000, Gavin Shan wrote:
> > > This allows to accept IOMMU group (PE) ID from the parameter from userland
> > > when handling EEH operation so that the operation only affects the target
> > > IOMMU group (PE). If the IOMMU group (PE) ID in the parameter from 
> > > userland
> > > is invalid, all IOMMU groups (PEs) attached to the specified container are
> > > affected as before.
> > > 
> > > Gavin Shan (2):
> > >   drivers/vfio: Support EEH API revision
> > >   drivers/vfio: Support IOMMU group for EEH operations
> > > 
> > >  drivers/vfio/vfio_iommu_spapr_tce.c | 50 
> > > ++---
> > >  drivers/vfio/vfio_spapr_eeh.c   | 46 
> > > ++
> > >  include/linux/vfio.h| 13 +++---
> > >  include/uapi/linux/vfio.h   |  6 +
> > >  4 files changed, 93 insertions(+), 22 deletions(-)
> > 
> > This interface is terrible.  A function named foo_enabled() should
> > return a bool, yes or no, don't try to overload it to also return a
> > version.
> 
> Sorry, that one's my fault.  I suggested that approach to Gavin
> without really thinking it through.
> 
> 
> > AFAICT, patch 2/2 breaks current users by changing the offset
> > of the union in struct vfio_eeh_pe_err.
> 
> Yeah, this one's ugly.  We have to preserve the offset, but that means
> putting the group in a very awkward place.  Especially since I'm not
> sure if there even are any existing users of the single extant union
> branch.
> 
> Sigh.
> 
> > Also, we generally pass group
> > file descriptors rather than a group ID because we can prove the
> > ownership of the group through the file descriptor and we don't need to
> > worry about races with the group because we can hold a reference to it.

Duh.  I finally realised the better, simpler, obvious solution.

Rather than changing the parameter structure, we should move the
ioctl()s so they're on the group fd instead of the container fd.

Obviously we need to keep it on the container fd for backwards compat,
but I think we should just error out if there is more than one group
in the container there.

We will need a new capability too, obviously.  VFIO_EEH_GROUPFD maybe?

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpCxcPnyjMr0.pgp
Description: PGP signature


Re: [PATCH v4] ppc/spapr: Implement H_RANDOM hypercall in QEMU

2015-09-20 Thread David Gibson
On Thu, Sep 17, 2015 at 10:49:41AM +0200, Thomas Huth wrote:
> The PAPR interface defines a hypercall to pass high-quality
> hardware generated random numbers to guests. Recent kernels can
> already provide this hypercall to the guest if the right hardware
> random number generator is available. But in case the user wants
> to use another source like EGD, or QEMU is running with an older
> kernel, we should also have this call in QEMU, so that guests that
> do not support virtio-rng yet can get good random numbers, too.
> 
> This patch now adds a new pseudo-device to QEMU that either
> directly provides this hypercall to the guest or is able to
> enable the in-kernel hypercall if available. The in-kernel
> hypercall can be enabled with the use-kvm property, e.g.:
> 
>  qemu-system-ppc64 -device spapr-rng,use-kvm=true
> 
> For handling the hypercall in QEMU instead, a "RngBackend" is
> required since the hypercall should provide "good" random data
> instead of pseudo-random (like from a "simple" library function
> like rand() or g_random_int()). Since there are multiple RngBackends
> available, the user must select an appropriate back-end via the
> "rng" property of the device, e.g.:
> 
>  qemu-system-ppc64 -object rng-random,filename=/dev/hwrng,id=gid0 \
>-device spapr-rng,rng=gid0 ...
> 
> See http://wiki.qemu-project.org/Features-Done/VirtIORNG for
> other example of specifying RngBackends.
> 
> Signed-off-by: Thomas Huth <th...@redhat.com>

Thanks, applied to spapr-next.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgprmHRL0m8Qe.pgp
Description: PGP signature


Re: [PATCH v4] ppc/spapr: Implement H_RANDOM hypercall in QEMU

2015-09-20 Thread David Gibson
On Fri, Sep 18, 2015 at 11:05:52AM +0200, Greg Kurz wrote:
> On Thu, 17 Sep 2015 10:49:41 +0200
> Thomas Huth <th...@redhat.com> wrote:
> 
> > The PAPR interface defines a hypercall to pass high-quality
> > hardware generated random numbers to guests. Recent kernels can
> > already provide this hypercall to the guest if the right hardware
> > random number generator is available. But in case the user wants
> > to use another source like EGD, or QEMU is running with an older
> > kernel, we should also have this call in QEMU, so that guests that
> > do not support virtio-rng yet can get good random numbers, too.
> > 
> > This patch now adds a new pseudo-device to QEMU that either
> > directly provides this hypercall to the guest or is able to
> > enable the in-kernel hypercall if available. The in-kernel
> > hypercall can be enabled with the use-kvm property, e.g.:
> > 
> >  qemu-system-ppc64 -device spapr-rng,use-kvm=true
> > 
> > For handling the hypercall in QEMU instead, a "RngBackend" is
> > required since the hypercall should provide "good" random data
> > instead of pseudo-random (like from a "simple" library function
> > like rand() or g_random_int()). Since there are multiple RngBackends
> > available, the user must select an appropriate back-end via the
> > "rng" property of the device, e.g.:
> > 
> >  qemu-system-ppc64 -object rng-random,filename=/dev/hwrng,id=gid0 \
> >-device spapr-rng,rng=gid0 ...
> > 
> > See http://wiki.qemu-project.org/Features-Done/VirtIORNG for
> > other example of specifying RngBackends.
> > 
> > Signed-off-by: Thomas Huth <th...@redhat.com>
> > ---
> 
> It is a good thing that the user can choose between in-kernel and backend,
> and this patch does the work.
> 
> This being said, I am not sure about the use case where a user has a hwrng
> capable platform and wants to run guests without any hwrng support at all is
> an appropriate default behavior... I guess we will find more users that want
> in-kernel being the default if it is available.
> 
> The patch below modifies yours to do just this: the pseudo-device is only
> created if hwrng is present and not already created.

I have mixed feelings about this.  On the one hand, I agree that it
would be nice to allow H_RANDOM support by default.  On the other hand
the patch below leaves no way to turn it off for testing purposes.  It
also adds another place where the guest hardware depends on the host
configuration, which adds to the already substantial mess of ensuring
that source and destination hardware configuration matches for
migration.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpUiaicZ6w4V.pgp
Description: PGP signature


Re: [PATCH 0/2] VFIO: Accept IOMMU group (PE) ID

2015-09-19 Thread David Gibson
On Fri, Sep 18, 2015 at 09:47:32AM -0600, Alex Williamson wrote:
> On Fri, 2015-09-18 at 16:24 +1000, Gavin Shan wrote:
> > This allows to accept IOMMU group (PE) ID from the parameter from userland
> > when handling EEH operation so that the operation only affects the target
> > IOMMU group (PE). If the IOMMU group (PE) ID in the parameter from userland
> > is invalid, all IOMMU groups (PEs) attached to the specified container are
> > affected as before.
> > 
> > Gavin Shan (2):
> >   drivers/vfio: Support EEH API revision
> >   drivers/vfio: Support IOMMU group for EEH operations
> > 
> >  drivers/vfio/vfio_iommu_spapr_tce.c | 50 
> > ++---
> >  drivers/vfio/vfio_spapr_eeh.c   | 46 ++
> >  include/linux/vfio.h| 13 +++---
> >  include/uapi/linux/vfio.h   |  6 +
> >  4 files changed, 93 insertions(+), 22 deletions(-)
> 
> This interface is terrible.  A function named foo_enabled() should
> return a bool, yes or no, don't try to overload it to also return a
> version.

Sorry, that one's my fault.  I suggested that approach to Gavin
without really thinking it through.


> AFAICT, patch 2/2 breaks current users by changing the offset
> of the union in struct vfio_eeh_pe_err.

Yeah, this one's ugly.  We have to preserve the offset, but that means
putting the group in a very awkward place.  Especially since I'm not
sure if there even are any existing users of the single extant union
branch.

Sigh.

> Also, we generally pass group
> file descriptors rather than a group ID because we can prove the
> ownership of the group through the file descriptor and we don't need to
> worry about races with the group because we can hold a reference to it.
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp3yfhiIyNQC.pgp
Description: PGP signature


Re: [PATCH v3] ppc/spapr: Implement H_RANDOM hypercall in QEMU

2015-09-14 Thread David Gibson
On Mon, Sep 14, 2015 at 08:32:36AM +0200, Thomas Huth wrote:
> On 14/09/15 04:15, David Gibson wrote:
> > On Fri, Sep 11, 2015 at 11:17:01AM +0200, Thomas Huth wrote:
> >> The PAPR interface defines a hypercall to pass high-quality
> >> hardware generated random numbers to guests. Recent kernels can
> >> already provide this hypercall to the guest if the right hardware
> >> random number generator is available. But in case the user wants
> >> to use another source like EGD, or QEMU is running with an older
> >> kernel, we should also have this call in QEMU, so that guests that
> >> do not support virtio-rng yet can get good random numbers, too.
> >>
> >> This patch now adds a new pseude-device to QEMU that either
> >> directly provides this hypercall to the guest or is able to
> >> enable the in-kernel hypercall if available. The in-kernel
> >> hypercall can be enabled with the use-kvm property, e.g.:
> >>
> >>  qemu-system-ppc64 -device spapr-rng,use-kvm=true
> >>
> >> For handling the hypercall in QEMU instead, a RngBackend is required
> >> since the hypercall should provide "good" random data instead of
> >> pseudo-random (like from a "simple" library function like rand()
> >> or g_random_int()). Since there are multiple RngBackends available,
> >> the user must select an appropriate backend via the "backend"
> >> property of the device, e.g.:
> >>
> >>  qemu-system-ppc64 -object rng-random,filename=/dev/hwrng,id=rng0 \
> >>-device spapr-rng,backend=rng0 ...
> >>
> >> See http://wiki.qemu-project.org/Features-Done/VirtIORNG for
> >> other example of specifying RngBackends.
> ...
> >> +
> >> +#include "qemu/error-report.h"
> >> +#include "sysemu/sysemu.h"
> >> +#include "sysemu/device_tree.h"
> >> +#include "sysemu/rng.h"
> >> +#include "hw/ppc/spapr.h"
> >> +#include "kvm_ppc.h"
> >> +
> >> +#define SPAPR_RNG(obj) \
> >> +OBJECT_CHECK(sPAPRRngState, (obj), TYPE_SPAPR_RNG)
> >> +
> >> +typedef struct sPAPRRngState {
> >> +/*< private >*/
> >> +DeviceState ds;
> >> +RngBackend *backend;
> >> +bool use_kvm;
> >> +} sPAPRRngState;
> >> +
> >> +typedef struct HRandomData {
> >> +QemuSemaphore sem;
> >> +union {
> >> +uint64_t v64;
> >> +uint8_t v8[8];
> >> +} val;
> >> +int received;
> >> +} HRandomData;
> >> +
> >> +/* Callback function for the RngBackend */
> >> +static void random_recv(void *dest, const void *src, size_t size)
> >> +{
> >> +HRandomData *hrdp = dest;
> >> +
> >> +if (src && size > 0) {
> >> +assert(size + hrdp->received <= sizeof(hrdp->val.v8));
> >> +memcpy(>val.v8[hrdp->received], src, size);
> >> +hrdp->received += size;
> >> +}
> >> +
> >> +qemu_sem_post(>sem);
> > 
> > I'm assuming qemu_sem_post() includes the necessary memory barrier to
> > make sure the requesting thread actually sees the data.
> 
> Not sure whether I fully got your point here... both callback function
> and main thread are calling an extern C-function, so the compiler should
> not assume that the memory stays the same in the main thread...?

I'm not talking about a compiler barrier: the callback will likely be
invoked on a different CPU from the vcpu thread that invoked H_RANDOM,
so on a weakly ordered arch like Power we need a real CPU memory barrier.

> Anyway, I've tested the hypercall by implementing it in SLOF and calling
> it a couple of times there to see that all bits in the result behave
> randomly, so for me this is working fine.

Right, I'd be almost certain anyway that qemu_sem_post() (actually
likely the pthreads functions it invokes) will include the necessary
barriers to stop accesses leaking outside the locked region.

> 
> >> +}
> >> +
> >> +/* Handler for the H_RANDOM hypercall */
> >> +static target_ulong h_random(PowerPCCPU *cpu, sPAPRMachineState *spapr,
> >> + target_ulong opcode, target_ulong *args)
> >> +{
> >> +sPAPRRngState *rngstate;
> >> +HRandomData hrdata;
> >> +
> >> +rngstate = SPAPR_RNG(object_resolve_path_type("", TYPE_SPAPR_RNG, 
> >> NULL));
> >> +

Re: [PATCH v3] ppc/spapr: Implement H_RANDOM hypercall in QEMU

2015-09-13 Thread David Gibson
 false),
> +DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void spapr_rng_class_init(ObjectClass *oc, void *data)
> +{
> +DeviceClass *dc = DEVICE_CLASS(oc);
> +
> +dc->realize = spapr_rng_realize;
> +set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> +dc->props = spapr_rng_properties;
> +}
> +
> +static const TypeInfo spapr_rng_info = {
> +.name  = TYPE_SPAPR_RNG,
> +.parent= TYPE_DEVICE,
> +.instance_size = sizeof(sPAPRRngState),
> +.instance_init = spapr_rng_instance_init,
> +.class_init= spapr_rng_class_init,
> +};
> +
> +static void spapr_rng_register_type(void)
> +{
> +type_register_static(_rng_info);
> +}
> +type_init(spapr_rng_register_type)
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 91a61ab..4e8aa2d 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -331,6 +331,7 @@ struct sPAPRMachineState {
>  #define H_SET_MPP   0x2D0
>  #define H_GET_MPP   0x2D4
>  #define H_XIRR_X0x2FC
> +#define H_RANDOM0x300
>  #define H_SET_MODE  0x31C
>  #define MAX_HCALL_OPCODEH_SET_MODE
>  
> @@ -603,10 +604,13 @@ struct sPAPRConfigureConnectorState {
>  void spapr_ccs_reset_hook(void *opaque);
>  
>  #define TYPE_SPAPR_RTC "spapr-rtc"
> +#define TYPE_SPAPR_RNG "spapr-rng"
>  
>  void spapr_rtc_read(DeviceState *dev, struct tm *tm, uint32_t *ns);
>  int spapr_rtc_import_offset(DeviceState *dev, int64_t legacy_offset);
>  
> +int spapr_rng_populate_dt(void *fdt);
> +
>  #define SPAPR_MEMORY_BLOCK_SIZE (1 << 28) /* 256MB */
>  
>  #endif /* !defined (__HW_SPAPR_H__) */
> diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
> index 110436d..42f66fe 100644
> --- a/target-ppc/kvm.c
> +++ b/target-ppc/kvm.c
> @@ -2484,3 +2484,12 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
>  {
>  return data & 0x;
>  }
> +
> +int kvmppc_enable_hwrng(void)
> +{
> +if (!kvm_enabled() || !kvm_check_extension(kvm_state, 
> KVM_CAP_PPC_HWRNG)) {
> +return -1;
> +}
> +
> +return kvmppc_enable_hcall(kvm_state, H_RANDOM);
> +}
> diff --git a/target-ppc/kvm_ppc.h b/target-ppc/kvm_ppc.h
> index 4d30e27..68836b4 100644
> --- a/target-ppc/kvm_ppc.h
> +++ b/target-ppc/kvm_ppc.h
> @@ -55,6 +55,7 @@ void kvmppc_hash64_free_pteg(uint64_t token);
>  void kvmppc_hash64_write_pte(CPUPPCState *env, target_ulong pte_index,
>   target_ulong pte0, target_ulong pte1);
>  bool kvmppc_has_cap_fixup_hcalls(void);
> +int kvmppc_enable_hwrng(void);
>  
>  #else
>  
> @@ -248,6 +249,10 @@ static inline bool kvmppc_has_cap_fixup_hcalls(void)
>  abort();
>  }
>  
> +static inline int kvmppc_enable_hwrng(void)
> +{
> +return -1;
> +}
>  #endif
>  
>  #ifndef CONFIG_KVM

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp6_iH1g4gwt.pgp
Description: PGP signature


Re: [PATCH 2/2] KVM: PPC: Book3S HV: Exit on H_DOORBELL only if HOST_IPI is set

2015-09-02 Thread David Gibson
On Thu, Sep 03, 2015 at 03:21:23PM +1000, Paul Mackerras wrote:
> From: "Gautham R. Shenoy" <e...@linux.vnet.ibm.com>
> 
> The code that handles the case when we receive a H_DOORBELL interrupt
> has a comment which says "Hypervisor doorbell - exit only if host IPI
> flag set".  However, the current code does not actually check if the
> host IPI flag is set.  This is due to a comparison instruction that
> got missed.
> 
> As a result, the current code performs the exit to host only
> if some sibling thread or a sibling sub-core is exiting to the
> host.  This implies that, an IPI sent to a sibling core in
> (subcores-per-core != 1) mode will be missed by the host unless the
> sibling core is on the exit path to the host.
> 
> This patch adds the missing comparison operation which will ensure
> that when HOST_IPI flag is set, we unconditionally exit to the host.
> 
> Fixes: 66feed61cdf6
> Cc: sta...@vger.kernel.org # v4.1+
> Signed-off-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com>
> Signed-off-by: Paul Mackerras <pau...@samba.org>

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>

> ---
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b07f045..2273dca 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1213,6 +1213,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
>   cmpwi   r12, BOOK3S_INTERRUPT_H_DOORBELL
>   bne     3f
>   lbz r0, HSTATE_HOST_IPI(r13)
> + cmpwi   r0, 0
>   beq 4f
>   b   guest_exit_cont
>  3:

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpC5z6OtPQK9.pgp
Description: PGP signature


Re: [PATCH 1/2] KVM: PPC: Book3S HV: Fix race in starting secondary threads

2015-09-02 Thread David Gibson
On Thu, Sep 03, 2015 at 03:20:50PM +1000, Paul Mackerras wrote:
> From: "Gautham R. Shenoy" <e...@linux.vnet.ibm.com>
> 
> The current dynamic micro-threading code has a race due to which a
> secondary thread naps when it is supposed to be running a vcpu. As a
> side effect of this, on a guest exit, the primary thread in
> kvmppc_wait_for_nap() finds that this secondary thread hasn't cleared
> its vcore pointer. This results in "CPU X seems to be stuck!"
> warnings.
> 
> The race is possible since the primary thread on exiting the guests
> only waits for all the secondaries to clear its vcore pointer. It
> subsequently expects the secondary threads to enter nap while it
> unsplits the core. A secondary thread which hasn't yet entered the nap
> will loop in kvm_no_guest until its vcore pointer and the do_nap flag
> are unset. Once the core has been unsplit, a new vcpu thread can grab
> the core and set the do_nap flag *before* setting the vcore pointers
> of the secondary. As a result, the secondary thread will now enter nap
> via kvm_unsplit_nap instead of running the guest vcpu.
> 
> Fix this by setting the do_nap flag after setting the vcore pointer in
> the PACA of the secondary in kvmppc_run_core. Also, ensure that a
> secondary thread doesn't nap in kvm_unsplit_nap when the vcore pointer
> in its PACA struct is set.
> 
> Fixes: b4deba5c41e9
> Signed-off-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com>
> Signed-off-by: Paul Mackerras <pau...@samba.org>

Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>
-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp1Om83aXHqD.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off

2015-08-04 Thread David Gibson
On Tue, Aug 04, 2015 at 09:54:44AM +0200, Andrew Jones wrote:
 On Tue, Aug 04, 2015 at 02:11:30PM +1000, David Gibson wrote:
  On Mon, Aug 03, 2015 at 04:41:28PM +0200, Andrew Jones wrote:
   Add enough RTAS support to support power-off, and apply it to
   exit().
   
   Signed-off-by: Andrew Jones drjo...@redhat.com
   ---
lib/powerpc/asm/rtas.h  | 27 
lib/powerpc/io.c|  3 ++
lib/powerpc/rtas.c  | 84 
   +
lib/ppc64/asm/rtas.h|  1 +
powerpc/Makefile.common |  1 +
powerpc/cstart64.S  |  9 ++
6 files changed, 125 insertions(+)
create mode 100644 lib/powerpc/asm/rtas.h
create mode 100644 lib/powerpc/rtas.c
create mode 100644 lib/ppc64/asm/rtas.h
   
   diff --git a/lib/powerpc/asm/rtas.h b/lib/powerpc/asm/rtas.h
   new file mode 100644
   index 0..53ce126c69a42
   --- /dev/null
   +++ b/lib/powerpc/asm/rtas.h
   @@ -0,0 +1,27 @@
   +#ifndef _ASMPOWERPC_RTAS_H_
   +#define _ASMPOWERPC_RTAS_H_
   +/*
   + * Copyright (C) 2015, Red Hat Inc, Andrew Jones drjo...@redhat.com
   + *
   + * This work is licensed under the terms of the GNU LGPL, version 2.
   + */
   +#include libcflat.h
   +
   +#define RTAS_UNKNOWN_SERVICE (-1)
   +
   +struct rtas_args {
   + u32 token;
   + u32 nargs;
   + u32 nret;
   + u32 args[16];
   + u32 *rets;
   +};
   +
   +extern void rtas_init(void);
   +extern void enter_rtas(unsigned long args);
   +extern int rtas_token(const char *service);
   +extern int rtas_call(int token, int nargs, int nret, int *outputs, ...);
   +
   +extern void rtas_power_off(void);
   +
   +#endif /* _ASMPOWERPC_RTAS_H_ */
   diff --git a/lib/powerpc/io.c b/lib/powerpc/io.c
   index a7eaafeca9205..25cdd0ad961ee 100644
   --- a/lib/powerpc/io.c
   +++ b/lib/powerpc/io.c
   @@ -7,6 +7,7 @@
 */
#include libcflat.h
#include asm/spinlock.h
   +#include asm/rtas.h

extern void halt(int code);
extern void putchar(int c);
   @@ -15,6 +16,7 @@ static struct spinlock uart_lock;

void io_init(void)
{
   + rtas_init();
}

void puts(const char *s)
   @@ -27,5 +29,6 @@ void puts(const char *s)

void exit(int code)
{
   + rtas_power_off();
 halt(code);
}
   diff --git a/lib/powerpc/rtas.c b/lib/powerpc/rtas.c
   new file mode 100644
   index 0..39f7d4428ee77
   --- /dev/null
   +++ b/lib/powerpc/rtas.c
   @@ -0,0 +1,84 @@
   +/*
   + * powerpc RTAS
   + *
   + * Copyright (C) 2015, Red Hat Inc, Andrew Jones drjo...@redhat.com
   + *
   + * This work is licensed under the terms of the GNU LGPL, version 2.
   + */
   +#include libcflat.h
   +#include libfdt/libfdt.h
   +#include devicetree.h
   +#include asm/spinlock.h
   +#include asm/io.h
   +#include asm/rtas.h
   +
   +static struct spinlock lock;
   +struct rtas_args rtas_args;
   +static int rtas_node;
   +
   +void rtas_init(void)
   +{
   + if (!dt_available()) {
   + printf(%s: No device tree!\n, __func__);
   + abort();
   + }
   +
   + rtas_node = fdt_path_offset(dt_fdt(), /rtas);
   + if (rtas_node  0) {
   + printf(%s: /rtas node: %s\n, __func__,
   + fdt_strerror(rtas_node));
   + abort();
   + }
   +}
   +
   +int rtas_token(const char *service)
   +{
   + const struct fdt_property *prop;
   + u32 *token;
   +
   + prop = fdt_get_property(dt_fdt(), rtas_node, service, NULL);
  
  Oh.. one other thing here.
  
  This is only safe if you never alter the device tree blob you're given
  (or at least, not after rtas_init()).  That may well be the case, but
  I don't know your code well enough to be sure.
  
  Otherwise, the rtas node could get moved by other dt changes, meaning
  the offset stored in rtas_node is no longer valid.
 
 I hadn't planned on modifying the DT, but if you can conceive of a test
 they may want to, then we should do this right.

That's probably ok then, as long as it's made clear that you can't
mess with the DT.  Or if tests might want to modify it, you could give
them their own copy of it.

I guess doing it right
 just means to hunt down rtas_node each time, that's easy.

That's right.

Not having persistent handles is a bit of a pain, but it's the
necessary tradeoff to work with the tree in flattened form without
having some complicated pile of context state to go with it.
(Attempting to remind people that these aren't persistent handles is,
incidentally, why so many of the libfdt functions explicitly mention
offset in their names).

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpfF0BoqH0NR.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off

2015-08-04 Thread David Gibson
On Tue, Aug 04, 2015 at 09:47:59AM +0200, Andrew Jones wrote:
 On Tue, Aug 04, 2015 at 02:09:52PM +1000, David Gibson wrote:
  On Mon, Aug 03, 2015 at 07:08:17PM +0200, Paolo Bonzini wrote:
   
   
   On 03/08/2015 16:41, Andrew Jones wrote:
Add enough RTAS support to support power-off, and apply it to
exit().

Signed-off-by: Andrew Jones drjo...@redhat.com
   
   Why not use virtio-mmio + testdev on ppc as well?  Similar to how we're
   not using PSCI on ARM or ACPI on x86.
  
  Strange as it seems, MMIO is actually a PITA for a simple pseries
  guest like this.  Basically, you have to enable the MMU - which
  requires a whole bunch of setup - in order to perform cache-inhibited
  loads and stores, which is what you need for IO.
  
  There are hypercalls to sidestep this (H_LOGICAL_CI_LOAD and
  H_LOGICAL_CI_STORE), but having a hypercall and KVM exit for every IO
  access may be hideously slow.
  
  In early development we did have a hypercall mediated virtio model,
  but it was abandoned once we got PCI working.
 
 So I think by yours and Alex's responses, if we want testdev support
 then we should target using pci to expose it.

Um.. maybe.  I'm not really familiar with these testdevs, so I can't
answer directly.

 I'm ok with that, but
 prefer not to be distracted with it while getting ppc kickstarted.
 So, question for Paolo, are you OK with the exitcode snooper cheat?

If you wanted to add a special hypercall channel for use by the tests
I'd be ok with that too.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpvWDZbaP8aT.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 10/14] powerpc/ppc64: relocate linker VMAs

2015-08-03 Thread David Gibson
: .llong  stackptr
 +p_toc:   .llong  tocptr
 +p_dyn:   .llong  dynamic_start
 +
  .text
  .align 3
  
 diff --git a/powerpc/flat.lds b/powerpc/flat.lds
 index bd075efb2c51b..8a573d27346de 100644
 --- a/powerpc/flat.lds
 +++ b/powerpc/flat.lds
 @@ -6,11 +6,22 @@ SECTIONS
  etext = .;
  .opd : { *(.opd) }
  . = ALIGN(16);
 +.dynamic : {
 +dynamic_start = .;
 +*(.dynamic)
 +}
 +.dynsym : {
 +dynsym_start = .;
 +*(.dynsym)
 +}
 +.rela.dyn : { *(.rela*) }
 +. = ALIGN(16);
  .data : {
  *(.data)
 +*(.data.rel*)
  }
  . = ALIGN(16);
 -.rodata : { *(.rodata) }
 +.rodata : { *(.rodata) *(.rodata.*) }
  . = ALIGN(16);
  .bss : { *(.bss) }
  . = ALIGN(16);
 diff --git a/powerpc/reloc64.c b/powerpc/reloc64.c
 new file mode 100644
 index 0..2804823bdfee3
 --- /dev/null
 +++ b/powerpc/reloc64.c
 @@ -0,0 +1,55 @@
 +/*
 + * relocate R_PPC_RELATIVE RELA entries. Normally this is done in
 + * assembly code to avoid the risk of using absolute addresses before
 + * they're relocated. We use C, but cautiously (no global references).
 + *
 + * Copyright (C) 2015, Red Hat Inc, Andrew Jones drjo...@redhat.com
 + *
 + * This work is licensed under the terms of the GNU LGPL, version 2.
 + */
 +#define DT_NULL  0
 +#define DT_RELA  7
 +#define DT_RELACOUNT 0x6ff9
 +#define R_PPC_RELATIVE   22
 +
 +struct elf64_dyn {
 + signed long long tag;
 + unsigned long long val;
 +};
 +
 +#define RELA_GET_TYPE(rela_ptr) ((rela_ptr)-info  0x)
 +struct elf64_rela {
 + unsigned long long offset;
 + unsigned long long info;
 + signed long long addend;
 +};
 +
 +void relocate(unsigned long load_addr, struct elf64_dyn *dyn_table)
 +{
 + unsigned long long rela_addr = 0, rela_count = 0, *addr;
 + struct elf64_dyn *d = dyn_table;
 + struct elf64_rela *r;
 +
 + while (d  d-tag != DT_NULL) {
 + if (d-tag == DT_RELA)
 + rela_addr = d-val;
 + else if (d-tag == DT_RELACOUNT)
 + rela_count = d-val;
 + if (rela_addr  rela_count)
 + break;
 + ++d;
 + }
 +
 + if (!rela_addr || !rela_count)
 + return;
 +
 + r = (void *)(rela_addr + load_addr);
 +
 + while (rela_count--) {
 + if (RELA_GET_TYPE(r) == R_PPC_RELATIVE) {
 + addr = (void *)(r-offset + load_addr);
 + *addr = r-addend + load_addr;
 + }
 + ++r;
 + }
 +}

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpCVvclN8krC.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off

2015-08-03 Thread David Gibson
..f4eff7f8eccb4 100644
 --- a/powerpc/Makefile.common
 +++ b/powerpc/Makefile.common
 @@ -39,6 +39,7 @@ cflatobjs += lib/powerpc/io.o
  cflatobjs += lib/powerpc/setup.o
  cflatobjs += lib/powerpc/mmu.o
  cflatobjs += lib/powerpc/smp.o
 +cflatobjs += lib/powerpc/rtas.o
  
  libgcc := $(shell $(CC) $(machine) --print-libgcc-file-name)
  start_addr := $(shell printf %x\n $$(( $(phys_base) + $(kernel_offset) )))
 diff --git a/powerpc/cstart64.S b/powerpc/cstart64.S
 index 8edaaa6e251fc..1c48083efa33c 100644
 --- a/powerpc/cstart64.S
 +++ b/powerpc/cstart64.S
 @@ -8,6 +8,7 @@
  
  #define HVSC .long 0x4422
  #define H_PUT_TERM_CHAR  0x58
 +#define KVMPPC_H_RTAS0xf000
  
  #define LOAD_REG_IMMEDIATE(reg,expr) \
   lis reg,(expr)@highest; \
 @@ -85,3 +86,11 @@ putchar:
   li  r5, 1   /* sending just 1 byte */
   HVSC
   blr
 +
 +.globl enter_rtas
 +enter_rtas:
 + mr  r4, r3
 + lis r3, KVMPPC_H_RTAS@h
 + ori r3, r3, KVMPPC_H_RTAS@l
 + HVSC
 + blr

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpe6ywsI0WDn.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off

2015-08-03 Thread David Gibson
 := $(shell printf %x\n $$(( $(phys_base) + $(kernel_offset) )))
 diff --git a/powerpc/cstart64.S b/powerpc/cstart64.S
 index 8edaaa6e251fc..1c48083efa33c 100644
 --- a/powerpc/cstart64.S
 +++ b/powerpc/cstart64.S
 @@ -8,6 +8,7 @@
  
  #define HVSC .long 0x4422
  #define H_PUT_TERM_CHAR  0x58
 +#define KVMPPC_H_RTAS0xf000
  
  #define LOAD_REG_IMMEDIATE(reg,expr) \
   lis reg,(expr)@highest; \
 @@ -85,3 +86,11 @@ putchar:
   li  r5, 1   /* sending just 1 byte */
   HVSC
   blr
 +
 +.globl enter_rtas
 +enter_rtas:
 + mr  r4, r3
 + lis r3, KVMPPC_H_RTAS@h
 + ori r3, r3, KVMPPC_H_RTAS@l
 + HVSC
 + blr

So, strictly speaking this isn't how a client program is supposed to
access RTAS.  Instead, you're supposed to call into SLOF to
instantiate RTAS, then jump into the blob it gives you.  Or, if you're
replacing rather than using SLOF, then you're supposed to get the
existing RTAS blob address from the device tree and jump into that (or
copy it somewhere else yourself).

In practice that RTAS blob will have the same 5 instructions as your
enter_rtas, so it's probably ok to do this.

What concerns me slightly, though, is that if we ever messed up the
RTAS blob handling in qemu or SLOF, then bypassing it in the kvm tests
like this means they wouldn't catch that breakage.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpZju4URL0W8.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 08/14] powerpc/ppc64: add HV putchar

2015-08-03 Thread David Gibson
On Mon, Aug 03, 2015 at 04:41:25PM +0200, Andrew Jones wrote:
 Add the hvcall for putchar and use it in puts. That, along with a
 couple more lines in start to prepare for C code, and a branch to
 main(), gets us hello world. Run with
 
 qemu-system-ppc64 -M pseries\
 -bios powerpc/boot_rom.bin  \
 -display none -serial stdio \
 -kernel powerpc/selftest.elf
 
 (We're still not relocating yet, that comes in a later patch. Thus,
  testing hello-world at this point requires a hacked QEMU and linking
  the unit test at QEMU's kernel load address.)
 
 Signed-off-by: Andrew Jones drjo...@redhat.com
 ---
  lib/powerpc/io.c| 15 +--
  powerpc/Makefile.common |  1 +
  powerpc/cstart64.S  | 27 +++
  3 files changed, 41 insertions(+), 2 deletions(-)
 
 diff --git a/lib/powerpc/io.c b/lib/powerpc/io.c
 index cf3b6347e1e46..a7eaafeca9205 100644
 --- a/lib/powerpc/io.c
 +++ b/lib/powerpc/io.c
 @@ -6,15 +6,26 @@
   * This work is licensed under the terms of the GNU LGPL, version 2.
   */
  #include libcflat.h
 +#include asm/spinlock.h
 +
 +extern void halt(int code);
 +extern void putchar(int c);
 +
 +static struct spinlock uart_lock;
  
  void io_init(void)
  {
  }
  
 -void puts(const char *s __unused)
 +void puts(const char *s)
  {
 + spin_lock(uart_lock);
 + while (*s)
 + putchar(*s++);
 + spin_unlock(uart_lock);
  }
  
 -void exit(int code __unused)
 +void exit(int code)
  {
 + halt(code);
  }
 diff --git a/powerpc/Makefile.common b/powerpc/Makefile.common
 index 5cd3ea8085038..d6356540918a5 100644
 --- a/powerpc/Makefile.common
 +++ b/powerpc/Makefile.common
 @@ -26,6 +26,7 @@ CFLAGS += -ffreestanding
  CFLAGS += -Wextra
  CFLAGS += -O2
  CFLAGS += -I lib -I lib/libfdt
 +CFLAGS += -Wa,-mregnames
  
  asm-offsets = lib/$(ARCH)/asm-offsets.h
  include scripts/asm-offsets.mak
 diff --git a/powerpc/cstart64.S b/powerpc/cstart64.S
 index d9b77f44f9e0e..d7c51cd352ee4 100644
 --- a/powerpc/cstart64.S
 +++ b/powerpc/cstart64.S
 @@ -6,14 +6,41 @@
   * This work is licensed under the terms of the GNU LGPL, version 2.
   */
  
 +#define HVSC .long 0x4422

Unless your assembler is really old, you could probably use sc 1
inline, instead of this macro.  Not that it really matters.

 +#define H_PUT_TERM_CHAR  0x58
 +
 +#define LOAD_REG_IMMEDIATE(reg,expr) \
 + lis reg,(expr)@highest; \
 + ori reg,reg,(expr)@higher;  \
 + rldicr  reg,reg,32,31;  \
 + orisreg,reg,(expr)@h;   \
 + ori reg,reg,(expr)@l;
 +
 +#define LOAD_REG_ADDR(reg,name)  \
 + ld  reg,name@got(r2)
 +
  .section .init
  
  .globl start
  start:
 + LOAD_REG_IMMEDIATE(r1, stackptr)
 + LOAD_REG_IMMEDIATE(r2, tocptr)
 + bl  .main
 + bl  .exit

Is this built for ppc64 or ppc64le?  IIUC under the new ABI version
usually used on ppc64le, function descriptors are no longer used.  And
even under the old ABI, I think the assembler now has the smarts to
avoid explicitly referencing the .-symbols.

   b   halt
  
  .text
 +.align 3
  
  .globl halt
  halt:
  1:   b   1b
 +
 +.globl putchar
 +putchar:
 + sldir6, r3, 56
 + li  r3, H_PUT_TERM_CHAR
 + li  r4, 0   /* vty-reg 0 means to use the default vty */
 + li  r5, 1   /* sending just 1 byte */
 + HVSC
 + blr

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp8F9ROuEkrR.pgp
Description: PGP signature


Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off

2015-08-03 Thread David Gibson
On Mon, Aug 03, 2015 at 07:08:17PM +0200, Paolo Bonzini wrote:
 
 
 On 03/08/2015 16:41, Andrew Jones wrote:
  Add enough RTAS support to support power-off, and apply it to
  exit().
  
  Signed-off-by: Andrew Jones drjo...@redhat.com
 
 Why not use virtio-mmio + testdev on ppc as well?  Similar to how we're
 not using PSCI on ARM or ACPI on x86.

Strange as it seems, MMIO is actually a PITA for a simple pseries
guest like this.  Basically, you have to enable the MMU - which
requires a whole bunch of setup - in order to perform cache-inhibited
loads and stores, which is what you need for IO.

There are hypercalls to sidestep this (H_LOGICAL_CI_LOAD and
H_LOGICAL_CI_STORE), but having a hypercall and KVM exit for every IO
access may be hideously slow.

In early development we did have a hypercall mediated virtio model,
but it was abandoned once we got PCI working.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpc3MvTERmfq.pgp
Description: PGP signature


Re: [PATCH 0/2] Two fixes for dynamic micro-threading

2015-07-20 Thread David Gibson
On Thu, Jul 16, 2015 at 05:11:12PM +1000, Paul Mackerras wrote:
 This series contains two fixes for the new dynamic micro-threading
 code that was added recently for HV-mode KVM on Power servers.
 The patches are against Alex Graf's kvm-ppc-queue branch.  Please
 apply.

agraf,

Any word on these?  These appear to fix a really nasty host crash in
current upstream.  I'd really like to see them merged ASAP.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp5vVs0iK10V.pgp
Description: PGP signature


Re: [PATCH 2/2] KVM: PPC: Book3S HV: Implement dynamic micro-threading on POWER8

2015-06-02 Thread David Gibson
On Thu, May 28, 2015 at 03:17:20PM +1000, Paul Mackerras wrote:
 This builds on the ability to run more than one vcore on a physical
 core by using the micro-threading (split-core) modes of the POWER8
 chip.  Previously, only vcores from the same VM could be run together,
 and (on POWER8) only if they had just one thread per core.  With the
 ability to split the core on guest entry and unsplit it on guest exit,
 we can run up to 8 vcpu threads from up to 4 different VMs, and we can
 run multiple vcores with 2 or 4 vcpus per vcore.
 
 Dynamic micro-threading is only available if the static configuration
 of the cores is whole-core mode (unsplit), and only on POWER8.
 
 To manage this, we introduce a new kvm_split_mode struct which is
 shared across all of the subcores in the core, with a pointer in the
 paca on each thread.  In addition we extend the core_info struct to
 have information on each subcore.  When deciding whether to add a
 vcore to the set already on the core, we now have two possibilities:
 (a) piggyback the vcore onto an existing subcore, or (b) start a new
 subcore.
 
 Currently, when any vcpu needs to exit the guest and switch to host
 virtual mode, we interrupt all the threads in all subcores and switch
 the core back to whole-core mode.  It may be possible in future to
 allow some of the subcores to keep executing in the guest while
 subcore 0 switches to the host, but that is not implemented in this
 patch.
 
 This adds a module parameter called dynamic_mt_modes which controls
 which micro-threading (split-core) modes the code will consider, as a
 bitmap.  In other words, if it is 0, no micro-threading mode is
 considered; if it is 2, only 2-way micro-threading is considered; if
 it is 4, only 4-way, and if it is 6, both 2-way and 4-way
 micro-threading mode will be considered.  The default is 6.
 
 With this, we now have secondary threads which are the primary thread
 for their subcore and therefore need to do the MMU switch.  These
 threads will need to be started even if they have no vcpu to run, so
 we use the vcore pointer in the PACA rather than the vcpu pointer to
 trigger them.
 
 Signed-off-by: Paul Mackerras pau...@samba.org

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpwPMMh5BtzP.pgp
Description: PGP signature


Re: [PATCH 1/2] KVM: PPC: Book3S HV: Make use of unused threads when running guests

2015-06-02 Thread David Gibson
On Thu, May 28, 2015 at 03:17:19PM +1000, Paul Mackerras wrote:
 When running a virtual core of a guest that is configured with fewer
 threads per core than the physical cores have, the extra physical
 threads are currently unused.  This makes it possible to use them to
 run one or more other virtual cores from the same guest when certain
 conditions are met.  This applies on POWER7, and on POWER8 to guests
 with one thread per virtual core.  (It doesn't apply to POWER8 guests
 with multiple threads per vcore because they require a 1-1 virtual to
 physical thread mapping in order to be able to use msgsndp and the
 TIR.)
 
 The idea is that we maintain a list of preempted vcores for each
 physical cpu (i.e. each core, since the host runs single-threaded).
 Then, when a vcore is about to run, it checks to see if there are
 any vcores on the list for its physical cpu that could be
 piggybacked onto this vcore's execution.  If so, those additional
 vcores are put into state VCORE_PIGGYBACK and their runnable VCPU
 threads are started as well as the original vcore, which is called
 the master vcore.
 
 After the vcores have exited the guest, the extra ones are put back
 onto the preempted list if any of their VCPUs are still runnable and
 not idle.
 
 This means that vcpu-arch.ptid is no longer necessarily the same as
 the physical thread that the vcpu runs on.  In order to make it easier
 for code that wants to send an IPI to know which CPU to target, we
 now store that in a new field in struct vcpu_arch, called thread_cpu.
 
 Signed-off-by: Paul Mackerras pau...@samba.org

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpGb5ERdypMG.pgp
Description: PGP signature


Re: [PATCHv4] kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM

2015-04-21 Thread David Gibson
On Tue, Apr 21, 2015 at 08:37:02AM +0200, Thomas Huth wrote:
 Am Tue, 21 Apr 2015 10:41:51 +1000
 schrieb David Gibson da...@gibson.dropbear.id.au:
 
  On POWER, storage caching is usually configured via the MMU - attributes
  such as cache-inhibited are stored in the TLB and the hashed page table.
  
  This makes correctly performing cache inhibited IO accesses awkward when
  the MMU is turned off (real mode).  Some CPU models provide special
  registers to control the cache attributes of real mode load and stores but
  this is not at all consistent.  This is a problem in particular for SLOF,
  the firmware used on KVM guests, which runs entirely in real mode, but
  which needs to do IO to load the kernel.
  
  To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD
  and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to
  a logical address (aka guest physical address).  SLOF uses these for IO.
  
  However, because these are implemented within qemu, not the host kernel,
  these bypass any IO devices emulated within KVM itself.  The simplest way
  to see this problem is to attempt to boot a KVM guest from a virtio-blk
  device with iothread / dataplane enabled.  The iothread code relies on an
  in kernel implementation of the virtio queue notification, which is not
  triggered by the IO hcalls, and so the guest will stall in SLOF unable to
  load the guest OS.
  
  This patch addresses this by providing in-kernel implementations of the
  2 hypercalls, which correctly scan the KVM IO bus.  Any access to an
  address not handled by the KVM IO bus will cause a VM exit, hitting the
  qemu implementation as before.
  
  Note that a userspace change is also required, in order to enable these
  new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL.
  
  Signed-off-by: David Gibson da...@gibson.dropbear.id.au
  ---
   arch/powerpc/include/asm/kvm_book3s.h |  3 ++
   arch/powerpc/kvm/book3s.c | 76 
  +++
   arch/powerpc/kvm/book3s_hv.c  | 12 ++
   arch/powerpc/kvm/book3s_pr_papr.c | 28 +
   4 files changed, 119 insertions(+)
 ...
  diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
  index cfbcdc6..453a8a4 100644
  --- a/arch/powerpc/kvm/book3s.c
  +++ b/arch/powerpc/kvm/book3s.c
  @@ -821,6 +821,82 @@ void kvmppc_core_destroy_vm(struct kvm *kvm)
   #endif
   }
   
  +int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu)
  +{
  +   unsigned long size = kvmppc_get_gpr(vcpu, 4);
  +   unsigned long addr = kvmppc_get_gpr(vcpu, 5);
  +   u64 buf;
  +   int ret;
  +
  +   if (!is_power_of_2(size) || (size  sizeof(buf)))
  +   return H_TOO_HARD;
  +
  +   ret = kvm_io_bus_read(vcpu, KVM_MMIO_BUS, addr, size, buf);
  +   if (ret != 0)
  +   return H_TOO_HARD;
  +
  +   switch (size) {
  +   case 1:
  +   kvmppc_set_gpr(vcpu, 4, *(u8 *)buf);
  +   break;
  +
 
 Most of the code in book3s.c seems not to use a empty line after a
 break;, so may I suggest to remove these empty lines here, too, to
 keep the coding style a little bit more consistent?

I don't think it's worth respinning just for that.

  +   case 2:
  +   kvmppc_set_gpr(vcpu, 4, be16_to_cpu(*(__be16 *)buf));
  +   break;
  +
  +   case 4:
  +   kvmppc_set_gpr(vcpu, 4, be32_to_cpu(*(__be32 *)buf));
  +   break;
  +
  +   case 8:
  +   kvmppc_set_gpr(vcpu, 4, be64_to_cpu(*(__be64 *)buf));
  +   break;
  +
  +   default:
  +   BUG();
 
 If I got the code right, a malicious guest could easily trigger this
 BUG() statement, couldn't it? ... so a BUG() is maybe not the right
 thing to do here. Would it be appropriate to return an error value to
 the guest instead?

Actually no - the test at the top of the function for
is_power_of_2(size) etc. catches this safely before we get here.  The
BUG() is just paranoia.


 
  +   }
  +
  +   return H_SUCCESS;
  +}
  +EXPORT_SYMBOL_GPL(kvmppc_h_logical_ci_load);
 
  Thomas
 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp2U1eK1O21J.pgp
Description: PGP signature


[PATCHv4] kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM

2015-04-20 Thread David Gibson
On POWER, storage caching is usually configured via the MMU - attributes
such as cache-inhibited are stored in the TLB and the hashed page table.

This makes correctly performing cache inhibited IO accesses awkward when
the MMU is turned off (real mode).  Some CPU models provide special
registers to control the cache attributes of real mode load and stores but
this is not at all consistent.  This is a problem in particular for SLOF,
the firmware used on KVM guests, which runs entirely in real mode, but
which needs to do IO to load the kernel.

To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD
and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to
a logical address (aka guest physical address).  SLOF uses these for IO.

However, because these are implemented within qemu, not the host kernel,
these bypass any IO devices emulated within KVM itself.  The simplest way
to see this problem is to attempt to boot a KVM guest from a virtio-blk
device with iothread / dataplane enabled.  The iothread code relies on an
in kernel implementation of the virtio queue notification, which is not
triggered by the IO hcalls, and so the guest will stall in SLOF unable to
load the guest OS.

This patch addresses this by providing in-kernel implementations of the
2 hypercalls, which correctly scan the KVM IO bus.  Any access to an
address not handled by the KVM IO bus will cause a VM exit, hitting the
qemu implementation as before.

Note that a userspace change is also required, in order to enable these
new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL.

Signed-off-by: David Gibson da...@gibson.dropbear.id.au
---
 arch/powerpc/include/asm/kvm_book3s.h |  3 ++
 arch/powerpc/kvm/book3s.c | 76 +++
 arch/powerpc/kvm/book3s_hv.c  | 12 ++
 arch/powerpc/kvm/book3s_pr_papr.c | 28 +
 4 files changed, 119 insertions(+)

Changes in v4:
 * Rebase onto 4.0+, correct for changed signature of kvm_io_bus_{read,write}

Alex, I saw from some build system notifications that you seemed to
hit some troubles compiling the last version of this patch. This
should fix it - hope it's not too late to get into 4.1.

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 9930904..b91e74a 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -288,6 +288,9 @@ static inline bool kvmppc_supports_magic_page(struct 
kvm_vcpu *vcpu)
return !is_kvmppc_hv_enabled(vcpu-kvm);
 }
 
+extern int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu);
+extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
+
 /* Magic register values loaded into r3 and r4 before the 'sc' assembly
  * instruction for the OSI hypercalls */
 #define OSI_SC_MAGIC_R30x113724FA
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index cfbcdc6..453a8a4 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -821,6 +821,82 @@ void kvmppc_core_destroy_vm(struct kvm *kvm)
 #endif
 }
 
+int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu)
+{
+   unsigned long size = kvmppc_get_gpr(vcpu, 4);
+   unsigned long addr = kvmppc_get_gpr(vcpu, 5);
+   u64 buf;
+   int ret;
+
+   if (!is_power_of_2(size) || (size  sizeof(buf)))
+   return H_TOO_HARD;
+
+   ret = kvm_io_bus_read(vcpu, KVM_MMIO_BUS, addr, size, buf);
+   if (ret != 0)
+   return H_TOO_HARD;
+
+   switch (size) {
+   case 1:
+   kvmppc_set_gpr(vcpu, 4, *(u8 *)buf);
+   break;
+
+   case 2:
+   kvmppc_set_gpr(vcpu, 4, be16_to_cpu(*(__be16 *)buf));
+   break;
+
+   case 4:
+   kvmppc_set_gpr(vcpu, 4, be32_to_cpu(*(__be32 *)buf));
+   break;
+
+   case 8:
+   kvmppc_set_gpr(vcpu, 4, be64_to_cpu(*(__be64 *)buf));
+   break;
+
+   default:
+   BUG();
+   }
+
+   return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_h_logical_ci_load);
+
+int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu)
+{
+   unsigned long size = kvmppc_get_gpr(vcpu, 4);
+   unsigned long addr = kvmppc_get_gpr(vcpu, 5);
+   unsigned long val = kvmppc_get_gpr(vcpu, 6);
+   u64 buf;
+   int ret;
+
+   switch (size) {
+   case 1:
+   *(u8 *)buf = val;
+   break;
+
+   case 2:
+   *(__be16 *)buf = cpu_to_be16(val);
+   break;
+
+   case 4:
+   *(__be32 *)buf = cpu_to_be32(val);
+   break;
+
+   case 8:
+   *(__be64 *)buf = cpu_to_be64(val);
+   break;
+
+   default:
+   return H_TOO_HARD;
+   }
+
+   ret = kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, size, buf);
+   if (ret != 0)
+   return H_TOO_HARD;
+
+   return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL

Re: [PATCHv3] kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM

2015-03-16 Thread David Gibson
On Thu, Feb 05, 2015 at 01:57:11AM +0100, Alexander Graf wrote:
 
 
 On 05.02.15 01:53, David Gibson wrote:
  On POWER, storage caching is usually configured via the MMU - attributes
  such as cache-inhibited are stored in the TLB and the hashed page table.
  
  This makes correctly performing cache inhibited IO accesses awkward when
  the MMU is turned off (real mode).  Some CPU models provide special
  registers to control the cache attributes of real mode load and stores but
  this is not at all consistent.  This is a problem in particular for SLOF,
  the firmware used on KVM guests, which runs entirely in real mode, but
  which needs to do IO to load the kernel.
  
  To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD
  and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to
  a logical address (aka guest physical address).  SLOF uses these for IO.
  
  However, because these are implemented within qemu, not the host kernel,
  these bypass any IO devices emulated within KVM itself.  The simplest way
  to see this problem is to attempt to boot a KVM guest from a virtio-blk
  device with iothread / dataplane enabled.  The iothread code relies on an
  in kernel implementation of the virtio queue notification, which is not
  triggered by the IO hcalls, and so the guest will stall in SLOF unable to
  load the guest OS.
  
  This patch addresses this by providing in-kernel implementations of the
  2 hypercalls, which correctly scan the KVM IO bus.  Any access to an
  address not handled by the KVM IO bus will cause a VM exit, hitting the
  qemu implementation as before.
  
  Note that a userspace change is also required, in order to enable these
  new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL.
  
  Signed-off-by: David Gibson da...@gibson.dropbear.id.au
 
 Thanks, applied to kvm-ppc-queue.

Any news on when this might go up to mainline?

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpskWWLllYjF.pgp
Description: PGP signature


[PATCHv3] kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM

2015-02-04 Thread David Gibson
On POWER, storage caching is usually configured via the MMU - attributes
such as cache-inhibited are stored in the TLB and the hashed page table.

This makes correctly performing cache inhibited IO accesses awkward when
the MMU is turned off (real mode).  Some CPU models provide special
registers to control the cache attributes of real mode load and stores but
this is not at all consistent.  This is a problem in particular for SLOF,
the firmware used on KVM guests, which runs entirely in real mode, but
which needs to do IO to load the kernel.

To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD
and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to
a logical address (aka guest physical address).  SLOF uses these for IO.

However, because these are implemented within qemu, not the host kernel,
these bypass any IO devices emulated within KVM itself.  The simplest way
to see this problem is to attempt to boot a KVM guest from a virtio-blk
device with iothread / dataplane enabled.  The iothread code relies on an
in kernel implementation of the virtio queue notification, which is not
triggered by the IO hcalls, and so the guest will stall in SLOF unable to
load the guest OS.

This patch addresses this by providing in-kernel implementations of the
2 hypercalls, which correctly scan the KVM IO bus.  Any access to an
address not handled by the KVM IO bus will cause a VM exit, hitting the
qemu implementation as before.

Note that a userspace change is also required, in order to enable these
new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL.

Signed-off-by: David Gibson da...@gibson.dropbear.id.au
---
 arch/powerpc/include/asm/kvm_book3s.h |  3 ++
 arch/powerpc/kvm/book3s.c | 76 +++
 arch/powerpc/kvm/book3s_hv.c  | 12 ++
 arch/powerpc/kvm/book3s_pr_papr.c | 28 +
 4 files changed, 119 insertions(+)

v3:
  - Use __beXX types instead of plain uXX where appropriate
  - Remove some unhelpful comments
v2:
  - Removed some debugging printk()s that were accidentally left in
- Fix endianness; like all PAPR hypercalls, these should always act
big-endian, even if the guest is little-endian (in practice this
makes no difference, since the only user is SLOF, which is always
big-endian)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 942c7b1..578e550 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -292,6 +292,9 @@ static inline bool kvmppc_supports_magic_page(struct 
kvm_vcpu *vcpu)
return !is_kvmppc_hv_enabled(vcpu-kvm);
 }
 
+extern int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu);
+extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
+
 /* Magic register values loaded into r3 and r4 before the 'sc' assembly
  * instruction for the OSI hypercalls */
 #define OSI_SC_MAGIC_R30x113724FA
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 888bf46..397f433 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -820,6 +820,82 @@ void kvmppc_core_destroy_vm(struct kvm *kvm)
 #endif
 }
 
+int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu)
+{
+   unsigned long size = kvmppc_get_gpr(vcpu, 4);
+   unsigned long addr = kvmppc_get_gpr(vcpu, 5);
+   u64 buf;
+   int ret;
+
+   if (!is_power_of_2(size) || (size  sizeof(buf)))
+   return H_TOO_HARD;
+
+   ret = kvm_io_bus_read(vcpu-kvm, KVM_MMIO_BUS, addr, size, buf);
+   if (ret != 0)
+   return H_TOO_HARD;
+
+   switch (size) {
+   case 1:
+   kvmppc_set_gpr(vcpu, 4, *(u8 *)buf);
+   break;
+
+   case 2:
+   kvmppc_set_gpr(vcpu, 4, be16_to_cpu(*(__be16 *)buf));
+   break;
+
+   case 4:
+   kvmppc_set_gpr(vcpu, 4, be32_to_cpu(*(__be32 *)buf));
+   break;
+
+   case 8:
+   kvmppc_set_gpr(vcpu, 4, be64_to_cpu(*(__be64 *)buf));
+   break;
+
+   default:
+   BUG();
+   }
+
+   return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_h_logical_ci_load);
+
+int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu)
+{
+   unsigned long size = kvmppc_get_gpr(vcpu, 4);
+   unsigned long addr = kvmppc_get_gpr(vcpu, 5);
+   unsigned long val = kvmppc_get_gpr(vcpu, 6);
+   u64 buf;
+   int ret;
+
+   switch (size) {
+   case 1:
+   *(u8 *)buf = val;
+   break;
+
+   case 2:
+   *(__be16 *)buf = cpu_to_be16(val);
+   break;
+
+   case 4:
+   *(__be32 *)buf = cpu_to_be32(val);
+   break;
+
+   case 8:
+   *(__be64 *)buf = cpu_to_be64(val);
+   break;
+
+   default:
+   return H_TOO_HARD;
+   }
+
+   ret = kvm_io_bus_write(vcpu-kvm

Re: [PATCHv2] kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM

2015-02-04 Thread David Gibson
On Wed, Feb 04, 2015 at 04:24:46PM +0100, Alexander Graf wrote:
 
 
 On 03.02.15 06:44, David Gibson wrote:
  On POWER, storage caching is usually configured via the MMU - attributes
  such as cache-inhibited are stored in the TLB and the hashed page table.
  
  This makes correctly performing cache inhibited IO accesses awkward when
  the MMU is turned off (real mode).  Some CPU models provide special
  registers to control the cache attributes of real mode load and stores but
  this is not at all consistent.  This is a problem in particular for SLOF,
  the firmware used on KVM guests, which runs entirely in real mode, but
  which needs to do IO to load the kernel.
  
  To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD
  and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to
  a logical address (aka guest physical address).  SLOF uses these for IO.
  
  However, because these are implemented within qemu, not the host kernel,
  these bypass any IO devices emulated within KVM itself.  The simplest way
  to see this problem is to attempt to boot a KVM guest from a virtio-blk
  device with iothread / dataplane enabled.  The iothread code relies on an
  in kernel implementation of the virtio queue notification, which is not
  triggered by the IO hcalls, and so the guest will stall in SLOF unable to
  load the guest OS.
  
  This patch addresses this by providing in-kernel implementations of the
  2 hypercalls, which correctly scan the KVM IO bus.  Any access to an
  address not handled by the KVM IO bus will cause a VM exit, hitting the
  qemu implementation as before.
  
  Note that a userspace change is also required, in order to enable these
  new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL.
  
  Signed-off-by: David Gibson da...@gibson.dropbear.id.au
  ---
   arch/powerpc/include/asm/kvm_book3s.h |  3 ++
   arch/powerpc/kvm/book3s.c | 76 
  +++
   arch/powerpc/kvm/book3s_hv.c  | 12 ++
   arch/powerpc/kvm/book3s_pr_papr.c | 28 +
   4 files changed, 119 insertions(+)
  
  v2:
- Removed some debugging printk()s that were accidentally left in
- Fix endianness; like all PAPR hypercalls, these should always act
  big-endian, even if the guest is little-endian (in practice this
  makes no difference, since the only user is SLOF, which is always
  big-endian)
  
  diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
  b/arch/powerpc/include/asm/kvm_book3s.h
  index 942c7b1..578e550 100644
  --- a/arch/powerpc/include/asm/kvm_book3s.h
  +++ b/arch/powerpc/include/asm/kvm_book3s.h
  @@ -292,6 +292,9 @@ static inline bool kvmppc_supports_magic_page(struct 
  kvm_vcpu *vcpu)
  return !is_kvmppc_hv_enabled(vcpu-kvm);
   }
   
  +extern int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu);
  +extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
  +
   /* Magic register values loaded into r3 and r4 before the 'sc' assembly
* instruction for the OSI hypercalls */
   #define OSI_SC_MAGIC_R30x113724FA
  diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
  index 888bf46..7b51492 100644
  --- a/arch/powerpc/kvm/book3s.c
  +++ b/arch/powerpc/kvm/book3s.c
  @@ -820,6 +820,82 @@ void kvmppc_core_destroy_vm(struct kvm *kvm)
   #endif
   }
   
  +int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu)
  +{
  +   unsigned long size = kvmppc_get_gpr(vcpu, 4);
  +   unsigned long addr = kvmppc_get_gpr(vcpu, 5);
  +   u64 buf;
  +   int ret;
  +
  +   if (!is_power_of_2(size) || (size  sizeof(buf)))
  +   return H_TOO_HARD;
  +
  +   ret = kvm_io_bus_read(vcpu-kvm, KVM_MMIO_BUS, addr, size, buf);
  +   if (ret != 0)
  +   return H_TOO_HARD;
  +
  +   switch (size) {
  +   case 1:
  +   kvmppc_set_gpr(vcpu, 4, *(u8 *)buf);
  +   break;
  +
  +   case 2:
  +   kvmppc_set_gpr(vcpu, 4, be16_to_cpu(*(u16 *)buf));
  +   break;
  +
  +   case 4:
  +   kvmppc_set_gpr(vcpu, 4, be32_to_cpu(*(u32 *)buf));
  +   break;
  +
  +   case 8:
  +   kvmppc_set_gpr(vcpu, 4, be64_to_cpu(*(u64 *)buf));
 
 Shouldn't these casts be __be types?

Ah, yes they should.

  +   break;
  +
  +   default:
  +   BUG();
  +   }
  +
  +   return H_SUCCESS;
  +}
  +EXPORT_SYMBOL_GPL(kvmppc_h_logical_ci_load); /* For use by the kvm-pr 
  module */
 
 No need for the comment.

Ok.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp78StKRbeXq.pgp
Description: PGP signature


[PATCHv2] kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM

2015-02-02 Thread David Gibson
On POWER, storage caching is usually configured via the MMU - attributes
such as cache-inhibited are stored in the TLB and the hashed page table.

This makes correctly performing cache inhibited IO accesses awkward when
the MMU is turned off (real mode).  Some CPU models provide special
registers to control the cache attributes of real mode load and stores but
this is not at all consistent.  This is a problem in particular for SLOF,
the firmware used on KVM guests, which runs entirely in real mode, but
which needs to do IO to load the kernel.

To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD
and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to
a logical address (aka guest physical address).  SLOF uses these for IO.

However, because these are implemented within qemu, not the host kernel,
these bypass any IO devices emulated within KVM itself.  The simplest way
to see this problem is to attempt to boot a KVM guest from a virtio-blk
device with iothread / dataplane enabled.  The iothread code relies on an
in kernel implementation of the virtio queue notification, which is not
triggered by the IO hcalls, and so the guest will stall in SLOF unable to
load the guest OS.

This patch addresses this by providing in-kernel implementations of the
2 hypercalls, which correctly scan the KVM IO bus.  Any access to an
address not handled by the KVM IO bus will cause a VM exit, hitting the
qemu implementation as before.

Note that a userspace change is also required, in order to enable these
new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL.

Signed-off-by: David Gibson da...@gibson.dropbear.id.au
---
 arch/powerpc/include/asm/kvm_book3s.h |  3 ++
 arch/powerpc/kvm/book3s.c | 76 +++
 arch/powerpc/kvm/book3s_hv.c  | 12 ++
 arch/powerpc/kvm/book3s_pr_papr.c | 28 +
 4 files changed, 119 insertions(+)

v2:
  - Removed some debugging printk()s that were accidentally left in
  - Fix endianness; like all PAPR hypercalls, these should always act
big-endian, even if the guest is little-endian (in practice this
makes no difference, since the only user is SLOF, which is always
big-endian)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 942c7b1..578e550 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -292,6 +292,9 @@ static inline bool kvmppc_supports_magic_page(struct 
kvm_vcpu *vcpu)
return !is_kvmppc_hv_enabled(vcpu-kvm);
 }
 
+extern int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu);
+extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
+
 /* Magic register values loaded into r3 and r4 before the 'sc' assembly
  * instruction for the OSI hypercalls */
 #define OSI_SC_MAGIC_R30x113724FA
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 888bf46..7b51492 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -820,6 +820,82 @@ void kvmppc_core_destroy_vm(struct kvm *kvm)
 #endif
 }
 
+int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu)
+{
+   unsigned long size = kvmppc_get_gpr(vcpu, 4);
+   unsigned long addr = kvmppc_get_gpr(vcpu, 5);
+   u64 buf;
+   int ret;
+
+   if (!is_power_of_2(size) || (size  sizeof(buf)))
+   return H_TOO_HARD;
+
+   ret = kvm_io_bus_read(vcpu-kvm, KVM_MMIO_BUS, addr, size, buf);
+   if (ret != 0)
+   return H_TOO_HARD;
+
+   switch (size) {
+   case 1:
+   kvmppc_set_gpr(vcpu, 4, *(u8 *)buf);
+   break;
+
+   case 2:
+   kvmppc_set_gpr(vcpu, 4, be16_to_cpu(*(u16 *)buf));
+   break;
+
+   case 4:
+   kvmppc_set_gpr(vcpu, 4, be32_to_cpu(*(u32 *)buf));
+   break;
+
+   case 8:
+   kvmppc_set_gpr(vcpu, 4, be64_to_cpu(*(u64 *)buf));
+   break;
+
+   default:
+   BUG();
+   }
+
+   return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_h_logical_ci_load); /* For use by the kvm-pr module */
+
+int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu)
+{
+   unsigned long size = kvmppc_get_gpr(vcpu, 4);
+   unsigned long addr = kvmppc_get_gpr(vcpu, 5);
+   unsigned long val = kvmppc_get_gpr(vcpu, 6);
+   u64 buf;
+   int ret;
+
+   switch (size) {
+   case 1:
+   *(u8 *)buf = val;
+   break;
+
+   case 2:
+   *(u16 *)buf = cpu_to_be16(val);
+   break;
+
+   case 4:
+   *(u32 *)buf = cpu_to_be32(val);
+   break;
+
+   case 8:
+   *(u64 *)buf = cpu_to_be64(val);
+   break;
+
+   default:
+   return H_TOO_HARD;
+   }
+
+   ret = kvm_io_bus_write(vcpu-kvm, KVM_MMIO_BUS, addr, size, buf);
+   if (ret != 0)
+   return H_TOO_HARD;
+
+   return

Re: [PATCH] kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM

2015-02-02 Thread David Gibson
On Mon, Feb 02, 2015 at 09:59:16AM +0100, Paolo Bonzini wrote:
 
 
 On 02/02/2015 08:45, David Gibson wrote:
  +   case H_LOGICAL_CI_LOAD:
  +   ret = kvmppc_h_logical_ci_load(vcpu);
  +   if (ret == H_TOO_HARD) {
  +   printk(Punting H_LOGICAL_CI_LOAD\n);
  +   return RESUME_HOST;
  +   }
  +   break;
  +   case H_LOGICAL_CI_STORE:
  +   ret = kvmppc_h_logical_ci_store(vcpu);
  +   if (ret == H_TOO_HARD) {
  +   printk(Punting H_LOGICAL_CI_STORE\n);
  +   return RESUME_HOST;
  +   }
  +   break;
  case H_SET_MODE:
  ret = kvmppc_h_set_mode(vcpu, kvmppc_get_gpr(vcpu, 4),
  kvmppc_get_gpr(vcpu, 5),
 
 KERN_DEBUG I guess?  Or even no printk at all perhaps.

Oh, bugger, removed most of the debug code, but not all of it.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpuG6eVHOL8i.pgp
Description: PGP signature


[PATCH] kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM

2015-02-01 Thread David Gibson
On POWER, storage caching is usually configured via the MMU - attributes
such as cache-inhibited are stored in the TLB and the hashed page table.

This makes correctly performing cache inhibited IO accesses awkward when
the MMU is turned off (real mode).  Some CPU models provide special
registers to control the cache attributes of real mode load and stores but
this is not at all consistent.  This is a problem in particular for SLOF,
the firmware used on KVM guests, which runs entirely in real mode, but
which needs to do IO to load the kernel.

To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD
and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to
a logical address (aka guest physical address).  SLOF uses these for IO.

However, because these are implemented within qemu, not the host kernel,
these bypass any IO devices emulated within KVM itself.  The simplest way
to see this problem is to attempt to boot a KVM guest from a virtio-blk
device with iothread / dataplane enabled.  The iothread code relies on an
in kernel implementation of the virtio queue notification, which is not
triggered by the IO hcalls, and so the guest will stall in SLOF unable to
load the guest OS.

This patch addresses this by providing in-kernel implementations of the
2 hypercalls, which correctly scan the KVM IO bus.  Any access to an
address not handled by the KVM IO bus will cause a VM exit, hitting the
qemu implementation as before.

Note that a userspace change is also required, in order to enable these
new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL.

Signed-off-by: David Gibson da...@gibson.dropbear.id.au
---
 arch/powerpc/include/asm/kvm_book3s.h |  3 ++
 arch/powerpc/kvm/book3s.c | 76 +++
 arch/powerpc/kvm/book3s_hv.c  | 16 
 arch/powerpc/kvm/book3s_pr_papr.c | 28 +
 4 files changed, 123 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 942c7b1..578e550 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -292,6 +292,9 @@ static inline bool kvmppc_supports_magic_page(struct 
kvm_vcpu *vcpu)
return !is_kvmppc_hv_enabled(vcpu-kvm);
 }
 
+extern int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu);
+extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
+
 /* Magic register values loaded into r3 and r4 before the 'sc' assembly
  * instruction for the OSI hypercalls */
 #define OSI_SC_MAGIC_R30x113724FA
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 888bf46..792c7cf 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -820,6 +820,82 @@ void kvmppc_core_destroy_vm(struct kvm *kvm)
 #endif
 }
 
+int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu)
+{
+   unsigned long size = kvmppc_get_gpr(vcpu, 4);
+   unsigned long addr = kvmppc_get_gpr(vcpu, 5);
+   u64 buf;
+   int ret;
+
+   if (!is_power_of_2(size) || (size  sizeof(buf)))
+   return H_TOO_HARD;
+
+   ret = kvm_io_bus_read(vcpu-kvm, KVM_MMIO_BUS, addr, size, buf);
+   if (ret != 0)
+   return H_TOO_HARD;
+
+   switch (size) {
+   case 1:
+   kvmppc_set_gpr(vcpu, 4, *(u8 *)buf);
+   break;
+
+   case 2:
+   kvmppc_set_gpr(vcpu, 4, *(u16 *)buf);
+   break;
+
+   case 4:
+   kvmppc_set_gpr(vcpu, 4, *(u32 *)buf);
+   break;
+
+   case 8:
+   kvmppc_set_gpr(vcpu, 4, *(u64 *)buf);
+   break;
+
+   default:
+   BUG();
+   }
+
+   return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_h_logical_ci_load); /* For use by the kvm-pr module */
+
+int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu)
+{
+   unsigned long size = kvmppc_get_gpr(vcpu, 4);
+   unsigned long addr = kvmppc_get_gpr(vcpu, 5);
+   unsigned long val = kvmppc_get_gpr(vcpu, 6);
+   u64 buf;
+   int ret;
+
+   switch (size) {
+   case 1:
+   *(u8 *)buf = val;
+   break;
+
+   case 2:
+   *(u16 *)buf = val;
+   break;
+
+   case 4:
+   *(u32 *)buf = val;
+   break;
+
+   case 8:
+   *(u64 *)buf = val;
+   break;
+
+   default:
+   return H_TOO_HARD;
+   }
+
+   ret = kvm_io_bus_write(vcpu-kvm, KVM_MMIO_BUS, addr, size, buf);
+   if (ret != 0)
+   return H_TOO_HARD;
+
+   return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_h_logical_ci_store); /* For use by the kvm-pr module 
*/
+
 int kvmppc_core_check_processor_compat(void)
 {
/*
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index de4018a..1013019 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -706,6 +706,20 @@ int kvmppc_pseries_do_hcall

Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-27 Thread David Gibson
On Sun, Jun 23, 2013 at 10:41:24PM -0600, Alex Williamson wrote:
 On Mon, 2013-06-24 at 13:52 +1000, David Gibson wrote:
  On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote:
   On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote:
On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote:
 On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote:
  On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
   On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
   Just out of curiosity - would not get_file() and fput_atomic() 
   on a
   group's
   file* do the right job instead of 
   vfio_group_add_external_user() and
   vfio_group_del_external_user()?
  
   I was thinking that too.  Grabbing a file reference would 
   certainly be
   the usual way of handling this sort of thing.
   
   But that wouldn't prevent the group ownership to be returned to
   the kernel or another user would it ?
  
  
  Holding the file pointer does not let the group-container_users 
  counter go
  to zero
 
 How so?  Holding the file pointer means the file won't go away, which
 means the group release function won't be called.  That means the 
 group
 won't go away, but that doesn't mean it's attached to an IOMMU.  A 
 user
 could call UNSET_CONTAINER.

Uhh... *thinks*.  Ah, I see.

I think the interface should not take the group fd, but the container
fd.  Holding a reference to *that* would keep the necessary things
around.  But more to the point, it's the right thing semantically:

The container is essentially the handle on a host iommu address space,
and so that's what should be bound by the KVM call to a particular
guest iommu address space.  e.g. it would make no sense to bind two
different groups to different guest iommu address spaces, if they were
in the same container - the guest thinks they are different spaces,
but if they're in the same container they must be the same space.
   
   While the container is the gateway to the iommu, what empowers the
   container to maintain an iommu is the group.  What happens to a
   container when all the groups are disconnected or closed?  Groups are
   the unit that indicates hardware access, not containers.  Thanks,
  
  Uh... huh?  I'm really not sure what you're getting at.
  
  The operation we're doing for KVM here is binding a guest iommu
  address space to a particular host iommu address space.  Why would we
  not want to use the obvious handle on the host iommu address space,
  which is the container fd?
 
 AIUI, the request isn't for an interface through which to do iommu
 mappings.  The request is for an interface to show that the user has
 sufficient privileges to do mappings.  Groups are what gives the user
 that ability.  The iommu is also possibly associated with multiple iommu
 groups and I believe what is being asked for here is a way to hold and
 lock a single iommu group with iommu protection.
 
 From a practical point of view, the iommu interface is de-privileged
 once the groups are disconnected or closed.  Holding a reference count
 on the iommu fd won't prevent that.  That means we'd have to use a
 notifier to have KVM stop the side-channel iommu access.  Meanwhile
 holding the file descriptor for the group and adding an interface that
 bumps use counter allows KVM to lock itself in, just as if it had a
 device opened itself.  Thanks,

Ah, good point.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpAjDK6J9rb1.pgp
Description: PGP signature


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-23 Thread David Gibson
On Sun, Jun 23, 2013 at 09:28:13AM +1000, Benjamin Herrenschmidt wrote:
 On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote:
  I think the interface should not take the group fd, but the container
  fd.  Holding a reference to *that* would keep the necessary things
  around.  But more to the point, it's the right thing semantically:
  
  The container is essentially the handle on a host iommu address space,
  and so that's what should be bound by the KVM call to a particular
  guest iommu address space.  e.g. it would make no sense to bind two
  different groups to different guest iommu address spaces, if they were
  in the same container - the guest thinks they are different spaces,
  but if they're in the same container they must be the same space.
 
 Interestingly, how are we going to extend that when/if we implement
 DDW ?
 
 DDW means an API by which the guest can request the creation of
 additional iommus for a given device (typically, in addition to the
 default smallish 32-bit one using 4k pages, the guest can request
 a larger window in 64-bit space using a larger page size).

So, would a PAPR gest requesting this expect the new window to have
a new liobn, or an existing liobn?

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp5jkS8m1XZl.pgp
Description: PGP signature


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-23 Thread David Gibson
On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote:
 On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote:
  On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote:
   On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote:
On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
 On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
 Just out of curiosity - would not get_file() and fput_atomic() on a
 group's
 file* do the right job instead of vfio_group_add_external_user() and
 vfio_group_del_external_user()?

 I was thinking that too.  Grabbing a file reference would certainly 
 be
 the usual way of handling this sort of thing.
 
 But that wouldn't prevent the group ownership to be returned to
 the kernel or another user would it ?


Holding the file pointer does not let the group-container_users 
counter go
to zero
   
   How so?  Holding the file pointer means the file won't go away, which
   means the group release function won't be called.  That means the group
   won't go away, but that doesn't mean it's attached to an IOMMU.  A user
   could call UNSET_CONTAINER.
  
  Uhh... *thinks*.  Ah, I see.
  
  I think the interface should not take the group fd, but the container
  fd.  Holding a reference to *that* would keep the necessary things
  around.  But more to the point, it's the right thing semantically:
  
  The container is essentially the handle on a host iommu address space,
  and so that's what should be bound by the KVM call to a particular
  guest iommu address space.  e.g. it would make no sense to bind two
  different groups to different guest iommu address spaces, if they were
  in the same container - the guest thinks they are different spaces,
  but if they're in the same container they must be the same space.
 
 While the container is the gateway to the iommu, what empowers the
 container to maintain an iommu is the group.  What happens to a
 container when all the groups are disconnected or closed?  Groups are
 the unit that indicates hardware access, not containers.  Thanks,

Uh... huh?  I'm really not sure what you're getting at.

The operation we're doing for KVM here is binding a guest iommu
address space to a particular host iommu address space.  Why would we
not want to use the obvious handle on the host iommu address space,
which is the container fd?

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpfZESO3eXiq.pgp
Description: PGP signature


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-22 Thread David Gibson
On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote:
 On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote:
  On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
   On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
   Just out of curiosity - would not get_file() and fput_atomic() on a
   group's
   file* do the right job instead of vfio_group_add_external_user() and
   vfio_group_del_external_user()?
  
   I was thinking that too.  Grabbing a file reference would certainly be
   the usual way of handling this sort of thing.
   
   But that wouldn't prevent the group ownership to be returned to
   the kernel or another user would it ?
  
  
  Holding the file pointer does not let the group-container_users counter go
  to zero
 
 How so?  Holding the file pointer means the file won't go away, which
 means the group release function won't be called.  That means the group
 won't go away, but that doesn't mean it's attached to an IOMMU.  A user
 could call UNSET_CONTAINER.

Uhh... *thinks*.  Ah, I see.

I think the interface should not take the group fd, but the container
fd.  Holding a reference to *that* would keep the necessary things
around.  But more to the point, it's the right thing semantically:

The container is essentially the handle on a host iommu address space,
and so that's what should be bound by the KVM call to a particular
guest iommu address space.  e.g. it would make no sense to bind two
different groups to different guest iommu address spaces, if they were
in the same container - the guest thinks they are different spaces,
but if they're in the same container they must be the same space.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp9ekBfTPFTm.pgp
Description: PGP signature


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-24 Thread David Gibson
On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
 On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
 diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
 index 8465c2a..da6bf61 100644
 --- a/arch/powerpc/kvm/powerpc.c
 @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 +++ b/arch/powerpc/kvm/powerpc.c
  break;
  #endif
  case KVM_CAP_SPAPR_MULTITCE:
 +case KVM_CAP_SPAPR_TCE_IOMMU:
  r = 1;
  break;
  default:
 
 Don't advertise SPAPR capabilities if it's not book3s -- and
 probably there's some additional limitation that would be
 appropriate.

So, in the case of MULTITCE, that's not quite right.  PR KVM can
emulate a PAPR system on a BookE machine, and there's no reason not to
allow TCE acceleration as well.  We can't make it dependent on PAPR
mode being selected, because that's enabled per-vcpu, whereas these
capabilities are queried on the VM before the vcpus are created.

CAP_SPAPR_TCE_IOMMU should be dependent on the presence of suitable
host side hardware (i.e. a PAPR style IOMMU), though.

 
 @@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
  r = kvm_vm_ioctl_create_spapr_tce(kvm, create_tce);
  goto out;
  }
 +case KVM_CREATE_SPAPR_TCE_IOMMU: {
 +struct kvm_create_spapr_tce_iommu create_tce_iommu;
 +struct kvm *kvm = filp-private_data;
 +
 +r = -EFAULT;
 +if (copy_from_user(create_tce_iommu, argp,
 +sizeof(create_tce_iommu)))
 +goto out;
 +r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
 create_tce_iommu);
 +goto out;
 +}
  #endif /* CONFIG_PPC_BOOK3S_64 */
 
  #ifdef CONFIG_KVM_BOOK3S_64_HV
 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
 index 5a2afda..450c82a 100644
 --- a/include/uapi/linux/kvm.h
 +++ b/include/uapi/linux/kvm.h
 @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
  #define KVM_CAP_PPC_RTAS 91
  #define KVM_CAP_IRQ_XICS 92
  #define KVM_CAP_SPAPR_MULTITCE (0x11 + 89)
 +#define KVM_CAP_SPAPR_TCE_IOMMU (0x11 + 90)
 
 Hmm...

Ah, yeah, that needs to be fixed.  Those were interim numbers so that
we didn't have to keep changing our internal trees as new upstream
ioctls got added to the list.  We need to get a proper number for the
merge, though.

 @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
  #define KVM_GET_DEVICE_ATTR   _IOW(KVMIO,  0xe2, struct
 kvm_device_attr)
  #define KVM_HAS_DEVICE_ATTR   _IOW(KVMIO,  0xe3, struct
 kvm_device_attr)
 
 +/* ioctl for SPAPR TCE IOMMU */
 +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
 kvm_create_spapr_tce_iommu)
 
 Shouldn't this go under the vm ioctl section?
 
 -Scott
 ___
 Linuxppc-dev mailing list
 linuxppc-...@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev
 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: Digital signature


Re: [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls

2013-05-10 Thread David Gibson
On Mon, May 06, 2013 at 05:25:53PM +1000, Alexey Kardashevskiy wrote:
 This adds real mode handlers for the H_PUT_TCE_INDIRECT and
 H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
 devices or emulated PCI.  These calls allow adding multiple entries
 (up to 512) into the TCE table in one call which saves time on
 transition to/from real mode.
 
 This adds a guest physical to host real address converter
 and calls the existing H_PUT_TCE handler. The converting function
 is going to be fully utilized by upcoming VFIO supporting patches.
 
 This also implements the KVM_CAP_PPC_MULTITCE capability,
 so in order to support the functionality of this patch, QEMU
 needs to query for this capability and set the hcall-multi-tce
 hypertas property only if the capability is present, otherwise
 there will be serious performance degradation.


Hrm.  Clearly I didn't read this carefully enough before.  There are
some problems here.

[snip]
 diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
 b/arch/powerpc/kvm/book3s_64_vio.c
 index 72ffc89..643ac1e 100644
 --- a/arch/powerpc/kvm/book3s_64_vio.c
 +++ b/arch/powerpc/kvm/book3s_64_vio.c
 @@ -14,6 +14,7 @@
   *
   * Copyright 2010 Paul Mackerras, IBM Corp. pau...@au1.ibm.com
   * Copyright 2011 David Gibson, IBM Corporation d...@au1.ibm.com
 + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation a...@au1.ibm.com
   */
  
  #include linux/types.h
 @@ -36,9 +37,14 @@
  #include asm/ppc-opcode.h
  #include asm/kvm_host.h
  #include asm/udbg.h
 +#include asm/iommu.h
  
  #define TCES_PER_PAGE(PAGE_SIZE / sizeof(u64))
 +#define ERROR_ADDR  (~(unsigned long)0x0)
  
 +/*
 + * TCE tables handlers.
 + */
  static long kvmppc_stt_npages(unsigned long window_size)
  {
   return ALIGN((window_size  SPAPR_TCE_SHIFT)
 @@ -148,3 +154,111 @@ fail:
   }
   return ret;
  }
 +
 +/*
 + * Virtual mode handling of IOMMU map/unmap.
 + */
 +/* Converts guest physical address into host virtual */
 +static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
 + unsigned long gpa)

This should probably return a void * rather than an unsigned long.
Well, actually a void __user *.

 +{
 + unsigned long hva, gfn = gpa  PAGE_SHIFT;
 + struct kvm_memory_slot *memslot;
 +
 + memslot = search_memslots(kvm_memslots(vcpu-kvm), gfn);
 + if (!memslot)
 + return ERROR_ADDR;
 +
 + /*
 +  * Convert gfn to hva preserving flags and an offset
 +  * within a system page
 +  */
 + hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa  ~PAGE_MASK);
 + return hva;
 +}
 +
 +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
 + unsigned long liobn, unsigned long ioba,
 + unsigned long tce)
 +{
 + struct kvmppc_spapr_tce_table *tt;
 +
 + tt = kvmppc_find_tce_table(vcpu, liobn);
 + /* Didn't find the liobn, put it to userspace */
 + if (!tt)
 + return H_TOO_HARD;
 +
 + /* Emulated IO */
 + return kvmppc_emulated_h_put_tce(tt, ioba, tce);
 +}
 +
 +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 + unsigned long liobn, unsigned long ioba,
 + unsigned long tce_list, unsigned long npages)
 +{
 + struct kvmppc_spapr_tce_table *tt;
 + long i;
 + unsigned long tces;
 +
 + /* The whole table addressed by tce_list resides in 4K page */
 + if (npages  512)
 + return H_PARAMETER;

So, that doesn't actually verify what the comment says it does - only
that the list is  4K in total.  You need to check the alignment of
tce_list as well.

 +
 + tt = kvmppc_find_tce_table(vcpu, liobn);
 + /* Didn't find the liobn, put it to userspace */
 + if (!tt)
 + return H_TOO_HARD;
 +
 + tces = get_virt_address(vcpu, tce_list);
 + if (tces == ERROR_ADDR)
 + return H_TOO_HARD;
 +
 + /* Emulated IO */

This comment doesn't seem to have any bearing on the test which
follows it.

 + if ((ioba + (npages  IOMMU_PAGE_SHIFT))  tt-window_size)
 + return H_PARAMETER;
 +
 + for (i = 0; i  npages; ++i) {
 + unsigned long tce;
 + unsigned long ptce = tces + i * sizeof(unsigned long);
 +
 + if (get_user(tce, (unsigned long __user *)ptce))
 + break;
 +
 + if (kvmppc_emulated_h_put_tce(tt,
 + ioba + (i  IOMMU_PAGE_SHIFT), tce))
 + break;
 + }
 + if (i == npages)
 + return H_SUCCESS;
 +
 + /* Failed, do cleanup */
 + do {
 + --i;
 + kvmppc_emulated_h_put_tce(tt, ioba + (i  IOMMU_PAGE_SHIFT),
 + 0);
 + } while (i);

Hrm, so, actually PAPR specifies that this hcall is supposed to first
copy the given tces to hypervisor memory, then translate (and
validate) them all, and only then touch the actual TCE table.  Rather
more complicated to do, but I guess we should - that would get rid

Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-07 Thread David Gibson
On Tue, May 07, 2013 at 03:51:31PM +1000, Alexey Kardashevskiy wrote:
 On 05/07/2013 03:29 PM, David Gibson wrote:
  On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
  This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
  and H_STUFF_TCE requests without passing them to QEMU, which should
  save time on switching to QEMU and back.
 
  Both real and virtual modes are supported - whenever the kernel
  fails to handle TCE request, it passes it to the virtual mode.
  If it the virtual mode handlers fail, then the request is passed
  to the user mode, for example, to QEMU.
 
  This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
  a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
  in-kernel handling of IOMMU map/unmap.
 
  This adds a special case for huge pages (16MB).  The reference
  counting cannot be easily done for such pages in real mode (when
  MMU is off) so we added a list of huge pages.  It is populated in
  virtual mode and get_page is called just once per a huge page.
  Real mode handlers check if the requested page is huge and in the list,
  then no reference counting is done, otherwise an exit to virtual mode
  happens.  The list is released at KVM exit.  At the moment the fastest
  card available for tests uses up to 9 huge pages so walking through this
  list is not very expensive.  However this can change and we may want
  to optimize this.
 
  This also adds the virt_only parameter to the KVM module
  for debug and performance check purposes.
 
  Tests show that this patch increases transmission speed from 220MB/s
  to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
 
  Cc: David Gibson da...@gibson.dropbear.id.au
  Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
  Signed-off-by: Paul Mackerras pau...@samba.org
  ---
   Documentation/virtual/kvm/api.txt   |   28 
   arch/powerpc/include/asm/kvm_host.h |2 +
   arch/powerpc/include/asm/kvm_ppc.h  |2 +
   arch/powerpc/include/uapi/asm/kvm.h |7 +
   arch/powerpc/kvm/book3s_64_vio.c|  242 
  ++-
   arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++
   arch/powerpc/kvm/powerpc.c  |   12 ++
   include/uapi/linux/kvm.h|2 +
   8 files changed, 485 insertions(+), 2 deletions(-)
 
  diff --git a/Documentation/virtual/kvm/api.txt 
  b/Documentation/virtual/kvm/api.txt
  index f621cd6..2039767 100644
  --- a/Documentation/virtual/kvm/api.txt
  +++ b/Documentation/virtual/kvm/api.txt
  @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, 
  invalidating any previously
   valid entries found.
   
   
  +4.79 KVM_CREATE_SPAPR_TCE_IOMMU
  +
  +Capability: KVM_CAP_SPAPR_TCE_IOMMU
  +Architectures: powerpc
  +Type: vm ioctl
  +Parameters: struct kvm_create_spapr_tce_iommu (in)
  +Returns: 0 on success, -1 on error
  +
  +This creates a link between IOMMU group and a hardware TCE (translation
  +control entry) table. This link lets the host kernel know what IOMMU
  +group (i.e. TCE table) to use for the LIOBN number passed with
  +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
  +
  +/* for KVM_CAP_SPAPR_TCE_IOMMU */
  +struct kvm_create_spapr_tce_iommu {
  +  __u64 liobn;
  +  __u32 iommu_id;
  
  Wouldn't it be more in keeping 
 
 
 pardon?

Sorry, I was going to suggest a change, but then realised it wasn't
actually any better than what you have now.

  +  __u32 flags;
  +};
  +
  +No flag is supported at the moment.
  +
  +When the guest issues TCE call on a liobn for which a TCE table has been
  +registered, the kernel will handle it in real mode, updating the hardware
  +TCE table. TCE table calls for other liobns will cause a vm exit and must
  +be handled by userspace.
  +
  +
   5. The kvm_run structure
   
   
  diff --git a/arch/powerpc/include/asm/kvm_host.h 
  b/arch/powerpc/include/asm/kvm_host.h
  index 36ceb0d..2b70cbc 100644
  --- a/arch/powerpc/include/asm/kvm_host.h
  +++ b/arch/powerpc/include/asm/kvm_host.h
  @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
 struct kvm *kvm;
 u64 liobn;
 u32 window_size;
  +  bool virtmode_only;
  
  I see this is now initialized from the global parameter, but I think
  it would be better to just check the global (debug) parameter
  directly, rather than duplicating it here.
 
 
 The global parameter is in kvm.ko and the struct above is in the real mode
 part which cannot go to the module.

Ah, ok.  I'm half inclined to just drop the virtmode_only thing
entirely.

  +  struct iommu_group *grp;/* used for IOMMU groups */
 struct page *pages[0];
   };
   
  diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
  b/arch/powerpc/include/asm/kvm_ppc.h
  index d501246..bdfa140 100644
  --- a/arch/powerpc/include/asm/kvm_ppc.h
  +++ b/arch/powerpc/include/asm/kvm_ppc.h
  @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
   
   extern long

Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-07 Thread David Gibson
On Tue, May 07, 2013 at 04:27:49PM +1000, Alexey Kardashevskiy wrote:
 On 05/07/2013 04:02 PM, David Gibson wrote:
  On Tue, May 07, 2013 at 03:51:31PM +1000, Alexey Kardashevskiy wrote:
  On 05/07/2013 03:29 PM, David Gibson wrote:
  On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
[snip]
  +#define DRIVER_VERSION  0.1
  +#define DRIVER_AUTHOR   Paul Mackerras, IBM Corp. pau...@au1.ibm.com
  +#define DRIVER_DESC POWERPC KVM driver
 
  Really?
 
 
  What is wrong here?
  
  Well, it seems entirely unrelated to the rest of the changes, 
 
 
 The patch adds a module parameter so I had to add those DRIVER_xxx.

Ah, ok.

  and not obviously accurate.
 
 Let's fix it then. How? Paul signed it...

Fair enough then.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: Digital signature


Re: [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls

2013-05-06 Thread David Gibson
On Mon, May 06, 2013 at 05:25:53PM +1000, Alexey Kardashevskiy wrote:
 This adds real mode handlers for the H_PUT_TCE_INDIRECT and
 H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
 devices or emulated PCI.  These calls allow adding multiple entries
 (up to 512) into the TCE table in one call which saves time on
 transition to/from real mode.
 
 This adds a guest physical to host real address converter
 and calls the existing H_PUT_TCE handler. The converting function
 is going to be fully utilized by upcoming VFIO supporting patches.
 
 This also implements the KVM_CAP_PPC_MULTITCE capability,
 so in order to support the functionality of this patch, QEMU
 needs to query for this capability and set the hcall-multi-tce
 hypertas property only if the capability is present, otherwise
 there will be serious performance degradation.
 
 Cc: David Gibson da...@gibson.dropbear.id.au
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 Signed-off-by: Paul Mackerras pau...@samba.org

Fwiw, it would be nice to get this patch merged, regardless of the
rest of the VFIO/powerpc patches.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: Digital signature


Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-06 Thread David Gibson
On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
 This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
 and H_STUFF_TCE requests without passing them to QEMU, which should
 save time on switching to QEMU and back.
 
 Both real and virtual modes are supported - whenever the kernel
 fails to handle TCE request, it passes it to the virtual mode.
 If it the virtual mode handlers fail, then the request is passed
 to the user mode, for example, to QEMU.
 
 This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
 a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
 in-kernel handling of IOMMU map/unmap.
 
 This adds a special case for huge pages (16MB).  The reference
 counting cannot be easily done for such pages in real mode (when
 MMU is off) so we added a list of huge pages.  It is populated in
 virtual mode and get_page is called just once per a huge page.
 Real mode handlers check if the requested page is huge and in the list,
 then no reference counting is done, otherwise an exit to virtual mode
 happens.  The list is released at KVM exit.  At the moment the fastest
 card available for tests uses up to 9 huge pages so walking through this
 list is not very expensive.  However this can change and we may want
 to optimize this.
 
 This also adds the virt_only parameter to the KVM module
 for debug and performance check purposes.
 
 Tests show that this patch increases transmission speed from 220MB/s
 to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
 
 Cc: David Gibson da...@gibson.dropbear.id.au
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 Signed-off-by: Paul Mackerras pau...@samba.org
 ---
  Documentation/virtual/kvm/api.txt   |   28 
  arch/powerpc/include/asm/kvm_host.h |2 +
  arch/powerpc/include/asm/kvm_ppc.h  |2 +
  arch/powerpc/include/uapi/asm/kvm.h |7 +
  arch/powerpc/kvm/book3s_64_vio.c|  242 
 ++-
  arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++
  arch/powerpc/kvm/powerpc.c  |   12 ++
  include/uapi/linux/kvm.h|2 +
  8 files changed, 485 insertions(+), 2 deletions(-)
 
 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index f621cd6..2039767 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, 
 invalidating any previously
  valid entries found.
  
  
 +4.79 KVM_CREATE_SPAPR_TCE_IOMMU
 +
 +Capability: KVM_CAP_SPAPR_TCE_IOMMU
 +Architectures: powerpc
 +Type: vm ioctl
 +Parameters: struct kvm_create_spapr_tce_iommu (in)
 +Returns: 0 on success, -1 on error
 +
 +This creates a link between IOMMU group and a hardware TCE (translation
 +control entry) table. This link lets the host kernel know what IOMMU
 +group (i.e. TCE table) to use for the LIOBN number passed with
 +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
 +
 +/* for KVM_CAP_SPAPR_TCE_IOMMU */
 +struct kvm_create_spapr_tce_iommu {
 + __u64 liobn;
 + __u32 iommu_id;

Wouldn't it be more in keeping 

 + __u32 flags;
 +};
 +
 +No flag is supported at the moment.
 +
 +When the guest issues TCE call on a liobn for which a TCE table has been
 +registered, the kernel will handle it in real mode, updating the hardware
 +TCE table. TCE table calls for other liobns will cause a vm exit and must
 +be handled by userspace.
 +
 +
  5. The kvm_run structure
  
  
 diff --git a/arch/powerpc/include/asm/kvm_host.h 
 b/arch/powerpc/include/asm/kvm_host.h
 index 36ceb0d..2b70cbc 100644
 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
   struct kvm *kvm;
   u64 liobn;
   u32 window_size;
 + bool virtmode_only;

I see this is now initialized from the global parameter, but I think
it would be better to just check the global (debug) parameter
directly, rather than duplicating it here.

 + struct iommu_group *grp;/* used for IOMMU groups */
   struct page *pages[0];
  };
  
 diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
 b/arch/powerpc/include/asm/kvm_ppc.h
 index d501246..bdfa140 100644
 --- a/arch/powerpc/include/asm/kvm_ppc.h
 +++ b/arch/powerpc/include/asm/kvm_ppc.h
 @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
  
  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
   struct kvm_create_spapr_tce *args);
 +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
 + struct kvm_create_spapr_tce_iommu *args);
  extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
   struct kvm_vcpu *vcpu, unsigned long liobn);
  extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
 diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
 b/arch/powerpc/include/uapi/asm/kvm.h

Re: KVM: PPC: Book3S: Add API for in-kernel XICS emulation

2013-05-02 Thread David Gibson
On Sat, Apr 27, 2013 at 10:28:37AM -, Paul Mackerras wrote:
 This adds the API for userspace to instantiate an XICS device in a VM
 and connect VCPUs to it.  The API consists of a new device type for
 the KVM_CREATE_DEVICE ioctl, a new capability KVM_CAP_IRQ_XICS, which
 functions similarly to KVM_CAP_IRQ_MPIC, and the KVM_IRQ_LINE ioctl,
 which is used to assert and deassert interrupt inputs of the XICS.
 
 The XICS device has one attribute group, KVM_DEV_XICS_GRP_SOURCES.
 Each attribute within this group corresponds to the state of one
 interrupt source.  The attribute number is the same as the interrupt
 source number.
 
 This does not support irq routing or irqfd yet.
 
 Signed-off-by: Paul Mackerras pau...@samba.org

Acked-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: Digital signature


Re: Reset problem vs. MMIO emulation, hypercalls, etc...

2012-08-08 Thread David Gibson
On Wed, Aug 08, 2012 at 11:58:58AM +0300, Avi Kivity wrote:
 On 08/08/2012 03:49 AM, David Gibson wrote:
   We never have irqchip in kernel (because we haven't written that yet)
   but we still sleep in-kernel for CEDE.  I haven't spotted any problem
   with that, but now I'm wondering if there is one, since x86 don't do
   it in what seems like the analogous situation.
   
   It's possible this works because our decrementer (timer) interrupts
   are different at the core level from external interrupts coming from
   the PIC, and *are* handled in kernel, but I haven't actually followed
   the logic to work out if this is the case.
   
Meaning the normal state of things is to sleep in
   the kernel (whether or not you have an emulated interrupt controller in
   the kernel -- the term irqchip in kernel is overloaded for x86).
   
   Uh.. overloaded in what way.
  
  On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and
  the two PICs are emulated in the kernel.  Now the IOAPIC and the PICs
  correspond to non-x86 interrupt controllers, but the local APIC is more
  tightly coupled to the core.  Interrupt acceptance by the core is an
  operation that involved synchronous communication with the local APIC:
  the APIC presents the interrupt, the core accepts it based on the value
  of the interrupt enable flag and possible a register (CR8), then the
  APIC updates the ISR and IRR.
  
  The upshot is that if the local APIC is in userspace, interrupts must be
  synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl
  and HLT is emulated in userspace (so that local APIC emulation can check
  if an interrupt wakes it up or not).
  
  Sorry, still not 100% getting it.  When the vcpu is actually running
  code, that synchronous communication must still be accomplished via
  the KVM_INTERRUPT ioctl, yes?  So what makes HLT different, that the
  communication can't be accomplished in that case.
 
 No, you're correct.  HLT could have been emulated in userspace, it just
 wasn't.  The correct statement is that HLT was arbitrarily chosen to be
 emulated in userspace with the synchronous model, but the asynchronous
 model forced it into the kernel.

Aha!  Ok, understood.  Uh, assuming you meant kernelspace, not
userspace in the first line there, anyway.  Ok, so I am now reassured
that our current handling of CEDE in kernelspace does not cause
problems.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reset problem vs. MMIO emulation, hypercalls, etc...

2012-08-07 Thread David Gibson
On Tue, Aug 07, 2012 at 11:46:35AM +0300, Avi Kivity wrote:
 On 08/07/2012 04:32 AM, David Gibson wrote:
  On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote:
  On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
   So, I'm still trying to nut out the implications for H_CEDE, and think
   if there are any other hypercalls that might want to block the guest
   for a time.  We were considering blocking H_PUT_TCE if qemu devices
   had active dma maps on the previously mapped iovas.  I'm not sure if
   the discussions that led to the inclusion of the qemu IOMMU code
   decided that was wholly unnnecessary or just not necessary for the
   time being.
  
  For sleeping hcalls they will simply have to set exit_request to
  complete the hcall from the kernel perspective, leaving us in a state
  where the kernel is about to restart at srr0 + 4, along with some other
  flag (stop or halt) to actually freeze the vcpu.
  
  If such an async hcall decides to return an error, it can then set
  gpr3 directly using ioctls before restarting the vcpu.
  
  Yeah, I'd pretty much convinced myself of that by the end of
  yesterday.  I hope to send patches implementing these fixes today.
  
  There are also some questions about why our in-kernel H_CEDE works
  kind of differently from x86's hlt instruction implementation (which
  comes out to qemu unless the irqchip is in-kernel as well).  I don't
  think we have an urgent problem there though.
 
 It's the other way round, hlt sleeps in the kernel unless the irqchip is
 not in the kernel.

That's the same as what I said.

We never have irqchip in kernel (because we haven't written that yet)
but we still sleep in-kernel for CEDE.  I haven't spotted any problem
with that, but now I'm wondering if there is one, since x86 don't do
it in what seems like the analogous situation.

It's possible this works because our decrementer (timer) interrupts
are different at the core level from external interrupts coming from
the PIC, and *are* handled in kernel, but I haven't actually followed
the logic to work out if this is the case.

  Meaning the normal state of things is to sleep in
 the kernel (whether or not you have an emulated interrupt controller in
 the kernel -- the term irqchip in kernel is overloaded for x86).

Uh.. overloaded in what way.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reset problem vs. MMIO emulation, hypercalls, etc...

2012-08-07 Thread David Gibson
On Tue, Aug 07, 2012 at 04:13:49PM +0300, Avi Kivity wrote:
 On 08/07/2012 03:14 PM, David Gibson wrote:
  On Tue, Aug 07, 2012 at 11:46:35AM +0300, Avi Kivity wrote:
  On 08/07/2012 04:32 AM, David Gibson wrote:
   On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote:
   On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
So, I'm still trying to nut out the implications for H_CEDE, and think
if there are any other hypercalls that might want to block the guest
for a time.  We were considering blocking H_PUT_TCE if qemu devices
had active dma maps on the previously mapped iovas.  I'm not sure if
the discussions that led to the inclusion of the qemu IOMMU code
decided that was wholly unnnecessary or just not necessary for the
time being.
   
   For sleeping hcalls they will simply have to set exit_request to
   complete the hcall from the kernel perspective, leaving us in a state
   where the kernel is about to restart at srr0 + 4, along with some other
   flag (stop or halt) to actually freeze the vcpu.
   
   If such an async hcall decides to return an error, it can then set
   gpr3 directly using ioctls before restarting the vcpu.
   
   Yeah, I'd pretty much convinced myself of that by the end of
   yesterday.  I hope to send patches implementing these fixes today.
   
   There are also some questions about why our in-kernel H_CEDE works
   kind of differently from x86's hlt instruction implementation (which
   comes out to qemu unless the irqchip is in-kernel as well).  I don't
   think we have an urgent problem there though.
  
  It's the other way round, hlt sleeps in the kernel unless the irqchip is
  not in the kernel.
  
  That's the same as what I said.
 
 I meant to stress that the normal way which other archs should emulate
 is sleep-in-kernel.

Ok.

  We never have irqchip in kernel (because we haven't written that yet)
  but we still sleep in-kernel for CEDE.  I haven't spotted any problem
  with that, but now I'm wondering if there is one, since x86 don't do
  it in what seems like the analogous situation.
  
  It's possible this works because our decrementer (timer) interrupts
  are different at the core level from external interrupts coming from
  the PIC, and *are* handled in kernel, but I haven't actually followed
  the logic to work out if this is the case.
  
   Meaning the normal state of things is to sleep in
  the kernel (whether or not you have an emulated interrupt controller in
  the kernel -- the term irqchip in kernel is overloaded for x86).
  
  Uh.. overloaded in what way.
 
 On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and
 the two PICs are emulated in the kernel.  Now the IOAPIC and the PICs
 correspond to non-x86 interrupt controllers, but the local APIC is more
 tightly coupled to the core.  Interrupt acceptance by the core is an
 operation that involved synchronous communication with the local APIC:
 the APIC presents the interrupt, the core accepts it based on the value
 of the interrupt enable flag and possible a register (CR8), then the
 APIC updates the ISR and IRR.
 
 The upshot is that if the local APIC is in userspace, interrupts must be
 synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl
 and HLT is emulated in userspace (so that local APIC emulation can check
 if an interrupt wakes it up or not).

Sorry, still not 100% getting it.  When the vcpu is actually running
code, that synchronous communication must still be accomplished via
the KVM_INTERRUPT ioctl, yes?  So what makes HLT different, that the
communication can't be accomplished in that case.

 As soon as the local APIC is
 emulated in the kernel, HLT can be emulated there as well, and
 interrupts become asynchronous (KVM_IRQ_LINE, a vm ioctl).
 
 So irqchip_in_kernel, for most discussions, really means whether
 interrupt queuing is synchronous or asynchronous.  It has nothing to do
 with the interrupt controllers per se.  All non-x86 archs always have
 irqchip_in_kernel() in this sense.
 
 Peter has started to fix up this naming mess in qemu.  I guess we should
 do the same for the kernel (except for ABIs) and document it, because it
 keeps generating confusion.

Ok.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reset problem vs. MMIO emulation, hypercalls, etc...

2012-08-06 Thread David Gibson
On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote:
 On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
  So, I'm still trying to nut out the implications for H_CEDE, and think
  if there are any other hypercalls that might want to block the guest
  for a time.  We were considering blocking H_PUT_TCE if qemu devices
  had active dma maps on the previously mapped iovas.  I'm not sure if
  the discussions that led to the inclusion of the qemu IOMMU code
  decided that was wholly unnnecessary or just not necessary for the
  time being.
 
 For sleeping hcalls they will simply have to set exit_request to
 complete the hcall from the kernel perspective, leaving us in a state
 where the kernel is about to restart at srr0 + 4, along with some other
 flag (stop or halt) to actually freeze the vcpu.
 
 If such an async hcall decides to return an error, it can then set
 gpr3 directly using ioctls before restarting the vcpu.

Yeah, I'd pretty much convinced myself of that by the end of
yesterday.  I hope to send patches implementing these fixes today.

There are also some questions about why our in-kernel H_CEDE works
kind of differently from x86's hlt instruction implementation (which
comes out to qemu unless the irqchip is in-kernel as well).  I don't
think we have an urgent problem there though.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 4/6] kvm tools: Add PPC64 XICS interrupt controller support

2011-12-20 Thread David Gibson
On Tue, Dec 20, 2011 at 12:16:40PM +1100, Matt Evans wrote:
 Hi David,
 
 On 14/12/11 13:35, David Gibson wrote:
  On Tue, Dec 13, 2011 at 06:10:48PM +1100, Matt Evans wrote:
  This patch adds XICS emulation code (heavily borrowed from QEMU), and wires
  this into kvm_cpu__irq() to fire a CPU IRQ via KVM.  A device tree entry is
  also added.  IPIs work, xics_alloc_irqnum() is added to allocate an 
  external
  IRQ (which will later be used by the PHB PCI code) and finally, 
  kvm__irq_line()
  can be called to raise an IRQ on XICS.\
  
  Hrm, looks like you took a somewhat old version of xics.c from qemu.
  It dangerously uses the same variable names for global irq numbers and
  numbers local to one ics unit.  It used to have at least one bug
  caused by confusing the two, which I'm not sure if you've also copied.
 
 Just had a look at the diffs between this and hw/xics.c from the master branch
 in your qemu-impreza.git (which I based the kvmtool stuff on) and I can't see
 anything standing out.
 
 Is there a particular commit/patch/variable name you have in mind that I can
 search for?

Sorry, my mistake, I was looking in the wrong place.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 4/6] kvm tools: Add PPC64 XICS interrupt controller support

2011-12-13 Thread David Gibson
On Tue, Dec 13, 2011 at 06:10:48PM +1100, Matt Evans wrote:
 This patch adds XICS emulation code (heavily borrowed from QEMU), and wires
 this into kvm_cpu__irq() to fire a CPU IRQ via KVM.  A device tree entry is
 also added.  IPIs work, xics_alloc_irqnum() is added to allocate an external
 IRQ (which will later be used by the PHB PCI code) and finally, 
 kvm__irq_line()
 can be called to raise an IRQ on XICS.\

Hrm, looks like you took a somewhat old version of xics.c from qemu.
It dangerously uses the same variable names for global irq numbers and
numbers local to one ics unit.  It used to have at least one bug
caused by confusing the two, which I'm not sure if you've also copied.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 2/6] kvm tools: Add SPAPR PPC64 hcall rtascall structure

2011-12-13 Thread David Gibson
On Tue, Dec 13, 2011 at 06:10:46PM +1100, Matt Evans wrote:
 This patch adds the basic structure for HV calls, their registration and some 
 of
 the simpler calls.  A similar layout for RTAS calls is also added, again with
 some of the simpler RTAS calls used by the guest.  The SPAPR RTAS stub is
 generated inline.  Also, nodes for RTAS are added to the device
 tree.

[snip]
 diff --git a/tools/kvm/powerpc/spapr.h b/tools/kvm/powerpc/spapr.h
 new file mode 100644
 index 000..57cece1
 --- /dev/null
 +++ b/tools/kvm/powerpc/spapr.h
 @@ -0,0 +1,105 @@
 +/*
 + * SPAPR definitions and declarations
 + *
 + * Borrowed heavily from QEMU's spapr.h,
 + * Copyright (c) 2010 David Gibson, IBM Corporation.

So, most of the content of this file in qemu, I in turn took from
arch/powerpc/include/asm/hvcall.h in the kernel tree.  You might be
better off using that directly.

[snip]
 +static target_ulong h_logical_icbi(struct kvm_cpu *vcpu, target_ulong 
 opcode, target_ulong *args)
 +{
 + /* Nothing to do on emulation, KVM will trap this in the kernel */
 + return H_SUCCESS;

hcalls that need to be handled by the host kernel should probably dump
an error here, rather than silently doing nothing, since if the host
kernel does handle them here they should never reach userspace at all.

 +}
 +
 +static target_ulong h_logical_dcbf(struct kvm_cpu *vcpu, target_ulong 
 opcode, target_ulong *args)
 +{
 + /* Nothing to do on emulation, KVM will trap this in the kernel */
 + return H_SUCCESS;
 +}
 +
 +void spapr_register_hypercall(target_ulong opcode, spapr_hcall_fn fn)
 +{
 + spapr_hcall_fn *slot;
 +
 + if (opcode = MAX_HCALL_OPCODE) {
 + assert((opcode  0x3) == 0);
 +
 + slot = papr_hypercall_table[opcode / 4];
 + } else {
 + assert((opcode = KVMPPC_HCALL_BASE) 
 +(opcode = KVMPPC_HCALL_MAX));
 +
 + slot = kvmppc_hypercall_table[opcode - KVMPPC_HCALL_BASE];
 + }
 +
 + assert(!(*slot) || (fn == *slot));
 + *slot = fn;
 +}
 +
 +target_ulong spapr_hypercall(struct kvm_cpu *vcpu, target_ulong opcode,
 +  target_ulong *args)
 +{
 + if ((opcode = MAX_HCALL_OPCODE)
 +  ((opcode  0x3) == 0)) {
 + spapr_hcall_fn fn = papr_hypercall_table[opcode / 4];
 +
 + if (fn) {
 + return fn(vcpu, opcode, args);
 + }
 + } else if ((opcode = KVMPPC_HCALL_BASE) 
 +(opcode = KVMPPC_HCALL_MAX)) {
 + spapr_hcall_fn fn = kvmppc_hypercall_table[opcode -
 +KVMPPC_HCALL_BASE];
 +
 + if (fn) {
 + return fn(vcpu, opcode, args);
 + }
 + }
 +
 + hcall_dprintf(Unimplemented hcall 0x%lx\n, opcode);
 + return H_FUNCTION;
 +}
 +
 +void hypercall_init(void)
 +{
 + /* hcall-dabr */
 + spapr_register_hypercall(H_SET_DABR, h_set_dabr);
 +
 + spapr_register_hypercall(H_LOGICAL_CI_LOAD, h_logical_load);
 + spapr_register_hypercall(H_LOGICAL_CI_STORE, h_logical_store);
 + spapr_register_hypercall(H_LOGICAL_CACHE_LOAD, h_logical_load);
 + spapr_register_hypercall(H_LOGICAL_CACHE_STORE, h_logical_store);
 + spapr_register_hypercall(H_LOGICAL_ICBI, h_logical_icbi);
 + spapr_register_hypercall(H_LOGICAL_DCBF, h_logical_dcbf);
 +
 + /* KVM-PPC specific hcalls */
 + spapr_register_hypercall(KVMPPC_H_RTAS, h_rtas);
 +}
 diff --git a/tools/kvm/powerpc/spapr_rtas.c b/tools/kvm/powerpc/spapr_rtas.c
 new file mode 100644
 index 000..72c6b02
 --- /dev/null
 +++ b/tools/kvm/powerpc/spapr_rtas.c
 @@ -0,0 +1,230 @@
 +/*
 + * SPAPR base RTAS calls
 + *
 + * Borrowed heavily from QEMU's spapr_rtas.c
 + * Copyright (c) 2010-2011 David Gibson, IBM Corporation.
 + *
 + * Modifications copyright 2011 Matt Evans m...@ozlabs.org, IBM 
 Corporation.
 + *
 + * This program is free software; you can redistribute it and/or modify it
 + * under the terms of the GNU General Public License version 2 as published
 + * by the Free Software Foundation.
 + */
 +
 +#include kvm/kvm.h
 +#include kvm/kvm-cpu.h
 +#include kvm/util.h
 +#include kvm/term.h
 +
 +#include spapr.h
 +
 +#include stdio.h
 +#include assert.h
 +#include libfdt.h
 +
 +#define TOKEN_BASE  0x2000
 +#define TOKEN_MAX   0x100
 +
 +#define RTAS_CONSOLE
 +
 +static struct rtas_call {
 + const char *name;
 + spapr_rtas_fn fn;
 +} rtas_table[TOKEN_MAX];
 +
 +struct rtas_call *rtas_next = rtas_table;
 +
 +
 +static void rtas_display_character(struct kvm_cpu *vcpu,
 +   uint32_t token, uint32_t nargs,
 +   target_ulong args,
 +   uint32_t nret, target_ulong rets)
 +{
 + char c = rtas_ld(vcpu-kvm, args, 0);
 + term_putc(CONSOLE_HV, c, 1, 0);
 + rtas_st(vcpu-kvm, rets, 0, 0);
 +}
 +
 +#ifdef RTAS_CONSOLE
 +static void rtas_put_term_char(struct

Re: [PATCH V2 1/6] kvm tools: Generate SPAPR PPC64 guest device tree

2011-12-13 Thread David Gibson
On Tue, Dec 13, 2011 at 06:10:45PM +1100, Matt Evans wrote:
 The generated DT is the bare minimum structure required for SPAPR (on which
 subsequent patches for VIO, XICS, PCI etc. will build); root node, cpus, 
 memory.
 
 Some aspects are currently hardwired for simplicity, for example advertised
 page sizes, HPT size, SLB size, VMX/DFP, etc.  Future support of a variety
 of POWER CPUs should acquire this info from the host and encode appropriately.
 
 This requires a 64-bit libfdt.

There's already a copy of libfdt embedded in the kernel tree
(scripts/dtc/libfdt), which you should be able to use to build one of
these as you go.

 
 Signed-off-by: Matt Evans m...@ozlabs.org
 ---
  tools/kvm/Makefile   |3 +-
  tools/kvm/powerpc/include/kvm/kvm-arch.h |   10 ++
  tools/kvm/powerpc/kvm.c  |  141 
 ++
  3 files changed, 153 insertions(+), 1 deletions(-)
 
 diff --git a/tools/kvm/Makefile b/tools/kvm/Makefile
 index 5bb3f08..4ee4805 100644
 --- a/tools/kvm/Makefile
 +++ b/tools/kvm/Makefile
 @@ -132,7 +132,8 @@ ifeq ($(uname_M), ppc64)
   OBJS+= powerpc/kvm.o
   OBJS+= powerpc/kvm-cpu.o
   ARCH_INCLUDE := powerpc/include
 - CFLAGS += -m64
 + CFLAGS  += -m64
 + LIBS+= -lfdt
  endif
  
  ###
 diff --git a/tools/kvm/powerpc/include/kvm/kvm-arch.h 
 b/tools/kvm/powerpc/include/kvm/kvm-arch.h
 index da61774..33a3827 100644
 --- a/tools/kvm/powerpc/include/kvm/kvm-arch.h
 +++ b/tools/kvm/powerpc/include/kvm/kvm-arch.h
 @@ -69,4 +69,14 @@ struct kvm {
   const char  *name;
  };
  
 +/* Helper for the various bits of code that generate FDT nodes */
 +#define _FDT(exp)\
 + do {\
 + int ret = (exp);\
 + if (ret  0) {  \
 + die(Error creating device tree: %s: %s\n, \
 + #exp, fdt_strerror(ret));   \
 + }   \
 + } while (0)
 +
  #endif /* KVM__KVM_ARCH_H */
 diff --git a/tools/kvm/powerpc/kvm.c b/tools/kvm/powerpc/kvm.c
 index f838a8f..95ed1cc 100644
 --- a/tools/kvm/powerpc/kvm.c
 +++ b/tools/kvm/powerpc/kvm.c
 @@ -3,6 +3,9 @@
   *
   * Copyright 2011 Matt Evans m...@ozlabs.org, IBM Corporation.
   *
 + * Portions of FDT setup borrowed from QEMU, copyright 2010 David Gibson, IBM
 + * Corporation.
 + *
   * This program is free software; you can redistribute it and/or modify it
   * under the terms of the GNU General Public License version 2 as published
   * by the Free Software Foundation.
 @@ -29,6 +32,8 @@
  #include linux/byteorder.h
  #include libfdt.h
  
 +#define HPT_ORDER 24
 +
  #define HUGETLBFS_PATH /var/lib/hugetlbfs/global/pagesize-16MB/
  
  static char kern_cmdline[2048];
 @@ -168,9 +173,145 @@ bool load_bzimage(struct kvm *kvm, int fd_kernel,
   return false;
  }
  
 +#define SMT_THREADS 4
 +
 +static uint32_t mfpvr(void)
 +{
 + uint32_t r;
 + asm volatile (mfpvr %0 : =r(r));
 + return r;
 +}
 +
  static void setup_fdt(struct kvm *kvm)
  {
 + uint64_tmem_reg_property[] = { 0, cpu_to_be64(kvm-ram_size) };
 + int smp_cpus = kvm-nrcpus;
 + charhypertas_prop_kvm[] = hcall-pft\0hcall-term\0
 + hcall-dabr\0hcall-interrupt\0hcall-tce\0hcall-vio\0
 + hcall-splpar\0hcall-bulk;
 + int i, j;
 + charcpu_name[30];
 + u8  staging_fdt[FDT_MAX_SIZE];
 + uint32_tpvr = mfpvr();
 +
 + /* Generate an appropriate DT at kvm-fdt_gra */
 + void *fdt_dest = guest_flat_to_host(kvm, kvm-fdt_gra);
 + void *fdt = staging_fdt;
 +
 + _FDT(fdt_create(fdt, FDT_MAX_SIZE));
 + _FDT(fdt_finish_reservemap(fdt));
 +
 + _FDT(fdt_begin_node(fdt, ));
 +
 + _FDT(fdt_property_string(fdt, device_type, chrp));
 + _FDT(fdt_property_string(fdt, model, IBM pSeries (kvmtool)));
 + _FDT(fdt_property_cell(fdt, #address-cells, 0x2));
 + _FDT(fdt_property_cell(fdt, #size-cells, 0x2));
 +
 + /* /chosen */
 + _FDT(fdt_begin_node(fdt, chosen));
 + /* cmdline */
 + _FDT(fdt_property_string(fdt, bootargs, kern_cmdline));
 + /* Initrd */
 + if (kvm-initrd_size != 0) {
 + uint32_t ird_st_prop = cpu_to_be32(kvm-initrd_gra);
 + uint32_t ird_end_prop = cpu_to_be32(kvm-initrd_gra +
 + kvm-initrd_size);
 + _FDT(fdt_property(fdt, linux,initrd-start,
 +ird_st_prop, sizeof(ird_st_prop)));
 + _FDT(fdt_property(fdt, linux,initrd-end,
 +ird_end_prop, sizeof(ird_end_prop)));
 + }
 + _FDT(fdt_end_node(fdt));
 +
 + /*
 +  * Memory: We

Re: [PATCH 07/10] KVM: PPC: Add PAPR hypercall code for PR mode

2011-08-12 Thread David Gibson
On Fri, Aug 12, 2011 at 07:38:54AM +0200, Alexander Graf wrote:
 
 Am 12.08.2011 um 05:35 schrieb David Gibson da...@gibson.dropbear.id.au:
 
  On Tue, Aug 09, 2011 at 06:31:45PM +0200, Alexander Graf wrote:
  When running a PAPR guest, we need to handle a few hypercalls in kernel 
  space,
  most prominently the page table invalidation (to sync the shadows).
  
  So this patch adds handling for a few PAPR hypercalls to PR mode KVM. I 
  tried
  to share the code with HV mode, but it ended up being a lot easier this way
  around, as the two differ too much in those details.
  
  Are these strictly necessary, or just an optimization?  Because you're
  using the space allocated by qemu for the guest hash table, it seems
  to be you could just let h_enter fall through to qemu which will put
  the right thing into the guest hash table which you can then walk in
  the kernel translation code.
 
 Every time a PTE can be invalidated, we need to do so in kvm to keep
 the SPT in sync. IIRC h_enter can evict/overwrite a previous entry,
 so we need to handle it in kvm as well :). Removal definitely needs
 to happin in-kernel.

True.  I think you could actually delay this invalidation until the
guest issues the tlbie, but it's probably not worth it.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/10] KVM: PPC: Support SC1 hypercalls for PAPR in PR mode

2011-08-12 Thread David Gibson
On Fri, Aug 12, 2011 at 07:35:42AM +0200, Alexander Graf wrote:
 
 Am 12.08.2011 um 05:33 schrieb David Gibson da...@gibson.dropbear.id.au:
 
  On Tue, Aug 09, 2011 at 06:31:47PM +0200, Alexander Graf wrote:
  PAPR defines hypercalls as SC1 instructions. Using these, the guest 
  modifies
  page tables and does other privileged operations that it wouldn't be 
  allowed
  to do in supervisor mode.
  
  This patch adds support for PR KVM to trap these instructions and route 
  them
  through the same PAPR hypercall interface that we already use for HV style
  KVM.
  
  This will work on a powermac or bare metal host.  Unfortunately, it's
  not enough on a pSeries LPAR host - the sc 1 instruction from the
  guest problem state will go direct to the hypervisor, which will
  return an error rather than trapping to the guest kernel.
  
  The only way around this I can see is for qemu to search for and patch
  up sc 1 instructions to something else.  Obviously that would also
  need some kernel support, and probably a capability to let it know if
  it's necessary.
 
 Well I'd like to keep Qemu out of the patching business, so the
 guest kernel would have to patch itself.

Well sure, but guest patching itself means it can't run existing
kernels.  I thought qemu already patched a few things, ugly though
that approach is.

 But yes, PHyP guests can't
 run this target yet :). I'll take a stab at that too, but one
 continent at a time! ;)

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/10] KVM: PPC: Support SC1 hypercalls for PAPR in PR mode

2011-08-11 Thread David Gibson
On Tue, Aug 09, 2011 at 06:31:47PM +0200, Alexander Graf wrote:
 PAPR defines hypercalls as SC1 instructions. Using these, the guest modifies
 page tables and does other privileged operations that it wouldn't be allowed
 to do in supervisor mode.
 
 This patch adds support for PR KVM to trap these instructions and route them
 through the same PAPR hypercall interface that we already use for HV style
 KVM.

This will work on a powermac or bare metal host.  Unfortunately, it's
not enough on a pSeries LPAR host - the sc 1 instruction from the
guest problem state will go direct to the hypervisor, which will
return an error rather than trapping to the guest kernel.

The only way around this I can see is for qemu to search for and patch
up sc 1 instructions to something else.  Obviously that would also
need some kernel support, and probably a capability to let it know if
it's necessary.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/10] KVM: PPC: Add PAPR hypercall code for PR mode

2011-08-11 Thread David Gibson
On Tue, Aug 09, 2011 at 06:31:45PM +0200, Alexander Graf wrote:
 When running a PAPR guest, we need to handle a few hypercalls in kernel space,
 most prominently the page table invalidation (to sync the shadows).
 
 So this patch adds handling for a few PAPR hypercalls to PR mode KVM. I tried
 to share the code with HV mode, but it ended up being a lot easier this way
 around, as the two differ too much in those details.

Are these strictly necessary, or just an optimization?  Because you're
using the space allocated by qemu for the guest hash table, it seems
to be you could just let h_enter fall through to qemu which will put
the right thing into the guest hash table which you can then walk in
the kernel translation code.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [v2] powerpc/4xx: work around CHIP11 errata in a more PAGE_SIZE-friendly way

2008-11-11 Thread David Gibson
On Tue, Nov 11, 2008 at 07:52:18AM -0500, Josh Boyer wrote:
 On Mon, Nov 10, 2008 at 05:24:34PM -0600, Hollis Blanchard wrote:
 The current CHIP11 errata truncates the device tree memory node, and 
 subtracts
 (hardcoded) 4096 bytes. This breaks kernels with larger PAGE_SIZE, since the
 bootmem allocator assumes that total memory is a multiple of PAGE_SIZE.
 
 Instead, use a device tree memory reservation to reserve only the 256 bytes
 actually affected by the errata, leaving the total memory size unaltered.
 
 Signed-off-by: Hollis Blanchard [EMAIL PROTECTED]
 
 ---
 
 Using large pages results in a huge performance improvement for KVM, and this
 patch is required to make Ilya's large page patch work. David and/or Josh,
 please apply.
 
 The patch looks fine to me, and once David acks the fdt parts I'll apply
 for -next.  I'll try to do some testing later today as well, since I have
 one of the boards with the Errata.

Um.. I sent something the fdt stuff a while back, but didn't get a
response.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [v3] powerpc/4xx: work around CHIP11 errata in a more PAGE_SIZE-friendly way

2008-11-11 Thread David Gibson
On Tue, Nov 11, 2008 at 06:06:46PM -0600, Hollis Blanchard wrote:
 The current CHIP11 errata truncates the device tree memory node, and subtracts
 (hardcoded) 4096 bytes. This breaks kernels with larger PAGE_SIZE, since the
 bootmem allocator assumes that total memory is a multiple of PAGE_SIZE.
 
 Instead, use a device tree memory reservation to reserve only the 256 bytes
 actually affected by the errata, leaving the total memory size unaltered.
 
 Signed-off-by: Hollis Blanchard [EMAIL PROTECTED]

libfdt usage changes look fine to me.

Acked-by: David Gibson [EMAIL PROTECTED]

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html