Re: [PATCH v2] KVM: PPC: Exit guest upon fatal machine check exception
{ > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > index b98889e..f43c124 100644 > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > @@ -147,7 +147,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) > addir1, r1, 112 > ld r7, HSTATE_HOST_MSR(r13) > Seems like the comment a little above this should be updated to reflect the fact that this path no longer handles machine checks. Apart from that and the access width bug Thomas spotted, it looks ok to me,. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH kernel 5/9] KVM: PPC: Account TCE-containing pages in locked_vm
On Tue, Sep 15, 2015 at 08:49:35PM +1000, Alexey Kardashevskiy wrote: > At the moment pages used for TCE tables (in addition to pages addressed > by TCEs) are not counted in locked_vm counter so a malicious userspace > tool can call ioctl(KVM_CREATE_SPAPR_TCE) as many times as RLIMIT_NOFILE and > lock a lot of memory. > > This adds counting for pages used for TCE tables. > > This counts the number of pages required for a table plus pages for > the kvmppc_spapr_tce_table struct (TCE table descriptor) itself. Hmm. Does it make sense to account for the descriptor struct itself? I mean there are lots of little structures the kernel will allocate on a process's behalf, and I don't think most of them get accounted against locked vm. > This does not change the amount of (de)allocated memory. > > Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> > --- > arch/powerpc/kvm/book3s_64_vio.c | 51 > +++- > 1 file changed, 50 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/kvm/book3s_64_vio.c > b/arch/powerpc/kvm/book3s_64_vio.c > index 9526c34..b70787d 100644 > --- a/arch/powerpc/kvm/book3s_64_vio.c > +++ b/arch/powerpc/kvm/book3s_64_vio.c > @@ -45,13 +45,56 @@ static long kvmppc_stt_npages(unsigned long window_size) >* sizeof(u64), PAGE_SIZE) / PAGE_SIZE; > } > > +static long kvmppc_account_memlimit(long npages, bool inc) > +{ > + long ret = 0; > + const long bytes = sizeof(struct kvmppc_spapr_tce_table) + > + (abs(npages) * sizeof(struct page *)); > + const long stt_pages = ALIGN(bytes, PAGE_SIZE) / PAGE_SIZE; Overflow checks might be useful here, I'm not sure. > + > + if (!current || !current->mm) > + return ret; /* process exited */ > + > + npages += stt_pages; > + > + down_write(>mm->mmap_sem); > + > + if (inc) { > + long locked, lock_limit; > + > + locked = current->mm->locked_vm + npages; > + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > + if (locked > lock_limit && !capable(CAP_IPC_LOCK)) > + ret = -ENOMEM; > + else > + current->mm->locked_vm += npages; > + } else { > + if (npages > current->mm->locked_vm) Should this be a WARN_ON? It means something has gone wrong previously in the accounting, doesn't it? > + npages = current->mm->locked_vm; > + > + current->mm->locked_vm -= npages; > + } > + > + pr_debug("[%d] RLIMIT_MEMLOCK KVM %c%ld %ld/%ld%s\n", current->pid, > + inc ? '+' : '-', > + npages << PAGE_SHIFT, > + current->mm->locked_vm << PAGE_SHIFT, > + rlimit(RLIMIT_MEMLOCK), > + ret ? " - exceeded" : ""); > + > + up_write(>mm->mmap_sem); > + > + return ret; > +} > + > static void release_spapr_tce_table(struct rcu_head *head) > { > struct kvmppc_spapr_tce_table *stt = container_of(head, > struct kvmppc_spapr_tce_table, rcu); > int i; > + long npages = kvmppc_stt_npages(stt->window_size); > > - for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++) > + for (i = 0; i < npages; i++) > __free_page(stt->pages[i]); > > kfree(stt); > @@ -89,6 +132,7 @@ static int kvm_spapr_tce_release(struct inode *inode, > struct file *filp) > > kvm_put_kvm(stt->kvm); > > + kvmppc_account_memlimit(kvmppc_stt_npages(stt->window_size), false); > call_rcu(>rcu, release_spapr_tce_table); > > return 0; > @@ -114,6 +158,11 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, > } > > npages = kvmppc_stt_npages(args->window_size); > + ret = kvmppc_account_memlimit(npages, true); > + if (ret) { > + stt = NULL; > + goto fail; > + } > > stt = kzalloc(sizeof(*stt) + npages * sizeof(struct page *), > GFP_KERNEL); -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH kernel 7/9] KVM: PPC: Move reusable bits of H_PUT_TCE handler to helpers
n (u64 *) page_address(page); > +} > + > +/* > + * Handles TCE requests for emulated devices. > + * Puts guest TCE values to the table and expects user space to convert them. > + * Called in both real and virtual modes. > + * Cannot fail so kvmppc_tce_validate must be called before it. > + * > + * WARNING: This will be called in real-mode on HV KVM and virtual > + * mode on PR KVM > + */ > +void kvmppc_tce_put(struct kvmppc_spapr_tce_table *stt, > + unsigned long idx, unsigned long tce) > +{ > + struct page *page; > + u64 *tbl; > + > + page = stt->pages[idx / TCES_PER_PAGE]; > + tbl = kvmppc_page_address(page); > + > + tbl[idx % TCES_PER_PAGE] = tce; > +} > +EXPORT_SYMBOL_GPL(kvmppc_tce_put); > > /* WARNING: This will be called in real-mode on HV KVM and virtual > * mode on PR KVM > @@ -85,9 +159,6 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long > liobn, > { > struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn); > long ret = H_TOO_HARD; > - unsigned long idx; > - struct page *page; > - u64 *tbl; > > /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */ > /* liobn, ioba, tce); */ > @@ -99,13 +170,11 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned > long liobn, > if (ret) > return ret; > > - idx = ioba >> IOMMU_PAGE_SHIFT_4K; > - page = stt->pages[idx / TCES_PER_PAGE]; > - tbl = (u64 *)page_address(page); > + ret = kvmppc_tce_validate(stt, tce); > + if (ret) > + return ret; > > - /* FIXME: Need to validate the TCE itself */ > - /* udbg_printf("tce @ %p\n", [idx % TCES_PER_PAGE]); */ > - tbl[idx % TCES_PER_PAGE] = tce; > + kvmppc_tce_put(stt, ioba >> IOMMU_PAGE_SHIFT_4K, tce); > > return ret; > } -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH kernel 6/9] KVM: PPC: Replace SPAPR_TCE_SHIFT with IOMMU_PAGE_SHIFT_4K
On Tue, Sep 15, 2015 at 08:49:36PM +1000, Alexey Kardashevskiy wrote: > SPAPR_TCE_SHIFT is used in few places only and since IOMMU_PAGE_SHIFT_4K > can be easily used instead, remove SPAPR_TCE_SHIFT. > > Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> Reviewed-by: David Gibson <da...@gibson.dropbear.id.au> > --- > arch/powerpc/include/asm/kvm_book3s_64.h | 2 -- > arch/powerpc/kvm/book3s_64_vio.c | 3 ++- > arch/powerpc/kvm/book3s_64_vio_hv.c | 4 ++-- > 3 files changed, 4 insertions(+), 5 deletions(-) > > diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h > b/arch/powerpc/include/asm/kvm_book3s_64.h > index 2aa79c8..7529aab 100644 > --- a/arch/powerpc/include/asm/kvm_book3s_64.h > +++ b/arch/powerpc/include/asm/kvm_book3s_64.h > @@ -33,8 +33,6 @@ static inline void svcpu_put(struct > kvmppc_book3s_shadow_vcpu *svcpu) > } > #endif > > -#define SPAPR_TCE_SHIFT 12 > - > #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE > #define KVM_DEFAULT_HPT_ORDER24 /* 16MB HPT by default */ > #endif > diff --git a/arch/powerpc/kvm/book3s_64_vio.c > b/arch/powerpc/kvm/book3s_64_vio.c > index b70787d..e347856 100644 > --- a/arch/powerpc/kvm/book3s_64_vio.c > +++ b/arch/powerpc/kvm/book3s_64_vio.c > @@ -36,12 +36,13 @@ > #include > #include > #include > +#include > > #define TCES_PER_PAGE(PAGE_SIZE / sizeof(u64)) > > static long kvmppc_stt_npages(unsigned long window_size) > { > - return ALIGN((window_size >> SPAPR_TCE_SHIFT) > + return ALIGN((window_size >> IOMMU_PAGE_SHIFT_4K) >* sizeof(u64), PAGE_SIZE) / PAGE_SIZE; > } > > diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c > b/arch/powerpc/kvm/book3s_64_vio_hv.c > index 8ae12ac..6cf1ab3 100644 > --- a/arch/powerpc/kvm/book3s_64_vio_hv.c > +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c > @@ -99,7 +99,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long > liobn, > if (ret) > return ret; > > - idx = ioba >> SPAPR_TCE_SHIFT; > + idx = ioba >> IOMMU_PAGE_SHIFT_4K; > page = stt->pages[idx / TCES_PER_PAGE]; > tbl = (u64 *)page_address(page); > > @@ -121,7 +121,7 @@ long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned > long liobn, > if (stt) { > ret = kvmppc_ioba_validate(stt, ioba, 1); > if (!ret) { > - unsigned long idx = ioba >> SPAPR_TCE_SHIFT; > + unsigned long idx = ioba >> IOMMU_PAGE_SHIFT_4K; > struct page *page = stt->pages[idx / TCES_PER_PAGE]; > u64 *tbl = (u64 *)page_address(page); > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH kernel 9/9] KVM: PPC: Add support for multiple-TCE hcalls
On Tue, Sep 15, 2015 at 08:49:39PM +1000, Alexey Kardashevskiy wrote: > This adds real and virtual mode handlers for the H_PUT_TCE_INDIRECT and > H_STUFF_TCE hypercalls for user space emulated devices such as IBMVIO > devices or emulated PCI. These calls allow adding multiple entries > (up to 512) into the TCE table in one call which saves time on > transition between kernel and user space. > > This implements the KVM_CAP_PPC_MULTITCE capability. When present, > the kernel will try handling H_PUT_TCE_INDIRECT and H_STUFF_TCE. > If they can not be handled by the kernel, they are passed on to > the user space. The user space still has to have an implementation > for these. > > Both HV and PR-syle KVM are supported. > > Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> > --- > Documentation/virtual/kvm/api.txt | 25 ++ > arch/powerpc/include/asm/kvm_ppc.h | 12 +++ > arch/powerpc/kvm/book3s_64_vio.c| 111 +++- > arch/powerpc/kvm/book3s_64_vio_hv.c | 145 > ++-- > arch/powerpc/kvm/book3s_hv.c| 26 +- > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 +- > arch/powerpc/kvm/book3s_pr_papr.c | 35 > arch/powerpc/kvm/powerpc.c | 3 + > 8 files changed, 350 insertions(+), 13 deletions(-) > > diff --git a/Documentation/virtual/kvm/api.txt > b/Documentation/virtual/kvm/api.txt > index d86d831..593c62a 100644 > --- a/Documentation/virtual/kvm/api.txt > +++ b/Documentation/virtual/kvm/api.txt > @@ -3019,6 +3019,31 @@ Returns: 0 on success, -1 on error > > Queues an SMI on the thread's vcpu. > > +4.97 KVM_CAP_PPC_MULTITCE > + > +Capability: KVM_CAP_PPC_MULTITCE > +Architectures: ppc > +Type: vm > + > +This capability means the kernel is capable of handling hypercalls > +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user > +space. This significantly accelerates DMA operations for PPC KVM guests. > +User space should expect that its handlers for these hypercalls > +are not going to be called if user space previously registered LIOBN > +in KVM (via KVM_CREATE_SPAPR_TCE or similar calls). > + > +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest, > +user space might have to advertise it for the guest. For example, > +IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is > +present in the "ibm,hypertas-functions" device-tree property. > + > +The hypercalls mentioned above may or may not be processed successfully > +in the kernel based fast path. If they can not be handled by the kernel, > +they will get passed on to user space. So user space still has to have > +an implementation for these despite the in kernel acceleration. > + > +This capability is always enabled. > + > 5. The kvm_run structure > > > diff --git a/arch/powerpc/include/asm/kvm_ppc.h > b/arch/powerpc/include/asm/kvm_ppc.h > index fcde896..e5b968e 100644 > --- a/arch/powerpc/include/asm/kvm_ppc.h > +++ b/arch/powerpc/include/asm/kvm_ppc.h > @@ -166,12 +166,24 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu > *vcpu); > > extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, > struct kvm_create_spapr_tce *args); > +extern struct kvmppc_spapr_tce_table *kvmppc_find_table( > + struct kvm_vcpu *vcpu, unsigned long liobn); > extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt, > unsigned long ioba, unsigned long npages); > extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt, > unsigned long tce); > +extern long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa, > + unsigned long *ua, unsigned long **prmap); > +extern void kvmppc_tce_put(struct kvmppc_spapr_tce_table *tt, > + unsigned long idx, unsigned long tce); > extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, >unsigned long ioba, unsigned long tce); > +extern long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_list, unsigned long npages); > +extern long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_value, unsigned long npages); > extern long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn, >unsigned long ioba); > extern struct page *kvm_alloc_hpt(unsigned long nr_pages); > diff --git a/arch/powerpc/kvm/book3s_64_vio.c > b/arch/powerpc/kvm/book3s_64_vio.c > in
Re: [PATCH kernel 3/9] KVM: PPC: Rework H_PUT_TCE/H_GET_TCE handlers
ies on the fact that kvmppc_ioba_validate() would have returned H_SUCCESS some distance above. This seems rather fragile if you insert anything else which alters ret in between. Since this is the success path, I think it's clearer to explicitly return H_SUCCESS. > } > EXPORT_SYMBOL_GPL(kvmppc_h_put_tce); > > long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn, > unsigned long ioba) > { > - struct kvm *kvm = vcpu->kvm; > - struct kvmppc_spapr_tce_table *stt; > + struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn); > + long ret = H_TOO_HARD; > > - list_for_each_entry(stt, >arch.spapr_tce_tables, list) { > - if (stt->liobn == liobn) { > + > + if (stt) { > + ret = kvmppc_ioba_validate(stt, ioba, 1); > + if (!ret) { This relies on the fact that H_SUCCESS == 0, I'm not sure if that's something we're already doing elsewhere or not. > unsigned long idx = ioba >> SPAPR_TCE_SHIFT; > - struct page *page; > - u64 *tbl; > - > - if (ioba >= stt->window_size) > - return H_PARAMETER; > - > - page = stt->pages[idx / TCES_PER_PAGE]; > - tbl = (u64 *)page_address(page); > + struct page *page = stt->pages[idx / TCES_PER_PAGE]; > + u64 *tbl = (u64 *)page_address(page); > > vcpu->arch.gpr[4] = tbl[idx % TCES_PER_PAGE]; > - return H_SUCCESS; > } > } > > - /* Didn't find the liobn, punt it to userspace */ > - return H_TOO_HARD; > + > + return ret; > } > EXPORT_SYMBOL_GPL(kvmppc_h_get_tce); -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH kernel 1/9] rcu: Define notrace version of list_for_each_entry_rcu
On Tue, Sep 15, 2015 at 08:49:31PM +1000, Alexey Kardashevskiy wrote: > This defines list_for_each_entry_rcu_notrace and list_entry_rcu_notrace > which use rcu_dereference_raw_notrace instead of rcu_dereference_raw. > This allows using list_for_each_entry_rcu_notrace in real mode (MMU is off). > > Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> Reviewed-by: David Gibson <da...@gibson.dropbear.id.au> > --- > include/linux/rculist.h | 38 ++ > 1 file changed, 38 insertions(+) > > diff --git a/include/linux/rculist.h b/include/linux/rculist.h > index 17c6b1f..439c4d7 100644 > --- a/include/linux/rculist.h > +++ b/include/linux/rculist.h > @@ -253,6 +253,25 @@ static inline void list_splice_init_rcu(struct list_head > *list, > }) > > /** > + * list_entry_rcu_notrace - get the struct for this entry > + * @ptr:the list_head pointer. > + * @type: the type of the struct this is embedded in. > + * @member: the name of the list_struct within the struct. > + * > + * This primitive may safely run concurrently with the _rcu list-mutation > + * primitives such as list_add_rcu() as long as it's guarded by > rcu_read_lock(). > + * > + * This is the same as list_entry_rcu() except that it does > + * not do any RCU debugging or tracing. > + */ > +#define list_entry_rcu_notrace(ptr, type, member) \ > +({ \ > + typeof(*ptr) __rcu *__ptr = (typeof(*ptr) __rcu __force *)ptr; \ > + container_of((typeof(ptr))rcu_dereference_raw_notrace(__ptr), \ > + type, member); \ > +}) > + > +/** > * Where are list_empty_rcu() and list_first_entry_rcu()? > * > * Implementing those functions following their counterparts list_empty() and > @@ -308,6 +327,25 @@ static inline void list_splice_init_rcu(struct list_head > *list, > pos = list_entry_rcu(pos->member.next, typeof(*pos), member)) > > /** > + * list_for_each_entry_rcu_notrace - iterate over rcu list of given type > + * @pos: the type * to use as a loop cursor. > + * @head:the head for your list. > + * @member: the name of the list_struct within the struct. > + * > + * This list-traversal primitive may safely run concurrently with > + * the _rcu list-mutation primitives such as list_add_rcu() > + * as long as the traversal is guarded by rcu_read_lock(). > + * > + * This is the same as list_for_each_entry_rcu() except that it does > + * not do any RCU debugging or tracing. > + */ > +#define list_for_each_entry_rcu_notrace(pos, head, member) \ > + for (pos = list_entry_rcu_notrace((head)->next, typeof(*pos), member); \ > + >member != (head); \ > + pos = list_entry_rcu_notrace(pos->member.next, typeof(*pos), \ > + member)) > + > +/** > * list_for_each_entry_continue_rcu - continue iteration over list of given > type > * @pos: the type * to use as a loop cursor. > * @head:the head for your list. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH kernel 4/9] KVM: PPC: Use RCU for arch.spapr_tce_tables
On Tue, Sep 15, 2015 at 08:49:34PM +1000, Alexey Kardashevskiy wrote: > At the moment spapr_tce_tables is not protected against races. This makes > use of RCU-variants of list helpers. As some bits are executed in real > mode, this makes use of just introduced list_for_each_entry_rcu_notrace(). > > This converts release_spapr_tce_table() to a RCU scheduled handler. > > Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> Looks correct on my limited knowledge of RCU Reviewed-by: David Gibson <da...@gibson.dropbear.id.au> -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH kernel 2/9] KVM: PPC: Make real_vmalloc_addr() public
On Tue, Sep 15, 2015 at 08:49:32PM +1000, Alexey Kardashevskiy wrote: > This helper translates vmalloc'd addresses to linear addresses. > It is only used by the KVM MMU code now and resides in the HV KVM code. > We will need it further in the TCE code and the DMA memory preregistration > code called in real mode. > > This makes real_vmalloc_addr() public and moves it to the powerpc code as > it does not do anything special for KVM. > > Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> Hmm, I have a couple of small concerns. First, I'm not sure if the name is clear enough for a public function. Second, I'm not sure if mmu-hash64.h is the right place for it. This is still a function with very specific and limited usage, I wonder if we should have somewhere else for such special real mode helpers. Paulus, thoughts? -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH] KVM: PPC: Fix emulation of H_SET_DABR/X on POWER8
On Fri, 20 Nov 2015 09:11:45 +0100 Thomas Huth <th...@redhat.com> wrote: > In the old DABR register, the BT (Breakpoint Translation) bit > is bit number 61. In the new DAWRX register, the WT (Watchpoint > Translation) bit is bit number 59. So to move the DABR-BT bit > into the position of the DAWRX-WT bit, it has to be shifted by > two, not only by one. This fixes hardware watchpoints in gdb of > older guests that only use the H_SET_DABR/X interface instead > of the new H_SET_MODE interface. > > Signed-off-by: Thomas Huth <th...@redhat.com> Reviewed-by: David Gibson <da...@gibson.dropbear.id.au> Reviewed-by: David Gibson <dgib...@redhat.com> > --- > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > index b98889e..3983b87 100644 > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > @@ -2143,7 +2143,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) > > /* Emulate H_SET_DABR/X on P8 for the sake of compat mode guests */ > 2: rlwimi r5, r4, 5, DAWRX_DR | DAWRX_DW > - rlwimi r5, r4, 1, DAWRX_WT > + rlwimi r5, r4, 2, DAWRX_WT > clrrdi r4, r4, 3 > std r4, VCPU_DAWR(r3) > std r5, VCPU_DAWRX(r3) > -- > 1.8.3.1 > -- David Gibson <dgib...@redhat.com> Senior Software Engineer, Virtualization, Red Hat pgp8jxKFUmn0u.pgp Description: OpenPGP digital signature
Re: [PATCH] KVM: PPC: Exit guest upon fatal machine check exception
On Thu, Nov 12, 2015 at 11:22:29PM +0530, Aravinda Prasad wrote: > > > On Thursday 12 November 2015 10:13 AM, David Gibson wrote: > > On Thu, Nov 12, 2015 at 10:02:10AM +0530, Aravinda Prasad wrote: > >> > >> > >> On Thursday 12 November 2015 09:08 AM, David Gibson wrote: > >>> On Thu, Nov 12, 2015 at 01:24:19PM +1100, Daniel Axtens wrote: > >>>> Aravinda Prasad <aravi...@linux.vnet.ibm.com> writes: > >>>> > >>>>> This patch modifies KVM to cause a guest exit with > >>>>> KVM_EXIT_NMI instead of immediately delivering a 0x200 > >>>>> interrupt to guest upon machine check exception in > >>>>> guest address. Exiting the guest enables QEMU to build > >>>>> error log and deliver machine check exception to guest > >>>>> OS (either via guest OS registered machine check > >>>>> handler or via 0x200 guest OS interrupt vector). > >>>>> > >>>>> This approach simplifies the delivering of machine > >>>>> check exception to guest OS compared to the earlier approach > >>>>> of KVM directly invoking 0x200 guest interrupt vector. > >>>>> In the earlier approach QEMU patched the 0x200 interrupt > >>>>> vector during boot. The patched code at 0x200 issued a > >>>>> private hcall to pass the control to QEMU to build the > >>>>> error log. > >>>>> > >>>>> This design/approach is based on the feedback for the > >>>>> QEMU patches to handle machine check exception. Details > >>>>> of earlier approach of handling machine check exception > >>>>> in QEMU and related discussions can be found at: > >>>>> > >>>>> https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html > >>>> > >>>> I've poked at the MCE code, but not the KVM MCE code, so I may be > >>>> mistaken here, but I'm not clear on how this handles errors that the > >>>> guest can recover without terminating. > >>>> > >>>> For example, a Linux guest can handle a UE in guest userspace by killing > >>>> the guest process. A hypthetical non-linux guest with a microkernel > >>>> could even survive UEs in drivers. > >>>> > >>>> It sounds from your patch like you're changing this behaviour. Is this > >>>> right? > >>> > >>> So, IIUC. Once the qemu pieces are in place as well it shouldn't > >>> change this behaviour: KVM will exit to qemu, qemu will log the error > >>> information (new), then reinject the MC to the guest which can still > >>> handle it as you describe above. > >> > >> Yes. With KVM and QEMU both in place this will not change the behavior. > >> QEMU will inject the UE to guest and the guest handles the UE based on > >> where it occurred. For example if an UE happens in a guest process > >> address space, that process will be killed. > >> > >>> > >>> But, there could be a problem if you have a new kernel with an old > >>> qemu, in that case qemu might not understand the new exit type and > >>> treat it as a fatal error, even though the guest could actually cope > >>> with it. > >> > >> In case of new kernel and old QEMU, the guest terminates as old QEMU > >> does not understand the NMI exit reason. However, this is the case with > >> old kernel and old QEMU as they do not handle UE belonging to guest. The > >> difference is that the guest kernel terminates with different error > >> code. > > > > Ok.. assuming the guest has code to handle the UE in 0x200, why would > > the guest terminate with old kernel and old qemu? I haven't quite > > followed the logic. > > I overlooked it. I think I need to take into consideration whether guest > issued "ibm, nmi-register". If the guest has issued "ibm, nmi-register" > then we should not jump to 0x200 upon UE. With the old kernel and old > QEMU this is broken as we always jump to 0x200. > > However, if the guest has not issued "ibm, nmi-register" then upon UE we > should jump to 0x200. If new kernel is used with old QEMU this > functionality breaks as it causes guest to terminate with unhandled NMI > exit. > > So thinking whether qemu should explicitly enable the new NMI > behavior. So, I think the reasoning above tends towards having qemu control the MC
Re: [PATCH] KVM: PPC: Exit guest upon fatal machine check exception
On Thu, Nov 12, 2015 at 10:02:10AM +0530, Aravinda Prasad wrote: > > > On Thursday 12 November 2015 09:08 AM, David Gibson wrote: > > On Thu, Nov 12, 2015 at 01:24:19PM +1100, Daniel Axtens wrote: > >> Aravinda Prasad <aravi...@linux.vnet.ibm.com> writes: > >> > >>> This patch modifies KVM to cause a guest exit with > >>> KVM_EXIT_NMI instead of immediately delivering a 0x200 > >>> interrupt to guest upon machine check exception in > >>> guest address. Exiting the guest enables QEMU to build > >>> error log and deliver machine check exception to guest > >>> OS (either via guest OS registered machine check > >>> handler or via 0x200 guest OS interrupt vector). > >>> > >>> This approach simplifies the delivering of machine > >>> check exception to guest OS compared to the earlier approach > >>> of KVM directly invoking 0x200 guest interrupt vector. > >>> In the earlier approach QEMU patched the 0x200 interrupt > >>> vector during boot. The patched code at 0x200 issued a > >>> private hcall to pass the control to QEMU to build the > >>> error log. > >>> > >>> This design/approach is based on the feedback for the > >>> QEMU patches to handle machine check exception. Details > >>> of earlier approach of handling machine check exception > >>> in QEMU and related discussions can be found at: > >>> > >>> https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html > >> > >> I've poked at the MCE code, but not the KVM MCE code, so I may be > >> mistaken here, but I'm not clear on how this handles errors that the > >> guest can recover without terminating. > >> > >> For example, a Linux guest can handle a UE in guest userspace by killing > >> the guest process. A hypthetical non-linux guest with a microkernel > >> could even survive UEs in drivers. > >> > >> It sounds from your patch like you're changing this behaviour. Is this > >> right? > > > > So, IIUC. Once the qemu pieces are in place as well it shouldn't > > change this behaviour: KVM will exit to qemu, qemu will log the error > > information (new), then reinject the MC to the guest which can still > > handle it as you describe above. > > Yes. With KVM and QEMU both in place this will not change the behavior. > QEMU will inject the UE to guest and the guest handles the UE based on > where it occurred. For example if an UE happens in a guest process > address space, that process will be killed. > > > > > But, there could be a problem if you have a new kernel with an old > > qemu, in that case qemu might not understand the new exit type and > > treat it as a fatal error, even though the guest could actually cope > > with it. > > In case of new kernel and old QEMU, the guest terminates as old QEMU > does not understand the NMI exit reason. However, this is the case with > old kernel and old QEMU as they do not handle UE belonging to guest. The > difference is that the guest kernel terminates with different error > code. Ok.. assuming the guest has code to handle the UE in 0x200, why would the guest terminate with old kernel and old qemu? I haven't quite followed the logic. > > old kernel and old QEMU -> guest panics [1] irrespective of where UE >happened in guest address space. > old kernel and new QEMU -> guest panics. same as above. > new kernel and old QEMU -> guest terminates with unhanded NMI error >irrespective of where UE happened in guest > new kernel and new QEMU -> guest handles UEs in process address space >by killing the process. guest terminates >for UEs in guest kernel address space. > > [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-June/118329.html > > > > > Aravinda, do we need to change this so that qemu has to explicitly > > enable the new NMI behaviour? Or have I missed something that will > > make that case work already. > > I think we don't need to explicitly enable the new behavior. With new > kernel and new QEMU this should just work. As mentioned above this is > already broken for old kernel/QEMU. Any thoughts? > > Regards, > Aravinda > > > > > > > > > ___ > > Linuxppc-dev mailing list > > linuxppc-...@lists.ozlabs.org > > https://lists.ozlabs.org/listinfo/linuxppc-dev > > > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH] KVM: PPC: Exit guest upon fatal machine check exception
et later based on trap). For handled errors > + * (no-fatal), go back to guest execution with current HSRR0 > + * instead of exiting the guest. This approach will cause > + * the guest to exit in case of fatal machine check error. >*/ > - ld r10, VCPU_PC(r9) > + bne 2f /* Continue guest execution? */ > + /* If not, exit the guest. SRR0/1 are already set */ > + b mc_cont > +2: ld r10, VCPU_PC(r9) > ld r11, VCPU_MSR(r9) > - bne 2f /* Continue guest execution. */ > - /* If not, deliver a machine check. SRR0/1 are already set */ > - li r10, BOOK3S_INTERRUPT_MACHINE_CHECK > - ld r11, VCPU_MSR(r9) > - bl kvmppc_msr_interrupt > -2: b fast_interrupt_c_return > + b fast_interrupt_c_return > > /* > * Check the reason we woke from nap, and take appropriate action. > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH] KVM: PPC: Exit guest upon fatal machine check exception
On Thu, Nov 12, 2015 at 01:24:19PM +1100, Daniel Axtens wrote: > Aravinda Prasad <aravi...@linux.vnet.ibm.com> writes: > > > This patch modifies KVM to cause a guest exit with > > KVM_EXIT_NMI instead of immediately delivering a 0x200 > > interrupt to guest upon machine check exception in > > guest address. Exiting the guest enables QEMU to build > > error log and deliver machine check exception to guest > > OS (either via guest OS registered machine check > > handler or via 0x200 guest OS interrupt vector). > > > > This approach simplifies the delivering of machine > > check exception to guest OS compared to the earlier approach > > of KVM directly invoking 0x200 guest interrupt vector. > > In the earlier approach QEMU patched the 0x200 interrupt > > vector during boot. The patched code at 0x200 issued a > > private hcall to pass the control to QEMU to build the > > error log. > > > > This design/approach is based on the feedback for the > > QEMU patches to handle machine check exception. Details > > of earlier approach of handling machine check exception > > in QEMU and related discussions can be found at: > > > > https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html > > I've poked at the MCE code, but not the KVM MCE code, so I may be > mistaken here, but I'm not clear on how this handles errors that the > guest can recover without terminating. > > For example, a Linux guest can handle a UE in guest userspace by killing > the guest process. A hypthetical non-linux guest with a microkernel > could even survive UEs in drivers. > > It sounds from your patch like you're changing this behaviour. Is this > right? So, IIUC. Once the qemu pieces are in place as well it shouldn't change this behaviour: KVM will exit to qemu, qemu will log the error information (new), then reinject the MC to the guest which can still handle it as you describe above. But, there could be a problem if you have a new kernel with an old qemu, in that case qemu might not understand the new exit type and treat it as a fatal error, even though the guest could actually cope with it. Aravinda, do we need to change this so that qemu has to explicitly enable the new NMI behaviour? Or have I missed something that will make that case work already. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH] KVM: PPC: Book3S HV: Synthesize segment fault if SLB lookup fails
On Tue, Oct 27, 2015 at 04:13:56PM +1100, Paul Mackerras wrote: > When handling a hypervisor data or instruction storage interrupt (HDSI > or HISI), we look up the SLB entry for the address being accessed in > order to translate the effective address to a virtual address which can > be looked up in the guest HPT. This lookup can occasionally fail due > to the guest replacing an SLB entry without invalidating the evicted > SLB entry. In this situation an ERAT (effective to real address > translation cache) entry can persist and be used by the hardware even > though there is no longer a corresponding SLB entry. > > Previously we would just deliver a data or instruction storage interrupt > (DSI or ISI) to the guest in this case. However, this is not correct > and has been observed to cause guests to crash, typically with a > data storage protection interrupt on a store to the vmemmap area. > > Instead, what we do now is to synthesize a data or instruction segment > interrupt. That should cause the guest to reload an appropriate entry > into the SLB and retry the faulting instruction. If it still faults, > we should find an appropriate SLB entry next time and be able to handle > the fault. > > Signed-off-by: Paul Mackerras <pau...@samba.org> Reviewed-by: David Gibson <da...@gibson.dropbear.id.au> -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [Qemu-devel] [PATCH v4] ppc/spapr: Implement H_RANDOM hypercall in QEMU
On Mon, Sep 21, 2015 at 10:37:28AM +0200, Greg Kurz wrote: > On Mon, 21 Sep 2015 10:26:52 +0200 > Thomas Huth <th...@redhat.com> wrote: > > > On 21/09/15 10:01, Greg Kurz wrote: > > > On Mon, 21 Sep 2015 12:10:00 +1000 > > > David Gibson <da...@gibson.dropbear.id.au> wrote: > > > > > >> On Fri, Sep 18, 2015 at 11:05:52AM +0200, Greg Kurz wrote: > > >>> On Thu, 17 Sep 2015 10:49:41 +0200 > > >>> Thomas Huth <th...@redhat.com> wrote: > > >>> > > >>>> The PAPR interface defines a hypercall to pass high-quality > > >>>> hardware generated random numbers to guests. Recent kernels can > > >>>> already provide this hypercall to the guest if the right hardware > > >>>> random number generator is available. But in case the user wants > > >>>> to use another source like EGD, or QEMU is running with an older > > >>>> kernel, we should also have this call in QEMU, so that guests that > > >>>> do not support virtio-rng yet can get good random numbers, too. > > >>>> > > >>>> This patch now adds a new pseudo-device to QEMU that either > > >>>> directly provides this hypercall to the guest or is able to > > >>>> enable the in-kernel hypercall if available. > > ... > > >>> It is a good thing that the user can choose between in-kernel and > > >>> backend, > > >>> and this patch does the work. > > >>> > > >>> This being said, I am not sure about the use case where a user has a > > >>> hwrng > > >>> capable platform and wants to run guests without any hwrng support at > > >>> all is > > >>> an appropriate default behavior... I guess we will find more users that > > >>> want > > >>> in-kernel being the default if it is available. > > >>> > > >>> The patch below modifies yours to do just this: the pseudo-device is > > >>> only > > >>> created if hwrng is present and not already created. > > >> > > >> I have mixed feelings about this. On the one hand, I agree that it > > >> would be nice to allow H_RANDOM support by default. On the other hand > > >> the patch below leaves no way to turn it off for testing purposes. It > > >> also adds another place where the guest hardware depends on the host > > >> configuration, which adds to the already substantial mess of ensuring > > >> that source and destination hardware configuration matches for > > >> migration. > > > > > > Yeah, describing the guest hw is really essential for migration... this > > > is best addressed at the libvirt level with a full XML description of > > > the machine... but FWIW if we are talking about running pseries on a > > > POWER8 or newer host, I am not aware about "hwrng-less" boards... but > > > I am probably missing something :) > > > > Maybe it would be at least ok to enable it by default as long as > > "-nodefaults" has not been specified as command line option? I like that in principle, but the -nodefaults option isn't exposed outside vl.c > It makes a lot of sense indeed. I guess David should take your patch > as it is now and the default behavior could be a follow up. That's the plan. I've already taken the base patch. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpfA9hFQZ4X9.pgp Description: PGP signature
Re: [PATCH] KVM: PPC: Take the kvm->srcu lock in kvmppc_h_logical_ci_load/store()
On Fri, Sep 18, 2015 at 08:57:28AM +0200, Thomas Huth wrote: > Access to the kvm->buses (like with the kvm_io_bus_read() and -write() > functions) has to be protected via the kvm->srcu lock. > The kvmppc_h_logical_ci_load() and -store() functions are missing > this lock so far, so let's add it there, too. > This fixes the problem that the kernel reports "suspicious RCU usage" > when lock debugging is enabled. > > Fixes: 99342cf8044420eebdf9297ca03a14cb6a7085a1 > Signed-off-by: Thomas Huth <th...@redhat.com> Nice catch. Looks like I missed this because the places kvm_io_bus_{read,write}() are called on x86 are buried about 5 layers below where the srcu lock is taken :/. Reviewed-by: David Gibson <da...@gibson.dropbear.id.au> > --- > arch/powerpc/kvm/book3s.c | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c > index d75bf32..096e5eb 100644 > --- a/arch/powerpc/kvm/book3s.c > +++ b/arch/powerpc/kvm/book3s.c > @@ -828,12 +828,15 @@ int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu) > unsigned long size = kvmppc_get_gpr(vcpu, 4); > unsigned long addr = kvmppc_get_gpr(vcpu, 5); > u64 buf; > + int srcu_idx; > int ret; > > if (!is_power_of_2(size) || (size > sizeof(buf))) > return H_TOO_HARD; > > + srcu_idx = srcu_read_lock(>kvm->srcu); > ret = kvm_io_bus_read(vcpu, KVM_MMIO_BUS, addr, size, ); > + srcu_read_unlock(>kvm->srcu, srcu_idx); > if (ret != 0) > return H_TOO_HARD; > > @@ -868,6 +871,7 @@ int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu) > unsigned long addr = kvmppc_get_gpr(vcpu, 5); > unsigned long val = kvmppc_get_gpr(vcpu, 6); > u64 buf; > + int srcu_idx; > int ret; > > switch (size) { > @@ -891,7 +895,9 @@ int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu) > return H_TOO_HARD; > } > > + srcu_idx = srcu_read_lock(>kvm->srcu); > ret = kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, size, ); > + srcu_read_unlock(>kvm->srcu, srcu_idx); > if (ret != 0) > return H_TOO_HARD; > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgp4kxxr5EHED.pgp Description: PGP signature
Re: [PATCH 0/2] VFIO: Accept IOMMU group (PE) ID
On Sat, Sep 19, 2015 at 04:22:47PM +1000, David Gibson wrote: > On Fri, Sep 18, 2015 at 09:47:32AM -0600, Alex Williamson wrote: > > On Fri, 2015-09-18 at 16:24 +1000, Gavin Shan wrote: > > > This allows to accept IOMMU group (PE) ID from the parameter from userland > > > when handling EEH operation so that the operation only affects the target > > > IOMMU group (PE). If the IOMMU group (PE) ID in the parameter from > > > userland > > > is invalid, all IOMMU groups (PEs) attached to the specified container are > > > affected as before. > > > > > > Gavin Shan (2): > > > drivers/vfio: Support EEH API revision > > > drivers/vfio: Support IOMMU group for EEH operations > > > > > > drivers/vfio/vfio_iommu_spapr_tce.c | 50 > > > ++--- > > > drivers/vfio/vfio_spapr_eeh.c | 46 > > > ++ > > > include/linux/vfio.h| 13 +++--- > > > include/uapi/linux/vfio.h | 6 + > > > 4 files changed, 93 insertions(+), 22 deletions(-) > > > > This interface is terrible. A function named foo_enabled() should > > return a bool, yes or no, don't try to overload it to also return a > > version. > > Sorry, that one's my fault. I suggested that approach to Gavin > without really thinking it through. > > > > AFAICT, patch 2/2 breaks current users by changing the offset > > of the union in struct vfio_eeh_pe_err. > > Yeah, this one's ugly. We have to preserve the offset, but that means > putting the group in a very awkward place. Especially since I'm not > sure if there even are any existing users of the single extant union > branch. > > Sigh. > > > Also, we generally pass group > > file descriptors rather than a group ID because we can prove the > > ownership of the group through the file descriptor and we don't need to > > worry about races with the group because we can hold a reference to it. Duh. I finally realised the better, simpler, obvious solution. Rather than changing the parameter structure, we should move the ioctl()s so they're on the group fd instead of the container fd. Obviously we need to keep it on the container fd for backwards compat, but I think we should just error out if there is more than one group in the container there. We will need a new capability too, obviously. VFIO_EEH_GROUPFD maybe? -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpCxcPnyjMr0.pgp Description: PGP signature
Re: [PATCH v4] ppc/spapr: Implement H_RANDOM hypercall in QEMU
On Thu, Sep 17, 2015 at 10:49:41AM +0200, Thomas Huth wrote: > The PAPR interface defines a hypercall to pass high-quality > hardware generated random numbers to guests. Recent kernels can > already provide this hypercall to the guest if the right hardware > random number generator is available. But in case the user wants > to use another source like EGD, or QEMU is running with an older > kernel, we should also have this call in QEMU, so that guests that > do not support virtio-rng yet can get good random numbers, too. > > This patch now adds a new pseudo-device to QEMU that either > directly provides this hypercall to the guest or is able to > enable the in-kernel hypercall if available. The in-kernel > hypercall can be enabled with the use-kvm property, e.g.: > > qemu-system-ppc64 -device spapr-rng,use-kvm=true > > For handling the hypercall in QEMU instead, a "RngBackend" is > required since the hypercall should provide "good" random data > instead of pseudo-random (like from a "simple" library function > like rand() or g_random_int()). Since there are multiple RngBackends > available, the user must select an appropriate back-end via the > "rng" property of the device, e.g.: > > qemu-system-ppc64 -object rng-random,filename=/dev/hwrng,id=gid0 \ >-device spapr-rng,rng=gid0 ... > > See http://wiki.qemu-project.org/Features-Done/VirtIORNG for > other example of specifying RngBackends. > > Signed-off-by: Thomas Huth <th...@redhat.com> Thanks, applied to spapr-next. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgprmHRL0m8Qe.pgp Description: PGP signature
Re: [PATCH v4] ppc/spapr: Implement H_RANDOM hypercall in QEMU
On Fri, Sep 18, 2015 at 11:05:52AM +0200, Greg Kurz wrote: > On Thu, 17 Sep 2015 10:49:41 +0200 > Thomas Huth <th...@redhat.com> wrote: > > > The PAPR interface defines a hypercall to pass high-quality > > hardware generated random numbers to guests. Recent kernels can > > already provide this hypercall to the guest if the right hardware > > random number generator is available. But in case the user wants > > to use another source like EGD, or QEMU is running with an older > > kernel, we should also have this call in QEMU, so that guests that > > do not support virtio-rng yet can get good random numbers, too. > > > > This patch now adds a new pseudo-device to QEMU that either > > directly provides this hypercall to the guest or is able to > > enable the in-kernel hypercall if available. The in-kernel > > hypercall can be enabled with the use-kvm property, e.g.: > > > > qemu-system-ppc64 -device spapr-rng,use-kvm=true > > > > For handling the hypercall in QEMU instead, a "RngBackend" is > > required since the hypercall should provide "good" random data > > instead of pseudo-random (like from a "simple" library function > > like rand() or g_random_int()). Since there are multiple RngBackends > > available, the user must select an appropriate back-end via the > > "rng" property of the device, e.g.: > > > > qemu-system-ppc64 -object rng-random,filename=/dev/hwrng,id=gid0 \ > >-device spapr-rng,rng=gid0 ... > > > > See http://wiki.qemu-project.org/Features-Done/VirtIORNG for > > other example of specifying RngBackends. > > > > Signed-off-by: Thomas Huth <th...@redhat.com> > > --- > > It is a good thing that the user can choose between in-kernel and backend, > and this patch does the work. > > This being said, I am not sure about the use case where a user has a hwrng > capable platform and wants to run guests without any hwrng support at all is > an appropriate default behavior... I guess we will find more users that want > in-kernel being the default if it is available. > > The patch below modifies yours to do just this: the pseudo-device is only > created if hwrng is present and not already created. I have mixed feelings about this. On the one hand, I agree that it would be nice to allow H_RANDOM support by default. On the other hand the patch below leaves no way to turn it off for testing purposes. It also adds another place where the guest hardware depends on the host configuration, which adds to the already substantial mess of ensuring that source and destination hardware configuration matches for migration. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpUiaicZ6w4V.pgp Description: PGP signature
Re: [PATCH 0/2] VFIO: Accept IOMMU group (PE) ID
On Fri, Sep 18, 2015 at 09:47:32AM -0600, Alex Williamson wrote: > On Fri, 2015-09-18 at 16:24 +1000, Gavin Shan wrote: > > This allows to accept IOMMU group (PE) ID from the parameter from userland > > when handling EEH operation so that the operation only affects the target > > IOMMU group (PE). If the IOMMU group (PE) ID in the parameter from userland > > is invalid, all IOMMU groups (PEs) attached to the specified container are > > affected as before. > > > > Gavin Shan (2): > > drivers/vfio: Support EEH API revision > > drivers/vfio: Support IOMMU group for EEH operations > > > > drivers/vfio/vfio_iommu_spapr_tce.c | 50 > > ++--- > > drivers/vfio/vfio_spapr_eeh.c | 46 ++ > > include/linux/vfio.h| 13 +++--- > > include/uapi/linux/vfio.h | 6 + > > 4 files changed, 93 insertions(+), 22 deletions(-) > > This interface is terrible. A function named foo_enabled() should > return a bool, yes or no, don't try to overload it to also return a > version. Sorry, that one's my fault. I suggested that approach to Gavin without really thinking it through. > AFAICT, patch 2/2 breaks current users by changing the offset > of the union in struct vfio_eeh_pe_err. Yeah, this one's ugly. We have to preserve the offset, but that means putting the group in a very awkward place. Especially since I'm not sure if there even are any existing users of the single extant union branch. Sigh. > Also, we generally pass group > file descriptors rather than a group ID because we can prove the > ownership of the group through the file descriptor and we don't need to > worry about races with the group because we can hold a reference to it. > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgp3yfhiIyNQC.pgp Description: PGP signature
Re: [PATCH v3] ppc/spapr: Implement H_RANDOM hypercall in QEMU
On Mon, Sep 14, 2015 at 08:32:36AM +0200, Thomas Huth wrote: > On 14/09/15 04:15, David Gibson wrote: > > On Fri, Sep 11, 2015 at 11:17:01AM +0200, Thomas Huth wrote: > >> The PAPR interface defines a hypercall to pass high-quality > >> hardware generated random numbers to guests. Recent kernels can > >> already provide this hypercall to the guest if the right hardware > >> random number generator is available. But in case the user wants > >> to use another source like EGD, or QEMU is running with an older > >> kernel, we should also have this call in QEMU, so that guests that > >> do not support virtio-rng yet can get good random numbers, too. > >> > >> This patch now adds a new pseude-device to QEMU that either > >> directly provides this hypercall to the guest or is able to > >> enable the in-kernel hypercall if available. The in-kernel > >> hypercall can be enabled with the use-kvm property, e.g.: > >> > >> qemu-system-ppc64 -device spapr-rng,use-kvm=true > >> > >> For handling the hypercall in QEMU instead, a RngBackend is required > >> since the hypercall should provide "good" random data instead of > >> pseudo-random (like from a "simple" library function like rand() > >> or g_random_int()). Since there are multiple RngBackends available, > >> the user must select an appropriate backend via the "backend" > >> property of the device, e.g.: > >> > >> qemu-system-ppc64 -object rng-random,filename=/dev/hwrng,id=rng0 \ > >>-device spapr-rng,backend=rng0 ... > >> > >> See http://wiki.qemu-project.org/Features-Done/VirtIORNG for > >> other example of specifying RngBackends. > ... > >> + > >> +#include "qemu/error-report.h" > >> +#include "sysemu/sysemu.h" > >> +#include "sysemu/device_tree.h" > >> +#include "sysemu/rng.h" > >> +#include "hw/ppc/spapr.h" > >> +#include "kvm_ppc.h" > >> + > >> +#define SPAPR_RNG(obj) \ > >> +OBJECT_CHECK(sPAPRRngState, (obj), TYPE_SPAPR_RNG) > >> + > >> +typedef struct sPAPRRngState { > >> +/*< private >*/ > >> +DeviceState ds; > >> +RngBackend *backend; > >> +bool use_kvm; > >> +} sPAPRRngState; > >> + > >> +typedef struct HRandomData { > >> +QemuSemaphore sem; > >> +union { > >> +uint64_t v64; > >> +uint8_t v8[8]; > >> +} val; > >> +int received; > >> +} HRandomData; > >> + > >> +/* Callback function for the RngBackend */ > >> +static void random_recv(void *dest, const void *src, size_t size) > >> +{ > >> +HRandomData *hrdp = dest; > >> + > >> +if (src && size > 0) { > >> +assert(size + hrdp->received <= sizeof(hrdp->val.v8)); > >> +memcpy(>val.v8[hrdp->received], src, size); > >> +hrdp->received += size; > >> +} > >> + > >> +qemu_sem_post(>sem); > > > > I'm assuming qemu_sem_post() includes the necessary memory barrier to > > make sure the requesting thread actually sees the data. > > Not sure whether I fully got your point here... both callback function > and main thread are calling an extern C-function, so the compiler should > not assume that the memory stays the same in the main thread...? I'm not talking about a compiler barrier: the callback will likely be invoked on a different CPU from the vcpu thread that invoked H_RANDOM, so on a weakly ordered arch like Power we need a real CPU memory barrier. > Anyway, I've tested the hypercall by implementing it in SLOF and calling > it a couple of times there to see that all bits in the result behave > randomly, so for me this is working fine. Right, I'd be almost certain anyway that qemu_sem_post() (actually likely the pthreads functions it invokes) will include the necessary barriers to stop accesses leaking outside the locked region. > > >> +} > >> + > >> +/* Handler for the H_RANDOM hypercall */ > >> +static target_ulong h_random(PowerPCCPU *cpu, sPAPRMachineState *spapr, > >> + target_ulong opcode, target_ulong *args) > >> +{ > >> +sPAPRRngState *rngstate; > >> +HRandomData hrdata; > >> + > >> +rngstate = SPAPR_RNG(object_resolve_path_type("", TYPE_SPAPR_RNG, > >> NULL)); > >> +
Re: [PATCH v3] ppc/spapr: Implement H_RANDOM hypercall in QEMU
false), > +DEFINE_PROP_END_OF_LIST(), > +}; > + > +static void spapr_rng_class_init(ObjectClass *oc, void *data) > +{ > +DeviceClass *dc = DEVICE_CLASS(oc); > + > +dc->realize = spapr_rng_realize; > +set_bit(DEVICE_CATEGORY_MISC, dc->categories); > +dc->props = spapr_rng_properties; > +} > + > +static const TypeInfo spapr_rng_info = { > +.name = TYPE_SPAPR_RNG, > +.parent= TYPE_DEVICE, > +.instance_size = sizeof(sPAPRRngState), > +.instance_init = spapr_rng_instance_init, > +.class_init= spapr_rng_class_init, > +}; > + > +static void spapr_rng_register_type(void) > +{ > +type_register_static(_rng_info); > +} > +type_init(spapr_rng_register_type) > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h > index 91a61ab..4e8aa2d 100644 > --- a/include/hw/ppc/spapr.h > +++ b/include/hw/ppc/spapr.h > @@ -331,6 +331,7 @@ struct sPAPRMachineState { > #define H_SET_MPP 0x2D0 > #define H_GET_MPP 0x2D4 > #define H_XIRR_X0x2FC > +#define H_RANDOM0x300 > #define H_SET_MODE 0x31C > #define MAX_HCALL_OPCODEH_SET_MODE > > @@ -603,10 +604,13 @@ struct sPAPRConfigureConnectorState { > void spapr_ccs_reset_hook(void *opaque); > > #define TYPE_SPAPR_RTC "spapr-rtc" > +#define TYPE_SPAPR_RNG "spapr-rng" > > void spapr_rtc_read(DeviceState *dev, struct tm *tm, uint32_t *ns); > int spapr_rtc_import_offset(DeviceState *dev, int64_t legacy_offset); > > +int spapr_rng_populate_dt(void *fdt); > + > #define SPAPR_MEMORY_BLOCK_SIZE (1 << 28) /* 256MB */ > > #endif /* !defined (__HW_SPAPR_H__) */ > diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c > index 110436d..42f66fe 100644 > --- a/target-ppc/kvm.c > +++ b/target-ppc/kvm.c > @@ -2484,3 +2484,12 @@ int kvm_arch_msi_data_to_gsi(uint32_t data) > { > return data & 0x; > } > + > +int kvmppc_enable_hwrng(void) > +{ > +if (!kvm_enabled() || !kvm_check_extension(kvm_state, > KVM_CAP_PPC_HWRNG)) { > +return -1; > +} > + > +return kvmppc_enable_hcall(kvm_state, H_RANDOM); > +} > diff --git a/target-ppc/kvm_ppc.h b/target-ppc/kvm_ppc.h > index 4d30e27..68836b4 100644 > --- a/target-ppc/kvm_ppc.h > +++ b/target-ppc/kvm_ppc.h > @@ -55,6 +55,7 @@ void kvmppc_hash64_free_pteg(uint64_t token); > void kvmppc_hash64_write_pte(CPUPPCState *env, target_ulong pte_index, > target_ulong pte0, target_ulong pte1); > bool kvmppc_has_cap_fixup_hcalls(void); > +int kvmppc_enable_hwrng(void); > > #else > > @@ -248,6 +249,10 @@ static inline bool kvmppc_has_cap_fixup_hcalls(void) > abort(); > } > > +static inline int kvmppc_enable_hwrng(void) > +{ > +return -1; > +} > #endif > > #ifndef CONFIG_KVM -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgp6_iH1g4gwt.pgp Description: PGP signature
Re: [PATCH 2/2] KVM: PPC: Book3S HV: Exit on H_DOORBELL only if HOST_IPI is set
On Thu, Sep 03, 2015 at 03:21:23PM +1000, Paul Mackerras wrote: > From: "Gautham R. Shenoy" <e...@linux.vnet.ibm.com> > > The code that handles the case when we receive a H_DOORBELL interrupt > has a comment which says "Hypervisor doorbell - exit only if host IPI > flag set". However, the current code does not actually check if the > host IPI flag is set. This is due to a comparison instruction that > got missed. > > As a result, the current code performs the exit to host only > if some sibling thread or a sibling sub-core is exiting to the > host. This implies that, an IPI sent to a sibling core in > (subcores-per-core != 1) mode will be missed by the host unless the > sibling core is on the exit path to the host. > > This patch adds the missing comparison operation which will ensure > that when HOST_IPI flag is set, we unconditionally exit to the host. > > Fixes: 66feed61cdf6 > Cc: sta...@vger.kernel.org # v4.1+ > Signed-off-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com> > Signed-off-by: Paul Mackerras <pau...@samba.org> Reviewed-by: David Gibson <da...@gibson.dropbear.id.au> > --- > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > index b07f045..2273dca 100644 > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > @@ -1213,6 +1213,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) > cmpwi r12, BOOK3S_INTERRUPT_H_DOORBELL > bne 3f > lbz r0, HSTATE_HOST_IPI(r13) > + cmpwi r0, 0 > beq 4f > b guest_exit_cont > 3: -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpC5z6OtPQK9.pgp Description: PGP signature
Re: [PATCH 1/2] KVM: PPC: Book3S HV: Fix race in starting secondary threads
On Thu, Sep 03, 2015 at 03:20:50PM +1000, Paul Mackerras wrote: > From: "Gautham R. Shenoy" <e...@linux.vnet.ibm.com> > > The current dynamic micro-threading code has a race due to which a > secondary thread naps when it is supposed to be running a vcpu. As a > side effect of this, on a guest exit, the primary thread in > kvmppc_wait_for_nap() finds that this secondary thread hasn't cleared > its vcore pointer. This results in "CPU X seems to be stuck!" > warnings. > > The race is possible since the primary thread on exiting the guests > only waits for all the secondaries to clear its vcore pointer. It > subsequently expects the secondary threads to enter nap while it > unsplits the core. A secondary thread which hasn't yet entered the nap > will loop in kvm_no_guest until its vcore pointer and the do_nap flag > are unset. Once the core has been unsplit, a new vcpu thread can grab > the core and set the do_nap flag *before* setting the vcore pointers > of the secondary. As a result, the secondary thread will now enter nap > via kvm_unsplit_nap instead of running the guest vcpu. > > Fix this by setting the do_nap flag after setting the vcore pointer in > the PACA of the secondary in kvmppc_run_core. Also, ensure that a > secondary thread doesn't nap in kvm_unsplit_nap when the vcore pointer > in its PACA struct is set. > > Fixes: b4deba5c41e9 > Signed-off-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com> > Signed-off-by: Paul Mackerras <pau...@samba.org> Reviewed-by: David Gibson <da...@gibson.dropbear.id.au> -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgp1Om83aXHqD.pgp Description: PGP signature
Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off
On Tue, Aug 04, 2015 at 09:54:44AM +0200, Andrew Jones wrote: On Tue, Aug 04, 2015 at 02:11:30PM +1000, David Gibson wrote: On Mon, Aug 03, 2015 at 04:41:28PM +0200, Andrew Jones wrote: Add enough RTAS support to support power-off, and apply it to exit(). Signed-off-by: Andrew Jones drjo...@redhat.com --- lib/powerpc/asm/rtas.h | 27 lib/powerpc/io.c| 3 ++ lib/powerpc/rtas.c | 84 + lib/ppc64/asm/rtas.h| 1 + powerpc/Makefile.common | 1 + powerpc/cstart64.S | 9 ++ 6 files changed, 125 insertions(+) create mode 100644 lib/powerpc/asm/rtas.h create mode 100644 lib/powerpc/rtas.c create mode 100644 lib/ppc64/asm/rtas.h diff --git a/lib/powerpc/asm/rtas.h b/lib/powerpc/asm/rtas.h new file mode 100644 index 0..53ce126c69a42 --- /dev/null +++ b/lib/powerpc/asm/rtas.h @@ -0,0 +1,27 @@ +#ifndef _ASMPOWERPC_RTAS_H_ +#define _ASMPOWERPC_RTAS_H_ +/* + * Copyright (C) 2015, Red Hat Inc, Andrew Jones drjo...@redhat.com + * + * This work is licensed under the terms of the GNU LGPL, version 2. + */ +#include libcflat.h + +#define RTAS_UNKNOWN_SERVICE (-1) + +struct rtas_args { + u32 token; + u32 nargs; + u32 nret; + u32 args[16]; + u32 *rets; +}; + +extern void rtas_init(void); +extern void enter_rtas(unsigned long args); +extern int rtas_token(const char *service); +extern int rtas_call(int token, int nargs, int nret, int *outputs, ...); + +extern void rtas_power_off(void); + +#endif /* _ASMPOWERPC_RTAS_H_ */ diff --git a/lib/powerpc/io.c b/lib/powerpc/io.c index a7eaafeca9205..25cdd0ad961ee 100644 --- a/lib/powerpc/io.c +++ b/lib/powerpc/io.c @@ -7,6 +7,7 @@ */ #include libcflat.h #include asm/spinlock.h +#include asm/rtas.h extern void halt(int code); extern void putchar(int c); @@ -15,6 +16,7 @@ static struct spinlock uart_lock; void io_init(void) { + rtas_init(); } void puts(const char *s) @@ -27,5 +29,6 @@ void puts(const char *s) void exit(int code) { + rtas_power_off(); halt(code); } diff --git a/lib/powerpc/rtas.c b/lib/powerpc/rtas.c new file mode 100644 index 0..39f7d4428ee77 --- /dev/null +++ b/lib/powerpc/rtas.c @@ -0,0 +1,84 @@ +/* + * powerpc RTAS + * + * Copyright (C) 2015, Red Hat Inc, Andrew Jones drjo...@redhat.com + * + * This work is licensed under the terms of the GNU LGPL, version 2. + */ +#include libcflat.h +#include libfdt/libfdt.h +#include devicetree.h +#include asm/spinlock.h +#include asm/io.h +#include asm/rtas.h + +static struct spinlock lock; +struct rtas_args rtas_args; +static int rtas_node; + +void rtas_init(void) +{ + if (!dt_available()) { + printf(%s: No device tree!\n, __func__); + abort(); + } + + rtas_node = fdt_path_offset(dt_fdt(), /rtas); + if (rtas_node 0) { + printf(%s: /rtas node: %s\n, __func__, + fdt_strerror(rtas_node)); + abort(); + } +} + +int rtas_token(const char *service) +{ + const struct fdt_property *prop; + u32 *token; + + prop = fdt_get_property(dt_fdt(), rtas_node, service, NULL); Oh.. one other thing here. This is only safe if you never alter the device tree blob you're given (or at least, not after rtas_init()). That may well be the case, but I don't know your code well enough to be sure. Otherwise, the rtas node could get moved by other dt changes, meaning the offset stored in rtas_node is no longer valid. I hadn't planned on modifying the DT, but if you can conceive of a test they may want to, then we should do this right. That's probably ok then, as long as it's made clear that you can't mess with the DT. Or if tests might want to modify it, you could give them their own copy of it. I guess doing it right just means to hunt down rtas_node each time, that's easy. That's right. Not having persistent handles is a bit of a pain, but it's the necessary tradeoff to work with the tree in flattened form without having some complicated pile of context state to go with it. (Attempting to remind people that these aren't persistent handles is, incidentally, why so many of the libfdt functions explicitly mention offset in their names). -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpfF0BoqH0NR.pgp Description: PGP signature
Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off
On Tue, Aug 04, 2015 at 09:47:59AM +0200, Andrew Jones wrote: On Tue, Aug 04, 2015 at 02:09:52PM +1000, David Gibson wrote: On Mon, Aug 03, 2015 at 07:08:17PM +0200, Paolo Bonzini wrote: On 03/08/2015 16:41, Andrew Jones wrote: Add enough RTAS support to support power-off, and apply it to exit(). Signed-off-by: Andrew Jones drjo...@redhat.com Why not use virtio-mmio + testdev on ppc as well? Similar to how we're not using PSCI on ARM or ACPI on x86. Strange as it seems, MMIO is actually a PITA for a simple pseries guest like this. Basically, you have to enable the MMU - which requires a whole bunch of setup - in order to perform cache-inhibited loads and stores, which is what you need for IO. There are hypercalls to sidestep this (H_LOGICAL_CI_LOAD and H_LOGICAL_CI_STORE), but having a hypercall and KVM exit for every IO access may be hideously slow. In early development we did have a hypercall mediated virtio model, but it was abandoned once we got PCI working. So I think by yours and Alex's responses, if we want testdev support then we should target using pci to expose it. Um.. maybe. I'm not really familiar with these testdevs, so I can't answer directly. I'm ok with that, but prefer not to be distracted with it while getting ppc kickstarted. So, question for Paolo, are you OK with the exitcode snooper cheat? If you wanted to add a special hypercall channel for use by the tests I'd be ok with that too. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpvWDZbaP8aT.pgp Description: PGP signature
Re: [kvm-unit-tests PATCH 10/14] powerpc/ppc64: relocate linker VMAs
: .llong stackptr +p_toc: .llong tocptr +p_dyn: .llong dynamic_start + .text .align 3 diff --git a/powerpc/flat.lds b/powerpc/flat.lds index bd075efb2c51b..8a573d27346de 100644 --- a/powerpc/flat.lds +++ b/powerpc/flat.lds @@ -6,11 +6,22 @@ SECTIONS etext = .; .opd : { *(.opd) } . = ALIGN(16); +.dynamic : { +dynamic_start = .; +*(.dynamic) +} +.dynsym : { +dynsym_start = .; +*(.dynsym) +} +.rela.dyn : { *(.rela*) } +. = ALIGN(16); .data : { *(.data) +*(.data.rel*) } . = ALIGN(16); -.rodata : { *(.rodata) } +.rodata : { *(.rodata) *(.rodata.*) } . = ALIGN(16); .bss : { *(.bss) } . = ALIGN(16); diff --git a/powerpc/reloc64.c b/powerpc/reloc64.c new file mode 100644 index 0..2804823bdfee3 --- /dev/null +++ b/powerpc/reloc64.c @@ -0,0 +1,55 @@ +/* + * relocate R_PPC_RELATIVE RELA entries. Normally this is done in + * assembly code to avoid the risk of using absolute addresses before + * they're relocated. We use C, but cautiously (no global references). + * + * Copyright (C) 2015, Red Hat Inc, Andrew Jones drjo...@redhat.com + * + * This work is licensed under the terms of the GNU LGPL, version 2. + */ +#define DT_NULL 0 +#define DT_RELA 7 +#define DT_RELACOUNT 0x6ff9 +#define R_PPC_RELATIVE 22 + +struct elf64_dyn { + signed long long tag; + unsigned long long val; +}; + +#define RELA_GET_TYPE(rela_ptr) ((rela_ptr)-info 0x) +struct elf64_rela { + unsigned long long offset; + unsigned long long info; + signed long long addend; +}; + +void relocate(unsigned long load_addr, struct elf64_dyn *dyn_table) +{ + unsigned long long rela_addr = 0, rela_count = 0, *addr; + struct elf64_dyn *d = dyn_table; + struct elf64_rela *r; + + while (d d-tag != DT_NULL) { + if (d-tag == DT_RELA) + rela_addr = d-val; + else if (d-tag == DT_RELACOUNT) + rela_count = d-val; + if (rela_addr rela_count) + break; + ++d; + } + + if (!rela_addr || !rela_count) + return; + + r = (void *)(rela_addr + load_addr); + + while (rela_count--) { + if (RELA_GET_TYPE(r) == R_PPC_RELATIVE) { + addr = (void *)(r-offset + load_addr); + *addr = r-addend + load_addr; + } + ++r; + } +} -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpCVvclN8krC.pgp Description: PGP signature
Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off
..f4eff7f8eccb4 100644 --- a/powerpc/Makefile.common +++ b/powerpc/Makefile.common @@ -39,6 +39,7 @@ cflatobjs += lib/powerpc/io.o cflatobjs += lib/powerpc/setup.o cflatobjs += lib/powerpc/mmu.o cflatobjs += lib/powerpc/smp.o +cflatobjs += lib/powerpc/rtas.o libgcc := $(shell $(CC) $(machine) --print-libgcc-file-name) start_addr := $(shell printf %x\n $$(( $(phys_base) + $(kernel_offset) ))) diff --git a/powerpc/cstart64.S b/powerpc/cstart64.S index 8edaaa6e251fc..1c48083efa33c 100644 --- a/powerpc/cstart64.S +++ b/powerpc/cstart64.S @@ -8,6 +8,7 @@ #define HVSC .long 0x4422 #define H_PUT_TERM_CHAR 0x58 +#define KVMPPC_H_RTAS0xf000 #define LOAD_REG_IMMEDIATE(reg,expr) \ lis reg,(expr)@highest; \ @@ -85,3 +86,11 @@ putchar: li r5, 1 /* sending just 1 byte */ HVSC blr + +.globl enter_rtas +enter_rtas: + mr r4, r3 + lis r3, KVMPPC_H_RTAS@h + ori r3, r3, KVMPPC_H_RTAS@l + HVSC + blr -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpe6ywsI0WDn.pgp Description: PGP signature
Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off
:= $(shell printf %x\n $$(( $(phys_base) + $(kernel_offset) ))) diff --git a/powerpc/cstart64.S b/powerpc/cstart64.S index 8edaaa6e251fc..1c48083efa33c 100644 --- a/powerpc/cstart64.S +++ b/powerpc/cstart64.S @@ -8,6 +8,7 @@ #define HVSC .long 0x4422 #define H_PUT_TERM_CHAR 0x58 +#define KVMPPC_H_RTAS0xf000 #define LOAD_REG_IMMEDIATE(reg,expr) \ lis reg,(expr)@highest; \ @@ -85,3 +86,11 @@ putchar: li r5, 1 /* sending just 1 byte */ HVSC blr + +.globl enter_rtas +enter_rtas: + mr r4, r3 + lis r3, KVMPPC_H_RTAS@h + ori r3, r3, KVMPPC_H_RTAS@l + HVSC + blr So, strictly speaking this isn't how a client program is supposed to access RTAS. Instead, you're supposed to call into SLOF to instantiate RTAS, then jump into the blob it gives you. Or, if you're replacing rather than using SLOF, then you're supposed to get the existing RTAS blob address from the device tree and jump into that (or copy it somewhere else yourself). In practice that RTAS blob will have the same 5 instructions as your enter_rtas, so it's probably ok to do this. What concerns me slightly, though, is that if we ever messed up the RTAS blob handling in qemu or SLOF, then bypassing it in the kvm tests like this means they wouldn't catch that breakage. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpZju4URL0W8.pgp Description: PGP signature
Re: [kvm-unit-tests PATCH 08/14] powerpc/ppc64: add HV putchar
On Mon, Aug 03, 2015 at 04:41:25PM +0200, Andrew Jones wrote: Add the hvcall for putchar and use it in puts. That, along with a couple more lines in start to prepare for C code, and a branch to main(), gets us hello world. Run with qemu-system-ppc64 -M pseries\ -bios powerpc/boot_rom.bin \ -display none -serial stdio \ -kernel powerpc/selftest.elf (We're still not relocating yet, that comes in a later patch. Thus, testing hello-world at this point requires a hacked QEMU and linking the unit test at QEMU's kernel load address.) Signed-off-by: Andrew Jones drjo...@redhat.com --- lib/powerpc/io.c| 15 +-- powerpc/Makefile.common | 1 + powerpc/cstart64.S | 27 +++ 3 files changed, 41 insertions(+), 2 deletions(-) diff --git a/lib/powerpc/io.c b/lib/powerpc/io.c index cf3b6347e1e46..a7eaafeca9205 100644 --- a/lib/powerpc/io.c +++ b/lib/powerpc/io.c @@ -6,15 +6,26 @@ * This work is licensed under the terms of the GNU LGPL, version 2. */ #include libcflat.h +#include asm/spinlock.h + +extern void halt(int code); +extern void putchar(int c); + +static struct spinlock uart_lock; void io_init(void) { } -void puts(const char *s __unused) +void puts(const char *s) { + spin_lock(uart_lock); + while (*s) + putchar(*s++); + spin_unlock(uart_lock); } -void exit(int code __unused) +void exit(int code) { + halt(code); } diff --git a/powerpc/Makefile.common b/powerpc/Makefile.common index 5cd3ea8085038..d6356540918a5 100644 --- a/powerpc/Makefile.common +++ b/powerpc/Makefile.common @@ -26,6 +26,7 @@ CFLAGS += -ffreestanding CFLAGS += -Wextra CFLAGS += -O2 CFLAGS += -I lib -I lib/libfdt +CFLAGS += -Wa,-mregnames asm-offsets = lib/$(ARCH)/asm-offsets.h include scripts/asm-offsets.mak diff --git a/powerpc/cstart64.S b/powerpc/cstart64.S index d9b77f44f9e0e..d7c51cd352ee4 100644 --- a/powerpc/cstart64.S +++ b/powerpc/cstart64.S @@ -6,14 +6,41 @@ * This work is licensed under the terms of the GNU LGPL, version 2. */ +#define HVSC .long 0x4422 Unless your assembler is really old, you could probably use sc 1 inline, instead of this macro. Not that it really matters. +#define H_PUT_TERM_CHAR 0x58 + +#define LOAD_REG_IMMEDIATE(reg,expr) \ + lis reg,(expr)@highest; \ + ori reg,reg,(expr)@higher; \ + rldicr reg,reg,32,31; \ + orisreg,reg,(expr)@h; \ + ori reg,reg,(expr)@l; + +#define LOAD_REG_ADDR(reg,name) \ + ld reg,name@got(r2) + .section .init .globl start start: + LOAD_REG_IMMEDIATE(r1, stackptr) + LOAD_REG_IMMEDIATE(r2, tocptr) + bl .main + bl .exit Is this built for ppc64 or ppc64le? IIUC under the new ABI version usually used on ppc64le, function descriptors are no longer used. And even under the old ABI, I think the assembler now has the smarts to avoid explicitly referencing the .-symbols. b halt .text +.align 3 .globl halt halt: 1: b 1b + +.globl putchar +putchar: + sldir6, r3, 56 + li r3, H_PUT_TERM_CHAR + li r4, 0 /* vty-reg 0 means to use the default vty */ + li r5, 1 /* sending just 1 byte */ + HVSC + blr -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgp8F9ROuEkrR.pgp Description: PGP signature
Re: [kvm-unit-tests PATCH 11/14] powerpc/ppc64: add rtas_power_off
On Mon, Aug 03, 2015 at 07:08:17PM +0200, Paolo Bonzini wrote: On 03/08/2015 16:41, Andrew Jones wrote: Add enough RTAS support to support power-off, and apply it to exit(). Signed-off-by: Andrew Jones drjo...@redhat.com Why not use virtio-mmio + testdev on ppc as well? Similar to how we're not using PSCI on ARM or ACPI on x86. Strange as it seems, MMIO is actually a PITA for a simple pseries guest like this. Basically, you have to enable the MMU - which requires a whole bunch of setup - in order to perform cache-inhibited loads and stores, which is what you need for IO. There are hypercalls to sidestep this (H_LOGICAL_CI_LOAD and H_LOGICAL_CI_STORE), but having a hypercall and KVM exit for every IO access may be hideously slow. In early development we did have a hypercall mediated virtio model, but it was abandoned once we got PCI working. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpc3MvTERmfq.pgp Description: PGP signature
Re: [PATCH 0/2] Two fixes for dynamic micro-threading
On Thu, Jul 16, 2015 at 05:11:12PM +1000, Paul Mackerras wrote: This series contains two fixes for the new dynamic micro-threading code that was added recently for HV-mode KVM on Power servers. The patches are against Alex Graf's kvm-ppc-queue branch. Please apply. agraf, Any word on these? These appear to fix a really nasty host crash in current upstream. I'd really like to see them merged ASAP. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgp5vVs0iK10V.pgp Description: PGP signature
Re: [PATCH 2/2] KVM: PPC: Book3S HV: Implement dynamic micro-threading on POWER8
On Thu, May 28, 2015 at 03:17:20PM +1000, Paul Mackerras wrote: This builds on the ability to run more than one vcore on a physical core by using the micro-threading (split-core) modes of the POWER8 chip. Previously, only vcores from the same VM could be run together, and (on POWER8) only if they had just one thread per core. With the ability to split the core on guest entry and unsplit it on guest exit, we can run up to 8 vcpu threads from up to 4 different VMs, and we can run multiple vcores with 2 or 4 vcpus per vcore. Dynamic micro-threading is only available if the static configuration of the cores is whole-core mode (unsplit), and only on POWER8. To manage this, we introduce a new kvm_split_mode struct which is shared across all of the subcores in the core, with a pointer in the paca on each thread. In addition we extend the core_info struct to have information on each subcore. When deciding whether to add a vcore to the set already on the core, we now have two possibilities: (a) piggyback the vcore onto an existing subcore, or (b) start a new subcore. Currently, when any vcpu needs to exit the guest and switch to host virtual mode, we interrupt all the threads in all subcores and switch the core back to whole-core mode. It may be possible in future to allow some of the subcores to keep executing in the guest while subcore 0 switches to the host, but that is not implemented in this patch. This adds a module parameter called dynamic_mt_modes which controls which micro-threading (split-core) modes the code will consider, as a bitmap. In other words, if it is 0, no micro-threading mode is considered; if it is 2, only 2-way micro-threading is considered; if it is 4, only 4-way, and if it is 6, both 2-way and 4-way micro-threading mode will be considered. The default is 6. With this, we now have secondary threads which are the primary thread for their subcore and therefore need to do the MMU switch. These threads will need to be started even if they have no vcpu to run, so we use the vcore pointer in the PACA rather than the vcpu pointer to trigger them. Signed-off-by: Paul Mackerras pau...@samba.org Reviewed-by: David Gibson da...@gibson.dropbear.id.au -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpwPMMh5BtzP.pgp Description: PGP signature
Re: [PATCH 1/2] KVM: PPC: Book3S HV: Make use of unused threads when running guests
On Thu, May 28, 2015 at 03:17:19PM +1000, Paul Mackerras wrote: When running a virtual core of a guest that is configured with fewer threads per core than the physical cores have, the extra physical threads are currently unused. This makes it possible to use them to run one or more other virtual cores from the same guest when certain conditions are met. This applies on POWER7, and on POWER8 to guests with one thread per virtual core. (It doesn't apply to POWER8 guests with multiple threads per vcore because they require a 1-1 virtual to physical thread mapping in order to be able to use msgsndp and the TIR.) The idea is that we maintain a list of preempted vcores for each physical cpu (i.e. each core, since the host runs single-threaded). Then, when a vcore is about to run, it checks to see if there are any vcores on the list for its physical cpu that could be piggybacked onto this vcore's execution. If so, those additional vcores are put into state VCORE_PIGGYBACK and their runnable VCPU threads are started as well as the original vcore, which is called the master vcore. After the vcores have exited the guest, the extra ones are put back onto the preempted list if any of their VCPUs are still runnable and not idle. This means that vcpu-arch.ptid is no longer necessarily the same as the physical thread that the vcpu runs on. In order to make it easier for code that wants to send an IPI to know which CPU to target, we now store that in a new field in struct vcpu_arch, called thread_cpu. Signed-off-by: Paul Mackerras pau...@samba.org Reviewed-by: David Gibson da...@gibson.dropbear.id.au -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpGb5ERdypMG.pgp Description: PGP signature
Re: [PATCHv4] kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM
On Tue, Apr 21, 2015 at 08:37:02AM +0200, Thomas Huth wrote: Am Tue, 21 Apr 2015 10:41:51 +1000 schrieb David Gibson da...@gibson.dropbear.id.au: On POWER, storage caching is usually configured via the MMU - attributes such as cache-inhibited are stored in the TLB and the hashed page table. This makes correctly performing cache inhibited IO accesses awkward when the MMU is turned off (real mode). Some CPU models provide special registers to control the cache attributes of real mode load and stores but this is not at all consistent. This is a problem in particular for SLOF, the firmware used on KVM guests, which runs entirely in real mode, but which needs to do IO to load the kernel. To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to a logical address (aka guest physical address). SLOF uses these for IO. However, because these are implemented within qemu, not the host kernel, these bypass any IO devices emulated within KVM itself. The simplest way to see this problem is to attempt to boot a KVM guest from a virtio-blk device with iothread / dataplane enabled. The iothread code relies on an in kernel implementation of the virtio queue notification, which is not triggered by the IO hcalls, and so the guest will stall in SLOF unable to load the guest OS. This patch addresses this by providing in-kernel implementations of the 2 hypercalls, which correctly scan the KVM IO bus. Any access to an address not handled by the KVM IO bus will cause a VM exit, hitting the qemu implementation as before. Note that a userspace change is also required, in order to enable these new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL. Signed-off-by: David Gibson da...@gibson.dropbear.id.au --- arch/powerpc/include/asm/kvm_book3s.h | 3 ++ arch/powerpc/kvm/book3s.c | 76 +++ arch/powerpc/kvm/book3s_hv.c | 12 ++ arch/powerpc/kvm/book3s_pr_papr.c | 28 + 4 files changed, 119 insertions(+) ... diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index cfbcdc6..453a8a4 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -821,6 +821,82 @@ void kvmppc_core_destroy_vm(struct kvm *kvm) #endif } +int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu) +{ + unsigned long size = kvmppc_get_gpr(vcpu, 4); + unsigned long addr = kvmppc_get_gpr(vcpu, 5); + u64 buf; + int ret; + + if (!is_power_of_2(size) || (size sizeof(buf))) + return H_TOO_HARD; + + ret = kvm_io_bus_read(vcpu, KVM_MMIO_BUS, addr, size, buf); + if (ret != 0) + return H_TOO_HARD; + + switch (size) { + case 1: + kvmppc_set_gpr(vcpu, 4, *(u8 *)buf); + break; + Most of the code in book3s.c seems not to use a empty line after a break;, so may I suggest to remove these empty lines here, too, to keep the coding style a little bit more consistent? I don't think it's worth respinning just for that. + case 2: + kvmppc_set_gpr(vcpu, 4, be16_to_cpu(*(__be16 *)buf)); + break; + + case 4: + kvmppc_set_gpr(vcpu, 4, be32_to_cpu(*(__be32 *)buf)); + break; + + case 8: + kvmppc_set_gpr(vcpu, 4, be64_to_cpu(*(__be64 *)buf)); + break; + + default: + BUG(); If I got the code right, a malicious guest could easily trigger this BUG() statement, couldn't it? ... so a BUG() is maybe not the right thing to do here. Would it be appropriate to return an error value to the guest instead? Actually no - the test at the top of the function for is_power_of_2(size) etc. catches this safely before we get here. The BUG() is just paranoia. + } + + return H_SUCCESS; +} +EXPORT_SYMBOL_GPL(kvmppc_h_logical_ci_load); Thomas -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgp2U1eK1O21J.pgp Description: PGP signature
[PATCHv4] kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM
On POWER, storage caching is usually configured via the MMU - attributes such as cache-inhibited are stored in the TLB and the hashed page table. This makes correctly performing cache inhibited IO accesses awkward when the MMU is turned off (real mode). Some CPU models provide special registers to control the cache attributes of real mode load and stores but this is not at all consistent. This is a problem in particular for SLOF, the firmware used on KVM guests, which runs entirely in real mode, but which needs to do IO to load the kernel. To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to a logical address (aka guest physical address). SLOF uses these for IO. However, because these are implemented within qemu, not the host kernel, these bypass any IO devices emulated within KVM itself. The simplest way to see this problem is to attempt to boot a KVM guest from a virtio-blk device with iothread / dataplane enabled. The iothread code relies on an in kernel implementation of the virtio queue notification, which is not triggered by the IO hcalls, and so the guest will stall in SLOF unable to load the guest OS. This patch addresses this by providing in-kernel implementations of the 2 hypercalls, which correctly scan the KVM IO bus. Any access to an address not handled by the KVM IO bus will cause a VM exit, hitting the qemu implementation as before. Note that a userspace change is also required, in order to enable these new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL. Signed-off-by: David Gibson da...@gibson.dropbear.id.au --- arch/powerpc/include/asm/kvm_book3s.h | 3 ++ arch/powerpc/kvm/book3s.c | 76 +++ arch/powerpc/kvm/book3s_hv.c | 12 ++ arch/powerpc/kvm/book3s_pr_papr.c | 28 + 4 files changed, 119 insertions(+) Changes in v4: * Rebase onto 4.0+, correct for changed signature of kvm_io_bus_{read,write} Alex, I saw from some build system notifications that you seemed to hit some troubles compiling the last version of this patch. This should fix it - hope it's not too late to get into 4.1. diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index 9930904..b91e74a 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -288,6 +288,9 @@ static inline bool kvmppc_supports_magic_page(struct kvm_vcpu *vcpu) return !is_kvmppc_hv_enabled(vcpu-kvm); } +extern int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu); +extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu); + /* Magic register values loaded into r3 and r4 before the 'sc' assembly * instruction for the OSI hypercalls */ #define OSI_SC_MAGIC_R30x113724FA diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index cfbcdc6..453a8a4 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -821,6 +821,82 @@ void kvmppc_core_destroy_vm(struct kvm *kvm) #endif } +int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu) +{ + unsigned long size = kvmppc_get_gpr(vcpu, 4); + unsigned long addr = kvmppc_get_gpr(vcpu, 5); + u64 buf; + int ret; + + if (!is_power_of_2(size) || (size sizeof(buf))) + return H_TOO_HARD; + + ret = kvm_io_bus_read(vcpu, KVM_MMIO_BUS, addr, size, buf); + if (ret != 0) + return H_TOO_HARD; + + switch (size) { + case 1: + kvmppc_set_gpr(vcpu, 4, *(u8 *)buf); + break; + + case 2: + kvmppc_set_gpr(vcpu, 4, be16_to_cpu(*(__be16 *)buf)); + break; + + case 4: + kvmppc_set_gpr(vcpu, 4, be32_to_cpu(*(__be32 *)buf)); + break; + + case 8: + kvmppc_set_gpr(vcpu, 4, be64_to_cpu(*(__be64 *)buf)); + break; + + default: + BUG(); + } + + return H_SUCCESS; +} +EXPORT_SYMBOL_GPL(kvmppc_h_logical_ci_load); + +int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu) +{ + unsigned long size = kvmppc_get_gpr(vcpu, 4); + unsigned long addr = kvmppc_get_gpr(vcpu, 5); + unsigned long val = kvmppc_get_gpr(vcpu, 6); + u64 buf; + int ret; + + switch (size) { + case 1: + *(u8 *)buf = val; + break; + + case 2: + *(__be16 *)buf = cpu_to_be16(val); + break; + + case 4: + *(__be32 *)buf = cpu_to_be32(val); + break; + + case 8: + *(__be64 *)buf = cpu_to_be64(val); + break; + + default: + return H_TOO_HARD; + } + + ret = kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, size, buf); + if (ret != 0) + return H_TOO_HARD; + + return H_SUCCESS; +} +EXPORT_SYMBOL_GPL
Re: [PATCHv3] kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM
On Thu, Feb 05, 2015 at 01:57:11AM +0100, Alexander Graf wrote: On 05.02.15 01:53, David Gibson wrote: On POWER, storage caching is usually configured via the MMU - attributes such as cache-inhibited are stored in the TLB and the hashed page table. This makes correctly performing cache inhibited IO accesses awkward when the MMU is turned off (real mode). Some CPU models provide special registers to control the cache attributes of real mode load and stores but this is not at all consistent. This is a problem in particular for SLOF, the firmware used on KVM guests, which runs entirely in real mode, but which needs to do IO to load the kernel. To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to a logical address (aka guest physical address). SLOF uses these for IO. However, because these are implemented within qemu, not the host kernel, these bypass any IO devices emulated within KVM itself. The simplest way to see this problem is to attempt to boot a KVM guest from a virtio-blk device with iothread / dataplane enabled. The iothread code relies on an in kernel implementation of the virtio queue notification, which is not triggered by the IO hcalls, and so the guest will stall in SLOF unable to load the guest OS. This patch addresses this by providing in-kernel implementations of the 2 hypercalls, which correctly scan the KVM IO bus. Any access to an address not handled by the KVM IO bus will cause a VM exit, hitting the qemu implementation as before. Note that a userspace change is also required, in order to enable these new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL. Signed-off-by: David Gibson da...@gibson.dropbear.id.au Thanks, applied to kvm-ppc-queue. Any news on when this might go up to mainline? -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpskWWLllYjF.pgp Description: PGP signature
[PATCHv3] kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM
On POWER, storage caching is usually configured via the MMU - attributes such as cache-inhibited are stored in the TLB and the hashed page table. This makes correctly performing cache inhibited IO accesses awkward when the MMU is turned off (real mode). Some CPU models provide special registers to control the cache attributes of real mode load and stores but this is not at all consistent. This is a problem in particular for SLOF, the firmware used on KVM guests, which runs entirely in real mode, but which needs to do IO to load the kernel. To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to a logical address (aka guest physical address). SLOF uses these for IO. However, because these are implemented within qemu, not the host kernel, these bypass any IO devices emulated within KVM itself. The simplest way to see this problem is to attempt to boot a KVM guest from a virtio-blk device with iothread / dataplane enabled. The iothread code relies on an in kernel implementation of the virtio queue notification, which is not triggered by the IO hcalls, and so the guest will stall in SLOF unable to load the guest OS. This patch addresses this by providing in-kernel implementations of the 2 hypercalls, which correctly scan the KVM IO bus. Any access to an address not handled by the KVM IO bus will cause a VM exit, hitting the qemu implementation as before. Note that a userspace change is also required, in order to enable these new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL. Signed-off-by: David Gibson da...@gibson.dropbear.id.au --- arch/powerpc/include/asm/kvm_book3s.h | 3 ++ arch/powerpc/kvm/book3s.c | 76 +++ arch/powerpc/kvm/book3s_hv.c | 12 ++ arch/powerpc/kvm/book3s_pr_papr.c | 28 + 4 files changed, 119 insertions(+) v3: - Use __beXX types instead of plain uXX where appropriate - Remove some unhelpful comments v2: - Removed some debugging printk()s that were accidentally left in - Fix endianness; like all PAPR hypercalls, these should always act big-endian, even if the guest is little-endian (in practice this makes no difference, since the only user is SLOF, which is always big-endian) diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index 942c7b1..578e550 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -292,6 +292,9 @@ static inline bool kvmppc_supports_magic_page(struct kvm_vcpu *vcpu) return !is_kvmppc_hv_enabled(vcpu-kvm); } +extern int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu); +extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu); + /* Magic register values loaded into r3 and r4 before the 'sc' assembly * instruction for the OSI hypercalls */ #define OSI_SC_MAGIC_R30x113724FA diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 888bf46..397f433 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -820,6 +820,82 @@ void kvmppc_core_destroy_vm(struct kvm *kvm) #endif } +int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu) +{ + unsigned long size = kvmppc_get_gpr(vcpu, 4); + unsigned long addr = kvmppc_get_gpr(vcpu, 5); + u64 buf; + int ret; + + if (!is_power_of_2(size) || (size sizeof(buf))) + return H_TOO_HARD; + + ret = kvm_io_bus_read(vcpu-kvm, KVM_MMIO_BUS, addr, size, buf); + if (ret != 0) + return H_TOO_HARD; + + switch (size) { + case 1: + kvmppc_set_gpr(vcpu, 4, *(u8 *)buf); + break; + + case 2: + kvmppc_set_gpr(vcpu, 4, be16_to_cpu(*(__be16 *)buf)); + break; + + case 4: + kvmppc_set_gpr(vcpu, 4, be32_to_cpu(*(__be32 *)buf)); + break; + + case 8: + kvmppc_set_gpr(vcpu, 4, be64_to_cpu(*(__be64 *)buf)); + break; + + default: + BUG(); + } + + return H_SUCCESS; +} +EXPORT_SYMBOL_GPL(kvmppc_h_logical_ci_load); + +int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu) +{ + unsigned long size = kvmppc_get_gpr(vcpu, 4); + unsigned long addr = kvmppc_get_gpr(vcpu, 5); + unsigned long val = kvmppc_get_gpr(vcpu, 6); + u64 buf; + int ret; + + switch (size) { + case 1: + *(u8 *)buf = val; + break; + + case 2: + *(__be16 *)buf = cpu_to_be16(val); + break; + + case 4: + *(__be32 *)buf = cpu_to_be32(val); + break; + + case 8: + *(__be64 *)buf = cpu_to_be64(val); + break; + + default: + return H_TOO_HARD; + } + + ret = kvm_io_bus_write(vcpu-kvm
Re: [PATCHv2] kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM
On Wed, Feb 04, 2015 at 04:24:46PM +0100, Alexander Graf wrote: On 03.02.15 06:44, David Gibson wrote: On POWER, storage caching is usually configured via the MMU - attributes such as cache-inhibited are stored in the TLB and the hashed page table. This makes correctly performing cache inhibited IO accesses awkward when the MMU is turned off (real mode). Some CPU models provide special registers to control the cache attributes of real mode load and stores but this is not at all consistent. This is a problem in particular for SLOF, the firmware used on KVM guests, which runs entirely in real mode, but which needs to do IO to load the kernel. To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to a logical address (aka guest physical address). SLOF uses these for IO. However, because these are implemented within qemu, not the host kernel, these bypass any IO devices emulated within KVM itself. The simplest way to see this problem is to attempt to boot a KVM guest from a virtio-blk device with iothread / dataplane enabled. The iothread code relies on an in kernel implementation of the virtio queue notification, which is not triggered by the IO hcalls, and so the guest will stall in SLOF unable to load the guest OS. This patch addresses this by providing in-kernel implementations of the 2 hypercalls, which correctly scan the KVM IO bus. Any access to an address not handled by the KVM IO bus will cause a VM exit, hitting the qemu implementation as before. Note that a userspace change is also required, in order to enable these new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL. Signed-off-by: David Gibson da...@gibson.dropbear.id.au --- arch/powerpc/include/asm/kvm_book3s.h | 3 ++ arch/powerpc/kvm/book3s.c | 76 +++ arch/powerpc/kvm/book3s_hv.c | 12 ++ arch/powerpc/kvm/book3s_pr_papr.c | 28 + 4 files changed, 119 insertions(+) v2: - Removed some debugging printk()s that were accidentally left in - Fix endianness; like all PAPR hypercalls, these should always act big-endian, even if the guest is little-endian (in practice this makes no difference, since the only user is SLOF, which is always big-endian) diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index 942c7b1..578e550 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -292,6 +292,9 @@ static inline bool kvmppc_supports_magic_page(struct kvm_vcpu *vcpu) return !is_kvmppc_hv_enabled(vcpu-kvm); } +extern int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu); +extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu); + /* Magic register values loaded into r3 and r4 before the 'sc' assembly * instruction for the OSI hypercalls */ #define OSI_SC_MAGIC_R30x113724FA diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 888bf46..7b51492 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -820,6 +820,82 @@ void kvmppc_core_destroy_vm(struct kvm *kvm) #endif } +int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu) +{ + unsigned long size = kvmppc_get_gpr(vcpu, 4); + unsigned long addr = kvmppc_get_gpr(vcpu, 5); + u64 buf; + int ret; + + if (!is_power_of_2(size) || (size sizeof(buf))) + return H_TOO_HARD; + + ret = kvm_io_bus_read(vcpu-kvm, KVM_MMIO_BUS, addr, size, buf); + if (ret != 0) + return H_TOO_HARD; + + switch (size) { + case 1: + kvmppc_set_gpr(vcpu, 4, *(u8 *)buf); + break; + + case 2: + kvmppc_set_gpr(vcpu, 4, be16_to_cpu(*(u16 *)buf)); + break; + + case 4: + kvmppc_set_gpr(vcpu, 4, be32_to_cpu(*(u32 *)buf)); + break; + + case 8: + kvmppc_set_gpr(vcpu, 4, be64_to_cpu(*(u64 *)buf)); Shouldn't these casts be __be types? Ah, yes they should. + break; + + default: + BUG(); + } + + return H_SUCCESS; +} +EXPORT_SYMBOL_GPL(kvmppc_h_logical_ci_load); /* For use by the kvm-pr module */ No need for the comment. Ok. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgp78StKRbeXq.pgp Description: PGP signature
[PATCHv2] kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM
On POWER, storage caching is usually configured via the MMU - attributes such as cache-inhibited are stored in the TLB and the hashed page table. This makes correctly performing cache inhibited IO accesses awkward when the MMU is turned off (real mode). Some CPU models provide special registers to control the cache attributes of real mode load and stores but this is not at all consistent. This is a problem in particular for SLOF, the firmware used on KVM guests, which runs entirely in real mode, but which needs to do IO to load the kernel. To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to a logical address (aka guest physical address). SLOF uses these for IO. However, because these are implemented within qemu, not the host kernel, these bypass any IO devices emulated within KVM itself. The simplest way to see this problem is to attempt to boot a KVM guest from a virtio-blk device with iothread / dataplane enabled. The iothread code relies on an in kernel implementation of the virtio queue notification, which is not triggered by the IO hcalls, and so the guest will stall in SLOF unable to load the guest OS. This patch addresses this by providing in-kernel implementations of the 2 hypercalls, which correctly scan the KVM IO bus. Any access to an address not handled by the KVM IO bus will cause a VM exit, hitting the qemu implementation as before. Note that a userspace change is also required, in order to enable these new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL. Signed-off-by: David Gibson da...@gibson.dropbear.id.au --- arch/powerpc/include/asm/kvm_book3s.h | 3 ++ arch/powerpc/kvm/book3s.c | 76 +++ arch/powerpc/kvm/book3s_hv.c | 12 ++ arch/powerpc/kvm/book3s_pr_papr.c | 28 + 4 files changed, 119 insertions(+) v2: - Removed some debugging printk()s that were accidentally left in - Fix endianness; like all PAPR hypercalls, these should always act big-endian, even if the guest is little-endian (in practice this makes no difference, since the only user is SLOF, which is always big-endian) diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index 942c7b1..578e550 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -292,6 +292,9 @@ static inline bool kvmppc_supports_magic_page(struct kvm_vcpu *vcpu) return !is_kvmppc_hv_enabled(vcpu-kvm); } +extern int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu); +extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu); + /* Magic register values loaded into r3 and r4 before the 'sc' assembly * instruction for the OSI hypercalls */ #define OSI_SC_MAGIC_R30x113724FA diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 888bf46..7b51492 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -820,6 +820,82 @@ void kvmppc_core_destroy_vm(struct kvm *kvm) #endif } +int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu) +{ + unsigned long size = kvmppc_get_gpr(vcpu, 4); + unsigned long addr = kvmppc_get_gpr(vcpu, 5); + u64 buf; + int ret; + + if (!is_power_of_2(size) || (size sizeof(buf))) + return H_TOO_HARD; + + ret = kvm_io_bus_read(vcpu-kvm, KVM_MMIO_BUS, addr, size, buf); + if (ret != 0) + return H_TOO_HARD; + + switch (size) { + case 1: + kvmppc_set_gpr(vcpu, 4, *(u8 *)buf); + break; + + case 2: + kvmppc_set_gpr(vcpu, 4, be16_to_cpu(*(u16 *)buf)); + break; + + case 4: + kvmppc_set_gpr(vcpu, 4, be32_to_cpu(*(u32 *)buf)); + break; + + case 8: + kvmppc_set_gpr(vcpu, 4, be64_to_cpu(*(u64 *)buf)); + break; + + default: + BUG(); + } + + return H_SUCCESS; +} +EXPORT_SYMBOL_GPL(kvmppc_h_logical_ci_load); /* For use by the kvm-pr module */ + +int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu) +{ + unsigned long size = kvmppc_get_gpr(vcpu, 4); + unsigned long addr = kvmppc_get_gpr(vcpu, 5); + unsigned long val = kvmppc_get_gpr(vcpu, 6); + u64 buf; + int ret; + + switch (size) { + case 1: + *(u8 *)buf = val; + break; + + case 2: + *(u16 *)buf = cpu_to_be16(val); + break; + + case 4: + *(u32 *)buf = cpu_to_be32(val); + break; + + case 8: + *(u64 *)buf = cpu_to_be64(val); + break; + + default: + return H_TOO_HARD; + } + + ret = kvm_io_bus_write(vcpu-kvm, KVM_MMIO_BUS, addr, size, buf); + if (ret != 0) + return H_TOO_HARD; + + return
Re: [PATCH] kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM
On Mon, Feb 02, 2015 at 09:59:16AM +0100, Paolo Bonzini wrote: On 02/02/2015 08:45, David Gibson wrote: + case H_LOGICAL_CI_LOAD: + ret = kvmppc_h_logical_ci_load(vcpu); + if (ret == H_TOO_HARD) { + printk(Punting H_LOGICAL_CI_LOAD\n); + return RESUME_HOST; + } + break; + case H_LOGICAL_CI_STORE: + ret = kvmppc_h_logical_ci_store(vcpu); + if (ret == H_TOO_HARD) { + printk(Punting H_LOGICAL_CI_STORE\n); + return RESUME_HOST; + } + break; case H_SET_MODE: ret = kvmppc_h_set_mode(vcpu, kvmppc_get_gpr(vcpu, 4), kvmppc_get_gpr(vcpu, 5), KERN_DEBUG I guess? Or even no printk at all perhaps. Oh, bugger, removed most of the debug code, but not all of it. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpuG6eVHOL8i.pgp Description: PGP signature
[PATCH] kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM
On POWER, storage caching is usually configured via the MMU - attributes such as cache-inhibited are stored in the TLB and the hashed page table. This makes correctly performing cache inhibited IO accesses awkward when the MMU is turned off (real mode). Some CPU models provide special registers to control the cache attributes of real mode load and stores but this is not at all consistent. This is a problem in particular for SLOF, the firmware used on KVM guests, which runs entirely in real mode, but which needs to do IO to load the kernel. To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to a logical address (aka guest physical address). SLOF uses these for IO. However, because these are implemented within qemu, not the host kernel, these bypass any IO devices emulated within KVM itself. The simplest way to see this problem is to attempt to boot a KVM guest from a virtio-blk device with iothread / dataplane enabled. The iothread code relies on an in kernel implementation of the virtio queue notification, which is not triggered by the IO hcalls, and so the guest will stall in SLOF unable to load the guest OS. This patch addresses this by providing in-kernel implementations of the 2 hypercalls, which correctly scan the KVM IO bus. Any access to an address not handled by the KVM IO bus will cause a VM exit, hitting the qemu implementation as before. Note that a userspace change is also required, in order to enable these new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL. Signed-off-by: David Gibson da...@gibson.dropbear.id.au --- arch/powerpc/include/asm/kvm_book3s.h | 3 ++ arch/powerpc/kvm/book3s.c | 76 +++ arch/powerpc/kvm/book3s_hv.c | 16 arch/powerpc/kvm/book3s_pr_papr.c | 28 + 4 files changed, 123 insertions(+) diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index 942c7b1..578e550 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -292,6 +292,9 @@ static inline bool kvmppc_supports_magic_page(struct kvm_vcpu *vcpu) return !is_kvmppc_hv_enabled(vcpu-kvm); } +extern int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu); +extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu); + /* Magic register values loaded into r3 and r4 before the 'sc' assembly * instruction for the OSI hypercalls */ #define OSI_SC_MAGIC_R30x113724FA diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 888bf46..792c7cf 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -820,6 +820,82 @@ void kvmppc_core_destroy_vm(struct kvm *kvm) #endif } +int kvmppc_h_logical_ci_load(struct kvm_vcpu *vcpu) +{ + unsigned long size = kvmppc_get_gpr(vcpu, 4); + unsigned long addr = kvmppc_get_gpr(vcpu, 5); + u64 buf; + int ret; + + if (!is_power_of_2(size) || (size sizeof(buf))) + return H_TOO_HARD; + + ret = kvm_io_bus_read(vcpu-kvm, KVM_MMIO_BUS, addr, size, buf); + if (ret != 0) + return H_TOO_HARD; + + switch (size) { + case 1: + kvmppc_set_gpr(vcpu, 4, *(u8 *)buf); + break; + + case 2: + kvmppc_set_gpr(vcpu, 4, *(u16 *)buf); + break; + + case 4: + kvmppc_set_gpr(vcpu, 4, *(u32 *)buf); + break; + + case 8: + kvmppc_set_gpr(vcpu, 4, *(u64 *)buf); + break; + + default: + BUG(); + } + + return H_SUCCESS; +} +EXPORT_SYMBOL_GPL(kvmppc_h_logical_ci_load); /* For use by the kvm-pr module */ + +int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu) +{ + unsigned long size = kvmppc_get_gpr(vcpu, 4); + unsigned long addr = kvmppc_get_gpr(vcpu, 5); + unsigned long val = kvmppc_get_gpr(vcpu, 6); + u64 buf; + int ret; + + switch (size) { + case 1: + *(u8 *)buf = val; + break; + + case 2: + *(u16 *)buf = val; + break; + + case 4: + *(u32 *)buf = val; + break; + + case 8: + *(u64 *)buf = val; + break; + + default: + return H_TOO_HARD; + } + + ret = kvm_io_bus_write(vcpu-kvm, KVM_MMIO_BUS, addr, size, buf); + if (ret != 0) + return H_TOO_HARD; + + return H_SUCCESS; +} +EXPORT_SYMBOL_GPL(kvmppc_h_logical_ci_store); /* For use by the kvm-pr module */ + int kvmppc_core_check_processor_compat(void) { /* diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index de4018a..1013019 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -706,6 +706,20 @@ int kvmppc_pseries_do_hcall
Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
On Sun, Jun 23, 2013 at 10:41:24PM -0600, Alex Williamson wrote: On Mon, 2013-06-24 at 13:52 +1000, David Gibson wrote: On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote: On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote: On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: Just out of curiosity - would not get_file() and fput_atomic() on a group's file* do the right job instead of vfio_group_add_external_user() and vfio_group_del_external_user()? I was thinking that too. Grabbing a file reference would certainly be the usual way of handling this sort of thing. But that wouldn't prevent the group ownership to be returned to the kernel or another user would it ? Holding the file pointer does not let the group-container_users counter go to zero How so? Holding the file pointer means the file won't go away, which means the group release function won't be called. That means the group won't go away, but that doesn't mean it's attached to an IOMMU. A user could call UNSET_CONTAINER. Uhh... *thinks*. Ah, I see. I think the interface should not take the group fd, but the container fd. Holding a reference to *that* would keep the necessary things around. But more to the point, it's the right thing semantically: The container is essentially the handle on a host iommu address space, and so that's what should be bound by the KVM call to a particular guest iommu address space. e.g. it would make no sense to bind two different groups to different guest iommu address spaces, if they were in the same container - the guest thinks they are different spaces, but if they're in the same container they must be the same space. While the container is the gateway to the iommu, what empowers the container to maintain an iommu is the group. What happens to a container when all the groups are disconnected or closed? Groups are the unit that indicates hardware access, not containers. Thanks, Uh... huh? I'm really not sure what you're getting at. The operation we're doing for KVM here is binding a guest iommu address space to a particular host iommu address space. Why would we not want to use the obvious handle on the host iommu address space, which is the container fd? AIUI, the request isn't for an interface through which to do iommu mappings. The request is for an interface to show that the user has sufficient privileges to do mappings. Groups are what gives the user that ability. The iommu is also possibly associated with multiple iommu groups and I believe what is being asked for here is a way to hold and lock a single iommu group with iommu protection. From a practical point of view, the iommu interface is de-privileged once the groups are disconnected or closed. Holding a reference count on the iommu fd won't prevent that. That means we'd have to use a notifier to have KVM stop the side-channel iommu access. Meanwhile holding the file descriptor for the group and adding an interface that bumps use counter allows KVM to lock itself in, just as if it had a device opened itself. Thanks, Ah, good point. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpAjDK6J9rb1.pgp Description: PGP signature
Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
On Sun, Jun 23, 2013 at 09:28:13AM +1000, Benjamin Herrenschmidt wrote: On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: I think the interface should not take the group fd, but the container fd. Holding a reference to *that* would keep the necessary things around. But more to the point, it's the right thing semantically: The container is essentially the handle on a host iommu address space, and so that's what should be bound by the KVM call to a particular guest iommu address space. e.g. it would make no sense to bind two different groups to different guest iommu address spaces, if they were in the same container - the guest thinks they are different spaces, but if they're in the same container they must be the same space. Interestingly, how are we going to extend that when/if we implement DDW ? DDW means an API by which the guest can request the creation of additional iommus for a given device (typically, in addition to the default smallish 32-bit one using 4k pages, the guest can request a larger window in 64-bit space using a larger page size). So, would a PAPR gest requesting this expect the new window to have a new liobn, or an existing liobn? -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgp5jkS8m1XZl.pgp Description: PGP signature
Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote: On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote: On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote: On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: Just out of curiosity - would not get_file() and fput_atomic() on a group's file* do the right job instead of vfio_group_add_external_user() and vfio_group_del_external_user()? I was thinking that too. Grabbing a file reference would certainly be the usual way of handling this sort of thing. But that wouldn't prevent the group ownership to be returned to the kernel or another user would it ? Holding the file pointer does not let the group-container_users counter go to zero How so? Holding the file pointer means the file won't go away, which means the group release function won't be called. That means the group won't go away, but that doesn't mean it's attached to an IOMMU. A user could call UNSET_CONTAINER. Uhh... *thinks*. Ah, I see. I think the interface should not take the group fd, but the container fd. Holding a reference to *that* would keep the necessary things around. But more to the point, it's the right thing semantically: The container is essentially the handle on a host iommu address space, and so that's what should be bound by the KVM call to a particular guest iommu address space. e.g. it would make no sense to bind two different groups to different guest iommu address spaces, if they were in the same container - the guest thinks they are different spaces, but if they're in the same container they must be the same space. While the container is the gateway to the iommu, what empowers the container to maintain an iommu is the group. What happens to a container when all the groups are disconnected or closed? Groups are the unit that indicates hardware access, not containers. Thanks, Uh... huh? I'm really not sure what you're getting at. The operation we're doing for KVM here is binding a guest iommu address space to a particular host iommu address space. Why would we not want to use the obvious handle on the host iommu address space, which is the container fd? -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpfZESO3eXiq.pgp Description: PGP signature
Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote: On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: Just out of curiosity - would not get_file() and fput_atomic() on a group's file* do the right job instead of vfio_group_add_external_user() and vfio_group_del_external_user()? I was thinking that too. Grabbing a file reference would certainly be the usual way of handling this sort of thing. But that wouldn't prevent the group ownership to be returned to the kernel or another user would it ? Holding the file pointer does not let the group-container_users counter go to zero How so? Holding the file pointer means the file won't go away, which means the group release function won't be called. That means the group won't go away, but that doesn't mean it's attached to an IOMMU. A user could call UNSET_CONTAINER. Uhh... *thinks*. Ah, I see. I think the interface should not take the group fd, but the container fd. Holding a reference to *that* would keep the necessary things around. But more to the point, it's the right thing semantically: The container is essentially the handle on a host iommu address space, and so that's what should be bound by the KVM call to a particular guest iommu address space. e.g. it would make no sense to bind two different groups to different guest iommu address spaces, if they were in the same container - the guest thinks they are different spaces, but if they're in the same container they must be the same space. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgp9ekBfTPFTm.pgp Description: PGP signature
Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote: On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote: diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 8465c2a..da6bf61 100644 --- a/arch/powerpc/kvm/powerpc.c @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext) +++ b/arch/powerpc/kvm/powerpc.c break; #endif case KVM_CAP_SPAPR_MULTITCE: +case KVM_CAP_SPAPR_TCE_IOMMU: r = 1; break; default: Don't advertise SPAPR capabilities if it's not book3s -- and probably there's some additional limitation that would be appropriate. So, in the case of MULTITCE, that's not quite right. PR KVM can emulate a PAPR system on a BookE machine, and there's no reason not to allow TCE acceleration as well. We can't make it dependent on PAPR mode being selected, because that's enabled per-vcpu, whereas these capabilities are queried on the VM before the vcpus are created. CAP_SPAPR_TCE_IOMMU should be dependent on the presence of suitable host side hardware (i.e. a PAPR style IOMMU), though. @@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_create_spapr_tce(kvm, create_tce); goto out; } +case KVM_CREATE_SPAPR_TCE_IOMMU: { +struct kvm_create_spapr_tce_iommu create_tce_iommu; +struct kvm *kvm = filp-private_data; + +r = -EFAULT; +if (copy_from_user(create_tce_iommu, argp, +sizeof(create_tce_iommu))) +goto out; +r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, create_tce_iommu); +goto out; +} #endif /* CONFIG_PPC_BOOK3S_64 */ #ifdef CONFIG_KVM_BOOK3S_64_HV diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 5a2afda..450c82a 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 #define KVM_CAP_SPAPR_MULTITCE (0x11 + 89) +#define KVM_CAP_SPAPR_TCE_IOMMU (0x11 + 90) Hmm... Ah, yeah, that needs to be fixed. Those were interim numbers so that we didn't have to keep changing our internal trees as new upstream ioctls got added to the list. We need to get a proper number for the merge, though. @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping { #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct kvm_device_attr) #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct kvm_device_attr) +/* ioctl for SPAPR TCE IOMMU */ +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4, struct kvm_create_spapr_tce_iommu) Shouldn't this go under the vm ioctl section? -Scott ___ Linuxppc-dev mailing list linuxppc-...@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: Digital signature
Re: [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls
On Mon, May 06, 2013 at 05:25:53PM +1000, Alexey Kardashevskiy wrote: This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. This adds a guest physical to host real address converter and calls the existing H_PUT_TCE handler. The converting function is going to be fully utilized by upcoming VFIO supporting patches. This also implements the KVM_CAP_PPC_MULTITCE capability, so in order to support the functionality of this patch, QEMU needs to query for this capability and set the hcall-multi-tce hypertas property only if the capability is present, otherwise there will be serious performance degradation. Hrm. Clearly I didn't read this carefully enough before. There are some problems here. [snip] diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index 72ffc89..643ac1e 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -14,6 +14,7 @@ * * Copyright 2010 Paul Mackerras, IBM Corp. pau...@au1.ibm.com * Copyright 2011 David Gibson, IBM Corporation d...@au1.ibm.com + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation a...@au1.ibm.com */ #include linux/types.h @@ -36,9 +37,14 @@ #include asm/ppc-opcode.h #include asm/kvm_host.h #include asm/udbg.h +#include asm/iommu.h #define TCES_PER_PAGE(PAGE_SIZE / sizeof(u64)) +#define ERROR_ADDR (~(unsigned long)0x0) +/* + * TCE tables handlers. + */ static long kvmppc_stt_npages(unsigned long window_size) { return ALIGN((window_size SPAPR_TCE_SHIFT) @@ -148,3 +154,111 @@ fail: } return ret; } + +/* + * Virtual mode handling of IOMMU map/unmap. + */ +/* Converts guest physical address into host virtual */ +static unsigned long get_virt_address(struct kvm_vcpu *vcpu, + unsigned long gpa) This should probably return a void * rather than an unsigned long. Well, actually a void __user *. +{ + unsigned long hva, gfn = gpa PAGE_SHIFT; + struct kvm_memory_slot *memslot; + + memslot = search_memslots(kvm_memslots(vcpu-kvm), gfn); + if (!memslot) + return ERROR_ADDR; + + /* + * Convert gfn to hva preserving flags and an offset + * within a system page + */ + hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa ~PAGE_MASK); + return hva; +} + +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce) +{ + struct kvmppc_spapr_tce_table *tt; + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!tt) + return H_TOO_HARD; + + /* Emulated IO */ + return kvmppc_emulated_h_put_tce(tt, ioba, tce); +} + +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_list, unsigned long npages) +{ + struct kvmppc_spapr_tce_table *tt; + long i; + unsigned long tces; + + /* The whole table addressed by tce_list resides in 4K page */ + if (npages 512) + return H_PARAMETER; So, that doesn't actually verify what the comment says it does - only that the list is 4K in total. You need to check the alignment of tce_list as well. + + tt = kvmppc_find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!tt) + return H_TOO_HARD; + + tces = get_virt_address(vcpu, tce_list); + if (tces == ERROR_ADDR) + return H_TOO_HARD; + + /* Emulated IO */ This comment doesn't seem to have any bearing on the test which follows it. + if ((ioba + (npages IOMMU_PAGE_SHIFT)) tt-window_size) + return H_PARAMETER; + + for (i = 0; i npages; ++i) { + unsigned long tce; + unsigned long ptce = tces + i * sizeof(unsigned long); + + if (get_user(tce, (unsigned long __user *)ptce)) + break; + + if (kvmppc_emulated_h_put_tce(tt, + ioba + (i IOMMU_PAGE_SHIFT), tce)) + break; + } + if (i == npages) + return H_SUCCESS; + + /* Failed, do cleanup */ + do { + --i; + kvmppc_emulated_h_put_tce(tt, ioba + (i IOMMU_PAGE_SHIFT), + 0); + } while (i); Hrm, so, actually PAPR specifies that this hcall is supposed to first copy the given tces to hypervisor memory, then translate (and validate) them all, and only then touch the actual TCE table. Rather more complicated to do, but I guess we should - that would get rid
Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
On Tue, May 07, 2013 at 03:51:31PM +1000, Alexey Kardashevskiy wrote: On 05/07/2013 03:29 PM, David Gibson wrote: On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote: This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests without passing them to QEMU, which should save time on switching to QEMU and back. Both real and virtual modes are supported - whenever the kernel fails to handle TCE request, it passes it to the virtual mode. If it the virtual mode handlers fail, then the request is passed to the user mode, for example, to QEMU. This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables in-kernel handling of IOMMU map/unmap. This adds a special case for huge pages (16MB). The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a list of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is huge and in the list, then no reference counting is done, otherwise an exit to virtual mode happens. The list is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this list is not very expensive. However this can change and we may want to optimize this. This also adds the virt_only parameter to the KVM module for debug and performance check purposes. Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Signed-off-by: Paul Mackerras pau...@samba.org --- Documentation/virtual/kvm/api.txt | 28 arch/powerpc/include/asm/kvm_host.h |2 + arch/powerpc/include/asm/kvm_ppc.h |2 + arch/powerpc/include/uapi/asm/kvm.h |7 + arch/powerpc/kvm/book3s_64_vio.c| 242 ++- arch/powerpc/kvm/book3s_64_vio_hv.c | 192 +++ arch/powerpc/kvm/powerpc.c | 12 ++ include/uapi/linux/kvm.h|2 + 8 files changed, 485 insertions(+), 2 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index f621cd6..2039767 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously valid entries found. +4.79 KVM_CREATE_SPAPR_TCE_IOMMU + +Capability: KVM_CAP_SPAPR_TCE_IOMMU +Architectures: powerpc +Type: vm ioctl +Parameters: struct kvm_create_spapr_tce_iommu (in) +Returns: 0 on success, -1 on error + +This creates a link between IOMMU group and a hardware TCE (translation +control entry) table. This link lets the host kernel know what IOMMU +group (i.e. TCE table) to use for the LIOBN number passed with +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls. + +/* for KVM_CAP_SPAPR_TCE_IOMMU */ +struct kvm_create_spapr_tce_iommu { + __u64 liobn; + __u32 iommu_id; Wouldn't it be more in keeping pardon? Sorry, I was going to suggest a change, but then realised it wasn't actually any better than what you have now. + __u32 flags; +}; + +No flag is supported at the moment. + +When the guest issues TCE call on a liobn for which a TCE table has been +registered, the kernel will handle it in real mode, updating the hardware +TCE table. TCE table calls for other liobns will cause a vm exit and must +be handled by userspace. + + 5. The kvm_run structure diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 36ceb0d..2b70cbc 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table { struct kvm *kvm; u64 liobn; u32 window_size; + bool virtmode_only; I see this is now initialized from the global parameter, but I think it would be better to just check the global (debug) parameter directly, rather than duplicating it here. The global parameter is in kvm.ko and the struct above is in the real mode part which cannot go to the module. Ah, ok. I'm half inclined to just drop the virtmode_only thing entirely. + struct iommu_group *grp;/* used for IOMMU groups */ struct page *pages[0]; }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index d501246..bdfa140 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm); extern long
Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
On Tue, May 07, 2013 at 04:27:49PM +1000, Alexey Kardashevskiy wrote: On 05/07/2013 04:02 PM, David Gibson wrote: On Tue, May 07, 2013 at 03:51:31PM +1000, Alexey Kardashevskiy wrote: On 05/07/2013 03:29 PM, David Gibson wrote: On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote: [snip] +#define DRIVER_VERSION 0.1 +#define DRIVER_AUTHOR Paul Mackerras, IBM Corp. pau...@au1.ibm.com +#define DRIVER_DESC POWERPC KVM driver Really? What is wrong here? Well, it seems entirely unrelated to the rest of the changes, The patch adds a module parameter so I had to add those DRIVER_xxx. Ah, ok. and not obviously accurate. Let's fix it then. How? Paul signed it... Fair enough then. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: Digital signature
Re: [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls
On Mon, May 06, 2013 at 05:25:53PM +1000, Alexey Kardashevskiy wrote: This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. This adds a guest physical to host real address converter and calls the existing H_PUT_TCE handler. The converting function is going to be fully utilized by upcoming VFIO supporting patches. This also implements the KVM_CAP_PPC_MULTITCE capability, so in order to support the functionality of this patch, QEMU needs to query for this capability and set the hcall-multi-tce hypertas property only if the capability is present, otherwise there will be serious performance degradation. Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Signed-off-by: Paul Mackerras pau...@samba.org Fwiw, it would be nice to get this patch merged, regardless of the rest of the VFIO/powerpc patches. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: Digital signature
Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote: This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests without passing them to QEMU, which should save time on switching to QEMU and back. Both real and virtual modes are supported - whenever the kernel fails to handle TCE request, it passes it to the virtual mode. If it the virtual mode handlers fail, then the request is passed to the user mode, for example, to QEMU. This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables in-kernel handling of IOMMU map/unmap. This adds a special case for huge pages (16MB). The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a list of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is huge and in the list, then no reference counting is done, otherwise an exit to virtual mode happens. The list is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this list is not very expensive. However this can change and we may want to optimize this. This also adds the virt_only parameter to the KVM module for debug and performance check purposes. Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Signed-off-by: Paul Mackerras pau...@samba.org --- Documentation/virtual/kvm/api.txt | 28 arch/powerpc/include/asm/kvm_host.h |2 + arch/powerpc/include/asm/kvm_ppc.h |2 + arch/powerpc/include/uapi/asm/kvm.h |7 + arch/powerpc/kvm/book3s_64_vio.c| 242 ++- arch/powerpc/kvm/book3s_64_vio_hv.c | 192 +++ arch/powerpc/kvm/powerpc.c | 12 ++ include/uapi/linux/kvm.h|2 + 8 files changed, 485 insertions(+), 2 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index f621cd6..2039767 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously valid entries found. +4.79 KVM_CREATE_SPAPR_TCE_IOMMU + +Capability: KVM_CAP_SPAPR_TCE_IOMMU +Architectures: powerpc +Type: vm ioctl +Parameters: struct kvm_create_spapr_tce_iommu (in) +Returns: 0 on success, -1 on error + +This creates a link between IOMMU group and a hardware TCE (translation +control entry) table. This link lets the host kernel know what IOMMU +group (i.e. TCE table) to use for the LIOBN number passed with +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls. + +/* for KVM_CAP_SPAPR_TCE_IOMMU */ +struct kvm_create_spapr_tce_iommu { + __u64 liobn; + __u32 iommu_id; Wouldn't it be more in keeping + __u32 flags; +}; + +No flag is supported at the moment. + +When the guest issues TCE call on a liobn for which a TCE table has been +registered, the kernel will handle it in real mode, updating the hardware +TCE table. TCE table calls for other liobns will cause a vm exit and must +be handled by userspace. + + 5. The kvm_run structure diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 36ceb0d..2b70cbc 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table { struct kvm *kvm; u64 liobn; u32 window_size; + bool virtmode_only; I see this is now initialized from the global parameter, but I think it would be better to just check the global (debug) parameter directly, rather than duplicating it here. + struct iommu_group *grp;/* used for IOMMU groups */ struct page *pages[0]; }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index d501246..bdfa140 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce *args); +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, + struct kvm_create_spapr_tce_iommu *args); extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table( struct kvm_vcpu *vcpu, unsigned long liobn); extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt, diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
Re: KVM: PPC: Book3S: Add API for in-kernel XICS emulation
On Sat, Apr 27, 2013 at 10:28:37AM -, Paul Mackerras wrote: This adds the API for userspace to instantiate an XICS device in a VM and connect VCPUs to it. The API consists of a new device type for the KVM_CREATE_DEVICE ioctl, a new capability KVM_CAP_IRQ_XICS, which functions similarly to KVM_CAP_IRQ_MPIC, and the KVM_IRQ_LINE ioctl, which is used to assert and deassert interrupt inputs of the XICS. The XICS device has one attribute group, KVM_DEV_XICS_GRP_SOURCES. Each attribute within this group corresponds to the state of one interrupt source. The attribute number is the same as the interrupt source number. This does not support irq routing or irqfd yet. Signed-off-by: Paul Mackerras pau...@samba.org Acked-by: David Gibson da...@gibson.dropbear.id.au -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: Digital signature
Re: Reset problem vs. MMIO emulation, hypercalls, etc...
On Wed, Aug 08, 2012 at 11:58:58AM +0300, Avi Kivity wrote: On 08/08/2012 03:49 AM, David Gibson wrote: We never have irqchip in kernel (because we haven't written that yet) but we still sleep in-kernel for CEDE. I haven't spotted any problem with that, but now I'm wondering if there is one, since x86 don't do it in what seems like the analogous situation. It's possible this works because our decrementer (timer) interrupts are different at the core level from external interrupts coming from the PIC, and *are* handled in kernel, but I haven't actually followed the logic to work out if this is the case. Meaning the normal state of things is to sleep in the kernel (whether or not you have an emulated interrupt controller in the kernel -- the term irqchip in kernel is overloaded for x86). Uh.. overloaded in what way. On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and the two PICs are emulated in the kernel. Now the IOAPIC and the PICs correspond to non-x86 interrupt controllers, but the local APIC is more tightly coupled to the core. Interrupt acceptance by the core is an operation that involved synchronous communication with the local APIC: the APIC presents the interrupt, the core accepts it based on the value of the interrupt enable flag and possible a register (CR8), then the APIC updates the ISR and IRR. The upshot is that if the local APIC is in userspace, interrupts must be synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl and HLT is emulated in userspace (so that local APIC emulation can check if an interrupt wakes it up or not). Sorry, still not 100% getting it. When the vcpu is actually running code, that synchronous communication must still be accomplished via the KVM_INTERRUPT ioctl, yes? So what makes HLT different, that the communication can't be accomplished in that case. No, you're correct. HLT could have been emulated in userspace, it just wasn't. The correct statement is that HLT was arbitrarily chosen to be emulated in userspace with the synchronous model, but the asynchronous model forced it into the kernel. Aha! Ok, understood. Uh, assuming you meant kernelspace, not userspace in the first line there, anyway. Ok, so I am now reassured that our current handling of CEDE in kernelspace does not cause problems. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reset problem vs. MMIO emulation, hypercalls, etc...
On Tue, Aug 07, 2012 at 11:46:35AM +0300, Avi Kivity wrote: On 08/07/2012 04:32 AM, David Gibson wrote: On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote: On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote: So, I'm still trying to nut out the implications for H_CEDE, and think if there are any other hypercalls that might want to block the guest for a time. We were considering blocking H_PUT_TCE if qemu devices had active dma maps on the previously mapped iovas. I'm not sure if the discussions that led to the inclusion of the qemu IOMMU code decided that was wholly unnnecessary or just not necessary for the time being. For sleeping hcalls they will simply have to set exit_request to complete the hcall from the kernel perspective, leaving us in a state where the kernel is about to restart at srr0 + 4, along with some other flag (stop or halt) to actually freeze the vcpu. If such an async hcall decides to return an error, it can then set gpr3 directly using ioctls before restarting the vcpu. Yeah, I'd pretty much convinced myself of that by the end of yesterday. I hope to send patches implementing these fixes today. There are also some questions about why our in-kernel H_CEDE works kind of differently from x86's hlt instruction implementation (which comes out to qemu unless the irqchip is in-kernel as well). I don't think we have an urgent problem there though. It's the other way round, hlt sleeps in the kernel unless the irqchip is not in the kernel. That's the same as what I said. We never have irqchip in kernel (because we haven't written that yet) but we still sleep in-kernel for CEDE. I haven't spotted any problem with that, but now I'm wondering if there is one, since x86 don't do it in what seems like the analogous situation. It's possible this works because our decrementer (timer) interrupts are different at the core level from external interrupts coming from the PIC, and *are* handled in kernel, but I haven't actually followed the logic to work out if this is the case. Meaning the normal state of things is to sleep in the kernel (whether or not you have an emulated interrupt controller in the kernel -- the term irqchip in kernel is overloaded for x86). Uh.. overloaded in what way. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reset problem vs. MMIO emulation, hypercalls, etc...
On Tue, Aug 07, 2012 at 04:13:49PM +0300, Avi Kivity wrote: On 08/07/2012 03:14 PM, David Gibson wrote: On Tue, Aug 07, 2012 at 11:46:35AM +0300, Avi Kivity wrote: On 08/07/2012 04:32 AM, David Gibson wrote: On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote: On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote: So, I'm still trying to nut out the implications for H_CEDE, and think if there are any other hypercalls that might want to block the guest for a time. We were considering blocking H_PUT_TCE if qemu devices had active dma maps on the previously mapped iovas. I'm not sure if the discussions that led to the inclusion of the qemu IOMMU code decided that was wholly unnnecessary or just not necessary for the time being. For sleeping hcalls they will simply have to set exit_request to complete the hcall from the kernel perspective, leaving us in a state where the kernel is about to restart at srr0 + 4, along with some other flag (stop or halt) to actually freeze the vcpu. If such an async hcall decides to return an error, it can then set gpr3 directly using ioctls before restarting the vcpu. Yeah, I'd pretty much convinced myself of that by the end of yesterday. I hope to send patches implementing these fixes today. There are also some questions about why our in-kernel H_CEDE works kind of differently from x86's hlt instruction implementation (which comes out to qemu unless the irqchip is in-kernel as well). I don't think we have an urgent problem there though. It's the other way round, hlt sleeps in the kernel unless the irqchip is not in the kernel. That's the same as what I said. I meant to stress that the normal way which other archs should emulate is sleep-in-kernel. Ok. We never have irqchip in kernel (because we haven't written that yet) but we still sleep in-kernel for CEDE. I haven't spotted any problem with that, but now I'm wondering if there is one, since x86 don't do it in what seems like the analogous situation. It's possible this works because our decrementer (timer) interrupts are different at the core level from external interrupts coming from the PIC, and *are* handled in kernel, but I haven't actually followed the logic to work out if this is the case. Meaning the normal state of things is to sleep in the kernel (whether or not you have an emulated interrupt controller in the kernel -- the term irqchip in kernel is overloaded for x86). Uh.. overloaded in what way. On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and the two PICs are emulated in the kernel. Now the IOAPIC and the PICs correspond to non-x86 interrupt controllers, but the local APIC is more tightly coupled to the core. Interrupt acceptance by the core is an operation that involved synchronous communication with the local APIC: the APIC presents the interrupt, the core accepts it based on the value of the interrupt enable flag and possible a register (CR8), then the APIC updates the ISR and IRR. The upshot is that if the local APIC is in userspace, interrupts must be synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl and HLT is emulated in userspace (so that local APIC emulation can check if an interrupt wakes it up or not). Sorry, still not 100% getting it. When the vcpu is actually running code, that synchronous communication must still be accomplished via the KVM_INTERRUPT ioctl, yes? So what makes HLT different, that the communication can't be accomplished in that case. As soon as the local APIC is emulated in the kernel, HLT can be emulated there as well, and interrupts become asynchronous (KVM_IRQ_LINE, a vm ioctl). So irqchip_in_kernel, for most discussions, really means whether interrupt queuing is synchronous or asynchronous. It has nothing to do with the interrupt controllers per se. All non-x86 archs always have irqchip_in_kernel() in this sense. Peter has started to fix up this naming mess in qemu. I guess we should do the same for the kernel (except for ABIs) and document it, because it keeps generating confusion. Ok. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reset problem vs. MMIO emulation, hypercalls, etc...
On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote: On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote: So, I'm still trying to nut out the implications for H_CEDE, and think if there are any other hypercalls that might want to block the guest for a time. We were considering blocking H_PUT_TCE if qemu devices had active dma maps on the previously mapped iovas. I'm not sure if the discussions that led to the inclusion of the qemu IOMMU code decided that was wholly unnnecessary or just not necessary for the time being. For sleeping hcalls they will simply have to set exit_request to complete the hcall from the kernel perspective, leaving us in a state where the kernel is about to restart at srr0 + 4, along with some other flag (stop or halt) to actually freeze the vcpu. If such an async hcall decides to return an error, it can then set gpr3 directly using ioctls before restarting the vcpu. Yeah, I'd pretty much convinced myself of that by the end of yesterday. I hope to send patches implementing these fixes today. There are also some questions about why our in-kernel H_CEDE works kind of differently from x86's hlt instruction implementation (which comes out to qemu unless the irqchip is in-kernel as well). I don't think we have an urgent problem there though. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 4/6] kvm tools: Add PPC64 XICS interrupt controller support
On Tue, Dec 20, 2011 at 12:16:40PM +1100, Matt Evans wrote: Hi David, On 14/12/11 13:35, David Gibson wrote: On Tue, Dec 13, 2011 at 06:10:48PM +1100, Matt Evans wrote: This patch adds XICS emulation code (heavily borrowed from QEMU), and wires this into kvm_cpu__irq() to fire a CPU IRQ via KVM. A device tree entry is also added. IPIs work, xics_alloc_irqnum() is added to allocate an external IRQ (which will later be used by the PHB PCI code) and finally, kvm__irq_line() can be called to raise an IRQ on XICS.\ Hrm, looks like you took a somewhat old version of xics.c from qemu. It dangerously uses the same variable names for global irq numbers and numbers local to one ics unit. It used to have at least one bug caused by confusing the two, which I'm not sure if you've also copied. Just had a look at the diffs between this and hw/xics.c from the master branch in your qemu-impreza.git (which I based the kvmtool stuff on) and I can't see anything standing out. Is there a particular commit/patch/variable name you have in mind that I can search for? Sorry, my mistake, I was looking in the wrong place. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 4/6] kvm tools: Add PPC64 XICS interrupt controller support
On Tue, Dec 13, 2011 at 06:10:48PM +1100, Matt Evans wrote: This patch adds XICS emulation code (heavily borrowed from QEMU), and wires this into kvm_cpu__irq() to fire a CPU IRQ via KVM. A device tree entry is also added. IPIs work, xics_alloc_irqnum() is added to allocate an external IRQ (which will later be used by the PHB PCI code) and finally, kvm__irq_line() can be called to raise an IRQ on XICS.\ Hrm, looks like you took a somewhat old version of xics.c from qemu. It dangerously uses the same variable names for global irq numbers and numbers local to one ics unit. It used to have at least one bug caused by confusing the two, which I'm not sure if you've also copied. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 2/6] kvm tools: Add SPAPR PPC64 hcall rtascall structure
On Tue, Dec 13, 2011 at 06:10:46PM +1100, Matt Evans wrote: This patch adds the basic structure for HV calls, their registration and some of the simpler calls. A similar layout for RTAS calls is also added, again with some of the simpler RTAS calls used by the guest. The SPAPR RTAS stub is generated inline. Also, nodes for RTAS are added to the device tree. [snip] diff --git a/tools/kvm/powerpc/spapr.h b/tools/kvm/powerpc/spapr.h new file mode 100644 index 000..57cece1 --- /dev/null +++ b/tools/kvm/powerpc/spapr.h @@ -0,0 +1,105 @@ +/* + * SPAPR definitions and declarations + * + * Borrowed heavily from QEMU's spapr.h, + * Copyright (c) 2010 David Gibson, IBM Corporation. So, most of the content of this file in qemu, I in turn took from arch/powerpc/include/asm/hvcall.h in the kernel tree. You might be better off using that directly. [snip] +static target_ulong h_logical_icbi(struct kvm_cpu *vcpu, target_ulong opcode, target_ulong *args) +{ + /* Nothing to do on emulation, KVM will trap this in the kernel */ + return H_SUCCESS; hcalls that need to be handled by the host kernel should probably dump an error here, rather than silently doing nothing, since if the host kernel does handle them here they should never reach userspace at all. +} + +static target_ulong h_logical_dcbf(struct kvm_cpu *vcpu, target_ulong opcode, target_ulong *args) +{ + /* Nothing to do on emulation, KVM will trap this in the kernel */ + return H_SUCCESS; +} + +void spapr_register_hypercall(target_ulong opcode, spapr_hcall_fn fn) +{ + spapr_hcall_fn *slot; + + if (opcode = MAX_HCALL_OPCODE) { + assert((opcode 0x3) == 0); + + slot = papr_hypercall_table[opcode / 4]; + } else { + assert((opcode = KVMPPC_HCALL_BASE) +(opcode = KVMPPC_HCALL_MAX)); + + slot = kvmppc_hypercall_table[opcode - KVMPPC_HCALL_BASE]; + } + + assert(!(*slot) || (fn == *slot)); + *slot = fn; +} + +target_ulong spapr_hypercall(struct kvm_cpu *vcpu, target_ulong opcode, + target_ulong *args) +{ + if ((opcode = MAX_HCALL_OPCODE) + ((opcode 0x3) == 0)) { + spapr_hcall_fn fn = papr_hypercall_table[opcode / 4]; + + if (fn) { + return fn(vcpu, opcode, args); + } + } else if ((opcode = KVMPPC_HCALL_BASE) +(opcode = KVMPPC_HCALL_MAX)) { + spapr_hcall_fn fn = kvmppc_hypercall_table[opcode - +KVMPPC_HCALL_BASE]; + + if (fn) { + return fn(vcpu, opcode, args); + } + } + + hcall_dprintf(Unimplemented hcall 0x%lx\n, opcode); + return H_FUNCTION; +} + +void hypercall_init(void) +{ + /* hcall-dabr */ + spapr_register_hypercall(H_SET_DABR, h_set_dabr); + + spapr_register_hypercall(H_LOGICAL_CI_LOAD, h_logical_load); + spapr_register_hypercall(H_LOGICAL_CI_STORE, h_logical_store); + spapr_register_hypercall(H_LOGICAL_CACHE_LOAD, h_logical_load); + spapr_register_hypercall(H_LOGICAL_CACHE_STORE, h_logical_store); + spapr_register_hypercall(H_LOGICAL_ICBI, h_logical_icbi); + spapr_register_hypercall(H_LOGICAL_DCBF, h_logical_dcbf); + + /* KVM-PPC specific hcalls */ + spapr_register_hypercall(KVMPPC_H_RTAS, h_rtas); +} diff --git a/tools/kvm/powerpc/spapr_rtas.c b/tools/kvm/powerpc/spapr_rtas.c new file mode 100644 index 000..72c6b02 --- /dev/null +++ b/tools/kvm/powerpc/spapr_rtas.c @@ -0,0 +1,230 @@ +/* + * SPAPR base RTAS calls + * + * Borrowed heavily from QEMU's spapr_rtas.c + * Copyright (c) 2010-2011 David Gibson, IBM Corporation. + * + * Modifications copyright 2011 Matt Evans m...@ozlabs.org, IBM Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published + * by the Free Software Foundation. + */ + +#include kvm/kvm.h +#include kvm/kvm-cpu.h +#include kvm/util.h +#include kvm/term.h + +#include spapr.h + +#include stdio.h +#include assert.h +#include libfdt.h + +#define TOKEN_BASE 0x2000 +#define TOKEN_MAX 0x100 + +#define RTAS_CONSOLE + +static struct rtas_call { + const char *name; + spapr_rtas_fn fn; +} rtas_table[TOKEN_MAX]; + +struct rtas_call *rtas_next = rtas_table; + + +static void rtas_display_character(struct kvm_cpu *vcpu, + uint32_t token, uint32_t nargs, + target_ulong args, + uint32_t nret, target_ulong rets) +{ + char c = rtas_ld(vcpu-kvm, args, 0); + term_putc(CONSOLE_HV, c, 1, 0); + rtas_st(vcpu-kvm, rets, 0, 0); +} + +#ifdef RTAS_CONSOLE +static void rtas_put_term_char(struct
Re: [PATCH V2 1/6] kvm tools: Generate SPAPR PPC64 guest device tree
On Tue, Dec 13, 2011 at 06:10:45PM +1100, Matt Evans wrote: The generated DT is the bare minimum structure required for SPAPR (on which subsequent patches for VIO, XICS, PCI etc. will build); root node, cpus, memory. Some aspects are currently hardwired for simplicity, for example advertised page sizes, HPT size, SLB size, VMX/DFP, etc. Future support of a variety of POWER CPUs should acquire this info from the host and encode appropriately. This requires a 64-bit libfdt. There's already a copy of libfdt embedded in the kernel tree (scripts/dtc/libfdt), which you should be able to use to build one of these as you go. Signed-off-by: Matt Evans m...@ozlabs.org --- tools/kvm/Makefile |3 +- tools/kvm/powerpc/include/kvm/kvm-arch.h | 10 ++ tools/kvm/powerpc/kvm.c | 141 ++ 3 files changed, 153 insertions(+), 1 deletions(-) diff --git a/tools/kvm/Makefile b/tools/kvm/Makefile index 5bb3f08..4ee4805 100644 --- a/tools/kvm/Makefile +++ b/tools/kvm/Makefile @@ -132,7 +132,8 @@ ifeq ($(uname_M), ppc64) OBJS+= powerpc/kvm.o OBJS+= powerpc/kvm-cpu.o ARCH_INCLUDE := powerpc/include - CFLAGS += -m64 + CFLAGS += -m64 + LIBS+= -lfdt endif ### diff --git a/tools/kvm/powerpc/include/kvm/kvm-arch.h b/tools/kvm/powerpc/include/kvm/kvm-arch.h index da61774..33a3827 100644 --- a/tools/kvm/powerpc/include/kvm/kvm-arch.h +++ b/tools/kvm/powerpc/include/kvm/kvm-arch.h @@ -69,4 +69,14 @@ struct kvm { const char *name; }; +/* Helper for the various bits of code that generate FDT nodes */ +#define _FDT(exp)\ + do {\ + int ret = (exp);\ + if (ret 0) { \ + die(Error creating device tree: %s: %s\n, \ + #exp, fdt_strerror(ret)); \ + } \ + } while (0) + #endif /* KVM__KVM_ARCH_H */ diff --git a/tools/kvm/powerpc/kvm.c b/tools/kvm/powerpc/kvm.c index f838a8f..95ed1cc 100644 --- a/tools/kvm/powerpc/kvm.c +++ b/tools/kvm/powerpc/kvm.c @@ -3,6 +3,9 @@ * * Copyright 2011 Matt Evans m...@ozlabs.org, IBM Corporation. * + * Portions of FDT setup borrowed from QEMU, copyright 2010 David Gibson, IBM + * Corporation. + * * This program is free software; you can redistribute it and/or modify it * under the terms of the GNU General Public License version 2 as published * by the Free Software Foundation. @@ -29,6 +32,8 @@ #include linux/byteorder.h #include libfdt.h +#define HPT_ORDER 24 + #define HUGETLBFS_PATH /var/lib/hugetlbfs/global/pagesize-16MB/ static char kern_cmdline[2048]; @@ -168,9 +173,145 @@ bool load_bzimage(struct kvm *kvm, int fd_kernel, return false; } +#define SMT_THREADS 4 + +static uint32_t mfpvr(void) +{ + uint32_t r; + asm volatile (mfpvr %0 : =r(r)); + return r; +} + static void setup_fdt(struct kvm *kvm) { + uint64_tmem_reg_property[] = { 0, cpu_to_be64(kvm-ram_size) }; + int smp_cpus = kvm-nrcpus; + charhypertas_prop_kvm[] = hcall-pft\0hcall-term\0 + hcall-dabr\0hcall-interrupt\0hcall-tce\0hcall-vio\0 + hcall-splpar\0hcall-bulk; + int i, j; + charcpu_name[30]; + u8 staging_fdt[FDT_MAX_SIZE]; + uint32_tpvr = mfpvr(); + + /* Generate an appropriate DT at kvm-fdt_gra */ + void *fdt_dest = guest_flat_to_host(kvm, kvm-fdt_gra); + void *fdt = staging_fdt; + + _FDT(fdt_create(fdt, FDT_MAX_SIZE)); + _FDT(fdt_finish_reservemap(fdt)); + + _FDT(fdt_begin_node(fdt, )); + + _FDT(fdt_property_string(fdt, device_type, chrp)); + _FDT(fdt_property_string(fdt, model, IBM pSeries (kvmtool))); + _FDT(fdt_property_cell(fdt, #address-cells, 0x2)); + _FDT(fdt_property_cell(fdt, #size-cells, 0x2)); + + /* /chosen */ + _FDT(fdt_begin_node(fdt, chosen)); + /* cmdline */ + _FDT(fdt_property_string(fdt, bootargs, kern_cmdline)); + /* Initrd */ + if (kvm-initrd_size != 0) { + uint32_t ird_st_prop = cpu_to_be32(kvm-initrd_gra); + uint32_t ird_end_prop = cpu_to_be32(kvm-initrd_gra + + kvm-initrd_size); + _FDT(fdt_property(fdt, linux,initrd-start, +ird_st_prop, sizeof(ird_st_prop))); + _FDT(fdt_property(fdt, linux,initrd-end, +ird_end_prop, sizeof(ird_end_prop))); + } + _FDT(fdt_end_node(fdt)); + + /* + * Memory: We
Re: [PATCH 07/10] KVM: PPC: Add PAPR hypercall code for PR mode
On Fri, Aug 12, 2011 at 07:38:54AM +0200, Alexander Graf wrote: Am 12.08.2011 um 05:35 schrieb David Gibson da...@gibson.dropbear.id.au: On Tue, Aug 09, 2011 at 06:31:45PM +0200, Alexander Graf wrote: When running a PAPR guest, we need to handle a few hypercalls in kernel space, most prominently the page table invalidation (to sync the shadows). So this patch adds handling for a few PAPR hypercalls to PR mode KVM. I tried to share the code with HV mode, but it ended up being a lot easier this way around, as the two differ too much in those details. Are these strictly necessary, or just an optimization? Because you're using the space allocated by qemu for the guest hash table, it seems to be you could just let h_enter fall through to qemu which will put the right thing into the guest hash table which you can then walk in the kernel translation code. Every time a PTE can be invalidated, we need to do so in kvm to keep the SPT in sync. IIRC h_enter can evict/overwrite a previous entry, so we need to handle it in kvm as well :). Removal definitely needs to happin in-kernel. True. I think you could actually delay this invalidation until the guest issues the tlbie, but it's probably not worth it. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/10] KVM: PPC: Support SC1 hypercalls for PAPR in PR mode
On Fri, Aug 12, 2011 at 07:35:42AM +0200, Alexander Graf wrote: Am 12.08.2011 um 05:33 schrieb David Gibson da...@gibson.dropbear.id.au: On Tue, Aug 09, 2011 at 06:31:47PM +0200, Alexander Graf wrote: PAPR defines hypercalls as SC1 instructions. Using these, the guest modifies page tables and does other privileged operations that it wouldn't be allowed to do in supervisor mode. This patch adds support for PR KVM to trap these instructions and route them through the same PAPR hypercall interface that we already use for HV style KVM. This will work on a powermac or bare metal host. Unfortunately, it's not enough on a pSeries LPAR host - the sc 1 instruction from the guest problem state will go direct to the hypervisor, which will return an error rather than trapping to the guest kernel. The only way around this I can see is for qemu to search for and patch up sc 1 instructions to something else. Obviously that would also need some kernel support, and probably a capability to let it know if it's necessary. Well I'd like to keep Qemu out of the patching business, so the guest kernel would have to patch itself. Well sure, but guest patching itself means it can't run existing kernels. I thought qemu already patched a few things, ugly though that approach is. But yes, PHyP guests can't run this target yet :). I'll take a stab at that too, but one continent at a time! ;) -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/10] KVM: PPC: Support SC1 hypercalls for PAPR in PR mode
On Tue, Aug 09, 2011 at 06:31:47PM +0200, Alexander Graf wrote: PAPR defines hypercalls as SC1 instructions. Using these, the guest modifies page tables and does other privileged operations that it wouldn't be allowed to do in supervisor mode. This patch adds support for PR KVM to trap these instructions and route them through the same PAPR hypercall interface that we already use for HV style KVM. This will work on a powermac or bare metal host. Unfortunately, it's not enough on a pSeries LPAR host - the sc 1 instruction from the guest problem state will go direct to the hypervisor, which will return an error rather than trapping to the guest kernel. The only way around this I can see is for qemu to search for and patch up sc 1 instructions to something else. Obviously that would also need some kernel support, and probably a capability to let it know if it's necessary. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/10] KVM: PPC: Add PAPR hypercall code for PR mode
On Tue, Aug 09, 2011 at 06:31:45PM +0200, Alexander Graf wrote: When running a PAPR guest, we need to handle a few hypercalls in kernel space, most prominently the page table invalidation (to sync the shadows). So this patch adds handling for a few PAPR hypercalls to PR mode KVM. I tried to share the code with HV mode, but it ended up being a lot easier this way around, as the two differ too much in those details. Are these strictly necessary, or just an optimization? Because you're using the space allocated by qemu for the guest hash table, it seems to be you could just let h_enter fall through to qemu which will put the right thing into the guest hash table which you can then walk in the kernel translation code. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [v2] powerpc/4xx: work around CHIP11 errata in a more PAGE_SIZE-friendly way
On Tue, Nov 11, 2008 at 07:52:18AM -0500, Josh Boyer wrote: On Mon, Nov 10, 2008 at 05:24:34PM -0600, Hollis Blanchard wrote: The current CHIP11 errata truncates the device tree memory node, and subtracts (hardcoded) 4096 bytes. This breaks kernels with larger PAGE_SIZE, since the bootmem allocator assumes that total memory is a multiple of PAGE_SIZE. Instead, use a device tree memory reservation to reserve only the 256 bytes actually affected by the errata, leaving the total memory size unaltered. Signed-off-by: Hollis Blanchard [EMAIL PROTECTED] --- Using large pages results in a huge performance improvement for KVM, and this patch is required to make Ilya's large page patch work. David and/or Josh, please apply. The patch looks fine to me, and once David acks the fdt parts I'll apply for -next. I'll try to do some testing later today as well, since I have one of the boards with the Errata. Um.. I sent something the fdt stuff a while back, but didn't get a response. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [v3] powerpc/4xx: work around CHIP11 errata in a more PAGE_SIZE-friendly way
On Tue, Nov 11, 2008 at 06:06:46PM -0600, Hollis Blanchard wrote: The current CHIP11 errata truncates the device tree memory node, and subtracts (hardcoded) 4096 bytes. This breaks kernels with larger PAGE_SIZE, since the bootmem allocator assumes that total memory is a multiple of PAGE_SIZE. Instead, use a device tree memory reservation to reserve only the 256 bytes actually affected by the errata, leaving the total memory size unaltered. Signed-off-by: Hollis Blanchard [EMAIL PROTECTED] libfdt usage changes look fine to me. Acked-by: David Gibson [EMAIL PROTECTED] -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html