On 07/10/2013 08:05 PM, Alexander Graf wrote:
>
> On 10.07.2013, at 07:00, Alexey Kardashevskiy wrote:
>
>> On 07/10/2013 03:02 AM, Alexander Graf wrote:
>>> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>>>> devices or emulated PCI. These calls allow adding multiple entries
>>>> (up to 512) into the TCE table in one call which saves time on
>>>> transition to/from real mode.
>>>
>>> We don't mention QEMU explicitly in KVM code usually.
>>>
>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>>> (copied from user and verified) before writing the whole list into
>>>> the TCE table. This cache will be utilized more in the upcoming
>>>> VFIO/IOMMU support to continue TCE list processing in the virtual
>>>> mode in the case if the real mode handler failed for some reason.
>>>>
>>>> This adds a guest physical to host real address converter
>>>> and calls the existing H_PUT_TCE handler. The converting function
>>>> is going to be fully utilized by upcoming VFIO supporting patches.
>>>>
>>>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>>>> so in order to support the functionality of this patch, QEMU
>>>> needs to query for this capability and set the "hcall-multi-tce"
>>>> hypertas property only if the capability is present, otherwise
>>>> there will be serious performance degradation.
>>>
>>> Same as above. But really you're only giving recommendations here. What's
>>> the point? Please describe what the benefit of this patch is, not what some
>>> other random subsystem might do with the benefits it brings.
>>>
>>>>
>>>> Signed-off-by: Paul Mackerras<[email protected]>
>>>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>>>
>>>> ---
>>>> Changelog:
>>>> 2013/07/06:
>>>> * fixed number of wrong get_page()/put_page() calls
>>>>
>>>> 2013/06/27:
>>>> * fixed clear of BUSY bit in kvmppc_lookup_pte()
>>>> * H_PUT_TCE_INDIRECT does realmode_get_page() now
>>>> * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
>>>> * updated doc
>>>>
>>>> 2013/06/05:
>>>> * fixed mistype about IBMVIO in the commit message
>>>> * updated doc and moved it to another section
>>>> * changed capability number
>>>>
>>>> 2013/05/21:
>>>> * added kvm_vcpu_arch::tce_tmp
>>>> * removed cleanup if put_indirect failed, instead we do not even start
>>>> writing to TCE table if we cannot get TCEs from the user and they are
>>>> invalid
>>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>>>> and kvmppc_emulated_validate_tce (for the previous item)
>>>> * fixed bug with failthrough for H_IPI
>>>> * removed all get_user() from real mode handlers
>>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>>> ---
>>>> Documentation/virtual/kvm/api.txt | 25 +++
>>>> arch/powerpc/include/asm/kvm_host.h | 9 ++
>>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +-
>>>> arch/powerpc/kvm/book3s_64_vio.c | 154 ++++++++++++++++++-
>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 260
>>>> ++++++++++++++++++++++++++++----
>>>> arch/powerpc/kvm/book3s_hv.c | 41 ++++-
>>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 +
>>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++-
>>>> arch/powerpc/kvm/powerpc.c | 3 +
>>>> 9 files changed, 517 insertions(+), 34 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/api.txt
>>>> b/Documentation/virtual/kvm/api.txt
>>>> index 6365fef..762c703 100644
>>>> --- a/Documentation/virtual/kvm/api.txt
>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>> @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed
>>>> to userspace to be
>>>> handled.
>>>>
>>>>
>>>> +4.86 KVM_CAP_PPC_MULTITCE
>>>> +
>>>> +Capability: KVM_CAP_PPC_MULTITCE
>>>> +Architectures: ppc
>>>> +Type: vm
>>>> +
>>>> +This capability means the kernel is capable of handling hypercalls
>>>> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
>>>> +space. This significanly accelerates DMA operations for PPC KVM guests.
>>>
>>> significanly? Please run this through a spell checker.
>>>
>>>> +The user space should expect that its handlers for these hypercalls
>>>
>>> s/The//
>>>
>>>> +are not going to be called.
>>>
>>> Is user space guaranteed they will not be called? Or can it still happen?
>>
>> ... if user space previously registered LIOBN in KVM (via
>> KVM_CREATE_SPAPR_TCE or similar calls).
>>
>> ok?
>
> How about this?
>
> The hypercalls mentioned above may or may not be processed successfully in
> the kernel based fast path. If they can not be handled by the kernel, they
> will get passed on to user space. So user space still has to have an
> implementation for these despite the in kernel acceleration.
>
> ---
>
> The target audience for this documentation is user space KVM API users.
> Someone developing kvm tool for example. They want to know implications
> specific CAPs have.
>
>>
>> There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet
>> and may never get there.
>>
>>
>>>> +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
>>>> +the user space might have to advertise it for the guest. For example,
>>>> +IBM pSeries guest starts using them if "hcall-multi-tce" is present in
>>>> +the "ibm,hypertas-functions" device-tree property.
>>>
>>> This paragraph describes sPAPR. That's fine, but please document it as
>>> such. Also please check your grammar.
>>
>>>> +
>>>> +Without this capability, only H_PUT_TCE is handled by the kernel and
>>>> +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
>>>> +unless the capability is present as passing hypercalls to the userspace
>>>> +slows operations a lot.
>>>> +
>>>> +Unlike other capabilities of this section, this one is always enabled.
>>>
>>> Why? Wouldn't that confuse older user space?
>>
>>
>> How? Old user space won't check for this capability and won't tell the
>> guest to use it (via "hcall-multi-tce"). Old H_PUT_TCE is still there.
>>
>> If the guest always uses H_PUT_TCE_INDIRECT/H_STUFF_TCE no matter what,
>> then it is its problem - it won't work now anyway as neither QEMU nor host
>> kernel supports these calls.
> Always assume that you are a kernel developer without knowledge
> of any user space code using your interfaces. So there is the theoretical
> possibility that there is a user space client out there that implements
> H_PUT_TCE_INDIRECT and advertises hcall-multi-tce to the guest.
> Would that client break? If so, we should definitely have
> the CAP disabled by default.
No, it won't break. Why would it break? I really do not get it. This user
space client has to do an extra step to get this acceleration by calling
ioctl(KVM_CREATE_SPAPR_TCE) anyway. Previously that ioctl only had effect
on H_PUT_TCE, now on all three hcalls.
> But really, it's also as much about consistency as anything else.
> If we leave everything as is and always extend functionality
> by enabling new CAPs, we're pretty much guaranteed that we
> don't break anything by accident. It also makes debugging easier
> because you can for example disable this particular feature
> to see whether something has bad side effects.
So I must add one more ioctl to enable MULTITCE in kernel handling. Is it
what you are saying?
I can see KVM_CHECK_EXTENSION but I do not see KVM_ENABLE_EXTENSION or
anything like that.
>>>> +
>>>> +
>>>> 5. The kvm_run structure
>>>> ------------------------
>>>>
>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h
>>>> b/arch/powerpc/include/asm/kvm_host.h
>>>> index af326cd..20d04bd 100644
>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>> @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
>>>> struct kvm *kvm;
>>>> u64 liobn;
>>>> u32 window_size;
>>>> + struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
>>>
>>> You don't need this.
>>>
>>>> struct page *pages[0];
>>>> };
>>>>
>>>> @@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
>>>> spinlock_t tbacct_lock;
>>>> u64 busy_stolen;
>>>> u64 busy_preempt;
>>>> +
>>>> + unsigned long *tce_tmp_hpas; /* TCE cache for TCE_PUT_INDIRECT
>>>> hcall */
>>>> + enum {
>>>> + TCERM_NONE,
>>>> + TCERM_GETPAGE,
>>>> + TCERM_PUTTCE,
>>>> + TCERM_PUTLIST,
>>>> + } tce_rm_fail; /* failed stage of request processing */
>>>> #endif
>>>> };
>>>>
>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h
>>>> b/arch/powerpc/include/asm/kvm_ppc.h
>>>> index a5287fe..fa722a0 100644
>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu
>>>> *vcpu);
>>>>
>>>> extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>> struct kvm_create_spapr_tce *args);
>>>> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>> - unsigned long ioba, unsigned long tce);
>>>> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>>>> + struct kvm_vcpu *vcpu, unsigned long liobn);
>>>> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
>>>> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>>> + unsigned long ioba, unsigned long tce);
>>>> +extern long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>>> + unsigned long liobn, unsigned long ioba,
>>>> + unsigned long tce);
>>>> +extern long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>> + unsigned long liobn, unsigned long ioba,
>>>> + unsigned long tce_list, unsigned long npages);
>>>> +extern long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>> + unsigned long liobn, unsigned long ioba,
>>>> + unsigned long tce_value, unsigned long npages);
>>>> extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
>>>> struct kvm_allocate_rma *rma);
>>>> extern struct kvmppc_linear_info *kvm_alloc_rma(void);
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c
>>>> b/arch/powerpc/kvm/book3s_64_vio.c
>>>> index b2d3f3b..99bf4e5 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>> @@ -14,6 +14,7 @@
>>>> *
>>>> * Copyright 2010 Paul Mackerras, IBM Corp.<[email protected]>
>>>> * Copyright 2011 David Gibson, IBM Corporation<[email protected]>
>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<[email protected]>
>>>> */
>>>>
>>>> #include<linux/types.h>
>>>> @@ -36,8 +37,10 @@
>>>> #include<asm/ppc-opcode.h>
>>>> #include<asm/kvm_host.h>
>>>> #include<asm/udbg.h>
>>>> +#include<asm/iommu.h>
>>>> +#include<asm/tce.h>
>>>>
>>>> -#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
>>>> +#define ERROR_ADDR ((void *)~(unsigned long)0x0)
>>>>
>>>> static long kvmppc_stt_npages(unsigned long window_size)
>>>> {
>>>> @@ -50,6 +53,20 @@ static void release_spapr_tce_table(struct
>>>> kvmppc_spapr_tce_table *stt)
>>>> struct kvm *kvm = stt->kvm;
>>>> int i;
>>>>
>>>> +#define __SV(x) stt->stat.x
>>>> +#define __SVD(x) (__SV(rm.x)?(__SV(rm.x)-__SV(vm.x)):0)
>>>> + pr_debug("%s stat for liobn=%llx\n"
>>>> + "--------------- realmode ----- virtmode ---\n"
>>>> + "put_tce %10ld %10ld\n"
>>>> + "put_tce_indir %10ld %10ld\n"
>>>> + "stuff_tce %10ld %10ld\n",
>>>> + __func__, stt->liobn,
>>>> + __SVD(put), __SV(vm.put),
>>>> + __SVD(indir), __SV(vm.indir),
>>>> + __SVD(stuff), __SV(vm.stuff));
>>>> +#undef __SVD
>>>> +#undef __SV
>>>
>>> All of these stat points should just be trace points. You can do the
>>> statistic gathering from user space then.
>>>
>>>> +
>>>> mutex_lock(&kvm->lock);
>>>> list_del(&stt->list);
>>>> for (i = 0; i< kvmppc_stt_npages(stt->window_size); i++)
>>>> @@ -148,3 +165,138 @@ fail:
>>>> }
>>>> return ret;
>>>> }
>>>> +
>>>> +/* Converts guest physical address to host virtual address */
>>>> +static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
>>>
>>> Please don't distinguish _vm versions. They're the normal case. _rm ones
>>> are the special ones.
>>>
>>>> + unsigned long gpa, struct page **pg)
>>>> +{
>>>> + unsigned long hva, gfn = gpa>> PAGE_SHIFT;
>>>> + struct kvm_memory_slot *memslot;
>>>> +
>>>> + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>>> + if (!memslot)
>>>> + return ERROR_ADDR;
>>>> +
>>>> + hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa& ~PAGE_MASK);
>>>
>>> s/+/|/
>>>
>>>> +
>>>> + if (get_user_pages_fast(hva& PAGE_MASK, 1, 0, pg) != 1)
>>>> + return ERROR_ADDR;
>>>> +
>>>> + return (void *) hva;
>>>> +}
>>>> +
>>>> +long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>>> + unsigned long liobn, unsigned long ioba,
>>>> + unsigned long tce)
>>>> +{
>>>> + long ret;
>>>> + struct kvmppc_spapr_tce_table *tt;
>>>> +
>>>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>>>> + /* Didn't find the liobn, put it to userspace */
>>>
>>> Unclear comment.
>>
>>
>> What detail is missing?
>
> Grammar wise "it" in the second half of the sentence refers to liobn.
> So you "put" the "liobn to userspace". That sentence doesn't
> make any sense.
Removed it. H_TOO_HARD itself says enough already.
> What you really want to say is:
>
> /* Couldn't find the liobn. Something went wrong. Let user space handle the
> hypercall. That has better ways of dealing with errors. */
>
>>
>>
>>>> + if (!tt)
>>>> + return H_TOO_HARD;
>>>> +
>>>> + ++tt->stat.vm.put;
>>>> +
>>>> + if (ioba>= tt->window_size)
>>>> + return H_PARAMETER;
>>>> +
>>>> + ret = kvmppc_emulated_validate_tce(tce);
>>>> + if (ret)
>>>> + return ret;
>>>> +
>>>> + kvmppc_emulated_put_tce(tt, ioba, tce);
>>>> +
>>>> + return H_SUCCESS;
>>>> +}
>>>> +
>>>> +long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>> + unsigned long liobn, unsigned long ioba,
>>>> + unsigned long tce_list, unsigned long npages)
>>>> +{
>>>> + struct kvmppc_spapr_tce_table *tt;
>>>> + long i, ret = H_SUCCESS;
>>>> + unsigned long __user *tces;
>>>> + struct page *pg = NULL;
>>>> +
>>>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>>>> + /* Didn't find the liobn, put it to userspace */
>>>> + if (!tt)
>>>> + return H_TOO_HARD;
>>>> +
>>>> + ++tt->stat.vm.indir;
>>>> +
>>>> + /*
>>>> + * The spec says that the maximum size of the list is 512 TCEs so
>>>> + * so the whole table addressed resides in 4K page
>>>
>>> so so?
>>>
>>>> + */
>>>> + if (npages> 512)
>>>> + return H_PARAMETER;
>>>> +
>>>> + if (tce_list& ~IOMMU_PAGE_MASK)
>>>> + return H_PARAMETER;
>>>> +
>>>> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
>>>> + return H_PARAMETER;
>>>> +
>>>> + tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list,&pg);
>>>> + if (tces == ERROR_ADDR)
>>>> + return H_TOO_HARD;
>>>> +
>>>> + if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
>>>> + goto put_list_page_exit;
>>>> +
>>>> + for (i = 0; i< npages; ++i) {
>>>> + if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
>>>> + ret = H_PARAMETER;
>>>> + goto put_list_page_exit;
>>>> + }
>>>> +
>>>> + ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp_hpas[i]);
>>>> + if (ret)
>>>> + goto put_list_page_exit;
>>>> + }
>>>> +
>>>> + for (i = 0; i< npages; ++i)
>>>> + kvmppc_emulated_put_tce(tt, ioba + (i<< IOMMU_PAGE_SHIFT),
>>>> + vcpu->arch.tce_tmp_hpas[i]);
>>>> +put_list_page_exit:
>>>> + if (pg)
>>>> + put_page(pg);
>>>> +
>>>> + if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
>>>> + vcpu->arch.tce_rm_fail = TCERM_NONE;
>>>> + if (pg&& !PageCompound(pg))
>>>> + put_page(pg); /* finish pending realmode_put_page() */
>>>> + }
>>>> +
>>>> + return ret;
>>>> +}
>>>> +
>>>> +long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>> + unsigned long liobn, unsigned long ioba,
>>>> + unsigned long tce_value, unsigned long npages)
>>>> +{
>>>> + struct kvmppc_spapr_tce_table *tt;
>>>> + long i, ret;
>>>> +
>>>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>>>> + /* Didn't find the liobn, put it to userspace */
>>>> + if (!tt)
>>>> + return H_TOO_HARD;
>>>> +
>>>> + ++tt->stat.vm.stuff;
>>>> +
>>>> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
>>>> + return H_PARAMETER;
>>>> +
>>>> + ret = kvmppc_emulated_validate_tce(tce_value);
>>>> + if (ret || (tce_value& (TCE_PCI_WRITE | TCE_PCI_READ)))
>>>> + return H_PARAMETER;
>>>> +
>>>> + for (i = 0; i< npages; ++i, ioba += IOMMU_PAGE_SIZE)
>>>> + kvmppc_emulated_put_tce(tt, ioba, tce_value);
>>>> +
>>>> + return H_SUCCESS;
>>>> +}
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> index 30c2f3b..cd3e6f9 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> @@ -14,6 +14,7 @@
>>>> *
>>>> * Copyright 2010 Paul Mackerras, IBM Corp.<[email protected]>
>>>> * Copyright 2011 David Gibson, IBM Corporation<[email protected]>
>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<[email protected]>
>>>> */
>>>>
>>>> #include<linux/types.h>
>>>> @@ -35,42 +36,243 @@
>>>> #include<asm/ppc-opcode.h>
>>>> #include<asm/kvm_host.h>
>>>> #include<asm/udbg.h>
>>>> +#include<asm/iommu.h>
>>>> +#include<asm/tce.h>
>>>>
>>>> #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
>>>> +#define ERROR_ADDR (~(unsigned long)0x0)
>>>>
>>>> -/* WARNING: This will be called in real-mode on HV KVM and virtual
>>>> - * mode on PR KVM
>>>
>>> What's wrong with the warning?
>>
>>
>> It belongs to kvmppc_h_put_tce() which is not called in virtual mode anymore.
>
> I thought the comment applied to the whole file before? Hrm. Maybe I misread
> it then.
>
>> It is technically correct for kvmppc_find_tce_table() though. Should I put
>> this comment before every function which may be called from real and
>> virtual modes?
>
> Yes, please. Otherwise someone might stick an access to a non-linear address
> in there by accident.
>
>>
>>
>>
>>>> +/*
>>>> + * Finds a TCE table descriptor by LIOBN
>>>> */
>>>> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu
>>>> *vcpu,
>>>> + unsigned long liobn)
>>>> +{
>>>> + struct kvmppc_spapr_tce_table *tt;
>>>> +
>>>> + list_for_each_entry(tt,&vcpu->kvm->arch.spapr_tce_tables, list) {
>>>> + if (tt->liobn == liobn)
>>>> + return tt;
>>>> + }
>>>> +
>>>> + return NULL;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
>>>> +
>>>> +#ifdef DEBUG
>>>> +/*
>>>> + * Lets user mode disable realmode handlers by putting big number
>>>> + * in the bottom value of LIOBN
>>>
>>> What? Seriously? Just don't enable the CAP.
>>
>>
>> It is under DEBUG. It really, really helps to be able to disable real mode
>> handlers without reboot. Ok, no debug code, I'll remove.
>
> Debug code is good, but #ifdefs are bad. For you, an #ifdef reads like
> "code that doesn't do any hard when disabled". For me, #ifdefs read
> "code that definitely breaks because nobody turns the #define on".
>
> So please, avoid #ifdef'ed code whenever possible. Switching the CAP on and
> off is a much better debug approach in this case.
>
>>
>>
>>>> + */
>>>> +#define kvmppc_find_tce_table(a, b) \
>>>> + ((((b)&0xffff)>10000)?NULL:kvmppc_find_tce_table((a), (b)))
>>>> +#endif
>>>> +
>>>> +/*
>>>> + * Validates TCE address.
>>>> + * At the moment only flags are validated as other checks will
>>>> significantly slow
>>>> + * down or can make it even impossible to handle TCE requests in real
>>>> mode.
>>>
>>> What?
>>
>>
>> What is missing here (besides good english)?
>
> What badness could slip through by not validating everything?
I cannot think of any good check which could be done in real mode and not
be "more than 2 calls deep" (c) Ben. Check that the page is allocated at
all? How? Don't know.
>>>> + */
>>>> +long kvmppc_emulated_validate_tce(unsigned long tce)
>>>
>>> I don't like the naming scheme. Please turn this around and make it
>>> kvmppc_tce_validate().
>>
>>
>> Oh. "Like"... Ok.
>
> Yes. Like.
>
>>
>>
>>>> +{
>>>> + if (tce& ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
>>>> + return H_PARAMETER;
>>>> +
>>>> + return H_SUCCESS;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
>>>> +
>>>> +/*
>>>> + * Handles TCE requests for QEMU emulated devices.
>>>
>>> We still don't mention QEMU in KVM code. And does it really matter whether
>>> they're emulated by QEMU? Devices could also be emulated by KVM.
>>>
>>>> + * Puts guest TCE values to the table and expects QEMU to convert them
>>>> + * later in a QEMU device implementation.
>>>> + * Called in both real and virtual modes.
>>>> + * Cannot fail so kvmppc_emulated_validate_tce must be called before it.
>>>> + */
>>>> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>>
>>> kvmppc_tce_put()
>>>
>>>> + unsigned long ioba, unsigned long tce)
>>>> +{
>>>> + unsigned long idx = ioba>> SPAPR_TCE_SHIFT;
>>>> + struct page *page;
>>>> + u64 *tbl;
>>>> +
>>>> + /*
>>>> + * Note on the use of page_address() in real mode,
>>>> + *
>>>> + * It is safe to use page_address() in real mode on ppc64 because
>>>> + * page_address() is always defined as lowmem_page_address()
>>>> + * which returns __va(PFN_PHYS(page_to_pfn(page))) which is
>>>> arithmetial
>>>> + * operation and does not access page struct.
>>>> + *
>>>> + * Theoretically page_address() could be defined different
>>>> + * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
>>>> + * should be enabled.
>>>> + * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
>>>> + * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
>>>> + * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
>>>> + * is not expected to be enabled on ppc32, page_address()
>>>> + * is safe for ppc32 as well.
>>>> + */
>>>> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
>>>> +#error TODO: fix to avoid page_address() here
>>>> +#endif
>>>
>>> Can you extract the text above, the check and the page_address call into a
>>> simple wrapper function?
>>
>>
>> Is this function also too big? Sorry, I do not understand the comment.
>
> All of the comment and #if here only deal with the fact that you
> have a real mode hack to call page_address() that happens
> to work under specific circumstances.
>
> There's nothing kvmppc_tce_put() specific about this.
> The page_address() code happens to get called here, sure.
> But if I read the kvmppc_tce_put() function I don't care about
> these details - I want to understand the code flow that ends
> up writing the TCE.
>
>>>> + page = tt->pages[idx / TCES_PER_PAGE];
>>>> + tbl = (u64 *)page_address(page);
>>>> +
>>>> + /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>>>
>>> This is not an RFC, is it?
>>
>>
>> Any debug code is prohibited? Ok, I'll remove.
>
> Debug code that requires code changes is prohibited, yes.
> Debug code that is runtime switchable (pr_debug, trace points, etc)
> are allowed.
Is there any easy way to enable just this specific udbg_printf (not all of
them at once)? Trace points do not work in real mode as we figured out.
>>>> + tbl[idx % TCES_PER_PAGE] = tce;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
>>>> +
>>>> +#ifdef CONFIG_KVM_BOOK3S_64_HV
>>>> +/*
>>>> + * Converts guest physical address to host physical address.
>>>> + * Tries to increase page counter via realmode_get_page() and
>>>> + * returns ERROR_ADDR if failed.
>>>> + */
>>>> +static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
>>>> + unsigned long gpa, struct page **pg)
>>>> +{
>>>> + struct kvm_memory_slot *memslot;
>>>> + pte_t *ptep, pte;
>>>> + unsigned long hva, hpa = ERROR_ADDR;
>>>> + unsigned long gfn = gpa>> PAGE_SHIFT;
>>>> + unsigned shift = 0;
>>>> +
>>>> + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>>> + if (!memslot)
>>>> + return ERROR_ADDR;
>>>> +
>>>> + hva = __gfn_to_hva_memslot(memslot, gfn);
>>>> +
>>>> + ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva,&shift);
>>>> + if (!ptep || !pte_present(*ptep))
>>>> + return ERROR_ADDR;
>>>> + pte = *ptep;
>>>> +
>>>> + if (((gpa& TCE_PCI_WRITE) || pte_write(pte))&& !pte_dirty(pte))
>>>> + return ERROR_ADDR;
>>>> +
>>>> + if (!pte_young(pte))
>>>> + return ERROR_ADDR;
>>>> +
>>>> + if (!shift)
>>>> + shift = PAGE_SHIFT;
>>>> +
>>>> + /* Put huge pages handling to the virtual mode */
>>>> + if (shift> PAGE_SHIFT)
>>>> + return ERROR_ADDR;
>>>> +
>>>> + *pg = realmode_pfn_to_page(pte_pfn(pte));
>>>> + if (!*pg || realmode_get_page(*pg))
>>>> + return ERROR_ADDR;
>>>> +
>>>> + /* pte_pfn(pte) returns address aligned to pg_size */
>>>> + hpa = (pte_pfn(pte)<< PAGE_SHIFT) + (gpa& ((1<< shift) - 1));
>>>> +
>>>> + if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>>> + hpa = ERROR_ADDR;
>>>> + realmode_put_page(*pg);
>>>> + *pg = NULL;
>>>> + }
>>>> +
>>>> + return hpa;
>>>> +}
>>>> +
>>>> long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>> unsigned long ioba, unsigned long tce)
>>>> {
>>>> - struct kvm *kvm = vcpu->kvm;
>>>> - struct kvmppc_spapr_tce_table *stt;
>>>> -
>>>> - /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>>> - /* liobn, ioba, tce); */
>>>> -
>>>> - list_for_each_entry(stt,&kvm->arch.spapr_tce_tables, list) {
>>>> - if (stt->liobn == liobn) {
>>>> - unsigned long idx = ioba>> SPAPR_TCE_SHIFT;
>>>> - struct page *page;
>>>> - u64 *tbl;
>>>> -
>>>> - /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p
>>>> window_size=0x%x\n", */
>>>> - /* liobn, stt, stt->window_size); */
>>>> - if (ioba>= stt->window_size)
>>>> - return H_PARAMETER;
>>>> -
>>>> - page = stt->pages[idx / TCES_PER_PAGE];
>>>> - tbl = (u64 *)page_address(page);
>>>> -
>>>> - /* FIXME: Need to validate the TCE itself */
>>>> - /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>>>> - tbl[idx % TCES_PER_PAGE] = tce;
>>>> - return H_SUCCESS;
>>>> - }
>>>> + long ret;
>>>> + struct kvmppc_spapr_tce_table *tt = kvmppc_find_tce_table(vcpu,
>>>> liobn);
>>>> +
>>>> + if (!tt)
>>>> + return H_TOO_HARD;
>>>> +
>>>> + ++tt->stat.rm.put;
>>>> +
>>>> + if (ioba>= tt->window_size)
>>>> + return H_PARAMETER;
>>>> +
>>>> + ret = kvmppc_emulated_validate_tce(tce);
>>>> + if (!ret)
>>>> + kvmppc_emulated_put_tce(tt, ioba, tce);
>>>> +
>>>> + return ret;
>>>> +}
>>>> +
>>>> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>
>>> So the _vm version is the normal one and this is the _rm version? If so,
>>> please mark it as such. Is there any way to generate both from the same
>>> source? The way it's now there is a lot of duplicate code.
>>
>>
>> I tried, looked very ugly. If you insist, I will do so.
>
> If it looks ugly better don't. I just want to make sure you explored the
> option.
> But please keep the naming scheme consistent.
Removed _vm everywhere and put _rm in realmode handlers. I just was
confused by _vm in kvm_vm_ioctl_create_spapr_tce() at the first place.
--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/