Re: [RFC/RFT PATCH 1/3] memblock: update initialization of reserved pages

2021-04-15 Thread David Hildenbrand

Not sure we really need a new pagetype here, PG_Reserved seems to be quite
enough to say "don't touch this".  I generally agree that we could make
PG_Reserved a PageType and then have several sub-types for reserved memory.
This definitely will add clarity but I'm not sure that this justifies
amount of churn and effort required to audit uses of PageResrved().
  

Then, we could mostly avoid having to query memblock at runtime to figure
out that this is special memory. This would obviously be an extension to
this series. Just a thought.


Stop pushing memblock out of kernel! ;-)


Can't stop. Won't stop. :D

It's lovely for booting up a kernel until we have other data-structures 
in place ;)



--
Thanks,

David / dhildenb

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC/RFT PATCH 2/3] arm64: decouple check whether pfn is normal memory from pfn_valid()

2021-04-15 Thread David Hildenbrand

On 14.04.21 22:29, Mike Rapoport wrote:

On Wed, Apr 14, 2021 at 05:58:26PM +0200, David Hildenbrand wrote:

On 08.04.21 07:14, Anshuman Khandual wrote:


On 4/7/21 10:56 PM, Mike Rapoport wrote:

From: Mike Rapoport 

The intended semantics of pfn_valid() is to verify whether there is a
struct page for the pfn in question and nothing else.


Should there be a comment affirming this semantics interpretation, above the
generic pfn_valid() in include/linux/mmzone.h ?



Yet, on arm64 it is used to distinguish memory areas that are mapped in the
linear map vs those that require ioremap() to access them.

Introduce a dedicated pfn_is_memory() to perform such check and use it
where appropriate.

Signed-off-by: Mike Rapoport 
---
   arch/arm64/include/asm/memory.h | 2 +-
   arch/arm64/include/asm/page.h   | 1 +
   arch/arm64/kvm/mmu.c| 2 +-
   arch/arm64/mm/init.c| 6 ++
   arch/arm64/mm/ioremap.c | 4 ++--
   arch/arm64/mm/mmu.c | 2 +-
   6 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 0aabc3be9a75..7e77fdf71b9d 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -351,7 +351,7 @@ static inline void *phys_to_virt(phys_addr_t x)
   #define virt_addr_valid(addr)({  
\
__typeof__(addr) __addr = __tag_reset(addr);\
-   __is_lm_address(__addr) && pfn_valid(virt_to_pfn(__addr));  \
+   __is_lm_address(__addr) && pfn_is_memory(virt_to_pfn(__addr));  \
   })
   void dump_mem_limit(void);
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 012cffc574e8..32b485bcc6ff 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -38,6 +38,7 @@ void copy_highpage(struct page *to, struct page *from);
   typedef struct page *pgtable_t;
   extern int pfn_valid(unsigned long);
+extern int pfn_is_memory(unsigned long);
   #include 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8711894db8c2..ad2ea65a3937 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -85,7 +85,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
   static bool kvm_is_device_pfn(unsigned long pfn)
   {
-   return !pfn_valid(pfn);
+   return !pfn_is_memory(pfn);
   }
   /*
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 3685e12aba9b..258b1905ed4a 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -258,6 +258,12 @@ int pfn_valid(unsigned long pfn)
   }
   EXPORT_SYMBOL(pfn_valid);
+int pfn_is_memory(unsigned long pfn)
+{
+   return memblock_is_map_memory(PFN_PHYS(pfn));
+}
+EXPORT_SYMBOL(pfn_is_memory);> +


Should not this be generic though ? There is nothing platform or arm64
specific in here. Wondering as pfn_is_memory() just indicates that the
pfn is linear mapped, should not it be renamed as pfn_is_linear_memory()
instead ? Regardless, it's fine either way.


TBH, I dislike (generic) pfn_is_memory(). It feels like we're mixing
concepts.


Yeah, at the moment NOMAP is very much arm specific so I'd keep it this way
for now.


  NOMAP memory vs !NOMAP memory; even NOMAP is some kind of memory
after all. pfn_is_map_memory() would be more expressive, although still
sub-optimal.

We'd actually want some kind of arm64-specific pfn_is_system_memory() or the
inverse pfn_is_device_memory() -- to be improved.


In my current version (to be posted soon) I've started with
pfn_lineary_mapped() but then ended up with pfn_mapped() to make it
"upward" compatible with architectures that use direct rather than linear
map :)


And even that is moot. It doesn't tell you if a PFN is *actually* mapped 
(hello secretmem).


I'd suggest to just use memblock_is_map_memory() in arch specific code. 
Then it's clear what we are querying exactly and what the semantics 
might be.


--
Thanks,

David / dhildenb

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-15 Thread Marc Zyngier
On Thu, 15 Apr 2021 03:20:52 +0100,
Keqian Zhu  wrote:
> 
> Hi Marc,
> 
> On 2021/4/14 17:05, Marc Zyngier wrote:
> > + Santosh, who found some interesting bugs in that area before.
> > 
> > On Wed, 14 Apr 2021 07:51:09 +0100,
> > Keqian Zhu  wrote:
> >>
> >> The MMIO region of a device maybe huge (GB level), try to use
> >> block mapping in stage2 to speedup both map and unmap.
> >>
> >> Compared to normal memory mapping, we should consider two more
> >> points when try block mapping for MMIO region:
> >>
> >> 1. For normal memory mapping, the PA(host physical address) and
> >> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
> >> the HVA to request hugepage, so we don't need to consider PA
> >> alignment when verifing block mapping. But for device memory
> >> mapping, the PA and HVA may have different alignment.
> >>
> >> 2. For normal memory mapping, we are sure hugepage size properly
> >> fit into vma, so we don't check whether the mapping size exceeds
> >> the boundary of vma. But for device memory mapping, we should pay
> >> attention to this.
> >>
> >> This adds device_rough_page_shift() to check these two points when
> >> selecting block mapping size.
> >>
> >> Signed-off-by: Keqian Zhu 
> >> ---
> >>  arch/arm64/kvm/mmu.c | 37 +
> >>  1 file changed, 33 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> >> index c59af5ca01b0..1a6d96169d60 100644
> >> --- a/arch/arm64/kvm/mmu.c
> >> +++ b/arch/arm64/kvm/mmu.c
> >> @@ -624,6 +624,31 @@ static void kvm_send_hwpoison_signal(unsigned long 
> >> address, short lsb)
> >>send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
> >>  }
> >>  
> >> +/*
> >> + * Find a max mapping size that properly insides the vma. And hva and pa 
> >> must
> >> + * have the same alignment to this mapping size. It's rough as there are 
> >> still
> >> + * other restrictions, will be checked by 
> >> fault_supports_stage2_huge_mapping().
> >> + */
> >> +static short device_rough_page_shift(struct vm_area_struct *vma,
> >> +   unsigned long hva)
> > 
> > My earlier question still stands. Under which circumstances would this
> > function return something that is *not* the final mapping size? I
> > really don't see a reason why this would not return the final mapping
> > size.
> 
> IIUC, all the restrictions are about alignment and area boundary.
> 
> That's to say, HVA, IPA and PA must have same alignment within the
> mapping size.  And the areas are memslot and vma, which means the
> mapping size must properly fit into the memslot and vma.
> 
> In this function, we just checked the alignment of HVA and PA, and
> the boundary of vma.  So we still need to check the alignment of HVA
> and IPA, and the boundary of memslot.  These will be checked by
> fault_supports_stage2_huge_mapping().

But that's no different from what we do with normal memory, is it? So
it really feels like we should have *one* function that deals with
establishing the basic mapping size from the VMA (see below for what I
have in mind).

> 
> > 
> >> +{
> >> +  phys_addr_t pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
> >> +
> >> +#ifndef __PAGETABLE_PMD_FOLDED
> >> +  if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
> >> +  ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
> >> +  ALIGN(hva, PUD_SIZE) <= vma->vm_end)
> >> +  return PUD_SHIFT;
> >> +#endif
> >> +
> >> +  if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
> >> +  ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
> >> +  ALIGN(hva, PMD_SIZE) <= vma->vm_end)
> >> +  return PMD_SHIFT;
> >> +
> >> +  return PAGE_SHIFT;
> >> +}
> >> +
> >>  static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot 
> >> *memslot,
> >>   unsigned long hva,
> >>   unsigned long map_size)
> >> @@ -769,7 +794,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >> phys_addr_t fault_ipa,
> >>return -EFAULT;
> >>}
> >>  
> >> -  /* Let's check if we will get back a huge page backed by hugetlbfs */
> >> +  /*
> >> +   * Let's check if we will get back a huge page backed by hugetlbfs, or
> >> +   * get block mapping for device MMIO region.
> >> +   */
> >>mmap_read_lock(current->mm);
> >>vma = find_vma_intersection(current->mm, hva, hva + 1);
> >>if (unlikely(!vma)) {
> >> @@ -780,11 +808,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >> phys_addr_t fault_ipa,
> >>  
> >>if (is_vm_hugetlb_page(vma))
> >>vma_shift = huge_page_shift(hstate_vma(vma));
> >> +  else if (vma->vm_flags & VM_PFNMAP)
> >> +  vma_shift = device_rough_page_shift(vma, hva);
> > 
> > What prevents a VMA from having both VM_HUGETLB and VM_PFNMAP? This is
> > pretty unlikely, but I'd like to see this case catered for.
> > 
> I'm not sure whether VM_HUGETLB an

Re: [PATCH 1/5] KVM: arm64: Divorce the perf code from oprofile helpers

2021-04-15 Thread Marc Zyngier
On Thu, 15 Apr 2021 07:59:26 +0100,
Keqian Zhu  wrote:
> 
> Hi Marc,
> 
> On 2021/4/14 21:44, Marc Zyngier wrote:
> > KVM/arm64 is the sole user of perf_num_counters(), and really
> > could do without it. Stop using the obsolete API by relying on
> > the existing probing code.
> > 
> > Signed-off-by: Marc Zyngier 
> > ---
> >  arch/arm64/kvm/perf.c | 7 +--
> >  arch/arm64/kvm/pmu-emul.c | 2 +-
> >  include/kvm/arm_pmu.h | 4 
> >  3 files changed, 6 insertions(+), 7 deletions(-)
> > 
> > diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
> > index 739164324afe..b8b398670ef2 100644
> > --- a/arch/arm64/kvm/perf.c
> > +++ b/arch/arm64/kvm/perf.c
> > @@ -50,12 +50,7 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
> >  
> >  int kvm_perf_init(void)
> >  {
> > -   /*
> > -* Check if HW_PERF_EVENTS are supported by checking the number of
> > -* hardware performance counters. This could ensure the presence of
> > -* a physical PMU and CONFIG_PERF_EVENT is selected.
> > -*/
> > -   if (IS_ENABLED(CONFIG_ARM_PMU) && perf_num_counters() > 0)
> > +   if (kvm_pmu_probe_pmuver() != 0xf)
> The probe() function may be called many times
> (kvm_arm_pmu_v3_set_attr also calls it).  I don't know whether the
> first calling is enough. If so, can we use a static variable in it,
> so the following calling can return the result right away?

No, because that wouldn't help with crappy big-little implementations
that could have PMUs with different versions. We want to find the
version at the point where the virtual PMU is created, which is why we
call the probe function once per vcpu.

This of course is broken in other ways (BL+KVM is a total disaster
when it comes to PMU), but making this static would just make it
worse.

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 3/5] s390: Get rid of oprofile leftovers

2021-04-15 Thread Marc Zyngier
On Thu, 15 Apr 2021 11:38:52 +0100,
Heiko Carstens  wrote:
> 
> On Wed, Apr 14, 2021 at 02:44:07PM +0100, Marc Zyngier wrote:
> > perf_pmu_name() and perf_num_counters() are unused. Drop them.
> > 
> > Signed-off-by: Marc Zyngier 
> > ---
> >  arch/s390/kernel/perf_event.c | 21 -
> >  1 file changed, 21 deletions(-)
> 
> Acked-by: Heiko Carstens 
> 
> ...or do you want me to pick this up and route via the s390 tree(?).

Either way work for me, but I just want to make sure the last patch
doesn't get applied before the previous ones.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 3/5] s390: Get rid of oprofile leftovers

2021-04-15 Thread Heiko Carstens
On Wed, Apr 14, 2021 at 02:44:07PM +0100, Marc Zyngier wrote:
> perf_pmu_name() and perf_num_counters() are unused. Drop them.
> 
> Signed-off-by: Marc Zyngier 
> ---
>  arch/s390/kernel/perf_event.c | 21 -
>  1 file changed, 21 deletions(-)

Acked-by: Heiko Carstens 

...or do you want me to pick this up and route via the s390 tree(?).
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-15 Thread Keqian Zhu
Hi Marc,

On 2021/4/15 18:23, Marc Zyngier wrote:
> On Thu, 15 Apr 2021 03:20:52 +0100,
> Keqian Zhu  wrote:
>>
>> Hi Marc,
>>
>> On 2021/4/14 17:05, Marc Zyngier wrote:
>>> + Santosh, who found some interesting bugs in that area before.
>>>
>>> On Wed, 14 Apr 2021 07:51:09 +0100,
>>> Keqian Zhu  wrote:

 The MMIO region of a device maybe huge (GB level), try to use
 block mapping in stage2 to speedup both map and unmap.

 Compared to normal memory mapping, we should consider two more
 points when try block mapping for MMIO region:

 1. For normal memory mapping, the PA(host physical address) and
 HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
 the HVA to request hugepage, so we don't need to consider PA
 alignment when verifing block mapping. But for device memory
 mapping, the PA and HVA may have different alignment.

 2. For normal memory mapping, we are sure hugepage size properly
 fit into vma, so we don't check whether the mapping size exceeds
 the boundary of vma. But for device memory mapping, we should pay
 attention to this.

 This adds device_rough_page_shift() to check these two points when
 selecting block mapping size.

 Signed-off-by: Keqian Zhu 
 ---
  arch/arm64/kvm/mmu.c | 37 +
  1 file changed, 33 insertions(+), 4 deletions(-)

 diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
 index c59af5ca01b0..1a6d96169d60 100644
 --- a/arch/arm64/kvm/mmu.c
 +++ b/arch/arm64/kvm/mmu.c
 @@ -624,6 +624,31 @@ static void kvm_send_hwpoison_signal(unsigned long 
 address, short lsb)
send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
  }
  
 +/*
 + * Find a max mapping size that properly insides the vma. And hva and pa 
 must
 + * have the same alignment to this mapping size. It's rough as there are 
 still
 + * other restrictions, will be checked by 
 fault_supports_stage2_huge_mapping().
 + */
 +static short device_rough_page_shift(struct vm_area_struct *vma,
 +   unsigned long hva)
>>>
>>> My earlier question still stands. Under which circumstances would this
>>> function return something that is *not* the final mapping size? I
>>> really don't see a reason why this would not return the final mapping
>>> size.
>>
>> IIUC, all the restrictions are about alignment and area boundary.
>>
>> That's to say, HVA, IPA and PA must have same alignment within the
>> mapping size.  And the areas are memslot and vma, which means the
>> mapping size must properly fit into the memslot and vma.
>>
>> In this function, we just checked the alignment of HVA and PA, and
>> the boundary of vma.  So we still need to check the alignment of HVA
>> and IPA, and the boundary of memslot.  These will be checked by
>> fault_supports_stage2_huge_mapping().
> 
> But that's no different from what we do with normal memory, is it? So
> it really feels like we should have *one* function that deals with
> establishing the basic mapping size from the VMA (see below for what I
> have in mind).
Right. And it looks better.

> 
>>
>>>
 +{
 +  phys_addr_t pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
 +
 +#ifndef __PAGETABLE_PMD_FOLDED
 +  if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
 +  ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
 +  ALIGN(hva, PUD_SIZE) <= vma->vm_end)
 +  return PUD_SHIFT;
 +#endif
 +
 +  if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
 +  ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
 +  ALIGN(hva, PMD_SIZE) <= vma->vm_end)
 +  return PMD_SHIFT;
 +
 +  return PAGE_SHIFT;
 +}
 +
  static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot 
 *memslot,
   unsigned long hva,
   unsigned long map_size)
 @@ -769,7 +794,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
 phys_addr_t fault_ipa,
return -EFAULT;
}
  
 -  /* Let's check if we will get back a huge page backed by hugetlbfs */
 +  /*
 +   * Let's check if we will get back a huge page backed by hugetlbfs, or
 +   * get block mapping for device MMIO region.
 +   */
mmap_read_lock(current->mm);
vma = find_vma_intersection(current->mm, hva, hva + 1);
if (unlikely(!vma)) {
 @@ -780,11 +808,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
 phys_addr_t fault_ipa,
  
if (is_vm_hugetlb_page(vma))
vma_shift = huge_page_shift(hstate_vma(vma));
 +  else if (vma->vm_flags & VM_PFNMAP)
 +  vma_shift = device_rough_page_shift(vma, hva);
>>>
>>> What prevents a VMA from having both VM_HUGETLB and VM_PFNMAP? This is
>>> 

Re: [PATCH 1/5] KVM: arm64: Divorce the perf code from oprofile helpers

2021-04-15 Thread Keqian Zhu
Hi Marc,

On 2021/4/15 18:42, Marc Zyngier wrote:
> On Thu, 15 Apr 2021 07:59:26 +0100,
> Keqian Zhu  wrote:
>>
>> Hi Marc,
>>
>> On 2021/4/14 21:44, Marc Zyngier wrote:
>>> KVM/arm64 is the sole user of perf_num_counters(), and really
>>> could do without it. Stop using the obsolete API by relying on
>>> the existing probing code.
>>>
>>> Signed-off-by: Marc Zyngier 
>>> ---
>>>  arch/arm64/kvm/perf.c | 7 +--
>>>  arch/arm64/kvm/pmu-emul.c | 2 +-
>>>  include/kvm/arm_pmu.h | 4 
>>>  3 files changed, 6 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
>>> index 739164324afe..b8b398670ef2 100644
>>> --- a/arch/arm64/kvm/perf.c
>>> +++ b/arch/arm64/kvm/perf.c
>>> @@ -50,12 +50,7 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
>>>  
>>>  int kvm_perf_init(void)
>>>  {
>>> -   /*
>>> -* Check if HW_PERF_EVENTS are supported by checking the number of
>>> -* hardware performance counters. This could ensure the presence of
>>> -* a physical PMU and CONFIG_PERF_EVENT is selected.
>>> -*/
>>> -   if (IS_ENABLED(CONFIG_ARM_PMU) && perf_num_counters() > 0)
>>> +   if (kvm_pmu_probe_pmuver() != 0xf)
>> The probe() function may be called many times
>> (kvm_arm_pmu_v3_set_attr also calls it).  I don't know whether the
>> first calling is enough. If so, can we use a static variable in it,
>> so the following calling can return the result right away?
> 
> No, because that wouldn't help with crappy big-little implementations
> that could have PMUs with different versions. We want to find the
> version at the point where the virtual PMU is created, which is why we
> call the probe function once per vcpu.
I see.

But AFAICS the pmuver is placed in kvm->arch, and the probe function is called
once per VM. Maybe I miss something.

> 
> This of course is broken in other ways (BL+KVM is a total disaster
> when it comes to PMU), but making this static would just make it
> worse.
OK.

Thanks,
Keqian
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v5 0/6] KVM: arm64: Improve efficiency of stage2 page table

2021-04-15 Thread Yanan Wang
Hi,

This series makes some efficiency improvement of guest stage-2 page
table code, and there are some test results to quantify the benefit.
The code has been re-arranged based on the latest kvmarm/next tree.

Descriptions:
We currently uniformly permorm CMOs of D-cache and I-cache in function
user_mem_abort before calling the fault handlers. If we get concurrent
guest faults(e.g. translation faults, permission faults) or some really
unnecessary guest faults caused by BBM, CMOs for the first vcpu are
necessary while the others later are not.

By moving CMOs to the fault handlers, we can easily identify conditions
where they are really needed and avoid the unnecessary ones. As it's a
time consuming process to perform CMOs especially when flushing a block
range, so this solution reduces much load of kvm and improve efficiency
of the stage-2 page table code.

In this series, patch #1, #3, #4 make preparation for place movement
of CMOs (adapt to the latest stage-2 page table framework). And patch
#2, #5 move CMOs of D-cache and I-cache to the fault handlers. Patch
#6 introduces a new way to distinguish cases of memcache allocations.

The following are results in v3 to represent the benefit introduced
by movement of CMOs, and they were tested by [1] (kvm/selftest) that
I have posted recently.
[1] https://lore.kernel.org/lkml/20210302125751.19080-1-wangyana...@huawei.com/

When there are muitiple vcpus concurrently accessing the same memory
region, we can test the execution time of KVM creating new mappings,
updating the permissions of old mappings from RO to RW, and the time
of re-creating the blocks after they have been split.

hardware platform: HiSilicon Kunpeng920 Server
host kernel: Linux mainline v5.12-rc2

cmdline: ./kvm_page_table_test -m 4 -s anonymous -b 1G -v 80
   (80 vcpus, 1G memory, page mappings(normal 4K))
KVM_CREATE_MAPPINGS: before 104.35s -> after  90.42s  +13.35%
KVM_UPDATE_MAPPINGS: before  78.64s -> after  75.45s  + 4.06%

cmdline: ./kvm_page_table_test -m 4 -s anonymous_thp -b 20G -v 40
   (40 vcpus, 20G memory, block mappings(THP 2M))
KVM_CREATE_MAPPINGS: before  15.66s -> after   6.92s  +55.80%
KVM_UPDATE_MAPPINGS: before 178.80s -> after 123.35s  +31.00%
KVM_REBUILD_BLOCKS:  before 187.34s -> after 131.76s  +30.65%

cmdline: ./kvm_page_table_test -m 4 -s anonymous_hugetlb_1gb -b 20G -v 40
   (40 vcpus, 20G memory, block mappings(HUGETLB 1G))
KVM_CREATE_MAPPINGS: before 104.54s -> after   3.70s  +96.46%
KVM_UPDATE_MAPPINGS: before 174.20s -> after 115.94s  +33.44%
KVM_REBUILD_BLOCKS:  before 103.95s -> after   2.96s  +97.15%

---

Changelogs:
v4->v5:
- rebased on the latest kvmarm/tree to adapt to the new stage-2 page-table code
- v4: 
https://lore.kernel.org/lkml/20210409033652.28316-1-wangyana...@huawei.com/

v3->v4:
- perform D-cache flush if we are not mapping device memory
- rebased on top of mainline v5.12-rc6
- v3: https://lore.kernel.org/lkml/20210326031654.3716-1-wangyana...@huawei.com/

v2->v3:
- drop patch #3 in v2
- retest v3 based on v5.12-rc2
- v2: 
https://lore.kernel.org/lkml/20210310094319.18760-1-wangyana...@huawei.com/

v1->v2:
- rebased on top of mainline v5.12-rc2
- also move CMOs of I-cache to the fault handlers
- retest v2 based on v5.12-rc2
- v1: 
https://lore.kernel.org/lkml/20210208112250.163568-1-wangyana...@huawei.com/

---

Yanan Wang (6):
  KVM: arm64: Introduce KVM_PGTABLE_S2_GUEST stage-2 flag
  KVM: arm64: Move D-cache flush to the fault handlers
  KVM: arm64: Add mm_ops member for structure stage2_attr_data
  KVM: arm64: Provide invalidate_icache_range at non-VHE EL2
  KVM: arm64: Move I-cache flush to the fault handlers
  KVM: arm64: Distinguish cases of memcache allocations completely

 arch/arm64/include/asm/kvm_mmu.h | 31 -
 arch/arm64/include/asm/kvm_pgtable.h | 38 ++--
 arch/arm64/kvm/hyp/nvhe/cache.S  | 11 +
 arch/arm64/kvm/hyp/pgtable.c | 65 +++-
 arch/arm64/kvm/mmu.c | 51 --
 5 files changed, 107 insertions(+), 89 deletions(-)

-- 
2.23.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v5 4/6] KVM: arm64: Provide invalidate_icache_range at non-VHE EL2

2021-04-15 Thread Yanan Wang
We want to move I-cache maintenance for the guest to the stage-2
page table code for performance improvement. Before it can work,
we should first make function invalidate_icache_range available
to non-VHE EL2 to avoid compiling or program running error, as
pgtable.c is now linked into the non-VHE EL2 code for pKVM mode.

In this patch, we only introduce symbol of invalidate_icache_range
with no real functionality in nvhe/cache.S, because there haven't
been situations found currently where I-cache maintenance is also
needed in non-VHE EL2 for pKVM mode.

Signed-off-by: Yanan Wang 
---
 arch/arm64/kvm/hyp/nvhe/cache.S | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/arm64/kvm/hyp/nvhe/cache.S b/arch/arm64/kvm/hyp/nvhe/cache.S
index 36cef6915428..a125ec9aeed2 100644
--- a/arch/arm64/kvm/hyp/nvhe/cache.S
+++ b/arch/arm64/kvm/hyp/nvhe/cache.S
@@ -11,3 +11,14 @@ SYM_FUNC_START_PI(__flush_dcache_area)
dcache_by_line_op civac, sy, x0, x1, x2, x3
ret
 SYM_FUNC_END_PI(__flush_dcache_area)
+
+/*
+ * invalidate_icache_range(start,end)
+ *
+ * Ensure that the I cache is invalid within specified region.
+ *
+ * - start   - virtual start address of region
+ * - end - virtual end address of region
+ */
+SYM_FUNC_START(invalidate_icache_range)
+SYM_FUNC_END(invalidate_icache_range)
-- 
2.23.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v5 1/6] KVM: arm64: Introduce KVM_PGTABLE_S2_GUEST stage-2 flag

2021-04-15 Thread Yanan Wang
The stage-2 page table code in pgtable.c now is generally used for
guest stage-2 and host stage-2. There may be some different issues
between guest S2 page-table and host S2 page-table that we should
consider, e.g., whether CMOs are needed when creating a new mapping.

So introduce the KVM_PGTABLE_S2_GUEST flag to determine if we are
doing something about guest stage-2. This flag will be used in a
coming patch, in which we will move CMOs for guest to pgtable.c.

Signed-off-by: Yanan Wang 
---
 arch/arm64/include/asm/kvm_pgtable.h | 38 ++--
 arch/arm64/kvm/mmu.c |  3 ++-
 2 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h 
b/arch/arm64/include/asm/kvm_pgtable.h
index c3674c47d48c..a43cbe697b37 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -61,10 +61,12 @@ struct kvm_pgtable_mm_ops {
  * @KVM_PGTABLE_S2_NOFWB:  Don't enforce Normal-WB even if the CPUs have
  * ARM64_HAS_STAGE2_FWB.
  * @KVM_PGTABLE_S2_IDMAP:  Only use identity mappings.
+ * @KVM_PGTABLE_S2_GUEST:  Whether the page-tables are guest stage-2.
  */
 enum kvm_pgtable_stage2_flags {
KVM_PGTABLE_S2_NOFWB= BIT(0),
KVM_PGTABLE_S2_IDMAP= BIT(1),
+   KVM_PGTABLE_S2_GUEST= BIT(2),
 };
 
 /**
@@ -221,12 +223,10 @@ int kvm_pgtable_stage2_init_flags(struct kvm_pgtable 
*pgt, struct kvm_arch *arch
  struct kvm_pgtable_mm_ops *mm_ops,
  enum kvm_pgtable_stage2_flags flags);
 
-#define kvm_pgtable_stage2_init(pgt, arch, mm_ops) \
-   kvm_pgtable_stage2_init_flags(pgt, arch, mm_ops, 0)
-
 /**
  * kvm_pgtable_stage2_destroy() - Destroy an unused guest stage-2 page-table.
- * @pgt:   Page-table structure initialised by kvm_pgtable_stage2_init*().
+ * @pgt:   Page-table structure initialised by function
+ * kvm_pgtable_stage2_init_flags().
  *
  * The page-table is assumed to be unreachable by any hardware walkers prior
  * to freeing and therefore no TLB invalidation is performed.
@@ -235,7 +235,8 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
 
 /**
  * kvm_pgtable_stage2_map() - Install a mapping in a guest stage-2 page-table.
- * @pgt:   Page-table structure initialised by kvm_pgtable_stage2_init*().
+ * @pgt:   Page-table structure initialised by function
+ * kvm_pgtable_stage2_init_flags().
  * @addr:  Intermediate physical address at which to place the mapping.
  * @size:  Size of the mapping.
  * @phys:  Physical address of the memory to map.
@@ -268,7 +269,8 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 
addr, u64 size,
 /**
  * kvm_pgtable_stage2_set_owner() - Unmap and annotate pages in the IPA space 
to
  * track ownership.
- * @pgt:   Page-table structure initialised by kvm_pgtable_stage2_init*().
+ * @pgt:   Page-table structure initialised by function
+ * kvm_pgtable_stage2_init_flags().
  * @addr:  Base intermediate physical address to annotate.
  * @size:  Size of the annotated range.
  * @mc:Cache of pre-allocated and zeroed memory from which to 
allocate
@@ -287,7 +289,8 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, 
u64 addr, u64 size,
 
 /**
  * kvm_pgtable_stage2_unmap() - Remove a mapping from a guest stage-2 
page-table.
- * @pgt:   Page-table structure initialised by kvm_pgtable_stage2_init*().
+ * @pgt:   Page-table structure initialised by function
+ * kvm_pgtable_stage2_init_flags().
  * @addr:  Intermediate physical address from which to remove the mapping.
  * @size:  Size of the mapping.
  *
@@ -307,7 +310,8 @@ int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 
addr, u64 size);
 /**
  * kvm_pgtable_stage2_wrprotect() - Write-protect guest stage-2 address range
  *  without TLB invalidation.
- * @pgt:   Page-table structure initialised by kvm_pgtable_stage2_init*().
+ * @pgt:   Page-table structure initialised by function
+ * kvm_pgtable_stage2_init_flags().
  * @addr:  Intermediate physical address from which to write-protect,
  * @size:  Size of the range.
  *
@@ -324,7 +328,8 @@ int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, 
u64 addr, u64 size);
 
 /**
  * kvm_pgtable_stage2_mkyoung() - Set the access flag in a page-table entry.
- * @pgt:   Page-table structure initialised by kvm_pgtable_stage2_init*().
+ * @pgt:   Page-table structure initialised by function
+ * kvm_pgtable_stage2_init_flags().
  * @addr:  Intermediate physical address to identify the page-table entry.
  *
  * The offset of @addr within a page is ignored.
@@ -338,7 +343,8 @@ kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable 
*pgt, u64 add

[PATCH v5 3/6] KVM: arm64: Add mm_ops member for structure stage2_attr_data

2021-04-15 Thread Yanan Wang
Also add a mm_ops member for structure stage2_attr_data, since we
will move I-cache maintenance for guest stage-2 to the permission
path and as a result will need mm_ops for address transformation.

Signed-off-by: Yanan Wang 
---
 arch/arm64/kvm/hyp/pgtable.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index e3606c9dcec7..b480f6d1171e 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -869,10 +869,11 @@ int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 
addr, u64 size)
 }
 
 struct stage2_attr_data {
-   kvm_pte_t   attr_set;
-   kvm_pte_t   attr_clr;
-   kvm_pte_t   pte;
-   u32 level;
+   kvm_pte_t   attr_set;
+   kvm_pte_t   attr_clr;
+   kvm_pte_t   pte;
+   u32 level;
+   struct kvm_pgtable_mm_ops   *mm_ops;
 };
 
 static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
@@ -911,6 +912,7 @@ static int stage2_update_leaf_attrs(struct kvm_pgtable 
*pgt, u64 addr,
struct stage2_attr_data data = {
.attr_set   = attr_set & attr_mask,
.attr_clr   = attr_clr & attr_mask,
+   .mm_ops = pgt->mm_ops,
};
struct kvm_pgtable_walker walker = {
.cb = stage2_attr_walker,
-- 
2.23.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v5 2/6] KVM: arm64: Move D-cache flush to the fault handlers

2021-04-15 Thread Yanan Wang
We currently uniformly permorm CMOs of D-cache and I-cache in function
user_mem_abort before calling the fault handlers. If we get concurrent
guest faults(e.g. translation faults, permission faults) or some really
unnecessary guest faults caused by BBM, CMOs for the first vcpu are
necessary while the others later are not.

By moving CMOs to the fault handlers, we can easily identify conditions
where they are really needed and avoid the unnecessary ones. As it's a
time consuming process to perform CMOs especially when flushing a block
range, so this solution reduces much load of kvm and improve efficiency
of the page table code.

This patch only moves clean of D-cache to the map path, and drop the
original APIs in mmu.c/mmu.h for D-cache maintenance by using what we
already have in pgtable.c. Change about the I-side will come from a
later patch.

Signed-off-by: Yanan Wang 
---
 arch/arm64/include/asm/kvm_mmu.h | 16 
 arch/arm64/kvm/hyp/pgtable.c | 20 ++--
 arch/arm64/kvm/mmu.c | 14 +++---
 3 files changed, 17 insertions(+), 33 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 25ed956f9af1..e9b163c5f023 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -187,22 +187,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu 
*vcpu)
return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
 }
 
-static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
-{
-   void *va = page_address(pfn_to_page(pfn));
-
-   /*
-* With FWB, we ensure that the guest always accesses memory using
-* cacheable attributes, and we don't have to clean to PoC when
-* faulting in pages. Furthermore, FWB implies IDC, so cleaning to
-* PoU is not required either in this case.
-*/
-   if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
-   return;
-
-   kvm_flush_dcache_to_poc(va, size);
-}
-
 static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
  unsigned long size)
 {
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index c37c1dc4feaf..e3606c9dcec7 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -562,6 +562,12 @@ static bool stage2_pte_is_counted(kvm_pte_t pte)
return !!pte;
 }
 
+static bool stage2_pte_cacheable(struct kvm_pgtable *pgt, kvm_pte_t pte)
+{
+   u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
+   return memattr == KVM_S2_MEMATTR(pgt, NORMAL);
+}
+
 static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
   u32 level, struct kvm_pgtable_mm_ops *mm_ops)
 {
@@ -583,6 +589,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, 
u32 level,
 {
kvm_pte_t new, old = *ptep;
u64 granule = kvm_granule_size(level), phys = data->phys;
+   struct kvm_pgtable *pgt = data->mmu->pgt;
struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 
if (!kvm_block_mapping_supported(addr, end, phys, level))
@@ -606,6 +613,13 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, 
u32 level,
stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
}
 
+   /* Perform CMOs before installation of the guest stage-2 PTE */
+   if (pgt->flags & KVM_PGTABLE_S2_GUEST) {
+   if (stage2_pte_cacheable(pgt, new) && !stage2_has_fwb(pgt))
+   __flush_dcache_area(mm_ops->phys_to_virt(phys),
+   granule);
+   }
+
smp_store_release(ptep, new);
if (stage2_pte_is_counted(new))
mm_ops->get_page(ptep);
@@ -798,12 +812,6 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, 
u64 addr, u64 size,
return ret;
 }
 
-static bool stage2_pte_cacheable(struct kvm_pgtable *pgt, kvm_pte_t pte)
-{
-   u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
-   return memattr == KVM_S2_MEMATTR(pgt, NORMAL);
-}
-
 static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
   enum kvm_pgtable_walk_flags flag,
   void * const arg)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 2cfcfc5f4e4e..86f7dd1c234f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -694,11 +694,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm 
*kvm,
kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
 }
 
-static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
-{
-   __clean_dcache_guest_page(pfn, size);
-}
-
 static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
 {
__invalidate_icache_guest_page(pfn, size);
@@ -972,9 +967,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
if (writable

[PATCH v5 5/6] KVM: arm64: Move I-cache flush to the fault handlers

2021-04-15 Thread Yanan Wang
In this patch, we move invalidation of I-cache to the fault handlers to
avoid unnecessary I-cache maintenances. On the map path, invalidate the
I-cache if we are going to create an executable stage-2 mapping for guest.
And on the permission path, invalidate the I-cache if we are going to add
an executable permission to the existing guest stage-2 mapping.

Signed-off-by: Yanan Wang 
---
 arch/arm64/include/asm/kvm_mmu.h | 15 --
 arch/arm64/kvm/hyp/pgtable.c | 35 +++-
 arch/arm64/kvm/mmu.c |  9 +---
 3 files changed, 35 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index e9b163c5f023..155492fe5b15 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -187,21 +187,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu 
*vcpu)
return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
 }
 
-static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
- unsigned long size)
-{
-   if (icache_is_aliasing()) {
-   /* any kind of VIPT cache */
-   __flush_icache_all();
-   } else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
-   /* PIPT or VPIPT at EL2 (see comment in 
__kvm_tlb_flush_vmid_ipa) */
-   void *va = page_address(pfn_to_page(pfn));
-
-   invalidate_icache_range((unsigned long)va,
-   (unsigned long)va + size);
-   }
-}
-
 void kvm_set_way_flush(struct kvm_vcpu *vcpu);
 void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled);
 
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index b480f6d1171e..9f4429d80df0 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -568,6 +568,26 @@ static bool stage2_pte_cacheable(struct kvm_pgtable *pgt, 
kvm_pte_t pte)
return memattr == KVM_S2_MEMATTR(pgt, NORMAL);
 }
 
+static bool stage2_pte_executable(kvm_pte_t pte)
+{
+   return !(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN);
+}
+
+static void stage2_invalidate_icache(void *addr, u64 size)
+{
+   if (icache_is_aliasing()) {
+   /* Any kind of VIPT cache */
+   __flush_icache_all();
+   } else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
+   /*
+* See comment in __kvm_tlb_flush_vmid_ipa().
+* Invalidate PIPT, or VPIPT at EL2.
+*/
+   invalidate_icache_range((unsigned long)addr,
+   (unsigned long)addr + size);
+   }
+}
+
 static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
   u32 level, struct kvm_pgtable_mm_ops *mm_ops)
 {
@@ -618,6 +638,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, 
u32 level,
if (stage2_pte_cacheable(pgt, new) && !stage2_has_fwb(pgt))
__flush_dcache_area(mm_ops->phys_to_virt(phys),
granule);
+
+   if (stage2_pte_executable(new))
+   stage2_invalidate_icache(mm_ops->phys_to_virt(phys),
+granule);
}
 
smp_store_release(ptep, new);
@@ -896,8 +920,17 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 
level, kvm_pte_t *ptep,
 * but worst-case the access flag update gets lost and will be
 * set on the next access instead.
 */
-   if (data->pte != pte)
+   if (data->pte != pte) {
+   /*
+* Invalidate the instruction cache before updating
+* if we are going to add the executable permission
+* for the guest stage-2 PTE.
+*/
+   if (!stage2_pte_executable(*ptep) && stage2_pte_executable(pte))
+   stage2_invalidate_icache(kvm_pte_follow(pte, 
data->mm_ops),
+kvm_granule_size(level));
WRITE_ONCE(*ptep, pte);
+   }
 
return 0;
 }
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 86f7dd1c234f..aa536392b308 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -694,11 +694,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm 
*kvm,
kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
 }
 
-static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
-{
-   __invalidate_icache_guest_page(pfn, size);
-}
-
 static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
 {
send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
@@ -967,10 +962,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
if (writable)
prot |= KVM_PGTABLE_PROT_W;
 
-   if (exec_fault) {
+   

[PATCH v5 6/6] KVM: arm64: Distinguish cases of memcache allocations completely

2021-04-15 Thread Yanan Wang
With a guest translation fault, the memcache pages are not needed if KVM
is only about to install a new leaf entry into the existing page table.
And with a guest permission fault, the memcache pages are also not needed
for a write_fault in dirty-logging time if KVM is only about to update
the existing leaf entry instead of collapsing a block entry into a table.

By comparing fault_granule and vma_pagesize, cases that require allocations
from memcache and cases that don't can be distinguished completely.

Signed-off-by: Yanan Wang 
---
 arch/arm64/kvm/mmu.c | 25 -
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index aa536392b308..9e35aa5d29f2 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -895,19 +895,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
gfn = fault_ipa >> PAGE_SHIFT;
mmap_read_unlock(current->mm);
 
-   /*
-* Permission faults just need to update the existing leaf entry,
-* and so normally don't require allocations from the memcache. The
-* only exception to this is when dirty logging is enabled at runtime
-* and a write fault needs to collapse a block entry into a table.
-*/
-   if (fault_status != FSC_PERM || (logging_active && write_fault)) {
-   ret = kvm_mmu_topup_memory_cache(memcache,
-kvm_mmu_cache_min_pages(kvm));
-   if (ret)
-   return ret;
-   }
-
mmu_seq = vcpu->kvm->mmu_notifier_seq;
/*
 * Ensure the read of mmu_notifier_seq happens before we call
@@ -970,6 +957,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
prot |= KVM_PGTABLE_PROT_X;
 
+   /*
+* Allocations from the memcache are required only when granule of the
+* lookup level where the guest fault happened exceeds vma_pagesize,
+* which means new page tables will be created in the fault handlers.
+*/
+   if (fault_granule > vma_pagesize) {
+   ret = kvm_mmu_topup_memory_cache(memcache,
+kvm_mmu_cache_min_pages(kvm));
+   if (ret)
+   return ret;
+   }
+
/*
 * Under the premise of getting a FSC_PERM fault, we just need to relax
 * permissions only if vma_pagesize equals fault_granule. Otherwise,
-- 
2.23.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH kvmtool] arm: Fail early if KVM_CAP_ARM_PMU_V3 is not supported

2021-04-15 Thread Alexandru Elisei
pmu__generate_fdt_nodes() checks if the host has support for PMU in a guest
and prints a warning if that's not the case. However, this check is too
late because the function is called after the VCPU has been created, and
VCPU creation fails if KVM_CAP_ARM_PMU_V3 is not available with a rather
unhelpful error:

$ ./vm run -c1 -m64 -f selftest.flat --pmu
  # lkvm run --firmware selftest.flat -m 64 -c 1 --name guest-1039
  Info: Placing fdt at 0x8020 - 0x8021
  Fatal: Unable to initialise vcpu

Move the check for KVM_CAP_ARM_PMU_V3 to kvm_cpu__arch_init() before the
VCPU is created so the user can get a more useful error message. This
also matches the behaviour of KVM_CAP_ARM_EL1_32BIT.

Signed-off-by: Alexandru Elisei 
---
 arm/kvm-cpu.c | 4 
 arm/pmu.c | 5 -
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/arm/kvm-cpu.c b/arm/kvm-cpu.c
index 2acecaecb696..6a2408c632df 100644
--- a/arm/kvm-cpu.c
+++ b/arm/kvm-cpu.c
@@ -50,6 +50,10 @@ struct kvm_cpu *kvm_cpu__arch_init(struct kvm *kvm, unsigned 
long cpu_id)
!kvm__supports_extension(kvm, KVM_CAP_ARM_EL1_32BIT))
die("32bit guests are not supported\n");
 
+   if (kvm->cfg.arch.has_pmuv3 &&
+   !kvm__supports_extension(kvm, KVM_CAP_ARM_PMU_V3))
+   die("PMUv3 is not supported");
+
vcpu = calloc(1, sizeof(struct kvm_cpu));
if (!vcpu)
return NULL;
diff --git a/arm/pmu.c b/arm/pmu.c
index ffd152e27447..5b058eabb49d 100644
--- a/arm/pmu.c
+++ b/arm/pmu.c
@@ -43,11 +43,6 @@ void pmu__generate_fdt_nodes(void *fdt, struct kvm *kvm)
if (!kvm->cfg.arch.has_pmuv3)
return;
 
-   if (!kvm__supports_extension(kvm, KVM_CAP_ARM_PMU_V3)) {
-   pr_info("PMU unsupported\n");
-   return;
-   }
-
for (i = 0; i < kvm->nrcpus; i++) {
struct kvm_device_attr pmu_attr;
 
-- 
2.31.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-15 Thread Marc Zyngier
On Thu, 15 Apr 2021 12:26:26 +0100,
Keqian Zhu  wrote:
> 
> Hi Marc,
> 
> On 2021/4/15 18:23, Marc Zyngier wrote:
> > On Thu, 15 Apr 2021 03:20:52 +0100,
> > Keqian Zhu  wrote:
> >>
> >> Hi Marc,
> >>
> >> On 2021/4/14 17:05, Marc Zyngier wrote:
> >>> + Santosh, who found some interesting bugs in that area before.
> >>>
> >>> On Wed, 14 Apr 2021 07:51:09 +0100,
> >>> Keqian Zhu  wrote:
> 
>  The MMIO region of a device maybe huge (GB level), try to use
>  block mapping in stage2 to speedup both map and unmap.
> 
>  Compared to normal memory mapping, we should consider two more
>  points when try block mapping for MMIO region:
> 
>  1. For normal memory mapping, the PA(host physical address) and
>  HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
>  the HVA to request hugepage, so we don't need to consider PA
>  alignment when verifing block mapping. But for device memory
>  mapping, the PA and HVA may have different alignment.
> 
>  2. For normal memory mapping, we are sure hugepage size properly
>  fit into vma, so we don't check whether the mapping size exceeds
>  the boundary of vma. But for device memory mapping, we should pay
>  attention to this.
> 
>  This adds device_rough_page_shift() to check these two points when
>  selecting block mapping size.
> 
>  Signed-off-by: Keqian Zhu 
>  ---
>   arch/arm64/kvm/mmu.c | 37 +
>   1 file changed, 33 insertions(+), 4 deletions(-)
> 
>  diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>  index c59af5ca01b0..1a6d96169d60 100644
>  --- a/arch/arm64/kvm/mmu.c
>  +++ b/arch/arm64/kvm/mmu.c
>  @@ -624,6 +624,31 @@ static void kvm_send_hwpoison_signal(unsigned long 
>  address, short lsb)
>   send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, 
>  current);
>   }
>   
>  +/*
>  + * Find a max mapping size that properly insides the vma. And hva and 
>  pa must
>  + * have the same alignment to this mapping size. It's rough as there 
>  are still
>  + * other restrictions, will be checked by 
>  fault_supports_stage2_huge_mapping().
>  + */
>  +static short device_rough_page_shift(struct vm_area_struct *vma,
>  + unsigned long hva)
> >>>
> >>> My earlier question still stands. Under which circumstances would this
> >>> function return something that is *not* the final mapping size? I
> >>> really don't see a reason why this would not return the final mapping
> >>> size.
> >>
> >> IIUC, all the restrictions are about alignment and area boundary.
> >>
> >> That's to say, HVA, IPA and PA must have same alignment within the
> >> mapping size.  And the areas are memslot and vma, which means the
> >> mapping size must properly fit into the memslot and vma.
> >>
> >> In this function, we just checked the alignment of HVA and PA, and
> >> the boundary of vma.  So we still need to check the alignment of HVA
> >> and IPA, and the boundary of memslot.  These will be checked by
> >> fault_supports_stage2_huge_mapping().
> > 
> > But that's no different from what we do with normal memory, is it? So
> > it really feels like we should have *one* function that deals with
> > establishing the basic mapping size from the VMA (see below for what I
> > have in mind).
> Right. And it looks better.
> 
> > 
> >>
> >>>
>  +{
>  +phys_addr_t pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - 
>  vma->vm_start);
>  +
>  +#ifndef __PAGETABLE_PMD_FOLDED
>  +if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
>  +ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
>  +ALIGN(hva, PUD_SIZE) <= vma->vm_end)
>  +return PUD_SHIFT;
>  +#endif
>  +
>  +if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
>  +ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
>  +ALIGN(hva, PMD_SIZE) <= vma->vm_end)
>  +return PMD_SHIFT;
>  +
>  +return PAGE_SHIFT;
>  +}
>  +
>   static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot 
>  *memslot,
>  unsigned long hva,
>  unsigned long map_size)
>  @@ -769,7 +794,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>  phys_addr_t fault_ipa,
>   return -EFAULT;
>   }
>   
>  -/* Let's check if we will get back a huge page backed by 
>  hugetlbfs */
>  +/*
>  + * Let's check if we will get back a huge page backed by 
>  hugetlbfs, or
>  + * get block mapping for device MMIO region.
>  + */
>   mmap_read_lock(current->mm);
>   vma = find_vma_intersec

Re: [PATCH 1/5] KVM: arm64: Divorce the perf code from oprofile helpers

2021-04-15 Thread Marc Zyngier
On Thu, 15 Apr 2021 12:34:40 +0100,
Keqian Zhu  wrote:
> 
> Hi Marc,
> 
> On 2021/4/15 18:42, Marc Zyngier wrote:
> > On Thu, 15 Apr 2021 07:59:26 +0100,
> > Keqian Zhu  wrote:
> >>
> >> Hi Marc,
> >>
> >> On 2021/4/14 21:44, Marc Zyngier wrote:
> >>> KVM/arm64 is the sole user of perf_num_counters(), and really
> >>> could do without it. Stop using the obsolete API by relying on
> >>> the existing probing code.
> >>>
> >>> Signed-off-by: Marc Zyngier 
> >>> ---
> >>>  arch/arm64/kvm/perf.c | 7 +--
> >>>  arch/arm64/kvm/pmu-emul.c | 2 +-
> >>>  include/kvm/arm_pmu.h | 4 
> >>>  3 files changed, 6 insertions(+), 7 deletions(-)
> >>>
> >>> diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
> >>> index 739164324afe..b8b398670ef2 100644
> >>> --- a/arch/arm64/kvm/perf.c
> >>> +++ b/arch/arm64/kvm/perf.c
> >>> @@ -50,12 +50,7 @@ static struct perf_guest_info_callbacks kvm_guest_cbs 
> >>> = {
> >>>  
> >>>  int kvm_perf_init(void)
> >>>  {
> >>> - /*
> >>> -  * Check if HW_PERF_EVENTS are supported by checking the number of
> >>> -  * hardware performance counters. This could ensure the presence of
> >>> -  * a physical PMU and CONFIG_PERF_EVENT is selected.
> >>> -  */
> >>> - if (IS_ENABLED(CONFIG_ARM_PMU) && perf_num_counters() > 0)
> >>> + if (kvm_pmu_probe_pmuver() != 0xf)
> >> The probe() function may be called many times
> >> (kvm_arm_pmu_v3_set_attr also calls it).  I don't know whether the
> >> first calling is enough. If so, can we use a static variable in it,
> >> so the following calling can return the result right away?
> > 
> > No, because that wouldn't help with crappy big-little implementations
> > that could have PMUs with different versions. We want to find the
> > version at the point where the virtual PMU is created, which is why we
> > call the probe function once per vcpu.
> I see.
> 
> But AFAICS the pmuver is placed in kvm->arch, and the probe function is called
> once per VM. Maybe I miss something.

You're right, I mis-remembered. This doesn't change much though.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions

2021-04-15 Thread Keqian Zhu
The MMIO regions may be unmapped for many reasons and can be remapped
by stage2 fault path. Map MMIO regions at creation time becomes a
minor optimization and makes these two mapping path hard to sync.

Remove the mapping code while keep the useful sanity check.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/kvm/mmu.c | 38 +++---
 1 file changed, 3 insertions(+), 35 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8711894db8c2..c59af5ca01b0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1301,7 +1301,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 {
hva_t hva = mem->userspace_addr;
hva_t reg_end = hva + mem->memory_size;
-   bool writable = !(mem->flags & KVM_MEM_READONLY);
int ret = 0;
 
if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
@@ -1318,8 +1317,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
mmap_read_lock(current->mm);
/*
 * A memory region could potentially cover multiple VMAs, and any holes
-* between them, so iterate over all of them to find out if we can map
-* any of them right now.
+* between them, so iterate over all of them.
 *
 * ++
 * +---++   ++
@@ -1330,50 +1328,20 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 */
do {
struct vm_area_struct *vma = find_vma(current->mm, hva);
-   hva_t vm_start, vm_end;
 
if (!vma || vma->vm_start >= reg_end)
break;
 
-   /*
-* Take the intersection of this VMA with the memory region
-*/
-   vm_start = max(hva, vma->vm_start);
-   vm_end = min(reg_end, vma->vm_end);
-
if (vma->vm_flags & VM_PFNMAP) {
-   gpa_t gpa = mem->guest_phys_addr +
-   (vm_start - mem->userspace_addr);
-   phys_addr_t pa;
-
-   pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
-   pa += vm_start - vma->vm_start;
-
/* IO region dirty page logging not allowed */
if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
ret = -EINVAL;
-   goto out;
-   }
-
-   ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
-   vm_end - vm_start,
-   writable);
-   if (ret)
break;
+   }
}
-   hva = vm_end;
+   hva = min(reg_end, vma->vm_end);
} while (hva < reg_end);
 
-   if (change == KVM_MR_FLAGS_ONLY)
-   goto out;
-
-   spin_lock(&kvm->mmu_lock);
-   if (ret)
-   unmap_stage2_range(&kvm->arch.mmu, mem->guest_phys_addr, 
mem->memory_size);
-   else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
-   stage2_flush_memslot(kvm, memslot);
-   spin_unlock(&kvm->mmu_lock);
-out:
mmap_read_unlock(current->mm);
return ret;
 }
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 0/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-15 Thread Keqian Zhu
Hi,

We have two pathes to build stage2 mapping for MMIO regions.

Create time's path and stage2 fault path.

Patch#1 removes the creation time's mapping of MMIO regions
Patch#2 tries stage2 block mapping for host device MMIO at fault path

Changelog:

v4:
 - use get_vma_page_shift() handle all cases. (Marc)
 - get rid of force_pte for device mapping. (Marc)

v3:
 - Do not need to check memslot boundary in device_rough_page_shift(). (Marc)

Thanks,
Keqian


Keqian Zhu (2):
  kvm/arm64: Remove the creation time's mapping of MMIO regions
  kvm/arm64: Try stage2 block mapping for host device MMIO

 arch/arm64/kvm/mmu.c | 99 
 1 file changed, 54 insertions(+), 45 deletions(-)

-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-15 Thread Keqian Zhu
The MMIO region of a device maybe huge (GB level), try to use
block mapping in stage2 to speedup both map and unmap.

Compared to normal memory mapping, we should consider two more
points when try block mapping for MMIO region:

1. For normal memory mapping, the PA(host physical address) and
HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
the HVA to request hugepage, so we don't need to consider PA
alignment when verifing block mapping. But for device memory
mapping, the PA and HVA may have different alignment.

2. For normal memory mapping, we are sure hugepage size properly
fit into vma, so we don't check whether the mapping size exceeds
the boundary of vma. But for device memory mapping, we should pay
attention to this.

This adds get_vma_page_shift() to get page shift for both normal
memory and device MMIO region, and check these two points when
selecting block mapping size for MMIO region.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/kvm/mmu.c | 61 
 1 file changed, 51 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c59af5ca01b0..5a1cc7751e6d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
*memslot,
return PAGE_SIZE;
 }
 
+static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva)
+{
+   unsigned long pa;
+
+   if (is_vm_hugetlb_page(vma) && !(vma->vm_flags & VM_PFNMAP))
+   return huge_page_shift(hstate_vma(vma));
+
+   if (!(vma->vm_flags & VM_PFNMAP))
+   return PAGE_SHIFT;
+
+   VM_BUG_ON(is_vm_hugetlb_page(vma));
+
+   pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
+
+#ifndef __PAGETABLE_PMD_FOLDED
+   if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
+   ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
+   ALIGN(hva, PUD_SIZE) <= vma->vm_end)
+   return PUD_SHIFT;
+#endif
+
+   if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
+   ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
+   ALIGN(hva, PMD_SIZE) <= vma->vm_end)
+   return PMD_SHIFT;
+
+   return PAGE_SHIFT;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
  struct kvm_memory_slot *memslot, unsigned long hva,
  unsigned long fault_status)
@@ -769,7 +798,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
}
 
-   /* Let's check if we will get back a huge page backed by hugetlbfs */
+   /*
+* Let's check if we will get back a huge page backed by hugetlbfs, or
+* get block mapping for device MMIO region.
+*/
mmap_read_lock(current->mm);
vma = find_vma_intersection(current->mm, hva, hva + 1);
if (unlikely(!vma)) {
@@ -778,15 +810,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
}
 
-   if (is_vm_hugetlb_page(vma))
-   vma_shift = huge_page_shift(hstate_vma(vma));
-   else
-   vma_shift = PAGE_SHIFT;
-
-   if (logging_active ||
-   (vma->vm_flags & VM_PFNMAP)) {
+   /*
+* logging_active is guaranteed to never be true for VM_PFNMAP
+* memslots.
+*/
+   if (logging_active) {
force_pte = true;
vma_shift = PAGE_SHIFT;
+   } else {
+   vma_shift = get_vma_page_shift(vma, hva);
}
 
switch (vma_shift) {
@@ -854,8 +886,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
 
if (kvm_is_device_pfn(pfn)) {
+   /*
+* If the page was identified as device early by looking at
+* the VMA flags, vma_pagesize is already representing the
+* largest quantity we can map.  If instead it was mapped
+* via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE
+* and must not be upgraded.
+*
+* In both cases, we don't let transparent_hugepage_adjust()
+* change things at the last minute.
+*/
device = true;
-   force_pte = true;
} else if (logging_active && !write_fault) {
/*
 * Only actually map the page as writable if this was a write
@@ -876,7 +917,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 * If we are not forced to use page mapping, check if we are
 * backed by a THP and thus use block mapping if possible.
 */
-   if (vma_pagesize == PAGE_SIZE && !force_pte)
+   if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
vma_pagesize = transparent_hugepage_adjus

Re: [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-15 Thread Keqian Zhu
Hi Marc,

On 2021/4/15 22:03, Keqian Zhu wrote:
> The MMIO region of a device maybe huge (GB level), try to use
> block mapping in stage2 to speedup both map and unmap.
> 
> Compared to normal memory mapping, we should consider two more
> points when try block mapping for MMIO region:
> 
> 1. For normal memory mapping, the PA(host physical address) and
> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
> the HVA to request hugepage, so we don't need to consider PA
> alignment when verifing block mapping. But for device memory
> mapping, the PA and HVA may have different alignment.
> 
> 2. For normal memory mapping, we are sure hugepage size properly
> fit into vma, so we don't check whether the mapping size exceeds
> the boundary of vma. But for device memory mapping, we should pay
> attention to this.
> 
> This adds get_vma_page_shift() to get page shift for both normal
> memory and device MMIO region, and check these two points when
> selecting block mapping size for MMIO region.
> 
> Signed-off-by: Keqian Zhu 
> ---
>  arch/arm64/kvm/mmu.c | 61 
>  1 file changed, 51 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c59af5ca01b0..5a1cc7751e6d 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
> *memslot,
>   return PAGE_SIZE;
>  }
>  
> +static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva)
> +{
> + unsigned long pa;
> +
> + if (is_vm_hugetlb_page(vma) && !(vma->vm_flags & VM_PFNMAP))
> + return huge_page_shift(hstate_vma(vma));
> +
> + if (!(vma->vm_flags & VM_PFNMAP))
> + return PAGE_SHIFT;
> +
> + VM_BUG_ON(is_vm_hugetlb_page(vma));
> +
> + pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
> +
> +#ifndef __PAGETABLE_PMD_FOLDED
> + if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
> + ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
> + ALIGN(hva, PUD_SIZE) <= vma->vm_end)
> + return PUD_SHIFT;
> +#endif
> +
> + if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
> + ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
> + ALIGN(hva, PMD_SIZE) <= vma->vm_end)
> + return PMD_SHIFT;
> +
> + return PAGE_SHIFT;
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> struct kvm_memory_slot *memslot, unsigned long hva,
> unsigned long fault_status)
> @@ -769,7 +798,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   return -EFAULT;
>   }
>  
> - /* Let's check if we will get back a huge page backed by hugetlbfs */
> + /*
> +  * Let's check if we will get back a huge page backed by hugetlbfs, or
> +  * get block mapping for device MMIO region.
> +  */
>   mmap_read_lock(current->mm);
>   vma = find_vma_intersection(current->mm, hva, hva + 1);
>   if (unlikely(!vma)) {
> @@ -778,15 +810,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   return -EFAULT;
>   }
>  
> - if (is_vm_hugetlb_page(vma))
> - vma_shift = huge_page_shift(hstate_vma(vma));
> - else
> - vma_shift = PAGE_SHIFT;
> -
> - if (logging_active ||
> - (vma->vm_flags & VM_PFNMAP)) {
> + /*
> +  * logging_active is guaranteed to never be true for VM_PFNMAP
> +  * memslots.
> +  */
> + if (logging_active) {
>   force_pte = true;
>   vma_shift = PAGE_SHIFT;
> + } else {
> + vma_shift = get_vma_page_shift(vma, hva);
>   }
I use a if/else manner in v4, please check that. Thanks very much!


BRs,
Keqian

>  
>   switch (vma_shift) {
> @@ -854,8 +886,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   return -EFAULT;
>  
>   if (kvm_is_device_pfn(pfn)) {
> + /*
> +  * If the page was identified as device early by looking at
> +  * the VMA flags, vma_pagesize is already representing the
> +  * largest quantity we can map.  If instead it was mapped
> +  * via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE
> +  * and must not be upgraded.
> +  *
> +  * In both cases, we don't let transparent_hugepage_adjust()
> +  * change things at the last minute.
> +  */
>   device = true;
> - force_pte = true;
>   } else if (logging_active && !write_fault) {
>   /*
>* Only actually map the page as writable if this was a write
> @@ -876,7 +917,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>* If we are not forced to use page mapping, check if we are
> 

Re: [PATCH 3/5] s390: Get rid of oprofile leftovers

2021-04-15 Thread Heiko Carstens
On Thu, Apr 15, 2021 at 11:47:26AM +0100, Marc Zyngier wrote:
> On Thu, 15 Apr 2021 11:38:52 +0100,
> Heiko Carstens  wrote:
> > 
> > On Wed, Apr 14, 2021 at 02:44:07PM +0100, Marc Zyngier wrote:
> > > perf_pmu_name() and perf_num_counters() are unused. Drop them.
> > > 
> > > Signed-off-by: Marc Zyngier 
> > > ---
> > >  arch/s390/kernel/perf_event.c | 21 -
> > >  1 file changed, 21 deletions(-)
> > 
> > Acked-by: Heiko Carstens 
> > 
> > ...or do you want me to pick this up and route via the s390 tree(?).
> 
> Either way work for me, but I just want to make sure the last patch
> doesn't get applied before the previous ones.

Ok, I applied this one to the s390 tree. Thanks!
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v2 0/4] KVM statistics data fd-based binary interface

2021-04-15 Thread Jing Zhang
This patchset provides a file descriptor for every VM and VCPU to read
KVM statistics data in binary format.
It is meant to provide a lightweight, flexible, scalable and efficient
lock-free solution for user space telemetry applications to pull the
statistics data periodically for large scale systems. The pulling
frequency could be as high as a few times per second.
In this patchset, every statistics data are treated to have some
attributes as below:
  * architecture dependent or common
  * VM statistics data or VCPU statistics data
  * type: cumulative, instantaneous,
  * unit: none for simple counter, nanosecond, microsecond,
millisecond, second, Byte, KiByte, MiByte, GiByte. Clock Cycles
Since no lock/synchronization is used, the consistency between all
the statistics data is not guaranteed. That means not all statistics
data are read out at the exact same time, since the statistics date
are still being updated by KVM subsystems while they are read out.

---

* v1 -> v2
  - Use ARRAY_SIZE to count the number of stats descriptors
  - Fix missing `size` field initialization in macro STATS_DESC

[1] https://lore.kernel.org/kvm/20210402224359.2297157-1-jingzhan...@google.com

---

Jing Zhang (4):
  KVM: stats: Separate common stats from architecture specific ones
  KVM: stats: Add fd-based API to read binary stats data
  KVM: stats: Add documentation for statistics data binary interface
  KVM: selftests: Add selftest for KVM statistics data binary interface

 Documentation/virt/kvm/api.rst| 169 
 arch/arm64/include/asm/kvm_host.h |   9 +-
 arch/arm64/kvm/guest.c|  42 +-
 arch/mips/include/asm/kvm_host.h  |   9 +-
 arch/mips/kvm/mips.c  |  67 +++-
 arch/powerpc/include/asm/kvm_host.h   |   9 +-
 arch/powerpc/kvm/book3s.c |  68 +++-
 arch/powerpc/kvm/book3s_hv.c  |  12 +-
 arch/powerpc/kvm/book3s_pr.c  |   2 +-
 arch/powerpc/kvm/book3s_pr_papr.c |   2 +-
 arch/powerpc/kvm/booke.c  |  63 ++-
 arch/s390/include/asm/kvm_host.h  |   9 +-
 arch/s390/kvm/kvm-s390.c  | 133 ++-
 arch/x86/include/asm/kvm_host.h   |   9 +-
 arch/x86/kvm/x86.c|  71 +++-
 include/linux/kvm_host.h  | 132 ++-
 include/linux/kvm_types.h |  12 +
 include/uapi/linux/kvm.h  |  48 +++
 tools/testing/selftests/kvm/.gitignore|   1 +
 tools/testing/selftests/kvm/Makefile  |   3 +
 .../testing/selftests/kvm/include/kvm_util.h  |   3 +
 .../selftests/kvm/kvm_bin_form_stats.c| 370 ++
 tools/testing/selftests/kvm/lib/kvm_util.c|  11 +
 virt/kvm/kvm_main.c   | 237 ++-
 24 files changed, 1401 insertions(+), 90 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/kvm_bin_form_stats.c


base-commit: f96be2deac9bca3ef5a2b0b66b71fcef8bad586d
-- 
2.31.1.295.g9ea45b61b8-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v2 1/4] KVM: stats: Separate common stats from architecture specific ones

2021-04-15 Thread Jing Zhang
Put all common statistics in a separate structure to ease
statistics handling for the incoming new statistics API.

No functional change intended.

Signed-off-by: Jing Zhang 
---
 arch/arm64/include/asm/kvm_host.h   |  9 ++---
 arch/arm64/kvm/guest.c  | 12 ++--
 arch/mips/include/asm/kvm_host.h|  9 ++---
 arch/mips/kvm/mips.c| 12 ++--
 arch/powerpc/include/asm/kvm_host.h |  9 ++---
 arch/powerpc/kvm/book3s.c   | 12 ++--
 arch/powerpc/kvm/book3s_hv.c| 12 ++--
 arch/powerpc/kvm/book3s_pr.c|  2 +-
 arch/powerpc/kvm/book3s_pr_papr.c   |  2 +-
 arch/powerpc/kvm/booke.c| 14 +++---
 arch/s390/include/asm/kvm_host.h|  9 ++---
 arch/s390/kvm/kvm-s390.c| 12 ++--
 arch/x86/include/asm/kvm_host.h |  9 ++---
 arch/x86/kvm/x86.c  | 14 +++---
 include/linux/kvm_host.h|  5 +
 include/linux/kvm_types.h   | 12 
 virt/kvm/kvm_main.c | 14 +++---
 17 files changed, 80 insertions(+), 88 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 72e6b4600264..442e447e05d3 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -546,16 +546,11 @@ static inline bool __vcpu_write_sys_reg_to_cpu(u64 val, 
int reg)
 }
 
 struct kvm_vm_stat {
-   ulong remote_tlb_flush;
+   struct kvm_vm_stat_common common;
 };
 
 struct kvm_vcpu_stat {
-   u64 halt_successful_poll;
-   u64 halt_attempted_poll;
-   u64 halt_poll_success_ns;
-   u64 halt_poll_fail_ns;
-   u64 halt_poll_invalid;
-   u64 halt_wakeup;
+   struct kvm_vcpu_stat_common common;
u64 hvc_exit_stat;
u64 wfe_exit_stat;
u64 wfi_exit_stat;
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 9bbd30e62799..cc5b1e2fdbd0 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -29,18 +29,18 @@
 #include "trace.h"
 
 struct kvm_stats_debugfs_item debugfs_entries[] = {
-   VCPU_STAT("halt_successful_poll", halt_successful_poll),
-   VCPU_STAT("halt_attempted_poll", halt_attempted_poll),
-   VCPU_STAT("halt_poll_invalid", halt_poll_invalid),
-   VCPU_STAT("halt_wakeup", halt_wakeup),
+   VCPU_STAT_COM("halt_successful_poll", halt_successful_poll),
+   VCPU_STAT_COM("halt_attempted_poll", halt_attempted_poll),
+   VCPU_STAT_COM("halt_poll_invalid", halt_poll_invalid),
+   VCPU_STAT_COM("halt_wakeup", halt_wakeup),
VCPU_STAT("hvc_exit_stat", hvc_exit_stat),
VCPU_STAT("wfe_exit_stat", wfe_exit_stat),
VCPU_STAT("wfi_exit_stat", wfi_exit_stat),
VCPU_STAT("mmio_exit_user", mmio_exit_user),
VCPU_STAT("mmio_exit_kernel", mmio_exit_kernel),
VCPU_STAT("exits", exits),
-   VCPU_STAT("halt_poll_success_ns", halt_poll_success_ns),
-   VCPU_STAT("halt_poll_fail_ns", halt_poll_fail_ns),
+   VCPU_STAT_COM("halt_poll_success_ns", halt_poll_success_ns),
+   VCPU_STAT_COM("halt_poll_fail_ns", halt_poll_fail_ns),
{ NULL }
 };
 
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index feaa77036b67..291a453ab447 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -143,10 +143,11 @@ static inline bool kvm_is_error_hva(unsigned long addr)
 }
 
 struct kvm_vm_stat {
-   ulong remote_tlb_flush;
+   struct kvm_vm_stat_common common;
 };
 
 struct kvm_vcpu_stat {
+   struct kvm_vcpu_stat_common common;
u64 wait_exits;
u64 cache_exits;
u64 signal_exits;
@@ -178,12 +179,6 @@ struct kvm_vcpu_stat {
u64 vz_cpucfg_exits;
 #endif
 #endif
-   u64 halt_successful_poll;
-   u64 halt_attempted_poll;
-   u64 halt_poll_success_ns;
-   u64 halt_poll_fail_ns;
-   u64 halt_poll_invalid;
-   u64 halt_wakeup;
 };
 
 struct kvm_arch_memory_slot {
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index 58a8812e2fa5..d78820777de1 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -71,12 +71,12 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
VCPU_STAT("vz_cpucfg", vz_cpucfg_exits),
 #endif
 #endif
-   VCPU_STAT("halt_successful_poll", halt_successful_poll),
-   VCPU_STAT("halt_attempted_poll", halt_attempted_poll),
-   VCPU_STAT("halt_poll_invalid", halt_poll_invalid),
-   VCPU_STAT("halt_wakeup", halt_wakeup),
-   VCPU_STAT("halt_poll_success_ns", halt_poll_success_ns),
-   VCPU_STAT("halt_poll_fail_ns", halt_poll_fail_ns),
+   VCPU_STAT_COM("halt_successful_poll", halt_successful_poll),
+   VCPU_STAT_COM("halt_attempted_poll", halt_attempted_poll),
+   VCPU_STAT_COM("halt_poll_invalid", halt_poll_invalid),
+   VCPU_STAT_COM("halt_wakeup", halt_wakeup),
+   VCPU_STAT_COM("halt_poll_success_ns", halt_poll_success_ns),
+   VC

[PATCH v2 3/4] KVM: stats: Add documentation for statistics data binary interface

2021-04-15 Thread Jing Zhang
Signed-off-by: Jing Zhang 
---
 Documentation/virt/kvm/api.rst | 169 +
 1 file changed, 169 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 2c4253718881..6474c31a4436 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4941,6 +4941,167 @@ see KVM_XEN_VCPU_SET_ATTR above.
 The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
 with the KVM_XEN_VCPU_GET_ATTR ioctl.
 
+4.131 KVM_STATS_GETFD
+-
+
+:Capability: KVM_CAP_STATS_BINARY_FD
+:Architectures: all
+:Type: vm ioctl, vcpu ioctl
+:Parameters: none
+:Returns: statistics file descriptor on success, < 0 on error
+
+Errors:
+
+  == ==
+  ENOMEM if the fd could not be created due to lack of memory
+  EMFILE if the number of opened files exceeds the limit
+  == ==
+
+The file descriptor can be used to read VM/vCPU statistics data in binary
+format. The file data is organized into three blocks as below:
++-+
+|   Header|
++-+
+| Descriptors |
++-+
+| Stats Data  |
++-+
+
+The Header block is always at the start of the file. It is only needed to be
+read one time after a system boot.
+It is in the form of ``struct kvm_stats_header`` as below::
+
+   #define KVM_STATS_ID_MAXLEN 64
+   struct kvm_stats_header {
+   char id[KVM_STATS_ID_MAXLEN];
+   __u32 name_size;
+   __u32 count;
+   __u32 desc_offset;
+   __u32 data_offset;
+   };
+
+The ``id`` field is identification for the corresponding KVM statistics. For
+KVM statistics, it is in the form of "kvm-{kvm pid}", like "kvm-12345". For
+VCPU statistics, it is in the form of "kvm-{kvm pid}/vcpu-{vcpu id}", like
+"kvm-12345/vcpu-12".
+
+The ``name_size`` field is the size (byte) of the statistics name string
+(including trailing '\0') appended to the end of every statistics descriptor.
+
+The ``count`` field is the number of statistics.
+
+The ``desc_offset`` field is the offset of the Descriptors block from the start
+of the file indicated by the file descriptor.
+
+The ``data_offset`` field is the offset of the Stats Data block from the start
+of the file indicated by the file descriptor.
+
+The Descriptors block is only needed to be read once after a system boot. It is
+an array of ``struct kvm_stats_desc`` as below::
+
+   #define KVM_STATS_TYPE_SHIFT0
+   #define KVM_STATS_TYPE_MASK (0xF << KVM_STATS_TYPE_SHIFT)
+   #define KVM_STATS_TYPE_CUMULATIVE   (0x0 << KVM_STATS_TYPE_SHIFT)
+   #define KVM_STATS_TYPE_INSTANT  (0x1 << KVM_STATS_TYPE_SHIFT)
+   #define KVM_STATS_TYPE_MAX  KVM_STATS_TYPE_INSTANT
+
+   #define KVM_STATS_UNIT_SHIFT4
+   #define KVM_STATS_UNIT_MASK (0xF << KVM_STATS_UNIT_SHIFT)
+   #define KVM_STATS_UNIT_NONE (0x0 << KVM_STATS_UNIT_SHIFT)
+   #define KVM_STATS_UNIT_BYTES(0x1 << KVM_STATS_UNIT_SHIFT)
+   #define KVM_STATS_UNIT_SECONDS  (0x2 << KVM_STATS_UNIT_SHIFT)
+   #define KVM_STATS_UNIT_CYCLES   (0x3 << KVM_STATS_UNIT_SHIFT)
+   #define KVM_STATS_UNIT_MAX  KVM_STATS_UNIT_CYCLES
+
+   #define KVM_STATS_SCALE_SHIFT   8
+   #define KVM_STATS_SCALE_MASK(0xF << KVM_STATS_SCALE_SHIFT)
+   #define KVM_STATS_SCALE_POW10   (0x0 << KVM_STATS_SCALE_SHIFT)
+   #define KVM_STATS_SCALE_POW2(0x1 << KVM_STATS_SCALE_SHIFT)
+   #define KVM_STATS_SCALE_MAX KVM_STATS_SCALE_POW2
+   struct kvm_stats_desc {
+   __u32 flags;
+   __s16 exponent;
+   __u16 size;
+   __u32 unused1;
+   __u32 unused2;
+   __u8 name[0];
+   };
+
+The ``flags`` field contains the type and unit of the statistics data described
+by this descriptor. The following flags are supported:
+  * ``KVM_STATS_TYPE_CUMULATIVE``
+The statistics data is cumulative. The value of data can only be increased.
+Most of the counters used in KVM are of this type.
+The corresponding ``count`` filed for this type is always 1.
+  * ``KVM_STATS_TYPE_INSTANT``
+The statistics data is instantaneous. Its value can be increased or
+decreased. This type is usually used as a measurement of some resources,
+like the number of dirty pages, the number of large pages, etc.
+The corresponding ``count`` field for this type is always 1.
+  * ``KVM_STATS_UNIT_NONE``
+There is no unit for the value of statistics data. This usually means that
+the value is a simple counter of an event.
+  * ``KVM_STATS_UNIT_BYTES``
+It indicates that the statistics data is used to measure memory size, in 
the
+unit of Byte

[PATCH v2 4/4] KVM: selftests: Add selftest for KVM statistics data binary interface

2021-04-15 Thread Jing Zhang
Signed-off-by: Jing Zhang 
---
 tools/testing/selftests/kvm/.gitignore|   1 +
 tools/testing/selftests/kvm/Makefile  |   3 +
 .../testing/selftests/kvm/include/kvm_util.h  |   3 +
 .../selftests/kvm/kvm_bin_form_stats.c| 370 ++
 tools/testing/selftests/kvm/lib/kvm_util.c|  11 +
 5 files changed, 388 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/kvm_bin_form_stats.c

diff --git a/tools/testing/selftests/kvm/.gitignore 
b/tools/testing/selftests/kvm/.gitignore
index 32b87cc77c8e..0c8241bd9a17 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -38,3 +38,4 @@
 /memslot_modification_stress_test
 /set_memory_region_test
 /steal_time
+/kvm_bin_form_stats
diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index a6d61f451f88..5cdd52ccedf2 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -72,6 +72,7 @@ TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
 TEST_GEN_PROGS_x86_64 += memslot_modification_stress_test
 TEST_GEN_PROGS_x86_64 += set_memory_region_test
 TEST_GEN_PROGS_x86_64 += steal_time
+TEST_GEN_PROGS_x86_64 += kvm_bin_form_stats
 
 TEST_GEN_PROGS_aarch64 += aarch64/get-reg-list
 TEST_GEN_PROGS_aarch64 += aarch64/get-reg-list-sve
@@ -81,6 +82,7 @@ TEST_GEN_PROGS_aarch64 += dirty_log_perf_test
 TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
 TEST_GEN_PROGS_aarch64 += set_memory_region_test
 TEST_GEN_PROGS_aarch64 += steal_time
+TEST_GEN_PROGS_aarch64 += kvm_bin_form_stats
 
 TEST_GEN_PROGS_s390x = s390x/memop
 TEST_GEN_PROGS_s390x += s390x/resets
@@ -89,6 +91,7 @@ TEST_GEN_PROGS_s390x += demand_paging_test
 TEST_GEN_PROGS_s390x += dirty_log_test
 TEST_GEN_PROGS_s390x += kvm_create_max_vcpus
 TEST_GEN_PROGS_s390x += set_memory_region_test
+TEST_GEN_PROGS_s390x += kvm_bin_form_stats
 
 TEST_GEN_PROGS += $(TEST_GEN_PROGS_$(UNAME_M))
 LIBKVM += $(LIBKVM_$(UNAME_M))
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h 
b/tools/testing/selftests/kvm/include/kvm_util.h
index 2d7eb6989e83..407f3a35ed3b 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -378,4 +378,7 @@ uint64_t get_ucall(struct kvm_vm *vm, uint32_t vcpu_id, 
struct ucall *uc);
 #define GUEST_ASSERT_4(_condition, arg1, arg2, arg3, arg4) \
__GUEST_ASSERT((_condition), 4, (arg1), (arg2), (arg3), (arg4))
 
+int vm_get_statsfd(struct kvm_vm *vm);
+int vcpu_get_statsfd(struct kvm_vm *vm, uint32_t vcpuid);
+
 #endif /* SELFTEST_KVM_UTIL_H */
diff --git a/tools/testing/selftests/kvm/kvm_bin_form_stats.c 
b/tools/testing/selftests/kvm/kvm_bin_form_stats.c
new file mode 100644
index ..5ed2fe74ce55
--- /dev/null
+++ b/tools/testing/selftests/kvm/kvm_bin_form_stats.c
@@ -0,0 +1,370 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * kvm_bin_form_stats
+ *
+ * Copyright (C) 2021, Google LLC.
+ *
+ * Test the fd-based interface for KVM statistics.
+ */
+
+#define _GNU_SOURCE /* for program_invocation_short_name */
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "test_util.h"
+
+#include "kvm_util.h"
+#include "asm/kvm.h"
+#include "linux/kvm.h"
+
+int vm_stats_test(struct kvm_vm *vm)
+{
+   ssize_t ret;
+   int i, stats_fd, err = -1;
+   size_t size_desc, size_data = 0;
+   struct kvm_stats_header header;
+   struct kvm_stats_desc *stats_desc, *pdesc;
+   struct kvm_vm_stats_data *stats_data;
+
+   // Get fd for VM stats
+   stats_fd = vm_get_statsfd(vm);
+   if (stats_fd < 0) {
+   perror("Get VM stats fd");
+   return err;
+   }
+   // Read kvm vm stats header
+   ret = read(stats_fd, &header, sizeof(header));
+   if (ret != sizeof(header)) {
+   perror("Read VM stats header");
+   goto out_close_fd;
+   }
+   size_desc = sizeof(*stats_desc) + header.name_size;
+   // Check id string in header, that should start with "kvm"
+   if (strncmp(header.id, "kvm", 3) ||
+   strlen(header.id) >= KVM_STATS_ID_MAXLEN) {
+   printf("Invalid KVM VM stats type!\n");
+   goto out_close_fd;
+   }
+   // Sanity check for other fields in header
+   if (header.count == 0) {
+   err = 0;
+   goto out_close_fd;
+   }
+   // Check overlap
+   if (header.desc_offset == 0 || header.data_offset == 0 ||
+   header.desc_offset < sizeof(header) ||
+   header.data_offset < sizeof(header)) {
+   printf("Invalid offset fields in header!\n");
+   goto out_close_fd;
+   }
+   if (header.desc_offset < header.data_offset &&
+   (header.desc_offset + size_desc * header.count >
+   header.data_offset)) {
+   printf("Descriptor block is overlapped with data block!\n");
+  

[PATCH v2 2/4] KVM: stats: Add fd-based API to read binary stats data

2021-04-15 Thread Jing Zhang
Provides a file descriptor per VM to read VM stats info/data.
Provides a file descriptor per vCPU to read vCPU stats info/data.

Signed-off-by: Jing Zhang 
---
 arch/arm64/kvm/guest.c|  30 +
 arch/mips/kvm/mips.c  |  55 ++
 arch/powerpc/kvm/book3s.c |  56 ++
 arch/powerpc/kvm/booke.c  |  49 +
 arch/s390/kvm/kvm-s390.c  | 121 +
 arch/x86/kvm/x86.c|  57 ++
 include/linux/kvm_host.h  | 127 +-
 include/uapi/linux/kvm.h  |  48 
 virt/kvm/kvm_main.c   | 223 ++
 9 files changed, 764 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index cc5b1e2fdbd0..6286a9011002 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -28,6 +28,36 @@
 
 #include "trace.h"
 
+struct _kvm_stats_desc kvm_vm_stats_desc[] = {
+   STATS_VM_COMMON,
+};
+
+struct _kvm_stats_header kvm_vm_stats_header = {
+   .name_size = KVM_STATS_NAME_LEN,
+   .count = ARRAY_SIZE(kvm_vm_stats_desc),
+   .desc_offset = sizeof(struct kvm_stats_header),
+   .data_offset = sizeof(struct kvm_stats_header) +
+   sizeof(kvm_vm_stats_desc),
+};
+
+struct _kvm_stats_desc kvm_vcpu_stats_desc[] = {
+   STATS_VCPU_COMMON,
+   STATS_DESC_COUNTER("hvc_exit_stat"),
+   STATS_DESC_COUNTER("wfe_exit_stat"),
+   STATS_DESC_COUNTER("wfi_exit_stat"),
+   STATS_DESC_COUNTER("mmio_exit_user"),
+   STATS_DESC_COUNTER("mmio_exit_kernel"),
+   STATS_DESC_COUNTER("exits"),
+};
+
+struct _kvm_stats_header kvm_vcpu_stats_header = {
+   .name_size = KVM_STATS_NAME_LEN,
+   .count = ARRAY_SIZE(kvm_vcpu_stats_desc),
+   .desc_offset = sizeof(struct kvm_stats_header),
+   .data_offset = sizeof(struct kvm_stats_header) +
+   sizeof(kvm_vcpu_stats_desc),
+};
+
 struct kvm_stats_debugfs_item debugfs_entries[] = {
VCPU_STAT_COM("halt_successful_poll", halt_successful_poll),
VCPU_STAT_COM("halt_attempted_poll", halt_attempted_poll),
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index d78820777de1..3073886d37f9 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -39,6 +39,61 @@
 #define VECTORSPACING 0x100/* for EI/VI mode */
 #endif
 
+struct _kvm_stats_desc kvm_vm_stats_desc[] = {
+   STATS_VM_COMMON,
+};
+
+struct _kvm_stats_header kvm_vm_stats_header = {
+   .name_size = KVM_STATS_NAME_LEN,
+   .count = ARRAY_SIZE(kvm_vm_stats_desc),
+   .desc_offset = sizeof(struct kvm_stats_header),
+   .data_offset = sizeof(struct kvm_stats_header) +
+   sizeof(kvm_vm_stats_desc),
+};
+
+struct _kvm_stats_desc kvm_vcpu_stats_desc[] = {
+   STATS_VCPU_COMMON,
+   STATS_DESC_COUNTER("wait_exits"),
+   STATS_DESC_COUNTER("cache_exits"),
+   STATS_DESC_COUNTER("signal_exits"),
+   STATS_DESC_COUNTER("int_exits"),
+   STATS_DESC_COUNTER("cop_unusable_exits"),
+   STATS_DESC_COUNTER("tlbmod_exits"),
+   STATS_DESC_COUNTER("tlbmiss_ld_exits"),
+   STATS_DESC_COUNTER("tlbmiss_st_exits"),
+   STATS_DESC_COUNTER("addrerr_st_exits"),
+   STATS_DESC_COUNTER("addrerr_ld_exits"),
+   STATS_DESC_COUNTER("syscall_exits"),
+   STATS_DESC_COUNTER("resvd_inst_exits"),
+   STATS_DESC_COUNTER("break_inst_exits"),
+   STATS_DESC_COUNTER("trap_inst_exits"),
+   STATS_DESC_COUNTER("msa_fpe_exits"),
+   STATS_DESC_COUNTER("fpe_exits"),
+   STATS_DESC_COUNTER("msa_disabled_exits"),
+   STATS_DESC_COUNTER("flush_dcache_exits"),
+#ifdef CONFIG_KVM_MIPS_VZ
+   STATS_DESC_COUNTER("vz_gpsi_exits"),
+   STATS_DESC_COUNTER("vz_gsfc_exits"),
+   STATS_DESC_COUNTER("vz_hc_exits"),
+   STATS_DESC_COUNTER("vz_grr_exits"),
+   STATS_DESC_COUNTER("vz_gva_exits"),
+   STATS_DESC_COUNTER("vz_ghfc_exits"),
+   STATS_DESC_COUNTER("vz_gpa_exits"),
+   STATS_DESC_COUNTER("vz_resvd_exits"),
+#ifdef CONFIG_CPU_LOONGSON64
+   STATS_DESC_COUNTER("vz_cpucfg_exits"),
+#endif
+#endif
+};
+
+struct _kvm_stats_header kvm_vcpu_stats_header = {
+   .name_size = KVM_STATS_NAME_LEN,
+   .count = ARRAY_SIZE(kvm_vcpu_stats_desc),
+   .desc_offset = sizeof(struct kvm_stats_header),
+   .data_offset = sizeof(struct kvm_stats_header) +
+   sizeof(kvm_vcpu_stats_desc),
+};
+
 struct kvm_stats_debugfs_item debugfs_entries[] = {
VCPU_STAT("wait", wait_exits),
VCPU_STAT("cache", cache_exits),
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index d799029f2e55..7c24ad66fa38 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -38,6 +38,62 @@
 
 /* #define EXIT_DEBUG */
 
+struct _kvm_stats_desc kvm_vm_stats_desc[] = {
+   STATS_VM_COMMON,
+   STATS_DESC_ICOUNTER("num_2M_pages"),
+   STATS_DESC_ICOUNTER("num_1G_pages"),
+};
+
+struct _kvm_stats_header kvm_vm_stats_header = {
+   .name_size = KVM_STATS_NAME_LEN